Merge branch 'master' of https://github.com/tensorflow/tensorflow

author: Madiyar Aitbayev <madiyar@google.com> 2018-08-03 20:26:48 +0100
committer: Madiyar Aitbayev <madiyar@google.com> 2018-08-03 20:26:48 +0100
commit: 750ac07158c66839e675bacd30beb5b5ad0ea0f6 (patch)
tree: 3d7476ce4481b43eb530fd50503cb8eb1dafb8a2
parent: 3ce29e3b3341d8820442407fc260d3f3223b9ab8 (diff)
parent: dd45704d092dac87575b8ce39013f91f4f213dc0 (diff)
2651 files changed, 152713 insertions, 56873 deletions
diff --git a/.gitignore b/.gitignore
index b5306b8b79..5afe375f46 100644
--- a/.gitignore
+++ b/.gitignore
@@ -29,6 +29,7 @@ Podfile.lock
 /tensorflow/contrib/lite/examples/ios/simple/data/*.tflite
 xcuserdata/**
 /api_init_files_list.txt
+/estimator_api_init_files_list.txt
 
 # Android
 .gradle
diff --git a/ISSUE_TEMPLATE.md b/ISSUE_TEMPLATE.md
index 2f3df7cda9..52faed9297 100644
--- a/ISSUE_TEMPLATE.md
+++ b/ISSUE_TEMPLATE.md
@@ -15,9 +15,10 @@ If you open a GitHub issue, here is our policy:
 ### System information
 - **Have I written custom code (as opposed to using a stock example script provided in TensorFlow)**:
 - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**:
+- **Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device**:
 - **TensorFlow installed from (source or binary)**:
 - **TensorFlow version (use command below)**:
-- **Python version**: 
+- **Python version**:
 - **Bazel version (if compiling from source)**:
 - **GCC/Compiler version (if compiling from source)**:
 - **CUDA/cuDNN version**:
diff --git a/README.md b/README.md
index 05fcb23f7e..82de010dd4 100644
--- a/README.md
+++ b/README.md
@@ -82,12 +82,12 @@ The TensorFlow project strives to abide by generally accepted best practices in
 | Build Type      | Status | Artifacts |
 | ---             | ---    | ---       |
 | **Linux CPU**   | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/ubuntu-cc.png) | [pypi](https://pypi.org/project/tf-nightly/) |
-| **Linux GPU**   | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/ubuntu-gpu-cc.png) | [pypi](https://pypi.org/project/tf-nightly-gpu/) |
-| **Linux XLA**   | TBA | TBA |
+| **Linux GPU**   | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/ubuntu-gpu-py3.png) | [pypi](https://pypi.org/project/tf-nightly-gpu/) |
+| **Linux XLA**   | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/ubuntu-xla.png) | TBA |
 | **MacOS**       | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/macos-py2-cc.png) | [pypi](https://pypi.org/project/tf-nightly/) |
-| **Windows CPU** | [![Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-win-cmake-py)](https://ci.tensorflow.org/job/tensorflow-master-win-cmake-py) | [pypi](https://pypi.org/project/tf-nightly/) |
-| **Windows GPU** | [![Status](http://ci.tensorflow.org/job/tf-master-win-gpu-cmake/badge/icon)](http://ci.tensorflow.org/job/tf-master-win-gpu-cmake/) | [pypi](https://pypi.org/project/tf-nightly-gpu/) |
-| **Android**     | [![Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-android)](https://ci.tensorflow.org/job/tensorflow-master-android) | [![Download](https://api.bintray.com/packages/google/tensorflow/tensorflow/images/download.svg)](https://bintray.com/google/tensorflow/tensorflow/_latestVersion) [demo APK](https://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/tensorflow_demo.apk), [native libs](https://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/native/) [build history](https://ci.tensorflow.org/view/Nightly/job/nightly-android/) |
+| **Windows CPU** | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/windows-cpu.png) | [pypi](https://pypi.org/project/tf-nightly/) |
+| **Windows GPU** | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/windows-gpu.png) | [pypi](https://pypi.org/project/tf-nightly-gpu/) |
+| **Android**     | ![Status](https://storage.googleapis.com/tensorflow-kokoro-build-badges/android.png) | [![Download](https://api.bintray.com/packages/google/tensorflow/tensorflow/images/download.svg)](https://bintray.com/google/tensorflow/tensorflow/_latestVersion) |
 
 
 ### Community Supported Builds
diff --git a/RELEASE.md b/RELEASE.md
index 4b03394427..6b67072f8e 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -21,7 +21,7 @@
 * The [distributions.Bijector](https://www.tensorflow.org/versions/r1.9/api_docs/python/tf/contrib/distributions/bijectors/Bijector)
   API supports broadcasting for Bijectors with new API changes.
   
-## Breaking Chances
+## Breaking Changes
   * If you're opening empty variable scopes; replace `variable_scope('', ...)` by
     `variable_scope(tf.get_variable_scope(), ...)`.
   * Headers used for building custom ops have been moved from site-packages/external into site-packages/tensorflow/include/external.
@@ -34,18 +34,22 @@
   * Using `tf.layers` in  a subclassed `tf.keras.Model` class. See
     [here](https://www.tensorflow.org/versions/r1.9/api_docs/python/tf/layers) for more details
 * `tf.data`:
-  * The `DatasetBase::DebugString()` method is now `const`.
-  * Added the `tf.contrib.data.sample_from_datasets()` API for randomly sampling from multiple datasets.
+  * `Dataset.from_generator()` now accepts an `args` list, in order to create nested generators.
+  * `Dataset.list_files()` now produces determinstic results when `shuffle=False` or a `seed` is passed.
+  * `tf.contrib.data.sample_from_datasets()` and `tf.contrib.data.choose_from_datasets()` make it easier to sample or deterministically choose elements from multiple datasets.
+  * `tf.contrib.data.make_csv_dataset()` now supports line breaks in quoted strings, and two infrequently used arguments removed.
+  * (C++) `DatasetBase::DebugString()` is now `const`.
+  * (C++) `DatasetBase::MakeIterator()` has been renamed to `DatasetBase::MakeIteratorInternal()`.
+  * (C++) `IteratorBase::Initialize()` method was added to support raising errors during iterator construction.
 * Eager Execution:
+  * Added the ability to pause recording operations for gradient computation via `tf.GradientTape.stop_recording`.
+  * Updated documentation, introductory notebooks.
 * `tf.keras`:
   * Move Keras code out of _impl folder and remove API files.
   * `tf.keras.Model.save_weights` now saves in TensorFlow format by default.
   * Enable dataset iterators to be passed to `tf.keras.Model` training/eval methods.
-* Accelerated Linear Algebra (XLA):
-* TensorFlow Debugger (tfdbg): fix an issue in which the TensorBoard Debugger Plugin could not handle total source file size exceeding gRPC message size limit (4 MB).
+* TensorFlow Debugger (tfdbg) CLI: fix an issue in which the TensorBoard Debugger Plugin could not handle total source file size exceeding gRPC message size limit (4 MB).
 * `tf.contrib`:
-  * Add `tf.contrib.data.choose_from_datasets()`.
-  * `tf.contrib.data.make_csv_dataset()` now supports line breaks in quoted strings. Two arguments were removed from `make_csv_dataset`.
   * `tf.contrib.framework.zero_initializer` supports ResourceVariable.
   * Adding "constrained_optimization" to tensorflow/contrib.
 * Other:
@@ -55,7 +59,6 @@
   * More consistent GcsFileSystem behavior for certain reads past EOF.
   * Update benchmark for tf.scan to match ranges across eager and graph modes.
   * Fixed bug in `tf.reduce_prod gradient` for complex dtypes.
-  * Add optional `args` argument to `Dataset.from_generator()`.
   * Allow the use of '.' in variables (e.g. "hparams.parse('a.b=1.0')"), which would previously raise an error. This will correspond to an attribute name with an embedded '.' symbol (e.g. 'a.b'), which can only be accessed indirectly (e.g. through getattr and setattr).  To set this up the user will first need to explicitly add the variable to the hparam object (e.g. "hparams.add_hparam(name='a.b', value=0.0)").
   * Benchmark for tf.scan in graph and eager modes.
   * Added complex128 support to FFT, FFT2D, FFT3D, IFFT, IFFT2D, and IFFT3D.
@@ -65,7 +68,6 @@
   * LinearOperator[1D,2D,3D]Circulant added to `tensorflow.linalg`.
   * Conv3D, Conv3DBackpropInput, Conv3DBackpropFilter now supports arbitrary.
   * Added `tf.train.Checkpoint` for reading/writing object-based checkpoints.
-  * `Dataset.list_files()` now produces determinstic results when `shuffle=False` or a `seed` is passed.
   * Added LinearOperatorKronecker, a dense-free implementation of the Kronecker Product.
   * Allow LinearOperator to broadcast.
   * SavedModelBuilder will now deduplicate asset names that point to files with the same basename and the same contents. Note that this may result in new asset files included in SavedModels in cases where assets with the same name but different contents were previously overwriting each other.
diff --git a/WORKSPACE b/WORKSPACE
index fd7570a80a..17961829a6 100644
--- a/WORKSPACE
+++ b/WORKSPACE
@@ -18,7 +18,7 @@ closure_repositories()
 # files, in case the parsing of those build files depends on the bazel
 # version we require here.
 load("//tensorflow:version_check.bzl", "check_bazel_version_at_least")
-check_bazel_version_at_least("0.10.0")
+check_bazel_version_at_least("0.15.0")
 
 load("//tensorflow:workspace.bzl", "tf_workspace")
 
diff --git a/configure.py b/configure.py
index 31a83b4a15..f97bf8a668 100644
--- a/configure.py
+++ b/configure.py
@@ -35,8 +35,8 @@ except ImportError:
 
 _DEFAULT_CUDA_VERSION = '9.0'
 _DEFAULT_CUDNN_VERSION = '7'
-_DEFAULT_NCCL_VERSION = '1.3'
-_DEFAULT_CUDA_COMPUTE_CAPABILITIES = '3.5,5.2'
+_DEFAULT_NCCL_VERSION = '2.2'
+_DEFAULT_CUDA_COMPUTE_CAPABILITIES = '3.5,7.0'
 _DEFAULT_CUDA_PATH = '/usr/local/cuda'
 _DEFAULT_CUDA_PATH_LINUX = '/opt/cuda'
 _DEFAULT_CUDA_PATH_WIN = ('C:/Program Files/NVIDIA GPU Computing '
@@ -680,7 +680,7 @@ def create_android_sdk_rule(environ_cp):
   if is_windows() or is_cygwin():
     default_sdk_path = cygpath('%s/Android/Sdk' % environ_cp['APPDATA'])
   elif is_macos():
-    default_sdk_path = '%s/library/Android/Sdk/ndk-bundle' % environ_cp['HOME']
+    default_sdk_path = '%s/library/Android/Sdk' % environ_cp['HOME']
   else:
     default_sdk_path = '%s/Android/Sdk' % environ_cp['HOME']
 
@@ -835,6 +835,8 @@ def set_tf_cuda_version(environ_cp):
                      '[Default is %s]: ') % (tf_cuda_version, default_cuda_path)
     cuda_toolkit_path = get_from_env_or_user_or_default(
         environ_cp, 'CUDA_TOOLKIT_PATH', ask_cuda_path, default_cuda_path)
+    if is_windows() or is_cygwin():
+      cuda_toolkit_path = cygpath(cuda_toolkit_path)
 
     if is_windows():
       cuda_rt_lib_path = 'lib/x64/cudart.lib'
@@ -880,7 +882,7 @@ def set_tf_cudnn_version(environ_cp):
     default_cudnn_path = environ_cp.get('CUDA_TOOLKIT_PATH')
     ask_cudnn_path = (r'Please specify the location where cuDNN %s library is '
                       'installed. Refer to README.md for more details. [Default'
-                      ' is %s]:') % (tf_cudnn_version, default_cudnn_path)
+                      ' is %s]: ') % (tf_cudnn_version, default_cudnn_path)
     cudnn_install_path = get_from_env_or_user_or_default(
         environ_cp, 'CUDNN_INSTALL_PATH', ask_cudnn_path, default_cudnn_path)
 
@@ -1095,8 +1097,10 @@ def set_tf_nccl_install_path(environ_cp):
     raise ValueError('Currently NCCL is only supported on Linux platforms.')
 
   ask_nccl_version = (
-      'Please specify the NCCL version you want to use. '
-      '[Leave empty to default to NCCL %s]: ') % _DEFAULT_NCCL_VERSION
+      'Please specify the NCCL version you want to use. If NCCL %s is not '
+      'installed, then you can use version 1.3 that can be fetched '
+      'automatically but it may have worse performance with multiple GPUs. '
+      '[Default is %s]: ') % (_DEFAULT_NCCL_VERSION, _DEFAULT_NCCL_VERSION)
 
   for _ in range(_DEFAULT_PROMPT_ASK_ATTEMPTS):
     tf_nccl_version = get_from_env_or_user_or_default(
@@ -1134,9 +1138,7 @@ def set_tf_nccl_install_path(environ_cp):
 
     nccl_lib_path = os.path.join(nccl_install_path, nccl_lib_path)
     nccl_hdr_path = os.path.join(nccl_install_path, 'include/nccl.h')
-    nccl_license_path = os.path.join(nccl_install_path, 'NCCL-SLA.txt')
-    if os.path.exists(nccl_lib_path) and os.path.exists(
-        nccl_hdr_path) and os.path.exists(nccl_license_path):
+    if os.path.exists(nccl_lib_path) and os.path.exists(nccl_hdr_path):
       # Set NCCL_INSTALL_PATH
       environ_cp['NCCL_INSTALL_PATH'] = nccl_install_path
       write_action_env_to_bazelrc('NCCL_INSTALL_PATH', nccl_install_path)
@@ -1199,7 +1201,7 @@ def set_tf_cuda_compute_capabilities(environ_cp):
         'https://developer.nvidia.com/cuda-gpus.\nPlease'
         ' note that each additional compute '
         'capability significantly increases your '
-        'build time and binary size. [Default is: %s]' %
+        'build time and binary size. [Default is: %s]: ' %
         default_cuda_compute_capabilities)
     tf_cuda_compute_capabilities = get_from_env_or_user_or_default(
         environ_cp, 'TF_CUDA_COMPUTE_CAPABILITIES',
@@ -1234,28 +1236,13 @@ def set_tf_cuda_compute_capabilities(environ_cp):
 
 def set_other_cuda_vars(environ_cp):
   """Set other CUDA related variables."""
-  if is_windows():
-    # The following three variables are needed for MSVC toolchain configuration
-    # in Bazel
-    environ_cp['CUDA_PATH'] = environ_cp.get('CUDA_TOOLKIT_PATH')
-    environ_cp['CUDA_COMPUTE_CAPABILITIES'] = environ_cp.get(
-        'TF_CUDA_COMPUTE_CAPABILITIES')
-    environ_cp['NO_WHOLE_ARCHIVE_OPTION'] = 1
-    write_action_env_to_bazelrc('CUDA_PATH', environ_cp.get('CUDA_PATH'))
-    write_action_env_to_bazelrc('CUDA_COMPUTE_CAPABILITIE',
-                                environ_cp.get('CUDA_COMPUTE_CAPABILITIE'))
-    write_action_env_to_bazelrc('NO_WHOLE_ARCHIVE_OPTION',
-                                environ_cp.get('NO_WHOLE_ARCHIVE_OPTION'))
-    write_to_bazelrc('build --config=win-cuda')
-    write_to_bazelrc('test --config=win-cuda')
+  # If CUDA is enabled, always use GPU during build and test.
+  if environ_cp.get('TF_CUDA_CLANG') == '1':
+    write_to_bazelrc('build --config=cuda_clang')
+    write_to_bazelrc('test --config=cuda_clang')
   else:
-    # If CUDA is enabled, always use GPU during build and test.
-    if environ_cp.get('TF_CUDA_CLANG') == '1':
-      write_to_bazelrc('build --config=cuda_clang')
-      write_to_bazelrc('test --config=cuda_clang')
-    else:
-      write_to_bazelrc('build --config=cuda')
-      write_to_bazelrc('test --config=cuda')
+    write_to_bazelrc('build --config=cuda')
+    write_to_bazelrc('test --config=cuda')
 
 
 def set_host_cxx_compiler(environ_cp):
@@ -1415,14 +1402,36 @@ def set_build_strip_flag():
   write_to_bazelrc('build --strip=always')
 
 
-def set_windows_build_flags():
-  if is_windows():
-    # The non-monolithic build is not supported yet
-    write_to_bazelrc('build --config monolithic')
-    # Suppress warning messages
-    write_to_bazelrc('build --copt=-w --host_copt=-w')
-    # Output more verbose information when something goes wrong
-    write_to_bazelrc('build --verbose_failures')
+def set_windows_build_flags(environ_cp):
+  """Set Windows specific build options."""
+  # The non-monolithic build is not supported yet
+  write_to_bazelrc('build --config monolithic')
+  # Suppress warning messages
+  write_to_bazelrc('build --copt=-w --host_copt=-w')
+  # Output more verbose information when something goes wrong
+  write_to_bazelrc('build --verbose_failures')
+  # The host and target platforms are the same in Windows build. So we don't
+  # have to distinct them. This avoids building the same targets twice.
+  write_to_bazelrc('build --distinct_host_configuration=false')
+  # Enable short object file path to avoid long path issue on Windows.
+  # TODO(pcloudy): Remove this flag when upgrading Bazel to 0.16.0
+  # Short object file path will be enabled by default.
+  write_to_bazelrc('build --experimental_shortened_obj_file_path=true')
+
+  if get_var(
+      environ_cp, 'TF_OVERRIDE_EIGEN_STRONG_INLINE', 'Eigen strong inline',
+      True,
+      ('Would you like to override eigen strong inline for some C++ '
+       'compilation to reduce the compilation time?'),
+      'Eigen strong inline overridden.',
+      'Not overriding eigen strong inline, '
+      'some compilations could take more than 20 mins.'):
+    # Due to a known MSVC compiler issue
+    # https://github.com/tensorflow/tensorflow/issues/10521
+    # Overriding eigen strong inline speeds up the compiling of
+    # conv_grad_ops_3d.cc and conv_ops_3d.cc by 20 minutes,
+    # but this also hurts the performance. Let users decide what they want.
+    write_to_bazelrc('build --define=override_eigen_strong_inline=true')
 
 
 def config_info_line(name, help_text):
@@ -1442,7 +1451,7 @@ def main():
   # environment variables.
   environ_cp = dict(os.environ)
 
-  check_bazel_version('0.10.0')
+  check_bazel_version('0.15.0')
 
   reset_tf_configure_bazelrc(args.workspace)
   cleanup_makefile()
@@ -1462,11 +1471,23 @@ def main():
     # TODO(ibiryukov): Investigate using clang as a cpu or cuda compiler on
     # Windows.
     environ_cp['TF_DOWNLOAD_CLANG'] = '0'
+    environ_cp['TF_ENABLE_XLA'] = '0'
+    environ_cp['TF_NEED_GDR'] = '0'
+    environ_cp['TF_NEED_VERBS'] = '0'
+    environ_cp['TF_NEED_MPI'] = '0'
+    environ_cp['TF_SET_ANDROID_WORKSPACE'] = '0'
 
   if is_macos():
     environ_cp['TF_NEED_JEMALLOC'] = '0'
     environ_cp['TF_NEED_TENSORRT'] = '0'
 
+  # The numpy package on ppc64le uses OpenBLAS which has multi-threading
+  # issues that lead to incorrect answers.  Set OMP_NUM_THREADS=1 at
+  # runtime to allow the Tensorflow testcases which compare numpy
+  # results to Tensorflow results to succeed.
+  if is_ppc64le():
+    write_action_env_to_bazelrc("OMP_NUM_THREADS", 1)
+
   set_build_var(environ_cp, 'TF_NEED_JEMALLOC', 'jemalloc as malloc',
                 'with_jemalloc', True)
   set_build_var(environ_cp, 'TF_NEED_GCP', 'Google Cloud Platform',
@@ -1538,7 +1559,8 @@ def main():
   set_grpc_build_flags()
   set_cc_opt_flags(environ_cp)
   set_build_strip_flag()
-  set_windows_build_flags()
+  if is_windows():
+    set_windows_build_flags(environ_cp)
 
   if get_var(
       environ_cp, 'TF_SET_ANDROID_WORKSPACE', 'android workspace',
@@ -1550,11 +1572,15 @@ def main():
     create_android_ndk_rule(environ_cp)
     create_android_sdk_rule(environ_cp)
 
-  print('Preconfigured Bazel build configs. You can use any of the below by '
-        'adding "--config=<>" to your build command. See tools/bazel.rc for '
-        'more details.')
-  config_info_line('mkl', 'Build with MKL support.')
-  config_info_line('monolithic', 'Config for mostly static monolithic build.')
+  # On Windows, we don't have MKL support and the build is always monolithic.
+  # So no need to print the following message.
+  # TODO(pcloudy): remove the following if check when they make sense on Windows
+  if not is_windows():
+    print('Preconfigured Bazel build configs. You can use any of the below by '
+          'adding "--config=<>" to your build command. See tools/bazel.rc for '
+          'more details.')
+    config_info_line('mkl', 'Build with MKL support.')
+    config_info_line('monolithic', 'Config for mostly static monolithic build.')
 
 if __name__ == '__main__':
   main()
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index 51eea94847..60db234c9c 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -20,10 +20,18 @@ load(
     "tf_additional_binary_deps",
 )
 load(
-    "//tensorflow/tools/api/generator:api_gen.bzl",
+    "//tensorflow/python/tools/api/generator:api_gen.bzl",
     "gen_api_init_files",  # @unused
 )
 
+# Config setting used when building for products
+# which requires restricted licenses to be avoided.
+config_setting(
+    name = "no_lgpl_deps",
+    values = {"define": "__TENSORFLOW_NO_LGPL_DEPS__=1"},
+    visibility = ["//visibility:public"],
+)
+
 # Config setting for determining if we are building for Android.
 config_setting(
     name = "android",
@@ -373,6 +381,14 @@ config_setting(
     },
 )
 
+# Setting to use when loading kernels dynamically
+config_setting(
+    name = "dynamic_loaded_kernels",
+    define_values = {
+        "dynamic_loaded_kernels": "true",
+    },
+)
+
 config_setting(
     name = "using_cuda_nvcc",
     define_values = {
@@ -400,14 +416,6 @@ config_setting(
     visibility = ["//visibility:public"],
 )
 
-# TODO(laigd): consider removing this option and make TensorRT enabled
-# automatically when CUDA is enabled.
-config_setting(
-    name = "with_tensorrt_support",
-    values = {"define": "with_tensorrt_support=true"},
-    visibility = ["//visibility:public"],
-)
-
 package_group(
     name = "internal",
     packages = [
diff --git a/tensorflow/c/c_api.cc b/tensorflow/c/c_api.cc
index 5c218d3f25..19ccb6e71d 100644
--- a/tensorflow/c/c_api.cc
+++ b/tensorflow/c/c_api.cc
@@ -33,6 +33,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/eval_const_tensor.h"
 #include "tensorflow/core/common_runtime/shape_refiner.h"
 #include "tensorflow/core/framework/allocation_description.pb.h"
+#include "tensorflow/core/framework/kernel_def.pb.h"
 #include "tensorflow/core/framework/log_memory.h"
 #include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -51,6 +52,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/mem.h"
 #include "tensorflow/core/platform/mutex.h"
@@ -327,6 +329,7 @@ TF_Buffer* TF_NewBufferFromString(const void* proto, size_t proto_len) {
 }
 
 void TF_DeleteBuffer(TF_Buffer* buffer) {
+  if (buffer == nullptr) return;
   if (buffer->data_deallocator != nullptr) {
     (*buffer->data_deallocator)(const_cast<void*>(buffer->data),
                                 buffer->length);
@@ -356,6 +359,7 @@ void TF_CloseDeprecatedSession(TF_DeprecatedSession* s, TF_Status* status) {
 
 void TF_DeleteDeprecatedSession(TF_DeprecatedSession* s, TF_Status* status) {
   status->status = Status::OK();
+  if (s == nullptr) return;
   delete s->session;
   delete s;
 }
@@ -906,6 +910,7 @@ TF_Library* TF_LoadLibrary(const char* library_filename, TF_Status* status) {
 TF_Buffer TF_GetOpList(TF_Library* lib_handle) { return lib_handle->op_list; }
 
 void TF_DeleteLibraryHandle(TF_Library* lib_handle) {
+  if (lib_handle == nullptr) return;
   tensorflow::port::Free(const_cast<void*>(lib_handle->op_list.data));
   delete lib_handle;
 }
@@ -963,6 +968,7 @@ TF_DEVICELIST_METHOD(const char*, TF_DeviceListName, name().c_str(), nullptr);
 TF_DEVICELIST_METHOD(const char*, TF_DeviceListType, device_type().c_str(),
                      nullptr);
 TF_DEVICELIST_METHOD(int64_t, TF_DeviceListMemoryBytes, memory_limit(), -1);
+TF_DEVICELIST_METHOD(uint64_t, TF_DeviceListIncarnation, incarnation(), 0);
 
 #undef TF_DEVICELIST_METHOD
 
@@ -1852,6 +1858,7 @@ TF_Graph::TF_Graph()
 TF_Graph* TF_NewGraph() { return new TF_Graph; }
 
 void TF_DeleteGraph(TF_Graph* g) {
+  if (g == nullptr) return;
   g->mu.lock();
   g->delete_requested = true;
   const bool del = g->sessions.empty();
@@ -2383,6 +2390,12 @@ void TF_AbortWhile(const TF_WhileParams* params) { FreeWhileResources(params); }
 
 void TF_AddGradients(TF_Graph* g, TF_Output* y, int ny, TF_Output* x, int nx,
                      TF_Output* dx, TF_Status* status, TF_Output* dy) {
+  TF_AddGradientsWithPrefix(g, nullptr, y, ny, x, nx, dx, status, dy);
+}
+
+void TF_AddGradientsWithPrefix(TF_Graph* g, const char* prefix, TF_Output* y,
+                               int ny, TF_Output* x, int nx, TF_Output* dx,
+                               TF_Status* status, TF_Output* dy) {
 #ifdef __ANDROID__
   status->status = tensorflow::errors::Unimplemented(
       "Adding gradients is not supported in Android. File a bug at "
@@ -2399,9 +2412,29 @@ void TF_AddGradients(TF_Graph* g, TF_Output* y, int ny, TF_Output* x, int nx,
 
     const int first_new_node_id = g->graph.num_node_ids();
 
+    string prefix_cmp;
+    const char* child_scope_name;
+    if (prefix == nullptr) {
+      child_scope_name = "gradients";
+    } else {
+      prefix_cmp = string(prefix) + "/";
+      // The operation should fail if the provided name prefix has already been
+      // used in this graph
+      for (const auto& pair : g->name_map) {
+        const string& name = pair.first;
+        if (name.compare(prefix) == 0 ||
+            tensorflow::str_util::StartsWith(name, prefix_cmp)) {
+          status->status = InvalidArgument(
+              "prefix [", prefix,
+              "] conflicts with existing node in the graph named [", name, "]");
+          return;
+        }
+      }
+      child_scope_name = prefix;
+    }
     tensorflow::Scope scope =
         NewInternalScope(&g->graph, &status->status, &g->refiner)
-            .NewSubScope("gradients");
+            .NewSubScope(child_scope_name);
 
     if (dx != nullptr) {
       std::vector<tensorflow::Output> dx_arg = OutputsFromTFOutputs(dx, ny);
@@ -2416,6 +2449,18 @@ void TF_AddGradients(TF_Graph* g, TF_Output* y, int ny, TF_Output* x, int nx,
     for (int i = first_new_node_id; i < g->graph.num_node_ids(); ++i) {
       Node* n = g->graph.FindNodeId(i);
       if (n == nullptr) continue;
+
+      // Adding the gradients to the graph can alter the prefix to prevent
+      // name collisions only if this prefix has not been provided explicitly
+      // by the user. If it was provided, assert that it remained intact.
+      if (prefix != nullptr &&
+          !tensorflow::str_util::StartsWith(n->name(), prefix_cmp)) {
+        status->status = tensorflow::errors::Internal(
+            "BUG: The gradients prefix have been unexpectedly altered when "
+            "adding the nodes to the graph. This is a bug. Please file an "
+            "issue at https://github.com/tensorflow/tensorflow/issues.");
+        return;
+      }
       // We have a convoluted scheme here: Using the C++ graph construction API
       // to add potentially many nodes to the graph without running the checks
       // (such as uniqueness of the names of nodes) we run with other functions
@@ -2527,6 +2572,7 @@ void TF_CloseSession(TF_Session* s, TF_Status* status) {
 
 void TF_DeleteSession(TF_Session* s, TF_Status* status) {
   status->status = Status::OK();
+  if (s == nullptr) return;
   TF_Graph* const graph = s->graph;
   if (graph != nullptr) {
     graph->mu.lock();
@@ -2725,7 +2771,34 @@ TF_Buffer* TF_ApiDefMapGet(TF_ApiDefMap* api_def_map, const char* name,
 
   TF_Buffer* ret = TF_NewBuffer();
   status->status = MessageToBuffer(*api_def, ret);
+  if (!status->status.ok()) {
+    TF_DeleteBuffer(ret);
+    return nullptr;
+  }
   return ret;
 #endif  // __ANDROID__
 }
+
+TF_Buffer* TF_GetAllRegisteredKernels(TF_Status* status) {
+  tensorflow::KernelList kernel_list = tensorflow::GetAllRegisteredKernels();
+  TF_Buffer* ret = TF_NewBuffer();
+  status->status = MessageToBuffer(kernel_list, ret);
+  if (!status->status.ok()) {
+    TF_DeleteBuffer(ret);
+    return nullptr;
+  }
+  return ret;
+}
+
+TF_Buffer* TF_GetRegisteredKernelsForOp(const char* name, TF_Status* status) {
+  tensorflow::KernelList kernel_list =
+      tensorflow::GetRegisteredKernelsForOp(name);
+  TF_Buffer* ret = TF_NewBuffer();
+  status->status = MessageToBuffer(kernel_list, ret);
+  if (!status->status.ok()) {
+    TF_DeleteBuffer(ret);
+    return nullptr;
+  }
+  return ret;
+}
 }  // end extern "C"
diff --git a/tensorflow/c/c_api.h b/tensorflow/c/c_api.h
index 1eb75ef11f..850f6ecd63 100644
--- a/tensorflow/c/c_api.h
+++ b/tensorflow/c/c_api.h
@@ -44,6 +44,7 @@ limitations under the License.
 // * size_t is used to represent byte sizes of objects that are
 //   materialized in the address space of the calling process.
 // * int is used as an index into arrays.
+// * Deletion functions are safe to call on nullptr.
 //
 // Questions left to address:
 // * Might at some point need a way for callers to provide their own Env.
@@ -1130,6 +1131,7 @@ TF_CAPI_EXPORT extern void TF_AbortWhile(const TF_WhileParams* params);
 
 // Adds operations to compute the partial derivatives of sum of `y`s w.r.t `x`s,
 // i.e., d(y_1 + y_2 + ...)/dx_1, d(y_1 + y_2 + ...)/dx_2...
+//
 // `dx` are used as initial gradients (which represent the symbolic partial
 // derivatives of some loss function `L` w.r.t. `y`).
 // `dx` must be nullptr or have size `ny`.
@@ -1138,6 +1140,12 @@ TF_CAPI_EXPORT extern void TF_AbortWhile(const TF_WhileParams* params);
 // The partial derivatives are returned in `dy`. `dy` should be allocated to
 // size `nx`.
 //
+// Gradient nodes are automatically named under the "gradients/" prefix. To
+// guarantee name uniqueness, subsequent calls to the same graph will
+// append an incremental tag to the prefix: "gradients_1/", "gradients_2/", ...
+// See TF_AddGradientsWithPrefix, which provides a means to specify a custom
+// name prefix for operations added to a graph to compute the gradients.
+//
 // WARNING: This function does not yet support all the gradients that python
 // supports. See
 // https://www.tensorflow.org/code/tensorflow/cc/gradients/README.md
@@ -1146,6 +1154,33 @@ TF_CAPI_EXPORT void TF_AddGradients(TF_Graph* g, TF_Output* y, int ny,
                                     TF_Output* x, int nx, TF_Output* dx,
                                     TF_Status* status, TF_Output* dy);
 
+// Adds operations to compute the partial derivatives of sum of `y`s w.r.t `x`s,
+// i.e., d(y_1 + y_2 + ...)/dx_1, d(y_1 + y_2 + ...)/dx_2...
+// This is a variant of TF_AddGradients that allows to caller to pass a custom
+// name prefix to the operations added to a graph to compute the gradients.
+//
+// `dx` are used as initial gradients (which represent the symbolic partial
+// derivatives of some loss function `L` w.r.t. `y`).
+// `dx` must be nullptr or have size `ny`.
+// If `dx` is nullptr, the implementation will use dx of `OnesLike` for all
+// shapes in `y`.
+// The partial derivatives are returned in `dy`. `dy` should be allocated to
+// size `nx`.
+// `prefix` names the scope into which all gradients operations are being added.
+// `prefix` must be unique within the provided graph otherwise this operation
+// will fail. If `prefix` is nullptr, the default prefixing behaviour takes
+// place, see TF_AddGradients for more details.
+//
+// WARNING: This function does not yet support all the gradients that python
+// supports. See
+// https://www.tensorflow.org/code/tensorflow/cc/gradients/README.md
+// for instructions on how to add C++ more gradients.
+TF_CAPI_EXPORT void TF_AddGradientsWithPrefix(TF_Graph* g, const char* prefix,
+                                              TF_Output* y, int ny,
+                                              TF_Output* x, int nx,
+                                              TF_Output* dx, TF_Status* status,
+                                              TF_Output* dy);
+
 // Create a TF_Function from a TF_Graph
 //
 // Params:
@@ -1235,6 +1270,11 @@ TF_CAPI_EXPORT extern TF_Function* TF_GraphToFunction(
     int noutputs, const TF_Output* outputs, const char* const* output_names,
     const TF_FunctionOptions* opts, const char* description, TF_Status* status);
 
+// Returns the name of the graph function.
+// The return value points to memory that is only usable until the next
+// mutation to *func.
+TF_CAPI_EXPORT extern const char* TF_FunctionName(TF_Function* func);
+
 // Write out a serialized representation of `func` (as a FunctionDef protocol
 // message) to `output_func_def` (allocated by TF_NewBuffer()).
 // `output_func_def`'s underlying buffer will be freed when TF_DeleteBuffer()
@@ -1521,6 +1561,13 @@ TF_CAPI_EXPORT extern const char* TF_DeviceListType(const TF_DeviceList* list,
 TF_CAPI_EXPORT extern int64_t TF_DeviceListMemoryBytes(
     const TF_DeviceList* list, int index, TF_Status* status);
 
+// Retrieve the incarnation number of a given device.
+//
+// If index is out of bounds, an error code will be set in the status object,
+// and 0 will be returned.
+TF_CAPI_EXPORT extern uint64_t TF_DeviceListIncarnation(
+    const TF_DeviceList* list, int index, TF_Status* status);
+
 // --------------------------------------------------------------------------
 // Load plugins containing custom ops and kernels
 
@@ -1603,6 +1650,18 @@ TF_CAPI_EXPORT extern TF_Buffer* TF_ApiDefMapGet(TF_ApiDefMap* api_def_map,
                                                  size_t name_len,
                                                  TF_Status* status);
 
+// --------------------------------------------------------------------------
+// Kernel definition information.
+
+// Returns a serialized KernelList protocol buffer containing KernelDefs for all
+// registered kernels.
+TF_CAPI_EXPORT extern TF_Buffer* TF_GetAllRegisteredKernels(TF_Status* status);
+
+// Returns a serialized KernelList protocol buffer containing KernelDefs for all
+// kernels registered for the operation named `name`.
+TF_CAPI_EXPORT extern TF_Buffer* TF_GetRegisteredKernelsForOp(
+    const char* name, TF_Status* status);
+
 #ifdef __cplusplus
 } /* end extern "C" */
 #endif
diff --git a/tensorflow/c/c_api_experimental.cc b/tensorflow/c/c_api_experimental.cc
index 95b04f9058..69b3ffe2a1 100644
--- a/tensorflow/c/c_api_experimental.cc
+++ b/tensorflow/c/c_api_experimental.cc
@@ -57,6 +57,45 @@ void TF_EnableXLACompilation(TF_SessionOptions* options, unsigned char enable) {
   }
 }
 
+TF_Buffer* TF_CreateConfig(unsigned char enable_xla_compilation,
+                           unsigned char gpu_memory_allow_growth) {
+  tensorflow::ConfigProto config;
+  auto* optimizer_options =
+      config.mutable_graph_options()->mutable_optimizer_options();
+  if (enable_xla_compilation) {
+    optimizer_options->set_global_jit_level(tensorflow::OptimizerOptions::ON_1);
+
+    // These XLA flags are needed to trigger XLA properly from C (more generally
+    // non-Python) clients. If this API is called again with `enable` set to
+    // false, it is safe to keep these flag values as is.
+    tensorflow::legacy_flags::MarkForCompilationPassFlags* flags =
+        tensorflow::legacy_flags::GetMarkForCompilationPassFlags();
+    flags->tf_xla_cpu_global_jit = true;
+    flags->tf_xla_min_cluster_size = 1;
+  } else {
+    optimizer_options->set_global_jit_level(tensorflow::OptimizerOptions::OFF);
+  }
+
+  auto* gpu_options = config.mutable_gpu_options();
+  gpu_options->set_allow_growth(gpu_memory_allow_growth);
+
+  TF_Buffer* ret = TF_NewBuffer();
+  TF_CHECK_OK(MessageToBuffer(config, ret));
+  return ret;
+}
+
+TF_Buffer* TF_CreateRunOptions(unsigned char enable_full_trace) {
+  tensorflow::RunOptions options;
+  if (enable_full_trace) {
+    options.set_trace_level(tensorflow::RunOptions::FULL_TRACE);
+  } else {
+    options.set_trace_level(tensorflow::RunOptions::NO_TRACE);
+  }
+  TF_Buffer* ret = TF_NewBuffer();
+  TF_CHECK_OK(MessageToBuffer(options, ret));
+  return ret;
+}
+
 const char* TF_GraphDebugString(TF_Graph* graph, size_t* len) {
   tensorflow::mutex_lock c(graph->mu);
   const auto& debug_str = graph->graph.ToGraphDefDebug().DebugString();
diff --git a/tensorflow/c/c_api_experimental.h b/tensorflow/c/c_api_experimental.h
index 20bdace40f..6617c5a572 100644
--- a/tensorflow/c/c_api_experimental.h
+++ b/tensorflow/c/c_api_experimental.h
@@ -55,11 +55,27 @@ extern "C" {
 // set XLA flag values to prepare for XLA compilation. Otherwise set
 // global_jit_level to OFF.
 //
-// This API is syntax sugar over TF_SetConfig(), and is used by clients that
-// cannot read/write the tensorflow.ConfigProto proto.
+// This and the next API are syntax sugar over TF_SetConfig(), and is used by
+// clients that cannot read/write the tensorflow.ConfigProto proto.
+// TODO: Migrate to TF_CreateConfig() below.
 TF_CAPI_EXPORT extern void TF_EnableXLACompilation(TF_SessionOptions* options,
                                                    unsigned char enable);
 
+// Create a serialized tensorflow.ConfigProto proto, where:
+//
+// a) ConfigProto.optimizer_options.global_jit_level is set to to ON_1 if
+// `enable_xla_compilation` is non-zero, and OFF otherwise.
+// b) ConfigProto.gpu_options.allow_growth is set to `gpu_memory_allow_growth`.
+TF_CAPI_EXPORT extern TF_Buffer* TF_CreateConfig(
+    unsigned char enable_xla_compilation,
+    unsigned char gpu_memory_allow_growth);
+
+// Create a serialized tensorflow.RunOptions proto, where RunOptions.trace_level
+// is set to FULL_TRACE if `enable_full_trace` is non-zero, and NO_TRACE
+// otherwise.
+TF_CAPI_EXPORT extern TF_Buffer* TF_CreateRunOptions(
+    unsigned char enable_full_trace);
+
 // Returns the graph content in a human-readable format, with length set in
 // `len`. The format is subject to change in the future.
 // The returned string is heap-allocated, and caller should call free() on it.
diff --git a/tensorflow/c/c_api_function.cc b/tensorflow/c/c_api_function.cc
index 384e6c8cb9..a2c5a42c11 100644
--- a/tensorflow/c/c_api_function.cc
+++ b/tensorflow/c/c_api_function.cc
@@ -536,6 +536,10 @@ TF_Function* TF_GraphToFunction(const TF_Graph* fn_body, const char* fn_name,
   return tf_function;
 }
 
+const char* TF_FunctionName(TF_Function* func) {
+  return func->fdef.signature().name().c_str();
+}
+
 void TF_GraphCopyFunction(TF_Graph* g, const TF_Function* func,
                           const TF_Function* grad, TF_Status* status) {
   if (func == nullptr) {
diff --git a/tensorflow/c/c_api_function_test.cc b/tensorflow/c/c_api_function_test.cc
index 610274696f..bb9433ce25 100644
--- a/tensorflow/c/c_api_function_test.cc
+++ b/tensorflow/c/c_api_function_test.cc
@@ -193,6 +193,7 @@ class CApiFunctionTest : public ::testing::Test {
 
     ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
     ASSERT_NE(func_, nullptr);
+    ASSERT_EQ(std::string(func_name_), std::string(TF_FunctionName(func_)));
     TF_GraphCopyFunction(host_graph_, func_, nullptr, s_);
     ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
   }
@@ -1516,7 +1517,8 @@ void DefineStatefulFunction(const char* name, TF_Function** func) {
 
   TF_Output inputs[] = {};
   TF_Output outputs[] = {{random, 0}};
-  *func = TF_GraphToFunction(func_graph.get(), name, /*append_hash=*/false, -1,
+  *func = TF_GraphToFunction(func_graph.get(), name,
+                             /*append_hash_to_fn_name=*/false, -1,
                              /*opers=*/nullptr, 0, inputs, 1, outputs,
                              /*output_names=*/nullptr,
                              /*opts=*/nullptr, "", s.get());
diff --git a/tensorflow/c/c_api_test.cc b/tensorflow/c/c_api_test.cc
index bc04b53fbb..aa2a537f03 100644
--- a/tensorflow/c/c_api_test.cc
+++ b/tensorflow/c/c_api_test.cc
@@ -29,9 +29,11 @@ limitations under the License.
 #include "tensorflow/core/framework/api_def.pb.h"
 #include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/graph.pb_text.h"
+#include "tensorflow/core/framework/kernel_def.pb.h"
 #include "tensorflow/core/framework/node_def.pb_text.h"
 #include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
@@ -1424,6 +1426,29 @@ TEST(CAPI, SavedModelNullArgsAreValid) {
   TF_DeleteStatus(s);
 }
 
+TEST(CAPI, DeletingNullPointerIsSafe) {
+  TF_Status* status = TF_NewStatus();
+
+  TF_DeleteStatus(nullptr);
+  TF_DeleteBuffer(nullptr);
+  TF_DeleteTensor(nullptr);
+  TF_DeleteSessionOptions(nullptr);
+  TF_DeleteGraph(nullptr);
+  TF_DeleteImportGraphDefOptions(nullptr);
+  TF_DeleteImportGraphDefResults(nullptr);
+  TF_DeleteFunction(nullptr);
+  TF_DeleteSession(nullptr, status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeletePRunHandle(nullptr);
+  TF_DeleteDeprecatedSession(nullptr, status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  TF_DeleteDeviceList(nullptr);
+  TF_DeleteLibraryHandle(nullptr);
+  TF_DeleteApiDefMap(nullptr);
+
+  TF_DeleteStatus(status);
+}
+
 REGISTER_OP("TestOpWithNoGradient")
     .Input("x: T")
     .Output("y: T")
@@ -1458,8 +1483,8 @@ class CApiGradientsTest : public ::testing::Test {
     BuildSuccessGraph(inputs, outputs);
     BuildExpectedGraph(grad_inputs_provided, expected_grad_outputs);
 
-    AddGradients(grad_inputs_provided, inputs, 2, outputs, 1, grad_outputs);
-
+    AddGradients(grad_inputs_provided, nullptr, inputs, 2, outputs, 1,
+                 grad_outputs);
     EXPECT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
 
     // Compare that the graphs match.
@@ -1480,7 +1505,8 @@ class CApiGradientsTest : public ::testing::Test {
 
     BuildErrorGraph(inputs, outputs);
 
-    AddGradients(grad_inputs_provided, inputs, 1, outputs, 1, grad_outputs);
+    AddGradients(grad_inputs_provided, nullptr, inputs, 1, outputs, 1,
+                 grad_outputs);
 
     string expected_msg =
         "No gradient defined for op: TestOpWithNoGradient. Please see "
@@ -1524,19 +1550,20 @@ class CApiGradientsTest : public ::testing::Test {
     EXPECT_EQ(*a_data, *b_data);
   }
 
-  void AddGradients(bool grad_inputs_provided, TF_Output* inputs, int ninputs,
-                    TF_Output* outputs, int noutputs, TF_Output* grad_outputs) {
+  void AddGradients(bool grad_inputs_provided, const char* prefix,
+                    TF_Output* inputs, int ninputs, TF_Output* outputs,
+                    int noutputs, TF_Output* grad_outputs) {
     if (grad_inputs_provided) {
       TF_Output grad_inputs[1];
       const float grad_inputs_val[] = {1.0, 1.0, 1.0, 1.0};
       TF_Operation* grad_inputs_op =
           FloatConst2x2(graph_, s_, grad_inputs_val, "GradInputs");
       grad_inputs[0] = TF_Output{grad_inputs_op, 0};
-      TF_AddGradients(graph_, outputs, noutputs, inputs, ninputs, grad_inputs,
-                      s_, grad_outputs);
+      TF_AddGradientsWithPrefix(graph_, prefix, outputs, noutputs, inputs,
+                                ninputs, grad_inputs, s_, grad_outputs);
     } else {
-      TF_AddGradients(graph_, outputs, noutputs, inputs, ninputs, nullptr, s_,
-                      grad_outputs);
+      TF_AddGradientsWithPrefix(graph_, prefix, outputs, noutputs, inputs,
+                                ninputs, nullptr, s_, grad_outputs);
     }
   }
 
@@ -1681,6 +1708,20 @@ class CApiGradientsTest : public ::testing::Test {
     return op;
   }
 
+  void BuildGraphAndAddGradientsWithPrefixes(const char* prefix1,
+                                             const char* prefix2 = nullptr) {
+    TF_Output inputs[2];
+    TF_Output outputs[1];
+    TF_Output grad_outputs[2];
+
+    BuildSuccessGraph(inputs, outputs);
+
+    AddGradients(false, prefix1, inputs, 2, outputs, 1, grad_outputs);
+    if (prefix2 != nullptr) {
+      AddGradients(false, prefix2, inputs, 2, outputs, 1, grad_outputs);
+    }
+  }
+
   TF_Status* s_;
   TF_Graph* graph_;
   TF_Graph* expected_graph_;
@@ -1700,6 +1741,56 @@ TEST_F(CApiGradientsTest, OpWithNoGradientRegistered_NoGradInputs) {
   TestGradientsError(false);
 }
 
+TEST_F(CApiGradientsTest, GradientsPrefix_PrefixIsOk) {
+  BuildGraphAndAddGradientsWithPrefixes("gradients");
+  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_TwoGradientsWithDistinctPrefixes) {
+  BuildGraphAndAddGradientsWithPrefixes("gradients", "gradients_1");
+  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_TwoGradientsInSameScope) {
+  BuildGraphAndAddGradientsWithPrefixes("scope/gradients", "scope/gradients_1");
+  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_TwoGradientsInDifferentScopes) {
+  BuildGraphAndAddGradientsWithPrefixes("scope/gradients", "scope_1/gradients");
+  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_2ndGradientsAsSubScopeOf1st) {
+  BuildGraphAndAddGradientsWithPrefixes("gradients", "gradients/sub");
+  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_PrefixMatchesExistingNodeName) {
+  BuildGraphAndAddGradientsWithPrefixes("Const_0");
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_TwoGradientsWithIdenticalPrefixes) {
+  BuildGraphAndAddGradientsWithPrefixes("gradients", "gradients");
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_2ndGradientsMatchingNodeOf1st) {
+  BuildGraphAndAddGradientsWithPrefixes("gradients", "gradients/MatMul");
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_1stGradientsMatchingNodeOf2nd) {
+  BuildGraphAndAddGradientsWithPrefixes("gradients/MatMul", "gradients");
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(s_)) << TF_Message(s_);
+}
+
+TEST_F(CApiGradientsTest, GradientsPrefix_2ndGradientsAsParentScopeOf1st) {
+  BuildGraphAndAddGradientsWithPrefixes("gradients/sub", "gradients");
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(s_)) << TF_Message(s_);
+}
+
 void ScalarFloatFromTensor(const TF_Tensor* t, float* f) {
   ASSERT_TRUE(t != nullptr);
   ASSERT_EQ(TF_FLOAT, TF_TensorType(t));
@@ -2312,6 +2403,57 @@ TEST(TestApiDef, TestCreateApiDefWithOverwrites) {
   TF_DeleteLibraryHandle(lib);
 }
 
+class DummyKernel : public tensorflow::OpKernel {
+ public:
+  explicit DummyKernel(tensorflow::OpKernelConstruction* context)
+      : OpKernel(context) {}
+  void Compute(tensorflow::OpKernelContext* context) override {}
+};
+
+// Test we can query kernels
+REGISTER_OP("TestOpWithSingleKernel")
+    .Input("a: float")
+    .Input("b: float")
+    .Output("o: float");
+REGISTER_KERNEL_BUILDER(
+    Name("TestOpWithSingleKernel").Device(tensorflow::DEVICE_CPU), DummyKernel);
+
+TEST(TestKernel, TestGetAllRegisteredKernels) {
+  TF_Status* status = TF_NewStatus();
+  TF_Buffer* kernel_list_buf = TF_GetAllRegisteredKernels(status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  KernelList kernel_list;
+  kernel_list.ParseFromArray(kernel_list_buf->data, kernel_list_buf->length);
+  ASSERT_GT(kernel_list.kernel_size(), 0);
+  TF_DeleteBuffer(kernel_list_buf);
+  TF_DeleteStatus(status);
+}
+
+TEST(TestKernel, TestGetRegisteredKernelsForOp) {
+  TF_Status* status = TF_NewStatus();
+  TF_Buffer* kernel_list_buf =
+      TF_GetRegisteredKernelsForOp("TestOpWithSingleKernel", status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  KernelList kernel_list;
+  kernel_list.ParseFromArray(kernel_list_buf->data, kernel_list_buf->length);
+  ASSERT_EQ(kernel_list.kernel_size(), 1);
+  EXPECT_EQ(kernel_list.kernel(0).op(), "TestOpWithSingleKernel");
+  EXPECT_EQ(kernel_list.kernel(0).device_type(), "CPU");
+  TF_DeleteBuffer(kernel_list_buf);
+  TF_DeleteStatus(status);
+}
+
+TEST(TestKernel, TestGetRegisteredKernelsForOpNoKernels) {
+  TF_Status* status = TF_NewStatus();
+  TF_Buffer* kernel_list_buf = TF_GetRegisteredKernelsForOp("Unknown", status);
+  EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
+  KernelList kernel_list;
+  kernel_list.ParseFromArray(kernel_list_buf->data, kernel_list_buf->length);
+  ASSERT_EQ(kernel_list.kernel_size(), 0);
+  TF_DeleteBuffer(kernel_list_buf);
+  TF_DeleteStatus(status);
+}
+
 #undef EXPECT_TF_META
 
 }  // namespace
diff --git a/tensorflow/c/eager/c_api.cc b/tensorflow/c/eager/c_api.cc
index 82ca2be2cf..555dab3e89 100644
--- a/tensorflow/c/eager/c_api.cc
+++ b/tensorflow/c/eager/c_api.cc
@@ -288,7 +288,7 @@ TFE_Context* TFE_NewContext(const TFE_ContextOptions* opts, TF_Status* status) {
                          opts->async, std::move(device_mgr), r);
 }
 
-void TFE_DeleteContext(TFE_Context* ctx, TF_Status* status) { delete ctx; }
+void TFE_DeleteContext(TFE_Context* ctx) { delete ctx; }
 
 TF_DeviceList* TFE_ContextListDevices(TFE_Context* ctx, TF_Status* status) {
   TF_DeviceList* list = new TF_DeviceList;
@@ -336,7 +336,7 @@ TFE_TensorHandle* TFE_NewTensorHandle(TF_Tensor* t, TF_Status* status) {
 }
 
 void TFE_DeleteTensorHandle(TFE_TensorHandle* h) {
-  DCHECK(h);
+  if (h == nullptr) return;
   if (h->handle) {
     h->handle->Unref();
   }
@@ -348,6 +348,11 @@ TF_DataType TFE_TensorHandleDataType(TFE_TensorHandle* h) {
 }
 
 int TFE_TensorHandleNumDims(TFE_TensorHandle* h, TF_Status* status) {
+  if (h == nullptr || h->handle == nullptr) {
+    status->status = tensorflow::errors::InvalidArgument(
+        "The passed in handle is a nullptr");
+    return -1;
+  }
   int result;
   status->status = h->handle->NumDims(&result);
   return result;
@@ -355,12 +360,22 @@ int TFE_TensorHandleNumDims(TFE_TensorHandle* h, TF_Status* status) {
 
 int64_t TFE_TensorHandleDim(TFE_TensorHandle* h, int dim_index,
                             TF_Status* status) {
+  if (h == nullptr || h->handle == nullptr) {
+    status->status = tensorflow::errors::InvalidArgument(
+        "The passed in handle is a nullptr");
+    return -1;
+  }
   tensorflow::int64 result;
   status->status = h->handle->Dim(dim_index, &result);
   return result;
 }
 
 const char* TFE_TensorHandleDeviceName(TFE_TensorHandle* h, TF_Status* status) {
+  if (h == nullptr || h->handle == nullptr) {
+    status->status = tensorflow::errors::InvalidArgument(
+        "The passed in handle is a nullptr");
+    return nullptr;
+  }
   tensorflow::Device* d = nullptr;
   status->status = h->handle->OpDevice(&d);
   return (d == nullptr) ? "/job:localhost/replica:0/task:0/device:CPU:0"
@@ -368,6 +383,11 @@ const char* TFE_TensorHandleDeviceName(TFE_TensorHandle* h, TF_Status* status) {
 }
 
 TF_Tensor* TFE_TensorHandleResolve(TFE_TensorHandle* h, TF_Status* status) {
+  if (h == nullptr || h->handle == nullptr) {
+    status->status = tensorflow::errors::InvalidArgument(
+        "The passed in handle is a nullptr");
+    return nullptr;
+  }
   // TODO(agarwal): move this implementation inside TFE_TensorHandle.
   tensorflow::Device* d = nullptr;
   tensorflow::Device* op_device = nullptr;
@@ -664,17 +684,17 @@ TFE_TensorHandle* TFE_NewTensorHandle(const tensorflow::Tensor& t) {
 
 const tensorflow::Tensor* TFE_TensorHandleUnderlyingTensorInHostMemory(
     TFE_TensorHandle* h, TF_Status* status) {
-  tensorflow::Device* d = nullptr;
-  tensorflow::Device* op_device = nullptr;
-  const tensorflow::Tensor* t = nullptr;
-  status->status = h->handle->TensorAndDevice(&t, &d, &op_device);
-  if (!status->status.ok()) return nullptr;
-  if (d != nullptr) {
+  if (!h->handle->OnHostCPU()) {
     status->status = tensorflow::errors::FailedPrecondition(
         "TFE_TensorHandle is placed in device (not host) memory. Cannot return "
         "a tensorflow::Tensor");
     return nullptr;
   }
+  tensorflow::Device* d = nullptr;
+  tensorflow::Device* op_device = nullptr;
+  const tensorflow::Tensor* t = nullptr;
+  status->status = h->handle->TensorAndDevice(&t, &d, &op_device);
+  if (!status->status.ok()) return nullptr;
   return t;
 }
 
diff --git a/tensorflow/c/eager/c_api.h b/tensorflow/c/eager/c_api.h
index fdbd5374b2..ea019a5711 100644
--- a/tensorflow/c/eager/c_api.h
+++ b/tensorflow/c/eager/c_api.h
@@ -102,8 +102,7 @@ typedef struct TFE_Context TFE_Context;
 
 TF_CAPI_EXPORT extern TFE_Context* TFE_NewContext(
     const TFE_ContextOptions* opts, TF_Status* status);
-TF_CAPI_EXPORT extern void TFE_DeleteContext(TFE_Context* ctx,
-                                             TF_Status* status);
+TF_CAPI_EXPORT extern void TFE_DeleteContext(TFE_Context* ctx);
 TF_CAPI_EXPORT extern TF_DeviceList* TFE_ContextListDevices(TFE_Context* ctx,
                                                             TF_Status* status);
 
diff --git a/tensorflow/c/eager/c_api_test.cc b/tensorflow/c/eager/c_api_test.cc
index 3504a8b5e7..6f2fbee884 100644
--- a/tensorflow/c/eager/c_api_test.cc
+++ b/tensorflow/c/eager/c_api_test.cc
@@ -49,7 +49,7 @@ void BM_InitOp(int iters) {
   }
   tensorflow::testing::StopTiming();
   TFE_DeleteTensorHandle(m);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -80,7 +80,7 @@ void BM_Execute(int iters, int async) {
   tensorflow::testing::StopTiming();
   TFE_DeleteOp(matmul);
   TFE_DeleteTensorHandle(m);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -95,7 +95,7 @@ TEST(CAPI, Context) {
   TF_DeviceList* devices = TFE_ContextListDevices(ctx, status);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
 
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
 
   const int num_devices = TF_DeviceListCount(devices);
@@ -195,7 +195,7 @@ void TestRemoteExecute(bool async) {
   TFE_DeleteOp(matmul);
 
   TFE_ContextAsyncWait(ctx, status);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
 
   TF_DeleteStatus(status);
@@ -281,7 +281,7 @@ void TestRemoteExecuteSilentCopies(bool async) {
   TFE_DeleteOp(matmul);
 
   TFE_ContextAsyncWait(ctx, status);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
 
   TF_DeleteStatus(status);
@@ -380,8 +380,7 @@ void TensorHandleCopyBetweenDevices(bool async) {
   TF_DeleteDeviceList(devices);
   TF_DeleteTensor(t);
   TFE_DeleteTensorHandle(hcpu);
-  TFE_DeleteContext(ctx, status.get());
-  EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
+  TFE_DeleteContext(ctx);
 }
 
 TEST(CAPI, TensorHandleCopyBetweenDevices) {
@@ -418,7 +417,7 @@ void TensorHandleCopyBetweenDevicesError(bool async) {
   TFE_DeleteTensorHandle(hcopy);
   TFE_DeleteTensorHandle(hcpu);
   if (hdevice != nullptr) TFE_DeleteTensorHandle(hdevice);
-  TFE_DeleteContext(ctx, status.get());
+  TFE_DeleteContext(ctx);
 }
 
 TEST(CAPI, TensorHandleCopyBetweenDevicesError) {
@@ -451,7 +450,7 @@ void TensorHandleCopyBetweenTwoGPUDevices(bool async) {
     TF_DeleteDeviceList(devices);
     TF_DeleteTensor(t);
     TFE_DeleteTensorHandle(hcpu);
-    TFE_DeleteContext(ctx, status.get());
+    TFE_DeleteContext(ctx);
     return;
   }
   const string gpu_1_name(TF_DeviceListName(devices, 1, status.get()));
@@ -484,8 +483,7 @@ void TensorHandleCopyBetweenTwoGPUDevices(bool async) {
   TF_DeleteDeviceList(devices);
   TF_DeleteTensor(t);
   TFE_DeleteTensorHandle(hcpu);
-  TFE_DeleteContext(ctx, status.get());
-  EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
+  TFE_DeleteContext(ctx);
 }
 
 TEST(CAPI, TensorHandleCopyBetweenTwoGPUDevices) {
@@ -533,8 +531,7 @@ void TensorHandleSilentCopy(bool async) {
   TFE_DeleteTensorHandle(hcpu);
   TFE_ContextAsyncWait(ctx, status.get());
   EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
-  TFE_DeleteContext(ctx, status.get());
-  EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
+  TFE_DeleteContext(ctx);
 }
 
 TEST(CAPI, TensorHandleSilentCopy) { TensorHandleSilentCopy(false); }
@@ -580,8 +577,7 @@ void TensorHandleSilentCopyLocal(bool async) {
   TFE_DeleteTensorHandle(hcpu);
   TFE_ContextAsyncWait(ctx, status.get());
   EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
-  TFE_DeleteContext(ctx, status.get());
-  EXPECT_EQ(TF_OK, TF_GetCode(status.get())) << TF_Message(status.get());
+  TFE_DeleteContext(ctx);
 }
 TEST(CAPI, TensorHandleSilentCopyLocal) { TensorHandleSilentCopyLocal(false); }
 TEST(CAPI, TensorHandleSilentCopyLocalAsync) {
@@ -614,11 +610,47 @@ void SetAndGetOpDevices(bool async) {
 
   TFE_DeleteOp(matmul);
   TFE_DeleteTensorHandle(m);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
 
+TEST(CAPI, TensorHandleNullptr) {
+  TFE_TensorHandle* h = nullptr;
+  std::unique_ptr<TF_Status, decltype(&TF_DeleteStatus)> status(
+      TF_NewStatus(), TF_DeleteStatus);
+
+  TF_Tensor* t = TFE_TensorHandleResolve(h, status.get());
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(status.get()));
+  ASSERT_EQ(t, nullptr);
+  ASSERT_EQ("The passed in handle is a nullptr",
+            string(TF_Message(status.get())));
+
+  TF_SetStatus(status.get(), TF_OK, "");
+
+  const char* device_name = TFE_TensorHandleDeviceName(h, status.get());
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(status.get()));
+  ASSERT_EQ(device_name, nullptr);
+  ASSERT_EQ("The passed in handle is a nullptr",
+            string(TF_Message(status.get())));
+
+  TF_SetStatus(status.get(), TF_OK, "");
+
+  int num_dims = TFE_TensorHandleNumDims(h, status.get());
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(status.get()));
+  ASSERT_EQ(num_dims, -1);
+  ASSERT_EQ("The passed in handle is a nullptr",
+            string(TF_Message(status.get())));
+
+  TF_SetStatus(status.get(), TF_OK, "");
+
+  int dim = TFE_TensorHandleDim(h, 0, status.get());
+  ASSERT_EQ(TF_INVALID_ARGUMENT, TF_GetCode(status.get()));
+  ASSERT_EQ(dim, -1);
+  ASSERT_EQ("The passed in handle is a nullptr",
+            string(TF_Message(status.get())));
+}
+
 void Execute_MatMul_CPU(bool async) {
   TF_Status* status = TF_NewStatus();
   TFE_ContextOptions* opts = TFE_NewContextOptions();
@@ -640,7 +672,7 @@ void Execute_MatMul_CPU(bool async) {
   TF_Tensor* t = TFE_TensorHandleResolve(retvals[0], status);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TFE_DeleteTensorHandle(retvals[0]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   float product[4] = {0};
   EXPECT_EQ(sizeof(product), TF_TensorByteSize(t));
@@ -712,7 +744,7 @@ void Execute_MatMul_CPU_Runtime_Error(bool async) {
   TFE_DeleteTensorHandle(m1);
   TFE_DeleteTensorHandle(m2);
   TFE_DeleteTensorHandle(retvals[0]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   TF_DeleteStatus(status);
 }
 TEST(CAPI, Execute_MatMul_CPU_Runtime_Error) {
@@ -743,7 +775,7 @@ void Execute_MatMul_CPU_Type_Error(bool async) {
   if (retvals[0] != nullptr) {
     TFE_DeleteTensorHandle(retvals[0]);
   }
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   TF_DeleteStatus(status);
 }
 
@@ -781,7 +813,7 @@ TEST(CAPI, Execute_Min_CPU) {
   TF_DeleteTensor(t);
   EXPECT_EQ(1, output[0]);
   EXPECT_EQ(3, output[1]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -823,7 +855,7 @@ void Execute_MatMul_XLA_CPU(bool async) {
   EXPECT_EQ(10, product[1]);
   EXPECT_EQ(15, product[2]);
   EXPECT_EQ(22, product[3]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   TF_DeleteStatus(status);
 }
 TEST(CAPI, Execute_MatMul_XLA_CPU) { Execute_MatMul_XLA_CPU(false); }
@@ -862,7 +894,7 @@ void Execute_Min_XLA_CPU(bool async) {
   TF_DeleteTensor(t);
   EXPECT_EQ(1, output[0]);
   EXPECT_EQ(3, output[1]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   TF_DeleteStatus(status);
 }
 TEST(CAPI, Execute_Min_XLA_CPU) { Execute_Min_XLA_CPU(false); }
@@ -898,7 +930,7 @@ void ExecuteWithTracing(bool async) {
 
   TF_Tensor* t = TFE_TensorHandleResolve(retvals[0], status);
   TFE_DeleteTensorHandle(retvals[0]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   float product[4] = {0};
   EXPECT_EQ(sizeof(product), TF_TensorByteSize(t));
@@ -974,7 +1006,7 @@ TEST(CAPI, Function_ident_CPU) {
     TF_DeleteTensor(r);
     TFE_DeleteTensorHandle(result[0]);
   }
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -1044,7 +1076,7 @@ TEST(CAPI, Function_ident_XLA_CPU) {
     TF_DeleteTensor(r);
     TFE_DeleteTensorHandle(result[0]);
   }
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   ASSERT_TRUE(TF_GetCode(status) == TF_OK) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -1120,7 +1152,7 @@ void FunctionDefAndExecute(bool async) {
   EXPECT_EQ(10, product[1]);
   EXPECT_EQ(15, product[2]);
   EXPECT_EQ(22, product[3]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -1161,7 +1193,7 @@ void BM_ExecuteFunction(int iters, int async) {
   tensorflow::testing::StopTiming();
   TFE_DeleteTensorHandle(m);
   TFE_DeleteTensorHandle(retval[0]);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   EXPECT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -1249,7 +1281,7 @@ TEST(CAPI, Variables) {
 
   TFE_DeleteTensorHandle(var_handle);
   TFE_DeleteTensorHandle(value_handle);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
@@ -1288,7 +1320,7 @@ void BM_ReadVariable(int iters) {
   TFE_DeleteOp(op);
 
   TFE_DeleteTensorHandle(var_handle);
-  TFE_DeleteContext(ctx, status);
+  TFE_DeleteContext(ctx);
   CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   TF_DeleteStatus(status);
 }
diff --git a/tensorflow/c/eager/tape.h b/tensorflow/c/eager/tape.h
index 734e712daa..1adb0458c3 100644
--- a/tensorflow/c/eager/tape.h
+++ b/tensorflow/c/eager/tape.h
@@ -520,7 +520,12 @@ Status GradientTape<Gradient, BackwardFunction>::ComputeGradient(
         }
       } else {
         any_gradient_nonzero = true;
-        auto new_gradients = vspace.AggregateGradients(grad_it->second);
+        Gradient* new_gradients = nullptr;
+        if (grad_it->second.size() == 1) {
+          new_gradients = grad_it->second.at(0);
+        } else {
+          new_gradients = vspace.AggregateGradients(grad_it->second);
+        }
         if (sources_set.find(grad_it->first) == sources_set.end()) {
           gradients.erase(grad_it);
         } else {
diff --git a/tensorflow/c/python_api.cc b/tensorflow/c/python_api.cc
index e18fdf6c57..8486b585c8 100644
--- a/tensorflow/c/python_api.cc
+++ b/tensorflow/c/python_api.cc
@@ -155,7 +155,7 @@ void SetResourceHandleShapeAndType(TF_Graph* graph, TF_Output output,
     tensorflow::shape_inference::ShapeHandle shape;
     status->status =
         ic->MakeShapeFromShapeProto(shape_and_type_proto.shape(), &shape);
-    if (status->status.ok()) return;
+    if (!status->status.ok()) return;
     shapes_and_types.emplace_back(shape, shape_and_type_proto.dtype());
   }
   ic->set_output_handle_shapes_and_types(output.index, shapes_and_types);
diff --git a/tensorflow/cc/gradients/math_grad_test.cc b/tensorflow/cc/gradients/math_grad_test.cc
index fd7b6fe662..1c9bdff5e1 100644
--- a/tensorflow/cc/gradients/math_grad_test.cc
+++ b/tensorflow/cc/gradients/math_grad_test.cc
@@ -475,11 +475,7 @@ TEST_F(CWiseUnaryGradTest, Tan_Complex) {
   auto x_fn = [this](const int i) {
     return CRV({{1, 0}, {0, 1}, {2, -1}, {1, 2}, {3, 4}});
   };
-  // TODO(kbsriram)
-  // Enable when tan kernel supports complex inputs
-  if (false) {
-    TestCWiseGrad<complex64, complex64>(TAN, x_fn);
-  }
+  TestCWiseGrad<complex64, complex64>(TAN, x_fn);
 }
 
 TEST_F(CWiseUnaryGradTest, Atan) {
diff --git a/tensorflow/cc/gradients/nn_grad.cc b/tensorflow/cc/gradients/nn_grad.cc
index c73482d5f4..588e96cb19 100644
--- a/tensorflow/cc/gradients/nn_grad.cc
+++ b/tensorflow/cc/gradients/nn_grad.cc
@@ -47,6 +47,72 @@ Status SoftmaxGrad(const Scope& scope, const Operation& op,
 }
 REGISTER_GRADIENT_OP("Softmax", SoftmaxGrad);
 
+bool IsZero(const Scope& scope, const Output& grad) {
+  string op_type_name = grad.op().node()->type_string();
+  if (op_type_name == "ZerosLike" || op_type_name == "Zeros") {
+    return true;
+  }
+  // The Operation we were provided is not named something obvious so
+  // we need to actually look at its contents.
+  // The original python code did this by calling a utility function called
+  // tensor_util.constant_value.
+  // There is no C++ equivalent to tensor_util.constant_value so we do nothing
+  // for the moment.
+  return false;
+}
+
+// Multiply after broadcasting vec to match dimensions of mat.
+//   Args:
+//     vec: A 1-D tensor of dimension [D0]
+//     mat: A 2-D tensor of dimesnion [D0, D1]
+//
+//   Returns:
+//     A tensor of dimension [D0, D1], the result fo vec * mat.
+Output BroadcastMul(const Scope& scope, const Output& vec, const Output& mat) {
+  auto reshaped = ExpandDims(scope, vec, -1);
+  return Multiply(scope, reshaped, mat);
+}
+
+Status SoftmaxCrossEntropyWithLogitsGrad(const Scope& scope,
+                                         const Operation& op,
+                                         const std::vector<Output>& grad_inputs,
+                                         std::vector<Output>* grad_outputs) {
+  // Softmax gradient with cross entropy logits function.
+  // We multiply the backprop for cost with the gradients - op.output[1].
+  // There is no gradient for labels.
+
+  // The outputs of the network are at input index 0.
+  auto logits = op.input(0);
+  // The "truth" labels are at index 1.
+  auto softmax_grad = op.output(1);
+
+  // The loss is the output at index 0, and backprop is the output at index 1.
+  auto grad_loss = grad_inputs[0];
+  auto grad_grad = grad_inputs[1];
+
+  auto grad = BroadcastMul(scope, grad_loss, softmax_grad);
+  if (!IsZero(scope, grad_grad)) {
+    std::vector<int> axis;
+    auto logits_softmax = Softmax(scope, logits);
+
+    auto grad_grad_expand = ExpandDims(scope, grad_grad, 1);
+    auto logits_softmax_expand = ExpandDims(scope, logits_softmax, 2);
+    auto matmul_result =
+        BatchMatMul(scope, grad_grad_expand, logits_softmax_expand);
+    axis.push_back(1);
+    auto squeeze_result = Squeeze(scope, matmul_result, Squeeze::Axis(axis));
+    auto subtraction_result = Subtract(scope, grad_grad, squeeze_result);
+    auto multiply_result = Multiply(scope, subtraction_result, logits_softmax);
+    grad = Add(scope, grad, multiply_result);
+  }
+  auto minus_log_softmax = Multiply(scope, LogSoftmax(scope, logits), -1.0f);
+  grad_outputs->push_back(grad);
+  grad_outputs->push_back(BroadcastMul(scope, grad_loss, minus_log_softmax));
+  return scope.status();
+}
+REGISTER_GRADIENT_OP("SoftmaxCrossEntropyWithLogits",
+                     SoftmaxCrossEntropyWithLogitsGrad);
+
 Status LogSoftmaxGrad(const Scope& scope, const Operation& op,
                       const std::vector<Output>& grad_inputs,
                       std::vector<Output>* grad_outputs) {
@@ -195,9 +261,9 @@ Status MaxPool3DGradHelper(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
   MaxPool3DGrad::Attrs grad_attrs;
-  auto dx = MaxPool3DGrad(scope, op.input(0), op.output(0), grad_inputs[0],
-                          ksize, strides, padding,
-                          grad_attrs.DataFormat(data_format));
+  auto dx =
+      MaxPool3DGrad(scope, op.input(0), op.output(0), grad_inputs[0], ksize,
+                    strides, padding, grad_attrs.DataFormat(data_format));
   grad_outputs->push_back(dx);
   return scope.status();
 }
@@ -216,10 +282,9 @@ Status AvgPoolGradHelper(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
   internal::AvgPoolGrad::Attrs grad_attrs;
-  auto dx =
-      internal::AvgPoolGrad(scope, Shape(scope, op.input(0)), grad_inputs[0],
-                            ksize, strides, padding,
-                            grad_attrs.DataFormat(data_format));
+  auto dx = internal::AvgPoolGrad(scope, Shape(scope, op.input(0)),
+                                  grad_inputs[0], ksize, strides, padding,
+                                  grad_attrs.DataFormat(data_format));
   grad_outputs->push_back(dx);
   return scope.status();
 }
@@ -238,9 +303,9 @@ Status AvgPool3DGradHelper(const Scope& scope, const Operation& op,
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
   TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
   AvgPool3DGrad::Attrs grad_attrs;
-  auto dx = AvgPool3DGrad(scope, Shape(scope, op.input(0)), grad_inputs[0],
-                          ksize, strides, padding,
-                          grad_attrs.DataFormat(data_format));
+  auto dx =
+      AvgPool3DGrad(scope, Shape(scope, op.input(0)), grad_inputs[0], ksize,
+                    strides, padding, grad_attrs.DataFormat(data_format));
   grad_outputs->push_back(dx);
   return scope.status();
 }
diff --git a/tensorflow/cc/gradients/nn_grad_test.cc b/tensorflow/cc/gradients/nn_grad_test.cc
index b4d457a9d1..aa72cf7ba2 100644
--- a/tensorflow/cc/gradients/nn_grad_test.cc
+++ b/tensorflow/cc/gradients/nn_grad_test.cc
@@ -25,6 +25,8 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
+using ops::AvgPool;
+using ops::AvgPool3D;
 using ops::BiasAdd;
 using ops::Conv2D;
 using ops::Elu;
@@ -33,11 +35,9 @@ using ops::FractionalMaxPool;
 using ops::L2Loss;
 using ops::LogSoftmax;
 using ops::LRN;
-using ops::AvgPool;
-using ops::AvgPool3D;
 using ops::MaxPool;
-using ops::MaxPoolV2;
 using ops::MaxPool3D;
+using ops::MaxPoolV2;
 using ops::Placeholder;
 using ops::Relu;
 using ops::Relu6;
@@ -111,6 +111,20 @@ TEST_F(NNGradTest, SoftmaxGrad) {
   RunTest(x, shape, y, shape);
 }
 
+TEST_F(NNGradTest, SoftmaxCrossEntropyWithLogitsGrad) {
+  TensorShape logits_shape({5, 3});
+  TensorShape loss_shape({5});
+
+  auto logits = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(logits_shape));
+  auto labels = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(logits_shape));
+  auto y =
+      tensorflow::ops::SoftmaxCrossEntropyWithLogits(scope_, logits, labels);
+  // Note the reversal of the backprop and loss orders. Issue #18734 has been
+  // opened for this.
+  RunTest({logits, labels}, {logits_shape, logits_shape}, {y.backprop, y.loss},
+          {logits_shape, loss_shape});
+}
+
 TEST_F(NNGradTest, LogSoftmaxGrad) {
   TensorShape shape({5, 3});
   auto x = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(shape));
@@ -253,7 +267,7 @@ TEST_F(NNGradTest, AvgPool3DGradHelper) {
   RunTest(x, x_shape, y, y_shape);
 }
 
-TEST_F(NNGradTest, LRN){
+TEST_F(NNGradTest, LRN) {
   TensorShape x_shape({1, 1, 2, 1});
   auto x = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(x_shape));
   auto y = LRN(scope_, x);
diff --git a/tensorflow/cc/saved_model/BUILD b/tensorflow/cc/saved_model/BUILD
index 06a3be18e0..3d3895c8fa 100644
--- a/tensorflow/cc/saved_model/BUILD
+++ b/tensorflow/cc/saved_model/BUILD
@@ -34,6 +34,46 @@ cc_library(
 )
 
 cc_library(
+    name = "reader",
+    srcs = ["reader.cc"],
+    hdrs = ["reader.h"],
+    deps = [
+        ":constants",
+    ] + if_not_mobile([
+        # TODO(b/111634734): :lib and :protos_all contain dependencies that
+        # cannot be built on mobile platforms. Instead, include the appropriate
+        # tf_lib depending on the build platform.
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+    ]) + if_mobile([
+        # Mobile-friendly SavedModel proto. See go/portable-proto for more info.
+        "//tensorflow/core:saved_model_portable_proto",
+    ]) + if_android([
+        "//tensorflow/core:android_tensorflow_lib",
+    ]) + if_ios([
+        "//tensorflow/core:ios_tensorflow_lib",
+    ]),
+)
+
+tf_cc_test(
+    name = "reader_test",
+    srcs = ["reader_test.cc"],
+    data = [
+        ":saved_model_half_plus_two",
+    ],
+    linkstatic = 1,
+    deps = [
+        ":constants",
+        ":reader",
+        ":tag_constants",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
+cc_library(
     name = "loader",
     hdrs = ["loader.h"],
     deps = [
@@ -54,6 +94,7 @@ cc_library(
     hdrs = ["loader.h"],
     deps = [
         ":constants",
+        ":reader",
     ] + if_not_mobile([
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
diff --git a/tensorflow/cc/saved_model/loader.cc b/tensorflow/cc/saved_model/loader.cc
index faa1e378d0..98be66a6ad 100644
--- a/tensorflow/cc/saved_model/loader.cc
+++ b/tensorflow/cc/saved_model/loader.cc
@@ -18,8 +18,10 @@ limitations under the License.
 #include <unordered_set>
 
 #include "tensorflow/cc/saved_model/constants.h"
+#include "tensorflow/cc/saved_model/reader.h"
 #include "tensorflow/core/lib/io/path.h"
 #include "tensorflow/core/lib/monitoring/counter.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/protobuf_internal.h"
@@ -43,56 +45,6 @@ auto* load_latency = monitoring::Counter<1>::New(
 constexpr char kLoadAttemptFail[] = "fail";
 constexpr char kLoadAttemptSuccess[] = "success";
 
-Status ReadSavedModel(const string& export_dir, SavedModel* saved_model_proto) {
-  const string saved_model_pb_path =
-      io::JoinPath(export_dir, kSavedModelFilenamePb);
-  if (Env::Default()->FileExists(saved_model_pb_path).ok()) {
-    return ReadBinaryProto(Env::Default(), saved_model_pb_path,
-                           saved_model_proto);
-  }
-  const string saved_model_pbtxt_path =
-      io::JoinPath(export_dir, kSavedModelFilenamePbTxt);
-  if (Env::Default()->FileExists(saved_model_pbtxt_path).ok()) {
-    return ReadTextProto(Env::Default(), saved_model_pbtxt_path,
-                         saved_model_proto);
-  }
-  return Status(error::Code::NOT_FOUND,
-                "Could not find SavedModel .pb or .pbtxt at supplied export "
-                "directory path: " +
-                    export_dir);
-}
-
-string GetTagsAsString(const std::unordered_set<string>& tags) {
-  string tags_as_string = "{ ";
-  for (const string& tag : tags) {
-    tags_as_string = strings::StrCat(tags_as_string, tag, " ");
-  }
-  tags_as_string = strings::StrCat(tags_as_string, "}");
-  return tags_as_string;
-}
-
-Status FindMetaGraphDefToLoad(const SavedModel& saved_model_proto,
-                              const std::unordered_set<string>& tags,
-                              MetaGraphDef* meta_graph_def_to_load) {
-  for (const MetaGraphDef& meta_graph_def : saved_model_proto.meta_graphs()) {
-    // Get tags from the meta_graph_def.
-    std::unordered_set<string> graph_tags;
-    for (const string& tag : meta_graph_def.meta_info_def().tags()) {
-      graph_tags.insert(tag);
-    }
-    // Match with the set of tags provided.
-    if (graph_tags == tags) {
-      *meta_graph_def_to_load = meta_graph_def;
-      return Status::OK();
-    }
-  }
-  return Status(error::Code::NOT_FOUND,
-                "Could not find meta graph def matching supplied tags: " +
-                    GetTagsAsString(tags) +
-                    ". To inspect available tag-sets in the SavedModel, please "
-                    "use the SavedModel CLI: `saved_model_cli`");
-}
-
 Status LoadMetaGraphIntoSession(const MetaGraphDef& meta_graph_def,
                                 const SessionOptions& session_options,
                                 std::unique_ptr<Session>* session) {
@@ -122,6 +74,54 @@ void AddAssetsTensorsToInputs(const StringPiece export_dir,
   }
 }
 
+// Like Session::Run(), but uses the Make/Run/ReleaseCallable() API to avoid
+// leaving behind non-GC'ed state.
+//
+// Detailed motivation behind this approach, from ashankar@:
+//
+// Each call to Session::Run() that identifies a new subgraph (based on feeds
+// and fetches) creates some datastructures that live as long as the session
+// (the partitioned graph, associated executors etc.).
+//
+// A pathological case of this would be if say the initialization op
+// (main_op/legacy_init_op) involves the use of a large constant. Then we
+// allocate memory for that large constant that will just stick around till the
+// session dies. With this Callable mechanism, that memory will be released
+// right after ReleaseCallable returns.
+//
+// However, the resource manager state remains.
+Status RunOnce(const RunOptions& run_options,
+               const std::vector<std::pair<string, Tensor>>& inputs,
+               const std::vector<string>& output_tensor_names,
+               const std::vector<string>& target_node_names,
+               std::vector<Tensor>* outputs, RunMetadata* run_metadata,
+               Session* session) {
+  CallableOptions callable_options;
+  std::vector<Tensor> feed_tensors;
+  *callable_options.mutable_run_options() = run_options;
+  for (const auto& input : inputs) {
+    const string& name = input.first;
+    const Tensor& tensor = input.second;
+    callable_options.add_feed(name);
+    feed_tensors.push_back(tensor);
+  }
+  for (const string& output_tensor_name : output_tensor_names) {
+    callable_options.add_fetch(output_tensor_name);
+  }
+  for (const string& target_node_name : target_node_names) {
+    callable_options.add_target(target_node_name);
+  }
+
+  Session::CallableHandle callable_handle;
+  TF_RETURN_IF_ERROR(session->MakeCallable(callable_options, &callable_handle));
+  const Status run_status = session->RunCallable(callable_handle, feed_tensors,
+                                                 outputs, run_metadata);
+  // Be sure to call ReleaseCallable() regardless of the outcome of
+  // RunCallable().
+  session->ReleaseCallable(callable_handle).IgnoreError();
+  return run_status;
+}
+
 bool HasMainOp(const MetaGraphDef& meta_graph_def) {
   const auto& collection_def_map = meta_graph_def.collection_def();
   if (collection_def_map.find(kSavedModelMainOpKey) !=
@@ -134,10 +134,11 @@ bool HasMainOp(const MetaGraphDef& meta_graph_def) {
 Status RunMainOp(const RunOptions& run_options, const string& export_dir,
                  const MetaGraphDef& meta_graph_def,
                  const std::vector<AssetFileDef>& asset_file_defs,
-                 Session* session) {
-  LOG(INFO) << "Running MainOp on SavedModel bundle.";
+                 Session* session, const string& main_op_key) {
+  LOG(INFO) << "Running MainOp with key " << main_op_key
+            << " on SavedModel bundle.";
   const auto& collection_def_map = meta_graph_def.collection_def();
-  const auto main_op_it = collection_def_map.find(kSavedModelMainOpKey);
+  const auto main_op_it = collection_def_map.find(main_op_key);
   if (main_op_it != collection_def_map.end()) {
     if (main_op_it->second.node_list().value_size() != 1) {
       return errors::FailedPrecondition(
@@ -147,8 +148,8 @@ Status RunMainOp(const RunOptions& run_options, const string& export_dir,
     AddAssetsTensorsToInputs(export_dir, asset_file_defs, &inputs);
     RunMetadata run_metadata;
     const StringPiece main_op_name = main_op_it->second.node_list().value(0);
-    return session->Run(run_options, inputs, {}, {main_op_name.ToString()},
-                        nullptr /* outputs */, &run_metadata);
+    return RunOnce(run_options, inputs, {}, {main_op_name.ToString()},
+                   nullptr /* outputs */, &run_metadata, session);
   }
   return Status::OK();
 }
@@ -185,32 +186,8 @@ Status RunRestore(const RunOptions& run_options, const string& export_dir,
   AddAssetsTensorsToInputs(export_dir, asset_file_defs, &inputs);
 
   RunMetadata run_metadata;
-  return session->Run(run_options, inputs, {}, {restore_op_name.ToString()},
-                      nullptr /* outputs */, &run_metadata);
-}
-
-Status RunLegacyInitOp(const RunOptions& run_options, const string& export_dir,
-                       const MetaGraphDef& meta_graph_def,
-                       const std::vector<AssetFileDef>& asset_file_defs,
-                       Session* session) {
-  LOG(INFO) << "Running LegacyInitOp on SavedModel bundle.";
-  const auto& collection_def_map = meta_graph_def.collection_def();
-  const auto init_op_it = collection_def_map.find(kSavedModelLegacyInitOpKey);
-  if (init_op_it != collection_def_map.end()) {
-    if (init_op_it->second.node_list().value_size() != 1) {
-      return errors::FailedPrecondition(strings::StrCat(
-          "Expected exactly one serving init op in : ", export_dir));
-    }
-    std::vector<std::pair<string, Tensor>> inputs;
-    AddAssetsTensorsToInputs(export_dir, asset_file_defs, &inputs);
-    RunMetadata run_metadata;
-    const StringPiece legacy_init_op_name =
-        init_op_it->second.node_list().value(0);
-    return session->Run(run_options, inputs, {},
-                        {legacy_init_op_name.ToString()}, nullptr /* outputs */,
-                        &run_metadata);
-  }
-  return Status::OK();
+  return RunOnce(run_options, inputs, {}, {restore_op_name.ToString()},
+                 nullptr /* outputs */, &run_metadata, session);
 }
 
 Status GetAssetFileDefs(const MetaGraphDef& meta_graph_def,
@@ -235,18 +212,8 @@ Status LoadSavedModelInternal(const SessionOptions& session_options,
                               const string& export_dir,
                               const std::unordered_set<string>& tags,
                               SavedModelBundle* const bundle) {
-  if (!MaybeSavedModelDirectory(export_dir)) {
-    return Status(error::Code::NOT_FOUND,
-                  "SavedModel not found in export directory: " + export_dir);
-  }
-  LOG(INFO) << "Loading SavedModel with tags: " << GetTagsAsString(tags)
-            << "; from: " << export_dir;
-
-  SavedModel saved_model_proto;
-  TF_RETURN_IF_ERROR(ReadSavedModel(export_dir, &saved_model_proto));
-
-  TF_RETURN_IF_ERROR(
-      FindMetaGraphDefToLoad(saved_model_proto, tags, &bundle->meta_graph_def));
+  TF_RETURN_IF_ERROR(ReadMetaGraphDefFromSavedModel(export_dir, tags,
+                                                    &bundle->meta_graph_def));
 
   TF_RETURN_IF_ERROR(LoadMetaGraphIntoSession(
       bundle->meta_graph_def, session_options, &bundle->session));
@@ -262,11 +229,11 @@ Status LoadSavedModelInternal(const SessionOptions& session_options,
   if (HasMainOp(bundle->meta_graph_def)) {
     TF_RETURN_IF_ERROR(RunMainOp(run_options, export_dir,
                                  bundle->meta_graph_def, asset_file_defs,
-                                 bundle->session.get()));
+                                 bundle->session.get(), kSavedModelMainOpKey));
   } else {
-    TF_RETURN_IF_ERROR(RunLegacyInitOp(run_options, export_dir,
-                                       bundle->meta_graph_def, asset_file_defs,
-                                       bundle->session.get()));
+    TF_RETURN_IF_ERROR(RunMainOp(
+        run_options, export_dir, bundle->meta_graph_def, asset_file_defs,
+        bundle->session.get(), kSavedModelLegacyInitOpKey));
   }
   return Status::OK();
 }
@@ -288,8 +255,8 @@ Status LoadSavedModel(const SessionOptions& session_options,
     return end_microseconds - start_microseconds;
   }();
   auto log_and_count = [&](const string& status_str) {
-    LOG(INFO) << "SavedModel load for tags " << GetTagsAsString(tags)
-              << "; Status: " << status_str << ". Took "
+    LOG(INFO) << "SavedModel load for tags { " << str_util::Join(tags, " ")
+              << " }; Status: " << status_str << ". Took "
               << load_latency_microsecs << " microseconds.";
     load_attempt_count->GetCell(export_dir, status_str)->IncrementBy(1);
   };
diff --git a/tensorflow/cc/saved_model/reader.cc b/tensorflow/cc/saved_model/reader.cc
new file mode 100644
index 0000000000..2146c8a197
--- /dev/null
+++ b/tensorflow/cc/saved_model/reader.cc
@@ -0,0 +1,88 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/cc/saved_model/reader.h"
+
+#include <unordered_set>
+
+#include "tensorflow/cc/saved_model/constants.h"
+#include "tensorflow/core/lib/io/path.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/protobuf/saved_model.pb.h"
+
+namespace tensorflow {
+namespace {
+
+Status ReadSavedModel(const string& export_dir, SavedModel* saved_model_proto) {
+  LOG(INFO) << "Reading SavedModel from: " << export_dir;
+
+  const string saved_model_pb_path =
+      io::JoinPath(export_dir, kSavedModelFilenamePb);
+  if (Env::Default()->FileExists(saved_model_pb_path).ok()) {
+    return ReadBinaryProto(Env::Default(), saved_model_pb_path,
+                           saved_model_proto);
+  }
+  const string saved_model_pbtxt_path =
+      io::JoinPath(export_dir, kSavedModelFilenamePbTxt);
+  if (Env::Default()->FileExists(saved_model_pbtxt_path).ok()) {
+    return ReadTextProto(Env::Default(), saved_model_pbtxt_path,
+                         saved_model_proto);
+  }
+  return Status(error::Code::NOT_FOUND,
+                "Could not find SavedModel .pb or .pbtxt at supplied export "
+                "directory path: " +
+                    export_dir);
+}
+
+Status FindMetaGraphDef(const SavedModel& saved_model_proto,
+                        const std::unordered_set<string>& tags,
+                        MetaGraphDef* meta_graph_def) {
+  LOG(INFO) << "Reading meta graph with tags { " << str_util::Join(tags, " ")
+            << " }";
+  for (const MetaGraphDef& graph_def : saved_model_proto.meta_graphs()) {
+    // Get tags from the graph_def.
+    std::unordered_set<string> graph_tags;
+    for (const string& tag : graph_def.meta_info_def().tags()) {
+      graph_tags.insert(tag);
+    }
+    // Match with the set of tags provided.
+    if (graph_tags == tags) {
+      *meta_graph_def = graph_def;
+      return Status::OK();
+    }
+  }
+  return Status(
+      error::Code::NOT_FOUND,
+      strings::StrCat(
+          "Could not find meta graph def matching supplied tags: { ",
+          str_util::Join(tags, " "),
+          " }. To inspect available tag-sets in the SavedModel, please "
+          "use the SavedModel CLI: `saved_model_cli`"));
+}
+
+}  // namespace
+
+Status ReadMetaGraphDefFromSavedModel(const string& export_dir,
+                                      const std::unordered_set<string>& tags,
+                                      MetaGraphDef* const meta_graph_def) {
+  SavedModel saved_model_proto;
+  TF_RETURN_IF_ERROR(ReadSavedModel(export_dir, &saved_model_proto));
+  TF_RETURN_IF_ERROR(FindMetaGraphDef(saved_model_proto, tags, meta_graph_def));
+  return Status::OK();
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/cc/saved_model/reader.h b/tensorflow/cc/saved_model/reader.h
new file mode 100644
index 0000000000..5815108df2
--- /dev/null
+++ b/tensorflow/cc/saved_model/reader.h
@@ -0,0 +1,39 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+/// Functions to read the SavedModel proto, or parts of it.
+
+#ifndef TENSORFLOW_CC_SAVED_MODEL_READER_H_
+#define TENSORFLOW_CC_SAVED_MODEL_READER_H_
+
+#include <string>
+#include <unordered_set>
+
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/protobuf/meta_graph.pb.h"
+
+namespace tensorflow {
+
+// Reads the SavedModel proto from saved_model.pb(txt) in the given directory,
+// finds the MetaGraphDef that matches the given set of tags and writes it to
+// the `meta_graph_def` parameter. Returns a failure status when the SavedModel
+// file does not exist or no MetaGraphDef matches the tags.
+Status ReadMetaGraphDefFromSavedModel(const string& export_dir,
+                                      const std::unordered_set<string>& tags,
+                                      MetaGraphDef* const meta_graph_def);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CC_SAVED_MODEL_READER_H_
diff --git a/tensorflow/cc/saved_model/reader_test.cc b/tensorflow/cc/saved_model/reader_test.cc
new file mode 100644
index 0000000000..620e9c2eec
--- /dev/null
+++ b/tensorflow/cc/saved_model/reader_test.cc
@@ -0,0 +1,108 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/cc/saved_model/reader.h"
+
+#include "tensorflow/cc/saved_model/constants.h"
+#include "tensorflow/cc/saved_model/tag_constants.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/io/path.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace {
+
+constexpr char kTestDataPbTxt[] =
+    "cc/saved_model/testdata/half_plus_two_pbtxt/00000123";
+constexpr char kTestDataSharded[] =
+    "cc/saved_model/testdata/half_plus_two/00000123";
+
+class ReaderTest : public ::testing::Test {
+ protected:
+  ReaderTest() {}
+
+  void CheckMetaGraphDef(const MetaGraphDef& meta_graph_def) {
+    const auto& tags = meta_graph_def.meta_info_def().tags();
+    EXPECT_TRUE(std::find(tags.begin(), tags.end(), kSavedModelTagServe) !=
+                tags.end());
+    EXPECT_NE(meta_graph_def.meta_info_def().tensorflow_version(), "");
+    EXPECT_EQ(
+        meta_graph_def.signature_def().at("serving_default").method_name(),
+        "tensorflow/serving/predict");
+  }
+};
+
+TEST_F(ReaderTest, TagMatch) {
+  MetaGraphDef meta_graph_def;
+
+  const string export_dir =
+      io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataSharded);
+  TF_ASSERT_OK(ReadMetaGraphDefFromSavedModel(export_dir, {kSavedModelTagServe},
+                                              &meta_graph_def));
+  CheckMetaGraphDef(meta_graph_def);
+}
+
+TEST_F(ReaderTest, NoTagMatch) {
+  MetaGraphDef meta_graph_def;
+
+  const string export_dir =
+      io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataSharded);
+  Status st = ReadMetaGraphDefFromSavedModel(export_dir, {"missing-tag"},
+                                             &meta_graph_def);
+  EXPECT_FALSE(st.ok());
+  EXPECT_TRUE(str_util::StrContains(
+      st.error_message(),
+      "Could not find meta graph def matching supplied tags: { missing-tag }"))
+      << st.error_message();
+}
+
+TEST_F(ReaderTest, NoTagMatchMultiple) {
+  MetaGraphDef meta_graph_def;
+
+  const string export_dir =
+      io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataSharded);
+  Status st = ReadMetaGraphDefFromSavedModel(
+      export_dir, {kSavedModelTagServe, "missing-tag"}, &meta_graph_def);
+  EXPECT_FALSE(st.ok());
+  EXPECT_TRUE(str_util::StrContains(
+      st.error_message(),
+      "Could not find meta graph def matching supplied tags: "))
+      << st.error_message();
+}
+
+TEST_F(ReaderTest, PbtxtFormat) {
+  MetaGraphDef meta_graph_def;
+
+  const string export_dir =
+      io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPbTxt);
+  TF_ASSERT_OK(ReadMetaGraphDefFromSavedModel(export_dir, {kSavedModelTagServe},
+                                              &meta_graph_def));
+  CheckMetaGraphDef(meta_graph_def);
+}
+
+TEST_F(ReaderTest, InvalidExportPath) {
+  MetaGraphDef meta_graph_def;
+
+  const string export_dir =
+      io::JoinPath(testing::TensorFlowSrcRoot(), "missing-path");
+  Status st = ReadMetaGraphDefFromSavedModel(export_dir, {kSavedModelTagServe},
+                                             &meta_graph_def);
+  EXPECT_FALSE(st.ok());
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/aot/BUILD b/tensorflow/compiler/aot/BUILD
index 2119c8ec47..fef8b8d4d4 100644
--- a/tensorflow/compiler/aot/BUILD
+++ b/tensorflow/compiler/aot/BUILD
@@ -68,6 +68,7 @@ cc_library(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:compile_only_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:compiler",
         "//tensorflow/compiler/xla/service/cpu:cpu_compiler",
         "//tensorflow/core:core_cpu_internal",
diff --git a/tensorflow/compiler/aot/compile.cc b/tensorflow/compiler/aot/compile.cc
index bbc35da2ef..2b5f97b34c 100644
--- a/tensorflow/compiler/aot/compile.cc
+++ b/tensorflow/compiler/aot/compile.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/tf2xla_util.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/compile_only_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_compiler.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/util.h"
diff --git a/tensorflow/compiler/aot/runtime.cc b/tensorflow/compiler/aot/runtime.cc
index 5e74079fc1..7606420ded 100644
--- a/tensorflow/compiler/aot/runtime.cc
+++ b/tensorflow/compiler/aot/runtime.cc
@@ -64,7 +64,7 @@ size_t align_to(size_t n, size_t align) {
 size_t aligned_buffer_bytes(const intptr_t* sizes, size_t n) {
   size_t total = 0;
   for (size_t i = 0; i < n; ++i) {
-    if (sizes[i] != -1) {
+    if (sizes[i] > 0) {
       total += align_to(sizes[i], kAlign);
     }
   }
@@ -85,7 +85,9 @@ void* MallocContiguousBuffers(const intptr_t* sizes, size_t n, void** bufs,
   }
   uintptr_t pos = reinterpret_cast<uintptr_t>(contiguous);
   for (size_t i = 0; i < n; ++i) {
-    if (sizes[i] == -1) {
+    if (sizes[i] < 0) {
+      // bufs[i] is either a constant, an entry parameter or a thread local
+      // allocation.
       bufs[i] = nullptr;
     } else {
       bufs[i] = reinterpret_cast<void*>(pos);
diff --git a/tensorflow/compiler/aot/tfcompile.bzl b/tensorflow/compiler/aot/tfcompile.bzl
index 5c57fee326..774a82d301 100644
--- a/tensorflow/compiler/aot/tfcompile.bzl
+++ b/tensorflow/compiler/aot/tfcompile.bzl
@@ -16,339 +16,367 @@ tf_library(
 )
 """
 
-load("//tensorflow:tensorflow.bzl",
-     "if_android", "tf_cc_test", "tf_copts")
-
-def tf_library(name, graph, config,
-               freeze_checkpoint=None, freeze_saver=None,
-               cpp_class=None, gen_test=True, gen_benchmark=True,
-               visibility=None, testonly=None,
-               tfcompile_flags=None,
-               tfcompile_tool="//tensorflow/compiler/aot:tfcompile",
-               include_standard_runtime_deps=True,
-               enable_xla_hlo_profiling=False, deps=None, tags=None):
-  """Runs tfcompile to compile a TensorFlow graph into executable code.
-
-  Given an invocation of tf_library(name="foo", ...), generates the following
-  build targets:
-    foo:           A cc_library containing the generated header and computation.
-    foo_test:      A cc_test with simple tests and benchmarks. Only created if
-                   gen_test=True.
-    foo_benchmark: A cc_binary that runs a minimal-dependency benchmark, useful
-                   for mobile devices or other platforms that can't compile the
-                   full test libraries. Only created if gen_benchmark=True.
-
-  Args:
-    name: The name of the build rule.
-    graph: The TensorFlow GraphDef to compile.  If the file ends in '.pbtxt' it
-      is expected to be in the human-readable proto text format, otherwise it is
-      expected to be in the proto binary format.
-    config: File containing tensorflow.tf2xla.Config proto.  If the file ends
-      in '.pbtxt' it is expected to be in the human-readable proto text format,
-      otherwise it is expected to be in the proto binary format.
-    freeze_checkpoint: If provided, run freeze_graph with this checkpoint to
-      convert variables into constants.
-    freeze_saver: If provided, run freeze_graph with this saver, in SaverDef
-      binary form, to convert variables into constants.
-    cpp_class: The name of the generated C++ class, wrapping the generated
-      function.  The syntax of this flag is
-      [[<optional_namespace>::],...]<class_name>.  This mirrors the C++ syntax
-      for referring to a class, where multiple namespaces may precede the class
-      name, separated by double-colons.  The class will be generated in the
-      given namespace(s), or if no namespaces are given, within the global
-      namespace.
-    gen_test: If True, also generate a cc_test rule that builds a simple
-      test and benchmark.
-    gen_benchmark: If True, also generate a binary with a simple benchmark.
-      Unlike the output of gen_test, this benchmark can be run on android.
-    visibility: Bazel build visibility.
-    testonly:   Bazel testonly attribute.
-    tfcompile_flags: Extra flags to pass to tfcompile to control compilation.
-    tfcompile_tool: The tfcompile binary. A non-default can be passed to
-      use a tfcompile built with extra dependencies.
-    include_standard_runtime_deps: If True, the standard list of kernel/runtime
-      deps is added to deps.  If False, deps must contain the full set of deps
-      needed by the generated library.
-    enable_xla_hlo_profiling: Enable XLA HLO profiling in the generated program,
-      and emit metadata that lets us pretty-print the gathered profile counters.
-    deps: a list of deps to include on the build rules for the generated
-      library, added to the standard deps if standard_runtime_deps is True.
-    tags: tags to apply to subsidiary build rules.
-
-  The output header is called <name>.h.
-  """
-  if not cpp_class:
-    fail("cpp_class must be specified")
-
-  tfcompile_graph = graph
-  if freeze_checkpoint or freeze_saver:
-    if not freeze_checkpoint:
-      fail("freeze_checkpoint must be specified when freeze_saver is specified")
+load(
+    "//tensorflow:tensorflow.bzl",
+    "if_android",
+    "tf_cc_test",
+    "tf_copts",
+)
 
-    freeze_name = "freeze_" + name
-    freeze_file = freeze_name + ".pb"
+def tf_library(
+        name,
+        graph,
+        config,
+        freeze_checkpoint = None,
+        freeze_saver = None,
+        cpp_class = None,
+        gen_test = True,
+        gen_benchmark = True,
+        visibility = None,
+        testonly = None,
+        tfcompile_flags = None,
+        tfcompile_tool = "//tensorflow/compiler/aot:tfcompile",
+        include_standard_runtime_deps = True,
+        enable_xla_hlo_profiling = False,
+        deps = None,
+        tags = None):
+    """Runs tfcompile to compile a TensorFlow graph into executable code.
 
-    # First run tfcompile to generate the list of out_nodes.
-    out_nodes_file = "out_nodes_" + freeze_name
-    native.genrule(
-        name=("gen_" + out_nodes_file),
-        srcs=[config],
-        outs=[out_nodes_file],
-        cmd=("$(location " + tfcompile_tool + ")" +
-             " --config=$(location " + config + ")" +
-             " --dump_fetch_nodes > $@"),
-        tools=[tfcompile_tool],
-        # Run tfcompile on the build host, rather than forge, since it's
-        # typically way faster on the local machine.
-        local=1,
-        tags=tags,
-    )
+    Given an invocation of tf_library(name="foo", ...), generates the following
+    build targets:
+      foo:           A cc_library containing the generated header and
+                     computation.
+      foo_test:      A cc_test with simple tests and benchmarks. Only created if
+                     gen_test=True.
+      foo_benchmark: A cc_binary that runs a minimal-dependency benchmark,
+                     useful for mobile devices or other platforms that can't
+                     compile the full test libraries. Only created if
+                     gen_benchmark=True.
+    The output header is called <name>.h.
 
-    # Now run freeze_graph to convert variables into constants.
-    freeze_args = (" --input_graph=$(location " + graph + ")" +
-                   " --checkpoint_version=1" +
-                   " --input_binary=" + str(not graph.endswith(".pbtxt")) +
-                   " --input_checkpoint=$(location " + freeze_checkpoint + ")" +
-                   " --output_graph=$(location " + freeze_file + ")" +
-                   " --output_node_names=$$(<$(location " + out_nodes_file +
-                   "))")
-    freeze_saver_srcs = []
-    if freeze_saver:
-      freeze_args += " --input_saver=$(location " + freeze_saver + ")"
-      freeze_saver_srcs += [freeze_saver]
-    native.genrule(
-        name=freeze_name,
-        srcs=[
-            graph,
-            freeze_checkpoint,
-            out_nodes_file,
-        ] + freeze_saver_srcs,
-        outs=[freeze_file],
-        cmd=("$(location //tensorflow/python/tools:freeze_graph)" +
-             freeze_args),
-        tools=["//tensorflow/python/tools:freeze_graph"],
-        tags=tags,
-    )
-    tfcompile_graph = freeze_file
+    Args:
+      name: The name of the build rule.
+      graph: The TensorFlow GraphDef to compile.  If the file ends in '.pbtxt'
+        it is expected to be in the human-readable proto text format, otherwise
+        it is expected to be in the proto binary format.
+      config: File containing tensorflow.tf2xla.Config proto.  If the file ends
+        in '.pbtxt' it is expected to be in the human-readable proto text
+        format, otherwise it is expected to be in the proto binary format.
+      freeze_checkpoint: If provided, run freeze_graph with this checkpoint to
+        convert variables into constants.
+      freeze_saver: If provided, run freeze_graph with this saver, in SaverDef
+        binary form, to convert variables into constants.
+      cpp_class: The name of the generated C++ class, wrapping the generated
+        function.  The syntax of this flag is
+        [[<optional_namespace>::],...]<class_name>.  This mirrors the C++ syntax
+        for referring to a class, where multiple namespaces may precede the
+        class name, separated by double-colons.  The class will be generated in
+        the given namespace(s), or if no namespaces are given, within the global
+        namespace.
+      gen_test: If True, also generate a cc_test rule that builds a simple
+        test and benchmark.
+      gen_benchmark: If True, also generate a binary with a simple benchmark.
+        Unlike the output of gen_test, this benchmark can be run on android.
+      visibility: Bazel build visibility.
+      testonly:   Bazel testonly attribute.
+      tfcompile_flags: Extra flags to pass to tfcompile to control compilation.
+      tfcompile_tool: The tfcompile binary. A non-default can be passed to
+        use a tfcompile built with extra dependencies.
+      include_standard_runtime_deps: If True, the standard list of
+        kernel/runtime deps is added to deps.  If False, deps must contain the
+        full set of deps needed by the generated library.
+      enable_xla_hlo_profiling: Enable XLA HLO profiling in the generated
+        program, and emit metadata that lets us pretty-print the gathered
+        profile counters.
+      deps: a list of deps to include on the build rules for the generated
+        library, added to the standard deps if standard_runtime_deps is True.
+      tags: tags to apply to subsidiary build rules.
+    """
+    if not cpp_class:
+        fail("cpp_class must be specified")
 
-  # Rule that runs tfcompile to produce the header and object file.
-  header_file = name + ".h"
-  metadata_object_file = name + "_tfcompile_metadata.o"
-  function_object_file = name + "_tfcompile_function.o"
-  ep = ("__" + native.package_name() + "__" + name).replace("/", "_")
-  if type(tfcompile_flags) == type(""):
-    flags = tfcompile_flags
-  else:
-    flags = " ".join(["'" + arg.replace("'", "'\\''") + "'" for arg in (tfcompile_flags or [])])
-  if enable_xla_hlo_profiling:
-    profiling_flag = "--xla_hlo_profile"
-  else:
-    profiling_flag = ""
-  native.genrule(
-      name=("gen_" + name),
-      srcs=[
-          tfcompile_graph,
-          config,
-      ],
-      outs=[
-          header_file,
-          metadata_object_file,
-          function_object_file,
-      ],
-      cmd=("$(location " + tfcompile_tool + ")" +
-           " --graph=$(location " + tfcompile_graph + ")" +
-           " --config=$(location " + config + ")" +
-           " --entry_point=" + ep +
-           " --cpp_class=" + cpp_class +
-           " --target_triple=" + target_llvm_triple() +
-           " --out_header=$(@D)/" + header_file +
-           " --out_metadata_object=$(@D)/" + metadata_object_file +
-           " --out_function_object=$(@D)/" + function_object_file +
-           " " + flags + " " + profiling_flag),
-      tools=[tfcompile_tool],
-      visibility=visibility,
-      testonly=testonly,
-      # Run tfcompile on the build host since it's typically faster on the local
-      # machine.
-      #
-      # Note that setting the local=1 attribute on a *test target* causes the
-      # test infrastructure to skip that test.  However this is a genrule, not a
-      # test target, and runs with --genrule_strategy=forced_forge, meaning the
-      # local=1 attribute is ignored, and the genrule is still run.
-      #
-      # https://www.bazel.io/versions/master/docs/be/general.html#genrule
-      local=1,
-      tags=tags,
-  )
+    tfcompile_graph = graph
+    if freeze_checkpoint or freeze_saver:
+        if not freeze_checkpoint:
+            fail("freeze_checkpoint must be specified when freeze_saver is " +
+                 "specified")
 
-  # Rule that runs tfcompile to produce the SessionModule proto, useful for
-  # debugging.  TODO(b/64813587): Once the SessionModule proto is
-  # deterministic, move this into the main rule above.
-  session_module_pb = name + "_session_module.pb"
-  native.genrule(
-      name=(name + "_session_module"),
-      srcs=[
-          tfcompile_graph,
-          config,
-      ],
-      outs=[
-          session_module_pb,
-      ],
-      cmd=("$(location " + tfcompile_tool + ")" +
-           " --graph=$(location " + tfcompile_graph + ")" +
-           " --config=$(location " + config + ")" +
-           " --entry_point=" + ep +
-           " --cpp_class=" + cpp_class +
-           " --target_triple=" + target_llvm_triple() +
-           " --out_session_module=$(@D)/" + session_module_pb +
-           " " + flags),
-      tools=[tfcompile_tool],
-      visibility=visibility,
-      testonly=testonly,
-      local=1,
-      tags=tags,
-  )
+        freeze_name = "freeze_" + name
+        freeze_file = freeze_name + ".pb"
 
-  # The cc_library rule packaging up the header and object file, and needed
-  # kernel implementations.
-  need_xla_data_proto = (flags and flags.find("--gen_program_shape") != -1)
-  native.cc_library(
-      name=name,
-      srcs=[function_object_file, metadata_object_file],
-      hdrs=[header_file],
-      visibility=visibility,
-      testonly=testonly,
-      deps = [
-          # These deps are required by all tf_library targets even if
-          # include_standard_runtime_deps is False.  Without them, the
-          # generated code will fail to compile.
-          "//tensorflow/compiler/tf2xla:xla_compiled_cpu_function",
-          "//tensorflow/core:framework_lite",
-      ] + (need_xla_data_proto and [
-          # If we're generating the program shape, we must depend on the proto.
-          "//tensorflow/compiler/xla:xla_data_proto",
-      ] or []) + (enable_xla_hlo_profiling and [
-          "//tensorflow/compiler/xla/service:hlo_profile_printer_data"
-      ] or []) + (include_standard_runtime_deps and [
-          # TODO(cwhipkey): only depend on kernel code that the model actually needed.
-          "//tensorflow/compiler/tf2xla/kernels:index_ops_kernel_argmax_float_1d",
-          "//tensorflow/compiler/tf2xla/kernels:index_ops_kernel_argmax_float_2d",
-          "//tensorflow/compiler/xla/service/cpu:runtime_conv2d",
-          "//tensorflow/compiler/xla/service/cpu:runtime_matmul",
-          "//tensorflow/compiler/xla/service/cpu:runtime_single_threaded_conv2d",
-          "//tensorflow/compiler/xla/service/cpu:runtime_single_threaded_matmul",
-          "//third_party/eigen3",
-      ] or []) + (deps or []),
-      tags=tags,
-  )
+        # First run tfcompile to generate the list of out_nodes.
+        out_nodes_file = "out_nodes_" + freeze_name
+        native.genrule(
+            name = ("gen_" + out_nodes_file),
+            srcs = [config],
+            outs = [out_nodes_file],
+            cmd = ("$(location " + tfcompile_tool + ")" +
+                   " --config=$(location " + config + ")" +
+                   " --dump_fetch_nodes > $@"),
+            tools = [tfcompile_tool],
+            # Run tfcompile on the build host, rather than forge, since it's
+            # typically way faster on the local machine.
+            local = 1,
+            tags = tags,
+        )
 
-  # Variables used for gen_test and gen_benchmark.
-  no_ns_name = ""
-  cpp_class_split = cpp_class.rsplit("::", maxsplit=2)
-  if len(cpp_class_split) == 1:
-    no_ns_name = cpp_class_split[0]
-  else:
-    no_ns_name = cpp_class_split[1]
-  sed_replace = (
-      "-e \"s|{{TFCOMPILE_HEADER}}|$(location " + header_file + ")|g\" " +
-      "-e \"s|{{TFCOMPILE_CPP_CLASS}}|" + cpp_class + "|g\" " +
-      "-e \"s|{{TFCOMPILE_NAME}}|" + no_ns_name + "|g\" ")
+        # Now run freeze_graph to convert variables into constants.
+        freeze_args = (
+            " --input_graph=$(location " + graph + ")" +
+            " --checkpoint_version=1" +
+            " --input_binary=" + str(not graph.endswith(".pbtxt")) +
+            " --input_checkpoint=$(location " + freeze_checkpoint + ")" +
+            " --output_graph=$(location " + freeze_file + ")" +
+            " --output_node_names=$$(<$(location " + out_nodes_file +
+            "))"
+        )
+        freeze_saver_srcs = []
+        if freeze_saver:
+            freeze_args += " --input_saver=$(location " + freeze_saver + ")"
+            freeze_saver_srcs += [freeze_saver]
+        native.genrule(
+            name = freeze_name,
+            srcs = [
+                graph,
+                freeze_checkpoint,
+                out_nodes_file,
+            ] + freeze_saver_srcs,
+            outs = [freeze_file],
+            cmd = ("$(location " +
+                   "//tensorflow/python/tools:freeze_graph)" +
+                   freeze_args),
+            tools = ["//tensorflow/python/tools:freeze_graph"],
+            tags = tags,
+        )
+        tfcompile_graph = freeze_file
 
-  if gen_test:
-    test_name = name + "_test"
-    test_file = test_name + ".cc"
-    # Rule to rewrite test.cc to produce the test_file.
+    # Rule that runs tfcompile to produce the header and object file.
+    header_file = name + ".h"
+    metadata_object_file = name + "_tfcompile_metadata.o"
+    function_object_file = name + "_tfcompile_function.o"
+    ep = ("__" + native.package_name() + "__" + name).replace("/", "_")
+    if type(tfcompile_flags) == type(""):
+        flags = tfcompile_flags
+    else:
+        flags = " ".join([
+            "'" + arg.replace("'", "'\\''") + "'"
+            for arg in (tfcompile_flags or [])
+        ])
+    if enable_xla_hlo_profiling:
+        profiling_flag = "--xla_hlo_profile"
+    else:
+        profiling_flag = ""
     native.genrule(
-        name=("gen_" + test_name),
-        testonly=1,
-        srcs=[
-            "//tensorflow/compiler/aot:test.cc",
+        name = ("gen_" + name),
+        srcs = [
+            tfcompile_graph,
+            config,
+        ],
+        outs = [
             header_file,
+            metadata_object_file,
+            function_object_file,
         ],
-        outs=[test_file],
-        cmd=("sed " + sed_replace +
-             " $(location //tensorflow/compiler/aot:test.cc) " +
-             "> $(OUTS)"),
-        tags=tags,
-    )
-
-    # The cc_test rule for the generated code.  To ensure that this works
-    # reliably across build configurations, we must use tf_cc_test instead of
-    # native.cc_test.  This is related to how we build
-    # //tensorflow/core:lib -- see the note in tensorflow/core/BUILD
-    # for more details.
-    tf_cc_test(
-        name=test_name,
-        srcs=[test_file],
-        deps=[
-            ":" + name,
-            "//tensorflow/compiler/aot:runtime",
-            "//tensorflow/compiler/aot:tf_library_test_main",
-            "//tensorflow/compiler/xla:executable_run_options",
-            "//third_party/eigen3",
-            "//tensorflow/core:lib",
-            "//tensorflow/core:test",
-            ],
-        tags=tags,
+        cmd = ("$(location " + tfcompile_tool + ")" +
+               " --graph=$(location " + tfcompile_graph + ")" +
+               " --config=$(location " + config + ")" +
+               " --entry_point=" + ep +
+               " --cpp_class=" + cpp_class +
+               " --target_triple=" + target_llvm_triple() +
+               " --out_header=$(@D)/" + header_file +
+               " --out_metadata_object=$(@D)/" + metadata_object_file +
+               " --out_function_object=$(@D)/" + function_object_file +
+               " " + flags + " " + profiling_flag),
+        tools = [tfcompile_tool],
+        visibility = visibility,
+        testonly = testonly,
+        # Run tfcompile on the build host since it's typically faster on the
+        # local machine.
+        #
+        # Note that setting the local=1 attribute on a *test target* causes the
+        # test infrastructure to skip that test.  However this is a genrule, not
+        # a test target, and runs with --genrule_strategy=forced_forge, meaning
+        # the local=1 attribute is ignored, and the genrule is still run.
+        #
+        # https://www.bazel.io/versions/master/docs/be/general.html#genrule
+        local = 1,
+        tags = tags,
     )
 
-  if gen_benchmark:
-    benchmark_name = name + "_benchmark"
-    benchmark_file = benchmark_name + ".cc"
-    benchmark_main = ("//tensorflow/compiler/aot:" +
-                      "benchmark_main.template")
-
-    # Rule to rewrite benchmark.cc to produce the benchmark_file.
+    # Rule that runs tfcompile to produce the SessionModule proto, useful for
+    # debugging.  TODO(b/64813587): Once the SessionModule proto is
+    # deterministic, move this into the main rule above.
+    session_module_pb = name + "_session_module.pb"
     native.genrule(
-        name=("gen_" + benchmark_name),
-        srcs=[
-            benchmark_main,
-            header_file,
+        name = (name + "_session_module"),
+        srcs = [
+            tfcompile_graph,
+            config,
         ],
+        outs = [
+            session_module_pb,
+        ],
+        cmd = ("$(location " + tfcompile_tool + ")" +
+               " --graph=$(location " + tfcompile_graph + ")" +
+               " --config=$(location " + config + ")" +
+               " --entry_point=" + ep +
+               " --cpp_class=" + cpp_class +
+               " --target_triple=" + target_llvm_triple() +
+               " --out_session_module=$(@D)/" + session_module_pb +
+               " " + flags),
+        tools = [tfcompile_tool],
+        visibility = visibility,
         testonly = testonly,
-        outs=[benchmark_file],
-        cmd=("sed " + sed_replace +
-             " $(location " + benchmark_main + ") " +
-             "> $(OUTS)"),
-        tags=tags,
+        local = 1,
+        tags = tags,
     )
 
-    # The cc_benchmark rule for the generated code.  This does not need the
-    # tf_cc_binary since we (by deliberate design) do not depend on
-    # //tensorflow/core:lib.
-    #
-    # Note: to get smaller size on android for comparison, compile with:
-    #    --copt=-fvisibility=hidden
-    #    --copt=-D_LIBCPP_TYPE_VIS=_LIBCPP_HIDDEN
-    #    --copt=-D_LIBCPP_EXCEPTION_ABI=_LIBCPP_HIDDEN
-    native.cc_binary(
-        name=benchmark_name,
-        srcs=[benchmark_file],
+    # The cc_library rule packaging up the header and object file, and needed
+    # kernel implementations.
+    need_xla_data_proto = (flags and flags.find("--gen_program_shape") != -1)
+    native.cc_library(
+        name = name,
+        srcs = [function_object_file, metadata_object_file],
+        hdrs = [header_file],
+        visibility = visibility,
         testonly = testonly,
-        copts = tf_copts(),
-        linkopts = if_android(["-pie", "-s"]),
-        deps=[
-            ":" + name,
-            "//tensorflow/compiler/aot:benchmark",
-            "//tensorflow/compiler/aot:runtime",
-            "//tensorflow/compiler/xla:executable_run_options",
+        deps = [
+            # These deps are required by all tf_library targets even if
+            # include_standard_runtime_deps is False.  Without them, the
+            # generated code will fail to compile.
+            "//tensorflow/compiler/tf2xla:xla_compiled_cpu_function",
+            "//tensorflow/core:framework_lite",
+        ] + (need_xla_data_proto and [
+            # If we're generating the program shape, we must depend on the
+            # proto.
+            "//tensorflow/compiler/xla:xla_data_proto",
+        ] or []) + (enable_xla_hlo_profiling and [
+            "//tensorflow/compiler/xla/service:hlo_profile_printer_data",
+        ] or []) + (include_standard_runtime_deps and [
+            # TODO(cwhipkey): only depend on kernel code that the model actually
+            # needed.
+            "//tensorflow/compiler/tf2xla/kernels:index_ops_kernel_argmax_float_1d",
+            "//tensorflow/compiler/tf2xla/kernels:index_ops_kernel_argmax_float_2d",
+            "//tensorflow/compiler/xla/service/cpu:runtime_conv2d",
+            "//tensorflow/compiler/xla/service/cpu:runtime_matmul",
+            "//tensorflow/compiler/xla/service/cpu:runtime_single_threaded_conv2d",
+            "//tensorflow/compiler/xla/service/cpu:runtime_single_threaded_matmul",
             "//third_party/eigen3",
-        ] + if_android([
-            "//tensorflow/compiler/aot:benchmark_extra_android",
-        ]),
-        tags=tags,
+        ] or []) + (deps or []),
+        tags = tags,
+    )
+
+    # Variables used for gen_test and gen_benchmark.
+    cpp_class_split = cpp_class.rsplit("::", maxsplit = 2)
+    if len(cpp_class_split) == 1:
+        no_ns_name = cpp_class_split[0]
+    else:
+        no_ns_name = cpp_class_split[1]
+    sed_replace = (
+        "-e \"s|{{TFCOMPILE_HEADER}}|$(location " + header_file + ")|g\" " +
+        "-e \"s|{{TFCOMPILE_CPP_CLASS}}|" + cpp_class + "|g\" " +
+        "-e \"s|{{TFCOMPILE_NAME}}|" + no_ns_name + "|g\" "
     )
 
+    if gen_test:
+        test_name = name + "_test"
+        test_file = test_name + ".cc"
+
+        # Rule to rewrite test.cc to produce the test_file.
+        native.genrule(
+            name = ("gen_" + test_name),
+            testonly = 1,
+            srcs = [
+                "//tensorflow/compiler/aot:test.cc",
+                header_file,
+            ],
+            outs = [test_file],
+            cmd = (
+                "sed " + sed_replace +
+                " $(location //tensorflow/compiler/aot:test.cc) " +
+                "> $(OUTS)"
+            ),
+            tags = tags,
+        )
+
+        # The cc_test rule for the generated code.  To ensure that this works
+        # reliably across build configurations, we must use tf_cc_test instead
+        # of native.cc_test.  This is related to how we build
+        # //tensorflow/core:lib -- see the note in
+        # tensorflow/core/BUILD for more details.
+        tf_cc_test(
+            name = test_name,
+            srcs = [test_file],
+            deps = [
+                ":" + name,
+                "//tensorflow/compiler/aot:runtime",
+                "//tensorflow/compiler/aot:tf_library_test_main",
+                "//tensorflow/compiler/xla:executable_run_options",
+                "//third_party/eigen3",
+                "//tensorflow/core:lib",
+                "//tensorflow/core:test",
+            ],
+            tags = tags,
+        )
+
+    if gen_benchmark:
+        benchmark_name = name + "_benchmark"
+        benchmark_file = benchmark_name + ".cc"
+        benchmark_main = ("//tensorflow/compiler/aot:" +
+                          "benchmark_main.template")
+
+        # Rule to rewrite benchmark.cc to produce the benchmark_file.
+        native.genrule(
+            name = ("gen_" + benchmark_name),
+            srcs = [
+                benchmark_main,
+                header_file,
+            ],
+            testonly = testonly,
+            outs = [benchmark_file],
+            cmd = ("sed " + sed_replace +
+                   " $(location " + benchmark_main + ") " +
+                   "> $(OUTS)"),
+            tags = tags,
+        )
+
+        # The cc_benchmark rule for the generated code.  This does not need the
+        # tf_cc_binary since we (by deliberate design) do not depend on
+        # //tensorflow/core:lib.
+        #
+        # Note: to get smaller size on android for comparison, compile with:
+        #    --copt=-fvisibility=hidden
+        #    --copt=-D_LIBCPP_TYPE_VIS=_LIBCPP_HIDDEN
+        #    --copt=-D_LIBCPP_EXCEPTION_ABI=_LIBCPP_HIDDEN
+        native.cc_binary(
+            name = benchmark_name,
+            srcs = [benchmark_file],
+            testonly = testonly,
+            copts = tf_copts(),
+            linkopts = if_android(["-pie", "-s"]),
+            deps = [
+                ":" + name,
+                "//tensorflow/compiler/aot:benchmark",
+                "//tensorflow/compiler/aot:runtime",
+                "//tensorflow/compiler/xla:executable_run_options",
+                "//third_party/eigen3",
+            ] + if_android([
+                "//tensorflow/compiler/aot:benchmark_extra_android",
+            ]),
+            tags = tags,
+        )
+
 def target_llvm_triple():
-  """Returns the target LLVM triple to be used for compiling the target."""
-  # TODO(toddw): Add target_triple for other targets.  For details see:
-  # http://llvm.org/docs/doxygen/html/Triple_8h_source.html
-  return select({
-      "//tensorflow:android_armeabi": "armv5-none-android",
-      "//tensorflow:android_arm": "armv7-none-android",
-      "//tensorflow:android_arm64": "aarch64-none-android",
-      "//tensorflow:android_x86": "i686-none-android",
-      "//tensorflow:linux_ppc64le": "ppc64le-ibm-linux-gnu",
-      "//tensorflow:darwin": "x86_64-none-darwin",
-      "//conditions:default": "x86_64-pc-linux",
-  })
+    """Returns the target LLVM triple to be used for compiling the target."""
+
+    # TODO(toddw): Add target_triple for other targets.  For details see:
+    # http://llvm.org/docs/doxygen/html/Triple_8h_source.html
+    return select({
+        "//tensorflow:android_armeabi": "armv5-none-android",
+        "//tensorflow:android_arm": "armv7-none-android",
+        "//tensorflow:android_arm64": "aarch64-none-android",
+        "//tensorflow:android_x86": "i686-none-android",
+        "//tensorflow:linux_ppc64le": "ppc64le-ibm-linux-gnu",
+        "//tensorflow:darwin": "x86_64-none-darwin",
+        "//conditions:default": "x86_64-pc-linux",
+    })
diff --git a/tensorflow/compiler/jit/BUILD b/tensorflow/compiler/jit/BUILD
index c2245b8eae..377cd11f44 100644
--- a/tensorflow/compiler/jit/BUILD
+++ b/tensorflow/compiler/jit/BUILD
@@ -166,6 +166,7 @@ cc_library(
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/service:stream_pool",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework",
@@ -301,14 +302,33 @@ tf_cc_test(
 )
 
 cc_library(
+    name = "deadness_analysis_internal",
+    testonly = True,
+    srcs = [
+        "deadness_analysis.cc",
+    ],
+    hdrs = [
+        "deadness_analysis.h",
+        "deadness_analysis_internal.h",
+    ],
+    deps = [
+        "//tensorflow/core:graph",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
     name = "compilation_passes",
     srcs = [
         "build_xla_launch_ops_pass.cc",
+        "deadness_analysis.cc",
+        "deadness_analysis_internal.h",
         "encapsulate_subgraphs_pass.cc",
         "mark_for_compilation_pass.cc",
     ],
     hdrs = [
         "build_xla_launch_ops_pass.h",
+        "deadness_analysis.h",
         "encapsulate_subgraphs_pass.h",
         "mark_for_compilation_pass.h",
     ],
@@ -325,6 +345,7 @@ cc_library(
         "//tensorflow/compiler/tf2xla:dump_graph",
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:util",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework",
@@ -374,6 +395,33 @@ tf_cc_test(
 )
 
 tf_cc_test(
+    name = "deadness_analysis_test",
+    size = "small",
+    srcs = [
+        "deadness_analysis_test.cc",
+    ],
+    deps = [
+        ":common",
+        ":deadness_analysis_internal",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:cc_ops_internal",
+        "//tensorflow/cc:function_ops",
+        "//tensorflow/cc:ops",
+        "//tensorflow/cc:sendrecv_ops",
+        "//tensorflow/compiler/jit/kernels:xla_launch_op",
+        "//tensorflow/compiler/tf2xla:xla_compiler",
+        "//tensorflow/compiler/tf2xla/kernels:xla_ops",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
+tf_cc_test(
     name = "compilation_passes_test",
     size = "small",
     srcs = [
@@ -387,6 +435,7 @@ tf_cc_test(
         "//tensorflow/cc:cc_ops_internal",
         "//tensorflow/cc:function_ops",
         "//tensorflow/cc:ops",
+        "//tensorflow/cc:sendrecv_ops",
         "//tensorflow/compiler/jit/kernels:xla_launch_op",
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/tf2xla/kernels:xla_ops",
@@ -458,6 +507,7 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = [
         ":common",
+        ":compilation_passes",
         ":union_find",
         ":xla_cluster_util",
         "//tensorflow/compiler/jit/graphcycles",
diff --git a/tensorflow/compiler/jit/deadness_analysis.cc b/tensorflow/compiler/jit/deadness_analysis.cc
new file mode 100644
index 0000000000..8aff87e5e6
--- /dev/null
+++ b/tensorflow/compiler/jit/deadness_analysis.cc
@@ -0,0 +1,592 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/jit/deadness_analysis.h"
+#include "tensorflow/compiler/jit/deadness_analysis_internal.h"
+#include "tensorflow/core/graph/algorithm.h"
+#include "tensorflow/core/graph/tensor_id.h"
+#include "tensorflow/core/lib/gtl/flatset.h"
+#include "tensorflow/core/lib/hash/hash.h"
+
+// ALGORITHM OVERVIEW
+//
+// We map every output produced by each node in the TensorFlow graph (including
+// control dependence) into an instance of the Predicate class.  Instances of
+// Predicate denote logical formulas and mapping a node `n` to a predicate
+// `pred` implies that `n` is executed whenver `pred` is true.  Then we can
+// deduce mismatching liveness in the inputs to node by comparing the predicate
+// those inputs are mapped to.
+//
+// Loops are handled pessimistically -- we map Merge nodes with backedges to
+// uninterpreted symbols (the same kind we use to represent Switch and _Recv).
+// Predicate equality has to hold over all possible assignments to these
+// uninterpreted symbols.
+
+namespace tensorflow {
+
+namespace {
+
+// Represents a logical predicate, used as described in the algorithm overview
+// above.
+class Predicate {
+ public:
+  enum class Kind { kAnd, kOr, kNot, kSymbol };
+
+  virtual string ToString() const = 0;
+  int64 hash() const { return hash_; }
+
+  virtual Kind kind() const = 0;
+  virtual ~Predicate() {}
+
+ protected:
+  explicit Predicate(int64 hash) : hash_(hash) {}
+
+ private:
+  const int64 hash_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(Predicate);
+};
+
+int64 HashPredicateSequence(Predicate::Kind kind,
+                            gtl::ArraySlice<Predicate*> preds) {
+  int64 hash = ::tensorflow::hash<Predicate::Kind>()(kind);
+  for (Predicate* pred : preds) {
+    hash = Hash64Combine(hash, pred->hash());
+  }
+  return hash;
+}
+
+// Represents a logical conjunction of a set of predicates.
+class AndPredicate : public Predicate {
+ public:
+  explicit AndPredicate(std::vector<Predicate*> operands)
+      : Predicate(HashPredicateSequence(Kind::kAnd, operands)),
+        operands_(std::move(operands)) {}
+
+  string ToString() const override {
+    if (operands().empty()) {
+      return "#true";
+    }
+
+    std::vector<string> operands_str;
+    std::transform(operands().begin(), operands().end(),
+                   std::back_inserter(operands_str),
+                   [](Predicate* pred) { return pred->ToString(); });
+
+    return strings::StrCat("(", str_util::Join(operands_str, " & "), ")");
+  }
+
+  Kind kind() const override { return Kind::kAnd; }
+
+  const gtl::ArraySlice<Predicate*> operands() const { return operands_; }
+
+ private:
+  std::vector<Predicate*> operands_;
+};
+
+// Represents a logical disjunction of a set of predicates.
+class OrPredicate : public Predicate {
+ public:
+  explicit OrPredicate(std::vector<Predicate*> operands)
+      : Predicate(HashPredicateSequence(Kind::kOr, operands)),
+        operands_(std::move(operands)) {}
+
+  string ToString() const override {
+    if (operands().empty()) {
+      return "#false";
+    }
+
+    std::vector<string> operands_str;
+    std::transform(operands().begin(), operands().end(),
+                   std::back_inserter(operands_str),
+                   [](Predicate* pred) { return pred->ToString(); });
+
+    return strings::StrCat("(", str_util::Join(operands_str, " | "), ")");
+  }
+
+  Kind kind() const override { return Kind::kOr; }
+  const gtl::ArraySlice<Predicate*> operands() const { return operands_; }
+
+ private:
+  std::vector<Predicate*> operands_;
+};
+
+// Represents a logical negation of a set of predicates.
+class NotPredicate : public Predicate {
+ public:
+  explicit NotPredicate(Predicate* operand)
+      : Predicate(HashPredicateSequence(Kind::kNot, {operand})),
+        operand_(operand) {}
+
+  string ToString() const override {
+    return strings::StrCat("~", operand()->ToString());
+  }
+
+  Kind kind() const override { return Kind::kNot; }
+  Predicate* operand() const { return operand_; }
+
+ private:
+  Predicate* operand_;
+};
+
+// Represents an uninterpreted symbol in a logical predicate.
+//
+// Two predicates are equivalent iff they are equivalent for all assignments to
+// the symbols contained in them.
+class SymbolPredicate : public Predicate {
+ public:
+  explicit SymbolPredicate(TensorId tensor_id, bool must_be_true)
+      : Predicate(Hash(tensor_id, must_be_true)),
+        tensor_id_(std::move(tensor_id)),
+        must_be_true_(must_be_true) {}
+
+  string ToString() const override {
+    return must_be_true() ? strings::StrCat("*", tensor_id_.ToString())
+                          : tensor_id_.ToString();
+  }
+
+  Kind kind() const override { return Kind::kSymbol; }
+
+  // If `must_be_true()` is true this SymbolPredicate represents the proposition
+  // "tensor_id() is live and evaluates to true".
+  //
+  // If `must_be_true()` is false then this SymbolPredicate represents the
+  // proposition "tensor_id() is live (and may evalutate to any value)"
+  TensorId tensor_id() const { return tensor_id_; }
+  bool must_be_true() const { return must_be_true_; }
+
+ private:
+  TensorId tensor_id_;
+  bool must_be_true_;
+
+  static int64 Hash(const TensorId tensor_id, bool must_be_true) {
+    return Hash64Combine(
+        ::tensorflow::hash<bool>()(must_be_true),
+        Hash64Combine(::tensorflow::hash<Predicate::Kind>()(Kind::kSymbol),
+                      TensorId::Hasher{}(tensor_id)));
+  }
+};
+
+// Creates and owns Predicate instances.  Simplifies predicates as it creates
+// them.
+class PredicateFactory {
+ public:
+  Predicate* MakeAndPredicate(gtl::ArraySlice<Predicate*> operands) {
+    return MakeAndOrImpl(operands, /*is_and=*/true);
+  }
+
+  Predicate* MakeOrPredicate(gtl::ArraySlice<Predicate*> operands) {
+    return MakeAndOrImpl(operands, /*is_and=*/false);
+  }
+
+  Predicate* MakeNotPredicate(Predicate* pred) {
+    SignatureForNot signature = pred;
+    auto it = interned_not_instances_.find(signature);
+    if (it == interned_not_instances_.end()) {
+      std::unique_ptr<Predicate> new_pred = Make<NotPredicate>(pred);
+      Predicate* new_pred_ptr = new_pred.get();
+      interned_not_instances_.emplace(signature, std::move(new_pred));
+      return new_pred_ptr;
+    } else {
+      return it->second.get();
+    }
+  }
+
+  Predicate* MakeSymbolPredicate(TensorId tensor_id, bool must_be_true) {
+    SignatureForSymbol signature = {tensor_id, must_be_true};
+    auto it = interned_symbol_instances_.find(signature);
+    if (it == interned_symbol_instances_.end()) {
+      std::unique_ptr<Predicate> new_pred =
+          Make<SymbolPredicate>(tensor_id, must_be_true);
+      Predicate* new_pred_ptr = new_pred.get();
+      interned_symbol_instances_.emplace(std::move(signature),
+                                         std::move(new_pred));
+      return new_pred_ptr;
+    } else {
+      return it->second.get();
+    }
+  }
+
+  Predicate* MakeTrue() { return MakeAndPredicate({}); }
+  Predicate* MakeFalse() { return MakeOrPredicate({}); }
+
+ private:
+  template <typename PredicateT, typename... Args>
+  std::unique_ptr<Predicate> Make(Args&&... args) {
+    return std::unique_ptr<PredicateT>(
+        new PredicateT(std::forward<Args>(args)...));
+  }
+
+  Predicate* MakeAndOrImpl(gtl::ArraySlice<Predicate*> operands, bool is_and);
+
+  // Predicate instances are interned, meaning that there is only a single
+  // instance of a Predicate object with a given content.  This makes checking
+  // for structural equality super-cheap -- we can just compare pointers.
+  //
+  // We intern predicates by maintaining a map from the content of a Predicate
+  // to the only instance of said predicate we allow to exist in the
+  // interned_and_or_instances_, interned_not_instances_ and
+  // interned_symbol_instances_ fields.  These maps also double up as storage
+  // for the owning pointers to predicate instances.
+
+  using SignatureForAndOr =
+      std::pair<Predicate::Kind, gtl::ArraySlice<Predicate*>>;
+  using SignatureForNot = Predicate*;
+  using SignatureForSymbol = std::pair<SafeTensorId, bool>;
+
+  struct HashSignatureForAndOr {
+    size_t operator()(const SignatureForAndOr& signature) const {
+      size_t hash = ::tensorflow::hash<Predicate::Kind>()(signature.first);
+      for (Predicate* p : signature.second) {
+        hash = Hash64Combine(hash, ::tensorflow::hash<Predicate*>()(p));
+      }
+      return hash;
+    }
+  };
+
+  struct HashSignatureForSymbol {
+    size_t operator()(const SignatureForSymbol& signature) const {
+      return Hash64Combine(SafeTensorId::Hasher()(signature.first),
+                           ::tensorflow::hash<bool>()(signature.second));
+    }
+  };
+
+  gtl::FlatMap<SignatureForAndOr, std::unique_ptr<Predicate>,
+               HashSignatureForAndOr>
+      interned_and_or_instances_;
+  gtl::FlatMap<SignatureForNot, std::unique_ptr<Predicate>>
+      interned_not_instances_;
+  gtl::FlatMap<SignatureForSymbol, std::unique_ptr<Predicate>,
+               HashSignatureForSymbol>
+      interned_symbol_instances_;
+};
+
+// Common code to create AndPredicate or OrPredicate instances.
+Predicate* PredicateFactory::MakeAndOrImpl(gtl::ArraySlice<Predicate*> operands,
+                                           bool is_and) {
+  Predicate::Kind pred_kind =
+      is_and ? Predicate::Kind::kAnd : Predicate::Kind::kOr;
+  gtl::FlatSet<Predicate*> simplified_ops_set;
+  std::vector<Predicate*> simplified_ops;
+  for (Predicate* op : operands) {
+    // Simplify A&A => A and  A|A => A.
+    if (!simplified_ops_set.insert(op).second) {
+      continue;
+    }
+
+    if (op->kind() == pred_kind) {
+      // "Inline" the operands of an inner And/Or into the parent And/Or.
+      gtl::ArraySlice<Predicate*> operands =
+          is_and ? dynamic_cast<AndPredicate*>(op)->operands()
+                 : dynamic_cast<OrPredicate*>(op)->operands();
+      for (Predicate* subop : operands) {
+        if (simplified_ops_set.insert(subop).second) {
+          simplified_ops.push_back(subop);
+        }
+      }
+    } else {
+      simplified_ops.push_back(op);
+    }
+  }
+
+  if (simplified_ops.size() == 1) {
+    return simplified_ops[0];
+  }
+
+  // Simplify "A&~A=>False" and "A|~A=>True".
+  gtl::FlatSet<Predicate*> negated_ops;
+  for (Predicate* op : simplified_ops) {
+    if (op->kind() == Predicate::Kind::kNot) {
+      negated_ops.insert(dynamic_cast<NotPredicate&>(*op).operand());
+    }
+  }
+
+  for (Predicate* op : simplified_ops) {
+    if (negated_ops.count(op)) {
+      return is_and ? MakeFalse() : MakeTrue();
+    }
+  }
+
+  std::stable_sort(
+      simplified_ops.begin(), simplified_ops.end(),
+      [](Predicate* a, Predicate* b) { return a->hash() < b->hash(); });
+
+  auto it = interned_and_or_instances_.find({pred_kind, simplified_ops});
+  if (it == interned_and_or_instances_.end()) {
+    simplified_ops.shrink_to_fit();
+    // NB!  Because we'll use a non-owning reference to simplified_ops in the
+    // key for interned_and_or_instances_ we need to be careful to std::move()
+    // it all the way through.
+    gtl::ArraySlice<Predicate*> operands_slice = simplified_ops;
+    std::unique_ptr<Predicate> new_pred =
+        is_and ? Make<AndPredicate>(std::move(simplified_ops))
+               : Make<OrPredicate>(std::move(simplified_ops));
+
+    Predicate* new_pred_ptr = new_pred.get();
+    CHECK(interned_and_or_instances_
+              .emplace(SignatureForAndOr(pred_kind, operands_slice),
+                       std::move(new_pred))
+              .second);
+    return new_pred_ptr;
+  } else {
+    return it->second.get();
+  }
+}
+
+class DeadnessAnalysisImpl : public DeadnessAnalysis {
+ public:
+  explicit DeadnessAnalysisImpl(const Graph* graph)
+      : graph_(*graph), vlog_(VLOG_IS_ON(2)) {}
+
+  Status Populate();
+  bool HasInputsWithMismatchingDeadness(const Node& node) override;
+  void Print() const override;
+  gtl::FlatMap<TensorId, string, TensorId::Hasher> PredicateMapAsString() const;
+
+ private:
+  enum class EdgeKind { kDataAndControl, kDataOnly, kControlOnly };
+
+  std::vector<Predicate*> GetIncomingPreds(Node* n, EdgeKind edge_kind);
+  void SetPred(Node* n, int output_idx, Predicate* pred) {
+    CHECK(
+        predicate_map_.insert({TensorId(n->name(), output_idx), pred}).second);
+  }
+  void SetPred(Node* n, gtl::ArraySlice<int> output_idxs, Predicate* pred) {
+    for (int output_idx : output_idxs) {
+      SetPred(n, output_idx, pred);
+    }
+  }
+
+  Status HandleSwitch(Node* n);
+  Status HandleMerge(Node* n);
+  Status HandleRecv(Node* n);
+  Status HandleGeneric(Node* n);
+
+  const Graph& graph_;
+  gtl::FlatMap<TensorId, Predicate*, TensorId::Hasher> predicate_map_;
+  PredicateFactory predicate_factory_;
+  bool vlog_;
+};
+
+TensorId InputEdgeToTensorId(const Edge* e) {
+  return TensorId(e->src()->name(), e->src_output());
+}
+
+std::vector<Predicate*> DeadnessAnalysisImpl::GetIncomingPreds(
+    Node* n, DeadnessAnalysisImpl::EdgeKind edge_kind) {
+  std::vector<Predicate*> incoming_preds;
+  for (const Edge* in_edge : n->in_edges()) {
+    bool should_process =
+        edge_kind == EdgeKind::kDataAndControl ||
+        (in_edge->IsControlEdge() && edge_kind == EdgeKind::kControlOnly) ||
+        (!in_edge->IsControlEdge() && edge_kind == EdgeKind::kDataOnly);
+
+    if (should_process) {
+      auto it = predicate_map_.find(InputEdgeToTensorId(in_edge));
+      CHECK(it != predicate_map_.end());
+      incoming_preds.push_back(it->second);
+    }
+  }
+  return incoming_preds;
+}
+
+Status DeadnessAnalysisImpl::HandleSwitch(Node* n) {
+  std::vector<Predicate*> input_preds =
+      GetIncomingPreds(n, EdgeKind::kDataAndControl);
+  const Edge* pred_edge;
+  TF_RETURN_IF_ERROR(n->input_edge(1, &pred_edge));
+  Predicate* true_switch = predicate_factory_.MakeSymbolPredicate(
+      TensorId(pred_edge->src()->name(), pred_edge->src_output()),
+      /*must_be_true=*/true);
+  Predicate* false_switch = predicate_factory_.MakeNotPredicate(true_switch);
+
+  // Output 0 is alive iff all inputs are alive and the condition is false.
+  input_preds.push_back(false_switch);
+  SetPred(n, 0, predicate_factory_.MakeAndPredicate(input_preds));
+  input_preds.pop_back();
+
+  // Output 1 is alive iff all inputs are alive and the condition is true.
+  input_preds.push_back(true_switch);
+  SetPred(n, 1, predicate_factory_.MakeAndPredicate(input_preds));
+  input_preds.pop_back();
+
+  // Control is alive iff any inputs are alive.
+  SetPred(n, Graph::kControlSlot,
+          predicate_factory_.MakeAndPredicate(input_preds));
+
+  return Status::OK();
+}
+
+Status DeadnessAnalysisImpl::HandleMerge(Node* n) {
+  // Merge ignores deadness of its control inputs.  A merge that isn't the
+  // target of a backedge has is alive iff any of its data inputs are.  We treat
+  // the liveness of a merge that is the target of a backedge symbolically.
+
+  bool has_backedge = std::any_of(
+      n->in_edges().begin(), n->in_edges().end(), [](const Edge* e) {
+        return !e->IsControlEdge() && e->src()->IsNextIteration();
+      });
+
+  Predicate* input_data_pred =
+      has_backedge ? predicate_factory_.MakeSymbolPredicate(
+                         TensorId(n->name(), 0), /*must_be_true=*/false)
+                   : predicate_factory_.MakeOrPredicate(
+                         GetIncomingPreds(n, EdgeKind::kDataOnly));
+
+  SetPred(n, {0, 1, Graph::kControlSlot}, input_data_pred);
+  return Status::OK();
+}
+
+Status DeadnessAnalysisImpl::HandleRecv(Node* n) {
+  // In addition to being alive or dead based on the inputs, a _Recv can also
+  // acquire a dead signal from a _Send.
+  std::vector<Predicate*> input_preds =
+      GetIncomingPreds(n, EdgeKind::kDataAndControl);
+  input_preds.push_back(predicate_factory_.MakeSymbolPredicate(
+      TensorId(n->name(), 0), /*must_be_true=*/false));
+  SetPred(n, {0, Graph::kControlSlot},
+          predicate_factory_.MakeAndPredicate(input_preds));
+  return Status::OK();
+}
+
+Status DeadnessAnalysisImpl::HandleGeneric(Node* n) {
+  // Generally nodes are alive iff all their inputs are alive.
+  Predicate* pred = predicate_factory_.MakeAndPredicate(
+      GetIncomingPreds(n, EdgeKind::kDataAndControl));
+  for (int output_idx = 0; output_idx < n->num_outputs(); output_idx++) {
+    SetPred(n, output_idx, pred);
+  }
+  SetPred(n, Graph::kControlSlot, pred);
+  return Status::OK();
+}
+
+Status DeadnessAnalysisImpl::Populate() {
+  std::vector<Node*> rpo;
+  GetReversePostOrder(graph_, &rpo, /*stable_comparator=*/{},
+                      /*edge_filter=*/[](const Edge& edge) {
+                        return !edge.src()->IsNextIteration();
+                      });
+
+  // This an abstract interpretation over the deadness propagation semantics of
+  // the graph executor.
+  for (Node* n : rpo) {
+    if (n->IsSwitch()) {
+      TF_RETURN_IF_ERROR(HandleSwitch(n));
+    } else if (n->IsMerge()) {
+      TF_RETURN_IF_ERROR(HandleMerge(n));
+    } else if (n->IsControlTrigger()) {
+      SetPred(n, Graph::kControlSlot, predicate_factory_.MakeTrue());
+    } else if (n->IsRecv() || n->IsHostRecv()) {
+      TF_RETURN_IF_ERROR(HandleRecv(n));
+    } else {
+      TF_RETURN_IF_ERROR(HandleGeneric(n));
+    }
+  }
+
+  return Status::OK();
+}
+
+bool DeadnessAnalysisImpl::HasInputsWithMismatchingDeadness(const Node& node) {
+  CHECK(!node.IsMerge());
+
+  if (vlog_) {
+    VLOG(2) << "HasInputsWithMismatchingDeadness(" << node.name() << ")";
+  }
+
+  Predicate* pred = nullptr;
+  for (const Edge* edge : node.in_edges()) {
+    auto it = predicate_map_.find(InputEdgeToTensorId(edge));
+    CHECK(it != predicate_map_.end());
+    if (vlog_) {
+      VLOG(2) << "  " << InputEdgeToTensorId(edge).ToString() << ": "
+              << it->second->ToString();
+    }
+
+    // Today we just compare the predicates for equality (with some
+    // canonicalization/simplification happening before) but we could be more
+    // sophisticated here if need be.  Comparing pointers is sufficient because
+    // we intern Predicate instances by their content.
+    if (pred != nullptr && pred != it->second) {
+      if (vlog_) {
+        VLOG(2) << "HasInputsWithMismatchingDeadness(" << node.name()
+                << ") -> true";
+      }
+      return true;
+    }
+    pred = it->second;
+  }
+
+  if (vlog_) {
+    VLOG(2) << "HasInputsWithMismatchingDeadness(" << node.name()
+            << ") -> false";
+  }
+
+  return false;
+}
+
+void DeadnessAnalysisImpl::Print() const {
+  std::vector<TensorId> tensor_ids;
+  for (const auto& kv_pair : predicate_map_) {
+    tensor_ids.push_back(kv_pair.first);
+  }
+
+  std::sort(tensor_ids.begin(), tensor_ids.end());
+
+  for (TensorId tensor_id : tensor_ids) {
+    auto it = predicate_map_.find(tensor_id);
+    CHECK(it != predicate_map_.end()) << tensor_id.ToString();
+    VLOG(2) << tensor_id.ToString() << " -> " << it->second->ToString();
+  }
+}
+
+}  // namespace
+
+DeadnessAnalysis::~DeadnessAnalysis() {}
+
+/*static*/ Status DeadnessAnalysis::Run(
+    const Graph& graph, std::unique_ptr<DeadnessAnalysis>* result) {
+  std::unique_ptr<DeadnessAnalysisImpl> analysis(
+      new DeadnessAnalysisImpl(&graph));
+  TF_RETURN_IF_ERROR(analysis->Populate());
+
+  if (VLOG_IS_ON(2)) {
+    analysis->Print();
+  }
+
+  *result = std::move(analysis);
+  return Status::OK();
+}
+
+gtl::FlatMap<TensorId, string, TensorId::Hasher>
+DeadnessAnalysisImpl::PredicateMapAsString() const {
+  gtl::FlatMap<TensorId, string, TensorId::Hasher> result;
+  std::vector<TensorId> tensor_ids;
+  for (const auto& kv_pair : predicate_map_) {
+    CHECK(result.insert({kv_pair.first, kv_pair.second->ToString()}).second);
+  }
+  return result;
+}
+
+namespace deadness_analysis_internal {
+Status ComputePredicates(const Graph& graph,
+                         PredicateMapTy* out_predicate_map) {
+  DeadnessAnalysisImpl impl(&graph);
+  TF_RETURN_IF_ERROR(impl.Populate());
+  *out_predicate_map = impl.PredicateMapAsString();
+  return Status::OK();
+}
+}  // namespace deadness_analysis_internal
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/deadness_analysis.h b/tensorflow/compiler/jit/deadness_analysis.h
new file mode 100644
index 0000000000..6e7ab41161
--- /dev/null
+++ b/tensorflow/compiler/jit/deadness_analysis.h
@@ -0,0 +1,68 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_JIT_DEADNESS_ANALYSIS_H_
+#define TENSORFLOW_COMPILER_JIT_DEADNESS_ANALYSIS_H_
+
+#include "tensorflow/core/graph/graph.h"
+
+namespace tensorflow {
+
+// This analyzes a TensorFlow graph to identify nodes which may have partially
+// dead inputs (i.e. these nodes may have some dead inputs and some alive
+// inputs).
+//
+// For example, the ADD node in the following graph
+//
+//      V0  PRED0    V1  PRED1
+//       |    |       |    |
+//       v    v       v    v
+//       SWITCH       SWITCH
+//          |            |
+//          +---+   + ---+
+//              |   |
+//              v   v
+//               ADD
+//
+// can have its inputs independently dead or alive based on the runtime values
+// of PRED0 and PRED1.
+//
+// It is tempting to call this a liveness analysis but I avoided that because
+// "liveness" already has other connotations.
+class DeadnessAnalysis {
+ public:
+  // Returns true if `node` may have some live inputs and some dead inputs.
+  //
+  // This is a conservatively correct routine -- if it returns false then `node`
+  // is guaranteed to not have inputs with mismatching liveness, but not the
+  // converse.
+  //
+  // REQUIRES: node is not a Merge operation.
+  virtual bool HasInputsWithMismatchingDeadness(const Node& node) = 0;
+
+  // Prints out the internal state of this instance.  For debugging purposes
+  // only.
+  virtual void Print() const = 0;
+  virtual ~DeadnessAnalysis();
+
+  // Run the deadness analysis over `graph` and returns an error or a populated
+  // instance of DeadnessAnalysis in `result`.
+  static Status Run(const Graph& graph,
+                    std::unique_ptr<DeadnessAnalysis>* result);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_JIT_DEADNESS_ANALYSIS_H_
diff --git a/tensorflow/compiler/jit/deadness_analysis_internal.h b/tensorflow/compiler/jit/deadness_analysis_internal.h
new file mode 100644
index 0000000000..cdef405110
--- /dev/null
+++ b/tensorflow/compiler/jit/deadness_analysis_internal.h
@@ -0,0 +1,32 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_JIT_DEADNESS_ANALYSIS_INTERNAL_H_
+#define TENSORFLOW_COMPILER_JIT_DEADNESS_ANALYSIS_INTERNAL_H_
+
+#include "tensorflow/core/graph/tensor_id.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
+
+namespace tensorflow {
+namespace deadness_analysis_internal {
+
+// Returns a map describing the predicate each Tensor was mapped to.  For
+// testing purposes only.
+using PredicateMapTy = gtl::FlatMap<TensorId, string, TensorId::Hasher>;
+Status ComputePredicates(const Graph& graph, PredicateMapTy* out_predicate_map);
+}  // namespace deadness_analysis_internal
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_JIT_DEADNESS_ANALYSIS_INTERNAL_H_
diff --git a/tensorflow/compiler/jit/deadness_analysis_test.cc b/tensorflow/compiler/jit/deadness_analysis_test.cc
new file mode 100644
index 0000000000..6881095b51
--- /dev/null
+++ b/tensorflow/compiler/jit/deadness_analysis_test.cc
@@ -0,0 +1,467 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/jit/deadness_analysis.h"
+
+#include "tensorflow/cc/framework/ops.h"
+#include "tensorflow/cc/ops/array_ops.h"
+#include "tensorflow/cc/ops/control_flow_ops_internal.h"
+#include "tensorflow/cc/ops/function_ops.h"
+#include "tensorflow/cc/ops/sendrecv_ops.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/compiler/jit/deadness_analysis_internal.h"
+#include "tensorflow/compiler/jit/defs.h"
+#include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
+#include "tensorflow/compiler/tf2xla/xla_op_registry.h"
+#include "tensorflow/core/framework/node_def_util.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/graph/algorithm.h"
+#include "tensorflow/core/graph/graph_constructor.h"
+#include "tensorflow/core/graph/graph_def_builder.h"
+#include "tensorflow/core/graph/graph_def_builder_util.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace {
+
+Status AnalyzeDeadness(Graph* graph,
+                       std::unique_ptr<DeadnessAnalysis>* result) {
+  FixupSourceAndSinkEdges(graph);
+  return DeadnessAnalysis::Run(*graph, result);
+}
+
+ops::Switch CreateSwitch(const Scope& root, const string& prefix) {
+  Output value = ops::Placeholder(root.WithOpName(prefix + "/value"), DT_FLOAT);
+  Output predicate =
+      ops::Placeholder(root.WithOpName(prefix + "/pred"), DT_BOOL);
+  return ops::Switch(root.WithOpName(prefix + "/switch"), value, predicate);
+}
+
+Output CreateInductionVariable(const Scope& root, const string& prefix,
+                               const string& frame_name, int32 init) {
+  Output initial_value = ops::Const(root.WithOpName(prefix + "/init"), init);
+  Output enter_initial_value = ops::internal::Enter(
+      root.WithOpName(prefix + "/enter"), initial_value, frame_name);
+
+  ops::Merge iv(root.WithOpName(prefix + "/iv"), {enter_initial_value});
+  Output increment_by = ops::Const(root.WithOpName(prefix + "/incr"), 1);
+  Output final_value = ops::Const(root.WithOpName(prefix + "/final"), 10);
+  Output loop_cond_expr =
+      ops::Less(root.WithOpName(prefix + "/less"), iv.output, final_value);
+  Output loop_cond =
+      ops::LoopCond(root.WithOpName(prefix + "/cond"), loop_cond_expr);
+  ops::Switch latch(root.WithOpName(prefix + "/latch"), iv.output, loop_cond);
+  ops::internal::Exit exit(root.WithOpName(prefix + "/exit"), iv.output);
+  Output iv_next =
+      ops::Add(root.WithOpName(prefix + "/ivnext"), iv.output, increment_by);
+  Output next_iteration =
+      ops::NextIteration(root.WithOpName(prefix + "next_iteration"), iv_next);
+
+  root.graph()->AddEdge(next_iteration.node(), 0, iv.output.node(), 1);
+  root.graph()->AddControlEdge(iv.output.node(), increment_by.node());
+  root.graph()->AddControlEdge(iv.output.node(), final_value.node());
+
+  return iv.output;
+}
+
+TEST(DeadnessAnalysisTest, BasicPositive) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw = CreateSwitch(root, "0");
+  Output add =
+      ops::Add(root.WithOpName("add"), sw.output_true, sw.output_false);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, BasicNegative) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  Output a = ops::Placeholder(root.WithOpName("a"), DT_FLOAT);
+  Output b = ops::Placeholder(root.WithOpName("b"), DT_FLOAT);
+  Output add = ops::Add(root.WithOpName("add"), a, b);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, AndIsCommutative) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw_0 = CreateSwitch(root, "0");
+  ops::Switch sw_1 = CreateSwitch(root, "1");
+
+  Output a0 =
+      ops::Add(root.WithOpName("a0"), sw_0.output_false, sw_1.output_false);
+  Output a1 =
+      ops::Add(root.WithOpName("a1"), sw_1.output_false, sw_0.output_false);
+
+  Output b0 =
+      ops::Add(root.WithOpName("b0"), sw_0.output_false, sw_1.output_true);
+  Output b1 =
+      ops::Add(root.WithOpName("b1"), sw_1.output_true, sw_0.output_false);
+
+  Output live0 = ops::Add(root.WithOpName("live0"), a0, a1);
+  Output live1 = ops::Add(root.WithOpName("live1"), b0, b1);
+
+  Output halfdead0 = ops::Add(root.WithOpName("halfdead0"), a0, b0);
+  Output halfdead1 = ops::Add(root.WithOpName("halfdead1"), a1, b1);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*live0.node()));
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*live1.node()));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*halfdead0.node()));
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*halfdead1.node()));
+}
+
+TEST(DeadnessAnalysisTest, AndIsAssociative) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw_0 = CreateSwitch(root, "0");
+  ops::Switch sw_1 = CreateSwitch(root, "1");
+  ops::Switch sw_2 = CreateSwitch(root, "2");
+
+  Output a0 =
+      ops::Add(root.WithOpName("a0"), sw_0.output_false, sw_1.output_false);
+  Output a1 = ops::Add(root.WithOpName("a1"), a0, sw_2.output_false);
+
+  Output b0 =
+      ops::Add(root.WithOpName("b0"), sw_1.output_false, sw_2.output_false);
+  Output b1 = ops::Add(root.WithOpName("b1"), sw_0.output_false, b0);
+
+  Output add = ops::Add(root.WithOpName("add"), a1, b1);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, OrIsCommutative) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw_0 = CreateSwitch(root, "0");
+  ops::Switch sw_1 = CreateSwitch(root, "1");
+
+  ops::Merge m0(root.WithOpName("m0"), {sw_0.output_false, sw_1.output_false});
+  ops::Merge m1(root.WithOpName("m1"), {sw_1.output_false, sw_0.output_false});
+  ops::Merge m2(root.WithOpName("m2"), {sw_0.output_false, sw_1.output_true});
+  ops::Merge m3(root.WithOpName("m3"), {sw_1.output_true, sw_0.output_false});
+
+  Output live0 = ops::Add(root.WithOpName("live0"), m0.output, m1.output);
+  Output live1 = ops::Add(root.WithOpName("live1"), m2.output, m3.output);
+
+  Output halfdead0 =
+      ops::Add(root.WithOpName("halfdead0"), m0.output, m2.output);
+  Output halfdead1 =
+      ops::Add(root.WithOpName("halfdead1"), m1.output, m3.output);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*live0.node()));
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*live1.node()));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*halfdead0.node()));
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*halfdead1.node()));
+}
+
+TEST(DeadnessAnalysisTest, OrIsAssociative) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw_0 = CreateSwitch(root, "0");
+  ops::Switch sw_1 = CreateSwitch(root, "1");
+  ops::Switch sw_2 = CreateSwitch(root, "2");
+
+  ops::Merge m0(root.WithOpName("m0"), {sw_0.output_false, sw_1.output_false});
+  ops::Merge m1(root.WithOpName("m1"), {m0.output, sw_2.output_false});
+  ops::Merge m2(root.WithOpName("m2"), {sw_1.output_false, sw_2.output_false});
+  ops::Merge m3(root.WithOpName("m3"), {sw_0.output_false, m2.output});
+
+  Output add = ops::Add(root.WithOpName("add"), m1.output, m3.output);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, AndOfOr) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw_0 = CreateSwitch(root, "0");
+  ops::Switch sw_1 = CreateSwitch(root, "1");
+  ops::Switch sw_2 = CreateSwitch(root, "2");
+  ops::Switch sw_3 = CreateSwitch(root, "3");
+
+  ops::Merge m0(root.WithOpName("m0"), {sw_0.output_false, sw_1.output_false});
+  ops::Merge m1(root.WithOpName("m1"), {sw_2.output_false, sw_3.output_false});
+
+  Output add0 = ops::Add(root.WithOpName("add0"), m0.output, m1.output);
+  Output add1 = ops::Add(root.WithOpName("add1"), m0.output, m1.output);
+
+  Output add2 = ops::Add(root.WithOpName("add2"), add0, add1);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add2.node()));
+}
+
+TEST(DeadnessAnalysisTest, OrOfAnd) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw_0 = CreateSwitch(root, "0");
+  ops::Switch sw_1 = CreateSwitch(root, "1");
+  ops::Switch sw_2 = CreateSwitch(root, "2");
+  ops::Switch sw_3 = CreateSwitch(root, "3");
+
+  Output add0 =
+      ops::Add(root.WithOpName("add0"), sw_0.output_false, sw_1.output_false);
+  Output add1 =
+      ops::Add(root.WithOpName("add1"), sw_2.output_false, sw_3.output_false);
+
+  ops::Merge m0(root.WithOpName("m0"), {add0, add1});
+  ops::Merge m1(root.WithOpName("m1"), {add0, add1});
+
+  Output add2 = ops::Add(root.WithOpName("add2"), m0.output, m1.output);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add2.node()));
+}
+
+TEST(DeadnessAnalysisTest, NEGATIVE_AndOrDistributive) {
+  // This demonstrates one of the weaknesses in the current approach -- since we
+  // only do some basic simplifications we can't see that "(A|B)&C" ==
+  // "(A&C)|(B&C)".
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  ops::Switch sw_0 = CreateSwitch(root, "0");
+  ops::Switch sw_1 = CreateSwitch(root, "1");
+  ops::Switch sw_2 = CreateSwitch(root, "2");
+
+  ops::Merge m0(root.WithOpName("m0"), {sw_0.output_false, sw_1.output_false});
+  Output add0 = ops::Add(root.WithOpName("add0"), m0.output, sw_2.output_false);
+
+  Output add1 =
+      ops::Add(root.WithOpName("add1"), sw_0.output_false, sw_2.output_false);
+  Output add2 =
+      ops::Add(root.WithOpName("add2"), sw_1.output_false, sw_2.output_false);
+  ops::Merge m1(root.WithOpName("m1"), {add1, add2});
+
+  Output add3 = ops::Add(root.WithOpName("add3"), add0, m1.output);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*add2.node()));
+}
+
+TEST(DeadnessAnalysisTest, Ternary) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  Output predicate = ops::Placeholder(root.WithOpName("predicate"), DT_BOOL);
+  Output true_value = ops::Placeholder(root.WithOpName("true_value"), DT_FLOAT);
+  Output false_value =
+      ops::Placeholder(root.WithOpName("false_value"), DT_FLOAT);
+
+  ops::Switch predicated_true(root.WithOpName("predicated_true"), true_value,
+                              predicate);
+
+  ops::Switch predicated_false(root.WithOpName("predicated_false"), true_value,
+                               predicate);
+  ops::Merge merge(root.WithOpName("ternary"), {predicated_true.output_true,
+                                                predicated_false.output_false});
+  Output addend = ops::Placeholder(root.WithOpName("addend"), DT_FLOAT);
+  Output add = ops::Add(root.WithOpName("add"), merge.output, addend);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, Recv) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  Output recv_a = ops::_Recv(root.WithOpName("recv_a"), DT_FLOAT, "tensor_a",
+                             "sender", 0, "receiver");
+  Output recv_b = ops::_Recv(root.WithOpName("recv_b"), DT_FLOAT, "tensor_b",
+                             "sender", 0, "receiver");
+  Output add = ops::Add(root.WithOpName("add"), recv_a, recv_b);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, HostRecv) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  Output recv_a = ops::_HostRecv(root.WithOpName("recv_a"), DT_FLOAT,
+                                 "tensor_a", "sender", 0, "receiver");
+  Output recv_b = ops::_HostRecv(root.WithOpName("recv_b"), DT_FLOAT,
+                                 "tensor_b", "sender", 0, "receiver");
+  Output add = ops::Add(root.WithOpName("add"), recv_a, recv_b);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, Loop) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+  Output iv0 = CreateInductionVariable(root, "iv0", "fr0", 0);
+  Output iv1 = CreateInductionVariable(root, "iv1", "fr0", 0);
+  Output iv2 = CreateInductionVariable(root, "iv2", "fr0", 1);
+  Output add0 = ops::Add(root.WithOpName("add0"), iv0, iv1);
+  Output add1 = ops::Add(root.WithOpName("add1"), iv1, iv2);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  // NB!  iv0 and iv1 are equivalent and a smarter deadness analysis would have
+  // noticed that.  Today we are pessimistic here because we assign an
+  // uninterpreted symbol to merges with backedges.
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*add0.node()));
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*add1.node()));
+}
+
+TEST(DeadnessAnalysisTest, ControlInputs) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+  ops::Switch sw = CreateSwitch(root, "0");
+
+  Output id0 = ops::Identity(root.WithOpName("id0"), sw.output_false);
+  Output id1 = ops::Identity(root.WithOpName("id1"), sw.output_true);
+
+  Output const0 = ops::Const(root.WithOpName("const0"), 1);
+  Output const1 = ops::Const(root.WithOpName("const1"), 2);
+
+  Output add = ops::Add(root.WithOpName("add"), const0, const1);
+
+  root.graph()->AddControlEdge(id0.node(), const0.node());
+  root.graph()->AddControlEdge(id1.node(), const1.node());
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, ControlTrigger) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+  ops::Switch sw = CreateSwitch(root, "0");
+
+  Output id0 = ops::Identity(root.WithOpName("id0"), sw.output_false);
+  Output id1 = ops::Identity(root.WithOpName("id1"), sw.output_true);
+
+  ops::ControlTrigger ctrl_trigger0(root.WithOpName("ctrl_trigger0"));
+  ops::ControlTrigger ctrl_trigger1(root.WithOpName("ctrl_trigger1"));
+
+  Output const0 = ops::Const(root.WithOpName("const0"), 1);
+  Output const1 = ops::Const(root.WithOpName("const1"), 2);
+
+  Output add = ops::Add(root.WithOpName("add"), const0, const1);
+
+  root.graph()->AddControlEdge(id0.node(), ctrl_trigger0.operation.node());
+  root.graph()->AddControlEdge(ctrl_trigger0.operation.node(), const0.node());
+
+  root.graph()->AddControlEdge(id1.node(), ctrl_trigger1.operation.node());
+  root.graph()->AddControlEdge(ctrl_trigger1.operation.node(), const1.node());
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, ControlInputsToMerge) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+  ops::Switch sw = CreateSwitch(root, "0");
+
+  Output id0 = ops::Identity(root.WithOpName("id0"), sw.output_false);
+  Output id1 = ops::Identity(root.WithOpName("id1"), sw.output_true);
+
+  Output constant = ops::Const(root.WithOpName("constant"), 5);
+  ops::Merge m0(root.WithOpName("m0"), {constant});
+  ops::Merge m1(root.WithOpName("m0"), {constant});
+  Output add = ops::Add(root.WithOpName("add"), m0.output, m1.output);
+
+  root.graph()->AddControlEdge(id0.node(), m0.output.node());
+  root.graph()->AddControlEdge(id1.node(), m1.output.node());
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_FALSE(result->HasInputsWithMismatchingDeadness(*add.node()));
+}
+
+TEST(DeadnessAnalysisTest, RecvVsSwitch) {
+  // Demonstrates why we need the must_be_true bit on SymbolP.
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  Output recv = ops::_Recv(root.WithOpName("recv"), DT_BOOL, "tensor", "sender",
+                           0, "receiver");
+  Output value = ops::Placeholder(root.WithOpName("value"), DT_BOOL);
+  ops::Switch sw(root.WithOpName("switch"), value, recv);
+  Output logical_and =
+      ops::LogicalAnd(root.WithOpName("and"), recv, sw.output_true);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  EXPECT_TRUE(result->HasInputsWithMismatchingDeadness(*logical_and.node()));
+}
+
+TEST(DeadnessAnalysisTest, RecvVsSwitchText) {
+  // Demonstrates why we need the must_be_true bit on SymbolP.
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  Output recv = ops::_Recv(root.WithOpName("recv"), DT_BOOL, "tensor", "sender",
+                           0, "receiver");
+  Output value = ops::Placeholder(root.WithOpName("value"), DT_BOOL);
+  ops::Switch sw(root.WithOpName("switch"), value, recv);
+  Output logical_and =
+      ops::LogicalAnd(root.WithOpName("and"), recv, sw.output_true);
+
+  std::unique_ptr<DeadnessAnalysis> result;
+  TF_ASSERT_OK(AnalyzeDeadness(root.graph(), &result));
+
+  deadness_analysis_internal::PredicateMapTy predicate_map;
+  TF_ASSERT_OK(deadness_analysis_internal::ComputePredicates(*root.graph(),
+                                                             &predicate_map));
+
+  TensorId logical_and_output_0 = {logical_and.node()->name(),
+                                   Graph::kControlSlot};
+  EXPECT_EQ(predicate_map[logical_and_output_0], "(recv:0 & *recv:0)");
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
index e786d41887..fdd71c6a58 100644
--- a/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
+++ b/tensorflow/compiler/jit/encapsulate_subgraphs_pass.cc
@@ -60,9 +60,9 @@ const char* const kXlaHostTransferSequencerAttr =
 
 namespace {
 
-bool AreAllParentsConst(const Node& n,
-                        const gtl::FlatSet<const Node*>& runtime_const_nodes) {
-  if (n.type_string() == "GuaranteeConst" || n.type_string() == "Const") {
+bool AreAllParentsGuaranteedConst(
+    const Node& n, const gtl::FlatSet<const Node*>& runtime_const_nodes) {
+  if (n.type_string() == "GuaranteeConst") {
     // If the current node is itself a cast-to-const, no need
     // to look at the incoming edges.
     return true;
@@ -93,7 +93,8 @@ void MarkGuaranteedConstants(
   ReverseDFSFrom(graph, srcs, /*enter=*/nullptr,
                  /*leave=*/[&guaranteed_const_nodes](const Node* n) {
                    // TODO(vinuraja): Doesn't work in the presence of loops.
-                   if (AreAllParentsConst(*n, guaranteed_const_nodes)) {
+                   if (AreAllParentsGuaranteedConst(*n,
+                                                    guaranteed_const_nodes)) {
                      guaranteed_const_nodes.insert(n);
                    }
                  });
@@ -137,7 +138,7 @@ class Encapsulator {
 
   // Find subgraphs marked with 'group_attribute', and build a new
   // subgraph, one for each value of 'group_attribute'.
-  Status SplitIntoSubgraphs();
+  Status SplitIntoSubgraphs(FunctionLibraryDefinition* library);
 
   // Build a FunctionDef for each subgraph, and add it 'library'. The values of
   // the 'group_attribute' annotations become the function names.
@@ -1136,7 +1137,10 @@ Status Encapsulator::Subgraph::AddShapeInferenceInfo(
         GraphToFunctionDef(*inference_graph, inference_graph_name, &fdef));
     host_compute->AddAttr("shape_inference_graph", inference_graph_name);
     host_compute->AddAttr("shapes", std::vector<TensorShapeProto>());
-    TF_RETURN_IF_ERROR(library->AddFunctionDef(fdef));
+    // TODO(sibyl-Aix6ihai): Understand why there are multiple calls to Encapsulator.
+    if (library->Find(inference_graph_name) == nullptr) {
+      TF_RETURN_IF_ERROR(library->AddFunctionDef(fdef));
+    }
   }
   return Status::OK();
 }
@@ -1474,7 +1478,7 @@ Status Encapsulator::CopySubgraphEdges(
   return Status::OK();
 }
 
-Status Encapsulator::SplitIntoSubgraphs() {
+Status Encapsulator::SplitIntoSubgraphs(FunctionLibraryDefinition* library) {
   Status s;
 
   // Map from input graph nodes to subgraph nodes.
@@ -1509,6 +1513,15 @@ Status Encapsulator::SplitIntoSubgraphs() {
     TF_RETURN_IF_ERROR(BuildControlFlowInfo(subgraph.GetGraph(), &dummy));
   }
 
+  if (VLOG_IS_ON(1)) {
+    // Dump subgraphs.
+    for (auto& entry : subgraphs_) {
+      dump_graph::DumpGraphToFile(
+          strings::StrCat("encapsulate_subgraphs_subgraph_", entry.first),
+          *entry.second.GetGraph(), library);
+    }
+  }
+
   return s;
 }
 
@@ -1932,6 +1945,8 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
             // continue.
             TensorShapeProto proto;
             context->ShapeHandleToProto(shape, &proto);
+            VLOG(2) << "Node " << src_node->name()
+                    << " has known shape: " << proto.DebugString();
             if (dummy_node_images.find(src_node) == dummy_node_images.end()) {
               dummy_node_images[src_node] =
                   AddDummyShapedNode(src_node, src_port, control_flow_info,
@@ -1949,6 +1964,8 @@ Status Encapsulator::DoStaticShapeInferenceForOutsideCompilationSend(
               if (VLOG_IS_ON(2)) {
                 TensorShapeProto proto;
                 context->ShapeHandleToProto(shape, &proto);
+                VLOG(2) << "Node " << src_node->name()
+                        << " has unknown shape: " << proto.DebugString();
               }
               stack.push_back({src_node, false});
             }
@@ -2191,6 +2208,23 @@ Status Encapsulator::FindClusterDependencies() {
       }
     }
   }
+  if (VLOG_IS_ON(2)) {
+    // Print debug information.
+    VLOG(2) << "node_ancestors_map:";
+    for (const auto& node_iter : node_ancestors_map) {
+      VLOG(2) << "\t" << node_iter.first->name() << ": subgraph = '"
+              << node_iter.second.subgraph
+              << "', outside_compilation_cluster = '"
+              << node_iter.second.outside_compilation_cluster
+              << "', ancestor_clusters: "
+              << (node_iter.second.ancestor_clusters.empty() ? "(empty)" : "");
+      for (const auto& cluster_iter : node_iter.second.ancestor_clusters) {
+        VLOG(2) << "\t\tsubgraph = '" << cluster_iter.subgraph
+                << "', outside_compilation_cluster = '"
+                << cluster_iter.outside_compilation_cluster << "'";
+      }
+    }
+  }
   return Status::OK();
 }
 
@@ -2398,7 +2432,7 @@ Status EncapsulateSubgraphsInFunctions(
                             std::move(outside_compilation_attribute),
                             &graph_in);
   TF_RETURN_IF_ERROR(encapsulator.FindClusterDependencies());
-  TF_RETURN_IF_ERROR(encapsulator.SplitIntoSubgraphs());
+  TF_RETURN_IF_ERROR(encapsulator.SplitIntoSubgraphs(library));
 
   TF_RETURN_IF_ERROR(encapsulator.BuildFunctionDefs(
       rewrite_subgraph_fn, reuse_existing_functions, library));
@@ -2447,7 +2481,7 @@ Status EncapsulateSubgraphsPass::Run(
     const GraphOptimizationPassOptions& options) {
   VLOG(1) << "EncapsulateSubgraphsPass::Run";
   if (VLOG_IS_ON(1)) {
-    dump_graph::DumpGraphToFile("before_encapsulate_subgraphs", **options.graph,
+    dump_graph::DumpGraphToFile("encapsulate_subgraphs_before", **options.graph,
                                 options.flib_def);
   }
 
@@ -2530,7 +2564,7 @@ Status EncapsulateSubgraphsPass::Run(
       "EncapsulateSubgraphsPass failed");
 
   if (VLOG_IS_ON(1)) {
-    dump_graph::DumpGraphToFile("after_encapsulate_subgraphs", *graph_out,
+    dump_graph::DumpGraphToFile("encapsulate_subgraphs_after", *graph_out,
                                 options.flib_def);
   }
 
diff --git a/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc b/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
index 4eb389e0c6..c0543a0079 100644
--- a/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
+++ b/tensorflow/compiler/jit/encapsulate_subgraphs_pass_test.cc
@@ -742,10 +742,13 @@ TEST(EncapsulateSubgraphsWithGuaranteeConstOpTest, Simple) {
   Scope root = Scope::NewRootScope().ExitOnError().WithDevice(
       "/job:localhost/replica:0/task:0/cpu:0");
   auto x1 = ops::Placeholder(root.WithOpName("x1"), DT_FLOAT);
-  auto const_x2 = ops::Const(root.WithOpName("const_x2"), 10.0f);
+  auto x2 = ops::Placeholder(root.WithOpName("x2"), DT_FLOAT);
+  auto const_guarantee_x2 =
+      ops::GuaranteeConst(root.WithOpName("const_guarantee_x2"), x2);
   auto const_guarantee_x1 =
       ops::GuaranteeConst(root.WithOpName("const_guarantee_x1"), x1);
-  auto add1 = ops::Add(root.WithOpName("add1"), const_guarantee_x1, const_x2);
+  auto add1 =
+      ops::Add(root.WithOpName("add1"), const_guarantee_x1, const_guarantee_x2);
   add1.node()->AddAttr("_encapsulate", "encapsulate1");
 
   Graph graph_before(OpRegistry::Global());
diff --git a/tensorflow/compiler/jit/kernels/xla_launch_op.cc b/tensorflow/compiler/jit/kernels/xla_launch_op.cc
index 251a07304e..b313d48011 100644
--- a/tensorflow/compiler/jit/kernels/xla_launch_op.cc
+++ b/tensorflow/compiler/jit/kernels/xla_launch_op.cc
@@ -51,7 +51,11 @@ XlaLocalLaunchBase::XlaLocalLaunchBase(OpKernelConstruction* ctx,
   if (device_type_ == DeviceType(DEVICE_CPU)) {
     platform_id_ = se::host::kHostPlatformId;
   } else if (device_type_ == DeviceType(DEVICE_GPU)) {
-    platform_id_ = se::cuda::kCudaPlatformId;
+    platform_id_ = ctx->device()
+                       ->tensorflow_gpu_device_info()
+                       ->stream->parent()
+                       ->platform()
+                       ->id();
   } else {
     platform_id_ = nullptr;
   }
@@ -115,6 +119,7 @@ void XlaLocalLaunchBase::Compute(OpKernelContext* ctx) {
   const XlaDevice::Metadata* metadata = nullptr;
   Status s = XlaDevice::GetMetadata(ctx, &metadata);
   bool allocate_xla_tensors = s.ok();
+  bool use_multiple_streams = s.ok() && metadata->UseMultipleStreams();
 
   // Get the platform_id_ for XLA_* devices.
   if (platform_id_ == nullptr) {
@@ -148,6 +153,10 @@ void XlaLocalLaunchBase::Compute(OpKernelContext* ctx) {
 
   XlaCompiler::Options options;
   options.client = client;
+  if (ctx->op_device_context() != nullptr) {
+    options.device_ordinal =
+        ctx->op_device_context()->stream()->parent()->device_ordinal();
+  }
   options.device_type = cache->device_type();
   options.flib_def = ctx->function_library()->GetFunctionLibraryDefinition();
   options.graph_def_version = ctx->function_library()->graph_def_version();
@@ -180,8 +189,8 @@ void XlaLocalLaunchBase::Compute(OpKernelContext* ctx) {
 
   VLOG(1) << "Executing XLA Computation...";
 
-  XlaComputationLaunchContext launch_context(client, xla_allocator,
-                                             allocate_xla_tensors);
+  XlaComputationLaunchContext launch_context(
+      client, xla_allocator, allocate_xla_tensors, use_multiple_streams);
   launch_context.PopulateInputs(ctx, kernel, variables);
 
   // Execute the computation.
diff --git a/tensorflow/compiler/jit/mark_for_compilation_pass.cc b/tensorflow/compiler/jit/mark_for_compilation_pass.cc
index 8c3882116d..45d422943c 100644
--- a/tensorflow/compiler/jit/mark_for_compilation_pass.cc
+++ b/tensorflow/compiler/jit/mark_for_compilation_pass.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include <unordered_map>
 #include <unordered_set>
 
+#include "tensorflow/compiler/jit/deadness_analysis.h"
 #include "tensorflow/compiler/jit/defs.h"
 #include "tensorflow/compiler/jit/graphcycles/graphcycles.h"
 #include "tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h"
@@ -28,6 +29,7 @@ limitations under the License.
 #include "tensorflow/compiler/jit/xla_cluster_util.h"
 #include "tensorflow/compiler/tf2xla/dump_graph.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
+#include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/framework/graph_def_util.h"
 #include "tensorflow/core/framework/memory_types.h"
@@ -460,20 +462,22 @@ Status MarkForCompilationPass::Run(
 
   VLOG(1) << "flags->tf_xla_cpu_global_jit = " << flags->tf_xla_cpu_global_jit;
   VLOG(1) << "flags->tf_xla_fusion_only = " << flags->tf_xla_fusion_only;
+  VLOG(1) << "flags->tf_xla_auto_jit = " << flags->tf_xla_auto_jit;
   const FunctionLibraryDefinition* fld = options.flib_def;
 
-  auto is_compilable = [global_jit_level, cpu_global_jit, fusion_only, fld](
-                           const Node* node, const DeviceType& device_type) {
+  std::unique_ptr<DeadnessAnalysis> deadness;
+  {
+    XLA_SCOPED_LOGGING_TIMER_LEVEL("DeadnessAnalysis", 1);
+    TF_RETURN_IF_ERROR(DeadnessAnalysis::Run(**options.graph, &deadness));
+  }
+
+  auto is_compilable = [&](const Node* node, const DeviceType& device_type) {
     const XlaOpRegistry::DeviceRegistration* registration;
     if (!XlaOpRegistry::GetCompilationDevice(device_type.type(),
                                              &registration)) {
       return false;
     }
 
-    // Don't compile control trigger nodes. We won't preserve their deadness
-    // semantics correctly, so it's safest not to compile them.
-    if (node->IsControlTrigger()) return false;
-
     // If this device requires a JIT, we must say yes.
     if (registration->requires_compilation) return true;
 
@@ -485,6 +489,14 @@ Status MarkForCompilationPass::Run(
     status = fld->GetAttr(*node, kXlaCompileAttr, &compile);
     if (status.ok()) return compile;
 
+    // If inputs to `node` can have conflicting deadness (i.e. some are alive
+    // and some are dead) then don't compile it.  XLA cannot represent the
+    // deadness semantics of these nodes correctly and auto-clustering these
+    // nodes can cause deadness to propagate to nodes that should be live.
+    if (node->IsMerge() || deadness->HasInputsWithMismatchingDeadness(*node)) {
+      return false;
+    }
+
     // Check for fusable ops only if requested.
     if (global_jit_level > 0 && fusion_only && !IsXlaFusable(node->def())) {
       return false;
diff --git a/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc b/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc
index 772c92d369..2c5f4fb774 100644
--- a/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc
+++ b/tensorflow/compiler/jit/mark_for_compilation_pass_test.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/cc/ops/array_ops.h"
 #include "tensorflow/cc/ops/control_flow_ops_internal.h"
 #include "tensorflow/cc/ops/function_ops.h"
+#include "tensorflow/cc/ops/sendrecv_ops.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/compiler/jit/defs.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
@@ -680,5 +681,37 @@ TEST(XlaCompilationTest, ClusterIdentityWithNonRefInput) {
   EXPECT_EQ(clusters, expected_clusters);
 }
 
+TEST(XlaCompilationTest, ClusterControlTrigger) {
+  Scope root = Scope::NewRootScope().ExitOnError();
+
+  Output recv_a = ops::_Recv(root.WithOpName("recv_a"), DT_BOOL, "tensor_a",
+                             "sender", 0, "receiver");
+  Output recv_b = ops::_Recv(root.WithOpName("recv_b"), DT_BOOL, "tensor_b",
+                             "sender", 0, "receiver");
+  Output const_a = ops::Const(root.WithOpName("const_a"), 42);
+
+  ops::ControlTrigger ctrl_trigger_a(root.WithOpName("ctrl_trigger_a"));
+  ops::ControlTrigger ctrl_trigger_b(root.WithOpName("ctrl_trigger_b"));
+  root.graph()->AddControlEdge(recv_a.node(), ctrl_trigger_a.operation.node());
+  root.graph()->AddControlEdge(recv_b.node(), ctrl_trigger_a.operation.node());
+  root.graph()->AddControlEdge(ctrl_trigger_b.operation.node(), const_a.node());
+
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+
+  TF_ASSERT_OK(root.ToGraph(graph.get()));
+  TF_ASSERT_OK(MarkForCompilation(&graph));
+
+  std::unordered_map<string, string> clusters = GetClusters(*graph);
+
+  ASSERT_FALSE(clusters.empty());
+  string cluster_name = clusters.begin()->second;
+
+  // ctrl_trigger_a has inputs with mismatching deadness so it won't be
+  // clustered.  ctrl_trigger_b is okay to cluster.
+  std::unordered_map<string, string> expected_clusters(
+      {{"const_a", cluster_name}, {"ctrl_trigger_b", cluster_name}});
+  EXPECT_EQ(clusters, expected_clusters);
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_compilation_cache.cc b/tensorflow/compiler/jit/xla_compilation_cache.cc
index 54a41a4daa..0e2cdcf630 100644
--- a/tensorflow/compiler/jit/xla_compilation_cache.cc
+++ b/tensorflow/compiler/jit/xla_compilation_cache.cc
@@ -209,7 +209,9 @@ Status XlaCompilationCache::BuildExecutable(
     argument_layouts[i] = &result.xla_input_shapes[i];
   }
   xla::ExecutableBuildOptions build_options;
-  build_options.set_device_ordinal(client_->default_device_ordinal());
+  build_options.set_device_ordinal(options.device_ordinal != -1
+                                       ? options.device_ordinal
+                                       : client_->default_device_ordinal());
   build_options.set_result_layout(result.xla_output_shape);
   build_options.set_device_allocator(options.device_allocator);
 
@@ -256,6 +258,7 @@ Status XlaCompilationCache::CompileImpl(
     xla::LocalExecutable** executable,
     const XlaCompiler::CompileOptions* compile_options,
     bool compile_single_op) {
+  CHECK_NE(executable, nullptr);
   VLOG(1) << "XlaCompilationCache::Compile " << DebugString();
 
   if (VLOG_IS_ON(2)) {
@@ -327,18 +330,15 @@ Status XlaCompilationCache::CompileImpl(
           compile_options ? *compile_options : XlaCompiler::CompileOptions(),
           function, args, &entry->compilation_result);
     }
+    TF_RETURN_IF_ERROR(entry->compilation_status);
+    CHECK_EQ(entry->executable.get(), nullptr);
+    entry->compilation_status =
+        BuildExecutable(options, entry->compilation_result, &entry->executable);
   }
+  TF_RETURN_IF_ERROR(entry->compilation_status);
   *compilation_result = &entry->compilation_result;
-  if (entry->compilation_status.ok() && executable) {
-    if (entry->executable == nullptr) {
-      entry->compilation_status = BuildExecutable(
-          options, entry->compilation_result, &entry->executable);
-    }
-    *executable = entry->executable.get();
-  }
-
-  Status status = entry->compilation_status;
-  return status;
+  *executable = entry->executable.get();
+  return Status::OK();
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/xla_compile_on_demand_op.cc b/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
index baccea2d6a..d288d37bc7 100644
--- a/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
+++ b/tensorflow/compiler/jit/xla_compile_on_demand_op.cc
@@ -53,7 +53,9 @@ Status XlaCompileOnDemandOp::Run(OpKernelContext* ctx,
 
   // Builds an XLA allocator for the device.
   XlaComputationLaunchContext launch_context(
-      client, client->backend().memory_allocator(), true);
+      client, client->backend().memory_allocator(),
+      /*allocate_xla_tensors=*/true,
+      /*use_multiple_streams=*/metadata.UseMultipleStreams());
 
   launch_context.PopulateInputs(ctx, result, variables);
 
diff --git a/tensorflow/compiler/jit/xla_cpu_device.cc b/tensorflow/compiler/jit/xla_cpu_device.cc
index 43648402f6..7e159e3171 100644
--- a/tensorflow/compiler/jit/xla_cpu_device.cc
+++ b/tensorflow/compiler/jit/xla_cpu_device.cc
@@ -54,6 +54,7 @@ Status XlaCpuDeviceFactory::CreateDevices(const SessionOptions& options,
                                        DEVICE_CPU_XLA_JIT, options, name_prefix,
                                        registration,
                                        /*transfer_as_literal=*/false,
+                                       /*use_multiple_streams=*/false,
                                        /*shape_representation_fn=*/{},
                                        /*padded_shape_fn=*/{}, &device));
   devices->push_back(device.release());
diff --git a/tensorflow/compiler/jit/xla_device.cc b/tensorflow/compiler/jit/xla_device.cc
index ed007d603e..c55eba2f79 100644
--- a/tensorflow/compiler/jit/xla_device.cc
+++ b/tensorflow/compiler/jit/xla_device.cc
@@ -130,7 +130,7 @@ Status DefaultPaddedShapeFn(const Tensor& tensor, xla::Shape* shape) {
     const string& jit_device_name, const SessionOptions& options,
     const string& name_prefix,
     const XlaOpRegistry::DeviceRegistration& registration,
-    bool transfer_as_literal,
+    bool transfer_as_literal, bool use_multiple_streams,
     const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
     const PaddedShapeFn& padded_shape_fn, std::unique_ptr<XlaDevice>* device) {
   VLOG(1) << "XlaDevice::Create " << platform_name << " " << device_name << ":"
@@ -151,22 +151,24 @@ Status DefaultPaddedShapeFn(const Tensor& tensor, xla::Shape* shape) {
       DeviceType(device_name), Bytes(16ULL << 30), DeviceLocality(),
       strings::StrCat("device: ", device_name, " device"));
 
-  device->reset(new XlaDevice(
-      options, attrs, device_ordinal, DeviceType(jit_device_name),
-      platform.ValueOrDie(), transfer_as_literal, shape_representation_fn,
-      padded_shape_fn ? padded_shape_fn : DefaultPaddedShapeFn));
+  device->reset(
+      new XlaDevice(options, attrs, device_ordinal, DeviceType(jit_device_name),
+                    platform.ValueOrDie(), transfer_as_literal,
+                    use_multiple_streams, shape_representation_fn,
+                    padded_shape_fn ? padded_shape_fn : DefaultPaddedShapeFn));
   return Status::OK();
 }
 
 XlaDevice::Metadata::Metadata(
     int device_ordinal, se::Platform* platform, const DeviceType& device_type,
     XlaCompiler::ShapeRepresentationFn shape_representation_fn,
-    PaddedShapeFn padded_shape_fn)
+    PaddedShapeFn padded_shape_fn, bool use_multiple_streams)
     : device_ordinal_(device_ordinal),
       device_type_(device_type),
       platform_(platform),
       shape_representation_fn_(std::move(shape_representation_fn)),
-      padded_shape_fn_(std::move(padded_shape_fn)) {}
+      padded_shape_fn_(std::move(padded_shape_fn)),
+      use_multiple_streams_(use_multiple_streams) {}
 
 int XlaDevice::Metadata::device_ordinal() const { return device_ordinal_; }
 
@@ -200,16 +202,18 @@ const DeviceType& XlaDevice::Metadata::jit_device_type() const {
 XlaDevice::XlaDevice(
     const SessionOptions& options, const DeviceAttributes& attrs,
     int device_ordinal, const DeviceType& jit_device_name,
-    se::Platform* platform, bool transfer_as_literal,
+    se::Platform* platform, bool transfer_as_literal, bool use_multiple_streams,
     const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
     const PaddedShapeFn& padded_shape_fn)
     : LocalDevice(options, attrs),
       xla_metadata_(device_ordinal, platform, jit_device_name,
-                    shape_representation_fn, padded_shape_fn),
+                    shape_representation_fn, padded_shape_fn,
+                    use_multiple_streams),
       device_ordinal_(device_ordinal),
       jit_device_name_(jit_device_name),
       xla_allocator_(nullptr),
       platform_(platform),
+      use_multiple_streams_(use_multiple_streams),
       transfer_as_literal_(transfer_as_literal),
       shape_representation_fn_(shape_representation_fn) {
   VLOG(1) << "Created XLA device " << jit_device_name;
@@ -253,6 +257,30 @@ xla::StatusOr<se::Stream*> XlaDevice::GetStream() {
   return stream_.get();
 }
 
+xla::StatusOr<se::Stream*> XlaDevice::GetDeviceToHostStream() {
+  if (!use_multiple_streams_) {
+    return GetStream();
+  }
+  if (!device_to_host_stream_) {
+    xla::Backend* backend = client()->mutable_backend();
+    TF_ASSIGN_OR_RETURN(device_to_host_stream_,
+                        backend->BorrowStream(device_ordinal_));
+  }
+  return device_to_host_stream_.get();
+}
+
+xla::StatusOr<se::Stream*> XlaDevice::GetHostToDeviceStream() {
+  if (!use_multiple_streams_) {
+    return GetStream();
+  }
+  if (!host_to_device_stream_) {
+    xla::Backend* backend = client()->mutable_backend();
+    TF_ASSIGN_OR_RETURN(host_to_device_stream_,
+                        backend->BorrowStream(device_ordinal_));
+  }
+  return host_to_device_stream_.get();
+}
+
 Status XlaDevice::CreateAndSetGpuDeviceInfo() {
   if (gpu_device_info_ == nullptr) {
     TF_ASSIGN_OR_RETURN(se::Stream * stream, GetStream());
@@ -263,8 +291,9 @@ Status XlaDevice::CreateAndSetGpuDeviceInfo() {
     // gpu_device_info_->default_context.
     gpu_device_info_ = MakeUnique<GpuDeviceInfo>();
     gpu_device_info_->stream = stream;
-    gpu_device_info_->default_context = new XlaDeviceContext(
-        stream, client(), transfer_as_literal_, shape_representation_fn_);
+    gpu_device_info_->default_context =
+        new XlaDeviceContext(stream, stream, stream, client(),
+                             transfer_as_literal_, shape_representation_fn_);
     set_tensorflow_gpu_device_info(gpu_device_info_.get());
   }
 
@@ -276,10 +305,16 @@ Status XlaDevice::FillContextMap(const Graph* graph,
   VLOG(1) << "XlaDevice::FillContextMap";
   device_context_map->resize(graph->num_node_ids());
   TF_ASSIGN_OR_RETURN(se::Stream * stream, GetStream());
+  TF_ASSIGN_OR_RETURN(se::Stream * device_to_host_stream,
+                      GetDeviceToHostStream());
+  TF_ASSIGN_OR_RETURN(se::Stream * host_to_device_stream,
+                      GetHostToDeviceStream());
+
   // Call GetAllocator for the side-effect of ensuring the allocator is created.
   GetAllocator({});
-  auto ctx = new XlaDeviceContext(stream, client(), transfer_as_literal_,
-                                  shape_representation_fn_);
+  auto ctx = new XlaDeviceContext(
+      stream, host_to_device_stream, device_to_host_stream, client(),
+      transfer_as_literal_, shape_representation_fn_);
   for (Node* n : graph->nodes()) {
     VLOG(2) << n->id() << " : " << n->type_string() << " : " << n->name();
     ctx->Ref();
@@ -326,8 +361,13 @@ Status XlaDevice::MakeTensorFromProto(const TensorProto& tensor_proto,
     Tensor copy(GetAllocator(alloc_attrs), parsed.dtype(), parsed.shape());
     Notification n;
     TF_ASSIGN_OR_RETURN(se::Stream * stream, GetStream());
-    XlaTransferManager manager(stream, client(), transfer_as_literal_,
-                               shape_representation_fn_);
+    TF_ASSIGN_OR_RETURN(se::Stream * device_to_host_stream,
+                        GetDeviceToHostStream());
+    TF_ASSIGN_OR_RETURN(se::Stream * host_to_device_stream,
+                        GetHostToDeviceStream());
+    XlaTransferManager manager(stream, host_to_device_stream,
+                               device_to_host_stream, client(),
+                               transfer_as_literal_, shape_representation_fn_);
     manager.CopyCPUTensorToDevice(&parsed, this, &copy,
                                   [&n, &status](const Status& s) {
                                     status = s;
diff --git a/tensorflow/compiler/jit/xla_device.h b/tensorflow/compiler/jit/xla_device.h
index 02e88ee679..4a5942fbd7 100644
--- a/tensorflow/compiler/jit/xla_device.h
+++ b/tensorflow/compiler/jit/xla_device.h
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_compiler.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/local_device.h"
 #include "tensorflow/core/framework/allocator.h"
@@ -57,7 +58,7 @@ class XlaDevice : public LocalDevice {
     Metadata(int device_ordinal, se::Platform* platform,
              const DeviceType& device_type,
              XlaCompiler::ShapeRepresentationFn shape_representation_fn,
-             PaddedShapeFn padded_shape_fn);
+             PaddedShapeFn padded_shape_fn, bool use_multiple_streams);
 
     // The index of the device on this host.
     int device_ordinal() const;
@@ -70,12 +71,15 @@ class XlaDevice : public LocalDevice {
     }
     const PaddedShapeFn& padded_shape_fn() const { return padded_shape_fn_; }
 
+    bool UseMultipleStreams() const { return use_multiple_streams_; }
+
    private:
     const int device_ordinal_;
     const DeviceType device_type_;
     se::Platform* platform_;  // Not owned.
     XlaCompiler::ShapeRepresentationFn shape_representation_fn_;
     PaddedShapeFn padded_shape_fn_;
+    const bool use_multiple_streams_;
 
     TF_DISALLOW_COPY_AND_ASSIGN(Metadata);
   };
@@ -89,6 +93,8 @@ class XlaDevice : public LocalDevice {
   // 'transfer_as_literal' is true if device<->host transfers must be done using
   // XLA's TransferLiteral{To,From}Device interface. If false, we can use
   // ThenMemcpy instead.
+  // If 'use_multiple_streams' is true, we create separate streams for
+  // host-to-device and device-to-host communication.
   // If padded_shape_fn is empty, a default implementation that returns
   // the on-host shape is used.
   static Status Create(
@@ -96,7 +102,7 @@ class XlaDevice : public LocalDevice {
       int device_ordinal, const string& jit_device_name,
       const SessionOptions& options, const string& name_prefix,
       const XlaOpRegistry::DeviceRegistration& registration,
-      bool transfer_as_literal,
+      bool transfer_as_literal, bool use_multiple_streams,
       const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
       const PaddedShapeFn& padded_shape_fn, std::unique_ptr<XlaDevice>* device);
 
@@ -106,6 +112,7 @@ class XlaDevice : public LocalDevice {
   XlaDevice(const SessionOptions& options, const DeviceAttributes& attrs,
             int device_ordinal, const DeviceType& jit_device_name,
             se::Platform* platform, bool transfer_as_literal,
+            bool use_multiple_streams,
             const XlaCompiler::ShapeRepresentationFn& shape_representation_fn,
             const PaddedShapeFn& padded_shape_fn);
   ~XlaDevice() override;
@@ -126,6 +133,8 @@ class XlaDevice : public LocalDevice {
   xla::LocalClient* client() const;
   const Metadata& metadata() { return xla_metadata_; }
   xla::StatusOr<se::Stream*> GetStream();
+  xla::StatusOr<se::Stream*> GetHostToDeviceStream();
+  xla::StatusOr<se::Stream*> GetDeviceToHostStream();
 
   // If not already set, create and set GpuDeviceInfo.
   // Not thread-safe
@@ -145,7 +154,17 @@ class XlaDevice : public LocalDevice {
   // stream are executed on the device. Operations include data
   // copying back and forth between CPU and the device, and
   // computations enqueued by XLA.
-  xla::Backend::StreamPtr stream_;
+  xla::StreamPool::Ptr stream_;
+  // If true, only stream_ is valid and all computation and transfers use
+  // stream_. If false, computation is performed by stream_ and transfers are
+  // performed by host_to_device/device_to_host_stream.
+  bool use_multiple_streams_;
+  // If use_multiple_streams_, host to device transfers are performed using this
+  // stream.
+  xla::StreamPool::Ptr host_to_device_stream_;
+  // If use_multiple_streams_, device to host transfers are performed using this
+  // stream.
+  xla::StreamPool::Ptr device_to_host_stream_;
   // Must we use XLA's transfer manager for correct host<->device transfers? if
   // false, we can use ThenMemcpy() instead.
   bool transfer_as_literal_;
diff --git a/tensorflow/compiler/jit/xla_device_context.cc b/tensorflow/compiler/jit/xla_device_context.cc
index 3bbf97afad..8cf198239c 100644
--- a/tensorflow/compiler/jit/xla_device_context.cc
+++ b/tensorflow/compiler/jit/xla_device_context.cc
@@ -48,13 +48,20 @@ void XlaDeviceAllocator::DeallocateRaw(void* ptr) {
 void XlaDeviceAllocator::GetStats(AllocatorStats* stats) { stats->Clear(); }
 
 XlaTransferManager::XlaTransferManager(
-    se::Stream* stream, xla::LocalClient* client, bool transfer_as_literal,
+    se::Stream* compute_stream, se::Stream* host_to_device_stream,
+    se::Stream* device_to_host_stream, xla::LocalClient* client,
+    bool transfer_as_literal,
     XlaCompiler::ShapeRepresentationFn shape_representation_fn)
-    : stream_(stream),
+    : stream_(compute_stream),
+      host_to_device_stream_(host_to_device_stream),
+      device_to_host_stream_(device_to_host_stream),
       client_(client),
       transfer_manager_(client->backend().transfer_manager()),
       transfer_as_literal_(transfer_as_literal),
       shape_representation_fn_(std::move(shape_representation_fn)) {
+  CHECK(host_to_device_stream_ != nullptr);
+  CHECK(device_to_host_stream_ != nullptr);
+  CHECK(stream_ != nullptr);
   if (!shape_representation_fn_) {
     shape_representation_fn_ =
         [](const TensorShape& shape,
@@ -74,15 +81,26 @@ Status XlaTransferManager::TransferLiteralToDevice(
   auto literal = std::make_shared<xla::BorrowingLiteral>(
       static_cast<const char*>(DMAHelper::base(&host_tensor)), xla_shape);
 
-  const xla::ShapedBuffer& shaped_buffer =
-      XlaTensor::FromTensor(device_tensor)->shaped_buffer();
+  XlaTensor* xla_tensor = XlaTensor::FromTensor(device_tensor);
+  const xla::ShapedBuffer& shaped_buffer = xla_tensor->shaped_buffer();
   VLOG(1) << "Transfer to device as literal: " << literal->ToString() << " "
           << shaped_buffer.ToString();
+  if (UseMultipleStreams()) {
+    // Initially wait for the compute stream so that memory allocations are
+    // synchronized.
+    host_to_device_stream_->ThenWaitFor(stream_);
+  }
   TF_RETURN_IF_ERROR(transfer_manager_->TransferLiteralToDeviceAsync(
-      stream_, *literal, shaped_buffer));
+      host_to_device_stream_, *literal, shaped_buffer));
+  if (UseMultipleStreams()) {
+    se::Event event(stream_->parent());
+    TF_RET_CHECK(event.Init()) << "Event failed to initialize!";
+    host_to_device_stream_->ThenRecordEvent(&event);
+    xla_tensor->SetDefinedOn(host_to_device_stream_, std::move(event));
+  }
   // Unref the host tensor, and capture the literal shared_ptr too so it goes
   // out of scope when the lambda completes.
-  stream_->ThenDoHostCallback([ref, literal]() { ref.Unref(); });
+  host_to_device_stream_->ThenDoHostCallback([ref, literal]() { ref.Unref(); });
   return Status::OK();
 }
 
@@ -94,7 +112,7 @@ void XlaTransferManager::TransferLiteralFromDevice(
 
   TensorReference ref(device_tensor);
   transfer_manager_->TransferLiteralFromDevice(
-      stream_, shaped_buffer,
+      device_to_host_stream_, shaped_buffer,
       [=, &shaped_buffer](
           xla::StatusOr<std::unique_ptr<xla::Literal> > literal_or) {
         ref.Unref();
@@ -120,67 +138,73 @@ void XlaTransferManager::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
                                                Device* device,
                                                Tensor* device_tensor,
                                                StatusCallback done) const {
-  if (cpu_tensor->NumElements() > 0) {
-    VLOG(2) << "CopyCPUTensorToDevice "
-            << reinterpret_cast<const void*>(cpu_tensor->tensor_data().data())
-            << " "
-            << reinterpret_cast<const void*>(
-                   device_tensor->tensor_data().data())
-            << " " << cpu_tensor->NumElements() << " "
-            << cpu_tensor->shape().DebugString() << " "
-            << device_tensor->shape().DebugString();
-
-    void* src_ptr = const_cast<void*>(DMAHelper::base(cpu_tensor));
-    const int64 total_bytes = cpu_tensor->TotalBytes();
-
-    XlaTensor* xla_tensor = XlaTensor::FromTensor(device_tensor);
-    CHECK(xla_tensor);
-
-    Status status;
-    xla::StatusOr<TensorShape> shape_or_status = shape_representation_fn_(
-        device_tensor->shape(), device_tensor->dtype());
-    if (!shape_or_status.ok()) {
-      done(shape_or_status.status());
+  if (cpu_tensor->NumElements() == 0) {
+    VLOG(2) << "CopyCPUTensorToDevice empty tensor";
+    done(Status::OK());
+    return;
+  }
+
+  VLOG(2) << "CopyCPUTensorToDevice "
+          << reinterpret_cast<const void*>(cpu_tensor->tensor_data().data())
+          << " "
+          << reinterpret_cast<const void*>(device_tensor->tensor_data().data())
+          << " " << cpu_tensor->NumElements() << " "
+          << cpu_tensor->shape().DebugString() << " "
+          << device_tensor->shape().DebugString();
+
+  void* src_ptr = const_cast<void*>(DMAHelper::base(cpu_tensor));
+  const int64 total_bytes = cpu_tensor->TotalBytes();
+
+  XlaTensor* xla_tensor = XlaTensor::FromTensor(device_tensor);
+  CHECK(xla_tensor);
+
+  xla::StatusOr<TensorShape> shape_or_status =
+      shape_representation_fn_(device_tensor->shape(), device_tensor->dtype());
+  if (!shape_or_status.ok()) {
+    done(shape_or_status.status());
+    return;
+  }
+  TensorShape shape = shape_or_status.ValueOrDie();
+  if (!xla_tensor->has_shaped_buffer()) {
+    Status s =
+        xla_tensor->AllocateShapedBuffer(device_tensor->dtype(), shape, client_,
+                                         stream_->parent()->device_ordinal());
+    if (!s.ok()) {
+      done(s);
       return;
     }
-    TensorShape shape = shape_or_status.ValueOrDie();
-    if (!xla_tensor->has_shaped_buffer()) {
-      status = xla_tensor->AllocateShapedBuffer(
-          device_tensor->dtype(), shape, client_,
-          stream_->parent()->device_ordinal());
-      if (!status.ok()) {
-        return done(status);
-      }
-    }
+  }
 
-    if (transfer_as_literal_) {
-      Tensor reshaped_cpu_tensor;
-      if (!reshaped_cpu_tensor.CopyFrom(*cpu_tensor, shape)) {
-        done(errors::Internal(
-            "Tensor::CopyFrom failed when copying from CPU to XLA device"));
-        return;
-      }
-      status = TransferLiteralToDevice(reshaped_cpu_tensor, device_tensor);
-    } else {
-      se::DeviceMemoryBase dev_dst_ptr =
-          XlaTensor::DeviceMemoryFromTensor(*device_tensor);
-      stream_->ThenMemcpy(&dev_dst_ptr, src_ptr, total_bytes);
-      // TODO(hpucha): Make this asynchronous.
-      Status block_status = stream_->BlockHostUntilDone();
-      if (!block_status.ok()) {
-        status = xla::InternalError(
-            "Failed to complete data transfer on stream %p: %s", stream_,
-            block_status.error_message().c_str());
-      }
+  Status status;
+  if (transfer_as_literal_) {
+    Tensor reshaped_cpu_tensor;
+    if (!reshaped_cpu_tensor.CopyFrom(*cpu_tensor, shape)) {
+      done(errors::Internal(
+          "Tensor::CopyFrom failed when copying from CPU to XLA device"));
+      return;
+    }
+    status = TransferLiteralToDevice(reshaped_cpu_tensor, device_tensor);
+    if (status.ok()) {
+      xla_tensor->set_host_tensor(*cpu_tensor);
+      host_to_device_stream_->ThenDoHostCallback(
+          [done]() { done(Status::OK()); });
+      return;
+    }
+  } else {
+    se::DeviceMemoryBase dev_dst_ptr =
+        XlaTensor::DeviceMemoryFromTensor(*device_tensor);
+    host_to_device_stream_->ThenMemcpy(&dev_dst_ptr, src_ptr, total_bytes);
+    // TODO(hpucha): Make this asynchronous.
+    Status block_status = host_to_device_stream_->BlockHostUntilDone();
+    if (!block_status.ok()) {
+      status = xla::InternalError(
+          "Failed to complete data transfer on stream %p: %s",
+          host_to_device_stream_, block_status.error_message().c_str());
     }
-    xla_tensor->set_host_tensor(*cpu_tensor);
-
-    done(status);
-    return;
   }
+  xla_tensor->set_host_tensor(*cpu_tensor);
 
-  VLOG(2) << "CopyCPUTensorToDevice empty tensor";
-  done(Status::OK());
+  done(status);
 }
 
 void XlaTransferManager::CopyDeviceTensorToCPU(const Tensor* device_tensor,
@@ -188,51 +212,65 @@ void XlaTransferManager::CopyDeviceTensorToCPU(const Tensor* device_tensor,
                                                Device* device,
                                                Tensor* cpu_tensor,
                                                StatusCallback done) {
-  if (device_tensor->NumElements() > 0) {
-    VLOG(2) << "CopyDeviceTensorToCPU "
-            << reinterpret_cast<const void*>(
-                   device_tensor->tensor_data().data())
-            << " "
-            << reinterpret_cast<const void*>(cpu_tensor->tensor_data().data())
-            << " " << device_tensor->NumElements() << " "
-            << cpu_tensor->shape().DebugString() << " "
-            << device_tensor->shape().DebugString();
-
-    const int64 total_bytes = cpu_tensor->TotalBytes();
-    se::DeviceMemoryBase dev_src_ptr =
-        XlaTensor::DeviceMemoryFromTensor(*device_tensor);
-    void* dst_ptr = DMAHelper::base(cpu_tensor);
+  if (device_tensor->NumElements() == 0) {
+    VLOG(2) << "CopyDeviceTensorToCPU empty tensor";
+    done(Status::OK());
+    return;
+  }
+  VLOG(2) << "CopyDeviceTensorToCPU "
+          << reinterpret_cast<const void*>(device_tensor->tensor_data().data())
+          << " "
+          << reinterpret_cast<const void*>(cpu_tensor->tensor_data().data())
+          << " " << device_tensor->NumElements() << " "
+          << cpu_tensor->shape().DebugString() << " "
+          << device_tensor->shape().DebugString();
+
+  const int64 total_bytes = cpu_tensor->TotalBytes();
+  se::DeviceMemoryBase dev_src_ptr =
+      XlaTensor::DeviceMemoryFromTensor(*device_tensor);
+  void* dst_ptr = DMAHelper::base(cpu_tensor);
+  XlaTensor* xla_tensor = XlaTensor::FromTensor(device_tensor);
+
+  if (se::Event* event =
+          xla_tensor->GetDefinitionEvent(device_to_host_stream_)) {
+    device_to_host_stream_->ThenWaitFor(event);
+    xla_tensor->SetDefinedOn(device_to_host_stream_);
+  }
 
-    Status status;
-    if (transfer_as_literal_) {
-      TransferLiteralFromDevice(cpu_tensor, *device_tensor, done);
-      return;
-    } else {
-      stream_->ThenMemcpy(dst_ptr, dev_src_ptr, total_bytes);
-      // TODO(hpucha): Make this asynchronous.
-      Status block_status = stream_->BlockHostUntilDone();
-      if (!block_status.ok()) {
-        status = xla::InternalError(
-            "Failed to complete data transfer on stream %p: %s", stream_,
-            block_status.error_message().c_str());
-      }
-      done(status);
-    }
+  Status status;
+  if (transfer_as_literal_) {
+    TransferLiteralFromDevice(cpu_tensor, *device_tensor, done);
     return;
+  } else {
+    device_to_host_stream_->ThenMemcpy(dst_ptr, dev_src_ptr, total_bytes);
+    // TODO(hpucha): Make this asynchronous.
+    Status block_status = device_to_host_stream_->BlockHostUntilDone();
+    if (!block_status.ok()) {
+      status = xla::InternalError(
+          "Failed to complete data transfer on stream %p: %s", stream_,
+          block_status.error_message().c_str());
+    }
   }
 
-  VLOG(2) << "CopyDeviceTensorToCPU empty tensor";
-  done(Status::OK());
+  done(status);
 }
 
 void XlaTransferManager::CopyDeviceTensorToDevice(const Tensor& src_tensor,
                                                   Tensor* dst_tensor,
                                                   const StatusCallback& done) {
+  VLOG(2) << "CopyDeviceTensorToDevice "
+          << reinterpret_cast<const void*>(src_tensor.tensor_data().data())
+          << " "
+          << reinterpret_cast<const void*>(dst_tensor->tensor_data().data());
   // Perform memory allocation now, and enqueue the device-to-device transfer.
   Status status = [&]() -> Status {
     if (src_tensor.NumElements() == 0) {
       return Status::OK();
     }
+    // TODO(jmolloy): We co-opt the device_to_host stream for device to device
+    // transfers; perhaps we should have a dedicated device to device stream? or
+    // one per device?
+    auto device_to_device_stream = stream_;
     XlaTensor* xla_src = XlaTensor::FromTensor(&src_tensor);
     XlaTensor* xla_dst = XlaTensor::FromTensor(dst_tensor);
     CHECK(xla_src && xla_dst)
@@ -244,13 +282,32 @@ void XlaTransferManager::CopyDeviceTensorToDevice(const Tensor& src_tensor,
       TF_RETURN_IF_ERROR(
           xla_dst->AllocateShapedBuffer(src_tensor.dtype(), shape, client_,
                                         stream_->parent()->device_ordinal()));
+      if (stream_ != device_to_device_stream) {
+        // Initially wait for the compute stream so that memory allocations are
+        // synchronized.
+        device_to_device_stream->ThenWaitFor(stream_);
+      }
     }
+
+    if (se::Event* event =
+            xla_src->GetDefinitionEvent(device_to_device_stream)) {
+      device_to_device_stream->ThenWaitFor(event);
+      xla_src->SetDefinedOn(device_to_device_stream);
+    }
+
     auto from_iter = xla_src->shaped_buffer().buffers().begin();
     auto to_iter = xla_dst->shaped_buffer().buffers().begin();
     for (auto end_iter = xla_src->shaped_buffer().buffers().end();
          from_iter != end_iter; ++from_iter, ++to_iter) {
-      stream_->ThenMemcpyD2D(&to_iter->second, from_iter->second,
-                             to_iter->second.size());
+      device_to_device_stream->ThenMemcpyD2D(
+          &to_iter->second, from_iter->second, to_iter->second.size());
+    }
+
+    if (UseMultipleStreams()) {
+      se::Event event(stream_->parent());
+      CHECK(event.Init());
+      device_to_device_stream->ThenRecordEvent(&event);
+      xla_dst->SetDefinedOn(device_to_device_stream, std::move(event));
     }
     return Status::OK();
   }();
@@ -262,9 +319,12 @@ void XlaTransferManager::CopyDeviceTensorToDevice(const Tensor& src_tensor,
 }
 
 XlaDeviceContext::XlaDeviceContext(
-    se::Stream* stream, xla::LocalClient* client, bool transfer_as_literal,
+    se::Stream* compute_stream, se::Stream* host_to_device_stream,
+    se::Stream* device_to_host_stream, xla::LocalClient* client,
+    bool transfer_as_literal,
     XlaCompiler::ShapeRepresentationFn shape_representation_fn)
-    : manager_(stream, client, transfer_as_literal,
+    : manager_(compute_stream, host_to_device_stream, device_to_host_stream,
+               client, transfer_as_literal,
                std::move(shape_representation_fn)) {}
 
 void XlaDeviceContext::CopyCPUTensorToDevice(const Tensor* cpu_tensor,
diff --git a/tensorflow/compiler/jit/xla_device_context.h b/tensorflow/compiler/jit/xla_device_context.h
index c5c81d65fe..912f8d779e 100644
--- a/tensorflow/compiler/jit/xla_device_context.h
+++ b/tensorflow/compiler/jit/xla_device_context.h
@@ -47,7 +47,9 @@ class XlaDeviceAllocator : public Allocator {
 class XlaTransferManager {
  public:
   explicit XlaTransferManager(
-      se::Stream* stream, xla::LocalClient* client, bool transfer_as_literal,
+      se::Stream* compute_stream, se::Stream* host_to_device_stream,
+      se::Stream* device_to_host_stream, xla::LocalClient* client,
+      bool transfer_as_literal,
       XlaCompiler::ShapeRepresentationFn shape_representation_fn);
 
   void CopyCPUTensorToDevice(const Tensor* cpu_tensor, Device* device,
@@ -67,10 +69,17 @@ class XlaTransferManager {
   void TransferLiteralFromDevice(Tensor* host_tensor,
                                  const Tensor& device_tensor,
                                  const StatusCallback& done) const;
+  bool UseMultipleStreams() const { return stream_ != host_to_device_stream_; }
 
-  // Stream obtained from a Device, used to transfer tensors between
-  // CPU and device.
+  // The main compute stream of the device, used to synchronize the transfer
+  // streams if they are set.
   se::Stream* stream_;
+  // The stream to use for transferring data from host to device. Can be
+  // idential to stream_, but must not be nullptr.
+  se::Stream* host_to_device_stream_;
+  // The stream to use for transferring data from device to host. Can be
+  // idential to stream_, but must not be nullptr.
+  se::Stream* device_to_host_stream_;
   // For the underlying memory allocator and XLA's TransferManager.
   xla::LocalClient* client_;
   // Transfer manager, for marshalling data to and from the device.
@@ -86,7 +95,9 @@ class XlaTransferManager {
 class XlaDeviceContext : public DeviceContext {
  public:
   explicit XlaDeviceContext(
-      se::Stream* stream, xla::LocalClient* client, bool transfer_as_literal,
+      se::Stream* compute_stream, se::Stream* host_to_device_stream,
+      se::Stream* device_to_host_stream, xla::LocalClient* client,
+      bool transfer_as_literal,
       XlaCompiler::ShapeRepresentationFn shape_representation_fn);
 
   void CopyCPUTensorToDevice(const Tensor* cpu_tensor, Device* device,
diff --git a/tensorflow/compiler/jit/xla_device_ops.h b/tensorflow/compiler/jit/xla_device_ops.h
index a605335a94..6adda327f1 100644
--- a/tensorflow/compiler/jit/xla_device_ops.h
+++ b/tensorflow/compiler/jit/xla_device_ops.h
@@ -77,9 +77,7 @@ class XlaAssignVariableOp : public AsyncOpKernel {
       ConstantOp);                                                             \
   REGISTER_KERNEL_BUILDER(                                                     \
       Name("Identity").Device(DEVICE).TypeConstraint("T", TYPES), IdentityOp); \
-  REGISTER_KERNEL_BUILDER(                                                     \
-      Name("IdentityN").Device(DEVICE).TypeConstraint("T", TYPES),             \
-      IdentityNOp);                                                            \
+  REGISTER_KERNEL_BUILDER(Name("IdentityN").Device(DEVICE), IdentityNOp);      \
   REGISTER_KERNEL_BUILDER(Name("Placeholder").Device(DEVICE), PlaceholderOp);  \
   REGISTER_KERNEL_BUILDER(Name("PlaceholderV2").Device(DEVICE),                \
                           PlaceholderOp);                                      \
@@ -90,6 +88,9 @@ class XlaAssignVariableOp : public AsyncOpKernel {
   REGISTER_KERNEL_BUILDER(                                                     \
       Name("ReadVariableOp").Device(DEVICE).HostMemory("resource"),            \
       ReadVariableOp);                                                         \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("DestroyResourceOp").Device(DEVICE).HostMemory("resource"),         \
+      DestroyResourceOp);                                                      \
   REGISTER_KERNEL_BUILDER(Name("Shape")                                        \
                               .Device(DEVICE)                                  \
                               .HostMemory("output")                            \
diff --git a/tensorflow/compiler/jit/xla_fusion_optimizer.cc b/tensorflow/compiler/jit/xla_fusion_optimizer.cc
index 74257b09a8..4b499b1613 100644
--- a/tensorflow/compiler/jit/xla_fusion_optimizer.cc
+++ b/tensorflow/compiler/jit/xla_fusion_optimizer.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include <unordered_map>
 #include <unordered_set>
 
+#include "tensorflow/compiler/jit/deadness_analysis.h"
 #include "tensorflow/compiler/jit/defs.h"
 #include "tensorflow/compiler/jit/graphcycles/graphcycles.h"
 #include "tensorflow/compiler/jit/union_find.h"
@@ -146,6 +147,9 @@ Status XlaFusionOptimizer::Optimize(grappler::Cluster* cluster,
   TF_RETURN_IF_ERROR(
       ImportGraphDef(options, item.graph, &graph, &shape_refiner));
 
+  std::unique_ptr<DeadnessAnalysis> deadness;
+  TF_RETURN_IF_ERROR(DeadnessAnalysis::Run(graph, &deadness));
+
   // Collect nodes that can be fused via XLA, while ignoring those that
   // explicitly ask for XLA: (*) nodes that are marked to be compiled
   // explicitly. (*) nodes assigned to XLA device.
@@ -185,6 +189,14 @@ Status XlaFusionOptimizer::Optimize(grappler::Cluster* cluster,
       continue;
     }
 
+    // If inputs to `node` can have conflicting deadness (i.e. some are alive
+    // and some are dead) then don't compile it.  XLA cannot represent the
+    // deadness semantics of these nodes correctly and auto-clustering these
+    // nodes can cause deadness to propagate to nodes that should be live.
+    if (node->IsMerge() || deadness->HasInputsWithMismatchingDeadness(*node)) {
+      continue;
+    }
+
     compilation_candidates.insert(node);
   }
 
diff --git a/tensorflow/compiler/jit/xla_gpu_device.cc b/tensorflow/compiler/jit/xla_gpu_device.cc
index c0d86a28c7..851b118b0c 100644
--- a/tensorflow/compiler/jit/xla_gpu_device.cc
+++ b/tensorflow/compiler/jit/xla_gpu_device.cc
@@ -49,6 +49,7 @@ Status XlaGpuDeviceFactory::CreateDevices(const SessionOptions& options,
       XlaDevice::Create("CUDA", DEVICE_XLA_GPU, 0, DEVICE_GPU_XLA_JIT, options,
                         name_prefix, registration,
                         /*transfer_as_literal=*/false,
+                        /*use_multiple_streams=*/false,
                         /*shape_representation_fn=*/{},
                         /*padded_shape_fn=*/{}, &device);
   if (!status.ok()) {
diff --git a/tensorflow/compiler/jit/xla_interpreter_device.cc b/tensorflow/compiler/jit/xla_interpreter_device.cc
index 661187f4a8..4574559674 100644
--- a/tensorflow/compiler/jit/xla_interpreter_device.cc
+++ b/tensorflow/compiler/jit/xla_interpreter_device.cc
@@ -52,6 +52,7 @@ Status XlaInterpreterDeviceFactory::CreateDevices(
                                        DEVICE_INTERPRETER_XLA_JIT, options,
                                        name_prefix, registration,
                                        /*transfer_as_literal=*/false,
+                                       /*use_multiple_streams=*/false,
                                        /*shape_representation_fn=*/{},
                                        /*padded_shape_fn=*/{}, &device));
   devices->push_back(device.release());
diff --git a/tensorflow/compiler/jit/xla_launch_util.cc b/tensorflow/compiler/jit/xla_launch_util.cc
index 5ceccc769f..6134b8c694 100644
--- a/tensorflow/compiler/jit/xla_launch_util.cc
+++ b/tensorflow/compiler/jit/xla_launch_util.cc
@@ -64,11 +64,13 @@ xla::StatusOr<xla::OwningDeviceMemory> XlaAllocator::Allocate(
     int device_ordinal, uint64 size, bool retry_on_failure) {
   AllocationAttributes attrs;
   attrs.no_retry_on_failure = !retry_on_failure;
-  void* data =
-      wrapped_->AllocateRaw(Allocator::kAllocatorAlignment, size, attrs);
-  if (data == nullptr) {
-    return errors::ResourceExhausted("Out of memory while trying to allocate ",
-                                     size, " bytes.");
+  void* data = nullptr;
+  if (size != 0) {
+    data = wrapped_->AllocateRaw(Allocator::kAllocatorAlignment, size, attrs);
+    if (data == nullptr) {
+      return errors::ResourceExhausted(
+          "Out of memory while trying to allocate ", size, " bytes.");
+    }
   }
   return xla::OwningDeviceMemory(se::DeviceMemoryBase(data, size),
                                  device_ordinal, this);
@@ -115,14 +117,22 @@ using internal::ExtractSubShapedBuffer;
 
 XlaComputationLaunchContext::XlaComputationLaunchContext(
     xla::LocalClient* client, xla::DeviceMemoryAllocator* xla_allocator,
-    bool allocate_xla_tensors)
+    bool allocate_xla_tensors, bool use_multiple_streams)
     : client_(client),
       xla_allocator_(xla_allocator),
-      allocate_xla_tensors_(allocate_xla_tensors) {}
+      allocate_xla_tensors_(allocate_xla_tensors),
+      use_multiple_streams_(use_multiple_streams) {
+  if (use_multiple_streams_) {
+    CHECK(allocate_xla_tensors_) << "To use multiple streams correctly we must "
+                                    "be allocating XLA tensors!";
+  }
+}
 
 void XlaComputationLaunchContext::PopulateInputs(
     OpKernelContext* ctx, const XlaCompiler::CompilationResult* kernel,
     const std::map<int, OptionalTensor>& variables) {
+  se::Stream* stream =
+      ctx->op_device_context() ? ctx->op_device_context()->stream() : nullptr;
   // Build ShapedBuffers that point directly to the Tensor buffers.
   arg_buffers_.reserve(kernel->xla_input_shapes.size() + 1);
   arg_buffers_.resize(kernel->xla_input_shapes.size());
@@ -140,6 +150,16 @@ void XlaComputationLaunchContext::PopulateInputs(
       t = &(ctx->input(arg_num));
     }
 
+    if (use_multiple_streams_) {
+      CHECK(stream) << "Must have a stream available when using XLA tensors!";
+      XlaTensor* xla_tensor = XlaTensor::FromTensor(t);
+      CHECK(xla_tensor);
+      if (se::Event* event = xla_tensor->GetDefinitionEvent(stream)) {
+        stream->ThenWaitFor(event);
+        xla_tensor->SetDefinedOn(stream);
+      }
+    }
+
     const xla::Shape on_device_shape =
         client_->backend().transfer_manager()->HostShapeToDeviceShape(shape);
     if (xla::ShapeUtil::IsTuple(on_device_shape)) {
@@ -248,6 +268,12 @@ void XlaComputationLaunchContext::PopulateOutputs(
         if (xla_tensor) {
           xla_tensor->set_shaped_buffer(ScopedShapedBuffer(
               ExtractSubShapedBuffer(&output, output_num, xla_allocator_)));
+          if (use_multiple_streams_) {
+            se::Event event(stream->parent());
+            CHECK(event.Init());
+            stream->ThenRecordEvent(&event);
+            xla_tensor->SetDefinedOn(stream, std::move(event));
+          }
         } else {
           // xla_tensor wasn't valid, which must mean this is a zero-element
           // tensor.
@@ -302,6 +328,12 @@ void XlaComputationLaunchContext::PopulateOutputs(
       CHECK(xla_tensor);
       xla_tensor->set_shaped_buffer(
           ExtractSubShapedBuffer(&output, output_num, xla_allocator_));
+      if (use_multiple_streams_) {
+        se::Event event(stream->parent());
+        CHECK(event.Init());
+        stream->ThenRecordEvent(&event);
+        xla_tensor->SetDefinedOn(stream, std::move(event));
+      }
       *variable->tensor() = output_tensor;
     } else {
       Tensor output_tensor = XlaTensorBuffer::MakeTensor(
diff --git a/tensorflow/compiler/jit/xla_launch_util.h b/tensorflow/compiler/jit/xla_launch_util.h
index 4390701ccb..1ea3fa4cf2 100644
--- a/tensorflow/compiler/jit/xla_launch_util.h
+++ b/tensorflow/compiler/jit/xla_launch_util.h
@@ -76,9 +76,15 @@ class XlaComputationLaunchContext {
   // Create a new launch context. 'allocate_xla_tensors' is true if allocated
   // output tensors and variables are always XlaTensors. If false they are
   // assumed to be "normal" device pointers.
+  // If 'use_multiple_streams' is true, tensors may be defined and used on
+  // multiple streams and so se::Events must be defined and waited for. If
+  // 'use_multiple_streams' is true, 'allocate_xla_tensors' must also be true
+  // because we track inter-stream dependencies through events inside XlaTensor
+  // objects.
   XlaComputationLaunchContext(xla::LocalClient* client,
                               xla::DeviceMemoryAllocator* xla_allocator,
-                              bool allocate_xla_tensors);
+                              bool allocate_xla_tensors,
+                              bool use_multiple_streams);
 
   // Add all inputs within `ctx` as XLA arguments (returned by arguments()).
   // `variables` is a map from TensorFlow argument number to resource variable.
@@ -99,6 +105,7 @@ class XlaComputationLaunchContext {
   xla::LocalClient* client_;
   xla::DeviceMemoryAllocator* xla_allocator_;
   bool allocate_xla_tensors_;
+  bool use_multiple_streams_;
   std::vector<std::unique_ptr<xla::ShapedBuffer>> arg_buffers_;
   std::vector<xla::ShapedBuffer*> arg_ptrs_;
 };
@@ -115,7 +122,11 @@ class XlaTensorBuffer : public TensorBuffer {
     data_ = const_cast<void*>(ptr);
   }
 
-  ~XlaTensorBuffer() override { allocator_->DeallocateRaw(data_); }
+  ~XlaTensorBuffer() override {
+    if (data_) {
+      allocator_->DeallocateRaw(data_);
+    }
+  }
 
   void* data() const override { return data_; }
   size_t size() const override { return expected_size_; }
diff --git a/tensorflow/compiler/jit/xla_tensor.cc b/tensorflow/compiler/jit/xla_tensor.cc
index 3c44c4ae6d..d777dfa5a3 100644
--- a/tensorflow/compiler/jit/xla_tensor.cc
+++ b/tensorflow/compiler/jit/xla_tensor.cc
@@ -73,6 +73,34 @@ Status XlaTensor::AllocateShapedBuffer(DataType dtype, const TensorShape& shape,
   return Status::OK();
 }
 
+se::Event* XlaTensor::GetDefinitionEvent(se::Stream* stream) {
+  mutex_lock lock(mu_);
+  if (!definition_event_.has_value()) {
+    return nullptr;
+  }
+
+  // The set of defined streams is expected to be very small indeed (usually
+  // 1-2), so a simple linear scan should be fast enough.
+  if (std::find(streams_defined_on_.begin(), streams_defined_on_.end(),
+                stream) != streams_defined_on_.end()) {
+    // stream is in streams_defined_on_; it doesn't need to be waited on.
+    return nullptr;
+  }
+
+  return &*definition_event_;
+}
+
+void XlaTensor::SetDefinedOn(se::Stream* stream, se::Event event) {
+  mutex_lock lock(mu_);
+  definition_event_ = std::move(event);
+  streams_defined_on_ = {stream};
+}
+
+void XlaTensor::SetDefinedOn(se::Stream* stream) {
+  mutex_lock lock(mu_);
+  streams_defined_on_.push_back(stream);
+}
+
 // The pointer tag, OR-ed into the XlaTensor's address to distinguish it from
 // device-side tensors, which are either CPU or GPU memory pointers. This works
 // because we're guaranteed that CPU and GPU pointers are aligned to > 1 bits.
diff --git a/tensorflow/compiler/jit/xla_tensor.h b/tensorflow/compiler/jit/xla_tensor.h
index c54001a999..f7e401c731 100644
--- a/tensorflow/compiler/jit/xla_tensor.h
+++ b/tensorflow/compiler/jit/xla_tensor.h
@@ -85,6 +85,24 @@ class XlaTensor {
     host_tensor_.reset(new Tensor(tensor));
   }
 
+  // If the tensor's content is not yet defined on 'stream', and there exists an
+  // se::Event declaring when the tensor's content is defined, return it.
+  // Otherwise, return nullptr. If this function returns nullptr then the
+  // tensor's content can be read on 'stream' without additional
+  // synchronization.
+  se::Event* GetDefinitionEvent(se::Stream* stream);
+
+  // Assert that the tensor's content is defined on 'stream' by the time 'event'
+  // triggers.
+  void SetDefinedOn(se::Stream* stream, se::Event event);
+
+  // Assert that the tensor's content is defined on 'stream'. This version does
+  // not provide an event, and must be called *after* SetDefinedOn(Stream,
+  // Event). This call can be read as an assertion that the definition event has
+  // been waited on by 'stream', so further calls to GetDefinitionEvent(stream)
+  // do not need to also wait on the event.
+  void SetDefinedOn(se::Stream* stream);
+
   // Convert from a raw pointer to an XlaTensor, removing the pointer tag.
   static XlaTensor* FromOpaquePointer(void* ptr);
   // Convert to a raw pointer from an XlaTensor, adding the pointer tag.
@@ -95,6 +113,14 @@ class XlaTensor {
   std::unique_ptr<xla::ScopedShapedBuffer> shaped_buffer_;
   // An optional host tensor value.
   std::unique_ptr<Tensor> host_tensor_;
+  // An optional event that is triggered when the tensor's content has been
+  // defined. If this event is nullptr, it is assumed that the tensor's content
+  // is always defined.
+  gtl::optional<se::Event> definition_event_;
+  // A list of all streams for which the tensor's content is defined for any
+  // newly enqueued command.
+  gtl::InlinedVector<se::Stream*, 2> streams_defined_on_ GUARDED_BY(mu_);
+  mutex mu_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tests/BUILD b/tensorflow/compiler/tests/BUILD
index 080b1c9c35..b7dc5d4c74 100644
--- a/tensorflow/compiler/tests/BUILD
+++ b/tensorflow/compiler/tests/BUILD
@@ -98,6 +98,19 @@ tf_xla_py_test(
 )
 
 tf_xla_py_test(
+    name = "adagrad_da_test",
+    size = "small",
+    srcs = ["adagrad_da_test.py"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:training",
+    ],
+)
+
+tf_xla_py_test(
     name = "adam_test",
     size = "small",
     srcs = ["adam_test.py"],
@@ -112,6 +125,48 @@ tf_xla_py_test(
 )
 
 tf_xla_py_test(
+    name = "adamax_test",
+    size = "small",
+    srcs = ["adamax_test.py"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/contrib/opt:opt_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:training",
+    ],
+)
+
+tf_xla_py_test(
+    name = "addsign_test",
+    size = "small",
+    srcs = ["addsign_test.py"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/contrib/opt:opt_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:training",
+    ],
+)
+
+tf_xla_py_test(
+    name = "powersign_test",
+    size = "small",
+    srcs = ["powersign_test.py"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/contrib/opt:opt_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:training",
+    ],
+)
+
+tf_xla_py_test(
     name = "argminmax_test",
     size = "small",
     srcs = ["argminmax_test.py"],
@@ -180,7 +235,7 @@ tf_xla_py_test(
 
 tf_xla_py_test(
     name = "cholesky_op_test",
-    size = "small",
+    size = "medium",
     srcs = ["cholesky_op_test.py"],
     tags = ["optonly"],
     deps = [
@@ -363,7 +418,7 @@ tf_xla_py_test(
 
 tf_xla_py_test(
     name = "eager_test",
-    size = "small",
+    size = "large",
     srcs = ["eager_test.py"],
     disabled_backends = [
         # TODO(b/78199195) Support XLA CPU devices in eager runtime
@@ -584,6 +639,54 @@ tf_xla_py_test(
 )
 
 tf_xla_py_test(
+    name = "proximal_adagrad_test",
+    size = "medium",
+    srcs = ["proximal_adagrad_test.py"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:training",
+    ],
+)
+
+tf_xla_py_test(
+    name = "proximal_gradient_descent_test",
+    size = "medium",
+    srcs = ["proximal_gradient_descent_test.py"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:training",
+    ],
+)
+
+tf_xla_py_test(
+    name = "qr_op_test",
+    size = "medium",
+    srcs = ["qr_op_test.py"],
+    disabled_backends = [
+        # Test is very slow on CPU.
+        "cpu",
+        "cpu_ondemand",
+    ],
+    shard_count = 5,
+    tags = ["optonly"],
+    deps = [
+        ":xla_test",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:training",
+        "@absl_py//absl/testing:parameterized",
+    ],
+)
+
+tf_xla_py_test(
     name = "random_ops_test",
     size = "small",
     srcs = ["random_ops_test.py"],
@@ -898,8 +1001,9 @@ tf_xla_py_test(
 
 tf_xla_py_test(
     name = "sort_ops_test",
-    size = "small",
+    size = "medium",
     srcs = ["sort_ops_test.py"],
+    shard_count = 5,
     # Times out in fastbuild mode.
     tags = ["optonly"],
     deps = [
diff --git a/tensorflow/compiler/tests/adagrad_da_test.py b/tensorflow/compiler/tests/adagrad_da_test.py
new file mode 100644
index 0000000000..dc1625793a
--- /dev/null
+++ b/tensorflow/compiler/tests/adagrad_da_test.py
@@ -0,0 +1,165 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for AdagradDA optimizer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.compiler.tests import xla_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import adagrad_da
+
+
+class AdagradDAOptimizerTest(xla_test.XLATestCase):
+
+  def testAdagradDAWithoutRegularizationBasic1(self):
+    for dtype in self.float_types:
+      with self.test_session(), self.test_scope():
+        global_step = resource_variable_ops.ResourceVariable(
+            0, dtype=dtypes.int64)
+        var0 = resource_variable_ops.ResourceVariable([0.0, 0.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([0.0, 0.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.2], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.02], dtype=dtype)
+        opt = adagrad_da.AdagradDAOptimizer(
+            3.0,
+            global_step,
+            initial_gradient_squared_accumulator_value=0.1,
+            l1_regularization_strength=0.0,
+            l2_regularization_strength=0.0)
+        update = opt.apply_gradients(
+            zip([grads0, grads1], [var0, var1]), global_step=global_step)
+        variables.global_variables_initializer().run()
+
+        self.assertAllClose([0.0, 0.0], var0.eval())
+        self.assertAllClose([0.0, 0.0], var1.eval())
+
+        # Run a step of AdagradDA
+        update.run()
+
+        # Let g to be gradient accumulator, gg to be gradient squared
+        # accumulator, T be the global step, lr is the learning rate, and k the
+        # initial gradient squared accumulator value.
+        # w = \dfrac{sign(-g)*lr*|g - l1*T|_{+}}{l2*T*lr + \sqrt{k+gg})}
+        # For -0.1*3.0*(0.1 - 0)/(0 + sqrt(0.1 + 0.1*0.1)) = -0.904534
+        # similarly for others.
+        self.assertAllCloseAccordingToType(
+            np.array([-0.904534, -1.603567]), var0.eval())
+        self.assertAllCloseAccordingToType(
+            np.array([-0.094821, -0.189358]), var1.eval())
+
+  def testAdagradDAwithoutRegularizationBasic2(self):
+    for dtype in self.float_types:
+      with self.test_session(), self.test_scope():
+        global_step = resource_variable_ops.ResourceVariable(
+            0, dtype=dtypes.int64)
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([4.0, 3.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.2], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.02], dtype=dtype)
+
+        opt = adagrad_da.AdagradDAOptimizer(
+            3.0,
+            global_step,
+            initial_gradient_squared_accumulator_value=0.1,
+            l1_regularization_strength=0.0,
+            l2_regularization_strength=0.0)
+        update = opt.apply_gradients(
+            zip([grads0, grads1], [var0, var1]), global_step=global_step)
+        variables.global_variables_initializer().run()
+
+        self.assertAllCloseAccordingToType([1.0, 2.0], var0.eval())
+        self.assertAllCloseAccordingToType([4.0, 3.0], var1.eval())
+
+        # Run a step of AdagradDA
+        update.run()
+
+        self.assertAllCloseAccordingToType(
+            np.array([-0.904534, -1.603567]), var0.eval())
+        self.assertAllCloseAccordingToType(
+            np.array([-0.094821, -0.189358]), var1.eval())
+
+  def testAdagradDAWithL1(self):
+    for dtype in self.float_types:
+      with self.test_session(), self.test_scope():
+        global_step = resource_variable_ops.ResourceVariable(
+            0, dtype=dtypes.int64)
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([4.0, 3.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.2], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.02], dtype=dtype)
+
+        opt = adagrad_da.AdagradDAOptimizer(
+            3.0,
+            global_step,
+            initial_gradient_squared_accumulator_value=0.1,
+            l1_regularization_strength=0.001,
+            l2_regularization_strength=0.0)
+        update = opt.apply_gradients(
+            zip([grads0, grads1], [var0, var1]), global_step=global_step)
+        variables.global_variables_initializer().run()
+
+        self.assertAllCloseAccordingToType([1.0, 2.0], var0.eval())
+        self.assertAllCloseAccordingToType([4.0, 3.0], var1.eval())
+
+        # Run a step of AdagradDA
+        update.run()
+
+        self.assertAllCloseAccordingToType(
+            np.array([-0.895489, -1.59555]), var0.eval())
+        self.assertAllCloseAccordingToType(
+            np.array([-0.085339, -0.17989]), var1.eval())
+
+  def testAdagradDAWithL1_L2(self):
+    for dtype in self.float_types:
+      with self.test_session(), self.test_scope():
+        global_step = resource_variable_ops.ResourceVariable(
+            0, dtype=dtypes.int64)
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([4.0, 3.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.2], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.02], dtype=dtype)
+
+        opt = adagrad_da.AdagradDAOptimizer(
+            3.0,
+            global_step,
+            initial_gradient_squared_accumulator_value=0.1,
+            l1_regularization_strength=0.001,
+            l2_regularization_strength=2.0)
+        update = opt.apply_gradients(
+            zip([grads0, grads1], [var0, var1]), global_step=global_step)
+        variables.global_variables_initializer().run()
+
+        self.assertAllCloseAccordingToType([1.0, 2.0], var0.eval())
+        self.assertAllCloseAccordingToType([4.0, 3.0], var1.eval())
+
+        # Run a step of AdagradDA
+        update.run()
+
+        self.assertAllCloseAccordingToType(
+            np.array([-0.046907, -0.093659]), var0.eval())
+        self.assertAllCloseAccordingToType(
+            np.array([-0.004275, -0.009023]), var1.eval())
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/compiler/tests/adamax_test.py b/tensorflow/compiler/tests/adamax_test.py
new file mode 100644
index 0000000000..c4fdbc5974
--- /dev/null
+++ b/tensorflow/compiler/tests/adamax_test.py
@@ -0,0 +1,139 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for AdaMax optimizer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.compiler.tests import xla_test
+from tensorflow.contrib.opt.python.training import adamax
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+
+
+def adamax_update_numpy(param,
+                        g_t,
+                        t,
+                        m,
+                        v,
+                        alpha=0.001,
+                        beta1=0.9,
+                        beta2=0.999,
+                        epsilon=1e-8):
+  m_t = beta1 * m + (1 - beta1) * g_t
+  v_t = np.maximum(beta2 * v, np.abs(g_t))
+  param_t = param - (alpha / (1 - beta1**t)) * (m_t / (v_t + epsilon))
+  return param_t, m_t, v_t
+
+
+class AdaMaxOptimizerTest(xla_test.XLATestCase):
+
+  def testBasic(self):
+    for i, dtype in enumerate(self.float_types):
+      with self.test_session(), self.test_scope():
+        variable_scope.get_variable_scope().set_use_resource(True)
+        # Initialize variables for numpy implementation.
+        m0, v0, m1, v1 = 0.0, 0.0, 0.0, 0.0
+        var0_np = np.array([1.0, 2.0], dtype=dtype)
+        grads0_np = np.array([0.1, 0.1], dtype=dtype)
+        var1_np = np.array([3.0, 4.0], dtype=dtype)
+        grads1_np = np.array([0.01, 0.01], dtype=dtype)
+
+        var0 = resource_variable_ops.ResourceVariable(
+            var0_np, name="var0_%d" % i)
+        var1 = resource_variable_ops.ResourceVariable(
+            var1_np, name="var1_%d" % i)
+        grads0 = constant_op.constant(grads0_np)
+        grads1 = constant_op.constant(grads1_np)
+
+        opt = adamax.AdaMaxOptimizer()
+        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+        opt_variables = opt.variables()
+        beta1_power = opt._get_beta_accumulators()
+        self.assertTrue(beta1_power is not None)
+        self.assertIn(beta1_power, opt_variables)
+
+        with ops.Graph().as_default():
+          # Shouldn't return non-slot variables from other graphs.
+          self.assertEqual(0, len(opt.variables()))
+
+        variables.global_variables_initializer().run()
+        # Fetch params to validate initial values
+        self.assertAllClose([1.0, 2.0], var0.eval())
+        self.assertAllClose([3.0, 4.0], var1.eval())
+
+        beta1_power = opt._get_beta_accumulators()
+
+        # Run 3 steps of AdaMax
+        for t in range(1, 4):
+          update.run()
+
+          self.assertAllCloseAccordingToType(0.9**(t + 1), beta1_power.eval())
+
+          var0_np, m0, v0 = adamax_update_numpy(var0_np, grads0_np, t, m0, v0)
+          var1_np, m1, v1 = adamax_update_numpy(var1_np, grads1_np, t, m1, v1)
+
+          # Validate updated params
+          self.assertAllCloseAccordingToType(var0_np, var0.eval(), rtol=1e-2)
+          self.assertAllCloseAccordingToType(var1_np, var1.eval(), rtol=1e-2)
+          self.assertEqual("var0_%d/AdaMax:0" % (i,),
+                           opt.get_slot(var=var0, name="m").name)
+
+  def testTensorLearningRate(self):
+    for dtype in self.float_types:
+      with self.test_session(), self.test_scope():
+        variable_scope.get_variable_scope().set_use_resource(True)
+        # Initialize variables for numpy implementation.
+        m0, v0, m1, v1 = 0.0, 0.0, 0.0, 0.0
+        var0_np = np.array([1.0, 2.0], dtype=dtype)
+        grads0_np = np.array([0.1, 0.1], dtype=dtype)
+        var1_np = np.array([3.0, 4.0], dtype=dtype)
+        grads1_np = np.array([0.01, 0.01], dtype=dtype)
+
+        var0 = resource_variable_ops.ResourceVariable(var0_np)
+        var1 = resource_variable_ops.ResourceVariable(var1_np)
+        grads0 = constant_op.constant(grads0_np)
+        grads1 = constant_op.constant(grads1_np)
+        opt = adamax.AdaMaxOptimizer(constant_op.constant(0.001))
+        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+        variables.global_variables_initializer().run()
+
+        # Fetch params to validate initial values
+        self.assertAllClose([1.0, 2.0], var0.eval())
+        self.assertAllClose([3.0, 4.0], var1.eval())
+
+        beta1_power = opt._get_beta_accumulators()
+
+        # Run 3 steps of AdaMax
+        for t in range(1, 4):
+          self.assertAllCloseAccordingToType(0.9**t, beta1_power.eval())
+          update.run()
+
+          var0_np, m0, v0 = adamax_update_numpy(var0_np, grads0_np, t, m0, v0)
+          var1_np, m1, v1 = adamax_update_numpy(var1_np, grads1_np, t, m1, v1)
+
+          # Validate updated params
+          self.assertAllCloseAccordingToType(var0_np, var0.eval())
+          self.assertAllCloseAccordingToType(var1_np, var1.eval())
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/compiler/tests/addsign_test.py b/tensorflow/compiler/tests/addsign_test.py
new file mode 100644
index 0000000000..9ec5a964cb
--- /dev/null
+++ b/tensorflow/compiler/tests/addsign_test.py
@@ -0,0 +1,142 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for AddSign."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.compiler.tests import xla_test
+from tensorflow.contrib.opt.python.training import addsign
+from tensorflow.contrib.opt.python.training import sign_decay
+from tensorflow.python.framework import constant_op
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+
+
+def py_linear_decay_fn(decay_steps):
+  def linear_decay(step):
+    step = min(step, decay_steps)
+    return float(decay_steps - step) / decay_steps
+  return linear_decay
+
+
+def addsign_update_numpy(params,
+                         g_t,
+                         m,
+                         lr,
+                         alpha=1.0,
+                         beta=0.9,
+                         py_sign_decay_fn=None,
+                         t=None):
+  m_t = beta * m + (1 - beta) * g_t
+  if py_sign_decay_fn is None:
+    sign_decayed = 1.0
+  else:
+    sign_decayed = py_sign_decay_fn(t-1)
+  multiplier = alpha + sign_decayed * np.sign(g_t) * np.sign(m_t)
+  params_t = params - lr * multiplier * g_t
+  return params_t, m_t
+
+
+class AddSignTest(xla_test.XLATestCase):
+
+  def _testDense(self,
+                 learning_rate=0.1,
+                 sign_decay_fn=None,
+                 py_sign_decay_fn=None,
+                 alpha=1.0,
+                 beta=0.9):
+    for dtype in self.float_types:
+      with self.test_session(), self.test_scope():
+        # Initialize variables for numpy implementation.
+        m0, m1 = 0.0, 0.0
+        var0_np = np.array([1.0, 2.0], dtype=dtype)
+        grads0_np = np.array([0.1, 0.1], dtype=dtype)
+        var1_np = np.array([3.0, 4.0], dtype=dtype)
+        grads1_np = np.array([0.01, 0.01], dtype=dtype)
+
+        var0 = resource_variable_ops.ResourceVariable(var0_np)
+        var1 = resource_variable_ops.ResourceVariable(var1_np)
+        global_step = resource_variable_ops.ResourceVariable(0, trainable=False)
+        grads0 = constant_op.constant(grads0_np)
+        grads1 = constant_op.constant(grads1_np)
+
+        opt = addsign.AddSignOptimizer(
+            learning_rate=learning_rate,
+            alpha=alpha,
+            beta=beta,
+            sign_decay_fn=sign_decay_fn,
+        )
+        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]),
+                                     global_step=global_step)
+        neg_update = opt.apply_gradients(zip([-grads0, -grads1], [var0, var1]),
+                                         global_step=global_step)
+        variables.global_variables_initializer().run()
+
+        # Fetch params to validate initial values
+        self.assertAllClose([1.0, 2.0], var0.eval())
+        self.assertAllClose([3.0, 4.0], var1.eval())
+
+        # Run 7 steps of AddSign
+        # first 4 steps with positive gradient
+        # last 3 steps with negative gradient (sign(gm) should be -1)
+        for t in range(1, 8):
+          if t < 5:
+            update.run()
+          else:
+            neg_update.run()
+
+          var0_np, m0 = addsign_update_numpy(
+              var0_np,
+              grads0_np if t < 5 else -grads0_np,
+              m0,
+              learning_rate,
+              alpha=alpha,
+              beta=beta,
+              py_sign_decay_fn=py_sign_decay_fn,
+              t=t,
+          )
+          var1_np, m1 = addsign_update_numpy(
+              var1_np,
+              grads1_np if t < 5 else -grads1_np,
+              m1,
+              learning_rate,
+              alpha=alpha,
+              beta=beta,
+              py_sign_decay_fn=py_sign_decay_fn,
+              t=t,
+          )
+
+          # Validate updated params
+          self.assertAllCloseAccordingToType(
+              var0_np, var0.eval(), half_rtol=1e-2)
+          self.assertAllCloseAccordingToType(var1_np, var1.eval())
+
+  def testDense(self):
+    decay_steps = 10
+    sign_decay_fn = sign_decay.get_linear_decay_fn(decay_steps)
+    py_sign_decay_fn = py_linear_decay_fn(decay_steps)
+    self._testDense()
+    self._testDense(learning_rate=0.01, alpha=0.1, beta=0.8)
+    self._testDense(
+        sign_decay_fn=sign_decay_fn, py_sign_decay_fn=py_sign_decay_fn)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/compiler/tests/binary_ops_test.py b/tensorflow/compiler/tests/binary_ops_test.py
index 9cb3d04546..0aafda7fb4 100644
--- a/tensorflow/compiler/tests/binary_ops_test.py
+++ b/tensorflow/compiler/tests/binary_ops_test.py
@@ -691,11 +691,13 @@ class BinaryOpsTest(xla_test.XLATestCase):
           np.array([[10], [7], [2]], dtype=np.float32),
           np.float32(7),
           expected=np.array([[False], [False], [True]], dtype=np.bool))
-      self._testBinary(
-          less_op,
-          np.array([[10], [7], [2], [-1]], dtype=np.int64),
-          np.int64(7),
-          expected=np.array([[False], [False], [True], [True]], dtype=np.bool))
+      if np.int64 in self.numeric_types:
+        self._testBinary(
+            less_op,
+            np.array([[10], [7], [2], [-1]], dtype=np.int64),
+            np.int64(7),
+            expected=np.array(
+                [[False], [False], [True], [True]], dtype=np.bool))
 
     for less_equal_op in [math_ops.less_equal, (lambda x, y: x <= y)]:
       self._testBinary(
diff --git a/tensorflow/compiler/tests/cholesky_op_test.py b/tensorflow/compiler/tests/cholesky_op_test.py
index d2867278af..ed532db0ee 100644
--- a/tensorflow/compiler/tests/cholesky_op_test.py
+++ b/tensorflow/compiler/tests/cholesky_op_test.py
@@ -18,8 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import unittest
-
 import numpy as np
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
@@ -103,9 +101,8 @@ class CholeskyOpTest(xla_test.XLATestCase):
       with self.assertRaises(ValueError):
         linalg_ops.cholesky(tensor3)
 
-  @unittest.skip("Test is slow")
-  def testLarge(self):
-    n = 200
+  def testLarge2000x2000(self):
+    n = 2000
     shape = (n, n)
     data = np.ones(shape).astype(np.float32) / (2.0 * n) + np.diag(
         np.ones(n).astype(np.float32))
@@ -128,6 +125,5 @@ class CholeskyOpTest(xla_test.XLATestCase):
       matrix = np.dot(np.dot(w, np.diag(v)), w.T).astype(dtype)
       self._verifyCholesky(matrix, atol=1e-4)
 
-
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/compiler/tests/conv2d_test.py b/tensorflow/compiler/tests/conv2d_test.py
index 98d41ba7ed..f9db103f6d 100644
--- a/tensorflow/compiler/tests/conv2d_test.py
+++ b/tensorflow/compiler/tests/conv2d_test.py
@@ -33,12 +33,9 @@ from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.platform import googletest
 
-
 DATA_FORMATS = (
     ("_data_format_NHWC", "NHWC"),
     ("_data_format_NCHW", "NCHW"),
-    ("_data_format_HWNC", "HWNC"),
-    ("_data_format_HWCN", "HWCN"),
 )
 
 
diff --git a/tensorflow/compiler/tests/eager_test.py b/tensorflow/compiler/tests/eager_test.py
index 3524666499..6ead15da13 100644
--- a/tensorflow/compiler/tests/eager_test.py
+++ b/tensorflow/compiler/tests/eager_test.py
@@ -403,7 +403,7 @@ class EagerFunctionTest(xla_test.XLATestCase):
   def testSliceInDefun(self):
     with self.test_scope():
 
-      @function.defun(compiled=True)
+      @function.defun
       def f(x, y):
         return x[0::2, y:, ...]
 
@@ -418,6 +418,22 @@ class EagerFunctionTest(xla_test.XLATestCase):
       self.assertAllEqual(np.ones([1, 2, 4]), z.numpy())
       self.assertAllEqual((2, 3, 4), dz.shape.as_list())
 
+  def testNestedDefun(self):
+    self.skipTest('Nested defuns do not work on TPU at the moment')
+    with self.test_scope():
+
+      @function.defun
+      def times_two(x):
+        return 2 * x
+
+      @function.defun
+      def two_x_plus_1(x):
+        return times_two(x) + 1
+
+      x = constant_op.constant([2, 3, 4])
+      y = two_x_plus_1(x)
+      self.assertAllEqual([5, 7, 9], y.numpy())
+
 
 class ExcessivePaddingTest(xla_test.XLATestCase):
   """Test that eager execution works with TPU flattened tensors.
@@ -470,6 +486,36 @@ class ExcessivePaddingTest(xla_test.XLATestCase):
       self.assertAllEqual(100 * [[36.0]], reduced)
 
 
+def multiple_tpus():
+  devices = context.context().devices()
+  return len([d for d in devices if 'device:TPU:' in d]) > 1
+
+
+class MultiDeviceTest(xla_test.XLATestCase):
+  """Test running TPU computation on more than one core."""
+
+  def testBasic(self):
+    if not multiple_tpus():
+      self.skipTest('MultiDeviceTest requires multiple TPU devices.')
+
+    # Compute 10 on TPU core 0
+    with ops.device('device:TPU:0'):
+      two = constant_op.constant(2)
+      five = constant_op.constant(5)
+      ten = two * five
+      self.assertAllEqual(10, ten)
+
+    # Compute 6 on TPU core 1
+    with ops.device('device:TPU:1'):
+      two = constant_op.constant(2)
+      three = constant_op.constant(3)
+      six = two * three
+      self.assertAllEqual(6, six)
+
+    # Copy 10 and 6 to CPU and sum them
+    self.assertAllEqual(16, ten + six)
+
+
 if __name__ == '__main__':
   ops.enable_eager_execution(
       config=config_pb2.ConfigProto(log_device_placement=True))
diff --git a/tensorflow/compiler/tests/image_ops_test.py b/tensorflow/compiler/tests/image_ops_test.py
index 8b01ef96db..bf986ade06 100644
--- a/tensorflow/compiler/tests/image_ops_test.py
+++ b/tensorflow/compiler/tests/image_ops_test.py
@@ -26,6 +26,7 @@ import numpy as np
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.compiler.tests import xla_test
+from tensorflow.python.compat import compat
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
@@ -579,5 +580,140 @@ class ResizeBilinearTest(xla_test.XLATestCase):
         large_tolerance=True)
 
 
+class NonMaxSuppressionTest(xla_test.XLATestCase):
+
+  def testNMS128From1024(self):
+    # TODO(b/26783907): The Sort HLO is not implemented on CPU or GPU.
+    if self.device in ["XLA_CPU", "XLA_GPU"]:
+      return
+
+    with compat.forward_compatibility_horizon(2018, 8, 8):
+      num_boxes = 1024
+      boxes_np = np.random.normal(50, 10, (num_boxes, 4)).astype("f4")
+      scores_np = np.random.normal(0.5, 0.1, (num_boxes,)).astype("f4")
+
+      max_output_size = 128
+      iou_threshold_np = np.array(0.5, dtype=np.float32)
+      score_threshold_np = np.array(0.0, dtype=np.float32)
+
+      with self.test_session() as sess:
+        boxes = array_ops.placeholder(boxes_np.dtype, shape=boxes_np.shape)
+        scores = array_ops.placeholder(scores_np.dtype, shape=scores_np.shape)
+        iou_threshold = array_ops.placeholder(iou_threshold_np.dtype,
+                                              iou_threshold_np.shape)
+        score_threshold = array_ops.placeholder(score_threshold_np.dtype,
+                                                score_threshold_np.shape)
+        with self.test_scope():
+          selected_indices = image_ops.non_max_suppression_padded(
+              boxes=boxes,
+              scores=scores,
+              max_output_size=max_output_size,
+              iou_threshold=iou_threshold,
+              score_threshold=score_threshold,
+              pad_to_max_output_size=True)
+        inputs_feed = {
+            boxes: boxes_np,
+            scores: scores_np,
+            score_threshold: score_threshold_np,
+            iou_threshold: iou_threshold_np
+        }
+        (indices_tf, _) = sess.run(selected_indices, feed_dict=inputs_feed)
+
+        self.assertEqual(indices_tf.size, max_output_size)
+
+  def testNMS3From6Boxes(self):
+    # TODO(b/26783907): The Sort HLO is not implemented on CPU or GPU.
+    if self.device in ["XLA_CPU", "XLA_GPU"]:
+      return
+
+    with compat.forward_compatibility_horizon(2018, 8, 8):
+      # Three boxes are selected based on IOU.
+      boxes_data = [[0, 0, 1, 1], [0, 0.1, 1, 1.1], [0, -0.1, 1, 0.9],
+                    [0, 10, 1, 11], [0, 10.1, 1, 11.1], [0, 100, 1, 101]]
+      boxes_np = np.array(boxes_data, dtype=np.float32)
+
+      scores_data = [0.9, 0.75, 0.6, 0.95, 0.5, 0.3]
+      scores_np = np.array(scores_data, dtype=np.float32)
+
+      max_output_size = 3
+      iou_threshold_np = np.array(0.5, dtype=np.float32)
+      score_threshold_np = np.array(0.0, dtype=np.float32)
+
+      with self.test_session() as sess:
+        boxes = array_ops.placeholder(boxes_np.dtype, shape=boxes_np.shape)
+        scores = array_ops.placeholder(scores_np.dtype, shape=scores_np.shape)
+        iou_threshold = array_ops.placeholder(iou_threshold_np.dtype,
+                                              iou_threshold_np.shape)
+        score_threshold = array_ops.placeholder(score_threshold_np.dtype,
+                                                score_threshold_np.shape)
+        with self.test_scope():
+          selected_indices = image_ops.non_max_suppression_padded(
+              boxes=boxes,
+              scores=scores,
+              max_output_size=max_output_size,
+              iou_threshold=iou_threshold,
+              score_threshold=score_threshold,
+              pad_to_max_output_size=True)
+        inputs_feed = {
+            boxes: boxes_np,
+            scores: scores_np,
+            score_threshold: score_threshold_np,
+            iou_threshold: iou_threshold_np
+        }
+        (indices_tf, num_valid) = sess.run(
+            selected_indices, feed_dict=inputs_feed)
+
+        self.assertEqual(indices_tf.size, max_output_size)
+        self.assertEqual(num_valid, 3)
+        self.assertAllClose(indices_tf[:num_valid], [3, 0, 5])
+
+  def testNMS3Then2WithScoreThresh(self):
+    # Three boxes are selected based on IOU.
+    # One is filtered out by score threshold.
+
+    # TODO(b/26783907): The Sort HLO is not implemented on CPU or GPU.
+    if self.device in ["XLA_CPU", "XLA_GPU"]:
+      return
+
+    with compat.forward_compatibility_horizon(2018, 8, 8):
+      boxes_data = [[0, 0, 1, 1], [0, 0.1, 1, 1.1], [0, -0.1, 1, 0.9],
+                    [0, 10, 1, 11], [0, 10.1, 1, 11.1], [0, 100, 1, 101]]
+      boxes_np = np.array(boxes_data, dtype=np.float32)
+
+      scores_data = [0.9, 0.75, 0.6, 0.95, 0.5, 0.3]
+      scores_np = np.array(scores_data, dtype=np.float32)
+      max_output_size = 3
+      iou_threshold_np = np.array(0.5, dtype=np.float32)
+      score_threshold_np = np.array(0.4, dtype=np.float32)
+
+      with self.test_session() as sess:
+        boxes = array_ops.placeholder(boxes_np.dtype, shape=boxes_np.shape)
+        scores = array_ops.placeholder(scores_np.dtype, shape=scores_np.shape)
+        iou_threshold = array_ops.placeholder(iou_threshold_np.dtype,
+                                              iou_threshold_np.shape)
+        score_threshold = array_ops.placeholder(score_threshold_np.dtype,
+                                                score_threshold_np.shape)
+        with self.test_scope():
+          selected_indices = image_ops.non_max_suppression_padded(
+              boxes=boxes,
+              scores=scores,
+              max_output_size=max_output_size,
+              iou_threshold=iou_threshold,
+              score_threshold=score_threshold,
+              pad_to_max_output_size=True)
+        inputs_feed = {
+            boxes: boxes_np,
+            scores: scores_np,
+            iou_threshold: iou_threshold_np,
+            score_threshold: score_threshold_np
+        }
+        (indices_tf, num_valid) = sess.run(
+            selected_indices, feed_dict=inputs_feed)
+
+        self.assertEqual(indices_tf.size, max_output_size)
+        self.assertEqual(num_valid, 2)
+        self.assertAllClose(indices_tf[:num_valid], [3, 0])
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/compiler/tests/powersign_test.py b/tensorflow/compiler/tests/powersign_test.py
new file mode 100644
index 0000000000..5fa7706d72
--- /dev/null
+++ b/tensorflow/compiler/tests/powersign_test.py
@@ -0,0 +1,142 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for PowerSign."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+import numpy as np
+
+from tensorflow.compiler.tests import xla_test
+from tensorflow.contrib.opt.python.training import powersign
+from tensorflow.contrib.opt.python.training import sign_decay
+from tensorflow.python.framework import constant_op
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+
+
+def py_linear_decay_fn(decay_steps):
+  def linear_decay(step):
+    step = min(step, decay_steps)
+    return float(decay_steps - step) / decay_steps
+  return linear_decay
+
+
+def powersign_update_numpy(params,
+                           g_t,
+                           m,
+                           lr,
+                           base=math.e,
+                           beta=0.9,
+                           py_sign_decay_fn=None,
+                           t=None):
+  m_t = beta * m + (1 - beta) * g_t
+  if py_sign_decay_fn is None:
+    sign_decayed = 1.0
+  else:
+    sign_decayed = py_sign_decay_fn(t-1)
+  multiplier = base ** (sign_decayed * np.sign(g_t) * np.sign(m_t))
+  params_t = params - lr * multiplier * g_t
+  return params_t, m_t
+
+
+class PowerSignTest(xla_test.XLATestCase):
+
+  def _testDense(self,
+                 learning_rate=0.1,
+                 sign_decay_fn=None,
+                 py_sign_decay_fn=None,
+                 base=math.e,
+                 beta=0.9):
+    for dtype in self.float_types:
+      with self.test_session(), self.test_scope():
+        # Initialize variables for numpy implementation.
+        m0, m1 = 0.0, 0.0
+        var0_np = np.array([1.0, 2.0], dtype=dtype)
+        grads0_np = np.array([0.1, 0.1], dtype=dtype)
+        var1_np = np.array([3.0, 4.0], dtype=dtype)
+        grads1_np = np.array([0.01, 0.01], dtype=dtype)
+
+        var0 = resource_variable_ops.ResourceVariable(var0_np)
+        var1 = resource_variable_ops.ResourceVariable(var1_np)
+        global_step = resource_variable_ops.ResourceVariable(0, trainable=False)
+        grads0 = constant_op.constant(grads0_np)
+        grads1 = constant_op.constant(grads1_np)
+
+        opt = powersign.PowerSignOptimizer(
+            learning_rate=learning_rate,
+            base=base,
+            beta=beta,
+            sign_decay_fn=sign_decay_fn,
+        )
+        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]),
+                                     global_step=global_step)
+        neg_update = opt.apply_gradients(zip([-grads0, -grads1], [var0, var1]),
+                                         global_step=global_step)
+
+        variables.global_variables_initializer().run()
+        # Fetch params to validate initial values
+        self.assertAllClose([1.0, 2.0], var0.eval())
+        self.assertAllClose([3.0, 4.0], var1.eval())
+
+        # Run 7 steps of powersign
+        # first 4 steps with positive gradient
+        # last 3 steps with negative gradient (sign(gm) should be -1)
+        for t in range(1, 8):
+          if t < 5:
+            update.run()
+          else:
+            neg_update.run()
+
+          var0_np, m0 = powersign_update_numpy(
+              var0_np,
+              grads0_np if t < 5 else -grads0_np,
+              m0,
+              learning_rate,
+              base=base,
+              beta=beta,
+              py_sign_decay_fn=py_sign_decay_fn,
+              t=t,
+          )
+          var1_np, m1 = powersign_update_numpy(
+              var1_np,
+              grads1_np if t < 5 else -grads1_np,
+              m1,
+              learning_rate,
+              base=base,
+              beta=beta,
+              py_sign_decay_fn=py_sign_decay_fn,
+              t=t,
+          )
+
+          # Validate updated params
+          self.assertAllCloseAccordingToType(var0_np, var0.eval())
+          self.assertAllCloseAccordingToType(var1_np, var1.eval())
+
+  def testDense(self):
+    decay_steps = 10
+    sign_decay_fn = sign_decay.get_linear_decay_fn(decay_steps)
+    py_sign_decay_fn = py_linear_decay_fn(decay_steps)
+    self._testDense()
+    self._testDense(learning_rate=0.1, base=10.0, beta=0.8)
+    self._testDense(
+        sign_decay_fn=sign_decay_fn, py_sign_decay_fn=py_sign_decay_fn)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/compiler/tests/proximal_adagrad_test.py b/tensorflow/compiler/tests/proximal_adagrad_test.py
new file mode 100644
index 0000000000..cde87db63d
--- /dev/null
+++ b/tensorflow/compiler/tests/proximal_adagrad_test.py
@@ -0,0 +1,172 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Proximal Adagrad optimizer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.compiler.tests import xla_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import adagrad
+from tensorflow.python.training import proximal_adagrad
+
+
+class ProximalAdagradOptimizerTest(xla_test.XLATestCase):
+
+  def testResourceProximalAdagradwithoutRegularization(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([0.0, 0.0])
+      var1 = resource_variable_ops.ResourceVariable([0.0, 0.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+      opt = proximal_adagrad.ProximalAdagradOptimizer(
+          3.0,
+          initial_accumulator_value=0.1,
+          l1_regularization_strength=0.0,
+          l2_regularization_strength=0.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([0.0, 0.0], var0.eval())
+      self.assertAllClose([0.0, 0.0], var1.eval())
+
+      # Run 3 steps Proximal Adagrad.
+      for _ in range(3):
+        update.run()
+
+      self.assertAllClose(np.array([-2.60260963, -4.29698515]), var0.eval())
+      self.assertAllClose(np.array([-0.28432083, -0.56694895]), var1.eval())
+      opt_vars = opt.variables()
+      self.assertStartsWith(opt_vars[0].name, var0._shared_name)
+      self.assertStartsWith(opt_vars[1].name, var1._shared_name)
+      self.assertEqual(2, len(opt_vars))
+
+  def testProximalAdagradwithoutRegularization2(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+      var1 = resource_variable_ops.ResourceVariable([4.0, 3.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+
+      opt = proximal_adagrad.ProximalAdagradOptimizer(
+          3.0,
+          initial_accumulator_value=0.1,
+          l1_regularization_strength=0.0,
+          l2_regularization_strength=0.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([4.0, 3.0], var1.eval())
+
+      # Run 3 steps Proximal Adagrad.
+      for _ in range(3):
+        update.run()
+      self.assertAllClose(np.array([-1.60261, -2.296985]), var0.eval())
+      self.assertAllClose(np.array([3.715679, 2.433051]), var1.eval())
+
+  def testProximalAdagradWithL1(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+      var1 = resource_variable_ops.ResourceVariable([4.0, 3.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+
+      opt = proximal_adagrad.ProximalAdagradOptimizer(
+          3.0,
+          initial_accumulator_value=0.1,
+          l1_regularization_strength=0.001,
+          l2_regularization_strength=0.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([4.0, 3.0], var1.eval())
+
+      # Run 10 steps Proximal Adagrad
+      for _ in range(10):
+        update.run()
+      self.assertAllClose(np.array([-6.663634, -9.190331]), var0.eval())
+      self.assertAllClose(np.array([2.959304, 1.029232]), var1.eval())
+
+  def testProximalAdagradWithL1_L2(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+      var1 = resource_variable_ops.ResourceVariable([4.0, 3.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+
+      opt = proximal_adagrad.ProximalAdagradOptimizer(
+          3.0,
+          initial_accumulator_value=0.1,
+          l1_regularization_strength=0.001,
+          l2_regularization_strength=2.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([4.0, 3.0], var1.eval())
+
+      # Run 10 steps Proximal Adagrad.
+      for _ in range(10):
+        update.run()
+
+      self.assertAllClose(np.array([-0.0495, -0.0995]), var0.eval())
+      self.assertAllClose(np.array([-0.0045, -0.0095]), var1.eval())
+
+  def applyOptimizer(self, opt, steps=5):
+    var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+    var1 = resource_variable_ops.ResourceVariable([3.0, 4.0])
+    grads0 = constant_op.constant([0.1, 0.2])
+    grads1 = constant_op.constant([0.01, 0.02])
+
+    update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+    variables.global_variables_initializer().run()
+
+    self.assertAllClose([1.0, 2.0], var0.eval())
+    self.assertAllClose([3.0, 4.0], var1.eval())
+
+    # Run ProximalAdagrad for a few steps
+    for _ in range(steps):
+      update.run()
+
+    return var0.eval(), var1.eval()
+
+  def testEquivAdagradwithoutRegularization(self):
+    with self.test_session(), self.test_scope():
+      val0, val1 = self.applyOptimizer(
+          proximal_adagrad.ProximalAdagradOptimizer(
+              3.0,
+              initial_accumulator_value=0.1,
+              l1_regularization_strength=0.0,
+              l2_regularization_strength=0.0))
+
+    with self.test_session(), self.test_scope():
+      val2, val3 = self.applyOptimizer(
+          adagrad.AdagradOptimizer(
+              3.0, initial_accumulator_value=0.1))
+
+    self.assertAllClose(val0, val2)
+    self.assertAllClose(val1, val3)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/compiler/tests/proximal_gradient_descent_test.py b/tensorflow/compiler/tests/proximal_gradient_descent_test.py
new file mode 100644
index 0000000000..11eb768711
--- /dev/null
+++ b/tensorflow/compiler/tests/proximal_gradient_descent_test.py
@@ -0,0 +1,156 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Proximal Gradient Descent optimizer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.compiler.tests import xla_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import gradient_descent
+from tensorflow.python.training import proximal_gradient_descent
+
+
+class ProximalGradientDescentOptimizerTest(xla_test.XLATestCase):
+
+  def testResourceProximalGradientDescentwithoutRegularization(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([0.0, 0.0])
+      var1 = resource_variable_ops.ResourceVariable([0.0, 0.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+      opt = proximal_gradient_descent.ProximalGradientDescentOptimizer(
+          3.0, l1_regularization_strength=0.0, l2_regularization_strength=0.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([0.0, 0.0], var0.eval())
+      self.assertAllClose([0.0, 0.0], var1.eval())
+
+      # Run 3 steps Proximal Gradient Descent.
+      for _ in range(3):
+        update.run()
+
+      self.assertAllClose(np.array([-0.9, -1.8]), var0.eval())
+      self.assertAllClose(np.array([-0.09, -0.18]), var1.eval())
+
+  def testProximalGradientDescentwithoutRegularization2(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+      var1 = resource_variable_ops.ResourceVariable([4.0, 3.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+
+      opt = proximal_gradient_descent.ProximalGradientDescentOptimizer(
+          3.0, l1_regularization_strength=0.0, l2_regularization_strength=0.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([4.0, 3.0], var1.eval())
+
+      # Run 3 steps Proximal Gradient Descent
+      for _ in range(3):
+        update.run()
+
+      self.assertAllClose(np.array([0.1, 0.2]), var0.eval())
+      self.assertAllClose(np.array([3.91, 2.82]), var1.eval())
+
+  def testProximalGradientDescentWithL1(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+      var1 = resource_variable_ops.ResourceVariable([4.0, 3.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+
+      opt = proximal_gradient_descent.ProximalGradientDescentOptimizer(
+          3.0, l1_regularization_strength=0.001, l2_regularization_strength=0.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([4.0, 3.0], var1.eval())
+
+      # Run 10 steps proximal gradient descent.
+      for _ in range(10):
+        update.run()
+
+      self.assertAllClose(np.array([-1.988, -3.988001]), var0.eval())
+      self.assertAllClose(np.array([3.67, 2.37]), var1.eval())
+
+  def testProximalGradientDescentWithL1_L2(self):
+    with self.test_session(), self.test_scope():
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+      var1 = resource_variable_ops.ResourceVariable([4.0, 3.0])
+      grads0 = constant_op.constant([0.1, 0.2])
+      grads1 = constant_op.constant([0.01, 0.02])
+
+      opt = proximal_gradient_descent.ProximalGradientDescentOptimizer(
+          3.0, l1_regularization_strength=0.001, l2_regularization_strength=2.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([4.0, 3.0], var1.eval())
+
+      # Run 10 steps Proximal Gradient Descent
+      for _ in range(10):
+        update.run()
+
+      self.assertAllClose(np.array([-0.0495, -0.0995]), var0.eval())
+      self.assertAllClose(np.array([-0.0045, -0.0095]), var1.eval())
+
+  def applyOptimizer(self, opt, steps=5):
+    var0 = resource_variable_ops.ResourceVariable([1.0, 2.0])
+    var1 = resource_variable_ops.ResourceVariable([3.0, 4.0])
+    grads0 = constant_op.constant([0.1, 0.2])
+    grads1 = constant_op.constant([0.01, 0.02])
+
+    update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+    variables.global_variables_initializer().run()
+
+    self.assertAllClose([1.0, 2.0], var0.eval())
+    self.assertAllClose([3.0, 4.0], var1.eval())
+
+    # Run ProximalAdagrad for a few steps
+    for _ in range(steps):
+      update.run()
+
+    return var0.eval(), var1.eval()
+
+  def testEquivGradientDescentwithoutRegularization(self):
+    with self.test_session(), self.test_scope():
+      val0, val1 = self.applyOptimizer(
+          proximal_gradient_descent.ProximalGradientDescentOptimizer(
+              3.0,
+              l1_regularization_strength=0.0,
+              l2_regularization_strength=0.0))
+
+    with self.test_session(), self.test_scope():
+      val2, val3 = self.applyOptimizer(
+          gradient_descent.GradientDescentOptimizer(3.0))
+
+    self.assertAllClose(val0, val2)
+    self.assertAllClose(val1, val3)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/compiler/tests/qr_op_test.py b/tensorflow/compiler/tests/qr_op_test.py
new file mode 100644
index 0000000000..1b969ee2b3
--- /dev/null
+++ b/tensorflow/compiler/tests/qr_op_test.py
@@ -0,0 +1,115 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for tensorflow.ops.math_ops.matrix_inverse."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import itertools
+
+from absl.testing import parameterized
+import numpy as np
+
+from tensorflow.compiler.tests import xla_test
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import linalg_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class QrOpTest(xla_test.XLATestCase, parameterized.TestCase):
+
+  def AdjustedNorm(self, x):
+    """Computes the norm of matrices in 'x', adjusted for dimension and type."""
+    norm = np.linalg.norm(x, axis=(-2, -1))
+    return norm / (max(x.shape[-2:]) * np.finfo(x.dtype).eps)
+
+  def CompareOrthogonal(self, x, y, rank):
+    # We only compare the first 'rank' orthogonal vectors since the
+    # remainder form an arbitrary orthonormal basis for the
+    # (row- or column-) null space, whose exact value depends on
+    # implementation details. Notice that since we check that the
+    # matrices of singular vectors are unitary elsewhere, we do
+    # implicitly test that the trailing vectors of x and y span the
+    # same space.
+    x = x[..., 0:rank]
+    y = y[..., 0:rank]
+    # Q is only unique up to sign (complex phase factor for complex matrices),
+    # so we normalize the sign first.
+    sum_of_ratios = np.sum(np.divide(y, x), -2, keepdims=True)
+    phases = np.divide(sum_of_ratios, np.abs(sum_of_ratios))
+    x *= phases
+    self.assertTrue(np.all(self.AdjustedNorm(x - y) < 30.0))
+
+  def CheckApproximation(self, a, q, r):
+    # Tests that a ~= q*r.
+    precision = self.AdjustedNorm(a - np.matmul(q, r))
+    self.assertTrue(np.all(precision < 10.0))
+
+  def CheckUnitary(self, x):
+    # Tests that x[...,:,:]^H * x[...,:,:] is close to the identity.
+    xx = math_ops.matmul(x, x, adjoint_a=True)
+    identity = array_ops.matrix_band_part(array_ops.ones_like(xx), 0, 0)
+    precision = self.AdjustedNorm(xx.eval() - identity.eval())
+    self.assertTrue(np.all(precision < 5.0))
+
+  def _test(self, dtype, shape, full_matrices):
+    np.random.seed(1)
+    x_np = np.random.uniform(
+        low=-1.0, high=1.0, size=np.prod(shape)).reshape(shape).astype(dtype)
+
+    with self.test_session() as sess:
+      x_tf = array_ops.placeholder(dtype)
+      with self.test_scope():
+        q_tf, r_tf = linalg_ops.qr(x_tf, full_matrices=full_matrices)
+      q_tf_val, r_tf_val = sess.run([q_tf, r_tf], feed_dict={x_tf: x_np})
+
+      q_dims = q_tf_val.shape
+      np_q = np.ndarray(q_dims, dtype)
+      np_q_reshape = np.reshape(np_q, (-1, q_dims[-2], q_dims[-1]))
+      new_first_dim = np_q_reshape.shape[0]
+
+      x_reshape = np.reshape(x_np, (-1, x_np.shape[-2], x_np.shape[-1]))
+      for i in range(new_first_dim):
+        if full_matrices:
+          np_q_reshape[i, :, :], _ = np.linalg.qr(
+              x_reshape[i, :, :], mode="complete")
+        else:
+          np_q_reshape[i, :, :], _ = np.linalg.qr(
+              x_reshape[i, :, :], mode="reduced")
+      np_q = np.reshape(np_q_reshape, q_dims)
+      self.CompareOrthogonal(np_q, q_tf_val, min(shape[-2:]))
+      self.CheckApproximation(x_np, q_tf_val, r_tf_val)
+      self.CheckUnitary(q_tf_val)
+
+  SIZES = [1, 2, 5, 10, 32, 100, 300]
+  DTYPES = [np.float32]
+  PARAMS = itertools.product(SIZES, SIZES, DTYPES)
+
+  @parameterized.parameters(*PARAMS)
+  def testQR(self, rows, cols, dtype):
+    # TODO(b/111317468): implement full_matrices=False, test other types.
+    for full_matrices in [True]:
+      # Only tests the (3, 2) case for small numbers of rows/columns.
+      for batch_dims in [(), (3,)] + [(3, 2)] * (max(rows, cols) < 10):
+        self._test(dtype, batch_dims + (rows, cols), full_matrices)
+
+  def testLarge2000x2000(self):
+    self._test(np.float32, (2000, 2000), full_matrices=True)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/compiler/tests/random_ops_test.py b/tensorflow/compiler/tests/random_ops_test.py
index b880b2a3fe..14c5e7a975 100644
--- a/tensorflow/compiler/tests/random_ops_test.py
+++ b/tensorflow/compiler/tests/random_ops_test.py
@@ -140,10 +140,10 @@ class RandomOpsTest(xla_test.XLATestCase):
   def testShuffle1d(self):
     with self.test_session() as sess:
       with self.test_scope():
-        x = math_ops.range(20)
+        x = math_ops.range(1 << 16)
         shuffle = random_ops.random_shuffle(x)
       result = sess.run(shuffle)
-      expected = range(20)
+      expected = range(1 << 16)
       # Compare sets to avoid randomness behavior changes but make sure still
       # have all the values.
       self.assertAllEqual(set(result), set(expected))
diff --git a/tensorflow/compiler/tests/rmsprop_test.py b/tensorflow/compiler/tests/rmsprop_test.py
index 9489fded32..ff8bbac911 100644
--- a/tensorflow/compiler/tests/rmsprop_test.py
+++ b/tensorflow/compiler/tests/rmsprop_test.py
@@ -30,31 +30,102 @@ from tensorflow.python.training import rmsprop
 
 class RmspropTest(xla_test.XLATestCase):
 
+  def _rmsprop_update_numpy(self,
+                            var,
+                            g,
+                            mg,
+                            rms,
+                            mom,
+                            lr,
+                            decay=0.9,
+                            momentum=0.0,
+                            epsilon=1e-10,
+                            centered=False):
+    rms_t = rms * decay + (1 - decay) * g * g
+    denom_t = rms_t + epsilon
+    if centered:
+      mg_t = mg * decay + (1 - decay) * g
+      denom_t -= mg_t * mg_t
+    else:
+      mg_t = mg
+    mom_t = momentum * mom + lr * g / np.sqrt(denom_t, dtype=denom_t.dtype)
+    var_t = var - mom_t
+    return var_t, mg_t, rms_t, mom_t
+
   def testBasic(self):
     for dtype in self.float_types:
-      with self.test_session(), self.test_scope():
-        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
-        var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        rms_opt = rmsprop.RMSPropOptimizer(3.0)
-        rms_update = rms_opt.apply_gradients(
-            zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-
-        # Fetch params to validate initial values
-        self.assertAllClose([1.0, 2.0], var0.eval())
-        self.assertAllClose([3.0, 4.0], var1.eval())
-
-        # Run 3 steps of RMSProp
-        for _ in range(3):
-          rms_update.run()
-
-        # Validate updated params
-        self.assertAllCloseAccordingToType(
-            np.array([2.91705132e-04, 1.00029182e+00]), var0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([2.89990854, 3.89990854]), var1.eval())
+      for centered in [False, True]:
+        with self.test_session(), self.test_scope():
+          # Initialize variables for numpy implementation.
+          var0_np = np.array([1.0, 2.0], dtype=dtype)
+          grads0_np = np.array([0.1, 0.1], dtype=dtype)
+          var1_np = np.array([3.0, 4.0], dtype=dtype)
+          grads1_np = np.array([0.01, 0.01], dtype=dtype)
+          mg0_np = np.array([0.0, 0.0], dtype=dtype)
+          mg1_np = np.array([0.0, 0.0], dtype=dtype)
+          rms0_np = np.array([1.0, 1.0], dtype=dtype)
+          rms1_np = np.array([1.0, 1.0], dtype=dtype)
+          mom0_np = np.array([0.0, 0.0], dtype=dtype)
+          mom1_np = np.array([0.0, 0.0], dtype=dtype)
+
+          var0 = resource_variable_ops.ResourceVariable(var0_np)
+          var1 = resource_variable_ops.ResourceVariable(var1_np)
+          grads0 = constant_op.constant(grads0_np)
+          grads1 = constant_op.constant(grads1_np)
+          learning_rate = 3.0
+          rms_opt = rmsprop.RMSPropOptimizer(learning_rate, centered=centered)
+          rms_update = rms_opt.apply_gradients(
+              zip([grads0, grads1], [var0, var1]))
+          variables.global_variables_initializer().run()
+
+          mg0 = rms_opt.get_slot(var0, "mg")
+          self.assertEqual(mg0 is not None, centered)
+          mg1 = rms_opt.get_slot(var1, "mg")
+          self.assertEqual(mg1 is not None, centered)
+          rms0 = rms_opt.get_slot(var0, "rms")
+          self.assertTrue(rms0 is not None)
+          rms1 = rms_opt.get_slot(var1, "rms")
+          self.assertTrue(rms1 is not None)
+          mom0 = rms_opt.get_slot(var0, "momentum")
+          self.assertTrue(mom0 is not None)
+          mom1 = rms_opt.get_slot(var1, "momentum")
+          self.assertTrue(mom1 is not None)
+
+          # Fetch params to validate initial values
+          self.assertAllClose([1.0, 2.0], var0.eval())
+          self.assertAllClose([3.0, 4.0], var1.eval())
+
+          # Run 3 steps of RMSProp
+          for _ in range(3):
+            rms_update.run()
+
+            var0_np, mg0_np, rms0_np, mom0_np = self._rmsprop_update_numpy(
+                var0_np,
+                grads0_np,
+                mg0_np,
+                rms0_np,
+                mom0_np,
+                learning_rate,
+                centered=centered)
+            var1_np, mg1_np, rms1_np, mom1_np = self._rmsprop_update_numpy(
+                var1_np,
+                grads1_np,
+                mg1_np,
+                rms1_np,
+                mom1_np,
+                learning_rate,
+                centered=centered)
+
+            # Validate updated params
+            if centered:
+              self.assertAllCloseAccordingToType(mg0_np, mg0.eval())
+              self.assertAllCloseAccordingToType(mg1_np, mg1.eval())
+            self.assertAllCloseAccordingToType(rms0_np, rms0.eval())
+            self.assertAllCloseAccordingToType(rms1_np, rms1.eval())
+            self.assertAllCloseAccordingToType(mom0_np, mom0.eval())
+            self.assertAllCloseAccordingToType(mom1_np, mom1.eval())
+            self.assertAllCloseAccordingToType(var0_np, var0.eval())
+            self.assertAllCloseAccordingToType(var1_np, var1.eval())
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/compiler/tests/sort_ops_test.py b/tensorflow/compiler/tests/sort_ops_test.py
index 9e2ef964a1..7ff01be3cb 100644
--- a/tensorflow/compiler/tests/sort_ops_test.py
+++ b/tensorflow/compiler/tests/sort_ops_test.py
@@ -88,6 +88,38 @@ class XlaSortOpTest(xla_test.XLATestCase):
               topk, [x.astype(dtype)],
               expected=[x[indices].astype(dtype), indices])
 
+  def testTopK2D(self):
+    # TODO(b/26783907): The Sort HLO is not implemented on CPU or GPU.
+    if self.device in ["XLA_CPU", "XLA_GPU"]:
+      return
+
+    supported_types = set(
+        [dtypes.bfloat16.as_numpy_dtype, np.float32, np.int32, np.uint32])
+    for dtype in supported_types.intersection(self.numeric_types):
+      # Use small input size for bfloat16. Otherwise, we'll get duplicate values
+      # after conversion to bfloat16, so the possible resulting index array is
+      # no longer unique.
+      if dtype == dtypes.bfloat16.as_numpy_dtype:
+        array_size = 10
+        k_options = [0, 1, 2, 10]
+      else:
+        array_size = 200 * 1000
+        k_options = [0, 1, 2, 10, 20, 100, 1000, 200 * 1000]
+      batch = 16
+      for x in [np.arange(batch * array_size)]:
+        np.random.shuffle(x)
+        x = np.reshape(x, [batch, array_size])
+        for k in k_options:
+          indices = x.argsort(axis=1)[::, -1:-k - 1:-1]
+          expected = np.sort(x, axis=1)[::, -1:-k - 1:-1]
+
+          def topk(v, k=k):
+            return nn_ops.top_k(v, k=k, sorted=True)
+
+          self._assertOpOutputMatchesExpected(
+              topk, [x.astype(dtype)],
+              expected=[expected.astype(dtype), indices])
+
   def testTopKZeros(self):
     """Tests that positive and negative zeros sort correctly."""
     # TODO(b/26783907): The Sort HLO is not implemented on CPU or GPU.
diff --git a/tensorflow/compiler/tests/unary_ops_test.py b/tensorflow/compiler/tests/unary_ops_test.py
index 6a7011aea6..5f25ff9002 100644
--- a/tensorflow/compiler/tests/unary_ops_test.py
+++ b/tensorflow/compiler/tests/unary_ops_test.py
@@ -382,6 +382,62 @@ class UnaryOpsTest(xla_test.XLATestCase):
           expected=np.array(
               [[True, False, True], [False, True, True]], dtype=np.bool))
 
+      self._assertOpOutputMatchesExpected(
+          math_ops.lgamma,
+          np.array(
+              [[1, 2, 3], [4, 5, 6], [1 / 2, 3 / 2, 5 / 2],
+               [-3 / 2, -7 / 2, -11 / 2]],
+              dtype=dtype),
+          expected=np.array(
+              [
+                  [0, 0, np.log(2.0)],
+                  [np.log(6.0), np.log(24.0),
+                   np.log(120)],
+                  [
+                      np.log(np.pi) / 2,
+                      np.log(np.pi) / 2 - np.log(2),
+                      np.log(np.pi) / 2 - np.log(4) + np.log(3)
+                  ],
+                  [
+                      np.log(np.pi) / 2 - np.log(3) + np.log(4),
+                      np.log(np.pi) / 2 - np.log(105) + np.log(16),
+                      np.log(np.pi) / 2 - np.log(10395) + np.log(64),
+                  ],
+              ],
+              dtype=dtype))
+
+      self._assertOpOutputMatchesExpected(
+          math_ops.digamma,
+          np.array(
+              [[1.0, 0.5, 1 / 3.0], [0.25, 1 / 6.0, 0.125], [2.0, 3.0, 4.0],
+               [6.0, 8.0, 9.0]],
+              dtype=dtype),
+          expected=np.array(
+              [
+                  [
+                      -np.euler_gamma, -2 * np.log(2) - np.euler_gamma,
+                      -np.pi / 2 / np.sqrt(3) - 3 * np.log(3) / 2 -
+                      np.euler_gamma
+                  ],
+                  [
+                      -np.pi / 2 - 3 * np.log(2) - np.euler_gamma,
+                      -np.pi * np.sqrt(3) / 2 - 2 * np.log(2) -
+                      3 * np.log(3) / 2 - np.euler_gamma,
+                      -np.pi / 2 - 4 * np.log(2) -
+                      (np.pi + np.log(2 + np.sqrt(2)) - np.log(2 - np.sqrt(2)))
+                      / np.sqrt(2) - np.euler_gamma
+                  ],
+                  [
+                      1 - np.euler_gamma, 1.5 - np.euler_gamma,
+                      11 / 6.0 - np.euler_gamma
+                  ],
+                  [
+                      137 / 60.0 - np.euler_gamma, 363 / 140.0 - np.euler_gamma,
+                      761 / 280.0 - np.euler_gamma
+                  ],
+              ],
+              dtype=dtype))
+
       def quantize_and_dequantize_v2(x):
         return array_ops.quantize_and_dequantize_v2(
             x, -127, 127, signed_input=True, num_bits=8)
diff --git a/tensorflow/compiler/tf2xla/BUILD b/tensorflow/compiler/tf2xla/BUILD
index 40e32f2e75..338943201b 100644
--- a/tensorflow/compiler/tf2xla/BUILD
+++ b/tensorflow/compiler/tf2xla/BUILD
@@ -81,7 +81,7 @@ cc_library(
         "//tensorflow/compiler/tf2xla/kernels:xla_cpu_only_ops",
         "//tensorflow/compiler/tf2xla/kernels:xla_ops",
         "//tensorflow/compiler/xla/client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework",
@@ -119,6 +119,7 @@ cc_library(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:cpu_plugin",
         "//tensorflow/compiler/xla/service/cpu:cpu_executable",
         "//tensorflow/core:lib",
@@ -162,7 +163,7 @@ cc_library(
         ":sharding_util",
         ":tf2xla_util",
         "//tensorflow/compiler/tf2xla/lib:util",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -170,11 +171,11 @@ cc_library(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
         "//tensorflow/compiler/xla/client/lib:constants",
         "//tensorflow/compiler/xla/client/lib:numeric",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework",
@@ -202,7 +203,7 @@ cc_library(
     ],
     visibility = [":friends"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/core:core_cpu_internal",
@@ -285,10 +286,12 @@ tf_cc_test(
     deps = [
         ":tf2xla",
         ":tf2xla_proto",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:cpu_plugin",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
@@ -327,7 +330,7 @@ tf_cc_test(
         "//tensorflow/cc:ops",
         "//tensorflow/cc:resource_variable_ops",
         "//tensorflow/compiler/tf2xla/kernels:xla_ops",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla/client:client_library",
@@ -364,6 +367,7 @@ tf_cc_test(
     ],
     deps = [
         ":common",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/core:framework",
         "//tensorflow/core:test",
diff --git a/tensorflow/compiler/tf2xla/dump_graph.cc b/tensorflow/compiler/tf2xla/dump_graph.cc
index 03603ee9ba..24616c01c7 100644
--- a/tensorflow/compiler/tf2xla/dump_graph.cc
+++ b/tensorflow/compiler/tf2xla/dump_graph.cc
@@ -33,7 +33,7 @@ struct NameCounts {
   std::unordered_map<string, int> counts;
 };
 
-string MakeUniquePath(string name) {
+string MakeUniqueFilename(string name) {
   static NameCounts& instance = *new NameCounts;
 
   // Remove illegal characters from `name`.
@@ -50,26 +50,41 @@ string MakeUniquePath(string name) {
     count = instance.counts[name]++;
   }
 
-  legacy_flags::DumpGraphFlags* flags = legacy_flags::GetDumpGraphFlags();
-  string path = strings::StrCat(flags->tf_dump_graph_prefix, "/", name);
+  string filename = name;
   if (count > 0) {
-    strings::StrAppend(&path, "_", count);
+    strings::StrAppend(&filename, "_", count);
   }
-  strings::StrAppend(&path, ".pbtxt");
-  return path;
+  strings::StrAppend(&filename, ".pbtxt");
+  return filename;
+}
+
+string WriteTextProtoToUniqueFile(
+    Env* env, const string& name, const char* proto_type,
+    const ::tensorflow::protobuf::Message& proto) {
+  const string& dirname =
+      legacy_flags::GetDumpGraphFlags()->tf_dump_graph_prefix;
+  Status status = env->RecursivelyCreateDir(dirname);
+  if (!status.ok()) {
+    LOG(WARNING) << "Failed to create " << dirname << " for dumping "
+                 << proto_type << ": " << status;
+    return "(unavailable)";
+  }
+  string filepath = strings::StrCat(dirname, "/", MakeUniqueFilename(name));
+  status = WriteTextProto(Env::Default(), filepath, proto);
+  if (!status.ok()) {
+    LOG(WARNING) << "Failed to dump " << proto_type << " to file: " << filepath
+                 << " : " << status;
+    return "(unavailable)";
+  }
+  LOG(INFO) << "Dumped " << proto_type << " to " << filepath;
+  return filepath;
 }
 
 }  // anonymous namespace
 
 string DumpGraphDefToFile(const string& name, GraphDef const& graph_def) {
-  string path = MakeUniquePath(name);
-  Status status = WriteTextProto(Env::Default(), path, graph_def);
-  if (!status.ok()) {
-    VLOG(1) << "Failed to dump GraphDef to file: " << path << " : " << status;
-    path.clear();
-    path = "(unavailable)";
-  }
-  return path;
+  return WriteTextProtoToUniqueFile(Env::Default(), name, "GraphDef",
+                                    graph_def);
 }
 
 string DumpGraphToFile(const string& name, Graph const& graph,
@@ -83,15 +98,7 @@ string DumpGraphToFile(const string& name, Graph const& graph,
 }
 
 string DumpFunctionDefToFile(const string& name, FunctionDef const& fdef) {
-  string path = MakeUniquePath(name);
-  Status status = WriteTextProto(Env::Default(), path, fdef);
-  if (!status.ok()) {
-    VLOG(1) << "Failed to dump FunctionDef to file: " << path << " : "
-            << status;
-    path.clear();
-    path = "(unavailable)";
-  }
-  return path;
+  return WriteTextProtoToUniqueFile(Env::Default(), name, "FunctionDef", fdef);
 }
 
 }  // namespace dump_graph
diff --git a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
index 6cc95149a1..0904778f97 100644
--- a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
+++ b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
@@ -177,8 +177,8 @@ Status CheckNoCycleContains(const Node* node, const int num_nodes) {
     visited[current_node->id()] = true;
     for (const Edge* out : current_node->out_edges()) {
       if (out->dst() == node) {
-        return errors::Internal("Detect a cycle: Node \"", node->name(), "\"(",
-                                node->def().op(), ") feeds into itself.");
+        return errors::Internal("Detected a cycle: ", FormatNodeForError(*node),
+                                "(", node->def().op(), ") feeds into itself.");
       } else if (!visited[out->dst()->id()]) {
         ready.push_back(out->dst());
       }
@@ -324,7 +324,7 @@ Status AddMissingFunctionDef(const FunctionDef& fdef,
     if (library->Find(node.op())) {
       continue;
     }
-    // The function refered by 'SymbolicGradient' node is specified in its
+    // The function referred by 'SymbolicGradient' node is specified in its
     // attribute 'f'.
     if (node.op() == FunctionLibraryDefinition::kGradientOp) {
       const AttrValue* attr =
@@ -437,22 +437,24 @@ Status FunctionalizeLoop(const FunctionLibraryDefinition* lookup_library,
           continue;
         }
         if (enter_merge != nullptr) {
-          return errors::Internal(
-              "Enter node for loop-varying argument ", arg.enter->name(),
-              " has multiple successors: ", enter_merge->dst()->name(), " and ",
-              e->dst()->name());
+          return errors::Internal("Enter node for loop-varying argument ",
+                                  FormatNodeForError(*arg.enter),
+                                  " has multiple successors: ",
+                                  FormatNodeForError(*enter_merge->dst()),
+                                  " and ", FormatNodeForError(*e->dst()));
         }
         enter_merge = e;
       }
       if (enter_merge == nullptr) {
         return errors::Internal("Enter node for loop-varying argument ",
-                                arg.enter->name(), " has zero successors");
+                                FormatNodeForError(*arg.enter),
+                                " has zero successors");
       }
       arg.merge = enter_merge->dst();
       if (!IsMerge(arg.merge)) {
         return errors::InvalidArgument(
             "Successor of Enter node for loop-varying argument ",
-            arg.merge->name(),
+            FormatNodeForError(*arg.merge),
             " is not a Merge node; got: ", arg.merge->type_string());
       }
 
@@ -462,7 +464,7 @@ Status FunctionalizeLoop(const FunctionLibraryDefinition* lookup_library,
         return errors::InvalidArgument(
             "Unexpected number of inputs to Merge node for loop-varying "
             "argument ",
-            arg.merge->name(), "; expected 2, got ",
+            FormatNodeForError(*arg.merge), "; expected 2, got ",
             arg.merge->input_types().size());
       }
       TF_RETURN_IF_ERROR(arg.merge->input_node(1 - enter_merge->dst_input(),
@@ -470,7 +472,7 @@ Status FunctionalizeLoop(const FunctionLibraryDefinition* lookup_library,
       if (!IsNextIteration(arg.next_iteration)) {
         return errors::InvalidArgument(
             "Expected NextIteration node as input to Merge node; got node ",
-            arg.next_iteration->name(), " with kind ",
+            FormatNodeForError(*arg.next_iteration), " with kind ",
             arg.next_iteration->type_string());
       }
 
@@ -481,14 +483,14 @@ Status FunctionalizeLoop(const FunctionLibraryDefinition* lookup_library,
             switches.find(edge->dst()) != switches.end()) {
           if (arg.switch_node != nullptr) {
             return errors::InvalidArgument("Duplicate Switch successors to ",
-                                           arg.merge->name());
+                                           FormatNodeForError(*arg.merge));
           }
           arg.switch_node = edge->dst();
         }
       }
       if (arg.switch_node == nullptr) {
         return errors::InvalidArgument("Missing Switch successor to ",
-                                       arg.merge->name());
+                                       FormatNodeForError(*arg.merge));
       }
 
       // Update the device on the Identity outputs of the switch to match their
@@ -516,14 +518,15 @@ Status FunctionalizeLoop(const FunctionLibraryDefinition* lookup_library,
         possible_exit.pop_front();
         if (IsExit(edge->dst())) {
           if (arg.exit != nullptr) {
-            return errors::InvalidArgument("Duplicate Exit successors to ",
-                                           arg.switch_node->name());
+            return errors::InvalidArgument(
+                "Duplicate Exit successors to ",
+                FormatNodeForError(*arg.switch_node));
           }
           arg.exit = edge->dst();
         } else {
           if (!IsIdentity(edge->dst())) {
             return errors::Unimplemented("General graph between switch (",
-                                         arg.switch_node->name(),
+                                         FormatNodeForError(*arg.switch_node),
                                          ") and exit node of frame ",
                                          frame->name, " not supported yet.");
           }
@@ -1470,7 +1473,7 @@ Status FunctionalizeControlFlow(const FunctionLibraryDefinition* lookup_library,
   if (!unreachable_nodes.empty()) {
     return errors::InvalidArgument(
         "The following nodes are unreachable from the source in the graph: ",
-        tensorflow::str_util::Join(unreachable_nodes, ", "));
+        errors::FormatNodeNamesForError(unreachable_nodes));
   }
 
   // Builds Frames, indexed by name.
diff --git a/tensorflow/compiler/tf2xla/functionalize_control_flow_test.cc b/tensorflow/compiler/tf2xla/functionalize_control_flow_test.cc
index aae2f8ee5a..ccf249b35d 100644
--- a/tensorflow/compiler/tf2xla/functionalize_control_flow_test.cc
+++ b/tensorflow/compiler/tf2xla/functionalize_control_flow_test.cc
@@ -1064,7 +1064,10 @@ TEST(FunctionalizeControlFlow, Cycle) {
   // less -> XlaIf <--> identity.
   Status status = FunctionalizeControlFlow(graph.get(), &library);
   EXPECT_FALSE(status.ok());
-  EXPECT_TRUE(str_util::StrContains(status.error_message(), "Detect a cycle"))
+  EXPECT_TRUE(str_util::StrContains(status.error_message(), "Detected a cycle"))
+      << status.error_message();
+  EXPECT_TRUE(
+      str_util::StrContains(status.error_message(), "{{node cond/Less_5_If}}"))
       << status.error_message();
 }
 
diff --git a/tensorflow/compiler/tf2xla/graph_compiler.cc b/tensorflow/compiler/tf2xla/graph_compiler.cc
index 4900af6df1..e4fdf0a618 100644
--- a/tensorflow/compiler/tf2xla/graph_compiler.cc
+++ b/tensorflow/compiler/tf2xla/graph_compiler.cc
@@ -29,7 +29,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_context.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/executor.h"
 #include "tensorflow/core/common_runtime/function.h"
@@ -161,9 +161,8 @@ Status GraphCompiler::Compile() {
     outputs.resize(n->num_outputs());
     for (int o = 0; o < n->num_outputs(); ++o) {
       outputs[o] = op_context.release_output(o);
-      if (*op_context.is_output_dead() || outputs[o].tensor == nullptr) {
+      if (outputs[o].tensor == nullptr) {
         return errors::Internal("Missing xla_context ", o, "-th output from ",
-                                (*op_context.is_output_dead() ? "(dead)" : ""),
                                 SummarizeNode(*n));
       }
     }
diff --git a/tensorflow/compiler/tf2xla/kernels/BUILD b/tensorflow/compiler/tf2xla/kernels/BUILD
index a8eb7d942d..0609e22338 100644
--- a/tensorflow/compiler/tf2xla/kernels/BUILD
+++ b/tensorflow/compiler/tf2xla/kernels/BUILD
@@ -58,6 +58,7 @@ tf_kernel_library(
         "pack_op.cc",
         "pad_op.cc",
         "pooling_ops.cc",
+        "qr_op.cc",
         "quantize_and_dequantize_op.cc",
         "random_ops.cc",
         "reduce_window_op.cc",
@@ -107,6 +108,7 @@ tf_kernel_library(
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/tf2xla/lib:batch_dot",
         "//tensorflow/compiler/tf2xla/lib:cholesky",
+        "//tensorflow/compiler/tf2xla/lib:qr",
         "//tensorflow/compiler/tf2xla/lib:random",
         "//tensorflow/compiler/tf2xla/lib:scatter",
         "//tensorflow/compiler/tf2xla/lib:triangular_solve",
@@ -114,17 +116,21 @@ tf_kernel_library(
         "//tensorflow/compiler/tf2xla/lib:while_loop",
         "//tensorflow/compiler/tf2xla/ops:xla_ops",
         "//tensorflow/compiler/xla:array4d",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
         "//tensorflow/compiler/xla/client/lib:constants",
         "//tensorflow/compiler/xla/client/lib:math",
         "//tensorflow/compiler/xla/client/lib:numeric",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client/lib:prng",
+        "//tensorflow/compiler/xla/client/lib:sorting",
         "//tensorflow/core:framework",
         "//tensorflow/core:image_ops_op_lib",
         "//tensorflow/core:lib",
@@ -159,8 +165,9 @@ tf_kernel_library(
         "//tensorflow/compiler/tf2xla:common",
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/tf2xla/ops:xla_ops",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
@@ -175,8 +182,8 @@ tf_kernel_library(
         "//tensorflow/compiler/tf2xla:common",
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/tf2xla/ops:xla_ops",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
@@ -210,10 +217,11 @@ tf_kernel_library(
         ":index_ops_kernel_argmax_float_2d",
         "//tensorflow/compiler/tf2xla:common",
         "//tensorflow/compiler/tf2xla:xla_compiler",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla/client:client_library",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core/kernels:argmax_op",
diff --git a/tensorflow/compiler/tf2xla/kernels/aggregate_ops.cc b/tensorflow/compiler/tf2xla/kernels/aggregate_ops.cc
index e335328280..41a453da80 100644
--- a/tensorflow/compiler/tf2xla/kernels/aggregate_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/aggregate_ops.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 namespace {
diff --git a/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc b/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc
index c4af79281d..b3ad0aea84 100644
--- a/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/batch_norm_op.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/util/tensor_format.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc b/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
index 26130fd9e7..48f2a005ab 100644
--- a/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 namespace {
diff --git a/tensorflow/compiler/tf2xla/kernels/bcast_ops.cc b/tensorflow/compiler/tf2xla/kernels/bcast_ops.cc
index ee2c920453..ba3b1c9dab 100644
--- a/tensorflow/compiler/tf2xla/kernels/bcast_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/bcast_ops.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/bcast.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/bias_ops.cc b/tensorflow/compiler/tf2xla/kernels/bias_ops.cc
index e9b2c0b16d..41f540506b 100644
--- a/tensorflow/compiler/tf2xla/kernels/bias_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/bias_ops.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/util/tensor_format.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/binary_ops.cc b/tensorflow/compiler/tf2xla/kernels/binary_ops.cc
index d6d4ae8937..2c328102e0 100644
--- a/tensorflow/compiler/tf2xla/kernels/binary_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/binary_ops.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/bucketize_op.cc b/tensorflow/compiler/tf2xla/kernels/bucketize_op.cc
index efbdb76eaa..5078f8662b 100644
--- a/tensorflow/compiler/tf2xla/kernels/bucketize_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/bucketize_op.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/cast_op.cc b/tensorflow/compiler/tf2xla/kernels/cast_op.cc
index 62eebf762b..8cc2479dd5 100644
--- a/tensorflow/compiler/tf2xla/kernels/cast_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/cast_op.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/categorical_op.cc b/tensorflow/compiler/tf2xla/kernels/categorical_op.cc
index 1784e712b5..e7fef77edc 100644
--- a/tensorflow/compiler/tf2xla/kernels/categorical_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/categorical_op.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/clip_by_value_op.cc b/tensorflow/compiler/tf2xla/kernels/clip_by_value_op.cc
index 4e6d33304c..547fe48046 100644
--- a/tensorflow/compiler/tf2xla/kernels/clip_by_value_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/clip_by_value_op.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/concat_op.cc b/tensorflow/compiler/tf2xla/kernels/concat_op.cc
index e3a32a5c0e..f410605104 100644
--- a/tensorflow/compiler/tf2xla/kernels/concat_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/concat_op.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/const_op.cc b/tensorflow/compiler/tf2xla/kernels/const_op.cc
index f4360d8c3f..da8cf3fc6f 100644
--- a/tensorflow/compiler/tf2xla/kernels/const_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/const_op.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_compiler.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/tensor.pb.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/conv_ops.cc b/tensorflow/compiler/tf2xla/kernels/conv_ops.cc
index 48ac4867ed..5da7972397 100644
--- a/tensorflow/compiler/tf2xla/kernels/conv_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/conv_ops.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/cross_op.cc b/tensorflow/compiler/tf2xla/kernels/cross_op.cc
index 500a564f3f..db579a5b35 100644
--- a/tensorflow/compiler/tf2xla/kernels/cross_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/cross_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 namespace {
diff --git a/tensorflow/compiler/tf2xla/kernels/cwise_ops.cc b/tensorflow/compiler/tf2xla/kernels/cwise_ops.cc
index 9ff3e02228..ef1015552d 100644
--- a/tensorflow/compiler/tf2xla/kernels/cwise_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/cwise_ops.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/util/bcast.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/cwise_ops.h b/tensorflow/compiler/tf2xla/kernels/cwise_ops.h
index 4f92dbc874..a5b870f8db 100644
--- a/tensorflow/compiler/tf2xla/kernels/cwise_ops.h
+++ b/tensorflow/compiler/tf2xla/kernels/cwise_ops.h
@@ -20,7 +20,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/util/bcast.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/depthtospace_op.cc b/tensorflow/compiler/tf2xla/kernels/depthtospace_op.cc
index f314920025..12b0e38288 100644
--- a/tensorflow/compiler/tf2xla/kernels/depthtospace_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/depthtospace_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/util/tensor_format.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/diag_op.cc b/tensorflow/compiler/tf2xla/kernels/diag_op.cc
index 6dec414c53..ed44ad218b 100644
--- a/tensorflow/compiler/tf2xla/kernels/diag_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/diag_op.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
@@ -123,8 +123,6 @@ class DiagPartOp : public XlaOpKernel {
   explicit DiagPartOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
 
   void Compile(XlaOpKernelContext* ctx) override {
-    xla::XlaBuilder* builder = ctx->builder();
-
     const TensorShape input_shape = ctx->InputShape(0);
     auto dims = input_shape.dim_sizes();
 
@@ -150,37 +148,13 @@ class DiagPartOp : public XlaOpKernel {
       new_dims.push_back(dims[i]);
     }
 
-    xla::XlaOp diag = ctx->Input(0);
-
-    // TODO(b/30878775): use Slice with strides when supported, in place of
-    // the Pad -> Reshape -> Slice.
-
-    // Picture:
-    // [[1, 0, 0, 0]  pad and reshape to [[1, 0, 0, 0, 0],
-    //  [0, 2, 0, 0]  =================>  [2, 0, 0, 0, 0],
-    //  [0, 0, 3, 0]                      [3, 0, 0, 0, 0],
-    //  [0, 0, 0, 4]]                     [4, 0, 0, 0, 0]]
-    // and then slice out the first column.
-
-    // Flattens the input to 1D.
-    int64 size = input_shape.num_elements();
-    diag = xla::Reshape(diag, {size});
-
-    // Adds padding after the last element of 'new_size'.
-    xla::PaddingConfig config;
-    auto* dim = config.add_dimensions();
-    dim->set_edge_padding_high(new_size);
-    auto zero = XlaHelpers::Zero(builder, input_type(0));
-    diag = xla::Pad(diag, zero, config);
-
-    // Reshapes so the diagonal is now in the first column.
-    diag = xla::Reshape(diag, {new_size, new_size + 1});
+    xla::XlaOp input = ctx->Input(0);
 
-    // Slices out the first column and reshapes to the final shape.
-    diag = xla::Slice(diag, {0, 0}, {new_size, 1}, {1, 1});
-    diag = xla::Reshape(diag, new_dims);
+    xla::XlaOp output = xla::Reshape(
+        xla::GetMatrixDiagonal(xla::Reshape(input, {new_size, new_size})),
+        new_dims);
 
-    ctx->SetOutput(0, diag);
+    ctx->SetOutput(0, output);
   }
 };
 
@@ -220,8 +194,6 @@ class MatrixDiagPartOp : public XlaOpKernel {
   explicit MatrixDiagPartOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
 
   void Compile(XlaOpKernelContext* ctx) override {
-    xla::XlaBuilder* builder = ctx->builder();
-
     const TensorShape input_shape = ctx->InputShape(0);
     auto dims = input_shape.dim_sizes();
 
@@ -229,71 +201,8 @@ class MatrixDiagPartOp : public XlaOpKernel {
                 errors::InvalidArgument("Expected 2 <= dims, got shape ",
                                         input_shape.DebugString()));
 
-    xla::XlaOp diag = ctx->Input(0);
-
-    int last_dim = dims.size() - 1;
-    int64 last_dim_size = dims[last_dim];
-
-    // The smaller of the last two dimension sizes.
-    int64 smaller_dim_size = std::min(dims[last_dim - 1], dims[last_dim]);
-
-    // TODO(b/30878775): use Slice with strides when supported, in place of
-    // the Pad -> Reshape -> Slice.
-
-    // Picture: for each 2D matrix in the tensor's last two dimensions:
-    // [[1, 0, 0, 0]  pad and reshape to [[1, 0, 0, 0, 0],
-    //  [0, 2, 0, 0]  =================>  [2, 0, 0, 0, 0],
-    //  [0, 0, 3, 0]]                     [3, 0, 0, 0, 0],
-    // and then slice out the first column.
-    //
-    // Another example, with tall and narrow input.
-    // [[1, 0]  pad and reshape to [[1, 0, 0],
-    //  [0, 2]  =================>  [2, 0, 0]]
-    //  [0, 0]
-    //  [0, 0]]
-
-    // Collapses the last two dimensions.
-    std::vector<int64> flattened_dims(dims.begin(), dims.end() - 1);
-    flattened_dims.back() *= dims.back();
-    diag = xla::Reshape(diag, flattened_dims);
-
-    // Slices or pads the last dimension to 'target_size'.
-    int64 actual_size = flattened_dims.back();
-    int64 target_size = smaller_dim_size * (last_dim_size + 1);
-    if (actual_size < target_size) {
-      xla::PaddingConfig config =
-          xla::MakeNoPaddingConfig(flattened_dims.size());
-      auto* dim = config.mutable_dimensions(flattened_dims.size() - 1);
-      dim->set_edge_padding_high(target_size - actual_size);
-      auto zero = XlaHelpers::Zero(builder, input_type(0));
-      diag = xla::Pad(diag, zero, config);
-    } else if (actual_size > target_size) {
-      std::vector<int64> start(flattened_dims.size(), 0);
-      std::vector<int64> limits(flattened_dims.begin(), flattened_dims.end());
-      std::vector<int64> strides(flattened_dims.size(), 1);
-      limits[flattened_dims.size() - 1] = target_size;
-      diag = xla::Slice(diag, start, limits, strides);
-    }
-
-    // Reshape so the target values are in the first position of the last
-    // dimension.
-    std::vector<int64> unflattened_dims(dims.begin(), dims.end());
-    dims[last_dim - 1] = smaller_dim_size;
-    dims[last_dim] = last_dim_size + 1;
-    diag = xla::Reshape(diag, dims);
-
-    // Slices out the first column and reshapes to the final shape.
-    std::vector<int64> start(dims.size(), 0);
-    std::vector<int64> limits(dims.begin(), dims.end());
-    std::vector<int64> strides(dims.size(), 1);
-    limits[last_dim] = 1;
-    diag = xla::Slice(diag, start, limits, strides);
-
-    // Collapses away the last dimension.
-    dims.pop_back();
-    diag = xla::Reshape(diag, dims);
-
-    ctx->SetOutput(0, diag);
+    xla::XlaOp input = ctx->Input(0);
+    ctx->SetOutput(0, xla::GetMatrixDiagonal(input));
   }
 };
 
diff --git a/tensorflow/compiler/tf2xla/kernels/dynamic_slice_ops.cc b/tensorflow/compiler/tf2xla/kernels/dynamic_slice_ops.cc
index 3b86ea34c9..a3389d5b90 100644
--- a/tensorflow/compiler/tf2xla/kernels/dynamic_slice_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/dynamic_slice_ops.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
 #include "tensorflow/compiler/tf2xla/type_util.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/dynamic_stitch_op.cc b/tensorflow/compiler/tf2xla/kernels/dynamic_stitch_op.cc
index 958231505b..cb73053666 100644
--- a/tensorflow/compiler/tf2xla/kernels/dynamic_stitch_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/dynamic_stitch_op.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/elu_op.cc b/tensorflow/compiler/tf2xla/kernels/elu_op.cc
index 2c76bcee25..5fdb1d972c 100644
--- a/tensorflow/compiler/tf2xla/kernels/elu_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/elu_op.cc
@@ -18,8 +18,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/kernels/cwise_ops.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/kernels/no_op.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/extract_image_patches_op.cc b/tensorflow/compiler/tf2xla/kernels/extract_image_patches_op.cc
index 65d42a302f..c68b0bfd79 100644
--- a/tensorflow/compiler/tf2xla/kernels/extract_image_patches_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/extract_image_patches_op.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/util/tensor_format.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc b/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc
index 2fd1a34741..cdba6680de 100644
--- a/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/fake_quantize_ops.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/platform/macros.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/fft_ops.cc b/tensorflow/compiler/tf2xla/kernels/fft_ops.cc
index b2b00e51e3..80bcef9663 100644
--- a/tensorflow/compiler/tf2xla/kernels/fft_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/fft_ops.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/fill_op.cc b/tensorflow/compiler/tf2xla/kernels/fill_op.cc
index 95faa1d058..54b21a2782 100644
--- a/tensorflow/compiler/tf2xla/kernels/fill_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/fill_op.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/register_types.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/gather_op.cc b/tensorflow/compiler/tf2xla/kernels/gather_op.cc
index 5f041be5df..35de96e0aa 100644
--- a/tensorflow/compiler/tf2xla/kernels/gather_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/gather_op.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/gather_op_helpers.h b/tensorflow/compiler/tf2xla/kernels/gather_op_helpers.h
index d898e43b85..92346283c3 100644
--- a/tensorflow/compiler/tf2xla/kernels/gather_op_helpers.h
+++ b/tensorflow/compiler/tf2xla/kernels/gather_op_helpers.h
@@ -20,7 +20,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/util/bcast.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/if_op.cc b/tensorflow/compiler/tf2xla/kernels/if_op.cc
index f5fcf3cacd..ceb2af756c 100644
--- a/tensorflow/compiler/tf2xla/kernels/if_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/if_op.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_context.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 
@@ -246,6 +246,7 @@ void XlaIfOp::Compile(XlaOpKernelContext* ctx) {
   VLOG(1) << "Done building If";
 }
 
+REGISTER_XLA_OP(Name("If").AllowResourceTypes(), XlaIfOp);
 REGISTER_XLA_OP(Name("XlaIf").AllowResourceTypes(), XlaIfOp);
 
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/image_ops.cc b/tensorflow/compiler/tf2xla/kernels/image_ops.cc
index cb4caf7bcb..33a73fe5fd 100644
--- a/tensorflow/compiler/tf2xla/kernels/image_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/image_ops.cc
@@ -17,7 +17,12 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/lib/arithmetic.h"
+#include "tensorflow/compiler/xla/client/lib/constants.h"
+#include "tensorflow/compiler/xla/client/lib/sorting.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/core/framework/tensor_shape.h"
 
 namespace tensorflow {
 namespace {
@@ -311,5 +316,150 @@ class AdjustHueOp : public XlaOpKernel {
 };
 REGISTER_XLA_OP(Name("AdjustHue"), AdjustHueOp);
 
+class NonMaxSuppressionOp : public XlaOpKernel {
+ public:
+  explicit NonMaxSuppressionOp(OpKernelConstruction* context)
+      : XlaOpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("pad_to_max_output_size",
+                                             &pad_to_max_output_size_));
+  }
+
+  void Compile(XlaOpKernelContext* context) override {
+    // TODO(b/111646731): Improve scalability of this op, using blocking.
+    int num_boxes_dim = 0;
+    int coords_dim = 1;
+    const TensorShape& boxes_shape = context->InputShape("boxes");
+    OP_REQUIRES(context, TensorShapeUtils::IsMatrix(boxes_shape),
+                errors::InvalidArgument("boxes must be 2-D, currently: ",
+                                        boxes_shape.DebugString()));
+    const int64 num_boxes = boxes_shape.dim_size(num_boxes_dim);
+    OP_REQUIRES(context, boxes_shape.dim_size(coords_dim) == 4,
+                errors::InvalidArgument("boxes must have 4 columns",
+                                        boxes_shape.DebugString()));
+    const TensorShape& scores_shape = context->InputShape("scores");
+    OP_REQUIRES(context, TensorShapeUtils::IsVector(scores_shape),
+                errors::InvalidArgument("scores must be 1-D, currently: ",
+                                        scores_shape.DebugString()));
+    OP_REQUIRES(
+        context, scores_shape.dim_size(0) == num_boxes,
+        errors::InvalidArgument("scores size must equal number of boxes",
+                                scores_shape.DebugString()));
+    OP_REQUIRES(context, pad_to_max_output_size_,
+                errors::InvalidArgument(
+                    "XLA compilation requires pad_to_max_output_size == True"));
+
+    xla::XlaOp boxes = context->Input("boxes");
+    xla::XlaOp scores = context->Input("scores");
+    int64 output_size;
+    OP_REQUIRES_OK(context, context->ConstantInputAsIntScalar(2, &output_size));
+    OP_REQUIRES(
+        context, output_size >= 0,
+        errors::InvalidArgument("Need output_size >= 0, got ", output_size));
+    xla::XlaOp score_thresh = context->Input("score_threshold");
+    xla::XlaOp iou_thresh = context->Input("iou_threshold");
+
+    xla::XlaBuilder* const builder = context->builder();
+
+    // Choose a more convenient layout.
+    xla::XlaOp boxes_t = xla::Transpose(boxes, {1, 0});
+    coords_dim = 0;
+    num_boxes_dim = 1;
+
+    // Shapes are henceforth [1, num_boxes].
+    xla::XlaOp coord_y0 = xla::SliceInDim(boxes_t,
+                                          /*start_index=*/0,
+                                          /*limit_index=*/1,
+                                          /*stride=*/1,
+                                          /*dimno=*/coords_dim);
+    xla::XlaOp coord_x0 = xla::SliceInDim(boxes_t,
+                                          /*start_index=*/1,
+                                          /*limit_index=*/2,
+                                          /*stride=*/1,
+                                          /*dimno=*/coords_dim);
+    xla::XlaOp coord_y1 = xla::SliceInDim(boxes_t,
+                                          /*start_index=*/2,
+                                          /*limit_index=*/3,
+                                          /*stride=*/1,
+                                          /*dimno=*/coords_dim);
+    xla::XlaOp coord_x1 = xla::SliceInDim(boxes_t,
+                                          /*start_index=*/3,
+                                          /*limit_index=*/4,
+                                          /*stride=*/1,
+                                          /*dimno=*/coords_dim);
+    xla::XlaOp y1 =
+        xla::Select(xla::Le(coord_y0, coord_y1), coord_y0, coord_y1);
+    xla::XlaOp y2 =
+        xla::Select(xla::Le(coord_y0, coord_y1), coord_y1, coord_y0);
+    xla::XlaOp x1 =
+        xla::Select(xla::Le(coord_x0, coord_x1), coord_x0, coord_x1);
+    xla::XlaOp x2 =
+        xla::Select(xla::Le(coord_x0, coord_x1), coord_x1, coord_x0);
+    xla::XlaOp area = (y2 - y1) * (x2 - x1);
+
+    // Transpose the 1xN tensors, instead of the NxN tensors.
+    xla::XlaOp y1_t = xla::Transpose(y1, {1, 0});
+    xla::XlaOp y2_t = xla::Transpose(y2, {1, 0});
+    xla::XlaOp x1_t = xla::Transpose(x1, {1, 0});
+    xla::XlaOp x2_t = xla::Transpose(x2, {1, 0});
+    xla::XlaOp area_t = xla::Transpose(area, {1, 0});
+
+    // Shapes are henceforth [num_boxes, num_boxes].
+    xla::XlaOp i_xmin = xla::Max(x1, x1_t);
+    xla::XlaOp i_ymin = xla::Max(y1, y1_t);
+    xla::XlaOp i_xmax = xla::Min(x2, x2_t);
+    xla::XlaOp i_ymax = xla::Min(y2, y2_t);
+    auto square_zero = xla::ZerosLike(i_xmin);
+
+    xla::XlaOp i_area = xla::Max(i_xmax - i_xmin, square_zero) *
+                        xla::Max(i_ymax - i_ymin, square_zero);
+    xla::XlaOp u_area = area + area_t - i_area;
+    xla::XlaOp iou = i_area / u_area;
+
+    xla::XlaOp iou_thresh_mask = xla::Gt(iou, iou_thresh + square_zero);
+    xla::XlaOp scores_2d = xla::Reshape(scores, {num_boxes, 1});
+    xla::XlaOp score_cmp_mask =
+        xla::Gt(scores_2d, xla::Transpose(scores_2d, {1, 0}));
+    xla::XlaOp suppress = xla::And(iou_thresh_mask, score_cmp_mask);
+
+    // Shapes are [num_boxes] after the reduce.
+    xla::XlaOp included_iou = xla::Not(xla::Reduce(
+        suppress,
+        /*init_value=*/xla::ConstantR0<bool>(builder, false),
+        /*computation=*/CreateScalarOrComputation(xla::PRED, builder),
+        /*dimensions_to_reduce=*/{0}));
+    xla::XlaOp included_score =
+        xla::Gt(scores, xla::Broadcast(score_thresh, {num_boxes}));
+    xla::XlaOp included = xla::And(included_iou, included_score);
+    xla::XlaOp neg_inf =
+        xla::Broadcast(xla::MinValue(builder, xla::F32), {num_boxes});
+    xla::XlaOp scores_included = xla::Select(included, scores, neg_inf);
+
+    xla::XlaOp ones_included = xla::Select(
+        included,
+        xla::Broadcast(xla::ConstantR0<int32>(builder, 1), {num_boxes}),
+        xla::Broadcast(xla::ConstantR0<int32>(builder, 0), {num_boxes}));
+
+    // num_valid is scalar.
+    xla::XlaOp num_valid = xla::Reduce(
+        ones_included,
+        /*init_value=*/xla::ConstantR0<int>(builder, 0),
+        /*computation=*/CreateScalarAddComputation(xla::S32, builder),
+        /*dimensions_to_reduce=*/{0});
+
+    xla::XlaOp output_tuple = TopK(scores_included, output_size);
+    xla::XlaOp selected_indices = xla::GetTupleElement(output_tuple, 1);
+
+    context->SetOutput(0, selected_indices);
+    context->SetOutput(1, num_valid);
+  }
+
+ private:
+  bool pad_to_max_output_size_;
+};
+
+REGISTER_XLA_OP(
+    Name("NonMaxSuppressionV4").CompileTimeConstInput("max_output_size"),
+    NonMaxSuppressionOp);
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc b/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc
index d6bf92fb3d..8d75624e74 100644
--- a/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/image_resize_ops.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/lib/math/math_util.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/index_ops_cpu.cc b/tensorflow/compiler/tf2xla/kernels/index_ops_cpu.cc
index a020ebc729..22a45b2a11 100644
--- a/tensorflow/compiler/tf2xla/kernels/index_ops_cpu.cc
+++ b/tensorflow/compiler/tf2xla/kernels/index_ops_cpu.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
@@ -78,14 +78,14 @@ class ArgMaxCustomCallOp : public XlaOpKernel {
     std::vector<xla::XlaOp> args;
     args.push_back(ctx->Input(0));
     args.push_back(xla::ConstantLiteral(
-        &b, *xla::Literal::CreateR1<int64>(input_shape.dim_sizes())));
+        &b, *xla::LiteralUtil::CreateR1<int64>(input_shape.dim_sizes())));
     if (input_shape.dims() > 1) {
       // Don't bother passing the output shape and dim for the 1d case, since
       // the shape is always a scalar and the dim is always 0.
       args.push_back(xla::ConstantLiteral(
-          &b, *xla::Literal::CreateR1<int64>(output_shape.dim_sizes())));
+          &b, *xla::LiteralUtil::CreateR1<int64>(output_shape.dim_sizes())));
       args.push_back(
-          xla::ConstantLiteral(&b, *xla::Literal::CreateR0<int32>(dim)));
+          xla::ConstantLiteral(&b, *xla::LiteralUtil::CreateR0<int32>(dim)));
     }
 
     xla::Shape xla_shape =
diff --git a/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc b/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc
index 9e64711051..f028e361bc 100644
--- a/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/l2loss_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/kernels/no_op.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/listdiff_op.cc b/tensorflow/compiler/tf2xla/kernels/listdiff_op.cc
index 2fb072f827..a11bbe918f 100644
--- a/tensorflow/compiler/tf2xla/kernels/listdiff_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/listdiff_op.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/lib/core/errors.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc b/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc
index dc934543cb..87ee2d3aed 100644
--- a/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/lrn_ops.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/matmul_op.cc b/tensorflow/compiler/tf2xla/kernels/matmul_op.cc
index 844080b8cf..6440770c29 100644
--- a/tensorflow/compiler/tf2xla/kernels/matmul_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/matmul_op.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
 namespace tensorflow {
@@ -54,10 +54,14 @@ class MatMulOp : public XlaOpKernel {
     const TensorShape b_shape = ctx->InputShape(1);
 
     // Check that the dimensions of the two matrices are valid.
-    OP_REQUIRES(ctx, TensorShapeUtils::IsMatrix(a_shape),
-                errors::InvalidArgument("In[0] is not a matrix"));
-    OP_REQUIRES(ctx, TensorShapeUtils::IsMatrix(b_shape),
-                errors::InvalidArgument("In[1] is not a matrix"));
+    OP_REQUIRES(
+        ctx, TensorShapeUtils::IsMatrix(a_shape),
+        errors::InvalidArgument("In[0] is not a matrix. Instead it has shape ",
+                                a_shape.DebugString()));
+    OP_REQUIRES(
+        ctx, TensorShapeUtils::IsMatrix(b_shape),
+        errors::InvalidArgument("In[1] is not a matrix. Instead it has shape ",
+                                b_shape.DebugString()));
     int first_index = transpose_a_ ? 0 : 1;
     int second_index = transpose_b_ ? 1 : 0;
 
diff --git a/tensorflow/compiler/tf2xla/kernels/matrix_band_part_op.cc b/tensorflow/compiler/tf2xla/kernels/matrix_band_part_op.cc
index e06c87db7a..8dfd7de591 100644
--- a/tensorflow/compiler/tf2xla/kernels/matrix_band_part_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/matrix_band_part_op.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/matrix_set_diag_op.cc b/tensorflow/compiler/tf2xla/kernels/matrix_set_diag_op.cc
index e2ab4b83cf..c0ca881ff8 100644
--- a/tensorflow/compiler/tf2xla/kernels/matrix_set_diag_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/matrix_set_diag_op.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/compiler/tf2xla/kernels/mirror_pad_op.cc b/tensorflow/compiler/tf2xla/kernels/mirror_pad_op.cc
index 529959dbd9..eedfc3c914 100644
--- a/tensorflow/compiler/tf2xla/kernels/mirror_pad_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/mirror_pad_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/util/mirror_pad_mode.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/pack_op.cc b/tensorflow/compiler/tf2xla/kernels/pack_op.cc
index 3aed47de26..a9b519d892 100644
--- a/tensorflow/compiler/tf2xla/kernels/pack_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/pack_op.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/pad_op.cc b/tensorflow/compiler/tf2xla/kernels/pad_op.cc
index 89fd610bc6..e5937b56c1 100644
--- a/tensorflow/compiler/tf2xla/kernels/pad_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/pad_op.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/register_types.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc b/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc
index a81f5fddf6..3d506e71e0 100644
--- a/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/pooling_ops.cc
@@ -21,8 +21,9 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/qr_op.cc b/tensorflow/compiler/tf2xla/kernels/qr_op.cc
new file mode 100644
index 0000000000..de9068a640
--- /dev/null
+++ b/tensorflow/compiler/tf2xla/kernels/qr_op.cc
@@ -0,0 +1,47 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/tf2xla/lib/qr.h"
+#include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
+#include "tensorflow/compiler/tf2xla/xla_op_registry.h"
+
+namespace tensorflow {
+namespace {
+
+class QROp : public XlaOpKernel {
+ public:
+  explicit QROp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
+    bool full_matrices;
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("full_matrices", &full_matrices));
+    OP_REQUIRES(
+        ctx, full_matrices,
+        errors::Unimplemented("full_matrices=False case of QR decomposition is "
+                              "not implemented in TF/XLA"));
+  }
+  void Compile(XlaOpKernelContext* ctx) override {
+    auto result = QRDecomposition(ctx->Input(0));
+    if (!result.ok()) {
+      ctx->SetStatus(result.status());
+      return;
+    }
+    ctx->SetOutput(0, result.ValueOrDie().q);
+    ctx->SetOutput(1, result.ValueOrDie().r);
+  }
+};
+
+REGISTER_XLA_OP(Name("Qr").TypeConstraint("T", kFloatTypes), QROp);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/quantize_and_dequantize_op.cc b/tensorflow/compiler/tf2xla/kernels/quantize_and_dequantize_op.cc
index e88221e4f4..6f4ed496a1 100644
--- a/tensorflow/compiler/tf2xla/kernels/quantize_and_dequantize_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/quantize_and_dequantize_op.cc
@@ -19,7 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/core/platform/macros.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/random_ops.cc b/tensorflow/compiler/tf2xla/kernels/random_ops.cc
index 9a0a7f9b90..2da9340625 100644
--- a/tensorflow/compiler/tf2xla/kernels/random_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/random_ops.cc
@@ -27,7 +27,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
@@ -74,56 +74,121 @@ class RandomShuffleOp : public XlaOpKernel {
     for (tensorflow::TensorShapeDim dimension : input_shape) {
       num_elements *= dimension.size;
     }
+
     if (num_elements <= 1 || n <= 1) {
       // No shuffling is required, so copy input directly to output
       ctx->SetOutput(0, input);
-    } else {
-      // Generate the random swaps for the indices.
-      auto swaps_shape = xla::ShapeUtil::MakeShape(xla::S32, {n});
-      auto swaps =
-          xla::RngUniform(xla::ConstantR0<int32>(builder, 0),
-                          xla::ConstantR0<int32>(builder, n), swaps_shape);
-
-      // Generate range(n) as the initial value for the indices to be swapped.
-      xla::XlaOp indices = xla::Iota(builder, xla::S32, n);
-
-      // Swap the indices at i and swaps[i].
-      auto swap_body_fn = [&](xla::XlaOp i,
-                              gtl::ArraySlice<xla::XlaOp> loop_vars,
-                              xla::XlaBuilder* builder)
-          -> xla::StatusOr<std::vector<xla::XlaOp>> {
-        auto swaps = loop_vars[0];
-        auto indices = loop_vars[1];
-        i = xla::Reshape(i, {1});
-        // temp = indices[i]
-        auto temp = xla::DynamicSlice(indices, i, {1});
-        // swap_index = swaps[i]
-        auto swap_index = xla::DynamicSlice(swaps, i, {1});
-        // swap_value = indices[swaps[i]]
-        auto swap_value = xla::DynamicSlice(indices, swap_index, {1});
-        // indices[i] = indices[swaps[i]]
-        indices = xla::DynamicUpdateSlice(indices, swap_value, i);
-        // indices[swaps[i]] = temp
-        indices = xla::DynamicUpdateSlice(indices, temp, swap_index);
-        return std::vector<xla::XlaOp>{swaps, indices};
-      };
-      // for i in range(n):
-      auto swap_loop_result =
-          XlaForEachIndex(n, xla::S32, swap_body_fn, {swaps, indices},
-                          "indices_swap_loop", builder)
-              .ValueOrDie();
-      auto swapped_indices = swap_loop_result[1];
-
-      // Gather the data using the swapped indices as the shuffled order.
-      auto indices_tensor_shape = TensorShape({n});
-      DataType type = ctx->expected_output_dtype(0);
-      xla::XlaOp gather;
-      OP_REQUIRES_OK(ctx, XlaGather(input, input_shape, swapped_indices,
-                                    indices_tensor_shape,
-                                    /*axis=*/0, /*indices_are_nd=*/false, type,
-                                    DT_INT32, builder, &gather));
-      ctx->SetOutput(0, gather);
+      return;
+    }
+
+    if (input_shape.dims() == 1) {
+      // For R1s, shuffle values by sorting instead of the obvious Fisher-Yates
+      // algorithm. Fisher-Yates is simple to implement and correct, but not
+      // easily parallelizable. For a sufficiently parallel architecture, it is
+      // faster to sort many times, than Fisher-Yates shuffle once.
+
+      // Shuffle values by assigning each value a random key and sorting the
+      // keys. Keys can collide causing detectable patterns in the shuffled
+      // output. Collisions translates into more ascending sub-sequences in the
+      // shuffled output than would be expected by chance. To avoid collisions,
+      // the number of possible key values must be sufficiently large.
+
+      // How are more than 2^32 keys created? In each loop iteration, the
+      // algorithm sorts by random keys. Conceptually, the earlier iterations
+      // are sorting on the lower-order bits of larger keys that are never
+      // actually assembled.
+
+      // The expected number of collisions is n - d + d(1 - 1/d)^n, where d is
+      // the number of possible keys and n is the number of values. If d = n^2,
+      // then the limit as n goes to infinity is 1/2. If d = n^3, then the limit
+      // as n goes to infinity is zero.
+
+      // This implementation ensures that the key-space is greater than or equal
+      // to the cube of the number of values. The risk of collisions can be
+      // further reduced by increasing Exponent at the expense of
+      // performance.
+
+      // For Exponent = 2, the expected number of collisions per shuffle is
+      // maximized at n = floor((2^32-1)^(1/2)) = 65535 where the expectation is
+      // about 1/2.
+
+      // For Exponent = 3, the expected number of collisions per shuffle is
+      // maximized at n = floor((2^32-1)^(1/3)) = 1625 where the expectation is
+      // about 1/3255.
+
+      // For Exponent = 4, the expected number of collisions per shuffle is
+      // maximized at n = floor((2^32-1)^(1/4)) = 255 where the expectation is
+      // about 1/132622.
+      constexpr int Exponent = 3;
+      const int rounds = static_cast<int>(
+          std::ceil(Exponent * std::log(num_elements) / std::log(kuint32max)));
+
+      const xla::Shape key_shape =
+          xla::ShapeUtil::MakeShape(xla::U32, {num_elements});
+      xla::XlaOp zero = xla::ConstantR0(builder, 0U);
+
+      // Unfortunately, xla::RngUniform gives values in the half open interval
+      // rather than the closed interval, so instead of 2^32 possible keys there
+      // are only 2^32 - 1 (kuint32max).
+      xla::XlaOp max_value = xla::ConstantR0(builder, kuint32max);
+
+      xla::XlaOp curr = input;
+      for (int i = 0; i < rounds; ++i) {
+        xla::XlaOp keys = xla::RngUniform(zero, max_value, key_shape);
+        xla::XlaOp sorted = xla::Sort(keys, curr);
+        curr = xla::GetTupleElement(sorted, 1);
+      }
+
+      ctx->SetOutput(0, curr);
+      return;
     }
+
+    // The Fisher-Yates algorithm.
+
+    // Generate the random swaps for the indices.
+    auto swaps_shape = xla::ShapeUtil::MakeShape(xla::S32, {n});
+    auto swaps =
+        xla::RngUniform(xla::ConstantR0<int32>(builder, 0),
+                        xla::ConstantR0<int32>(builder, n), swaps_shape);
+
+    // Generate range(n) as the initial value for the indices to be swapped.
+    xla::XlaOp indices = xla::Iota(builder, xla::S32, n);
+
+    // Swap the indices at i and swaps[i].
+    auto swap_body_fn = [&](xla::XlaOp i, gtl::ArraySlice<xla::XlaOp> loop_vars,
+                            xla::XlaBuilder* builder)
+        -> xla::StatusOr<std::vector<xla::XlaOp>> {
+      auto swaps = loop_vars[0];
+      auto indices = loop_vars[1];
+      i = xla::Reshape(i, {1});
+      // temp = indices[i]
+      auto temp = xla::DynamicSlice(indices, i, {1});
+      // swap_index = swaps[i]
+      auto swap_index = xla::DynamicSlice(swaps, i, {1});
+      // swap_value = indices[swaps[i]]
+      auto swap_value = xla::DynamicSlice(indices, swap_index, {1});
+      // indices[i] = indices[swaps[i]]
+      indices = xla::DynamicUpdateSlice(indices, swap_value, i);
+      // indices[swaps[i]] = temp
+      indices = xla::DynamicUpdateSlice(indices, temp, swap_index);
+      return std::vector<xla::XlaOp>{swaps, indices};
+    };
+    // for i in range(n):
+    auto swap_loop_result =
+        XlaForEachIndex(n, xla::S32, swap_body_fn, {swaps, indices},
+                        "indices_swap_loop", builder)
+            .ValueOrDie();
+    auto swapped_indices = swap_loop_result[1];
+
+    // Gather the data using the swapped indices as the shuffled order.
+    auto indices_tensor_shape = TensorShape({n});
+    DataType type = ctx->expected_output_dtype(0);
+    xla::XlaOp gather;
+    OP_REQUIRES_OK(ctx, XlaGather(input, input_shape, swapped_indices,
+                                  indices_tensor_shape,
+                                  /*axis=*/0, /*indices_are_nd=*/false, type,
+                                  DT_INT32, builder, &gather));
+    ctx->SetOutput(0, gather);
   }
 
  private:
@@ -220,5 +285,5 @@ REGISTER_XLA_OP(Name("TruncatedNormal")
                     .TypeConstraint("dtype", DT_FLOAT),
                 TruncatedNormalOp);
 
-}  // anonymous namespace
+}  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/reduce_window_op.cc b/tensorflow/compiler/tf2xla/kernels/reduce_window_op.cc
index 76bd1e62aa..b11a4ce36d 100644
--- a/tensorflow/compiler/tf2xla/kernels/reduce_window_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reduce_window_op.cc
@@ -19,7 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_compiler.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc b/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc
index 46fae59ad4..0d260fa8fc 100644
--- a/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reduction_ops.cc
@@ -20,8 +20,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/reduction_ops.h b/tensorflow/compiler/tf2xla/kernels/reduction_ops.h
index 8333f9b288..466e79828d 100644
--- a/tensorflow/compiler/tf2xla/kernels/reduction_ops.h
+++ b/tensorflow/compiler/tf2xla/kernels/reduction_ops.h
@@ -19,7 +19,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_TF2XLA_KERNELS_REDUCTION_OPS_H_
 
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc b/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc
index 909783ecb3..b52f0a0ab6 100644
--- a/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reduction_ops_common.cc
@@ -19,8 +19,9 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/type_util.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/relu_op.cc b/tensorflow/compiler/tf2xla/kernels/relu_op.cc
index a4ba6c748a..d35777ccb1 100644
--- a/tensorflow/compiler/tf2xla/kernels/relu_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/relu_op.cc
@@ -18,8 +18,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/kernels/cwise_ops.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/kernels/no_op.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/reshape_op.cc b/tensorflow/compiler/tf2xla/kernels/reshape_op.cc
index e0ca8dd8e2..121750a82a 100644
--- a/tensorflow/compiler/tf2xla/kernels/reshape_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reshape_op.cc
@@ -19,8 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/retval_op.cc b/tensorflow/compiler/tf2xla/kernels/retval_op.cc
index 5be70a4ded..1911e6ea36 100644
--- a/tensorflow/compiler/tf2xla/kernels/retval_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/retval_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_context.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/reverse_op.cc b/tensorflow/compiler/tf2xla/kernels/reverse_op.cc
index 037c422258..d962ef4a5f 100644
--- a/tensorflow/compiler/tf2xla/kernels/reverse_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reverse_op.cc
@@ -19,8 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/reverse_sequence_op.cc b/tensorflow/compiler/tf2xla/kernels/reverse_sequence_op.cc
index c810456f94..03a50ef8a0 100644
--- a/tensorflow/compiler/tf2xla/kernels/reverse_sequence_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/reverse_sequence_op.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/scan_ops.cc b/tensorflow/compiler/tf2xla/kernels/scan_ops.cc
index 76924c6a01..ab094d7dd1 100644
--- a/tensorflow/compiler/tf2xla/kernels/scan_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/scan_ops.cc
@@ -20,8 +20,9 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/scatter_nd_op.cc b/tensorflow/compiler/tf2xla/kernels/scatter_nd_op.cc
index 14709bb6cb..f1f32699fe 100644
--- a/tensorflow/compiler/tf2xla/kernels/scatter_nd_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/scatter_nd_op.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc b/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
index e2ac7da2c2..b22ecb7c6d 100644
--- a/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 namespace {
diff --git a/tensorflow/compiler/tf2xla/kernels/select_op.cc b/tensorflow/compiler/tf2xla/kernels/select_op.cc
index 5c010c9df2..6ce50efb4a 100644
--- a/tensorflow/compiler/tf2xla/kernels/select_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/select_op.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/kernels/bounds_check.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/sendrecv_ops.cc b/tensorflow/compiler/tf2xla/kernels/sendrecv_ops.cc
index 6281d6c653..a7f5a8f169 100644
--- a/tensorflow/compiler/tf2xla/kernels/sendrecv_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/sendrecv_ops.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/sequence_ops.cc b/tensorflow/compiler/tf2xla/kernels/sequence_ops.cc
index bc3d0bf5df..25a5bcbe1d 100644
--- a/tensorflow/compiler/tf2xla/kernels/sequence_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/sequence_ops.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/shape_op.cc b/tensorflow/compiler/tf2xla/kernels/shape_op.cc
index 5798823cd5..4e0cf99d8e 100644
--- a/tensorflow/compiler/tf2xla/kernels/shape_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/shape_op.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/kernels/bounds_check.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/slice_op.cc b/tensorflow/compiler/tf2xla/kernels/slice_op.cc
index 1864584ade..6adc3c58de 100644
--- a/tensorflow/compiler/tf2xla/kernels/slice_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/slice_op.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/softmax_op.cc b/tensorflow/compiler/tf2xla/kernels/softmax_op.cc
index a71fbcd901..1d7a63dc31 100644
--- a/tensorflow/compiler/tf2xla/kernels/softmax_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/softmax_op.cc
@@ -20,7 +20,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/sort_ops.cc b/tensorflow/compiler/tf2xla/kernels/sort_ops.cc
index faaf8964ff..aaeeae01cc 100644
--- a/tensorflow/compiler/tf2xla/kernels/sort_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/sort_ops.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 namespace {
diff --git a/tensorflow/compiler/tf2xla/kernels/spacetobatch_op.cc b/tensorflow/compiler/tf2xla/kernels/spacetobatch_op.cc
index 8a8525efa1..7327258c31 100644
--- a/tensorflow/compiler/tf2xla/kernels/spacetobatch_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/spacetobatch_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 namespace {
diff --git a/tensorflow/compiler/tf2xla/kernels/spacetodepth_op.cc b/tensorflow/compiler/tf2xla/kernels/spacetodepth_op.cc
index 47d282fe9e..4493539fe3 100644
--- a/tensorflow/compiler/tf2xla/kernels/spacetodepth_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/spacetodepth_op.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/util/tensor_format.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/kernels/split_op.cc b/tensorflow/compiler/tf2xla/kernels/split_op.cc
index ca74cf2450..93fc14e9ef 100644
--- a/tensorflow/compiler/tf2xla/kernels/split_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/split_op.cc
@@ -19,8 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/stack_ops.cc b/tensorflow/compiler/tf2xla/kernels/stack_ops.cc
index 591e61b4c8..df91900570 100644
--- a/tensorflow/compiler/tf2xla/kernels/stack_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/stack_ops.cc
@@ -23,7 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc b/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc
index a6f5769e7b..5412e13547 100644
--- a/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc
@@ -23,7 +23,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/lib/constants.h"
 #include "tensorflow/compiler/xla/client/lib/math.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/lib/prng.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
@@ -33,134 +34,6 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
-// Rotates a 32-bit integer 'v' left by 'distance' bits.
-xla::XlaOp RotateLeftS32(xla::XlaBuilder* builder, const xla::XlaOp& v,
-                         int distance) {
-  return xla::Or(
-      xla::ShiftLeft(v, xla::ConstantR0<int>(builder, distance)),
-      xla::ShiftRightLogical(v, xla::ConstantR0<int>(builder, 32 - distance)));
-}
-
-using ThreeFry2x32State = std::array<xla::XlaOp, 2>;
-
-// Implements the ThreeFry counter-based PRNG algorithm.
-// Salmon et al. SC 2011. Parallel random numbers: as easy as 1, 2, 3.
-// http://www.thesalmons.org/john/random123/papers/random123sc11.pdf
-ThreeFry2x32State ThreeFry2x32(xla::XlaBuilder* builder,
-                               ThreeFry2x32State input, ThreeFry2x32State key) {
-  // Rotation distances specified by the Threefry2x32 algorithm.
-  constexpr std::array<int, 8> rotations = {13, 15, 26, 6, 17, 29, 16, 24};
-  ThreeFry2x32State x;
-
-  std::array<xla::XlaOp, 3> ks;
-  // 0x1BD11BDA is a parity constant specified by the ThreeFry2x32 algorithm.
-  ks[2] = xla::ConstantR0<int32>(builder, 0x1BD11BDA);
-  for (int i = 0; i < 2; ++i) {
-    ks[i] = key[i];
-    x[i] = input[i];
-    ks[2] = xla::Xor(ks[2], key[i]);
-  }
-
-  x[0] = xla::Add(x[0], ks[0]);
-  x[1] = xla::Add(x[1], ks[1]);
-
-  // Performs a single round of the Threefry2x32 algorithm, with a rotation
-  // amount 'rotation'.
-  auto round = [builder](ThreeFry2x32State v, int rotation) {
-    v[0] = xla::Add(v[0], v[1]);
-    v[1] = RotateLeftS32(builder, v[1], rotation);
-    v[1] = xla::Xor(v[0], v[1]);
-    return v;
-  };
-
-  // There are no known statistical flaws with 13 rounds of Threefry2x32.
-  // We are conservative and use 20 rounds.
-  x = round(x, rotations[0]);
-  x = round(x, rotations[1]);
-  x = round(x, rotations[2]);
-  x = round(x, rotations[3]);
-  x[0] = xla::Add(x[0], ks[1]);
-  x[1] = xla::Add(xla::Add(x[1], ks[2]), xla::ConstantR0<int32>(builder, 1));
-
-  x = round(x, rotations[4]);
-  x = round(x, rotations[5]);
-  x = round(x, rotations[6]);
-  x = round(x, rotations[7]);
-  x[0] = xla::Add(x[0], ks[2]);
-  x[1] = xla::Add(xla::Add(x[1], ks[0]), xla::ConstantR0<int32>(builder, 2));
-
-  x = round(x, rotations[0]);
-  x = round(x, rotations[1]);
-  x = round(x, rotations[2]);
-  x = round(x, rotations[3]);
-  x[0] = xla::Add(x[0], ks[0]);
-  x[1] = xla::Add(xla::Add(x[1], ks[1]), xla::ConstantR0<int32>(builder, 3));
-
-  x = round(x, rotations[4]);
-  x = round(x, rotations[5]);
-  x = round(x, rotations[6]);
-  x = round(x, rotations[7]);
-  x[0] = xla::Add(x[0], ks[1]);
-  x[1] = xla::Add(xla::Add(x[1], ks[2]), xla::ConstantR0<int32>(builder, 4));
-
-  x = round(x, rotations[0]);
-  x = round(x, rotations[1]);
-  x = round(x, rotations[2]);
-  x = round(x, rotations[3]);
-  x[0] = xla::Add(x[0], ks[2]);
-  x[1] = xla::Add(xla::Add(x[1], ks[0]), xla::ConstantR0<int32>(builder, 5));
-
-  return x;
-}
-
-// Returns a tensor of 'shape' random values uniformly distributed in the range
-// [minval, maxval)
-xla::XlaOp RandomUniform(xla::XlaBuilder* builder, const xla::XlaOp& seed,
-                         const TensorShape& shape, double minval,
-                         double maxval) {
-  // Split the seed into two 32-bit scalars to form a key.
-  auto seed0 = xla::Reshape(xla::Slice(seed, {0}, {1}, {1}), {});
-  auto seed1 = xla::Reshape(xla::Slice(seed, {1}, {2}, {1}), {});
-  ThreeFry2x32State key = {seed0, seed1};
-  const int64 size = shape.num_elements();
-
-  const int64 half_size = MathUtil::CeilOfRatio<int64>(size, 2);
-  const bool size_is_odd = (half_size * 2 != size);
-
-  // Fill the generator inputs with unique counter values.
-  ThreeFry2x32State inputs;
-  inputs[0] = xla::Iota(builder, xla::S32, half_size);
-  inputs[1] = xla::Add(inputs[0], xla::ConstantR0<int32>(builder, half_size));
-  ThreeFry2x32State outputs = ThreeFry2x32(builder, inputs, key);
-
-  if (size_is_odd) {
-    outputs[1] = xla::Slice(outputs[1], {0}, {half_size - 1}, {1});
-  }
-
-  auto bits =
-      xla::Reshape(xla::ConcatInDim(builder, outputs, 0), shape.dim_sizes());
-
-  // Form 22 random mantissa bits, with a leading 1 bit. The leading 1 bit
-  // forces the random bits into the mantissa.
-  constexpr int kFloatBits = 32;
-  constexpr int kMantissaBits = 23;
-  bits = xla::Or(
-      xla::ShiftRightLogical(
-          bits, xla::ConstantR0<int32>(builder, kFloatBits - kMantissaBits)),
-      xla::ConstantR0<int32>(builder, bit_cast<int32>(1.0f)));
-  auto floats = xla::BitcastConvertType(bits, xla::F32);
-
-  // We have a floating point number in the range [1.0, 2.0).
-  // Subtract 1.0f to shift to the range [0.0, 1.0)
-  floats = xla::Sub(floats, xla::ConstantR0<float>(builder, 1.0f));
-  // Multiply and add to shift to the range [minval, maxval).
-  floats = xla::Mul(floats, xla::ConstantR0<float>(builder, maxval - minval));
-  floats = xla::Add(floats, xla::ConstantR0<float>(builder, minval));
-  return floats;
-}
-
-}  // namespace
-
 class StatelessRandomUniformOp : public XlaOpKernel {
  public:
   explicit StatelessRandomUniformOp(OpKernelConstruction* ctx)
@@ -177,7 +50,17 @@ class StatelessRandomUniformOp : public XlaOpKernel {
                 errors::InvalidArgument("seed must have shape [2], not ",
                                         seed_shape.DebugString()));
     xla::XlaOp seed = ctx->Input(1);
-    ctx->SetOutput(0, RandomUniform(builder, seed, shape, 0.0, 1.0));
+
+    xla::Shape xla_shape;
+    OP_REQUIRES_OK(ctx, TensorShapeToXLAShape(DT_FLOAT, shape, &xla_shape));
+
+    auto seed0 = xla::Reshape(xla::Slice(seed, {0}, {1}, {1}), {});
+    auto seed1 = xla::Reshape(xla::Slice(seed, {1}, {2}, {1}), {});
+
+    auto uniform = xla::StatelessRngUniform(
+        {seed0, seed1}, xla_shape, xla::ConstantR0<float>(builder, 0.0),
+        xla::ConstantR0<float>(builder, 1.0));
+    ctx->SetOutput(0, uniform);
   }
 
  private:
@@ -206,8 +89,16 @@ class StatelessRandomNormalOp : public XlaOpKernel {
                                         seed_shape.DebugString()));
     xla::XlaOp seed = ctx->Input(1);
     xla::XlaBuilder* builder = ctx->builder();
-    auto uniform =
-        RandomUniform(builder, seed, shape, std::nextafter(-1.0f, 0.0f), 1.0);
+    xla::Shape xla_shape;
+    OP_REQUIRES_OK(ctx, TensorShapeToXLAShape(DT_FLOAT, shape, &xla_shape));
+
+    auto seed0 = xla::Reshape(xla::Slice(seed, {0}, {1}, {1}), {});
+    auto seed1 = xla::Reshape(xla::Slice(seed, {1}, {2}, {1}), {});
+
+    auto uniform = xla::StatelessRngUniform(
+        {seed0, seed1}, xla_shape,
+        xla::ConstantR0<float>(builder, std::nextafter(-1.0f, 0.0f)),
+        xla::ConstantR0<float>(builder, 1.0));
     // Convert uniform distribution to normal distribution by computing
     // sqrt(2) * erfinv(x)
     auto normal =
@@ -240,10 +131,18 @@ class StatelessTruncatedNormalOp : public XlaOpKernel {
                 errors::InvalidArgument("seed must have shape [2], not ",
                                         seed_shape.DebugString()));
     xla::XlaOp seed = ctx->Input(1);
-    xla::XlaBuilder* b = ctx->builder();
+    xla::XlaBuilder* builder = ctx->builder();
+
+    auto seed0 = xla::Reshape(xla::Slice(seed, {0}, {1}, {1}), {});
+    auto seed1 = xla::Reshape(xla::Slice(seed, {1}, {2}, {1}), {});
+
+    xla::Shape xla_shape;
+    OP_REQUIRES_OK(ctx, TensorShapeToXLAShape(DT_FLOAT, shape, &xla_shape));
+    auto uniform = xla::StatelessRngUniform(
+        {seed0, seed1}, xla_shape,
+        xla::ConstantR0<float>(builder, std::numeric_limits<float>::min()),
+        xla::ConstantR0<float>(builder, 1.0));
 
-    auto uniform =
-        RandomUniform(b, seed, shape, std::numeric_limits<float>::min(), 1.0);
     ctx->SetOutput(0, TruncatedNormal(uniform));
   }
 
@@ -257,4 +156,5 @@ REGISTER_XLA_OP(Name("StatelessTruncatedNormal")
                     .TypeConstraint("Tseed", DT_INT32),
                 StatelessTruncatedNormalOp);
 
+}  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc b/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc
index c2165ccd86..1062399d91 100644
--- a/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/tensor_array_ops.cc b/tensorflow/compiler/tf2xla/kernels/tensor_array_ops.cc
index 2f650ce305..be1814d8e3 100644
--- a/tensorflow/compiler/tf2xla/kernels/tensor_array_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/tensor_array_ops.cc
@@ -25,8 +25,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/tf2xla/xla_resource.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/register_types.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/tile_ops.cc b/tensorflow/compiler/tf2xla/kernels/tile_ops.cc
index c9e5694262..1233a37565 100644
--- a/tensorflow/compiler/tf2xla/kernels/tile_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/tile_ops.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/topk_op.cc b/tensorflow/compiler/tf2xla/kernels/topk_op.cc
index 9962f1207d..183879c760 100644
--- a/tensorflow/compiler/tf2xla/kernels/topk_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/topk_op.cc
@@ -13,12 +13,12 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/lib/sorting.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/kernels/no_op.h"
@@ -41,35 +41,18 @@ class TopKOp : public XlaOpKernel {
     OP_REQUIRES(context, input_shape.dims() >= 1,
                 errors::InvalidArgument("input must be >= 1-D, got shape ",
                                         input_shape.DebugString()));
+    int last_dim = input_shape.dims() - 1;
+    int last_dim_size = input_shape.dim_size(last_dim);
     OP_REQUIRES(
-        context, input_shape.dim_size(input_shape.dims() - 1) >= k,
+        context, last_dim_size >= k,
         errors::InvalidArgument("input must have at least k columns. Had ",
-                                input_shape.dim_size(input_shape.dims() - 1),
-                                ", needed ", k));
-
-    OP_REQUIRES(
-        context, input_shape.dims() == 1,
-        errors::Unimplemented("TopK is implemented for 1-D inputs, got shape ",
-                              input_shape.DebugString()));
-
-    xla::XlaBuilder* const b = context->builder();
-    if (input_shape.dim_size(0) < k) {
-      k = input_shape.dim_size(0);
+                                last_dim_size, ", needed ", k));
+    if (last_dim_size < k) {
+      k = last_dim_size;
     }
-    const xla::XlaOp input = context->Input(0);
-    xla::XlaOp iota_s32 = xla::Iota(b, xla::S32, input_shape.dim_size(0));
-    xla::XlaOp sort_result = xla::Sort(xla::Neg(input), iota_s32);
-    xla::XlaOp values =
-        xla::Neg(xla::Slice(xla::GetTupleElement(sort_result, 0),
-                            /*start_indices=*/{0},
-                            /*limit_indices=*/{k},
-                            /*strides=*/{1}));
-    xla::XlaOp indices = xla::Slice(xla::GetTupleElement(sort_result, 1),
-                                    /*start_indices=*/{0},
-                                    /*limit_indices=*/{k},
-                                    /*strides=*/{1});
-    context->SetOutput(0, values);
-    context->SetOutput(1, indices);
+    xla::XlaOp output_tuple = TopK(context->Input(0), k);
+    context->SetOutput(0, xla::GetTupleElement(output_tuple, 0));
+    context->SetOutput(1, xla::GetTupleElement(output_tuple, 1));
   }
 
  private:
diff --git a/tensorflow/compiler/tf2xla/kernels/training_ops.cc b/tensorflow/compiler/tf2xla/kernels/training_ops.cc
index 68b1fce477..be5e911386 100644
--- a/tensorflow/compiler/tf2xla/kernels/training_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/training_ops.cc
@@ -18,8 +18,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
 #include "tensorflow/compiler/xla/client/lib/math.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/kernels/no_op.h"
@@ -57,6 +57,64 @@ REGISTER_XLA_OP(
     Name("ResourceApplyGradientDescent").TypeConstraint("T", kFloatTypes),
     ResourceApplyGradientDescent);
 
+xla::XlaOp ProximalGradientDescentUpdate(xla::XlaOp var, xla::XlaOp lr,
+                                         xla::XlaOp l1, xla::XlaOp l2,
+                                         xla::XlaOp grad) {
+  xla::XlaOp one = xla::ScalarLike(lr, 1.0);
+  xla::XlaOp zero = xla::ScalarLike(lr, 0.0);
+  xla::XlaOp prox_var = var - grad * lr;
+  xla::XlaOp l1_gt_zero = xla::Sign(prox_var) *
+                          xla::Max(xla::Abs(prox_var) - lr * l1, zero) /
+                          (one + lr * l2);
+  xla::XlaOp l1_le_zero = prox_var / (one + lr * l2);
+  return xla::Select(xla::Gt(l1, zero), l1_gt_zero, l1_le_zero);
+}
+
+class ResourceApplyProximalGradientDescent : public XlaOpKernel {
+ public:
+  explicit ResourceApplyProximalGradientDescent(OpKernelConstruction* ctx)
+      : XlaOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype_));
+  }
+
+  void Compile(XlaOpKernelContext* ctx) override {
+    xla::XlaOp var;
+    TensorShape var_shape;
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(0, dtype_, &var_shape, &var));
+
+    TensorShape alpha_shape = ctx->InputShape(1);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(alpha_shape),
+                errors::InvalidArgument("alpha is not a scalar: ",
+                                        alpha_shape.DebugString()));
+    TensorShape l1_shape = ctx->InputShape(2);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(alpha_shape),
+                errors::InvalidArgument("l1 is not a scalar: ",
+                                        l1_shape.DebugString()));
+    TensorShape l2_shape = ctx->InputShape(3);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(alpha_shape),
+                errors::InvalidArgument("l2 is not a scalar: ",
+                                        l2_shape.DebugString()));
+    TensorShape delta_shape = ctx->InputShape(4);
+    OP_REQUIRES(
+        ctx, var_shape.IsSameSize(delta_shape),
+        errors::InvalidArgument("var and delta do not have the same shape: ",
+                                var_shape.DebugString(), " vs ",
+                                delta_shape.DebugString()));
+    xla::XlaOp alpha = ctx->Input(1);
+    xla::XlaOp l1 = ctx->Input(2);
+    xla::XlaOp l2 = ctx->Input(3);
+    xla::XlaOp delta = ctx->Input(4);
+    var = ProximalGradientDescentUpdate(var, alpha, l1, l2, delta);
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(0, dtype_, var));
+  }
+
+ private:
+  DataType dtype_;
+};
+REGISTER_XLA_OP(Name("ResourceApplyProximalGradientDescent")
+                    .TypeConstraint("T", kFloatTypes),
+                ResourceApplyProximalGradientDescent);
+
 class ResourceApplyMomentum : public XlaOpKernel {
  public:
   explicit ResourceApplyMomentum(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
@@ -154,6 +212,139 @@ class ResourceApplyAdagrad : public XlaOpKernel {
 REGISTER_XLA_OP(Name("ResourceApplyAdagrad").TypeConstraint("T", kFloatTypes),
                 ResourceApplyAdagrad);
 
+class ResourceApplyProximalAdagrad : public XlaOpKernel {
+ public:
+  explicit ResourceApplyProximalAdagrad(OpKernelConstruction* ctx)
+      : XlaOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype_));
+  }
+
+  void Compile(XlaOpKernelContext* ctx) override {
+    TensorShape var_shape, accum_shape;
+    xla::XlaOp var, accum;
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(0, dtype_, &var_shape, &var));
+    OP_REQUIRES_OK(ctx,
+                   ctx->ReadVariableInput(1, dtype_, &accum_shape, &accum));
+
+    OP_REQUIRES(ctx, var_shape.IsSameSize(accum_shape),
+                errors::InvalidArgument(
+                    "var and accum do not have the same shape",
+                    var_shape.DebugString(), " ", accum_shape.DebugString()));
+
+    TensorShape lr_shape = ctx->InputShape(2);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(lr_shape),
+                errors::InvalidArgument("lr is not a scalar: ",
+                                        lr_shape.DebugString()));
+    TensorShape l1_shape = ctx->InputShape(3);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(l1_shape),
+                errors::InvalidArgument("l1 is not a scalar: ",
+                                        l1_shape.DebugString()));
+    TensorShape l2_shape = ctx->InputShape(4);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(l2_shape),
+                errors::InvalidArgument("l2 is not a scalar: ",
+                                        l2_shape.DebugString()));
+    TensorShape grad_shape = ctx->InputShape(5);
+    OP_REQUIRES(ctx, var_shape.IsSameSize(grad_shape),
+                errors::InvalidArgument(
+                    "var and grad do not have the same shape: ",
+                    var_shape.DebugString(), " vs ", grad_shape.DebugString()));
+
+    xla::XlaOp lr = ctx->Input(2);
+    xla::XlaOp l1 = ctx->Input(3);
+    xla::XlaOp l2 = ctx->Input(4);
+    xla::XlaOp grad = ctx->Input(5);
+    accum = accum + xla::Square(grad);
+    // Adagrad learning rate.
+    xla::XlaOp adagrad_lr = lr * xla::Rsqrt(accum);
+    var = ProximalGradientDescentUpdate(var, adagrad_lr, l1, l2, grad);
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(0, dtype_, var));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(1, dtype_, accum));
+  }
+
+ private:
+  DataType dtype_;
+};
+REGISTER_XLA_OP(
+    Name("ResourceApplyProximalAdagrad").TypeConstraint("T", kFloatTypes),
+    ResourceApplyProximalAdagrad);
+
+class ResourceApplyAdagradDA : public XlaOpKernel {
+ public:
+  explicit ResourceApplyAdagradDA(OpKernelConstruction* ctx)
+      : XlaOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype_));
+  }
+
+  void Compile(XlaOpKernelContext* ctx) override {
+    TensorShape var_shape, accum_shape, squared_accum_shape;
+    xla::XlaOp var, accum, squared_accum;
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(0, dtype_, &var_shape, &var));
+    OP_REQUIRES_OK(ctx,
+                   ctx->ReadVariableInput(1, dtype_, &accum_shape, &accum));
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(2, dtype_, &squared_accum_shape,
+                                               &squared_accum));
+    OP_REQUIRES(ctx, var_shape.IsSameSize(accum_shape),
+                errors::InvalidArgument(
+                    "var and accum do not have the same shape",
+                    var_shape.DebugString(), " ", accum_shape.DebugString()));
+    OP_REQUIRES(
+        ctx, var_shape.IsSameSize(squared_accum_shape),
+        errors::InvalidArgument(
+            "var and squared accum do not have the same shape",
+            var_shape.DebugString(), " ", squared_accum_shape.DebugString()));
+
+    TensorShape grad_shape = ctx->InputShape(3);
+    TensorShape lr_shape = ctx->InputShape(4);
+    TensorShape l1_shape = ctx->InputShape(5);
+    TensorShape l2_shape = ctx->InputShape(6);
+    TensorShape global_step_shape = ctx->InputShape(7);
+
+    OP_REQUIRES(ctx, var_shape.IsSameSize(grad_shape),
+                errors::InvalidArgument(
+                    "var and grad do not have the same shape",
+                    var_shape.DebugString(), " ", grad_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(lr_shape),
+                errors::InvalidArgument("lr is not a scalar: ",
+                                        lr_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(l1_shape),
+                errors::InvalidArgument("l1 is not a scalar: ",
+                                        l1_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(l2_shape),
+                errors::InvalidArgument("l2 is not a scalar: ",
+                                        l2_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(global_step_shape),
+                errors::InvalidArgument("global step is not a scalar: ",
+                                        global_step_shape.DebugString()));
+
+    xla::XlaOp grad = ctx->Input(3);
+    xla::XlaOp lr = ctx->Input(4);
+    xla::XlaOp l1 = ctx->Input(5);
+    xla::XlaOp l2 = ctx->Input(6);
+    xla::XlaBuilder* const b = ctx->builder();
+    xla::XlaOp global_step =
+        XlaHelpers::ConvertElementType(b, ctx->Input(7), dtype_);
+
+    accum = accum + grad;
+    squared_accum = squared_accum + xla::Square(grad);
+    xla::XlaOp zero = xla::ScalarLike(lr, 0.0);
+    xla::XlaOp denominator = global_step * lr * l2 + xla::Sqrt(squared_accum);
+    xla::XlaOp l1_le_zero = -lr * accum / denominator;
+    xla::XlaOp l1_gt_zero = -lr * xla::Sign(accum) *
+                            xla::Max(xla::Abs(accum) - global_step * l1, zero) /
+                            denominator;
+
+    var = xla::Select(xla::Gt(l1, zero), l1_gt_zero, l1_le_zero);
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(0, dtype_, var));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(1, dtype_, accum));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(2, dtype_, squared_accum));
+  }
+
+ private:
+  DataType dtype_;
+};
+REGISTER_XLA_OP(Name("ResourceApplyAdagradDA").TypeConstraint("T", kFloatTypes),
+                ResourceApplyAdagradDA);
+
 class ResourceApplyAdam : public XlaOpKernel {
  public:
   explicit ResourceApplyAdam(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
@@ -239,36 +430,112 @@ class ResourceApplyAdam : public XlaOpKernel {
 REGISTER_XLA_OP(Name("ResourceApplyAdam").TypeConstraint("T", kFloatTypes),
                 ResourceApplyAdam);
 
-class ResourceApplyRMSProp : public XlaOpKernel {
+class ResourceApplyAdaMax : public XlaOpKernel {
  public:
-  explicit ResourceApplyRMSProp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
+  explicit ResourceApplyAdaMax(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype_));
+  }
 
   void Compile(XlaOpKernelContext* ctx) override {
-    DataType type = ctx->input_type(3);
+    TensorShape var_shape, m_shape, v_shape;
+    xla::XlaOp var, m, v;
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(0, dtype_, &var_shape, &var));
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(1, dtype_, &m_shape, &m));
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(2, dtype_, &v_shape, &v));
 
-    TensorShape var_shape, ms_shape, mom_shape;
-    xla::XlaOp var, ms, mom;
-    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(0, type, &var_shape, &var));
-    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(1, type, &ms_shape, &ms));
-    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(2, type, &mom_shape, &mom));
+    TensorShape beta1_power_shape = ctx->InputShape(3);
+    TensorShape lr_shape = ctx->InputShape(4);
+    TensorShape beta1_shape = ctx->InputShape(5);
+    TensorShape beta2_shape = ctx->InputShape(6);
+    TensorShape epsilon_shape = ctx->InputShape(7);
+    TensorShape grad_shape = ctx->InputShape(8);
 
-    TensorShape lr_shape = ctx->InputShape(3);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta1_power_shape),
+                errors::InvalidArgument("beta1_power is not a scalar: ",
+                                        beta1_power_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(lr_shape),
+                errors::InvalidArgument("lr is not a scalar : ",
+                                        lr_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta1_shape),
+                errors::InvalidArgument("beta1 is not a scalar: ",
+                                        beta1_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta2_shape),
+                errors::InvalidArgument("beta2 is not a scalar: ",
+                                        beta2_shape.DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(epsilon_shape),
+                errors::InvalidArgument("epsilon is not a scalar: ",
+                                        epsilon_shape.DebugString()));
+    OP_REQUIRES(ctx, var_shape.IsSameSize(m_shape),
+                errors::InvalidArgument("var and m do not have the same shape",
+                                        var_shape.DebugString(), " ",
+                                        m_shape.DebugString()));
+    OP_REQUIRES(ctx, var_shape.IsSameSize(v_shape),
+                errors::InvalidArgument("var and v do not have the same shape",
+                                        var_shape.DebugString(), " ",
+                                        v_shape.DebugString()));
+    OP_REQUIRES(ctx, var_shape.IsSameSize(grad_shape),
+                errors::InvalidArgument(
+                    "var and grad do not have the same shape",
+                    var_shape.DebugString(), " ", grad_shape.DebugString()));
+
+    xla::XlaOp beta1_power = ctx->Input(3);
+    xla::XlaOp lr = ctx->Input(4);
+    xla::XlaOp beta1 = ctx->Input(5);
+    xla::XlaOp beta2 = ctx->Input(6);
+    xla::XlaOp epsilon = ctx->Input(7);
+    xla::XlaOp grad = ctx->Input(8);
+
+    xla::XlaOp one = xla::ScalarLike(lr, 1.0);
+    m = beta1 * m + (one - beta1) * grad;
+    v = xla::Max(beta2 * v, xla::Abs(grad));
+    var = var - lr / (one - beta1_power) * (m / (v + epsilon));
+
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(0, dtype_, var));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(1, dtype_, m));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(2, dtype_, v));
+  }
+
+ private:
+  DataType dtype_;
+};
+REGISTER_XLA_OP(Name("ResourceApplyAdaMax").TypeConstraint("T", kFloatTypes),
+                ResourceApplyAdaMax);
+
+class ResourceApplyRMSProp : public XlaOpKernel {
+ public:
+  explicit ResourceApplyRMSProp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype_));
+  }
+
+  void Compile(XlaOpKernelContext* ctx) override {
+    TensorShape var_shape, ms_shape, mom_shape, mg_shape;
+    xla::XlaOp var, ms, mom, mg;
+    OP_REQUIRES_OK(ctx,
+                   ctx->ReadVariableInput("var", dtype_, &var_shape, &var));
+    if (centered_) {
+      OP_REQUIRES_OK(ctx, ctx->ReadVariableInput("mg", dtype_, &mg_shape, &mg));
+    }
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput("ms", dtype_, &ms_shape, &ms));
+    OP_REQUIRES_OK(ctx,
+                   ctx->ReadVariableInput("mom", dtype_, &mom_shape, &mom));
+
+    TensorShape lr_shape = ctx->InputShape("lr");
     OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(lr_shape),
                 errors::InvalidArgument("lr is not a scalar: ",
                                         lr_shape.DebugString()));
-    TensorShape rho_shape = ctx->InputShape(4);
+    TensorShape rho_shape = ctx->InputShape("rho");
     OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(rho_shape),
                 errors::InvalidArgument("rho is not a scalar: ",
                                         rho_shape.DebugString()));
-    TensorShape momentum_shape = ctx->InputShape(5);
+    TensorShape momentum_shape = ctx->InputShape("momentum");
     OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(momentum_shape),
                 errors::InvalidArgument("momentum is not a scalar: ",
                                         momentum_shape.DebugString()));
-    TensorShape epsilon_shape = ctx->InputShape(6);
+    TensorShape epsilon_shape = ctx->InputShape("epsilon");
     OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(epsilon_shape),
                 errors::InvalidArgument("epsilon is not a scalar: ",
                                         epsilon_shape.DebugString()));
-    TensorShape grad_shape = ctx->InputShape(7);
+    TensorShape grad_shape = ctx->InputShape("grad");
 
     // var should be the same shape as mom and ms.
     OP_REQUIRES(ctx, var_shape.IsSameSize(ms_shape),
@@ -284,11 +551,11 @@ class ResourceApplyRMSProp : public XlaOpKernel {
                     "var and grad do not have the same shape",
                     var_shape.DebugString(), " ", grad_shape.DebugString()));
 
-    xla::XlaOp lr = ctx->Input(3);
-    xla::XlaOp rho = ctx->Input(4);
-    xla::XlaOp momentum = ctx->Input(5);
-    xla::XlaOp epsilon = ctx->Input(6);
-    xla::XlaOp grad = ctx->Input(7);
+    xla::XlaOp lr = ctx->Input("lr");
+    xla::XlaOp rho = ctx->Input("rho");
+    xla::XlaOp momentum = ctx->Input("momentum");
+    xla::XlaOp epsilon = ctx->Input("epsilon");
+    xla::XlaOp grad = ctx->Input("grad");
 
     // ms <- rho * ms_{t-1} + (1-rho) * grad * grad
     // mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
@@ -307,20 +574,46 @@ class ResourceApplyRMSProp : public XlaOpKernel {
     //    ms <- grad**2 (1 - rho) + ms * rho
     //
     // Which is the equation listed above.
-    xla::XlaOp new_ms =
-        ms + (xla::Square(grad) - ms) * (xla::ScalarLike(ms, 1.0) - rho);
-    xla::XlaOp new_mom =
-        mom * momentum + grad * lr * xla::Rsqrt(new_ms + epsilon);
+    xla::XlaOp one = xla::ScalarLike(ms, 1.0);
+    xla::XlaOp new_ms = xla::Square(grad) * (one - rho) + ms * rho;
+    xla::XlaOp denominator;
+    if (centered_) {
+      mg = grad * (one - rho) + mg * rho;
+      denominator = new_ms - xla::Square(mg) + epsilon;
+    } else {
+      denominator = new_ms + epsilon;
+    }
+    xla::XlaOp new_mom = mom * momentum + grad * lr * xla::Rsqrt(denominator);
     xla::XlaOp new_var = var - new_mom;
 
-    OP_REQUIRES_OK(ctx, ctx->AssignVariable(0, type, new_var));
-    OP_REQUIRES_OK(ctx, ctx->AssignVariable(1, type, new_ms));
-    OP_REQUIRES_OK(ctx, ctx->AssignVariable(2, type, new_mom));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable("var", dtype_, new_var));
+    if (centered_) {
+      OP_REQUIRES_OK(ctx, ctx->AssignVariable("mg", dtype_, mg));
+    }
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable("ms", dtype_, new_ms));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable("mom", dtype_, new_mom));
   }
+
+ protected:
+  bool centered_ = false;
+
+ private:
+  DataType dtype_;
 };
 REGISTER_XLA_OP(Name("ResourceApplyRMSProp").TypeConstraint("T", kFloatTypes),
                 ResourceApplyRMSProp);
 
+class ResourceApplyCenteredRMSProp : public ResourceApplyRMSProp {
+ public:
+  explicit ResourceApplyCenteredRMSProp(OpKernelConstruction* ctx)
+      : ResourceApplyRMSProp(ctx) {
+    centered_ = true;
+  }
+};
+REGISTER_XLA_OP(
+    Name("ResourceApplyCenteredRMSProp").TypeConstraint("T", kFloatTypes),
+    ResourceApplyCenteredRMSProp);
+
 void CompileFtrl(XlaOpKernelContext* ctx, DataType dtype,
                  bool has_l2_shrinkage) {
   xla::XlaBuilder* b = ctx->builder();
@@ -526,5 +819,107 @@ class ResourceApplyAdadelta : public XlaOpKernel {
 REGISTER_XLA_OP(Name("ResourceApplyAdadelta").TypeConstraint("T", kFloatTypes),
                 ResourceApplyAdadelta);
 
+class ResourceApplySignBase : public XlaOpKernel {
+ public:
+  explicit ResourceApplySignBase(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype_));
+  }
+
+  void Compile(XlaOpKernelContext* ctx) override {
+    TensorShape var_shape, m_shape;
+    xla::XlaOp var, m;
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(0, dtype_, &var_shape, &var));
+    OP_REQUIRES_OK(ctx, ctx->ReadVariableInput(1, dtype_, &m_shape, &m));
+    OP_REQUIRES(ctx, var_shape.IsSameSize(m_shape),
+                errors::InvalidArgument("var and m do not have the same shape",
+                                        var_shape.DebugString(), " ",
+                                        m_shape.DebugString()));
+    TensorShape grad_shape = ctx->InputShape(6);
+    OP_REQUIRES(ctx, var_shape.IsSameSize(grad_shape),
+                errors::InvalidArgument(
+                    "var and grad do not have the same shape",
+                    var_shape.DebugString(), " ", grad_shape.DebugString()));
+    CheckScalarParams(ctx);
+
+    xla::XlaOp lr = ctx->Input(2);
+    xla::XlaOp alpha = ctx->Input(3);
+    xla::XlaOp sign_decay = ctx->Input(4);
+    xla::XlaOp beta = ctx->Input(5);
+    xla::XlaOp grad = ctx->Input(6);
+
+    m = m * beta + grad * (xla::ScalarLike(beta, 1.0) - beta);
+    xla::XlaOp decay = xla::Sign(grad) * xla::Sign(m) * sign_decay;
+
+    xla::XlaOp grad_scale = ComputeGradientScale(alpha, decay);
+    var = var - lr * grad_scale * grad;
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(0, dtype_, var));
+    OP_REQUIRES_OK(ctx, ctx->AssignVariable(1, dtype_, m));
+  }
+
+  virtual void CheckScalarParams(XlaOpKernelContext* ctx) {
+    TensorShape lr_shape = ctx->InputShape(2);
+    TensorShape sign_decay_shape = ctx->InputShape(4);
+    TensorShape beta_shape = ctx->InputShape(5);
+
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(lr_shape),
+                errors::InvalidArgument("lr is not a scalar: ",
+                                        lr_shape.DebugString()));
+
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(sign_decay_shape),
+                errors::InvalidArgument("sign_decay is not a scalar: ",
+                                        sign_decay_shape.DebugString()));
+
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta_shape),
+                errors::InvalidArgument("beta is not a scalar: ",
+                                        beta_shape.DebugString()));
+  }
+
+  virtual xla::XlaOp ComputeGradientScale(xla::XlaOp alpha,
+                                          xla::XlaOp decay) = 0;
+
+ private:
+  DataType dtype_;
+};
+
+class ResourceApplyAddSign : public ResourceApplySignBase {
+ public:
+  explicit ResourceApplyAddSign(OpKernelConstruction* ctx)
+      : ResourceApplySignBase(ctx) {}
+
+  void CheckScalarParams(XlaOpKernelContext* ctx) override {
+    ResourceApplySignBase::CheckScalarParams(ctx);
+    TensorShape alpha_shape = ctx->InputShape(3);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(alpha_shape),
+                errors::InvalidArgument("alpha is not a scalar: ",
+                                        alpha_shape.DebugString()));
+  }
+
+  xla::XlaOp ComputeGradientScale(xla::XlaOp alpha, xla::XlaOp decay) override {
+    return alpha + decay;
+  }
+};
+REGISTER_XLA_OP(Name("ResourceApplyAddSign").TypeConstraint("T", kFloatTypes),
+                ResourceApplyAddSign);
+
+class ResourceApplyPowerSign : public ResourceApplySignBase {
+ public:
+  explicit ResourceApplyPowerSign(OpKernelConstruction* ctx)
+      : ResourceApplySignBase(ctx) {}
+
+  void CheckScalarParams(XlaOpKernelContext* ctx) override {
+    ResourceApplySignBase::CheckScalarParams(ctx);
+    TensorShape logbase_shape = ctx->InputShape(3);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(logbase_shape),
+                errors::InvalidArgument("logbase is not a scalar: ",
+                                        logbase_shape.DebugString()));
+  }
+
+  xla::XlaOp ComputeGradientScale(xla::XlaOp alpha, xla::XlaOp decay) override {
+    return xla::Exp(alpha * decay);
+  }
+};
+REGISTER_XLA_OP(Name("ResourceApplyPowerSign").TypeConstraint("T", kFloatTypes),
+                ResourceApplyPowerSign);
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/transpose_op.cc b/tensorflow/compiler/tf2xla/kernels/transpose_op.cc
index 6c721c48fe..f9148b3942 100644
--- a/tensorflow/compiler/tf2xla/kernels/transpose_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/transpose_op.cc
@@ -23,7 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/kernels/bounds_check.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/unary_ops.cc b/tensorflow/compiler/tf2xla/kernels/unary_ops.cc
index 116a020437..0bdfc05726 100644
--- a/tensorflow/compiler/tf2xla/kernels/unary_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/unary_ops.cc
@@ -23,7 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
 #include "tensorflow/compiler/xla/client/lib/math.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 
 namespace tensorflow {
@@ -51,43 +51,18 @@ XLAJIT_MAKE_UNARY(Conj, xla::Conj(x));
 
 // Return x if x>0, otherwise -x.
 XLAJIT_MAKE_UNARY(Abs, xla::Abs(x));
-
-// acos(x) = 2 * atan(sqrt(1 - x^2) / (1 + x))
-XLAJIT_MAKE_UNARY(Acos,
-                  xla::ScalarLike(x, 2.0) *
-                      xla::Atan2(xla::Sqrt(xla::ScalarLike(x, 1.0) - x * x),
-                                 xla::ScalarLike(x, 1.0) + x));
-
-// acosh(x) = log(x + sqrt(x^2 - 1))
-//          = log(x + sqrt((x+1)*(x-1)))
-XLAJIT_MAKE_UNARY(Acosh,
-                  xla::Log(x + xla::Sqrt((x + xla::ScalarLike(x, 1.0)) *
-                                         (x - xla::ScalarLike(x, 1.0)))));
-
-// asin(x) = 2 * atan(x / (1 + sqrt(1 - x^2)))
-XLAJIT_MAKE_UNARY(
-    Asin, xla::ScalarLike(x, 2.0) *
-              xla::Atan2(x, xla::ScalarLike(x, 1.0) +
-                                xla::Sqrt(xla::ScalarLike(x, 1.0) - x * x)));
-
-// asinh(x) = log(x + sqrt(x^2 + 1))
-XLAJIT_MAKE_UNARY(Asinh,
-                  xla::Log(x + xla::Sqrt(x * x + xla::ScalarLike(x, 1.0))));
-
-XLAJIT_MAKE_UNARY(Atan, xla::Atan2(x, xla::ScalarLike(x, 1.0)));
-
-// atanh(x) = 0.5 * log((1 + x) / (1 - x))
-XLAJIT_MAKE_UNARY(Atanh, xla::Log((xla::ScalarLike(x, 1.0) + x) /
-                                  (xla::ScalarLike(x, 1.0) - x)) *
-                             xla::ScalarLike(x, 0.5));
+XLAJIT_MAKE_UNARY(Acos, xla::Acos(x));
+XLAJIT_MAKE_UNARY(Acosh, xla::Acosh(x));
+XLAJIT_MAKE_UNARY(Asin, xla::Asin(x))
+XLAJIT_MAKE_UNARY(Asinh, xla::Asinh(x));
+XLAJIT_MAKE_UNARY(Atan, xla::Atan(x));
+XLAJIT_MAKE_UNARY(Atanh, xla::Atanh(x));
 XLAJIT_MAKE_UNARY(Ceil, xla::Ceil(x));
 XLAJIT_MAKE_UNARY(Cos, xla::Cos(x));
-XLAJIT_MAKE_UNARY(Cosh, (xla::Exp(x) + xla::Exp(-x)) * xla::ScalarLike(x, 0.5));
+XLAJIT_MAKE_UNARY(Cosh, xla::Cosh(x));
 XLAJIT_MAKE_UNARY(Sin, xla::Sin(x));
 XLAJIT_MAKE_UNARY(Exp, xla::Exp(x));
-
 XLAJIT_MAKE_UNARY(Expm1, xla::Expm1(x));
-
 XLAJIT_MAKE_UNARY(Floor, xla::Floor(x));
 XLAJIT_MAKE_UNARY(IsFinite, xla::IsFinite(x));
 XLAJIT_MAKE_UNARY(
@@ -99,7 +74,6 @@ XLAJIT_MAKE_UNARY(IsNan, xla::Ne(x, x));
 XLAJIT_MAKE_UNARY(Inv, xla::ScalarLike(x, 1.0) / x);
 XLAJIT_MAKE_UNARY(Reciprocal, xla::ScalarLike(x, 1.0) / x);
 XLAJIT_MAKE_UNARY(Log, xla::Log(x));
-
 XLAJIT_MAKE_UNARY(Log1p, xla::Log1p(x));
 
 XLAJIT_MAKE_UNARY(Invert, xla::Not(x));
@@ -136,7 +110,7 @@ XLAJIT_MAKE_UNARY(Sigmoid, Sigmoid(x));
 
 // Returns 0 if x is 0, -1 if x < 0 and 1 if x > 0.
 XLAJIT_MAKE_UNARY(Sign, xla::Sign(x));
-XLAJIT_MAKE_UNARY(Sinh, (xla::Exp(x) - xla::Exp(-x)) * xla::ScalarLike(x, 0.5));
+XLAJIT_MAKE_UNARY(Sinh, xla::Sinh(x));
 
 // softplus(x) = log(1 + exp(x))
 //
@@ -153,7 +127,7 @@ XLAJIT_MAKE_UNARY(Softplus, xla::Max(x, xla::ScalarLike(x, 0.0)) +
 XLAJIT_MAKE_UNARY(Softsign, x / (xla::Abs(x) + xla::ScalarLike(x, 1.0)));
 XLAJIT_MAKE_UNARY(Sqrt, xla::Sqrt(x));
 XLAJIT_MAKE_UNARY(Square, x* x);
-XLAJIT_MAKE_UNARY(Tan, xla::Sin(x) / xla::Cos(x));
+XLAJIT_MAKE_UNARY(Tan, xla::Tan(x));
 XLAJIT_MAKE_UNARY(Tanh, xla::Tanh(x));
 
 XLAJIT_MAKE_UNARY(Real, xla::Real(x));
@@ -189,5 +163,51 @@ class ErfcOp : public XlaOpKernel {
 };
 REGISTER_XLA_OP(Name("Erfc"), ErfcOp);
 
+class LgammaOp : public XlaOpKernel {
+ public:
+  explicit LgammaOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
+  // Calculate lgamma using the Lanczos approximation
+  // (https://en.wikipedia.org/wiki/Lanczos_approximation).
+  void Compile(XlaOpKernelContext* ctx) override {
+    xla::XlaOp input = ctx->Input(0);
+    xla::PrimitiveType input_type = ctx->input_xla_type(0);
+
+    if (input_type == xla::F16 || input_type == xla::BF16) {
+      // The approximation works better with at least 32-bits of accuracy.
+      xla::XlaOp input_f32 = xla::ConvertElementType(input, xla::F32);
+      xla::XlaOp result_f32 = xla::Lgamma(input_f32);
+      xla::XlaOp result_x16 = xla::ConvertElementType(result_f32, input_type);
+      ctx->SetOutput(0, result_x16);
+    } else {
+      xla::XlaOp result = xla::Lgamma(input);
+      ctx->SetOutput(0, result);
+    }
+  }
+};  // namespace
+REGISTER_XLA_OP(Name("Lgamma"), LgammaOp);
+
+class DigammaOp : public XlaOpKernel {
+ public:
+  explicit DigammaOp(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {}
+  // Calculate lgamma using the Lanczos approximation
+  // (https://en.wikipedia.org/wiki/Lanczos_approximation).
+  void Compile(XlaOpKernelContext* ctx) override {
+    xla::XlaOp input = ctx->Input(0);
+    xla::PrimitiveType input_type = ctx->input_xla_type(0);
+
+    if (input_type == xla::F16 || input_type == xla::BF16) {
+      // The approximation works better with at least 32-bits of accuracy.
+      xla::XlaOp input_f32 = xla::ConvertElementType(input, xla::F32);
+      xla::XlaOp result_f32 = xla::Digamma(input_f32);
+      xla::XlaOp result_x16 = xla::ConvertElementType(result_f32, input_type);
+      ctx->SetOutput(0, result_x16);
+    } else {
+      xla::XlaOp result = xla::Digamma(input);
+      ctx->SetOutput(0, result);
+    }
+  }
+};  // namespace
+REGISTER_XLA_OP(Name("Digamma"), DigammaOp);
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/kernels/unpack_op.cc b/tensorflow/compiler/tf2xla/kernels/unpack_op.cc
index 0e5d58ecba..8671632976 100644
--- a/tensorflow/compiler/tf2xla/kernels/unpack_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/unpack_op.cc
@@ -22,8 +22,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
diff --git a/tensorflow/compiler/tf2xla/kernels/variable_ops.cc b/tensorflow/compiler/tf2xla/kernels/variable_ops.cc
index febac82873..2c92a585f5 100644
--- a/tensorflow/compiler/tf2xla/kernels/variable_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/variable_ops.cc
@@ -19,8 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/types.h"
 
diff --git a/tensorflow/compiler/tf2xla/kernels/while_op.cc b/tensorflow/compiler/tf2xla/kernels/while_op.cc
index 340165bac6..1e8a376765 100644
--- a/tensorflow/compiler/tf2xla/kernels/while_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/while_op.cc
@@ -21,8 +21,9 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
@@ -299,6 +300,7 @@ void XlaWhileOp::Compile(XlaOpKernelContext* ctx) {
   VLOG(1) << "Done building while loop";
 }
 
+REGISTER_XLA_OP(Name("While").AllowResourceTypes(), XlaWhileOp);
 REGISTER_XLA_OP(Name("XlaWhile").AllowResourceTypes(), XlaWhileOp);
 
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/lib/BUILD b/tensorflow/compiler/tf2xla/lib/BUILD
index dfa3c0595a..cb7a40e23d 100644
--- a/tensorflow/compiler/tf2xla/lib/BUILD
+++ b/tensorflow/compiler/tf2xla/lib/BUILD
@@ -25,8 +25,8 @@ cc_library(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:lib",
     ],
 )
@@ -40,13 +40,13 @@ cc_library(
         ":triangular_solve",
         ":util",
         ":while_loop",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:constants",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/core:lib",
     ],
 )
@@ -59,28 +59,50 @@ cc_library(
         "//tensorflow/compiler/tf2xla:xla_compiler",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/client/lib:constants",
         "//tensorflow/compiler/xla/client/lib:math",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
         "//tensorflow/core:protos_all_cc",
     ],
 )
 
 cc_library(
+    name = "qr",
+    srcs = ["qr.cc"],
+    hdrs = ["qr.h"],
+    deps = [
+        ":batch_dot",
+        ":util",
+        ":while_loop",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
+        "//tensorflow/compiler/xla/client/lib:arithmetic",
+        "//tensorflow/compiler/xla/client/lib:constants",
+        "//tensorflow/compiler/xla/client/lib:math",
+        "//tensorflow/compiler/xla/client/lib:numeric",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
     name = "scatter",
     srcs = ["scatter.cc"],
     hdrs = ["scatter.h"],
     deps = [
         ":util",
         ":while_loop",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/core:lib",
     ],
 )
@@ -92,14 +114,15 @@ cc_library(
     deps = [
         ":batch_dot",
         ":util",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:constants",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client/lib:numeric",
         "//tensorflow/core:lib",
     ],
 )
@@ -111,7 +134,7 @@ xla_test(
     deps = [
         ":triangular_solve",
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -119,7 +142,7 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -133,13 +156,14 @@ cc_library(
     srcs = ["util.cc"],
     hdrs = ["util.h"],
     deps = [
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:lib",
     ],
 )
@@ -151,7 +175,7 @@ xla_test(
         ":batch_dot",
         ":util",
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -176,8 +200,8 @@ cc_library(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:lib",
     ],
 )
diff --git a/tensorflow/compiler/tf2xla/lib/batch_dot.cc b/tensorflow/compiler/tf2xla/lib/batch_dot.cc
index f9f3a8c8cf..f666d22ea4 100644
--- a/tensorflow/compiler/tf2xla/lib/batch_dot.cc
+++ b/tensorflow/compiler/tf2xla/lib/batch_dot.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <memory>
 #include <vector>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -84,7 +84,7 @@ xla::XlaOp BatchDot(xla::XlaOp x, xla::XlaOp y, bool transpose_x,
       dimensions.push_back(y_shape.dimensions(y_outer_dim));
       return xla::Broadcast(
           xla::ConstantLiteral(builder,
-                               xla::Literal::Zero(x_shape.element_type())),
+                               xla::LiteralUtil::Zero(x_shape.element_type())),
           dimensions);
     }
 
diff --git a/tensorflow/compiler/tf2xla/lib/batch_dot.h b/tensorflow/compiler/tf2xla/lib/batch_dot.h
index d07a9486f1..8757b16a1c 100644
--- a/tensorflow/compiler/tf2xla/lib/batch_dot.h
+++ b/tensorflow/compiler/tf2xla/lib/batch_dot.h
@@ -16,8 +16,8 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_TF2XLA_LIB_BATCH_DOT_H_
 #define TENSORFLOW_COMPILER_TF2XLA_LIB_BATCH_DOT_H_
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/compiler/tf2xla/lib/cholesky.cc b/tensorflow/compiler/tf2xla/lib/cholesky.cc
index cc840de393..87d73eb3f0 100644
--- a/tensorflow/compiler/tf2xla/lib/cholesky.cc
+++ b/tensorflow/compiler/tf2xla/lib/cholesky.cc
@@ -23,8 +23,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/lib/util.h"
 #include "tensorflow/compiler/tf2xla/lib/while_loop.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
diff --git a/tensorflow/compiler/tf2xla/lib/cholesky.h b/tensorflow/compiler/tf2xla/lib/cholesky.h
index 0f6e0e9d15..1bef9bb166 100644
--- a/tensorflow/compiler/tf2xla/lib/cholesky.h
+++ b/tensorflow/compiler/tf2xla/lib/cholesky.h
@@ -16,8 +16,8 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_TF2XLA_LIB_CHOLESKY_H_
 #define TENSORFLOW_COMPILER_TF2XLA_LIB_CHOLESKY_H_
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/compiler/tf2xla/lib/qr.cc b/tensorflow/compiler/tf2xla/lib/qr.cc
new file mode 100644
index 0000000000..fc0c1ee838
--- /dev/null
+++ b/tensorflow/compiler/tf2xla/lib/qr.cc
@@ -0,0 +1,387 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/tf2xla/lib/qr.h"
+
+#include <memory>
+#include <vector>
+
+#include "tensorflow/compiler/tf2xla/lib/batch_dot.h"
+#include "tensorflow/compiler/tf2xla/lib/util.h"
+#include "tensorflow/compiler/tf2xla/lib/while_loop.h"
+#include "tensorflow/compiler/xla/client/lib/arithmetic.h"
+#include "tensorflow/compiler/xla/client/lib/constants.h"
+#include "tensorflow/compiler/xla/client/lib/math.h"
+#include "tensorflow/compiler/xla/client/lib/numeric.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+
+namespace {
+
+// Computes a Householder reflection of the form:
+// H = I - tau v v.T.
+// such that
+// H . ( x1  ) = ( x1   )
+//     ( x2  ) = ( x2   )
+//     ( ... ) = ( ...  )
+//     ( xk  ) = ( beta )
+//     ( ... )   ( 0    )
+//     ( ... )   ( 0    )
+// Unlike the usual formulation, we allow the caller to supply 'k' rather than
+// only providing the relevant part of 'x' to maintain XLA's static shape
+// invariant. In addition, the implementation supports batching.
+// Pseudo-code, without batching:
+//   alpha = x[k]
+//   x_copy = np.copy(x)
+//   x_copy[:k+1] = 0
+//   xnorm = norm2(x_copy)
+//   if xnorm == 0:
+//     beta = alpha
+//     tau = 0
+//     v = np.zeros_like(x)
+//   else:
+//     beta = - np.sign(alpha) * dlapy2(alpha, xnorm)
+//     tau = (beta - alpha) / beta
+//     v = x / (alpha - beta)
+//   v[k] = 1
+//   return (v, tau, beta)
+// TODO(phawkins): LAPACK's xLARFG implementation has code for handling
+// overflows in the norm/beta calculations. Perhaps do the same here.
+xla::Status House(xla::XlaOp x, xla::XlaOp k, gtl::ArraySlice<int64> batch_dims,
+                  const int64 m, xla::XlaOp* v, xla::XlaOp* tau,
+                  xla::XlaOp* beta) {
+  xla::XlaBuilder* const builder = x.builder();
+  TF_ASSIGN_OR_RETURN(xla::Shape x_shape, builder->GetShape(x));
+  const xla::PrimitiveType type = x_shape.element_type();
+
+  std::vector<int64> batch_dim_ids(batch_dims.size());
+  std::iota(batch_dim_ids.begin(), batch_dim_ids.end(), 0);
+  const int64 minor_dim = batch_dims.size();
+
+  xla::XlaOp zero = xla::ScalarLike(x, 0.0);
+  xla::XlaOp one = xla::ScalarLike(x, 1.0);
+
+  // alpha = x[k]
+  xla::XlaOp alpha =
+      xla::Reshape(DynamicSliceInMinorDims(x, {k}, {1}), batch_dims);
+
+  // Compute x[k+1:] (padded with zeros in elements 0..k)
+  xla::XlaOp iota = xla::Iota(builder, xla::S32, m);
+  xla::XlaOp x_after_k =
+      xla::Mul(x, xla::ConvertElementType(xla::Gt(iota, k), type),
+               /*broadcast_dimensions=*/{minor_dim});
+
+  // sigma = np.dot(x[k+1:], x[k+1:])
+  auto sigma =
+      xla::Reduce(x_after_k * x_after_k, zero,
+                  xla::CreateScalarAddComputation(type, builder), {minor_dim});
+  // mu = np.sqrt(x[k]*x[k] + sigma)
+  auto mu = xla::Sqrt(xla::Square(alpha) + sigma);
+
+  auto sigma_is_zero = xla::Eq(sigma, zero);
+
+  *beta = xla::Select(sigma_is_zero, alpha, -xla::Sign(alpha) * mu);
+  *tau = xla::Select(sigma_is_zero, xla::Broadcast(zero, batch_dims),
+                     (*beta - alpha) / *beta);
+  auto divisor = xla::Select(sigma_is_zero, xla::Broadcast(one, batch_dims),
+                             alpha - *beta);
+
+  auto e_k = xla::Broadcast(xla::ConvertElementType(xla::Eq(iota, k), type),
+                            std::vector<int64>(batch_dims.size(), 1));
+
+  // Form v as [0, 0, ..., 1] ++ x[k+1:] / divisor
+  // If sigma is zero, x[k+1:] is zero, so use any non-zero divisor.
+  *v = e_k +
+       xla::Div(x_after_k, divisor, /*broadcast_dimensions=*/batch_dim_ids);
+  return Status::OK();
+}
+
+// Householder QR decomposition. Algorithm 5.2.1 from Golub and Van
+// Loan "Matrix Computations", 4th Edition. This is an unblocked implementation
+// used as an inner routine of the blocked implementation.
+// Algorithm is adapted slightly so the shapes inside the loop are static, at
+// the cost of some redundant computation. Since this is used as an inner block
+// kernel, accumulates the Householder transformations (vs, taus) rather than
+// the matrix q.
+// Equivalent Python code, without batching:
+// def qr(a):
+//   m = a.shape[0]
+//   n = a.shape[1]
+//   vs = np.zeros([m, n])
+//   taus = np.zeros([n])
+//   for j in xrange(min(m, n)):
+//     v, tau, beta = house(a[:, j], j)
+//     # Unusually, we apply the Householder transformation to the entirety of
+//     # a, wasting FLOPs to maintain the static shape invariant that XLA
+//     # requires. For columns that precede j this has no effect.
+//     a[:, :] -= tau * np.dot(v[:, np.newaxis],
+//                              np.dot(v[np.newaxis, :], a[:, :]))
+//     # Form column j explicitly rather than relying on the precision of the
+//     # Householder update.
+//     a[j, j] = beta
+//     a[j+1:, j] = np.zeros([m - j - 1], dtype=a.dtype)
+//     vs[:, j] = v
+//     taus[j] = tau
+//   return (q, vs, taus)
+struct QRBlockResult {
+  // The factored R value
+  xla::XlaOp r;
+
+  // Representation of the Householder matrices I - beta v v.T
+  xla::XlaOp taus;  // Shape: [..., n]
+  xla::XlaOp vs;    // Shape: [..., m, n]
+};
+xla::StatusOr<QRBlockResult> QRBlock(xla::XlaOp a) {
+  xla::XlaBuilder* builder = a.builder();
+  TF_ASSIGN_OR_RETURN(xla::Shape a_shape, builder->GetShape(a));
+  const int num_dims = xla::ShapeUtil::Rank(a_shape);
+  if (num_dims < 2) {
+    return errors::InvalidArgument("Arguments to QR must have rank >= 2: ",
+                                   num_dims);
+  }
+  xla::PrimitiveType type = a_shape.element_type();
+
+  const int64 m = xla::ShapeUtil::GetDimension(a_shape, -2);
+  const int64 n = xla::ShapeUtil::GetDimension(a_shape, -1);
+
+  const int64 num_batch_dims = num_dims - 2;
+  std::vector<int64> batch_dims(num_batch_dims);
+  for (int i = 0; i < num_batch_dims; ++i) {
+    batch_dims[i] = xla::ShapeUtil::GetDimension(a_shape, i);
+  }
+
+  std::vector<int64> batch_dim_indices(num_batch_dims);
+  std::iota(batch_dim_indices.begin(), batch_dim_indices.end(), 0);
+
+  auto qr_body_fn =
+      [&](xla::XlaOp j, gtl::ArraySlice<xla::XlaOp> values,
+          xla::XlaBuilder* builder) -> xla::StatusOr<std::vector<xla::XlaOp>> {
+    auto a = values[0];
+    auto vs = values[1];
+    auto taus = values[2];
+
+    // v, beta = house(a[:, j], j)
+    auto x = DynamicSliceInMinorDims(a, {j}, {1});
+    xla::XlaOp v, tau, beta;
+    TF_RETURN_IF_ERROR(House(xla::Collapse(x, {num_dims - 2, num_dims - 1}), j,
+                             batch_dims, m, &v, &tau, &beta));
+
+    std::vector<int64> shape = batch_dims;
+    shape.push_back(1);
+    shape.push_back(m);
+    auto v_broadcast = xla::Reshape(v, shape);
+    // a[:, :] -= tau * np.dot(v[:, np.newaxis],
+    //                          np.dot(v[np.newaxis, :], a[:, :]))
+    auto vva = BatchDot(v_broadcast, a);
+    vva = BatchDot(v_broadcast, vva, /*transpose_x=*/true);
+    a = a - xla::Mul(tau, vva,
+                     /*broadcast_dimensions=*/batch_dim_indices);
+
+    // It is more precise to populate column 'k' explicitly, rather than
+    // computing it implicitly by applying the Householder transformation.
+    // a[k,k] = beta
+    // a[k+1:,k] = np.zeros([m-k-1], dtype=a.dtype)
+    auto iota = xla::Reshape(xla::Iota(a.builder(), xla::S32, m), {m, 1});
+    auto predecessor_mask = xla::ConvertElementType(xla::Lt(iota, j), type);
+    auto mask = xla::Broadcast(xla::ConvertElementType(xla::Eq(iota, j), type),
+                               std::vector<int64>(batch_dims.size(), 1));
+    auto new_x =
+        xla::Mul(x, predecessor_mask,
+                 /*broadcast_dimensions=*/{num_dims - 2, num_dims - 1}) +
+        xla::Mul(beta, mask, /*broadcast_dimensions=*/batch_dim_indices);
+    a = DynamicUpdateSliceInMinorDims(a, new_x, {j});
+
+    // vs[:, j] = v
+    vs = DynamicUpdateSliceInMinorDims(
+        vs, xla::Reshape(v, ConcatVectors(batch_dims, {m, 1})), {j});
+    // taus[j] = tau
+    taus = DynamicUpdateSliceInMinorDims(
+        taus, xla::Reshape(tau, ConcatVectors(batch_dims, {1})), {j});
+    return std::vector<xla::XlaOp>{a, vs, taus};
+  };
+
+  auto vs = xla::Zeros(builder, xla::ShapeUtil::MakeShape(
+                                    type, ConcatVectors(batch_dims, {m, n})));
+  auto taus = xla::Zeros(
+      builder, xla::ShapeUtil::MakeShape(type, ConcatVectors(batch_dims, {n})));
+
+  TF_ASSIGN_OR_RETURN(auto values,
+                      XlaForEachIndex(std::min(m, n), xla::S32, qr_body_fn,
+                                      {a, vs, taus}, "qr", builder));
+
+  QRBlockResult result;
+  result.r = values[0];
+  result.vs = values[1];
+  result.taus = values[2];
+  return result;
+}
+
+// Computes W and Y such that I-WY is equivalent to the sequence of Householder
+// transformations given by vs and taus.
+// Golub and van Loan, "Matrix Computations", algorithm 5.1.2.
+// Y = np.zeros([m, n])
+// W = np.zeros([m, n])
+// Y[:, 0] = vs[:, 0]
+// W[:, 0] = -taus[0] * vs[:, 0]
+// for j in xrange(1, n):
+//   v = vs[:, j]
+//   z = -taus[j] * v - taus[j] * np.dot(W, np.dot(Y.T, v))
+//   W[:, j] = z
+//   Y[:, j] = v
+// return W
+// There is no need to return Y since at termination of the loop it is equal to
+// vs.
+xla::StatusOr<xla::XlaOp> ComputeWYRepresentation(
+    xla::PrimitiveType type, gtl::ArraySlice<int64> batch_dims, xla::XlaOp vs,
+    xla::XlaOp taus, int64 m, int64 n) {
+  std::vector<int64> batch_dim_indices(batch_dims.size());
+  std::iota(batch_dim_indices.begin(), batch_dim_indices.end(), 0);
+  int64 n_index = batch_dims.size() + 1;
+
+  auto body_fn =
+      [&](xla::XlaOp j, gtl::ArraySlice<xla::XlaOp> values,
+          xla::XlaBuilder* builder) -> xla::StatusOr<std::vector<xla::XlaOp>> {
+    auto w = values[0];
+    auto y = values[1];
+    const auto vs = values[2];
+    const auto taus = values[3];
+
+    // Want j values in range [1, ... n).
+    j = j + xla::ConstantR0<int32>(builder, 1);
+    // vs has shape [..., m, 1]
+    auto v = DynamicSliceInMinorDims(vs, {j}, {1});
+    // beta has shape [..., 1]
+    auto beta = DynamicSliceInMinorDims(taus, {j}, {1});
+
+    // yv has shape [..., n, 1]
+    auto yv = BatchDot(y, v, /*transpose_x=*/true);
+    // wyv has shape [..., m, 1]
+    auto wyv = BatchDot(w, yv);
+
+    auto z = xla::Mul(
+        -beta, v + wyv,
+        /*broadcast_dimensions=*/ConcatVectors(batch_dim_indices, {n_index}));
+
+    w = DynamicUpdateSliceInMinorDims(w, z, {j});
+    y = DynamicUpdateSliceInMinorDims(y, v, {j});
+
+    return std::vector<xla::XlaOp>{w, y, vs, taus};
+  };
+
+  xla::XlaBuilder* builder = vs.builder();
+  auto w = xla::Zeros(builder, xla::ShapeUtil::MakeShape(
+                                   type, ConcatVectors(batch_dims, {m, n})));
+  auto y = w;
+  auto v = SliceInMinorDims(vs, {0}, {1});
+  auto beta = SliceInMinorDims(taus, {0}, {1});
+  y = UpdateSliceInMinorDims(y, v, {0});
+  auto bv = xla::Mul(
+      -beta, v,
+      /*broadcast_dimensions=*/ConcatVectors(batch_dim_indices, {n_index}));
+  w = UpdateSliceInMinorDims(w, bv, {0});
+
+  TF_ASSIGN_OR_RETURN(
+      auto values, XlaForEachIndex(n - 1, xla::S32, body_fn, {w, y, vs, taus},
+                                   "wy", builder));
+  return values[0];
+}
+
+}  // namespace
+
+// Block Householder QR Factorization. Algorithm 5.2.2 of Golub and van Loan.
+// def qr_blocked(a, block_size):
+//   m = a.shape[0]
+//   n = a.shape[1]
+//   q = np.eye(m)
+//   for i in xrange(0, min(m, n), block_size):
+//     k = min(block_size, min(m, n) - s)
+//     (a, vs, taus) = qr(a[i:, i:i+k])
+//     y = vs
+//     w = ComputeWYRepresentation(vs, taus, m-i, k)
+//     a[i:, i+r:] += np.dot(y, np.dot(w.T, a[i:, i+k:]))
+//     q[:, i:] += np.dot(q[:, i:], np.dot(w, y.T))
+//   return (q, a)
+// TODO(phawkins): consider using UT transformations (in the form I - V U V')
+// rather than WY transformations.
+xla::StatusOr<QRDecompositionResult> QRDecomposition(xla::XlaOp a,
+                                                     int64 block_size) {
+  xla::XlaBuilder* builder = a.builder();
+  TF_ASSIGN_OR_RETURN(xla::Shape a_shape, builder->GetShape(a));
+  const int num_dims = xla::ShapeUtil::Rank(a_shape);
+  if (num_dims < 2) {
+    return errors::InvalidArgument("Arguments to QR must have rank >= 2: ",
+                                   num_dims);
+  }
+  xla::PrimitiveType type = a_shape.element_type();
+
+  const int64 m = xla::ShapeUtil::GetDimension(a_shape, -2);
+  const int64 n = xla::ShapeUtil::GetDimension(a_shape, -1);
+  const int64 p = std::min(m, n);
+
+  if (block_size < 1) {
+    return errors::InvalidArgument(
+        "block_size argument to QR must be >= 1; got ", block_size);
+  }
+
+  const int64 num_batch_dims = num_dims - 2;
+  std::vector<int64> batch_dims(num_batch_dims);
+  for (int i = 0; i < num_batch_dims; ++i) {
+    batch_dims[i] = xla::ShapeUtil::GetDimension(a_shape, i);
+  }
+
+  auto q = xla::Broadcast(xla::IdentityMatrix(builder, type, m, m), batch_dims);
+  for (int64 i = 0; i < p; i += block_size) {
+    int64 k = std::min(block_size, p - i);
+
+    auto a_block = SliceInMinorDims(a, {i, i}, {m, i + k});
+    TF_ASSIGN_OR_RETURN(auto qr_block, QRBlock(a_block));
+
+    a = UpdateSliceInMinorDims(a, qr_block.r, {i, i});
+
+    // Compute the I-WY block representation of a product of Householder
+    // matrices.
+    TF_ASSIGN_OR_RETURN(auto w,
+                        ComputeWYRepresentation(type, batch_dims, qr_block.vs,
+                                                qr_block.taus, m - i, k));
+    auto y = qr_block.vs;
+
+    // a[i:, i+k:] += np.dot(Y, np.dot(W.T, a[i:, i+k:]))
+    auto a_panel = SliceInMinorDims(a, {i, i + k}, {m, n});
+    auto a_update = BatchDot(w, a_panel, /*transpose_x=*/true);
+    a_update = BatchDot(y, a_update);
+    a_panel = a_panel + a_update;
+    a = UpdateSliceInMinorDims(a, a_panel, {i, i + k});
+
+    // q[:, i:] += np.dot(np.dot(q[:, i:], W), Y.T))
+    auto q_panel = SliceInMinorDims(q, {0, i}, {m, m});
+    auto q_update = BatchDot(q_panel, w);
+    q_update =
+        BatchDot(q_update, y, /*transpose_x=*/false, /*transpose_y=*/true);
+    q_panel = q_panel + q_update;
+    q = UpdateSliceInMinorDims(q, q_panel, {0, i});
+  }
+  QRDecompositionResult result;
+  result.q = q;
+  result.r = a;
+  return result;
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/lib/qr.h b/tensorflow/compiler/tf2xla/lib/qr.h
new file mode 100644
index 0000000000..abd2316ac9
--- /dev/null
+++ b/tensorflow/compiler/tf2xla/lib/qr.h
@@ -0,0 +1,40 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_TF2XLA_LIB_QR_H_
+#define TENSORFLOW_COMPILER_TF2XLA_LIB_QR_H_
+
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+
+namespace tensorflow {
+
+// Computes the QR decompositions of a batch of matrices. That is,
+// given a (batched) matrix a, computes an orthonormal matrix Q and an
+// upper-triangular matrix R such that a = QR.
+// `a` must be a (batched) matrix of size [..., m, n].
+// The algorithm implements a blocked QR decomposition; `block_size` is
+// the block size to use.
+// TODO(phawkins): handle the complex case.
+struct QRDecompositionResult {
+  xla::XlaOp q;
+  xla::XlaOp r;
+};
+
+xla::StatusOr<QRDecompositionResult> QRDecomposition(xla::XlaOp a,
+                                                     int64 block_size = 128);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_COMPILER_TF2XLA_LIB_QR_H_
diff --git a/tensorflow/compiler/tf2xla/lib/random.cc b/tensorflow/compiler/tf2xla/lib/random.cc
index 8ff10fbd3f..5e7cf00ee5 100644
--- a/tensorflow/compiler/tf2xla/lib/random.cc
+++ b/tensorflow/compiler/tf2xla/lib/random.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
 #include "tensorflow/compiler/xla/client/lib/math.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/lib/random.h b/tensorflow/compiler/tf2xla/lib/random.h
index 2c573fd85b..59fc5d0433 100644
--- a/tensorflow/compiler/tf2xla/lib/random.h
+++ b/tensorflow/compiler/tf2xla/lib/random.h
@@ -16,7 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_TF2XLA_LIB_RANDOM_H_
 #define TENSORFLOW_COMPILER_TF2XLA_LIB_RANDOM_H_
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/core/framework/types.pb.h"
 
diff --git a/tensorflow/compiler/tf2xla/lib/scatter.cc b/tensorflow/compiler/tf2xla/lib/scatter.cc
index 85e3d3ab85..ba22eff73a 100644
--- a/tensorflow/compiler/tf2xla/lib/scatter.cc
+++ b/tensorflow/compiler/tf2xla/lib/scatter.cc
@@ -21,8 +21,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/lib/util.h"
 #include "tensorflow/compiler/tf2xla/lib/while_loop.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/util.h"
@@ -114,7 +114,7 @@ xla::StatusOr<xla::XlaOp> XlaScatter(
     auto buffer = loop_vars[2];
 
     auto zero_index = xla::ConstantLiteral(
-        body_builder, xla::Literal::Zero(indices_shape.element_type()));
+        body_builder, xla::LiteralUtil::Zero(indices_shape.element_type()));
 
     // Slice the i-th index from the indices array.
     xla::XlaOp index;
@@ -132,7 +132,7 @@ xla::StatusOr<xla::XlaOp> XlaScatter(
     // Discard updates with negative indices, since some users expect this.
     auto index_in_range = xla::ReduceAll(
         xla::Le(zero_index, index), xla::ConstantR0<bool>(body_builder, true),
-        xla::CreateScalarAndComputation(body_builder));
+        xla::CreateScalarAndComputation(xla::PRED, body_builder));
 
     // Make the index in bounds to prevent implementation defined behavior.
     index = xla::Max(index, zero_index);
diff --git a/tensorflow/compiler/tf2xla/lib/scatter.h b/tensorflow/compiler/tf2xla/lib/scatter.h
index 87309e10ed..13a5f1b850 100644
--- a/tensorflow/compiler/tf2xla/lib/scatter.h
+++ b/tensorflow/compiler/tf2xla/lib/scatter.h
@@ -18,8 +18,8 @@ limitations under the License.
 
 #include <functional>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/statusor.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/lib/triangular_solve.cc b/tensorflow/compiler/tf2xla/lib/triangular_solve.cc
index 4f97d1277c..04fa10108c 100644
--- a/tensorflow/compiler/tf2xla/lib/triangular_solve.cc
+++ b/tensorflow/compiler/tf2xla/lib/triangular_solve.cc
@@ -21,16 +21,316 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/lib/batch_dot.h"
 #include "tensorflow/compiler/tf2xla/lib/util.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/lib/numeric.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/math/math_util.h"
 
 namespace tensorflow {
 
+// Get the diagonal blocks of the coefficient matrix
+xla::XlaOp DiagonalBlocks(xla::XlaOp a, int64 block_size) {
+  xla::XlaBuilder* builder = a.builder();
+  return builder->ReportErrorOrReturn([&]() -> xla::StatusOr<xla::XlaOp> {
+    TF_ASSIGN_OR_RETURN(xla::Shape shape, builder->GetShape(a));
+    int ndims = xla::ShapeUtil::Rank(shape);
+    int64 n = xla::ShapeUtil::GetDimension(shape, -1);
+    int64 num_blocks = n / block_size;
+
+    xla::XlaOp diag_blocks;
+
+    // If the coefficient matrix is exactly the block size, we just add a
+    // singleton dimension i.e. [..., n, n] -> [..., 1, n, n]
+    if (n == block_size) {
+      std::vector<int64> permutation(ndims);
+      std::iota(permutation.begin(), permutation.end(), 1);
+      permutation.insert(permutation.end() - 2, 0);
+      return Transpose(Broadcast(a, /*broadcast_sizes=*/{1}), permutation);
+    }
+
+    // We can grab entire blocks using gather
+    if (n > block_size) {
+      // Construct the starting indices of the diagonal blocks
+      auto gather_indices =
+          Transpose(Broadcast(Mul(Iota(builder, xla::S32, num_blocks),
+                                  xla::ConstantR0<int32>(builder, block_size)),
+                              /*broadcast_sizes=*/{2}),
+                    /*permutation=*/{1, 0});
+
+      // Gather the diagonal blocks
+      xla::GatherDimensionNumbers dim_numbers;
+      dim_numbers.add_output_window_dims(ndims - 1);
+      dim_numbers.add_output_window_dims(ndims);
+      dim_numbers.add_gather_dims_to_operand_dims(ndims - 2);
+      dim_numbers.add_gather_dims_to_operand_dims(ndims - 1);
+      dim_numbers.set_index_vector_dim(1);
+      diag_blocks = Gather(a, gather_indices, dim_numbers,
+                           /*window_bounds=*/{block_size, block_size});
+    }
+
+    // The last block might be smaller than the block size,
+    // so we will need to pad it
+    if (n % block_size != 0) {
+      // Pad with zeros
+      auto last_blocks =
+          SliceInMinorDims(a, {n - n % block_size, n - n % block_size}, {n, n});
+      xla::PaddingConfig config = xla::MakeNoPaddingConfig(ndims);
+      int64 padding = block_size - n % block_size;
+      config.mutable_dimensions(ndims - 1)->set_edge_padding_high(padding);
+      config.mutable_dimensions(ndims - 2)->set_edge_padding_high(padding);
+      last_blocks =
+          Pad(last_blocks, Zero(builder, shape.element_type()), config);
+
+      // Add a singleton dimension
+      // i.e. [..., block_size, block_size] -> [..., 1, block_size, block_size]
+      TF_ASSIGN_OR_RETURN(xla::Shape blocks_shape,
+                          builder->GetShape(last_blocks));
+      auto shape_dims = xla::AsInt64Slice(blocks_shape.dimensions());
+      auto last_blocks_dims = std::vector<int64>(ndims);
+      std::copy(shape_dims.begin(), shape_dims.end(), last_blocks_dims.begin());
+      last_blocks_dims.insert(last_blocks_dims.end() - 2, 1);
+      last_blocks = Reshape(last_blocks, last_blocks_dims);
+
+      // Concatenate with the other blocks if necessary
+      if (n > block_size) {
+        diag_blocks =
+            xla::ConcatInDim(builder, {diag_blocks, last_blocks}, ndims - 2);
+      } else {
+        diag_blocks = last_blocks;
+      }
+    }
+
+    return diag_blocks;
+  });
+}
+
+xla::XlaOp InvertDiagonalBlocks(xla::XlaOp diag_blocks, bool lower,
+                                bool transpose_a, bool conjugate_a) {
+  xla::XlaBuilder* builder = diag_blocks.builder();
+  return builder->ReportErrorOrReturn([&]() -> xla::StatusOr<xla::XlaOp> {
+    // Input is a batch of square lower triangular square matrices. Its shape is
+    // (..., size, size). We resize this to (num_blocks, size, size).
+    TF_ASSIGN_OR_RETURN(xla::Shape shape, builder->GetShape(diag_blocks));
+    int64 block_size = xla::ShapeUtil::GetDimension(shape, -1);
+    int64 num_blocks = xla::ShapeUtil::ElementsIn(shape) /
+                       tensorflow::MathUtil::IPow(block_size, 2);
+    diag_blocks = Reshape(diag_blocks, {num_blocks, block_size, block_size});
+
+    // The input must be triangular because we rely on that when doing
+    // multiplications later on
+    diag_blocks = Triangle(diag_blocks, /*lower=*/lower);
+
+    // Rescale blocks to be unit triangular, but avoid dividing by
+    // zero (which can happen if the last block was padded) otherwise it will
+    // introduce nans which will propagate
+    auto diags = GetMatrixDiagonal(diag_blocks);
+    TF_ASSIGN_OR_RETURN(xla::Shape diags_shape, builder->GetShape(diags));
+    auto one = ScalarLike(diags, 1);
+    auto ones = Broadcast(one, xla::AsInt64Slice(diags_shape.dimensions()));
+    diags = Select(Eq(diags, Zero(builder, shape.element_type())), ones, diags);
+    auto scaled_diag_blocks = Div(diag_blocks, diags, {0, 2});
+
+    // We can now use the fact that for an upper triangular matrix
+    // [[L11, 0], [L21, L22]], given the inverses L11' and L22', we have
+    // L22' = -L22' * L21 * L11'. In our case, L21 is a vector and our blocks
+    // have been rescaled to be unit triangular, so L22 = L22' = 1.
+
+    // Initialize the output matrix with -1s on the diagonal. We use -1 instead
+    // of 1 because we cannot do matrix-vector multiplies with variable shapes
+    // inside of a loop, or do irregularly shaped in-place updates. Hence,
+    // L21 <- -L22 * L21 * L11 cannot be done naively. Instead, we update the
+    // entire row i.e. we calculate
+    // [L21 L22 0] <- -[L21 L22 0] @ diag_blocks([L11', -I, -I])
+    // which means [L21 L22 0] <- [-L21 * L11', L22, 0].
+    auto identity =
+        IdentityMatrix(builder, shape.element_type(), block_size, block_size);
+    auto neg_identity = -identity;
+
+    // The first or last  diagonal element should be set to 1 instead of -1
+    // though, since we never update it
+    auto pos_one = Reshape(One(builder, shape.element_type()), {1, 1});
+    auto start_index = (lower) ? 0 : block_size - 1;
+    auto output_block = DynamicUpdateSlice(
+        neg_identity, pos_one,
+        /*start_indices=*/xla::ConstantR1<int>(builder, 2, start_index));
+
+    // Broadcast diag([1, -1, -1, ...]) to every block
+    xla::XlaOp output = Broadcast(output_block,
+                                  /*broadcast_sizes=*/{num_blocks});
+
+    // Now we construct a loop that performs matrix-vector multiplications
+    // inverting the blocks one row at a time
+    std::vector<xla::Shape> tuple_shapes = {
+        // The loop iteration counter is a scalar, incremented each iteration.
+        xla::ShapeUtil::MakeShape(xla::S32, {}),
+        // The output has the shape of A, with one row updated each iteration.
+        xla::ShapeUtil::MakeShape(shape.element_type(),
+                                  {num_blocks, block_size, block_size}),
+        // The input is a loop invariant.
+        xla::ShapeUtil::MakeShape(shape.element_type(),
+                                  {num_blocks, block_size, block_size})};
+    xla::Shape tuple_shape = xla::ShapeUtil::MakeTupleShape(tuple_shapes);
+
+    auto init_i = One(builder, xla::S32);
+    auto init = xla::Tuple(builder, {init_i, output, scaled_diag_blocks});
+
+    // Construct the loop condition function.
+    std::unique_ptr<xla::XlaBuilder> condb =
+        builder->CreateSubBuilder("InvertDiagCond");
+    {
+      auto i = GetTupleElement(
+          Parameter(condb.get(), 0, tuple_shape, "InvertDiagCondTuple"), 0);
+      Lt(i, xla::ConstantR0<int32>(condb.get(), block_size));
+    }
+    TF_ASSIGN_OR_RETURN(auto cond, condb->Build());
+
+    // Construct the loop body function.
+    std::unique_ptr<xla::XlaBuilder> bodyb =
+        builder->CreateSubBuilder("InvertDiagBody");
+    {
+      auto input_tuple =
+          Parameter(bodyb.get(), 0, tuple_shape, "InvertDiagBodyTuple");
+
+      auto i = GetTupleElement(input_tuple, 0);
+      auto body_out = GetTupleElement(input_tuple, 1);
+      auto body_input = GetTupleElement(input_tuple, 2);
+
+      auto zero = xla::ConstantR1<int32>(bodyb.get(), 1, 0);
+      auto j = (lower) ? i : ScalarLike(i, block_size - 1) - i;
+      auto start_indices =
+          xla::ConcatInDim(bodyb.get(), {zero, Reshape(j, {1}), zero}, 0);
+      auto input_row =
+          DynamicSlice(body_input, start_indices,
+                       /*slice_sizes=*/{num_blocks, 1, block_size});
+
+      // We want -L21 L11^{-1}
+      xla::DotDimensionNumbers dnums;
+      dnums.add_lhs_batch_dimensions(0);
+      dnums.add_rhs_batch_dimensions(0);
+      dnums.add_lhs_contracting_dimensions(2);
+      dnums.add_rhs_contracting_dimensions(1);
+      auto update = -DotGeneral(input_row, body_out, dnums);
+
+      body_out = DynamicUpdateSlice(body_out, update, start_indices);
+
+      auto next_i = i + ScalarLike(i, 1);
+      xla::Tuple(bodyb.get(), {next_i, body_out, body_input});
+    }
+    TF_ASSIGN_OR_RETURN(auto body, bodyb->Build());
+
+    // Construct the While loop and return the result,
+    // return while_loop(cond_fun, body_fun, init)[1]
+    auto invert_while = While(cond, body, init);
+    auto inv_diag_blocks = GetTupleElement(invert_while, 1);
+
+    // Undo the scaling
+    inv_diag_blocks = Div(inv_diag_blocks, diags,
+                          /*broadcast_dimensions=*/{0, 1});
+
+    // Reshape back to original batch major dimensions
+    return Reshape(inv_diag_blocks, xla::AsInt64Slice(shape.dimensions()));
+  });
+}
+
+xla::XlaOp SolveWithInvertedDiagonalBlocks(xla::XlaOp a, xla::XlaOp b,
+                                           xla::XlaOp inv_diag_blocks,
+                                           bool left_side, bool lower,
+                                           bool transpose_a, bool conjugate_a) {
+  xla::XlaBuilder* builder = a.builder();
+  return builder->ReportErrorOrReturn([&]() -> xla::StatusOr<xla::XlaOp> {
+    TF_ASSIGN_OR_RETURN(xla::Shape blocks_shape,
+                        builder->GetShape(inv_diag_blocks));
+    TF_ASSIGN_OR_RETURN(xla::Shape b_shape, builder->GetShape(b));
+    int64 block_size = xla::ShapeUtil::GetDimension(blocks_shape, -1);
+
+    TF_ASSIGN_OR_RETURN(xla::Shape a_shape, builder->GetShape(a));
+    int64 ndims = xla::ShapeUtil::Rank(a_shape);
+    int64 n = xla::ShapeUtil::GetDimension(a_shape, -1);
+    int64 num_blocks = n / block_size + (n % block_size != 0);
+    int64 m_dim = (left_side) ? -1 : -2;
+    int64 m = xla::ShapeUtil::GetDimension(b_shape, m_dim);
+
+    // Initialize the solution
+    auto x = ZerosLike(b);
+
+    // This loop is unrolled for performance reasons, but it could be expressed
+    // rolled as well since the matrices are of the same size each iteration
+    for (int i = 0; i < num_blocks; i++) {
+      // High-level intuition: We have B[i] = L[i] @ X. Since L is upper
+      // triangular this means B[i] = L[i, :i + 1] @ X[:i + 1]. We can split
+      // this into two parts: B[i] = L[i, :i] @ X[:i] + L[i, i] @ X[i] which
+      // can be solved for X[i] as X[i] = inv(L[i, i]) @ B[i] - L[i, :i] @ X[:i]
+
+      // Decide whether we go from first block to last or vice versa
+      auto j = (left_side ^ lower ^ transpose_a) ? num_blocks - 1 - i : i;
+
+      // Get the size of the inverse blocks (the last one might be smaller)
+      int64 block = (n % block_size != 0 && j + 1 == num_blocks)
+                        ? n % block_size
+                        : block_size;
+      auto inv_block =
+          MaybeConjugate(Collapse(SliceInMinorDims(inv_diag_blocks, {j, 0, 0},
+                                                   {j + 1, block, block}),
+                                  /*dimensions=*/{ndims - 2, ndims - 1}),
+                         conjugate_a);
+
+      // Get the corresponding row of B
+      int64 k = std::min((j + 1) * block_size, n);
+      std::vector<int64> start = {j * block_size, 0};
+      std::vector<int64> end = {k, m};
+      if (!left_side) {
+        std::swap(start[0], start[1]);
+        std::swap(end[0], end[1]);
+      }
+      auto b_row = SliceInMinorDims(b, start, end);
+
+      xla::XlaOp remainder;
+      if (i == 0) {
+        remainder = b_row;
+      } else {
+        // This matrix multiply involves a lot of multiplying with zero (namely,
+        // X[i * block_size:] = 0), but this is faster than slicing...
+        end = {k, n};
+        if (!left_side) {
+          std::swap(end[0], end[1]);
+        }
+        if (transpose_a) {
+          std::swap(start[0], start[1]);
+          std::swap(end[0], end[1]);
+        }
+        auto a_row =
+            MaybeConjugate(SliceInMinorDims(a, start, end), conjugate_a);
+        if (left_side) {
+          remainder = b_row - BatchDot(a_row, x, transpose_a, false);
+        } else {
+          remainder = b_row - BatchDot(x, a_row, false, transpose_a);
+        }
+      }
+
+      xla::XlaOp x_update;
+      auto zero = Zero(builder, xla::S32);
+      auto start_index =
+          xla::ConstantR0WithType(builder, xla::S32, j * block_size);
+      std::vector<xla::XlaOp> update_starts = {start_index, zero};
+      if (left_side) {
+        x_update = BatchDot(inv_block, remainder, transpose_a, false);
+      } else {
+        x_update = BatchDot(remainder, inv_block, false, transpose_a);
+        std::swap(update_starts[0], update_starts[1]);
+      }
+      x = DynamicUpdateSliceInMinorDims(x, x_update, /*starts=*/update_starts);
+    }
+
+    return x;
+  });
+}
+
 xla::XlaOp TriangularSolve(xla::XlaOp a, xla::XlaOp b, bool left_side,
                            bool lower, bool transpose_a, bool conjugate_a,
                            int64 block_size) {
@@ -44,7 +344,7 @@ xla::XlaOp TriangularSolve(xla::XlaOp a, xla::XlaOp b, bool left_side,
           xla::ShapeUtil::HumanString(a_shape), " vs. ",
           xla::ShapeUtil::HumanString(b_shape));
     }
-    const int ndims = xla::ShapeUtil::Rank(a_shape);
+    const int64 ndims = xla::ShapeUtil::Rank(a_shape);
     if (ndims < 2) {
       return errors::InvalidArgument(
           "Arguments to TriangularSolve must have rank >= 2: ", ndims);
@@ -84,510 +384,18 @@ xla::XlaOp TriangularSolve(xla::XlaOp a, xla::XlaOp b, bool left_side,
           block_size);
     }
 
-    std::map<int, xla::XlaComputation> base_computations;
-    auto get_base_triangular_solve =
-        [&](int k) -> xla::StatusOr<xla::XlaComputation*> {
-      xla::XlaComputation& computation = base_computations[k];
-      if (computation.IsNull()) {
-        std::unique_ptr<xla::XlaBuilder> sub = builder->CreateSubBuilder(
-            tensorflow::strings::StrCat("trsm_base_", k));
-
-        auto a_param = xla::Parameter(
-            sub.get(), 0,
-            xla::ShapeUtil::MakeShape(b_shape.element_type(),
-                                      ConcatVectors(batch_dimensions, {k, k})),
-            "a");
-
-        std::array<int64, 2> b_lastd;
-        if (left_side) {
-          b_lastd = {k, n};
-        } else {
-          b_lastd = {m, k};
-        }
-        auto b_param = xla::Parameter(
-            sub.get(), 1,
-            xla::ShapeUtil::MakeShape(b_shape.element_type(),
-                                      ConcatVectors(batch_dimensions, b_lastd)),
-            "b");
-
-        // We use a left-looking or right-looking subroutine on the block
-        // diagonal in the lower=true cases, while falling back to a recursive
-        // call in others. The left-looking and right-looking subroutines are
-        // written with a While loop and so yields much faster compile times.
-        // Moreover, they can give higher performance on smaller (sub)problems.
-        if (left_side && lower) {
-          TriangularSolveLeftLooking(a_param, b_param, transpose_a,
-                                     conjugate_a);
-        } else if (!left_side && lower) {
-          TriangularSolveRightLooking(a_param, b_param, transpose_a,
-                                      conjugate_a);
-        } else {
-          TriangularSolve(a_param, b_param, left_side, lower, transpose_a,
-                          conjugate_a,
-                          /*block_size=*/1);
-        }
-
-        TF_ASSIGN_OR_RETURN(computation, sub->Build());
-      }
-      return &computation;
-    };
-
-    xla::XlaOp output = xla::ZerosLike(b);
-
-    // Right-looking blocked triangular solve.
-    // For an explanation of the algorithm, see the TRSM discussion in:
-    // Goto, Kazushige, and Robert Van De Geijn. "High-performance
-    // implementation of the level-3 BLAS." ACM Transactions on Mathematical
-    // Software (TOMS) 35.1 (2008): 4.
-
-    // In the code comments below, T = lambda x: np.swapaxes(x, -1, -2) if
-    // conjugate_a is False, or T = lambda x: np.conj(np.swapaxes(x, -1, -2)) if
-    // conjugate_a is True.
-
-    if (!left_side && lower == transpose_a) {
-      // for i in range(0, a.shape[-1], block_size):
-      for (int64 i = 0; i < n; i += block_size) {
-        int64 k = std::min(block_size, n - i);
-
-        // output[..., :, i:i+k] = triangular_solve(
-        //     a[..., i:i+k, i:i+k], b[..., :, i:i+k], ..., block_size=1)
-        auto a_slice = SliceInMinorDims(a, {i, i}, {i + k, i + k});
-        auto b_slice = SliceInMinorDims(b, {0, i}, {m, i + k});
-        xla::XlaOp update;
-        if (k > 1) {
-          TF_ASSIGN_OR_RETURN(xla::XlaComputation * solve,
-                              get_base_triangular_solve(k));
-          update = xla::Call(builder, *solve, {a_slice, b_slice});
-        } else {
-          auto a_slice_conj = MaybeConjugate(a_slice, conjugate_a);
-          update = b_slice / a_slice_conj;
-        }
-        output = UpdateSliceInMinorDims(output, update, {0, i});
-
-        // if i + k < a.shape[-1]:
-        //   a_slice_2 = a[..., i+k:, i:i+k] if lower else a[..., i:i+k, i+k:]
-        //   a_slice_2 = T(a_slice_2) if transpose_a else a_slice_2
-        //   b[..., :, i+k:] -= np.matmul(output[..., :, i:i+k], a_slice_2)
-        if (i + k < n) {
-          xla::XlaOp a_slice_2;
-          if (lower) {
-            a_slice_2 = SliceInMinorDims(a, {i + k, i}, {n, i + k});
-          } else {
-            a_slice_2 = SliceInMinorDims(a, {i, i + k}, {i + k, n});
-          }
-
-          auto b_update = BatchDot(update, a_slice_2,
-                                   /*transpose_x=*/false,
-                                   /*transpose_y=*/transpose_a,
-                                   /*conjugate_x=*/false,
-                                   /*conjugate_y=*/conjugate_a);
-          auto b_slice_2 = SliceInMinorDims(b, {0, i + k}, {m, n});
-          b = UpdateSliceInMinorDims(b, b_slice_2 - b_update, {0, i + k});
-        }
-      }
-
-    } else if (left_side && lower != transpose_a) {
-      // for i in range(0, a.shape[-1], block_size):
-      for (int64 i = 0; i < m; i += block_size) {
-        int64 k = std::min(block_size, m - i);
-
-        // output[..., i:i+k, :] = triangular_solve(
-        //     a[..., i:i+k, i:i+k], b[..., i:i+k, :], ..., block_size=1)
-        auto a_slice = SliceInMinorDims(a, {i, i}, {i + k, i + k});
-        auto b_slice = SliceInMinorDims(b, {i, 0}, {i + k, n});
-        xla::XlaOp update;
-        if (k > 1) {
-          TF_ASSIGN_OR_RETURN(xla::XlaComputation * solve,
-                              get_base_triangular_solve(k));
-          update = xla::Call(builder, *solve, {a_slice, b_slice});
-        } else {
-          auto a_slice_conj = MaybeConjugate(a_slice, conjugate_a);
-          update = b_slice / a_slice_conj;
-        }
-        output = UpdateSliceInMinorDims(output, update, {i, 0});
-
-        // if i + k < a.shape[-1]:
-        //   a_slice_2 = a[..., i+k:, i:i+k] if lower else a[..., i:i+k, i+k:]
-        //   a_slice_2 = T(a_slice_2) if transpose_a else a_slice_2
-        //   b[..., i+k:, :] -= np.matmul(a_slice_2, output[..., i:i+k, :])
-        if (i + k < m) {
-          xla::XlaOp a_slice_2;
-          if (lower) {
-            a_slice_2 = SliceInMinorDims(a, {i + k, i}, {m, i + k});
-          } else {
-            a_slice_2 = SliceInMinorDims(a, {i, i + k}, {i + k, m});
-          }
-
-          auto b_update = BatchDot(a_slice_2, update,
-                                   /*transpose_x=*/transpose_a,
-                                   /*transpose_y=*/false,
-                                   /*conjugate_x=*/conjugate_a,
-                                   /*conjugate_y=*/false);
-          auto b_slice_2 = SliceInMinorDims(b, {i + k, 0}, {m, n});
-          b = UpdateSliceInMinorDims(b, b_slice_2 - b_update, {i + k, 0});
-        }
-      }
-    } else if (!left_side && lower != transpose_a) {
-      // for i in reversed(range(0, a.shape[-1], block_size)):
-      const int64 last_blk_ix =
-          xla::RoundUpToNearest(n, block_size) - block_size;
-      for (int64 i = last_blk_ix; i >= 0; i -= block_size) {
-        int64 k = std::min(block_size, n - i);
-
-        // output[..., :, i:i+k] triangular_solve(
-        //     a[..., i:i+k, i:i+k], b[..., :, i:i+k], ..., block_size=1)
-        auto a_slice = SliceInMinorDims(a, {i, i}, {i + k, i + k});
-        auto b_slice = SliceInMinorDims(b, {0, i}, {m, i + k});
-        xla::XlaOp update;
-        if (k > 1) {
-          TF_ASSIGN_OR_RETURN(xla::XlaComputation * solve,
-                              get_base_triangular_solve(k));
-          update = xla::Call(builder, *solve, {a_slice, b_slice});
-        } else {
-          auto a_slice_conj = MaybeConjugate(a_slice, conjugate_a);
-          update = b_slice / a_slice_conj;
-        }
-        output = UpdateSliceInMinorDims(output, update, {0, i});
-
-        // if i - k >= 0:
-        //   a_slice_2 = a[..., i:i+k, :i] if lower else a[..., :i, i:i+k]
-        //   a_slice_2 = T(a_slice_2) if transpose_a else a_slice_2
-        //   b[..., :, :i] -= np.matmul(out[..., :, i:i+k], a_slice_2)
-        if (i - k >= 0) {
-          xla::XlaOp a_slice_2;
-          if (lower) {
-            a_slice_2 = SliceInMinorDims(a, {i, 0}, {i + k, i});
-          } else {
-            a_slice_2 = SliceInMinorDims(a, {0, i}, {i, i + k});
-          }
-
-          auto b_update = BatchDot(update, a_slice_2,
-                                   /*transpose_x=*/false,
-                                   /*transpose_y=*/transpose_a,
-                                   /*conjugate_x=*/false,
-                                   /*conjugate_y=*/conjugate_a);
-          auto b_slice_2 = SliceInMinorDims(b, {0, 0}, {m, i});
-          b = UpdateSliceInMinorDims(b, b_slice_2 - b_update, {0, 0});
-        }
-      }
-    } else {  // left_side && lower == transpose_a
-      // for i in reversed(range(0, a.shape[-1], block_size)):
-      const int64 last_blk_ix =
-          xla::RoundUpToNearest(m, block_size) - block_size;
-      for (int64 i = last_blk_ix; i >= 0; i -= block_size) {
-        int64 k = std::min(block_size, m - i);
-
-        // output[..., i:i+k, :] triangular_solve(
-        //     a[..., i:i+k, i:i+k], b[..., i:i+k, :], ..., block_size=1)
-        auto a_slice = SliceInMinorDims(a, {i, i}, {i + k, i + k});
-        auto b_slice = SliceInMinorDims(b, {i, 0}, {i + k, n});
-        xla::XlaOp update;
-        if (k > 1) {
-          TF_ASSIGN_OR_RETURN(xla::XlaComputation * solve,
-                              get_base_triangular_solve(k));
-          update = xla::Call(builder, *solve, {a_slice, b_slice});
-        } else {
-          auto a_slice_conj = MaybeConjugate(a_slice, conjugate_a);
-          update = b_slice / a_slice_conj;
-        }
-        output = UpdateSliceInMinorDims(output, update, {i, 0});
-
-        // if i - k >= 0:
-        //   a_slice_2 = a[..., i:i+k, :i] if lower else a[..., :i, i:i+k]
-        //   a_slice_2 = T(a_slice_2) if transpose_a else a_slice_2
-        //   b[..., :i, :] -= np.matmul(a_slice_2, out[..., i:i+k, :])
-        if (i - k >= 0) {
-          xla::XlaOp a_slice_2;
-          if (lower) {
-            a_slice_2 = SliceInMinorDims(a, {i, 0}, {i + k, i});
-          } else {
-            a_slice_2 = SliceInMinorDims(a, {0, i}, {i, i + k});
-          }
-
-          auto b_update = BatchDot(a_slice_2, update,
-                                   /*transpose_x=*/transpose_a,
-                                   /*transpose_y=*/false,
-                                   /*conjugate_x=*/conjugate_a,
-                                   /*conjugate_y=*/false);
-          auto b_slice_2 = SliceInMinorDims(b, {0, 0}, {i, n});
-          b = UpdateSliceInMinorDims(b, b_slice_2 - b_update, {0, 0});
-        }
-      }
-    }
-
-    return output;
-  });
-}
+    // We find the diagonal blocks of the coefficient matrix
+    auto diag_blocks = DiagonalBlocks(a, block_size);
 
-xla::XlaOp TriangularSolveLeftLooking(xla::XlaOp a, xla::XlaOp b,
-                                      bool transpose_a, bool conjugate_a) {
-  xla::XlaBuilder* builder = a.builder();
-  return builder->ReportErrorOrReturn([&]() -> xla::StatusOr<xla::XlaOp> {
-    TF_ASSIGN_OR_RETURN(xla::Shape a_shape, builder->GetShape(a));
-    TF_ASSIGN_OR_RETURN(xla::Shape b_shape, builder->GetShape(b));
-    const int64 m = xla::ShapeUtil::GetDimension(b_shape, -2);
-    const int64 n = xla::ShapeUtil::GetDimension(b_shape, -1);
-    const int64 ndims = xla::ShapeUtil::Rank(a_shape);
-
-    std::vector<int64> batch_dimensions;
-    for (int i = 0; i < ndims - 2; ++i) {
-      int64 a_size = a_shape.dimensions(i);
-      batch_dimensions.push_back(a_size);
-    }
+    // We invert these blocks in parallel using batched matrix-vector products
+    auto inv_diag_blocks =
+        InvertDiagonalBlocks(diag_blocks, lower, transpose_a, conjugate_a);
 
-    // The main computation is performed in a While loop.
+    // We now find the solution using GEMMs
+    auto x = SolveWithInvertedDiagonalBlocks(a, b, inv_diag_blocks, left_side,
+                                             lower, transpose_a, conjugate_a);
 
-    // Allocate the output and set its first or last row,
-    // output = np.zeros_like(b)
-    // if transpose_a:
-    //   output[..., m-1:, :] = b[..., m-1:, :] / a[..., m-1:, m-1:]
-    // else:
-    //   output[..., :1, :] = b[..., :1, :] / a[..., :1, :1]
-    xla::XlaOp output = xla::ZerosLike(b);
-    {
-      auto i = transpose_a ? m - 1 : 0;
-      auto a_slice = SliceInMinorDims(a, {i, i}, {i + 1, i + 1});
-      auto b_slice = SliceInMinorDims(b, {i, 0}, {i + 1, n});
-      auto a_slice_conj = MaybeConjugate(a_slice, conjugate_a);
-      auto update = b_slice / a_slice_conj;
-      output = UpdateSliceInMinorDims(output, update, {i, 0});
-    }
-
-    // Construct the initial loop carry tuple,
-    // if transpose_a:
-    //   init = (m-2, output, a, b)
-    // else:
-    //   init = (1, output, a, b)
-    std::vector<xla::Shape> tuple_shapes = {
-        // The loop iteration counter is a scalar, incremented each iteration.
-        xla::ShapeUtil::MakeShape(xla::S32, {}),
-        // The output has the shape of b, with one row updated each iteration.
-        b_shape,
-        // The coefficient matrix a is a loop invariant.
-        a_shape,
-        // The right-hand-side matrix b is a loop invariant.
-        b_shape};
-    xla::Shape tuple_shape = xla::ShapeUtil::MakeTupleShape(tuple_shapes);
-    auto init_i = xla::ConstantR0<int32>(builder, transpose_a ? m - 2 : 1);
-    auto init = xla::Tuple(builder, {init_i, output, a, b});
-
-    // Construct the loop condition function,
-    // def cond_fun(loop_carry):
-    //   i, output, a, b = loop_carry
-    //   return i >= 0 if transpose_a else i < m
-    std::unique_ptr<xla::XlaBuilder> condb =
-        builder->CreateSubBuilder("TriangularSolveLeftLookingWhileCond");
-    {
-      auto i = xla::GetTupleElement(
-          xla::Parameter(condb.get(), 0, tuple_shape,
-                         "TriangularSolveLeftLookingWhileTuple"),
-          0);
-      if (transpose_a) {
-        xla::Ge(i, xla::ConstantR0<int32>(condb.get(), 0));
-      } else {
-        xla::Lt(i, xla::ConstantR0<int32>(condb.get(), m));
-      }
-    }
-    TF_ASSIGN_OR_RETURN(auto cond, condb->Build());
-
-    // Construct the loop body function,
-    // def body_fun(loop_carry):
-    //   i, output, a, b = loop_carry
-    //   if transpose_a:
-    //     a_row = np.swapaxes(a[..., i+1:, i:i+1], -1 -2)
-    //   else:
-    //     a_row = a[..., i:i+1, :i]
-    //   result_row = b[..., i:i+1, :] - np.matmul(a_row, output[..., :, :])
-    //   output[..., i:i+1, :] = result_row / a[..., i:i+1, i:i+1]
-    //   if transpose_a:
-    //     return (i - 1, output, a, b)
-    //   else:
-    //     return (i + 1, output, a, b)
-    // We have to do some extra FLOPs propagating zeros in the matrix multiply
-    // because we can't have the size of its arguments depend on the loop
-    // counter.
-    std::unique_ptr<xla::XlaBuilder> bodyb =
-        builder->CreateSubBuilder("TriangularSolveLeftLookingWhileBody");
-    {
-      auto input_tuple = xla::Parameter(bodyb.get(), 0, tuple_shape,
-                                        "TriangularSolveLeftLookingWhileTuple");
-
-      // i, output, a, b = loop_carry
-      auto i = xla::GetTupleElement(input_tuple, 0);
-      auto body_out = xla::GetTupleElement(input_tuple, 1);
-      auto body_a = xla::GetTupleElement(input_tuple, 2);
-      auto body_b = xla::GetTupleElement(input_tuple, 3);
-      auto zero = xla::ConstantR0<int32>(bodyb.get(), 0);
-
-      // We'd like to implement this:
-      //   if transpose_a:
-      //     a_row = T(a[..., i+1:, i:i+1])
-      //     result_row = (b[..., i:i+1, :]
-      //                   - np.matmul(a_row, body_out[..., i+1:, :]))
-      //   else:
-      //     result_row = (b[..., i:i+1, :]
-      //                   - np.matmul(a[..., i:i+1, :i], body_out[..., :i, :]))
-      // But since we can't have intermediate array sizes depend on the loop
-      // counter, we instead exploit the fact that we initialized the output to
-      // all zeros and use that as zero-padding (doing unnecessary FLOPs).
-      xla::XlaOp a_row;
-      if (transpose_a) {
-        a_row = DynamicSliceInMinorDims(body_a, {zero, i}, {m, 1});
-      } else {
-        a_row = DynamicSliceInMinorDims(body_a, {i, zero}, {1, m});
-      }
-      auto b_update = BatchDot(a_row, body_out,
-                               /*transpose_x=*/transpose_a,
-                               /*transpose_y=*/false,
-                               /*conjugate_x=*/conjugate_a,
-                               /*conjugate_y=*/false);
-      auto result_row_slice =
-          DynamicSliceInMinorDims(body_b, {i, zero}, {1, n});
-      auto result_row = result_row_slice - b_update;
-
-      // body_out[..., i:i+1, :] = result_row / a[..., i:i+1, i:i+1]
-      auto a_elt = DynamicSliceInMinorDims(body_a, {i, i}, {1, 1});
-      auto a_elt_conj = MaybeConjugate(a_elt, conjugate_a);
-      auto div_result = xla::Div(result_row, a_elt_conj);
-      body_out = DynamicUpdateSliceInMinorDims(body_out, div_result, {i, zero});
-
-      // if transpose_a:
-      //   return (i - 1, body_out, a, b)
-      // else:
-      //   return (i + 1, body_out, a, b)
-      auto next_i = xla::Add(
-          i, xla::ConstantR0<int32>(bodyb.get(), transpose_a ? -1 : 1));
-      xla::Tuple(bodyb.get(), {next_i, body_out, body_a, body_b});
-    }
-    TF_ASSIGN_OR_RETURN(auto body, bodyb->Build());
-
-    // Construct the While loop and return the result,
-    // return while_loop(cond_fun, body_fun, init)[1]
-    auto triangular_solve_left_looking_while = xla::While(cond, body, init);
-    return xla::GetTupleElement(triangular_solve_left_looking_while, 1);
-  });
-}
-
-xla::XlaOp TriangularSolveRightLooking(xla::XlaOp a, xla::XlaOp b,
-                                       bool transpose_a, bool conjugate_a) {
-  xla::XlaBuilder* builder = a.builder();
-  return builder->ReportErrorOrReturn([&]() -> xla::StatusOr<xla::XlaOp> {
-    TF_ASSIGN_OR_RETURN(xla::Shape a_shape, builder->GetShape(a));
-    TF_ASSIGN_OR_RETURN(xla::Shape b_shape, builder->GetShape(b));
-    const int64 m = xla::ShapeUtil::GetDimension(b_shape, -2);
-    const int64 n = xla::ShapeUtil::GetDimension(b_shape, -1);
-    const int64 ndims = xla::ShapeUtil::Rank(a_shape);
-
-    std::vector<int64> batch_dimensions;
-    for (int i = 0; i < ndims - 2; ++i) {
-      int64 a_size = a_shape.dimensions(i);
-      batch_dimensions.push_back(a_size);
-    }
-
-    // The main computation is performed in a While loop.
-    xla::XlaOp output = xla::ZerosLike(b);
-
-    // Construct the initial loop carry tuple,
-    // if transpose_a:
-    //   init = (0, output, a, b)
-    // else:
-    //   init = (n-1, output, a, b)
-    std::vector<xla::Shape> tuple_shapes = {
-        // The loop iteration counter is a scalar, incremented each iteration.
-        xla::ShapeUtil::MakeShape(xla::S32, {}),
-        // The output has the shape of b, with one row updated each iteration.
-        b_shape,
-        // The coefficient matrix a is a loop invariant.
-        a_shape,
-        // The right-hand-side matrix b is a loop invariant.
-        b_shape};
-    xla::Shape tuple_shape = xla::ShapeUtil::MakeTupleShape(tuple_shapes);
-    auto init_i = xla::ConstantR0<int32>(builder, transpose_a ? 0 : n - 1);
-    auto init = xla::Tuple(builder, {init_i, output, a, b});
-
-    // Construct the loop condition function,
-    // def cond_fun(loop_carry):
-    //   i, output, a, b = loop_carry
-    //   return i < n if transpose_a else i >= 0
-    std::unique_ptr<xla::XlaBuilder> condb =
-        builder->CreateSubBuilder("TriangularSolveRightLookingWhileCond");
-    {
-      auto i = xla::GetTupleElement(
-          xla::Parameter(condb.get(), 0, tuple_shape,
-                         "TriangularSolveRightLookingWhileTuple"),
-          0);
-      if (transpose_a) {
-        xla::Lt(i, xla::ConstantR0<int32>(condb.get(), n));
-      } else {
-        xla::Ge(i, xla::ConstantR0<int32>(condb.get(), 0));
-      }
-    }
-    TF_ASSIGN_OR_RETURN(auto cond, condb->Build());
-
-    // Construct the loop body function,
-    // def body_fun(loop_carry):
-    //   i, output, a, b = loop_carry
-    //   if transpose_a:
-    //     a_row = np.swapaxes(a[..., :, i:i+1], -1 -2)
-    //   else:
-    //     a_row = a[..., :, i:i+1]
-    //   result_row = b[..., :, i:i+1] - np.matmul(output, a_row)
-    //   output[..., :, i:i+1] = result_row / a[..., i:i+1, i:i+1]
-    //   if transpose_a:
-    //     return (i - 1, output, a, b)
-    //   else:
-    //     return (i + 1, output, a, b)
-    // We have to do some extra FLOPs propagating zeros in the matrix multiply
-    // because we can't have the size of its arguments depend on the loop
-    // counter.
-    std::unique_ptr<xla::XlaBuilder> bodyb =
-        builder->CreateSubBuilder("TriangularSolveRightLookingWhileBody");
-    {
-      auto input_tuple = xla::Parameter(
-          bodyb.get(), 0, tuple_shape, "TriangularSolveRightLookingWhileTuple");
-
-      // i, output, a, b = loop_carry
-      auto i = xla::GetTupleElement(input_tuple, 0);
-      auto body_out = xla::GetTupleElement(input_tuple, 1);
-      auto body_a = xla::GetTupleElement(input_tuple, 2);
-      auto body_b = xla::GetTupleElement(input_tuple, 3);
-      auto zero = xla::ConstantR0<int32>(bodyb.get(), 0);
-
-      // We'd like to implement b[..., :, i:i+1] - np.matmul(output, a[..., :,
-      // i:i+1]) But since we can't have intermediate array sizes depend on the
-      // loop counter, we instead exploit the fact that we initialized the
-      // output to all zeros and use that as zero-padding (doing unnecessary
-      // FLOPs).
-      auto b_update = BatchDot(body_out, body_a,
-                               /*transpose_x=*/false,
-                               /*transpose_y=*/transpose_a,
-                               /*conjugate_x=*/false,
-                               /*conjugate_y=*/conjugate_a);
-      // result = b - np.matmul(output, a)
-      auto result = body_b - b_update;
-      // result_row = result[..., :, i:i+1]
-      auto result_row = DynamicSliceInMinorDims(result, {zero, i}, {m, 1});
-
-      // body_out[..., :, i:i+1] = result_row / a[..., i:i+1, i:i+1]
-      auto a_ii = DynamicSliceInMinorDims(body_a, {i, i}, {1, 1});
-      auto a_ii_conj = MaybeConjugate(a_ii, conjugate_a);
-      auto div_result = xla::Div(result_row, a_ii_conj);
-      body_out = DynamicUpdateSliceInMinorDims(body_out, div_result, {zero, i});
-
-      // if transpose_a:
-      //   return (i + 1, body_out, a, b)
-      // else:
-      //   return (i - 1, body_out, a, b)
-      auto next_i = xla::Add(
-          i, xla::ConstantR0<int32>(bodyb.get(), transpose_a ? 1 : -1));
-      xla::Tuple(bodyb.get(), {next_i, body_out, body_a, body_b});
-    }
-    TF_ASSIGN_OR_RETURN(auto body, bodyb->Build());
-
-    // Construct the While loop and return the result,
-    // return while_loop(cond_fun, body_fun, init)[1]
-    auto triangular_solve_left_looking_while = xla::While(cond, body, init);
-    return xla::GetTupleElement(triangular_solve_left_looking_while, 1);
+    return x;
   });
 }
 
diff --git a/tensorflow/compiler/tf2xla/lib/triangular_solve.h b/tensorflow/compiler/tf2xla/lib/triangular_solve.h
index 80c2bc4c9c..555760b7ef 100644
--- a/tensorflow/compiler/tf2xla/lib/triangular_solve.h
+++ b/tensorflow/compiler/tf2xla/lib/triangular_solve.h
@@ -16,8 +16,8 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_TF2XLA_LIB_TRIANGULAR_SOLVE_H_
 #define TENSORFLOW_COMPILER_TF2XLA_LIB_TRIANGULAR_SOLVE_H_
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 
 namespace tensorflow {
 
@@ -59,13 +59,7 @@ namespace tensorflow {
 // blocking is used.
 xla::XlaOp TriangularSolve(xla::XlaOp a, xla::XlaOp b, bool left_side,
                            bool lower, bool transpose_a, bool conjugate_a,
-                           int64 block_size = 256);
-
-xla::XlaOp TriangularSolveLeftLooking(xla::XlaOp a, xla::XlaOp b,
-                                      bool transpose_a, bool conjugate_a);
-
-xla::XlaOp TriangularSolveRightLooking(xla::XlaOp a, xla::XlaOp b,
-                                       bool transpose_a, bool conjugate_a);
+                           int64 block_size = 128);
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/compiler/tf2xla/lib/triangular_solve_test.cc b/tensorflow/compiler/tf2xla/lib/triangular_solve_test.cc
index d5ffc1498e..aeebf16028 100644
--- a/tensorflow/compiler/tf2xla/lib/triangular_solve_test.cc
+++ b/tensorflow/compiler/tf2xla/lib/triangular_solve_test.cc
@@ -20,8 +20,8 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/array2d.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@@ -207,6 +207,28 @@ XLA_TEST_F(TriangularSolveTest, SimpleLeftLowerNotranspose) {
                              xla::ErrorSpec(1e-2, 1e-2));
 }
 
+XLA_TEST_F(TriangularSolveTest, SimpleLeftLowerNotransposeIrregularblock) {
+  xla::XlaBuilder builder(TestName());
+
+  xla::XlaOp a, b;
+  auto a_data = CreateR2Parameter<float>(AValsLower(), 0, "a", &builder, &a);
+  auto b_data = CreateR2Parameter<float>(BValsLeft(), 1, "b", &builder, &b);
+  TriangularSolve(a, b,
+                  /*left_side=*/true, /*lower=*/true,
+                  /*transpose_a=*/false, /*conjugate_a=*/false,
+                  /*block_size=*/3);
+
+  xla::Array2D<float> expected({
+      {0.5, 1.0, 1.5},
+      {0.41666667, 0.33333333, 0.25},
+      {0.23148148, 0.18518519, 0.13888889},
+      {0.16835017, 0.13468013, 0.1010101},
+  });
+
+  ComputeAndCompareR2<float>(&builder, expected, {a_data.get(), b_data.get()},
+                             xla::ErrorSpec(1e-2, 1e-2));
+}
+
 XLA_TEST_F(TriangularSolveTest, SimpleLeftUpperTranspose) {
   xla::XlaBuilder builder(TestName());
 
@@ -307,47 +329,5 @@ XLA_TEST_F(TriangularSolveTest, SimpleLeftUpperTransposeNoconjugate) {
                                  xla::ErrorSpec(1e-2, 1e-2));
 }
 
-XLA_TEST_F(TriangularSolveLeftLookingTest, Simple) {
-  xla::XlaBuilder builder(TestName());
-
-  xla::XlaOp a, b;
-  auto a_data = CreateR2Parameter<float>(AValsLower(), 0, "a", &builder, &a);
-  auto b_data = CreateR2Parameter<float>(BValsLeft(), 1, "b", &builder, &b);
-  TriangularSolveLeftLooking(a, b,
-                             /*transpose_a=*/false,
-                             /*conjugate_a=*/false);
-
-  xla::Array2D<float> expected({
-      {0.5, 1.0, 1.5},
-      {0.41666667, 0.33333333, 0.25},
-      {0.23148148, 0.18518519, 0.13888889},
-      {0.16835017, 0.13468013, 0.1010101},
-  });
-
-  ComputeAndCompareR2<float>(&builder, expected, {a_data.get(), b_data.get()},
-                             xla::ErrorSpec(1e-2, 1e-2));
-}
-
-XLA_TEST_F(TriangularSolveLeftLookingTest, NonzeroUpperTriangle) {
-  xla::XlaBuilder builder(TestName());
-
-  xla::XlaOp a, b;
-  auto a_data = CreateR2Parameter<float>(AValsFull(), 0, "a", &builder, &a);
-  auto b_data = CreateR2Parameter<float>(BValsLeft(), 1, "b", &builder, &b);
-  TriangularSolveLeftLooking(a, b,
-                             /*transpose_a=*/false,
-                             /*conjugate_a=*/false);
-
-  xla::Array2D<float> expected({
-      {0.5, 1.0, 1.5},
-      {0.41666667, 0.33333333, 0.25},
-      {0.23148148, 0.18518519, 0.13888889},
-      {0.16835017, 0.13468013, 0.1010101},
-  });
-
-  ComputeAndCompareR2<float>(&builder, expected, {a_data.get(), b_data.get()},
-                             xla::ErrorSpec(1e-2, 1e-2));
-}
-
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/tf2xla/lib/util.cc b/tensorflow/compiler/tf2xla/lib/util.cc
index fdc8bfca49..8b5beba383 100644
--- a/tensorflow/compiler/tf2xla/lib/util.cc
+++ b/tensorflow/compiler/tf2xla/lib/util.cc
@@ -18,7 +18,8 @@ limitations under the License.
 #include <memory>
 #include <vector>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -28,6 +29,13 @@ limitations under the License.
 
 namespace tensorflow {
 
+xla::XlaOp Zeros(xla::XlaBuilder* builder, const xla::Shape& shape) {
+  return xla::Broadcast(
+      xla::ConstantLiteral(builder,
+                           xla::LiteralUtil::Zero(shape.element_type())),
+      xla::AsInt64Slice(shape.dimensions()));
+}
+
 xla::XlaOp FloatLiteral(xla::XlaBuilder* builder, xla::PrimitiveType type,
                         double value) {
   switch (type) {
@@ -56,31 +64,31 @@ xla::XlaOp IntegerLiteral(xla::XlaBuilder* builder, xla::PrimitiveType type,
   xla::Literal literal;
   switch (type) {
     case xla::U8:
-      literal = std::move(*xla::Literal::CreateR0<uint8>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<uint8>(value));
       break;
     case xla::U32:
-      literal = std::move(*xla::Literal::CreateR0<uint32>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<uint32>(value));
       break;
     case xla::U64:
-      literal = std::move(*xla::Literal::CreateR0<uint64>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<uint64>(value));
       break;
     case xla::S8:
-      literal = std::move(*xla::Literal::CreateR0<int8>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<int8>(value));
       break;
     case xla::S32:
-      literal = std::move(*xla::Literal::CreateR0<int32>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<int32>(value));
       break;
     case xla::S64:
-      literal = std::move(*xla::Literal::CreateR0<int64>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<int64>(value));
       break;
     case xla::F32:
-      literal = std::move(*xla::Literal::CreateR0<float>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<float>(value));
       break;
     case xla::F64:
-      literal = std::move(*xla::Literal::CreateR0<double>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<double>(value));
       break;
     case xla::C64:
-      literal = std::move(*xla::Literal::CreateR0<complex64>(value));
+      literal = std::move(*xla::LiteralUtil::CreateR0<complex64>(value));
       break;
     case xla::PRED:
       LOG(FATAL) << "pred element type is not integral";
@@ -89,11 +97,11 @@ xla::XlaOp IntegerLiteral(xla::XlaBuilder* builder, xla::PrimitiveType type,
       LOG(FATAL) << "u16/s16 literals not yet implemented";
     case xla::BF16:
       literal = std::move(
-          *xla::Literal::CreateR0<bfloat16>(static_cast<bfloat16>(value)));
+          *xla::LiteralUtil::CreateR0<bfloat16>(static_cast<bfloat16>(value)));
       break;
     case xla::F16:
-      literal = std::move(
-          *xla::Literal::CreateR0<xla::half>(static_cast<xla::half>(value)));
+      literal = std::move(*xla::LiteralUtil::CreateR0<xla::half>(
+          static_cast<xla::half>(value)));
       break;
     case xla::TUPLE:
       LOG(FATAL) << "tuple element type is not integral";
diff --git a/tensorflow/compiler/tf2xla/lib/util.h b/tensorflow/compiler/tf2xla/lib/util.h
index 6cb6c088e9..b4905c9528 100644
--- a/tensorflow/compiler/tf2xla/lib/util.h
+++ b/tensorflow/compiler/tf2xla/lib/util.h
@@ -16,8 +16,8 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_TF2XLA_LIB_UTIL_H_
 #define TENSORFLOW_COMPILER_TF2XLA_LIB_UTIL_H_
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 
diff --git a/tensorflow/compiler/tf2xla/lib/util_test.cc b/tensorflow/compiler/tf2xla/lib/util_test.cc
index 7d0f2222a9..442fe92c34 100644
--- a/tensorflow/compiler/tf2xla/lib/util_test.cc
+++ b/tensorflow/compiler/tf2xla/lib/util_test.cc
@@ -21,7 +21,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/lib/batch_dot.h"
 #include "tensorflow/compiler/xla/array2d.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
diff --git a/tensorflow/compiler/tf2xla/lib/while_loop.cc b/tensorflow/compiler/tf2xla/lib/while_loop.cc
index 7cc88f34d2..d64394f140 100644
--- a/tensorflow/compiler/tf2xla/lib/while_loop.cc
+++ b/tensorflow/compiler/tf2xla/lib/while_loop.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/lib/while_loop.h"
 #include "tensorflow/compiler/tf2xla/lib/util.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 
@@ -100,8 +100,9 @@ xla::StatusOr<std::vector<xla::XlaOp>> XlaForEachIndex(
     std::vector<xla::XlaOp> updated_values;
     updated_values.reserve(values.size());
     updated_values.push_back(xla::Add(
-        iteration, xla::ConstantLiteral(
-                       body_builder, xla::Literal::One(num_iterations_type))));
+        iteration,
+        xla::ConstantLiteral(body_builder,
+                             xla::LiteralUtil::One(num_iterations_type))));
 
     values.remove_prefix(1);
     TF_ASSIGN_OR_RETURN(std::vector<xla::XlaOp> body_outputs,
@@ -113,8 +114,8 @@ xla::StatusOr<std::vector<xla::XlaOp>> XlaForEachIndex(
 
   std::vector<xla::XlaOp> values;
   values.reserve(initial_values.size() + 1);
-  values.push_back(
-      xla::ConstantLiteral(builder, xla::Literal::Zero(num_iterations_type)));
+  values.push_back(xla::ConstantLiteral(
+      builder, xla::LiteralUtil::Zero(num_iterations_type)));
   values.insert(values.end(), initial_values.begin(), initial_values.end());
 
   TF_ASSIGN_OR_RETURN(values, XlaWhileLoop(while_cond_fn, while_body_fn, values,
diff --git a/tensorflow/compiler/tf2xla/lib/while_loop.h b/tensorflow/compiler/tf2xla/lib/while_loop.h
index 5b6684c995..9493b1f109 100644
--- a/tensorflow/compiler/tf2xla/lib/while_loop.h
+++ b/tensorflow/compiler/tf2xla/lib/while_loop.h
@@ -19,8 +19,8 @@ limitations under the License.
 #include <functional>
 #include <vector>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
diff --git a/tensorflow/compiler/tf2xla/literal_util.cc b/tensorflow/compiler/tf2xla/literal_util.cc
index b43405a1a4..2fb66913ad 100644
--- a/tensorflow/compiler/tf2xla/literal_util.cc
+++ b/tensorflow/compiler/tf2xla/literal_util.cc
@@ -17,7 +17,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/type_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/common_runtime/dma_helper.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/literal_util.h b/tensorflow/compiler/tf2xla/literal_util.h
index ab7e861f33..0610a57029 100644
--- a/tensorflow/compiler/tf2xla/literal_util.h
+++ b/tensorflow/compiler/tf2xla/literal_util.h
@@ -18,7 +18,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_TF2XLA_LITERAL_UTIL_H_
 #define TENSORFLOW_COMPILER_TF2XLA_LITERAL_UTIL_H_
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/core/status.h"
diff --git a/tensorflow/compiler/tf2xla/literal_util_test.cc b/tensorflow/compiler/tf2xla/literal_util_test.cc
index f3d6787daa..a3404c2b3d 100644
--- a/tensorflow/compiler/tf2xla/literal_util_test.cc
+++ b/tensorflow/compiler/tf2xla/literal_util_test.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/literal_util.h"
 
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/numeric_types.h"
 #include "tensorflow/core/framework/tensor_testutil.h"
@@ -27,7 +28,7 @@ TEST(LiteralUtil, LiteralToHostTensor) {
   {
     std::vector<int64> int64_values = {1, 2, 3};
     std::unique_ptr<xla::Literal> int64_values_literal =
-        xla::Literal::CreateR1(gtl::ArraySlice<int64>(int64_values));
+        xla::LiteralUtil::CreateR1(gtl::ArraySlice<int64>(int64_values));
     Tensor host_tensor;
     EXPECT_EQ("Cannot convert literal of type S64 to tensor of type int32",
               LiteralToHostTensor(*int64_values_literal, DT_INT32, &host_tensor)
@@ -48,7 +49,7 @@ TEST(LiteralUtil, LiteralToHostTensor) {
     Tensor host_tensor;
     std::vector<int32> int32_values = {10, 11};
     std::unique_ptr<xla::Literal> int32_values_literal =
-        xla::Literal::CreateR1(gtl::ArraySlice<int32>(int32_values));
+        xla::LiteralUtil::CreateR1(gtl::ArraySlice<int32>(int32_values));
     EXPECT_TRUE(
         LiteralToHostTensor(*int32_values_literal, DT_INT32, &host_tensor)
             .ok());
diff --git a/tensorflow/compiler/tf2xla/tf2xla.cc b/tensorflow/compiler/tf2xla/tf2xla.cc
index ac768b206e..48568c825b 100644
--- a/tensorflow/compiler/tf2xla/tf2xla.cc
+++ b/tensorflow/compiler/tf2xla/tf2xla.cc
@@ -27,6 +27,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/tf2xla_util.h"
 #include "tensorflow/compiler/tf2xla/xla_compiler.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/graph.pb.h"
diff --git a/tensorflow/compiler/tf2xla/tf2xla.h b/tensorflow/compiler/tf2xla/tf2xla.h
index d02fc56c5b..432a12a516 100644
--- a/tensorflow/compiler/tf2xla/tf2xla.h
+++ b/tensorflow/compiler/tf2xla/tf2xla.h
@@ -18,7 +18,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/tf2xla.pb.h"
 #include "tensorflow/compiler/xla/client/client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/core/framework/graph.pb.h"
 
 namespace tensorflow {
diff --git a/tensorflow/compiler/tf2xla/tf2xla_test.cc b/tensorflow/compiler/tf2xla/tf2xla_test.cc
index 84c133ffab..56f7045a98 100644
--- a/tensorflow/compiler/tf2xla/tf2xla_test.cc
+++ b/tensorflow/compiler/tf2xla/tf2xla_test.cc
@@ -18,6 +18,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/tf2xla.pb.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
@@ -73,8 +75,8 @@ TEST(ConvertGraphDefToXla, Sum) {
   TF_EXPECT_OK(ConvertGraphDefToXla(graph_def, config, client, &computation));
 
   // Set up arguments.
-  auto x_literal = xla::Literal::CreateR0<int32>(10);
-  auto y_literal = xla::Literal::CreateR0<int32>(32);
+  auto x_literal = xla::LiteralUtil::CreateR0<int32>(10);
+  auto y_literal = xla::LiteralUtil::CreateR0<int32>(32);
   auto x_global_or = client->TransferToServer(*x_literal);
   auto y_global_or = client->TransferToServer(*y_literal);
   TF_EXPECT_OK(x_global_or.status());
diff --git a/tensorflow/compiler/tf2xla/xla_compilation_device.cc b/tensorflow/compiler/tf2xla/xla_compilation_device.cc
index fe7ec633ec..e89f473328 100644
--- a/tensorflow/compiler/tf2xla/xla_compilation_device.cc
+++ b/tensorflow/compiler/tf2xla/xla_compilation_device.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/sharding_util.h"
 #include "tensorflow/compiler/tf2xla/xla_context.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/common_runtime/local_device.h"
 #include "tensorflow/core/framework/device_base.h"
 #include "tensorflow/core/platform/mem.h"
diff --git a/tensorflow/compiler/tf2xla/xla_compilation_device.h b/tensorflow/compiler/tf2xla/xla_compilation_device.h
index d0b9e34e16..a6e7882533 100644
--- a/tensorflow/compiler/tf2xla/xla_compilation_device.h
+++ b/tensorflow/compiler/tf2xla/xla_compilation_device.h
@@ -19,7 +19,7 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/compiler/tf2xla/xla_resource.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/common_runtime/local_device.h"
 #include "tensorflow/core/framework/device_base.h"
diff --git a/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.cc b/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.cc
index 672e19bd93..ed5aa08c6f 100644
--- a/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.cc
+++ b/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.cc
@@ -26,6 +26,8 @@ XlaCompiledCpuFunction::XlaCompiledCpuFunction(const StaticData& static_data,
       result_index_(static_data.result_index),
       args_(new void*[static_data.num_args]),
       temps_(new void*[static_data.num_temps]),
+      arg_index_to_temp_index_(new int32[static_data.num_args]),
+      num_args_(static_data.num_args),
       arg_names_(static_data.arg_names),
       result_names_(static_data.result_names),
       program_shape_(static_data.program_shape),
@@ -40,6 +42,13 @@ XlaCompiledCpuFunction::XlaCompiledCpuFunction(const StaticData& static_data,
       static_data.temp_sizes, static_data.num_temps, temps_,
       /*annotate_initialized=*/true);
 
+  for (int i = 0; i < static_data.num_temps; i++) {
+    if (static_data.temp_sizes[i] < -1) {
+      int32 param_number = -(static_data.temp_sizes[i] + 2);
+      arg_index_to_temp_index_[param_number] = i;
+    }
+  }
+
   // If Hlo profiling is enabled the generated code expects an appropriately
   // sized buffer to be passed in as the last argument.  If Hlo profiling is
   // disabled the last function argument is still present in the function
@@ -50,11 +59,24 @@ XlaCompiledCpuFunction::XlaCompiledCpuFunction(const StaticData& static_data,
   }
 }
 
+bool XlaCompiledCpuFunction::Run() {
+  // Propagate pointers to the argument buffers into the temps array.  Code
+  // generated by XLA discovers the incoming argument pointers from the temps
+  // array.
+  for (int32 i = 0; i < num_args_; i++) {
+    temps_[arg_index_to_temp_index_[i]] = args_[i];
+  }
+  raw_function_(temps_[result_index_], &run_options_, nullptr, temps_,
+                profile_counters_);
+  return true;
+}
+
 XlaCompiledCpuFunction::~XlaCompiledCpuFunction() {
   tensorflow::tfcompile::runtime::FreeContiguous(alloc_args_);
   tensorflow::tfcompile::runtime::FreeContiguous(alloc_temps_);
   delete[] args_;
   delete[] temps_;
+  delete[] arg_index_to_temp_index_;
   delete[] profile_counters_;
 }
 
diff --git a/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h b/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h
index 48a8c083ca..27cfb354bf 100644
--- a/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h
+++ b/tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h
@@ -60,9 +60,19 @@ class XlaCompiledCpuFunction {
     // The raw function to call.
     RawFunction raw_function;
 
-    // Cardinality and sizes of arg and temp buffers.
+    // Cardinality and size of arg buffers.
     const intptr_t* arg_sizes = nullptr;
     size_t num_args = 0;
+
+    // Cardinality and size of temp buffers.
+    //
+    // If temp_sizes[i] >= 0 then the i'th temp is a regular temporary buffer.
+    //
+    // If temp_sizes[i] == -1 then the i'th temp is a constant buffer.  The
+    // corresponding entry in the temp buffer array needs to be set to null.
+    //
+    // If temp_sizes[i] < -1 then the i'th temp is the entry parameter
+    // -(temp_sizes[i] + 2).
     const intptr_t* temp_sizes = nullptr;
     size_t num_temps = 0;
 
@@ -113,11 +123,7 @@ class XlaCompiledCpuFunction {
 
   // Runs the computation, with inputs read from arg buffers, and outputs
   // written to result buffers. Returns true on success and false on failure.
-  bool Run() {
-    raw_function_(temps_[result_index_], &run_options_,
-                  const_cast<const void**>(args_), temps_, profile_counters_);
-    return true;
-  }
+  bool Run();
 
   // Returns the error message from the previous failed Run call.
   //
@@ -224,6 +230,17 @@ class XlaCompiledCpuFunction {
   void** args_ = nullptr;
   void** temps_ = nullptr;
 
+  // Argument i needs to be placed in temps_[arg_index_to_temp_index_[i]] for
+  // XLA generated code to be able to find it.
+  //
+  // For now we need to keep around the args_ array because there is code that
+  // depends on args() returning a void**.  However, in the future we may remove
+  // args_ in favor of using temps_ as the sole storage for the arguments.
+  int32* arg_index_to_temp_index_;
+
+  // The number of incoming arguments.
+  int32 num_args_;
+
   // Backing memory for individual arg and temp buffers.
   void* alloc_args_ = nullptr;
   void* alloc_temps_ = nullptr;
diff --git a/tensorflow/compiler/tf2xla/xla_compiler.cc b/tensorflow/compiler/tf2xla/xla_compiler.cc
index 319cbc74e9..226c89bcf1 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler.cc
+++ b/tensorflow/compiler/tf2xla/xla_compiler.cc
@@ -28,12 +28,14 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_compilation_device.h"
 #include "tensorflow/compiler/tf2xla/xla_context.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/executor.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/common_runtime/graph_optimizer.h"
 #include "tensorflow/core/framework/attr_value_util.h"
+#include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/graph/algorithm.h"
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/graph/node_builder.h"
@@ -422,16 +424,18 @@ Status BuildComputation(
       // assignment will be placed on this value, which will cause the resource
       // update to be returned from the same device that provided the resource.
       handle = xla::GetTupleElement(xla::Tuple(builder, {handle}), 0);
-
       elems.push_back(handle);
     }
   }
 
   *num_computation_outputs = elems.size();
 
-  // Builds the XLA computation.
-  if (always_return_tuple || elems.size() != 1) {
-    xla::Tuple(builder, elems);
+  // Builds the XLA computation. We *always* form a tuple here to ensure that
+  // the output value is the last thing added into the XLA computation, even
+  // if there is only one output value.
+  auto tuple = xla::Tuple(builder, elems);
+  if (!always_return_tuple && elems.size() == 1) {
+    xla::GetTupleElement(tuple, 0);
   }
   builder->ClearOpMetadata();
 
@@ -686,12 +690,12 @@ Status ValidateFunctionDef(const FunctionDef* fdef,
 Status ValidateGraph(const Graph* graph,
                      const FunctionLibraryDefinition& flib_def,
                      const DeviceType& device_type, const string& name) {
-  auto maybe_error = [&](const string& op, const Status& s) -> Status {
+  auto maybe_error = [&](const Node* node, const Status& s) -> Status {
     if (!s.ok()) {
       return errors::InvalidArgument(strings::StrCat(
           "Detected unsupported operations when trying to compile graph ", name,
-          " on ", device_type.type_string(), ": ", op, " (", s.error_message(),
-          ")"));
+          " on ", device_type.type_string(), ": ", node->def().op(), " (",
+          s.error_message(), ")", FormatNodeForError(*node)));
     }
     return Status::OK();
   };
@@ -704,15 +708,15 @@ Status ValidateGraph(const Graph* graph,
     Status s;
     if (fdef) {
       s = ValidateFunctionDef(fdef, flib_def);
-      TF_RETURN_IF_ERROR(maybe_error(node->def().op(), s));
+      TF_RETURN_IF_ERROR(maybe_error(node, s));
       continue;
     }
     const OpDef* op_def;
     s = OpRegistry::Global()->LookUpOpDef(node->def().op(), &op_def);
-    TF_RETURN_IF_ERROR(maybe_error(node->def().op(), s));
+    TF_RETURN_IF_ERROR(maybe_error(node, s));
     TF_RETURN_IF_ERROR(ValidateNodeDef(node->def(), *op_def));
     s = FindKernelDef(device_type, node->def(), nullptr, nullptr);
-    TF_RETURN_IF_ERROR(maybe_error(node->def().op(), s));
+    TF_RETURN_IF_ERROR(maybe_error(node, s));
   }
   return Status::OK();
 }
diff --git a/tensorflow/compiler/tf2xla/xla_compiler.h b/tensorflow/compiler/tf2xla/xla_compiler.h
index 079c99797e..25332c8d8e 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler.h
+++ b/tensorflow/compiler/tf2xla/xla_compiler.h
@@ -20,6 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_compilation_device.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
@@ -251,6 +252,12 @@ class XlaCompiler {
     // The default empty value is invalid.
     DeviceType device_type = DeviceType("");
 
+    // The device to use during compilation to execute instructions on, for
+    // example for auto-tuning.
+    // Valid values are defined by `xla::Backend::devices_ordinal_supported()`.
+    // -1 indicates the default device should be used.
+    int device_ordinal = -1;
+
     xla::Client* client = nullptr;
 
     // Function library in which to find function definitions. Must be non-null.
diff --git a/tensorflow/compiler/tf2xla/xla_compiler_test.cc b/tensorflow/compiler/tf2xla/xla_compiler_test.cc
index 07af8ef54b..be00ed8813 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler_test.cc
+++ b/tensorflow/compiler/tf2xla/xla_compiler_test.cc
@@ -23,7 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -206,9 +206,9 @@ TEST_F(XlaCompilerTest, Simple) {
 
   // Tests that the generated computation works.
   std::unique_ptr<xla::Literal> param0_literal =
-      xla::Literal::CreateR1<int32>({7, 42});
+      xla::LiteralUtil::CreateR1<int32>({7, 42});
   std::unique_ptr<xla::Literal> param1_literal =
-      xla::Literal::CreateR1<int32>({-3, 101});
+      xla::LiteralUtil::CreateR1<int32>({-3, 101});
   std::unique_ptr<xla::GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<xla::GlobalData> param1_data =
@@ -222,12 +222,64 @@ TEST_F(XlaCompilerTest, Simple) {
       client_->Transfer(*actual).ConsumeValueOrDie();
 
   std::unique_ptr<xla::Literal> expected0 =
-      xla::Literal::CreateR1<int32>({4, 143});
+      xla::LiteralUtil::CreateR1<int32>({4, 143});
   std::unique_ptr<xla::Literal> expected_literal =
-      xla::Literal::MakeTuple({expected0.get()});
+      xla::LiteralUtil::MakeTuple({expected0.get()});
   EXPECT_TRUE(xla::LiteralTestUtil::Equal(*expected_literal, *actual_literal));
 }
 
+// Tests compilation of a graph where the _Retval node is not necessarily last
+// amongst the graph nodes in construction order, and always_return_tuple is
+// false. Regression test for bug where the wrong value was returned.
+TEST_F(XlaCompilerTest, OutOfOrderGraph) {
+  Scope scope = Scope::NewRootScope().ExitOnError();
+  auto a = ops::_Arg(scope.WithOpName("A"), DT_INT32, 0);
+  auto b = ops::_Arg(scope.WithOpName("B"), DT_INT32, 1);
+  // The _Retval node is not last in construction order.
+  auto d = ops::_Retval(scope.WithOpName("D"), a, 0);
+  auto c = ops::Add(scope.WithOpName("C"), a, b);
+
+  std::unique_ptr<Graph> graph(new Graph(OpRegistry::Global()));
+  TF_ASSERT_OK(scope.ToGraph(graph.get()));
+
+  // Builds a description of the arguments.
+  std::vector<XlaCompiler::Argument> args(2);
+  args[0].kind = XlaCompiler::Argument::kParameter;
+  args[0].type = DT_INT32;
+  args[0].shape = TensorShape({2});
+  args[1].kind = XlaCompiler::Argument::kParameter;
+  args[1].type = DT_INT32;
+  args[1].shape = TensorShape({2});
+
+  // Compiles the graph.
+  XlaCompiler compiler(DefaultOptions());
+
+  XlaCompiler::CompileOptions compile_options;
+  compile_options.always_return_tuple = false;
+  XlaCompiler::CompilationResult result;
+  TF_ASSERT_OK(compiler.CompileGraph(compile_options, "add", std::move(graph),
+                                     args, &result));
+
+  // Tests that the generated computation works.
+  std::unique_ptr<xla::Literal> param0_literal =
+      xla::LiteralUtil::CreateR1<int32>({7, 42});
+  std::unique_ptr<xla::Literal> param1_literal =
+      xla::LiteralUtil::CreateR1<int32>({-3, 101});
+  std::unique_ptr<xla::GlobalData> param0_data =
+      client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
+  std::unique_ptr<xla::GlobalData> param1_data =
+      client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
+
+  std::unique_ptr<xla::GlobalData> actual =
+      client_
+          ->Execute(*result.computation, {param0_data.get(), param1_data.get()})
+          .ConsumeValueOrDie();
+  std::unique_ptr<xla::Literal> actual_literal =
+      client_->Transfer(*actual).ConsumeValueOrDie();
+
+  EXPECT_TRUE(xla::LiteralTestUtil::Equal(*param0_literal, *actual_literal));
+}
+
 TEST_F(XlaCompilerTest, HasSaneErrorOnNonCompileTimeConstantInputToReshape) {
   // Builds a graph that adds reshapes a tensor, but with the shape not
   // statically known.
@@ -260,7 +312,7 @@ TEST_F(XlaCompilerTest, HasSaneErrorOnNonCompileTimeConstantInputToReshape) {
       str_util::StrContains(status.error_message(), "depends on a parameter"))
       << status.error_message();
   EXPECT_TRUE(
-      str_util::StrContains(status.error_message(), "[[Node: C = Reshape"))
+      str_util::StrContains(status.error_message(), "[[{{node C}} = Reshape"))
       << status.error_message();
 }
 
@@ -306,7 +358,7 @@ TEST_F(XlaCompilerTest, ConstantOutputs) {
 
     // Tests that the generated computation works.
     std::unique_ptr<xla::Literal> param0_literal =
-        xla::Literal::CreateR1<int32>({7, 42});
+        xla::LiteralUtil::CreateR1<int32>({7, 42});
     std::unique_ptr<xla::GlobalData> param0_data =
         client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -317,9 +369,9 @@ TEST_F(XlaCompilerTest, ConstantOutputs) {
         client_->Transfer(*actual).ConsumeValueOrDie();
 
     std::unique_ptr<xla::Literal> expected0 =
-        xla::Literal::CreateR1<int32>({-7, -42});
+        xla::LiteralUtil::CreateR1<int32>({-7, -42});
     std::unique_ptr<xla::Literal> expected_literal =
-        xla::Literal::MakeTuple({expected0.get()});
+        xla::LiteralUtil::MakeTuple({expected0.get()});
     EXPECT_TRUE(
         xla::LiteralTestUtil::Equal(*expected_literal, *actual_literal));
   }
@@ -341,7 +393,7 @@ TEST_F(XlaCompilerTest, ConstantOutputs) {
 
     // Tests that the generated computation works.
     std::unique_ptr<xla::Literal> param0_literal =
-        xla::Literal::CreateR1<int32>({7, 42});
+        xla::LiteralUtil::CreateR1<int32>({7, 42});
     std::unique_ptr<xla::GlobalData> param0_data =
         client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -351,11 +403,12 @@ TEST_F(XlaCompilerTest, ConstantOutputs) {
     std::unique_ptr<xla::Literal> actual_literal =
         client_->Transfer(*actual).ConsumeValueOrDie();
 
-    std::unique_ptr<xla::Literal> expected0 = xla::Literal::CreateR0<int32>(7);
+    std::unique_ptr<xla::Literal> expected0 =
+        xla::LiteralUtil::CreateR0<int32>(7);
     std::unique_ptr<xla::Literal> expected1 =
-        xla::Literal::CreateR1<int32>({-7, -42});
+        xla::LiteralUtil::CreateR1<int32>({-7, -42});
     std::unique_ptr<xla::Literal> expected =
-        xla::Literal::MakeTuple({expected0.get(), expected1.get()});
+        xla::LiteralUtil::MakeTuple({expected0.get(), expected1.get()});
     EXPECT_TRUE(xla::LiteralTestUtil::Equal(*expected, *actual_literal));
   }
 }
@@ -569,11 +622,11 @@ TEST_F(XlaCompilerTest, CanPassTensorArraysToAndFromComputation) {
 
   // Tests that the generated computation works.
   std::unique_ptr<xla::Literal> input_base =
-      xla::Literal::CreateR1<int32>({7, 42});
+      xla::LiteralUtil::CreateR1<int32>({7, 42});
   std::unique_ptr<xla::Literal> input_grad2 =
-      xla::Literal::CreateR1<int32>({-3, 101});
+      xla::LiteralUtil::CreateR1<int32>({-3, 101});
   std::unique_ptr<xla::Literal> input =
-      xla::Literal::MakeTuple({input_base.get(), input_grad2.get()});
+      xla::LiteralUtil::MakeTuple({input_base.get(), input_grad2.get()});
   std::unique_ptr<xla::GlobalData> param0_data =
       client_->TransferToServer(*input).ConsumeValueOrDie();
 
@@ -583,17 +636,18 @@ TEST_F(XlaCompilerTest, CanPassTensorArraysToAndFromComputation) {
   std::unique_ptr<xla::Literal> actual_literal =
       client_->Transfer(*actual).ConsumeValueOrDie();
 
-  std::unique_ptr<xla::Literal> output_read = xla::Literal::CreateR0<int32>(42);
+  std::unique_ptr<xla::Literal> output_read =
+      xla::LiteralUtil::CreateR0<int32>(42);
   std::unique_ptr<xla::Literal> output_base =
-      xla::Literal::CreateR1<int32>({7, 42});
+      xla::LiteralUtil::CreateR1<int32>({7, 42});
   std::unique_ptr<xla::Literal> output_grad1 =
-      xla::Literal::CreateR1<int32>({0, 1});
+      xla::LiteralUtil::CreateR1<int32>({0, 1});
   std::unique_ptr<xla::Literal> output_grad2 =
-      xla::Literal::CreateR1<int32>({-3, 101});
-  std::unique_ptr<xla::Literal> output_resource = xla::Literal::MakeTuple(
+      xla::LiteralUtil::CreateR1<int32>({-3, 101});
+  std::unique_ptr<xla::Literal> output_resource = xla::LiteralUtil::MakeTuple(
       {output_base.get(), output_grad1.get(), output_grad2.get()});
   std::unique_ptr<xla::Literal> expected_literal =
-      xla::Literal::MakeTuple({output_read.get(), output_resource.get()});
+      xla::LiteralUtil::MakeTuple({output_read.get(), output_resource.get()});
   EXPECT_TRUE(xla::LiteralTestUtil::Equal(*expected_literal, *actual_literal));
 }
 
@@ -796,9 +850,9 @@ TEST_F(XlaCompilerTest, Variables) {
 
   // Tests that the generated computation works.
   std::unique_ptr<xla::Literal> param0_literal =
-      xla::Literal::CreateR1<int32>({7, 42});
+      xla::LiteralUtil::CreateR1<int32>({7, 42});
   std::unique_ptr<xla::Literal> param1_literal =
-      xla::Literal::CreateR1<int32>({-3, 101});
+      xla::LiteralUtil::CreateR1<int32>({-3, 101});
   std::unique_ptr<xla::GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<xla::GlobalData> param1_data =
@@ -812,11 +866,11 @@ TEST_F(XlaCompilerTest, Variables) {
       client_->Transfer(*actual).ConsumeValueOrDie();
 
   std::unique_ptr<xla::Literal> expected0 =
-      xla::Literal::CreateR1<int32>({5, 144});
+      xla::LiteralUtil::CreateR1<int32>({5, 144});
   std::unique_ptr<xla::Literal> expected1 =
-      xla::Literal::CreateR1<int32>({4, 143});
+      xla::LiteralUtil::CreateR1<int32>({4, 143});
   std::unique_ptr<xla::Literal> expected_literal =
-      xla::Literal::MakeTuple({expected0.get(), expected1.get()});
+      xla::LiteralUtil::MakeTuple({expected0.get(), expected1.get()});
   EXPECT_TRUE(xla::LiteralTestUtil::Equal(*expected_literal, *actual_literal));
 }
 
@@ -884,9 +938,9 @@ TEST_F(XlaCompilerTest, VariableRepresentationShapeFunction) {
 
   // Tests that the generated computation works.
   std::unique_ptr<xla::Literal> param0_literal =
-      xla::Literal::CreateR2<int32>({{4, 55}, {1, -3}});
+      xla::LiteralUtil::CreateR2<int32>({{4, 55}, {1, -3}});
   std::unique_ptr<xla::Literal> param1_literal =
-      xla::Literal::CreateR1<int32>({22, 11, 33, 404});
+      xla::LiteralUtil::CreateR1<int32>({22, 11, 33, 404});
   std::unique_ptr<xla::GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<xla::GlobalData> param1_data =
@@ -900,11 +954,11 @@ TEST_F(XlaCompilerTest, VariableRepresentationShapeFunction) {
       client_->Transfer(*actual).ConsumeValueOrDie();
 
   std::unique_ptr<xla::Literal> expected0 =
-      xla::Literal::CreateR2<int32>({{27, 67}, {35, 402}});
+      xla::LiteralUtil::CreateR2<int32>({{27, 67}, {35, 402}});
   std::unique_ptr<xla::Literal> expected1 =
-      xla::Literal::CreateR1<int32>({26, 66, 34, 401});
+      xla::LiteralUtil::CreateR1<int32>({26, 66, 34, 401});
   std::unique_ptr<xla::Literal> expected_literal =
-      xla::Literal::MakeTuple({expected0.get(), expected1.get()});
+      xla::LiteralUtil::MakeTuple({expected0.get(), expected1.get()});
   EXPECT_TRUE(xla::LiteralTestUtil::Equal(*expected_literal, *actual_literal));
 }
 
@@ -953,9 +1007,9 @@ TEST_F(XlaCompilerTest, ArgRetvalShapeRepresentationFunction) {
 
   // Tests that the generated computation works.
   std::unique_ptr<xla::Literal> param0_literal =
-      xla::Literal::CreateR1<int32>({4, 55, 1, -3});
+      xla::LiteralUtil::CreateR1<int32>({4, 55, 1, -3});
   std::unique_ptr<xla::Literal> param1_literal =
-      xla::Literal::CreateR1<int32>({22, 11, 33, 404});
+      xla::LiteralUtil::CreateR1<int32>({22, 11, 33, 404});
   std::unique_ptr<xla::GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<xla::GlobalData> param1_data =
@@ -969,11 +1023,11 @@ TEST_F(XlaCompilerTest, ArgRetvalShapeRepresentationFunction) {
       client_->Transfer(*actual).ConsumeValueOrDie();
 
   std::unique_ptr<xla::Literal> expected0 =
-      xla::Literal::CreateR1<int32>({27, 67, 35, 402});
+      xla::LiteralUtil::CreateR1<int32>({27, 67, 35, 402});
   std::unique_ptr<xla::Literal> expected1 =
-      xla::Literal::CreateR1<int32>({26, 66, 34, 401});
+      xla::LiteralUtil::CreateR1<int32>({26, 66, 34, 401});
   std::unique_ptr<xla::Literal> expected_literal =
-      xla::Literal::MakeTuple({expected0.get(), expected1.get()});
+      xla::LiteralUtil::MakeTuple({expected0.get(), expected1.get()});
   EXPECT_TRUE(xla::LiteralTestUtil::Equal(*expected_literal, *actual_literal));
 }
 
@@ -1023,6 +1077,8 @@ TEST_F(XlaCompilerTest, FunctionWithInvalidOp) {
   ASSERT_FALSE(status.ok());
   EXPECT_TRUE(str_util::StrContains(status.error_message(), "InvalidOp"))
       << status.error_message();
+  EXPECT_TRUE(str_util::StrContains(status.error_message(), "{{node fill_fn}}"))
+      << status.error_message();
 }
 
 // Tests a graph which has a node with invalid data type.
@@ -1047,6 +1103,8 @@ TEST_F(XlaCompilerTest, NodeWithInvalidDataType) {
   EXPECT_TRUE(str_util::StrContains(status.error_message(),
                                     "is not in the list of allowed values"))
       << status.error_message();
+  EXPECT_TRUE(str_util::StrContains(status.error_message(), "{{node Shape}}"))
+      << status.error_message();
 }
 
 TEST_F(XlaCompilerTest, SingleOpWithoutInputs) {
@@ -1068,9 +1126,10 @@ TEST_F(XlaCompilerTest, SingleOpWithoutInputs) {
     status = compiler.CompileGraph(XlaCompiler::CompileOptions(), "NoOp",
                                    std::move(graph_copy), args, &result);
     ASSERT_FALSE(status.ok());
-    EXPECT_TRUE(str_util::StrContains(status.error_message(),
-                                      "The following nodes are unreachable "
-                                      "from the source in the graph: NoOp"))
+    EXPECT_TRUE(
+        str_util::StrContains(status.error_message(),
+                              "The following nodes are unreachable "
+                              "from the source in the graph: {{node NoOp}}"))
         << status.error_message();
   }
 
diff --git a/tensorflow/compiler/tf2xla/xla_context.cc b/tensorflow/compiler/tf2xla/xla_context.cc
index fd39a58ce6..b24e3aabbe 100644
--- a/tensorflow/compiler/tf2xla/xla_context.cc
+++ b/tensorflow/compiler/tf2xla/xla_context.cc
@@ -25,9 +25,10 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_kernel.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/core/common_runtime/dma_helper.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
diff --git a/tensorflow/compiler/tf2xla/xla_context.h b/tensorflow/compiler/tf2xla/xla_context.h
index 38d8cd653c..3db37afdba 100644
--- a/tensorflow/compiler/tf2xla/xla_context.h
+++ b/tensorflow/compiler/tf2xla/xla_context.h
@@ -22,8 +22,8 @@ limitations under the License.
 
 #include "tensorflow/compiler/tf2xla/xla_compilation_device.h"
 #include "tensorflow/compiler/tf2xla/xla_compiler.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/framework/op_kernel.h"
diff --git a/tensorflow/compiler/tf2xla/xla_cpu_backend.cc b/tensorflow/compiler/tf2xla/xla_cpu_backend.cc
index ead229aacc..23d04d43b3 100644
--- a/tensorflow/compiler/tf2xla/xla_cpu_backend.cc
+++ b/tensorflow/compiler/tf2xla/xla_cpu_backend.cc
@@ -31,6 +31,10 @@ bool CpuOpFilter(KernelDef* kdef) {
         DT_FLOAT);
     return true;
   }
+  // TODO(b/26783907): The CPU backend currently does not implement sort.
+  if (kdef->op() == "XlaSort" || kdef->op() == "TopKV2") {
+    return false;
+  }
   if (kdef->op() == "Const") {
     AddDtypeToKernalDefConstraint("dtype", DT_STRING, kdef);
   }
diff --git a/tensorflow/compiler/tf2xla/xla_gpu_backend.cc b/tensorflow/compiler/tf2xla/xla_gpu_backend.cc
index 62168b6483..1398e9ee53 100644
--- a/tensorflow/compiler/tf2xla/xla_gpu_backend.cc
+++ b/tensorflow/compiler/tf2xla/xla_gpu_backend.cc
@@ -20,12 +20,6 @@ limitations under the License.
 namespace tensorflow {
 
 bool GpuOpFilter(KernelDef* kdef) {
-  // TODO(b/31361304): The GPU backend does not parallelize PRNG ops, leading to
-  // slow code.
-  if (kdef->op() == "RandomStandardNormal" || kdef->op() == "RandomUniform" ||
-      kdef->op() == "RandomUniformInt" || kdef->op() == "TruncatedNormal") {
-    return false;
-  }
   if (kdef->op() == "Const") {
     AddDtypeToKernalDefConstraint("dtype", DT_STRING, kdef);
   }
diff --git a/tensorflow/compiler/tf2xla/xla_helpers.cc b/tensorflow/compiler/tf2xla/xla_helpers.cc
index edbc5e95a8..8efb3d55c8 100644
--- a/tensorflow/compiler/tf2xla/xla_helpers.cc
+++ b/tensorflow/compiler/tf2xla/xla_helpers.cc
@@ -26,7 +26,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/lib/constants.h"
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/core/status.h"
@@ -94,13 +95,13 @@ xla::XlaOp ArgMinMax(xla::XlaOp input, xla::PrimitiveType output_type, int axis,
 xla::XlaOp XlaHelpers::Zero(xla::XlaBuilder* b, DataType data_type) {
   xla::PrimitiveType type;
   TF_CHECK_OK(DataTypeToPrimitiveType(data_type, &type));
-  return xla::ConstantLiteral(b, xla::Literal::Zero(type));
+  return xla::ConstantLiteral(b, xla::LiteralUtil::Zero(type));
 }
 
 xla::XlaOp XlaHelpers::One(xla::XlaBuilder* b, DataType data_type) {
   xla::PrimitiveType type;
   TF_CHECK_OK(DataTypeToPrimitiveType(data_type, &type));
-  return xla::ConstantLiteral(b, xla::Literal::One(type));
+  return xla::ConstantLiteral(b, xla::LiteralUtil::One(type));
 }
 
 xla::XlaOp XlaHelpers::IntegerLiteral(xla::XlaBuilder* b, DataType data_type,
diff --git a/tensorflow/compiler/tf2xla/xla_helpers.h b/tensorflow/compiler/tf2xla/xla_helpers.h
index d6ca4ab934..e6522157a5 100644
--- a/tensorflow/compiler/tf2xla/xla_helpers.h
+++ b/tensorflow/compiler/tf2xla/xla_helpers.h
@@ -19,7 +19,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_TF2XLA_XLA_HELPERS_H_
 
 #include "tensorflow/compiler/tf2xla/xla_context.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 
diff --git a/tensorflow/compiler/tf2xla/xla_jit_compiled_cpu_function.cc b/tensorflow/compiler/tf2xla/xla_jit_compiled_cpu_function.cc
index 9e17756b27..114a9241bd 100644
--- a/tensorflow/compiler/tf2xla/xla_jit_compiled_cpu_function.cc
+++ b/tensorflow/compiler/tf2xla/xla_jit_compiled_cpu_function.cc
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/xla_compiled_cpu_function.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_executable.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -57,11 +58,15 @@ xla::StatusOr<std::vector<intptr_t>> ComputeTempSizes(
   std::vector<intptr_t> temp_sizes;
   temp_sizes.reserve(allocations.size());
   for (const xla::BufferAllocation& allocation : allocations) {
-    // Callers don't allocate temporary buffers for parameters. Nor for
-    // thread-local buffers, which are lowered to alloca.
-    if (allocation.is_entry_computation_parameter() ||
-        allocation.is_thread_local()) {
+    if (allocation.is_constant() || allocation.is_thread_local()) {
+      // Constants are lowered to globals.  Thread locals are lowered to
+      // allocas.
       temp_sizes.push_back(-1);
+    } else if (allocation.is_entry_computation_parameter()) {
+      // Entry computation parameters need some preprocessing in
+      // XlaCompiledCpuFunction::Run.  See the comment on
+      // XlaCompiledCpuFunction::StaticData::temp_sizes.
+      temp_sizes.push_back(-allocation.parameter_number() - 2);
     } else {
       temp_sizes.push_back(allocation.size());
     }
diff --git a/tensorflow/compiler/tf2xla/xla_op_kernel.cc b/tensorflow/compiler/tf2xla/xla_op_kernel.cc
index 359cb4c467..82028c8b9c 100644
--- a/tensorflow/compiler/tf2xla/xla_op_kernel.cc
+++ b/tensorflow/compiler/tf2xla/xla_op_kernel.cc
@@ -21,7 +21,8 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/type_util.h"
 #include "tensorflow/compiler/tf2xla/xla_context.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/core/common_runtime/dma_helper.h"
 
@@ -66,10 +67,18 @@ const xla::XlaOp& XlaOpKernelContext::Input(int index) {
   return GetComputationFromTensor(context_->input(index));
 }
 
+const xla::XlaOp& XlaOpKernelContext::Input(StringPiece name) {
+  return GetComputationFromTensor(GetInputTensorByName(name));
+}
+
 TensorShape XlaOpKernelContext::InputShape(int index) {
   return context_->input(index).shape();
 }
 
+TensorShape XlaOpKernelContext::InputShape(StringPiece name) {
+  return GetInputTensorByName(name).shape();
+}
+
 DataType XlaOpKernelContext::input_type(int index) const {
   return context_->input(index).dtype();
 }
@@ -332,10 +341,11 @@ Status XlaOpKernelContext::ConstantInputList(
   return Status::OK();
 }
 
-Status XlaOpKernelContext::ReadVariableInput(int index, DataType type,
-                                             TensorShape* shape,
-                                             xla::XlaOp* value) {
-  const Tensor& tensor = context_->input(index);
+namespace {
+
+Status ReadVariableInputTensor(const Tensor& tensor, DataType type,
+                               const OpKernelContext* ctx, TensorShape* shape,
+                               xla::XlaOp* value) {
   const XlaExpression* expression = CastExpressionFromTensor(tensor);
   XlaResource* variable = expression->resource();
   TF_RET_CHECK(variable != nullptr);
@@ -353,7 +363,7 @@ Status XlaOpKernelContext::ReadVariableInput(int index, DataType type,
     *shape = variable->shape();
   }
 
-  XlaContext& xla_context = XlaContext::Get(context_);
+  XlaContext& xla_context = XlaContext::Get(ctx);
   TF_ASSIGN_OR_RETURN(
       TensorShape representation_shape,
       xla_context.RepresentationShape(variable->shape(), variable->type()));
@@ -365,6 +375,22 @@ Status XlaOpKernelContext::ReadVariableInput(int index, DataType type,
   return Status::OK();
 }
 
+}  // namespace
+
+Status XlaOpKernelContext::ReadVariableInput(int index, DataType type,
+                                             TensorShape* shape,
+                                             xla::XlaOp* value) {
+  return ReadVariableInputTensor(context_->input(index), type, context_, shape,
+                                 value);
+}
+
+Status XlaOpKernelContext::ReadVariableInput(StringPiece name, DataType type,
+                                             TensorShape* shape,
+                                             xla::XlaOp* value) {
+  return ReadVariableInputTensor(GetInputTensorByName(name), type, context_,
+                                 shape, value);
+}
+
 Status XlaOpKernelContext::GetVariableTypeAndShape(int index, DataType* type,
                                                    TensorShape* shape) const {
   const Tensor& tensor = context_->input(index);
@@ -455,17 +481,17 @@ Status XlaOpKernelContext::GetResourceInput(int index, XlaResource** resource) {
   return Status::OK();
 }
 
-Status XlaOpKernelContext::AssignVariable(int input_index, DataType type,
-                                          xla::XlaOp handle) {
-  TF_RET_CHECK(handle.valid());
+namespace {
 
-  const XlaExpression* expression =
-      CastExpressionFromTensor(context_->input(input_index));
+Status AssignVariableTensor(const Tensor& tensor, DataType type,
+                            const OpKernelContext* ctx, xla::XlaOp handle,
+                            xla::XlaBuilder* builder) {
+  const XlaExpression* expression = CastExpressionFromTensor(tensor);
   XlaResource* variable = expression->resource();
   TF_RET_CHECK(variable != nullptr);
   TF_RET_CHECK(variable->kind() == XlaResource::kVariable);
 
-  auto shape_or_status = builder()->GetShape(handle);
+  auto shape_or_status = builder->GetShape(handle);
   if (!shape_or_status.ok()) {
     return shape_or_status.status();
   }
@@ -475,7 +501,7 @@ Status XlaOpKernelContext::AssignVariable(int input_index, DataType type,
 
   TF_RETURN_IF_ERROR(variable->SetTypeAndShape(type, shape));
 
-  XlaContext& xla_context = XlaContext::Get(context_);
+  XlaContext& xla_context = XlaContext::Get(ctx);
   TF_ASSIGN_OR_RETURN(TensorShape representation_shape,
                       xla_context.RepresentationShape(shape, type));
   if (shape != representation_shape) {
@@ -484,6 +510,22 @@ Status XlaOpKernelContext::AssignVariable(int input_index, DataType type,
   return variable->SetValue(handle);
 }
 
+}  // namespace
+
+Status XlaOpKernelContext::AssignVariable(int input_index, DataType type,
+                                          xla::XlaOp handle) {
+  TF_RET_CHECK(handle.valid());
+  return AssignVariableTensor(context_->input(input_index), type, context_,
+                              handle, builder());
+}
+
+Status XlaOpKernelContext::AssignVariable(StringPiece name, DataType type,
+                                          xla::XlaOp handle) {
+  TF_RET_CHECK(handle.valid());
+  return AssignVariableTensor(GetInputTensorByName(name), type, context_,
+                              handle, builder());
+}
+
 XlaCompiler* XlaOpKernelContext::compiler() const {
   return XlaContext::Get(context_).compiler();
 }
@@ -523,6 +565,12 @@ const xla::XlaComputation* XlaOpKernelContext::GetOrCreateMul(
   return XlaContext::Get(context_).GetOrCreateMul(type);
 }
 
+const Tensor& XlaOpKernelContext::GetInputTensorByName(StringPiece name) {
+  const Tensor* tensor;
+  CHECK(context_->input(name, &tensor).ok());
+  return *tensor;
+}
+
 XlaOpKernel::XlaOpKernel(OpKernelConstruction* context) : OpKernel(context) {}
 
 void XlaOpKernel::Compute(OpKernelContext* context) {
diff --git a/tensorflow/compiler/tf2xla/xla_op_kernel.h b/tensorflow/compiler/tf2xla/xla_op_kernel.h
index 2bde2c983d..ac9dfe3369 100644
--- a/tensorflow/compiler/tf2xla/xla_op_kernel.h
+++ b/tensorflow/compiler/tf2xla/xla_op_kernel.h
@@ -17,7 +17,8 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_TF2XLA_XLA_OP_KERNEL_H_
 
 #include "tensorflow/compiler/tf2xla/xla_compiler.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/platform/macros.h"
@@ -67,21 +68,26 @@ class XlaOpKernelContext {
   // Returns the number of inputs to the operator.
   int num_inputs() const { return context_->num_inputs(); }
 
-  // Returns the type of input 'index'.
+  // Returns the type of input `index`.
   DataType input_type(int index) const;
 
-  // Returns the type of input 'index' as an xla::PrimitiveType. If the type
+  // Returns the type of input `index` as an xla::PrimitiveType. If the type
   // is not representable as an XLA type, sets an error status and returns
   // xla::PRIMITIVE_TYPE_INVALID.
   xla::PrimitiveType input_xla_type(int index);
 
-  // Returns the shape of input 'index'.
+  // Returns the shape of input `index`.
   TensorShape InputShape(int index);
 
-  // Returns input 'index' as a XlaOp. Unlike
+  // Returns the shape of input `name`.
+  TensorShape InputShape(StringPiece name);
+
+  // Returns input `index` as a XlaOp. Unlike
   // OpKernelContext::Input returns a symbolic value rather than a concrete
   // Tensor.
   const xla::XlaOp& Input(int index);
+  // Returns input `name` as a XlaOp.
+  const xla::XlaOp& Input(StringPiece name);
 
   // Returns true if all inputs are the same shape, otherwise sets the
   // status to a non-OK value and returns false.
@@ -96,13 +102,13 @@ class XlaOpKernelContext {
 
   // Helper methods for constant inputs.
 
-  // Evaluates input 'index' and stores it in '*constant_literal'. If the
+  // Evaluates input `index` and stores it in `*constant_literal`. If the
   // expression cannot be evaluated, e.g., because it depends on unbound
   // parameters, returns a non-OK status.
   Status ConstantInput(int index, xla::Literal* constant_literal);
 
-  // Evaluates input 'index', reshapes it to 'new_shape' if new_shape !=
-  // InputShape(index), and stores it in '*constant_literal'. If the input
+  // Evaluates input `index`, reshapes it to `new_shape` if new_shape !=
+  // InputShape(index), and stores it in `*constant_literal`. If the input
   // cannot be evaluated, e.g., because it depends on unbound parameters,
   // returns a non-Ok status. If InputShape(index).num_elements() !=
   // new_shape.num_elements(), returns an error status.
@@ -137,17 +143,17 @@ class XlaOpKernelContext {
     return context_->expected_output_dtype(index);
   }
 
-  // Sets output 'index' to the XlaOp 'handle'.
+  // Sets output `index` to the XlaOp `handle`.
   // All outputs should be set using SetOutput and SetConstantOutput, not
   // via the underlying OpKernelContext.
   void SetOutput(int index, const xla::XlaOp& handle);
 
-  // Sets output 'index' to compile-time constant 'host_tensor', where
-  // 'host_tensor' is a tensor in host memory. It is preferable to use
+  // Sets output `index` to compile-time constant `host_tensor`, where
+  // `host_tensor` is a tensor in host memory. It is preferable to use
   // SetConstantOutput where possible.
   void SetConstantOutput(int index, const Tensor& host_tensor);
 
-  // Sets output 'index' to an invalid value.
+  // Sets output `index` to an invalid value.
   // Any subsequent attempt to consume this output will cause an error.
   void SetInvalidOutput(int index);
 
@@ -157,10 +163,10 @@ class XlaOpKernelContext {
 
   // Variables
 
-  // Sets '*resource' to the resource associated with input `index`.
+  // Sets `*resource` to the resource associated with input `index`.
   Status GetResourceInput(int index, XlaResource** resource);
 
-  // Sets output 'index' to be a reference to resource 'resource'.
+  // Sets output `index` to be a reference to resource `resource`.
   void SetResourceOutput(int index, XlaResource* resource);
 
   // Sets `*type` and `*shape` to the current type and shape of a variable's
@@ -169,17 +175,23 @@ class XlaOpKernelContext {
                                  TensorShape* shape) const;
 
   // Reads the current value of the resouce variable referred to by input
-  // 'index'. If `shape` is not nullptr, sets `*shape` to the shape of the
+  // `index`. If `shape` is not nullptr, sets `*shape` to the shape of the
   // variable. Returns an error if the variable has not been initialized, or if
   // its type does not match `type`.
   Status ReadVariableInput(int index, DataType type, TensorShape* shape,
                            xla::XlaOp* value);
+  // Reads the current value of the resouce variable referred to by input
+  // `name`.
+  Status ReadVariableInput(StringPiece name, DataType type, TensorShape* shape,
+                           xla::XlaOp* value);
 
   // Assigns the value `handle` to the variable referenced by input
   // `input_index`. The variable must be of `type`. Returns an error if the
   // variable has been initialized with a different type or with a
   // different shape.
   Status AssignVariable(int input_index, DataType type, xla::XlaOp handle);
+  // Assigns the value `handle` to the variable referenced by input `name`.
+  Status AssignVariable(StringPiece name, DataType type, xla::XlaOp handle);
 
   // Helper routines for the OP_REQUIRES macros
   void CtxFailure(const Status& s);
@@ -227,6 +239,9 @@ class XlaOpKernelContext {
   const xla::XlaComputation* GetOrCreateMul(const DataType type);
 
  private:
+  // Returns the tensor of input `name`.
+  const Tensor& GetInputTensorByName(StringPiece name);
+
   OpKernelContext* const context_;
 };
 
diff --git a/tensorflow/compiler/tf2xla/xla_op_registry.h b/tensorflow/compiler/tf2xla/xla_op_registry.h
index 2d4593ea49..fc14834ca6 100644
--- a/tensorflow/compiler/tf2xla/xla_op_registry.h
+++ b/tensorflow/compiler/tf2xla/xla_op_registry.h
@@ -279,7 +279,7 @@ class XlaOpRegistrar {
 
 #define REGISTER_XLA_OP_UNIQ(CTR, BUILDER, OP)                                 \
   static ::tensorflow::XlaOpRegistrar xla_op_registrar__body__##CTR##__object( \
-      XlaOpRegistrationBuilder::BUILDER.Build(                                 \
+      ::tensorflow::XlaOpRegistrationBuilder::BUILDER.Build(                   \
           [](::tensorflow::OpKernelConstruction* context)                      \
               -> ::tensorflow::OpKernel* { return new OP(context); }));
 
diff --git a/tensorflow/compiler/tf2xla/xla_resource.cc b/tensorflow/compiler/tf2xla/xla_resource.cc
index baea814965..7928fa0347 100644
--- a/tensorflow/compiler/tf2xla/xla_resource.cc
+++ b/tensorflow/compiler/tf2xla/xla_resource.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/tf2xla/sharding_util.h"
 #include "tensorflow/compiler/tf2xla/xla_context.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/compiler/tf2xla/xla_resource.h b/tensorflow/compiler/tf2xla/xla_resource.h
index 4de18a7788..2438490be1 100644
--- a/tensorflow/compiler/tf2xla/xla_resource.h
+++ b/tensorflow/compiler/tf2xla/xla_resource.h
@@ -18,7 +18,7 @@ limitations under the License.
 
 #include <memory>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/framework/types.pb.h"
diff --git a/tensorflow/compiler/xla/BUILD b/tensorflow/compiler/xla/BUILD
index 03e542855b..fdf13bb18c 100644
--- a/tensorflow/compiler/xla/BUILD
+++ b/tensorflow/compiler/xla/BUILD
@@ -254,6 +254,7 @@ tf_cc_test(
         ":types",
         ":util",
         ":xla_data_proto",
+        "//tensorflow/core:lib",
         "//tensorflow/core:test_main",
     ],
 )
@@ -281,9 +282,9 @@ tf_cc_test(
 )
 
 cc_library(
-    name = "literal_util",
-    srcs = ["literal_util.cc"],
-    hdrs = ["literal_util.h"],
+    name = "literal",
+    srcs = ["literal.cc"],
+    hdrs = ["literal.h"],
     visibility = ["//visibility:public"],
     deps = [
         ":array2d",
@@ -300,11 +301,12 @@ cc_library(
 )
 
 tf_cc_test(
-    name = "literal_util_test",
-    srcs = ["literal_util_test.cc"],
+    name = "literal_test",
+    srcs = ["literal_test.cc"],
     deps = [
         ":array3d",
         ":array4d",
+        ":literal",
         ":literal_util",
         ":shape_util",
         ":test",
@@ -317,6 +319,26 @@ tf_cc_test(
 )
 
 cc_library(
+    name = "literal_util",
+    srcs = ["literal_util.cc"],
+    hdrs = ["literal_util.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":array2d",
+        ":array3d",
+        ":array4d",
+        ":literal",
+        ":shape_util",
+        ":sparse_index_array",
+        ":status_macros",
+        ":types",
+        ":util",
+        ":xla_data_proto",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
     name = "error_spec",
     hdrs = ["error_spec.h"],
 )
@@ -327,6 +349,7 @@ cc_library(
     hdrs = ["literal_comparison.h"],
     deps = [
         ":error_spec",
+        ":literal",
         ":literal_util",
         ":util",
         "//tensorflow/core:lib",
@@ -458,7 +481,7 @@ cc_library(
     hdrs = ["packed_literal_reader.h"],
     visibility = [":internal"],
     deps = [
-        ":literal_util",
+        ":literal",
         ":shape_util",
         ":status_macros",
         ":statusor",
@@ -489,7 +512,7 @@ cc_library(
     hdrs = ["text_literal_reader.h"],
     visibility = [":internal"],
     deps = [
-        ":literal_util",
+        ":literal",
         ":shape_util",
         ":status_macros",
         ":statusor",
@@ -505,7 +528,7 @@ tf_cc_test(
     name = "text_literal_reader_test",
     srcs = ["text_literal_reader_test.cc"],
     deps = [
-        ":literal_util",
+        ":literal",
         ":shape_util",
         ":test",
         ":text_literal_reader",
@@ -522,7 +545,7 @@ cc_library(
     hdrs = ["text_literal_writer.h"],
     visibility = [":internal"],
     deps = [
-        ":literal_util",
+        ":literal",
         ":shape_util",
         ":status_macros",
         ":types",
@@ -535,6 +558,7 @@ tf_cc_test(
     name = "text_literal_writer_test",
     srcs = ["text_literal_writer_test.cc"],
     deps = [
+        ":literal",
         ":literal_util",
         ":test",
         ":test_helpers",
@@ -607,11 +631,12 @@ cc_library(
         ":array2d",
         ":array3d",
         ":array4d",
+        ":literal_util",
         ":util",
         ":window_util",
         ":xla_data_proto",
         "//tensorflow/compiler/xla/client:padding",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_evaluator",
         "//tensorflow/compiler/xla/service:shape_inference",
@@ -627,7 +652,7 @@ tf_cc_test(
         ":array2d",
         ":array3d",
         ":array4d",
-        ":literal_util",
+        ":literal",
         ":reference_util",
         ":test",
         ":util",
diff --git a/tensorflow/compiler/xla/client/BUILD b/tensorflow/compiler/xla/client/BUILD
index 8f08d3b2e0..ad3fcee05b 100644
--- a/tensorflow/compiler/xla/client/BUILD
+++ b/tensorflow/compiler/xla/client/BUILD
@@ -64,8 +64,9 @@ cc_library(
     hdrs = ["client.h"],
     deps = [
         ":global_data",
+        ":xla_computation",
         "//tensorflow/compiler/xla:execution_options_util",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:service_interface",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -73,7 +74,6 @@ cc_library(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla:xla_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
         "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/core:lib",
@@ -100,12 +100,12 @@ cc_library(
     deps = [
         ":client",
         ":executable_build_options",
+        ":xla_computation",
         "//tensorflow/compiler/xla:executable_run_options",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/service:backend",
         "//tensorflow/compiler/xla/service:compiler",
         "//tensorflow/compiler/xla/service:device_memory_allocator",
@@ -114,6 +114,7 @@ cc_library(
         "//tensorflow/compiler/xla/service:local_service",
         "//tensorflow/compiler/xla/service:shaped_buffer",
         "//tensorflow/compiler/xla/service:source_map_util",
+        "//tensorflow/compiler/xla/service:stream_pool",
         "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
         "@llvm//:support",
@@ -126,11 +127,11 @@ cc_library(
     hdrs = ["compile_only_client.h"],
     deps = [
         ":client",
+        ":xla_computation",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/service:compile_only_service",
         "//tensorflow/compiler/xla/service:compiler",
         "//tensorflow/core:stream_executor_no_cuda",
@@ -174,3 +175,60 @@ cc_library(
         "//tensorflow/compiler/xla:xla_data_proto",
     ],
 )
+
+cc_library(
+    name = "xla_computation",
+    srcs = ["xla_computation.cc"],
+    hdrs = ["xla_computation.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_proto",
+    ],
+)
+
+cc_library(
+    name = "xla_builder",
+    srcs = ["xla_builder.cc"],
+    hdrs = ["xla_builder.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":padding",
+        ":sharding_builder",
+        ":xla_computation",
+        "//tensorflow/compiler/xla:execution_options_util",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_proto",
+        "//tensorflow/compiler/xla/service:shape_inference",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "xla_builder_test",
+    srcs = ["xla_builder_test.cc"],
+    deps = [
+        ":xla_builder",
+        ":xla_computation",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:test_helpers",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/core:test",
+    ],
+)
diff --git a/tensorflow/compiler/xla/client/client.cc b/tensorflow/compiler/xla/client/client.cc
index 3d596a6e65..d0ce5e8a6a 100644
--- a/tensorflow/compiler/xla/client/client.cc
+++ b/tensorflow/compiler/xla/client/client.cc
@@ -18,9 +18,10 @@ limitations under the License.
 #include <string>
 #include <utility>
 
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -409,8 +410,10 @@ StatusOr<string> Client::ExecutionStatsAsString(
   return string("[Execution Statistics] not available.");
 }
 
-StatusOr<ChannelHandle> Client::CreateChannelHandle() {
+StatusOr<ChannelHandle> Client::CreateChannelHandleByType(
+    ChannelHandle::ChannelType type) {
   CreateChannelHandleRequest request;
+  request.set_channel_type(type);
   CreateChannelHandleResponse response;
 
   VLOG(1) << "making create channel handle request";
@@ -424,4 +427,16 @@ StatusOr<ChannelHandle> Client::CreateChannelHandle() {
   return response.channel();
 }
 
+StatusOr<ChannelHandle> Client::CreateChannelHandle() {
+  return CreateChannelHandleByType(ChannelHandle::DEVICE_TO_DEVICE);
+}
+
+StatusOr<ChannelHandle> Client::CreateHostToDeviceChannelHandle() {
+  return CreateChannelHandleByType(ChannelHandle::HOST_TO_DEVICE);
+}
+
+StatusOr<ChannelHandle> Client::CreateDeviceToHostChannelHandle() {
+  return CreateChannelHandleByType(ChannelHandle::DEVICE_TO_HOST);
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/client.h b/tensorflow/compiler/xla/client/client.h
index 68f0d0ac78..be50cebfcc 100644
--- a/tensorflow/compiler/xla/client/client.h
+++ b/tensorflow/compiler/xla/client/client.h
@@ -20,8 +20,8 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/client/global_data.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service_interface.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -178,10 +178,15 @@ class Client {
   StatusOr<std::unique_ptr<ProgramShape>> GetComputationShape(
       const XlaComputation& computation);
 
-  // Creates a channel handle that can be used to transfer data between
-  // two computations via a pair of Send and Recv instructions.
+  // Creates a channel handle that can be used to transfer data between two
+  // computations on different devices via a pair of Send and Recv instructions.
   StatusOr<ChannelHandle> CreateChannelHandle();
 
+  // Create a channel for communicating with the host via a SendtoHost or
+  // RecvFromHost operation.
+  StatusOr<ChannelHandle> CreateHostToDeviceChannelHandle();
+  StatusOr<ChannelHandle> CreateDeviceToHostChannelHandle();
+
   StatusOr<XlaComputation> LoadSnapshot(const HloSnapshot& module);
 
   ServiceInterface* stub() { return stub_; }
@@ -192,6 +197,9 @@ class Client {
   StatusOr<string> ExecutionStatsAsString(const XlaComputation& computation,
                                           const ExecutionProfile& profile);
 
+  StatusOr<ChannelHandle> CreateChannelHandleByType(
+      ChannelHandle::ChannelType type);
+
   ServiceInterface* stub_;  // Stub that this client is connected on.
 
   TF_DISALLOW_COPY_AND_ASSIGN(Client);
diff --git a/tensorflow/compiler/xla/client/compile_only_client.h b/tensorflow/compiler/xla/client/compile_only_client.h
index 332c965036..a551edeab0 100644
--- a/tensorflow/compiler/xla/client/compile_only_client.h
+++ b/tensorflow/compiler/xla/client/compile_only_client.h
@@ -17,7 +17,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_CLIENT_COMPILE_ONLY_CLIENT_H_
 
 #include "tensorflow/compiler/xla/client/client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/compile_only_service.h"
 #include "tensorflow/compiler/xla/service/compiler.h"
 #include "tensorflow/compiler/xla/statusor.h"
diff --git a/tensorflow/compiler/xla/client/lib/BUILD b/tensorflow/compiler/xla/client/lib/BUILD
index a6b9b47253..39d5582d19 100644
--- a/tensorflow/compiler/xla/client/lib/BUILD
+++ b/tensorflow/compiler/xla/client/lib/BUILD
@@ -29,8 +29,8 @@ cc_library(
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:lib",
     ],
 )
@@ -45,7 +45,7 @@ cc_library(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
     ],
 )
 
@@ -58,7 +58,7 @@ xla_test(
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
     ],
@@ -72,7 +72,7 @@ cc_library(
         ":constants",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
     ],
 )
 
@@ -82,10 +82,11 @@ xla_test(
     tags = ["enable_for_xla_interpreter"],
     deps = [
         ":math",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
     ],
@@ -96,9 +97,12 @@ cc_library(
     srcs = ["numeric.cc"],
     hdrs = ["numeric.h"],
     deps = [
+        ":arithmetic",
+        ":constants",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/core:lib",
     ],
 )
 
@@ -111,6 +115,52 @@ xla_test(
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/tests:client_library_test_base",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+    ],
+)
+
+cc_library(
+    name = "prng",
+    srcs = ["prng.cc"],
+    hdrs = ["prng.h"],
+    deps = [
+        ":constants",
+        ":math",
+        ":numeric",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
+    name = "sorting",
+    srcs = ["sorting.cc"],
+    hdrs = ["sorting.h"],
+    deps = [
+        ":numeric",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+    ],
+)
+
+xla_test(
+    name = "sorting_test",
+    srcs = ["sorting_test.cc"],
+    blacklisted_backends = [
+        "cpu",
+        "gpu",
+    ],
+    tags = ["enable_for_xla_interpreter"],
+    deps = [
+        ":sorting",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client/xla_client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -123,7 +173,7 @@ cc_library(
     hdrs = ["testing.h"],
     deps = [
         "//tensorflow/compiler/xla:execution_options_util",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
@@ -131,8 +181,8 @@ cc_library(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client",
         "//tensorflow/compiler/xla/client:global_data",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:test_utils",
         "//tensorflow/core:lib",
     ],
diff --git a/tensorflow/compiler/xla/client/lib/arithmetic.cc b/tensorflow/compiler/xla/client/lib/arithmetic.cc
index 978fc40f34..9225b1acd6 100644
--- a/tensorflow/compiler/xla/client/lib/arithmetic.cc
+++ b/tensorflow/compiler/xla/client/lib/arithmetic.cc
@@ -18,8 +18,8 @@ limitations under the License.
 #include <string>
 
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -94,16 +94,18 @@ XlaComputation CreateScalarMinComputation(PrimitiveType type,
       });
 }
 
-XlaComputation CreateScalarAndComputation(XlaBuilder* builder) {
+XlaComputation CreateScalarAndComputation(PrimitiveType type,
+                                          XlaBuilder* builder) {
   return CreateScalarComputation(
-      "and", PRED, builder,
+      "and", type, builder,
       [](XlaBuilder* b, const XlaOp& lhs, const XlaOp& rhs) {
         return And(lhs, rhs);
       });
 }
 
-XlaComputation CreateScalarOrComputation(XlaBuilder* builder) {
-  return CreateScalarComputation("or", PRED, builder,
+XlaComputation CreateScalarOrComputation(PrimitiveType type,
+                                         XlaBuilder* builder) {
+  return CreateScalarComputation("or", type, builder,
                                  [](XlaBuilder* b, const XlaOp& lhs,
                                     const XlaOp& rhs) { return Or(lhs, rhs); });
 }
@@ -112,7 +114,7 @@ XlaOp Any(XlaOp predicates) {
   XlaBuilder* builder = predicates.builder();
   return builder->ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
     auto f = ConstantR0<bool>(builder, false);
-    XlaComputation logical_or = CreateScalarOrComputation(builder);
+    XlaComputation logical_or = CreateScalarOrComputation(PRED, builder);
     TF_ASSIGN_OR_RETURN(const Shape& predicates_shape,
                         builder->GetShape(predicates));
     std::vector<int64> all_dimensions(ShapeUtil::Rank(predicates_shape));
diff --git a/tensorflow/compiler/xla/client/lib/arithmetic.h b/tensorflow/compiler/xla/client/lib/arithmetic.h
index d0b916e8c8..632e8cc8bc 100644
--- a/tensorflow/compiler/xla/client/lib/arithmetic.h
+++ b/tensorflow/compiler/xla/client/lib/arithmetic.h
@@ -18,8 +18,8 @@ limitations under the License.
 
 #include <memory>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
 namespace xla {
@@ -45,10 +45,12 @@ XlaComputation CreateScalarMinComputation(PrimitiveType type,
                                           XlaBuilder* builder);
 
 // Creates a scalar logical AND computation and returns it.
-XlaComputation CreateScalarAndComputation(XlaBuilder* builder);
+XlaComputation CreateScalarAndComputation(PrimitiveType type,
+                                          XlaBuilder* builder);
 
 // Creates a scalar logical OR computation and returns it.
-XlaComputation CreateScalarOrComputation(XlaBuilder* builder);
+XlaComputation CreateScalarOrComputation(PrimitiveType type,
+                                         XlaBuilder* builder);
 
 // Returns whether any predicate in "predicates" is set.
 //
diff --git a/tensorflow/compiler/xla/client/lib/constants.cc b/tensorflow/compiler/xla/client/lib/constants.cc
index 1686389a23..031d62e4ff 100644
--- a/tensorflow/compiler/xla/client/lib/constants.cc
+++ b/tensorflow/compiler/xla/client/lib/constants.cc
@@ -21,7 +21,7 @@ limitations under the License.
 namespace xla {
 
 XlaOp Zero(XlaBuilder* builder, PrimitiveType type) {
-  return ConstantLiteral(builder, Literal::Zero(type));
+  return ConstantLiteral(builder, LiteralUtil::Zero(type));
 }
 
 XlaOp Zeros(XlaBuilder* builder, const Shape& shape) {
@@ -38,7 +38,7 @@ XlaOp ZerosLike(XlaOp prototype) {
 }
 
 XlaOp One(XlaBuilder* builder, PrimitiveType type) {
-  return ConstantLiteral(builder, Literal::One(type));
+  return ConstantLiteral(builder, LiteralUtil::One(type));
 }
 
 XlaOp Epsilon(XlaBuilder* builder, PrimitiveType type) {
@@ -61,7 +61,7 @@ XlaOp Epsilon(XlaBuilder* builder, PrimitiveType type) {
 }
 
 XlaOp MinValue(XlaBuilder* builder, PrimitiveType type) {
-  return ConstantLiteral(builder, Literal::MinValue(type));
+  return ConstantLiteral(builder, LiteralUtil::MinValue(type));
 }
 
 XlaOp MinFiniteValue(XlaBuilder* builder, PrimitiveType type) {
@@ -81,7 +81,7 @@ XlaOp MinFiniteValue(XlaBuilder* builder, PrimitiveType type) {
 }
 
 XlaOp MaxValue(XlaBuilder* builder, PrimitiveType type) {
-  return ConstantLiteral(builder, Literal::MaxValue(type));
+  return ConstantLiteral(builder, LiteralUtil::MaxValue(type));
 }
 
 XlaOp MaxFiniteValue(XlaBuilder* builder, PrimitiveType type) {
diff --git a/tensorflow/compiler/xla/client/lib/constants.h b/tensorflow/compiler/xla/client/lib/constants.h
index b47f5243f0..0c8a9b8cc0 100644
--- a/tensorflow/compiler/xla/client/lib/constants.h
+++ b/tensorflow/compiler/xla/client/lib/constants.h
@@ -18,7 +18,7 @@ limitations under the License.
 
 #include <type_traits>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
diff --git a/tensorflow/compiler/xla/client/lib/constants_test.cc b/tensorflow/compiler/xla/client/lib/constants_test.cc
index f1e3439862..f4320f65c1 100644
--- a/tensorflow/compiler/xla/client/lib/constants_test.cc
+++ b/tensorflow/compiler/xla/client/lib/constants_test.cc
@@ -14,7 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/client/lib/constants.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
diff --git a/tensorflow/compiler/xla/client/lib/math.cc b/tensorflow/compiler/xla/client/lib/math.cc
index 5587559040..0221de7672 100644
--- a/tensorflow/compiler/xla/client/lib/math.cc
+++ b/tensorflow/compiler/xla/client/lib/math.cc
@@ -25,11 +25,9 @@ XlaOp Sqrt(XlaOp operand) { return Pow(operand, ScalarLike(operand, 0.5)); }
 
 XlaOp Rsqrt(XlaOp operand) { return Pow(operand, ScalarLike(operand, -0.5)); }
 
-XlaOp Square(XlaOp operand) { return Pow(operand, ScalarLike(operand, 2.0)); }
+XlaOp Square(XlaOp operand) { return operand * operand; }
 
-XlaOp Reciprocal(XlaOp operand) {
-  return Pow(operand, ScalarLike(operand, -1.0));
-}
+XlaOp Reciprocal(XlaOp operand) { return ScalarLike(operand, 1.0) / operand; }
 
 namespace {
 
@@ -149,4 +147,158 @@ XlaOp ErfInv(XlaOp x) {
   });
 }
 
+namespace {
+// Coefficients for the Lanczos approximation of the gamma function. The
+// coefficients are uniquely determined by the choice of g and n (kLanczosGamma
+// and kLanczosCoefficients.size() + 1). The coefficients below correspond to
+// [7, 9]. [5, 7], [7, 9], [9, 10], and [607/128.0, 15] were evaluated and [7,
+// 9] seemed to be the least sensitive to the quality of the log function. In
+// particular, [5, 7] is the only choice where -1.5e-5 <= lgamma(2) <= 1.5e-5
+// for a particularly inaccurate log function.
+static constexpr double kLanczosGamma = 7;  // aka g
+static constexpr double kBaseLanczosCoeff = 0.99999999999980993227684700473478;
+static constexpr std::array<double, 8> kLanczosCoefficients = {
+    676.520368121885098567009190444019, -1259.13921672240287047156078755283,
+    771.3234287776530788486528258894,   -176.61502916214059906584551354,
+    12.507343278686904814458936853,     -0.13857109526572011689554707,
+    9.984369578019570859563e-6,         1.50563273514931155834e-7};
+}  // namespace
+
+// Compute the Lgamma function using Lanczos' approximation from "A Precision
+// Approximation of the Gamma Function". SIAM Journal on Numerical Analysis
+// series B. Vol. 1:
+// lgamma(z + 1) = (log(2) + log(pi)) / 2 + (z + 1/2) * log(t(z)) - t(z) + A(z)
+// t(z) = z + kLanczosGamma + 1/2
+// A(z) = kBaseLanczosCoeff + sigma(k = 1, n, kLanczosCoefficients[i] / (z + k))
+XlaOp Lgamma(XlaOp input) {
+  XlaOp one_half = ScalarLike(input, 0.5);
+  XlaOp one = ScalarLike(input, 1);
+
+  XlaOp pi = ScalarLike(input, M_PI);
+  XlaOp log_pi = ScalarLike(input, std::log(M_PI));
+  XlaOp log_sqrt_two_pi = ScalarLike(input, (std::log(2) + std::log(M_PI)) / 2);
+
+  XlaOp lanczos_gamma_plus_one_half = ScalarLike(input, kLanczosGamma + 0.5);
+  XlaOp log_lanczos_gamma_plus_one_half =
+      ScalarLike(input, std::log(kLanczosGamma + 0.5));
+
+  XlaOp base_lanczos_coeff = ScalarLike(input, kBaseLanczosCoeff);
+
+  // If the input is less than 0.5 use Gauss's reflection formula:
+  // gamma(x) = pi / sin(pi * x) * gamma(1 - x)
+  XlaOp need_to_reflect = Lt(Real(input), one_half);
+  XlaOp z = Select(need_to_reflect, -input, input - one);
+
+  XlaOp x = base_lanczos_coeff;
+  for (int i = 0; i < kLanczosCoefficients.size(); ++i) {
+    XlaOp lanczos_coefficient = ScalarLike(input, kLanczosCoefficients[i]);
+    XlaOp index = ScalarLike(input, i);
+    x = x + lanczos_coefficient / (z + index + one);
+  }
+
+  // To improve accuracy on platforms with less-precise log implementations,
+  // compute log(lanczos_gamma_plus_one_half) at compile time and use log1p on
+  // the device.
+  // log(t) = log(kLanczosGamma + 0.5 + z)
+  //        = log(kLanczosGamma + 0.5) + log1p(z / (kLanczosGamma + 0.5))
+  XlaOp t = lanczos_gamma_plus_one_half + z;
+  XlaOp log_t =
+      log_lanczos_gamma_plus_one_half + Log1p(z / lanczos_gamma_plus_one_half);
+
+  XlaOp log_y = log_sqrt_two_pi + (z + one_half) * log_t - t + Log(x);
+
+  XlaOp reflection = log_pi - Log(Sin(pi * input)) - log_y;
+  XlaOp result = Select(need_to_reflect, reflection, log_y);
+  return result;
+}
+
+// Compute the Digamma function using Lanczos' approximation from "A Precision
+// Approximation of the Gamma Function". SIAM Journal on Numerical Analysis
+// series B. Vol. 1:
+// digamma(z + 1) = log(t(z)) + A'(z) / A(z) - kLanczosGamma / t(z)
+// t(z) = z + kLanczosGamma + 1/2
+// A(z) = kBaseLanczosCoeff + sigma(k = 1, n, kLanczosCoefficients[i] / (z + k))
+// A'(z) = sigma(k = 1, n, kLanczosCoefficients[i] / (z + k) / (z + k))
+XlaOp Digamma(XlaOp input) {
+  XlaOp zero = ScalarLike(input, 0);
+  XlaOp one_half = ScalarLike(input, 0.5);
+  XlaOp one = ScalarLike(input, 1);
+
+  XlaOp pi = ScalarLike(input, M_PI);
+
+  XlaOp lanczos_gamma = ScalarLike(input, kLanczosGamma);
+  XlaOp lanczos_gamma_plus_one_half = ScalarLike(input, kLanczosGamma + 0.5);
+  XlaOp log_lanczos_gamma_plus_one_half =
+      ScalarLike(input, std::log(kLanczosGamma + 0.5));
+
+  XlaOp base_lanczos_coeff = ScalarLike(input, kBaseLanczosCoeff);
+
+  // If the input is less than 0.5 use Gauss's reflection formula:
+  // digamma(x) = digamma(1 - x) - pi * cot(pi * x)
+  XlaOp need_to_reflect = Lt(Real(input), one_half);
+  XlaOp z = Select(need_to_reflect, -input, input - one);
+
+  XlaOp num = zero;
+  XlaOp denom = base_lanczos_coeff;
+  for (int i = 0; i < kLanczosCoefficients.size(); ++i) {
+    XlaOp lanczos_coefficient = ScalarLike(input, kLanczosCoefficients[i]);
+    XlaOp index = ScalarLike(input, i);
+    num = num - lanczos_coefficient / ((z + index + one) * (z + index + one));
+    denom = denom + lanczos_coefficient / (z + index + one);
+  }
+
+  // To improve accuracy on platforms with less-precise log implementations,
+  // compute log(lanczos_gamma_plus_one_half) at compile time and use log1p on
+  // the device.
+  // log(t) = log(kLanczosGamma + 0.5 + z)
+  //        = log(kLanczosGamma + 0.5) + log1p(z / (kLanczosGamma + 0.5))
+  XlaOp t = lanczos_gamma_plus_one_half + z;
+  XlaOp log_t =
+      log_lanczos_gamma_plus_one_half + Log1p(z / lanczos_gamma_plus_one_half);
+
+  XlaOp y = log_t + num / denom - lanczos_gamma / t;
+  XlaOp reflection = y - pi * Cos(pi * input) / Sin(pi * input);
+  XlaOp result = Select(need_to_reflect, reflection, y);
+  return result;
+}
+
+// Trigonometric functions.
+
+// acos(x) = 2 * atan(sqrt(1 - x^2) / (1 + x))
+XlaOp Acos(XlaOp x) {
+  return ScalarLike(x, 2.0) *
+         Atan2(Sqrt(ScalarLike(x, 1.0) - x * x), ScalarLike(x, 1.0) + x);
+}
+
+// asin(x) = 2 * atan(x / (1 + sqrt(1 - x^2)))
+XlaOp Asin(XlaOp x) {
+  return ScalarLike(x, 2.0) *
+         Atan2(x, ScalarLike(x, 1.0) + Sqrt(ScalarLike(x, 1.0) - x * x));
+}
+
+XlaOp Atan(XlaOp x) { return Atan2(x, ScalarLike(x, 1.0)); }
+
+XlaOp Tan(XlaOp x) { return Sin(x) / Cos(x); }
+
+// Hyperbolic trigonometric functions.
+
+// acosh(x) = log(x + sqrt(x^2 - 1))
+//          = log(x + sqrt((x+1)*(x-1)))
+XlaOp Acosh(XlaOp x) {
+  return Log(x + Sqrt((x + ScalarLike(x, 1.0)) * (x - ScalarLike(x, 1.0))));
+}
+
+// asinh(x) = log(x + sqrt(x^2 + 1))
+XlaOp Asinh(XlaOp x) { return Log(x + Sqrt(x * x + ScalarLike(x, 1.0))); }
+
+// atanh(x) = 0.5 * log((1 + x) / (1 - x))
+XlaOp Atanh(XlaOp x) {
+  return Log((ScalarLike(x, 1.0) + x) / (ScalarLike(x, 1.0) - x)) *
+         ScalarLike(x, 0.5);
+}
+
+XlaOp Cosh(XlaOp x) { return (Exp(x) + Exp(-x)) * ScalarLike(x, 0.5); }
+
+XlaOp Sinh(XlaOp x) { return (Exp(x) - Exp(-x)) * ScalarLike(x, 0.5); }
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/lib/math.h b/tensorflow/compiler/xla/client/lib/math.h
index e7c8b50273..13db232556 100644
--- a/tensorflow/compiler/xla/client/lib/math.h
+++ b/tensorflow/compiler/xla/client/lib/math.h
@@ -16,7 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_CLIENT_LIB_MATH_H_
 #define TENSORFLOW_COMPILER_XLA_CLIENT_LIB_MATH_H_
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 namespace xla {
 
@@ -46,6 +46,43 @@ XlaOp Erf(XlaOp x);
 // Computes an approximation of the inverse of the error function.
 XlaOp ErfInv(XlaOp x);
 
+// Computes an approximation of the lgamma function.
+XlaOp Lgamma(XlaOp input);
+
+// Computes an approximation of the digamma function.
+XlaOp Digamma(XlaOp input);
+
+// Trigonometric functions
+
+// Computes the arc cosine of 'x'.
+XlaOp Acos(XlaOp x);
+
+// Computes the arc sine of 'x'.
+XlaOp Asin(XlaOp x);
+
+// Computes the arc tangent of 'x'.
+XlaOp Atan(XlaOp x);
+
+// Computes the tangent of 'x'.
+XlaOp Tan(XlaOp x);
+
+// Hyperbolic trigonometric functions
+
+// Computes the inverse hyperbolic cosine of 'x'.
+XlaOp Acosh(XlaOp x);
+
+// Computes the inverse hyperbolic sine of 'x'.
+XlaOp Asinh(XlaOp x);
+
+// Computes the inverse hyperbolic tangent of 'x'.
+XlaOp Atanh(XlaOp x);
+
+// Computes the hyperbolic cosine of 'x'.
+XlaOp Cosh(XlaOp x);
+
+// Computes the hyperbolic sine of 'x'.
+XlaOp Sinh(XlaOp x);
+
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_CLIENT_LIB_MATH_H_
diff --git a/tensorflow/compiler/xla/client/lib/math_test.cc b/tensorflow/compiler/xla/client/lib/math_test.cc
index 1df4e6ea42..14c259a7fa 100644
--- a/tensorflow/compiler/xla/client/lib/math_test.cc
+++ b/tensorflow/compiler/xla/client/lib/math_test.cc
@@ -14,7 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/client/lib/math.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
@@ -31,7 +32,7 @@ class MathTest : public ClientLibraryTestBase {
 
 XLA_TEST_F(MathTest, SqrtF32) {
   XlaBuilder builder(TestName());
-  Literal zero_literal = Literal::Zero(PrimitiveType::F32);
+  Literal zero_literal = LiteralUtil::Zero(PrimitiveType::F32);
 
   std::unique_ptr<GlobalData> zero_data =
       client_->TransferToServer(zero_literal).ConsumeValueOrDie();
@@ -81,5 +82,59 @@ XLA_TEST_F(MathTest, SqrtSixValues) {
   std::vector<float> expected = {4, 1, 32, 0.4, 0.4472, 111.1080};
   ComputeAndCompareR1<float>(&builder, expected, {}, error_spec_);
 }
+
+XLA_TEST_F(MathTest, Lgamma) {
+  XlaBuilder builder(TestName());
+  auto x = ConstantR1<float>(&builder, {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 0.5, 1.5,
+                                        2.5, -1.5, -3.5, -5.5});
+  Lgamma(x);
+
+  std::vector<float> expected = {
+      0,
+      0,
+      static_cast<float>(std::log(2)),
+      static_cast<float>(std::log(6)),
+      static_cast<float>(std::log(24)),
+      static_cast<float>(std::log(120)),
+      static_cast<float>(std::log(M_PI) / 2),
+      static_cast<float>(std::log(M_PI) / 2 - std::log(2)),
+      static_cast<float>(std::log(M_PI) / 2 - std::log(4) + std::log(3)),
+      static_cast<float>(std::log(M_PI) / 2 - std::log(3) + std::log(4)),
+      static_cast<float>(std::log(M_PI) / 2 - std::log(105) + std::log(16)),
+      static_cast<float>(std::log(M_PI) / 2 - std::log(10395) + std::log(64))};
+  error_spec_ = ErrorSpec{0.001};
+  ComputeAndCompareR1<float>(&builder, expected, {}, error_spec_);
+}
+
+XLA_TEST_F(MathTest, Digamma) {
+  XlaBuilder builder(TestName());
+  auto x = ConstantR1<float>(&builder, {1.0, 0.5, 1 / 3.0, 0.25, 1 / 6.0, 0.125,
+                                        2.0, 3.0, 4.0, 6.0, 8.0, 9.0});
+  Digamma(x);
+
+  constexpr double euler_mascheroni =
+      0.57721566490153286060651209008240243104215933593992;
+  std::vector<float> expected = {
+      static_cast<float>(-euler_mascheroni),
+      static_cast<float>(-2 * std::log(2) - euler_mascheroni),
+      static_cast<float>(-M_PI / 2 / std::sqrt(3) - 3 * std::log(3) / 2 -
+                         euler_mascheroni),
+      static_cast<float>(-M_PI / 2 - 3 * std::log(2) - euler_mascheroni),
+      static_cast<float>(-M_PI * std::sqrt(3) / 2 - 2 * std::log(2) -
+                         3 * std::log(3) / 2 - euler_mascheroni),
+      static_cast<float>(
+          -M_PI / 2 - 4 * std::log(2) -
+          (M_PI + std::log(2 + std::sqrt(2)) - std::log(2 - std::sqrt(2))) /
+              std::sqrt(2) -
+          euler_mascheroni),
+      static_cast<float>(1 - euler_mascheroni),
+      static_cast<float>(1.5 - euler_mascheroni),
+      static_cast<float>(11 / 6.0 - euler_mascheroni),
+      static_cast<float>(137 / 60.0 - euler_mascheroni),
+      static_cast<float>(363 / 140.0 - euler_mascheroni),
+      static_cast<float>(761 / 280.0 - euler_mascheroni)};
+  ComputeAndCompareR1<float>(&builder, expected, {}, error_spec_);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/lib/numeric.cc b/tensorflow/compiler/xla/client/lib/numeric.cc
index cbe9e7fdd1..1c91237ae1 100644
--- a/tensorflow/compiler/xla/client/lib/numeric.cc
+++ b/tensorflow/compiler/xla/client/lib/numeric.cc
@@ -13,11 +13,14 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/lib/numeric.h"
-
 #include <numeric>
 #include <vector>
 
+#include "tensorflow/compiler/xla/client/lib/arithmetic.h"
+#include "tensorflow/compiler/xla/client/lib/constants.h"
+#include "tensorflow/compiler/xla/client/lib/numeric.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+
 namespace xla {
 
 namespace {
@@ -28,7 +31,7 @@ XlaOp MakeIota(XlaBuilder* builder, int64 size) {
   for (int64 i = 0; i < size; ++i) {
     values[i] = static_cast<T>(i);
   }
-  return xla::ConstantR1<T>(builder, values);
+  return ConstantR1<T>(builder, values);
 }
 
 }  // namespace
@@ -68,4 +71,67 @@ XlaOp Iota(XlaBuilder* builder, PrimitiveType type, int64 size) {
   }
 }
 
+XlaOp IdentityMatrix(XlaBuilder* builder, PrimitiveType type, int64 m,
+                     int64 n) {
+  auto a = Iota(builder, type, m);
+  auto b = Iota(builder, type, n);
+  auto indicator = Eq(a, Broadcast(b, {m}), /*broadcast_dimensions=*/{0});
+  return ConvertElementType(indicator, type);
+}
+
+XlaOp GetMatrixDiagonal(XlaOp x) {
+  XlaBuilder* builder = x.builder();
+  return builder->ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    TF_ASSIGN_OR_RETURN(Shape shape, builder->GetShape(x));
+    const int64 n_dims = ShapeUtil::Rank(shape);
+    TF_RET_CHECK(n_dims >= 2);
+    const int64 m = shape.dimensions(n_dims - 2);
+    const int64 n = shape.dimensions(n_dims - 1);
+    tensorflow::gtl::ArraySlice<int64> major_dims(
+        AsInt64Slice(shape.dimensions()), /*pos=*/0, /*len=*/n_dims - 2);
+    auto a = Iota(builder, U32, n);
+    auto b = Iota(builder, U32, m);
+    auto indicator = Eq(b, Broadcast(a, {m}), /*broadcast_dimensions=*/{0});
+    auto mask = Broadcast(indicator, major_dims);
+
+    // TPUs don't support S64 add reduction at the moment. But fortunately
+    // OR-reductions work just as well for integers.
+    XlaComputation reducer =
+        primitive_util::IsIntegralType(shape.element_type())
+            ? CreateScalarOrComputation(shape.element_type(), builder)
+            : CreateScalarAddComputation(shape.element_type(), builder);
+
+    return Reduce(Select(mask, x, Zeros(builder, shape)), ScalarLike(x, 0),
+                  reducer, {m >= n ? n_dims - 2 : n_dims - 1});
+  });
+}
+
+XlaOp Triangle(XlaOp x, bool lower) {
+  XlaBuilder* builder = x.builder();
+  return builder->ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    TF_ASSIGN_OR_RETURN(Shape shape, builder->GetShape(x));
+    const int64 n_dims = ShapeUtil::Rank(shape);
+    TF_RET_CHECK(n_dims >= 2);
+    const int64 m = shape.dimensions(n_dims - 2);
+    const int64 n = shape.dimensions(n_dims - 1);
+    tensorflow::gtl::ArraySlice<int64> major_dims(
+        AsInt64Slice(shape.dimensions()), /*pos=*/0, /*len=*/n_dims - 2);
+    auto a = Iota(builder, U32, n);
+    auto b = Iota(builder, U32, m);
+    xla::XlaOp indicator;
+    if (lower) {
+      indicator = Ge(b, Broadcast(a, {m}), /*broadcast_dimensions=*/{0});
+    } else {
+      indicator = Le(b, Broadcast(a, {m}), /*broadcast_dimensions=*/{0});
+    }
+    auto mask = Broadcast(indicator, major_dims);
+
+    return Select(mask, x, Zeros(builder, shape));
+  });
+}
+
+XlaOp UpperTriangle(XlaOp x) { return Triangle(x, false); }
+
+XlaOp LowerTriangle(XlaOp x) { return Triangle(x, true); }
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/lib/numeric.h b/tensorflow/compiler/xla/client/lib/numeric.h
index 2a409ae311..efd8cdc257 100644
--- a/tensorflow/compiler/xla/client/lib/numeric.h
+++ b/tensorflow/compiler/xla/client/lib/numeric.h
@@ -16,7 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_CLIENT_LIB_NUMERIC_H_
 #define TENSORFLOW_COMPILER_XLA_CLIENT_LIB_NUMERIC_H_
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
@@ -25,6 +25,24 @@ namespace xla {
 // Returns a rank 1 tensor of `type` containing values [0, 1, 2, ...].
 XlaOp Iota(XlaBuilder* builder, PrimitiveType type, int64 size);
 
+// Returns an m x n matrix with 1s on the diagonal elements, zeros everywhere
+// else.
+XlaOp IdentityMatrix(XlaBuilder* builder, PrimitiveType type, int64 m, int64 n);
+
+// Get the diagonals of the last two dimensions. If 'x' has shape
+// [..., M, N], then the output has shape [..., min(M, N)], containing the
+// diagonal elements (i.e., with indices [..., i, i]).
+XlaOp GetMatrixDiagonal(XlaOp x);
+
+// Get the upper or lower triangle part of the last two dimensions
+XlaOp Triangle(XlaOp x, bool lower);
+
+// Get the upper triangle part of the last two dimensions
+XlaOp UpperTriangle(XlaOp x);
+
+// Get the lower triangle part of the last two dimensions
+XlaOp LowerTriangle(XlaOp x);
+
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_CLIENT_LIB_NUMERIC_H_
diff --git a/tensorflow/compiler/xla/client/lib/numeric_test.cc b/tensorflow/compiler/xla/client/lib/numeric_test.cc
index bc8a73e9d7..8a96ec68d2 100644
--- a/tensorflow/compiler/xla/client/lib/numeric_test.cc
+++ b/tensorflow/compiler/xla/client/lib/numeric_test.cc
@@ -14,7 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/client/lib/numeric.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
@@ -24,8 +24,15 @@ limitations under the License.
 namespace xla {
 namespace {
 
-using NumericTest = ClientLibraryTestBase;
+class NumericTest : public ClientLibraryTestBase {
+ protected:
+  template <typename T>
+  void TestMatrixDiagonal();
+};
 
+// TODO(b/64798317): Delete this test case once xla::IotaGen is converted to
+// xla::Iota. This test is already implemented for xla::IotaGen in
+// xla/tests/iota_test.cc.
 XLA_TEST_F(NumericTest, Iota) {
   XlaBuilder builder(TestName());
   Iota(&builder, S32, 10);
@@ -33,5 +40,39 @@ XLA_TEST_F(NumericTest, Iota) {
   ComputeAndCompareR1<int32>(&builder, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {});
 }
 
+XLA_TEST_F(NumericTest, Triangle) {
+  XlaBuilder builder(TestName());
+  Array3D<int32> input(2, 3, 4);
+  input.FillIota(0);
+
+  XlaOp a;
+  auto a_data = CreateR3Parameter<int32>(input, 0, "a", &builder, &a);
+  LowerTriangle(a);
+  Array3D<int32> expected({{{0, 0, 0, 0}, {4, 5, 0, 0}, {8, 9, 10, 0}},
+                           {{12, 0, 0, 0}, {16, 17, 0, 0}, {20, 21, 22, 0}}});
+
+  ComputeAndCompareR3<int32>(&builder, expected, {a_data.get()});
+}
+
+template <typename T>
+void NumericTest::TestMatrixDiagonal() {
+  XlaBuilder builder("GetMatrixDiagonal");
+  Array3D<T> input(2, 3, 4);
+  input.FillIota(0);
+
+  XlaOp a;
+  auto a_data = CreateR3Parameter<T>(input, 0, "a", &builder, &a);
+  GetMatrixDiagonal(a);
+  Array2D<T> expected({{0, 5, 10}, {12, 17, 22}});
+
+  ComputeAndCompareR2<T>(&builder, expected, {a_data.get()});
+}
+
+XLA_TEST_F(NumericTest, GetMatrixDiagonal_S32) { TestMatrixDiagonal<int32>(); }
+
+XLA_TEST_F(NumericTest, GetMatrixDiagonal_S64) { TestMatrixDiagonal<int64>(); }
+
+XLA_TEST_F(NumericTest, GetMatrixDiagonal_F32) { TestMatrixDiagonal<float>(); }
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/lib/prng.cc b/tensorflow/compiler/xla/client/lib/prng.cc
new file mode 100644
index 0000000000..3a744148fb
--- /dev/null
+++ b/tensorflow/compiler/xla/client/lib/prng.cc
@@ -0,0 +1,150 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <cmath>
+
+#include "tensorflow/compiler/xla/client/lib/constants.h"
+#include "tensorflow/compiler/xla/client/lib/math.h"
+#include "tensorflow/compiler/xla/client/lib/numeric.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/core/casts.h"
+
+namespace xla {
+namespace {
+
+// Rotates a 32-bit integer 'v' left by 'distance' bits.
+XlaOp RotateLeftS32(XlaOp v, int distance) {
+  return (v << ConstantR0<int32>(v.builder(), distance)) |
+         ShiftRightLogical(v, ConstantR0<int32>(v.builder(), 32 - distance));
+}
+
+using ThreeFry2x32State = std::array<XlaOp, 2>;
+
+// Implements the ThreeFry counter-based PRNG algorithm.
+// Salmon et al. SC 2011. Parallel random numbers: as easy as 1, 2, 3.
+// http://www.thesalmons.org/john/random123/papers/random123sc11.pdf
+ThreeFry2x32State ThreeFry2x32(ThreeFry2x32State input, ThreeFry2x32State key) {
+  XlaBuilder* builder = input[0].builder();
+  // Rotation distances specified by the Threefry2x32 algorithm.
+  constexpr std::array<int, 8> rotations = {13, 15, 26, 6, 17, 29, 16, 24};
+  ThreeFry2x32State x;
+
+  std::array<XlaOp, 3> ks;
+  // 0x1BD11BDA is a parity constant specified by the ThreeFry2x32 algorithm.
+  ks[2] = ConstantR0<int32>(builder, 0x1BD11BDA);
+  for (int i = 0; i < 2; ++i) {
+    ks[i] = key[i];
+    x[i] = input[i];
+    ks[2] = ks[2] ^ key[i];
+  }
+
+  x[0] = x[0] + ks[0];
+  x[1] = x[1] + ks[1];
+
+  // Performs a single round of the Threefry2x32 algorithm, with a rotation
+  // amount 'rotation'.
+  auto round = [builder](ThreeFry2x32State v, int rotation) {
+    v[0] = v[0] + v[1];
+    v[1] = RotateLeftS32(v[1], rotation);
+    v[1] = v[0] ^ v[1];
+    return v;
+  };
+
+  // There are no known statistical flaws with 13 rounds of Threefry2x32.
+  // We are conservative and use 20 rounds.
+  x = round(x, rotations[0]);
+  x = round(x, rotations[1]);
+  x = round(x, rotations[2]);
+  x = round(x, rotations[3]);
+  x[0] = x[0] + ks[1];
+  x[1] = x[1] + ks[2] + ConstantR0<int32>(builder, 1);
+
+  x = round(x, rotations[4]);
+  x = round(x, rotations[5]);
+  x = round(x, rotations[6]);
+  x = round(x, rotations[7]);
+  x[0] = x[0] + ks[2];
+  x[1] = x[1] + ks[0] + ConstantR0<int32>(builder, 2);
+
+  x = round(x, rotations[0]);
+  x = round(x, rotations[1]);
+  x = round(x, rotations[2]);
+  x = round(x, rotations[3]);
+  x[0] = x[0] + ks[0];
+  x[1] = x[1] + ks[1] + ConstantR0<int32>(builder, 3);
+
+  x = round(x, rotations[4]);
+  x = round(x, rotations[5]);
+  x = round(x, rotations[6]);
+  x = round(x, rotations[7]);
+  x[0] = x[0] + ks[1];
+  x[1] = x[1] + ks[2] + ConstantR0<int32>(builder, 4);
+
+  x = round(x, rotations[0]);
+  x = round(x, rotations[1]);
+  x = round(x, rotations[2]);
+  x = round(x, rotations[3]);
+  x[0] = x[0] + ks[2];
+  x[1] = x[1] + ks[0] + ConstantR0<int32>(builder, 5);
+
+  return x;
+}
+
+}  // namespace
+
+XlaOp StatelessRngUniform(std::array<XlaOp, 2> seeds, const Shape& shape,
+                          XlaOp minval, XlaOp maxval) {
+  XlaBuilder* builder = seeds[0].builder();
+  if (shape.element_type() != F32) {
+    return builder->ReportError(Unimplemented(
+        "Types other than F32 are not implemented by StatelessRngUniform."));
+  }
+  ThreeFry2x32State key = seeds;
+  const int64 size = ShapeUtil::ElementsIn(shape);
+
+  const int64 half_size = CeilOfRatio<int64>(size, 2);
+  const bool size_is_odd = (half_size * 2 != size);
+
+  // Fill the generator inputs with unique counter values.
+  ThreeFry2x32State inputs;
+  inputs[0] = Iota(builder, S32, half_size);
+  inputs[1] = inputs[0] + ConstantR0<int32>(builder, half_size);
+  ThreeFry2x32State outputs = ThreeFry2x32(inputs, key);
+
+  if (size_is_odd) {
+    outputs[1] = Slice(outputs[1], {0}, {half_size - 1}, {1});
+  }
+
+  auto bits = Reshape(ConcatInDim(builder, outputs, 0),
+                      AsInt64Slice(shape.dimensions()));
+
+  // Form 23 random mantissa bits, with a leading 1 bit. The leading 1 bit
+  // forces the random bits into the mantissa.
+  constexpr int kFloatBits = 32;
+  constexpr int kMantissaBits = 23;
+  bits = ShiftRightLogical(
+             bits, ConstantR0<int32>(builder, kFloatBits - kMantissaBits)) |
+         ConstantR0<int32>(builder, tensorflow::bit_cast<int32>(1.0f));
+  auto floats = BitcastConvertType(bits, F32);
+
+  // We have a floating point number in the range [1.0, 2.0).
+  // Subtract 1.0f to shift to the range [0.0, 1.0)
+  floats = floats - ConstantR0<float>(builder, 1.0f);
+  // Multiply and add to shift to the range [minval, maxval).
+  return floats * (maxval - minval) + minval;
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/client/lib/prng.h b/tensorflow/compiler/xla/client/lib/prng.h
new file mode 100644
index 0000000000..ad000b1fa1
--- /dev/null
+++ b/tensorflow/compiler/xla/client/lib/prng.h
@@ -0,0 +1,34 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_CLIENT_LIB_PRNG_H_
+#define TENSORFLOW_COMPILER_XLA_CLIENT_LIB_PRNG_H_
+
+#include <array>
+
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+
+namespace xla {
+
+// Returns a tensor containing 'shape' random values uniformly distributed in
+// the range [minval, maxval). Requires 2 32-bit integer seeds.
+// Currently only 'shape's of type F32 are implemented.
+XlaOp StatelessRngUniform(std::array<XlaOp, 2> seeds, const Shape& shape,
+                          XlaOp minval, XlaOp maxval);
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_CLIENT_LIB_PRNG_H_
diff --git a/tensorflow/compiler/xla/client/lib/sorting.cc b/tensorflow/compiler/xla/client/lib/sorting.cc
new file mode 100644
index 0000000000..a904be259a
--- /dev/null
+++ b/tensorflow/compiler/xla/client/lib/sorting.cc
@@ -0,0 +1,46 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/client/lib/sorting.h"
+#include "tensorflow/compiler/xla/client/lib/numeric.h"
+
+namespace xla {
+
+XlaOp TopK(XlaOp input, int64 k) {
+  XlaBuilder* const builder = input.builder();
+  return builder->ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    TF_ASSIGN_OR_RETURN(Shape input_shape, builder->GetShape(input));
+    int last_dim = input_shape.dimensions_size() - 1;
+    int last_dim_size = input_shape.dimensions(last_dim);
+
+    XlaOp iota_s32 = Iota(builder, S32, last_dim_size);
+    auto input_dims = input_shape.dimensions();
+    std::vector<int64> broadcast_dims(input_dims.begin(), input_dims.end() - 1);
+    XlaOp broadcast_s32 = Broadcast(iota_s32, broadcast_dims);
+    XlaOp sort_result = Sort(Neg(input), broadcast_s32);
+    std::vector<int64> start_indices(input_shape.dimensions_size(), 0);
+    std::vector<int64> limit_indices(input_dims.begin(), input_dims.end());
+    limit_indices[last_dim] = k;
+    std::vector<int64> strides(input_shape.dimensions_size(), 1);
+
+    XlaOp values = Neg(Slice(GetTupleElement(sort_result, 0), start_indices,
+                             limit_indices, strides));
+    XlaOp indices = Slice(GetTupleElement(sort_result, 1), start_indices,
+                          limit_indices, strides);
+    return Tuple(builder, {values, indices});
+  });
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/client/lib/sorting.h b/tensorflow/compiler/xla/client/lib/sorting.h
new file mode 100644
index 0000000000..404b4783c3
--- /dev/null
+++ b/tensorflow/compiler/xla/client/lib/sorting.h
@@ -0,0 +1,31 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_CLIENT_LIB_SORTING_H_
+#define TENSORFLOW_COMPILER_XLA_CLIENT_LIB_SORTING_H_
+
+#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+
+namespace xla {
+
+// Returns a tuple composed of the top `k` values and corresponding indices in
+// `input`.  Output values are in descending order, from largest to smallest.
+XlaOp TopK(XlaOp input, int64 k);
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_CLIENT_LIB_SORTING_H_
diff --git a/tensorflow/compiler/xla/client/lib/sorting_test.cc b/tensorflow/compiler/xla/client/lib/sorting_test.cc
new file mode 100644
index 0000000000..b6eee762a5
--- /dev/null
+++ b/tensorflow/compiler/xla/client/lib/sorting_test.cc
@@ -0,0 +1,60 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/client/lib/sorting.h"
+#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/client_library_test_base.h"
+#include "tensorflow/compiler/xla/tests/test_macros.h"
+#include "tensorflow/compiler/xla/types.h"
+
+namespace xla {
+namespace {
+
+using SortingTest = ClientLibraryTestBase;
+
+XLA_TEST_F(SortingTest, TopK3From8Values) {
+  XlaBuilder builder(TestName());
+  auto x =
+      ConstantR1<float>(&builder, {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0});
+  xla::GetTupleElement(xla::TopK(x, 3), 0);
+  ComputeAndCompareR1<float>(&builder, {7.0, 6.0, 5.0}, {});
+}
+
+XLA_TEST_F(SortingTest, TopK3From8Indices) {
+  XlaBuilder builder(TestName());
+  auto x_rev =
+      ConstantR1<float>(&builder, {7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0});
+  xla::GetTupleElement(xla::TopK(x_rev, 3), 1);
+  ComputeAndCompareR1<int>(&builder, {0, 1, 2}, {});
+}
+
+XLA_TEST_F(SortingTest, TopKFullSort) {
+  XlaBuilder builder(TestName());
+  const int kSize = 16;
+  std::mt19937 eng;
+  std::uniform_real_distribution<float> u_dist(0.0, 100.0);
+  auto gen = std::bind(u_dist, eng);
+  std::vector<float> inputs(kSize);
+  std::generate(inputs.begin(), inputs.end(), gen);
+  auto x = ConstantR1<float>(&builder, inputs);
+  xla::GetTupleElement(xla::TopK(x, kSize), 0);
+
+  std::sort(inputs.begin(), inputs.end(), std::greater<float>());
+  ComputeAndCompareR1<float>(&builder, inputs, {});
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/client/lib/testing.cc b/tensorflow/compiler/xla/client/lib/testing.cc
index 731ad13b8d..b1a776b8b8 100644
--- a/tensorflow/compiler/xla/client/lib/testing.cc
+++ b/tensorflow/compiler/xla/client/lib/testing.cc
@@ -15,9 +15,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/lib/testing.h"
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/tests/test_utils.h"
@@ -49,7 +49,7 @@ int64 DataSizeOfShape(const Shape& shape) {
 XlaOp BuildFakeDataOpOnDevice(const Shape& shape, XlaBuilder* builder) {
   if (ShapeUtil::IsArray(shape)) {
     return Broadcast(
-        ConstantLiteral(builder, Literal::One(shape.element_type())),
+        ConstantLiteral(builder, LiteralUtil::One(shape.element_type())),
         AsInt64Slice(shape.dimensions()));
   }
   std::vector<XlaOp> parts;
diff --git a/tensorflow/compiler/xla/client/lib/testing.h b/tensorflow/compiler/xla/client/lib/testing.h
index dc613099e2..03695ce2a3 100644
--- a/tensorflow/compiler/xla/client/lib/testing.h
+++ b/tensorflow/compiler/xla/client/lib/testing.h
@@ -21,7 +21,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 
 namespace xla {
diff --git a/tensorflow/compiler/xla/client/local_client.cc b/tensorflow/compiler/xla/client/local_client.cc
index 5f9710914b..8a6c5fb9a7 100644
--- a/tensorflow/compiler/xla/client/local_client.cc
+++ b/tensorflow/compiler/xla/client/local_client.cc
@@ -18,10 +18,12 @@ limitations under the License.
 #include <utility>
 
 #include "llvm/ADT/Triple.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/backend.h"
 #include "tensorflow/compiler/xla/service/service_executable_run_options.h"
 #include "tensorflow/compiler/xla/service/source_map_util.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 
 using xla::source_map_util::InvalidParameterArgument;
@@ -29,8 +31,8 @@ using xla::source_map_util::InvalidParameterArgument;
 namespace xla {
 
 namespace {
-StatusOr<Backend::StreamPtr> BorrowStreamForDevice(int device_ordinal,
-                                                   Backend* backend) {
+StatusOr<StreamPool::Ptr> BorrowStreamForDevice(int device_ordinal,
+                                                Backend* backend) {
   if (device_ordinal < 0) {
     device_ordinal = backend->default_device_ordinal();
   }
@@ -99,11 +101,14 @@ Status LocalExecutable::ValidateExecutionOptions(
     }
   }
 
-  // Verify that the device the executable was built for is equivalent to the
-  // device it will run on.
-  int run_device_ordinal = run_options.device_ordinal() == -1
-                               ? backend_->default_device_ordinal()
-                               : run_options.device_ordinal();
+  // Verify that the device the executable was built for is equivalent
+  // to the device it will run on.
+  int run_device_ordinal = run_options.device_ordinal();
+  if (run_device_ordinal == -1) {
+    run_device_ordinal = run_options.stream() != nullptr
+                             ? run_options.stream()->parent()->device_ordinal()
+                             : backend_->default_device_ordinal();
+  }
   TF_ASSIGN_OR_RETURN(bool devices_equivalent,
                       backend_->devices_equivalent(
                           run_device_ordinal, build_options_.device_ordinal()));
@@ -141,7 +146,7 @@ StatusOr<ScopedShapedBuffer> LocalExecutable::Run(
   TF_RETURN_IF_ERROR(
       ValidateExecutionOptions(arguments, run_options, *backend_));
 
-  Backend::StreamPtr stream;
+  StreamPool::Ptr stream;
   if (run_options.stream() == nullptr) {
     // NB!  The lifetime of `stream` needs to match the lifetime of
     // `actual_options` (otherwise we will end up using a returned stream in
diff --git a/tensorflow/compiler/xla/client/local_client.h b/tensorflow/compiler/xla/client/local_client.h
index 4d9e0d7cd9..ae23809261 100644
--- a/tensorflow/compiler/xla/client/local_client.h
+++ b/tensorflow/compiler/xla/client/local_client.h
@@ -20,7 +20,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/executable_build_options.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/executable_run_options.h"
 #include "tensorflow/compiler/xla/service/compiler.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder.cc b/tensorflow/compiler/xla/client/xla_builder.cc
index 09e7e87918..1cb61f77fb 100644
--- a/tensorflow/compiler/xla/client/xla_client/xla_builder.cc
+++ b/tensorflow/compiler/xla/client/xla_builder.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 #include <functional>
 #include <numeric>
@@ -22,6 +22,7 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/client/sharding_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/shape_inference.h"
@@ -736,14 +737,22 @@ void XlaBuilder::Trace(const string& tag, const XlaOp& operand) {
   ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
     HloInstructionProto instr;
     *instr.mutable_shape() = ShapeUtil::MakeNil();
-    *instr.mutable_literal() = Literal::CreateR1U8(tag)->ToProto();
+    *instr.mutable_literal() = LiteralUtil::CreateR1U8(tag)->ToProto();
     return AddInstruction(std::move(instr), HloOpcode::kTrace, {operand});
   });
 }
 
 XlaOp XlaBuilder::Select(const XlaOp& pred, const XlaOp& on_true,
                          const XlaOp& on_false) {
-  return TernaryOp(HloOpcode::kSelect, pred, on_true, on_false);
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    TF_ASSIGN_OR_RETURN(const Shape& true_shape, GetShape(on_true));
+    TF_ASSIGN_OR_RETURN(const Shape& false_shape, GetShape(on_false));
+    TF_RET_CHECK(ShapeUtil::IsTuple(true_shape) ==
+                 ShapeUtil::IsTuple(false_shape));
+    HloOpcode opcode = ShapeUtil::IsTuple(true_shape) ? HloOpcode::kTupleSelect
+                                                      : HloOpcode::kSelect;
+    return TernaryOp(opcode, pred, on_true, on_false);
+  });
 }
 
 XlaOp XlaBuilder::Tuple(tensorflow::gtl::ArraySlice<XlaOp> elements) {
@@ -1109,6 +1118,35 @@ XlaOp XlaBuilder::Infeed(const Shape& shape, const string& config) {
   });
 }
 
+XlaOp XlaBuilder::InfeedWithToken(const XlaOp& token, const Shape& shape,
+                                  const string& config) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    HloInstructionProto instr;
+    if (!LayoutUtil::HasLayout(shape)) {
+      return InvalidArgument("Given shape to Infeed must have a layout");
+    }
+    const Shape infeed_instruction_shape =
+        ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeTokenShape()});
+    *instr.mutable_shape() = infeed_instruction_shape;
+    instr.set_infeed_config(config);
+
+    if (ShapeUtil::IsArray(shape) && sharding() &&
+        sharding()->type() == OpSharding::Type::OpSharding_Type_OTHER) {
+      // TODO(b/110793772): Support tiled array-shaped infeeds.
+      return InvalidArgument(
+          "Tiled sharding is not yet supported for array-shaped infeeds");
+    }
+
+    if (sharding() &&
+        sharding()->type() == OpSharding::Type::OpSharding_Type_REPLICATED) {
+      return InvalidArgument(
+          "Replicated sharding is not yet supported for infeeds");
+    }
+
+    return AddInstruction(std::move(instr), HloOpcode::kInfeed, {token});
+  });
+}
+
 void XlaBuilder::Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
                          const string& outfeed_config) {
   ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
@@ -1154,6 +1192,53 @@ void XlaBuilder::Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
   });
 }
 
+XlaOp XlaBuilder::OutfeedWithToken(const XlaOp& operand, const XlaOp& token,
+                                   const Shape& shape_with_layout,
+                                   const string& outfeed_config) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    HloInstructionProto instr;
+
+    *instr.mutable_shape() = ShapeUtil::MakeTokenShape();
+
+    // Check and set outfeed shape.
+    if (!LayoutUtil::HasLayout(shape_with_layout)) {
+      return InvalidArgument("Given shape to Outfeed must have a layout");
+    }
+    TF_ASSIGN_OR_RETURN(const Shape& operand_shape, GetShape(operand));
+    if (!ShapeUtil::Compatible(operand_shape, shape_with_layout)) {
+      return InvalidArgument(
+          "Outfeed shape %s must be compatible with operand shape %s",
+          ShapeUtil::HumanStringWithLayout(shape_with_layout).c_str(),
+          ShapeUtil::HumanStringWithLayout(operand_shape).c_str());
+    }
+    *instr.mutable_outfeed_shape() = shape_with_layout;
+
+    instr.set_outfeed_config(outfeed_config);
+
+    return AddInstruction(std::move(instr), HloOpcode::kOutfeed,
+                          {operand, token});
+  });
+}
+
+XlaOp XlaBuilder::CreateToken() {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    HloInstructionProto instr;
+    *instr.mutable_shape() = ShapeUtil::MakeTokenShape();
+    return AddInstruction(std::move(instr), HloOpcode::kAfterAll);
+  });
+}
+
+XlaOp XlaBuilder::AfterAll(tensorflow::gtl::ArraySlice<XlaOp> tokens) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    if (tokens.empty()) {
+      return InvalidArgument("AfterAll requires at least one operand");
+    }
+    HloInstructionProto instr;
+    *instr.mutable_shape() = ShapeUtil::MakeTokenShape();
+    return AddInstruction(std::move(instr), HloOpcode::kAfterAll, tokens);
+  });
+}
+
 XlaOp XlaBuilder::CustomCall(const string& call_target_name,
                              tensorflow::gtl::ArraySlice<XlaOp> operands,
                              const Shape& shape) {
@@ -1357,7 +1442,8 @@ XlaOp XlaBuilder::Rev(const XlaOp& operand,
   });
 }
 
-XlaOp XlaBuilder::Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values) {
+XlaOp XlaBuilder::Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values,
+                       int64 dimension) {
   return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
     HloInstructionProto instr;
     std::vector<const Shape*> operand_shape_ptrs;
@@ -1371,6 +1457,11 @@ XlaOp XlaBuilder::Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values) {
     TF_ASSIGN_OR_RETURN(*instr.mutable_shape(),
                         ShapeInference::InferVariadicOpShape(
                             HloOpcode::kSort, operand_shape_ptrs));
+    if (dimension == -1) {
+      TF_ASSIGN_OR_RETURN(const Shape& keys_shape, GetShape(keys));
+      dimension = ShapeUtil::Rank(keys_shape) - 1;
+    }
+    instr.add_dimensions(dimension);
     return values.has_value()
                ? AddInstruction(std::move(instr), HloOpcode::kSort,
                                 {keys, *values})
@@ -1544,6 +1635,32 @@ XlaOp XlaBuilder::Gather(const XlaOp& input, const XlaOp& gather_indices,
   });
 }
 
+XlaOp XlaBuilder::Scatter(const XlaOp& input, const XlaOp& scatter_indices,
+                          const XlaOp& updates,
+                          const XlaComputation& update_computation,
+                          const ScatterDimensionNumbers& dimension_numbers) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    HloInstructionProto instr;
+
+    TF_ASSIGN_OR_RETURN(const Shape& input_shape, GetShape(input));
+    TF_ASSIGN_OR_RETURN(const Shape& scatter_indices_shape,
+                        GetShape(scatter_indices));
+    TF_ASSIGN_OR_RETURN(const Shape& updates_shape, GetShape(updates));
+    TF_ASSIGN_OR_RETURN(const ProgramShape& to_apply_shape,
+                        update_computation.GetProgramShape());
+    TF_ASSIGN_OR_RETURN(*instr.mutable_shape(),
+                        ShapeInference::InferScatterShape(
+                            input_shape, scatter_indices_shape, updates_shape,
+                            to_apply_shape, dimension_numbers));
+
+    *instr.mutable_scatter_dimension_numbers() = dimension_numbers;
+
+    AddCalledComputation(update_computation, &instr);
+    return AddInstruction(std::move(instr), HloOpcode::kScatter,
+                          {input, scatter_indices, updates});
+  });
+}
+
 XlaOp XlaBuilder::Conditional(const XlaOp& predicate, const XlaOp& true_operand,
                               const XlaComputation& true_computation,
                               const XlaOp& false_operand,
@@ -1590,7 +1707,7 @@ XlaOp XlaBuilder::Reduce(
 
     TF_ASSIGN_OR_RETURN(*instr.mutable_shape(),
                         ShapeInference::InferReduceShape(
-                            operand_shape, init_shape, dimensions_to_reduce,
+                            {&operand_shape, &init_shape}, dimensions_to_reduce,
                             called_program_shape));
 
     for (int64 dim : dimensions_to_reduce) {
@@ -1755,10 +1872,6 @@ XlaOp XlaBuilder::CrossReplicaSum(
     tensorflow::gtl::ArraySlice<int64> replica_group_ids,
     const tensorflow::gtl::optional<ChannelHandle>& channel_id) {
   return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
-    if (channel_id.has_value()) {
-      return Unimplemented("channel_id is not supported in AllReduce");
-    }
-
     HloInstructionProto instr;
     TF_ASSIGN_OR_RETURN(const Shape& operand_shape, GetShape(operand));
     TF_ASSIGN_OR_RETURN(
@@ -1768,6 +1881,10 @@ XlaOp XlaBuilder::CrossReplicaSum(
       instr.add_replica_group_ids(replica_group_id);
     }
 
+    if (channel_id.has_value()) {
+      instr.set_all_reduce_id(channel_id->handle());
+    }
+
     AddCalledComputation(computation, &instr);
 
     return AddInstruction(std::move(instr), HloOpcode::kCrossReplicaSum,
@@ -1850,18 +1967,30 @@ void XlaBuilder::Send(const XlaOp& operand, const ChannelHandle& handle) {
     TF_ASSIGN_OR_RETURN(XlaOp token, AddInstruction(std::move(token_instr),
                                                     HloOpcode::kAfterAll, {}));
 
-    // Send instruction produces a tuple of {aliased operand, U32 context}.
+    return SendWithToken(operand, token, handle);
+  });
+}
+
+XlaOp XlaBuilder::SendWithToken(const XlaOp& operand, const XlaOp& token,
+                                const ChannelHandle& handle) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    if (handle.type() != ChannelHandle::DEVICE_TO_DEVICE) {
+      return InvalidArgument("Send must use a device-to-device channel");
+    }
+
+    // Send instruction produces a tuple of {aliased operand, U32 context,
+    // token}.
     HloInstructionProto send_instr;
     TF_ASSIGN_OR_RETURN(const Shape& shape, GetShape(operand));
-    *send_instr.mutable_shape() =
-        ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeShape(U32, {})});
+    *send_instr.mutable_shape() = ShapeUtil::MakeTupleShape(
+        {shape, ShapeUtil::MakeShape(U32, {}), ShapeUtil::MakeTokenShape()});
     send_instr.set_channel_id(handle.handle());
     TF_ASSIGN_OR_RETURN(XlaOp send,
                         AddInstruction(std::move(send_instr), HloOpcode::kSend,
                                        {operand, token}));
 
     HloInstructionProto send_done_instr;
-    *send_done_instr.mutable_shape() = ShapeUtil::MakeNil();
+    *send_done_instr.mutable_shape() = ShapeUtil::MakeTokenShape();
     send_done_instr.set_channel_id(handle.handle());
     return AddInstruction(std::move(send_done_instr), HloOpcode::kSendDone,
                           {send});
@@ -1879,22 +2008,128 @@ XlaOp XlaBuilder::Recv(const Shape& shape, const ChannelHandle& handle) {
     TF_ASSIGN_OR_RETURN(XlaOp token, AddInstruction(std::move(token_instr),
                                                     HloOpcode::kAfterAll, {}));
 
-    // Recv instruction produces a tuple of {receive buffer, U32 context}.
+    XlaOp recv = RecvWithToken(token, shape, handle);
+
+    // The RecvDone instruction produces a tuple of the data and a token
+    // type. Return XLA op containing the data.
+    // TODO(b/80000000): Remove this when clients have been updated to handle
+    // tokens.
+    HloInstructionProto recv_data;
+    *recv_data.mutable_shape() = shape;
+    recv_data.set_tuple_index(0);
+    return AddInstruction(std::move(recv_data), HloOpcode::kGetTupleElement,
+                          {recv});
+  });
+}
+
+XlaOp XlaBuilder::RecvWithToken(const XlaOp& token, const Shape& shape,
+                                const ChannelHandle& handle) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    if (handle.type() != ChannelHandle::DEVICE_TO_DEVICE) {
+      return InvalidArgument("Recv must use a device-to-device channel");
+    }
+
+    // Recv instruction produces a tuple of {receive buffer, U32 context,
+    // token}.
     HloInstructionProto recv_instr;
-    *recv_instr.mutable_shape() =
-        ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeShape(U32, {})});
+    *recv_instr.mutable_shape() = ShapeUtil::MakeTupleShape(
+        {shape, ShapeUtil::MakeShape(U32, {}), ShapeUtil::MakeTokenShape()});
     recv_instr.set_channel_id(handle.handle());
     TF_ASSIGN_OR_RETURN(XlaOp recv, AddInstruction(std::move(recv_instr),
                                                    HloOpcode::kRecv, {token}));
 
     HloInstructionProto recv_done_instr;
-    *recv_done_instr.mutable_shape() = shape;
+    *recv_done_instr.mutable_shape() =
+        ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeTokenShape()});
     recv_done_instr.set_channel_id(handle.handle());
     return AddInstruction(std::move(recv_done_instr), HloOpcode::kRecvDone,
                           {recv});
   });
 }
 
+XlaOp XlaBuilder::SendToHost(const XlaOp& operand, const XlaOp& token,
+                             const Shape& shape_with_layout,
+                             const ChannelHandle& handle) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    if (!LayoutUtil::HasLayout(shape_with_layout)) {
+      return InvalidArgument("Shape passed to SendToHost must have a layout");
+    }
+    TF_ASSIGN_OR_RETURN(const Shape& operand_shape, GetShape(operand));
+    if (!ShapeUtil::Compatible(operand_shape, shape_with_layout)) {
+      return InvalidArgument(
+          "SendToHost shape %s must be compatible with operand shape %s",
+          ShapeUtil::HumanStringWithLayout(shape_with_layout).c_str(),
+          ShapeUtil::HumanStringWithLayout(operand_shape).c_str());
+    }
+    // TODO(b/111544877): Support tuple shapes.
+    if (!ShapeUtil::IsArray(operand_shape)) {
+      return InvalidArgument("SendToHost only supports array shapes, shape: %s",
+                             ShapeUtil::HumanString(operand_shape).c_str());
+    }
+
+    if (handle.type() != ChannelHandle::DEVICE_TO_HOST) {
+      return InvalidArgument("SendToHost must use a device-to-host channel");
+    }
+
+    // Send instruction produces a tuple of {aliased operand, U32 context,
+    // token}.
+    HloInstructionProto send_instr;
+    *send_instr.mutable_shape() = ShapeUtil::MakeTupleShape(
+        {shape_with_layout, ShapeUtil::MakeShape(U32, {}),
+         ShapeUtil::MakeTokenShape()});
+    send_instr.set_channel_id(handle.handle());
+    send_instr.set_is_host_transfer(true);
+    TF_ASSIGN_OR_RETURN(XlaOp send,
+                        AddInstruction(std::move(send_instr), HloOpcode::kSend,
+                                       {operand, token}));
+
+    HloInstructionProto send_done_instr;
+    *send_done_instr.mutable_shape() = ShapeUtil::MakeTokenShape();
+    send_done_instr.set_channel_id(handle.handle());
+    send_done_instr.set_is_host_transfer(true);
+    return AddInstruction(std::move(send_done_instr), HloOpcode::kSendDone,
+                          {send});
+  });
+}
+
+XlaOp XlaBuilder::RecvFromHost(const XlaOp& token, const Shape& shape,
+                               const ChannelHandle& handle) {
+  return ReportErrorOrReturn([&]() -> StatusOr<XlaOp> {
+    if (!LayoutUtil::HasLayout(shape)) {
+      return InvalidArgument("Shape passed to RecvFromHost must have a layout");
+    }
+
+    // TODO(b/111544877): Support tuple shapes.
+    if (!ShapeUtil::IsArray(shape)) {
+      return InvalidArgument(
+          "RecvFromHost only supports array shapes, shape: %s",
+          ShapeUtil::HumanString(shape).c_str());
+    }
+
+    if (handle.type() != ChannelHandle::HOST_TO_DEVICE) {
+      return InvalidArgument("RecvFromHost must use a host-to-device channel");
+    }
+
+    // Recv instruction produces a tuple of {receive buffer, U32 context,
+    // token}.
+    HloInstructionProto recv_instr;
+    *recv_instr.mutable_shape() = ShapeUtil::MakeTupleShape(
+        {shape, ShapeUtil::MakeShape(U32, {}), ShapeUtil::MakeTokenShape()});
+    recv_instr.set_channel_id(handle.handle());
+    recv_instr.set_is_host_transfer(true);
+    TF_ASSIGN_OR_RETURN(XlaOp recv, AddInstruction(std::move(recv_instr),
+                                                   HloOpcode::kRecv, {token}));
+
+    HloInstructionProto recv_done_instr;
+    *recv_done_instr.mutable_shape() =
+        ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeTokenShape()});
+    recv_done_instr.set_channel_id(handle.handle());
+    recv_done_instr.set_is_host_transfer(true);
+    return AddInstruction(std::move(recv_done_instr), HloOpcode::kRecvDone,
+                          {recv});
+  });
+}
+
 StatusOr<bool> XlaBuilder::IsConstant(const XlaOp& operand) const {
   TF_RETURN_IF_ERROR(first_error_);
 
@@ -2543,8 +2778,9 @@ XlaOp Rev(const XlaOp& operand, tensorflow::gtl::ArraySlice<int64> dimensions) {
   return operand.builder()->Rev(operand, dimensions);
 }
 
-XlaOp Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values) {
-  return keys.builder()->Sort(keys, std::move(values));
+XlaOp Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values,
+           int64 dimension) {
+  return keys.builder()->Sort(keys, std::move(values), dimension);
 }
 
 XlaOp Clamp(const XlaOp& min, const XlaOp& operand, const XlaOp& max) {
@@ -2593,6 +2829,13 @@ XlaOp Gather(const XlaOp& input, const XlaOp& gather_indices,
                                  window_bounds);
 }
 
+XlaOp Scatter(const XlaOp& input, const XlaOp& scatter_indices,
+              const XlaOp& updates, const XlaComputation& update_computation,
+              const ScatterDimensionNumbers& dimension_numbers) {
+  return input.builder()->Scatter(input, scatter_indices, updates,
+                                  update_computation, dimension_numbers);
+}
+
 void Send(const XlaOp& operand, const ChannelHandle& handle) {
   return operand.builder()->Send(operand, handle);
 }
@@ -2602,6 +2845,45 @@ XlaOp Recv(XlaBuilder* builder, const Shape& shape,
   return builder->Recv(shape, handle);
 }
 
+XlaOp SendWithToken(const XlaOp& operand, const XlaOp& token,
+                    const ChannelHandle& handle) {
+  return operand.builder()->SendWithToken(operand, token, handle);
+}
+
+XlaOp RecvWithToken(const XlaOp& token, const Shape& shape,
+                    const ChannelHandle& handle) {
+  return token.builder()->RecvWithToken(token, shape, handle);
+}
+
+XlaOp SendToHost(const XlaOp& operand, const XlaOp& token,
+                 const Shape& shape_with_layout, const ChannelHandle& handle) {
+  return operand.builder()->SendToHost(operand, token, shape_with_layout,
+                                       handle);
+}
+
+XlaOp RecvFromHost(const XlaOp& token, const Shape& shape,
+                   const ChannelHandle& handle) {
+  return token.builder()->RecvFromHost(token, shape, handle);
+}
+
+XlaOp InfeedWithToken(const XlaOp& token, const Shape& shape,
+                      const string& config) {
+  return token.builder()->InfeedWithToken(token, shape, config);
+}
+
+XlaOp OutfeedWithToken(const XlaOp& operand, const XlaOp& token,
+                       const Shape& shape_with_layout,
+                       const string& outfeed_config) {
+  return operand.builder()->OutfeedWithToken(operand, token, shape_with_layout,
+                                             outfeed_config);
+}
+
+XlaOp CreateToken(XlaBuilder* builder) { return builder->CreateToken(); }
+
+XlaOp AfterAll(XlaBuilder* builder, tensorflow::gtl::ArraySlice<XlaOp> tokens) {
+  return builder->AfterAll(tokens);
+}
+
 XlaOp BatchNormTraining(const XlaOp& operand, const XlaOp& scale,
                         const XlaOp& offset, float epsilon,
                         int64 feature_index) {
@@ -2625,4 +2907,11 @@ XlaOp BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
                                           grad_output, epsilon, feature_index);
 }
 
+XlaOp IotaGen(XlaBuilder* builder, PrimitiveType type, int64 size) {
+  HloInstructionProto instr;
+  *instr.mutable_shape() = ShapeUtil::MakeShape(type, {size});
+  return builder->ReportErrorOrReturn(
+      builder->AddInstruction(std::move(instr), HloOpcode::kIota));
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/client/xla_builder.h b/tensorflow/compiler/xla/client/xla_builder.h
new file mode 100644
index 0000000000..8726cc6f93
--- /dev/null
+++ b/tensorflow/compiler/xla/client/xla_builder.h
@@ -0,0 +1,2255 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_CLIENT_XLA_BUILDER_H_
+#define TENSORFLOW_COMPILER_XLA_CLIENT_XLA_BUILDER_H_
+
+#include <map>
+#include <string>
+#include <type_traits>
+#include <utility>
+
+#include "tensorflow/compiler/xla/client/padding.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo.pb.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/flatset.h"
+#include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/stacktrace.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+
+class XlaBuilder;
+
+// This represents an instruction that has been enqueued using the XlaBuilder.
+// This is used to pass to subsequent computations that depends upon the
+// instruction as an operand.
+class XlaOp {
+ public:
+  XlaOp() : handle_(-1), builder_(nullptr) {
+    static_assert(std::is_trivially_destructible<XlaOp>::value,
+                  "XlaOp should be trivially destructible");
+  }
+  ~XlaOp() = default;
+
+  // Precondition: !IsUninitialized().
+  //
+  // It's very common to do foo.builder()->bar().  Without this precondition, if
+  // foo.builder() is null, the call to bar will segfault at some point possibly
+  // deep in the callstack when we finally dereference `this`.  The precondition
+  // lets us avoid this tricky-to-debug problem.
+  XlaBuilder* builder() const {
+    CHECK(builder_ != nullptr);
+    return builder_;
+  }
+
+  // Returns true if the XlaOp represents valid, non-erroneous value.
+  bool valid() const { return handle_ >= 0; }
+
+  // Returns true if the XlaOp was created by the XlaOp() constructor and
+  // not returned by a builder.
+  bool IsUninitialized() const { return builder_ == nullptr; }
+
+  bool IsIdenticalTo(const XlaOp& rhs) const {
+    return handle_ == rhs.handle_ && builder_ == rhs.builder_;
+  }
+
+  friend std::ostream& operator<<(std::ostream& out, const XlaOp& op) {
+    out << op.handle();
+    return out;
+  }
+
+ private:
+  explicit XlaOp(XlaBuilder* builder) : handle_(-1), builder_(builder) {}
+  XlaOp(int64 handle, XlaBuilder* builder)
+      : handle_(handle), builder_(builder) {}
+
+  int64 handle() const { return handle_; }
+
+  friend class XlaBuilder;
+
+  // < 0 means "invalid handle".
+  int64 handle_;
+
+  // Not owned. Non-null for any handle returned by XlaBuilder, even if the
+  // handle is invalid.
+  XlaBuilder* builder_;
+};
+
+// Arithmetic operator overloads for the XlaOp type.
+XlaOp operator-(const XlaOp& x);
+XlaOp operator+(const XlaOp& x, const XlaOp& y);
+XlaOp operator-(const XlaOp& x, const XlaOp& y);
+XlaOp operator*(const XlaOp& x, const XlaOp& y);
+XlaOp operator/(const XlaOp& x, const XlaOp& y);
+XlaOp operator%(const XlaOp& x, const XlaOp& y);
+
+// Bitwise operator overloads for the XlaOp type.
+XlaOp operator~(const XlaOp& x);
+XlaOp operator&(const XlaOp& x, const XlaOp& y);
+XlaOp operator|(const XlaOp& x, const XlaOp& y);
+XlaOp operator^(const XlaOp& x, const XlaOp& y);
+XlaOp operator<<(const XlaOp& x, const XlaOp& y);
+// Performs a right arithmetic shift if 'x' is a signed type, otherwise performs
+// a right logical shift.
+XlaOp operator>>(const XlaOp& x, const XlaOp& y);
+
+// We don't overload the relational operators (==, !=, <, <=, >, >=) because the
+// semantics might be surprising since their result types are usually 'bool'.
+// Further programmers may expect == to be a structural equality.
+// We also choose not to overload any of the mutating operators (e.g., +=, -=)
+// because the semantics might be misleading — XLA computations are immutable.
+
+// A convenient interface for building up computations.
+//
+// Thread-compatible.
+class XlaBuilder {
+ public:
+  // computation_name: name to use for the built computation.
+  XlaBuilder(const string& computation_name);
+
+  XlaBuilder(const XlaBuilder&) = delete;
+  XlaBuilder& operator=(const XlaBuilder&) = delete;
+
+  ~XlaBuilder();
+
+  // Returns the computation name.
+  const string& name() const { return name_; }
+
+  // Sets OpMetadata that will be added to all instructions until cleared.
+  //
+  // OpMetadata is often applied to a series of XLA HLO instructions. As a
+  // result, OpMetadata is set on the Computation Builder. All subsequent
+  // instructions generated via this Computation Builder will have the same
+  // OpMetadata attached until a call to ClearOpMetadata.
+  void SetOpMetadata(const OpMetadata& metadata) { metadata_ = metadata; }
+
+  // Clears the HloMetadata state.
+  void ClearOpMetadata() { metadata_.Clear(); }
+
+  // Sets an OpSharding that will be attached to all instructions until cleared.
+  void SetSharding(const OpSharding& sharding) { sharding_ = sharding; }
+
+  // Clears the sharding. Ops will be sharded according to the default placement
+  // policy.
+  void ClearSharding() { sharding_ = tensorflow::gtl::nullopt; }
+
+  // Returns the OpSharding that will be attached to all instructions.
+  const tensorflow::gtl::optional<OpSharding>& sharding() const {
+    return sharding_;
+  }
+
+  // Sets the builder to a mode where it will die immediately when an error is
+  // encountered, rather than producing it in a deferred fashion when Build() is
+  // called (which is the default).
+  void set_die_immediately_on_error(bool enabled) {
+    die_immediately_on_error_ = enabled;
+  }
+
+  // Default dimension numbers used for a 2D convolution.
+  static constexpr int64 kConvBatchDimension = 0;
+  static constexpr int64 kConvFeatureDimension = 1;
+  static constexpr int64 kConvFirstSpatialDimension = 2;
+  static constexpr int64 kConvSecondSpatialDimension = 3;
+  static constexpr int64 kConvKernelOutputDimension = 0;
+  static constexpr int64 kConvKernelInputDimension = 1;
+  static constexpr int64 kConvKernelFirstSpatialDimension = 2;
+  static constexpr int64 kConvKernelSecondSpatialDimension = 3;
+
+  // Creates a default ConvolutionDimensionNumbers. For a 2D convolution, for
+  // the input operand {batch, feature, height, width} = {0, 1, 2, 3} and for
+  // the kernel operand
+  // {output_feature, input_feature, height, width} = {0, 1, 2, 3}.
+  static ConvolutionDimensionNumbers CreateDefaultConvDimensionNumbers(
+      int num_spatial_dims = 2);
+
+  // Returns an error if the convolution dimension numbers have conflicts.
+  static Status Validate(const ConvolutionDimensionNumbers& dnum);
+
+  // Returns a new XlaBuilder whose resultant Computation is used only by this
+  // XlaBuilder. The sub-XlaBuilder has the same die_immediately_on_error
+  // behavior as the parent.
+  std::unique_ptr<XlaBuilder> CreateSubBuilder(const string& computation_name);
+
+  // Builds the computation with the requested operations, or returns a non-ok
+  // status. Note that all ops that have been enqueued will be moved to the
+  // computation being returned.
+  StatusOr<XlaComputation> Build();
+
+  // Builds the computation with the requested operations, or notes an error in
+  // the parent XlaBuilder and returns an empty computation if building failed.
+  // This function is intended to be used where the returned XlaComputation is
+  // only used by the parent XlaBuilder and hence further operation on the
+  // returned XlaComputation will simply be error'ed out if an error occurred
+  // while building this computation. If the built computation is to be used by
+  // a XlaBuilder other than the parent XlaBuilder then Build() should be used
+  // instead.
+  XlaComputation BuildAndNoteError();
+
+  // Returns a subgraph that roots on the given root. If the root is not a
+  // compile-time constant (see `IsConstant`), returns an error.
+  //
+  // This will copy the needed ops/computations to the subgraph.
+  StatusOr<XlaComputation> BuildConstantSubGraph(const XlaOp& root_op) const;
+
+  // Returns the first error that was encountered while building the
+  // computation. When an error is encountered, by default we return a vacuous
+  // XlaOp and inform the user of the error that occurred while
+  // building the computation when they make a final call to Build().
+  //
+  // See also set_die_immediately_on_error().
+  Status first_error() const { return first_error_; }
+
+  // Returns the shape of the given op.
+  StatusOr<Shape> GetShape(const XlaOp& op) const;
+
+  // Returns the (inferred) result for the current computation's shape.
+  StatusOr<ProgramShape> GetProgramShape() const;
+
+  // Reports an error to the builder, by
+  // * storing it internally and capturing a backtrace if it's the first error
+  //   (this deferred value will be produced on the call to
+  //    Build()/GetShape()/...)
+  // * dying if die_immediately_on_error_ is true.
+  // Returns an XlaOp with an invalid handle but a valid builder. This value can
+  // be returned in place of a value in APIs that return an XlaOp.
+  XlaOp ReportError(const Status& error);
+
+  // A helper function that converts a StatusOr<XlaOp> into an XlaOp.
+  // If the Status was an error, reports the error to builder and returns an
+  // invalid XlaOp handle.
+  XlaOp ReportErrorOrReturn(const StatusOr<XlaOp>& op);
+
+  // A helper function that runs a function that returns a StatusOr<XlaOp> and
+  // returns an XlaOp.
+  XlaOp ReportErrorOrReturn(const std::function<StatusOr<XlaOp>()>& op_creator);
+
+  // Returns true if 'operand' is a compile-time constant. A compile-time
+  // constant does not depend on any parameters, or on stateful operators such
+  // as `RngNormal` or `Infeed`.
+  //
+  // This tests whether a computation is a compile-time constant without
+  // evaluating the computation.
+  StatusOr<bool> IsConstant(const XlaOp& operand) const;
+
+ private:
+  // Enqueues a "retrieve parameter value" instruction for a parameter that was
+  // passed to the computation.
+  XlaOp Parameter(int64 parameter_number, const Shape& shape,
+                  const string& name);
+
+  // Enqueues a constant with the value of the given literal onto the
+  // computation.
+  XlaOp ConstantLiteral(const LiteralSlice& literal);
+
+  // Enqueues a constant onto the computation. Methods are templated on the
+  // native host type (NativeT) which corresponds to a specific XLA
+  // PrimitiveType as given in the following table:
+  //
+  //  Native Type   PrimitiveType
+  // -----------------------------
+  //   bool           PRED
+  //   int32          S32
+  //   int64          S64
+  //   uint32         U32
+  //   uint64         U64
+  //   float          F32
+  //   double         F64
+  //
+  // Note: not all primitive types defined in xla_data.proto have a
+  // corresponding native type yet.
+  template <typename NativeT>
+  XlaOp ConstantR0(NativeT value);
+  template <typename NativeT>
+  XlaOp ConstantR1(tensorflow::gtl::ArraySlice<NativeT> values);
+  XlaOp ConstantR1(const tensorflow::core::Bitmap& values);
+  template <typename NativeT>
+  XlaOp ConstantR2(
+      std::initializer_list<std::initializer_list<NativeT>> values);
+  template <typename NativeT>
+  XlaOp ConstantFromArrayWithLayout(const Array<NativeT>& values,
+                                    const Layout& layout);
+  template <typename NativeT>
+  XlaOp ConstantFromArray(const Array<NativeT>& values);
+  template <typename NativeT>
+  XlaOp ConstantR2FromArray2DWithLayout(const Array2D<NativeT>& values,
+                                        const Layout& layout);
+  template <typename NativeT>
+  XlaOp ConstantR2FromArray2D(const Array2D<NativeT>& values);
+  template <typename NativeT>
+  XlaOp ConstantR3FromArray3DWithLayout(const Array3D<NativeT>& values,
+                                        const Layout& layout);
+  template <typename NativeT>
+  XlaOp ConstantR3FromArray3D(const Array3D<NativeT>& values);
+  template <typename NativeT>
+  XlaOp ConstantR4FromArray4DWithLayout(const Array4D<NativeT>& values,
+                                        const Layout& layout);
+  template <typename NativeT>
+  XlaOp ConstantR4FromArray4D(const Array4D<NativeT>& values);
+
+  // Enqueues a rank one constant (vector) onto the computation. The vector has
+  // size 'length' and every element has the value 'value'.
+  template <typename NativeT>
+  XlaOp ConstantR1(int64 length, NativeT value);
+
+  // Adds dimensions to an array by duplicating the data in the array.
+  //
+  // The new dimensions are inserted on the left, i.e. if
+  // broadcast_sizes has values {a0, ..., aN} and the operand shape
+  // has dimensions {b0, ..., bM} then the shape of the output has
+  // dimensions {a0, ..., aN, b0, ..., bM}.
+  //
+  // The new dimensions index into copies of the operand, i.e.
+  //
+  //   output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM]
+  XlaOp Broadcast(const XlaOp& operand,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_sizes);
+
+  // Performs in-dimension-style broadcast.
+  //
+  // Operand specifies the input to be broadcast. "shape" is expected output
+  // shape. "broadcast_dimensions" are the dimensions to be broadcasting into.
+  // Dimension numbers in broadcast_dimensions map to individual dimensions
+  // of the operand, and specify what dimension of the output shape they
+  // should be broadcast.
+  // e.g.
+  // Say operand = [1, 2], i.e., a 1D tensor with 2 elements.
+  // and dimension of shape is [2,2].
+  // Specifying {1} as brodcast_dimension will generate output
+  // [1 , 2]
+  // [1 , 2]
+  // On the other hand, specifying {0} as broadcast_dimension
+  // will generate output
+  // [1 , 1]
+  // [2 , 2]
+  XlaOp BroadcastInDim(
+      const XlaOp& operand, const Shape& shape,
+      const tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+
+  // Enqueues a pad operation onto the computation that pads the given value on
+  // the edges as well as between the elements of the input. padding_config
+  // specifies the padding amount for each dimension.
+  XlaOp Pad(const XlaOp& operand, const XlaOp& padding_value,
+            const PaddingConfig& padding_config);
+
+  // Enqueues an operation onto the computation that flattens the operand based
+  // on the dimension order (major/slowest-varying to minor/fastest-varying)
+  // given, followed by reshaping it into the shape with the given dimension
+  // sizes (also major to minor). Conceptually, this is a limited form of
+  // "shape casting".
+  XlaOp Reshape(const XlaOp& operand,
+                tensorflow::gtl::ArraySlice<int64> dimensions,
+                tensorflow::gtl::ArraySlice<int64> new_sizes);
+
+  // Enqueues an operation onto the computation that collapses the operand, from
+  // first to last dimension (C order), then reshapes it to the given dimension
+  // sizes. Conceptually, this is a limited form of "shape casting".
+  XlaOp Reshape(const XlaOp& operand,
+                tensorflow::gtl::ArraySlice<int64> new_sizes);
+
+  // Wrapper for Reshape.
+  // Enqueues an operation to collapse the provided dimensions; e.g. an
+  // operand with dimensions {x=256, y=2, z=2, p=32} can be collapsed to
+  // {x=1024, y=32} by collapsing dims {0, 1, 2}. Collapsing dimensions must
+  // be a consecutive, in-order subsequence of the operand dimensions.
+  //
+  // Note that collapsing a single dimension does nothing:
+  //
+  //    {256} collapsing {0} => {256}
+  //    {1} collapsing {0} => {1}
+  //
+  // Collapsing multiple dimensions produces a single result dimension:
+  //
+  //    {256, 2} collapsing {0,1} => {512}
+  //    {256, 2, 3} collapsing {0,1} => {512, 3}
+  //
+  // This could potentially cause data to be moved -- it provides a more
+  // structured form of reshaping than an arbitrary Reshape operation.
+  XlaOp Collapse(const XlaOp& operand,
+                 tensorflow::gtl::ArraySlice<int64> dimensions);
+
+  // Enqueues a slice operation onto the computation that slices the operand
+  // from the start indices to the limit indices; e.g.
+  //
+  //        x
+  //   [ 0 1 2 3 ]
+  // y [ 4 5 6 7 ] => slice(start={1, 1}, limit={2, 3}) => [ 5 6 ]
+  //   [ 8 9 a b ]
+  //
+  // Note that "limit" means up-to-but-not-including; i.e. [start, limit) in 1D
+  // range notation.
+  // The strides parameter determines the stride over the slice
+  XlaOp Slice(const XlaOp& operand,
+              tensorflow::gtl::ArraySlice<int64> start_indices,
+              tensorflow::gtl::ArraySlice<int64> limit_indices,
+              tensorflow::gtl::ArraySlice<int64> strides);
+
+  // Enqueues a slice operation in a given dimension, taking all other
+  // dimensions as they are; e.g. if dimno is 1 from start_index 2 to
+  // limit_index 4 by 1, and the shape is f32[7,8,9], this call is short-hand
+  // for:
+  //
+  //  array[:, 2:4:1, :]
+  XlaOp SliceInDim(const XlaOp& operand, int64 start_index, int64 limit_index,
+                   int64 stride, int64 dimno);
+
+  // Enqueues a slice operation onto the computation that slices the 'operand'
+  // from dynamic start indices which are passed in 'start_indices'.
+  // The size of the slice in each dimension is passed in 'slice_sizes',
+  // which specify the end point of exclusive slice intervals in each
+  // dimension [start, start + size).
+  // The shape of 'start_indices' must be rank == 1, with dimension size
+  // equal to the rank of the 'operand'.
+  // Slice index calculations are computed modulo input dimension sizes to
+  // prevent dynamic start indices from generating out-of-bound array accesses.
+  XlaOp DynamicSlice(const XlaOp& operand, const XlaOp& start_indices,
+                     tensorflow::gtl::ArraySlice<int64> slice_sizes);
+
+  // Enqueues a dynamic update slice operation onto the computation, which
+  // updates a slice of 'operand' with 'update' at dynamic 'start_indices'.
+  // The shape of 'update' determines the shape of the slice of 'operand'
+  // which is updated.
+  // The indices specified in 'start_indices' specify the offset of the slice
+  // of 'operand' which is updated.
+  //
+  //               update = {10, 11} // calculated at runtime.
+  //   [1 2 3]     start  = {1, 1}   // calculated at runtime.  [1 2  3 ]
+  //   [4 5 6]  => DynamicUpdateslice(data, update, start)   => [4 10 11]
+  //   [7 8 9]                                                  [7 8  9 ]
+  //
+  // The shape of 'start_indices' must be rank == 1, with dimension size
+  // equal to the rank of the 'operand'.
+  // Slice index calculations are computed modulo update dimension sizes to
+  // prevent dynamic start indices from generating out-of-bound array accesses.
+  XlaOp DynamicUpdateSlice(const XlaOp& operand, const XlaOp& update,
+                           const XlaOp& start_indices);
+
+  // Enqueues a concatenate instruction onto the computation. 'operands' must
+  // have >= 1 entry.
+  XlaOp ConcatInDim(tensorflow::gtl::ArraySlice<XlaOp> operands,
+                    int64 dimension);
+
+  // Enqueue a tracing operation onto the computation; the computation will emit
+  // a logging message with the operand.
+  void Trace(const string& tag, const XlaOp& operand);
+
+  // Enqueues a conditional-move-like select operation onto the computation;
+  // predicated on pred, selects between on_true and on_false.
+  XlaOp Select(const XlaOp& pred, const XlaOp& on_true, const XlaOp& on_false);
+
+  // Enqueues a tuple-creation instruction onto the computation.
+  XlaOp Tuple(tensorflow::gtl::ArraySlice<XlaOp> elements);
+
+  // Enqueues a tuple-element-get instruction onto the computation.
+  XlaOp GetTupleElement(const XlaOp& tuple_data, int64 index);
+
+  // Enqueues an equal-to comparison instruction onto the computation.
+  XlaOp Eq(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a not-equal comparison instruction onto the computation.
+  XlaOp Ne(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a greater-or-equal comparison instruction onto the computation.
+  XlaOp Ge(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a greater-than comparison instruction onto the computation.
+  XlaOp Gt(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a less-than comparison instruction onto the computation.
+  XlaOp Lt(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a less-or-equal comparison instruction onto the computation.
+  XlaOp Le(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a dot instruction onto the computation.
+  XlaOp Dot(const XlaOp& lhs, const XlaOp& rhs);
+
+  // Enqueues a general dot instruction onto the computation.
+  XlaOp DotGeneral(const XlaOp& lhs, const XlaOp& rhs,
+                   const DotDimensionNumbers& dimension_numbers);
+
+  // Enqueues a convolution instruction onto the computation, which uses the
+  // default convolution dimension numbers.
+  XlaOp Conv(const XlaOp& lhs, const XlaOp& rhs,
+             tensorflow::gtl::ArraySlice<int64> window_strides,
+             Padding padding);
+
+  // Enqueues a convolution instruction onto the computation, with the caller
+  // provided padding configuration in the format returned by MakePadding().
+  XlaOp ConvWithGeneralPadding(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
+
+  // Enqueues a convolution instruction onto the computation, with the caller
+  // provided dimension numbers configuration.
+  XlaOp ConvWithGeneralDimensions(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
+      const ConvolutionDimensionNumbers& dimension_numbers);
+
+  // Enqueues a convolution instruction onto the computation, with the caller
+  // provided padding configuration as well as the dimension numbers.
+  XlaOp ConvGeneral(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+      const ConvolutionDimensionNumbers& dimension_numbers);
+
+  // Enqueues a convolution instruction onto the computation, with the caller
+  // provided padding configuration, dilation factors and dimension numbers.
+  XlaOp ConvGeneralDilated(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+      tensorflow::gtl::ArraySlice<int64> lhs_dilation,
+      tensorflow::gtl::ArraySlice<int64> rhs_dilation,
+      const ConvolutionDimensionNumbers& dimension_numbers);
+
+  // Enqueues an FFT instruction onto the computation, of the given type and
+  // with the given FFT length.
+  XlaOp Fft(const XlaOp& operand, FftType fft_type,
+            tensorflow::gtl::ArraySlice<int64> fft_length);
+
+  // Enqueues an infeed instruction onto the computation, which writes data of
+  // the given shape to the infeed buffer of the device.
+  XlaOp Infeed(const Shape& shape, const string& config = "");
+  XlaOp InfeedWithToken(const XlaOp& token, const Shape& shape,
+                        const string& config = "");
+
+  // Enqueues an outfeed instruction onto the computation. This instruction
+  // generates outgoing data transfers for the given data.
+  //
+  // shape_with_layout communicates the laid out shape that we want to outfeed
+  // -- if !ShapeUtil::Compatible(GetShape(operand), shape_with_layout) an error
+  // will occur.
+  void Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
+               const string& outfeed_config);
+  XlaOp OutfeedWithToken(const XlaOp& operand, const XlaOp& token,
+                         const Shape& shape_with_layout,
+                         const string& outfeed_config);
+
+  // Enqueues a call instruction onto the computation.
+  XlaOp Call(const XlaComputation& computation,
+             tensorflow::gtl::ArraySlice<XlaOp> operands);
+
+  // Enqueues a custom call instruction onto the computation.
+  // During code generation, a call instruction is emitted which targets a
+  // symbol with the name |call_target_name|.  The |operands| are passed to the
+  // call instruction.  |shape| is the resultant shape.
+  XlaOp CustomCall(const string& call_target_name,
+                   tensorflow::gtl::ArraySlice<XlaOp> operands,
+                   const Shape& shape);
+
+  // Enqueues a pseudo-op to represent host-side computation data-dependencies.
+  // During code generation, host send and receive operations will be generated
+  // to transfer |operands| to the host and a single result of |shape| back to
+  // the device.  Host send/recv operations are emitted using |channel_name|.
+  // Dataflow dependencies and the |cost_estimate_ns| field may be used in HLO
+  // instruction scheduling.
+  XlaOp HostCompute(tensorflow::gtl::ArraySlice<XlaOp> operands,
+                    const string& channel_name, int64 cost_estimate_ns,
+                    const Shape& shape);
+
+  // The following methods enqueue element-wise binary arithmetic operations
+  // onto the computation. The shapes of the operands have to match unless one
+  // of the operands is a scalar, or an explicit broadcast dimension is given
+  // (see g3doc for more details).
+
+  // Enqueues a complex compose instruction onto the computation.
+  XlaOp Complex(const XlaOp& real, const XlaOp& imag,
+                tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a complex conjugate instruction onto the computation.
+  XlaOp Conj(const XlaOp& operand);
+
+  // Enqueues an add instruction onto the computation.
+  XlaOp Add(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a subtract instruction onto the computation.
+  XlaOp Sub(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a multiply instruction onto the computation.
+  XlaOp Mul(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a divide instruction onto the computation.
+  XlaOp Div(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a remainder instruction onto the computation.
+  XlaOp Rem(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a max instruction onto the computation.
+  XlaOp Max(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues a min instruction onto the computation.
+  XlaOp Min(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Element-wise logical operators
+  XlaOp And(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  XlaOp Or(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  XlaOp Xor(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  XlaOp Not(const XlaOp& operand);
+
+  XlaOp ShiftLeft(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+  XlaOp ShiftRightArithmetic(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+  XlaOp ShiftRightLogical(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Reduces an array among the provided dimensions, given "computation" as a
+  // reduction operator.
+  XlaOp Reduce(const XlaOp& operand, const XlaOp& init_value,
+               const XlaComputation& computation,
+               tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce);
+
+  // Convenience wrapper around the above that reduces all the dimensions in the
+  // operand shape.
+  XlaOp ReduceAll(const XlaOp& operand, const XlaOp& init_value,
+                  const XlaComputation& computation);
+
+  // Enqueues a windowed reduce instruction onto the computation.
+  XlaOp ReduceWindow(const XlaOp& operand, const XlaOp& init_value,
+                     const XlaComputation& computation,
+                     tensorflow::gtl::ArraySlice<int64> window_dimensions,
+                     tensorflow::gtl::ArraySlice<int64> window_strides,
+                     Padding padding);
+
+  // As ReduceWindow(), but the padding is given in the format
+  // returned by MakePadding().
+  XlaOp ReduceWindowWithGeneralPadding(
+      const XlaOp& operand, const XlaOp& init_value,
+      const XlaComputation& computation,
+      tensorflow::gtl::ArraySlice<int64> window_dimensions,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
+
+  // Returns the sum of the operand value within each subgroup of replicas. All
+  // replicas supply one input to the sum and all replicas receive the resulting
+  // sum for each subgroup.
+  XlaOp CrossReplicaSum(
+      const XlaOp& operand,
+      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {});
+
+  // Enqueues an operation that do an AllReduce of the operand cross cores. Here
+  // AllReduce means doing a reduction on the input operand cross cores and then
+  // broadcasting the reduction result to those cores. The reduction function is
+  // defined by `computation`, which should be a commutative computation on
+  // scalars, e.g., add, min, or max. The way that AllReduce is applied is
+  // configured by:
+  //
+  // - `replica_group_ids`: maps replica ids to subgroup ids. If empty, all
+  // replicas belong to one group. Allreduce will be applied within subgroups.
+  // For example, we have 4 replicas, then replica_group_ids={0,1,0,1} means,
+  // replica 0 and 2 are in subgroup 0, replica 1 and 3 are in subgroup 1.
+  //
+  // - `channel_id`: for Allreduce nodes from different models, if they have the
+  // same channel_id, they will be 'Allreduce'd. If empty, Allreduce will not be
+  // applied cross models.
+  //
+  // TODO(b/79737069): Rename this to AllReduce when it's ready to use.
+  XlaOp CrossReplicaSum(
+      const XlaOp& operand, const XlaComputation& computation,
+      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {},
+      const tensorflow::gtl::optional<ChannelHandle>& channel_id =
+          tensorflow::gtl::nullopt);
+
+  // Enqueues an operation that scatters the `source` array to the selected
+  // indices of each window.
+  XlaOp SelectAndScatter(const XlaOp& operand, const XlaComputation& select,
+                         tensorflow::gtl::ArraySlice<int64> window_dimensions,
+                         tensorflow::gtl::ArraySlice<int64> window_strides,
+                         Padding padding, const XlaOp& source,
+                         const XlaOp& init_value,
+                         const XlaComputation& scatter);
+
+  // As SelectAndScatter(), but the padding is given in the format
+  // returned by MakePadding().
+  XlaOp SelectAndScatterWithGeneralPadding(
+      const XlaOp& operand, const XlaComputation& select,
+      tensorflow::gtl::ArraySlice<int64> window_dimensions,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+      const XlaOp& source, const XlaOp& init_value,
+      const XlaComputation& scatter);
+
+  // Enqueues an abs instruction onto the computation.
+  XlaOp Abs(const XlaOp& operand);
+
+  // Enqueues a atan2 instruction onto the computation.
+  XlaOp Atan2(const XlaOp& y, const XlaOp& x,
+              tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues an exp instruction onto the computation.
+  XlaOp Exp(const XlaOp& operand);
+
+  // Enqueues an expm1 instruction onto the computation.
+  XlaOp Expm1(const XlaOp& operand);
+
+  // Enqueues a floor instruction onto the computation.
+  XlaOp Floor(const XlaOp& operand);
+
+  // Enqueues a ceil instruction onto the computation.
+  XlaOp Ceil(const XlaOp& operand);
+
+  // Enqueues a round instruction onto the computation, rounding to nearest even
+  // with half-way cases rounding away from zero.
+  XlaOp Round(const XlaOp& operand);
+
+  // Enqueues an log instruction (natural logarithm) onto the computation.
+  XlaOp Log(const XlaOp& operand);
+
+  // Enqueues an log1p instruction (log(x+1)) onto the computation.
+  XlaOp Log1p(const XlaOp& operand);
+
+  // Enqueues a sign instruction onto the computation.
+  XlaOp Sign(const XlaOp& operand);
+
+  // Enqueues a count leading zeros instruction onto the computation.
+  XlaOp Clz(const XlaOp& operand);
+
+  // Enqueues a cosine instruction onto the computation.
+  XlaOp Cos(const XlaOp& operand);
+
+  // Enqueues a sine instruction onto the computation.
+  XlaOp Sin(const XlaOp& operand);
+
+  // Enqueues a tanh instruction onto the computation.
+  XlaOp Tanh(const XlaOp& operand);
+
+  // Enqueues a real-part instruction onto the computation.
+  XlaOp Real(const XlaOp& operand);
+
+  // Enqueues an imaginary-part instruction onto the computation.
+  XlaOp Imag(const XlaOp& operand);
+
+  // Enqueues a lhs^rhs computation onto the computation.
+  XlaOp Pow(const XlaOp& lhs, const XlaOp& rhs,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+  // Enqueues an operator that tests if the operand's values are finite, i.e.,
+  // not Inf or NaN. Defined only for floating-point types. Returns an array of
+  // booleans with the same shape where entries are true iff the corresponding
+  // entry was NaN.
+  XlaOp IsFinite(const XlaOp& operand);
+
+  // Enqueues a convert instruction onto the computation that changes the
+  // element type of the operand array to primitive_type.
+  XlaOp ConvertElementType(const XlaOp& operand,
+                           PrimitiveType new_element_type);
+
+  // Enqueues a no-op instruction onto the computation that changes
+  // the element type of the operand array to primitive_type. The
+  // bit-widths of the source and destination element types must be
+  // identical.
+  XlaOp BitcastConvertType(const XlaOp& operand,
+                           PrimitiveType new_element_type);
+
+  // Enqueues a negate instruction onto the computation.
+  XlaOp Neg(const XlaOp& operand);
+
+  // Enqueues a transpose instruction onto the computation.
+  XlaOp Transpose(const XlaOp& operand,
+                  tensorflow::gtl::ArraySlice<int64> permutation);
+
+  // Enqueues a reverse instruction onto the computation. The order of the
+  // elements in the given dimensions is reversed (i.e., the element at index i
+  // is moved to index dimension_size - 1 - i).
+  XlaOp Rev(const XlaOp& operand,
+            tensorflow::gtl::ArraySlice<int64> dimensions);
+
+  // Enqueues a sort (as increasing order) instruction onto the computation.
+  // If only keys are provided:
+  // * If the keys are an rank-1 tensor (an array), the result is a sorted array
+  // of keys, in ascending order.
+  // * If the keys have higher rank, the keys are sorted along the provided
+  // dimension. For example, for a rank-2 tensor (a matrix) of keys, a dimension
+  // value of 0 will indepenently sort every column, and a dimension value of 1
+  // will independently sort each row. If no dimension number is provided, then
+  // the last dimension is chosen by default.
+  //
+  // If both keys and values are provided:
+  // * The keys and the values must tensors with the same dimensions. The
+  // element types of the tensors may be different.
+  // * The result is a tuple that consists of a sorted tensor of keys (along the
+  // provided dimension, as above) as the first element, and a tensor with their
+  // corresponding values as the second element.
+  XlaOp Sort(XlaOp keys,
+             tensorflow::gtl::optional<XlaOp> values = tensorflow::gtl::nullopt,
+             int64 dimension = -1);
+
+  // Enqueues a clamp instruction onto the computation.
+  XlaOp Clamp(const XlaOp& min, const XlaOp& operand, const XlaOp& max);
+
+  // Enqueues a map instruction onto the computation.
+  XlaOp Map(tensorflow::gtl::ArraySlice<XlaOp> operands,
+            const XlaComputation& computation,
+            tensorflow::gtl::ArraySlice<int64> dimensions,
+            tensorflow::gtl::ArraySlice<XlaOp> static_operands = {});
+
+  // Enqueues a N(mu, sigma) random number generation instruction onto the
+  // computation.
+  XlaOp RngNormal(const XlaOp& mu, const XlaOp& sigma, const Shape& shape);
+
+  // Enqueues a U(a, b) random number generation instruction onto the
+  // computation. Returns values in the semi-open interval [a, b).
+  XlaOp RngUniform(const XlaOp& a, const XlaOp& b, const Shape& shape);
+
+  // Enqueues a while node onto the computation.
+  XlaOp While(const XlaComputation& condition, const XlaComputation& body,
+              const XlaOp& init);
+
+  // Enqueues a conditional node onto the computation.
+  XlaOp Conditional(const XlaOp& predicate, const XlaOp& true_operand,
+                    const XlaComputation& true_computation,
+                    const XlaOp& false_operand,
+                    const XlaComputation& false_computation);
+
+  // Enqueues a ReducePrecision node onto the computation.
+  XlaOp ReducePrecision(const XlaOp& operand, const int exponent_bits,
+                        const int mantissa_bits);
+
+  // Enqueues a Gather node onto the computation.
+  XlaOp Gather(const XlaOp& input, const XlaOp& gather_indices,
+               const GatherDimensionNumbers& dimension_numbers,
+               tensorflow::gtl::ArraySlice<int64> window_bounds);
+
+  // Enqueues a Scatter node onto the computation.
+  XlaOp Scatter(const XlaOp& input, const XlaOp& scatter_indices,
+                const XlaOp& updates, const XlaComputation& update_computation,
+                const ScatterDimensionNumbers& dimension_numbers);
+
+  // Enqueues a Send node onto the computation for device-to-device
+  // communication, to send the given operand to a Recv instruction that shares
+  // the same channel handle.
+  void Send(const XlaOp& operand, const ChannelHandle& handle);
+  XlaOp SendWithToken(const XlaOp& operand, const XlaOp& token,
+                      const ChannelHandle& handle);
+
+  // Enqueues a Send node which sends data to the host.
+  XlaOp SendToHost(const XlaOp& operand, const XlaOp& token,
+                   const Shape& shape_with_layout, const ChannelHandle& handle);
+
+  // Enqueues a Recv node which receives data from the host.
+  XlaOp RecvFromHost(const XlaOp& token, const Shape& shape,
+                     const ChannelHandle& handle);
+
+  // Enqueues an AfterAll operation with no operands producing a token-shaped
+  // value.
+  XlaOp CreateToken();
+
+  // Enqueues an AfterAll operation with no operands producing a token-shaped
+  // value.
+  XlaOp AfterAll(tensorflow::gtl::ArraySlice<XlaOp> tokens);
+
+  // Enqueues a Recv node onto the computation. The data comes from a Send
+  // instruction that shares the same channel handle and its shape must
+  // be the same as the given shape.
+  XlaOp Recv(const Shape& shape, const ChannelHandle& handle);
+  XlaOp RecvWithToken(const XlaOp& token, const Shape& shape,
+                      const ChannelHandle& handle);
+
+  // Normalizes operand across spatial and batch dimensions for each feature.
+  //
+  // Returns a tuple (normalized, batch_mean, batch_var) where `normalized`
+  // is the normalized result and batch_mean and batch_var are the mean and
+  // variance, respectively, across batch for the operand.
+  XlaOp BatchNormTraining(const XlaOp& operand, const XlaOp& scale,
+                          const XlaOp& offset, float epsilon,
+                          int64 feature_index);
+
+  // Normalizes operand across spatial and batch dimensions for each feature.
+  //
+  // `BatchNormInference` is equivalent to calling `BatchNormTraining` without
+  // computing `mean` and `variance` for each batch inside the operation. It
+  // uses the input `mean` and `variance` instead as estimated values. The
+  // purpose of this op is to reduce latency in inference, hence the name
+  // `BatchNormInference`.
+  //
+  // The output has the same shape as `operand`, and contains the normalized
+  // values for each batch.
+  XlaOp BatchNormInference(const XlaOp& operand, const XlaOp& scale,
+                           const XlaOp& offset, const XlaOp& mean,
+                           const XlaOp& variance, float epsilon,
+                           int64 feature_index);
+
+  // Calculates the gradients of a batch norm op.
+  //
+  // The inputs `batch_mean` and `batch_var` represent the mean and variance
+  // across the batch.
+  //
+  // Returns a tuple of three elements:
+  //   - grad_operand: Gradient with respect to input `operand`
+  //   - grad_offset: Gradient with respect to input `offset`
+  //   - grad_scale: Gradient with respect to input `scale`
+  XlaOp BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
+                      const XlaOp& batch_mean, const XlaOp& batch_var,
+                      const XlaOp& grad_output, float epsilon,
+                      int64 feature_index);
+
+  StatusOr<XlaOp> AddInstruction(
+      HloInstructionProto&& instr, HloOpcode opcode,
+      tensorflow::gtl::ArraySlice<XlaOp> operands = {});
+
+  void AddCalledComputation(const XlaComputation& computation,
+                            HloInstructionProto* instr);
+
+  StatusOr<const HloInstructionProto*> LookUpInstruction(const XlaOp& op) const;
+
+  // Internal helper method that does the building for an arbitrary unary op.
+  XlaOp UnaryOp(HloOpcode unop, const XlaOp& operand);
+
+  // Internal helper method that does the building for an arbitrary binary op.
+  // broadcast_dimensions specifies which dimensions to use for broadcasting
+  // when the operation is between tensors of different ranks.
+  XlaOp BinaryOp(HloOpcode binop, const XlaOp& lhs, const XlaOp& rhs,
+                 tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+
+  // Internal helper method that does the building for an arbitrary ternary op.
+  XlaOp TernaryOp(HloOpcode triop, const XlaOp& lhs, const XlaOp& rhs,
+                  const XlaOp& ehs);
+
+  XlaOp RngOp(RandomDistribution distribution,
+              tensorflow::gtl::ArraySlice<XlaOp> parameters,
+              const Shape& shape);
+
+  StatusOr<XlaOp> InDimBroadcast(
+      const Shape& shape, const XlaOp& operand,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+
+  // Internal helper method that creates a sequence of instructions that
+  // performs an explicit broadcast of the operand to the target shape.
+  StatusOr<XlaOp> AddBroadcastSequence(const Shape& output_shape,
+                                       const XlaOp& operand);
+
+  // Internal helper method for creating a Reshape op with the already inferred
+  // shape.
+  StatusOr<XlaOp> Reshape(const Shape& shape, const XlaOp& operand);
+
+  // Returns the (inferred) result for the program shape for the current
+  // computation and fills the root_id in the pointer.
+  StatusOr<ProgramShape> GetProgramShape(int64* root_id) const;
+
+  // Returns shapes for the operands.
+  StatusOr<std::vector<Shape>> GetOperandShapes(
+      tensorflow::gtl::ArraySlice<XlaOp> operands) const;
+
+  // A visitor which checks whether an operation is a compile-time constant,
+  // meaning that it doesn't depend on any parameters, or on any stateful
+  // operation such as `RngNormal` or `Infeed`. The visitor walks the
+  // computation starting at a given operation and sets is_constant to false iff
+  // a parameter or stateful operation is encountered.
+  void IsConstantVisitor(const int64 op_handle, std::set<int64>* visited,
+                         bool* is_constant) const;
+
+  // Checks bounds for convolution parameters.
+  Status VerifyConvolution(
+      const Shape& lhs_shape, const Shape& rhs_shape,
+      const ConvolutionDimensionNumbers& dimension_numbers) const;
+
+  // Helper function for creating a Window proto from user-supplied data.
+  // Returns error if the user-supplied data was invalid.
+  StatusOr<Window> MakeWindow(
+      tensorflow::gtl::ArraySlice<int64> window_dimensions,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+      tensorflow::gtl::ArraySlice<int64> lhs_dilation,
+      tensorflow::gtl::ArraySlice<int64> rhs_dilation) const;
+
+  string name_;  // Name to use for the built computation.
+
+  // The first error encountered while building the computation.
+  // This is OK until the first error is encountered.
+  Status first_error_;
+
+  // The saved stack trace from the point at which the first error occurred.
+  tensorflow::SavedStackTrace first_error_backtrace_;
+
+  // The instructions of this computation.
+  std::vector<HloInstructionProto> instructions_;
+
+  // The embedded computations used by this computation. Each computation was
+  // the entry computation of some XlaComputation, the key is the unique id of
+  // that XlaComputation.
+  std::map<int64, HloComputationProto> embedded_;
+
+  // The unique parameter numbers.
+  tensorflow::gtl::FlatSet<int64> parameter_numbers_;
+
+  // The metadata to attach to each op. This is structured as a "modal"-like
+  // operation, in order to simplify client code (and not sprinkle this metadata
+  // throughout the TensorFlow op kernel implementations).
+  OpMetadata metadata_;
+
+  // Sharding for this operator. This is structured as a "model"-like operation,
+  // in order to simplify client code, similar to metadata_.
+  tensorflow::gtl::optional<OpSharding> sharding_;
+
+  // Mode bit that indicates whether to die when a first error is encountered.
+  bool die_immediately_on_error_ = false;
+
+  XlaBuilder* parent_builder_{nullptr};
+
+  friend XlaOp Parameter(XlaBuilder* builder, int64 parameter_number,
+                         const Shape& shape, const string& name);
+  friend XlaOp ConstantLiteral(XlaBuilder* builder,
+                               const LiteralSlice& literal);
+  template <typename NativeT>
+  friend XlaOp ConstantR0(XlaBuilder* builder, NativeT value);
+  template <typename NativeT>
+  friend XlaOp ConstantR1(XlaBuilder* builder,
+                          tensorflow::gtl::ArraySlice<NativeT> values);
+  friend XlaOp ConstantR1(XlaBuilder* builder,
+                          const tensorflow::core::Bitmap& values);
+  template <typename NativeT>
+  friend XlaOp ConstantR2(
+      XlaBuilder* builder,
+      std::initializer_list<std::initializer_list<NativeT>> values);
+  template <typename NativeT>
+  friend XlaOp ConstantFromArrayWithLayout(XlaBuilder* builder,
+                                           const Array<NativeT>& values,
+                                           const Layout& layout);
+  template <typename NativeT>
+  friend XlaOp ConstantFromArray(XlaBuilder* builder,
+                                 const Array<NativeT>& values);
+  template <typename NativeT>
+  friend XlaOp ConstantR2FromArray2DWithLayout(XlaBuilder* builder,
+                                               const Array2D<NativeT>& values,
+                                               const Layout& layout);
+  template <typename NativeT>
+  friend XlaOp ConstantR2FromArray2D(XlaBuilder* builder,
+                                     const Array2D<NativeT>& values);
+  template <typename NativeT>
+  friend XlaOp ConstantR3FromArray3DWithLayout(XlaBuilder* builder,
+                                               const Array3D<NativeT>& values,
+                                               const Layout& layout);
+  template <typename NativeT>
+  friend XlaOp ConstantR3FromArray3D(XlaBuilder* builder,
+                                     const Array3D<NativeT>& values);
+  template <typename NativeT>
+  friend XlaOp ConstantR4FromArray4DWithLayout(XlaBuilder* builder,
+                                               const Array4D<NativeT>& values,
+                                               const Layout& layout);
+  template <typename NativeT>
+  friend XlaOp ConstantR4FromArray4D(XlaBuilder* builder,
+                                     const Array4D<NativeT>& values);
+
+  template <typename NativeT>
+  friend XlaOp ConstantR1(XlaBuilder* builder, int64 length, NativeT value);
+
+  friend XlaOp Broadcast(const XlaOp& operand,
+                         tensorflow::gtl::ArraySlice<int64> broadcast_sizes);
+
+  friend XlaOp BroadcastInDim(
+      const XlaOp& operand, const Shape& shape,
+      const tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+
+  friend XlaOp Pad(const XlaOp& operand, const XlaOp& padding_value,
+                   const PaddingConfig& padding_config);
+
+  friend XlaOp Reshape(const XlaOp& operand,
+                       tensorflow::gtl::ArraySlice<int64> dimensions,
+                       tensorflow::gtl::ArraySlice<int64> new_sizes);
+
+  friend XlaOp Reshape(const XlaOp& operand,
+                       tensorflow::gtl::ArraySlice<int64> new_sizes);
+
+  friend XlaOp Collapse(const XlaOp& operand,
+                        tensorflow::gtl::ArraySlice<int64> dimensions);
+
+  friend XlaOp Slice(const XlaOp& operand,
+                     tensorflow::gtl::ArraySlice<int64> start_indices,
+                     tensorflow::gtl::ArraySlice<int64> limit_indices,
+                     tensorflow::gtl::ArraySlice<int64> strides);
+
+  friend XlaOp SliceInDim(const XlaOp& operand, int64 start_index,
+                          int64 limit_index, int64 stride, int64 dimno);
+
+  friend XlaOp DynamicSlice(const XlaOp& operand, const XlaOp& start_indices,
+                            tensorflow::gtl::ArraySlice<int64> slice_sizes);
+
+  friend XlaOp DynamicUpdateSlice(const XlaOp& operand, const XlaOp& update,
+                                  const XlaOp& start_indices);
+
+  friend XlaOp ConcatInDim(XlaBuilder* builder,
+                           tensorflow::gtl::ArraySlice<XlaOp> operands,
+                           int64 dimension);
+
+  friend void Trace(const string& tag, const XlaOp& operand);
+
+  friend XlaOp Select(const XlaOp& pred, const XlaOp& on_true,
+                      const XlaOp& on_false);
+  friend XlaOp Tuple(XlaBuilder* builder,
+                     tensorflow::gtl::ArraySlice<XlaOp> elements);
+  friend XlaOp GetTupleElement(const XlaOp& tuple_data, int64 index);
+  friend XlaOp Eq(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Ne(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Ge(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Gt(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Lt(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Le(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Dot(const XlaOp& lhs, const XlaOp& rhs);
+  friend XlaOp DotGeneral(const XlaOp& lhs, const XlaOp& rhs,
+                          const DotDimensionNumbers& dimension_numbers);
+  friend XlaOp Conv(const XlaOp& lhs, const XlaOp& rhs,
+                    tensorflow::gtl::ArraySlice<int64> window_strides,
+                    Padding padding);
+  friend XlaOp ConvWithGeneralPadding(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
+  friend XlaOp ConvWithGeneralDimensions(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
+      const ConvolutionDimensionNumbers& dimension_numbers);
+  friend XlaOp ConvGeneral(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+      const ConvolutionDimensionNumbers& dimension_numbers);
+  friend XlaOp ConvGeneralDilated(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+      tensorflow::gtl::ArraySlice<int64> lhs_dilation,
+      tensorflow::gtl::ArraySlice<int64> rhs_dilation,
+      const ConvolutionDimensionNumbers& dimension_numbers);
+  friend XlaOp Fft(const XlaOp& operand, FftType fft_type,
+                   tensorflow::gtl::ArraySlice<int64> fft_length);
+  friend XlaOp Infeed(XlaBuilder* builder, const Shape& shape,
+                      const string& config);
+  friend void Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
+                      const string& outfeed_config);
+  friend XlaOp Call(XlaBuilder* builder, const XlaComputation& computation,
+                    tensorflow::gtl::ArraySlice<XlaOp> operands);
+  friend XlaOp CustomCall(XlaBuilder* builder, const string& call_target_name,
+                          tensorflow::gtl::ArraySlice<XlaOp> operands,
+                          const Shape& shape);
+  friend XlaOp HostCompute(XlaBuilder* builder,
+                           tensorflow::gtl::ArraySlice<XlaOp> operands,
+                           const string& channel_name, int64 cost_estimate_ns,
+                           const Shape& shape);
+  friend XlaOp Complex(const XlaOp& real, const XlaOp& imag,
+                       tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Conj(const XlaOp& operand);
+  friend XlaOp Add(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Sub(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Mul(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Div(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Rem(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Max(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Min(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp And(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Or(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Xor(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Not(const XlaOp& operand);
+  friend XlaOp ShiftLeft(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp ShiftRightArithmetic(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp ShiftRightLogical(
+      const XlaOp& lhs, const XlaOp& rhs,
+      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Reduce(const XlaOp& operand, const XlaOp& init_value,
+                      const XlaComputation& computation,
+                      tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce);
+  friend XlaOp ReduceAll(const XlaOp& operand, const XlaOp& init_value,
+                         const XlaComputation& computation);
+  friend XlaOp ReduceWindow(
+      const XlaOp& operand, const XlaOp& init_value,
+      const XlaComputation& computation,
+      tensorflow::gtl::ArraySlice<int64> window_dimensions,
+      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding);
+  friend XlaOp ReduceWindowWithGeneralPadding(
+      const XlaOp& operand, const XlaOp& init_value,
+      const XlaComputation& computation,
+      tensorflow::gtl::ArraySlice<int64> window_dimensions,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
+  friend XlaOp CrossReplicaSum(
+      const XlaOp& operand,
+      tensorflow::gtl::ArraySlice<int64> replica_group_ids);
+  friend XlaOp CrossReplicaSum(
+      const XlaOp& operand, const XlaComputation& computation,
+      tensorflow::gtl::ArraySlice<int64> replica_group_ids,
+      const tensorflow::gtl::optional<ChannelHandle>& channel_id);
+  friend XlaOp SelectAndScatter(
+      const XlaOp& operand, const XlaComputation& select,
+      tensorflow::gtl::ArraySlice<int64> window_dimensions,
+      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
+      const XlaOp& source, const XlaOp& init_value,
+      const XlaComputation& scatter);
+  friend XlaOp SelectAndScatterWithGeneralPadding(
+      const XlaOp& operand, const XlaComputation& select,
+      tensorflow::gtl::ArraySlice<int64> window_dimensions,
+      tensorflow::gtl::ArraySlice<int64> window_strides,
+      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+      const XlaOp& source, const XlaOp& init_value,
+      const XlaComputation& scatter);
+  friend XlaOp Abs(const XlaOp& operand);
+  friend XlaOp Atan2(const XlaOp& y, const XlaOp& x,
+                     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp Exp(const XlaOp& operand);
+  friend XlaOp Expm1(const XlaOp& operand);
+  friend XlaOp Floor(const XlaOp& operand);
+  friend XlaOp Ceil(const XlaOp& operand);
+  friend XlaOp Round(const XlaOp& operand);
+  friend XlaOp Log(const XlaOp& operand);
+  friend XlaOp Log1p(const XlaOp& operand);
+  friend XlaOp Sign(const XlaOp& operand);
+  friend XlaOp Clz(const XlaOp& operand);
+  friend XlaOp Cos(const XlaOp& operand);
+  friend XlaOp Sin(const XlaOp& operand);
+  friend XlaOp Tanh(const XlaOp& operand);
+  friend XlaOp Real(const XlaOp& operand);
+  friend XlaOp Imag(const XlaOp& operand);
+  friend XlaOp Pow(const XlaOp& lhs, const XlaOp& rhs,
+                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+  friend XlaOp IsFinite(const XlaOp& operand);
+  // TODO(b/64798317): Finish CPU & GPU implementation, then replace xla::Iota
+  // in xla/client/lib/numeric.h with this (renamed to xla::Iota).
+  friend XlaOp IotaGen(XlaBuilder* builder, PrimitiveType type, int64 size);
+  friend XlaOp ConvertElementType(const XlaOp& operand,
+                                  PrimitiveType new_element_type);
+  friend XlaOp BitcastConvertType(const XlaOp& operand,
+                                  PrimitiveType new_element_type);
+  friend XlaOp Neg(const XlaOp& operand);
+  friend XlaOp Transpose(const XlaOp& operand,
+                         tensorflow::gtl::ArraySlice<int64> permutation);
+  friend XlaOp Rev(const XlaOp& operand,
+                   tensorflow::gtl::ArraySlice<int64> dimensions);
+  friend XlaOp Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values,
+                    int64 dimension);
+  friend XlaOp Clamp(const XlaOp& min, const XlaOp& operand, const XlaOp& max);
+  friend XlaOp Map(XlaBuilder* builder,
+                   tensorflow::gtl::ArraySlice<XlaOp> operands,
+                   const XlaComputation& computation,
+                   tensorflow::gtl::ArraySlice<int64> dimensions,
+                   tensorflow::gtl::ArraySlice<XlaOp> static_operands);
+  friend XlaOp RngNormal(const XlaOp& mu, const XlaOp& sigma,
+                         const Shape& shape);
+  friend XlaOp RngUniform(const XlaOp& a, const XlaOp& b, const Shape& shape);
+  friend XlaOp While(const XlaComputation& condition,
+                     const XlaComputation& body, const XlaOp& init);
+  friend XlaOp Conditional(const XlaOp& predicate, const XlaOp& true_operand,
+                           const XlaComputation& true_computation,
+                           const XlaOp& false_operand,
+                           const XlaComputation& false_computation);
+  friend XlaOp ReducePrecision(const XlaOp& operand, const int exponent_bits,
+                               const int mantissa_bits);
+  friend XlaOp Gather(const XlaOp& input, const XlaOp& gather_indices,
+                      const GatherDimensionNumbers& dimension_numbers,
+                      tensorflow::gtl::ArraySlice<int64> window_bounds);
+  friend XlaOp Scatter(const XlaOp& input, const XlaOp& scatter_indices,
+                       const XlaOp& updates,
+                       const XlaComputation& update_computation,
+                       const ScatterDimensionNumbers& dimension_numbers);
+  friend void Send(const XlaOp& operand, const ChannelHandle& handle);
+  friend XlaOp Recv(XlaBuilder* builder, const Shape& shape,
+                    const ChannelHandle& handle);
+  friend XlaOp BatchNormTraining(const XlaOp& operand, const XlaOp& scale,
+                                 const XlaOp& offset, float epsilon,
+                                 int64 feature_index);
+  friend XlaOp BatchNormInference(const XlaOp& operand, const XlaOp& scale,
+                                  const XlaOp& offset, const XlaOp& mean,
+                                  const XlaOp& variance, float epsilon,
+                                  int64 feature_index);
+  friend XlaOp BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
+                             const XlaOp& batch_mean, const XlaOp& batch_var,
+                             const XlaOp& grad_output, float epsilon,
+                             int64 feature_index);
+  friend XlaOp SendWithToken(const XlaOp& operand, const XlaOp& token,
+                             const ChannelHandle& handle);
+  friend XlaOp RecvWithToken(const XlaOp& token, const Shape& shape,
+                             const ChannelHandle& handle);
+  friend XlaOp SendToHost(const XlaOp& operand, const XlaOp& token,
+                          const Shape& shape_with_layout,
+                          const ChannelHandle& handle);
+  friend XlaOp RecvFromHost(const XlaOp& token, const Shape& shape,
+                            const ChannelHandle& handle);
+  friend XlaOp InfeedWithToken(const XlaOp& token, const Shape& shape,
+                               const string& config);
+  friend XlaOp OutfeedWithToken(const XlaOp& operand, const XlaOp& token,
+                                const Shape& shape_with_layout,
+                                const string& outfeed_config);
+  friend XlaOp CreateToken(XlaBuilder* builder);
+  friend XlaOp AfterAll(XlaBuilder* builder,
+                        tensorflow::gtl::ArraySlice<XlaOp> tokens);
+};
+
+// RAII-style object: sets the current sharding assignment in builder on
+// construction, and sets back to the previous assignment on destruction.
+class XlaScopedShardingAssignment {
+ public:
+  XlaScopedShardingAssignment(xla::XlaBuilder* builder,
+                              tensorflow::gtl::optional<OpSharding> sharding)
+      : builder_(builder), prev_sharding_(builder->sharding()) {
+    SetSharding(sharding);
+  }
+
+  XlaScopedShardingAssignment(const XlaScopedShardingAssignment&) = delete;
+  XlaScopedShardingAssignment& operator=(const XlaScopedShardingAssignment&) =
+      delete;
+
+  ~XlaScopedShardingAssignment() { SetSharding(prev_sharding_); }
+
+ private:
+  void SetSharding(const tensorflow::gtl::optional<OpSharding>& sharding) {
+    if (sharding.has_value()) {
+      builder_->SetSharding(sharding.value());
+    } else {
+      builder_->ClearSharding();
+    }
+  }
+
+  xla::XlaBuilder* const builder_;
+  tensorflow::gtl::optional<OpSharding> prev_sharding_;
+};
+
+// Free functions for building XlaOps. The intention is that these will
+// become the public API for building XlaOps rather than calling methods on
+// XlaBuilder directly.
+
+// Enqueues a "retrieve parameter value" instruction for a parameter that was
+// passed to the computation.
+XlaOp Parameter(XlaBuilder* builder, int64 parameter_number, const Shape& shape,
+                const string& name);
+
+// Enqueues a constant with the value of the given literal onto the
+// computation.
+XlaOp ConstantLiteral(XlaBuilder* builder, const LiteralSlice& literal);
+
+// Enqueues a constant onto the computation. Methods are templated on the
+// native host type (NativeT) which corresponds to a specific XLA
+// PrimitiveType as given in the following table:
+//
+//  Native Type   PrimitiveType
+// -----------------------------
+//   bool           PRED
+//   int32          S32
+//   int64          S64
+//   uint32         U32
+//   uint64         U64
+//   float          F32
+//   double         F64
+//
+// Note: not all primitive types defined in xla_data.proto have a
+// corresponding native type yet.
+template <typename NativeT>
+XlaOp ConstantR0(XlaBuilder* builder, NativeT value);
+template <typename NativeT>
+XlaOp ConstantR1(XlaBuilder* builder,
+                 tensorflow::gtl::ArraySlice<NativeT> values);
+XlaOp ConstantR1(XlaBuilder* builder, const tensorflow::core::Bitmap& values);
+template <typename NativeT>
+XlaOp ConstantR2(XlaBuilder* builder,
+                 std::initializer_list<std::initializer_list<NativeT>> values);
+template <typename NativeT>
+XlaOp ConstantFromArrayWithLayout(XlaBuilder* builder,
+                                  const Array<NativeT>& values,
+                                  const Layout& layout);
+template <typename NativeT>
+XlaOp ConstantFromArray(XlaBuilder* builder, const Array<NativeT>& values);
+template <typename NativeT>
+XlaOp ConstantR2FromArray2DWithLayout(XlaBuilder* builder,
+                                      const Array2D<NativeT>& values,
+                                      const Layout& layout);
+template <typename NativeT>
+XlaOp ConstantR2FromArray2D(XlaBuilder* builder,
+                            const Array2D<NativeT>& values);
+template <typename NativeT>
+XlaOp ConstantR3FromArray3DWithLayout(XlaBuilder* builder,
+                                      const Array3D<NativeT>& values,
+                                      const Layout& layout);
+template <typename NativeT>
+XlaOp ConstantR3FromArray3D(XlaBuilder* builder,
+                            const Array3D<NativeT>& values);
+template <typename NativeT>
+XlaOp ConstantR4FromArray4DWithLayout(XlaBuilder* builder,
+                                      const Array4D<NativeT>& values,
+                                      const Layout& layout);
+template <typename NativeT>
+XlaOp ConstantR4FromArray4D(XlaBuilder* builder,
+                            const Array4D<NativeT>& values);
+
+// Enqueues a rank one constant (XlaBuilder* builder, vector) onto the
+// computation. The vector has size 'length' and every element has the value
+// 'value'.
+template <typename NativeT>
+XlaOp ConstantR1(XlaBuilder* builder, int64 length, NativeT value);
+
+// Adds dimensions to an array by duplicating the data in the array.
+//
+// The new dimensions are inserted on the left, i.e. if
+// broadcast_sizes has values {a0, ..., aN} and the operand shape
+// has dimensions {b0, ..., bM} then the shape of the output has
+// dimensions {a0, ..., aN, b0, ..., bM}.
+//
+// The new dimensions index into copies of the operand, i.e.
+//
+//   output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM]
+XlaOp Broadcast(const XlaOp& operand,
+                tensorflow::gtl::ArraySlice<int64> broadcast_sizes);
+
+// Performs in-dimension-style broadcast.
+//
+// Operand specifies the input to be broadcast. "shape" is expected output
+// shape. "broadcast_dimensions" are the dimensions to be broadcasting into.
+// Dimension numbers in broadcast_dimensions map to individual dimensions
+// of the operand, and specify what dimension of the output shape they
+// should be broadcast.
+// e.g.
+// Say operand = [1, 2], i.e., a 1D tensor with 2 elements.
+// and dimension of shape is [2,2].
+// Specifying {1} as brodcast_dimension will generate output
+// [1 , 2]
+// [1 , 2]
+// On the other hand, specifying {0} as broadcast_dimension
+// will generate output
+// [1 , 1]
+// [2 , 2]
+XlaOp BroadcastInDim(
+    const XlaOp& operand, const Shape& shape,
+    const tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
+
+// Enqueues a pad operation onto the computation that pads the given value on
+// the edges as well as between the elements of the input. padding_config
+// specifies the padding amount for each dimension.
+XlaOp Pad(const XlaOp& operand, const XlaOp& padding_value,
+          const PaddingConfig& padding_config);
+
+// Enqueues an operation onto the computation that flattens the operand based
+// on the dimension order (major/slowest-varying to minor/fastest-varying)
+// given, followed by reshaping it into the shape with the given dimension
+// sizes (also major to minor). Conceptually, this is a limited form of
+// "shape casting".
+XlaOp Reshape(const XlaOp& operand,
+              tensorflow::gtl::ArraySlice<int64> dimensions,
+              tensorflow::gtl::ArraySlice<int64> new_sizes);
+
+// Enqueues an operation onto the computation that collapses the operand, from
+// first to last dimension (C order), then reshapes it to the given dimension
+// sizes. Conceptually, this is a limited form of "shape casting".
+XlaOp Reshape(const XlaOp& operand,
+              tensorflow::gtl::ArraySlice<int64> new_sizes);
+
+// Wrapper for Reshape.
+// Enqueues an operation to collapse the provided dimensions; e.g. an
+// operand with dimensions {x=256, y=2, z=2, p=32} can be collapsed to
+// {x=1024, y=32} by collapsing dims {0, 1, 2}. Collapsing dimensions must
+// be a consecutive, in-order subsequence of the operand dimensions.
+//
+// Note that collapsing a single dimension does nothing:
+//
+//    {256} collapsing {0} => {256}
+//    {1} collapsing {0} => {1}
+//
+// Collapsing multiple dimensions produces a single result dimension:
+//
+//    {256, 2} collapsing {0,1} => {512}
+//    {256, 2, 3} collapsing {0,1} => {512, 3}
+//
+// This could potentially cause data to be moved -- it provides a more
+// structured form of reshaping than an arbitrary Reshape operation.
+XlaOp Collapse(const XlaOp& operand,
+               tensorflow::gtl::ArraySlice<int64> dimensions);
+
+// Enqueues a slice operation onto the computation that slices the operand
+// from the start indices to the limit indices; e.g.
+//
+//        x
+//   [ 0 1 2 3 ]
+// y [ 4 5 6 7 ] => slice(start={1, 1}, limit={2, 3}) => [ 5 6 ]
+//   [ 8 9 a b ]
+//
+// Note that "limit" means up-to-but-not-including; i.e. [start, limit) in 1D
+// range notation.
+// The strides parameter determines the stride over the slice
+XlaOp Slice(const XlaOp& operand,
+            tensorflow::gtl::ArraySlice<int64> start_indices,
+            tensorflow::gtl::ArraySlice<int64> limit_indices,
+            tensorflow::gtl::ArraySlice<int64> strides);
+
+// Enqueues a slice operation in a given dimension, taking all other
+// dimensions as they are; e.g. if dimno is 1 from start_index 2 to
+// limit_index 4 by 1, and the shape is f32[7,8,9], this call is short-hand
+// for:
+//
+//  array[:, 2:4:1, :]
+XlaOp SliceInDim(const XlaOp& operand, int64 start_index, int64 limit_index,
+                 int64 stride, int64 dimno);
+
+// Enqueues a slice operation onto the computation that slices the 'operand'
+// from dynamic start indices which are passed in 'start_indices'.
+// The size of the slice in each dimension is passed in 'slice_sizes',
+// which specify the end point of exclusive slice intervals in each
+// dimension [start, start + size).
+// The shape of 'start_indices' must be rank == 1, with dimension size
+// equal to the rank of the 'operand'.
+// Slice index calculations are computed modulo input dimension sizes to
+// prevent dynamic start indices from generating out-of-bound array accesses.
+XlaOp DynamicSlice(const XlaOp& operand, const XlaOp& start_indices,
+                   tensorflow::gtl::ArraySlice<int64> slice_sizes);
+
+// Enqueues a dynamic update slice operation onto the computation, which
+// updates a slice of 'operand' with 'update' at dynamic 'start_indices'.
+// The shape of 'update' determines the shape of the slice of 'operand'
+// which is updated.
+// The indices specified in 'start_indices' specify the offset of the slice
+// of 'operand' which is updated.
+//
+//               update = {10, 11} // calculated at runtime.
+//   [1 2 3]     start  = {1, 1}   // calculated at runtime.  [1 2  3 ]
+//   [4 5 6]  => DynamicUpdateslice(data, update, start)   => [4 10 11]
+//   [7 8 9]                                                  [7 8  9 ]
+//
+// The shape of 'start_indices' must be rank == 1, with dimension size
+// equal to the rank of the 'operand'.
+// Slice index calculations are computed modulo update dimension sizes to
+// prevent dynamic start indices from generating out-of-bound array accesses.
+XlaOp DynamicUpdateSlice(const XlaOp& operand, const XlaOp& update,
+                         const XlaOp& start_indices);
+
+// Enqueues a concatenate instruction onto the computation. 'operands' must
+// have >= 1 entry.
+XlaOp ConcatInDim(XlaBuilder* builder,
+                  tensorflow::gtl::ArraySlice<XlaOp> operands, int64 dimension);
+
+// Enqueue a tracing operation onto the computation; the computation will emit
+// a logging message with the operand.
+void Trace(const string& tag, const XlaOp& operand);
+
+// Enqueues a conditional-move-like select operation onto the computation;
+// predicated on pred, selects between on_true and on_false.
+XlaOp Select(const XlaOp& pred, const XlaOp& on_true, const XlaOp& on_false);
+
+// Enqueues a tuple-creation instruction onto the computation.
+XlaOp Tuple(XlaBuilder* builder, tensorflow::gtl::ArraySlice<XlaOp> elements);
+
+// Enqueues a tuple-element-get instruction onto the computation.
+XlaOp GetTupleElement(const XlaOp& tuple_data, int64 index);
+
+// Enqueues an equal-to comparison instruction onto the computation.
+XlaOp Eq(const XlaOp& lhs, const XlaOp& rhs,
+         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a not-equal comparison instruction onto the computation.
+XlaOp Ne(const XlaOp& lhs, const XlaOp& rhs,
+         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a greater-or-equal comparison instruction onto the computation.
+XlaOp Ge(const XlaOp& lhs, const XlaOp& rhs,
+         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a greater-than comparison instruction onto the computation.
+XlaOp Gt(const XlaOp& lhs, const XlaOp& rhs,
+         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a less-than comparison instruction onto the computation.
+XlaOp Lt(const XlaOp& lhs, const XlaOp& rhs,
+         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a less-or-equal comparison instruction onto the computation.
+XlaOp Le(const XlaOp& lhs, const XlaOp& rhs,
+         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a dot instruction onto the computation.
+XlaOp Dot(const XlaOp& lhs, const XlaOp& rhs);
+
+// Enqueues a general dot instruction onto the computation.
+XlaOp DotGeneral(const XlaOp& lhs, const XlaOp& rhs,
+                 const DotDimensionNumbers& dimension_numbers);
+
+// Enqueues a convolution instruction onto the computation, which uses the
+// default convolution dimension numbers.
+XlaOp Conv(const XlaOp& lhs, const XlaOp& rhs,
+           tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding);
+
+// Enqueues a convolution instruction onto the computation, with the caller
+// provided padding configuration in the format returned by MakePadding().
+XlaOp ConvWithGeneralPadding(
+    const XlaOp& lhs, const XlaOp& rhs,
+    tensorflow::gtl::ArraySlice<int64> window_strides,
+    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
+
+// Enqueues a convolution instruction onto the computation, with the caller
+// provided dimension numbers configuration.
+XlaOp ConvWithGeneralDimensions(
+    const XlaOp& lhs, const XlaOp& rhs,
+    tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
+    const ConvolutionDimensionNumbers& dimension_numbers);
+
+// Enqueues a convolution instruction onto the computation, with the caller
+// provided padding configuration as well as the dimension numbers.
+XlaOp ConvGeneral(const XlaOp& lhs, const XlaOp& rhs,
+                  tensorflow::gtl::ArraySlice<int64> window_strides,
+                  tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+                  const ConvolutionDimensionNumbers& dimension_numbers);
+
+// Enqueues a convolution instruction onto the computation, with the caller
+// provided padding configuration, dilation factors and dimension numbers.
+XlaOp ConvGeneralDilated(
+    const XlaOp& lhs, const XlaOp& rhs,
+    tensorflow::gtl::ArraySlice<int64> window_strides,
+    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+    tensorflow::gtl::ArraySlice<int64> lhs_dilation,
+    tensorflow::gtl::ArraySlice<int64> rhs_dilation,
+    const ConvolutionDimensionNumbers& dimension_numbers);
+
+// Enqueues an FFT instruction onto the computation, of the given type and
+// with the given FFT length.
+XlaOp Fft(const XlaOp& operand, FftType fft_type,
+          tensorflow::gtl::ArraySlice<int64> fft_length);
+
+// Enqueues an infeed instruction onto the computation, which writes data of
+// the given shape to the infeed buffer of the device.
+XlaOp Infeed(XlaBuilder* builder, const Shape& shape,
+             const string& config = "");
+
+// Variant of Infeed which takes a token-shaped operand and produces a
+// two-element tuple containing the data value and a token-shaped value.
+// Tokens are used for ordering side-effecting operations.
+// TODO(b/110532604): Replace all uses of the non-token form with this variant.
+XlaOp InfeedWithToken(const XlaOp& token, const Shape& shape,
+                      const string& config = "");
+
+// Enqueues an outfeed instruction onto the computation. This instruction
+// generates outgoing data transfers for the given data.
+//
+// shape_with_layout communicates the laid out shape that we want to outfeed
+// -- if !ShapeUtil::Compatible(GetShape(operand), shape_with_layout) an error
+// will occur.
+void Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
+             const string& outfeed_config);
+
+// Variant of Outfeed which takes a token-shaped operand and produces a
+// token-shaped value. Tokens are used for ordering side-effecting operations.
+// TODO(b/110532604): Replace all uses of the non-token form with this variant.
+XlaOp OutfeedWithToken(const XlaOp& operand, const XlaOp& token,
+                       const Shape& shape_with_layout,
+                       const string& outfeed_config);
+
+// Enqueues a call instruction onto the computation.
+XlaOp Call(XlaBuilder* builder, const XlaComputation& computation,
+           tensorflow::gtl::ArraySlice<XlaOp> operands);
+
+// Enqueues a custom call instruction onto the computation.
+// During code generation, a call instruction is emitted which targets a
+// symbol with the name |call_target_name|.  The |operands| are passed to the
+// call instruction.  |shape| is the resultant shape.
+XlaOp CustomCall(XlaBuilder* builder, const string& call_target_name,
+                 tensorflow::gtl::ArraySlice<XlaOp> operands,
+                 const Shape& shape);
+
+// Enqueues a pseudo-op to represent host-side computation data-dependencies.
+// During code generation, host send and receive operations will be generated
+// to transfer |operands| to the host and a single result of |shape| back to
+// the device.  Host send/recv operations are emitted using |channel_name|.
+// Dataflow dependencies and the |cost_estimate_ns| field may be used in HLO
+// instruction scheduling.
+XlaOp HostCompute(XlaBuilder* builder,
+                  tensorflow::gtl::ArraySlice<XlaOp> operands,
+                  const string& channel_name, int64 cost_estimate_ns,
+                  const Shape& shape);
+
+// The following methods enqueue element-wise binary arithmetic operations
+// onto the computation. The shapes of the operands have to match unless one
+// of the operands is a scalar, or an explicit broadcast dimension is given
+// (see g3doc for more details).
+
+// Enqueues a complex compose instruction onto the computation.
+XlaOp Complex(const XlaOp& real, const XlaOp& imag,
+              tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a complex conjugate instruction onto the computation.
+XlaOp Conj(const XlaOp& operand);
+
+// Enqueues an add instruction onto the computation.
+XlaOp Add(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a subtract instruction onto the computation.
+XlaOp Sub(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a multiply instruction onto the computation.
+XlaOp Mul(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a divide instruction onto the computation.
+XlaOp Div(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a remainder instruction onto the computation.
+XlaOp Rem(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a max instruction onto the computation.
+XlaOp Max(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues a min instruction onto the computation.
+XlaOp Min(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Element-wise logical operators
+XlaOp And(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+XlaOp Or(const XlaOp& lhs, const XlaOp& rhs,
+         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+XlaOp Xor(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+XlaOp Not(const XlaOp& operand);
+
+XlaOp ShiftLeft(const XlaOp& lhs, const XlaOp& rhs,
+                tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+XlaOp ShiftRightArithmetic(
+    const XlaOp& lhs, const XlaOp& rhs,
+    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+XlaOp ShiftRightLogical(
+    const XlaOp& lhs, const XlaOp& rhs,
+    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Reduces an array among the provided dimensions, given "computation" as a
+// reduction operator.
+XlaOp Reduce(const XlaOp& operand, const XlaOp& init_value,
+             const XlaComputation& computation,
+             tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce);
+
+// Convenience wrapper around the above that reduces all the dimensions in the
+// operand shape.
+XlaOp ReduceAll(const XlaOp& operand, const XlaOp& init_value,
+                const XlaComputation& computation);
+
+// Enqueues a windowed reduce instruction onto the computation.
+XlaOp ReduceWindow(const XlaOp& operand, const XlaOp& init_value,
+                   const XlaComputation& computation,
+                   tensorflow::gtl::ArraySlice<int64> window_dimensions,
+                   tensorflow::gtl::ArraySlice<int64> window_strides,
+                   Padding padding);
+
+// As ReduceWindow(), but the padding is given in the format
+// returned by MakePadding().
+XlaOp ReduceWindowWithGeneralPadding(
+    const XlaOp& operand, const XlaOp& init_value,
+    const XlaComputation& computation,
+    tensorflow::gtl::ArraySlice<int64> window_dimensions,
+    tensorflow::gtl::ArraySlice<int64> window_strides,
+    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
+
+// Returns the sum of the operand value within each subgroup of replicas. All
+// replicas supply one input to the sum and all replicas receive the resulting
+// sum for each subgroup.
+XlaOp CrossReplicaSum(
+    const XlaOp& operand,
+    tensorflow::gtl::ArraySlice<int64> replica_group_ids = {});
+
+// Enqueues an operation that do an AllReduce of the operand cross cores. Here
+// AllReduce means doing a reduction on the input operand cross cores and then
+// broadcasting the reduction result to those cores. The reduction function is
+// defined by `computation`, which should be a commutative computation on
+// scalars, e.g., add, min, or max. The way that AllReduce is applied is
+// configured by:
+//
+// - `replica_group_ids`: maps replica ids to subgroup ids. If empty, all
+// replicas belong to one group. Allreduce will be applied within subgroups.
+// For example, we have 4 replicas, then replica_group_ids={0,1,0,1} means,
+// replica 0 and 2 are in subgroup 0, replica 1 and 3 are in subgroup 1.
+//
+// - `channel_id`: for Allreduce nodes from different models, if they have the
+// same channel_id, they will be 'Allreduce'd. If empty, Allreduce will not be
+// applied cross models.
+//
+// TODO(b/79737069): Rename this to AllReduce when it's ready to use.
+XlaOp CrossReplicaSum(const XlaOp& operand, const XlaComputation& computation,
+                      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {},
+                      const tensorflow::gtl::optional<ChannelHandle>&
+                          channel_id = tensorflow::gtl::nullopt);
+
+// Enqueues an operation that scatters the `source` array to the selected
+// indices of each window.
+XlaOp SelectAndScatter(const XlaOp& operand, const XlaComputation& select,
+                       tensorflow::gtl::ArraySlice<int64> window_dimensions,
+                       tensorflow::gtl::ArraySlice<int64> window_strides,
+                       Padding padding, const XlaOp& source,
+                       const XlaOp& init_value, const XlaComputation& scatter);
+
+// As SelectAndScatter(), but the padding is given in the format
+// returned by MakePadding().
+XlaOp SelectAndScatterWithGeneralPadding(
+    const XlaOp& operand, const XlaComputation& select,
+    tensorflow::gtl::ArraySlice<int64> window_dimensions,
+    tensorflow::gtl::ArraySlice<int64> window_strides,
+    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
+    const XlaOp& source, const XlaOp& init_value,
+    const XlaComputation& scatter);
+
+// Enqueues an abs instruction onto the computation.
+XlaOp Abs(const XlaOp& operand);
+
+// Enqueues a atan2 instruction onto the computation.
+XlaOp Atan2(const XlaOp& y, const XlaOp& x,
+            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues an exp instruction onto the computation.
+XlaOp Exp(const XlaOp& operand);
+
+// Enqueues an expm1 instruction onto the computation.
+XlaOp Expm1(const XlaOp& operand);
+
+// Enqueues a floor instruction onto the computation.
+XlaOp Floor(const XlaOp& operand);
+
+// Enqueues a ceil instruction onto the computation.
+XlaOp Ceil(const XlaOp& operand);
+
+// Enqueues a round instruction onto the computation, rounding to nearest even
+// with half-way cases rounding away from zero.
+XlaOp Round(const XlaOp& operand);
+
+// Enqueues an log instruction (natural logarithm) onto the computation.
+XlaOp Log(const XlaOp& operand);
+
+// Enqueues an log1p instruction (log(x+1)) onto the computation.
+XlaOp Log1p(const XlaOp& operand);
+
+// Enqueues a sign instruction onto the computation.
+XlaOp Sign(const XlaOp& operand);
+
+// Enqueues a count leading zeros instruction onto the computation.
+XlaOp Clz(const XlaOp& operand);
+
+// Enqueues a cosine instruction onto the computation.
+XlaOp Cos(const XlaOp& operand);
+
+// Enqueues a sine instruction onto the computation.
+XlaOp Sin(const XlaOp& operand);
+
+// Enqueues a tanh instruction onto the computation.
+XlaOp Tanh(const XlaOp& operand);
+
+// Enqueues a real-part instruction onto the computation.
+XlaOp Real(const XlaOp& operand);
+
+// Enqueues an imaginary-part instruction onto the computation.
+XlaOp Imag(const XlaOp& operand);
+
+// Enqueues a lhs^rhs computation onto the computation.
+XlaOp Pow(const XlaOp& lhs, const XlaOp& rhs,
+          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
+
+// Enqueues an operator that tests if the operand's values are finite, i.e.,
+// not Inf or NaN. Defined only for floating-point types. Returns an array of
+// booleans with the same shape where entries are true iff the corresponding
+// entry was NaN.
+XlaOp IsFinite(const XlaOp& operand);
+
+// Enqueues a convert instruction onto the computation that changes the
+// element type of the operand array to primitive_type.
+XlaOp ConvertElementType(const XlaOp& operand, PrimitiveType new_element_type);
+
+// Enqueues a no-op instruction onto the computation that changes
+// the element type of the operand array to primitive_type. The
+// bit-widths of the source and destination element types must be
+// identical.
+XlaOp BitcastConvertType(const XlaOp& operand, PrimitiveType new_element_type);
+
+// Enqueues a negate instruction onto the computation.
+XlaOp Neg(const XlaOp& operand);
+
+// Enqueues a transpose instruction onto the computation.
+XlaOp Transpose(const XlaOp& operand,
+                tensorflow::gtl::ArraySlice<int64> permutation);
+
+// Enqueues a reverse instruction onto the computation. The order of the
+// elements in the given dimensions is reversed (i.e., the element at index i
+// is moved to index dimension_size - 1 - i).
+XlaOp Rev(const XlaOp& operand, tensorflow::gtl::ArraySlice<int64> dimensions);
+
+// Enqueues a sort (as increasing order) instruction onto the computation.
+// If only keys are provided:
+// * If the keys are an rank-1 tensor (an array), the result is a sorted array
+// of keys, in ascending order.
+// * If the keys have higher rank, the keys are sorted along the provided
+// dimension. For example, for a rank-2 tensor (a matrix) of keys, a dimension
+// value of 0 will indepenently sort every column, and a dimension value of 1
+// will independently sort each row. If no dimension number is provided, then
+// the last dimension is chosen by default.
+//
+// If both keys and values are provided:
+// * The keys and the values must tensors with the same dimensions. The
+// element types of the tensors may be different.
+// * The result is a tuple that consists of a sorted tensor of keys (along the
+// provided dimension, as above) as the first element, and a tensor with their
+// corresponding values as the second element.
+XlaOp Sort(XlaOp keys,
+           tensorflow::gtl::optional<XlaOp> values = tensorflow::gtl::nullopt,
+           int64 dimension = -1);
+
+// Enqueues a clamp instruction onto the computation.
+XlaOp Clamp(const XlaOp& min, const XlaOp& operand, const XlaOp& max);
+
+// Enqueues a map instruction onto the computation.
+XlaOp Map(XlaBuilder* builder, tensorflow::gtl::ArraySlice<XlaOp> operands,
+          const XlaComputation& computation,
+          tensorflow::gtl::ArraySlice<int64> dimensions,
+          tensorflow::gtl::ArraySlice<XlaOp> static_operands = {});
+
+// Enqueues a N(mu, sigma) random number generation instruction onto the
+// computation.
+XlaOp RngNormal(const XlaOp& mu, const XlaOp& sigma, const Shape& shape);
+
+// Enqueues a U(a, b) random number generation instruction onto the
+// computation. Returns values in the semi-open interval [a, b).
+XlaOp RngUniform(const XlaOp& a, const XlaOp& b, const Shape& shape);
+
+// Enqueues a while node onto the computation.
+XlaOp While(const XlaComputation& condition, const XlaComputation& body,
+            const XlaOp& init);
+
+// Enqueues a conditional node onto the computation.
+XlaOp Conditional(const XlaOp& predicate, const XlaOp& true_operand,
+                  const XlaComputation& true_computation,
+                  const XlaOp& false_operand,
+                  const XlaComputation& false_computation);
+
+// Enqueues a ReducePrecision node onto the computation.
+XlaOp ReducePrecision(const XlaOp& operand, const int exponent_bits,
+                      const int mantissa_bits);
+
+// Enqueues a Gather node onto the computation.
+XlaOp Gather(const XlaOp& input, const XlaOp& gather_indices,
+             const GatherDimensionNumbers& dimension_numbers,
+             tensorflow::gtl::ArraySlice<int64> window_bounds);
+
+// Enqueues a Scatter node onto the computation.
+XlaOp Scatter(const XlaOp& input, const XlaOp& scatter_indices,
+              const XlaOp& updates, const XlaComputation& update_computation,
+              const ScatterDimensionNumbers& dimension_numbers);
+
+// Enqueues a Send node onto the computation for device-to-device
+// communication. This operation sends the given operand to
+// a Recv instruction in a different computation that shares the same channel
+// handle.
+void Send(const XlaOp& operand, const ChannelHandle& handle);
+
+// Variant of Send which takes a token-shaped operand and produces a
+// token-shaped value.  Tokens are used for ordering side-effecting operations.
+// TODO(b/110532604): Replace all uses of the non-token form with this variant.
+XlaOp SendWithToken(const XlaOp& operand, const XlaOp& token,
+                    const ChannelHandle& handle);
+
+// Enqueues a Recv node onto the computation for device-to-device
+// communication. The data comes from a Send instruction in a different
+// computation that shares the same channel handle and its shape must be the
+// same as the given shape.
+XlaOp Recv(XlaBuilder* builder, const Shape& shape,
+           const ChannelHandle& handle);
+
+// Variant of Recv which takes a token-shaped operand and produces a two-element
+// tuple containing the data value and a token-shaped value. Tokens are used
+// for ordering side-effecting operations.
+// TODO(b/110532604): Replace all uses of the non-token form with this variant.
+XlaOp RecvWithToken(const XlaOp& token, const Shape& shape,
+                    const ChannelHandle& handle);
+
+// Enqueues a Send node which transfers data from the device to the host. The
+// 'shape_with_layout' argument defines the layout of the data transferred; its
+// shape must be compatible with the shape of the operand. The operand must be
+// array-shaped.
+// TODO(b/111544877): Support tuple shapes.
+XlaOp SendToHost(const XlaOp& operand, const XlaOp& token,
+                 const Shape& shape_with_layout, const ChannelHandle& handle);
+
+// Enqueues a Recv node which transfers data from the host to the device. The
+// given shape must contain a layout and must be an array.
+// TODO(b/111544877): Support tuple shapes.
+XlaOp RecvFromHost(const XlaOp& token, const Shape& shape,
+                   const ChannelHandle& handle);
+
+// Enqueues an operation (AfterAll) with no operands that produces a
+// token-shaped value.  Tokens are used for ordering side-effecting operations.
+// This is a separate method from AfterAll to facility the removal of
+// operand-less AfterAll instructions.
+// TODO(b/110532604): Remove this function when all tokens are derived from a
+// single token generated or passed into the entry computation.
+XlaOp CreateToken(XlaBuilder* builder);
+
+// Enqueues an AfterAll instruction which produces a token-shaped value and
+// takes a variadic number of token-shaped operands. The number of operands must
+// be greater than zero. Used for joining tokens.
+XlaOp AfterAll(XlaBuilder* builder, tensorflow::gtl::ArraySlice<XlaOp> tokens);
+
+// Normalizes operand across spatial and batch dimensions for each feature.
+//
+// Returns a tuple (normalized, batch_mean, batch_var) where `normalized`
+// is the normalized result and batch_mean and batch_var are the mean and
+// variance, respectively, across batch for the operand.
+XlaOp BatchNormTraining(const XlaOp& operand, const XlaOp& scale,
+                        const XlaOp& offset, float epsilon,
+                        int64 feature_index);
+
+// Normalizes operand across spatial and batch dimensions for each feature.
+//
+// `BatchNormInference` is equivalent to calling `BatchNormTraining` without
+// computing `mean` and `variance` for each batch inside the operation. It
+// uses the input `mean` and `variance` instead as estimated values. The
+// purpose of this op is to reduce latency in inference, hence the name
+// `BatchNormInference`.
+//
+// The output has the same shape as `operand`, and contains the normalized
+// values for each batch.
+XlaOp BatchNormInference(const XlaOp& operand, const XlaOp& scale,
+                         const XlaOp& offset, const XlaOp& mean,
+                         const XlaOp& variance, float epsilon,
+                         int64 feature_index);
+
+// Calculates the gradients of a batch norm op.
+//
+// The inputs `batch_mean` and `batch_var` represent the mean and variance
+// across the batch.
+//
+// Returns a tuple of three elements:
+//   - grad_operand: Gradient with respect to input `operand`
+//   - grad_offset: Gradient with respect to input `offset`
+//   - grad_scale: Gradient with respect to input `scale`
+XlaOp BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
+                    const XlaOp& batch_mean, const XlaOp& batch_var,
+                    const XlaOp& grad_output, float epsilon,
+                    int64 feature_index);
+
+// Implementation details below this point.
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR0(NativeT value) {
+  return ConstantLiteral(*LiteralUtil::CreateR0<NativeT>(value));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR1(tensorflow::gtl::ArraySlice<NativeT> values) {
+  return ConstantLiteral(*LiteralUtil::CreateR1<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR1(int64 length, NativeT value) {
+  Literal literal(ShapeUtil::MakeShape(
+      primitive_util::NativeToPrimitiveType<NativeT>(), {length}));
+  literal.PopulateWithValue(value);
+  return ConstantLiteral(literal);
+}
+
+inline XlaOp XlaBuilder::ConstantR1(const tensorflow::core::Bitmap& values) {
+  return ConstantLiteral(*LiteralUtil::CreateR1(values));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR2(
+    std::initializer_list<std::initializer_list<NativeT>> values) {
+  return ConstantLiteral(*LiteralUtil::CreateR2<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantFromArrayWithLayout(const Array<NativeT>& values,
+                                              const Layout& layout) {
+  return ConstantLiteral(
+      *LiteralUtil::CreateFromArrayWithLayout<NativeT>(values, layout));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantFromArray(const Array<NativeT>& values) {
+  return ConstantLiteral(*LiteralUtil::CreateFromArray<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR2FromArray2DWithLayout(
+    const Array2D<NativeT>& values, const Layout& layout) {
+  return ConstantLiteral(
+      *LiteralUtil::CreateFromArrayWithLayout<NativeT>(values, layout));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR2FromArray2D(const Array2D<NativeT>& values) {
+  return ConstantLiteral(*LiteralUtil::CreateR2FromArray2D<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR3FromArray3DWithLayout(
+    const Array3D<NativeT>& values, const Layout& layout) {
+  return ConstantLiteral(
+      *LiteralUtil::CreateR3FromArray3DWithLayout<NativeT>(values, layout));
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR3FromArray3D(const Array3D<NativeT>& values) {
+  return ConstantFromArray(values);
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR4FromArray4DWithLayout(
+    const Array4D<NativeT>& values, const Layout& layout) {
+  return ConstantFromArrayWithLayout(values, layout);
+}
+
+template <typename NativeT>
+XlaOp XlaBuilder::ConstantR4FromArray4D(const Array4D<NativeT>& values) {
+  return ConstantFromArray(values);
+}
+
+// Free function template implementations.
+
+template <typename NativeT>
+XlaOp ConstantR0(XlaBuilder* builder, NativeT value) {
+  return ConstantLiteral(builder, *LiteralUtil::CreateR0<NativeT>(value));
+}
+
+template <typename NativeT>
+XlaOp ConstantR1(XlaBuilder* builder,
+                 tensorflow::gtl::ArraySlice<NativeT> values) {
+  return ConstantLiteral(builder, *LiteralUtil::CreateR1<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp ConstantR1(XlaBuilder* builder, int64 length, NativeT value) {
+  Literal literal(ShapeUtil::MakeShape(
+      primitive_util::NativeToPrimitiveType<NativeT>(), {length}));
+  literal.PopulateWithValue(value);
+  return ConstantLiteral(builder, literal);
+}
+
+inline XlaOp ConstantR1(XlaBuilder* builder,
+                        const tensorflow::core::Bitmap& values) {
+  return ConstantLiteral(builder, *LiteralUtil::CreateR1(values));
+}
+
+template <typename NativeT>
+XlaOp ConstantR2(XlaBuilder* builder,
+                 std::initializer_list<std::initializer_list<NativeT>> values) {
+  return ConstantLiteral(builder, *LiteralUtil::CreateR2<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp ConstantFromArrayWithLayout(XlaBuilder* builder,
+                                  const Array<NativeT>& values,
+                                  const Layout& layout) {
+  return ConstantLiteral(
+      builder,
+      *LiteralUtil::CreateFromArrayWithLayout<NativeT>(values, layout));
+}
+
+template <typename NativeT>
+XlaOp ConstantFromArray(XlaBuilder* builder, const Array<NativeT>& values) {
+  return ConstantLiteral(builder,
+                         *LiteralUtil::CreateFromArray<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp ConstantR2FromArray2DWithLayout(XlaBuilder* builder,
+                                      const Array2D<NativeT>& values,
+                                      const Layout& layout) {
+  return ConstantLiteral(
+      builder,
+      *LiteralUtil::CreateFromArrayWithLayout<NativeT>(values, layout));
+}
+
+template <typename NativeT>
+XlaOp ConstantR2FromArray2D(XlaBuilder* builder,
+                            const Array2D<NativeT>& values) {
+  return ConstantLiteral(builder,
+                         *LiteralUtil::CreateR2FromArray2D<NativeT>(values));
+}
+
+template <typename NativeT>
+XlaOp ConstantR3FromArray3DWithLayout(XlaBuilder* builder,
+                                      const Array3D<NativeT>& values,
+                                      const Layout& layout) {
+  return ConstantLiteral(
+      builder,
+      *LiteralUtil::CreateR3FromArray3DWithLayout<NativeT>(values, layout));
+}
+
+template <typename NativeT>
+XlaOp ConstantR3FromArray3D(XlaBuilder* builder,
+                            const Array3D<NativeT>& values) {
+  return ConstantFromArray(builder, values);
+}
+
+template <typename NativeT>
+XlaOp ConstantR4FromArray4DWithLayout(XlaBuilder* builder,
+                                      const Array4D<NativeT>& values,
+                                      const Layout& layout) {
+  return ConstantFromArrayWithLayout(builder, values, layout);
+}
+
+template <typename NativeT>
+XlaOp ConstantR4FromArray4D(XlaBuilder* builder,
+                            const Array4D<NativeT>& values) {
+  return ConstantFromArray(builder, values);
+}
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_CLIENT_XLA_BUILDER_H_
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder_test.cc b/tensorflow/compiler/xla/client/xla_builder_test.cc
index 3b8beb2c78..28a207b137 100644
--- a/tensorflow/compiler/xla/client/xla_client/xla_builder_test.cc
+++ b/tensorflow/compiler/xla/client/xla_builder_test.cc
@@ -13,10 +13,11 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 #include <string>
 
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
diff --git a/tensorflow/compiler/xla/client/xla_client/BUILD b/tensorflow/compiler/xla/client/xla_client/BUILD
index ee00a9eada..2e131dbad2 100644
--- a/tensorflow/compiler/xla/client/xla_client/BUILD
+++ b/tensorflow/compiler/xla/client/xla_client/BUILD
@@ -24,55 +24,10 @@ filegroup(
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 
 cc_library(
-    name = "xla_computation",
-    srcs = ["xla_computation.cc"],
-    hdrs = ["xla_computation.h"],
-    deps = [
-        "//tensorflow/compiler/xla:status_macros",
-        "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/service:hlo_proto",
-    ],
-)
-
-cc_library(
     name = "xla_builder",
-    srcs = ["xla_builder.cc"],
     hdrs = ["xla_builder.h"],
     visibility = ["//visibility:public"],
     deps = [
-        ":xla_computation",
-        "//tensorflow/compiler/xla:execution_options_util",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla:status_macros",
-        "//tensorflow/compiler/xla:statusor",
-        "//tensorflow/compiler/xla:types",
-        "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client:padding",
-        "//tensorflow/compiler/xla/client:sharding_builder",
-        "//tensorflow/compiler/xla/service:hlo",
-        "//tensorflow/compiler/xla/service:hlo_proto",
-        "//tensorflow/compiler/xla/service:shape_inference",
-        "//tensorflow/core:lib",
-    ],
-)
-
-tf_cc_test(
-    name = "xla_builder_test",
-    srcs = ["xla_builder_test.cc"],
-    deps = [
-        ":xla_builder",
-        "//tensorflow/compiler/xla:literal_util",
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla:status_macros",
-        "//tensorflow/compiler/xla:test",
-        "//tensorflow/compiler/xla:test_helpers",
-        "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
-        "//tensorflow/compiler/xla/service:hlo",
-        "//tensorflow/compiler/xla/service:hlo_matchers",
-        "//tensorflow/core:test",
+        "//tensorflow/compiler/xla/client:xla_builder",
     ],
 )
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_builder.h b/tensorflow/compiler/xla/client/xla_client/xla_builder.h
index 274aba8a31..ce2a8afd4c 100644
--- a/tensorflow/compiler/xla/client/xla_client/xla_builder.h
+++ b/tensorflow/compiler/xla/client/xla_client/xla_builder.h
@@ -16,2094 +16,6 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_BUILDER_H_
 #define TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_BUILDER_H_
 
-#include <map>
-#include <string>
-#include <type_traits>
-#include <utility>
-
-#include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
-#include "tensorflow/compiler/xla/service/hlo.pb.h"
-#include "tensorflow/compiler/xla/service/hlo_opcode.h"
-#include "tensorflow/compiler/xla/shape_util.h"
-#include "tensorflow/compiler/xla/status_macros.h"
-#include "tensorflow/compiler/xla/statusor.h"
-#include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
-#include "tensorflow/core/lib/core/stringpiece.h"
-#include "tensorflow/core/lib/gtl/array_slice.h"
-#include "tensorflow/core/lib/gtl/flatset.h"
-#include "tensorflow/core/platform/macros.h"
-#include "tensorflow/core/platform/stacktrace.h"
-#include "tensorflow/core/platform/types.h"
-
-namespace xla {
-
-class XlaBuilder;
-
-// This represents an instruction that has been enqueued using the XlaBuilder.
-// This is used to pass to subsequent computations that depends upon the
-// instruction as an operand.
-class XlaOp {
- public:
-  XlaOp() : handle_(-1), builder_(nullptr) {
-    static_assert(std::is_trivially_destructible<XlaOp>::value,
-                  "XlaOp should be trivially destructible");
-  }
-  ~XlaOp() = default;
-
-  XlaBuilder* builder() const { return builder_; }
-
-  // Returns true if the XlaOp represents valid, non-erroneous value.
-  bool valid() const { return handle_ >= 0; }
-
-  // Returns true if the XlaOp was created by the XlaOp() constructor and
-  // not returned by a builder.
-  bool IsUninitialized() const { return builder_ == nullptr; }
-
-  bool IsIdenticalTo(const XlaOp& rhs) const {
-    return handle_ == rhs.handle_ && builder_ == rhs.builder_;
-  }
-
-  friend std::ostream& operator<<(std::ostream& out, const XlaOp& op) {
-    out << op.handle();
-    return out;
-  }
-
- private:
-  explicit XlaOp(XlaBuilder* builder) : handle_(-1), builder_(builder) {}
-  XlaOp(int64 handle, XlaBuilder* builder)
-      : handle_(handle), builder_(builder) {}
-
-  int64 handle() const { return handle_; }
-
-  friend class XlaBuilder;
-
-  // < 0 means "invalid handle".
-  int64 handle_;
-
-  // Not owned. Non-null for any handle returned by XlaBuilder, even if the
-  // handle is invalid.
-  XlaBuilder* builder_;
-};
-
-// Arithmetic operator overloads for the XlaOp type.
-XlaOp operator-(const XlaOp& x);
-XlaOp operator+(const XlaOp& x, const XlaOp& y);
-XlaOp operator-(const XlaOp& x, const XlaOp& y);
-XlaOp operator*(const XlaOp& x, const XlaOp& y);
-XlaOp operator/(const XlaOp& x, const XlaOp& y);
-XlaOp operator%(const XlaOp& x, const XlaOp& y);
-
-// Bitwise operator overloads for the XlaOp type.
-XlaOp operator~(const XlaOp& x);
-XlaOp operator&(const XlaOp& x, const XlaOp& y);
-XlaOp operator|(const XlaOp& x, const XlaOp& y);
-XlaOp operator^(const XlaOp& x, const XlaOp& y);
-XlaOp operator<<(const XlaOp& x, const XlaOp& y);
-// Performs a right arithmetic shift if 'x' is a signed type, otherwise performs
-// a right logical shift.
-XlaOp operator>>(const XlaOp& x, const XlaOp& y);
-
-// We don't overload the relational operators (==, !=, <, <=, >, >=) because the
-// semantics might be surprising since their result types are usually 'bool'.
-// Further programmers may expect == to be a structural equality.
-// We also choose not to overload any of the mutating operators (e.g., +=, -=)
-// because the semantics might be misleading — XLA computations are immutable.
-
-// A convenient interface for building up computations.
-//
-// Thread-compatible.
-class XlaBuilder {
- public:
-  // computation_name: name to use for the built computation.
-  XlaBuilder(const string& computation_name);
-
-  XlaBuilder(const XlaBuilder&) = delete;
-  XlaBuilder& operator=(const XlaBuilder&) = delete;
-
-  ~XlaBuilder();
-
-  // Returns the computation name.
-  const string& name() const { return name_; }
-
-  // Sets OpMetadata that will be added to all instructions until cleared.
-  //
-  // OpMetadata is often applied to a series of XLA HLO instructions. As a
-  // result, OpMetadata is set on the Computation Builder. All subsequent
-  // instructions generated via this Computation Builder will have the same
-  // OpMetadata attached until a call to ClearOpMetadata.
-  void SetOpMetadata(const OpMetadata& metadata) { metadata_ = metadata; }
-
-  // Clears the HloMetadata state.
-  void ClearOpMetadata() { metadata_.Clear(); }
-
-  // Sets an OpSharding that will be attached to all instructions until cleared.
-  void SetSharding(const OpSharding& sharding) { sharding_ = sharding; }
-
-  // Clears the sharding. Ops will be sharded according to the default placement
-  // policy.
-  void ClearSharding() { sharding_ = tensorflow::gtl::nullopt; }
-
-  // Returns the OpSharding that will be attached to all instructions.
-  const tensorflow::gtl::optional<OpSharding>& sharding() const {
-    return sharding_;
-  }
-
-  // Sets the builder to a mode where it will die immediately when an error is
-  // encountered, rather than producing it in a deferred fashion when Build() is
-  // called (which is the default).
-  void set_die_immediately_on_error(bool enabled) {
-    die_immediately_on_error_ = enabled;
-  }
-
-  // Default dimension numbers used for a 2D convolution.
-  static constexpr int64 kConvBatchDimension = 0;
-  static constexpr int64 kConvFeatureDimension = 1;
-  static constexpr int64 kConvFirstSpatialDimension = 2;
-  static constexpr int64 kConvSecondSpatialDimension = 3;
-  static constexpr int64 kConvKernelOutputDimension = 0;
-  static constexpr int64 kConvKernelInputDimension = 1;
-  static constexpr int64 kConvKernelFirstSpatialDimension = 2;
-  static constexpr int64 kConvKernelSecondSpatialDimension = 3;
-
-  // Creates a default ConvolutionDimensionNumbers. For a 2D convolution, for
-  // the input operand {batch, feature, height, width} = {0, 1, 2, 3} and for
-  // the kernel operand
-  // {output_feature, input_feature, height, width} = {0, 1, 2, 3}.
-  static ConvolutionDimensionNumbers CreateDefaultConvDimensionNumbers(
-      int num_spatial_dims = 2);
-
-  // Returns an error if the convolution dimension numbers have conflicts.
-  static Status Validate(const ConvolutionDimensionNumbers& dnum);
-
-  // Returns a new XlaBuilder whose resultant Computation is used only by this
-  // XlaBuilder. The sub-XlaBuilder has the same die_immediately_on_error
-  // behavior as the parent.
-  std::unique_ptr<XlaBuilder> CreateSubBuilder(const string& computation_name);
-
-  // Builds the computation with the requested operations, or returns a non-ok
-  // status. Note that all ops that have been enqueued will be moved to the
-  // computation being returned.
-  StatusOr<XlaComputation> Build();
-
-  // Builds the computation with the requested operations, or notes an error in
-  // the parent XlaBuilder and returns an empty computation if building failed.
-  // This function is intended to be used where the returned XlaComputation is
-  // only used by the parent XlaBuilder and hence further operation on the
-  // returned XlaComputation will simply be error'ed out if an error occurred
-  // while building this computation. If the built computation is to be used by
-  // a XlaBuilder other than the parent XlaBuilder then Build() should be used
-  // instead.
-  XlaComputation BuildAndNoteError();
-
-  // Returns a subgraph that roots on the given root. If the root is not a
-  // compile-time constant (see `IsConstant`), returns an error.
-  //
-  // This will copy the needed ops/computations to the subgraph.
-  StatusOr<XlaComputation> BuildConstantSubGraph(const XlaOp& root_op) const;
-
-  // Returns the first error that was encountered while building the
-  // computation. When an error is encountered, by default we return a vacuous
-  // XlaOp and inform the user of the error that occurred while
-  // building the computation when they make a final call to Build().
-  //
-  // See also set_die_immediately_on_error().
-  Status first_error() const { return first_error_; }
-
-  // Returns the shape of the given op.
-  StatusOr<Shape> GetShape(const XlaOp& op) const;
-
-  // Returns the (inferred) result for the current computation's shape.
-  StatusOr<ProgramShape> GetProgramShape() const;
-
-  // Reports an error to the builder, by
-  // * storing it internally and capturing a backtrace if it's the first error
-  //   (this deferred value will be produced on the call to
-  //    Build()/GetShape()/...)
-  // * dying if die_immediately_on_error_ is true.
-  // Returns an XlaOp with an invalid handle but a valid builder. This value can
-  // be returned in place of a value in APIs that return an XlaOp.
-  XlaOp ReportError(const Status& error);
-
-  // A helper function that converts a StatusOr<XlaOp> into an XlaOp.
-  // If the Status was an error, reports the error to builder and returns an
-  // invalid XlaOp handle.
-  XlaOp ReportErrorOrReturn(const StatusOr<XlaOp>& op);
-
-  // A helper function that runs a function that returns a StatusOr<XlaOp> and
-  // returns an XlaOp.
-  XlaOp ReportErrorOrReturn(const std::function<StatusOr<XlaOp>()>& op_creator);
-
-  // Returns true if 'operand' is a compile-time constant. A compile-time
-  // constant does not depend on any parameters, or on stateful operators such
-  // as `RngNormal` or `Infeed`.
-  //
-  // This tests whether a computation is a compile-time constant without
-  // evaluating the computation.
-  StatusOr<bool> IsConstant(const XlaOp& operand) const;
-
- private:
-  // Enqueues a "retrieve parameter value" instruction for a parameter that was
-  // passed to the computation.
-  XlaOp Parameter(int64 parameter_number, const Shape& shape,
-                  const string& name);
-
-  // Enqueues a constant with the value of the given literal onto the
-  // computation.
-  XlaOp ConstantLiteral(const LiteralSlice& literal);
-
-  // Enqueues a constant onto the computation. Methods are templated on the
-  // native host type (NativeT) which corresponds to a specific XLA
-  // PrimitiveType as given in the following table:
-  //
-  //  Native Type   PrimitiveType
-  // -----------------------------
-  //   bool           PRED
-  //   int32          S32
-  //   int64          S64
-  //   uint32         U32
-  //   uint64         U64
-  //   float          F32
-  //   double         F64
-  //
-  // Note: not all primitive types defined in xla_data.proto have a
-  // corresponding native type yet.
-  template <typename NativeT>
-  XlaOp ConstantR0(NativeT value);
-  template <typename NativeT>
-  XlaOp ConstantR1(tensorflow::gtl::ArraySlice<NativeT> values);
-  XlaOp ConstantR1(const tensorflow::core::Bitmap& values);
-  template <typename NativeT>
-  XlaOp ConstantR2(
-      std::initializer_list<std::initializer_list<NativeT>> values);
-  template <typename NativeT>
-  XlaOp ConstantFromArrayWithLayout(const Array<NativeT>& values,
-                                    const Layout& layout);
-  template <typename NativeT>
-  XlaOp ConstantFromArray(const Array<NativeT>& values);
-  template <typename NativeT>
-  XlaOp ConstantR2FromArray2DWithLayout(const Array2D<NativeT>& values,
-                                        const Layout& layout);
-  template <typename NativeT>
-  XlaOp ConstantR2FromArray2D(const Array2D<NativeT>& values);
-  template <typename NativeT>
-  XlaOp ConstantR3FromArray3DWithLayout(const Array3D<NativeT>& values,
-                                        const Layout& layout);
-  template <typename NativeT>
-  XlaOp ConstantR3FromArray3D(const Array3D<NativeT>& values);
-  template <typename NativeT>
-  XlaOp ConstantR4FromArray4DWithLayout(const Array4D<NativeT>& values,
-                                        const Layout& layout);
-  template <typename NativeT>
-  XlaOp ConstantR4FromArray4D(const Array4D<NativeT>& values);
-
-  // Enqueues a rank one constant (vector) onto the computation. The vector has
-  // size 'length' and every element has the value 'value'.
-  template <typename NativeT>
-  XlaOp ConstantR1(int64 length, NativeT value);
-
-  // Adds dimensions to an array by duplicating the data in the array.
-  //
-  // The new dimensions are inserted on the left, i.e. if
-  // broadcast_sizes has values {a0, ..., aN} and the operand shape
-  // has dimensions {b0, ..., bM} then the shape of the output has
-  // dimensions {a0, ..., aN, b0, ..., bM}.
-  //
-  // The new dimensions index into copies of the operand, i.e.
-  //
-  //   output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM]
-  XlaOp Broadcast(const XlaOp& operand,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_sizes);
-
-  // Performs in-dimension-style broadcast.
-  //
-  // Operand specifies the input to be broadcast. "shape" is expected output
-  // shape. "broadcast_dimensions" are the dimensions to be broadcasting into.
-  // Dimension numbers in broadcast_dimensions map to individual dimensions
-  // of the operand, and specify what dimension of the output shape they
-  // should be broadcast.
-  // e.g.
-  // Say operand = [1, 2], i.e., a 1D tensor with 2 elements.
-  // and dimension of shape is [2,2].
-  // Specifying {1} as brodcast_dimension will generate output
-  // [1 , 2]
-  // [1 , 2]
-  // On the other hand, specifying {0} as broadcast_dimension
-  // will generate output
-  // [1 , 1]
-  // [2 , 2]
-  XlaOp BroadcastInDim(
-      const XlaOp& operand, const Shape& shape,
-      const tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-
-  // Enqueues a pad operation onto the computation that pads the given value on
-  // the edges as well as between the elements of the input. padding_config
-  // specifies the padding amount for each dimension.
-  XlaOp Pad(const XlaOp& operand, const XlaOp& padding_value,
-            const PaddingConfig& padding_config);
-
-  // Enqueues an operation onto the computation that flattens the operand based
-  // on the dimension order (major/slowest-varying to minor/fastest-varying)
-  // given, followed by reshaping it into the shape with the given dimension
-  // sizes (also major to minor). Conceptually, this is a limited form of
-  // "shape casting".
-  XlaOp Reshape(const XlaOp& operand,
-                tensorflow::gtl::ArraySlice<int64> dimensions,
-                tensorflow::gtl::ArraySlice<int64> new_sizes);
-
-  // Enqueues an operation onto the computation that collapses the operand, from
-  // first to last dimension (C order), then reshapes it to the given dimension
-  // sizes. Conceptually, this is a limited form of "shape casting".
-  XlaOp Reshape(const XlaOp& operand,
-                tensorflow::gtl::ArraySlice<int64> new_sizes);
-
-  // Wrapper for Reshape.
-  // Enqueues an operation to collapse the provided dimensions; e.g. an
-  // operand with dimensions {x=256, y=2, z=2, p=32} can be collapsed to
-  // {x=1024, y=32} by collapsing dims {0, 1, 2}. Collapsing dimensions must
-  // be a consecutive, in-order subsequence of the operand dimensions.
-  //
-  // Note that collapsing a single dimension does nothing:
-  //
-  //    {256} collapsing {0} => {256}
-  //    {1} collapsing {0} => {1}
-  //
-  // Collapsing multiple dimensions produces a single result dimension:
-  //
-  //    {256, 2} collapsing {0,1} => {512}
-  //    {256, 2, 3} collapsing {0,1} => {512, 3}
-  //
-  // This could potentially cause data to be moved -- it provides a more
-  // structured form of reshaping than an arbitrary Reshape operation.
-  XlaOp Collapse(const XlaOp& operand,
-                 tensorflow::gtl::ArraySlice<int64> dimensions);
-
-  // Enqueues a slice operation onto the computation that slices the operand
-  // from the start indices to the limit indices; e.g.
-  //
-  //        x
-  //   [ 0 1 2 3 ]
-  // y [ 4 5 6 7 ] => slice(start={1, 1}, limit={2, 3}) => [ 5 6 ]
-  //   [ 8 9 a b ]
-  //
-  // Note that "limit" means up-to-but-not-including; i.e. [start, limit) in 1D
-  // range notation.
-  // The strides parameter determines the stride over the slice
-  XlaOp Slice(const XlaOp& operand,
-              tensorflow::gtl::ArraySlice<int64> start_indices,
-              tensorflow::gtl::ArraySlice<int64> limit_indices,
-              tensorflow::gtl::ArraySlice<int64> strides);
-
-  // Enqueues a slice operation in a given dimension, taking all other
-  // dimensions as they are; e.g. if dimno is 1 from start_index 2 to
-  // limit_index 4 by 1, and the shape is f32[7,8,9], this call is short-hand
-  // for:
-  //
-  //  array[:, 2:4:1, :]
-  XlaOp SliceInDim(const XlaOp& operand, int64 start_index, int64 limit_index,
-                   int64 stride, int64 dimno);
-
-  // Enqueues a slice operation onto the computation that slices the 'operand'
-  // from dynamic start indices which are passed in 'start_indices'.
-  // The size of the slice in each dimension is passed in 'slice_sizes',
-  // which specify the end point of exclusive slice intervals in each
-  // dimension [start, start + size).
-  // The shape of 'start_indices' must be rank == 1, with dimension size
-  // equal to the rank of the 'operand'.
-  // Slice index calculations are computed modulo input dimension sizes to
-  // prevent dynamic start indices from generating out-of-bound array accesses.
-  XlaOp DynamicSlice(const XlaOp& operand, const XlaOp& start_indices,
-                     tensorflow::gtl::ArraySlice<int64> slice_sizes);
-
-  // Enqueues a dynamic update slice operation onto the computation, which
-  // updates a slice of 'operand' with 'update' at dynamic 'start_indices'.
-  // The shape of 'update' determines the shape of the slice of 'operand'
-  // which is updated.
-  // The indices specified in 'start_indices' specify the offset of the slice
-  // of 'operand' which is updated.
-  //
-  //               update = {10, 11} // calculated at runtime.
-  //   [1 2 3]     start  = {1, 1}   // calculated at runtime.  [1 2  3 ]
-  //   [4 5 6]  => DynamicUpdateslice(data, update, start)   => [4 10 11]
-  //   [7 8 9]                                                  [7 8  9 ]
-  //
-  // The shape of 'start_indices' must be rank == 1, with dimension size
-  // equal to the rank of the 'operand'.
-  // Slice index calculations are computed modulo update dimension sizes to
-  // prevent dynamic start indices from generating out-of-bound array accesses.
-  XlaOp DynamicUpdateSlice(const XlaOp& operand, const XlaOp& update,
-                           const XlaOp& start_indices);
-
-  // Enqueues a concatenate instruction onto the computation. 'operands' must
-  // have >= 1 entry.
-  XlaOp ConcatInDim(tensorflow::gtl::ArraySlice<XlaOp> operands,
-                    int64 dimension);
-
-  // Enqueue a tracing operation onto the computation; the computation will emit
-  // a logging message with the operand.
-  void Trace(const string& tag, const XlaOp& operand);
-
-  // Enqueues a conditional-move-like select operation onto the computation;
-  // predicated on pred, selects between on_true and on_false.
-  XlaOp Select(const XlaOp& pred, const XlaOp& on_true, const XlaOp& on_false);
-
-  // Enqueues a tuple-creation instruction onto the computation.
-  XlaOp Tuple(tensorflow::gtl::ArraySlice<XlaOp> elements);
-
-  // Enqueues a tuple-element-get instruction onto the computation.
-  XlaOp GetTupleElement(const XlaOp& tuple_data, int64 index);
-
-  // Enqueues an equal-to comparison instruction onto the computation.
-  XlaOp Eq(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a not-equal comparison instruction onto the computation.
-  XlaOp Ne(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a greater-or-equal comparison instruction onto the computation.
-  XlaOp Ge(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a greater-than comparison instruction onto the computation.
-  XlaOp Gt(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a less-than comparison instruction onto the computation.
-  XlaOp Lt(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a less-or-equal comparison instruction onto the computation.
-  XlaOp Le(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a dot instruction onto the computation.
-  XlaOp Dot(const XlaOp& lhs, const XlaOp& rhs);
-
-  // Enqueues a general dot instruction onto the computation.
-  XlaOp DotGeneral(const XlaOp& lhs, const XlaOp& rhs,
-                   const DotDimensionNumbers& dimension_numbers);
-
-  // Enqueues a convolution instruction onto the computation, which uses the
-  // default convolution dimension numbers.
-  XlaOp Conv(const XlaOp& lhs, const XlaOp& rhs,
-             tensorflow::gtl::ArraySlice<int64> window_strides,
-             Padding padding);
-
-  // Enqueues a convolution instruction onto the computation, with the caller
-  // provided padding configuration in the format returned by MakePadding().
-  XlaOp ConvWithGeneralPadding(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
-
-  // Enqueues a convolution instruction onto the computation, with the caller
-  // provided dimension numbers configuration.
-  XlaOp ConvWithGeneralDimensions(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
-      const ConvolutionDimensionNumbers& dimension_numbers);
-
-  // Enqueues a convolution instruction onto the computation, with the caller
-  // provided padding configuration as well as the dimension numbers.
-  XlaOp ConvGeneral(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-      const ConvolutionDimensionNumbers& dimension_numbers);
-
-  // Enqueues a convolution instruction onto the computation, with the caller
-  // provided padding configuration, dilation factors and dimension numbers.
-  XlaOp ConvGeneralDilated(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-      tensorflow::gtl::ArraySlice<int64> lhs_dilation,
-      tensorflow::gtl::ArraySlice<int64> rhs_dilation,
-      const ConvolutionDimensionNumbers& dimension_numbers);
-
-  // Enqueues an FFT instruction onto the computation, of the given type and
-  // with the given FFT length.
-  XlaOp Fft(const XlaOp& operand, FftType fft_type,
-            tensorflow::gtl::ArraySlice<int64> fft_length);
-
-  // Enqueues an infeed instruction onto the computation, which writes data of
-  // the given shape to the infeed buffer of the device.
-  XlaOp Infeed(const Shape& shape, const string& config = "");
-
-  // Enqueues an outfeed instruction onto the computation. This instruction
-  // generates outgoing data transfers for the given data.
-  //
-  // shape_with_layout communicates the laid out shape that we want to outfeed
-  // -- if !ShapeUtil::Compatible(GetShape(operand), shape_with_layout) an error
-  // will occur.
-  void Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
-               const string& outfeed_config);
-
-  // Enqueues a call instruction onto the computation.
-  XlaOp Call(const XlaComputation& computation,
-             tensorflow::gtl::ArraySlice<XlaOp> operands);
-
-  // Enqueues a custom call instruction onto the computation.
-  // During code generation, a call instruction is emitted which targets a
-  // symbol with the name |call_target_name|.  The |operands| are passed to the
-  // call instruction.  |shape| is the resultant shape.
-  XlaOp CustomCall(const string& call_target_name,
-                   tensorflow::gtl::ArraySlice<XlaOp> operands,
-                   const Shape& shape);
-
-  // Enqueues a pseudo-op to represent host-side computation data-dependencies.
-  // During code generation, host send and receive operations will be generated
-  // to transfer |operands| to the host and a single result of |shape| back to
-  // the device.  Host send/recv operations are emitted using |channel_name|.
-  // Dataflow dependencies and the |cost_estimate_ns| field may be used in HLO
-  // instruction scheduling.
-  XlaOp HostCompute(tensorflow::gtl::ArraySlice<XlaOp> operands,
-                    const string& channel_name, int64 cost_estimate_ns,
-                    const Shape& shape);
-
-  // The following methods enqueue element-wise binary arithmetic operations
-  // onto the computation. The shapes of the operands have to match unless one
-  // of the operands is a scalar, or an explicit broadcast dimension is given
-  // (see g3doc for more details).
-
-  // Enqueues a complex compose instruction onto the computation.
-  XlaOp Complex(const XlaOp& real, const XlaOp& imag,
-                tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a complex conjugate instruction onto the computation.
-  XlaOp Conj(const XlaOp& operand);
-
-  // Enqueues an add instruction onto the computation.
-  XlaOp Add(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a subtract instruction onto the computation.
-  XlaOp Sub(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a multiply instruction onto the computation.
-  XlaOp Mul(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a divide instruction onto the computation.
-  XlaOp Div(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a remainder instruction onto the computation.
-  XlaOp Rem(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a max instruction onto the computation.
-  XlaOp Max(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues a min instruction onto the computation.
-  XlaOp Min(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Element-wise logical operators
-  XlaOp And(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  XlaOp Or(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  XlaOp Xor(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  XlaOp Not(const XlaOp& operand);
-
-  XlaOp ShiftLeft(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-  XlaOp ShiftRightArithmetic(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-  XlaOp ShiftRightLogical(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Reduces an array among the provided dimensions, given "computation" as a
-  // reduction operator.
-  XlaOp Reduce(const XlaOp& operand, const XlaOp& init_value,
-               const XlaComputation& computation,
-               tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce);
-
-  // Convenience wrapper around the above that reduces all the dimensions in the
-  // operand shape.
-  XlaOp ReduceAll(const XlaOp& operand, const XlaOp& init_value,
-                  const XlaComputation& computation);
-
-  // Enqueues a windowed reduce instruction onto the computation.
-  XlaOp ReduceWindow(const XlaOp& operand, const XlaOp& init_value,
-                     const XlaComputation& computation,
-                     tensorflow::gtl::ArraySlice<int64> window_dimensions,
-                     tensorflow::gtl::ArraySlice<int64> window_strides,
-                     Padding padding);
-
-  // As ReduceWindow(), but the padding is given in the format
-  // returned by MakePadding().
-  XlaOp ReduceWindowWithGeneralPadding(
-      const XlaOp& operand, const XlaOp& init_value,
-      const XlaComputation& computation,
-      tensorflow::gtl::ArraySlice<int64> window_dimensions,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
-
-  // Returns the sum of the operand value within each subgroup of replicas. All
-  // replicas supply one input to the sum and all replicas receive the resulting
-  // sum for each subgroup.
-  XlaOp CrossReplicaSum(
-      const XlaOp& operand,
-      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {});
-
-  // Enqueues an operation that do an AllReduce of the operand cross cores. Here
-  // AllReduce means doing a reduction on the input operand cross cores and then
-  // broadcasting the reduction result to those cores. The reduction function is
-  // defined by `computation`, which should be a commutative computation on
-  // scalars, e.g., add, min, or max. The way that AllReduce is applied is
-  // configured by:
-  //
-  // - `replica_group_ids`: maps replica ids to subgroup ids. If empty, all
-  // replicas belong to one group. Allreduce will be applied within subgroups.
-  // For example, we have 4 replicas, then replica_group_ids={0,1,0,1} means,
-  // replica 0 and 2 are in subgroup 0, replica 1 and 3 are in subgroup 1.
-  //
-  // - `channel_id`: for Allreduce nodes from different models, if they have the
-  // same channel_id, they will be 'Allreduce'd. If empty, Allreduce will not be
-  // applied cross models.
-  //
-  // TODO(b/79737069): Rename this to AllReduce when it's ready to use.
-  XlaOp CrossReplicaSum(
-      const XlaOp& operand, const XlaComputation& computation,
-      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {},
-      const tensorflow::gtl::optional<ChannelHandle>& channel_id =
-          tensorflow::gtl::nullopt);
-
-  // Enqueues an operation that scatters the `source` array to the selected
-  // indices of each window.
-  XlaOp SelectAndScatter(const XlaOp& operand, const XlaComputation& select,
-                         tensorflow::gtl::ArraySlice<int64> window_dimensions,
-                         tensorflow::gtl::ArraySlice<int64> window_strides,
-                         Padding padding, const XlaOp& source,
-                         const XlaOp& init_value,
-                         const XlaComputation& scatter);
-
-  // As SelectAndScatter(), but the padding is given in the format
-  // returned by MakePadding().
-  XlaOp SelectAndScatterWithGeneralPadding(
-      const XlaOp& operand, const XlaComputation& select,
-      tensorflow::gtl::ArraySlice<int64> window_dimensions,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-      const XlaOp& source, const XlaOp& init_value,
-      const XlaComputation& scatter);
-
-  // Enqueues an abs instruction onto the computation.
-  XlaOp Abs(const XlaOp& operand);
-
-  // Enqueues a atan2 instruction onto the computation.
-  XlaOp Atan2(const XlaOp& y, const XlaOp& x,
-              tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues an exp instruction onto the computation.
-  XlaOp Exp(const XlaOp& operand);
-
-  // Enqueues an expm1 instruction onto the computation.
-  XlaOp Expm1(const XlaOp& operand);
-
-  // Enqueues a floor instruction onto the computation.
-  XlaOp Floor(const XlaOp& operand);
-
-  // Enqueues a ceil instruction onto the computation.
-  XlaOp Ceil(const XlaOp& operand);
-
-  // Enqueues a round instruction onto the computation, rounding to nearest even
-  // with half-way cases rounding away from zero.
-  XlaOp Round(const XlaOp& operand);
-
-  // Enqueues an log instruction (natural logarithm) onto the computation.
-  XlaOp Log(const XlaOp& operand);
-
-  // Enqueues an log1p instruction (log(x+1)) onto the computation.
-  XlaOp Log1p(const XlaOp& operand);
-
-  // Enqueues a sign instruction onto the computation.
-  XlaOp Sign(const XlaOp& operand);
-
-  // Enqueues a count leading zeros instruction onto the computation.
-  XlaOp Clz(const XlaOp& operand);
-
-  // Enqueues a cosine instruction onto the computation.
-  XlaOp Cos(const XlaOp& operand);
-
-  // Enqueues a sine instruction onto the computation.
-  XlaOp Sin(const XlaOp& operand);
-
-  // Enqueues a tanh instruction onto the computation.
-  XlaOp Tanh(const XlaOp& operand);
-
-  // Enqueues a real-part instruction onto the computation.
-  XlaOp Real(const XlaOp& operand);
-
-  // Enqueues an imaginary-part instruction onto the computation.
-  XlaOp Imag(const XlaOp& operand);
-
-  // Enqueues a lhs^rhs computation onto the computation.
-  XlaOp Pow(const XlaOp& lhs, const XlaOp& rhs,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-  // Enqueues an operator that tests if the operand's values are finite, i.e.,
-  // not Inf or NaN. Defined only for floating-point types. Returns an array of
-  // booleans with the same shape where entries are true iff the corresponding
-  // entry was NaN.
-  XlaOp IsFinite(const XlaOp& operand);
-
-  // Enqueues a convert instruction onto the computation that changes the
-  // element type of the operand array to primitive_type.
-  XlaOp ConvertElementType(const XlaOp& operand,
-                           PrimitiveType new_element_type);
-
-  // Enqueues a no-op instruction onto the computation that changes
-  // the element type of the operand array to primitive_type. The
-  // bit-widths of the source and destination element types must be
-  // identical.
-  XlaOp BitcastConvertType(const XlaOp& operand,
-                           PrimitiveType new_element_type);
-
-  // Enqueues a negate instruction onto the computation.
-  XlaOp Neg(const XlaOp& operand);
-
-  // Enqueues a transpose instruction onto the computation.
-  XlaOp Transpose(const XlaOp& operand,
-                  tensorflow::gtl::ArraySlice<int64> permutation);
-
-  // Enqueues a reverse instruction onto the computation. The order of the
-  // elements in the given dimensions is reversed (i.e., the element at index i
-  // is moved to index dimension_size - 1 - i).
-  XlaOp Rev(const XlaOp& operand,
-            tensorflow::gtl::ArraySlice<int64> dimensions);
-
-  // Enqueues a sort (as increasing order) instruction onto the computation.
-  // If only keys are provided:
-  // * The keys must be a rank-1 tensor (i.e. an array).
-  // * The result is a sorted array of keys.
-  //
-  // If both keys and values are provided:
-  // * The keys and the values must be rank-1 tensors with the same dimensions.
-  // The element types of the tensors may be different.
-  // * The result is a tuple that consists of a sorted array of keys as the
-  // first element, and an array with their corresponding values as the second
-  // element.
-  XlaOp Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values =
-                             tensorflow::gtl::nullopt);
-
-  // Enqueues a clamp instruction onto the computation.
-  XlaOp Clamp(const XlaOp& min, const XlaOp& operand, const XlaOp& max);
-
-  // Enqueues a map instruction onto the computation.
-  XlaOp Map(tensorflow::gtl::ArraySlice<XlaOp> operands,
-            const XlaComputation& computation,
-            tensorflow::gtl::ArraySlice<int64> dimensions,
-            tensorflow::gtl::ArraySlice<XlaOp> static_operands = {});
-
-  // Enqueues a N(mu, sigma) random number generation instruction onto the
-  // computation.
-  XlaOp RngNormal(const XlaOp& mu, const XlaOp& sigma, const Shape& shape);
-
-  // Enqueues a U(a, b) random number generation instruction onto the
-  // computation. Returns values in the semi-open interval [a, b).
-  XlaOp RngUniform(const XlaOp& a, const XlaOp& b, const Shape& shape);
-
-  // Enqueues a while node onto the computation.
-  XlaOp While(const XlaComputation& condition, const XlaComputation& body,
-              const XlaOp& init);
-
-  // Enqueues a conditional node onto the computation.
-  XlaOp Conditional(const XlaOp& predicate, const XlaOp& true_operand,
-                    const XlaComputation& true_computation,
-                    const XlaOp& false_operand,
-                    const XlaComputation& false_computation);
-
-  // Enqueues a ReducePrecision node onto the computation.
-  XlaOp ReducePrecision(const XlaOp& operand, const int exponent_bits,
-                        const int mantissa_bits);
-
-  // Enqueues a Gather node onto the computation.
-  XlaOp Gather(const XlaOp& input, const XlaOp& gather_indices,
-               const GatherDimensionNumbers& dimension_numbers,
-               tensorflow::gtl::ArraySlice<int64> window_bounds);
-
-  // Enqueues a Send node onto the computation, to send the given operand to
-  // a Recv instruction that shares the same channel handle.
-  void Send(const XlaOp& operand, const ChannelHandle& handle);
-
-  // Enqueues a Recv node onto the computation. The data comes from a Send
-  // instruction that shares the same channel handle and its shape must
-  // be the same as the given shape.
-  XlaOp Recv(const Shape& shape, const ChannelHandle& handle);
-
-  // Normalizes operand across spatial and batch dimensions for each feature.
-  //
-  // Returns a tuple (normalized, batch_mean, batch_var) where `normalized`
-  // is the normalized result and batch_mean and batch_var are the mean and
-  // variance, respectively, across batch for the operand.
-  XlaOp BatchNormTraining(const XlaOp& operand, const XlaOp& scale,
-                          const XlaOp& offset, float epsilon,
-                          int64 feature_index);
-
-  // Normalizes operand across spatial and batch dimensions for each feature.
-  //
-  // `BatchNormInference` is equivalent to calling `BatchNormTraining` without
-  // computing `mean` and `variance` for each batch inside the operation. It
-  // uses the input `mean` and `variance` instead as estimated values. The
-  // purpose of this op is to reduce latency in inference, hence the name
-  // `BatchNormInference`.
-  //
-  // The output has the same shape as `operand`, and contains the normalized
-  // values for each batch.
-  XlaOp BatchNormInference(const XlaOp& operand, const XlaOp& scale,
-                           const XlaOp& offset, const XlaOp& mean,
-                           const XlaOp& variance, float epsilon,
-                           int64 feature_index);
-
-  // Calculates the gradients of a batch norm op.
-  //
-  // The inputs `batch_mean` and `batch_var` represent the mean and variance
-  // across the batch.
-  //
-  // Returns a tuple of three elements:
-  //   - grad_operand: Gradient with respect to input `operand`
-  //   - grad_offset: Gradient with respect to input `offset`
-  //   - grad_scale: Gradient with respect to input `scale`
-  XlaOp BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
-                      const XlaOp& batch_mean, const XlaOp& batch_var,
-                      const XlaOp& grad_output, float epsilon,
-                      int64 feature_index);
-
-  StatusOr<XlaOp> AddInstruction(
-      HloInstructionProto&& instr, HloOpcode opcode,
-      tensorflow::gtl::ArraySlice<XlaOp> operands = {});
-
-  void AddCalledComputation(const XlaComputation& computation,
-                            HloInstructionProto* instr);
-
-  StatusOr<const HloInstructionProto*> LookUpInstruction(const XlaOp& op) const;
-
-  // Internal helper method that does the building for an arbitrary unary op.
-  XlaOp UnaryOp(HloOpcode unop, const XlaOp& operand);
-
-  // Internal helper method that does the building for an arbitrary binary op.
-  // broadcast_dimensions specifies which dimensions to use for broadcasting
-  // when the operation is between tensors of different ranks.
-  XlaOp BinaryOp(HloOpcode binop, const XlaOp& lhs, const XlaOp& rhs,
-                 tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-
-  // Internal helper method that does the building for an arbitrary ternary op.
-  XlaOp TernaryOp(HloOpcode triop, const XlaOp& lhs, const XlaOp& rhs,
-                  const XlaOp& ehs);
-
-  XlaOp RngOp(RandomDistribution distribution,
-              tensorflow::gtl::ArraySlice<XlaOp> parameters,
-              const Shape& shape);
-
-  StatusOr<XlaOp> InDimBroadcast(
-      const Shape& shape, const XlaOp& operand,
-      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-
-  // Internal helper method that creates a sequence of instructions that
-  // performs an explicit broadcast of the operand to the target shape.
-  StatusOr<XlaOp> AddBroadcastSequence(const Shape& output_shape,
-                                       const XlaOp& operand);
-
-  // Internal helper method for creating a Reshape op with the already inferred
-  // shape.
-  StatusOr<XlaOp> Reshape(const Shape& shape, const XlaOp& operand);
-
-  // Returns the (inferred) result for the program shape for the current
-  // computation and fills the root_id in the pointer.
-  StatusOr<ProgramShape> GetProgramShape(int64* root_id) const;
-
-  // Returns shapes for the operands.
-  StatusOr<std::vector<Shape>> GetOperandShapes(
-      tensorflow::gtl::ArraySlice<XlaOp> operands) const;
-
-  // A visitor which checks whether an operation is a compile-time constant,
-  // meaning that it doesn't depend on any parameters, or on any stateful
-  // operation such as `RngNormal` or `Infeed`. The visitor walks the
-  // computation starting at a given operation and sets is_constant to false iff
-  // a parameter or stateful operation is encountered.
-  void IsConstantVisitor(const int64 op_handle, std::set<int64>* visited,
-                         bool* is_constant) const;
-
-  // Checks bounds for convolution parameters.
-  Status VerifyConvolution(
-      const Shape& lhs_shape, const Shape& rhs_shape,
-      const ConvolutionDimensionNumbers& dimension_numbers) const;
-
-  // Helper function for creating a Window proto from user-supplied data.
-  // Returns error if the user-supplied data was invalid.
-  StatusOr<Window> MakeWindow(
-      tensorflow::gtl::ArraySlice<int64> window_dimensions,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-      tensorflow::gtl::ArraySlice<int64> lhs_dilation,
-      tensorflow::gtl::ArraySlice<int64> rhs_dilation) const;
-
-  string name_;  // Name to use for the built computation.
-
-  // The first error encountered while building the computation.
-  // This is OK until the first error is encountered.
-  Status first_error_;
-
-  // The saved stack trace from the point at which the first error occurred.
-  tensorflow::SavedStackTrace first_error_backtrace_;
-
-  // The instructions of this computation.
-  std::vector<HloInstructionProto> instructions_;
-
-  // The embedded computations used by this computation. Each computation was
-  // the entry computation of some XlaComputation, the key is the unique id of
-  // that XlaComputation.
-  std::map<int64, HloComputationProto> embedded_;
-
-  // The unique parameter numbers.
-  tensorflow::gtl::FlatSet<int64> parameter_numbers_;
-
-  // The metadata to attach to each op. This is structured as a "modal"-like
-  // operation, in order to simplify client code (and not sprinkle this metadata
-  // throughout the TensorFlow op kernel implementations).
-  OpMetadata metadata_;
-
-  // Sharding for this operator. This is structured as a "model"-like operation,
-  // in order to simplify client code, similar to metadata_.
-  tensorflow::gtl::optional<OpSharding> sharding_;
-
-  // Mode bit that indicates whether to die when a first error is encountered.
-  bool die_immediately_on_error_ = false;
-
-  XlaBuilder* parent_builder_{nullptr};
-
-  friend XlaOp Parameter(XlaBuilder* builder, int64 parameter_number,
-                         const Shape& shape, const string& name);
-  friend XlaOp ConstantLiteral(XlaBuilder* builder,
-                               const LiteralSlice& literal);
-  template <typename NativeT>
-  friend XlaOp ConstantR0(XlaBuilder* builder, NativeT value);
-  template <typename NativeT>
-  friend XlaOp ConstantR1(XlaBuilder* builder,
-                          tensorflow::gtl::ArraySlice<NativeT> values);
-  friend XlaOp ConstantR1(XlaBuilder* builder,
-                          const tensorflow::core::Bitmap& values);
-  template <typename NativeT>
-  friend XlaOp ConstantR2(
-      XlaBuilder* builder,
-      std::initializer_list<std::initializer_list<NativeT>> values);
-  template <typename NativeT>
-  friend XlaOp ConstantFromArrayWithLayout(XlaBuilder* builder,
-                                           const Array<NativeT>& values,
-                                           const Layout& layout);
-  template <typename NativeT>
-  friend XlaOp ConstantFromArray(XlaBuilder* builder,
-                                 const Array<NativeT>& values);
-  template <typename NativeT>
-  friend XlaOp ConstantR2FromArray2DWithLayout(XlaBuilder* builder,
-                                               const Array2D<NativeT>& values,
-                                               const Layout& layout);
-  template <typename NativeT>
-  friend XlaOp ConstantR2FromArray2D(XlaBuilder* builder,
-                                     const Array2D<NativeT>& values);
-  template <typename NativeT>
-  friend XlaOp ConstantR3FromArray3DWithLayout(XlaBuilder* builder,
-                                               const Array3D<NativeT>& values,
-                                               const Layout& layout);
-  template <typename NativeT>
-  friend XlaOp ConstantR3FromArray3D(XlaBuilder* builder,
-                                     const Array3D<NativeT>& values);
-  template <typename NativeT>
-  friend XlaOp ConstantR4FromArray4DWithLayout(XlaBuilder* builder,
-                                               const Array4D<NativeT>& values,
-                                               const Layout& layout);
-  template <typename NativeT>
-  friend XlaOp ConstantR4FromArray4D(XlaBuilder* builder,
-                                     const Array4D<NativeT>& values);
-
-  template <typename NativeT>
-  friend XlaOp ConstantR1(XlaBuilder* builder, int64 length, NativeT value);
-
-  friend XlaOp Broadcast(const XlaOp& operand,
-                         tensorflow::gtl::ArraySlice<int64> broadcast_sizes);
-
-  friend XlaOp BroadcastInDim(
-      const XlaOp& operand, const Shape& shape,
-      const tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-
-  friend XlaOp Pad(const XlaOp& operand, const XlaOp& padding_value,
-                   const PaddingConfig& padding_config);
-
-  friend XlaOp Reshape(const XlaOp& operand,
-                       tensorflow::gtl::ArraySlice<int64> dimensions,
-                       tensorflow::gtl::ArraySlice<int64> new_sizes);
-
-  friend XlaOp Reshape(const XlaOp& operand,
-                       tensorflow::gtl::ArraySlice<int64> new_sizes);
-
-  friend XlaOp Collapse(const XlaOp& operand,
-                        tensorflow::gtl::ArraySlice<int64> dimensions);
-
-  friend XlaOp Slice(const XlaOp& operand,
-                     tensorflow::gtl::ArraySlice<int64> start_indices,
-                     tensorflow::gtl::ArraySlice<int64> limit_indices,
-                     tensorflow::gtl::ArraySlice<int64> strides);
-
-  friend XlaOp SliceInDim(const XlaOp& operand, int64 start_index,
-                          int64 limit_index, int64 stride, int64 dimno);
-
-  friend XlaOp DynamicSlice(const XlaOp& operand, const XlaOp& start_indices,
-                            tensorflow::gtl::ArraySlice<int64> slice_sizes);
-
-  friend XlaOp DynamicUpdateSlice(const XlaOp& operand, const XlaOp& update,
-                                  const XlaOp& start_indices);
-
-  friend XlaOp ConcatInDim(XlaBuilder* builder,
-                           tensorflow::gtl::ArraySlice<XlaOp> operands,
-                           int64 dimension);
-
-  friend void Trace(const string& tag, const XlaOp& operand);
-
-  friend XlaOp Select(const XlaOp& pred, const XlaOp& on_true,
-                      const XlaOp& on_false);
-  friend XlaOp Tuple(XlaBuilder* builder,
-                     tensorflow::gtl::ArraySlice<XlaOp> elements);
-  friend XlaOp GetTupleElement(const XlaOp& tuple_data, int64 index);
-  friend XlaOp Eq(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Ne(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Ge(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Gt(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Lt(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Le(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Dot(const XlaOp& lhs, const XlaOp& rhs);
-  friend XlaOp DotGeneral(const XlaOp& lhs, const XlaOp& rhs,
-                          const DotDimensionNumbers& dimension_numbers);
-  friend XlaOp Conv(const XlaOp& lhs, const XlaOp& rhs,
-                    tensorflow::gtl::ArraySlice<int64> window_strides,
-                    Padding padding);
-  friend XlaOp ConvWithGeneralPadding(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
-  friend XlaOp ConvWithGeneralDimensions(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
-      const ConvolutionDimensionNumbers& dimension_numbers);
-  friend XlaOp ConvGeneral(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-      const ConvolutionDimensionNumbers& dimension_numbers);
-  friend XlaOp ConvGeneralDilated(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-      tensorflow::gtl::ArraySlice<int64> lhs_dilation,
-      tensorflow::gtl::ArraySlice<int64> rhs_dilation,
-      const ConvolutionDimensionNumbers& dimension_numbers);
-  friend XlaOp Fft(const XlaOp& operand, FftType fft_type,
-                   tensorflow::gtl::ArraySlice<int64> fft_length);
-  friend XlaOp Infeed(XlaBuilder* builder, const Shape& shape,
-                      const string& config);
-  friend void Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
-                      const string& outfeed_config);
-  friend XlaOp Call(XlaBuilder* builder, const XlaComputation& computation,
-                    tensorflow::gtl::ArraySlice<XlaOp> operands);
-  friend XlaOp CustomCall(XlaBuilder* builder, const string& call_target_name,
-                          tensorflow::gtl::ArraySlice<XlaOp> operands,
-                          const Shape& shape);
-  friend XlaOp HostCompute(XlaBuilder* builder,
-                           tensorflow::gtl::ArraySlice<XlaOp> operands,
-                           const string& channel_name, int64 cost_estimate_ns,
-                           const Shape& shape);
-  friend XlaOp Complex(const XlaOp& real, const XlaOp& imag,
-                       tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Conj(const XlaOp& operand);
-  friend XlaOp Add(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Sub(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Mul(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Div(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Rem(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Max(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Min(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp And(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Or(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Xor(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Not(const XlaOp& operand);
-  friend XlaOp ShiftLeft(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp ShiftRightArithmetic(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp ShiftRightLogical(
-      const XlaOp& lhs, const XlaOp& rhs,
-      tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Reduce(const XlaOp& operand, const XlaOp& init_value,
-                      const XlaComputation& computation,
-                      tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce);
-  friend XlaOp ReduceAll(const XlaOp& operand, const XlaOp& init_value,
-                         const XlaComputation& computation);
-  friend XlaOp ReduceWindow(
-      const XlaOp& operand, const XlaOp& init_value,
-      const XlaComputation& computation,
-      tensorflow::gtl::ArraySlice<int64> window_dimensions,
-      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding);
-  friend XlaOp ReduceWindowWithGeneralPadding(
-      const XlaOp& operand, const XlaOp& init_value,
-      const XlaComputation& computation,
-      tensorflow::gtl::ArraySlice<int64> window_dimensions,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
-  friend XlaOp CrossReplicaSum(
-      const XlaOp& operand,
-      tensorflow::gtl::ArraySlice<int64> replica_group_ids);
-  friend XlaOp CrossReplicaSum(
-      const XlaOp& operand, const XlaComputation& computation,
-      tensorflow::gtl::ArraySlice<int64> replica_group_ids,
-      const tensorflow::gtl::optional<ChannelHandle>& channel_id);
-  friend XlaOp SelectAndScatter(
-      const XlaOp& operand, const XlaComputation& select,
-      tensorflow::gtl::ArraySlice<int64> window_dimensions,
-      tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
-      const XlaOp& source, const XlaOp& init_value,
-      const XlaComputation& scatter);
-  friend XlaOp SelectAndScatterWithGeneralPadding(
-      const XlaOp& operand, const XlaComputation& select,
-      tensorflow::gtl::ArraySlice<int64> window_dimensions,
-      tensorflow::gtl::ArraySlice<int64> window_strides,
-      tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-      const XlaOp& source, const XlaOp& init_value,
-      const XlaComputation& scatter);
-  friend XlaOp Abs(const XlaOp& operand);
-  friend XlaOp Atan2(const XlaOp& y, const XlaOp& x,
-                     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp Exp(const XlaOp& operand);
-  friend XlaOp Expm1(const XlaOp& operand);
-  friend XlaOp Floor(const XlaOp& operand);
-  friend XlaOp Ceil(const XlaOp& operand);
-  friend XlaOp Round(const XlaOp& operand);
-  friend XlaOp Log(const XlaOp& operand);
-  friend XlaOp Log1p(const XlaOp& operand);
-  friend XlaOp Sign(const XlaOp& operand);
-  friend XlaOp Clz(const XlaOp& operand);
-  friend XlaOp Cos(const XlaOp& operand);
-  friend XlaOp Sin(const XlaOp& operand);
-  friend XlaOp Tanh(const XlaOp& operand);
-  friend XlaOp Real(const XlaOp& operand);
-  friend XlaOp Imag(const XlaOp& operand);
-  friend XlaOp Pow(const XlaOp& lhs, const XlaOp& rhs,
-                   tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-  friend XlaOp IsFinite(const XlaOp& operand);
-  friend XlaOp ConvertElementType(const XlaOp& operand,
-                                  PrimitiveType new_element_type);
-  friend XlaOp BitcastConvertType(const XlaOp& operand,
-                                  PrimitiveType new_element_type);
-  friend XlaOp Neg(const XlaOp& operand);
-  friend XlaOp Transpose(const XlaOp& operand,
-                         tensorflow::gtl::ArraySlice<int64> permutation);
-  friend XlaOp Rev(const XlaOp& operand,
-                   tensorflow::gtl::ArraySlice<int64> dimensions);
-  friend XlaOp Sort(XlaOp keys, tensorflow::gtl::optional<XlaOp> values);
-  friend XlaOp Clamp(const XlaOp& min, const XlaOp& operand, const XlaOp& max);
-  friend XlaOp Map(XlaBuilder* builder,
-                   tensorflow::gtl::ArraySlice<XlaOp> operands,
-                   const XlaComputation& computation,
-                   tensorflow::gtl::ArraySlice<int64> dimensions,
-                   tensorflow::gtl::ArraySlice<XlaOp> static_operands);
-  friend XlaOp RngNormal(const XlaOp& mu, const XlaOp& sigma,
-                         const Shape& shape);
-  friend XlaOp RngUniform(const XlaOp& a, const XlaOp& b, const Shape& shape);
-  friend XlaOp While(const XlaComputation& condition,
-                     const XlaComputation& body, const XlaOp& init);
-  friend XlaOp Conditional(const XlaOp& predicate, const XlaOp& true_operand,
-                           const XlaComputation& true_computation,
-                           const XlaOp& false_operand,
-                           const XlaComputation& false_computation);
-  friend XlaOp ReducePrecision(const XlaOp& operand, const int exponent_bits,
-                               const int mantissa_bits);
-  friend XlaOp Gather(const XlaOp& input, const XlaOp& gather_indices,
-                      const GatherDimensionNumbers& dimension_numbers,
-                      tensorflow::gtl::ArraySlice<int64> window_bounds);
-  friend void Send(const XlaOp& operand, const ChannelHandle& handle);
-  friend XlaOp Recv(XlaBuilder* builder, const Shape& shape,
-                    const ChannelHandle& handle);
-  friend XlaOp BatchNormTraining(const XlaOp& operand, const XlaOp& scale,
-                                 const XlaOp& offset, float epsilon,
-                                 int64 feature_index);
-  friend XlaOp BatchNormInference(const XlaOp& operand, const XlaOp& scale,
-                                  const XlaOp& offset, const XlaOp& mean,
-                                  const XlaOp& variance, float epsilon,
-                                  int64 feature_index);
-  friend XlaOp BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
-                             const XlaOp& batch_mean, const XlaOp& batch_var,
-                             const XlaOp& grad_output, float epsilon,
-                             int64 feature_index);
-};
-
-// RAII-style object: sets the current sharding assignment in builder on
-// construction, and sets back to the previous assignment on destruction.
-class XlaScopedShardingAssignment {
- public:
-  XlaScopedShardingAssignment(xla::XlaBuilder* builder,
-                              tensorflow::gtl::optional<OpSharding> sharding)
-      : builder_(builder), prev_sharding_(builder->sharding()) {
-    SetSharding(sharding);
-  }
-
-  XlaScopedShardingAssignment(const XlaScopedShardingAssignment&) = delete;
-  XlaScopedShardingAssignment& operator=(const XlaScopedShardingAssignment&) =
-      delete;
-
-  ~XlaScopedShardingAssignment() { SetSharding(prev_sharding_); }
-
- private:
-  void SetSharding(const tensorflow::gtl::optional<OpSharding>& sharding) {
-    if (sharding.has_value()) {
-      builder_->SetSharding(sharding.value());
-    } else {
-      builder_->ClearSharding();
-    }
-  }
-
-  xla::XlaBuilder* const builder_;
-  tensorflow::gtl::optional<OpSharding> prev_sharding_;
-};
-
-// Free functions for building XlaOps. The intention is that these will
-// become the public API for building XlaOps rather than calling methods on
-// XlaBuilder directly.
-
-// Enqueues a "retrieve parameter value" instruction for a parameter that was
-// passed to the computation.
-XlaOp Parameter(XlaBuilder* builder, int64 parameter_number, const Shape& shape,
-                const string& name);
-
-// Enqueues a constant with the value of the given literal onto the
-// computation.
-XlaOp ConstantLiteral(XlaBuilder* builder, const LiteralSlice& literal);
-
-// Enqueues a constant onto the computation. Methods are templated on the
-// native host type (NativeT) which corresponds to a specific XLA
-// PrimitiveType as given in the following table:
-//
-//  Native Type   PrimitiveType
-// -----------------------------
-//   bool           PRED
-//   int32          S32
-//   int64          S64
-//   uint32         U32
-//   uint64         U64
-//   float          F32
-//   double         F64
-//
-// Note: not all primitive types defined in xla_data.proto have a
-// corresponding native type yet.
-template <typename NativeT>
-XlaOp ConstantR0(XlaBuilder* builder, NativeT value);
-template <typename NativeT>
-XlaOp ConstantR1(XlaBuilder* builder,
-                 tensorflow::gtl::ArraySlice<NativeT> values);
-XlaOp ConstantR1(XlaBuilder* builder, const tensorflow::core::Bitmap& values);
-template <typename NativeT>
-XlaOp ConstantR2(XlaBuilder* builder,
-                 std::initializer_list<std::initializer_list<NativeT>> values);
-template <typename NativeT>
-XlaOp ConstantFromArrayWithLayout(XlaBuilder* builder,
-                                  const Array<NativeT>& values,
-                                  const Layout& layout);
-template <typename NativeT>
-XlaOp ConstantFromArray(XlaBuilder* builder, const Array<NativeT>& values);
-template <typename NativeT>
-XlaOp ConstantR2FromArray2DWithLayout(XlaBuilder* builder,
-                                      const Array2D<NativeT>& values,
-                                      const Layout& layout);
-template <typename NativeT>
-XlaOp ConstantR2FromArray2D(XlaBuilder* builder,
-                            const Array2D<NativeT>& values);
-template <typename NativeT>
-XlaOp ConstantR3FromArray3DWithLayout(XlaBuilder* builder,
-                                      const Array3D<NativeT>& values,
-                                      const Layout& layout);
-template <typename NativeT>
-XlaOp ConstantR3FromArray3D(XlaBuilder* builder,
-                            const Array3D<NativeT>& values);
-template <typename NativeT>
-XlaOp ConstantR4FromArray4DWithLayout(XlaBuilder* builder,
-                                      const Array4D<NativeT>& values,
-                                      const Layout& layout);
-template <typename NativeT>
-XlaOp ConstantR4FromArray4D(XlaBuilder* builder,
-                            const Array4D<NativeT>& values);
-
-// Enqueues a rank one constant (XlaBuilder* builder, vector) onto the
-// computation. The vector has size 'length' and every element has the value
-// 'value'.
-template <typename NativeT>
-XlaOp ConstantR1(XlaBuilder* builder, int64 length, NativeT value);
-
-// Adds dimensions to an array by duplicating the data in the array.
-//
-// The new dimensions are inserted on the left, i.e. if
-// broadcast_sizes has values {a0, ..., aN} and the operand shape
-// has dimensions {b0, ..., bM} then the shape of the output has
-// dimensions {a0, ..., aN, b0, ..., bM}.
-//
-// The new dimensions index into copies of the operand, i.e.
-//
-//   output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM]
-XlaOp Broadcast(const XlaOp& operand,
-                tensorflow::gtl::ArraySlice<int64> broadcast_sizes);
-
-// Performs in-dimension-style broadcast.
-//
-// Operand specifies the input to be broadcast. "shape" is expected output
-// shape. "broadcast_dimensions" are the dimensions to be broadcasting into.
-// Dimension numbers in broadcast_dimensions map to individual dimensions
-// of the operand, and specify what dimension of the output shape they
-// should be broadcast.
-// e.g.
-// Say operand = [1, 2], i.e., a 1D tensor with 2 elements.
-// and dimension of shape is [2,2].
-// Specifying {1} as brodcast_dimension will generate output
-// [1 , 2]
-// [1 , 2]
-// On the other hand, specifying {0} as broadcast_dimension
-// will generate output
-// [1 , 1]
-// [2 , 2]
-XlaOp BroadcastInDim(
-    const XlaOp& operand, const Shape& shape,
-    const tensorflow::gtl::ArraySlice<int64> broadcast_dimensions);
-
-// Enqueues a pad operation onto the computation that pads the given value on
-// the edges as well as between the elements of the input. padding_config
-// specifies the padding amount for each dimension.
-XlaOp Pad(const XlaOp& operand, const XlaOp& padding_value,
-          const PaddingConfig& padding_config);
-
-// Enqueues an operation onto the computation that flattens the operand based
-// on the dimension order (major/slowest-varying to minor/fastest-varying)
-// given, followed by reshaping it into the shape with the given dimension
-// sizes (also major to minor). Conceptually, this is a limited form of
-// "shape casting".
-XlaOp Reshape(const XlaOp& operand,
-              tensorflow::gtl::ArraySlice<int64> dimensions,
-              tensorflow::gtl::ArraySlice<int64> new_sizes);
-
-// Enqueues an operation onto the computation that collapses the operand, from
-// first to last dimension (C order), then reshapes it to the given dimension
-// sizes. Conceptually, this is a limited form of "shape casting".
-XlaOp Reshape(const XlaOp& operand,
-              tensorflow::gtl::ArraySlice<int64> new_sizes);
-
-// Wrapper for Reshape.
-// Enqueues an operation to collapse the provided dimensions; e.g. an
-// operand with dimensions {x=256, y=2, z=2, p=32} can be collapsed to
-// {x=1024, y=32} by collapsing dims {0, 1, 2}. Collapsing dimensions must
-// be a consecutive, in-order subsequence of the operand dimensions.
-//
-// Note that collapsing a single dimension does nothing:
-//
-//    {256} collapsing {0} => {256}
-//    {1} collapsing {0} => {1}
-//
-// Collapsing multiple dimensions produces a single result dimension:
-//
-//    {256, 2} collapsing {0,1} => {512}
-//    {256, 2, 3} collapsing {0,1} => {512, 3}
-//
-// This could potentially cause data to be moved -- it provides a more
-// structured form of reshaping than an arbitrary Reshape operation.
-XlaOp Collapse(const XlaOp& operand,
-               tensorflow::gtl::ArraySlice<int64> dimensions);
-
-// Enqueues a slice operation onto the computation that slices the operand
-// from the start indices to the limit indices; e.g.
-//
-//        x
-//   [ 0 1 2 3 ]
-// y [ 4 5 6 7 ] => slice(start={1, 1}, limit={2, 3}) => [ 5 6 ]
-//   [ 8 9 a b ]
-//
-// Note that "limit" means up-to-but-not-including; i.e. [start, limit) in 1D
-// range notation.
-// The strides parameter determines the stride over the slice
-XlaOp Slice(const XlaOp& operand,
-            tensorflow::gtl::ArraySlice<int64> start_indices,
-            tensorflow::gtl::ArraySlice<int64> limit_indices,
-            tensorflow::gtl::ArraySlice<int64> strides);
-
-// Enqueues a slice operation in a given dimension, taking all other
-// dimensions as they are; e.g. if dimno is 1 from start_index 2 to
-// limit_index 4 by 1, and the shape is f32[7,8,9], this call is short-hand
-// for:
-//
-//  array[:, 2:4:1, :]
-XlaOp SliceInDim(const XlaOp& operand, int64 start_index, int64 limit_index,
-                 int64 stride, int64 dimno);
-
-// Enqueues a slice operation onto the computation that slices the 'operand'
-// from dynamic start indices which are passed in 'start_indices'.
-// The size of the slice in each dimension is passed in 'slice_sizes',
-// which specify the end point of exclusive slice intervals in each
-// dimension [start, start + size).
-// The shape of 'start_indices' must be rank == 1, with dimension size
-// equal to the rank of the 'operand'.
-// Slice index calculations are computed modulo input dimension sizes to
-// prevent dynamic start indices from generating out-of-bound array accesses.
-XlaOp DynamicSlice(const XlaOp& operand, const XlaOp& start_indices,
-                   tensorflow::gtl::ArraySlice<int64> slice_sizes);
-
-// Enqueues a dynamic update slice operation onto the computation, which
-// updates a slice of 'operand' with 'update' at dynamic 'start_indices'.
-// The shape of 'update' determines the shape of the slice of 'operand'
-// which is updated.
-// The indices specified in 'start_indices' specify the offset of the slice
-// of 'operand' which is updated.
-//
-//               update = {10, 11} // calculated at runtime.
-//   [1 2 3]     start  = {1, 1}   // calculated at runtime.  [1 2  3 ]
-//   [4 5 6]  => DynamicUpdateslice(data, update, start)   => [4 10 11]
-//   [7 8 9]                                                  [7 8  9 ]
-//
-// The shape of 'start_indices' must be rank == 1, with dimension size
-// equal to the rank of the 'operand'.
-// Slice index calculations are computed modulo update dimension sizes to
-// prevent dynamic start indices from generating out-of-bound array accesses.
-XlaOp DynamicUpdateSlice(const XlaOp& operand, const XlaOp& update,
-                         const XlaOp& start_indices);
-
-// Enqueues a concatenate instruction onto the computation. 'operands' must
-// have >= 1 entry.
-XlaOp ConcatInDim(XlaBuilder* builder,
-                  tensorflow::gtl::ArraySlice<XlaOp> operands, int64 dimension);
-
-// Enqueue a tracing operation onto the computation; the computation will emit
-// a logging message with the operand.
-void Trace(const string& tag, const XlaOp& operand);
-
-// Enqueues a conditional-move-like select operation onto the computation;
-// predicated on pred, selects between on_true and on_false.
-XlaOp Select(const XlaOp& pred, const XlaOp& on_true, const XlaOp& on_false);
-
-// Enqueues a tuple-creation instruction onto the computation.
-XlaOp Tuple(XlaBuilder* builder, tensorflow::gtl::ArraySlice<XlaOp> elements);
-
-// Enqueues a tuple-element-get instruction onto the computation.
-XlaOp GetTupleElement(const XlaOp& tuple_data, int64 index);
-
-// Enqueues an equal-to comparison instruction onto the computation.
-XlaOp Eq(const XlaOp& lhs, const XlaOp& rhs,
-         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a not-equal comparison instruction onto the computation.
-XlaOp Ne(const XlaOp& lhs, const XlaOp& rhs,
-         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a greater-or-equal comparison instruction onto the computation.
-XlaOp Ge(const XlaOp& lhs, const XlaOp& rhs,
-         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a greater-than comparison instruction onto the computation.
-XlaOp Gt(const XlaOp& lhs, const XlaOp& rhs,
-         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a less-than comparison instruction onto the computation.
-XlaOp Lt(const XlaOp& lhs, const XlaOp& rhs,
-         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a less-or-equal comparison instruction onto the computation.
-XlaOp Le(const XlaOp& lhs, const XlaOp& rhs,
-         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a dot instruction onto the computation.
-XlaOp Dot(const XlaOp& lhs, const XlaOp& rhs);
-
-// Enqueues a general dot instruction onto the computation.
-XlaOp DotGeneral(const XlaOp& lhs, const XlaOp& rhs,
-                 const DotDimensionNumbers& dimension_numbers);
-
-// Enqueues a convolution instruction onto the computation, which uses the
-// default convolution dimension numbers.
-XlaOp Conv(const XlaOp& lhs, const XlaOp& rhs,
-           tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding);
-
-// Enqueues a convolution instruction onto the computation, with the caller
-// provided padding configuration in the format returned by MakePadding().
-XlaOp ConvWithGeneralPadding(
-    const XlaOp& lhs, const XlaOp& rhs,
-    tensorflow::gtl::ArraySlice<int64> window_strides,
-    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
-
-// Enqueues a convolution instruction onto the computation, with the caller
-// provided dimension numbers configuration.
-XlaOp ConvWithGeneralDimensions(
-    const XlaOp& lhs, const XlaOp& rhs,
-    tensorflow::gtl::ArraySlice<int64> window_strides, Padding padding,
-    const ConvolutionDimensionNumbers& dimension_numbers);
-
-// Enqueues a convolution instruction onto the computation, with the caller
-// provided padding configuration as well as the dimension numbers.
-XlaOp ConvGeneral(const XlaOp& lhs, const XlaOp& rhs,
-                  tensorflow::gtl::ArraySlice<int64> window_strides,
-                  tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-                  const ConvolutionDimensionNumbers& dimension_numbers);
-
-// Enqueues a convolution instruction onto the computation, with the caller
-// provided padding configuration, dilation factors and dimension numbers.
-XlaOp ConvGeneralDilated(
-    const XlaOp& lhs, const XlaOp& rhs,
-    tensorflow::gtl::ArraySlice<int64> window_strides,
-    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-    tensorflow::gtl::ArraySlice<int64> lhs_dilation,
-    tensorflow::gtl::ArraySlice<int64> rhs_dilation,
-    const ConvolutionDimensionNumbers& dimension_numbers);
-
-// Enqueues an FFT instruction onto the computation, of the given type and
-// with the given FFT length.
-XlaOp Fft(const XlaOp& operand, FftType fft_type,
-          tensorflow::gtl::ArraySlice<int64> fft_length);
-
-// Enqueues an infeed instruction onto the computation, which writes data of
-// the given shape to the infeed buffer of the device.
-XlaOp Infeed(XlaBuilder* builder, const Shape& shape,
-             const string& config = "");
-
-// Enqueues an outfeed instruction onto the computation. This instruction
-// generates outgoing data transfers for the given data.
-//
-// shape_with_layout communicates the laid out shape that we want to outfeed
-// -- if !ShapeUtil::Compatible(GetShape(operand), shape_with_layout) an error
-// will occur.
-void Outfeed(const XlaOp& operand, const Shape& shape_with_layout,
-             const string& outfeed_config);
-
-// Enqueues a call instruction onto the computation.
-XlaOp Call(XlaBuilder* builder, const XlaComputation& computation,
-           tensorflow::gtl::ArraySlice<XlaOp> operands);
-
-// Enqueues a custom call instruction onto the computation.
-// During code generation, a call instruction is emitted which targets a
-// symbol with the name |call_target_name|.  The |operands| are passed to the
-// call instruction.  |shape| is the resultant shape.
-XlaOp CustomCall(XlaBuilder* builder, const string& call_target_name,
-                 tensorflow::gtl::ArraySlice<XlaOp> operands,
-                 const Shape& shape);
-
-// Enqueues a pseudo-op to represent host-side computation data-dependencies.
-// During code generation, host send and receive operations will be generated
-// to transfer |operands| to the host and a single result of |shape| back to
-// the device.  Host send/recv operations are emitted using |channel_name|.
-// Dataflow dependencies and the |cost_estimate_ns| field may be used in HLO
-// instruction scheduling.
-XlaOp HostCompute(XlaBuilder* builder,
-                  tensorflow::gtl::ArraySlice<XlaOp> operands,
-                  const string& channel_name, int64 cost_estimate_ns,
-                  const Shape& shape);
-
-// The following methods enqueue element-wise binary arithmetic operations
-// onto the computation. The shapes of the operands have to match unless one
-// of the operands is a scalar, or an explicit broadcast dimension is given
-// (see g3doc for more details).
-
-// Enqueues a complex compose instruction onto the computation.
-XlaOp Complex(const XlaOp& real, const XlaOp& imag,
-              tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a complex conjugate instruction onto the computation.
-XlaOp Conj(const XlaOp& operand);
-
-// Enqueues an add instruction onto the computation.
-XlaOp Add(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a subtract instruction onto the computation.
-XlaOp Sub(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a multiply instruction onto the computation.
-XlaOp Mul(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a divide instruction onto the computation.
-XlaOp Div(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a remainder instruction onto the computation.
-XlaOp Rem(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a max instruction onto the computation.
-XlaOp Max(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues a min instruction onto the computation.
-XlaOp Min(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Element-wise logical operators
-XlaOp And(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-XlaOp Or(const XlaOp& lhs, const XlaOp& rhs,
-         tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-XlaOp Xor(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-XlaOp Not(const XlaOp& operand);
-
-XlaOp ShiftLeft(const XlaOp& lhs, const XlaOp& rhs,
-                tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-XlaOp ShiftRightArithmetic(
-    const XlaOp& lhs, const XlaOp& rhs,
-    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-XlaOp ShiftRightLogical(
-    const XlaOp& lhs, const XlaOp& rhs,
-    tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Reduces an array among the provided dimensions, given "computation" as a
-// reduction operator.
-XlaOp Reduce(const XlaOp& operand, const XlaOp& init_value,
-             const XlaComputation& computation,
-             tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce);
-
-// Convenience wrapper around the above that reduces all the dimensions in the
-// operand shape.
-XlaOp ReduceAll(const XlaOp& operand, const XlaOp& init_value,
-                const XlaComputation& computation);
-
-// Enqueues a windowed reduce instruction onto the computation.
-XlaOp ReduceWindow(const XlaOp& operand, const XlaOp& init_value,
-                   const XlaComputation& computation,
-                   tensorflow::gtl::ArraySlice<int64> window_dimensions,
-                   tensorflow::gtl::ArraySlice<int64> window_strides,
-                   Padding padding);
-
-// As ReduceWindow(), but the padding is given in the format
-// returned by MakePadding().
-XlaOp ReduceWindowWithGeneralPadding(
-    const XlaOp& operand, const XlaOp& init_value,
-    const XlaComputation& computation,
-    tensorflow::gtl::ArraySlice<int64> window_dimensions,
-    tensorflow::gtl::ArraySlice<int64> window_strides,
-    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding);
-
-// Returns the sum of the operand value within each subgroup of replicas. All
-// replicas supply one input to the sum and all replicas receive the resulting
-// sum for each subgroup.
-XlaOp CrossReplicaSum(
-    const XlaOp& operand,
-    tensorflow::gtl::ArraySlice<int64> replica_group_ids = {});
-
-// Enqueues an operation that do an AllReduce of the operand cross cores. Here
-// AllReduce means doing a reduction on the input operand cross cores and then
-// broadcasting the reduction result to those cores. The reduction function is
-// defined by `computation`, which should be a commutative computation on
-// scalars, e.g., add, min, or max. The way that AllReduce is applied is
-// configured by:
-//
-// - `replica_group_ids`: maps replica ids to subgroup ids. If empty, all
-// replicas belong to one group. Allreduce will be applied within subgroups.
-// For example, we have 4 replicas, then replica_group_ids={0,1,0,1} means,
-// replica 0 and 2 are in subgroup 0, replica 1 and 3 are in subgroup 1.
-//
-// - `channel_id`: for Allreduce nodes from different models, if they have the
-// same channel_id, they will be 'Allreduce'd. If empty, Allreduce will not be
-// applied cross models.
-//
-// TODO(b/79737069): Rename this to AllReduce when it's ready to use.
-XlaOp CrossReplicaSum(const XlaOp& operand, const XlaComputation& computation,
-                      tensorflow::gtl::ArraySlice<int64> replica_group_ids = {},
-                      const tensorflow::gtl::optional<ChannelHandle>&
-                          channel_id = tensorflow::gtl::nullopt);
-
-// Enqueues an operation that scatters the `source` array to the selected
-// indices of each window.
-XlaOp SelectAndScatter(const XlaOp& operand, const XlaComputation& select,
-                       tensorflow::gtl::ArraySlice<int64> window_dimensions,
-                       tensorflow::gtl::ArraySlice<int64> window_strides,
-                       Padding padding, const XlaOp& source,
-                       const XlaOp& init_value, const XlaComputation& scatter);
-
-// As SelectAndScatter(), but the padding is given in the format
-// returned by MakePadding().
-XlaOp SelectAndScatterWithGeneralPadding(
-    const XlaOp& operand, const XlaComputation& select,
-    tensorflow::gtl::ArraySlice<int64> window_dimensions,
-    tensorflow::gtl::ArraySlice<int64> window_strides,
-    tensorflow::gtl::ArraySlice<std::pair<int64, int64>> padding,
-    const XlaOp& source, const XlaOp& init_value,
-    const XlaComputation& scatter);
-
-// Enqueues an abs instruction onto the computation.
-XlaOp Abs(const XlaOp& operand);
-
-// Enqueues a atan2 instruction onto the computation.
-XlaOp Atan2(const XlaOp& y, const XlaOp& x,
-            tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues an exp instruction onto the computation.
-XlaOp Exp(const XlaOp& operand);
-
-// Enqueues an expm1 instruction onto the computation.
-XlaOp Expm1(const XlaOp& operand);
-
-// Enqueues a floor instruction onto the computation.
-XlaOp Floor(const XlaOp& operand);
-
-// Enqueues a ceil instruction onto the computation.
-XlaOp Ceil(const XlaOp& operand);
-
-// Enqueues a round instruction onto the computation, rounding to nearest even
-// with half-way cases rounding away from zero.
-XlaOp Round(const XlaOp& operand);
-
-// Enqueues an log instruction (natural logarithm) onto the computation.
-XlaOp Log(const XlaOp& operand);
-
-// Enqueues an log1p instruction (log(x+1)) onto the computation.
-XlaOp Log1p(const XlaOp& operand);
-
-// Enqueues a sign instruction onto the computation.
-XlaOp Sign(const XlaOp& operand);
-
-// Enqueues a count leading zeros instruction onto the computation.
-XlaOp Clz(const XlaOp& operand);
-
-// Enqueues a cosine instruction onto the computation.
-XlaOp Cos(const XlaOp& operand);
-
-// Enqueues a sine instruction onto the computation.
-XlaOp Sin(const XlaOp& operand);
-
-// Enqueues a tanh instruction onto the computation.
-XlaOp Tanh(const XlaOp& operand);
-
-// Enqueues a real-part instruction onto the computation.
-XlaOp Real(const XlaOp& operand);
-
-// Enqueues an imaginary-part instruction onto the computation.
-XlaOp Imag(const XlaOp& operand);
-
-// Enqueues a lhs^rhs computation onto the computation.
-XlaOp Pow(const XlaOp& lhs, const XlaOp& rhs,
-          tensorflow::gtl::ArraySlice<int64> broadcast_dimensions = {});
-
-// Enqueues an operator that tests if the operand's values are finite, i.e.,
-// not Inf or NaN. Defined only for floating-point types. Returns an array of
-// booleans with the same shape where entries are true iff the corresponding
-// entry was NaN.
-XlaOp IsFinite(const XlaOp& operand);
-
-// Enqueues a convert instruction onto the computation that changes the
-// element type of the operand array to primitive_type.
-XlaOp ConvertElementType(const XlaOp& operand, PrimitiveType new_element_type);
-
-// Enqueues a no-op instruction onto the computation that changes
-// the element type of the operand array to primitive_type. The
-// bit-widths of the source and destination element types must be
-// identical.
-XlaOp BitcastConvertType(const XlaOp& operand, PrimitiveType new_element_type);
-
-// Enqueues a negate instruction onto the computation.
-XlaOp Neg(const XlaOp& operand);
-
-// Enqueues a transpose instruction onto the computation.
-XlaOp Transpose(const XlaOp& operand,
-                tensorflow::gtl::ArraySlice<int64> permutation);
-
-// Enqueues a reverse instruction onto the computation. The order of the
-// elements in the given dimensions is reversed (i.e., the element at index i
-// is moved to index dimension_size - 1 - i).
-XlaOp Rev(const XlaOp& operand, tensorflow::gtl::ArraySlice<int64> dimensions);
-
-// * The result is a sorted array of keys.
-//
-// If both keys and values are provided:
-// * The keys and the values must be rank-1 tensors with the same dimensions.
-// The element types of the tensors may be different.
-// * The result is a tuple that consists of a sorted array of keys as the
-// first element, and an array with their corresponding values as the second
-// element.
-XlaOp Sort(XlaOp keys,
-           tensorflow::gtl::optional<XlaOp> values = tensorflow::gtl::nullopt);
-
-// Enqueues a clamp instruction onto the computation.
-XlaOp Clamp(const XlaOp& min, const XlaOp& operand, const XlaOp& max);
-
-// Enqueues a map instruction onto the computation.
-XlaOp Map(XlaBuilder* builder, tensorflow::gtl::ArraySlice<XlaOp> operands,
-          const XlaComputation& computation,
-          tensorflow::gtl::ArraySlice<int64> dimensions,
-          tensorflow::gtl::ArraySlice<XlaOp> static_operands = {});
-
-// Enqueues a N(mu, sigma) random number generation instruction onto the
-// computation.
-XlaOp RngNormal(const XlaOp& mu, const XlaOp& sigma, const Shape& shape);
-
-// Enqueues a U(a, b) random number generation instruction onto the
-// computation. Returns values in the semi-open interval [a, b).
-XlaOp RngUniform(const XlaOp& a, const XlaOp& b, const Shape& shape);
-
-// Enqueues a while node onto the computation.
-XlaOp While(const XlaComputation& condition, const XlaComputation& body,
-            const XlaOp& init);
-
-// Enqueues a conditional node onto the computation.
-XlaOp Conditional(const XlaOp& predicate, const XlaOp& true_operand,
-                  const XlaComputation& true_computation,
-                  const XlaOp& false_operand,
-                  const XlaComputation& false_computation);
-
-// Enqueues a ReducePrecision node onto the computation.
-XlaOp ReducePrecision(const XlaOp& operand, const int exponent_bits,
-                      const int mantissa_bits);
-
-// Enqueues a Gather node onto the computation.
-XlaOp Gather(const XlaOp& input, const XlaOp& gather_indices,
-             const GatherDimensionNumbers& dimension_numbers,
-             tensorflow::gtl::ArraySlice<int64> window_bounds);
-
-// Enqueues a Send node onto the computation, to send the given operand to
-// a Recv instruction that shares the same channel handle.
-void Send(const XlaOp& operand, const ChannelHandle& handle);
-
-// Enqueues a Recv node onto the computation. The data comes from a Send
-// instruction that shares the same channel handle and its shape must
-// be the same as the given shape.
-XlaOp Recv(XlaBuilder* builder, const Shape& shape,
-           const ChannelHandle& handle);
-
-// Normalizes operand across spatial and batch dimensions for each feature.
-//
-// Returns a tuple (normalized, batch_mean, batch_var) where `normalized`
-// is the normalized result and batch_mean and batch_var are the mean and
-// variance, respectively, across batch for the operand.
-XlaOp BatchNormTraining(const XlaOp& operand, const XlaOp& scale,
-                        const XlaOp& offset, float epsilon,
-                        int64 feature_index);
-
-// Normalizes operand across spatial and batch dimensions for each feature.
-//
-// `BatchNormInference` is equivalent to calling `BatchNormTraining` without
-// computing `mean` and `variance` for each batch inside the operation. It
-// uses the input `mean` and `variance` instead as estimated values. The
-// purpose of this op is to reduce latency in inference, hence the name
-// `BatchNormInference`.
-//
-// The output has the same shape as `operand`, and contains the normalized
-// values for each batch.
-XlaOp BatchNormInference(const XlaOp& operand, const XlaOp& scale,
-                         const XlaOp& offset, const XlaOp& mean,
-                         const XlaOp& variance, float epsilon,
-                         int64 feature_index);
-
-// Calculates the gradients of a batch norm op.
-//
-// The inputs `batch_mean` and `batch_var` represent the mean and variance
-// across the batch.
-//
-// Returns a tuple of three elements:
-//   - grad_operand: Gradient with respect to input `operand`
-//   - grad_offset: Gradient with respect to input `offset`
-//   - grad_scale: Gradient with respect to input `scale`
-XlaOp BatchNormGrad(const XlaOp& operand, const XlaOp& scale,
-                    const XlaOp& batch_mean, const XlaOp& batch_var,
-                    const XlaOp& grad_output, float epsilon,
-                    int64 feature_index);
-
-// Implementation details below this point.
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR0(NativeT value) {
-  return ConstantLiteral(*Literal::CreateR0<NativeT>(value));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR1(tensorflow::gtl::ArraySlice<NativeT> values) {
-  return ConstantLiteral(*Literal::CreateR1<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR1(int64 length, NativeT value) {
-  Literal literal(ShapeUtil::MakeShape(
-      primitive_util::NativeToPrimitiveType<NativeT>(), {length}));
-  literal.PopulateWithValue(value);
-  return ConstantLiteral(literal);
-}
-
-inline XlaOp XlaBuilder::ConstantR1(const tensorflow::core::Bitmap& values) {
-  return ConstantLiteral(*Literal::CreateR1(values));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR2(
-    std::initializer_list<std::initializer_list<NativeT>> values) {
-  return ConstantLiteral(*Literal::CreateR2<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantFromArrayWithLayout(const Array<NativeT>& values,
-                                              const Layout& layout) {
-  return ConstantLiteral(
-      *Literal::CreateFromArrayWithLayout<NativeT>(values, layout));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantFromArray(const Array<NativeT>& values) {
-  return ConstantLiteral(*Literal::CreateFromArray<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR2FromArray2DWithLayout(
-    const Array2D<NativeT>& values, const Layout& layout) {
-  return ConstantLiteral(
-      *Literal::CreateFromArrayWithLayout<NativeT>(values, layout));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR2FromArray2D(const Array2D<NativeT>& values) {
-  return ConstantLiteral(*Literal::CreateR2FromArray2D<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR3FromArray3DWithLayout(
-    const Array3D<NativeT>& values, const Layout& layout) {
-  return ConstantLiteral(
-      *Literal::CreateR3FromArray3DWithLayout<NativeT>(values, layout));
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR3FromArray3D(const Array3D<NativeT>& values) {
-  return ConstantFromArray(values);
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR4FromArray4DWithLayout(
-    const Array4D<NativeT>& values, const Layout& layout) {
-  return ConstantFromArrayWithLayout(values, layout);
-}
-
-template <typename NativeT>
-XlaOp XlaBuilder::ConstantR4FromArray4D(const Array4D<NativeT>& values) {
-  return ConstantFromArray(values);
-}
-
-// Free function template implementations.
-
-template <typename NativeT>
-XlaOp ConstantR0(XlaBuilder* builder, NativeT value) {
-  return ConstantLiteral(builder, *Literal::CreateR0<NativeT>(value));
-}
-
-template <typename NativeT>
-XlaOp ConstantR1(XlaBuilder* builder,
-                 tensorflow::gtl::ArraySlice<NativeT> values) {
-  return ConstantLiteral(builder, *Literal::CreateR1<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp ConstantR1(XlaBuilder* builder, int64 length, NativeT value) {
-  Literal literal(ShapeUtil::MakeShape(
-      primitive_util::NativeToPrimitiveType<NativeT>(), {length}));
-  literal.PopulateWithValue(value);
-  return ConstantLiteral(builder, literal);
-}
-
-inline XlaOp ConstantR1(XlaBuilder* builder,
-                        const tensorflow::core::Bitmap& values) {
-  return ConstantLiteral(builder, *Literal::CreateR1(values));
-}
-
-template <typename NativeT>
-XlaOp ConstantR2(XlaBuilder* builder,
-                 std::initializer_list<std::initializer_list<NativeT>> values) {
-  return ConstantLiteral(builder, *Literal::CreateR2<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp ConstantFromArrayWithLayout(XlaBuilder* builder,
-                                  const Array<NativeT>& values,
-                                  const Layout& layout) {
-  return ConstantLiteral(
-      builder, *Literal::CreateFromArrayWithLayout<NativeT>(values, layout));
-}
-
-template <typename NativeT>
-XlaOp ConstantFromArray(XlaBuilder* builder, const Array<NativeT>& values) {
-  return ConstantLiteral(builder, *Literal::CreateFromArray<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp ConstantR2FromArray2DWithLayout(XlaBuilder* builder,
-                                      const Array2D<NativeT>& values,
-                                      const Layout& layout) {
-  return ConstantLiteral(
-      builder, *Literal::CreateFromArrayWithLayout<NativeT>(values, layout));
-}
-
-template <typename NativeT>
-XlaOp ConstantR2FromArray2D(XlaBuilder* builder,
-                            const Array2D<NativeT>& values) {
-  return ConstantLiteral(builder,
-                         *Literal::CreateR2FromArray2D<NativeT>(values));
-}
-
-template <typename NativeT>
-XlaOp ConstantR3FromArray3DWithLayout(XlaBuilder* builder,
-                                      const Array3D<NativeT>& values,
-                                      const Layout& layout) {
-  return ConstantLiteral(
-      builder,
-      *Literal::CreateR3FromArray3DWithLayout<NativeT>(values, layout));
-}
-
-template <typename NativeT>
-XlaOp ConstantR3FromArray3D(XlaBuilder* builder,
-                            const Array3D<NativeT>& values) {
-  return ConstantFromArray(builder, values);
-}
-
-template <typename NativeT>
-XlaOp ConstantR4FromArray4DWithLayout(XlaBuilder* builder,
-                                      const Array4D<NativeT>& values,
-                                      const Layout& layout) {
-  return ConstantFromArrayWithLayout(builder, values, layout);
-}
-
-template <typename NativeT>
-XlaOp ConstantR4FromArray4D(XlaBuilder* builder,
-                            const Array4D<NativeT>& values) {
-  return ConstantFromArray(builder, values);
-}
-
-}  // namespace xla
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 
 #endif  // TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_BUILDER_H_
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_computation.cc b/tensorflow/compiler/xla/client/xla_computation.cc
index 72e3935696..3543d41fc2 100644
--- a/tensorflow/compiler/xla/client/xla_client/xla_computation.cc
+++ b/tensorflow/compiler/xla/client/xla_computation.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 
 #include <utility>
 
diff --git a/tensorflow/compiler/xla/client/xla_client/xla_computation.h b/tensorflow/compiler/xla/client/xla_computation.h
index 0ffba208b1..71598ef8b2 100644
--- a/tensorflow/compiler/xla/client/xla_client/xla_computation.h
+++ b/tensorflow/compiler/xla/client/xla_computation.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_COMPUTATION_H_
-#define TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_COMPUTATION_H_
+#ifndef TENSORFLOW_COMPILER_XLA_CLIENT_XLA_COMPUTATION_H_
+#define TENSORFLOW_COMPILER_XLA_CLIENT_XLA_COMPUTATION_H_
 
 #include <utility>
 
@@ -64,4 +64,4 @@ class XlaComputation {
 
 }  // namespace xla
 
-#endif  // TENSORFLOW_COMPILER_XLA_CLIENT_XLA_CLIENT_XLA_COMPUTATION_H_
+#endif  // TENSORFLOW_COMPILER_XLA_CLIENT_XLA_COMPUTATION_H_
diff --git a/tensorflow/compiler/xla/experimental/xla_sharding/xla_sharding.py b/tensorflow/compiler/xla/experimental/xla_sharding/xla_sharding.py
index abd10b164e..fb135f5ced 100644
--- a/tensorflow/compiler/xla/experimental/xla_sharding/xla_sharding.py
+++ b/tensorflow/compiler/xla/experimental/xla_sharding/xla_sharding.py
@@ -20,7 +20,7 @@ from __future__ import print_function
 
 import math
 
-import numpy as np
+import numpy as _np  # Avoids becoming a part of public Tensorflow API.
 
 from tensorflow.compiler.xla import xla_data_pb2
 from tensorflow.compiler.xla.python_api import xla_shape
@@ -85,7 +85,7 @@ class Sharding(object):
     something we really want to expose to users (especially as the
     contract for tile_assignment is very strict).
     """
-    if not isinstance(tile_assignment, np.ndarray):
+    if not isinstance(tile_assignment, _np.ndarray):
       raise TypeError('Tile assignment must be of type np.ndarray')
     if not isinstance(tile_shape, xla_shape.Shape):
       raise TypeError('Tile shape must be of type xla_shape.Shape')
diff --git a/tensorflow/compiler/xla/layout_util.cc b/tensorflow/compiler/xla/layout_util.cc
index 15eeb2ea13..b72d190d54 100644
--- a/tensorflow/compiler/xla/layout_util.cc
+++ b/tensorflow/compiler/xla/layout_util.cc
@@ -297,7 +297,7 @@ Layout CreateDefaultLayoutForRank(int64 rank) {
       shape.layout().padded_dimensions_size() == 0) {
     return false;
   }
-  CHECK(IsDenseArray(shape));
+  CHECK(IsDenseArray(shape)) << shape.ShortDebugString();
   CHECK_EQ(shape.dimensions_size(), shape.layout().padded_dimensions_size());
   for (int64 i = 0; i < shape.dimensions_size(); ++i) {
     if (shape.layout().padded_dimensions(i) > shape.dimensions(i)) {
diff --git a/tensorflow/compiler/xla/literal.cc b/tensorflow/compiler/xla/literal.cc
new file mode 100644
index 0000000000..0545deb096
--- /dev/null
+++ b/tensorflow/compiler/xla/literal.cc
@@ -0,0 +1,1969 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/literal.h"
+
+#include <algorithm>
+#include <cstring>
+#include <functional>
+#include <limits>
+#include <numeric>
+#include <vector>
+
+#include "tensorflow/compiler/xla/index_util.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/core/casts.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/hash/hash.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/lib/strings/stringprintf.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/types.h"
+
+using tensorflow::strings::Printf;
+using tensorflow::strings::StrCat;
+
+namespace xla {
+
+namespace {
+
+constexpr bool kLittleEndian = __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__;
+
+// Converts between little and big endian.
+//
+// Precondition: size % 2 == 0 (elements in the array are 16 bits long)
+void ConvertEndianShort(string* bytes) {
+  CHECK_EQ(bytes->size() / 2, 0);
+  for (int64 i = 0; i < bytes->size(); i += 2) {
+    std::swap((*bytes)[i], (*bytes)[i + 1]);
+  }
+}
+
+void ConvertEndianShort(char* bytes, int64 size) {
+  CHECK_EQ(size / 2, 0);
+  for (int64 i = 0; i < size; i += 2) {
+    std::swap(bytes[i], bytes[i + 1]);
+  }
+}
+
+}  // namespace
+
+LiteralBase::~LiteralBase() {}
+
+std::ostream& operator<<(std::ostream& out, const Literal& literal) {
+  out << literal.ToString();
+  return out;
+}
+
+Literal::StrideConfig::StrideConfig(
+    const Shape& source_shape, const Shape& dest_shape,
+    tensorflow::gtl::ArraySlice<int64> dimensions)
+    : dimensions(dimensions),
+      base(dimensions.size(), 0),
+      step(dimensions.size(), 1) {
+  if (!dimensions.empty()) {
+    // Selects the shape with the largest minor dimension as the one upon
+    // which to run the tight stride loop.
+    if (dimensions[LayoutUtil::Minor(source_shape.layout(), 0)] >=
+        dimensions[LayoutUtil::Minor(dest_shape.layout(), 0)]) {
+      minor_dimension = LayoutUtil::Minor(source_shape.layout(), 0);
+      dest_stride = IndexUtil::GetDimensionStride(dest_shape, minor_dimension);
+    } else {
+      minor_dimension = LayoutUtil::Minor(dest_shape.layout(), 0);
+      source_stride =
+          IndexUtil::GetDimensionStride(source_shape, minor_dimension);
+    }
+    minor_loop_size = dimensions[minor_dimension];
+    step[minor_dimension] = minor_loop_size;
+  }
+}
+
+Literal::Literal(const Shape& shape)
+    : Literal(shape, /*allocate_arrays=*/true) {}
+
+void Literal::SetPiece(const Shape& shape, Piece* piece, bool allocate_arrays) {
+  if (ShapeUtil::IsTuple(shape)) {
+    for (int i = 0; i < ShapeUtil::TupleElementCount(shape); ++i) {
+      const Shape& subshape = shape.tuple_shapes(i);
+
+      auto child_piece = Piece();
+      child_piece.set_subshape(&subshape);
+
+      SetPiece(subshape, &child_piece, allocate_arrays);
+
+      piece->emplace_back(std::move(child_piece));
+    }
+  } else if (ShapeUtil::IsArray(shape)) {
+    if (allocate_arrays) {
+      if (LayoutUtil::IsSparseArray(shape)) {
+        // For sparse arrays, the buffer must be of the size of the maximum
+        // number of sparse elements possible.
+        const int64 max_sparse_elements =
+            LayoutUtil::MaxSparseElements(shape.layout());
+        piece->set_buffer(
+            new char[max_sparse_elements *
+                     ShapeUtil::ByteSizeOfPrimitiveType(shape.element_type())]);
+        piece->set_sparse_indices(
+            new SparseIndexArray(max_sparse_elements, ShapeUtil::Rank(shape)));
+      } else {
+        piece->set_buffer(new char[piece->size_bytes()]);
+      }
+    }
+  } else {
+    // If the shape is neither an array nor tuple, then it must be
+    // zero-sized. Otherwise, some memory needs to be allocated for it.
+    CHECK_EQ(piece->size_bytes(), 0);
+  }
+}
+
+Literal::Literal(const Shape& shape, bool allocate_arrays)
+    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
+  CHECK(LayoutUtil::HasLayout(*shape_));
+  root_piece_ = new Piece();
+  root_piece_->set_subshape(shape_.get());
+  CHECK(&root_piece_->subshape() == shape_.get());
+
+  SetPiece(*shape_, root_piece_, allocate_arrays);
+}
+
+Literal::~Literal() {
+  if (root_piece_ != nullptr) {
+    DeallocateBuffers();
+    delete root_piece_;
+  }
+}
+
+void Literal::DeallocateBuffers() {
+  root_piece_->ForEachMutableSubpiece(
+      [&](const ShapeIndex& index, Piece* piece) {
+        if (piece->buffer() != nullptr) {
+          delete[] piece->buffer();
+          delete piece->sparse_indices();
+        }
+      });
+}
+
+Literal::Literal(Literal&& other) : LiteralBase() { *this = std::move(other); }
+
+Literal& Literal::operator=(Literal&& other) {
+  DCHECK(&other.root_piece_->subshape() == other.shape_.get());
+  using std::swap;
+  swap(shape_, other.shape_);
+  swap(root_piece_, other.root_piece_);
+  DCHECK(&root_piece_->subshape() == shape_.get());
+
+  return *this;
+}
+
+std::unique_ptr<Literal> LiteralBase::CreateFromShape(const Shape& shape) {
+  auto literal = MakeUnique<Literal>(shape);
+  literal->root_piece_->ForEachMutableSubpiece(
+      [&](const ShapeIndex& index, Piece* piece) {
+        if (ShapeUtil::IsArray(piece->subshape())) {
+          memset(piece->untyped_data(), 0, piece->size_bytes());
+        }
+      });
+  return literal;
+}
+
+const SparseIndexArray* LiteralBase::sparse_indices(
+    const ShapeIndex& shape_index) const {
+  return piece(shape_index).sparse_indices();
+}
+
+SparseIndexArray* Literal::sparse_indices(const ShapeIndex& shape_index) {
+  return piece(shape_index).sparse_indices();
+}
+
+template <typename NativeT>
+Status Literal::CopySliceFromInternal(
+    const LiteralBase& src_literal, tensorflow::gtl::ArraySlice<int64> src_base,
+    tensorflow::gtl::ArraySlice<int64> dest_base,
+    tensorflow::gtl::ArraySlice<int64> copy_size) {
+  TF_RET_CHECK(ShapeUtil::Rank(src_literal.shape()) == src_base.size());
+  TF_RET_CHECK(ShapeUtil::Rank(shape()) == dest_base.size());
+
+  auto linear_index = [](const Shape& shape,
+                         tensorflow::gtl::ArraySlice<int64> multi_index) {
+    return IndexUtil::MultidimensionalIndexToLinearIndex(shape, multi_index);
+  };
+
+  if (ShapeUtil::Rank(src_literal.shape()) == 0 ||
+      ShapeUtil::Rank(shape()) == 0) {
+    // If any of the two shapes are scalars, we can just call the StridedCopy()
+    // directly, and we know we will be copying only one value.
+    TF_RET_CHECK(copy_size.empty());
+    StridedCopy(data<NativeT>(), linear_index(shape(), dest_base), 0,
+                src_literal.data<NativeT>(),
+                linear_index(src_literal.shape(), src_base), 0, 1);
+  } else if (!ShapeUtil::IsZeroElementArray(shape()) &&
+             !ShapeUtil::IsZeroElementArray(src_literal.shape())) {
+    // Perform copy if neither src nor dest has dimensions with zero element,
+    // otherwise it's a no-op.
+    TF_RET_CHECK(src_base.size() == dest_base.size());
+    TF_RET_CHECK(src_base.size() == copy_size.size());
+
+    // Scan the source from minor, stepping in copy size blocks, then within
+    // the index enumaration functor, do a strided copy advancing source index
+    // by one (walking through the minor dimension), and destination index by
+    // proper stride size at the matching dimension.
+    DimensionVector src_indexes(src_base.size(), 0);
+    DimensionVector dest_indexes(dest_base.size(), 0);
+    Literal::StrideConfig stride_config(src_literal.shape(), shape(),
+                                        copy_size);
+
+    auto copy_proc = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
+      // Map from multi-dimensional index, to source index.
+      std::transform(indexes.begin(), indexes.end(), src_base.begin(),
+                     src_indexes.begin(), std::plus<int64>());
+      // Map from multi-dimensional index, to destination index.
+      std::transform(indexes.begin(), indexes.end(), dest_base.begin(),
+                     dest_indexes.begin(), std::plus<int64>());
+
+      int64 src_index = linear_index(src_literal.shape(), src_indexes);
+      int64 dest_index = linear_index(shape(), dest_indexes);
+
+      // `this->` is needed to workaround MSVC bug: #16882
+      StridedCopy(this->data<NativeT>(), dest_index, stride_config.dest_stride,
+                  src_literal.data<NativeT>(), src_index,
+                  stride_config.source_stride, stride_config.minor_loop_size);
+      return true;
+    };
+
+    ShapeUtil::ForEachIndex(src_literal.shape(), stride_config.base,
+                            stride_config.dimensions, stride_config.step,
+                            copy_proc);
+  }
+  return Status::OK();
+}
+
+Status Literal::CopyElementFrom(const LiteralSlice& src_literal,
+                                tensorflow::gtl::ArraySlice<int64> src_index,
+                                tensorflow::gtl::ArraySlice<int64> dest_index) {
+  DCHECK_EQ(shape().element_type(), src_literal.shape().element_type());
+  const int64 src_linear_index = IndexUtil::MultidimensionalIndexToLinearIndex(
+      src_literal.shape(), src_index);
+  const int64 dest_linear_index =
+      IndexUtil::MultidimensionalIndexToLinearIndex(shape(), dest_index);
+  const int64 primitive_size =
+      ShapeUtil::ByteSizeOfPrimitiveType(shape().element_type());
+
+  char* dest_address =
+      static_cast<char*>(untyped_data()) + dest_linear_index * primitive_size;
+  const char* source_address =
+      static_cast<const char*>(src_literal.untyped_data()) +
+      src_linear_index * primitive_size;
+  if (dest_address != source_address) {
+    memcpy(dest_address, source_address, primitive_size);
+  }
+  return Status::OK();
+}
+
+/* static */ StatusOr<std::unique_ptr<Literal>> Literal::CreateFromProto(
+    const LiteralProto& proto) {
+  if (!proto.has_shape()) {
+    return InvalidArgument("LiteralProto has no shape");
+  }
+  if (!LayoutUtil::HasLayout(proto.shape())) {
+    return InvalidArgument("LiteralProto has no layout");
+  }
+
+  auto literal = MakeUnique<Literal>(proto.shape());
+
+  TF_RETURN_IF_ERROR(literal->root_piece_->ForEachMutableSubpieceWithStatus(
+      [&](const ShapeIndex& index, Piece* piece) {
+        const LiteralProto* proto_element = &proto;
+        for (int64 i : index) {
+          CHECK(i < proto_element->tuple_literals_size());
+          proto_element = &proto_element->tuple_literals(i);
+        }
+
+        if (ShapeUtil::IsTuple(piece->subshape())) {
+          if (proto_element->tuple_literals_size() !=
+              ShapeUtil::TupleElementCount(piece->subshape())) {
+            return InvalidArgument(
+                "Expected %lld tuple elements in LiteralProto, has %d",
+                ShapeUtil::TupleElementCount(piece->subshape()),
+                proto_element->tuple_literals_size());
+          }
+          return Status::OK();
+        }
+        if (piece->subshape().element_type() == TOKEN) {
+          return Status::OK();
+        }
+
+        CHECK(ShapeUtil::IsArray(piece->subshape()));
+        TF_RETURN_IF_ERROR(piece->CopyFromProto(*proto_element));
+
+        return Status::OK();
+      }));
+
+  return std::move(literal);
+}
+
+std::vector<Literal> Literal::DecomposeTuple() {
+  CHECK(ShapeUtil::IsTuple(shape()));
+  std::vector<Literal> elements;
+  for (int i = 0; i < ShapeUtil::TupleElementCount(shape()); ++i) {
+    elements.push_back(Literal(ShapeUtil::GetSubshape(shape(), {i}),
+                               /*allocate_arrays=*/false));
+    Literal& element = elements.back();
+    element.root_piece_->ForEachMutableSubpiece(
+        [&](const ShapeIndex& index, Piece* dest_piece) {
+          ShapeIndex src_index = {i};
+          for (int64 j : index) {
+            src_index.push_back(j);
+          }
+          Piece& src_piece = piece(src_index);
+
+          // Move the respective buffer and sparse indices over to the element
+          // Literal.
+          dest_piece->set_buffer(src_piece.buffer());
+          src_piece.set_buffer(nullptr);
+          dest_piece->set_sparse_indices(src_piece.sparse_indices());
+          src_piece.set_sparse_indices(nullptr);
+        });
+  }
+  // Set this literal to be nil-shaped.
+  *this = Literal();
+  return elements;
+}
+
+namespace {
+
+// Copies the elements in 'src' to 'dest'. The shape and layout of the data in
+// the array slices are indicated by dest_shape and src_shape respectively.
+template <typename NativeT>
+void CopyElementsBetween(tensorflow::gtl::MutableArraySlice<NativeT> dest,
+                         tensorflow::gtl::ArraySlice<NativeT> src,
+                         const Shape& dest_shape, const Shape& src_shape) {
+  CHECK(ShapeUtil::Compatible(dest_shape, src_shape));
+  if (ShapeUtil::IsZeroElementArray(dest_shape)) {
+    return;
+  }
+  std::vector<int64> index(ShapeUtil::Rank(dest_shape));
+  do {
+    dest[IndexUtil::MultidimensionalIndexToLinearIndex(dest_shape, index)] =
+        src[IndexUtil::MultidimensionalIndexToLinearIndex(src_shape, index)];
+  } while (IndexUtil::BumpIndices(dest_shape, &index));
+}
+
+}  // namespace
+
+Status LiteralBase::Piece::CopyFrom(const LiteralBase::Piece& src) {
+  CHECK(subshape_ != nullptr);
+  CHECK(src.subshape_ != nullptr);
+  if (ShapeUtil::Equal(subshape(), src.subshape())) {
+    // If the layouts are equal it's faster just to memcpy.
+    memcpy(buffer(), src.buffer(), src.size_bytes());
+  } else {
+    TF_RET_CHECK(ShapeUtil::Compatible(src.subshape(), subshape()));
+    std::vector<int64> origin(ShapeUtil::Rank(subshape()), 0);
+    switch (subshape().element_type()) {
+#define COPY_ELEMENTS(XLA_T, NATIVE_T)                                    \
+  case (XLA_T):                                                           \
+    CopyElementsBetween<NATIVE_T>(data<NATIVE_T>(), src.data<NATIVE_T>(), \
+                                  subshape(), src.subshape());            \
+    break;
+      COPY_ELEMENTS(U8, uint8);
+      COPY_ELEMENTS(U16, uint16);
+      COPY_ELEMENTS(U32, uint32);
+      COPY_ELEMENTS(U64, uint64);
+      COPY_ELEMENTS(S8, int8);
+      COPY_ELEMENTS(S16, int16);
+      COPY_ELEMENTS(S32, int32);
+      COPY_ELEMENTS(S64, int64);
+      COPY_ELEMENTS(F16, half);
+      COPY_ELEMENTS(BF16, bfloat16);
+      COPY_ELEMENTS(F32, float);
+      COPY_ELEMENTS(F64, double);
+      COPY_ELEMENTS(C64, complex64);
+      COPY_ELEMENTS(PRED, bool);
+#undef COPY_ELEMENTS
+      default:
+        return Unimplemented(
+            "Copying a Literal object with element type %s is not implemented.",
+            PrimitiveType_Name(subshape().element_type()).c_str());
+    }
+  }
+  return Status::OK();
+}
+
+Status Literal::CopyFrom(const LiteralSlice& src_literal,
+                         const ShapeIndex& dest_shape_index,
+                         const ShapeIndex& src_shape_index) {
+  const Shape& dest_subshape =
+      ShapeUtil::GetSubshape(shape(), dest_shape_index);
+  const Shape& src_subshape =
+      ShapeUtil::GetSubshape(src_literal.shape(), src_shape_index);
+  if (!ShapeUtil::Compatible(dest_subshape, src_subshape)) {
+    return InvalidArgument(
+        "Destination subshape incompatible with source subshape: %s vs %s",
+        ShapeUtil::HumanString(dest_subshape).c_str(),
+        ShapeUtil::HumanString(src_subshape).c_str());
+  }
+  return root_piece_->ForEachMutableSubpieceWithStatus(
+      [&](const ShapeIndex& index, Piece* piece) {
+        if (!ShapeUtil::IsArray(piece->subshape())) {
+          return Status::OK();
+        }
+
+        // Determine if this index is in the part of this literal that we want
+        // to copy over from src_literal.
+        bool in_subtree_to_copy = true;
+        for (int i = 0; i < dest_shape_index.size(); ++i) {
+          if (index[i] != dest_shape_index[i]) {
+            in_subtree_to_copy = false;
+            break;
+          }
+        }
+        if (!in_subtree_to_copy) {
+          return Status::OK();
+        }
+        // Construct the index of the corresponding piece in the source literal.
+        ShapeIndex src_piece_index = src_shape_index;
+        for (int64 i = dest_shape_index.size(); i < index.size(); ++i) {
+          src_piece_index.push_back(index[i]);
+        }
+        TF_RETURN_IF_ERROR(piece->CopyFrom(src_literal.piece(src_piece_index)));
+        return Status::OK();
+      });
+}
+
+Status Literal::MoveFrom(Literal&& src_literal,
+                         const ShapeIndex& dest_shape_index) {
+  const Shape& dest_subshape =
+      ShapeUtil::GetSubshape(shape(), dest_shape_index);
+  if (!ShapeUtil::Equal(dest_subshape, src_literal.shape())) {
+    return InvalidArgument(
+        "Destination subshape not equal to source shape: %s vs %s",
+        ShapeUtil::HumanString(dest_subshape).c_str(),
+        ShapeUtil::HumanString(src_literal.shape()).c_str());
+  }
+
+  src_literal.root_piece_->ForEachSubpiece(
+      [&](const ShapeIndex& src_index, const Piece& src_piece) {
+        if (!ShapeUtil::IsArray(src_piece.subshape())) {
+          return;
+        }
+
+        ShapeIndex dest_index = dest_shape_index;
+        for (int64 i : src_index) {
+          dest_index.push_back(i);
+        }
+        Piece& dest_piece = piece(dest_index);
+        delete[] dest_piece.buffer();
+        dest_piece.set_buffer(src_piece.buffer());
+        delete dest_piece.sparse_indices();
+        dest_piece.set_sparse_indices(src_piece.sparse_indices());
+      });
+
+  src_literal.shape_ = MakeUnique<Shape>(ShapeUtil::MakeNil());
+  delete src_literal.root_piece_;
+  src_literal.root_piece_ = new LiteralBase::Piece();
+  src_literal.root_piece_->set_subshape(src_literal.shape_.get());
+
+  return Status::OK();
+}
+
+Status Literal::CopySliceFrom(const LiteralSlice& src_literal,
+                              tensorflow::gtl::ArraySlice<int64> src_base,
+                              tensorflow::gtl::ArraySlice<int64> dest_base,
+                              tensorflow::gtl::ArraySlice<int64> copy_size) {
+  TF_RET_CHECK(ShapeUtil::IsArray(shape())) << ShapeUtil::HumanString(shape());
+  TF_RET_CHECK(ShapeUtil::IsArray(src_literal.shape()))
+      << ShapeUtil::HumanString(src_literal.shape());
+  TF_RET_CHECK(ShapeUtil::SameElementType(src_literal.shape(), shape()));
+
+  switch (shape().element_type()) {
+    case U8:
+      return CopySliceFromInternal<uint8>(src_literal, src_base, dest_base,
+                                          copy_size);
+    case U16:
+      return CopySliceFromInternal<uint16>(src_literal, src_base, dest_base,
+                                           copy_size);
+    case U32:
+      return CopySliceFromInternal<uint32>(src_literal, src_base, dest_base,
+                                           copy_size);
+    case U64:
+      return CopySliceFromInternal<uint64>(src_literal, src_base, dest_base,
+                                           copy_size);
+    case S8:
+      return CopySliceFromInternal<int8>(src_literal, src_base, dest_base,
+                                         copy_size);
+    case S16:
+      return CopySliceFromInternal<int16>(src_literal, src_base, dest_base,
+                                          copy_size);
+    case S32:
+      return CopySliceFromInternal<int32>(src_literal, src_base, dest_base,
+                                          copy_size);
+    case S64:
+      return CopySliceFromInternal<int64>(src_literal, src_base, dest_base,
+                                          copy_size);
+    case F16:
+      return CopySliceFromInternal<half>(src_literal, src_base, dest_base,
+                                         copy_size);
+    case BF16:
+      return CopySliceFromInternal<bfloat16>(src_literal, src_base, dest_base,
+                                             copy_size);
+    case F32:
+      return CopySliceFromInternal<float>(src_literal, src_base, dest_base,
+                                          copy_size);
+    case F64:
+      return CopySliceFromInternal<double>(src_literal, src_base, dest_base,
+                                           copy_size);
+    case C64:
+      return CopySliceFromInternal<complex64>(src_literal, src_base, dest_base,
+                                              copy_size);
+    case PRED:
+      return CopySliceFromInternal<bool>(src_literal, src_base, dest_base,
+                                         copy_size);
+    default:
+      break;
+  }
+  return Unimplemented(
+      "Copying a slice from a Literal object with element type %d is not "
+      "implemented.",
+      shape().element_type());
+}
+
+void Literal::PopulateR1(const tensorflow::core::Bitmap& values) {
+  CHECK(ShapeUtil::IsArray(shape()));
+  CHECK_EQ(ShapeUtil::Rank(shape()), 1);
+  CHECK_EQ(element_count(), values.bits());
+  CHECK_EQ(shape().element_type(), PRED);
+  for (int64 i = 0; i < static_cast<int64>(values.bits()); ++i) {
+    Set({i}, values.get(i));
+  }
+}
+
+std::unique_ptr<Literal> LiteralBase::Relayout(
+    const Layout& new_layout, const ShapeIndex& shape_index) const {
+  // Create new shape with 'new_layout' set at the given shape index.
+  Shape new_shape = shape();
+  Shape* subshape = ShapeUtil::GetMutableSubshape(&new_shape, shape_index);
+  TF_CHECK_OK(LayoutUtil::ValidateLayoutForShape(new_layout, *subshape));
+  *subshape->mutable_layout() = new_layout;
+  auto result = MakeUnique<Literal>(new_shape);
+  TF_CHECK_OK(result->CopyFrom(*this));
+  return result;
+}
+
+std::unique_ptr<Literal> LiteralBase::Relayout(
+    const Shape& shape_with_layout) const {
+  CHECK(ShapeUtil::Compatible(shape_with_layout, shape()))
+      << "Given shape_with_layout " << ShapeUtil::HumanString(shape_with_layout)
+      << " not compatible with literal shape "
+      << ShapeUtil::HumanString(shape());
+  std::unique_ptr<Literal> result = CreateFromShape(shape_with_layout);
+  ShapeUtil::ForEachSubshape(
+      result->shape(),
+      [this, &result](const Shape& subshape, const ShapeIndex& index) {
+        if (ShapeUtil::IsArray(subshape)) {
+          TF_CHECK_OK(result->CopyFrom(*this,
+                                       /*dest_shape_index=*/index,
+                                       /*src_shape_index=*/index));
+        }
+      });
+  return result;
+}
+
+StatusOr<std::unique_ptr<Literal>> LiteralBase::Broadcast(
+    const Shape& result_shape,
+    tensorflow::gtl::ArraySlice<int64> dimensions) const {
+  if (!ShapeUtil::IsArray(shape())) {
+    return InvalidArgument("Broadcast only supports arrays.");
+  }
+
+  for (int64 i = 0; i < dimensions.size(); i++) {
+    TF_RET_CHECK(shape().dimensions(i) ==
+                 result_shape.dimensions(dimensions[i]));
+  }
+
+  std::unique_ptr<Literal> result = MakeUnique<Literal>(result_shape);
+
+  // scratch_source_index is temporary storage space for the computed index into
+  // the input literal.  We put it here to avoid allocating an std::vector in
+  // every iteration of ShapeUtil::ForEachIndex.
+  std::vector<int64> scratch_source_index(shape().dimensions_size());
+
+  char* dest_data = static_cast<char*>(result->untyped_data());
+  const char* source_data = static_cast<const char*>(untyped_data());
+  const int64 primitive_size =
+      ShapeUtil::ByteSizeOfPrimitiveType(shape().element_type());
+
+  ShapeUtil::ForEachIndex(
+      result_shape, [&](tensorflow::gtl::ArraySlice<int64> output_index) {
+        for (int64 i = 0; i < dimensions.size(); ++i) {
+          scratch_source_index[i] = output_index[dimensions[i]];
+        }
+        int64 dest_index = IndexUtil::MultidimensionalIndexToLinearIndex(
+            result_shape, output_index);
+        int64 source_index = IndexUtil::MultidimensionalIndexToLinearIndex(
+            shape(), scratch_source_index);
+        memcpy(dest_data + primitive_size * dest_index,
+               source_data + primitive_size * source_index, primitive_size);
+        return true;
+      });
+
+  return std::move(result);
+}
+
+StatusOr<std::unique_ptr<Literal>> LiteralBase::Reshape(
+    tensorflow::gtl::ArraySlice<int64> dimensions) const {
+  if (!ShapeUtil::IsArray(shape())) {
+    return InvalidArgument("Reshape does not support tuples.");
+  }
+  std::unique_ptr<Literal> output;
+  if (!LayoutUtil::IsMonotonicWithDim0Major(shape().layout())) {
+    output =
+        Relayout(LayoutUtil::GetDefaultLayoutForRank(ShapeUtil::Rank(shape())));
+  } else {
+    output = CloneToUnique();
+  }
+  // Because the layout is monotonic, we can simply reuse the same sequence of
+  // values without changing their order.
+  *output->mutable_shape_do_not_use() =
+      ShapeUtil::MakeShape(shape().element_type(), dimensions);
+
+  int64 elements_before = ShapeUtil::ElementsIn(shape());
+  int64 elements_after = ShapeUtil::ElementsIn(output->shape());
+  if (elements_before != elements_after) {
+    return InvalidArgument(
+        "Shapes before and after Literal::Reshape have different numbers "
+        "of elements: %s vs %s.",
+        ShapeUtil::HumanString(shape()).c_str(),
+        ShapeUtil::HumanString(output->shape()).c_str());
+  }
+  return std::move(output);
+}
+
+std::unique_ptr<Literal> LiteralBase::Transpose(
+    tensorflow::gtl::ArraySlice<int64> permutation) const {
+  CHECK(ShapeUtil::IsArray(shape())) << "Tuple is not supported for transpose";
+  CHECK(IsPermutation(permutation, ShapeUtil::Rank(shape())))
+      << "Given permutation is not a permutation of dimension numbers";
+  // To transpose the array, we just permute the dimensions and layout, and
+  // do a straight memory copy of the raw data set.
+  // This is considerably faster than iterating over every array element using
+  // the EachCell<>() and Set<>() APIs.
+  std::vector<int64> inverse_permutation = InversePermutation(permutation);
+  Shape permuted_shape =
+      ShapeUtil::PermuteDimensions(inverse_permutation, shape());
+  // Replace the layout with one affine to this shape, such that a
+  // transpose operation can be performed by leaving the flat values
+  // representation intact.
+  // For example, consider the shape F32[11,8]{1,0} under a {1,0} permutation.
+  // The shape with affine layout resulting from that operation will be
+  // F32[8,11]{0,1}, since it leaves the original most minor (the 8 sized), the
+  // most minor.
+  //
+  // Essentially, given MinMaj(Di) the position of the Di dimension within the
+  // minor to major vector, and given T(Di) the index that the original Di
+  // dimension has within the transposed array, a layout is affine if
+  // MinMaj(Di) == TMinMaj(T(Di)), with TMinMaj() being the minor to major
+  // vector of the affine layout.
+  CHECK(LayoutUtil::IsDenseArray(permuted_shape));
+  Layout* layout = permuted_shape.mutable_layout();
+  layout->clear_minor_to_major();
+  for (auto index : LayoutUtil::MinorToMajor(shape())) {
+    layout->add_minor_to_major(inverse_permutation[index]);
+  }
+  auto new_literal = MakeUnique<Literal>(permuted_shape);
+  DCHECK_EQ(ShapeUtil::ByteSizeOf(new_literal->shape()),
+            ShapeUtil::ByteSizeOf(shape()));
+  std::memcpy(new_literal->untyped_data(), untyped_data(), size_bytes());
+  return new_literal;
+}
+
+template <typename NativeT>
+std::unique_ptr<Literal> LiteralBase::SliceInternal(
+    const Shape& result_shape,
+    tensorflow::gtl::ArraySlice<int64> start_indices) const {
+  auto result_literal = MakeUnique<Literal>(result_shape);
+  DimensionVector new_indices(ShapeUtil::Rank(result_shape));
+  result_literal->EachCell<NativeT>(
+      [&](tensorflow::gtl::ArraySlice<int64> indices, NativeT /*value*/) {
+        for (int64 i = 0; i < ShapeUtil::Rank(result_shape); ++i) {
+          new_indices[i] = indices[i] + start_indices[i];
+        }
+        NativeT value = Get<NativeT>(new_indices);
+        result_literal->Set<NativeT>(indices, value);
+      });
+  return result_literal;
+}
+
+std::unique_ptr<Literal> LiteralBase::Slice(
+    tensorflow::gtl::ArraySlice<int64> start_indices,
+    tensorflow::gtl::ArraySlice<int64> limit_indices) const {
+  CHECK(ShapeUtil::IsArray(shape())) << "tuple is not supported for slice";
+
+  DimensionVector result_dimensions;
+  for (int64 dnum = 0; dnum < ShapeUtil::Rank(shape()); ++dnum) {
+    CHECK_GE(start_indices[dnum], 0);
+    CHECK_LE(limit_indices[dnum], shape().dimensions(dnum))
+        << "dnum = " << dnum;
+    int64 dimension = limit_indices[dnum] - start_indices[dnum];
+    CHECK_GE(dimension, 0) << "dnum = " << dnum;
+    result_dimensions.push_back(dimension);
+  }
+  const auto result_shape =
+      ShapeUtil::MakeShapeWithLayout(shape().element_type(), result_dimensions,
+                                     LayoutUtil::MinorToMajor(shape()));
+  switch (result_shape.element_type()) {
+    case F32:
+      return SliceInternal<float>(result_shape, start_indices);
+    case BF16:
+      return SliceInternal<bfloat16>(result_shape, start_indices);
+    case C64:
+      return SliceInternal<complex64>(result_shape, start_indices);
+    case S32:
+      return SliceInternal<int32>(result_shape, start_indices);
+    case U32:
+      return SliceInternal<uint32>(result_shape, start_indices);
+    default:
+      LOG(FATAL) << "not yet implemented: "
+                 << PrimitiveType_Name(result_shape.element_type());
+  }
+}
+
+Literal LiteralBase::Clone() const {
+  Literal result(shape());
+  TF_CHECK_OK(result.CopyFrom(*this));
+  return result;
+}
+
+std::unique_ptr<Literal> LiteralBase::CloneToUnique() const {
+  auto result = MakeUnique<Literal>(shape());
+  TF_CHECK_OK(result->CopyFrom(*this));
+  return result;
+}
+
+string LiteralBase::GetAsString(tensorflow::gtl::ArraySlice<int64> multi_index,
+                                const ShapeIndex& shape_index) const {
+  const Shape& subshape = ShapeUtil::GetSubshape(shape(), shape_index);
+  CHECK(LayoutUtil::IsDenseArray(subshape));
+  switch (subshape.element_type()) {
+    case PRED:
+      return Get<bool>(multi_index, shape_index) ? "true" : "false";
+    case S8:
+      return StrCat(Get<int8>(multi_index, shape_index));
+    case S16:
+      return StrCat(Get<int16>(multi_index, shape_index));
+    case S32:
+      return StrCat(Get<int32>(multi_index, shape_index));
+    case S64:
+      return StrCat(Get<int64>(multi_index, shape_index));
+    case U8:
+      return StrCat(Get<uint8>(multi_index, shape_index));
+    case U16:
+      return StrCat(Get<uint16>(multi_index, shape_index));
+    case U32:
+      return StrCat(Get<uint32>(multi_index, shape_index));
+    case U64:
+      return StrCat(Get<uint64>(multi_index, shape_index));
+    case F16:
+      return StrCat(static_cast<float>(Get<half>(multi_index, shape_index)));
+    case F32:
+      return StrCat(Get<float>(multi_index, shape_index));
+    case BF16:
+      return StrCat(
+          static_cast<float>(Get<bfloat16>(multi_index, shape_index)));
+    case F64:
+      return StrCat(Get<double>(multi_index, shape_index));
+    case C64: {
+      complex64 c = Get<complex64>(multi_index, shape_index);
+      return StrCat("(", c.real(), ", ", c.imag(), ")");
+    }
+    default:
+      LOG(FATAL) << PrimitiveType_Name(subshape.element_type());
+  }
+}
+
+string LiteralBase::GetSparseElementAsString(
+    int64 sparse_element_number, const ShapeIndex& shape_index) const {
+  const Shape& subshape = ShapeUtil::GetSubshape(shape(), shape_index);
+  CHECK(LayoutUtil::IsSparseArray(subshape));
+  switch (subshape.element_type()) {
+    case PRED:
+      return GetSparseElement<bool>(sparse_element_number, shape_index)
+                 ? "true"
+                 : "false";
+    case S8:
+      return StrCat(GetSparseElement<int8>(sparse_element_number, shape_index));
+    case S16:
+      return StrCat(
+          GetSparseElement<int16>(sparse_element_number, shape_index));
+    case S32:
+      return StrCat(
+          GetSparseElement<int32>(sparse_element_number, shape_index));
+    case S64:
+      return StrCat(
+          GetSparseElement<int64>(sparse_element_number, shape_index));
+    case U8:
+      return StrCat(
+          GetSparseElement<uint8>(sparse_element_number, shape_index));
+    case U16:
+      return StrCat(
+          GetSparseElement<uint16>(sparse_element_number, shape_index));
+    case U32:
+      return StrCat(
+          GetSparseElement<uint32>(sparse_element_number, shape_index));
+    case U64:
+      return StrCat(
+          GetSparseElement<uint64>(sparse_element_number, shape_index));
+    case F16:
+      return StrCat(static_cast<float>(
+          GetSparseElement<half>(sparse_element_number, shape_index)));
+    case F32:
+      return StrCat(
+          GetSparseElement<float>(sparse_element_number, shape_index));
+    case BF16:
+      return StrCat(static_cast<float>(
+          GetSparseElement<bfloat16>(sparse_element_number, shape_index)));
+    case F64:
+      return StrCat(
+          GetSparseElement<double>(sparse_element_number, shape_index));
+    case C64: {
+      complex64 c =
+          GetSparseElement<complex64>(sparse_element_number, shape_index);
+      return StrCat("(", c.real(), ", ", c.imag(), ")");
+    }
+    default:
+      LOG(FATAL) << "Invalid element type for sparse arrays: "
+                 << PrimitiveType_Name(subshape.element_type());
+  }
+}
+
+StatusOr<int64> LiteralBase::GetIntegralAsS64(
+    tensorflow::gtl::ArraySlice<int64> multi_index) const {
+  CHECK(LayoutUtil::IsDenseArray(shape()));
+  switch (shape().element_type()) {
+    case PRED:
+      return Get<bool>(multi_index);
+    case U8:
+      return Get<uint8>(multi_index);
+    case S32:
+      return Get<int32>(multi_index);
+    case S64:
+      return Get<int64>(multi_index);
+    case U32:
+      return Get<uint32>(multi_index);
+    case U64:
+      return Get<uint64>(multi_index);
+    default:
+      return FailedPrecondition(
+          "Array element type is not integral: %s",
+          PrimitiveType_Name(shape().element_type()).c_str());
+  }
+}
+
+size_t LiteralBase::Hash() const {
+  using tensorflow::Hash64;
+  using tensorflow::Hash64Combine;
+
+  size_t hash_value = ShapeUtil::Hash(shape());
+
+  ShapeUtil::ForEachSubshape(
+      shape(), [&](const Shape& subshape, const ShapeIndex& index) {
+        if (!ShapeUtil::IsArray(subshape)) {
+          return;
+        }
+
+        CHECK(LayoutUtil::IsDense(subshape.layout()));
+        hash_value = Hash64Combine(
+            hash_value, Hash64(static_cast<const char*>(untyped_data(index)),
+                               size_bytes(index)));
+      });
+
+  return hash_value;
+}
+
+Status Literal::SetIntegralAsS64(tensorflow::gtl::ArraySlice<int64> multi_index,
+                                 int64 value) {
+  CHECK(LayoutUtil::IsDenseArray(shape()));
+  switch (shape().element_type()) {
+    case PRED:
+      Set<bool>(multi_index, value);
+      break;
+    case U8:
+      Set<uint8>(multi_index, value);
+      break;
+    case S32:
+      Set<int32>(multi_index, value);
+      break;
+    case S64:
+      Set<int64>(multi_index, value);
+      break;
+    case U32:
+      Set<uint32>(multi_index, value);
+      break;
+    case U64:
+      Set<uint64>(multi_index, value);
+      break;
+    default:
+      return FailedPrecondition(
+          "Array element type is not integral: %s",
+          PrimitiveType_Name(shape().element_type()).c_str());
+  }
+  return Status::OK();
+}
+
+tensorflow::gtl::ArraySlice<int64> LiteralBase::GetSparseIndex(
+    int64 sparse_element_number, const ShapeIndex& shape_index) const {
+  const Piece& p = piece(shape_index);
+  CHECK_GE(sparse_element_number, 0);
+  CHECK_LT(sparse_element_number, p.sparse_indices()->index_count());
+  return p.sparse_indices()->At(sparse_element_number);
+}
+
+void Literal::SortSparseElements(const ShapeIndex& shape_index) {
+  piece(shape_index).SortSparseElements();
+}
+
+void LiteralBase::Piece::SortSparseElements() {
+  switch (subshape().element_type()) {
+    case PRED:
+      SortSparseElementsInternal<bool>();
+      break;
+    case S8:
+      SortSparseElementsInternal<int8>();
+      break;
+    case U8:
+      SortSparseElementsInternal<uint8>();
+      break;
+    case S16:
+      SortSparseElementsInternal<int16>();
+      break;
+    case U16:
+      SortSparseElementsInternal<uint16>();
+      break;
+    case S32:
+      SortSparseElementsInternal<int32>();
+      break;
+    case U32:
+      SortSparseElementsInternal<uint32>();
+      break;
+    case S64:
+      SortSparseElementsInternal<int64>();
+      break;
+    case U64:
+      SortSparseElementsInternal<uint64>();
+      break;
+    case F32:
+      SortSparseElementsInternal<float>();
+      break;
+    case F64:
+      SortSparseElementsInternal<double>();
+      break;
+    case C64:
+      SortSparseElementsInternal<complex64>();
+      break;
+    case F16:
+      SortSparseElementsInternal<half>();
+      break;
+    case BF16:
+      SortSparseElementsInternal<bfloat16>();
+      break;
+    default:
+      LOG(FATAL) << "Element type not valid for sparse array: "
+                 << PrimitiveType_Name(subshape().element_type());
+  }
+}
+
+template <typename NativeT>
+void LiteralBase::Piece::SortSparseElementsInternal() {
+  CHECK(LayoutUtil::IsSparseArray(subshape()));
+  int64 num_elements = sparse_indices()->index_count();
+  auto values = data<NativeT>();
+  CHECK_LE(num_elements, values.size());
+  sparse_indices()->SortWithValues(
+      tensorflow::gtl::MutableArraySlice<NativeT>(values.data(), num_elements));
+}
+
+namespace {
+
+void ToStringHelper(const LiteralBase& literal, const ShapeIndex& shape_index,
+                    bool print_layout, std::vector<string>* pieces) {
+  const Shape& subshape = ShapeUtil::GetSubshape(literal.shape(), shape_index);
+  CHECK(LayoutUtil::HasLayout(literal.shape()));
+  CHECK(LayoutUtil::HasLayout(subshape));
+
+  auto shape_to_string = [print_layout](const Shape& shape) {
+    if (print_layout) {
+      return ShapeUtil::HumanStringWithLayout(shape);
+    } else {
+      return ShapeUtil::HumanString(shape);
+    }
+  };
+
+  // TODO(b/32894291): refactor this code to reduce code duplication.
+  if (ShapeUtil::IsTuple(subshape)) {
+    pieces->push_back(shape_to_string(subshape));
+    pieces->push_back(" (\n");
+    std::vector<string> tuple_pieces;
+    for (int i = 0; i < ShapeUtil::TupleElementCount(subshape); ++i) {
+      ShapeIndex element_index = shape_index;
+      element_index.push_back(i);
+      std::vector<string> element_pieces;
+      ToStringHelper(literal, element_index, print_layout, &element_pieces);
+      tuple_pieces.push_back(tensorflow::str_util::Join(element_pieces, ""));
+    }
+    pieces->push_back(tensorflow::str_util::Join(tuple_pieces, ",\n"));
+    pieces->push_back("\n)");
+    return;
+  }
+
+  if (ShapeUtil::IsToken(subshape)) {
+    pieces->push_back("token");
+    return;
+  }
+
+  if (LayoutUtil::IsSparseArray(subshape)) {
+    pieces->push_back(shape_to_string(subshape));
+    pieces->push_back("{");
+    int64 rank = ShapeUtil::Rank(subshape);
+    int64 num_elements = literal.sparse_element_count();
+    for (int64 i = 0; i < num_elements; ++i) {
+      if (i > 0) {
+        pieces->push_back(", ");
+      }
+      if (rank == 1) {
+        pieces->push_back(StrCat(literal.GetSparseIndex(i)[0]));
+        pieces->push_back(": ");
+      } else {
+        pieces->push_back("[");
+        pieces->push_back(
+            tensorflow::str_util::Join(literal.GetSparseIndex(i), ", "));
+        pieces->push_back("]: ");
+      }
+      pieces->push_back(literal.GetSparseElementAsString(i));
+    }
+    pieces->push_back("}");
+    return;
+  }
+
+  CHECK(LayoutUtil::IsDenseArray(subshape));
+
+  auto element_to_string =
+      [&](tensorflow::gtl::ArraySlice<int64> indices) -> string {
+    PrimitiveType element_type = subshape.element_type();
+    if (element_type == PRED) {
+      // We display predicates in a densely packed form.
+      return literal.Get<bool>(indices, shape_index) ? "1" : "0";
+    }
+    return ((!indices.empty() && indices.back() > 0) ? ", " : "") +
+           literal.GetAsString(indices, shape_index);
+  };
+
+  if (ShapeUtil::Rank(subshape) == 0) {
+    pieces->push_back(literal.GetAsString({}, shape_index));
+  } else if (ShapeUtil::Rank(subshape) == 1) {
+    pieces->push_back("{");
+    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
+      pieces->push_back(element_to_string({i0}));
+    }
+    pieces->push_back("}");
+  } else if (ShapeUtil::Rank(subshape) == 2) {
+    pieces->push_back(shape_to_string(subshape));
+    pieces->push_back(" {\n");
+    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
+      pieces->push_back("  { ");
+      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
+        pieces->push_back(element_to_string({i0, i1}));
+      }
+      pieces->push_back(" ");
+      pieces->push_back(i0 == subshape.dimensions(0) - 1 ? "}\n" : "},\n");
+    }
+    pieces->push_back("}");
+  } else if (ShapeUtil::Rank(subshape) == 3) {
+    pieces->push_back(shape_to_string(subshape));
+    pieces->push_back(" {\n");
+    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
+      pieces->push_back(i0 > 0 ? ",\n{" : "{");
+      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
+        pieces->push_back(i1 > 0 ? ",\n  { " : " { ");
+        for (int64 i2 = 0; i2 < subshape.dimensions(2); ++i2) {
+          pieces->push_back(element_to_string({i0, i1, i2}));
+        }
+        pieces->push_back(" }");
+      }
+      pieces->push_back(" }");
+    }
+    pieces->push_back("\n}");
+  } else if (ShapeUtil::Rank(subshape) == 4) {
+    pieces->push_back(shape_to_string(subshape));
+    pieces->push_back(" {\n");
+    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
+      pieces->push_back(Printf("  {  /*i0=%lld*/\n", i0));
+      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
+        pieces->push_back(Printf("    {  /*i1=%lld*/\n", i1));
+        for (int64 i2 = 0; i2 < subshape.dimensions(2); ++i2) {
+          pieces->push_back("      {");
+          for (int64 i3 = 0; i3 < subshape.dimensions(3); ++i3) {
+            pieces->push_back(element_to_string({i0, i1, i2, i3}));
+          }
+          pieces->push_back(i2 == subshape.dimensions(2) - 1 ? "}\n" : "},\n");
+        }
+        pieces->push_back(i1 == subshape.dimensions(1) - 1 ? "    }\n"
+                                                           : "    },\n");
+      }
+      pieces->push_back(i0 == subshape.dimensions(0) - 1 ? "  }\n" : "  },\n");
+    }
+    pieces->push_back("}");
+  } else if (ShapeUtil::Rank(subshape) == 5) {
+    pieces->push_back(shape_to_string(subshape));
+    pieces->push_back(" {\n");
+    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
+      pieces->push_back(Printf("  {  /*i0=%lld*/\n", i0));
+      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
+        pieces->push_back(Printf("    {  /*i1=%lld*/\n", i1));
+        for (int64 i2 = 0; i2 < subshape.dimensions(2); ++i2) {
+          pieces->push_back(Printf("      {  /*i2=%lld*/\n", i2));
+          for (int64 i3 = 0; i3 < subshape.dimensions(3); ++i3) {
+            pieces->push_back("        {");
+            for (int64 i4 = 0; i4 < subshape.dimensions(4); ++i4) {
+              pieces->push_back(element_to_string({i0, i1, i2, i3, i4}));
+            }
+            pieces->push_back(i3 == subshape.dimensions(3) - 1 ? "}\n"
+                                                               : "},\n");
+          }
+          pieces->push_back(i2 == subshape.dimensions(2) - 1 ? "      }\n"
+                                                             : "      },\n");
+        }
+        pieces->push_back(i1 == subshape.dimensions(1) - 1 ? "    }\n"
+                                                           : "    },\n");
+      }
+      pieces->push_back(i0 == subshape.dimensions(0) - 1 ? "  }\n" : "  },\n");
+    }
+    pieces->push_back("}");
+  } else {
+    pieces->push_back(shape_to_string(subshape));
+    pieces->push_back(" {");
+    literal.EachCellAsString(
+        [&](tensorflow::gtl::ArraySlice<int64> indices, const string& value) {
+          pieces->push_back(" ");
+          pieces->push_back(value);
+        });
+    pieces->push_back("}");
+  }
+}
+
+}  // namespace
+
+int64 LiteralBase::sparse_element_count() const {
+  CHECK(LayoutUtil::IsSparseArray(shape()));
+  return sparse_indices()->index_count();
+}
+
+string LiteralBase::ToString(bool print_layout) const {
+  std::vector<string> pieces;
+  CHECK(LayoutUtil::HasLayout(this->shape()));
+  ToStringHelper(*this, {}, print_layout, &pieces);
+  return tensorflow::str_util::Join(pieces, "");
+}
+
+void LiteralBase::EachCellAsString(
+    const std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
+                             const string& value)>& per_cell) const {
+  if (ShapeUtil::IsZeroElementArray(shape())) {
+    return;
+  }
+  std::vector<int64> indices = IndexUtil::LinearIndexToMultidimensionalIndex(
+      shape(), /*linear_index=*/0);
+  do {
+    per_cell(indices, GetAsString(indices));
+  } while (IndexUtil::BumpIndices(shape(), &indices));
+}
+
+namespace {
+template <typename NativeSrcT, typename NativeDestT, typename ConverterType>
+std::unique_ptr<Literal> ConvertBetweenNativeTypesWithConverter(
+    const LiteralBase& src_literal, const ConverterType& converter) {
+  CHECK(ShapeUtil::IsArray(src_literal.shape()));
+  auto result_literal = MakeUnique<Literal>(ShapeUtil::ChangeElementType(
+      src_literal.shape(),
+      primitive_util::NativeToPrimitiveType<NativeDestT>()));
+  auto src_data = src_literal.data<NativeSrcT>();
+  auto dest_data = result_literal->template data<NativeDestT>();
+  int64 num_elements = src_literal.element_count();
+
+  for (int64 i = 0; i < num_elements; ++i) {
+    dest_data[i] = converter(src_data[i]);
+  }
+  return result_literal;
+}
+
+template <typename NativeSrcT, typename NativeDestT>
+std::unique_ptr<Literal> ConvertBetweenNativeTypes(
+    const LiteralBase& src_literal) {
+  auto converter = [](NativeSrcT src) { return static_cast<NativeDestT>(src); };
+  return ConvertBetweenNativeTypesWithConverter<NativeSrcT, NativeDestT>(
+      src_literal, converter);
+}
+
+template <typename NativeSrcT, typename NativeDestT>
+typename std::enable_if<(sizeof(NativeSrcT) == sizeof(NativeDestT)),
+                        std::unique_ptr<Literal>>::type
+BitcastBetweenNativeTypes(const LiteralBase& src_literal) {
+  auto converter = [](NativeSrcT src) {
+    return tensorflow::bit_cast<NativeDestT>(src);
+  };
+  return ConvertBetweenNativeTypesWithConverter<NativeSrcT, NativeDestT>(
+      src_literal, converter);
+}
+
+// This template specialization is here to make the compiler happy. bit_cast has
+// a static check that the types are the same size. This specialization should
+// never be used because the source and destination types are checked for
+// identical sizes higher up.
+template <typename NativeSrcT, typename NativeDestT>
+typename std::enable_if<(sizeof(NativeSrcT) != sizeof(NativeDestT)),
+                        std::unique_ptr<Literal>>::type
+BitcastBetweenNativeTypes(const LiteralBase& src_literal) {
+  LOG(FATAL) << "Invalid bitcast between types of different sizes.";
+}
+
+template <PrimitiveType primitive_src_type>
+std::unique_ptr<Literal> ConvertToC64(const LiteralBase& src_literal) {
+  CHECK(ShapeUtil::IsArray(src_literal.shape()));
+  auto result_literal = MakeUnique<Literal>(
+      ShapeUtil::ChangeElementType(src_literal.shape(), C64));
+  using NativeSrcT =
+      typename primitive_util::PrimitiveTypeToNative<primitive_src_type>::type;
+  tensorflow::gtl::ArraySlice<NativeSrcT> src_data =
+      src_literal.data<NativeSrcT>();
+  tensorflow::gtl::MutableArraySlice<complex64> dest_data =
+      result_literal->data<complex64>();
+  int64 num_elements = src_literal.element_count();
+  for (int64 i = 0; i < num_elements; ++i) {
+    dest_data[i] = complex64(static_cast<float>(src_data[i]), 0);
+  }
+  return result_literal;
+}
+
+template <PrimitiveType primitive_src_type, PrimitiveType primitive_dest_type>
+std::unique_ptr<Literal> ConvertIfTypesMatch(const LiteralBase& src_literal,
+                                             bool bitcast) {
+  CHECK_EQ(primitive_src_type, src_literal.shape().element_type());
+  if (bitcast) {
+    return BitcastBetweenNativeTypes<
+        typename primitive_util::PrimitiveTypeToNative<
+            primitive_src_type>::type,
+        typename primitive_util::PrimitiveTypeToNative<
+            primitive_dest_type>::type>(src_literal);
+  } else {
+    return ConvertBetweenNativeTypes<
+        typename primitive_util::PrimitiveTypeToNative<
+            primitive_src_type>::type,
+        typename primitive_util::PrimitiveTypeToNative<
+            primitive_dest_type>::type>(src_literal);
+  }
+}
+
+template <PrimitiveType primitive_src_type>
+StatusOr<std::unique_ptr<Literal>> ConvertIfDestTypeMatches(
+    const LiteralBase& src_literal, PrimitiveType primitive_dest_type,
+    bool bitcast) {
+  switch (primitive_dest_type) {
+#define CONVERT_IF_TYPES_MATCH(type)                                    \
+  case (type):                                                          \
+    return ConvertIfTypesMatch<primitive_src_type, (type)>(src_literal, \
+                                                           bitcast);
+    CONVERT_IF_TYPES_MATCH(PRED)
+    CONVERT_IF_TYPES_MATCH(S8)
+    CONVERT_IF_TYPES_MATCH(S32)
+    CONVERT_IF_TYPES_MATCH(S64)
+    CONVERT_IF_TYPES_MATCH(U8)
+    CONVERT_IF_TYPES_MATCH(U32)
+    CONVERT_IF_TYPES_MATCH(U64)
+    CONVERT_IF_TYPES_MATCH(F16)
+    CONVERT_IF_TYPES_MATCH(F32)
+    CONVERT_IF_TYPES_MATCH(F64)
+    CONVERT_IF_TYPES_MATCH(BF16)
+#undef CONVERT_IF_TYPES_MATCH
+    case C64:
+      if (!bitcast) {
+        return ConvertToC64<primitive_src_type>(src_literal);
+      }
+      break;
+    // Other types are not yet supported.
+    default:
+      break;
+  }
+  return Unimplemented(
+      "Converting from type %s to type %s is not implemented.",
+      PrimitiveType_Name(src_literal.shape().element_type()).c_str(),
+      PrimitiveType_Name(primitive_dest_type).c_str());
+}
+
+StatusOr<std::unique_ptr<Literal>> ConvertSwitch(
+    const LiteralBase& literal, PrimitiveType primitive_dest_type,
+    bool bitcast) {
+  TF_RET_CHECK(ShapeUtil::IsArray(literal.shape()));
+  if (literal.shape().element_type() == primitive_dest_type) {
+    return literal.CloneToUnique();
+  }
+  switch (literal.shape().element_type()) {
+#define CONVERT_IF_DEST_TYPE_MATCHES(type)                                \
+  case (type):                                                            \
+    return ConvertIfDestTypeMatches<(type)>(literal, primitive_dest_type, \
+                                            bitcast);
+    CONVERT_IF_DEST_TYPE_MATCHES(PRED)
+    CONVERT_IF_DEST_TYPE_MATCHES(S8)
+    CONVERT_IF_DEST_TYPE_MATCHES(S32)
+    CONVERT_IF_DEST_TYPE_MATCHES(S64)
+    CONVERT_IF_DEST_TYPE_MATCHES(U8)
+    CONVERT_IF_DEST_TYPE_MATCHES(U32)
+    CONVERT_IF_DEST_TYPE_MATCHES(U64)
+    CONVERT_IF_DEST_TYPE_MATCHES(F16)
+    CONVERT_IF_DEST_TYPE_MATCHES(F32)
+    CONVERT_IF_DEST_TYPE_MATCHES(F64)
+    CONVERT_IF_DEST_TYPE_MATCHES(BF16)
+#undef CONVERT_IF_DEST_TYPE_MATCHES
+      // Other types are not yet supported.
+    default:
+      return Unimplemented(
+          "%s from type %s to type %s is not implemented.",
+          (bitcast ? "Bitcast converting" : "Converting"),
+          PrimitiveType_Name(literal.shape().element_type()).c_str(),
+          PrimitiveType_Name(primitive_dest_type).c_str());
+  }
+}
+
+}  // namespace
+
+StatusOr<std::unique_ptr<Literal>> LiteralBase::Convert(
+    PrimitiveType primitive_dest_type) const {
+  return ConvertSwitch(*this, primitive_dest_type, /*bitcast=*/false);
+}
+
+StatusOr<std::unique_ptr<Literal>> LiteralBase::BitcastConvert(
+    PrimitiveType primitive_dest_type) const {
+  if (primitive_util::BitWidth(shape().element_type()) !=
+      primitive_util::BitWidth(primitive_dest_type)) {
+    return InvalidArgument(
+        "Cannot bitcast convert from %s to %s, bit widths are different: %d != "
+        "%d",
+        PrimitiveType_Name(shape().element_type()).c_str(),
+        PrimitiveType_Name(primitive_dest_type).c_str(),
+        primitive_util::BitWidth(shape().element_type()),
+        primitive_util::BitWidth(primitive_dest_type));
+  }
+  return ConvertSwitch(*this, primitive_dest_type, /*bitcast=*/true);
+}
+
+StatusOr<std::unique_ptr<Literal>> LiteralBase::ConvertToShape(
+    const Shape& dest_shape, bool round_f32_to_bf16) const {
+  if (!ShapeUtil::IsTuple(dest_shape)) {
+    if (round_f32_to_bf16 && shape().element_type() == F32 &&
+        dest_shape.element_type() == BF16) {
+      auto converter = [](float src) {
+        return tensorflow::bfloat16::round_to_bfloat16(src);
+      };
+      return ConvertBetweenNativeTypesWithConverter<float, bfloat16>(*this,
+                                                                     converter);
+    }
+    return Convert(dest_shape.element_type());
+  }
+  std::vector<Literal> elements;
+  for (int i = 0; i < ShapeUtil::TupleElementCount(shape()); ++i) {
+    auto element = LiteralSlice(*this, {i});
+    TF_ASSIGN_OR_RETURN(
+        auto new_element,
+        element.ConvertToShape(ShapeUtil::GetSubshape(dest_shape, {i})));
+    elements.push_back(std::move(*new_element));
+  }
+  auto converted = MakeUnique<Literal>();
+  *converted = Literal::MoveIntoTuple(&elements);
+  return std::move(converted);
+}
+
+/* static */ Literal Literal::MoveIntoTuple(
+    tensorflow::gtl::MutableArraySlice<Literal> elements) {
+  std::vector<Shape> element_shapes;
+  for (const Literal& element : elements) {
+    element_shapes.push_back(element.shape());
+  }
+  Literal literal(ShapeUtil::MakeTupleShape(element_shapes),
+                  /*allocate_arrays=*/false);
+  for (int i = 0; i < elements.size(); ++i) {
+    TF_CHECK_OK(
+        literal.MoveFrom(std::move(elements[i]), /*dest_shape_index=*/{i}));
+  }
+  return literal;
+}
+
+template <typename NativeT>
+bool LiteralBase::Piece::EqualElementsInternal(
+    const LiteralBase::Piece& other, std::vector<int64>* multi_index) const {
+  if (multi_index->size() == ShapeUtil::Rank(subshape())) {
+    return (Get<NativeT>(*multi_index) == other.Get<NativeT>(*multi_index));
+  }
+  for (int64 i = 0; i < subshape().dimensions(multi_index->size()); ++i) {
+    multi_index->push_back(i);
+    if (!EqualElementsInternal<NativeT>(other, multi_index)) {
+      return false;
+    }
+    multi_index->pop_back();
+  }
+  return true;
+}
+
+bool LiteralBase::Piece::EqualElements(const LiteralBase::Piece& other) const {
+  DCHECK(ShapeUtil::Compatible(subshape(), other.subshape()));
+
+  std::vector<int64> multi_index;
+  switch (subshape().element_type()) {
+    case PRED:
+      return EqualElementsInternal<bool>(other, &multi_index);
+    case U8:
+      return EqualElementsInternal<uint8>(other, &multi_index);
+    case S32:
+      return EqualElementsInternal<int32>(other, &multi_index);
+    case S64:
+      return EqualElementsInternal<int64>(other, &multi_index);
+    case U32:
+      return EqualElementsInternal<uint32>(other, &multi_index);
+    case U64:
+      return EqualElementsInternal<uint64>(other, &multi_index);
+    case F32:
+      return EqualElementsInternal<float>(other, &multi_index);
+    case F64:
+      return EqualElementsInternal<double>(other, &multi_index);
+    case F16:
+      return EqualElementsInternal<half>(other, &multi_index);
+    case BF16:
+      return EqualElementsInternal<bfloat16>(other, &multi_index);
+    case C64:
+      return EqualElementsInternal<complex64>(other, &multi_index);
+    default:
+      LOG(FATAL) << "Unimplemented: LiteralBase::Piece::EqualElements for type "
+                 << PrimitiveType_Name(subshape().element_type());
+  }
+}
+
+bool LiteralBase::operator==(const LiteralBase& other) const {
+  if (!ShapeUtil::Compatible(shape(), other.shape())) {
+    return false;
+  }
+
+  return root_piece().ForEachSubpieceWithBool(
+      [&](const ShapeIndex& index, const Piece& piece) {
+        if (!ShapeUtil::IsArray(piece.subshape())) {
+          return true;
+        }
+
+        const Piece& other_piece = other.piece(index);
+        if (!piece.EqualElements(other_piece)) {
+          return false;
+        }
+        return true;
+      });
+}
+
+namespace {
+
+template <typename NativeT>
+static bool AllElementsEqualValue(tensorflow::gtl::ArraySlice<NativeT> data,
+                                  NativeT value) {
+  for (int64 i = 0; i < data.size(); ++i) {
+    if (data[i] != value) {
+      return false;
+    }
+  }
+  return true;
+}
+
+}  // namespace
+
+bool LiteralBase::IsAll(int8 value) const {
+  return root_piece().ForEachSubpieceWithBool([&](const ShapeIndex& index,
+                                                  const Piece& piece) {
+    if (!ShapeUtil::IsArray(piece.subshape())) {
+      return true;
+    }
+
+    auto piece_is_all = [&]() {
+      switch (shape().element_type()) {
+        case U8:
+          if (value >= 0) {
+            return AllElementsEqualValue<uint8>(piece.data<uint8>(), value);
+          }
+          return false;
+        case U32:
+          if (value >= 0) {
+            return AllElementsEqualValue<uint32>(piece.data<uint32>(), value);
+          }
+          return false;
+        case U64:
+          if (value >= 0) {
+            return AllElementsEqualValue<uint64>(piece.data<uint64>(), value);
+          }
+          return false;
+        case S8:
+          return AllElementsEqualValue<int8>(piece.data<int8>(), value);
+        case S32:
+          return AllElementsEqualValue<int32>(piece.data<int32>(), value);
+        case S64:
+          return AllElementsEqualValue<int64>(piece.data<int64>(), value);
+        case F32:
+          return AllElementsEqualValue<float>(piece.data<float>(), value);
+        case F64:
+          return AllElementsEqualValue<double>(piece.data<double>(), value);
+        case F16:
+          return AllElementsEqualValue<half>(piece.data<half>(),
+                                             static_cast<half>(value));
+        case BF16:
+          return AllElementsEqualValue<bfloat16>(piece.data<bfloat16>(),
+                                                 static_cast<bfloat16>(value));
+        case PRED:
+          if (value == 0) {
+            return AllElementsEqualValue<bool>(piece.data<bool>(), false);
+          }
+          if (value == 1) {
+            return AllElementsEqualValue<bool>(piece.data<bool>(), true);
+          }
+          return false;
+        default:
+          return false;
+      }
+      return false;
+    };
+
+    if (!piece_is_all()) {
+      return false;
+    }
+    return true;
+  });
+}
+
+bool LiteralBase::IsAllFloat(float value) const {
+  return root_piece().ForEachSubpieceWithBool(
+      [&](const ShapeIndex& index, const Piece& piece) {
+        if (!ShapeUtil::IsArray(piece.subshape())) {
+          return true;
+        }
+
+        auto piece_is_all = [&]() {
+          switch (shape().element_type()) {
+            case F32:
+              return AllElementsEqualValue<float>(piece.data<float>(), value);
+            case F64:
+              return AllElementsEqualValue<double>(piece.data<double>(), value);
+            case F16:
+              return AllElementsEqualValue<half>(piece.data<half>(),
+                                                 static_cast<half>(value));
+            case BF16:
+              return AllElementsEqualValue<bfloat16>(
+                  piece.data<bfloat16>(), static_cast<bfloat16>(value));
+            default:
+              return false;
+          }
+        };
+        if (!piece_is_all()) {
+          return false;
+        }
+        return true;
+      });
+}
+
+bool LiteralBase::IsAllComplex(complex64 value) const {
+  switch (shape().element_type()) {
+    case C64:
+      return AllElementsEqualValue<complex64>(root_piece().data<complex64>(),
+                                              value);
+    default:
+      return false;
+  }
+}
+
+bool LiteralBase::IsAllFirst() const {
+  return root_piece().ForEachSubpieceWithBool(
+      [&](const ShapeIndex& index, const Piece& piece) {
+        if (!ShapeUtil::IsArray(piece.subshape())) {
+          return true;
+        }
+
+        // Empty shapes are not all the first element since there is no first
+        // element.
+        if (ShapeUtil::IsZeroElementArray(piece.subshape())) {
+          return false;
+        }
+        auto piece_is_all = [&]() {
+          switch (piece.subshape().element_type()) {
+            case PRED: {
+              auto data = piece.data<bool>();
+              return AllElementsEqualValue<bool>(data, data[0]);
+            }
+            // 8 bit types
+            case S8: {
+              auto data = piece.data<int8>();
+              return AllElementsEqualValue<int8>(data, data[0]);
+            }
+            case U8: {
+              auto data = piece.data<uint8>();
+              return AllElementsEqualValue<uint8>(data, data[0]);
+            }
+            // 16 bit types
+            case BF16: {
+              auto data = piece.data<bfloat16>();
+              return AllElementsEqualValue<bfloat16>(data, data[0]);
+            }
+            case F16: {
+              auto data = piece.data<half>();
+              return AllElementsEqualValue<half>(data, data[0]);
+            }
+            case S16: {
+              auto data = piece.data<int16>();
+              return AllElementsEqualValue<int16>(data, data[0]);
+            }
+            case U16: {
+              auto data = piece.data<uint16>();
+              return AllElementsEqualValue<uint16>(data, data[0]);
+            }
+            // 32 bit types
+            case F32: {
+              auto data = piece.data<float>();
+              return AllElementsEqualValue<float>(data, data[0]);
+            }
+            case U32: {
+              auto data = piece.data<uint32>();
+              return AllElementsEqualValue<uint32>(data, data[0]);
+            }
+            case S32: {
+              auto data = piece.data<int32>();
+              return AllElementsEqualValue<int32>(data, data[0]);
+            }
+            // 64 bit types
+            case C64: {
+              auto data = piece.data<complex64>();
+              return AllElementsEqualValue<complex64>(data, data[0]);
+            }
+            case F64: {
+              auto data = piece.data<double>();
+              return AllElementsEqualValue<double>(data, data[0]);
+            }
+            case S64: {
+              auto data = piece.data<int64>();
+              return AllElementsEqualValue<int64>(data, data[0]);
+            }
+            case U64: {
+              auto data = piece.data<uint64>();
+              return AllElementsEqualValue<uint64>(data, data[0]);
+            }
+            default:
+              return false;
+          }
+        };
+
+        if (!piece_is_all()) {
+          return false;
+        }
+        return true;
+      });
+}
+
+bool LiteralBase::IsZero(tensorflow::gtl::ArraySlice<int64> indices) const {
+  CHECK(ShapeUtil::IsArray(shape()));
+  switch (shape().element_type()) {
+    case U8:
+      return Get<uint8>(indices) == 0;
+    case U32:
+      return Get<uint32>(indices) == 0;
+    case U64:
+      return Get<uint64>(indices) == 0;
+    case S8:
+      return Get<int8>(indices) == 0;
+    case S32:
+      return Get<int32>(indices) == 0;
+    case S64:
+      return Get<int64>(indices) == 0;
+    case F32:
+      return Get<float>(indices) == 0.0f;
+    case F64:
+      return Get<double>(indices) == 0.0;
+    case C64:
+      return Get<complex64>(indices) == complex64(0.0f, 0.0f);
+    case F16:
+      return Get<half>(indices) == static_cast<half>(0.0f);
+    case BF16:
+      return Get<bfloat16>(indices) == static_cast<bfloat16>(0.0f);
+    case PRED:
+      return Get<bool>(indices) == false;
+    default:
+      LOG(FATAL) << "Input literal must be an array.";
+  }
+}
+
+namespace {
+
+template <typename RepeatedFieldT, typename NativeT>
+void CopyToRepeatedField(RepeatedFieldT* dest,
+                         const tensorflow::gtl::ArraySlice<NativeT> src) {
+  *dest = RepeatedFieldT(src.begin(), src.end());
+}
+
+}  // namespace
+
+void LiteralBase::Piece::WriteToProto(LiteralProto* proto) const {
+  *proto->mutable_shape() = subshape();
+  switch (subshape().element_type()) {
+    case PRED:
+      CopyToRepeatedField(proto->mutable_preds(), data<bool>());
+      break;
+    case U8:
+      proto->set_u8s(static_cast<const unsigned char*>(data<uint8>().data()),
+                     element_count());
+      break;
+    case U32:
+      CopyToRepeatedField(proto->mutable_u32s(), data<uint32>());
+      break;
+    case U64:
+      CopyToRepeatedField(proto->mutable_u64s(), data<uint64>());
+      break;
+    case S32:
+      CopyToRepeatedField(proto->mutable_s32s(), data<int32>());
+      break;
+    case S64:
+      CopyToRepeatedField(proto->mutable_s64s(), data<int64>());
+      break;
+    case F16:
+      *proto->mutable_f16s() = string(
+          reinterpret_cast<const char*>(data<half>().data()), size_bytes());
+      if (!kLittleEndian) {
+        ConvertEndianShort(proto->mutable_f16s());
+      }
+      break;
+    case BF16:
+      *proto->mutable_bf16s() = string(
+          reinterpret_cast<const char*>(data<bfloat16>().data()), size_bytes());
+      if (!kLittleEndian) {
+        ConvertEndianShort(proto->mutable_bf16s());
+      }
+      break;
+    case F32:
+      CopyToRepeatedField(proto->mutable_f32s(), data<float>());
+      break;
+    case F64:
+      CopyToRepeatedField(proto->mutable_f64s(), data<double>());
+      break;
+    case C64:
+      for (complex64 value : data<complex64>()) {
+        proto->add_c64s(value.real());
+        proto->add_c64s(value.imag());
+      }
+      break;
+    case TUPLE:
+    case TOKEN:
+      // Nothing to do but assign the shape which is done above.
+      return;
+    default:
+      // TODO(b/111551621): Support serializing more PrimitiveTypes.
+      LOG(FATAL) << "Unhandled primitive type "
+                 << PrimitiveType_Name(subshape().element_type());
+  }
+}
+
+const void* LiteralBase::Piece::untyped_data() const {
+  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
+  return buffer();
+}
+
+void* LiteralBase::Piece::untyped_data() {
+  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
+  return buffer();
+}
+
+namespace {
+
+template <typename RepeatedFieldT, typename NativeT>
+Status CopyFromRepeatedField(tensorflow::gtl::MutableArraySlice<NativeT> dest,
+                             const RepeatedFieldT& src) {
+  if (dest.size() != src.size()) {
+    return InvalidArgument(
+        "Expected %lu elements in LiteralProto repeated field, has %d",
+        dest.size(), src.size());
+  }
+  std::copy(src.begin(), src.end(), dest.begin());
+  return Status::OK();
+}
+
+}  // namespace
+
+Status LiteralBase::Piece::CopyFromProto(const LiteralProto& proto) {
+  // These conditions should have been checked in Literal::CreateFromProto.
+  TF_RET_CHECK(proto.has_shape());
+  TF_RET_CHECK(LayoutUtil::HasLayout(proto.shape()));
+  TF_RET_CHECK(ShapeUtil::Equal(proto.shape(), subshape()));
+
+  switch (subshape().element_type()) {
+    case PRED:
+      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<bool>(), proto.preds()));
+      break;
+    case U8: {
+      auto u8_data = data<uint8>();
+      TF_RET_CHECK(proto.u8s().size() == u8_data.size());
+      std::copy(proto.u8s().begin(), proto.u8s().end(), u8_data.begin());
+    } break;
+    case S32:
+      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<int32>(), proto.s32s()));
+      break;
+    case S64:
+      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<int64>(), proto.s64s()));
+      break;
+    case U32:
+      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<uint32>(), proto.u32s()));
+      break;
+    case U64:
+      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<uint64>(), proto.u64s()));
+      break;
+    case F16: {
+      const string& s(proto.f16s());
+      TF_RET_CHECK(data<half>().size() * sizeof(half) == s.size());
+      memcpy(untyped_data(), s.data(), s.size());
+      if (!kLittleEndian) {
+        ConvertEndianShort(reinterpret_cast<char*>(untyped_data()), s.size());
+      }
+    } break;
+
+    case BF16: {
+      const string& s(proto.bf16s());
+      TF_RET_CHECK(data<bfloat16>().size() * sizeof(bfloat16) == s.size());
+      memcpy(untyped_data(), s.data(), s.size());
+      if (!kLittleEndian) {
+        ConvertEndianShort(reinterpret_cast<char*>(untyped_data()), s.size());
+      }
+    } break;
+    case F32:
+      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<float>(), proto.f32s()));
+      break;
+    case F64:
+      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<double>(), proto.f64s()));
+      break;
+    case C64: {
+      auto complex_data = data<complex64>();
+      TF_RET_CHECK(proto.c64s_size() == complex_data.size() * 2);
+      for (int64 i = 0; i < complex_data.size(); ++i) {
+        complex_data[i] = complex64{proto.c64s(i * 2), proto.c64s(i * 2 + 1)};
+      }
+    } break;
+    case TUPLE:
+      LOG(FATAL) << "Should not be called on tuple shapes: "
+                 << ShapeUtil::HumanString(subshape());
+      break;
+    default:
+      LOG(FATAL) << "Unhandled primitive type " << subshape().element_type();
+  }
+  return Status::OK();
+}
+
+LiteralProto LiteralBase::ToProto() const {
+  LiteralProto proto;
+  root_piece().ForEachSubpiece(
+      [&](const ShapeIndex& index, const Piece& piece) {
+        LiteralProto* proto_piece = &proto;
+        for (int64 i : index) {
+          while (proto_piece->tuple_literals_size() <= i) {
+            proto_piece->add_tuple_literals();
+          }
+          proto_piece = proto_piece->mutable_tuple_literals(i);
+        }
+        piece.WriteToProto(proto_piece);
+      });
+
+  if (LayoutUtil::IsSparseArray(shape())) {
+    CopyToRepeatedField(proto.mutable_sparse_indices(),
+                        sparse_indices()->data());
+  }
+
+  return proto;
+}
+
+const void* LiteralBase::untyped_data(const ShapeIndex& shape_index) const {
+  return piece(shape_index).untyped_data();
+}
+
+void* Literal::untyped_data(const ShapeIndex& shape_index) {
+  return piece(shape_index).untyped_data();
+}
+
+int64 LiteralBase::size_bytes(const ShapeIndex& shape_index) const {
+  return piece(shape_index).size_bytes();
+}
+
+string LiteralBase::GetR1U8AsString() const {
+  CHECK(ShapeUtil::IsArray(shape()));
+  CHECK_EQ(ShapeUtil::Rank(shape()), 1);
+  CHECK_EQ(shape().element_type(), U8);
+  return string(tensorflow::bit_cast<const char*>(data<uint8>().data()),
+                ShapeUtil::ElementsIn(shape()));
+}
+
+void BorrowingLiteral::BuildPieceSubtree(const Shape& shape, Piece* piece) {
+  CHECK(ShapeUtil::IsTuple(shape));
+  for (int i = 0; i < ShapeUtil::TupleElementCount(shape); ++i) {
+    const Shape& subshape = shape.tuple_shapes(i);
+
+    auto child_piece = Piece();
+    child_piece.set_subshape(&subshape);
+
+    if (ShapeUtil::IsTuple(subshape)) {
+      BuildPieceSubtree(subshape, &child_piece);
+    }
+
+    piece->emplace_back(std::move(child_piece));
+  }
+}
+
+LiteralSlice::LiteralSlice(const LiteralBase& literal)
+    : LiteralBase(), root_piece_(&literal.root_piece()) {}
+
+LiteralSlice::LiteralSlice(const LiteralBase& literal,
+                           const ShapeIndex& view_root)
+    : LiteralBase(), root_piece_(&literal.piece(view_root)) {}
+
+BorrowingLiteral::BorrowingLiteral(const char* src_buf_ptr, const Shape& shape)
+    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
+  CHECK(ShapeUtil::IsArray(*shape_));
+  CHECK(LayoutUtil::HasLayout(*shape_));
+
+  root_piece_ = Piece();
+  root_piece_.set_buffer(const_cast<char*>(src_buf_ptr));
+  root_piece_.set_subshape(shape_.get());
+}
+
+BorrowingLiteral::BorrowingLiteral(
+    tensorflow::gtl::ArraySlice<const char*> src_buf_ptrs, const Shape& shape)
+    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
+  CHECK(ShapeUtil::IsTuple(*shape_));
+  CHECK(!ShapeUtil::IsNestedTuple(*shape_));
+  CHECK_EQ(src_buf_ptrs.size(), ShapeUtil::TupleElementCount(*shape_));
+  root_piece_ = Piece();
+  root_piece_.set_subshape(shape_.get());
+  BuildPieceSubtree(*shape_, &root_piece_);
+
+  for (int i = 0; i < src_buf_ptrs.size(); ++i) {
+    const auto& src_shape = shape_->tuple_shapes(i);
+    CHECK(ShapeUtil::IsArray(src_shape));
+    root_piece_.child(i).set_buffer(const_cast<char*>(src_buf_ptrs[i]));
+  }
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/literal.h b/tensorflow/compiler/xla/literal.h
new file mode 100644
index 0000000000..dd67dfa8d4
--- /dev/null
+++ b/tensorflow/compiler/xla/literal.h
@@ -0,0 +1,1152 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_LITERAL_H_
+#define TENSORFLOW_COMPILER_XLA_LITERAL_H_
+
+#include <functional>
+#include <initializer_list>
+#include <iterator>
+#include <memory>
+#include <ostream>
+#include <string>
+#include <type_traits>
+#include <vector>
+
+#include "tensorflow/compiler/xla/array2d.h"
+#include "tensorflow/compiler/xla/array3d.h"
+#include "tensorflow/compiler/xla/array4d.h"
+#include "tensorflow/compiler/xla/index_util.h"
+#include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/primitive_util.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/sparse_index_array.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/bitmap.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/protobuf.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+
+// Forward declare Literal and LiteralSlice class to be used by the creation
+// methods in the base class.
+class Literal;
+class LiteralSlice;
+
+// Abstract base class for literals.
+class LiteralBase {
+ public:
+  virtual ~LiteralBase() = 0;
+
+  // Literals are equal if they have compatible shapes and the same data
+  // values. Layout is not compared.
+  bool operator==(const LiteralBase& other) const;
+  bool operator!=(const LiteralBase& other) const { return !(*this == other); }
+
+  // Returns the shape of the literal.
+  const Shape& shape() const { return root_piece().subshape(); }
+
+  // Serialize to proto.
+  LiteralProto ToProto() const;
+
+  // Returns an ArraySlice of the array for this literal for the given NativeT
+  // (e.g., float). CHECKs if the subshape of the literal at the given
+  // ShapeIndex is not array. See primitive_util.h for the mapping from XLA type
+  // to native type.
+  template <typename NativeT>
+  tensorflow::gtl::ArraySlice<NativeT> data(
+      const ShapeIndex& shape_index = {}) const;
+
+  // Returns a const pointer to the sparse index array. Returns nullptr if the
+  // literal is not a sparse array.
+  const SparseIndexArray* sparse_indices(
+      const ShapeIndex& shape_index = {}) const;
+
+  // Returns a const pointer to (or size of) the underlying buffer holding the
+  // array at the given shape index. CHECKs if the subshape of the literal at
+  // the given ShapeIndex is not array.
+  const void* untyped_data(const ShapeIndex& shape_index = {}) const;
+  int64 size_bytes(const ShapeIndex& shape_index = {}) const;
+
+  // Returns this literal's data as a string. This literal must be a rank-1 U8
+  // array.
+  string GetR1U8AsString() const;
+
+  // Returns a string representation of the literal value.
+  // Warning: this function can take minutes for multi-million element Literals.
+  string ToString(bool print_layout = false) const;
+
+  // Gets an element in the literal at the given index. The multi_index is
+  // CHECKed against the dimension sizes.
+  template <typename NativeT>
+  NativeT Get(tensorflow::gtl::ArraySlice<int64> multi_index,
+              const ShapeIndex& shape_index) const;
+  // Overloads of Get for array literals. CHECKs if the literal is not
+  // array-shaped and dense.
+  template <typename NativeT>
+  NativeT Get(tensorflow::gtl::ArraySlice<int64> multi_index) const;
+
+  // Returns the element value at index (0, ..., 0), however many zeroes are
+  // required for that index.
+  template <typename NativeT>
+  NativeT GetFirstElement() const;
+
+  // As Get(), but determines the correct type and converts the value
+  // into text.
+  string GetAsString(tensorflow::gtl::ArraySlice<int64> multi_index,
+                     const ShapeIndex& shape_index = {}) const;
+  // As GetSparseElement(), but determines the correct type and converts the
+  // value into text.
+  string GetSparseElementAsString(int64 sparse_element_number,
+                                  const ShapeIndex& shape_index = {}) const;
+  // As Get(), but determines the correct type and converts the value into
+  // int64.  This literal must be an array.
+  StatusOr<int64> GetIntegralAsS64(
+      tensorflow::gtl::ArraySlice<int64> multi_index) const;
+
+  // Returns the multi-index of the element in a sparse literal at the given
+  // sparse element number.  The sparse element number is the position with in
+  // the sparse array's list of (index, value) pairs, and is checked against the
+  // total number of (index, value) pairs in the sparse array.
+  tensorflow::gtl::ArraySlice<int64> GetSparseIndex(
+      int64 sparse_element_number, const ShapeIndex& shape_index = {}) const;
+
+  // Returns the value of the element in a sparse literal at the given sparse
+  // element number.  The sparse element number is the position with in the
+  // sparse array's list of (index, value) pairs, and is checked against the
+  // total number of (index, value) pairs in the sparse array.
+  template <typename NativeT>
+  NativeT GetSparseElement(int64 sparse_element_number,
+                           const ShapeIndex& shape_index = {}) const;
+
+  // Invokes the "per cell" callback for each element in the provided
+  // literal with the element's indices and a string representation of
+  // the element's value.
+  //
+  // This function is useful if you want a polymorphic representation
+  // of the tensor's elements (turning it to a string for something
+  // like representation in a protobuf).
+  //
+  // This literal must have a dense layout.
+  void EachCellAsString(
+      const std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
+                               const string& value)>& per_cell) const;
+  template <typename NativeT>
+  void EachCell(std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
+                                   NativeT value)>
+                    per_cell) const;
+
+  // Returns whether every element in this literal is equal to value.
+  //
+  // value is an int8 because we expect this to be called with small
+  // compile-time constants (0, -1, etc.) and so that whatever value you pass
+  // can be represented exactly by floating-point types as small as 16 bits.
+  //
+  // If value doesn't fit in this literal's type, returns false.  Values of 1/0
+  // are considered equal to true/false; other values are not considered equal
+  // to true. Also if this literal is not array-shaped false is returned.
+  bool IsAll(int8 value) const;
+
+  // Like IsAll(const Literal&, int8), except we check whether the literal is
+  // equal to a particular floating-point number.
+  //
+  // If the literal is not a floating-point value, this always returns false.
+  //
+  // This casts value to the type of literal, then compares using ==.  The usual
+  // admonishments about floating-point equality checks apply.  We expect you to
+  // use this to check for values that can be expressed precisely as a float,
+  // e.g. -0.5.  Also if this literal is not array-shaped false is returned.
+  bool IsAllFloat(float value) const;
+
+  // Like IsAll(const Literal&, int8), except we check whether the literal is
+  // equal to a particular complex number.
+  //
+  // If the literal is not a complex value, this always returns false.
+  //
+  // This casts value to the type of literal, then compares using ==.  The usual
+  // admonishments about floating-point equality checks apply.  We expect you to
+  // use this to check for complex values that can be expressed precisely as
+  // float pairs e.g. (-0.5, 1.0).
+  //
+  // This literal must have a dense layout.
+  bool IsAllComplex(complex64 value) const;
+
+  // Literal consists entirely of the first element of the literal.
+  bool IsAllFirst() const;
+
+  // Returns whether this literal is zero at the specified index. This literal
+  // must be an array with a dense layout.
+  bool IsZero(tensorflow::gtl::ArraySlice<int64> indices) const;
+
+  // Returns the count of the elements in the array at the given shape index in
+  // this literal.
+  int64 element_count(const ShapeIndex& index = {}) const {
+    return ShapeUtil::ElementsIn(ShapeUtil::GetSubshape(shape(), index));
+  }
+
+  // Returns the count of the elements in the sparse array at the given shape
+  // index in this literal, which will be no larger than
+  // LayoutUtil::MaxSparseElements(SetSubshape(shape(), index).layout()).
+  int64 sparse_element_count() const;
+
+  // Compute a hash for this literal.  This literal must not be a sparse tensor
+  // or a tuple containing a sparse tensor.
+  size_t Hash() const;
+
+  // Converts this literal to the given shape. Returns an error is the
+  // conversion is not possible.
+  //
+  // round_f32_to_bf16: if true, converting F32 elements to BF16 uses rounding
+  // instead of truncation; otherwise, truncation is used.
+  //
+  // TODO(b/69266521): remove the round_to_bfloat16 flag when rounding becomes
+  // the default behavior.
+  StatusOr<std::unique_ptr<Literal>> ConvertToShape(
+      const Shape& dest_shape, bool round_f32_to_bf16 = false) const;
+
+  // Converts this literal to another primitive type using a bitcast
+  // conversion. The to and from primitive types must have the same bit
+  // width. Returns an error if the conversion is not possible. This literal
+  // must be array-shaped.
+  StatusOr<std::unique_ptr<Literal>> BitcastConvert(
+      PrimitiveType primitive_dest_type) const;
+
+  // Converts this literal to another primitive type. Returns an error if the
+  // conversion is not possible. This literal must be array-shaped.
+  StatusOr<std::unique_ptr<Literal>> Convert(
+      PrimitiveType primitive_dest_type) const;
+
+  // Clones the underlying buffers into a new Literal, or new
+  // std::unique_ptr<Literal>.
+  Literal Clone() const;
+  std::unique_ptr<Literal> CloneToUnique() const;
+
+  // TODO(b/67651157): The methods below which perform computation on Literals
+  // (Reshape, Slice, etc) should be moved elsewhere, and perhaps combined with
+  // evaluator code which operates on Literals.
+  //
+  // Creates a new value that has the equivalent value as this
+  // literal, but conforms to new_layout; e.g. a literal matrix that was in {0,
+  // 1} minor-to-major dimension layout can be re-layed-out as {1, 0}
+  // minor-to-major dimension layout and the value in the cell at any given
+  // logical index (i0, i1) will be the same.
+  //
+  // For tuple shaped literals, shape_index should be used to select the inner
+  // array that the new layout applies to.
+  //
+  // Note: this is useful when the client wants to ensure that a value placed in
+  // the XLA allocation tracker has a particular layout; for efficiency
+  // purposes or avoiding unimplemented operation/layout combinations.
+  std::unique_ptr<Literal> Relayout(const Layout& new_layout,
+                                    const ShapeIndex& shape_index = {}) const;
+
+  // An overload of Relayout which changes the layout of the entire shape rather
+  // than being limited to a single array within the shape.
+  std::unique_ptr<Literal> Relayout(const Shape& shape_with_layout) const;
+
+  // Creates a new literal by reshaping this literal to have the given
+  // dimensions. The total number of elements must not change; The
+  // implementation currently only supports monotonic dim0-major layouts.
+  // This literal must be an array.
+  StatusOr<std::unique_ptr<Literal>> Reshape(
+      tensorflow::gtl::ArraySlice<int64> dimensions) const;
+
+  // Creates a new literal by broadcasting this literal with `dimensions` to
+  // yield a literal of shape `result_shape`.
+  StatusOr<std::unique_ptr<Literal>> Broadcast(
+      const Shape& result_shape,
+      tensorflow::gtl::ArraySlice<int64> dimensions) const;
+
+  // Creates a new literal by reordering the dimensions of this literal.
+  // The given `permutation` must be a permutation of the dimension numbers
+  // in the original literal, and it specifies the order of the new dimensions
+  // in the result literal (i.e., new_order[i] = old_order[permutation[i]]).
+  // For example, a transpose call on a literal of shape [3 x 8 x 4] and
+  // `permutation` = {2, 0, 1} returns a new literal of shape [4 x 3 x 8].
+  // This literal must be an array.
+  std::unique_ptr<Literal> Transpose(
+      tensorflow::gtl::ArraySlice<int64> permutation) const;
+
+  // Creates a sub-array from this literal by extracting the indices
+  // [start_index, limit_index) of each dimension. The result literal has the
+  // same rank and layout as for the given literal. The number of indices in
+  // start_indices and limit_indices must be the rank of the literal, and the
+  // indices follow the order of the dimensions.
+  // This literal must be an array.
+  std::unique_ptr<Literal> Slice(
+      tensorflow::gtl::ArraySlice<int64> start_indices,
+      tensorflow::gtl::ArraySlice<int64> limit_indices) const;
+
+  // Creates a literal with a prepended dimension with bound "times"; e.g. a
+  // f32[3x2] with times=4 will produce a f32[4x3x2] with the 3x2 from this
+  // literal replicated four times.
+  // This literal must be an array.
+  template <typename NativeT>
+  std::unique_ptr<Literal> Replicate(int64 times) const;
+
+  // Creates a new Literal object with the shape specified as parameter.
+  // The content of the literal values is the default value of the primitive
+  // type of literal itself (0 for numeric types, and false for predicates).
+  //
+  // Note: It's an antipattern to use this method then immediately call
+  // Literal::Populate on the result (since that results in zero initialization,
+  // then reinitialization. Conside if a call to MakeUnique<Literal>(shape),
+  // followed by the call to Literal::Populate can be used instead.
+  static std::unique_ptr<Literal> CreateFromShape(const Shape& shape);
+
+ protected:
+  // A data structure representing a subshape at a particular ShapeIndex within
+  // the literal. For array-shaped ShapeIndexes, this data structure holds the
+  // pointer to the memory allocated for the array data.
+  class Piece {
+   public:
+    // Returns the buffer holding the array data for this piece as an array
+    // slice. This piece must be array-shaped.
+    template <typename NativeT>
+    tensorflow::gtl::ArraySlice<NativeT> data() const;
+    template <typename NativeT>
+    tensorflow::gtl::MutableArraySlice<NativeT> data();
+
+    // Returns the buffer holding the array data for this piece as a void*. This
+    // piece must be array-shaped.
+    void* untyped_data();
+    const void* untyped_data() const;
+
+    // Gets or sets an element in the array at the given index. The multi_index
+    // is CHECKed against the dimension sizes of the array.  This piece must be
+    // array-shaped.
+    template <typename NativeT>
+    NativeT Get(tensorflow::gtl::ArraySlice<int64> index) const;
+    template <typename NativeT>
+    void Set(tensorflow::gtl::ArraySlice<int64> index, NativeT value);
+
+    // Gets/sets the buffer holding the array data.
+    char* buffer() const { return buffer_; }
+    void set_buffer(char* buffer) { buffer_ = buffer; }
+
+    // The array of multi-indices that provide the locations of non-zero
+    // elements in a sparse array.  Only used if
+    // LayoutUtil::IsSparseArray(shape()) is true.
+    SparseIndexArray* sparse_indices() const { return sparse_indices_; }
+    void set_sparse_indices(SparseIndexArray* sparse_indices) {
+      sparse_indices_ = sparse_indices;
+    }
+
+    // Gets or sets the subshape of this piece. This reference points to a
+    // subshape within the shape in the containing Literal (Literal::shape_).
+    const Shape& subshape() const { return *subshape_; }
+    void set_subshape(const Shape* subshape) { subshape_ = subshape; }
+
+    // Returns the size in bytes of the buffer holding the array data.
+    int64 size_bytes() const { return ShapeUtil::ByteSizeOf(subshape()); }
+
+    // Returns the number of elements in this piece's array.
+    int64 element_count() const {
+      // If this is a sparse array, use the number of elements represented by
+      // the indices in the associated SparseIndexArray.
+      return LayoutUtil::IsSparseArray(subshape())
+                 ? sparse_indices()->index_count()
+                 : ShapeUtil::ElementsIn(subshape());
+    }
+
+    // Returns the child piece at 'index' of this piece.
+    Piece& child(int64 index) { return children_[index]; }
+
+    // Adds a child piece to this piece's children.
+    void emplace_back(Piece child_piece) {
+      children_.emplace_back(std::move(child_piece));
+    }
+
+    // Returns the size of children pieces of this piece.
+    int64 children_size() { return children_.size(); }
+
+    // Visitor functions that recursively traverses the piece and calls the
+    // given function at each child piece. The function has the type:
+    //    void (const ShapeIndex& index, const Piece& piece)
+    template <typename Fn>
+    void ForEachSubpiece(const Fn& func) const {
+      ShapeIndex index;
+      return ForEachHelper(
+                 [&func](const ShapeIndex& index, const Piece& piece) {
+                   func(index, piece);
+                   return Status::OK();
+                 },
+                 *this, &index)
+          .IgnoreError();
+    }
+    // Same as above, but the function has the type:
+    //    Status (const ShapeIndex& index, const Piece& piece)
+    // The first non-OK return value is returned by the function.
+    template <typename Fn>
+    Status ForEachSubpieceWithStatus(const Fn& func) const {
+      ShapeIndex index;
+      return ForEachHelper(func, *this, &index);
+    }
+    // Same as above, but the function has the type:
+    //    Bool (const ShapeIndex& index, const Piece& piece)
+    // The first non-true return value is returned by the function.
+    template <typename Fn>
+    bool ForEachSubpieceWithBool(const Fn& func) const {
+      ShapeIndex index;
+      return ForEachHelperBool(func, *this, &index);
+    }
+    // Same as above, but the function has the type:
+    //    Void (const ShapeIndex& index, Piece& piece)
+    template <typename Fn>
+    void ForEachMutableSubpiece(const Fn& func) {
+      ShapeIndex index;
+      return ForEachMutableHelper(
+                 [&func](const ShapeIndex& index, Piece* piece) {
+                   func(index, piece);
+                   return Status::OK();
+                 },
+                 const_cast<xla::LiteralBase::Piece*>(this), &index)
+          .IgnoreError();
+    }
+    // Same as above, but the function has the type:
+    //    Status (const ShapeIndex& index, Piece& piece)
+    // The first non-OK return value is returned by the function.
+    template <typename Fn>
+    Status ForEachMutableSubpieceWithStatus(const Fn& func) {
+      ShapeIndex index;
+      return ForEachMutableHelper(
+          func, const_cast<xla::LiteralBase::Piece*>(this), &index);
+    }
+
+    // Returns true if this piece and 'other' contain the same data. This piece
+    // and 'other' must be array-shaped and compatible.
+    bool EqualElements(const Piece& other) const;
+
+    // Writes the shape and data (if array-shaped) into the given proto.
+    void WriteToProto(LiteralProto* proto) const;
+
+    // Copy the data from 'src' into this piece's buffer. Shapes of this piece
+    // and src must be compatible.
+    Status CopyFrom(const Piece& src);
+
+    // Copies the data from the given proto into this piece. The shape of this
+    // piece must be equal (not just compatible) to the shape of the proto.
+    Status CopyFromProto(const LiteralProto& proto);
+
+    // Sorts the elements in a sparse array.
+    void SortSparseElements();
+
+   private:
+    // Helpers for traversing the piece via ForEachSubpiece rooted at 'index'.
+    // The first non-OK (or non-true) value is returned by the function.
+    // The callable 'func' has the same signature as described above in
+    // ForEachSubpiece*.
+    template <typename Fn>
+    Status ForEachHelper(const Fn& func, const Piece& piece,
+                         ShapeIndex* index) const {
+      TF_RETURN_IF_ERROR(func(*index, piece));
+      for (int64 i = 0; i < piece.children_.size(); ++i) {
+        index->push_back(i);
+        TF_RETURN_IF_ERROR(ForEachHelper(func, piece.children_[i], index));
+        index->pop_back();
+      }
+      return Status::OK();
+    }
+    template <typename Fn>
+    bool ForEachHelperBool(const Fn& func, const Piece& piece,
+                           ShapeIndex* index) const {
+      if (!func(*index, piece)) {
+        return false;
+      }
+      for (int64 i = 0; i < piece.children_.size(); ++i) {
+        index->push_back(i);
+        if (!ForEachHelperBool(func, piece.children_[i], index)) {
+          return false;
+        }
+        index->pop_back();
+      }
+      return true;
+    }
+    template <typename Fn>
+    Status ForEachMutableHelper(const Fn& func, Piece* piece,
+                                ShapeIndex* index) {
+      TF_RETURN_IF_ERROR(func(*index, piece));
+      for (int64 i = 0; i < piece->children_.size(); ++i) {
+        index->push_back(i);
+        TF_RETURN_IF_ERROR(
+            ForEachMutableHelper(func, &piece->children_[i], index));
+        index->pop_back();
+      }
+      return Status::OK();
+    }
+
+    // Recursive helper for EqualElements.
+    template <typename NativeT>
+    bool EqualElementsInternal(const Piece& other,
+                               std::vector<int64>* multi_index) const;
+
+    // Helper for SortSparseElements that has the element type as a template
+    // parameter.
+    template <typename NativeT>
+    void SortSparseElementsInternal();
+
+    // For array-shaped pieces, this is the buffer holding the literal data.
+    char* buffer_ = nullptr;
+
+    // For sparse arrays, this is the array of indices.
+    SparseIndexArray* sparse_indices_ = nullptr;
+
+    // The shape of piece. This points into the shape of the containing Literal
+    // (Literal::shape_).
+    const Shape* subshape_ = nullptr;
+
+    // Children pieces for tuple shaped pieces.
+    std::vector<Piece> children_ = {};
+  };  // class Piece
+
+  const Piece& piece(const ShapeIndex& shape_index) const {
+    Piece* piece = &const_cast<Piece&>(root_piece());
+    for (const auto i : shape_index) {
+      DCHECK_GE(i, 0);
+      DCHECK_LT(i, piece->children_size());
+      piece = &piece->child(i);
+    }
+    return *piece;
+  }
+
+  // Returns the piece at the root of the shape.
+  virtual const Piece& root_piece() const = 0;
+
+  // LiteralSlice and Literal must access Pieces of other Literals.
+  friend class Literal;
+  friend class LiteralSlice;
+  friend class BorrowingLiteral;
+
+ private:
+  template <typename NativeT>
+  std::unique_ptr<Literal> SliceInternal(
+      const Shape& result_shape,
+      tensorflow::gtl::ArraySlice<int64> start_indices) const;
+};
+
+// Class representing literal values in XLA.
+//
+// The underlying buffer and shape is always owned by this class.
+class Literal : public LiteralBase {
+ public:
+  Literal() : Literal(ShapeUtil::MakeNil()) {}
+
+  // Create a literal of the given shape. The literal is allocated sufficient
+  // memory to hold the shape. Memory is uninitialized.
+  explicit Literal(const Shape& shape);
+  virtual ~Literal();
+
+  // Literals are moveable, but not copyable. To copy a literal use
+  // Literal::Clone or Literal::CloneToUnique. This prevents inadvertent copies
+  // of literals which can be expensive.
+  Literal(const Literal& other) = delete;
+  Literal& operator=(const Literal& other) = delete;
+  Literal(Literal&& other);
+  // 'allocate_arrays' indicates whether to allocate memory for the arrays in
+  // the shape. If false, buffer pointers inside of the Literal::Pieces are set
+  // to nullptr.
+  Literal(const Shape& shape, bool allocate_arrays);
+  Literal& operator=(Literal&& other);
+
+  // TODO(b/67651157): Remove this accessor. Literal users should not be able to
+  // mutate the shape as this can produce malformed Literals.
+  Shape* mutable_shape_do_not_use() { return shape_.get(); }
+
+  // Returns a MutableArraySlice view of the array for this literal for the
+  // given NativeT (e.g., float). CHECKs if the subshape of the literal at the
+  // given ShapeIndex is not array. See primitive_util.h for the mapping from
+  // XLA type to native type.
+  template <typename NativeT>
+  tensorflow::gtl::MutableArraySlice<NativeT> data(
+      const ShapeIndex& shape_index = {});
+  // Unhide const method from parent class.
+  using LiteralBase::data;
+
+  // Returns a pointer to the sparse index array. Returns nullptr if the literal
+  // is not a sparse array.
+  SparseIndexArray* sparse_indices(const ShapeIndex& shape_index = {});
+
+  // Returns a pointer to the underlying buffer holding the array at the given
+  // shape index. CHECKs if the subshape of the literal at the given ShapeIndex
+  // is not array.
+  void* untyped_data(const ShapeIndex& shape_index = {});
+  // Unhide const method from parent class.
+  using LiteralBase::untyped_data;
+
+  // Populates a literal with a sparse layout with the given indices and values.
+  // Each index in the indices array is CHECKed against the dimensions in the
+  // literal's shape.  If sort is true, then the indices and values will be
+  // sorted.  If sort is false, then the indices and values are assumed to
+  // already be in sorted order.  See CreateSparse for an example of how data
+  // are populated.
+  template <typename NativeT>
+  void PopulateSparse(SparseIndexArray indices,
+                      tensorflow::gtl::ArraySlice<NativeT> values,
+                      bool sort = true);
+
+  // Copy values from 'src_literal' rooted at 'src_shape_index' into this
+  // literal rooted at 'dest_shape_index'. The subshape of this literal rooted
+  // at 'dest_shape_index' must be compatible with the subshape of 'src_literal'
+  // rooted at 'src_shape_index', but need not be arrays.
+  Status CopyFrom(const LiteralSlice& src_literal,
+                  const ShapeIndex& dest_shape_index = {},
+                  const ShapeIndex& src_shape_index = {});
+
+  // Returns a vector containing the tuple elements of this Literal as separate
+  // Literals. This Literal must be tuple-shaped and can be a nested tuple. The
+  // elements are moved into the new Literals; no data is copied. Upon return
+  // this Literal is set to a nil shape (empty tuple)
+  std::vector<Literal> DecomposeTuple();
+
+  // Similar to CopyFrom, but with move semantincs. The subshape of this literal
+  // rooted at 'dest_shape_index' must be *equal* to the shape 'src_literal'
+  // (layouts and shapes must match), but need not be arrays. The memory
+  // allocated in this literal for the subshape at dest_shape_index is
+  // deallocated, and the respective buffers are replaced with those in
+  // src_literal. Upon return, src_literal is set to a nil shape (empty tuple).
+  Status MoveFrom(Literal&& src_literal,
+                  const ShapeIndex& dest_shape_index = {});
+
+  // Copies the values from src_literal, starting at src_base shape indexes,
+  // to this literal, starting at dest_base, where the copy size in each
+  // dimension is specified by copy_size.
+  // The src_literal and this literal must have the same primitive type,
+  // src_base+copy_size must fit the source literal dimensions, as well as
+  // dest_base+copy_size must fit the destination literal dimensions.
+  // Note: if either src_literal or this literal contains dimensions with zero
+  // element, then copy_size must be 0 in these dimensions while the
+  // corresponding base indices being 0.
+  // This literal and 'src_literal' must be arrays.
+  Status CopySliceFrom(const LiteralSlice& src_literal,
+                       tensorflow::gtl::ArraySlice<int64> src_base,
+                       tensorflow::gtl::ArraySlice<int64> dest_base,
+                       tensorflow::gtl::ArraySlice<int64> copy_size);
+
+  // Copies one element from src_literal[src_index] to (*this)[dest_index].
+  Status CopyElementFrom(const LiteralSlice& src_literal,
+                         tensorflow::gtl::ArraySlice<int64> src_index,
+                         tensorflow::gtl::ArraySlice<int64> dest_index);
+
+  // Sets an element in the literal at the given index. The multi_index is
+  // CHECKed against the dimension sizes.
+  template <typename NativeT>
+  void Set(tensorflow::gtl::ArraySlice<int64> multi_index,
+           const ShapeIndex& shape_index, NativeT value);
+  // Overloads of Set for array literals. CHECKs if the literal is not
+  // array-shaped and dense.
+  template <typename NativeT>
+  void Set(tensorflow::gtl::ArraySlice<int64> multi_index, NativeT value);
+
+  // Appends the given element to the literal.  If the elements are not appended
+  // in sorted order, then SortSparseElements should be called before calling
+  // other methods.  This literal must have a sparse layout.
+  template <typename NativeT>
+  void AppendSparseElement(tensorflow::gtl::ArraySlice<int64> multi_index,
+                           NativeT value, const ShapeIndex& shape_index = {});
+
+  // Sorts the elements in a sparse array.
+  void SortSparseElements(const ShapeIndex& shape_index = {});
+
+  // As Set(), but truncates `value` to the literal element type before storing.
+  // This literal must be an array.
+  Status SetIntegralAsS64(tensorflow::gtl::ArraySlice<int64> multi_index,
+                          int64 value);
+
+  // Populate this literal with the given values. Examples:
+  //
+  //   // Populate with floats.
+  //   Array2D<float> float_values = ...
+  //   literal.PopulateR2FromArray2D(values);
+  //
+  //   // Populate with int32s.
+  //   literal.PopulateR2<int32>({{1, 2}, {3, 4}});
+  //
+  // The shape and element type of this literal must match given values. For
+  // example, in the call above to literal.PopulateR2(), 'literal' must be a 2x2
+  // array of S32.
+  template <typename NativeT>
+  void PopulateR1(tensorflow::gtl::ArraySlice<NativeT> values);
+  void PopulateR1(const tensorflow::core::Bitmap& values);
+  template <typename NativeT>
+  void PopulateR2(std::initializer_list<std::initializer_list<NativeT>> values);
+  template <typename NativeT>
+  void PopulateFromArray(const Array<NativeT>& values);
+  template <typename NativeT>
+  void PopulateR2FromArray2D(const Array2D<NativeT>& values);
+  template <typename NativeT>
+  void PopulateR3FromArray3D(const Array3D<NativeT>& values);
+  template <typename NativeT>
+  void PopulateR4FromArray4D(const Array4D<NativeT>& values);
+
+  // Populates literal values by calling the generator function for every cell
+  // in this literal object.
+  //
+  // generator must be a callable of the type
+  // NativeT(tensorflow::gtl::ArraySlice<int64> indexes) or compatible.
+  //
+  // This literal must have a dense layout.
+  template <typename NativeT, typename FnType>
+  Status Populate(const FnType& generator);
+
+  // A parallel version of Populate(). This can be used if the generator is
+  // thread-safe and the values for the shape's different elements are
+  // independent.
+  template <typename NativeT, typename FnType>
+  Status PopulateParallel(const FnType& generator);
+
+  // Fills this literal with the given value.
+  template <typename NativeT>
+  void PopulateWithValue(NativeT value);
+
+  // This operation is the inverse of DecomposeTuple. The given elements are
+  // moved into the tuple elements of a new tuple-shaped Literal which is
+  // returned. Upon return, each of the Literals in 'elements' is set to a nil
+  // shape (empty tuple).
+  static Literal MoveIntoTuple(
+      tensorflow::gtl::MutableArraySlice<Literal> elements);
+
+  // Serialize from a proto.
+  static StatusOr<std::unique_ptr<Literal>> CreateFromProto(
+      const LiteralProto& proto);
+
+ private:
+  // Recursively sets the subshapes and buffers of all subpieces rooted at
+  // 'piece'. If 'allocate_array' is true, memory is allocated for the arrays in
+  // the shape.
+  void SetPiece(const Shape& shape, Piece* piece, bool allocate_arrays);
+
+  // Returns the piece at the given ShapeIndex.
+  Piece& piece(const ShapeIndex& shape_index) {
+    return const_cast<Piece&>(LiteralBase::piece(shape_index));
+  }
+
+  Piece& root_piece() const override { return *root_piece_; };
+
+  // Internal template helper for the Literal::CopySliceFrom(), matching its
+  // arguments one by one.
+  template <typename NativeT>
+  Status CopySliceFromInternal(const LiteralBase& src_literal,
+                               tensorflow::gtl::ArraySlice<int64> src_base,
+                               tensorflow::gtl::ArraySlice<int64> dest_base,
+                               tensorflow::gtl::ArraySlice<int64> copy_size);
+
+  // Utility structure which is used to create the optimal configuration for
+  // a ShapeUtil::ForEachIndex() scan across two literals.
+  struct StrideConfig {
+    StrideConfig(const Shape& source_shape, const Shape& dest_shape,
+                 tensorflow::gtl::ArraySlice<int64> dimensions);
+
+    // The dimensions of the stride operation. Essentially every dimension
+    // will be iterated from base[i] to base[i]+dimensions[i], in step[i]
+    // steps.
+    tensorflow::gtl::ArraySlice<int64> dimensions;
+    DimensionVector base;
+    DimensionVector step;
+    int64 minor_dimension = 0;
+    // The size of the strides for source and destination. One of the two
+    // (the one looping through its most minor dimension) will be 1, while
+    // the other will be the stride size at the dimension matching the other
+    // shape most minor dimension being scanned.
+    int64 dest_stride = 1;
+    int64 source_stride = 1;
+    // The size of the inner loop on the most minor dimension.
+    int64 minor_loop_size = 1;
+  };
+
+  // Literal class always owns the shape. The parent class borrows this shape.
+  std::unique_ptr<Shape> shape_;
+
+  Piece* root_piece_ = nullptr;
+
+  // Implementation details shared between Populate() and PopulateParallel()
+  template <typename NativeT, typename FnType>
+  Status PopulateInternal(const FnType& generator, bool parallel);
+
+  // Deallocate the buffers held by this literal.
+  void DeallocateBuffers();
+
+  friend class LiteralBase;
+};
+std::ostream& operator<<(std::ostream& out, const Literal& literal);
+
+// A read-only view of a Literal. A LiteralSlice contains pointers to shape and
+// literal buffers always owned by others.
+class LiteralSlice : public LiteralBase {
+ public:
+  LiteralSlice() : LiteralBase() {}
+
+  // Implicit conversion constructors.
+  LiteralSlice(const LiteralBase& literal);
+  LiteralSlice(const LiteralBase& literal, const ShapeIndex& view_root);
+
+ private:
+  const Piece& root_piece() const override { return *root_piece_; };
+
+  const Piece* root_piece_;  // Not owned.
+};
+
+// A read-only Literal where the underlying buffers are never owned by this
+// class.
+class BorrowingLiteral : public LiteralBase {
+ public:
+  BorrowingLiteral() : LiteralBase() {}
+
+  // 'src_buf_ptr' is not owned by this class and must outlive the
+  // lifetime of this class. It points to an appropirately sized buffer with
+  // data interpretered as indicated by 'shape'.
+  // This constructor is only used for array shapes.
+  BorrowingLiteral(const char* src_buf_ptr, const Shape& shape);
+  // Similar as above, except to be used for constructing non-nested tuples.
+  BorrowingLiteral(tensorflow::gtl::ArraySlice<const char*> src_buf_ptrs,
+                   const Shape& shape);
+  // TODO(b/79707221): adding constructors for nested tuples as well.
+
+ private:
+  // Recursively builds the subtree for the given piece and sets the subshapes
+  // of the given piece with the given shape.
+  void BuildPieceSubtree(const Shape& shape, Piece* piece);
+
+  // Accessor for the root piece of this literal.
+  const Piece& root_piece() const override { return root_piece_; };
+  Piece root_piece_;
+
+  // Shape of this literal. Stored as unique_ptr so such that the (default)
+  // move construction of this class would be trivially correct: the pointer to
+  // Shape root_piece_ stores will still point to the correct address.
+  std::unique_ptr<Shape> shape_;
+};
+
+template <typename NativeT>
+tensorflow::gtl::ArraySlice<NativeT> LiteralBase::Piece::data() const {
+  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
+  CHECK_EQ(subshape().element_type(),
+           primitive_util::NativeToPrimitiveType<NativeT>())
+      << "Attempting to access "
+      << PrimitiveType_Name(primitive_util::NativeToPrimitiveType<NativeT>())
+      << " type, but literal element type is "
+      << PrimitiveType_Name(subshape().element_type());
+  return tensorflow::gtl::ArraySlice<NativeT>(
+      reinterpret_cast<const NativeT*>(buffer()), element_count());
+}
+
+template <typename NativeT>
+tensorflow::gtl::MutableArraySlice<NativeT> LiteralBase::Piece::data() {
+  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
+  CHECK_EQ(subshape().element_type(),
+           primitive_util::NativeToPrimitiveType<NativeT>())
+      << "Attempting to access "
+      << PrimitiveType_Name(primitive_util::NativeToPrimitiveType<NativeT>())
+      << " type, but literal element type is "
+      << PrimitiveType_Name(subshape().element_type());
+  return tensorflow::gtl::MutableArraySlice<NativeT>(
+      reinterpret_cast<NativeT*>(buffer()), element_count());
+}
+
+template <typename NativeT>
+NativeT LiteralBase::Piece::Get(
+    tensorflow::gtl::ArraySlice<int64> multi_index) const {
+  CHECK(LayoutUtil::IsDenseArray(subshape()));
+  return data<NativeT>()[IndexUtil::MultidimensionalIndexToLinearIndex(
+      subshape(), multi_index)];
+}
+
+template <typename NativeT>
+void LiteralBase::Piece::Set(tensorflow::gtl::ArraySlice<int64> multi_index,
+                             NativeT value) {
+  CHECK(LayoutUtil::IsDenseArray(subshape()));
+  data<NativeT>()[IndexUtil::MultidimensionalIndexToLinearIndex(
+      subshape(), multi_index)] = value;
+}
+
+template <typename NativeT>
+tensorflow::gtl::ArraySlice<NativeT> LiteralBase::data(
+    const ShapeIndex& shape_index) const {
+  return piece(shape_index).data<NativeT>();
+}
+
+template <typename NativeT>
+tensorflow::gtl::MutableArraySlice<NativeT> Literal::data(
+    const ShapeIndex& shape_index) {
+  return piece(shape_index).data<NativeT>();
+}
+
+template <typename NativeT>
+inline NativeT LiteralBase::Get(tensorflow::gtl::ArraySlice<int64> multi_index,
+                                const ShapeIndex& shape_index) const {
+  return piece(shape_index).Get<NativeT>(multi_index);
+}
+
+template <typename NativeT>
+inline NativeT LiteralBase::Get(
+    tensorflow::gtl::ArraySlice<int64> multi_index) const {
+  return root_piece().Get<NativeT>(multi_index);
+}
+
+template <typename NativeT>
+inline void Literal::Set(tensorflow::gtl::ArraySlice<int64> multi_index,
+                         const ShapeIndex& shape_index, NativeT value) {
+  return piece(shape_index).Set<NativeT>(multi_index, value);
+}
+
+template <typename NativeT>
+inline void Literal::Set(tensorflow::gtl::ArraySlice<int64> multi_index,
+                         NativeT value) {
+  return root_piece().Set<NativeT>(multi_index, value);
+}
+
+template <typename NativeT>
+NativeT LiteralBase::GetFirstElement() const {
+  return data<NativeT>().at(0);
+}
+
+template <typename NativeT>
+NativeT LiteralBase::GetSparseElement(int64 sparse_element_number,
+                                      const ShapeIndex& shape_index) const {
+  CHECK(
+      LayoutUtil::IsSparseArray(ShapeUtil::GetSubshape(shape(), shape_index)));
+  return data<NativeT>(shape_index)[sparse_element_number];
+}
+
+template <typename NativeT>
+void Literal::AppendSparseElement(
+    tensorflow::gtl::ArraySlice<int64> multi_index, NativeT value,
+    const ShapeIndex& shape_index) {
+  Piece& p = piece(shape_index);
+  const Shape& subshape = p.subshape();
+  CHECK(LayoutUtil::IsSparseArray(subshape));
+  int64 rank = ShapeUtil::Rank(subshape);
+  CHECK_EQ(multi_index.size(), rank);
+  int64 last_element = p.sparse_indices()->index_count();
+  CHECK_LT(last_element, LayoutUtil::MaxSparseElements(subshape.layout()));
+  p.sparse_indices()->Append(multi_index);
+  CHECK_LT(last_element, p.data<NativeT>().size());
+  p.data<NativeT>()[last_element] = value;
+}
+
+template <typename NativeT>
+void LiteralBase::EachCell(
+    std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
+                       NativeT value)>
+        per_cell) const {
+  if (ShapeUtil::IsZeroElementArray(shape())) {
+    return;
+  }
+  std::vector<int64> indices(ShapeUtil::Rank(shape()), 0);
+  do {
+    per_cell(indices, Get<NativeT>(indices));
+  } while (IndexUtil::BumpIndices(shape(), &indices));
+}
+
+template <typename NativeT>
+inline void Literal::PopulateR1(tensorflow::gtl::ArraySlice<NativeT> values) {
+  CHECK(ShapeUtil::IsArray(shape()));
+  CHECK_EQ(ShapeUtil::Rank(shape()), 1);
+  CHECK_EQ(ShapeUtil::ElementsIn(shape()), values.size());
+  CHECK_EQ(shape().element_type(),
+           primitive_util::NativeToPrimitiveType<NativeT>());
+  for (int64 i = 0; i < values.size(); ++i) {
+    Set({i}, values[i]);
+  }
+}
+
+template <typename NativeT>
+void Literal::PopulateR2(
+    std::initializer_list<std::initializer_list<NativeT>> values) {
+  CHECK(ShapeUtil::IsArray(shape()));
+  CHECK_EQ(ShapeUtil::Rank(shape()), 2);
+  CHECK_EQ(shape().element_type(),
+           primitive_util::NativeToPrimitiveType<NativeT>());
+
+  const int64 dim0_size = values.size();
+  const int64 dim1_size = values.begin()->size();
+  CHECK_EQ(dim0_size, shape().dimensions(0));
+  CHECK_EQ(dim1_size, shape().dimensions(1));
+
+  int64 dim0 = 0;
+  for (auto inner_list : values) {
+    int64 dim1 = 0;
+    for (auto value : inner_list) {
+      Set({dim0, dim1}, value);
+      ++dim1;
+    }
+    CHECK_EQ(dim1_size, dim1);
+    ++dim0;
+  }
+}
+
+template <typename NativeT>
+void Literal::PopulateFromArray(const Array<NativeT>& values) {
+  CHECK(ShapeUtil::IsArray(shape()));
+  CHECK_EQ(shape().element_type(),
+           primitive_util::NativeToPrimitiveType<NativeT>());
+  CHECK_EQ(ShapeUtil::Rank(shape()), values.num_dimensions());
+  for (int dim = 0; dim < values.num_dimensions(); ++dim) {
+    CHECK_EQ(values.dim(dim), shape().dimensions(dim));
+  }
+  values.Each([this](tensorflow::gtl::ArraySlice<int64> indices,
+                     NativeT value) { this->Set(indices, value); });
+}
+
+template <typename NativeT>
+void Literal::PopulateR2FromArray2D(const Array2D<NativeT>& values) {
+  PopulateFromArray(values);
+}
+
+template <typename NativeT>
+void Literal::PopulateR3FromArray3D(const Array3D<NativeT>& values) {
+  PopulateFromArray(values);
+}
+
+template <typename NativeT>
+void Literal::PopulateR4FromArray4D(const Array4D<NativeT>& values) {
+  PopulateFromArray(values);
+}
+
+template <typename NativeT>
+void Literal::PopulateSparse(SparseIndexArray indices,
+                             tensorflow::gtl::ArraySlice<NativeT> values,
+                             bool sort) {
+  CHECK(LayoutUtil::IsSparseArray(shape()));
+  int rank = ShapeUtil::Rank(shape());
+  CHECK_EQ(indices.rank(), rank);
+  int64 max_elements = LayoutUtil::MaxSparseElements(shape().layout());
+  CHECK_LE(indices.max_indices(), max_elements);
+  int64 num_elements = values.size();
+  CHECK_LE(num_elements, max_elements);
+  CHECK_EQ(num_elements, indices.index_count());
+  auto root_data = root_piece().data<NativeT>();
+  // Piece::data() returns an ArraySlice of size equal to the number of indices
+  // in the SparseIndexArray. So there is no need to adjust the size of the data
+  // here. It is enough to just copy the incoming values into the data buffer.
+  std::copy(values.begin(), values.end(), root_data.begin());
+  *this->root_piece().sparse_indices() = std::move(indices);
+  if (sort) {
+    auto root_data = this->root_piece().data<NativeT>();
+    this->root_piece().sparse_indices()->SortWithValues(root_data);
+  }
+  DCHECK(this->root_piece().sparse_indices()->Validate(shape()));
+}
+
+template <typename NativeT, typename FnType>
+Status Literal::PopulateInternal(const FnType& generator, bool parallel) {
+  const Shape& this_shape = shape();
+  const int64 rank = ShapeUtil::Rank(this_shape);
+  TF_RET_CHECK(LayoutUtil::IsDenseArray(this_shape));
+  TF_RET_CHECK(this_shape.element_type() ==
+               primitive_util::NativeToPrimitiveType<NativeT>());
+  tensorflow::gtl::MutableArraySlice<NativeT> literal_data = data<NativeT>();
+  if (rank > 0) {
+    StrideConfig stride_config(this_shape, this_shape,
+                               AsInt64Slice(this_shape.dimensions()));
+    int64 minor_dimension_size =
+        ShapeUtil::GetDimension(this_shape, stride_config.minor_dimension);
+
+    auto init_function = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
+      DimensionVector minor_scan_indexes(rank, 0);
+      const int64 index =
+          IndexUtil::MultidimensionalIndexToLinearIndex(shape(), indexes);
+      std::copy(indexes.begin(), indexes.end(), minor_scan_indexes.begin());
+      for (int64 i = 0; i < minor_dimension_size; ++i) {
+        minor_scan_indexes[stride_config.minor_dimension] = i;
+        literal_data.at(index + i) = generator(minor_scan_indexes);
+      }
+    };
+    if (parallel) {
+      ShapeUtil::ForEachIndexParallel(this_shape, stride_config.base,
+                                      stride_config.dimensions,
+                                      stride_config.step, init_function);
+    } else {
+      ShapeUtil::ForEachIndex(
+          this_shape, stride_config.base, stride_config.dimensions,
+          stride_config.step,
+          [&init_function](tensorflow::gtl::ArraySlice<int64> indexes) {
+            init_function(indexes);
+            return true;
+          });
+    }
+  } else {
+    // For scalars.
+    literal_data.at(0) = generator({});
+  }
+  return Status::OK();
+}
+template <typename NativeT, typename FnType>
+Status Literal::Populate(const FnType& generator) {
+  return PopulateInternal<NativeT>(generator, /*parallel=*/false);
+}
+
+template <typename NativeT, typename FnType>
+Status Literal::PopulateParallel(const FnType& generator) {
+  return PopulateInternal<NativeT>(generator, /*parallel=*/true);
+}
+
+template <typename NativeT>
+void Literal::PopulateWithValue(NativeT value) {
+  CHECK(ShapeUtil::IsArray(shape()));
+  CHECK_EQ(shape().element_type(),
+           primitive_util::NativeToPrimitiveType<NativeT>());
+  for (NativeT& element : data<NativeT>()) {
+    element = value;
+  }
+}
+
+template <typename NativeT>
+std::unique_ptr<Literal> LiteralBase::Replicate(int64 times) const {
+  DimensionVector bounds = {times};
+  bounds.reserve(shape().dimensions_size() + 1);
+  for (int64 bound : shape().dimensions()) {
+    bounds.push_back(bound);
+  }
+  auto literal =
+      MakeUnique<Literal>(ShapeUtil::MakeShape(shape().element_type(), bounds));
+  int64 elements = ShapeUtil::ElementsIn(literal->shape());
+  if (elements == 0) {
+    return literal;
+  }
+
+  DimensionVector output_indices(bounds.size(), 0);
+  tensorflow::gtl::ArraySlice<int64> input_indices = output_indices;
+  input_indices.remove_prefix(1);
+
+  bool done = false;
+  while (!done) {
+    const auto element = Get<NativeT>(input_indices);
+    literal->Set<NativeT>(output_indices, element);
+
+    done = true;
+    for (int n = 0; n < output_indices.size(); ++n) {
+      ++output_indices[n];
+      if (output_indices[n] < bounds[n]) {
+        done = false;
+        break;
+      }
+      output_indices[n] = 0;
+    }
+  }
+  return literal;
+}
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_LITERAL_H_
diff --git a/tensorflow/compiler/xla/literal_comparison.cc b/tensorflow/compiler/xla/literal_comparison.cc
index 2125ab7c61..94993cc874 100644
--- a/tensorflow/compiler/xla/literal_comparison.cc
+++ b/tensorflow/compiler/xla/literal_comparison.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include <cmath>
 #include <vector>
 
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/lib/core/casts.h"
 #include "tensorflow/core/lib/strings/strcat.h"
@@ -217,7 +218,7 @@ class NearComparator {
       return Printf(
           "actual %s, expected %s, index %s, rel error %8.3g, abs error %8.3g",
           FpValueToString(actual).c_str(), FpValueToString(expected).c_str(),
-          Literal::MultiIndexAsString(
+          LiteralUtil::MultiIndexAsString(
               IndexUtil::LinearIndexToMultidimensionalIndex(shape,
                                                             linear_index))
               .c_str(),
@@ -722,7 +723,7 @@ Status Equal(const LiteralSlice& expected, const LiteralSlice& actual) {
   return AppendStatus(result,
                       tensorflow::strings::Printf(
                           "\nat index: %s\nexpected: %s\nactual:   %s",
-                          Literal::MultiIndexAsString(multi_index).c_str(),
+                          LiteralUtil::MultiIndexAsString(multi_index).c_str(),
                           ToStringTruncated(expected).c_str(),
                           ToStringTruncated(actual).c_str()));
 }
diff --git a/tensorflow/compiler/xla/literal_comparison.h b/tensorflow/compiler/xla/literal_comparison.h
index 00a13e3619..9e5bf7c1d0 100644
--- a/tensorflow/compiler/xla/literal_comparison.h
+++ b/tensorflow/compiler/xla/literal_comparison.h
@@ -20,7 +20,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_LITERAL_COMPARISON_H_
 
 #include "tensorflow/compiler/xla/error_spec.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/core/lib/core/status.h"
 
 namespace xla {
diff --git a/tensorflow/compiler/xla/literal_util_test.cc b/tensorflow/compiler/xla/literal_test.cc
index 493d807591..e8f919950f 100644
--- a/tensorflow/compiler/xla/literal_util_test.cc
+++ b/tensorflow/compiler/xla/literal_test.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 
 #include <vector>
 
@@ -21,6 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -76,11 +77,11 @@ class LiteralUtilTest : public ::testing::Test {
     layout_r4_dim0minor_ = LayoutUtil::MakeLayout({0, 1, 2, 3});
 
     literal_r4_2x2x3x3_dim0major_ =
-        Literal::CreateR4FromArray4DWithLayout<float>(arr4d,
-                                                      layout_r4_dim0major_);
+        LiteralUtil::CreateR4FromArray4DWithLayout<float>(arr4d,
+                                                          layout_r4_dim0major_);
     literal_r4_2x2x3x3_dim0minor_ =
-        Literal::CreateR4FromArray4DWithLayout<float>(arr4d,
-                                                      layout_r4_dim0minor_);
+        LiteralUtil::CreateR4FromArray4DWithLayout<float>(arr4d,
+                                                          layout_r4_dim0minor_);
   }
 
   Layout layout_r2_dim0major_;
@@ -94,47 +95,47 @@ class LiteralUtilTest : public ::testing::Test {
 };
 
 TEST_F(LiteralUtilTest, LiteralScalarToString) {
-  auto true_lit = Literal::CreateR0<bool>(true);
+  auto true_lit = LiteralUtil::CreateR0<bool>(true);
   ASSERT_EQ("true", true_lit->ToString());
 
-  auto false_lit = Literal::CreateR0<bool>(false);
+  auto false_lit = LiteralUtil::CreateR0<bool>(false);
   ASSERT_EQ("false", false_lit->ToString());
 
-  auto u32_lit = Literal::CreateR0<uint32>(42);
+  auto u32_lit = LiteralUtil::CreateR0<uint32>(42);
   ASSERT_EQ("42", u32_lit->ToString());
 
-  auto s32_lit = Literal::CreateR0<int32>(-999);
+  auto s32_lit = LiteralUtil::CreateR0<int32>(-999);
   ASSERT_EQ("-999", s32_lit->ToString());
 
-  auto f32_lit = Literal::CreateR0<float>(3.14f);
+  auto f32_lit = LiteralUtil::CreateR0<float>(3.14f);
   ASSERT_EQ("3.14", f32_lit->ToString());
 
-  auto f16_lit = Literal::CreateR0<half>(static_cast<half>(0.5f));
+  auto f16_lit = LiteralUtil::CreateR0<half>(static_cast<half>(0.5f));
   ASSERT_EQ("0.5", f16_lit->ToString());
 
-  auto c64_lit = Literal::CreateR0<complex64>({3.14f, 2.78f});
+  auto c64_lit = LiteralUtil::CreateR0<complex64>({3.14f, 2.78f});
   ASSERT_EQ("(3.14, 2.78)", c64_lit->ToString());
 
-  auto bf16_lit = Literal::CreateR0<bfloat16>(static_cast<bfloat16>(0.5f));
+  auto bf16_lit = LiteralUtil::CreateR0<bfloat16>(static_cast<bfloat16>(0.5f));
   ASSERT_EQ("0.5", bf16_lit->ToString());
 
   // 3.14 will be truncated to 3.125 in bfloat16 format.
   auto bf16_lit_truncated =
-      Literal::CreateR0<bfloat16>(static_cast<bfloat16>(3.14f));
+      LiteralUtil::CreateR0<bfloat16>(static_cast<bfloat16>(3.14f));
   ASSERT_EQ("3.125", bf16_lit_truncated->ToString());
 
   auto bf16_lit_truncated2 =
-      Literal::CreateR0<bfloat16>(static_cast<bfloat16>(9.001f));
+      LiteralUtil::CreateR0<bfloat16>(static_cast<bfloat16>(9.001f));
   ASSERT_EQ("9", bf16_lit_truncated2->ToString());
 }
 
 TEST_F(LiteralUtilTest, LiteralVectorToString) {
-  auto pred_vec = Literal::CreateR1<bool>({true, false, true});
+  auto pred_vec = LiteralUtil::CreateR1<bool>({true, false, true});
   ASSERT_EQ("{101}", pred_vec->ToString());
 }
 
 TEST_F(LiteralUtilTest, R2ToString) {
-  const auto literal = Literal::CreateR2({{1, 2}, {3, 4}, {5, 6}});
+  const auto literal = LiteralUtil::CreateR2({{1, 2}, {3, 4}, {5, 6}});
   const string expected = R"(s32[3,2] {
   { 1, 2 },
   { 3, 4 },
@@ -144,7 +145,8 @@ TEST_F(LiteralUtilTest, R2ToString) {
 }
 
 TEST_F(LiteralUtilTest, R3ToString) {
-  const auto literal = Literal::CreateR3({{{1}, {2}}, {{3}, {4}}, {{5}, {6}}});
+  const auto literal =
+      LiteralUtil::CreateR3({{{1}, {2}}, {{3}, {4}}, {{5}, {6}}});
   const string expected = R"(s32[3,2,1] {
 { { 1 },
   { 2 } },
@@ -157,9 +159,9 @@ TEST_F(LiteralUtilTest, R3ToString) {
 }
 
 TEST_F(LiteralUtilTest, TupleToString) {
-  auto scalar = Literal::CreateR0<float>(1.0);
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto tuple = Literal::MakeTuple({scalar.get(), matrix.get()});
+  auto scalar = LiteralUtil::CreateR0<float>(1.0);
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto tuple = LiteralUtil::MakeTuple({scalar.get(), matrix.get()});
   const string expected = R"((f32[], f32[2,2]) (
 1,
 f32[2,2] {
@@ -182,7 +184,7 @@ TEST_F(LiteralUtilTest, CreateR3FromArray3d) {
   });
   // clang-format on
 
-  auto literal = Literal::CreateR3FromArray3D(array_3d);
+  auto literal = LiteralUtil::CreateR3FromArray3D(array_3d);
   EXPECT_THAT(literal->shape().dimensions(), ElementsAre(2, 3, 2));
   string result = literal->ToString();
   const string expected = R"(f32[2,3,2] {
@@ -205,7 +207,7 @@ TEST_F(LiteralUtilTest, CreateSparse) {
       {3, 5, 6},
   };
   std::vector<int64> values = {7, 8, 9, 10};
-  auto literal = Literal::CreateSparse<int64>(
+  auto literal = LiteralUtil::CreateSparse<int64>(
       dimensions, SparseIndexArray(indices.n1() + 3, indices), values);
 
   Array2D<int64> expected_indices = {
@@ -224,7 +226,7 @@ TEST_F(LiteralUtilTest, CreateSparse) {
 
 TEST_F(LiteralUtilTest, LiteralR4F32ProjectedStringifies) {
   // clang-format off
-  auto literal = Literal::CreateR4Projected<float>({
+  auto literal = LiteralUtil::CreateR4Projected<float>({
     {1, 2},
     {1001, 1002},
     {2001, 2002},
@@ -284,7 +286,7 @@ TEST_F(LiteralUtilTest, LiteralR4F32Stringifies) {
 
 TEST_F(LiteralUtilTest, EachCellR2F32) {
   // clang-format off
-  auto literal = Literal::CreateR2<float>({
+  auto literal = LiteralUtil::CreateR2<float>({
     {3.1f, 4.2f},
     {9.3f, 12.4f},
   });
@@ -303,26 +305,27 @@ TEST_F(LiteralUtilTest, EachCellR2F32) {
 
 TEST_F(LiteralUtilTest, ScalarEquality) {
   // Test equality with scalars.
-  auto f32_42 = Literal::CreateR0<float>(42.0);
-  auto f32_42_clone = Literal::CreateR0<float>(42.0);
+  auto f32_42 = LiteralUtil::CreateR0<float>(42.0);
+  auto f32_42_clone = LiteralUtil::CreateR0<float>(42.0);
 
   EXPECT_EQ(*f32_42, *f32_42);
   EXPECT_EQ(*f32_42, *f32_42_clone);
 
-  auto f32_123 = Literal::CreateR0<float>(123.0);
+  auto f32_123 = LiteralUtil::CreateR0<float>(123.0);
   EXPECT_NE(*f32_42, *f32_123);
 
-  auto f64_42 = Literal::CreateR0<double>(42.0);
+  auto f64_42 = LiteralUtil::CreateR0<double>(42.0);
   EXPECT_NE(*f32_42, *f64_42);
 }
 
 TEST_F(LiteralUtilTest, NonScalarEquality) {
   // Test equality with nonscalars.
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto matrix_clone = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto matrix_different = Literal::CreateR2<float>({{4.0, 3.0}, {1.0, 2.0}});
-  auto vector_literal = Literal::CreateR1<float>({1.0, 2.0, 3.0, 4.0});
-  auto scalar = Literal::CreateR0<float>(1.0);
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto matrix_clone = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto matrix_different =
+      LiteralUtil::CreateR2<float>({{4.0, 3.0}, {1.0, 2.0}});
+  auto vector_literal = LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0, 4.0});
+  auto scalar = LiteralUtil::CreateR0<float>(1.0);
   Literal nil(ShapeUtil::MakeNil());
 
   EXPECT_EQ(*matrix, *matrix);
@@ -335,19 +338,19 @@ TEST_F(LiteralUtilTest, NonScalarEquality) {
 }
 
 TEST_F(LiteralUtilTest, TokenEquality) {
-  auto token0 = Literal::CreateToken();
-  auto token1 = Literal::CreateToken();
-  auto scalar = Literal::CreateR0<float>(1.0);
+  auto token0 = LiteralUtil::CreateToken();
+  auto token1 = LiteralUtil::CreateToken();
+  auto scalar = LiteralUtil::CreateR0<float>(1.0);
 
   EXPECT_EQ(*token0, *token1);
   EXPECT_NE(*token0, *scalar);
 
-  EXPECT_EQ(*Literal::MakeTuple({token0.get()}),
-            *Literal::MakeTuple({token0.get()}));
-  EXPECT_EQ(*Literal::MakeTuple({token0.get(), scalar.get()}),
-            *Literal::MakeTuple({token1.get(), scalar.get()}));
-  EXPECT_NE(*Literal::MakeTuple({token0.get(), scalar.get()}),
-            *Literal::MakeTuple({scalar.get(), token1.get()}));
+  EXPECT_EQ(*LiteralUtil::MakeTuple({token0.get()}),
+            *LiteralUtil::MakeTuple({token0.get()}));
+  EXPECT_EQ(*LiteralUtil::MakeTuple({token0.get(), scalar.get()}),
+            *LiteralUtil::MakeTuple({token1.get(), scalar.get()}));
+  EXPECT_NE(*LiteralUtil::MakeTuple({token0.get(), scalar.get()}),
+            *LiteralUtil::MakeTuple({scalar.get(), token1.get()}));
 }
 
 TEST_F(LiteralUtilTest, DifferentLayoutEquality) {
@@ -371,43 +374,46 @@ TEST_F(LiteralUtilTest, DifferentLayoutEquality) {
 
 TEST_F(LiteralUtilTest, TupleEquality) {
   // Test equality with tuples.
-  auto scalar = Literal::CreateR0<float>(1.0);
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto tuple1 = Literal::MakeTuple({scalar.get(), matrix.get()});
+  auto scalar = LiteralUtil::CreateR0<float>(1.0);
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto tuple1 = LiteralUtil::MakeTuple({scalar.get(), matrix.get()});
 
   // Tuple with the same elements. One element is shared with the original
   // tuple, the other is a clone of the element in the original tuple.
-  auto scalar_clone = Literal::CreateR0<float>(1.0);
-  auto tuple2 = Literal::MakeTuple({scalar_clone.get(), matrix.get()});
+  auto scalar_clone = LiteralUtil::CreateR0<float>(1.0);
+  auto tuple2 = LiteralUtil::MakeTuple({scalar_clone.get(), matrix.get()});
   EXPECT_EQ(*tuple1, *tuple2);
 
   // Tuple with elements reversed.
-  auto reversed_tuple = Literal::MakeTuple({matrix.get(), scalar.get()});
+  auto reversed_tuple = LiteralUtil::MakeTuple({matrix.get(), scalar.get()});
   EXPECT_NE(*tuple1, *reversed_tuple);
 
   // Tuple with different value.
-  auto scalar_42 = Literal::CreateR0<float>(42.0);
-  auto different_tuple = Literal::MakeTuple({scalar_42.get(), matrix.get()});
+  auto scalar_42 = LiteralUtil::CreateR0<float>(42.0);
+  auto different_tuple =
+      LiteralUtil::MakeTuple({scalar_42.get(), matrix.get()});
   EXPECT_NE(*tuple1, *different_tuple);
 }
 
 TEST_F(LiteralUtilTest, C64Equality) {
   // Test equality with tuples.
-  auto vector = Literal::CreateR1<complex64>({{1.0, 2.0}, {3.0, 4.0}});
+  auto vector = LiteralUtil::CreateR1<complex64>({{1.0, 2.0}, {3.0, 4.0}});
 
   // Tuple with the same elements. One element is shared with the original
   // tuple, the other is a clone of the element in the original tuple.
-  auto vector_clone = Literal::CreateR1<complex64>({{1.0, 2.0}, {3.0, 4.0}});
+  auto vector_clone =
+      LiteralUtil::CreateR1<complex64>({{1.0, 2.0}, {3.0, 4.0}});
   EXPECT_EQ(*vector, *vector_clone);
 
-  auto vector_reversed = Literal::CreateR1<complex64>({{3.0, 4.0}, {1.0, 2.0}});
+  auto vector_reversed =
+      LiteralUtil::CreateR1<complex64>({{3.0, 4.0}, {1.0, 2.0}});
   EXPECT_NE(*vector, *vector_reversed);
 }
 
 TEST_F(LiteralUtilTest, IsAllTuple) {
-  auto element1 = Literal::CreateR0<float>(0.0);
-  auto element2 = Literal::CreateR2<float>({{0.0, 0.0}, {0.0, 0.0}});
-  auto tuple = Literal::MakeTuple({element1.get(), element1.get()});
+  auto element1 = LiteralUtil::CreateR0<float>(0.0);
+  auto element2 = LiteralUtil::CreateR2<float>({{0.0, 0.0}, {0.0, 0.0}});
+  auto tuple = LiteralUtil::MakeTuple({element1.get(), element1.get()});
 
   // Tuples should always return false for IsAll.
   EXPECT_FALSE(tuple->IsAll(0));
@@ -416,140 +422,141 @@ TEST_F(LiteralUtilTest, IsAllTuple) {
 
 // Verifies that CreateFromShape works for tuples.
 TEST_F(LiteralUtilTest, CreateFromShapeTuple) {
-  auto scalar = Literal::CreateR0<float>(0.0);
-  auto matrix = Literal::CreateR2<int32>({{0, 0}, {0, 0}});
-  auto tuple = Literal::MakeTuple({scalar.get(), matrix.get()});
+  auto scalar = LiteralUtil::CreateR0<float>(0.0);
+  auto matrix = LiteralUtil::CreateR2<int32>({{0, 0}, {0, 0}});
+  auto tuple = LiteralUtil::MakeTuple({scalar.get(), matrix.get()});
 
   auto x = Literal::CreateFromShape(tuple->shape());
   EXPECT_EQ(*tuple, *x);
 }
 
 TEST_F(LiteralUtilTest, IsAll) {
-  EXPECT_TRUE(Literal::CreateR0<bool>(false)->IsAll(0));
-  EXPECT_TRUE(Literal::CreateR0<bool>(true)->IsAll(1));
-  EXPECT_FALSE(Literal::CreateR0<bool>(false)->IsAll(1));
-  EXPECT_FALSE(Literal::CreateR0<bool>(false)->IsAll(2));
-  EXPECT_FALSE(Literal::CreateR0<bool>(true)->IsAll(0));
-  EXPECT_FALSE(Literal::CreateR0<bool>(true)->IsAll(2));
-  EXPECT_FALSE(Literal::CreateR0<bool>(true)->IsAll(-1));
+  EXPECT_TRUE(LiteralUtil::CreateR0<bool>(false)->IsAll(0));
+  EXPECT_TRUE(LiteralUtil::CreateR0<bool>(true)->IsAll(1));
+  EXPECT_FALSE(LiteralUtil::CreateR0<bool>(false)->IsAll(1));
+  EXPECT_FALSE(LiteralUtil::CreateR0<bool>(false)->IsAll(2));
+  EXPECT_FALSE(LiteralUtil::CreateR0<bool>(true)->IsAll(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<bool>(true)->IsAll(2));
+  EXPECT_FALSE(LiteralUtil::CreateR0<bool>(true)->IsAll(-1));
 
   // We shouldn't reinterpret int8_min as an unsigned type and then decide that
   // it is equal to 255.
   auto int8_min = std::numeric_limits<int8>::min();
-  EXPECT_FALSE(Literal::CreateR0<uint8>(255)->IsAll(int8_min));
+  EXPECT_FALSE(LiteralUtil::CreateR0<uint8>(255)->IsAll(int8_min));
 
-  EXPECT_TRUE(Literal::CreateR0<float>(42.0)->IsAll(42));
-  EXPECT_FALSE(Literal::CreateR0<float>(42.0001)->IsAll(42));
+  EXPECT_TRUE(LiteralUtil::CreateR0<float>(42.0)->IsAll(42));
+  EXPECT_FALSE(LiteralUtil::CreateR0<float>(42.0001)->IsAll(42));
 
-  EXPECT_TRUE(Literal::CreateR1<int>({100, 100, 100})->IsAll(100));
-  EXPECT_FALSE(Literal::CreateR1<double>({100, 100, 100.001})->IsAll(100));
+  EXPECT_TRUE(LiteralUtil::CreateR1<int>({100, 100, 100})->IsAll(100));
+  EXPECT_FALSE(LiteralUtil::CreateR1<double>({100, 100, 100.001})->IsAll(100));
 
-  EXPECT_TRUE(Literal::CreateR2<uint64>({{8, 8}, {8, 8}})->IsAll(8));
-  EXPECT_FALSE(Literal::CreateR2<uint64>({{8, 8}, {8, 9}})->IsAll(8));
-  EXPECT_FALSE(Literal::CreateR2<uint64>({{9, 8}, {8, 8}})->IsAll(8));
+  EXPECT_TRUE(LiteralUtil::CreateR2<uint64>({{8, 8}, {8, 8}})->IsAll(8));
+  EXPECT_FALSE(LiteralUtil::CreateR2<uint64>({{8, 8}, {8, 9}})->IsAll(8));
+  EXPECT_FALSE(LiteralUtil::CreateR2<uint64>({{9, 8}, {8, 8}})->IsAll(8));
 
   half h8(8.0f);
   half h9(9.0f);
-  EXPECT_TRUE(Literal::CreateR2<half>({{h8}, {h8}})->IsAll(8));
-  EXPECT_FALSE(Literal::CreateR2<half>({{h8}, {h9}})->IsAll(8));
-  EXPECT_FALSE(Literal::CreateR2<half>({{h9}, {h8}})->IsAll(8));
+  EXPECT_TRUE(LiteralUtil::CreateR2<half>({{h8}, {h8}})->IsAll(8));
+  EXPECT_FALSE(LiteralUtil::CreateR2<half>({{h8}, {h9}})->IsAll(8));
+  EXPECT_FALSE(LiteralUtil::CreateR2<half>({{h9}, {h8}})->IsAll(8));
 
   bfloat16 b8(8.0f);
   bfloat16 b9(9.0f);
 
-  EXPECT_TRUE(Literal::CreateR2<bfloat16>({{b8}, {b8}})->IsAll(8));
-  EXPECT_FALSE(Literal::CreateR2<bfloat16>({{b8}, {b9}})->IsAll(8));
-  EXPECT_FALSE(Literal::CreateR2<bfloat16>({{b9}, {b8}})->IsAll(8));
+  EXPECT_TRUE(LiteralUtil::CreateR2<bfloat16>({{b8}, {b8}})->IsAll(8));
+  EXPECT_FALSE(LiteralUtil::CreateR2<bfloat16>({{b8}, {b9}})->IsAll(8));
+  EXPECT_FALSE(LiteralUtil::CreateR2<bfloat16>({{b9}, {b8}})->IsAll(8));
 
   // 9.001 will be truncated to 9.0
   bfloat16 b91(9.001f);
   bfloat16 b90(9.00f);
-  EXPECT_TRUE(Literal::CreateR2<bfloat16>({{b91}, {b90}})->IsAll(9.0));
+  EXPECT_TRUE(LiteralUtil::CreateR2<bfloat16>({{b91}, {b90}})->IsAll(9.0));
 
   complex64 c8_9 = {8, 9};
-  EXPECT_FALSE(Literal::CreateR2<complex64>({{c8_9}, {c8_9}})->IsAll(8));
+  EXPECT_FALSE(LiteralUtil::CreateR2<complex64>({{c8_9}, {c8_9}})->IsAll(8));
 
   auto uint64_max = std::numeric_limits<uint64>::max();
-  EXPECT_FALSE(Literal::CreateR2<uint64>(
+  EXPECT_FALSE(LiteralUtil::CreateR2<uint64>(
                    {{uint64_max, uint64_max}, {uint64_max, uint64_max}})
                    ->IsAll(-1));
 }
 
 TEST_F(LiteralUtilTest, IsAllFloat) {
   // IsAllFloat always returns false when the literal is not floating-point.
-  EXPECT_FALSE(Literal::CreateR0<bool>(false)->IsAllFloat(0));
-  EXPECT_FALSE(Literal::CreateR0<int8>(0)->IsAllFloat(0));
-  EXPECT_FALSE(Literal::CreateR0<uint8>(0)->IsAllFloat(0));
-  EXPECT_FALSE(Literal::CreateR0<int>(0)->IsAllFloat(0));
-
-  EXPECT_TRUE(Literal::CreateR0<float>(0)->IsAllFloat(0));
-  EXPECT_TRUE(Literal::CreateR0<float>(.5)->IsAllFloat(.5));
-  EXPECT_TRUE(Literal::CreateR0<float>(-.5)->IsAllFloat(-.5));
-  EXPECT_FALSE(Literal::CreateR0<float>(-.5)->IsAllFloat(-.49));
+  EXPECT_FALSE(LiteralUtil::CreateR0<bool>(false)->IsAllFloat(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<int8>(0)->IsAllFloat(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<uint8>(0)->IsAllFloat(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<int>(0)->IsAllFloat(0));
+
+  EXPECT_TRUE(LiteralUtil::CreateR0<float>(0)->IsAllFloat(0));
+  EXPECT_TRUE(LiteralUtil::CreateR0<float>(.5)->IsAllFloat(.5));
+  EXPECT_TRUE(LiteralUtil::CreateR0<float>(-.5)->IsAllFloat(-.5));
+  EXPECT_FALSE(LiteralUtil::CreateR0<float>(-.5)->IsAllFloat(-.49));
   EXPECT_FALSE(
-      Literal::CreateR2<float>({{0, 0, 0}, {0, .1, 0}})->IsAllFloat(0));
-  EXPECT_TRUE(
-      Literal::CreateR2<float>({{.5, .5, .5}, {.5, .5, .5}})->IsAllFloat(.5));
-
-  EXPECT_TRUE(Literal::CreateR0<double>(0)->IsAllFloat(0));
-  EXPECT_TRUE(Literal::CreateR0<double>(.5)->IsAllFloat(.5));
-  EXPECT_TRUE(Literal::CreateR0<double>(-.5)->IsAllFloat(-.5));
-  EXPECT_FALSE(Literal::CreateR0<double>(-.5)->IsAllFloat(-.49));
+      LiteralUtil::CreateR2<float>({{0, 0, 0}, {0, .1, 0}})->IsAllFloat(0));
+  EXPECT_TRUE(LiteralUtil::CreateR2<float>({{.5, .5, .5}, {.5, .5, .5}})
+                  ->IsAllFloat(.5));
+
+  EXPECT_TRUE(LiteralUtil::CreateR0<double>(0)->IsAllFloat(0));
+  EXPECT_TRUE(LiteralUtil::CreateR0<double>(.5)->IsAllFloat(.5));
+  EXPECT_TRUE(LiteralUtil::CreateR0<double>(-.5)->IsAllFloat(-.5));
+  EXPECT_FALSE(LiteralUtil::CreateR0<double>(-.5)->IsAllFloat(-.49));
   EXPECT_FALSE(
-      Literal::CreateR2<double>({{0, 0, 0}, {0, .1, 0}})->IsAllFloat(0));
+      LiteralUtil::CreateR2<double>({{0, 0, 0}, {0, .1, 0}})->IsAllFloat(0));
 }
 
 TEST_F(LiteralUtilTest, IsAllComplex) {
   // IsAllComplex always returns false when the literal is not complex.
-  EXPECT_FALSE(Literal::CreateR0<bool>(false)->IsAllComplex(0));
-  EXPECT_FALSE(Literal::CreateR0<int8>(0)->IsAllComplex(0));
-  EXPECT_FALSE(Literal::CreateR0<uint8>(0)->IsAllComplex(0));
-  EXPECT_FALSE(Literal::CreateR0<int>(0)->IsAllComplex(0));
-  EXPECT_FALSE(Literal::CreateR0<float>(0)->IsAllComplex(0));
-  EXPECT_FALSE(Literal::CreateR0<double>(0)->IsAllComplex(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<bool>(false)->IsAllComplex(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<int8>(0)->IsAllComplex(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<uint8>(0)->IsAllComplex(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<int>(0)->IsAllComplex(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<float>(0)->IsAllComplex(0));
+  EXPECT_FALSE(LiteralUtil::CreateR0<double>(0)->IsAllComplex(0));
 
   complex64 c8_9 = {8, 9};
   complex64 c7_9 = {7, 9};
-  EXPECT_TRUE(Literal::CreateR2<complex64>({{c8_9}, {c8_9}})
+  EXPECT_TRUE(LiteralUtil::CreateR2<complex64>({{c8_9}, {c8_9}})
                   ->IsAllComplex({8.0f, 9.0f}));
-  EXPECT_FALSE(Literal::CreateR2<complex64>({{c7_9}, {c8_9}})
+  EXPECT_FALSE(LiteralUtil::CreateR2<complex64>({{c7_9}, {c8_9}})
                    ->IsAllComplex({8.0f, 9.0f}));
-  EXPECT_FALSE(Literal::CreateR2<complex64>({{c8_9}, {c7_9}})
+  EXPECT_FALSE(LiteralUtil::CreateR2<complex64>({{c8_9}, {c7_9}})
                    ->IsAllComplex({8.0f, 9.0f}));
 }
 
 TEST_F(LiteralUtilTest, IsAllFirst) {
   // IsAllComplex always returns false when the literal is not complex.
-  EXPECT_FALSE(Literal::CreateR1<bool>({false, true})->IsAllFirst());
-  EXPECT_TRUE(Literal::CreateR1<bool>({false, false})->IsAllFirst());
-  EXPECT_FALSE(Literal::CreateR1<int8>({1, 1, 2})->IsAllFirst());
-  EXPECT_TRUE(Literal::CreateR1<int8>({5, 5, 5, 5})->IsAllFirst());
-  EXPECT_FALSE(Literal::CreateR1<uint8>({1, 1, 2})->IsAllFirst());
-  EXPECT_TRUE(Literal::CreateR1<int32>({5, 5, 5, 5})->IsAllFirst());
-  EXPECT_FALSE(Literal::CreateR1<int32>({1, 1, 2})->IsAllFirst());
-  EXPECT_TRUE(Literal::CreateR1<uint32>({5, 5, 5, 5})->IsAllFirst());
-  EXPECT_FALSE(Literal::CreateR1<uint32>({1, 1, 2})->IsAllFirst());
+  EXPECT_FALSE(LiteralUtil::CreateR1<bool>({false, true})->IsAllFirst());
+  EXPECT_TRUE(LiteralUtil::CreateR1<bool>({false, false})->IsAllFirst());
+  EXPECT_FALSE(LiteralUtil::CreateR1<int8>({1, 1, 2})->IsAllFirst());
+  EXPECT_TRUE(LiteralUtil::CreateR1<int8>({5, 5, 5, 5})->IsAllFirst());
+  EXPECT_FALSE(LiteralUtil::CreateR1<uint8>({1, 1, 2})->IsAllFirst());
+  EXPECT_TRUE(LiteralUtil::CreateR1<int32>({5, 5, 5, 5})->IsAllFirst());
+  EXPECT_FALSE(LiteralUtil::CreateR1<int32>({1, 1, 2})->IsAllFirst());
+  EXPECT_TRUE(LiteralUtil::CreateR1<uint32>({5, 5, 5, 5})->IsAllFirst());
+  EXPECT_FALSE(LiteralUtil::CreateR1<uint32>({1, 1, 2})->IsAllFirst());
 
   complex64 c8_9 = {8, 9};
   complex64 c7_9 = {7, 9};
-  EXPECT_TRUE(Literal::CreateR2<complex64>({{c8_9}, {c8_9}})->IsAllFirst());
-  EXPECT_FALSE(Literal::CreateR2<complex64>({{c7_9}, {c8_9}})->IsAllFirst());
+  EXPECT_TRUE(LiteralUtil::CreateR2<complex64>({{c8_9}, {c8_9}})->IsAllFirst());
+  EXPECT_FALSE(
+      LiteralUtil::CreateR2<complex64>({{c7_9}, {c8_9}})->IsAllFirst());
 }
 
 TEST_F(LiteralUtilTest, IsZero) {
-  auto scalar_zero = Literal::CreateR0<float>(0.0f);
-  auto scalar_one = Literal::CreateR0<float>(1.0f);
+  auto scalar_zero = LiteralUtil::CreateR0<float>(0.0f);
+  auto scalar_one = LiteralUtil::CreateR0<float>(1.0f);
   EXPECT_TRUE(scalar_zero->IsZero({}));
   EXPECT_FALSE(scalar_one->IsZero({}));
 
-  auto array = Literal::CreateR2<uint32>({{1, 2, 0, 3}, {1, 0, 1, 2}});
+  auto array = LiteralUtil::CreateR2<uint32>({{1, 2, 0, 3}, {1, 0, 1, 2}});
   EXPECT_FALSE(array->IsZero({0, 1}));
   EXPECT_TRUE(array->IsZero({0, 2}));
   EXPECT_TRUE(array->IsZero({1, 1}));
   EXPECT_FALSE(array->IsZero({1, 2}));
 
-  auto complex_zero = Literal::CreateR0<complex64>(0.0f);
-  auto complex_nonzero = Literal::CreateR0<complex64>(0.5f);
+  auto complex_zero = LiteralUtil::CreateR0<complex64>(0.0f);
+  auto complex_nonzero = LiteralUtil::CreateR0<complex64>(0.5f);
   EXPECT_TRUE(complex_zero->IsZero({}));
   EXPECT_FALSE(complex_nonzero->IsZero({}));
 }
@@ -563,7 +570,7 @@ TYPED_TEST_CASE(LiteralUtilTestTemplated, TestedTypes);
 TYPED_TEST(LiteralUtilTestTemplated, Relayout2x2) {
   // Make a non-integer for floating point types.
   TypeParam half = TypeParam(1) / TypeParam(2);
-  auto data = Literal::CreateR2<TypeParam>({{half, 2}, {3, 4}});
+  auto data = LiteralUtil::CreateR2<TypeParam>({{half, 2}, {3, 4}});
   const Layout layout01 = LayoutUtil::MakeLayout({0, 1});
   const Layout layout10 = LayoutUtil::MakeLayout({1, 0});
 
@@ -577,7 +584,7 @@ TYPED_TEST(LiteralUtilTestTemplated, Relayout2x2) {
 }
 
 TEST_F(LiteralUtilTest, ReshapeR0) {
-  auto original = Literal::CreateR0<float>(1.7f);
+  auto original = LiteralUtil::CreateR0<float>(1.7f);
   auto reshape = original->Reshape(/*dimensions=*/{}).ConsumeValueOrDie();
   EXPECT_EQ(*original, *reshape);
 }
@@ -585,13 +592,13 @@ TEST_F(LiteralUtilTest, ReshapeR0) {
 TEST_F(LiteralUtilTest, ReshapeR4) {
   // clang-format off
   // F32[1x3x2x4]
-  auto original = Literal::CreateR4WithLayout<float>({{
+  auto original = LiteralUtil::CreateR4WithLayout<float>({{
      {{10, 11, 12, 13}, {14, 15, 16, 17}},
      {{18, 19, 20, 21}, {22, 23, 24, 25}},
      {{26, 27, 28, 29}, {30, 31, 32, 33}},
   }}, layout_r4_dim0major_);
   // F32[1x3x4x2]
-  auto expected = Literal::CreateR3WithLayout<float>({
+  auto expected = LiteralUtil::CreateR3WithLayout<float>({
     {{10, 11}, {12, 13}, {14, 15}, {16, 17}},
     {{18, 19}, {20, 21}, {22, 23}, {24, 25}},
     {{26, 27}, {28, 29}, {30, 31}, {32, 33}},
@@ -605,13 +612,13 @@ TEST_F(LiteralUtilTest, ReshapeR4) {
 TEST_F(LiteralUtilTest, ReshapeR4Dim0Minor) {
   // clang-format off
   // F32[1x3x2x4]
-  auto original = Literal::CreateR4WithLayout<float>({{
+  auto original = LiteralUtil::CreateR4WithLayout<float>({{
      {{10, 11, 12, 13}, {14, 15, 16, 17}},
      {{18, 19, 20, 21}, {22, 23, 24, 25}},
      {{26, 27, 28, 29}, {30, 31, 32, 33}},
   }}, layout_r4_dim0minor_);
   // F32[1x3x4x2]
-  auto expected = Literal::CreateR3WithLayout<float>({
+  auto expected = LiteralUtil::CreateR3WithLayout<float>({
     {{10, 11}, {12, 13}, {14, 15}, {16, 17}},
     {{18, 19}, {20, 21}, {22, 23}, {24, 25}},
     {{26, 27}, {28, 29}, {30, 31}, {32, 33}},
@@ -623,7 +630,7 @@ TEST_F(LiteralUtilTest, ReshapeR4Dim0Minor) {
 }
 
 TEST_F(LiteralUtilTest, TransposeR0) {
-  auto original = Literal::CreateR0<float>(1.7f);
+  auto original = LiteralUtil::CreateR0<float>(1.7f);
   auto reshape = original->Transpose(/*permutation=*/{});
   EXPECT_EQ(*original, *reshape);
 }
@@ -631,7 +638,7 @@ TEST_F(LiteralUtilTest, TransposeR0) {
 TEST_F(LiteralUtilTest, TransposeR4) {
   // clang-format off
   // F32[1x3x2x4]
-  auto original = Literal::CreateR4<float>({{
+  auto original = LiteralUtil::CreateR4<float>({{
      {{10, 11, 12, 13}, {14, 15, 16, 17}},
      {{18, 19, 20, 21}, {22, 23, 24, 25}},
      {{26, 27, 28, 29}, {30, 31, 32, 33}},
@@ -659,7 +666,7 @@ TEST_F(LiteralUtilTest, TestR4RelayoutEquivalence) {
 
 TEST_F(LiteralUtilTest, TestR2LinearLayout) {
   // Test expected memory layout of R2 dim0-minor (column-major) literal.
-  auto mat_dim0minor = Literal::CreateR2WithLayout<int32>(
+  auto mat_dim0minor = LiteralUtil::CreateR2WithLayout<int32>(
       {{1, 2, 3}, {4, 5, 6}}, layout_r2_dim0minor_);
   EXPECT_EQ(mat_dim0minor->element_count(), 6);
   EXPECT_THAT(mat_dim0minor->data<int32>(), ElementsAre(1, 4, 2, 5, 3, 6));
@@ -670,7 +677,7 @@ TEST_F(LiteralUtilTest, TestR2LinearLayout) {
               ElementsAre(1, 2, 3, 4, 5, 6));
 
   // Test expected memory layout of R2 created with dim0-major (row-major).
-  auto mat_dim0major = Literal::CreateR2WithLayout<int32>(
+  auto mat_dim0major = LiteralUtil::CreateR2WithLayout<int32>(
       {{1, 2, 3}, {4, 5, 6}}, layout_r2_dim0major_);
   EXPECT_EQ(mat_dim0major->element_count(), 6);
   EXPECT_THAT(mat_dim0major->data<int32>(), ElementsAre(1, 2, 3, 4, 5, 6));
@@ -695,8 +702,8 @@ TEST_F(LiteralUtilTest, TestR3LinearLayout) {
             {10, 11, 12},
           },
       });  // clang-format on
-  auto lit_dim0minor =
-      Literal::CreateR3FromArray3DWithLayout<int>(arr3d, layout_r3_dim0minor_);
+  auto lit_dim0minor = LiteralUtil::CreateR3FromArray3DWithLayout<int>(
+      arr3d, layout_r3_dim0minor_);
 
   EXPECT_EQ(lit_dim0minor->element_count(), 12);
   std::vector<int> expected_dim0minor{1, 7, 4, 10, 2, 8, 5, 11, 3, 9, 6, 12};
@@ -710,8 +717,8 @@ TEST_F(LiteralUtilTest, TestR3LinearLayout) {
               testing::ElementsAreArray(expected_dim0major));
 
   // Test expected memory layout of R3 created with dim0-major (row-major).
-  auto lit_dim0major =
-      Literal::CreateR3FromArray3DWithLayout<int>(arr3d, layout_r3_dim0major_);
+  auto lit_dim0major = LiteralUtil::CreateR3FromArray3DWithLayout<int>(
+      arr3d, layout_r3_dim0major_);
   EXPECT_EQ(lit_dim0major->element_count(), 12);
   EXPECT_THAT(lit_dim0major->data<int32>(),
               testing::ElementsAreArray(expected_dim0major));
@@ -723,28 +730,28 @@ TEST_F(LiteralUtilTest, TestR3LinearLayout) {
 }
 
 TEST_F(LiteralUtilTest, SliceR0S32) {
-  auto input = Literal::CreateR0<int32>(1);
+  auto input = LiteralUtil::CreateR0<int32>(1);
   auto result = input->Slice({}, {});
   EXPECT_EQ(*input, *result);
 }
 
 TEST_F(LiteralUtilTest, SliceR1F32) {
-  auto input = Literal::CreateR1<float>({1.0, 2.0, 3.0, 4.0, 5.0});
+  auto input = LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0, 4.0, 5.0});
   auto result = input->Slice({3}, {4});
-  auto expected = Literal::CreateR1<float>({4.0});
+  auto expected = LiteralUtil::CreateR1<float>({4.0});
   EXPECT_EQ(*expected, *result);
 }
 
 TEST_F(LiteralUtilTest, SliceR2U32) {
-  auto input_3x4 =
-      Literal::CreateR2<uint32>({{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}});
+  auto input_3x4 = LiteralUtil::CreateR2<uint32>(
+      {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}});
   auto result = input_3x4->Slice({0, 2}, {2, 4});
-  auto expected = Literal::CreateR2<uint32>({{3, 4}, {7, 8}});
+  auto expected = LiteralUtil::CreateR2<uint32>({{3, 4}, {7, 8}});
   EXPECT_EQ(*expected, *result);
 }
 
 TEST_F(LiteralUtilTest, SliceR3U32Full) {
-  auto input_2x3x2 = Literal::CreateR3<uint32>(
+  auto input_2x3x2 = LiteralUtil::CreateR3<uint32>(
       {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 10}, {11, 12}}});
   auto result = input_2x3x2->Slice({0, 0, 0}, {2, 3, 2});
   EXPECT_EQ(*input_2x3x2, *result);
@@ -753,21 +760,21 @@ TEST_F(LiteralUtilTest, SliceR3U32Full) {
 TEST_F(LiteralUtilTest, PopulateR1S64) {
   Literal output(ShapeUtil::MakeShape(S64, {1}));
   output.PopulateR1<int64>({77});
-  auto expected = Literal::CreateR1<int64>({77});
+  auto expected = LiteralUtil::CreateR1<int64>({77});
   EXPECT_EQ(output, *expected);
 }
 
 TEST_F(LiteralUtilTest, PopulateR1U64) {
   Literal output(ShapeUtil::MakeShape(U64, {2}));
   output.PopulateR1<uint64>({{77, 88}});
-  auto expected = Literal::CreateR1<uint64>({{77, 88}});
+  auto expected = LiteralUtil::CreateR1<uint64>({{77, 88}});
   EXPECT_EQ(output, *expected);
 }
 
 TEST_F(LiteralUtilTest, PopulateR1C64) {
   Literal output(ShapeUtil::MakeShape(C64, {1}));
   output.PopulateR1<complex64>({{77, 88}});
-  auto expected = Literal::CreateR1<complex64>({{77, 88}});
+  auto expected = LiteralUtil::CreateR1<complex64>({{77, 88}});
   EXPECT_EQ(output, *expected);
 }
 
@@ -775,7 +782,7 @@ TEST_F(LiteralUtilTest, PopulateR2C64) {
   Literal output(ShapeUtil::MakeShape(C64, {2, 2}));
   output.PopulateR2<complex64>({{{7, 8}, {9, 10}}, {{1, 2}, {3, 4}}});
   auto expected =
-      Literal::CreateR2<complex64>({{{7, 8}, {9, 10}}, {{1, 2}, {3, 4}}});
+      LiteralUtil::CreateR2<complex64>({{{7, 8}, {9, 10}}, {{1, 2}, {3, 4}}});
   EXPECT_EQ(output, *expected);
 }
 
@@ -783,7 +790,7 @@ TEST_F(LiteralUtilTest, PopulateWithValueR0BF16) {
   Literal output(ShapeUtil::MakeShape(BF16, {}));
   bfloat16 h(0.25f);
   output.PopulateWithValue<bfloat16>(h);
-  auto expected = Literal::CreateR0<bfloat16>(h);
+  auto expected = LiteralUtil::CreateR0<bfloat16>(h);
   EXPECT_EQ(output, *expected);
 }
 
@@ -791,7 +798,7 @@ TEST_F(LiteralUtilTest, PopulateWithValueR1BF16) {
   Literal output(ShapeUtil::MakeShape(BF16, {3}));
   bfloat16 h(0.5f);
   output.PopulateWithValue<bfloat16>(h);
-  auto expected = Literal::CreateR1<bfloat16>({h, h, h});
+  auto expected = LiteralUtil::CreateR1<bfloat16>({h, h, h});
   EXPECT_EQ(output, *expected);
 }
 
@@ -799,28 +806,28 @@ TEST_F(LiteralUtilTest, PopulateWithValueR2BF16) {
   Literal output(ShapeUtil::MakeShape(BF16, {2, 2}));
   bfloat16 h(2.0f);
   output.PopulateWithValue<bfloat16>(h);
-  auto expected = Literal::CreateR2<bfloat16>({{h, h}, {h, h}});
+  auto expected = LiteralUtil::CreateR2<bfloat16>({{h, h}, {h, h}});
   EXPECT_EQ(output, *expected);
 }
 
 TEST_F(LiteralUtilTest, PopulateWithValueR0F32) {
   Literal output(ShapeUtil::MakeShape(F32, {}));
   output.PopulateWithValue<float>(2.5f);
-  auto expected = Literal::CreateR0<float>(2.5f);
+  auto expected = LiteralUtil::CreateR0<float>(2.5f);
   EXPECT_EQ(output, *expected);
 }
 
 TEST_F(LiteralUtilTest, PopulateWithValueR1S64) {
   Literal output(ShapeUtil::MakeShape(S64, {3}));
   output.PopulateWithValue<int64>(-7);
-  auto expected = Literal::CreateR1<int64>({-7, -7, -7});
+  auto expected = LiteralUtil::CreateR1<int64>({-7, -7, -7});
   EXPECT_EQ(output, *expected);
 }
 
 TEST_F(LiteralUtilTest, PopulateWithValueR2U64) {
   Literal output(ShapeUtil::MakeShape(U64, {2, 2}));
   output.PopulateWithValue<uint64>(42);
-  auto expected = Literal::CreateR2<uint64>({{42, 42}, {42, 42}});
+  auto expected = LiteralUtil::CreateR2<uint64>({{42, 42}, {42, 42}});
   EXPECT_EQ(output, *expected);
 }
 
@@ -828,7 +835,7 @@ TEST_F(LiteralUtilTest, PopulateWithValueR2C64) {
   Literal output(ShapeUtil::MakeShape(C64, {2, 2}));
   output.PopulateWithValue<complex64>({4, 2});
   auto expected =
-      Literal::CreateR2<complex64>({{{4, 2}, {4, 2}}, {{4, 2}, {4, 2}}});
+      LiteralUtil::CreateR2<complex64>({{{4, 2}, {4, 2}}, {{4, 2}, {4, 2}}});
   EXPECT_EQ(output, *expected);
 }
 
@@ -836,7 +843,7 @@ TEST_F(LiteralUtilTest, PopulateWithValueR0F16) {
   Literal output(ShapeUtil::MakeShape(F16, {}));
   half h(0.25f);
   output.PopulateWithValue<half>(h);
-  auto expected = Literal::CreateR0<half>(h);
+  auto expected = LiteralUtil::CreateR0<half>(h);
   EXPECT_EQ(output, *expected);
 }
 
@@ -844,7 +851,7 @@ TEST_F(LiteralUtilTest, PopulateWithValueR1F16) {
   Literal output(ShapeUtil::MakeShape(F16, {3}));
   half h(0.5f);
   output.PopulateWithValue<half>(h);
-  auto expected = Literal::CreateR1<half>({h, h, h});
+  auto expected = LiteralUtil::CreateR1<half>({h, h, h});
   EXPECT_EQ(output, *expected);
 }
 
@@ -852,15 +859,15 @@ TEST_F(LiteralUtilTest, PopulateWithValueR2F16) {
   Literal output(ShapeUtil::MakeShape(F16, {2, 2}));
   half h(2.0f);
   output.PopulateWithValue<half>(h);
-  auto expected = Literal::CreateR2<half>({{h, h}, {h, h}});
+  auto expected = LiteralUtil::CreateR2<half>({{h, h}, {h, h}});
   EXPECT_EQ(output, *expected);
 }
 
 TEST_F(LiteralUtilTest, ReplicateR2U32) {
-  auto input =
-      Literal::CreateR2<uint32>({{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}});
+  auto input = LiteralUtil::CreateR2<uint32>(
+      {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}});
   auto output = input->Replicate<uint32>(3);
-  auto expected = Literal::CreateR3<uint32>(
+  auto expected = LiteralUtil::CreateR3<uint32>(
       {{{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}},
        {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}},
        {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}}});
@@ -914,12 +921,12 @@ TEST_F(LiteralUtilTest, CopySliceFrom) {
 }
 
 TEST_F(LiteralUtilTest, CopyFromScalars) {
-  auto zero = Literal::CreateR0<uint32>(0);
-  auto nine = Literal::CreateR0<uint32>(9);
+  auto zero = LiteralUtil::CreateR0<uint32>(0);
+  auto nine = LiteralUtil::CreateR0<uint32>(9);
   TF_EXPECT_OK(zero->CopyFrom(*nine));
   EXPECT_EQ(*zero, *nine);
 
-  auto vect = Literal::CreateR1<uint32>({3, 4, 9, 12, 5, 17, 21});
+  auto vect = LiteralUtil::CreateR1<uint32>({3, 4, 9, 12, 5, 17, 21});
   TF_EXPECT_OK(zero->CopySliceFrom(*vect, {5}, {}, {}));
   EXPECT_EQ(zero->Get<uint32>({}), 17);
   TF_EXPECT_OK(vect->CopySliceFrom(*zero, {}, {4}, {}));
@@ -928,13 +935,13 @@ TEST_F(LiteralUtilTest, CopyFromScalars) {
 
 TEST_F(LiteralUtilTest, CopyFromAndToZeroElement) {
   const Shape empty_r1_shape = ShapeUtil::MakeShape(F32, {0});
-  const auto const_nine = Literal::CreateR1<float>({9});
+  const auto const_nine = LiteralUtil::CreateR1<float>({9});
   const auto const_empty = Literal::CreateFromShape(empty_r1_shape);
 
   {
     // Source contains dimension with zero elements.
     const auto empty = Literal::CreateFromShape(empty_r1_shape);
-    auto nine = Literal::CreateR1<float>({9});
+    auto nine = LiteralUtil::CreateR1<float>({9});
 
     TF_EXPECT_OK(nine->CopySliceFrom(*empty, {0}, {0}, {0}));
     EXPECT_EQ(*nine, *const_nine);
@@ -943,7 +950,7 @@ TEST_F(LiteralUtilTest, CopyFromAndToZeroElement) {
   {
     // Copy 0 element to destination with zero elements.
     const auto empty = Literal::CreateFromShape(empty_r1_shape);
-    auto nine = Literal::CreateR1<float>({9});
+    auto nine = LiteralUtil::CreateR1<float>({9});
 
     TF_EXPECT_OK(empty->CopySliceFrom(*nine, {0}, {0}, {0}));
     EXPECT_EQ(*empty, *const_empty);
@@ -958,16 +965,16 @@ TEST_F(LiteralUtilTest, CopyFromNilShape) {
 }
 
 TEST_F(LiteralUtilTest, CopyFromArrays) {
-  auto scalar_42 = Literal::CreateR0<float>(42.0);
-  auto scalar_123 = Literal::CreateR0<float>(123.0);
+  auto scalar_42 = LiteralUtil::CreateR0<float>(42.0);
+  auto scalar_123 = LiteralUtil::CreateR0<float>(123.0);
   EXPECT_NE(*scalar_42, *scalar_123);
   TF_ASSERT_OK(scalar_42->CopyFrom(*scalar_123, /*dest_shape_index=*/{},
                                    /*src_shape_index=*/{}));
   EXPECT_EQ(*scalar_42, *scalar_123);
   EXPECT_EQ(scalar_42->Get<float>({}), 123.0f);
 
-  auto matrix_1234 = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto matrix_5678 = Literal::CreateR2<float>({{5.0, 6.0}, {7.0, 8.0}});
+  auto matrix_1234 = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto matrix_5678 = LiteralUtil::CreateR2<float>({{5.0, 6.0}, {7.0, 8.0}});
   EXPECT_NE(*matrix_1234, *matrix_5678);
   EXPECT_EQ(matrix_1234->Get<float>({0, 0}), 1.0f);
   TF_ASSERT_OK(matrix_1234->CopyFrom(*matrix_5678, /*dest_shape_index=*/{},
@@ -977,19 +984,19 @@ TEST_F(LiteralUtilTest, CopyFromArrays) {
 }
 
 TEST_F(LiteralUtilTest, CopyFromTuples) {
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
   Literal nil_literal(ShapeUtil::MakeNil());
-  auto nested_tuple = Literal::MakeTuple(
+  auto nested_tuple = LiteralUtil::MakeTuple(
       {matrix.get(),
-       Literal::MakeTuple({Literal::CreateR0<int32>(42).get(),
-                           Literal::CreateR1<double>({23.0, 44.0}).get(),
-                           &nil_literal})
+       LiteralUtil::MakeTuple(
+           {LiteralUtil::CreateR0<int32>(42).get(),
+            LiteralUtil::CreateR1<double>({23.0, 44.0}).get(), &nil_literal})
            .get()});
   // Create a tuple the same shape as the inner tuple of nested_tuple but with
   // different values..
-  auto tuple = Literal::MakeTuple({Literal::CreateR0<int32>(-5).get(),
-                                   Literal::CreateR1<double>({2.0, 4.0}).get(),
-                                   &nil_literal});
+  auto tuple = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<int32>(-5).get(),
+       LiteralUtil::CreateR1<double>({2.0, 4.0}).get(), &nil_literal});
 
   EXPECT_EQ(*matrix, LiteralSlice(*nested_tuple, {0}));
   EXPECT_EQ(nested_tuple->Get<int32>({}, {1, 0}), 42);
@@ -1010,8 +1017,8 @@ TEST_F(LiteralUtilTest, CopyFromTuples) {
   EXPECT_EQ(nested_tuple->Get<double>({1}, {1, 1}), 4.0);
 }
 TEST_F(LiteralUtilTest, CopyBetweenSameTuple) {
-  auto tuple = Literal::MakeTuple(
-      {Literal::CreateR0<int32>(-2).get(), Literal::CreateR0<int32>(4).get()});
+  auto tuple = LiteralUtil::MakeTuple({LiteralUtil::CreateR0<int32>(-2).get(),
+                                       LiteralUtil::CreateR0<int32>(4).get()});
 
   EXPECT_EQ(tuple->Get<int32>({}, {0}), -2);
   EXPECT_EQ(tuple->Get<int32>({}, {1}), 4);
@@ -1025,8 +1032,8 @@ TEST_F(LiteralUtilTest, CopyBetweenSameTuple) {
 }
 
 TEST_F(LiteralUtilTest, CopyFromDifferentShapes) {
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto vector = Literal::CreateR1<float>({5.0, 7.0});
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto vector = LiteralUtil::CreateR1<float>({5.0, 7.0});
   Status status = matrix->CopyFrom(*vector);
   ASSERT_FALSE(status.ok());
   ASSERT_THAT(status.error_message(),
@@ -1051,7 +1058,7 @@ TEST_F(LiteralUtilTest, F16) {
 
   half h1(1.0f);
   half h2(2.0f);
-  auto m2 = Literal::CreateR2<half>({{h1, h2}, {h2, h1}});
+  auto m2 = LiteralUtil::CreateR2<half>({{h1, h2}, {h2, h1}});
   Literal* l2 = m2.get();
   const char* d2 = reinterpret_cast<const char*>(l2->data<half>().data());
   EXPECT_EQ(d2[0], 0);
@@ -1150,12 +1157,12 @@ TEST_F(LiteralUtilTest, PopulateParallel) {
 
 TEST_F(LiteralUtilTest, ConvertR4) {
   // clang-format off
-  auto original = Literal::CreateR4WithLayout<int8>({{
+  auto original = LiteralUtil::CreateR4WithLayout<int8>({{
      {{10, 11, 12, 13}, {14, 15, 16, 17}},
      {{18, 19, 20, 21}, {22, 23, 24, 25}},
      {{26, 27, 28, 29}, {30, 31, 32, 33}},
   }}, layout_r4_dim0major_);
-  auto expected = Literal::CreateR4WithLayout<uint32>({{
+  auto expected = LiteralUtil::CreateR4WithLayout<uint32>({{
      {{10, 11, 12, 13}, {14, 15, 16, 17}},
      {{18, 19, 20, 21}, {22, 23, 24, 25}},
      {{26, 27, 28, 29}, {30, 31, 32, 33}},
@@ -1169,42 +1176,42 @@ TEST_F(LiteralUtilTest, ConvertR4) {
 
 TEST_F(LiteralUtilTest, ConvertIfTypesMatch) {
   // clang-format off
-  auto s8 = Literal::CreateR4WithLayout<int8>({{
+  auto s8 = LiteralUtil::CreateR4WithLayout<int8>({{
     {{10, 0, 12, 0}, {0, 15, 0, 17}},
     {{0, 19, 0, 21}, {22, 0, 24, 0}},
     {{26, 0, 28, 0}, {0, 31, 0, 33}},
   }}, layout_r4_dim0major_);
-  auto s32 = Literal::CreateR4WithLayout<int32>({{
+  auto s32 = LiteralUtil::CreateR4WithLayout<int32>({{
     {{10, 0, 12, 0}, {0, 15, 0, 17}},
     {{0, 19, 0, 21}, {22, 0, 24, 0}},
     {{26, 0, 28, 0}, {0, 31, 0, 33}},
   }}, layout_r4_dim0major_);
-  auto u32 = Literal::CreateR4WithLayout<uint32>({{
+  auto u32 = LiteralUtil::CreateR4WithLayout<uint32>({{
     {{10, 0, 12, 0}, {0, 15, 0, 17}},
     {{0, 19, 0, 21}, {22, 0, 24, 0}},
     {{26, 0, 28, 0}, {0, 31, 0, 33}},
   }}, layout_r4_dim0major_);
-  auto s64 = Literal::CreateR4WithLayout<int64>({{
+  auto s64 = LiteralUtil::CreateR4WithLayout<int64>({{
     {{10, 0, 12, 0}, {0, 15, 0, 17}},
     {{0, 19, 0, 21}, {22, 0, 24, 0}},
     {{26, 0, 28, 0}, {0, 31, 0, 33}},
   }}, layout_r4_dim0major_);
-  auto u64 = Literal::CreateR4WithLayout<uint64>({{
+  auto u64 = LiteralUtil::CreateR4WithLayout<uint64>({{
     {{10, 0, 12, 0}, {0, 15, 0, 17}},
     {{0, 19, 0, 21}, {22, 0, 24, 0}},
     {{26, 0, 28, 0}, {0, 31, 0, 33}},
   }}, layout_r4_dim0major_);
-  auto pred = Literal::CreateR4WithLayout<bool>({{
+  auto pred = LiteralUtil::CreateR4WithLayout<bool>({{
     {{true, false, true, false}, {false, true, false, true}},
     {{false, true, false, true}, {true, false, true, false}},
     {{true, false, true, false}, {false, true, false, true}},
   }}, layout_r4_dim0major_);
-  auto int32_pred = Literal::CreateR4WithLayout<int32>({{
+  auto int32_pred = LiteralUtil::CreateR4WithLayout<int32>({{
     {{1, 0, 1, 0}, {0, 1, 0, 1}},
     {{0, 1, 0, 1}, {1, 0, 1, 0}},
     {{1, 0, 1, 0}, {0, 1, 0, 1}},
   }}, layout_r4_dim0major_);
-  auto f16 = Literal::CreateR4WithLayout<half>({{
+  auto f16 = LiteralUtil::CreateR4WithLayout<half>({{
     {{half(10.0), half(0.0), half(12.0), half(0.0)},
      {half(0.0), half(15.0), half(0.0), half(17.0)}},
     {{half(0.0), half(19.0), half(0.0), half(21.0)},
@@ -1212,7 +1219,7 @@ TEST_F(LiteralUtilTest, ConvertIfTypesMatch) {
     {{half(26.0), half(0.0), half(28.0), half(0.0)},
      {half(0.0), half(31.0), half(0.0), half(33.0)}},
   }}, layout_r4_dim0major_);
-  auto bf16 = Literal::CreateR4WithLayout<bfloat16>({{
+  auto bf16 = LiteralUtil::CreateR4WithLayout<bfloat16>({{
     {{bfloat16(10.0), bfloat16(0.0), bfloat16(12.0), bfloat16(0.0)},
      {bfloat16(0.0), bfloat16(15.0), bfloat16(0.0), bfloat16(17.0)}},
     {{bfloat16(0.0), bfloat16(19.0), bfloat16(0.0), bfloat16(21.0)},
@@ -1220,17 +1227,17 @@ TEST_F(LiteralUtilTest, ConvertIfTypesMatch) {
     {{bfloat16(26.0), bfloat16(0.0), bfloat16(28.0), bfloat16(0.0)},
      {bfloat16(0.0), bfloat16(31.0), bfloat16(0.0), bfloat16(33.0)}},
   }}, layout_r4_dim0major_);
-  auto f32 = Literal::CreateR4WithLayout<float>({{
+  auto f32 = LiteralUtil::CreateR4WithLayout<float>({{
     {{10.0f, 0.0f, 12.0f, 0.0f}, {0.0f, 15.0f, 0.0f, 17.0f}},
     {{0.0f, 19.0f, 0.0f, 21.0f}, {22.0f, 0.0f, 24.0f, 0.0f}},
     {{26.0f, 0.0f, 28.0f, 0.0f}, {0.0f, 31.0f, 0.0f, 33.0f}},
   }}, layout_r4_dim0major_);
-  auto f64 = Literal::CreateR4WithLayout<double>({{
+  auto f64 = LiteralUtil::CreateR4WithLayout<double>({{
     {{10.0, 0.0, 12.0, 0.0}, {0.0, 15.0, 0.0, 17.0}},
     {{0.0, 19.0, 0.0, 21.0}, {22.0, 0.0, 24.0, 0.0}},
     {{26.0, 0.0, 28.0, 0.0}, {0.0, 31.0, 0.0, 33.0}},
   }}, layout_r4_dim0major_);
-  auto c64 = Literal::CreateR4WithLayout<complex64>({{
+  auto c64 = LiteralUtil::CreateR4WithLayout<complex64>({{
     {{10.0f, 0.0f, 12.0f, 0.0f}, {0.0f, 15.0f, 0.0f, 17.0f}},
     {{0.0f, 19.0f, 0.0f, 21.0f}, {22.0f, 0.0f, 24.0f, 0.0f}},
     {{26.0f, 0.0f, 28.0f, 0.0f}, {0.0f, 31.0f, 0.0f, 33.0f}},
@@ -1302,18 +1309,18 @@ TEST_F(LiteralUtilTest, ConvertIfTypesMatch) {
 }
 
 TEST_F(LiteralUtilTest, BitcastConvert) {
-  auto original =
-      Literal::CreateR1<uint32>({tensorflow::bit_cast<uint32>(2.5f),
-                                 tensorflow::bit_cast<uint32>(-42.25f),
-                                 tensorflow::bit_cast<uint32>(100.f), 0xbeef});
-  auto expected = Literal::CreateR1<float>(
+  auto original = LiteralUtil::CreateR1<uint32>(
+      {tensorflow::bit_cast<uint32>(2.5f),
+       tensorflow::bit_cast<uint32>(-42.25f),
+       tensorflow::bit_cast<uint32>(100.f), 0xbeef});
+  auto expected = LiteralUtil::CreateR1<float>(
       {2.5f, -42.25f, 100.0f, tensorflow::bit_cast<float>(0xbeef)});
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> converted,
                           original->BitcastConvert(F32));
 }
 
 TEST_F(LiteralUtilTest, BitcastConvertBetweenInvalidTypes) {
-  auto literal = Literal::CreateR0<uint32>(1234);
+  auto literal = LiteralUtil::CreateR0<uint32>(1234);
   Status status = literal->BitcastConvert(F64).status();
   EXPECT_NE(Status::OK(), status);
   EXPECT_TRUE(tensorflow::str_util::StrContains(status.error_message(),
@@ -1348,7 +1355,7 @@ TEST_F(LiteralUtilTest, ToProto_f16) {
   half h1(1.0f);
   half h2(2.0f);
 
-  auto m = Literal::CreateR2<half>({{h1, h2}, {h2, h1}});
+  auto m = LiteralUtil::CreateR2<half>({{h1, h2}, {h2, h1}});
   Literal* l = m.get();
   EXPECT_EQ(4, ShapeUtil::ElementsIn(l->shape()));
   EXPECT_EQ(4, l->data<half>().size());
@@ -1391,10 +1398,10 @@ TEST_F(LiteralUtilTest, CopyFromProto_f16) {
 }
 
 TEST_F(LiteralUtilTest, LiteralSliceTest) {
-  auto scalar = Literal::CreateR0<float>(1.0);
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto tuple = Literal::MakeTuple({scalar.get(), matrix.get()});
-  auto nested_tuple = Literal::MakeTuple({tuple.get(), scalar.get()});
+  auto scalar = LiteralUtil::CreateR0<float>(1.0);
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto tuple = LiteralUtil::MakeTuple({scalar.get(), matrix.get()});
+  auto nested_tuple = LiteralUtil::MakeTuple({tuple.get(), scalar.get()});
   Literal nil(ShapeUtil::MakeNil());
 
   EXPECT_EQ(LiteralSlice(*scalar, {}), *scalar);
@@ -1413,10 +1420,10 @@ TEST_F(LiteralUtilTest, LiteralSliceTest) {
 }
 
 TEST_F(LiteralUtilTest, MutatingLiteralSlice) {
-  auto scalar = Literal::CreateR0<float>(1.0);
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto tuple = Literal::MakeTuple({scalar.get(), matrix.get()});
-  auto nested_tuple = Literal::MakeTuple({tuple.get(), scalar.get()});
+  auto scalar = LiteralUtil::CreateR0<float>(1.0);
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto tuple = LiteralUtil::MakeTuple({scalar.get(), matrix.get()});
+  auto nested_tuple = LiteralUtil::MakeTuple({tuple.get(), scalar.get()});
   // Verify that changing the underlying data beneath the view changes the
   // data of the view itself.
   const auto nested_tuple_view = LiteralSlice(*nested_tuple);
@@ -1436,15 +1443,16 @@ TEST_F(LiteralUtilTest, MutatingLiteralSlice) {
 }
 
 TEST_F(LiteralUtilTest, LiteralSliceOfALiteralSlice) {
-  auto scalar = Literal::CreateR0<float>(1.0);
-  auto matrix = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto tuple = Literal::MakeTuple({scalar.get(), matrix.get()});
-  auto nested_tuple = Literal::MakeTuple({tuple.get(), scalar.get()});
+  auto scalar = LiteralUtil::CreateR0<float>(1.0);
+  auto matrix = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto tuple = LiteralUtil::MakeTuple({scalar.get(), matrix.get()});
+  auto nested_tuple = LiteralUtil::MakeTuple({tuple.get(), scalar.get()});
 
   const auto nested_tuple_view = LiteralSlice(*nested_tuple);
   const auto tuple_view = LiteralSlice(nested_tuple_view, /*view_root=*/{0});
   const auto matrix_view = LiteralSlice(tuple_view, /*view_root=*/{1});
-  EXPECT_EQ(matrix_view, *Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}));
+  EXPECT_EQ(matrix_view,
+            *LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}));
 }
 
 TEST_F(LiteralUtilTest, BorrowingLiteralFromOneBufferPtr) {
@@ -1488,7 +1496,7 @@ TEST_F(LiteralUtilTest, BorrowingLiteralFromMultipleBufferPtrs) {
 
 TEST_F(LiteralUtilTest, LiteralMove) {
   std::unique_ptr<Literal> matrix =
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
   Literal literal(std::move(*matrix));
 
   EXPECT_TRUE(
@@ -1501,11 +1509,11 @@ TEST_F(LiteralUtilTest, LiteralMove) {
 
 TEST_F(LiteralUtilTest, DecomposeTuple) {
   Literal nil_literal(ShapeUtil::MakeNil());
-  auto nested_tuple = Literal::MakeTuple(
-      {Literal::CreateR2<int32>({{1, 2}, {3, 4}}).get(),
-       Literal::MakeTuple({Literal::CreateR0<int32>(42).get(),
-                           Literal::CreateR1<double>({23.0, 44.0}).get(),
-                           &nil_literal})
+  auto nested_tuple = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR2<int32>({{1, 2}, {3, 4}}).get(),
+       LiteralUtil::MakeTuple(
+           {LiteralUtil::CreateR0<int32>(42).get(),
+            LiteralUtil::CreateR1<double>({23.0, 44.0}).get(), &nil_literal})
            .get(),
        &nil_literal});
 
@@ -1542,13 +1550,13 @@ TEST_F(LiteralUtilTest, DecomposeEmptyTuple) {
 
 TEST_F(LiteralUtilTest, MoveIntoTuple) {
   std::vector<Literal> elements;
-  elements.push_back(std::move(*Literal::CreateR0<float>(1.0)));
-  elements.push_back(std::move(*Literal::CreateR1<int32>({4, 8})));
-  elements.push_back(std::move(
-      *Literal::MakeTuple({Literal::CreateR0<int32>(42).get(),
-                           Literal::CreateR1<double>({23.0, 44.0}).get()})
+  elements.push_back(std::move(*LiteralUtil::CreateR0<float>(1.0)));
+  elements.push_back(std::move(*LiteralUtil::CreateR1<int32>({4, 8})));
+  elements.push_back(std::move(*LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<int32>(42).get(),
+       LiteralUtil::CreateR1<double>({23.0, 44.0}).get()})
 
-          ));
+                                   ));
 
   Literal literal = Literal::MoveIntoTuple(&elements);
   ASSERT_TRUE(ShapeUtil::IsTuple(literal.shape()));
@@ -1577,7 +1585,7 @@ TEST_F(LiteralUtilTest, LiteralMoveAssignment) {
   EXPECT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeNil(), literal.shape()));
 
   std::unique_ptr<Literal> matrix =
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
   literal = std::move(*matrix);
 
   EXPECT_TRUE(
@@ -1590,7 +1598,7 @@ TEST_F(LiteralUtilTest, LiteralMoveAssignment) {
 
 TEST_F(LiteralUtilTest, LiteralSliceCopy) {
   std::unique_ptr<Literal> matrix =
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
   const auto matrix_view = LiteralSlice(*matrix);
   LiteralSlice matrix_view_copy(matrix_view);
 
@@ -1601,9 +1609,9 @@ TEST_F(LiteralUtilTest, LiteralSliceCopy) {
 }
 
 TEST_F(LiteralUtilTest, GetSetTuple) {
-  auto tuple = Literal::MakeTuple(
-      {Literal::CreateR0<float>(42.0).get(),
-       Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get()});
+  auto tuple = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(42.0).get(),
+       LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get()});
   EXPECT_EQ(tuple->Get<float>(/*multi_index=*/{}, /*shape_index=*/{0}), 42.0);
   tuple->Set<float>(/*multi_index=*/{}, /*shape_index=*/{0}, -5.0);
   EXPECT_EQ(tuple->Get<float>(/*multi_index=*/{}, /*shape_index=*/{0}), -5.0);
@@ -1644,20 +1652,20 @@ TEST_F(LiteralUtilTest, CreateFromShapeZeroInitialized) {
 
 TEST_F(LiteralUtilTest, ProtoRoundTrip) {
   // Test serializing then deserializing a Literal through a proto.
-  auto one_f32 = Literal::CreateR0<float>(1.0);
-  auto two_f32 = Literal::CreateR0<float>(2.0);
-  auto vector_int8 = Literal::CreateR1<int8>({-128, 0, 2, 4, 7, 56, 127});
-  auto vector_c64 = Literal::CreateR1<complex64>({{1.0, 2.0}, {3.0, 4.0}});
-  auto vector_bfloat16 = Literal::CreateR1<bfloat16>(
+  auto one_f32 = LiteralUtil::CreateR0<float>(1.0);
+  auto two_f32 = LiteralUtil::CreateR0<float>(2.0);
+  auto vector_int8 = LiteralUtil::CreateR1<int8>({-128, 0, 2, 4, 7, 56, 127});
+  auto vector_c64 = LiteralUtil::CreateR1<complex64>({{1.0, 2.0}, {3.0, 4.0}});
+  auto vector_bfloat16 = LiteralUtil::CreateR1<bfloat16>(
       {bfloat16{-1.0}, bfloat16{2.0}, bfloat16{-3.0}});
   auto vector_half =
-      Literal::CreateR1<half>({half{10.0}, half{20.0}, half{-30.0}});
+      LiteralUtil::CreateR1<half>({half{10.0}, half{20.0}, half{-30.0}});
   auto matrix_pred =
-      Literal::CreateR2<bool>({{true, false, true}, {false, false, true}});
-  auto tuple = Literal::MakeTuple(
+      LiteralUtil::CreateR2<bool>({{true, false, true}, {false, false, true}});
+  auto tuple = LiteralUtil::MakeTuple(
       {one_f32.get(), vector_half.get(), matrix_pred.get(), matrix_pred.get()});
   Literal nil_literal(ShapeUtil::MakeNil());
-  auto nested_tuple = Literal::MakeTuple(
+  auto nested_tuple = LiteralUtil::MakeTuple(
       {tuple.get(), vector_bfloat16.get(), tuple.get(), &nil_literal});
 
   auto to_from_proto = [](const Literal& literal) -> Literal {
@@ -1790,8 +1798,8 @@ TEST_F(LiteralUtilTest, InvalidProtoTooManyTupleElements) {
 }
 
 TEST_F(LiteralUtilTest, SortSparseElements) {
-  auto literal =
-      Literal::CreateSparse<float>({10, 10, 10}, SparseIndexArray(10, 3), {});
+  auto literal = LiteralUtil::CreateSparse<float>({10, 10, 10},
+                                                  SparseIndexArray(10, 3), {});
   literal->AppendSparseElement<float>({2, 3, 4}, 2.0);
   literal->AppendSparseElement<float>({3, 4, 5}, 3.0);
   literal->AppendSparseElement<float>({1, 2, 3}, 1.0);
@@ -1805,21 +1813,22 @@ TEST_F(LiteralUtilTest, GetSparseElementAsString) {
   SparseIndexArray indices(10, {{1, 2, 3}, {2, 3, 4}, {3, 4, 5}});
 
   ASSERT_EQ(
-      Literal::CreateSparse<bool>(dimensions, indices, {true, false, true})
+      LiteralUtil::CreateSparse<bool>(dimensions, indices, {true, false, true})
           ->GetSparseElementAsString(1),
       "false");
-  ASSERT_EQ(Literal::CreateSparse<int64>(dimensions, indices, {1, 2, 3})
+  ASSERT_EQ(LiteralUtil::CreateSparse<int64>(dimensions, indices, {1, 2, 3})
                 ->GetSparseElementAsString(1),
             tensorflow::strings::StrCat(int64{2}));
-  ASSERT_EQ(Literal::CreateSparse<double>(dimensions, indices, {1.0, 2.0, 3.0})
-                ->GetSparseElementAsString(1),
-            tensorflow::strings::StrCat(double{2.0}));
-  ASSERT_EQ(Literal::CreateSparse<half>(dimensions, indices,
-                                        {half{1.0}, half{2.0}, half{3.0}})
+  ASSERT_EQ(
+      LiteralUtil::CreateSparse<double>(dimensions, indices, {1.0, 2.0, 3.0})
+          ->GetSparseElementAsString(1),
+      tensorflow::strings::StrCat(double{2.0}));
+  ASSERT_EQ(LiteralUtil::CreateSparse<half>(dimensions, indices,
+                                            {half{1.0}, half{2.0}, half{3.0}})
                 ->GetSparseElementAsString(1),
             tensorflow::strings::StrCat(static_cast<float>(half{2.0})));
   ASSERT_EQ(
-      Literal::CreateSparse<complex64>(
+      LiteralUtil::CreateSparse<complex64>(
           dimensions, indices,
           std::vector<complex64>{{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}})
           ->GetSparseElementAsString(1),
@@ -1827,33 +1836,36 @@ TEST_F(LiteralUtilTest, GetSparseElementAsString) {
 }
 
 TEST_F(LiteralUtilTest, BroadcastVectorToMatrix0) {
-  std::unique_ptr<Literal> literal = Literal::CreateR1<int64>({1, 2});
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR1<int64>({1, 2});
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<Literal> broadcasted_literal,
       literal->Broadcast(
           /*result_shape=*/ShapeUtil::MakeShape(S64, {2, 2}),
           /*dimensions=*/{0}));
-  EXPECT_EQ(*broadcasted_literal, *Literal::CreateR2<int64>({{1, 1}, {2, 2}}));
+  EXPECT_EQ(*broadcasted_literal,
+            *LiteralUtil::CreateR2<int64>({{1, 1}, {2, 2}}));
 }
 
 TEST_F(LiteralUtilTest, BroadcastVectorToMatrix1) {
-  std::unique_ptr<Literal> literal = Literal::CreateR1<int64>({1, 2});
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR1<int64>({1, 2});
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<Literal> broadcasted_literal,
       literal->Broadcast(
           /*result_shape=*/ShapeUtil::MakeShape(S64, {2, 2}),
           /*dimensions=*/{1}));
-  EXPECT_EQ(*broadcasted_literal, *Literal::CreateR2<int64>({{1, 2}, {1, 2}}));
+  EXPECT_EQ(*broadcasted_literal,
+            *LiteralUtil::CreateR2<int64>({{1, 2}, {1, 2}}));
 }
 
 TEST_F(LiteralUtilTest, BroadcastScalarToMatrix) {
-  std::unique_ptr<Literal> literal = Literal::CreateR0<int32>(9);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR0<int32>(9);
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<Literal> broadcasted_literal,
       literal->Broadcast(
           /*result_shape=*/ShapeUtil::MakeShape(S32, {2, 2}),
           /*dimensions=*/{}));
-  EXPECT_EQ(*broadcasted_literal, *Literal::CreateR2<int32>({{9, 9}, {9, 9}}));
+  EXPECT_EQ(*broadcasted_literal,
+            *LiteralUtil::CreateR2<int32>({{9, 9}, {9, 9}}));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/literal_util.cc b/tensorflow/compiler/xla/literal_util.cc
index eeabf835ac..548fbe8a83 100644
--- a/tensorflow/compiler/xla/literal_util.cc
+++ b/tensorflow/compiler/xla/literal_util.cc
@@ -43,25 +43,6 @@ namespace xla {
 
 namespace {
 
-constexpr bool kLittleEndian = __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__;
-
-// Converts between little and big endian.
-//
-// Precondition: size % 2 == 0 (elements in the array are 16 bits long)
-void ConvertEndianShort(string* bytes) {
-  CHECK_EQ(bytes->size() / 2, 0);
-  for (int64 i = 0; i < bytes->size(); i += 2) {
-    std::swap((*bytes)[i], (*bytes)[i + 1]);
-  }
-}
-
-void ConvertEndianShort(char* bytes, int64 size) {
-  CHECK_EQ(size / 2, 0);
-  for (int64 i = 0; i < size; i += 2) {
-    std::swap(bytes[i], bytes[i + 1]);
-  }
-}
-
 // Return a literal with all arrays of type FromNativeT converted to type
 // ToNativeT in the given literal.
 template <typename FromNativeT, typename ToNativeT>
@@ -103,505 +84,54 @@ std::unique_ptr<Literal> ConvertType(LiteralSlice literal) {
 
 }  // namespace
 
-LiteralBase::~LiteralBase() {}
-
-std::ostream& operator<<(std::ostream& out, const Literal& literal) {
-  out << literal.ToString();
-  return out;
-}
-
-Literal::StrideConfig::StrideConfig(
-    const Shape& source_shape, const Shape& dest_shape,
-    tensorflow::gtl::ArraySlice<int64> dimensions)
-    : dimensions(dimensions),
-      base(dimensions.size(), 0),
-      step(dimensions.size(), 1) {
-  if (!dimensions.empty()) {
-    // Selects the shape with the largest minor dimension as the one upon
-    // which to run the tight stride loop.
-    if (dimensions[LayoutUtil::Minor(source_shape.layout(), 0)] >=
-        dimensions[LayoutUtil::Minor(dest_shape.layout(), 0)]) {
-      minor_dimension = LayoutUtil::Minor(source_shape.layout(), 0);
-      dest_stride = IndexUtil::GetDimensionStride(dest_shape, minor_dimension);
-    } else {
-      minor_dimension = LayoutUtil::Minor(dest_shape.layout(), 0);
-      source_stride =
-          IndexUtil::GetDimensionStride(source_shape, minor_dimension);
-    }
-    minor_loop_size = dimensions[minor_dimension];
-    step[minor_dimension] = minor_loop_size;
-  }
-}
-
-Literal::Literal(const Shape& shape)
-    : Literal(shape, /*allocate_arrays=*/true) {}
-
-void Literal::SetPiece(const Shape& shape, Piece* piece, bool allocate_arrays) {
-  if (ShapeUtil::IsTuple(shape)) {
-    for (int i = 0; i < ShapeUtil::TupleElementCount(shape); ++i) {
-      const Shape& subshape = shape.tuple_shapes(i);
-
-      auto child_piece = Piece();
-      child_piece.set_subshape(&subshape);
-
-      SetPiece(subshape, &child_piece, allocate_arrays);
-
-      piece->emplace_back(std::move(child_piece));
-    }
-  } else if (ShapeUtil::IsArray(shape)) {
-    if (allocate_arrays) {
-      if (LayoutUtil::IsSparseArray(shape)) {
-        // For sparse arrays, the buffer must be of the size of the maximum
-        // number of sparse elements possible.
-        const int64 max_sparse_elements =
-            LayoutUtil::MaxSparseElements(shape.layout());
-        piece->set_buffer(
-            new char[max_sparse_elements *
-                     ShapeUtil::ByteSizeOfPrimitiveType(shape.element_type())]);
-        piece->set_sparse_indices(
-            new SparseIndexArray(max_sparse_elements, ShapeUtil::Rank(shape)));
-      } else {
-        piece->set_buffer(new char[piece->size_bytes()]);
-      }
-    }
-  } else {
-    // If the shape is neither an array nor tuple, then it must be
-    // zero-sized. Otherwise, some memory needs to be allocated for it.
-    CHECK_EQ(piece->size_bytes(), 0);
-  }
-}
-
-Literal::Literal(const Shape& shape, bool allocate_arrays)
-    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
-  CHECK(LayoutUtil::HasLayout(*shape_));
-  root_piece_ = new Piece();
-  root_piece_->set_subshape(shape_.get());
-  CHECK(&root_piece_->subshape() == shape_.get());
-
-  SetPiece(*shape_, root_piece_, allocate_arrays);
-}
-
-Literal::~Literal() {
-  if (root_piece_ != nullptr) {
-    DeallocateBuffers();
-    delete root_piece_;
-  }
-}
-
-void Literal::DeallocateBuffers() {
-  root_piece_->ForEachMutableSubpiece(
-      [&](const ShapeIndex& index, Piece* piece) {
-        if (piece->buffer() != nullptr) {
-          delete[] piece->buffer();
-          delete piece->sparse_indices();
-        }
-      });
-}
-
-Literal::Literal(Literal&& other) : LiteralBase() { *this = std::move(other); }
-
-Literal& Literal::operator=(Literal&& other) {
-  DCHECK(&other.root_piece_->subshape() == other.shape_.get());
-  using std::swap;
-  swap(shape_, other.shape_);
-  swap(root_piece_, other.root_piece_);
-  DCHECK(&root_piece_->subshape() == shape_.get());
-
-  return *this;
-}
-
-std::unique_ptr<Literal> LiteralBase::CreateFromShape(const Shape& shape) {
-  auto literal = MakeUnique<Literal>(shape);
-  literal->root_piece_->ForEachMutableSubpiece(
-      [&](const ShapeIndex& index, Piece* piece) {
-        if (ShapeUtil::IsArray(piece->subshape())) {
-          memset(piece->untyped_data(), 0, piece->size_bytes());
-        }
-      });
-  return literal;
-}
-
-const SparseIndexArray* LiteralBase::sparse_indices(
-    const ShapeIndex& shape_index) const {
-  return piece(shape_index).sparse_indices();
-}
-
-SparseIndexArray* Literal::sparse_indices(const ShapeIndex& shape_index) {
-  return piece(shape_index).sparse_indices();
-}
-
-/* static */ std::unique_ptr<Literal> Literal::CreateFromDimensions(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateFromDimensions(
     PrimitiveType primitive_type,
     tensorflow::gtl::ArraySlice<int64> dimensions) {
-  return CreateFromShape(ShapeUtil::MakeShape(primitive_type, dimensions));
+  return Literal::CreateFromShape(
+      ShapeUtil::MakeShape(primitive_type, dimensions));
 }
 
-/* static */ std::unique_ptr<Literal> Literal::ConvertBF16ToF32(
+/* static */ std::unique_ptr<Literal> LiteralUtil::ConvertBF16ToF32(
     const LiteralSlice& bf16_literal) {
   return ConvertType<bfloat16, float>(bf16_literal);
 }
 
-/* static */ std::unique_ptr<Literal> Literal::ConvertF32ToBF16(
+/* static */ std::unique_ptr<Literal> LiteralUtil::ConvertF32ToBF16(
     const LiteralSlice& f32_literal) {
   return ConvertType<float, bfloat16>(f32_literal);
 }
 
-template <typename NativeT>
-Status Literal::CopySliceFromInternal(
-    const LiteralBase& src_literal, tensorflow::gtl::ArraySlice<int64> src_base,
-    tensorflow::gtl::ArraySlice<int64> dest_base,
-    tensorflow::gtl::ArraySlice<int64> copy_size) {
-  TF_RET_CHECK(ShapeUtil::Rank(src_literal.shape()) == src_base.size());
-  TF_RET_CHECK(ShapeUtil::Rank(shape()) == dest_base.size());
-
-  auto linear_index = [](const Shape& shape,
-                         tensorflow::gtl::ArraySlice<int64> multi_index) {
-    return IndexUtil::MultidimensionalIndexToLinearIndex(shape, multi_index);
-  };
-
-  if (ShapeUtil::Rank(src_literal.shape()) == 0 ||
-      ShapeUtil::Rank(shape()) == 0) {
-    // If any of the two shapes are scalars, we can just call the StridedCopy()
-    // directly, and we know we will be copying only one value.
-    TF_RET_CHECK(copy_size.empty());
-    StridedCopy(data<NativeT>(), linear_index(shape(), dest_base), 0,
-                src_literal.data<NativeT>(),
-                linear_index(src_literal.shape(), src_base), 0, 1);
-  } else if (!ShapeUtil::IsZeroElementArray(shape()) &&
-             !ShapeUtil::IsZeroElementArray(src_literal.shape())) {
-    // Perform copy if neither src nor dest has dimensions with zero element,
-    // otherwise it's a no-op.
-    TF_RET_CHECK(src_base.size() == dest_base.size());
-    TF_RET_CHECK(src_base.size() == copy_size.size());
-
-    // Scan the source from minor, stepping in copy size blocks, then within
-    // the index enumaration functor, do a strided copy advancing source index
-    // by one (walking through the minor dimension), and destination index by
-    // proper stride size at the matching dimension.
-    DimensionVector src_indexes(src_base.size(), 0);
-    DimensionVector dest_indexes(dest_base.size(), 0);
-    Literal::StrideConfig stride_config(src_literal.shape(), shape(),
-                                        copy_size);
-
-    auto copy_proc = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
-      // Map from multi-dimensional index, to source index.
-      std::transform(indexes.begin(), indexes.end(), src_base.begin(),
-                     src_indexes.begin(), std::plus<int64>());
-      // Map from multi-dimensional index, to destination index.
-      std::transform(indexes.begin(), indexes.end(), dest_base.begin(),
-                     dest_indexes.begin(), std::plus<int64>());
-
-      int64 src_index = linear_index(src_literal.shape(), src_indexes);
-      int64 dest_index = linear_index(shape(), dest_indexes);
-
-      // `this->` is needed to workaround MSVC bug: #16882
-      StridedCopy(this->data<NativeT>(), dest_index, stride_config.dest_stride,
-                  src_literal.data<NativeT>(), src_index,
-                  stride_config.source_stride, stride_config.minor_loop_size);
-      return true;
-    };
-
-    ShapeUtil::ForEachIndex(src_literal.shape(), stride_config.base,
-                            stride_config.dimensions, stride_config.step,
-                            copy_proc);
-  }
-  return Status::OK();
-}
-
-Status Literal::CopyElementFrom(const LiteralSlice& src_literal,
-                                tensorflow::gtl::ArraySlice<int64> src_index,
-                                tensorflow::gtl::ArraySlice<int64> dest_index) {
-  DCHECK_EQ(shape().element_type(), src_literal.shape().element_type());
-  const int64 src_linear_index = IndexUtil::MultidimensionalIndexToLinearIndex(
-      src_literal.shape(), src_index);
-  const int64 dest_linear_index =
-      IndexUtil::MultidimensionalIndexToLinearIndex(shape(), dest_index);
-  const int64 primitive_size =
-      ShapeUtil::ByteSizeOfPrimitiveType(shape().element_type());
-
-  char* dest_address =
-      static_cast<char*>(untyped_data()) + dest_linear_index * primitive_size;
-  const char* source_address =
-      static_cast<const char*>(src_literal.untyped_data()) +
-      src_linear_index * primitive_size;
-  if (dest_address != source_address) {
-    memcpy(dest_address, source_address, primitive_size);
-  }
-  return Status::OK();
-}
-
-/* static */ std::unique_ptr<Literal> Literal::CreateToken() {
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateToken() {
   return MakeUnique<Literal>(ShapeUtil::MakeTokenShape());
 }
 
-std::vector<Literal> Literal::DecomposeTuple() {
-  CHECK(ShapeUtil::IsTuple(shape()));
-  std::vector<Literal> elements;
-  for (int i = 0; i < ShapeUtil::TupleElementCount(shape()); ++i) {
-    elements.push_back(Literal(ShapeUtil::GetSubshape(shape(), {i}),
-                               /*allocate_arrays=*/false));
-    Literal& element = elements.back();
-    element.root_piece_->ForEachMutableSubpiece(
-        [&](const ShapeIndex& index, Piece* dest_piece) {
-          ShapeIndex src_index = {i};
-          for (int64 j : index) {
-            src_index.push_back(j);
-          }
-          Piece& src_piece = piece(src_index);
-
-          // Move the respective buffer and sparse indices over to the element
-          // Literal.
-          dest_piece->set_buffer(src_piece.buffer());
-          src_piece.set_buffer(nullptr);
-          dest_piece->set_sparse_indices(src_piece.sparse_indices());
-          src_piece.set_sparse_indices(nullptr);
-        });
-  }
-  // Set this literal to be nil-shaped.
-  *this = Literal();
-  return elements;
-}
-
-/* static */ Literal Literal::MoveIntoTuple(
-    tensorflow::gtl::MutableArraySlice<Literal> elements) {
-  std::vector<Shape> element_shapes;
-  for (const Literal& element : elements) {
-    element_shapes.push_back(element.shape());
-  }
-  Literal literal(ShapeUtil::MakeTupleShape(element_shapes),
-                  /*allocate_arrays=*/false);
-  for (int i = 0; i < elements.size(); ++i) {
-    TF_CHECK_OK(
-        literal.MoveFrom(std::move(elements[i]), /*dest_shape_index=*/{i}));
-  }
-  return literal;
-}
-
-namespace {
-
-// Copies the elements in 'src' to 'dest'. The shape and layout of the data in
-// the array slices are indicated by dest_shape and src_shape respectively.
-template <typename NativeT>
-void CopyElementsBetween(tensorflow::gtl::MutableArraySlice<NativeT> dest,
-                         tensorflow::gtl::ArraySlice<NativeT> src,
-                         const Shape& dest_shape, const Shape& src_shape) {
-  CHECK(ShapeUtil::Compatible(dest_shape, src_shape));
-  if (ShapeUtil::IsZeroElementArray(dest_shape)) {
-    return;
-  }
-  std::vector<int64> index(ShapeUtil::Rank(dest_shape));
-  do {
-    dest[IndexUtil::MultidimensionalIndexToLinearIndex(dest_shape, index)] =
-        src[IndexUtil::MultidimensionalIndexToLinearIndex(src_shape, index)];
-  } while (IndexUtil::BumpIndices(dest_shape, &index));
-}
-
-}  // namespace
-
-Status LiteralBase::Piece::CopyFrom(const LiteralBase::Piece& src) {
-  CHECK(subshape_ != nullptr);
-  CHECK(src.subshape_ != nullptr);
-  if (ShapeUtil::Equal(subshape(), src.subshape())) {
-    // If the layouts are equal it's faster just to memcpy.
-    memcpy(buffer(), src.buffer(), src.size_bytes());
-  } else {
-    TF_RET_CHECK(ShapeUtil::Compatible(src.subshape(), subshape()));
-    std::vector<int64> origin(ShapeUtil::Rank(subshape()), 0);
-    switch (subshape().element_type()) {
-#define COPY_ELEMENTS(XLA_T, NATIVE_T)                                    \
-  case (XLA_T):                                                           \
-    CopyElementsBetween<NATIVE_T>(data<NATIVE_T>(), src.data<NATIVE_T>(), \
-                                  subshape(), src.subshape());            \
-    break;
-      COPY_ELEMENTS(U8, uint8);
-      COPY_ELEMENTS(U16, uint16);
-      COPY_ELEMENTS(U32, uint32);
-      COPY_ELEMENTS(U64, uint64);
-      COPY_ELEMENTS(S8, int8);
-      COPY_ELEMENTS(S16, int16);
-      COPY_ELEMENTS(S32, int32);
-      COPY_ELEMENTS(S64, int64);
-      COPY_ELEMENTS(F16, half);
-      COPY_ELEMENTS(BF16, bfloat16);
-      COPY_ELEMENTS(F32, float);
-      COPY_ELEMENTS(F64, double);
-      COPY_ELEMENTS(C64, complex64);
-      COPY_ELEMENTS(PRED, bool);
-#undef COPY_ELEMENTS
-      default:
-        return Unimplemented(
-            "Copying a Literal object with element type %s is not implemented.",
-            PrimitiveType_Name(subshape().element_type()).c_str());
-    }
-  }
-  return Status::OK();
-}
-
-Status Literal::CopyFrom(const LiteralSlice& src_literal,
-                         const ShapeIndex& dest_shape_index,
-                         const ShapeIndex& src_shape_index) {
-  const Shape& dest_subshape =
-      ShapeUtil::GetSubshape(shape(), dest_shape_index);
-  const Shape& src_subshape =
-      ShapeUtil::GetSubshape(src_literal.shape(), src_shape_index);
-  if (!ShapeUtil::Compatible(dest_subshape, src_subshape)) {
-    return InvalidArgument(
-        "Destination subshape incompatible with source subshape: %s vs %s",
-        ShapeUtil::HumanString(dest_subshape).c_str(),
-        ShapeUtil::HumanString(src_subshape).c_str());
-  }
-  return root_piece_->ForEachMutableSubpieceWithStatus(
-      [&](const ShapeIndex& index, Piece* piece) {
-        if (!ShapeUtil::IsArray(piece->subshape())) {
-          return Status::OK();
-        }
-
-        // Determine if this index is in the part of this literal that we want
-        // to copy over from src_literal.
-        bool in_subtree_to_copy = true;
-        for (int i = 0; i < dest_shape_index.size(); ++i) {
-          if (index[i] != dest_shape_index[i]) {
-            in_subtree_to_copy = false;
-            break;
-          }
-        }
-        if (!in_subtree_to_copy) {
-          return Status::OK();
-        }
-        // Construct the index of the corresponding piece in the source literal.
-        ShapeIndex src_piece_index = src_shape_index;
-        for (int64 i = dest_shape_index.size(); i < index.size(); ++i) {
-          src_piece_index.push_back(index[i]);
-        }
-        TF_RETURN_IF_ERROR(piece->CopyFrom(src_literal.piece(src_piece_index)));
-        return Status::OK();
-      });
-}
-
-Status Literal::MoveFrom(Literal&& src_literal,
-                         const ShapeIndex& dest_shape_index) {
-  const Shape& dest_subshape =
-      ShapeUtil::GetSubshape(shape(), dest_shape_index);
-  if (!ShapeUtil::Equal(dest_subshape, src_literal.shape())) {
-    return InvalidArgument(
-        "Destination subshape not equal to source shape: %s vs %s",
-        ShapeUtil::HumanString(dest_subshape).c_str(),
-        ShapeUtil::HumanString(src_literal.shape()).c_str());
-  }
-
-  src_literal.root_piece_->ForEachSubpiece(
-      [&](const ShapeIndex& src_index, const Piece& src_piece) {
-        if (!ShapeUtil::IsArray(src_piece.subshape())) {
-          return;
-        }
-
-        ShapeIndex dest_index = dest_shape_index;
-        for (int64 i : src_index) {
-          dest_index.push_back(i);
-        }
-        Piece& dest_piece = piece(dest_index);
-        delete[] dest_piece.buffer();
-        dest_piece.set_buffer(src_piece.buffer());
-        delete dest_piece.sparse_indices();
-        dest_piece.set_sparse_indices(src_piece.sparse_indices());
-      });
-
-  src_literal.shape_ = MakeUnique<Shape>(ShapeUtil::MakeNil());
-  delete src_literal.root_piece_;
-  src_literal.root_piece_ = new LiteralBase::Piece();
-  src_literal.root_piece_->set_subshape(src_literal.shape_.get());
-
-  return Status::OK();
-}
-
-Status Literal::CopySliceFrom(const LiteralSlice& src_literal,
-                              tensorflow::gtl::ArraySlice<int64> src_base,
-                              tensorflow::gtl::ArraySlice<int64> dest_base,
-                              tensorflow::gtl::ArraySlice<int64> copy_size) {
-  TF_RET_CHECK(ShapeUtil::IsArray(shape())) << ShapeUtil::HumanString(shape());
-  TF_RET_CHECK(ShapeUtil::IsArray(src_literal.shape()))
-      << ShapeUtil::HumanString(src_literal.shape());
-  TF_RET_CHECK(ShapeUtil::SameElementType(src_literal.shape(), shape()));
-
-  switch (shape().element_type()) {
-    case U8:
-      return CopySliceFromInternal<uint8>(src_literal, src_base, dest_base,
-                                          copy_size);
-    case U16:
-      return CopySliceFromInternal<uint16>(src_literal, src_base, dest_base,
-                                           copy_size);
-    case U32:
-      return CopySliceFromInternal<uint32>(src_literal, src_base, dest_base,
-                                           copy_size);
-    case U64:
-      return CopySliceFromInternal<uint64>(src_literal, src_base, dest_base,
-                                           copy_size);
-    case S8:
-      return CopySliceFromInternal<int8>(src_literal, src_base, dest_base,
-                                         copy_size);
-    case S16:
-      return CopySliceFromInternal<int16>(src_literal, src_base, dest_base,
-                                          copy_size);
-    case S32:
-      return CopySliceFromInternal<int32>(src_literal, src_base, dest_base,
-                                          copy_size);
-    case S64:
-      return CopySliceFromInternal<int64>(src_literal, src_base, dest_base,
-                                          copy_size);
-    case F16:
-      return CopySliceFromInternal<half>(src_literal, src_base, dest_base,
-                                         copy_size);
-    case BF16:
-      return CopySliceFromInternal<bfloat16>(src_literal, src_base, dest_base,
-                                             copy_size);
-    case F32:
-      return CopySliceFromInternal<float>(src_literal, src_base, dest_base,
-                                          copy_size);
-    case F64:
-      return CopySliceFromInternal<double>(src_literal, src_base, dest_base,
-                                           copy_size);
-    case C64:
-      return CopySliceFromInternal<complex64>(src_literal, src_base, dest_base,
-                                              copy_size);
-    case PRED:
-      return CopySliceFromInternal<bool>(src_literal, src_base, dest_base,
-                                         copy_size);
-    default:
-      break;
-  }
-  return Unimplemented(
-      "Copying a slice from a Literal object with element type %d is not "
-      "implemented.",
-      shape().element_type());
-}
-
-/* static */ Literal Literal::Zero(PrimitiveType primitive_type) {
+/* static */ Literal LiteralUtil::Zero(PrimitiveType primitive_type) {
   switch (primitive_type) {
     case U8:
-      return std::move(*Literal::CreateR0<uint8>(0));
+      return std::move(*LiteralUtil::CreateR0<uint8>(0));
     case U32:
-      return std::move(*Literal::CreateR0<uint32>(0));
+      return std::move(*LiteralUtil::CreateR0<uint32>(0));
     case U64:
-      return std::move(*Literal::CreateR0<uint64>(0));
+      return std::move(*LiteralUtil::CreateR0<uint64>(0));
     case S8:
-      return std::move(*Literal::CreateR0<int8>(0));
+      return std::move(*LiteralUtil::CreateR0<int8>(0));
     case S32:
-      return std::move(*Literal::CreateR0<int32>(0));
+      return std::move(*LiteralUtil::CreateR0<int32>(0));
     case S64:
-      return std::move(*Literal::CreateR0<int64>(0));
+      return std::move(*LiteralUtil::CreateR0<int64>(0));
     case F16:
-      return std::move(*Literal::CreateR0<half>(static_cast<half>(0.0f)));
+      return std::move(*LiteralUtil::CreateR0<half>(static_cast<half>(0.0f)));
     case BF16:
       return std::move(
-          *Literal::CreateR0<bfloat16>(static_cast<bfloat16>(0.0f)));
+          *LiteralUtil::CreateR0<bfloat16>(static_cast<bfloat16>(0.0f)));
     case F32:
-      return std::move(*Literal::CreateR0<float>(0));
+      return std::move(*LiteralUtil::CreateR0<float>(0));
     case F64:
-      return std::move(*Literal::CreateR0<double>(0));
+      return std::move(*LiteralUtil::CreateR0<double>(0));
     case C64:
-      return std::move(*Literal::CreateR0<complex64>(0));
+      return std::move(*LiteralUtil::CreateR0<complex64>(0));
     case PRED:
-      return std::move(*Literal::CreateR0<bool>(false));
+      return std::move(*LiteralUtil::CreateR0<bool>(false));
     case S16:
     case U16:
       LOG(FATAL) << "u16/s16 literals not yet implemented";
@@ -614,33 +144,33 @@ Status Literal::CopySliceFrom(const LiteralSlice& src_literal,
   }
 }
 
-/* static */ Literal Literal::One(PrimitiveType primitive_type) {
+/* static */ Literal LiteralUtil::One(PrimitiveType primitive_type) {
   switch (primitive_type) {
     case U8:
-      return std::move(*Literal::CreateR0<uint8>(1));
+      return std::move(*LiteralUtil::CreateR0<uint8>(1));
     case U32:
-      return std::move(*Literal::CreateR0<uint32>(1));
+      return std::move(*LiteralUtil::CreateR0<uint32>(1));
     case U64:
-      return std::move(*Literal::CreateR0<uint64>(1));
+      return std::move(*LiteralUtil::CreateR0<uint64>(1));
     case S8:
-      return std::move(*Literal::CreateR0<int8>(1));
+      return std::move(*LiteralUtil::CreateR0<int8>(1));
     case S32:
-      return std::move(*Literal::CreateR0<int32>(1));
+      return std::move(*LiteralUtil::CreateR0<int32>(1));
     case S64:
-      return std::move(*Literal::CreateR0<int64>(1));
+      return std::move(*LiteralUtil::CreateR0<int64>(1));
     case F16:
-      return std::move(*Literal::CreateR0<half>(static_cast<half>(1.0f)));
+      return std::move(*LiteralUtil::CreateR0<half>(static_cast<half>(1.0f)));
     case BF16:
       return std::move(
-          *Literal::CreateR0<bfloat16>(static_cast<bfloat16>(1.0f)));
+          *LiteralUtil::CreateR0<bfloat16>(static_cast<bfloat16>(1.0f)));
     case F32:
-      return std::move(*Literal::CreateR0<float>(1));
+      return std::move(*LiteralUtil::CreateR0<float>(1));
     case F64:
-      return std::move(*Literal::CreateR0<double>(1));
+      return std::move(*LiteralUtil::CreateR0<double>(1));
     case C64:
-      return std::move(*Literal::CreateR0<complex64>(1));
+      return std::move(*LiteralUtil::CreateR0<complex64>(1));
     case PRED:
-      return std::move(*Literal::CreateR0<bool>(true));
+      return std::move(*LiteralUtil::CreateR0<bool>(true));
     case S16:
     case U16:
       LOG(FATAL) << "u16/s16 literals not yet implemented";
@@ -653,44 +183,44 @@ Status Literal::CopySliceFrom(const LiteralSlice& src_literal,
   }
 }
 
-/* static */ Literal Literal::MinValue(PrimitiveType primitive_type) {
+/* static */ Literal LiteralUtil::MinValue(PrimitiveType primitive_type) {
   switch (primitive_type) {
     case U8:
       return std::move(
-          *Literal::CreateR0<uint8>(std::numeric_limits<uint8>::min()));
+          *LiteralUtil::CreateR0<uint8>(std::numeric_limits<uint8>::min()));
     case U32:
       return std::move(
-          *Literal::CreateR0<uint32>(std::numeric_limits<uint32>::min()));
+          *LiteralUtil::CreateR0<uint32>(std::numeric_limits<uint32>::min()));
     case U64:
       return std::move(
-          *Literal::CreateR0<uint64>(std::numeric_limits<uint64>::min()));
+          *LiteralUtil::CreateR0<uint64>(std::numeric_limits<uint64>::min()));
     case S8:
       return std::move(
-          *Literal::CreateR0<int8>(std::numeric_limits<int8>::min()));
+          *LiteralUtil::CreateR0<int8>(std::numeric_limits<int8>::min()));
     case S32:
       return std::move(
-          *Literal::CreateR0<int32>(std::numeric_limits<int32>::min()));
+          *LiteralUtil::CreateR0<int32>(std::numeric_limits<int32>::min()));
     case S64:
       return std::move(
-          *Literal::CreateR0<int64>(std::numeric_limits<int64>::min()));
+          *LiteralUtil::CreateR0<int64>(std::numeric_limits<int64>::min()));
     case F32:
-      return std::move(
-          *Literal::CreateR0<float>(-std::numeric_limits<float>::infinity()));
+      return std::move(*LiteralUtil::CreateR0<float>(
+          -std::numeric_limits<float>::infinity()));
     case F64:
-      return std::move(
-          *Literal::CreateR0<double>(-std::numeric_limits<double>::infinity()));
+      return std::move(*LiteralUtil::CreateR0<double>(
+          -std::numeric_limits<double>::infinity()));
     case C64:
       LOG(FATAL) << "C64 element type has no minimum value";
     case PRED:
-      return std::move(*Literal::CreateR0<bool>(false));
+      return std::move(*LiteralUtil::CreateR0<bool>(false));
     case S16:
     case U16:
       LOG(FATAL) << "u16/s16 literals not yet implemented";
     case F16:
-      return std::move(*Literal::CreateR0<half>(
+      return std::move(*LiteralUtil::CreateR0<half>(
           static_cast<half>(-std::numeric_limits<float>::infinity())));
     case BF16:
-      return std::move(*Literal::CreateR0<bfloat16>(
+      return std::move(*LiteralUtil::CreateR0<bfloat16>(
           static_cast<bfloat16>(-std::numeric_limits<float>::infinity())));
     case TUPLE:
       LOG(FATAL) << "tuple element type has no minimum value";
@@ -701,42 +231,42 @@ Status Literal::CopySliceFrom(const LiteralSlice& src_literal,
   }
 }
 
-/* static */ Literal Literal::MaxValue(PrimitiveType primitive_type) {
+/* static */ Literal LiteralUtil::MaxValue(PrimitiveType primitive_type) {
   switch (primitive_type) {
     case U8:
       return std::move(
-          *Literal::CreateR0<uint8>(std::numeric_limits<uint8>::max()));
+          *LiteralUtil::CreateR0<uint8>(std::numeric_limits<uint8>::max()));
     case U32:
       return std::move(
-          *Literal::CreateR0<uint32>(std::numeric_limits<uint32>::max()));
+          *LiteralUtil::CreateR0<uint32>(std::numeric_limits<uint32>::max()));
     case U64:
       return std::move(
-          *Literal::CreateR0<uint64>(std::numeric_limits<uint64>::max()));
+          *LiteralUtil::CreateR0<uint64>(std::numeric_limits<uint64>::max()));
     case S8:
       return std::move(
-          *Literal::CreateR0<int8>(std::numeric_limits<int8>::max()));
+          *LiteralUtil::CreateR0<int8>(std::numeric_limits<int8>::max()));
     case S32:
       return std::move(
-          *Literal::CreateR0<int32>(std::numeric_limits<int32>::max()));
+          *LiteralUtil::CreateR0<int32>(std::numeric_limits<int32>::max()));
     case S64:
       return std::move(
-          *Literal::CreateR0<int64>(std::numeric_limits<int64>::max()));
+          *LiteralUtil::CreateR0<int64>(std::numeric_limits<int64>::max()));
     case F32:
-      return std::move(
-          *Literal::CreateR0<float>(std::numeric_limits<float>::infinity()));
+      return std::move(*LiteralUtil::CreateR0<float>(
+          std::numeric_limits<float>::infinity()));
     case F64:
-      return std::move(
-          *Literal::CreateR0<double>(std::numeric_limits<double>::infinity()));
+      return std::move(*LiteralUtil::CreateR0<double>(
+          std::numeric_limits<double>::infinity()));
     case PRED:
-      return std::move(*Literal::CreateR0<bool>(true));
+      return std::move(*LiteralUtil::CreateR0<bool>(true));
     case S16:
     case U16:
       LOG(FATAL) << "u16/s16 literals not yet implemented";
     case F16:
-      return std::move(*Literal::CreateR0<half>(
+      return std::move(*LiteralUtil::CreateR0<half>(
           static_cast<half>(std::numeric_limits<float>::infinity())));
     case BF16:
-      return std::move(*Literal::CreateR0<bfloat16>(
+      return std::move(*LiteralUtil::CreateR0<bfloat16>(
           static_cast<bfloat16>(std::numeric_limits<float>::infinity())));
     case TUPLE:
       LOG(FATAL) << "tuple element type has no maximum value";
@@ -747,7 +277,7 @@ Status Literal::CopySliceFrom(const LiteralSlice& src_literal,
   }
 }
 
-/* static */ std::unique_ptr<Literal> Literal::CreateR1(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR1(
     const tensorflow::core::Bitmap& values) {
   auto literal = MakeUnique<Literal>(
       ShapeUtil::MakeShape(PRED, {static_cast<int64>(values.bits())}));
@@ -755,17 +285,7 @@ Status Literal::CopySliceFrom(const LiteralSlice& src_literal,
   return literal;
 }
 
-void Literal::PopulateR1(const tensorflow::core::Bitmap& values) {
-  CHECK(ShapeUtil::IsArray(shape()));
-  CHECK_EQ(ShapeUtil::Rank(shape()), 1);
-  CHECK_EQ(element_count(), values.bits());
-  CHECK_EQ(shape().element_type(), PRED);
-  for (int64 i = 0; i < static_cast<int64>(values.bits()); ++i) {
-    Set({i}, values.get(i));
-  }
-}
-
-/* static */ std::unique_ptr<Literal> Literal::CreateR1U8(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR1U8(
     tensorflow::StringPiece value) {
   auto literal = MakeUnique<Literal>(
       ShapeUtil::MakeShape(U8, {static_cast<int64>(value.size())}));
@@ -775,116 +295,13 @@ void Literal::PopulateR1(const tensorflow::core::Bitmap& values) {
   return literal;
 }
 
-/* static */ std::unique_ptr<Literal> Literal::CreateR2F32Linspace(float from,
-                                                                   float to,
-                                                                   int64 rows,
-                                                                   int64 cols) {
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR2F32Linspace(
+    float from, float to, int64 rows, int64 cols) {
   auto value = MakeLinspaceArray2D(from, to, rows, cols);
   return CreateR2FromArray2D(*value);
 }
 
-std::unique_ptr<Literal> LiteralBase::Relayout(
-    const Layout& new_layout, const ShapeIndex& shape_index) const {
-  // Create new shape with 'new_layout' set at the given shape index.
-  Shape new_shape = shape();
-  Shape* subshape = ShapeUtil::GetMutableSubshape(&new_shape, shape_index);
-  TF_CHECK_OK(LayoutUtil::ValidateLayoutForShape(new_layout, *subshape));
-  *subshape->mutable_layout() = new_layout;
-  auto result = MakeUnique<Literal>(new_shape);
-  TF_CHECK_OK(result->CopyFrom(*this));
-  return result;
-}
-
-std::unique_ptr<Literal> LiteralBase::Relayout(
-    const Shape& shape_with_layout) const {
-  CHECK(ShapeUtil::Compatible(shape_with_layout, shape()))
-      << "Given shape_with_layout " << ShapeUtil::HumanString(shape_with_layout)
-      << " not compatible with literal shape "
-      << ShapeUtil::HumanString(shape());
-  std::unique_ptr<Literal> result = CreateFromShape(shape_with_layout);
-  ShapeUtil::ForEachSubshape(
-      result->shape(),
-      [this, &result](const Shape& subshape, const ShapeIndex& index) {
-        if (ShapeUtil::IsArray(subshape)) {
-          TF_CHECK_OK(result->CopyFrom(*this,
-                                       /*dest_shape_index=*/index,
-                                       /*src_shape_index=*/index));
-        }
-      });
-  return result;
-}
-
-StatusOr<std::unique_ptr<Literal>> LiteralBase::Broadcast(
-    const Shape& result_shape,
-    tensorflow::gtl::ArraySlice<int64> dimensions) const {
-  if (!ShapeUtil::IsArray(shape())) {
-    return InvalidArgument("Broadcast only supports arrays.");
-  }
-
-  for (int64 i = 0; i < dimensions.size(); i++) {
-    TF_RET_CHECK(shape().dimensions(i) ==
-                 result_shape.dimensions(dimensions[i]));
-  }
-
-  std::unique_ptr<Literal> result = MakeUnique<Literal>(result_shape);
-
-  // scratch_source_index is temporary storage space for the computed index into
-  // the input literal.  We put it here to avoid allocating an std::vector in
-  // every iteration of ShapeUtil::ForEachIndex.
-  std::vector<int64> scratch_source_index(shape().dimensions_size());
-
-  char* dest_data = static_cast<char*>(result->untyped_data());
-  const char* source_data = static_cast<const char*>(untyped_data());
-  const int64 primitive_size =
-      ShapeUtil::ByteSizeOfPrimitiveType(shape().element_type());
-
-  ShapeUtil::ForEachIndex(
-      result_shape, [&](tensorflow::gtl::ArraySlice<int64> output_index) {
-        for (int64 i = 0; i < dimensions.size(); ++i) {
-          scratch_source_index[i] = output_index[dimensions[i]];
-        }
-        int64 dest_index = IndexUtil::MultidimensionalIndexToLinearIndex(
-            result_shape, output_index);
-        int64 source_index = IndexUtil::MultidimensionalIndexToLinearIndex(
-            shape(), scratch_source_index);
-        memcpy(dest_data + primitive_size * dest_index,
-               source_data + primitive_size * source_index, primitive_size);
-        return true;
-      });
-
-  return std::move(result);
-}
-
-StatusOr<std::unique_ptr<Literal>> LiteralBase::Reshape(
-    tensorflow::gtl::ArraySlice<int64> dimensions) const {
-  if (!ShapeUtil::IsArray(shape())) {
-    return InvalidArgument("Reshape does not support tuples.");
-  }
-  std::unique_ptr<Literal> output;
-  if (!LayoutUtil::IsMonotonicWithDim0Major(shape().layout())) {
-    output =
-        Relayout(LayoutUtil::GetDefaultLayoutForRank(ShapeUtil::Rank(shape())));
-  } else {
-    output = CloneToUnique();
-  }
-  // Because the layout is monotonic, we can simply reuse the same sequence of
-  // values without changing their order.
-  *output->mutable_shape_do_not_use() =
-      ShapeUtil::MakeShape(shape().element_type(), dimensions);
-
-  int64 elements_before = ShapeUtil::ElementsIn(shape());
-  int64 elements_after = ShapeUtil::ElementsIn(output->shape());
-  if (elements_before != elements_after) {
-    return InvalidArgument(
-        "Shapes before and after Literal::Reshape have different numbers "
-        "of elements: %s vs %s.",
-        ShapeUtil::HumanString(shape()).c_str(),
-        ShapeUtil::HumanString(output->shape()).c_str());
-  }
-  return std::move(output);
-}
-
-/* static */ std::unique_ptr<Literal> Literal::ReshapeSlice(
+/* static */ std::unique_ptr<Literal> LiteralUtil::ReshapeSlice(
     tensorflow::gtl::ArraySlice<int64> new_dimensions,
     tensorflow::gtl::ArraySlice<int64> minor_to_major,
     const LiteralSlice& literal) {
@@ -956,575 +373,64 @@ StatusOr<std::unique_ptr<Literal>> LiteralBase::Reshape(
   return new_literal;
 }
 
-std::unique_ptr<Literal> LiteralBase::Transpose(
-    tensorflow::gtl::ArraySlice<int64> permutation) const {
-  CHECK(ShapeUtil::IsArray(shape())) << "Tuple is not supported for transpose";
-  CHECK(IsPermutation(permutation, ShapeUtil::Rank(shape())))
-      << "Given permutation is not a permutation of dimension numbers";
-  // To transpose the array, we just permute the dimensions and layout, and
-  // do a straight memory copy of the raw data set.
-  // This is considerably faster than iterating over every array element using
-  // the EachCell<>() and Set<>() APIs.
-  std::vector<int64> inverse_permutation = InversePermutation(permutation);
-  Shape permuted_shape =
-      ShapeUtil::PermuteDimensions(inverse_permutation, shape());
-  // Replace the layout with one affine to this shape, such that a
-  // transpose operation can be performed by leaving the flat values
-  // representation intact.
-  // For example, consider the shape F32[11,8]{1,0} under a {1,0} permutation.
-  // The shape with affine layout resulting from that operation will be
-  // F32[8,11]{0,1}, since it leaves the original most minor (the 8 sized), the
-  // most minor.
-  //
-  // Essentially, given MinMaj(Di) the position of the Di dimension within the
-  // minor to major vector, and given T(Di) the index that the original Di
-  // dimension has within the transposed array, a layout is affine if
-  // MinMaj(Di) == TMinMaj(T(Di)), with TMinMaj() being the minor to major
-  // vector of the affine layout.
-  CHECK(LayoutUtil::IsDenseArray(permuted_shape));
-  Layout* layout = permuted_shape.mutable_layout();
-  layout->clear_minor_to_major();
-  for (auto index : LayoutUtil::MinorToMajor(shape())) {
-    layout->add_minor_to_major(inverse_permutation[index]);
-  }
-  auto new_literal = MakeUnique<Literal>(permuted_shape);
-  DCHECK_EQ(ShapeUtil::ByteSizeOf(new_literal->shape()),
-            ShapeUtil::ByteSizeOf(shape()));
-  std::memcpy(new_literal->untyped_data(), untyped_data(), size_bytes());
-  return new_literal;
-}
-
-template <typename NativeT>
-std::unique_ptr<Literal> LiteralBase::SliceInternal(
-    const Shape& result_shape,
-    tensorflow::gtl::ArraySlice<int64> start_indices) const {
-  auto result_literal = MakeUnique<Literal>(result_shape);
-  DimensionVector new_indices(ShapeUtil::Rank(result_shape));
-  result_literal->EachCell<NativeT>(
-      [&](tensorflow::gtl::ArraySlice<int64> indices, NativeT /*value*/) {
-        for (int64 i = 0; i < ShapeUtil::Rank(result_shape); ++i) {
-          new_indices[i] = indices[i] + start_indices[i];
-        }
-        NativeT value = Get<NativeT>(new_indices);
-        result_literal->Set<NativeT>(indices, value);
-      });
-  return result_literal;
-}
-
-std::unique_ptr<Literal> LiteralBase::Slice(
-    tensorflow::gtl::ArraySlice<int64> start_indices,
-    tensorflow::gtl::ArraySlice<int64> limit_indices) const {
-  CHECK(ShapeUtil::IsArray(shape())) << "tuple is not supported for slice";
-
-  DimensionVector result_dimensions;
-  for (int64 dnum = 0; dnum < ShapeUtil::Rank(shape()); ++dnum) {
-    CHECK_GE(start_indices[dnum], 0);
-    CHECK_LE(limit_indices[dnum], shape().dimensions(dnum))
-        << "dnum = " << dnum;
-    int64 dimension = limit_indices[dnum] - start_indices[dnum];
-    CHECK_GE(dimension, 0) << "dnum = " << dnum;
-    result_dimensions.push_back(dimension);
-  }
-  const auto result_shape =
-      ShapeUtil::MakeShapeWithLayout(shape().element_type(), result_dimensions,
-                                     LayoutUtil::MinorToMajor(shape()));
-  switch (result_shape.element_type()) {
-    case F32:
-      return SliceInternal<float>(result_shape, start_indices);
-    case BF16:
-      return SliceInternal<bfloat16>(result_shape, start_indices);
-    case C64:
-      return SliceInternal<complex64>(result_shape, start_indices);
-    case S32:
-      return SliceInternal<int32>(result_shape, start_indices);
-    case U32:
-      return SliceInternal<uint32>(result_shape, start_indices);
-    default:
-      LOG(FATAL) << "not yet implemented: "
-                 << PrimitiveType_Name(result_shape.element_type());
-  }
-}
-
-Literal LiteralBase::Clone() const {
-  Literal result(shape());
-  TF_CHECK_OK(result.CopyFrom(*this));
-  return result;
-}
-
-std::unique_ptr<Literal> LiteralBase::CloneToUnique() const {
-  auto result = MakeUnique<Literal>(shape());
-  TF_CHECK_OK(result->CopyFrom(*this));
-  return result;
-}
-
-string LiteralBase::GetAsString(tensorflow::gtl::ArraySlice<int64> multi_index,
-                                const ShapeIndex& shape_index) const {
-  const Shape& subshape = ShapeUtil::GetSubshape(shape(), shape_index);
-  CHECK(LayoutUtil::IsDenseArray(subshape));
-  switch (subshape.element_type()) {
-    case PRED:
-      return Get<bool>(multi_index, shape_index) ? "true" : "false";
-    case S8:
-      return StrCat(Get<int8>(multi_index, shape_index));
-    case S16:
-      return StrCat(Get<int16>(multi_index, shape_index));
-    case S32:
-      return StrCat(Get<int32>(multi_index, shape_index));
-    case S64:
-      return StrCat(Get<int64>(multi_index, shape_index));
-    case U8:
-      return StrCat(Get<uint8>(multi_index, shape_index));
-    case U16:
-      return StrCat(Get<uint16>(multi_index, shape_index));
-    case U32:
-      return StrCat(Get<uint32>(multi_index, shape_index));
-    case U64:
-      return StrCat(Get<uint64>(multi_index, shape_index));
-    case F16:
-      return StrCat(static_cast<float>(Get<half>(multi_index, shape_index)));
-    case F32:
-      return StrCat(Get<float>(multi_index, shape_index));
-    case BF16:
-      return StrCat(
-          static_cast<float>(Get<bfloat16>(multi_index, shape_index)));
-    case F64:
-      return StrCat(Get<double>(multi_index, shape_index));
-    case C64: {
-      complex64 c = Get<complex64>(multi_index, shape_index);
-      return StrCat("(", c.real(), ", ", c.imag(), ")");
-    }
-    default:
-      LOG(FATAL) << PrimitiveType_Name(subshape.element_type());
-  }
-}
-
-string LiteralBase::GetSparseElementAsString(
-    int64 sparse_element_number, const ShapeIndex& shape_index) const {
-  const Shape& subshape = ShapeUtil::GetSubshape(shape(), shape_index);
-  CHECK(LayoutUtil::IsSparseArray(subshape));
-  switch (subshape.element_type()) {
-    case PRED:
-      return GetSparseElement<bool>(sparse_element_number, shape_index)
-                 ? "true"
-                 : "false";
-    case S8:
-      return StrCat(GetSparseElement<int8>(sparse_element_number, shape_index));
-    case S16:
-      return StrCat(
-          GetSparseElement<int16>(sparse_element_number, shape_index));
-    case S32:
-      return StrCat(
-          GetSparseElement<int32>(sparse_element_number, shape_index));
-    case S64:
-      return StrCat(
-          GetSparseElement<int64>(sparse_element_number, shape_index));
-    case U8:
-      return StrCat(
-          GetSparseElement<uint8>(sparse_element_number, shape_index));
-    case U16:
-      return StrCat(
-          GetSparseElement<uint16>(sparse_element_number, shape_index));
-    case U32:
-      return StrCat(
-          GetSparseElement<uint32>(sparse_element_number, shape_index));
-    case U64:
-      return StrCat(
-          GetSparseElement<uint64>(sparse_element_number, shape_index));
-    case F16:
-      return StrCat(static_cast<float>(
-          GetSparseElement<half>(sparse_element_number, shape_index)));
-    case F32:
-      return StrCat(
-          GetSparseElement<float>(sparse_element_number, shape_index));
-    case BF16:
-      return StrCat(static_cast<float>(
-          GetSparseElement<bfloat16>(sparse_element_number, shape_index)));
-    case F64:
-      return StrCat(
-          GetSparseElement<double>(sparse_element_number, shape_index));
-    case C64: {
-      complex64 c =
-          GetSparseElement<complex64>(sparse_element_number, shape_index);
-      return StrCat("(", c.real(), ", ", c.imag(), ")");
-    }
-    default:
-      LOG(FATAL) << "Invalid element type for sparse arrays: "
-                 << PrimitiveType_Name(subshape.element_type());
-  }
-}
-
-StatusOr<int64> LiteralBase::GetIntegralAsS64(
-    tensorflow::gtl::ArraySlice<int64> multi_index) const {
-  CHECK(LayoutUtil::IsDenseArray(shape()));
-  switch (shape().element_type()) {
-    case PRED:
-      return Get<bool>(multi_index);
-    case U8:
-      return Get<uint8>(multi_index);
-    case S32:
-      return Get<int32>(multi_index);
-    case S64:
-      return Get<int64>(multi_index);
-    case U32:
-      return Get<uint32>(multi_index);
-    case U64:
-      return Get<uint64>(multi_index);
-    default:
-      return FailedPrecondition(
-          "Array element type is not integral: %s",
-          PrimitiveType_Name(shape().element_type()).c_str());
-  }
-}
-
-size_t LiteralBase::Hash() const {
-  using tensorflow::Hash64;
-  using tensorflow::Hash64Combine;
-
-  size_t hash_value = ShapeUtil::Hash(shape());
-
-  ShapeUtil::ForEachSubshape(
-      shape(), [&](const Shape& subshape, const ShapeIndex& index) {
-        if (!ShapeUtil::IsArray(subshape)) {
-          return;
-        }
-
-        CHECK(LayoutUtil::IsDense(subshape.layout()));
-        hash_value = Hash64Combine(
-            hash_value, Hash64(static_cast<const char*>(untyped_data(index)),
-                               size_bytes(index)));
-      });
-
-  return hash_value;
-}
-
-Status Literal::SetIntegralAsS64(tensorflow::gtl::ArraySlice<int64> multi_index,
-                                 int64 value) {
-  CHECK(LayoutUtil::IsDenseArray(shape()));
-  switch (shape().element_type()) {
-    case PRED:
-      Set<bool>(multi_index, value);
-      break;
-    case U8:
-      Set<uint8>(multi_index, value);
-      break;
-    case S32:
-      Set<int32>(multi_index, value);
-      break;
-    case S64:
-      Set<int64>(multi_index, value);
-      break;
-    case U32:
-      Set<uint32>(multi_index, value);
-      break;
-    case U64:
-      Set<uint64>(multi_index, value);
-      break;
-    default:
-      return FailedPrecondition(
-          "Array element type is not integral: %s",
-          PrimitiveType_Name(shape().element_type()).c_str());
-  }
-  return Status::OK();
-}
-
-tensorflow::gtl::ArraySlice<int64> LiteralBase::GetSparseIndex(
-    int64 sparse_element_number, const ShapeIndex& shape_index) const {
-  const Piece& p = piece(shape_index);
-  CHECK_GE(sparse_element_number, 0);
-  CHECK_LT(sparse_element_number, p.sparse_indices()->index_count());
-  return p.sparse_indices()->At(sparse_element_number);
-}
-
-void Literal::SortSparseElements(const ShapeIndex& shape_index) {
-  piece(shape_index).SortSparseElements();
-}
-
-Literal LiteralBase::GetFirstScalarLiteral() const {
-  CHECK(ShapeUtil::IsArray(shape()));
-  CHECK_GT(ShapeUtil::ElementsIn(shape()), 0);
-  switch (shape().element_type()) {
+/* static */ Literal LiteralUtil::GetFirstScalarLiteral(
+    const LiteralSlice& literal) {
+  CHECK(ShapeUtil::IsArray(literal.shape()));
+  CHECK_GT(ShapeUtil::ElementsIn(literal.shape()), 0);
+  switch (literal.shape().element_type()) {
     case PRED:
-      return std::move(*Literal::CreateR0<bool>(GetFirstElement<bool>()));
+      return std::move(
+          *LiteralUtil::CreateR0<bool>(literal.GetFirstElement<bool>()));
     // 8 bit types.
     case S8:
-      return std::move(*Literal::CreateR0<int8>(GetFirstElement<int8>()));
+      return std::move(
+          *LiteralUtil::CreateR0<int8>(literal.GetFirstElement<int8>()));
     case U8:
-      return std::move(*Literal::CreateR0<uint8>(GetFirstElement<uint8>()));
+      return std::move(
+          *LiteralUtil::CreateR0<uint8>(literal.GetFirstElement<uint8>()));
     // 16 bit types.
     case BF16:
-      return std::move(
-          *Literal::CreateR0<bfloat16>(GetFirstElement<bfloat16>()));
+      return std::move(*LiteralUtil::CreateR0<bfloat16>(
+          literal.GetFirstElement<bfloat16>()));
     case F16:
-      return std::move(*Literal::CreateR0<half>(GetFirstElement<half>()));
+      return std::move(
+          *LiteralUtil::CreateR0<half>(literal.GetFirstElement<half>()));
     case S16:
-      return std::move(*Literal::CreateR0<int16>(GetFirstElement<int16>()));
+      return std::move(
+          *LiteralUtil::CreateR0<int16>(literal.GetFirstElement<int16>()));
     case U16:
-      return std::move(*Literal::CreateR0<uint16>(GetFirstElement<uint16>()));
+      return std::move(
+          *LiteralUtil::CreateR0<uint16>(literal.GetFirstElement<uint16>()));
     // 32 bit types.
     case F32:
-      return std::move(*Literal::CreateR0<float>(GetFirstElement<float>()));
+      return std::move(
+          *LiteralUtil::CreateR0<float>(literal.GetFirstElement<float>()));
     case S32:
-      return std::move(*Literal::CreateR0<int32>(GetFirstElement<int32>()));
+      return std::move(
+          *LiteralUtil::CreateR0<int32>(literal.GetFirstElement<int32>()));
     case U32:
-      return std::move(*Literal::CreateR0<uint32>(GetFirstElement<uint32>()));
+      return std::move(
+          *LiteralUtil::CreateR0<uint32>(literal.GetFirstElement<uint32>()));
     // 64 bit types.
     case C64:
-      return std::move(
-          *Literal::CreateR0<complex64>(GetFirstElement<complex64>()));
+      return std::move(*LiteralUtil::CreateR0<complex64>(
+          literal.GetFirstElement<complex64>()));
     case F64:
-      return std::move(*Literal::CreateR0<double>(GetFirstElement<double>()));
-    case S64:
-      return std::move(*Literal::CreateR0<int64>(GetFirstElement<int64>()));
-    case U64:
-      return std::move(*Literal::CreateR0<uint64>(GetFirstElement<uint64>()));
-    default:
-      LOG(FATAL) << "Unhandled primitive type " << shape().element_type();
-  }
-}
-
-void LiteralBase::Piece::SortSparseElements() {
-  switch (subshape().element_type()) {
-    case PRED:
-      SortSparseElementsInternal<bool>();
-      break;
-    case S8:
-      SortSparseElementsInternal<int8>();
-      break;
-    case U8:
-      SortSparseElementsInternal<uint8>();
-      break;
-    case S16:
-      SortSparseElementsInternal<int16>();
-      break;
-    case U16:
-      SortSparseElementsInternal<uint16>();
-      break;
-    case S32:
-      SortSparseElementsInternal<int32>();
-      break;
-    case U32:
-      SortSparseElementsInternal<uint32>();
-      break;
+      return std::move(
+          *LiteralUtil::CreateR0<double>(literal.GetFirstElement<double>()));
     case S64:
-      SortSparseElementsInternal<int64>();
-      break;
+      return std::move(
+          *LiteralUtil::CreateR0<int64>(literal.GetFirstElement<int64>()));
     case U64:
-      SortSparseElementsInternal<uint64>();
-      break;
-    case F32:
-      SortSparseElementsInternal<float>();
-      break;
-    case F64:
-      SortSparseElementsInternal<double>();
-      break;
-    case C64:
-      SortSparseElementsInternal<complex64>();
-      break;
-    case F16:
-      SortSparseElementsInternal<half>();
-      break;
-    case BF16:
-      SortSparseElementsInternal<bfloat16>();
-      break;
+      return std::move(
+          *LiteralUtil::CreateR0<uint64>(literal.GetFirstElement<uint64>()));
     default:
-      LOG(FATAL) << "Element type not valid for sparse array: "
-                 << PrimitiveType_Name(subshape().element_type());
-  }
-}
-
-template <typename NativeT>
-void LiteralBase::Piece::SortSparseElementsInternal() {
-  CHECK(LayoutUtil::IsSparseArray(subshape()));
-  int64 num_elements = sparse_indices()->index_count();
-  auto values = data<NativeT>();
-  CHECK_LE(num_elements, values.size());
-  sparse_indices()->SortWithValues(
-      tensorflow::gtl::MutableArraySlice<NativeT>(values.data(), num_elements));
-}
-
-namespace {
-
-void ToStringHelper(const LiteralBase& literal, const ShapeIndex& shape_index,
-                    bool print_layout, std::vector<string>* pieces) {
-  const Shape& subshape = ShapeUtil::GetSubshape(literal.shape(), shape_index);
-  CHECK(LayoutUtil::HasLayout(literal.shape()));
-  CHECK(LayoutUtil::HasLayout(subshape));
-
-  auto shape_to_string = [print_layout](const Shape& shape) {
-    if (print_layout) {
-      return ShapeUtil::HumanStringWithLayout(shape);
-    } else {
-      return ShapeUtil::HumanString(shape);
-    }
-  };
-
-  // TODO(b/32894291): refactor this code to reduce code duplication.
-  if (ShapeUtil::IsTuple(subshape)) {
-    pieces->push_back(shape_to_string(subshape));
-    pieces->push_back(" (\n");
-    std::vector<string> tuple_pieces;
-    for (int i = 0; i < ShapeUtil::TupleElementCount(subshape); ++i) {
-      ShapeIndex element_index = shape_index;
-      element_index.push_back(i);
-      std::vector<string> element_pieces;
-      ToStringHelper(literal, element_index, print_layout, &element_pieces);
-      tuple_pieces.push_back(tensorflow::str_util::Join(element_pieces, ""));
-    }
-    pieces->push_back(tensorflow::str_util::Join(tuple_pieces, ",\n"));
-    pieces->push_back("\n)");
-    return;
-  }
-
-  if (ShapeUtil::IsToken(subshape)) {
-    pieces->push_back("token");
-    return;
-  }
-
-  if (LayoutUtil::IsSparseArray(subshape)) {
-    pieces->push_back(shape_to_string(subshape));
-    pieces->push_back("{");
-    int64 rank = ShapeUtil::Rank(subshape);
-    int64 num_elements = literal.sparse_element_count();
-    for (int64 i = 0; i < num_elements; ++i) {
-      if (i > 0) {
-        pieces->push_back(", ");
-      }
-      if (rank == 1) {
-        pieces->push_back(StrCat(literal.GetSparseIndex(i)[0]));
-        pieces->push_back(": ");
-      } else {
-        pieces->push_back("[");
-        pieces->push_back(
-            tensorflow::str_util::Join(literal.GetSparseIndex(i), ", "));
-        pieces->push_back("]: ");
-      }
-      pieces->push_back(literal.GetSparseElementAsString(i));
-    }
-    pieces->push_back("}");
-    return;
-  }
-
-  CHECK(LayoutUtil::IsDenseArray(subshape));
-
-  auto element_to_string =
-      [&](tensorflow::gtl::ArraySlice<int64> indices) -> string {
-    PrimitiveType element_type = subshape.element_type();
-    if (element_type == PRED) {
-      // We display predicates in a densely packed form.
-      return literal.Get<bool>(indices, shape_index) ? "1" : "0";
-    }
-    return ((!indices.empty() && indices.back() > 0) ? ", " : "") +
-           literal.GetAsString(indices, shape_index);
-  };
-
-  if (ShapeUtil::Rank(subshape) == 0) {
-    pieces->push_back(literal.GetAsString({}, shape_index));
-  } else if (ShapeUtil::Rank(subshape) == 1) {
-    pieces->push_back("{");
-    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
-      pieces->push_back(element_to_string({i0}));
-    }
-    pieces->push_back("}");
-  } else if (ShapeUtil::Rank(subshape) == 2) {
-    pieces->push_back(shape_to_string(subshape));
-    pieces->push_back(" {\n");
-    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
-      pieces->push_back("  { ");
-      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
-        pieces->push_back(element_to_string({i0, i1}));
-      }
-      pieces->push_back(" ");
-      pieces->push_back(i0 == subshape.dimensions(0) - 1 ? "}\n" : "},\n");
-    }
-    pieces->push_back("}");
-  } else if (ShapeUtil::Rank(subshape) == 3) {
-    pieces->push_back(shape_to_string(subshape));
-    pieces->push_back(" {\n");
-    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
-      pieces->push_back(i0 > 0 ? ",\n{" : "{");
-      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
-        pieces->push_back(i1 > 0 ? ",\n  { " : " { ");
-        for (int64 i2 = 0; i2 < subshape.dimensions(2); ++i2) {
-          pieces->push_back(element_to_string({i0, i1, i2}));
-        }
-        pieces->push_back(" }");
-      }
-      pieces->push_back(" }");
-    }
-    pieces->push_back("\n}");
-  } else if (ShapeUtil::Rank(subshape) == 4) {
-    pieces->push_back(shape_to_string(subshape));
-    pieces->push_back(" {\n");
-    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
-      pieces->push_back(Printf("  {  /*i0=%lld*/\n", i0));
-      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
-        pieces->push_back(Printf("    {  /*i1=%lld*/\n", i1));
-        for (int64 i2 = 0; i2 < subshape.dimensions(2); ++i2) {
-          pieces->push_back("      {");
-          for (int64 i3 = 0; i3 < subshape.dimensions(3); ++i3) {
-            pieces->push_back(element_to_string({i0, i1, i2, i3}));
-          }
-          pieces->push_back(i2 == subshape.dimensions(2) - 1 ? "}\n" : "},\n");
-        }
-        pieces->push_back(i1 == subshape.dimensions(1) - 1 ? "    }\n"
-                                                           : "    },\n");
-      }
-      pieces->push_back(i0 == subshape.dimensions(0) - 1 ? "  }\n" : "  },\n");
-    }
-    pieces->push_back("}");
-  } else if (ShapeUtil::Rank(subshape) == 5) {
-    pieces->push_back(shape_to_string(subshape));
-    pieces->push_back(" {\n");
-    for (int64 i0 = 0; i0 < subshape.dimensions(0); ++i0) {
-      pieces->push_back(Printf("  {  /*i0=%lld*/\n", i0));
-      for (int64 i1 = 0; i1 < subshape.dimensions(1); ++i1) {
-        pieces->push_back(Printf("    {  /*i1=%lld*/\n", i1));
-        for (int64 i2 = 0; i2 < subshape.dimensions(2); ++i2) {
-          pieces->push_back(Printf("      {  /*i2=%lld*/\n", i2));
-          for (int64 i3 = 0; i3 < subshape.dimensions(3); ++i3) {
-            pieces->push_back("        {");
-            for (int64 i4 = 0; i4 < subshape.dimensions(4); ++i4) {
-              pieces->push_back(element_to_string({i0, i1, i2, i3, i4}));
-            }
-            pieces->push_back(i3 == subshape.dimensions(3) - 1 ? "}\n"
-                                                               : "},\n");
-          }
-          pieces->push_back(i2 == subshape.dimensions(2) - 1 ? "      }\n"
-                                                             : "      },\n");
-        }
-        pieces->push_back(i1 == subshape.dimensions(1) - 1 ? "    }\n"
-                                                           : "    },\n");
-      }
-      pieces->push_back(i0 == subshape.dimensions(0) - 1 ? "  }\n" : "  },\n");
-    }
-    pieces->push_back("}");
-  } else {
-    pieces->push_back(shape_to_string(subshape));
-    pieces->push_back(" {");
-    literal.EachCellAsString(
-        [&](tensorflow::gtl::ArraySlice<int64> indices, const string& value) {
-          pieces->push_back(" ");
-          pieces->push_back(value);
-        });
-    pieces->push_back("}");
+      LOG(FATAL) << "Unhandled primitive type "
+                 << literal.shape().element_type();
   }
 }
 
-}  // namespace
-
-int64 LiteralBase::sparse_element_count() const {
-  CHECK(LayoutUtil::IsSparseArray(shape()));
-  return sparse_indices()->index_count();
-}
-
-string LiteralBase::ToString(bool print_layout) const {
-  std::vector<string> pieces;
-  CHECK(LayoutUtil::HasLayout(this->shape()));
-  ToStringHelper(*this, {}, print_layout, &pieces);
-  return tensorflow::str_util::Join(pieces, "");
-}
-
-/* static */ std::unique_ptr<Literal> Literal::MakeTuple(
+/* static */ std::unique_ptr<Literal> LiteralUtil::MakeTuple(
     tensorflow::gtl::ArraySlice<const Literal*> elements) {
   std::vector<Shape> element_shapes;
   for (const auto* element : elements) {
@@ -1537,7 +443,7 @@ string LiteralBase::ToString(bool print_layout) const {
   return literal;
 }
 
-/* static */ std::unique_ptr<Literal> Literal::MakeTupleFromSlices(
+/* static */ std::unique_ptr<Literal> LiteralUtil::MakeTupleFromSlices(
     tensorflow::gtl::ArraySlice<LiteralSlice> elements) {
   std::vector<Shape> element_shapes;
   for (const auto& element : elements) {
@@ -1550,7 +456,7 @@ string LiteralBase::ToString(bool print_layout) const {
   return literal;
 }
 
-/* static */ std::unique_ptr<Literal> Literal::MakeTupleOwned(
+/* static */ std::unique_ptr<Literal> LiteralUtil::MakeTupleOwned(
     std::vector<std::unique_ptr<Literal>> elements) {
   std::vector<Shape> element_shapes;
   element_shapes.reserve(elements.size());
@@ -1565,822 +471,9 @@ string LiteralBase::ToString(bool print_layout) const {
   return literal;
 }
 
-void LiteralBase::EachCellAsString(
-    const std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
-                             const string& value)>& per_cell) const {
-  if (ShapeUtil::IsZeroElementArray(shape())) {
-    return;
-  }
-  std::vector<int64> indices = IndexUtil::LinearIndexToMultidimensionalIndex(
-      shape(), /*linear_index=*/0);
-  do {
-    per_cell(indices, GetAsString(indices));
-  } while (IndexUtil::BumpIndices(shape(), &indices));
-}
-
-namespace {
-template <typename NativeSrcT, typename NativeDestT, typename ConverterType>
-std::unique_ptr<Literal> ConvertBetweenNativeTypesWithConverter(
-    const LiteralBase& src_literal, const ConverterType& converter) {
-  CHECK(ShapeUtil::IsArray(src_literal.shape()));
-  auto result_literal = MakeUnique<Literal>(ShapeUtil::ChangeElementType(
-      src_literal.shape(),
-      primitive_util::NativeToPrimitiveType<NativeDestT>()));
-  auto src_data = src_literal.data<NativeSrcT>();
-  auto dest_data = result_literal->template data<NativeDestT>();
-  int64 num_elements = src_literal.element_count();
-
-  for (int64 i = 0; i < num_elements; ++i) {
-    dest_data[i] = converter(src_data[i]);
-  }
-  return result_literal;
-}
-
-template <typename NativeSrcT, typename NativeDestT>
-std::unique_ptr<Literal> ConvertBetweenNativeTypes(
-    const LiteralBase& src_literal) {
-  auto converter = [](NativeSrcT src) { return static_cast<NativeDestT>(src); };
-  return ConvertBetweenNativeTypesWithConverter<NativeSrcT, NativeDestT>(
-      src_literal, converter);
-}
-
-template <typename NativeSrcT, typename NativeDestT>
-typename std::enable_if<(sizeof(NativeSrcT) == sizeof(NativeDestT)),
-                        std::unique_ptr<Literal>>::type
-BitcastBetweenNativeTypes(const LiteralBase& src_literal) {
-  auto converter = [](NativeSrcT src) {
-    return tensorflow::bit_cast<NativeDestT>(src);
-  };
-  return ConvertBetweenNativeTypesWithConverter<NativeSrcT, NativeDestT>(
-      src_literal, converter);
-}
-
-// This template specialization is here to make the compiler happy. bit_cast has
-// a static check that the types are the same size. This specialization should
-// never be used because the source and destination types are checked for
-// identical sizes higher up.
-template <typename NativeSrcT, typename NativeDestT>
-typename std::enable_if<(sizeof(NativeSrcT) != sizeof(NativeDestT)),
-                        std::unique_ptr<Literal>>::type
-BitcastBetweenNativeTypes(const LiteralBase& src_literal) {
-  LOG(FATAL) << "Invalid bitcast between types of different sizes.";
-}
-
-template <PrimitiveType primitive_src_type>
-std::unique_ptr<Literal> ConvertToC64(const LiteralBase& src_literal) {
-  CHECK(ShapeUtil::IsArray(src_literal.shape()));
-  auto result_literal = MakeUnique<Literal>(
-      ShapeUtil::ChangeElementType(src_literal.shape(), C64));
-  using NativeSrcT =
-      typename primitive_util::PrimitiveTypeToNative<primitive_src_type>::type;
-  tensorflow::gtl::ArraySlice<NativeSrcT> src_data =
-      src_literal.data<NativeSrcT>();
-  tensorflow::gtl::MutableArraySlice<complex64> dest_data =
-      result_literal->data<complex64>();
-  int64 num_elements = src_literal.element_count();
-  for (int64 i = 0; i < num_elements; ++i) {
-    dest_data[i] = complex64(static_cast<float>(src_data[i]), 0);
-  }
-  return result_literal;
-}
-
-template <PrimitiveType primitive_src_type, PrimitiveType primitive_dest_type>
-std::unique_ptr<Literal> ConvertIfTypesMatch(const LiteralBase& src_literal,
-                                             bool bitcast) {
-  CHECK_EQ(primitive_src_type, src_literal.shape().element_type());
-  if (bitcast) {
-    return BitcastBetweenNativeTypes<
-        typename primitive_util::PrimitiveTypeToNative<
-            primitive_src_type>::type,
-        typename primitive_util::PrimitiveTypeToNative<
-            primitive_dest_type>::type>(src_literal);
-  } else {
-    return ConvertBetweenNativeTypes<
-        typename primitive_util::PrimitiveTypeToNative<
-            primitive_src_type>::type,
-        typename primitive_util::PrimitiveTypeToNative<
-            primitive_dest_type>::type>(src_literal);
-  }
-}
-
-template <PrimitiveType primitive_src_type>
-StatusOr<std::unique_ptr<Literal>> ConvertIfDestTypeMatches(
-    const LiteralBase& src_literal, PrimitiveType primitive_dest_type,
-    bool bitcast) {
-  switch (primitive_dest_type) {
-#define CONVERT_IF_TYPES_MATCH(type)                                    \
-  case (type):                                                          \
-    return ConvertIfTypesMatch<primitive_src_type, (type)>(src_literal, \
-                                                           bitcast);
-    CONVERT_IF_TYPES_MATCH(PRED)
-    CONVERT_IF_TYPES_MATCH(S8)
-    CONVERT_IF_TYPES_MATCH(S32)
-    CONVERT_IF_TYPES_MATCH(S64)
-    CONVERT_IF_TYPES_MATCH(U8)
-    CONVERT_IF_TYPES_MATCH(U32)
-    CONVERT_IF_TYPES_MATCH(U64)
-    CONVERT_IF_TYPES_MATCH(F16)
-    CONVERT_IF_TYPES_MATCH(F32)
-    CONVERT_IF_TYPES_MATCH(F64)
-    CONVERT_IF_TYPES_MATCH(BF16)
-#undef CONVERT_IF_TYPES_MATCH
-    case C64:
-      if (!bitcast) {
-        return ConvertToC64<primitive_src_type>(src_literal);
-      }
-      break;
-    // Other types are not yet supported.
-    default:
-      break;
-  }
-  return Unimplemented(
-      "Converting from type %s to type %s is not implemented.",
-      PrimitiveType_Name(src_literal.shape().element_type()).c_str(),
-      PrimitiveType_Name(primitive_dest_type).c_str());
-}
-
-StatusOr<std::unique_ptr<Literal>> ConvertSwitch(
-    const LiteralBase& literal, PrimitiveType primitive_dest_type,
-    bool bitcast) {
-  TF_RET_CHECK(ShapeUtil::IsArray(literal.shape()));
-  if (literal.shape().element_type() == primitive_dest_type) {
-    return literal.CloneToUnique();
-  }
-  switch (literal.shape().element_type()) {
-#define CONVERT_IF_DEST_TYPE_MATCHES(type)                                \
-  case (type):                                                            \
-    return ConvertIfDestTypeMatches<(type)>(literal, primitive_dest_type, \
-                                            bitcast);
-    CONVERT_IF_DEST_TYPE_MATCHES(PRED)
-    CONVERT_IF_DEST_TYPE_MATCHES(S8)
-    CONVERT_IF_DEST_TYPE_MATCHES(S32)
-    CONVERT_IF_DEST_TYPE_MATCHES(S64)
-    CONVERT_IF_DEST_TYPE_MATCHES(U8)
-    CONVERT_IF_DEST_TYPE_MATCHES(U32)
-    CONVERT_IF_DEST_TYPE_MATCHES(U64)
-    CONVERT_IF_DEST_TYPE_MATCHES(F16)
-    CONVERT_IF_DEST_TYPE_MATCHES(F32)
-    CONVERT_IF_DEST_TYPE_MATCHES(F64)
-    CONVERT_IF_DEST_TYPE_MATCHES(BF16)
-#undef CONVERT_IF_DEST_TYPE_MATCHES
-      // Other types are not yet supported.
-    default:
-      return Unimplemented(
-          "%s from type %s to type %s is not implemented.",
-          (bitcast ? "Bitcast converting" : "Converting"),
-          PrimitiveType_Name(literal.shape().element_type()).c_str(),
-          PrimitiveType_Name(primitive_dest_type).c_str());
-  }
-}
-
-}  // namespace
-
-StatusOr<std::unique_ptr<Literal>> LiteralBase::Convert(
-    PrimitiveType primitive_dest_type) const {
-  return ConvertSwitch(*this, primitive_dest_type, /*bitcast=*/false);
-}
-
-StatusOr<std::unique_ptr<Literal>> LiteralBase::BitcastConvert(
-    PrimitiveType primitive_dest_type) const {
-  if (primitive_util::BitWidth(shape().element_type()) !=
-      primitive_util::BitWidth(primitive_dest_type)) {
-    return InvalidArgument(
-        "Cannot bitcast convert from %s to %s, bit widths are different: %d != "
-        "%d",
-        PrimitiveType_Name(shape().element_type()).c_str(),
-        PrimitiveType_Name(primitive_dest_type).c_str(),
-        primitive_util::BitWidth(shape().element_type()),
-        primitive_util::BitWidth(primitive_dest_type));
-  }
-  return ConvertSwitch(*this, primitive_dest_type, /*bitcast=*/true);
-}
-
-StatusOr<std::unique_ptr<Literal>> LiteralBase::ConvertToShape(
-    const Shape& dest_shape, bool round_f32_to_bf16) const {
-  if (!ShapeUtil::IsTuple(dest_shape)) {
-    if (round_f32_to_bf16 && shape().element_type() == F32 &&
-        dest_shape.element_type() == BF16) {
-      auto converter = [](float src) {
-        return tensorflow::bfloat16::round_to_bfloat16(src);
-      };
-      return ConvertBetweenNativeTypesWithConverter<float, bfloat16>(*this,
-                                                                     converter);
-    }
-    return Convert(dest_shape.element_type());
-  }
-  std::vector<Literal> elements;
-  for (int i = 0; i < ShapeUtil::TupleElementCount(shape()); ++i) {
-    auto element = LiteralSlice(*this, {i});
-    TF_ASSIGN_OR_RETURN(
-        auto new_element,
-        element.ConvertToShape(ShapeUtil::GetSubshape(dest_shape, {i})));
-    elements.push_back(std::move(*new_element));
-  }
-  auto converted = MakeUnique<Literal>();
-  *converted = Literal::MoveIntoTuple(&elements);
-  return std::move(converted);
-}
-
-template <typename NativeT>
-bool LiteralBase::Piece::EqualElementsInternal(
-    const LiteralBase::Piece& other, std::vector<int64>* multi_index) const {
-  if (multi_index->size() == ShapeUtil::Rank(subshape())) {
-    return (Get<NativeT>(*multi_index) == other.Get<NativeT>(*multi_index));
-  }
-  for (int64 i = 0; i < subshape().dimensions(multi_index->size()); ++i) {
-    multi_index->push_back(i);
-    if (!EqualElementsInternal<NativeT>(other, multi_index)) {
-      return false;
-    }
-    multi_index->pop_back();
-  }
-  return true;
-}
-
-bool LiteralBase::Piece::EqualElements(const LiteralBase::Piece& other) const {
-  DCHECK(ShapeUtil::Compatible(subshape(), other.subshape()));
-
-  std::vector<int64> multi_index;
-  switch (subshape().element_type()) {
-    case PRED:
-      return EqualElementsInternal<bool>(other, &multi_index);
-    case U8:
-      return EqualElementsInternal<uint8>(other, &multi_index);
-    case S32:
-      return EqualElementsInternal<int32>(other, &multi_index);
-    case S64:
-      return EqualElementsInternal<int64>(other, &multi_index);
-    case U32:
-      return EqualElementsInternal<uint32>(other, &multi_index);
-    case U64:
-      return EqualElementsInternal<uint64>(other, &multi_index);
-    case F32:
-      return EqualElementsInternal<float>(other, &multi_index);
-    case F64:
-      return EqualElementsInternal<double>(other, &multi_index);
-    case F16:
-      return EqualElementsInternal<half>(other, &multi_index);
-    case BF16:
-      return EqualElementsInternal<bfloat16>(other, &multi_index);
-    case C64:
-      return EqualElementsInternal<complex64>(other, &multi_index);
-    default:
-      LOG(FATAL) << "Unimplemented: LiteralBase::Piece::EqualElements for type "
-                 << PrimitiveType_Name(subshape().element_type());
-  }
-}
-
-bool LiteralBase::operator==(const LiteralBase& other) const {
-  if (!ShapeUtil::Compatible(shape(), other.shape())) {
-    return false;
-  }
-
-  return root_piece().ForEachSubpieceWithBool(
-      [&](const ShapeIndex& index, const Piece& piece) {
-        if (!ShapeUtil::IsArray(piece.subshape())) {
-          return true;
-        }
-
-        const Piece& other_piece = other.piece(index);
-        if (!piece.EqualElements(other_piece)) {
-          return false;
-        }
-        return true;
-      });
-}
-
-namespace {
-
-template <typename NativeT>
-static bool AllElementsEqualValue(tensorflow::gtl::ArraySlice<NativeT> data,
-                                  NativeT value) {
-  for (int64 i = 0; i < data.size(); ++i) {
-    if (data[i] != value) {
-      return false;
-    }
-  }
-  return true;
-}
-
-}  // namespace
-
-bool LiteralBase::IsAll(int8 value) const {
-  return root_piece().ForEachSubpieceWithBool([&](const ShapeIndex& index,
-                                                  const Piece& piece) {
-    if (!ShapeUtil::IsArray(piece.subshape())) {
-      return true;
-    }
-
-    auto piece_is_all = [&]() {
-      switch (shape().element_type()) {
-        case U8:
-          if (value >= 0) {
-            return AllElementsEqualValue<uint8>(piece.data<uint8>(), value);
-          }
-          return false;
-        case U32:
-          if (value >= 0) {
-            return AllElementsEqualValue<uint32>(piece.data<uint32>(), value);
-          }
-          return false;
-        case U64:
-          if (value >= 0) {
-            return AllElementsEqualValue<uint64>(piece.data<uint64>(), value);
-          }
-          return false;
-        case S8:
-          return AllElementsEqualValue<int8>(piece.data<int8>(), value);
-        case S32:
-          return AllElementsEqualValue<int32>(piece.data<int32>(), value);
-        case S64:
-          return AllElementsEqualValue<int64>(piece.data<int64>(), value);
-        case F32:
-          return AllElementsEqualValue<float>(piece.data<float>(), value);
-        case F64:
-          return AllElementsEqualValue<double>(piece.data<double>(), value);
-        case F16:
-          return AllElementsEqualValue<half>(piece.data<half>(),
-                                             static_cast<half>(value));
-        case BF16:
-          return AllElementsEqualValue<bfloat16>(piece.data<bfloat16>(),
-                                                 static_cast<bfloat16>(value));
-        case PRED:
-          if (value == 0) {
-            return AllElementsEqualValue<bool>(piece.data<bool>(), false);
-          }
-          if (value == 1) {
-            return AllElementsEqualValue<bool>(piece.data<bool>(), true);
-          }
-          return false;
-        default:
-          return false;
-      }
-      return false;
-    };
-
-    if (!piece_is_all()) {
-      return false;
-    }
-    return true;
-  });
-}
-
-bool LiteralBase::IsAllFloat(float value) const {
-  return root_piece().ForEachSubpieceWithBool(
-      [&](const ShapeIndex& index, const Piece& piece) {
-        if (!ShapeUtil::IsArray(piece.subshape())) {
-          return true;
-        }
-
-        auto piece_is_all = [&]() {
-          switch (shape().element_type()) {
-            case F32:
-              return AllElementsEqualValue<float>(piece.data<float>(), value);
-            case F64:
-              return AllElementsEqualValue<double>(piece.data<double>(), value);
-            case F16:
-              return AllElementsEqualValue<half>(piece.data<half>(),
-                                                 static_cast<half>(value));
-            case BF16:
-              return AllElementsEqualValue<bfloat16>(
-                  piece.data<bfloat16>(), static_cast<bfloat16>(value));
-            default:
-              return false;
-          }
-        };
-        if (!piece_is_all()) {
-          return false;
-        }
-        return true;
-      });
-}
-
-bool LiteralBase::IsAllComplex(complex64 value) const {
-  switch (shape().element_type()) {
-    case C64:
-      return AllElementsEqualValue<complex64>(root_piece().data<complex64>(),
-                                              value);
-    default:
-      return false;
-  }
-}
-
-bool LiteralBase::IsAllFirst() const {
-  return root_piece().ForEachSubpieceWithBool(
-      [&](const ShapeIndex& index, const Piece& piece) {
-        if (!ShapeUtil::IsArray(piece.subshape())) {
-          return true;
-        }
-
-        // Empty shapes are not all the first element since there is no first
-        // element.
-        if (ShapeUtil::IsZeroElementArray(piece.subshape())) {
-          return false;
-        }
-        auto piece_is_all = [&]() {
-          switch (piece.subshape().element_type()) {
-            case PRED: {
-              auto data = piece.data<bool>();
-              return AllElementsEqualValue<bool>(data, data[0]);
-            }
-            // 8 bit types
-            case S8: {
-              auto data = piece.data<int8>();
-              return AllElementsEqualValue<int8>(data, data[0]);
-            }
-            case U8: {
-              auto data = piece.data<uint8>();
-              return AllElementsEqualValue<uint8>(data, data[0]);
-            }
-            // 16 bit types
-            case BF16: {
-              auto data = piece.data<bfloat16>();
-              return AllElementsEqualValue<bfloat16>(data, data[0]);
-            }
-            case F16: {
-              auto data = piece.data<half>();
-              return AllElementsEqualValue<half>(data, data[0]);
-            }
-            case S16: {
-              auto data = piece.data<int16>();
-              return AllElementsEqualValue<int16>(data, data[0]);
-            }
-            case U16: {
-              auto data = piece.data<uint16>();
-              return AllElementsEqualValue<uint16>(data, data[0]);
-            }
-            // 32 bit types
-            case F32: {
-              auto data = piece.data<float>();
-              return AllElementsEqualValue<float>(data, data[0]);
-            }
-            case U32: {
-              auto data = piece.data<uint32>();
-              return AllElementsEqualValue<uint32>(data, data[0]);
-            }
-            case S32: {
-              auto data = piece.data<int32>();
-              return AllElementsEqualValue<int32>(data, data[0]);
-            }
-            // 64 bit types
-            case C64: {
-              auto data = piece.data<complex64>();
-              return AllElementsEqualValue<complex64>(data, data[0]);
-            }
-            case F64: {
-              auto data = piece.data<double>();
-              return AllElementsEqualValue<double>(data, data[0]);
-            }
-            case S64: {
-              auto data = piece.data<int64>();
-              return AllElementsEqualValue<int64>(data, data[0]);
-            }
-            case U64: {
-              auto data = piece.data<uint64>();
-              return AllElementsEqualValue<uint64>(data, data[0]);
-            }
-            default:
-              return false;
-          }
-        };
-
-        if (!piece_is_all()) {
-          return false;
-        }
-        return true;
-      });
-}
-
-bool LiteralBase::IsZero(tensorflow::gtl::ArraySlice<int64> indices) const {
-  CHECK(ShapeUtil::IsArray(shape()));
-  switch (shape().element_type()) {
-    case U8:
-      return Get<uint8>(indices) == 0;
-    case U32:
-      return Get<uint32>(indices) == 0;
-    case U64:
-      return Get<uint64>(indices) == 0;
-    case S8:
-      return Get<int8>(indices) == 0;
-    case S32:
-      return Get<int32>(indices) == 0;
-    case S64:
-      return Get<int64>(indices) == 0;
-    case F32:
-      return Get<float>(indices) == 0.0f;
-    case F64:
-      return Get<double>(indices) == 0.0;
-    case C64:
-      return Get<complex64>(indices) == complex64(0.0f, 0.0f);
-    case F16:
-      return Get<half>(indices) == static_cast<half>(0.0f);
-    case BF16:
-      return Get<bfloat16>(indices) == static_cast<bfloat16>(0.0f);
-    case PRED:
-      return Get<bool>(indices) == false;
-    default:
-      LOG(FATAL) << "Input literal must be an array.";
-  }
-}
-
-namespace {
-
-template <typename RepeatedFieldT, typename NativeT>
-void CopyToRepeatedField(RepeatedFieldT* dest,
-                         const tensorflow::gtl::ArraySlice<NativeT> src) {
-  *dest = RepeatedFieldT(src.begin(), src.end());
-}
-
-}  // namespace
-
-void LiteralBase::Piece::WriteToProto(LiteralProto* proto) const {
-  *proto->mutable_shape() = subshape();
-  switch (subshape().element_type()) {
-    case PRED:
-      CopyToRepeatedField(proto->mutable_preds(), data<bool>());
-      break;
-    case U8:
-      proto->set_u8s(static_cast<const unsigned char*>(data<uint8>().data()),
-                     element_count());
-      break;
-    case U32:
-      CopyToRepeatedField(proto->mutable_u32s(), data<uint32>());
-      break;
-    case U64:
-      CopyToRepeatedField(proto->mutable_u64s(), data<uint64>());
-      break;
-    case S32:
-      CopyToRepeatedField(proto->mutable_s32s(), data<int32>());
-      break;
-    case S64:
-      CopyToRepeatedField(proto->mutable_s64s(), data<int64>());
-      break;
-    case F16:
-      *proto->mutable_f16s() = string(
-          reinterpret_cast<const char*>(data<half>().data()), size_bytes());
-      if (!kLittleEndian) {
-        ConvertEndianShort(proto->mutable_f16s());
-      }
-      break;
-    case BF16:
-      *proto->mutable_bf16s() = string(
-          reinterpret_cast<const char*>(data<bfloat16>().data()), size_bytes());
-      if (!kLittleEndian) {
-        ConvertEndianShort(proto->mutable_bf16s());
-      }
-      break;
-    case F32:
-      CopyToRepeatedField(proto->mutable_f32s(), data<float>());
-      break;
-    case F64:
-      CopyToRepeatedField(proto->mutable_f64s(), data<double>());
-      break;
-    case C64:
-      for (complex64 value : data<complex64>()) {
-        proto->add_c64s(value.real());
-        proto->add_c64s(value.imag());
-      }
-      break;
-    case TUPLE:
-    case TOKEN:
-      // Nothing to do but assign the shape which is done above.
-      return;
-    default:
-      LOG(FATAL) << "Unhandled primitive type " << subshape().element_type();
-  }
-}
-
-const void* LiteralBase::Piece::untyped_data() const {
-  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
-  return buffer();
-}
-
-void* LiteralBase::Piece::untyped_data() {
-  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
-  return buffer();
-}
-
-namespace {
-
-template <typename RepeatedFieldT, typename NativeT>
-Status CopyFromRepeatedField(tensorflow::gtl::MutableArraySlice<NativeT> dest,
-                             const RepeatedFieldT& src) {
-  if (dest.size() != src.size()) {
-    return InvalidArgument(
-        "Expected %lu elements in LiteralProto repeated field, has %d",
-        dest.size(), src.size());
-  }
-  std::copy(src.begin(), src.end(), dest.begin());
-  return Status::OK();
-}
-
-}  // namespace
-
-Status LiteralBase::Piece::CopyFromProto(const LiteralProto& proto) {
-  // These conditions should have been checked in Literal::CreateFromProto.
-  TF_RET_CHECK(proto.has_shape());
-  TF_RET_CHECK(LayoutUtil::HasLayout(proto.shape()));
-  TF_RET_CHECK(ShapeUtil::Equal(proto.shape(), subshape()));
-
-  switch (subshape().element_type()) {
-    case PRED:
-      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<bool>(), proto.preds()));
-      break;
-    case U8: {
-      auto u8_data = data<uint8>();
-      TF_RET_CHECK(proto.u8s().size() == u8_data.size());
-      std::copy(proto.u8s().begin(), proto.u8s().end(), u8_data.begin());
-    } break;
-    case S32:
-      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<int32>(), proto.s32s()));
-      break;
-    case S64:
-      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<int64>(), proto.s64s()));
-      break;
-    case U32:
-      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<uint32>(), proto.u32s()));
-      break;
-    case U64:
-      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<uint64>(), proto.u64s()));
-      break;
-    case F16: {
-      const string& s(proto.f16s());
-      TF_RET_CHECK(data<half>().size() * sizeof(half) == s.size());
-      memcpy(untyped_data(), s.data(), s.size());
-      if (!kLittleEndian) {
-        ConvertEndianShort(reinterpret_cast<char*>(untyped_data()), s.size());
-      }
-    } break;
-
-    case BF16: {
-      const string& s(proto.bf16s());
-      TF_RET_CHECK(data<bfloat16>().size() * sizeof(bfloat16) == s.size());
-      memcpy(untyped_data(), s.data(), s.size());
-      if (!kLittleEndian) {
-        ConvertEndianShort(reinterpret_cast<char*>(untyped_data()), s.size());
-      }
-    } break;
-    case F32:
-      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<float>(), proto.f32s()));
-      break;
-    case F64:
-      TF_RETURN_IF_ERROR(CopyFromRepeatedField(data<double>(), proto.f64s()));
-      break;
-    case C64: {
-      auto complex_data = data<complex64>();
-      TF_RET_CHECK(proto.c64s_size() == complex_data.size() * 2);
-      for (int64 i = 0; i < complex_data.size(); ++i) {
-        complex_data[i] = complex64{proto.c64s(i * 2), proto.c64s(i * 2 + 1)};
-      }
-    } break;
-    case TUPLE:
-      LOG(FATAL) << "Should not be called on tuple shapes: "
-                 << ShapeUtil::HumanString(subshape());
-      break;
-    default:
-      LOG(FATAL) << "Unhandled primitive type " << subshape().element_type();
-  }
-  return Status::OK();
-}
-
-LiteralProto LiteralBase::ToProto() const {
-  LiteralProto proto;
-  root_piece().ForEachSubpiece(
-      [&](const ShapeIndex& index, const Piece& piece) {
-        LiteralProto* proto_piece = &proto;
-        for (int64 i : index) {
-          while (proto_piece->tuple_literals_size() <= i) {
-            proto_piece->add_tuple_literals();
-          }
-          proto_piece = proto_piece->mutable_tuple_literals(i);
-        }
-        piece.WriteToProto(proto_piece);
-      });
-
-  if (LayoutUtil::IsSparseArray(shape())) {
-    CopyToRepeatedField(proto.mutable_sparse_indices(),
-                        sparse_indices()->data());
-  }
-
-  return proto;
-}
-
-/* static */
-StatusOr<std::unique_ptr<Literal>> Literal::CreateFromProto(
-    const LiteralProto& proto) {
-  if (!proto.has_shape()) {
-    return InvalidArgument("LiteralProto has no shape");
-  }
-  if (!LayoutUtil::HasLayout(proto.shape())) {
-    return InvalidArgument("LiteralProto has no layout");
-  }
-
-  auto literal = MakeUnique<Literal>(proto.shape());
-
-  TF_RETURN_IF_ERROR(literal->root_piece_->ForEachMutableSubpieceWithStatus(
-      [&](const ShapeIndex& index, Piece* piece) {
-        const LiteralProto* proto_element = &proto;
-        for (int64 i : index) {
-          CHECK(i < proto_element->tuple_literals_size());
-          proto_element = &proto_element->tuple_literals(i);
-        }
-
-        if (ShapeUtil::IsTuple(piece->subshape())) {
-          if (proto_element->tuple_literals_size() !=
-              ShapeUtil::TupleElementCount(piece->subshape())) {
-            return InvalidArgument(
-                "Expected %lld tuple elements in LiteralProto, has %d",
-                ShapeUtil::TupleElementCount(piece->subshape()),
-                proto_element->tuple_literals_size());
-          }
-          return Status::OK();
-        }
-        if (piece->subshape().element_type() == TOKEN) {
-          return Status::OK();
-        }
-
-        CHECK(ShapeUtil::IsArray(piece->subshape()));
-        TF_RETURN_IF_ERROR(piece->CopyFromProto(*proto_element));
-
-        return Status::OK();
-      }));
-
-  return std::move(literal);
-}
-
-/* static */ string Literal::MultiIndexAsString(
+/* static */ string LiteralUtil::MultiIndexAsString(
     tensorflow::gtl::ArraySlice<int64> multi_index) {
   return StrCat("{", tensorflow::str_util::Join(multi_index, ","), "}");
 }
 
-const void* LiteralBase::untyped_data(const ShapeIndex& shape_index) const {
-  return piece(shape_index).untyped_data();
-}
-
-void* Literal::untyped_data(const ShapeIndex& shape_index) {
-  return piece(shape_index).untyped_data();
-}
-
-int64 LiteralBase::size_bytes(const ShapeIndex& shape_index) const {
-  return piece(shape_index).size_bytes();
-}
-
-string LiteralBase::GetR1U8AsString() const {
-  CHECK(ShapeUtil::IsArray(shape()));
-  CHECK_EQ(ShapeUtil::Rank(shape()), 1);
-  CHECK_EQ(shape().element_type(), U8);
-  return string(tensorflow::bit_cast<const char*>(data<uint8>().data()),
-                ShapeUtil::ElementsIn(shape()));
-}
-
-void BorrowingLiteral::BuildPieceSubtree(const Shape& shape, Piece* piece) {
-  CHECK(ShapeUtil::IsTuple(shape));
-  for (int i = 0; i < ShapeUtil::TupleElementCount(shape); ++i) {
-    const Shape& subshape = shape.tuple_shapes(i);
-
-    auto child_piece = Piece();
-    child_piece.set_subshape(&subshape);
-
-    if (ShapeUtil::IsTuple(subshape)) {
-      BuildPieceSubtree(subshape, &child_piece);
-    }
-
-    piece->emplace_back(std::move(child_piece));
-  }
-}
-
-LiteralSlice::LiteralSlice(const LiteralBase& literal)
-    : LiteralBase(), root_piece_(&literal.root_piece()) {}
-
-LiteralSlice::LiteralSlice(const LiteralBase& literal,
-                           const ShapeIndex& view_root)
-    : LiteralBase(), root_piece_(&literal.piece(view_root)) {}
-
-BorrowingLiteral::BorrowingLiteral(const char* src_buf_ptr, const Shape& shape)
-    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
-  CHECK(ShapeUtil::IsArray(*shape_));
-  CHECK(LayoutUtil::HasLayout(*shape_));
-
-  root_piece_ = Piece();
-  root_piece_.set_buffer(const_cast<char*>(src_buf_ptr));
-  root_piece_.set_subshape(shape_.get());
-}
-
-BorrowingLiteral::BorrowingLiteral(
-    tensorflow::gtl::ArraySlice<const char*> src_buf_ptrs, const Shape& shape)
-    : LiteralBase(), shape_(MakeUnique<Shape>(shape)) {
-  CHECK(ShapeUtil::IsTuple(*shape_));
-  CHECK(!ShapeUtil::IsNestedTuple(*shape_));
-  CHECK_EQ(src_buf_ptrs.size(), ShapeUtil::TupleElementCount(*shape_));
-  root_piece_ = Piece();
-  root_piece_.set_subshape(shape_.get());
-  BuildPieceSubtree(*shape_, &root_piece_);
-
-  for (int i = 0; i < src_buf_ptrs.size(); ++i) {
-    const auto& src_shape = shape_->tuple_shapes(i);
-    CHECK(ShapeUtil::IsArray(src_shape));
-    root_piece_.child(i).set_buffer(const_cast<char*>(src_buf_ptrs[i]));
-  }
-}
-
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/literal_util.h b/tensorflow/compiler/xla/literal_util.h
index 37ca8ea9f1..e3737a9d00 100644
--- a/tensorflow/compiler/xla/literal_util.h
+++ b/tensorflow/compiler/xla/literal_util.h
@@ -32,6 +32,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/index_util.h"
 #include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -51,679 +52,12 @@ limitations under the License.
 
 namespace xla {
 
-// Forward declare Literal and LiteralSlice class to be used by the creation
-// methods in the base class.
-class Literal;
-class LiteralSlice;
-
-// Abstract base class for literals.
-class LiteralBase {
+class LiteralUtil {
  public:
-  virtual ~LiteralBase() = 0;
-
-  // Literals are equal if they have compatible shapes and the same data
-  // values. Layout is not compared.
-  bool operator==(const LiteralBase& other) const;
-  bool operator!=(const LiteralBase& other) const { return !(*this == other); }
-
-  // Returns the shape of the literal.
-  const Shape& shape() const { return root_piece().subshape(); }
-
-  // Serialize to proto.
-  LiteralProto ToProto() const;
-
-  // Returns an ArraySlice of the array for this literal for the given NativeT
-  // (e.g., float). CHECKs if the subshape of the literal at the given
-  // ShapeIndex is not array. See primitive_util.h for the mapping from XLA type
-  // to native type.
-  template <typename NativeT>
-  tensorflow::gtl::ArraySlice<NativeT> data(
-      const ShapeIndex& shape_index = {}) const;
-
-  // Returns a const pointer to the sparse index array. Returns nullptr if the
-  // literal is not a sparse array.
-  const SparseIndexArray* sparse_indices(
-      const ShapeIndex& shape_index = {}) const;
-
-  // Returns a const pointer to (or size of) the underlying buffer holding the
-  // array at the given shape index. CHECKs if the subshape of the literal at
-  // the given ShapeIndex is not array.
-  const void* untyped_data(const ShapeIndex& shape_index = {}) const;
-  int64 size_bytes(const ShapeIndex& shape_index = {}) const;
-
-  // Returns this literal's data as a string. This literal must be a rank-1 U8
-  // array.
-  string GetR1U8AsString() const;
-
-  // Returns a string representation of the literal value.
-  // Warning: this function can take minutes for multi-million element Literals.
-  string ToString(bool print_layout = false) const;
-
-  // Gets an element in the literal at the given index. The multi_index is
-  // CHECKed against the dimension sizes.
-  template <typename NativeT>
-  NativeT Get(tensorflow::gtl::ArraySlice<int64> multi_index,
-              const ShapeIndex& shape_index) const;
-  // Overloads of Get for array literals. CHECKs if the literal is not
-  // array-shaped and dense.
-  template <typename NativeT>
-  NativeT Get(tensorflow::gtl::ArraySlice<int64> multi_index) const;
-
-  // Returns the element value at index (0, ..., 0), however many zeroes are
-  // required for that index.
-  template <typename NativeT>
-  NativeT GetFirstElement() const;
-
-  // As Get(), but determines the correct type and converts the value
-  // into text.
-  string GetAsString(tensorflow::gtl::ArraySlice<int64> multi_index,
-                     const ShapeIndex& shape_index = {}) const;
-  // As GetSparseElement(), but determines the correct type and converts the
-  // value into text.
-  string GetSparseElementAsString(int64 sparse_element_number,
-                                  const ShapeIndex& shape_index = {}) const;
-  // As Get(), but determines the correct type and converts the value into
-  // int64.  This literal must be an array.
-  StatusOr<int64> GetIntegralAsS64(
-      tensorflow::gtl::ArraySlice<int64> multi_index) const;
-
-  // Returns the multi-index of the element in a sparse literal at the given
-  // sparse element number.  The sparse element number is the position with in
-  // the sparse array's list of (index, value) pairs, and is checked against the
-  // total number of (index, value) pairs in the sparse array.
-  tensorflow::gtl::ArraySlice<int64> GetSparseIndex(
-      int64 sparse_element_number, const ShapeIndex& shape_index = {}) const;
-
-  // Returns the value of the element in a sparse literal at the given sparse
-  // element number.  The sparse element number is the position with in the
-  // sparse array's list of (index, value) pairs, and is checked against the
-  // total number of (index, value) pairs in the sparse array.
-  template <typename NativeT>
-  NativeT GetSparseElement(int64 sparse_element_number,
-                           const ShapeIndex& shape_index = {}) const;
-
-  // Invokes the "per cell" callback for each element in the provided
-  // literal with the element's indices and a string representation of
-  // the element's value.
-  //
-  // This function is useful if you want a polymorphic representation
-  // of the tensor's elements (turning it to a string for something
-  // like representation in a protobuf).
-  //
-  // This literal must have a dense layout.
-  void EachCellAsString(
-      const std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
-                               const string& value)>& per_cell) const;
-  template <typename NativeT>
-  void EachCell(std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
-                                   NativeT value)>
-                    per_cell) const;
-
-  // Returns whether every element in this literal is equal to value.
-  //
-  // value is an int8 because we expect this to be called with small
-  // compile-time constants (0, -1, etc.) and so that whatever value you pass
-  // can be represented exactly by floating-point types as small as 16 bits.
-  //
-  // If value doesn't fit in this literal's type, returns false.  Values of 1/0
-  // are considered equal to true/false; other values are not considered equal
-  // to true. Also if this literal is not array-shaped false is returned.
-  bool IsAll(int8 value) const;
-
-  // Like IsAll(const Literal&, int8), except we check whether the literal is
-  // equal to a particular floating-point number.
-  //
-  // If the literal is not a floating-point value, this always returns false.
-  //
-  // This casts value to the type of literal, then compares using ==.  The usual
-  // admonishments about floating-point equality checks apply.  We expect you to
-  // use this to check for values that can be expressed precisely as a float,
-  // e.g. -0.5.  Also if this literal is not array-shaped false is returned.
-  bool IsAllFloat(float value) const;
-
-  // Like IsAll(const Literal&, int8), except we check whether the literal is
-  // equal to a particular complex number.
-  //
-  // If the literal is not a complex value, this always returns false.
-  //
-  // This casts value to the type of literal, then compares using ==.  The usual
-  // admonishments about floating-point equality checks apply.  We expect you to
-  // use this to check for complex values that can be expressed precisely as
-  // float pairs e.g. (-0.5, 1.0).
-  //
-  // This literal must have a dense layout.
-  bool IsAllComplex(complex64 value) const;
-
-  // Literal consists entirely of the first element of the literal.
-  bool IsAllFirst() const;
-
-  // Returns whether this literal is zero at the specified index. This literal
-  // must be an array with a dense layout.
-  bool IsZero(tensorflow::gtl::ArraySlice<int64> indices) const;
-
-  // Returns the count of the elements in the array at the given shape index in
-  // this literal.
-  int64 element_count(const ShapeIndex& index = {}) const {
-    return ShapeUtil::ElementsIn(ShapeUtil::GetSubshape(shape(), index));
-  }
-
-  // Returns the count of the elements in the sparse array at the given shape
-  // index in this literal, which will be no larger than
-  // LayoutUtil::MaxSparseElements(SetSubshape(shape(), index).layout()).
-  int64 sparse_element_count() const;
-
-  // Compute a hash for this literal.  This literal must not be a sparse tensor
-  // or a tuple containing a sparse tensor.
-  size_t Hash() const;
-
-  // Converts this literal to the given shape. Returns an error is the
-  // conversion is not possible.
-  //
-  // round_f32_to_bf16: if true, converting F32 elements to BF16 uses rounding
-  // instead of truncation; otherwise, truncation is used.
-  //
-  // TODO(b/69266521): remove the round_to_bfloat16 flag when rounding becomes
-  // the default behavior.
-  StatusOr<std::unique_ptr<Literal>> ConvertToShape(
-      const Shape& dest_shape, bool round_f32_to_bf16 = false) const;
-
-  // Converts this literal to another primitive type using a bitcast
-  // conversion. The to and from primitive types must have the same bit
-  // width. Returns an error if the conversion is not possible. This literal
-  // must be array-shaped.
-  StatusOr<std::unique_ptr<Literal>> BitcastConvert(
-      PrimitiveType primitive_dest_type) const;
-
-  // Converts this literal to another primitive type. Returns an error if the
-  // conversion is not possible. This literal must be array-shaped.
-  StatusOr<std::unique_ptr<Literal>> Convert(
-      PrimitiveType primitive_dest_type) const;
+  LiteralUtil() = delete;
 
   // Returns a literal scalar representing the first element.
-  Literal GetFirstScalarLiteral() const;
-
-  // Clones the underlying buffers into a new Literal, or new
-  // std::unique_ptr<Literal>.
-  Literal Clone() const;
-  std::unique_ptr<Literal> CloneToUnique() const;
-
-  // TODO(b/67651157): The methods below which perform computation on Literals
-  // (Reshape, Slice, etc) should be moved elsewhere, and perhaps combined with
-  // evaluator code which operates on Literals.
-  //
-  // Creates a new value that has the equivalent value as this
-  // literal, but conforms to new_layout; e.g. a literal matrix that was in {0,
-  // 1} minor-to-major dimension layout can be re-layed-out as {1, 0}
-  // minor-to-major dimension layout and the value in the cell at any given
-  // logical index (i0, i1) will be the same.
-  //
-  // For tuple shaped literals, shape_index should be used to select the inner
-  // array that the new layout applies to.
-  //
-  // Note: this is useful when the client wants to ensure that a value placed in
-  // the XLA allocation tracker has a particular layout; for efficiency
-  // purposes or avoiding unimplemented operation/layout combinations.
-  std::unique_ptr<Literal> Relayout(const Layout& new_layout,
-                                    const ShapeIndex& shape_index = {}) const;
-
-  // An overload of Relayout which changes the layout of the entire shape rather
-  // than being limited to a single array within the shape.
-  std::unique_ptr<Literal> Relayout(const Shape& shape_with_layout) const;
-
-  // Creates a new literal by reshaping this literal to have the given
-  // dimensions. The total number of elements must not change; The
-  // implementation currently only supports monotonic dim0-major layouts.
-  // This literal must be an array.
-  StatusOr<std::unique_ptr<Literal>> Reshape(
-      tensorflow::gtl::ArraySlice<int64> dimensions) const;
-
-  // Creates a new literal by broadcasting this literal with `dimensions` to
-  // yield a literal of shape `result_shape`.
-  StatusOr<std::unique_ptr<Literal>> Broadcast(
-      const Shape& result_shape,
-      tensorflow::gtl::ArraySlice<int64> dimensions) const;
-
-  // Creates a new literal by reordering the dimensions of this literal.
-  // The given `permutation` must be a permutation of the dimension numbers
-  // in the original literal, and it specifies the order of the new dimensions
-  // in the result literal (i.e., new_order[i] = old_order[permutation[i]]).
-  // For example, a transpose call on a literal of shape [3 x 8 x 4] and
-  // `permutation` = {2, 0, 1} returns a new literal of shape [4 x 3 x 8].
-  // This literal must be an array.
-  std::unique_ptr<Literal> Transpose(
-      tensorflow::gtl::ArraySlice<int64> permutation) const;
-
-  // Creates a sub-array from this literal by extracting the indices
-  // [start_index, limit_index) of each dimension. The result literal has the
-  // same rank and layout as for the given literal. The number of indices in
-  // start_indices and limit_indices must be the rank of the literal, and the
-  // indices follow the order of the dimensions.
-  // This literal must be an array.
-  std::unique_ptr<Literal> Slice(
-      tensorflow::gtl::ArraySlice<int64> start_indices,
-      tensorflow::gtl::ArraySlice<int64> limit_indices) const;
-
-  // Creates a literal with a prepended dimension with bound "times"; e.g. a
-  // f32[3x2] with times=4 will produce a f32[4x3x2] with the 3x2 from this
-  // literal replicated four times.
-  // This literal must be an array.
-  template <typename NativeT>
-  std::unique_ptr<Literal> Replicate(int64 times) const;
-
-  // Creates a new Literal object with the shape specified as parameter.
-  // The content of the literal values is the default value of the primitive
-  // type of literal itself (0 for numeric types, and false for predicates).
-  //
-  // Note: It's an antipattern to use this method then immediately call
-  // Literal::Populate on the result (since that results in zero initialization,
-  // then reinitialization. Conside if a call to MakeUnique<Literal>(shape),
-  // followed by the call to Literal::Populate can be used instead.
-  static std::unique_ptr<Literal> CreateFromShape(const Shape& shape);
-
- protected:
-  // A data structure representing a subshape at a particular ShapeIndex within
-  // the literal. For array-shaped ShapeIndexes, this data structure holds the
-  // pointer to the memory allocated for the array data.
-  class Piece {
-   public:
-    // Returns the buffer holding the array data for this piece as an array
-    // slice. This piece must be array-shaped.
-    template <typename NativeT>
-    tensorflow::gtl::ArraySlice<NativeT> data() const;
-    template <typename NativeT>
-    tensorflow::gtl::MutableArraySlice<NativeT> data();
-
-    // Returns the buffer holding the array data for this piece as a void*. This
-    // piece must be array-shaped.
-    void* untyped_data();
-    const void* untyped_data() const;
-
-    // Gets or sets an element in the array at the given index. The multi_index
-    // is CHECKed against the dimension sizes of the array.  This piece must be
-    // array-shaped.
-    template <typename NativeT>
-    NativeT Get(tensorflow::gtl::ArraySlice<int64> index) const;
-    template <typename NativeT>
-    void Set(tensorflow::gtl::ArraySlice<int64> index, NativeT value);
-
-    // Gets/sets the buffer holding the array data.
-    char* buffer() const { return buffer_; }
-    void set_buffer(char* buffer) { buffer_ = buffer; }
-
-    // The array of multi-indices that provide the locations of non-zero
-    // elements in a sparse array.  Only used if
-    // LayoutUtil::IsSparseArray(shape()) is true.
-    SparseIndexArray* sparse_indices() const { return sparse_indices_; }
-    void set_sparse_indices(SparseIndexArray* sparse_indices) {
-      sparse_indices_ = sparse_indices;
-    }
-
-    // Gets or sets the subshape of this piece. This reference points to a
-    // subshape within the shape in the containing Literal (Literal::shape_).
-    const Shape& subshape() const { return *subshape_; }
-    void set_subshape(const Shape* subshape) { subshape_ = subshape; }
-
-    // Returns the size in bytes of the buffer holding the array data.
-    int64 size_bytes() const { return ShapeUtil::ByteSizeOf(subshape()); }
-
-    // Returns the number of elements in this piece's array.
-    int64 element_count() const {
-      // If this is a sparse array, use the number of elements represented by
-      // the indices in the associated SparseIndexArray.
-      return LayoutUtil::IsSparseArray(subshape())
-                 ? sparse_indices()->index_count()
-                 : ShapeUtil::ElementsIn(subshape());
-    }
-
-    // Returns the child piece at 'index' of this piece.
-    Piece& child(int64 index) { return children_[index]; }
-
-    // Adds a child piece to this piece's children.
-    void emplace_back(Piece child_piece) {
-      children_.emplace_back(std::move(child_piece));
-    }
-
-    // Returns the size of children pieces of this piece.
-    int64 children_size() { return children_.size(); }
-
-    // Visitor functions that recursively traverses the piece and calls the
-    // given function at each child piece. The function has the type:
-    //    void (const ShapeIndex& index, const Piece& piece)
-    template <typename Fn>
-    void ForEachSubpiece(const Fn& func) const {
-      ShapeIndex index;
-      return ForEachHelper(
-                 [&func](const ShapeIndex& index, const Piece& piece) {
-                   func(index, piece);
-                   return Status::OK();
-                 },
-                 *this, &index)
-          .IgnoreError();
-    }
-    // Same as above, but the function has the type:
-    //    Status (const ShapeIndex& index, const Piece& piece)
-    // The first non-OK return value is returned by the function.
-    template <typename Fn>
-    Status ForEachSubpieceWithStatus(const Fn& func) const {
-      ShapeIndex index;
-      return ForEachHelper(func, *this, &index);
-    }
-    // Same as above, but the function has the type:
-    //    Bool (const ShapeIndex& index, const Piece& piece)
-    // The first non-true return value is returned by the function.
-    template <typename Fn>
-    bool ForEachSubpieceWithBool(const Fn& func) const {
-      ShapeIndex index;
-      return ForEachHelperBool(func, *this, &index);
-    }
-    // Same as above, but the function has the type:
-    //    Void (const ShapeIndex& index, Piece& piece)
-    template <typename Fn>
-    void ForEachMutableSubpiece(const Fn& func) {
-      ShapeIndex index;
-      return ForEachMutableHelper(
-                 [&func](const ShapeIndex& index, Piece* piece) {
-                   func(index, piece);
-                   return Status::OK();
-                 },
-                 const_cast<xla::LiteralBase::Piece*>(this), &index)
-          .IgnoreError();
-    }
-    // Same as above, but the function has the type:
-    //    Status (const ShapeIndex& index, Piece& piece)
-    // The first non-OK return value is returned by the function.
-    template <typename Fn>
-    Status ForEachMutableSubpieceWithStatus(const Fn& func) {
-      ShapeIndex index;
-      return ForEachMutableHelper(
-          func, const_cast<xla::LiteralBase::Piece*>(this), &index);
-    }
-
-    // Returns true if this piece and 'other' contain the same data. This piece
-    // and 'other' must be array-shaped and compatible.
-    bool EqualElements(const Piece& other) const;
-
-    // Writes the shape and data (if array-shaped) into the given proto.
-    void WriteToProto(LiteralProto* proto) const;
-
-    // Copy the data from 'src' into this piece's buffer. Shapes of this piece
-    // and src must be compatible.
-    Status CopyFrom(const Piece& src);
-
-    // Copies the data from the given proto into this piece. The shape of this
-    // piece must be equal (not just compatible) to the shape of the proto.
-    Status CopyFromProto(const LiteralProto& proto);
-
-    // Sorts the elements in a sparse array.
-    void SortSparseElements();
-
-   private:
-    // Helpers for traversing the piece via ForEachSubpiece rooted at 'index'.
-    // The first non-OK (or non-true) value is returned by the function.
-    // The callable 'func' has the same signature as described above in
-    // ForEachSubpiece*.
-    template <typename Fn>
-    Status ForEachHelper(const Fn& func, const Piece& piece,
-                         ShapeIndex* index) const {
-      TF_RETURN_IF_ERROR(func(*index, piece));
-      for (int64 i = 0; i < piece.children_.size(); ++i) {
-        index->push_back(i);
-        TF_RETURN_IF_ERROR(ForEachHelper(func, piece.children_[i], index));
-        index->pop_back();
-      }
-      return Status::OK();
-    }
-    template <typename Fn>
-    bool ForEachHelperBool(const Fn& func, const Piece& piece,
-                           ShapeIndex* index) const {
-      if (!func(*index, piece)) {
-        return false;
-      }
-      for (int64 i = 0; i < piece.children_.size(); ++i) {
-        index->push_back(i);
-        if (!ForEachHelperBool(func, piece.children_[i], index)) {
-          return false;
-        }
-        index->pop_back();
-      }
-      return true;
-    }
-    template <typename Fn>
-    Status ForEachMutableHelper(const Fn& func, Piece* piece,
-                                ShapeIndex* index) {
-      TF_RETURN_IF_ERROR(func(*index, piece));
-      for (int64 i = 0; i < piece->children_.size(); ++i) {
-        index->push_back(i);
-        TF_RETURN_IF_ERROR(
-            ForEachMutableHelper(func, &piece->children_[i], index));
-        index->pop_back();
-      }
-      return Status::OK();
-    }
-
-    // Recursive helper for EqualElements.
-    template <typename NativeT>
-    bool EqualElementsInternal(const Piece& other,
-                               std::vector<int64>* multi_index) const;
-
-    // Helper for SortSparseElements that has the element type as a template
-    // parameter.
-    template <typename NativeT>
-    void SortSparseElementsInternal();
-
-    // For array-shaped pieces, this is the buffer holding the literal data.
-    char* buffer_ = nullptr;
-
-    // For sparse arrays, this is the array of indices.
-    SparseIndexArray* sparse_indices_ = nullptr;
-
-    // The shape of piece. This points into the shape of the containing Literal
-    // (Literal::shape_).
-    const Shape* subshape_ = nullptr;
-
-    // Children pieces for tuple shaped pieces.
-    std::vector<Piece> children_ = {};
-  };  // class Piece
-
-  const Piece& piece(const ShapeIndex& shape_index) const {
-    Piece* piece = &const_cast<Piece&>(root_piece());
-    for (const auto i : shape_index) {
-      DCHECK_GE(i, 0);
-      DCHECK_LT(i, piece->children_size());
-      piece = &piece->child(i);
-    }
-    return *piece;
-  }
-
-  // Returns the piece at the root of the shape.
-  virtual const Piece& root_piece() const = 0;
-
-  // LiteralSlice and Literal must access Pieces of other Literals.
-  friend class Literal;
-  friend class LiteralSlice;
-  friend class BorrowingLiteral;
-
- private:
-  template <typename NativeT>
-  std::unique_ptr<Literal> SliceInternal(
-      const Shape& result_shape,
-      tensorflow::gtl::ArraySlice<int64> start_indices) const;
-};
-
-// Class representing literal values in XLA.
-//
-// The underlying buffer and shape is always owned by this class.
-class Literal : public LiteralBase {
- public:
-  Literal() : Literal(ShapeUtil::MakeNil()) {}
-
-  // Create a literal of the given shape. The literal is allocated sufficient
-  // memory to hold the shape. Memory is uninitialized.
-  explicit Literal(const Shape& shape);
-  virtual ~Literal();
-
-  // Literals are moveable, but not copyable. To copy a literal use
-  // Literal::Clone or Literal::CloneToUnique. This prevents inadvertent copies
-  // of literals which can be expensive.
-  Literal(const Literal& other) = delete;
-  Literal& operator=(const Literal& other) = delete;
-  Literal(Literal&& other);
-  // 'allocate_arrays' indicates whether to allocate memory for the arrays in
-  // the shape. If false, buffer pointers inside of the Literal::Pieces are set
-  // to nullptr.
-  Literal(const Shape& shape, bool allocate_arrays);
-  Literal& operator=(Literal&& other);
-
-  // TODO(b/67651157): Remove this accessor. Literal users should not be able to
-  // mutate the shape as this can produce malformed Literals.
-  Shape* mutable_shape_do_not_use() { return shape_.get(); }
-
-  // Returns a MutableArraySlice view of the array for this literal for the
-  // given NativeT (e.g., float). CHECKs if the subshape of the literal at the
-  // given ShapeIndex is not array. See primitive_util.h for the mapping from
-  // XLA type to native type.
-  template <typename NativeT>
-  tensorflow::gtl::MutableArraySlice<NativeT> data(
-      const ShapeIndex& shape_index = {});
-  // Unhide const method from parent class.
-  using LiteralBase::data;
-
-  // Returns a pointer to the sparse index array. Returns nullptr if the literal
-  // is not a sparse array.
-  SparseIndexArray* sparse_indices(const ShapeIndex& shape_index = {});
-
-  // Returns a pointer to the underlying buffer holding the array at the given
-  // shape index. CHECKs if the subshape of the literal at the given ShapeIndex
-  // is not array.
-  void* untyped_data(const ShapeIndex& shape_index = {});
-  // Unhide const method from parent class.
-  using LiteralBase::untyped_data;
-
-  // Populates a literal with a sparse layout with the given indices and values.
-  // Each index in the indices array is CHECKed against the dimensions in the
-  // literal's shape.  If sort is true, then the indices and values will be
-  // sorted.  If sort is false, then the indices and values are assumed to
-  // already be in sorted order.  See CreateSparse for an example of how data
-  // are populated.
-  template <typename NativeT>
-  void PopulateSparse(SparseIndexArray indices,
-                      tensorflow::gtl::ArraySlice<NativeT> values,
-                      bool sort = true);
-
-  // Copy values from 'src_literal' rooted at 'src_shape_index' into this
-  // literal rooted at 'dest_shape_index'. The subshape of this literal rooted
-  // at 'dest_shape_index' must be compatible with the subshape of 'src_literal'
-  // rooted at 'src_shape_index', but need not be arrays.
-  Status CopyFrom(const LiteralSlice& src_literal,
-                  const ShapeIndex& dest_shape_index = {},
-                  const ShapeIndex& src_shape_index = {});
-
-  // Similar to CopyFrom, but with move semantincs. The subshape of this literal
-  // rooted at 'dest_shape_index' must be *equal* to the shape 'src_literal'
-  // (layouts and shapes must match), but need not be arrays. The memory
-  // allocated in this literal for the subshape at dest_shape_index is
-  // deallocated, and the respective buffers are replaced with those in
-  // src_literal. Upon return, src_literal is set to a nil shape (empty tuple).
-  Status MoveFrom(Literal&& src_literal,
-                  const ShapeIndex& dest_shape_index = {});
-
-  // Copies the values from src_literal, starting at src_base shape indexes,
-  // to this literal, starting at dest_base, where the copy size in each
-  // dimension is specified by copy_size.
-  // The src_literal and this literal must have the same primitive type,
-  // src_base+copy_size must fit the source literal dimensions, as well as
-  // dest_base+copy_size must fit the destination literal dimensions.
-  // Note: if either src_literal or this literal contains dimensions with zero
-  // element, then copy_size must be 0 in these dimensions while the
-  // corresponding base indices being 0.
-  // This literal and 'src_literal' must be arrays.
-  Status CopySliceFrom(const LiteralSlice& src_literal,
-                       tensorflow::gtl::ArraySlice<int64> src_base,
-                       tensorflow::gtl::ArraySlice<int64> dest_base,
-                       tensorflow::gtl::ArraySlice<int64> copy_size);
-
-  // Copies one element from src_literal[src_index] to (*this)[dest_index].
-  Status CopyElementFrom(const LiteralSlice& src_literal,
-                         tensorflow::gtl::ArraySlice<int64> src_index,
-                         tensorflow::gtl::ArraySlice<int64> dest_index);
-
-  // Sets an element in the literal at the given index. The multi_index is
-  // CHECKed against the dimension sizes.
-  template <typename NativeT>
-  void Set(tensorflow::gtl::ArraySlice<int64> multi_index,
-           const ShapeIndex& shape_index, NativeT value);
-  // Overloads of Set for array literals. CHECKs if the literal is not
-  // array-shaped and dense.
-  template <typename NativeT>
-  void Set(tensorflow::gtl::ArraySlice<int64> multi_index, NativeT value);
-
-  // Appends the given element to the literal.  If the elements are not appended
-  // in sorted order, then SortSparseElements should be called before calling
-  // other methods.  This literal must have a sparse layout.
-  template <typename NativeT>
-  void AppendSparseElement(tensorflow::gtl::ArraySlice<int64> multi_index,
-                           NativeT value, const ShapeIndex& shape_index = {});
-
-  // Sorts the elements in a sparse array.
-  void SortSparseElements(const ShapeIndex& shape_index = {});
-
-  // As Set(), but truncates `value` to the literal element type before storing.
-  // This literal must be an array.
-  Status SetIntegralAsS64(tensorflow::gtl::ArraySlice<int64> multi_index,
-                          int64 value);
-
-  // Populate this literal with the given values. Examples:
-  //
-  //   // Populate with floats.
-  //   Array2D<float> float_values = ...
-  //   literal.PopulateR2FromArray2D(values);
-  //
-  //   // Populate with int32s.
-  //   literal.PopulateR2<int32>({{1, 2}, {3, 4}});
-  //
-  // The shape and element type of this literal must match given values. For
-  // example, in the call above to literal.PopulateR2(), 'literal' must be a 2x2
-  // array of S32.
-  template <typename NativeT>
-  void PopulateR1(tensorflow::gtl::ArraySlice<NativeT> values);
-  void PopulateR1(const tensorflow::core::Bitmap& values);
-  template <typename NativeT>
-  void PopulateR2(std::initializer_list<std::initializer_list<NativeT>> values);
-  template <typename NativeT>
-  void PopulateFromArray(const Array<NativeT>& values);
-  template <typename NativeT>
-  void PopulateR2FromArray2D(const Array2D<NativeT>& values);
-  template <typename NativeT>
-  void PopulateR3FromArray3D(const Array3D<NativeT>& values);
-  template <typename NativeT>
-  void PopulateR4FromArray4D(const Array4D<NativeT>& values);
-
-  // Populates literal values by calling the generator function for every cell
-  // in this literal object.
-  //
-  // generator must be a callable of the type
-  // NativeT(tensorflow::gtl::ArraySlice<int64> indexes) or compatible.
-  //
-  // This literal must have a dense layout.
-  template <typename NativeT, typename FnType>
-  Status Populate(const FnType& generator);
-
-  // A parallel version of Populate(). This can be used if the generator is
-  // thread-safe and the values for the shape's different elements are
-  // independent.
-  template <typename NativeT, typename FnType>
-  Status PopulateParallel(const FnType& generator);
-
-  // Fills this literal with the given value.
-  template <typename NativeT>
-  void PopulateWithValue(NativeT value);
-
-  // Factory methods below.
-  //
-
-  // Serialize from a proto.
-  static StatusOr<std::unique_ptr<Literal>> CreateFromProto(
-      const LiteralProto& proto);
+  static Literal GetFirstScalarLiteral(const LiteralSlice& literal);
 
   // Creates a new literal of a given rank. To minimize ambiguity (for users
   // and the compiler) these CreateR[0-2] methods should explicitly specify the
@@ -889,7 +223,7 @@ class Literal : public LiteralBase {
   // As above, but intended to be invoked with move semantics; i.e.
   //
   //  std::vector<std::unique_ptr<Literal>> elements = ...;
-  //  auto result = Literal::MakeTupleOwned(std::move(elements));
+  //  auto result = LiteralUtil::MakeTupleOwned(std::move(elements));
   //
   // This would have been declared as an overload, but there is ambiguity
   // in invocation between the above signature and this one.
@@ -899,7 +233,7 @@ class Literal : public LiteralBase {
   // This overload lets you pass a braced list of unique_ptr<Literal>s to
   // MakeTupleOwned:
   //
-  //   Literal::MakeTupleOwned(Literal::CreateR1(...), ...).
+  //   LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR1(...), ...).
   //
   // Simply relying on the MakeTupleOwned(std::vector<unique_ptr<Literal>>)
   // overload doesn't work because std::initializer_list's elements are always
@@ -920,19 +254,6 @@ class Literal : public LiteralBase {
   // Create a constant token literal. Token types have no value.
   static std::unique_ptr<Literal> CreateToken();
 
-  // Returns a vector containing the tuple elements of this Literal as separate
-  // Literals. This Literal must be tuple-shaped and can be a nested tuple. The
-  // elements are moved into the new Literals; no data is copied. Upon return
-  // this Literal is set to a nil shape (empty tuple)
-  std::vector<Literal> DecomposeTuple();
-
-  // This operation is the inverse of DecomposeTuple. The given elements are
-  // moved into the tuple elements of a new tuple-shaped Literal which is
-  // returned. Upon return, each of the Literals in 'elements' is set to a nil
-  // shape (empty tuple).
-  static Literal MoveIntoTuple(
-      tensorflow::gtl::MutableArraySlice<Literal> elements);
-
   // Creates a new Literal object with its values havings the primitive_type
   // type, and with dimensions defined by the dimensions parameter.
   // The content of the literal values is the default value of the primitive
@@ -1000,194 +321,12 @@ class Literal : public LiteralBase {
   // dimension 1 equal to 8.
   static string MultiIndexAsString(
       tensorflow::gtl::ArraySlice<int64> multi_index);
-
- private:
-  // Recursively sets the subshapes and buffers of all subpieces rooted at
-  // 'piece'. If 'allocate_array' is true, memory is allocated for the arrays in
-  // the shape.
-  void SetPiece(const Shape& shape, Piece* piece, bool allocate_arrays);
-
-  // Returns the piece at the given ShapeIndex.
-  Piece& piece(const ShapeIndex& shape_index) {
-    return const_cast<Piece&>(LiteralBase::piece(shape_index));
-  }
-
-  Piece& root_piece() const override { return *root_piece_; };
-
-  // Internal template helper for the Literal::CopySliceFrom(), matching its
-  // arguments one by one.
-  template <typename NativeT>
-  Status CopySliceFromInternal(const LiteralBase& src_literal,
-                               tensorflow::gtl::ArraySlice<int64> src_base,
-                               tensorflow::gtl::ArraySlice<int64> dest_base,
-                               tensorflow::gtl::ArraySlice<int64> copy_size);
-
-  // Utility structure which is used to create the optimal configuration for
-  // a ShapeUtil::ForEachIndex() scan across two literals.
-  struct StrideConfig {
-    StrideConfig(const Shape& source_shape, const Shape& dest_shape,
-                 tensorflow::gtl::ArraySlice<int64> dimensions);
-
-    // The dimensions of the stride operation. Essentially every dimension
-    // will be iterated from base[i] to base[i]+dimensions[i], in step[i]
-    // steps.
-    tensorflow::gtl::ArraySlice<int64> dimensions;
-    DimensionVector base;
-    DimensionVector step;
-    int64 minor_dimension = 0;
-    // The size of the strides for source and destination. One of the two
-    // (the one looping through its most minor dimension) will be 1, while
-    // the other will be the stride size at the dimension matching the other
-    // shape most minor dimension being scanned.
-    int64 dest_stride = 1;
-    int64 source_stride = 1;
-    // The size of the inner loop on the most minor dimension.
-    int64 minor_loop_size = 1;
-  };
-
-  // Literal class always owns the shape. The parent class borrows this shape.
-  std::unique_ptr<Shape> shape_;
-
-  Piece* root_piece_ = nullptr;
-
-  // Implementation details shared between Populate() and PopulateParallel()
-  template <typename NativeT, typename FnType>
-  Status PopulateInternal(const FnType& generator, bool parallel);
-
-  // Deallocate the buffers held by this literal.
-  void DeallocateBuffers();
-
-  friend class LiteralBase;
-};
-std::ostream& operator<<(std::ostream& out, const Literal& literal);
-
-// A read-only view of a Literal. A LiteralSlice contains pointers to shape and
-// literal buffers always owned by others.
-class LiteralSlice : public LiteralBase {
- public:
-  LiteralSlice() : LiteralBase() {}
-
-  // Implicit conversion constructors.
-  LiteralSlice(const LiteralBase& literal);
-  LiteralSlice(const LiteralBase& literal, const ShapeIndex& view_root);
-
- private:
-  const Piece& root_piece() const override { return *root_piece_; };
-
-  const Piece* root_piece_;  // Not owned.
-};
-
-// A read-only Literal where the underlying buffers are never owned by this
-// class.
-class BorrowingLiteral : public LiteralBase {
- public:
-  BorrowingLiteral() : LiteralBase() {}
-
-  // 'src_buf_ptr' is not owned by this class and must outlive the
-  // lifetime of this class. It points to an appropirately sized buffer with
-  // data interpretered as indicated by 'shape'.
-  // This constructor is only used for array shapes.
-  BorrowingLiteral(const char* src_buf_ptr, const Shape& shape);
-  // Similar as above, except to be used for constructing non-nested tuples.
-  BorrowingLiteral(tensorflow::gtl::ArraySlice<const char*> src_buf_ptrs,
-                   const Shape& shape);
-  // TODO(b/79707221): adding constructors for nested tuples as well.
-
- private:
-  // Recursively builds the subtree for the given piece and sets the subshapes
-  // of the given piece with the given shape.
-  void BuildPieceSubtree(const Shape& shape, Piece* piece);
-
-  // Accessor for the root piece of this literal.
-  const Piece& root_piece() const override { return root_piece_; };
-  Piece root_piece_;
-
-  // Shape of this literal. Stored as unique_ptr so such that the (default)
-  // move construction of this class would be trivially correct: the pointer to
-  // Shape root_piece_ stores will still point to the correct address.
-  std::unique_ptr<Shape> shape_;
 };
 
-template <typename NativeT>
-tensorflow::gtl::ArraySlice<NativeT> LiteralBase::Piece::data() const {
-  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
-  CHECK_EQ(subshape().element_type(),
-           primitive_util::NativeToPrimitiveType<NativeT>())
-      << "Attempting to access "
-      << PrimitiveType_Name(primitive_util::NativeToPrimitiveType<NativeT>())
-      << " type, but literal element type is "
-      << PrimitiveType_Name(subshape().element_type());
-  return tensorflow::gtl::ArraySlice<NativeT>(
-      reinterpret_cast<const NativeT*>(buffer()), element_count());
-}
-
-template <typename NativeT>
-tensorflow::gtl::MutableArraySlice<NativeT> LiteralBase::Piece::data() {
-  CHECK(ShapeUtil::IsArray(subshape())) << ShapeUtil::HumanString(subshape());
-  CHECK_EQ(subshape().element_type(),
-           primitive_util::NativeToPrimitiveType<NativeT>())
-      << "Attempting to access "
-      << PrimitiveType_Name(primitive_util::NativeToPrimitiveType<NativeT>())
-      << " type, but literal element type is "
-      << PrimitiveType_Name(subshape().element_type());
-  return tensorflow::gtl::MutableArraySlice<NativeT>(
-      reinterpret_cast<NativeT*>(buffer()), element_count());
-}
-
-template <typename NativeT>
-NativeT LiteralBase::Piece::Get(
-    tensorflow::gtl::ArraySlice<int64> multi_index) const {
-  CHECK(LayoutUtil::IsDenseArray(subshape()));
-  return data<NativeT>()[IndexUtil::MultidimensionalIndexToLinearIndex(
-      subshape(), multi_index)];
-}
-
-template <typename NativeT>
-void LiteralBase::Piece::Set(tensorflow::gtl::ArraySlice<int64> multi_index,
-                             NativeT value) {
-  CHECK(LayoutUtil::IsDenseArray(subshape()));
-  data<NativeT>()[IndexUtil::MultidimensionalIndexToLinearIndex(
-      subshape(), multi_index)] = value;
-}
-
-template <typename NativeT>
-tensorflow::gtl::ArraySlice<NativeT> LiteralBase::data(
-    const ShapeIndex& shape_index) const {
-  return piece(shape_index).data<NativeT>();
-}
-
-template <typename NativeT>
-tensorflow::gtl::MutableArraySlice<NativeT> Literal::data(
-    const ShapeIndex& shape_index) {
-  return piece(shape_index).data<NativeT>();
-}
-
-template <typename NativeT>
-inline NativeT LiteralBase::Get(tensorflow::gtl::ArraySlice<int64> multi_index,
-                                const ShapeIndex& shape_index) const {
-  return piece(shape_index).Get<NativeT>(multi_index);
-}
-
-template <typename NativeT>
-inline NativeT LiteralBase::Get(
-    tensorflow::gtl::ArraySlice<int64> multi_index) const {
-  return root_piece().Get<NativeT>(multi_index);
-}
-
-template <typename NativeT>
-inline void Literal::Set(tensorflow::gtl::ArraySlice<int64> multi_index,
-                         const ShapeIndex& shape_index, NativeT value) {
-  return piece(shape_index).Set<NativeT>(multi_index, value);
-}
-
-template <typename NativeT>
-inline void Literal::Set(tensorflow::gtl::ArraySlice<int64> multi_index,
-                         NativeT value) {
-  return root_piece().Set<NativeT>(multi_index, value);
-}
+std::ostream& operator<<(std::ostream& out, const Literal& literal);
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR0(NativeT value) {
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR0(NativeT value) {
   auto literal = MakeUnique<Literal>(ShapeUtil::MakeShape(
       primitive_util::NativeToPrimitiveType<NativeT>(), {}));
   literal->Set({}, value);
@@ -1195,7 +334,7 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR1(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR1(
     tensorflow::gtl::ArraySlice<NativeT> values) {
   auto literal = MakeUnique<Literal>(
       ShapeUtil::MakeShape(primitive_util::NativeToPrimitiveType<NativeT>(),
@@ -1205,7 +344,7 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR2WithLayout(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR2WithLayout(
     std::initializer_list<std::initializer_list<NativeT>> values,
     const Layout& layout) {
   auto literal = MakeUnique<Literal>(ShapeUtil::MakeShapeWithLayout(
@@ -1218,13 +357,13 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR2(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR2(
     std::initializer_list<std::initializer_list<NativeT>> values) {
   return CreateR2WithLayout(values, LayoutUtil::GetDefaultLayoutForR2());
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR3WithLayout(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR3WithLayout(
     std::initializer_list<std::initializer_list<std::initializer_list<NativeT>>>
         values,
     const Layout& layout) {
@@ -1249,14 +388,14 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR3(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR3(
     std::initializer_list<std::initializer_list<std::initializer_list<NativeT>>>
         values) {
   return CreateR3WithLayout(values, LayoutUtil::GetDefaultLayoutForR3());
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR4WithLayout(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR4WithLayout(
     std::initializer_list<std::initializer_list<
         std::initializer_list<std::initializer_list<NativeT>>>>
         values,
@@ -1287,7 +426,7 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateSparse(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateSparse(
     tensorflow::gtl::ArraySlice<int64> dimensions, SparseIndexArray indices,
     tensorflow::gtl::ArraySlice<NativeT> values, bool sort) {
   int64 num_elements = values.size();
@@ -1302,7 +441,7 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR4(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR4(
     std::initializer_list<std::initializer_list<
         std::initializer_list<std::initializer_list<NativeT>>>>
         values) {
@@ -1310,7 +449,7 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateFromArrayWithLayout(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateFromArrayWithLayout(
     const Array<NativeT>& values, const Layout& layout) {
   auto literal = MakeUnique<Literal>(ShapeUtil::MakeShapeWithLayout(
       primitive_util::NativeToPrimitiveType<NativeT>(), values.dimensions(),
@@ -1320,38 +459,40 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateFromArray(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateFromArray(
     const Array<NativeT>& values) {
   return CreateFromArrayWithLayout(
       values, LayoutUtil::GetDefaultLayoutForRank(values.num_dimensions()));
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR2FromArray2DWithLayout(
-    const Array2D<NativeT>& values, const Layout& layout) {
+/* static */ std::unique_ptr<Literal>
+LiteralUtil::CreateR2FromArray2DWithLayout(const Array2D<NativeT>& values,
+                                           const Layout& layout) {
   return CreateFromArrayWithLayout(values, layout);
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR2FromArray2D(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR2FromArray2D(
     const Array2D<NativeT>& values) {
   return CreateFromArray(values);
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR3FromArray3DWithLayout(
-    const Array3D<NativeT>& values, const Layout& layout) {
+/* static */ std::unique_ptr<Literal>
+LiteralUtil::CreateR3FromArray3DWithLayout(const Array3D<NativeT>& values,
+                                           const Layout& layout) {
   return CreateFromArrayWithLayout(values, layout);
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR3FromArray3D(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR3FromArray3D(
     const Array3D<NativeT>& values) {
   return CreateFromArray(values);
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR3Projected(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR3Projected(
     std::initializer_list<std::initializer_list<NativeT>> values,
     int64 projection) {
   int64 dim0_size = projection;
@@ -1376,7 +517,7 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR4Projected(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR4Projected(
     std::initializer_list<std::initializer_list<NativeT>> values,
     int64 projection_p, int64 projection_z) {
   int64 dim0_size = projection_p;
@@ -1404,49 +545,21 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR4FromArray4D(
+/* static */ std::unique_ptr<Literal> LiteralUtil::CreateR4FromArray4D(
     const Array4D<NativeT>& values) {
   return CreateFromArray(values);
 }
 
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateR4FromArray4DWithLayout(
-    const Array4D<NativeT>& values, const Layout& layout) {
+/* static */ std::unique_ptr<Literal>
+LiteralUtil::CreateR4FromArray4DWithLayout(const Array4D<NativeT>& values,
+                                           const Layout& layout) {
   return CreateFromArrayWithLayout(values, layout);
 }
 
-template <typename NativeT>
-NativeT LiteralBase::GetFirstElement() const {
-  return data<NativeT>().at(0);
-}
-
-template <typename NativeT>
-NativeT LiteralBase::GetSparseElement(int64 sparse_element_number,
-                                      const ShapeIndex& shape_index) const {
-  CHECK(
-      LayoutUtil::IsSparseArray(ShapeUtil::GetSubshape(shape(), shape_index)));
-  return data<NativeT>(shape_index)[sparse_element_number];
-}
-
-template <typename NativeT>
-void Literal::AppendSparseElement(
-    tensorflow::gtl::ArraySlice<int64> multi_index, NativeT value,
-    const ShapeIndex& shape_index) {
-  Piece& p = piece(shape_index);
-  const Shape& subshape = p.subshape();
-  CHECK(LayoutUtil::IsSparseArray(subshape));
-  int64 rank = ShapeUtil::Rank(subshape);
-  CHECK_EQ(multi_index.size(), rank);
-  int64 last_element = p.sparse_indices()->index_count();
-  CHECK_LT(last_element, LayoutUtil::MaxSparseElements(subshape.layout()));
-  p.sparse_indices()->Append(multi_index);
-  CHECK_LT(last_element, p.data<NativeT>().size());
-  p.data<NativeT>()[last_element] = value;
-}
-
 // Returns an identity matrix (rank 2) with the given row and column count.
 template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::MakeIdentityR2(int64 size) {
+/* static */ std::unique_ptr<Literal> LiteralUtil::MakeIdentityR2(int64 size) {
   Array2D<NativeT> array(size, size, 0);
   for (int64 i = 0; i < size; ++i) {
     array(i, i) = 1;
@@ -1455,174 +568,8 @@ template <typename NativeT>
 }
 
 template <typename NativeT>
-void LiteralBase::EachCell(
-    std::function<void(tensorflow::gtl::ArraySlice<int64> indices,
-                       NativeT value)>
-        per_cell) const {
-  if (ShapeUtil::IsZeroElementArray(shape())) {
-    return;
-  }
-  std::vector<int64> indices(ShapeUtil::Rank(shape()), 0);
-  do {
-    per_cell(indices, Get<NativeT>(indices));
-  } while (IndexUtil::BumpIndices(shape(), &indices));
-}
-
-template <typename NativeT>
-inline void Literal::PopulateR1(tensorflow::gtl::ArraySlice<NativeT> values) {
-  CHECK(ShapeUtil::IsArray(shape()));
-  CHECK_EQ(ShapeUtil::Rank(shape()), 1);
-  CHECK_EQ(ShapeUtil::ElementsIn(shape()), values.size());
-  CHECK_EQ(shape().element_type(),
-           primitive_util::NativeToPrimitiveType<NativeT>());
-  for (int64 i = 0; i < values.size(); ++i) {
-    Set({i}, values[i]);
-  }
-}
-
-template <typename NativeT>
-void Literal::PopulateR2(
-    std::initializer_list<std::initializer_list<NativeT>> values) {
-  CHECK(ShapeUtil::IsArray(shape()));
-  CHECK_EQ(ShapeUtil::Rank(shape()), 2);
-  CHECK_EQ(shape().element_type(),
-           primitive_util::NativeToPrimitiveType<NativeT>());
-
-  const int64 dim0_size = values.size();
-  const int64 dim1_size = values.begin()->size();
-  CHECK_EQ(dim0_size, shape().dimensions(0));
-  CHECK_EQ(dim1_size, shape().dimensions(1));
-
-  int64 dim0 = 0;
-  for (auto inner_list : values) {
-    int64 dim1 = 0;
-    for (auto value : inner_list) {
-      Set({dim0, dim1}, value);
-      ++dim1;
-    }
-    CHECK_EQ(dim1_size, dim1);
-    ++dim0;
-  }
-}
-
-template <typename NativeT>
-void Literal::PopulateFromArray(const Array<NativeT>& values) {
-  CHECK(ShapeUtil::IsArray(shape()));
-  CHECK_EQ(shape().element_type(),
-           primitive_util::NativeToPrimitiveType<NativeT>());
-  CHECK_EQ(ShapeUtil::Rank(shape()), values.num_dimensions());
-  for (int dim = 0; dim < values.num_dimensions(); ++dim) {
-    CHECK_EQ(values.dim(dim), shape().dimensions(dim));
-  }
-  values.Each([this](tensorflow::gtl::ArraySlice<int64> indices,
-                     NativeT value) { this->Set(indices, value); });
-}
-
-template <typename NativeT>
-void Literal::PopulateR2FromArray2D(const Array2D<NativeT>& values) {
-  PopulateFromArray(values);
-}
-
-template <typename NativeT>
-void Literal::PopulateR3FromArray3D(const Array3D<NativeT>& values) {
-  PopulateFromArray(values);
-}
-
-template <typename NativeT>
-void Literal::PopulateR4FromArray4D(const Array4D<NativeT>& values) {
-  PopulateFromArray(values);
-}
-
-template <typename NativeT>
-void Literal::PopulateSparse(SparseIndexArray indices,
-                             tensorflow::gtl::ArraySlice<NativeT> values,
-                             bool sort) {
-  CHECK(LayoutUtil::IsSparseArray(shape()));
-  int rank = ShapeUtil::Rank(shape());
-  CHECK_EQ(indices.rank(), rank);
-  int64 max_elements = LayoutUtil::MaxSparseElements(shape().layout());
-  CHECK_LE(indices.max_indices(), max_elements);
-  int64 num_elements = values.size();
-  CHECK_LE(num_elements, max_elements);
-  CHECK_EQ(num_elements, indices.index_count());
-  auto root_data = root_piece().data<NativeT>();
-  // Piece::data() returns an ArraySlice of size equal to the number of indices
-  // in the SparseIndexArray. So there is no need to adjust the size of the data
-  // here. It is enough to just copy the incoming values into the data buffer.
-  std::copy(values.begin(), values.end(), root_data.begin());
-  *this->root_piece().sparse_indices() = std::move(indices);
-  if (sort) {
-    auto root_data = this->root_piece().data<NativeT>();
-    this->root_piece().sparse_indices()->SortWithValues(root_data);
-  }
-  DCHECK(this->root_piece().sparse_indices()->Validate(shape()));
-}
-
-template <typename NativeT, typename FnType>
-Status Literal::PopulateInternal(const FnType& generator, bool parallel) {
-  const Shape& this_shape = shape();
-  const int64 rank = ShapeUtil::Rank(this_shape);
-  TF_RET_CHECK(LayoutUtil::IsDenseArray(this_shape));
-  TF_RET_CHECK(this_shape.element_type() ==
-               primitive_util::NativeToPrimitiveType<NativeT>());
-  tensorflow::gtl::MutableArraySlice<NativeT> literal_data = data<NativeT>();
-  if (rank > 0) {
-    StrideConfig stride_config(this_shape, this_shape,
-                               AsInt64Slice(this_shape.dimensions()));
-    int64 minor_dimension_size =
-        ShapeUtil::GetDimension(this_shape, stride_config.minor_dimension);
-
-    auto init_function = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
-      DimensionVector minor_scan_indexes(rank, 0);
-      const int64 index =
-          IndexUtil::MultidimensionalIndexToLinearIndex(shape(), indexes);
-      std::copy(indexes.begin(), indexes.end(), minor_scan_indexes.begin());
-      for (int64 i = 0; i < minor_dimension_size; ++i) {
-        minor_scan_indexes[stride_config.minor_dimension] = i;
-        literal_data.at(index + i) = generator(minor_scan_indexes);
-      }
-    };
-    if (parallel) {
-      ShapeUtil::ForEachIndexParallel(this_shape, stride_config.base,
-                                      stride_config.dimensions,
-                                      stride_config.step, init_function);
-    } else {
-      ShapeUtil::ForEachIndex(
-          this_shape, stride_config.base, stride_config.dimensions,
-          stride_config.step,
-          [&init_function](tensorflow::gtl::ArraySlice<int64> indexes) {
-            init_function(indexes);
-            return true;
-          });
-    }
-  } else {
-    // For scalars.
-    literal_data.at(0) = generator({});
-  }
-  return Status::OK();
-}
-template <typename NativeT, typename FnType>
-Status Literal::Populate(const FnType& generator) {
-  return PopulateInternal<NativeT>(generator, /*parallel=*/false);
-}
-
-template <typename NativeT, typename FnType>
-Status Literal::PopulateParallel(const FnType& generator) {
-  return PopulateInternal<NativeT>(generator, /*parallel=*/true);
-}
-
-template <typename NativeT>
-void Literal::PopulateWithValue(NativeT value) {
-  CHECK(ShapeUtil::IsArray(shape()));
-  CHECK_EQ(shape().element_type(),
-           primitive_util::NativeToPrimitiveType<NativeT>());
-  for (NativeT& element : data<NativeT>()) {
-    element = value;
-  }
-}
-
-template <typename NativeT>
-/* static */ std::unique_ptr<Literal> Literal::CreateFullWithDescendingLayout(
+/* static */ std::unique_ptr<Literal>
+LiteralUtil::CreateFullWithDescendingLayout(
     tensorflow::gtl::ArraySlice<int64> dimensions, NativeT value) {
   auto literal = MakeUnique<Literal>(ShapeUtil::MakeShapeWithDescendingLayout(
       primitive_util::NativeToPrimitiveType<NativeT>(), dimensions));
@@ -1630,44 +577,9 @@ template <typename NativeT>
   return literal;
 }
 
-template <typename NativeT>
-std::unique_ptr<Literal> LiteralBase::Replicate(int64 times) const {
-  DimensionVector bounds = {times};
-  bounds.reserve(shape().dimensions_size() + 1);
-  for (int64 bound : shape().dimensions()) {
-    bounds.push_back(bound);
-  }
-  auto literal =
-      MakeUnique<Literal>(ShapeUtil::MakeShape(shape().element_type(), bounds));
-  int64 elements = ShapeUtil::ElementsIn(literal->shape());
-  if (elements == 0) {
-    return literal;
-  }
-
-  DimensionVector output_indices(bounds.size(), 0);
-  tensorflow::gtl::ArraySlice<int64> input_indices = output_indices;
-  input_indices.remove_prefix(1);
-
-  bool done = false;
-  while (!done) {
-    const auto element = Get<NativeT>(input_indices);
-    literal->Set<NativeT>(output_indices, element);
-
-    done = true;
-    for (int n = 0; n < output_indices.size(); ++n) {
-      ++output_indices[n];
-      if (output_indices[n] < bounds[n]) {
-        done = false;
-        break;
-      }
-      output_indices[n] = 0;
-    }
-  }
-  return literal;
-}
-
 template <PrimitiveType type, typename T>
-/* static */ StatusOr<std::unique_ptr<Literal>> Literal::CreateRandomLiteral(
+/* static */ StatusOr<std::unique_ptr<Literal>>
+LiteralUtil::CreateRandomLiteral(
     const Shape& shape,
     const std::function<T(tensorflow::gtl::ArraySlice<int64>)>& generator) {
   using NativeT = typename primitive_util::PrimitiveTypeToNative<type>::type;
@@ -1681,8 +593,9 @@ template <PrimitiveType type, typename T>
 }
 
 template <PrimitiveType type, typename E, typename T>
-/* static */ StatusOr<std::unique_ptr<Literal>> Literal::CreateRandomLiteral(
-    const Shape& shape, E* engine, T mean, T stddev) {
+/* static */ StatusOr<std::unique_ptr<Literal>>
+LiteralUtil::CreateRandomLiteral(const Shape& shape, E* engine, T mean,
+                                 T stddev) {
   using NativeT = typename primitive_util::PrimitiveTypeToNative<type>::type;
   std::normal_distribution<NativeT> generator(mean, stddev);
   return CreateRandomLiteral<type, NativeT>(
@@ -1692,8 +605,8 @@ template <PrimitiveType type, typename E, typename T>
 }
 
 template <PrimitiveType type, typename T>
-/* static */ StatusOr<std::unique_ptr<Literal>> Literal::CreateRandomLiteral(
-    const Shape& shape, T mean, T stddev) {
+/* static */ StatusOr<std::unique_ptr<Literal>>
+LiteralUtil::CreateRandomLiteral(const Shape& shape, T mean, T stddev) {
   std::minstd_rand0 engine;
   return CreateRandomLiteral<type>(shape, &engine, mean, stddev);
 }
diff --git a/tensorflow/compiler/xla/packed_literal_reader.cc b/tensorflow/compiler/xla/packed_literal_reader.cc
index 857aae0a79..6b7fd10d63 100644
--- a/tensorflow/compiler/xla/packed_literal_reader.cc
+++ b/tensorflow/compiler/xla/packed_literal_reader.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
diff --git a/tensorflow/compiler/xla/packed_literal_reader.h b/tensorflow/compiler/xla/packed_literal_reader.h
index 45a9fe0127..98dccaa9a2 100644
--- a/tensorflow/compiler/xla/packed_literal_reader.h
+++ b/tensorflow/compiler/xla/packed_literal_reader.h
@@ -18,7 +18,7 @@ limitations under the License.
 
 #include <memory>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
diff --git a/tensorflow/compiler/xla/python/BUILD b/tensorflow/compiler/xla/python/BUILD
index 22cc4e2436..c8f2d65c22 100644
--- a/tensorflow/compiler/xla/python/BUILD
+++ b/tensorflow/compiler/xla/python/BUILD
@@ -33,6 +33,7 @@ cc_library(
     srcs = ["numpy_bridge.cc"],
     hdrs = ["numpy_bridge.h"],
     deps = [
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -52,9 +53,9 @@ cc_library(
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:executable_build_options",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:math",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/service:shaped_buffer",
         "//tensorflow/core:framework_lite",
         "//tensorflow/core:lib",
@@ -70,7 +71,7 @@ tf_py_wrap_cc(
     deps = [
         ":local_computation_builder",
         ":numpy_bridge",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/service:cpu_plugin",
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.cc b/tensorflow/compiler/xla/python/local_computation_builder.cc
index be55d50b23..8246f76d34 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.cc
+++ b/tensorflow/compiler/xla/python/local_computation_builder.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/python/local_computation_builder.h"
 #include "tensorflow/compiler/xla/client/lib/math.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/executable_run_options.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/util.h"
@@ -486,6 +486,11 @@ LocalOp LocalComputationBuilder::ConvertElementType(
   return xla::ConvertElementType(operand.op(), new_element_type);
 }
 
+LocalOp LocalComputationBuilder::BitcastConvertType(
+    const LocalOp& operand, PrimitiveType new_element_type) {
+  return xla::BitcastConvertType(operand.op(), new_element_type);
+}
+
 LocalOp LocalComputationBuilder::Call(
     const LocalComputation& local_computation,
     tensorflow::gtl::ArraySlice<LocalOp> operands) {
@@ -614,6 +619,12 @@ _FORWARD_BINOP(Min)
 _FORWARD_BINOP(And)
 _FORWARD_BINOP(Or)
 _FORWARD_BINOP(Xor)
+_FORWARD_BINOP(ShiftLeft)
+_FORWARD_BINOP(ShiftRightArithmetic)
+_FORWARD_BINOP(ShiftRightLogical)
+_FORWARD_BINOP(Atan2)
+_FORWARD_BINOP(Pow)
+_FORWARD_BINOP(Complex)
 _FORWARD_UNOP(Not)
 _FORWARD_UNOP(Abs)
 _FORWARD_UNOP(Exp)
@@ -627,13 +638,30 @@ _FORWARD_UNOP(Sign)
 _FORWARD_UNOP(Cos)
 _FORWARD_UNOP(Sin)
 _FORWARD_UNOP(Tanh)
-_FORWARD_UNOP(Sqrt)
-_FORWARD_UNOP(Square)
-_FORWARD_BINOP(Pow)
 _FORWARD_UNOP(IsFinite)
-_FORWARD_UNOP(Reciprocal)
 _FORWARD_UNOP(Neg)
 _FORWARD_UNOP(Sort)
+_FORWARD_UNOP(Sqrt)
+_FORWARD_UNOP(Rsqrt)
+_FORWARD_UNOP(Square)
+_FORWARD_UNOP(Reciprocal)
+_FORWARD_UNOP(Erfc)
+_FORWARD_UNOP(Erf)
+_FORWARD_UNOP(ErfInv)
+_FORWARD_UNOP(Lgamma)
+_FORWARD_UNOP(Digamma)
+_FORWARD_UNOP(Acos)
+_FORWARD_UNOP(Asin)
+_FORWARD_UNOP(Atan)
+_FORWARD_UNOP(Tan)
+_FORWARD_UNOP(Acosh)
+_FORWARD_UNOP(Asinh)
+_FORWARD_UNOP(Atanh)
+_FORWARD_UNOP(Cosh)
+_FORWARD_UNOP(Sinh)
+_FORWARD_UNOP(Real)
+_FORWARD_UNOP(Imag)
+_FORWARD_UNOP(Conj)
 
 #undef _FORWARD
 #undef _FORWARD_UNOP
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.h b/tensorflow/compiler/xla/python/local_computation_builder.h
index 690ff277e8..a568c24c63 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.h
+++ b/tensorflow/compiler/xla/python/local_computation_builder.h
@@ -19,8 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/executable_build_options.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/shaped_buffer.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
@@ -259,6 +259,9 @@ class LocalComputationBuilder {
   LocalOp ConvertElementType(const LocalOp& operand,
                              PrimitiveType new_element_type);
 
+  LocalOp BitcastConvertType(const LocalOp& operand,
+                             PrimitiveType new_element_type);
+
   LocalOp Call(const LocalComputation& local_computation,
                tensorflow::gtl::ArraySlice<LocalOp> operands);
 
@@ -333,6 +336,12 @@ class LocalComputationBuilder {
   _FORWARD_BINOP(And)
   _FORWARD_BINOP(Or)
   _FORWARD_BINOP(Xor)
+  _FORWARD_BINOP(ShiftLeft)
+  _FORWARD_BINOP(ShiftRightArithmetic)
+  _FORWARD_BINOP(ShiftRightLogical)
+  _FORWARD_BINOP(Atan2)
+  _FORWARD_BINOP(Pow)
+  _FORWARD_BINOP(Complex)
   _FORWARD_UNOP(Not)
   _FORWARD_UNOP(Abs)
   _FORWARD_UNOP(Exp)
@@ -346,13 +355,30 @@ class LocalComputationBuilder {
   _FORWARD_UNOP(Cos)
   _FORWARD_UNOP(Sin)
   _FORWARD_UNOP(Tanh)
-  _FORWARD_UNOP(Sqrt)
-  _FORWARD_UNOP(Square)
-  _FORWARD_BINOP(Pow)
   _FORWARD_UNOP(IsFinite)
-  _FORWARD_UNOP(Reciprocal)
   _FORWARD_UNOP(Neg)
   _FORWARD_UNOP(Sort)
+  _FORWARD_UNOP(Sqrt)
+  _FORWARD_UNOP(Rsqrt)
+  _FORWARD_UNOP(Square)
+  _FORWARD_UNOP(Reciprocal)
+  _FORWARD_UNOP(Erfc)
+  _FORWARD_UNOP(Erf)
+  _FORWARD_UNOP(ErfInv)
+  _FORWARD_UNOP(Lgamma)
+  _FORWARD_UNOP(Digamma)
+  _FORWARD_UNOP(Acos)
+  _FORWARD_UNOP(Asin)
+  _FORWARD_UNOP(Atan)
+  _FORWARD_UNOP(Tan)
+  _FORWARD_UNOP(Acosh)
+  _FORWARD_UNOP(Asinh)
+  _FORWARD_UNOP(Atanh)
+  _FORWARD_UNOP(Cosh)
+  _FORWARD_UNOP(Sinh)
+  _FORWARD_UNOP(Real)
+  _FORWARD_UNOP(Imag)
+  _FORWARD_UNOP(Conj)
 
 #undef _FORWARD
 #undef _FORWARD_UNOP
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.i b/tensorflow/compiler/xla/python/local_computation_builder.i
index c44e69e615..5d5a955bfe 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.i
+++ b/tensorflow/compiler/xla/python/local_computation_builder.i
@@ -109,7 +109,7 @@ limitations under the License.
 // Must be included first
 #include "tensorflow/python/lib/core/numpy.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
@@ -957,6 +957,7 @@ tensorflow::ImportNumpy();
 %unignore xla::swig::LocalComputationBuilder::Tuple;
 %unignore xla::swig::LocalComputationBuilder::GetTupleElement;
 %unignore xla::swig::LocalComputationBuilder::ConvertElementType;
+%unignore xla::swig::LocalComputationBuilder::BitcastConvertType;
 %unignore xla::swig::LocalComputationBuilder::Call;
 %unignore xla::swig::LocalComputationBuilder::Transpose;
 %unignore xla::swig::LocalComputationBuilder::Rev;
@@ -989,6 +990,9 @@ tensorflow::ImportNumpy();
 %unignore xla::swig::LocalComputationBuilder::And;
 %unignore xla::swig::LocalComputationBuilder::Or;
 %unignore xla::swig::LocalComputationBuilder::Xor;
+%unignore xla::swig::LocalComputationBuilder::ShiftLeft;
+%unignore xla::swig::LocalComputationBuilder::ShiftRightArithmetic;
+%unignore xla::swig::LocalComputationBuilder::ShiftRightLogical;
 %unignore xla::swig::LocalComputationBuilder::Not;
 %unignore xla::swig::LocalComputationBuilder::Abs;
 %unignore xla::swig::LocalComputationBuilder::Exp;
@@ -1002,13 +1006,33 @@ tensorflow::ImportNumpy();
 %unignore xla::swig::LocalComputationBuilder::Cos;
 %unignore xla::swig::LocalComputationBuilder::Sin;
 %unignore xla::swig::LocalComputationBuilder::Tanh;
-%unignore xla::swig::LocalComputationBuilder::Sqrt;
-%unignore xla::swig::LocalComputationBuilder::Square;
-%unignore xla::swig::LocalComputationBuilder::Pow;
+%unignore xla::swig::LocalComputationBuilder::Atan2;
 %unignore xla::swig::LocalComputationBuilder::IsFinite;
-%unignore xla::swig::LocalComputationBuilder::Reciprocal;
+%unignore xla::swig::LocalComputationBuilder::Pow;
 %unignore xla::swig::LocalComputationBuilder::Neg;
 %unignore xla::swig::LocalComputationBuilder::Sort;
+%unignore xla::swig::LocalComputationBuilder::Sqrt;
+%unignore xla::swig::LocalComputationBuilder::Rsqrt;
+%unignore xla::swig::LocalComputationBuilder::Square;
+%unignore xla::swig::LocalComputationBuilder::Reciprocal;
+%unignore xla::swig::LocalComputationBuilder::Erfc;
+%unignore xla::swig::LocalComputationBuilder::Erf;
+%unignore xla::swig::LocalComputationBuilder::ErfInv;
+%unignore xla::swig::LocalComputationBuilder::Lgamma;
+%unignore xla::swig::LocalComputationBuilder::Digamma;
+%unignore xla::swig::LocalComputationBuilder::Acos;
+%unignore xla::swig::LocalComputationBuilder::Asin;
+%unignore xla::swig::LocalComputationBuilder::Atan;
+%unignore xla::swig::LocalComputationBuilder::Tan;
+%unignore xla::swig::LocalComputationBuilder::Acosh;
+%unignore xla::swig::LocalComputationBuilder::Asinh;
+%unignore xla::swig::LocalComputationBuilder::Atanh;
+%unignore xla::swig::LocalComputationBuilder::Cosh;
+%unignore xla::swig::LocalComputationBuilder::Sinh;
+%unignore xla::swig::LocalComputationBuilder::Real;
+%unignore xla::swig::LocalComputationBuilder::Imag;
+%unignore xla::swig::LocalComputationBuilder::Conj;
+%unignore xla::swig::LocalComputationBuilder::Complex;
 %unignore xla::swig::DestructureLocalShapedBufferTuple;
 %unignore xla::swig::DeleteLocalShapedBuffer;
 %unignore xla::swig::DeleteLocalComputation;
diff --git a/tensorflow/compiler/xla/python/numpy_bridge.cc b/tensorflow/compiler/xla/python/numpy_bridge.cc
index 68648a3a17..6f665faf61 100644
--- a/tensorflow/compiler/xla/python/numpy_bridge.cc
+++ b/tensorflow/compiler/xla/python/numpy_bridge.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/python/numpy_bridge.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/core/platform/logging.h"
 
@@ -49,6 +50,8 @@ int PrimitiveTypeToNumpyType(PrimitiveType primitive_type) {
       return NPY_FLOAT32;
     case F64:
       return NPY_FLOAT64;
+    case C64:
+      return NPY_COMPLEX64;
     case TUPLE:
       return NPY_OBJECT;
     default:
@@ -82,6 +85,8 @@ PrimitiveType NumpyTypeToPrimitiveType(int np_type) {
       return F32;
     case NPY_FLOAT64:
       return F64;
+    case NPY_COMPLEX64:
+      return C64;
     case NPY_OBJECT:
       return TUPLE;
     default:
@@ -103,6 +108,7 @@ bool NumpyTypeIsValid(int np_type) {
     case NPY_FLOAT16:
     case NPY_FLOAT32:
     case NPY_FLOAT64:
+    case NPY_COMPLEX64:
     case NPY_OBJECT:
       return true;
     default:
@@ -374,7 +380,7 @@ StatusOr<std::unique_ptr<Literal>> XlaLiteralFromPyObject(PyObject* o) {
       TF_ASSIGN_OR_RETURN(auto literal, XlaLiteralFromPyObject(element));
       elements.push_back(std::move(literal));
     }
-    return Literal::MakeTupleOwned(std::move(elements));
+    return LiteralUtil::MakeTupleOwned(std::move(elements));
   } else if (PyArray_Check(o)) {
     PyArrayObject* py_array = reinterpret_cast<PyArrayObject*>(o);
     int rank = PyArray_NDIM(py_array);
@@ -383,7 +389,7 @@ StatusOr<std::unique_ptr<Literal>> XlaLiteralFromPyObject(PyObject* o) {
       dimensions[i] = PyArray_DIM(py_array, i);
     }
     int np_type = PyArray_TYPE(py_array);
-    auto literal = Literal::CreateFromDimensions(
+    auto literal = LiteralUtil::CreateFromDimensions(
         NumpyTypeToPrimitiveType(np_type), dimensions);
     TF_RETURN_IF_ERROR(
         CopyNumpyArrayToLiteral(np_type, py_array, literal.get()));
@@ -424,6 +430,9 @@ Status CopyNumpyArrayToLiteral(int np_type, PyArrayObject* py_array,
     case NPY_FLOAT64:
       CopyNumpyArrayToLiteral<double>(py_array, literal);
       break;
+    case NPY_COMPLEX64:
+      CopyNumpyArrayToLiteral<complex64>(py_array, literal);
+      break;
     default:
       return InvalidArgument(
           "No XLA literal container for Numpy type number: %d", np_type);
@@ -461,6 +470,9 @@ void CopyLiteralToNumpyArray(int np_type, const LiteralSlice& literal,
     case NPY_FLOAT64:
       CopyLiteralToNumpyArray<double>(literal, py_array);
       break;
+    case NPY_COMPLEX64:
+      CopyLiteralToNumpyArray<complex64>(literal, py_array);
+      break;
     default:
       LOG(FATAL) << "No XLA literal container for Numpy type" << np_type;
   }
diff --git a/tensorflow/compiler/xla/python/numpy_bridge.h b/tensorflow/compiler/xla/python/numpy_bridge.h
index 64f0aae0f9..a67c93a4fb 100644
--- a/tensorflow/compiler/xla/python/numpy_bridge.h
+++ b/tensorflow/compiler/xla/python/numpy_bridge.h
@@ -25,7 +25,7 @@ limitations under the License.
 #include <algorithm>
 #include <memory>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/python/lib/core/numpy.h"
diff --git a/tensorflow/compiler/xla/python/xla_client.py b/tensorflow/compiler/xla/python/xla_client.py
index 27aee634ba..a2c6fc344d 100644
--- a/tensorflow/compiler/xla/python/xla_client.py
+++ b/tensorflow/compiler/xla/python/xla_client.py
@@ -99,12 +99,30 @@ _UNARY_OPS = [
     'Cos',
     'Sin',
     'Tanh',
+    'IsFinite',
     'Sqrt',
+    'Rsqrt',
     'Square',
-    'IsFinite',
     'Reciprocal',
     'Neg',
     'Sort',
+    'Erf',
+    'Erfc',
+    'ErfInv',
+    'Lgamma',
+    'Digamma',
+    'Acos',
+    'Asin',
+    'Atan',
+    'Tan',
+    'Acosh',
+    'Asinh',
+    'Atanh',
+    'Cosh',
+    'Sinh',
+    'Real',
+    'Imag',
+    'Conj',
 ]
 
 _BINARY_OPS = [
@@ -125,6 +143,11 @@ _BINARY_OPS = [
     'Or',
     'Xor',
     'Pow',
+    'ShiftLeft',
+    'ShiftRightArithmetic',
+    'ShiftRightLogical',
+    'Atan2',
+    'Complex',
 ]
 
 
@@ -461,14 +484,16 @@ class LocalComputation(object):
     if self.is_compiled:
       raise ValueError('Attempt to compile a compiled local XLA computation.')
 
+    result_shape = _wrap_shape(self.c_local_computation.GetReturnValueShape())
+
     if layout_fn:
       argument_shapes = [
           shape.map_leaves(layout_fn) for shape in argument_shapes
       ]
-      result_shape = _wrap_shape(self.c_local_computation.GetReturnValueShape())
       result_shape = result_shape.map_leaves(layout_fn)
-      compile_options = compile_options or CompileOptions()
-      compile_options.result_shape = result_shape
+
+    compile_options = compile_options or CompileOptions()
+    compile_options.result_shape = result_shape
     return LocalComputation(
         self.c_local_computation.Compile(argument_shapes, compile_options),
         is_compiled=True)
@@ -700,6 +725,18 @@ class ComputationBuilder(object):
     """
     return self._client.ConvertElementType(operand, new_element_type)
 
+  def BitcastConvertType(self, operand, new_element_type):
+    """Enqueues a bitcast type conversion operation onto the computation.
+
+    Args:
+      operand: the operand to convert.
+      new_element_type: the target primitive type.
+
+    Returns:
+      A LocalOp representing the added conversion op.
+    """
+    return self._client.BitcastConvertType(operand, new_element_type)
+
   def GetShape(self, operand):
     return _wrap_shape(self._client.GetShape(operand))
 
diff --git a/tensorflow/compiler/xla/python/xla_client_test.py b/tensorflow/compiler/xla/python/xla_client_test.py
index 0564ddcb85..fd98e19457 100644
--- a/tensorflow/compiler/xla/python/xla_client_test.py
+++ b/tensorflow/compiler/xla/python/xla_client_test.py
@@ -171,6 +171,24 @@ class ComputationsWithConstantsTest(LocalComputationTest):
         c.Constant(NumpyArrayF32([[1, -1, 1], [-1, 1, -1]])))
     self._ExecuteAndCompareClose(c, expected=[[2, 1, 4], [3, 6, 5]])
 
+  def testShiftLeft(self):
+    c = self._NewComputation()
+    c.ShiftLeft(c.Constant(NumpyArrayS32([3])),
+                c.Constant(NumpyArrayS32([2])))
+    self._ExecuteAndCompareClose(c, expected=[12])
+
+  def testShiftRightArithmetic(self):
+    c = self._NewComputation()
+    c.ShiftRightArithmetic(c.Constant(NumpyArrayS32([-2])),
+                           c.Constant(NumpyArrayS32([1])))
+    self._ExecuteAndCompareClose(c, expected=[-1])
+
+  def testShiftRightLogical(self):
+    c = self._NewComputation()
+    c.ShiftRightLogical(c.Constant(NumpyArrayS32([-1])),
+                        c.Constant(NumpyArrayS32([1])))
+    self._ExecuteAndCompareClose(c, expected=[2**31 - 1])
+
   def testGetProto(self):
     c = self._NewComputation()
     c.Add(
@@ -471,6 +489,34 @@ class SingleOpTest(LocalComputationTest):
     for src_dtype, dst_dtype in itertools.product(xla_types, xla_types):
       _ConvertAndTest(x, src_dtype, dst_dtype)
 
+  def testBitcastConvertType(self):
+    xla_x32_types = {
+        np.int32: xla_client.xla_data_pb2.S32,
+        np.float32: xla_client.xla_data_pb2.F32,
+    }
+
+    xla_x64_types = {
+        np.int64: xla_client.xla_data_pb2.S64,
+        np.float64: xla_client.xla_data_pb2.F64,
+    }
+
+    def _ConvertAndTest(template, src_dtype, dst_dtype, dst_etype):
+      c = self._NewComputation()
+      x = c.Constant(np.array(template, dtype=src_dtype))
+      c.BitcastConvertType(x, dst_etype)
+
+      result = c.Build().Compile().Execute()
+      expected = np.array(template, src_dtype).view(dst_dtype)
+
+      self.assertEqual(result.shape, expected.shape)
+      self.assertEqual(result.dtype, expected.dtype)
+      np.testing.assert_equal(result, expected)
+
+    x = [0, 1, 0, 0, 1]
+    for xla_types in [xla_x32_types, xla_x64_types]:
+      for src_dtype, dst_dtype in itertools.product(xla_types, xla_types):
+        _ConvertAndTest(x, src_dtype, dst_dtype, xla_types[dst_dtype])
+
   def testCrossReplicaSumOneReplica(self):
     samples = [
         NumpyArrayF32(42.0),
diff --git a/tensorflow/compiler/xla/python_api/types.py b/tensorflow/compiler/xla/python_api/types.py
index b60f8dce92..e036c894f8 100644
--- a/tensorflow/compiler/xla/python_api/types.py
+++ b/tensorflow/compiler/xla/python_api/types.py
@@ -20,7 +20,7 @@ from __future__ import print_function
 
 import collections
 
-import numpy as np
+import numpy as _np  # Avoids becoming a part of public Tensorflow API.
 
 from tensorflow.compiler.xla import xla_data_pb2
 
@@ -43,73 +43,73 @@ MAP_XLA_TYPE_TO_RECORD = {
     xla_data_pb2.F16:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.F16,
-            numpy_dtype=np.float16,
+            numpy_dtype=_np.float16,
             literal_field_name='f16s',
             literal_field_type=float),
     xla_data_pb2.F32:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.F32,
-            numpy_dtype=np.float32,
+            numpy_dtype=_np.float32,
             literal_field_name='f32s',
             literal_field_type=float),
     xla_data_pb2.F64:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.F64,
-            numpy_dtype=np.float64,
+            numpy_dtype=_np.float64,
             literal_field_name='f64s',
             literal_field_type=float),
     xla_data_pb2.S8:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.S8,
-            numpy_dtype=np.int8,
+            numpy_dtype=_np.int8,
             literal_field_name='s8s',
             literal_field_type=int),
     xla_data_pb2.S16:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.S16,
-            numpy_dtype=np.int16,
+            numpy_dtype=_np.int16,
             literal_field_name='s16s',
             literal_field_type=int),
     xla_data_pb2.S32:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.S32,
-            numpy_dtype=np.int32,
+            numpy_dtype=_np.int32,
             literal_field_name='s32s',
             literal_field_type=int),
     xla_data_pb2.S64:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.S64,
-            numpy_dtype=np.int64,
+            numpy_dtype=_np.int64,
             literal_field_name='s64s',
             literal_field_type=int),
     xla_data_pb2.U8:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.U8,
-            numpy_dtype=np.uint8,
+            numpy_dtype=_np.uint8,
             literal_field_name='s8s',
             literal_field_type=int),
     xla_data_pb2.U16:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.U16,
-            numpy_dtype=np.uint16,
+            numpy_dtype=_np.uint16,
             literal_field_name='s16s',
             literal_field_type=int),
     xla_data_pb2.U32:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.U32,
-            numpy_dtype=np.uint32,
+            numpy_dtype=_np.uint32,
             literal_field_name='s32s',
             literal_field_type=int),
     xla_data_pb2.U64:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.U64,
-            numpy_dtype=np.uint64,
+            numpy_dtype=_np.uint64,
             literal_field_name='s64s',
             literal_field_type=int),
     xla_data_pb2.PRED:
         TypeConversionRecord(
             primitive_type=xla_data_pb2.PRED,
-            numpy_dtype=np.bool,
+            numpy_dtype=_np.bool,
             literal_field_name='preds',
             literal_field_type=bool)
 }
@@ -119,6 +119,6 @@ MAP_XLA_TYPE_TO_RECORD = {
 # doesn't work as expected (https://github.com/numpy/numpy/issues/7242). Thus,
 # when keying by dtype in this dict, we use the string form of dtypes.
 MAP_DTYPE_TO_RECORD = {
-    str(np.dtype(record.numpy_dtype)): record
+    str(_np.dtype(record.numpy_dtype)): record
     for record in MAP_XLA_TYPE_TO_RECORD.values()
 }
diff --git a/tensorflow/compiler/xla/python_api/xla_literal.py b/tensorflow/compiler/xla/python_api/xla_literal.py
index b040098c29..757e41a78a 100644
--- a/tensorflow/compiler/xla/python_api/xla_literal.py
+++ b/tensorflow/compiler/xla/python_api/xla_literal.py
@@ -18,7 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import numpy as np
+import numpy as _np  # Avoids becoming a part of public Tensorflow API.
 
 from tensorflow.compiler.xla import xla_data_pb2
 from tensorflow.compiler.xla.python_api import types
@@ -35,7 +35,7 @@ def ConvertLiteralToNumpyArray(literal):
 
   type_record = types.MAP_XLA_TYPE_TO_RECORD[element_type]
   if not literal.shape.dimensions:
-    return np.array(
+    return _np.array(
         getattr(literal, type_record.literal_field_name)[0],
         type_record.numpy_dtype)
   else:
@@ -54,7 +54,7 @@ def ConvertLiteralToNumpyArray(literal):
       numpy_reshaper = lambda arr: arr.reshape(numpy_shape, order='C')
     else:
       raise NotImplementedError('Unsupported layout: {0}'.format(layout_order))
-    ndarray = np.array(
+    ndarray = _np.array(
         getattr(literal, type_record.literal_field_name),
         copy=False,
         dtype=type_record.numpy_dtype)
@@ -69,11 +69,11 @@ def _ConvertNumpyArrayToLiteral(ndarray):
 
   if ndarray.ndim == 0:
     getattr(literal, type_record.literal_field_name).append(
-        np.asscalar(ndarray.astype(type_record.literal_field_type)))
+        _np.asscalar(ndarray.astype(type_record.literal_field_type)))
   else:
     # Ndarrays with boolean dtypes need special type conversion with protobufs
-    if ndarray.dtype in {np.bool_, np.dtype('bool')}:
-      for element in np.nditer(ndarray):
+    if ndarray.dtype in {_np.bool_, _np.dtype('bool')}:
+      for element in _np.nditer(ndarray):
         getattr(literal, type_record.literal_field_name).append(
             type_record.literal_field_type(element))
     else:
diff --git a/tensorflow/compiler/xla/python_api/xla_shape.py b/tensorflow/compiler/xla/python_api/xla_shape.py
index 6af2895803..f158f6b241 100644
--- a/tensorflow/compiler/xla/python_api/xla_shape.py
+++ b/tensorflow/compiler/xla/python_api/xla_shape.py
@@ -18,7 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import numpy as np
+import numpy as _np  # Avoids becoming a part of public Tensorflow API.
 
 from tensorflow.compiler.xla import xla_data_pb2
 from tensorflow.compiler.xla.python_api import types
@@ -111,7 +111,7 @@ def _CreateShapeFromNumpy(ndarray):  # pylint: disable=invalid-name
 
   # Set the shape's layout based on the ordering of ndarray.
   # Numpy arrays come in two orders: Fortran (column-major) and C (row-major).
-  if np.isfortran(ndarray):
+  if _np.isfortran(ndarray):
     # Column-major layout. This corresponds to a "dimension order is
     # minor-to-major" layout in XLA.
     layout = range(ndarray.ndim)
diff --git a/tensorflow/compiler/xla/reference_util.cc b/tensorflow/compiler/xla/reference_util.cc
index c289c84cff..a803520876 100644
--- a/tensorflow/compiler/xla/reference_util.cc
+++ b/tensorflow/compiler/xla/reference_util.cc
@@ -18,7 +18,8 @@ limitations under the License.
 #include <array>
 #include <utility>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h"
 #include "tensorflow/compiler/xla/service/hlo_evaluator.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -510,8 +511,8 @@ ReferenceUtil::ConvArray4DGeneralDimensionsDilated(
     std::pair<int64, int64> lhs_dilation, std::pair<int64, int64> rhs_dilation,
     ConvolutionDimensionNumbers dnums) {
   HloComputation::Builder b("ConvArray4DGeneralDimensionDilated");
-  auto lhs_literal = Literal::CreateR4FromArray4D<float>(lhs);
-  auto rhs_literal = Literal::CreateR4FromArray4D<float>(rhs);
+  auto lhs_literal = LiteralUtil::CreateR4FromArray4D<float>(lhs);
+  auto rhs_literal = LiteralUtil::CreateR4FromArray4D<float>(rhs);
 
   std::array<int64, 2> ordered_kernel_strides;
   std::array<int64, 2> ordered_input_dimensions;
diff --git a/tensorflow/compiler/xla/reference_util_test.cc b/tensorflow/compiler/xla/reference_util_test.cc
index 9da9bc60a2..8091bed499 100644
--- a/tensorflow/compiler/xla/reference_util_test.cc
+++ b/tensorflow/compiler/xla/reference_util_test.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -53,7 +53,7 @@ class ReferenceUtilTest : public ::testing::Test {
 
 TEST_F(ReferenceUtilTest, TransposeArray2D) {
   auto result = ReferenceUtil::TransposeArray2D(*matrix_);
-  auto actual_literal = Literal::CreateR2FromArray2D(*result);
+  auto actual_literal = LiteralUtil::CreateR2FromArray2D(*result);
   LiteralTestUtil::ExpectR2Near<float>({{1.f, 4.f}, {2.f, 5.f}, {3.f, 6.f}},
                                        *actual_literal, ErrorSpec(0.0001));
 }
@@ -65,7 +65,7 @@ TEST_F(ReferenceUtilTest, MatmulArray2D) {
       {11.f, 12.f},
   });
   auto result = ReferenceUtil::MatmulArray2D(*matrix_, rhs);
-  auto actual_literal = Literal::CreateR2FromArray2D(*result);
+  auto actual_literal = LiteralUtil::CreateR2FromArray2D(*result);
   LiteralTestUtil::ExpectR2Near<float>({{58.f, 64.f}, {139.f, 154.f}},
                                        *actual_literal, ErrorSpec(0.0001));
 }
@@ -73,7 +73,7 @@ TEST_F(ReferenceUtilTest, MatmulArray2D) {
 TEST_F(ReferenceUtilTest, ReduceToColArray2D) {
   auto add = [](float lhs, float rhs) { return lhs + rhs; };
   auto result = ReferenceUtil::ReduceToColArray2D(*matrix_, 0.0f, add);
-  auto actual_literal = Literal::CreateR1<float>(*result);
+  auto actual_literal = LiteralUtil::CreateR1<float>(*result);
   LiteralTestUtil::ExpectR1Near<float>({6.f, 15.f}, *actual_literal,
                                        ErrorSpec(0.0001));
 }
@@ -81,13 +81,13 @@ TEST_F(ReferenceUtilTest, ReduceToColArray2D) {
 TEST_F(ReferenceUtilTest, ReduceToRowArray2D) {
   auto add = [](float lhs, float rhs) { return lhs + rhs; };
   auto result = ReferenceUtil::ReduceToRowArray2D(*matrix_, 0.0f, add);
-  auto actual_literal = Literal::CreateR1<float>(*result);
+  auto actual_literal = LiteralUtil::CreateR1<float>(*result);
   LiteralTestUtil::ExpectR1Near<float>({5.f, 7.f, 9.f}, *actual_literal,
                                        ErrorSpec(0.0001));
 }
 
 TEST_F(ReferenceUtilTest, Reduce4Dto1DZeroSizedArray) {
-  auto result = Literal::CreateR1<float>(ReferenceUtil::Reduce4DTo1D(
+  auto result = LiteralUtil::CreateR1<float>(ReferenceUtil::Reduce4DTo1D(
       Array4D<float>(1, 0, 1, 1), /*init=*/0, /*dims=*/{0, 1, 2},
       [](float a, float b) { return a + b; }));
   LiteralTestUtil::ExpectR1Equal<float>({0}, *result);
@@ -96,7 +96,7 @@ TEST_F(ReferenceUtilTest, Reduce4Dto1DZeroSizedArray) {
 TEST_F(ReferenceUtilTest, MapArray2D) {
   auto identity = [](float value) { return log(exp(value)); };
   auto result = ReferenceUtil::MapArray2D(*matrix_, identity);
-  auto actual_literal = Literal::CreateR2FromArray2D(*result);
+  auto actual_literal = LiteralUtil::CreateR2FromArray2D(*result);
   LiteralTestUtil::ExpectR2NearArray2D(*matrix_, *actual_literal,
                                        ErrorSpec(0.0001));
 }
@@ -106,7 +106,7 @@ TEST_F(ReferenceUtilTest, MapWithIndexArray2D) {
     return value + row + col;
   };
   auto result = ReferenceUtil::MapWithIndexArray2D(*matrix_, add_index);
-  auto actual_literal = Literal::CreateR2FromArray2D(*result);
+  auto actual_literal = LiteralUtil::CreateR2FromArray2D(*result);
   LiteralTestUtil::ExpectR2Near<float>({{1.f, 3.f, 5.f}, {5.f, 7.f, 9.f}},
                                        *actual_literal, ErrorSpec(0.0001));
 }
@@ -117,7 +117,7 @@ TEST_F(ReferenceUtilTest, MapArray4D) {
   input->FillWithMultiples(1.0f);
   auto multiply_by_two = [](float value) { return 2 * value; };
   auto result = ReferenceUtil::MapArray4D(*input, multiply_by_two);
-  auto actual_literal = Literal::CreateR4FromArray4D(*result);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*result);
 
   Array4D<float> expected(/*planes=*/2, /*depth=*/3, /*height=*/4, /*width=*/5);
   expected.FillWithMultiples(2.0f);
@@ -134,7 +134,7 @@ TEST_F(ReferenceUtilTest, MapWithIndexArray4D) {
     return value - (3 * 4 * 5 * plane + 4 * 5 * depth + 5 * height + width);
   };
   auto result = ReferenceUtil::MapWithIndexArray4D(*input, subtract_index);
-  auto actual_literal = Literal::CreateR4FromArray4D(*result);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*result);
 
   Array4D<float> expected(/*planes=*/2, /*depth=*/3, /*height=*/4, /*width=*/5);
   expected.Fill(0.0f);
@@ -144,7 +144,7 @@ TEST_F(ReferenceUtilTest, MapWithIndexArray4D) {
 
 TEST_F(ReferenceUtilTest, SliceArray2D) {
   auto result = ReferenceUtil::Slice2D(*matrix_, {{0, 0}}, {{2, 2}}, {{1, 1}});
-  auto actual_literal = Literal::CreateR2FromArray2D(*result);
+  auto actual_literal = LiteralUtil::CreateR2FromArray2D(*result);
 
   LiteralTestUtil::ExpectR2Near<float>({{1.f, 2.f}, {4.f, 5.f}},
                                        *actual_literal, ErrorSpec(0.0001));
@@ -152,7 +152,7 @@ TEST_F(ReferenceUtilTest, SliceArray2D) {
 
 TEST_F(ReferenceUtilTest, SliceStridedArray2D) {
   auto result = ReferenceUtil::Slice2D(*matrix_, {{0, 0}}, {{2, 3}}, {{1, 2}});
-  auto actual_literal = Literal::CreateR2FromArray2D(*result);
+  auto actual_literal = LiteralUtil::CreateR2FromArray2D(*result);
 
   LiteralTestUtil::ExpectR2Near<float>({{1.f, 3.f}, {4.f, 6.f}},
                                        *actual_literal, ErrorSpec(0.0001));
@@ -164,7 +164,7 @@ TEST_F(ReferenceUtilTest, SliceArray3D) {
 
   auto result =
       ReferenceUtil::Slice3D(input, {{0, 0, 0}}, {{2, 2, 2}}, {{1, 1, 1}});
-  auto actual_literal = Literal::CreateR3FromArray3D(*result);
+  auto actual_literal = LiteralUtil::CreateR3FromArray3D(*result);
 
   LiteralTestUtil::ExpectR3Near<float>(
       {{{0.f, 1.f}, {4.f, 5.f}}, {{12.f, 13.f}, {16.f, 17.f}}}, *actual_literal,
@@ -177,7 +177,7 @@ TEST_F(ReferenceUtilTest, SliceStridedArray3D) {
 
   auto result =
       ReferenceUtil::Slice3D(input, {{0, 0, 0}}, {{2, 3, 4}}, {{1, 2, 2}});
-  auto actual_literal = Literal::CreateR3FromArray3D(*result);
+  auto actual_literal = LiteralUtil::CreateR3FromArray3D(*result);
 
   LiteralTestUtil::ExpectR3Near<float>(
       {{{0.f, 2.f}, {8.f, 10.f}}, {{12.f, 14.f}, {20.f, 22.f}}},
@@ -190,7 +190,7 @@ TEST_F(ReferenceUtilTest, SliceArray4D) {
 
   auto result = ReferenceUtil::Slice4D(input, {{1, 0, 0, 0}}, {{2, 2, 2, 2}},
                                        {{1, 1, 1, 1}});
-  auto actual_literal = Literal::CreateR4FromArray4D(*result);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*result);
 
   LiteralTestUtil::ExpectR4Near<float>(
       {{{{60.f, 61.f}, {65.f, 66.f}}, {{80.f, 81.f}, {85.f, 86.f}}}},
@@ -203,7 +203,7 @@ TEST_F(ReferenceUtilTest, SliceStridedArray4D) {
 
   auto result = ReferenceUtil::Slice4D(input, {{1, 0, 0, 0}}, {{2, 3, 4, 5}},
                                        {{1, 2, 2, 2}});
-  auto actual_literal = Literal::CreateR4FromArray4D(*result);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*result);
 
   LiteralTestUtil::ExpectR4Near<float>(
       {{{{60.f, 62.f, 64.f}, {70.f, 72.f, 74.f}},
@@ -218,7 +218,7 @@ TEST_F(ReferenceUtilTest, ConvArray3DWithSamePadding) {
       ReferenceUtil::ConvArray3D(input, weights, 1, Padding::kSame);
   Array3D<float> expected = {{{17, 28, 39, 20}}};
 
-  auto actual_literal = Literal::CreateR3FromArray3D(*actual);
+  auto actual_literal = LiteralUtil::CreateR3FromArray3D(*actual);
 
   LiteralTestUtil::ExpectR3NearArray3D<float>(expected, *actual_literal,
                                               ErrorSpec(0.0001));
@@ -231,7 +231,7 @@ TEST_F(ReferenceUtilTest, ConvArray3DWithValidPadding) {
       ReferenceUtil::ConvArray3D(input, weights, 1, Padding::kValid);
   Array3D<float> expected = {{{17, 28, 39}}};
 
-  auto actual_literal = Literal::CreateR3FromArray3D(*actual);
+  auto actual_literal = LiteralUtil::CreateR3FromArray3D(*actual);
 
   LiteralTestUtil::ExpectR3NearArray3D<float>(expected, *actual_literal,
                                               ErrorSpec(0.0001));
@@ -266,7 +266,7 @@ TEST_F(ReferenceUtilTest, ConvWithSamePadding) {
   }));
   // clang-format on
 
-  auto actual_literal = Literal::CreateR4FromArray4D(*actual);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*actual);
 
   LiteralTestUtil::ExpectR4NearArray4D<float>(expected, *actual_literal,
                                               ErrorSpec(0.0001));
@@ -300,7 +300,7 @@ TEST_F(ReferenceUtilTest, ConvWithValidPadding) {
   }));
   // clang-format on
 
-  auto actual_literal = Literal::CreateR4FromArray4D(*actual);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*actual);
 
   LiteralTestUtil::ExpectR4NearArray4D<float>(expected, *actual_literal,
                                               ErrorSpec(0.0001));
@@ -356,7 +356,7 @@ TEST_F(ReferenceUtilTest, ConvGeneralDimensionsWithSamePadding) {
   }});
   // clang-format on
 
-  auto actual_literal = Literal::CreateR4FromArray4D(*actual);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*actual);
 
   LiteralTestUtil::ExpectR4NearArray4D<float>(expected, *actual_literal,
                                               ErrorSpec(0.0001));
@@ -409,7 +409,7 @@ TEST_F(ReferenceUtilTest, ConvGeneralDimensionsWithValidPadding) {
   Array4D<float> expected({{{{2514, 2685}}}});
   // clang-format on
 
-  auto actual_literal = Literal::CreateR4FromArray4D(*actual);
+  auto actual_literal = LiteralUtil::CreateR4FromArray4D(*actual);
 
   LiteralTestUtil::ExpectR4NearArray4D<float>(expected, *actual_literal,
                                               ErrorSpec(0.0001));
@@ -422,7 +422,7 @@ TEST_F(ReferenceUtilTest, ApplyElementwise2D) {
 
   auto actual = ReferenceUtil::ApplyElementwise2D(
       [](float x, float y, float z) { return 100 * x + 10 * y + z; }, a, b, c);
-  auto actual_literal = Literal::CreateR2FromArray2D(*actual);
+  auto actual_literal = LiteralUtil::CreateR2FromArray2D(*actual);
   LiteralTestUtil::ExpectR2Near({{300.f, 600.f}, {900.f, 1200.f}},
                                 *actual_literal, ErrorSpec(0.0001));
 }
diff --git a/tensorflow/compiler/xla/rpc/BUILD b/tensorflow/compiler/xla/rpc/BUILD
index 0b1cec1925..44b22a5586 100644
--- a/tensorflow/compiler/xla/rpc/BUILD
+++ b/tensorflow/compiler/xla/rpc/BUILD
@@ -56,7 +56,7 @@ tf_cc_test(
         ":grpc_stub",
         "//tensorflow:grpc++",
         "//tensorflow/compiler/xla/client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:lib",
diff --git a/tensorflow/compiler/xla/rpc/grpc_client_test.cc b/tensorflow/compiler/xla/rpc/grpc_client_test.cc
index f8414468bd..6788676181 100644
--- a/tensorflow/compiler/xla/rpc/grpc_client_test.cc
+++ b/tensorflow/compiler/xla/rpc/grpc_client_test.cc
@@ -24,7 +24,7 @@ limitations under the License.
 #include "grpcpp/security/credentials.h"
 
 #include "tensorflow/compiler/xla/client/client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/rpc/grpc_stub.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/core/lib/io/path.h"
@@ -97,7 +97,7 @@ TEST_F(GRPCClientTestBase, AxpyTenValues) {
       1.85840735, -1.85840735, 2.28318531,   -2.28318531,  -6.42477796,
       6.42477796, 10.56637061, -10.56637061, -14.70796327, 14.70796327};
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR1<float>(expected);
+      LiteralUtil::CreateR1<float>(expected);
   TF_ASSERT_OK_AND_ASSIGN(auto computation, builder.Build());
   TF_ASSERT_OK_AND_ASSIGN(auto result_literal, client_->ExecuteAndTransfer(
                                                    computation, {}, nullptr));
diff --git a/tensorflow/compiler/xla/service/BUILD b/tensorflow/compiler/xla/service/BUILD
index fe99f700d2..528b7fdfd3 100644
--- a/tensorflow/compiler/xla/service/BUILD
+++ b/tensorflow/compiler/xla/service/BUILD
@@ -136,7 +136,7 @@ cc_library(
         ":hlo_dce",
         ":hlo_pass",
         ":tuple_simplifier",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_tree",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
@@ -182,6 +182,7 @@ tf_cc_test(
     name = "shape_inference_test",
     srcs = ["shape_inference_test.cc"],
     deps = [
+        ":hlo",
         ":shape_inference",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -227,6 +228,7 @@ cc_library(
         ":hlo",
         ":hlo_query",
         ":shape_inference",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
@@ -244,7 +246,7 @@ tf_cc_test(
     deps = [
         ":hlo",
         ":hlo_evaluator",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status",
@@ -254,7 +256,7 @@ tf_cc_test(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:hlo_element_type_converter",
         "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -294,6 +296,7 @@ cc_library(
         ":hlo_reachability",
         ":name_uniquer",
         "//tensorflow/compiler/xla:array",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:protobuf_util",
         "//tensorflow/compiler/xla:shape_tree",
@@ -396,6 +399,7 @@ tf_cc_test(
     deps = [
         ":hlo_matchers",
         ":hlo_parser",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
     ],
@@ -407,7 +411,7 @@ tf_cc_test(
     deps = [
         ":hlo",
         ":hlo_parser",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:protobuf_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -424,7 +428,7 @@ tf_cc_test(
     srcs = ["hlo_sharding_test.cc"],
     deps = [
         ":hlo",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:protobuf_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -453,7 +457,7 @@ tf_cc_test(
     srcs = ["call_graph_test.cc"],
     deps = [
         ":call_graph",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
@@ -487,6 +491,7 @@ cc_library(
     hdrs = ["call_inliner.h"],
     deps = [
         ":call_graph",
+        ":hlo_dce",
         ":hlo_pass",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/core:lib",
@@ -502,7 +507,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_matchers",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
@@ -521,7 +526,7 @@ tf_cc_test(
     deps = [
         ":call_graph",
         ":flatten_call_graph",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
@@ -559,7 +564,7 @@ cc_library(
         ":computation_placer",
         ":device_memory_allocator",
         ":platform_util",
-        ":pool",
+        ":stream_pool",
         ":transfer_manager",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -593,6 +598,7 @@ cc_library(
         ":hlo_proto_util",
         ":platform_util",
         ":source_map_util",
+        ":stream_pool",
         ":transfer_manager",
         "//tensorflow/compiler/xla:executable_run_options",
         "//tensorflow/compiler/xla:execution_options_util",
@@ -637,7 +643,7 @@ cc_library(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:executable_build_options",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
     ],
@@ -746,8 +752,8 @@ cc_library(
         ":hlo_execution_profile",
         ":hlo_graph_dumper",
         ":hlo_proto",
-        ":pool",
         ":shaped_buffer",
+        ":stream_pool",
         "//tensorflow/compiler/xla:executable_run_options",
         "//tensorflow/compiler/xla:status",
         "//tensorflow/compiler/xla:status_macros",
@@ -797,7 +803,7 @@ cc_library(
     hdrs = ["transfer_manager.h"],
     deps = [
         ":shaped_buffer",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -833,7 +839,7 @@ cc_library(
     hdrs = ["execution_tracker.h"],
     deps = [
         ":backend",
-        ":pool",
+        ":stream_pool",
         "//tensorflow/compiler/xla:executable_run_options",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
@@ -941,7 +947,6 @@ cc_library(
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
-        "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
     ],
@@ -960,7 +965,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_ordering",
         ":hlo_scheduling",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
@@ -1038,7 +1043,7 @@ tf_cc_test(
         ":hlo_ordering",
         ":hlo_value",
         ":tuple_points_to_analysis",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1052,6 +1057,7 @@ cc_library(
     hdrs = ["hlo_module_group_metadata.h"],
     deps = [
         ":hlo",
+        ":hlo_casting_utils",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status",
         "//tensorflow/compiler/xla:status_macros",
@@ -1121,7 +1127,7 @@ cc_library(
     hdrs = ["hlo_query.h"],
     deps = [
         ":hlo",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
     ],
 )
@@ -1170,6 +1176,7 @@ cc_library(
     deps = [
         ":hlo",
         ":shape_inference",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
@@ -1200,6 +1207,7 @@ cc_library(
     deps = [
         ":hlo",
         ":hlo_pass",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -1219,6 +1227,7 @@ cc_library(
         ":hlo_creation_utils",
         ":hlo_pass",
         ":while_util",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:util",
     ],
@@ -1232,8 +1241,9 @@ tf_cc_test(
         ":batchnorm_expander",
         ":hlo",
         ":hlo_matchers",
+        ":hlo_parser",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
@@ -1255,6 +1265,7 @@ cc_library(
         ":hlo_pass",
         ":hlo_query",
         ":pattern_matcher",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -1274,7 +1285,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_matchers",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
@@ -1310,7 +1321,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_matchers",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
@@ -1345,7 +1356,7 @@ cc_library(
         ":call_inliner",
         ":hlo",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
@@ -1361,6 +1372,7 @@ tf_cc_test(
         ":conditional_simplifier",
         ":hlo",
         ":hlo_matchers",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -1420,7 +1432,7 @@ tf_cc_test(
     deps = [
         ":defuser",
         ":hlo_matchers",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
     ],
@@ -1448,7 +1460,7 @@ tf_cc_test(
     deps = [
         ":hlo_matchers",
         ":implicit_broadcast_remover",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
     ],
@@ -1490,7 +1502,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_matchers",
         ":tuple_simplifier",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
@@ -1505,7 +1517,7 @@ cc_library(
     hdrs = ["reshape_mover.h"],
     deps = [
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:util",
@@ -1520,7 +1532,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_matchers",
         ":reshape_mover",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
@@ -1555,7 +1567,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_matchers",
         ":inliner",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:util",
@@ -1572,7 +1584,7 @@ cc_library(
     hdrs = ["computation_placer.h"],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status",
         "//tensorflow/compiler/xla:status_macros",
@@ -1604,7 +1616,7 @@ cc_library(
     hdrs = ["generic_transfer_manager.h"],
     deps = [
         ":transfer_manager",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -1651,8 +1663,8 @@ tf_cc_test(
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client:padding",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
@@ -1695,7 +1707,7 @@ tf_cc_test(
     deps = [
         ":hlo",
         ":hlo_matchers",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
@@ -1710,6 +1722,7 @@ tf_cc_binary(
     deps = [
         ":hlo",
         ":hlo_graph_dumper",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
@@ -1724,7 +1737,7 @@ tf_cc_test(
     srcs = ["hlo_module_test.cc"],
     deps = [
         ":hlo",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:util",
@@ -1822,7 +1835,7 @@ tf_cc_test(
         ":hlo_matchers",
         ":hlo_ordering",
         ":instruction_fusion",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
@@ -1859,7 +1872,7 @@ tf_cc_test(
     deps = [
         ":hlo",
         ":hlo_liveness_analysis",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
@@ -1920,7 +1933,7 @@ tf_cc_test(
         ":hlo_matchers",
         ":hlo_ordering",
         ":instruction_fusion",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
@@ -1955,6 +1968,7 @@ cc_library(
         ":hlo_dataflow_analysis",
         ":logical_buffer",
         ":logical_buffer_analysis",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_tree",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
@@ -1973,6 +1987,7 @@ tf_cc_test(
         ":hlo_matchers",
         ":instruction_fusion",
         ":tuple_points_to_analysis",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -1996,6 +2011,7 @@ cc_library(
     deps = [
         ":computation_layout",
         ":hlo",
+        ":hlo_casting_utils",
         ":hlo_dce",
         ":hlo_graph_dumper",
         ":hlo_pass",
@@ -2044,7 +2060,7 @@ tf_cc_test(
         ":hlo_graph_dumper",
         ":hlo_matchers",
         ":hlo_runner",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
@@ -2108,6 +2124,7 @@ tf_cc_test(
     srcs = ["hlo_verifier_test.cc"],
     deps = [
         ":hlo",
+        ":hlo_parser",
         ":hlo_verifier",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -2169,6 +2186,7 @@ tf_cc_test(
     deps = [
         ":hlo",
         ":hlo_dce",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
@@ -2189,7 +2207,7 @@ tf_cc_test(
     deps = [
         ":hlo",
         ":hlo_module_dce",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
@@ -2213,7 +2231,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_matchers",
         ":layout_assignment",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_layout",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -2272,7 +2290,7 @@ cc_library(
         ":hlo",
         ":hlo_domain_map",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -2288,7 +2306,7 @@ tf_cc_test(
         ":hlo",
         ":hlo_cse",
         ":hlo_matchers",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
@@ -2310,7 +2328,7 @@ cc_library(
         ":hlo_evaluator",
         ":hlo_pass",
         ":hlo_query",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/core:lib",
@@ -2325,7 +2343,7 @@ tf_cc_test(
         ":hlo_constant_folding",
         ":hlo_matchers",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:types",
@@ -2363,6 +2381,20 @@ cc_library(
 )
 
 cc_library(
+    name = "hlo_domain_verifier",
+    srcs = ["hlo_domain_verifier.cc"],
+    hdrs = ["hlo_domain_verifier.h"],
+    deps = [
+        ":hlo",
+        ":hlo_domain_map",
+        ":hlo_graph_dumper",
+        ":hlo_pass",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
     name = "hlo_domain_isolator",
     srcs = ["hlo_domain_isolator.cc"],
     hdrs = ["hlo_domain_isolator.h"],
@@ -2381,8 +2413,8 @@ cc_library(
     hdrs = ["hlo_domain_remover.h"],
     deps = [
         ":hlo",
-        ":hlo_domain_isolator",
         ":hlo_domain_map",
+        ":hlo_domain_verifier",
         ":hlo_graph_dumper",
         ":hlo_pass",
         "//tensorflow/compiler/xla:types",
@@ -2417,7 +2449,7 @@ cc_library(
         ":hlo_evaluator",
         ":hlo_pass",
         ":hlo_query",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/core:lib",
@@ -2552,7 +2584,7 @@ cc_library(
     hdrs = ["hlo_tfgraph_builder.h"],
     deps = [
         ":hlo",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:xla_proto",
         "//tensorflow/core:framework",
@@ -2583,7 +2615,7 @@ cc_library(
         ":hlo_casting_utils",
         ":hlo_execution_profile",
         ":hlo_tfgraph_builder",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:window_util",
@@ -2601,6 +2633,7 @@ tf_cc_test(
     deps = [
         ":hlo",
         ":hlo_graph_dumper",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:xla_proto",
         "//tensorflow/compiler/xla/tests:test_utils",
@@ -2632,12 +2665,12 @@ tf_cc_test(
         ":hlo_matchers",
         ":shape_inference",
         ":transpose_folding",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service/gpu:ir_emission_utils",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
@@ -2653,7 +2686,7 @@ cc_library(
     deps = [
         ":hlo",
         ":hlo_pass",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:util",
@@ -2668,13 +2701,13 @@ tf_cc_test(
         ":hlo",
         ":shape_inference",
         ":zero_sized_hlo_elimination",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
@@ -2682,21 +2715,25 @@ tf_cc_test(
 )
 
 cc_library(
-    name = "pool",
-    hdrs = ["pool.h"],
+    name = "stream_pool",
+    srcs = ["stream_pool.cc"],
+    hdrs = ["stream_pool.h"],
     deps = [
+        "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/core:lib",
+        "//tensorflow/core:stream_executor_no_cuda",
     ],
 )
 
 tf_cc_test(
-    name = "pool_test",
-    srcs = ["pool_test.cc"],
+    name = "stream_pool_test",
+    srcs = ["stream_pool_test.cc"],
     deps = [
-        ":pool",
+        ":stream_pool",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/core:stream_executor_no_cuda",
     ],
 )
 
@@ -2828,6 +2865,7 @@ cc_library(
         ":hlo",
         ":hlo_creation_utils",
         ":tuple_util",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/core:lib",
     ],
 )
@@ -2963,6 +3001,7 @@ cc_library(
         ":hlo",
         ":hlo_lexer",
         ":hlo_sharding_metadata",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier.cc b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
index 1ddeb27e40..946ef6f0d6 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
@@ -23,6 +23,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
@@ -149,6 +150,8 @@ class AlgebraicSimplifierVisitor : public DfsHloVisitorWithDefault {
   Status HandleDynamicUpdateSlice(
       HloInstruction* dynamic_update_slice) override;
 
+  Status HandleSort(HloInstruction* sort) override;
+
   Status HandleTranspose(HloInstruction* transpose) override;
 
   Status HandleSubtract(HloInstruction* sub) override;
@@ -195,7 +198,7 @@ class AlgebraicSimplifierVisitor : public DfsHloVisitorWithDefault {
   HloInstruction* AddReduce(HloInstruction* hlo, int64 dim) {
     HloInstruction* zero =
         computation_->AddInstruction(HloInstruction::CreateConstant(
-            Literal::Zero(hlo->shape().element_type()).CloneToUnique()));
+            LiteralUtil::Zero(hlo->shape().element_type()).CloneToUnique()));
     HloComputation* AddReduce_computation = GetOrCreateScalarAddComputation();
     Shape shape = ShapeUtil::DeleteDimension(dim, hlo->shape());
     return computation_->AddInstruction(HloInstruction::CreateReduce(
@@ -537,8 +540,8 @@ Status AlgebraicSimplifierVisitor::HandleConstant(HloInstruction* constant) {
   // If a literal is all the same element replace it with a scalar broadcast.
   if (ShapeUtil::ElementsIn(constant->shape()) > 1 &&
       constant->literal().IsAllFirst()) {
-    std::unique_ptr<Literal> unique_scalar =
-        MakeUnique<Literal>(constant->literal().GetFirstScalarLiteral());
+    std::unique_ptr<Literal> unique_scalar = MakeUnique<Literal>(
+        LiteralUtil::GetFirstScalarLiteral(constant->literal()));
     HloInstruction* scalar = computation_->AddInstruction(
         HloInstruction::CreateConstant(std::move(unique_scalar)));
     return ReplaceWithNewInstruction(
@@ -1093,7 +1096,7 @@ Status AlgebraicSimplifierVisitor::HandleDot(HloInstruction* dot) {
       ShapeUtil::IsZeroElementArray(lhs->shape()) ||
       ShapeUtil::IsZeroElementArray(rhs->shape())) {
     auto zero = computation_->AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0(0.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0(0.0f)));
     return ReplaceWithNewInstruction(
         dot, HloInstruction::CreateBroadcast(dot->shape(), zero, {}));
   }
@@ -1155,6 +1158,19 @@ Status AlgebraicSimplifierVisitor::HandleMultiply(HloInstruction* multiply) {
     return Status::OK();
   }
 
+  // 0*A => 0. Only applies for integral types for correct NaN-handling.
+  if (IsAll(lhs, 0) &&
+      primitive_util::IsIntegralType(multiply->shape().element_type()) &&
+      ReplaceInstructionIfSameShape(multiply, lhs)) {
+    return Status::OK();
+  }
+  // A*0 => 0
+  if (IsAll(rhs, 0) &&
+      primitive_util::IsIntegralType(multiply->shape().element_type()) &&
+      ReplaceInstructionIfSameShape(multiply, rhs)) {
+    return Status::OK();
+  }
+
   // exp(A) * exp(B) => exp(A+B)
   if (Match(multiply, m::Multiply(m::Exp(m::Op(&lhs)), m::Exp(m::Op(&rhs))))) {
     auto add = computation_->AddInstruction(HloInstruction::CreateBinary(
@@ -1519,7 +1535,7 @@ Status AlgebraicSimplifierVisitor::HandlePower(HloInstruction* power) {
   CHECK(Match(power, m::Power(m::Op(&lhs), m::Op(&rhs))));
   if (IsAll(rhs, 0)) {
     auto one = HloInstruction::CreateConstant(
-        Literal::One(power->shape().element_type()).CloneToUnique());
+        LiteralUtil::One(power->shape().element_type()).CloneToUnique());
     std::unique_ptr<HloInstruction> ones;
     if (ShapeUtil::IsScalar(power->shape())) {
       ones = std::move(one);
@@ -1554,7 +1570,7 @@ Status AlgebraicSimplifierVisitor::HandlePower(HloInstruction* power) {
   VLOG(10) << "trying transform [pow(A, -1) => 1/A]: " << power->ToString();
   if (IsAll(rhs, -1)) {
     auto* one = computation_->AddInstruction(HloInstruction::CreateConstant(
-        Literal::One(rhs->shape().element_type()).CloneToUnique()));
+        LiteralUtil::One(rhs->shape().element_type()).CloneToUnique()));
 
     // Explicitly broadcast scalar 1 to the output shape, to avoid implicit
     // broadcast in divide HLO as we are trying to eliminate implicit
@@ -1730,19 +1746,37 @@ Status AlgebraicSimplifierVisitor::HandleSlice(HloInstruction* slice) {
   if (ReplaceInstructionIfSameShape(slice, slice->mutable_operand(0))) {
     return Status::OK();
   }
+
+  auto is_unstrided_slice = [](const HloInstruction* hlo) {
+    return c_all_of(hlo->slice_strides(),
+                    [](int64 stride) { return stride == 1; });
+  };
+  if (slice->operand(0)->opcode() == HloOpcode::kSlice &&
+      is_unstrided_slice(slice) && is_unstrided_slice(slice->operand(0))) {
+    HloInstruction* operand_slice = slice->mutable_operand(0);
+    std::vector<int64> new_slice_starts = slice->slice_starts();
+    std::vector<int64> new_slice_limits = slice->slice_limits();
+    for (int64 i = 0; i < new_slice_starts.size(); ++i) {
+      new_slice_starts[i] += operand_slice->slice_starts(i);
+      new_slice_limits[i] += operand_slice->slice_starts(i);
+    }
+    return ReplaceWithNewInstruction(
+        slice, HloInstruction::CreateSlice(
+                   slice->shape(), operand_slice->mutable_operand(0),
+                   new_slice_starts, new_slice_limits, slice->slice_strides()));
+  }
   return Status::OK();
 }
 
 Status AlgebraicSimplifierVisitor::HandleDynamicSlice(
     HloInstruction* dynamic_slice) {
   auto operand = dynamic_slice->mutable_operand(0);
-  auto start_indices = dynamic_slice->operand(1);
   if (ShapeUtil::IsScalar(dynamic_slice->shape())) {
     return ReplaceInstruction(dynamic_slice, operand);
   }
-  // DynamicSlice where operand has the same size as the output and
-  // start_indices are all zero is simply equal to operand.
-  if (IsAll(start_indices, 0) && SameShape(operand, dynamic_slice)) {
+  // DynamicSlice where operand has the same size as the output is simply equal
+  // to operand.
+  if (SameShape(operand, dynamic_slice)) {
     return ReplaceInstruction(dynamic_slice, operand);
   }
   return Status::OK();
@@ -1751,20 +1785,10 @@ Status AlgebraicSimplifierVisitor::HandleDynamicSlice(
 Status AlgebraicSimplifierVisitor::HandleDynamicUpdateSlice(
     HloInstruction* dynamic_update_slice) {
   auto update = dynamic_update_slice->mutable_operand(1);
-  auto start_indices = dynamic_update_slice->operand(2);
-  // DynamicUpdateSlice on a scalar just passes through the update argument.
-  if (ShapeUtil::IsScalar(dynamic_update_slice->shape())) {
-    return ReplaceInstruction(dynamic_update_slice, update);
-  }
 
-  // DynamicUpdateSlice where operand and update have the same size and
-  // start_indices are all zero is simply equal to update.
-  //
-  // (We require start_indices to be all zero because we want this optimization
-  // not to affect the visible behavior of this op even when the indices are out
-  // of range.  Currently dynamic-update-slice wraps out-of-range indices, so
-  // we can only remove the op if its indices never wrap.)
-  if (IsAll(start_indices, 0) && SameShape(dynamic_update_slice, update)) {
+  // DynamicUpdateSlice where operand and update have the same size is simply
+  // equal to update.
+  if (SameShape(dynamic_update_slice, update)) {
     return ReplaceInstruction(dynamic_update_slice, update);
   }
 
@@ -1890,6 +1914,26 @@ Status AlgebraicSimplifierVisitor::HandleReduce(HloInstruction* reduce) {
                       new_reduce_dimensions, function));
     }
   }
+  // Convert Reduce(concat({a,b,...})) to
+  //  map(reduce(a),map(reduce(b),...,))
+  //
+  // This should make fusion easier or use less memory bandwidth in the unfused
+  // case.
+  if (arg->opcode() == HloOpcode::kConcatenate &&
+      c_linear_search(reduce->dimensions(), arg->concatenate_dimension())) {
+    HloInstruction* old_reduce = nullptr;
+    for (HloInstruction* operand : arg->operands()) {
+      HloInstruction* new_reduce = computation_->AddInstruction(
+          HloInstruction::CreateReduce(reduce->shape(), operand, init_value,
+                                       reduce->dimensions(), function));
+      if (old_reduce != nullptr) {
+        new_reduce = computation_->AddInstruction(HloInstruction::CreateMap(
+            reduce->shape(), {old_reduce, new_reduce}, function));
+      }
+      old_reduce = new_reduce;
+    }
+    return ReplaceInstruction(reduce, old_reduce);
+  }
   return Status::OK();
 }
 
@@ -2063,6 +2107,21 @@ Status AlgebraicSimplifierVisitor::HandleReduceWindow(
                          /*reduce_computation=*/function));
 }
 
+Status AlgebraicSimplifierVisitor::HandleSort(HloInstruction* sort) {
+  auto operand = sort->mutable_operand(0);
+  int64 dimension_to_sort = sort->dimensions(0);
+  if (ShapeUtil::IsZeroElementArray(operand->shape()) ||
+      operand->shape().dimensions(dimension_to_sort) <= 1) {
+    if (sort->operand_count() == 1) {
+      return ReplaceInstruction(sort, operand);
+    }
+    // If it is key/value sort, the output of sort is a tuple.
+    return ReplaceWithNewInstruction(
+        sort, HloInstruction::CreateTuple({operand, sort->mutable_operand(1)}));
+  }
+  return Status::OK();
+}
+
 Status AlgebraicSimplifierVisitor::HandleTranspose(HloInstruction* transpose) {
   auto operand = transpose->mutable_operand(0);
   if (std::is_sorted(transpose->dimensions().begin(),
@@ -2098,7 +2157,7 @@ Status AlgebraicSimplifierVisitor::HandleConvolution(
         HloInstruction::CreateBroadcast(
             convolution->shape(),
             computation_->AddInstruction(HloInstruction::CreateConstant(
-                Literal::Zero(convolution->shape().element_type())
+                LiteralUtil::Zero(convolution->shape().element_type())
                     .CloneToUnique())),
             {}));
   }
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
index b733f6f59e..ad14fe6f2c 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -60,7 +60,7 @@ TEST_F(AlgebraicSimplifierTest, AddZero) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kAdd, param0, zero));
 
@@ -74,12 +74,32 @@ TEST_F(AlgebraicSimplifierTest, AddZero) {
   EXPECT_EQ(root, param0);
 }
 
+// Test that A * 0 is simplified to 0
+TEST_F(AlgebraicSimplifierTest, MulZero) {
+  Shape r0s32 = ShapeUtil::MakeShape(S32, {});
+  HloComputation::Builder builder(TestName());
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, r0s32, "param0"));
+  HloInstruction* zero = builder.AddInstruction(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(0)));
+  builder.AddInstruction(
+      HloInstruction::CreateBinary(r0s32, HloOpcode::kMultiply, param0, zero));
+
+  auto computation = module().AddEntryComputation(builder.Build());
+  HloInstruction* root = computation->root_instruction();
+  EXPECT_EQ(root->opcode(), HloOpcode::kMultiply);
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(&module()).ValueOrDie());
+  EXPECT_EQ(computation->root_instruction(), zero);
+}
+
 // Test that Reduce(Reduce(A)) -> Reduce(A)
 TEST_F(AlgebraicSimplifierTest, TwoReducesToOne) {
   HloComputation::Builder builder(TestName());
   // Create add computation.
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   HloComputation* add_computation = nullptr;
   {
     HloComputation::Builder builder(TestName() + ".add");
@@ -119,7 +139,7 @@ TEST_F(AlgebraicSimplifierTest, AddConstOnLHS) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(42.0f)));
   builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kAdd, constant, param0));
 
@@ -140,9 +160,9 @@ TEST_F(AlgebraicSimplifierTest, AddReassociateMergeConstants) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(42.0f)));
   HloInstruction* constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(3.14159f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(3.14159f)));
 
   HloInstruction* add1 = builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kAdd, param0, constant1));
@@ -165,7 +185,7 @@ TEST_F(AlgebraicSimplifierTest, AddBroadcastZeroR0Operand) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r2f32, "param0"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   HloInstruction* bcast = builder.AddInstruction(
       HloInstruction::CreateBroadcast(r2f32, zero, {0, 1}));
   builder.AddInstruction(
@@ -200,7 +220,7 @@ TEST_F(AlgebraicSimplifierTest, InlineTrivialMap) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r2f32, "param0"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   builder.AddInstruction(HloInstruction::CreateMap(
       r2f32,
       {param0, builder.AddInstruction(
@@ -223,7 +243,7 @@ TEST_F(AlgebraicSimplifierTest, AddBroadcastZeroR1Operand) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r2f32, "param0"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({0, 0, 0})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({0, 0, 0})));
   HloInstruction* bcast =
       builder.AddInstruction(HloInstruction::CreateBroadcast(r2f32, zero, {1}));
   builder.AddInstruction(
@@ -242,7 +262,7 @@ TEST_F(AlgebraicSimplifierTest, AddBroadcastZeroR1Operand) {
 TEST_F(AlgebraicSimplifierTest, ConstantToBroadcast) {
   HloComputation::Builder builder(TestName());
   builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({3.14f, 3.14f, 3.14f})));
+      LiteralUtil::CreateR1<float>({3.14f, 3.14f, 3.14f})));
 
   auto computation = module().AddEntryComputation(builder.Build());
   HloInstruction* root = computation->root_instruction();
@@ -258,7 +278,7 @@ TEST_F(AlgebraicSimplifierTest, ConstantToBroadcast) {
 TEST_F(AlgebraicSimplifierTest, ConstantNotToBroadcast) {
   HloComputation::Builder builder(TestName());
   builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({3.14, 3.14, 4})));
+      LiteralUtil::CreateR1<float>({3.14, 3.14, 4})));
 
   auto computation = module().AddEntryComputation(builder.Build());
   HloInstruction* root = computation->root_instruction();
@@ -277,7 +297,7 @@ TEST_F(AlgebraicSimplifierTest, SubZero) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kSubtract, param0, zero));
 
@@ -298,7 +318,7 @@ TEST_F(AlgebraicSimplifierTest, SubConstCanonicalization) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   builder.AddInstruction(HloInstruction::CreateBinary(
       r0f32, HloOpcode::kSubtract, param0, constant));
 
@@ -493,7 +513,7 @@ TEST_F(AlgebraicSimplifierTest, DivideByConstant) {
       HloInstruction::CreateParameter(0, r1f32, "param0"));
   HloInstruction* constant =
       builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR1<float>({0.f, 1.f, 2.f})));
+          LiteralUtil::CreateR1<float>({0.f, 1.f, 2.f})));
   builder.AddInstruction(HloInstruction::CreateBinary(r1f32, HloOpcode::kDivide,
                                                       param0, constant));
 
@@ -559,7 +579,7 @@ TEST_F(AlgebraicSimplifierTest, DivOneScalar) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0f)));
   HloInstruction* div = builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kDivide, param0, one));
 
@@ -580,7 +600,7 @@ TEST_F(AlgebraicSimplifierTest, DivOneArray) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r2f32, "param0"));
   HloInstruction* one = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 1.0}, {1.0, 1.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 1.0}, {1.0, 1.0}})));
   HloInstruction* div = builder.AddInstruction(
       HloInstruction::CreateBinary(r2f32, HloOpcode::kDivide, param0, one));
 
@@ -860,7 +880,7 @@ TEST_F(AlgebraicSimplifierTest, Pow0Scalar) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0)));
   builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kPower, param0, zero));
 
@@ -884,7 +904,7 @@ TEST_F(AlgebraicSimplifierTest, Pow0Vector) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r1f32, "param0"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0)));
   builder.AddInstruction(
       HloInstruction::CreateBinary(r1f32, HloOpcode::kPower, param0, zero));
 
@@ -912,7 +932,7 @@ TEST_F(AlgebraicSimplifierTest, Pow1) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1)));
   builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kPower, param0, one));
 
@@ -934,7 +954,7 @@ TEST_F(AlgebraicSimplifierTest, Pow2) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* two = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2)));
   builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kPower, param0, two));
 
@@ -956,7 +976,7 @@ TEST_F(AlgebraicSimplifierTest, PowNegative1) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "param0"));
   HloInstruction* negative_one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(-1)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(-1)));
   builder.AddInstruction(HloInstruction::CreateBinary(r0f32, HloOpcode::kPower,
                                                       param0, negative_one));
 
@@ -1047,7 +1067,7 @@ TEST_F(AlgebraicSimplifierTest, ZeroSizedReduceWindow) {
   builder.AddInstruction(HloInstruction::CreateReduceWindow(
       ShapeUtil::MakeShape(F32, {5, 2}), param,
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f))),
       window, add_computation));
   module().AddEntryComputation(builder.Build());
   HloPassFix<AlgebraicSimplifier> simplifier(/*is_layout_sensitive=*/false,
@@ -1074,7 +1094,7 @@ TEST_F(AlgebraicSimplifierTest, ZeroSizedPad) {
   builder.AddInstruction(HloInstruction::CreatePad(
       ShapeUtil::MakeShape(F32, {5, 2}), param,
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0(0.0f))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0(0.0f))),
       padding));
   module().AddEntryComputation(builder.Build());
   EXPECT_THAT(module().entry_computation()->root_instruction(),
@@ -1116,7 +1136,7 @@ TEST_F(AlgebraicSimplifierTest, ReshapeBroadcast) {
 TEST_F(AlgebraicSimplifierTest, ConvertBetweenSameType) {
   HloComputation::Builder builder(TestName());
   HloInstruction* input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   builder.AddInstruction(
       HloInstruction::CreateConvert(ShapeUtil::MakeShape(F32, {}), input));
 
@@ -1208,7 +1228,7 @@ TEST_F(AlgebraicSimplifierTest, RemoveEmptyConcatenateOperands) {
   HloInstruction* param1 = builder.AddInstruction(
       HloInstruction::CreateParameter(1, r1f32, "param1"));
   HloInstruction* empty_literal = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({})));
   HloInstruction* empty_slice =
       builder.AddInstruction(HloInstruction::CreateSlice(
           ShapeUtil::MakeShape(F32, {0}), param1, {42}, {42}, {1}));
@@ -1230,6 +1250,55 @@ TEST_F(AlgebraicSimplifierTest, RemoveEmptyConcatenateOperands) {
               op::Concatenate(param0, param0, param1));
 }
 
+// Test that reduce of concat is simplified.
+TEST_F(AlgebraicSimplifierTest, SimplifyReduceOfConcat) {
+  const int kParamLength = 100;
+  Shape r3f32 =
+      ShapeUtil::MakeShape(F32, {kParamLength, kParamLength, kParamLength});
+  HloComputation::Builder builder(TestName());
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, r3f32, "param0"));
+  HloInstruction* param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, r3f32, "param1"));
+  HloInstruction* param2 = builder.AddInstruction(
+      HloInstruction::CreateParameter(2, r3f32, "param2"));
+  Shape concat_shape =
+      ShapeUtil::MakeShape(F32, {kParamLength, 3 * kParamLength, kParamLength});
+  HloInstruction* Concatenate =
+      builder.AddInstruction(HloInstruction::CreateConcatenate(
+          concat_shape, {param0, param1, param2}, 1));
+  HloComputation* add_computation = nullptr;
+  {
+    HloComputation::Builder builder(TestName() + ".add");
+    const Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
+    HloInstruction* p0 = builder.AddInstruction(
+        HloInstruction::CreateParameter(0, scalar_shape, "p0"));
+    HloInstruction* p1 = builder.AddInstruction(
+        HloInstruction::CreateParameter(1, scalar_shape, "p1"));
+    builder.AddInstruction(
+        HloInstruction::CreateBinary(scalar_shape, HloOpcode::kAdd, p0, p1));
+    add_computation = module().AddEmbeddedComputation(builder.Build());
+  }
+  Shape r4f32 = ShapeUtil::MakeShape(F32, {4, 5, 6, 7});
+  Shape reduce_shape = ShapeUtil::MakeShape(F32, {kParamLength});
+
+  HloInstruction* zero = builder.AddInstruction(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0)));
+  builder.AddInstruction(HloInstruction::CreateReduce(
+      reduce_shape, Concatenate, zero, {1, 2}, add_computation));
+
+  auto computation = module().AddEntryComputation(builder.Build());
+
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(&module()).ValueOrDie());
+
+  EXPECT_THAT(
+      computation->root_instruction(),
+      op::Map(op::Map(op::Reduce(param0, zero), op::Reduce(param1, zero)),
+              op::Reduce(param2, zero)));
+}
+
 // Test a concatenate with only empty operands is removed.
 TEST_F(AlgebraicSimplifierTest, OnlyEmptyConcatenateOperands) {
   const int kParamLength = 100;
@@ -1238,7 +1307,7 @@ TEST_F(AlgebraicSimplifierTest, OnlyEmptyConcatenateOperands) {
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, r1f32, "param0"));
   HloInstruction* empty_literal = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({})));
   HloInstruction* empty_slice =
       builder.AddInstruction(HloInstruction::CreateSlice(
           ShapeUtil::MakeShape(F32, {0}), param0, {42}, {42}, {1}));
@@ -1420,7 +1489,7 @@ TEST_F(AlgebraicSimplifierTest, FailureToSinkReshapeDoesntAffectChangedBit) {
       builder.AddInstruction(
           HloInstruction::CreateParameter(0, shape, "param0")),
       builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR2<float>({{0, 0}, {0, 0}})))));
+          LiteralUtil::CreateR2<float>({{0, 0}, {0, 0}})))));
 
   builder.AddInstruction(
       HloInstruction::CreateReshape(ShapeUtil::MakeShape(F32, {4}), add));
@@ -1443,7 +1512,7 @@ TEST_F(AlgebraicSimplifierTest, FailureToSinkBroadcastDoesntAffectChangedBit) {
       builder.AddInstruction(
           HloInstruction::CreateParameter(0, shape, "param0")),
       builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR2<float>({{0, 0}, {0, 0}})))));
+          LiteralUtil::CreateR2<float>({{0, 0}, {0, 0}})))));
 
   builder.AddInstruction(
       HloInstruction::CreateBroadcast(ShapeUtil::MakeShape(F32, {2, 2, 2}), add,
@@ -1726,7 +1795,7 @@ TEST_F(AlgebraicSimplifierTest, RemoveNoopPad) {
       builder.AddInstruction(HloInstruction::CreateParameter(
           0, ShapeUtil::MakeShape(F32, {2, 2}), "param"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   PaddingConfig no_padding;
   for (int i = 0; i < 2; ++i) {
     auto dimension = no_padding.add_dimensions();
@@ -1757,7 +1826,7 @@ TEST_F(AlgebraicSimplifierTest, NegativePadding) {
       builder.AddInstruction(HloInstruction::CreateParameter(
           0, ShapeUtil::MakeShape(F32, {10, 10}), "param"));
   HloInstruction* zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   PaddingConfig padding;
   int64 low_padding[2] = {-1, -2};
   int64 high_padding[2] = {2, -3};
@@ -1839,6 +1908,73 @@ TEST_F(AlgebraicSimplifierTest, RemoveNoopSlice) {
   EXPECT_THAT(computation->root_instruction(), param);
 }
 
+TEST_F(AlgebraicSimplifierTest, SliceOfSliceToSlice) {
+  HloComputation::Builder builder(TestName());
+  const int64 dim0 = 11;
+  const int64 dim1 = 12;
+  HloInstruction* param =
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          0, ShapeUtil::MakeShape(F32, {dim0, dim1}), "param"));
+  HloInstruction* original_slice =
+      builder.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {dim0 - 2, dim1 - 4}), param,
+          /*start_indices=*/{1, 2},
+          /*limit_indices=*/{dim0 - 1, dim1 - 2}, /*strides=*/{1, 1}));
+
+  builder.AddInstruction(HloInstruction::CreateSlice(
+      ShapeUtil::MakeShape(F32, {dim0 - 5, dim1 - 9}), original_slice,
+      /*start_indices=*/{2, 3},
+      /*limit_indices=*/{dim0 - 3, dim1 - 6}, /*strides=*/{1, 1}));
+  auto module = CreateNewModule();
+  HloComputation* computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_THAT(computation->root_instruction(), op::Slice(op::Slice(param)));
+
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
+
+  EXPECT_THAT(computation->root_instruction(), op::Slice(param));
+  EXPECT_EQ(computation->root_instruction()->slice_starts(0), 3);
+  EXPECT_EQ(computation->root_instruction()->slice_starts(1), 5);
+  EXPECT_EQ(computation->root_instruction()->slice_limits(0), dim0 - 2);
+  EXPECT_EQ(computation->root_instruction()->slice_limits(1), dim1 - 4);
+}
+
+TEST_F(AlgebraicSimplifierTest, RemoveNoopSort) {
+  auto builder = HloComputation::Builder(TestName());
+
+  Shape keys_shape = ShapeUtil::MakeShape(F32, {1});
+  auto keys = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, keys_shape, "keys"));
+  builder.AddInstruction(HloInstruction::CreateSort(keys_shape, 0, keys));
+  auto module = CreateNewModule();
+  HloComputation* computation = module->AddEntryComputation(builder.Build());
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
+  EXPECT_THAT(computation->root_instruction(), keys);
+}
+
+TEST_F(AlgebraicSimplifierTest, ReplaceEffectiveScalarKeyValueSortWithTuple) {
+  auto builder = HloComputation::Builder(TestName());
+
+  Shape keys_shape = ShapeUtil::MakeShape(F32, {5, 0});
+  Shape values_shape = ShapeUtil::MakeShape(S32, {5, 0});
+  auto keys = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, keys_shape, "keys"));
+  auto values = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, values_shape, "values"));
+  builder.AddInstruction(HloInstruction::CreateSort(
+      ShapeUtil::MakeTupleShape({keys_shape, values_shape}), 0, keys, values));
+  auto module = CreateNewModule();
+  HloComputation* computation = module->AddEntryComputation(builder.Build());
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(module).ValueOrDie());
+  EXPECT_THAT(computation->root_instruction(), op::Tuple(keys, values));
+}
+
 TEST_F(AlgebraicSimplifierTest, ConvertConvToMatmul) {
   struct ConvTestOptions {
     int in_batch = 10;
@@ -2109,7 +2245,7 @@ TEST_F(AlgebraicSimplifierTest, ScalarBroadcastToSlice) {
 TEST_F(AlgebraicSimplifierTest, ScalarBroadcastToTransposeReshape) {
   HloComputation::Builder builder(TestName());
   HloInstruction* forty_two = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
 
   Shape broadcast_shape = ShapeUtil::MakeShape(F32, {4, 5, 6});
   HloInstruction* broadcast = builder.AddInstruction(
@@ -2156,7 +2292,7 @@ TEST_F(AlgebraicSimplifierTest, FoldPadIntoReduceWindow) {
   padding.mutable_dimensions(3)->set_edge_padding_high(2);
 
   HloInstruction* pad_value = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(5.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(5.0f)));
   HloInstruction* pad = builder.AddInstruction(HloInstruction::CreatePad(
       ShapeUtil::MakeShape(F32, {1, 3, 3, 5}), operand, pad_value, padding));
 
@@ -2187,7 +2323,7 @@ TEST_F(AlgebraicSimplifierTest, FoldPadIntoReduceWindow) {
   const Shape reduce_window_shape =
       ShapeUtil::MakeShape(F32, {111, 113, 113, 115});
   HloInstruction* reduce_init_value = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(5.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(5.0f)));
   HloInstruction* reduce_window =
       builder.AddInstruction(HloInstruction::CreateReduceWindow(
           reduce_window_shape, pad, reduce_init_value, window,
@@ -2238,7 +2374,7 @@ TEST_F(AlgebraicSimplifierTest, FoldConvertedPadIntoReduceWindow) {
   padding.mutable_dimensions(3)->set_edge_padding_high(2);
 
   HloInstruction* pad_value = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(5.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(5.0f)));
   HloInstruction* pad = builder.AddInstruction(HloInstruction::CreatePad(
       ShapeUtil::MakeShape(BF16, {1, 3, 3, 5}), parameter, pad_value, padding));
 
@@ -2273,7 +2409,7 @@ TEST_F(AlgebraicSimplifierTest, FoldConvertedPadIntoReduceWindow) {
   const Shape reduce_window_shape =
       ShapeUtil::MakeShape(F32, {111, 113, 113, 115});
   HloInstruction* reduce_init_value = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(5.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(5.0f)));
   HloInstruction* reduce_window =
       builder.AddInstruction(HloInstruction::CreateReduceWindow(
           reduce_window_shape, convert, reduce_init_value, window,
@@ -2344,9 +2480,9 @@ TEST_F(AlgebraicSimplifierTest, IteratorInvalidation) {
 
   HloComputation::Builder call_builder(TestName() + ".Call");
   HloInstruction* zero = call_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({0.0f})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({0.0f})));
   HloInstruction* one = call_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({1.0f})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({1.0f})));
   call_builder.AddInstruction(
       HloInstruction::CreateCall(r1f32, {zero, one}, dot_computation.get()));
 
@@ -2362,9 +2498,9 @@ TEST_F(AlgebraicSimplifierTest, ConstantTupleBecomesTupleOfConstants) {
   HloComputation::Builder builder(TestName());
   const float constant_scalar = 7.3f;
   std::initializer_list<float> constant_vector = {1.1f, 2.0f, 3.3f};
-  std::unique_ptr<Literal> value =
-      Literal::MakeTuple({Literal::CreateR0<float>(constant_scalar).get(),
-                          Literal::CreateR1<float>(constant_vector).get()});
+  std::unique_ptr<Literal> value = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(constant_scalar).get(),
+       LiteralUtil::CreateR1<float>(constant_vector).get()});
   builder.AddInstruction(HloInstruction::CreateConstant(std::move(value)));
 
   auto computation = module().AddEntryComputation(builder.Build());
@@ -2387,8 +2523,8 @@ TEST_F(AlgebraicSimplifierTest, TrivialDynamicSlice) {
       shape,
       builder.AddInstruction(
           HloInstruction::CreateParameter(0, shape, "slice_from")),
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR1<int>({0, 0, 0}))),
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          1, ShapeUtil::MakeShape(U32, {3}), "slice_indices")),
       /*slice_sizes=*/{10, 100, 1000}));
 
   auto computation = module().AddEntryComputation(builder.Build());
@@ -2421,8 +2557,8 @@ TEST_F(AlgebraicSimplifierTest, TrivialDynamicUpdateSlice) {
       builder.AddInstruction(
           HloInstruction::CreateParameter(2, slice_shape, "to_update")),
       slice,
-      builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR1<int>({0, 0, 0})))));
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          3, ShapeUtil::MakeShape(U32, {3}), "update_indices"))));
 
   auto computation = module().AddEntryComputation(builder.Build());
   AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
@@ -2437,7 +2573,7 @@ TEST_F(AlgebraicSimplifierTest, MergeBroadcasts) {
   HloComputation::Builder builder(TestName());
   Shape r2f32 = ShapeUtil::MakeShape(F32, {2, 2});
   HloInstruction* input_array = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({3, 4})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({3, 4})));
   HloInstruction* inner_bcast = builder.AddInstruction(
       HloInstruction::CreateBroadcast(r2f32, input_array, {1}));
   Shape r3f32 = ShapeUtil::MakeShape(F32, {2, 2, 2});
@@ -2546,7 +2682,7 @@ TEST_P(PadReduceWindowEffectiveBroadcastTest, DoIt) {
   HloInstruction* pad = builder.AddInstruction(HloInstruction::CreatePad(
       pad_shape, input,
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0(0.0f))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0(0.0f))),
       padding));
 
   HloComputation* add_computation = nullptr;
@@ -2565,7 +2701,7 @@ TEST_P(PadReduceWindowEffectiveBroadcastTest, DoIt) {
   Window window = window_util::MakeWindow(
       decorate_spatials(param.reduce_window_spatials, 1, 1));
   auto zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   TF_ASSERT_OK_AND_ASSIGN(const Shape output_shape,
                           ShapeInference::InferReduceWindowShape(
                               pad->shape(), zero->shape(), window,
@@ -2704,7 +2840,7 @@ TEST_P(DotOfConcatSimplificationTest, ConstantLHS) {
 
   Shape lhs_shape = ShapeUtil::MakeShape(F32, {spec.m, spec.k});
   auto* lhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
           /*from=*/10.0, /*to=*/10000.0, /*rows=*/spec.m, /*cols=*/spec.k)));
 
   Shape rhs0_shape = ShapeUtil::MakeShape(F32, {k0, spec.n});
@@ -2783,7 +2919,7 @@ TEST_P(DotOfConcatSimplificationTest, ConstantRHS) {
 
   Shape rhs_shape = ShapeUtil::MakeShape(F32, {spec.k, spec.n});
   auto* rhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
           /*from=*/10.0, /*to=*/10000.0, /*rows=*/spec.k, /*cols=*/spec.n)));
 
   DotDimensionNumbers dot_dnums;
@@ -2830,7 +2966,7 @@ TEST_F(AlgebraicSimplifierTest, DynamicUpdateSliceZeroUpdate) {
   HloInstruction* const update = builder.AddInstruction(
       HloInstruction::CreateParameter(1, update_shape, "update"));
   HloInstruction* const start_indices = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int>({0})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int>({0})));
   builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
       dslice_shape, operand, update, start_indices));
   const HloComputation* const computation =
@@ -2879,7 +3015,7 @@ TEST_P(DotOfGatherSimplificationTest, ConstantRHS) {
   int64 lhs_cols = (spec.lcd == 0) ? spec.m : (spec.k + k_increase);
   Shape lhs_shape = ShapeUtil::MakeShape(F32, {lhs_rows, lhs_cols});
   auto* lhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
           /*from=*/10.0, /*to=*/10000.0, /*rows=*/lhs_rows,
           /*cols=*/lhs_cols)));
 
@@ -2887,7 +3023,7 @@ TEST_P(DotOfGatherSimplificationTest, ConstantRHS) {
   int32 start_col = (spec.lcd == 0) ? spec.s : 0;
   const auto start_indices =
       builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR1<int32>({start_row, start_col})));
+          LiteralUtil::CreateR1<int32>({start_row, start_col})));
   int64 slice_row_size = (spec.lcd == 0) ? spec.k : 1;
   int64 slice_col_size = (spec.lcd == 0) ? 1 : spec.k;
   Shape ds_shape = ShapeUtil::MakeShape(F32, {slice_row_size, slice_col_size});
@@ -2898,7 +3034,7 @@ TEST_P(DotOfGatherSimplificationTest, ConstantRHS) {
   int64 rhs_cols = (spec.rcd == 0) ? spec.n : spec.k;
   Shape rhs_shape = ShapeUtil::MakeShape(F32, {rhs_rows, rhs_cols});
   auto* rhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
           /*from=*/10.0, /*to=*/10000.0, /*rows=*/rhs_rows,
           /*cols=*/rhs_cols)));
 
@@ -2946,7 +3082,7 @@ TEST_P(DotOfGatherSimplificationTest, ConstantLHS) {
   int64 lhs_cols = (spec.lcd == 0) ? spec.m : spec.k;
   Shape lhs_shape = ShapeUtil::MakeShape(F32, {lhs_rows, lhs_cols});
   auto* lhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
           /*from=*/10.0, /*to=*/10000.0, /*rows=*/lhs_rows,
           /*cols=*/lhs_cols)));
 
@@ -2957,7 +3093,7 @@ TEST_P(DotOfGatherSimplificationTest, ConstantLHS) {
   int64 rhs_cols = (spec.rcd == 0) ? spec.n : (spec.k + k_increase);
   Shape rhs_shape = ShapeUtil::MakeShape(F32, {rhs_rows, rhs_cols});
   auto* rhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
           /*from=*/10.0, /*to=*/10000.0, /*rows=*/rhs_rows,
           /*cols=*/rhs_cols)));
 
@@ -2965,7 +3101,7 @@ TEST_P(DotOfGatherSimplificationTest, ConstantLHS) {
   int32 start_col = (spec.rcd == 0) ? spec.s : 0;
   const auto start_indices =
       builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR1<int32>({start_row, start_col})));
+          LiteralUtil::CreateR1<int32>({start_row, start_col})));
   int64 slice_row_size = (spec.rcd == 0) ? spec.k : 1;
   int64 slice_col_size = (spec.rcd == 0) ? 1 : spec.k;
   Shape ds_shape = ShapeUtil::MakeShape(F32, {slice_row_size, slice_col_size});
diff --git a/tensorflow/compiler/xla/service/backend.cc b/tensorflow/compiler/xla/service/backend.cc
index 349b32451a..d12be3e007 100644
--- a/tensorflow/compiler/xla/service/backend.cc
+++ b/tensorflow/compiler/xla/service/backend.cc
@@ -96,24 +96,19 @@ Backend::CreateDefaultBackend() {
   return CreateBackend(backend_options);
 }
 
-StatusOr<Backend::StreamPtr> Backend::BorrowStream(int device_ordinal) {
-  TF_ASSIGN_OR_RETURN(auto exec, stream_executor(device_ordinal));
-  return BorrowStream(exec);
+StatusOr<StreamPool::Ptr> Backend::BorrowStream(int device_ordinal) {
+  TF_ASSIGN_OR_RETURN(auto executor, stream_executor(device_ordinal));
+  return BorrowStream(executor);
 }
 
-StatusOr<Backend::StreamPtr> Backend::BorrowStream(
-    se::StreamExecutor* executor) {
+StatusOr<StreamPool::Ptr> Backend::BorrowStream(se::StreamExecutor* executor) {
   tensorflow::mutex_lock l(mu_);
   if (0 == stream_pools_.count(executor)) {
     stream_pools_.emplace(std::piecewise_construct,
                           std::forward_as_tuple(executor),
-                          std::forward_as_tuple([executor]() {
-                            auto stream = MakeUnique<se::Stream>(executor);
-                            stream->Init();
-                            return stream;
-                          }));
+                          std::forward_as_tuple());
   }
-  return stream_pools_.at(executor).Allocate();
+  return stream_pools_.at(executor).BorrowStream(executor);
 }
 
 Backend::Backend(
diff --git a/tensorflow/compiler/xla/service/backend.h b/tensorflow/compiler/xla/service/backend.h
index 6546602473..1bc3796fa4 100644
--- a/tensorflow/compiler/xla/service/backend.h
+++ b/tensorflow/compiler/xla/service/backend.h
@@ -24,7 +24,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/compiler.h"
 #include "tensorflow/compiler/xla/service/computation_placer.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
-#include "tensorflow/compiler/xla/service/pool.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/compiler/xla/service/transfer_manager.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -63,11 +63,9 @@ class BackendOptions {
 //
 // It also offers a pooling API for creation/use of initialized streams:
 //
-//    StreamPtr stream = backend->BorrowStream().ConsumeValueOrDie();
+//    StreamPool::Ptr stream = backend->BorrowStream().ConsumeValueOrDie();
 class Backend {
  public:
-  using StreamPtr = Pool<se::Stream>::SmartPtr;
-
   // Creates a new backend.
   static StatusOr<std::unique_ptr<Backend>> CreateBackend(
       const BackendOptions& options);
@@ -114,13 +112,13 @@ class Backend {
   // Borrows a stream for use by the caller, either by grabbing it from an
   // internal pool, or by constructing/initializating it, and returns the result
   // to the caller.
-  StatusOr<StreamPtr> BorrowStream(int device_ordinal);
-  StatusOr<StreamPtr> BorrowStream(se::StreamExecutor* executor);
+  StatusOr<StreamPool::Ptr> BorrowStream(int device_ordinal);
+  StatusOr<StreamPool::Ptr> BorrowStream(se::StreamExecutor* executor);
 
   // Returns a function to borrow a stream, as `BorrowStream` above does.
   // Purely for convenience, the caller could rather make this anonymous
   // function itself.
-  std::function<StatusOr<StreamPtr>(int)> StreamBorrower() {
+  std::function<StatusOr<StreamPool::Ptr>(int)> StreamBorrower() {
     return [this](int device_ordinal) { return BorrowStream(device_ordinal); };
   }
 
@@ -169,7 +167,7 @@ class Backend {
   tensorflow::mutex mu_;
 
   // Mapping from stream executor to stream pools, used by `BorrowStream` above.
-  std::map<se::StreamExecutor*, Pool<se::Stream>> stream_pools_ GUARDED_BY(mu_);
+  std::map<se::StreamExecutor*, StreamPool> stream_pools_ GUARDED_BY(mu_);
 
   // The default memory allocator to use.
   std::unique_ptr<StreamExecutorMemoryAllocator> memory_allocator_;
diff --git a/tensorflow/compiler/xla/service/batchnorm_expander.cc b/tensorflow/compiler/xla/service/batchnorm_expander.cc
index ec13fadbc7..c4cd60c120 100644
--- a/tensorflow/compiler/xla/service/batchnorm_expander.cc
+++ b/tensorflow/compiler/xla/service/batchnorm_expander.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
@@ -34,6 +35,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -41,6 +43,8 @@ namespace xla {
 
 namespace {
 
+using tensorflow::gtl::optional;
+
 // BatchNormExpanderVisitor traverses the HLO computation and rewrites BatchNorm
 // operations into smaller operations.
 class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
@@ -97,7 +101,7 @@ class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
         add_instruction(HloInstruction::CreateConvert(
             ShapeUtil::MakeShape(operand->shape().element_type(), {}),
             add_instruction(HloInstruction::CreateConstant(
-                Literal::CreateR0<float>(-0.5f))))),
+                LiteralUtil::CreateR0<float>(-0.5f))))),
         {}));
     return HloInstruction::CreateBinary(operand->shape(), HloOpcode::kPower,
                                         operand, exponent);
@@ -113,7 +117,7 @@ class BatchNormExpanderVisitor : public DfsHloVisitorWithDefault {
             add_instruction(HloInstruction::CreateConvert(
                 ShapeUtil::MakeShape(operand->shape().element_type(), {}),
                 add_instruction(HloInstruction::CreateConstant(
-                    Literal::CreateR0<float>(1.0 / element_count))))),
+                    LiteralUtil::CreateR0<float>(1.0 / element_count))))),
             {}));
     return HloInstruction::CreateBinary(operand->shape(), HloOpcode::kMultiply,
                                         operand, elem_count_recip);
@@ -200,11 +204,11 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
   HloInstruction* offset = batch_norm->mutable_operand(2);
   const Shape feature_shape = scale->shape();
 
-  auto zero_literal = Literal::CreateR0(0.0f);
+  auto zero_literal = LiteralUtil::CreateR0(0.0f);
   TF_ASSIGN_OR_RETURN(zero_literal, zero_literal->Convert(ptype));
   auto zero = add(HloInstruction::CreateConstant(std::move(zero_literal)));
 
-  auto epsilon_literal = Literal::CreateR0(batch_norm->epsilon());
+  auto epsilon_literal = LiteralUtil::CreateR0(batch_norm->epsilon());
   TF_ASSIGN_OR_RETURN(epsilon_literal, epsilon_literal->Convert(ptype));
   auto epsilon = add(HloInstruction::CreateBroadcast(
       operand_shape,
@@ -288,16 +292,22 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
     int64 instruction_count_after = computation_->instruction_count();
     CHECK_EQ(instruction_count_after,
              instruction_count_before + added_instructions.size());
+    const HloSharding& sharding = batch_norm->sharding();
     HloSharding operand_sharding =
-        batch_norm->sharding().GetAsShapeTree(batch_norm->shape()).element({0});
+        sharding.GetAsShapeTree(batch_norm->shape()).element({0});
+    optional<int64> unique_device = batch_norm->sharding_unique_device();
+    HloSharding default_sharding =
+        unique_device.has_value()
+            ? HloSharding::AssignDevice(unique_device.value())
+            : HloSharding::Replicate();
     for (HloInstruction* inst : added_instructions) {
       if (ShapeUtil::Equal(inst->shape(), operand_shape)) {
         inst->set_sharding(operand_sharding);
       } else {
-        inst->set_sharding(HloSharding::Replicate());
+        inst->set_sharding(default_sharding);
       }
     }
-    tuple->set_sharding(batch_norm->sharding());
+    tuple->set_sharding(sharding);
   }
   TF_CHECK_OK(ReplaceWithNewInstruction(batch_norm, std::move(tuple)));
   return Status::OK();
@@ -320,7 +330,7 @@ Status BatchNormExpanderVisitor::HandleBatchNormInference(
   HloInstruction* var = batch_norm->mutable_operand(4);
   const Shape feature_shape = scale->shape();
 
-  auto epsilon_literal = Literal::CreateR0(batch_norm->epsilon());
+  auto epsilon_literal = LiteralUtil::CreateR0(batch_norm->epsilon());
   TF_ASSIGN_OR_RETURN(epsilon_literal, epsilon_literal->Convert(ptype));
   auto epsilon = computation_->AddInstruction(HloInstruction::CreateBroadcast(
       operand_shape,
@@ -388,14 +398,20 @@ Status BatchNormExpanderVisitor::HandleBatchNormInference(
   CHECK_EQ(instruction_count_after,
            instruction_count_before + added_instructions.size());
   if (batch_norm->has_sharding()) {
+    const HloSharding& sharding = batch_norm->sharding();
+    optional<int64> unique_device = batch_norm->sharding_unique_device();
+    HloSharding default_sharding =
+        unique_device.has_value()
+            ? HloSharding::AssignDevice(unique_device.value())
+            : HloSharding::Replicate();
     for (HloInstruction* inst : added_instructions) {
       if (ShapeUtil::Equal(inst->shape(), operand_shape)) {
-        inst->set_sharding(batch_norm->sharding());
+        inst->set_sharding(sharding);
       } else {
-        inst->set_sharding(HloSharding::Replicate());
+        inst->set_sharding(default_sharding);
       }
     }
-    shifted_normalized->set_sharding(batch_norm->sharding());
+    shifted_normalized->set_sharding(sharding);
   }
   TF_CHECK_OK(
       ReplaceWithNewInstruction(batch_norm, std::move(shifted_normalized)));
@@ -447,11 +463,11 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
   const int64 feature_count = activation_shape.dimensions(feature_index);
   const int64 elements_per_feature_int64 = size_in_elements / feature_count;
 
-  auto zero_literal = Literal::CreateR0(0.0f);
+  auto zero_literal = LiteralUtil::CreateR0(0.0f);
   TF_ASSIGN_OR_RETURN(zero_literal, zero_literal->Convert(ptype));
   auto zero = add(HloInstruction::CreateConstant(std::move(zero_literal)));
 
-  auto epsilon_literal = Literal::CreateR0(batch_norm->epsilon());
+  auto epsilon_literal = LiteralUtil::CreateR0(batch_norm->epsilon());
   TF_ASSIGN_OR_RETURN(epsilon_literal, epsilon_literal->Convert(ptype));
   auto epsilon_scalar =
       add(HloInstruction::CreateConstant(std::move(epsilon_literal)));
@@ -542,7 +558,7 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
       Mean(elements_per_feature_int64, scale_times_rsqrt_var_add_epsilon, add));
 
   auto elements_per_feature_literal =
-      Literal::CreateR0<float>(elements_per_feature_int64);
+      LiteralUtil::CreateR0<float>(elements_per_feature_int64);
   TF_ASSIGN_OR_RETURN(elements_per_feature_literal,
                       elements_per_feature_literal->Convert(ptype));
   auto elements_per_feature = add(
@@ -562,19 +578,25 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
   auto tuple =
       HloInstruction::CreateTuple({grad_activation, grad_scale, grad_beta});
   if (batch_norm->has_sharding()) {
+    const HloSharding& sharding = batch_norm->sharding();
     int64 instruction_count_after = computation_->instruction_count();
     CHECK_EQ(instruction_count_after,
              instruction_count_before + added_instructions.size());
     HloSharding activation_sharding =
-        batch_norm->sharding().GetAsShapeTree(batch_norm->shape()).element({0});
+        sharding.GetAsShapeTree(batch_norm->shape()).element({0});
+    auto unique_device = batch_norm->sharding_unique_device();
+    HloSharding default_sharding =
+        unique_device.has_value()
+            ? HloSharding::AssignDevice(unique_device.value())
+            : HloSharding::Replicate();
     for (HloInstruction* inst : added_instructions) {
       if (ShapeUtil::Equal(inst->shape(), activation_shape)) {
         inst->set_sharding(activation_sharding);
       } else {
-        inst->set_sharding(HloSharding::Replicate());
+        inst->set_sharding(default_sharding);
       }
     }
-    tuple->set_sharding(batch_norm->sharding());
+    tuple->set_sharding(sharding);
   }
 
   TF_CHECK_OK(ReplaceWithNewInstruction(batch_norm, std::move(tuple)));
diff --git a/tensorflow/compiler/xla/service/batchnorm_expander_test.cc b/tensorflow/compiler/xla/service/batchnorm_expander_test.cc
index aa36e64b07..a725351462 100644
--- a/tensorflow/compiler/xla/service/batchnorm_expander_test.cc
+++ b/tensorflow/compiler/xla/service/batchnorm_expander_test.cc
@@ -19,12 +19,13 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/hlo_pass_fix.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -114,5 +115,33 @@ TEST_F(BatchNormExpanderTest, BatchNormGrad) {
   EXPECT_EQ(root->opcode(), HloOpcode::kTuple);
 }
 
+TEST_F(BatchNormExpanderTest, BatchNormTrainingSharding) {
+  const char* module_str = R"(
+HloModule module
+ENTRY entry {
+  %param.0 = f32[8,4] parameter(0)
+  %param.1 = f32[4] parameter(1)
+  %param.2 = f32[4] parameter(2)
+  ROOT %batch-norm-training = (f32[8,4], f32[4], f32[4])
+    batch-norm-training(f32[8,4] %param.0, f32[4] %param.1, f32[4] %param.2),
+    epsilon=0.001, feature_index=1, sharding={maximal device=1}
+})";
+
+  TF_ASSERT_OK_AND_ASSIGN(auto module, ParseHloString(module_str));
+  BatchNormExpander rewriter(/*rewrite_training_op=*/true,
+                             /*rewrite_inference_op=*/true,
+                             /*rewrite_grad_op=*/true);
+  ASSERT_TRUE(rewriter.Run(module.get()).ValueOrDie());
+
+  for (auto* instruction : module->entry_computation()->instructions()) {
+    if (instruction->opcode() == HloOpcode::kParameter) {
+      continue;
+    }
+    auto device = instruction->sharding_unique_device();
+    ASSERT_TRUE(device);
+    EXPECT_EQ(*device, 1);
+  }
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc b/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
index f7b4c1405d..7cf05ca443 100644
--- a/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_conversion_folding_test.cc
@@ -235,7 +235,8 @@ TEST_F(BFloat16ConversionFoldingTest, FoldCrossReplicaSumTupleOutput) {
   HloInstruction* crs =
       builder.AddInstruction(HloInstruction::CreateCrossReplicaSum(
           ShapeUtil::MakeTupleShape({f32_shape, f32_shape}), {convert_a, b},
-          sum, /*replica_group_ids=*/{}, /*barrier=*/""));
+          sum, /*replica_group_ids=*/{}, /*barrier=*/"",
+          /*all_reduce_id=*/tensorflow::gtl::nullopt));
   HloInstruction* gte_a = builder.AddInstruction(
       HloInstruction::CreateGetTupleElement(f32_shape, crs, 0));
   HloInstruction* gte_b = builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/bfloat16_normalization.cc b/tensorflow/compiler/xla/service/bfloat16_normalization.cc
index 14c54ddd13..16e99b5722 100644
--- a/tensorflow/compiler/xla/service/bfloat16_normalization.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_normalization.cc
@@ -34,8 +34,10 @@ class BFloat16NormalizationVisitor : public DfsHloVisitorWithDefault {
 
   Status DefaultAction(HloInstruction* hlo) override;
 
-  // Special handling for cross-replica-sum which can have a tuple output.
+  // Special handling for cross-replica-sum and sort which can have a tuple
+  // output.
   Status HandleCrossReplicaSum(HloInstruction* crs) override;
+  Status HandleSort(HloInstruction* sort) override;
 
   static bool Run(HloComputation* computation,
                   const BFloat16Support* bfloat16_support) {
@@ -49,6 +51,10 @@ class BFloat16NormalizationVisitor : public DfsHloVisitorWithDefault {
   // conversions between F32 and BF16 to make it supported.
   Status HandleInstruction(HloInstruction* hlo);
 
+  // Handle instructions with tuple outputs by examining each output
+  // independently.
+  Status HandleMultipleOutputs(HloInstruction* hlo);
+
   // Inserts a conversion HLO that changes the given HLO's output type.
   Status InsertConvertAfterOutput(HloInstruction* hlo, PrimitiveType to,
                                   HloComputation* computation);
@@ -148,22 +154,35 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
     HloInstruction* crs) {
   if (!ShapeUtil::IsTuple(crs->shape())) {
     return HandleInstruction(crs);
+  } else {
+    return HandleMultipleOutputs(crs);
   }
+}
+
+Status BFloat16NormalizationVisitor::HandleSort(HloInstruction* sort) {
+  if (!ShapeUtil::IsTuple(sort->shape())) {
+    return HandleInstruction(sort);
+  } else {
+    return HandleMultipleOutputs(sort);
+  }
+}
 
-  std::vector<PrimitiveType> operand_types(crs->operand_count());
-  std::vector<PrimitiveType> output_types(crs->operand_count());
+Status BFloat16NormalizationVisitor::HandleMultipleOutputs(
+    HloInstruction* hlo) {
+  std::vector<PrimitiveType> operand_types(hlo->operand_count());
+  std::vector<PrimitiveType> output_types(hlo->operand_count());
   int64 f32_count = 0;
   int64 bf16_count = 0;
   bool has_unsupported_bf16_operand = false;
   bool has_unsupported_bf16_output = false;
-  for (int64 i = 0; i < crs->operand_count(); ++i) {
-    operand_types[i] = crs->operand(i)->shape().element_type();
-    output_types[i] = ShapeUtil::GetSubshape(crs->shape(), {i}).element_type();
+  for (int64 i = 0; i < hlo->operand_count(); ++i) {
+    operand_types[i] = hlo->operand(i)->shape().element_type();
+    output_types[i] = ShapeUtil::GetSubshape(hlo->shape(), {i}).element_type();
     if (operand_types[i] == F32) {
       f32_count += 1;
     } else if (operand_types[i] == BF16) {
       bf16_count += 1;
-      if (!bfloat16_support_->SupportsBF16Operand(*crs, i)) {
+      if (!bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
         has_unsupported_bf16_operand = true;
       }
     }
@@ -171,7 +190,7 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
       f32_count += 1;
     } else if (output_types[i] == BF16) {
       bf16_count += 1;
-      if (!bfloat16_support_->SupportsBF16Output(*crs)) {
+      if (!bfloat16_support_->SupportsBF16Output(*hlo)) {
         has_unsupported_bf16_output = true;
       }
     }
@@ -185,43 +204,43 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
     if (operand_types[i] != BF16) {
       return false;
     }
-    if (!bfloat16_support_->SupportsBF16Operand(*crs, i)) {
+    if (!bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
       return true;
     }
-    if (bfloat16_support_->SupportsMixedPrecisions(*crs)) {
+    if (bfloat16_support_->SupportsMixedPrecisions(*hlo)) {
       return false;
     }
     return has_unsupported_bf16_operand || has_unsupported_bf16_output ||
            f32_count > 0;
   };
 
-  for (int64 i = 0; i < crs->operand_count(); ++i) {
+  for (int64 i = 0; i < hlo->operand_count(); ++i) {
     if (should_convert_operand(i)) {
-      TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(crs, i, F32, computation_));
+      TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(hlo, i, F32, computation_));
       f32_count += 1;
       bf16_count -= 1;
     }
   }
 
   if (!has_unsupported_bf16_output &&
-      (bfloat16_support_->SupportsMixedPrecisions(*crs) || f32_count == 0 ||
+      (bfloat16_support_->SupportsMixedPrecisions(*hlo) || f32_count == 0 ||
        bf16_count == 0)) {
     return Status::OK();
   }
 
-  std::vector<HloInstruction*> materialized_users = crs->users();
-  std::vector<HloInstruction*> output_elements(crs->operand_count());
-  auto original_shape = crs->shape();
-  for (int64 i = 0; i < crs->operand_count(); ++i) {
-    auto subshape = ShapeUtil::GetMutableSubshape(crs->mutable_shape(), {i});
+  std::vector<HloInstruction*> materialized_users = hlo->users();
+  std::vector<HloInstruction*> output_elements(hlo->operand_count());
+  auto original_shape = hlo->shape();
+  for (int64 i = 0; i < hlo->operand_count(); ++i) {
+    auto subshape = ShapeUtil::GetMutableSubshape(hlo->mutable_shape(), {i});
     if (output_types[i] != BF16) {
       output_elements[i] = computation_->AddInstruction(
-          HloInstruction::CreateGetTupleElement(*subshape, crs, i));
+          HloInstruction::CreateGetTupleElement(*subshape, hlo, i));
       continue;
     }
     subshape->set_element_type(F32);
     auto gte = computation_->AddInstruction(
-        HloInstruction::CreateGetTupleElement(*subshape, crs, i));
+        HloInstruction::CreateGetTupleElement(*subshape, hlo, i));
     output_elements[i] =
         computation_->AddInstruction(HloInstruction::CreateConvert(
             ShapeUtil::ChangeElementType(*subshape, BF16), gte));
@@ -229,11 +248,11 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
   auto tuple = computation_->AddInstruction(
       HloInstruction::CreateTuple(output_elements));
 
-  // Use the crs' shape temporarily, in order to pass checks in
+  // Use the hlo' shape temporarily, in order to pass checks in
   // ReplaceUseWith.
-  *tuple->mutable_shape() = crs->shape();
+  *tuple->mutable_shape() = hlo->shape();
   for (auto* user : materialized_users) {
-    TF_RETURN_IF_ERROR(crs->ReplaceUseWith(user, tuple));
+    TF_RETURN_IF_ERROR(hlo->ReplaceUseWith(user, tuple));
   }
   *tuple->mutable_shape() = original_shape;
   return Status::OK();
diff --git a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
index 830f26422b..f9f1f64998 100644
--- a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
@@ -251,7 +251,8 @@ TEST_F(BFloat16NormalizationTest, ResolveMixedPrecisionTupleCrossReplicaSum) {
   HloInstruction* crs =
       builder.AddInstruction(HloInstruction::CreateCrossReplicaSum(
           ShapeUtil::MakeTupleShape({f32_shape, bf16_shape}), {a, b}, reduction,
-          /*replica_group_ids=*/{}, /*barrier=*/""));
+          /*replica_group_ids=*/{}, /*barrier=*/"",
+          /*all_reduce_id=*/tensorflow::gtl::nullopt));
   HloInstruction* gte = builder.AddInstruction(
       HloInstruction::CreateGetTupleElement(bf16_shape, crs, 1));
 
@@ -265,6 +266,33 @@ TEST_F(BFloat16NormalizationTest, ResolveMixedPrecisionTupleCrossReplicaSum) {
   EXPECT_EQ(ShapeUtil::GetSubshape(crs->shape(), {1}).element_type(), F32);
 }
 
+TEST_F(BFloat16NormalizationTest, ResolveMixedPrecisionTupleSort) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape f32_shape = ShapeUtil::MakeShape(F32, {1024});
+  Shape bf16_shape = ShapeUtil::MakeShape(BF16, {1024});
+  Shape s32_shape = ShapeUtil::MakeShape(BF16, {1024});
+
+  HloInstruction* key = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, f32_shape, "key"));
+  HloInstruction* value = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, s32_shape, "value"));
+
+  HloInstruction* sort = builder.AddInstruction(HloInstruction::CreateSort(
+      ShapeUtil::MakeTupleShape({bf16_shape, s32_shape}), 0, key, value));
+  HloInstruction* gte = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(bf16_shape, sort, 0));
+
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(Normalize(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), gte);
+  EXPECT_EQ(gte->shape().element_type(), BF16);
+  EXPECT_EQ(sort->operand(0)->shape().element_type(), F32);
+  EXPECT_EQ(ShapeUtil::GetSubshape(sort->shape(), {0}).element_type(), F32);
+}
+
 // Tests that the normalization should not cause unsupported mixed precision due
 // to resolving unsupported BF16 operand.
 TEST_F(BFloat16NormalizationTest, DoNotAddUnsupportedMixedPrecision) {
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation.cc b/tensorflow/compiler/xla/service/bfloat16_propagation.cc
index ff6d5027ef..2fb401c428 100644
--- a/tensorflow/compiler/xla/service/bfloat16_propagation.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/bfloat16_propagation.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_dce.h"
@@ -215,7 +215,12 @@ bool BFloat16Propagation::AllUsersConsumeBF16(const HloInstruction& hlo,
     if (ContainsKey(values_that_must_be_kept_as_f32_, value)) {
       return false;
     }
-    if (ValueTypeAfterChange(value) == BF16) {
+    // We use the original type for the value because we are going to examine
+    // the uses of it, instead of the value itself. If ValueTypeAfterChange()
+    // were used, it would cause problems when there are aliasing buffers, i.e.,
+    // ResolveInconsistencyOfAliasingBuffers() would fail to revert the
+    // tentative change to BF16 even if the uses require F32.
+    if (value->shape().element_type() == BF16) {
       continue;
     }
     for (const HloUse& use : value->uses()) {
@@ -566,6 +571,9 @@ bool BFloat16Propagation::ResolveInconsistencyOfAliasingBuffersHelper(
       }
       visited_computations->insert(visited_in_while.begin(),
                                    visited_in_while.end());
+    } else if (hlo->opcode() == HloOpcode::kFusion) {
+      ResolveInconsistencyOfAliasingBuffersHelper(
+          hlo->fused_instructions_computation(), visited_computations);
     }
   }
   // Now adjust parameters of called computations.
@@ -615,7 +623,6 @@ Status BFloat16Propagation::ResolveInconsistentFusions(HloModule* module) {
   // (1) a is F32 but tuple is BF16
   // (2) after adding conversion
   // (3) after tuple simplifier and DCE.
-  bool needs_tuple_simplifier = false;
   for (auto computation : module->MakeComputationPostOrder()) {
     auto insts = computation->MakeInstructionPostOrder();
     for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
@@ -629,67 +636,25 @@ Status BFloat16Propagation::ResolveInconsistentFusions(HloModule* module) {
         continue;
       }
       ShapeTree<HloInstruction*> converted_outputs(hlo->shape());
-      // Iterate through nodes in the shape tree in pre-order and initialize
-      // each non-root node with a corresponding get-tuple-element. For a leaf
-      // node, if its shape does not match the fusion output, create a
-      // conversion node to overwrite the node value.
-      for (auto it = converted_outputs.begin(); it != converted_outputs.end();
-           ++it) {
-        ShapeIndex output_index = it->first;
-        HloInstruction*& output = it->second;
-        const Shape subshape =
-            ShapeUtil::GetSubshape(hlo->shape(), output_index);
-        if (output_index.empty()) {
-          output = fusion_root;
-        } else {
-          ShapeIndex parent_index = output_index;
-          parent_index.pop_back();
-          output = fusion_computation->AddInstruction(
-              HloInstruction::CreateGetTupleElement(
-                  subshape, converted_outputs.element(parent_index),
-                  output_index.back()));
-        }
-        if (!ShapeUtil::IsArray(subshape)) {
-          continue;
-        }
-        if (!ShapeUtil::Compatible(
-                subshape,
-                ShapeUtil::GetSubshape(fusion_root->shape(), output_index))) {
-          output = fusion_computation->AddInstruction(
-              HloInstruction::CreateConvert(subshape, output));
-        }
-      }
-      // Iterate through nodes in the shape tree in reverse pre-order and create
-      // a tuple instruction for each non-leaf node where the elements are the
-      // values of its child nodes.
-      for (auto it = converted_outputs.rbegin(); it != converted_outputs.rend();
-           ++it) {
-        ShapeIndex output_index = it->first;
-        HloInstruction*& output = it->second;
-        const Shape& subshape =
-            ShapeUtil::GetSubshape(hlo->shape(), output_index);
-        if (!ShapeUtil::IsTuple(subshape)) {
-          continue;
-        }
-        std::vector<HloInstruction*> elements(
-            ShapeUtil::TupleElementCount(subshape));
-        ShapeIndex child_index = output_index;
-        for (int64 i = 0; i < elements.size(); ++i) {
-          child_index.push_back(i);
-          elements[i] = converted_outputs.element(child_index);
-          child_index.pop_back();
-        }
-        output = fusion_computation->AddInstruction(
-            HloInstruction::CreateTuple(elements));
-      }
-      fusion_computation->set_root_instruction(converted_outputs.element({}));
-      needs_tuple_simplifier |= ShapeUtil::IsTuple(hlo->shape());
+      // Deep copy the fusion root, and convert a leaf node only if its shape
+      // does not match the fusion output.
+      TF_ASSIGN_OR_RETURN(
+          HloInstruction * copy,
+          fusion_computation->DeepCopyInstructionWithCustomCopier(
+              fusion_root,
+              [hlo](HloInstruction* leaf, const ShapeIndex& leaf_index,
+                    HloComputation* comp) {
+                const Shape& hlo_subshape =
+                    ShapeUtil::GetSubshape(hlo->shape(), leaf_index);
+                if (ShapeUtil::Compatible(leaf->shape(), hlo_subshape)) {
+                  return leaf;
+                }
+                return comp->AddInstruction(
+                    HloInstruction::CreateConvert(hlo_subshape, leaf));
+              }));
+      fusion_computation->set_root_instruction(copy);
     }
   }
-  if (needs_tuple_simplifier) {
-    TupleSimplifier tuple_simplifier;
-    TF_RETURN_IF_ERROR(tuple_simplifier.Run(module).status());
-  }
   return Status::OK();
 }
 
@@ -758,10 +723,38 @@ StatusOr<bool> BFloat16Propagation::Run(HloModule* module) {
   changes_to_bf16_.clear();
   changed_ = false;
 
+  auto computations_topological_order = module->MakeComputationPostOrder();
+
+  // Before running the propagation pass, we insert copies (kConvert to the same
+  // type) of F32 inputs to while loops. This prevents other uses of the same
+  // input from aliasing the while loop input/output, so that there's greater
+  // chance to use BF16 inside the loop. If some of these added copies do not
+  // help, they will remain F32 after BF16 propagation and will be removed since
+  // they are no-ops.
+  for (auto computation : computations_topological_order) {
+    for (auto inst : computation->MakeInstructionPostOrder()) {
+      if (inst->opcode() != HloOpcode::kWhile) {
+        continue;
+      }
+
+      auto operand = inst->mutable_operand(0);
+      TF_ASSIGN_OR_RETURN(
+          HloInstruction * copy,
+          computation->DeepCopyInstructionWithCustomCopier(
+              operand, [](HloInstruction* leaf, const ShapeIndex& leaf_index,
+                          HloComputation* comp) {
+                if (leaf->shape().element_type() != F32) {
+                  return leaf;
+                }
+                return comp->AddInstruction(
+                    HloInstruction::CreateConvert(leaf->shape(), leaf));
+              }));
+      TF_RETURN_IF_ERROR(operand->ReplaceUseWith(inst, copy));
+    }
+  }
+
   TF_ASSIGN_OR_RETURN(dataflow_, HloDataflowAnalysis::Run(*module));
 
-  const auto& computations_topological_order =
-      module->MakeComputationPostOrder();
   // The first step is a forward pass (parameters to root), where we determine
   // the potential candidate instructions to use bfloat16 in the outputs that
   // are not likely to cause overhead from extra explicit conversions. This is
@@ -784,8 +777,7 @@ StatusOr<bool> BFloat16Propagation::Run(HloModule* module) {
   // propagation in reverse topological order.
   for (auto comp_it = computations_topological_order.rbegin();
        comp_it != computations_topological_order.rend(); ++comp_it) {
-    if ((*comp_it)->IsFusionComputation()) {
-      // Fusion computations are handled when visiting the fusion instruction.
+    if (ContainsKey(computations_visited_in_backward_pass_, *comp_it)) {
       continue;
     }
     auto insts = (*comp_it)->MakeInstructionPostOrder();
@@ -793,6 +785,7 @@ StatusOr<bool> BFloat16Propagation::Run(HloModule* module) {
       DetermineInstructionPrecision(*inst_it,
                                     /*skip_parameters=*/true);
     }
+    computations_visited_in_backward_pass_.insert(*comp_it);
   }
 
   // It's possible that an instruction does not define a buffer, but the
@@ -810,23 +803,27 @@ StatusOr<bool> BFloat16Propagation::Run(HloModule* module) {
     }
   }
 
+  // Removes redundant HLOs added by this pass, either when inserting
+  // de-aliasing copies to while loop inputs, or later when converting output
+  // types.
+  auto clean_up = [this, module]() {
+    TF_RETURN_IF_ERROR(SkipNoopConversions(module));
+    TupleSimplifier tuple_simplifier;
+    TF_RETURN_IF_ERROR(tuple_simplifier.Run(module).status());
+    HloDCE dce;
+    TF_RETURN_IF_ERROR(dce.Run(module).status());
+    return Status::OK();
+  };
+
   if (!changed_) {
+    TF_RETURN_IF_ERROR(clean_up());
     return false;
   }
 
   TF_RETURN_IF_ERROR(ResolveInconsistentFusions(module));
   TF_RETURN_IF_ERROR(ResolveConvertedConstants(module));
 
-  // This pass could have turned an F32 -> BF16 conversion to a no-op (BF16 ->
-  // BF16), so we skip them now.
-  TF_RETURN_IF_ERROR(SkipNoopConversions(module));
-
-  {
-    // We may have dead HLOs after ResolveInconsistentFusions,
-    // ResolveConvertedConstants and SkipNoopConversions.
-    HloDCE dce;
-    TF_RETURN_IF_ERROR(dce.Run(module).status());
-  }
+  TF_RETURN_IF_ERROR(clean_up());
   return true;
 }
 
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc b/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
index 560910cc5f..69b654d30e 100644
--- a/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
@@ -133,9 +133,9 @@ TEST_F(BFloat16PropagationTest, ConvertConstantLiteral) {
   array_b.FillUnique(10.0f);
 
   HloInstruction* a = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateFromArray(array_a)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateFromArray(array_a)));
   HloInstruction* b = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateFromArray(array_b)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateFromArray(array_b)));
   HloInstruction* dot = builder.AddInstruction(
       HloInstruction::CreateBinary(shape, HloOpcode::kDot, a, b));
 
@@ -150,10 +150,10 @@ TEST_F(BFloat16PropagationTest, ConvertConstantLiteral) {
   EXPECT_EQ(dot->operand(0)->opcode(), HloOpcode::kConstant);
   EXPECT_EQ(dot->operand(1)->opcode(), HloOpcode::kConstant);
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::ConvertF32ToBF16(*Literal::CreateFromArray(array_a)),
+      *LiteralUtil::ConvertF32ToBF16(*LiteralUtil::CreateFromArray(array_a)),
       dot->operand(0)->literal()));
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::ConvertF32ToBF16(*Literal::CreateFromArray(array_b)),
+      *LiteralUtil::ConvertF32ToBF16(*LiteralUtil::CreateFromArray(array_b)),
       dot->operand(1)->literal()));
 }
 
@@ -240,12 +240,10 @@ TEST_F(BFloat16PropagationTest, SameValueReferencedTwice) {
   EXPECT_TRUE(PropagatePrecision(module.get()));
 
   EXPECT_EQ(computation->root_instruction(), dot);
-  EXPECT_TRUE(OutputsBF16(add0));
   EXPECT_TRUE(OutputsBF16(add1));
   EXPECT_TRUE(OutputsBF16(lhs));
-  // rhs is a get-tuple-element, which does not define a buffer, but its shape
-  // should also be adjusted accordingly.
-  EXPECT_TRUE(OutputsBF16(rhs));
+
+  // add0 and rhs have been eliminated by simplification and DCE.
 }
 
 // Tests that a non-fusion computation's root should not be changed.
@@ -434,7 +432,7 @@ TEST_F(BFloat16PropagationTest, SelectOverTuples) {
   HloInstruction* tuple1 =
       builder.AddInstruction(HloInstruction::CreateTuple({param, add1}));
   HloInstruction* sel = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple0->shape(), HloOpcode::kSelect, pred, tuple0, tuple1));
+      tuple0->shape(), HloOpcode::kTupleSelect, pred, tuple0, tuple1));
   HloInstruction* gte0 = builder.AddInstruction(
       HloInstruction::CreateGetTupleElement(shape, sel, 0));
   HloInstruction* gte1 = builder.AddInstruction(
@@ -510,6 +508,63 @@ TEST_F(BFloat16PropagationTest, PropagateThroughSimpleWhile) {
   EXPECT_FALSE(OutputsBF16(dot));
 }
 
+// Tests that if the while condition prevents using BF16, no changes should be
+// made to the while body and thus the fusion node inside it.
+TEST_F(BFloat16PropagationTest,
+       ConditionPreventsPropagationForFusionInsideWhile) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param0"));
+  HloInstruction* param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, shape, "param1"));
+  HloInstruction* add = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+
+  auto builder_cond = HloComputation::Builder("cond");
+  auto cond_param = builder_cond.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "cond_param"));
+  builder_cond.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt,
+      builder_cond.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond_param, {0, 0}, {1, 1}, {1, 1})),
+      builder_cond.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond_param, {1, 1}, {2, 2}, {1, 1}))));
+  auto cond = module->AddEmbeddedComputation(builder_cond.Build());
+
+  auto builder_body = HloComputation::Builder("body");
+  auto body_param = builder_body.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "body_param"));
+  auto body_transpose = builder_body.AddInstruction(
+      HloInstruction::CreateTranspose(shape, body_param, {0, 1}));
+
+  auto builder_f = HloComputation::Builder("fusion");
+  HloInstruction* a_f =
+      builder_f.AddInstruction(HloInstruction::CreateParameter(0, shape, "a"));
+  builder_f.AddInstruction(HloInstruction::CreateTranspose(shape, a_f, {0, 1}));
+  auto comp_f = module->AddEmbeddedComputation(builder_f.Build());
+  auto body_fusion = builder_body.AddInstruction(HloInstruction::CreateFusion(
+      shape, HloInstruction::FusionKind::kCustom, {body_transpose}, comp_f));
+  auto body = module->AddEmbeddedComputation(builder_body.Build());
+
+  auto while_hlo = builder.AddInstruction(
+      HloInstruction::CreateWhile(shape, cond, body, add));
+
+  auto dot = builder.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot, while_hlo, while_hlo));
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_FALSE(PropagatePrecision(module.get()));
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_FALSE(OutputsBF16(add));
+  EXPECT_FALSE(OutputsBF16(body_fusion));
+  EXPECT_FALSE(OutputsBF16(body_param));
+  EXPECT_FALSE(OutputsBF16(body_transpose));
+  EXPECT_FALSE(OutputsBF16(a_f));
+}
+
 // Tests that BF16 is propagated properly through while computations with
 // tuple-shaped input/output.
 TEST_F(BFloat16PropagationTest, PropagateThroughTupleWhile) {
@@ -555,10 +610,14 @@ TEST_F(BFloat16PropagationTest, PropagateThroughTupleWhile) {
       HloInstruction::CreateGetTupleElement(shape, body_param, 0));
   auto body_rhs = builder_body.AddInstruction(
       HloInstruction::CreateGetTupleElement(shape, body_param, 1));
-  auto body_dot = builder_body.AddInstruction(
+  auto body_dot1 = builder_body.AddInstruction(
       HloInstruction::CreateBinary(shape, HloOpcode::kDot, body_lhs, body_rhs));
+  auto body_dot2 = builder_body.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, body_rhs, body_lhs));
+  auto body_transpose = builder_body.AddInstruction(
+      HloInstruction::CreateTranspose(shape, body_dot2, {0, 1}));
   builder_body.AddInstruction(
-      HloInstruction::CreateTuple({body_dot, body_rhs}));
+      HloInstruction::CreateTuple({body_dot1, body_transpose}));
   auto body = module->AddEmbeddedComputation(builder_body.Build());
 
   auto while_hlo = builder.AddInstruction(
@@ -577,9 +636,11 @@ TEST_F(BFloat16PropagationTest, PropagateThroughTupleWhile) {
   EXPECT_EQ(computation->root_instruction(), dot);
   EXPECT_TRUE(OutputsBF16(lhs));
   EXPECT_FALSE(OutputsBF16(rhs));
-  EXPECT_TRUE(OutputsBF16(body_dot));
+  EXPECT_TRUE(OutputsBF16(body_dot1));
   EXPECT_TRUE(OutputsBF16(body_lhs));
   EXPECT_FALSE(OutputsBF16(body_rhs));
+  EXPECT_FALSE(OutputsBF16(body_dot2));
+  EXPECT_FALSE(OutputsBF16(body_transpose));
   EXPECT_TRUE(OutputsBF16(cond_lhs));
   EXPECT_FALSE(OutputsBF16(cond_rhs));
   EXPECT_TRUE(OutputsBF16(add0));
@@ -734,10 +795,8 @@ TEST_F(BFloat16PropagationTest, NoopConversionRemoved) {
   EXPECT_TRUE(PropagatePrecision(module.get()));
 
   EXPECT_EQ(computation->root_instruction(), add2);
-  EXPECT_EQ(add2->operand(0), gte0);
-  EXPECT_EQ(add2->operand(1), gte1);
-  EXPECT_EQ(gte0->shape().element_type(), BF16);
-  EXPECT_EQ(gte1->shape().element_type(), BF16);
+  EXPECT_EQ(add2->operand(0), add0);
+  EXPECT_EQ(add2->operand(1), add1);
   EXPECT_EQ(add0->shape().element_type(), BF16);
   EXPECT_EQ(add1->shape().element_type(), BF16);
 }
diff --git a/tensorflow/compiler/xla/service/bfloat16_support.cc b/tensorflow/compiler/xla/service/bfloat16_support.cc
index 8595afca7e..23645346e6 100644
--- a/tensorflow/compiler/xla/service/bfloat16_support.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_support.cc
@@ -103,6 +103,7 @@ bool BFloat16Support::EffectiveOperandPrecisionIsOutputPrecision(
     case HloOpcode::kDynamicUpdateSlice:
       return operand_index == 0 || operand_index == 1;
     case HloOpcode::kSelect:
+    case HloOpcode::kTupleSelect:
       return operand_index == 1 || operand_index == 2;
     default:
       break;
diff --git a/tensorflow/compiler/xla/service/buffer_assignment.cc b/tensorflow/compiler/xla/service/buffer_assignment.cc
index afe4b2e142..e4d2e73b99 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment.cc
+++ b/tensorflow/compiler/xla/service/buffer_assignment.cc
@@ -270,7 +270,7 @@ BufferAllocationProto BufferAllocation::ToProto() const {
   proto.set_index(index_);
   proto.set_size(size_);
   proto.set_is_thread_local(is_thread_local_);
-  proto.set_is_reusable(is_reusable_);
+  proto.set_is_tuple(is_tuple_);
   proto.set_color(color_.value());
   if (is_entry_computation_parameter_) {
     proto.set_is_entry_computation_parameter(true);
@@ -279,6 +279,7 @@ BufferAllocationProto BufferAllocation::ToProto() const {
     }
     proto.set_parameter_number(parameter_number_);
   }
+  proto.set_is_constant(is_constant_);
   proto.set_maybe_live_out(maybe_live_out_);
   for (const auto& buffer_offset_size : assigned_buffers_) {
     BufferAllocationProto::Assigned* proto_assigned = proto.add_assigned();
@@ -304,6 +305,9 @@ string BufferAllocation::ToString() const {
     StrAppend(&output, ", parameter ", parameter_number(), " at ShapeIndex ",
               param_shape_index().ToString());
   }
+  if (is_constant()) {
+    StrAppend(&output, ", constant");
+  }
   if (is_thread_local()) {
     StrAppend(&output, ", thread-local");
   }
@@ -491,20 +495,16 @@ BufferAssignment::GetUniqueTopLevelOutputSlice() const {
 }
 
 BufferAllocation* BufferAssignment::NewEmptyAllocation(
-    int64 size, bool is_thread_local, bool is_reusable,
-    LogicalBuffer::Color color) {
+    int64 size, LogicalBuffer::Color color) {
   BufferAllocation::Index index = allocations_.size();
-  allocations_.emplace_back(index, size, is_thread_local, is_reusable, color);
+  allocations_.emplace_back(index, size, color);
   BufferAllocation* allocation = &allocations_.back();
   return allocation;
 }
 
 BufferAllocation* BufferAssignment::NewAllocation(const LogicalBuffer& buffer,
-                                                  int64 size,
-                                                  bool is_thread_local,
-                                                  bool is_reusable) {
-  BufferAllocation* allocation =
-      NewEmptyAllocation(size, is_thread_local, is_reusable, buffer.color());
+                                                  int64 size) {
+  BufferAllocation* allocation = NewEmptyAllocation(size, buffer.color());
   AddAssignment(allocation, buffer, /*offset=*/0, size);
   allocation->peak_buffers_.push_back(&buffer);
   return allocation;
@@ -517,7 +517,8 @@ void BufferAssignment::AddAssignment(BufferAllocation* allocation,
   CHECK_EQ(0, allocation_index_for_buffer_.count(&buffer))
       << "LogicalBuffer " << buffer << " already has an allocation.";
   CHECK(allocation->is_reusable() || allocation->assigned_buffers().empty())
-      << "Non-reusable allocation already assigned a buffer";
+      << "Non-reusable allocation already assigned a buffer: "
+      << allocation->ToString();
 
   TF_CHECK_OK(points_to_analysis().VerifyBuffer(buffer));
 
@@ -609,6 +610,10 @@ Status BufferAssignment::ComputeSummaryStats() {
       stats_.parameter_allocation_count++;
       stats_.parameter_allocation_bytes += allocation.size();
     }
+    if (allocation.is_constant()) {
+      stats_.constant_allocation_count++;
+      stats_.constant_allocation_bytes += allocation.size();
+    }
     if (allocation.maybe_live_out()) {
       stats_.maybe_live_out_allocation_count++;
       stats_.maybe_live_out_allocation_bytes += allocation.size();
@@ -645,6 +650,8 @@ string BufferAssignment::Stats::ToString() const {
   Appendf(&s, "BufferAssignment stats:\n");
   Appendf(&s, "             parameter allocation: %10s\n",
           HumanReadableNumBytes(parameter_allocation_bytes).c_str());
+  Appendf(&s, "              constant allocation: %10s\n",
+          HumanReadableNumBytes(constant_allocation_bytes).c_str());
   Appendf(&s, "        maybe_live_out allocation: %10s\n",
           HumanReadableNumBytes(maybe_live_out_allocation_bytes).c_str());
   Appendf(&s, "     preallocated temp allocation: %10s\n",
@@ -722,8 +729,10 @@ StatusOr<std::unique_ptr<BufferAssignment>> BufferAssigner::Run(
     const HloModule* module, std::unique_ptr<HloOrdering> hlo_ordering,
     LogicalBuffer::SizeFunction buffer_size,
     LogicalBuffer::AlignmentFunction color_alignment,
-    bool allow_input_output_aliasing, BufferLiveness::Colorer colorer) {
-  BufferAssigner assigner(allow_input_output_aliasing, std::move(colorer));
+    bool allow_input_output_aliasing, bool allocate_buffers_for_constants,
+    BufferLiveness::Colorer colorer) {
+  BufferAssigner assigner(allow_input_output_aliasing,
+                          allocate_buffers_for_constants, std::move(colorer));
   return assigner.CreateAssignment(module, std::move(hlo_ordering),
                                    std::move(buffer_size),
                                    std::move(color_alignment));
@@ -751,8 +760,8 @@ bool BufferAssigner::MaybeAssignBuffer(BufferAllocation* allocation,
     return false;
   }
 
-  if (allocation->is_entry_computation_parameter()) {
-    VLOG(4) << "Can't assign: allocation holds parameter";
+  if (allocation->is_readonly()) {
+    VLOG(4) << "Can't assign: allocation is readonly";
     return false;
   }
 
@@ -808,8 +817,7 @@ bool BufferAssigner::MaybeAssignBuffer(BufferAllocation* allocation,
 }
 
 Status BufferAssigner::AssignBuffersForComputation(
-    const HloComputation* computation, const DebugOptions& debug_options,
-    bool is_thread_local,
+    const HloComputation* computation, bool is_thread_local,
     const FlatSet<const LogicalBuffer*>& colocated_buffers,
     const FlatSet<BufferAllocation::Index>& colocated_allocations,
     FlatMap<const HloComputation*, FlatSet<const LogicalBuffer*>>*
@@ -905,15 +913,19 @@ Status BufferAssigner::AssignBuffersForComputation(
     TF_RET_CHECK(!assignment->HasAllocation(*buffer));
 
     const HloInstruction* instruction = buffer->instruction();
+    const int64 buffer_size = assignment->buffer_size_(*buffer);
+
     if (instruction->opcode() == HloOpcode::kConstant) {
-      // No BufferAllocations for constants.
-      // TODO(b/32248867): For consistency, constants should get allocations.
-      VLOG(3) << "Skipping constant: " << *buffer;
+      if (allocate_buffers_for_constants_) {
+        BufferAllocation* allocation =
+            assignment->NewAllocation(*buffer, buffer_size);
+        allocation->set_constant(true);
+        VLOG(3) << "New allocation #" << allocation->index() << " for constant "
+                << *buffer;
+      }
       continue;
     }
 
-    const int64 buffer_size = assignment->buffer_size_(*buffer);
-
     const bool is_entry_parameter =
         instruction->opcode() == HloOpcode::kParameter &&
         computation == computation->parent()->entry_computation();
@@ -923,9 +935,7 @@ Status BufferAssigner::AssignBuffersForComputation(
       // computations do not need special allocations because they live inside
       // callers.
       BufferAllocation* allocation =
-          assignment->NewAllocation(*buffer, buffer_size,
-                                    /*is_thread_local=*/false,
-                                    /*is_reusable=*/false);
+          assignment->NewAllocation(*buffer, buffer_size);
       allocation->set_entry_computation_parameter(
           instruction->parameter_number(), buffer->index());
       VLOG(3) << "New allocation #" << allocation->index()
@@ -934,20 +944,18 @@ Status BufferAssigner::AssignBuffersForComputation(
     }
 
     if (is_thread_local) {
-      // We do not reuse thread-local buffers for now, because they are
-      // dynamically allocated and their lifetimes are hard to compute.
-      BufferAllocation* allocation = assignment->NewAllocation(
-          *buffer, buffer_size, is_thread_local, /*is_reusable=*/false);
+      BufferAllocation* allocation =
+          assignment->NewAllocation(*buffer, buffer_size);
+      allocation->set_is_thread_local(true);
       VLOG(3) << "New allocation #" << allocation->index()
               << " for thread-local: " << *buffer;
       continue;
     }
 
     if (ShapeUtil::IsTuple(buffer->shape())) {
-      // TODO(b/34669761): Don't reuse tuple buffers because the GPU backend
-      // assumes longer buffer liveness than indicated by the analysis.
-      BufferAllocation* allocation = assignment->NewAllocation(
-          *buffer, buffer_size, is_thread_local, /*is_reusable=*/false);
+      BufferAllocation* allocation =
+          assignment->NewAllocation(*buffer, buffer_size);
+      allocation->set_is_tuple(true);
       VLOG(3) << "New allocation #" << allocation->index()
               << " for tuple-shaped buffer: " << *buffer;
       continue;
@@ -1030,8 +1038,8 @@ Status BufferAssigner::AssignBuffersForComputation(
     }
 
     if (!assignment->HasAllocation(*buffer)) {
-      BufferAllocation* allocation = assignment->NewAllocation(
-          *buffer, buffer_size, is_thread_local, /*is_reusable=*/true);
+      BufferAllocation* allocation =
+          assignment->NewAllocation(*buffer, buffer_size);
       allocation_indices.push_back(allocation->index());
       VLOG(3) << "New allocation #" << allocation->index()
               << " for: " << *buffer;
@@ -1085,6 +1093,7 @@ Status BufferAssigner::AssignBuffersWithSequentialOrdering(
       VLOG(2) << "Simulating heap for color " << color;
       int64 alignment = assignment->color_alignment_(color);
       HeapSimulator::Options options;
+      options.alloc_constants = allocate_buffers_for_constants_;
       BufferValueFlatSet buffer_value_set =
           ToBufferValueFlatSet(single_colored_set.second);
       options.buffers_to_assign = &buffer_value_set;
@@ -1227,8 +1236,8 @@ void BufferAssigner::AssignBuffersFromHeapSimulator(
         result.fragmentation_size;
   }
 
-  BufferAllocation* allocation = assignment->NewEmptyAllocation(
-      result.heap_size, /*is_thread_local=*/false, /*is_reusable=*/true, color);
+  BufferAllocation* allocation =
+      assignment->NewEmptyAllocation(result.heap_size, color);
   for (const auto& buffer_chunk : result.chunk_map) {
     // TODO(lauj) Remove this down_cast after downstream users of
     // BufferAllocation::assigned_buffers() are updated to use BufferValue.
@@ -1332,11 +1341,25 @@ BufferAssigner::MergeColocatedBufferSets(
   auto cannot_merge_buffer_sets = [&colocated_buffer_sets, &buffer_liveness,
                                    &buffer_size,
                                    &is_entry_parameter](int64 i, int64 j) {
-    // Do not merge if one of the sets includes live outs or entry parameters.
+    // Do not merge if one of the sets includes live outs, entry parameters or
+    // constants.
+    //
+    // Buffer liveness does not report the correct live range for entry
+    // parameter and live out buffers so we have to special case them here.  On
+    // backends that support constant buffer allocations, constant buffers are
+    // assigned globals in readonly storage so we can't merge colocated buffer
+    // sets containing constants with colocated buffer sets containing writing
+    // instructions or other constants.
+    //
+    // Moreover (on the CPU/GPU backends) the entry parameter buffers belong to
+    // the caller of the executable so we can't write to entry parameters
+    // either, and the argument for not merging constants also applies to entry
+    // parameters.
     for (int64 key : {i, j}) {
       for (auto& buffer : colocated_buffer_sets[key]) {
         if (buffer_liveness.MaybeLiveOut(*buffer) ||
-            is_entry_parameter(*buffer)) {
+            is_entry_parameter(*buffer) ||
+            buffer->instruction()->opcode() == HloOpcode::kConstant) {
           return true;
         }
       }
@@ -1444,8 +1467,23 @@ void BufferAssigner::BuildColocatedBufferSets(
             });
       } else if (opcode == HloOpcode::kCall) {
         const HloInstruction* call_hlo = instruction;
-        const HloInstruction* root_hlo =
-            call_hlo->to_apply()->root_instruction();
+        const HloComputation* callee = call_hlo->to_apply();
+        const HloInstruction* root_hlo = callee->root_instruction();
+        for (int64 i = 0; i < call_hlo->operand_count(); i++) {
+          const HloInstruction* call_param = callee->parameter_instruction(i);
+          const HloInstruction* call_operand = call_hlo->operand(i);
+          ShapeUtil::ForEachSubshape(
+              call_operand->shape(),
+              [&](const Shape& /*subshape*/, const ShapeIndex& index) {
+                std::vector<const LogicalBuffer*> colocated_set;
+                AddBufferToColocatedSet(call_param, index, points_to_analysis,
+                                        &colocated_set);
+                AddBufferToColocatedSet(call_operand, index, points_to_analysis,
+                                        &colocated_set);
+                AddSetToColocatedBufferSets(colocated_set,
+                                            colocated_buffer_sets);
+              });
+        }
         ShapeUtil::ForEachSubshape(
             call_hlo->shape(),
             [this, call_hlo, root_hlo, &points_to_analysis,
@@ -1551,6 +1589,7 @@ void BufferAssigner::AssignColocatedBufferSets(
     // param in 'colocated_buffer_set'.
     int64 entry_parameter_number = -1;
     const ShapeIndex* entry_parameter_shape_idx = nullptr;
+    bool is_constant = false;
     for (const LogicalBuffer* buffer : colocated_buffer_set) {
       const HloInstruction* instruction = buffer->instruction();
       const HloComputation* computation = instruction->parent();
@@ -1558,10 +1597,14 @@ void BufferAssigner::AssignColocatedBufferSets(
           computation == computation->parent()->entry_computation()) {
         entry_parameter_number = instruction->parameter_number();
         entry_parameter_shape_idx = &buffer->index();
-        break;
+      } else if (instruction->opcode() == HloOpcode::kConstant) {
+        is_constant = true;
       }
     }
 
+    CHECK(!is_constant || entry_parameter_number == -1)
+        << "Copy insertion should have inserted copies to prevent this.";
+
     for (const LogicalBuffer* buffer : colocated_buffer_set) {
       const int64 buffer_size = assignment->buffer_size_(*buffer);
       if (allocation == nullptr) {
@@ -1569,18 +1612,14 @@ void BufferAssigner::AssignColocatedBufferSets(
         // allocations for each colocated buffer set. When liveness has
         // module-level scope, we can allow buffers to be shared across
         // computations (in some cases).
-        allocation = assignment->NewAllocation(*buffer, buffer_size,
-                                               /*is_thread_local=*/false,
-                                               /*is_reusable=*/true);
+        allocation = assignment->NewAllocation(*buffer, buffer_size);
         if (entry_parameter_number >= 0) {
-          // This colocated buffer set contains an entry parameter and other
-          // logical buffers which use the parameter as read-only in a while
-          // body computation (which updates in place).
-          // Set 'entry_computation_parameter' to indicate that it contains
-          // an entry parameter, and to prevent reuse in MaybeAssignBuffer.
           allocation->set_entry_computation_parameter(
               entry_parameter_number, *entry_parameter_shape_idx);
         }
+        if (is_constant) {
+          allocation->set_constant(true);
+        }
         colocated_allocations->insert(allocation->index());
       } else {
         CHECK_EQ(buffer_size, allocation->size())
@@ -1638,7 +1677,7 @@ StatusOr<std::unique_ptr<BufferAssignment>> BufferAssigner::CreateAssignment(
       buffers_to_assign_sequentially;
   for (auto* computation : global_computations) {
     TF_RETURN_IF_ERROR(AssignBuffersForComputation(
-        computation, module->config().debug_options(),
+        computation,
         /*is_thread_local=*/false, colocated_buffers, colocated_allocations,
         &buffers_to_assign_sequentially, assignment.get()));
   }
@@ -1659,7 +1698,7 @@ StatusOr<std::unique_ptr<BufferAssignment>> BufferAssigner::CreateAssignment(
       continue;
     }
     TF_RETURN_IF_ERROR(AssignBuffersForComputation(
-        computation, module->config().debug_options(),
+        computation,
         /*is_thread_local=*/true, colocated_buffers, colocated_allocations,
         /*buffers_to_assign_sequentially=*/nullptr, assignment.get()));
   }
diff --git a/tensorflow/compiler/xla/service/buffer_assignment.h b/tensorflow/compiler/xla/service/buffer_assignment.h
index ad0b0bf7c2..94495290c1 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment.h
+++ b/tensorflow/compiler/xla/service/buffer_assignment.h
@@ -32,7 +32,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/tuple_points_to_analysis.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/gtl/flatmap.h"
 #include "tensorflow/core/lib/gtl/flatset.h"
@@ -58,13 +57,8 @@ class BufferAllocation {
   // contiguously and can be used as array indexes.
   using Index = int64;
 
-  BufferAllocation(Index index, int64 size, bool is_thread_local,
-                   bool is_reusable, LogicalBuffer::Color color)
-      : index_(index),
-        size_(size),
-        is_thread_local_(is_thread_local),
-        is_reusable_(is_reusable),
-        color_(color) {}
+  BufferAllocation(Index index, int64 size, LogicalBuffer::Color color)
+      : index_(index), size_(size), color_(color) {}
   ~BufferAllocation() {}
 
   // Returns the index of this allocation.
@@ -74,9 +68,28 @@ class BufferAllocation {
   // inside of a map or reduce computation. Such allocations need to be thread
   // local.
   bool is_thread_local() const { return is_thread_local_; }
+  void set_is_thread_local(bool is_thread_local) {
+    is_thread_local_ = is_thread_local;
+  }
 
   // Whether this allocation can be used by more than one logical buffer.
-  bool is_reusable() const { return is_reusable_; }
+  bool is_reusable() const {
+    // We do not reuse thread-local buffers for now, because they are
+    // dynamically allocated and their lifetimes are hard to compute.
+    //
+    // TODO(b/34669761): Don't reuse tuple buffers because the GPU backend
+    // assumes longer buffer liveness than indicated by the analysis.
+    return !is_thread_local() && !is_tuple();
+  }
+
+  // Whether this allocation is readonly i.e. backed by memory we cannot write
+  // to.
+  bool is_readonly() const {
+    return is_entry_computation_parameter() || is_constant();
+  }
+
+  bool is_tuple() const { return is_tuple_; }
+  void set_is_tuple(bool is_tuple) { is_tuple_ = is_tuple; }
 
   // Whether this allocation holds a LogicalBuffer from a parameter of the entry
   // computation. These buffers have lifetimes which may be longer than the
@@ -84,6 +97,13 @@ class BufferAllocation {
   bool is_entry_computation_parameter() const {
     return is_entry_computation_parameter_;
   }
+
+  // Whether this allocation holds a constant.  On the CPU and GPU backends
+  // constant allocations are not allocated dynamically, instead we resolve
+  // references to these buffer allocations to a global in the readonly section
+  // of the binary.
+  bool is_constant() const { return is_constant_; }
+
   // If this allocation holds a Buffer from a parameter of the entry
   // computation, this methods returns the parameter number. CHECKs otherwise.
   int64 parameter_number() const {
@@ -189,7 +209,9 @@ class BufferAllocation {
            // of the computation.
            !maybe_live_out() &&
            // Thread-local buffers are allocated using `alloca`s.
-           !is_thread_local();
+           !is_thread_local() &&
+           // Constant buffers are allocated as global values.
+           !is_constant();
   }
 
   // Add a heap trace which was used to assign slices to logical buffers in this
@@ -245,6 +267,8 @@ class BufferAllocation {
     parameter_number_ = parameter_number;
     param_shape_index_ = std::move(param_shape_index);
   }
+
+  void set_constant(bool is_constant) { is_constant_ = is_constant; }
   void set_maybe_live_out(bool value) { maybe_live_out_ = value; }
   void set_index(Index index) { index_ = index; }
   void set_size(int64 size) { size_ = size; }
@@ -256,10 +280,10 @@ class BufferAllocation {
   int64 size_;
 
   // Whether this buffer needs to be thread-local.
-  bool is_thread_local_;
+  bool is_thread_local_ = false;
 
-  // Whether this buffer is usable by more than one logical buffer.
-  bool is_reusable_;
+  // Whether this buffer holds a tuple.
+  bool is_tuple_ = false;
 
   // Color of the allocation.
   LogicalBuffer::Color color_;
@@ -283,6 +307,9 @@ class BufferAllocation {
   // might not actually escape.
   bool maybe_live_out_ = false;
 
+  // See comment on the is_constant() accessor.
+  bool is_constant_ = false;
+
   // Mapping from the set of buffers assigned to this allocation to their
   // logical offsets and sizes.
   tensorflow::gtl::FlatMap<const LogicalBuffer*, OffsetSize> assigned_buffers_;
@@ -398,6 +425,8 @@ class BufferAssignment {
   struct Stats {
     int64 parameter_allocation_count = 0;
     int64 parameter_allocation_bytes = 0;
+    int64 constant_allocation_count = 0;
+    int64 constant_allocation_bytes = 0;
     int64 maybe_live_out_allocation_count = 0;
     int64 maybe_live_out_allocation_bytes = 0;
     int64 preallocated_temp_allocation_count = 0;
@@ -426,14 +455,11 @@ class BufferAssignment {
 
   // Creates and returns a new BufferAllocation, with no assigned
   // LogicalBuffers. Ownership is maintained internally.
-  BufferAllocation* NewEmptyAllocation(int64 size, bool is_thread_local,
-                                       bool is_reusable,
-                                       LogicalBuffer::Color color);
+  BufferAllocation* NewEmptyAllocation(int64 size, LogicalBuffer::Color color);
 
   // Helper that calls NewEmptyAllocation and AddAssignment in one call,
   // creating an allocation containing a single LogicalBuffer.
-  BufferAllocation* NewAllocation(const LogicalBuffer& buffer, int64 size,
-                                  bool is_thread_local, bool is_reusable);
+  BufferAllocation* NewAllocation(const LogicalBuffer& buffer, int64 size);
 
   // Adds a LogicalBuffer to the set assigned to the given allocation.
   void AddAssignment(BufferAllocation* allocation, const LogicalBuffer& buffer,
@@ -493,12 +519,15 @@ class BufferAssigner {
       LogicalBuffer::SizeFunction buffer_size,
       LogicalBuffer::AlignmentFunction color_alignment,
       bool allow_input_output_aliasing = false,
+      bool allocate_buffers_for_constants = false,
       BufferLiveness::Colorer colorer = BufferLiveness::DefaultColorer());
 
  private:
   BufferAssigner(bool allow_input_output_aliasing,
+                 bool allocate_buffers_for_constants,
                  BufferLiveness::Colorer colorer)
       : allow_input_output_aliasing_(allow_input_output_aliasing),
+        allocate_buffers_for_constants_(allocate_buffers_for_constants),
         colorer_(colorer) {}
   virtual ~BufferAssigner() = default;
 
@@ -513,8 +542,7 @@ class BufferAssigner {
   // true, then all assigned buffers have the is_thread_local flag set to
   // true.
   Status AssignBuffersForComputation(
-      const HloComputation* computation, const DebugOptions& debug_options,
-      bool is_thread_local,
+      const HloComputation* computation, bool is_thread_local,
       const tensorflow::gtl::FlatSet<const LogicalBuffer*>& colocated_buffers,
       const tensorflow::gtl::FlatSet<BufferAllocation::Index>&
           colocated_allocations,
@@ -595,6 +623,9 @@ class BufferAssigner {
   // buffers can be shared if their sizes match.
   bool allow_input_output_aliasing_;
 
+  // If true, allocate buffers for constant instructions.
+  bool allocate_buffers_for_constants_;
+
   // Functor used to assign colors to newly allocated logical buffers.
   BufferLiveness::Colorer colorer_;
 
diff --git a/tensorflow/compiler/xla/service/buffer_assignment_test.cc b/tensorflow/compiler/xla/service/buffer_assignment_test.cc
index 28b5a5784f..eccb146a0d 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/buffer_assignment_test.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/buffer_value.h"
 #include "tensorflow/compiler/xla/service/call_graph.h"
@@ -89,7 +89,20 @@ class BufferAssignmentTest : public HloTestBase {
     return BufferAssigner::Run(
                module, xla::MakeUnique<DependencyHloOrdering>(module),
                backend().compiler()->BufferSizeBytesFunction(),
-               [alignment](LogicalBuffer::Color) { return alignment; })
+               [alignment](LogicalBuffer::Color) { return alignment; },
+               /*allow_input_output_aliasing=*/false,
+               /*allocate_buffers_for_constants=*/true)
+        .ConsumeValueOrDie();
+  }
+
+  std::unique_ptr<BufferAssignment> RunBufferAssignmentNoBuffersForConstants(
+      HloModule* module, int64 alignment = 1) {
+    return BufferAssigner::Run(
+               module, xla::MakeUnique<DependencyHloOrdering>(module),
+               backend().compiler()->BufferSizeBytesFunction(),
+               [alignment](LogicalBuffer::Color) { return alignment; },
+               /*allow_input_output_aliasing=*/false,
+               /*allocate_buffers_for_constants=*/false)
         .ConsumeValueOrDie();
   }
 
@@ -98,8 +111,9 @@ class BufferAssignmentTest : public HloTestBase {
     return BufferAssigner::Run(
                module, xla::MakeUnique<DependencyHloOrdering>(module),
                backend().compiler()->BufferSizeBytesFunction(),
-               [alignment](LogicalBuffer::Color) { return alignment; }, false,
-               std::move(colorer))
+               [alignment](LogicalBuffer::Color) { return alignment; },
+               /*allow_input_output_aliasing=*/false,
+               /*allocate_buffers_for_constants=*/true, std::move(colorer))
         .ConsumeValueOrDie();
   }
 
@@ -115,7 +129,9 @@ class BufferAssignmentTest : public HloTestBase {
                module,
                xla::MakeUnique<SequentialHloOrdering>(module, module_sequence),
                backend().compiler()->BufferSizeBytesFunction(),
-               [alignment](LogicalBuffer::Color) { return alignment; })
+               [alignment](LogicalBuffer::Color) { return alignment; },
+               /*allow_input_output_aliasing=*/false,
+               /*allocate_buffers_for_constants=*/true)
         .ConsumeValueOrDie();
   }
 
@@ -125,7 +141,7 @@ class BufferAssignmentTest : public HloTestBase {
     auto param =
         builder.AddInstruction(HloInstruction::CreateParameter(0, r0f32_, "x"));
     auto value = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     builder.AddInstruction(
         HloInstruction::CreateBinary(r0f32_, HloOpcode::kAdd, param, value));
     return builder.Build();
@@ -142,7 +158,7 @@ class BufferAssignmentTest : public HloTestBase {
       const string& name) {
     auto builder = HloComputation::Builder(name);
     auto const4 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int>(4)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(4)));
     auto param = builder.AddInstruction(
         HloInstruction::CreateParameter(0, t_s32_f32v4_, "x"));
     auto index = builder.AddInstruction(
@@ -167,9 +183,9 @@ class BufferAssignmentTest : public HloTestBase {
       const string& name) {
     auto builder = HloComputation::Builder(name);
     auto const1 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int>(1)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(1)));
     auto constv = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
+        LiteralUtil::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
     auto param = builder.AddInstruction(
         HloInstruction::CreateParameter(0, t_s32_f32v4_, "x"));
     auto indexc = builder.AddInstruction(
@@ -290,13 +306,19 @@ static bool BuffersDistinct(const std::vector<const HloInstruction*>& a,
 TEST_F(BufferAssignmentTest, ScalarConstant) {
   auto builder = HloComputation::Builder(TestName());
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto module = CreateNewModule();
   module->AddEntryComputation(builder.Build());
 
-  auto buffers = RunBufferAssignment(module.get());
-  // Check that the constant does not have a buffer assigned.
-  EXPECT_FALSE(buffers->HasTopLevelAllocation(const0));
+  {
+    auto buffers = RunBufferAssignment(module.get());
+    EXPECT_TRUE(buffers->HasTopLevelAllocation(const0));
+  }
+
+  {
+    auto buffers = RunBufferAssignmentNoBuffersForConstants(module.get());
+    EXPECT_FALSE(buffers->HasTopLevelAllocation(const0));
+  }
 }
 
 TEST_F(BufferAssignmentTest, BufferForConst) {
@@ -304,20 +326,26 @@ TEST_F(BufferAssignmentTest, BufferForConst) {
   // no buffers assigned, and their consumer has a buffer.
   auto builder = HloComputation::Builder(TestName());
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
+      LiteralUtil::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
   auto const1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({4.1f, 4.2f, 4.3f, 4.4f})));
+      LiteralUtil::CreateR1<float>({4.1f, 4.2f, 4.3f, 4.4f})));
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(f32vec4_, HloOpcode::kAdd, const0, const1));
   auto module = CreateNewModule();
   module->AddEntryComputation(builder.Build());
 
-  auto buffers = RunBufferAssignment(module.get());
-  // The two constant nodes have no buffers assigned.
-  EXPECT_FALSE(buffers->HasTopLevelAllocation(const0));
-  EXPECT_FALSE(buffers->HasTopLevelAllocation(const1));
-  // The add node has an output buffer.
-  GetAssignedOutputAllocation(*buffers, add);
+  {
+    auto buffers = RunBufferAssignment(module.get());
+    EXPECT_TRUE(buffers->HasTopLevelAllocation(const0));
+    EXPECT_TRUE(buffers->HasTopLevelAllocation(const1));
+    GetAssignedOutputAllocation(*buffers, add);
+  }
+  {
+    auto buffers = RunBufferAssignmentNoBuffersForConstants(module.get());
+    EXPECT_FALSE(buffers->HasTopLevelAllocation(const0));
+    EXPECT_FALSE(buffers->HasTopLevelAllocation(const1));
+    GetAssignedOutputAllocation(*buffers, add);
+  }
 }
 
 TEST_F(BufferAssignmentTest, HasAllocationAt) {
@@ -327,7 +355,7 @@ TEST_F(BufferAssignmentTest, HasAllocationAt) {
   auto param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, f32vec100_, "param0"));
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int>(1)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(1)));
   auto negate = builder.AddInstruction(
       HloInstruction::CreateUnary(f32vec100_, HloOpcode::kNegate, param0));
   auto tuple = builder.AddInstruction(
@@ -352,7 +380,7 @@ TEST_F(BufferAssignmentTest, BufferForOutputConst) {
   // This computation copies a constant to output.
   auto builder = HloComputation::Builder(TestName());
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
+      LiteralUtil::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
   auto copy = builder.AddInstruction(
       HloInstruction::CreateUnary(const0->shape(), HloOpcode::kCopy, const0));
   auto module = CreateNewModule();
@@ -660,7 +688,7 @@ TEST_F(BufferAssignmentTest, CannotReuseInputBufferOfReduce) {
   auto exp2 = builder.AddInstruction(
       HloInstruction::CreateUnary(f32a100x10_, HloOpcode::kExp, exp1));
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   auto reduce = builder.AddInstruction(HloInstruction::CreateReduce(
       /*shape=*/f32vec10_,
       /*operand=*/exp2,
@@ -708,9 +736,9 @@ TEST_F(BufferAssignmentTest, ExampleWhile) {
   // Creates the main kernel and verifies instruction counts.
   auto builder = HloComputation::Builder(TestName());
   auto const3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(0)));
   auto const4 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
+      LiteralUtil::CreateR1<float>({1.1f, 2.2f, 3.3f, 4.4f})));
   auto tuple =
       builder.AddInstruction(HloInstruction::CreateTuple({const3, const4}));
   auto while_op = builder.AddInstruction(HloInstruction::CreateWhile(
@@ -773,11 +801,11 @@ TEST_F(BufferAssignmentTest, ExampleConditional) {
 
   auto builder = HloComputation::Builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(56.4f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(56.4f)));
   auto const2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(12.4f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(12.4f)));
   auto conditional = builder.AddInstruction(HloInstruction::CreateConditional(
       r0f32_, pred, const1, true_computation, const2, false_computation));
   module->AddEntryComputation(builder.Build());
@@ -1094,7 +1122,7 @@ TEST_F(BufferAssignmentTest, EmbeddedComputationBuffers) {
 
   // Allocations for the call computation should not be thread-local.
   auto& call_param_alloc = GetTopLevelAllocation(*assignment, call_param);
-  EXPECT_FALSE(call_param_alloc.is_entry_computation_parameter());
+  EXPECT_TRUE(call_param_alloc.is_entry_computation_parameter());
   EXPECT_FALSE(call_param_alloc.maybe_live_out());
   EXPECT_FALSE(call_param_alloc.is_thread_local());
 
@@ -1196,12 +1224,13 @@ TEST_F(BufferAssignmentTest, ElementOfNestedTupleParameterAsOutput) {
 
 // TODO(b/32248867): Enable when buffer assignment gives allocations to
 // constants.
-TEST_F(BufferAssignmentTest, DISABLED_TupleConstantAsOutput) {
+TEST_F(BufferAssignmentTest, TupleConstantAsOutput) {
   // Test that a tuple constant which is forwarded to the computation output
   // is properly handled.
   auto builder = HloComputation::Builder(TestName());
-  builder.AddInstruction(HloInstruction::CreateConstant(Literal::MakeTuple(
-      {Literal::CreateR0<int64>(0).get(), Literal::CreateR0<int64>(1).get()})));
+  builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::MakeTuple({LiteralUtil::CreateR0<int64>(0).get(),
+                              LiteralUtil::CreateR0<int64>(1).get()})));
 
   auto module = CreateNewModule();
   module->AddEntryComputation(builder.Build());
@@ -1252,16 +1281,18 @@ TEST_F(BufferAssignmentTest, TupleCallAsOutput) {
 
   auto assignment = RunBufferAssignment(module.get());
 
-  EXPECT_EQ(3, assignment->Allocations().size());
+  EXPECT_EQ(2, assignment->Allocations().size());
   // Buffers for call are colocated with the sub-computation.
   EXPECT_EQ(GetAllocation(*assignment, call, /*index=*/{}),
             GetAllocation(*assignment, sub_tuple, /*index=*/{}));
   EXPECT_EQ(GetAllocation(*assignment, call, /*index=*/{0}),
             GetAllocation(*assignment, sub_param, /*index=*/{}));
-  // The parameter isn't aliased with anything.
+
+  // The parameter isn't aliased with the result tuple, but it is aliased with
+  // the call operand.
   EXPECT_NE(GetTopLevelAllocation(*assignment, param),
             GetTopLevelAllocation(*assignment, sub_tuple));
-  EXPECT_NE(GetTopLevelAllocation(*assignment, param),
+  EXPECT_EQ(GetTopLevelAllocation(*assignment, param),
             GetTopLevelAllocation(*assignment, sub_param));
 }
 
@@ -1325,13 +1356,15 @@ TEST_F(BufferAssignmentTest, TupleChainedCallAsOutput) {
             GetAllocation(*assignment, c_call, /*index=*/{0}));
   EXPECT_EQ(GetAllocation(*assignment, c_call, /*index=*/{0}),
             GetAllocation(*assignment, d_param, /*index=*/{0}));
-  // The parameters aren't aliased with anything.
+
   EXPECT_TRUE(BuffersDistinct({a_param}, {b_param}, *assignment));
   EXPECT_TRUE(BuffersDistinct({a_param}, {c_param}, *assignment));
   EXPECT_TRUE(BuffersDistinct({a_param}, {d_param}, *assignment));
-  EXPECT_TRUE(BuffersDistinct({b_param}, {c_param}, *assignment));
-  EXPECT_TRUE(BuffersDistinct({b_param}, {d_param}, *assignment));
-  EXPECT_TRUE(BuffersDistinct({c_param}, {d_param}, *assignment));
+
+  EXPECT_EQ(GetAllocation(*assignment, b_param, /*index=*/{0}),
+            GetAllocation(*assignment, c_param, /*index=*/{0}));
+  EXPECT_EQ(GetAllocation(*assignment, c_param, /*index=*/{0}),
+            GetAllocation(*assignment, d_param, /*index=*/{0}));
 }
 
 TEST_F(BufferAssignmentTest, BitcastAsOutput) {
@@ -1365,8 +1398,9 @@ TEST_F(BufferAssignmentTest, AmbiguousBufferAsOutput) {
       HloInstruction::CreateParameter(1, tuple_shape, "param1"));
   auto pred_param = builder.AddInstruction(HloInstruction::CreateParameter(
       2, ShapeUtil::MakeShape(PRED, {}), "param1"));
-  auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred_param, tuple_param0, tuple_param1));
+  auto select = builder.AddInstruction(
+      HloInstruction::CreateTernary(tuple_shape, HloOpcode::kTupleSelect,
+                                    pred_param, tuple_param0, tuple_param1));
 
   auto module = CreateNewModule();
   module->AddEntryComputation(builder.Build());
@@ -1583,7 +1617,7 @@ TEST_F(BufferAssignmentTest, PeakBuffersWhile) {
     auto b = HloComputation::Builder(TestName() + ".cond");
     b.AddInstruction(HloInstruction::CreateParameter(0, shape, "x"));
     b.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
     condition = module->AddEmbeddedComputation(b.Build());
   }
   HloComputation* body;
@@ -1638,6 +1672,66 @@ TEST_F(BufferAssignmentTest, PeakBuffersWhile) {
       nonbcast_buffer->instruction() == condition->parameter_instruction(0));
 }
 
+TEST_F(BufferAssignmentTest, ConstantBuffersAreNotReused) {
+  const char* hlo_text = R"(
+HloModule Module
+
+True {
+  ROOT x.0.1 = f32[] parameter(0)
+}
+
+False {
+  x.0.0 = f32[] parameter(0)
+  ROOT copy.1 = f32[] copy(x.0.0)
+}
+
+ENTRY main {
+  pred.1.0 = pred[] parameter(0)
+  constant.1.1 = f32[] constant(56)
+  copy.2 = f32[] copy(constant.1.1)
+  constant.1.2 = f32[] constant(12)
+  ROOT conditional.1.3 = f32[] conditional(pred.1.0, copy.2, constant.1.2),
+      true_computation=True, false_computation=False
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(hlo_text));
+
+  HloInstruction* constant_1 =
+      module->entry_computation()->GetInstructionWithName("constant.1.1");
+  HloInstruction* constant_2 =
+      module->entry_computation()->GetInstructionWithName("constant.1.2");
+
+  auto buffers = RunBufferAssignment(module.get());
+
+  {
+    const BufferAllocation& allocation_for_const_1 =
+        GetTopLevelAllocation(*buffers, constant_1);
+    EXPECT_TRUE(allocation_for_const_1.is_constant());
+    for (const auto& buffer_offset_pair :
+         allocation_for_const_1.assigned_buffers()) {
+      EXPECT_NE(buffer_offset_pair.first->instruction()->opcode(),
+                HloOpcode::kCopy);
+      EXPECT_NE(buffer_offset_pair.first->instruction()->opcode(),
+                HloOpcode::kConditional);
+    }
+  }
+
+  {
+    const BufferAllocation& allocation_for_const_2 =
+        GetTopLevelAllocation(*buffers, constant_2);
+    EXPECT_TRUE(allocation_for_const_2.is_constant());
+    for (const auto& buffer_offset_pair :
+         allocation_for_const_2.assigned_buffers()) {
+      EXPECT_NE(buffer_offset_pair.first->instruction()->opcode(),
+                HloOpcode::kCopy);
+      EXPECT_NE(buffer_offset_pair.first->instruction()->opcode(),
+                HloOpcode::kConditional);
+    }
+  }
+}
+
 class WhileBufferAssignmentTest : public HloTestBase {
  protected:
   std::unique_ptr<HloComputation> BuildWhileConditionComputation(
@@ -1646,9 +1740,9 @@ class WhileBufferAssignmentTest : public HloTestBase {
     builder.AddInstruction(
         HloInstruction::CreateParameter(0, loop_state_shape_, "loop_state"));
     auto zero = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int>(0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(0)));
     auto ten = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int>(10)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(10)));
     builder.AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::MakeShape(PRED, {}), HloOpcode::kLt, zero, ten));
     return builder.Build();
@@ -1677,7 +1771,9 @@ class WhileBufferAssignmentTest : public HloTestBase {
     return BufferAssigner::Run(
                module, xla::MakeUnique<SequentialHloOrdering>(module, sequence),
                ByteSizeOf,
-               [alignment](LogicalBuffer::Color) { return alignment; })
+               [alignment](LogicalBuffer::Color) { return alignment; },
+               /*allow_input_output_aliasing=*/false,
+               /*allocate_buffers_for_constants=*/true)
         .ConsumeValueOrDie();
   }
 
@@ -1707,7 +1803,7 @@ TEST_F(WhileBufferAssignmentTest, TwoForwardWhileLoops) {
       HloInstruction::CreateParameter(2, data_shape_, "weights1"));
 
   auto zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0)));
   auto output0 = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape_, zero, {1}));
   auto output1 = builder.AddInstruction(
@@ -1827,6 +1923,74 @@ ENTRY %test_module {
   EXPECT_NE(slice_param, slice_while1);
 }
 
+TEST_F(WhileBufferAssignmentTest, ColocatedBufferWithConstant) {
+  const Shape r0s32 = ShapeUtil::MakeShape(S32, {});
+
+  const char* module_str = R"(
+HloModule test_module
+
+%cond.v0 {
+  %param = s32[] parameter(0)
+  ROOT %constant = pred[] constant(true)
+}
+
+%cond.v1 {
+  %param.0 = s32[] parameter(0)
+  ROOT %constant.0 = pred[] constant(true)
+}
+
+%body.v0 {
+  ROOT %param.1 = s32[] parameter(0)
+}
+
+%body.v1 {
+  %param.2 = s32[] parameter(0)
+  ROOT add = s32[] add(%param.2, %param.2)
+}
+
+ENTRY %test_module {
+  %constant.42 = s32[] constant(42)
+  %while.0 = s32[] while(%constant.42), condition=%cond.v0, body=%body.v0
+  %mul = s32[] multiply(%while.0, %while.0)
+  %while.1 = s32[] while(%mul), condition=%cond.v1, body=%body.v1
+  ROOT %bcast = s32[1024,1024]{1,0} broadcast(s32[] %while.1), dimensions={}
+})";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(module_str));
+
+  // Run CopyInsertion and check if the graph constructed above doesn't need
+  // any copies inserted for BufferAssignment to run.
+  int64 instruction_count = module->instruction_count();
+  CopyInsertion copy_insertion;
+  ASSERT_IS_OK(copy_insertion.Run(module.get()).status());
+  ASSERT_EQ(instruction_count, module->instruction_count());
+
+  // Get the instructions in the module.
+  const HloInstruction* bcast = module->entry_computation()->root_instruction();
+  const HloInstruction* constant =
+      module->entry_computation()->GetInstructionWithName("constant.42");
+  ASSERT_EQ(bcast->opcode(), HloOpcode::kBroadcast);
+  const HloInstruction* while1 = bcast->operand(0);
+  ASSERT_EQ(while1->opcode(), HloOpcode::kWhile);
+  const HloInstruction* while0 = while1->operand(0)->operand(0);
+  ASSERT_EQ(while0->opcode(), HloOpcode::kWhile);
+
+  // Run buffer assignment.
+  auto assignment = RunBufferAssignment(module.get());
+  TF_ASSERT_OK_AND_ASSIGN(auto slice_constant,
+                          assignment->GetUniqueSlice(constant, {}));
+  TF_ASSERT_OK_AND_ASSIGN(auto slice_while0,
+                          assignment->GetUniqueSlice(while0, {}));
+  TF_ASSERT_OK_AND_ASSIGN(auto slice_while1,
+                          assignment->GetUniqueSlice(while1, {}));
+
+  // The constant slice is part of the while0's colocation set (init value), but
+  // not merged into the while1's colocation set.
+  EXPECT_EQ(slice_constant, slice_while0);
+  EXPECT_NE(slice_constant, slice_while1);
+}
+
 // Tests that the colocated buffers for while instructions are properly assigned
 // during buffer assignment such that the result tuple elements are not assigned
 // to the same buffer.
@@ -1850,7 +2014,7 @@ TEST_F(WhileBufferAssignmentTest, ColocatedBuffers) {
   auto build_cond = [&]() {
     auto builder = HloComputation::Builder("cond");
     auto const4 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int>(4)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(4)));
     auto param =
         builder.AddInstruction(HloInstruction::CreateParameter(0, r0s32, "x"));
     builder.AddInstruction(HloInstruction::CreateBinary(
@@ -1862,7 +2026,7 @@ TEST_F(WhileBufferAssignmentTest, ColocatedBuffers) {
   auto build_body = [&]() {
     auto builder = HloComputation::Builder("body");
     auto const9 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int>(9)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int>(9)));
     auto param =
         builder.AddInstruction(HloInstruction::CreateParameter(0, r0s32, "x"));
     builder.AddInstruction(
@@ -1874,7 +2038,7 @@ TEST_F(WhileBufferAssignmentTest, ColocatedBuffers) {
   auto module = CreateNewModule();
   auto builder = HloComputation::Builder("entry");
 
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto infeed =
       builder.AddInstruction(HloInstruction::CreateInfeed(r0s32, token, ""));
   auto infeed_data = builder.AddInstruction(
@@ -1890,7 +2054,7 @@ TEST_F(WhileBufferAssignmentTest, ColocatedBuffers) {
       HloInstruction::CreateWhile(r0s32, cond1, body1, while0));
 
   auto zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(0)));
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(r0s32, HloOpcode::kAdd, zero, zero));
   auto cond2 = module->AddEmbeddedComputation(build_cond());
@@ -1921,7 +2085,9 @@ TEST_F(WhileBufferAssignmentTest, ColocatedBuffers) {
           module.get(),
           xla::MakeUnique<SequentialHloOrdering>(module.get(), sequence),
           backend().compiler()->BufferSizeBytesFunction(),
-          [](LogicalBuffer::Color) { return 1; }));
+          [](LogicalBuffer::Color) { return 1; },
+          /*allow_input_output_aliasing=*/false,
+          /*allocate_buffers_for_constants=*/true));
 
   // The result tuple elements must be assigned with different buffers.
   TF_ASSERT_OK_AND_ASSIGN(auto slice0, assignment->GetUniqueSlice(tuple, {0}));
@@ -1952,7 +2118,7 @@ TEST_F(WhileBufferAssignmentTest, OneForwardBackwardWhileLoopSet) {
       HloInstruction::CreateParameter(1, data_shape_, "weights0"));
 
   auto zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0)));
   auto output0 = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape_, zero, {1}));
 
@@ -1996,16 +2162,16 @@ TEST_F(BufferAssignmentTest, TwoCalls) {
     auto param = builder.AddInstruction(
         HloInstruction::CreateParameter(0, r0f32, "param"));
     auto constant1 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     auto add = builder.AddInstruction(
         HloInstruction::CreateBinary(r0f32, HloOpcode::kAdd, param, constant1));
     sub_computation = module->AddEmbeddedComputation(builder.Build(add));
   }
   auto builder = HloComputation::Builder(TestName());
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto call1 = builder.AddInstruction(
       HloInstruction::CreateCall(r0f32, {constant2}, sub_computation));
   auto call2 = builder.AddInstruction(
@@ -2029,6 +2195,56 @@ TEST_F(BufferAssignmentTest, TwoCalls) {
   EXPECT_TRUE(BuffersDistinct({call1}, {call2}, *assignment));
 }
 
+TEST_F(BufferAssignmentTest, CallParamCoAllocation) {
+  const char* hlo_text = R"(
+HloModule CallParamCoAllocation
+
+Callee {
+  param0 = (f32[100],(f32[200],f32[300])) parameter(0)
+  param1 = s32[20] parameter(1)
+  ROOT constant = f32[] constant(1)
+}
+
+ENTRY Main {
+  entry_param0 = f32[100] parameter(0)
+  entry_param1 = s32[20]  parameter(1)
+  custom_call = (f32[200],f32[300]) custom-call(), custom_call_target="call-target"
+  call_op0 = (f32[100],(f32[200],f32[300])) tuple(entry_param0, custom_call)
+  ROOT call_result = f32[] call(call_op0, entry_param1), to_apply=Callee
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<HloModule> module,
+      HloRunner::CreateModuleFromString(
+          hlo_text, legacy_flags::GetDebugOptionsFromFlags()));
+
+  auto buffers = RunBufferAssignment(module.get());
+
+  HloComputation* main = module->entry_computation();
+  HloComputation* callee = module->GetComputationWithName("Callee");
+  EXPECT_NE(callee, nullptr);
+
+  HloInstruction* param0 = callee->parameter_instruction(0);
+  HloInstruction* param1 = callee->parameter_instruction(1);
+
+  HloInstruction* entry_param0 = main->parameter_instruction(0);
+  HloInstruction* entry_param1 = main->parameter_instruction(1);
+  HloInstruction* custom_call = main->GetInstructionWithName("custom_call");
+
+  EXPECT_EQ(GetAllocation(*buffers, entry_param0, {}),
+            GetAllocation(*buffers, param0, {0}));
+  EXPECT_EQ(GetAllocation(*buffers, entry_param1, {}),
+            GetAllocation(*buffers, param1, {}));
+
+  EXPECT_EQ(GetAllocation(*buffers, custom_call, {}),
+            GetAllocation(*buffers, param0, {1}));
+  EXPECT_EQ(GetAllocation(*buffers, custom_call, {0}),
+            GetAllocation(*buffers, param0, {1, 0}));
+  EXPECT_EQ(GetAllocation(*buffers, custom_call, {1}),
+            GetAllocation(*buffers, param0, {1, 1}));
+}
+
 static bool IsPostOrderTraversal(
     const std::vector<const HloInstruction*>& sequence) {
   tensorflow::gtl::FlatSet<const HloInstruction*> seen_so_far;
@@ -2057,9 +2273,9 @@ TEST_F(WhileBufferAssignmentTest, WhileLoopsInterferingResultRange) {
   auto builder = HloComputation::Builder(TestName());
 
   auto zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0)));
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
 
   auto input0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, data_shape_, "input0"));
@@ -2125,7 +2341,9 @@ TEST_F(WhileBufferAssignmentTest, WhileLoopsInterferingResultRange) {
       BufferAssigner::Run(
           module.get(),
           xla::MakeUnique<SequentialHloOrdering>(module.get(), sequence),
-          ByteSizeOf, [](LogicalBuffer::Color) { return 1; })
+          ByteSizeOf, [](LogicalBuffer::Color) { return 1; },
+          /*allow_input_output_aliasing=*/false,
+          /*allocate_buffers_for_constants=*/true)
           .ConsumeValueOrDie();
 
   EXPECT_TRUE(BuffersDistinct({while0}, {while1}, *assignment));
@@ -2141,7 +2359,7 @@ TEST_F(WhileBufferAssignmentTest, WhilesDontShareEntryParamIfLiveOut) {
       HloInstruction::CreateParameter(1, data_shape_, "weights0"));
 
   auto zero = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0)));
   auto output0 = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape_, zero, {1}));
   auto output1 = builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/buffer_liveness_test.cc b/tensorflow/compiler/xla/service/buffer_liveness_test.cc
index 7833ebe73b..4a927b5767 100644
--- a/tensorflow/compiler/xla/service/buffer_liveness_test.cc
+++ b/tensorflow/compiler/xla/service/buffer_liveness_test.cc
@@ -327,7 +327,7 @@ TEST_F(BufferLivenessTest, RootInstructionIsNotLastInSequentialOrder) {
       builder.AddInstruction(HloInstruction::CreateParameter(0, vec_, "param"));
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(vec_, HloOpcode::kAdd, param, param));
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto recv = builder.AddInstruction(
       HloInstruction::CreateRecv(vec_, token, /*channel_id=*/0));
   auto recv_done = builder.AddInstruction(HloInstruction::CreateRecvDone(recv));
@@ -439,11 +439,13 @@ TEST_F(BufferLivenessTest, TupleConstantLiveOut) {
   // computation. The buffer containing {0, 1} is copied by GetTupleElement, and
   // the buffers containing {3} and 3 are dead.
   auto builder = HloComputation::Builder(TestName());
-  auto inner_tuple0 = Literal::MakeTuple(
-      {Literal::CreateR0<int64>(0).get(), Literal::CreateR0<int64>(1).get()});
-  auto inner_tuple1 = Literal::MakeTuple({Literal::CreateR0<int64>(3).get()});
+  auto inner_tuple0 =
+      LiteralUtil::MakeTuple({LiteralUtil::CreateR0<int64>(0).get(),
+                              LiteralUtil::CreateR0<int64>(1).get()});
+  auto inner_tuple1 =
+      LiteralUtil::MakeTuple({LiteralUtil::CreateR0<int64>(3).get()});
   auto tuple_constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::MakeTuple({inner_tuple0.get(), inner_tuple1.get()})));
+      LiteralUtil::MakeTuple({inner_tuple0.get(), inner_tuple1.get()})));
   builder.AddInstruction(HloInstruction::CreateGetTupleElement(
       inner_tuple0->shape(), tuple_constant, 0));
 
@@ -491,7 +493,7 @@ TEST_F(BufferLivenessTest, IndependentTupleElements) {
       builder.AddInstruction(HloInstruction::CreateGetTupleElement(
           tuple_element0_shape, tuple_param0, 0));
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
+      LiteralUtil::CreateR1<float>({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
   auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
       tuple_element0_shape, HloOpcode::kAdd, tuple_element0, const0));
 
@@ -503,7 +505,7 @@ TEST_F(BufferLivenessTest, IndependentTupleElements) {
       builder.AddInstruction(HloInstruction::CreateGetTupleElement(
           tuple_element1_shape, tuple_param0, 1));
   auto const1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f})));
+      LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f})));
   auto add1 = builder.AddInstruction(HloInstruction::CreateBinary(
       tuple_element1_shape, HloOpcode::kAdd, tuple_element1, const1));
 
@@ -555,7 +557,7 @@ TEST_F(BufferLivenessTest, DependentTupleElements) {
       builder.AddInstruction(HloInstruction::CreateGetTupleElement(
           tuple_element0_shape, tuple_param0, 0));
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
+      LiteralUtil::CreateR1<float>({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
   auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
       tuple_element0_shape, HloOpcode::kAdd, tuple_element0, const0));
 
@@ -627,7 +629,7 @@ class FusedDynamicUpdateSliceLivenessTest : public BufferLivenessTest {
         HloInstruction::CreateGetTupleElement(data_shape, tuple_param0, 1));
 
     auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+        LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f})));
     HloInstruction* slice = nullptr;
     if (update_uses_tuple_element1) {
       // Create a slice instruction as an additional user of 'gte1'.
@@ -638,7 +640,7 @@ class FusedDynamicUpdateSliceLivenessTest : public BufferLivenessTest {
     }
     // Create a DynamicUpdateSlice instruction of tuple element 1 with 'update'.
     auto starts = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
     auto dynamic_update_slice =
         builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
             data_shape, gte1, update, starts));
@@ -757,7 +759,7 @@ class DynamicUpdateSliceLivenessTest : public BufferLivenessTest {
         HloInstruction::CreateGetTupleElement(data_shape, tuple_param0, 1));
 
     auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+        LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f})));
 
     if (tuple_element1_has_two_uses) {
       // Add 'gte0' and 'gte1' to create another user of 'gte1'.
@@ -766,7 +768,7 @@ class DynamicUpdateSliceLivenessTest : public BufferLivenessTest {
     }
     // Create a DynamicUpdateSlice instruction of tuple element 1 with 'update'.
     auto starts = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
     auto dynamic_update_slice =
         builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
             data_shape, gte1, update, starts));
diff --git a/tensorflow/compiler/xla/service/call_graph_test.cc b/tensorflow/compiler/xla/service/call_graph_test.cc
index 1ea7d538cd..cc80b74843 100644
--- a/tensorflow/compiler/xla/service/call_graph_test.cc
+++ b/tensorflow/compiler/xla/service/call_graph_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/call_graph.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -82,7 +82,7 @@ class CallGraphTest : public HloTestBase {
     HloInstruction* param0 = builder.AddInstruction(
         HloInstruction::CreateParameter(0, kScalarShape, "param0"));
     HloInstruction* zero = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
     builder.AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt, param0, zero));
     return builder.Build();
@@ -247,11 +247,11 @@ TEST_F(CallGraphTest, ComputationWithConditional) {
 
   HloComputation::Builder builder(TestName());
   HloInstruction* pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloInstruction* const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(56.4f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(56.4f)));
   HloInstruction* const2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(12.6f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(12.6f)));
   HloInstruction* conditional =
       builder.AddInstruction(HloInstruction::CreateConditional(
           kScalarShape, pred, const1, true_computation, const2,
diff --git a/tensorflow/compiler/xla/service/call_inliner.cc b/tensorflow/compiler/xla/service/call_inliner.cc
index 482ccc5b67..256d05a73e 100644
--- a/tensorflow/compiler/xla/service/call_inliner.cc
+++ b/tensorflow/compiler/xla/service/call_inliner.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <deque>
 
 #include "tensorflow/compiler/xla/service/call_graph.h"
+#include "tensorflow/compiler/xla/service/hlo_dce.h"
 #include "tensorflow/core/lib/core/errors.h"
 
 namespace xla {
@@ -151,6 +152,14 @@ StatusOr<bool> CallInliner::Run(HloModule* module) {
         }
         return Status::OK();
       }));
+  if (did_mutate) {
+    // Run DCE to remove called computations which are now becoming unused.
+    // This can result then in problems if within the called computation, there
+    // were send/recv instructions, which the module group verifier will flag as
+    // error findingthe same channel ID used for multiple send/recv
+    // instructions.
+    TF_RETURN_IF_ERROR(HloDCE().Run(module).status());
+  }
   return did_mutate;
 }
 
diff --git a/tensorflow/compiler/xla/service/call_inliner_test.cc b/tensorflow/compiler/xla/service/call_inliner_test.cc
index 924348c870..ff968bca29 100644
--- a/tensorflow/compiler/xla/service/call_inliner_test.cc
+++ b/tensorflow/compiler/xla/service/call_inliner_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -48,9 +48,9 @@ TEST_F(CallInlinerTest, ControlDependenciesAreCarriedToCaller) {
   // the "one" value.
   HloComputation::Builder inner(TestName() + ".inner");
   HloInstruction* zero = inner.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(24.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(24.0f)));
   HloInstruction* one = inner.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   TF_ASSERT_OK(zero->AddControlDependencyTo(one));
   auto module = CreateNewModule();
   HloComputation* inner_computation =
@@ -87,7 +87,7 @@ TEST_F(CallInlinerTest, CallsWithinWhileBodiesAreInlined) {
   // little trickier.
   HloComputation::Builder just_false(TestName() + ".false");
   just_false.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* false_computation =
       module->AddEmbeddedComputation(just_false.Build());
 
@@ -99,7 +99,7 @@ TEST_F(CallInlinerTest, CallsWithinWhileBodiesAreInlined) {
 
   HloComputation::Builder outer(TestName() + ".outer");
   HloInstruction* init_value = outer.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   outer.AddInstruction(
       HloInstruction::CreateWhile(pred, call_false, call_false, init_value));
 
@@ -123,9 +123,9 @@ TEST_F(CallInlinerTest, InlineWithoutRunningPass) {
 
   HloComputation::Builder just_false(TestName() + ".false");
   auto* true_constant = just_false.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<bool>({true})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<bool>({true})));
   auto* false_constant = just_false.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   TF_ASSERT_OK(false_constant->AddControlDependencyTo(true_constant));
   HloComputation* false_computation =
       module->AddEmbeddedComputation(just_false.Build());
@@ -147,8 +147,8 @@ TEST_F(CallInlinerTest, CallToOutfeedComputationIsInlined) {
 
   HloComputation::Builder outfeeder(TestName() + ".outfeeder");
   auto value = outfeeder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
-  auto token = outfeeder.AddInstruction(HloInstruction::CreateAfterAll({}));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
+  auto token = outfeeder.AddInstruction(HloInstruction::CreateToken());
   outfeeder.AddInstruction(
       HloInstruction::CreateOutfeed(f32, value, token, /*outfeed_config=*/""));
 
diff --git a/tensorflow/compiler/xla/service/channel_tracker.cc b/tensorflow/compiler/xla/service/channel_tracker.cc
index a5b392cbc3..13008efed1 100644
--- a/tensorflow/compiler/xla/service/channel_tracker.cc
+++ b/tensorflow/compiler/xla/service/channel_tracker.cc
@@ -31,16 +31,23 @@ namespace xla {
 
 ChannelTracker::ChannelTracker() : next_channel_(1) {}
 
-ChannelHandle ChannelTracker::NewChannel() {
+StatusOr<ChannelHandle> ChannelTracker::NewChannel(
+    ChannelHandle::ChannelType type) {
+  if (type != ChannelHandle::DEVICE_TO_DEVICE &&
+      type != ChannelHandle::HOST_TO_DEVICE &&
+      type != ChannelHandle::DEVICE_TO_HOST) {
+    return InvalidArgument("Invalid channel type: %d", type);
+  }
   tensorflow::mutex_lock lock(channel_mutex_);
 
   // Create a new channel handle with a unique value.
-  const ChannelHandle new_handle = AllocateHandle();
+  ChannelHandle new_handle = AllocateHandle(type);
 
   // Register a channel object associated with the handle.
   Channel channel;
   channel.has_sender = false;
   channel.receiver_count = 0;
+  channel.type = type;
   opaque_to_channel_[new_handle.handle()] = channel;
 
   return new_handle;
@@ -56,10 +63,11 @@ Status ChannelTracker::RegisterRecv(const ChannelHandle& handle) {
   return RegisterRecvInternal(handle);
 }
 
-ChannelHandle ChannelTracker::AllocateHandle() {
+ChannelHandle ChannelTracker::AllocateHandle(ChannelHandle::ChannelType type) {
   int64 handle_value = next_channel_++;
   ChannelHandle result;
   result.set_handle(handle_value);
+  result.set_type(type);
   return result;
 }
 
@@ -68,6 +76,13 @@ Status ChannelTracker::RegisterSendInternal(const ChannelHandle& handle) {
     return NotFound("channel handle not found: %lld", handle.handle());
   }
   Channel& channel = opaque_to_channel_[handle.handle()];
+  if (channel.type == ChannelHandle::HOST_TO_DEVICE) {
+    return FailedPrecondition(
+        "host-to-device channels cannot be used with a Send operation; "
+        "channel handle: %lld",
+        handle.handle());
+  }
+
   if (channel.has_sender) {
     return FailedPrecondition(
         "when registering send, passed a channel handle that is already used "
@@ -83,6 +98,13 @@ Status ChannelTracker::RegisterRecvInternal(const ChannelHandle& handle) {
     return NotFound("channel handle not found: %lld", handle.handle());
   }
   Channel& channel = opaque_to_channel_[handle.handle()];
+  if (channel.type == ChannelHandle::DEVICE_TO_HOST) {
+    return FailedPrecondition(
+        "device-to-host channels cannot be used with a Recv operation; "
+        "channel handle: %lld",
+        handle.handle());
+  }
+
   // TODO(b/33942691): Allow more than 1 receivers for broadcast.
   if (channel.receiver_count >= 1) {
     return FailedPrecondition(
diff --git a/tensorflow/compiler/xla/service/channel_tracker.h b/tensorflow/compiler/xla/service/channel_tracker.h
index fac0afd672..d773558c28 100644
--- a/tensorflow/compiler/xla/service/channel_tracker.h
+++ b/tensorflow/compiler/xla/service/channel_tracker.h
@@ -48,11 +48,12 @@ class ChannelTracker {
   struct Channel {
     bool has_sender;
     int64 receiver_count;
+    ChannelHandle::ChannelType type;
   };
 
   // Creates a new Channel object and returns the corresponding
   // ChannelHandle for it.
-  ChannelHandle NewChannel();
+  StatusOr<ChannelHandle> NewChannel(ChannelHandle::ChannelType type);
 
   // Informs that the given channel handle is used for a Send operation.
   // Returns an error status if the handle is already used by another Send.
@@ -65,7 +66,8 @@ class ChannelTracker {
  private:
   // Bumps the next_channel_ number and returns the allocated number
   // wrapped in a ChannelHandle.
-  ChannelHandle AllocateHandle() EXCLUSIVE_LOCKS_REQUIRED(channel_mutex_);
+  ChannelHandle AllocateHandle(ChannelHandle::ChannelType type)
+      EXCLUSIVE_LOCKS_REQUIRED(channel_mutex_);
 
   Status RegisterSendInternal(const ChannelHandle& handle)
       EXCLUSIVE_LOCKS_REQUIRED(channel_mutex_);
diff --git a/tensorflow/compiler/xla/service/computation_placer.cc b/tensorflow/compiler/xla/service/computation_placer.cc
index 7c1bacff92..187ce568cb 100644
--- a/tensorflow/compiler/xla/service/computation_placer.cc
+++ b/tensorflow/compiler/xla/service/computation_placer.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status.h"
@@ -29,9 +29,13 @@ limitations under the License.
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 
+using tensorflow::strings::StrAppend;
+using tensorflow::strings::StrCat;
+
 namespace xla {
 
 Status DeviceAssignment::Serialize(DeviceAssignmentProto* proto) const {
@@ -71,6 +75,19 @@ DeviceAssignment::Deserialize(const DeviceAssignmentProto& proto) {
   return std::move(assignment);
 }
 
+string DeviceAssignment::ToString() const {
+  string output = StrCat("Computations: ", computation_count(),
+                         " Replicas: ", replica_count(), "\n");
+  for (int computation = 0; computation < computation_count(); ++computation) {
+    StrAppend(&output, "Computation ", computation, ": ");
+    for (int replica = 0; replica < replica_count(); ++replica) {
+      StrAppend(&output, operator()(replica, computation), " ");
+    }
+    StrAppend(&output, "\n");
+  }
+  return output;
+}
+
 StatusOr<int> ComputationPlacer::DeviceId(int replica, int computation,
                                           int replica_count,
                                           int computation_count) {
diff --git a/tensorflow/compiler/xla/service/computation_placer.h b/tensorflow/compiler/xla/service/computation_placer.h
index 737d00e93e..c899ffb9dc 100644
--- a/tensorflow/compiler/xla/service/computation_placer.h
+++ b/tensorflow/compiler/xla/service/computation_placer.h
@@ -55,6 +55,8 @@ class DeviceAssignment : public Array2D<int> {
   // due to a StatusOr of an incomplete type (DeviceAssignment).
   static StatusOr<std::unique_ptr<DeviceAssignment>> Deserialize(
       const DeviceAssignmentProto& proto);
+
+  string ToString() const;
 };
 
 // A generic implementation of the XLA computation placer, which assigns device
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier.cc b/tensorflow/compiler/xla/service/conditional_simplifier.cc
index e9ec796121..b7be3ba605 100644
--- a/tensorflow/compiler/xla/service/conditional_simplifier.cc
+++ b/tensorflow/compiler/xla/service/conditional_simplifier.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/call_inliner.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier_test.cc b/tensorflow/compiler/xla/service/conditional_simplifier_test.cc
index 68f6ffc6b7..c43a31b167 100644
--- a/tensorflow/compiler/xla/service/conditional_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/conditional_simplifier_test.cc
@@ -55,7 +55,7 @@ HloComputation* ConditionalSimplifierTest::MakeConditional(HloModule* module) {
         true_computation_builder.AddInstruction(HloInstruction::CreateParameter(
             0, ShapeUtil::MakeShape(S32, {}), "param"));
     auto one = true_computation_builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
 
     true_computation_builder.AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::MakeShape(S32, {}), HloOpcode::kAdd, param, one));
@@ -73,7 +73,7 @@ HloComputation* ConditionalSimplifierTest::MakeConditional(HloModule* module) {
         HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(S32, {}),
                                         "param"));
     auto forty_two = false_computation_builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(42)));
 
     false_computation_builder.AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::MakeShape(S32, {}), HloOpcode::kAdd, param, forty_two));
@@ -82,11 +82,11 @@ HloComputation* ConditionalSimplifierTest::MakeConditional(HloModule* module) {
   }
 
   auto false_instrn = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto false_param = builder.AddInstruction(HloInstruction::CreateParameter(
       0, ShapeUtil::MakeShape(S32, {}), "false_param"));
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
 
   builder.AddInstruction(HloInstruction::CreateConditional(
       ShapeUtil::MakeShape(S32, {}), false_instrn, one, true_computation,
@@ -106,7 +106,7 @@ TEST_F(ConditionalSimplifierTest, ConditionalWithControlDependency) {
   HloComputation* computation = MakeConditional(&module());
 
   auto* true_op = computation->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   TF_ASSERT_OK(
       true_op->AddControlDependencyTo(computation->root_instruction()));
 
@@ -119,11 +119,10 @@ TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsSend) {
   ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
 
   auto* true_computation = conditional->true_computation();
-  auto* token =
-      true_computation->AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto* token = true_computation->AddInstruction(HloInstruction::CreateToken());
   auto* send = true_computation->AddInstruction(HloInstruction::CreateSend(
       true_computation->AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<bool>(true))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true))),
       token, /*channel_id=*/0));
   true_computation->AddInstruction(HloInstruction::CreateSendDone(send));
   EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
@@ -135,8 +134,7 @@ TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsRecv) {
   ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
 
   auto* true_computation = conditional->true_computation();
-  auto* token =
-      true_computation->AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto* token = true_computation->AddInstruction(HloInstruction::CreateToken());
   auto* recv = true_computation->AddInstruction(HloInstruction::CreateRecv(
       ShapeUtil::MakeShape(F32, {1}), token, /*channel_id=*/0));
   true_computation->AddInstruction(HloInstruction::CreateRecvDone(recv));
@@ -148,8 +146,7 @@ TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsNonRemovableInstruction) {
   auto* conditional = computation->root_instruction();
   ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
   auto* false_computation = conditional->false_computation();
-  auto token =
-      false_computation->AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = false_computation->AddInstruction(HloInstruction::CreateToken());
   false_computation->AddInstruction(HloInstruction::CreateInfeed(
       ShapeUtil::MakeShape(F32, {1}), token, "config"));
   EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
diff --git a/tensorflow/compiler/xla/service/copy_insertion.cc b/tensorflow/compiler/xla/service/copy_insertion.cc
index ab3d846403..36fb9b43aa 100644
--- a/tensorflow/compiler/xla/service/copy_insertion.cc
+++ b/tensorflow/compiler/xla/service/copy_insertion.cc
@@ -76,15 +76,6 @@ SpecialCaseCopyPolicy GetSpecialCaseCopyPolicy(const CallGraphNode& node,
     policy.copy_parameters_and_constants = true;
     policy.copy_root_replicated_buffers = true;
   }
-  for (const CallSite& site : node.caller_callsites()) {
-    // The AddCopiesForConditional() already adds copies, but the copy remover
-    // removes them, so we re-add them by returning the policy here. But really
-    // the copy remover should not be removing them.
-    if (site.instruction()->opcode() == HloOpcode::kConditional) {
-      policy.copy_parameters_and_constants = true;
-      policy.copy_root_replicated_buffers = true;
-    }
-  }
   return policy;
 }
 
@@ -360,26 +351,6 @@ Status StripControlDependenciesFrom(HloInstruction* instruction) {
   return Status::OK();
 }
 
-// Add kCopy instructions to the given module to guarantee there is no
-// live-range interference. Generally interference can only occur around kWhile
-// instructions which have update-in-place semantics.
-Status AddCopiesToResolveInterference(HloModule* module) {
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloAliasAnalysis> alias_analysis,
-                      HloAliasAnalysis::Run(module));
-
-  for (HloComputation* computation : module->computations()) {
-    for (HloInstruction* instruction : computation->instructions()) {
-      if (instruction->opcode() == HloOpcode::kWhile) {
-        TF_RETURN_IF_ERROR(AddCopiesForWhile(*alias_analysis, instruction));
-      } else if (instruction->opcode() == HloOpcode::kConditional) {
-        TF_RETURN_IF_ERROR(
-            AddCopiesForConditional(*alias_analysis, instruction));
-      }
-    }
-  }
-  return Status::OK();
-}
-
 // Class for removing unnecessary copies from the module.
 //
 // kCopy instructions are added conservatively to guarantee no live range
@@ -954,6 +925,36 @@ class CopyRemover {
   BufferValueTracker buffer_value_tracker_;
 };
 
+void MaybeDumpModule(const string& message, const HloModule& module) {
+  if (VLOG_IS_ON(3)) {
+    VLOG(3) << message;
+    XLA_VLOG_LINES(3, module.ToString());
+    hlo_graph_dumper::MaybeDumpHloModule(module, message);
+  }
+}
+
+}  // namespace
+
+// Add kCopy instructions to the given module to guarantee there is no
+// live-range interference. Generally interference can only occur around kWhile
+// instructions which have update-in-place semantics.
+Status CopyInsertion::AddCopiesToResolveInterference(HloModule* module) {
+  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloAliasAnalysis> alias_analysis,
+                      HloAliasAnalysis::Run(module, fusion_can_share_buffer_));
+
+  for (HloComputation* computation : module->computations()) {
+    for (HloInstruction* instruction : computation->instructions()) {
+      if (instruction->opcode() == HloOpcode::kWhile) {
+        TF_RETURN_IF_ERROR(AddCopiesForWhile(*alias_analysis, instruction));
+      } else if (instruction->opcode() == HloOpcode::kConditional) {
+        TF_RETURN_IF_ERROR(
+            AddCopiesForConditional(*alias_analysis, instruction));
+      }
+    }
+  }
+  return Status::OK();
+}
+
 // Add copies to address special constraints on the roots of computations not
 // related to live range interference:
 //
@@ -964,9 +965,10 @@ class CopyRemover {
 //
 //    (3) Constants and parameters cannot be live out of the entry computation
 //
-Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
+Status CopyInsertion::AddSpecialCaseCopies(const CallGraph& call_graph,
+                                           HloModule* module) {
   TF_ASSIGN_OR_RETURN(std::unique_ptr<HloAliasAnalysis> alias_analysis,
-                      HloAliasAnalysis::Run(module));
+                      HloAliasAnalysis::Run(module, fusion_can_share_buffer_));
 
   // Identify which shape indices of which instructions need to be copied. Store
   // these results in 'instructions_to_copy'.
@@ -1074,32 +1076,20 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
   return Status::OK();
 }
 
-Status VerifyNoLiveRangeInterference(HloModule* module) {
+Status CopyInsertion::VerifyNoLiveRangeInterference(HloModule* module) {
   TF_ASSIGN_OR_RETURN(std::unique_ptr<HloAliasAnalysis> alias_analysis,
-                      HloAliasAnalysis::Run(module));
+                      HloAliasAnalysis::Run(module, fusion_can_share_buffer_));
   DependencyHloOrdering ordering(module);
   TF_RET_CHECK(!alias_analysis->HasLiveRangeInterference(ordering));
   return Status::OK();
 }
 
-void MaybeDumpModule(const string& message, const HloModule& module) {
-  if (VLOG_IS_ON(3)) {
-    VLOG(3) << message;
-    XLA_VLOG_LINES(3, module.ToString());
-    hlo_graph_dumper::MaybeDumpHloModule(module, message);
-  }
-}
-
-}  // namespace
-
-Status RemoveUnnecessaryCopies(
-    const HloOrdering& ordering, HloModule* module,
-    const HloDataflowAnalysis::FusionCanShareBufferFunction&
-        fusion_can_share_buffer) {
+Status CopyInsertion::RemoveUnnecessaryCopies(const HloOrdering& ordering,
+                                              HloModule* module) {
   MaybeDumpModule("after adding copies to resolve interference", *module);
 
   TF_ASSIGN_OR_RETURN(std::unique_ptr<HloAliasAnalysis> alias_analysis,
-                      HloAliasAnalysis::Run(module, fusion_can_share_buffer));
+                      HloAliasAnalysis::Run(module, fusion_can_share_buffer_));
   CopyRemover copy_remover(*alias_analysis, ordering, module);
   XLA_VLOG_LINES(3, copy_remover.ToString());
 
diff --git a/tensorflow/compiler/xla/service/copy_insertion.h b/tensorflow/compiler/xla/service/copy_insertion.h
index e1973db928..5ba64b78a3 100644
--- a/tensorflow/compiler/xla/service/copy_insertion.h
+++ b/tensorflow/compiler/xla/service/copy_insertion.h
@@ -71,20 +71,26 @@ class CopyInsertion : public HloPassInterface {
   // TODO(b/62548313): Remove this when buffer assignment is module-scoped.
   static StatusOr<bool> AddCopiesForBufferAssignment(HloModule* module);
 
+  // Try to remove as many copies from the module as possible without
+  // introducing live range interference. Only copy instructions that are
+  // eligible for copy elision are considered for removal.
+  Status RemoveUnnecessaryCopies(const HloOrdering& ordering,
+                                 HloModule* module);
+
  private:
+  // Verifies that no HLO values have interfering live ranged assuming the
+  // ordering used by copy insertion.
+  Status VerifyNoLiveRangeInterference(HloModule* module);
+
+  Status AddCopiesToResolveInterference(HloModule* module);
+
+  Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module);
+
   // Backend specific function that decides whether a fusion can share buffer
   // with its operand.
   HloDataflowAnalysis::FusionCanShareBufferFunction fusion_can_share_buffer_;
 };
 
-// Try to remove as many copies from the module as possible without introducing
-// live range interference. Only copy instructions that are eligible for
-// copy elision are considered for removal.
-Status RemoveUnnecessaryCopies(
-    const HloOrdering& ordering, HloModule* module,
-    const HloDataflowAnalysis::FusionCanShareBufferFunction&
-        fusion_can_share_buffer = nullptr);
-
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_SERVICE_COPY_INSERTION_H_
diff --git a/tensorflow/compiler/xla/service/copy_insertion_test.cc b/tensorflow/compiler/xla/service/copy_insertion_test.cc
index 7ae8799b61..cd735256b8 100644
--- a/tensorflow/compiler/xla/service/copy_insertion_test.cc
+++ b/tensorflow/compiler/xla/service/copy_insertion_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <set>
 
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
@@ -108,7 +108,7 @@ TEST_F(CopyInsertionTest, SingleConstant) {
   // be copied before entering the tuple.
   auto builder = HloComputation::Builder(TestName());
   HloInstruction* constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   HloInstruction* tuple =
       builder.AddInstruction(HloInstruction::CreateTuple({constant}));
 
@@ -132,7 +132,7 @@ TEST_F(CopyInsertionTest, ExistingCopiesNotRemoved) {
   auto builder = HloComputation::Builder(TestName());
   HloInstruction* constant =
       builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR2<float>({{0.f, 2.f}, {2.f, 4.f}})));
+          LiteralUtil::CreateR2<float>({{0.f, 2.f}, {2.f, 4.f}})));
   auto minor_to_major = LayoutUtil::MinorToMajor(constant->shape());
   Layout reversed_layout =
       LayoutUtil::MakeLayoutFromMajorToMinor(minor_to_major);
@@ -167,9 +167,9 @@ TEST_F(CopyInsertionTest, MultipleConstantsAndParameters) {
   auto builder = HloComputation::Builder(TestName());
 
   HloInstruction* constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   HloInstruction* constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
 
   HloInstruction* x = builder.AddInstruction(
       HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(F32, {}), "x"));
@@ -197,11 +197,11 @@ TEST_F(CopyInsertionTest, AmbiguousPointsToSet) {
   // the computation result. Verify that copies are added properly.
   auto builder = HloComputation::Builder(TestName());
   HloInstruction* constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   HloInstruction* constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   HloInstruction* constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
 
   HloInstruction* tuple1 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
@@ -209,9 +209,9 @@ TEST_F(CopyInsertionTest, AmbiguousPointsToSet) {
       HloInstruction::CreateTuple({constant3, constant2}));
 
   HloInstruction* pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple1->shape(), HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple1->shape(), HloOpcode::kTupleSelect, pred, tuple1, tuple2));
 
   EXPECT_THAT(constant1->users(), UnorderedElementsAre(tuple1));
   EXPECT_THAT(constant2->users(), UnorderedElementsAre(tuple1, tuple2));
@@ -255,8 +255,9 @@ TEST_F(CopyInsertionTest, BitcastConstant) {
   // The output of a bitcast is its operand (same buffer), so a bitcast
   // constant feeding the result must have a copy added.
   auto builder = HloComputation::Builder(TestName());
-  HloInstruction* constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({1.0, 42.0})));
+  HloInstruction* constant =
+      builder.AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR1<float>({1.0, 42.0})));
   HloInstruction* bitcast = builder.AddInstruction(HloInstruction::CreateUnary(
       ShapeUtil::MakeShape(F32, {2, 2}), HloOpcode::kBitcast, constant));
 
@@ -370,9 +371,9 @@ TEST_F(CopyInsertionTest, AmbiguousTopLevelRoot) {
   // copy is added.
   auto builder = HloComputation::Builder(TestName());
   HloInstruction* constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   HloInstruction* constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
 
   HloInstruction* tuple1 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
@@ -380,9 +381,9 @@ TEST_F(CopyInsertionTest, AmbiguousTopLevelRoot) {
       HloInstruction::CreateTuple({constant2, constant1}));
 
   HloInstruction* pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloInstruction* select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple1->shape(), HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple1->shape(), HloOpcode::kTupleSelect, pred, tuple1, tuple2));
   HloInstruction* gte =
       builder.AddInstruction(HloInstruction::CreateGetTupleElement(
           ShapeUtil::GetSubshape(select->shape(), {0}), select, 0));
@@ -413,7 +414,7 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
       const Shape& loop_state_shape) {
     auto builder = HloComputation::Builder(TestName() + ".Condition");
     auto limit_const = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(10)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(10)));
     auto loop_state = builder.AddInstruction(
         HloInstruction::CreateParameter(0, loop_state_shape, "loop_state"));
     auto induction_variable =
@@ -442,7 +443,7 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
         builder.AddInstruction(HloInstruction::CreateGetTupleElement(
             induction_variable_shape_, loop_state, 0));
     auto inc = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
     auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
         induction_variable->shape(), HloOpcode::kAdd, induction_variable, inc));
     // Update data GTE(1).
@@ -480,7 +481,7 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
         builder.AddInstruction(HloInstruction::CreateGetTupleElement(
             induction_variable_shape_, loop_state, 0));
     auto inc = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
 
     // add0 = Add(in0, 1)
     auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
@@ -549,7 +550,7 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
         builder.AddInstruction(HloInstruction::CreateGetTupleElement(
             induction_variable_shape_, loop_state, 0));
     auto inc = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
     // add0 = Add(in0, 1)
     auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
         induction_variable->shape(), HloOpcode::kAdd, induction_variable, inc));
@@ -564,8 +565,9 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
       data = builder.AddInstruction(
           HloInstruction::CreateGetTupleElement(data_shape_, loop_state, 1));
     }
-    auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
+    auto update = builder.AddInstruction(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(
+            {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
     // add1 = Add(in1, {1, 1, 1, 1, 1, 1, 1, 1})
     auto add1 = builder.AddInstruction(HloInstruction::CreateBinary(
         data_shape_, HloOpcode::kAdd, data, update));
@@ -598,7 +600,7 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
     auto gte0 = builder.AddInstruction(HloInstruction::CreateGetTupleElement(
         induction_variable_shape_, loop_state, 0));
     auto inc = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
     auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
         gte0->shape(), HloOpcode::kAdd, gte0, inc));
 
@@ -608,8 +610,9 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
     // GTE(GTE(loop_state, 1), 0) -> Add
     auto gte10 = builder.AddInstruction(
         HloInstruction::CreateGetTupleElement(data_shape_, gte1, 0));
-    auto update10 = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
+    auto update10 = builder.AddInstruction(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(
+            {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
     auto add10 = builder.AddInstruction(HloInstruction::CreateBinary(
         data_shape_, HloOpcode::kAdd, gte10, update10));
 
@@ -633,10 +636,11 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
                                         bool nested = false) {
     auto builder = HloComputation::Builder(TestName() + ".While");
     auto induction_var_init = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(0)));
 
-    auto data_init = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f})));
+    auto data_init = builder.AddInstruction(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(
+            {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f})));
 
     if (nested) {
       auto inner_init = builder.AddInstruction(
@@ -659,8 +663,9 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
 
   HloInstruction* BuildWhileInstruction_InitPointsToConstant() {
     auto builder = HloComputation::Builder(TestName() + ".While");
-    auto data_init = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f})));
+    auto data_init = builder.AddInstruction(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(
+            {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f})));
     return BuildWhileInstructionWithCustomInit(loop_state_shape_, data_init,
                                                &builder);
   }
@@ -677,11 +682,11 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
     auto builder = HloComputation::Builder(TestName() + ".While");
 
     auto one = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     auto v1 = builder.AddInstruction(
         HloInstruction::CreateBroadcast(data_shape_, one, {1}));
     auto zero = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     auto v2 = builder.AddInstruction(
         HloInstruction::CreateBroadcast(data_shape_, zero, {1}));
 
@@ -689,9 +694,9 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
     auto tuple2 = builder.AddInstruction(HloInstruction::CreateTuple({v2, v1}));
 
     auto pred = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
     auto data_init = builder.AddInstruction(HloInstruction::CreateTernary(
-        nested_tuple_shape_, HloOpcode::kSelect, pred, tuple1, tuple2));
+        nested_tuple_shape_, HloOpcode::kTupleSelect, pred, tuple1, tuple2));
 
     return BuildWhileInstructionWithCustomInit(nested_loop_state_shape_,
                                                data_init, &builder);
@@ -701,7 +706,7 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
     auto builder = HloComputation::Builder(TestName() + ".While");
 
     auto one = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     auto one_vec = builder.AddInstruction(
         HloInstruction::CreateBroadcast(data_shape_, one, {1}));
     auto data_init =
@@ -714,11 +719,12 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
   HloInstruction* BuildWhileInstruction_InitPointsToInterfering() {
     auto builder = HloComputation::Builder(TestName() + ".While");
     auto one = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     auto data_init = builder.AddInstruction(
         HloInstruction::CreateBroadcast(data_shape_, one, {1}));
-    auto one_vec = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
+    auto one_vec = builder.AddInstruction(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(
+            {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f})));
     // Take a reference to 'data_init' to make it interfere with while result.
     auto add = builder.AddInstruction(HloInstruction::CreateBinary(
         data_shape_, HloOpcode::kAdd, data_init, one_vec));
@@ -750,7 +756,7 @@ class WhileCopyInsertionTest : public CopyInsertionTest {
     const bool nested =
         ShapeUtil::Equal(loop_state_shape, nested_loop_state_shape_);
     auto induction_var_init = builder->AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(0)));
     auto condition = module_->AddEmbeddedComputation(
         BuildConditionComputation(loop_state_shape));
     auto body = module_->AddEmbeddedComputation(
@@ -1252,7 +1258,6 @@ TEST_F(WhileCopyInsertionTest, InitPointsToNonDistinctUsedByTwoWhileLoops) {
   auto loop_init = builder.AddInstruction(
       HloInstruction::CreateTuple({iter_param, data_param, data_param}));
 
-
   // Two while loops shares the same loop init tuple.
   auto while_hlo1 = builder.AddInstruction(HloInstruction::CreateWhile(
       loop_state_shape, condition1, body1, loop_init));
@@ -1310,7 +1315,7 @@ TEST_F(CopyInsertionTest, SwizzlingWhile) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, loop_state_shape, "param"));
   auto cond_constant = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   cond_builder.AddInstruction(HloInstruction::CreateUnary(
       cond_constant->shape(), HloOpcode::kNot, cond_constant));
   HloComputation* condition =
@@ -1318,9 +1323,9 @@ TEST_F(CopyInsertionTest, SwizzlingWhile) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto xla_while = builder.AddInstruction(
@@ -1375,7 +1380,7 @@ TEST_F(CopyInsertionTest, SwizzlingWhileWithOneOp) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, loop_state_shape, "param"));
   auto cond_constant = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   cond_builder.AddInstruction(HloInstruction::CreateUnary(
       cond_constant->shape(), HloOpcode::kNot, cond_constant));
   HloComputation* condition =
@@ -1383,9 +1388,9 @@ TEST_F(CopyInsertionTest, SwizzlingWhileWithOneOp) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto xla_while = builder.AddInstruction(
@@ -1435,7 +1440,7 @@ TEST_F(CopyInsertionTest, SwizzlingWhileSharedInput) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, loop_state_shape, "param"));
   auto cond_constant = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   cond_builder.AddInstruction(HloInstruction::CreateUnary(
       cond_constant->shape(), HloOpcode::kNot, cond_constant));
   HloComputation* condition =
@@ -1443,7 +1448,7 @@ TEST_F(CopyInsertionTest, SwizzlingWhileSharedInput) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto tuple =
       builder.AddInstruction(HloInstruction::CreateTuple({constant, constant}));
   builder.AddInstruction(
@@ -1520,7 +1525,7 @@ TEST_F(CopyInsertionTest, SequentialWhiles) {
     cond_builder.AddInstruction(
         HloInstruction::CreateParameter(0, loop_state_shape, "param"));
     auto cond_constant = cond_builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
     cond_builder.AddInstruction(HloInstruction::CreateUnary(
         cond_constant->shape(), HloOpcode::kNot, cond_constant));
     HloComputation* condition =
@@ -1575,14 +1580,14 @@ TEST_F(CopyInsertionTest, WhileBodyWithConstantRoot) {
   body_builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "param"));
   body_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(123.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(123.0)));
   HloComputation* body = module->AddEmbeddedComputation(body_builder.Build());
 
   auto cond_builder = HloComputation::Builder("condition");
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module->AddEmbeddedComputation(cond_builder.Build());
 
@@ -1644,7 +1649,7 @@ std::unique_ptr<HloComputation> MakeTrivialCondition(const Shape& shape) {
   builder.AddInstruction(
       HloInstruction::CreateParameter(0, shape, "loop_state"));
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   builder.AddInstruction(HloInstruction::CreateUnary(
       constant->shape(), HloOpcode::kNot, constant));
   return builder.Build();
diff --git a/tensorflow/compiler/xla/service/cpu/BUILD b/tensorflow/compiler/xla/service/cpu/BUILD
index 3479240610..504b61d134 100644
--- a/tensorflow/compiler/xla/service/cpu/BUILD
+++ b/tensorflow/compiler/xla/service/cpu/BUILD
@@ -37,6 +37,7 @@ cc_library(
     srcs = ["cpu_transfer_manager.cc"],
     hdrs = ["cpu_transfer_manager.h"],
     deps = [
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -72,7 +73,7 @@ cc_library(
         ":ir_emitter",
         ":parallel_task_assignment",
         ":simple_orc_jit",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:protobuf_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -89,7 +90,6 @@ cc_library(
         "//tensorflow/compiler/xla/service:dot_decomposer",
         "//tensorflow/compiler/xla/service:executable",
         "//tensorflow/compiler/xla/service:flatten_call_graph",
-        "//tensorflow/compiler/xla/service:gather_expander",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_constant_folding",
         "//tensorflow/compiler/xla/service:hlo_cse",
@@ -252,12 +252,13 @@ cc_library(
         "//tensorflow/compiler/xla/service:hlo_module_config",
         "//tensorflow/compiler/xla/service:name_uniquer",
         "//tensorflow/compiler/xla/service/llvm_ir:alias_analysis",
+        "//tensorflow/compiler/xla/service/llvm_ir:buffer_assignment_util",
+        "//tensorflow/compiler/xla/service/llvm_ir:dynamic_update_slice_util",
         "//tensorflow/compiler/xla/service/llvm_ir:fused_ir_emitter",
         "//tensorflow/compiler/xla/service/llvm_ir:ir_array",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_loop",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/compiler/xla/service/llvm_ir:loop_emitter",
-        "//tensorflow/compiler/xla/service/llvm_ir:ops",
         "//tensorflow/compiler/xla/service/llvm_ir:tuple_ops",
         "//tensorflow/core:lib",
         "@llvm//:code_gen",
@@ -355,7 +356,7 @@ tf_cc_binary(
     srcs = ["sample_harness.cc"],
     deps = [
         "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -363,8 +364,8 @@ tf_cc_binary(
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/core:lib",
     ],
 )
@@ -444,6 +445,7 @@ cc_library(
     deps = [
         ":vector_support_library",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
+        "//tensorflow/compiler/xla/service/llvm_ir:math_ops",
         "//tensorflow/core:lib",
         "@llvm//:core",
         "@llvm//:transform_utils",
@@ -717,7 +719,7 @@ tf_cc_test(
     deps = [
         ":cpu_layout_assignment",
         ":target_machine_features_fake",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_layout",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -809,7 +811,7 @@ tf_cc_test(
         ":cpu_executable",
         ":parallel_task_assignment",
         ":target_machine_features_fake",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_layout",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
@@ -892,7 +894,7 @@ tf_cc_test(
     srcs = ["cpu_copy_insertion_test.cc"],
     deps = [
         ":cpu_copy_insertion",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
diff --git a/tensorflow/compiler/xla/service/cpu/compiler_functor.cc b/tensorflow/compiler/xla/service/cpu/compiler_functor.cc
index 6a7eb85e3b..128eea4828 100644
--- a/tensorflow/compiler/xla/service/cpu/compiler_functor.cc
+++ b/tensorflow/compiler/xla/service/cpu/compiler_functor.cc
@@ -156,9 +156,26 @@ std::unique_ptr<llvm::MemoryBuffer> CompilerFunctor::operator()(
   target_machine_->addPassesToEmitMC(codegen_passes, mc_context, ostream);
   codegen_passes.run(module);
 
-  // Construct ObjectFile from machine code buffer.
-  return std::unique_ptr<llvm::MemoryBuffer>(
+  std::unique_ptr<llvm::MemoryBuffer> memory_buffer(
       new llvm::SmallVectorMemoryBuffer(std::move(stream_buffer)));
+
+  if (VLOG_IS_ON(2)) {
+    llvm::Expected<std::unique_ptr<llvm::object::ObjectFile>> obj_file =
+        llvm::object::ObjectFile::createObjectFile(*memory_buffer);
+    if (obj_file) {
+      StatusOr<DisassemblerResult> disasm_result =
+          disassembler_->DisassembleObjectFile(*obj_file.get());
+      if (disasm_result.ok()) {
+        XLA_VLOG_LINES(2, disasm_result.ValueOrDie().text);
+      } else {
+        LOG(WARNING) << "Could not disassemble object file!";
+      }
+    } else {
+      LOG(WARNING) << "Could convert memory buffer to object file!";
+    }
+  }
+
+  return memory_buffer;
 }
 
 static std::vector<llvm::VecDesc> VectorFunctionsForTargetLibraryInfoImpl() {
diff --git a/tensorflow/compiler/xla/service/cpu/conv_canonicalization_test.cc b/tensorflow/compiler/xla/service/cpu/conv_canonicalization_test.cc
index 375b017b09..547d4c696d 100644
--- a/tensorflow/compiler/xla/service/cpu/conv_canonicalization_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/conv_canonicalization_test.cc
@@ -60,11 +60,11 @@ TEST_F(ConvCanonicalizationTest, NonCanonicalToCanonical) {
   auto builder = HloComputation::Builder(TestName());
   // The input dimensions are in CNHW order.
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR4FromArray4D(Array4D<float>(
+      LiteralUtil::CreateR4FromArray4D(Array4D<float>(
           kInputFeatureCount, kBatchSize, kInputSize, kInputSize))));
   // The kernel dimensions are in OIHW order.
   auto kernel = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR4FromArray4D(Array4D<float>(
+      LiteralUtil::CreateR4FromArray4D(Array4D<float>(
           kOutputFeatureCount, kInputFeatureCount, kWindowSize, kWindowSize))));
 
   ConvolutionDimensionNumbers dnums;
@@ -122,11 +122,11 @@ TEST_F(ConvCanonicalizationTest, CanonicalStaysTheSame) {
   auto builder = HloComputation::Builder(TestName());
   // The input dimensions are in NHWC order.
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR4FromArray4D(Array4D<float>(
+      LiteralUtil::CreateR4FromArray4D(Array4D<float>(
           kBatchSize, kInputSize, kInputSize, kInputFeatureCount))));
   // The kernel dimensions are in HWIO order.
   auto kernel = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR4FromArray4D(Array4D<float>(
+      LiteralUtil::CreateR4FromArray4D(Array4D<float>(
           kWindowSize, kWindowSize, kInputFeatureCount, kOutputFeatureCount))));
 
   ConvolutionDimensionNumbers dnums;
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
index 55962ba70d..8cbe9a1b0d 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
@@ -30,6 +30,7 @@ limitations under the License.
 #include "llvm/ADT/Triple.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/LLVMContext.h"
+#include "llvm/IR/Mangler.h"
 #include "llvm/IR/Module.h"
 #include "llvm/IR/Verifier.h"
 #include "llvm/Object/ObjectFile.h"
@@ -38,7 +39,7 @@ limitations under the License.
 #include "llvm/Support/TargetSelect.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Target/TargetOptions.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/protobuf_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
@@ -66,7 +67,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/dot_decomposer.h"
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
-#include "tensorflow/compiler/xla/service/gather_expander.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_constant_folding.h"
@@ -297,8 +297,6 @@ Status CpuCompiler::RunHloPasses(HloModule* module, bool is_aot_compile,
   pipeline.AddPass<HloCSE>(/*is_layout_sensitive=*/false);
   pipeline.AddPass<CpuInstructionFusion>();
 
-  pipeline.AddPass<GatherExpander>();
-
   ReducePrecisionInsertion::AddPasses(
       &pipeline, module->config().debug_options(),
       ReducePrecisionInsertion::PassTiming::AFTER_FUSION);
@@ -564,7 +562,9 @@ StatusOr<std::unique_ptr<Executable>> CpuCompiler::RunBackend(
       BufferAssigner::Run(
           module.get(),
           xla::MakeUnique<SequentialHloOrdering>(module.get(), module_sequence),
-          BufferSizeBytesFunction(), memory_alignment));
+          BufferSizeBytesFunction(), memory_alignment,
+          /*allow_input_output_aliasing=*/false,
+          /*allocate_buffers_for_constants=*/true));
   // BufferAssignment::ToString() includes a header, so no need for us to
   // print one ourselves.
   XLA_VLOG_LINES(2, assignment->ToString());
@@ -586,6 +586,8 @@ StatusOr<std::unique_ptr<Executable>> CpuCompiler::RunBackend(
                        std::move(computation_to_profile_idx),
                        &target_machine_features);
 
+  TF_RETURN_IF_ERROR(ir_emitter.EmitConstantGlobals());
+
   for (auto embedded_computation :
        entry_computation->MakeEmbeddedComputationsList()) {
     if (embedded_computation->IsFusionComputation()) {
@@ -607,7 +609,13 @@ StatusOr<std::unique_ptr<Executable>> CpuCompiler::RunBackend(
                                  /*is_top_level_computation=*/true,
                                  &module_sequence.at(entry_computation)));
 
-  string function_name = llvm_ir::AsString(entry_function->getName());
+  string function_name = [&]() {
+    llvm::SmallVector<char, 40> function_name_vector;
+    llvm::Mangler::getNameWithPrefix(
+        function_name_vector, entry_function->getName(), jit->data_layout());
+    return string(function_name_vector.begin(), function_name_vector.end());
+  }();
+
   string ir_module_string;
   if (embed_ir_in_executable) {
     ir_module_string = llvm_ir::DumpModuleToString(*llvm_module);
@@ -743,7 +751,9 @@ CpuCompiler::CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> modules,
         BufferAssigner::Run(
             module,
             xla::MakeUnique<SequentialHloOrdering>(module, module_sequence),
-            BufferSizeBytesFunction(), memory_alignment));
+            BufferSizeBytesFunction(), memory_alignment,
+            /*allow_input_output_aliasing=*/false,
+            /*allocate_buffers_for_constants=*/true));
     // BufferAssignment::ToString() includes a header, so no need for us to
     // print one ourselves.
     XLA_VLOG_LINES(2, assignment->ToString());
@@ -772,6 +782,9 @@ CpuCompiler::CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> modules,
                          std::move(instruction_to_profile_idx),
                          std::move(computation_to_profile_idx),
                          &target_machine_features);
+
+    TF_RETURN_IF_ERROR(ir_emitter.EmitConstantGlobals());
+
     HloComputation* computation = module->entry_computation();
     for (auto embedded_computation :
          computation->MakeEmbeddedComputationsList()) {
@@ -827,17 +840,29 @@ CpuCompiler::CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> modules,
 
     BufferSizes buffer_sizes;
     for (const BufferAllocation& allocation : assignment->Allocations()) {
-      // Callers don't need to allocate temporary buffers for parameters.
-      if (allocation.is_entry_computation_parameter()) {
-        buffer_sizes.push_back(-1);
-        continue;
-      }
       // Callers don't need to allocate anything for thread-local temporary
       // buffers.  They are lowered to allocas.
       if (allocation.is_thread_local()) {
         buffer_sizes.push_back(-1);
         continue;
       }
+
+      // Callers don't need to allocate anything for constant buffers.  They are
+      // lowered to globals.
+      if (allocation.is_constant()) {
+        buffer_sizes.push_back(-1);
+        continue;
+      }
+
+      // Callers don't need to allocate anything for entry computation buffers,
+      // but they do need to stash the pointer to the entry computation buffer
+      // in the temp buffer table.  See the comment on
+      // XlaCompiledCpuFunction::StaticData::temp_sizes.
+      if (allocation.is_entry_computation_parameter()) {
+        buffer_sizes.push_back(-allocation.parameter_number() - 2);
+        continue;
+      }
+
       buffer_sizes.push_back(allocation.size());
     }
 
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_copy_insertion_test.cc b/tensorflow/compiler/xla/service/cpu/cpu_copy_insertion_test.cc
index a05a269417..4db7fa446e 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_copy_insertion_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_copy_insertion_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/cpu/cpu_copy_insertion.h"
 
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
@@ -74,14 +74,14 @@ TEST_F(CpuCopyInsertionTest, WhileBodyWithConstantRoot) {
   body_builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "param"));
   body_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(123.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(123.0)));
   HloComputation* body = module->AddEmbeddedComputation(body_builder.Build());
 
   auto cond_builder = HloComputation::Builder("condition");
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module->AddEmbeddedComputation(cond_builder.Build());
 
@@ -114,7 +114,7 @@ TEST_F(CpuCopyInsertionTest, TupleCall) {
   auto sub_param = sub_builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "param"));
   auto constant = sub_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(123.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(123.0)));
   auto add = sub_builder.AddInstruction(HloInstruction::CreateBinary(
       scalar_shape_, HloOpcode::kAdd, sub_param, constant));
   sub_builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_executable.cc b/tensorflow/compiler/xla/service/cpu/cpu_executable.cc
index 1093559892..946f5124b8 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_executable.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_executable.cc
@@ -69,12 +69,19 @@ CpuExecutable::CpuExecutable(
   // guarded by the mutex.
   compute_function_ =
       reinterpret_cast<ComputeFunctionType>(cantFail(sym.getAddress()));
+  VLOG(1) << "compute_function_ at address "
+          << reinterpret_cast<void*>(compute_function_);
 }
 
-Status CpuExecutable::AllocateBuffers(
+StatusOr<std::pair<std::vector<se::DeviceMemoryBase>,
+                   std::vector<OwningDeviceMemory>>>
+CpuExecutable::CreateTempArray(
     DeviceMemoryAllocator* memory_allocator, int device_ordinal,
-    std::vector<OwningDeviceMemory>* buffers) {
-  CHECK_EQ(buffers->size(), assignment_->Allocations().size());
+    tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments) {
+  std::vector<se::DeviceMemoryBase> unowning_buffers(
+      assignment_->Allocations().size());
+  std::vector<OwningDeviceMemory> owning_buffers(
+      assignment_->Allocations().size());
   VLOG(3) << "Allocating " << assignment_->Allocations().size()
           << " allocations for module " << module().name();
   for (BufferAllocation::Index i = 0; i < assignment_->Allocations().size();
@@ -84,44 +91,51 @@ Status CpuExecutable::AllocateBuffers(
     VLOG(3) << allocation.ToString();
 
     if (allocation.is_entry_computation_parameter()) {
+      unowning_buffers[i] = arguments[allocation.parameter_number()]->buffer(
+          allocation.param_shape_index());
       VLOG(3) << "allocation #" << i << " is a parameter";
       continue;
     }
 
+    if (allocation.is_constant()) {
+      VLOG(3) << "allocation #" << i << " is a constant";
+      continue;
+    }
+
     if (allocation.is_thread_local()) {
       VLOG(3) << "buffer #" << i << " is thread-local";
       continue;
     }
 
     int64 buffer_size = allocation.size();
-    if (!(*buffers)[i].is_null()) {
+    if (!owning_buffers[i].is_null()) {
       VLOG(3) << "buffer #" << i
               << " is in the preallocated result ShapedBuffer";
     } else {
-      TF_ASSIGN_OR_RETURN((*buffers)[i], memory_allocator->Allocate(
-                                             device_ordinal, buffer_size));
+      TF_ASSIGN_OR_RETURN(owning_buffers[i], memory_allocator->Allocate(
+                                                 device_ordinal, buffer_size));
+      unowning_buffers[i] = owning_buffers[i].AsDeviceMemoryBase();
 
       VLOG(3) << "buffer #" << i << " allocated " << buffer_size << " bytes ["
-              << (*buffers)[i].opaque() << "]";
+              << owning_buffers[i].opaque() << "]";
     }
 
     // Since the output buffer and all the temporary buffers were written into
     // by the JITed code, msan has no way of knowing their memory was
     // initialized. Mark them initialized so that msan doesn't flag loads from
     // these buffers.
-    TF_ANNOTATE_MEMORY_IS_INITIALIZED((*buffers)[i].opaque(), buffer_size);
+    TF_ANNOTATE_MEMORY_IS_INITIALIZED(owning_buffers[i].opaque(), buffer_size);
   }
 
   TF_ASSIGN_OR_RETURN(const BufferAllocation::Slice result_slice,
                       assignment_->GetUniqueTopLevelOutputSlice());
   VLOG(3) << "result index: " << result_slice.index();
 
-  return Status::OK();
+  return {{std::move(unowning_buffers), std::move(owning_buffers)}};
 }
 
 Status CpuExecutable::ExecuteComputeFunction(
     const ExecutableRunOptions* run_options,
-    tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
     tensorflow::gtl::ArraySlice<se::DeviceMemoryBase> buffers,
     HloExecutionProfile* hlo_execution_profile) {
   // The calling convention for JITed functions is:
@@ -131,17 +145,11 @@ Status CpuExecutable::ExecuteComputeFunction(
   //
   // result: Points at the result.
   // run_options: the ExecutableRunOptions object.
-  // args_array: An array of pointers, each of which points to a parameter.
-  //               The size of this array is determined by the function's arity
-  //               (ProgramShape).
-  // temps_array:  An array of pointers, each of which points to a temporary
-  //               buffer the computation needs. The size of this array is
-  //               determined by buffer analysis.
+  // args_array: null
+  // temps_array: An array of pointers, containing pointers to temporary buffers
+  //              required by the executable adn pointers to entry computation
+  //              parameters.
   //
-  std::vector<const void*> args_array;
-  for (const ShapedBuffer* argument : arguments) {
-    args_array.push_back(argument->root_buffer().opaque());
-  }
 
   uint64 start_micros = tensorflow::Env::Default()->NowMicros();
 
@@ -164,16 +172,14 @@ Status CpuExecutable::ExecuteComputeFunction(
   if (VLOG_IS_ON(3)) {
     VLOG(3) << "Executing compute function:";
     VLOG(3) << tensorflow::strings::Printf(
-        "  func(void* result, void* params[%zu], void* temps[%zu], "
+        "  func(void* result, void* params[null], void* temps[%zu], "
         "uint64 profile_counters[%zu])",
-        args_array.size(), buffer_pointers.size(), profile_counters_size);
+        buffer_pointers.size(), profile_counters_size);
     VLOG(3) << tensorflow::strings::Printf("    result = %p", result_buffer);
     auto ptr_printer = [](string* out, const void* p) {
       tensorflow::strings::StrAppend(out, tensorflow::strings::Printf("%p", p));
     };
-    VLOG(3) << tensorflow::strings::Printf(
-        "    params = [%s]",
-        tensorflow::str_util::Join(args_array, ", ", ptr_printer).c_str());
+    VLOG(3) << "    params = nullptr";
     VLOG(3) << tensorflow::strings::Printf(
         "    temps = [%s]",
         tensorflow::str_util::Join(buffer_pointers, ", ", ptr_printer).c_str());
@@ -181,8 +187,8 @@ Status CpuExecutable::ExecuteComputeFunction(
                                            profile_counters);
   }
 
-  compute_function_(result_buffer, run_options, args_array.data(),
-                    buffer_pointers.data(), profile_counters);
+  compute_function_(result_buffer, run_options, nullptr, buffer_pointers.data(),
+                    profile_counters);
 
   uint64 end_micros = tensorflow::Env::Default()->NowMicros();
 
@@ -249,21 +255,18 @@ StatusOr<ScopedShapedBuffer> CpuExecutable::ExecuteOnStream(
 
   se::Stream* stream = run_options->stream();
   DeviceMemoryAllocator* memory_allocator = run_options->allocator();
-  std::vector<OwningDeviceMemory> buffers(assignment_->Allocations().size());
-
-  TF_RETURN_IF_ERROR(AllocateBuffers(
-      memory_allocator, stream->parent()->device_ordinal(), &buffers));
 
+  std::vector<OwningDeviceMemory> owning_buffers;
   std::vector<se::DeviceMemoryBase> unowning_buffers;
-  unowning_buffers.reserve(buffers.size());
-  for (auto& buffer : buffers) {
-    unowning_buffers.push_back(buffer.AsDeviceMemoryBase());
-  }
-  TF_RETURN_IF_ERROR(ExecuteComputeFunction(&run_options->run_options(),
-                                            arguments, unowning_buffers,
-                                            hlo_execution_profile));
+  TF_ASSIGN_OR_RETURN(
+      std::tie(unowning_buffers, owning_buffers),
+      CreateTempArray(memory_allocator, stream->parent()->device_ordinal(),
+                      arguments));
+
+  TF_RETURN_IF_ERROR(ExecuteComputeFunction(
+      &run_options->run_options(), unowning_buffers, hlo_execution_profile));
 
-  return CreateResultShapedBuffer(run_options, &buffers);
+  return CreateResultShapedBuffer(run_options, &owning_buffers);
 }
 
 StatusOr<ScopedShapedBuffer> CpuExecutable::ExecuteAsyncOnStream(
@@ -279,17 +282,15 @@ StatusOr<ScopedShapedBuffer> CpuExecutable::ExecuteAsyncOnStream(
       run_options->stream()->implementation());
   se::Stream* stream = run_options->stream();
   DeviceMemoryAllocator* memory_allocator = run_options->allocator();
-  std::vector<OwningDeviceMemory> buffers(assignment_->Allocations().size());
-  TF_RETURN_IF_ERROR(AllocateBuffers(
-      memory_allocator, stream->parent()->device_ordinal(), &buffers));
-
+  std::vector<OwningDeviceMemory> owning_buffers;
   std::vector<se::DeviceMemoryBase> unowning_buffers;
-  unowning_buffers.reserve(buffers.size());
-  for (auto& buffer : buffers) {
-    unowning_buffers.push_back(buffer.AsDeviceMemoryBase());
-  }
+  TF_ASSIGN_OR_RETURN(
+      std::tie(unowning_buffers, owning_buffers),
+      CreateTempArray(memory_allocator, stream->parent()->device_ordinal(),
+                      arguments));
+
   TF_ASSIGN_OR_RETURN(ScopedShapedBuffer result,
-                      CreateResultShapedBuffer(run_options, &buffers));
+                      CreateResultShapedBuffer(run_options, &owning_buffers));
 
   // At this point, `unowning_buffers` contains unowning pointers to all of our
   // buffers, and `buffers` contains owning pointers to the non-live-out
@@ -307,7 +308,6 @@ StatusOr<ScopedShapedBuffer> CpuExecutable::ExecuteAsyncOnStream(
   struct AsyncRunTask {
     CpuExecutable* executable;
     ServiceExecutableRunOptions run_options;
-    std::vector<const ShapedBuffer*> arguments;
     std::vector<se::DeviceMemoryBase> unowning_buffers;
     std::shared_ptr<std::vector<OwningDeviceMemory>> buffers;
 
@@ -315,15 +315,14 @@ StatusOr<ScopedShapedBuffer> CpuExecutable::ExecuteAsyncOnStream(
       // Failing a CHECK here is not great, but I don't see an obvious way to
       // return a failed Status asynchronously.
       TF_CHECK_OK(executable->ExecuteComputeFunction(
-          &run_options.run_options(), arguments, unowning_buffers,
+          &run_options.run_options(), unowning_buffers,
           /*hlo_execution_profile=*/nullptr));
     }
   };
-  host_stream->EnqueueTask(AsyncRunTask{
-      this, *run_options,
-      std::vector<const ShapedBuffer*>(arguments.begin(), arguments.end()),
-      unowning_buffers,
-      std::make_shared<std::vector<OwningDeviceMemory>>(std::move(buffers))});
+  host_stream->EnqueueTask(
+      AsyncRunTask{this, *run_options, std::move(unowning_buffers),
+                   std::make_shared<std::vector<OwningDeviceMemory>>(
+                       std::move(owning_buffers))});
 
   return std::move(result);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_executable.h b/tensorflow/compiler/xla/service/cpu/cpu_executable.h
index 8dd47bfb86..8af8a5dfec 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_executable.h
+++ b/tensorflow/compiler/xla/service/cpu/cpu_executable.h
@@ -85,20 +85,29 @@ class CpuExecutable : public Executable {
   const BufferAssignment& buffer_assignment() const { return *assignment_; }
 
  private:
-  // Allocate buffers required for execution and assign them to the elements of
-  // "buffers". "buffers" should be sized to the number of buffers in buffer
-  // assignment. Each vector element corresponds to a particular Index. If
-  // a vector element already contains a non-null DeviceMemoryBase, then no
-  // buffer is assigned for this element.
-  Status AllocateBuffers(DeviceMemoryAllocator* memory_allocator,
-                         int device_ordinal,
-                         std::vector<OwningDeviceMemory>* buffers);
+  // Creates an array suitable for passing as the "temps" argument to the JIT
+  // compiled function pointer.
+  //
+  // Returns (unowning_buffers, owning_buffers) where:
+  //
+  //  - unowning_buffers.data() can be passed as the temps argument as-is and
+  //    includes pointers to the scratch storage required by the computation,
+  //    the live-out buffer into which the result will be written and entry
+  //    computation parameters.
+  //
+  //  - owning_buffers contains owning pointers to the buffers that were
+  //    allocated by this routine.  This routine allocates buffers for temporary
+  //    storage and the live-out buffer into which the computation writes it
+  //    result.
+  StatusOr<std::pair<std::vector<se::DeviceMemoryBase>,
+                     std::vector<OwningDeviceMemory>>>
+  CreateTempArray(DeviceMemoryAllocator* memory_allocator, int device_ordinal,
+                  tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments);
 
   // Calls the generated function performing the computation with the given
   // arguments using the supplied buffers.
   Status ExecuteComputeFunction(
       const ExecutableRunOptions* run_options,
-      tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
       tensorflow::gtl::ArraySlice<se::DeviceMemoryBase> buffers,
       HloExecutionProfile* hlo_execution_profile);
 
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
index 750310c633..991b14f17d 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
@@ -282,7 +282,7 @@ class OpcodeFusionTest : public InstructionFusionTest {
         builder.AddInstruction(HloInstruction::CreateParameter(
             0, ShapeUtil::MakeShape(F32, {}), "arg0"));
     HloInstruction* one = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     builder.AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::MakeShape(F32, {}), HloOpcode::kAdd, arg0, one));
     return module->AddEmbeddedComputation(builder.Build());
@@ -595,7 +595,7 @@ TEST_F(OpcodeFusionTest, MessOfFusileNodes) {
   auto pad = builder.AddInstruction(HloInstruction::CreatePad(
       ShapeUtil::MakeShape(S32, {5}), idx_choice,
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0(0))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0(0))),
       padding_config));
 
   auto slice = builder.AddInstruction(HloInstruction::CreateDynamicSlice(
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_layout_assignment_test.cc b/tensorflow/compiler/xla/service/cpu/cpu_layout_assignment_test.cc
index 429fc7b786..3681d12d8d 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_layout_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_layout_assignment_test.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/algebraic_simplifier.h"
 #include "tensorflow/compiler/xla/service/computation_layout.h"
 #include "tensorflow/compiler/xla/service/cpu/target_machine_features_fake.h"
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc b/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
index 54c52bc08f..639064040f 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
@@ -92,9 +92,10 @@ tensorflow::string ShapeString(const void* shape_ptr, xla::int32 shape_length) {
 
 }  // namespace
 
-void* __xla_cpu_runtime_AcquireInfeedBufferForDequeue(xla::int32 buffer_length,
-                                                      const void* shape,
-                                                      xla::int32 shape_length) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void*
+__xla_cpu_runtime_AcquireInfeedBufferForDequeue(xla::int32 buffer_length,
+                                                const void* shape,
+                                                xla::int32 shape_length) {
   if (VLOG_IS_ON(2)) {
     LOG(INFO) << "AcquireInfeedBufferForDequeue: "
               << ShapeString(shape, shape_length);
@@ -111,9 +112,11 @@ void* __xla_cpu_runtime_AcquireInfeedBufferForDequeue(xla::int32 buffer_length,
   return buffer->data();
 }
 
-void __xla_cpu_runtime_ReleaseInfeedBufferAfterDequeue(
-    xla::int32 buffer_length, void* buffer_ptr, const void* shape_ptr,
-    xla::int32 shape_length) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void
+__xla_cpu_runtime_ReleaseInfeedBufferAfterDequeue(xla::int32 buffer_length,
+                                                  void* buffer_ptr,
+                                                  const void* shape_ptr,
+                                                  xla::int32 shape_length) {
   if (VLOG_IS_ON(2)) {
     LOG(INFO) << "ReleaseInfeedBufferAfterDeque: "
               << ShapeString(shape_ptr, shape_length);
@@ -125,8 +128,10 @@ void __xla_cpu_runtime_ReleaseInfeedBufferAfterDequeue(
                                         std::move(shape));
 }
 
-void* __xla_cpu_runtime_AcquireOutfeedBufferForPopulation(
-    xla::int32 buffer_length, const void* shape_ptr, xla::int32 shape_length) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void*
+__xla_cpu_runtime_AcquireOutfeedBufferForPopulation(xla::int32 buffer_length,
+                                                    const void* shape_ptr,
+                                                    xla::int32 shape_length) {
   if (VLOG_IS_ON(2)) {
     LOG(INFO) << "AcquireOutfeedBufferForPopulation: "
               << ShapeString(shape_ptr, shape_length);
@@ -143,9 +148,11 @@ void* __xla_cpu_runtime_AcquireOutfeedBufferForPopulation(
   return buffer->data();
 }
 
-void __xla_cpu_runtime_ReleaseOutfeedBufferAfterPopulation(
-    xla::int32 buffer_length, void* buffer_ptr, const void* shape_ptr,
-    xla::int32 shape_length) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void
+__xla_cpu_runtime_ReleaseOutfeedBufferAfterPopulation(xla::int32 buffer_length,
+                                                      void* buffer_ptr,
+                                                      const void* shape_ptr,
+                                                      xla::int32 shape_length) {
   if (VLOG_IS_ON(2)) {
     LOG(INFO) << "ReleaseOutfeedBufferAfterPopulation: "
               << ShapeString(shape_ptr, shape_length);
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.cc b/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.cc
index b877b29581..156166bf2b 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_runtime.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -180,7 +181,7 @@ Status CpuTransferManager::TransferLiteralFromOutfeed(
     tensorflow::gtl::ArraySlice<int64> dimensions(
         tensorflow::bit_cast<const int64*>(literal_shape.dimensions().data()),
         literal_shape.dimensions().size());
-    *literal = std::move(*Literal::CreateFromDimensions(
+    *literal = std::move(*LiteralUtil::CreateFromDimensions(
         literal_shape.element_type(), dimensions));
     TF_ASSIGN_OR_RETURN(Shape received_shape,
                         TransferArrayBufferFromOutfeed(
@@ -211,7 +212,7 @@ Status CpuTransferManager::TransferLiteralFromOutfeed(
         tensorflow::bit_cast<const int64*>(
             tuple_element_shape.dimensions().data()),
         tuple_element_shape.dimensions().size());
-    auto empty = Literal::CreateFromDimensions(
+    auto empty = LiteralUtil::CreateFromDimensions(
         tuple_element_shape.element_type(), dimensions);
     int64 size = GetByteSizeRequirement(tuple_element_shape);
     buffer_data.push_back({empty->untyped_data(), size});
@@ -232,7 +233,7 @@ Status CpuTransferManager::TransferLiteralFromOutfeed(
   for (int64 i = 0; i < literal_shape.tuple_shapes_size(); ++i) {
     *elements[i]->mutable_shape_do_not_use() = received_shape.tuple_shapes(i);
   }
-  *literal = std::move(*Literal::MakeTupleOwned(std::move(elements)));
+  *literal = std::move(*LiteralUtil::MakeTupleOwned(std::move(elements)));
   TF_RET_CHECK(ShapeUtil::Equal(literal->shape(), literal_shape));
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.h b/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.h
index 6dfc666f09..593575c0fd 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.h
+++ b/tensorflow/compiler/xla/service/cpu/cpu_transfer_manager.h
@@ -39,13 +39,14 @@ class CpuTransferManager : public GenericTransferManager {
 
   Status TransferLiteralToInfeed(se::StreamExecutor* executor,
                                  const LiteralSlice& literal) override;
-  Status TransferBufferToInfeed(se::StreamExecutor* executor, int64 size,
-                                const void* source) override;
   Status TransferLiteralFromOutfeed(se::StreamExecutor* executor,
                                     const Shape& literal_shape,
                                     Literal* literal) override;
 
  private:
+  Status TransferBufferToInfeed(se::StreamExecutor* executor, int64 size,
+                                const void* source);
+
   // Transfers infeed data to device. InfeedBuffer->Done() must be
   // called to clean up the memory allocated for InfeedBuffer.
   StatusOr<cpu::runtime::XfeedBuffer*> TransferBufferToInfeedInternal(
diff --git a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
index 58228180ca..645888de78 100644
--- a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
@@ -49,15 +49,15 @@ class MemoryTile {
   // `tile_size_along_major_dim` vectors from the matrix `matrix`, starting at
   // `major_dim_offset` in the major dimension.  The tile size along the minor
   // dimension is the vector size, and that is implicitly determined by `vsl`.
-  MemoryTile(VectorSupportLibrary* vsl, llvm::IRBuilder<>* ir_builder,
+  MemoryTile(VectorSupportLibrary* vsl, llvm::IRBuilder<>* b,
              llvm::Value* matrix, int64 matrix_size_along_minor_dim,
              llvm::Value* major_dim_offset, int64 tile_size_along_major_dim)
-      : vsl_(vsl), ir_builder_(ir_builder) {
+      : vsl_(vsl), b_(b) {
     pointers_.reserve(tile_size_along_major_dim);
     for (int64 i = 0; i < tile_size_along_major_dim; i++) {
-      llvm::Value* total_offset = ir_builder->CreateMul(
-          ir_builder->getInt64(matrix_size_along_minor_dim),
-          ir_builder->CreateAdd(ir_builder->getInt64(i), major_dim_offset));
+      llvm::Value* total_offset =
+          b->CreateMul(b->getInt64(matrix_size_along_minor_dim),
+                       b->CreateAdd(b->getInt64(i), major_dim_offset));
       pointers_.push_back(vsl_->ComputeOffsetPointer(matrix, total_offset));
     }
   }
@@ -101,8 +101,7 @@ class MemoryTile {
     for (int64 i = 0; i < pointers_.size(); i++) {
       for (int64 j = 0; j < tile_size_along_middle_dim; j++) {
         result[i].push_back(vsl_->LoadBroadcast(
-            pointers_[i], ir_builder_->CreateAdd(minor_dim_offset,
-                                                 ir_builder_->getInt64(j))));
+            pointers_[i], b_->CreateAdd(minor_dim_offset, b_->getInt64(j))));
       }
     }
     return result;
@@ -110,7 +109,7 @@ class MemoryTile {
 
  private:
   VectorSupportLibrary* vsl_;
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   std::vector<llvm::Value*> pointers_;
 };
 
@@ -249,16 +248,15 @@ class ColumnMajorMatrixVectorProductEmitter
   ColumnMajorMatrixVectorProductEmitter(const Config& config, llvm::Value* lhs,
                                         llvm::Value* rhs, llvm::Value* addend,
                                         llvm::Value* result,
-                                        llvm::IRBuilder<>* ir_builder)
+                                        llvm::IRBuilder<>* b)
       : config_(config),
         lhs_(lhs),
         rhs_(rhs),
         addend_(addend),
         result_(result),
-        ir_builder_(ir_builder),
-        ksl_(ir_builder_),
-        vsl_(config.scalar_type(), /*vector_size=*/config.tile_rows(),
-             ir_builder_, "") {
+        b_(b),
+        ksl_(b_),
+        vsl_(config.scalar_type(), /*vector_size=*/config.tile_rows(), b_, "") {
     CHECK(tile_rows() > 0 && IsPowerOfTwo(static_cast<uint64>(tile_rows())));
     CHECK(!has_addend() || addend != nullptr);
   }
@@ -272,7 +270,7 @@ class ColumnMajorMatrixVectorProductEmitter
                          bool is_first_column);
 
   MemoryTile GetLhsMemoryTile(llvm::Value* column_start, int64 column_count) {
-    return MemoryTile(&vsl_, ir_builder_, /*matrix=*/lhs_,
+    return MemoryTile(&vsl_, b_, /*matrix=*/lhs_,
                       /*matrix_size_along_minor_dim=*/m(),
                       /*major_dim_offset=*/column_start,
                       /*tile_size_along_major_dim=*/column_count);
@@ -302,7 +300,7 @@ class ColumnMajorMatrixVectorProductEmitter
   llvm::Value* rhs_;
   llvm::Value* addend_;
   llvm::Value* result_;
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   KernelSupportLibrary ksl_;
   VectorSupportLibrary vsl_;
 };
@@ -331,7 +329,7 @@ void ColumnMajorMatrixVectorProductEmitter::Emit() {
                      });
 
   if (column_remainder != 0) {
-    EmitOuterLoopBody(ir_builder_->getInt64(column_limit), column_remainder,
+    EmitOuterLoopBody(b_->getInt64(column_limit), column_remainder,
                       column_limit == 0);
   }
 }
@@ -364,7 +362,7 @@ void ColumnMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
     return;
   }
 
-  llvm::Value* columns_llvm = ir_builder_->getInt64(columns);
+  llvm::Value* columns_llvm = b_->getInt64(columns);
 
   // for (col = current_tile_col; col < (columns + current_tile_col); col++)
   //   for (row = row_start, row < m_; row++) {
@@ -375,12 +373,11 @@ void ColumnMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
 
   ksl_.ForReturnVoid(
       "dot.inner.epilg.outer", /*start=*/current_tile_col,
-      /*end=*/ir_builder_->CreateAdd(columns_llvm, current_tile_col),
+      /*end=*/b_->CreateAdd(columns_llvm, current_tile_col),
       /*step=*/1, /*peel_first_iteration=*/false,
       [&](llvm::Value* col, llvm::Value* is_first_scalar_col) {
         llvm::Value* rhs_element = vsl_.LoadScalar(rhs_, col);
-        llvm::Value* total_offset =
-            ir_builder_->CreateMul(col, ir_builder_->getInt64(m()));
+        llvm::Value* total_offset = b_->CreateMul(col, b_->getInt64(m()));
         llvm::Value* lhs_base_pointer =
             vsl_.ComputeOffsetPointer(lhs_, total_offset);
         ksl_.ForReturnVoid(
@@ -388,9 +385,8 @@ void ColumnMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
             /*step=*/1, [&](llvm::Value* scalar_row) {
               llvm::Value* product = vsl_.Mul(
                   vsl_.LoadScalar(lhs_base_pointer, scalar_row), rhs_element);
-              llvm::Value* setting_result_first_time = ir_builder_->CreateAnd(
-                  is_first_scalar_col,
-                  ir_builder_->getInt1(is_first_tiled_column));
+              llvm::Value* setting_result_first_time = b_->CreateAnd(
+                  is_first_scalar_col, b_->getInt1(is_first_tiled_column));
               ksl_.IfReturnVoid(
                   setting_result_first_time,
                   /*true_block_generator=*/
@@ -478,16 +474,15 @@ class RowMajorMatrixVectorProductEmitter
 
   RowMajorMatrixVectorProductEmitter(const Config& config, llvm::Value* lhs,
                                      llvm::Value* rhs, llvm::Value* addend,
-                                     llvm::Value* result,
-                                     llvm::IRBuilder<>* ir_builder)
+                                     llvm::Value* result, llvm::IRBuilder<>* b)
       : config_(config),
         lhs_(lhs),
         rhs_(rhs),
         addend_(addend),
         result_(result),
-        ir_builder_(ir_builder),
-        ksl_(ir_builder_),
-        vsl_(scalar_type(), /*vector_size=*/tile_cols(), ir_builder_, "") {
+        b_(b),
+        ksl_(b_),
+        vsl_(scalar_type(), /*vector_size=*/tile_cols(), b_, "") {
     CHECK(tile_cols() > 0 && IsPowerOfTwo(static_cast<uint64>(tile_cols())));
     CHECK(!has_addend() || addend != nullptr);
   }
@@ -498,7 +493,7 @@ class RowMajorMatrixVectorProductEmitter
 
  private:
   MemoryTile GetLhsMemoryTile(llvm::Value* row_start, int64 row_count) {
-    return MemoryTile(&vsl_, ir_builder_, /*matrix=*/lhs_,
+    return MemoryTile(&vsl_, b_, /*matrix=*/lhs_,
                       /*matrix_size_along_minor_dim=*/k(),
                       /*major_dim_offset=*/row_start,
                       /*tile_size_along_major_dim=*/row_count);
@@ -517,7 +512,7 @@ class RowMajorMatrixVectorProductEmitter
   llvm::Value* rhs_;
   llvm::Value* addend_;
   llvm::Value* result_;
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   KernelSupportLibrary ksl_;
   VectorSupportLibrary vsl_;
 };
@@ -559,7 +554,7 @@ void RowMajorMatrixVectorProductEmitter::EmitOuterLoopBody(llvm::Value* row,
   for (int i = 0; i < row_count; i++) {
     llvm::Value* result_value =
         vsl_.Add(horizontal_sums[i], scalar_accumulators[i].Get());
-    llvm::Value* offset = ir_builder_->CreateAdd(ir_builder_->getInt64(i), row);
+    llvm::Value* offset = b_->CreateAdd(b_->getInt64(i), row);
     if (addend_ && row_count != vsl_.vector_size()) {
       result_value = vsl_.Add(vsl_.LoadScalar(addend_, offset), result_value);
     }
@@ -578,7 +573,7 @@ void RowMajorMatrixVectorProductEmitter::Emit() {
       [&](llvm::Value* row) { EmitOuterLoopBody(row, tile_rows()); });
 
   if (row_remainder != 0) {
-    EmitOuterLoopBody(ir_builder_->getInt64(row_limit), row_remainder);
+    EmitOuterLoopBody(b_->getInt64(row_limit), row_remainder);
   }
 }
 
@@ -609,9 +604,8 @@ void RowMajorMatrixVectorProductEmitter::EmitInnerLoopEpilogue(
   }
 
   for (int r = 0; r < rows; r++) {
-    llvm::Value* total_offset = ir_builder_->CreateMul(
-        ir_builder_->CreateAdd(ir_builder_->getInt64(r), current_tile_row),
-        ir_builder_->getInt64(k()));
+    llvm::Value* total_offset = b_->CreateMul(
+        b_->CreateAdd(b_->getInt64(r), current_tile_row), b_->getInt64(k()));
     llvm::Value* lhs_base_pointer =
         vsl_.ComputeOffsetPointer(lhs_, total_offset);
     ksl_.ForReturnVoid(
@@ -722,13 +716,13 @@ class MatrixMatrixBlockPanelEmitter {
   // `lhs` with `rhs` and stores the result in `result`.
   explicit MatrixMatrixBlockPanelEmitter(Config config, llvm::Value* lhs,
                                          llvm::Value* rhs, llvm::Value* result,
-                                         llvm::IRBuilder<>* ir_builder)
+                                         llvm::IRBuilder<>* b)
       : lhs_(lhs),
         rhs_(rhs),
         result_(result),
         config_(config),
-        ir_builder_(ir_builder),
-        ksl_(ir_builder_) {
+        b_(b),
+        ksl_(b_) {
     CHECK(max_vectorization_width() > 0 &&
           IsPowerOfTwo(static_cast<uint64>(max_vectorization_width())));
     CHECK_GT(max_vector_count(), 0);
@@ -761,7 +755,7 @@ class MatrixMatrixBlockPanelEmitter {
                      int64 tile_size_m, llvm::Value* m_start,
                      llvm::Value* m_end);
 
-  llvm::Value* GetInt64(int64 value) { return ir_builder_->getInt64(value); }
+  llvm::Value* GetInt64(int64 value) { return b_->getInt64(value); }
 
   Config config() const { return config_; }
   Dimensions dims() const { return config().dims(); }
@@ -782,7 +776,7 @@ class MatrixMatrixBlockPanelEmitter {
   llvm::Value* result_;
   Config config_;
 
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   KernelSupportLibrary ksl_;
 };
 
@@ -804,8 +798,8 @@ void MatrixMatrixBlockPanelEmitter::HandleResiduesOnN() {
          current_vectorization_width >= min_vectorization_width()) {
     int64 n_end = dims().n() - (dims().n() % current_vectorization_width);
     if (n_start != n_end) {
-      VectorSupportLibrary vsl(scalar_type(), current_vectorization_width,
-                               ir_builder_, "gebp");
+      VectorSupportLibrary vsl(scalar_type(), current_vectorization_width, b_,
+                               "gebp");
       HandleResiduesOnK(&vsl, GetInt64(n_start), GetInt64(n_end));
       n_start = n_end;
     }
@@ -819,10 +813,9 @@ void MatrixMatrixBlockPanelEmitter::HandleResiduesOnN() {
   }
 
   if (n_start != dims().n()) {
-    VectorSupportLibrary vsl(scalar_type(), 1, ir_builder_, "gebp");
+    VectorSupportLibrary vsl(scalar_type(), 1, b_, "gebp");
     ksl_.ForReturnVoid("epi.n", n_start, dims().n(), 1, [&](llvm::Value* n_i) {
-      llvm::Value* n_i_next =
-          ir_builder_->CreateAdd(n_i, ir_builder_->getInt64(1));
+      llvm::Value* n_i_next = b_->CreateAdd(n_i, b_->getInt64(1));
       HandleResiduesOnK(&vsl, n_i, n_i_next);
     });
   }
@@ -935,11 +928,11 @@ void MatrixMatrixBlockPanelEmitter::EmitTiledGemm(
   ksl_.ForReturnVoid(
       "dot.m", m_start, m_end, tile_size_m, [&](llvm::Value* m_i) {
         MemoryTile result_memory_tile(
-            vsl, ir_builder_, /*matrix=*/result_,
+            vsl, b_, /*matrix=*/result_,
             /*matrix_size_along_minor_dim=*/dims().n(),
             /*major_dim_offset=*/m_i,
             /*tile_size_along_major_dim=*/tile_size_m);
-        MemoryTile lhs_memory_tile(vsl, ir_builder_, /*matrix=*/lhs_,
+        MemoryTile lhs_memory_tile(vsl, b_, /*matrix=*/lhs_,
                                    /*matrix_size_along_minor_dim=*/dims().k(),
                                    /*major_dim_offset=*/m_i,
                                    /*tile_size_along_major_dim=*/tile_size_m);
@@ -949,8 +942,8 @@ void MatrixMatrixBlockPanelEmitter::EmitTiledGemm(
                                            result_memory_tile.LoadTile(n_i));
               ksl_.ForReturnVoid(
                   "dot.k", k_start, k_end, tile_size_k, [&](llvm::Value* k_i) {
-                    MemoryTile rhs_memory_tile(vsl, ir_builder_, rhs_,
-                                               dims().n(), k_i, tile_size_k);
+                    MemoryTile rhs_memory_tile(vsl, b_, rhs_, dims().n(), k_i,
+                                               tile_size_k);
                     std::vector<std::vector<llvm::Value*>> lhs_tile =
                         lhs_memory_tile.LoadBroadcastTile(k_i, tile_size_k);
                     std::vector<llvm::Value*> rhs_tile =
@@ -980,7 +973,7 @@ DotOpEmitter::DotOpEmitter(const HloInstruction& dot,
                            const llvm_ir::IrArray& rhs_array,
                            const llvm_ir::IrArray* addend_array,
                            llvm::Value* executable_run_options_value,
-                           llvm::IRBuilder<>* ir_builder,
+                           llvm::IRBuilder<>* b,
                            const HloModuleConfig& hlo_module_config,
                            const TargetMachineFeatures& target_machine_features)
     : dot_(dot),
@@ -989,7 +982,7 @@ DotOpEmitter::DotOpEmitter(const HloInstruction& dot,
       rhs_array_(rhs_array),
       addend_array_(addend_array),
       executable_run_options_value_(executable_run_options_value),
-      ir_builder_(ir_builder),
+      b_(b),
       hlo_module_config_(hlo_module_config),
       target_machine_features_(target_machine_features) {}
 
@@ -997,15 +990,14 @@ DotOpEmitter::DotOpEmitter(const HloInstruction& dot,
     const HloInstruction& dot, const llvm_ir::IrArray& target_array,
     const llvm_ir::IrArray& lhs_array, const llvm_ir::IrArray& rhs_array,
     const llvm_ir::IrArray* addend_array,
-    llvm::Value* executable_run_options_value, llvm::IRBuilder<>* ir_builder,
+    llvm::Value* executable_run_options_value, llvm::IRBuilder<>* b,
     const HloModuleConfig& hlo_module_config,
     const TargetMachineFeatures& target_machine_features) {
   PrimitiveType type = target_array.GetShape().element_type();
   TF_RET_CHECK(F16 == type || F32 == type || F64 == type || C64 == type);
   DotOpEmitter dot_emitter(dot, target_array, lhs_array, rhs_array,
-                           addend_array, executable_run_options_value,
-                           ir_builder, hlo_module_config,
-                           target_machine_features);
+                           addend_array, executable_run_options_value, b,
+                           hlo_module_config, target_machine_features);
   return dot_emitter.Emit();
 }
 
@@ -1050,13 +1042,13 @@ bool DotOpEmitter::EmitExperimentalGebpDotIfEnabled(
   }
 
   int64 size_bytes = m * n * ShapeUtil::ByteSizeOfPrimitiveType(primitive_type);
-  ir_builder_->CreateMemSet(
-      target, ir_builder_->getInt8(0), size_bytes,
+  b_->CreateMemSet(
+      target, b_->getInt8(0), size_bytes,
       target_machine_features_.minimum_alignment_for_allocation(size_bytes));
 
   int64 max_target_vector_width =
       target_machine_features_.vector_register_num_elements(
-          *ir_builder_->GetInsertBlock()->getParent(), primitive_type);
+          *b_->GetInsertBlock()->getParent(), primitive_type);
 
   int64 tile_size_m, tile_size_k, tile_size_n_in_vector_width;
   std::tie(tile_size_m, tile_size_k, tile_size_n_in_vector_width) =
@@ -1080,12 +1072,12 @@ bool DotOpEmitter::EmitExperimentalGebpDotIfEnabled(
 
   KernelSupportLibrary::EmitAndCallOutlinedKernel(
       /*enable_fast_math=*/enable_fast_math,
-      /*optimize_for_size=*/optimize_for_size, ir_builder_,
-      config.GetCacheKey(), lhs, rhs, target,
+      /*optimize_for_size=*/optimize_for_size, b_, config.GetCacheKey(), lhs,
+      rhs, target,
       [this, config](llvm::Value* lhs, llvm::Value* rhs, llvm::Value* target) {
-        MatrixMatrixBlockPanelEmitter gebp_emitter(
-            config, /*lhs=*/lhs, /*rhs=*/rhs,
-            /*result=*/target, ir_builder_);
+        MatrixMatrixBlockPanelEmitter gebp_emitter(config, /*lhs=*/lhs,
+                                                   /*rhs=*/rhs,
+                                                   /*result=*/target, b_);
         gebp_emitter.Emit();
       });
 
@@ -1163,7 +1155,7 @@ bool DotOpEmitter::EmitLlvmIrDotIfProfitable() {
 
   const int target_vector_register_element_size =
       target_machine_features_.vector_register_num_elements(
-          *ir_builder_->GetInsertBlock()->getParent(), primitive_type);
+          *b_->GetInsertBlock()->getParent(), primitive_type);
 
   // We may not always know the vector register size for the target we're
   // compiling against, in which case target_vector_register_element_size is 0.
@@ -1184,13 +1176,13 @@ bool DotOpEmitter::EmitLlvmIrDotIfProfitable() {
 
     KernelSupportLibrary::EmitAndCallOutlinedKernel(
         /*enable_fast_math=*/enable_fast_math,
-        /*optimize_for_size=*/optimize_for_size, ir_builder_,
-        config.GetCacheKey(), lhs_op, rhs_op,
+        /*optimize_for_size=*/optimize_for_size, b_, config.GetCacheKey(),
+        lhs_op, rhs_op,
         addend_array_ ? addend_array_->GetBasePointer() : nullptr, result_op,
         [this, config](llvm::Value* lhs_op, llvm::Value* rhs_op,
                        llvm::Value* addend_op, llvm::Value* result_op) {
           ColumnMajorMatrixVectorProductEmitter emitter(
-              config, lhs_op, rhs_op, addend_op, result_op, ir_builder_);
+              config, lhs_op, rhs_op, addend_op, result_op, b_);
           emitter.Emit();
         });
   } else {
@@ -1203,13 +1195,13 @@ bool DotOpEmitter::EmitLlvmIrDotIfProfitable() {
 
     KernelSupportLibrary::EmitAndCallOutlinedKernel(
         /*enable_fast_math=*/enable_fast_math,
-        /*optimize_for_size=*/optimize_for_size, ir_builder_,
-        config.GetCacheKey(), lhs_op, rhs_op,
+        /*optimize_for_size=*/optimize_for_size, b_, config.GetCacheKey(),
+        lhs_op, rhs_op,
         addend_array_ ? addend_array_->GetBasePointer() : nullptr, result_op,
         [this, config](llvm::Value* lhs_op, llvm::Value* rhs_op,
                        llvm::Value* addend_op, llvm::Value* result_op) {
-          RowMajorMatrixVectorProductEmitter emitter(
-              config, lhs_op, rhs_op, addend_op, result_op, ir_builder_);
+          RowMajorMatrixVectorProductEmitter emitter(config, lhs_op, rhs_op,
+                                                     addend_op, result_op, b_);
           emitter.Emit();
         });
   }
@@ -1285,11 +1277,11 @@ Status DotOpEmitter::Emit() {
   // Create loop nests which loop through the LHS operand dimensions and the RHS
   // operand dimensions. The reduction dimension of the LHS and RHS are handled
   // in a separate innermost loop which performs the sum of products.
-  llvm_ir::ForLoopNest loop_nest(llvm_ir::IrName(&dot_), ir_builder_);
-  llvm_ir::IrArray::Index lhs_index = EmitOperandArrayLoopNest(
-      &loop_nest, lhs_array_, lhs_reduction_dimension, "lhs");
-  llvm_ir::IrArray::Index rhs_index = EmitOperandArrayLoopNest(
-      &loop_nest, rhs_array_, rhs_reduction_dimension, "rhs");
+  llvm_ir::ForLoopNest loop_nest(llvm_ir::IrName(&dot_), b_);
+  llvm_ir::IrArray::Index lhs_index = loop_nest.EmitOperandArrayLoopNest(
+      lhs_array_, /*dimension_to_skip=*/lhs_reduction_dimension, "lhs");
+  llvm_ir::IrArray::Index rhs_index = loop_nest.EmitOperandArrayLoopNest(
+      rhs_array_, /*dimension_to_skip=*/rhs_reduction_dimension, "rhs");
 
   // Create the loop which does the sum of products reduction.
   //
@@ -1319,62 +1311,55 @@ Status DotOpEmitter::Emit() {
   // Function entry basic block.
   // - Emit alloca for accumulator
   llvm::Function* func = reduction_loop->GetPreheaderBasicBlock()->getParent();
-  SetToFirstInsertPoint(&func->getEntryBlock(), ir_builder_);
+  SetToFirstInsertPoint(&func->getEntryBlock(), b_);
   llvm::Type* accum_type = target_array_.GetElementLlvmType();
-  llvm::Value* accum_address = ir_builder_->CreateAlloca(
-      accum_type, /*ArraySize=*/nullptr, "accum_address");
+  llvm::Value* accum_address =
+      b_->CreateAlloca(accum_type, /*ArraySize=*/nullptr, "accum_address");
 
   // Preheader basic block of reduction loop:
   // - Initialize accumulator to zero.
   llvm::BasicBlock* preheader_bb = reduction_loop->GetPreheaderBasicBlock();
-  ir_builder_->SetInsertPoint(preheader_bb->getTerminator());
+  b_->SetInsertPoint(preheader_bb->getTerminator());
 
-  ir_builder_->CreateStore(llvm::Constant::getNullValue(accum_type),
-                           accum_address);
+  b_->CreateStore(llvm::Constant::getNullValue(accum_type), accum_address);
 
   // Body basic block of reduction loop:
   // - Load elements from lhs and rhs array.
   // - Multiply lhs-element and rhs-element.
   // - Load accumulator and add to product.
   // - Store sum back into accumulator.
-  SetToFirstInsertPoint(reduction_loop->GetBodyBasicBlock(), ir_builder_);
+  SetToFirstInsertPoint(reduction_loop->GetBodyBasicBlock(), b_);
 
-  llvm::Value* lhs_element =
-      lhs_array_.EmitReadArrayElement(lhs_index, ir_builder_);
-  llvm::Value* rhs_element =
-      rhs_array_.EmitReadArrayElement(rhs_index, ir_builder_);
+  llvm::Value* lhs_element = lhs_array_.EmitReadArrayElement(lhs_index, b_);
+  llvm::Value* rhs_element = rhs_array_.EmitReadArrayElement(rhs_index, b_);
 
-  llvm::Value* accum = ir_builder_->CreateLoad(accum_address);
+  llvm::Value* accum = b_->CreateLoad(accum_address);
   llvm::Value* updated_accum;
   if (ShapeUtil::ElementIsComplex(lhs_shape)) {
-    auto real = [&](llvm::Value* x) {
-      return ir_builder_->CreateExtractValue(x, {0});
-    };
-    auto imag = [&](llvm::Value* x) {
-      return ir_builder_->CreateExtractValue(x, {1});
-    };
-    llvm::Value* product_real = ir_builder_->CreateFSub(
-        ir_builder_->CreateFMul(real(lhs_element), real(rhs_element)),
-        ir_builder_->CreateFMul(imag(lhs_element), imag(rhs_element)));
-    llvm::Value* product_imag = ir_builder_->CreateFAdd(
-        ir_builder_->CreateFMul(real(lhs_element), imag(rhs_element)),
-        ir_builder_->CreateFMul(imag(lhs_element), real(rhs_element)));
-    updated_accum = ir_builder_->CreateInsertValue(
-        accum, ir_builder_->CreateFAdd(real(accum), product_real), {0});
-    updated_accum = ir_builder_->CreateInsertValue(
-        updated_accum, ir_builder_->CreateFAdd(imag(accum), product_imag), {1});
+    auto real = [&](llvm::Value* x) { return b_->CreateExtractValue(x, {0}); };
+    auto imag = [&](llvm::Value* x) { return b_->CreateExtractValue(x, {1}); };
+    llvm::Value* product_real =
+        b_->CreateFSub(b_->CreateFMul(real(lhs_element), real(rhs_element)),
+                       b_->CreateFMul(imag(lhs_element), imag(rhs_element)));
+    llvm::Value* product_imag =
+        b_->CreateFAdd(b_->CreateFMul(real(lhs_element), imag(rhs_element)),
+                       b_->CreateFMul(imag(lhs_element), real(rhs_element)));
+    updated_accum = b_->CreateInsertValue(
+        accum, b_->CreateFAdd(real(accum), product_real), {0});
+    updated_accum = b_->CreateInsertValue(
+        updated_accum, b_->CreateFAdd(imag(accum), product_imag), {1});
   } else {
-    llvm::Value* product = ir_builder_->CreateFMul(lhs_element, rhs_element);
-    updated_accum = ir_builder_->CreateFAdd(accum, product);
+    llvm::Value* product = b_->CreateFMul(lhs_element, rhs_element);
+    updated_accum = b_->CreateFAdd(accum, product);
   }
-  ir_builder_->CreateStore(updated_accum, accum_address);
+  b_->CreateStore(updated_accum, accum_address);
 
   // Exit basic block of reduction loop.
   // - Load accumulator value (the result).
   // - Store into output array.
-  SetToFirstInsertPoint(reduction_loop->GetExitBasicBlock(), ir_builder_);
+  SetToFirstInsertPoint(reduction_loop->GetExitBasicBlock(), b_);
 
-  llvm::Value* result = ir_builder_->CreateLoad(accum_address);
+  llvm::Value* result = b_->CreateLoad(accum_address);
 
   // Create index into target address. The target index is the concatenation of
   // the rhs and lhs indexes with the reduction dimensions removed. The terms
@@ -1392,11 +1377,11 @@ Status DotOpEmitter::Emit() {
     }
   }
 
-  target_array_.EmitWriteArrayElement(target_index, result, ir_builder_);
+  target_array_.EmitWriteArrayElement(target_index, result, b_);
 
   // Set the IR builder insert point to the exit basic block of the outer most
   // loop.
-  ir_builder_->SetInsertPoint(loop_nest.GetOuterLoopExitBasicBlock());
+  b_->SetInsertPoint(loop_nest.GetOuterLoopExitBasicBlock());
 
   return Status::OK();
 }
@@ -1405,31 +1390,30 @@ Status DotOpEmitter::EmitScalarDot() {
   // A scalar dot is just a scalar multiply.
   llvm::Value* result;
   // Use the same index_type for all tensor accesses in the same kernel.
-  llvm::Type* index_type = ir_builder_->getInt64Ty();
+  llvm::Type* index_type = b_->getInt64Ty();
   llvm_ir::IrArray::Index element_index(index_type);
   llvm::Value* lhs_value =
-      lhs_array_.EmitReadArrayElement(/*index=*/element_index, ir_builder_);
+      lhs_array_.EmitReadArrayElement(/*index=*/element_index, b_);
   llvm::Value* rhs_value =
-      rhs_array_.EmitReadArrayElement(/*index=*/element_index, ir_builder_);
+      rhs_array_.EmitReadArrayElement(/*index=*/element_index, b_);
   if (ShapeUtil::ElementIsComplex(lhs_array_.GetShape())) {
-#define REAL(x) ir_builder_->CreateExtractValue(x, {0})
-#define IMAG(x) ir_builder_->CreateExtractValue(x, {1})
-    llvm::Value* real = ir_builder_->CreateFSub(
-        ir_builder_->CreateFMul(REAL(lhs_value), REAL(rhs_value)),
-        ir_builder_->CreateFMul(IMAG(lhs_value), IMAG(rhs_value)));
-    llvm::Value* imag = ir_builder_->CreateFAdd(
-        ir_builder_->CreateFMul(REAL(lhs_value), IMAG(rhs_value)),
-        ir_builder_->CreateFMul(IMAG(lhs_value), REAL(rhs_value)));
+#define REAL(x) b_->CreateExtractValue(x, {0})
+#define IMAG(x) b_->CreateExtractValue(x, {1})
+    llvm::Value* real =
+        b_->CreateFSub(b_->CreateFMul(REAL(lhs_value), REAL(rhs_value)),
+                       b_->CreateFMul(IMAG(lhs_value), IMAG(rhs_value)));
+    llvm::Value* imag =
+        b_->CreateFAdd(b_->CreateFMul(REAL(lhs_value), IMAG(rhs_value)),
+                       b_->CreateFMul(IMAG(lhs_value), REAL(rhs_value)));
 #undef IMAG
 #undef REAL
     result = llvm::ConstantAggregateZero::get(lhs_array_.GetElementLlvmType());
-    result = ir_builder_->CreateInsertValue(result, real, {0});
-    result = ir_builder_->CreateInsertValue(result, imag, {1});
+    result = b_->CreateInsertValue(result, real, {0});
+    result = b_->CreateInsertValue(result, imag, {1});
   } else {
-    result = ir_builder_->CreateFMul(lhs_value, rhs_value);
+    result = b_->CreateFMul(lhs_value, rhs_value);
   }
-  target_array_.EmitWriteArrayElement(/*index=*/element_index, result,
-                                      ir_builder_);
+  target_array_.EmitWriteArrayElement(/*index=*/element_index, result, b_);
   return Status::OK();
 }
 
@@ -1452,7 +1436,7 @@ Status DotOpEmitter::EmitCallToRuntime() {
       fn_name = multi_threaded
                     ? runtime::kEigenMatMulF16SymbolName
                     : runtime::kEigenSingleThreadedMatMulF16SymbolName;
-      float_type = ir_builder_->getHalfTy();
+      float_type = b_->getHalfTy();
       break;
     case F32:
       fn_name = multi_threaded
@@ -1461,7 +1445,7 @@ Status DotOpEmitter::EmitCallToRuntime() {
                     : (use_mkl_dnn
                            ? runtime::kMKLSingleThreadedMatMulF32SymbolName
                            : runtime::kEigenSingleThreadedMatMulF32SymbolName);
-      float_type = ir_builder_->getFloatTy();
+      float_type = b_->getFloatTy();
       break;
     case F64:
       fn_name = multi_threaded
@@ -1470,7 +1454,7 @@ Status DotOpEmitter::EmitCallToRuntime() {
                     : (use_mkl_dnn
                            ? runtime::kMKLSingleThreadedMatMulF64SymbolName
                            : runtime::kEigenSingleThreadedMatMulF64SymbolName);
-      float_type = ir_builder_->getDoubleTy();
+      float_type = b_->getDoubleTy();
       break;
     default:
       return Unimplemented("Invalid type %s for dot operation",
@@ -1478,16 +1462,16 @@ Status DotOpEmitter::EmitCallToRuntime() {
   }
 
   llvm::Type* float_ptr_type = float_type->getPointerTo();
-  llvm::Type* int64_type = ir_builder_->getInt64Ty();
-  llvm::Type* int32_type = ir_builder_->getInt32Ty();
-  llvm::Type* int8_ptr_type = ir_builder_->getInt8Ty()->getPointerTo();
+  llvm::Type* int64_type = b_->getInt64Ty();
+  llvm::Type* int32_type = b_->getInt32Ty();
+  llvm::Type* int8_ptr_type = b_->getInt8Ty()->getPointerTo();
   llvm::FunctionType* matmul_type = llvm::FunctionType::get(
-      ir_builder_->getVoidTy(),
+      b_->getVoidTy(),
       {int8_ptr_type, float_ptr_type, float_ptr_type, float_ptr_type,
        int64_type, int64_type, int64_type, int32_type, int32_type},
       /*isVarArg=*/false);
 
-  llvm::Function* function = ir_builder_->GetInsertBlock()->getParent();
+  llvm::Function* function = b_->GetInsertBlock()->getParent();
   llvm::Module* module = function->getParent();
 
   llvm::Function* matmul_func = llvm::cast<llvm::Function>(
@@ -1522,18 +1506,15 @@ Status DotOpEmitter::EmitCallToRuntime() {
     std::swap(transpose_lhs, transpose_rhs);
   }
 
-  ir_builder_->CreateCall(
+  b_->CreateCall(
       matmul_func,
-      {ir_builder_->CreateBitCast(executable_run_options_value_, int8_ptr_type),
-       ir_builder_->CreateBitCast(target_array_.GetBasePointer(),
-                                  float_ptr_type),
-       ir_builder_->CreateBitCast(lhs->GetBasePointer(), float_ptr_type),
-       ir_builder_->CreateBitCast(rhs->GetBasePointer(), float_ptr_type),
-       ir_builder_->getInt64(mat_mult_dims.m),
-       ir_builder_->getInt64(mat_mult_dims.n),
-       ir_builder_->getInt64(mat_mult_dims.k),
-       ir_builder_->getInt32(transpose_lhs),
-       ir_builder_->getInt32(transpose_rhs)});
+      {b_->CreateBitCast(executable_run_options_value_, int8_ptr_type),
+       b_->CreateBitCast(target_array_.GetBasePointer(), float_ptr_type),
+       b_->CreateBitCast(lhs->GetBasePointer(), float_ptr_type),
+       b_->CreateBitCast(rhs->GetBasePointer(), float_ptr_type),
+       b_->getInt64(mat_mult_dims.m), b_->getInt64(mat_mult_dims.n),
+       b_->getInt64(mat_mult_dims.k), b_->getInt32(transpose_lhs),
+       b_->getInt32(transpose_rhs)});
   return Status::OK();
 }
 
@@ -1556,36 +1537,6 @@ DotOpEmitter::MatMultDims DotOpEmitter::GetMatMultDims() const {
       LayoutUtil::Minor(target_array_.GetShape().layout(), 0) == 0};
 }
 
-llvm_ir::IrArray::Index DotOpEmitter::EmitOperandArrayLoopNest(
-    llvm_ir::ForLoopNest* loop_nest, const llvm_ir::IrArray& operand_array,
-    int64 reduction_dimension, tensorflow::StringPiece name_suffix) {
-  // Prepares the dimension list we will use to emit the loop nest. Outermost
-  // loops are added first. Add loops in major-to-minor order, and skip the
-  // reduction dimension.
-  std::vector<int64> dimensions;
-  const Shape& shape = operand_array.GetShape();
-  for (int i = LayoutUtil::MinorToMajor(shape).size() - 1; i >= 0; --i) {
-    int64 dimension = LayoutUtil::Minor(shape.layout(), i);
-    if (dimension != reduction_dimension) {
-      dimensions.push_back(dimension);
-    }
-  }
-
-  // Create loop nest with one for-loop for each dimension of the
-  // output.
-  llvm_ir::IrArray::Index index =
-      loop_nest->AddLoopsForShapeOnDimensions(shape, dimensions, name_suffix);
-  // Verify every dimension except the reduction dimension was set in the index.
-  for (int dimension = 0; dimension < index.size(); ++dimension) {
-    if (dimension == reduction_dimension) {
-      DCHECK_EQ(nullptr, index[dimension]);
-    } else {
-      DCHECK_NE(nullptr, index[dimension]);
-    }
-  }
-  return index;
-}
-
 // Return whether the given shape is a matrix with no padding.
 static bool IsRank2WithNoPadding(const Shape& shape) {
   return ShapeUtil::Rank(shape) == 2 && !LayoutUtil::IsPadded(shape);
diff --git a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h
index ed2a18976a..590032fbe9 100644
--- a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h
+++ b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.h
@@ -61,7 +61,7 @@ class DotOpEmitter {
       const HloInstruction& dot, const llvm_ir::IrArray& target_array,
       const llvm_ir::IrArray& lhs_array, const llvm_ir::IrArray& rhs_array,
       const llvm_ir::IrArray* addend_array,
-      llvm::Value* executable_run_options_value, llvm::IRBuilder<>* ir_builder,
+      llvm::Value* executable_run_options_value, llvm::IRBuilder<>* b,
       const HloModuleConfig& hlo_module_config,
       const TargetMachineFeatures& target_machine_features);
 
@@ -70,8 +70,7 @@ class DotOpEmitter {
                const llvm_ir::IrArray& lhs_array,
                const llvm_ir::IrArray& rhs_array,
                const llvm_ir::IrArray* addend_array,
-               llvm::Value* executable_run_options_value,
-               llvm::IRBuilder<>* ir_builder,
+               llvm::Value* executable_run_options_value, llvm::IRBuilder<>* b,
                const HloModuleConfig& hlo_module_config,
                const TargetMachineFeatures& target_machine_features);
 
@@ -89,17 +88,6 @@ class DotOpEmitter {
   // Emits a call to the CPU runtime to perform the matrix multiply.
   Status EmitCallToRuntime();
 
-  // Emits a series of nested loops for iterating over an operand array in the
-  // dot operation. Loops are constructed in major to minor dimension layout
-  // order. No loop is emitted for the given reduction_dimension. The function
-  // returns an IrArray index for the given operand_array containing the indvars
-  // of the loops. All dimensions of the index are filled except for the
-  // reduction dimension. name_suffix is the string to append to the names of
-  // LLVM constructs (eg, basic blocks) constructed by this method.
-  llvm_ir::IrArray::Index EmitOperandArrayLoopNest(
-      llvm_ir::ForLoopNest* loop_nest, const llvm_ir::IrArray& operand_array,
-      int64 reduction_dimension, tensorflow::StringPiece name_suffix);
-
   // Represents the dimensions of a matrix-matrix multiply operation.
   struct MatMultDims {
     // The number of rows in the LHS.
@@ -171,7 +159,7 @@ class DotOpEmitter {
   const llvm_ir::IrArray& rhs_array_;
   const llvm_ir::IrArray* addend_array_;
   llvm::Value* executable_run_options_value_;
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   const HloModuleConfig& hlo_module_config_;
   const TargetMachineFeatures& target_machine_features_;
 };
diff --git a/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.cc b/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.cc
index e97113dfa0..c13d36776f 100644
--- a/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.cc
@@ -19,6 +19,8 @@ limitations under the License.
 
 #include "llvm/IR/Instructions.h"
 #include "llvm/IR/Module.h"
+#include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -38,8 +40,7 @@ StatusOr<llvm::Value*> CpuElementalIrEmitter::EmitFloatUnaryOp(
       switch (element_type) {
         case F16:
           cast_result_to_fp16 = true;
-          operand_value = ir_builder_->CreateFPCast(operand_value,
-                                                    ir_builder_->getFloatTy());
+          operand_value = b_->CreateFPCast(operand_value, b_->getFloatTy());
           TF_FALLTHROUGH_INTENDED;
         case F32:
           function_name = "tanhf";
@@ -59,9 +60,9 @@ StatusOr<llvm::Value*> CpuElementalIrEmitter::EmitFloatUnaryOp(
       function->setDoesNotThrow();
       function->setDoesNotAccessMemory();
       // Create an instruction to call the function.
-      llvm::Value* result = ir_builder_->CreateCall(function, operand_value);
+      llvm::Value* result = b_->CreateCall(function, operand_value);
       if (cast_result_to_fp16) {
-        result = ir_builder_->CreateFPCast(result, ir_builder_->getHalfTy());
+        result = b_->CreateFPCast(result, b_->getHalfTy());
       }
       return result;
     }
@@ -77,8 +78,8 @@ StatusOr<llvm::Value*> CpuElementalIrEmitter::EmitAtan2(
   switch (prim_type) {
     case F16:
       cast_result_to_fp16 = true;
-      lhs = ir_builder_->CreateFPCast(lhs, ir_builder_->getFloatTy());
-      rhs = ir_builder_->CreateFPCast(rhs, ir_builder_->getFloatTy());
+      lhs = b_->CreateFPCast(lhs, b_->getFloatTy());
+      rhs = b_->CreateFPCast(rhs, b_->getFloatTy());
       TF_FALLTHROUGH_INTENDED;
     case F32:
       function_name = "atan2f";
@@ -98,9 +99,9 @@ StatusOr<llvm::Value*> CpuElementalIrEmitter::EmitAtan2(
   function->setDoesNotThrow();
   function->setDoesNotAccessMemory();
   // Create an instruction to call the function.
-  llvm::Value* result = ir_builder_->CreateCall(function, {lhs, rhs});
+  llvm::Value* result = b_->CreateCall(function, {lhs, rhs});
   if (cast_result_to_fp16) {
-    result = ir_builder_->CreateFPCast(result, ir_builder_->getHalfTy());
+    result = b_->CreateFPCast(result, b_->getHalfTy());
   }
   return result;
 }
@@ -118,9 +119,8 @@ llvm_ir::ElementGenerator CpuElementalIrEmitter::MakeElementGenerator(
                                 ElementwiseSourceIndex(index, *hlo, i)));
         operands.push_back(operand_value);
       }
-      return ir_emitter_->EmitScalarCall(hlo->shape().element_type(),
-                                         hlo->to_apply(), operands,
-                                         llvm_ir::IrName(hlo));
+      return ir_emitter_->EmitElementalMap(*Cast<HloMapInstruction>(hlo),
+                                           operands, llvm_ir::IrName(hlo));
     };
   }
   return ElementalIrEmitter::MakeElementGenerator(hlo, operand_to_generator);
diff --git a/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.h b/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.h
index 4446dfd282..9598a886ab 100644
--- a/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.h
+++ b/tensorflow/compiler/xla/service/cpu/elemental_ir_emitter.h
@@ -31,7 +31,7 @@ class CpuElementalIrEmitter : public ElementalIrEmitter {
  public:
   CpuElementalIrEmitter(const HloModuleConfig& module_config,
                         IrEmitter* ir_emitter, llvm::Module* module)
-      : ElementalIrEmitter(module_config, module, ir_emitter->ir_builder()),
+      : ElementalIrEmitter(module_config, module, ir_emitter->b()),
         ir_emitter_(ir_emitter) {}
 
   llvm_ir::ElementGenerator MakeElementGenerator(
diff --git a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
index 6b66a4b0b7..ca645d3f1d 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
@@ -51,10 +51,11 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/dynamic_update_slice_util.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
-#include "tensorflow/compiler/xla/service/llvm_ir/ops.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -89,14 +90,14 @@ IrEmitter::IrEmitter(
     : assignment_(assignment),
       module_(llvm_module),
       arch_type_(llvm::Triple(llvm_module->getTargetTriple()).getArch()),
-      ir_builder_(llvm_module->getContext()),
+      b_(llvm_module->getContext()),
       instruction_to_profile_idx_(std::move(instruction_to_profile_idx)),
       computation_to_profile_idx_(std::move(computation_to_profile_idx)),
       alias_analysis_(hlo_module, assignment, &llvm_module->getContext()),
       hlo_module_config_(hlo_module.config()),
       is_top_level_computation_(false),
       target_machine_features_(*target_machine_features) {
-  ir_builder_.setFastMathFlags(llvm_ir::GetFastMathFlags(
+  b_.setFastMathFlags(llvm_ir::GetFastMathFlags(
       /*fast_math_enabled=*/hlo_module_config_.debug_options()
           .xla_enable_fast_math()));
 }
@@ -115,6 +116,19 @@ StatusOr<llvm::Function*> IrEmitter::EmitComputation(
         computation->root_instruction()->outer_dimension_partitions().size();
   }
 
+  if (computation->root_instruction()->opcode() != HloOpcode::kOutfeed) {
+    TF_ASSIGN_OR_RETURN(
+        computation_root_allocation_,
+        assignment_.GetUniqueTopLevelSlice(computation->root_instruction()));
+  }
+
+  for (const HloInstruction* param : computation->parameter_instructions()) {
+    TF_ASSIGN_OR_RETURN(BufferAllocation::Slice param_slice,
+                        assignment_.GetUniqueTopLevelSlice(param));
+    computation_parameter_allocations_[param_slice.allocation()->index()] =
+        param->parameter_number();
+  }
+
   InitializeIrFunction(function_name);
   // The rdtscp instruction is x86 specific.  We will fallback to LLVM's generic
   // readcyclecounter if it is unavailable.
@@ -131,6 +145,8 @@ StatusOr<llvm::Function*> IrEmitter::EmitComputation(
   // Delete 'compute_function', finalizing 'ir_function' and restoring caller
   // IR insert point.
   compute_function_.reset();
+  computation_root_allocation_ = BufferAllocation::Slice();
+  computation_parameter_allocations_.clear();
   return ir_function;
 }
 
@@ -146,7 +162,7 @@ void IrEmitter::InitializeIrFunction(const string& function_name) {
       new IrFunction(function_name, linkage,
                      options::OptimizeForSizeRequested(hlo_module_config_),
                      hlo_module_config_.debug_options().xla_enable_fast_math(),
-                     module_, &ir_builder_, num_dynamic_loop_bounds_));
+                     module_, &b_, num_dynamic_loop_bounds_));
 }
 
 IrEmitter::~IrEmitter() {}
@@ -154,9 +170,9 @@ IrEmitter::~IrEmitter() {}
 Status IrEmitter::HandleBitcast(HloInstruction* bitcast) {
   VLOG(2) << "HandleBitcast: " << bitcast->ToString();
   emitted_value_[bitcast] =
-      ir_builder_.CreateBitCast(GetEmittedValueFor(bitcast->operand(0)),
-                                IrShapeType(bitcast->shape())->getPointerTo(),
-                                AsStringRef(IrName(bitcast)));
+      b_.CreateBitCast(GetEmittedValueFor(bitcast->operand(0)),
+                       IrShapeType(bitcast->shape())->getPointerTo(),
+                       AsStringRef(IrName(bitcast)));
   return Status::OK();
 }
 
@@ -175,25 +191,36 @@ llvm::Constant* IrEmitter::EmitGlobalForLiteral(const Literal& literal) {
       result_global, IrShapeType(literal.shape())->getPointerTo());
 }
 
-Status IrEmitter::HandleConstant(HloInstruction* constant) {
-  VLOG(2) << "HandleConstant: " << constant->ToString();
-  const Literal& literal = constant->literal();
-  llvm::Constant* global_for_const;
+Status IrEmitter::EmitConstantGlobals() {
+  for (const BufferAllocation& allocation : assignment_.Allocations()) {
+    if (!allocation.is_constant()) {
+      continue;
+    }
 
-  auto it = emitted_literals_.find(&literal);
-  if (it != emitted_literals_.end()) {
-    global_for_const = it->second;
-  } else {
-    global_for_const = EmitGlobalForLiteral(literal);
-    emitted_literals_[&literal] = global_for_const;
+    const Literal& literal = llvm_ir::LiteralForConstantAllocation(allocation);
+    llvm::Constant* global_for_const;
+    auto it = emitted_literals_.find(&literal);
+    if (it != emitted_literals_.end()) {
+      global_for_const = it->second;
+    } else {
+      global_for_const = EmitGlobalForLiteral(literal);
+      InsertOrDie(&emitted_literals_, &literal, global_for_const);
+    }
+
+    InsertOrDie(&constant_buffer_to_global_, allocation.index(),
+                global_for_const);
   }
-  emitted_value_[constant] = global_for_const;
-  VLOG(2) << "  emitted value: " << llvm_ir::DumpToString(*global_for_const);
-  VLOG(2) << "  its type: "
-          << llvm_ir::DumpToString(*global_for_const->getType());
+
   return Status::OK();
 }
 
+Status IrEmitter::HandleConstant(HloInstruction* constant) {
+  VLOG(2) << "HandleConstant: " << constant->ToString();
+  // IrEmitter::EmitConstantGlobals has already taken care of emitting the body
+  // of the constant.
+  return EmitTargetAddressForOp(constant);
+}
+
 Status IrEmitter::HandleCopy(HloInstruction* copy) {
   if (ShapeUtil::IsTuple(copy->shape())) {
     // kCopy shallow copies a tuple so just memcpy the top-level buffer.
@@ -273,27 +300,30 @@ Status IrEmitter::HandleGetTupleElement(HloInstruction* get_tuple_element) {
   const Shape& shape = get_tuple_element->shape();
   emitted_value_[get_tuple_element] = llvm_ir::EmitGetTupleElement(
       shape, get_tuple_element->tuple_index(), MinimumAlignmentForShape(shape),
-      GetEmittedValueFor(operand), &ir_builder_, module_);
+      GetEmittedValueFor(operand), &b_, module_);
   return Status::OK();
 }
 
 Status IrEmitter::HandleSelect(HloInstruction* select) {
   auto pred = select->operand(0);
-  auto on_true = select->operand(1);
-  auto on_false = select->operand(2);
   TF_RET_CHECK(pred->shape().element_type() == PRED);
-
-  if (ShapeUtil::IsTuple(select->shape())) {
-    TF_RETURN_IF_ERROR(EmitTargetAddressForOp(select));
-    llvm_ir::EmitTupleSelect(
-        GetIrArrayFor(select), GetIrArrayFor(pred), GetEmittedValueFor(on_true),
-        GetEmittedValueFor(on_false), &ir_builder_, module_);
-    return Status::OK();
-  }
-
   return DefaultAction(select);
 }
 
+Status IrEmitter::HandleTupleSelect(HloInstruction* tuple_select) {
+  auto pred = tuple_select->operand(0);
+  auto on_true = tuple_select->operand(1);
+  auto on_false = tuple_select->operand(2);
+  TF_RET_CHECK(pred->shape().element_type() == PRED);
+  TF_RET_CHECK(ShapeUtil::IsScalar(pred->shape()));
+  TF_RET_CHECK(ShapeUtil::IsTuple(tuple_select->shape()));
+  TF_RETURN_IF_ERROR(EmitTargetAddressForOp(tuple_select));
+  llvm_ir::EmitTupleSelect(GetIrArrayFor(tuple_select), GetIrArrayFor(pred),
+                           GetEmittedValueFor(on_true),
+                           GetEmittedValueFor(on_false), &b_, module_);
+  return Status::OK();
+}
+
 Status IrEmitter::HandleInfeed(HloInstruction* instruction) {
   HloInfeedInstruction* infeed = Cast<HloInfeedInstruction>(instruction);
   VLOG(2) << "HandleInfeed: " << infeed->ToString();
@@ -313,8 +343,8 @@ Status IrEmitter::HandleInfeed(HloInstruction* instruction) {
                       assignment_.GetUniqueSlice(infeed, {1}));
   llvm::Value* token_address = EmitTempBufferPointer(
       token_slice, ShapeUtil::GetTupleElementShape(infeed->shape(), 1));
-  llvm_ir::EmitTuple(GetIrArrayFor(infeed), {data_address, token_address},
-                     &ir_builder_, module_);
+  llvm_ir::EmitTuple(GetIrArrayFor(infeed), {data_address, token_address}, &b_,
+                     module_);
 
   if (ShapeUtil::IsTuple(data_shape)) {
     TF_RET_CHECK(!ShapeUtil::IsNestedTuple(data_shape));
@@ -345,7 +375,7 @@ Status IrEmitter::HandleInfeed(HloInstruction* instruction) {
     }
 
     llvm_ir::EmitTuple(llvm_ir::IrArray(data_address, data_shape),
-                       tuple_element_addresses, &ir_builder_, module_);
+                       tuple_element_addresses, &b_, module_);
   } else {
     TF_RETURN_IF_ERROR(
         EmitXfeedTransfer(XfeedKind::kInfeed, data_shape, data_address));
@@ -366,14 +396,14 @@ Status IrEmitter::EmitXfeedTransfer(XfeedKind kind, const Shape& shape,
   int32 length_32 = static_cast<int32>(length);
 
   int32 shape_length;
-  TF_ASSIGN_OR_RETURN(llvm::Value * shape_ptr,
-                      llvm_ir::EncodeSelfDescribingShapeConstant(
-                          shape, &shape_length, &ir_builder_));
+  TF_ASSIGN_OR_RETURN(
+      llvm::Value * shape_ptr,
+      llvm_ir::EncodeSelfDescribingShapeConstant(shape, &shape_length, &b_));
 
   // The signature of the acquire infeed buffer function is:
   //
   //   (void*)(int32 length);
-  llvm::Type* int32_type = ir_builder_.getInt32Ty();
+  llvm::Type* int32_type = b_.getInt32Ty();
   llvm::Type* i8_ptr_type = llvm::Type::getInt8PtrTy(module_->getContext());
   llvm::FunctionType* acquire_type = llvm::FunctionType::get(
       i8_ptr_type, {int32_type, i8_ptr_type, int32_type},
@@ -393,8 +423,7 @@ Status IrEmitter::EmitXfeedTransfer(XfeedKind kind, const Shape& shape,
   //
   //   (void)(int32 length, void* buffer);
   llvm::FunctionType* release_type = llvm::FunctionType::get(
-      ir_builder_.getVoidTy(),
-      {int32_type, i8_ptr_type, i8_ptr_type, int32_type},
+      b_.getVoidTy(), {int32_type, i8_ptr_type, i8_ptr_type, int32_type},
       /*isVarArg=*/false);
 
   llvm::Function* release_func;
@@ -411,25 +440,22 @@ Status IrEmitter::EmitXfeedTransfer(XfeedKind kind, const Shape& shape,
   // of size exactly 'length_32', and the runtime is responsible for
   // check-failing the process if there is a mismatch, versus passing us back a
   // buffer that we might overrun.
-  llvm::Value* acquired_pointer = ir_builder_.CreateCall(
-      acquire_func, {ir_builder_.getInt32(length_32), shape_ptr,
-                     ir_builder_.getInt32(shape_length)});
+  llvm::Value* acquired_pointer = b_.CreateCall(
+      acquire_func,
+      {b_.getInt32(length_32), shape_ptr, b_.getInt32(shape_length)});
 
   if (kind == XfeedKind::kInfeed) {
     // Copy to the program buffer address from the acquired buffer.
-    ir_builder_.CreateMemCpy(program_buffer_address, /*DstAlign=*/1,
-                             acquired_pointer,
-                             /*SrcAlign=*/1, length_32);
+    b_.CreateMemCpy(program_buffer_address, /*DstAlign=*/1, acquired_pointer,
+                    /*SrcAlign=*/1, length_32);
   } else {
     // Outfeed -- copy from the in-program address to the acquired buffer.
-    ir_builder_.CreateMemCpy(acquired_pointer, /*DstAlign=*/1,
-                             program_buffer_address,
-                             /*SrcAlign=*/1, length_32);
+    b_.CreateMemCpy(acquired_pointer, /*DstAlign=*/1, program_buffer_address,
+                    /*SrcAlign=*/1, length_32);
   }
 
-  ir_builder_.CreateCall(release_func,
-                         {ir_builder_.getInt32(length_32), acquired_pointer,
-                          shape_ptr, ir_builder_.getInt32(shape_length)});
+  b_.CreateCall(release_func, {b_.getInt32(length_32), acquired_pointer,
+                               shape_ptr, b_.getInt32(shape_length)});
 
   return Status::OK();
 }
@@ -450,7 +476,7 @@ Status IrEmitter::HandleOutfeed(HloInstruction* outfeed) {
         ShapeUtil::GetTupleElementShape(operand_shape, i);
     llvm::Value* tuple_element = llvm_ir::EmitGetTupleElement(
         tuple_element_shape, i, MinimumAlignmentForShape(tuple_element_shape),
-        value, &ir_builder_, module_);
+        value, &b_, module_);
     TF_RETURN_IF_ERROR(EmitXfeedTransfer(XfeedKind::kOutfeed,
                                          tuple_element_shape, tuple_element));
   }
@@ -469,46 +495,96 @@ Status IrEmitter::HandleTuple(HloInstruction* tuple) {
   for (auto operand : tuple->operands()) {
     base_ptrs.push_back(GetEmittedValueFor(operand));
   }
-  llvm_ir::EmitTuple(GetIrArrayFor(tuple), base_ptrs, &ir_builder_, module_);
+  llvm_ir::EmitTuple(GetIrArrayFor(tuple), base_ptrs, &b_, module_);
   return Status::OK();
 }
 
-Status IrEmitter::HandleMap(HloInstruction* map) {
-  gtl::ArraySlice<HloInstruction*> operands(map->operands());
-  HloComputation* function = map->to_apply();
-  // The called computation should have been emitted previously.
-  llvm::Function* mapped_ir_function = FindOrDie(emitted_functions_, function);
+llvm::Value* IrEmitter::EmitElementalMap(
+    const HloMapInstruction& map_instr,
+    tensorflow::gtl::ArraySlice<llvm::Value*> elemental_operands,
+    tensorflow::StringPiece name) {
+  return EmitThreadLocalCall(*map_instr.to_apply(), elemental_operands, name);
+}
+
+StatusOr<llvm::Value*> IrEmitter::EmitTargetElementLoopBodyForReduceWindow(
+    HloReduceWindowInstruction* reduce_window,
+    const llvm_ir::IrArray::Index& index) {
+  const HloInstruction* operand = reduce_window->operand(0);
+  const Window& window = reduce_window->window();
+
+  // We fold inputs into the accumulator and initialize it to
+  // the initial value on the reduce_window.
+  PrimitiveType operand_element_type = operand->shape().element_type();
+  llvm::Value* accumulator_address = llvm_ir::EmitAllocaAtFunctionEntry(
+      llvm_ir::PrimitiveTypeToIrType(operand_element_type, module_),
+      "reduce_window_accumulator_address", &b_,
+      MinimumAlignmentForPrimitiveType(operand_element_type));
+  b_.CreateStore(b_.CreateLoad(GetEmittedValueFor(reduce_window->operand(1))),
+                 accumulator_address);
 
-  return EmitTargetElementLoop(map, [this, map, operands, mapped_ir_function](
-                                        const llvm_ir::IrArray::Index& index) {
-    std::vector<llvm::Value*> parameter_addresses;
-    for (const HloInstruction* operand : operands) {
-      const llvm_ir::IrArray& array = GetIrArrayFor(operand);
-      parameter_addresses.push_back(
-          array.EmitArrayElementAddress(index, &ir_builder_));
+  llvm_ir::ForLoopNest loops(IrName(reduce_window, "inner"), &b_);
+  std::vector<int64> window_size;
+  for (const auto& dim : window.dimensions()) {
+    window_size.push_back(dim.size());
+  }
+  const llvm_ir::IrArray::Index window_index = loops.AddLoopsForShape(
+      ShapeUtil::MakeShape(operand_element_type, window_size), "window");
+  CHECK_EQ(window_index.size(), index.size());
+
+  SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &b_);
+
+  llvm_ir::IrArray::Index input_index(b_.getInt64Ty(), index.size());
+  llvm::Value* in_bounds_condition = nullptr;
+  for (size_t i = 0; i < index.size(); ++i) {
+    llvm::Value* strided_index =
+        b_.CreateNSWMul(index[i], b_.getInt64(window.dimensions(i).stride()));
+    input_index[i] =
+        b_.CreateNSWSub(b_.CreateNSWAdd(strided_index, window_index[i]),
+                        b_.getInt64(window.dimensions(i).padding_low()));
+
+    // We need to check if 0 <= input_index[i] < bound, as otherwise we are in
+    // the padding so that we can skip the computation. That is equivalent to
+    // input_index[i] < bound as an *unsigned* comparison, since a negative
+    // value will wrap to a large positive value.
+    llvm::Value* index_condition = b_.CreateICmpULT(
+        input_index[i],
+        b_.getInt64(ShapeUtil::GetDimension(operand->shape(), i)));
+    if (in_bounds_condition == nullptr) {
+      in_bounds_condition = index_condition;
+    } else {
+      in_bounds_condition = b_.CreateAnd(in_bounds_condition, index_condition);
     }
-    return EmitElementFunctionCall(mapped_ir_function, map->shape(),
-                                   parameter_addresses, "map_function");
-  });
+  }
+  CHECK(in_bounds_condition != nullptr);
+
+  llvm_ir::LlvmIfData if_data =
+      llvm_ir::EmitIfThenElse(in_bounds_condition, "in-bounds", &b_);
+  SetToFirstInsertPoint(if_data.true_block, &b_);
+
+  // We are not in the padding, so carry out the computation.
+  llvm_ir::IrArray input_array(GetIrArrayFor(operand));
+  llvm::Value* input_value = input_array.EmitReadArrayElement(input_index, &b_);
+  llvm::Value* result = EmitThreadLocalCall(
+      *reduce_window->to_apply(),
+      {b_.CreateLoad(accumulator_address), input_value}, "reducer_function");
+  b_.CreateStore(result, accumulator_address);
+
+  SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &b_);
+  return b_.CreateLoad(accumulator_address);
 }
 
 Status IrEmitter::HandleReduceWindow(HloInstruction* reduce_window) {
-  auto operand = reduce_window->operand(0);
-  const Window& window = reduce_window->window();
-  HloComputation* function = reduce_window->to_apply();
   TF_RETURN_IF_ERROR(ElementTypesSameAndSupported(
-      /*instruction=*/*reduce_window, /*operands=*/{operand},
+      /*instruction=*/*reduce_window,
+      /*operands=*/{reduce_window->operand(0)},
       /*supported_types=*/{F32, BF16, S32}));
 
   // TODO(b/31410564): Implement dilation for reduce-window.
-  if (window_util::HasDilation(window)) {
+  if (window_util::HasDilation(reduce_window->window())) {
     return Unimplemented(
         "Dilation for ReduceWindow is not implemented on CPU.");
   }
 
-  // The called computation should have been emitted previously.
-  llvm::Function* reducer_function = FindOrDie(emitted_functions_, function);
-
   // Pseudo code for reduce window:
   //
   //   for (coordinates O in the output)
@@ -523,73 +599,9 @@ Status IrEmitter::HandleReduceWindow(HloInstruction* reduce_window) {
   // This is completely un-optimized and just here to have something
   // that works.
   return EmitTargetElementLoop(
-      reduce_window, [this, reduce_window, operand, window,
-                      reducer_function](const llvm_ir::IrArray::Index& index) {
-        // We fold inputs into the accumulator and initialize it to
-        // the initial value on the reduce_window.
-        PrimitiveType operand_element_type = operand->shape().element_type();
-        llvm::Value* accumulator_address = llvm_ir::EmitAllocaAtFunctionEntry(
-            llvm_ir::PrimitiveTypeToIrType(operand_element_type, module_),
-            "reduce_window_accumulator_address", &ir_builder_,
-            MinimumAlignmentForPrimitiveType(operand_element_type));
-        ir_builder_.CreateStore(ir_builder_.CreateLoad(GetEmittedValueFor(
-                                    reduce_window->operand(1))),
-                                accumulator_address);
-
-        llvm_ir::ForLoopNest loops(IrName(reduce_window, "inner"),
-                                   &ir_builder_);
-        std::vector<int64> window_size;
-        for (const auto& dim : window.dimensions()) {
-          window_size.push_back(dim.size());
-        }
-        const llvm_ir::IrArray::Index window_index = loops.AddLoopsForShape(
-            ShapeUtil::MakeShape(operand_element_type, window_size), "window");
-        CHECK_EQ(window_index.size(), index.size());
-
-        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &ir_builder_);
-
-        llvm_ir::IrArray::Index input_index(ir_builder_.getInt64Ty(),
-                                            index.size());
-        llvm::Value* in_bounds_condition = nullptr;
-        for (size_t i = 0; i < index.size(); ++i) {
-          llvm::Value* strided_index = ir_builder_.CreateNSWMul(
-              index[i], ir_builder_.getInt64(window.dimensions(i).stride()));
-          input_index[i] = ir_builder_.CreateNSWSub(
-              ir_builder_.CreateNSWAdd(strided_index, window_index[i]),
-              ir_builder_.getInt64(window.dimensions(i).padding_low()));
-
-          // We need to check if 0 <= input_index[i] < bound, as
-          // otherwise we are in the padding so that we can skip the
-          // computation. That is equivalent to input_index[i] < bound
-          // as an *unsigned* comparison, since a negative value will
-          // wrap to a large positive value.
-          llvm::Value* index_condition = ir_builder_.CreateICmpULT(
-              input_index[i], ir_builder_.getInt64(ShapeUtil::GetDimension(
-                                  operand->shape(), i)));
-          if (in_bounds_condition == nullptr) {
-            in_bounds_condition = index_condition;
-          } else {
-            in_bounds_condition =
-                ir_builder_.CreateAnd(in_bounds_condition, index_condition);
-          }
-        }
-        CHECK(in_bounds_condition != nullptr);
-
-        llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(
-            in_bounds_condition, "in-bounds", &ir_builder_);
-        SetToFirstInsertPoint(if_data.true_block, &ir_builder_);
-
-        // We are not in the padding, so carry out the computation.
-        llvm_ir::IrArray input_array(GetIrArrayFor(operand));
-        llvm::Value* input_value_address =
-            input_array.EmitArrayElementAddress(input_index, &ir_builder_);
-        llvm::Value* result = EmitElementFunctionCall(
-            reducer_function, reduce_window->shape(),
-            {accumulator_address, input_value_address}, "reducer_function");
-        ir_builder_.CreateStore(result, accumulator_address);
-
-        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &ir_builder_);
-        return ir_builder_.CreateLoad(accumulator_address);
+      reduce_window, [&](const llvm_ir::IrArray::Index& index) {
+        return EmitTargetElementLoopBodyForReduceWindow(
+            Cast<HloReduceWindowInstruction>(reduce_window), index);
       });
 }
 
@@ -610,12 +622,6 @@ Status IrEmitter::HandleSelectAndScatter(HloInstruction* select_and_scatter) {
         "Dilation for SelectAndScatter is not implemented on CPU. ");
   }
 
-  // The select and scatter computations should have been emitted previously.
-  llvm::Function* select_function =
-      FindOrDie(emitted_functions_, select_and_scatter->select());
-  llvm::Function* scatter_function =
-      FindOrDie(emitted_functions_, select_and_scatter->scatter());
-
   // Pseudo code for select-and-scatter:
   //
   // initialized_flag is initially off for every window, and is turned on after
@@ -641,141 +647,128 @@ Status IrEmitter::HandleSelectAndScatter(HloInstruction* select_and_scatter) {
       select_and_scatter, /*desc=*/IrName(select_and_scatter, "init"),
       [this, init_value](const llvm_ir::IrArray::Index& target_index) {
         llvm::Value* init_value_addr = GetEmittedValueFor(init_value);
-        return ir_builder_.CreateLoad(init_value_addr);
+        return b_.CreateLoad(init_value_addr);
       }));
 
   // Create a loop to iterate over the source array to scatter to the output.
-  llvm_ir::ForLoopNest source_loops(IrName(select_and_scatter), &ir_builder_);
+  llvm_ir::ForLoopNest source_loops(IrName(select_and_scatter), &b_);
   const llvm_ir::IrArray::Index source_index =
       source_loops.AddLoopsForShape(source->shape(), "source");
-  SetToFirstInsertPoint(source_loops.GetInnerLoopBodyBasicBlock(),
-                        &ir_builder_);
+  SetToFirstInsertPoint(source_loops.GetInnerLoopBodyBasicBlock(), &b_);
 
   // Allocate space to keep the currently selected value, its index, and
   // the boolean initialized_flag, which is initially set to false.
   llvm::Value* selected_value_address = llvm_ir::EmitAllocaAtFunctionEntry(
       llvm_ir::PrimitiveTypeToIrType(operand_element_type, module_),
-      "selected_value_address", &ir_builder_,
+      "selected_value_address", &b_,
       MinimumAlignmentForPrimitiveType(operand_element_type));
   llvm::Value* selected_index_address =
       llvm_ir::EmitAllocaAtFunctionEntryWithCount(
-          ir_builder_.getInt64Ty(), ir_builder_.getInt32(rank),
-          "selected_index_address", &ir_builder_);
+          b_.getInt64Ty(), b_.getInt32(rank), "selected_index_address", &b_);
   llvm::Value* initialized_flag_address = llvm_ir::EmitAllocaAtFunctionEntry(
-      ir_builder_.getInt1Ty(), "initialized_flag_address", &ir_builder_);
-  ir_builder_.CreateStore(ir_builder_.getInt1(false), initialized_flag_address);
+      b_.getInt1Ty(), "initialized_flag_address", &b_);
+  b_.CreateStore(b_.getInt1(false), initialized_flag_address);
 
   // Create the inner loop to iterate over the window.
-  llvm_ir::ForLoopNest window_loops(IrName(select_and_scatter, "window"),
-                                    &ir_builder_);
+  llvm_ir::ForLoopNest window_loops(IrName(select_and_scatter, "window"), &b_);
   std::vector<int64> window_size;
   for (const auto& dim : window.dimensions()) {
     window_size.push_back(dim.size());
   }
   const llvm_ir::IrArray::Index window_index = window_loops.AddLoopsForShape(
       ShapeUtil::MakeShape(operand_element_type, window_size), "window");
-  SetToFirstInsertPoint(window_loops.GetInnerLoopBodyBasicBlock(),
-                        &ir_builder_);
+  SetToFirstInsertPoint(window_loops.GetInnerLoopBodyBasicBlock(), &b_);
 
   // Compute the operand index to visit and evaluate the condition whether the
   // operand index is within the bounds. The unsigned comparison includes
   // checking whether the operand index >= 0.
-  llvm_ir::IrArray::Index operand_index(ir_builder_.getInt64Ty(),
-                                        source_index.size());
-  llvm::Value* in_bounds_condition = ir_builder_.getTrue();
+  llvm_ir::IrArray::Index operand_index(b_.getInt64Ty(), source_index.size());
+  llvm::Value* in_bounds_condition = b_.getTrue();
   for (int64 i = 0; i < rank; ++i) {
-    llvm::Value* strided_index = ir_builder_.CreateNSWMul(
-        source_index[i], ir_builder_.getInt64(window.dimensions(i).stride()));
-    operand_index[i] = ir_builder_.CreateNSWSub(
-        ir_builder_.CreateNSWAdd(strided_index, window_index[i]),
-        ir_builder_.getInt64(window.dimensions(i).padding_low()));
-    llvm::Value* index_condition = ir_builder_.CreateICmpULT(
+    llvm::Value* strided_index = b_.CreateNSWMul(
+        source_index[i], b_.getInt64(window.dimensions(i).stride()));
+    operand_index[i] =
+        b_.CreateNSWSub(b_.CreateNSWAdd(strided_index, window_index[i]),
+                        b_.getInt64(window.dimensions(i).padding_low()));
+    llvm::Value* index_condition = b_.CreateICmpULT(
         operand_index[i],
-        ir_builder_.getInt64(ShapeUtil::GetDimension(operand->shape(), i)));
-    in_bounds_condition =
-        ir_builder_.CreateAnd(in_bounds_condition, index_condition);
+        b_.getInt64(ShapeUtil::GetDimension(operand->shape(), i)));
+    in_bounds_condition = b_.CreateAnd(in_bounds_condition, index_condition);
   }
   CHECK(in_bounds_condition != nullptr);
 
   // Only need to do something if the operand index is within the bounds. First
   // check if the initialized_flag is set.
   llvm_ir::LlvmIfData if_in_bounds =
-      llvm_ir::EmitIfThenElse(in_bounds_condition, "in-bounds", &ir_builder_);
-  SetToFirstInsertPoint(if_in_bounds.true_block, &ir_builder_);
-  llvm_ir::LlvmIfData if_initialized =
-      llvm_ir::EmitIfThenElse(ir_builder_.CreateLoad(initialized_flag_address),
-                              "initialized", &ir_builder_);
+      llvm_ir::EmitIfThenElse(in_bounds_condition, "in-bounds", &b_);
+  SetToFirstInsertPoint(if_in_bounds.true_block, &b_);
+  llvm_ir::LlvmIfData if_initialized = llvm_ir::EmitIfThenElse(
+      b_.CreateLoad(initialized_flag_address), "initialized", &b_);
 
   // If the initialized_flag is false, initialize the selected value and index
   // with the currently visiting operand.
-  SetToFirstInsertPoint(if_initialized.false_block, &ir_builder_);
+  SetToFirstInsertPoint(if_initialized.false_block, &b_);
   const auto save_operand_index =
       [&](const llvm_ir::IrArray::Index& operand_index) {
         for (int64 i = 0; i < rank; ++i) {
           llvm::Value* selected_index_address_slot =
-              ir_builder_.CreateInBoundsGEP(selected_index_address,
-                                            {ir_builder_.getInt32(i)});
-          ir_builder_.CreateStore(operand_index[i],
-                                  selected_index_address_slot);
+              b_.CreateInBoundsGEP(selected_index_address, {b_.getInt32(i)});
+          b_.CreateStore(operand_index[i], selected_index_address_slot);
         }
       };
   llvm_ir::IrArray operand_array(GetIrArrayFor(operand));
   llvm::Value* operand_data =
-      operand_array.EmitReadArrayElement(operand_index, &ir_builder_);
-  ir_builder_.CreateStore(operand_data, selected_value_address);
+      operand_array.EmitReadArrayElement(operand_index, &b_);
+  b_.CreateStore(operand_data, selected_value_address);
   save_operand_index(operand_index);
-  ir_builder_.CreateStore(ir_builder_.getInt1(true), initialized_flag_address);
+  b_.CreateStore(b_.getInt1(true), initialized_flag_address);
 
   // If the initialized_flag is true, call the `select` function to potentially
   // update the selected value and index with the currently visiting operand.
-  SetToFirstInsertPoint(if_initialized.true_block, &ir_builder_);
-  const Shape output_shape = ShapeUtil::MakeShape(PRED, {});
+  SetToFirstInsertPoint(if_initialized.true_block, &b_);
   llvm::Value* operand_address =
-      operand_array.EmitArrayElementAddress(operand_index, &ir_builder_);
-  llvm::Value* result = EmitElementFunctionCall(
-      select_function, output_shape, {selected_value_address, operand_address},
+      operand_array.EmitArrayElementAddress(operand_index, &b_);
+  llvm::Value* operand_element = b_.CreateLoad(operand_address);
+  llvm::Value* result = EmitThreadLocalCall(
+      *select_and_scatter->select(),
+      {b_.CreateLoad(selected_value_address), operand_element},
       "select_function");
 
   // If the 'select' function returns false, update the selected value and the
   // index to the currently visiting operand.
-  llvm::Value* cond = ir_builder_.CreateICmpNE(
+  llvm::Value* cond = b_.CreateICmpNE(
       result,
       llvm::ConstantInt::get(llvm_ir::PrimitiveTypeToIrType(PRED, module_), 0),
       "boolean_predicate");
   llvm_ir::LlvmIfData if_select_lhs =
-      llvm_ir::EmitIfThenElse(cond, "if-select-lhs", &ir_builder_);
-  SetToFirstInsertPoint(if_select_lhs.false_block, &ir_builder_);
-  ir_builder_.CreateStore(ir_builder_.CreateLoad(operand_address),
-                          selected_value_address);
+      llvm_ir::EmitIfThenElse(cond, "if-select-lhs", &b_);
+  SetToFirstInsertPoint(if_select_lhs.false_block, &b_);
+  b_.CreateStore(b_.CreateLoad(operand_address), selected_value_address);
   save_operand_index(operand_index);
 
   // After iterating over the window elements, scatter the source element to
   // the selected index of the output. The value we store at the output
   // location is computed by calling the `scatter` function with the source
   // value and the current output value.
-  SetToFirstInsertPoint(window_loops.GetOuterLoopExitBasicBlock(),
-                        &ir_builder_);
+  SetToFirstInsertPoint(window_loops.GetOuterLoopExitBasicBlock(), &b_);
   llvm_ir::IrArray::Index selected_index(source_index.GetType());
   for (int64 i = 0; i < rank; ++i) {
-    llvm::Value* selected_index_address_slot = ir_builder_.CreateInBoundsGEP(
-        selected_index_address, {ir_builder_.getInt32(i)});
-    selected_index.push_back(
-        ir_builder_.CreateLoad(selected_index_address_slot));
+    llvm::Value* selected_index_address_slot =
+        b_.CreateInBoundsGEP(selected_index_address, {b_.getInt32(i)});
+    selected_index.push_back(b_.CreateLoad(selected_index_address_slot));
   }
   llvm_ir::IrArray source_array(GetIrArrayFor(source));
-  llvm::Value* source_value_address =
-      source_array.EmitArrayElementAddress(source_index, &ir_builder_);
+  llvm::Value* source_value =
+      source_array.EmitReadArrayElement(source_index, &b_);
   llvm_ir::IrArray output_array(GetIrArrayFor(select_and_scatter));
-  llvm::Value* output_value_address =
-      output_array.EmitArrayElementAddress(selected_index, &ir_builder_);
-  llvm::Value* scatter_value = EmitElementFunctionCall(
-      scatter_function, source->shape(),
-      {output_value_address, source_value_address}, "scatter_function");
-  output_array.EmitWriteArrayElement(selected_index, scatter_value,
-                                     &ir_builder_);
-
-  SetToFirstInsertPoint(source_loops.GetOuterLoopExitBasicBlock(),
-                        &ir_builder_);
+  llvm::Value* output_value =
+      output_array.EmitReadArrayElement(selected_index, &b_);
+  llvm::Value* scatter_value =
+      EmitThreadLocalCall(*select_and_scatter->scatter(),
+                          {output_value, source_value}, "scatter_function");
+  output_array.EmitWriteArrayElement(selected_index, scatter_value, &b_);
+
+  SetToFirstInsertPoint(source_loops.GetOuterLoopExitBasicBlock(), &b_);
   return Status::OK();
 }
 
@@ -814,21 +807,155 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
   // Dot operation is complicated so we delegate to a helper class.
   return DotOpEmitter::EmitDotOperation(
       *dot, target_array, lhs_array, rhs_array, /*addend_array=*/nullptr,
-      GetExecutableRunOptionsArgument(), &ir_builder_, hlo_module_config_,
+      GetExecutableRunOptionsArgument(), &b_, hlo_module_config_,
       target_machine_features_);
 }
 
+StatusOr<llvm::Value*> IrEmitter::EmitTargetElementLoopBodyForConvolution(
+    HloConvolutionInstruction* convolution,
+    const llvm_ir::IrArray::Index& index) {
+  const HloInstruction* lhs = convolution->operand(0);
+  const HloInstruction* rhs = convolution->operand(1);
+  const Window& window = convolution->window();
+
+  const ConvolutionDimensionNumbers& dnums =
+      convolution->convolution_dimension_numbers();
+  int num_spatial_dims = dnums.output_spatial_dimensions_size();
+  std::vector<llvm::Value*> output_spatial(num_spatial_dims);
+  for (int i = 0; i < num_spatial_dims; ++i) {
+    output_spatial[i] = index[dnums.output_spatial_dimensions(i)];
+  }
+  llvm::Value* output_feature = index[dnums.output_feature_dimension()];
+  llvm::Value* batch = index[dnums.output_batch_dimension()];
+
+  // We will accumulate the products into this sum to calculate the output entry
+  // at the given index.
+  PrimitiveType lhs_element_type = lhs->shape().element_type();
+  llvm::Type* lhs_llvm_type =
+      llvm_ir::PrimitiveTypeToIrType(lhs_element_type, module_);
+  llvm::Value* sum_address = llvm_ir::EmitAllocaAtFunctionEntry(
+      lhs_llvm_type, "convolution_sum_address", &b_,
+      MinimumAlignmentForPrimitiveType(lhs_element_type));
+  llvm::Value* constant_zero = llvm::Constant::getNullValue(lhs_llvm_type);
+  b_.CreateStore(constant_zero, sum_address);
+
+  llvm_ir::ForLoopNest loops(IrName(convolution, "inner"), &b_);
+  std::vector<llvm::Value*> kernel_spatial(num_spatial_dims);
+  for (int i = 0; i < num_spatial_dims; ++i) {
+    kernel_spatial[i] =
+        loops
+            .AddLoop(
+                0, rhs->shape().dimensions(dnums.kernel_spatial_dimensions(i)),
+                tensorflow::strings::StrCat("k", i))
+            ->GetIndVarValue();
+  }
+  llvm::Value* input_feature =
+      loops
+          .AddLoop(0, lhs->shape().dimensions(dnums.input_feature_dimension()),
+                   "iz")
+          ->GetIndVarValue();
+
+  SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &b_);
+
+  // Calculate the spatial index in the input array, taking striding, dilation
+  // and padding into account. An index in the padding will be out of the bounds
+  // of the array.
+  const auto calculate_input_index = [this](llvm::Value* output_index,
+                                            llvm::Value* kernel_index,
+                                            const WindowDimension& window_dim) {
+    llvm::Value* strided_index =
+        b_.CreateNSWMul(output_index, b_.getInt64(window_dim.stride()));
+    llvm::Value* dilated_kernel_index = b_.CreateNSWMul(
+        kernel_index, b_.getInt64(window_dim.window_dilation()));
+    return b_.CreateNSWSub(b_.CreateNSWAdd(strided_index, dilated_kernel_index),
+                           b_.getInt64(window_dim.padding_low()));
+  };
+  std::vector<llvm::Value*> input_spatial(num_spatial_dims);
+  for (int i = 0; i < num_spatial_dims; ++i) {
+    input_spatial[i] = calculate_input_index(
+        output_spatial[i], kernel_spatial[i], window.dimensions(i));
+  }
+
+  // We need to check if 0 <= input dim < bound, as otherwise we are in the
+  // padding so that we can skip the computation. That is equivalent to input
+  // dim < bound as an *unsigned* comparison, since a negative value will wrap
+  // to a large positive value. The input dim is dilated, so we need to dilate
+  // the bound as well to match.
+
+  // Also need to check that the input coordinates are not in one of the
+  // holes created by base dilation.
+  const auto not_in_hole = [&](llvm::Value* input_index, int64 base_dilation) {
+    llvm::Value* remainder =
+        b_.CreateSRem(input_index, b_.getInt64(base_dilation));
+    return b_.CreateICmpEQ(remainder, b_.getInt64(0));
+  };
+
+  llvm::Value* in_bounds_condition = b_.getInt1(true);
+  for (int i = 0; i < num_spatial_dims; ++i) {
+    llvm::ConstantInt* input_bound = b_.getInt64(window_util::DilatedBound(
+        lhs->shape().dimensions(dnums.input_spatial_dimensions(i)),
+        window.dimensions(i).base_dilation()));
+    llvm::Value* dim_in_bound = b_.CreateICmpULT(input_spatial[i], input_bound);
+    llvm::Value* dim_not_in_hole =
+        not_in_hole(input_spatial[i], window.dimensions(i).base_dilation());
+    llvm::Value* dim_ok = b_.CreateAnd(dim_in_bound, dim_not_in_hole);
+    in_bounds_condition = b_.CreateAnd(in_bounds_condition, dim_ok);
+  }
+
+  // Now we need to map the dilated base coordinates back to the actual
+  // data indices on the lhs.
+  const auto undilate = [&](llvm::Value* input_index, int64 base_dilation) {
+    return b_.CreateSDiv(input_index, b_.getInt64(base_dilation));
+  };
+  for (int i = 0; i < num_spatial_dims; ++i) {
+    input_spatial[i] =
+        undilate(input_spatial[i], window.dimensions(i).base_dilation());
+  }
+
+  llvm_ir::LlvmIfData if_data =
+      llvm_ir::EmitIfThenElse(in_bounds_condition, "in-bounds", &b_);
+  SetToFirstInsertPoint(if_data.true_block, &b_);
+
+  // We are not in the padding, so carry out the computation.
+  int num_dims = num_spatial_dims + 2;
+  llvm_ir::IrArray::Index input_index(b_.getInt64Ty(), num_dims);
+  for (int i = 0; i < num_spatial_dims; ++i) {
+    input_index[dnums.input_spatial_dimensions(i)] = input_spatial[i];
+  }
+  input_index[dnums.input_feature_dimension()] = input_feature;
+  input_index[dnums.input_batch_dimension()] = batch;
+
+  llvm_ir::IrArray kernel_array(GetIrArrayFor(rhs));
+  llvm_ir::IrArray::Index kernel_index(b_.getInt64Ty(), num_dims);
+  for (int i = 0; i < num_spatial_dims; ++i) {
+    kernel_index[dnums.kernel_spatial_dimensions(i)] =
+        window.dimensions(i).window_reversal()
+            ? b_.CreateNSWSub(b_.getInt64(window.dimensions(i).size() - 1),
+                              kernel_spatial[i])
+            : kernel_spatial[i];
+  }
+
+  kernel_index[dnums.kernel_input_feature_dimension()] = input_feature;
+  kernel_index[dnums.kernel_output_feature_dimension()] = output_feature;
+
+  llvm_ir::IrArray input_array(GetIrArrayFor(lhs));
+  llvm::Value* product =
+      b_.CreateFMul(input_array.EmitReadArrayElement(input_index, &b_),
+                    kernel_array.EmitReadArrayElement(kernel_index, &b_));
+  llvm::Value* sum = b_.CreateFAdd(b_.CreateLoad(sum_address), product);
+  b_.CreateStore(sum, sum_address);
+
+  SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &b_);
+  return b_.CreateLoad(sum_address);
+}
+
 Status IrEmitter::HandleConvolution(HloInstruction* convolution) {
   auto lhs = convolution->operand(0);
   auto rhs = convolution->operand(1);
-  const auto& window = convolution->window();
   TF_RETURN_IF_ERROR(ElementTypesSameAndSupported(
       /*instruction=*/*convolution, /*operands=*/{lhs, rhs},
       /*supported_types=*/{F16, F32, C64}));
 
-  const ConvolutionDimensionNumbers& dnums =
-      convolution->convolution_dimension_numbers();
-
   // TODO(tonywy): Add PotentiallyImplementedAsMKLCovolution to support
   // different data layouts.
   if (PotentiallyImplementedAsEigenConvolution(*convolution,
@@ -908,12 +1035,12 @@ Status IrEmitter::HandleConvolution(HloInstruction* convolution) {
 
       PrimitiveType primitive_type = lhs->shape().element_type();
       llvm::Type* ir_ptr_type = primitive_type == F16
-                                    ? ir_builder_.getHalfTy()->getPointerTo()
-                                    : ir_builder_.getFloatTy()->getPointerTo();
-      llvm::Type* int64_type = ir_builder_.getInt64Ty();
-      llvm::Type* int8_ptr_type = ir_builder_.getInt8Ty()->getPointerTo();
+                                    ? b_.getHalfTy()->getPointerTo()
+                                    : b_.getFloatTy()->getPointerTo();
+      llvm::Type* int64_type = b_.getInt64Ty();
+      llvm::Type* int8_ptr_type = b_.getInt8Ty()->getPointerTo();
       llvm::FunctionType* conv_type = llvm::FunctionType::get(
-          ir_builder_.getVoidTy(),
+          b_.getVoidTy(),
           {int8_ptr_type, ir_ptr_type, ir_ptr_type, ir_ptr_type, int64_type,
            int64_type,    int64_type,  int64_type,  int64_type,  int64_type,
            int64_type,    int64_type,  int64_type,  int64_type,  int64_type,
@@ -945,34 +1072,34 @@ Status IrEmitter::HandleConvolution(HloInstruction* convolution) {
       conv_func->setCallingConv(llvm::CallingConv::C);
       conv_func->setDoesNotThrow();
       conv_func->setOnlyAccessesArgMemory();
-      ir_builder_.CreateCall(
-          conv_func, {
-                         GetExecutableRunOptionsArgument(),
-                         ir_builder_.CreateBitCast(
-                             GetEmittedValueFor(convolution), ir_ptr_type),
-                         ir_builder_.CreateBitCast(lhs_address, ir_ptr_type),
-                         ir_builder_.CreateBitCast(rhs_address, ir_ptr_type),
-                         ir_builder_.getInt64(input_batch),
-                         ir_builder_.getInt64(input_rows),
-                         ir_builder_.getInt64(input_cols),
-                         ir_builder_.getInt64(input_channels),
-                         ir_builder_.getInt64(kernel_rows),
-                         ir_builder_.getInt64(kernel_cols),
-                         ir_builder_.getInt64(kernel_channels),
-                         ir_builder_.getInt64(kernel_filters),
-                         ir_builder_.getInt64(output_rows),
-                         ir_builder_.getInt64(output_cols),
-                         ir_builder_.getInt64(row_stride),
-                         ir_builder_.getInt64(col_stride),
-                         ir_builder_.getInt64(padding_top),
-                         ir_builder_.getInt64(padding_bottom),
-                         ir_builder_.getInt64(padding_left),
-                         ir_builder_.getInt64(padding_right),
-                         ir_builder_.getInt64(lhs_row_dilation),
-                         ir_builder_.getInt64(lhs_col_dilation),
-                         ir_builder_.getInt64(rhs_row_dilation),
-                         ir_builder_.getInt64(rhs_col_dilation),
-                     });
+      b_.CreateCall(
+          conv_func,
+          {
+              GetExecutableRunOptionsArgument(),
+              b_.CreateBitCast(GetEmittedValueFor(convolution), ir_ptr_type),
+              b_.CreateBitCast(lhs_address, ir_ptr_type),
+              b_.CreateBitCast(rhs_address, ir_ptr_type),
+              b_.getInt64(input_batch),
+              b_.getInt64(input_rows),
+              b_.getInt64(input_cols),
+              b_.getInt64(input_channels),
+              b_.getInt64(kernel_rows),
+              b_.getInt64(kernel_cols),
+              b_.getInt64(kernel_channels),
+              b_.getInt64(kernel_filters),
+              b_.getInt64(output_rows),
+              b_.getInt64(output_cols),
+              b_.getInt64(row_stride),
+              b_.getInt64(col_stride),
+              b_.getInt64(padding_top),
+              b_.getInt64(padding_bottom),
+              b_.getInt64(padding_left),
+              b_.getInt64(padding_right),
+              b_.getInt64(lhs_row_dilation),
+              b_.getInt64(lhs_col_dilation),
+              b_.getInt64(rhs_row_dilation),
+              b_.getInt64(rhs_col_dilation),
+          });
 
       return Status::OK();
     }
@@ -985,150 +1112,9 @@ Status IrEmitter::HandleConvolution(HloInstruction* convolution) {
   // See the description of convolution in the XLA documentation for the pseudo
   // code for convolution.
   return EmitTargetElementLoop(
-      convolution, [this, convolution, lhs, rhs, window,
-                    dnums](const llvm_ir::IrArray::Index& index) {
-        int num_spatial_dims = dnums.output_spatial_dimensions_size();
-        std::vector<llvm::Value*> output_spatial(num_spatial_dims);
-        for (int i = 0; i < num_spatial_dims; ++i) {
-          output_spatial[i] = index[dnums.output_spatial_dimensions(i)];
-        }
-        llvm::Value* output_feature = index[dnums.output_feature_dimension()];
-        llvm::Value* batch = index[dnums.output_batch_dimension()];
-
-        // We will accumulate the products into this sum to calculate
-        // the output entry at the given index.
-        PrimitiveType lhs_element_type = lhs->shape().element_type();
-        llvm::Type* lhs_llvm_type =
-            llvm_ir::PrimitiveTypeToIrType(lhs_element_type, module_);
-        llvm::Value* sum_address = llvm_ir::EmitAllocaAtFunctionEntry(
-            lhs_llvm_type, "convolution_sum_address", &ir_builder_,
-            MinimumAlignmentForPrimitiveType(lhs_element_type));
-        llvm::Value* constant_zero =
-            llvm::Constant::getNullValue(lhs_llvm_type);
-        ir_builder_.CreateStore(constant_zero, sum_address);
-
-        llvm_ir::ForLoopNest loops(IrName(convolution, "inner"), &ir_builder_);
-        std::vector<llvm::Value*> kernel_spatial(num_spatial_dims);
-        for (int i = 0; i < num_spatial_dims; ++i) {
-          kernel_spatial[i] =
-              loops
-                  .AddLoop(0,
-                           rhs->shape().dimensions(
-                               dnums.kernel_spatial_dimensions(i)),
-                           tensorflow::strings::StrCat("k", i))
-                  ->GetIndVarValue();
-        }
-        llvm::Value* input_feature =
-            loops
-                .AddLoop(
-                    0, lhs->shape().dimensions(dnums.input_feature_dimension()),
-                    "iz")
-                ->GetIndVarValue();
-
-        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &ir_builder_);
-
-        // Calculate the spatial index in the input array, taking striding,
-        // dilation and padding into account. An index in the padding will be
-        // out of the bounds of the array.
-        const auto calculate_input_index =
-            [this](llvm::Value* output_index, llvm::Value* kernel_index,
-                   const WindowDimension& window_dim) {
-              llvm::Value* strided_index = ir_builder_.CreateNSWMul(
-                  output_index, ir_builder_.getInt64(window_dim.stride()));
-              llvm::Value* dilated_kernel_index = ir_builder_.CreateNSWMul(
-                  kernel_index,
-                  ir_builder_.getInt64(window_dim.window_dilation()));
-              return ir_builder_.CreateNSWSub(
-                  ir_builder_.CreateNSWAdd(strided_index, dilated_kernel_index),
-                  ir_builder_.getInt64(window_dim.padding_low()));
-            };
-        std::vector<llvm::Value*> input_spatial(num_spatial_dims);
-        for (int i = 0; i < num_spatial_dims; ++i) {
-          input_spatial[i] = calculate_input_index(
-              output_spatial[i], kernel_spatial[i], window.dimensions(i));
-        }
-
-        // We need to check if 0 <= input dim < bound, as otherwise we are in
-        // the padding so that we can skip the computation. That is equivalent
-        // to input dim < bound as an *unsigned* comparison, since a negative
-        // value will wrap to a large positive value. The input dim is dilated,
-        // so we need to dilate the bound as well to match.
-
-        // Also need to check that the input coordinates are not in one of the
-        // holes created by base dilation.
-        const auto not_in_hole = [&](llvm::Value* input_index,
-                                     int64 base_dilation) {
-          llvm::Value* remainder = ir_builder_.CreateSRem(
-              input_index, ir_builder_.getInt64(base_dilation));
-          return ir_builder_.CreateICmpEQ(remainder, ir_builder_.getInt64(0));
-        };
-
-        llvm::Value* in_bounds_condition = ir_builder_.getInt1(true);
-        for (int i = 0; i < num_spatial_dims; ++i) {
-          llvm::ConstantInt* input_bound =
-              ir_builder_.getInt64(window_util::DilatedBound(
-                  lhs->shape().dimensions(dnums.input_spatial_dimensions(i)),
-                  window.dimensions(i).base_dilation()));
-          llvm::Value* dim_in_bound =
-              ir_builder_.CreateICmpULT(input_spatial[i], input_bound);
-          llvm::Value* dim_not_in_hole = not_in_hole(
-              input_spatial[i], window.dimensions(i).base_dilation());
-          llvm::Value* dim_ok =
-              ir_builder_.CreateAnd(dim_in_bound, dim_not_in_hole);
-          in_bounds_condition =
-              ir_builder_.CreateAnd(in_bounds_condition, dim_ok);
-        }
-
-        // Now we need to map the dilated base coordinates back to the actual
-        // data indices on the lhs.
-        const auto undilate = [&](llvm::Value* input_index,
-                                  int64 base_dilation) {
-          return ir_builder_.CreateSDiv(input_index,
-                                        ir_builder_.getInt64(base_dilation));
-        };
-        for (int i = 0; i < num_spatial_dims; ++i) {
-          input_spatial[i] =
-              undilate(input_spatial[i], window.dimensions(i).base_dilation());
-        }
-
-        llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(
-            in_bounds_condition, "in-bounds", &ir_builder_);
-        SetToFirstInsertPoint(if_data.true_block, &ir_builder_);
-
-        // We are not in the padding, so carry out the computation.
-        int num_dims = num_spatial_dims + 2;
-        llvm_ir::IrArray::Index input_index(ir_builder_.getInt64Ty(), num_dims);
-        for (int i = 0; i < num_spatial_dims; ++i) {
-          input_index[dnums.input_spatial_dimensions(i)] = input_spatial[i];
-        }
-        input_index[dnums.input_feature_dimension()] = input_feature;
-        input_index[dnums.input_batch_dimension()] = batch;
-
-        llvm_ir::IrArray kernel_array(GetIrArrayFor(rhs));
-        llvm_ir::IrArray::Index kernel_index(ir_builder_.getInt64Ty(),
-                                             num_dims);
-        for (int i = 0; i < num_spatial_dims; ++i) {
-          kernel_index[dnums.kernel_spatial_dimensions(i)] =
-              window.dimensions(i).window_reversal()
-                  ? ir_builder_.CreateNSWSub(
-                        ir_builder_.getInt64(window.dimensions(i).size() - 1),
-                        kernel_spatial[i])
-                  : kernel_spatial[i];
-        }
-
-        kernel_index[dnums.kernel_input_feature_dimension()] = input_feature;
-        kernel_index[dnums.kernel_output_feature_dimension()] = output_feature;
-
-        llvm_ir::IrArray input_array(GetIrArrayFor(lhs));
-        llvm::Value* product = ir_builder_.CreateFMul(
-            input_array.EmitReadArrayElement(input_index, &ir_builder_),
-            kernel_array.EmitReadArrayElement(kernel_index, &ir_builder_));
-        llvm::Value* sum = ir_builder_.CreateFAdd(
-            ir_builder_.CreateLoad(sum_address), product);
-        ir_builder_.CreateStore(sum, sum_address);
-
-        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &ir_builder_);
-        return ir_builder_.CreateLoad(sum_address);
+      convolution, [&](const llvm_ir::IrArray::Index& index) {
+        return EmitTargetElementLoopBodyForConvolution(
+            Cast<HloConvolutionInstruction>(convolution), index);
       });
 }
 
@@ -1152,11 +1138,11 @@ Status IrEmitter::HandleFft(HloInstruction* fft) {
   }
 
   // Args have been computed, make the call.
-  llvm::Type* int8_ptr_type = ir_builder_.getInt8Ty()->getPointerTo();
-  llvm::Type* int32_type = ir_builder_.getInt32Ty();
-  llvm::Type* int64_type = ir_builder_.getInt64Ty();
+  llvm::Type* int8_ptr_type = b_.getInt8Ty()->getPointerTo();
+  llvm::Type* int32_type = b_.getInt32Ty();
+  llvm::Type* int64_type = b_.getInt64Ty();
   llvm::FunctionType* fft_type = llvm::FunctionType::get(
-      ir_builder_.getVoidTy(),
+      b_.getVoidTy(),
       {int8_ptr_type, int8_ptr_type, int8_ptr_type, int32_type, int32_type,
        int64_type, int64_type, int64_type, int64_type},
       /*isVarArg=*/false);
@@ -1173,16 +1159,15 @@ Status IrEmitter::HandleFft(HloInstruction* fft) {
   fft_func->setDoesNotThrow();
   fft_func->setOnlyAccessesInaccessibleMemOrArgMem();
   const int fft_rank = fft_length.size();
-  ir_builder_.CreateCall(
+  b_.CreateCall(
       fft_func,
       {GetExecutableRunOptionsArgument(),
-       ir_builder_.CreateBitCast(GetEmittedValueFor(fft), int8_ptr_type),
-       ir_builder_.CreateBitCast(operand_address, int8_ptr_type),
-       ir_builder_.getInt32(fft->fft_type()), ir_builder_.getInt32(fft_rank),
-       ir_builder_.getInt64(input_batch),
-       ir_builder_.getInt64(fft_rank > 0 ? fft_length[0] : 0),
-       ir_builder_.getInt64(fft_rank > 1 ? fft_length[1] : 0),
-       ir_builder_.getInt64(fft_rank > 2 ? fft_length[2] : 0)});
+       b_.CreateBitCast(GetEmittedValueFor(fft), int8_ptr_type),
+       b_.CreateBitCast(operand_address, int8_ptr_type),
+       b_.getInt32(fft->fft_type()), b_.getInt32(fft_rank),
+       b_.getInt64(input_batch), b_.getInt64(fft_rank > 0 ? fft_length[0] : 0),
+       b_.getInt64(fft_rank > 1 ? fft_length[1] : 0),
+       b_.getInt64(fft_rank > 2 ? fft_length[2] : 0)});
 
   return Status::OK();
 }
@@ -1221,11 +1206,10 @@ Status IrEmitter::HandleCrossReplicaSum(HloInstruction* crs) {
     operand_ptrs.push_back(EmitTempBufferPointer(out_slice, operand_shape));
 
     // TODO(b/63762267): Be more aggressive about specifying alignment.
-    ir_builder_.CreateMemCpy(operand_ptrs.back(), /*DstAlign=*/1, in_ptr,
-                             /*SrcAlign=*/1,
-                             ShapeUtil::ByteSizeOf(operand_shape));
+    b_.CreateMemCpy(operand_ptrs.back(), /*DstAlign=*/1, in_ptr,
+                    /*SrcAlign=*/1, ShapeUtil::ByteSizeOf(operand_shape));
   }
-  llvm_ir::EmitTuple(GetIrArrayFor(crs), operand_ptrs, &ir_builder_, module_);
+  llvm_ir::EmitTuple(GetIrArrayFor(crs), operand_ptrs, &b_, module_);
   return Status::OK();
 }
 
@@ -1258,47 +1242,7 @@ static llvm_ir::IrArray::Index FillReducedDimensionIndex(
 
 Status IrEmitter::HandleParameter(HloInstruction* parameter) {
   VLOG(2) << "HandleParameter: " << parameter->ToString();
-  auto param_number = parameter->parameter_number();
-  auto param_shape = parameter->shape();
-
-  // We have to access the parameter at offset param_number in the params
-  // array. The code generated here is equivalent to this C code:
-  //
-  //   i8* param_address_untyped = params[param_number];
-  //   Param* param_address_typed = (Param*)param_address_untyped;
-  //
-  // Where Param is the actual element type of the underlying buffer (for
-  // example, float for an XLA F32 element type).
-  llvm::Value* params = compute_function_->parameters_arg();
-  llvm::Value* param_address_offset =
-      llvm_ir::EmitBufferIndexingGEP(params, param_number, &ir_builder_);
-  llvm::LoadInst* param_address_untyped =
-      ir_builder_.CreateLoad(param_address_offset);
-  param_address_untyped->setName(AsStringRef(IrName(parameter, "untyped")));
-  if (is_top_level_computation_ &&
-      hlo_module_config_.debug_options()
-          .xla_llvm_enable_invariant_load_metadata()) {
-    // In the entry computation the parameter slots in the %params argument are
-    // invariant through program execution.  In computations that are called
-    // from the entry computation (via kWhile, kCall and kConditional) the
-    // parameter slots are *not* invariant since they're written to by their
-    // callers.
-    param_address_untyped->setMetadata(
-        llvm::LLVMContext::MD_invariant_load,
-        llvm::MDNode::get(param_address_untyped->getContext(), /*MDs=*/{}));
-  }
-
-  llvm::Value* param_address_typed = ir_builder_.CreateBitCast(
-      param_address_untyped, IrShapeType(param_shape)->getPointerTo());
-  emitted_value_[parameter] = param_address_typed;
-
-  if (!ShapeUtil::IsOpaque(param_shape)) {
-    AttachAlignmentMetadataForLoad(param_address_untyped, param_shape);
-    AttachDereferenceableMetadataForLoad(param_address_untyped, param_shape);
-  }
-
-  VLOG(2) << "  emitted value: " << llvm_ir::DumpToString(*param_address_typed);
-  return Status::OK();
+  return EmitTargetAddressForOp(parameter);
 }
 
 // Returns true if the relative order of the unreduced dimensions stays the same
@@ -1396,62 +1340,61 @@ IrEmitter::ReductionGenerator IrEmitter::MatchReductionGenerator(
       return nullptr;
 
     case HloOpcode::kAdd:
-      return [root_is_integral](llvm::IRBuilder<>* ir_builder, llvm::Value* lhs,
+      return [root_is_integral](llvm::IRBuilder<>* b, llvm::Value* lhs,
                                 llvm::Value* rhs) {
-        return root_is_integral ? ir_builder->CreateAdd(lhs, rhs)
-                                : ir_builder->CreateFAdd(lhs, rhs);
+        return root_is_integral ? b->CreateAdd(lhs, rhs)
+                                : b->CreateFAdd(lhs, rhs);
       };
 
     case HloOpcode::kMultiply:
-      return [root_is_integral](llvm::IRBuilder<>* ir_builder, llvm::Value* lhs,
+      return [root_is_integral](llvm::IRBuilder<>* b, llvm::Value* lhs,
                                 llvm::Value* rhs) {
-        return root_is_integral ? ir_builder->CreateMul(lhs, rhs)
-                                : ir_builder->CreateFMul(lhs, rhs);
+        return root_is_integral ? b->CreateMul(lhs, rhs)
+                                : b->CreateFMul(lhs, rhs);
       };
 
     case HloOpcode::kAnd:
-      return [](llvm::IRBuilder<>* ir_builder, llvm::Value* lhs,
-                llvm::Value* rhs) { return ir_builder->CreateAnd(lhs, rhs); };
+      return [](llvm::IRBuilder<>* b, llvm::Value* lhs, llvm::Value* rhs) {
+        return b->CreateAnd(lhs, rhs);
+      };
 
     case HloOpcode::kOr:
-      return [](llvm::IRBuilder<>* ir_builder, llvm::Value* lhs,
-                llvm::Value* rhs) { return ir_builder->CreateOr(lhs, rhs); };
+      return [](llvm::IRBuilder<>* b, llvm::Value* lhs, llvm::Value* rhs) {
+        return b->CreateOr(lhs, rhs);
+      };
 
     case HloOpcode::kXor:
-      return [](llvm::IRBuilder<>* ir_builder, llvm::Value* lhs,
-                llvm::Value* rhs) { return ir_builder->CreateXor(lhs, rhs); };
+      return [](llvm::IRBuilder<>* b, llvm::Value* lhs, llvm::Value* rhs) {
+        return b->CreateXor(lhs, rhs);
+      };
 
     case HloOpcode::kMaximum:
       return [root_is_floating_point, root_is_signed](
-                 llvm::IRBuilder<>* ir_builder, llvm::Value* lhs,
-                 llvm::Value* rhs) {
+                 llvm::IRBuilder<>* b, llvm::Value* lhs, llvm::Value* rhs) {
         if (root_is_floating_point) {
           return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::maxnum,
-                                              {lhs, rhs}, {lhs->getType()},
-                                              ir_builder);
+                                              {lhs, rhs}, {lhs->getType()}, b);
         }
 
-        return ir_builder->CreateSelect(
-            ir_builder->CreateICmp(root_is_signed ? llvm::ICmpInst::ICMP_SGE
-                                                  : llvm::ICmpInst::ICMP_UGE,
-                                   lhs, rhs),
+        return b->CreateSelect(
+            b->CreateICmp(root_is_signed ? llvm::ICmpInst::ICMP_SGE
+                                         : llvm::ICmpInst::ICMP_UGE,
+                          lhs, rhs),
             lhs, rhs);
       };
 
     case HloOpcode::kMinimum:
       return [root_is_floating_point, root_is_signed](
-                 llvm::IRBuilder<>* ir_builder, llvm::Value* lhs,
-                 llvm::Value* rhs) {
+                 llvm::IRBuilder<>* b, llvm::Value* lhs, llvm::Value* rhs) {
         if (root_is_floating_point) {
           return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::minnum,
-                                              {lhs, rhs}, {lhs->getType()},
-                                              ir_builder);
+                                              {lhs, rhs}, {lhs->getType()}, b);
         }
 
-        return ir_builder->CreateSelect(
-            ir_builder->CreateICmp(root_is_signed ? llvm::ICmpInst::ICMP_SLE
-                                                  : llvm::ICmpInst::ICMP_ULE,
-                                   lhs, rhs),
+        return b->CreateSelect(
+            b->CreateICmp(root_is_signed ? llvm::ICmpInst::ICMP_SLE
+                                         : llvm::ICmpInst::ICMP_ULE,
+                          lhs, rhs),
             lhs, rhs);
       };
   }
@@ -1520,34 +1463,31 @@ IrEmitter::EmitInnerLoopForVectorizedReduction(
   accumulator.reserve(accumulator_type.size());
   for (auto accumulator_shard_type : accumulator_type) {
     accumulator.push_back(llvm_ir::EmitAllocaAtFunctionEntry(
-        accumulator_shard_type, "accumulator", &ir_builder_, 0));
+        accumulator_shard_type, "accumulator", &b_, 0));
   }
 
-  llvm::Value* init_value_ssa =
-      ir_builder_.CreateLoad(GetEmittedValueFor(init_value));
+  llvm::Value* init_value_ssa = b_.CreateLoad(GetEmittedValueFor(init_value));
 
   for (llvm::Value* accumulator_shard : accumulator) {
     llvm::Value* initial_value;
     auto shard_type = accumulator_shard->getType()->getPointerElementType();
     if (auto vector_type = llvm::dyn_cast<llvm::VectorType>(shard_type)) {
-      initial_value = ir_builder_.CreateVectorSplat(
-          vector_type->getNumElements(), init_value_ssa);
+      initial_value =
+          b_.CreateVectorSplat(vector_type->getNumElements(), init_value_ssa);
     } else {
       initial_value = init_value_ssa;
     }
 
-    ir_builder_.CreateAlignedStore(initial_value, accumulator_shard,
-                                   element_alignment);
+    b_.CreateAlignedStore(initial_value, accumulator_shard, element_alignment);
   }
 
   llvm_ir::ForLoopNest reduction_loop_nest(IrName(arg, "vectorized_inner"),
-                                           &ir_builder_);
+                                           &b_);
   llvm_ir::IrArray::Index reduced_dims_index =
       reduction_loop_nest.AddLoopsForShapeOnDimensions(arg->shape(), dimensions,
                                                        "reduction_dim");
 
-  SetToFirstInsertPoint(reduction_loop_nest.GetInnerLoopBodyBasicBlock(),
-                        &ir_builder_);
+  SetToFirstInsertPoint(reduction_loop_nest.GetInnerLoopBodyBasicBlock(), &b_);
 
   llvm_ir::IrArray arg_array(GetIrArrayFor(arg));
   llvm_ir::IrArray::Index input_index = reduced_dims_index;
@@ -1560,38 +1500,34 @@ IrEmitter::EmitInnerLoopForVectorizedReduction(
   }
   CHECK(output_index.end() == it);
 
-  llvm::Value* input_address = ir_builder_.CreateBitCast(
-      arg_array.EmitArrayElementAddress(input_index, &ir_builder_),
-      ir_builder_.getInt8PtrTy());
+  llvm::Value* input_address = b_.CreateBitCast(
+      arg_array.EmitArrayElementAddress(input_index, &b_), b_.getInt8PtrTy());
 
   for (int i = 0; i < accumulator.size(); i++) {
     auto input_address_typed =
-        ir_builder_.CreateBitCast(input_address, accumulator[i]->getType());
+        b_.CreateBitCast(input_address, accumulator[i]->getType());
     auto current_accumulator_value =
-        ir_builder_.CreateAlignedLoad(accumulator[i], element_alignment);
-    auto addend =
-        ir_builder_.CreateAlignedLoad(input_address_typed, element_alignment);
+        b_.CreateAlignedLoad(accumulator[i], element_alignment);
+    auto addend = b_.CreateAlignedLoad(input_address_typed, element_alignment);
     arg_array.AnnotateLoadStoreInstructionWithMetadata(addend);
 
     auto reduced_result =
-        reduction_generator(&ir_builder_, current_accumulator_value, addend);
-    ir_builder_.CreateAlignedStore(reduced_result, accumulator[i],
-                                   element_alignment);
+        reduction_generator(&b_, current_accumulator_value, addend);
+    b_.CreateAlignedStore(reduced_result, accumulator[i], element_alignment);
 
     if (i != (accumulator.size() - 1)) {
-      input_address = ir_builder_.CreateConstInBoundsGEP1_32(
-          reduced_result->getType(), input_address_typed, 1);
+      input_address = b_.CreateConstInBoundsGEP1_32(reduced_result->getType(),
+                                                    input_address_typed, 1);
     }
   }
 
-  SetToFirstInsertPoint(reduction_loop_nest.GetOuterLoopExitBasicBlock(),
-                        &ir_builder_);
+  SetToFirstInsertPoint(reduction_loop_nest.GetOuterLoopExitBasicBlock(), &b_);
 
   ShardedVector result_ssa;
   result_ssa.reserve(accumulator.size());
   for (auto accumulator_shard : accumulator) {
     result_ssa.push_back(
-        ir_builder_.CreateAlignedLoad(accumulator_shard, element_alignment));
+        b_.CreateAlignedLoad(accumulator_shard, element_alignment));
   }
   return result_ssa;
 }
@@ -1600,17 +1536,17 @@ void IrEmitter::EmitShardedVectorStore(
     llvm::Value* store_address, const std::vector<llvm::Value*>& value_to_store,
     const int alignment, const llvm_ir::IrArray& containing_array) {
   for (int i = 0; i < value_to_store.size(); i++) {
-    auto store_address_typed = ir_builder_.CreateBitCast(
+    auto store_address_typed = b_.CreateBitCast(
         store_address,
         llvm::PointerType::getUnqual(value_to_store[i]->getType()));
 
-    auto store_instruction = ir_builder_.CreateAlignedStore(
+    auto store_instruction = b_.CreateAlignedStore(
         value_to_store[i], store_address_typed, alignment);
     containing_array.AnnotateLoadStoreInstructionWithMetadata(
         store_instruction);
 
     if (i != (value_to_store.size() - 1)) {
-      store_address = ir_builder_.CreateConstInBoundsGEP1_32(
+      store_address = b_.CreateConstInBoundsGEP1_32(
           value_to_store[i]->getType(), store_address_typed, 1);
     }
   }
@@ -1676,8 +1612,8 @@ StatusOr<bool> IrEmitter::EmitVectorizedReduce(
   //    }
   //  }
 
-  llvm_ir::ForLoopNest loop_nest(IrName(reduce), &ir_builder_);
-  llvm_ir::IrArray::Index array_index(ir_builder_.getInt64Ty(),
+  llvm_ir::ForLoopNest loop_nest(IrName(reduce), &b_);
+  llvm_ir::IrArray::Index array_index(b_.getInt64Ty(),
                                       reduce->shape().dimensions_size());
   for (int i = LayoutUtil::MinorToMajor(reduce->shape()).size() - 1; i > 0;
        --i) {
@@ -1696,7 +1632,7 @@ StatusOr<bool> IrEmitter::EmitVectorizedReduce(
 
   if (llvm::BasicBlock* innermost_body_bb =
           loop_nest.GetInnerLoopBodyBasicBlock()) {
-    SetToFirstInsertPoint(innermost_body_bb, &ir_builder_);
+    SetToFirstInsertPoint(innermost_body_bb, &b_);
   }
 
   auto outermost_loop_exit_block = loop_nest.GetOuterLoopExitBasicBlock();
@@ -1710,7 +1646,7 @@ StatusOr<bool> IrEmitter::EmitVectorizedReduce(
         tensorflow::strings::Printf("dim.%lld", innermost_dimension));
     array_index[innermost_dimension] = loop->GetIndVarValue();
 
-    SetToFirstInsertPoint(loop->GetBodyBasicBlock(), &ir_builder_);
+    SetToFirstInsertPoint(loop->GetBodyBasicBlock(), &b_);
 
     ShardedVectorType vector_type = CreateShardedVectorType(
         reduce->shape().element_type(), vectorization_factor);
@@ -1721,16 +1657,16 @@ StatusOr<bool> IrEmitter::EmitVectorizedReduce(
 
     llvm_ir::IrArray target_array = GetIrArrayFor(reduce);
     llvm::Value* output_address =
-        target_array.EmitArrayElementAddress(array_index, &ir_builder_);
+        target_array.EmitArrayElementAddress(array_index, &b_);
     EmitShardedVectorStore(output_address, accumulator, element_alignment,
                            target_array);
 
     if (auto exit_terminator = loop->GetExitBasicBlock()->getTerminator()) {
       CHECK_GT(LayoutUtil::MinorToMajor(reduce->shape()).size(), 1);
-      ir_builder_.SetInsertPoint(exit_terminator);
+      b_.SetInsertPoint(exit_terminator);
     } else {
       CHECK_EQ(LayoutUtil::MinorToMajor(reduce->shape()).size(), 1);
-      ir_builder_.SetInsertPoint(loop->GetExitBasicBlock());
+      b_.SetInsertPoint(loop->GetExitBasicBlock());
     }
   }
 
@@ -1740,8 +1676,8 @@ StatusOr<bool> IrEmitter::EmitVectorizedReduce(
   if (innermost_dimension_size % vectorization_factor) {
     // TODO(b/63775531): Consider using a scalar loop here to save on code size.
     array_index[innermost_dimension] =
-        ir_builder_.getInt64(innermost_dimension_size -
-                             (innermost_dimension_size % vectorization_factor));
+        b_.getInt64(innermost_dimension_size -
+                    (innermost_dimension_size % vectorization_factor));
 
     ShardedVectorType vector_type = CreateShardedVectorType(
         reduce->shape().element_type(),
@@ -1753,18 +1689,72 @@ StatusOr<bool> IrEmitter::EmitVectorizedReduce(
 
     llvm_ir::IrArray target_array = GetIrArrayFor(reduce);
     llvm::Value* output_address =
-        target_array.EmitArrayElementAddress(array_index, &ir_builder_);
+        target_array.EmitArrayElementAddress(array_index, &b_);
     EmitShardedVectorStore(output_address, accumulator, element_alignment,
                            target_array);
   }
 
   if (outermost_loop_exit_block) {
-    ir_builder_.SetInsertPoint(outermost_loop_exit_block);
+    b_.SetInsertPoint(outermost_loop_exit_block);
   }
 
   return true;
 }
 
+StatusOr<llvm::Value*> IrEmitter::EmitTargetElementLoopBodyForReduce(
+    HloReduceInstruction* reduce, const llvm_ir::IrArray::Index& index) {
+  const HloInstruction* arg = reduce->mutable_operand(0);
+  const HloInstruction* init_value = reduce->mutable_operand(1);
+  gtl::ArraySlice<int64> dimensions(reduce->dimensions());
+
+  // Initialize an accumulator with init_value.
+  PrimitiveType accumulator_type = reduce->shape().element_type();
+  llvm::AllocaInst* accumulator_addr = llvm_ir::EmitAllocaAtFunctionEntry(
+      llvm_ir::PrimitiveTypeToIrType(accumulator_type, module_), "accumulator",
+      &b_, MinimumAlignmentForPrimitiveType(accumulator_type));
+  llvm::Value* init_value_addr = GetEmittedValueFor(init_value);
+  llvm::Value* load_init_value = b_.CreateLoad(init_value_addr);
+  b_.CreateStore(load_init_value, accumulator_addr);
+
+  // The enclosing loops go over all the target elements. Now we have to compute
+  // the actual target element. For this, we build a new loop nest to iterate
+  // over all the reduction dimensions in the argument.
+  // AddLoopsForShapeOnDimensions will return an Index where induction Value*s
+  // are placed for each dimension in dimensions, and all the rest are nullptrs.
+  llvm_ir::ForLoopNest loops(IrName(reduce, "inner"), &b_);
+  const llvm_ir::IrArray::Index reduced_dims_index =
+      loops.AddLoopsForShapeOnDimensions(arg->shape(), dimensions,
+                                         "reduction_dim");
+
+  SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &b_);
+
+  // Build a full index for the input argument, using reduced_dims_index as the
+  // base. In reduced_dims_index only the reduction dimensions are filled in. We
+  // fill in the rest of the dimensions with induction Value*s taken from
+  // 'index' which iterates over the target array.  See the high-level
+  // description in the XLA documentation for details.
+  llvm_ir::IrArray arg_array(GetIrArrayFor(arg));
+  llvm_ir::IrArray::Index input_index = reduced_dims_index;
+  llvm_ir::IrArray::Index::const_iterator it = index.begin();
+
+  for (size_t i = 0; i < input_index.size(); ++i) {
+    if (input_index[i] == nullptr) {
+      input_index[i] = *it++;
+    }
+  }
+  CHECK(index.end() == it);
+
+  // Apply the reduction function to the loaded value.
+  llvm::Value* input_element = arg_array.EmitReadArrayElement(input_index, &b_);
+  llvm::Value* result = EmitThreadLocalCall(
+      *reduce->to_apply(), {b_.CreateLoad(accumulator_addr), input_element},
+      "reduce_function");
+  b_.CreateStore(result, accumulator_addr);
+
+  SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &b_);
+  return b_.CreateLoad(accumulator_addr);
+}
+
 Status IrEmitter::HandleReduce(HloInstruction* reduce) {
   auto arg = reduce->mutable_operand(0);
   auto init_value = reduce->mutable_operand(1);
@@ -1786,61 +1776,11 @@ Status IrEmitter::HandleReduce(HloInstruction* reduce) {
     }
   }
 
-  // The called computation should have been emitted previously.
-  llvm::Function* reducer_function = FindOrDie(emitted_functions_, function);
-  return EmitTargetElementLoop(
-      reduce, [this, reduce, arg, init_value, dimensions,
-               reducer_function](const llvm_ir::IrArray::Index& index) {
-        // Initialize an accumulator with init_value.
-        PrimitiveType accumulator_type = reduce->shape().element_type();
-        llvm::AllocaInst* accumulator_addr = llvm_ir::EmitAllocaAtFunctionEntry(
-            llvm_ir::PrimitiveTypeToIrType(accumulator_type, module_),
-            "accumulator", &ir_builder_,
-            MinimumAlignmentForPrimitiveType(accumulator_type));
-        llvm::Value* init_value_addr = GetEmittedValueFor(init_value);
-        llvm::Value* load_init_value = ir_builder_.CreateLoad(init_value_addr);
-        ir_builder_.CreateStore(load_init_value, accumulator_addr);
-
-        // The enclosing loops go over all the target elements. Now we have to
-        // compute the actual target element. For this, we build a new loop nest
-        // to iterate over all the reduction dimensions in the argument.
-        // AddLoopsForShapeOnDimensions will return an Index where induction
-        // Value*s are placed for each dimension in dimensions, and all the rest
-        // are nullptrs.
-        llvm_ir::ForLoopNest loops(IrName(reduce, "inner"), &ir_builder_);
-        const llvm_ir::IrArray::Index reduced_dims_index =
-            loops.AddLoopsForShapeOnDimensions(arg->shape(), dimensions,
-                                               "reduction_dim");
-
-        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &ir_builder_);
-
-        // Build a full index for the input argument, using reduced_dims_index
-        // as the base. In reduced_dims_index only the reduction dimensions are
-        // filled in. We fill in the rest of the dimensions with induction
-        // Value*s taken from 'index' which iterates over the target array.
-        // See the high-level description in the XLA documentation for details.
-        llvm_ir::IrArray arg_array(GetIrArrayFor(arg));
-        llvm_ir::IrArray::Index input_index = reduced_dims_index;
-        llvm_ir::IrArray::Index::const_iterator it = index.begin();
-
-        for (size_t i = 0; i < input_index.size(); ++i) {
-          if (input_index[i] == nullptr) {
-            input_index[i] = *it++;
-          }
-        }
-        CHECK(index.end() == it);
-
-        // Apply the reduction function to the loaded value.
-        llvm::Value* input_address =
-            arg_array.EmitArrayElementAddress(input_index, &ir_builder_);
-        llvm::Value* result = EmitElementFunctionCall(
-            reducer_function, reduce->shape(),
-            {accumulator_addr, input_address}, "reduce_function");
-        ir_builder_.CreateStore(result, accumulator_addr);
-
-        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &ir_builder_);
-        return ir_builder_.CreateLoad(accumulator_addr);
-      });
+  return EmitTargetElementLoop(reduce,
+                               [&](const llvm_ir::IrArray::Index& index) {
+                                 return EmitTargetElementLoopBodyForReduce(
+                                     Cast<HloReduceInstruction>(reduce), index);
+                               });
 }
 
 Status IrEmitter::HandleSend(HloInstruction* send) {
@@ -1853,6 +1793,10 @@ Status IrEmitter::HandleSendDone(HloInstruction* send_done) {
   return Unimplemented("Send-done is not implemented on CPU.");
 }
 
+Status IrEmitter::HandleScatter(HloInstruction*) {
+  return Unimplemented("Scatter is not implemented on CPUs.");
+}
+
 Status IrEmitter::HandleSlice(HloInstruction* slice) {
   VLOG(2) << "HandleSlice: " << slice->ToString();
   auto operand = slice->operand(0);
@@ -1942,7 +1886,7 @@ Status IrEmitter::HandleSlice(HloInstruction* slice) {
   llvm_ir::IrArray target_array = GetIrArrayFor(slice);
 
   const int64 num_outer_loops = outer_dims.size();
-  llvm_ir::ForLoopNest loops(IrName(slice), &ir_builder_);
+  llvm_ir::ForLoopNest loops(IrName(slice), &b_);
   llvm_ir::IrArray::Index target_index =
       loops.AddLoopsForShapeOnDimensions(slice->shape(), outer_dims, "slice");
 
@@ -1951,21 +1895,21 @@ Status IrEmitter::HandleSlice(HloInstruction* slice) {
   // for the rest of the dimensions the copy writes to the full dimension.
   std::replace(target_index.begin(), target_index.end(),
                static_cast<llvm::Value*>(nullptr),
-               static_cast<llvm::Value*>(ir_builder_.getInt64(0)));
+               static_cast<llvm::Value*>(b_.getInt64(0)));
 
   if (num_outer_loops > 0) {
-    SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &ir_builder_);
+    SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &b_);
   }
 
   llvm_ir::IrArray source_array = GetIrArrayFor(operand);
   const llvm_ir::IrArray::Index source_index = target_index.SourceIndexOfSlice(
       /*shape=*/slice->shape(), /*starts=*/slice->slice_starts(),
-      /*strides=*/slice->slice_strides(), /*builder=*/&ir_builder_);
+      /*strides=*/slice->slice_strides(), /*builder=*/&b_);
 
-  llvm::Value* memcpy_dest = target_array.EmitArrayElementAddress(
-      target_index, &ir_builder_, "slice.dest");
-  llvm::Value* memcpy_source = source_array.EmitArrayElementAddress(
-      source_index, &ir_builder_, "slice.source");
+  llvm::Value* memcpy_dest =
+      target_array.EmitArrayElementAddress(target_index, &b_, "slice.dest");
+  llvm::Value* memcpy_source =
+      source_array.EmitArrayElementAddress(source_index, &b_, "slice.source");
 
   const int64 memcpy_elements =
       primitive_elements_per_logical_element * memcpy_logical_elements;
@@ -1982,7 +1926,7 @@ Status IrEmitter::HandleSlice(HloInstruction* slice) {
   }
 
   if (num_outer_loops > 0) {
-    SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &ir_builder_);
+    SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &b_);
   }
 
   return Status::OK();
@@ -2008,7 +1952,7 @@ Status IrEmitter::HandleDynamicUpdateSlice(
     auto operands = GetIrArraysForOperandsOf(dynamic_update_slice);
     return llvm_ir::EmitDynamicUpdateSliceInPlace(
         operands, GetIrArrayFor(dynamic_update_slice),
-        IrName(dynamic_update_slice, "in_place"), &ir_builder_);
+        IrName(dynamic_update_slice, "in_place"), &b_);
   }
   return DefaultAction(dynamic_update_slice);
 }
@@ -2042,43 +1986,41 @@ Status IrEmitter::HandlePad(HloInstruction* pad) {
       [this, pad](const llvm_ir::IrArray::Index& target_index) {
         const HloInstruction* padding_value = pad->operand(1);
         llvm::Value* padding_value_addr = GetEmittedValueFor(padding_value);
-        return ir_builder_.CreateLoad(padding_value_addr);
+        return b_.CreateLoad(padding_value_addr);
       }));
 
   // Create a loop to iterate over the operand elements and update the output
   // locations where the operand elements should be stored.
-  llvm_ir::ForLoopNest loops(IrName(pad, "assign"), &ir_builder_);
+  llvm_ir::ForLoopNest loops(IrName(pad, "assign"), &b_);
   const HloInstruction* operand = pad->operand(0);
   const llvm_ir::IrArray::Index operand_index =
       loops.AddLoopsForShape(operand->shape(), "operand");
 
-  SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &ir_builder_);
+  SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &b_);
 
   // Load an element from the operand.
   llvm_ir::IrArray operand_array(GetIrArrayFor(operand));
   llvm::Value* operand_data =
-      operand_array.EmitReadArrayElement(operand_index, &ir_builder_);
+      operand_array.EmitReadArrayElement(operand_index, &b_);
 
   // Compute the output index the operand element should be assigned to.
   // output_index := edge_padding_low + operand_index * (interior_padding + 1)
   const PaddingConfig& padding_config = pad->padding_config();
   llvm_ir::IrArray::Index output_index(operand_index.GetType());
   for (size_t i = 0; i < operand_index.size(); ++i) {
-    llvm::Value* offset = ir_builder_.CreateMul(
+    llvm::Value* offset = b_.CreateMul(
         operand_index[i],
-        ir_builder_.getInt64(padding_config.dimensions(i).interior_padding() +
-                             1));
-    llvm::Value* index = ir_builder_.CreateAdd(
-        offset,
-        ir_builder_.getInt64(padding_config.dimensions(i).edge_padding_low()));
+        b_.getInt64(padding_config.dimensions(i).interior_padding() + 1));
+    llvm::Value* index = b_.CreateAdd(
+        offset, b_.getInt64(padding_config.dimensions(i).edge_padding_low()));
     output_index.push_back(index);
   }
 
   // Store the operand element to the computed output location.
   llvm_ir::IrArray output_array(GetIrArrayFor(pad));
-  output_array.EmitWriteArrayElement(output_index, operand_data, &ir_builder_);
+  output_array.EmitWriteArrayElement(output_index, operand_data, &b_);
 
-  SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &ir_builder_);
+  SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &b_);
   return Status::OK();
 }
 
@@ -2100,8 +2042,7 @@ Status IrEmitter::HandleFusion(HloInstruction* fusion) {
     // Delegate to common implementation of fused in-place dynamic-update-slice.
     auto operands = GetIrArraysForOperandsOf(fusion);
     return llvm_ir::EmitFusedDynamicUpdateSliceInPlace(
-        fusion, operands, GetIrArrayFor(fusion), &elemental_emitter,
-        &ir_builder_);
+        fusion, operands, GetIrArrayFor(fusion), &elemental_emitter, &b_);
   } else if (fusion->fusion_kind() == HloInstruction::FusionKind::kLoop) {
     VLOG(3) << "HandleFusion kLoop";
     CpuElementalIrEmitter elemental_emitter(hlo_module_config_, this, module_);
@@ -2136,7 +2077,7 @@ Status IrEmitter::HandleFusion(HloInstruction* fusion) {
 
     TF_RETURN_IF_ERROR(DotOpEmitter::EmitDotOperation(
         *dot, target_array, lhs_array, rhs_array, &addend_array,
-        GetExecutableRunOptionsArgument(), &ir_builder_, hlo_module_config_,
+        GetExecutableRunOptionsArgument(), &b_, hlo_module_config_,
         target_machine_features_));
     return Status::OK();
   } else {
@@ -2148,18 +2089,13 @@ Status IrEmitter::HandleCall(HloInstruction* call) {
   HloComputation* computation = call->to_apply();
   llvm::Function* call_ir_function = FindOrDie(emitted_functions_, computation);
 
-  std::vector<llvm::Value*> parameter_addresses;
-  for (const HloInstruction* operand : call->operands()) {
-    parameter_addresses.push_back(GetEmittedValueFor(operand));
-  }
-
   TF_RETURN_IF_ERROR(EmitTargetAddressForOp(call));
 
   if (!computation->root_instruction()->outer_dimension_partitions().empty()) {
     // ParallelTaskAssignment assigned partitions, emit call to
     // ParallelForkJoin.
     std::vector<llvm::Value*> call_args = GetArrayFunctionCallArguments(
-        parameter_addresses, &ir_builder_, computation->name(),
+        {}, &b_, computation->name(),
         /*return_value_buffer=*/emitted_value_[call],
         /*exec_run_options_arg=*/GetExecutableRunOptionsArgument(),
         /*temp_buffers_arg=*/GetTempBuffersArgument(),
@@ -2167,11 +2103,10 @@ Status IrEmitter::HandleCall(HloInstruction* call) {
 
     HloInstruction* root = computation->root_instruction();
     TF_RETURN_IF_ERROR(EmitCallToParallelForkJoin(
-        call_args, root->shape(), root->outer_dimension_partitions(),
-        &ir_builder_, call_ir_function, computation->name()));
+        call_args, root->shape(), root->outer_dimension_partitions(), &b_,
+        call_ir_function, computation->name()));
   } else {
-    EmitArrayFunctionCallInto(call_ir_function, parameter_addresses,
-                              emitted_value_[call], computation->name());
+    EmitGlobalCall(*computation, computation->name());
   }
 
   return Status::OK();
@@ -2180,33 +2115,31 @@ Status IrEmitter::HandleCall(HloInstruction* call) {
 Status IrEmitter::HandleCustomCall(HloInstruction* custom_call) {
   gtl::ArraySlice<HloInstruction*> operands(custom_call->operands());
   tensorflow::StringPiece custom_call_target(custom_call->custom_call_target());
-  llvm::Type* i8_ptr_type = ir_builder_.getInt8PtrTy();
+  llvm::Type* i8_ptr_type = b_.getInt8PtrTy();
   llvm::AllocaInst* operands_alloca =
       llvm_ir::EmitAllocaAtFunctionEntryWithCount(
-          i8_ptr_type, ir_builder_.getInt32(operands.size()),
-          "cc_operands_alloca", &ir_builder_);
+          i8_ptr_type, b_.getInt32(operands.size()), "cc_operands_alloca", &b_);
   for (size_t i = 0; i < operands.size(); ++i) {
     const HloInstruction* operand = operands[i];
     llvm::Value* operand_as_i8ptr =
-        ir_builder_.CreatePointerCast(GetEmittedValueFor(operand), i8_ptr_type);
-    llvm::Value* slot_in_operands_alloca = ir_builder_.CreateInBoundsGEP(
-        operands_alloca, {ir_builder_.getInt64(i)});
-    ir_builder_.CreateStore(operand_as_i8ptr, slot_in_operands_alloca);
+        b_.CreatePointerCast(GetEmittedValueFor(operand), i8_ptr_type);
+    llvm::Value* slot_in_operands_alloca =
+        b_.CreateInBoundsGEP(operands_alloca, {b_.getInt64(i)});
+    b_.CreateStore(operand_as_i8ptr, slot_in_operands_alloca);
   }
   auto* custom_call_ir_function =
       llvm::cast<llvm::Function>(module_->getOrInsertFunction(
           AsStringRef(custom_call_target),
           llvm::FunctionType::get(
-              /*Result=*/ir_builder_.getVoidTy(),
+              /*Result=*/b_.getVoidTy(),
               /*Params=*/{i8_ptr_type, operands_alloca->getType()},
               /*isVarArg=*/false)));
 
   TF_RETURN_IF_ERROR(EmitTargetAddressForOp(custom_call));
-  auto* output_address_arg = ir_builder_.CreatePointerCast(
-      GetEmittedValueFor(custom_call), i8_ptr_type);
+  auto* output_address_arg =
+      b_.CreatePointerCast(GetEmittedValueFor(custom_call), i8_ptr_type);
 
-  ir_builder_.CreateCall(custom_call_ir_function,
-                         {output_address_arg, operands_alloca});
+  b_.CreateCall(custom_call_ir_function, {output_address_arg, operands_alloca});
 
   return Status::OK();
 }
@@ -2254,12 +2187,6 @@ Status IrEmitter::HandleWhile(HloInstruction* xla_while) {
   const HloInstruction* init = xla_while->operand(0);
   emitted_value_[xla_while] = GetEmittedValueFor(init);
 
-  // The called computation should have been emitted previously.
-  llvm::Function* condition_ir_function =
-      FindOrDie(emitted_functions_, condition);
-  llvm::Function* body_ir_function =
-      FindOrDie(emitted_functions_, xla_while->while_body());
-
   // Generating:
   //   while (Condition(while_result)) {
   //     // CopyInsertion pass inserts copies which enable 'while_result' to
@@ -2271,17 +2198,15 @@ Status IrEmitter::HandleWhile(HloInstruction* xla_while) {
   llvm::BasicBlock* header_bb = llvm::BasicBlock::Create(
       module_->getContext(), AsStringRef(IrName(xla_while, "header")),
       compute_function_->function());
-  ir_builder_.CreateBr(header_bb);
-  ir_builder_.SetInsertPoint(header_bb);
+  b_.CreateBr(header_bb);
+  b_.SetInsertPoint(header_bb);
 
   // Calls the condition function to determine whether to proceed with the
   // body.  It must return a bool, so use the scalar call form.
-  llvm::Value* while_result = GetEmittedValueFor(xla_while);
-  llvm::Value* while_condition = EmitElementFunctionCall(
-      condition_ir_function, condition->root_instruction()->shape(),
-      {while_result}, IrName(xla_while, "cond"));
-  llvm::Value* while_predicate = ir_builder_.CreateICmpNE(
-      while_condition,
+  EmitGlobalCall(*xla_while->while_condition(), IrName(xla_while, "cond"));
+  llvm::Value* while_predicate = b_.CreateICmpNE(
+      b_.CreateLoad(
+          GetBufferForGlobalCallReturnValue(*xla_while->while_condition())),
       llvm::ConstantInt::get(llvm_ir::PrimitiveTypeToIrType(PRED, module_), 0));
 
   // Branches to the body or to the while exit depending on the condition.
@@ -2290,20 +2215,20 @@ Status IrEmitter::HandleWhile(HloInstruction* xla_while) {
       compute_function_->function());
   llvm::BasicBlock* exit_bb = llvm::BasicBlock::Create(
       module_->getContext(), AsStringRef(IrName(xla_while, "exit")));
-  ir_builder_.CreateCondBr(while_predicate, body_bb, exit_bb);
+  b_.CreateCondBr(while_predicate, body_bb, exit_bb);
 
   // Calls the body function from the body block.
-  ir_builder_.SetInsertPoint(body_bb);
+  b_.SetInsertPoint(body_bb);
 
   // Calls the body function.
-  EmitArrayFunctionCallInto(body_ir_function, {while_result}, while_result,
-                            IrName(xla_while, "body"));
+  EmitGlobalCall(*xla_while->while_body(), IrName(xla_while, "body"));
+
   // Finishes with a branch back to the header.
-  ir_builder_.CreateBr(header_bb);
+  b_.CreateBr(header_bb);
 
   // Adds the exit block to the function and sets the insert point there.
   compute_function_->function()->getBasicBlockList().push_back(exit_bb);
-  ir_builder_.SetInsertPoint(exit_bb);
+  b_.SetInsertPoint(exit_bb);
 
   return Status::OK();
 }
@@ -2345,21 +2270,21 @@ StatusOr<bool> IrEmitter::EmitFastConcatenate(
   std::vector<int64> outer_dims(std::next(concat_dim_layout_itr),
                                 output_min2maj.end());
 
-  llvm::Type* i8_ptr_type = ir_builder_.getInt8PtrTy();
-  llvm::Type* i8_type = ir_builder_.getInt8Ty();
+  llvm::Type* i8_ptr_type = b_.getInt8PtrTy();
+  llvm::Type* i8_type = b_.getInt8Ty();
 
   TF_RETURN_IF_ERROR(EmitTargetAddressForOp(concatenate));
   llvm_ir::IrArray target_array = GetIrArrayFor(concatenate);
 
-  llvm_ir::ForLoopNest loops(IrName(concatenate), &ir_builder_);
+  llvm_ir::ForLoopNest loops(IrName(concatenate), &b_);
   llvm_ir::IrArray::Index outer_dims_index =
       loops.AddLoopsForShapeOnDimensions(output_shape, outer_dims, "concat");
   std::replace(outer_dims_index.begin(), outer_dims_index.end(),
                static_cast<llvm::Value*>(nullptr),
-               static_cast<llvm::Value*>(ir_builder_.getInt64(0)));
+               static_cast<llvm::Value*>(b_.getInt64(0)));
 
   if (!outer_dims.empty()) {
-    SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &ir_builder_);
+    SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &b_);
   }
 
   PrimitiveType primitive_type = output_shape.element_type();
@@ -2368,10 +2293,10 @@ StatusOr<bool> IrEmitter::EmitFastConcatenate(
 
   // Contiguous subregions from each operand to the concatenate contribute to a
   // contiguous subregion in the target buffer starting at target_region_begin.
-  llvm::Value* target_region_begin = ir_builder_.CreateBitCast(
-      target_array.EmitArrayElementAddress(outer_dims_index, &ir_builder_,
-                                           "target_region"),
-      i8_ptr_type);
+  llvm::Value* target_region_begin =
+      b_.CreateBitCast(target_array.EmitArrayElementAddress(
+                           outer_dims_index, &b_, "target_region"),
+                       i8_ptr_type);
   int64 byte_offset_into_target_region = 0;
 
   int64 inner_dims_product =
@@ -2385,14 +2310,13 @@ StatusOr<bool> IrEmitter::EmitFastConcatenate(
   for (HloInstruction* operand : operands) {
     const Shape& input_shape = operand->shape();
     llvm_ir::IrArray source_array = GetIrArrayFor(operand);
-    llvm::Value* copy_source_address = ir_builder_.CreateBitCast(
-        source_array.EmitArrayElementAddress(outer_dims_index, &ir_builder_,
-                                             "src_addr"),
+    llvm::Value* copy_source_address = b_.CreateBitCast(
+        source_array.EmitArrayElementAddress(outer_dims_index, &b_, "src_addr"),
         i8_ptr_type);
 
-    llvm::Value* copy_target_address = ir_builder_.CreateGEP(
-        i8_type, target_region_begin,
-        ir_builder_.getInt64(byte_offset_into_target_region));
+    llvm::Value* copy_target_address =
+        b_.CreateGEP(i8_type, target_region_begin,
+                     b_.getInt64(byte_offset_into_target_region));
 
     EmitTransferElements(
         copy_target_address, copy_source_address,
@@ -2405,7 +2329,7 @@ StatusOr<bool> IrEmitter::EmitFastConcatenate(
   }
 
   if (!outer_dims.empty()) {
-    SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &ir_builder_);
+    SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &b_);
   }
 
   return true;
@@ -2424,16 +2348,15 @@ void IrEmitter::EmitTransferElements(llvm::Value* target, llvm::Value* source,
       llvm_ir::PrimitiveTypeToIrType(primitive_type, module_));
 
   if (element_count == 1) {
-    auto* load_instruction = ir_builder_.CreateAlignedLoad(
-        ir_builder_.CreateBitCast(source, primitive_ptr_type),
-        element_alignment);
+    auto* load_instruction = b_.CreateAlignedLoad(
+        b_.CreateBitCast(source, primitive_ptr_type), element_alignment);
     source_array.AnnotateLoadStoreInstructionWithMetadata(load_instruction);
-    auto* store_instruction = ir_builder_.CreateAlignedStore(
-        load_instruction, ir_builder_.CreateBitCast(target, primitive_ptr_type),
+    auto* store_instruction = b_.CreateAlignedStore(
+        load_instruction, b_.CreateBitCast(target, primitive_ptr_type),
         element_alignment);
     target_array.AnnotateLoadStoreInstructionWithMetadata(store_instruction);
   } else {
-    auto* memcpy_instruction = ir_builder_.CreateMemCpy(
+    auto* memcpy_instruction = b_.CreateMemCpy(
         target, /*DstAlign=*/element_alignment, source,
         /*SrcAlign=*/element_alignment, element_count * primitive_type_size);
 
@@ -2467,8 +2390,6 @@ Status IrEmitter::HandleConcatenate(HloInstruction* concatenate) {
 
 Status IrEmitter::HandleConditional(HloInstruction* conditional) {
   auto pred = conditional->operand(0);
-  auto true_arg = conditional->operand(1);
-  auto false_arg = conditional->operand(2);
   TF_RET_CHECK(ShapeUtil::IsScalar(pred->shape()) &&
                pred->shape().element_type() == PRED)
       << "Predicate on a Conditional must be bool; got: "
@@ -2490,37 +2411,31 @@ Status IrEmitter::HandleConditional(HloInstruction* conditional) {
       << " and "
       << ShapeUtil::HumanString(false_computation->root_instruction()->shape());
 
-  llvm::Function* true_function =
-      FindOrDie(emitted_functions_, true_computation);
-  llvm::Function* false_function =
-      FindOrDie(emitted_functions_, false_computation);
-
   TF_RETURN_IF_ERROR(EmitTargetAddressForOp(conditional));
-  llvm::Value* conditional_result = GetEmittedValueFor(conditional);
 
   // Generating:
   //   if (pred)
   //     cond_result = true_computation(true_operand)
   //   else
   //     cond_result = false_computation(false_operand)
-  llvm::LoadInst* pred_value = ir_builder_.CreateLoad(
+  llvm::LoadInst* pred_value = b_.CreateLoad(
       GetIrArrayFor(pred).GetBasePointer(), "load_predicate_value");
-  llvm::Value* pred_cond = ir_builder_.CreateICmpNE(
+  llvm::Value* pred_cond = b_.CreateICmpNE(
       pred_value,
       llvm::ConstantInt::get(llvm_ir::PrimitiveTypeToIrType(PRED, module_), 0),
       "boolean_predicate");
   llvm_ir::LlvmIfData if_data =
-      llvm_ir::EmitIfThenElse(pred_cond, "conditional", &ir_builder_);
+      llvm_ir::EmitIfThenElse(pred_cond, "conditional", &b_);
 
-  SetToFirstInsertPoint(if_data.true_block, &ir_builder_);
-  EmitArrayFunctionCallInto(true_function, {GetEmittedValueFor(true_arg)},
-                            conditional_result, IrName(conditional, "_true"));
+  SetToFirstInsertPoint(if_data.true_block, &b_);
+  EmitGlobalCall(*conditional->true_computation(),
+                 IrName(conditional, "_true"));
 
-  SetToFirstInsertPoint(if_data.false_block, &ir_builder_);
-  EmitArrayFunctionCallInto(false_function, {GetEmittedValueFor(false_arg)},
-                            conditional_result, IrName(conditional, "_false"));
+  SetToFirstInsertPoint(if_data.false_block, &b_);
+  EmitGlobalCall(*conditional->false_computation(),
+                 IrName(conditional, "_false"));
 
-  SetToFirstInsertPoint(if_data.after_block, &ir_builder_);
+  SetToFirstInsertPoint(if_data.after_block, &b_);
   return Status::OK();
 }
 
@@ -2531,6 +2446,28 @@ Status IrEmitter::HandleAfterAll(HloInstruction* gen_token) {
   return Status::OK();
 }
 
+Status IrEmitter::HandleIota(HloInstruction* iota) {
+  // TODO(b/64798317): implement iota on CPU.
+  return Unimplemented("Iota is not implemented on CPU.");
+}
+
+Status IrEmitter::HandleRng(HloInstruction* rng) {
+  ElementalIrEmitter::HloToElementGeneratorMap operand_to_generator;
+  for (const HloInstruction* operand : rng->operands()) {
+    operand_to_generator[operand] = [=](const llvm_ir::IrArray::Index& index) {
+      return GetIrArrayFor(operand).EmitReadArrayElement(index, &b_);
+    };
+  }
+
+  CpuElementalIrEmitter elemental_emitter(hlo_module_config_, this, module_);
+  TF_RETURN_IF_ERROR(EmitTargetElementLoop(
+      rng, elemental_emitter.MakeElementGenerator(rng, operand_to_generator)));
+
+  llvm_ir::IncrementVariableForPhiloxRngState(1, module_, &b_);
+
+  return Status::OK();
+}
+
 Status IrEmitter::FinishVisit(HloInstruction* root) {
   // When this method is called, we should have already emitted an IR value for
   // the root (return) op. The IR value holds the address of the buffer holding
@@ -2548,7 +2485,7 @@ Status IrEmitter::FinishVisit(HloInstruction* root) {
 
   auto record_complete_computation = [&](llvm::Value* prof_counter) {
     if (prof_counter) {
-      profiling_state_.RecordCompleteComputation(&ir_builder_, prof_counter);
+      profiling_state_.RecordCompleteComputation(&b_, prof_counter);
     }
   };
 
@@ -2570,54 +2507,51 @@ llvm::Value* IrEmitter::GetProfileCounterCommon(
 
   int64 prof_counter_idx = it->second;
   string counter_name = IrName("prof_counter", hlo.name());
-  return ir_builder_.CreateGEP(GetProfileCountersArgument(),
-                               ir_builder_.getInt64(prof_counter_idx),
-                               AsStringRef(counter_name));
+  return b_.CreateGEP(GetProfileCountersArgument(),
+                      b_.getInt64(prof_counter_idx), AsStringRef(counter_name));
 }
 
-void IrEmitter::ProfilingState::UpdateProfileCounter(
-    llvm::IRBuilder<>* ir_builder, llvm::Value* prof_counter,
-    llvm::Value* cycle_end, llvm::Value* cycle_start) {
-  auto* cycle_diff = ir_builder->CreateSub(cycle_end, cycle_start);
+void IrEmitter::ProfilingState::UpdateProfileCounter(llvm::IRBuilder<>* b,
+                                                     llvm::Value* prof_counter,
+                                                     llvm::Value* cycle_end,
+                                                     llvm::Value* cycle_start) {
+  auto* cycle_diff = b->CreateSub(cycle_end, cycle_start);
   llvm::LoadInst* old_cycle_count =
-      ir_builder->CreateLoad(prof_counter, "old_cycle_count");
+      b->CreateLoad(prof_counter, "old_cycle_count");
   auto* new_cycle_count =
-      ir_builder->CreateAdd(cycle_diff, old_cycle_count, "new_cycle_count");
-  ir_builder->CreateStore(new_cycle_count, prof_counter);
+      b->CreateAdd(cycle_diff, old_cycle_count, "new_cycle_count");
+  b->CreateStore(new_cycle_count, prof_counter);
 }
 
-llvm::Value* IrEmitter::ProfilingState::ReadCycleCounter(
-    llvm::IRBuilder<>* ir_builder) {
-  llvm::Module* module = ir_builder->GetInsertBlock()->getModule();
+llvm::Value* IrEmitter::ProfilingState::ReadCycleCounter(llvm::IRBuilder<>* b) {
+  llvm::Module* module = b->GetInsertBlock()->getModule();
   if (use_rdtscp_) {
     llvm::Function* func_llvm_readcyclecounter =
         llvm::Intrinsic::getDeclaration(module,
                                         llvm::Intrinsic::readcyclecounter);
-    return ir_builder->CreateCall(func_llvm_readcyclecounter);
+    return b->CreateCall(func_llvm_readcyclecounter);
   }
   llvm::Function* func_llvm_x86_rdtscp =
       llvm::Intrinsic::getDeclaration(module, llvm::Intrinsic::x86_rdtscp);
   if (!aux_i8ptr_) {
-    llvm::AllocaInst* rdtscp_aux = llvm_ir::EmitAllocaAtFunctionEntry(
-        ir_builder->getInt32Ty(), "rdtscp_aux", ir_builder);
-    aux_i8ptr_ =
-        ir_builder->CreateBitCast(rdtscp_aux, ir_builder->getInt8PtrTy());
+    llvm::AllocaInst* rdtscp_aux =
+        llvm_ir::EmitAllocaAtFunctionEntry(b->getInt32Ty(), "rdtscp_aux", b);
+    aux_i8ptr_ = b->CreateBitCast(rdtscp_aux, b->getInt8PtrTy());
   }
-  llvm::ConstantInt* alloca_size = ir_builder->getInt64(4);
+  llvm::ConstantInt* alloca_size = b->getInt64(4);
   llvm::Function* func_llvm_lifetime_start =
       llvm::Intrinsic::getDeclaration(module, llvm::Intrinsic::lifetime_start);
-  ir_builder->CreateCall(func_llvm_lifetime_start, {alloca_size, aux_i8ptr_});
-  llvm::Value* rdtscp_call =
-      ir_builder->CreateCall(func_llvm_x86_rdtscp, aux_i8ptr_);
+  b->CreateCall(func_llvm_lifetime_start, {alloca_size, aux_i8ptr_});
+  llvm::Value* rdtscp_call = b->CreateCall(func_llvm_x86_rdtscp, aux_i8ptr_);
   llvm::Function* func_llvm_lifetime_end =
       llvm::Intrinsic::getDeclaration(module, llvm::Intrinsic::lifetime_end);
-  ir_builder->CreateCall(func_llvm_lifetime_end, {alloca_size, aux_i8ptr_});
+  b->CreateCall(func_llvm_lifetime_end, {alloca_size, aux_i8ptr_});
   return rdtscp_call;
 }
 
-void IrEmitter::ProfilingState::RecordCycleStart(llvm::IRBuilder<>* ir_builder,
+void IrEmitter::ProfilingState::RecordCycleStart(llvm::IRBuilder<>* b,
                                                  HloInstruction* hlo) {
-  auto* cycle_start = ReadCycleCounter(ir_builder);
+  auto* cycle_start = ReadCycleCounter(b);
   cycle_start->setName(AsStringRef(IrName(hlo, "cycle_start")));
   cycle_starts_[hlo] = cycle_start;
   if (first_read_cycle_start_ == nullptr) {
@@ -2625,20 +2559,20 @@ void IrEmitter::ProfilingState::RecordCycleStart(llvm::IRBuilder<>* ir_builder,
   }
 }
 
-void IrEmitter::ProfilingState::RecordCycleDelta(llvm::IRBuilder<>* ir_builder,
+void IrEmitter::ProfilingState::RecordCycleDelta(llvm::IRBuilder<>* b,
                                                  HloInstruction* hlo,
                                                  llvm::Value* prof_counter) {
-  auto* cycle_end = ReadCycleCounter(ir_builder);
+  auto* cycle_end = ReadCycleCounter(b);
   cycle_end->setName(AsStringRef(IrName(hlo, "cycle_end")));
   auto* cycle_start = cycle_starts_[hlo];
-  UpdateProfileCounter(ir_builder, prof_counter, cycle_end, cycle_start);
+  UpdateProfileCounter(b, prof_counter, cycle_end, cycle_start);
   last_read_cycle_end_ = cycle_end;
 }
 
 void IrEmitter::ProfilingState::RecordCompleteComputation(
-    llvm::IRBuilder<>* ir_builder, llvm::Value* prof_counter) {
+    llvm::IRBuilder<>* b, llvm::Value* prof_counter) {
   if (last_read_cycle_end_ && first_read_cycle_start_) {
-    UpdateProfileCounter(ir_builder, prof_counter, last_read_cycle_end_,
+    UpdateProfileCounter(b, prof_counter, last_read_cycle_end_,
                          first_read_cycle_start_);
   }
 }
@@ -2646,14 +2580,14 @@ void IrEmitter::ProfilingState::RecordCompleteComputation(
 Status IrEmitter::Preprocess(HloInstruction* hlo) {
   VLOG(3) << "Visiting: " << hlo->ToString();
   if (instruction_to_profile_idx_.count(hlo)) {
-    profiling_state_.RecordCycleStart(&ir_builder_, hlo);
+    profiling_state_.RecordCycleStart(&b_, hlo);
   }
   return Status::OK();
 }
 
 Status IrEmitter::Postprocess(HloInstruction* hlo) {
   if (auto* prof_counter = GetProfileCounterFor(*hlo)) {
-    profiling_state_.RecordCycleDelta(&ir_builder_, hlo, prof_counter);
+    profiling_state_.RecordCycleDelta(&b_, hlo, prof_counter);
   }
   return Status::OK();
 }
@@ -2700,42 +2634,76 @@ llvm::Value* IrEmitter::GetExecutableRunOptionsArgument() {
   return compute_function_->exec_run_options_arg();
 }
 
-llvm::Value* IrEmitter::EmitTempBufferPointer(
+llvm::Value* IrEmitter::EmitThreadLocalTempBufferPointer(
     const BufferAllocation::Slice& slice, const Shape& target_shape) {
-  llvm::Type* element_type = IrShapeType(target_shape);
-  // The alignment and number of bytes within the temporary buffer is determined
-  // by the maximal shape as determined by buffer assignment.
-  const BufferAllocation& allocation = assignment_.GetAllocation(slice.index());
-  if (allocation.is_thread_local()) {
+  const BufferAllocation& allocation = *slice.allocation();
+  llvm::Value* tempbuf_address = [&]() -> llvm::Value* {
+    if (slice == computation_root_allocation_) {
+      llvm::Argument* retval = compute_function_->result_arg();
+      llvm::AttrBuilder attr_builder;
+      attr_builder.addAlignmentAttr(MinimumAlignmentForShape(target_shape));
+      attr_builder.addDereferenceableAttr(ByteSizeOf(target_shape));
+      retval->addAttrs(attr_builder);
+      return retval;
+    }
+
+    auto param_it =
+        computation_parameter_allocations_.find(slice.allocation()->index());
+    if (param_it != computation_parameter_allocations_.end()) {
+      int64 param_number = param_it->second;
+      // We have to access the parameter at offset param_number in the params
+      // array. The code generated here is equivalent to this C code:
+      //
+      //   i8* param_address_untyped = params[param_number];
+      //   Param* param_address_typed = (Param*)param_address_untyped;
+      //
+      // Where Param is the actual element type of the underlying buffer (for
+      // example, float for an XLA F32 element type).
+      llvm::Value* params = compute_function_->parameters_arg();
+      llvm::Value* param_address_offset =
+          llvm_ir::EmitBufferIndexingGEP(params, param_number, &b_);
+      llvm::LoadInst* param_address_untyped =
+          b_.CreateLoad(param_address_offset);
+
+      if (!ShapeUtil::IsOpaque(target_shape)) {
+        AttachAlignmentMetadataForLoad(param_address_untyped, target_shape);
+        AttachDereferenceableMetadataForLoad(param_address_untyped,
+                                             target_shape);
+      }
+      return param_address_untyped;
+    }
+
     // Thread-local allocations should only be assigned a single buffer.
     const auto& assigned_buffers = allocation.assigned_buffers();
     CHECK_EQ(1, assigned_buffers.size());
     const Shape& shape = assigned_buffers.begin()->first->shape();
 
-    llvm::AllocaInst*& tempbuf_address = thread_local_buffers_[{
-        ir_builder_.GetInsertBlock()->getParent(), slice}];
-    if (tempbuf_address == nullptr) {
-      tempbuf_address = llvm_ir::EmitAllocaAtFunctionEntry(
+    std::pair<llvm::Function*, BufferAllocation::Slice> key = {
+        compute_function_->function(), slice};
+    auto buf_it = thread_local_buffers_.find(key);
+    if (buf_it == thread_local_buffers_.end()) {
+      llvm::Value* buffer = llvm_ir::EmitAllocaAtFunctionEntry(
           IrShapeType(shape),
-          tensorflow::strings::StrCat("thread_local", slice.ToString()),
-          &ir_builder_, MinimumAlignmentForShape(target_shape));
+          tensorflow::strings::StrCat("thread_local", slice.ToString()), &b_,
+          MinimumAlignmentForShape(target_shape));
+      auto it_inserted_pair = thread_local_buffers_.insert({key, buffer});
+      CHECK(it_inserted_pair.second);
+      buf_it = it_inserted_pair.first;
     }
-    return ir_builder_.CreateBitCast(tempbuf_address,
-                                     element_type->getPointerTo());
-  }
+    return buf_it->second;
+  }();
+  return b_.CreateBitCast(tempbuf_address,
+                          IrShapeType(target_shape)->getPointerTo());
+}
 
+llvm::Value* IrEmitter::EmitGlobalTempBufferPointer(
+    const BufferAllocation::Slice& slice, const Shape& target_shape) {
+  const BufferAllocation& allocation = *slice.allocation();
   llvm::Value* tempbuf_address_ptr = llvm_ir::EmitBufferIndexingGEP(
-      GetTempBuffersArgument(), slice.index(), &ir_builder_);
-  llvm::LoadInst* tempbuf_address_base =
-      ir_builder_.CreateLoad(tempbuf_address_ptr);
-  if (is_top_level_computation_ &&
-      hlo_module_config_.debug_options()
+      GetTempBuffersArgument(), slice.index(), &b_);
+  llvm::LoadInst* tempbuf_address_base = b_.CreateLoad(tempbuf_address_ptr);
+  if (hlo_module_config_.debug_options()
           .xla_llvm_enable_invariant_load_metadata()) {
-    // In the entry computation the parameter slots in the %params argument are
-    // invariant through program execution.  In computations that are called
-    // from the entry computation (via kWhile, kCall and kConditional) the
-    // parameter slots are *not* invariant since they're written to by their
-    // callers.
     tempbuf_address_base->setMetadata(
         llvm::LLVMContext::MD_invariant_load,
         llvm::MDNode::get(tempbuf_address_base->getContext(), /*MDs=*/{}));
@@ -2746,90 +2714,29 @@ llvm::Value* IrEmitter::EmitTempBufferPointer(
   llvm::Value* tempbuf_address_untyped = tempbuf_address_base;
   if (slice.offset() > 0) {
     // Adjust the address to account for the slice offset.
-    tempbuf_address_untyped = ir_builder_.CreateInBoundsGEP(
-        tempbuf_address_base, ir_builder_.getInt64(slice.offset()));
+    tempbuf_address_untyped =
+        b_.CreateInBoundsGEP(tempbuf_address_base, b_.getInt64(slice.offset()));
   }
-  return ir_builder_.CreateBitCast(tempbuf_address_untyped,
-                                   element_type->getPointerTo());
-}
-
-// Emits a function call returning a single array element.  Allocates space
-// for a single element_type value, and loads it after call.
-llvm::Value* IrEmitter::EmitElementFunctionCall(
-    llvm::Function* function, const Shape& return_shape,
-    gtl::ArraySlice<llvm::Value*> parameter_addresses,
-    tensorflow::StringPiece name) {
-  llvm::Value* return_value_buffer = EmitArrayFunctionCall(
-      function, return_shape, 1, parameter_addresses, name);
-  return ir_builder_.CreateLoad(
-      return_value_buffer,
-      AsStringRef(tensorflow::strings::StrCat(name, "_return_value")));
-}
-
-// Emits a core function call based on the following pseudo-code.
-//
-//   char** parameter_addresses_buffer =
-//       allocate buffer with a pointer for each parameter to the function
-//   for each parameter index, i.e. for i = 0, ..., #parameters:
-//     parameter_addresses_buffer[i] = parameter_addresses[i]
-//   call function(return_value_buffer,
-//                 parameter_addresses_buffer,
-//                 temps)
-//   return return_value_buffer  -- address of the return value.
-void IrEmitter::EmitArrayFunctionCallInto(
-    llvm::Function* function, gtl::ArraySlice<llvm::Value*> parameter_addresses,
-    llvm::Value* return_value_buffer, tensorflow::StringPiece name) {
-  ir_builder_.CreateCall(
-      function, GetArrayFunctionCallArguments(
-                    parameter_addresses, &ir_builder_, name,
-                    /*return_value_buffer=*/return_value_buffer,
-                    /*exec_run_options_arg=*/GetExecutableRunOptionsArgument(),
-                    /*temp_buffers_arg=*/GetTempBuffersArgument(),
-                    /*profile_counters_arg=*/GetProfileCountersArgument()));
+  return b_.CreateBitCast(tempbuf_address_untyped,
+                          IrShapeType(target_shape)->getPointerTo());
 }
 
-llvm::Value* IrEmitter::EmitArrayFunctionCall(
-    llvm::Function* function, const Shape& return_shape, int64 element_count,
-    gtl::ArraySlice<llvm::Value*> parameter_addresses,
-    tensorflow::StringPiece name) {
-  llvm::Value* elements =
-      llvm::ConstantInt::get(ir_builder_.getInt64Ty(), element_count);
-  PrimitiveType return_type = return_shape.element_type();
-  llvm::Value* return_value_buffer =
-      llvm_ir::EmitAllocaAtFunctionEntryWithCount(
-          llvm_ir::PrimitiveTypeToIrType(return_type, module_), elements,
-          tensorflow::strings::StrCat(name, "_return_value_address"),
-          &ir_builder_, MinimumAlignmentForPrimitiveType(return_type));
-  EmitArrayFunctionCallInto(function, parameter_addresses, return_value_buffer,
-                            name);
-  return return_value_buffer;
+llvm::Value* IrEmitter::EmitTempBufferPointer(
+    const BufferAllocation::Slice& slice, const Shape& target_shape) {
+  if (slice.allocation()->is_thread_local()) {
+    return EmitThreadLocalTempBufferPointer(slice, target_shape);
+  } else if (slice.allocation()->is_constant()) {
+    return FindOrDie(constant_buffer_to_global_, slice.allocation()->index());
+  } else {
+    return EmitGlobalTempBufferPointer(slice, target_shape);
+  }
 }
 
 Status IrEmitter::EmitTargetAddressForOp(const HloInstruction* op) {
-  llvm::Value* addr;
   const Shape& target_shape = op->shape();
-  if (op == op->parent()->root_instruction()) {
-    // For the root node, we write directly to the output buffer of the
-    // function.
-    llvm::Argument* retval = compute_function_->result_arg();
-    if ((ShapeUtil::IsArray(target_shape) &&
-         !ShapeUtil::IsZeroElementArray(target_shape)) ||
-        (ShapeUtil::IsTuple(target_shape) &&
-         !ShapeUtil::IsEmptyTuple(target_shape))) {
-      llvm::AttrBuilder attr_builder;
-      attr_builder.addAlignmentAttr(MinimumAlignmentForShape(target_shape));
-      attr_builder.addDereferenceableAttr(ByteSizeOf(target_shape));
-      retval->addAttrs(attr_builder);
-    }
-    addr = ir_builder_.CreateBitCast(retval,
-                                     IrShapeType(target_shape)->getPointerTo());
-  } else {
-    // For other nodes, we need the temporary buffer allocated for this node to
-    // write the result into.
-    TF_ASSIGN_OR_RETURN(const BufferAllocation::Slice slice,
-                        assignment_.GetUniqueTopLevelSlice(op));
-    addr = EmitTempBufferPointer(slice, target_shape);
-  }
+  TF_ASSIGN_OR_RETURN(const BufferAllocation::Slice slice,
+                      assignment_.GetUniqueTopLevelSlice(op));
+  llvm::Value* addr = EmitTempBufferPointer(slice, target_shape);
   addr->setName(AsStringRef(IrName(op)));
   emitted_value_[op] = addr;
   return Status::OK();
@@ -2864,14 +2771,14 @@ Status IrEmitter::EmitTargetElementLoop(
           llvm_ir::IrArray(op_target_address, element_shape));
     }
     TF_RETURN_IF_ERROR(
-        llvm_ir::LoopEmitter(element_generator, output_arrays, &ir_builder_)
+        llvm_ir::LoopEmitter(element_generator, output_arrays, &b_)
             .EmitLoop(IrName(target_op)));
 
     std::vector<llvm::Value*> tuple_operand_ptrs;
     for (int64 i = 0; i < output_arrays.size(); ++i) {
       tuple_operand_ptrs.push_back(output_arrays[i].GetBasePointer());
     }
-    llvm_ir::EmitTuple(target_array, tuple_operand_ptrs, &ir_builder_, module_);
+    llvm_ir::EmitTuple(target_array, tuple_operand_ptrs, &b_, module_);
 
   } else {
     if (ShouldEmitParallelLoopFor(*target_op)) {
@@ -2880,11 +2787,11 @@ Status IrEmitter::EmitTargetElementLoop(
           compute_function_->GetDynamicLoopBounds();
       // Emit parallel loop with dynamic loop bounds for most-major dimensions.
       TF_RETURN_IF_ERROR(ParallelLoopEmitter(element_generator, target_array,
-                                             &dynamic_loop_bounds, &ir_builder_)
+                                             &dynamic_loop_bounds, &b_)
                              .EmitLoop(IrName(target_op)));
     } else {
       TF_RETURN_IF_ERROR(
-          llvm_ir::LoopEmitter(element_generator, target_array, &ir_builder_)
+          llvm_ir::LoopEmitter(element_generator, target_array, &b_)
               .EmitLoop(IrName(target_op)));
     }
   }
@@ -2897,8 +2804,8 @@ Status IrEmitter::EmitMemcpy(const HloInstruction& source,
   llvm::Value* destination_value = GetEmittedValueFor(&destination);
   int64 source_size = ByteSizeOf(source.shape());
   // TODO(b/63762267): Be more aggressive about specifying alignment.
-  ir_builder_.CreateMemCpy(destination_value, /*DstAlign=*/1, source_value,
-                           /*SrcAlign=*/1, source_size);
+  b_.CreateMemCpy(destination_value, /*DstAlign=*/1, source_value,
+                  /*SrcAlign=*/1, source_size);
   return Status::OK();
 }
 
@@ -2926,7 +2833,7 @@ Status IrEmitter::DefaultAction(HloInstruction* hlo) {
   ElementalIrEmitter::HloToElementGeneratorMap operand_to_generator;
   for (const HloInstruction* operand : hlo->operands()) {
     operand_to_generator[operand] = [=](const llvm_ir::IrArray::Index& index) {
-      return GetIrArrayFor(operand).EmitReadArrayElement(index, &ir_builder_);
+      return GetIrArrayFor(operand).EmitReadArrayElement(index, &b_);
     };
   }
   CpuElementalIrEmitter elemental_emitter(hlo_module_config_, this, module_);
@@ -2934,20 +2841,69 @@ Status IrEmitter::DefaultAction(HloInstruction* hlo) {
       hlo, elemental_emitter.MakeElementGenerator(hlo, operand_to_generator));
 }
 
-StatusOr<llvm::Value*> IrEmitter::EmitScalarCall(
-    PrimitiveType return_type, HloComputation* computation,
-    const std::vector<llvm::Value*>& arguments, tensorflow::StringPiece name) {
-  llvm::Function* llvm_function = FindOrDie(emitted_functions_, computation);
-  std::vector<llvm::Value*> argument_addrs;
-  for (auto argument : arguments) {
-    llvm::Value* argument_addr = llvm_ir::EmitAllocaAtFunctionEntry(
-        argument->getType(), "arg_addr", &ir_builder_);
-    ir_builder_.CreateStore(argument, argument_addr);
-    argument_addrs.push_back(argument_addr);
+llvm::Value* IrEmitter::EmitThreadLocalCall(
+    const HloComputation& callee,
+    tensorflow::gtl::ArraySlice<llvm::Value*> parameters,
+    tensorflow::StringPiece name) {
+  const Shape& return_shape = callee.root_instruction()->shape();
+
+  // Lifting this restriction to allow "small" arrays should be easy.  Allowing
+  // larger arrays is difficult because we allocate the buffer for this return
+  // value on the stack.
+  CHECK(ShapeUtil::IsScalar(return_shape));
+
+  PrimitiveType return_type = return_shape.element_type();
+
+  std::vector<llvm::Value*> parameter_addrs;
+  for (llvm::Value* parameter : parameters) {
+    CHECK(!parameter->getType()->isPointerTy());
+    llvm::Value* parameter_addr = llvm_ir::EmitAllocaAtFunctionEntry(
+        parameter->getType(), "arg_addr", &b_);
+    b_.CreateStore(parameter, parameter_addr);
+    parameter_addrs.push_back(parameter_addr);
+  }
+
+  llvm::Value* return_value_buffer = llvm_ir::EmitAllocaAtFunctionEntry(
+      llvm_ir::PrimitiveTypeToIrType(return_type, module_),
+      tensorflow::strings::StrCat(name, "_retval_addr"), &b_,
+      MinimumAlignmentForPrimitiveType(return_type));
+
+  b_.CreateCall(
+      FindOrDie(emitted_functions_, &callee),
+      GetArrayFunctionCallArguments(
+          parameter_addrs, &b_, name,
+          /*return_value_buffer=*/return_value_buffer,
+          /*exec_run_options_arg=*/GetExecutableRunOptionsArgument(),
+          /*temp_buffers_arg=*/
+          llvm::Constant::getNullValue(b_.getInt8PtrTy()->getPointerTo()),
+          /*profile_counters_arg=*/GetProfileCountersArgument()));
+
+  return b_.CreateLoad(return_value_buffer);
+}
+
+void IrEmitter::EmitGlobalCall(const HloComputation& callee,
+                               tensorflow::StringPiece name) {
+  b_.CreateCall(FindOrDie(emitted_functions_, &callee),
+                GetArrayFunctionCallArguments(
+                    /*parameter_addresses=*/{}, &b_, name,
+                    /*return_value_buffer=*/
+                    llvm::Constant::getNullValue(b_.getInt8PtrTy()),
+                    /*exec_run_options_arg=*/GetExecutableRunOptionsArgument(),
+                    /*temp_buffers_arg=*/GetTempBuffersArgument(),
+                    /*profile_counters_arg=*/GetProfileCountersArgument()));
+}
+
+llvm::Value* IrEmitter::GetBufferForGlobalCallReturnValue(
+    const HloComputation& callee) {
+  const HloInstruction* root_inst = callee.root_instruction();
+  if (root_inst->opcode() == HloOpcode::kOutfeed) {
+    return llvm::Constant::getNullValue(b_.getInt8PtrTy());
   }
-  return EmitElementFunctionCall(llvm_function,
-                                 ShapeUtil::MakeShape(return_type, {}),
-                                 argument_addrs, name);
+
+  const BufferAllocation::Slice root_buffer =
+      assignment_.GetUniqueTopLevelSlice(root_inst).ValueOrDie();
+  return EmitTempBufferPointer(root_buffer, root_inst->shape());
 }
+
 }  // namespace cpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/ir_emitter.h b/tensorflow/compiler/xla/service/cpu/ir_emitter.h
index 3c110a320f..c9a1dab62d 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_emitter.h
+++ b/tensorflow/compiler/xla/service/cpu/ir_emitter.h
@@ -35,6 +35,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_module_config.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/alias_analysis.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/ir_array.h"
@@ -97,12 +98,16 @@ class IrEmitter : public DfsHloVisitorWithDefault {
       bool is_top_level_computation,
       std::vector<const HloInstruction*>* instruction_order);
 
-  llvm::IRBuilder<>* ir_builder() { return &ir_builder_; }
+  llvm::IRBuilder<>* b() { return &b_; }
 
-  // Emits a call to `computation` with scalar arguments `arguments`.
-  StatusOr<llvm::Value*> EmitScalarCall(
-      PrimitiveType return_type, HloComputation* computation,
-      const std::vector<llvm::Value*>& arguments, tensorflow::StringPiece name);
+  // Emit an LLVM global variable for every constant buffer allocation.
+  Status EmitConstantGlobals();
+
+  // Emit code to map one element according to `map_instr`.
+  llvm::Value* EmitElementalMap(
+      const HloMapInstruction& map_instr,
+      tensorflow::gtl::ArraySlice<llvm::Value*> elemental_operands,
+      tensorflow::StringPiece name);
 
  protected:
   //
@@ -117,6 +122,7 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   Status HandleCopy(HloInstruction* copy) override;
   Status HandleGetTupleElement(HloInstruction* get_tuple_element) override;
   Status HandleSelect(HloInstruction* select) override;
+  Status HandleTupleSelect(HloInstruction* tuple_select) override;
   Status HandleDot(HloInstruction* dot) override;
   Status HandleConvolution(HloInstruction* convolution) override;
   Status HandleFft(HloInstruction* fft) override;
@@ -138,14 +144,16 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   Status HandleRecvDone(HloInstruction* recv_done) override;
   Status HandlePad(HloInstruction* pad) override;
   Status HandleTuple(HloInstruction* tuple) override;
-  Status HandleMap(HloInstruction* map) override;
   Status HandleFusion(HloInstruction* fusion) override;
   Status HandleCall(HloInstruction* call) override;
   Status HandleCustomCall(HloInstruction* custom_call) override;
   Status HandleWhile(HloInstruction* xla_while) override;
   Status HandleConcatenate(HloInstruction* concatenate) override;
   Status HandleConditional(HloInstruction* conditional) override;
+  Status HandleScatter(HloInstruction* scatter) override;
   Status HandleAfterAll(HloInstruction* gen_token) override;
+  Status HandleIota(HloInstruction* iota) override;
+  Status HandleRng(HloInstruction* rng) override;
   Status FinishVisit(HloInstruction* root) override;
 
   Status Preprocess(HloInstruction* hlo) override;
@@ -211,9 +219,18 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   // computation function being emitted by this emitter.
   llvm::Value* GetTempBuffersArgument();
 
-  // Emits code that computes the address of the given temporary buffer to the
-  // function. target_shape is the shape of this temporary buffer.
-  // The returned Value's type is a pointer to element_type.
+  // Helper for EmitTempBufferPointer.
+  llvm::Value* EmitGlobalTempBufferPointer(const BufferAllocation::Slice& slice,
+                                           const Shape& target_shape);
+
+  // Helper for EmitTempBufferPointer.
+  llvm::Value* EmitThreadLocalTempBufferPointer(
+      const BufferAllocation::Slice& slice, const Shape& target_shape);
+
+  // Emits code that computes the address of the given buffer allocation slice.
+  //
+  // TODO(sanjoy): This should be renamed to reflect that it no longer provides
+  // access to just temporaries.
   llvm::Value* EmitTempBufferPointer(const BufferAllocation::Slice& slice,
                                      const Shape& target_shape);
 
@@ -225,44 +242,27 @@ class IrEmitter : public DfsHloVisitorWithDefault {
       tensorflow::StringPiece
           function_name_suffix);  // Used for LLVM IR register names.
 
-  // Methods that emit a function call.
-  // Parameters:
-  //   function - The LLVM function to call.
-  //   return_shape - The return shape of the HLO computation that was used to
-  //     make the function.  Not the same as the return type of the function
-  //     in LLVM, since we use output parameters for the return type.
-  //   element_count - number of elements to return (array form only).
-  //   parameter_addresses - pointers to be passed to the function as
-  //     parameters.
-  //   name - used for LLVM IR register names.
-
-  // Emits a function call, returning a scalar, often an element of a larger
-  // array.  Returns a Value for the scalar element returned by the function.
-  llvm::Value* EmitElementFunctionCall(
-      llvm::Function* function, const Shape& return_shape,
-      tensorflow::gtl::ArraySlice<llvm::Value*> parameter_addresses,
+  // Emits a call to a thread local function (e.g. to the computation nested
+  // within a reduce or a map).  Thread local callees (by definition) only write
+  // to and read from thread local allocations.
+  //
+  // `parameters` holds the *scalar values* that need to be passed to the
+  // callee.  The return value is the scalar returned by the callee.
+  llvm::Value* EmitThreadLocalCall(
+      const HloComputation& callee,
+      tensorflow::gtl::ArraySlice<llvm::Value*> parameters,
       tensorflow::StringPiece name);
 
-  // Array function call emitter.  Stores the function's result into a supplied
-  // buffer.
-  // Parameters:
-  //   function - The LLVM function to call.
-  //   parameter_addresses - pointers to be passed to the function as
-  //     parameters.
-  //   return_value - pointer to a buffer where the call result is stored.
-
-  void EmitArrayFunctionCallInto(
-      llvm::Function* function,
-      tensorflow::gtl::ArraySlice<llvm::Value*> parameter_addresses,
-      llvm::Value* return_value_buffer, tensorflow::StringPiece name);
-
-  // Array function call emitter.  Returns a Value for the function's return
-  // value buffer address. The return value buffer is alloca'ed by this
-  // function.
-  llvm::Value* EmitArrayFunctionCall(
-      llvm::Function* function, const Shape& return_shape, int64 element_count,
-      tensorflow::gtl::ArraySlice<llvm::Value*> parameter_addresses,
-      tensorflow::StringPiece name);
+  // Emits a call to a "global" function (e.g. to the computation nested within
+  // a kWhile or a kCall).  Buffer assignment unabiguously assignes buffers to
+  // the parameters and return values for these computations so there is no need
+  // to explicitly pass parameters or return results.
+  void EmitGlobalCall(const HloComputation& callee,
+                      tensorflow::StringPiece name);
+
+  // Returns the buffer to which a global call to `callee` would have written
+  // its result.
+  llvm::Value* GetBufferForGlobalCallReturnValue(const HloComputation& callee);
 
   // Verifies that the element types of all of the given operand instructions
   // match and are of one of the given supported types.
@@ -401,11 +401,10 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   NameUniquer name_uniquer_;
 
   // Map containing all previously emitted computations.
-  std::map<HloComputation*, llvm::Function*> emitted_functions_;
+  std::map<const HloComputation*, llvm::Function*> emitted_functions_;
 
   // Map containing all previously emitted thread-local temporary buffers.
-  std::map<std::pair<llvm::Function*, BufferAllocation::Slice>,
-           llvm::AllocaInst*>
+  std::map<std::pair<llvm::Function*, BufferAllocation::Slice>, llvm::Value*>
       thread_local_buffers_;
 
   // The following fields track the IR emission state. According to LLVM memory
@@ -413,7 +412,17 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   // creates the encapsulated llvm::Function s.t. it is added to the llvm
   // module's function list).
   std::unique_ptr<IrFunction> compute_function_;
-  llvm::IRBuilder<> ir_builder_;
+  llvm::IRBuilder<> b_;
+
+  // The buffer allocation slice for the root of the computation being compiled.
+  // Only relevant for thread local computations.
+  BufferAllocation::Slice computation_root_allocation_;
+
+  // Maps the buffer allocation slices for the parameters to the computation
+  // being compiled to their parameter numbers.  Only relevant for thread local
+  // computations.
+  tensorflow::gtl::FlatMap<BufferAllocation::Index, int64>
+      computation_parameter_allocations_;
 
   // Maps HLO instructions to their index into the profile counter array.
   const std::unordered_map<const HloInstruction*, int64>
@@ -449,23 +458,22 @@ class IrEmitter : public DfsHloVisitorWithDefault {
         : use_rdtscp_(use_rdtscp), prof_counters_(prof_counters) {}
 
     // Record the cycle counter before an HLO executes.
-    void RecordCycleStart(llvm::IRBuilder<>* ir_builder, HloInstruction* hlo);
+    void RecordCycleStart(llvm::IRBuilder<>* b, HloInstruction* hlo);
     // Record the number of cycles it took for an HLO to execute.
-    void RecordCycleDelta(llvm::IRBuilder<>* ir_builder, HloInstruction* hlo,
+    void RecordCycleDelta(llvm::IRBuilder<>* b, HloInstruction* hlo,
                           llvm::Value* prof_counter);
     // Record the number of cycles it took for the entire computation to
     // execute.
-    void RecordCompleteComputation(llvm::IRBuilder<>* ir_builder,
+    void RecordCompleteComputation(llvm::IRBuilder<>* b,
                                    llvm::Value* prof_counter);
 
     // Convenience function to generate a call to an intrinsic which reads the
     // CPU cycle counter.
-    llvm::Value* ReadCycleCounter(llvm::IRBuilder<>* ir_builder);
+    llvm::Value* ReadCycleCounter(llvm::IRBuilder<>* b);
 
     // Store the cycle counter delta to the per-HLO profile counter.
-    void UpdateProfileCounter(llvm::IRBuilder<>* ir_builder,
-                              llvm::Value* prof_counter, llvm::Value* cycle_end,
-                              llvm::Value* cycle_start);
+    void UpdateProfileCounter(llvm::IRBuilder<>* b, llvm::Value* prof_counter,
+                              llvm::Value* cycle_end, llvm::Value* cycle_start);
 
    private:
     // Should we use the x86-specific rdtscp or the generic readcyclecounter
@@ -513,6 +521,17 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   // Returns the number of bytes within the shape.
   int64 ByteSizeOf(const Shape& shape) const;
 
+  StatusOr<llvm::Value*> EmitTargetElementLoopBodyForMap(
+      HloMapInstruction* map, const llvm_ir::IrArray::Index& index);
+  StatusOr<llvm::Value*> EmitTargetElementLoopBodyForReduceWindow(
+      HloReduceWindowInstruction* reduce_window,
+      const llvm_ir::IrArray::Index& index);
+  StatusOr<llvm::Value*> EmitTargetElementLoopBodyForConvolution(
+      HloConvolutionInstruction* convolution,
+      const llvm_ir::IrArray::Index& index);
+  StatusOr<llvm::Value*> EmitTargetElementLoopBodyForReduce(
+      HloReduceInstruction* reduce, const llvm_ir::IrArray::Index& index);
+
   enum class XfeedKind {
     kInfeed,
     kOutfeed,
@@ -546,6 +565,9 @@ class IrEmitter : public DfsHloVisitorWithDefault {
                            LiteralPtrHashFunctor, LiteralPtrEqualityFunctor>
       emitted_literals_;
 
+  tensorflow::gtl::FlatMap<BufferAllocation::Index, llvm::Constant*>
+      constant_buffer_to_global_;
+
   TF_DISALLOW_COPY_AND_ASSIGN(IrEmitter);
 };
 
diff --git a/tensorflow/compiler/xla/service/cpu/ir_function.cc b/tensorflow/compiler/xla/service/cpu/ir_function.cc
index 2d6f2f3818..2db4d000f5 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_function.cc
+++ b/tensorflow/compiler/xla/service/cpu/ir_function.cc
@@ -49,11 +49,10 @@ IrFunction::IrFunction(const string& function_name,
                        llvm::Function::LinkageTypes linkage,
                        const bool optimize_for_size_requested,
                        const bool enable_fast_math, llvm::Module* llvm_module,
-                       llvm::IRBuilder<>* ir_builder,
-                       int64 num_dynamic_loop_bounds)
-    : ir_builder_(ir_builder),
+                       llvm::IRBuilder<>* b, int64 num_dynamic_loop_bounds)
+    : b_(b),
       llvm_module_(llvm_module),
-      caller_insert_point_guard_(*ir_builder),
+      caller_insert_point_guard_(*b),
       num_dynamic_loop_bounds_(num_dynamic_loop_bounds) {
   Initialize(function_name, linkage, optimize_for_size_requested,
              enable_fast_math);
@@ -61,7 +60,7 @@ IrFunction::IrFunction(const string& function_name,
 
 IrFunction::~IrFunction() {
   // Emit function return value.
-  ir_builder_->CreateRetVoid();
+  b_->CreateRetVoid();
 }
 
 DynamicLoopBounds IrFunction::GetDynamicLoopBounds() {
@@ -81,9 +80,16 @@ void IrFunction::Initialize(const string& function_name,
   //   void function(i8* retval, i8* run_options, i8** params, i8** temps,
   //                 i64* dynamic_loop_bounds, i64* prof_counters)
   //
-  // retval: points to the returned value.
-  // params: address of an array with pointers to parameters.
-  // temps: address of an array with pointers to temporary buffers.
+  // For thread local functions:
+  //   retval: points to the returned value.
+  //   params: address of an array with pointers to parameters.
+  //   temps: is null
+  //
+  // For global functions:
+  //   retval: is null
+  //   params: is null
+  //   temps: address of an array with pointers to temporary buffers and entry
+  //          computation parameters.
   //
   // Therefore, the generated function's signature (FunctionType) is statically
   // determined - parameter unpacking is done in code generated into the
@@ -174,7 +180,7 @@ void IrFunction::Initialize(const string& function_name,
     function_->addAttribute(argument.getArgNo() + 1, llvm::Attribute::NoAlias);
   }
 
-  ir_builder_->SetInsertPoint(llvm::BasicBlock::Create(
+  b_->SetInsertPoint(llvm::BasicBlock::Create(
       /*Context=*/llvm_module_->getContext(),
       /*Name=*/"entry",
       /*Parent=*/function_));
@@ -184,9 +190,8 @@ llvm::Value* IrFunction::GetDynamicLoopBound(const int64 offset) {
   CHECK_GT(num_dynamic_loop_bounds_, 0);
   CHECK_LT(offset, num_dynamic_loop_bounds_ * 2);
   string name = tensorflow::strings::StrCat("dynamic_loop_bound_", offset);
-  return ir_builder_->CreateLoad(
-      ir_builder_->CreateGEP(CHECK_NOTNULL(dynamic_loop_bounds_arg_),
-                             ir_builder_->getInt64(offset), AsStringRef(name)));
+  return b_->CreateLoad(b_->CreateGEP(CHECK_NOTNULL(dynamic_loop_bounds_arg_),
+                                      b_->getInt64(offset), AsStringRef(name)));
 }
 
 // Emits code to allocate an array of parameter address pointers, and store
@@ -195,27 +200,32 @@ llvm::Value* IrFunction::GetDynamicLoopBound(const int64 offset) {
 // address buffer).
 std::vector<llvm::Value*> GetArrayFunctionCallArguments(
     tensorflow::gtl::ArraySlice<llvm::Value*> parameter_addresses,
-    llvm::IRBuilder<>* ir_builder, tensorflow::StringPiece name,
+    llvm::IRBuilder<>* b, tensorflow::StringPiece name,
     llvm::Value* return_value_buffer, llvm::Value* exec_run_options_arg,
     llvm::Value* temp_buffers_arg, llvm::Value* profile_counters_arg) {
-  llvm::Value* parameter_addresses_buffer =
-      llvm_ir::EmitAllocaAtFunctionEntryWithCount(
-          ir_builder->getInt8PtrTy(),
-          ir_builder->getInt32(parameter_addresses.size()),
-          tensorflow::strings::StrCat(name, "_parameter_addresses"),
-          ir_builder);
-  for (size_t i = 0; i < parameter_addresses.size(); ++i) {
-    llvm::Value* parameter_as_i8ptr = ir_builder->CreateBitCast(
-        parameter_addresses[i], ir_builder->getInt8PtrTy(),
-        AsStringRef(tensorflow::strings::StrCat(name, "_parameter_", i,
-                                                "_address_as_i8ptr")));
-    llvm::Value* slot_in_param_addresses = ir_builder->CreateInBoundsGEP(
-        parameter_addresses_buffer, {ir_builder->getInt64(i)});
-    ir_builder->CreateStore(parameter_as_i8ptr, slot_in_param_addresses);
+  llvm::Value* parameter_addresses_buffer;
+
+  if (parameter_addresses.empty()) {
+    parameter_addresses_buffer =
+        llvm::Constant::getNullValue(b->getInt8PtrTy()->getPointerTo());
+  } else {
+    parameter_addresses_buffer = llvm_ir::EmitAllocaAtFunctionEntryWithCount(
+        b->getInt8PtrTy(), b->getInt32(parameter_addresses.size()),
+        tensorflow::strings::StrCat(name, "_parameter_addresses"), b);
+
+    for (size_t i = 0; i < parameter_addresses.size(); ++i) {
+      llvm::Value* parameter_as_i8ptr =
+          b->CreateBitCast(parameter_addresses[i], b->getInt8PtrTy(),
+                           AsStringRef(tensorflow::strings::StrCat(
+                               name, "_parameter_", i, "_address_as_i8ptr")));
+      llvm::Value* slot_in_param_addresses =
+          b->CreateInBoundsGEP(parameter_addresses_buffer, {b->getInt64(i)});
+      b->CreateStore(parameter_as_i8ptr, slot_in_param_addresses);
+    }
   }
 
   const auto to_int8_ptr = [=](llvm::Value* ptr) {
-    return ir_builder->CreatePointerCast(ptr, ir_builder->getInt8PtrTy());
+    return b->CreatePointerCast(ptr, b->getInt8PtrTy());
   };
   std::vector<llvm::Value*> arguments{
       to_int8_ptr(return_value_buffer), to_int8_ptr(exec_run_options_arg),
@@ -230,22 +240,21 @@ std::vector<llvm::Value*> GetArrayFunctionCallArguments(
 // calls to 'parallel_function' (and joins threads before returning).
 Status EmitCallToParallelForkJoin(
     const std::vector<llvm::Value*>& arguments, const Shape& shape,
-    const std::vector<int64>& dimension_partition_counts,
-    llvm::IRBuilder<>* ir_builder, llvm::Function* parallel_function,
-    const string& name) {
-  llvm::Module* module = ir_builder->GetInsertBlock()->getModule();
+    const std::vector<int64>& dimension_partition_counts, llvm::IRBuilder<>* b,
+    llvm::Function* parallel_function, const string& name) {
+  llvm::Module* module = b->GetInsertBlock()->getModule();
 
   // Build ParallelForkJoin function type.
   std::vector<llvm::Type*> compute_function_params =
       GetComputeFunctionParams(module, /*num_dynamic_loop_bounds=*/0);
   // Number of parallel compute functions.
-  compute_function_params.push_back(ir_builder->getInt32Ty());
+  compute_function_params.push_back(b->getInt32Ty());
   // Array of partitions. There is an array element for each
   // partition x partition_dim x 2 (for dimension start and limit).
   compute_function_params.push_back(
       llvm::Type::getInt64PtrTy(module->getContext()));
   // Number of partitioned most-major dimensions in 'shape'.
-  compute_function_params.push_back(ir_builder->getInt32Ty());
+  compute_function_params.push_back(b->getInt32Ty());
   // Function pointer for compute function to be dispatched in parallel.
   compute_function_params.push_back(
       llvm::Type::getInt8PtrTy(module->getContext()));
@@ -268,7 +277,7 @@ Status EmitCallToParallelForkJoin(
   ShapePartitionIterator partition_iterator(shape, dimension_partition_counts);
   const int64 num_partitions = partition_iterator.GetTotalPartitionCount();
   // Add argument specifying the number of parallel partitions.
-  fork_join_arguments.push_back(ir_builder->getInt32(num_partitions));
+  fork_join_arguments.push_back(b->getInt32(num_partitions));
 
   // The number of partitioned most-major dimensions in 'shape'.
   const int32 num_partitioned_dims = dimension_partition_counts.size();
@@ -293,15 +302,15 @@ Status EmitCallToParallelForkJoin(
       const std::pair<int64, int64>& dim_partition = dim_partitions[j];
       const int32 index = partition_index + j * dim_partition_size;
       // Store partition [dim_start, dim_limit) intervals for each dimension.
-      partitions[index] = ir_builder->getInt64(dim_partition.first);
+      partitions[index] = b->getInt64(dim_partition.first);
       partitions[index + 1] =
-          ir_builder->getInt64(dim_partition.first + dim_partition.second);
+          b->getInt64(dim_partition.first + dim_partition.second);
     }
   }
 
   // Create global variable out of dimension partitions in 'partitions'.
   llvm::ArrayType* partitions_array_type =
-      llvm::ArrayType::get(ir_builder->getInt64Ty(), partition_array_size);
+      llvm::ArrayType::get(b->getInt64Ty(), partition_array_size);
   llvm::Constant* partitions_array =
       llvm::ConstantArray::get(partitions_array_type, partitions);
   llvm::GlobalVariable* global_partitions_array = new llvm::GlobalVariable(
@@ -315,16 +324,16 @@ Status EmitCallToParallelForkJoin(
           tensorflow::strings::StrCat(name, "_parallel_dimension_partitions")));
 
   // Add argument specifying parallel dimension partitions.
-  fork_join_arguments.push_back(ir_builder->CreateBitCast(
-      global_partitions_array,
-      llvm::Type::getInt64PtrTy(module->getContext())));
+  fork_join_arguments.push_back(
+      b->CreateBitCast(global_partitions_array,
+                       llvm::Type::getInt64PtrTy(module->getContext())));
   // Add argument specifying the number of partitioned most-major dimensions.
-  fork_join_arguments.push_back(ir_builder->getInt32(num_partitioned_dims));
+  fork_join_arguments.push_back(b->getInt32(num_partitioned_dims));
   // Add argument for parallel compute function pointer.
   fork_join_arguments.push_back(
-      ir_builder->CreateBitCast(parallel_function, ir_builder->getInt8PtrTy()));
+      b->CreateBitCast(parallel_function, b->getInt8PtrTy()));
   // Emit call to parallel fork/join.
-  ir_builder->CreateCall(fork_join_func, fork_join_arguments);
+  b->CreateCall(fork_join_func, fork_join_arguments);
 
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/cpu/ir_function.h b/tensorflow/compiler/xla/service/cpu/ir_function.h
index 2e55181eed..a41cbb64cd 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_function.h
+++ b/tensorflow/compiler/xla/service/cpu/ir_function.h
@@ -54,7 +54,7 @@ class IrFunction {
   IrFunction(const string& function_name, llvm::Function::LinkageTypes linkage,
              const bool optimize_for_size_requested,
              const bool enable_fast_math, llvm::Module* llvm_module,
-             llvm::IRBuilder<>* ir_builder, int64 num_dynamic_loop_bounds);
+             llvm::IRBuilder<>* b, int64 num_dynamic_loop_bounds);
   ~IrFunction();
 
   // Emit ir to read and return the set of ir values representing the dynamic
@@ -97,7 +97,7 @@ class IrFunction {
   // 'offset' from the "dynamic_loop_bounds" argument of this function.
   llvm::Value* GetDynamicLoopBound(int64 offset);
 
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   llvm::Module* llvm_module_;
   llvm::IRBuilder<>::InsertPointGuard caller_insert_point_guard_;
 
@@ -116,7 +116,7 @@ class IrFunction {
 // Returns an array of compute function call argument ir values.
 std::vector<llvm::Value*> GetArrayFunctionCallArguments(
     tensorflow::gtl::ArraySlice<llvm::Value*> parameter_addresses,
-    llvm::IRBuilder<>* ir_builder, tensorflow::StringPiece name,
+    llvm::IRBuilder<>* b, tensorflow::StringPiece name,
     llvm::Value* return_value_buffer, llvm::Value* exec_run_options_arg,
     llvm::Value* temp_buffers_arg, llvm::Value* profile_counters_arg);
 
@@ -124,9 +124,8 @@ std::vector<llvm::Value*> GetArrayFunctionCallArguments(
 // calls to 'parallel_function' (and joins threads before returning).
 Status EmitCallToParallelForkJoin(
     const std::vector<llvm::Value*>& arguments, const Shape& shape,
-    const std::vector<int64>& dimension_partition_counts,
-    llvm::IRBuilder<>* ir_builder, llvm::Function* parallel_function,
-    const string& name);
+    const std::vector<int64>& dimension_partition_counts, llvm::IRBuilder<>* b,
+    llvm::Function* parallel_function, const string& name);
 
 }  // namespace cpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/llvm_ir_runtime.cc b/tensorflow/compiler/xla/service/cpu/llvm_ir_runtime.cc
index 2e5cc96098..cef5e57b0b 100644
--- a/tensorflow/compiler/xla/service/cpu/llvm_ir_runtime.cc
+++ b/tensorflow/compiler/xla/service/cpu/llvm_ir_runtime.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include "llvm/IR/Verifier.h"
 #include "llvm/Transforms/Utils/Cloning.h"
 #include "tensorflow/compiler/xla/service/cpu/vector_support_library.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/math_ops.h"
 #include "tensorflow/core/lib/core/casts.h"
 #include "tensorflow/core/platform/logging.h"
 
@@ -52,46 +53,14 @@ llvm::Function* EmitVectorF32TanhIfNeeded(llvm::Module* module,
   llvm::BasicBlock* vector_tanh_body =
       llvm::BasicBlock::Create(*context, "body", vector_tanh_function);
 
-  llvm::IRBuilder<> ir_builder(vector_tanh_body);
+  llvm::IRBuilder<> b(vector_tanh_body);
   llvm::FastMathFlags fast_math_flags;
-  fast_math_flags.setFast();
-  ir_builder.setFastMathFlags(fast_math_flags);
-
-  VectorSupportLibrary vsl(F32, vector_width, &ir_builder, "tanh_f32");
+  fast_math_flags.setFast(enable_fast_math);
+  b.setFastMathFlags(fast_math_flags);
 
   llvm::Value* input = &*vector_tanh_function->arg_begin();
-  CHECK_EQ(input->getType(), vsl.vector_type());
-
-  // This implements the same rational interpolant as implemented in Eigen3.
-  llvm::Value* input_clamped =
-      vsl.Clamp(input, /*low=*/GetIeeeF32(-9.0), /*high=*/GetIeeeF32(9.0));
-
-  std::array<float, 7> numerator_coeffs{
-      -2.76076847742355e-16f, 2.00018790482477e-13f, -8.60467152213735e-11f,
-      5.12229709037114e-08f,  1.48572235717979e-05f, 6.37261928875436e-04f,
-      4.89352455891786e-03f};
-
-  std::array<float, 4> denominator_coeffs{
-      1.19825839466702e-06f, 1.18534705686654e-04f, 2.26843463243900e-03f,
-      4.89352518554385e-03f};
-
-  llvm::Value* input_squared = vsl.Mul(input_clamped, input_clamped);
-  llvm::Value* numerator = vsl.SplatFloat(GetIeeeF32(numerator_coeffs[0]));
-  for (int i = 1; i < numerator_coeffs.size(); i++) {
-    numerator =
-        vsl.MulAdd(input_squared, numerator, GetIeeeF32(numerator_coeffs[i]));
-  }
-
-  numerator = vsl.Mul(input_clamped, numerator);
-
-  llvm::Value* denominator = vsl.SplatFloat(GetIeeeF32(denominator_coeffs[0]));
-  for (int i = 1; i < denominator_coeffs.size(); i++) {
-    denominator = vsl.MulAdd(input_squared, denominator,
-                             GetIeeeF32(denominator_coeffs[i]));
-  }
-
-  llvm::Value* result = vsl.Div(numerator, denominator);
-  ir_builder.CreateRet(result);
+  CHECK_EQ(vector_width, input->getType()->getVectorNumElements());
+  b.CreateRet(llvm_ir::EmitFastTanh(&b, input));
 
   DCHECK(!llvm::verifyFunction(*vector_tanh_function));
   return vector_tanh_function;
@@ -113,12 +82,12 @@ llvm::Function* EmitVectorF32ExpIfNeeded(llvm::Module* module,
   llvm::BasicBlock* vector_exp_body =
       llvm::BasicBlock::Create(*context, "body", vector_exp_function);
 
-  llvm::IRBuilder<> ir_builder(vector_exp_body);
+  llvm::IRBuilder<> b(vector_exp_body);
   llvm::FastMathFlags fast_math_flags;
   fast_math_flags.setFast();
-  ir_builder.setFastMathFlags(fast_math_flags);
+  b.setFastMathFlags(fast_math_flags);
 
-  VectorSupportLibrary vsl(F32, vector_width, &ir_builder, "exp_f32");
+  VectorSupportLibrary vsl(F32, vector_width, &b, "exp_f32");
 
   // This implements the same polynomial approximation as implemented in Eigen3.
 
@@ -160,21 +129,21 @@ llvm::Function* EmitVectorF32ExpIfNeeded(llvm::Module* module,
   // VectorSupportLibrary (intentionally) can't juggle more than one type at a
   // time so drop down to IRBuilder for this bit.
   llvm::Value* vector_constant_0x7f =
-      ir_builder.CreateVectorSplat(vector_width, ir_builder.getInt32(0x7f));
+      b.CreateVectorSplat(vector_width, b.getInt32(0x7f));
   llvm::Value* vector_constant_23 =
-      ir_builder.CreateVectorSplat(vector_width, ir_builder.getInt32(23));
+      b.CreateVectorSplat(vector_width, b.getInt32(23));
   llvm::Type* i32_vector_type =
-      llvm::VectorType::get(ir_builder.getInt32Ty(), vector_width);
+      llvm::VectorType::get(b.getInt32Ty(), vector_width);
   // fx is clamped so we don't have to worry about it being out of range for
   // i32.
-  llvm::Value* emm0 = ir_builder.CreateFPToSI(fx, i32_vector_type);
-  emm0 = ir_builder.CreateAdd(emm0, vector_constant_0x7f);
-  emm0 = ir_builder.CreateShl(emm0, vector_constant_23);
-  llvm::Value* emm0_f32 = ir_builder.CreateBitCast(emm0, vsl.vector_type());
+  llvm::Value* emm0 = b.CreateFPToSI(fx, i32_vector_type);
+  emm0 = b.CreateAdd(emm0, vector_constant_0x7f);
+  emm0 = b.CreateShl(emm0, vector_constant_23);
+  llvm::Value* emm0_f32 = b.CreateBitCast(emm0, vsl.vector_type());
 
   llvm::Value* result = vsl.Max(vsl.Mul(y, emm0_f32), input);
 
-  ir_builder.CreateRet(result);
+  b.CreateRet(result);
 
   DCHECK(!llvm::verifyFunction(*vector_exp_function));
   return vector_exp_function;
@@ -196,13 +165,13 @@ llvm::Function* EmitVectorF32LogIfNeeded(llvm::Module* module,
   llvm::BasicBlock* vector_log_body =
       llvm::BasicBlock::Create(*context, "body", vector_log_function);
 
-  llvm::IRBuilder<> ir_builder(vector_log_body);
+  llvm::IRBuilder<> b(vector_log_body);
   llvm::FastMathFlags fast_math_flags;
   fast_math_flags.setFast();
-  ir_builder.setFastMathFlags(fast_math_flags);
+  b.setFastMathFlags(fast_math_flags);
 
   llvm::Value* input = &*vector_log_function->arg_begin();
-  VectorSupportLibrary vsl(F32, vector_width, &ir_builder, "log_f32");
+  VectorSupportLibrary vsl(F32, vector_width, &b, "log_f32");
 
   const llvm::APFloat half = GetIeeeF32(0.5);
   const llvm::APFloat one = GetIeeeF32(1.0);
@@ -238,22 +207,21 @@ llvm::Function* EmitVectorF32LogIfNeeded(llvm::Module* module,
   // VectorSupportLibrary (intentionally) can't juggle more than one type at a
   // time so drop down to IRBuilder for this bit.
   llvm::Value* vector_constant_0x7f =
-      ir_builder.CreateVectorSplat(vector_width, ir_builder.getInt32(0x7f));
+      b.CreateVectorSplat(vector_width, b.getInt32(0x7f));
   llvm::Value* vector_constant_23 =
-      ir_builder.CreateVectorSplat(vector_width, ir_builder.getInt32(23));
+      b.CreateVectorSplat(vector_width, b.getInt32(23));
   llvm::Type* i32_vector_type =
-      llvm::VectorType::get(ir_builder.getInt32Ty(), vector_width);
+      llvm::VectorType::get(b.getInt32Ty(), vector_width);
 
-  llvm::Value* emm0 = ir_builder.CreateLShr(
-      ir_builder.CreateBitCast(input, i32_vector_type), vector_constant_23);
+  llvm::Value* emm0 =
+      b.CreateLShr(b.CreateBitCast(input, i32_vector_type), vector_constant_23);
 
   // Keep only the fractional part.
   input = vsl.FloatAnd(input, inv_mant_mask);
   input = vsl.FloatOr(input, half);
 
-  emm0 = ir_builder.CreateSub(emm0, vector_constant_0x7f);
-  llvm::Value* e =
-      vsl.Add(one, ir_builder.CreateSIToFP(emm0, vsl.vector_type()));
+  emm0 = b.CreateSub(emm0, vector_constant_0x7f);
+  llvm::Value* e = vsl.Add(one, b.CreateSIToFP(emm0, vsl.vector_type()));
 
   // part2:
   //   if( x < SQRTHF ) {
@@ -294,7 +262,7 @@ llvm::Function* EmitVectorF32LogIfNeeded(llvm::Module* module,
   llvm::Value* or_rhs = vsl.FloatAnd(iszero_mask, minus_inf);
   llvm::Value* result = vsl.FloatOr(or_lhs, or_rhs);
 
-  ir_builder.CreateRet(result);
+  b.CreateRet(result);
 
   DCHECK(!llvm::verifyFunction(*vector_log_function));
   return vector_log_function;
diff --git a/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.cc b/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.cc
index 59ae5acd8b..8560e4296a 100644
--- a/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.cc
@@ -25,8 +25,8 @@ namespace cpu {
 ParallelLoopEmitter::ParallelLoopEmitter(
     const llvm_ir::ElementGenerator& target_element_generator,
     const llvm_ir::IrArray& target_array,
-    const DynamicLoopBounds* dynamic_loop_bounds, llvm::IRBuilder<>* ir_builder)
-    : LoopEmitter(target_element_generator, target_array, ir_builder),
+    const DynamicLoopBounds* dynamic_loop_bounds, llvm::IRBuilder<>* b)
+    : LoopEmitter(target_element_generator, target_array, b),
       dynamic_loop_bounds_(dynamic_loop_bounds) {}
 
 std::vector<llvm_ir::IrArray::Index>
@@ -37,7 +37,7 @@ ParallelLoopEmitter::EmitIndexAndSetExitBasicBlock(
   CHECK(!ShapeUtil::IsTuple(shape_));
   CHECK(!ShapeUtil::IsScalar(shape_));
 
-  llvm_ir::ForLoopNest loop_nest(loop_name, ir_builder_);
+  llvm_ir::ForLoopNest loop_nest(loop_name, b_);
   const int64 num_dims = shape_.dimensions_size();
   llvm_ir::IrArray::Index array_index(index_type, num_dims);
 
@@ -65,8 +65,7 @@ ParallelLoopEmitter::EmitIndexAndSetExitBasicBlock(
     }
   }
   // Point IR builder at inner loop BB.
-  llvm_ir::SetToFirstInsertPoint(loop_nest.GetInnerLoopBodyBasicBlock(),
-                                 ir_builder_);
+  llvm_ir::SetToFirstInsertPoint(loop_nest.GetInnerLoopBodyBasicBlock(), b_);
 
   // Set exit_bb_ to the exit block of the loop nest.
   exit_bb_ = loop_nest.GetOuterLoopExitBasicBlock();
diff --git a/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.h b/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.h
index 25e182a26d..076c683ca5 100644
--- a/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.h
+++ b/tensorflow/compiler/xla/service/cpu/parallel_loop_emitter.h
@@ -54,7 +54,7 @@ class ParallelLoopEmitter : public llvm_ir::LoopEmitter {
   ParallelLoopEmitter(const llvm_ir::ElementGenerator& target_element_generator,
                       const llvm_ir::IrArray& target_array,
                       const DynamicLoopBounds* dynamic_loop_bounds,
-                      llvm::IRBuilder<>* ir_builder);
+                      llvm::IRBuilder<>* b);
 
   ParallelLoopEmitter(const ParallelLoopEmitter&) = delete;
   ParallelLoopEmitter& operator=(const ParallelLoopEmitter&) = delete;
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_fork_join.cc b/tensorflow/compiler/xla/service/cpu/runtime_fork_join.cc
index d03da46575..a5f34908d7 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_fork_join.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_fork_join.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/compiler/xla/executable_run_options.h"
 #include "tensorflow/core/lib/core/blocking_counter.h"
+#include "tensorflow/core/platform/dynamic_annotations.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -58,13 +59,14 @@ using ComputeFunctionType = void (*)(void*, const void*, const void**, void**,
 //   [partition1_dim2_start]
 //   [partition1_dim2_limit]
 //
-void __xla_cpu_runtime_ParallelForkJoin(
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void __xla_cpu_runtime_ParallelForkJoin(
     void* result_ptr, const void* run_options_ptr, const void** params,
     void** temps, uint64* prof_counters, int32 num_partitions,
     int64* partitions, int32 num_partitioned_dims, void* function_ptr) {
   VLOG(2) << "ParallelForkJoin ENTRY"
           << " num_partitions: " << num_partitions
           << " num_partitioned_dims: " << num_partitioned_dims;
+  CHECK_EQ(params, nullptr);
   CHECK_GT(num_partitions, 1);
   CHECK_GT(num_partitioned_dims, 0);
   const xla::ExecutableRunOptions* run_options =
@@ -79,9 +81,9 @@ void __xla_cpu_runtime_ParallelForkJoin(
   for (int32 i = 1; i < num_partitions; ++i) {
     const int64 offset = i * stride;
     run_options->intra_op_thread_pool()->enqueueNoNotification(
-        [i, function, result_ptr, run_options_ptr, params, temps, prof_counters,
+        [i, function, result_ptr, run_options_ptr, temps, prof_counters,
          partitions, offset, &bc]() {
-          function(result_ptr, run_options_ptr, params, temps,
+          function(result_ptr, run_options_ptr, nullptr, temps,
                    &partitions[offset], prof_counters);
           bc.DecrementCount();
           VLOG(3) << "ParallelForkJoin partition " << i << " done.";
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc b/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
index 39b13183ff..a71a85913c 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/compiler/xla/executable_run_options.h"
 #include "tensorflow/compiler/xla/service/cpu/runtime_matvec.h"
+#include "tensorflow/core/platform/dynamic_annotations.h"
 #include "tensorflow/core/platform/types.h"
 
 using tensorflow::int32;
@@ -77,27 +78,24 @@ void MatMulImpl(const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m,
 
 }  // namespace
 
-void __xla_cpu_runtime_EigenMatMulF16(const void* run_options_ptr,
-                                      Eigen::half* out, Eigen::half* lhs,
-                                      Eigen::half* rhs, int64 m, int64 n,
-                                      int64 k, int32 transpose_lhs,
-                                      int32 transpose_rhs) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void __xla_cpu_runtime_EigenMatMulF16(
+    const void* run_options_ptr, Eigen::half* out, Eigen::half* lhs,
+    Eigen::half* rhs, int64 m, int64 n, int64 k, int32 transpose_lhs,
+    int32 transpose_rhs) {
   MatMulImpl<Eigen::half>(run_options_ptr, out, lhs, rhs, m, n, k,
                           transpose_lhs, transpose_rhs);
 }
 
-void __xla_cpu_runtime_EigenMatMulF32(const void* run_options_ptr, float* out,
-                                      float* lhs, float* rhs, int64 m, int64 n,
-                                      int64 k, int32 transpose_lhs,
-                                      int32 transpose_rhs) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void __xla_cpu_runtime_EigenMatMulF32(
+    const void* run_options_ptr, float* out, float* lhs, float* rhs, int64 m,
+    int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
   MatMulImpl<float>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
                     transpose_rhs);
 }
 
-void __xla_cpu_runtime_EigenMatMulF64(const void* run_options_ptr, double* out,
-                                      double* lhs, double* rhs, int64 m,
-                                      int64 n, int64 k, int32 transpose_lhs,
-                                      int32 transpose_rhs) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void __xla_cpu_runtime_EigenMatMulF64(
+    const void* run_options_ptr, double* out, double* lhs, double* rhs, int64 m,
+    int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
   MatMulImpl<double>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
                      transpose_rhs);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc b/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc
index f8c8dd5e93..997fdd2ab3 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matmul_mkl.cc
@@ -23,6 +23,7 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 #include "third_party/eigen3/unsupported/Eigen/CXX11/ThreadPool"
+#include "tensorflow/core/platform/dynamic_annotations.h"
 
 using tensorflow::int32;
 using tensorflow::int64;
@@ -74,10 +75,9 @@ void MatMulF64(const void* run_options_ptr, double* out, double* lhs,
 
 }  // namespace
 
-void __xla_cpu_runtime_MKLMatMulF32(const void* run_options_ptr, float* out,
-                                    float* lhs, float* rhs, int64 m, int64 n,
-                                    int64 k, int32 transpose_lhs,
-                                    int32 transpose_rhs) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void __xla_cpu_runtime_MKLMatMulF32(
+    const void* run_options_ptr, float* out, float* lhs, float* rhs, int64 m,
+    int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
   const xla::ExecutableRunOptions* run_options =
       static_cast<const xla::ExecutableRunOptions*>(run_options_ptr);
   // BLAS GEMM MatMul uses OpenMP for parallelization, so we pass the thread
@@ -88,11 +88,11 @@ void __xla_cpu_runtime_MKLMatMulF32(const void* run_options_ptr, float* out,
   // Set thread number back to the previous number.
   mkl_set_num_threads_local(prev_num_threads);
 }
+
 // BLAS GEMM API for 64-bit Matrix Multiplication
-void __xla_cpu_runtime_MKLMatMulF64(const void* run_options_ptr, double* out,
-                                    double* lhs, double* rhs, int64 m, int64 n,
-                                    int64 k, int32 transpose_lhs,
-                                    int32 transpose_rhs) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void __xla_cpu_runtime_MKLMatMulF64(
+    const void* run_options_ptr, double* out, double* lhs, double* rhs, int64 m,
+    int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
   const xla::ExecutableRunOptions* run_options =
       static_cast<const xla::ExecutableRunOptions*>(run_options_ptr);
   // BLAS GEMM MatMul uses OpenMP for parallelization, so we pass the thread
@@ -103,22 +103,26 @@ void __xla_cpu_runtime_MKLMatMulF64(const void* run_options_ptr, double* out,
   // Set thread number back to the previous number.
   mkl_set_num_threads_local(prev_num_threads);
 }
-void __xla_cpu_runtime_MKLSingleThreadedMatMulF32(const void* run_options_ptr,
-                                                  float* out, float* lhs,
-                                                  float* rhs, int64 m, int64 n,
-                                                  int64 k, int32 transpose_lhs,
-                                                  int32 transpose_rhs) {
+
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void
+__xla_cpu_runtime_MKLSingleThreadedMatMulF32(const void* run_options_ptr,
+                                             float* out, float* lhs, float* rhs,
+                                             int64 m, int64 n, int64 k,
+                                             int32 transpose_lhs,
+                                             int32 transpose_rhs) {
   // Set the thread number to 1 for single threaded excution.
   int prev_num_threads = mkl_set_num_threads_local(1);
   MatMulF32(nullptr, out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
   // Set thread number back to the previous number.
   mkl_set_num_threads_local(prev_num_threads);
 }
-void __xla_cpu_runtime_MKLSingleThreadedMatMulF64(const void* run_options_ptr,
-                                                  double* out, double* lhs,
-                                                  double* rhs, int64 m, int64 n,
-                                                  int64 k, int32 transpose_lhs,
-                                                  int32 transpose_rhs) {
+
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void
+__xla_cpu_runtime_MKLSingleThreadedMatMulF64(const void* run_options_ptr,
+                                             double* out, double* lhs,
+                                             double* rhs, int64 m, int64 n,
+                                             int64 k, int32 transpose_lhs,
+                                             int32 transpose_rhs) {
   // Set the thread number to 1 for single threaded excution.
   int prev_num_threads = mkl_set_num_threads_local(1);
   MatMulF64(nullptr, out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
index 17303e2f0d..16692e7f2e 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/compiler/xla/service/cpu/runtime_matvec.h"
+#include "tensorflow/core/platform/dynamic_annotations.h"
 #include "tensorflow/core/platform/types.h"
 
 using tensorflow::int32;
@@ -71,7 +72,8 @@ void SingleThreadedMatMul(const void* run_options_ptr, T* out, T* lhs, T* rhs,
 
 }  // namespace
 
-void __xla_cpu_runtime_EigenSingleThreadedMatMulF16(
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void
+__xla_cpu_runtime_EigenSingleThreadedMatMulF16(
     const void* run_options_ptr, Eigen::half* out, Eigen::half* lhs,
     Eigen::half* rhs, int64 m, int64 n, int64 k, int32 transpose_lhs,
     int32 transpose_rhs) {
@@ -79,16 +81,22 @@ void __xla_cpu_runtime_EigenSingleThreadedMatMulF16(
                                     transpose_lhs, transpose_rhs);
 }
 
-void __xla_cpu_runtime_EigenSingleThreadedMatMulF32(
-    const void* run_options_ptr, float* out, float* lhs, float* rhs, int64 m,
-    int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void
+__xla_cpu_runtime_EigenSingleThreadedMatMulF32(const void* run_options_ptr,
+                                               float* out, float* lhs,
+                                               float* rhs, int64 m, int64 n,
+                                               int64 k, int32 transpose_lhs,
+                                               int32 transpose_rhs) {
   SingleThreadedMatMul<float>(run_options_ptr, out, lhs, rhs, m, n, k,
                               transpose_lhs, transpose_rhs);
 }
 
-void __xla_cpu_runtime_EigenSingleThreadedMatMulF64(
-    const void* run_options_ptr, double* out, double* lhs, double* rhs, int64 m,
-    int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
+TF_ATTRIBUTE_NO_SANITIZE_MEMORY void
+__xla_cpu_runtime_EigenSingleThreadedMatMulF64(const void* run_options_ptr,
+                                               double* out, double* lhs,
+                                               double* rhs, int64 m, int64 n,
+                                               int64 k, int32 transpose_lhs,
+                                               int32 transpose_rhs) {
   SingleThreadedMatMul<double>(run_options_ptr, out, lhs, rhs, m, n, k,
                                transpose_lhs, transpose_rhs);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/sample_harness.cc b/tensorflow/compiler/xla/service/cpu/sample_harness.cc
index 7e792a82b8..f227e4ae13 100644
--- a/tensorflow/compiler/xla/service/cpu/sample_harness.cc
+++ b/tensorflow/compiler/xla/service/cpu/sample_harness.cc
@@ -21,9 +21,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -38,12 +38,13 @@ int main(int argc, char** argv) {
 
   // Transfer parameters.
   std::unique_ptr<xla::Literal> param0_literal =
-      xla::Literal::CreateR1<float>({1.1f, 2.2f, 3.3f, 5.5f});
+      xla::LiteralUtil::CreateR1<float>({1.1f, 2.2f, 3.3f, 5.5f});
   std::unique_ptr<xla::GlobalData> param0_data =
       client->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
-  std::unique_ptr<xla::Literal> param1_literal = xla::Literal::CreateR2<float>(
-      {{3.1f, 4.2f, 7.3f, 9.5f}, {1.1f, 2.2f, 3.3f, 4.4f}});
+  std::unique_ptr<xla::Literal> param1_literal =
+      xla::LiteralUtil::CreateR2<float>(
+          {{3.1f, 4.2f, 7.3f, 9.5f}, {1.1f, 2.2f, 3.3f, 4.4f}});
   std::unique_ptr<xla::GlobalData> param1_data =
       client->TransferToServer(*param1_literal).ConsumeValueOrDie();
 
diff --git a/tensorflow/compiler/xla/service/cpu/tests/BUILD b/tensorflow/compiler/xla/service/cpu/tests/BUILD
index 66ae5ef0f6..181cec3cdd 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/BUILD
+++ b/tensorflow/compiler/xla/service/cpu/tests/BUILD
@@ -40,7 +40,7 @@ tf_cc_test(
     name = "cpu_fusion_test",
     srcs = ["cpu_fusion_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -82,7 +82,7 @@ tf_cc_test(
     name = "cpu_noalias_test",
     srcs = ["cpu_noalias_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -128,16 +128,16 @@ tf_cc_test(
     name = "cpu_infeed_test",
     srcs = ["cpu_infeed_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/service:cpu_plugin",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h b/tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h
index 7c8d07a10b..77b3a0301f 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h
@@ -22,7 +22,7 @@ namespace xla {
 namespace cpu {
 
 // Tests that verify IR emitted by the CPU backend is as expected.
-class CpuCodegenTest : public LLVMIRGenTestBase {};
+class CpuCodegenTest : public LlvmIrGenTestBase {};
 
 }  // namespace cpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc b/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc
index 1d4bf483ae..00a7aa2ad2 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_external_constants_test.cc
@@ -40,7 +40,7 @@ class CpuExternalConstantsTest : public CpuCodegenTest {
 
     HloInstruction* constant =
         builder.AddInstruction(HloInstruction::CreateConstant(
-            Literal::CreateR2FromArray2D(backing_array)));
+            LiteralUtil::CreateR2FromArray2D(backing_array)));
     HloInstruction* param =
         builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "x"));
     builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_fusion_test.cc b/tensorflow/compiler/xla/service/cpu/tests/cpu_fusion_test.cc
index 783b2820e9..d98856fdbf 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_fusion_test.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
@@ -43,8 +43,8 @@ class CpuFusionTest : public HloTestBase {
 
 TEST_F(CpuFusionTest, FuseTwoElementwiseOps) {
   auto builder = HloComputation::Builder(TestName());
-  auto input_literal1 = Literal::CreateR1<float>({1.0, 2.0, 3.0});
-  auto input_literal2 = Literal::CreateR1<float>({-2.0, -42.0, 2.0});
+  auto input_literal1 = LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0});
+  auto input_literal2 = LiteralUtil::CreateR1<float>({-2.0, -42.0, 2.0});
   Shape vshape = input_literal1->shape();
 
   auto input1 = builder.AddInstruction(
@@ -83,7 +83,7 @@ TEST_F(CpuFusionTest, FuseTwoElementwiseOps) {
 
 TEST_F(CpuFusionTest, FuseElementwiseOpChain) {
   auto builder = HloComputation::Builder(TestName());
-  auto input_literal = Literal::CreateR1<float>({-1.5, -2.5, -3.0});
+  auto input_literal = LiteralUtil::CreateR1<float>({-1.5, -2.5, -3.0});
   Shape vshape = input_literal->shape();
 
   auto input = builder.AddInstruction(
@@ -99,7 +99,7 @@ TEST_F(CpuFusionTest, FuseElementwiseOpChain) {
   auto two = builder.AddInstruction(HloInstruction::CreateBroadcast(
       vshape,
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0))),
       {}));
   builder.AddInstruction(
       HloInstruction::CreateBinary(vshape, HloOpcode::kMultiply, two, floor));
@@ -134,7 +134,7 @@ TEST_F(CpuFusionTest, ElementwiseOpChainWithNonfusableInstruction) {
   // middle.
   auto module = CreateNewModule();
   auto builder = HloComputation::Builder(TestName());
-  auto input_literal = Literal::CreateR1<float>({-1.5, -2.5, -3.0});
+  auto input_literal = LiteralUtil::CreateR1<float>({-1.5, -2.5, -3.0});
   Shape vshape = input_literal->shape();
 
   auto input = builder.AddInstruction(
@@ -166,7 +166,7 @@ TEST_F(CpuFusionTest, ElementwiseOpChainWithNonfusableInstruction) {
           ShapeUtil::MakeShape(F32, {6, 1}), concatenate)),
       /*init_value=*/
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(0))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0))),
       /*dimensions_to_reduce=*/{1}, add_f32));
 
   auto exp = builder.AddInstruction(
@@ -176,7 +176,7 @@ TEST_F(CpuFusionTest, ElementwiseOpChainWithNonfusableInstruction) {
   auto two = builder.AddInstruction(HloInstruction::CreateBroadcast(
       cshape,
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0))),
       {}));
   builder.AddInstruction(
       HloInstruction::CreateBinary(cshape, HloOpcode::kMultiply, two, floor));
@@ -231,7 +231,7 @@ TEST_F(CpuFusionTest, TestOperandOrderToAvoidDuplication) {
   // operand vectors. Test for this problem by counting the number of nodes in
   // each fusion instruction to ensure that negate is not duplicated.
   auto builder = HloComputation::Builder(TestName());
-  auto input_literal = Literal::CreateR1<float>({1.0, 2.0, 3.0});
+  auto input_literal = LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0});
   Shape vshape = input_literal->shape();
 
   auto constant = builder.AddInstruction(
@@ -292,10 +292,10 @@ TEST_F(CpuFusionTest, DoNotDuplicateExpensiveOps) {
   // computation. The duplication is caused by the other use of exp2 in the
   // tuple.
   auto builder = HloComputation::Builder(TestName());
-  auto input_literal1 = Literal::CreateR1<float>({1.0, 2.0, 3.0});
-  auto input_literal2 = Literal::CreateR1<float>({-2.0, -42.0, 2.0});
+  auto input_literal1 = LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0});
+  auto input_literal2 = LiteralUtil::CreateR1<float>({-2.0, -42.0, 2.0});
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
   Shape shape = constant->shape();
 
   auto exp1 = builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_infeed_test.cc b/tensorflow/compiler/xla/service/cpu/tests/cpu_infeed_test.cc
index ea7e479d66..c433bddc84 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_infeed_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_infeed_test.cc
@@ -19,9 +19,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -58,52 +58,52 @@ class InfeedTest : public ClientLibraryTestBase {
 };
 
 TEST_F(InfeedTest, SingleInfeedR0Bool) {
-  TestInfeedRoundTrip(*Literal::CreateR0<bool>(true));
+  TestInfeedRoundTrip(*LiteralUtil::CreateR0<bool>(true));
 }
 
 TEST_F(InfeedTest, SingleInfeedR1U32) {
-  TestInfeedRoundTrip(*Literal::CreateR1<uint32>({1, 2, 3}));
+  TestInfeedRoundTrip(*LiteralUtil::CreateR1<uint32>({1, 2, 3}));
 }
 
 TEST_F(InfeedTest, SingleInfeedR2F32) {
-  TestInfeedRoundTrip(*Literal::CreateR2F32Linspace(0.0, 1.0, 128, 64));
+  TestInfeedRoundTrip(*LiteralUtil::CreateR2F32Linspace(0.0, 1.0, 128, 64));
 }
 
 TEST_F(InfeedTest, SingleInfeedR3F32) {
   TestInfeedRoundTrip(
-      *Literal::CreateR3({{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
-                          {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}}));
+      *LiteralUtil::CreateR3({{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
+                              {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}}));
 }
 
 TEST_F(InfeedTest, SingleInfeedR3F32DifferentLayout) {
   const Layout r3_dim0minor = LayoutUtil::MakeLayout({0, 1, 2});
   const Layout r3_dim0major = LayoutUtil::MakeLayout({2, 1, 0});
 
-  TestInfeedRoundTrip(
-      *Literal::CreateR3WithLayout({{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
-                                    {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}},
-                                   r3_dim0minor));
+  TestInfeedRoundTrip(*LiteralUtil::CreateR3WithLayout(
+      {{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
+       {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}},
+      r3_dim0minor));
 
-  TestInfeedRoundTrip(
-      *Literal::CreateR3WithLayout({{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
-                                    {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}},
-                                   r3_dim0major));
+  TestInfeedRoundTrip(*LiteralUtil::CreateR3WithLayout(
+      {{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
+       {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}},
+      r3_dim0major));
 }
 
 TEST_F(InfeedTest, SingleInfeedR4S32) {
-  TestInfeedRoundTrip(*Literal::CreateR4(
+  TestInfeedRoundTrip(*LiteralUtil::CreateR4(
       {{{{1, -2}, {-4, 5}, {6, 7}}, {{8, 9}, {10, 11}, {12, 13}}},
        {{{10, 3}, {7, -2}, {3, 6}}, {{2, 5}, {-11, 5}, {-2, -5}}}}));
 }
 
 TEST_F(InfeedTest, SingleInfeedTuple) {
   TestInfeedRoundTrip(
-      *Literal::MakeTuple({Literal::CreateR1<uint32>({1, 2, 3}).get(),
-                           Literal::CreateR0<bool>(false).get()}));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<uint32>({1, 2, 3}).get(),
+                               LiteralUtil::CreateR0<bool>(false).get()}));
 }
 
 TEST_F(InfeedTest, SingleInfeedEmptyTuple) {
-  TestInfeedRoundTrip(*Literal::MakeTuple({}));
+  TestInfeedRoundTrip(*LiteralUtil::MakeTuple({}));
 }
 
 // Tests Infeed operation used in a while loop, as in the code below. The
@@ -156,13 +156,16 @@ TEST_F(InfeedTest, DISABLED_SingleInfeedInWhile) {
           });
 
   // Send 5 Infeed data of shape F32[3].
-  ASSERT_IS_OK(client_->TransferToInfeed(*Literal::CreateR1<float>({1, 2, 3})));
-  ASSERT_IS_OK(client_->TransferToInfeed(*Literal::CreateR1<float>({4, 5, 6})));
-  ASSERT_IS_OK(client_->TransferToInfeed(*Literal::CreateR1<float>({7, 8, 9})));
   ASSERT_IS_OK(
-      client_->TransferToInfeed(*Literal::CreateR1<float>({10, 11, 12})));
+      client_->TransferToInfeed(*LiteralUtil::CreateR1<float>({1, 2, 3})));
+  ASSERT_IS_OK(
+      client_->TransferToInfeed(*LiteralUtil::CreateR1<float>({4, 5, 6})));
+  ASSERT_IS_OK(
+      client_->TransferToInfeed(*LiteralUtil::CreateR1<float>({7, 8, 9})));
+  ASSERT_IS_OK(
+      client_->TransferToInfeed(*LiteralUtil::CreateR1<float>({10, 11, 12})));
   ASSERT_IS_OK(
-      client_->TransferToInfeed(*Literal::CreateR1<float>({13, 14, 15})));
+      client_->TransferToInfeed(*LiteralUtil::CreateR1<float>({13, 14, 15})));
 
   delete computation_thread;  // Joins the thread.
   auto result_literal = client_->Transfer(*result).ConsumeValueOrDie();
@@ -247,17 +250,17 @@ TEST_F(InfeedTest, DISABLED_TwoInfeedsInTotalOrder) {
 
   // Send the first 4 Infeed data of shape Tuple(F32[2], PRED).
   ASSERT_IS_OK(client_->TransferToInfeed(
-      *Literal::MakeTuple({Literal::CreateR1<float>({1, 2}).get(),
-                           Literal::CreateR0<bool>(true).get()})));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({1, 2}).get(),
+                               LiteralUtil::CreateR0<bool>(true).get()})));
   ASSERT_IS_OK(client_->TransferToInfeed(
-      *Literal::MakeTuple({Literal::CreateR1<float>({3, 4}).get(),
-                           Literal::CreateR0<bool>(true).get()})));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({3, 4}).get(),
+                               LiteralUtil::CreateR0<bool>(true).get()})));
   ASSERT_IS_OK(client_->TransferToInfeed(
-      *Literal::MakeTuple({Literal::CreateR1<float>({5, 6}).get(),
-                           Literal::CreateR0<bool>(true).get()})));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({5, 6}).get(),
+                               LiteralUtil::CreateR0<bool>(true).get()})));
   ASSERT_IS_OK(client_->TransferToInfeed(
-      *Literal::MakeTuple({Literal::CreateR1<float>({7, 8}).get(),
-                           Literal::CreateR0<bool>(false).get()})));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({7, 8}).get(),
+                               LiteralUtil::CreateR0<bool>(false).get()})));
 
   // Asynchronously launch the execution on the device.
   std::unique_ptr<GlobalData> result;
@@ -272,14 +275,14 @@ TEST_F(InfeedTest, DISABLED_TwoInfeedsInTotalOrder) {
   // Infeed data, and send the rest Infeed data of shape Tuple(F32[3], PRED).
   sleep(1);
   ASSERT_IS_OK(client_->TransferToInfeed(
-      *Literal::MakeTuple({Literal::CreateR1<float>({1, 2, 3}).get(),
-                           Literal::CreateR0<bool>(true).get()})));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({1, 2, 3}).get(),
+                               LiteralUtil::CreateR0<bool>(true).get()})));
   ASSERT_IS_OK(client_->TransferToInfeed(
-      *Literal::MakeTuple({Literal::CreateR1<float>({7, 8, 9}).get(),
-                           Literal::CreateR0<bool>(false).get()})));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({7, 8, 9}).get(),
+                               LiteralUtil::CreateR0<bool>(false).get()})));
   ASSERT_IS_OK(client_->TransferToInfeed(
-      *Literal::MakeTuple({Literal::CreateR1<float>({4, 5, 6}).get(),
-                           Literal::CreateR0<bool>(true).get()})));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({4, 5, 6}).get(),
+                               LiteralUtil::CreateR0<bool>(true).get()})));
 
   // Wait for the execution to be done, and transfer the result.
   delete computation_thread;  // Joins the thread.
diff --git a/tensorflow/compiler/xla/service/cpu/tests/cpu_noalias_test.cc b/tensorflow/compiler/xla/service/cpu/tests/cpu_noalias_test.cc
index 3b6b0ed740..01daed4bcd 100644
--- a/tensorflow/compiler/xla/service/cpu/tests/cpu_noalias_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/tests/cpu_noalias_test.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include <utility>
 
 #include "llvm/IR/Module.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h"
@@ -42,7 +42,7 @@ TEST_F(CpuNoAliasTest, Concat) {
   HloComputation::Builder builder(TestName());
 
   std::unique_ptr<Literal> literal =
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
   auto param_shape = ShapeUtil::MakeShape(F32, {2, 2});
   HloInstruction* param_x = builder.AddInstruction(
       HloInstruction::CreateParameter(0, param_shape, "x"));
@@ -78,7 +78,7 @@ TEST_F(CpuNoAliasTest, Concat) {
   llvm::Function* func = llvm::cast<llvm::Function>(
       ir_module.getOrInsertFunction("test_fn", llvm::Type::getVoidTy(context)));
   llvm::BasicBlock* bb = llvm::BasicBlock::Create(context, "body", func);
-  llvm::IRBuilder<> ir_builder(bb);
+  llvm::IRBuilder<> b(bb);
   auto* zero = llvm::ConstantInt::get(llvm::Type::getInt32Ty(context), 0);
   llvm_ir::IrArray::Index zero2D({zero, zero});
 
@@ -90,7 +90,7 @@ TEST_F(CpuNoAliasTest, Concat) {
         ir_module.getOrInsertGlobal("param_x", array2d_type);
     llvm_ir::IrArray param_x_array(param_x_val, param_shape);
     aa.AddAliasingInformationToIrArray(*param_x, &param_x_array);
-    param_x_array.EmitReadArrayElement(zero2D, &ir_builder)
+    param_x_array.EmitReadArrayElement(zero2D, &b)
         ->setName("read_param_x_array");
   }
 
@@ -100,7 +100,7 @@ TEST_F(CpuNoAliasTest, Concat) {
     auto shape = ShapeUtil::MakeShape(F32, {2, 4});
     llvm_ir::IrArray concat1_array(concat1_val, shape);
     aa.AddAliasingInformationToIrArray(*concat1, &concat1_array);
-    concat1_array.EmitReadArrayElement(zero2D, &ir_builder)
+    concat1_array.EmitReadArrayElement(zero2D, &b)
         ->setName("read_concat1_array");
   }
 
@@ -110,7 +110,7 @@ TEST_F(CpuNoAliasTest, Concat) {
     auto shape = ShapeUtil::MakeShape(F32, {2, 6});
     llvm_ir::IrArray concat2_array(concat2_val, shape);
     aa.AddAliasingInformationToIrArray(*concat2, &concat2_array);
-    concat2_array.EmitReadArrayElement(zero2D, &ir_builder)
+    concat2_array.EmitReadArrayElement(zero2D, &b)
         ->setName("read_concat2_array");
   }
 
diff --git a/tensorflow/compiler/xla/service/cpu/vector_support_library.cc b/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
index c444d15185..3274be8d9d 100644
--- a/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
+++ b/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
@@ -23,14 +23,14 @@ namespace xla {
 namespace cpu {
 VectorSupportLibrary::VectorSupportLibrary(PrimitiveType primitive_type,
                                            int64 vector_size,
-                                           llvm::IRBuilder<>* ir_builder,
+                                           llvm::IRBuilder<>* b,
                                            std::string name)
     : vector_size_(vector_size),
       primitive_type_(primitive_type),
-      ir_builder_(ir_builder),
+      b_(b),
       name_(std::move(name)) {
   scalar_type_ = llvm_ir::PrimitiveTypeToIrType(
-      primitive_type, ir_builder_->GetInsertBlock()->getModule());
+      primitive_type, b_->GetInsertBlock()->getModule());
   scalar_pointer_type_ = llvm::PointerType::getUnqual(scalar_type_);
   vector_type_ = llvm::VectorType::get(scalar_type_, vector_size);
   vector_pointer_type_ = llvm::PointerType::getUnqual(vector_type_);
@@ -63,9 +63,9 @@ llvm::Value* VectorSupportLibrary::Mul(llvm::Value* lhs, llvm::Value* rhs) {
 llvm::Value* VectorSupportLibrary::MulInternal(llvm::Value* lhs,
                                                llvm::Value* rhs) {
   if (scalar_type_->isFloatingPointTy()) {
-    return ir_builder()->CreateFMul(lhs, rhs, name());
+    return b()->CreateFMul(lhs, rhs, name());
   } else {
-    return ir_builder()->CreateMul(lhs, rhs, name());
+    return b()->CreateMul(lhs, rhs, name());
   }
 }
 
@@ -76,13 +76,13 @@ llvm::Value* VectorSupportLibrary::Add(llvm::Value* lhs, llvm::Value* rhs) {
 
 llvm::Value* VectorSupportLibrary::Sub(llvm::Value* lhs, llvm::Value* rhs) {
   AssertCorrectTypes({lhs, rhs});
-  return ir_builder()->CreateFSub(lhs, rhs);
+  return b()->CreateFSub(lhs, rhs);
 }
 
 llvm::Value* VectorSupportLibrary::Max(llvm::Value* lhs, llvm::Value* rhs) {
   AssertCorrectTypes({lhs, rhs});
   if (scalar_type_->isFloatingPointTy()) {
-    return llvm_ir::EmitFloatMax(lhs, rhs, ir_builder_);
+    return llvm_ir::EmitFloatMax(lhs, rhs, b_);
   } else {
     LOG(FATAL) << "Max for integers is unimplemented";
   }
@@ -91,13 +91,13 @@ llvm::Value* VectorSupportLibrary::Max(llvm::Value* lhs, llvm::Value* rhs) {
 llvm::Value* VectorSupportLibrary::Floor(llvm::Value* a) {
   AssertCorrectTypes({a});
   return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::floor, {a},
-                                      {a->getType()}, ir_builder());
+                                      {a->getType()}, b());
 }
 
 llvm::Value* VectorSupportLibrary::Div(llvm::Value* lhs, llvm::Value* rhs) {
   AssertCorrectTypes({lhs, rhs});
   if (scalar_type_->isFloatingPointTy()) {
-    return ir_builder()->CreateFDiv(lhs, rhs, name());
+    return b()->CreateFDiv(lhs, rhs, name());
   } else {
     LOG(FATAL) << "Division for integers is unimplemented";
   }
@@ -111,42 +111,41 @@ llvm::Value* VectorSupportLibrary::Clamp(llvm::Value* a,
   CHECK(low.compare(high) == llvm::APFloat::cmpLessThan);
   CHECK(scalar_type_->isFloatingPointTy());
   return llvm_ir::EmitFloatMin(
-      llvm_ir::EmitFloatMax(a, GetConstantFloat(type, low), ir_builder_),
-      GetConstantFloat(type, high), ir_builder_);
+      llvm_ir::EmitFloatMax(a, GetConstantFloat(type, low), b_),
+      GetConstantFloat(type, high), b_);
 }
 
 llvm::Value* VectorSupportLibrary::FCmpEQMask(llvm::Value* lhs,
                                               llvm::Value* rhs) {
   AssertCorrectTypes({lhs, rhs});
-  return I1ToFloat(ir_builder()->CreateFCmpOEQ(lhs, rhs, name()));
+  return I1ToFloat(b()->CreateFCmpOEQ(lhs, rhs, name()));
 }
 
 llvm::Value* VectorSupportLibrary::FCmpOLTMask(llvm::Value* lhs,
                                                llvm::Value* rhs) {
   AssertCorrectTypes({lhs, rhs});
-  return I1ToFloat(ir_builder()->CreateFCmpOLT(lhs, rhs, name()));
+  return I1ToFloat(b()->CreateFCmpOLT(lhs, rhs, name()));
 }
 
 llvm::Value* VectorSupportLibrary::FCmpULEMask(llvm::Value* lhs,
                                                llvm::Value* rhs) {
   AssertCorrectTypes({lhs, rhs});
-  return I1ToFloat(ir_builder()->CreateFCmpULE(lhs, rhs, name()));
+  return I1ToFloat(b()->CreateFCmpULE(lhs, rhs, name()));
 }
 
 llvm::Value* VectorSupportLibrary::I1ToFloat(llvm::Value* i1) {
   bool is_vector = llvm::isa<llvm::VectorType>(i1->getType());
   llvm::Type* integer_type = IntegerTypeForFloatSize(is_vector);
-  return ir_builder()->CreateBitCast(
-      ir_builder()->CreateSExt(i1, integer_type, name()),
-      is_vector ? vector_type() : scalar_type(), name());
+  return b()->CreateBitCast(b()->CreateSExt(i1, integer_type, name()),
+                            is_vector ? vector_type() : scalar_type(), name());
 }
 
 llvm::Type* VectorSupportLibrary::IntegerTypeForFloatSize(bool vector) {
   CHECK(scalar_type()->isFloatingPointTy());
   const llvm::DataLayout& data_layout =
-      ir_builder()->GetInsertBlock()->getModule()->getDataLayout();
+      b()->GetInsertBlock()->getModule()->getDataLayout();
   int64 float_size_bits = data_layout.getTypeSizeInBits(scalar_type());
-  llvm::Type* scalar_int_type = ir_builder()->getIntNTy(float_size_bits);
+  llvm::Type* scalar_int_type = b()->getIntNTy(float_size_bits);
   if (vector) {
     return llvm::VectorType::get(scalar_int_type, vector_size());
   } else {
@@ -156,7 +155,7 @@ llvm::Type* VectorSupportLibrary::IntegerTypeForFloatSize(bool vector) {
 
 llvm::Value* VectorSupportLibrary::BroadcastScalar(llvm::Value* x) {
   CHECK_EQ(x->getType(), scalar_type());
-  return ir_builder()->CreateVectorSplat(vector_size(), x, name());
+  return b()->CreateVectorSplat(vector_size(), x, name());
 }
 
 llvm::Value* VectorSupportLibrary::FloatAnd(llvm::Value* lhs,
@@ -164,10 +163,9 @@ llvm::Value* VectorSupportLibrary::FloatAnd(llvm::Value* lhs,
   AssertCorrectTypes({lhs, rhs});
   llvm::Type* int_type =
       IntegerTypeForFloatSize(lhs->getType() == vector_type());
-  return ir_builder()->CreateBitCast(
-      ir_builder()->CreateAnd(
-          ir_builder()->CreateBitCast(lhs, int_type, name()),
-          ir_builder()->CreateBitCast(rhs, int_type, name()), name()),
+  return b()->CreateBitCast(
+      b()->CreateAnd(b()->CreateBitCast(lhs, int_type, name()),
+                     b()->CreateBitCast(rhs, int_type, name()), name()),
       vector_type());
 }
 
@@ -175,9 +173,8 @@ llvm::Value* VectorSupportLibrary::FloatNot(llvm::Value* lhs) {
   AssertCorrectTypes({lhs});
   llvm::Type* int_type =
       IntegerTypeForFloatSize(lhs->getType() == vector_type());
-  return ir_builder()->CreateBitCast(
-      ir_builder()->CreateNot(
-          ir_builder()->CreateBitCast(lhs, int_type, name()), name()),
+  return b()->CreateBitCast(
+      b()->CreateNot(b()->CreateBitCast(lhs, int_type, name()), name()),
       vector_type());
 }
 
@@ -185,47 +182,43 @@ llvm::Value* VectorSupportLibrary::FloatOr(llvm::Value* lhs, llvm::Value* rhs) {
   AssertCorrectTypes({lhs, rhs});
   llvm::Type* int_type =
       IntegerTypeForFloatSize(lhs->getType() == vector_type());
-  return ir_builder()->CreateBitCast(
-      ir_builder()->CreateOr(ir_builder()->CreateBitCast(lhs, int_type, name()),
-                             ir_builder()->CreateBitCast(rhs, int_type, name()),
-                             name()),
+  return b()->CreateBitCast(
+      b()->CreateOr(b()->CreateBitCast(lhs, int_type, name()),
+                    b()->CreateBitCast(rhs, int_type, name()), name()),
       vector_type(), name());
 }
 
 llvm::Value* VectorSupportLibrary::AddInternal(llvm::Value* lhs,
                                                llvm::Value* rhs) {
   if (scalar_type_->isFloatingPointTy()) {
-    return ir_builder()->CreateFAdd(lhs, rhs, name());
+    return b()->CreateFAdd(lhs, rhs, name());
   } else {
-    return ir_builder()->CreateAdd(lhs, rhs, name());
+    return b()->CreateAdd(lhs, rhs, name());
   }
 }
 
 llvm::Value* VectorSupportLibrary::ComputeOffsetPointer(
     llvm::Value* base_pointer, llvm::Value* offset_elements) {
   if (base_pointer->getType() != scalar_pointer_type()) {
-    base_pointer = ir_builder()->CreateBitCast(base_pointer,
-                                               scalar_pointer_type(), name());
+    base_pointer =
+        b()->CreateBitCast(base_pointer, scalar_pointer_type(), name());
   }
-  return ir_builder()->CreateInBoundsGEP(base_pointer, {offset_elements},
-                                         name());
+  return b()->CreateInBoundsGEP(base_pointer, {offset_elements}, name());
 }
 
 llvm::Value* VectorSupportLibrary::LoadVector(llvm::Value* pointer) {
   if (pointer->getType() != vector_pointer_type()) {
-    pointer =
-        ir_builder()->CreateBitCast(pointer, vector_pointer_type(), name());
+    pointer = b()->CreateBitCast(pointer, vector_pointer_type(), name());
   }
-  return ir_builder()->CreateAlignedLoad(
+  return b()->CreateAlignedLoad(
       pointer, ShapeUtil::ByteSizeOfPrimitiveType(primitive_type_), name());
 }
 
 llvm::Value* VectorSupportLibrary::LoadScalar(llvm::Value* pointer) {
   if (pointer->getType() != scalar_pointer_type()) {
-    pointer =
-        ir_builder()->CreateBitCast(pointer, scalar_pointer_type(), name());
+    pointer = b()->CreateBitCast(pointer, scalar_pointer_type(), name());
   }
-  return ir_builder()->CreateAlignedLoad(
+  return b()->CreateAlignedLoad(
       pointer, ShapeUtil::ByteSizeOfPrimitiveType(primitive_type_), name());
 }
 
@@ -233,30 +226,28 @@ void VectorSupportLibrary::StoreVector(llvm::Value* value,
                                        llvm::Value* pointer) {
   AssertCorrectTypes({value});
   if (pointer->getType() != vector_pointer_type()) {
-    pointer = ir_builder()->CreateBitCast(pointer, vector_pointer_type());
+    pointer = b()->CreateBitCast(pointer, vector_pointer_type());
   }
-  ir_builder()->CreateAlignedStore(
-      value, pointer, ShapeUtil::ByteSizeOfPrimitiveType(primitive_type_));
+  b()->CreateAlignedStore(value, pointer,
+                          ShapeUtil::ByteSizeOfPrimitiveType(primitive_type_));
 }
 
 void VectorSupportLibrary::StoreScalar(llvm::Value* value,
                                        llvm::Value* pointer) {
   AssertCorrectTypes({value});
   if (pointer->getType() != scalar_pointer_type()) {
-    pointer =
-        ir_builder()->CreateBitCast(pointer, scalar_pointer_type(), name());
+    pointer = b()->CreateBitCast(pointer, scalar_pointer_type(), name());
   }
-  ir_builder()->CreateAlignedStore(
-      value, pointer, ShapeUtil::ByteSizeOfPrimitiveType(primitive_type_));
+  b()->CreateAlignedStore(value, pointer,
+                          ShapeUtil::ByteSizeOfPrimitiveType(primitive_type_));
 }
 
 llvm::Value* VectorSupportLibrary::LoadBroadcast(llvm::Value* pointer) {
   if (pointer->getType() != scalar_pointer_type()) {
-    pointer =
-        ir_builder()->CreateBitCast(pointer, scalar_pointer_type(), name());
+    pointer = b()->CreateBitCast(pointer, scalar_pointer_type(), name());
   }
-  return ir_builder()->CreateVectorSplat(
-      vector_size(), ir_builder()->CreateLoad(pointer), name());
+  return b()->CreateVectorSplat(vector_size(), b()->CreateLoad(pointer),
+                                name());
 }
 
 llvm::Value* VectorSupportLibrary::AddReduce(llvm::Value* vector) {
@@ -267,20 +258,19 @@ llvm::Value* VectorSupportLibrary::AddReduce(llvm::Value* vector) {
 
     for (unsigned j = 0; j < vector_size(); ++j) {
       if (j < (i / 2)) {
-        mask[j] = ir_builder()->getInt32(i / 2 + j);
+        mask[j] = b()->getInt32(i / 2 + j);
       } else {
-        mask[j] = llvm::UndefValue::get(ir_builder()->getInt32Ty());
+        mask[j] = llvm::UndefValue::get(b()->getInt32Ty());
       }
     }
 
-    llvm::Value* half_remaining_lanes = ir_builder()->CreateShuffleVector(
-        vector, llvm::UndefValue::get(vector_type()),
-        llvm::ConstantVector::get(mask), "");
+    llvm::Value* half_remaining_lanes =
+        b()->CreateShuffleVector(vector, llvm::UndefValue::get(vector_type()),
+                                 llvm::ConstantVector::get(mask), "");
     vector = Add(vector, half_remaining_lanes);
   }
 
-  return ir_builder()->CreateExtractElement(vector, ir_builder()->getInt32(0),
-                                            name());
+  return b()->CreateExtractElement(vector, b()->getInt32(0), name());
 }
 
 llvm::Value* VectorSupportLibrary::AvxStyleHorizontalAdd(llvm::Value* lhs,
@@ -307,19 +297,19 @@ llvm::Value* VectorSupportLibrary::AvxStyleHorizontalAdd(llvm::Value* lhs,
   // vector, which are the lanes 2 and 3 in the rhs vector.
   for (int i = 0; i < vector_size(); i += 2) {
     int increment = i < vector_size() / 2 ? 0 : (vector_size() / 2);
-    mask_a.push_back(ir_builder()->getInt32(increment + i));
-    mask_b.push_back(ir_builder()->getInt32(increment + i + 1));
+    mask_a.push_back(b()->getInt32(increment + i));
+    mask_b.push_back(b()->getInt32(increment + i + 1));
   }
   for (int i = 0; i < vector_size(); i += 2) {
     int increment = i < vector_size() / 2 ? (vector_size() / 2) : vector_size();
-    mask_a.push_back(ir_builder()->getInt32(increment + i));
-    mask_b.push_back(ir_builder()->getInt32(increment + i + 1));
+    mask_a.push_back(b()->getInt32(increment + i));
+    mask_b.push_back(b()->getInt32(increment + i + 1));
   }
 
-  llvm::Value* shuffle_0 = ir_builder()->CreateShuffleVector(
-      lhs, rhs, llvm::ConstantVector::get(mask_a));
-  llvm::Value* shuffle_1 = ir_builder()->CreateShuffleVector(
-      lhs, rhs, llvm::ConstantVector::get(mask_b));
+  llvm::Value* shuffle_0 =
+      b()->CreateShuffleVector(lhs, rhs, llvm::ConstantVector::get(mask_a));
+  llvm::Value* shuffle_1 =
+      b()->CreateShuffleVector(lhs, rhs, llvm::ConstantVector::get(mask_b));
 
   return Add(shuffle_0, shuffle_1);
 }
@@ -327,23 +317,21 @@ llvm::Value* VectorSupportLibrary::AvxStyleHorizontalAdd(llvm::Value* lhs,
 llvm::Value* VectorSupportLibrary::ExtractLowHalf(llvm::Value* vector) {
   llvm::SmallVector<llvm::Constant*, 32> mask;
   for (int i = 0; i < vector_size() / 2; i++) {
-    mask.push_back(ir_builder()->getInt32(i));
+    mask.push_back(b()->getInt32(i));
   }
 
-  return ir_builder()->CreateShuffleVector(vector,
-                                           llvm::UndefValue::get(vector_type()),
-                                           llvm::ConstantVector::get(mask));
+  return b()->CreateShuffleVector(vector, llvm::UndefValue::get(vector_type()),
+                                  llvm::ConstantVector::get(mask));
 }
 
 llvm::Value* VectorSupportLibrary::ExtractHighHalf(llvm::Value* vector) {
   llvm::SmallVector<llvm::Constant*, 32> mask;
   for (int i = 0; i < vector_size() / 2; i++) {
-    mask.push_back(ir_builder()->getInt32(i + vector_size() / 2));
+    mask.push_back(b()->getInt32(i + vector_size() / 2));
   }
 
-  return ir_builder()->CreateShuffleVector(vector,
-                                           llvm::UndefValue::get(vector_type()),
-                                           llvm::ConstantVector::get(mask));
+  return b()->CreateShuffleVector(vector, llvm::UndefValue::get(vector_type()),
+                                  llvm::ConstantVector::get(mask));
 }
 
 std::vector<llvm::Value*> VectorSupportLibrary::ComputeHorizontalSums(
@@ -360,8 +348,8 @@ std::vector<llvm::Value*> VectorSupportLibrary::ComputeHorizontalSums(
                  [this](llvm::Value* vector) { return AddReduce(vector); });
   if (init_values) {
     for (int64 i = 0, e = result.size(); i < e; i++) {
-      result[i] = Add(result[i], ir_builder()->CreateExtractElement(
-                                     init_values, ir_builder()->getInt32(i)));
+      result[i] = Add(result[i],
+                      b()->CreateExtractElement(init_values, b()->getInt32(i)));
     }
   }
   return result;
@@ -398,9 +386,9 @@ VectorSupportLibrary::ComputeAvxOptimizedHorizontalSums(
 
   std::vector<llvm::Value*> results;
   for (int i = 0; i < lane_width; i++) {
-    llvm::Value* scalar_result = ir_builder()->CreateExtractElement(
-        i < (lane_width / 2) ? low : high,
-        ir_builder()->getInt32(i % (lane_width / 2)), name());
+    llvm::Value* scalar_result =
+        b()->CreateExtractElement(i < (lane_width / 2) ? low : high,
+                                  b()->getInt32(i % (lane_width / 2)), name());
     results.push_back(scalar_result);
   }
 
@@ -415,17 +403,14 @@ llvm::Value* VectorSupportLibrary::GetZeroScalar() {
   return llvm::Constant::getNullValue(scalar_type());
 }
 
-LlvmVariable::LlvmVariable(llvm::Type* type, llvm::IRBuilder<>* ir_builder)
-    : ir_builder_(ir_builder) {
-  alloca_ = llvm_ir::EmitAllocaAtFunctionEntry(type, "", ir_builder_);
+LlvmVariable::LlvmVariable(llvm::Type* type, llvm::IRBuilder<>* b) : b_(b) {
+  alloca_ = llvm_ir::EmitAllocaAtFunctionEntry(type, "", b_);
 }
 
-llvm::Value* LlvmVariable::Get() const {
-  return ir_builder_->CreateLoad(alloca_);
-}
+llvm::Value* LlvmVariable::Get() const { return b_->CreateLoad(alloca_); }
 
 void LlvmVariable::Set(llvm::Value* new_value) {
-  ir_builder_->CreateStore(new_value, alloca_);
+  b_->CreateStore(new_value, alloca_);
 }
 
 TileVariable::TileVariable(VectorSupportLibrary* vector_support,
diff --git a/tensorflow/compiler/xla/service/cpu/vector_support_library.h b/tensorflow/compiler/xla/service/cpu/vector_support_library.h
index 49c2a4e2f4..c728f6df0a 100644
--- a/tensorflow/compiler/xla/service/cpu/vector_support_library.h
+++ b/tensorflow/compiler/xla/service/cpu/vector_support_library.h
@@ -46,11 +46,11 @@ class VectorSupportLibrary {
   // instance (i.e. LoadVector will load a vector of type <`vector_size` x
   // `primitive_type`>).
   VectorSupportLibrary(PrimitiveType primitive_type, int64 vector_size,
-                       llvm::IRBuilder<>* ir_builder, std::string name);
+                       llvm::IRBuilder<>* b, std::string name);
 
   llvm::Value* Mul(llvm::Value* lhs, llvm::Value* rhs);
   llvm::Value* Mul(int64 lhs, llvm::Value* rhs) {
-    return Mul(ir_builder()->getInt64(lhs), rhs);
+    return Mul(b()->getInt64(lhs), rhs);
   }
   llvm::Value* Mul(const llvm::APFloat& lhs, llvm::Value* rhs) {
     return Mul(GetConstantFloat(rhs->getType(), lhs), rhs);
@@ -63,7 +63,7 @@ class VectorSupportLibrary {
 
   llvm::Value* Add(llvm::Value* lhs, llvm::Value* rhs);
   llvm::Value* Add(int64 lhs, llvm::Value* rhs) {
-    return Add(ir_builder()->getInt64(lhs), rhs);
+    return Add(b()->getInt64(lhs), rhs);
   }
   llvm::Value* Add(const llvm::APFloat& lhs, llvm::Value* rhs) {
     return Add(GetConstantFloat(rhs->getType(), lhs), rhs);
@@ -147,13 +147,11 @@ class VectorSupportLibrary {
   llvm::Value* ComputeOffsetPointer(llvm::Value* base_pointer,
                                     llvm::Value* offset_elements, int64 scale) {
     return ComputeOffsetPointer(
-        base_pointer,
-        ir_builder_->CreateMul(ir_builder_->getInt64(scale), offset_elements));
+        base_pointer, b_->CreateMul(b_->getInt64(scale), offset_elements));
   }
   llvm::Value* ComputeOffsetPointer(llvm::Value* base_pointer,
                                     int64 offset_elements) {
-    return ComputeOffsetPointer(base_pointer,
-                                ir_builder()->getInt64(offset_elements));
+    return ComputeOffsetPointer(base_pointer, b()->getInt64(offset_elements));
   }
 
   llvm::Value* LoadVector(llvm::Value* pointer);
@@ -164,7 +162,7 @@ class VectorSupportLibrary {
   }
 
   llvm::Value* LoadVector(llvm::Value* base_pointer, int64 offset_elements) {
-    return LoadVector(base_pointer, ir_builder()->getInt64(offset_elements));
+    return LoadVector(base_pointer, b()->getInt64(offset_elements));
   }
 
   llvm::Value* LoadScalar(llvm::Value* pointer);
@@ -175,7 +173,7 @@ class VectorSupportLibrary {
   }
 
   llvm::Value* LoadScalar(llvm::Value* base_pointer, int64 offset_elements) {
-    return LoadScalar(base_pointer, ir_builder()->getInt64(offset_elements));
+    return LoadScalar(base_pointer, b()->getInt64(offset_elements));
   }
 
   void StoreVector(llvm::Value* value, llvm::Value* pointer);
@@ -187,7 +185,7 @@ class VectorSupportLibrary {
 
   void StoreVector(llvm::Value* value, llvm::Value* base_pointer,
                    int64 offset_elements) {
-    StoreVector(value, base_pointer, ir_builder()->getInt64(offset_elements));
+    StoreVector(value, base_pointer, b()->getInt64(offset_elements));
   }
 
   void StoreScalar(llvm::Value* value, llvm::Value* pointer);
@@ -198,7 +196,7 @@ class VectorSupportLibrary {
 
   void StoreScalar(llvm::Value* value, llvm::Value* base_pointer,
                    int64 offset_elements) {
-    StoreScalar(base_pointer, ir_builder()->getInt64(offset_elements));
+    StoreScalar(base_pointer, b()->getInt64(offset_elements));
   }
 
   llvm::Value* LoadBroadcast(llvm::Value* pointer);
@@ -207,7 +205,7 @@ class VectorSupportLibrary {
     return LoadBroadcast(ComputeOffsetPointer(base_pointer, offset_elements));
   }
   llvm::Value* LoadBroadcast(llvm::Value* base_pointer, int64 offset_elements) {
-    return LoadBroadcast(base_pointer, ir_builder()->getInt64(offset_elements));
+    return LoadBroadcast(base_pointer, b()->getInt64(offset_elements));
   }
 
   // Compute the horizontal sum of each vector in `vectors`.  The i'th element
@@ -220,7 +218,7 @@ class VectorSupportLibrary {
   llvm::Value* GetZeroVector();
   llvm::Value* GetZeroScalar();
 
-  llvm::IRBuilder<>* ir_builder() const { return ir_builder_; }
+  llvm::IRBuilder<>* b() const { return b_; }
   int64 vector_size() const { return vector_size_; }
   llvm::Type* vector_type() const { return vector_type_; }
   llvm::Type* vector_pointer_type() const { return vector_pointer_type_; }
@@ -277,7 +275,7 @@ class VectorSupportLibrary {
 
   int64 vector_size_;
   PrimitiveType primitive_type_;
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   llvm::Type* vector_type_;
   llvm::Type* vector_pointer_type_;
   llvm::Type* scalar_type_;
@@ -289,22 +287,21 @@ class VectorSupportLibrary {
 // can later convert to a SSA value.
 class LlvmVariable {
  public:
-  LlvmVariable(llvm::Type*, llvm::IRBuilder<>* ir_builder);
+  LlvmVariable(llvm::Type*, llvm::IRBuilder<>* b);
 
   llvm::Value* Get() const;
   void Set(llvm::Value* new_value);
 
  private:
   llvm::AllocaInst* alloca_;
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
 };
 
 class VectorVariable : public LlvmVariable {
  public:
   VectorVariable(VectorSupportLibrary* vector_support,
                  llvm::Value* initial_value)
-      : LlvmVariable(vector_support->vector_type(),
-                     vector_support->ir_builder()) {
+      : LlvmVariable(vector_support->vector_type(), vector_support->b()) {
     Set(initial_value);
   }
 };
@@ -313,8 +310,7 @@ class ScalarVariable : public LlvmVariable {
  public:
   ScalarVariable(VectorSupportLibrary* vector_support,
                  llvm::Value* initial_value)
-      : LlvmVariable(vector_support->scalar_type(),
-                     vector_support->ir_builder()) {
+      : LlvmVariable(vector_support->scalar_type(), vector_support->b()) {
     Set(initial_value);
   }
 };
diff --git a/tensorflow/compiler/xla/service/defuser_test.cc b/tensorflow/compiler/xla/service/defuser_test.cc
index 32b5c5d35f..e727ba49cb 100644
--- a/tensorflow/compiler/xla/service/defuser_test.cc
+++ b/tensorflow/compiler/xla/service/defuser_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/defuser.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
@@ -124,7 +124,7 @@ TEST_F(DefuserTest, NonTrivialFusionInstruction) {
   auto div = builder.AddInstruction(
       HloInstruction::CreateBinary(shape_, HloOpcode::kDivide, mul, param3));
   auto constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   auto add2 = builder.AddInstruction(
       HloInstruction::CreateBinary(shape_, HloOpcode::kAdd, constant, div));
 
@@ -162,7 +162,7 @@ TEST_F(DefuserTest, MultipleFusionInstructions) {
   auto div = builder.AddInstruction(
       HloInstruction::CreateBinary(shape_, HloOpcode::kDivide, mul, param3));
   auto constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   auto add2 = builder.AddInstruction(
       HloInstruction::CreateBinary(shape_, HloOpcode::kAdd, constant, div));
 
diff --git a/tensorflow/compiler/xla/service/dfs_hlo_visitor.h b/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
index cb3676c5ba..9f86749125 100644
--- a/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
+++ b/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
@@ -19,7 +19,7 @@ limitations under the License.
 #include <type_traits>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/status.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -76,6 +76,7 @@ class DfsHloVisitorBase {
 
   virtual Status HandleClamp(HloInstructionPtr hlo) = 0;
   virtual Status HandleSelect(HloInstructionPtr hlo) = 0;
+  virtual Status HandleTupleSelect(HloInstructionPtr hlo) = 0;
   virtual Status HandleMaximum(HloInstructionPtr hlo) {
     return HandleElementwiseBinary(hlo);
   }
@@ -211,6 +212,7 @@ class DfsHloVisitorBase {
   virtual Status HandleReverse(HloInstructionPtr hlo) = 0;
   virtual Status HandleSort(HloInstructionPtr hlo) = 0;
   virtual Status HandleConstant(HloInstructionPtr hlo) = 0;
+  virtual Status HandleIota(HloInstructionPtr hlo) = 0;
   virtual Status HandleGetTupleElement(HloInstructionPtr hlo) = 0;
   virtual Status HandleReduce(HloInstructionPtr hlo) = 0;
   virtual Status HandleBitcast(HloInstructionPtr hlo) = 0;
@@ -231,6 +233,7 @@ class DfsHloVisitorBase {
   virtual Status HandleWhile(HloInstructionPtr hlo) = 0;
   virtual Status HandleConditional(HloInstructionPtr hlo) = 0;
   virtual Status HandleGather(HloInstructionPtr hlo) = 0;
+  virtual Status HandleScatter(HloInstructionPtr hlo) = 0;
 
   virtual Status HandlePad(HloInstructionPtr hlo) = 0;
 
diff --git a/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h b/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
index 987c91e5ba..ae8a066d62 100644
--- a/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
+++ b/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
@@ -16,7 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_DFS_HLO_VISITOR_WITH_DEFAULT_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_DFS_HLO_VISITOR_WITH_DEFAULT_H_
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -79,6 +79,9 @@ class DfsHloVisitorWithDefaultBase
   Status HandleSelect(HloInstructionPtr select) override {
     return DefaultAction(select);
   }
+  Status HandleTupleSelect(HloInstructionPtr tuple_select) override {
+    return DefaultAction(tuple_select);
+  }
   Status HandleDot(HloInstructionPtr dot) override {
     return DefaultAction(dot);
   }
@@ -112,6 +115,9 @@ class DfsHloVisitorWithDefaultBase
   Status HandleConstant(HloInstructionPtr constant) override {
     return DefaultAction(constant);
   }
+  Status HandleIota(HloInstructionPtr iota) override {
+    return DefaultAction(iota);
+  }
   Status HandleGetTupleElement(HloInstructionPtr get_tuple_element) override {
     return DefaultAction(get_tuple_element);
   }
@@ -188,6 +194,9 @@ class DfsHloVisitorWithDefaultBase
   Status HandleGather(HloInstructionPtr gather) override {
     return DefaultAction(gather);
   }
+  Status HandleScatter(HloInstructionPtr scatter) override {
+    return DefaultAction(scatter);
+  }
   Status HandleAfterAll(HloInstructionPtr token) override {
     return DefaultAction(token);
   }
diff --git a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
index ce0951bbe1..f883eb828c 100644
--- a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
@@ -61,13 +61,13 @@ int64 GlobalRandomValue() {
 
 llvm::Value* EmitReducePrecisionFloat(llvm::Value* x, int64 exponent_bits,
                                       int64 mantissa_bits,
-                                      llvm::IRBuilder<>* ir_builder) {
+                                      llvm::IRBuilder<>* b) {
   // Integer and float types for casting and constant generation.
   llvm::Type* float_type = x->getType();
-  llvm::IntegerType* int_type = ir_builder->getInt32Ty();
+  llvm::IntegerType* int_type = b->getInt32Ty();
 
   // Cast the input value to an integer for bitwise manipulation.
-  llvm::Value* x_as_int = ir_builder->CreateBitCast(x, int_type);
+  llvm::Value* x_as_int = b->CreateBitCast(x, int_type);
 
   if (mantissa_bits < 23) {
     // Last remaining mantissa bit.
@@ -77,22 +77,22 @@ llvm::Value* EmitReducePrecisionFloat(llvm::Value* x, int64 exponent_bits,
     // equal to a base value of 0111... plus one bit if the last remaining
     // mantissa bit is 1.
     const uint32_t base_rounding_bias = (last_mantissa_bit_mask >> 1) - 1;
-    llvm::Value* x_last_mantissa_bit = ir_builder->CreateLShr(
-        ir_builder->CreateAnd(
-            x_as_int, llvm::ConstantInt::get(int_type, last_mantissa_bit_mask)),
+    llvm::Value* x_last_mantissa_bit = b->CreateLShr(
+        b->CreateAnd(x_as_int,
+                     llvm::ConstantInt::get(int_type, last_mantissa_bit_mask)),
         (23 - mantissa_bits));
-    llvm::Value* x_rounding_bias = ir_builder->CreateAdd(
-        x_last_mantissa_bit,
-        llvm::ConstantInt::get(int_type, base_rounding_bias));
+    llvm::Value* x_rounding_bias =
+        b->CreateAdd(x_last_mantissa_bit,
+                     llvm::ConstantInt::get(int_type, base_rounding_bias));
 
     // Add rounding bias, and mask out truncated bits.  Note that the case
     // where adding the rounding bias overflows into the exponent bits is
     // correct; the non-masked mantissa bits will all be zero, and the
     // exponent will be incremented by one.
     const uint32_t truncation_mask = ~(last_mantissa_bit_mask - 1);
-    x_as_int = ir_builder->CreateAdd(x_as_int, x_rounding_bias);
-    x_as_int = ir_builder->CreateAnd(
-        x_as_int, llvm::ConstantInt::get(int_type, truncation_mask));
+    x_as_int = b->CreateAdd(x_as_int, x_rounding_bias);
+    x_as_int = b->CreateAnd(x_as_int,
+                            llvm::ConstantInt::get(int_type, truncation_mask));
   }
 
   if (exponent_bits < 8) {
@@ -120,29 +120,29 @@ llvm::Value* EmitReducePrecisionFloat(llvm::Value* x, int64 exponent_bits,
         f32_exponent_bias - reduced_exponent_bias;
 
     // Do we overflow or underflow?
-    llvm::Value* x_exponent = ir_builder->CreateAnd(
+    llvm::Value* x_exponent = b->CreateAnd(
         x_as_int, llvm::ConstantInt::get(int_type, f32_exp_bits_mask));
-    llvm::Value* x_overflows = ir_builder->CreateICmpUGT(
+    llvm::Value* x_overflows = b->CreateICmpUGT(
         x_exponent,
         llvm::ConstantInt::get(int_type, reduced_max_exponent << 23));
-    llvm::Value* x_underflows = ir_builder->CreateICmpULE(
+    llvm::Value* x_underflows = b->CreateICmpULE(
         x_exponent,
         llvm::ConstantInt::get(int_type, reduced_min_exponent << 23));
 
     // Compute appropriately-signed values of zero and infinity.
-    llvm::Value* x_signed_zero = ir_builder->CreateAnd(
+    llvm::Value* x_signed_zero = b->CreateAnd(
         x_as_int, llvm::ConstantInt::get(int_type, f32_sign_bit_mask));
-    llvm::Value* x_signed_inf = ir_builder->CreateOr(
+    llvm::Value* x_signed_inf = b->CreateOr(
         x_signed_zero, llvm::ConstantInt::get(int_type, f32_exp_bits_mask));
 
     // Force to zero or infinity if overflow or underflow.  (Note that this
     // truncates all denormal values to zero, rather than rounding them.)
-    x_as_int = ir_builder->CreateSelect(x_overflows, x_signed_inf, x_as_int);
-    x_as_int = ir_builder->CreateSelect(x_underflows, x_signed_zero, x_as_int);
+    x_as_int = b->CreateSelect(x_overflows, x_signed_inf, x_as_int);
+    x_as_int = b->CreateSelect(x_underflows, x_signed_zero, x_as_int);
   }
 
   // Cast the result back to a floating-point type.
-  llvm::Value* result = ir_builder->CreateBitCast(x_as_int, float_type);
+  llvm::Value* result = b->CreateBitCast(x_as_int, float_type);
 
   // Correct result for NaN inputs.
   //
@@ -154,53 +154,49 @@ llvm::Value* EmitReducePrecisionFloat(llvm::Value* x, int64 exponent_bits,
   //
   // If the fast-math flags are set to assume no NaNs, the comparison is likely
   // to be optimized away, so there's no point in even emitting it.
-  if (!ir_builder->getFastMathFlags().noNaNs()) {
-    llvm::Value* x_is_nan = ir_builder->CreateFCmpUNO(x, x);
+  if (!b->getFastMathFlags().noNaNs()) {
+    llvm::Value* x_is_nan = b->CreateFCmpUNO(x, x);
 
     if (mantissa_bits > 0) {
-      result = ir_builder->CreateSelect(x_is_nan, x, result);
+      result = b->CreateSelect(x_is_nan, x, result);
     } else {
-      result = ir_builder->CreateSelect(
+      result = b->CreateSelect(
           x_is_nan, llvm::ConstantFP::getInfinity(float_type), result);
     }
   }
   return result;
 }
 
-llvm::Value* EmitF32ToBF16(llvm::Value* f32_value,
-                           llvm::IRBuilder<>* ir_builder) {
+llvm::Value* EmitF32ToBF16(llvm::Value* f32_value, llvm::IRBuilder<>* b) {
   auto reduced_precision = EmitReducePrecisionFloat(
       f32_value,
       /*exponent_bits=*/primitive_util::kBFloat16ExponentBits,
-      /*mantissa_bits=*/primitive_util::kBFloat16MantissaBits, ir_builder);
-  auto as_int32 =
-      ir_builder->CreateBitCast(reduced_precision, ir_builder->getInt32Ty());
-  auto shifted = ir_builder->CreateLShr(as_int32, 16);
-  auto truncated = ir_builder->CreateTrunc(shifted, ir_builder->getInt16Ty());
-  return ir_builder->CreateBitCast(truncated, ir_builder->getInt16Ty());
+      /*mantissa_bits=*/primitive_util::kBFloat16MantissaBits, b);
+  auto as_int32 = b->CreateBitCast(reduced_precision, b->getInt32Ty());
+  auto shifted = b->CreateLShr(as_int32, 16);
+  auto truncated = b->CreateTrunc(shifted, b->getInt16Ty());
+  return b->CreateBitCast(truncated, b->getInt16Ty());
 }
 
-llvm::Value* EmitBF16ToF32(llvm::Value* bf16_value,
-                           llvm::IRBuilder<>* ir_builder) {
-  auto as_int16 =
-      ir_builder->CreateBitCast(bf16_value, ir_builder->getInt16Ty());
-  auto as_int32 = ir_builder->CreateZExt(as_int16, ir_builder->getInt32Ty());
-  auto shifted = ir_builder->CreateShl(as_int32, 16);
-  return ir_builder->CreateBitCast(shifted, ir_builder->getFloatTy());
+llvm::Value* EmitBF16ToF32(llvm::Value* bf16_value, llvm::IRBuilder<>* b) {
+  auto as_int16 = b->CreateBitCast(bf16_value, b->getInt16Ty());
+  auto as_int32 = b->CreateZExt(as_int16, b->getInt32Ty());
+  auto shifted = b->CreateShl(as_int32, 16);
+  return b->CreateBitCast(shifted, b->getFloatTy());
 }
 
 llvm::Value* EmitIntegralToFloating(llvm::Value* integer_value,
                                     PrimitiveType from_type,
                                     PrimitiveType to_type, llvm::Module* module,
-                                    llvm::IRBuilder<>* ir_builder) {
+                                    llvm::IRBuilder<>* b) {
   if (primitive_util::IsSignedIntegralType(from_type)) {
-    return ir_builder->CreateSIToFP(
-        integer_value, llvm_ir::PrimitiveTypeToIrType(to_type, module));
+    return b->CreateSIToFP(integer_value,
+                           llvm_ir::PrimitiveTypeToIrType(to_type, module));
   } else {
     CHECK(primitive_util::IsUnsignedIntegralType(from_type) ||
           from_type == PRED);
-    return ir_builder->CreateUIToFP(
-        integer_value, llvm_ir::PrimitiveTypeToIrType(to_type, module));
+    return b->CreateUIToFP(integer_value,
+                           llvm_ir::PrimitiveTypeToIrType(to_type, module));
   }
 }
 
@@ -226,39 +222,43 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerUnaryOp(
     case HloOpcode::kConvert: {
       PrimitiveType from_type = op->operand(0)->shape().element_type();
       PrimitiveType to_type = op->shape().element_type();
-      CHECK(primitive_util::IsIntegralType(from_type) || from_type == PRED);
+      CHECK(primitive_util::IsIntegralType(from_type) || from_type == PRED)
+          << from_type;
       if (from_type == to_type) {
         return operand_value;
       }
+      if (to_type == PRED) {
+        return b_->CreateZExt(
+            b_->CreateICmpNE(operand_value, llvm::ConstantInt::get(
+                                                operand_value->getType(), 0)),
+            llvm_ir::PrimitiveTypeToIrType(PRED, module_));
+      }
       if (primitive_util::IsIntegralType(to_type)) {
-        return ir_builder_->CreateIntCast(
+        return b_->CreateIntCast(
             operand_value, llvm_ir::PrimitiveTypeToIrType(to_type, module_),
             primitive_util::IsSignedIntegralType(from_type));
       }
       if (primitive_util::IsFloatingPointType(to_type)) {
         if (to_type == BF16) {
-          return EmitF32ToBF16(
-              EmitIntegralToFloating(operand_value, from_type, F32, module_,
-                                     ir_builder_),
-              ir_builder_);
+          return EmitF32ToBF16(EmitIntegralToFloating(operand_value, from_type,
+                                                      F32, module_, b_),
+                               b_);
         }
         return EmitIntegralToFloating(operand_value, from_type, to_type,
-                                      module_, ir_builder_);
+                                      module_, b_);
       }
       if (primitive_util::IsComplexType(to_type)) {
         auto to_ir_component_type = llvm_ir::PrimitiveTypeToIrType(
             primitive_util::ComplexComponentType(to_type), module_);
         if (primitive_util::IsSignedIntegralType(from_type)) {
           return EmitComposeComplex(
-              op,
-              ir_builder_->CreateSIToFP(operand_value, to_ir_component_type),
+              op, b_->CreateSIToFP(operand_value, to_ir_component_type),
               nullptr);
         }
         if (primitive_util::IsUnsignedIntegralType(from_type) ||
             from_type == PRED) {
           return EmitComposeComplex(
-              op,
-              ir_builder_->CreateUIToFP(operand_value, to_ir_component_type),
+              op, b_->CreateUIToFP(operand_value, to_ir_component_type),
               nullptr);
         }
       }
@@ -275,7 +275,7 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerUnaryOp(
       }
       if (primitive_util::BitWidth(from_type) ==
           primitive_util::BitWidth(to_type)) {
-        return ir_builder_->CreateBitCast(
+        return b_->CreateBitCast(
             operand_value, llvm_ir::PrimitiveTypeToIrType(to_type, module_));
       }
       return InvalidArgument(
@@ -293,18 +293,18 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerUnaryOp(
         auto type =
             llvm_ir::PrimitiveTypeToIrType(op->shape().element_type(), module_);
         auto zero = llvm::ConstantInt::get(type, 0);
-        auto cmp = ir_builder_->CreateICmpSGE(operand_value, zero);
-        return ir_builder_->CreateSelect(cmp, operand_value,
-                                         ir_builder_->CreateNeg(operand_value));
+        auto cmp = b_->CreateICmpSGE(operand_value, zero);
+        return b_->CreateSelect(cmp, operand_value,
+                                b_->CreateNeg(operand_value));
       } else {
         return operand_value;
       }
     }
     case HloOpcode::kClz: {
-      auto is_zero_undef = ir_builder_->getFalse();
-      return llvm_ir::EmitCallToIntrinsic(
-          llvm::Intrinsic::ctlz, {operand_value, is_zero_undef},
-          {operand_value->getType()}, ir_builder_);
+      auto is_zero_undef = b_->getFalse();
+      return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::ctlz,
+                                          {operand_value, is_zero_undef},
+                                          {operand_value->getType()}, b_);
     }
     case HloOpcode::kSign: {
       bool is_signed =
@@ -312,31 +312,28 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerUnaryOp(
       auto type =
           llvm_ir::PrimitiveTypeToIrType(op->shape().element_type(), module_);
       auto zero = llvm::ConstantInt::get(type, 0);
-      auto cmp = ir_builder_->CreateICmpEQ(operand_value, zero);
+      auto cmp = b_->CreateICmpEQ(operand_value, zero);
       if (is_signed) {
-        auto ashr = ir_builder_->CreateAShr(operand_value,
-                                            type->getIntegerBitWidth() - 1);
-        return ir_builder_->CreateSelect(cmp, zero,
-                                         ir_builder_->CreateOr(ashr, 1));
+        auto ashr =
+            b_->CreateAShr(operand_value, type->getIntegerBitWidth() - 1);
+        return b_->CreateSelect(cmp, zero, b_->CreateOr(ashr, 1));
       } else {
-        return ir_builder_->CreateSelect(cmp, zero,
-                                         llvm::ConstantInt::get(type, 1));
+        return b_->CreateSelect(cmp, zero, llvm::ConstantInt::get(type, 1));
       }
     }
     case HloOpcode::kNegate:
-      return ir_builder_->CreateNeg(operand_value);
+      return b_->CreateNeg(operand_value);
     case HloOpcode::kNot: {
       auto type = op->shape().element_type();
       if (type == PRED) {
         // It is not sufficient to just call CreateNot() here because a PRED
         // is represented as an i8 and the truth value is stored only in the
         // bottom bit.
-        return ir_builder_->CreateZExt(
-            ir_builder_->CreateNot(ir_builder_->CreateTrunc(
-                operand_value, ir_builder_->getInt1Ty())),
+        return b_->CreateZExt(
+            b_->CreateNot(b_->CreateTrunc(operand_value, b_->getInt1Ty())),
             llvm_ir::PrimitiveTypeToIrType(PRED, module_));
       } else if (primitive_util::IsIntegralType(type)) {
-        return ir_builder_->CreateNot(operand_value);
+        return b_->CreateNot(operand_value);
       }
       return Unimplemented("unary op Not is not defined for type '%d'", type);
     }
@@ -352,7 +349,7 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitFloatUnaryOp(
     case HloOpcode::kConvert: {
       PrimitiveType from_type = op->operand(0)->shape().element_type();
       PrimitiveType to_type = op->shape().element_type();
-      CHECK(primitive_util::IsFloatingPointType(from_type));
+      CHECK(primitive_util::IsFloatingPointType(from_type)) << from_type;
       if (from_type == to_type) {
         return operand_value;
       }
@@ -364,32 +361,38 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitFloatUnaryOp(
         }
         return EmitComposeComplex(
             op,
-            ir_builder_->CreateFPCast(
-                operand_value,
-                llvm_ir::PrimitiveTypeToIrType(to_component_type, module_)),
+            b_->CreateFPCast(operand_value, llvm_ir::PrimitiveTypeToIrType(
+                                                to_component_type, module_)),
             nullptr);
       }
       if (from_type == BF16) {
         TF_RET_CHECK(to_type != BF16);
-        operand_value = EmitBF16ToF32(operand_value, ir_builder_);
+        operand_value = EmitBF16ToF32(operand_value, b_);
         from_type = F32;
         if (from_type == to_type) {
           return operand_value;
         }
       }
       if (from_type == F32 && to_type == BF16) {
-        return EmitF32ToBF16(operand_value, ir_builder_);
+        return EmitF32ToBF16(operand_value, b_);
+      }
+      if (to_type == PRED) {
+        return b_->CreateZExt(
+            b_->CreateFCmpUNE(
+                operand_value,
+                llvm::ConstantFP::get(operand_value->getType(), 0.0)),
+            llvm_ir::PrimitiveTypeToIrType(PRED, module_));
       }
       if (primitive_util::IsFloatingPointType(to_type)) {
-        return ir_builder_->CreateFPCast(
+        return b_->CreateFPCast(
             operand_value, llvm_ir::PrimitiveTypeToIrType(to_type, module_));
       }
       if (primitive_util::IsSignedIntegralType(to_type)) {
-        return ir_builder_->CreateFPToSI(
+        return b_->CreateFPToSI(
             operand_value, llvm_ir::PrimitiveTypeToIrType(to_type, module_));
       }
       if (primitive_util::IsUnsignedIntegralType(to_type)) {
-        return ir_builder_->CreateFPToUI(
+        return b_->CreateFPToUI(
             operand_value, llvm_ir::PrimitiveTypeToIrType(to_type, module_));
       }
       return Unimplemented("unhandled conversion operation: %s => %s",
@@ -405,7 +408,7 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitFloatUnaryOp(
       }
       if (primitive_util::BitWidth(from_type) ==
           primitive_util::BitWidth(to_type)) {
-        return ir_builder_->CreateBitCast(
+        return b_->CreateBitCast(
             operand_value, llvm_ir::PrimitiveTypeToIrType(to_type, module_));
       }
       return InvalidArgument(
@@ -429,45 +432,49 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitFloatUnaryOp(
     case HloOpcode::kSin:
       return EmitSin(op->shape().element_type(), operand_value);
     case HloOpcode::kFloor:
-      return llvm_ir::EmitCallToIntrinsic(
-          llvm::Intrinsic::floor, {operand_value}, {operand_value->getType()},
-          ir_builder_);
+      return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::floor,
+                                          {operand_value},
+                                          {operand_value->getType()}, b_);
     case HloOpcode::kCeil:
-      return llvm_ir::EmitCallToIntrinsic(
-          llvm::Intrinsic::ceil, {operand_value}, {operand_value->getType()},
-          ir_builder_);
+      return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::ceil,
+                                          {operand_value},
+                                          {operand_value->getType()}, b_);
     case HloOpcode::kAbs:
-      return llvm_ir::EmitCallToIntrinsic(
-          llvm::Intrinsic::fabs, {operand_value}, {operand_value->getType()},
-          ir_builder_);
+      return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::fabs,
+                                          {operand_value},
+                                          {operand_value->getType()}, b_);
     case HloOpcode::kRoundNearestAfz:
-      return llvm_ir::EmitCallToIntrinsic(
-          llvm::Intrinsic::round, {operand_value}, {operand_value->getType()},
-          ir_builder_);
+      return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::round,
+                                          {operand_value},
+                                          {operand_value->getType()}, b_);
     case HloOpcode::kSign: {
       // TODO(b/32151903): Ensure consistent sign behavior for -0.0.
       auto type = operand_value->getType();
       auto zero = llvm::ConstantFP::get(type, 0.0);
-      auto oeq = ir_builder_->CreateFCmpOEQ(operand_value, zero);
-      auto olt = ir_builder_->CreateFCmpOLT(operand_value, zero);
-      return ir_builder_->CreateSelect(
+      auto oeq = b_->CreateFCmpOEQ(operand_value, zero);
+      auto olt = b_->CreateFCmpOLT(operand_value, zero);
+      return b_->CreateSelect(
           oeq, zero,
-          ir_builder_->CreateSelect(olt, llvm::ConstantFP::get(type, -1.0),
-                                    llvm::ConstantFP::get(type, 1.0)));
+          b_->CreateSelect(olt, llvm::ConstantFP::get(type, -1.0),
+                           llvm::ConstantFP::get(type, 1.0)));
     }
     case HloOpcode::kIsFinite: {
       // abs(x) o!= inf, this works because the comparison returns false if
       // either operand is NaN.
       auto type = operand_value->getType();
       auto abs_value = llvm_ir::EmitCallToIntrinsic(
-          llvm::Intrinsic::fabs, {operand_value}, {type}, ir_builder_);
+          llvm::Intrinsic::fabs, {operand_value}, {type}, b_);
       auto infinity = llvm::ConstantFP::getInfinity(type);
-      auto not_infinite = ir_builder_->CreateFCmpONE(abs_value, infinity);
-      return ir_builder_->CreateZExt(
-          not_infinite, llvm_ir::PrimitiveTypeToIrType(PRED, module_));
+      auto not_infinite = b_->CreateFCmpONE(abs_value, infinity);
+      return b_->CreateZExt(not_infinite,
+                            llvm_ir::PrimitiveTypeToIrType(PRED, module_));
     }
     case HloOpcode::kNegate:
-      return ir_builder_->CreateFNeg(operand_value);
+      return b_->CreateFNeg(operand_value);
+    case HloOpcode::kReal:
+      return operand_value;
+    case HloOpcode::kImag:
+      return llvm::ConstantFP::get(operand_value->getType(), 0.0);
     default:
       return Unimplemented("unary floating-point op '%s'",
                            HloOpcodeString(op->opcode()).c_str());
@@ -487,13 +494,12 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
       auto a = EmitExtractReal(operand_value);
       auto b = EmitExtractImag(operand_value);
       llvm::Type* llvm_ty = a->getType();
-      auto sum_sq = ir_builder_->CreateFAdd(ir_builder_->CreateFMul(a, a),
-                                            ir_builder_->CreateFMul(b, b));
+      auto sum_sq = b_->CreateFAdd(b_->CreateFMul(a, a), b_->CreateFMul(b, b));
       TF_ASSIGN_OR_RETURN(auto log_sum_sq, EmitLog(component_type, sum_sq));
       TF_ASSIGN_OR_RETURN(auto angle, EmitAtan2(component_type, b, a));
       auto one_half = llvm::ConstantFP::get(llvm_ty, 0.5);
-      return EmitComposeComplex(
-          op, ir_builder_->CreateFMul(one_half, log_sum_sq), angle);
+      return EmitComposeComplex(op, b_->CreateFMul(one_half, log_sum_sq),
+                                angle);
     }
     case HloOpcode::kLog1p: {
       // log1p(a+bi) = .5*log((a+1)^2+b^2) + i*atan2(b, a + 1)
@@ -501,15 +507,14 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
       auto b = EmitExtractImag(operand_value);
       llvm::Type* llvm_ty = a->getType();
       auto one = llvm::ConstantFP::get(llvm_ty, 1.0);
-      auto a_plus_one = ir_builder_->CreateFAdd(a, one);
-      auto sum_sq = ir_builder_->CreateFAdd(
-          ir_builder_->CreateFMul(a_plus_one, a_plus_one),
-          ir_builder_->CreateFMul(b, b));
+      auto a_plus_one = b_->CreateFAdd(a, one);
+      auto sum_sq = b_->CreateFAdd(b_->CreateFMul(a_plus_one, a_plus_one),
+                                   b_->CreateFMul(b, b));
       TF_ASSIGN_OR_RETURN(auto log_sum_sq, EmitLog(component_type, sum_sq));
       TF_ASSIGN_OR_RETURN(auto angle, EmitAtan2(component_type, b, a_plus_one));
       auto one_half = llvm::ConstantFP::get(llvm_ty, 0.5);
-      return EmitComposeComplex(
-          op, ir_builder_->CreateFMul(one_half, log_sum_sq), angle);
+      return EmitComposeComplex(op, b_->CreateFMul(one_half, log_sum_sq),
+                                angle);
     }
     case HloOpcode::kConvert: {
       PrimitiveType from_type = op->operand(0)->shape().element_type();
@@ -523,12 +528,11 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
           primitive_util::ComplexComponentType(to_type);
       auto to_ir_component_type =
           llvm_ir::PrimitiveTypeToIrType(to_component_type, module_);
-      return EmitComposeComplex(
-          op,
-          ir_builder_->CreateFPCast(EmitExtractReal(operand_value),
-                                    to_ir_component_type),
-          ir_builder_->CreateFPCast(EmitExtractImag(operand_value),
-                                    to_ir_component_type));
+      return EmitComposeComplex(op,
+                                b_->CreateFPCast(EmitExtractReal(operand_value),
+                                                 to_ir_component_type),
+                                b_->CreateFPCast(EmitExtractImag(operand_value),
+                                                 to_ir_component_type));
     }
     case HloOpcode::kExp: {
       // e^(a+bi) = e^a*(cos(b)+sin(b)i)
@@ -538,8 +542,8 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
           auto cos_b, EmitCos(component_type, EmitExtractImag(operand_value)));
       TF_ASSIGN_OR_RETURN(
           auto sin_b, EmitSin(component_type, EmitExtractImag(operand_value)));
-      return EmitComposeComplex(op, ir_builder_->CreateFMul(exp_a, cos_b),
-                                ir_builder_->CreateFMul(exp_a, sin_b));
+      return EmitComposeComplex(op, b_->CreateFMul(exp_a, cos_b),
+                                b_->CreateFMul(exp_a, sin_b));
     }
     case HloOpcode::kExpm1: {
       // e^(a+bi)-1 = (e^a*cos(b)-1)+e^a*sin(b)i
@@ -550,9 +554,8 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
       TF_ASSIGN_OR_RETURN(
           auto sin_b, EmitSin(component_type, EmitExtractImag(operand_value)));
       auto one = llvm::ConstantFP::get(exp_a->getType(), 1.0);
-      auto real_result =
-          ir_builder_->CreateFSub(ir_builder_->CreateFMul(exp_a, cos_b), one);
-      auto imag_result = ir_builder_->CreateFMul(exp_a, sin_b);
+      auto real_result = b_->CreateFSub(b_->CreateFMul(exp_a, cos_b), one);
+      auto imag_result = b_->CreateFMul(exp_a, sin_b);
       return EmitComposeComplex(op, real_result, imag_result);
     }
     case HloOpcode::kCos: {
@@ -567,18 +570,14 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
       auto b = EmitExtractImag(operand_value);
       auto type = a->getType();
       TF_ASSIGN_OR_RETURN(auto exp_b, EmitExp(component_type, b));
-      auto half_exp_b =
-          ir_builder_->CreateFMul(llvm::ConstantFP::get(type, 0.5), exp_b);
+      auto half_exp_b = b_->CreateFMul(llvm::ConstantFP::get(type, 0.5), exp_b);
       auto half_exp_neg_b =
-          ir_builder_->CreateFDiv(llvm::ConstantFP::get(type, 0.5), exp_b);
+          b_->CreateFDiv(llvm::ConstantFP::get(type, 0.5), exp_b);
       TF_ASSIGN_OR_RETURN(auto cos_a, EmitCos(component_type, a));
       TF_ASSIGN_OR_RETURN(auto sin_a, EmitSin(component_type, a));
       return EmitComposeComplex(
-          op,
-          ir_builder_->CreateFMul(
-              cos_a, ir_builder_->CreateFAdd(half_exp_neg_b, half_exp_b)),
-          ir_builder_->CreateFMul(
-              sin_a, ir_builder_->CreateFSub(half_exp_neg_b, half_exp_b)));
+          op, b_->CreateFMul(cos_a, b_->CreateFAdd(half_exp_neg_b, half_exp_b)),
+          b_->CreateFMul(sin_a, b_->CreateFSub(half_exp_neg_b, half_exp_b)));
     }
     case HloOpcode::kSin: {
       // sin(z) = .5i(e^(-iz) - e^(iz))
@@ -594,18 +593,14 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
       auto b = EmitExtractImag(operand_value);
       auto type = a->getType();
       TF_ASSIGN_OR_RETURN(auto exp_b, EmitExp(component_type, b));
-      auto half_exp_b =
-          ir_builder_->CreateFMul(llvm::ConstantFP::get(type, 0.5), exp_b);
+      auto half_exp_b = b_->CreateFMul(llvm::ConstantFP::get(type, 0.5), exp_b);
       auto half_exp_neg_b =
-          ir_builder_->CreateFDiv(llvm::ConstantFP::get(type, 0.5), exp_b);
+          b_->CreateFDiv(llvm::ConstantFP::get(type, 0.5), exp_b);
       TF_ASSIGN_OR_RETURN(auto cos_a, EmitCos(component_type, a));
       TF_ASSIGN_OR_RETURN(auto sin_a, EmitSin(component_type, a));
       return EmitComposeComplex(
-          op,
-          ir_builder_->CreateFMul(
-              sin_a, ir_builder_->CreateFAdd(half_exp_b, half_exp_neg_b)),
-          ir_builder_->CreateFMul(
-              cos_a, ir_builder_->CreateFSub(half_exp_b, half_exp_neg_b)));
+          op, b_->CreateFMul(sin_a, b_->CreateFAdd(half_exp_b, half_exp_neg_b)),
+          b_->CreateFMul(cos_a, b_->CreateFSub(half_exp_b, half_exp_neg_b)));
     }
     case HloOpcode::kTanh: {
       /*
@@ -633,64 +628,61 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexUnaryOp(
       TF_ASSIGN_OR_RETURN(auto exp_a, EmitExp(component_type, a));
       TF_ASSIGN_OR_RETURN(auto cos_b, EmitCos(component_type, b));
       TF_ASSIGN_OR_RETURN(auto sin_b, EmitSin(component_type, b));
-      auto exp_neg_a = ir_builder_->CreateFDiv(
-          llvm::ConstantFP::get(exp_a->getType(), 1), exp_a);
-      auto exp_2a_minus_exp_neg_2a = ir_builder_->CreateFSub(
-          ir_builder_->CreateFMul(exp_a, exp_a),
-          ir_builder_->CreateFMul(exp_neg_a, exp_neg_a));
-      auto cos_b_sq = ir_builder_->CreateFMul(cos_b, cos_b);
-      auto sin_b_sq = ir_builder_->CreateFMul(sin_b, sin_b);
-      auto real_num = ir_builder_->CreateFAdd(
-          ir_builder_->CreateFMul(cos_b_sq, exp_2a_minus_exp_neg_2a),
-          ir_builder_->CreateFMul(sin_b_sq, exp_2a_minus_exp_neg_2a));
-      auto cos_b_sin_b = ir_builder_->CreateFMul(cos_b, sin_b);
-      auto exp_a_plus_exp_neg_a = ir_builder_->CreateFAdd(exp_a, exp_neg_a);
+      auto exp_neg_a =
+          b_->CreateFDiv(llvm::ConstantFP::get(exp_a->getType(), 1), exp_a);
+      auto exp_2a_minus_exp_neg_2a = b_->CreateFSub(
+          b_->CreateFMul(exp_a, exp_a), b_->CreateFMul(exp_neg_a, exp_neg_a));
+      auto cos_b_sq = b_->CreateFMul(cos_b, cos_b);
+      auto sin_b_sq = b_->CreateFMul(sin_b, sin_b);
+      auto real_num =
+          b_->CreateFAdd(b_->CreateFMul(cos_b_sq, exp_2a_minus_exp_neg_2a),
+                         b_->CreateFMul(sin_b_sq, exp_2a_minus_exp_neg_2a));
+      auto cos_b_sin_b = b_->CreateFMul(cos_b, sin_b);
+      auto exp_a_plus_exp_neg_a = b_->CreateFAdd(exp_a, exp_neg_a);
       auto exp_a_plus_exp_neg_a_sq =
-          ir_builder_->CreateFMul(exp_a_plus_exp_neg_a, exp_a_plus_exp_neg_a);
-      auto exp_a_minus_exp_neg_a = ir_builder_->CreateFSub(exp_a, exp_neg_a);
+          b_->CreateFMul(exp_a_plus_exp_neg_a, exp_a_plus_exp_neg_a);
+      auto exp_a_minus_exp_neg_a = b_->CreateFSub(exp_a, exp_neg_a);
       auto exp_a_minus_exp_neg_a_sq =
-          ir_builder_->CreateFMul(exp_a_minus_exp_neg_a, exp_a_minus_exp_neg_a);
-      auto imag_num = ir_builder_->CreateFMul(
-          cos_b_sin_b, ir_builder_->CreateFSub(exp_a_plus_exp_neg_a_sq,
-                                               exp_a_minus_exp_neg_a_sq));
-      auto denom = ir_builder_->CreateFAdd(
-          ir_builder_->CreateFMul(cos_b_sq, exp_a_plus_exp_neg_a_sq),
-          ir_builder_->CreateFMul(sin_b_sq, exp_a_minus_exp_neg_a_sq));
-      return EmitComposeComplex(op, ir_builder_->CreateFDiv(real_num, denom),
-                                ir_builder_->CreateFDiv(imag_num, denom));
+          b_->CreateFMul(exp_a_minus_exp_neg_a, exp_a_minus_exp_neg_a);
+      auto imag_num = b_->CreateFMul(
+          cos_b_sin_b,
+          b_->CreateFSub(exp_a_plus_exp_neg_a_sq, exp_a_minus_exp_neg_a_sq));
+      auto denom =
+          b_->CreateFAdd(b_->CreateFMul(cos_b_sq, exp_a_plus_exp_neg_a_sq),
+                         b_->CreateFMul(sin_b_sq, exp_a_minus_exp_neg_a_sq));
+      return EmitComposeComplex(op, b_->CreateFDiv(real_num, denom),
+                                b_->CreateFDiv(imag_num, denom));
     }
     case HloOpcode::kAbs: {
-      auto sum_sq = ir_builder_->CreateFAdd(
-          ir_builder_->CreateFMul(EmitExtractReal(operand_value),
-                                  EmitExtractReal(operand_value)),
-          ir_builder_->CreateFMul(EmitExtractImag(operand_value),
-                                  EmitExtractImag(operand_value)));
+      auto sum_sq =
+          b_->CreateFAdd(b_->CreateFMul(EmitExtractReal(operand_value),
+                                        EmitExtractReal(operand_value)),
+                         b_->CreateFMul(EmitExtractImag(operand_value),
+                                        EmitExtractImag(operand_value)));
       return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::sqrt, {sum_sq},
-                                          {sum_sq->getType()}, ir_builder_);
+                                          {sum_sq->getType()}, b_);
     }
     case HloOpcode::kSign: {  // Sign(c) = c / |c|
-      auto sum_sq = ir_builder_->CreateFAdd(
-          ir_builder_->CreateFMul(EmitExtractReal(operand_value),
-                                  EmitExtractReal(operand_value)),
-          ir_builder_->CreateFMul(EmitExtractImag(operand_value),
-                                  EmitExtractImag(operand_value)));
+      auto sum_sq =
+          b_->CreateFAdd(b_->CreateFMul(EmitExtractReal(operand_value),
+                                        EmitExtractReal(operand_value)),
+                         b_->CreateFMul(EmitExtractImag(operand_value),
+                                        EmitExtractImag(operand_value)));
       auto cplx_abs = llvm_ir::EmitCallToIntrinsic(
-          llvm::Intrinsic::sqrt, {sum_sq}, {sum_sq->getType()}, ir_builder_);
+          llvm::Intrinsic::sqrt, {sum_sq}, {sum_sq->getType()}, b_);
       auto type = cplx_abs->getType();
       auto zero = llvm::ConstantFP::get(type, 0.0);
-      auto oeq = ir_builder_->CreateFCmpOEQ(cplx_abs, zero);
-      return ir_builder_->CreateSelect(
+      auto oeq = b_->CreateFCmpOEQ(cplx_abs, zero);
+      return b_->CreateSelect(
           oeq, EmitComposeComplex(op, zero, zero),
           EmitComposeComplex(
-              op,
-              ir_builder_->CreateFDiv(EmitExtractReal(operand_value), cplx_abs),
-              ir_builder_->CreateFDiv(EmitExtractImag(operand_value),
-                                      cplx_abs)));
+              op, b_->CreateFDiv(EmitExtractReal(operand_value), cplx_abs),
+              b_->CreateFDiv(EmitExtractImag(operand_value), cplx_abs)));
     }
     case HloOpcode::kNegate:
-      return EmitComposeComplex(
-          op, ir_builder_->CreateFNeg(EmitExtractReal(operand_value)),
-          ir_builder_->CreateFNeg(EmitExtractImag(operand_value)));
+      return EmitComposeComplex(op,
+                                b_->CreateFNeg(EmitExtractReal(operand_value)),
+                                b_->CreateFNeg(EmitExtractImag(operand_value)));
     case HloOpcode::kReal:
       return EmitExtractReal(operand_value);
     case HloOpcode::kImag:
@@ -724,15 +716,15 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitFloatBinaryOp(
     case HloOpcode::kComplex:
       return EmitComposeComplex(op, lhs_value, rhs_value);
     case HloOpcode::kAdd:
-      return ir_builder_->CreateFAdd(lhs_value, rhs_value);
+      return b_->CreateFAdd(lhs_value, rhs_value);
     case HloOpcode::kSubtract:
-      return ir_builder_->CreateFSub(lhs_value, rhs_value);
+      return b_->CreateFSub(lhs_value, rhs_value);
     case HloOpcode::kMultiply:
-      return ir_builder_->CreateFMul(lhs_value, rhs_value);
+      return b_->CreateFMul(lhs_value, rhs_value);
     case HloOpcode::kDivide:
-      return ir_builder_->CreateFDiv(lhs_value, rhs_value);
+      return b_->CreateFDiv(lhs_value, rhs_value);
     case HloOpcode::kRemainder:
-      return ir_builder_->CreateFRem(lhs_value, rhs_value);
+      return b_->CreateFRem(lhs_value, rhs_value);
     // LLVM comparisons can be "unordered" (U) or "ordered" (O) -- ordered
     // comparisons always return false when one of the operands is NaN, whereas
     // unordered comparisons return true.
@@ -742,22 +734,22 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitFloatBinaryOp(
     // matches C++'s semantics.
     case HloOpcode::kEq:
       return llvm_ir::EmitComparison(llvm::CmpInst::FCMP_OEQ, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
     case HloOpcode::kNe:
       return llvm_ir::EmitComparison(llvm::CmpInst::FCMP_UNE, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
     case HloOpcode::kLt:
       return llvm_ir::EmitComparison(llvm::CmpInst::FCMP_OLT, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
     case HloOpcode::kGt:
       return llvm_ir::EmitComparison(llvm::CmpInst::FCMP_OGT, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
     case HloOpcode::kLe:
       return llvm_ir::EmitComparison(llvm::CmpInst::FCMP_OLE, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
     case HloOpcode::kGe:
       return llvm_ir::EmitComparison(llvm::CmpInst::FCMP_OGE, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
 
     case HloOpcode::kMaximum:
       return EmitFloatMax(lhs_value, rhs_value);
@@ -778,64 +770,56 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexBinaryOp(
     llvm::Value* rhs_value) const {
   switch (op->opcode()) {
     case HloOpcode::kAdd:
-      return EmitComposeComplex(
-          op,
-          ir_builder_->CreateFAdd(EmitExtractReal(lhs_value),
-                                  EmitExtractReal(rhs_value)),
-          ir_builder_->CreateFAdd(EmitExtractImag(lhs_value),
-                                  EmitExtractImag(rhs_value)));
+      return EmitComposeComplex(op,
+                                b_->CreateFAdd(EmitExtractReal(lhs_value),
+                                               EmitExtractReal(rhs_value)),
+                                b_->CreateFAdd(EmitExtractImag(lhs_value),
+                                               EmitExtractImag(rhs_value)));
     case HloOpcode::kSubtract:
-      return EmitComposeComplex(
-          op,
-          ir_builder_->CreateFSub(EmitExtractReal(lhs_value),
-                                  EmitExtractReal(rhs_value)),
-          ir_builder_->CreateFSub(EmitExtractImag(lhs_value),
-                                  EmitExtractImag(rhs_value)));
+      return EmitComposeComplex(op,
+                                b_->CreateFSub(EmitExtractReal(lhs_value),
+                                               EmitExtractReal(rhs_value)),
+                                b_->CreateFSub(EmitExtractImag(lhs_value),
+                                               EmitExtractImag(rhs_value)));
     case HloOpcode::kMultiply:
       return EmitComposeComplex(
           op,
-          ir_builder_->CreateFSub(
-              ir_builder_->CreateFMul(EmitExtractReal(lhs_value),
-                                      EmitExtractReal(rhs_value)),
-              ir_builder_->CreateFMul(EmitExtractImag(lhs_value),
-                                      EmitExtractImag(rhs_value))),
-          ir_builder_->CreateFAdd(
-              ir_builder_->CreateFMul(EmitExtractReal(lhs_value),
-                                      EmitExtractImag(rhs_value)),
-              ir_builder_->CreateFMul(EmitExtractImag(lhs_value),
-                                      EmitExtractReal(rhs_value))));
+          b_->CreateFSub(b_->CreateFMul(EmitExtractReal(lhs_value),
+                                        EmitExtractReal(rhs_value)),
+                         b_->CreateFMul(EmitExtractImag(lhs_value),
+                                        EmitExtractImag(rhs_value))),
+          b_->CreateFAdd(b_->CreateFMul(EmitExtractReal(lhs_value),
+                                        EmitExtractImag(rhs_value)),
+                         b_->CreateFMul(EmitExtractImag(lhs_value),
+                                        EmitExtractReal(rhs_value))));
     case HloOpcode::kDivide: {
       // (a+bi) / (c+di) = ((a+bi)(c-di)) / ((c+di)(c-di))
       // = ((ac + bd) + (bc - ad)i) / (c^2 + d^2)
-      auto rhs_sum_sq = ir_builder_->CreateFAdd(
-          ir_builder_->CreateFMul(EmitExtractReal(rhs_value),
-                                  EmitExtractReal(rhs_value)),
-          ir_builder_->CreateFMul(EmitExtractImag(rhs_value),
-                                  EmitExtractImag(rhs_value)));
+      auto rhs_sum_sq =
+          b_->CreateFAdd(b_->CreateFMul(EmitExtractReal(rhs_value),
+                                        EmitExtractReal(rhs_value)),
+                         b_->CreateFMul(EmitExtractImag(rhs_value),
+                                        EmitExtractImag(rhs_value)));
       auto type = rhs_sum_sq->getType();
       auto zero = llvm::ConstantFP::get(type, 0.0);
-      auto oeq = ir_builder_->CreateFCmpOEQ(rhs_sum_sq, zero);
-      auto real_inf_or_nan =
-          ir_builder_->CreateFDiv(EmitExtractReal(lhs_value), zero);
-      auto imag_inf_or_nan =
-          ir_builder_->CreateFDiv(EmitExtractImag(lhs_value), zero);
-      return ir_builder_->CreateSelect(
+      auto oeq = b_->CreateFCmpOEQ(rhs_sum_sq, zero);
+      auto real_inf_or_nan = b_->CreateFDiv(EmitExtractReal(lhs_value), zero);
+      auto imag_inf_or_nan = b_->CreateFDiv(EmitExtractImag(lhs_value), zero);
+      return b_->CreateSelect(
           oeq, EmitComposeComplex(op, real_inf_or_nan, imag_inf_or_nan),
           EmitComposeComplex(
               op,
-              ir_builder_->CreateFDiv(
-                  ir_builder_->CreateFAdd(
-                      ir_builder_->CreateFMul(EmitExtractReal(lhs_value),
-                                              EmitExtractReal(rhs_value)),
-                      ir_builder_->CreateFMul(EmitExtractImag(lhs_value),
-                                              EmitExtractImag(rhs_value))),
+              b_->CreateFDiv(
+                  b_->CreateFAdd(b_->CreateFMul(EmitExtractReal(lhs_value),
+                                                EmitExtractReal(rhs_value)),
+                                 b_->CreateFMul(EmitExtractImag(lhs_value),
+                                                EmitExtractImag(rhs_value))),
                   rhs_sum_sq),
-              ir_builder_->CreateFDiv(
-                  ir_builder_->CreateFSub(
-                      ir_builder_->CreateFMul(EmitExtractImag(lhs_value),
-                                              EmitExtractReal(rhs_value)),
-                      ir_builder_->CreateFMul(EmitExtractReal(lhs_value),
-                                              EmitExtractImag(rhs_value))),
+              b_->CreateFDiv(
+                  b_->CreateFSub(b_->CreateFMul(EmitExtractImag(lhs_value),
+                                                EmitExtractReal(rhs_value)),
+                                 b_->CreateFMul(EmitExtractReal(lhs_value),
+                                                EmitExtractImag(rhs_value))),
                   rhs_sum_sq)));
     }
     // LLVM comparisons can be "unordered" (U) or "ordered" (O) -- ordered
@@ -846,21 +830,21 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexBinaryOp(
     // unordered comparison.  This makes x != y equivalent to !(x == y), and
     // matches C++'s semantics.
     case HloOpcode::kEq:
-      return ir_builder_->CreateAnd(
+      return b_->CreateAnd(
           llvm_ir::EmitComparison(llvm::CmpInst::FCMP_OEQ,
                                   EmitExtractReal(lhs_value),
-                                  EmitExtractReal(rhs_value), ir_builder_),
+                                  EmitExtractReal(rhs_value), b_),
           llvm_ir::EmitComparison(llvm::CmpInst::FCMP_OEQ,
                                   EmitExtractImag(lhs_value),
-                                  EmitExtractImag(rhs_value), ir_builder_));
+                                  EmitExtractImag(rhs_value), b_));
     case HloOpcode::kNe:
-      return ir_builder_->CreateOr(
+      return b_->CreateOr(
           llvm_ir::EmitComparison(llvm::CmpInst::FCMP_UNE,
                                   EmitExtractReal(lhs_value),
-                                  EmitExtractReal(rhs_value), ir_builder_),
+                                  EmitExtractReal(rhs_value), b_),
           llvm_ir::EmitComparison(llvm::CmpInst::FCMP_UNE,
                                   EmitExtractImag(lhs_value),
-                                  EmitExtractImag(rhs_value), ir_builder_));
+                                  EmitExtractImag(rhs_value), b_));
 
     case HloOpcode::kPower: {
       // (a+bi)^(c+di) =
@@ -872,29 +856,26 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexBinaryOp(
       auto b = EmitExtractImag(lhs_value);
       auto c = EmitExtractReal(rhs_value);
       auto d = EmitExtractImag(rhs_value);
-      auto aa_p_bb = ir_builder_->CreateFAdd(ir_builder_->CreateFMul(a, a),
-                                             ir_builder_->CreateFMul(b, b));
+      auto aa_p_bb = b_->CreateFAdd(b_->CreateFMul(a, a), b_->CreateFMul(b, b));
       auto one_half = llvm::ConstantFP::get(a->getType(), 0.5);
-      auto half_c = ir_builder_->CreateFMul(one_half, c);
+      auto half_c = b_->CreateFMul(one_half, c);
 
       TF_ASSIGN_OR_RETURN(auto aa_p_bb_to_half_c,
                           EmitPow(component_type, aa_p_bb, half_c));
-      auto neg_d = ir_builder_->CreateFNeg(d);
+      auto neg_d = b_->CreateFNeg(d);
       TF_ASSIGN_OR_RETURN(auto arg_lhs, EmitAtan2(component_type, b, a));
-      auto neg_d_arg_lhs = ir_builder_->CreateFMul(neg_d, arg_lhs);
+      auto neg_d_arg_lhs = b_->CreateFMul(neg_d, arg_lhs);
       TF_ASSIGN_OR_RETURN(auto e_to_neg_d_arg_lhs,
                           EmitExp(component_type, neg_d_arg_lhs));
-      auto coeff =
-          ir_builder_->CreateFMul(aa_p_bb_to_half_c, e_to_neg_d_arg_lhs);
+      auto coeff = b_->CreateFMul(aa_p_bb_to_half_c, e_to_neg_d_arg_lhs);
       TF_ASSIGN_OR_RETURN(auto ln_aa_p_bb, EmitLog(component_type, aa_p_bb));
-      auto half_d = ir_builder_->CreateFMul(one_half, d);
-      auto q =
-          ir_builder_->CreateFAdd(ir_builder_->CreateFMul(c, arg_lhs),
-                                  ir_builder_->CreateFMul(half_d, ln_aa_p_bb));
+      auto half_d = b_->CreateFMul(one_half, d);
+      auto q = b_->CreateFAdd(b_->CreateFMul(c, arg_lhs),
+                              b_->CreateFMul(half_d, ln_aa_p_bb));
       TF_ASSIGN_OR_RETURN(auto cos_q, EmitCos(component_type, q));
       TF_ASSIGN_OR_RETURN(auto sin_q, EmitSin(component_type, q));
-      return EmitComposeComplex(op, ir_builder_->CreateFMul(coeff, cos_q),
-                                ir_builder_->CreateFMul(coeff, sin_q));
+      return EmitComposeComplex(op, b_->CreateFMul(coeff, cos_q),
+                                b_->CreateFMul(coeff, sin_q));
     }
     default:
       return Unimplemented("binary complex op '%s'",
@@ -904,12 +885,12 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitComplexBinaryOp(
 
 llvm::Value* ElementalIrEmitter::EmitFloatMax(llvm::Value* lhs_value,
                                               llvm::Value* rhs_value) const {
-  return llvm_ir::EmitFloatMax(lhs_value, rhs_value, ir_builder_);
+  return llvm_ir::EmitFloatMax(lhs_value, rhs_value, b_);
 }
 
 llvm::Value* ElementalIrEmitter::EmitFloatMin(llvm::Value* lhs_value,
                                               llvm::Value* rhs_value) const {
-  return llvm_ir::EmitFloatMin(lhs_value, rhs_value, ir_builder_);
+  return llvm_ir::EmitFloatMin(lhs_value, rhs_value, b_);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitErfInv(PrimitiveType prim_type,
@@ -921,15 +902,14 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitErfInv(PrimitiveType prim_type,
         "type F32.");
   }
   auto getFloat = [&](const float f) {
-    return llvm::ConstantFP::get(ir_builder_->getFloatTy(), f);
+    return llvm::ConstantFP::get(b_->getFloatTy(), f);
   };
   auto multiply_add = [&](tensorflow::gtl::ArraySlice<float> coefficients,
                           llvm::Value* w) {
     llvm::Value* p = getFloat(coefficients.front());
     coefficients.pop_front();
     for (float coefficient : coefficients) {
-      p = ir_builder_->CreateFAdd(ir_builder_->CreateFMul(p, w),
-                                  getFloat(coefficient));
+      p = b_->CreateFAdd(b_->CreateFMul(p, w), getFloat(coefficient));
     }
     return p;
   };
@@ -947,50 +927,48 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitErfInv(PrimitiveType prim_type,
   //   }
   //   return p*x
   llvm::Function* logf_fn = llvm::Intrinsic::getDeclaration(
-      module_, llvm::Intrinsic::log, {ir_builder_->getFloatTy()});
+      module_, llvm::Intrinsic::log, {b_->getFloatTy()});
 
-  llvm::Value* w = ir_builder_->CreateFNeg(ir_builder_->CreateCall(
-      logf_fn,
-      {ir_builder_->CreateFMul(ir_builder_->CreateFSub(getFloat(1.0f), x),
-                               ir_builder_->CreateFAdd(getFloat(1.0f), x))}));
+  llvm::Value* w = b_->CreateFNeg(b_->CreateCall(
+      logf_fn, {b_->CreateFMul(b_->CreateFSub(getFloat(1.0f), x),
+                               b_->CreateFAdd(getFloat(1.0f), x))}));
 
-  llvm::Value* p_addr = llvm_ir::EmitAllocaAtFunctionEntry(
-      ir_builder_->getFloatTy(), "p.addr", ir_builder_);
+  llvm::Value* p_addr =
+      llvm_ir::EmitAllocaAtFunctionEntry(b_->getFloatTy(), "p.addr", b_);
 
-  llvm_ir::LlvmIfData if_data =
-      llvm_ir::EmitIfThenElse(ir_builder_->CreateFCmpOLT(w, getFloat(5.0f)),
-                              "w_less_than_five", ir_builder_);
+  llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(
+      b_->CreateFCmpOLT(w, getFloat(5.0f)), "w_less_than_five", b_);
   // Handle true BB.
-  SetToFirstInsertPoint(if_data.true_block, ir_builder_);
+  SetToFirstInsertPoint(if_data.true_block, b_);
   {
-    llvm::Value* lw = ir_builder_->CreateFSub(w, getFloat(2.5f));
+    llvm::Value* lw = b_->CreateFSub(w, getFloat(2.5f));
     tensorflow::gtl::ArraySlice<float> lq{
         2.81022636e-08f,  3.43273939e-07f, -3.5233877e-06f,
         -4.39150654e-06f, 0.00021858087f,  -0.00125372503f,
         -0.00417768164f,  0.246640727f,    1.50140941f};
     llvm::Value* p = multiply_add(lq, lw);
-    ir_builder_->CreateStore(p, p_addr);
+    b_->CreateStore(p, p_addr);
   }
 
   // Handle false BB.
-  SetToFirstInsertPoint(if_data.false_block, ir_builder_);
+  SetToFirstInsertPoint(if_data.false_block, b_);
   {
     llvm::Function* sqrtf_fn = llvm::Intrinsic::getDeclaration(
-        module_, llvm::Intrinsic::sqrt, {ir_builder_->getFloatTy()});
+        module_, llvm::Intrinsic::sqrt, {b_->getFloatTy()});
 
-    llvm::Value* gw = ir_builder_->CreateFSub(
-        ir_builder_->CreateCall(sqrtf_fn, {w}), getFloat(3.0f));
+    llvm::Value* gw =
+        b_->CreateFSub(b_->CreateCall(sqrtf_fn, {w}), getFloat(3.0f));
     tensorflow::gtl::ArraySlice<float> gq{
         -0.000200214257f, 0.000100950558f, 0.00134934322f,
         -0.00367342844f,  0.00573950773f,  -0.0076224613f,
         0.00943887047f,   1.00167406f,     2.83297682f};
     llvm::Value* p = multiply_add(gq, gw);
-    ir_builder_->CreateStore(p, p_addr);
+    b_->CreateStore(p, p_addr);
   }
 
-  SetToFirstInsertPoint(if_data.after_block, ir_builder_);
-  llvm::Value* p = ir_builder_->CreateLoad(p_addr);
-  return ir_builder_->CreateFMul(p, x);
+  SetToFirstInsertPoint(if_data.after_block, b_);
+  llvm::Value* p = b_->CreateLoad(p_addr);
+  return b_->CreateFMul(p, x);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitErfcInv(
@@ -998,13 +976,13 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitErfcInv(
   // Compute erfcinv(value) by calculating erfinv(1.0 - value).
   auto type = llvm_ir::PrimitiveTypeToIrType(prim_type, module_);
   auto one = llvm::ConstantFP::get(type, 1.0);
-  return EmitErfInv(prim_type, ir_builder_->CreateFSub(one, value));
+  return EmitErfInv(prim_type, b_->CreateFSub(one, value));
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitLog(PrimitiveType prim_type,
                                                    llvm::Value* value) const {
   return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::log, {value},
-                                      {value->getType()}, ir_builder_);
+                                      {value->getType()}, b_);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitLog1p(PrimitiveType prim_type,
@@ -1016,35 +994,34 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitLog1p(PrimitiveType prim_type,
   // When x is large, the naive evaluation of ln(x + 1) is more
   // accurate than the Taylor series.
   TF_ASSIGN_OR_RETURN(auto for_large_x,
-                      EmitLog(prim_type, ir_builder_->CreateFAdd(x, one)));
+                      EmitLog(prim_type, b_->CreateFAdd(x, one)));
   // The Taylor series for ln(x+1) is x - x^2/2 - x^3/3 + ….
-  auto for_small_x = ir_builder_->CreateFMul(
-      ir_builder_->CreateFAdd(ir_builder_->CreateFMul(negative_half, x), one),
-      x);
+  auto for_small_x =
+      b_->CreateFMul(b_->CreateFAdd(b_->CreateFMul(negative_half, x), one), x);
   const auto kAntilogarithmIsSmallThreshold = 1e-4;
-  auto abs_x = llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::fabs, {value},
-                                            {type}, ir_builder_);
-  auto x_is_small = ir_builder_->CreateFCmpOLT(
+  auto abs_x =
+      llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::fabs, {value}, {type}, b_);
+  auto x_is_small = b_->CreateFCmpOLT(
       abs_x, llvm::ConstantFP::get(type, kAntilogarithmIsSmallThreshold));
-  return ir_builder_->CreateSelect(x_is_small, for_small_x, for_large_x);
+  return b_->CreateSelect(x_is_small, for_small_x, for_large_x);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitSin(PrimitiveType prim_type,
                                                    llvm::Value* value) const {
   return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::sin, {value},
-                                      {value->getType()}, ir_builder_);
+                                      {value->getType()}, b_);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitCos(PrimitiveType prim_type,
                                                    llvm::Value* value) const {
   return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::cos, {value},
-                                      {value->getType()}, ir_builder_);
+                                      {value->getType()}, b_);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitExp(PrimitiveType prim_type,
                                                    llvm::Value* value) const {
   return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::exp, {value},
-                                      {value->getType()}, ir_builder_);
+                                      {value->getType()}, b_);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitExpm1(PrimitiveType prim_type,
@@ -1056,25 +1033,25 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitExpm1(PrimitiveType prim_type,
   // When the exponent is large, the naive evaluation of e^(x) - 1 is more
   // accurate than the Taylor series.
   TF_ASSIGN_OR_RETURN(auto exp_x, EmitExp(prim_type, value));
-  auto for_large_x = ir_builder_->CreateFSub(exp_x, one);
+  auto for_large_x = b_->CreateFSub(exp_x, one);
   // The Taylor series for exp(x) is 1 + x + x^2/2 + x^3/6 + ….
   // We want exp(x)-1 which is x + x^2/2 + x^3/6 + ….
-  auto x_squared = ir_builder_->CreateFAdd(x, x);
-  auto x_squared_over_two = ir_builder_->CreateFMul(x_squared, half);
-  auto for_small_x = ir_builder_->CreateFAdd(x, x_squared_over_two);
+  auto x_squared = b_->CreateFAdd(x, x);
+  auto x_squared_over_two = b_->CreateFMul(x_squared, half);
+  auto for_small_x = b_->CreateFAdd(x, x_squared_over_two);
   const auto kExponentIsSmallThreshold = 1e-5;
-  auto abs_x = llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::fabs, {value},
-                                            {type}, ir_builder_);
-  auto x_is_small = ir_builder_->CreateFCmpOLT(
+  auto abs_x =
+      llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::fabs, {value}, {type}, b_);
+  auto x_is_small = b_->CreateFCmpOLT(
       abs_x, llvm::ConstantFP::get(type, kExponentIsSmallThreshold));
-  return ir_builder_->CreateSelect(x_is_small, for_small_x, for_large_x);
+  return b_->CreateSelect(x_is_small, for_small_x, for_large_x);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitPow(PrimitiveType prim_type,
                                                    llvm::Value* lhs,
                                                    llvm::Value* rhs) const {
   return llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::pow, {lhs, rhs},
-                                      {lhs->getType()}, ir_builder_);
+                                      {lhs->getType()}, b_);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitAtan2(PrimitiveType prim_type,
@@ -1089,11 +1066,10 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitReducePrecision(
     return Unimplemented("reduce-precision only implemented for F32");
   }
   return EmitReducePrecisionFloat(x, /*exponent_bits=*/hlo->exponent_bits(),
-                                  /*mantissa_bits=*/hlo->mantissa_bits(),
-                                  ir_builder_);
+                                  /*mantissa_bits=*/hlo->mantissa_bits(), b_);
 }
 
-static llvm::Value* SaturateShiftIfNecessary(llvm::IRBuilder<>* ir_builder,
+static llvm::Value* SaturateShiftIfNecessary(llvm::IRBuilder<>* b,
                                              llvm::Value* lhs, llvm::Value* rhs,
                                              llvm::Value* shift_result,
                                              bool saturate_to_sign_bit) {
@@ -1106,15 +1082,14 @@ static llvm::Value* SaturateShiftIfNecessary(llvm::IRBuilder<>* ir_builder,
   llvm::ConstantInt* minus_one = llvm::ConstantInt::get(integer_type, -1);
   llvm::Value* saturated_value;
   if (saturate_to_sign_bit) {
-    saturated_value = ir_builder->CreateSelect(
-        ir_builder->CreateICmpSLT(lhs, zero), minus_one, zero);
+    saturated_value =
+        b->CreateSelect(b->CreateICmpSLT(lhs, zero), minus_one, zero);
   } else {
     saturated_value = zero;
   }
   llvm::Value* shift_amt_in_range =
-      ir_builder->CreateICmpULT(rhs, integer_bitsize_constant, "shft.chk");
-  return ir_builder->CreateSelect(shift_amt_in_range, shift_result,
-                                  saturated_value);
+      b->CreateICmpULT(rhs, integer_bitsize_constant, "shft.chk");
+  return b->CreateSelect(shift_amt_in_range, shift_result, saturated_value);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerBinaryOp(
@@ -1123,49 +1098,49 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerBinaryOp(
   switch (op->opcode()) {
     // TODO(jingyue): add the "nsw" attribute for signed types.
     case HloOpcode::kAdd:
-      return ir_builder_->CreateAdd(lhs_value, rhs_value);
+      return b_->CreateAdd(lhs_value, rhs_value);
     case HloOpcode::kSubtract:
-      return ir_builder_->CreateSub(lhs_value, rhs_value);
+      return b_->CreateSub(lhs_value, rhs_value);
     case HloOpcode::kMultiply:
-      return ir_builder_->CreateMul(lhs_value, rhs_value);
+      return b_->CreateMul(lhs_value, rhs_value);
     case HloOpcode::kDivide:
-      return is_signed ? ir_builder_->CreateSDiv(lhs_value, rhs_value)
-                       : ir_builder_->CreateUDiv(lhs_value, rhs_value);
+      return is_signed ? b_->CreateSDiv(lhs_value, rhs_value)
+                       : b_->CreateUDiv(lhs_value, rhs_value);
     case HloOpcode::kRemainder:
-      return is_signed ? ir_builder_->CreateSRem(lhs_value, rhs_value)
-                       : ir_builder_->CreateURem(lhs_value, rhs_value);
+      return is_signed ? b_->CreateSRem(lhs_value, rhs_value)
+                       : b_->CreateURem(lhs_value, rhs_value);
     case HloOpcode::kEq:
       return llvm_ir::EmitComparison(llvm::CmpInst::ICMP_EQ, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
     case HloOpcode::kNe:
       return llvm_ir::EmitComparison(llvm::CmpInst::ICMP_NE, lhs_value,
-                                     rhs_value, ir_builder_);
+                                     rhs_value, b_);
     case HloOpcode::kLt:
       return llvm_ir::EmitComparison(
           is_signed ? llvm::CmpInst::ICMP_SLT : llvm::CmpInst::ICMP_ULT,
-          lhs_value, rhs_value, ir_builder_);
+          lhs_value, rhs_value, b_);
     case HloOpcode::kGt:
       return llvm_ir::EmitComparison(
           is_signed ? llvm::CmpInst::ICMP_SGT : llvm::CmpInst::ICMP_UGT,
-          lhs_value, rhs_value, ir_builder_);
+          lhs_value, rhs_value, b_);
     case HloOpcode::kLe:
       return llvm_ir::EmitComparison(
           is_signed ? llvm::CmpInst::ICMP_SLE : llvm::CmpInst::ICMP_ULE,
-          lhs_value, rhs_value, ir_builder_);
+          lhs_value, rhs_value, b_);
     case HloOpcode::kGe:
       return llvm_ir::EmitComparison(
           is_signed ? llvm::CmpInst::ICMP_SGE : llvm::CmpInst::ICMP_UGE,
-          lhs_value, rhs_value, ir_builder_);
+          lhs_value, rhs_value, b_);
     case HloOpcode::kMinimum:
       return EmitIntegralMin(lhs_value, rhs_value, is_signed);
     case HloOpcode::kMaximum:
       return EmitIntegralMax(lhs_value, rhs_value, is_signed);
     case HloOpcode::kAnd:
-      return ir_builder_->CreateAnd(lhs_value, rhs_value);
+      return b_->CreateAnd(lhs_value, rhs_value);
     case HloOpcode::kOr:
-      return ir_builder_->CreateOr(lhs_value, rhs_value);
+      return b_->CreateOr(lhs_value, rhs_value);
     case HloOpcode::kXor:
-      return ir_builder_->CreateXor(lhs_value, rhs_value);
+      return b_->CreateXor(lhs_value, rhs_value);
 
     // Shifting out bits >= the number of bits in the type being shifted
     // produces a poison value in LLVM which is basically "deferred undefined
@@ -1173,20 +1148,17 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerBinaryOp(
     // UB.  We replace the poison value with a constant to avoid this deferred
     // UB.
     case HloOpcode::kShiftRightArithmetic:
-      return SaturateShiftIfNecessary(
-          ir_builder_, lhs_value, rhs_value,
-          ir_builder_->CreateAShr(lhs_value, rhs_value),
-          /*saturate_to_sign_bit=*/true);
+      return SaturateShiftIfNecessary(b_, lhs_value, rhs_value,
+                                      b_->CreateAShr(lhs_value, rhs_value),
+                                      /*saturate_to_sign_bit=*/true);
     case HloOpcode::kShiftLeft:
-      return SaturateShiftIfNecessary(
-          ir_builder_, lhs_value, rhs_value,
-          ir_builder_->CreateShl(lhs_value, rhs_value),
-          /*saturate_to_sign_bit=*/false);
+      return SaturateShiftIfNecessary(b_, lhs_value, rhs_value,
+                                      b_->CreateShl(lhs_value, rhs_value),
+                                      /*saturate_to_sign_bit=*/false);
     case HloOpcode::kShiftRightLogical:
-      return SaturateShiftIfNecessary(
-          ir_builder_, lhs_value, rhs_value,
-          ir_builder_->CreateLShr(lhs_value, rhs_value),
-          /*saturate_to_sign_bit=*/false);
+      return SaturateShiftIfNecessary(b_, lhs_value, rhs_value,
+                                      b_->CreateLShr(lhs_value, rhs_value),
+                                      /*saturate_to_sign_bit=*/false);
     default:
       return Unimplemented("binary integer op '%s'",
                            HloOpcodeString(op->opcode()).c_str());
@@ -1196,21 +1168,19 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerBinaryOp(
 llvm::Value* ElementalIrEmitter::EmitIntegralMax(llvm::Value* lhs_value,
                                                  llvm::Value* rhs_value,
                                                  bool is_signed) const {
-  return ir_builder_->CreateSelect(
-      ir_builder_->CreateICmp(
-          is_signed ? llvm::ICmpInst::ICMP_SGE : llvm::ICmpInst::ICMP_UGE,
-          lhs_value, rhs_value),
-      lhs_value, rhs_value);
+  return b_->CreateSelect(b_->CreateICmp(is_signed ? llvm::ICmpInst::ICMP_SGE
+                                                   : llvm::ICmpInst::ICMP_UGE,
+                                         lhs_value, rhs_value),
+                          lhs_value, rhs_value);
 }
 
 llvm::Value* ElementalIrEmitter::EmitIntegralMin(llvm::Value* lhs_value,
                                                  llvm::Value* rhs_value,
                                                  bool is_signed) const {
-  return ir_builder_->CreateSelect(
-      ir_builder_->CreateICmp(
-          is_signed ? llvm::ICmpInst::ICMP_SLE : llvm::ICmpInst::ICMP_ULE,
-          lhs_value, rhs_value),
-      lhs_value, rhs_value);
+  return b_->CreateSelect(b_->CreateICmp(is_signed ? llvm::ICmpInst::ICMP_SLE
+                                                   : llvm::ICmpInst::ICMP_ULE,
+                                         lhs_value, rhs_value),
+                          lhs_value, rhs_value);
 }
 
 llvm_ir::IrArray::Index ElementalIrEmitter::ElementwiseSourceIndex(
@@ -1227,7 +1197,14 @@ llvm_ir::IrArray::Index ElementalIrEmitter::ElementwiseSourceIndex(
 
   // If no implicit broadcast is needed for this operand, returns the target
   // index as the source index.
-  if (ShapeUtil::CompatibleIgnoringElementType(operand_shape, hlo.shape())) {
+  //
+  // `IrArray::Index` may contain a physical linear which we can propagate to
+  // our operand only if our layouts match.  "only if" is a bit strong since
+  // e.g. we can still forward the linear index if the operand shape is
+  // [5,1,1,5]{3,2,1,0} and the HLO shape is[5,1,1,5]{3,1,2,0}, but those cases
+  // are probably not worth handling here for now.
+  if (ShapeUtil::CompatibleIgnoringElementType(operand_shape, hlo.shape()) &&
+      LayoutUtil::Equal(operand_shape.layout(), hlo.shape().layout())) {
     return target_index;
   }
 
@@ -1246,180 +1223,254 @@ llvm_ir::IrArray::Index ElementalIrEmitter::ElementwiseSourceIndex(
   return source_index;
 }
 
-llvm_ir::ElementGenerator ElementalIrEmitter::MakeRngElementGenerator(
+StatusOr<llvm::Value*> ElementalIrEmitter::ConvertValueForDistribution(
     const HloInstruction* hlo,
-    const ElementalIrEmitter::HloToElementGeneratorMap& operand_to_generator)
-    const {
-  PrimitiveType param_prim_type = hlo->operand(0)->shape().element_type();
-  llvm::Type* param_ir_type =
-      llvm_ir::PrimitiveTypeToIrType(param_prim_type, module_);
-
-  // Same values as PCG library
-  // https://github.com/imneme/pcg-c/blob/master/include/pcg_variants.h
-  llvm::Value* multiplier = ir_builder_->getInt(
-      llvm::APInt(128, {0x4385DF649FCCF645, 0x2360ED051FC65DA4}));
-  llvm::Value* increment = ir_builder_->getInt(
-      llvm::APInt(128, {0x14057B7EF767814F, 0x5851F42D4C957F2D}));
-
-  auto random_value_from_hlo = [hlo]() {
-    const HloModule* module =
-        hlo->IsFused() ? hlo->parent()->FusionInstruction()->parent()->parent()
-                       : hlo->parent()->parent();
-    return module->RandomNew64();
-  };
+    const ElementalIrEmitter::HloToElementGeneratorMap& operand_to_generator,
+    const llvm_ir::IrArray::Index& index, llvm::Value* raw_value) const {
+  TF_ASSIGN_OR_RETURN(llvm::Value * a_or_mean,
+                      operand_to_generator.at(hlo->operand(0))(index));
+  TF_ASSIGN_OR_RETURN(llvm::Value * b_or_sigma,
+                      operand_to_generator.at(hlo->operand(1))(index));
+  PrimitiveType elem_prim_ty = hlo->shape().element_type();
+  llvm::Type* elem_ir_ty =
+      llvm_ir::PrimitiveTypeToIrType(elem_prim_ty, module_);
+  llvm::Type* raw_value_ty = raw_value->getType();
+
+  // Convert raw integer to float in range [0, 1) if the element is a float.
+  llvm::Value* elem_value = raw_value;
+  if (elem_ir_ty->isFloatingPointTy()) {
+    elem_value = b_->CreateUIToFP(elem_value, elem_ir_ty);
+    unsigned raw_value_size_in_bits = raw_value_ty->getPrimitiveSizeInBits();
+    CHECK(raw_value_size_in_bits == 32 || raw_value_size_in_bits == 64);
+    elem_value = b_->CreateFDiv(
+        elem_value,
+        llvm::ConstantFP::get(elem_ir_ty,
+                              raw_value_size_in_bits == 64 ? 0x1p64 : 0x1p32));
+  }
+
+  // Convert the value for the requested distribution.
+  switch (hlo->random_distribution()) {
+    case RNG_UNIFORM: {
+      if (elem_ir_ty->isFloatingPointTy()) {
+        return b_->CreateFAdd(
+            b_->CreateFMul(b_->CreateFSub(b_or_sigma, a_or_mean), elem_value),
+            a_or_mean);
+      } else {
+        // To generate a uniform random value in [a, b) from a raw random sample
+        // in range [0, 2^N), we let range = b - a and return
+        // (a + raw_value % range). If range is not a power of 2, raw values
+        // larger than (2^N - 2^N % range) are biased toward results in
+        // [a, a + (limit % range)). An unbiased algorithm would need to drop
+        // raw values and re-sample, but we don't do this because re-sampling in
+        // an efficient way is complex, and it's not clear that users need it.
+        // In particular, if one thread in a GPU warp needs to re-sample, we pay
+        // the same cost as if the whole warp were to re-sample.  So an
+        // efficient re-sampling implementation on GPU would need to do
+        // nontrivial work to share entropy between threads in the warp.
+        auto range = b_->CreateSub(b_or_sigma, a_or_mean);
+        return b_->CreateAdd(a_or_mean, b_->CreateURem(elem_value, range));
+      }
+    }
+    case RNG_NORMAL: {
+      TF_ASSIGN_OR_RETURN(
+          llvm::Value * r,
+          EmitErfcInv(elem_prim_ty,
+                      b_->CreateFMul(llvm::ConstantFP::get(elem_ir_ty, 2.0),
+                                     elem_value)));
+      return b_->CreateFAdd(b_->CreateFMul(r, b_or_sigma), a_or_mean);
+    }
+    default:
+      return InvalidArgument(
+          "unhandled distribution %s",
+          RandomDistribution_Name(hlo->random_distribution()).c_str());
+  }
+}
+
+namespace {
+
+// Checks that the primitive type is supported by the elemental IR emitter for
+// Philox RNG and returns the number of elements in each 128 bit sample of the
+// Philox RNG algorithm.
+int32 GetNumberOfElementsPerPhiloxRngSample(PrimitiveType elem_prim_ty) {
+  // Calculate the number of elements, that is the number of random numbers, in
+  // a 128 bit sample.
+  switch (elem_prim_ty) {
+    case U32:
+    case S32:
+    case F32:
+    // The algorithm uses 32 bits to generate values for F16.
+    case F16:
+      return 4;
+    case U64:
+    case F64:
+      return 2;
+    default:
+      // BF16 is converted to F16 by the hlo pass HloElementTypeConverter.
+      // Other data types are not supported by XLA random operation.
+      LOG(FATAL) << "Unrecognized primitive type for RNG " << elem_prim_ty;
+  }
+  return 0;
+}
+
+// Calculates the four uint32 values for the 128-bit Philox sample.
+std::array<llvm::Value*, 4> CalculateSampleValues(
+    llvm::Value* sample_idx, llvm::Value* hlo_random_value,
+    llvm::Value* global_random_number, llvm::Value* rng_state,
+    llvm::IRBuilder<>* b) {
+  llvm::Type* index_ty = sample_idx->getType();
+
+  std::array<llvm::Value*, 4> counter_values;
+
+  // Use the sample index to initialize counter[0] and counter[1].
+  unsigned index_ty_size_in_bits = index_ty->getPrimitiveSizeInBits();
+  CHECK(index_ty_size_in_bits == 32 || index_ty_size_in_bits == 64);
+  if (index_ty_size_in_bits == 32) {
+    counter_values[0] = sample_idx;
+    counter_values[1] = b->getInt32(0);
+  } else {
+    std::tie(counter_values[0], counter_values[1]) =
+        llvm_ir::SplitInt64ToInt32s(b, sample_idx);
+  }
+
+  // Xor the global state variable with the global random number seed and use
+  // the result to initialize counter[2] and counter[3].
+  std::tie(counter_values[2], counter_values[3]) = llvm_ir::SplitInt64ToInt32s(
+      b, b->CreateXor(rng_state, global_random_number));
 
-  // Seed each RNG emitter with a new 64-bit seed from the HloModule. If the
-  // compilation order is deterministic (i.e., RandomNew64 invocation order is
-  // deterministic), then the order of RNG is deterministic for a given seed and
-  // hence tests will be deterministic.
-  // If the user provides a global seed instruction then we only use 64-bits of
-  // the host's random number generator to seed the 128 bit value with the other
-  // 64-bits is due to a user specified global seed instruction.
-  // Create a GlobalVariable to maintain state between invocations. There is a
-  // bug in NVPTX with GlobalVariable and 128 bit values, so using 2 64-bit
+  // The algorithm uses a 64 bit key, which is also interpreted as two uint32
   // values.
-  llvm::GlobalVariable* state_ptr0 = new llvm::GlobalVariable(
-      /*M=*/*module_,
-      /*Ty=*/ir_builder_->getInt64Ty(),
-      /*isConstant=*/false,
-      /*Linkage=*/llvm::GlobalValue::PrivateLinkage,
-      /*Initializer=*/ir_builder_->getInt64(random_value_from_hlo()),
-      /*Name=*/"state_ptr0");
-
-  // When the module config seed is 0, the expected result of a prng is a random
-  // value. Instead of using the random_value_from_hlo, we need a global random
-  // value as the graph seed. This is because if we use random_value_from_hlo
-  // here, then for a newly built hlo graph, it always gives the same number.
-  uint64 graph_seed = hlo_module_config_.seed() != 0 ? hlo_module_config_.seed()
-                                                     : GlobalRandomValue();
-  llvm::GlobalVariable* state_ptr1 = new llvm::GlobalVariable(
-      /*M=*/*module_,
-      /*Ty=*/ir_builder_->getInt64Ty(),
-      /*isConstant=*/false,
-      /*Linkage=*/llvm::GlobalValue::PrivateLinkage,
-      /*Initializer=*/ir_builder_->getInt64(graph_seed),
-      /*Name=*/"state_ptr1");
-
-  // We want each thread to use its own stream, so we modify the increment per
-  // thread. We want the increment to remain odd, so we shift the thread id left
-  // 1 and add it to the increment.
-  increment = ir_builder_->CreateAdd(increment,
-                                     ir_builder_->CreateShl(EmitThreadId(), 1));
-
-  // PCG-XSL-RR algorithm
-  // http://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf
-  //   state = multiplier * state + increment
-  //   return uint64_t(state ^ (state >> 64))) >>> (state >> 122)
-  // where ">>>" is bitwise rotation
-  auto get_next_i64 = [=]() {
-    llvm::Value* state0 = ir_builder_->CreateZExtOrTrunc(
-        ir_builder_->CreateLoad(state_ptr0, "state0"),
-        ir_builder_->getInt128Ty());
-    llvm::Value* state1 = ir_builder_->CreateShl(
-        ir_builder_->CreateZExtOrTrunc(
-            ir_builder_->CreateLoad(state_ptr1, "state1"),
-            ir_builder_->getInt128Ty()),
-        64);
-    llvm::Value* state = ir_builder_->CreateOr(state0, state1);
-    llvm::Value* updated = ir_builder_->CreateAdd(
-        ir_builder_->CreateMul(state, multiplier), increment);
-    ir_builder_->CreateStore(
-        ir_builder_->CreateTrunc(updated, ir_builder_->getInt64Ty()),
-        state_ptr0);
-    ir_builder_->CreateStore(
-        ir_builder_->CreateTrunc(ir_builder_->CreateLShr(updated, 64),
-                                 ir_builder_->getInt64Ty()),
-        state_ptr1);
-
-    return llvm_ir::CreateRor(
-        ir_builder_->CreateTrunc(
-            ir_builder_->CreateXor(state, ir_builder_->CreateLShr(state, 64)),
-            ir_builder_->getInt64Ty()),
-        ir_builder_->CreateTrunc(ir_builder_->CreateLShr(state, 122),
-                                 ir_builder_->getInt64Ty()),
-        ir_builder_);
-  };
+  llvm::Value* key_values[2];
+
+  // Use a module random number to initialize the key.
+  std::tie(key_values[0], key_values[1]) =
+      llvm_ir::SplitInt64ToInt32s(b, hlo_random_value);
+
+  // Prepare the constants used in the Philox RNG Algorithm.
+  llvm::Value* philoxW32A = b->getInt32(0x9E3779B9);
+  llvm::Value* philoxW32B = b->getInt32(0xBB67AE85);
+  llvm::Value* philoxM4xW32A = b->getInt32(0xD2511F53);
+  llvm::Value* philoxM4xW32B = b->getInt32(0xCD9E8D57);
+
+  // Compute the 128 bit value for the current sample by repeating the
+  // single round computation and key raising computation for ten times.
+  for (int round = 0; round < 10; ++round) {
+    // A single round of computation of the counter values is as follows:
+    //  MultiplyHighLow(kPhiloxM4x32A, counter[0], &lo0, &hi0);
+    //  MultiplyHighLow(kPhiloxM4x32B, counter[2], &lo1, &hi1);
+    //  counter[0] = hi1 ^ counter[1] ^ key[0];
+    //  counter[1] = lo1;
+    //  counter[2] = hi0 ^ counter[3] ^ key[1];
+    //  counter[3] = lo0;
+    llvm::Value* lo0;
+    llvm::Value* hi0;
+    std::tie(lo0, hi0) =
+        llvm_ir::UMulLowHigh32(b, philoxM4xW32A, counter_values[0]);
+    llvm::Value* lo1;
+    llvm::Value* hi1;
+    std::tie(lo1, hi1) =
+        llvm_ir::UMulLowHigh32(b, philoxM4xW32B, counter_values[2]);
+    counter_values[0] =
+        b->CreateXor(hi1, b->CreateXor(counter_values[1], key_values[0]));
+    counter_values[1] = lo1;
+    counter_values[2] =
+        b->CreateXor(hi0, b->CreateXor(counter_values[3], key_values[1]));
+    counter_values[3] = lo0;
+    key_values[0] = b->CreateAdd(key_values[0], philoxW32A);
+    key_values[1] = b->CreateAdd(key_values[1], philoxW32B);
+  }
 
-  auto get_next_uniform_float = [=]() {
-    return ir_builder_->CreateFDiv(
-        ir_builder_->CreateUIToFP(get_next_i64(), param_ir_type),
-        llvm::ConstantFP::get(param_ir_type, 0x1p64));
-  };
+  return counter_values;
+}
 
+}  // namespace
+
+// Implements the Philox algorithm to generate random numbers in parallel.
+// Salmon et al. SC 2011. Parallel random numbers: as easy as 1, 2, 3.
+//   http://www.thesalmons.org/john/random123/papers/random123sc11.pdf
+//
+// The paper presents a few variants of the Philox algorithm, we picked the
+// 4x32_10 version of the algorithm for the following reasons:
+//   . 4x32 uses 32-bit multiplication which is fast on GPUs.
+//   . The authors recommend the 10-round variant, and TensorFlow also uses it.
+//
+// Precondition: the RNG instruction is not fused.
+llvm_ir::ElementGenerator ElementalIrEmitter::MakePhiloxRngElementGenerator(
+    const HloInstruction* hlo,
+    const ElementalIrEmitter::HloToElementGeneratorMap& operand_to_generator)
+    const {
+  VLOG(3) << "Using philox RNG algorithm";
+  CHECK(!hlo->IsFused());
+  // A random number generated by the per module random number generator.
+  // This ensures that each RNG HLO generates a different random sequence.
+  llvm::Value* hlo_random_value = b_->getInt64(hlo->GetModule()->RandomNew64());
+  // A value specified by the configuration or generated by a global random
+  // number generator.
+  llvm::Value* global_random_number =
+      b_->getInt64(hlo_module_config_.seed() != 0 ? hlo_module_config_.seed()
+                                                  : GlobalRandomValue());
+
+  int elems_per_sample =
+      GetNumberOfElementsPerPhiloxRngSample(hlo->shape().element_type());
+
+  // Allocate stack storage for the 128 bit sample as four int32.
+  llvm::Type* int32_ty = b_->getInt32Ty();
+  llvm::Value* sample_address = llvm_ir::EmitAllocaAtFunctionEntryWithCount(
+      int32_ty, /*element_count=*/b_->getInt32(4), "sample", b_);
+
+  // Load the global state variable for the Philox RNG algorithm.
+  llvm::GlobalVariable* rng_state_ptr =
+      llvm_ir::GetOrCreateVariableForPhiloxRngState(module_, b_);
+  llvm::Value* rng_state = b_->CreateLoad(rng_state_ptr, "rng_state_value");
+
+  // Build and return the elemental IR generator to generate a random value for
+  // the element corresponding to the current thread.
+  //
+  // This elemental IR generator computes one sample with multiple random
+  // numbers but only returns one random number. As a result, neighboring
+  // threads may calculate the same sample unnecessarily. However, if the
+  // kernel containing the RNG hlo is unrolled, LLVM is able to optimize away
+  // the duplicated computation of the same sample. In particular, if the unroll
+  // factor is a multiplier of elems_per_sample, LLVM is able to completely
+  // remove such duplicated computation. If the unroll factor is a non-trivial
+  // factor of elems_per_sample, LLVM can only partially remove such duplicated
+  // computation.
   return [=](const llvm_ir::IrArray::Index& index) -> StatusOr<llvm::Value*> {
-    switch (hlo->random_distribution()) {
-      case RNG_UNIFORM: {
-        TF_ASSIGN_OR_RETURN(llvm::Value * p,
-                            operand_to_generator.at(hlo->operand(0))(index));
-        TF_ASSIGN_OR_RETURN(llvm::Value * q,
-                            operand_to_generator.at(hlo->operand(1))(index));
-        if (primitive_util::IsFloatingPointType(param_prim_type)) {
-          return ir_builder_->CreateFAdd(
-              ir_builder_->CreateFMul(ir_builder_->CreateFSub(q, p),
-                                      get_next_uniform_float()),
-              p);
-        } else {
-          auto r = ir_builder_->CreateSub(q, p);
-          auto leading_zeros = llvm_ir::EmitCallToIntrinsic(
-              llvm::Intrinsic::ctlz, {r, ir_builder_->getInt1(true)},
-              {param_ir_type}, ir_builder_);
-          auto in_block = ir_builder_->GetInsertBlock();
-
-          // A terminator should be present iff we're emitting code
-          // into the middle (as opposed to the end) of a basic block.
-          CHECK_EQ(ir_builder_->GetInsertPoint() == in_block->end(),
-                   in_block->getTerminator() == nullptr);
-
-          llvm::BasicBlock* body_block;
-          llvm::BasicBlock* out_block;
-
-          if (ir_builder_->GetInsertPoint() == in_block->end()) {
-            body_block = llvm_ir::CreateBasicBlock(
-                nullptr, IrName(hlo, "rng_body"), ir_builder_);
-            out_block = llvm_ir::CreateBasicBlock(
-                nullptr, IrName(hlo, "rng_out"), ir_builder_);
-            llvm::BranchInst::Create(body_block, in_block);
-          } else {
-            body_block = in_block->splitBasicBlock(
-                ir_builder_->GetInsertPoint(), "rng_body");
-            out_block = body_block->splitBasicBlock(
-                ir_builder_->GetInsertPoint(), "rng_out");
-            body_block->getTerminator()->eraseFromParent();
-          }
-
-          SetToFirstInsertPoint(body_block, ir_builder_);
-          auto random = ir_builder_->CreateAnd(
-              ir_builder_->CreateZExtOrTrunc(get_next_i64(), param_ir_type),
-              ir_builder_->CreateLShr(llvm::ConstantInt::get(param_ir_type, ~0),
-                                      leading_zeros));
-          llvm::BranchInst::Create(out_block, body_block,
-                                   ir_builder_->CreateICmpULT(random, r),
-                                   body_block);
-          SetToFirstInsertPoint(out_block, ir_builder_);
-          return ir_builder_->CreateAdd(
-              p, ir_builder_->CreateSelect(
-                     ir_builder_->CreateICmpEQ(p, q),
-                     llvm::ConstantInt::get(param_ir_type, 0), random));
-        }
-      }
-      case RNG_NORMAL: {
-        TF_ASSIGN_OR_RETURN(llvm::Value * m,
-                            operand_to_generator.at(hlo->operand(0))(index));
-        TF_ASSIGN_OR_RETURN(llvm::Value * s,
-                            operand_to_generator.at(hlo->operand(1))(index));
-        TF_ASSIGN_OR_RETURN(
-            llvm::Value * r,
-            EmitErfcInv(param_prim_type,
-                        ir_builder_->CreateFMul(
-                            llvm::ConstantFP::get(param_ir_type, 2.0),
-                            get_next_uniform_float())));
-        return ir_builder_->CreateFAdd(ir_builder_->CreateFMul(r, s), m);
-      }
-      default:
-        return InvalidArgument(
-            "unhandled distribution %s",
-            RandomDistribution_Name(hlo->random_distribution()).c_str());
+    llvm::Type* index_ty = index.GetType();
+    // Calculate the linear element index.
+    llvm::Value* elem_idx = index.linear();
+    if (elem_idx == nullptr) {
+      elem_idx = index.Linearize(AsInt64Slice(hlo->shape().dimensions()), b_);
+    }
+
+    // Calculate the index for the 128 bit sample and the offset of the current
+    // element within the sample.
+    llvm::Value* elems_per_sample_value =
+        llvm::ConstantInt::get(index_ty, elems_per_sample);
+    llvm::Value* sample_idx = b_->CreateUDiv(elem_idx, elems_per_sample_value);
+    llvm::Value* elem_offset = b_->CreateURem(elem_idx, elems_per_sample_value);
+
+    std::array<llvm::Value*, 4> counter_values = CalculateSampleValues(
+        sample_idx, hlo_random_value, global_random_number, rng_state, b_);
+
+    // Store the four counter_values into the sample_address alloca so we can
+    // load the elem_offset'th one below.
+    for (int idx = 0; idx < 4; ++idx) {
+      b_->CreateStore(counter_values[idx],
+                      b_->CreateInBoundsGEP(sample_address, b_->getInt32(idx)));
     }
+
+    llvm::Type* int64_ty = b_->getInt64Ty();
+    CHECK(elems_per_sample == 2 || elems_per_sample == 4);
+    llvm::Type* raw_value_ty = elems_per_sample == 2 ? int64_ty : int32_ty;
+    // Retrieve the raw value for the current element from the current sample.
+    llvm::Value* raw_elem_value = b_->CreateLoad(
+        b_->CreateInBoundsGEP(
+            b_->CreatePointerCast(sample_address, raw_value_ty->getPointerTo()),
+            elem_offset),
+        "raw_elem_value");
+
+    return ConvertValueForDistribution(hlo, operand_to_generator, index,
+                                       raw_elem_value);
   };
 }
 
@@ -1436,9 +1487,8 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalSelect(
   TF_ASSIGN_OR_RETURN(llvm::Value * on_false_value,
                       operand_to_generator.at(hlo->operand(2))(
                           ElementwiseSourceIndex(index, *hlo, 2)));
-  return ir_builder_->CreateSelect(
-      ir_builder_->CreateTrunc(pred_value, ir_builder_->getInt1Ty()),
-      on_true_value, on_false_value);
+  return b_->CreateSelect(b_->CreateTrunc(pred_value, b_->getInt1Ty()),
+                          on_true_value, on_false_value);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalClamp(
@@ -1474,64 +1524,62 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalConcatenate(
   const int64 concat_dim = hlo->dimensions(0);
   auto source_index = target_index;
 
-  llvm::BasicBlock* init_block = ir_builder_->GetInsertBlock();
+  llvm::BasicBlock* init_block = b_->GetInsertBlock();
 
   // A terminator should be present iff we're emitting code
   // into the middle (as opposed to the end) of a basic block.
-  CHECK_EQ(ir_builder_->GetInsertPoint() == init_block->end(),
+  CHECK_EQ(b_->GetInsertPoint() == init_block->end(),
            init_block->getTerminator() == nullptr);
 
   llvm::BasicBlock* exit_block;
-  if (ir_builder_->GetInsertPoint() == init_block->end()) {
+  if (b_->GetInsertPoint() == init_block->end()) {
     exit_block = llvm_ir::CreateBasicBlock(
-        /*insert_before=*/nullptr, IrName(hlo, "merge"), ir_builder_);
+        /*insert_before=*/nullptr, IrName(hlo, "merge"), b_);
   } else {
-    exit_block = init_block->splitBasicBlock(ir_builder_->GetInsertPoint(),
+    exit_block = init_block->splitBasicBlock(b_->GetInsertPoint(),
                                              AsStringRef(IrName(hlo, "merge")));
     init_block->getTerminator()->eraseFromParent();
   }
 
-  llvm_ir::SetToFirstInsertPoint(exit_block, ir_builder_);
-  llvm::PHINode* output = ir_builder_->CreatePHI(
+  llvm_ir::SetToFirstInsertPoint(exit_block, b_);
+  llvm::PHINode* output = b_->CreatePHI(
       llvm_ir::PrimitiveTypeToIrType(hlo->shape().element_type(), module_),
       hlo->operands().size());
-  auto prior_insert_point = ir_builder_->GetInsertPoint();
+  auto prior_insert_point = b_->GetInsertPoint();
 
-  ir_builder_->SetInsertPoint(init_block);
+  b_->SetInsertPoint(init_block);
 
   for (int64 operand_idx = 0; operand_idx < hlo->operand_count();
        ++operand_idx) {
     const HloInstruction* operand = hlo->operand(operand_idx);
     auto true_block = llvm_ir::CreateBasicBlock(
-        exit_block, StrCat("concat_index_from_operand", operand_idx),
-        ir_builder_);
+        exit_block, StrCat("concat_index_from_operand", operand_idx), b_);
     auto false_block = llvm_ir::CreateBasicBlock(
-        exit_block, StrCat("concat_index_not_from_operand", operand_idx),
-        ir_builder_);
+        exit_block, StrCat("concat_index_not_from_operand", operand_idx), b_);
     auto concat_dim_size =
         llvm::ConstantInt::get(source_index[concat_dim]->getType(),
                                operand->shape().dimensions(concat_dim));
-    ir_builder_->CreateCondBr(
-        ir_builder_->CreateICmpULT(source_index[concat_dim], concat_dim_size),
+    b_->CreateCondBr(
+        b_->CreateICmpULT(source_index[concat_dim], concat_dim_size),
         true_block, false_block);
 
     // Create the terminator of the true block before calling operand
     // generators, because they require non-degenerate basic blocks.
-    ir_builder_->SetInsertPoint(
+    b_->SetInsertPoint(
         llvm::BranchInst::Create(exit_block, /*InsertAtEnd=*/true_block));
     TF_ASSIGN_OR_RETURN(llvm::Value * value,
                         operand_to_generator.at(operand)(source_index));
-    output->addIncoming(value, ir_builder_->GetInsertBlock());
+    output->addIncoming(value, b_->GetInsertBlock());
 
     // Subtract the size of the concat dimension of the current operand
     // from the source index.
-    ir_builder_->SetInsertPoint(false_block);
+    b_->SetInsertPoint(false_block);
     source_index[concat_dim] =
-        ir_builder_->CreateSub(source_index[concat_dim], concat_dim_size);
+        b_->CreateSub(source_index[concat_dim], concat_dim_size);
   }
 
-  ir_builder_->CreateUnreachable();
-  ir_builder_->SetInsertPoint(exit_block, prior_insert_point);
+  b_->CreateUnreachable();
+  b_->SetInsertPoint(exit_block, prior_insert_point);
   return output;
 }
 
@@ -1555,22 +1603,16 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDynamicSlice(
 
     // Clamp the start index so that the sliced portion fits in the operand:
     // start_index = clamp(start_index, 0, operand_dim_size - output_dim_size)
+    start_index_value = b_->CreateSExtOrTrunc(start_index_value, index_type);
+    int64 largest_valid_start_index =
+        input_hlo->shape().dimensions(i) - hlo->shape().dimensions(i);
+    CHECK_GE(largest_valid_start_index, 0);
 
-    // TODO(b/74360564): This is implementation defined behavior, but is
-    // currently respected by all implementations. Change this if we ever decide
-    // to oficially document different behavior.
-    start_index_value =
-        ir_builder_->CreateSExtOrTrunc(start_index_value, index_type);
-    llvm::Value* operand_dim_size =
-        index_typed_const(input_hlo->shape().dimensions(i));
-    llvm::Value* output_dim_size =
-        index_typed_const(hlo->shape().dimensions(i));
-
+    bool is_signed = ShapeUtil::ElementIsSigned(hlo->operand(1)->shape());
     start_index_value = EmitIntegralMin(
-        ir_builder_->CreateSub(operand_dim_size, output_dim_size),
-        EmitIntegralMax(index_typed_const(0), start_index_value,
-                        /*is_signed=*/true),
-        /*is_signed=*/true);
+        index_typed_const(largest_valid_start_index),
+        EmitIntegralMax(index_typed_const(0), start_index_value, is_signed),
+        is_signed);
 
     start_index_value->setName(
         AsStringRef(IrName(hlo, StrCat("start_idx", i))));
@@ -1581,7 +1623,7 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDynamicSlice(
   for (int64 i = 0; i < rank; ++i) {
     // Emit IR which computes:
     //   input_index = start_index + offset_index
-    input_index[i] = ir_builder_->CreateAdd(slice_start_index[i], index[i]);
+    input_index[i] = b_->CreateAdd(slice_start_index[i], index[i]);
   }
   return operand_to_generator.at(input_hlo)(input_index);
 }
@@ -1603,19 +1645,22 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalGather(
 
   llvm::Type* index_type = index.GetType();
   // This is the index into `operand` that holds the element we want to
-  // generate.  This index "unsafe" as in the components in here may be
-  // out of bounds.
-  IrArray::Index unsafe_operand_index(index_type);
-
-  // First copy in the window indices to unsafe_operand_index.
-  for (int64 i = 0, e = operand_shape.dimensions_size(),
-             unsafe_operand_index_dim = 0;
+  // generate.
+  IrArray::Index operand_index(index_type);
+
+  // First copy in the window indices to operand_index. Also collect a mapping
+  // from operand dimension to output window dimension. Elided window dimensions
+  // map to -1.
+  std::vector<int64> operand_to_output_dim(operand_shape.dimensions_size(), -1);
+  for (int64 i = 0, e = operand_shape.dimensions_size(), operand_index_dim = 0;
        i < e; i++) {
     if (c_binary_search(dim_numbers.elided_window_dims(), i)) {
-      unsafe_operand_index.push_back(index.GetConstantWithIndexType(0));
+      operand_index.push_back(index.GetConstantWithIndexType(0));
     } else {
-      unsafe_operand_index.push_back(
-          index[dim_numbers.output_window_dims(unsafe_operand_index_dim++)]);
+      int64 output_window_dim =
+          dim_numbers.output_window_dims(operand_index_dim++);
+      operand_to_output_dim[i] = output_window_dim;
+      operand_index.push_back(index[output_window_dim]);
     }
   }
 
@@ -1634,20 +1679,40 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalGather(
     }
   }
 
-  auto add_to_unsafe_operand_index = [&](llvm::Value* index_component,
-                                         int64 dim) {
+  auto add_to_operand_index = [&](llvm::Value* index_component, int64 dim) {
     llvm::Value* gather_dim_component_extended =
-        ir_builder_->CreateSExtOrTrunc(index_component, index_type);
-    unsafe_operand_index[dim_numbers.gather_dims_to_operand_dims(dim)] =
-        ir_builder_->CreateAdd(
-            unsafe_operand_index[dim_numbers.gather_dims_to_operand_dims(dim)],
-            gather_dim_component_extended);
+        b_->CreateSExtOrTrunc(index_component, index_type);
+    int64 operand_dim = dim_numbers.gather_dims_to_operand_dims(dim);
+    int64 output_dim = operand_to_output_dim[operand_dim];
+    // If 'output_dim' is -1, it means 'operand_dim' is an elided window dim.
+    // This means we set the iteration index to 0, so for the purpose of the
+    // following calculations we can consider the output dimension size to be 1.
+    int64 output_dim_size =
+        output_dim == -1 ? 1 : output_shape.dimensions(output_dim);
+    int64 largest_valid_start_index =
+        operand_shape.dimensions(operand_dim) - output_dim_size;
+    CHECK_GE(largest_valid_start_index, 0);
+
+    // Clamp the gather index so that the gather region fits in the operand.
+    // gather_dim_component_extended_inbound =
+    //     clamp(gather_dim_component_extended, 0, largest_valid_start_index);
+
+    // TODO(b/111078873): This is implementation defined behavior.
+    bool is_signed = ShapeUtil::ElementIsSigned(indices_shape);
+    auto gather_dim_component_extended_inbound = EmitIntegralMin(
+        index.GetConstantWithIndexType(largest_valid_start_index),
+        EmitIntegralMax(index.GetConstantWithIndexType(0),
+                        gather_dim_component_extended, is_signed),
+        is_signed);
+
+    operand_index[operand_dim] = b_->CreateAdd(
+        operand_index[operand_dim], gather_dim_component_extended_inbound);
   };
 
   if (indices_shape.dimensions_size() == dim_numbers.index_vector_dim()) {
     TF_ASSIGN_OR_RETURN(llvm::Value * gather_dim_component,
                         indices_generator(gather_index_index));
-    add_to_unsafe_operand_index(gather_dim_component, 0);
+    add_to_operand_index(gather_dim_component, 0);
   } else {
     int64 index_vector_size =
         indices_shape.dimensions(dim_numbers.index_vector_dim());
@@ -1656,18 +1721,10 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalGather(
           index.GetConstantWithIndexType(i);
       TF_ASSIGN_OR_RETURN(llvm::Value * gather_dim_component,
                           indices_generator(gather_index_index));
-      add_to_unsafe_operand_index(gather_dim_component, i);
+      add_to_operand_index(gather_dim_component, i);
     }
   }
-
-  IrArray::Index safe_operand_index(index_type);
-  for (int64 i = 0, e = unsafe_operand_index.size(); i < e; i++) {
-    safe_operand_index.push_back(ir_builder_->CreateURem(
-        unsafe_operand_index[i],
-        index.GetConstantWithIndexType(operand_shape.dimensions(i))));
-  }
-
-  return operand_generator(safe_operand_index);
+  return operand_generator(operand_index);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDynamicUpdateSlice(
@@ -1683,7 +1740,7 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDynamicUpdateSlice(
   llvm_ir::IrArray::Index slice_limit_index(index.GetType(), rank);
   // Slice intersection gathers (ANDs) conditions on all ranks for which
   // 'input' is set to 'update'
-  llvm::Value* slice_intersection = ir_builder_->getTrue();
+  llvm::Value* slice_intersection = b_->getTrue();
 
   for (int64 i = 0; i < rank; ++i) {
     llvm::Type* index_type = index[0]->getType();
@@ -1696,36 +1753,29 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDynamicUpdateSlice(
 
     // Clamp the start index so that the update region fits in the operand.
     // start_index = clamp(start_index, 0, input_dim_size - update_dim_size)
-
-    // TODO(b/74360564): This is implementation defined behavior, but is
-    // currently respected by all implementations. Change this if we ever decide
-    // to oficially document different behavior.
-    start_index_value =
-        ir_builder_->CreateSExtOrTrunc(start_index_value, index_type);
-    llvm::Value* input_dim_size =
-        index_typed_const(input_hlo->shape().dimensions(i));
+    start_index_value = b_->CreateSExtOrTrunc(start_index_value, index_type);
     llvm::Value* update_dim_size =
         index_typed_const(update_hlo->shape().dimensions(i));
+    int64 largest_valid_start_index =
+        input_hlo->shape().dimensions(i) - update_hlo->shape().dimensions(i);
+    CHECK_GE(largest_valid_start_index, 0);
 
-    start_index_value =
-        EmitIntegralMin(ir_builder_->CreateSub(input_dim_size, update_dim_size),
-                        EmitIntegralMax(index_typed_const(0), start_index_value,
-                                        /*is_signed=*/true),
-                        /*is_signed=*/true);
+    bool is_signed = ShapeUtil::ElementIsSigned(start_hlo->shape());
+    start_index_value = EmitIntegralMin(
+        index_typed_const(largest_valid_start_index),
+        EmitIntegralMax(index_typed_const(0), start_index_value, is_signed),
+        is_signed);
 
     start_index_value->setName(
         AsStringRef(IrName(hlo, StrCat("start_idx", i))));
     slice_start_index[i] = start_index_value;
-    slice_limit_index[i] =
-        ir_builder_->CreateAdd(slice_start_index[i], update_dim_size);
+    slice_limit_index[i] = b_->CreateAdd(slice_start_index[i], update_dim_size);
 
-    slice_intersection = ir_builder_->CreateAnd(
-        slice_intersection,
-        ir_builder_->CreateICmpSGE(index[i], slice_start_index[i]),
+    slice_intersection = b_->CreateAnd(
+        slice_intersection, b_->CreateICmpSGE(index[i], slice_start_index[i]),
         "slice_intersection");
-    slice_intersection = ir_builder_->CreateAnd(
-        slice_intersection,
-        ir_builder_->CreateICmpSLT(index[i], slice_limit_index[i]),
+    slice_intersection = b_->CreateAnd(
+        slice_intersection, b_->CreateICmpSLT(index[i], slice_limit_index[i]),
         "slice_intersection");
   }
 
@@ -1734,29 +1784,29 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDynamicUpdateSlice(
   // else                    -> return data from 'input'.
   llvm::Value* ret_value_addr = llvm_ir::EmitAllocaAtFunctionEntry(
       llvm_ir::PrimitiveTypeToIrType(hlo->shape().element_type(), module_),
-      "ret_value_addr", ir_builder_);
-  llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(
-      slice_intersection, "slice_intersection", ir_builder_);
+      "ret_value_addr", b_);
+  llvm_ir::LlvmIfData if_data =
+      llvm_ir::EmitIfThenElse(slice_intersection, "slice_intersection", b_);
 
   // Handle true BB (return data from 'update')
-  SetToFirstInsertPoint(if_data.true_block, ir_builder_);
+  SetToFirstInsertPoint(if_data.true_block, b_);
   // Compute update index for intersection case.
   llvm_ir::IrArray::Index update_index(index.GetType(), rank);
   for (int64 i = 0; i < rank; ++i) {
-    update_index[i] = ir_builder_->CreateSub(index[i], slice_start_index[i]);
+    update_index[i] = b_->CreateSub(index[i], slice_start_index[i]);
   }
   TF_ASSIGN_OR_RETURN(llvm::Value * true_value,
                       operand_to_generator.at(update_hlo)(update_index));
-  ir_builder_->CreateStore(true_value, ret_value_addr);
+  b_->CreateStore(true_value, ret_value_addr);
 
   // Handle false BB (return data from 'input')
-  SetToFirstInsertPoint(if_data.false_block, ir_builder_);
+  SetToFirstInsertPoint(if_data.false_block, b_);
   TF_ASSIGN_OR_RETURN(llvm::Value * false_value,
                       operand_to_generator.at(input_hlo)(index));
-  ir_builder_->CreateStore(false_value, ret_value_addr);
+  b_->CreateStore(false_value, ret_value_addr);
 
-  SetToFirstInsertPoint(if_data.after_block, ir_builder_);
-  return ir_builder_->CreateLoad(ret_value_addr);
+  SetToFirstInsertPoint(if_data.after_block, b_);
+  return b_->CreateLoad(ret_value_addr);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalPad(
@@ -1764,29 +1814,29 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalPad(
     const ElementalIrEmitter::HloToElementGeneratorMap& operand_to_generator,
     const llvm_ir::IrArray::Index& padded_index) const {
   auto index = padded_index;
-  llvm::Value* in_bounds = ir_builder_->getTrue();
+  llvm::Value* in_bounds = b_->getTrue();
   for (size_t i = 0; i < index.size(); ++i) {
     auto index_typed_const = [=](int64 n) {
       return llvm::ConstantInt::get(index[i]->getType(), n);
     };
     const auto& pad_dim = hlo->padding_config().dimensions(i);
-    index[i] = ir_builder_->CreateSub(
-        index[i], index_typed_const(pad_dim.edge_padding_low()));
-    in_bounds = ir_builder_->CreateAnd(
-        in_bounds, ir_builder_->CreateICmpSGE(index[i], index_typed_const(0)),
-        "in_bounds");
-    in_bounds = ir_builder_->CreateAnd(
+    index[i] =
+        b_->CreateSub(index[i], index_typed_const(pad_dim.edge_padding_low()));
+    in_bounds = b_->CreateAnd(in_bounds,
+                              b_->CreateICmpSGE(index[i], index_typed_const(0)),
+                              "in_bounds");
+    in_bounds = b_->CreateAnd(
         in_bounds,
-        ir_builder_->CreateICmpEQ(
+        b_->CreateICmpEQ(
             index_typed_const(0),
-            ir_builder_->CreateURem(
-                index[i], index_typed_const(pad_dim.interior_padding() + 1))),
+            b_->CreateURem(index[i],
+                           index_typed_const(pad_dim.interior_padding() + 1))),
         "in_bounds");
-    index[i] = ir_builder_->CreateSDiv(
+    index[i] = b_->CreateSDiv(
         index[i], index_typed_const(pad_dim.interior_padding() + 1));
-    in_bounds = ir_builder_->CreateAnd(
+    in_bounds = b_->CreateAnd(
         in_bounds,
-        ir_builder_->CreateICmpSLT(
+        b_->CreateICmpSLT(
             index[i],
             index_typed_const(hlo->operand(0)->shape().dimensions(i))),
         "in_bounds");
@@ -1799,26 +1849,26 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalPad(
   // }
   llvm::Value* ret_value_addr = llvm_ir::EmitAllocaAtFunctionEntry(
       llvm_ir::PrimitiveTypeToIrType(hlo->shape().element_type(), module_),
-      "pad_result_addr", ir_builder_);
+      "pad_result_addr", b_);
   llvm_ir::LlvmIfData if_data =
-      llvm_ir::EmitIfThenElse(in_bounds, "in_bounds", ir_builder_);
-  SetToFirstInsertPoint(if_data.true_block, ir_builder_);
+      llvm_ir::EmitIfThenElse(in_bounds, "in_bounds", b_);
+  SetToFirstInsertPoint(if_data.true_block, b_);
   TF_ASSIGN_OR_RETURN(llvm::Value * operand_value,
                       operand_to_generator.at(hlo->operand(0))(index));
-  ir_builder_->CreateStore(operand_value, ret_value_addr);
+  b_->CreateStore(operand_value, ret_value_addr);
 
-  SetToFirstInsertPoint(if_data.false_block, ir_builder_);
+  SetToFirstInsertPoint(if_data.false_block, b_);
   TF_ASSIGN_OR_RETURN(llvm::Value * padding_value,
                       operand_to_generator.at(hlo->operand(1))(
                           IrArray::Index(index.GetType())));
-  ir_builder_->CreateStore(padding_value, ret_value_addr);
+  b_->CreateStore(padding_value, ret_value_addr);
 
-  SetToFirstInsertPoint(if_data.after_block, ir_builder_);
+  SetToFirstInsertPoint(if_data.after_block, b_);
   // Don't create phi(operand_value, padding_value) here, because invoking
   // operand_to_generator may create new basic blocks, making the parent
   // of operand_value or padding_value no longer a predecessor of
   // if_data.after_block.
-  return ir_builder_->CreateLoad(ret_value_addr);
+  return b_->CreateLoad(ret_value_addr);
 }
 
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDot(
@@ -1842,21 +1892,20 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDot(
     return llvm::ConstantInt::get(index_type, c);
   };
 
-  std::unique_ptr<llvm_ir::ForLoop> inner_loop =
-      llvm_ir::ForLoop::EmitForLoop(IrName(hlo, "inner"), index_typed_const(0),
-                                    index_typed_const(contracted_dim_size),
-                                    index_typed_const(1), ir_builder_);
+  std::unique_ptr<llvm_ir::ForLoop> inner_loop = llvm_ir::ForLoop::EmitForLoop(
+      IrName(hlo, "inner"), index_typed_const(0),
+      index_typed_const(contracted_dim_size), index_typed_const(1), b_);
 
-  SetToFirstInsertPoint(inner_loop->GetPreheaderBasicBlock(), ir_builder_);
+  SetToFirstInsertPoint(inner_loop->GetPreheaderBasicBlock(), b_);
   PrimitiveType primitive_type = hlo->shape().element_type();
   llvm::Type* primitive_type_llvm =
       llvm_ir::PrimitiveTypeToIrType(primitive_type, module_);
-  llvm::Value* accumulator_alloca = llvm_ir::EmitAllocaAtFunctionEntry(
-      primitive_type_llvm, "dot_acc", ir_builder_);
-  ir_builder_->CreateStore(llvm::Constant::getNullValue(primitive_type_llvm),
-                           accumulator_alloca);
+  llvm::Value* accumulator_alloca =
+      llvm_ir::EmitAllocaAtFunctionEntry(primitive_type_llvm, "dot_acc", b_);
+  b_->CreateStore(llvm::Constant::getNullValue(primitive_type_llvm),
+                  accumulator_alloca);
 
-  SetToFirstInsertPoint(inner_loop->GetBodyBasicBlock(), ir_builder_);
+  SetToFirstInsertPoint(inner_loop->GetBodyBasicBlock(), b_);
 
   // This is the inner reduction loop for a dot operation that produces
   // one element in the output.  If the operands to the dot operation have
@@ -1876,43 +1925,36 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitElementalDot(
   }
   rhs_index.InsertAt(rhs_contracting_dim, inner_loop->GetIndVarValue());
 
-  llvm::Value* current_accumulator =
-      ir_builder_->CreateLoad(accumulator_alloca);
+  llvm::Value* current_accumulator = b_->CreateLoad(accumulator_alloca);
   TF_ASSIGN_OR_RETURN(llvm::Value * lhs_value, lhs_generator(lhs_index));
   TF_ASSIGN_OR_RETURN(llvm::Value * rhs_value, rhs_generator(rhs_index));
   llvm::Value* next_accumulator;
   if (primitive_util::IsComplexType(primitive_type)) {
-    llvm::Value* product_real = ir_builder_->CreateFSub(
-        ir_builder_->CreateFMul(EmitExtractReal(lhs_value),
-                                EmitExtractReal(rhs_value)),
-        ir_builder_->CreateFMul(EmitExtractImag(lhs_value),
-                                EmitExtractImag(rhs_value)));
-    llvm::Value* product_imag = ir_builder_->CreateFAdd(
-        ir_builder_->CreateFMul(EmitExtractReal(lhs_value),
-                                EmitExtractImag(rhs_value)),
-        ir_builder_->CreateFMul(EmitExtractImag(lhs_value),
-                                EmitExtractReal(rhs_value)));
-    next_accumulator = ir_builder_->CreateInsertValue(
+    llvm::Value* product_real = b_->CreateFSub(
+        b_->CreateFMul(EmitExtractReal(lhs_value), EmitExtractReal(rhs_value)),
+        b_->CreateFMul(EmitExtractImag(lhs_value), EmitExtractImag(rhs_value)));
+    llvm::Value* product_imag = b_->CreateFAdd(
+        b_->CreateFMul(EmitExtractReal(lhs_value), EmitExtractImag(rhs_value)),
+        b_->CreateFMul(EmitExtractImag(lhs_value), EmitExtractReal(rhs_value)));
+    next_accumulator = b_->CreateInsertValue(
         current_accumulator,
-        ir_builder_->CreateFAdd(EmitExtractReal(current_accumulator),
-                                product_real),
+        b_->CreateFAdd(EmitExtractReal(current_accumulator), product_real),
         {0});
-    next_accumulator = ir_builder_->CreateInsertValue(
+    next_accumulator = b_->CreateInsertValue(
         next_accumulator,
-        ir_builder_->CreateFAdd(EmitExtractImag(current_accumulator),
-                                product_imag),
+        b_->CreateFAdd(EmitExtractImag(current_accumulator), product_imag),
         {1});
   } else if (primitive_util::IsFloatingPointType(primitive_type)) {
-    next_accumulator = ir_builder_->CreateFAdd(
-        current_accumulator, ir_builder_->CreateFMul(lhs_value, rhs_value));
+    next_accumulator = b_->CreateFAdd(current_accumulator,
+                                      b_->CreateFMul(lhs_value, rhs_value));
   } else {
-    next_accumulator = ir_builder_->CreateAdd(
-        current_accumulator, ir_builder_->CreateMul(lhs_value, rhs_value));
+    next_accumulator =
+        b_->CreateAdd(current_accumulator, b_->CreateMul(lhs_value, rhs_value));
   }
-  ir_builder_->CreateStore(next_accumulator, accumulator_alloca);
+  b_->CreateStore(next_accumulator, accumulator_alloca);
 
-  SetToFirstInsertPoint(inner_loop->GetExitBasicBlock(), ir_builder_);
-  return ir_builder_->CreateLoad(accumulator_alloca);
+  SetToFirstInsertPoint(inner_loop->GetExitBasicBlock(), b_);
+  return b_->CreateLoad(accumulator_alloca);
 }
 
 llvm_ir::ElementGenerator ElementalIrEmitter::MakeElementGenerator(
@@ -2012,7 +2054,7 @@ llvm_ir::ElementGenerator ElementalIrEmitter::MakeElementGenerator(
         const HloInstruction* operand = hlo->operand(0);
         auto source_index = target_index;
         for (int64 dim : hlo->dimensions()) {
-          source_index[dim] = ir_builder_->CreateSub(
+          source_index[dim] = b_->CreateSub(
               llvm::ConstantInt::get(target_index[dim]->getType(),
                                      hlo->shape().dimensions(dim) - 1),
               target_index[dim]);
@@ -2025,16 +2067,16 @@ llvm_ir::ElementGenerator ElementalIrEmitter::MakeElementGenerator(
         const HloInstruction* operand = hlo->operand(0);
         // The `dimensions` member of the broadcast instruction maps from
         // input dimensions to output dimensions.
-        return operand_to_generator.at(
-            operand)(target_index.SourceIndexOfBroadcast(
-            hlo->shape(), operand->shape(), hlo->dimensions(), ir_builder_));
+        return operand_to_generator.at(operand)(
+            target_index.SourceIndexOfBroadcast(hlo->shape(), operand->shape(),
+                                                hlo->dimensions(), b_));
       };
     case HloOpcode::kSlice:
       return [this, hlo, &operand_to_generator](
                  const IrArray::Index& index) -> StatusOr<llvm::Value*> {
         IrArray::Index sliced_index = index.SourceIndexOfSlice(
             /*shape=*/hlo->shape(), /*starts=*/hlo->slice_starts(),
-            /*strides=*/hlo->slice_strides(), /*builder=*/ir_builder_);
+            /*strides=*/hlo->slice_strides(), /*builder=*/b_);
         return operand_to_generator.at(hlo->operand(0))(sliced_index);
       };
     case HloOpcode::kDynamicSlice:
@@ -2059,27 +2101,26 @@ llvm_ir::ElementGenerator ElementalIrEmitter::MakeElementGenerator(
                ShapeUtil::ElementsIn(hlo->operand(0)->shape()));
       return [this, hlo, &operand_to_generator](const IrArray::Index& index) {
         const HloInstruction* operand = hlo->operand(0);
-        return operand_to_generator.at(operand)(index.SourceIndexOfBitcast(
-            hlo->shape(), operand->shape(), ir_builder_));
+        return operand_to_generator.at(operand)(
+            index.SourceIndexOfBitcast(hlo->shape(), operand->shape(), b_));
       };
     case HloOpcode::kReshape:
       CHECK_EQ(ShapeUtil::ElementsIn(hlo->shape()),
                ShapeUtil::ElementsIn(hlo->operand(0)->shape()));
       return [this, hlo, &operand_to_generator](const IrArray::Index& index) {
         const HloInstruction* operand = hlo->operand(0);
-        return operand_to_generator.at(operand)(index.SourceIndexOfReshape(
-            hlo->shape(), operand->shape(), ir_builder_));
+        return operand_to_generator.at(operand)(
+            index.SourceIndexOfReshape(hlo->shape(), operand->shape(), b_));
       };
     case HloOpcode::kTranspose:
       return [this, hlo,
               &operand_to_generator](const IrArray::Index& target_index) {
         return operand_to_generator.at(hlo->operand(0))(
             target_index.SourceIndexOfTranspose(
-                hlo->shape(), hlo->operand(0)->shape(), hlo->dimensions(),
-                ir_builder_));
+                hlo->shape(), hlo->operand(0)->shape(), hlo->dimensions(), b_));
       };
     case HloOpcode::kRng:
-      return MakeRngElementGenerator(hlo, operand_to_generator);
+      return MakePhiloxRngElementGenerator(hlo, operand_to_generator);
     case HloOpcode::kPad:
       return [this, hlo, &operand_to_generator](
                  const IrArray::Index& padded_index) -> StatusOr<llvm::Value*> {
@@ -2101,11 +2142,11 @@ llvm_ir::ElementGenerator ElementalIrEmitter::MakeElementGenerator(
 }
 
 llvm::Value* ElementalIrEmitter::EmitExtractReal(llvm::Value* value) const {
-  return ir_builder_->CreateExtractValue(value, {0});
+  return b_->CreateExtractValue(value, {0});
 }
 
 llvm::Value* ElementalIrEmitter::EmitExtractImag(llvm::Value* value) const {
-  return ir_builder_->CreateExtractValue(value, {1});
+  return b_->CreateExtractValue(value, {1});
 }
 
 llvm::Value* ElementalIrEmitter::EmitComposeComplex(const HloInstruction* op,
@@ -2113,10 +2154,10 @@ llvm::Value* ElementalIrEmitter::EmitComposeComplex(const HloInstruction* op,
                                                     llvm::Value* imag) const {
   auto cplx_type =
       llvm_ir::PrimitiveTypeToIrType(op->shape().element_type(), module_);
-  auto complex = ir_builder_->CreateInsertValue(
+  auto complex = b_->CreateInsertValue(
       llvm::ConstantAggregateZero::get(cplx_type), real, {0});
   if (imag != nullptr) {
-    complex = ir_builder_->CreateInsertValue(complex, imag, {1});
+    complex = b_->CreateInsertValue(complex, imag, {1});
   }
   return complex;
 }
diff --git a/tensorflow/compiler/xla/service/elemental_ir_emitter.h b/tensorflow/compiler/xla/service/elemental_ir_emitter.h
index d199473374..fcb34557a5 100644
--- a/tensorflow/compiler/xla/service/elemental_ir_emitter.h
+++ b/tensorflow/compiler/xla/service/elemental_ir_emitter.h
@@ -34,10 +34,8 @@ class ElementalIrEmitter {
       std::unordered_map<const HloInstruction*, llvm_ir::ElementGenerator>;
 
   ElementalIrEmitter(const HloModuleConfig& hlo_module_config,
-                     llvm::Module* module, llvm::IRBuilder<>* ir_builder)
-      : ir_builder_(ir_builder),
-        module_(module),
-        hlo_module_config_(hlo_module_config) {}
+                     llvm::Module* module, llvm::IRBuilder<>* b)
+      : b_(b), module_(module), hlo_module_config_(hlo_module_config) {}
 
   virtual ~ElementalIrEmitter() = default;
 
@@ -54,7 +52,7 @@ class ElementalIrEmitter {
       const HloInstruction* hlo,
       const HloToElementGeneratorMap& operand_to_generator) const;
 
-  llvm::IRBuilder<>* ir_builder() const { return ir_builder_; }
+  llvm::IRBuilder<>* b() const { return b_; }
   llvm::Module* module() const { return module_; }
 
  protected:
@@ -144,9 +142,7 @@ class ElementalIrEmitter {
       int64 operand_no) const;
 
   // Identifier of the thread unique among all threads on the device
-  virtual llvm::Value* EmitThreadId() const {
-    return ir_builder_->getIntN(128, 0);
-  }
+  virtual llvm::Value* EmitThreadId() const { return b_->getIntN(128, 0); }
 
   StatusOr<llvm::Value*> EmitElementalSelect(
       const HloInstruction* hlo,
@@ -188,7 +184,7 @@ class ElementalIrEmitter {
       const HloToElementGeneratorMap& operand_to_generator,
       const llvm_ir::IrArray::Index& dot_result_index) const;
 
-  llvm::IRBuilder<>* const ir_builder_;
+  llvm::IRBuilder<>* const b_;
 
   llvm::Module* module_;
 
@@ -197,10 +193,17 @@ class ElementalIrEmitter {
   const HloModuleConfig& hlo_module_config_;
 
  private:
-  // Returns a ElementGenerator for a RNG HloInstruction.
-  llvm_ir::ElementGenerator MakeRngElementGenerator(
+  // Returns a ElementGenerator for an RNG HloInstruction using the Philox
+  // random number generation algorithm.
+  llvm_ir::ElementGenerator MakePhiloxRngElementGenerator(
       const HloInstruction* hlo,
       const HloToElementGeneratorMap& operand_to_generator) const;
+  // Converts the raw value generated by a random number generation algorithm
+  // to the distribution requested by the RNG HloInstruction.
+  StatusOr<llvm::Value*> ConvertValueForDistribution(
+      const HloInstruction* hlo,
+      const ElementalIrEmitter::HloToElementGeneratorMap& operand_to_generator,
+      const llvm_ir::IrArray::Index& index, llvm::Value* raw_value) const;
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc b/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc
index 8980d43033..addb016b04 100644
--- a/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc
+++ b/tensorflow/compiler/xla/service/elemental_ir_emitter_test.cc
@@ -57,8 +57,8 @@ ENTRY main {
 }
 )";
 
-  std::unique_ptr<Literal> lhs = Literal::CreateR3<int32>({{{1}, {2}}});
-  std::unique_ptr<Literal> rhs = Literal::CreateR3<int32>({{{3}, {4}}});
+  std::unique_ptr<Literal> lhs = LiteralUtil::CreateR3<int32>({{{1}, {2}}});
+  std::unique_ptr<Literal> rhs = LiteralUtil::CreateR3<int32>({{{3}, {4}}});
   RunTest(hlo_text, {lhs.get(), rhs.get()});
 }
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/execution_tracker.cc b/tensorflow/compiler/xla/service/execution_tracker.cc
index 6794cfe297..228c3fac95 100644
--- a/tensorflow/compiler/xla/service/execution_tracker.cc
+++ b/tensorflow/compiler/xla/service/execution_tracker.cc
@@ -25,7 +25,7 @@ limitations under the License.
 namespace xla {
 
 AsyncExecution::AsyncExecution(Backend* backend,
-                               std::vector<Backend::StreamPtr> streams,
+                               std::vector<StreamPool::Ptr> streams,
                                const ExecutionProfile& profile,
                                GlobalDataHandle result)
     : backend_(CHECK_NOTNULL(backend)),
@@ -46,9 +46,10 @@ Status AsyncExecution::BlockUntilDone() const {
 
 ExecutionTracker::ExecutionTracker() : next_handle_(1) {}
 
-ExecutionHandle ExecutionTracker::Register(
-    Backend* backend, std::vector<Backend::StreamPtr> streams,
-    const ExecutionProfile& profile, GlobalDataHandle result) {
+ExecutionHandle ExecutionTracker::Register(Backend* backend,
+                                           std::vector<StreamPool::Ptr> streams,
+                                           const ExecutionProfile& profile,
+                                           GlobalDataHandle result) {
   tensorflow::mutex_lock lock(execution_mutex_);
   int64 handle = next_handle_++;
   auto inserted = handle_to_execution_.emplace(
diff --git a/tensorflow/compiler/xla/service/execution_tracker.h b/tensorflow/compiler/xla/service/execution_tracker.h
index 4458152dd9..4e9b9f883e 100644
--- a/tensorflow/compiler/xla/service/execution_tracker.h
+++ b/tensorflow/compiler/xla/service/execution_tracker.h
@@ -22,7 +22,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/executable_run_options.h"
 #include "tensorflow/compiler/xla/service/backend.h"
-#include "tensorflow/compiler/xla/service/pool.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -40,7 +40,7 @@ namespace xla {
 // the stream when destructed.
 class AsyncExecution {
  public:
-  AsyncExecution(Backend* backend, std::vector<Backend::StreamPtr> streams,
+  AsyncExecution(Backend* backend, std::vector<StreamPool::Ptr> streams,
                  const ExecutionProfile& profile, GlobalDataHandle result);
 
   Status BlockUntilDone() const;
@@ -54,7 +54,7 @@ class AsyncExecution {
   Backend* backend_;
 
   // Stream on which the execution is launched.
-  std::vector<Backend::StreamPtr> streams_;
+  std::vector<StreamPool::Ptr> streams_;
 
   // Profile object of the execution to be returned to the user.
   ExecutionProfile profile_;
@@ -72,7 +72,7 @@ class ExecutionTracker {
   // Registers an execution with its backend, streams, and data handle to the
   // execution result. Returns a handle for the registered execution.
   ExecutionHandle Register(Backend* backend,
-                           std::vector<Backend::StreamPtr> stream,
+                           std::vector<StreamPool::Ptr> stream,
                            const ExecutionProfile& profile,
                            GlobalDataHandle data);
 
diff --git a/tensorflow/compiler/xla/service/flatten_call_graph_test.cc b/tensorflow/compiler/xla/service/flatten_call_graph_test.cc
index d3854b40de..8f6608241e 100644
--- a/tensorflow/compiler/xla/service/flatten_call_graph_test.cc
+++ b/tensorflow/compiler/xla/service/flatten_call_graph_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/call_graph.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -80,7 +80,7 @@ class FlattenCallGraphTest : public HloTestBase {
     HloInstruction* param0 = builder.AddInstruction(
         HloInstruction::CreateParameter(0, kScalarShape, "param0"));
     HloInstruction* zero = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
     builder.AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt, param0, zero));
     return builder.Build();
@@ -157,7 +157,7 @@ TEST_F(FlattenCallGraphTest, SharedWhileConditionAndBody) {
         builder.AddInstruction(HloInstruction::CreateParameter(
             0, ShapeUtil::MakeShape(PRED, {}), "param0"));
     HloInstruction* false_constant = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
     builder.AddInstruction(
         HloInstruction::CreateBinary(ShapeUtil::MakeShape(PRED, {}),
                                      HloOpcode::kEq, param0, false_constant));
@@ -168,7 +168,7 @@ TEST_F(FlattenCallGraphTest, SharedWhileConditionAndBody) {
   {
     HloComputation::Builder builder(TestName() + ".entry");
     HloInstruction* false_constant = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
     builder.AddInstruction(HloInstruction::CreateWhile(
         ShapeUtil::MakeShape(PRED, {}), cond_computation, cond_computation,
         false_constant));
@@ -232,11 +232,11 @@ TEST_F(FlattenCallGraphTest, FlattenCallsInConditional) {
   // computation in the true and false branch.
   HloComputation::Builder builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(56.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(56.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(12.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(12.0f)));
   builder.AddInstruction(HloInstruction::CreateConditional(
       kScalarShape, pred, constant1, sub_computation, constant2,
       sub_computation));
diff --git a/tensorflow/compiler/xla/service/gather_expander.cc b/tensorflow/compiler/xla/service/gather_expander.cc
index 7cd2c9c136..e3a42d0d06 100644
--- a/tensorflow/compiler/xla/service/gather_expander.cc
+++ b/tensorflow/compiler/xla/service/gather_expander.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include <utility>
 
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/gather_expander.h"
 #include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -113,7 +114,7 @@ static StatusOr<HloInstruction*> ExpandIndexVectorIntoOperandSpace(
   const Shape& index_shape = index_vector->shape();
   HloInstruction* zero =
       computation->AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateFromDimensions(index_shape.element_type(), {1})));
+          LiteralUtil::CreateFromDimensions(index_shape.element_type(), {1})));
 
   // We extract out individual components from the smaller index and concatenate
   // them (interspersing zeros as needed) into the larger index.
diff --git a/tensorflow/compiler/xla/service/generic_transfer_manager.cc b/tensorflow/compiler/xla/service/generic_transfer_manager.cc
index 85e28a0dfe..e314a469f0 100644
--- a/tensorflow/compiler/xla/service/generic_transfer_manager.cc
+++ b/tensorflow/compiler/xla/service/generic_transfer_manager.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/interpreter/platform_id.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -158,16 +158,10 @@ Status GenericTransferManager::TransferLiteralToInfeed(
   return Unimplemented("Generic transfer to Infeed");
 }
 
-Status GenericTransferManager::TransferBufferToInfeed(
-    se::StreamExecutor* executor, int64 size, const void* source) {
-  return Unimplemented("Generic transfer to Infeed");
-}
-
 Status GenericTransferManager::TransferLiteralFromOutfeed(
     se::StreamExecutor* executor, const Shape& literal_shape,
     Literal* literal) {
-  return Unimplemented(
-      "Outfeed is not supported on this platform (b/30467474)");
+  return Unimplemented("Generic transfer from Outfeed");
 }
 
 Status GenericTransferManager::ResetDevices(
diff --git a/tensorflow/compiler/xla/service/generic_transfer_manager.h b/tensorflow/compiler/xla/service/generic_transfer_manager.h
index d216fe7d29..3cd002c1bf 100644
--- a/tensorflow/compiler/xla/service/generic_transfer_manager.h
+++ b/tensorflow/compiler/xla/service/generic_transfer_manager.h
@@ -61,9 +61,6 @@ class GenericTransferManager : public TransferManager {
   int64 GetByteSizeRequirement(const Shape& shape) const override;
 
  protected:
-  Status TransferBufferToInfeed(se::StreamExecutor* executor, int64 size,
-                                const void* source) override;
-
   Status WriteSingleTupleIndexTable(
       se::Stream* stream,
       tensorflow::gtl::ArraySlice<se::DeviceMemoryBase> elements,
diff --git a/tensorflow/compiler/xla/service/gpu/BUILD b/tensorflow/compiler/xla/service/gpu/BUILD
index d90b0fb57d..4947dd278e 100644
--- a/tensorflow/compiler/xla/service/gpu/BUILD
+++ b/tensorflow/compiler/xla/service/gpu/BUILD
@@ -36,6 +36,7 @@ cc_library(
     hdrs = ["gpu_constants.h"],
     deps = [
         "//tensorflow/compiler/xla:types",
+        "//tensorflow/core:framework",
     ],
 )
 
@@ -113,11 +114,13 @@ cc_library(
     srcs = ["hlo_to_ir_bindings.cc"],
     hdrs = ["hlo_to_ir_bindings.h"],
     deps = [
+        ":buffer_allocations",
         ":ir_emission_utils",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla/service:buffer_assignment",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service/llvm_ir:alias_analysis",
+        "//tensorflow/compiler/xla/service/llvm_ir:buffer_assignment_util",
         "//tensorflow/compiler/xla/service/llvm_ir:ir_array",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/compiler/xla/service/llvm_ir:tuple_ops",
@@ -141,6 +144,7 @@ cc_library(
     ],
     deps = [
         ":backend_configs",
+        ":buffer_allocations",
         ":cudnn_convolution_runner",
         ":elemental_ir_emitter",
         ":gpu_constants",
@@ -150,7 +154,7 @@ cc_library(
         ":parallel_loop_emitter",
         ":partition_assignment",
         ":while_transformer",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -162,13 +166,16 @@ cc_library(
         "//tensorflow/compiler/xla/service:elemental_ir_emitter",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:name_uniquer",
+        "//tensorflow/compiler/xla/service/llvm_ir:buffer_assignment_util",
+        "//tensorflow/compiler/xla/service/llvm_ir:dynamic_update_slice_util",
         "//tensorflow/compiler/xla/service/llvm_ir:fused_ir_emitter",
         "//tensorflow/compiler/xla/service/llvm_ir:ir_array",
         "//tensorflow/compiler/xla/service/llvm_ir:kernel_support_library",
+        "//tensorflow/compiler/xla/service/llvm_ir:kernel_tiling",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_loop",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/compiler/xla/service/llvm_ir:loop_emitter",
-        "//tensorflow/compiler/xla/service/llvm_ir:ops",
+        "//tensorflow/compiler/xla/service/llvm_ir:sort_util",
         "//tensorflow/compiler/xla/service/llvm_ir:tuple_ops",
         "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
@@ -199,7 +206,7 @@ cc_library(
     srcs = ["elemental_ir_emitter.cc"],
     hdrs = ["elemental_ir_emitter.h"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -214,6 +221,7 @@ cc_library(
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_loop",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/compiler/xla/service/llvm_ir:loop_emitter",
+        "//tensorflow/compiler/xla/service/llvm_ir:math_ops",
         "//tensorflow/core:lib",
         "@llvm//:core",
         "@llvm//:support",
@@ -244,7 +252,7 @@ cc_library(
     deps = [
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_execution_profile",
-        "//tensorflow/compiler/xla/service:pool",
+        "//tensorflow/compiler/xla/service:stream_pool",
         "//tensorflow/core:lib",
         "//tensorflow/core:ptr_util",
         "//tensorflow/core:stream_executor_no_cuda",
@@ -265,7 +273,9 @@ cc_library(
         "infeed_thunk.cc",
         "kernel_thunk.cc",
         "memset_thunk.cc",
+        "outfeed_thunk.cc",
         "sequential_thunk.cc",
+        "thunk.cc",
         "thunk_schedule.cc",
         "tuple_thunk.cc",
         "while_thunk.cc",
@@ -282,6 +292,7 @@ cc_library(
         "infeed_thunk.h",
         "kernel_thunk.h",
         "memset_thunk.h",
+        "outfeed_thunk.h",
         "sequential_thunk.h",
         "thunk.h",
         "thunk_schedule.h",
@@ -289,15 +300,16 @@ cc_library(
         "while_thunk.h",
     ],
     deps = [
-        ":backend_configs",
         ":buffer_allocations",
         ":cudnn_convolution_runner",
         ":hlo_execution_profiler",
         ":infeed_manager",
         ":ir_emission_utils",
+        ":outfeed_manager",
         ":partition_assignment",
         ":stream_assignment",
         "//tensorflow/compiler/xla:array2d",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_tree",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status",
@@ -315,6 +327,7 @@ cc_library(
         "//tensorflow/compiler/xla/service:shaped_buffer",
         "//tensorflow/compiler/xla/service:transfer_manager",
         "//tensorflow/compiler/xla/service:tuple_points_to_analysis",
+        "//tensorflow/compiler/xla/service/llvm_ir:buffer_assignment_util",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:stream_executor_no_cuda",
@@ -351,6 +364,7 @@ cc_library(
         ":cudnn_convolution_runner",
         ":gpu_executable",
         ":ir_emission_utils",
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla/service:device_memory_allocator",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_pass",
@@ -382,7 +396,7 @@ cc_library(
     hdrs = ["cudnn_convolution_rewriter.h"],
     deps = [
         ":ir_emission_utils",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:window_util",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -443,6 +457,7 @@ cc_library(
     srcs = ["multi_output_fusion.cc"],
     hdrs = ["multi_output_fusion.h"],
     deps = [
+        ":instruction_fusion",
         ":ir_emission_utils",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla/service:hlo",
@@ -455,6 +470,7 @@ tf_cc_test(
     name = "multi_output_fusion_test",
     srcs = ["multi_output_fusion_test.cc"],
     deps = [
+        ":instruction_fusion",
         ":multi_output_fusion",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:util",
@@ -517,6 +533,7 @@ cc_library(
     hdrs = ["pad_insertion.h"],
     deps = [
         ":ir_emission_utils",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:window_util",
@@ -528,12 +545,47 @@ cc_library(
 )
 
 cc_library(
+    name = "pad_for_tensor_cores",
+    srcs = ["pad_for_tensor_cores.cc"],
+    hdrs = ["pad_for_tensor_cores.h"],
+    deps = [
+        ":ir_emission_utils",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:window_util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo_creation_utils",
+        "//tensorflow/compiler/xla/service:hlo_pass",
+        "//tensorflow/compiler/xla/service:shape_inference",
+    ],
+)
+
+tf_cc_test(
+    name = "pad_for_tensor_cores_test",
+    srcs = ["pad_for_tensor_cores_test.cc"],
+    deps = [
+        ":ir_emission_utils",
+        ":pad_for_tensor_cores",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # build_cleaner: keep
+    ],
+)
+
+cc_library(
     name = "gpu_transfer_manager",
     srcs = ["gpu_transfer_manager.cc"],
     hdrs = ["gpu_transfer_manager.h"],
     deps = [
         ":gpu_compiler",
+        ":outfeed_manager",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_tree",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -552,8 +604,8 @@ cc_library(
 
 cc_library(
     name = "gpu_compiler",
-    srcs = ["gpu_compiler.cc"],
-    hdrs = ["gpu_compiler.h"],
+    srcs = ["nvptx_compiler.cc"],
+    hdrs = ["nvptx_compiler.h"],
     deps = [
         ":cudnn_convolution_algorithm_picker",
         ":cudnn_convolution_rewriter",
@@ -568,9 +620,11 @@ cc_library(
         ":ir_emission_utils",
         ":ir_emitter",
         ":multi_output_fusion",
+        ":pad_for_tensor_cores",
         ":pad_insertion",
         ":partition_assignment",
         ":stream_assignment",
+        ":stream_executor_util",
         "//tensorflow/compiler/xla:protobuf_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -582,7 +636,6 @@ cc_library(
         "//tensorflow/compiler/xla/service:buffer_liveness",
         "//tensorflow/compiler/xla/service:call_inliner",
         "//tensorflow/compiler/xla/service:conditional_simplifier",
-        "//tensorflow/compiler/xla/service:dot_decomposer",
         "//tensorflow/compiler/xla/service:executable",
         "//tensorflow/compiler/xla/service:flatten_call_graph",
         "//tensorflow/compiler/xla/service:hlo",
@@ -624,6 +677,7 @@ cc_library(
     hdrs = ["cudnn_batchnorm_rewriter.h"],
     deps = [
         ":ir_emission_utils",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_pass",
@@ -631,18 +685,39 @@ cc_library(
 )
 
 cc_library(
+    name = "xfeed_queue",
+    hdrs = ["xfeed_queue.h"],
+    deps = ["//tensorflow/core:lib"],
+)
+
+cc_library(
     name = "infeed_manager",
     srcs = ["infeed_manager.cc"],
     hdrs = ["infeed_manager.h"],
     deps = [
+        ":xfeed_queue",
+        "//tensorflow/compiler/xla:shape_tree",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
-        "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
     ],
 )
 
 cc_library(
+    name = "outfeed_manager",
+    srcs = ["outfeed_manager.cc"],
+    hdrs = ["outfeed_manager.h"],
+    deps = [
+        ":xfeed_queue",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:shape_tree",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
     name = "gpu_layout_assignment",
     srcs = ["gpu_layout_assignment.cc"],
     hdrs = ["gpu_layout_assignment.h"],
@@ -673,6 +748,8 @@ tf_cc_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/service:computation_layout",
         "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_matchers",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # build_cleaner: keep
     ],
@@ -716,7 +793,7 @@ cc_library(
     srcs = ["while_transformer.cc"],
     hdrs = ["while_transformer.h"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -772,6 +849,7 @@ cc_library(
     deps = [
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/core:stream_executor_no_cuda",
     ],
diff --git a/tensorflow/compiler/xla/service/gpu/buffer_allocations.cc b/tensorflow/compiler/xla/service/gpu/buffer_allocations.cc
index ab5149dcdb..537295292b 100644
--- a/tensorflow/compiler/xla/service/gpu/buffer_allocations.cc
+++ b/tensorflow/compiler/xla/service/gpu/buffer_allocations.cc
@@ -44,17 +44,27 @@ StatusOr<std::unique_ptr<BufferAllocations>> BufferAllocations::Builder::Build(
       num_buffers, device_ordinal, memory_allocator, buffer_assignment));
 
   for (BufferAllocation::Index i = 0; i < num_buffers; ++i) {
+    const BufferAllocation& allocation = buffer_assignment->GetAllocation(i);
+    const int64 expected_alignment = [&] {
+      if (allocation.is_entry_computation_parameter()) {
+        return kEntryParameterAlignBytes;
+      } else if (allocation.is_constant()) {
+        return kConstantBufferAlignBytes;
+      } else {
+        return kXlaAllocatedBufferAlignBytes;
+      }
+    }();
+
     // If buffer #i's address is already registered (e.g. external arguments or
     // result buffers), use that registered buffer.
     if (registered_buffers_.count(i)) {
       se::DeviceMemoryBase address = FindOrDie(registered_buffers_, i);
-      if (reinterpret_cast<uintptr_t>(address.opaque()) %
-              kCudaMallocAlignBytes !=
+      if (reinterpret_cast<uintptr_t>(address.opaque()) % expected_alignment !=
           0) {
         return InternalError(
             "Address of registered buffer %lld must be a multiple of %llx, but "
             "was %p",
-            i, kCudaMallocAlignBytes, address.opaque());
+            i, kEntryParameterAlignBytes, address.opaque());
       }
       buffer_allocations->SetBuffer(i, FindOrDie(registered_buffers_, i));
       continue;
@@ -62,7 +72,6 @@ StatusOr<std::unique_ptr<BufferAllocations>> BufferAllocations::Builder::Build(
 
     // Allocate each allocation that might escape, or is the temp buffer.
     bool seen_temp_buffer = false;
-    const BufferAllocation& allocation = buffer_assignment->GetAllocation(i);
     if (allocation.maybe_live_out() || allocation.IsPreallocatedTempBuffer()) {
       const int64 buffer_size = allocation.size();
       se::DeviceMemoryBase buffer_address;
@@ -70,13 +79,12 @@ StatusOr<std::unique_ptr<BufferAllocations>> BufferAllocations::Builder::Build(
         OwningDeviceMemory buffer;
         TF_ASSIGN_OR_RETURN(
             buffer, memory_allocator->Allocate(device_ordinal, buffer_size));
-        if (reinterpret_cast<uintptr_t>(buffer.opaque()) %
-                kCudaMallocAlignBytes !=
+        if (reinterpret_cast<uintptr_t>(buffer.opaque()) % expected_alignment !=
             0) {
           return InternalError(
               "Address returned by memory_allocator->Allocate must be a "
               "multiple of %llx, but was %p",
-              kCudaMallocAlignBytes, buffer.opaque());
+              kXlaAllocatedBufferAlignBytes, buffer.opaque());
         }
         // We do manual memory management within BufferAllocations.  Be sure not
         // to do a TF_RETURN_IF_ERROR between this line and the
@@ -165,5 +173,10 @@ void BufferAllocations::SetBuffer(BufferAllocation::Index buffer_index,
   buffers_[buffer_index] = buffer;
 }
 
+bool ShouldEmitLiteralInLlvmIr(const Literal& literal) {
+  // LLVM can sometimes do interesting optimizations using scalar constants.
+  return ShapeUtil::IsScalar(literal.shape());
+}
+
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/buffer_allocations.h b/tensorflow/compiler/xla/service/gpu/buffer_allocations.h
index 6366235025..f13eab0dd7 100644
--- a/tensorflow/compiler/xla/service/gpu/buffer_allocations.h
+++ b/tensorflow/compiler/xla/service/gpu/buffer_allocations.h
@@ -107,6 +107,12 @@ class BufferAllocations {
   bool torn_down_ = false;
 };
 
+// LLVM and PTXAS don't deal well with large constants, so we only emit very
+// small constants directly in LLVM IR.  Larger constants are emitted with zero
+// initializers in LLVM IR and are later overwritten when the PTX/CUBIN is
+// loaded.
+bool ShouldEmitLiteralInLlvmIr(const Literal& literal);
+
 }  // namespace gpu
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/gpu/conditional_thunk.cc b/tensorflow/compiler/xla/service/gpu/conditional_thunk.cc
index 5e4fe1dd39..5780e0af40 100644
--- a/tensorflow/compiler/xla/service/gpu/conditional_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/conditional_thunk.cc
@@ -33,8 +33,11 @@ ConditionalThunk::ConditionalThunk(
       predicate_buffer_index_(predicate_buffer_index),
       true_operand_buffer_index_(true_operand_buffer_index),
       false_operand_buffer_index_(false_operand_buffer_index),
-      true_thunk_(std::move(true_thunk_sequence), hlo),
-      false_thunk_(std::move(false_thunk_sequence), hlo) {}
+      // Pass nullptr as the HloInstruction* to the true_thunk_ and false_thunk_
+      // constructors because these SequentialThunks are logically "part of"
+      // this ConditionalThunk, and shouldn't be profiled separately from it.
+      true_thunk_(std::move(true_thunk_sequence), nullptr),
+      false_thunk_(std::move(false_thunk_sequence), nullptr) {}
 
 Status ConditionalThunk::Initialize(const GpuExecutable& executable,
                                     se::StreamExecutor* executor) {
diff --git a/tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.cc b/tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.cc
index c77e3c81c9..6028950652 100644
--- a/tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.cc
+++ b/tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 
@@ -66,11 +67,12 @@ Status Visitor::HandleBatchNormInference(HloInstruction* batch_norm) {
     return Status::OK();
   }
 
-  HloInstruction* epsilon = computation_->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(batch_norm->epsilon())));
+  HloInstruction* epsilon =
+      computation_->AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR0(batch_norm->epsilon())));
   HloInstruction* feature_index =
       computation_->AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR0(batch_norm->feature_index())));
+          LiteralUtil::CreateR0(batch_norm->feature_index())));
 
   std::vector<HloInstruction*> operands(batch_norm->operands().begin(),
                                         batch_norm->operands().end());
@@ -101,11 +103,12 @@ Status Visitor::HandleBatchNormTraining(HloInstruction* batch_norm) {
     return Status::OK();
   }
 
-  HloInstruction* epsilon = computation_->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(batch_norm->epsilon())));
+  HloInstruction* epsilon =
+      computation_->AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR0(batch_norm->epsilon())));
   HloInstruction* feature_index =
       computation_->AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR0(batch_norm->feature_index())));
+          LiteralUtil::CreateR0(batch_norm->feature_index())));
 
   std::vector<HloInstruction*> operands(batch_norm->operands().begin(),
                                         batch_norm->operands().end());
@@ -128,8 +131,8 @@ Status Visitor::HandleBatchNormTraining(HloInstruction* batch_norm) {
           inverse_stddev->shape(), HloOpcode::kPower, inverse_stddev,
           computation_->AddInstruction(HloInstruction::CreateBroadcast(
               inverse_stddev->shape(),
-              computation_->AddInstruction(
-                  HloInstruction::CreateConstant(Literal::CreateR0<float>(-2))),
+              computation_->AddInstruction(HloInstruction::CreateConstant(
+                  LiteralUtil::CreateR0<float>(-2))),
               {}))));
   HloInstruction* variance =
       computation_->AddInstruction(HloInstruction::CreateBinary(
@@ -169,11 +172,12 @@ Status Visitor::HandleBatchNormGrad(HloInstruction* batch_norm) {
     return Status::OK();
   }
 
-  HloInstruction* epsilon = computation_->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(batch_norm->epsilon())));
+  HloInstruction* epsilon =
+      computation_->AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR0(batch_norm->epsilon())));
   HloInstruction* feature_index =
       computation_->AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR0(batch_norm->feature_index())));
+          LiteralUtil::CreateR0(batch_norm->feature_index())));
 
   // The cudnn libcall expects its input to be rsqrt(variance + epsilon), but
   // the batchnorm HLO takes plain variance as input.  Fix it up.
@@ -189,7 +193,7 @@ Status Visitor::HandleBatchNormGrad(HloInstruction* batch_norm) {
           computation_->AddInstruction(HloInstruction::CreateBroadcast(
               var_plus_epsilon->shape(),
               computation_->AddInstruction(HloInstruction::CreateConstant(
-                  Literal::CreateR0<float>(-.5))),
+                  LiteralUtil::CreateR0<float>(-.5))),
               {}))));
 
   std::vector<HloInstruction*> operands(batch_norm->operands().begin(),
diff --git a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc
index 3dc98c4c93..5a63e65208 100644
--- a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc
+++ b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/gpu/backend_configs.pb.h"
 #include "tensorflow/compiler/xla/service/gpu/convolution_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
@@ -80,8 +81,7 @@ bool ShouldIncludeWinogradNonfusedAlgo(const Shape& input_shape,
                                        const ConvolutionDimensionNumbers& dnums,
                                        se::StreamExecutor* stream_exec) {
   // Skip this check for cudnn7 and newer.
-  auto version =
-      stream_exec->AsDnn()->GetVersion();
+  auto version = stream_exec->AsDnn()->GetVersion();
   if (version.ok() && version.ValueOrDie().major_version() >= 7) {
     return true;
   }
@@ -338,8 +338,8 @@ StatusOr<bool> CudnnConvolutionAlgorithmPicker::RunOnInstruction(
       computation->AddInstruction(HloInstruction::CreateTuple(
           {computation->AddInstruction(HloInstruction::CreateGetTupleElement(
                new_call_shape.tuple_shapes(0), new_call, 0)),
-           computation->AddInstruction(
-               HloInstruction::CreateConstant(Literal::CreateR1<uint8>({})))}));
+           computation->AddInstruction(HloInstruction::CreateConstant(
+               LiteralUtil::CreateR1<uint8>({})))}));
 
   TF_RETURN_IF_ERROR(instr->parent()->ReplaceInstruction(instr, new_tuple));
   return true;
diff --git a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_rewriter.cc b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_rewriter.cc
index f9dccd287d..905b5ee876 100644
--- a/tensorflow/compiler/xla/service/gpu/cudnn_convolution_rewriter.cc
+++ b/tensorflow/compiler/xla/service/gpu/cudnn_convolution_rewriter.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <numeric>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
diff --git a/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc b/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc
index 27d2c3e491..cc38db27e2 100644
--- a/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.cc
@@ -29,12 +29,13 @@ limitations under the License.
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/Module.h"
 #include "llvm/IR/Type.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/ir_array.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/math_ops.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -67,8 +68,8 @@ bool IsFPLiteralWithValue(const HloInstruction* operand, float value) {
 
 GpuElementalIrEmitter::GpuElementalIrEmitter(
     const HloModuleConfig& hlo_module_config, llvm::Module* module,
-    llvm::IRBuilder<>* ir_builder, NestedComputer compute_nested)
-    : ElementalIrEmitter(hlo_module_config, module, ir_builder),
+    llvm::IRBuilder<>* b, NestedComputer compute_nested)
+    : ElementalIrEmitter(hlo_module_config, module, b),
       hlo_module_config_(hlo_module_config),
       compute_nested_(std::move(compute_nested)) {}
 
@@ -92,8 +93,8 @@ StatusOr<llvm::Value*> GpuElementalIrEmitter::EmitLibdeviceMathCall(
       cast_result_to_fp16 = true;
       for (int64 i = 0; i < operands.size(); ++i) {
         if (input_types[i] == F16) {
-          converted_operands[i] = ir_builder_->CreateFPCast(
-              converted_operands[i], ir_builder_->getFloatTy());
+          converted_operands[i] =
+              b_->CreateFPCast(converted_operands[i], b_->getFloatTy());
           converted_input_types[i] = F32;
         }
       }
@@ -112,7 +113,7 @@ StatusOr<llvm::Value*> GpuElementalIrEmitter::EmitLibdeviceMathCall(
                                      converted_input_types, output_type)
                             .ValueOrDie();
   if (cast_result_to_fp16) {
-    result = ir_builder_->CreateFPCast(result, ir_builder_->getHalfTy());
+    result = b_->CreateFPCast(result, b_->getHalfTy());
   }
   return result;
 }
@@ -215,7 +216,7 @@ StatusOr<llvm::Value*> GpuElementalIrEmitter::EmitPowerOp(
     // LLVM's NVPTX backend knows how to transform 1/sqrt(A) into the NVPTX
     // rsqrt.approx instruction.
     TF_ASSIGN_OR_RETURN(auto* sqrt, make_sqrt());
-    return ir_builder_->CreateFDiv(llvm::ConstantFP::get(llvm_ty, 1), sqrt);
+    return b_->CreateFDiv(llvm::ConstantFP::get(llvm_ty, 1), sqrt);
   }
 
   VLOG(10) << "emitting pow as regular call to pow(): " << op->ToString();
@@ -277,6 +278,16 @@ StatusOr<llvm::Value*> GpuElementalIrEmitter::EmitFloatUnaryOp(
   PrimitiveType output_type = op->shape().element_type();
   switch (op->opcode()) {
     case HloOpcode::kTanh:
+      // If we don't care much about precision, emit a fast approximation of
+      // tanh.
+      if (hlo_module_config_.debug_options().xla_enable_fast_math()) {
+        // Upcast F16 to F32 if necessary.
+        llvm::Type* type =
+            input_type == F16 ? b_->getFloatTy() : operand_value->getType();
+        llvm::Value* input = b_->CreateFPCast(operand_value, type);
+        llvm::Value* fast_tanh = llvm_ir::EmitFastTanh(b_, input);
+        return b_->CreateFPCast(fast_tanh, operand_value->getType());
+      }
       return EmitLibdeviceMathCall("__nv_tanh", {operand_value}, {input_type},
                                    output_type);
     default:
@@ -302,32 +313,31 @@ llvm::Value* GpuElementalIrEmitter::EmitDeviceFunctionCall(
 
   // Declares the callee if it is not declared already.
   llvm::Function* callee = llvm::cast<llvm::Function>(
-      ir_builder_->GetInsertBlock()->getModule()->getOrInsertFunction(
+      b_->GetInsertBlock()->getModule()->getOrInsertFunction(
           llvm_ir::AsStringRef(callee_name), callee_type));
 
   for (auto attribute : attributes) {
     callee->addFnAttr(attribute);
   }
 
-  return ir_builder_->CreateCall(callee, llvm_ir::AsArrayRef(operands));
+  return b_->CreateCall(callee, llvm_ir::AsArrayRef(operands));
 }
 
 llvm::Value* GpuElementalIrEmitter::EmitThreadId() const {
-  llvm::Value* block_id = ir_builder_->CreateIntCast(
+  llvm::Value* block_id = b_->CreateIntCast(
       llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::nvvm_read_ptx_sreg_ctaid_x,
-                                   {}, {}, ir_builder_),
-      ir_builder_->getIntNTy(128), /*isSigned=*/true, "block.id");
-  llvm::Value* thread_id_in_block = ir_builder_->CreateIntCast(
+                                   {}, {}, b_),
+      b_->getIntNTy(128), /*isSigned=*/true, "block.id");
+  llvm::Value* thread_id_in_block = b_->CreateIntCast(
       llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x,
-                                   {}, {}, ir_builder_),
-      ir_builder_->getIntNTy(128), /*isSigned=*/true, "thread.id");
-  llvm::Value* threads_per_block = ir_builder_->CreateIntCast(
+                                   {}, {}, b_),
+      b_->getIntNTy(128), /*isSigned=*/true, "thread.id");
+  llvm::Value* threads_per_block = b_->CreateIntCast(
       llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::nvvm_read_ptx_sreg_ntid_x,
-                                   {}, {}, ir_builder_),
-      ir_builder_->getIntNTy(128), /*isSigned=*/true, "threads_per_block");
-  return ir_builder_->CreateNSWAdd(
-      ir_builder_->CreateNSWMul(block_id, threads_per_block),
-      thread_id_in_block);
+                                   {}, {}, b_),
+      b_->getIntNTy(128), /*isSigned=*/true, "threads_per_block");
+  return b_->CreateNSWAdd(b_->CreateNSWMul(block_id, threads_per_block),
+                          thread_id_in_block);
 }
 
 llvm_ir::ElementGenerator GpuElementalIrEmitter::MakeElementGenerator(
@@ -373,12 +383,12 @@ llvm_ir::ElementGenerator GpuElementalIrEmitter::MakeElementGenerator(
         PrimitiveType operand_element_type = operand->shape().element_type();
         llvm::Value* accum_ptr = llvm_ir::EmitAllocaAtFunctionEntry(
             llvm_ir::PrimitiveTypeToIrType(operand_element_type, module_),
-            "reduce_window_accum_ptr", ir_builder_);
+            "reduce_window_accum_ptr", b_);
         {
           TF_ASSIGN_OR_RETURN(llvm::Value * init_value,
                               operand_to_generator.at(hlo->operand(1))(
                                   IrArray::Index(index.GetType())));
-          ir_builder_->CreateStore(init_value, accum_ptr);
+          b_->CreateStore(init_value, accum_ptr);
         }
 
         llvm::Type* index_type = index.GetType();
@@ -386,7 +396,7 @@ llvm_ir::ElementGenerator GpuElementalIrEmitter::MakeElementGenerator(
           return index.GetConstantWithIndexType(c);
         };
 
-        llvm_ir::ForLoopNest loops(IrName(hlo), ir_builder_, index_type);
+        llvm_ir::ForLoopNest loops(IrName(hlo), b_, index_type);
         std::vector<int64> window_size;
         for (const auto& dim : window.dimensions()) {
           window_size.push_back(dim.size());
@@ -395,15 +405,15 @@ llvm_ir::ElementGenerator GpuElementalIrEmitter::MakeElementGenerator(
             ShapeUtil::MakeShape(operand_element_type, window_size), "window");
         CHECK_EQ(window_index.size(), index.size());
 
-        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), ir_builder_);
+        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), b_);
 
         IrArray::Index input_index(index_type, index.size());
-        llvm::Value* in_bounds = ir_builder_->getInt1(true);
+        llvm::Value* in_bounds = b_->getInt1(true);
         for (size_t i = 0; i < index.size(); ++i) {
-          llvm::Value* stridden_index = ir_builder_->CreateNSWMul(
+          llvm::Value* stridden_index = b_->CreateNSWMul(
               index[i], index_typed_const(window.dimensions(i).stride()));
-          input_index[i] = ir_builder_->CreateNSWSub(
-              ir_builder_->CreateNSWAdd(stridden_index, window_index[i]),
+          input_index[i] = b_->CreateNSWSub(
+              b_->CreateNSWAdd(stridden_index, window_index[i]),
               index_typed_const(window.dimensions(i).padding_low()));
 
           // We must check whether 0 ≤ input_index[i] < bound, as otherwise
@@ -411,16 +421,16 @@ llvm_ir::ElementGenerator GpuElementalIrEmitter::MakeElementGenerator(
           // comparison is equivalent to the unsigned comparison
           // input_index[i] < bound, as a negative value wraps to a large
           // positive value.
-          in_bounds = ir_builder_->CreateAnd(
+          in_bounds = b_->CreateAnd(
               in_bounds,
-              ir_builder_->CreateICmpULT(
+              b_->CreateICmpULT(
                   input_index[i],
                   index_typed_const(operand->shape().dimensions(i))));
         }
 
         llvm_ir::LlvmIfData if_data =
-            llvm_ir::EmitIfThenElse(in_bounds, "in_bounds", ir_builder_);
-        SetToFirstInsertPoint(if_data.true_block, ir_builder_);
+            llvm_ir::EmitIfThenElse(in_bounds, "in_bounds", b_);
+        SetToFirstInsertPoint(if_data.true_block, b_);
 
         // We are not in pad, so do the computation.
         TF_ASSIGN_OR_RETURN(llvm::Value * input_value,
@@ -428,26 +438,26 @@ llvm_ir::ElementGenerator GpuElementalIrEmitter::MakeElementGenerator(
         TF_ASSIGN_OR_RETURN(
             llvm::Value * accum_value,
             compute_nested_(*hlo->to_apply(),
-                            {ir_builder_->CreateLoad(accum_ptr), input_value}));
-        ir_builder_->CreateStore(accum_value, accum_ptr);
+                            {b_->CreateLoad(accum_ptr), input_value}));
+        b_->CreateStore(accum_value, accum_ptr);
 
-        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), ir_builder_);
-        return ir_builder_->CreateLoad(accum_ptr);
+        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), b_);
+        return b_->CreateLoad(accum_ptr);
       };
     case HloOpcode::kReduce:
       return [=, &operand_to_generator](
                  const IrArray::Index& output_index) -> StatusOr<llvm::Value*> {
         const HloInstruction* operand = hlo->operand(0);
         llvm::Value* accum_ptr =
-            ir_builder()->CreateAlloca(llvm_ir::PrimitiveTypeToIrType(
+            b()->CreateAlloca(llvm_ir::PrimitiveTypeToIrType(
                 hlo->shape().element_type(), module_));
         llvm::Type* index_type = output_index.GetType();
         TF_ASSIGN_OR_RETURN(llvm::Value * init_value,
                             operand_to_generator.at(hlo->operand(1))(
                                 IrArray::Index(index_type)));
-        ir_builder()->CreateStore(init_value, accum_ptr);
+        b()->CreateStore(init_value, accum_ptr);
 
-        llvm_ir::ForLoopNest loops(IrName(hlo), ir_builder_, index_type);
+        llvm_ir::ForLoopNest loops(IrName(hlo), b_, index_type);
         IrArray::Index input_index = loops.AddLoopsForShapeOnDimensions(
             operand->shape(), hlo->dimensions(), "reduction_dim");
         if (!ShapeUtil::IsScalar(hlo->shape())) {
@@ -462,18 +472,17 @@ llvm_ir::ElementGenerator GpuElementalIrEmitter::MakeElementGenerator(
           CHECK_EQ(output_index.size(), j);
         }
 
-        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), ir_builder());
+        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), b());
         TF_ASSIGN_OR_RETURN(
             llvm::Value * input_value,
             operand_to_generator.at(hlo->operand(0))(input_index));
         TF_ASSIGN_OR_RETURN(
             llvm::Value * accum_value,
-            compute_nested_(
-                *hlo->to_apply(),
-                {ir_builder()->CreateLoad(accum_ptr), input_value}));
-        ir_builder()->CreateStore(accum_value, accum_ptr);
-        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), ir_builder());
-        return ir_builder()->CreateLoad(accum_ptr);
+            compute_nested_(*hlo->to_apply(),
+                            {b()->CreateLoad(accum_ptr), input_value}));
+        b()->CreateStore(accum_value, accum_ptr);
+        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), b());
+        return b()->CreateLoad(accum_ptr);
       };
     default:
       return ElementalIrEmitter::MakeElementGenerator(hlo,
diff --git a/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.h b/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.h
index 91f4d960aa..e3eacef133 100644
--- a/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.h
+++ b/tensorflow/compiler/xla/service/gpu/elemental_ir_emitter.h
@@ -43,7 +43,7 @@ class GpuElementalIrEmitter : public ElementalIrEmitter {
       const HloComputation&, tensorflow::gtl::ArraySlice<llvm::Value*>)>;
 
   GpuElementalIrEmitter(const HloModuleConfig& hlo_module_config,
-                        llvm::Module* module, llvm::IRBuilder<>* ir_builder,
+                        llvm::Module* module, llvm::IRBuilder<>* b,
                         NestedComputer compute_nested);
 
   llvm_ir::ElementGenerator MakeElementGenerator(
diff --git a/tensorflow/compiler/xla/service/gpu/for_thunk.cc b/tensorflow/compiler/xla/service/gpu/for_thunk.cc
index 4fdc55909a..b3a3c5dcb4 100644
--- a/tensorflow/compiler/xla/service/gpu/for_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/for_thunk.cc
@@ -28,8 +28,11 @@ ForThunk::ForThunk(const int64 loop_limit,
                    const HloInstruction* hlo)
     : Thunk(Kind::kWhile, hlo),
       loop_limit_(loop_limit),
-      body_thunk_sequence_(
-          MakeUnique<SequentialThunk>(std::move(*body_thunk_sequence), hlo)) {}
+      body_thunk_sequence_(MakeUnique<SequentialThunk>(
+          // Pass nullptr as the HloInstruction* to the body_thunk_sequence_
+          // constructor because this SequentialThunk is logically "part of"
+          // this ForThunk, and shouldn't be profiled separately from it.
+          std::move(*body_thunk_sequence), nullptr)) {}
 
 Status ForThunk::Initialize(const GpuExecutable& executable,
                             se::StreamExecutor* executor) {
diff --git a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
index dbc7754e25..74282c568c 100644
--- a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <functional>
 
 #include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 #include "tensorflow/core/platform/types.h"
@@ -31,16 +32,19 @@ namespace {
 // dimensions.
 struct MatrixDescriptor {
   MatrixDescriptor(se::DeviceMemoryBase matrix_data, bool needs_transpose,
-                   int64 matrix_num_rows, int64 matrix_num_cols)
+                   int64 matrix_num_rows, int64 matrix_num_cols,
+                   int64 matrix_batch_size)
       : data(matrix_data),
         transpose(needs_transpose),
         num_rows(matrix_num_rows),
-        num_cols(matrix_num_cols) {}
+        num_cols(matrix_num_cols),
+        batch_size(matrix_batch_size) {}
 
   se::DeviceMemoryBase data;
   bool transpose;  // Whether this matrix needs to be transposed.
   int64 num_rows;
   int64 num_cols;
+  int64 batch_size;
 };
 
 // Performs a gemm call without an explicit algorithm on lhs_matrix and
@@ -50,6 +54,9 @@ bool DoGemm(MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
             MatrixDescriptor output_matrix, double alpha, se::Stream* stream) {
   DCHECK(!output_matrix.transpose);
 
+  const int64 batch_size = lhs_matrix.batch_size;
+  CHECK_EQ(batch_size, rhs_matrix.batch_size);
+  CHECK_EQ(batch_size, output_matrix.batch_size);
   se::DeviceMemory<Element> lhs_data(lhs_matrix.data);
   se::DeviceMemory<Element> rhs_data(rhs_matrix.data);
   se::DeviceMemory<Element> output_data(output_matrix.data);
@@ -60,13 +67,30 @@ bool DoGemm(MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
                                             : se::blas::Transpose::kNoTranspose;
   auto k = lhs_matrix.transpose ? lhs_matrix.num_rows : lhs_matrix.num_cols;
 
+  if (batch_size == 1) {
+    return stream
+        ->ThenBlasGemm(
+            lhs_transpose, rhs_transpose, output_matrix.num_rows,
+            output_matrix.num_cols, /*size of reduce dim=*/k, /*alpha=*/alpha,
+            lhs_data, /*leading dim of LHS=*/lhs_matrix.num_rows, rhs_data,
+            /*leading dim of RHS=*/rhs_matrix.num_rows, /*beta=*/0.0,
+            &output_data, /*leading dim of output=*/output_matrix.num_rows)
+        .ok();
+  }
+
+  int64 lhs_stride = lhs_matrix.num_rows * lhs_matrix.num_cols;
+  int64 rhs_stride = rhs_matrix.num_rows * rhs_matrix.num_cols;
+  int64 output_stride = output_matrix.num_rows * output_matrix.num_cols;
   return stream
-      ->ThenBlasGemm(
+      ->ThenBlasGemmStridedBatched(
           lhs_transpose, rhs_transpose, output_matrix.num_rows,
-          output_matrix.num_cols, /*size of reduce dim=*/k, /*alpha=*/alpha,
-          lhs_data, /*leading dim of LHS=*/lhs_matrix.num_rows, rhs_data,
-          /*leading dim of RHS=*/rhs_matrix.num_rows, /*beta=*/0.0,
-          &output_data, /*leading dim of output=*/output_matrix.num_rows)
+          output_matrix.num_cols, /*size of reduce dim=*/k,
+          /*alpha=*/alpha, lhs_data,
+          /*leading dim of LHS=*/lhs_matrix.num_rows, lhs_stride, rhs_data,
+          /*leading dim of RHS=*/rhs_matrix.num_rows, rhs_stride,
+          /*beta=*/0.0, &output_data,
+          /*leading dim of output=*/output_matrix.num_rows, output_stride,
+          batch_size)
       .ok();
 }
 
@@ -93,6 +117,10 @@ bool DoGemmWithAlgorithm(MatrixDescriptor lhs_matrix,
                          se::blas::ProfileResult* output_profile_result) {
   DCHECK(!output_matrix.transpose);
 
+  CHECK_EQ(1, lhs_matrix.batch_size);
+  CHECK_EQ(1, rhs_matrix.batch_size);
+  CHECK_EQ(1, output_matrix.batch_size);
+
   se::DeviceMemory<Element> lhs_data(lhs_matrix.data);
   se::DeviceMemory<Element> rhs_data(rhs_matrix.data);
   se::DeviceMemory<Element> output_data(output_matrix.data);
@@ -141,9 +169,15 @@ StatusOr<se::blas::AlgorithmType> DoGemmAutotune(
                                        alpha, computation_type, algorithm,
                                        stream, &profile_result));
 
-    if (profile_result.is_valid() && profile_result.elapsed_time_in_ms() <
-                                         best_result.elapsed_time_in_ms()) {
-      best_result = profile_result;
+    if (profile_result.is_valid()) {
+      VLOG(3) << "cublas gemm algorithm " << algorithm << " took "
+              << profile_result.elapsed_time_in_ms() << "ms";
+      if (profile_result.elapsed_time_in_ms() <
+          best_result.elapsed_time_in_ms()) {
+        best_result = profile_result;
+      }
+    } else {
+      VLOG(4) << "cublas gemm algorithm " << algorithm << " failed.";
     }
   }
 
@@ -167,6 +201,8 @@ auto GetGemmFn(PrimitiveType type) -> decltype(&DoGemm<float>) {
       return &DoGemm<float>;
     case F64:
       return &DoGemm<double>;
+    case C64:
+      return &DoGemm<std::complex<float>>;
     default:
       LOG(FATAL) << "Unsupported type.";
   }
@@ -180,6 +216,8 @@ auto GetGemmWithAlgorithmFn(PrimitiveType type)
       return &DoGemmWithAlgorithm<float>;
     case F64:
       return &DoGemmWithAlgorithm<double>;
+    case C64:
+      return &DoGemmWithAlgorithm<std::complex<float>>;
     default:
       LOG(FATAL) << "Unsupported type.";
   }
@@ -192,6 +230,8 @@ auto GetGemmAutotuneFn(PrimitiveType type) -> decltype(&DoGemmAutotune<float>) {
       return &DoGemmAutotune<float>;
     case F64:
       return &DoGemmAutotune<double>;
+    case C64:
+      return &DoGemmAutotune<std::complex<float>>;
     default:
       LOG(FATAL) << "Unsupported type.";
   }
@@ -210,6 +250,8 @@ se::blas::ComputationType GetBlasComputationType(PrimitiveType type) {
       return se::blas::ComputationType::kF32;
     case F64:
       return se::blas::ComputationType::kF64;
+    case C64:
+      return se::blas::ComputationType::kComplexF32;
     default:
       LOG(FATAL) << "Unsupported type.";
   }
@@ -263,12 +305,37 @@ Status GemmThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
   se::DeviceMemoryBase output_data =
       buffer_allocations.GetDeviceAddress(output_buffer_);
 
+  DotDimensionNumbers dim_nums = GetDimensionNumbers(*hlo_instruction());
+  CHECK_EQ(dim_nums.lhs_batch_dimensions_size(),
+           dim_nums.rhs_batch_dimensions_size());
+  CHECK_EQ(dim_nums.lhs_batch_dimensions_size() + 2,
+           ShapeUtil::Rank(output_shape_));
+
+  int64 row_dim = dim_nums.lhs_batch_dimensions_size();
+  int64 col_dim = dim_nums.lhs_batch_dimensions_size() + 1;
+  int64 batch_size = std::accumulate(output_shape_.dimensions().begin(),
+                                     output_shape_.dimensions().end() - 2, 1,
+                                     std::multiplies<int64>());
+
+  // Check that the batch dims don't cover the last two dims.
+  for (int64 batch_dim : dim_nums.lhs_batch_dimensions()) {
+    CHECK_NE(row_dim, batch_dim);
+    CHECK_NE(col_dim, batch_dim);
+  }
+
+  // Verify that the non-batch dimensions are minor-most. This is required for
+  // efficient access.
+  for (const auto* shape : {&lhs_shape_, &rhs_shape_, &output_shape_}) {
+    CHECK_LT(shape->layout().minor_to_major(row_dim), 2);
+    CHECK_LT(shape->layout().minor_to_major(col_dim), 2);
+  }
+
   // BLAS gemm reduces rows of LHS and columns of RHS. The Dot operator between
   // matrices reduces dimension 1 of LHS and dimension 0 of RHS regardless of
   // their layout. Therefore, we should treat dimension 0 as row and dimension 1
   // as column when mapping a matrix Dot to BLAS gemm.
-  int64 output_num_rows = output_shape_.dimensions(0);
-  int64 output_num_cols = output_shape_.dimensions(1);
+  int64 output_num_rows = output_shape_.dimensions(row_dim);
+  int64 output_num_cols = output_shape_.dimensions(col_dim);
 
   // BLAS gemm expects the inputs and the output are in column-major order.
   // Therefore, we need to convert dot between row-major matrices to that
@@ -291,34 +358,46 @@ Status GemmThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
   // the leading dimension of the LHS matrix of gemm is the number of rows in
   // B^T and thus the number of columns in B.
 
-  auto make_descriptor = [this](se::DeviceMemoryBase data, const Shape& shape,
-                                bool transpose) -> MatrixDescriptor {
-    bool is_row_major = LayoutUtil::Minor(shape.layout(), 0) != 0;
-    bool layout_mismatch = LayoutUtil::Minor(shape.layout(), 0) !=
-                           LayoutUtil::Minor(output_shape_.layout(), 0);
-    return MatrixDescriptor(data, transpose ^ layout_mismatch,
-                            shape.dimensions(is_row_major),
-                            shape.dimensions(!is_row_major));
+  auto make_descriptor = [&](se::DeviceMemoryBase data, const Shape& shape,
+                             bool transpose) -> MatrixDescriptor {
+    bool is_row_major = LayoutUtil::Minor(shape.layout(), row_dim) != 0;
+    bool layout_mismatch = LayoutUtil::Minor(shape.layout(), row_dim) !=
+                           LayoutUtil::Minor(output_shape_.layout(), row_dim);
+    return MatrixDescriptor(
+        data, transpose ^ layout_mismatch,
+        shape.dimensions(row_dim + static_cast<int64>(is_row_major)),
+        shape.dimensions(row_dim + static_cast<int64>(!is_row_major)),
+        batch_size);
   };
 
-  DotDimensionNumbers dim_nums = GetDimensionNumbers(*hlo_instruction());
-
   const MatrixDescriptor lhs_descriptor = make_descriptor(
-      lhs_data, lhs_shape_, dim_nums.lhs_contracting_dimensions(0) == 0);
+      lhs_data, lhs_shape_, dim_nums.lhs_contracting_dimensions(0) == row_dim);
   const MatrixDescriptor rhs_descriptor = make_descriptor(
-      rhs_data, rhs_shape_, dim_nums.rhs_contracting_dimensions(0) == 1);
+      rhs_data, rhs_shape_, dim_nums.rhs_contracting_dimensions(0) == col_dim);
 
   // Dispatches to a regular cublas gemm, a gemm-with-algorithm, or attempts to
   // autotune this gemm to figure out the best algorithm.
-  auto launch = [this](MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
-                       MatrixDescriptor output_matrix, se::Stream* stream) {
+  auto launch = [&](MatrixDescriptor lhs_matrix, MatrixDescriptor rhs_matrix,
+                    MatrixDescriptor output_matrix, se::Stream* stream) {
     PrimitiveType element_type = output_shape_.element_type();
     se::blas::ComputationType computation_type =
         GetBlasComputationType(element_type);
 
+    // TODO(b/112111608): Implement auto tune for batched gemm.
+    if (batch_size != 1) {
+      return GetGemmFn(element_type)(lhs_matrix, rhs_matrix, output_matrix,
+                                     alpha_, stream);
+    }
+
+    auto thunk_name = [&] {
+      return hlo_instruction() != nullptr ? hlo_instruction()->ToString()
+                                          : "<null>";
+    };
+
     const string& device_name = stream->parent()->GetDeviceDescription().name();
     auto autotune_it = autotune_results_.find(device_name);
     if (autotune_it == autotune_results_.end()) {
+      VLOG(3) << "Starting autotune of GemmThunk " << thunk_name();
       StatusOr<se::blas::AlgorithmType> best_algorithm =
           GetGemmAutotuneFn(element_type)(lhs_matrix, rhs_matrix, output_matrix,
                                           alpha_, computation_type, stream);
@@ -326,11 +405,11 @@ Status GemmThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
           autotune_results_.insert({device_name, best_algorithm}).first;
 
       if (autotune_it->second.ok()) {
-        VLOG(2) << "Autotune on GemmThunk " << this
+        VLOG(2) << "Autotune on GemmThunk " << thunk_name()
                 << " successful; best algorithm is "
                 << best_algorithm.ValueOrDie();
       } else {
-        VLOG(2) << "Autotune on GemmThunk " << this
+        VLOG(2) << "Autotune on GemmThunk " << thunk_name()
                 << " unsuccessful.  Will use generic gemm.";
       }
     }
@@ -340,7 +419,7 @@ Status GemmThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
     if (best_algorithm.ok()) {
       auto algorithm = best_algorithm.ValueOrDie();
       VLOG(2) << "Using algorithm " << algorithm
-              << " chosen by autotuning on GemmThunk " << this;
+              << " chosen by autotuning on GemmThunk " << thunk_name();
       return GetGemmWithAlgorithmFn(element_type)(
           lhs_matrix, rhs_matrix, output_matrix, alpha_, computation_type,
           algorithm, stream,
@@ -355,16 +434,16 @@ Status GemmThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
 
   auto op_profiler = profiler->MakeScopedInstructionProfiler(hlo_instruction());
   bool launch_ok;
-  if (LayoutUtil::Minor(output_shape_.layout(), 0) == 0) {
-    launch_ok = launch(
-        lhs_descriptor, rhs_descriptor,
-        MatrixDescriptor(output_data, false, output_num_rows, output_num_cols),
-        stream);
+  if (LayoutUtil::Minor(output_shape_.layout(), row_dim) == 0) {
+    launch_ok = launch(lhs_descriptor, rhs_descriptor,
+                       MatrixDescriptor(output_data, false, output_num_rows,
+                                        output_num_cols, batch_size),
+                       stream);
   } else {
-    launch_ok = launch(
-        rhs_descriptor, lhs_descriptor,
-        MatrixDescriptor(output_data, false, output_num_cols, output_num_rows),
-        stream);
+    launch_ok = launch(rhs_descriptor, lhs_descriptor,
+                       MatrixDescriptor(output_data, false, output_num_cols,
+                                        output_num_rows, batch_size),
+                       stream);
   }
 
   if (!launch_ok) {
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_constants.cc b/tensorflow/compiler/xla/service/gpu/gpu_constants.cc
index aa360c7f73..7f0b030fec 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_constants.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_constants.cc
@@ -14,12 +14,23 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/gpu/gpu_constants.h"
+#include "tensorflow/core/framework/allocator.h"
 
 namespace xla {
 namespace gpu {
 
-// http://docs.nvidia.com/cuda/cuda-c-programming-guide/#device-memory-accesses
-const int64 kCudaMallocAlignBytes = 256;
+// kEntryParameterAlignBytes is equal to EIGEN_MAX_ALIGN_BYTES, though including
+// Eigen headers here to get that symbol may not be a good idea.
+// EIGEN_MAX_ALIGN_BYTES may differ between CUDA-enabled builds vs CUDA-disabled
+// builds and we don't want the IR generated by XLA:GPU to depend on that.
+//
+// TODO(b/111767313): Consider raising EIGEN_MAX_ALIGN_BYTES if it helps.
+const int64 kEntryParameterAlignBytes = 16;
+
+const int64 kXlaAllocatedBufferAlignBytes =
+    tensorflow::Allocator::kAllocatorAlignment;
+
+const int64 kConstantBufferAlignBytes = kXlaAllocatedBufferAlignBytes;
 
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_constants.h b/tensorflow/compiler/xla/service/gpu/gpu_constants.h
index eb1ca4c6c9..6f5f1fa09c 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_constants.h
+++ b/tensorflow/compiler/xla/service/gpu/gpu_constants.h
@@ -21,9 +21,15 @@ limitations under the License.
 namespace xla {
 namespace gpu {
 
-// Minimum alignment of cudaMalloc.  We require that buffers created by our
-// DeviceMemoryAllocator, and all input/output buffers, have this alignment.
-extern const int64 kCudaMallocAlignBytes;
+// Minimum alignment for buffers passed as incoming arguments by TensorFlow.
+extern const int64 kEntryParameterAlignBytes;
+
+// Minimum alignment for buffers allocated by XLA: the temp buffers and the live
+// out (result) buffers.
+extern const int64 kXlaAllocatedBufferAlignBytes;
+
+// Minimum alignment for constant buffers.
+extern const int64 kConstantBufferAlignBytes;
 
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc b/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
index fbc1303085..75f414e47f 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
@@ -48,80 +48,17 @@ StatusOr<bool> GpuCopyInsertion::Run(HloModule* module) {
 
   TF_ASSIGN_OR_RETURN(bool changed, generic_copy_insertion.Run(module));
 
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloDataflowAnalysis> dataflow,
-                      HloDataflowAnalysis::Run(*module));
-
-  // Make sure all operands of a library call are in memory instead of constants
-  // in IR. Also, init values of while and conditional nodes cannot be
-  // constants. Insert copies for any constants found at the operands of these
-  // nodes.
-  tensorflow::gtl::FlatSet<HloInstruction*> inserted_copies;
+  // Check the assumption that the epsilon and feature_index constants of the
+  // CUDNN batchnorm op are not shared with other ops where we would replace
+  // them with a copy. These custom op calls are generated with the
+  // CudnnBatchNormRewriter, so this would only happen if HloCSE merges them.
   for (HloComputation* computation : module->computations()) {
     for (HloInstruction* hlo : computation->instructions()) {
-      // Inserts a copy of hlo->operand(n) if it's a constant.
-      auto copy_operand_if_constant = [&](int64 n) -> Status {
-        HloInstruction* operand = hlo->mutable_operand(n);
-        // Skip the operands that have already been replaced with a copy in a
-        // previous iteration (which is possible when a constant is used as an
-        // operand in multiple places).
-        if (ContainsKey(inserted_copies, operand)) {
-          return Status::OK();
-        }
-        for (auto& pair : dataflow->GetInstructionValueSet(operand)) {
-          const HloValueSet& value_set = pair.second;
-          for (const HloValue* value : value_set.values()) {
-            if (value->defining_instruction()->IsConstant() &&
-                !ContainsKey(hlo_to_copy_map_, value->defining_instruction())) {
-              HloInstruction* constant = value->defining_instruction();
-              TF_ASSIGN_OR_RETURN(HloInstruction * copy,
-                                  FindOrInsertCopy(constant));
-              TF_RETURN_IF_ERROR(constant->ReplaceAllUsesWith(copy));
-              inserted_copies.insert(copy);
-              changed = true;
-            }
-          }
-        }
-        return Status::OK();
-      };
-
-      if (IsCustomCallToDnnBatchNorm(*hlo)) {
-        // The epsilon and feature_index operands to a CUDNN batchnorm op don't
-        // need to be materialized in memory -- in fact, they must be constants.
-        // These are the last two operands of all three batchnorm ops.
-        for (int64 i = 0; i < hlo->operand_count() - 2; ++i) {
-          TF_RETURN_IF_ERROR(copy_operand_if_constant(i));
-        }
-      } else if (ImplementedAsLibraryCall(*hlo) ||
-                 hlo->opcode() == HloOpcode::kCrossReplicaSum ||
-                 hlo->opcode() == HloOpcode::kWhile ||
-                 hlo->opcode() == HloOpcode::kConditional) {
-        // For all other library calls, cross-replica-sum, while and conditional
-        // ops materialize all the operands into memory.  (Cross-replica-sum
-        // gets its constant args materialized even if it's not implemented as a
-        // libcall to simplify the implementation.  It's slower, but we can
-        // constant fold away constant args *anyway*, so we just need to make it
-        // work.)
-        for (int64 i = 0; i < hlo->operand_count(); ++i) {
-          TF_RETURN_IF_ERROR(copy_operand_if_constant(i));
-        }
+      if (!IsCustomCallToDnnBatchNorm(*hlo)) {
+        continue;
       }
-    }
-  }
-
-  if (changed) {
-    // Check the assumption that the epsilon and feature_index constants of the
-    // CUDNN batchnorm op are not shared with other ops where we would replace
-    // them with a copy. These custom op calls are generated with the
-    // CudnnBatchNormRewriter, so this would only happen if HloCSE merges them.
-    for (HloComputation* computation : module->computations()) {
-      for (HloInstruction* hlo : computation->instructions()) {
-        if (!IsCustomCallToDnnBatchNorm(*hlo)) {
-          continue;
-        }
-        for (int64 i = hlo->operand_count() - 2; i < hlo->operand_count();
-             ++i) {
-          CHECK_EQ(hlo->operand(i)->opcode(), HloOpcode::kConstant);
-        }
+      for (int64 i = hlo->operand_count() - 2; i < hlo->operand_count(); ++i) {
+        CHECK_EQ(hlo->operand(i)->opcode(), HloOpcode::kConstant);
       }
     }
   }
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_executable.cc b/tensorflow/compiler/xla/service/gpu/gpu_executable.cc
index 0cad2958c7..bb71c79fd7 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_executable.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_executable.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/gpu/buffer_allocations.h"
 #include "tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h"
 #include "tensorflow/compiler/xla/service/logical_buffer.h"
 #include "tensorflow/compiler/xla/service/shaped_buffer.h"
 #include "tensorflow/compiler/xla/service/transfer_manager.h"
@@ -84,7 +85,7 @@ Status GpuExecutable::ExecuteThunks(
   }
 
   // Stream 0 indicates `main_stream` and substreams start from stream 1.
-  std::vector<Pool<se::Stream>::SmartPtr> sub_streams;
+  std::vector<StreamPool::Ptr> sub_streams;
   sub_streams.reserve(thunk_schedule_->StreamCount() - 1);
   while (sub_streams.size() + 1 < thunk_schedule_->StreamCount()) {
     sub_streams.emplace_back();
@@ -181,6 +182,55 @@ Status GpuExecutable::ExecuteThunks(
   return Status::OK();
 }
 
+StatusOr<const GpuExecutable::BufferAllocToDeviceMemoryMap*>
+GpuExecutable::ResolveConstantGlobals(se::StreamExecutor* executor) {
+  tensorflow::mutex_lock lock(module_handle_mutex_);
+  auto it = module_globals_.find(executor);
+  if (it != module_globals_.end()) {
+    return &it->second;
+  }
+
+  se::MultiModuleLoaderSpec module_spec;
+  if (!cubin().empty()) {
+    module_spec.AddCudaCubinInMemory(cubin());
+  }
+  module_spec.AddCudaPtxInMemory(ptx().c_str());
+
+  tensorflow::gtl::FlatMap<int64, se::DeviceMemoryBase> globals;
+  se::ModuleHandle module_handle;
+  executor->LoadModule(module_spec, &module_handle);
+
+  for (BufferAllocation::Index i = 0; i < assignment_->Allocations().size();
+       ++i) {
+    const BufferAllocation& allocation = assignment_->GetAllocation(i);
+    if (allocation.is_constant()) {
+      TF_ASSIGN_OR_RETURN(
+          se::DeviceMemoryBase global,
+          executor->GetUntypedSymbol(
+              llvm_ir::ConstantBufferAllocationToGlobalName(allocation),
+              module_handle));
+      VLOG(3) << "Resolved global "
+              << llvm_ir::ConstantBufferAllocationToGlobalName(allocation)
+              << " to " << global.opaque();
+      InsertOrDie(&globals, i, global);
+
+      const Literal& literal =
+          llvm_ir::LiteralForConstantAllocation(allocation);
+      CHECK(ShapeUtil::IsArray(literal.shape()));
+      if (!ShouldEmitLiteralInLlvmIr(literal)) {
+        VLOG(3) << "H2D memcpy for constant with shape "
+                << ShapeUtil::HumanString(literal.shape());
+        TF_RETURN_IF_ERROR(executor->SynchronousMemcpyH2D(
+            literal.untyped_data(), allocation.size(), &global));
+      }
+    }
+  }
+
+  module_handles_.emplace(executor,
+                          se::ScopedModuleHandle(executor, module_handle));
+  return &module_globals_.emplace(executor, std::move(globals)).first->second;
+}
+
 StatusOr<ScopedShapedBuffer> GpuExecutable::ExecuteOnStream(
     const ServiceExecutableRunOptions* run_options,
     tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
@@ -192,6 +242,10 @@ StatusOr<ScopedShapedBuffer> GpuExecutable::ExecuteOnStream(
   }
 
   BufferAllocations::Builder buffer_allocations_builder;
+  se::StreamExecutor* executor = run_options->stream()->parent();
+
+  TF_ASSIGN_OR_RETURN(auto* const globals, ResolveConstantGlobals(executor));
+
   for (BufferAllocation::Index i = 0; i < assignment_->Allocations().size();
        ++i) {
     const BufferAllocation& allocation = assignment_->GetAllocation(i);
@@ -213,8 +267,12 @@ StatusOr<ScopedShapedBuffer> GpuExecutable::ExecuteOnStream(
 
       buffer_allocations_builder.RegisterBuffer(i, buffer);
     }
+
+    if (allocation.is_constant()) {
+      buffer_allocations_builder.RegisterBuffer(i, FindOrDie(*globals, i));
+    }
   }
-  se::StreamExecutor* executor = run_options->stream()->parent();
+
   TF_ASSIGN_OR_RETURN(
       auto buffer_allocations,
       buffer_allocations_builder.Build(
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_executable.h b/tensorflow/compiler/xla/service/gpu/gpu_executable.h
index 80ec38c3ac..c7ce6d0acb 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_executable.h
+++ b/tensorflow/compiler/xla/service/gpu/gpu_executable.h
@@ -34,6 +34,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 
@@ -66,7 +68,7 @@ class GpuExecutable : public Executable {
   }
 
   // Returns the compiled PTX for the computation.
-  tensorflow::StringPiece ptx() const { return ptx_; }
+  const string& ptx() const { return ptx_; }
 
   // Returns the cubin (compiled PTX) stored in this GpuExecutable.  May be
   // empty, in which case compilation is left up to the GPU driver.
@@ -98,6 +100,15 @@ class GpuExecutable : public Executable {
   // computation. Uses points-to analysis from buffer assignment.
   const PointsToSet& GetRootPointsToSet() const;
 
+  using BufferAllocToDeviceMemoryMap =
+      tensorflow::gtl::FlatMap<BufferAllocation::Index, se::DeviceMemoryBase>;
+
+  // Loads the PTX or CUBIN for this executable into `executor` and resolves the
+  // globals corresponding to constant buffers.  Returns a map mapping buffer
+  // allocation indices to GPU pointers.
+  StatusOr<const BufferAllocToDeviceMemoryMap*> ResolveConstantGlobals(
+      stream_executor::StreamExecutor* executor);
+
   // The LLVM IR, in string format, of the unoptimized module generated for this
   // GpuExecutable. We save a string instead of an llvm::Module* because leaving
   // llvm::Module* in a singleton can cause the heap checker to emit false
@@ -126,6 +137,14 @@ class GpuExecutable : public Executable {
   // memory for every output/temp buffers.
   const std::unique_ptr<const BufferAssignment> assignment_;
 
+  // Cache of module handles and constant buffer allocation maps used by
+  // `ResolveConstantGlobals`.
+  tensorflow::mutex module_handle_mutex_;
+  std::map<stream_executor::StreamExecutor*, se::ScopedModuleHandle>
+      module_handles_ GUARDED_BY(module_handle_mutex_);
+  std::map<stream_executor::StreamExecutor*, BufferAllocToDeviceMemoryMap>
+      module_globals_ GUARDED_BY(module_handle_mutex_);
+
   TF_DISALLOW_COPY_AND_ASSIGN(GpuExecutable);
 };
 
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc
index 09ef62c87f..d033faee8d 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment.cc
@@ -31,20 +31,13 @@ limitations under the License.
 namespace xla {
 namespace gpu {
 
-using stream_executor::dnn::DataLayout;
-using stream_executor::dnn::FilterLayout;
-
-static bool IsVoltaOrLater(const se::StreamExecutor& stream_executor) {
-  int major, minor;
-  CHECK(stream_executor.GetDeviceDescription().cuda_compute_capability(&major,
-                                                                       &minor));
-  return major >= 7;
-}
+using se::dnn::DataLayout;
+using se::dnn::FilterLayout;
 
 // Returns (input, filter, output) layouts.
 static std::tuple<DataLayout, FilterLayout, DataLayout>
 HeuristicLayoutAssignment(const HloInstruction* instr,
-                          stream_executor::StreamExecutor* stream_executor) {
+                          se::StreamExecutor* stream_executor) {
   // DataLayout and FilterLayout uses weird enum names. Translations:
   //   N <=> Batch or Output
   //   C <=> Depth or Input
@@ -52,31 +45,44 @@ HeuristicLayoutAssignment(const HloInstruction* instr,
   //   W <=> X
   //
   // Therefore kOutputInputYX and kBatchDepthYX mean NCHW.
+  //
+  // If you have trouble keeping these straight, consider that all that matters
+  // is the location of the channel dim: Is it major (NCHW), or minor (NHWC)?
+
+  constexpr auto kAllNCHW =
+      std::make_tuple(DataLayout::kBatchDepthYX, FilterLayout::kOutputInputYX,
+                      DataLayout::kBatchDepthYX);
+  constexpr auto kAllNHWC =
+      std::make_tuple(DataLayout::kBatchYXDepth, FilterLayout::kOutputYXInput,
+                      DataLayout::kBatchYXDepth);
 
-  // As of today, our empirical evidence is that cudnn 7.0 is faster on V100 x
-  // fp16 with the mostly-NHWC layout. The heuristic may change as cudnn version
-  // changes, as well as the hardware updates.
+  // If we're not Volta or not fp16, the decision is easy: Use NCHW.
   if (!(instr->operand(0)->shape().element_type() == xla::PrimitiveType::F16 &&
         IsVoltaOrLater(*stream_executor))) {
-    return std::make_tuple(DataLayout::kBatchDepthYX,
-                           FilterLayout::kOutputInputYX,
-                           DataLayout::kBatchDepthYX);
+    return kAllNCHW;
   }
+
   VLOG(2) << "Using heuristic to figure out layouts for " << instr->ToString();
-  // For BackwardInput that has stride, full NHWC layouts run significantly
-  // slower than (NHWC, NCHW, NCHW) or (NHWC, NCHW, NHWC).
+
+  // Empirically we've found with Volta and cudnn 7 that backward-input convs
+  // with stride are significantly faster with NCHW layouts.
   //
-  // TODO(timshen): more closely compare (NHWC, NCHW, NCHW) and (NHWC, NCHW,
-  // NHWC).
+  // We could have used a mixed layout combination, e.g. (NHWC, NCHW, NCHW),
+  // which on paper gives good performance. However, there are two observations:
+  // * a mixed layout combination is more cuDNN-bug prone, based on empirical
+  //   envidence.
+  // * we've also observed that for mixed layouts, cuDNN transposes data back
+  //   and forth from a different layout combination. If we end up with
+  //   transposes anyway, we prefer to have them in XLA, as they can be fused.
+  // TODO(timshen): Figure out the exact condition. This may be achieved by
+  // auto-tuning layouts offline.
   if (instr->custom_call_target() == kCudnnConvBackwardInputCallTarget &&
       window_util::HasStride(instr->window())) {
-    return std::make_tuple(DataLayout::kBatchYXDepth,
-                           FilterLayout::kOutputInputYX,
-                           DataLayout::kBatchDepthYX);
+    return kAllNCHW;
   }
-  return std::make_tuple(DataLayout::kBatchYXDepth,
-                         FilterLayout::kOutputYXInput,
-                         DataLayout::kBatchYXDepth);
+
+  // For other Volta f16 convolutions, use NHWC.
+  return kAllNHWC;
 }
 
 // Adds layout constraints on the cudnn custom-call instruction. The layout
@@ -170,6 +176,38 @@ Status GpuLayoutAssignment::AddBackendConstraints(
       TF_RETURN_IF_ERROR(
           AddBackendConstraintsToDnnConvCustomCall(instruction, constraints));
     }
+
+    // For batched dot we require the default layout.
+    // TODO(b/112111608): This is overly conservative, the only real restriction
+    // is that batch dimensions must be major.
+    if (instruction->opcode() == HloOpcode::kDot &&
+        ImplementedAsGemm(*instruction) &&
+        instruction->dot_dimension_numbers().lhs_batch_dimensions_size() > 0) {
+      // Verify that the batch dims come before the row and col dims.
+      const DotDimensionNumbers& dim_nums =
+          instruction->dot_dimension_numbers();
+      CHECK_EQ(dim_nums.lhs_batch_dimensions_size(),
+               dim_nums.rhs_batch_dimensions_size());
+      CHECK_EQ(dim_nums.lhs_batch_dimensions_size() + 2,
+               ShapeUtil::Rank(instruction->shape()));
+      for (int64 batch_dim : dim_nums.lhs_batch_dimensions()) {
+        CHECK_LT(batch_dim, ShapeUtil::Rank(instruction->shape()) - 2);
+      }
+
+      // Set both inputs and the output to default layout.
+      Shape op0_shape = instruction->operand(0)->shape();
+      LayoutUtil::SetToDefaultLayout(&op0_shape);
+      Shape op1_shape = instruction->operand(1)->shape();
+      LayoutUtil::SetToDefaultLayout(&op1_shape);
+      Shape output_shape = instruction->shape();
+      LayoutUtil::SetToDefaultLayout(&output_shape);
+      TF_RETURN_IF_ERROR(
+          constraints->SetOperandLayout(op0_shape, instruction, 0));
+      TF_RETURN_IF_ERROR(
+          constraints->SetOperandLayout(op1_shape, instruction, 1));
+      TF_RETURN_IF_ERROR(
+          constraints->SetInstructionLayout(output_shape, instruction));
+    }
   }
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc
index e48165c142..286547ebae 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_layout_assignment_test.cc
@@ -20,8 +20,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_layout.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
@@ -31,6 +33,8 @@ namespace xla {
 namespace gpu {
 namespace {
 
+namespace op = xla::testing::opcode_matchers;
+
 using LayoutAssignmentTest = HloTestBase;
 
 TEST_F(LayoutAssignmentTest, Elementwise) {
@@ -132,10 +136,10 @@ TEST_F(LayoutAssignmentTest, BatchNormInference) {
           HloInstruction::CreateParameter(4, aux_shape, "variance"));
 
       auto* epsilon = builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(1)));
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1)));
       auto* feature_index =
           builder.AddInstruction(HloInstruction::CreateConstant(
-              Literal::CreateR0<int64>(kFeatureIndex)));
+              LiteralUtil::CreateR0<int64>(kFeatureIndex)));
 
       auto* batchnorm = builder.AddInstruction(HloInstruction::CreateCustomCall(
           shape,
@@ -201,10 +205,10 @@ TEST_F(LayoutAssignmentTest, BatchNormTraining) {
           HloInstruction::CreateParameter(2, offset_scale_shape, "offset"));
 
       auto* epsilon = builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(1)));
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1)));
       auto* feature_index =
           builder.AddInstruction(HloInstruction::CreateConstant(
-              Literal::CreateR0<int64>(kFeatureIndex)));
+              LiteralUtil::CreateR0<int64>(kFeatureIndex)));
 
       auto* batchnorm = builder.AddInstruction(HloInstruction::CreateCustomCall(
           batchnorm_shape, {operand, scale, offset, epsilon, feature_index},
@@ -278,10 +282,10 @@ TEST_F(LayoutAssignmentTest, BatchNormGrad) {
             HloInstruction::CreateParameter(4, shape, "var"));
 
         auto* epsilon = builder.AddInstruction(
-            HloInstruction::CreateConstant(Literal::CreateR0<float>(1)));
+            HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1)));
         auto* feature_index =
             builder.AddInstruction(HloInstruction::CreateConstant(
-                Literal::CreateR0<int64>(kFeatureIndex)));
+                LiteralUtil::CreateR0<int64>(kFeatureIndex)));
 
         auto* batchnorm =
             builder.AddInstruction(HloInstruction::CreateCustomCall(
@@ -327,6 +331,33 @@ TEST_F(LayoutAssignmentTest, BatchNormGrad) {
   }
 }
 
+TEST_F(LayoutAssignmentTest, DotLayout) {
+  const char* hlo_text = R"(
+  HloModule DotLayout
+  ENTRY dot {
+    p0 = f32[8,8,256,64]{3,1,2,0} parameter(0)
+    p1 = f32[8,8,256,64]{3,1,2,0} parameter(1)
+    ROOT dot.1330.10585 = f32[8,8,256,256]{3,2,1,0} dot(p0, p1),
+      lhs_batch_dims={0,1}, lhs_contracting_dims={3},
+      rhs_batch_dims={0,1}, rhs_contracting_dims={3}
+  })";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(hlo_text));
+
+  ComputationLayout computation_layout(
+      module->entry_computation()->ComputeProgramShape());
+  GpuLayoutAssignment layout_assignment(&computation_layout,
+                                        backend().default_stream_executor());
+  EXPECT_TRUE(layout_assignment.Run(module.get()).ValueOrDie());
+
+  Shape expected_shape =
+      ShapeUtil::MakeShapeWithLayout(F32, {8, 8, 256, 64}, {3, 2, 1, 0});
+  EXPECT_THAT(module->entry_computation()->root_instruction(),
+              op::Dot(op::ShapeWithLayout(expected_shape),
+                      op::ShapeWithLayout(expected_shape)));
+}
+
 }  // namespace
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.cc b/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.cc
index 5343497c03..79b3f1efec 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.cc
@@ -20,8 +20,10 @@ limitations under the License.
 #include <vector>
 
 #include "llvm/IR/DataLayout.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
-#include "tensorflow/compiler/xla/service/gpu/gpu_compiler.h"
+#include "tensorflow/compiler/xla/service/gpu/nvptx_compiler.h"
+#include "tensorflow/compiler/xla/service/gpu/outfeed_manager.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -34,15 +36,14 @@ limitations under the License.
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 
 namespace xla {
+namespace gpu {
 
 // TODO(b/30467474) Once GPU infeed implementation settles, consider
 // folding back the cpu and gpu infeed implementations into a generic
 // one if possible.
-GpuTransferManager::GpuTransferManager()
-    : GenericTransferManager(
-          se::cuda::kCudaPlatformId,
-          /*pointer_size=*/llvm::DataLayout(gpu::GpuCompiler::kDataLayout)
-              .getPointerSize(0 /* default address space */)) {}
+GpuTransferManager::GpuTransferManager(se::Platform::Id id,
+                                       unsigned pointer_size)
+    : GenericTransferManager(id, pointer_size) {}
 
 Status GpuTransferManager::TransferLiteralToInfeed(
     se::StreamExecutor* executor, const LiteralSlice& literal) {
@@ -50,48 +51,28 @@ Status GpuTransferManager::TransferLiteralToInfeed(
   VLOG(2) << "Transferring literal to infeed with shape: "
           << ShapeUtil::HumanString(shape);
 
-  if (!ShapeUtil::IsTuple(shape)) {
-    int64 size = GetByteSizeRequirement(shape);
-    return TransferBufferToInfeed(executor, size, literal.untyped_data());
-  }
-
   // For a tuple, we transfer each of its elements to the device and
   // enqueue the resulting destination device addresses with the
   // infeed manager.
-  std::vector<gpu::InfeedBuffer*> buffers;
-  auto cleanup = tensorflow::gtl::MakeCleanup([buffers]() {
-    for (gpu::InfeedBuffer* b : buffers) {
-      b->Done();
-    }
-  });
+  ShapeTree<InfeedBuffer> buffer_tree(shape);
 
   TF_RETURN_IF_ERROR(ShapeUtil::ForEachSubshapeWithStatus(
       shape, [&](const Shape& literal_subshape, const ShapeIndex& index) {
         if (ShapeUtil::IsArray(literal_subshape)) {
           int64 tuple_element_size = GetByteSizeRequirement(literal_subshape);
           TF_ASSIGN_OR_RETURN(
-              gpu::InfeedBuffer * buffer,
+              *buffer_tree.mutable_element(index),
               TransferBufferToInfeedInternal(executor, tuple_element_size,
                                              literal.untyped_data(index)));
-          buffers.push_back(buffer);
         }
         return Status::OK();
       }));
 
-  cleanup.release();
-  return EnqueueBuffersToInfeed(executor, buffers);
-}
-
-Status GpuTransferManager::TransferBufferToInfeed(se::StreamExecutor* executor,
-                                                  int64 size,
-                                                  const void* source) {
-  TF_ASSIGN_OR_RETURN(gpu::InfeedBuffer * buffer,
-                      TransferBufferToInfeedInternal(executor, size, source));
-  return EnqueueBuffersToInfeed(executor, {buffer});
+  return EnqueueBuffersToInfeed(executor, std::move(buffer_tree));
 }
 
 Status GpuTransferManager::EnqueueBuffersToInfeed(
-    se::StreamExecutor* executor, std::vector<gpu::InfeedBuffer*> buffers) {
+    se::StreamExecutor* executor, ShapeTree<InfeedBuffer> buffers) {
   gpu::InfeedManager* infeed_manager = gpu::GetOrCreateInfeedManager();
   se::Stream* stream = infeed_manager->GetStream(executor);
 
@@ -101,21 +82,18 @@ Status GpuTransferManager::EnqueueBuffersToInfeed(
   // possible.
   Status block_status = stream->BlockHostUntilDone();
   if (!block_status.ok()) {
-    for (gpu::InfeedBuffer* b : buffers) {
-      b->Done();
-    }
     return InternalError("Failed to complete data transfer on stream %p: %s",
                          stream, block_status.error_message().c_str());
   }
 
-  infeed_manager->EnqueueBuffers(buffers);
+  infeed_manager->EnqueueDestination(std::move(buffers));
 
   VLOG(2) << "Infeed data transferred";
 
   return Status::OK();
 }
 
-StatusOr<gpu::InfeedBuffer*> GpuTransferManager::TransferBufferToInfeedInternal(
+StatusOr<InfeedBuffer> GpuTransferManager::TransferBufferToInfeedInternal(
     se::StreamExecutor* executor, int64 size, const void* source) {
   if (size > std::numeric_limits<int32>::max()) {
     return InvalidArgument("Infeed shape is too large: needs %lld bytes", size);
@@ -131,23 +109,84 @@ StatusOr<gpu::InfeedBuffer*> GpuTransferManager::TransferBufferToInfeedInternal(
     return InternalError("Failed to obtain a stream");
   }
 
-  gpu::InfeedBuffer* buffer = new gpu::InfeedBuffer(executor, size);
-  stream->ThenMemcpy(buffer->device_memory(), source, size);
+  InfeedBuffer buffer(executor, size);
+  stream->ThenMemcpy(buffer.device_memory(), source, size);
 
   VLOG(2) << "Queued infeed data on stream " << stream;
 
-  return buffer;
+  return std::move(buffer);
+}
+
+static std::unique_ptr<Literal> ShapeTreeToLiteral(
+    ShapeTree<std::unique_ptr<gpu::OutfeedBuffer>>* shape_tree) {
+  // This is a struct instead of a lambda for std::function-free recursion.
+  struct Helper {
+    static std::unique_ptr<Literal> helper(
+        ShapeTree<std::unique_ptr<gpu::OutfeedBuffer>>* shape_tree,
+        ShapeIndex* index) {
+      const Shape& shape = ShapeUtil::GetSubshape(shape_tree->shape(), *index);
+      if (ShapeUtil::IsArray(shape)) {
+        return (*shape_tree->mutable_element(*index))->WaitUntilAvailable();
+      }
+
+      CHECK(ShapeUtil::IsTuple(shape))
+          << ShapeUtil::HumanStringWithLayout(shape);
+      const int64 tuple_element_count = ShapeUtil::TupleElementCount(shape);
+      index->push_back(0);
+      std::vector<std::unique_ptr<Literal>> tuple_operands;
+      for (int64 i = 0; i < tuple_element_count; ++i) {
+        index->back() = i;
+        tuple_operands.push_back(helper(shape_tree, index));
+      }
+      index->pop_back();
+      return LiteralUtil::MakeTupleOwned(std::move(tuple_operands));
+    }
+  };
+  ShapeIndex index;
+  return Helper::helper(shape_tree, &index);
+}
+
+Status GpuTransferManager::TransferLiteralFromOutfeed(
+    se::StreamExecutor* /*executor*/, const Shape& literal_shape,
+    Literal* literal) {
+  ShapeTree<std::unique_ptr<gpu::OutfeedBuffer>> outfeed_buffers(
+      &literal_shape);
+
+  // First create a tree of literal buffers that the device can write to.
+  outfeed_buffers.ForEachMutableElement(
+      [&](const ShapeIndex& index,
+          std::unique_ptr<gpu::OutfeedBuffer>* buffer) {
+        const Shape& shape = ShapeUtil::GetSubshape(literal_shape, index);
+        // Do not transfer tuple index buffers.
+        if (ShapeUtil::IsTuple(shape)) {
+          return;
+        }
+        *buffer = MakeUnique<gpu::OutfeedBuffer>(GetByteSizeRequirement(shape));
+      });
+
+  // Give the tree of buffers to the outfeed mananger. The device will fill it
+  // while we're waiting for it below.
+  gpu::OutfeedManager* outfeed_manager = gpu::GetOrCreateOutfeedManager();
+  outfeed_manager->EnqueueDestination(&outfeed_buffers);
+
+  // Now turn the tree of buffers back into a literal.
+  *literal = std::move(*ShapeTreeToLiteral(&outfeed_buffers));
+  return Status::OK();
 }
 
+}  // namespace gpu
 }  // namespace xla
 
-static std::unique_ptr<xla::TransferManager> CreateGpuTransferManager() {
-  return xla::MakeUnique<xla::GpuTransferManager>();
+static std::unique_ptr<xla::TransferManager> CreateNVPTXTransferManager() {
+  return xla::MakeUnique<xla::gpu::GpuTransferManager>(
+      /*id=*/stream_executor::cuda::kCudaPlatformId,
+      /*pointer_size=*/llvm::DataLayout(xla::gpu::NVPTXCompiler::kDataLayout)
+          .getPointerSize(0 /* default address space */));
 }
 
 static bool InitModule() {
   xla::TransferManager::RegisterTransferManager(
-      stream_executor::cuda::kCudaPlatformId, &CreateGpuTransferManager);
+      stream_executor::cuda::kCudaPlatformId, &CreateNVPTXTransferManager);
   return true;
 }
 static bool module_initialized = InitModule();
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.h b/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.h
index 09f8227f50..dceeb9e2eb 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.h
+++ b/tensorflow/compiler/xla/service/gpu/gpu_transfer_manager.h
@@ -21,6 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/generic_transfer_manager.h"
 #include "tensorflow/compiler/xla/service/gpu/infeed_manager.h"
 #include "tensorflow/compiler/xla/service/transfer_manager.h"
+#include "tensorflow/compiler/xla/shape_tree.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/platform/macros.h"
@@ -28,33 +29,36 @@ limitations under the License.
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
+namespace gpu {
 
 // An implementation of the XLA GenericTransferManager that
 // handles GPU-specific infeed.
 class GpuTransferManager : public GenericTransferManager {
  public:
-  GpuTransferManager();
+  GpuTransferManager(se::Platform::Id id, unsigned pointer_size);
   ~GpuTransferManager() override {}
 
   Status TransferLiteralToInfeed(se::StreamExecutor* executor,
                                  const LiteralSlice& literal) override;
-  Status TransferBufferToInfeed(se::StreamExecutor* executor, int64 size,
-                                const void* source) override;
+  Status TransferLiteralFromOutfeed(se::StreamExecutor* executor,
+                                    const Shape& literal_shape,
+                                    Literal* literal) override;
 
  private:
   // Initiates the infeed data transfers. InfeedBuffer->Done() must be
   // called to clean up the memory allocated for InfeedBuffer.
-  StatusOr<gpu::InfeedBuffer*> TransferBufferToInfeedInternal(
+  StatusOr<InfeedBuffer> TransferBufferToInfeedInternal(
       se::StreamExecutor* executor, int64 size, const void* source);
 
   // Enqueues infeed data buffers with the infeed manager after their
   // transfer completes.
   Status EnqueueBuffersToInfeed(se::StreamExecutor* executor,
-                                std::vector<gpu::InfeedBuffer*> buffers);
+                                ShapeTree<InfeedBuffer> buffers);
 
   TF_DISALLOW_COPY_AND_ASSIGN(GpuTransferManager);
 };
 
+}  // namespace gpu
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_TRANSFER_MANAGER_H_
diff --git a/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.cc b/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.cc
index 3e96beb575..1722676930 100644
--- a/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.cc
+++ b/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.cc
@@ -17,12 +17,13 @@ limitations under the License.
 
 #include <memory>
 #include <stack>
+#include <unordered_set>
 #include <vector>
 
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_execution_profile.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
-#include "tensorflow/compiler/xla/service/pool.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 #include "tensorflow/core/util/ptr_util.h"
@@ -36,10 +37,9 @@ void InitAndStartTimer(std::stack<std::unique_ptr<se::Timer>>* timers,
   stream->InitTimer(timers->top().get()).ThenStartTimer(timers->top().get());
 }
 
-uint64 GetCyclesTaken(
-    std::stack<std::unique_ptr<se::Timer>>* timers,
-    const std::vector<Pool<se::Stream>::SmartPtr>& sub_streams,
-    se::Stream* stream, double clock_rate_ghz) {
+uint64 GetCyclesTaken(std::stack<std::unique_ptr<se::Timer>>* timers,
+                      const std::vector<StreamPool::Ptr>& sub_streams,
+                      se::Stream* stream, double clock_rate_ghz) {
   CHECK_GT(timers->size(), 0);
   stream->ThenWaitFor(&sub_streams);
   stream->ThenStopTimer(timers->top().get());
@@ -52,7 +52,7 @@ uint64 GetCyclesTaken(
 
 HloExecutionProfiler::HloExecutionProfiler(
     bool do_profile, HloExecutionProfile* profile, se::Stream* stream,
-    const std::vector<Pool<se::Stream>::SmartPtr>& sub_streams,
+    const std::vector<StreamPool::Ptr>& sub_streams,
     const HloComputation* computation)
     : do_profile_(do_profile),
       profile_(profile),
@@ -99,6 +99,7 @@ void HloExecutionProfiler::StartHloInstruction() {
 void HloExecutionProfiler::FinishHloInstruction(
     const HloInstruction* hlo_instruction) {
   if (do_profile_) {
+    hlo_instructions_.erase(hlo_instruction);
     profile_->SetCyclesTakenBy(
         hlo_instruction,
         GetCyclesTaken(&timers_, sub_streams_, stream_, clock_rate_ghz_));
@@ -108,6 +109,12 @@ void HloExecutionProfiler::FinishHloInstruction(
 std::unique_ptr<ScopedInstructionProfiler>
 HloExecutionProfiler::MakeScopedInstructionProfiler(
     const HloInstruction* hlo_instruction) {
+  if (do_profile_ && hlo_instruction != nullptr) {
+    // Make sure that we are not already measuring the time for the same
+    // 'hlo_instruction'.
+    CHECK(hlo_instructions_.insert(hlo_instruction).second)
+        << hlo_instruction->name();
+  }
   return MakeUnique<ScopedInstructionProfiler>(this, hlo_instruction);
 }
 
diff --git a/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.h b/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.h
index e5c655edc6..80cde75f2b 100644
--- a/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.h
+++ b/tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.h
@@ -18,12 +18,13 @@ limitations under the License.
 
 #include <memory>
 #include <stack>
+#include <unordered_set>
 #include <vector>
 
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_execution_profile.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
-#include "tensorflow/compiler/xla/service/pool.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 
 namespace xla {
@@ -37,10 +38,10 @@ class ScopedInstructionProfiler;
 class HloExecutionProfiler {
  public:
   // If profiling is enabled, start an execution timer running.
-  explicit HloExecutionProfiler(
-      bool do_profile, HloExecutionProfile* profile, se::Stream* stream,
-      const std::vector<Pool<se::Stream>::SmartPtr>& sub_streams,
-      const HloComputation* computation);
+  explicit HloExecutionProfiler(bool do_profile, HloExecutionProfile* profile,
+                                se::Stream* stream,
+                                const std::vector<StreamPool::Ptr>& sub_streams,
+                                const HloComputation* computation);
 
   // If profiling is enabled, sets the total cycle count on the profile from the
   // execution timer.
@@ -71,9 +72,12 @@ class HloExecutionProfiler {
   double clock_rate_ghz_;
   HloExecutionProfile* profile_;
   se::Stream* stream_;
-  const std::vector<Pool<se::Stream>::SmartPtr>& sub_streams_;
+  const std::vector<StreamPool::Ptr>& sub_streams_;
   const HloComputation* computation_;
   std::stack<std::unique_ptr<se::Timer>> timers_;
+  // Contains the HLO instructions for which we are currently measuring the
+  // time.
+  std::unordered_set<const HloInstruction*> hlo_instructions_;
   bool finished_execution_ = false;
 };
 
diff --git a/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc b/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc
index 375709150e..19de37b0fb 100644
--- a/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc
+++ b/tensorflow/compiler/xla/service/gpu/hlo_schedule.cc
@@ -100,7 +100,7 @@ GpuHloOrdering::GpuHloOrdering(
       if (last_instruction_per_stream[stream_no] != nullptr) {
         immediate_preds.push_back(last_instruction_per_stream[stream_no]);
       }
-      predecessor_map->SetReachabilityToUnion(immediate_preds, hlo);
+      predecessor_map->FastSetReachabilityToUnion(immediate_preds, hlo);
       last_instruction_per_stream[stream_no] = hlo;
     } else {
       // Only parameters and constants don't have an assigned stream, since they
diff --git a/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.cc b/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.cc
index d420863b85..8c11cd0541 100644
--- a/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.cc
+++ b/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.cc
@@ -18,8 +18,10 @@ limitations under the License.
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Instructions.h"
+#include "tensorflow/compiler/xla/service/gpu/buffer_allocations.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h"
 #include "tensorflow/core/lib/strings/str_util.h"
@@ -39,7 +41,7 @@ void HloToIrBindings::EmitBasePointersForHlos(
   // I/O HLOs are bound to the arguments of the current IR function. I.e.,
   //
   // void IrFunction(io_0, io_1, ..., io_{m-1}, temp_buffer_base) {
-  llvm::Function* function = ir_builder_->GetInsertBlock()->getParent();
+  llvm::Function* function = b_->GetInsertBlock()->getParent();
   CHECK_EQ(io_hlos.size() + 1, function->arg_size());
 
   // An HLO can have duplicated operands. This data structure remembers which
@@ -79,8 +81,8 @@ void HloToIrBindings::EmitBasePointersForHlos(
         const int64 offset = slice.offset();
         CHECK_NE(nullptr, temp_buffer_base_);
         // Emit IR for GetTupleElement instruction and bind to emitted value.
-        llvm::Value* base_ptr = ir_builder_->CreateInBoundsGEP(
-            temp_buffer_base_, ir_builder_->getInt64(offset));
+        llvm::Value* base_ptr =
+            b_->CreateInBoundsGEP(temp_buffer_base_, b_->getInt64(offset));
         BindHloToIrValue(*non_io_hlo,
                          EmitGetTupleElement(non_io_hlo, base_ptr));
       }
@@ -108,15 +110,20 @@ void HloToIrBindings::EmitBasePointersForHlos(
           if (slice.allocation()->is_thread_local()) {
             llvm::Type* pointee_type =
                 llvm_ir::ShapeToIrType(non_io_hlo->shape(), module_);
-            BindHloToIrValue(*non_io_hlo,
-                             ir_builder_->CreateAlloca(pointee_type), index);
+            BindHloToIrValue(*non_io_hlo, b_->CreateAlloca(pointee_type),
+                             index);
+          } else if (slice.allocation()->is_constant()) {
+            llvm::Value* global_for_constant =
+                module_->getGlobalVariable(llvm_ir::AsStringRef(
+                    llvm_ir::ConstantBufferAllocationToGlobalName(
+                        *slice.allocation())));
+            BindHloToIrValue(*non_io_hlo, global_for_constant);
           } else {
             const int64 offset = slice.offset();
             CHECK_NE(nullptr, temp_buffer_base_);
             BindHloToIrValue(
                 *non_io_hlo,
-                ir_builder_->CreateInBoundsGEP(temp_buffer_base_,
-                                               ir_builder_->getInt64(offset)),
+                b_->CreateInBoundsGEP(temp_buffer_base_, b_->getInt64(offset)),
                 index);
           }
         });
@@ -129,11 +136,19 @@ llvm::Value* HloToIrBindings::EmitGetTupleElement(const HloInstruction* gte,
   if (gte->operand(0)->opcode() != HloOpcode::kGetTupleElement) {
     return llvm_ir::EmitGetTupleElement(
         gte->shape(), gte->tuple_index(), /*alignment=*/1,
-        GetTypedIrValue(*gte->operand(0), {}, base_ptr), ir_builder_, module_);
+        GetTypedIrValue(*gte->operand(0), {}, base_ptr), b_, module_);
   }
   return llvm_ir::EmitGetTupleElement(
       gte->shape(), gte->tuple_index(), /*alignment=*/1,
-      EmitGetTupleElement(gte->operand(0), base_ptr), ir_builder_, module_);
+      EmitGetTupleElement(gte->operand(0), base_ptr), b_, module_);
+}
+
+// Returns true if `value` has a name that should not be changed.
+static bool HasMeaningfulName(llvm::Value* value) {
+  if (auto* global = llvm::dyn_cast<llvm::GlobalValue>(value)) {
+    return global->getLinkage() != llvm::GlobalValue::PrivateLinkage;
+  }
+  return false;
 }
 
 llvm::Value* HloToIrBindings::GetTypedIrValue(const HloInstruction& hlo,
@@ -145,14 +160,18 @@ llvm::Value* HloToIrBindings::GetTypedIrValue(const HloInstruction& hlo,
 
   llvm::Value* typed_ir_value;
   if (llvm::isa<llvm::GlobalVariable>(ir_value)) {
-    typed_ir_value = llvm::ConstantExpr::getBitCast(
+    typed_ir_value = llvm::ConstantExpr::getPointerBitCastOrAddrSpaceCast(
         llvm::cast<llvm::GlobalVariable>(ir_value), dest_type);
   } else {
-    typed_ir_value =
-        ir_builder_->CreateBitCast(ir_value, pointee_type->getPointerTo());
+    typed_ir_value = b_->CreateBitCast(ir_value, pointee_type->getPointerTo());
+  }
+  if (!HasMeaningfulName(ir_value)) {
+    ir_value->setName(llvm_ir::AsStringRef(llvm_ir::IrName(&hlo, "raw")));
+  }
+  if (!HasMeaningfulName(typed_ir_value)) {
+    typed_ir_value->setName(
+        llvm_ir::AsStringRef(llvm_ir::IrName(&hlo, "typed")));
   }
-  ir_value->setName(llvm_ir::AsStringRef(llvm_ir::IrName(&hlo, "raw")));
-  typed_ir_value->setName(llvm_ir::AsStringRef(llvm_ir::IrName(&hlo, "typed")));
   return typed_ir_value;
 }
 
diff --git a/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.h b/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.h
index a86e6e78c6..eee40b0e91 100644
--- a/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.h
+++ b/tensorflow/compiler/xla/service/gpu/hlo_to_ir_bindings.h
@@ -36,14 +36,13 @@ class HloToIrBindings {
  public:
   HloToIrBindings(const HloModule& module,
                   const BufferAssignment* buffer_assignment,
-                  llvm::IRBuilder<>* ir_builder, llvm::Module* llvm_module,
+                  llvm::IRBuilder<>* b, llvm::Module* llvm_module,
                   bool is_nested)
       : buffer_assignment_(buffer_assignment),
         is_nested_(is_nested),
-        ir_builder_(ir_builder),
+        b_(b),
         module_(llvm_module),
-        alias_analysis_(module, *buffer_assignment_,
-                        &ir_builder_->getContext()) {}
+        alias_analysis_(module, *buffer_assignment_, &b_->getContext()) {}
 
   void EmitBasePointersForHlos(
       tensorflow::gtl::ArraySlice<const HloInstruction*> io_hlos,
@@ -104,7 +103,7 @@ class HloToIrBindings {
 
   const bool is_nested_;
 
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   llvm::Module* module_;
 
   // Stores the underlying llvm::IrArray for each HloInstruction.
diff --git a/tensorflow/compiler/xla/service/gpu/infeed_manager.cc b/tensorflow/compiler/xla/service/gpu/infeed_manager.cc
index ae310beefa..c5f0cdf6cd 100644
--- a/tensorflow/compiler/xla/service/gpu/infeed_manager.cc
+++ b/tensorflow/compiler/xla/service/gpu/infeed_manager.cc
@@ -15,76 +15,13 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/gpu/infeed_manager.h"
 
-#include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
-#include "tensorflow/core/platform/logging.h"
 
 namespace xla {
 namespace gpu {
 
-InfeedManager::InfeedManager() : host_to_device_executor_(nullptr) {}
-
-void InfeedManager::Reset() {
-  tensorflow::mutex_lock l(mu_);
-  CHECK(dequeued_buffer_.empty());
-  for (auto buffer : enqueued_buffer_) {
-    buffer->Done();
-  }
-  enqueued_buffer_.clear();
-}
-
-void InfeedManager::EnqueueBuffers(const std::vector<InfeedBuffer*>& buffers) {
-  tensorflow::mutex_lock l(mu_);
-  bool was_empty = enqueued_buffer_.empty();
-  for (gpu::InfeedBuffer* b : buffers) {
-    enqueued_buffer_.push_back(b);
-  }
-  if (was_empty) {
-    // This has the potential to suffer from the notified thread
-    // immediately trying and failing to acquire mu_, but seems
-    // preferable to the alternative of notifying outside the lock
-    // on every enqueue.
-    cv_.notify_one();
-  }
-}
-
-InfeedBuffer* InfeedManager::BlockingDequeueBuffer() {
-  bool became_empty = false;
-  InfeedBuffer* current_buffer;
-  {
-    tensorflow::mutex_lock l(mu_);
-    while (enqueued_buffer_.empty()) {
-      cv_.wait(l);
-    }
-    current_buffer = enqueued_buffer_.front();
-    enqueued_buffer_.pop_front();
-    dequeued_buffer_.insert(current_buffer);
-    if (enqueued_buffer_.empty()) {
-      became_empty = true;
-    }
-  }
-  if (became_empty) {
-    for (const auto& callback : on_empty_callbacks_) {
-      callback();
-    }
-  }
-  return current_buffer;
-}
-
-void InfeedManager::ReleaseBuffers(const std::vector<InfeedBuffer*>& buffers) {
-  {
-    tensorflow::mutex_lock l(mu_);
-    for (gpu::InfeedBuffer* b : buffers) {
-      CHECK(ContainsKey(dequeued_buffer_, b));
-      dequeued_buffer_.erase(b);
-    }
-  }
-  for (gpu::InfeedBuffer* b : buffers) {
-    b->Done();
-  }
-}
-
 se::Stream* InfeedManager::GetStream(se::StreamExecutor* executor) {
+  tensorflow::mutex_lock l(host_to_device_stream_mu_);
   if (host_to_device_executor_ == nullptr) {
     host_to_device_executor_ = executor;
     host_to_device_stream_ = MakeUnique<se::Stream>(executor);
@@ -100,10 +37,6 @@ se::Stream* InfeedManager::GetStream(se::StreamExecutor* executor) {
   return host_to_device_stream_.get();
 }
 
-void InfeedManager::RegisterOnEmptyCallback(std::function<void()> callback) {
-  on_empty_callbacks_.push_back(std::move(callback));
-}
-
 InfeedManager* GetOrCreateInfeedManager() {
   static InfeedManager* manager = new InfeedManager;
   return manager;
diff --git a/tensorflow/compiler/xla/service/gpu/infeed_manager.h b/tensorflow/compiler/xla/service/gpu/infeed_manager.h
index a3fc15cfe3..7e418882e0 100644
--- a/tensorflow/compiler/xla/service/gpu/infeed_manager.h
+++ b/tensorflow/compiler/xla/service/gpu/infeed_manager.h
@@ -20,12 +20,9 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_INFEED_MANAGER_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_INFEED_MANAGER_H_
 
-#include <deque>
-#include <vector>
-
+#include "tensorflow/compiler/xla/service/gpu/xfeed_queue.h"
+#include "tensorflow/compiler/xla/shape_tree.h"
 #include "tensorflow/compiler/xla/types.h"
-#include "tensorflow/core/lib/gtl/flatset.h"
-#include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 
 namespace xla {
@@ -47,90 +44,41 @@ namespace gpu {
 // the client. The client manages the memory of the buffer.
 class InfeedBuffer {
  public:
+  InfeedBuffer() = default;
   InfeedBuffer(se::StreamExecutor* executor, int64 length)
-      : executor_(executor), length_(length) {
-    device_memory_ = executor_->AllocateArray<uint8>(length);
-    CHECK(!device_memory_.is_null());
+      : device_memory_(executor, executor->AllocateArray<uint8>(length)),
+        length_(length) {
+    CHECK(!device_memory_->is_null());
   }
 
-  ~InfeedBuffer() { executor_->Deallocate(&device_memory_); }
-
   int64 length() const { return length_; }
 
-  // Callback to signal that this buffer is consumed. This helps the
-  // client to manage memory for the infeed buffers.
-  void Done() { delete this; }
-
-  se::DeviceMemoryBase* device_memory() { return &device_memory_; }
+  se::DeviceMemoryBase* device_memory() { return device_memory_.ptr(); }
 
  private:
-  se::StreamExecutor* executor_;  // Not owned.
-  const int64 length_;
-  se::DeviceMemoryBase device_memory_;
+  se::ScopedDeviceMemory<uint8> device_memory_;
+  int64 length_;
 };
 
 // Client-side class used to enqueue infeed buffers.
-class InfeedManager {
+class InfeedManager : public XfeedQueue<ShapeTree<InfeedBuffer>> {
  public:
-  InfeedManager();
-
-  // Calls the completion callback for any enqueued buffers that have
-  // not been dequeued by the runtime, and empties the infeed
-  // queue. Reset may not be called while a runtime computation is
-  // processing a dequeued buffer. The only safe way to ensure this
-  // condition is to call Reset when no computation is taking place.
-  void Reset();
-
-  // Adds a set of buffers to the infeed queue atomically. buffer->Done
-  // will be called when the buffer will no longer be accessed by the
-  // InfeedManager, either as a result of a call to Reset or because the
-  // runtime has dequeued and used the buffer.
-  void EnqueueBuffers(const std::vector<InfeedBuffer*>& buffers);
-
-  // Blocks until the infeed queue is non-empty, then returns the
-  // buffer at the head of the queue. Adds the current buffer to the
-  // to-be released set.
-  InfeedBuffer* BlockingDequeueBuffer();
-
-  // Releases a set of buffers from the to-be released set.
-  void ReleaseBuffers(const std::vector<InfeedBuffer*>& buffers);
-
   // Returns a cached stream associated with an executor. Allocates a
   // new stream on the first invocation. On subsequent invocations, if
   // the cached executor is not the same as the requested executor,
   // returns null.
   se::Stream* GetStream(se::StreamExecutor* executor);
 
-  // Registers a callback that will be called when 'enqueued_buffer_' becomes
-  // empty.
-  void RegisterOnEmptyCallback(std::function<void()> callback);
-
  private:
-  // TODO(b/30467474): Revisit if this mutex becomes a point of
-  // contention.
-  tensorflow::mutex mu_;
-
-  // Condition variable that is signaled every time a buffer is
-  // enqueued to an empty queue.
-  tensorflow::condition_variable cv_;
-
-  // InfeedBuffer* queue contents are not owned, but buffer->Done must
-  // be called when the buffer is no longer needed by the runtime.
-  std::deque<InfeedBuffer*> enqueued_buffer_;
-
-  // Buffers that are dequeued and currently being processed by the
-  // runtime. Not owned.
-  tensorflow::gtl::FlatSet<const InfeedBuffer*> dequeued_buffer_;
+  // Mutex for serializing the creation of host_to_device_stream_.
+  tensorflow::mutex host_to_device_stream_mu_;
 
   // Cached host to device stream for queuing infeed data.
-  std::unique_ptr<se::Stream> host_to_device_stream_;
+  std::unique_ptr<se::Stream> host_to_device_stream_
+      GUARDED_BY(host_to_device_stream_mu_);
 
   // Executor that the host_to_device_stream belongs to. Not owned.
-  se::StreamExecutor* host_to_device_executor_;
-
-  // List of callbacks which will be called when 'enqueued_buffer_' becomes
-  // empty.
-  std::vector<std::function<void()>> on_empty_callbacks_;
+  se::StreamExecutor* host_to_device_executor_ = nullptr;
 };
 
 // Singleton creator-or-accessor: Returns the GPU infeed manager.
diff --git a/tensorflow/compiler/xla/service/gpu/infeed_thunk.cc b/tensorflow/compiler/xla/service/gpu/infeed_thunk.cc
index 62915febb1..fee6d2af3b 100644
--- a/tensorflow/compiler/xla/service/gpu/infeed_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/infeed_thunk.cc
@@ -30,51 +30,68 @@ InfeedThunk::InfeedThunk(
 Status InfeedThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
                                     se::Stream* stream,
                                     HloExecutionProfiler* profiler) {
-  VLOG(2) << "Infeeding to GPU ";
+  VLOG(2) << "Infeeding to GPU: " << hlo_instruction()->ToString();
 
   auto op_profiler = profiler->MakeScopedInstructionProfiler(hlo_instruction());
-  // First copy the infeed data which is element 0 of the infeed instruction's
-  // two-tuple output (the other element is a token).
-  se::DeviceMemoryBase data_address =
-      buffer_allocations.GetDeviceAddress(infeed_slices_.element({0}));
-  InfeedManager* infeed_manager = GetOrCreateInfeedManager();
-  std::vector<InfeedBuffer*> infeed_buffers;
-  const Shape& data_shape =
-      ShapeUtil::GetTupleElementShape(hlo_instruction()->shape(), 0);
-  if (ShapeUtil::IsTuple(data_shape)) {
-    CHECK(!ShapeUtil::IsNestedTuple(data_shape));
-    // Transfer the tuple elements first.
+  ShapeTree<InfeedBuffer> infeed_buffers =
+      GetOrCreateInfeedManager()->BlockingGetNextDestination();
+
+  {
+    // The infeed buffer has an extra outer tuple with a token. Adjust the index
+    // accordingly.
+    ShapeIndex index = {0};
+    std::function<void(std::vector<void*>*)> copy_tuple_contents =
+        [&](std::vector<void*>* tuple_element_addresses) {
+          const Shape& shape = ShapeUtil::GetSubshape(infeed_buffers.shape(),
+                                                      ShapeIndexView(index, 1));
+          // For the leaf buffers of the tuple copy the elements directly.
+          if (ShapeUtil::IsArray(shape)) {
+            const BufferAllocation::Slice& tuple_element_buffer =
+                infeed_slices_.element(index);
+            se::DeviceMemoryBase tuple_element_address =
+                buffer_allocations.GetDeviceAddress(tuple_element_buffer);
+
+            InfeedBuffer* buffer =
+                infeed_buffers.mutable_element(ShapeIndexView(index, 1));
+            stream->ThenMemcpy(&tuple_element_address,
+                               *(buffer->device_memory()), buffer->length());
+            tuple_element_addresses->push_back(tuple_element_address.opaque());
+            return;
+          }
+
+          const int64 tuple_element_count = ShapeUtil::TupleElementCount(shape);
+          index.push_back(0);
+          std::vector<void*> inner_tuple_element_addresses;
+          for (int64 i = 0; i < tuple_element_count; ++i) {
+            index.back() = i;
+            copy_tuple_contents(&inner_tuple_element_addresses);
+          }
+          index.pop_back();
+
+          // Create a buffer of pointers for non-leaf buffers.
+          CHECK_EQ(tuple_element_count, inner_tuple_element_addresses.size());
+          auto host_size = inner_tuple_element_addresses.size() * sizeof(void*);
+          se::DeviceMemoryBase tuple_address =
+              buffer_allocations.GetDeviceAddress(
+                  infeed_slices_.element(index));
+          stream->ThenMemcpy(&tuple_address,
+                             inner_tuple_element_addresses.data(), host_size);
+          tuple_element_addresses->push_back(tuple_address.opaque());
+        };
+
     std::vector<void*> tuple_element_addresses;
-    for (int i = 0; i < ShapeUtil::TupleElementCount(data_shape); ++i) {
-      const BufferAllocation::Slice& tuple_element_buffer =
-          infeed_slices_.element({0, i});
-      se::DeviceMemoryBase tuple_element_address =
-          buffer_allocations.GetDeviceAddress(tuple_element_buffer);
-
-      InfeedBuffer* buffer = infeed_manager->BlockingDequeueBuffer();
-      infeed_buffers.push_back(buffer);
-      stream->ThenMemcpy(&tuple_element_address, *(buffer->device_memory()),
-                         buffer->length());
-      tuple_element_addresses.push_back(tuple_element_address.opaque());
-    }
-    // Transfer the tuple outer buffer.
-    auto host_size = tuple_element_addresses.size() * sizeof(void*);
-    stream->ThenMemcpy(&data_address, tuple_element_addresses.data(),
-                       host_size);
-  } else {
-    InfeedBuffer* buffer = infeed_manager->BlockingDequeueBuffer();
-    infeed_buffers.push_back(buffer);
-    stream->ThenMemcpy(&data_address, *(buffer->device_memory()),
-                       buffer->length());
+    copy_tuple_contents(&tuple_element_addresses);
+    CHECK_EQ(1, tuple_element_addresses.size());
   }
 
   // Construct top-level tuple of infeed containing the data and the token. Use
   // a nullptr for the token, it should never be dereferenced.
-  std::vector<void*> infeed_addresses = {data_address.opaque(), nullptr};
+  se::DeviceMemoryBase data_address =
+      buffer_allocations.GetDeviceAddress(infeed_slices_.element({0}));
+  void* infeed_addresses[] = {data_address.opaque(), nullptr};
   se::DeviceMemoryBase top_level_address =
       buffer_allocations.GetDeviceAddress(infeed_slices_.element({}));
-  stream->ThenMemcpy(&top_level_address, infeed_addresses.data(),
-                     2 * sizeof(void*));
+  stream->ThenMemcpy(&top_level_address, infeed_addresses, 2 * sizeof(void*));
 
   Status block_status = stream->BlockHostUntilDone();
   if (!block_status.ok()) {
@@ -82,8 +99,6 @@ Status InfeedThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
                          stream, block_status.error_message().c_str());
   }
 
-  infeed_manager->ReleaseBuffers(infeed_buffers);
-
   VLOG(2) << "Infeeding to GPU complete";
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc b/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
index 64ed3d748f..0f2c83aeb2 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion.cc
@@ -73,6 +73,67 @@ bool IsIEEEFloatingPointScalarConstant(const HloInstruction* constant) {
   }
 }
 
+// This function limits the maximum number of operands to a fusion.
+//
+// There's a cap on how many parameters we can pass to a CUDA kernel, but
+// exactly what that limit is is hazy, as it depends on (among other things) how
+// much GPU constant memory is in use for other purposes.
+//
+// Moreover, we don't even know at the point that we're running fusion how many
+// arguments the CUDA kernel for a fusion node will have: It depends on buffer
+// assignment, where we will decide which of the fusion's operands live in XLA's
+// big temp buffer versus in other allocations.
+//
+// As a heuristic, we simply cap the number of fusion operands plus outputs at
+// kMaxOperandsAndOutputsPerFusion.  This puts an upper bound on the number of
+// parameters to the kernel, working around the correctness problem.
+//
+// This limit is also often good for performance.  In a fusion with many
+// operands, each GPU thread likely has to do a lot of work, and so possibly
+// uses a lot of registers, thus limiting occupancy.
+/*static*/ bool GpuInstructionFusion::FusionWouldBeTooLarge(
+    const HloInstruction* a, const HloInstruction* b) {
+  // Compute the number of outputs of the (possibly multi-output) fusion node
+  // we're considering creating.
+  //
+  // This isn't precise; we may be off by one if
+  //  - We're creating a multi-output fusion out of two non-MOFs.  Creating a
+  //    MOF adds a new buffer, namely, the tuple buffer.
+  //  - We're merging two MOFs.  In this case, we should count the tuple buffer
+  //    only once.
+  //  - WLOG there's an edge from `a` to `b` and `b` is the only consumer of
+  //    `a`.  In this case the result of `a` is not part of the output of the
+  //    fusion.
+  //
+  // But because this is a heuristic and our limit
+  // kMaxOperandsAndOutputsPerFusion is a large value (so +/- 1 doesn't make a
+  // big difference), we ignore this small inaccuracy in favor of simplicity.
+  int64 num_output_buffers = ShapeUtil::SubshapeCount(a->shape()) +
+                             ShapeUtil::SubshapeCount(b->shape());
+
+  // The new fusion will have no more operands and outputs than
+  //   producer_operands + consumer_operands - 1 + num_output_buffers
+  // (minus one because we may be fusing a producer->consumer edge between `a`
+  // and `b`).
+  //
+  // This fact may be enough to let us avoid having to compute the true total
+  // number of operands, which can be expensive.
+  if (a->operand_count() + b->operand_count() - 1 + num_output_buffers <=
+      kMaxOperandsAndOutputsPerFusion) {
+    return false;
+  }
+
+  // Compute the precise number of operands to the new fusion.
+  tensorflow::gtl::FlatSet<const HloInstruction*> operands(
+      a->operands().begin(), a->operands().end());
+  operands.insert(b->operands().begin(), b->operands().end());
+  // If there's an edge between `a` and `b`, don't count it: We're fusing that
+  // producer -> consumer relationship.
+  operands.erase(a);
+  operands.erase(b);
+  return operands.size() + num_output_buffers > kMaxOperandsAndOutputsPerFusion;
+}
+
 bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
                                       int64 operand_index) {
   HloInstruction* producer = consumer->mutable_operand(operand_index);
@@ -141,6 +202,7 @@ bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
              IsIEEEFloatingPointScalarConstant(producer->operand(0)) &&
              fused_parameter_users[0]->opcode() == HloOpcode::kMultiply;
     }
+    return false;
   }
 
   // Other output fusions are not currently supported on GPUs.
@@ -183,8 +245,13 @@ bool GpuInstructionFusion::ShouldFuse(HloInstruction* consumer,
     return true;
   }
 
-  return IsFusile(*producer) && IsFusile(*consumer) &&
-         InstructionFusion::ShouldFuse(consumer, operand_index);
+  if (!IsFusile(*producer) || !IsFusile(*consumer) ||
+      !InstructionFusion::ShouldFuse(consumer, operand_index)) {
+    return false;
+  }
+
+  // We put this check last because it's potentially expensive.
+  return !FusionWouldBeTooLarge(consumer, producer);
 }
 
 bool GpuInstructionFusion::ShouldFuseIntoMultiOutput(HloInstruction* consumer,
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion.h b/tensorflow/compiler/xla/service/gpu/instruction_fusion.h
index f629d9ff2c..c91f6343a6 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion.h
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion.h
@@ -27,6 +27,19 @@ class GpuInstructionFusion : public InstructionFusion {
   explicit GpuInstructionFusion(bool may_duplicate)
       : InstructionFusion(GpuInstructionFusion::IsExpensive, may_duplicate) {}
 
+  // Maximum number of operands plus outputs allowed on a single fusion node.
+  // Exposed publicly mainly for tests.
+  static constexpr int64 kMaxOperandsAndOutputsPerFusion = 64;
+
+  // Determines whether the combination of `a` and `b` into a (possibly
+  // multi-output) fusion would be "too large" -- i.e., have more operands and
+  // outputs than is allowed.
+  //
+  // `ShouldFuse` and `ShouldFuseIntoMultiOutput` call this; it's public so that
+  // other fusion passes (e.g. GPU multi-output fusion) can also call this.
+  static bool FusionWouldBeTooLarge(const HloInstruction* a,
+                                    const HloInstruction* b);
+
   static bool IsExpensive(const HloInstruction& instruction);
 
   bool ShouldFuse(HloInstruction* consumer, int64 operand_index) override;
diff --git a/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc b/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
index 1963d9eef7..8d0522bd8f 100644
--- a/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/instruction_fusion_test.cc
@@ -33,7 +33,7 @@ TEST_F(InstructionFusionTest,
        CostlyProducerAndOperandElementReusingConsumerNotFused) {
   HloComputation::Builder builder(TestName());
   HloInstruction* const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(5)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(5)));
   HloInstruction* exp1 = builder.AddInstruction(HloInstruction::CreateUnary(
       ShapeUtil::MakeShape(S32, {}), HloOpcode::kExp, const0));
   HloInstruction* broadcast2 =
@@ -53,7 +53,7 @@ TEST_F(InstructionFusionTest,
        NonCostlyProducerAndOperandElementReusingConsumerFused) {
   HloComputation::Builder builder(TestName());
   HloInstruction* const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(5)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(5)));
   HloInstruction* negate1 = builder.AddInstruction(HloInstruction::CreateUnary(
       ShapeUtil::MakeShape(S32, {}), HloOpcode::kNegate, const0));
   HloInstruction* broadcast2 =
@@ -73,7 +73,7 @@ TEST_F(InstructionFusionTest,
        CostlyProducerAndNonOperandElementReusingConsumerFused_Reshape) {
   HloComputation::Builder builder(TestName());
   HloInstruction* const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(5)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(5)));
   HloInstruction* exp1 = builder.AddInstruction(HloInstruction::CreateUnary(
       ShapeUtil::MakeShape(S32, {}), HloOpcode::kExp, const0));
   HloInstruction* reshape2 = builder.AddInstruction(
@@ -92,7 +92,7 @@ TEST_F(InstructionFusionTest,
        CostlyProducerAndNonOperandElementReusingConsumerFused_Transpose) {
   HloComputation::Builder builder(TestName());
   HloInstruction* const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(5)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(5)));
   HloInstruction* exp1 = builder.AddInstruction(HloInstruction::CreateUnary(
       ShapeUtil::MakeShape(S32, {}), HloOpcode::kExp, const0));
   HloInstruction* transpose2 = builder.AddInstruction(
@@ -606,5 +606,35 @@ TEST_F(InstructionFusionTest, FuseScalarConstant) {
                       op::Parameter()));
 }
 
+// Check that we limit the number of operands to fusions we create.
+TEST_F(InstructionFusionTest, AvoidsLargeFusion) {
+  constexpr int64 kNumParams = 200;
+  ASSERT_GT(kNumParams, GpuInstructionFusion::kMaxOperandsAndOutputsPerFusion);
+
+  // Compute p0 + p1 + ... + pN.
+  HloComputation::Builder b(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {10, 100});
+  auto param0 =
+      b.AddInstruction(HloInstruction::CreateParameter(0, shape, "p"));
+  auto sum = param0;
+  for (int64 i = 1; i < kNumParams; ++i) {
+    auto param =
+        b.AddInstruction(HloInstruction::CreateParameter(i, shape, "p"));
+    sum = b.AddInstruction(
+        HloInstruction::CreateBinary(shape, HloOpcode::kAdd, sum, param));
+  }
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(b.Build());
+  EXPECT_TRUE(GpuInstructionFusion(/*may_duplicate=*/true)
+                  .Run(module.get())
+                  .ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+  for (const HloInstruction* instr : computation->instructions()) {
+    EXPECT_LE(instr->operand_count(),
+              GpuInstructionFusion::kMaxOperandsAndOutputsPerFusion)
+        << instr->ToString();
+  }
+}
+
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
index 388aa35d7d..c349063c71 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
@@ -38,24 +38,27 @@ namespace gpu {
 namespace {
 
 // Return whether the given shape is a matrix with no padding.
-bool IsRank2WithNoPadding(const Shape& shape) {
-  return ShapeUtil::Rank(shape) == 2 && !LayoutUtil::IsPadded(shape);
+bool IsRank2WithNoPadding(const Shape& shape, int64 batch_dimensions_size) {
+  return ShapeUtil::Rank(shape) == batch_dimensions_size + 2 &&
+         !LayoutUtil::IsPadded(shape);
 }
 
 // In a gemm operation where output = lhs * rhs, check whether the given shapes
 // are valid for the operation.
 bool AreValidGemmShapes(const Shape& lhs_shape, const Shape& rhs_shape,
-                        const Shape& output_shape) {
+                        const Shape& output_shape,
+                        int64 batch_dimensions_size) {
   // The inputs and the output must
   // 1) be matrices with no padding and a non-zero number of elements,
   // 2) have an allowed element type.
   PrimitiveType output_primitive_type = output_shape.element_type();
   bool type_is_allowed =
       (output_primitive_type == F16 || output_primitive_type == F32 ||
-       output_primitive_type == F64);
-  return type_is_allowed && IsRank2WithNoPadding(lhs_shape) &&
-         IsRank2WithNoPadding(rhs_shape) &&
-         IsRank2WithNoPadding(output_shape) &&
+       output_primitive_type == F64 || output_primitive_type == C64);
+  return type_is_allowed &&
+         IsRank2WithNoPadding(lhs_shape, batch_dimensions_size) &&
+         IsRank2WithNoPadding(rhs_shape, batch_dimensions_size) &&
+         IsRank2WithNoPadding(output_shape, batch_dimensions_size) &&
          !ShapeUtil::IsZeroElementArray(lhs_shape) &&
          !ShapeUtil::IsZeroElementArray(rhs_shape);
 }
@@ -64,14 +67,15 @@ bool DotImplementedAsGemm(const HloInstruction& dot) {
   CHECK_EQ(dot.opcode(), HloOpcode::kDot);
   const Shape& lhs_shape = dot.operand(0)->shape();
   const Shape& rhs_shape = dot.operand(1)->shape();
+  const DotDimensionNumbers& dim_numbers = dot.dot_dimension_numbers();
 
   // If gemm can accept the operand shapes, use it rather than a custom
   // kernel.
-  if (AreValidGemmShapes(lhs_shape, rhs_shape, dot.shape())) {
+  if (AreValidGemmShapes(lhs_shape, rhs_shape, dot.shape(),
+                         dim_numbers.lhs_batch_dimensions_size())) {
     // The size of the reduction dimension should match. The shape inference
     // guarantees this invariant, so the check here is for programming
     // errors.
-    const DotDimensionNumbers& dim_numbers = dot.dot_dimension_numbers();
     CHECK_EQ(lhs_shape.dimensions(dim_numbers.lhs_contracting_dimensions(0)),
              rhs_shape.dimensions(dim_numbers.rhs_contracting_dimensions(0)));
     return true;
@@ -81,11 +85,6 @@ bool DotImplementedAsGemm(const HloInstruction& dot) {
 }  // namespace
 
 bool ImplementedAsGemm(const HloInstruction& hlo) {
-  // We can only do this if the HLO is unnested.
-  if (hlo.parent() != hlo.GetModule()->entry_computation()) {
-    return false;
-  }
-
   // For certain types of Dot, we can call pre-canned BLAS gemm.
   if (hlo.opcode() == HloOpcode::kDot) {
     return DotImplementedAsGemm(hlo);
@@ -242,15 +241,17 @@ llvm::Value* EmitPrintf(tensorflow::StringPiece fmt,
        arguments_ptr});
 }
 
-llvm::Value* EmitShuffleDown(llvm::Value* value, llvm::Value* offset,
-                             llvm::IRBuilder<>* builder) {
+llvm::Value* EmitFullWarpShuffleDown(llvm::Value* value, llvm::Value* offset,
+                                     llvm::IRBuilder<>* builder) {
   int bit_width = value->getType()->getPrimitiveSizeInBits();
+  llvm::Value* all_warps_mask = builder->getInt32(-1);
 
   // Special case for efficiency
   if (value->getType()->isFloatTy() && bit_width == 32) {
     return llvm_ir::EmitCallToIntrinsic(
-        llvm::Intrinsic::nvvm_shfl_down_f32,
-        {value, offset, builder->getInt32(kWarpSize - 1)}, {}, builder);
+        llvm::Intrinsic::nvvm_shfl_sync_down_f32,
+        {all_warps_mask, value, offset, builder->getInt32(kWarpSize - 1)}, {},
+        builder);
   }
 
   // We must split values wider than 32 bits as the "shfl" instruction operates
@@ -264,10 +265,11 @@ llvm::Value* EmitShuffleDown(llvm::Value* value, llvm::Value* offset,
   for (int i = 0; i < num_segments; ++i) {
     x = builder->CreateInsertElement(
         x,
-        llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::nvvm_shfl_down_i32,
-                                     {builder->CreateExtractElement(x, i),
-                                      offset, builder->getInt32(kWarpSize - 1)},
-                                     {}, builder),
+        llvm_ir::EmitCallToIntrinsic(
+            llvm::Intrinsic::nvvm_shfl_sync_down_i32,
+            {all_warps_mask, builder->CreateExtractElement(x, i), offset,
+             builder->getInt32(kWarpSize - 1)},
+            {}, builder),
         i);
   }
   return builder->CreateBitCast(
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.h b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.h
index 59455f389e..5d23a3d018 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.h
+++ b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.h
@@ -31,6 +31,12 @@ namespace gpu {
 constexpr int64 kWarpSize = 32;
 
 // Returns true if `hlo` will be implemented as a call to BLAS gemm.
+//
+// Precondition: `hlo` is in an "unnested context", meaning, it lives within the
+// entry computation, within the either of a while loop's subcomputations,
+// within any of a conditional's subcomputations, etc., but *does not* live
+// within a reduce subcomputation, a map subcomputation, a fusion
+// subcomputation, etc.  It's OK if `hlo` *is* a fusion.
 bool ImplementedAsGemm(const HloInstruction& hlo);
 
 // A call to cuDNN for batch normalization is represented as CustomCall HLO with
@@ -125,13 +131,17 @@ llvm::Value* EmitPrintf(tensorflow::StringPiece fmt,
                         llvm::IRBuilder<>* builder);
 
 // Emits code to shuffle data between threads of a warp. This has the same
-// semantics as the PTX "shfl.down" instruction [0] but works for values of any
-// size. The last operand of the emitted "shfl" is `kWarpSize - 1`.
+// semantics as the PTX "shfl.sync.down" instruction but works for values that
+// aren't 32 bits in size. The last operand of the emitted "shfl" is
+// `kWarpSize - 1`.
+//
+// This function emits a "full-warp" shuffle, which all threads of a warp
+// participate in.  *Do not use this function from a divergent context:* You
+// can't correctly do so on both Volta and earlier GPUs.
 //
-// [0]
-// http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-shfl
-llvm::Value* EmitShuffleDown(llvm::Value* value, llvm::Value* offset,
-                             llvm::IRBuilder<>* builder);
+// https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-shfl-sync
+llvm::Value* EmitFullWarpShuffleDown(llvm::Value* value, llvm::Value* offset,
+                                     llvm::IRBuilder<>* builder);
 
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
index d5e07c3afb..541cacf697 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
@@ -57,12 +57,12 @@ IrEmitter::IrEmitter(const HloModuleConfig& hlo_module_config,
                      IrEmitterContext* ir_emitter_context, bool is_nested)
     : ir_emitter_context_(ir_emitter_context),
       module_(ir_emitter_context->llvm_module()),
-      ir_builder_(module_->getContext()),
+      b_(module_->getContext()),
       bindings_(ir_emitter_context->hlo_module(),
-                &ir_emitter_context->buffer_assignment(), &ir_builder_, module_,
+                &ir_emitter_context->buffer_assignment(), &b_, module_,
                 is_nested),
       hlo_module_config_(hlo_module_config) {
-  ir_builder_.setFastMathFlags(llvm_ir::GetFastMathFlags(
+  b_.setFastMathFlags(llvm_ir::GetFastMathFlags(
       /*fast_math_enabled=*/hlo_module_config.debug_options()
           .xla_enable_fast_math()));
 }
@@ -71,30 +71,16 @@ Status IrEmitter::DefaultAction(HloInstruction* hlo) {
   ElementalIrEmitter::HloToElementGeneratorMap operand_to_generator;
   for (const HloInstruction* operand : hlo->operands()) {
     operand_to_generator[operand] = [=](const llvm_ir::IrArray::Index& index) {
-      return GetIrArray(*operand, *hlo)
-          .EmitReadArrayElement(index, &ir_builder_);
+      return GetIrArray(*operand, *hlo).EmitReadArrayElement(index, &b_);
     };
   }
   return EmitTargetElementLoop(
-      *hlo, GpuElementalIrEmitter(hlo_module_config_, module_, &ir_builder_,
+      *hlo, GpuElementalIrEmitter(hlo_module_config_, module_, &b_,
                                   GetNestedComputer())
                 .MakeElementGenerator(hlo, operand_to_generator));
 }
 
 Status IrEmitter::HandleConstant(HloInstruction* constant) {
-  const Literal& literal = constant->literal();
-  llvm::Constant* initializer =
-      llvm_ir::ConvertLiteralToIrConstant(literal, module_);
-  llvm::GlobalVariable* global_for_const = new llvm::GlobalVariable(
-      *module_, initializer->getType(),
-      /*isConstant=*/true, llvm::GlobalValue::PrivateLinkage, initializer,
-      /*Name=*/"");
-  VLOG(2) << "HandleConstant: " << constant->ToString() << std::endl
-          << "  emitted_value: " << llvm_ir::DumpToString(*global_for_const)
-          << std::endl
-          << "  its type: "
-          << llvm_ir::DumpToString(*global_for_const->getType());
-  bindings_.BindHloToIrValue(*constant, global_for_const);
   return Status::OK();
 }
 
@@ -119,15 +105,10 @@ Status IrEmitter::HandleGetTupleElement(HloInstruction* get_tuple_element) {
           get_tuple_element->shape(), get_tuple_element->tuple_index(),
           // TODO(b/26344050): tighten the alignment here
           // based on the real element type.
-          /*alignment=*/1, GetBasePointer(*operand), &ir_builder_, module_));
+          /*alignment=*/1, GetBasePointer(*operand), &b_, module_));
   return Status::OK();
 }
 
-Status IrEmitter::HandleSort(HloInstruction*) {
-  // TODO(b/26783907): Implement sort on GPU.
-  return Unimplemented("sort");
-}
-
 Status IrEmitter::HandleSend(HloInstruction*) {
   return Unimplemented("Send is not implemented on GPU");
 }
@@ -144,13 +125,16 @@ Status IrEmitter::HandleRecvDone(HloInstruction*) {
   return Unimplemented("Recv-done is not implemented on GPU");
 }
 
+Status IrEmitter::HandleScatter(HloInstruction*) {
+  return Unimplemented("Scatter is not implemented on GPUs.");
+}
+
 Status IrEmitter::HandleTuple(HloInstruction* tuple) {
   std::vector<llvm::Value*> base_ptrs;
   for (const HloInstruction* operand : tuple->operands()) {
     base_ptrs.push_back(GetBasePointer(*operand));
   }
-  llvm_ir::EmitTuple(GetIrArray(*tuple, *tuple), base_ptrs, &ir_builder_,
-                     module_);
+  llvm_ir::EmitTuple(GetIrArray(*tuple, *tuple), base_ptrs, &b_, module_);
   return Status::OK();
 }
 
@@ -171,7 +155,7 @@ Status IrEmitter::EmitCallToNestedComputation(
   std::vector<llvm::Value*> arguments(operands.begin(), operands.end());
   arguments.push_back(output);
   arguments.push_back(bindings_.GetTempBufferBase());
-  ir_builder_.CreateCall(emitted_function, arguments);
+  b_.CreateCall(emitted_function, arguments);
 
   return Status::OK();
 }
@@ -193,21 +177,20 @@ bool IrEmitter::MaybeEmitDirectAtomicOperation(
       computation.root_instruction()->shape().element_type();
   bool is_atomic_integral = element_type == S32 || element_type == U32 ||
                             element_type == S64 || element_type == U64;
-  llvm::Value* source = ir_builder_.CreateLoad(source_address, "source");
+  llvm::Value* source = b_.CreateLoad(source_address, "source");
   if (root_opcode == HloOpcode::kAdd) {
     // NVPTX supports atomicAdd on F32 and integer types.
     if (element_type == F32) {
       // F32 + F32
       llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::nvvm_atomic_load_add_f32,
                                    {output_address, source},
-                                   {output_address->getType()}, &ir_builder_);
+                                   {output_address->getType()}, &b_);
       return true;
     }
     if (is_atomic_integral) {
       // integral + integral
-      ir_builder_.CreateAtomicRMW(llvm::AtomicRMWInst::Add, output_address,
-                                  source,
-                                  llvm::AtomicOrdering::SequentiallyConsistent);
+      b_.CreateAtomicRMW(llvm::AtomicRMWInst::Add, output_address, source,
+                         llvm::AtomicOrdering::SequentiallyConsistent);
       return true;
     }
   }
@@ -218,8 +201,8 @@ bool IrEmitter::MaybeEmitDirectAtomicOperation(
     auto opcode = primitive_util::IsSignedIntegralType(element_type)
                       ? llvm::AtomicRMWInst::Max
                       : llvm::AtomicRMWInst::UMax;
-    ir_builder_.CreateAtomicRMW(opcode, output_address, source,
-                                llvm::AtomicOrdering::SequentiallyConsistent);
+    b_.CreateAtomicRMW(opcode, output_address, source,
+                       llvm::AtomicOrdering::SequentiallyConsistent);
     return true;
   }
 
@@ -228,8 +211,8 @@ bool IrEmitter::MaybeEmitDirectAtomicOperation(
     auto opcode = primitive_util::IsSignedIntegralType(element_type)
                       ? llvm::AtomicRMWInst::Min
                       : llvm::AtomicRMWInst::UMin;
-    ir_builder_.CreateAtomicRMW(opcode, output_address, source,
-                                llvm::AtomicOrdering::SequentiallyConsistent);
+    b_.CreateAtomicRMW(opcode, output_address, source,
+                       llvm::AtomicOrdering::SequentiallyConsistent);
     return true;
   }
 
@@ -301,20 +284,20 @@ Status IrEmitter::EmitAtomicOperationUsingCAS(const HloComputation& computation,
   llvm::Type* element_address_type = element_type->getPointerTo();
 
   int atomic_size = (element_size < 32) ? 32 : element_size;
-  llvm::Type* atomic_type = ir_builder_.getIntNTy(atomic_size);
+  llvm::Type* atomic_type = b_.getIntNTy(atomic_size);
   llvm::Type* atomic_address_type =
       atomic_type->getPointerTo(output_address_type->getPointerAddressSpace());
 
   // cas_old_output_address and cas_new_output_address point to the scratch
   // memory where we store the old and new values for the repeated atomicCAS
   // operations.
-  llvm::Value* cas_old_output_address = ir_builder_.CreateAlloca(
+  llvm::Value* cas_old_output_address = b_.CreateAlloca(
       atomic_type, /*ArraySize=*/nullptr, "cas_old_output_address");
-  llvm::Value* cas_new_output_address = ir_builder_.CreateAlloca(
+  llvm::Value* cas_new_output_address = b_.CreateAlloca(
       atomic_type, /*ArraySize=*/nullptr, "cas_new_output_address");
 
   // Emit preparation code to the preheader.
-  llvm::BasicBlock* loop_preheader_bb = ir_builder_.GetInsertBlock();
+  llvm::BasicBlock* loop_preheader_bb = b_.GetInsertBlock();
 
   llvm::Value* atomic_memory_address;
   // binop_output_address points to the scratch memory that stores the
@@ -325,77 +308,71 @@ Status IrEmitter::EmitAtomicOperationUsingCAS(const HloComputation& computation,
     CHECK_EQ((element_size % sizeof(char)), 0);
     llvm::Type* address_int_type =
         module_->getDataLayout().getIntPtrType(output_address_type);
-    atomic_memory_address =
-        ir_builder_.CreatePtrToInt(output_address, address_int_type);
+    atomic_memory_address = b_.CreatePtrToInt(output_address, address_int_type);
     llvm::Value* mask = llvm::ConstantInt::get(address_int_type, 3);
-    llvm::Value* offset = ir_builder_.CreateAnd(atomic_memory_address, mask);
+    llvm::Value* offset = b_.CreateAnd(atomic_memory_address, mask);
     mask = llvm::ConstantInt::get(address_int_type, -4);
-    atomic_memory_address = ir_builder_.CreateAnd(atomic_memory_address, mask);
+    atomic_memory_address = b_.CreateAnd(atomic_memory_address, mask);
     atomic_memory_address =
-        ir_builder_.CreateIntToPtr(atomic_memory_address, atomic_address_type);
-    binop_output_address = ir_builder_.CreateAdd(
-        ir_builder_.CreatePtrToInt(cas_new_output_address, address_int_type),
-        offset);
+        b_.CreateIntToPtr(atomic_memory_address, atomic_address_type);
+    binop_output_address = b_.CreateAdd(
+        b_.CreatePtrToInt(cas_new_output_address, address_int_type), offset);
     binop_output_address =
-        ir_builder_.CreateIntToPtr(binop_output_address, element_address_type);
+        b_.CreateIntToPtr(binop_output_address, element_address_type);
   } else {
     atomic_memory_address =
-        ir_builder_.CreateBitCast(output_address, atomic_address_type);
+        b_.CreateBitCast(output_address, atomic_address_type);
     binop_output_address =
-        ir_builder_.CreateBitCast(cas_new_output_address, element_address_type);
+        b_.CreateBitCast(cas_new_output_address, element_address_type);
   }
 
   // Use the value from the memory that atomicCAS operates on to initialize
   // cas_old_output.
   llvm::Value* cas_old_output =
-      ir_builder_.CreateLoad(atomic_memory_address, "cas_old_output");
-  ir_builder_.CreateStore(cas_old_output, cas_old_output_address);
+      b_.CreateLoad(atomic_memory_address, "cas_old_output");
+  b_.CreateStore(cas_old_output, cas_old_output_address);
 
   llvm::BasicBlock* loop_exit_bb = loop_preheader_bb->splitBasicBlock(
-      ir_builder_.GetInsertPoint(), "atomic_op_loop_exit");
-  llvm::BasicBlock* loop_body_bb =
-      llvm::BasicBlock::Create(ir_builder_.getContext(), "atomic_op_loop_body",
-                               ir_builder_.GetInsertBlock()->getParent());
-  ir_builder_.SetInsertPoint(loop_body_bb);
+      b_.GetInsertPoint(), "atomic_op_loop_exit");
+  llvm::BasicBlock* loop_body_bb = llvm::BasicBlock::Create(
+      b_.getContext(), "atomic_op_loop_body", b_.GetInsertBlock()->getParent());
+  b_.SetInsertPoint(loop_body_bb);
   // Change preheader's successor from loop_exit_bb to loop_body_bb.
   loop_preheader_bb->getTerminator()->setSuccessor(0, loop_body_bb);
 
   // Emit the body of the loop that repeatedly invokes atomicCAS.
   //
   // Use cas_old_output to initialize cas_new_output.
-  cas_old_output =
-      ir_builder_.CreateLoad(cas_old_output_address, "cas_old_output");
-  ir_builder_.CreateStore(cas_old_output, cas_new_output_address);
+  cas_old_output = b_.CreateLoad(cas_old_output_address, "cas_old_output");
+  b_.CreateStore(cas_old_output, cas_new_output_address);
   // Emits code to calculate new_output = operation(old_output, source);
   TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
       computation, {binop_output_address, source_address},
       binop_output_address));
 
   llvm::Value* cas_new_output =
-      ir_builder_.CreateLoad(cas_new_output_address, "cas_new_output");
+      b_.CreateLoad(cas_new_output_address, "cas_new_output");
 
   // Emit code to perform the atomicCAS operation
   // (cas_old_output, success) = atomicCAS(memory_address, cas_old_output,
   //                                       cas_new_output);
-  llvm::Value* ret_value = ir_builder_.CreateAtomicCmpXchg(
+  llvm::Value* ret_value = b_.CreateAtomicCmpXchg(
       atomic_memory_address, cas_old_output, cas_new_output,
       llvm::AtomicOrdering::SequentiallyConsistent,
       llvm::AtomicOrdering::SequentiallyConsistent);
 
   // Extract the memory value returned from atomicCAS and store it as
   // cas_old_output.
-  ir_builder_.CreateStore(
-      ir_builder_.CreateExtractValue(ret_value, 0, "cas_old_output"),
-      cas_old_output_address);
+  b_.CreateStore(b_.CreateExtractValue(ret_value, 0, "cas_old_output"),
+                 cas_old_output_address);
   // Extract the success bit returned from atomicCAS and generate a
   // conditional branch on the success bit.
-  ir_builder_.CreateCondBr(
-      ir_builder_.CreateExtractValue(ret_value, 1, "success"), loop_exit_bb,
-      loop_body_bb);
+  b_.CreateCondBr(b_.CreateExtractValue(ret_value, 1, "success"), loop_exit_bb,
+                  loop_body_bb);
 
   // Set the insertion point to the exit basic block so that the caller of
   // this method can continue emitting code to the right place.
-  SetToFirstInsertPoint(loop_exit_bb, &ir_builder_);
+  SetToFirstInsertPoint(loop_exit_bb, &b_);
   return Status::OK();
 }
 
@@ -421,46 +398,49 @@ Status IrEmitter::EmitAtomicOperationForNestedComputation(
 
 Status IrEmitter::HandleSelect(HloInstruction* select) {
   auto pred = select->operand(0);
-  auto on_true = select->operand(1);
-  auto on_false = select->operand(2);
   TF_RET_CHECK(pred->shape().element_type() == PRED);
-
-  if (ShapeUtil::IsTuple(select->shape())) {
-    llvm_ir::EmitTupleSelect(GetIrArray(*select, *select),
-                             GetIrArray(*pred, *select),
-                             GetBasePointer(*on_true),
-                             GetBasePointer(*on_false), &ir_builder_, module_);
-    return Status::OK();
-  }
-
   // We must not call the subclass `DefaultAction` method, lest its
   // `HandleSelect` call `IrEmitter::HandleSelect` and its `DefaultAction`
   // assume no handler has already been called.
   return IrEmitter::DefaultAction(select);
 }
 
+Status IrEmitter::HandleTupleSelect(HloInstruction* tuple_select) {
+  auto pred = tuple_select->operand(0);
+  auto on_true = tuple_select->operand(1);
+  auto on_false = tuple_select->operand(2);
+  TF_RET_CHECK(pred->shape().element_type() == PRED);
+  TF_RET_CHECK(ShapeUtil::IsScalar(pred->shape()));
+  TF_RET_CHECK(ShapeUtil::IsTuple(tuple_select->shape()));
+  llvm_ir::EmitTupleSelect(GetIrArray(*tuple_select, *tuple_select),
+                           GetIrArray(*pred, *tuple_select),
+                           GetBasePointer(*on_true), GetBasePointer(*on_false),
+                           &b_, module_);
+  return Status::OK();
+}
+
 namespace {
-llvm::Value* Real(llvm::Value* x, llvm::IRBuilder<>* ir_builder) {
-  return ir_builder->CreateExtractValue(x, {0});
-}
-
-llvm::Value* Imag(llvm::Value* x, llvm::IRBuilder<>* ir_builder) {
-  return ir_builder->CreateExtractValue(x, {1});
-}
-
-std::pair<llvm::Value*, llvm::Value*> MultiplyComplex(
-    llvm::Value* lhs_value, llvm::Value* rhs_value,
-    llvm::IRBuilder<>* ir_builder) {
-  llvm::Value* lhs_real = Real(lhs_value, ir_builder);
-  llvm::Value* lhs_imag = Imag(lhs_value, ir_builder);
-  llvm::Value* rhs_real = Real(rhs_value, ir_builder);
-  llvm::Value* rhs_imag = Imag(rhs_value, ir_builder);
-  llvm::Value* real_result1 = ir_builder->CreateFMul(lhs_real, rhs_real);
-  llvm::Value* real_result2 = ir_builder->CreateFMul(lhs_imag, rhs_imag);
-  llvm::Value* real_result = ir_builder->CreateFSub(real_result1, real_result2);
-  llvm::Value* imag_result1 = ir_builder->CreateFMul(lhs_real, rhs_imag);
-  llvm::Value* imag_result2 = ir_builder->CreateFMul(lhs_imag, rhs_real);
-  llvm::Value* imag_result = ir_builder->CreateFAdd(imag_result1, imag_result2);
+llvm::Value* Real(llvm::Value* x, llvm::IRBuilder<>* b) {
+  return b->CreateExtractValue(x, {0});
+}
+
+llvm::Value* Imag(llvm::Value* x, llvm::IRBuilder<>* b) {
+  return b->CreateExtractValue(x, {1});
+}
+
+std::pair<llvm::Value*, llvm::Value*> MultiplyComplex(llvm::Value* lhs_value,
+                                                      llvm::Value* rhs_value,
+                                                      llvm::IRBuilder<>* b) {
+  llvm::Value* lhs_real = Real(lhs_value, b);
+  llvm::Value* lhs_imag = Imag(lhs_value, b);
+  llvm::Value* rhs_real = Real(rhs_value, b);
+  llvm::Value* rhs_imag = Imag(rhs_value, b);
+  llvm::Value* real_result1 = b->CreateFMul(lhs_real, rhs_real);
+  llvm::Value* real_result2 = b->CreateFMul(lhs_imag, rhs_imag);
+  llvm::Value* real_result = b->CreateFSub(real_result1, real_result2);
+  llvm::Value* imag_result1 = b->CreateFMul(lhs_real, rhs_imag);
+  llvm::Value* imag_result2 = b->CreateFMul(lhs_imag, rhs_real);
+  llvm::Value* imag_result = b->CreateFAdd(imag_result1, imag_result2);
   return {real_result, imag_result};
 }
 }  // namespace
@@ -474,27 +454,29 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
 
   const Shape& lhs_shape = lhs_instruction->shape();
   const Shape& rhs_shape = rhs_instruction->shape();
+  const DotDimensionNumbers& dnums = dot->dot_dimension_numbers();
+  CHECK_EQ(dnums.lhs_batch_dimensions_size(),
+           dnums.rhs_batch_dimensions_size());
 
   // TODO(b/110211620): Convert to use i32 index_type when it is possible.
-  llvm::Type* index_type = ir_builder_.getInt64Ty();
+  llvm::Type* index_type = b_.getInt64Ty();
   llvm_ir::IrArray::Index element_index(index_type);
   if (ShapeUtil::IsScalar(lhs_shape) && ShapeUtil::IsScalar(rhs_shape)) {
     // If the operands are scalar, don't emit any loops.
     llvm::Value* lhs_value =
-        lhs_array.EmitReadArrayElement(/*index=*/element_index, &ir_builder_);
+        lhs_array.EmitReadArrayElement(/*index=*/element_index, &b_);
     llvm::Value* rhs_value =
-        rhs_array.EmitReadArrayElement(/*index=*/element_index, &ir_builder_);
+        rhs_array.EmitReadArrayElement(/*index=*/element_index, &b_);
     llvm::Value* result;
     if (ShapeUtil::ElementIsComplex(lhs_shape)) {
-      auto value = MultiplyComplex(lhs_value, rhs_value, &ir_builder_);
+      auto value = MultiplyComplex(lhs_value, rhs_value, &b_);
       result = llvm::ConstantAggregateZero::get(lhs_array.GetElementLlvmType());
-      result = ir_builder_.CreateInsertValue(result, value.first, {0});
-      result = ir_builder_.CreateInsertValue(result, value.second, {1});
+      result = b_.CreateInsertValue(result, value.first, {0});
+      result = b_.CreateInsertValue(result, value.second, {1});
     } else {
-      result = ir_builder_.CreateFMul(lhs_value, rhs_value);
+      result = b_.CreateFMul(lhs_value, rhs_value);
     }
-    target_array.EmitWriteArrayElement(/*index=*/element_index, result,
-                                       &ir_builder_);
+    target_array.EmitWriteArrayElement(/*index=*/element_index, result, &b_);
     return Status::OK();
   }
 
@@ -510,9 +492,15 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
   const int64 lhs_reduction_dimension =
       ShapeUtil::GetDimensionNumber(lhs_shape, -1);
   const int64 rhs_reduction_dimension =
-      ShapeUtil::Rank(rhs_shape) >= 2
+      ShapeUtil::Rank(rhs_shape) >= 2 + dnums.lhs_batch_dimensions_size()
           ? ShapeUtil::GetDimensionNumber(rhs_shape, -2)
-          : 0;
+          : dnums.lhs_batch_dimensions_size();
+
+  // Check that the batch dims don't cover the last two dims.
+  for (int64 batch_dim : dnums.lhs_batch_dimensions()) {
+    CHECK_NE(lhs_reduction_dimension, batch_dim);
+    CHECK_NE(rhs_reduction_dimension, batch_dim);
+  }
 
   // Verify the reduction dimension in the two operands are the same size.
   TF_RET_CHECK(lhs_shape.dimensions(lhs_reduction_dimension) ==
@@ -521,11 +509,18 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
   // Create loop nests which loop through the LHS operand dimensions and the RHS
   // operand dimensions. The reduction dimension of the LHS and RHS are handled
   // in a separate innermost loop which performs the sum of products.
-  llvm_ir::ForLoopNest loop_nest(IrName(dot), &ir_builder_);
-  llvm_ir::IrArray::Index lhs_index = EmitOperandArrayLoopNest(
-      lhs_array, lhs_reduction_dimension, "lhs", &loop_nest);
-  llvm_ir::IrArray::Index rhs_index = EmitOperandArrayLoopNest(
-      rhs_array, rhs_reduction_dimension, "rhs", &loop_nest);
+  llvm_ir::ForLoopNest loop_nest(IrName(dot), &b_);
+  llvm_ir::IrArray::Index lhs_index = loop_nest.EmitOperandArrayLoopNest(
+      lhs_array, /*dimension_to_skip=*/lhs_reduction_dimension, "lhs");
+  llvm_ir::IrArray::Index rhs_index = loop_nest.EmitOperandArrayLoopNest(
+      rhs_array, /*dimension_to_skip=*/rhs_reduction_dimension, "rhs");
+
+  // We don't have to iterate over the batch dimensions in both arrays, simplify
+  // the loop nest of the rhs.
+  for (int i = 0; i != dnums.lhs_batch_dimensions_size(); ++i) {
+    DCHECK(c_linear_search(dnums.lhs_batch_dimensions(), i));
+    rhs_index[i] = lhs_index[i];
+  }
 
   // Create the reduction loop which does the sum of products reduction.
   std::unique_ptr<llvm_ir::ForLoop> reduction_loop = loop_nest.AddLoop(
@@ -545,7 +540,7 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
   llvm::Value* accum_address = llvm_ir::EmitAllocaAtFunctionEntry(
       accum_type,       // The pointee type of the alloca instruction.
       "accum_address",  // The name of the alloca instruction.
-      &ir_builder_);
+      &b_);
 
   // Initialize the accumulator in the preheader to zero.
   new llvm::StoreInst(
@@ -559,27 +554,25 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
   //   updated_accum = accum + lhs_element * rhs_element
   //   *accum_address = updated_accum
   TF_RET_CHECK(!reduction_loop->GetBodyBasicBlock()->empty());
-  ir_builder_.SetInsertPoint(
+  b_.SetInsertPoint(
       &*reduction_loop->GetBodyBasicBlock()->getFirstInsertionPt());
-  llvm::Value* lhs_element =
-      lhs_array.EmitReadArrayElement(lhs_index, &ir_builder_);
-  llvm::Value* rhs_element =
-      rhs_array.EmitReadArrayElement(rhs_index, &ir_builder_);
-  llvm::Value* accum = ir_builder_.CreateLoad(accum_address);
+  llvm::Value* lhs_element = lhs_array.EmitReadArrayElement(lhs_index, &b_);
+  llvm::Value* rhs_element = rhs_array.EmitReadArrayElement(rhs_index, &b_);
+  llvm::Value* accum = b_.CreateLoad(accum_address);
   llvm::Value* updated_accum;
   if (ShapeUtil::ElementIsComplex(lhs_shape)) {
-    auto value = MultiplyComplex(lhs_element, rhs_element, &ir_builder_);
-    llvm::Value* accum_real = Real(accum, &ir_builder_);
-    llvm::Value* real_sum = ir_builder_.CreateFAdd(accum_real, value.first);
-    updated_accum = ir_builder_.CreateInsertValue(accum, real_sum, {0});
-    llvm::Value* accum_imag = Imag(accum, &ir_builder_);
-    llvm::Value* imag_sum = ir_builder_.CreateFAdd(accum_imag, value.second);
-    updated_accum = ir_builder_.CreateInsertValue(updated_accum, imag_sum, {1});
+    auto value = MultiplyComplex(lhs_element, rhs_element, &b_);
+    llvm::Value* accum_real = Real(accum, &b_);
+    llvm::Value* real_sum = b_.CreateFAdd(accum_real, value.first);
+    updated_accum = b_.CreateInsertValue(accum, real_sum, {0});
+    llvm::Value* accum_imag = Imag(accum, &b_);
+    llvm::Value* imag_sum = b_.CreateFAdd(accum_imag, value.second);
+    updated_accum = b_.CreateInsertValue(updated_accum, imag_sum, {1});
   } else {
-    llvm::Value* product = ir_builder_.CreateFMul(lhs_element, rhs_element);
-    updated_accum = ir_builder_.CreateFAdd(accum, product);
+    llvm::Value* product = b_.CreateFMul(lhs_element, rhs_element);
+    updated_accum = b_.CreateFAdd(accum, product);
   }
-  ir_builder_.CreateStore(updated_accum, accum_address);
+  b_.CreateStore(updated_accum, accum_address);
 
   // After the reduction loop exits, store the accumulator into the target
   // address. The index into the target address is the concatenation of the rhs
@@ -591,21 +584,22 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
       target_index.push_back(lhs_index[dimension]);
     }
   }
-  for (size_t dimension = 0; dimension < rhs_index.size(); ++dimension) {
+  // Skip over the batch dimensions to not have them in the index twice.
+  for (size_t dimension = dnums.lhs_batch_dimensions_size();
+       dimension < rhs_index.size(); ++dimension) {
     if (dimension != rhs_reduction_dimension) {
       target_index.push_back(rhs_index[dimension]);
     }
   }
-  SetToFirstInsertPoint(reduction_loop->GetExitBasicBlock(), &ir_builder_);
+  SetToFirstInsertPoint(reduction_loop->GetExitBasicBlock(), &b_);
   target_array.EmitWriteArrayElement(
       target_index,
-      ir_builder_.CreateLoad(
-          accum_address),  // The value written to the target array.
-      &ir_builder_);
+      b_.CreateLoad(accum_address),  // The value written to the target array.
+      &b_);
 
   // Set the IR builder insert point to the exit basic block of the outer most
   // loop. This ensures later instructions are inserted after this loop nest.
-  ir_builder_.SetInsertPoint(loop_nest.GetOuterLoopExitBasicBlock());
+  b_.SetInsertPoint(loop_nest.GetOuterLoopExitBasicBlock());
 
   return Status::OK();
 }
@@ -647,11 +641,10 @@ Status IrEmitter::HandleReduce(HloInstruction* reduce) {
       [=](const llvm_ir::IrArray::Index& index) -> StatusOr<llvm::Value*> {
         // Initialize an accumulator with init_value.
         llvm::AllocaInst* accumulator_addr =
-            ir_builder_.CreateAlloca(llvm_ir::PrimitiveTypeToIrType(
+            b_.CreateAlloca(llvm_ir::PrimitiveTypeToIrType(
                 reduce->shape().element_type(), module_));
-        ir_builder_.CreateStore(
-            ir_builder_.CreateLoad(GetBasePointer(*init_value)),
-            accumulator_addr);
+        b_.CreateStore(b_.CreateLoad(GetBasePointer(*init_value)),
+                       accumulator_addr);
 
         // The enclosing loops go over all the target elements. Now we have to
         // compute the actual target element. For this, we build a new loop nest
@@ -659,12 +652,12 @@ Status IrEmitter::HandleReduce(HloInstruction* reduce) {
         // AddLoopsForShapeOnDimensions will return an Index where induction
         // Value*s are placed for each dimension in dimensions, and all the rest
         // are nullptrs.
-        llvm_ir::ForLoopNest loops(IrName(reduce, "inner"), &ir_builder_);
+        llvm_ir::ForLoopNest loops(IrName(reduce, "inner"), &b_);
         const llvm_ir::IrArray::Index reduced_dims_index =
             loops.AddLoopsForShapeOnDimensions(arg->shape(), dimensions,
                                                "reduction_dim");
 
-        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &ir_builder_);
+        SetToFirstInsertPoint(loops.GetInnerLoopBodyBasicBlock(), &b_);
 
         // Build a full index for the input argument, using reduced_dims_index
         // as the base. In reduced_dims_index only the reduction dimensions are
@@ -683,13 +676,12 @@ Status IrEmitter::HandleReduce(HloInstruction* reduce) {
 
         // Apply the reduction function to the loaded value.
         llvm::Value* input_address =
-            GetIrArray(*arg, *reduce)
-                .EmitArrayElementAddress(input_index, &ir_builder_);
+            GetIrArray(*arg, *reduce).EmitArrayElementAddress(input_index, &b_);
         TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
             *function, {accumulator_addr, input_address}, accumulator_addr));
 
-        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &ir_builder_);
-        return ir_builder_.CreateLoad(accumulator_addr);
+        SetToFirstInsertPoint(loops.GetOuterLoopExitBasicBlock(), &b_);
+        return b_.CreateLoad(accumulator_addr);
       });
 }
 
@@ -702,8 +694,8 @@ Status IrEmitter::HandleFusion(HloInstruction* fusion) {
   for (HloInstruction* operand : fusion->operands()) {
     parameter_arrays.push_back(GetIrArray(*operand, *fusion));
   }
-  GpuElementalIrEmitter elemental_emitter(hlo_module_config_, module_,
-                                          &ir_builder_, GetNestedComputer());
+  GpuElementalIrEmitter elemental_emitter(hlo_module_config_, module_, &b_,
+                                          GetNestedComputer());
   FusedIrEmitter fused_emitter(parameter_arrays, &elemental_emitter);
   TF_RETURN_IF_ERROR(fusion->fused_expression_root()->Accept(&fused_emitter));
 
@@ -733,24 +725,6 @@ Status IrEmitter::HandleOutfeed(HloInstruction*) {
   return Unimplemented("Outfeed is not supported on GPU.");
 }
 
-Status IrEmitter::HandleRng(HloInstruction* random) {
-  ElementalIrEmitter::HloToElementGeneratorMap operand_to_generator;
-  for (const HloInstruction* operand : random->operands()) {
-    operand_to_generator[operand] = [=](const llvm_ir::IrArray::Index& index) {
-      return GetIrArray(*operand, *random)
-          .EmitReadArrayElement(index, &ir_builder_);
-    };
-  }
-  // Emits a single-threaded loop because the loop body generated by the element
-  // generator for Rng can't be parallelized (b/32333178).
-  return llvm_ir::LoopEmitter(
-             GpuElementalIrEmitter(hlo_module_config_, module_, &ir_builder_,
-                                   GetNestedComputer())
-                 .MakeElementGenerator(random, operand_to_generator),
-             GetIrArray(*random, *random), &ir_builder_)
-      .EmitLoop(IrName(random));
-}
-
 Status IrEmitter::HandleBatchNormInference(HloInstruction*) {
   return Unimplemented(
       "The GPU backend does not implement BatchNormInference directly.  It "
@@ -774,34 +748,9 @@ Status IrEmitter::HandleBatchNormGrad(HloInstruction*) {
       "to a cudnn CustomCall using CudnnBatchNormRewriter.");
 }
 
-llvm_ir::IrArray::Index IrEmitter::EmitOperandArrayLoopNest(
-    const llvm_ir::IrArray& operand_array, int64 reduction_dimension,
-    tensorflow::StringPiece name_suffix, llvm_ir::ForLoopNest* loop_nest) {
-  // Prepares the dimension list we will use to emit the loop nest. Outermost
-  // loops are added first. Add loops in major-to-minor order, and skip the
-  // reduction dimension.
-  std::vector<int64> dimensions;
-  const Shape& shape = operand_array.GetShape();
-  for (int i = 0; i < LayoutUtil::MinorToMajor(shape).size(); ++i) {
-    int64 dimension = LayoutUtil::Major(shape.layout(), i);
-    if (dimension != reduction_dimension) {
-      dimensions.push_back(dimension);
-    }
-  }
-
-  // Create loop nest with one for-loop for each dimension of the
-  // output.
-  llvm_ir::IrArray::Index index =
-      loop_nest->AddLoopsForShapeOnDimensions(shape, dimensions, name_suffix);
-  // Verify every dimension except the reduction dimension was set in the index.
-  for (size_t dimension = 0; dimension < index.size(); ++dimension) {
-    if (dimension == reduction_dimension) {
-      DCHECK_EQ(nullptr, index[dimension]);
-    } else {
-      DCHECK_NE(nullptr, index[dimension]);
-    }
-  }
-  return index;
+Status IrEmitter::HandleIota(HloInstruction*) {
+  // TODO(b/64798317): implement iota on GPU.
+  return Unimplemented("Iota is not implemented on GPU.");
 }
 
 StatusOr<llvm::Value*> IrEmitter::ComputeNestedElement(
@@ -810,16 +759,16 @@ StatusOr<llvm::Value*> IrEmitter::ComputeNestedElement(
   llvm::Value* return_buffer = llvm_ir::EmitAllocaAtFunctionEntry(
       llvm_ir::PrimitiveTypeToIrType(
           computation.root_instruction()->shape().element_type(), module_),
-      "return_buffer", &ir_builder_);
+      "return_buffer", &b_);
   std::vector<llvm::Value*> parameter_buffers;
   for (llvm::Value* parameter_element : parameter_elements) {
     parameter_buffers.push_back(llvm_ir::EmitAllocaAtFunctionEntry(
-        parameter_element->getType(), "parameter_buffer", &ir_builder_));
-    ir_builder_.CreateStore(parameter_element, parameter_buffers.back());
+        parameter_element->getType(), "parameter_buffer", &b_));
+    b_.CreateStore(parameter_element, parameter_buffers.back());
   }
   TF_RETURN_IF_ERROR(EmitCallToNestedComputation(computation, parameter_buffers,
                                                  return_buffer));
-  return ir_builder_.CreateLoad(return_buffer);
+  return b_.CreateLoad(return_buffer);
 }
 
 }  // namespace gpu
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter.h b/tensorflow/compiler/xla/service/gpu/ir_emitter.h
index e55dfc6dae..561c683879 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter.h
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter.h
@@ -79,7 +79,6 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   Status HandleCrossReplicaSum(HloInstruction* crs) override;
   Status HandleInfeed(HloInstruction* infeed) override;
   Status HandleOutfeed(HloInstruction* outfeed) override;
-  Status HandleSort(HloInstruction* sort) override;
   Status HandleSend(HloInstruction* send) override;
   Status HandleSendDone(HloInstruction* send_done) override;
   Status HandleRecv(HloInstruction* recv) override;
@@ -87,14 +86,16 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   Status HandleParameter(HloInstruction* parameter) override;
   Status HandleReduce(HloInstruction* reduce) override;
   Status HandleTuple(HloInstruction* tuple) override;
+  Status HandleScatter(HloInstruction* scatter) override;
   Status HandleSelect(HloInstruction* select) override;
+  Status HandleTupleSelect(HloInstruction* tuple_select) override;
   Status HandleFusion(HloInstruction* fusion) override;
   Status HandleCall(HloInstruction* call) override;
   Status HandleCustomCall(HloInstruction* custom_call) override;
-  Status HandleRng(HloInstruction* random) override;
   Status HandleBatchNormInference(HloInstruction* batch_norm) override;
   Status HandleBatchNormTraining(HloInstruction* batch_norm) override;
   Status HandleBatchNormGrad(HloInstruction* batch_norm) override;
+  Status HandleIota(HloInstruction* iota) override;
 
   Status FinishVisit(HloInstruction* root) override { return Status::OK(); }
 
@@ -161,7 +162,7 @@ class IrEmitter : public DfsHloVisitorWithDefault {
 
   // The following fields track the IR emission state. According to LLVM memory
   // management rules, their memory is owned by the module.
-  llvm::IRBuilder<> ir_builder_;
+  llvm::IRBuilder<> b_;
 
   // Mapping from HLO to its underlying LLVM value.
   HloToIrBindings bindings_;
@@ -170,17 +171,6 @@ class IrEmitter : public DfsHloVisitorWithDefault {
   const HloModuleConfig& hlo_module_config_;
 
  private:
-  // Emits a series of nested loops for iterating over an operand array in the
-  // dot operation. Loops are constructed in major to minor dimension layout
-  // order. No loop is emitted for the given reduction_dimension. The function
-  // returns an IrArray index for the given operand_array containing the indvars
-  // of the loops. All dimensions of the index are filled except for the
-  // reduction dimension. name_suffix is the string to append to the names of
-  // LLVM constructs (eg, basic blocks) constructed by this method.
-  llvm_ir::IrArray::Index EmitOperandArrayLoopNest(
-      const llvm_ir::IrArray& operand_array, int64 reduction_dimension,
-      tensorflow::StringPiece name_suffix, llvm_ir::ForLoopNest* loop_nest);
-
   // A helper method for EmitAtomicOperationForNestedComputation. Certain
   // computations, such as floating-point addition and integer maximization, can
   // be simply implemented using an LLVM atomic instruction. If "computation" is
@@ -197,6 +187,13 @@ class IrEmitter : public DfsHloVisitorWithDefault {
                                      llvm::Value* output_address,
                                      llvm::Value* source_address);
 
+  // A helper method for HandleSort(). It adds the inner comparison loop where
+  // we compare elements pointed to by 'keys_index' and 'compare_keys_index'.
+  void EmitCompareLoop(int64 dimension_to_sort,
+                       const llvm_ir::IrArray::Index& keys_index,
+                       const llvm_ir::IrArray::Index& compare_keys_index,
+                       const llvm_ir::IrArray& keys_array);
+
   StatusOr<llvm::Value*> ComputeNestedElement(
       const HloComputation& computation,
       tensorflow::gtl::ArraySlice<llvm::Value*> parameter_elements);
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_nested.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter_nested.cc
index c9574c87a3..5c827e5f9c 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_nested.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_nested.cc
@@ -70,10 +70,10 @@ llvm::Function* IrEmitterNested::EmitBasePointersForNestedComputation(
     argument_dereferenceable_bytes.push_back(root_size);
   }
   // The base pointer of the memory block for all pre-allocated temp buffers.
-  argument_types.push_back(ir_builder_.getInt8PtrTy());
+  argument_types.push_back(b_.getInt8PtrTy());
 
   llvm::FunctionType* function_type =
-      llvm::FunctionType::get(ir_builder_.getVoidTy(), argument_types, false);
+      llvm::FunctionType::get(b_.getVoidTy(), argument_types, false);
   llvm::Function* function = llvm::Function::Create(
       function_type,                       // The function type.
       llvm::GlobalValue::InternalLinkage,  // The linkage type.
@@ -96,8 +96,7 @@ llvm::Function* IrEmitterNested::EmitBasePointersForNestedComputation(
       llvm::BasicBlock::Create(function->getContext(), "entry", function);
   // Emit a "return void" at entry_bb's end, and sets the insert point before
   // that return instruction.
-  ir_builder_.SetInsertPoint(
-      llvm::ReturnInst::Create(function->getContext(), entry_bb));
+  b_.SetInsertPoint(llvm::ReturnInst::Create(function->getContext(), entry_bb));
 
   std::vector<const HloInstruction*> non_io_hlos;
   for (const auto* hlo : nested_computation.instructions()) {
@@ -127,20 +126,17 @@ Status IrEmitterNested::EmitTargetElementLoop(
       target_arrays.push_back(GetIrArray(hlo, hlo, {i}));
     }
     TF_RETURN_IF_ERROR(
-        llvm_ir::LoopEmitter(element_generator, target_arrays, &ir_builder_)
-            .EmitLoop());
+        llvm_ir::LoopEmitter(element_generator, target_arrays, &b_).EmitLoop());
 
     std::vector<llvm::Value*> tuple_operand_ptrs;
     tuple_operand_ptrs.reserve(num_elems);
     for (const llvm_ir::IrArray& array : target_arrays) {
       tuple_operand_ptrs.push_back(array.GetBasePointer());
     }
-    llvm_ir::EmitTuple(GetIrArray(hlo, hlo), tuple_operand_ptrs, &ir_builder_,
-                       module_);
+    llvm_ir::EmitTuple(GetIrArray(hlo, hlo), tuple_operand_ptrs, &b_, module_);
     return Status::OK();
   }
-  return llvm_ir::LoopEmitter(element_generator, GetIrArray(hlo, hlo),
-                              &ir_builder_)
+  return llvm_ir::LoopEmitter(element_generator, GetIrArray(hlo, hlo), &b_)
       .EmitLoop();
 }
 
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
index bdb9e77da4..d5ecae88ed 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
@@ -28,11 +28,12 @@ limitations under the License.
 #include "llvm/IR/Instructions.h"
 #include "llvm/IR/LLVMContext.h"
 #include "llvm/IR/Module.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
 #include "tensorflow/compiler/xla/service/gpu/backend_configs.pb.h"
+#include "tensorflow/compiler/xla/service/gpu/buffer_allocations.h"
 #include "tensorflow/compiler/xla/service/gpu/conditional_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/convolution_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/copy_thunk.h"
@@ -48,6 +49,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/gpu/ir_emitter_context.h"
 #include "tensorflow/compiler/xla/service/gpu/kernel_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/memset_thunk.h"
+#include "tensorflow/compiler/xla/service/gpu/outfeed_thunk.h"
 #include "tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h"
 #include "tensorflow/compiler/xla/service/gpu/partition_assignment.h"
 #include "tensorflow/compiler/xla/service/gpu/sequential_thunk.h"
@@ -58,10 +60,12 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/dynamic_update_slice_util.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
-#include "tensorflow/compiler/xla/service/llvm_ir/ops.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/sort_util.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h"
 #include "tensorflow/compiler/xla/service/name_uniquer.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -70,8 +74,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/window_util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/bits.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/platform/logging.h"
 
 namespace xla {
@@ -79,6 +85,7 @@ namespace gpu {
 
 namespace {
 
+using llvm_ir::IrArray;
 using llvm_ir::IrName;
 using tensorflow::gtl::ArraySlice;
 using tensorflow::gtl::InlinedVector;
@@ -164,40 +171,6 @@ Status IrEmitterUnnested::Postprocess(HloInstruction* hlo) {
   return DfsHloVisitor::Postprocess(hlo);
 }
 
-namespace {
-bool ImplementedAsHostToDeviceMemcpy(const BufferAssignment& buffer_assignment,
-                                     const HloInstruction& hlo) {
-  // `hlo` needs to satisfy the following conditions to be implemented as a
-  // host-to-device cuMemcpy.
-  //
-  // 1. `hlo` is a kCopy instruction.
-  // 2. `hlo`'s only operand is a kConstant instruction.
-  // 3. `hlo` and its operand have the same shape (thus the same layout too).
-  // 4. The address of `hlo`'s buffer is known at runtime (without dereferencing
-  //    pointers in a tuple).
-  return hlo.opcode() == HloOpcode::kCopy &&
-         hlo.operand(0)->opcode() == HloOpcode::kConstant &&
-         ShapeUtil::Equal(hlo.operand(0)->shape(), hlo.shape()) &&
-         buffer_assignment.GetUniqueTopLevelSlice(&hlo).ok();
-}
-
-bool ImplementedAsDeviceToDeviceMemcpy(
-    const BufferAssignment& buffer_assignment, const HloInstruction& hlo) {
-  // `hlo` needs to satisfy three conditions to be implemented as a
-  // device-to-device cuMemcpy.
-  //
-  // 1. `hlo` is a kCopy instruction.
-  // 2. `hlo` and its operand have the same shape (thus the same layout too).
-  // 3. `hlo` and its operand have a statically-known buffer assignment
-  //     (constants do not, for instance), which means the source buffer also
-  //     resides on the device.
-  return hlo.opcode() == HloOpcode::kCopy &&
-         ShapeUtil::Equal(hlo.operand(0)->shape(), hlo.shape()) &&
-         buffer_assignment.GetUniqueTopLevelSlice(&hlo).ok() &&
-         buffer_assignment.GetUniqueTopLevelSlice(hlo.operand(0)).ok();
-}
-}  // namespace
-
 llvm::Function* IrEmitterUnnested::BuildKernelPrototype(
     const HloInstruction& inst,
     tensorflow::gtl::ArraySlice<const BufferAllocation*> args) {
@@ -211,7 +184,7 @@ llvm::Function* IrEmitterUnnested::BuildKernelPrototype(
   llvm::LLVMContext& context = module->getContext();
   llvm::FunctionType* kernel_type = llvm::FunctionType::get(
       /*Result=*/llvm::Type::getVoidTy(context),
-      std::vector<llvm::Type*>(args.size(), ir_builder_.getInt8PtrTy()),
+      std::vector<llvm::Type*>(args.size(), b_.getInt8PtrTy()),
       /*isVarArg=*/false);
   llvm::Function* kernel =
       llvm::Function::Create(kernel_type, llvm::GlobalValue::ExternalLinkage,
@@ -226,9 +199,20 @@ llvm::Function* IrEmitterUnnested::BuildKernelPrototype(
     ++arg_it;
 
     kernel->addDereferenceableAttr(arg_no + 1, alloc->size());
+
+    const int64 alignment = [&] {
+      if (alloc->is_entry_computation_parameter()) {
+        return kEntryParameterAlignBytes;
+      } else if (alloc->is_constant()) {
+        return kConstantBufferAlignBytes;
+      } else {
+        return kXlaAllocatedBufferAlignBytes;
+      }
+    }();
+
     kernel->addParamAttr(
-        arg_no, llvm::Attribute::get(context, llvm::Attribute::Alignment,
-                                     kCudaMallocAlignBytes));
+        arg_no,
+        llvm::Attribute::get(context, llvm::Attribute::Alignment, alignment));
 
     if (alloc->IsPreallocatedTempBuffer()) {
       fn_arg->setName("temp_buf");
@@ -247,7 +231,7 @@ llvm::Function* IrEmitterUnnested::BuildKernelPrototype(
   nvvm_annotations_node->addOperand(llvm::MDNode::get(
       context, {llvm::ConstantAsMetadata::get(kernel),
                 llvm::MDString::get(context, "kernel"),
-                llvm::ConstantAsMetadata::get(ir_builder_.getInt32(1))}));
+                llvm::ConstantAsMetadata::get(b_.getInt32(1))}));
 
   // Update the insert point to the entry basic block.
   llvm::BasicBlock* entry_bb =
@@ -255,7 +239,7 @@ llvm::Function* IrEmitterUnnested::BuildKernelPrototype(
 
   // Emit a "return void" at entry_bb's end, and set the insert point before
   // that return instruction.
-  ir_builder_.SetInsertPoint(llvm::ReturnInst::Create(context, entry_bb));
+  b_.SetInsertPoint(llvm::ReturnInst::Create(context, entry_bb));
 
   return kernel;
 }
@@ -293,7 +277,7 @@ int ComputeMaxUnrollFactor(const HloInstruction* hlo) {
 //    range of i32.
 // Otherwise, the return type is i64.
 llvm::Type* GetIndexTypeForKernel(const HloInstruction* hlo, int64 launch_size,
-                                  llvm::IRBuilder<>* ir_builder) {
+                                  llvm::IRBuilder<>* b) {
   // Find the unnested hlo instructon for which the kernel is generated for.
   const HloInstruction* unnested_hlo = hlo;
   const HloComputation* computation = hlo->parent();
@@ -314,7 +298,7 @@ llvm::Type* GetIndexTypeForKernel(const HloInstruction* hlo, int64 launch_size,
     return in_range;
   };
 
-  llvm::Type* i64_ty = ir_builder->getInt64Ty();
+  llvm::Type* i64_ty = b->getInt64Ty();
   // Check launch dimension
   if (!IsInt32(launch_size)) {
     return i64_ty;
@@ -343,7 +327,7 @@ llvm::Type* GetIndexTypeForKernel(const HloInstruction* hlo, int64 launch_size,
     }
   }
 
-  return ir_builder->getInt32Ty();
+  return b->getInt32Ty();
 }
 
 }  // namespace
@@ -355,21 +339,18 @@ Status IrEmitterUnnested::DefaultAction(HloInstruction* hlo) {
     unroll_factor = ComputeMaxUnrollFactor(hlo);
   }
 
-  thunk_sequence_->emplace_back(BuildKernelThunk(hlo, unroll_factor));
+  thunk_sequence_->emplace_back(BuildKernelThunk(
+      hlo, /*implements_whole_instruction=*/true, unroll_factor));
   return IrEmitter::DefaultAction(hlo);
 }
 
 Status IrEmitterUnnested::HandleDot(HloInstruction* dot) {
-  const DotDimensionNumbers& dnums = dot->dot_dimension_numbers();
-  if (dnums.lhs_batch_dimensions_size() > 0 ||
-      dnums.rhs_batch_dimensions_size() > 0) {
-    return Unimplemented("Dot with batch dimensions not implemented.");
-  }
   if (ImplementedAsGemm(*dot)) {
     thunk_sequence_->emplace_back(BuildGemmThunk(dot));
     return Status::OK();
   }
-  thunk_sequence_->emplace_back(BuildKernelThunk(dot));
+  thunk_sequence_->emplace_back(
+      BuildKernelThunk(dot, /*implements_whole_instruction=*/true));
   return IrEmitter::HandleDot(dot);
 }
 
@@ -379,7 +360,8 @@ Status IrEmitterUnnested::HandleConditional(HloInstruction* conditional) {
 }
 
 Status IrEmitterUnnested::HandleConvolution(HloInstruction* convolution) {
-  thunk_sequence_->emplace_back(BuildKernelThunk(convolution));
+  thunk_sequence_->emplace_back(
+      BuildKernelThunk(convolution, /*implements_whole_instruction=*/true));
   return IrEmitter::HandleConvolution(convolution);
 }
 
@@ -586,16 +568,17 @@ Status IrEmitterUnnested::HandleFusion(HloInstruction* fusion) {
           }
         }
         CHECK(first_reduce != nullptr);
-        thunks.push_back(BuildKernelThunk(fusion));
+        thunks.push_back(
+            BuildKernelThunk(fusion, /*implements_whole_instruction=*/false));
         thunk_sequence_->emplace_back(
             MakeUnique<SequentialThunk>(std::move(thunks), fusion));
-        std::vector<llvm_ir::IrArray> parameter_arrays;
+        std::vector<IrArray> parameter_arrays;
         for (HloInstruction* operand : fusion->operands()) {
           parameter_arrays.push_back(GetIrArray(*operand, *fusion));
         }
         GpuElementalIrEmitter elemental_emitter(
-            hlo_module_config_, ir_emitter_context_->llvm_module(),
-            &ir_builder_, GetNestedComputer());
+            hlo_module_config_, ir_emitter_context_->llvm_module(), &b_,
+            GetNestedComputer());
         FusedIrEmitter fused_emitter(parameter_arrays, &elemental_emitter);
         TF_RETURN_IF_ERROR(root->Accept(&fused_emitter));
 
@@ -660,21 +643,22 @@ Status IrEmitterUnnested::HandleFusion(HloInstruction* fusion) {
     // touching the un-updated elements.
 
     // Set up kernel thunk and fused ir emitter.
-    thunk_sequence_->emplace_back(BuildKernelThunk(fusion));
-    std::vector<llvm_ir::IrArray> operand_arrays;
+    thunk_sequence_->emplace_back(
+        BuildKernelThunk(fusion, /*implements_whole_instruction=*/true));
+    std::vector<IrArray> operand_arrays;
     for (HloInstruction* operand : fusion->operands()) {
       operand_arrays.push_back(GetIrArray(*operand, *fusion));
     }
     GpuElementalIrEmitter elemental_emitter(hlo_module_config_,
                                             ir_emitter_context_->llvm_module(),
-                                            &ir_builder_, GetNestedComputer());
+                                            &b_, GetNestedComputer());
 
     // Shape of the dynamic-update-slice's "update" operand.
     Shape update_shape = root->operand(1)->shape();
 
     // Array to write into.  Because this is an in-place operation, this is the
     // same as operand 0's array.
-    llvm_ir::IrArray output_array = GetIrArray(*fusion, *fusion);
+    IrArray output_array = GetIrArray(*fusion, *fusion);
 
     LaunchDimensions launch_dimensions = CalculateLaunchDimensions(
         update_shape, ir_emitter_context_->device_description());
@@ -685,346 +669,38 @@ Status IrEmitterUnnested::HandleFusion(HloInstruction* fusion) {
 
     return llvm_ir::EmitParallelFusedDynamicUpdateSliceInPlace(
         fusion, operand_arrays, output_array, &elemental_emitter,
-        launch_dimensions, &ir_builder_);
+        launch_dimensions, &b_);
   }
+
   if (ImplementedAsGemm(*fusion)) {
     thunk_sequence_->emplace_back(BuildGemmThunk(fusion));
     return Status::OK();
   }
 
-  CHECK(fusion->fusion_kind() == HloInstruction::FusionKind::kLoop);
-  int unroll_factor = ComputeMaxUnrollFactor(fusion);
+  CHECK_EQ(fusion->fusion_kind(), HloInstruction::FusionKind::kLoop);
 
-  thunk_sequence_->emplace_back(BuildKernelThunk(fusion, unroll_factor));
-  return IrEmitter::HandleFusion(fusion);
-}
-
-namespace {
-
-// Returns the indices of the first elements of all consecutive subarrays of the
-// given array. For example:
-// ConsecutiveSegments({m, m+1, m+2, n, k, k+1}) = {0, 3, 4}
-std::vector<size_t> ConsecutiveSegments(tensorflow::gtl::ArraySlice<int64> xs) {
-  std::vector<size_t> is = {0};
-  for (size_t i = 1; i < xs.size(); ++i) {
-    if (1 != xs[i] - xs[i - 1]) {
-      is.push_back(i);
-    }
-  }
-  return is;
-}
-
-// Merges the sequences of dimensions of the given shape which start at the
-// given indices `segs`.
-Shape MergeDimensions(tensorflow::gtl::ArraySlice<size_t> segs,
-                      const Shape& shape) {
-  std::vector<int64> dimensions;
-  for (size_t i = 1; i <= segs.size(); ++i) {
-    dimensions.push_back(std::accumulate(
-        shape.dimensions().begin() + segs[i - 1],
-        shape.dimensions().begin() +
-            (segs.size() == i ? shape.dimensions().size() : segs[i]),
-        1, std::multiplies<int64>()));
-  }
-  return ShapeUtil::MakeShapeWithDescendingLayout(shape.element_type(),
-                                                  dimensions);
-}
-
-// Returns whether the given shapes and permutation are a 0-2-1 transpose, and
-// if so, the normalized and rank-reduced shapes. The shapes must have the same
-// dimensions, so this considers layout only.
-//
-// This function recognizes higher-rank transposes which are elementwise
-// equivalent to a 0-2-1 transpose.
-std::tuple<bool, Shape, Shape> IsTranspose021(const Shape& a, const Shape& b) {
-  CHECK(ShapeUtil::Compatible(a, b));
-  std::vector<int64> perm(a.dimensions().size());
-  {
-    auto layout_a_orig = LayoutUtil::MinorToMajor(a);
-    std::vector<int64> layout_a(layout_a_orig.rbegin(), layout_a_orig.rend());
-    auto layout_b_orig = LayoutUtil::MinorToMajor(b);
-    std::vector<int64> layout_b(layout_b_orig.rbegin(), layout_b_orig.rend());
-    for (size_t i = 0; i < perm.size(); ++i) {
-      perm[i] = PositionInContainer(layout_b, layout_a[i]);
-    }
+  if (CheckAndEmitHloWithTile021(fusion)) {
+    return Status::OK();
   }
-  auto segs = ConsecutiveSegments(perm);
-  Shape norm_a =
-      ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(a);
-  Shape norm_b =
-      ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(b);
-  if (3 == segs.size() && 0 == perm[0]) {
-    Shape reduced_a = MergeDimensions(segs, norm_a);
-    Shape reduced_b = ShapeUtil::MakeShapeWithDescendingLayout(
-        b.element_type(),
-        Permute({0, 2, 1}, AsInt64Slice(reduced_a.dimensions())));
-    return std::make_tuple(true, reduced_a, reduced_b);
-  }
-  return std::make_tuple(false, ShapeUtil::MakeNil(), ShapeUtil::MakeNil());
-}
-
-// Returns whether the given shapes are potentially of a 0-2-1 transpose.
-// As 0-2-1 is a self-inverse permutation, which shape is input or output is
-// arbitrary.
-bool AreShapesForTranspose021(const Shape& a, const Shape& b) {
-  return 3 == b.dimensions().size() &&
-         ShapeUtil::Compatible(
-             ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(a),
-             ShapeUtil::PermuteDimensions(
-                 {0, 2, 1},
-                 ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
-                     b)));
-}
-
-// Emits a tiled 0-2-1 transpose, assuming both input and output lain out from
-// major to minor. The x- and y- dimensions are tiled in square tiles of edge
-// length `tile_size`. Each thread block of `tile_size` x `num_rows` threads
-// transposes one tile: each thread copies a row from the input to a shared
-// memory tile, then copies a column from the shared memory tile to the output.
-//
-// `tile_size` should usually be same as warp size.
-//
-// Returns (number of tiles = number of thread blocks needed).
-//
-// TODO(b/33320379): Here each block transposes 1 tile. It may be more efficient
-//                   to launch fewer blocks so each transposes many tiles, and
-//                   in any case, the number of blocks we can launch is limited.
-//
-// This is the same algorithm in CUDA:
-// https://github.com/tensorflow/tensorflow/blob/d2693c8a70567cc78b2e8a9ac8020d321620ca83/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc#L189
-int64 EmitTranspose021Tiled(llvm_ir::IrArray input, llvm_ir::IrArray output,
-                            const int64 tile_size, const int64 num_rows,
-                            llvm::IRBuilder<>* builder) {
-  // Adds `addend` to the given `dim` of `index`.
-  auto offset_dim = [builder](llvm_ir::IrArray::Index index,
-                              llvm::Value* addend, int64 dim) {
-    index[dim] = builder->CreateAdd(index[dim], addend);
-    return index;
-  };
 
-  CHECK(AreShapesForTranspose021(input.GetShape(), output.GetShape()));
-
-  Shape input_shape =
-      ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
-          input.GetShape());
-  Shape output_shape =
-      ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(
-          output.GetShape());
-  input = input.CastToShape(input_shape, builder);
-  output = output.CastToShape(output_shape, builder);
-
-  llvm::Type* tile_type = llvm::ArrayType::get(
-      llvm::ArrayType::get(input.GetElementLlvmType(), tile_size),
-      // One extra here to avoid share memory bank conflict
-      tile_size + 1);
-  auto* tile = new llvm::GlobalVariable(
-      *builder->GetInsertBlock()->getParent()->getParent(), tile_type,
-      /*isConstant=*/false, llvm::GlobalValue::PrivateLinkage,
-      llvm::UndefValue::get(tile_type), "tile", nullptr,
-      llvm::GlobalValue::NotThreadLocal,
-      /*AddressSpace=*/3 /* GPU shared memory */);
-
-  // let x = threadIdx.x
-  llvm::Value* x = llvm_ir::EmitCallToIntrinsic(
-      llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x, {}, {}, builder);
-  llvm_ir::AddRangeMetadata(0, num_rows * tile_size,
-                            static_cast<llvm::Instruction*>(x));
-  x = builder->CreateIntCast(x, builder->getInt64Ty(), /*isSigned=*/true,
-                             "thread.id.x");
-
-  // computing logical thread ids
-  // logical_x = x % tile_size
-  auto logical_x = builder->CreateURem(x, builder->getInt64(tile_size));
-
-  // logical_y = x / tile_size
-  auto logical_y = builder->CreateUDiv(x, builder->getInt64(tile_size));
-
-  // `emit_cp` emits equivalent to following pseudocode:
-  // if (tile_size == tile_width && tile_size == tile_height) {
-  //   unroll for (i in range(0, tile_size, num_rows)) {
-  //     emit_cp_element(index + {0, i, 0}, y + logical_y);
-  //   }
-  // } else if (x < tile_width) {
-  //   tile_height_upperbound = ceil(tile_height / num_rows) * num_rows;
-  //   for (i in range(0, tile_height_upperbound, num_rows)) {
-  //     y_loc = i + logical_y;
-  //     if (y_loc < tile_height)
-  //      emit_cp_element(index + {0, i, 0}, y_loc);
-  //   }
-  // }
-  //
-  // We use this to emit both the copy from input to tile and the copy from tile
-  // to output.
-  //
-  // `index` is the origin of the row or column in the input or output array.
-  //
-  // `emit_cp_element(index, y)` emits code to copy a single element between the
-  // tile and the input or output array, where `y` is the `y`-position in the
-  // tile, whether which is row or column is a function of whether we're copying
-  // from input or to output, and `index` is the index into the input or output
-  // array.
-  auto emit_cp_tile = [builder, tile_size, &offset_dim, num_rows, logical_x,
-                       logical_y](
-                          std::function<void(const llvm_ir::IrArray::Index&,
-                                             llvm::Value*)>
-                              emit_cp_element,
-                          llvm::Value* tile_width, llvm::Value* tile_height,
-                          const llvm_ir::IrArray::Index& index,
-                          const string& loop_name) {
-    llvm_ir::LlvmIfData if_not_last_row = llvm_ir::EmitIfThenElse(
-        builder->CreateAnd(
-            builder->CreateICmpEQ(builder->getInt64(tile_size), tile_width),
-            builder->CreateICmpEQ(builder->getInt64(tile_size), tile_height)),
-        "not_last_row", builder);
-    builder->SetInsertPoint(if_not_last_row.true_block->getTerminator());
-    for (int64 i = 0; i < tile_size; i += num_rows) {
-      auto source_idx = offset_dim(index, builder->getInt64(i), /*dim=*/1);
-      auto y_loc = builder->CreateAdd(builder->getInt64(i), logical_y);
-      emit_cp_element(source_idx, y_loc);
-    }
-    builder->SetInsertPoint(if_not_last_row.false_block->getTerminator());
-    llvm_ir::LlvmIfData if_in_tile = llvm_ir::EmitIfThenElse(
-        builder->CreateICmpULT(logical_x, tile_width), "x_in_tile", builder);
-    builder->SetInsertPoint(if_in_tile.true_block->getTerminator());
-
-    // tile_height_upper_bound = ceil(tile_height / num_rows) * num_rows
-    auto tile_height_upper_bound = builder->CreateMul(
-        builder->CreateUDiv(
-            builder->CreateAdd(tile_height, builder->getInt64(num_rows - 1)),
-            builder->getInt64(num_rows)),
-        builder->getInt64(num_rows));
-
-    auto loop = llvm_ir::ForLoop::EmitForLoop(
-        loop_name, builder->getInt64(0), tile_height_upper_bound,
-        builder->getInt64(num_rows), builder);
-    llvm_ir::SetToFirstInsertPoint(loop->GetHeaderBasicBlock(), builder);
-    builder->SetInsertPoint(loop->GetBodyBasicBlock()->getTerminator());
-
-    auto y_loc = builder->CreateAdd(loop->GetIndVarValue(), logical_y);
-    auto if_y_in_tile = llvm_ir::EmitIfThenElse(
-        builder->CreateICmpULT(y_loc, tile_height), "y_in_tile", builder);
-    builder->SetInsertPoint(if_y_in_tile.true_block->getTerminator());
-
-    emit_cp_element(offset_dim(index, loop->GetIndVarValue(), /*dim=*/1),
-                    y_loc);
-    builder->SetInsertPoint(if_not_last_row.after_block->getTerminator());
-  };
-
-  auto input_dims_in_tiles = input_shape.dimensions();
-  // Unpermuted dimensions are untiled.
-  for (int i = 1; i < 3; ++i) {
-    input_dims_in_tiles[i] =
-        CeilOfRatio<int64>(input_dims_in_tiles[i], tile_size);
-  }
-  int64 num_tiles =
-      std::accumulate(input_dims_in_tiles.begin(), input_dims_in_tiles.end(), 1,
-                      std::multiplies<int64>());
-  const llvm_ir::IrArray::Index input_tile_index(
-      /*linear=*/builder->CreateIntCast(
-          llvm_ir::AddRangeMetadata(
-              0, num_tiles,
-              static_cast<llvm::Instruction*>(llvm_ir::EmitCallToIntrinsic(
-                  llvm::Intrinsic::nvvm_read_ptx_sreg_ctaid_x, {}, {},
-                  builder))),
-          builder->getInt64Ty(), /*isSigned=*/true, "block.id.x"),
-      ShapeUtil::MakeShapeWithDescendingLayout(
-          PRED /*arbitrary*/, AsInt64Slice(input_dims_in_tiles)),
-      builder);
-  const llvm_ir::IrArray::Index input_tile_origin = ({
-    llvm_ir::IrArray::Index index = input_tile_index;
-    for (int i = 1; i < 3; ++i) {
-      index[i] = builder->CreateMul(index[i], builder->getInt64(tile_size),
-                                    "tile_origin." + std::to_string(i));
-    }
-    index;
-  });
-  const llvm_ir::IrArray::Index input_index =
-      offset_dim(offset_dim(input_tile_origin, logical_x, /*dim=*/2), logical_y,
-                 /*dim=*/1);
-  std::vector<llvm::Value*> tile_dims(input_shape.dimensions().size());
-  // Only last row or column may not have full size.
-  for (int i = 1; i < 3; ++i) {
-    tile_dims[i] = builder->CreateSelect(
-        builder->CreateICmpEQ(input_tile_index[i],
-                              builder->getInt64(input_dims_in_tiles[i] - 1)),
-        builder->getInt64(input_shape.dimensions(i) -
-                          (input_dims_in_tiles[i] - 1) * tile_size),
-        builder->getInt64(tile_size), "tile_size");
-  }
-
-  // Load data from input memory to shared memory tile.
-  emit_cp_tile(
-      // tile[y, x] = input_array[index]
-      [builder, tile, &input, logical_x](const llvm_ir::IrArray::Index& index,
-                                         llvm::Value* y) {
-        builder->CreateStore(
-            input.EmitReadArrayElement(index, builder, "input_element"),
-            builder->CreateGEP(tile, {builder->getInt64(0), y, logical_x}));
-      },
-      tile_dims[2], tile_dims[1], input_index, "input");
+  int unroll_factor = ComputeMaxUnrollFactor(fusion);
 
-  // Wait for all threads to reach this point, lest we copy a value from tile to
-  // output before the other thread copies it from input to tile.
-  // This is `__syncthreads` in CUDA.
-  llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::nvvm_barrier0, {}, {}, builder);
-
-  const llvm_ir::IrArray::Index output_tile_index(
-      Permute({0, 2, 1}, input_tile_index.multidim()));
-  const llvm_ir::IrArray::Index output_tile_origin(
-      Permute({0, 2, 1}, input_tile_origin.multidim()));
-  const llvm_ir::IrArray::Index output_index =
-      offset_dim(offset_dim(output_tile_origin, logical_x, /*dim=*/2),
-                 logical_y, /*dim=*/1);
-
-  // Store data from shared memory tile to output memory.
-  emit_cp_tile(
-      // output_array[index] = tile[x, y]
-      [builder, tile, &output, logical_x](const llvm_ir::IrArray::Index& index,
-                                          llvm::Value* y) {
-        output.EmitWriteArrayElement(
-            index,
-            builder->CreateLoad(
-                builder->CreateGEP(tile, {builder->getInt64(0), logical_x, y}),
-                "output_element"),
-            builder);
-      },
-      tile_dims[1], tile_dims[2], output_index, "output");
-
-  return num_tiles;
+  thunk_sequence_->emplace_back(BuildKernelThunk(
+      fusion, /*implements_whole_instruction=*/true, unroll_factor));
+  return IrEmitter::HandleFusion(fusion);
 }
 
-}  // namespace
-
 Status IrEmitterUnnested::HandleCopy(HloInstruction* copy) {
-  if (ImplementedAsHostToDeviceMemcpy(ir_emitter_context_->buffer_assignment(),
-                                      *copy)) {
-    thunk_sequence_->emplace_back(BuildHostToDeviceCopyThunk(copy));
-    return Status::OK();
-  }
-  if (ImplementedAsDeviceToDeviceMemcpy(
-          ir_emitter_context_->buffer_assignment(), *copy)) {
+  CHECK(ShapeUtil::Compatible(copy->operand(0)->shape(), copy->shape()));
+  const BufferAssignment& buffer_assignment =
+      ir_emitter_context_->buffer_assignment();
+  if (LayoutUtil::Equal(copy->operand(0)->shape().layout(),
+                        copy->shape().layout()) &&
+      buffer_assignment.GetUniqueTopLevelSlice(copy->operand(0)).ok()) {
     thunk_sequence_->emplace_back(BuildDeviceToDeviceCopyThunk(copy));
     return Status::OK();
   }
-  bool is_transpose_021;
-  Shape reduced_input_shape, reduced_output_shape;
-  std::tie(is_transpose_021, reduced_input_shape, reduced_output_shape) =
-      IsTranspose021(copy->operand(0)->shape(), copy->shape());
-  if (is_transpose_021 &&
-      reduced_input_shape.dimensions(1) >= kMinDimensionToTransposeTiled &&
-      reduced_input_shape.dimensions(2) >= kMinDimensionToTransposeTiled) {
-    thunk_sequence_->emplace_back(BuildKernelThunk(copy));
-    VLOG(3) << "Emitting tiled 0-2-1 transposition";
-    constexpr int64 tile_size = 32;
-    constexpr int64 num_rows = 8;
-    int64 num_tiles = EmitTranspose021Tiled(
-        GetIrArray(*copy->operand(0), *copy)
-            .CastToShape(reduced_input_shape, &ir_builder_),
-        GetIrArray(*copy, *copy)
-            .CastToShape(reduced_output_shape, &ir_builder_),
-        tile_size, num_rows, &ir_builder_);
-    UpdateLaunchDimensions(LaunchDimensions(num_tiles, num_rows * tile_size),
-                           LastThunk(), ir_emitter_context_->llvm_module());
+  if (CheckAndEmitHloWithTile021(copy)) {
     return Status::OK();
   }
 
@@ -1032,7 +708,7 @@ Status IrEmitterUnnested::HandleCopy(HloInstruction* copy) {
 }
 
 Status IrEmitterUnnested::EmitExtraOutputsForReduce(
-    const HloInstruction* reduce, const llvm_ir::IrArray::Index& index,
+    const HloInstruction* reduce, const IrArray::Index& index,
     tensorflow::gtl::ArraySlice<
         std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
         extra_output_gens) {
@@ -1040,11 +716,11 @@ Status IrEmitterUnnested::EmitExtraOutputsForReduce(
     const HloInstruction* output = reduce->parent()->FusionInstruction();
     llvm::Value* extra_output_address =
         GetIrArray(*output, *output, extra_output_gens[i].second)
-            .EmitArrayElementAddress(index, &ir_builder_,
+            .EmitArrayElementAddress(index, &b_,
                                      "extra_output_element_address");
     TF_ASSIGN_OR_RETURN(llvm::Value* const extra_output_ir_value,
                         extra_output_gens[i].first(index));
-    ir_builder_.CreateStore(extra_output_ir_value, extra_output_address);
+    b_.CreateStore(extra_output_ir_value, extra_output_address);
   }
   return Status::OK();
 }
@@ -1074,12 +750,10 @@ Status IrEmitterUnnested::EmitReductionToScalar(
   LaunchDimensions launch_dimensions = CalculateLaunchDimensions(
       tiled_input_shape, ir_emitter_context_->device_description());
 
-  llvm::Type* index_ty = GetIndexTypeForKernel(
-      reduce,
-      launch_dimensions.block_count() * launch_dimensions.threads_per_block(),
-      &ir_builder_);
+  llvm::Type* index_ty =
+      GetIndexTypeForKernel(reduce, launch_dimensions.launch_bound(), &b_);
 
-  auto index_typed_const = [&](uint64 c) -> llvm::Constant* {
+  auto index_typed_constant = [&](uint64 c) -> llvm::Constant* {
     return llvm::ConstantInt::get(index_ty, c);
   };
 
@@ -1121,59 +795,57 @@ Status IrEmitterUnnested::EmitReductionToScalar(
   // // and threads_per_block is a multiple of warpSize.
   // reduce_kernel<<<num_blocks, threads_per_block>>>();
   //
-  auto loop_body_emitter =
-      [=](const llvm_ir::IrArray::Index& tile_index) -> Status {
+  auto loop_body_emitter = [=](const IrArray::Index& tile_index) -> Status {
     const int num_reduces = reducers.size();
     llvm::Type* element_ir_type =
         llvm_ir::PrimitiveTypeToIrType(input_shape.element_type(), module_);
     std::vector<llvm::Value*> partial_reduction_result_addresses;
     for (int i = 0; i != num_reduces; ++i) {
-      llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
-          element_ir_type, /*ArraySize=*/nullptr,
-          "partial_reduction_result." + llvm::Twine(i));
-      TF_ASSIGN_OR_RETURN(
-          llvm::Value* const init_ir_value,
-          init_value_gens[i](llvm_ir::IrArray::Index(index_ty)));
-      ir_builder_.CreateStore(init_ir_value, partial_reduction_result_address);
+      llvm::Value* partial_reduction_result_address =
+          b_.CreateAlloca(element_ir_type, /*ArraySize=*/nullptr,
+                          "partial_reduction_result." + llvm::Twine(i));
+      TF_ASSIGN_OR_RETURN(llvm::Value* const init_ir_value,
+                          init_value_gens[i](IrArray::Index(index_ty)));
+      b_.CreateStore(init_ir_value, partial_reduction_result_address);
       partial_reduction_result_addresses.push_back(
           partial_reduction_result_address);
     }
 
     llvm::Value* x_in_tiles = tile_index[0];
-    x_in_tiles = ir_builder_.CreateZExtOrTrunc(x_in_tiles, index_ty);
+    x_in_tiles = b_.CreateZExtOrTrunc(x_in_tiles, index_ty);
 
     // Emit an inner for-loop that reduces the elements in the tile.
     auto emit_tile_element_loop = [=](bool tile_in_bounds) -> Status {
       std::unique_ptr<llvm_ir::ForLoop> tile_element_loop =
           llvm_ir::ForLoop::EmitForLoop(
-              "element_id_in_tile", index_typed_const(0),
-              index_typed_const(kTileSize), index_typed_const(1), &ir_builder_);
+              "element_id_in_tile", index_typed_constant(0),
+              index_typed_constant(kTileSize), index_typed_constant(1), &b_);
 
       // Emit the body of the partial reduction loop.
       llvm_ir::SetToFirstInsertPoint(tile_element_loop->GetBodyBasicBlock(),
-                                     &ir_builder_);
-      llvm::Value* x = ir_builder_.CreateNSWAdd(
-          ir_builder_.CreateNSWMul(x_in_tiles, index_typed_const(kTileSize)),
+                                     &b_);
+      llvm::Value* x = b_.CreateNSWAdd(
+          b_.CreateNSWMul(x_in_tiles, index_typed_constant(kTileSize)),
           tile_element_loop->GetIndVarValue());
       // Unless we know the tile is entirely in bounds, we have to emit a
       // x-in-bounds check before reading from the input.
       if (!tile_in_bounds) {
         llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(
-            ir_builder_.CreateICmpULT(x, index_typed_const(num_elems)),
-            "x_in_bounds", &ir_builder_);
+            b_.CreateICmpULT(x, index_typed_constant(num_elems)), "x_in_bounds",
+            &b_);
 
         // Emit code that reads the input element and accumulates it to
         // the partial reduction result.
-        llvm_ir::SetToFirstInsertPoint(if_data.true_block, &ir_builder_);
+        llvm_ir::SetToFirstInsertPoint(if_data.true_block, &b_);
       }
 
-      llvm_ir::IrArray::Index input_index(
-          /*linear=*/x, input_shape, &ir_builder_);
-      llvm::Value* input_address = ir_builder_.CreateAlloca(element_ir_type);
+      IrArray::Index input_index(
+          /*linear=*/x, input_shape, &b_);
+      llvm::Value* input_address = b_.CreateAlloca(element_ir_type);
       for (int i = 0; i != num_reduces; ++i) {
         TF_ASSIGN_OR_RETURN(llvm::Value* const input_ir_value,
                             input_gens[i](input_index));
-        ir_builder_.CreateStore(input_ir_value, input_address);
+        b_.CreateStore(input_ir_value, input_address);
         TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
             *reducers[i],
             {partial_reduction_result_addresses[i], input_address},
@@ -1184,49 +856,48 @@ Status IrEmitterUnnested::EmitReductionToScalar(
 
     // x_end = kTileSize + x_in_tiles * kTileSize, i.e., the location that's
     // immediately beyond the tile.
-    llvm::Value* x_end = ir_builder_.CreateNSWAdd(
-        index_typed_const(kTileSize),
-        ir_builder_.CreateNSWMul(x_in_tiles, index_typed_const(kTileSize)));
+    llvm::Value* x_end = b_.CreateNSWAdd(
+        index_typed_constant(kTileSize),
+        b_.CreateNSWMul(x_in_tiles, index_typed_constant(kTileSize)));
     // The tile is entirely in bound if all_threads_in_bounds or
     // x_end <= num_elems.
-    llvm::Value* tile_in_bounds = ir_builder_.CreateOr(
-        ir_builder_.CreateICmpULE(x_end, index_typed_const(num_elems)),
-        ir_builder_.getInt1(all_threads_in_bounds));
+    llvm::Value* tile_in_bounds =
+        b_.CreateOr(b_.CreateICmpULE(x_end, index_typed_constant(num_elems)),
+                    b_.getInt1(all_threads_in_bounds));
     llvm_ir::LlvmIfData if_tile_in_bounds_data =
-        llvm_ir::EmitIfThenElse(tile_in_bounds, "tile_in_bounds", &ir_builder_);
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.true_block,
-                                   &ir_builder_);
+        llvm_ir::EmitIfThenElse(tile_in_bounds, "tile_in_bounds", &b_);
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.true_block, &b_);
     TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_bounds=*/true));
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.false_block,
-                                   &ir_builder_);
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.false_block, &b_);
     TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_bounds=*/false));
 
     // After the if-then-else statement on tile_in_bounds, emit calls to
     // shfl_down that accumulate the partial reduction results of all threads
     // from the warp.
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.after_block,
-                                   &ir_builder_);
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.after_block, &b_);
     int bit_width = llvm_ir::GetSizeInBits(element_ir_type);
     // bitcast cannot be applied to aggregate types (even packed ones), so we
     // instead bitcast addresses of load/store to intN* of the same bit-width.
     llvm::Type* shuffle_ir_type = element_ir_type->isStructTy()
-                                      ? ir_builder_.getIntNTy(bit_width)
+                                      ? b_.getIntNTy(bit_width)
                                       : element_ir_type;
     for (int shuffle_distance = kWarpSize / 2; shuffle_distance >= 1;
          shuffle_distance /= 2) {
-      llvm::Value* result_from_other_lane = ir_builder_.CreateAlloca(
-          element_ir_type, nullptr, "result_from_other_lane");
+      llvm::Value* result_from_other_lane =
+          b_.CreateAlloca(element_ir_type, nullptr, "result_from_other_lane");
       for (int i = 0; i != num_reduces; ++i) {
-        llvm::Value* partial_reduction_result = ir_builder_.CreateLoad(
-            ir_builder_.CreateBitCast(partial_reduction_result_addresses[i],
-                                      shuffle_ir_type->getPointerTo()),
+        llvm::Value* partial_reduction_result = b_.CreateLoad(
+            b_.CreateBitCast(partial_reduction_result_addresses[i],
+                             shuffle_ir_type->getPointerTo()),
             "partial_reduction_result");
-        ir_builder_.CreateStore(
-            EmitShuffleDown(partial_reduction_result,
-                            ir_builder_.getInt32(shuffle_distance),
-                            &ir_builder_),
-            ir_builder_.CreateBitCast(result_from_other_lane,
-                                      shuffle_ir_type->getPointerTo()));
+        CHECK_EQ(launch_dimensions.threads_per_block() % kWarpSize, 0)
+            << "Requires block size a multiple of the warp size, otherwise we "
+               "will read undefined elements.";
+        b_.CreateStore(
+            EmitFullWarpShuffleDown(partial_reduction_result,
+                                    b_.getInt32(shuffle_distance), &b_),
+            b_.CreateBitCast(result_from_other_lane,
+                             shuffle_ir_type->getPointerTo()));
         TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
             *reducers[i],
             {partial_reduction_result_addresses[i], result_from_other_lane},
@@ -1240,24 +911,23 @@ Status IrEmitterUnnested::EmitReductionToScalar(
     // Emit an atomic operation that accumulates the partial reduction result of
     // lane 0 (which holds the partially accumulated result for its warp) to the
     // output element.
-    llvm::Value* lane_id = ir_builder_.CreateURem(
-        x_in_tiles, index_typed_const(kWarpSize), "lane_id");
+    llvm::Value* lane_id =
+        b_.CreateURem(x_in_tiles, index_typed_constant(kWarpSize), "lane_id");
     llvm_ir::LlvmIfData if_lane_id_is_zero_data = llvm_ir::EmitIfThenElse(
-        ir_builder_.CreateICmpEQ(lane_id, index_typed_const(0)),
-        "lane_id_is_zero", &ir_builder_);
-    llvm_ir::SetToFirstInsertPoint(if_lane_id_is_zero_data.true_block,
-                                   &ir_builder_);
+        b_.CreateICmpEQ(lane_id, index_typed_constant(0)), "lane_id_is_zero",
+        &b_);
+    llvm_ir::SetToFirstInsertPoint(if_lane_id_is_zero_data.true_block, &b_);
 
     for (int i = 0; i != num_reduces; ++i) {
       llvm::Value* output_address =
           GetIrArray(*output, *output, reduce_output_shapes[i])
               .EmitArrayElementAddress(
-                  llvm_ir::IrArray::Index(
-                      /*linear=*/ir_builder_.getInt64(0),
+                  IrArray::Index(
+                      /*linear=*/b_.getInt64(0),
                       ShapeUtil::GetSubshape(output->shape(),
                                              reduce_output_shapes[i]),
-                      &ir_builder_),
-                  &ir_builder_, "output_element_address");
+                      &b_),
+                  &b_, "output_element_address");
       TF_RETURN_IF_ERROR(EmitAtomicOperationForNestedComputation(
           *reducers[i], output_address, partial_reduction_result_addresses[i]));
     }
@@ -1271,7 +941,7 @@ Status IrEmitterUnnested::EmitReductionToScalar(
       static_cast<SequentialThunk*>(LastThunk())->thunks().back().get(),
       ir_emitter_context_->llvm_module());
   return ParallelLoopEmitter(loop_body_emitter, tiled_input_shape,
-                             launch_dimensions, &ir_builder_)
+                             launch_dimensions, &b_)
       .EmitLoop(IrName(reduce), index_ty);
 }
 
@@ -1284,8 +954,8 @@ Status IrEmitterUnnested::EmitColumnReduction(
     tensorflow::gtl::ArraySlice<
         std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
         extra_output_gens) {
-  // Divide the input matrix into tiles of size Kx1. For example, when the
-  // input matrix is 4x4 and K=2, the tiled matrix looks like
+  // Divide the input matrix into tiles of size KxL. For example, when the
+  // input matrix is 4x4, K=2, and L=1 the tiled matrix looks like
   //
   //   0123
   //   0123
@@ -1297,100 +967,131 @@ Status IrEmitterUnnested::EmitColumnReduction(
   //
   // We choose 128 as the tile size based on empirical evidence. It's big enough
   // to reduce the amount of atomic adds in the end, maximizing the memory
-  // bandwidth.
-  constexpr int64 kTileSize = 128;
+  // bandwidth. A tile width of 2 allows for high memory bandwidth utilization
+  // on 16b input data.
+  constexpr int64 kTileHeight = 128;
+  constexpr int64 kTileWidth = 2;
 
-  // If the height is not a multiple of the tile size, we pad the bottom of the
+  // If the height is not a multiple of kTileHeight, we pad the bottom of the
   // input matrix.
-  const int64 height_in_tiles = CeilOfRatio(height, kTileSize);
-  Shape tiled_input_shape = ShapeUtil::MakeShapeWithLayout(
-      reduce->shape().element_type(), {height_in_tiles, width}, {1, 0});
+  const int64 height_in_tiles = CeilOfRatio(height, kTileHeight);
+  // If width is not a multiple of kTileWidth the rightmost thread will process
+  // fewer input elements.
+  const int64 width_in_tiles = CeilOfRatio(width, kTileWidth);
+  Shape tiled_input_shape =
+      ShapeUtil::MakeShapeWithLayout(reduce->shape().element_type(),
+                                     {height_in_tiles, width_in_tiles}, {1, 0});
   LaunchDimensions launch_dimensions = CalculateLaunchDimensions(
       tiled_input_shape, ir_emitter_context_->device_description());
 
   // TODO(b/110211620): Convert to use i32 index_type when it is possible.
-  llvm::Type* index_ty = ir_builder_.getInt64Ty();
+  llvm::Type* index_ty = b_.getInt64Ty();
 
-  auto index_typed_const = [&](uint64 c) -> llvm::Constant* {
+  auto index_typed_constant = [&](uint64 c) -> llvm::Constant* {
     return llvm::ConstantInt::get(index_ty, c);
   };
 
   // for (linear_index = threadIdx.x + blockIdx.x * blockDim.x;
-  //      linear_index < height_in_tiles * width;
+  //      linear_index < height_in_tiles * width_in_tiles;
   //      linear_index += blockDim.x * gridDim.x) {
-  //   y_in_tiles = linear_index / width;
-  //   x = linear_index % width;
+  //   y_in_tiles = linear_index / width_in_tiles;
+  //   x_in_tiles = linear_index % width_in_tiles;
   //
-  //   partial_result = init_value;
-  //   if (height % kTileSize == 0 ||
-  //       y_in_tiles * kTileSize + kTileSize <= height) {
-  //     for (element_id_in_tile : range(kTileSize)) {
-  //       y = y_in_tiles * kTileSize + element_id_in_tile;
-  //       partial_result = Reducer(partial_result, input[y][x]);
+  //   partial_results[kTileWidth] = init_values;
+  //   tile_in_y_bounds = height % kTileHeight == 0 ||
+  //       y_in_tiles * kTileHeight + kTileHeight <= height;
+  //   tile_in_x_bounds = width % kTileWidth == 0 ||
+  //       x_in_tiles * kTileWidth + kTileWidth <= width;
+  //   // The implementation handles y and x bound checks separately.
+  //   if (tile_in_y_bounds && tile_in_x_bounds) {
+  //     for (y_offset : range(kTileHeight)) {
+  //       y = y_in_tiles * kTileHeight + y_offset;
+  //       for (x_offset : range(kTileWidth)) {
+  //         x = x_in_tiles * kTileWidth + x_offset;
+  //         partial_result = Reducer(partial_result[x_offset], input[y][x]);
+  //       }
   //     }
   //   } else {
-  //     for (element_id_in_tile : range(kTileSize)) {
-  //       y = y_in_tiles * kTileSize + element_id_in_tile;
-  //       if (y < height) {
-  //         partial_result = Reducer(partial_result, input[y][x]);
+  //     for (y_offset : range(kTileHeight)) {
+  //       y = y_in_tiles * kTileHeight + y_offset;
+  //       for (y_offset : range(kTileHeight)) {
+  //         x = x_in_tiles * kTileWidth + x_offset;
+  //         if (y < height && x < width) {
+  //           partial_result = Reducer(partial_result, input[y][x]);
+  //         }
   //       }
   //     }
   //   }
-  //   AtomicReducer(&output[x], partial_result);
+  //   for (x_offset : range(kTileWidth)) {
+  //     AtomicReducer(&output[x + x_offset], partial_result[x_offset]);
+  //   }
   // }
-  auto loop_body_emitter =
-      [=](const llvm_ir::IrArray::Index& tile_index) -> Status {
+  auto loop_body_emitter = [=](const IrArray::Index& tile_index) -> Status {
     const int num_reduces = reducers.size();
     // Emit the loop body that reduces one tile.
     llvm::Type* element_ir_type =
         llvm_ir::PrimitiveTypeToIrType(input_shape.element_type(), module_);
     std::vector<llvm::Value*> partial_reduction_result_addresses;
     for (int i = 0; i != num_reduces; ++i) {
-      llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
-          element_ir_type, /*ArraySize=*/nullptr,
-          "partial_reduction_result." + llvm::Twine(i));
-      TF_ASSIGN_OR_RETURN(
-          llvm::Value* const init_ir_value,
-          init_value_gens[i](llvm_ir::IrArray::Index(index_ty)));
-      ir_builder_.CreateStore(init_ir_value, partial_reduction_result_address);
-      partial_reduction_result_addresses.push_back(
-          partial_reduction_result_address);
+      for (int x_offset = 0; x_offset < kTileWidth; ++x_offset) {
+        llvm::Value* partial_reduction_result_address =
+            b_.CreateAlloca(element_ir_type, /*ArraySize=*/nullptr,
+                            "partial_reduction_result." +
+                                llvm::Twine(i * kTileWidth + x_offset));
+        TF_ASSIGN_OR_RETURN(llvm::Value* const init_ir_value,
+                            init_value_gens[i](IrArray::Index(index_ty)));
+        b_.CreateStore(init_ir_value, partial_reduction_result_address);
+        partial_reduction_result_addresses.push_back(
+            partial_reduction_result_address);
+      }
     }
 
     // Emit an inner for-loop that partially reduces the elements in the given
     // tile.
     llvm::Value* y_in_tiles = tile_index[0];
-    llvm::Value* x = tile_index[1];
+    llvm::Value* x_in_tiles = tile_index[1];
 
-    y_in_tiles = ir_builder_.CreateZExtOrTrunc(y_in_tiles, index_ty);
-    x = ir_builder_.CreateZExtOrTrunc(x, index_ty);
+    y_in_tiles = b_.CreateZExtOrTrunc(y_in_tiles, index_ty);
+    x_in_tiles = b_.CreateZExtOrTrunc(x_in_tiles, index_ty);
 
-    auto emit_tile_element_loop = [=](bool tile_in_bounds) -> Status {
+    auto emit_tile_element_loop = [=](bool tile_in_y_bounds,
+                                      bool tile_in_x_bounds) -> Status {
       std::unique_ptr<llvm_ir::ForLoop> tile_element_loop =
           llvm_ir::ForLoop::EmitForLoop(
-              "element_id_in_tile", index_typed_const(0),
-              index_typed_const(kTileSize), index_typed_const(1), &ir_builder_);
+              "element_id_in_tile", index_typed_constant(0),
+              index_typed_constant(kTileHeight), index_typed_constant(1), &b_);
 
       // Emit the body of the partial reduction loop.
       llvm_ir::SetToFirstInsertPoint(tile_element_loop->GetBodyBasicBlock(),
-                                     &ir_builder_);
-      llvm::Value* y = ir_builder_.CreateNSWAdd(
-          ir_builder_.CreateNSWMul(y_in_tiles, index_typed_const(kTileSize)),
+                                     &b_);
+      llvm::Value* y = b_.CreateNSWAdd(
+          b_.CreateNSWMul(y_in_tiles, index_typed_constant(kTileHeight)),
           tile_element_loop->GetIndVarValue());
 
-      // Unless we know the tile is entirely in bounds, we have to emit a
-      // y-in-bounds check before reading from the input.
-      if (!tile_in_bounds) {
+      // Unless we know that y is in bounds, we have to emit a check before
+      // reading from the input.
+      if (!tile_in_y_bounds) {
         llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(
-            ir_builder_.CreateICmpULT(y, index_typed_const(height)),
-            "y_in_bounds", &ir_builder_);
+            b_.CreateICmpULT(y, index_typed_constant(height)), "y_in_bounds",
+            &b_);
 
         // Emit code that reads the input element and accumulates it to
         // the partial reduction result.
-        llvm_ir::SetToFirstInsertPoint(if_data.true_block, &ir_builder_);
+        llvm_ir::SetToFirstInsertPoint(if_data.true_block, &b_);
       }
-      llvm::Value* input_address = ir_builder_.CreateAlloca(element_ir_type);
-      {
+      for (int x_offset = 0; x_offset < kTileWidth; ++x_offset) {
+        llvm::Value* x = b_.CreateNSWAdd(
+            b_.CreateNSWMul(x_in_tiles, index_typed_constant(kTileWidth)),
+            index_typed_constant(x_offset));
+        // Unless we know that x is in bounds, we have to emit a check before
+        // reading from the input.
+        if (!tile_in_x_bounds) {
+          llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(
+              b_.CreateICmpULT(x, index_typed_constant(width)), "x_in_bounds",
+              &b_);
+          llvm_ir::SetToFirstInsertPoint(if_data.true_block, &b_);
+        }
+        llvm::Value* input_address = b_.CreateAlloca(element_ir_type);
         // {y,x} is an index to input_matrix_shape [height,width]. We need to
         // convert that to an index to input_shape (the shape of the operand of
         // "reduce"). This conversion is composed of a transposition from
@@ -1406,67 +1107,95 @@ Status IrEmitterUnnested::EmitColumnReduction(
         const Shape input_matrix_shape =
             ShapeUtil::MakeShapeWithDescendingLayout(input_shape.element_type(),
                                                      {height, width});
-        const llvm_ir::IrArray::Index input_matrix_index(
-            {y, x}, input_matrix_shape, &ir_builder_);
-        const llvm_ir::IrArray::Index input_index =
+        const IrArray::Index input_matrix_index({y, x}, input_matrix_shape,
+                                                &b_);
+        const IrArray::Index input_index =
             input_matrix_index
                 .SourceIndexOfReshape(input_matrix_shape,
-                                      normalized_input_shape, &ir_builder_)
+                                      normalized_input_shape, &b_)
                 .SourceIndexOfTranspose(normalized_input_shape, input_shape,
-                                        transpose_dimension_mapping,
-                                        &ir_builder_);
+                                        transpose_dimension_mapping, &b_);
         for (int i = 0; i != num_reduces; ++i) {
           TF_ASSIGN_OR_RETURN(llvm::Value* const input_ir_value,
                               input_gens[i](input_index));
-          ir_builder_.CreateStore(input_ir_value, input_address);
+          b_.CreateStore(input_ir_value, input_address);
           TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
               *reducers[i],
-              {partial_reduction_result_addresses[i], input_address},
-              partial_reduction_result_addresses[i]));
+              {partial_reduction_result_addresses[i * kTileWidth + x_offset],
+               input_address},
+              partial_reduction_result_addresses[i * kTileWidth + x_offset]));
+          TF_RETURN_IF_ERROR(EmitExtraOutputsForReduce(reduce, input_index,
+                                                       extra_output_gens));
         }
-        return EmitExtraOutputsForReduce(reduce, input_index,
-                                         extra_output_gens);
       }
+      return Status::OK();
     };
 
-    // y_end = kTileSize + y_in_tiles * kTileSize, i.e., the y location that's
-    // immediately beyond the tile.
-    llvm::Value* y_end = ir_builder_.CreateNSWAdd(
-        index_typed_const(kTileSize),
-        ir_builder_.CreateNSWMul(y_in_tiles, index_typed_const(kTileSize)));
-    llvm::Value* tile_in_bounds = ir_builder_.CreateOr(
-        ir_builder_.CreateICmpULE(y_end, index_typed_const(height)),
-        ir_builder_.getInt1(height % kTileSize == 0));
-    // The tile is entirely in bound if "height" is a multiple of kTileSize or
+    // y_end = kTileHeight + y_in_tiles * kTileHeight, i.e., the y location
+    // that's immediately beyond the tile.
+    llvm::Value* y_end = b_.CreateNSWAdd(
+        index_typed_constant(kTileHeight),
+        b_.CreateNSWMul(y_in_tiles, index_typed_constant(kTileHeight)));
+    // x_end = kTileWidth + x_in_tiles * kTileWidth, i.e., the x location
+    // that's immediately beyond the tile.
+    llvm::Value* x_end = b_.CreateNSWAdd(
+        index_typed_constant(kTileWidth),
+        b_.CreateNSWMul(x_in_tiles, index_typed_constant(kTileWidth)));
+    llvm::Value* tile_in_y_bounds =
+        b_.CreateOr(b_.CreateICmpULE(y_end, index_typed_constant(height)),
+                    b_.getInt1(height % kTileHeight == 0));
+    llvm::Value* tile_in_x_bounds =
+        b_.CreateOr(b_.CreateICmpULE(x_end, index_typed_constant(width)),
+                    b_.getInt1(width % kTileWidth == 0));
+    // The tile is in y bounds if "height" is a multiple of kTileHeight or
     // y_end <= height.
-    llvm_ir::LlvmIfData if_tile_in_bounds_data =
-        llvm_ir::EmitIfThenElse(tile_in_bounds, "tile_in_bounds", &ir_builder_);
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.true_block,
-                                   &ir_builder_);
-    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_bounds=*/true));
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.false_block,
-                                   &ir_builder_);
-    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_bounds=*/false));
-
-    // After the if-then-else statement on tile_in_bounds, emit atomic
-    // operations to accumulate the partial reduction result to the output
-    // element.
-    llvm_ir::SetToFirstInsertPoint(if_tile_in_bounds_data.after_block,
-                                   &ir_builder_);
+    llvm_ir::LlvmIfData if_tile_in_y_bounds_data =
+        llvm_ir::EmitIfThenElse(tile_in_y_bounds, "tile_in_y_bounds", &b_);
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_y_bounds_data.true_block, &b_);
+    // The tile is in x bounds if "width" is a multiple of kTileWidth or
+    // x_end <= width.
+    llvm_ir::LlvmIfData if_tile_in_x_bounds_data =
+        llvm_ir::EmitIfThenElse(tile_in_x_bounds, "tile_in_x_bounds", &b_);
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_x_bounds_data.true_block, &b_);
+    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_y_bounds=*/true,
+                                              /*tile_in_x_bounds=*/true));
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_x_bounds_data.false_block, &b_);
+    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_y_bounds=*/true,
+                                              /*tile_in_x_bounds=*/false));
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_y_bounds_data.false_block, &b_);
+    if_tile_in_x_bounds_data =
+        llvm_ir::EmitIfThenElse(tile_in_x_bounds, "tile_in_x_bounds", &b_);
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_x_bounds_data.true_block, &b_);
+    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_y_bounds=*/false,
+                                              /*tile_in_x_bounds=*/true));
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_x_bounds_data.false_block, &b_);
+    TF_RETURN_IF_ERROR(emit_tile_element_loop(/*tile_in_y_bounds=*/false,
+                                              /*tile_in_x_bounds=*/false));
+
+    // After the nested if-then-else statement on tile_in_y_bounds and
+    // tile_in_x_bounds, emit atomic operations to accumulate the partial
+    // reduction result to the output element.
+    llvm_ir::SetToFirstInsertPoint(if_tile_in_y_bounds_data.after_block, &b_);
     const HloInstruction* output =
         reduce->IsFused() ? reduce->parent()->FusionInstruction() : reduce;
     for (int i = 0; i != num_reduces; ++i) {
-      llvm::Value* output_address =
-          GetIrArray(*output, *output, reduce_output_shapes[i])
-              .EmitArrayElementAddress(
-                  llvm_ir::IrArray::Index(
-                      x,
-                      ShapeUtil::GetSubshape(output->shape(),
-                                             reduce_output_shapes[i]),
-                      &ir_builder_),
-                  &ir_builder_, "output_element_address");
-      TF_RETURN_IF_ERROR(EmitAtomicOperationForNestedComputation(
-          *reducers[i], output_address, partial_reduction_result_addresses[i]));
+      for (int x_offset = 0; x_offset < kTileWidth; ++x_offset) {
+        llvm::Value* x = b_.CreateNSWAdd(
+            b_.CreateNSWMul(x_in_tiles, index_typed_constant(kTileWidth)),
+            index_typed_constant(x_offset));
+        llvm::Value* output_address =
+            GetIrArray(*output, *output, reduce_output_shapes[i])
+                .EmitArrayElementAddress(
+                    IrArray::Index(
+                        x,
+                        ShapeUtil::GetSubshape(output->shape(),
+                                               reduce_output_shapes[i]),
+                        &b_),
+                    &b_, "output_element_address");
+        TF_RETURN_IF_ERROR(EmitAtomicOperationForNestedComputation(
+            *reducers[i], output_address,
+            partial_reduction_result_addresses[i * kTileWidth + x_offset]));
+      }
     }
     return Status::OK();
   };
@@ -1478,7 +1207,7 @@ Status IrEmitterUnnested::EmitColumnReduction(
       static_cast<SequentialThunk*>(LastThunk())->thunks().back().get(),
       ir_emitter_context_->llvm_module());
   return ParallelLoopEmitter(loop_body_emitter, tiled_input_shape,
-                             launch_dimensions, &ir_builder_)
+                             launch_dimensions, &b_)
       .EmitLoop(IrName(reduce), index_ty);
 }
 
@@ -1628,28 +1357,25 @@ Status IrEmitterUnnested::EmitRowReduction(
       {depth / z_tile_size, height, width_in_tiles}, {2, 1, 0});
   LaunchDimensions launch_dimensions = CalculateLaunchDimensions(
       tiled_input_shape, ir_emitter_context_->device_description());
-  llvm::Type* index_ty = GetIndexTypeForKernel(
-      reduce,
-      launch_dimensions.block_count() * launch_dimensions.threads_per_block(),
-      &ir_builder_);
+  llvm::Type* index_ty =
+      GetIndexTypeForKernel(reduce, launch_dimensions.launch_bound(), &b_);
 
-  auto index_typed_const = [&](uint64 c) -> llvm::Constant* {
+  auto index_typed_constant = [&](uint64 c) -> llvm::Constant* {
     return llvm::ConstantInt::get(index_ty, c);
   };
 
-  auto loop_body_emitter = [=](const llvm_ir::IrArray::Index& tile_index) {
+  auto loop_body_emitter = [=](const IrArray::Index& tile_index) {
     const int num_reduces = reducers.size();
     llvm::Type* element_ir_type = llvm_ir::PrimitiveTypeToIrType(
         input_shape.element_type(), ir_emitter_context_->llvm_module());
     std::vector<llvm::Value*> partial_reduction_result_addresses;
     for (int i = 0; i != num_reduces; ++i) {
-      llvm::Value* partial_reduction_result_address = ir_builder_.CreateAlloca(
-          element_ir_type, /*ArraySize=*/nullptr,
-          "partial_reduction_result." + llvm::Twine(i));
-      TF_ASSIGN_OR_RETURN(
-          llvm::Value* const init_ir_value,
-          init_value_gens[i](llvm_ir::IrArray::Index(index_ty)));
-      ir_builder_.CreateStore(init_ir_value, partial_reduction_result_address);
+      llvm::Value* partial_reduction_result_address =
+          b_.CreateAlloca(element_ir_type, /*ArraySize=*/nullptr,
+                          "partial_reduction_result." + llvm::Twine(i));
+      TF_ASSIGN_OR_RETURN(llvm::Value* const init_ir_value,
+                          init_value_gens[i](IrArray::Index(index_ty)));
+      b_.CreateStore(init_ir_value, partial_reduction_result_address);
       partial_reduction_result_addresses.push_back(
           partial_reduction_result_address);
     }
@@ -1658,25 +1384,25 @@ Status IrEmitterUnnested::EmitRowReduction(
     llvm::Value* y = tile_index[1];
     llvm::Value* x_tile = tile_index[2];
 
-    x_tile = ir_builder_.CreateZExtOrTrunc(x_tile, index_ty);
+    x_tile = b_.CreateZExtOrTrunc(x_tile, index_ty);
 
     llvm::Value* warp_id =
-        ir_builder_.CreateUDiv(x_tile, index_typed_const(kWarpSize), "warp_id");
+        b_.CreateUDiv(x_tile, index_typed_constant(kWarpSize), "warp_id");
     llvm::Value* lane_id =
-        ir_builder_.CreateURem(x_tile, index_typed_const(kWarpSize), "lane_id");
+        b_.CreateURem(x_tile, index_typed_constant(kWarpSize), "lane_id");
 
     // The x-location of the last element in this z-x-tile.
     // last_x = lane_id + warpSize * (x_tile_size - 1 + warp_id * x_tile_size);
-    llvm::Value* last_x = ir_builder_.CreateNSWAdd(
-        lane_id, ir_builder_.CreateNSWMul(
-                     index_typed_const(kWarpSize),
-                     ir_builder_.CreateNSWAdd(
-                         index_typed_const(x_tile_size - 1),
-                         ir_builder_.CreateNSWMul(
-                             warp_id, index_typed_const(x_tile_size)))));
+    llvm::Value* last_x = b_.CreateNSWAdd(
+        lane_id,
+        b_.CreateNSWMul(
+            index_typed_constant(kWarpSize),
+            b_.CreateNSWAdd(
+                index_typed_constant(x_tile_size - 1),
+                b_.CreateNSWMul(warp_id, index_typed_constant(x_tile_size)))));
 
     KernelSupportLibrary ksl(
-        &ir_builder_,
+        &b_,
         /*unroll_mode=*/xla::llvm_ir::UnrollMode::kFullyUnroll,
         /*prevent_vectorization=*/false);
 
@@ -1685,22 +1411,22 @@ Status IrEmitterUnnested::EmitRowReduction(
     auto emit_z_x_tile_element_loop = [&](bool x_tile_in_bounds,
                                           int64 x_tile_loop_bound) -> Status {
       auto emit_z_tile_element_loop = [&](llvm::Value* z_indvar) -> Status {
-        llvm::Value* z = ir_builder_.CreateNSWAdd(
+        llvm::Value* z = b_.CreateNSWAdd(
             z_indvar,
-            ir_builder_.CreateNSWMul(index_typed_const(z_tile_size), z_tile));
+            b_.CreateNSWMul(index_typed_constant(z_tile_size), z_tile));
         TF_RETURN_IF_ERROR(ksl.For(
             "x_tile",
-            /*start=*/index_typed_const(0),
-            /*end=*/index_typed_const(x_tile_loop_bound),
+            /*start=*/index_typed_constant(0),
+            /*end=*/index_typed_constant(x_tile_loop_bound),
             /*step=*/1, [&](llvm::Value* x_indvar) -> Status {
               // x = lane_id +
               //     warpSize * (element_id_in_x_tile + warp_id * x_tile_size);
-              llvm::Value* x = ir_builder_.CreateNSWAdd(
+              llvm::Value* x = b_.CreateNSWAdd(
                   lane_id,
-                  ir_builder_.CreateNSWMul(
-                      index_typed_const(kWarpSize),
-                      ir_builder_.CreateNSWAdd(
-                          x_indvar, ir_builder_.CreateNSWMul(
+                  b_.CreateNSWMul(
+                      index_typed_constant(kWarpSize),
+                      b_.CreateNSWAdd(
+                          x_indvar, b_.CreateNSWMul(
                                         warp_id, llvm::ConstantInt::get(
                                                      index_ty, x_tile_size)))));
 
@@ -1709,17 +1435,16 @@ Status IrEmitterUnnested::EmitRowReduction(
               if (!x_tile_in_bounds) {
                 llvm_ir::LlvmIfData if_x_in_bounds_data =
                     llvm_ir::EmitIfThenElse(
-                        ir_builder_.CreateICmpULT(x, index_typed_const(width)),
-                        "x_in_bounds", &ir_builder_);
-                // Points ir_builder_ to the then-block.
+                        b_.CreateICmpULT(x, index_typed_constant(width)),
+                        "x_in_bounds", &b_);
+                // Points b_ to the then-block.
                 llvm_ir::SetToFirstInsertPoint(if_x_in_bounds_data.true_block,
-                                               &ir_builder_);
+                                               &b_);
               }
 
               // Emit code that reads the input element and accumulates it
               // to the partial reduction result.
-              llvm::Value* input_address =
-                  ir_builder_.CreateAlloca(element_ir_type);
+              llvm::Value* input_address = b_.CreateAlloca(element_ir_type);
               {
                 // {z,y,x} is an index to input_3d_tensor_shape
                 // [depth,height,width]. We need to convert that to an index
@@ -1737,21 +1462,20 @@ Status IrEmitterUnnested::EmitRowReduction(
                 const Shape input_3d_tensor_shape =
                     ShapeUtil::MakeShapeWithDescendingLayout(
                         input_shape.element_type(), {depth, height, width});
-                const llvm_ir::IrArray::Index input_3d_tensor_index(
-                    {z, y, x}, input_3d_tensor_shape, &ir_builder_);
-                const llvm_ir::IrArray::Index input_index =
+                const IrArray::Index input_3d_tensor_index(
+                    {z, y, x}, input_3d_tensor_shape, &b_);
+                const IrArray::Index input_index =
                     input_3d_tensor_index
                         .SourceIndexOfReshape(input_3d_tensor_shape,
-                                              normalized_input_shape,
-                                              &ir_builder_)
+                                              normalized_input_shape, &b_)
                         .SourceIndexOfTranspose(
                             normalized_input_shape, input_shape,
-                            transpose_dimension_mapping, &ir_builder_);
+                            transpose_dimension_mapping, &b_);
 
                 for (int i = 0; i != num_reduces; ++i) {
                   TF_ASSIGN_OR_RETURN(llvm::Value* const input_ir_value,
                                       input_gens[i](input_index));
-                  ir_builder_.CreateStore(input_ir_value, input_address);
+                  b_.CreateStore(input_ir_value, input_address);
                   TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
                       *reducers[i],
                       {partial_reduction_result_addresses[i], input_address},
@@ -1765,14 +1489,14 @@ Status IrEmitterUnnested::EmitRowReduction(
       };
 
       return ksl.For("z_tile",
-                     /*start=*/index_typed_const(0),
-                     /*end=*/index_typed_const(z_tile_size),
+                     /*start=*/index_typed_constant(0),
+                     /*end=*/index_typed_constant(z_tile_size),
                      /*step=*/1, emit_z_tile_element_loop);
     };
 
-    llvm::Value* tile_in_bounds = ir_builder_.CreateOr(
-        ir_builder_.getInt1(width % (x_tile_size * kWarpSize) == 0),
-        ir_builder_.CreateICmpULT(last_x, index_typed_const(width)));
+    llvm::Value* tile_in_bounds =
+        b_.CreateOr(b_.getInt1(width % (x_tile_size * kWarpSize) == 0),
+                    b_.CreateICmpULT(last_x, index_typed_constant(width)));
 
     TF_RETURN_IF_ERROR(
         ksl.If(tile_in_bounds,
@@ -1795,23 +1519,25 @@ Status IrEmitterUnnested::EmitRowReduction(
     // bitcast cannot be applied to aggregate types (even packed ones), so we
     // instead bitcast addresses of load/store to intN* of the same bit-width.
     llvm::Type* shuffle_ir_type = element_ir_type->isStructTy()
-                                      ? ir_builder_.getIntNTy(bit_width)
+                                      ? b_.getIntNTy(bit_width)
                                       : element_ir_type;
     for (int shuffle_distance = 16; shuffle_distance >= 1;
          shuffle_distance /= 2) {
-      llvm::Value* result_from_other_lane = ir_builder_.CreateAlloca(
-          element_ir_type, nullptr, "result_from_other_lane");
+      llvm::Value* result_from_other_lane =
+          b_.CreateAlloca(element_ir_type, nullptr, "result_from_other_lane");
       for (int i = 0; i != num_reduces; ++i) {
-        llvm::Value* partial_reduction_result = ir_builder_.CreateLoad(
-            ir_builder_.CreateBitCast(partial_reduction_result_addresses[i],
-                                      shuffle_ir_type->getPointerTo()),
+        llvm::Value* partial_reduction_result = b_.CreateLoad(
+            b_.CreateBitCast(partial_reduction_result_addresses[i],
+                             shuffle_ir_type->getPointerTo()),
             "partial_reduction_result");
-        ir_builder_.CreateStore(
-            EmitShuffleDown(partial_reduction_result,
-                            ir_builder_.getInt32(shuffle_distance),
-                            &ir_builder_),
-            ir_builder_.CreateBitCast(result_from_other_lane,
-                                      shuffle_ir_type->getPointerTo()));
+        CHECK_EQ(launch_dimensions.threads_per_block() % kWarpSize, 0)
+            << "Requires block size a multiple of the warp size, otherwise we "
+               "will read undefined elements.";
+        b_.CreateStore(
+            EmitFullWarpShuffleDown(partial_reduction_result,
+                                    b_.getInt32(shuffle_distance), &b_),
+            b_.CreateBitCast(result_from_other_lane,
+                             shuffle_ir_type->getPointerTo()));
         TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
             *reducers[i],
             {partial_reduction_result_addresses[i], result_from_other_lane},
@@ -1826,20 +1552,18 @@ Status IrEmitterUnnested::EmitRowReduction(
     // lane 0 (which holds the partially accumulated result for its warp) to the
     // output element.
     llvm_ir::LlvmIfData if_lane_id_is_zero_data = llvm_ir::EmitIfThenElse(
-        ir_builder_.CreateICmpEQ(lane_id, index_typed_const(0)),
-        "lane_id_is_zero", &ir_builder_);
-    llvm_ir::SetToFirstInsertPoint(if_lane_id_is_zero_data.true_block,
-                                   &ir_builder_);
+        b_.CreateICmpEQ(lane_id, index_typed_constant(0)), "lane_id_is_zero",
+        &b_);
+    llvm_ir::SetToFirstInsertPoint(if_lane_id_is_zero_data.true_block, &b_);
     for (int i = 0; i != num_reduces; ++i) {
       llvm::Value* output_address =
           GetIrArray(*output, *output, reduce_output_shapes[i])
               .EmitArrayElementAddress(
-                  llvm_ir::IrArray::Index(
-                      y,
-                      ShapeUtil::GetSubshape(output->shape(),
-                                             reduce_output_shapes[i]),
-                      &ir_builder_),
-                  &ir_builder_, "output_element_address");
+                  IrArray::Index(y,
+                                 ShapeUtil::GetSubshape(
+                                     output->shape(), reduce_output_shapes[i]),
+                                 &b_),
+                  &b_, "output_element_address");
       // We don't need to emit atomic operations if there is only one tile of
       // results. 'depth' is the z dimension, 'width' is the x dimension.
       if (z_tile_size >= depth && x_tile_size >= width) {
@@ -1863,7 +1587,7 @@ Status IrEmitterUnnested::EmitRowReduction(
       static_cast<SequentialThunk*>(LastThunk())->thunks().back().get(),
       ir_emitter_context_->llvm_module());
   return ParallelLoopEmitter(loop_body_emitter, tiled_input_shape,
-                             launch_dimensions, &ir_builder_)
+                             launch_dimensions, &b_)
       .EmitLoop(IrName(reduce), index_ty);
 }
 
@@ -1982,32 +1706,36 @@ Status IrEmitterUnnested::HandleReduce(HloInstruction* reduce) {
                         BuildInitializerThunk(reduce));
     std::vector<std::unique_ptr<Thunk>> thunks;
     thunks.push_back(std::move(initializer_thunk));
-    thunks.push_back(BuildKernelThunk(reduce));
+    thunks.push_back(
+        BuildKernelThunk(reduce, /*implements_whole_instruction=*/false));
     thunk_sequence_->emplace_back(
         MakeUnique<SequentialThunk>(std::move(thunks), reduce));
 
     return EmitReductionToVector(
-        reduce, input->shape(), {[&](const llvm_ir::IrArray::Index& index) {
-          return GetIrArray(*input, *reduce)
-              .EmitReadArrayElement(index, &ir_builder_);
+        reduce, input->shape(), {[&](const IrArray::Index& index) {
+          return GetIrArray(*input, *reduce).EmitReadArrayElement(index, &b_);
         }},
-        {[&](const llvm_ir::IrArray::Index& index) {
+        {[&](const IrArray::Index& index) {
           return GetIrArray(*init_value, *reduce)
-              .EmitReadArrayElement(index, &ir_builder_);
+              .EmitReadArrayElement(index, &b_);
         }},
         dimensions_to_reduce, {reducer}, {{}}, {});
   }
 
-  thunk_sequence_->emplace_back(BuildKernelThunk(reduce));
+  thunk_sequence_->emplace_back(
+      BuildKernelThunk(reduce, /*implements_whole_instruction=*/true));
   return IrEmitter::HandleReduce(reduce);
 }
 
 Status IrEmitterUnnested::HandleTuple(HloInstruction* tuple) {
   bool all_tuple_elements_have_buffer =
       c_all_of(tuple->operands(), [&](HloInstruction* tuple_element) {
-        return ir_emitter_context_->buffer_assignment().HasTopLevelAllocation(
-            tuple_element);
+        return ir_emitter_context_->buffer_assignment()
+            .GetUniqueTopLevelSlice(tuple_element)
+            .ok();
       });
+  // TODO(b/111689850): This logic isn't quite correct.
+  //
   // Tuples (especially tuples that are the final result of a computation) can
   // be so huge that if we were to emit a kernel that took each tuple element as
   // a parameter, we would exceed the max allowable number of parameters to a
@@ -2015,9 +1743,9 @@ Status IrEmitterUnnested::HandleTuple(HloInstruction* tuple) {
   // buffer, we collect their buffer addresses in a host array, and then copy
   // that array to the tuple's buffer.
   //
-  // Some tuple elements (e.g. const or bitcast of const) might not have a
-  // buffer -- their contents are stored in code. In that case, we fall back to
-  // emitting kernels which have access to their buffer addresses in code.
+  // Some tuple elements might not have an unambiguous buffer (like the result
+  // of a select-tuple). In that case, we fall back to emitting kernels which
+  // have access to their buffer addresses in code.
   if (all_tuple_elements_have_buffer) {
     std::vector<BufferAllocation::Slice> tuple_element_buffers;
     for (const HloInstruction* tuple_element : tuple->operands()) {
@@ -2027,7 +1755,8 @@ Status IrEmitterUnnested::HandleTuple(HloInstruction* tuple) {
         tuple_element_buffers, GetAllocationSlice(*tuple), tuple));
     return Status::OK();
   }
-  thunk_sequence_->emplace_back(BuildKernelThunk(tuple));
+  thunk_sequence_->emplace_back(
+      BuildKernelThunk(tuple, /*implements_whole_instruction=*/true));
   return IrEmitter::HandleTuple(tuple);
 }
 
@@ -2052,7 +1781,8 @@ Status IrEmitterUnnested::HandleSelectAndScatter(
                       BuildInitializerThunk(select_and_scatter));
   std::vector<std::unique_ptr<Thunk>> thunks;
   thunks.push_back(std::move(initializer_thunk));
-  thunks.push_back(BuildKernelThunk(select_and_scatter));
+  thunks.push_back(BuildKernelThunk(select_and_scatter,
+                                    /*implements_whole_instruction=*/false));
   thunk_sequence_->emplace_back(
       MakeUnique<SequentialThunk>(std::move(thunks), select_and_scatter));
 
@@ -2065,8 +1795,8 @@ Status IrEmitterUnnested::HandleSelectAndScatter(
   LaunchDimensions launch_dimensions = CalculateLaunchDimensions(
       source->shape(), ir_emitter_context_->device_description());
   llvm::Type* index_type = GetIndexTypeForKernel(
-      select_and_scatter, launch_dimensions.launch_bound(), &ir_builder_);
-  auto index_typed_const = [&](uint64 c) -> llvm::Constant* {
+      select_and_scatter, launch_dimensions.launch_bound(), &b_);
+  auto index_typed_constant = [&](uint64 c) -> llvm::Constant* {
     return llvm::ConstantInt::get(index_type, c);
   };
 
@@ -2089,114 +1819,106 @@ Status IrEmitterUnnested::HandleSelectAndScatter(
   //         selected_index = I
   //         initialized_flag = true
   //   output(selected_index) = scatter(output(selected_index), source(S))
-  auto loop_body_emitter =
-      [=](const llvm_ir::IrArray::Index& source_index) -> Status {
+  auto loop_body_emitter = [=](const IrArray::Index& source_index) -> Status {
     // Allocate space to keep the currently selected value, its index, and a
     // boolean flag if the value is initialized. The initialized_flag is set
     // false.
     llvm::Value* selected_value_address = llvm_ir::EmitAllocaAtFunctionEntry(
         llvm_ir::PrimitiveTypeToIrType(operand_element_type,
                                        ir_emitter_context_->llvm_module()),
-        "selected_value_address", &ir_builder_);
+        "selected_value_address", &b_);
     llvm::Value* selected_index_address =
         llvm_ir::EmitAllocaAtFunctionEntryWithCount(
-            index_type, index_typed_const(rank), "selected_index_address",
-            &ir_builder_);
+            index_type, index_typed_constant(rank), "selected_index_address",
+            &b_);
     llvm::Value* initialized_flag_address = llvm_ir::EmitAllocaAtFunctionEntry(
-        ir_builder_.getInt1Ty(), "initialized_flag_address", &ir_builder_);
-    ir_builder_.CreateStore(ir_builder_.getInt1(false),
-                            initialized_flag_address);
+        b_.getInt1Ty(), "initialized_flag_address", &b_);
+    b_.CreateStore(b_.getInt1(false), initialized_flag_address);
 
     // Create the inner loop to iterate over the window.
-    llvm_ir::ForLoopNest window_loops(IrName(select_and_scatter, "inner"),
-                                      &ir_builder_, index_type);
+    llvm_ir::ForLoopNest window_loops(IrName(select_and_scatter, "inner"), &b_,
+                                      index_type);
     std::vector<int64> window_size;
     for (const auto& dim : window.dimensions()) {
       window_size.push_back(dim.size());
       CHECK_GT(dim.size(), 0);
     }
-    const llvm_ir::IrArray::Index window_index = window_loops.AddLoopsForShape(
+    const IrArray::Index window_index = window_loops.AddLoopsForShape(
         ShapeUtil::MakeShape(operand_element_type, window_size), "window");
     llvm_ir::SetToFirstInsertPoint(window_loops.GetInnerLoopBodyBasicBlock(),
-                                   &ir_builder_);
+                                   &b_);
 
     // Compute the operand index to visit and evaluate the condition whether the
     // operand index is within the bounds. The unsigned comparison includes
     // checking whether the operand index >= 0.
-    llvm_ir::IrArray::Index operand_index(index_type, source_index.size());
-    llvm::Value* in_bounds_condition = ir_builder_.getInt1(true);
+    IrArray::Index operand_index(index_type, source_index.size());
+    llvm::Value* in_bounds_condition = b_.getInt1(true);
     for (int64 i = 0; i < rank; ++i) {
-      llvm::Value* strided_index = ir_builder_.CreateNSWMul(
-          source_index[i], index_typed_const(window.dimensions(i).stride()));
-      operand_index[i] = ir_builder_.CreateNSWSub(
-          ir_builder_.CreateNSWAdd(strided_index, window_index[i]),
-          index_typed_const(window.dimensions(i).padding_low()));
-      llvm::Value* index_condition = ir_builder_.CreateICmpULT(
+      llvm::Value* strided_index = b_.CreateNSWMul(
+          source_index[i], index_typed_constant(window.dimensions(i).stride()));
+      operand_index[i] = b_.CreateNSWSub(
+          b_.CreateNSWAdd(strided_index, window_index[i]),
+          index_typed_constant(window.dimensions(i).padding_low()));
+      llvm::Value* index_condition = b_.CreateICmpULT(
           operand_index[i],
-          index_typed_const(ShapeUtil::GetDimension(operand->shape(), i)));
-      in_bounds_condition =
-          ir_builder_.CreateAnd(in_bounds_condition, index_condition);
+          index_typed_constant(ShapeUtil::GetDimension(operand->shape(), i)));
+      in_bounds_condition = b_.CreateAnd(in_bounds_condition, index_condition);
     }
     CHECK(in_bounds_condition != nullptr);
 
     // Only need to do something if the operand index is within the bounds.
     // First check if the initialized_flag is set.
     llvm_ir::LlvmIfData if_in_bounds =
-        llvm_ir::EmitIfThenElse(in_bounds_condition, "in-bounds", &ir_builder_);
-    llvm_ir::SetToFirstInsertPoint(if_in_bounds.true_block, &ir_builder_);
+        llvm_ir::EmitIfThenElse(in_bounds_condition, "in-bounds", &b_);
+    llvm_ir::SetToFirstInsertPoint(if_in_bounds.true_block, &b_);
     llvm_ir::LlvmIfData if_initialized = llvm_ir::EmitIfThenElse(
-        ir_builder_.CreateLoad(initialized_flag_address), "initialized",
-        &ir_builder_);
+        b_.CreateLoad(initialized_flag_address), "initialized", &b_);
 
     // If the initialized_flag is false, initialize the selected value and index
     // with the currently visiting operand.
-    llvm_ir::SetToFirstInsertPoint(if_initialized.false_block, &ir_builder_);
-    const auto save_operand_index = [&](
-        const llvm_ir::IrArray::Index& operand_index) {
+    llvm_ir::SetToFirstInsertPoint(if_initialized.false_block, &b_);
+    const auto save_operand_index = [&](const IrArray::Index& operand_index) {
       for (int64 i = 0; i < rank; ++i) {
         llvm::Value* selected_index_address_slot =
-            ir_builder_.CreateInBoundsGEP(selected_index_address,
-                                          {ir_builder_.getInt32(i)});
-        ir_builder_.CreateStore(operand_index[i], selected_index_address_slot);
+            b_.CreateInBoundsGEP(selected_index_address, {b_.getInt32(i)});
+        b_.CreateStore(operand_index[i], selected_index_address_slot);
       }
     };
-    llvm_ir::IrArray operand_array = GetIrArray(*operand, *select_and_scatter);
+    IrArray operand_array = GetIrArray(*operand, *select_and_scatter);
     llvm::Value* operand_data =
-        operand_array.EmitReadArrayElement(operand_index, &ir_builder_);
-    ir_builder_.CreateStore(operand_data, selected_value_address);
+        operand_array.EmitReadArrayElement(operand_index, &b_);
+    b_.CreateStore(operand_data, selected_value_address);
     save_operand_index(operand_index);
-    ir_builder_.CreateStore(ir_builder_.getInt1(true),
-                            initialized_flag_address);
+    b_.CreateStore(b_.getInt1(true), initialized_flag_address);
 
     // If the initialized_flag is true, call the `select` function to
     // potentially update the selected value and index with the currently
     // visiting operand.
-    llvm_ir::SetToFirstInsertPoint(if_initialized.true_block, &ir_builder_);
+    llvm_ir::SetToFirstInsertPoint(if_initialized.true_block, &b_);
     const Shape output_shape = ShapeUtil::MakeShape(PRED, {});
     llvm::Value* operand_address =
-        operand_array.EmitArrayElementAddress(operand_index, &ir_builder_);
+        operand_array.EmitArrayElementAddress(operand_index, &b_);
     llvm::Value* select_return_buffer = llvm_ir::EmitAllocaAtFunctionEntry(
         llvm_ir::PrimitiveTypeToIrType(PRED,
                                        ir_emitter_context_->llvm_module()),
-        "select_return_buffer", &ir_builder_);
+        "select_return_buffer", &b_);
     TF_RETURN_IF_ERROR(EmitCallToNestedComputation(
         *select_and_scatter->select(),
         {selected_value_address, operand_address}, select_return_buffer));
-    llvm::Value* result = ir_builder_.CreateLoad(select_return_buffer);
+    llvm::Value* result = b_.CreateLoad(select_return_buffer);
 
     // If the 'select' function returns false, update the selected value and the
     // index to the currently visiting operand.
-    llvm::Value* cond = ir_builder_.CreateICmpNE(
+    llvm::Value* cond = b_.CreateICmpNE(
         result,
         llvm::ConstantInt::get(llvm_ir::PrimitiveTypeToIrType(
                                    PRED, ir_emitter_context_->llvm_module()),
                                0),
         "boolean_predicate");
     llvm_ir::LlvmIfData if_select_lhs =
-        llvm_ir::EmitIfThenElse(cond, "if-select-lhs", &ir_builder_);
-    llvm_ir::SetToFirstInsertPoint(if_select_lhs.false_block, &ir_builder_);
-    ir_builder_.CreateStore(ir_builder_.CreateLoad(operand_address),
-                            selected_value_address);
+        llvm_ir::EmitIfThenElse(cond, "if-select-lhs", &b_);
+    llvm_ir::SetToFirstInsertPoint(if_select_lhs.false_block, &b_);
+    b_.CreateStore(b_.CreateLoad(operand_address), selected_value_address);
     save_operand_index(operand_index);
 
     // After iterating over the window elements, scatter the source element to
@@ -2204,20 +1926,19 @@ Status IrEmitterUnnested::HandleSelectAndScatter(
     // location is computed by calling the `scatter` function with the source
     // value and the current output value.
     llvm_ir::SetToFirstInsertPoint(window_loops.GetOuterLoopExitBasicBlock(),
-                                   &ir_builder_);
-    llvm_ir::IrArray::Index selected_index(operand_index.GetType());
+                                   &b_);
+    IrArray::Index selected_index(operand_index.GetType());
     for (int64 i = 0; i < rank; ++i) {
-      llvm::Value* selected_index_address_slot = ir_builder_.CreateInBoundsGEP(
-          selected_index_address, {ir_builder_.getInt32(i)});
-      selected_index.push_back(
-          ir_builder_.CreateLoad(selected_index_address_slot));
+      llvm::Value* selected_index_address_slot =
+          b_.CreateInBoundsGEP(selected_index_address, {b_.getInt32(i)});
+      selected_index.push_back(b_.CreateLoad(selected_index_address_slot));
     }
     llvm::Value* source_value_address =
         GetIrArray(*source, *select_and_scatter)
-            .EmitArrayElementAddress(source_index, &ir_builder_);
+            .EmitArrayElementAddress(source_index, &b_);
     llvm::Value* output_value_address =
         GetIrArray(*select_and_scatter, *select_and_scatter)
-            .EmitArrayElementAddress(selected_index, &ir_builder_);
+            .EmitArrayElementAddress(selected_index, &b_);
     return EmitAtomicOperationForNestedComputation(
         *select_and_scatter->scatter(), output_value_address,
         source_value_address);
@@ -2232,7 +1953,7 @@ Status IrEmitterUnnested::HandleSelectAndScatter(
       static_cast<SequentialThunk*>(LastThunk())->thunks().back().get(),
       ir_emitter_context_->llvm_module());
   return ParallelLoopEmitter(loop_body_emitter, source->shape(),
-                             launch_dimensions, &ir_builder_)
+                             launch_dimensions, &b_)
       .EmitLoop(IrName(select_and_scatter), index_type);
 }
 
@@ -2259,16 +1980,136 @@ Status IrEmitterUnnested::HandleWhile(HloInstruction* xla_while) {
   return Status::OK();
 }
 
-Status IrEmitterUnnested::HandleRng(HloInstruction* random) {
-  thunk_sequence_->push_back(BuildKernelThunk(random));
-  return IrEmitter::HandleRng(random);
+Status IrEmitterUnnested::HandleRng(HloInstruction* rng) {
+  // Build the kernel to generate the random numbers.
+  //
+  // Unroll the kernel so that the duplicated computation that calculates the
+  // 128 bit sample can be optimized away by LLVM.
+  thunk_sequence_->emplace_back(
+      BuildKernelThunk(rng, /*implements_whole_instruction=*/false,
+                       ComputeMaxUnrollFactor(rng)));
+  ElementalIrEmitter::HloToElementGeneratorMap operand_to_generator;
+  for (const HloInstruction* operand : rng->operands()) {
+    operand_to_generator[operand] = [=](const llvm_ir::IrArray::Index& index) {
+      return GetIrArray(*operand, *rng).EmitReadArrayElement(index, &b_);
+    };
+  }
+  TF_RETURN_IF_ERROR(EmitTargetElementLoop(
+      *rng, GpuElementalIrEmitter(hlo_module_config_, module_, &b_,
+                                  GetNestedComputer())
+                .MakeElementGenerator(rng, operand_to_generator)));
+  std::unique_ptr<Thunk> rng_thunk = std::move(thunk_sequence_->back());
+  thunk_sequence_->pop_back();
+
+  // Emit a kernel to increment the global state for Philox RNG algorithm.
+  thunk_sequence_->emplace_back(
+      BuildKernelThunk(rng, /*implements_whole_instruction=*/false));
+  llvm_ir::IncrementVariableForPhiloxRngState(1, module_, &b_);
+  std::unique_ptr<Thunk> increment_seed_thunk =
+      std::move(thunk_sequence_->back());
+  thunk_sequence_->pop_back();
+
+  // Build the SequentialThunk for the RNG hlo.
+  std::vector<std::unique_ptr<Thunk>> thunks;
+  thunks.reserve(2);
+  thunks.push_back(std::move(rng_thunk));
+  thunks.push_back(std::move(increment_seed_thunk));
+  thunk_sequence_->emplace_back(
+      MakeUnique<SequentialThunk>(std::move(thunks), rng));
+
+  return Status::OK();
 }
 
 Status IrEmitterUnnested::HandleSelect(HloInstruction* select) {
-  thunk_sequence_->push_back(BuildKernelThunk(select));
+  thunk_sequence_->push_back(
+      BuildKernelThunk(select, /*implements_whole_instruction=*/true));
   return IrEmitter::HandleSelect(select);
 }
 
+Status IrEmitterUnnested::HandleSort(HloInstruction* sort) {
+  std::vector<std::unique_ptr<Thunk>> thunks;
+  auto keys = sort->operand(0);
+  auto values = sort->operand_count() > 1 ? sort->operand(1) : nullptr;
+  ShapeIndex keys_shape_index({});
+  ShapeIndex values_shape_index({});
+  if (values != nullptr) {
+    keys_shape_index = ShapeIndex({0});
+    values_shape_index = ShapeIndex({1});
+  }
+  auto keys_destination = GetAllocationSlice(*sort, keys_shape_index);
+  auto values_destination = GetAllocationSlice(*sort, values_shape_index);
+
+  if (keys_destination != GetAllocationSlice(*keys)) {
+    thunks.push_back(MakeUnique<DeviceToDeviceCopyThunk>(
+        /*source_address=*/GetAllocationSlice(*keys),
+        /*destination_buffer=*/keys_destination,
+        /*mem_size=*/ShapeUtil::ByteSizeOf(keys->shape()), nullptr));
+  }
+  if (values != nullptr && values_destination != GetAllocationSlice(*values)) {
+    // TODO(b/26783907): Figure out why we never seem to share buffers for
+    // key/value sort.
+    thunks.push_back(MakeUnique<DeviceToDeviceCopyThunk>(
+        /*source_address=*/GetAllocationSlice(*values),
+        /*destination_buffer=*/values_destination,
+        /*mem_size=*/ShapeUtil::ByteSizeOf(values->shape()), nullptr));
+  }
+
+  int64 dimension_to_sort = sort->dimensions(0);
+  int64 dimension_to_sort_bound = keys->shape().dimensions(dimension_to_sort);
+  int64 num_stages = tensorflow::Log2Ceiling(dimension_to_sort_bound);
+  auto index_type = b_.getInt64Ty();
+
+  // Naive C++ code for the outer loops:
+  //
+  // for (int64 stage = 0; stage < Log2Ceiling(dimension_to_sort_bound);
+  //     ++stage) {
+  //   int64 first_xor_mask = (1LL << (stage + 1)) - 1;
+  //   SortInPlace(first_xor_mask);
+  //   for (int64 mask = stage - 1; mask >= 0; --mask) {
+  //     int64 later_xor_mask = 1LL << mask;
+  //     SortInPlace(later_xor_mask);
+  //   }
+  // }
+  //
+  // This follows the algorithm described on Wikipedia:
+  // https://en.wikipedia.org/wiki/Bitonic_sorter
+
+  for (int64 stage = 0; stage < num_stages; ++stage) {
+    for (int64 mask = stage; mask >= 0; --mask) {
+      thunks.push_back(
+          BuildKernelThunk(sort, /*implements_whole_instruction=*/false));
+      LaunchDimensions launch_dimensions = CalculateLaunchDimensions(
+          keys->shape(), ir_emitter_context_->device_description());
+      UpdateLaunchDimensions(launch_dimensions, thunks.back().get(),
+                             ir_emitter_context_->llvm_module());
+
+      llvm::Value* xor_mask;
+      if (mask == stage) {
+        xor_mask = llvm::ConstantInt::get(index_type, (1LL << (stage + 1)) - 1);
+      } else {
+        xor_mask = llvm::ConstantInt::get(index_type, 1LL << mask);
+      }
+
+      TF_RETURN_IF_ERROR(llvm_ir::EmitSortInPlace(
+          dimension_to_sort, GetIrArray(*sort, *sort, keys_shape_index),
+          values != nullptr ? tensorflow::gtl::make_optional<IrArray>(
+                                  GetIrArray(*sort, *sort, values_shape_index))
+                            : tensorflow::gtl::nullopt,
+          IrName(sort), xor_mask, &b_, &launch_dimensions));
+    }
+  }
+
+  thunk_sequence_->emplace_back(
+      MakeUnique<SequentialThunk>(std::move(thunks), sort));
+  return Status::OK();
+}
+
+Status IrEmitterUnnested::HandleTupleSelect(HloInstruction* tuple_select) {
+  thunk_sequence_->push_back(
+      BuildKernelThunk(tuple_select, /*implements_whole_instruction=*/true));
+  return IrEmitter::HandleTupleSelect(tuple_select);
+}
+
 Status IrEmitterUnnested::HandleCrossReplicaSum(HloInstruction* crs) {
   if (hlo_module_config_.replica_count() != 1) {
     // TODO(b/33011107): Support nontrivial cross replica sum on GPU.
@@ -2304,12 +2145,12 @@ Status IrEmitterUnnested::HandleCrossReplicaSum(HloInstruction* crs) {
     thunks.push_back(MakeUnique<DeviceToDeviceCopyThunk>(
         /*source_address=*/GetAllocationSlice(*crs->operand(i)),
         /*destination_buffer=*/tuple_element_buffers.back(),
-        /*mem_size=*/ShapeUtil::ByteSizeOf(crs->operand(i)->shape()), crs));
+        /*mem_size=*/ShapeUtil::ByteSizeOf(crs->operand(i)->shape()), nullptr));
   }
 
   // Output a tuple of the buffers above.
   thunks.push_back(MakeUnique<TupleThunk>(tuple_element_buffers,
-                                          GetAllocationSlice(*crs), crs));
+                                          GetAllocationSlice(*crs), nullptr));
   thunk_sequence_->push_back(
       MakeUnique<SequentialThunk>(std::move(thunks), crs));
   return Status::OK();
@@ -2324,6 +2165,11 @@ Status IrEmitterUnnested::HandleInfeed(HloInstruction* infeed) {
   return Status::OK();
 }
 
+Status IrEmitterUnnested::HandleOutfeed(HloInstruction* outfeed) {
+  thunk_sequence_->emplace_back(BuildOutfeedThunk(outfeed));
+  return Status::OK();
+}
+
 // Figures out how to access the buffers for all subshapes of hlo's operands and
 // for hlo itself (i.e. all the buffers produced by HLO).
 //
@@ -2411,11 +2257,6 @@ GetHloBufferSlices(const HloInstruction* hlo,
 
   // Adds entries for all subshapes of instr to `slices`.
   auto add_slices_for = [&](const HloInstruction* instr) {
-    // GPU constants don't have buffers; don't bother looking for one.
-    if (instr->IsConstant()) {
-      return;
-    }
-
     ShapeUtil::ForEachSubshape(
         instr->shape(), [&](const Shape& /*shape*/, const ShapeIndex& index) {
           if (slices.count({instr, index})) {
@@ -2443,7 +2284,8 @@ GetHloBufferSlices(const HloInstruction* hlo,
 }
 
 std::unique_ptr<KernelThunk> IrEmitterUnnested::BuildKernelThunk(
-    const HloInstruction* inst, int unroll_factor) {
+    const HloInstruction* inst, bool implements_whole_instruction,
+    int unroll_factor) {
   const BufferAssignment& buffer_assn =
       ir_emitter_context_->buffer_assignment();
 
@@ -2476,21 +2318,25 @@ std::unique_ptr<KernelThunk> IrEmitterUnnested::BuildKernelThunk(
 
   // We'll pass a pointer to each of the elements of `buffers` to our kernel, in
   // this order.
-  std::vector<const BufferAllocation*> buffers(buffers_needed.begin(),
-                                               buffers_needed.end());
-  std::sort(buffers.begin(), buffers.end(),
+  std::vector<const BufferAllocation*> non_constant_buffers;
+  c_copy_if(buffers_needed, std::back_inserter(non_constant_buffers),
+            [](const BufferAllocation* allocation) {
+              return !allocation->is_constant();
+            });
+
+  std::sort(non_constant_buffers.begin(), non_constant_buffers.end(),
             [](const BufferAllocation* a, const BufferAllocation* b) {
               return a->index() < b->index();
             });
 
-  llvm::Function* kernel = BuildKernelPrototype(*inst, buffers);
+  llvm::Function* kernel = BuildKernelPrototype(*inst, non_constant_buffers);
 
   // Build a map from a BufferAllocation to the corresponding argument in our
   // kernel.
   std::unordered_map<const BufferAllocation*, llvm::Value*> kernel_args;
   {
     auto arg_it = kernel->arg_begin();
-    auto buffers_it = buffers.begin();
+    auto buffers_it = non_constant_buffers.begin();
     for (; arg_it != kernel->arg_end(); ++arg_it, ++buffers_it) {
       kernel_args[*buffers_it] = arg_it;
     }
@@ -2508,18 +2354,24 @@ std::unique_ptr<KernelThunk> IrEmitterUnnested::BuildKernelThunk(
             << " is found in slice " << slice.ToString() << " at GTE index "
             << gte_index.ToString();
 
-    llvm::Value* loc =
-        ir_builder_.CreateInBoundsGEP(kernel_args.at(slice.allocation()),
-                                      {ir_builder_.getInt64(slice.offset())});
+    llvm::Value* loc;
+    if (slice.allocation()->is_constant()) {
+      loc = ir_emitter_context_->llvm_module()->getGlobalVariable(
+          llvm_ir::AsStringRef(llvm_ir::ConstantBufferAllocationToGlobalName(
+              *slice.allocation())));
+      CHECK_NE(loc, nullptr);
+    } else {
+      loc = b_.CreateInBoundsGEP(kernel_args.at(slice.allocation()),
+                                 {b_.getInt64(slice.offset())});
+    }
 
     // If gte_index is nonempty, we have to dereference `loc` to get to the
     // value we're ultimately interested in.
     llvm::Type* int8_double_pointer =
-        llvm::PointerType::get(ir_builder_.getInt8PtrTy(), /*AddressSpace=*/0);
+        llvm::PointerType::get(b_.getInt8PtrTy(), /*AddressSpace=*/0);
     for (int64 idx : gte_index) {
-      loc = ir_builder_.CreateBitCast(loc, int8_double_pointer);
-      loc = ir_builder_.CreateLoad(
-          ir_builder_.CreateInBoundsGEP(loc, {ir_builder_.getInt64(idx)}));
+      loc = b_.CreateBitCast(loc, int8_double_pointer);
+      loc = b_.CreateLoad(b_.CreateInBoundsGEP(loc, {b_.getInt64(idx)}));
     }
 
     bindings_.BindHloToIrValue(*instr, loc, index);
@@ -2531,11 +2383,12 @@ std::unique_ptr<KernelThunk> IrEmitterUnnested::BuildKernelThunk(
     bindings_.SetTempBufferBase(kernel_args.at(*temp_buffer));
   } else {
     bindings_.SetTempBufferBase(
-        llvm::ConstantPointerNull::get(ir_builder_.getInt8PtrTy()));
+        llvm::ConstantPointerNull::get(b_.getInt8PtrTy()));
   }
 
-  return MakeUnique<KernelThunk>(buffers, llvm_ir::AsString(kernel->getName()),
-                                 inst, unroll_factor);
+  return MakeUnique<KernelThunk>(
+      non_constant_buffers, llvm_ir::AsString(kernel->getName()),
+      implements_whole_instruction ? inst : nullptr, unroll_factor);
 }
 
 std::unique_ptr<Thunk> IrEmitterUnnested::BuildHostToDeviceCopyThunk(
@@ -2569,7 +2422,7 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildInfeedThunk(
 
   ShapeTree<BufferAllocation::Slice> slices(inst->shape());
   slices.ForEachMutableElement(
-      [this, inst](const ShapeIndex& index, BufferAllocation::Slice* slice) {
+      [&](const ShapeIndex& index, BufferAllocation::Slice* slice) {
         *slice = ir_emitter_context_->buffer_assignment()
                      .GetUniqueSlice(inst, index)
                      .ConsumeValueOrDie();
@@ -2577,6 +2430,23 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildInfeedThunk(
   return MakeUnique<InfeedThunk>(slices, inst);
 }
 
+std::unique_ptr<Thunk> IrEmitterUnnested::BuildOutfeedThunk(
+    const HloInstruction* inst) {
+  CHECK_EQ(HloOpcode::kOutfeed, inst->opcode());
+
+  ShapeTree<BufferAllocation::Slice> slices(inst->operand(0)->shape());
+  slices.ForEachMutableElement(
+      [&](const ShapeIndex& index, BufferAllocation::Slice* slice) {
+        auto status_or_slice =
+            ir_emitter_context_->buffer_assignment().GetUniqueSlice(
+                inst->operand(0), index);
+        if (status_or_slice.ok()) {
+          *slice = status_or_slice.ConsumeValueOrDie();
+        }
+      });
+  return MakeUnique<OutfeedThunk>(std::move(slices), inst);
+}
+
 namespace {
 double GetScalarConstantAsDouble(const Literal& literal) {
   switch (literal.shape().element_type()) {
@@ -2692,6 +2562,11 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
     init_value = hlo->operand(init_value->parameter_number());
   }
 
+  // Initializer thunks don't implement a whole instruction, and we want to
+  // profile the whole instruction instead of the individual thunks it consists
+  // of. Therefore we pass nullptr as the HloInstruction* to the thunks we
+  // generate below.
+  //
   // In the common case, the initializer is a constant.  In this case, emit a
   // device-memset call if we can.  Currently StreamExecutor only supports
   // zeroing and 32-bit memsets.
@@ -2705,7 +2580,8 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
     ArraySlice<uint8> literal_bytes(
         reinterpret_cast<const uint8*>(literal.untyped_data()), num_bytes);
     if (c_all_of(literal_bytes, [](uint8 byte) { return byte == 0; })) {
-      return {MakeUnique<MemzeroThunk>(GetAllocationSlice(*hlo, index), hlo)};
+      return {
+          MakeUnique<MemzeroThunk>(GetAllocationSlice(*hlo, index), nullptr)};
     }
 
     // If the literal is 8 or 16 bits wide, we can emit a 32-bit memset by
@@ -2723,7 +2599,7 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
       }
       uint32 pattern32 = uint32{pattern16} | (uint32{pattern16} << 16);
       return {MakeUnique<Memset32BitValueThunk>(
-          pattern32, GetAllocationSlice(*hlo, index), hlo)};
+          pattern32, GetAllocationSlice(*hlo, index), nullptr)};
     }
 
     // If the literal is an even multiple of 32 bits wide, we can emit a 32-bit
@@ -2734,12 +2610,13 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
       uint32 word;
       memcpy(&word, literal_bytes.data(), sizeof(word));
       return {MakeUnique<Memset32BitValueThunk>(
-          word, GetAllocationSlice(*hlo, index), hlo)};
+          word, GetAllocationSlice(*hlo, index), nullptr)};
     }
   }
 
   // Otherwise fall back to our slow initializer code.
-  std::unique_ptr<KernelThunk> kernel_thunk = BuildKernelThunk(hlo);
+  std::unique_ptr<KernelThunk> kernel_thunk =
+      BuildKernelThunk(hlo, /*implements_whole_instruction=*/false);
   LaunchDimensions launch_dimensions =
       CalculateLaunchDimensions(ShapeUtil::GetSubshape(hlo->shape(), index),
                                 ir_emitter_context_->device_description());
@@ -2748,15 +2625,24 @@ StatusOr<std::unique_ptr<Thunk>> IrEmitterUnnested::BuildInitializerThunk(
   // If the init_value was fused into this reduce we have to generate it first.
   if (fused && init_value_operand->opcode() != HloOpcode::kParameter) {
     CHECK_EQ(HloOpcode::kConstant, init_value_operand->opcode());
-    TF_RETURN_IF_ERROR(HandleConstant(const_cast<HloInstruction*>(init_value)));
+
+    const Literal& literal = init_value_operand->literal();
+    llvm::Constant* initializer =
+        llvm_ir::ConvertLiteralToIrConstant(literal, module_);
+
+    llvm::GlobalVariable* global_for_const = new llvm::GlobalVariable(
+        *module_, initializer->getType(),
+        /*isConstant=*/true, llvm::GlobalValue::PrivateLinkage, initializer,
+        /*Name=*/"");
+    global_for_const->setAlignment(kConstantBufferAlignBytes);
+    bindings_.BindHloToIrValue(*init_value_operand, global_for_const);
   }
   TF_RETURN_IF_ERROR(ParallelLoopEmitter(
-                         [=](const llvm_ir::IrArray::Index& index) {
+                         [=](const IrArray::Index& index) {
                            return GetIrArray(*init_value, *hlo)
-                               .EmitReadArrayElement(index, &ir_builder_);
+                               .EmitReadArrayElement(index, &b_);
                          },
-                         GetIrArray(*hlo, *hlo, index), launch_dimensions,
-                         &ir_builder_)
+                         GetIrArray(*hlo, *hlo, index), launch_dimensions, &b_)
                          .EmitLoop(IrName(hlo)));
 
   // Clean up state left behind by emitting the loop above.  (This is normally
@@ -2867,13 +2753,13 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildWhileThunk(
   HloComputation* condition = hlo->while_condition();
   IrEmitterUnnested ir_emitter_condition(hlo_module_config_, condition,
                                          ir_emitter_context_);
-  TF_CHECK_OK(condition->root_instruction()->Accept(&ir_emitter_condition));
+  TF_CHECK_OK(condition->Accept(&ir_emitter_condition));
 
   // Generate thunk sequence for while 'body'.
   HloComputation* body = hlo->while_body();
   IrEmitterUnnested ir_emitter_body(hlo_module_config_, body,
                                     ir_emitter_context_);
-  TF_CHECK_OK(body->root_instruction()->Accept(&ir_emitter_body));
+  TF_CHECK_OK(body->Accept(&ir_emitter_body));
 
   return MakeUnique<WhileThunk>(
       GetAllocationSlice(*condition->root_instruction()),  // cond result
@@ -2891,7 +2777,7 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildForThunk(
   HloComputation* body = hlo->while_body();
   IrEmitterUnnested ir_emitter_body(hlo_module_config_, body,
                                     ir_emitter_context_);
-  TF_CHECK_OK(body->root_instruction()->Accept(&ir_emitter_body));
+  TF_CHECK_OK(body->Accept(&ir_emitter_body));
 
   return MakeUnique<ForThunk>(loop_limit,
                               ir_emitter_body.ConsumeThunkSequence(), hlo);
@@ -2907,12 +2793,12 @@ std::unique_ptr<Thunk> IrEmitterUnnested::BuildConditionalThunk(
   HloComputation* true_computation = hlo->true_computation();
   IrEmitterUnnested ir_emitter_true(hlo_module_config_, true_computation,
                                     ir_emitter_context_);
-  TF_CHECK_OK(true_computation->root_instruction()->Accept(&ir_emitter_true));
+  TF_CHECK_OK(true_computation->Accept(&ir_emitter_true));
 
   HloComputation* false_computation = hlo->false_computation();
   IrEmitterUnnested ir_emitter_false(hlo_module_config_, false_computation,
                                      ir_emitter_context_);
-  TF_CHECK_OK(false_computation->root_instruction()->Accept(&ir_emitter_false));
+  TF_CHECK_OK(false_computation->Accept(&ir_emitter_false));
 
   return MakeUnique<ConditionalThunk>(
       GetAllocationSlice(*hlo->operand(0)),
@@ -2940,41 +2826,588 @@ Status IrEmitterUnnested::EmitTargetElementLoopInThunk(
                          ir_emitter_context_->llvm_module());
   if (!hlo.IsMultiOutputFusion()) {
     return ParallelLoopEmitter(element_generator, GetIrArray(hlo, hlo),
-                               launch_dimensions, &ir_builder_, unroll_factor)
-        .EmitLoop(IrName(&hlo),
-                  GetIndexTypeForKernel(&hlo, launch_dimensions.launch_bound(),
-                                        &ir_builder_));
+                               launch_dimensions, &b_, unroll_factor)
+        .EmitLoop(
+            IrName(&hlo),
+            GetIndexTypeForKernel(&hlo, launch_dimensions.launch_bound(), &b_));
   }
 
-  // For multiple outputs fusion, we need to emit each operand and the root.
-  std::vector<llvm_ir::IrArray> output_arrays;
+  // For multioutput fusion, we need to emit each operand and the root.
+  std::vector<IrArray> output_arrays;
   for (int64 i = 0; i < ShapeUtil::TupleElementCount(hlo.shape()); ++i) {
     output_arrays.push_back(GetIrArray(hlo, hlo, {i}));
   }
   TF_RETURN_IF_ERROR(
       ParallelLoopEmitter(element_generator, output_arrays, launch_dimensions,
-                          &ir_builder_, unroll_factor)
+                          &b_, unroll_factor)
           .EmitLoop(IrName(&hlo),
                     GetIndexTypeForKernel(
-                        &hlo, launch_dimensions.launch_bound(), &ir_builder_)));
+                        &hlo, launch_dimensions.launch_bound(), &b_)));
 
   std::vector<llvm::Value*> tuple_operand_ptrs;
   for (int64 i = 0; i < output_arrays.size(); ++i) {
     tuple_operand_ptrs.push_back(output_arrays[i].GetBasePointer());
   }
-  ir_builder_.SetInsertPoint(ir_builder_.GetInsertBlock()->getTerminator());
-  llvm_ir::EmitTuple(GetIrArray(hlo, hlo), tuple_operand_ptrs, &ir_builder_,
-                     module_);
+  b_.SetInsertPoint(b_.GetInsertBlock()->getTerminator());
+  llvm_ir::EmitTuple(GetIrArray(hlo, hlo), tuple_operand_ptrs, &b_, module_);
   return Status::OK();
 }
 
 Status IrEmitterUnnested::EmitTargetElementLoop(
     const HloInstruction& hlo,
     const llvm_ir::ElementGenerator& element_generator) {
-  CHECK(Thunk::Kind::kKernel == LastThunk()->kind());
+  CHECK_EQ(Thunk::Kind::kKernel, LastThunk()->kind());
   return EmitTargetElementLoopInThunk(hlo, element_generator,
                                       static_cast<KernelThunk*>(LastThunk()));
 }
 
+int IrEmitterUnnested::ConstructIrArrayForOutputs(
+    const HloInstruction& hlo, std::vector<IrArray>* output_arrays) {
+  int64 num_outputs = 1;
+  if (hlo.IsMultiOutputFusion()) {
+    num_outputs = ShapeUtil::TupleElementCount(hlo.shape());
+    output_arrays->reserve(num_outputs);
+    for (int64 i = 0; i < num_outputs; ++i) {
+      output_arrays->push_back(GetIrArray(hlo, hlo, {i}));
+    }
+  } else {
+    output_arrays->push_back(GetIrArray(hlo, hlo));
+  }
+  return num_outputs;
+}
+
+int IrEmitterUnnested::ConstructIrArrayForInputs(
+    const HloInstruction& hlo, std::vector<IrArray>* param_arrays) {
+  int64 num_params = hlo.operands().size();
+  param_arrays->reserve(num_params);
+  for (const HloInstruction* param : hlo.operands()) {
+    param_arrays->push_back(GetIrArray(*param, hlo));
+  }
+  return num_params;
+}
+
+int IrEmitterUnnested::ConstructOutputReducedShapeAndCastOutputIrArrayToShape(
+    const HloInstruction& hlo, const std::vector<IrArray>& output_arrays,
+    tensorflow::gtl::ArraySlice<int64> reduced_output_dims,
+    std::vector<Shape>* output_reduced_shapes,
+    std::vector<IrArray>* output_in_reduced_shape_arrays) {
+  int64 num_outputs = 1;
+  if (hlo.IsMultiOutputFusion()) {
+    num_outputs = ShapeUtil::TupleElementCount(hlo.shape());
+    output_in_reduced_shape_arrays->reserve(num_outputs);
+    output_reduced_shapes->reserve(num_outputs);
+    for (int64 i = 0; i < num_outputs; ++i) {
+      output_reduced_shapes->push_back(ShapeUtil::MakeShapeWithDescendingLayout(
+          ShapeUtil::GetSubshape(hlo.shape(), {i}).element_type(),
+          reduced_output_dims));
+      output_in_reduced_shape_arrays->push_back(
+          output_arrays[i].CastToShape((*output_reduced_shapes)[i], &b_));
+    }
+  } else {
+    output_reduced_shapes->push_back(ShapeUtil::MakeShapeWithDescendingLayout(
+        hlo.shape().element_type(), reduced_output_dims));
+    output_in_reduced_shape_arrays->push_back(
+        output_arrays[0].CastToShape((*output_reduced_shapes)[0], &b_));
+  }
+  return num_outputs;
+}
+
+int IrEmitterUnnested::ConstructInputReducedShapeAndCastInputIrArrayToShape(
+    const HloInstruction& hlo, const std::vector<IrArray>& param_arrays,
+    const std::vector<llvm::Value*>& param_buffers,
+    tensorflow::gtl::ArraySlice<int64> reduced_output_dims,
+    std::vector<Shape>* param_reduced_shapes,
+    std::vector<IrArray>* param_in_reduced_shape_arrays) {
+  int64 num_params = hlo.operands().size();
+  param_in_reduced_shape_arrays->reserve(num_params);
+  param_reduced_shapes->reserve(num_params);
+  for (int64 id = 0; id < num_params; ++id) {
+    if (param_buffers[id] == nullptr) {
+      param_reduced_shapes->push_back(Shape());
+      param_in_reduced_shape_arrays->push_back(IrArray());
+      continue;
+    }
+    const HloInstruction* param = hlo.operand(id);
+    param_reduced_shapes->push_back(ShapeUtil::MakeShapeWithDescendingLayout(
+        param->shape().element_type(),
+        Permute({0, 2, 1}, reduced_output_dims)));
+    param_in_reduced_shape_arrays->push_back(
+        param_arrays[id].CastToShape((*param_reduced_shapes)[id], &b_));
+  }
+  return num_params;
+}
+
+namespace {
+
+// Reads thread_idx.x and converts it to a (y,x) coordinate, assuming that the
+// thread lives within a square tile of size tile_size (so thread blocks are of
+// size tile_size * tile_size).
+std::tuple<llvm::Value*, llvm::Value*> CalculateYXCoordinateWithinTile(
+    llvm::IRBuilder<>* builder, llvm::Value* tile_size,
+    int64 threads_per_tile) {
+  // Calculate the starting element coordinate within a tile for the current
+  // thread, (y, x) from thread_id.
+  llvm::Value* thread_id = llvm_ir::EmitCallToIntrinsic(
+      llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x, {}, {}, builder);
+  llvm_ir::AddRangeMetadata(0, threads_per_tile,
+                            llvm::cast<llvm::Instruction>(thread_id));
+  thread_id = builder->CreateIntCast(thread_id, tile_size->getType(),
+                                     /*isSigned=*/true, "thread.id.x");
+  auto x = builder->CreateURem(thread_id, tile_size);
+  auto y = builder->CreateUDiv(thread_id, tile_size);
+  return std::make_tuple(y, x);
+}
+
+// Reads block_idx.x, casts it to type index_ty, and adds the assumption that
+// it's in the range [0, num_blocks].
+llvm::Value* GetBlockIdx(llvm::IRBuilder<>* builder, llvm::Type* index_ty,
+                         int64 num_blocks) {
+  llvm::Value* block_id = llvm_ir::EmitCallToIntrinsic(
+      llvm::Intrinsic::nvvm_read_ptx_sreg_ctaid_x, {}, {}, builder);
+  llvm_ir::AddRangeMetadata(0, num_blocks,
+                            llvm::cast<llvm::Instruction>(block_id));
+  return builder->CreateIntCast(block_id, index_ty, /*isSigned=*/true,
+                                "block.id.x");
+}
+
+// Emits code to process up to (tile_size/num_rows) elements in a tile, given
+// `emit_elem_function` is the function to emit code to process one element, `y`
+// and `x` are the coordinates for the first element to process, and `index` is
+// the index for the origin of the tile. Emits bounds check to ensure that each
+// processed element is within the boundary defined by `tile_width` and
+// `tile_height`.
+void EmitTiledElementalCodeWithBoundsCheck(
+    int64 tile_size, int64 num_rows, const IrArray::Index& index,
+    const string& loop_name, KernelSupportLibrary* ksl,
+    llvm::IRBuilder<>* builder, llvm::Value* y, llvm::Value* x,
+    llvm::Value* tile_width, llvm::Value* tile_height,
+    const std::function<void(const IrArray::Index&, llvm::Value*)>&
+        emit_elem_function) {
+  llvm::Type* index_ty = tile_width->getType();
+  // Emits a constant value with index type.
+  auto index_typed_constant = [&](uint64 c) -> llvm::Constant* {
+    return llvm::ConstantInt::get(index_ty, c);
+  };
+  // Adds `addend` to the given `dim` of `index`.
+  auto offset_dim = [&](IrArray::Index index, llvm::Value* addend, int64 dim) {
+    index[dim] = builder->CreateAdd(index[dim], addend);
+    return index;
+  };
+
+  auto emit_full_tile = [&] {
+    for (int64 i = 0; i < tile_size; i += num_rows) {
+      auto source_idx = offset_dim(index, index_typed_constant(i), /*dim=*/1);
+      auto y_loc = builder->CreateAdd(index_typed_constant(i), y);
+      emit_elem_function(source_idx, y_loc);
+    }
+  };
+
+  auto emit_last_row = [&] {
+    ksl->IfReturnVoid("x_in_tile", builder->CreateICmpULT(x, tile_width), [&] {
+      // tile_height_upper_bound =
+      //   ceil(tile_height / num_rows) * num_rows
+      auto tile_height_upper_bound = builder->CreateMul(
+          builder->CreateUDiv(
+              builder->CreateAdd(tile_height,
+                                 index_typed_constant(num_rows - 1)),
+              index_typed_constant(num_rows)),
+          index_typed_constant(num_rows));
+      ksl->ForReturnVoid(
+          loop_name, /*start=*/index_typed_constant(0),
+          /*end=*/tile_height_upper_bound,
+          /*step=*/index_typed_constant(num_rows), [&](llvm::Value* y_indvar) {
+            auto y_loc = builder->CreateAdd(y_indvar, y);
+            ksl->IfReturnVoid(
+                "y_in_tile", builder->CreateICmpULT(y_loc, tile_height), [&] {
+                  emit_elem_function(offset_dim(index, y_indvar, /*dim=*/1),
+                                     y_loc);
+                });
+          });
+    });
+  };
+  ksl->IfReturnVoid(
+      "full_tile",
+      builder->CreateAnd(
+          builder->CreateICmpEQ(index_typed_constant(tile_size), tile_width),
+          builder->CreateICmpEQ(index_typed_constant(tile_size), tile_height)),
+      emit_full_tile, emit_last_row);
+}
+}  // namespace
+
+// Emits a kernel for the given hlo instruction using a tiled 0-2-1 transpose
+// algorithm to improve the memory access patterns for the input parameters
+// which have a shape that is a 0-2-1 transpose of the output tensors.
+//
+// For the purpose of tiling, the output tensors have a logical shape of three
+// components 0-2-1 while the relevant input parameters have a logical shape of
+// three components 0-1-2 in the order major to minor. The x- and y- dimensions
+// of the tensors are tiled in square tiles of edge length `kTileSize`. Each
+// thread block of `kTileSize` x `kNumRows` threads transposes one tile: each
+// thread copies kTileSize/kNumRows elements from the input to a shared memory
+// tile, then the otherwise "regular hlo kernel" reads from the shared memory
+// instead of the original input.
+//
+// This is similar to the following CUDA algorithm in TensorFlow:
+// https://goo.gl/MStRV6.
+//
+// `kTileSize` should usually be same as warp size. We currently choose 32 for
+// `kTileSize` and 4 for `kNumRows`. The CUDA algorithm uses 8 for `kNumRows`.
+//
+// TODO(b/33320379): Here each block transposes 1 tile. It may be more efficient
+// to launch fewer blocks so each transposes many tiles.
+LaunchDimensions IrEmitterUnnested::EmitHlo021Tile(
+    HloInstruction* hlo, tensorflow::gtl::ArraySlice<int64> reduced_output_dims,
+    tensorflow::gtl::ArraySlice<int64> tiled_param_ids) {
+  // Parameters for the tiling algorithm.
+  constexpr int64 kTileSize = 32;
+  constexpr int64 kNumRows = 4;
+  constexpr int64 kThreadsPerTile = kTileSize * kNumRows;
+
+  // Construct IrArrays for the inputs and outputs.
+  std::vector<IrArray> output_arrays;
+  int64 num_outputs = ConstructIrArrayForOutputs(*hlo, &output_arrays);
+  std::vector<IrArray> param_arrays;
+  int64 num_params = ConstructIrArrayForInputs(*hlo, &param_arrays);
+
+  // Allocate shared memory buffers to store the tiled inputs.
+  std::vector<llvm::Value*> param_shmem_buffers(num_params, nullptr);
+  for (int64 id : tiled_param_ids) {
+    const HloInstruction* param = hlo->operand(id);
+    // Add 1 to the minor dimension to reduce shared memory bank conflicts.
+    llvm::Type* tile_type = llvm::ArrayType::get(
+        llvm::ArrayType::get(llvm_ir::PrimitiveTypeToIrType(
+                                 param->shape().element_type(), module_),
+                             kTileSize + 1),
+        kTileSize);
+    const int kNVPTXSharedMemoryAddrSpace = 3;
+    auto* tile_base_ptr = new llvm::GlobalVariable(
+        *b_.GetInsertBlock()->getParent()->getParent(), tile_type,
+        /*isConstant=*/false, llvm::GlobalValue::PrivateLinkage,
+        llvm::UndefValue::get(tile_type),
+        llvm_ir::AsStringRef(IrName(hlo, StrCat("tile", id))), nullptr,
+        llvm::GlobalValue::NotThreadLocal, kNVPTXSharedMemoryAddrSpace);
+    param_shmem_buffers[id] = tile_base_ptr;
+    VLOG(3) << "Added shmem buffer for parameter " << id << ": "
+            << llvm_ir::DumpToString(*tile_base_ptr);
+  }
+
+  // The 0-2-1 shape of the tiling scheme is the reduced shape of the HLO result
+  // for the purpose of tiling. Calculate the logical output dimensions in the
+  // tile from the reduced output dimensions.
+  std::vector<int64> output_dims_in_tiles = std::vector<int64>(
+      reduced_output_dims.begin(), reduced_output_dims.end());
+  CHECK_EQ(output_dims_in_tiles.size(), 3);
+  for (int i = 1; i < 3; ++i) {
+    output_dims_in_tiles[i] =
+        CeilOfRatio<int64>(output_dims_in_tiles[i], kTileSize);
+  }
+  const int64 num_tiles =
+      c_accumulate(output_dims_in_tiles, 1, std::multiplies<int64>());
+  LaunchDimensions launch_dimensions(num_tiles, kThreadsPerTile);
+
+  llvm::Type* index_ty =
+      GetIndexTypeForKernel(hlo, launch_dimensions.launch_bound(), &b_);
+  auto index_typed_constant = [&](uint64 c) -> llvm::Constant* {
+    return llvm::ConstantInt::get(index_ty, c);
+  };
+
+  // Cast each output IrArray to its corresponding reduced shape and keep the
+  // reduced shape live during IR emission.
+  std::vector<IrArray> output_in_reduced_shape_arrays;
+  std::vector<Shape> output_reduced_shapes;
+  CHECK_EQ(ConstructOutputReducedShapeAndCastOutputIrArrayToShape(
+               *hlo, output_arrays, reduced_output_dims, &output_reduced_shapes,
+               &output_in_reduced_shape_arrays),
+           num_outputs);
+
+  // For each tiled parameter, cast its input IrArray to the corresponding
+  // reduced shape and keep the reduced shape live during IR emission.
+  std::vector<IrArray> param_in_reduced_shape_arrays;
+  std::vector<Shape> param_reduced_shapes;
+  CHECK_EQ(ConstructInputReducedShapeAndCastInputIrArrayToShape(
+               *hlo, param_arrays, param_shmem_buffers, reduced_output_dims,
+               &param_reduced_shapes, &param_in_reduced_shape_arrays),
+           num_params);
+
+  // Calculate the starting element coordinate within a tile for the current
+  // thread, (y, x) from thread_id.
+  llvm::Value* x;
+  llvm::Value* y;
+  std::tie(y, x) = CalculateYXCoordinateWithinTile(
+      &b_, index_typed_constant(kTileSize), kThreadsPerTile);
+
+  // Calculate the index for the current output tile from block_id.
+  const IrArray::Index output_tile_index(
+      GetBlockIdx(&b_, index_ty, num_tiles),
+      ShapeUtil::MakeShapeWithDescendingLayout(PRED /*arbitrary*/,
+                                               output_dims_in_tiles),
+      &b_);
+
+  // Output tile origin is the index for the first element of the current output
+  // tile.
+  const IrArray::Index output_tile_origin = [&] {
+    IrArray::Index index = output_tile_index;
+    for (int i = 1; i < 3; ++i) {
+      index[i] =
+          b_.CreateMul(output_tile_index[i], index_typed_constant(kTileSize),
+                       "tile_origin." + std::to_string(i));
+    }
+    return index;
+  }();
+
+  // Calculate the input tile origin from the output tile origin.
+  const IrArray::Index input_tile_origin(
+      Permute({0, 2, 1}, output_tile_origin.multidim()));
+
+  // Calculate the current output tile bounds in each of the logical dimensions.
+  std::vector<llvm::Value*> output_tile_bounds(3);
+  for (int i = 1; i < 3; ++i) {
+    // Only last row or column may not have full size.
+    output_tile_bounds[i] = b_.CreateSelect(
+        b_.CreateICmpEQ(output_tile_index[i],
+                        index_typed_constant(output_dims_in_tiles[i] - 1)),
+        index_typed_constant(reduced_output_dims[i] -
+                             (output_dims_in_tiles[i] - 1) * kTileSize),
+        index_typed_constant(kTileSize), "kTileSize");
+  }
+
+  KernelSupportLibrary ksl(&b_, llvm_ir::UnrollMode::kDefaultUnroll);
+
+  // Curry a few parameters to EmitTiledElementalCodeWithBoundsCheck.
+  auto emit_tiled_elemental_code_with_bounds_check =
+      [&](const IrArray::Index& index, const string& loop_name,
+          llvm::Value* tile_width, llvm::Value* tile_height,
+          const std::function<void(const IrArray::Index&, llvm::Value*)>&
+              emit_elem_function) {
+        EmitTiledElementalCodeWithBoundsCheck(
+            kTileSize, kNumRows, index, loop_name, &ksl, &b_, y, x, tile_width,
+            tile_height, emit_elem_function);
+      };
+
+  // Adds `addend` to the given `dim` of `index`.
+  auto offset_dim = [&](IrArray::Index index, llvm::Value* addend, int64 dim) {
+    index[dim] = b_.CreateAdd(index[dim], addend);
+    return index;
+  };
+  const IrArray::Index input_index =
+      offset_dim(offset_dim(input_tile_origin, x, /*dim=*/2), y, /*dim=*/1);
+
+  // Copy input parameter values to shared memory buffers:
+  // tile[y, x] = input[index]
+  emit_tiled_elemental_code_with_bounds_check(
+      input_index, "input", output_tile_bounds[1], output_tile_bounds[2],
+      [&](const IrArray::Index& index, llvm::Value* y_loc) {
+        for (int64 id : tiled_param_ids) {
+          IrArray& input_in_logical_shape = param_in_reduced_shape_arrays[id];
+          llvm::Value* shmem_buffer = param_shmem_buffers[id];
+          // TODO(jlebar): Add AA metadata to this store.  Tile buffers are
+          // global variables, so LLVM can't infer much about it.
+          b_.CreateStore(
+              input_in_logical_shape.EmitReadArrayElement(index, &b_,
+                                                          "input_element"),
+              b_.CreateGEP(shmem_buffer, {index_typed_constant(0), y_loc, x}));
+        }
+      });
+
+  // Wait for all threads to reach this point, lest we copy a value from tile to
+  // output before the other thread copies it from input to tile.
+  // This is `__syncthreads` in CUDA.
+  llvm_ir::EmitCallToIntrinsic(llvm::Intrinsic::nvvm_barrier0, {}, {}, &b_);
+
+  llvm_ir::TiledParameterInfo tiled_param_info(param_shmem_buffers, y, x);
+
+  const IrArray::Index output_index =
+      offset_dim(offset_dim(output_tile_origin, x, /*dim=*/2), y, /*dim=*/1);
+
+  // Write to output[index] by emitting code like normal, except that values for
+  // the tiled parameters are read from the shmem buffers.
+  if (hlo->opcode() == HloOpcode::kCopy) {
+    emit_tiled_elemental_code_with_bounds_check(
+        output_index, "output", output_tile_bounds[2], output_tile_bounds[1],
+        [&](const IrArray::Index& index, llvm::Value* y_loc) {
+          // TODO(jlebar): Add AA metadata to this load.
+          llvm::Instruction* load_from_shmem_buffer = b_.CreateLoad(
+              b_.CreateGEP(param_shmem_buffers[0], {b_.getInt64(0), x, y_loc}),
+              "output_element");
+          output_in_reduced_shape_arrays[0].EmitWriteArrayElement(
+              index, load_from_shmem_buffer, &b_);
+        });
+  } else {
+    CHECK_EQ(hlo->opcode(), HloOpcode::kFusion);
+    emit_tiled_elemental_code_with_bounds_check(
+        output_index, "output", output_tile_bounds[2], output_tile_bounds[1],
+        [&](const IrArray::Index& index, llvm::Value* y_loc) {
+          GpuElementalIrEmitter elem_emitter(hlo_module_config_, module_, &b_,
+                                             GetNestedComputer());
+          FusedIrEmitter fused_emitter(param_arrays, &elem_emitter);
+          tiled_param_info.set_y(y_loc);
+          fused_emitter.SetTiledParameterInfo(&tiled_param_info);
+          TF_CHECK_OK(hlo->fused_expression_root()->Accept(&fused_emitter));
+          IrArray::Index untiled_index = llvm_ir::GetUnreducedOutputIndex(
+              index, output_reduced_shapes[0], output_arrays[0].GetShape(),
+              &b_);
+          const llvm_ir::ElementGenerator& output_generator =
+              fused_emitter.GetRootGenerator();
+          llvm::Value* output_value =
+              output_generator(untiled_index).ValueOrDie();
+          if (hlo->IsMultiOutputFusion()) {
+            CHECK(output_value->getType()->isStructTy());
+            CHECK_EQ(output_value->getType()->getStructNumElements(),
+                     output_in_reduced_shape_arrays.size());
+            for (int64 i = 0; i < output_in_reduced_shape_arrays.size(); ++i) {
+              output_in_reduced_shape_arrays[i].EmitWriteArrayElement(
+                  index, b_.CreateExtractValue(output_value, i), &b_);
+            }
+          } else {
+            output_in_reduced_shape_arrays[0].EmitWriteArrayElement(
+                index, output_value, &b_);
+          }
+        });
+  }
+
+  // For multioutput fusion, emit a tuple with all the individual outputs.
+  if (hlo->IsMultiOutputFusion()) {
+    std::vector<llvm::Value*> tuple_operand_ptrs;
+    for (int64 i = 0; i < output_arrays.size(); ++i) {
+      tuple_operand_ptrs.push_back(output_arrays[i].GetBasePointer());
+    }
+    llvm_ir::EmitTuple(GetIrArray(*hlo, *hlo), tuple_operand_ptrs, &b_,
+                       module_);
+  }
+
+  return launch_dimensions;
+}
+
+bool IrEmitterUnnested::CheckAndEmitHloWithTile021(HloInstruction* hlo) {
+  HloOpcode opcode = hlo->opcode();
+  CHECK(opcode == HloOpcode::kFusion || opcode == HloOpcode::kCopy);
+  CHECK(opcode != HloOpcode::kFusion ||
+        hlo->fusion_kind() == HloInstruction::FusionKind::kLoop)
+      << "Only loop fusions are supported.";
+
+  const Shape& output_shape = hlo->IsMultiOutputFusion()
+                                  ? ShapeUtil::GetSubshape(hlo->shape(), {0})
+                                  : hlo->shape();
+
+  // If the output_shape is reduced to 021 shape, find all the parameters of the
+  // hlo that are in the corresponding 012 shape.
+  std::vector<int64> params_012;
+  optional<std::vector<int64>> reduced_dims_021;
+  for (int64 operand_idx = 0; operand_idx < hlo->operand_count();
+       ++operand_idx) {
+    HloInstruction* operand = hlo->mutable_operand(operand_idx);
+    auto find_transpose_result =
+        llvm_ir::FindTranspose021(operand->shape(), output_shape);
+    if (!find_transpose_result.has_value()) {
+      continue;
+    }
+    const std::vector<int64>& curr_reduced_dims_021 = *find_transpose_result;
+    if (!reduced_dims_021.has_value()) {
+      reduced_dims_021 = curr_reduced_dims_021;
+    }
+    if (!ContainersEqual(*reduced_dims_021, curr_reduced_dims_021)) {
+      // There is more than one possible transpose. Instead of picking one
+      // transpose, we simply give up here.
+      return false;
+    }
+    params_012.push_back(operand_idx);
+  }
+
+  if (!reduced_dims_021.has_value()) {
+    return false;
+  }
+
+  if ((*reduced_dims_021)[1] < kMinDimensionToTransposeTiled ||
+      (*reduced_dims_021)[2] < kMinDimensionToTransposeTiled) {
+    return false;
+  }
+
+  // Each of our shared memory tiles has 32*33 elements (so ~4kb, if the
+  // elements are of size 4 bytes), and CUDA has an architectural limit of 48kb
+  // shared memory per SM.  (This is increased to 96kb in Volta, but we don't
+  // use this, in part because it eats into our L1 cache space.)
+  //
+  // For correctness we need to ensure that we don't make more than 48kb worth
+  // of shmem tiles per block.  And for performance, we'd probably like to use
+  // significantly less, so that we can fit more than one block at a time on a
+  // gpu core.
+  //
+  // We say without benchmarks that we want at least 3 threads/block,
+  // corresponding to 3 shmem tiles if the elements are 32 bits wide.  We choose
+  // which params get the shmem transpose treatment arbitrarily; it's not clear
+  // if there's a Right Choice.
+  //
+  // This is only sound if tiled transposes are the only place where we use
+  // shared memory in fusions.  If in the future other fusile ops use shared
+  // memory, we'll have to adjust this heuristic.
+  constexpr int kMinBlocksPerCore = 3;
+  constexpr int64 kShmemPerCore = 48 * 1024;
+  int64 shmem_used = 0;
+  for (int64 i = 0; i < params_012.size(); ++i) {
+    const HloInstruction* operand = hlo->operand(params_012[i]);
+    shmem_used +=
+        32 * 33 *
+        ShapeUtil::ByteSizeOfPrimitiveType(operand->shape().element_type());
+
+    if (kMinBlocksPerCore * shmem_used > kShmemPerCore) {
+      // Erase this element and everything after it from params_012.
+      params_012.resize(i);
+      break;
+    }
+  }
+
+  VLOG(3) << "EmitHlo021Tile Emitting hlo tile 0-2-1" << hlo->ToString();
+  thunk_sequence_->emplace_back(
+      BuildKernelThunk(hlo, /*implements_whole_instruction=*/true));
+  const LaunchDimensions launch_dimensions =
+      EmitHlo021Tile(hlo, *reduced_dims_021, params_012);
+  UpdateLaunchDimensions(launch_dimensions, LastThunk(),
+                         ir_emitter_context_->llvm_module());
+
+  return true;
+}
+
+Status IrEmitterUnnested::EmitConstantGlobals() {
+  for (const BufferAllocation& allocation :
+       ir_emitter_context_->buffer_assignment().Allocations()) {
+    if (!allocation.is_constant()) {
+      continue;
+    }
+
+    const Literal& literal = llvm_ir::LiteralForConstantAllocation(allocation);
+    const bool should_emit_initializer = ShouldEmitLiteralInLlvmIr(literal);
+    llvm::ArrayType* global_type =
+        llvm::ArrayType::get(b_.getInt8Ty(), allocation.size());
+    llvm::Constant* initializer =
+        should_emit_initializer
+            ? llvm_ir::ConvertLiteralToIrConstant(literal, module_)
+            : llvm::ConstantAggregateZero::get(global_type);
+    if (should_emit_initializer) {
+      VLOG(3) << "Emitted initializer for constant with shape "
+              << ShapeUtil::HumanString(literal.shape());
+    }
+
+    // These globals will be looked up by name by GpuExecutable so we need to
+    // give them an external linkage.  Not all of their uses are visible in the
+    // LLVM IR (e.g. TupleThunk) so we can't give then a linkage that merely
+    // preserves their names (like available_externally), we also need to ensure
+    // that they stick around even if they're "unused".
+    //
+    // We may have to be more more clever here in the future if we notice that
+    // we're keeping around too many globals because of their linkage.
+    llvm::GlobalVariable* global_for_const = new llvm::GlobalVariable(
+        global_type, /*isConstant=*/should_emit_initializer,
+        llvm::GlobalValue::ExternalLinkage,
+        /*Initializer=*/initializer,
+        llvm_ir::AsStringRef(
+            llvm_ir::ConstantBufferAllocationToGlobalName(allocation)));
+    global_for_const->setAlignment(kConstantBufferAlignBytes);
+    ir_emitter_context_->llvm_module()->getGlobalList().push_back(
+        global_for_const);
+  }
+
+  return Status::OK();
+}
+
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
index 819060061a..5254419907 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/gpu/ir_emitter.h"
 #include "tensorflow/compiler/xla/service/gpu/thunk.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.h"
 
 namespace xla {
 namespace gpu {
@@ -73,8 +74,11 @@ class IrEmitterUnnested : public IrEmitter {
   Status HandleTuple(HloInstruction* tuple) override;
   Status HandleWhile(HloInstruction* xla_while) override;
   Status HandleInfeed(HloInstruction* xla_infeed) override;
+  Status HandleOutfeed(HloInstruction* outfeed) override;
   Status HandleRng(HloInstruction* random) override;
   Status HandleSelect(HloInstruction* select) override;
+  Status HandleSort(HloInstruction* sort) override;
+  Status HandleTupleSelect(HloInstruction* tuple_select) override;
   Status HandleCrossReplicaSum(HloInstruction* crs) override;
   Status HandleAfterAll(HloInstruction* gen_token) override;
 
@@ -88,6 +92,9 @@ class IrEmitterUnnested : public IrEmitter {
       const HloInstruction& hlo, const llvm_ir::ElementGenerator& body_emitter,
       KernelThunk* thunk);
 
+  // Emits LLVM global variables corresponding to constant instructions.
+  Status EmitConstantGlobals();
+
  private:
   // Builds the appropriate thunk for the instruction hlo and returns the owning
   // pointer to it. The caller needs to make sure `inst` outlives the lifetime
@@ -115,7 +122,7 @@ class IrEmitterUnnested : public IrEmitter {
   // Emits code that reduces a matrix of shape [height x width] to a vector of
   // [width]. Other parameters have the same meaning as those of
   // `EmitReductionToVector`. Note that input shape might not be
-  // [height x width], but can be bitcast to [height x weight] with "height"
+  // [height x width], but can be bitcast to [height x width] with "height"
   // being the major dimension.
   Status EmitColumnReduction(
       int64 height, int64 width, HloInstruction* reduce,
@@ -131,7 +138,7 @@ class IrEmitterUnnested : public IrEmitter {
   // Emits code that reduces a 3D tensor of shape [depth x height x width] to a
   // vector of shape [height]. Other parameters have the same meaning as those
   // of `EmitReductionToVector`. Note that input shape might not be
-  // [depth x height x width], but can be bitcast to [depth x height x weight]
+  // [depth x height x width], but can be bitcast to [depth x height x width]
   // with "depth" being the most major dimension.
   Status EmitRowReduction(
       int64 depth, int64 height, int64 width, HloInstruction* reduce,
@@ -182,12 +189,56 @@ class IrEmitterUnnested : public IrEmitter {
           std::pair<llvm_ir::ElementGenerator, ShapeIndex>>
           extra_output_gens);
 
+  // Returns true if a 0-2-1 tiling algorithm is already used to emit the kernel
+  // for the hlo instruction.
+  bool CheckAndEmitHloWithTile021(HloInstruction* hlo);
+  // Emits a kernel for the hlo instruction using a 0-2-1 tiling algorithm and
+  // returns the launch dimensions for the kernel. This is a helper to support
+  // the implementation of CheckAndEmitHloWithTile021.
+  LaunchDimensions EmitHlo021Tile(
+      HloInstruction* hlo,
+      tensorflow::gtl::ArraySlice<int64> reduced_output_dims,
+      tensorflow::gtl::ArraySlice<int64> tiled_param_ids);
+  // Generates the IrArray for each output of hlo and returns the number of
+  // outputs.
+  int ConstructIrArrayForOutputs(const HloInstruction& hlo,
+                                 std::vector<llvm_ir::IrArray>* output_arrays);
+  // Generates the IrArray for each input of hlo and returns the number of
+  // inputs.
+  int ConstructIrArrayForInputs(const HloInstruction& hlo,
+                                std::vector<llvm_ir::IrArray>* param_arrays);
+  // For each output of the `hlo` instruction, constructs the reduced shape for
+  // the output with the given `reduced_output_dims` and cast the original
+  // output IrArray element in `output_arrays` to the reduced shape. Returns
+  // the number of outputs.
+  int ConstructOutputReducedShapeAndCastOutputIrArrayToShape(
+      const HloInstruction& hlo,
+      const std::vector<llvm_ir::IrArray>& output_arrays,
+      tensorflow::gtl::ArraySlice<int64> reduced_output_dims,
+      std::vector<Shape>* output_reduced_shapes,
+      std::vector<llvm_ir::IrArray>* output_in_reduced_shape_arrays);
+  // For each input of the `hlo` instruction, checks its value in
+  // `param_buffers` to find out whether the input has a reduced shape. If the
+  // input has a reduced shape, constructs the reduced shape for the input and
+  // casts the original input IrArray in `param_arrays` to the reduced shape.
+  // Return the total number of inputs.
+  int ConstructInputReducedShapeAndCastInputIrArrayToShape(
+      const HloInstruction& hlo,
+      const std::vector<llvm_ir::IrArray>& param_arrays,
+      const std::vector<llvm::Value*>& param_buffers,
+      tensorflow::gtl::ArraySlice<int64> reduced_output_dims,
+      std::vector<Shape>* param_reduced_shapes,
+      std::vector<llvm_ir::IrArray>* param_in_reduced_shape_arrays);
+
   // Returns a KernelThunk that invokes the kernel emitted for `inst`. The
   // caller needs to make sure `inst` outlives the lifetime of the returned
   // Thunk object. The kernel implementation will be unrolled if unroll_factor
-  // is greater than one.
-  std::unique_ptr<KernelThunk> BuildKernelThunk(const HloInstruction* inst,
-                                                int unroll_factor = 1);
+  // is greater than one. 'implements_whole_instruction' specifies whether this
+  // KernelThunk implements the whole 'inst' HloInstruction. In some cases
+  // 'inst' will be implemented by a sequence of Thunks.
+  std::unique_ptr<KernelThunk> BuildKernelThunk(
+      const HloInstruction* inst, bool implements_whole_instruction,
+      int unroll_factor = 1);
 
   // Returns a FftThunk that calls cuFFT to implement `inst`.
   std::unique_ptr<Thunk> BuildFftThunk(const HloInstruction* inst);
@@ -208,10 +259,14 @@ class IrEmitterUnnested : public IrEmitter {
   std::unique_ptr<Thunk> BuildDeviceToDeviceCopyThunk(
       const HloInstruction* inst);
 
-  // Returns an InfeedThunk that performs device-to-device memcpy to implement
+  // Returns an InfeedThunk that performs a host-to-device memcpy to implement
   // `inst`.
   std::unique_ptr<Thunk> BuildInfeedThunk(const HloInstruction* inst);
 
+  // Returns an OutfeedThunk that performs a device-to-host memcpy to implement
+  // `inst`.
+  std::unique_ptr<Thunk> BuildOutfeedThunk(const HloInstruction* inst);
+
   // Returns a WhileThunk that invokes thunk sequences for 'condition' and
   // 'body' sub-computations of while instruction 'hlo'.
   std::unique_ptr<Thunk> BuildWhileThunk(const HloInstruction* hlo);
diff --git a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/BUILD b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/BUILD
index 7de8f9e1ee..eb93efc560 100644
--- a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/BUILD
+++ b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/BUILD
@@ -17,12 +17,12 @@ cc_library(
     name = "llvm_gpu_backend",
     srcs = [
         "dump_ir_pass.cc",
-        "gpu_backend_lib.cc",
+        "nvptx_backend_lib.cc",
         "utils.cc",
     ],
     hdrs = [
         "dump_ir_pass.h",
-        "gpu_backend_lib.h",
+        "nvptx_backend_lib.h",
         "utils.h",
     ],
     deps = [
@@ -34,6 +34,7 @@ cc_library(
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
+        "@llvm//:amdgpu_code_gen",
         "@llvm//:analysis",
         "@llvm//:bit_reader",
         "@llvm//:bit_writer",
diff --git a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.cc
index a4e4e85bf3..c48fca355b 100644
--- a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc
+++ b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.h"
+#include "tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.h"
 
 #include <map>
 #include <memory>
@@ -114,20 +114,20 @@ static string GetLibdeviceFilename(const string& libdevice_dir_path,
 // Gets the GPU name as it's known to LLVM for a given compute capability.  If
 // we see an unrecognized compute capability, we return "sm_30".
 static string GetSmName(std::pair<int, int> compute_capability) {
-  static auto* m = new std::map<std::pair<int, int>, int>({{{2, 0}, 20},
-                                                           {{2, 1}, 21},
-                                                           {{3, 0}, 30},
-                                                           {{3, 2}, 32},
-                                                           {{3, 5}, 35},
-                                                           {{3, 7}, 37},
-                                                           {{5, 0}, 50},
-                                                           {{5, 2}, 52},
-                                                           {{5, 3}, 53},
-                                                           {{6, 0}, 60},
-                                                           {{6, 1}, 61},
-                                                           {{6, 2}, 62},
-                    // TODO: Change this to 70 once LLVM NVPTX supports it
-                                                           {{7, 0}, 60}});
+  static auto* m = new std::map<std::pair<int, int>, int>({
+      {{3, 0}, 30},
+      {{3, 2}, 32},
+      {{3, 5}, 35},
+      {{3, 7}, 37},
+      {{5, 0}, 50},
+      {{5, 2}, 52},
+      {{5, 3}, 53},
+      {{6, 0}, 60},
+      {{6, 1}, 61},
+      {{6, 2}, 62},
+      {{7, 0}, 70},
+      {{7, 2}, 72},
+  });
   int sm_version = 30;
   auto it = m->find(compute_capability);
   if (it != m->end()) {
@@ -206,7 +206,7 @@ std::unique_ptr<llvm::TargetMachine> GetTargetMachine(
       codegen_opt_level = CodeGenOpt::None;
   }
   return WrapUnique(target->createTargetMachine(
-      triple.str(), llvm_ir::AsStringRef(cpu_name), "+ptx42", target_options,
+      triple.str(), llvm_ir::AsStringRef(cpu_name), "+ptx60", target_options,
       Optional<Reloc::Model>(RelocModel), Optional<CodeModel::Model>(CMModel),
       codegen_opt_level));
 }
@@ -319,8 +319,8 @@ Status LinkLibdeviceIfNecessary(llvm::Module* module,
 
   llvm::Linker linker(*module);
   string libdevice_path = tensorflow::io::JoinPath(
-      libdevice_dir_path, GetLibdeviceFilename(libdevice_dir_path,
-                                               compute_capability));
+      libdevice_dir_path,
+      GetLibdeviceFilename(libdevice_dir_path, compute_capability));
   TF_RETURN_IF_ERROR(tensorflow::Env::Default()->FileExists(libdevice_path));
   VLOG(1) << "Linking with libdevice from: " << libdevice_path;
   std::unique_ptr<llvm::Module> libdevice_module =
diff --git a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.h b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.h
index 0a345191d3..54e0e140de 100644
--- a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.h
+++ b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.h
@@ -14,8 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 // LLVM-based compiler backend.
-#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_LLVM_GPU_BACKEND_GPU_BACKEND_LIB_H_
-#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_LLVM_GPU_BACKEND_GPU_BACKEND_LIB_H_
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_LLVM_GPU_BACKEND_NVPTX_BACKEND_LIB_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_LLVM_GPU_BACKEND_NVPTX_BACKEND_LIB_H_
 
 #include <string>
 #include <utility>
@@ -44,4 +44,4 @@ StatusOr<string> CompileToPtx(llvm::Module* module,
 }  // namespace gpu
 }  // namespace xla
 
-#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_LLVM_GPU_BACKEND_GPU_BACKEND_LIB_H_
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_LLVM_GPU_BACKEND_NVPTX_BACKEND_LIB_H_
diff --git a/tensorflow/compiler/xla/service/gpu/multi_output_fusion.cc b/tensorflow/compiler/xla/service/gpu/multi_output_fusion.cc
index ea661b3c2c..c62bae0628 100644
--- a/tensorflow/compiler/xla/service/gpu/multi_output_fusion.cc
+++ b/tensorflow/compiler/xla/service/gpu/multi_output_fusion.cc
@@ -23,6 +23,8 @@ limitations under the License.
 #include <string>
 #include <utility>
 
+#include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/service/gpu/instruction_fusion.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
@@ -71,7 +73,6 @@ bool GpuMultiOutputFusion::ShapesCompatibleForFusion(HloInstruction* instr1,
   // In that case, the operand of the reduce needs to have the same shape
   // as the other tuple operands, but also we need to compare the output
   // shapes of the reduces.
-  // TODO(tjoerg): Allow differences in fp precision.
   auto* element_instr_1 = get_element_instr(instr1);
   auto* element_instr_2 = get_element_instr(instr2);
   if (element_instr_1->opcode() == HloOpcode::kReduce &&
@@ -80,8 +81,8 @@ bool GpuMultiOutputFusion::ShapesCompatibleForFusion(HloInstruction* instr1,
     return false;
   }
   // The elementwise output shapes must be the same (including layout).
-  return ShapeUtil::Equal(get_element_shape(element_instr_1),
-                          get_element_shape(element_instr_2));
+  return ShapeUtil::EqualIgnoringFpPrecision(
+      get_element_shape(element_instr_1), get_element_shape(element_instr_2));
 }
 
 namespace {
@@ -107,6 +108,35 @@ bool IsInputFusibleReduction(HloInstruction* instr) {
     return IsReductionToVector(*instr);
   }
 }
+
+// The code emitted for reduction suffers from poor data locality if the layouts
+// of input parameters differ. In such situtations it is beneficial not to fuse.
+// We consider input params with maximum rank only. Params with smaller ranks
+// will be broadcasted and have not been observed to cause data locality issues.
+// TODO(b/111977086): Improve reduce emitters to remove this limitation.
+bool ReduceFriendlyInputLayouts(HloInstruction* instr) {
+  std::vector<HloInstruction*> params;
+  if (instr->opcode() == HloOpcode::kFusion) {
+    params = instr->fused_parameters();
+  } else {
+    for (HloInstruction* operand : instr->operands()) {
+      params.push_back(operand);
+    }
+  }
+  int64 max_rank = 0;
+  const Layout* max_rank_layout;
+  for (HloInstruction* param : params) {
+    if (ShapeUtil::Rank(param->shape()) > max_rank) {
+      max_rank = ShapeUtil::Rank(param->shape());
+      max_rank_layout = &param->shape().layout();
+    }
+  }
+  return c_all_of(params, [&](HloInstruction* param) {
+    return (ShapeUtil::Rank(param->shape()) < max_rank) ||
+           (LayoutUtil::Equal(param->shape().layout(), *max_rank_layout));
+  });
+}
+
 }  // namespace
 
 bool GpuMultiOutputFusion::IsFusible(HloInstruction* instr) {
@@ -142,16 +172,22 @@ bool GpuMultiOutputFusion::LegalToFuse(HloInstruction* instr1,
   if (!MultiOutputFusion::LegalToFuse(instr1, instr2)) {
     return false;
   }
+
   // If we're fusing fusions only do it if the fusion kind matches. Loop fusions
   // merge into bigger loop fusions and input (reduce) fusions become fusions
   // with multiple reduce outputs. We could fuse reduce and loop fusions
   // together too (the result being an input fusion) if we find cases where this
   // improves things.
   CHECK(instr1->opcode() == HloOpcode::kFusion);
-  if (instr2->opcode() == HloOpcode::kFusion) {
-    return instr1->fusion_kind() == instr2->fusion_kind();
+  if ((instr2->opcode() == HloOpcode::kFusion &&
+       instr1->fusion_kind() != instr2->fusion_kind()) ||
+      (instr2->opcode() != HloOpcode::kFusion &&
+       instr1->fusion_kind() == HloInstruction::FusionKind::kLoop)) {
+    return false;
   }
-  return instr1->fusion_kind() != HloInstruction::FusionKind::kLoop;
+
+  // Do this check last, as it may be expensive.
+  return !GpuInstructionFusion::FusionWouldBeTooLarge(instr1, instr2);
 }
 
 bool GpuMultiOutputFusion::DoProducerConsumerMultiOutputFusion() {
@@ -173,29 +209,41 @@ bool GpuMultiOutputFusion::DoProducerConsumerMultiOutputFusion() {
   // fusions operands.
   for (HloInstruction* consumer : computation()->MakeInstructionPostOrder()) {
     if (consumer->user_count() == 0) {
+      VLOG(3) << consumer->name() << " has no users.";
       continue;
     }
     if (!IsInputFusibleReduction(consumer)) {
+      VLOG(3) << consumer->name() << " is not an input-fusable reduction.";
       continue;
     }
+    VLOG(3) << consumer->name()
+            << " is a fusion candidate. Looking for fuseable operands.";
 
     auto consumer_operands = consumer->operands();
     for (size_t i = 0; i < consumer_operands.size(); ++i) {
       HloInstruction* producer = consumer_operands[i];
       if (!producer->IsFusable()) {
+        VLOG(3) << producer->name() << " is not fusable.";
         continue;
       }
       const bool is_loop_fusion =
           producer->opcode() == HloOpcode::kFusion &&
           producer->fusion_kind() == HloInstruction::FusionKind::kLoop;
-      if (!is_loop_fusion) {
+      if (!producer->IsElementwise() && !is_loop_fusion) {
+        VLOG(3) << producer->name() << " is not a loop fusion.";
         continue;
       }
       if (!ShapesCompatibleForFusion(producer, consumer)) {
+        VLOG(3) << producer->name() << " has an incompatible shape.";
+        continue;
+      }
+      if (!ReduceFriendlyInputLayouts(producer)) {
+        VLOG(3) << producer->name() << " has inputs with mixed layouts.";
         continue;
       }
       // If we have already decided to fuse this producer, skip it.
       if (ContainsKey(to_fuse, producer)) {
+        VLOG(3) << producer->name() << " will be fused with another consumer.";
         continue;
       }
       // Do not fuse a producer if the other operands of the fusion are
@@ -204,6 +252,7 @@ bool GpuMultiOutputFusion::DoProducerConsumerMultiOutputFusion() {
             return producer != operand &&
                    reachability()->IsReachable(producer, operand);
           })) {
+        VLOG(3) << producer->name() << " would introduce a cycle when fused.";
         break;
       }
       to_fuse.insert(producer);
diff --git a/tensorflow/compiler/xla/service/gpu/multi_output_fusion_test.cc b/tensorflow/compiler/xla/service/gpu/multi_output_fusion_test.cc
index 979ea79243..14f157a5e5 100644
--- a/tensorflow/compiler/xla/service/gpu/multi_output_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/multi_output_fusion_test.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/gpu/multi_output_fusion.h"
 
+#include "tensorflow/compiler/xla/service/gpu/instruction_fusion.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -27,7 +28,7 @@ namespace op = xla::testing::opcode_matchers;
 namespace xla {
 namespace gpu {
 
-using InstructionFusionTest = HloTestBase;
+using MultiOutputFusionTest = HloTestBase;
 
 const char kModulePrefix[] = R"(
     HloModule test_module
@@ -40,10 +41,10 @@ const char kModulePrefix[] = R"(
     scalar_mul_computation {
       scalar_lhs.1 = f32[] parameter(0)
       scalar_rhs.1 = f32[] parameter(1)
-      ROOT mul.1 = f32[] add(scalar_lhs.1, scalar_rhs.1)
+      ROOT mul.1 = f32[] multiply(scalar_lhs.1, scalar_rhs.1)
     })";
 
-TEST_F(InstructionFusionTest, MultiOutputFusionSiblingReduceAndReduceFusion) {
+TEST_F(MultiOutputFusionTest, MultiOutputFusionSiblingReduceAndReduceFusion) {
   // Fusion with reduce instruction root and a sibling reduce instruction
   // sharing the same input param.
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
@@ -72,7 +73,7 @@ TEST_F(InstructionFusionTest, MultiOutputFusionSiblingReduceAndReduceFusion) {
               op::Tuple(op::Reduce(), op::Reduce()));
 }
 
-TEST_F(InstructionFusionTest, MultiOutputFusionDifferentReduceInputShapes) {
+TEST_F(MultiOutputFusionTest, MultiOutputFusionDifferentReduceInputShapes) {
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
     fused_computation_1 {
       p1.1 = f32[6400]{0} parameter(1)
@@ -99,7 +100,7 @@ TEST_F(InstructionFusionTest, MultiOutputFusionDifferentReduceInputShapes) {
   ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
 }
 
-TEST_F(InstructionFusionTest, MultiOutputFusionDifferentReduceOutputShapes) {
+TEST_F(MultiOutputFusionTest, MultiOutputFusionDifferentReduceOutputShapes) {
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
     fused_computation_1 {
       p1.1 = f32[10,10]{1,0} parameter(1)
@@ -126,7 +127,7 @@ TEST_F(InstructionFusionTest, MultiOutputFusionDifferentReduceOutputShapes) {
   ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
 }
 
-TEST_F(InstructionFusionTest, MultiOutputFusionSiblingReduceFusions) {
+TEST_F(MultiOutputFusionTest, MultiOutputFusionSiblingReduceFusions) {
   // Two sibling fusions with reduce instruction roots sharing the same input
   // param.
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
@@ -160,7 +161,7 @@ TEST_F(InstructionFusionTest, MultiOutputFusionSiblingReduceFusions) {
               op::Tuple(op::Reduce(), op::Reduce()));
 }
 
-TEST_F(InstructionFusionTest,
+TEST_F(MultiOutputFusionTest,
        MultiOutputFusionSiblingReduceAndReduceMultiOutputFusion) {
   // Multi-output fusion with two reduce instructions root and a sibling reduce
   // instruction sharing the same input param.
@@ -193,7 +194,7 @@ TEST_F(InstructionFusionTest,
               op::Tuple(op::Reduce(), op::Reduce(), op::Reduce()));
 }
 
-TEST_F(InstructionFusionTest,
+TEST_F(MultiOutputFusionTest,
        MultiOutputFusionSiblingFusionCheckAgainstReduceOperand) {
   // Verify that if we already have a multi-output fusion that we prefer to pick
   // a reduce op from its operands for checking shape compatibility.
@@ -226,7 +227,7 @@ TEST_F(InstructionFusionTest,
   ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
 }
 
-TEST_F(InstructionFusionTest, MultiOutputFusionTwoLoops) {
+TEST_F(MultiOutputFusionTest, MultiOutputFusionTwoLoops) {
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
     fused_computation_1 {
       p0.1 = f32[6400]{0} parameter(0)
@@ -255,7 +256,27 @@ TEST_F(InstructionFusionTest, MultiOutputFusionTwoLoops) {
               op::Tuple(op::Multiply(), op::Divide()));
 }
 
-TEST_F(InstructionFusionTest, ProducerConsumerFusionLoopFusionAndReduce) {
+TEST_F(MultiOutputFusionTest, ProducerConsumerFusionElementwiseAndReduce) {
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    ENTRY reduce {
+      p0 = f32[2,2,2]{2,1,0} parameter(0)
+      c0 = f32[] constant(0)
+      exp = f32[2,2,2]{2,1,0} exponential(p0)
+      reduce = f32[2,2]{1,0} reduce(exp, c0), dimensions={2}, to_apply=scalar_add_computation
+      ROOT root = (f32[2,2]{1,0}, f32[2,2,2]{2,1,0}) tuple(reduce, exp)
+    })"))
+                    .ValueOrDie();
+  ASSERT_TRUE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+  const HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Tuple(op::GetTupleElement(), op::GetTupleElement()));
+  const HloInstruction* fusion = root->operand(0)->operand(0);
+  ASSERT_TRUE(fusion->IsMultiOutputFusion());
+  EXPECT_THAT(fusion->fused_expression_root(),
+              op::Tuple(op::Reduce(), op::Exp()));
+}
+
+TEST_F(MultiOutputFusionTest, ProducerConsumerFusionLoopFusionAndReduce) {
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
     fused_add {
       p0.1 = f32[2,2,2]{2,1,0} parameter(0)
@@ -282,7 +303,7 @@ TEST_F(InstructionFusionTest, ProducerConsumerFusionLoopFusionAndReduce) {
               op::Tuple(op::Reduce(), op::Add()));
 }
 
-TEST_F(InstructionFusionTest, ProducerConsumerFusionLoopFusionAndReduceFusion) {
+TEST_F(MultiOutputFusionTest, ProducerConsumerFusionLoopFusionAndReduceFusion) {
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
     fused_select {
       p1.1 = f32[2,2,2]{2,1,0} parameter(1)
@@ -323,7 +344,7 @@ TEST_F(InstructionFusionTest, ProducerConsumerFusionLoopFusionAndReduceFusion) {
               op::Tuple(op::Reduce(), op::Reduce(), op::Select()));
 }
 
-TEST_F(InstructionFusionTest, ProducerConsumerFusionDoNotFuseLoopReduceFusion) {
+TEST_F(MultiOutputFusionTest, ProducerConsumerFusionDoNotFuseLoopReduceFusion) {
   auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
     fused_element_wise {
       p0.1 = f32[2,2,2]{2,1,0} parameter(0)
@@ -349,5 +370,128 @@ TEST_F(InstructionFusionTest, ProducerConsumerFusionDoNotFuseLoopReduceFusion) {
   ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
 }
 
+TEST_F(MultiOutputFusionTest,
+       ProducerConsumerFusionFp16LoopFusionAndReduceFusion) {
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    fused_select {
+      p1.1 = f16[2,2,2]{2,1,0} parameter(1)
+      c0 = f16[] constant(0)
+      broadcast = f16[2,2,2]{2,1,0} broadcast(f16[] c0), dimensions={}
+      greater-than = pred[2,2,2]{2,1,0} greater-than(f32[2,2,2]{2,1,0} p1.1, f32[2,2,2]{2,1,0} broadcast)
+      p0.1 = f16[2,2,2]{2,1,0} parameter(0)
+      ROOT select = f16[2,2,2]{2,1,0} select(pred[2,2,2]{2,1,0} greater-than, f16[2,2,2]{2,1,0} p0.1, f16[2,2,2]{2,1,0} broadcast)
+    }
+    fused_reduce {
+      p0.2 = f16[2,2,2]{2,1,0} parameter(0)
+      convert = f32[2,2,2]{2,1,0} convert(p0.2)
+      c1 = f32[] constant(0)
+      r1 = f32[2,2]{1,0} reduce(convert, c1), dimensions={2}, to_apply=scalar_add_computation
+      mul = f32[2,2,2]{2,1,0} multiply(convert, convert)
+      r2 = f32[2,2]{1,0} reduce(mul, c1), dimensions={2}, to_apply=scalar_add_computation
+      ROOT tuple = (f32[2,2]{1,0}, f32[2,2]{1,0}) tuple(r1, r2)
+    }
+    ENTRY reduce {
+      p0 = f16[2,2,2]{2,1,0} parameter(0)
+      p1 = f16[2,2,2]{2,1,0} parameter(1)
+      select = f16[2,2,2]{2,1,0} fusion(p0, p1), kind=kLoop, calls=fused_select
+      fusion = (f32[2,2]{1,0}, f32[2,2]{1,0}) fusion(select), kind=kInput, calls=fused_reduce
+      gte0 = f32[2,2]{1,0} get-tuple-element(fusion), index=0
+      gte1 = f32[2,2]{1,0} get-tuple-element(fusion), index=1
+      ROOT root = (f32[2,2]{1,0}, f32[2,2]{1,0}, f16[2,2,2]{2,1,0}) tuple(gte1, gte1, select)
+    })"))
+                    .ValueOrDie();
+  ASSERT_TRUE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+  const HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Tuple(op::GetTupleElement(), op::GetTupleElement(),
+                              op::GetTupleElement()));
+  const HloInstruction* fusion = root->operand(0)->operand(0);
+  ASSERT_TRUE(fusion->IsMultiOutputFusion());
+  EXPECT_THAT(fusion->fused_expression_root(),
+              op::Tuple(op::Reduce(), op::Reduce(), op::Select()));
+}
+
+TEST_F(MultiOutputFusionTest,
+       ProducerConsumerFusionReduceUnfriendlyLoopFusion) {
+  auto module = ParseHloString(tensorflow::strings::StrCat(kModulePrefix, R"(
+    mixed_input_layouts_computation {
+      p0.1 = f16[128,1024,32,32]{1,3,2,0} parameter(0)
+      p1.1 = f16[128,1024,32,32]{3,2,1,0} parameter(1)
+      copy = f16[128,1024,32,32]{1,3,2,0} copy(p1.1)
+      c0 = f16[] constant(0)
+      broadcast = f16[128,1024,32,32]{1,3,2,0} broadcast(c0), dimensions={}
+      greater-than = pred[128,1024,32,32]{1,3,2,0} greater-than(copy, broadcast)
+      ROOT root = f16[128,1024,32,32]{1,3,2,0} select(greater-than, p0.1, broadcast)
+    }
+    fused_reduce {
+      p0.2 = f16[128,1024,32,32]{1,3,2,0} parameter(0)
+      convert = f32[128,1024,32,32]{1,3,2,0} convert(p0.2)
+      c0.2 = f32[] constant(0)
+      ROOT reduce = f32[1024]{0} reduce(convert, c0.2), dimensions={0,2,3}, to_apply=scalar_add_computation
+    }
+    ENTRY reduce {
+      p0 = f16[128,1024,32,32]{3,2,1,0} parameter(0)
+      p1 = f16[128,1024,32,32]{1,3,2,0} parameter(1)
+      loop_fusion = f16[128,1024,32,32]{1,3,2,0} fusion(p0, p1), kind=kLoop, calls=mixed_input_layouts_computation
+      reduce_fusion = f32[1024]{0} fusion(loop_fusion), kind=kInput, calls=fused_reduce
+      ROOT root = (f32[1024]{0}, f16[128,1024,32,32]{1,3,2,0}) tuple(reduce_fusion, loop_fusion)
+    })"))
+                    .ValueOrDie();
+  ASSERT_FALSE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+}
+
+// Check that we limit the number of operands to fusions we create.
+TEST_F(MultiOutputFusionTest, AvoidsLargeFusion) {
+  constexpr int64 kNumParams = 200;
+  ASSERT_GT(kNumParams, GpuInstructionFusion::kMaxOperandsAndOutputsPerFusion);
+
+  // Compute
+  //   p0 * p1,
+  //   p0 * p1 + p1 * p2
+  //   p0 * p1 + p1 * p2 + p2 * p3
+  //   ...
+  // where each of the (pi * pj)'s is represented as a fusion node so that
+  // multi-output fusion will pay attention to it.
+  auto module = CreateNewModule();
+  HloComputation::Builder b(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {10, 100});
+
+  std::vector<HloInstruction*> params;
+  for (int64 i = 0; i < kNumParams; ++i) {
+    params.push_back(
+        b.AddInstruction(HloInstruction::CreateParameter(i, shape, "p")));
+  }
+
+  // Creates a fusion node that calculates x*y.
+  auto make_fusion = [&](HloInstruction* x, HloInstruction* y) {
+    HloComputation::Builder sub_builder("subcomp");
+    auto* p0 = sub_builder.AddInstruction(
+        HloInstruction::CreateParameter(0, shape, "p"));
+    auto* p1 = sub_builder.AddInstruction(
+        HloInstruction::CreateParameter(1, shape, "p"));
+    sub_builder.AddInstruction(
+        HloInstruction::CreateBinary(shape, HloOpcode::kMultiply, p0, p1));
+    HloComputation* subcomp =
+        module->AddEmbeddedComputation(sub_builder.Build());
+    return HloInstruction::CreateFusion(
+        shape, HloInstruction::FusionKind::kLoop, {x, y}, subcomp);
+  };
+
+  auto* sum = b.AddInstruction(make_fusion(params[0], params[1]));
+  for (int64 i = 2; i < kNumParams; ++i) {
+    sum = b.AddInstruction(HloInstruction::CreateBinary(
+        shape, HloOpcode::kAdd, sum,
+        b.AddInstruction(make_fusion(params[i - 1], params[i]))));
+  }
+  auto computation = module->AddEntryComputation(b.Build());
+  EXPECT_TRUE(GpuMultiOutputFusion().Run(module.get()).ValueOrDie());
+  SCOPED_TRACE(module->ToString());
+  for (const HloInstruction* instr : computation->instructions()) {
+    EXPECT_LE(instr->operand_count() + ShapeUtil::SubshapeCount(instr->shape()),
+              GpuInstructionFusion::kMaxOperandsAndOutputsPerFusion)
+        << instr->ToString();
+  }
+}
+
 }  // namespace gpu
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc b/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc
index decfc40daf..8fa0439006 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/service/gpu/gpu_compiler.h"
+#include "tensorflow/compiler/xla/service/gpu/nvptx_compiler.h"
 
 #include <stdlib.h>
 #include <atomic>
@@ -34,7 +34,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/buffer_liveness.h"
 #include "tensorflow/compiler/xla/service/call_inliner.h"
 #include "tensorflow/compiler/xla/service/conditional_simplifier.h"
-#include "tensorflow/compiler/xla/service/dot_decomposer.h"
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
 #include "tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.h"
 #include "tensorflow/compiler/xla/service/gpu/cudnn_convolution_algorithm_picker.h"
@@ -50,11 +49,13 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emitter_context.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h"
-#include "tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.h"
+#include "tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.h"
 #include "tensorflow/compiler/xla/service/gpu/multi_output_fusion.h"
+#include "tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.h"
 #include "tensorflow/compiler/xla/service/gpu/pad_insertion.h"
 #include "tensorflow/compiler/xla/service/gpu/partition_assignment.h"
 #include "tensorflow/compiler/xla/service/gpu/stream_assignment.h"
+#include "tensorflow/compiler/xla/service/gpu/stream_executor_util.h"
 #include "tensorflow/compiler/xla/service/gpu/thunk_schedule.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
@@ -96,8 +97,8 @@ limitations under the License.
 namespace xla {
 namespace gpu {
 
-/* static */ const char* GpuCompiler::kTargetTriple = "nvptx64-nvidia-cuda";
-/* static */ const char* GpuCompiler::kDataLayout =
+/* static */ const char* NVPTXCompiler::kTargetTriple = "nvptx64-nvidia-cuda";
+/* static */ const char* NVPTXCompiler::kDataLayout =
     "e-i64:64-i128:128-v16:16-v32:32-n16:32:64";
 
 namespace {
@@ -146,7 +147,6 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
     // support BF16 operations without directly implementing a BF16 lowering for
     // most ops.
     pipeline.AddPass<HloElementTypeConverter>(BF16, F32);
-    pipeline.AddPass<DotDecomposer>();
 
     {
       auto& pass =
@@ -199,6 +199,12 @@ Status OptimizeHloModule(HloModule* hlo_module, se::StreamExecutor* stream_exec,
     pipeline.AddInvariantChecker<HloVerifier>();
     pipeline.AddPass<CudnnConvolutionRewriter>();
     pipeline.AddPass<PadInsertion>();
+    if (IsVoltaOrLater(*stream_exec)) {
+      pipeline.AddPass<PadForTensorCores>();
+      // PadForTensorCores leaves behind unnecessary tuple/get-tuple-element
+      // pairs that TupleSimplifier fixes.
+      pipeline.AddPass<TupleSimplifier>();
+    }
     TF_RETURN_IF_ERROR(pipeline.Run(hlo_module).status());
   }
 
@@ -354,16 +360,30 @@ void WarnIfBadPtxasVersion(const string& ptxas_path) {
     return;
   }
 
+  // We need ptxas >= 9.0 as a hard requirement, because we compile targeting
+  // PTX 6.0.  An older ptxas will just fail to compile any of our code.
+  //
   // ptxas 9.0 before 9.0.276 and ptxas 9.1 before 9.1.121 miscompile some
   // address calculations with large offsets (e.g. "load ptr + large_constant"),
   // b/70245379.
-  if ((vmaj == 9 && vmin == 0 && vdot < 276) ||
-      (vmaj == 9 && vmin == 1 && vdot < 121)) {
-    LOG(WARNING) << "*** WARNING *** You are using ptxas " << vmaj << "."
-                 << vmin << "." << vdot
-                 << ", which is in range [9.0.0, 9.0.276) + [9.1.0, 9.1.121). "
-                    "These versions are known to miscompile XLA code, leading "
-                    "to incorrect results or invalid-address errors.";
+  //
+  // ptxas 9.1.121 miscompiles some large multioutput fusions, again in a way
+  // that appears related to address calculations, b/111107644.  ptxas 9.2.88
+  // appears to work, as far as we can tell.
+  if (vmaj < 9) {
+    LOG(ERROR)
+        << "You are using ptxas 8.x, but XLA requires ptxas 9.x (and strongly "
+           "prefers >= 9.2.88).  Compilation of XLA kernels below will likely "
+           "fail.\n\nYou do not need to update CUDA; cherry-picking the ptxas "
+           "binary is sufficient.";
+  } else if ((vmaj < 9 || vmin < 2 || vdot < 88)) {
+    LOG(WARNING)
+        << "*** WARNING *** You are using ptxas " << vmaj << "." << vmin << "."
+        << vdot
+        << ", which older than 9.2.88. ptxas 9.x before 9.2.88 is known to "
+           "miscompile XLA code, leading to incorrect results or "
+           "invalid-address errors.\n\nYou do not need to update to CUDA "
+           "9.2.88; cherry-picking the ptxas binary is sufficient.";
   }
 }
 
@@ -391,17 +411,18 @@ void WarnIfBadDriverJITVersion() {
     //  - 384.x before 384.108
     //  - 387.x before 387.40
     //  - 390.x before 390.10.
-    auto vmaj = std::get<0>(version);
-    auto vmin = std::get<1>(version);
-    if ((vmaj == 384 && vmin < 108) ||  //
-        (vmaj == 387 && vmin < 40) ||   //
-        (vmaj == 390 && vmin < 10)) {
+    //
+    // In addition, only >= 396.20 contains ptxas >= 9.2.88, which contains the
+    // fix for the "large multioutput fusions" miscompile, b/111107644.
+    if (version < std::make_tuple(396, 20, 0)) {
       LOG(WARNING)
           << "*** WARNING *** Invoking the PTX->SASS JIT from driver version "
           << se::cuda::DriverVersionToString(version)
-          << ", which is in range [384.0.0, 384.108.0) + [387.0.0, 387.40.0) + "
-             "[390.0.0, 390.10.0). These versions are known to miscompile XLA "
-             "code, leading to incorrect results or invalid-address errors.";
+          << ", which is older than 396.20.0. These versions are known to "
+             "miscompile XLA code, leading to incorrect results or "
+             "invalid-address errors.\nXLA only uses the driver JIT if it "
+             "cannot find ptxas; you don't need to update your driver if "
+             "you can point XLA to ptxas 9.2.88 or newer.";
     }
   });
 }
@@ -473,14 +494,14 @@ StatusOr<std::vector<uint8>> CompilePtx(const string& ptx, int cc_major,
 
 }  // namespace
 
-GpuCompiler::GpuCompiler()
+NVPTXCompiler::NVPTXCompiler()
     : pointer_size_(llvm::DataLayout(kDataLayout)
                         .getPointerSize(0 /* default address space */)) {}
 
-StatusOr<std::unique_ptr<HloModule>> GpuCompiler::RunHloPasses(
+StatusOr<std::unique_ptr<HloModule>> NVPTXCompiler::RunHloPasses(
     std::unique_ptr<HloModule> module, se::StreamExecutor* stream_exec,
     DeviceMemoryAllocator* device_allocator) {
-  XLA_SCOPED_LOGGING_TIMER("GpuCompiler::RunHloPasses");
+  XLA_SCOPED_LOGGING_TIMER("NVPTXCompiler::RunHloPasses");
   tracing::ScopedActivity activity("HLO Transforms", module->name(),
                                    /*is_expensive=*/true);
   TF_RETURN_IF_ERROR(
@@ -488,10 +509,10 @@ StatusOr<std::unique_ptr<HloModule>> GpuCompiler::RunHloPasses(
   return std::move(module);
 }
 
-StatusOr<std::unique_ptr<Executable>> GpuCompiler::RunBackend(
+StatusOr<std::unique_ptr<Executable>> NVPTXCompiler::RunBackend(
     std::unique_ptr<HloModule> module, se::StreamExecutor* stream_exec,
     DeviceMemoryAllocator* device_allocator) {
-  XLA_SCOPED_LOGGING_TIMER("GpuCompiler::RunBackend");
+  XLA_SCOPED_LOGGING_TIMER("NVPTXCompiler::RunBackend");
 
   TF_RET_CHECK(stream_exec != nullptr);
 
@@ -525,11 +546,13 @@ StatusOr<std::unique_ptr<Executable>> GpuCompiler::RunBackend(
   // temporary buffers are required to run the computation.
   TF_ASSIGN_OR_RETURN(
       std::unique_ptr<BufferAssignment> buffer_assignment,
-      BufferAssigner::Run(module.get(), hlo_schedule->ConsumeHloOrdering(),
-                          BufferSizeBytesFunction(),
-                          /*color_alignment=*/[](LogicalBuffer::Color) {
-                            return kCudaMallocAlignBytes;
-                          }));
+      BufferAssigner::Run(
+          module.get(), hlo_schedule->ConsumeHloOrdering(),
+          BufferSizeBytesFunction(),
+          /*color_alignment=*/
+          [](LogicalBuffer::Color) { return kXlaAllocatedBufferAlignBytes; },
+          /*allow_input_output_aliasing=*/false,
+          /*allocate_buffers_for_constants=*/true));
   // BufferAssignment::Stats::ToString() and BufferAssignment::ToString()
   // include headers, so no need for us to print them ourselves.
   XLA_VLOG_LINES(1, buffer_assignment->GetStats().ToString());
@@ -550,10 +573,12 @@ StatusOr<std::unique_ptr<Executable>> GpuCompiler::RunBackend(
   HloComputation* entry_computation = module->entry_computation();
   IrEmitterUnnested ir_emitter(module->config(), entry_computation,
                                &ir_emitter_context);
+
+  TF_RETURN_IF_ERROR(ir_emitter.EmitConstantGlobals());
+
   {
-    XLA_SCOPED_LOGGING_TIMER("GpuCompiler::RunBackend - IR emission");
-    TF_RETURN_IF_ERROR(
-        entry_computation->root_instruction()->Accept(&ir_emitter));
+    XLA_SCOPED_LOGGING_TIMER("NVPTXCompiler::RunBackend - IR emission");
+    TF_RETURN_IF_ERROR(entry_computation->Accept(&ir_emitter));
   }
 
   if (user_pre_optimization_hook_) {
@@ -579,7 +604,8 @@ StatusOr<std::unique_ptr<Executable>> GpuCompiler::RunBackend(
   }
 
   {
-    XLA_SCOPED_LOGGING_TIMER("GpuCompiler::RunBackend - Running LLVM verifier");
+    XLA_SCOPED_LOGGING_TIMER(
+        "NVPTXCompiler::RunBackend - Running LLVM verifier");
 
     std::string err;
     llvm::raw_string_ostream err_stream(err);
@@ -619,7 +645,7 @@ StatusOr<std::unique_ptr<Executable>> GpuCompiler::RunBackend(
 
   string ptx;
   {
-    XLA_SCOPED_LOGGING_TIMER("GpuCompiler::RunBackend - CompileToPtx");
+    XLA_SCOPED_LOGGING_TIMER("NVPTXCompiler::RunBackend - CompileToPtx");
     TF_ASSIGN_OR_RETURN(ptx, CompileToPtx(&llvm_module, {cc_major, cc_minor},
                                           module->config(), libdevice_dir));
   }
@@ -688,10 +714,10 @@ StatusOr<std::unique_ptr<Executable>> GpuCompiler::RunBackend(
   return std::unique_ptr<Executable>(gpu_executable);
 }
 
-std::vector<uint8> GpuCompiler::CompilePtxOrGetCachedResult(const string& ptx,
-                                                            int cc_major,
-                                                            int cc_minor) {
-  XLA_SCOPED_LOGGING_TIMER("GpuCompiler::CompilePtxOrGetCachedResult");
+std::vector<uint8> NVPTXCompiler::CompilePtxOrGetCachedResult(const string& ptx,
+                                                              int cc_major,
+                                                              int cc_minor) {
+  XLA_SCOPED_LOGGING_TIMER("NVPTXCompiler::CompilePtxOrGetCachedResult");
   tracing::ScopedActivity activity("PTX->CUBIN", /*is_expensive=*/true);
   bool inserted;
   decltype(compilation_cache_.begin()) iter;
@@ -764,12 +790,14 @@ std::vector<uint8> GpuCompiler::CompilePtxOrGetCachedResult(const string& ptx,
 }
 
 StatusOr<std::vector<std::unique_ptr<AotCompilationResult>>>
-GpuCompiler::CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> module,
-                                const AotCompilationOptions& options) {
-  return Unimplemented("not yet implemented: GpuCompiler::CompileAheadOfTime");
+NVPTXCompiler::CompileAheadOfTime(
+    std::vector<std::unique_ptr<HloModule>> module,
+    const AotCompilationOptions& options) {
+  return Unimplemented(
+      "not yet implemented: NVPTXCompiler::CompileAheadOfTime");
 }
 
-se::Platform::Id GpuCompiler::PlatformId() const {
+se::Platform::Id NVPTXCompiler::PlatformId() const {
   return se::cuda::kCudaPlatformId;
 }
 
@@ -779,7 +807,7 @@ se::Platform::Id GpuCompiler::PlatformId() const {
 static bool InitModule() {
   xla::Compiler::RegisterCompilerFactory(
       stream_executor::cuda::kCudaPlatformId,
-      []() { return xla::MakeUnique<xla::gpu::GpuCompiler>(); });
+      []() { return xla::MakeUnique<xla::gpu::NVPTXCompiler>(); });
   return true;
 }
 static bool module_initialized = InitModule();
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_compiler.h b/tensorflow/compiler/xla/service/gpu/nvptx_compiler.h
index f3b02ae5d8..d4d2909f1b 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_compiler.h
+++ b/tensorflow/compiler/xla/service/gpu/nvptx_compiler.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_GPU_COMPILER_H_
-#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_GPU_COMPILER_H_
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_NVPTX_COMPILER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_NVPTX_COMPILER_H_
 
 #include <memory>
 #include <string>
@@ -37,10 +37,10 @@ namespace xla {
 namespace gpu {
 
 // The GPU compiler generates efficient GPU executables.
-class GpuCompiler : public LLVMCompiler {
+class NVPTXCompiler : public LLVMCompiler {
  public:
-  GpuCompiler();
-  ~GpuCompiler() override {}
+  NVPTXCompiler();
+  ~NVPTXCompiler() override {}
 
   // Bring in
   // StatusOr<std::vector<std::unique_ptr<Executable>>> Compile(
@@ -64,7 +64,7 @@ class GpuCompiler : public LLVMCompiler {
   se::Platform::Id PlatformId() const override;
 
   HloCostAnalysis::ShapeSizeFunction ShapeSizeBytesFunction() const override {
-    // Capture just the pointer size, not the entire GpuCompiler object.
+    // Capture just the pointer size, not the entire NVPTXCompiler object.
     int64 pointer_size = pointer_size_;
     return [pointer_size](const Shape& shape) {
       return ShapeUtil::ByteSizeOf(shape, pointer_size);
@@ -146,10 +146,10 @@ class GpuCompiler : public LLVMCompiler {
                      CompilationCacheHash, CompilationCacheEq>
       compilation_cache_ GUARDED_BY(mutex_);
 
-  TF_DISALLOW_COPY_AND_ASSIGN(GpuCompiler);
+  TF_DISALLOW_COPY_AND_ASSIGN(NVPTXCompiler);
 };
 
 }  // namespace gpu
 }  // namespace xla
 
-#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_GPU_COMPILER_H_
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_NVPTX_COMPILER_H_
diff --git a/tensorflow/compiler/xla/service/gpu/outfeed_manager.cc b/tensorflow/compiler/xla/service/gpu/outfeed_manager.cc
new file mode 100644
index 0000000000..4aaf0c9e14
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/outfeed_manager.cc
@@ -0,0 +1,32 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/outfeed_manager.h"
+
+#include "tensorflow/compiler/xla/map_util.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace xla {
+namespace gpu {
+
+OutfeedManager* GetOrCreateOutfeedManager() {
+  static auto* manager = new OutfeedManager;
+  return manager;
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/outfeed_manager.h b/tensorflow/compiler/xla/service/gpu/outfeed_manager.h
new file mode 100644
index 0000000000..a752eb7011
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/outfeed_manager.h
@@ -0,0 +1,69 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_OUTFEED_MANAGER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_OUTFEED_MANAGER_H_
+
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/service/gpu/xfeed_queue.h"
+#include "tensorflow/compiler/xla/shape_tree.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/notification.h"
+
+namespace xla {
+namespace gpu {
+
+// TODO(b/30467474) Once GPU outfeed implementation settles, consider
+// folding back the cpu and gpu outfeed implementations into a generic
+// one if possible.
+
+// Defines a buffer holding the destination for an outfeed in host memory and a
+// notification when that triggers when the transfer is done.
+class OutfeedBuffer {
+ public:
+  OutfeedBuffer(int64 length) : length_(length) {}
+
+  // Waits for the device transfer to be finished.
+  std::unique_ptr<Literal> WaitUntilAvailable() {
+    done_.WaitForNotification();
+    return std::move(destination_);
+  }
+
+  int64 length() const { return length_; }
+  void set_destination(std::unique_ptr<Literal> destination) {
+    destination_ = std::move(destination);
+  }
+  Literal* destination() { return destination_.get(); }
+
+  // Callback to signal that this buffer is consumed.
+  void Done() { done_.Notify(); }
+
+ private:
+  std::unique_ptr<Literal> destination_;
+  const int64 length_;
+  tensorflow::Notification done_;
+};
+
+// Manages a thread-safe queue of buffers. The buffers are supposed to be
+// produced by the transfer manager and consumed by the device.
+using OutfeedManager = XfeedQueue<ShapeTree<std::unique_ptr<OutfeedBuffer>>*>;
+
+// Singleton creator-or-accessor: Returns the GPU outfeed manager.
+OutfeedManager* GetOrCreateOutfeedManager();
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_OUTFEED_MANAGER_H_
diff --git a/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc b/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc
new file mode 100644
index 0000000000..7986e63f43
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/outfeed_thunk.cc
@@ -0,0 +1,111 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/outfeed_thunk.h"
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.h"
+#include "tensorflow/compiler/xla/service/gpu/outfeed_manager.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/platform/stream_executor_no_cuda.h"
+
+namespace xla {
+namespace gpu {
+
+OutfeedThunk::OutfeedThunk(ShapeTree<BufferAllocation::Slice> outfeed_slices,
+                           const HloInstruction* hlo_instruction)
+    : Thunk(Kind::kOutfeed, hlo_instruction),
+      outfeed_slices_(std::move(outfeed_slices)) {}
+
+Status OutfeedThunk::ExecuteOnStream(
+    const BufferAllocations& buffer_allocations, se::Stream* stream,
+    HloExecutionProfiler* profiler) {
+  VLOG(2) << "Outfeeding from GPU: " << hlo_instruction()->ToString();
+
+  auto op_profiler = profiler->MakeScopedInstructionProfiler(hlo_instruction());
+  OutfeedManager* outfeed_manager = GetOrCreateOutfeedManager();
+  ShapeTree<std::unique_ptr<OutfeedBuffer>>* outfeed_buffers =
+      outfeed_manager->BlockingGetNextDestination();
+
+  // Nothing to be done for empty tuples.
+  if (ShapeUtil::IsEmptyTuple(hlo_instruction()->operand(0)->shape())) {
+    return Status::OK();
+  }
+  CHECK(ShapeUtil::Compatible(hlo_instruction()->operand(0)->shape(),
+                              outfeed_buffers->shape()));
+
+  TF_RETURN_IF_ERROR(outfeed_buffers->ForEachMutableElementWithStatus(
+      [&](const ShapeIndex& index, std::unique_ptr<OutfeedBuffer>* buffer) {
+        if (!*buffer) {  // Tuple pointers.
+          return Status::OK();
+        }
+        // Allocate storage for the literal data.
+        const Shape& shape =
+            ShapeUtil::GetSubshape(outfeed_buffers->shape(), index);
+        (*buffer)->set_destination(Literal::CreateFromShape(shape));
+
+        BufferAllocation::Slice slice = outfeed_slices_.element(index);
+        se::DeviceMemoryBase data_address;
+        if (slice.allocation()) {
+          // If we have a static allocation, read it from there. This avoids
+          // synchronizing the host and device just to read a pointer.
+          data_address = buffer_allocations.GetDeviceAddress(slice);
+        } else {
+          // Otherwise we have to read the tuple pointer first.
+          CHECK(!index.empty());
+          // Copy the parent buffer to the host.
+          BufferAllocation::Slice tuple_slice =
+              outfeed_slices_.element(ShapeIndexView(index).ConsumeFront());
+          if (!tuple_slice.allocation()) {
+            return Unimplemented(
+                "Nested dynamic tuples are not supported on GPU");
+          }
+          se::DeviceMemoryBase tuple_address =
+              buffer_allocations.GetDeviceAddress(tuple_slice);
+          CHECK(tuple_slice.size() % sizeof(void*) == 0)
+              << "Tuple size must be a multiple of pointer size";
+          std::vector<void*> tuple_element_buffer_addresses(tuple_slice.size() /
+                                                            sizeof(void*));
+          stream->ThenMemcpy(tuple_element_buffer_addresses.data(),
+                             tuple_address, tuple_slice.size());
+          TF_RETURN_IF_ERROR(stream->BlockHostUntilDone());
+          // The data address is specified by the element of the tuple pointer
+          // buffer.
+          data_address =
+              se::DeviceMemoryBase(tuple_element_buffer_addresses[index.back()],
+                                   (*buffer)->length());
+        }
+
+        // TODO(b/111309141): Run this on a separate stream so it doesn't block
+        // the GPU from doing work during the transfer. This could be handled by
+        // making StreamAssignment do something intelligent with outfeed thunks.
+        stream
+            ->ThenMemcpy((*buffer)->destination()->untyped_data(), data_address,
+                         (*buffer)->length())
+            .ThenDoHostCallback([buffer]() { (*buffer)->Done(); });
+        return Status::OK();
+      }));
+
+  Status block_status = stream->BlockHostUntilDone();
+  if (!block_status.ok()) {
+    return InternalError("Failed to complete data transfer on stream %p: %s",
+                         stream, block_status.error_message().c_str());
+  }
+
+  VLOG(2) << "Outfeeding from GPU complete";
+  return Status::OK();
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/outfeed_thunk.h b/tensorflow/compiler/xla/service/gpu/outfeed_thunk.h
new file mode 100644
index 0000000000..8ed89f05f0
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/outfeed_thunk.h
@@ -0,0 +1,52 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_OUTFEED_THUNK_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_OUTFEED_THUNK_H_
+
+#include "tensorflow/compiler/xla/service/gpu/buffer_allocations.h"
+#include "tensorflow/compiler/xla/service/gpu/hlo_execution_profiler.h"
+#include "tensorflow/compiler/xla/service/gpu/thunk.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/core/platform/stream_executor_no_cuda.h"
+
+namespace xla {
+namespace gpu {
+
+// A thunk that outfeeds data. Data must be already resident on the host. This
+// thunk performs a host to device copy from the buffer allocated for the
+// outfeed op to the host location.
+class OutfeedThunk : public Thunk {
+ public:
+  // Constructs a OutfeedThunk that copies data to the host-side
+  // outfeed queue from the buffers in the given shape tree.
+  OutfeedThunk(ShapeTree<BufferAllocation::Slice> outfeed_slices,
+               const HloInstruction* hlo_instruction);
+
+  OutfeedThunk(const OutfeedThunk&) = delete;
+  OutfeedThunk& operator=(const OutfeedThunk&) = delete;
+
+  Status ExecuteOnStream(const BufferAllocations& buffer_allocations,
+                         se::Stream* stream,
+                         HloExecutionProfiler* profiler) override;
+
+ private:
+  const ShapeTree<BufferAllocation::Slice> outfeed_slices_;
+};
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_OUTFEED_THUNK_H_
diff --git a/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.cc b/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.cc
new file mode 100644
index 0000000000..79f7d31816
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.cc
@@ -0,0 +1,233 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.h"
+
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/compiler/xla/window_util.h"
+
+namespace xla {
+namespace gpu {
+
+using tensorflow::gtl::ArraySlice;
+
+// We want the input/output feature counts of an f16 conv to be factors of 8,
+// because without this cudnn can't use tensor cores on the conv.
+static constexpr int64 kDesiredNumFeaturesFactor = 8;
+
+// We won't pad a conv if doing so increases the total number of bytes in the
+// lhs, rhs, or result by more than this amount.
+//
+// TODO(jlebar): This number was tuned experimentally.  It represents a
+// compromise on our current benchmarks; it speeds some up significantly, and
+// doesn't slow any down.  But we can observe by changing this value that
+// there's additional room for speedups.  Achieving those speedups without also
+// slowing other things down will likely require a more sophisticated heuristic,
+// possibly some form of auto-tuning.
+static constexpr double kMaxBytesTouchedIncrease = 1.2;
+
+// Pads the given dimensions in the given shape up to a multiple of
+// kDesiredNumFeaturesFactor.
+static Shape PadShape(Shape s, ArraySlice<int64> dims) {
+  for (int64 dim : dims) {
+    int64 dim_to_pad_size = s.dimensions(dim);
+    int64 new_dim_to_pad_size =
+        RoundUpToNearest(dim_to_pad_size, kDesiredNumFeaturesFactor);
+    s.set_dimensions(dim, new_dim_to_pad_size);
+  }
+  return s;
+}
+
+// Creates and returns an HLO that zero-pads one or more dimensions in the given
+// instruction so that its shape is equal to the given shape.
+//
+// Padding is added to the end of each relevant dimension.
+//
+// If the instruction already has the given shape, simply returns it without an
+// intervening pad.
+static HloInstruction* PadInstruction(HloInstruction* instr,
+                                      const Shape& new_shape) {
+  HloComputation* comp = instr->parent();
+
+  const Shape& shape = instr->shape();
+  auto* zero = comp->AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::Zero(shape.element_type()).CloneToUnique()));
+
+  PaddingConfig pad_config = MakeNoPaddingConfig(ShapeUtil::Rank(shape));
+
+  bool added_padding = false;
+  for (int64 dim = 0; dim < ShapeUtil::Rank(shape); ++dim) {
+    if (shape.dimensions(dim) == new_shape.dimensions(dim)) {
+      continue;
+    }
+    CHECK_GT(new_shape.dimensions(dim), shape.dimensions(dim));
+    pad_config.mutable_dimensions(dim)->set_edge_padding_high(
+        new_shape.dimensions(dim) - shape.dimensions(dim));
+    added_padding = true;
+  }
+
+  if (!added_padding) {
+    return instr;
+  }
+  return comp->AddInstruction(
+      HloInstruction::CreatePad(new_shape, instr, zero, pad_config));
+}
+
+// Pads the input/output feature dimensions of the given cudnn convolution
+// custom-call to be multiples of kDesiredNumFeaturesFactor.
+static StatusOr<bool> PadFeaturesDims(HloInstruction* conv) {
+  CHECK_EQ(0, conv->shape().tuple_shapes(1).dimensions(0))
+      << "conv must use 0 scratch bytes, i.e. this pass must be run "
+         "before CudnnConvolutionAlgorithmPicker.";
+
+  const auto& target = conv->custom_call_target();
+  const auto& dnums = conv->convolution_dimension_numbers();
+  auto* lhs = conv->mutable_operand(0);
+  auto* rhs = conv->mutable_operand(1);
+  const Shape& result_shape = conv->shape().tuple_shapes(0);
+
+  Shape new_lhs_shape = [&] {
+    if (target == kCudnnConvForwardCallTarget ||
+        target == kCudnnConvBackwardFilterCallTarget) {
+      // LHS is "input".
+      return PadShape(lhs->shape(), {dnums.input_feature_dimension()});
+    }
+    CHECK_EQ(target, kCudnnConvBackwardInputCallTarget);
+    // LHS is "output".
+    return PadShape(lhs->shape(), {dnums.output_feature_dimension()});
+  }();
+
+  Shape new_rhs_shape = [&] {
+    if (target == kCudnnConvForwardCallTarget ||
+        target == kCudnnConvBackwardInputCallTarget) {
+      // RHS is "filter".
+      return PadShape(rhs->shape(), {dnums.kernel_input_feature_dimension(),
+                                     dnums.kernel_output_feature_dimension()});
+    }
+    CHECK_EQ(target, kCudnnConvBackwardFilterCallTarget);
+    // RHS is "output".
+    return PadShape(rhs->shape(), {dnums.output_feature_dimension()});
+  }();
+
+  if (ShapeUtil::Equal(lhs->shape(), new_lhs_shape) &&
+      ShapeUtil::Equal(rhs->shape(), new_rhs_shape)) {
+    VLOG(3) << "No need to pad features of " << conv->ToString();
+    return false;
+  }
+
+  Shape new_result_shape = [&] {
+    if (target == kCudnnConvForwardCallTarget) {
+      // Result is "output".
+      return PadShape(result_shape, {dnums.output_feature_dimension()});
+    }
+    if (target == kCudnnConvBackwardInputCallTarget) {
+      // Result is "input".
+      return PadShape(result_shape, {dnums.input_feature_dimension()});
+    }
+    CHECK_EQ(target, kCudnnConvBackwardFilterCallTarget);
+    // Result is "filter".
+    return PadShape(result_shape, {dnums.kernel_input_feature_dimension(),
+                                   dnums.kernel_output_feature_dimension()});
+  }();
+
+  // Check that padding wouldn't increase the total bytes read/written by this
+  // operation too much.
+  auto check_size_increase = [&](const Shape& old_shape,
+                                 const Shape& new_shape) {
+    int64 old_bytes = ShapeUtil::ByteSizeOf(old_shape);
+    int64 new_bytes = ShapeUtil::ByteSizeOf(new_shape);
+    if (new_bytes <= old_bytes * kMaxBytesTouchedIncrease) {
+      return true;
+    }
+    VLOG(3) << "Not padding convolution; doing so would change input / result "
+               "shape from "
+            << ShapeUtil::HumanString(old_shape) << " to "
+            << ShapeUtil::HumanString(new_shape) << ", a size increase of "
+            << new_bytes / static_cast<double>(old_bytes) << "x > "
+            << kMaxBytesTouchedIncrease << "x: " << conv->ToString();
+    return false;
+  };
+  if (!check_size_increase(lhs->shape(), new_lhs_shape) ||
+      !check_size_increase(rhs->shape(), new_rhs_shape) ||
+      !check_size_increase(result_shape, new_result_shape)) {
+    return false;
+  }
+
+  // OK, let's do the transformation!
+
+  auto* new_lhs = PadInstruction(lhs, new_lhs_shape);
+  auto* new_rhs = PadInstruction(rhs, new_rhs_shape);
+  CHECK(new_lhs != lhs || new_rhs != rhs)
+      << "We should have had to pad either LHS or RHS.";
+
+  auto add = [&](std::unique_ptr<HloInstruction> new_instr) {
+    return conv->parent()->AddInstruction(std::move(new_instr));
+  };
+
+  Shape new_conv_shape = ShapeUtil::MakeTupleShape(
+      {new_result_shape, ShapeUtil::MakeShape(U8, {0})});
+  auto* new_conv =
+      add(conv->CloneWithNewOperands(new_conv_shape, {new_lhs, new_rhs}));
+
+  // Slice the new conv result if necessary, keeping in mind that new_conv has
+  // tuple shape (new_result_shape, u8[0]).
+  if (!ShapeUtil::Equal(result_shape, new_result_shape)) {
+    std::vector<int64> start_indices(result_shape.dimensions_size(), 0);
+    std::vector<int64> end_indices(result_shape.dimensions().begin(),
+                                   result_shape.dimensions().end());
+    std::vector<int64> strides(result_shape.dimensions_size(), 1);
+
+    auto* new_conv_result = add(
+        HloInstruction::CreateGetTupleElement(new_result_shape, new_conv, 0));
+    auto* empty_temp_buffer =
+        add(HloInstruction::CreateConstant(LiteralUtil::CreateR1<uint8>({})));
+    auto* sliced_result = add(HloInstruction::CreateSlice(
+        result_shape, new_conv_result, start_indices, end_indices, strides));
+    new_conv =
+        add(HloInstruction::CreateTuple({sliced_result, empty_temp_buffer}));
+  }
+
+  VLOG(2) << "Padded features of " << conv->ToString() << ", replaced with "
+          << new_conv->ToString();
+  TF_RETURN_IF_ERROR(conv->parent()->ReplaceInstruction(conv, new_conv));
+  return true;
+}
+
+static std::vector<HloInstruction*> GetRelevantConvs(HloComputation* comp) {
+  std::vector<HloInstruction*> convs;
+  for (HloInstruction* instr : comp->instructions()) {
+    if (IsCustomCallToDnnConvolution(*instr) &&
+        instr->operand(0)->shape().element_type() == F16) {
+      convs.push_back(instr);
+    }
+  }
+  return convs;
+}
+
+StatusOr<bool> PadForTensorCores::Run(HloModule* module) {
+  bool changed = false;
+  for (HloComputation* comp : module->MakeNonfusionComputations()) {
+    for (HloInstruction* conv : GetRelevantConvs(comp)) {
+      TF_ASSIGN_OR_RETURN(bool result, PadFeaturesDims(conv));
+      changed |= result;
+    }
+  }
+  return changed;
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.h b/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.h
new file mode 100644
index 0000000000..192359f026
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.h
@@ -0,0 +1,45 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_PAD_FOR_TENSOR_CORES_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_PAD_FOR_TENSOR_CORES_H_
+
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+
+namespace xla {
+namespace gpu {
+
+// Ensures that f16 cudnn convolutions have input/output channel dimensions that
+// are multiples of 8, inserting pads/slices as necessary.
+//
+// This is useful primarily for Volta and newer GPUs, where tensor cores can
+// only be used if the channel dims are multiples of 8.  It's probably the
+// opposite of useful on other GPUs, so you should check what GPU you're
+// targeting before running this pass.
+//
+// TODO(jlebar): Also pad dots.
+class PadForTensorCores : public HloPassInterface {
+ public:
+  tensorflow::StringPiece name() const override {
+    return "pad for tensor cores";
+  }
+
+  StatusOr<bool> Run(HloModule* module) override;
+};
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_PAD_FOR_TENSOR_CORES_H_
diff --git a/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores_test.cc b/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores_test.cc
new file mode 100644
index 0000000000..99e7580b82
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores_test.cc
@@ -0,0 +1,164 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/pad_for_tensor_cores.h"
+
+#include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
+#include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
+#include "tensorflow/compiler/xla/util.h"
+
+namespace xla {
+namespace gpu {
+namespace {
+
+namespace op = xla::testing::opcode_matchers;
+using ::testing::_;
+
+using PadForTensorCoresTest = HloVerifiedTestBase;
+
+TEST_F(PadForTensorCoresTest, PadF16ForwardConvInputChannels) {
+  ParseAndVerifyModule(R"(
+  HloModule TestModule
+
+  ENTRY TestComputation {
+    input = f16[10,20,30,41] parameter(0)
+    filter = f16[2,2,41,40] parameter(1)
+    ROOT result = (f16[10,20,30,40], u8[0]) custom-call(input, filter),
+                  window={size=2x2}, dim_labels=b01f_01io->b01f,
+                  custom_call_target="__cudnn$convForward"
+  })");
+  EXPECT_TRUE(PadForTensorCores().Run(&module()).ValueOrDie());
+  auto* root = module().entry_computation()->root_instruction();
+
+  SCOPED_TRACE(module().ToString());
+  EXPECT_THAT(root, op::CustomCall(kCudnnConvForwardCallTarget,
+                                   op::Pad(op::Parameter(0), _),
+                                   op::Pad(op::Parameter(1), _)));
+  EXPECT_TRUE(ShapeUtil::Equal(root->operand(0)->shape(),
+                               ShapeUtil::MakeShape(F16, {10, 20, 30, 48})));
+  EXPECT_TRUE(ShapeUtil::Equal(root->operand(1)->shape(),
+                               ShapeUtil::MakeShape(F16, {2, 2, 48, 40})));
+}
+
+TEST_F(PadForTensorCoresTest, PadF16BackwardInputConvOutputChannels) {
+  ParseAndVerifyModule(R"(
+  HloModule TestModule
+
+  ENTRY TestComputation {
+    output = f16[10,20,30,41] parameter(0)
+    filter = f16[2,2,40,41] parameter(1)
+    ROOT result = (f16[10,20,30,40], u8[0]) custom-call(output, filter),
+                  window={size=2x2}, dim_labels=b01f_01io->b01f,
+                  custom_call_target="__cudnn$convBackwardInput"
+  })");
+  EXPECT_TRUE(PadForTensorCores().Run(&module()).ValueOrDie());
+  auto* root = module().entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::CustomCall(kCudnnConvBackwardInputCallTarget,
+                                   op::Pad(op::Parameter(0), _),
+                                   op::Pad(op::Parameter(1), _)));
+  EXPECT_TRUE(ShapeUtil::Equal(root->operand(0)->shape(),
+                               ShapeUtil::MakeShape(F16, {10, 20, 30, 48})));
+  EXPECT_TRUE(ShapeUtil::Equal(root->operand(1)->shape(),
+                               ShapeUtil::MakeShape(F16, {2, 2, 40, 48})));
+}
+
+TEST_F(PadForTensorCoresTest, PadF16ForwardConvOutputChannels) {
+  ParseAndVerifyModule(R"(
+  HloModule TestModule
+
+  ENTRY TestComputation {
+    input = f16[10,20,30,40] parameter(0)
+    filter = f16[2,2,40,41] parameter(1)
+    ROOT result = (f16[10,20,30,41], u8[0]) custom-call(input, filter),
+                  window={size=2x2}, dim_labels=b01f_01io->b01f,
+                  custom_call_target="__cudnn$convForward"
+  })");
+  EXPECT_TRUE(PadForTensorCores().Run(&module()).ValueOrDie());
+  auto* root = module().entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::Tuple(op::Slice(op::GetTupleElement(op::CustomCall(
+                                  kCudnnConvForwardCallTarget, op::Parameter(0),
+                                  op::Pad(op::Parameter(1), _)))),
+                              _));
+}
+
+TEST_F(PadForTensorCoresTest, PadF16BackwardInputConvInputChannels) {
+  ParseAndVerifyModule(R"(
+  HloModule TestModule
+
+  ENTRY TestComputation {
+    output = f16[10,20,30,40] parameter(0)
+    filter = f16[2,2,41,40] parameter(1)
+    result = (f16[10,20,30,41], u8[0]) custom-call(output, filter),
+              window={size=2x2}, dim_labels=b01f_01io->b01f,
+              custom_call_target="__cudnn$convBackwardInput"
+    ROOT gte = f16[10,20,30,41] get-tuple-element(result), index=0
+  })");
+  EXPECT_TRUE(PadForTensorCores().Run(&module()).ValueOrDie());
+  auto* root = module().entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::GetTupleElement(op::Tuple(
+                        op::Slice(op::GetTupleElement(op::CustomCall(
+                            kCudnnConvBackwardInputCallTarget, op::Parameter(0),
+                            op::Pad(op::Parameter(1), _)))),
+                        _)));
+}
+
+TEST_F(PadForTensorCoresTest, PadF16BackwardFilterConvInputChannels) {
+  ParseAndVerifyModule(R"(
+  HloModule TestModule
+
+  ENTRY TestComputation {
+    input = f16[10,20,30,41] parameter(0)
+    output = f16[10,20,30,40] parameter(1)
+    result = (f16[2,2,41,40], u8[0]) custom-call(input, output),
+              window={size=2x2}, dim_labels=b01f_01io->b01f,
+              custom_call_target="__cudnn$convBackwardFilter"
+    ROOT gte = f16[2,2,41,40] get-tuple-element(result), index=0
+  })");
+  EXPECT_TRUE(PadForTensorCores().Run(&module()).ValueOrDie());
+  auto* root = module().entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::GetTupleElement(op::Tuple(
+                        op::Slice(op::GetTupleElement(op::CustomCall(
+                            kCudnnConvBackwardFilterCallTarget,
+                            op::Pad(op::Parameter(0), _), op::Parameter(1)))),
+                        _)));
+}
+
+TEST_F(PadForTensorCoresTest, PadF16BackwardFilterConvOutputChannels) {
+  ParseAndVerifyModule(R"(
+  HloModule TestModule
+
+  ENTRY TestComputation {
+    input = f16[10,20,30,40] parameter(0)
+    output = f16[10,20,30,41] parameter(1)
+    result = (f16[2,2,40,41], u8[0]) custom-call(input, output),
+              window={size=2x2}, dim_labels=b01f_01io->b01f,
+              custom_call_target="__cudnn$convBackwardFilter"
+    ROOT gte = f16[2,2,40,41] get-tuple-element(result), index=0
+  })");
+  EXPECT_TRUE(PadForTensorCores().Run(&module()).ValueOrDie());
+  auto* root = module().entry_computation()->root_instruction();
+  EXPECT_THAT(root, op::GetTupleElement(op::Tuple(
+                        op::Slice(op::GetTupleElement(op::CustomCall(
+                            kCudnnConvBackwardFilterCallTarget,
+                            op::Parameter(0), op::Pad(op::Parameter(1), _)))),
+                        _)));
+}
+
+}  // anonymous namespace
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/pad_insertion.cc b/tensorflow/compiler/xla/service/gpu/pad_insertion.cc
index c8f0d4185c..b22040eee1 100644
--- a/tensorflow/compiler/xla/service/gpu/pad_insertion.cc
+++ b/tensorflow/compiler/xla/service/gpu/pad_insertion.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/gpu/pad_insertion.h"
 
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
@@ -68,7 +69,7 @@ HloInstruction* MaybePaddedAndSlicedInput(
     PrimitiveType element_type = input->shape().element_type();
     HloInstruction* padding =
         computation->AddInstruction(HloInstruction::CreateConstant(
-            MakeUnique<Literal>(Literal::Zero(element_type))));
+            MakeUnique<Literal>(LiteralUtil::Zero(element_type))));
     input = MakePadHlo(input, padding, padding_config).ValueOrDie();
   }
 
@@ -125,7 +126,7 @@ HloInstruction* MaybePaddedKernel(const Window& conv_window,
   PrimitiveType element_type = kernel->shape().element_type();
   HloInstruction* padding =
       computation->AddInstruction(HloInstruction::CreateConstant(
-          MakeUnique<Literal>(Literal::Zero(element_type))));
+          MakeUnique<Literal>(LiteralUtil::Zero(element_type))));
   return MakePadHlo(kernel, padding, padding_config).ValueOrDie();
 }
 }  // namespace
@@ -234,9 +235,9 @@ bool PadInsertion::CanonicalizeBackwardFilterConvolution(
   // Create a new backward convolution replacing the old one.
   HloComputation* computation = backward_conv->parent();
   HloInstruction* output = backward_conv->mutable_operand(1);
-  HloInstruction* padding =
-      computation->AddInstruction(HloInstruction::CreateConstant(
-          MakeUnique<Literal>(Literal::Zero(input->shape().element_type()))));
+  HloInstruction* padding = computation->AddInstruction(
+      HloInstruction::CreateConstant(MakeUnique<Literal>(
+          LiteralUtil::Zero(input->shape().element_type()))));
   HloInstruction* padded_input =
       MakePadHlo(input, padding, input_padding_config).ValueOrDie();
 
diff --git a/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.cc b/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.cc
index cd833ec7bd..3838fee674 100644
--- a/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.cc
+++ b/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.cc
@@ -32,27 +32,27 @@ namespace gpu {
 
 ParallelLoopEmitter::ParallelLoopEmitter(
     BodyEmitter body_emitter, const Shape& shape,
-    const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* ir_builder,
+    const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* b,
     int unroll_factor)
-    : LoopEmitter(body_emitter, shape, ir_builder),
+    : LoopEmitter(body_emitter, shape, b),
       launch_dimensions_(launch_dimensions),
       unroll_factor_(unroll_factor) {}
 
 ParallelLoopEmitter::ParallelLoopEmitter(
     const llvm_ir::ElementGenerator& target_element_generator,
     tensorflow::gtl::ArraySlice<llvm_ir::IrArray> target_arrays,
-    const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* ir_builder,
+    const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* b,
     int unroll_factor)
-    : LoopEmitter(target_element_generator, target_arrays, ir_builder),
+    : LoopEmitter(target_element_generator, target_arrays, b),
       launch_dimensions_(launch_dimensions),
       unroll_factor_(unroll_factor) {}
 
 ParallelLoopEmitter::ParallelLoopEmitter(
     const llvm_ir::ElementGenerator& target_element_generator,
     const llvm_ir::IrArray& target_array,
-    const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* ir_builder,
+    const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* b,
     int unroll_factor)
-    : LoopEmitter(target_element_generator, target_array, ir_builder),
+    : LoopEmitter(target_element_generator, target_array, b),
       launch_dimensions_(launch_dimensions),
       unroll_factor_(unroll_factor) {}
 
@@ -74,29 +74,27 @@ ParallelLoopEmitter::EmitIndexAndSetExitBasicBlock(
   CHECK_NE(index_type, nullptr);
   std::vector<llvm_ir::IrArray::Index> array_indices;
   llvm::Value* block_id = llvm_ir::EmitCallToIntrinsic(
-      llvm::Intrinsic::nvvm_read_ptx_sreg_ctaid_x, {}, {}, ir_builder_);
+      llvm::Intrinsic::nvvm_read_ptx_sreg_ctaid_x, {}, {}, b_);
   llvm_ir::AddRangeMetadata(0, launch_dimensions_.block_count(),
                             static_cast<llvm::Instruction*>(block_id));
-  block_id = ir_builder_->CreateZExtOrTrunc(block_id, index_type, "block_id");
+  block_id = b_->CreateZExtOrTrunc(block_id, index_type, "block_id");
 
   // Per the PTX documentation:
   //   "It is guaranteed that [...] 0  <=  %tid.x <  %ntid.x"
   //
   // %ntid.x is currently specified as 1024.
   llvm::Value* thread_id = llvm_ir::EmitCallToIntrinsic(
-      llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x, {}, {}, ir_builder_);
+      llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x, {}, {}, b_);
   llvm_ir::AddRangeMetadata(0, launch_dimensions_.threads_per_block(),
                             static_cast<llvm::Instruction*>(thread_id));
-  thread_id =
-      ir_builder_->CreateZExtOrTrunc(thread_id, index_type, "thread_id");
-
-  llvm::Value* linear_index_base = ir_builder_->CreateAdd(
-      ir_builder_->CreateMul(
-          block_id,
-          llvm::ConstantInt::get(index_type,
-                                 launch_dimensions_.threads_per_block()),
-          "",
-          /*HasNUW=*/true, /*HasNSW=*/true),
+  thread_id = b_->CreateZExtOrTrunc(thread_id, index_type, "thread_id");
+
+  llvm::Value* linear_index_base = b_->CreateAdd(
+      b_->CreateMul(block_id,
+                    llvm::ConstantInt::get(
+                        index_type, launch_dimensions_.threads_per_block()),
+                    "",
+                    /*HasNUW=*/true, /*HasNSW=*/true),
       thread_id, "linear_index", /*HasNUW=*/true, /*HasNSW=*/true);
 
   // Add an @llvm.assume(linear_index < threads_per_block * num_blocks).
@@ -109,41 +107,41 @@ ParallelLoopEmitter::EmitIndexAndSetExitBasicBlock(
   // conditions in the same basic block as their operands.
   llvm_ir::EmitCallToIntrinsic(
       llvm::Intrinsic::assume,
-      {ir_builder_->CreateICmpULT(
+      {b_->CreateICmpULT(
           linear_index_base,
           llvm::ConstantInt::get(index_type,
                                  launch_dimensions_.threads_per_block() *
                                      launch_dimensions_.block_count()),
           "linear_index_in_range")},
-      {}, ir_builder_);
+      {}, b_);
 
   if (unroll_factor_ > 1) {
-    linear_index_base = ir_builder_->CreateMul(
+    linear_index_base = b_->CreateMul(
         linear_index_base, llvm::ConstantInt::get(index_type, unroll_factor_),
         "linear_index_base", /*HasNUW=*/true, /*HasNSW=*/true);
   }
 
-  array_indices.emplace_back(linear_index_base, shape_, ir_builder_);
+  array_indices.emplace_back(linear_index_base, shape_, b_);
   for (int i = 1; i < unroll_factor_; ++i) {
-    llvm::Value* linear_index = ir_builder_->CreateAdd(
-        linear_index_base, llvm::ConstantInt::get(index_type, i),
-        "linear_index",
-        /*HasNUW=*/true, /*HasNSW=*/true);
-    array_indices.emplace_back(linear_index, shape_, ir_builder_);
+    llvm::Value* linear_index =
+        b_->CreateAdd(linear_index_base, llvm::ConstantInt::get(index_type, i),
+                      "linear_index",
+                      /*HasNUW=*/true, /*HasNSW=*/true);
+    array_indices.emplace_back(linear_index, shape_, b_);
   }
 
   auto if_in_bounds = llvm_ir::EmitIfThenElse(
-      ir_builder_->CreateICmpULT(
+      b_->CreateICmpULT(
           linear_index_base,
           llvm::ConstantInt::get(index_type, ShapeUtil::ElementsIn(shape_))),
-      llvm_ir::IrName(loop_name, "in_bounds"), ir_builder_, false);
+      llvm_ir::IrName(loop_name, "in_bounds"), b_, false);
 
   // Set exit_bb_ to the exit block of the if structure.
   exit_bb_ = if_in_bounds.after_block;
   CHECK_NE(nullptr, exit_bb_);
 
   // Set IR builder insertion point to the body of the if structure.
-  llvm_ir::SetToFirstInsertPoint(if_in_bounds.true_block, ir_builder_);
+  llvm_ir::SetToFirstInsertPoint(if_in_bounds.true_block, b_);
 
   return array_indices;
 }
diff --git a/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h b/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h
index 302e1bf1bc..b82a23419d 100644
--- a/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h
+++ b/tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h
@@ -34,13 +34,13 @@ class ParallelLoopEmitter : public llvm_ir::LoopEmitter {
   // The meanings of other parameters are the same as LoopEmitter.
   ParallelLoopEmitter(BodyEmitter body_emitter, const Shape& shape,
                       const LaunchDimensions& launch_dimensions,
-                      llvm::IRBuilder<>* ir_builder, int unroll_factor = 1);
+                      llvm::IRBuilder<>* b, int unroll_factor = 1);
   // Constructs a ParallelLoopEmitter from an element generator that generates
   // each element of the given target array.
   ParallelLoopEmitter(const llvm_ir::ElementGenerator& target_element_generator,
                       const llvm_ir::IrArray& target_array,
                       const LaunchDimensions& launch_dimensions,
-                      llvm::IRBuilder<>* ir_builder, int unroll_factor = 1);
+                      llvm::IRBuilder<>* b, int unroll_factor = 1);
 
   // Constructs a loop emitter for a loop that generates on element of each of N
   // arrays on each iteration.
@@ -50,7 +50,7 @@ class ParallelLoopEmitter : public llvm_ir::LoopEmitter {
   ParallelLoopEmitter(
       const llvm_ir::ElementGenerator& target_element_generator,
       tensorflow::gtl::ArraySlice<llvm_ir::IrArray> target_arrays,
-      const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* ir_builder,
+      const LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* b,
       int unroll_factor = 1);
 
   ParallelLoopEmitter(const ParallelLoopEmitter&) = delete;
diff --git a/tensorflow/compiler/xla/service/gpu/sequential_thunk.cc b/tensorflow/compiler/xla/service/gpu/sequential_thunk.cc
index dfdba7d7d9..84285be70a 100644
--- a/tensorflow/compiler/xla/service/gpu/sequential_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/sequential_thunk.cc
@@ -36,12 +36,7 @@ Status SequentialThunk::Initialize(const GpuExecutable& executable,
 Status SequentialThunk::ExecuteOnStream(
     const BufferAllocations& buffer_allocations, se::Stream* stream,
     HloExecutionProfiler* profiler) {
-  // TODO(b/71544591): We need to potentially measure the total time of the
-  // sequential thunk. This happens for a reduce op which consists of
-  // SequentialThunk with a thunk that initializes the output, and another thunk
-  // that does the actual reduce. Right now, in this case we would only measure
-  // the time of the last thunk, because both thunks would have the same
-  // HloInstruction.
+  auto op_profiler = profiler->MakeScopedInstructionProfiler(hlo_instruction());
   for (const auto& thunk : thunks_) {
     TF_RETURN_IF_ERROR(
         thunk->ExecuteOnStream(buffer_allocations, stream, profiler));
diff --git a/tensorflow/compiler/xla/service/gpu/stream_assignment.cc b/tensorflow/compiler/xla/service/gpu/stream_assignment.cc
index e4cfc6999f..0806dd5161 100644
--- a/tensorflow/compiler/xla/service/gpu/stream_assignment.cc
+++ b/tensorflow/compiler/xla/service/gpu/stream_assignment.cc
@@ -33,13 +33,13 @@ int StreamAssignment::StreamNumberForHlo(const HloInstruction& hlo) const {
 }
 
 void StreamAssignment::AssignStreamToHlo(const HloInstruction* hlo,
-                                         int stream_no) {
-  CHECK_GE(stream_no, 0);
-  if (stream_no >= stream_count_) {
-    stream_count_ = stream_no + 1;
+                                         int stream_num) {
+  CHECK_GE(stream_num, 0);
+  if (stream_num >= stream_count_) {
+    stream_count_ = stream_num + 1;
   }
-  InsertOrDie(&hlo_to_stream_number_, hlo, stream_no);
-  VLOG(2) << "Assign stream #" << stream_no << " to " << hlo->ToString();
+  InsertOrDie(&hlo_to_stream_number_, hlo, stream_num);
+  VLOG(2) << "Assign stream #" << stream_num << " to " << hlo->ToString();
 }
 
 namespace {
@@ -51,6 +51,12 @@ bool CanRunConcurrently(const HloInstruction& a, const HloInstruction& b,
   return !reachability.IsConnected(&a, &b);
 }
 
+constexpr int kInvalidStreamNum = -1;
+//  Returns true iff `stream_num` is an invalid stream number.
+inline bool IsStreamNumValid(int stream_num) {
+  return stream_num != kInvalidStreamNum;
+}
+
 // Returns which existing stream to assign to `hlo`, or -1 if a stream is not
 // needed. `stream_assignment` is the existing stream assignment for all
 // instructions topologically before `hlo`. `seen_gemms` contains all GEMMs that
@@ -62,7 +68,7 @@ int ComputeStreamToAssign(
   if (hlo.opcode() == HloOpcode::kParameter ||
       hlo.opcode() == HloOpcode::kConstant) {
     // kParameter and kConstant do not need a thunk.
-    return -1;
+    return kInvalidStreamNum;
   }
 
   if (hlo.GetModule()
@@ -75,17 +81,17 @@ int ComputeStreamToAssign(
   if (!ImplementedAsGemm(hlo)) {
     // If `hlo` is not implemented as a GEMM, keep it close to its operands to
     // avoid excessive synchronization.
-    int stream_no = -1;
+    int stream_num = -1;
     for (const auto* operand : hlo.operands()) {
       if (stream_assignment.HasStreamAssigned(*operand)) {
-        stream_no =
-            std::max(stream_no, stream_assignment.StreamNumberForHlo(*operand));
+        stream_num = std::max(stream_num,
+                              stream_assignment.StreamNumberForHlo(*operand));
       }
     }
-    if (stream_no == -1) {
-      stream_no = 0;
+    if (!IsStreamNumValid(stream_num)) {
+      stream_num = 0;
     }
-    return stream_no;
+    return stream_num;
   }
 
   // Assign different streams to concurrent GEMMs. The code below uses a
@@ -94,17 +100,17 @@ int ComputeStreamToAssign(
   // `hlo` a different stream.
   std::set<int> forbidden_stream_numbers;
   for (const auto* seen_gemm : seen_gemms) {
-    int stream_no = stream_assignment.StreamNumberForHlo(*seen_gemm);
-    if (!forbidden_stream_numbers.count(stream_no) &&
+    int stream_num = stream_assignment.StreamNumberForHlo(*seen_gemm);
+    if (!forbidden_stream_numbers.count(stream_num) &&
         CanRunConcurrently(*seen_gemm, hlo, reachability)) {
-      forbidden_stream_numbers.insert(stream_no);
+      forbidden_stream_numbers.insert(stream_num);
     }
   }
 
-  for (int stream_no = 0; stream_no < stream_assignment.StreamCount();
-       ++stream_no) {
-    if (!forbidden_stream_numbers.count(stream_no)) {
-      return stream_no;
+  for (int stream_num = 0; stream_num < stream_assignment.StreamCount();
+       ++stream_num) {
+    if (!forbidden_stream_numbers.count(stream_num)) {
+      return stream_num;
     }
   }
   return stream_assignment.StreamCount();
@@ -118,11 +124,27 @@ std::unique_ptr<StreamAssignment> AssignStreams(const HloModule& module) {
   std::unique_ptr<HloReachabilityMap> reachability =
       computation.ComputeReachability();
   std::vector<const HloInstruction*> seen_gemms;
+  // The execution of different RNG Hlo instructions in the same module updates
+  // a common global variable. To avoid a race condition, we simply assign all
+  // RNG kernels to the same stream to make them run sequentially.
+  //
+  // TODO(b/111791052): If we remove such a common variable, we will need to
+  // clean up the code here.
+  int stream_num_for_rng = kInvalidStreamNum;
   for (const auto* hlo : computation.MakeInstructionPostOrder()) {
-    int stream_no = ComputeStreamToAssign(*hlo, *stream_assignment,
-                                          *reachability, seen_gemms);
-    if (stream_no != -1) {
-      stream_assignment->AssignStreamToHlo(hlo, stream_no);
+    // If we ever enable fusion of RNG instructions, we will need to extend this
+    // code to look inside a fused instruction.
+    int stream_num = (hlo->opcode() == HloOpcode::kRng &&
+                      IsStreamNumValid(stream_num_for_rng))
+                         ? stream_num_for_rng
+                         : ComputeStreamToAssign(*hlo, *stream_assignment,
+                                                 *reachability, seen_gemms);
+    if (IsStreamNumValid(stream_num)) {
+      stream_assignment->AssignStreamToHlo(hlo, stream_num);
+      if (hlo->opcode() == HloOpcode::kRng &&
+          !IsStreamNumValid(stream_num_for_rng)) {
+        stream_num_for_rng = stream_num;
+      }
     }
     if (ImplementedAsGemm(*hlo)) {
       seen_gemms.push_back(hlo);
diff --git a/tensorflow/compiler/xla/service/gpu/stream_executor_util.cc b/tensorflow/compiler/xla/service/gpu/stream_executor_util.cc
index a50ddf6ac6..05b305ea4c 100644
--- a/tensorflow/compiler/xla/service/gpu/stream_executor_util.cc
+++ b/tensorflow/compiler/xla/service/gpu/stream_executor_util.cc
@@ -20,10 +20,17 @@ limitations under the License.
 namespace xla {
 namespace gpu {
 
-using stream_executor::dnn::DataLayout;
-using stream_executor::dnn::DataLayoutString;
-using stream_executor::dnn::FilterLayout;
-using stream_executor::dnn::FilterLayoutString;
+using se::dnn::DataLayout;
+using se::dnn::DataLayoutString;
+using se::dnn::FilterLayout;
+using se::dnn::FilterLayoutString;
+
+bool IsVoltaOrLater(const se::StreamExecutor& stream_executor) {
+  int major, minor;
+  CHECK(stream_executor.GetDeviceDescription().cuda_compute_capability(&major,
+                                                                       &minor));
+  return major >= 7;
+}
 
 StatusOr<std::tuple<Layout, Layout, Layout>>
 StreamExecutorConvLayoutsToXlaLayouts(const ConvolutionDimensionNumbers& dnums,
diff --git a/tensorflow/compiler/xla/service/gpu/stream_executor_util.h b/tensorflow/compiler/xla/service/gpu/stream_executor_util.h
index 39a6a38d00..1fc46bafa1 100644
--- a/tensorflow/compiler/xla/service/gpu/stream_executor_util.h
+++ b/tensorflow/compiler/xla/service/gpu/stream_executor_util.h
@@ -17,6 +17,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_STREAM_EXECUTOR_UTIL_H_
 
 #include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 
@@ -25,18 +26,20 @@ limitations under the License.
 namespace xla {
 namespace gpu {
 
+// Returns true if the given StreamExecutor is for a Volta or newer nvidia GPU.
+bool IsVoltaOrLater(const se::StreamExecutor& stream_exec);
+
 // Returns (input, filter, output) XLA Layout protos given the StreamExecutor
 // layouts.
 StatusOr<std::tuple<Layout, Layout, Layout>>
 StreamExecutorConvLayoutsToXlaLayouts(const ConvolutionDimensionNumbers& dnums,
-                                      stream_executor::dnn::DataLayout input,
-                                      stream_executor::dnn::FilterLayout filter,
-                                      stream_executor::dnn::DataLayout output);
+                                      se::dnn::DataLayout input,
+                                      se::dnn::FilterLayout filter,
+                                      se::dnn::DataLayout output);
 
 // Returns (input, filter, output) StreamExecutor layouts given the XLA layouts.
-StatusOr<std::tuple<stream_executor::dnn::DataLayout,
-                    stream_executor::dnn::FilterLayout,
-                    stream_executor::dnn::DataLayout>>
+StatusOr<
+    std::tuple<se::dnn::DataLayout, se::dnn::FilterLayout, se::dnn::DataLayout>>
 XlaConvLayoutsToStreamExecutorLayouts(const ConvolutionDimensionNumbers& dnums,
                                       const Layout& input, const Layout& filter,
                                       const Layout& output);
diff --git a/tensorflow/compiler/xla/service/gpu/tests/BUILD b/tensorflow/compiler/xla/service/gpu/tests/BUILD
new file mode 100644
index 0000000000..4fad3f46cf
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/BUILD
@@ -0,0 +1,223 @@
+# Description: GPU-specific XLA tests. For example, codegen tests that
+# verify the IR emitted.
+#
+# TODO(jlebar): None of these tests actually use the GPU, so they should not
+# need to run on machines with GPUs present.
+
+licenses(["notice"])  # Apache 2.0
+
+package(default_visibility = [":friends"])
+
+package_group(
+    name = "friends",
+    includes = [
+        "//tensorflow/compiler/xla:friends",
+    ],
+)
+
+# Filegroup used to collect source files for dependency checking.
+filegroup(
+    name = "c_srcs",
+    data = glob([
+        "**/*.cc",
+        "**/*.h",
+    ]),
+)
+
+load("//tensorflow:tensorflow.bzl", "tf_cc_test")
+
+cc_library(
+    name = "gpu_codegen_test",
+    testonly = True,
+    srcs = ["gpu_codegen_test.cc"],
+    hdrs = ["gpu_codegen_test.h"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
+        "//tensorflow/compiler/xla/service:gpu_plugin",
+        "//tensorflow/compiler/xla/service/gpu:gpu_executable",
+        "//tensorflow/compiler/xla/tests:filecheck",
+        "//tensorflow/compiler/xla/tests:llvm_irgen_test_base",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_copy_test",
+    srcs = ["gpu_copy_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_ftz_test",
+    srcs = ["gpu_ftz_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_index_test",
+    srcs = ["gpu_index_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla:xla_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_module_config",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_infeed_test",
+    srcs = ["infeed_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:test_helpers",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/client:global_data",
+        "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client/lib:arithmetic",
+        "//tensorflow/compiler/xla/tests:client_library_test_base",
+        "//tensorflow/compiler/xla/tests:literal_test_util",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_kernel_tiling_test",
+    srcs = ["gpu_kernel_tiling_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_module_config",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_ldg_test",
+    srcs = ["gpu_ldg_test.cc"],
+    tags = ["requires-gpu-sm35"],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_noalias_test",
+    srcs = ["gpu_noalias_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla:literal",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_fusion_test",
+    srcs = ["gpu_fusion_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla/service:hlo_module_config",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_unrolling_test",
+    srcs = ["gpu_unrolling_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla/service:hlo_module_config",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "gpu_alignment_test",
+    testonly = True,
+    srcs = ["gpu_alignment_test.cc"],
+    tags = [
+        "requires-gpu-sm35",
+    ],
+    deps = [
+        ":gpu_codegen_test",
+        "//tensorflow/compiler/xla/service:gpu_plugin",
+        "//tensorflow/compiler/xla/service/cpu:custom_call_target_registry",
+        "//tensorflow/compiler/xla/service/llvm_ir:alias_analysis",
+        "//tensorflow/compiler/xla/tests:filecheck",
+        "//tensorflow/compiler/xla/tests:llvm_irgen_test_base",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_alignment_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_alignment_test.cc
new file mode 100644
index 0000000000..672c68e59b
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_alignment_test.cc
@@ -0,0 +1,54 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <memory>
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/cpu/custom_call_target_registry.h"
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/alias_analysis.h"
+#include "tensorflow/compiler/xla/tests/filecheck.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+namespace {
+
+class GpuAlignmentTest : public GpuCodegenTest {};
+
+TEST_F(GpuAlignmentTest, Test) {
+  const char* hlo_string = R"(
+HloModule GpuAlignmentTest
+
+ENTRY main {
+  zero = f32[] constant(0)
+  tok = token[] after-all()
+  a = f32[100] parameter(0)
+  b_tup = (f32[200], token[]) infeed(tok)
+  b = f32[200] get-tuple-element(b_tup), index=0
+  a_padded = f32[150] pad(a, zero), padding=0_50
+  b_sliced = f32[150] slice(b), slice={[0:150]}
+  ROOT c = f32[150] add(a_padded, b_sliced)
+}
+)";
+
+  CompileAndVerifyIr(hlo_string, R"(
+CHECK: @fusion(i8* align 64 dereferenceable(600) %alloc0, i8* align 16 dereferenceable(400) %alloc1, i8* align 64 dereferenceable(864) %temp_buf)
+)");
+}
+
+}  // namespace
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.cc
new file mode 100644
index 0000000000..4b8415fe91
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.cc
@@ -0,0 +1,50 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"
+#include "tensorflow/compiler/xla/tests/filecheck.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace xla {
+namespace gpu {
+
+std::unique_ptr<HloModule> GpuCodegenTest::CreateNewModuleWithFTZ(bool ftz) {
+  HloModuleConfig config;
+  auto debug_options = legacy_flags::GetDebugOptionsFromFlags();
+  debug_options.set_xla_gpu_ftz(ftz);
+  debug_options.set_xla_gpu_max_kernel_unroll_factor(1);
+  // TODO(b/38354253): Change tests to use Parameters instead of Constants.
+  debug_options.add_xla_disable_hlo_passes("constant_folding");
+  config.set_debug_options(debug_options);
+
+  return MakeUnique<HloModule>(TestName(), config);
+}
+
+void GpuCodegenTest::CompileAndVerifyPtx(std::unique_ptr<HloModule> hlo_module,
+                                         const string& pattern) {
+  std::unique_ptr<Executable> executable =
+      std::move(CompileToExecutable(std::move(hlo_module)).ValueOrDie());
+  string ptx_str =
+      std::string(static_cast<GpuExecutable*>(executable.get())->ptx());
+  StatusOr<bool> filecheck_result = RunFileCheck(ptx_str, pattern);
+  ASSERT_TRUE(filecheck_result.ok());
+  EXPECT_TRUE(filecheck_result.ValueOrDie());
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h b/tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h
new file mode 100644
index 0000000000..e4a3573bab
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h
@@ -0,0 +1,42 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_TESTS_GPU_CODEGEN_TEST_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_TESTS_GPU_CODEGEN_TEST_H_
+
+#include <string>
+
+#include "tensorflow/compiler/xla/tests/llvm_irgen_test_base.h"
+
+namespace xla {
+namespace gpu {
+
+// Tests that verify IR or PTX emitted by the GPU backend is as expected.
+class GpuCodegenTest : public LlvmIrGenTestBase {
+ protected:
+  // Like HloTestBase::CreateNewModule(), with a flag for configuring the ftz
+  // option.
+  std::unique_ptr<HloModule> CreateNewModuleWithFTZ(bool ftz);
+
+  // Compiles the given HLO module to PTX and verifies the PTX matches the given
+  // FileCheck pattern.  (See http://llvm.org/docs/CommandGuide/FileCheck.html).
+  void CompileAndVerifyPtx(std::unique_ptr<HloModule> hlo_module,
+                           const string& pattern);
+};
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_TESTS_GPU_CODEGEN_TEST_H_
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_copy_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_copy_test.cc
new file mode 100644
index 0000000000..ce69e058e6
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_copy_test.cc
@@ -0,0 +1,59 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <memory>
+#include <utility>
+
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+
+class GpuCopyTest : public GpuCodegenTest {};
+
+// The GPU backend should not emit a copy kernel for the kCopy instruction in
+// this test. Instead, it should generate a CopyThunk which invokes cuMemcpy at
+// runtime.
+TEST_F(GpuCopyTest, UseMemcpy) {
+  HloComputation::Builder builder(TestName());
+
+  std::unique_ptr<Literal> literal =
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  HloInstruction* constant = builder.AddInstruction(
+      HloInstruction::CreateConstant(std::move(literal)));
+  builder.AddInstruction(HloInstruction::CreateUnary(
+      constant->shape(), HloOpcode::kCopy, constant));
+
+  std::unique_ptr<HloComputation> computation = builder.Build();
+
+  auto hlo_module = CreateNewModule();
+  hlo_module->AddEntryComputation(std::move(computation));
+
+  // There should not be any kernel prefixed "copy".
+  CompileAndVerifyIr(std::move(hlo_module), "; CHECK-NOT: define void @_copy",
+                     /*match_optimized_ir=*/false);
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_ftz_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_ftz_test.cc
new file mode 100644
index 0000000000..177b94934c
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_ftz_test.cc
@@ -0,0 +1,119 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+
+// Check that the ftz (flush denormals to zero) flag is reflected in PTX as
+// expected.
+
+namespace xla {
+namespace gpu {
+namespace {
+
+class GpuFtzTest : public GpuCodegenTest {
+ public:
+  explicit GpuFtzTest(bool ftz) : ftz_(ftz) {}
+
+  // Creates an HLO module that performs the given binary operation on some
+  // data.
+  std::unique_ptr<HloModule> CreateBinaryOpModule(HloOpcode op) {
+    HloComputation::Builder builder(TestName());
+
+    Shape param_shape = ShapeUtil::MakeShapeWithLayout(
+        F32, /*dimensions=*/{100, 100}, /*minor_to_major=*/{1, 0});
+    HloInstruction* x = builder.AddInstruction(HloInstruction::CreateParameter(
+        /* parameter_number=*/0, param_shape, "x"));
+    HloInstruction* y = builder.AddInstruction(HloInstruction::CreateParameter(
+        /* parameter_number=*/1, param_shape, "y"));
+    builder.AddInstruction(HloInstruction::CreateBinary(param_shape, op, x, y));
+
+    auto hlo_module = CreateNewModuleWithFTZ(ftz_);
+    hlo_module->AddEntryComputation(builder.Build());
+    return hlo_module;
+  }
+
+  // Creates an HLO module that performs the given unary operation on some data.
+  std::unique_ptr<HloModule> CreateUnaryOpModule(HloOpcode op) {
+    HloComputation::Builder builder(TestName());
+
+    Shape param_shape = ShapeUtil::MakeShapeWithLayout(
+        F32, /*dimensions=*/{100, 100}, /*minor_to_major=*/{1, 0});
+    HloInstruction* x = builder.AddInstruction(HloInstruction::CreateParameter(
+        /* parameter_number=*/0, param_shape, "x"));
+    builder.AddInstruction(HloInstruction::CreateUnary(param_shape, op, x));
+
+    auto hlo_module = CreateNewModuleWithFTZ(ftz_);
+    hlo_module->AddEntryComputation(builder.Build());
+    return hlo_module;
+  }
+
+  bool ftz_;
+};
+
+class GpuFtzEnabledTest : public GpuFtzTest {
+ public:
+  GpuFtzEnabledTest() : GpuFtzTest(/*ftz=*/true) {}
+};
+
+class GpuFtzDisabledTest : public GpuFtzTest {
+ public:
+  GpuFtzDisabledTest() : GpuFtzTest(/*ftz=*/false) {}
+};
+
+// Check that we emit mul.ftz.f32 when in ftz mode, and plain mul.f32 otherwise.
+TEST_F(GpuFtzEnabledTest, MultiplyFtz) {
+  CompileAndVerifyPtx(CreateBinaryOpModule(HloOpcode::kMultiply), R"(
+    CHECK-NOT: mul.f32
+    CHECK: mul.ftz.f32
+    CHECK-NOT: mul.f32
+  )");
+}
+TEST_F(GpuFtzDisabledTest, MultiplyFtz) {
+  CompileAndVerifyPtx(CreateBinaryOpModule(HloOpcode::kMultiply), R"(
+    CHECK-NOT: mul.ftz.f32
+    CHECK: mul.f32
+    CHECK-NOT: mul.ftz.f32
+  )");
+}
+
+// In NVPTX, exp(float) is implemented in libdevice, and consults __nvvm_reflect
+// to determine whether or not ftz is enabled.  The implementation uses two
+// calls to ex2.approx.  When ftz is on, we get two calls to the ftz version;
+// when ftz is off, we get one call to the ftz version and one call to the
+// regular version.
+TEST_F(GpuFtzEnabledTest, ExpFtz) {
+  CompileAndVerifyPtx(CreateUnaryOpModule(HloOpcode::kExp), R"(
+    CHECK-NOT: ex2.approx.f32
+    CHECK:     ex2.approx.ftz.f32
+    CHECK-NOT: ex2.approx.f32
+    CHECK:     ex2.approx.ftz.f32
+    CHECK-NOT: ex2.approx.f32
+    CHECK-NOT: ex2.approx.ftz.f32
+  )");
+}
+
+TEST_F(GpuFtzDisabledTest, ExpFtz) {
+  CompileAndVerifyPtx(CreateUnaryOpModule(HloOpcode::kExp), R"(
+    CHECK-NOT: ex2.approx.f32
+    CHECK-DAG: ex2.approx.ftz.f32
+    CHECK-DAG: ex2.approx.f32
+    CHECK-NOT: ex2.approx.f32
+    CHECK-NOT: ex2.approx.ftz.f32
+  )");
+}
+
+}  // namespace
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_fusion_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_fusion_test.cc
new file mode 100644
index 0000000000..674b436a8e
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_fusion_test.cc
@@ -0,0 +1,59 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/hlo_module_config.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+namespace {
+
+class GpuFusionTest : public GpuCodegenTest {};
+
+TEST_F(GpuFusionTest, FusedReshape) {
+  const char* hlo_text = R"(
+    HloModule test_module
+
+    fused_computation {
+      p0.param_0 = f32[4,1,1]{2,1,0} parameter(0)
+      p1.param_1 = f32[4,1]{1,0} parameter(1)
+      reshape = f32[4,1]{1,0} reshape(p0.param_0)
+      ROOT add = f32[4,1] add(reshape, p1.param_1)
+    }
+
+    ENTRY BroadcastIntoAdd {
+      p0 = f32[4,1,1]{2,1,0} parameter(0)
+      p1 = f32[4,1]{1,0} parameter(1)
+      ROOT fusion = f32[4,1]{1,0} fusion(p0, p1), kind=kLoop,
+                                                  calls=fused_computation
+    }
+)";
+
+  CompileAndVerifyIr(hlo_text,
+                     R"(
+; CHECK-LABEL: @fusion
+; CHECK: fadd
+; CHECK: }
+      )");
+}
+
+}  // namespace
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_index_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_index_test.cc
new file mode 100644
index 0000000000..e5958165ef
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_index_test.cc
@@ -0,0 +1,147 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <memory>
+#include <utility>
+
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_module_config.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/xla.pb.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+
+// This file tests the index expressions used to reference source tensors. When
+// the destination tensor and source tensor have compatible shapes, the linear
+// index is used to access the source tensor. Otherwise, dimensional indices
+// computed from the linear index are used to access the source tensor.
+
+class GpuIndexTest : public GpuCodegenTest {};
+
+TEST_F(GpuIndexTest, CompatibleUseLinearIndex) {
+  HloComputation::Builder builder(TestName());
+
+  auto param_shape = ShapeUtil::MakeShape(F32, {5, 7, 2});
+  HloInstruction* param_x = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, param_shape, "x"));
+  HloInstruction* param_y = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, param_shape, "y"));
+  builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {5, 7, 2}), HloOpcode::kGe, param_x, param_y));
+
+  auto hlo_module = CreateNewModule();
+  hlo_module->AddEntryComputation(builder.Build());
+
+  // Check the optimized IR as the unoptimized IR contains dead udiv and urem.
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-NOT: udiv
+; CHECK-NOT: urem
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuIndexTest, CompatibleUseLinearIndexWithReshape) {
+  HloModuleConfig config;
+  config.set_debug_options(HloTestBase::GetDebugOptionsForTest());
+  auto module = ParseHloString(R"(
+    HloModule test_module
+
+    ENTRY CompatibleUseLinearIndexWithReshape {
+      x = f32[5,7,2]{2,1,0} parameter(0)
+      y = f32[5,14]{1,0} parameter(1)
+      reshape = f32[5,7,2]{2,1,0} reshape(y)
+      ROOT gte = pred[5,7,2]{2,1,0} greater-than-or-equal-to(x, reshape)
+    })",
+                               config)
+                    .ValueOrDie();
+
+  // Check the optimized IR as the unoptimized IR contains dead udiv and urem.
+  CompileAndVerifyIr(std::move(module),
+                     R"(
+; CHECK-NOT: udiv
+; CHECK-NOT: urem
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuIndexTest, CompatibleUseLinearIndexWithReshapeAndBroadcast) {
+  HloModuleConfig config;
+  config.set_debug_options(HloTestBase::GetDebugOptionsForTest());
+  auto module = ParseHloString(R"(
+    HloModule test_module
+
+    ENTRY CompatibleUseLinearIndexWithReshape {
+      x = f32[5,7,2]{2,1,0} parameter(0)
+      y = f32[14]{0} parameter(1)
+      reshape = f32[7,2]{1,0} reshape(y)
+      broadcast = f32[5,7,2]{2,1,0} broadcast(reshape), dimensions={1,2}
+      ROOT gte = pred[5,7,2]{2,1,0} greater-than-or-equal-to(x, broadcast)
+    })",
+                               config)
+                    .ValueOrDie();
+
+  // Check the optimized IR reuses the linear index by calculating modulo 14.
+  CompileAndVerifyIr(std::move(module),
+                     R"(
+; CHECK: %[[urem1:.*]] = urem i{{[0-9]*}} %[[linear_index:.*]], 14
+; CHECK: %[[bitcast:.*]] = bitcast i8 addrspace(1)* %[[alloc:.*]] to float addrspace(1)*
+; CHECK: %[[idx1:.*]] = zext i{{[0-9]*}} %[[urem1]] to i64
+; CHECK: getelementptr inbounds float, float addrspace(1)* %[[bitcast]], i64 %[[idx1]]
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuIndexTest, CompatibleUseLinearIndexWithSizeOneDimensions) {
+  HloModuleConfig config;
+  auto debug_options = HloTestBase::GetDebugOptionsForTest();
+  debug_options.set_xla_gpu_max_kernel_unroll_factor(1);
+  config.set_debug_options(debug_options);
+
+  auto module = ParseHloString(R"(
+    HloModule  test_module
+
+    ENTRY CompatibleUseLinearIndexWithSizeOneDimensions  {
+      x = f32[1,1024,1,256]{3,2,1,0} parameter(0)
+      ROOT y = f16[1,1024,1,256]{2,3,1,0} convert(x)
+    })",
+                               config)
+                    .ValueOrDie();
+
+  // Check that the unoptimized IR reuses the linear index.
+  CompileAndVerifyIr(std::move(module),
+                     R"(
+; CHECK-LABEL: @fusion
+; CHECK: udiv i32 %[[linear_index:.*]], 262144
+; CHECK: %[[ld_addr:.*]] = getelementptr inbounds float, float* {{.*}}, i32 %[[linear_index]]
+; CHECK: load float, float* %[[ld_addr]]
+; CHECK: %[[st_addr:.*]] = getelementptr inbounds half, half* {{.*}}, i32 %[[linear_index]]
+; CHECK: store half {{.*}}, half* %[[st_addr]]
+      )",
+                     /*match_optimized_ir=*/false);
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_kernel_tiling_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_kernel_tiling_test.cc
new file mode 100644
index 0000000000..cca35316f0
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_kernel_tiling_test.cc
@@ -0,0 +1,177 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/hlo_module_config.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+namespace {
+
+class GpuKernelTilingTest : public GpuCodegenTest {
+ protected:
+  GpuKernelTilingTest() {
+    auto debug_options = HloTestBase::GetDebugOptionsForTest();
+    config_.set_debug_options(debug_options);
+    // Disable layout_assignment to use the preassigned layouts.
+    debug_options.add_xla_disable_hlo_passes("layout_assignment");
+  }
+  HloModuleConfig config_;
+};
+
+TEST_F(GpuKernelTilingTest, UnnestedTransposeWithProperDimensionsTiled) {
+  const char *const kHloString = R"(
+    HloModule unnested_transpose_1
+
+    ENTRY unnested_transpose_1 {
+      para0 = f16[32,3,64]{2,1,0} parameter(0)
+      ROOT copy1 = f16[32,3,64]{1,0,2} copy(para0)
+    })";
+
+  // Check that a call to llvm.nvvm.barrier0 is generated.
+  auto hlo_module = ParseHloString(kHloString, config_).ValueOrDie();
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: define void @copy
+; CHECK: tail call void @llvm.nvvm.barrier0()
+; CHECK: }
+)",
+                     /*match_optimized_ir=*/true);
+
+  // Check that the kernel runs correctly.
+  EXPECT_TRUE(RunAndCompareNoHloPasses(kHloString, ErrorSpec{0.0}));
+}
+
+TEST_F(GpuKernelTilingTest, UnnestedTransposeWithSmallDimensionsNotTiled) {
+  const char *const kHloString = R"(
+    HloModule unnested_transpose_2
+
+    ENTRY unnested_transpose_2 {
+      para0 = f16[2,3,64]{2,1,0} parameter(0)
+      ROOT copy1 = f16[2,3,64]{1,0,2} copy(para0)
+    })";
+
+  // Check that a call to llvm.nvvm.barrier0 is not generated.
+  auto hlo_module = ParseHloString(kHloString, config_).ValueOrDie();
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: define void @copy
+; CHECK-NOT: tail call void @llvm.nvvm.barrier0()
+; CHECK: }
+)",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuKernelTilingTest, SimpleFusionWithTransposeTiled) {
+  const char *const kHloString = R"(
+    HloModule multiple_output_fusion_1
+    fused_computation.1 {
+      param0 = f32[4,5,6,7,8]{4,3,2,1,0} parameter(0)
+      copy = f32[4,5,6,7,8]{2,1,4,3,0} copy(param0)
+      ROOT convert = f16[4,5,6,7,8]{2,1,4,3,0} convert(copy)
+    }
+
+    ENTRY copy_in_fusion_run_without_hlo_passes {
+      para0 = f32[4,5,6,7,8]{4,3,2,1,0} parameter(0)
+      ROOT fusion.1 = f16[4,5,6,7,8]{2,1,4,3,0} fusion(para0), kind=kLoop,
+        calls=fused_computation.1
+    })";
+
+  // Check that a call to llvm.nvvm.barrier0 is generated.
+  auto hlo_module = ParseHloString(kHloString, config_).ValueOrDie();
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: define void @fusion
+; CHECK: tail call void @llvm.nvvm.barrier0()
+; CHECK: }
+)",
+                     /*match_optimized_ir=*/true);
+
+  // Check that the kernel runs correctly.
+  EXPECT_TRUE(RunAndCompareNoHloPasses(kHloString, ErrorSpec{0.0}));
+}
+
+TEST_F(GpuKernelTilingTest, MultipleOutputFusionWithOnePossibleTransposeTiled) {
+  const char *const kHloString = R"(
+    HloModule multiple_output_fusion_1
+    fused_computation.1 {
+      param0 = f16[8,31,31,65]{3,2,1,0} parameter(0)
+      param1 = f16[8,31,31,65]{3,2,1,0} parameter(1)
+      copy0 = f16[8,31,31,65]{2,1,3,0} copy(param0)
+      copy1 = f16[8,31,31,65]{2,1,3,0} copy(param1)
+      ROOT tuple1 = (f16[8,31,31,65]{2,1,3,0}, f16[8,31,31,65]{2,1,3,0})
+        tuple(copy0, copy1)
+    }
+
+    ENTRY multiple_output_fusion_1 {
+      para0 = f16[8,31,31,65]{3,2,1,0} parameter(0)
+      para1 = f16[8,31,31,65]{3,2,1,0} parameter(1)
+      ROOT fusion.1 = (f16[8,31,31,65]{2,1,3,0}, f16[8,31,31,65]{2,1,3,0})
+        fusion(para0,para1), kind=kLoop, calls=fused_computation.1
+    })";
+
+  // Check that a call to llvm.nvvm.barrier0 is generated.
+  auto hlo_module = ParseHloString(kHloString, config_).ValueOrDie();
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: define void @fusion
+; CHECK: tail call void @llvm.nvvm.barrier0()
+; CHECK: }
+)",
+                     /*match_optimized_ir=*/true);
+
+  // Check that the kernel runs correctly.
+  EXPECT_TRUE(RunAndCompareNoHloPasses(kHloString, ErrorSpec{0.0}));
+}
+
+TEST_F(GpuKernelTilingTest,
+       MultipleOutputFusionWithTwoPossibleTransposesNotTiled) {
+  const char *const kHloString = R"(
+    HloModule multiple_output_fusion_2
+    fused_computation.1 {
+      param0 = f16[8,31,31,65]{3,2,1,0} parameter(0)
+      param1 = f16[8,31,31,65]{1,3,2,0} parameter(1)
+      copy2 = f16[8,31,31,65]{2,1,3,0} copy(param0)
+      copy3 = f16[8,31,31,65]{2,1,3,0} copy(param1)
+      ROOT tuple1 = (f16[8,31,31,65]{2,1,3,0}, f16[8,31,31,65]{2,1,3,0})
+        tuple(copy2, copy3)
+    }
+
+    ENTRY multiple_output_fusion_2 {
+      para0 = f16[8,31,31,65]{3,2,1,0} parameter(0)
+      para1 = f16[8,31,31,65]{1,3,2,0} parameter(1)
+      ROOT fusion1 = (f16[8,31,31,65]{2,1,3,0}, f16[8,31,31,65]{2,1,3,0})
+        fusion(para0,para1), kind=kLoop, calls=fused_computation.1
+    })";
+
+  // Check that a call to llvm.nvvm.barrier0 is not generated.
+  auto hlo_module = ParseHloString(kHloString, config_).ValueOrDie();
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: define void @fusion
+; CHECK-NOT: tail call void @llvm.nvvm.barrier0()
+; CHECK: }
+)",
+                     /*match_optimized_ir=*/true);
+}
+
+}  // namespace
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_ldg_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_ldg_test.cc
new file mode 100644
index 0000000000..6c9ae7bada
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_ldg_test.cc
@@ -0,0 +1,141 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Tests that we emit ld.global.nc (the PTX instruction corresponding to CUDA's
+// __ldg builtin) for reads of buffers that don't change during a kernel's
+// execution.
+
+#include <memory>
+#include <utility>
+
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+
+class GpuLdgTest : public GpuCodegenTest {};
+
+// Parameters are never overwritten, so parameter reads should get ld.global.nc
+// reads.
+TEST_F(GpuLdgTest, LdgForParamRead) {
+  HloComputation::Builder builder(TestName());
+
+  auto shape = ShapeUtil::MakeShape(F32, {2, 2});
+  HloInstruction* param =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "x"));
+  builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param, param));
+  std::unique_ptr<HloComputation> computation = builder.Build();
+
+  auto hlo_module = CreateNewModule();
+  hlo_module->AddEntryComputation(std::move(computation));
+
+  CompileAndVerifyPtx(std::move(hlo_module), R"(
+    CHECK-NOT: ld.global.f32
+    CHECK: ld.global.nc.f32
+  )");
+}
+
+// Check that reading a buffer produced by a non-parameter HLO also results in
+// ld.global.nc, if that buffer isn't modified within the instruction that reads
+// it.
+TEST_F(GpuLdgTest, LdgForNonParamRead) {
+  HloComputation::Builder builder(TestName());
+
+  auto shape = ShapeUtil::MakeShape(F32, {2, 2});
+  HloInstruction* param =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "x"));
+  HloInstruction* add = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param, param));
+  HloInstruction* square = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kMultiply, add, add));
+  builder.AddInstruction(HloInstruction::CreateTuple({add, square}));
+  std::unique_ptr<HloComputation> computation = builder.Build();
+
+  auto hlo_module = CreateNewModule();
+  hlo_module->AddEntryComputation(std::move(computation));
+
+  CompileAndVerifyPtx(std::move(hlo_module), R"(
+    CHECK: {
+    CHECK-NOT: ld.global.f32
+    CHECK: ld.global.nc.f32
+    CHECK: }
+  )");
+}
+
+// Check that reading a buffer that's modified in-place does not produce
+// ld.global.nc.
+//
+// We do this by creating a reduce that feeds into a sin.  We don't currently
+// fuse sin into reduce, and the sin is elementwise, so it reuses its input
+// buffer as its output.
+//
+// It seems like a fair bet that we won't start fusing sin into the output of
+// reduce in the foreseeable future.  But if that turns out to be wrong, I give
+// you, future reader, permission to delete this test.
+TEST_F(GpuLdgTest, NoLdgWhenSharingBuffer) {
+  auto hlo_module = CreateNewModule();
+  HloComputation::Builder builder(TestName());
+
+  HloComputation* reduce_computation;
+  {
+    auto embedded_builder = HloComputation::Builder("add");
+    auto lhs = embedded_builder.AddInstruction(HloInstruction::CreateParameter(
+        0, ShapeUtil::MakeShape(F32, {}), "lhs"));
+    auto rhs = embedded_builder.AddInstruction(HloInstruction::CreateParameter(
+        1, ShapeUtil::MakeShape(F32, {}), "rhs"));
+    embedded_builder.AddInstruction(
+        HloInstruction::CreateBinary(lhs->shape(), HloOpcode::kAdd, lhs, rhs));
+    reduce_computation =
+        hlo_module->AddEmbeddedComputation(embedded_builder.Build());
+  }
+
+  auto param_shape = ShapeUtil::MakeShape(F32, {2, 2});
+  auto reduce_shape = ShapeUtil::MakeShape(F32, {2});
+  HloInstruction* param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, param_shape, "x"));
+  HloInstruction* reduce = builder.AddInstruction(HloInstruction::CreateReduce(
+      reduce_shape,
+      builder.AddInstruction(HloInstruction::CreateBinary(
+          param_shape, HloOpcode::kAdd, param, param)),
+      builder.AddInstruction(
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0))),
+      {0}, reduce_computation));
+  builder.AddInstruction(
+      HloInstruction::CreateUnary(reduce_shape, HloOpcode::kSin, reduce));
+
+  std::unique_ptr<HloComputation> computation = builder.Build();
+  hlo_module->AddEntryComputation(std::move(computation));
+
+  CompileAndVerifyPtx(std::move(hlo_module), R"(
+    CHECK-LABEL: .entry sin
+    CHECK: {
+    CHECK-NOT: ld.global.nc.f32
+    CHECK: ld.global.f32
+    CHECK: }
+  )");
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_noalias_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_noalias_test.cc
new file mode 100644
index 0000000000..c42e5704a4
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_noalias_test.cc
@@ -0,0 +1,68 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <memory>
+#include <utility>
+
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+
+class GpuNoAliasTest : public GpuCodegenTest {};
+
+TEST_F(GpuNoAliasTest, Concat) {
+  HloComputation::Builder builder(TestName());
+
+  auto param_shape = ShapeUtil::MakeShape(F32, {2, 2});
+  HloInstruction* param_x = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, param_shape, "x"));
+  HloInstruction* param_y = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, param_shape, "y"));
+  HloInstruction* concat =
+      builder.AddInstruction(HloInstruction::CreateConcatenate(
+          ShapeUtil::MakeShape(F32, {2, 4}), {param_x, param_y}, 1));
+  builder.AddInstruction(HloInstruction::CreateConcatenate(
+      ShapeUtil::MakeShape(F32, {2, 6}), {concat, param_x}, 1));
+
+  std::unique_ptr<HloComputation> computation = builder.Build();
+
+  auto hlo_module = CreateNewModule();
+  hlo_module->AddEntryComputation(std::move(computation));
+
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK: %[[x_gep:.*]] = getelementptr inbounds [2 x [2 x float]], [2 x [2 x float]]* %x{{.*}}, i32 0
+; CHECK: load float, float* %[[x_gep]], {{.*}}, !noalias ![[param_noalias:.*]]
+; CHECK: %[[y_gep:.*]] = getelementptr inbounds [2 x [2 x float]], [2 x [2 x float]]* %y{{.*}}, i32 0
+; CHECK: load float, float* %[[y_gep]], {{.*}}, !noalias ![[param_noalias]]
+; CHECK: %[[result_ptr:.*]] = bitcast [2 x [6 x float]]* %fusion{{.*}} to float*
+; CHECK: %[[result_gep:.*]] = getelementptr inbounds float, float* %[[result_ptr]]
+; CHECK: store float {{.*}}, float* %[[result_gep]], !alias.scope ![[param_noalias]]
+; CHECK: ![[param_noalias]] = !{![[retval_buffer:.*]]}
+      )",
+                     /*match_optimized_ir=*/false);
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/gpu_unrolling_test.cc b/tensorflow/compiler/xla/service/gpu/tests/gpu_unrolling_test.cc
new file mode 100644
index 0000000000..9622936306
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/gpu_unrolling_test.cc
@@ -0,0 +1,185 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/hlo_module_config.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace gpu {
+namespace {
+
+class GpuUnrollingTest : public GpuCodegenTest {};
+
+const char *const kAddModule = R"(
+    HloModule test_module
+
+    fused_computation {
+      p0.param_0 = f32[2,2]{1,0} parameter(0)
+      p1.param_1 = f32[2,2]{1,0} parameter(1)
+      ROOT add = f32[2,2] add(p0.param_0, p1.param_1)
+    }
+
+    ENTRY BroadcastIntoAdd {
+      p0 = f32[2,2]{1,0} parameter(0)
+      p1 = f32[2,2]{1,0} parameter(1)
+      ROOT fusion = f32[2,2]{1,0} fusion(p0, p1), kind=kLoop,
+                                                  calls=fused_computation
+    })";
+
+TEST_F(GpuUnrollingTest, DoNotUnroll) {
+  HloModuleConfig config;
+  auto debug_options = HloTestBase::GetDebugOptionsForTest();
+  debug_options.set_xla_gpu_max_kernel_unroll_factor(1);
+  config.set_debug_options(debug_options);
+  auto hlo_module = ParseHloString(kAddModule, config).ValueOrDie();
+
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: @fusion
+; CHECK: fadd
+; CHECK-NOT: fadd
+; CHECK: }
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuUnrollingTest, UnrollFourTimes) {
+  HloModuleConfig config;
+  auto debug_options = HloTestBase::GetDebugOptionsForTest();
+  // We request a factor of 8, but the computation works on 4 elements, limiting
+  // the maximum unroll factor.
+  debug_options.set_xla_gpu_max_kernel_unroll_factor(8);
+  config.set_debug_options(debug_options);
+  auto hlo_module = ParseHloString(kAddModule, config).ValueOrDie();
+
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: @fusion
+; CHECK: fadd
+; CHECK: fadd
+; CHECK: fadd
+; CHECK: fadd
+; CHECK-NOT: fadd
+; CHECK: }
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuUnrollingTest, UnrollDefaultTimes) {
+  // The default unrolling factor is 4.
+  HloModuleConfig config;
+  config.set_debug_options(legacy_flags::GetDebugOptionsFromFlags());
+  auto hlo_module = ParseHloString(kAddModule, config).ValueOrDie();
+
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: @fusion
+; CHECK: load <4 x float>
+; CHECK: fadd
+; CHECK: fadd
+; CHECK: fadd
+; CHECK: fadd
+; CHECK-NOT: fadd
+; CHECK: store <4 x float>
+; CHECK: }
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuUnrollingTest, UnrollUnfusedAdd) {
+  HloModuleConfig config;
+  auto debug_options = HloTestBase::GetDebugOptionsForTest();
+  debug_options.set_xla_gpu_max_kernel_unroll_factor(4);
+  config.set_debug_options(debug_options);
+
+  const char *const kUnfusedAddModule = R"(
+    HloModule test_module
+
+    ENTRY AddFunc {
+      p0 = f32[2,2]{1,0} parameter(0)
+      p1 = f32[2,2]{1,0} parameter(1)
+      ROOT add = f32[2,2]{1,0} add(p0, p1)
+    })";
+  auto hlo_module = ParseHloString(kUnfusedAddModule, config).ValueOrDie();
+
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: @add
+; CHECK: load <4 x float>
+; CHECK: fadd
+; CHECK: fadd
+; CHECK: fadd
+; CHECK: fadd
+; CHECK-NOT: fadd
+; CHECK: store <4 x float>
+; CHECK: }
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+TEST_F(GpuUnrollingTest, UnrollMultiOutputFusion) {
+  HloModuleConfig config;
+  auto debug_options = HloTestBase::GetDebugOptionsForTest();
+  debug_options.set_xla_gpu_max_kernel_unroll_factor(2);
+  config.set_debug_options(debug_options);
+
+  const char *const kMultiOutputFusionModule = R"(
+    HloModule test_module
+
+    fused_computation {
+      p0.param_0 = f32[2,2]{1,0} parameter(0)
+      p1.param_1 = f32[2,2]{1,0} parameter(1)
+      add = f32[2,2]{1,0} add(p0.param_0, p1.param_1)
+      mul = f32[2,2]{1,0} multiply(p0.param_0, p1.param_1)
+      ROOT tuple = (f32[2,2]{1,0}, f32[2,2]{1,0}) tuple(add, mul)
+    }
+
+    ENTRY BroadcastIntoAdd {
+      p0 = f32[2,2]{1,0} parameter(0)
+      p1 = f32[2,2]{1,0} parameter(1)
+      ROOT fusion = (f32[2,2]{1,0}, f32[2,2]{1,0}) fusion(p0, p1), kind=kLoop,
+                                                   calls=fused_computation
+    })";
+  auto hlo_module =
+      ParseHloString(kMultiOutputFusionModule, config).ValueOrDie();
+
+  CompileAndVerifyIr(std::move(hlo_module),
+                     R"(
+; CHECK-LABEL: @fusion
+; CHECK: load <2 x float>
+; CHECK: load <2 x float>
+; CHECK-NOT: load <2 x float>
+; CHECK: fadd
+; CHECK: fmul
+; CHECK: fadd
+; CHECK: fmul
+; CHECK: store <2 x float>
+; CHECK: store <2 x float>
+; CHECK-NOT: store <2 x float>
+; CHECK-NOT: fadd
+; CHECK-NOT: fmul
+; CHECK: }
+      )",
+                     /*match_optimized_ir=*/true);
+}
+
+}  // namespace
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/tests/infeed_test.cc b/tensorflow/compiler/xla/service/gpu/tests/infeed_test.cc
new file mode 100644
index 0000000000..9072b30317
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/tests/infeed_test.cc
@@ -0,0 +1,121 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <unistd.h>
+#include <memory>
+
+#include "tensorflow/compiler/xla/client/global_data.h"
+#include "tensorflow/compiler/xla/client/lib/arithmetic.h"
+#include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/test_helpers.h"
+#include "tensorflow/compiler/xla/tests/client_library_test_base.h"
+#include "tensorflow/compiler/xla/tests/literal_test_util.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/math/math_util.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+namespace {
+
+class InfeedTest : public ClientLibraryTestBase {
+ protected:
+  // Transfers the given literal to the infeed interface of the device, and
+  // check if the returned data from Infeed HLO is same as the literal.
+  void TestInfeedRoundTrip(const Literal& literal) {
+    // TODO(b/30481585) Explicitly reset the Infeed state so that the
+    // test is not affected by the state from the previous tests.
+    ASSERT_IS_OK(client_->TransferToInfeed(literal));
+    XlaBuilder builder(TestName());
+    Infeed(&builder, literal.shape());
+    if (ShapeUtil::IsTuple(literal.shape())) {
+      // TODO(b/30609564): Use ComputeAndCompareLiteral instead.
+      ComputeAndCompareTuple(&builder, literal, {});
+    } else {
+      ComputeAndCompareLiteral(&builder, literal, {});
+    }
+  }
+};
+
+TEST_F(InfeedTest, SingleInfeedR0Bool) {
+  TestInfeedRoundTrip(*LiteralUtil::CreateR0<bool>(true));
+}
+
+TEST_F(InfeedTest, SingleInfeedR1U32) {
+  TestInfeedRoundTrip(*LiteralUtil::CreateR1<uint32>({1, 2, 3}));
+}
+
+TEST_F(InfeedTest, SingleInfeedR2F32) {
+  TestInfeedRoundTrip(*LiteralUtil::CreateR2F32Linspace(0.0, 1.0, 128, 64));
+}
+
+TEST_F(InfeedTest, SingleInfeedR3F32) {
+  TestInfeedRoundTrip(
+      *LiteralUtil::CreateR3({{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
+                              {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}}));
+}
+
+TEST_F(InfeedTest, SingleInfeedR3F32DifferentLayout) {
+  const Layout r3_dim0minor = LayoutUtil::MakeLayout({0, 1, 2});
+  const Layout r3_dim0major = LayoutUtil::MakeLayout({2, 1, 0});
+
+  TestInfeedRoundTrip(*LiteralUtil::CreateR3WithLayout(
+      {{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
+       {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}},
+      r3_dim0minor));
+
+  TestInfeedRoundTrip(*LiteralUtil::CreateR3WithLayout(
+      {{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
+       {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}},
+      r3_dim0major));
+}
+
+TEST_F(InfeedTest, SingleInfeedR4S32) {
+  TestInfeedRoundTrip(*LiteralUtil::CreateR4(
+      {{{{1, -2}, {-4, 5}, {6, 7}}, {{8, 9}, {10, 11}, {12, 13}}},
+       {{{10, 3}, {7, -2}, {3, 6}}, {{2, 5}, {-11, 5}, {-2, -5}}}}));
+}
+
+// Tests that a large infeed can be handled.
+TEST_F(InfeedTest, LargeInfeed) {
+  Array4D<float> array(80, 100, 8, 128);
+  array.FillIota(1.0f);
+  TestInfeedRoundTrip(*LiteralUtil::CreateR4FromArray4D<float>(array));
+}
+
+TEST_F(InfeedTest, SingleInfeedTuple) {
+  TestInfeedRoundTrip(
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<uint32>({1, 2, 3}).get(),
+                               LiteralUtil::CreateR0<bool>(false).get()}));
+}
+
+TEST_F(InfeedTest, SingleInfeedEmptyTuple) {
+  TestInfeedRoundTrip(*LiteralUtil::MakeTuple({}));
+}
+
+// Tests that a large tuple infeed can be handled.
+TEST_F(InfeedTest, SingleInfeedLargeTuple) {
+  Array4D<float> array(40, 100, 8, 128);
+  array.FillIota(1.0f);
+  TestInfeedRoundTrip(*LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR4FromArray4D<float>(array).get(),
+       LiteralUtil::CreateR0<int32>(5).get()}));
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/thunk.cc b/tensorflow/compiler/xla/service/gpu/thunk.cc
new file mode 100644
index 0000000000..c78605cebb
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/thunk.cc
@@ -0,0 +1,59 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/gpu/thunk.h"
+
+namespace xla {
+namespace gpu {
+
+std::ostream& operator<<(std::ostream& os, Thunk::Kind kind) {
+  switch (kind) {
+    case Thunk::kConditional:
+      return os << "kConditional";
+    case Thunk::kConvolution:
+      return os << "kConvolution";
+    case Thunk::kCopy:
+      return os << "kCopy";
+    case Thunk::kCudnnBatchNormBackward:
+      return os << "kCudnnBatchNormBackward";
+    case Thunk::kCudnnBatchNormForwardInference:
+      return os << "kCudnnBatchNormForwardInference";
+    case Thunk::kCudnnBatchNormForwardTraining:
+      return os << "kCudnnBatchNormForwardTraining";
+    case Thunk::kFft:
+      return os << "kFft";
+    case Thunk::kGemm:
+      return os << "kGemm";
+    case Thunk::kInfeed:
+      return os << "kInfeed";
+    case Thunk::kKernel:
+      return os << "kKernel";
+    case Thunk::kMemset32BitValue:
+      return os << "kMemset32BitValue";
+    case Thunk::kMemzero:
+      return os << "kMemzero";
+    case Thunk::kOutfeed:
+      return os << "kOutfeed";
+    case Thunk::kSequential:
+      return os << "kSequential";
+    case Thunk::kTuple:
+      return os << "kTuple";
+    case Thunk::kWhile:
+      return os << "kWhile";
+  }
+}
+
+}  // namespace gpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/gpu/thunk.h b/tensorflow/compiler/xla/service/gpu/thunk.h
index 14d41033c2..4df0bb005b 100644
--- a/tensorflow/compiler/xla/service/gpu/thunk.h
+++ b/tensorflow/compiler/xla/service/gpu/thunk.h
@@ -41,7 +41,7 @@ class GpuExecutable;
 // This is thread-compatible.
 class Thunk {
  public:
-  enum class Kind {
+  enum Kind {
     kConditional,
     kConvolution,
     kCopy,
@@ -54,6 +54,7 @@ class Thunk {
     kKernel,
     kMemset32BitValue,
     kMemzero,
+    kOutfeed,
     kSequential,
     kTuple,
     kWhile,
@@ -110,6 +111,8 @@ class Thunk {
 // A sequence of thunks.
 using ThunkSequence = std::vector<std::unique_ptr<Thunk>>;
 
+std::ostream& operator<<(std::ostream& os, Thunk::Kind kind);
+
 }  // namespace gpu
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/gpu/while_thunk.cc b/tensorflow/compiler/xla/service/gpu/while_thunk.cc
index 5e13f989c2..d81d87e7dc 100644
--- a/tensorflow/compiler/xla/service/gpu/while_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/while_thunk.cc
@@ -30,10 +30,14 @@ WhileThunk::WhileThunk(
     const HloInstruction* hlo)
     : Thunk(Kind::kWhile, hlo),
       condition_result_buffer_index_(condition_result_buffer_index),
+      // Pass nullptr as the HloInstruction* to the condition_thunk_sequence_
+      // and body_thunk_sequence_ constructors because these SequentialThunks
+      // are logically "part of" this WhileThunk, and shouldn't be profiled
+      // separately from it.
       condition_thunk_sequence_(MakeUnique<SequentialThunk>(
-          std::move(*condition_thunk_sequence), hlo)),
-      body_thunk_sequence_(
-          MakeUnique<SequentialThunk>(std::move(*body_thunk_sequence), hlo)) {}
+          std::move(*condition_thunk_sequence), nullptr)),
+      body_thunk_sequence_(MakeUnique<SequentialThunk>(
+          std::move(*body_thunk_sequence), nullptr)) {}
 
 Status WhileThunk::Initialize(const GpuExecutable& executable,
                               se::StreamExecutor* executor) {
@@ -53,6 +57,7 @@ Status WhileThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
   while (true) {
     // Invoke thunk sequence for while 'condition' computation.
     profiler->StartHloComputation();
+    VLOG(3) << "Executing condition computation";
     TF_RETURN_IF_ERROR(condition_thunk_sequence_->ExecuteOnStream(
         buffer_allocations, stream, profiler));
     profiler->FinishHloComputation(hlo_instruction()->while_condition());
@@ -60,6 +65,7 @@ Status WhileThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
     // Copy the result of condition computation and break the loop if 'false'.
     bool condition_result;
     stream->ThenMemcpy(&condition_result, condition_result_data, sizeof(bool));
+    VLOG(3) << "condition_result = " << condition_result;
     Status block_status = stream->BlockHostUntilDone();
     if (!block_status.ok()) {
       return InternalError(
@@ -74,6 +80,7 @@ Status WhileThunk::ExecuteOnStream(const BufferAllocations& buffer_allocations,
     // We measure the time of one execution of the while body computation. The
     // while body may be executed more than once, the last measurement "wins".
     profiler->StartHloComputation();
+    VLOG(3) << "Executing body computation";
     // Invoke thunk sequence for while 'body' computation, and pass on
     // 'profiler' to measure the timing of the thunks in 'body_thunk_sequence_'.
     TF_RETURN_IF_ERROR(body_thunk_sequence_->ExecuteOnStream(buffer_allocations,
diff --git a/tensorflow/compiler/xla/service/gpu/while_transformer.cc b/tensorflow/compiler/xla/service/gpu/while_transformer.cc
index 7749201cbc..c5321df6c4 100644
--- a/tensorflow/compiler/xla/service/gpu/while_transformer.cc
+++ b/tensorflow/compiler/xla/service/gpu/while_transformer.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <unordered_map>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
diff --git a/tensorflow/compiler/xla/service/gpu/while_transformer_test.cc b/tensorflow/compiler/xla/service/gpu/while_transformer_test.cc
index 2f290f61bd..dbc8442ed2 100644
--- a/tensorflow/compiler/xla/service/gpu/while_transformer_test.cc
+++ b/tensorflow/compiler/xla/service/gpu/while_transformer_test.cc
@@ -42,7 +42,7 @@ class WhileTransformerTest : public HloTestBase {
       const int64 tuple_index, const int64 limit) {
     auto builder = HloComputation::Builder(TestName() + ".Condition");
     auto limit_const = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(limit)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(limit)));
     auto loop_state = builder.AddInstruction(HloInstruction::CreateParameter(
         0, GetLoopStateShape(tuple_index), "loop_state"));
     auto induction_variable =
@@ -65,8 +65,8 @@ class WhileTransformerTest : public HloTestBase {
     auto induction_variable =
         builder.AddInstruction(HloInstruction::CreateGetTupleElement(
             induction_variable_shape_, loop_state, ind_var_tuple_index));
-    auto inc = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(increment)));
+    auto inc = builder.AddInstruction(HloInstruction::CreateConstant(
+        LiteralUtil::CreateR0<int32>(increment)));
     auto add0 = builder.AddInstruction(HloInstruction::CreateBinary(
         induction_variable->shape(), HloOpcode::kAdd, induction_variable, inc));
     // Update data GTE(data_tuple_index).
@@ -89,10 +89,12 @@ class WhileTransformerTest : public HloTestBase {
                                         const int64 ind_var_tuple_index,
                                         const int64 ind_var_init) {
     auto builder = HloComputation::Builder(TestName() + ".While");
-    auto induction_var_init = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(ind_var_init)));
-    auto data_init = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f})));
+    auto induction_var_init =
+        builder.AddInstruction(HloInstruction::CreateConstant(
+            LiteralUtil::CreateR0<int32>(ind_var_init)));
+    auto data_init = builder.AddInstruction(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(
+            {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f})));
     auto loop_state_init =
         ind_var_tuple_index == 0
             ? builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/gpu/xfeed_queue.h b/tensorflow/compiler/xla/service/gpu/xfeed_queue.h
new file mode 100644
index 0000000000..dd46ff433b
--- /dev/null
+++ b/tensorflow/compiler/xla/service/gpu/xfeed_queue.h
@@ -0,0 +1,90 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_GPU_XFEED_QUEUE_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_GPU_XFEED_QUEUE_H_
+
+#include <deque>
+#include <functional>
+#include <vector>
+
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/notification.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+
+namespace xla {
+namespace gpu {
+
+// TODO(b/30467474) Once GPU outfeed implementation settles, consider
+// folding back the cpu and gpu outfeed implementations into a generic
+// one if possible.
+
+// Manages a thread-safe queue of buffers.
+template <typename BufferType>
+class XfeedQueue {
+ public:
+  // Adds a tree of buffers to the queue. The individual buffers correspond to
+  // the elements of a tuple and may be nullptr if the buffer is a tuple index
+  // buffer.
+  void EnqueueDestination(BufferType buffers) {
+    tensorflow::mutex_lock l(mu_);
+    enqueued_buffers_.push_back(std::move(buffers));
+    cv_.notify_one();
+  }
+
+  // Blocks until the queue is non-empty, then returns the buffer at the head of
+  // the queue.
+  BufferType BlockingGetNextDestination() {
+    bool became_empty;
+    BufferType current_buffer;
+    {
+      tensorflow::mutex_lock l(mu_);
+      while (enqueued_buffers_.empty()) {
+        cv_.wait(l);
+      }
+      current_buffer = std::move(enqueued_buffers_.front());
+      enqueued_buffers_.pop_front();
+      became_empty = enqueued_buffers_.empty();
+    }
+    if (became_empty) {
+      for (const auto& callback : on_empty_callbacks_) {
+        callback();
+      }
+    }
+    return current_buffer;
+  }
+
+  void RegisterOnEmptyCallback(std::function<void()> callback) {
+    on_empty_callbacks_.push_back(std::move(callback));
+  }
+
+ private:
+  tensorflow::mutex mu_;
+
+  // Condition variable that is signaled every time a buffer is enqueued.
+  tensorflow::condition_variable cv_;
+
+  // The queue of trees of buffers. Buffer* queue contents are not owned.
+  std::deque<BufferType> enqueued_buffers_ GUARDED_BY(mu_);
+
+  // List of callbacks which will be called when 'enqueued_buffers_' becomes
+  // empty.
+  std::vector<std::function<void()>> on_empty_callbacks_;
+};
+
+}  // namespace gpu
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_GPU_XFEED_QUEUE_H_
diff --git a/tensorflow/compiler/xla/service/graphviz_example.cc b/tensorflow/compiler/xla/service/graphviz_example.cc
index acf6611486..aa89567ee8 100644
--- a/tensorflow/compiler/xla/service/graphviz_example.cc
+++ b/tensorflow/compiler/xla/service/graphviz_example.cc
@@ -22,6 +22,7 @@ limitations under the License.
 #include <memory>
 #include <string>
 
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
@@ -47,7 +48,7 @@ HloComputation* AddScalarConstantComputation(int64 addend, HloModule* module) {
   auto x_value = builder.AddInstruction(HloInstruction::CreateParameter(
       0, ShapeUtil::MakeShape(F32, {}), "x_value"));
   auto half = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.5)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.5)));
   builder.AddInstruction(HloInstruction::CreateBinary(
       half->shape(), HloOpcode::kAdd, x_value, half));
   return module->AddEmbeddedComputation(builder.Build());
@@ -122,7 +123,7 @@ std::unique_ptr<HloModule> MakeBigGraph() {
   auto rng = builder.AddInstruction(
       HloInstruction::CreateRng(vshape, RNG_UNIFORM, {param_m, param_m}));
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto add_computation = ScalarSumComputation(module.get());
   builder.AddInstruction(
       HloInstruction::CreateReduce(vshape, rng, one, {1}, add_computation));
diff --git a/tensorflow/compiler/xla/service/heap_simulator_test.cc b/tensorflow/compiler/xla/service/heap_simulator_test.cc
index 3849b565e3..b41dc66fe9 100644
--- a/tensorflow/compiler/xla/service/heap_simulator_test.cc
+++ b/tensorflow/compiler/xla/service/heap_simulator_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/buffer_value.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -239,7 +239,7 @@ class HeapSimulatorTest : public HloTestBase {
 TEST_F(HeapSimulatorTest, ScalarConstant) {
   auto builder = HloComputation::Builder(TestName());
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
 
   // Constants aren't assigned.  See b/32248867
   HeapSimulatorTracker tracker(TestName(), builder.Build(), {const0});
@@ -674,7 +674,7 @@ class HeapAlgorithmTestBase : public ::testing::Test {
   const BufferValue* DummyBufferValue() {
     const BufferValue::Id id = buffers_.size();
     auto const0 = builder_.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
     buffers_.emplace_back(MakeUnique<HloValue>(id, const0, ShapeIndex{}));
     return buffers_.back().get();
   }
diff --git a/tensorflow/compiler/xla/service/hlo.proto b/tensorflow/compiler/xla/service/hlo.proto
index d241791060..0b93d97c11 100644
--- a/tensorflow/compiler/xla/service/hlo.proto
+++ b/tensorflow/compiler/xla/service/hlo.proto
@@ -155,6 +155,13 @@ message HloInstructionProto {
   repeated int64 replica_group_ids = 44;
   int64 all_reduce_id = 45;
   string cross_replica_sum_barrier = 46;
+
+  // Whether this Send/Recv instruction transfers data to/from the host. Only
+  // present for Send and Recv instructions and their SendDone and RecvDone
+  // partners.
+  bool is_host_transfer = 47;
+
+  xla.ScatterDimensionNumbers scatter_dimension_numbers = 48;
 }
 
 // Serialization of HloComputation.
@@ -239,8 +246,9 @@ message BufferAllocationProto {
   int64 index = 1;
   int64 size = 2;
   bool is_thread_local = 3;
-  bool is_reusable = 4;
+  bool is_tuple = 11;
   bool is_entry_computation_parameter = 5;
+  bool is_constant = 12;
   int64 parameter_number = 6;
   repeated int64 parameter_shape_index = 10;
   bool maybe_live_out = 7;
diff --git a/tensorflow/compiler/xla/service/hlo_alias_analysis.h b/tensorflow/compiler/xla/service/hlo_alias_analysis.h
index afb0c20f0c..1fea544730 100644
--- a/tensorflow/compiler/xla/service/hlo_alias_analysis.h
+++ b/tensorflow/compiler/xla/service/hlo_alias_analysis.h
@@ -42,7 +42,7 @@ class HloAliasAnalysis {
   static StatusOr<std::unique_ptr<HloAliasAnalysis>> Run(
       HloModule* module,
       const HloDataflowAnalysis::FusionCanShareBufferFunction&
-          fusion_can_share_buffer = nullptr);
+          fusion_can_share_buffer);
 
   string ToString() const;
 
diff --git a/tensorflow/compiler/xla/service/hlo_alias_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_alias_analysis_test.cc
index 8f18d50f6e..da94ab5346 100644
--- a/tensorflow/compiler/xla/service/hlo_alias_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_alias_analysis_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <map>
 #include <memory>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
 #include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
@@ -47,7 +47,9 @@ class HloAliasAnalysisTest : public HloTestBase {
   // reference to the generated analysis stored in analysis_.
   HloAliasAnalysis& RunAnalysis() {
     hlo_graph_dumper::MaybeDumpHloModule(*module_, "Before alias analysis");
-    analysis_ = HloAliasAnalysis::Run(module_.get()).ConsumeValueOrDie();
+    analysis_ = HloAliasAnalysis::Run(module_.get(),
+                                      /*fusion_can_share_buffer=*/nullptr)
+                    .ConsumeValueOrDie();
     return *analysis_;
   }
 
@@ -116,9 +118,9 @@ TEST_F(HloAliasAnalysisTest, BinaryOperation) {
   // Test the analysis on a single binary operation (Add).
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
       scalar_shape_, HloOpcode::kAdd, constant1, constant2));
   module_->AddEntryComputation(builder.Build());
@@ -228,9 +230,9 @@ TEST_F(HloAliasAnalysisTest, SingleCall) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto call = builder.AddInstruction(HloInstruction::CreateCall(
       scalar_shape_, {constant1, constant2}, called_computation));
   module_->AddEntryComputation(builder.Build());
@@ -267,9 +269,9 @@ TEST_F(HloAliasAnalysisTest, ComputationCalledTwice) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto call1 = builder.AddInstruction(HloInstruction::CreateCall(
       scalar_shape_, {constant1, constant2}, called_computation));
   auto call2 = builder.AddInstruction(HloInstruction::CreateCall(
@@ -346,15 +348,15 @@ TEST_F(HloAliasAnalysisTest, SingleWhile) {
   auto cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto xla_while = builder.AddInstruction(
@@ -439,15 +441,15 @@ TEST_F(HloAliasAnalysisTest, SequentialWhiles) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto xla_while0 = builder.AddInstruction(
@@ -498,7 +500,7 @@ TEST_F(HloAliasAnalysisTest, NestedWhiles) {
     cond_builder.AddInstruction(
         HloInstruction::CreateParameter(0, tuple_shape, "param"));
     cond_builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
     return cond_builder.Build();
   };
   // Build separate condition computations so the call graph is flat. The
@@ -543,9 +545,9 @@ TEST_F(HloAliasAnalysisTest, NestedWhiles) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto entry_while = builder.AddInstruction(
@@ -608,17 +610,17 @@ TEST_F(HloAliasAnalysisTest, SwizzlingWhile) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   auto cond_constant = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2, constant3}));
   auto xla_while = builder.AddInstruction(
@@ -654,19 +656,18 @@ TEST_F(HloAliasAnalysisTest, SwizzlingWhile) {
 }
 
 TEST_F(HloAliasAnalysisTest, TupleSelect) {
-  // Test a kSelect of a tuple value. Non-top-level element flow through the
-  // instruction.
+  // Test a kTupleSelect. Non-top-level element flow through the instruction.
   auto builder = HloComputation::Builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto constant4 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(4.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(4.0)));
   auto tuple1 =
       builder.AddInstruction(HloInstruction::CreateTuple({constant1}));
   auto tuple2 =
@@ -677,13 +678,13 @@ TEST_F(HloAliasAnalysisTest, TupleSelect) {
       builder.AddInstruction(HloInstruction::CreateTuple({constant4}));
   const Shape tuple_shape = tuple1->shape();
   auto select11 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, tuple1, tuple1));
+      tuple_shape, HloOpcode::kTupleSelect, pred, tuple1, tuple1));
   auto select12 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple_shape, HloOpcode::kTupleSelect, pred, tuple1, tuple2));
   auto select34 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, tuple3, tuple4));
+      tuple_shape, HloOpcode::kTupleSelect, pred, tuple3, tuple4));
   auto select1234 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, select12, select34));
+      tuple_shape, HloOpcode::kTupleSelect, pred, select12, select34));
 
   module_->AddEntryComputation(builder.Build());
 
@@ -718,7 +719,7 @@ TEST_F(HloAliasAnalysisTest, TupleSelect) {
 }
 
 TEST_F(HloAliasAnalysisTest, TupleSelectToWhile) {
-  // Test a tuple-shaped kSelect feeding a kWhile instruction. HLO:
+  // Test a tuple-shaped kTupleSelect feeding a kWhile instruction. HLO:
   //
   // body((F32[], F32[]) %tuple_param):
   //   %negate = Negate(%tuple_param{0})
@@ -754,22 +755,22 @@ TEST_F(HloAliasAnalysisTest, TupleSelectToWhile) {
   auto cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple1 =
       builder.AddInstruction(HloInstruction::CreateTuple({constant1}));
   auto tuple2 =
       builder.AddInstruction(HloInstruction::CreateTuple({constant2}));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple_shape, HloOpcode::kTupleSelect, pred, tuple1, tuple2));
   auto xla_while = builder.AddInstruction(
       HloInstruction::CreateWhile(tuple_shape, condition, body, select));
 
@@ -806,7 +807,7 @@ TEST_F(HloAliasAnalysisTest, Bitcast) {
   // Bitcasting a value should not produce a new buffer.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto bitcast = builder.AddInstruction(HloInstruction::CreateUnary(
       scalar_shape_, HloOpcode::kBitcast, constant));
 
@@ -825,7 +826,7 @@ TEST_F(HloAliasAnalysisTest, BitcastInterference) {
   // interference.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto bitcast = builder.AddInstruction(HloInstruction::CreateUnary(
       scalar_shape_, HloOpcode::kBitcast, constant));
   builder.AddInstruction(HloInstruction::CreateTuple({constant, bitcast}));
@@ -844,13 +845,13 @@ TEST_F(HloAliasAnalysisTest, WhileInterference) {
   // the other use of the init.
   auto builder = HloComputation::Builder(TestName());
   auto init = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
 
   auto cond_builder = HloComputation::Builder("condition");
   auto cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, init->shape(), "param"));
   auto cond_root = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
diff --git a/tensorflow/compiler/xla/service/hlo_computation.cc b/tensorflow/compiler/xla/service/hlo_computation.cc
index 34b18b0e21..441288da1a 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.cc
+++ b/tensorflow/compiler/xla/service/hlo_computation.cc
@@ -284,9 +284,8 @@ void HloComputation::set_root_instruction(
   if (!IsFusionComputation()) {
     CHECK(ShapeUtil::Compatible(new_root_instruction->shape(),
                                 root_instruction_->shape()))
-        << new_root_instruction->shape().ShortDebugString()
-        << " is incompatible with "
-        << root_instruction_->shape().ShortDebugString();
+        << new_root_instruction->shape() << " is incompatible with "
+        << root_instruction_->shape();
   }
   bool root_found = false;
   for (auto& instruction : instructions_) {
@@ -529,8 +528,10 @@ HloInstruction* HloComputation::CreateFusionInstruction(
 }
 
 StatusOr<HloInstruction*> HloComputation::DeepCopyHelper(
-    HloInstruction* instruction, const ShapeTree<bool>* indices_to_copy,
-    ShapeTree<HloInstruction*>* copies_added, ShapeIndex* index) {
+    HloInstruction* instruction, ShapeIndex* index,
+    const std::function<
+        HloInstruction*(HloInstruction* leaf, const ShapeIndex& leaf_index,
+                        HloComputation* computation)>& copy_leaf) {
   if (ShapeUtil::IsTuple(instruction->shape())) {
     std::vector<HloInstruction*> elements;
     for (int64 i = 0; i < ShapeUtil::TupleElementCount(instruction->shape());
@@ -541,9 +542,8 @@ StatusOr<HloInstruction*> HloComputation::DeepCopyHelper(
               instruction, i));
 
       index->push_back(i);
-      TF_ASSIGN_OR_RETURN(
-          HloInstruction * element,
-          DeepCopyHelper(gte, indices_to_copy, copies_added, index));
+      TF_ASSIGN_OR_RETURN(HloInstruction * element,
+                          DeepCopyHelper(gte, index, copy_leaf));
       elements.push_back(element);
       index->pop_back();
     }
@@ -557,19 +557,7 @@ StatusOr<HloInstruction*> HloComputation::DeepCopyHelper(
 
   // Array shape.
   TF_RET_CHECK(ShapeUtil::IsArray(instruction->shape()));
-  if (indices_to_copy == nullptr || indices_to_copy->element(*index)) {
-    // Use kCopy to copy array elements
-    HloInstruction* copy = AddInstruction(HloInstruction::CreateUnary(
-        instruction->shape(), HloOpcode::kCopy, instruction));
-    if (copies_added != nullptr) {
-      *copies_added->mutable_element(*index) = copy;
-    }
-    return copy;
-  } else {
-    // Elements which are not to be copied are passed through
-    // transparently.
-    return instruction;
-  }
+  return copy_leaf(instruction, *index, this);
 }
 
 StatusOr<HloInstruction*> HloComputation::DeepCopyInstruction(
@@ -591,7 +579,36 @@ StatusOr<HloInstruction*> HloComputation::DeepCopyInstruction(
   }
 
   ShapeIndex index;
-  return DeepCopyHelper(instruction, indices_to_copy, copies_added, &index);
+  auto copy_leaf = [indices_to_copy, copies_added](
+                       HloInstruction* leaf, const ShapeIndex& leaf_index,
+                       HloComputation* computation) {
+    if (indices_to_copy == nullptr || indices_to_copy->element(leaf_index)) {
+      HloInstruction* copy = computation->AddInstruction(
+          HloInstruction::CreateUnary(leaf->shape(), HloOpcode::kCopy, leaf));
+      if (copies_added != nullptr) {
+        *copies_added->mutable_element(leaf_index) = copy;
+      }
+      return copy;
+    }
+    // Elements which are not to be copied are passed through
+    // transparently.
+    return leaf;
+  };
+  return DeepCopyHelper(instruction, &index, copy_leaf);
+}
+
+StatusOr<HloInstruction*> HloComputation::DeepCopyInstructionWithCustomCopier(
+    HloInstruction* instruction,
+    const std::function<
+        HloInstruction*(HloInstruction* leaf, const ShapeIndex& leaf_index,
+                        HloComputation* computation)>& copy_leaf) {
+  if (instruction->parent() != this) {
+    return FailedPrecondition(
+        "Can't deep copy instruction %s: instruction is not in computation %s",
+        instruction->name().c_str(), name().c_str());
+  }
+  ShapeIndex index;
+  return DeepCopyHelper(instruction, &index, copy_leaf);
 }
 
 ProgramShape HloComputation::ComputeProgramShape() const {
@@ -664,7 +681,7 @@ std::unique_ptr<HloReachabilityMap> HloComputation::ComputeReachability()
     inputs.assign(hlo->operands().begin(), hlo->operands().end());
     inputs.insert(inputs.end(), hlo->control_predecessors().begin(),
                   hlo->control_predecessors().end());
-    result->SetReachabilityToUnion(inputs, hlo);
+    result->FastSetReachabilityToUnion(inputs, hlo);
   }
   return result;
 }
@@ -881,4 +898,13 @@ void HloComputation::UniquifyName(NameUniquer* name_uniquer) {
   name_ = name_uniquer->GetUniqueName(name_);
 }
 
+HloInstruction* HloComputation::GetInstructionWithName(
+    tensorflow::StringPiece name) {
+  auto instructions_in_computation = instructions();
+  auto it = c_find_if(instructions_in_computation, [&](HloInstruction* instr) {
+    return instr->name() == name;
+  });
+  return it == instructions_in_computation.end() ? nullptr : *it;
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_computation.h b/tensorflow/compiler/xla/service/hlo_computation.h
index c1c3e79ebc..49ed65910f 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.h
+++ b/tensorflow/compiler/xla/service/hlo_computation.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_COMPUTATION_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_COMPUTATION_H_
 
+#include <functional>
 #include <list>
 #include <memory>
 #include <string>
@@ -254,6 +255,14 @@ class HloComputation {
       const ShapeTree<bool>* indices_to_copy = nullptr,
       ShapeTree<HloInstruction*>* copies_added = nullptr);
 
+  // As above, but uses a custom function to copy the leaf nodes, which could
+  // create alternative HLOs other than kCopy, or even pass-throughs.
+  StatusOr<HloInstruction*> DeepCopyInstructionWithCustomCopier(
+      HloInstruction* instruction,
+      const std::function<
+          HloInstruction*(HloInstruction* leaf, const ShapeIndex& leaf_index,
+                          HloComputation* computation)>& copy_leaf);
+
   // Computes and returns the ProgramShape of this computation (shape of
   // parameters and result with layout).
   ProgramShape ComputeProgramShape() const;
@@ -356,6 +365,10 @@ class HloComputation {
     unique_id_ = id;
   }
 
+  // Returns the instruction in this computation that has name `name`.  Returns
+  // null if there is no such computation.
+  HloInstruction* GetInstructionWithName(tensorflow::StringPiece name);
+
   int64 unique_id() const { return unique_id_; }
 
  private:
@@ -378,8 +391,10 @@ class HloComputation {
   // Internal helper for recursive copying of an instruction. Creates and
   // returns a deep copy of the given instruction.
   StatusOr<HloInstruction*> DeepCopyHelper(
-      HloInstruction* instruction, const ShapeTree<bool>* indices_to_copy,
-      ShapeTree<HloInstruction*>* copies_added, ShapeIndex* index);
+      HloInstruction* instruction, ShapeIndex* index,
+      const std::function<
+          HloInstruction*(HloInstruction* leaf, const ShapeIndex& leaf_index,
+                          HloComputation* computation)>& copy_leaf);
 
   // Internal helper to collect unreachable roots.
   std::vector<HloInstruction*> CollectUnreachableRoots() const;
diff --git a/tensorflow/compiler/xla/service/hlo_computation_test.cc b/tensorflow/compiler/xla/service/hlo_computation_test.cc
index a8f3f0e9c2..e4c5470331 100644
--- a/tensorflow/compiler/xla/service/hlo_computation_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_computation_test.cc
@@ -17,7 +17,7 @@ limitations under the License.
 
 #include <set>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
@@ -118,7 +118,7 @@ TEST_F(HloComputationTest, PostOrderSingleton) {
   // Test GetInstructionPostOrder for a computation with one instruction.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
   EXPECT_THAT(computation->MakeInstructionPostOrder(), ElementsAre(constant));
@@ -129,7 +129,7 @@ TEST_F(HloComputationTest, PostOrderSimple) {
   // instructions.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto negate1 = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kNegate, constant));
   auto negate2 = builder.AddInstruction(
@@ -144,7 +144,7 @@ TEST_F(HloComputationTest, PostOrderTrace) {
   // Test GetInstructionPostOrder for a computation with a trace instruction.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto negate1 = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kNegate, constant));
   auto trace =
@@ -163,13 +163,13 @@ TEST_F(HloComputationTest, PostOrderDisconnectedInstructions) {
   // which are not connected.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant4 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
   EXPECT_THAT(computation->MakeInstructionPostOrder(),
@@ -181,11 +181,11 @@ TEST_F(HloComputationTest, PostOrderWithMultipleRoots) {
   // which are not connected.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto add1 = builder.AddInstruction(HloInstruction::CreateBinary(
       r0f32_, HloOpcode::kAdd, constant1, constant2));
   auto add2 = builder.AddInstruction(HloInstruction::CreateBinary(
@@ -205,11 +205,11 @@ TEST_F(HloComputationTest, VisitWithMultipleRoots) {
   // computation has multiple roots (dead code).
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   // Add three disconnected add expressions.
   builder.AddInstruction(HloInstruction::CreateBinary(r0f32_, HloOpcode::kAdd,
                                                       constant1, constant2));
@@ -256,7 +256,7 @@ TEST_F(HloComputationTest, DeepCopyArray) {
   // Test that DeepCopyInstruction properly copies an array.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.0, 2.0, 3.0})));
+      LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0})));
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
   auto copy = computation->DeepCopyInstruction(constant).ValueOrDie();
@@ -268,9 +268,9 @@ TEST_F(HloComputationTest, DeepCopyTuple) {
   // Test that DeepCopyInstruction properly copies a tuple.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.0, 2.0, 3.0})));
+      LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0})));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
 
@@ -289,7 +289,7 @@ TEST_F(HloComputationTest, DeepCopyArrayAtIndices) {
   // copy are specified.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.0, 2.0, 3.0})));
+      LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0})));
   auto computation = builder.Build();
 
   {
@@ -314,9 +314,9 @@ TEST_F(HloComputationTest, DeepCopyTupleAtIndices) {
   // specified by the given indices.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.0, 2.0, 3.0})));
+      LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0})));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto computation = builder.Build();
@@ -375,7 +375,7 @@ TEST_F(HloComputationTest, DeepCopyToken) {
   // Test that DeepCopyInstruction properly handles tokens which should not be
   // copied.
   auto builder = HloComputation::Builder(TestName());
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
   auto copy = computation->DeepCopyInstruction(token).ValueOrDie();
@@ -388,9 +388,9 @@ TEST_F(HloComputationTest, DeepCopyTokenTuple) {
   // Test that DeepCopyInstruction properly handles tokens which should not be
   // copied.
   auto builder = HloComputation::Builder(TestName());
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
   auto tuple =
       builder.AddInstruction(HloInstruction::CreateTuple({token, constant}));
   auto module = CreateNewModule();
@@ -407,7 +407,7 @@ TEST_F(HloComputationTest, CycleDetection) {
   // Test whether the visitor can detect cycles in the graph.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto negate = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kNegate, constant));
   auto add = builder.AddInstruction(
@@ -433,7 +433,7 @@ TEST_F(HloComputationTest, RemoveInstructionWithDuplicateOperand) {
   // twice.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto dead_negate = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kNegate, constant));
   auto dead_add = builder.AddInstruction(HloInstruction::CreateBinary(
@@ -456,9 +456,9 @@ TEST_F(HloComputationTest, RemoveInstructionWithDuplicateOperand) {
 TEST_F(HloComputationTest, CloneWithControlDependency) {
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0f)));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
       r0f32_, HloOpcode::kAdd, constant1, constant2));
 
@@ -502,9 +502,9 @@ TEST_F(HloComputationTest, Reachability) {
   // There is a control dependency from 'add' to 'exp'.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0f)));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
       r0f32_, HloOpcode::kAdd, constant1, constant2));
   auto negate = builder.AddInstruction(
@@ -607,13 +607,14 @@ TEST_F(HloComputationTest, Stringification) {
   auto* computation = module->AddEntryComputation(builder.Build());
 
   auto options = HloPrintOptions().set_print_metadata(false);
-  EXPECT_EQ(computation->ToString(options),
-            R"(%TransposeDot (x: f32[5,10], y: f32[20,10]) -> f32[5,20] {
+  const string expected_computation =
+      R"(%TransposeDot (x: f32[5,10], y: f32[20,10]) -> f32[5,20] {
   %x = f32[5,10]{1,0} parameter(0)
   %y = f32[20,10]{1,0} parameter(1)
   %transpose = f32[10,20]{1,0} transpose(f32[20,10]{1,0} %y), dimensions={1,0}
   ROOT %dot = f32[5,20]{1,0} dot(f32[5,10]{1,0} %x, f32[10,20]{1,0} %transpose), lhs_contracting_dims={1}, rhs_contracting_dims={0}
-})");
+})";
+  EXPECT_EQ(computation->ToString(options), expected_computation);
 }
 
 TEST_F(HloComputationTest, StringificationIndent) {
@@ -639,13 +640,14 @@ TEST_F(HloComputationTest, StringificationIndent) {
 
   auto options =
       HloPrintOptions().set_print_metadata(false).set_indent_amount(2);
-  EXPECT_EQ(computation->ToString(options),
-            R"(    %TransposeDot (x: f32[5,10], y: f32[20,10]) -> f32[5,20] {
+  const string expected_computation =
+      R"(    %TransposeDot (x: f32[5,10], y: f32[20,10]) -> f32[5,20] {
       %x = f32[5,10]{1,0} parameter(0)
       %y = f32[20,10]{1,0} parameter(1)
       %transpose = f32[10,20]{1,0} transpose(f32[20,10]{1,0} %y), dimensions={1,0}
       ROOT %dot = f32[5,20]{1,0} dot(f32[5,10]{1,0} %x, f32[10,20]{1,0} %transpose), lhs_contracting_dims={1}, rhs_contracting_dims={0}
-    })");
+    })";
+  EXPECT_EQ(computation->ToString(options), expected_computation);
 }
 
 TEST_F(HloComputationTest, StringificationCanonical) {
@@ -670,21 +672,23 @@ TEST_F(HloComputationTest, StringificationCanonical) {
   auto* computation = module->AddEntryComputation(builder.Build());
 
   auto options = HloPrintOptions().set_print_metadata(false);
-  EXPECT_EQ(computation->ToString(options),
-            R"(%TransposeDot (x: f32[5,10], y: f32[20,10]) -> f32[5,20] {
+  const string expected_computation1 =
+      R"(%TransposeDot (x: f32[5,10], y: f32[20,10]) -> f32[5,20] {
   %x = f32[5,10]{1,0} parameter(0)
   %y = f32[20,10]{1,0} parameter(1)
   %transpose = f32[10,20]{1,0} transpose(f32[20,10]{1,0} %y), dimensions={1,0}
   ROOT %dot = f32[5,20]{1,0} dot(f32[5,10]{1,0} %x, f32[10,20]{1,0} %transpose), lhs_contracting_dims={1}, rhs_contracting_dims={0}
-})");
+})";
+  EXPECT_EQ(computation->ToString(options), expected_computation1);
 
   options = HloPrintOptions().Canonical();
-  EXPECT_EQ(computation->ToString(options), R"(TransposeDot {
+  const string expected_computation2 = R"(TransposeDot {
   tmp_0 = f32[5,10]{1,0} parameter(0)
   tmp_1 = f32[20,10]{1,0} parameter(1)
   tmp_2 = f32[10,20]{1,0} transpose(f32[20,10]{1,0} tmp_1), dimensions={1,0}
   ROOT tmp_3 = f32[5,20]{1,0} dot(f32[5,10]{1,0} tmp_0, f32[10,20]{1,0} tmp_2), lhs_contracting_dims={1}, rhs_contracting_dims={0}
-})");
+})";
+  EXPECT_EQ(computation->ToString(options), expected_computation2);
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/hlo_constant_folding.cc b/tensorflow/compiler/xla/service/hlo_constant_folding.cc
index 436d103f23..7229031c0c 100644
--- a/tensorflow/compiler/xla/service/hlo_constant_folding.cc
+++ b/tensorflow/compiler/xla/service/hlo_constant_folding.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_evaluator.h"
diff --git a/tensorflow/compiler/xla/service/hlo_constant_folding_test.cc b/tensorflow/compiler/xla/service/hlo_constant_folding_test.cc
index 5d05ccfc0b..64a42c1efc 100644
--- a/tensorflow/compiler/xla/service/hlo_constant_folding_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_constant_folding_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
@@ -41,7 +41,7 @@ using HloConstantFoldingTest = HloTestBase;
 TEST_F(HloConstantFoldingTest, ConvertF32ToS64) {
   HloComputation::Builder builder(TestName());
   HloInstruction* input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   builder.AddInstruction(
       HloInstruction::CreateConvert(ShapeUtil::MakeShape(S64, {}), input));
 
@@ -62,7 +62,7 @@ TEST_F(HloConstantFoldingTest, ConvertF32ToS64) {
 TEST_F(HloConstantFoldingTest, ConvertS64ToF32) {
   HloComputation::Builder builder(TestName());
   HloInstruction* input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int64>(42)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int64>(42)));
   builder.AddInstruction(
       HloInstruction::CreateConvert(ShapeUtil::MakeShape(F32, {}), input));
 
@@ -82,8 +82,8 @@ TEST_F(HloConstantFoldingTest, ConvertS64ToF32) {
 
 TEST_F(HloConstantFoldingTest, ConvertF32ArrayToS64Array) {
   HloComputation::Builder builder(TestName());
-  HloInstruction* input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({42.0f, 19.0f})));
+  HloInstruction* input = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<float>({42.0f, 19.0f})));
   builder.AddInstruction(
       HloInstruction::CreateConvert(ShapeUtil::MakeShape(S64, {2}), input));
 
@@ -120,7 +120,7 @@ TEST_F(HloConstantFoldingTest, Concatenate) {
     for (auto csize : test_config.concat_sizes) {
       dimensions[test_config.concat_dimension] = csize;
       concat_size += csize;
-      auto literal = Literal::CreateFromDimensions(F32, dimensions);
+      auto literal = LiteralUtil::CreateFromDimensions(F32, dimensions);
       HloInstruction* insn = builder.AddInstruction(
           HloInstruction::CreateConstant(std::move(literal)));
       operands.push_back(insn);
@@ -149,7 +149,7 @@ TEST_F(HloConstantFoldingTest, Slice) {
   const int64 slice_limits[] = {10, 8, 6, 5, 9};
   const int64 slice_strides[] = {1, 1, 1, 1, 1};
   TF_ASSERT_OK_AND_ASSIGN(auto literal,
-                          Literal::CreateRandomLiteral<F32>(
+                          LiteralUtil::CreateRandomLiteral<F32>(
                               ShapeUtil::MakeShape(F32, dimensions), 0.0, 1.0));
   HloInstruction* literal_instruction = builder.AddInstruction(
       HloInstruction::CreateConstant(std::move(literal)));
@@ -172,7 +172,7 @@ TEST_F(HloConstantFoldingTest, TransposeConstantFold) {
   HloComputation::Builder builder(TestName());
   const int64 dimensions[] = {11, 8, 7, 5, 9};
   TF_ASSERT_OK_AND_ASSIGN(auto literal,
-                          Literal::CreateRandomLiteral<F32>(
+                          LiteralUtil::CreateRandomLiteral<F32>(
                               ShapeUtil::MakeShape(F32, dimensions), 0.0, 1.0));
   auto literal_clone = literal->Literal::CloneToUnique();
   HloInstruction* literal_instruction = builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis.cc b/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
index 8955e26d5c..995521aed0 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
@@ -49,9 +49,9 @@ Status HloCostAnalysis::Preprocess(const HloInstruction* hlo) {
   // The default number of bytes accessed for an instruction is the sum of the
   // sizes of the inputs and outputs. The default ShapeUtil::ByteSizeOf does not
   // handle opaque types.
-  float bytes_accessed = shape_size_(hlo->shape());
+  float bytes_accessed = GetShapeSize(hlo->shape());
   for (const HloInstruction* operand : hlo->operands()) {
-    bytes_accessed += shape_size_(operand->shape());
+    bytes_accessed += GetShapeSize(operand->shape());
   }
   current_properties_[kBytesAccessedKey] = bytes_accessed;
 
@@ -121,6 +121,13 @@ Status HloCostAnalysis::HandleElementwiseOp(
   }
 }
 
+int64 HloCostAnalysis::GetShapeSize(const Shape& shape) const {
+  if (!LayoutUtil::HasLayout(shape)) {
+    return 0;
+  }
+  return shape_size_(shape);
+}
+
 Status HloCostAnalysis::HandleElementwiseUnary(const HloInstruction* hlo) {
   return HandleElementwiseOp(hlo);
 }
@@ -155,6 +162,10 @@ Status HloCostAnalysis::HandleConstant(const HloInstruction*) {
   return Status::OK();
 }
 
+Status HloCostAnalysis::HandleIota(const HloInstruction*) {
+  return Status::OK();
+}
+
 Status HloCostAnalysis::HandleGetTupleElement(const HloInstruction*) {
   // GetTupleElement forwards a pointer and does not touch each element in the
   // output.
@@ -164,7 +175,11 @@ Status HloCostAnalysis::HandleGetTupleElement(const HloInstruction*) {
   return Status::OK();
 }
 
-Status HloCostAnalysis::HandleSelect(const HloInstruction*) {
+Status HloCostAnalysis::HandleSelect(const HloInstruction* hlo) {
+  return HandleElementwiseOp(hlo);
+}
+
+Status HloCostAnalysis::HandleTupleSelect(const HloInstruction*) {
   return Status::OK();
 }
 
@@ -173,21 +188,21 @@ Status HloCostAnalysis::HandleReverse(const HloInstruction*) {
 }
 
 Status HloCostAnalysis::HandleSlice(const HloInstruction* slice) {
-  current_properties_[kBytesAccessedKey] = shape_size_(slice->shape()) * 2;
+  current_properties_[kBytesAccessedKey] = GetShapeSize(slice->shape()) * 2;
   return Status::OK();
 }
 
 Status HloCostAnalysis::HandleDynamicSlice(
     const HloInstruction* dynamic_slice) {
   current_properties_[kBytesAccessedKey] =
-      shape_size_(dynamic_slice->shape()) * 2;
+      GetShapeSize(dynamic_slice->shape()) * 2;
   return Status::OK();
 }
 
 Status HloCostAnalysis::HandleDynamicUpdateSlice(
     const HloInstruction* dynamic_update_slice) {
   current_properties_[kBytesAccessedKey] =
-      shape_size_(dynamic_update_slice->operand(1)->shape()) * 2;
+      GetShapeSize(dynamic_update_slice->operand(1)->shape()) * 2;
   return Status::OK();
 }
 
@@ -196,7 +211,7 @@ Status HloCostAnalysis::HandleTuple(const HloInstruction* tuple) {
   // through them). The memory touched is then only the size of the output
   // index table of the tuple.
 
-  current_properties_[kBytesAccessedKey] = shape_size_(tuple->shape());
+  current_properties_[kBytesAccessedKey] = GetShapeSize(tuple->shape());
   return Status::OK();
 }
 
@@ -538,15 +553,9 @@ Status HloCostAnalysis::HandleRng(const HloInstruction* random) {
 }
 
 Status HloCostAnalysis::HandleFusion(const HloInstruction* fusion) {
-  // Compute the properties of the fused expression and attribute them to the
-  // fusion node. Use a dummy shape_size to avoid any errors from trying to
-  // calculate the size of a shape that does not have a layout, since nodes
-  // inside fusion nodes do not necessarily have a layout assigned.
-  ShapeSizeFunction shape_size = [](const Shape& shape) { return 0; };
   TF_ASSIGN_OR_RETURN(
       current_properties_,
-      ProcessSubcomputation(fusion->fused_instructions_computation(),
-                            &shape_size));
+      ProcessSubcomputation(fusion->fused_instructions_computation()));
 
   // Fusion nodes that produce a tuple also produce the entries in the tuple.
   // Ignore the memory accessed inside fused ops, since fusion is supposed to
@@ -555,11 +564,11 @@ Status HloCostAnalysis::HandleFusion(const HloInstruction* fusion) {
   ShapeUtil::ForEachSubshape(
       fusion->shape(),
       [this](const Shape& subshape, const ShapeIndex& /*shape_index*/) {
-        current_properties_[kBytesAccessedKey] += shape_size_(subshape);
+        current_properties_[kBytesAccessedKey] += GetShapeSize(subshape);
       });
 
   for (const HloInstruction* operand : fusion->operands()) {
-    current_properties_[kBytesAccessedKey] += shape_size_(operand->shape());
+    current_properties_[kBytesAccessedKey] += GetShapeSize(operand->shape());
   }
 
   return Status::OK();
@@ -640,6 +649,11 @@ Status HloCostAnalysis::HandleGather(const HloInstruction* gather) {
   return Status::OK();
 }
 
+Status HloCostAnalysis::HandleScatter(const HloInstruction* scatter) {
+  // TODO(b/32945756): Compute the properties of the sub-computation.
+  return Status::OK();
+}
+
 Status HloCostAnalysis::FinishVisit(const HloInstruction*) {
   return Status::OK();
 }
@@ -677,11 +691,8 @@ float HloCostAnalysis::optimal_seconds(const HloInstruction& hlo) const {
 }
 
 StatusOr<HloCostAnalysis::Properties> HloCostAnalysis::ProcessSubcomputation(
-    HloComputation* computation, const ShapeSizeFunction* shape_size) {
-  if (shape_size == nullptr) {
-    shape_size = &shape_size_;
-  }
-  HloCostAnalysis visitor(*shape_size, per_second_rates_);
+    HloComputation* computation) {
+  HloCostAnalysis visitor(shape_size_, per_second_rates_);
   TF_RETURN_IF_ERROR(computation->Accept(&visitor));
   return visitor.properties();
 }
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis.h b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
index 44e5df587c..0a79c92f4a 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis.h
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
@@ -52,9 +52,11 @@ class HloCostAnalysis : public ConstDfsHloVisitor {
   Status HandleElementwiseUnary(const HloInstruction* hlo) override;
   Status HandleElementwiseBinary(const HloInstruction* hlo) override;
   Status HandleConstant(const HloInstruction* constant) override;
+  Status HandleIota(const HloInstruction* iota) override;
   Status HandleGetTupleElement(
       const HloInstruction* get_tuple_element) override;
-  Status HandleSelect(const HloInstruction* select) override;
+  Status HandleSelect(const HloInstruction* hlo) override;
+  Status HandleTupleSelect(const HloInstruction* hlo) override;
   Status HandleCompare(const HloInstruction* compare) override;
   Status HandleClamp(const HloInstruction* clamp) override;
   Status HandleReducePrecision(const HloInstruction* hlo) override;
@@ -102,6 +104,7 @@ class HloCostAnalysis : public ConstDfsHloVisitor {
   Status HandleWhile(const HloInstruction* xla_while) override;
   Status HandleConditional(const HloInstruction* conditional) override;
   Status HandleGather(const HloInstruction* gather) override;
+  Status HandleScatter(const HloInstruction* scatter) override;
   Status FinishVisit(const HloInstruction* root) override;
 
   Status Preprocess(const HloInstruction* hlo) override;
@@ -147,11 +150,8 @@ class HloCostAnalysis : public ConstDfsHloVisitor {
                   const Properties& per_second_rates);
 
   // Returns the properties computed from visiting the computation rooted at the
-  // given hlo. Uses shape_size_ to calculate shape sizes if shape_size is null,
-  // otherwise uses shape_size_.
-  StatusOr<Properties> ProcessSubcomputation(
-      HloComputation* computation,
-      const ShapeSizeFunction* shape_size = nullptr);
+  // given hlo.
+  StatusOr<Properties> ProcessSubcomputation(HloComputation* computation);
 
   // Utility function to handle all element-wise operations.
   Status HandleElementwiseOp(const HloInstruction* hlo_instruction);
@@ -168,6 +168,10 @@ class HloCostAnalysis : public ConstDfsHloVisitor {
   static float GetPropertyForHlo(const HloInstruction& hlo, const string& key,
                                  const HloToProperties& hlo_to_properties);
 
+  // Decorates shape_size_ by returning 0 immediately if the shape does not have
+  // a layout.
+  int64 GetShapeSize(const Shape& shape) const;
+
   // Function which computes the size of the top-level of a given shape (not
   // including nested elements, if any). If null then bytes_accessed methods
   // return an error.
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
index 9fc4c48226..2c854eea18 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
@@ -22,8 +22,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/local_service.h"
 #include "tensorflow/compiler/xla/service/service.h"
@@ -338,13 +338,13 @@ TEST_F(FusionCostAnalysis, LoopFusion) {
     //   tuple = Tuple({sub, sub, mul, C1})
     HloComputation::Builder builder(TestName());
     auto c1 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
             /*from=*/0.0f, /*to=*/1.0f, /*rows=*/2, /*cols=*/2)));
     auto c2 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
             /*from=*/1.0f, /*to=*/2.0f, /*rows=*/2, /*cols=*/2)));
     auto c3 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR2F32Linspace(
+        HloInstruction::CreateConstant(LiteralUtil::CreateR2F32Linspace(
             /*from=*/2.0f, /*to=*/3.0f, /*rows=*/2, /*cols=*/2)));
     auto add = builder.AddInstruction(
         HloInstruction::CreateBinary(r2f32, HloOpcode::kAdd, c1, c2));
@@ -391,9 +391,9 @@ TEST_F(FusionCostAnalysis, NoLayout) {
 
   HloComputation::Builder builder(TestName());
   auto c1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR4FromArray4D(Array4D<float>(2, 3, 4, 5))));
+      LiteralUtil::CreateR4FromArray4D(Array4D<float>(2, 3, 4, 5))));
   auto c2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({1, 2, 3})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({1, 2, 3})));
 
   auto broadcast = builder.AddInstruction(
       HloInstruction::CreateBroadcast(shape_without_layout, c2, {1}));
diff --git a/tensorflow/compiler/xla/service/hlo_creation_utils.cc b/tensorflow/compiler/xla/service/hlo_creation_utils.cc
index 0fb65c845a..90d2be118d 100644
--- a/tensorflow/compiler/xla/service/hlo_creation_utils.cc
+++ b/tensorflow/compiler/xla/service/hlo_creation_utils.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/shape_inference.h"
@@ -261,9 +262,9 @@ StatusOr<HloInstruction*> PadVectorWithZeros(HloInstruction* operand,
   padding_config_dim.set_edge_padding_high(zeros_to_append);
   *padding_config.add_dimensions() = padding_config_dim;
 
-  HloInstruction* zero =
-      computation->AddInstruction(HloInstruction::CreateConstant(
-          MakeUnique<Literal>(Literal::Zero(operand->shape().element_type()))));
+  HloInstruction* zero = computation->AddInstruction(
+      HloInstruction::CreateConstant(MakeUnique<Literal>(
+          LiteralUtil::Zero(operand->shape().element_type()))));
   return MakePadHlo(operand, zero, padding_config);
 }
 
@@ -272,7 +273,7 @@ StatusOr<HloInstruction*> BroadcastZeros(
     ArraySlice<int64> broadcast_dimensions) {
   HloInstruction* zero =
       computation->AddInstruction(HloInstruction::CreateConstant(
-          MakeUnique<Literal>(Literal::Zero(element_type))));
+          MakeUnique<Literal>(LiteralUtil::Zero(element_type))));
   return MakeBroadcastHlo(zero, /*broadcast_dimensions=*/{},
                           /*result_shape_bounds=*/broadcast_dimensions);
 }
diff --git a/tensorflow/compiler/xla/service/hlo_creation_utils_test.cc b/tensorflow/compiler/xla/service/hlo_creation_utils_test.cc
index 7e7c4f95fe..60d3e71757 100644
--- a/tensorflow/compiler/xla/service/hlo_creation_utils_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_creation_utils_test.cc
@@ -60,8 +60,8 @@ TEST_F(HloCreationUtilsTest, CollapseFirst1Dim) {
   HloEvaluator evaluator;
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
                           evaluator.Evaluate<std::unique_ptr<Literal>>(
-                              *module, {Literal::CreateR1<int32>({3, 4})}));
-  CHECK_EQ(*result_literal, *Literal::CreateR1<int32>({3, 4}));
+                              *module, {LiteralUtil::CreateR1<int32>({3, 4})}));
+  CHECK_EQ(*result_literal, *LiteralUtil::CreateR1<int32>({3, 4}));
 }
 
 TEST_F(HloCreationUtilsTest, CollapseFirst2Dims) {
@@ -82,10 +82,10 @@ TEST_F(HloCreationUtilsTest, CollapseFirst2Dims) {
       std::unique_ptr<Literal> result_literal,
       evaluator.Evaluate<std::unique_ptr<Literal>>(
           *module,
-          {Literal::CreateR3<int32>(
+          {LiteralUtil::CreateR3<int32>(
               {{{1, 2}, {3, 4}, {5, 6}}, {{-1, -2}, {-3, -4}, {-5, -6}}})}));
   CHECK_EQ(*result_literal,
-           *Literal::CreateR2<int32>(
+           *LiteralUtil::CreateR2<int32>(
                {{1, 2}, {3, 4}, {5, 6}, {-1, -2}, {-3, -4}, {-5, -6}}));
 }
 
@@ -103,10 +103,11 @@ TEST_F(HloCreationUtilsTest, Prepend1DegenerateDim) {
   entry_computation->set_root_instruction(with_1_degenerate_dim_prepended);
 
   HloEvaluator evaluator;
-  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
-                          evaluator.Evaluate<std::unique_ptr<Literal>>(
-                              *module, {Literal::CreateR1<int32>({9, 10})}));
-  CHECK_EQ(*result_literal, *Literal::CreateR2<int32>({{9, 10}}));
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<Literal> result_literal,
+      evaluator.Evaluate<std::unique_ptr<Literal>>(
+          *module, {LiteralUtil::CreateR1<int32>({9, 10})}));
+  CHECK_EQ(*result_literal, *LiteralUtil::CreateR2<int32>({{9, 10}}));
 }
 
 TEST_F(HloCreationUtilsTest, Prepend2DegenerateDims) {
@@ -123,10 +124,11 @@ TEST_F(HloCreationUtilsTest, Prepend2DegenerateDims) {
   entry_computation->set_root_instruction(with_2_degenerate_dims_prepended);
 
   HloEvaluator evaluator;
-  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
-                          evaluator.Evaluate<std::unique_ptr<Literal>>(
-                              *module, {Literal::CreateR1<int32>({9, 10})}));
-  CHECK_EQ(*result_literal, *Literal::CreateR3<int32>({{{9, 10}}}));
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<Literal> result_literal,
+      evaluator.Evaluate<std::unique_ptr<Literal>>(
+          *module, {LiteralUtil::CreateR1<int32>({9, 10})}));
+  CHECK_EQ(*result_literal, *LiteralUtil::CreateR3<int32>({{{9, 10}}}));
 }
 
 TEST_F(HloCreationUtilsTest, Prepend2DegenerateDimsToScalar) {
@@ -145,8 +147,8 @@ TEST_F(HloCreationUtilsTest, Prepend2DegenerateDimsToScalar) {
   HloEvaluator evaluator;
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
                           evaluator.Evaluate<std::unique_ptr<Literal>>(
-                              *module, {Literal::CreateR0<int32>(9)}));
-  CHECK_EQ(*result_literal, *Literal::CreateR2<int32>({{9}}));
+                              *module, {LiteralUtil::CreateR0<int32>(9)}));
+  CHECK_EQ(*result_literal, *LiteralUtil::CreateR2<int32>({{9}}));
 }
 
 TEST_F(HloCreationUtilsTest, ExpandFirstDimInto3Dims) {
@@ -166,9 +168,9 @@ TEST_F(HloCreationUtilsTest, ExpandFirstDimInto3Dims) {
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<Literal> result_literal,
       evaluator.Evaluate<std::unique_ptr<Literal>>(
-          *module, {Literal::CreateR1<int32>({1, 2, 3, 4, 5, 6})}));
+          *module, {LiteralUtil::CreateR1<int32>({1, 2, 3, 4, 5, 6})}));
   CHECK_EQ(*result_literal,
-           *Literal::CreateR3<int32>({{{1, 2}}, {{3, 4}}, {{5, 6}}}));
+           *LiteralUtil::CreateR3<int32>({{{1, 2}}, {{3, 4}}, {{5, 6}}}));
 }
 
 TEST_F(HloCreationUtilsTest, PadVectorWithZeros) {
@@ -188,8 +190,8 @@ TEST_F(HloCreationUtilsTest, PadVectorWithZeros) {
   HloEvaluator evaluator;
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
                           evaluator.Evaluate<std::unique_ptr<Literal>>(
-                              *module, {Literal::CreateR1<int32>({3, 4})}));
-  CHECK_EQ(*result_literal, *Literal::CreateR1<int32>({0, 0, 0, 3, 4, 0}));
+                              *module, {LiteralUtil::CreateR1<int32>({3, 4})}));
+  CHECK_EQ(*result_literal, *LiteralUtil::CreateR1<int32>({0, 0, 0, 3, 4, 0}));
 }
 
 TEST_F(HloCreationUtilsTest, BroadcastZeros_S32) {
@@ -209,8 +211,8 @@ TEST_F(HloCreationUtilsTest, BroadcastZeros_S32) {
   HloEvaluator evaluator;
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
                           evaluator.Evaluate<std::unique_ptr<Literal>>(
-                              *module, {Literal::CreateR0<int32>(0)}));
-  CHECK_EQ(*result_literal, *Literal::CreateR2<int32>({{0, 0}, {0, 0}}));
+                              *module, {LiteralUtil::CreateR0<int32>(0)}));
+  CHECK_EQ(*result_literal, *LiteralUtil::CreateR2<int32>({{0, 0}, {0, 0}}));
 }
 
 TEST_F(HloCreationUtilsTest, BroadcastZeros_F32) {
@@ -230,9 +232,9 @@ TEST_F(HloCreationUtilsTest, BroadcastZeros_F32) {
   HloEvaluator evaluator;
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result_literal,
                           evaluator.Evaluate<std::unique_ptr<Literal>>(
-                              *module, {Literal::CreateR0<float>(0.0f)}));
+                              *module, {LiteralUtil::CreateR0<float>(0.0f)}));
   CHECK_EQ(*result_literal,
-           *Literal::CreateR2<float>({{0.0f, 0.0f}, {0.0f, 0.0f}}));
+           *LiteralUtil::CreateR2<float>({{0.0f, 0.0f}, {0.0f, 0.0f}}));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/hlo_cse.cc b/tensorflow/compiler/xla/service/hlo_cse.cc
index a0ee889623..06484f4012 100644
--- a/tensorflow/compiler/xla/service/hlo_cse.cc
+++ b/tensorflow/compiler/xla/service/hlo_cse.cc
@@ -24,7 +24,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_domain_map.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -143,10 +143,8 @@ StatusOr<bool> HloCSE::Run(HloModule* module) {
       if (instruction->operand_count() == 0) {
         continue;
       }
-      // Skip instructions which have side effects or are a domain (which must
-      // not be CSE-ed).
-      if (instruction->HasSideEffect() ||
-          instruction->opcode() == HloOpcode::kDomain) {
+      // Skip instructions which have side effects.
+      if (instruction->HasSideEffect()) {
         continue;
       }
 
diff --git a/tensorflow/compiler/xla/service/hlo_cse_test.cc b/tensorflow/compiler/xla/service/hlo_cse_test.cc
index 16db374566..90fbaa37c5 100644
--- a/tensorflow/compiler/xla/service/hlo_cse_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_cse_test.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -53,9 +53,9 @@ TEST_F(HloCseTest, CombineTwoConstants) {
   // Test that two identical constants are commoned.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   builder.AddInstruction(HloInstruction::CreateBinary(
       constant1->shape(), HloOpcode::kAdd, constant1, constant2));
 
@@ -72,7 +72,7 @@ TEST_F(HloCseTest, CombineTwoConstants) {
   EXPECT_EQ(42.0f, constant->literal().Get<float>({}));
 
   auto result = ExecuteAndTransfer(std::move(module), {});
-  auto expected = Literal::CreateR0<float>(84.0);
+  auto expected = LiteralUtil::CreateR0<float>(84.0);
   EXPECT_TRUE(LiteralTestUtil::Near(*expected, *result, ErrorSpec(1e-4)));
 }
 
@@ -81,10 +81,10 @@ TEST_F(HloCseTest, CombineTwoConstantsDifferentLayoutsAndInsensitive) {
   // the pass is not layout sensitive.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({0, 1}))));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({1, 0}))));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
       constant1->shape(), HloOpcode::kAdd, constant1, constant2));
@@ -104,7 +104,7 @@ TEST_F(HloCseTest, CombineTwoConstantsDifferentLayoutsAndInsensitive) {
   EXPECT_THAT(add, op::Add(first_operand, first_operand));
 
   auto result = ExecuteAndTransfer(std::move(module), {});
-  auto expected = Literal::CreateR2<float>({{2.0, 4.0}, {6.0, 8.0}});
+  auto expected = LiteralUtil::CreateR2<float>({{2.0, 4.0}, {6.0, 8.0}});
   EXPECT_TRUE(LiteralTestUtil::Near(*expected, *result, ErrorSpec(1e-4)));
 }
 
@@ -113,10 +113,10 @@ TEST_F(HloCseTest, CombineTwoConstantsDifferentLayoutsAndSensitive) {
   // if the pass is layout sensitive.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({0, 1}))));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({1, 0}))));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
       constant1->shape(), HloOpcode::kAdd, constant1, constant2));
@@ -134,7 +134,7 @@ TEST_F(HloCseTest, CombineTwoConstantsDifferentLayoutsAndSensitive) {
   EXPECT_THAT(add, op::Add(constant1, constant2));
 
   auto result = ExecuteAndTransfer(std::move(module), {});
-  auto expected = Literal::CreateR2<float>({{2.0, 4.0}, {6.0, 8.0}});
+  auto expected = LiteralUtil::CreateR2<float>({{2.0, 4.0}, {6.0, 8.0}});
   EXPECT_TRUE(LiteralTestUtil::Near(*expected, *result, ErrorSpec(1e-4)));
 }
 
@@ -144,20 +144,20 @@ TEST_F(HloCseTest, ConstantsSameValueDifferentType) {
   auto builder = HloComputation::Builder(TestName());
   std::vector<HloInstruction*> constants;
   constants.push_back(builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<uint32>(42))));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<uint32>(42))));
   constants.push_back(builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42))));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(42))));
   constants.push_back(builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<uint64>(42.0))));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<uint64>(42.0))));
   constants.push_back(builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int64>(42.0))));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int64>(42.0))));
   constants.push_back(builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<double>(42.0))));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<double>(42.0))));
   constants.push_back(builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f))));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f))));
   // Duplicate the float constant to verify something happens.
   constants.push_back(builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f))));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f))));
 
   const Shape shape_r0 = ShapeUtil::MakeShape(F32, {});
   for (int64 i = 0; i < constants.size(); ++i) {
@@ -188,13 +188,13 @@ TEST_F(HloCseTest, NonscalarConstants) {
   // Test that identical nonscalar constants are merged.
   auto builder = HloComputation::Builder(TestName());
   auto common_constant1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   auto common_constant2 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   // Create a constant which has the same shape but a different value.
   auto uncommon_constant =
       builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR2<float>({{2.0, 4.0}, {6.0, 8.0}})));
+          LiteralUtil::CreateR2<float>({{2.0, 4.0}, {6.0, 8.0}})));
 
   // Tie the constants together with a tuple. This makes it easier to refer to
   // the constant instructions via their use.
@@ -223,7 +223,7 @@ TEST_F(HloCseTest, IdenticalInstructions) {
   // Test that three identical instructions are commoned.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
   auto exp1 = builder.AddInstruction(HloInstruction::CreateUnary(
       constant->shape(), HloOpcode::kExp, constant));
   auto exp2 = builder.AddInstruction(HloInstruction::CreateUnary(
@@ -239,7 +239,7 @@ TEST_F(HloCseTest, IdenticalInstructions) {
   EXPECT_EQ(5, computation->instruction_count());
   EXPECT_THAT(tuple, op::Tuple(exp1, exp2, exp3));
 
-  HloCSE cse(/*is_layout_sensitive=*/false);
+  HloCSE cse(/*is_layout_sensitive=*/true);
   EXPECT_TRUE(cse.Run(module.get()).ValueOrDie());
 
   EXPECT_EQ(3, computation->instruction_count());
@@ -248,12 +248,189 @@ TEST_F(HloCseTest, IdenticalInstructions) {
   EXPECT_THAT(tuple, op::Tuple(first_operand, first_operand, first_operand));
 }
 
+// Test two identical while loops with same inputs
+TEST_F(HloCseTest, WhileLoopsIdenticalConditionsAndBodiesSameInput) {
+  auto module = ParseHloString(R"(
+    HloModule WhileLoopsIdenticalConditionsAndBodiesSameInput
+
+    %body (param: (f32[], f32[])) -> (f32[], f32[]) {
+      %param = (f32[], f32[]) parameter(0)
+      %get-tuple-element = f32[] get-tuple-element((f32[], f32[]) %param),
+index=0 %get-tuple-element.1 = f32[] get-tuple-element((f32[], f32[]) %param),
+index=1 %add = f32[] add(f32[] %get-tuple-element, f32[] %get-tuple-element.1)
+      ROOT %tuple = (f32[], f32[]) tuple(f32[] %get-tuple-element, f32[] %add)
+    }
+
+    %condition (param.1: (f32[], f32[])) -> pred[] {
+      %param.1 = (f32[], f32[]) parameter(0)
+      ROOT %constant = pred[] constant(false)
+    }
+
+    %condition.1 (param.2: (f32[], f32[])) -> pred[] {
+      %param.2 = (f32[], f32[]) parameter(0)
+      ROOT %constant.1 = pred[] constant(false)
+    }
+
+    ENTRY %WhileLoopsIdenticalConditionsAndBodiesSameInput () -> (f32[], f32[])
+{ %constant.2 = f32[] constant(1) %constant.3 = f32[] constant(2) %tuple.1 =
+(f32[], f32[]) tuple(f32[] %constant.2, f32[] %constant.3) %while = (f32[],
+f32[]) while((f32[], f32[]) %tuple.1), condition=%condition, body=%body ROOT
+%while.1 = (f32[], f32[]) while((f32[], f32[]) %tuple.1),
+condition=%condition.1, body=%body
+    }
+    )")
+                    .ValueOrDie();
+
+  auto computation = module->entry_computation();
+
+  EXPECT_EQ(5, computation->instruction_count());
+  HloCSE cse(true);
+  EXPECT_TRUE(cse.Run(module.get()).ValueOrDie());
+  EXPECT_EQ(4, computation->instruction_count());
+}
+
+// Test two while loops with same conditions, same inputs, but different
+// bodies
+TEST_F(HloCseTest, WhileLoopsIdenticalConditionsSameInputAndDifferentBodies) {
+  auto module = ParseHloString(R"(
+    HloModule WhileLoopsIdenticalConditionsSameInputAndDifferentBodies
+
+    %body (param: (f32[], f32[])) -> (f32[], f32[]) {
+      %param = (f32[], f32[]) parameter(0)
+      %get-tuple-element = f32[] get-tuple-element((f32[], f32[]) %param),
+index=0 %get-tuple-element.1 = f32[] get-tuple-element((f32[], f32[]) %param),
+index=1 %add = f32[] add(f32[] %get-tuple-element, f32[] %get-tuple-element.1)
+      ROOT %tuple = (f32[], f32[]) tuple(f32[] %get-tuple-element, f32[] %add)
+    }
+
+    %body2 (param.1: (f32[], f32[])) -> (f32[], f32[]) {
+      %param.1 = (f32[], f32[]) parameter(0)
+      %get-tuple-element.2 = f32[] get-tuple-element((f32[], f32[]) %param.1),
+index=0 %get-tuple-element.3 = f32[] get-tuple-element((f32[], f32[]) %param.1),
+index=1 %sub = f32[] subtract(f32[] %get-tuple-element.2, f32[]
+%get-tuple-element.3) ROOT %tuple.2 = (f32[], f32[]) tuple(f32[]
+%get-tuple-element.2, f32[] %sub)
+    }
+
+    %condition (param.2: (f32[], f32[])) -> pred[] {
+      %param.2 = (f32[], f32[]) parameter(0)
+      ROOT %constant = pred[] constant(false)
+    }
+
+    %condition.1 (param.3: (f32[], f32[])) -> pred[] {
+      %param.3 = (f32[], f32[]) parameter(0)
+      ROOT %constant.1 = pred[] constant(false)
+    }
+
+    ENTRY %WhileLoopsIdenticalConditionsSameInputAndDifferentBodies () ->
+(f32[], f32[]) { %constant.2 = f32[] constant(1) %constant.3 = f32[] constant(2)
+      %tuple.1 = (f32[], f32[]) tuple(f32[] %constant.2, f32[] %constant.3)
+      %while = (f32[], f32[]) while((f32[], f32[]) %tuple.1),
+condition=%condition, body=%body ROOT %while.1 = (f32[], f32[]) while((f32[],
+f32[]) %tuple.1), condition=%condition.1, body=%body2
+    }
+    )")
+                    .ValueOrDie();
+
+  auto computation = module->entry_computation();
+
+  EXPECT_EQ(5, computation->instruction_count());
+  HloCSE cse(true);
+  EXPECT_FALSE(cse.Run(module.get()).ValueOrDie());
+  EXPECT_EQ(5, computation->instruction_count());
+}
+
+// Test two identical while loops with different inputs
+TEST_F(HloCseTest, WhileLoopsIdenticalConditionsAndBodiesDifferentInput) {
+  auto module = ParseHloString(R"(
+    HloModule WhileLoopsIdenticalConditionsAndBodiesDifferentInput
+
+    %body (param: (f32[], f32[])) -> (f32[], f32[]) {
+      %param = (f32[], f32[]) parameter(0)
+      %get-tuple-element = f32[] get-tuple-element((f32[], f32[]) %param),
+index=0 %get-tuple-element.1 = f32[] get-tuple-element((f32[], f32[]) %param),
+index=1 %add = f32[] add(f32[] %get-tuple-element, f32[] %get-tuple-element.1)
+      ROOT %tuple = (f32[], f32[]) tuple(f32[] %get-tuple-element, f32[] %add)
+    }
+
+    %condition (param.1: (f32[], f32[])) -> pred[] {
+      %param.1 = (f32[], f32[]) parameter(0)
+      ROOT %constant = pred[] constant(false)
+    }
+
+    %condition.1 (param.2: (f32[], f32[])) -> pred[] {
+      %param.2 = (f32[], f32[]) parameter(0)
+      ROOT %constant.1 = pred[] constant(false)
+    }
+
+    ENTRY %WhileLoopsIdenticalConditionsAndBodiesDifferentInput () -> (f32[],
+f32[]) { %constant.2 = f32[] constant(1) %constant.3 = f32[] constant(2)
+      %tuple.1 = (f32[], f32[]) tuple(f32[] %constant.2, f32[] %constant.3)
+      %while = (f32[], f32[]) while((f32[], f32[]) %tuple.1),
+condition=%condition, body=%body %constant.4 = f32[] constant(1) %constant.5 =
+f32[] constant(2) %tuple.2 = (f32[], f32[]) tuple(f32[] %constant.4, f32[]
+%constant.5) ROOT %while.1 = (f32[], f32[]) while((f32[], f32[]) %tuple.2),
+condition=%condition.1, body=%body
+    }
+
+    )")
+                    .ValueOrDie();
+
+  auto computation = module->entry_computation();
+
+  EXPECT_EQ(8, computation->instruction_count());
+  HloCSE cse(true);
+  EXPECT_FALSE(cse.Run(module.get()).ValueOrDie());
+  EXPECT_EQ(8, computation->instruction_count());
+}
+
+// Test two while loops with identical bodies and same inputs, but different
+// conditions
+TEST_F(HloCseTest, WhileLoopsIdenticalBodiesAndInputDifferntConditions) {
+  auto module = ParseHloString(R"(
+    HloModule WhileLoopsIdenticalBodiesAndInputDifferntConditions
+
+    %body (param: (f32[], f32[])) -> (f32[], f32[]) {
+      %param = (f32[], f32[]) parameter(0)
+      %get-tuple-element = f32[] get-tuple-element((f32[], f32[]) %param),
+index=0 %get-tuple-element.1 = f32[] get-tuple-element((f32[], f32[]) %param),
+index=1 %add = f32[] add(f32[] %get-tuple-element, f32[] %get-tuple-element.1)
+      ROOT %tuple = (f32[], f32[]) tuple(f32[] %get-tuple-element, f32[] %add)
+    }
+
+    %condition (param.1: (f32[], f32[])) -> pred[] {
+      %param.1 = (f32[], f32[]) parameter(0)
+      ROOT %constant = pred[] constant(false)
+    }
+
+    %condition.1 (param.2: (f32[], f32[])) -> pred[] {
+      %param.2 = (f32[], f32[]) parameter(0)
+      ROOT %constant.1 = pred[] constant(true)
+    }
+
+    ENTRY %WhileLoopsIdenticalBodiesAndInputDifferntConditions () -> (f32[],
+f32[]) { %constant.2 = f32[] constant(1) %constant.3 = f32[] constant(2)
+      %tuple.1 = (f32[], f32[]) tuple(f32[] %constant.2, f32[] %constant.3)
+      %while = (f32[], f32[]) while((f32[], f32[]) %tuple.1),
+condition=%condition, body=%body ROOT %while.1 = (f32[], f32[]) while((f32[],
+f32[]) %tuple.1), condition=%condition.1, body=%body
+    })")
+                    .ValueOrDie();
+
+  auto computation = module->entry_computation();
+
+  EXPECT_EQ(5, computation->instruction_count());
+  HloCSE cse(true);
+  EXPECT_FALSE(cse.Run(module.get()).ValueOrDie());
+  EXPECT_EQ(5, computation->instruction_count());
+}
+
 TEST_F(HloCseTest, IdenticalInstructionsDifferentLayoutsSensitive) {
   // Test that two identical instructions with different layouts are *not*
   // commoned if the pass is layout sensitive.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
 
   auto exp1 = builder.AddInstruction(HloInstruction::CreateUnary(
       constant->shape(), HloOpcode::kExp, constant));
@@ -284,7 +461,7 @@ TEST_F(HloCseTest, IdenticalInstructionsDifferentLayoutsInsensitive) {
   // the pass is layout insensitive.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
 
   auto exp1 = builder.AddInstruction(HloInstruction::CreateUnary(
       constant->shape(), HloOpcode::kExp, constant));
@@ -362,7 +539,7 @@ TEST_F(HloCseTest, IdenticalExpressions) {
   // The *1 instructions should be merged with the *2 instructions.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
 
   auto negate1 = builder.AddInstruction(HloInstruction::CreateUnary(
       constant->shape(), HloOpcode::kNegate, constant));
@@ -400,9 +577,9 @@ TEST_F(HloCseTest, DoNotCombineRng) {
   // Test that two RNG ops are not commoned.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0f)));
   auto rng1 = builder.AddInstruction(HloInstruction::CreateRng(
       ShapeUtil::MakeShape(F32, {}), RandomDistribution::RNG_UNIFORM,
       {constant1, constant2}));
@@ -442,9 +619,9 @@ TEST_F(HloCseTest, DoNotCombineCallsToImpureFunctions) {
     Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
     auto builder = HloComputation::Builder(TestName() + "_rng_fun");
     auto constant1 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
     auto constant2 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0f)));
     auto rng = builder.AddInstruction(HloInstruction::CreateRng(
         scalar_shape, RandomDistribution::RNG_UNIFORM, {constant1, constant2}));
     auto param = builder.AddInstruction(HloInstruction::CreateParameter(
@@ -459,7 +636,7 @@ TEST_F(HloCseTest, DoNotCombineCallsToImpureFunctions) {
   {
     auto builder = HloComputation::Builder(TestName());
     auto constant = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR1<float>({5.0f})));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({5.0f})));
     auto rng1 = builder.AddInstruction(
         HloInstruction::CreateMap(constant->shape(), {constant}, rng_function));
     auto rng2 = builder.AddInstruction(
@@ -521,9 +698,9 @@ TEST_F(HloCseTest, ConstantsSameValueInDifferentDomains) {
   // in this case) are not collapsed.
   auto builder = HloComputation::Builder(TestName());
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<uint32>(42)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<uint32>(42)));
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<uint32>(42)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<uint32>(42)));
 
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
@@ -536,5 +713,40 @@ TEST_F(HloCseTest, ConstantsSameValueInDifferentDomains) {
   EXPECT_EQ(2, computation->instruction_count());
 }
 
+TEST_F(HloCseTest, Domain) {
+  auto module = ParseHloString(R"(
+HloModule module
+ENTRY %entry {
+  %param = f32[] parameter(0), sharding={maximal device=0}
+  %domain.0 = f32[] domain(%param),
+    domain={kind="sharding", entry={maximal device=0}, exit={maximal device=1}}
+  %domain.1 = f32[] domain(%param),
+    domain={kind="sharding", entry={maximal device=0}, exit={maximal device=1}}
+  %domain.2 = f32[] domain(%param),
+    domain={kind="sharding", entry={maximal device=0}, exit={maximal device=2}}
+  %negate.0 = f32[] negate(%domain.0)
+  %negate.1 = f32[] negate(%domain.1)
+  %negate.2 = f32[] negate(%domain.2)
+  %domain.3 = f32[] domain(%negate.0),
+    domain={kind="sharding", entry={maximal device=1}, exit={maximal device=0}}
+  %domain.4 = f32[] domain(%negate.1),
+    domain={kind="sharding", entry={maximal device=1}, exit={maximal device=0}}
+  %domain.5 = f32[] domain(%negate.2),
+    domain={kind="sharding", entry={maximal device=2}, exit={maximal device=0}}
+  %add = f32[] add(%domain.3, %domain.4)
+  ROOT %sub = f32[] subtract(%add, %domain.5)
+})")
+                    .ValueOrDie();
+
+  HloCSE cse(/*is_layout_sensitive=*/false);
+  EXPECT_TRUE(cse.Run(module.get()).ValueOrDie());
+  LOG(INFO) << "AAAAA " << module->ToString();
+  const HloInstruction* sub = module->entry_computation()->root_instruction();
+  const HloInstruction* add = sub->operand(0);
+  EXPECT_EQ(add->operand(0), add->operand(1));
+  EXPECT_NE(add->operand(0), sub->operand(1));
+  EXPECT_NE(add->operand(1), sub->operand(1));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
index 8a4a9b5986..bbfb0c253f 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
@@ -398,18 +398,17 @@ bool HloDataflowAnalysis::UpdateSendValueSet(HloInstruction* send) {
 bool HloDataflowAnalysis::UpdateRecvDoneValueSet(HloInstruction* recv_done) {
   CHECK_EQ(recv_done->opcode(), HloOpcode::kRecvDone);
   bool changed = false;
-  // RecvDone forwards the operand value at {0} to the output.
+  // RecvDone forwards the operand value at {0} to element {0} of its output.
   for (auto& pair : GetInstructionValueSet(recv_done)) {
     ShapeIndex& index = pair.first;
     HloValueSet& value_set = pair.second;
 
-    ShapeIndex operand_index = {0};
-    for (int64 i : index) {
-      operand_index.push_back(i);
+    if (index.empty() || index[0] != 0) {
+      continue;
     }
 
     const HloValueSet& operand_value_set =
-        GetValueSet(recv_done->operand(0), operand_index);
+        GetValueSet(recv_done->operand(0), index);
     if (value_set != operand_value_set) {
       value_set = operand_value_set;
       changed = true;
@@ -578,17 +577,17 @@ bool HloDataflowAnalysis::UpdateParameterValueSet(HloInstruction* parameter) {
   }
 }
 
-bool HloDataflowAnalysis::UpdateSelectValueSet(HloInstruction* select) {
-  CHECK_EQ(select->opcode(), HloOpcode::kSelect);
-  // A phi value is not defined at a kSelect instruction because kSelect does
-  // not create a new value. Rather it forwards a value from its operands. This
-  // contrasts with kWhile instruction (which does define a phi value) which has
-  // in-place update semantics.
+bool HloDataflowAnalysis::UpdateTupleSelectValueSet(HloInstruction* select) {
+  CHECK_EQ(select->opcode(), HloOpcode::kTupleSelect);
+  // A phi value is not defined at a kTupleSelect instruction because
+  // kTupleSelect does not create a new value. Rather it forwards a value from
+  // its operands. This contrasts with kWhile instruction (which does define a
+  // phi value) which has in-place update semantics.
   bool changed = false;
   for (auto& pair : GetInstructionValueSet(select)) {
     const ShapeIndex& index = pair.first;
     if (index.empty()) {
-      // kSelect copies (not forwards) the top-level value.
+      // kTupleSelect copies (not forwards) the top-level value.
       continue;
     }
     HloValueSet& value_set = pair.second;
@@ -650,8 +649,8 @@ bool HloDataflowAnalysis::UpdateInstructionValueSet(
       return UpdateCopyValueSet(instruction);
     case HloOpcode::kGetTupleElement:
       return UpdateGetTupleElementValueSet(instruction);
-    case HloOpcode::kSelect:
-      return UpdateSelectValueSet(instruction);
+    case HloOpcode::kTupleSelect:
+      return UpdateTupleSelectValueSet(instruction);
     case HloOpcode::kTuple:
       return UpdateTupleValueSet(instruction);
     case HloOpcode::kParameter:
@@ -850,21 +849,25 @@ Status HloDataflowAnalysis::InitializeInstructionValueSets() {
           }
           break;
         case HloOpcode::kCopy:
-        case HloOpcode::kSelect:
+        case HloOpcode::kTupleSelect:
         case HloOpcode::kTuple:
           // These instructions only define their top-level values. Any other
           // values flow from their operands.
           define_top_level_only();
           break;
         case HloOpcode::kRecvDone:
-          // RecvDone aliases its input tuple element {0}, therefore does not
-          // define any values.
+          // RecvDone produces a two-element tuple. Element zero aliases its
+          // input tuple element {0}; element one is a token.
+          define_value_at(/*index=*/{});
+          define_value_at(/*index=*/{1});
           break;
         case HloOpcode::kSend:
-          // Send produces a tuple of {aliased operand, U32 context}, therefore
-          // only defines the top-level tuple and the tuple element at {1}.
+          // Send produces a tuple of {aliased operand, U32 context, token},
+          // therefore only defines the top-level tuple and the tuple elements
+          // at {1} and {2}.
           define_value_at(/*index=*/{});
           define_value_at(/*index=*/{1});
+          define_value_at(/*index=*/{2});
           break;
         default:
           define_all_values();
@@ -1014,19 +1017,17 @@ bool HloDataflowAnalysis::CanShareOperandBufferWithUser(
   }
 
   if (user->opcode() == HloOpcode::kFusion) {
+    if (fusion_can_share_buffer_ != nullptr) {
+      return fusion_can_share_buffer_(user, operand);
+    }
     // Get the parameter associated with 'operand';
     HloInstruction* fusion_param =
         user->fused_parameter(user->operand_index(operand));
 
     const HloValue& value = GetValueDefinedAt(fusion_param, operand_index);
-    if (value.uses().size() != 1) {
-      if (MultiDynamicSliceUseShareSameIndices(value.uses())) {
-        return true;
-      }
-      return false;
+    if (MultiDynamicSliceUseShareSameIndices(value.uses())) {
+      return true;
     }
-    const HloUse& use = value.uses()[0];
-
     if (user->fusion_kind() == HloInstruction::FusionKind::kLoop ||
         user->fusion_kind() == HloInstruction::FusionKind::kInput) {
       if (user->fused_expression_root()->opcode() ==
@@ -1036,13 +1037,17 @@ bool HloDataflowAnalysis::CanShareOperandBufferWithUser(
         // Returns true iff there is exactly one use of 'operand' at shape index
         // 'operand_index', and this singleton use is the fused root at operand
         // index 0.
-        return use.instruction == user->fused_expression_root() &&
-               use.operand_number == 0;
-      } else {
-        return AreTransitiveUsesElementwiseOrTuple(fusion_param);
+        if (value.uses().size() == 1) {
+          const HloUse& use = value.uses()[0];
+          return use.instruction == user->fused_expression_root() &&
+                 use.operand_number == 0;
+        }
+        return false;
       }
-    } else if (user->fusion_kind() == HloInstruction::FusionKind::kOutput &&
-               user->fused_expression_root()->opcode() == HloOpcode::kAdd) {
+      return AreTransitiveUsesElementwiseOrTuple(fusion_param);
+    }
+    if (user->fusion_kind() == HloInstruction::FusionKind::kOutput &&
+        user->fused_expression_root()->opcode() == HloOpcode::kAdd) {
       // Output fusion with kAdd fused root.
 
       // Check if one operand of kAdd fused root is kDot or kConvolution.
@@ -1063,11 +1068,12 @@ bool HloDataflowAnalysis::CanShareOperandBufferWithUser(
       // Returns true iff there is exactly one use of 'operand' at shape index
       // 'operand_index', and this singleton use is the fused root (at operand
       // index 'other_add_operand_index').
-      return use.instruction == user->fused_expression_root() &&
-             use.operand_number == other_add_operand_index;
-    } else if (fusion_can_share_buffer_ != nullptr &&
-               fusion_can_share_buffer_(user, operand)) {
-      return true;
+      if (value.uses().size() == 1) {
+        const HloUse& use = value.uses()[0];
+        return use.instruction == user->fused_expression_root() &&
+               use.operand_number == other_add_operand_index;
+      }
+      return false;
     }
   }
 
@@ -1078,6 +1084,21 @@ bool HloDataflowAnalysis::CanShareOperandBufferWithUser(
     std::vector<int64> operand_indices = user->OperandIndices(operand);
     return operand_indices.size() == 1 && operand_indices[0] == 0;
   }
+  if (user->opcode() == HloOpcode::kSort) {
+    // Only valid if there are no other users.
+    if (operand->users().size() != 1) {
+      return false;
+    }
+    // If we only sort keys, the output of sort is not a tuple, so we can always
+    // share the buffer.
+    if (user->operand_count() == 1) {
+      return true;
+    }
+    CHECK(!user_index.empty());
+    // Only share with the right tuple element buffer.
+    std::vector<int64> operand_indices = user->OperandIndices(operand);
+    return operand_indices.size() == 1 && user_index[0] == operand_indices[0];
+  }
   if (user->opcode() == HloOpcode::kCall) {
     // Get all uses of value defined by 'operand' at 'operand_index'.
     const auto& uses = GetValueDefinedAt(operand, operand_index).uses();
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h
index 9fea218af0..f4abc7a7c7 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h
@@ -189,7 +189,7 @@ class HloDataflowAnalysis {
   bool UpdateGetTupleElementValueSet(HloInstruction* gte);
   bool UpdateParameterValueSet(HloInstruction* parameter);
   bool UpdateRecvDoneValueSet(HloInstruction* recv_done);
-  bool UpdateSelectValueSet(HloInstruction* select);
+  bool UpdateTupleSelectValueSet(HloInstruction* select);
   bool UpdateSendValueSet(HloInstruction* send);
   bool UpdateTupleValueSet(HloInstruction* tuple);
   bool UpdateWhileValueSet(HloInstruction* xla_while);
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
index 70254e2c1a..2ec31a9148 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_dataflow_analysis.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
@@ -101,9 +101,9 @@ TEST_P(HloDataflowAnalysisTest, BinaryOperation) {
   // Test the dataflow for a simple binary operation (Add).
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
       scalar_shape_, HloOpcode::kAdd, constant1, constant2));
   module_->AddEntryComputation(builder.Build());
@@ -198,9 +198,9 @@ TEST_P(HloDataflowAnalysisTest, NestedTuple) {
   // Verify the dataflow through a nested tuple.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto nested_tuple = builder.AddInstruction(
@@ -259,9 +259,9 @@ TEST_P(HloDataflowAnalysisTest, SingleCall) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto call = builder.AddInstruction(HloInstruction::CreateCall(
       scalar_shape_, {constant1, constant2}, called_computation));
   module_->AddEntryComputation(builder.Build());
@@ -308,9 +308,9 @@ TEST_P(HloDataflowAnalysisTest, ComputationCalledTwiceWithSameArguments) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto call1 = builder.AddInstruction(HloInstruction::CreateCall(
       scalar_shape_, {constant1, constant2}, called_computation));
   auto call2 = builder.AddInstruction(HloInstruction::CreateCall(
@@ -362,9 +362,9 @@ TEST_P(HloDataflowAnalysisTest, ComputationCalledTwiceWithDifferentArguments) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto call1 = builder.AddInstruction(HloInstruction::CreateCall(
       scalar_shape_, {constant1, constant2}, called_computation));
   auto call2 = builder.AddInstruction(HloInstruction::CreateCall(
@@ -426,9 +426,9 @@ TEST_P(HloDataflowAnalysisTest, NestedCalls) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto call = builder.AddInstruction(HloInstruction::CreateCall(
       scalar_shape_, {constant1, constant2}, outer_computation));
   module_->AddEntryComputation(builder.Build());
@@ -493,15 +493,15 @@ TEST_P(HloDataflowAnalysisTest, SingleWhile) {
   auto cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   auto cond_constant = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto xla_while = builder.AddInstruction(
@@ -594,15 +594,15 @@ TEST_P(HloDataflowAnalysisTest, SequentialWhiles) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto xla_while0 = builder.AddInstruction(
@@ -653,7 +653,7 @@ TEST_P(HloDataflowAnalysisTest, NestedWhiles) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
@@ -691,9 +691,9 @@ TEST_P(HloDataflowAnalysisTest, NestedWhiles) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto entry_while = builder.AddInstruction(
@@ -780,15 +780,15 @@ TEST_P(HloDataflowAnalysisTest, SwizzlingWhile) {
   auto cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto xla_while = builder.AddInstruction(
@@ -840,11 +840,11 @@ TEST_P(HloDataflowAnalysisTest, ArraySelect) {
   // Test a kSelect of an array value.
   auto builder = HloComputation::Builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
       scalar_shape_, HloOpcode::kSelect, pred, constant1, constant2));
 
@@ -860,19 +860,18 @@ TEST_P(HloDataflowAnalysisTest, ArraySelect) {
 }
 
 TEST_P(HloDataflowAnalysisTest, TupleSelect) {
-  // Test a kSelect of a tuple value. Non-top-level element flow through the
-  // instruction.
+  // Test a kTupleSelect. Non-top-level element flow through the instruction.
   auto builder = HloComputation::Builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto constant4 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(4.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(4.0)));
   auto tuple1 =
       builder.AddInstruction(HloInstruction::CreateTuple({constant1}));
   auto tuple2 =
@@ -883,20 +882,20 @@ TEST_P(HloDataflowAnalysisTest, TupleSelect) {
       builder.AddInstruction(HloInstruction::CreateTuple({constant4}));
   const Shape tuple_shape = tuple1->shape();
   auto select11 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, tuple1, tuple1));
+      tuple_shape, HloOpcode::kTupleSelect, pred, tuple1, tuple1));
   auto select12 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple_shape, HloOpcode::kTupleSelect, pred, tuple1, tuple2));
   auto select34 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, tuple3, tuple4));
+      tuple_shape, HloOpcode::kTupleSelect, pred, tuple3, tuple4));
   auto select1234 = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, select12, select34));
+      tuple_shape, HloOpcode::kTupleSelect, pred, select12, select34));
 
   module_->AddEntryComputation(builder.Build());
 
   bool ssa_form = GetParam();
   const HloDataflowAnalysis& analysis = RunAnalysis(ssa_form);
 
-  // Top-level value is always defined by a kSelect.
+  // Top-level value is always defined by a kTupleSelect.
   EXPECT_TRUE(analysis.ValueIsDefinedAt(select11));
   EXPECT_TRUE(analysis.ValueIsDefinedAt(select12));
   EXPECT_TRUE(analysis.ValueIsDefinedAt(select34));
@@ -937,20 +936,20 @@ TEST_P(HloDataflowAnalysisTest, TupleSelect) {
 }
 
 TEST_P(HloDataflowAnalysisTest, NestedTupleSelect) {
-  // Test kSelect of a nested tuple.
+  // Test kTupleSelect of a nested tuple.
   auto builder = HloComputation::Builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto constant4 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(4.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(4.0)));
   auto constant5 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(5.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(5.0)));
   auto inner_tuple1 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant2, constant3}));
   auto tuple1 = builder.AddInstruction(
@@ -960,7 +959,7 @@ TEST_P(HloDataflowAnalysisTest, NestedTupleSelect) {
   auto tuple2 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant4, inner_tuple2}));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple1->shape(), HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple1->shape(), HloOpcode::kTupleSelect, pred, tuple1, tuple2));
 
   module_->AddEntryComputation(builder.Build());
 
@@ -983,7 +982,7 @@ TEST_P(HloDataflowAnalysisTest, NestedTupleSelect) {
 }
 
 TEST_P(HloDataflowAnalysisTest, TupleSelectToWhile) {
-  // Test a tuple-shaped kSelect feeding a kWhile instruction. HLO:
+  // Test a tuple-shaped kTupleSelect feeding a kWhile instruction. HLO:
   //
   // body((F32[], F32[]) %tuple_param):
   //   %add = Add(%tuple_param{0}, %tuple_param{1})
@@ -1026,24 +1025,24 @@ TEST_P(HloDataflowAnalysisTest, TupleSelectToWhile) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, tuple_shape, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto tuple1 =
       builder.AddInstruction(HloInstruction::CreateTuple({constant1}));
   auto tuple2 =
       builder.AddInstruction(HloInstruction::CreateTuple({constant2}));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple1->shape(), HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple1->shape(), HloOpcode::kTupleSelect, pred, tuple1, tuple2));
   auto gte = builder.AddInstruction(
       HloInstruction::CreateGetTupleElement(scalar_shape_, select, 0));
   auto tuple =
@@ -1089,7 +1088,7 @@ TEST_P(HloDataflowAnalysisTest, BitcastDefinesValue) {
   // Test the bitcast_defines_value flag to the dataflow analysis.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto bitcast = builder.AddInstruction(HloInstruction::CreateUnary(
       scalar_shape_, HloOpcode::kBitcast, constant));
 
@@ -1158,7 +1157,7 @@ TEST_P(HloDataflowAnalysisTest, SendAndSendDone) {
   auto builder = HloComputation::Builder(TestName());
   auto param = builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "param0"));
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto send = builder.AddInstruction(
       HloInstruction::CreateSend(param, token, /*channel_id=*/0));
   auto send_done = builder.AddInstruction(HloInstruction::CreateSendDone(send));
@@ -1167,22 +1166,23 @@ TEST_P(HloDataflowAnalysisTest, SendAndSendDone) {
   bool ssa_form = GetParam();
   const HloDataflowAnalysis& analysis = RunAnalysis(ssa_form);
 
-  EXPECT_EQ(analysis.values().size(), 5);
+  EXPECT_EQ(analysis.values().size(), 6);
 
   EXPECT_TRUE(analysis.ValueIsDefinedAt(param));
   EXPECT_TRUE(analysis.ValueIsDefinedAt(send, /*index=*/{}));
   EXPECT_FALSE(analysis.ValueIsDefinedAt(send, /*index=*/{0}));
   EXPECT_TRUE(analysis.ValueIsDefinedAt(send, /*index=*/{1}));
+  EXPECT_TRUE(analysis.ValueIsDefinedAt(send, /*index=*/{2}));
   EXPECT_TRUE(analysis.ValueIsDefinedAt(send_done));
   EXPECT_THAT(HloValuesAt(send, /*index=*/{0}),
               UnorderedElementsAre(analysis.GetValueDefinedAt(param)));
 }
 
 TEST_P(HloDataflowAnalysisTest, RecvAndRecvDone) {
-  // Test that a RecvDone forwards its operand tuple element at {0} to the
-  // output.
+  // Test that a RecvDone forwards its operand tuple element at {0} to element
+  // {0} of the output.
   auto builder = HloComputation::Builder(TestName());
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto recv = builder.AddInstruction(
       HloInstruction::CreateRecv(scalar_shape_, token, /*channel_id=*/0));
   auto recv_done = builder.AddInstruction(HloInstruction::CreateRecvDone(recv));
@@ -1191,13 +1191,16 @@ TEST_P(HloDataflowAnalysisTest, RecvAndRecvDone) {
   bool ssa_form = GetParam();
   const HloDataflowAnalysis& analysis = RunAnalysis(ssa_form);
 
-  EXPECT_EQ(analysis.values().size(), 4);
+  EXPECT_EQ(analysis.values().size(), 7);
 
   EXPECT_TRUE(analysis.ValueIsDefinedAt(recv, /*index=*/{}));
   EXPECT_TRUE(analysis.ValueIsDefinedAt(recv, /*index=*/{0}));
   EXPECT_TRUE(analysis.ValueIsDefinedAt(recv, /*index=*/{1}));
-  EXPECT_FALSE(analysis.ValueIsDefinedAt(recv_done));
-  EXPECT_THAT(HloValuesAt(recv_done),
+  EXPECT_TRUE(analysis.ValueIsDefinedAt(recv, /*index=*/{2}));
+  EXPECT_TRUE(analysis.ValueIsDefinedAt(recv_done, /*index=*/{}));
+  EXPECT_FALSE(analysis.ValueIsDefinedAt(recv_done, /*index=*/{0}));
+  EXPECT_TRUE(analysis.ValueIsDefinedAt(recv_done, /*index=*/{1}));
+  EXPECT_THAT(HloValuesAt(recv_done, /*index=*/{0}),
               UnorderedElementsAre(analysis.GetValueDefinedAt(recv, {0})));
   EXPECT_TRUE(
       analysis.GetValueDefinedAt(recv, /*index=*/{0}).live_out_of_module());
@@ -1306,13 +1309,13 @@ TEST_P(HloDataflowAnalysisTest, WhileParameters_Sequential) {
   auto body_param = body_builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "body_param"));
   auto constant = body_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto exp = body_builder.AddInstruction(
       HloInstruction::CreateUnary(scalar_shape_, HloOpcode::kExp, constant));
   auto add = body_builder.AddInstruction(HloInstruction::CreateBinary(
       scalar_shape_, HloOpcode::kAdd, exp, body_param));
   auto dead_constant = body_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto dead_negate = body_builder.AddInstruction(HloInstruction::CreateUnary(
       scalar_shape_, HloOpcode::kNegate, dead_constant));
   HloComputation* body = module_->AddEmbeddedComputation(
@@ -1322,7 +1325,7 @@ TEST_P(HloDataflowAnalysisTest, WhileParameters_Sequential) {
   auto cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_shape_, "cond_param"));
   auto cond_constant = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   HloComputation* condition =
       module_->AddEmbeddedComputation(cond_builder.Build());
 
@@ -1573,11 +1576,11 @@ TEST_P(HloDataflowAnalysisTest, ConditionalWithIdentity) {
 
   auto builder = HloComputation::Builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(56.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(56.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(12.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(12.0f)));
   auto conditional = builder.AddInstruction(HloInstruction::CreateConditional(
       scalar_shape_, pred, constant1, true_computation, constant2,
       false_computation));
@@ -1664,11 +1667,11 @@ TEST_P(HloDataflowAnalysisTest, ConditionalTakingTupleOperand) {
 
   auto builder = HloComputation::Builder(TestName());
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(56.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(56.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(12.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(12.0f)));
   auto tuple_operand = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto conditional = builder.AddInstruction(HloInstruction::CreateConditional(
@@ -1794,15 +1797,15 @@ TEST_P(HloDataflowAnalysisTest, NestedConditionals) {
   // Build entry computation.
   auto builder = HloComputation::Builder(TestName());
   auto pred1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   auto pred2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.2f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.2f)));
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.3f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.3f)));
   auto tuple_operand = builder.AddInstruction(
       HloInstruction::CreateTuple({pred2, constant1, constant2}));
   auto conditional = builder.AddInstruction(HloInstruction::CreateConditional(
@@ -1940,9 +1943,9 @@ TEST_F(DoesNotUseOperandBufferTest, FusedDynamicUpdateSlice) {
 
   // Create a DynamicUpdateSlice instruction of tuple element 1.
   auto starts = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
   auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+      LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f})));
   auto dynamic_update_slice =
       builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
           data_shape, gte1, update, starts));
@@ -2045,7 +2048,7 @@ TEST_F(CanShareOperandBufferWithUserTest,
   Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
 
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto operand = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape, one, {1}));
 
@@ -2073,7 +2076,7 @@ TEST_F(CanShareOperandBufferWithUserTest,
   auto param = builder.AddInstruction(
       HloInstruction::CreateParameter(0, data_shape, "param0"));
   auto index = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int64>({0, 0})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int64>({0, 0})));
   auto ds = builder.AddInstruction(
       HloInstruction::CreateDynamicSlice(slice_shape, param, index, {1, 2, 2}));
 
@@ -2141,9 +2144,9 @@ TEST_F(CanShareOperandBufferWithUserTest, FusedDynamicUpdateSlice) {
 
   // Create a DynamicUpdateSlice instruction of tuple element 1.
   auto starts = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
   auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+      LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f})));
   auto dynamic_update_slice =
       builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
           data_shape, gte1, update, starts));
@@ -2181,9 +2184,9 @@ TEST_F(CanShareOperandBufferWithUserTest,
 
   // Create a DynamicUpdateSlice instruction of tuple element 1.
   auto starts = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
   auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+      LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f})));
   auto dynamic_update_slice =
       builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
           data_shape_bf16, convert1, update, starts));
@@ -2229,14 +2232,56 @@ TEST_F(CanShareOperandBufferWithUserTest, DynamicUpdateSliceCanShare) {
       dataflow_analysis_->CanShareOperandBufferWithUser(starts, {}, dus, {}));
 }
 
+TEST_F(CanShareOperandBufferWithUserTest, SortCanShare) {
+  auto builder = HloComputation::Builder(TestName());
+
+  Shape keys_shape = ShapeUtil::MakeShape(F32, {8});
+  auto keys = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, keys_shape, "keys"));
+  auto sort =
+      builder.AddInstruction(HloInstruction::CreateSort(keys_shape, 0, keys));
+
+  BuildModuleAndRunAnalysis(builder.Build());
+
+  EXPECT_TRUE(
+      dataflow_analysis_->CanShareOperandBufferWithUser(keys, {}, sort, {}));
+}
+
+TEST_F(CanShareOperandBufferWithUserTest, SortCanShareWithTupleUser) {
+  auto builder = HloComputation::Builder(TestName());
+
+  Shape keys_shape = ShapeUtil::MakeShape(F32, {8});
+  Shape values_shape = ShapeUtil::MakeShape(F32, {8});
+  auto keys = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, keys_shape, "keys"));
+  auto values = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, values_shape, "values"));
+  auto sort = builder.AddInstruction(HloInstruction::CreateSort(
+      ShapeUtil::MakeTupleShape({keys_shape, values_shape}), 0, keys, values));
+
+  BuildModuleAndRunAnalysis(builder.Build());
+
+  // The buffer for the keys can be shared with the first tuple entry.
+  EXPECT_TRUE(
+      dataflow_analysis_->CanShareOperandBufferWithUser(keys, {}, sort, {0}));
+  // The buffer for the values can be shared with the second tuple entry.
+  EXPECT_TRUE(
+      dataflow_analysis_->CanShareOperandBufferWithUser(values, {}, sort, {1}));
+  // Verify that the buffers are not shared with the "wrong" tuple entry.
+  EXPECT_FALSE(
+      dataflow_analysis_->CanShareOperandBufferWithUser(keys, {}, sort, {1}));
+  EXPECT_FALSE(
+      dataflow_analysis_->CanShareOperandBufferWithUser(values, {}, sort, {0}));
+}
+
 TEST_F(CanShareOperandBufferWithUserTest, FusedDotAdd) {
   auto builder = HloComputation::Builder(TestName());
   Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
 
   auto a = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 0.0}, {0.0, 1.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 0.0}, {0.0, 1.0}})));
   auto b = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
+      LiteralUtil::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
 
   DotDimensionNumbers dot_dnums;
   dot_dnums.add_lhs_contracting_dimensions(1);
@@ -2245,7 +2290,7 @@ TEST_F(CanShareOperandBufferWithUserTest, FusedDotAdd) {
       HloInstruction::CreateDot(data_shape, a, b, dot_dnums));
 
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto add_operand = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape, one, {1}));
 
@@ -2267,7 +2312,7 @@ TEST_F(CanShareOperandBufferWithUserTest, OutputFusionCantAliasOperandBuffer) {
   Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
 
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto operand = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape, one, {1}));
 
@@ -2275,7 +2320,7 @@ TEST_F(CanShareOperandBufferWithUserTest, OutputFusionCantAliasOperandBuffer) {
       HloInstruction::CreateReverse(data_shape, operand, {0, 1}));
 
   auto two = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
+      LiteralUtil::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
 
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(data_shape, HloOpcode::kAdd, reverse, two));
@@ -2295,13 +2340,13 @@ TEST_F(CanShareOperandBufferWithUserTest, FusionCanShareBufferCustomized) {
   Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
 
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto operand = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape, one, {1}));
   auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
       data_shape, HloOpcode::kMultiply, operand, operand));
   auto two = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
+      LiteralUtil::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(data_shape, HloOpcode::kAdd, mul, two));
 
@@ -2367,7 +2412,7 @@ TEST_F(CanShareOperandBufferWithUserTest, CallToComputationWithFusionRoot) {
   auto sub_param = sub_builder.AddInstruction(
       HloInstruction::CreateParameter(0, shape, "sub_param"));
   auto one = sub_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto ones = sub_builder.AddInstruction(
       HloInstruction::CreateBroadcast(shape, one, {1}));
   auto add = sub_builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/hlo_dce_test.cc b/tensorflow/compiler/xla/service/hlo_dce_test.cc
index f5524dc6fe..26e3736e01 100644
--- a/tensorflow/compiler/xla/service/hlo_dce_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_dce_test.cc
@@ -53,9 +53,9 @@ TEST_F(HloDceTest, NoDeadCode) {
   // Verify that no dead code is removed from a computation with no dead code.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(123.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(123.0f)));
   builder.AddInstruction(HloInstruction::CreateBinary(
       constant1->shape(), HloOpcode::kAdd, constant1, constant2));
 
@@ -74,8 +74,8 @@ TEST_F(HloDceTest, InstructionsWithSideEffect) {
   // Verify that side-effect instructions (Send in this test) are not removed.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   builder.AddInstruction(
       HloInstruction::CreateSend(constant, token, /*channel_id=*/0));
   builder.AddInstruction(HloInstruction::CreateTuple({}));
@@ -127,9 +127,9 @@ TEST_F(HloDceTest, ControlDependencies) {
   // Verify that instructions with control dependencies are not removed.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(123.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(123.0f)));
 
   // Create two dead instructions: a negate and an add.
   auto dead_negate = builder.AddInstruction(HloInstruction::CreateUnary(
@@ -224,7 +224,7 @@ TEST_F(HloDceTest, CalledComputationWithSideEffect) {
     auto param = cond_builder.AddInstruction(
         HloInstruction::CreateParameter(0, shape, "cond_param"));
     auto constant = cond_builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
     cond_builder.AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::MakeShape(PRED, {}), HloOpcode::kLt, param, constant));
   }
@@ -235,8 +235,7 @@ TEST_F(HloDceTest, CalledComputationWithSideEffect) {
   {
     auto param = body_builder.AddInstruction(
         HloInstruction::CreateParameter(0, shape, "param"));
-    auto token =
-        body_builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+    auto token = body_builder.AddInstruction(HloInstruction::CreateToken());
     auto infeed = body_builder.AddInstruction(
         HloInstruction::CreateInfeed(shape, token, ""));
     body_builder.AddInstruction(
@@ -280,8 +279,8 @@ TEST_F(HloDceTest, CalledComputationWithNestedSideEffect) {
   {
     auto param = nested_callee_builder.AddInstruction(
         HloInstruction::CreateParameter(0, shape, "param"));
-    auto token = nested_callee_builder.AddInstruction(
-        HloInstruction::CreateAfterAll({}));
+    auto token =
+        nested_callee_builder.AddInstruction(HloInstruction::CreateToken());
     nested_callee_builder.AddInstruction(
         HloInstruction::CreateOutfeed(shape, param, token, ""));
   }
@@ -346,12 +345,12 @@ TEST_F(HloDceTest, RemoveDeadSubcomputation) {
       builder.AddInstruction(HloInstruction::CreateParameter(
           /*parameter_number=*/0, ShapeUtil::MakeShape(F32, {100}), "param0")),
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(0))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0))),
       /*dimensions_to_reduce=*/{0}, reduce_subcomp));
 
   // Add another instruction as the root of the computation.
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0)));
 
   module->AddEntryComputation(builder.Build());
   EXPECT_EQ(module->MakeComputationPostOrder().size(), 2);
@@ -387,7 +386,7 @@ TEST_F(HloDceTest, KeepUsedSubcomputation) {
       builder.AddInstruction(HloInstruction::CreateParameter(
           /*parameter_number=*/0, ShapeUtil::MakeShape(F32, {100}), "param0")),
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(0))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0))),
       /*dimensions_to_reduce=*/{0}, reduce_subcomp));
 
   // Add another instruction as the root of the computation that also uses
@@ -397,7 +396,7 @@ TEST_F(HloDceTest, KeepUsedSubcomputation) {
       builder.AddInstruction(HloInstruction::CreateParameter(
           /*parameter_number=*/1, ShapeUtil::MakeShape(F32, {100}), "param1")),
       builder.AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<float>(0))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0))),
       /*dimensions_to_reduce=*/{0}, reduce_subcomp));
 
   module->AddEntryComputation(builder.Build());
diff --git a/tensorflow/compiler/xla/service/hlo_domain_map.cc b/tensorflow/compiler/xla/service/hlo_domain_map.cc
index ebd5adb5d5..9e096320db 100644
--- a/tensorflow/compiler/xla/service/hlo_domain_map.cc
+++ b/tensorflow/compiler/xla/service/hlo_domain_map.cc
@@ -41,11 +41,15 @@ namespace xla {
 
 bool HloDomainMap::InSameDomain(HloInstruction* instruction1,
                                 HloInstruction* instruction2) const {
-  int64 domain_id1 = FindOrDefault(instruction_to_domain_, instruction1, -1);
-  int64 domain_id2 = FindOrDefault(instruction_to_domain_, instruction2, -1);
+  int64 domain_id1 = GetDomainId(instruction1);
+  int64 domain_id2 = GetDomainId(instruction2);
   return domain_id1 >= 0 && domain_id1 == domain_id2;
 }
 
+int64 HloDomainMap::GetDomainId(HloInstruction* instruction) const {
+  return FindOrDefault(instruction_to_domain_, instruction, -1);
+}
+
 Status HloDomainMap::TryProcessEmptyDomain(HloInstruction* instruction) {
   TF_RET_CHECK(instruction->opcode() == HloOpcode::kDomain);
   // We only check operands, so we are sure to not process the empty domain from
@@ -58,6 +62,11 @@ Status HloDomainMap::TryProcessEmptyDomain(HloInstruction* instruction) {
       TF_RETURN_IF_ERROR(InsertDomain(std::move(domain)));
     }
   }
+  if (instruction == instruction->parent()->root_instruction()) {
+    auto domain = MakeUnique<DomainMetadata::Domain>();
+    domain->enter_domains.insert(instruction);
+    TF_RETURN_IF_ERROR(InsertDomain(std::move(domain)));
+  }
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_domain_map.h b/tensorflow/compiler/xla/service/hlo_domain_map.h
index e62ef763fb..1ca7159725 100644
--- a/tensorflow/compiler/xla/service/hlo_domain_map.h
+++ b/tensorflow/compiler/xla/service/hlo_domain_map.h
@@ -65,6 +65,10 @@ class HloDomainMap {
   // currently processing.
   bool IsDomainInstruction(HloInstruction* instruction) const;
 
+  // Retrieves the domain identifier of the instruction, or -1 in case
+  // instruction is not found within any domain.
+  int64 GetDomainId(HloInstruction* instruction) const;
+
  private:
   HloDomainMap(string domain_kind) : domain_kind_(std::move(domain_kind)) {}
 
diff --git a/tensorflow/compiler/xla/service/hlo_domain_metadata.h b/tensorflow/compiler/xla/service/hlo_domain_metadata.h
index aa0308100a..f855f2a1fc 100644
--- a/tensorflow/compiler/xla/service/hlo_domain_metadata.h
+++ b/tensorflow/compiler/xla/service/hlo_domain_metadata.h
@@ -71,12 +71,6 @@ class DomainMetadata {
 
   // Returns a string representation of the metadata.
   virtual string ToString() const = 0;
-
-  // Given a reachable set (the set of instructions which are reachable from
-  // each other via user/operand pathways, without crossing a kDomain
-  // instruciton), makes sure that all of them have metadata attributes which
-  // are coherent with this metadata object.
-  virtual Status NormalizeInstructions(const Domain& domain) const = 0;
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_domain_remover.cc b/tensorflow/compiler/xla/service/hlo_domain_remover.cc
index 1d06040b0e..67fad0769f 100644
--- a/tensorflow/compiler/xla/service/hlo_domain_remover.cc
+++ b/tensorflow/compiler/xla/service/hlo_domain_remover.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_domain_remover.h"
 
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
-#include "tensorflow/compiler/xla/service/hlo_domain_isolator.h"
 #include "tensorflow/compiler/xla/service/hlo_domain_map.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_verifier.h"
 #include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
@@ -43,54 +43,16 @@ class HloDomainRemover::RunContext {
 
 Status HloDomainRemover::RunContext::VerifyAndNormalizeDomain(
     const DomainMetadata::Domain& domain) {
-  // Verify that the whole kDomain frontier bounding the instruction reach set,
-  // has matching metadata.
-  // A kDomain instruction has two sides of metadata, a user facing and an
-  // operand facing.
-  // A reachable instruction set can make contact with a kDomain instruction on
-  // a user facing side (the kDomain is operand of the instruction), or on a
-  // operand facing side (the kDomain is user of the instruction).
-  // And depending on the contact side, the proper metadata object
-  // (user_side_metadata() vs. operand_side_metadata()) needs to be used for
-  // consistency checks.
-  const DomainMetadata* ref_metadata = nullptr;
-  VLOG(4) << "Reach set:";
-  for (HloInstruction* instruction : domain.instructions) {
-    VLOG(4) << "  " << instruction->name();
-  }
-  VLOG(4) << "  Domains:";
-  for (HloInstruction* instruction : domain.enter_domains) {
-    const DomainMetadata& meta = instruction->user_side_metadata();
-    VLOG(4) << "    User side: " << instruction->name();
-    VLOG(4) << "      " << meta.ToString();
-    if (ref_metadata == nullptr) {
-      ref_metadata = &meta;
-    } else {
-      TF_RET_CHECK(meta.Matches(*ref_metadata))
-          << "Metadata mismatch at instruction " << instruction->name() << " : "
-          << meta.ToString() << " vs " << ref_metadata->ToString();
-    }
-  }
-  for (HloInstruction* instruction : domain.exit_domains) {
-    const DomainMetadata& meta = instruction->operand_side_metadata();
-    VLOG(4) << "    Operand side: " << instruction->name();
-    VLOG(4) << "      " << meta.ToString();
-    if (ref_metadata == nullptr) {
-      ref_metadata = &meta;
-    } else {
-      TF_RET_CHECK(meta.Matches(*ref_metadata))
-          << "Metadata mismatch at instruction " << instruction->name() << " : "
-          << meta.ToString() << " vs " << ref_metadata->ToString();
-    }
-  }
+  TF_ASSIGN_OR_RETURN(const DomainMetadata* ref_metadata,
+                      HloDomainVerifier::VerifyDomain(domain));
   if (ref_metadata != nullptr) {
     VLOG(4) << "Applying domain normalization: " << ref_metadata->ToString();
-    TF_RETURN_IF_ERROR(ref_metadata->NormalizeInstructions(domain));
+    TF_RETURN_IF_ERROR(remover_->normalizer_(domain, ref_metadata));
   } else {
     // No kDomain instruction was present within this domain, so call the
     // generic normalization functions and have them apply their heuristic.
     VLOG(2) << "Applying domain-less normalization";
-    TF_RETURN_IF_ERROR(remover_->normalizer_(domain));
+    TF_RETURN_IF_ERROR(remover_->normalizer_(domain, nullptr));
   }
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/hlo_domain_remover.h b/tensorflow/compiler/xla/service/hlo_domain_remover.h
index 0c71dd34fd..c859e05f02 100644
--- a/tensorflow/compiler/xla/service/hlo_domain_remover.h
+++ b/tensorflow/compiler/xla/service/hlo_domain_remover.h
@@ -35,9 +35,10 @@ class HloDomainRemover : public HloPassInterface {
   // instructions in it with the same attributes (ie, sharding), a normalizer
   // function is tasked at applying attribute normalization on the instructions
   // within such domain.
-  HloDomainRemover(
-      tensorflow::StringPiece kind,
-      std::function<Status(const DomainMetadata::Domain&)> normalizer)
+  HloDomainRemover(tensorflow::StringPiece kind,
+                   std::function<Status(const DomainMetadata::Domain&,
+                                        const DomainMetadata* metadata)>
+                       normalizer)
       : kind_(kind.ToString()), normalizer_(std::move(normalizer)) {}
 
   tensorflow::StringPiece name() const override { return "domain_remover"; }
@@ -48,7 +49,9 @@ class HloDomainRemover : public HloPassInterface {
   class RunContext;
 
   string kind_;
-  std::function<Status(const DomainMetadata::Domain&)> normalizer_;
+  std::function<Status(const DomainMetadata::Domain&,
+                       const DomainMetadata* metadata)>
+      normalizer_;
 };
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_domain_test.cc b/tensorflow/compiler/xla/service/hlo_domain_test.cc
index c1412f7c68..ffc18a0f88 100644
--- a/tensorflow/compiler/xla/service/hlo_domain_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_domain_test.cc
@@ -97,12 +97,6 @@ class OpNameMetadata : public DomainMetadata {
 
   string ToString() const override { return opname_; }
 
-  Status NormalizeInstructions(
-      const DomainMetadata::Domain& domain) const override {
-    // For the purposes of this test, nothing to do.
-    return Status::OK();
-  }
-
   static tensorflow::StringPiece KindName() { return "opname"; }
 
  private:
@@ -124,7 +118,8 @@ std::unique_ptr<HloInstruction> OpNameDomainCreator(HloInstruction* instruction,
                                       std::move(user_side_metadata));
 }
 
-Status OpNameDomainNormalizer(const DomainMetadata::Domain& domain) {
+Status OpNameDomainNormalizer(const DomainMetadata::Domain& domain,
+                              const DomainMetadata* metadata) {
   // Nothing to do for the particular use this test make of the OpName domains.
   return Status::OK();
 }
@@ -159,7 +154,7 @@ ENTRY entry {
   EXPECT_FALSE(HasDomainEdge(module, "e", "d"));
 
   HloDomainRemover remover(ShardingMetadata::KindName(),
-                           NormalizeShardingDomain);
+                           ShardingMetadata::NormalizeShardingDomain);
   TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module));
   EXPECT_TRUE(remover_changed);
 
@@ -202,12 +197,13 @@ ENTRY entry {
   p0 = (f32[4]) parameter(0)
   a = f32[4] get-tuple-element(p0), index=0
   token = token[] after-all()
-  b = (f32[4], u32[]) send(a, token), channel_id=1, sharding={maximal device=0}
-  c = () send-done(b), channel_id=1, sharding={maximal device=0}
-  d = (f32[4], u32[]) recv(token), channel_id=2, sharding={maximal device=0}
-  e = f32[4] recv-done(d), channel_id=2, sharding={maximal device=0}
-  f = f32[4] add(a, e)
-  g = f32[4] subtract(a, e)
+  b = (f32[4], u32[], token[]) send(a, token), channel_id=1, sharding={maximal device=0}
+  c = token[] send-done(b), channel_id=1, sharding={maximal device=0}
+  d = (f32[4], u32[], token[]) recv(token), channel_id=2, sharding={maximal device=0}
+  e = (f32[4], token[]) recv-done(d), channel_id=2, sharding={maximal device=0}
+  e_element = f32[4] get-tuple-element(e), index=0, sharding={maximal device=0}
+  f = f32[4] add(a, e_element)
+  g = f32[4] subtract(a, e_element)
   ROOT h = (f32[4], f32[4]) tuple(f, g)
 }
 )";
@@ -220,18 +216,18 @@ ENTRY entry {
   EXPECT_TRUE(isolator_changed);
 
   EXPECT_TRUE(HasDomainEdge(module, "b", "a"));
-  EXPECT_TRUE(HasDomainEdge(module, "f", "e"));
+  EXPECT_TRUE(HasDomainEdge(module, "f", "e_element"));
   EXPECT_FALSE(HasDomainEdge(module, "a", "p0"));
   EXPECT_FALSE(HasDomainEdge(module, "c", "b"));
   EXPECT_FALSE(HasDomainEdge(module, "e", "d"));
 
   HloDomainRemover remover(ShardingMetadata::KindName(),
-                           NormalizeShardingDomain);
+                           ShardingMetadata::NormalizeShardingDomain);
   TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module));
   EXPECT_TRUE(remover_changed);
 
   EXPECT_FALSE(HasDomainEdge(module, "b", "a"));
-  EXPECT_FALSE(HasDomainEdge(module, "f", "e"));
+  EXPECT_FALSE(HasDomainEdge(module, "f", "e_element"));
 }
 
 TEST_F(HloDomainTest, CheckNoDomainAddedOnPureIOComputation) {
@@ -240,11 +236,12 @@ HloModule Module
 
 ENTRY entry {
   token = token[] after-all(), sharding={maximal device=-1}
-  a = (f32[4], u32[]) recv(token), channel_id=1, sharding={maximal device=-1}
-  b = f32[4] recv-done(a), channel_id=1, sharding={maximal device=-1}
-  c = f32[4] add(b, b), sharding={maximal device=-1}
-  d = (f32[4], u32[]) send(c, token), channel_id=2, sharding={maximal device=-1}
-  ROOT e = () send-done(d), channel_id=2, sharding={maximal device=-1}
+  a = (f32[4], u32[], token[]) recv(token), channel_id=1, sharding={maximal device=-1}
+  b = (f32[4], token[]) recv-done(a), channel_id=1, sharding={maximal device=-1}
+  b_element = f32[4] get-tuple-element(b), index=0, sharding={maximal device=-1}
+  c = f32[4] add(b_element, b_element), sharding={maximal device=-1}
+  d = (f32[4], u32[], token[]) send(c, token), channel_id=2, sharding={maximal device=-1}
+  ROOT e = token[] send-done(d), channel_id=2, sharding={maximal device=-1}
 }
 )";
 
@@ -262,11 +259,12 @@ HloModule Module
 
 ENTRY entry {
   token = token[] after-all(), sharding={maximal device=0}
-  a = (f32[4], u32[]) recv(token), channel_id=1, sharding={maximal device=0}
-  b = f32[4] recv-done(a), channel_id=1, sharding={maximal device=0}
-  c = f32[4] add(b, b)
-  d = (f32[4], u32[]) send(c, token), channel_id=2, sharding={maximal device=0}
-  ROOT e = () send-done(d), channel_id=2, sharding={maximal device=0}
+  a = (f32[4], u32[], token[]) recv(token), channel_id=1, sharding={maximal device=0}
+  b = (f32[4], token[]) recv-done(a), channel_id=1, sharding={maximal device=0}
+  b_element = f32[4] get-tuple-element(b), index=0, sharding={maximal device=0}
+  c = f32[4] add(b_element, b_element)
+  d = (f32[4], u32[], token[]) send(c, token), channel_id=2, sharding={maximal device=0}
+  ROOT e = token[] send-done(d), channel_id=2, sharding={maximal device=0}
 }
 )";
 
@@ -274,7 +272,7 @@ ENTRY entry {
   LOG(INFO) << "Original module:\n" << module->ToString();
 
   HloDomainRemover remover(ShardingMetadata::KindName(),
-                           NormalizeShardingDomain);
+                           ShardingMetadata::NormalizeShardingDomain);
   TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module));
   EXPECT_FALSE(remover_changed);
 
@@ -321,7 +319,7 @@ ENTRY entry {
   EXPECT_FALSE(HasDomainEdge(module, "e", "d"));
 
   HloDomainRemover sharding_remover(ShardingMetadata::KindName(),
-                                    NormalizeShardingDomain);
+                                    ShardingMetadata::NormalizeShardingDomain);
   TF_ASSERT_OK_AND_ASSIGN(bool sharding_remover_changed,
                           sharding_remover.Run(module));
   EXPECT_TRUE(sharding_remover_changed);
@@ -408,7 +406,7 @@ ENTRY entry {
   }
 
   HloDomainRemover remover(ShardingMetadata::KindName(),
-                           NormalizeShardingDomain);
+                           ShardingMetadata::NormalizeShardingDomain);
   TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module));
   EXPECT_TRUE(remover_changed);
 
@@ -433,6 +431,44 @@ ENTRY entry {
                                               HloSharding::AssignDevice(0)}));
 }
 
+TEST_F(HloDomainTest, EmptyRootDomain) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+ENTRY entry {
+  %param = f32[1] parameter(0), sharding={maximal device=0}
+  %tuple = (f32[1]) tuple(%param),
+    sharding={maximal device=1}
+  ROOT %gte = f32[1] get-tuple-element(%tuple), index=0,
+    sharding={maximal device=1}
+})";
+
+  TF_ASSERT_OK_AND_ASSIGN(HloModule * module, ParseModule(hlo_string));
+
+  HloDomainIsolator isolator(CreateShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool isolator_changed, isolator.Run(module));
+  EXPECT_TRUE(isolator_changed);
+
+  EXPECT_TRUE(HasDomainEdge(module, "tuple", "param"));
+  EXPECT_FALSE(HasDomainEdge(module, "gte", "tuple"));
+
+  // Remove %tuple and %gte (tuple simplification)
+  HloInstruction* gte = FindInstruction(module, "gte");
+  HloInstruction* tuple = FindInstruction(module, "tuple");
+  module->entry_computation()->set_root_instruction(tuple->mutable_operand(0));
+  TF_EXPECT_OK(module->entry_computation()->RemoveInstruction(gte));
+  TF_EXPECT_OK(module->entry_computation()->RemoveInstruction(tuple));
+
+  HloDomainRemover remover(ShardingMetadata::KindName(),
+                           ShardingMetadata::NormalizeShardingDomain);
+  TF_ASSERT_OK_AND_ASSIGN(bool remover_changed, remover.Run(module));
+  EXPECT_TRUE(remover_changed);
+
+  const HloInstruction* root = module->entry_computation()->root_instruction();
+  EXPECT_TRUE(root->has_sharding());
+  EXPECT_EQ(root->sharding(), HloSharding::AssignDevice(1));
+}
+
 // Tests that text dumps of domain instructions can be parsed back, in the
 // specific case of null shardings.
 TEST_F(HloDomainTest, DumpParseNullSharding) {
diff --git a/tensorflow/compiler/xla/service/hlo_domain_verifier.cc b/tensorflow/compiler/xla/service/hlo_domain_verifier.cc
new file mode 100644
index 0000000000..751fc677e2
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_verifier.cc
@@ -0,0 +1,124 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/hlo_domain_verifier.h"
+
+#include <set>
+
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_map.h"
+#include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/types.h"
+
+namespace xla {
+
+class HloDomainVerifier::RunContext {
+ public:
+  RunContext(HloModule* module, HloDomainVerifier* verifier)
+      : module_(module), verifier_(verifier) {}
+
+  Status Run();
+
+ private:
+  // If the verifier caller passed an empty vector for kinds, we collect all the
+  // avalable domain types.
+  Status PopulateDomainKinds();
+
+  HloModule* module_;
+  HloDomainVerifier* verifier_;
+};
+
+Status HloDomainVerifier::RunContext::PopulateDomainKinds() {
+  if (verifier_->kinds_.empty()) {
+    // The caller specified no domain kinds, collect all the ones available.
+    std::set<string> kinds;
+    for (HloComputation* computation : module_->computations()) {
+      for (HloInstruction* instruction : computation->instructions()) {
+        if (instruction->opcode() == HloOpcode::kDomain) {
+          TF_RET_CHECK(instruction->user_side_metadata().Kind() ==
+                       instruction->operand_side_metadata().Kind())
+              << instruction->ToString();
+          kinds.insert(instruction->user_side_metadata().Kind().ToString());
+        }
+      }
+    }
+    verifier_->kinds_.insert(verifier_->kinds_.end(), kinds.begin(),
+                             kinds.end());
+  }
+  return Status::OK();
+}
+
+Status HloDomainVerifier::RunContext::Run() {
+  VLOG(4) << "Running HLO Domain Verifier";
+  TF_RETURN_IF_ERROR(PopulateDomainKinds());
+  for (HloComputation* computation : module_->computations()) {
+    for (auto& kind : verifier_->kinds_) {
+      // First create the domain instruciton sets. A domain instruction set is
+      // the set of instructions whose edges never cross a kDomain instruction.
+      TF_ASSIGN_OR_RETURN(std::unique_ptr<HloDomainMap> domain_map,
+                          HloDomainMap::Create(computation, kind));
+      // Verify every domain populated within the map.
+      for (auto& domain : domain_map->GetDomains()) {
+        TF_RETURN_IF_ERROR(VerifyDomain(*domain).status());
+      }
+    }
+  }
+  return Status::OK();
+}
+
+StatusOr<bool> HloDomainVerifier::Run(HloModule* module) {
+  RunContext run_context(module, this);
+  TF_RETURN_IF_ERROR(run_context.Run());
+  return false;
+}
+
+StatusOr<const DomainMetadata*> HloDomainVerifier::VerifyDomain(
+    const DomainMetadata::Domain& domain) {
+  const DomainMetadata* ref_metadata = nullptr;
+  VLOG(4) << "Reach set:";
+  for (HloInstruction* instruction : domain.instructions) {
+    VLOG(4) << "  " << instruction->name();
+  }
+  VLOG(4) << "  Domains:";
+  for (HloInstruction* instruction : domain.enter_domains) {
+    const DomainMetadata& meta = instruction->user_side_metadata();
+    VLOG(4) << "    User side: " << instruction->name();
+    VLOG(4) << "      " << meta.ToString();
+    if (ref_metadata == nullptr) {
+      ref_metadata = &meta;
+    } else {
+      TF_RET_CHECK(meta.Matches(*ref_metadata))
+          << "Metadata mismatch at instruction " << instruction->name() << " : "
+          << meta.ToString() << " vs " << ref_metadata->ToString();
+    }
+  }
+  for (HloInstruction* instruction : domain.exit_domains) {
+    const DomainMetadata& meta = instruction->operand_side_metadata();
+    VLOG(4) << "    Operand side: " << instruction->name();
+    VLOG(4) << "      " << meta.ToString();
+    if (ref_metadata == nullptr) {
+      ref_metadata = &meta;
+    } else {
+      TF_RET_CHECK(meta.Matches(*ref_metadata))
+          << "Metadata mismatch at instruction " << instruction->name() << " : "
+          << meta.ToString() << " vs " << ref_metadata->ToString();
+    }
+  }
+  return ref_metadata;
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_domain_verifier.h b/tensorflow/compiler/xla/service/hlo_domain_verifier.h
new file mode 100644
index 0000000000..8e53cf97f8
--- /dev/null
+++ b/tensorflow/compiler/xla/service/hlo_domain_verifier.h
@@ -0,0 +1,65 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_VERIFIER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_VERIFIER_H_
+
+#include <string>
+#include <vector>
+
+#include "tensorflow/compiler/xla/service/hlo_domain_map.h"
+#include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace xla {
+
+// Verifies that the domain instructions are consistent, and the each domain is
+// surrounded by the same metadata.
+class HloDomainVerifier : public HloPassInterface {
+ public:
+  HloDomainVerifier(std::vector<string> kinds) : kinds_(std::move(kinds)) {}
+
+  tensorflow::StringPiece name() const override { return "domain_verifier"; }
+
+  StatusOr<bool> Run(HloModule* module) override;
+
+  // Verify that the whole kDomain frontier bounding the instruction reach set,
+  // has matching metadata.
+  // A kDomain instruction has two sides of metadata, a user facing and an
+  // operand facing.
+  // A reachable instruction set can make contact with a kDomain instruction on
+  // a user facing side (the kDomain is operand of the instruction), or on a
+  // operand facing side (the kDomain is user of the instruction).
+  // And depending on the contact side, the proper metadata object
+  // (user_side_metadata() vs. operand_side_metadata()) needs to be used for
+  // consistency checks.
+  // Returns the DomainMetadata pointer which surrounds the domain, and
+  // represents the common metadata within such domain. If the returned
+  // DomainMetadata pointer is nullptr, the input domain had no kDomain
+  // boundary.
+  static StatusOr<const DomainMetadata*> VerifyDomain(
+      const DomainMetadata::Domain& domain);
+
+ private:
+  class RunContext;
+
+  std::vector<string> kinds_;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_DOMAIN_VERIFIER_H_
diff --git a/tensorflow/compiler/xla/service/hlo_element_type_converter.cc b/tensorflow/compiler/xla/service/hlo_element_type_converter.cc
index 4ed1508d70..c804f4364f 100644
--- a/tensorflow/compiler/xla/service/hlo_element_type_converter.cc
+++ b/tensorflow/compiler/xla/service/hlo_element_type_converter.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_evaluator.h"
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.cc b/tensorflow/compiler/xla/service/hlo_evaluator.cc
index e65e1af20c..51353eea6e 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.cc
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.cc
@@ -25,6 +25,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/index_util.h"
 #include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
@@ -135,7 +136,6 @@ StatusOr<std::unique_ptr<Literal>> Compare<complex64>(
 
 }  // namespace
 
-
 HloEvaluator::HloEvaluator(int64 max_loop_iterations)
     : max_loop_iterations_(max_loop_iterations) {
   typed_visitors_[PRED] = MakeUnique<HloEvaluatorTypedVisitor<bool>>(this);
@@ -330,6 +330,24 @@ StatusOr<std::unique_ptr<Literal>> HloEvaluator::EvaluateElementwiseUnaryOp(
   return result;
 }
 
+StatusOr<std::unique_ptr<Literal>> HloEvaluator::EvaluateDotOp(
+    const DotDimensionNumbers& dim_numbers, const Literal& lhs,
+    const Literal& rhs) {
+  std::unique_ptr<HloInstruction> lhs_instr =
+      HloInstruction::CreateConstant(lhs.CloneToUnique());
+  std::unique_ptr<HloInstruction> rhs_instr =
+      HloInstruction::CreateConstant(rhs.CloneToUnique());
+
+  TF_ASSIGN_OR_RETURN(
+      Shape dot_shape,
+      ShapeInference::InferDotOpShape(lhs.shape(), rhs.shape(), dim_numbers));
+
+  std::unique_ptr<HloInstruction> cloned_instruction =
+      HloInstruction::CreateDot(dot_shape, lhs_instr.get(), rhs_instr.get(),
+                                dim_numbers);
+  return Evaluate(cloned_instruction.get());
+}
+
 Status HloEvaluator::HandleParameter(HloInstruction* parameter) {
   CHECK_LT(parameter->parameter_number(), arg_literals_.size());
   const Literal* input_literal = arg_literals_[parameter->parameter_number()];
@@ -382,7 +400,7 @@ Status HloEvaluator::HandleConcatenate(HloInstruction* concatenate) {
         ShapeUtil::GetDimension(operand_shape, concat_dim);
   }
 
-  auto result_literal = Literal::CreateFromDimensions(
+  auto result_literal = LiteralUtil::CreateFromDimensions(
       reference_shape.element_type(), concat_dimensions);
   DimensionVector source_indices(rank, 0);
   DimensionVector dest_indices(concat_dimensions.size(), 0);
@@ -533,7 +551,7 @@ Status HloEvaluator::HandleTuple(HloInstruction* tuple) {
     operand_literals.push_back(&GetEvaluatedLiteralFor(operand));
   }
 
-  evaluated_[tuple] = Literal::MakeTuple(operand_literals);
+  evaluated_[tuple] = LiteralUtil::MakeTuple(operand_literals);
   return Status::OK();
 }
 
@@ -757,6 +775,12 @@ class OutputWindowIndexToInputIndex {
     return ArraySlice<int64>(input_index_);
   }
 
+  // Returns for a given 'input_dim' the corresponding output dimension index,
+  // or -1 if 'input_dim' is an elided window dimension.
+  int64 input_dim_value_to_output_index(int64 input_dim) {
+    return input_dim_value_to_output_index_[input_dim];
+  }
+
  private:
   // Propagates window dimensions from the output index to input_index_ by
   // mutating input_index_ in place.
@@ -774,7 +798,7 @@ class OutputWindowIndexToInputIndex {
 
   // input_dim_value_to_index_vector_[i] tells us how to compute dimension i of
   // the input index from the output index. See
-  // PropagateOutputIndexToInputIndex.
+  // PropagateOutputIndexWindowDimsToInputIndex.
   std::vector<int64> input_dim_value_to_output_index_;
 
   // The result computed by this functor.  operator() returns an ArraySlice into
@@ -827,6 +851,8 @@ Status HloEvaluator::HandleGather(HloInstruction* gather) {
   // corresponding index in the input shape.
   std::vector<int64> input_index(operand.shape().dimensions_size());
   std::vector<int64> output_index(gather->shape().dimensions_size());
+  std::vector<int64> input_gather_index_clamped(
+      operand.shape().dimensions_size());
 
   OutputGatherIndexToInputIndex output_gather_index_to_input_index(
       &gather->gather_dimension_numbers(), /*input_shape=*/operand.shape(),
@@ -848,14 +874,26 @@ Status HloEvaluator::HandleGather(HloInstruction* gather) {
       output_index[i] = output_gather_index[i] + output_window_index[i];
       DCHECK_LT(output_index[i], shape.dimensions(i));
     }
+    for (int i = 0, e = input_gather_index.size(); i < e; i++) {
+      int64 output_dim =
+          output_window_index_to_input_index.input_dim_value_to_output_index(i);
+      // If 'output_dim' is -1, it means 'i' is an elided window dim. This means
+      // we set the iteration index to 0, so for the purpose of the following
+      // calculations we can consider the output dimension size to be 1.
+      int64 output_dim_size =
+          output_dim == -1 ? 1 : shape.dimensions(output_dim);
+      // Clamp the gather index so that the gather region fits in the operand.
+      // input_gather_index_clamped[i] = clamp(input_gather_index[i], 0,
+      //                                       operand_shape.dimensions(i) -
+      //                                       output_dim_size);
+      input_gather_index_clamped[i] =
+          std::min(operand_shape.dimensions(i) - output_dim_size,
+                   std::max(0LL, input_gather_index[i]));
+    }
     for (int i = 0, e = input_index.size(); i < e; i++) {
-      // TODO(b/74360564): We should implement whatever out of bounds behavior
-      // we decide for dynamic-slice here as well.
-      input_index[i] = (input_gather_index[i] + input_window_index[i]) %
-                       operand_shape.dimensions(i);
-      if (input_index[i] < 0) {
-        input_index[i] += operand_shape.dimensions(i);
-      }
+      input_index[i] = input_gather_index_clamped[i] + input_window_index[i];
+      DCHECK_GE(input_index[i], 0);
+      DCHECK_LT(input_index[i], operand_shape.dimensions(i));
     }
     TF_RETURN_IF_ERROR(
         result->CopyElementFrom(operand, input_index, output_index));
@@ -903,7 +941,7 @@ Status HloEvaluator::HandleBroadcast(HloInstruction* broadcast) {
 }
 
 Status HloEvaluator::HandleAfterAll(HloInstruction* token) {
-  evaluated_[token] = Literal::CreateToken();
+  evaluated_[token] = LiteralUtil::CreateToken();
   return Status::OK();
 }
 
@@ -1024,8 +1062,6 @@ Status HloEvaluator::HandleSelect(HloInstruction* select) {
   const auto& on_false = GetEvaluatedLiteralFor(select->operand(2));
 
   // If predicate is of scalar type, no element-wise selection would be needed.
-  // This would also handle output array of tuple types as the DefaultAction
-  // would go through the HloEvaluatorTypedVisitor which doesn't handle tuples.
   if (ShapeUtil::IsScalar(pred.shape())) {
     if (pred.Get<bool>({})) {
       evaluated_[select] = on_true.CloneToUnique();
@@ -1038,6 +1074,19 @@ Status HloEvaluator::HandleSelect(HloInstruction* select) {
   return DefaultAction(select);
 }
 
+Status HloEvaluator::HandleTupleSelect(HloInstruction* tuple_select) {
+  const auto& pred = GetEvaluatedLiteralFor(tuple_select->operand(0));
+  const auto& on_true = GetEvaluatedLiteralFor(tuple_select->operand(1));
+  const auto& on_false = GetEvaluatedLiteralFor(tuple_select->operand(2));
+
+  if (pred.Get<bool>({})) {
+    evaluated_[tuple_select] = on_true.CloneToUnique();
+  } else {
+    evaluated_[tuple_select] = on_false.CloneToUnique();
+  }
+  return Status::OK();
+}
+
 Status HloEvaluator::HandleWhile(HloInstruction* while_hlo) {
   HloComputation* cond_comp = while_hlo->while_condition();
   HloComputation* body_comp = while_hlo->while_body();
@@ -1068,17 +1117,159 @@ Status HloEvaluator::HandleWhile(HloInstruction* while_hlo) {
   return Status::OK();
 }
 
+// Key-value sort is a special snowflake: it's templated on two different
+// element types, one for the keys, and one for the values. Jump through some
+// hoops to make this work.
+namespace {
+template <typename KeyType, typename ValueType>
+StatusOr<std::unique_ptr<Literal>> EvaluateSortInternal(
+    HloInstruction* sort, const Literal& keys_literal,
+    const Literal& values_literal) {
+  auto rank = ShapeUtil::Rank(keys_literal.shape());
+  TF_RET_CHECK(
+      ShapeUtil::SameDimensions(keys_literal.shape(), values_literal.shape()))
+      << "Sort keys and values must have the same dimensions";
+  TF_RET_CHECK(rank > 0 && rank <= 2)
+      << "Sort is only supported for rank-1 and rank-2 shapes, rank is: "
+      << rank;
+  TF_RET_CHECK(sort->operand_count() == 2) << "Expected key-value sort";
+  // We need to sort and array of keys and an array of values, where the
+  // sorted order of the values is determined by the keys. The simplest(?)
+  // way to do this is to go to an array-of-pairs representation, sort the
+  // array using the keys, and then go back to pair-of-arrays.
+  VLOG(3) << "HandleSort keys_literal: " << keys_literal.ToString();
+  VLOG(3) << "HandleSort values_literal: " << values_literal.ToString();
+
+  auto sort_r1 = [](const Literal& keys_literal,
+                    const Literal& values_literal) {
+    const auto& keys_data = keys_literal.data<KeyType>();
+    const auto& values_data = values_literal.data<ValueType>();
+
+    using kv_pair = std::pair<KeyType, ValueType>;
+    std::vector<kv_pair> key_value_vector;
+    CHECK_EQ(keys_data.size(), values_data.size());
+    key_value_vector.reserve(keys_data.size());
+    for (int i = 0; i < keys_data.size(); ++i) {
+      key_value_vector.push_back(std::make_pair(keys_data[i], values_data[i]));
+    }
+    std::sort(key_value_vector.begin(), key_value_vector.end(),
+              [](const kv_pair& a, const kv_pair& b) {
+                return SafeLess<KeyType>(a.first, b.first);
+              });
+    std::vector<KeyType> result_keys;
+    std::vector<ValueType> result_values;
+    for (const auto& key_value : key_value_vector) {
+      result_keys.push_back(key_value.first);
+      result_values.push_back(key_value.second);
+    }
+    auto result_keys_literal = MakeUnique<Literal>(keys_literal.shape());
+    result_keys_literal->PopulateR1(
+        tensorflow::gtl::ArraySlice<KeyType>(result_keys));
+    auto result_values_literal = MakeUnique<Literal>(values_literal.shape());
+    result_values_literal->PopulateR1(
+        tensorflow::gtl::ArraySlice<ValueType>(result_values));
+    return std::make_pair(std::move(result_keys_literal),
+                          std::move(result_values_literal));
+  };
+
+  std::unique_ptr<Literal> result_tuple;
+  if (rank == 1) {
+    auto result_pair = sort_r1(keys_literal, values_literal);
+    result_tuple = LiteralUtil::MakeTuple(
+        {result_pair.first.get(), result_pair.second.get()});
+  } else {
+    // For R2 sort, the desired semantics are to sort each matrix row
+    // independently.
+    auto keys_result_literal = MakeUnique<Literal>(keys_literal.shape());
+    auto values_result_literal = MakeUnique<Literal>(values_literal.shape());
+    int64 r1_length = keys_literal.shape().dimensions(1);
+    for (int64 row = 0; row < keys_literal.shape().dimensions(0); ++row) {
+      TF_ASSIGN_OR_RETURN(auto keys_r1_slice,
+                          keys_literal.Slice({row, 0}, {row + 1, r1_length})
+                              ->Reshape({r1_length}));
+      TF_ASSIGN_OR_RETURN(auto values_r1_slice,
+                          values_literal.Slice({row, 0}, {row + 1, r1_length})
+                              ->Reshape({r1_length}));
+      auto r1_result_pair = sort_r1(*keys_r1_slice, *values_r1_slice);
+      TF_ASSIGN_OR_RETURN(auto sorted_keys,
+                          r1_result_pair.first->Reshape({1, r1_length}));
+      TF_ASSIGN_OR_RETURN(auto sorted_values,
+                          r1_result_pair.second->Reshape({1, r1_length}));
+      TF_RETURN_IF_ERROR(keys_result_literal->CopySliceFrom(
+          *sorted_keys, {0, 0}, {row, 0}, {1, r1_length}));
+      TF_RETURN_IF_ERROR(values_result_literal->CopySliceFrom(
+          *sorted_values, {0, 0}, {row, 0}, {1, r1_length}));
+    }
+    result_tuple = LiteralUtil::MakeTuple(
+        {keys_result_literal.get(), values_result_literal.get()});
+  }
+
+  VLOG(3) << "HandleSort result_tuple: " << result_tuple->ToString();
+  return std::move(result_tuple);
+}
+
+template <typename KeyType>
+StatusOr<std::unique_ptr<Literal>> EvaluateSortCurried(
+    HloInstruction* sort, const Literal& keys_literal,
+    const Literal& values_literal) {
+  switch (sort->operand(1)->shape().element_type()) {
+    case F32:
+      return EvaluateSortInternal<KeyType, float>(sort, keys_literal,
+                                                  values_literal);
+    case U32:
+      return EvaluateSortInternal<KeyType, uint32>(sort, keys_literal,
+                                                   values_literal);
+    case S32:
+      return EvaluateSortInternal<KeyType, int32>(sort, keys_literal,
+                                                  values_literal);
+    case BF16:
+      return EvaluateSortInternal<KeyType, bfloat16>(sort, keys_literal,
+                                                     values_literal);
+    default:
+      return InvalidArgument("Unsupported type for Sort");
+  }
+}
+
+StatusOr<std::unique_ptr<Literal>> EvaluateSort(HloInstruction* sort,
+                                                const Literal& keys_literal,
+                                                const Literal& values_literal) {
+  switch (sort->operand(0)->shape().element_type()) {
+    case F32:
+      return EvaluateSortCurried<float>(sort, keys_literal, values_literal);
+    case U32:
+      return EvaluateSortCurried<uint32>(sort, keys_literal, values_literal);
+    case S32:
+      return EvaluateSortCurried<int32>(sort, keys_literal, values_literal);
+    case BF16:
+      return EvaluateSortCurried<bfloat16>(sort, keys_literal, values_literal);
+    default:
+      return InvalidArgument("Unsupported type for Sort");
+  }
+}
+}  // namespace
+
 Status HloEvaluator::HandleSort(HloInstruction* sort) {
+  const int64 sort_dim = sort->dimensions(0);
+  const int64 rank = ShapeUtil::Rank(sort->operand(0)->shape());
+  if (sort_dim != rank - 1) {
+    return Unimplemented(
+        "Trying to support along dimension %lld, which is not the last "
+        "dimension",
+        sort_dim);
+  }
+
   if (!ShapeUtil::IsTuple(sort->shape())) {
     return DefaultAction(sort);
+  } else {
+    auto result = EvaluateSort(sort, GetEvaluatedLiteralFor(sort->operand(0)),
+                               GetEvaluatedLiteralFor(sort->operand(1)));
+    if (result.ok()) {
+      evaluated_[sort] = std::move(result.ValueOrDie());
+      return Status::OK();
+    } else {
+      return result.status();
+    }
   }
-  // The key-value version of Sort is a special snowflake, since the output
-  // shape is a tuple, so its element type is not meaningful.
-  //
-  // TODO(mkuper): Do something sane here, so that we can support different key
-  // and value types.
-  return sort->Visit(
-      typed_visitors_.at(sort->operand(0)->shape().element_type()).get());
 }
 
 Status HloEvaluator::Preprocess(HloInstruction* hlo) {
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.h b/tensorflow/compiler/xla/service/hlo_evaluator.h
index b330c30eeb..a4c37ef328 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.h
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.h
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/shape_inference.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -115,6 +116,10 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
   StatusOr<std::unique_ptr<Literal>> EvaluateElementwiseUnaryOp(
       HloOpcode opcode, const Literal& operand);
 
+  StatusOr<std::unique_ptr<Literal>> EvaluateDotOp(
+      const DotDimensionNumbers& dim_numbers, const Literal& lhs,
+      const Literal& rhs);
+
  protected:
   // Make HloEvaluatorTypedVisitor a friend because it is logically part of this
   // class.
@@ -172,6 +177,8 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
 
   Status HandleSelect(HloInstruction* select) override;
 
+  Status HandleTupleSelect(HloInstruction* tuple_select) override;
+
   Status HandleBroadcast(HloInstruction* broadcast) override;
 
   Status HandleAfterAll(HloInstruction* token) override;
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator_test.cc b/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
index 42770d848a..cba72469ce 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_evaluator_test.cc
@@ -21,8 +21,8 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_element_type_converter.h"
@@ -112,9 +112,9 @@ class HloEvaluatorTest : public ::testing::WithParamInterface<bool>,
 // Verifies that HloEvaluator evaluates a HLO instruction that performs clamp
 // with 3 operands.
 TEST_P(HloEvaluatorTest, DoesClamp) {
-  auto low = Literal::CreateR2<float>({{0.f, 2.f}, {2.f, 4.f}});
-  auto value = Literal::CreateR2<float>({{0.f, 5.f}, {0.f, 4.f}});
-  auto high = Literal::CreateR2<float>({{2.f, 4.f}, {4.f, 4.f}});
+  auto low = LiteralUtil::CreateR2<float>({{0.f, 2.f}, {2.f, 4.f}});
+  auto value = LiteralUtil::CreateR2<float>({{0.f, 5.f}, {0.f, 4.f}});
+  auto high = LiteralUtil::CreateR2<float>({{2.f, 4.f}, {4.f, 4.f}});
 
   Shape shape = low->shape();
   HloComputation::Builder b(TestName());
@@ -127,15 +127,15 @@ TEST_P(HloEvaluatorTest, DoesClamp) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<float>({{0, 4}, {2, 4}});
+  auto expected = LiteralUtil::CreateR2<float>({{0, 4}, {2, 4}});
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
 
 TEST_P(HloEvaluatorTest, DISABLED_DoesClampSpecialBroadcast) {
-  auto low = Literal::CreateR0<float>(0.f);
-  auto value = Literal::CreateR2<float>({{-1.f, 0.f}, {1.f, 2.f}});
-  auto high = Literal::CreateR0<float>(1.f);
+  auto low = LiteralUtil::CreateR0<float>(0.f);
+  auto value = LiteralUtil::CreateR2<float>({{-1.f, 0.f}, {1.f, 2.f}});
+  auto high = LiteralUtil::CreateR0<float>(1.f);
 
   Shape shape = value->shape();
   HloComputation::Builder b(TestName());
@@ -148,7 +148,7 @@ TEST_P(HloEvaluatorTest, DISABLED_DoesClampSpecialBroadcast) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<float>({{0, 0}, {1, 1}});
+  auto expected = LiteralUtil::CreateR2<float>({{0, 0}, {1, 1}});
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -156,9 +156,9 @@ TEST_P(HloEvaluatorTest, DISABLED_DoesClampSpecialBroadcast) {
 // Verifies that HloEvaluator evaluates a HLO instruction that performs select
 // with 3 operands.
 TEST_P(HloEvaluatorTest, DoesSelect) {
-  auto pred = Literal::CreateR2<bool>({{true, false}, {false, true}});
-  auto on_true = Literal::CreateR2<float>({{2.f, 4.f}, {4.f, 4.f}});
-  auto on_false = Literal::CreateR2<float>({{0.f, 5.f}, {0.f, 4.f}});
+  auto pred = LiteralUtil::CreateR2<bool>({{true, false}, {false, true}});
+  auto on_true = LiteralUtil::CreateR2<float>({{2.f, 4.f}, {4.f, 4.f}});
+  auto on_false = LiteralUtil::CreateR2<float>({{0.f, 5.f}, {0.f, 4.f}});
 
   Shape shape = on_true->shape();
   HloComputation::Builder b(TestName());
@@ -173,7 +173,7 @@ TEST_P(HloEvaluatorTest, DoesSelect) {
 
   std::unique_ptr<Literal> result = Evaluate({});
 
-  auto expected = Literal::CreateR2<float>({{2, 5}, {0, 4}});
+  auto expected = LiteralUtil::CreateR2<float>({{2, 5}, {0, 4}});
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -181,46 +181,46 @@ TEST_P(HloEvaluatorTest, DoesSelect) {
 // Verifies that HloEvaluator evaluates a HLO instruction that performs
 // element-wise addition with 2 operands.
 TEST_P(HloEvaluatorTest, DoesAdd) {
-  auto lhs = Literal::CreateR2<int64>({{1, 0}, {-100, 4}});
-  auto rhs = Literal::CreateR2<int64>({{2, 4}, {4, 4}});
-  auto expected = Literal::CreateR2<int64>({{3, 4}, {-96, 8}});
+  auto lhs = LiteralUtil::CreateR2<int64>({{1, 0}, {-100, 4}});
+  auto rhs = LiteralUtil::CreateR2<int64>({{2, 4}, {4, 4}});
+  auto expected = LiteralUtil::CreateR2<int64>({{3, 4}, {-96, 8}});
   TestBinaryOp(HloOpcode::kAdd, std::move(expected), std::move(lhs),
                std::move(rhs));
 }
 // Verifies that HloEvaluator evaluates a HLO instruction that performs
 // element-wise and with 2 operands.
 TEST_P(HloEvaluatorTest, DoesAnd) {
-  auto lhs = Literal::CreateR2<int64>({{1, 0}, {-100, 4}});
-  auto rhs = Literal::CreateR2<int64>({{2, 4}, {4, 4}});
-  auto expected = Literal::CreateR2<int64>({{0, 0}, {4, 4}});
+  auto lhs = LiteralUtil::CreateR2<int64>({{1, 0}, {-100, 4}});
+  auto rhs = LiteralUtil::CreateR2<int64>({{2, 4}, {4, 4}});
+  auto expected = LiteralUtil::CreateR2<int64>({{0, 0}, {4, 4}});
   TestBinaryOp(HloOpcode::kAnd, std::move(expected), std::move(lhs),
                std::move(rhs));
 }
 // Verifies that HloEvaluator evaluates a HLO instruction that performs
 // element-wise or with 2 operands.
 TEST_P(HloEvaluatorTest, DoesOr) {
-  auto lhs = Literal::CreateR2<int64>({{1, 0}, {-100, 4}});
-  auto rhs = Literal::CreateR2<int64>({{2, 4}, {4, 4}});
-  auto expected = Literal::CreateR2<int64>({{3, 4}, {-100, 4}});
+  auto lhs = LiteralUtil::CreateR2<int64>({{1, 0}, {-100, 4}});
+  auto rhs = LiteralUtil::CreateR2<int64>({{2, 4}, {4, 4}});
+  auto expected = LiteralUtil::CreateR2<int64>({{3, 4}, {-100, 4}});
   TestBinaryOp(HloOpcode::kOr, std::move(expected), std::move(lhs),
                std::move(rhs));
 }
 // Verifies that HloEvaluator evaluates a HLO instruction that performs
 // element-wise or with 2 operands.
 TEST_P(HloEvaluatorTest, DoesXor) {
-  auto lhs = Literal::CreateR2<int64>({{1, 0}, {-100, 4}});
-  auto rhs = Literal::CreateR2<int64>({{2, 4}, {4, 4}});
-  auto expected = Literal::CreateR2<int64>({{3, 4}, {-104, 0}});
+  auto lhs = LiteralUtil::CreateR2<int64>({{1, 0}, {-100, 4}});
+  auto rhs = LiteralUtil::CreateR2<int64>({{2, 4}, {4, 4}});
+  auto expected = LiteralUtil::CreateR2<int64>({{3, 4}, {-104, 0}});
   TestBinaryOp(HloOpcode::kXor, std::move(expected), std::move(lhs),
                std::move(rhs));
 }
 // Verifies that HloEvaluator evaluates a HLO instruction that performs
 // element-wise multiply with 2 operands.
 TEST_P(HloEvaluatorTest, DoesMultiply) {
-  auto lhs = Literal::CreateR2<int32>({{-1, 0}, {-100, 4}});
-  auto rhs = Literal::CreateR2<int32>(
+  auto lhs = LiteralUtil::CreateR2<int32>({{-1, 0}, {-100, 4}});
+  auto rhs = LiteralUtil::CreateR2<int32>(
       {{std::numeric_limits<int32>::min(), 4}, {4, 4}});
-  auto expected = Literal::CreateR2<int32>(
+  auto expected = LiteralUtil::CreateR2<int32>(
       {{std::numeric_limits<int32>::min(), 0}, {-400, 16}});
   TestBinaryOp(HloOpcode::kMultiply, std::move(expected), std::move(lhs),
                std::move(rhs));
@@ -228,17 +228,17 @@ TEST_P(HloEvaluatorTest, DoesMultiply) {
 // Verifies that HloEvaluator evaluates a HLO instruction that performs
 // element-wise divide with 2 operands.
 TEST_P(HloEvaluatorTest, DoesDivideInt64) {
-  auto lhs = Literal::CreateR2<int64>({{1, 0}, {-100, 4}});
-  auto rhs = Literal::CreateR2<int64>({{2, 4}, {4, 4}});
-  auto expected = Literal::CreateR2<int64>({{0, 0}, {-25, 1}});
+  auto lhs = LiteralUtil::CreateR2<int64>({{1, 0}, {-100, 4}});
+  auto rhs = LiteralUtil::CreateR2<int64>({{2, 4}, {4, 4}});
+  auto expected = LiteralUtil::CreateR2<int64>({{0, 0}, {-25, 1}});
   TestBinaryOp(HloOpcode::kDivide, std::move(expected), std::move(lhs),
                std::move(rhs));
 }
 TEST_P(HloEvaluatorTest, DoesDivideDouble) {
-  auto lhs = Literal::CreateR2<double>({{1.0, 0.0}, {-100.0, 4.0}});
-  auto rhs = Literal::CreateR2<double>({{2.2, 4.0}, {4.0, 4.0}});
+  auto lhs = LiteralUtil::CreateR2<double>({{1.0, 0.0}, {-100.0, 4.0}});
+  auto rhs = LiteralUtil::CreateR2<double>({{2.2, 4.0}, {4.0, 4.0}});
   auto expected =
-      Literal::CreateR2<double>({{0.45454545454545453, 0}, {-25, 1}});
+      LiteralUtil::CreateR2<double>({{0.45454545454545453, 0}, {-25, 1}});
   TestBinaryOp(HloOpcode::kDivide, std::move(expected), std::move(lhs),
                std::move(rhs));
 }
@@ -246,54 +246,54 @@ TEST_P(HloEvaluatorTest, DoesDivideDouble) {
 // Verifies that HloEvaluator evaluates a HLO instruction that performs
 // element-wise abs op with 1 operand.
 TEST_P(HloEvaluatorTest, DoesAbsR2) {
-  auto operand = Literal::CreateR2<int64>({{1, -20}, {-100, 4}});
-  auto expected = Literal::CreateR2<int64>({{1, 20}, {100, 4}});
+  auto operand = LiteralUtil::CreateR2<int64>({{1, -20}, {-100, 4}});
+  auto expected = LiteralUtil::CreateR2<int64>({{1, 20}, {100, 4}});
   TestUnaryOp(HloOpcode::kAbs, std::move(expected), std::move(operand));
 }
 TEST_P(HloEvaluatorTest, DoesAbsR0) {
-  auto operand = Literal::CreateR0<float>(-1.0f);
-  auto expected = Literal::CreateR0<float>(1.0f);
+  auto operand = LiteralUtil::CreateR0<float>(-1.0f);
+  auto expected = LiteralUtil::CreateR0<float>(1.0f);
   TestUnaryOp(HloOpcode::kAbs, std::move(expected), std::move(operand));
 }
 TEST_P(HloEvaluatorTest, DoesAbsR1WithZeroSize) {
-  auto operand = Literal::CreateR1<float>({});
-  auto expected = Literal::CreateR1<float>({});
+  auto operand = LiteralUtil::CreateR1<float>({});
+  auto expected = LiteralUtil::CreateR1<float>({});
   TestUnaryOp(HloOpcode::kAbs, std::move(expected), std::move(operand));
 }
 TEST_P(HloEvaluatorTest, DoesNegateR2) {
-  auto operand = Literal::CreateR2<int32>(
+  auto operand = LiteralUtil::CreateR2<int32>(
       {{0, std::numeric_limits<int32>::min()}, {-1, 4}});
-  auto expected =
-      Literal::CreateR2<int32>({{0, std::numeric_limits<int>::min()}, {1, -4}});
+  auto expected = LiteralUtil::CreateR2<int32>(
+      {{0, std::numeric_limits<int>::min()}, {1, -4}});
   TestUnaryOp(HloOpcode::kNegate, std::move(expected), std::move(operand));
 }
 TEST_P(HloEvaluatorTest, DoesCosR2) {
-  auto operand = Literal::CreateR2<float>({{0, M_PI}, {-M_PI, 2 * M_PI}});
-  auto expected = Literal::CreateR2<float>({{1, -1}, {-1, 1}});
+  auto operand = LiteralUtil::CreateR2<float>({{0, M_PI}, {-M_PI, 2 * M_PI}});
+  auto expected = LiteralUtil::CreateR2<float>({{1, -1}, {-1, 1}});
   TestUnaryOp(HloOpcode::kCos, std::move(expected), std::move(operand),
               use_bfloat16_ ? 0.031250 : 9.5367431640625E-7);
 }
 TEST_P(HloEvaluatorTest, DoesSinR2) {
-  auto operand = Literal::CreateR2<float>({{0, M_PI}, {-M_PI, 2 * M_PI}});
-  auto expected = Literal::CreateR2<float>({{0, 0}, {0, 0}});
+  auto operand = LiteralUtil::CreateR2<float>({{0, M_PI}, {-M_PI, 2 * M_PI}});
+  auto expected = LiteralUtil::CreateR2<float>({{0, 0}, {0, 0}});
   TestUnaryOp(HloOpcode::kSin, std::move(expected), std::move(operand),
               use_bfloat16_ ? 0.031250 : 9.5367431640625E-7);
 }
 TEST_P(HloEvaluatorTest, DoesNotR2) {
   auto operand =
-      Literal::CreateR2<int32>({{0, std::numeric_limits<int>::min()},
-                                {-1, std::numeric_limits<int>::max()}});
+      LiteralUtil::CreateR2<int32>({{0, std::numeric_limits<int>::min()},
+                                    {-1, std::numeric_limits<int>::max()}});
   auto expected =
-      Literal::CreateR2<int32>({{-1, std::numeric_limits<int>::max()},
-                                {0, std::numeric_limits<int>::min()}});
+      LiteralUtil::CreateR2<int32>({{-1, std::numeric_limits<int>::max()},
+                                    {0, std::numeric_limits<int>::min()}});
   TestUnaryOp(HloOpcode::kNot, std::move(expected), std::move(operand));
 }
 // Verifies that HloEvaluator evaluates a HLO Computation with non-parameter nor
 // constant operands.
 TEST_P(HloEvaluatorTest, DoesTraverseInstructions) {
-  auto lhs = Literal::CreateR2<int64>({{1, 0}, {-100, 4}});
-  auto rhs = Literal::CreateR2<int64>({{2, 4}, {4, 4}});
-  auto rhs2 = Literal::CreateR2<int64>({{1, -20}, {-100, 4}});
+  auto lhs = LiteralUtil::CreateR2<int64>({{1, 0}, {-100, 4}});
+  auto rhs = LiteralUtil::CreateR2<int64>({{2, 4}, {4, 4}});
+  auto rhs2 = LiteralUtil::CreateR2<int64>({{1, -20}, {-100, 4}});
   std::vector<const Literal*> args = {lhs.get(), rhs.get(), rhs2.get()};
 
   Shape shape = ShapeUtil::MakeShape(S64, {2, 2});
@@ -314,7 +314,7 @@ TEST_P(HloEvaluatorTest, DoesTraverseInstructions) {
 
   std::unique_ptr<Literal> result = Evaluate(args);
 
-  auto expected = Literal::CreateR2<int64>({{4, -16}, {-196, 12}});
+  auto expected = LiteralUtil::CreateR2<int64>({{4, -16}, {-196, 12}});
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -324,7 +324,7 @@ TEST_P(HloEvaluatorTest, DoesReshape) {
   HloComputation::Builder b(TestName());
   const int64 dimensions[] = {11, 8, 7, 5, 9};
   TF_ASSERT_OK_AND_ASSIGN(auto literal,
-                          Literal::CreateRandomLiteral<F32>(
+                          LiteralUtil::CreateRandomLiteral<F32>(
                               ShapeUtil::MakeShape(F32, dimensions), 0.0, 1.0));
   auto literal_clone = literal->CloneToUnique();
   HloInstruction* literal_instruction =
@@ -349,8 +349,8 @@ TEST_P(HloEvaluatorTest, DoesReshape) {
 // Verifies Broadcast operation is correctly evaluated.
 TEST_P(HloEvaluatorTest, DoesBroadcast) {
   HloComputation::Builder b(TestName());
-  auto input_literal = Literal::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}});
-  auto output_literal = Literal::CreateR3<int32>(
+  auto input_literal = LiteralUtil::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}});
+  auto output_literal = LiteralUtil::CreateR3<int32>(
       {{{1, 2}, {3, 4}, {5, 6}}, {{1, 2}, {3, 4}, {5, 6}}});
   HloInstruction* literal_instruction = b.AddInstruction(
       HloInstruction::CreateConstant(std::move(input_literal)));
@@ -365,8 +365,8 @@ TEST_P(HloEvaluatorTest, DoesBroadcast) {
 
 TEST_P(HloEvaluatorTest, DoesBroadcastScalar) {
   HloComputation::Builder b(TestName());
-  auto input_literal = Literal::CreateR0<int32>(111);
-  auto output_literal = Literal::CreateR2<int32>(
+  auto input_literal = LiteralUtil::CreateR0<int32>(111);
+  auto output_literal = LiteralUtil::CreateR2<int32>(
       {{111, 111}, {111, 111}, {111, 111}, {111, 111}, {111, 111}, {111, 111}});
 
   HloInstruction* literal_instruction = b.AddInstruction(
@@ -386,9 +386,9 @@ TEST_P(HloEvaluatorTest, DoesConcatenateSimple) {
   HloComputation::Builder b(TestName());
 
   HloInstruction* operand1 = b.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int64>({{-1, -2}, {100, 200}})));
+      LiteralUtil::CreateR2<int64>({{-1, -2}, {100, 200}})));
   HloInstruction* operand2 = b.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int64>({{-2, -3}, {-100, -200}})));
+      LiteralUtil::CreateR2<int64>({{-2, -3}, {-100, -200}})));
 
   std::vector<HloInstruction*> operands = {operand1, operand2};
 
@@ -399,8 +399,8 @@ TEST_P(HloEvaluatorTest, DoesConcatenateSimple) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected =
-      Literal::CreateR2<int64>({{-1, -2}, {100, 200}, {-2, -3}, {-100, -200}});
+  auto expected = LiteralUtil::CreateR2<int64>(
+      {{-1, -2}, {100, 200}, {-2, -3}, {-100, -200}});
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
 
@@ -408,9 +408,9 @@ TEST_P(HloEvaluatorTest, ConcatenateHandlesShapeWithZeroElement) {
   HloComputation::Builder b(TestName());
 
   HloInstruction* operand1 = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int64>({100, 200})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int64>({100, 200})));
   HloInstruction* operand2 = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int64>({})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int64>({})));
 
   std::vector<HloInstruction*> operands = {operand1, operand2};
 
@@ -421,16 +421,16 @@ TEST_P(HloEvaluatorTest, ConcatenateHandlesShapeWithZeroElement) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR1<int64>({100, 200});
+  auto expected = LiteralUtil::CreateR1<int64>({100, 200});
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
 
 TEST_P(HloEvaluatorTest, ConvertWithSameLayout) {
   HloComputation::Builder b(TestName());
 
-  auto input_literal = Literal::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}});
+  auto input_literal = LiteralUtil::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}});
   auto expected =
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}});
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}});
   ASSERT_TRUE(LayoutUtil::LayoutsInShapesEqual(input_literal->shape(),
                                                expected->shape()));
 
@@ -447,9 +447,9 @@ TEST_P(HloEvaluatorTest, ConvertWithSameLayout) {
 TEST_P(HloEvaluatorTest, ConvertWithDifferentLayout) {
   HloComputation::Builder b(TestName());
 
-  auto input_literal = Literal::CreateR2WithLayout<int32>(
+  auto input_literal = LiteralUtil::CreateR2WithLayout<int32>(
       {{1, 2}, {3, 4}, {5, 6}}, LayoutUtil::MakeLayout({0, 1}));
-  auto expected = Literal::CreateR2WithLayout<float>(
+  auto expected = LiteralUtil::CreateR2WithLayout<float>(
       {{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}}, LayoutUtil::MakeLayout({1, 0}));
   ASSERT_FALSE(LayoutUtil::LayoutsInShapesEqual(input_literal->shape(),
                                                 expected->shape()));
@@ -478,13 +478,13 @@ PaddingConfig CreatePaddingConfig(
 }
 
 TEST_P(HloEvaluatorTest, Pad2DIntegerArrayWithZeroDimension) {
-  auto operand = Literal::CreateR2<int32>({{}, {}});
+  auto operand = LiteralUtil::CreateR2<int32>({{}, {}});
   HloComputation::Builder b(TestName());
   auto operand_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(operand)));
 
   constexpr int32 kPadValue = 10;
-  auto pad_value = Literal::CreateR0<int32>(kPadValue);
+  auto pad_value = LiteralUtil::CreateR0<int32>(kPadValue);
   auto padding_value_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(pad_value)));
 
@@ -496,7 +496,7 @@ TEST_P(HloEvaluatorTest, Pad2DIntegerArrayWithZeroDimension) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<int32>(
+  auto expected = LiteralUtil::CreateR2<int32>(
       {{10, 10}, {10, 10}, {10, 10}, {10, 10}, {10, 10}});
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
@@ -506,11 +506,11 @@ TEST_P(HloEvaluatorTest, Pad4DFloatArrayWithInteriorPadding) {
   HloComputation::Builder b(TestName());
 
   Array4D<float> input_array(3, 2, 1, 1, {1, 2, 3, 4, 5, 6});
-  auto input = Literal::CreateR4FromArray4D<float>(input_array);
+  auto input = LiteralUtil::CreateR4FromArray4D<float>(input_array);
   HloInstruction* input_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(input)));
   constexpr float kPadValue = 1.5;
-  auto pad_value = Literal::CreateR0<float>(kPadValue);
+  auto pad_value = LiteralUtil::CreateR0<float>(kPadValue);
   HloInstruction* pad_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(pad_value)));
 
@@ -532,7 +532,7 @@ TEST_P(HloEvaluatorTest, Pad4DFloatArrayWithInteriorPadding) {
   (*expected_array)(7, 0, 0, 0) = 5.0f;
   (*expected_array)(7, 2, 0, 0) = 6.0f;
 
-  auto expected = Literal::CreateR4FromArray4D<float>(*expected_array);
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>(*expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -549,12 +549,12 @@ TEST_P(HloEvaluatorTest, NegativePadding2D) {
   // }
   auto input_array = MakeUnique<Array2D<float>>(4, 3);
   input_array->FillUnique(1.0f);
-  auto input = Literal::CreateR2FromArray2D<float>(*input_array);
+  auto input = LiteralUtil::CreateR2FromArray2D<float>(*input_array);
   HloInstruction* input_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(input)));
 
   auto pad_value_instruction = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.718f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.718f)));
 
   auto r2_padding_on_dim0_dim1 =
       CreatePaddingConfig({{{-1, -2, 0}}, {{-2, 4, 0}}});
@@ -574,7 +574,7 @@ TEST_P(HloEvaluatorTest, NegativePadding2D) {
   (*expected_array)(0, 2) = 2.718f;
   (*expected_array)(0, 3) = 2.718f;
   (*expected_array)(0, 4) = 2.718f;
-  auto expected = Literal::CreateR2FromArray2D<float>(*expected_array);
+  auto expected = LiteralUtil::CreateR2FromArray2D<float>(*expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Near(*expected, *result, ErrorSpec(0.031250)));
 }
@@ -590,12 +590,12 @@ TEST_P(HloEvaluatorTest, NegativeAndInteriorPadding2D) {
   // }
   auto input_array = MakeUnique<Array2D<float>>(4, 3);
   input_array->FillUnique(1.0f);
-  auto input = Literal::CreateR2FromArray2D<float>(*input_array);
+  auto input = LiteralUtil::CreateR2FromArray2D<float>(*input_array);
   HloInstruction* input_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(input)));
 
   auto pad_value_instruction = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.718f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.718f)));
 
   PaddingConfig padding_config = MakeNoPaddingConfig(2);
 
@@ -613,7 +613,7 @@ TEST_P(HloEvaluatorTest, NegativeAndInteriorPadding2D) {
   std::unique_ptr<Literal> result = Evaluate();
 
   auto expected_array = MakeUnique<Array2D<float>>(0, 9);
-  auto expected = Literal::CreateR2FromArray2D<float>(*expected_array);
+  auto expected = LiteralUtil::CreateR2FromArray2D<float>(*expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -630,13 +630,13 @@ TEST_P(HloEvaluatorTest, DotRank2AndRank1) {
   // }
   auto lhs_array = MakeUnique<Array2D<float>>(4, 1);
   lhs_array->FillUnique(1.0f);
-  auto lhs_literal = Literal::CreateR2FromArray2D<float>(*lhs_array);
+  auto lhs_literal = LiteralUtil::CreateR2FromArray2D<float>(*lhs_array);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
   // rhs:
   // f32[2] { 1, 2 },
-  auto rhs_literal = Literal::CreateR2<float>({{1, 2}});
+  auto rhs_literal = LiteralUtil::CreateR2<float>({{1, 2}});
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -658,7 +658,7 @@ TEST_P(HloEvaluatorTest, DotRank2AndRank1) {
       {4.f, 8.f},
   });
   // clang-format on
-  auto expected = Literal::CreateR2FromArray2D<float>(expected_array);
+  auto expected = LiteralUtil::CreateR2FromArray2D<float>(expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -669,7 +669,7 @@ TEST_P(HloEvaluatorTest, DotRank1AndRank2) {
   // lhs:
   // f32[3]
   //  { 1, 2, 3 },
-  auto lhs_literal = Literal::CreateR1<float>({1, 2, 3});
+  auto lhs_literal = LiteralUtil::CreateR1<float>({1, 2, 3});
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
@@ -681,7 +681,7 @@ TEST_P(HloEvaluatorTest, DotRank1AndRank2) {
   // }
   auto rhs_array = MakeUnique<Array2D<float>>(3, 2);
   rhs_array->FillUnique(1.0f);
-  auto rhs_literal = Literal::CreateR2FromArray2D<float>(*rhs_array);
+  auto rhs_literal = LiteralUtil::CreateR2FromArray2D<float>(*rhs_array);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -695,7 +695,7 @@ TEST_P(HloEvaluatorTest, DotRank1AndRank2) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR1<float>({22.f, 28.f});
+  auto expected = LiteralUtil::CreateR1<float>({22.f, 28.f});
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -712,7 +712,7 @@ TEST_P(HloEvaluatorTest, DotRank2AndRank2) {
   // }
   auto lhs_array = MakeUnique<Array2D<float>>(4, 3);
   lhs_array->FillUnique(1.0f);
-  auto lhs_literal = Literal::CreateR2FromArray2D<float>(*lhs_array);
+  auto lhs_literal = LiteralUtil::CreateR2FromArray2D<float>(*lhs_array);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
@@ -724,7 +724,7 @@ TEST_P(HloEvaluatorTest, DotRank2AndRank2) {
   // }
   auto rhs_array = MakeUnique<Array2D<float>>(3, 2);
   rhs_array->FillUnique(1.0f);
-  auto rhs_literal = Literal::CreateR2FromArray2D<float>(*rhs_array);
+  auto rhs_literal = LiteralUtil::CreateR2FromArray2D<float>(*rhs_array);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -744,7 +744,7 @@ TEST_P(HloEvaluatorTest, DotRank2AndRank2) {
       {94.f, 124.f},
       {130.f, 172.f},
   });
-  auto expected = Literal::CreateR2FromArray2D<float>(expected_array);
+  auto expected = LiteralUtil::CreateR2FromArray2D<float>(expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -753,12 +753,12 @@ TEST_P(HloEvaluatorTest, SimpleConv1D) {
   HloComputation::Builder b(TestName());
 
   Array3D<float> lhs_array = {{{1, 2, 3}}};
-  auto lhs_literal = Literal::CreateR3FromArray3D<float>(lhs_array);
+  auto lhs_literal = LiteralUtil::CreateR3FromArray3D<float>(lhs_array);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
   Array3D<float> rhs_array = {{{3.f, 4.f}}};
-  auto rhs_literal = Literal::CreateR3FromArray3D<float>(rhs_array);
+  auto rhs_literal = LiteralUtil::CreateR3FromArray3D<float>(rhs_array);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -792,7 +792,7 @@ TEST_P(HloEvaluatorTest, SimpleConv1D) {
   std::unique_ptr<Literal> result = Evaluate();
 
   Array3D<float> expected_array = {{{11.f, 18.f, 9.f}}};
-  auto expected = Literal::CreateR3FromArray3D<float>(expected_array);
+  auto expected = LiteralUtil::CreateR3FromArray3D<float>(expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -809,7 +809,7 @@ TEST_P(HloEvaluatorTest, Simple4x4Conv2DWith2x2Kernel) {
     {13, 14, 15, 16},
   }));
   // clang-format on
-  auto lhs_literal = Literal::CreateR4FromArray4D<float>(lhs_array);
+  auto lhs_literal = LiteralUtil::CreateR4FromArray4D<float>(lhs_array);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
@@ -820,7 +820,7 @@ TEST_P(HloEvaluatorTest, Simple4x4Conv2DWith2x2Kernel) {
     {7, 8},
   }));
   // clang-format on
-  auto rhs_literal = Literal::CreateR4FromArray4D<float>(rhs_array);
+  auto rhs_literal = LiteralUtil::CreateR4FromArray4D<float>(rhs_array);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -854,7 +854,7 @@ TEST_P(HloEvaluatorTest, Simple4x4Conv2DWith2x2Kernel) {
     {149, 160, 171,  80},
   }));
   // clang-format on
-  auto expected = Literal::CreateR4FromArray4D<float>(expected_array);
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>(expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -884,11 +884,11 @@ TEST_P(HloEvaluatorTest, Conv2DGeneralDimensionsReversed) {
   }});
   // clang-format on
 
-  auto lhs_literal = Literal::CreateR4FromArray4D<float>(input);
+  auto lhs_literal = LiteralUtil::CreateR4FromArray4D<float>(input);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
-  auto rhs_literal = Literal::CreateR4FromArray4D<float>(weight);
+  auto rhs_literal = LiteralUtil::CreateR4FromArray4D<float>(weight);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
   rhs_instruction = b.AddInstruction(HloInstruction::CreateReverse(
@@ -933,7 +933,7 @@ TEST_P(HloEvaluatorTest, Conv2DGeneralDimensionsReversed) {
   Array4D<float> expected_array({{{{2514, 2685}}}});
   Array4D<float> expected_array_bf16({{{{2512, 2672}}}});
   // clang-format on
-  auto expected = Literal::CreateR4FromArray4D<float>(
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>(
       use_bfloat16_ ? expected_array_bf16 : expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
@@ -964,11 +964,11 @@ TEST_P(HloEvaluatorTest, Conv2DGeneralDimensions) {
   }});
   // clang-format on
 
-  auto lhs_literal = Literal::CreateR4FromArray4D<float>(input);
+  auto lhs_literal = LiteralUtil::CreateR4FromArray4D<float>(input);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
-  auto rhs_literal = Literal::CreateR4FromArray4D<float>(weight);
+  auto rhs_literal = LiteralUtil::CreateR4FromArray4D<float>(weight);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -1010,7 +1010,7 @@ TEST_P(HloEvaluatorTest, Conv2DGeneralDimensions) {
   Array4D<float> expected_array({{{{2514, 2685}}}});
   Array4D<float> expected_array_bf16({{{{2512, 2672}}}});
   // clang-format on
-  auto expected = Literal::CreateR4FromArray4D<float>(
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>(
       use_bfloat16_ ? expected_array_bf16 : expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
@@ -1028,7 +1028,7 @@ TEST_P(HloEvaluatorTest, DilatedBaseConv2DWithHighPadding) {
     {13, 14, 15, 16},
   }));
   // clang-format on
-  auto lhs_literal = Literal::CreateR4FromArray4D<float>(lhs_array);
+  auto lhs_literal = LiteralUtil::CreateR4FromArray4D<float>(lhs_array);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
@@ -1039,7 +1039,7 @@ TEST_P(HloEvaluatorTest, DilatedBaseConv2DWithHighPadding) {
     {7, 8},
   }));
   // clang-format on
-  auto rhs_literal = Literal::CreateR4FromArray4D<float>(rhs_array);
+  auto rhs_literal = LiteralUtil::CreateR4FromArray4D<float>(rhs_array);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -1074,7 +1074,7 @@ TEST_P(HloEvaluatorTest, DilatedBaseConv2DWithHighPadding) {
       {91, 112, 98, 120, 105, 128, 112},
       {65, 84, 70, 90, 75, 96, 80},
   }));
-  auto expected = Literal::CreateR4FromArray4D<float>(expected_array);
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>(expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -1091,7 +1091,7 @@ TEST_P(HloEvaluatorTest, DilatedBaseConv2DWithLowAndHighPadding) {
     {13, 14, 15, 16},
   }));
   // clang-format on
-  auto lhs_literal = Literal::CreateR4FromArray4D<float>(lhs_array);
+  auto lhs_literal = LiteralUtil::CreateR4FromArray4D<float>(lhs_array);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
@@ -1102,7 +1102,7 @@ TEST_P(HloEvaluatorTest, DilatedBaseConv2DWithLowAndHighPadding) {
     {7, 8},
   }));
   // clang-format on
-  auto rhs_literal = Literal::CreateR4FromArray4D<float>(rhs_array);
+  auto rhs_literal = LiteralUtil::CreateR4FromArray4D<float>(rhs_array);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -1138,7 +1138,7 @@ TEST_P(HloEvaluatorTest, DilatedBaseConv2DWithLowAndHighPadding) {
       {104, 91, 112, 98, 120, 105, 128, 112},
       {78, 65, 84, 70, 90, 75, 96, 80},
   }));
-  auto expected = Literal::CreateR4FromArray4D<float>(expected_array);
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>(expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -1156,7 +1156,7 @@ TEST_P(HloEvaluatorTest,
     {13, 14, 15, 16},
   }));
   // clang-format on
-  auto lhs_literal = Literal::CreateR4FromArray4D<float>(lhs_array);
+  auto lhs_literal = LiteralUtil::CreateR4FromArray4D<float>(lhs_array);
   HloInstruction* lhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(lhs_literal)));
 
@@ -1167,7 +1167,7 @@ TEST_P(HloEvaluatorTest,
     {8, 9, 10},
   }));
   // clang-format on
-  auto rhs_literal = Literal::CreateR4FromArray4D<float>(rhs_array);
+  auto rhs_literal = LiteralUtil::CreateR4FromArray4D<float>(rhs_array);
   HloInstruction* rhs_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(rhs_literal)));
 
@@ -1210,7 +1210,7 @@ TEST_P(HloEvaluatorTest,
       {0, 0, 0},
       {91, 98, 105},
   }));
-  auto expected = Literal::CreateR4FromArray4D<float>(expected_array);
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>(expected_array);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -1225,9 +1225,9 @@ TEST_F(HloEvaluatorPreciseReduceTest, AddReductionPrecisionTest) {
   constexpr int kNumElements = 1 << 25;  // float += 1 saturates at 1<<24
   std::vector<float> v(kNumElements, 1.0f);
   HloInstruction* arg_instruction = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>(v)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(v)));
   HloInstruction* init_value = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.f)));
 
   HloComputation::Builder add_computation("add");
   Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
@@ -1262,9 +1262,9 @@ void BM_ReducePrecisely(int num_iters) {
   constexpr int kNumElements = 1 << 25;  // float += 1 saturates at 1<<24
   std::vector<float> v(kNumElements, 1.0f);
   HloInstruction* arg_instruction = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>(v)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(v)));
   auto init_value = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.f)));
 
   HloComputation::Builder add_computation("add");
   Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
@@ -1299,13 +1299,13 @@ TEST_P(HloEvaluatorTest, ReduceAdd) {
   // }
   auto arg_array = MakeUnique<Array2D<float>>(2, 3);
   arg_array->FillUnique(1.0f);
-  auto arg_literal = Literal::CreateR2FromArray2D<float>(*arg_array);
+  auto arg_literal = LiteralUtil::CreateR2FromArray2D<float>(*arg_array);
 
   HloInstruction* arg_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(arg_literal)));
 
   auto init_value = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.f)));
 
   HloComputation::Builder add_computation("add");
   Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
@@ -1326,7 +1326,7 @@ TEST_P(HloEvaluatorTest, ReduceAdd) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR1<float>({6, 18});
+  auto expected = LiteralUtil::CreateR1<float>({6, 18});
 
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
@@ -1341,13 +1341,13 @@ TEST_P(HloEvaluatorTest, ReduceWindowMax) {
   // }
   auto arg_array = MakeUnique<Array2D<float>>(2, 3);
   arg_array->FillUnique(1.0f);
-  auto arg_literal = Literal::CreateR2FromArray2D<float>(*arg_array);
+  auto arg_literal = LiteralUtil::CreateR2FromArray2D<float>(*arg_array);
 
   HloInstruction* arg_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(arg_literal)));
 
   auto init_value = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.f)));
 
   HloComputation::Builder max_computation("max");
   Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
@@ -1378,7 +1378,7 @@ TEST_P(HloEvaluatorTest, ReduceWindowMax) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<float>({{6, 7}});
+  auto expected = LiteralUtil::CreateR2<float>({{6, 7}});
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
 
@@ -1392,13 +1392,13 @@ TEST_P(HloEvaluatorTest, ReduceWindowAdd) {
   // }
   auto arg_array = MakeUnique<Array2D<float>>(2, 3);
   arg_array->FillUnique(1.0f);
-  auto arg_literal = Literal::CreateR2FromArray2D<float>(*arg_array);
+  auto arg_literal = LiteralUtil::CreateR2FromArray2D<float>(*arg_array);
 
   HloInstruction* arg_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(arg_literal)));
 
   auto init_value = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.f)));
 
   HloComputation::Builder add_computation("add");
   Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
@@ -1435,7 +1435,7 @@ TEST_P(HloEvaluatorTest, ReduceWindowAdd) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<float>({{1, 3, 5}, {5, 11, 13}});
+  auto expected = LiteralUtil::CreateR2<float>({{1, 3, 5}, {5, 11, 13}});
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *result));
 }
 
@@ -1445,13 +1445,13 @@ TEST_P(HloEvaluatorTest, ReduceWindowAdd6D) {
   // arg: f32[4,4,4,4,4,4] full of ones. Using small dims to limit run-time.
   std::vector<int64> input_dims(6, 4);
   std::unique_ptr<Literal> arg_literal =
-      Literal::CreateFullWithDescendingLayout<float>(input_dims, 1.0f);
+      LiteralUtil::CreateFullWithDescendingLayout<float>(input_dims, 1.0f);
 
   HloInstruction* arg_instruction =
       b.AddInstruction(HloInstruction::CreateConstant(std::move(arg_literal)));
 
   auto init_value = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.f)));
 
   HloComputation::Builder add_computation("add");
   Shape scalar_shape = ShapeUtil::MakeShape(F32, {});
@@ -1498,7 +1498,7 @@ TEST_P(HloEvaluatorTest, ReduceWindowAdd6D) {
 
   std::vector<int64> output_dims = {4, 3, 3, 3, 4, 4};
   std::unique_ptr<Literal> result_literal =
-      Literal::CreateFullWithDescendingLayout<float>(output_dims, 8.0f);
+      LiteralUtil::CreateFullWithDescendingLayout<float>(output_dims, 8.0f);
   EXPECT_TRUE(LiteralTestUtil::Equal(*result_literal, *result));
 }
 
@@ -1513,7 +1513,8 @@ TEST_P(HloEvaluatorTest, StridedSlice) {
   // }
   auto operand_array = MakeUnique<Array2D<float>>(3, 5);
   operand_array->FillUnique(1.0f);
-  auto operand_literal = Literal::CreateR2FromArray2D<float>(*operand_array);
+  auto operand_literal =
+      LiteralUtil::CreateR2FromArray2D<float>(*operand_array);
 
   HloInstruction* operand = b.AddInstruction(
       HloInstruction::CreateConstant(std::move(operand_literal)));
@@ -1527,7 +1528,7 @@ TEST_P(HloEvaluatorTest, StridedSlice) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<float>({
+  auto expected = LiteralUtil::CreateR2<float>({
       {3},
       {19},
   });
@@ -1545,13 +1546,14 @@ TEST_P(HloEvaluatorTest, DynamicSlice) {
   // }
   auto operand_array = MakeUnique<Array2D<float>>(2, 4);
   operand_array->FillUnique(1.0f);
-  auto operand_literal = Literal::CreateR2FromArray2D<float>(*operand_array);
+  auto operand_literal =
+      LiteralUtil::CreateR2FromArray2D<float>(*operand_array);
 
   HloInstruction* operand = b.AddInstruction(
       HloInstruction::CreateConstant(std::move(operand_literal)));
 
   auto start_indices = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({0, 1})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({0, 1})));
 
   Shape shape = ShapeUtil::MakeShape(F32, {2, 3});
   b.AddInstruction(HloInstruction::CreateDynamicSlice(shape, operand,
@@ -1560,7 +1562,7 @@ TEST_P(HloEvaluatorTest, DynamicSlice) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<float>({
+  auto expected = LiteralUtil::CreateR2<float>({
       {2, 3, 4},
       {6, 7, 8},
   });
@@ -1580,13 +1582,14 @@ TEST_P(HloEvaluatorTest, DynamicSliceModSlice) {
   // }
   auto operand_array = MakeUnique<Array2D<float>>(2, 4);
   operand_array->FillUnique(1.0f);
-  auto operand_literal = Literal::CreateR2FromArray2D<float>(*operand_array);
+  auto operand_literal =
+      LiteralUtil::CreateR2FromArray2D<float>(*operand_array);
 
   HloInstruction* operand = b.AddInstruction(
       HloInstruction::CreateConstant(std::move(operand_literal)));
 
   auto start_indices = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2, 1})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2, 1})));
 
   Shape shape = ShapeUtil::MakeShape(F32, {2, 3});
   b.AddInstruction(HloInstruction::CreateDynamicSlice(shape, operand,
@@ -1595,7 +1598,7 @@ TEST_P(HloEvaluatorTest, DynamicSliceModSlice) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<float>({
+  auto expected = LiteralUtil::CreateR2<float>({
       {2, 3, 4},
       {6, 7, 8},
   });
@@ -1613,16 +1616,17 @@ TEST_P(HloEvaluatorTest, DynamicSliceUpdate) {
   // }
   auto operand_array = MakeUnique<Array2D<double>>(2, 3);
   operand_array->FillUnique(1.0);
-  auto operand_literal = Literal::CreateR2FromArray2D<double>(*operand_array);
+  auto operand_literal =
+      LiteralUtil::CreateR2FromArray2D<double>(*operand_array);
 
   HloInstruction* operand = b.AddInstruction(
       HloInstruction::CreateConstant(std::move(operand_literal)));
 
   auto start_indices = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int64>({0, 1})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int64>({0, 1})));
 
   auto update = b.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<double>({{-2.0, -3.0}, {-6.0, -7.0}})));
+      LiteralUtil::CreateR2<double>({{-2.0, -3.0}, {-6.0, -7.0}})));
 
   Shape shape = ShapeUtil::MakeShape(F64, {2, 3});
   b.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
@@ -1631,7 +1635,7 @@ TEST_P(HloEvaluatorTest, DynamicSliceUpdate) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<double>({
+  auto expected = LiteralUtil::CreateR2<double>({
       {1, -2, -3},
       {5, -6, -7},
   });
@@ -1649,12 +1653,13 @@ TEST_P(HloEvaluatorTest, SetAndGetTuples) {
   // }
   auto operand_array = MakeUnique<Array2D<double>>(2, 3);
   operand_array->FillUnique(1.0);
-  auto operand_literal2 = Literal::CreateR2FromArray2D<double>(*operand_array);
+  auto operand_literal2 =
+      LiteralUtil::CreateR2FromArray2D<double>(*operand_array);
 
   HloInstruction* operand2 = b.AddInstruction(
       HloInstruction::CreateConstant(std::move(operand_literal2)));
   HloInstruction* operand1 = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int64>({0, 1})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int64>({0, 1})));
 
   auto tuple =
       b.AddInstruction(HloInstruction::CreateTuple({operand1, operand2}));
@@ -1666,7 +1671,7 @@ TEST_P(HloEvaluatorTest, SetAndGetTuples) {
 
   std::unique_ptr<Literal> result = Evaluate();
 
-  auto expected = Literal::CreateR2<double>({
+  auto expected = LiteralUtil::CreateR2<double>({
       {1, 2, 3},
       {5, 6, 7},
   });
@@ -1686,9 +1691,9 @@ TEST_P(HloEvaluatorTest, SetAndGetNestedTuples) {
   operand_array->FillUnique(1.0);
 
   HloInstruction* operand2 = b.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2FromArray2D<double>(*operand_array)));
+      LiteralUtil::CreateR2FromArray2D<double>(*operand_array)));
   HloInstruction* operand1 = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int64>({0, 1})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int64>({0, 1})));
 
   auto tuple1 =
       b.AddInstruction(HloInstruction::CreateTuple({operand1, operand2}));
@@ -1706,8 +1711,8 @@ TEST_P(HloEvaluatorTest, SetAndGetNestedTuples) {
   std::unique_ptr<Literal> result = Evaluate();
 
   auto result_inner_literal =
-      Literal::CreateR2FromArray2D<double>(*operand_array);
-  auto expected = Literal::MakeTuple({
+      LiteralUtil::CreateR2FromArray2D<double>(*operand_array);
+  auto expected = LiteralUtil::MakeTuple({
       result_inner_literal.get(),
       result_inner_literal.get(),
   });
@@ -1735,7 +1740,7 @@ TEST_P(HloEvaluatorTest, Reverse) {
      {{23.0f}, {24.0f}}},
   });
   // clang-format on
-  auto operand_literal = Literal::CreateR4FromArray4D<float>(input);
+  auto operand_literal = LiteralUtil::CreateR4FromArray4D<float>(input);
   HloInstruction* operand = b.AddInstruction(
       HloInstruction::CreateConstant(std::move(operand_literal)));
 
@@ -1746,7 +1751,7 @@ TEST_P(HloEvaluatorTest, Reverse) {
   std::unique_ptr<Literal> result = Evaluate();
 
   // clang-format off
-  auto expected = Literal::CreateR4FromArray4D<float>({
+  auto expected = LiteralUtil::CreateR4FromArray4D<float>({
     {{{23.0f}, {24.0f}},
      {{21.0f}, {22.0f}},
      {{19.0f}, {20.0f}}},
@@ -1782,11 +1787,11 @@ TEST_P(HloEvaluatorTest, EvaluateWithSubstitutions) {
   // Evaluate add with param0 = {1, 2, 3, 4}, square = {10, 20, 30, 40}.
   HloEvaluator evaluator;
   auto result = evaluator.EvaluateWithSubstitutions(
-      add, {{param0, Literal::CreateR1<float>({1, 2, 3, 4}).get()},
-            {square, Literal::CreateR1<float>({10, 20, 30, 40}).get()}});
+      add, {{param0, LiteralUtil::CreateR1<float>({1, 2, 3, 4}).get()},
+            {square, LiteralUtil::CreateR1<float>({10, 20, 30, 40}).get()}});
   TF_ASSERT_OK(result.status());
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR1<float>({11, 22, 33, 44}), *result.ValueOrDie()));
+      *LiteralUtil::CreateR1<float>({11, 22, 33, 44}), *result.ValueOrDie()));
 }
 
 // Check that EvaluateWithSubstitutions works if one of the operands to the op
@@ -1799,18 +1804,18 @@ TEST_P(HloEvaluatorTest, EvaluateWithSubstitutionsWithConstantOperand) {
       b.AddInstruction(HloInstruction::CreateParameter(0, shape, "param0"));
   HloInstruction* square = b.AddInstruction(HloInstruction::CreateBinary(
       shape, HloOpcode::kMultiply, param0, param0));
-  HloInstruction* constant = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({1, 2, 3, 4})));
+  HloInstruction* constant = b.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<float>({1, 2, 3, 4})));
   HloInstruction* add = b.AddInstruction(
       HloInstruction::CreateBinary(shape, HloOpcode::kAdd, constant, square));
 
   // Evaluate add with square = {10, 20, 30, 40}.
   HloEvaluator evaluator;
   auto result = evaluator.EvaluateWithSubstitutions(
-      add, {{square, Literal::CreateR1<float>({10, 20, 30, 40}).get()}});
+      add, {{square, LiteralUtil::CreateR1<float>({10, 20, 30, 40}).get()}});
   TF_ASSERT_OK(result.status());
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR1<float>({11, 22, 33, 44}), *result.ValueOrDie()));
+      *LiteralUtil::CreateR1<float>({11, 22, 33, 44}), *result.ValueOrDie()));
 }
 
 TEST_P(HloEvaluatorTest, EvaluateGather_TensorFlowGatherV1) {
@@ -1830,11 +1835,12 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
-  EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{1, 2, 3}, {7, 8, 9}}),
-                             *Evaluate({operand.get(), gather_indices.get()})));
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({0, 2});
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *LiteralUtil::CreateR2<int32>({{1, 2, 3}, {7, 8, 9}}),
+      *Evaluate({operand.get(), gather_indices.get()})));
 }
 
 TEST_P(HloEvaluatorTest, EvaluateGather_TensorFlowGatherV2) {
@@ -1854,10 +1860,11 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({0, 2});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR2<int32>({{1, 3}, {4, 6}, {7, 9}}),
+      *LiteralUtil::CreateR2<int32>({{1, 3}, {4, 6}, {7, 9}}),
       *Evaluate({operand.get(), gather_indices.get()})));
 }
 
@@ -1878,11 +1885,11 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 2}, {2, 1}});
+      LiteralUtil::CreateR2<int32>({{0, 2}, {2, 1}});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR3<int32>(
+      *LiteralUtil::CreateR3<int32>(
           {{{1, 3}, {4, 6}, {7, 9}}, {{3, 2}, {6, 5}, {9, 8}}}),
       *Evaluate({operand.get(), gather_indices.get()})));
 }
@@ -1904,13 +1911,13 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
   std::unique_ptr<Literal> operand =
-      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
-                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
-                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+      LiteralUtil::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                    {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                    {{-7, 7}, {-8, 8}, {-9, 9}}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+      LiteralUtil::CreateR2<int32>({{0, 0}, {1, 0}});
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{-1, 1}, {-4, 4}}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR2<int32>({{-1, 1}, {-4, 4}}),
                              *Evaluate({operand.get(), gather_indices.get()})));
 }
 
@@ -1932,13 +1939,13 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
   std::unique_ptr<Literal> operand =
-      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
-                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
-                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+      LiteralUtil::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                    {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                    {{-7, 7}, {-8, 8}, {-9, 9}}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+      LiteralUtil::CreateR2<int32>({{0, 0}, {1, 0}});
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{-2, 2}, {-1, 1}}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR2<int32>({{-2, 2}, {-1, 1}}),
                              *Evaluate({operand.get(), gather_indices.get()})));
 }
 
@@ -1959,10 +1966,11 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({1, 1});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({1, 1});
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{5}}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR2<int32>({{5}}),
                              *Evaluate({operand.get(), gather_indices.get()})));
 }
 
@@ -1983,11 +1991,11 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{2, 1}, {1, 1}});
+      LiteralUtil::CreateR2<int32>({{2, 1}, {1, 1}});
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR3<int32>({{{8}}, {{5}}}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR3<int32>({{{8}}, {{5}}}),
                              *Evaluate({operand.get(), gather_indices.get()})));
 }
 
@@ -2007,10 +2015,11 @@ ENTRY main {
 }
 )";
   ParseAndVerifyModule(hlo_text);
-  std::unique_ptr<Literal> operand = Literal::CreateR2<int32>({{}, {}, {}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  std::unique_ptr<Literal> operand = LiteralUtil::CreateR2<int32>({{}, {}, {}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({0, 2});
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{}, {}}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR2<int32>({{}, {}}),
                              *Evaluate({operand.get(), gather_indices.get()})));
 }
 
@@ -2031,11 +2040,11 @@ ENTRY main {
 )";
   ParseAndVerifyModule(hlo_text);
 
-  std::unique_ptr<Literal> operand = Literal::CreateR1<int32>({0, 1, 2});
+  std::unique_ptr<Literal> operand = LiteralUtil::CreateR1<int32>({0, 1, 2});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR3<int32>({{{0}, {1}}, {{2}, {1}}});
+      LiteralUtil::CreateR3<int32>({{{0}, {1}}, {{2}, {1}}});
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{0, 1}, {2, 1}}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR2<int32>({{0, 1}, {2, 1}}),
                              *Evaluate({operand.get(), gather_indices.get()})));
 }
 
@@ -2043,14 +2052,14 @@ ENTRY main {
 // element-wise comparison with 2 bfloat16 operands.
 TEST_P(HloEvaluatorTest, DoesCompareBF16) {
   // lhs >= rhs
-  auto lhs = Literal::CreateR2<bfloat16>(
+  auto lhs = LiteralUtil::CreateR2<bfloat16>(
       {{bfloat16(0.25), bfloat16(0.35), bfloat16(0.125)},
        {bfloat16(-0.25), bfloat16(-0.35), bfloat16(-0.125)}});
-  auto rhs = Literal::CreateR2<bfloat16>(
+  auto rhs = LiteralUtil::CreateR2<bfloat16>(
       {{bfloat16(0.5), bfloat16(0.125), bfloat16(0.125)},
        {bfloat16(0.25), bfloat16(-0.375), bfloat16(-0.127)}});
   auto expected =
-      Literal::CreateR2<bool>({{false, true, true}, {false, true, true}});
+      LiteralUtil::CreateR2<bool>({{false, true, true}, {false, true, true}});
   TestBinaryOp(HloOpcode::kGe, std::move(expected), std::move(lhs),
                std::move(rhs));
 }
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h b/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h
index 1136178e90..d1ee4a180b 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h
+++ b/tensorflow/compiler/xla/service/hlo_evaluator_typed_visitor.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_HLO_EVALUATOR_TYPED_VISITOR_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_HLO_EVALUATOR_TYPED_VISITOR_H_
 
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_evaluator.h"
 #include "tensorflow/compiler/xla/service/shape_inference.h"
 #include "tensorflow/core/lib/core/casts.h"
@@ -34,6 +35,37 @@ using is_complex_t = std::is_same<T, complex64>;
 template <typename T>
 using is_complex64_t = std::is_same<T, complex64>;
 
+// It's UB to use std::sort with std::less<float>, because of NaNs. Define
+// "safe" less functions which are actually strict weak orders.
+template <
+    typename NativeT,
+    typename std::enable_if<std::is_integral<NativeT>::value>::type* = nullptr>
+bool SafeLess(const NativeT& a, const NativeT& b) {
+  return a < b;
+}
+
+template <typename NativeT,
+          typename std::enable_if<
+              std::is_floating_point<NativeT>::value ||
+              std::is_same<NativeT, bfloat16>::value>::type* = nullptr>
+bool SafeLess(const NativeT& a, const NativeT& b) {
+  if (std::isnan(b)) {
+    return !std::isnan(a);
+  } else {
+    return a < b;
+  }
+}
+
+template <typename NativeT, typename std::enable_if<std::is_same<
+                                NativeT, Eigen::half>::value>::type* = nullptr>
+bool SafeLess(const NativeT& a, const NativeT& b) {
+  if (Eigen::half_impl::isnan(b)) {
+    return !Eigen::half_impl::isnan(a);
+  } else {
+    return a < b;
+  }
+}
+
 // Templated DfsHloVisitor for use by HloEvaluator.
 //
 // Typically ReturnT here indicates the resulting literal type of each evaluated
@@ -269,6 +301,14 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
     return HandleFloor<ReturnT>(floor);
   }
 
+  Status HandleImag(HloInstruction* imag) override {
+    TF_ASSIGN_OR_RETURN(parent_->evaluated_[imag],
+                        ElementWiseUnaryOp(imag, [](ElementwiseT elem_operand) {
+                          return std::imag(elem_operand);
+                        }));
+    return Status::OK();
+  }
+
   Status HandleLog(HloInstruction* log) override {
     TF_ASSIGN_OR_RETURN(parent_->evaluated_[log],
                         ElementWiseUnaryOp(log, [](ElementwiseT elem_operand) {
@@ -572,6 +612,14 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
     return Status::OK();
   }
 
+  Status HandleReal(HloInstruction* real) override {
+    TF_ASSIGN_OR_RETURN(parent_->evaluated_[real],
+                        ElementWiseUnaryOp(real, [](ElementwiseT elem_operand) {
+                          return std::real(elem_operand);
+                        }));
+    return Status::OK();
+  }
+
   template <
       typename NativeT,
       typename std::enable_if<!is_complex_t<NativeT>::value>::type* = nullptr>
@@ -1285,7 +1333,7 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
                 parent_->GetEvaluatedLiteralFor(operand);
 
             auto curr_val = arg_literal.Get<NativeT>(multi_index);
-            auto curr_val_literal = Literal::CreateR0<NativeT>(curr_val);
+            auto curr_val_literal = LiteralUtil::CreateR0<NativeT>(curr_val);
 
             arg_literals.push_back(std::move(curr_val_literal));
           }
@@ -1367,66 +1415,47 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
                 !std::is_same<NativeT, bool>::value>::type* = nullptr>
   Status HandleSort(HloInstruction* sort) {
     auto keys = sort->operand(0);
-    TF_RET_CHECK(ShapeUtil::Rank(keys->shape()) == 1)
-        << "Sort is only supported for R1 shapes";
+    auto rank = ShapeUtil::Rank(keys->shape());
+    TF_RET_CHECK(rank > 0 && rank <= 2)
+        << "Sort is only supported for R1 and R2 shapes";
+    TF_RET_CHECK(sort->operand_count() == 1)
+        << "Typed visitor does not support key-value sort";
 
     const Literal& keys_literal = parent_->GetEvaluatedLiteralFor(keys);
-    VLOG(3) << "HandleSort keys_literal: " << keys_literal.ToString();
-    const auto& keys_data = keys_literal.data<ReturnT>();
 
-    if (sort->operand_count() == 1) {
+    auto sort_r1 = [this](const Literal& keys_literal) {
+      VLOG(3) << "HandleSort keys_literal: " << keys_literal.ToString();
+      const auto& keys_data = keys_literal.data<ReturnT>();
+
       std::vector<ReturnT> result_data(keys_data.begin(), keys_data.end());
       std::sort(result_data.begin(), result_data.end(),
                 [](const ReturnT& a, const ReturnT& b) {
                   return SafeLess<ReturnT>(a, b);
                 });
-      auto result_literal = MakeUnique<Literal>(sort->shape());
+      auto result_literal = MakeUnique<Literal>(keys_literal.shape());
       result_literal->PopulateR1(
           tensorflow::gtl::ArraySlice<ReturnT>(result_data));
       VLOG(3) << "HandleSort result_literal: " << result_literal->ToString();
-      parent_->evaluated_[sort] = std::move(result_literal);
-    } else {
-      CHECK_EQ(sort->operand_count(), 2);
-      auto values = sort->operand(1);
-      if (values->shape().element_type() !=
-          primitive_util::NativeToPrimitiveType<ReturnT>()) {
-        return InvalidArgument(
-            "Evaluator requires value and key types for Sort to match");
-      }
+      return result_literal;
+    };
 
-      // We need to sort and array of keys and an array of values, where the
-      // sorted order of the values is determined by the keys. The simplest(?)
-      // way to do this is to go to an array-of-pairs representation, sort the
-      // array using the keys, and then go back to pair-of-arrays.
-      const Literal& values_literal = parent_->GetEvaluatedLiteralFor(values);
-      VLOG(3) << "HandleSort values_literal: " << values_literal.ToString();
-      const auto& values_data = values_literal.data<ReturnT>();
-      using kv_pair = std::pair<ReturnT, ReturnT>;
-      std::vector<kv_pair> key_value_vector;
-      CHECK_EQ(keys_data.size(), values_data.size());
-      for (int i = 0; i < keys_data.size(); ++i) {
-        key_value_vector.push_back(
-            std::make_pair(keys_data[i], values_data[i]));
-      }
-      std::sort(key_value_vector.begin(), key_value_vector.end(),
-                [](const kv_pair& a, const kv_pair& b) {
-                  return SafeLess<ReturnT>(a.first, b.first);
-                });
-      std::vector<ReturnT> result_keys, result_values;
-      for (const auto& key_value : key_value_vector) {
-        result_keys.push_back(key_value.first);
-        result_values.push_back(key_value.second);
+    if (rank == 1) {
+      parent_->evaluated_[sort] = std::move(sort_r1(keys_literal));
+    } else {
+      // For R2 sort, the desired semantics are to sort each matrix row
+      // independently.
+      auto result_literal = MakeUnique<Literal>(keys_literal.shape());
+      int64 r1_length = keys->shape().dimensions(1);
+      for (int64 row = 0; row < keys->shape().dimensions(0); ++row) {
+        TF_ASSIGN_OR_RETURN(auto r1_slice,
+                            keys_literal.Slice({row, 0}, {row + 1, r1_length})
+                                ->Reshape({r1_length}));
+        auto r1_result = sort_r1(*r1_slice);
+        TF_ASSIGN_OR_RETURN(r1_result, r1_result->Reshape({1, r1_length}));
+        TF_RETURN_IF_ERROR(result_literal->CopySliceFrom(
+            *r1_result, {0, 0}, {row, 0}, {1, r1_length}));
       }
-      auto result_keys_literal = MakeUnique<Literal>(keys->shape());
-      result_keys_literal->PopulateR1(
-          tensorflow::gtl::ArraySlice<ReturnT>(result_keys));
-      auto result_values_literal = MakeUnique<Literal>(values->shape());
-      result_values_literal->PopulateR1(
-          tensorflow::gtl::ArraySlice<ReturnT>(result_values));
-      auto result_tuple = Literal::MakeTuple(
-          {result_keys_literal.get(), result_values_literal.get()});
-      VLOG(3) << "HandleSort result_tuple: " << result_tuple->ToString();
-      parent_->evaluated_[sort] = std::move(result_tuple);
+      parent_->evaluated_[sort] = std::move(result_literal);
     }
     return Status::OK();
   }
@@ -1452,8 +1481,7 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
                  ShapeUtil::Rank(arg->shape()) - dimensions.size());
     TF_ASSIGN_OR_RETURN(auto inferred_return_shape,
                         ShapeInference::InferReduceShape(
-                            /*arg=*/arg->shape(),
-                            /*init_value=*/init_value->shape(),
+                            {&arg->shape(), &init_value->shape()},
                             /*dimensions_to_reduce=*/dimensions,
                             /*to_apply=*/function->ComputeProgramShape()));
     TF_RET_CHECK(ShapeUtil::Compatible(reduce->shape(), inferred_return_shape))
@@ -1515,8 +1543,9 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
             auto curr_val = arg_literal.Get<ReturnT>(input_index);
 
             // Evaluate computation with specified literal operands.
-            auto curr_val_literal = Literal::CreateR0<ReturnT>(curr_val);
-            auto result_val_literal = Literal::CreateR0<ReturnT>(result_val);
+            auto curr_val_literal = LiteralUtil::CreateR0<ReturnT>(curr_val);
+            auto result_val_literal =
+                LiteralUtil::CreateR0<ReturnT>(result_val);
 
             std::unique_ptr<Literal> computed_result =
                 embedded_evaluator
@@ -1594,10 +1623,10 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
 
     // Used in the dual IterateThroughWindow lambdas below. Hoisted to avoid
     // dynamic memory allocations.
-    auto curr_val_literal = Literal::CreateR0<ReturnT>(ReturnT());
-    auto selected_val_literal = Literal::CreateR0<ReturnT>(ReturnT());
-    auto source_literal_scatter = Literal::CreateR0<ReturnT>(ReturnT());
-    auto scattered_literal = Literal::CreateR0<ReturnT>(ReturnT());
+    auto curr_val_literal = LiteralUtil::CreateR0<ReturnT>(ReturnT());
+    auto selected_val_literal = LiteralUtil::CreateR0<ReturnT>(ReturnT());
+    auto source_literal_scatter = LiteralUtil::CreateR0<ReturnT>(ReturnT());
+    auto scattered_literal = LiteralUtil::CreateR0<ReturnT>(ReturnT());
     do {
       // For each element in `source`, we place a window in `operand`. For each
       // window placement, we iterate inside the window twice:
@@ -1718,9 +1747,9 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
 
                 // Evaluate computation with specified literal operands.
                 const auto curr_val_literal =
-                    Literal::CreateR0<ReturnT>(curr_val);
+                    LiteralUtil::CreateR0<ReturnT>(curr_val);
                 const auto result_val_literal =
-                    Literal::CreateR0<ReturnT>(result_val);
+                    LiteralUtil::CreateR0<ReturnT>(result_val);
                 std::unique_ptr<Literal> computed_result =
                     embedded_evaluator
                         .Evaluate<const Literal*>(
@@ -1765,7 +1794,7 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
       return operand_literal.Get<ReturnT>(operand_index);
     };
 
-    auto result = Literal::CreateFromDimensions(
+    auto result = LiteralUtil::CreateFromDimensions(
         shape.element_type(), AsInt64Slice(shape.dimensions()));
     TF_RETURN_IF_ERROR(result->Populate<ReturnT>(func));
     parent_->evaluated_[slice] = std::move(result);
@@ -1967,6 +1996,30 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
     return HandleReducePrecision<ElementwiseT>(reduce_precision);
   }
 
+  template <typename NativeT,
+            typename std::enable_if<
+                std::is_same<NativeT, float>::value ||
+                std::is_same<NativeT, int32>::value ||
+                std::is_same<NativeT, uint32>::value>::type* = nullptr>
+  Status HandleIota(HloInstruction* iota) {
+    auto result = MakeUnique<Literal>(iota->shape());
+    auto data = result->data<ReturnT>();
+    std::iota(data.begin(), data.end(), 0);
+    parent_->evaluated_[iota] = std::move(result);
+    return Status::OK();
+  }
+  template <typename NativeT,
+            typename std::enable_if<
+                !(std::is_same<NativeT, float>::value ||
+                  std::is_same<NativeT, int32>::value ||
+                  std::is_same<NativeT, uint32>::value)>::type* = nullptr>
+  Status HandleIota(HloInstruction* iota) {
+    return InvalidArgument("Unsupported type for iota");
+  }
+  Status HandleIota(HloInstruction* iota) override {
+    return HandleIota<ReturnT>(iota);
+  }
+
  private:
   // Creates a vector of multipliers which can be used to create a linear index
   // into shape.
@@ -2024,10 +2077,6 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
                              start_indices_typed.end());
 
     // Clamp the start indices so the slice is in-bounds w.r.t the operand.
-
-    // TODO(b/74360564): This is implementation defined behavior, but is
-    // currently respected by all implementations. Change this if we ever decide
-    // to officially document different behavior.
     for (int64 i = 0; i < start.size(); ++i) {
       start[i] = std::min<int64>(
           std::max(int64{0}, start[i]),
@@ -2061,10 +2110,6 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
                              start_indices_typed.end());
     // Clamp the update start indices so the slice is in-bounds w.r.t the
     // operand.
-
-    // TODO(b/74360564): This is implementation defined behavior, but is
-    // currently respected by all implementations. Change this if we ever decide
-    // to oficially document different behavior.
     for (int64 i = 0; i < rank; ++i) {
       start[i] = std::min<int64>(
           std::max<int64>(0, start[i]),
@@ -2183,38 +2228,6 @@ class HloEvaluatorTypedVisitor : public DfsHloVisitorWithDefault {
     return rhs_unsigned >= lhs_size_unsigned;
   }
 
-  // It's UB to use std::sort with std::less<float>, because of NaNs. Define
-  // "safe" less functions which are actually strict weak orders.
-  template <typename NativeT,
-            typename std::enable_if<std::is_integral<NativeT>::value>::type* =
-                nullptr>
-  static bool SafeLess(const NativeT& a, const NativeT& b) {
-    return a < b;
-  }
-
-  template <typename NativeT,
-            typename std::enable_if<
-                std::is_floating_point<NativeT>::value ||
-                std::is_same<NativeT, bfloat16>::value>::type* = nullptr>
-  static bool SafeLess(const NativeT& a, const NativeT& b) {
-    if (std::isnan(b)) {
-      return !std::isnan(a);
-    } else {
-      return a < b;
-    }
-  }
-
-  template <typename NativeT,
-            typename std::enable_if<
-                std::is_same<NativeT, Eigen::half>::value>::type* = nullptr>
-  static bool SafeLess(const NativeT& a, const NativeT& b) {
-    if (Eigen::half_impl::isnan(b)) {
-      return !Eigen::half_impl::isnan(a);
-    } else {
-      return a < b;
-    }
-  }
-
   HloEvaluator* parent_;
 };
 
diff --git a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
index 8856723f67..7e5866a356 100644
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
@@ -27,7 +27,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
@@ -948,6 +948,7 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     case HloOpcode::kGe:
     case HloOpcode::kGt:
     case HloOpcode::kImag:
+    case HloOpcode::kIota:
     case HloOpcode::kIsFinite:
     case HloOpcode::kLe:
     case HloOpcode::kLog:
@@ -966,6 +967,7 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     case HloOpcode::kRemainder:
     case HloOpcode::kRng:
     case HloOpcode::kRoundNearestAfz:
+    case HloOpcode::kSelect:
     case HloOpcode::kShiftLeft:
     case HloOpcode::kShiftRightArithmetic:
     case HloOpcode::kShiftRightLogical:
@@ -1001,7 +1003,7 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     case HloOpcode::kPad:
     case HloOpcode::kReshape:
     case HloOpcode::kReverse:
-    case HloOpcode::kSelect:
+    case HloOpcode::kTupleSelect:
     case HloOpcode::kTranspose:
       // De-emphasize scalar-shaped data movement ops and all data movement ops
       // inside fusion nodes, both of which are essentially free.
@@ -1017,6 +1019,8 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
         return kWhite;
       }
       return kGreen;
+    case HloOpcode::kScatter:
+      // Do not de-emphasize Scatter, since it involves significant work.
     case HloOpcode::kCopy:
       // Emphasize copy nodes, which are either physical transposes (and thus
       // significant), or copies of read-only buffers (and thus dead weight).
diff --git a/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc b/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc
index 68f41a1cbb..1d7a062c55 100644
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper_test.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
 
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
@@ -120,7 +121,7 @@ TEST(HloGraphDumperTest, NestedFusion) {
 TEST(HloGraphDumperTest, Constant) {
   HloComputation::Builder b("b");
   auto instruction = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(-42)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(-42)));
   instruction->SetAndSanitizeName("i_am_a_constant_root_instruction");
   HloModuleConfig config;
   HloModule m(TestName(), config);
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.cc b/tensorflow/compiler/xla/service/hlo_instruction.cc
index 5b416d9654..7591b99204 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/protobuf_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
@@ -115,26 +115,27 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
       TF_RET_CHECK(proto.operand_ids_size() == 2)
           << "Send instruction should have 2 operand but sees "
           << proto.operand_ids_size();
-      instruction = CreateSend(operands(0), operands(1), proto.channel_id());
+      instruction = CreateSend(operands(0), operands(1), proto.channel_id(),
+                               proto.is_host_transfer());
       break;
     case HloOpcode::kSendDone:
       TF_RET_CHECK(proto.operand_ids_size() == 1)
           << "SendDone instruction should have 1 operand but sees "
           << proto.operand_ids_size();
-      instruction = CreateSendDone(operands(0));
+      instruction = CreateSendDone(operands(0), proto.is_host_transfer());
       break;
     case HloOpcode::kRecv:
       TF_RET_CHECK(proto.operand_ids_size() == 1)
           << "Recv instruction should have 1 operand but sees "
           << proto.operand_ids_size();
       instruction = CreateRecv(proto.shape().tuple_shapes(0), operands(0),
-                               proto.channel_id());
+                               proto.channel_id(), proto.is_host_transfer());
       break;
     case HloOpcode::kRecvDone:
       TF_RET_CHECK(proto.operand_ids_size() == 1)
           << "RecvDone instruction should have 1 operand but sees "
           << proto.operand_ids_size();
-      instruction = CreateRecvDone(operands(0));
+      instruction = CreateRecvDone(operands(0), proto.is_host_transfer());
       break;
     case HloOpcode::kReverse:
       TF_RET_CHECK(proto.operand_ids_size() == 1)
@@ -163,6 +164,20 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
                                                     proto.dimensions().end()),
                                  computations(0));
       break;
+    case HloOpcode::kSort: {
+      TF_RET_CHECK(proto.operand_ids_size() == 1 ||
+                   proto.operand_ids_size() == 2)
+          << "Sort instruction should have 1 or 2 operands but has "
+          << proto.operand_ids_size();
+      TF_RET_CHECK(proto.dimensions().size() == 1)
+          << "Sort instruction should have 1 dimension";
+      HloInstruction* keys = operands(0);
+      HloInstruction* values =
+          proto.operand_ids_size() == 2 ? operands(1) : nullptr;
+      instruction =
+          CreateSort(proto.shape(), proto.dimensions(0), keys, values);
+      break;
+    }
     case HloOpcode::kTranspose:
       TF_RET_CHECK(proto.operand_ids_size() == 1)
           << "Transpose instruction should have 1 operand but sees "
@@ -271,7 +286,7 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
         // converted to take tokens.
         instruction = CreateInfeed(data_shape, proto.infeed_config());
       } else {
-        CHECK_EQ(proto.operand_ids_size(), 2);
+        CHECK_EQ(proto.operand_ids_size(), 1);
         instruction =
             CreateInfeed(data_shape, operands(0), proto.infeed_config());
       }
@@ -372,6 +387,39 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
                                        slice_sizes);
       break;
     }
+    case HloOpcode::kGather: {
+      TF_RET_CHECK(proto.operand_ids_size() == 2)
+          << "Gather instruction should have 2 operands but sees "
+          << proto.operand_ids_size();
+      TF_RET_CHECK(proto.has_gather_dimension_numbers())
+          << "Gather instruction should have GatherDimensionNumbers set.";
+      std::unique_ptr<GatherDimensionNumbers> gather_dimension_numbers =
+          MakeUnique<GatherDimensionNumbers>(proto.gather_dimension_numbers());
+      std::vector<int64> gather_window_bounds;
+      for (int64 bound : proto.gather_window_bounds()) {
+        gather_window_bounds.push_back(bound);
+      }
+      instruction =
+          CreateGather(proto.shape(), operands(0), operands(1),
+                       *gather_dimension_numbers, gather_window_bounds);
+      break;
+    }
+    case HloOpcode::kScatter: {
+      TF_RET_CHECK(proto.operand_ids_size() == 3)
+          << "Scatter instruction should have 3 operands but sees "
+          << proto.operand_ids_size();
+      TF_RET_CHECK(proto.has_scatter_dimension_numbers())
+          << "Scatter instruction should have ScatterDimensionNumbers set.";
+      TF_RET_CHECK(proto.called_computation_ids_size() == 1)
+          << "Scatter instruction should have 1 called computation but sees "
+          << proto.called_computation_ids_size();
+      auto scatter_dimension_numbers = MakeUnique<ScatterDimensionNumbers>(
+          proto.scatter_dimension_numbers());
+      instruction =
+          CreateScatter(proto.shape(), operands(0), operands(1), operands(2),
+                        computations(0), *scatter_dimension_numbers);
+      break;
+    }
     default: {
       instruction = WrapUnique(new HloInstruction(opcode, proto.shape()));
       for (const int64 operand_id : proto.operand_ids()) {
@@ -413,13 +461,6 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
     instruction->set_sharding(sharding);
   }
 
-  if (proto.has_gather_dimension_numbers()) {
-    instruction->gather_dimension_numbers_ =
-        MakeUnique<GatherDimensionNumbers>(proto.gather_dimension_numbers());
-  }
-  for (int64 bound : proto.gather_window_bounds()) {
-    instruction->gather_window_bounds_.push_back(bound);
-  }
   return std::move(instruction);
 }
 
@@ -438,6 +479,11 @@ StatusOr<std::unique_ptr<HloInstruction>> HloInstruction::CreateFromProto(
   return MakeUnique<HloConstantInstruction>(std::move(literal));
 }
 
+/* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateIota(
+    const Shape& shape) {
+  return WrapUnique(new HloInstruction(HloOpcode::kIota, shape));
+}
+
 /* static */ std::unique_ptr<HloInstruction>
 HloInstruction::CreateGetTupleElement(const Shape& shape,
                                       HloInstruction* operand, int64 index) {
@@ -541,8 +587,9 @@ HloInstruction::CreateGetTupleElement(const Shape& shape,
   // Only certain opcodes are supported with CreateTernary: opcodes of ternary
   // instructions with no auxiliary fields.
   switch (opcode) {
-    case (HloOpcode::kClamp):
-    case (HloOpcode::kSelect):
+    case HloOpcode::kClamp:
+    case HloOpcode::kSelect:
+    case HloOpcode::kTupleSelect:
       break;
     default:
       LOG(FATAL) << "Invalid ternary instruction opcode "
@@ -650,29 +697,33 @@ HloInstruction::CreateCrossReplicaSum(
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateSend(
-    HloInstruction* operand, HloInstruction* token, int64 channel_id) {
-  return MakeUnique<HloSendInstruction>(operand, token, channel_id);
+    HloInstruction* operand, HloInstruction* token, int64 channel_id,
+    bool is_host_transfer) {
+  return MakeUnique<HloSendInstruction>(operand, token, channel_id,
+                                        is_host_transfer);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateSendDone(
-    HloInstruction* operand) {
+    HloInstruction* operand, bool is_host_transfer) {
   auto send_operand = DynCast<HloSendInstruction>(operand);
   CHECK(send_operand != nullptr)
       << "SendDone must take the context operand from Send";
-  return MakeUnique<HloSendDoneInstruction>(send_operand);
+  return MakeUnique<HloSendDoneInstruction>(send_operand, is_host_transfer);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateRecv(
-    const Shape& shape, HloInstruction* token, int64 channel_id) {
-  return MakeUnique<HloRecvInstruction>(shape, token, channel_id);
+    const Shape& shape, HloInstruction* token, int64 channel_id,
+    bool is_host_transfer) {
+  return MakeUnique<HloRecvInstruction>(shape, token, channel_id,
+                                        is_host_transfer);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateRecvDone(
-    HloInstruction* operand) {
+    HloInstruction* operand, bool is_host_transfer) {
   auto recv_operand = DynCast<HloRecvInstruction>(operand);
   CHECK(recv_operand != nullptr)
       << "RecvDone must take the context operand from Recv";
-  return MakeUnique<HloRecvDoneInstruction>(recv_operand);
+  return MakeUnique<HloRecvDoneInstruction>(recv_operand, is_host_transfer);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateReverse(
@@ -683,6 +734,7 @@ HloInstruction::CreateCrossReplicaSum(
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateAfterAll(
     tensorflow::gtl::ArraySlice<HloInstruction*> operands) {
+  CHECK(!operands.empty());
   auto instruction = WrapUnique(
       new HloInstruction(HloOpcode::kAfterAll, ShapeUtil::MakeTokenShape()));
   for (auto operand : operands) {
@@ -691,6 +743,11 @@ HloInstruction::CreateCrossReplicaSum(
   return instruction;
 }
 
+/* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateToken() {
+  return WrapUnique(
+      new HloInstruction(HloOpcode::kAfterAll, ShapeUtil::MakeTokenShape()));
+}
+
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateWhile(
     const Shape& shape, HloComputation* condition, HloComputation* body,
     HloInstruction* init) {
@@ -771,11 +828,25 @@ HloInstruction::CreateBitcastConvert(const Shape& shape,
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateReduce(
-    const Shape& shape, HloInstruction* arg, HloInstruction* init_value,
+    const Shape& shape, HloInstruction* operand, HloInstruction* init_value,
+    tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
+    HloComputation* reduce_computation) {
+  auto instruction = WrapUnique(new HloReduceInstruction(
+      shape, {operand, init_value}, dimensions_to_reduce, reduce_computation));
+  return std::move(instruction);
+}
+
+/* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateReduce(
+    const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+    tensorflow::gtl::ArraySlice<HloInstruction*> init_values,
     tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
     HloComputation* reduce_computation) {
-  return MakeUnique<HloReduceInstruction>(
-      shape, arg, init_value, dimensions_to_reduce, reduce_computation);
+  std::vector<HloInstruction*> all_args;
+  all_args.reserve(operands.size() * 2);
+  all_args.insert(all_args.end(), operands.begin(), operands.end());
+  all_args.insert(all_args.end(), init_values.begin(), init_values.end());
+  return MakeUnique<HloReduceInstruction>(shape, all_args, dimensions_to_reduce,
+                                          reduce_computation);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateReduceWindow(
@@ -908,13 +979,9 @@ HloInstruction::CreateBroadcastSequence(
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateSort(
-    const Shape& shape, HloInstruction* keys, HloInstruction* values) {
-  auto instruction = WrapUnique(new HloInstruction(HloOpcode::kSort, shape));
-  instruction->AppendOperand(keys);
-  if (values) {
-    instruction->AppendOperand(values);
-  }
-  return instruction;
+    const Shape& shape, int64 dimension, HloInstruction* keys,
+    HloInstruction* values) {
+  return MakeUnique<HloSortInstruction>(shape, dimension, keys, values);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateFusion(
@@ -961,6 +1028,8 @@ bool HloInstruction::HasSideEffectNoRecurse() const {
     case HloOpcode::kTrace:
     case HloOpcode::kHostCompute:
       return true;
+    case HloOpcode::kCrossReplicaSum:
+      return all_reduce_id().has_value();
     default:
       return false;
   }
@@ -1019,34 +1088,18 @@ bool HloInstruction::HasSideEffect() const {
     const Shape& shape, HloInstruction* operand, HloInstruction* gather_indices,
     const GatherDimensionNumbers& gather_dim_numbers,
     tensorflow::gtl::ArraySlice<int64> window_bounds) {
-  std::unique_ptr<HloInstruction> instruction =
-      WrapUnique(new HloInstruction(HloOpcode::kGather, shape));
-  instruction->AppendOperand(operand);
-  instruction->AppendOperand(gather_indices);
-  instruction->gather_dimension_numbers_ =
-      MakeUnique<GatherDimensionNumbers>(gather_dim_numbers);
-  c_copy(window_bounds, std::back_inserter(instruction->gather_window_bounds_));
-  return instruction;
+  return MakeUnique<HloGatherInstruction>(shape, operand, gather_indices,
+                                          gather_dim_numbers, window_bounds);
 }
 
-/* static */ GatherDimensionNumbers HloInstruction::MakeGatherDimNumbers(
-    tensorflow::gtl::ArraySlice<int64> output_window_dims,
-    tensorflow::gtl::ArraySlice<int64> elided_window_dims,
-    tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
-    int64 index_vector_dim) {
-  GatherDimensionNumbers gather_dim_numbers;
-  for (int64 output_window_dim : output_window_dims) {
-    gather_dim_numbers.add_output_window_dims(output_window_dim);
-  }
-  for (int64 elided_window_dim : elided_window_dims) {
-    gather_dim_numbers.add_elided_window_dims(elided_window_dim);
-  }
-  for (int64 gather_dim_to_input_dim : gather_dims_to_operand_dims) {
-    gather_dim_numbers.add_gather_dims_to_operand_dims(gather_dim_to_input_dim);
-  }
-
-  gather_dim_numbers.set_index_vector_dim(index_vector_dim);
-  return gather_dim_numbers;
+/* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateScatter(
+    const Shape& shape, HloInstruction* operand,
+    HloInstruction* scatter_indices, HloInstruction* updates,
+    HloComputation* update_computation,
+    const ScatterDimensionNumbers& scatter_dim_numbers) {
+  return MakeUnique<HloScatterInstruction>(shape, operand, scatter_indices,
+                                           updates, update_computation,
+                                           scatter_dim_numbers);
 }
 
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateDomain(
@@ -1109,6 +1162,10 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
     case HloOpcode::kHostCompute:
     case HloOpcode::kPad:
     case HloOpcode::kDynamicSlice:
+    case HloOpcode::kSort:
+    case HloOpcode::kGather:
+    case HloOpcode::kScatter:
+    case HloOpcode::kIota:
       clone = CloneWithNewOperandsImpl(shape, new_operands, context);
       break;
     // Unary ops.
@@ -1164,6 +1221,7 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
     // Ternary ops.
     case HloOpcode::kClamp:
     case HloOpcode::kSelect:
+    case HloOpcode::kTupleSelect:
       CHECK_EQ(new_operands.size(), 3);
       clone = CreateTernary(shape, opcode_, new_operands[0], new_operands[1],
                             new_operands[2]);
@@ -1209,11 +1267,6 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
                                 true_computation(), new_operands[2],
                                 false_computation());
       break;
-    case HloOpcode::kGather:
-      CHECK_EQ(new_operands.size(), 2);
-      clone = CreateGather(shape, new_operands[0], new_operands[1],
-                           *gather_dimension_numbers_, gather_window_bounds_);
-      break;
     case HloOpcode::kDomain:
       CHECK_EQ(new_operands.size(), 1);
       clone =
@@ -1221,15 +1274,11 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
                        user_side_metadata_->Clone());
       break;
     case HloOpcode::kAfterAll:
-      clone = CreateAfterAll(new_operands);
-      break;
-    case HloOpcode::kSort:
-      CHECK(new_operands.size() == 1 || new_operands.size() == 2)
-          << "Too many operands for sort: " << new_operands.size();
-      HloInstruction* keys = new_operands[0];
-      HloInstruction* values =
-          new_operands.size() == 2 ? new_operands[1] : nullptr;
-      clone = CreateSort(shape, keys, values);
+      if (new_operands.empty()) {
+        clone = CreateToken();
+      } else {
+        clone = CreateAfterAll(new_operands);
+      }
       break;
   }
   SetupDerivedInstruction(clone.get());
@@ -1507,16 +1556,14 @@ bool HloInstruction::IdenticalSlowPath(
     case HloOpcode::kShiftRightArithmetic:
     case HloOpcode::kShiftRightLogical:
     case HloOpcode::kSign:
-    case HloOpcode::kSort:
     case HloOpcode::kSin:
     case HloOpcode::kSubtract:
     case HloOpcode::kTanh:
     case HloOpcode::kTuple:
+    case HloOpcode::kTupleSelect:
       return true;
 
-    // These opcodes have complex or special behavior so just return false.
-    case HloOpcode::kDomain:
-    case HloOpcode::kWhile:
+    // This opcode has complex or special behavior so just return false.
     case HloOpcode::kAfterAll:
       return false;
 
@@ -1525,11 +1572,6 @@ bool HloInstruction::IdenticalSlowPath(
       return protobuf_util::ProtobufEquals(dot_dimension_numbers(),
                                            other.dot_dimension_numbers());
 
-    case HloOpcode::kGather:
-      return protobuf_util::ProtobufEquals(gather_dimension_numbers(),
-                                           other.gather_dimension_numbers()) &&
-             gather_window_bounds() == other.gather_window_bounds();
-
     // Remaining instructions with special values.
     case HloOpcode::kCall:
       return eq_computations(to_apply(), other.to_apply());
@@ -1537,6 +1579,18 @@ bool HloInstruction::IdenticalSlowPath(
       return eq_computations(true_computation(), other.true_computation()) &&
              eq_computations(false_computation(), other.false_computation());
 
+    case HloOpcode::kWhile: {
+      if (eq_computations(while_body(), other.while_body()) &&
+          eq_computations(while_condition(), other.while_condition())) {
+        return true;
+      }
+      return false;
+    }
+
+    case HloOpcode::kDomain:
+      return operand_side_metadata().Matches(other.operand_side_metadata()) &&
+             user_side_metadata().Matches(other.user_side_metadata());
+
     // Ops migrated to subclasses should never come to this line.
     // TODO(b/80131774): Remove this switch when migration is complete.
     case HloOpcode::kBatchNormTraining:
@@ -1550,11 +1604,13 @@ bool HloInstruction::IdenticalSlowPath(
     case HloOpcode::kReverse:
     case HloOpcode::kConcatenate:
     case HloOpcode::kReduce:
+    case HloOpcode::kSort:
     case HloOpcode::kTranspose:
     case HloOpcode::kBroadcast:
     case HloOpcode::kMap:
     case HloOpcode::kSlice:
     case HloOpcode::kConstant:
+    case HloOpcode::kIota:
     case HloOpcode::kTrace:
     case HloOpcode::kFusion:
     case HloOpcode::kRng:
@@ -1571,9 +1627,12 @@ bool HloInstruction::IdenticalSlowPath(
     case HloOpcode::kHostCompute:
     case HloOpcode::kPad:
     case HloOpcode::kDynamicSlice:
+    case HloOpcode::kGather:
+    case HloOpcode::kScatter:
       LOG(FATAL) << "Base class impl called for opcode with subclass: "
                  << opcode();
   }
+  return false;
 }
 
 void HloInstruction::RemoveUser(HloInstruction* user) {
@@ -1623,8 +1682,8 @@ Status HloInstruction::ReplaceOperandWith(int64 operand_num,
 
   TF_RET_CHECK(ShapeUtil::CompatibleIgnoringFpPrecision(old_operand->shape(),
                                                         new_operand->shape()))
-      << old_operand->shape().ShortDebugString() << " is not compatible with "
-      << new_operand->shape().ShortDebugString();
+      << old_operand->shape() << " is not compatible with "
+      << new_operand->shape();
   operands_[operand_num] = new_operand;
 
   VLOG(3) << "Replacing operand " << operand_num << " of " << name() << " with "
@@ -1676,6 +1735,7 @@ HloComputation* HloInstruction::to_apply() const {
     case HloOpcode::kReduceWindow:
     case HloOpcode::kReduce:
     case HloOpcode::kCrossReplicaSum:
+    case HloOpcode::kScatter:
       CHECK_EQ(called_computations_.size(), 1);
       return called_computations_[0];
     default:
@@ -1694,6 +1754,7 @@ void HloInstruction::set_to_apply(HloComputation* computation) {
     case HloOpcode::kReduceWindow:
     case HloOpcode::kReduce:
     case HloOpcode::kCrossReplicaSum:
+    case HloOpcode::kScatter:
       CHECK_EQ(called_computations_.size(), 1);
       called_computations_[0] = computation;
       break;
@@ -1833,7 +1894,6 @@ bool HloInstruction::IsElementwiseImpl(
 
     // Ternary elementwise operations.
     case HloOpcode::kSelect:
-      return !ShapeUtil::IsTuple(shape_);
     case HloOpcode::kClamp:
       return true;
 
@@ -1845,6 +1905,10 @@ bool HloInstruction::IsElementwiseImpl(
   }
 }
 
+bool HloInstruction::IsCrossModuleAllReduce() const {
+  return opcode() == HloOpcode::kCrossReplicaSum && all_reduce_id();
+}
+
 string HloInstruction::ToStringWithCanonicalNameMap(
     const HloPrintOptions& options,
     CanonicalNameMap* canonical_name_map) const {
@@ -1937,11 +2001,6 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
   if (dot_dimension_numbers_ != nullptr) {
     extra.push_back(DotDimensionNumbersToString());
   }
-  if (gather_dimension_numbers_ != nullptr) {
-    extra.push_back(GatherDimensionNumbersToString());
-    extra.push_back(
-        StrCat("window_bounds={", Join(gather_window_bounds(), ","), "}"));
-  }
 
   if (options.print_subcomputation_mode() ==
       HloPrintOptions::PrintSubcomputationMode::kNameOnly) {
@@ -1962,7 +2021,8 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
     } else if (opcode() == HloOpcode::kCall || opcode() == HloOpcode::kMap ||
                opcode() == HloOpcode::kReduceWindow ||
                opcode() == HloOpcode::kReduce ||
-               opcode() == HloOpcode::kCrossReplicaSum) {
+               opcode() == HloOpcode::kCrossReplicaSum ||
+               opcode() == HloOpcode::kScatter) {
       extra.push_back(
           StrCat("to_apply=", PrintName(to_apply()->name(), options)));
     } else if (!called_computations().empty()) {
@@ -1998,6 +2058,7 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
       case HloOpcode::kReduceWindow:
       case HloOpcode::kReduce:
       case HloOpcode::kCrossReplicaSum:
+      case HloOpcode::kScatter:
         extra.push_back(
             StrCat("to_apply=\n", to_apply()->ToString(new_options)));
         break;
@@ -2028,8 +2089,8 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
   }
   if (operand_side_metadata_ != nullptr && user_side_metadata_ != nullptr) {
     extra.push_back(StrCat("domain={kind=\"", operand_side_metadata_->Kind(),
-                           "\", entry=", operand_side_metadata_->ToString(),
-                           ", exit=", user_side_metadata_->ToString(), "}"));
+                           "\", entry=", user_side_metadata_->ToString(),
+                           ", exit=", operand_side_metadata_->ToString(), "}"));
   }
 
   return extra;
@@ -2071,14 +2132,6 @@ HloInstructionProto HloInstruction::ToProto() const {
   if (dot_dimension_numbers_ != nullptr) {
     *proto.mutable_dot_dimension_numbers() = *dot_dimension_numbers_;
   }
-  if (gather_dimension_numbers_ != nullptr) {
-    *proto.mutable_gather_dimension_numbers() = *gather_dimension_numbers_;
-  }
-  if (opcode() == HloOpcode::kGather) {
-    for (int64 bound : gather_window_bounds()) {
-      proto.add_gather_window_bounds(bound);
-    }
-  }
 
   if (has_sharding()) {
     *proto.mutable_sharding() = sharding().ToProto();
@@ -2204,6 +2257,8 @@ Status HloInstruction::Visit(DfsHloVisitorBase<HloInstructionPtr>* visitor) {
       return visitor->HandleRemainder(this);
     case HloOpcode::kSelect:
       return visitor->HandleSelect(this);
+    case HloOpcode::kTupleSelect:
+      return visitor->HandleTupleSelect(this);
     case HloOpcode::kConvolution:
       return visitor->HandleConvolution(this);
     case HloOpcode::kFft:
@@ -2302,10 +2357,14 @@ Status HloInstruction::Visit(DfsHloVisitorBase<HloInstructionPtr>* visitor) {
       return visitor->HandleSendDone(this);
     case HloOpcode::kGather:
       return visitor->HandleGather(this);
+    case HloOpcode::kScatter:
+      return visitor->HandleScatter(this);
     case HloOpcode::kDomain:
       return visitor->HandleDomain(this);
     case HloOpcode::kAfterAll:
       return visitor->HandleAfterAll(this);
+    case HloOpcode::kIota:
+      return visitor->HandleIota(this);
 
     // These opcodes are not handled here.
     case HloOpcode::kTrace:
@@ -2837,26 +2896,6 @@ std::ostream& operator<<(std::ostream& os, HloInstruction::FusionKind kind) {
   return os << ToString(kind);
 }
 
-string HloInstruction::GatherDimensionNumbersToString() const {
-  CHECK_NE(gather_dimension_numbers_.get(), nullptr);
-  string output_window_dims =
-      StrCat("output_window_dims={",
-             Join(gather_dimension_numbers_->output_window_dims(), ","), "}");
-  string elided_window_dims =
-      StrCat("elided_window_dims={",
-             Join(gather_dimension_numbers_->elided_window_dims(), ","), "}");
-  string gather_dims_to_operand_dims = StrCat(
-      "gather_dims_to_operand_dims={",
-      Join(gather_dimension_numbers_->gather_dims_to_operand_dims(), ","), "}");
-  string index_vector_dim = StrCat(
-      "index_vector_dim=", gather_dimension_numbers_->index_vector_dim());
-
-  return Join<std::initializer_list<string>>(
-      {output_window_dims, elided_window_dims, gather_dims_to_operand_dims,
-       index_vector_dim},
-      ", ");
-}
-
 bool HloInstruction::CouldBeBitcast() const {
   switch (opcode_) {
     case HloOpcode::kTranspose:
@@ -3170,4 +3209,19 @@ int64 HloInstruction::slice_sizes(int64 dimension) const {
 const std::vector<int64>& HloInstruction::dynamic_slice_sizes() const {
   return Cast<HloDynamicSliceInstruction>(this)->dynamic_slice_sizes();
 }
+
+const GatherDimensionNumbers& HloInstruction::gather_dimension_numbers() const {
+  return Cast<HloGatherInstruction>(this)->gather_dimension_numbers();
+}
+
+tensorflow::gtl::ArraySlice<int64> HloInstruction::gather_window_bounds()
+    const {
+  return Cast<HloGatherInstruction>(this)->gather_window_bounds();
+}
+
+const ScatterDimensionNumbers& HloInstruction::scatter_dimension_numbers()
+    const {
+  return Cast<HloScatterInstruction>(this)->scatter_dimension_numbers();
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.h b/tensorflow/compiler/xla/service/hlo_instruction.h
index 34e7dcb43d..e722086732 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.h
+++ b/tensorflow/compiler/xla/service/hlo_instruction.h
@@ -33,7 +33,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/iterator_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
@@ -346,6 +346,9 @@ class HloInstruction {
   static std::unique_ptr<HloInstruction> CreateConstant(
       std::unique_ptr<Literal> literal);
 
+  // Creates an Iota instruction.
+  static std::unique_ptr<HloInstruction> CreateIota(const Shape& shape);
+
   // Creates a get tuple element instruction.
   static std::unique_ptr<HloInstruction> CreateGetTupleElement(
       const Shape& shape, HloInstruction* operand, int64 index);
@@ -444,8 +447,7 @@ class HloInstruction {
       HloComputation* reduce_computation,
       tensorflow::gtl::ArraySlice<int64> replica_group_ids,
       tensorflow::StringPiece barrier,
-      const tensorflow::gtl::optional<int64>& all_reduce_id =
-          tensorflow::gtl::nullopt);
+      const tensorflow::gtl::optional<int64>& all_reduce_id);
 
   // Creates a conversion instruction, where operand is the data to convert and
   // shape is the target shape for the conversion.
@@ -485,27 +487,30 @@ class HloInstruction {
 
   // Creates an asynchronous send instruction with the given channel id, which
   // initiates sending the operand data to a unique receive instruction in
-  // another computation that has the same channel id.
-  static std::unique_ptr<HloInstruction> CreateSend(HloInstruction* operand,
-                                                    HloInstruction* token,
-                                                    int64 channel_id);
+  // another computation that has the same channel id. If is_host_transfer is
+  // true, then this Send operation transfers data to the host.
+  static std::unique_ptr<HloInstruction> CreateSend(
+      HloInstruction* operand, HloInstruction* token, int64 channel_id,
+      bool is_host_transfer = false);
 
   // Blocks until data transfer for the Send instruction (operand) is complete.
   // The operand must be kSend.
   static std::unique_ptr<HloInstruction> CreateSendDone(
-      HloInstruction* operand);
+      HloInstruction* operand, bool is_host_transfer = false);
 
   // Creates an asynchronous receive instruction with the given channel id,
   // which allocates resources to receive data of the given shape from a unique
-  // send instruction in another computation that has the same channel id.
-  static std::unique_ptr<HloInstruction> CreateRecv(const Shape& shape,
-                                                    HloInstruction* token,
-                                                    int64 channel_id);
+  // send instruction in another computation that has the same channel id.  If
+  // is_host_transfer is true, then this Send operation transfers data from the
+  // host.
+  static std::unique_ptr<HloInstruction> CreateRecv(
+      const Shape& shape, HloInstruction* token, int64 channel_id,
+      bool is_host_transfer = false);
 
   // Blocks until data transfer for the Recv instruction (operand) is complete
   // and returns the receive buffer. The operand must be kRecv.
   static std::unique_ptr<HloInstruction> CreateRecvDone(
-      HloInstruction* operand);
+      HloInstruction* operand, bool is_host_transfer = false);
 
   // Creates a slice instruction, where the operand is sliced by the given
   // start/limit indices.
@@ -536,17 +541,34 @@ class HloInstruction {
       int64 dimension);
 
   // Creates a reduce instruction, where the computation (given by the handle)
-  // is applied successively to every element in operand. That is, if f is the
-  // function to apply (which either takes 2 [accumulator, value] or 3
-  // [accumulator, index, value] arguments) and init is a reduction operator
-  // specified initial value (for example, 0 for addition), then this operation
-  // will compute:
-  //   f(f(init, [index0], value0), [index1], value1), ...)
+  // is applied successively to every element in operand. For example, let f be
+  // the function to apply, which takes 2 arguments, an accumulator and the
+  // current value. Let init be an initial value (which is normally chosen to be
+  // the identity element for f, e.g. 0 if f is addition).
+  // Then the reduce HLO will compute:
+  // f(f(init, value0), value1), ...)
   static std::unique_ptr<HloInstruction> CreateReduce(
       const Shape& shape, HloInstruction* operand, HloInstruction* init_value,
       tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
       HloComputation* reduce_computation);
 
+  // A more general, multiple-argument version of the above.
+  // The function to apply, f, now takes N arguments:
+  // [accumulator0, accumulator1, ..., accumulatorN, value0, value1, ...,
+  // init_valueN], and returns an N-tuple. The performed computation is (for
+  // commutative and associative f operators) equivalent to:
+  //
+  // f_1 = f(init0, ...  initN, input0.value0, ..., inputN.value0)
+  // f_2 = f(f_1.tuple_element(0), ..., f_1.tuple_element(N), input0.value1,
+  // ..., inputN.value1)
+  // ...
+  // TODO(b/112040122): Add support to this in HLO passes and in backends.
+  static std::unique_ptr<HloInstruction> CreateReduce(
+      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+      tensorflow::gtl::ArraySlice<HloInstruction*> init_values,
+      tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
+      HloComputation* reduce_computation);
+
   // Creates a reduce-window instruction, where the computation (given
   // by the handle) is applied window-wise at each valid window
   // position in the operand.
@@ -615,7 +637,7 @@ class HloInstruction {
 
   // Creates a sort op, with a keys operand, and an optional values operand.
   static std::unique_ptr<HloInstruction> CreateSort(
-      const Shape& shape, HloInstruction* keys,
+      const Shape& shape, int64 dimension, HloInstruction* keys,
       HloInstruction* values = nullptr);
 
   // Creates a while instruction, given a condition computation, a body
@@ -639,6 +661,12 @@ class HloInstruction {
       const GatherDimensionNumbers& gather_dim_numbers,
       tensorflow::gtl::ArraySlice<int64> window_bounds);
 
+  static std::unique_ptr<HloInstruction> CreateScatter(
+      const Shape& shape, HloInstruction* operand,
+      HloInstruction* scatter_indices, HloInstruction* updates,
+      HloComputation* update_computation,
+      const ScatterDimensionNumbers& scatter_dim_numbers);
+
   // Creates a kDomain instruction which delimits an HLO domain which have
   // the provided user and operand side metadata.
   static std::unique_ptr<HloInstruction> CreateDomain(
@@ -687,17 +715,18 @@ class HloInstruction {
       const Shape& shape, HloInstruction* operand,
       tensorflow::gtl::ArraySlice<int64> dimensions);
 
-  // Creates a token instruction used for joining or creating new values of
-  // token type which thread through side-effecting operations.
+  // Creates a Afterall instruction used for joining or creating new values of
+  // token type which thread through side-effecting operations. Operands must
+  // all be tokens, and there must be at least one operand.
   static std::unique_ptr<HloInstruction> CreateAfterAll(
       tensorflow::gtl::ArraySlice<HloInstruction*> operands);
 
-  // Creates an instance of GatherDimensionNumbers.
-  static GatherDimensionNumbers MakeGatherDimNumbers(
-      tensorflow::gtl::ArraySlice<int64> output_window_dims,
-      tensorflow::gtl::ArraySlice<int64> elided_window_dims,
-      tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
-      int64 index_vector_dim);
+  // Creates an AfterAll instruction which creates a token type out of thin air
+  // (no operands). This is a separate method from CreateAfterAll to facility
+  // the removal of operand-less AfterAll instructions.
+  // TODO(b/110532604): Remove this capability of creating a token from nothing
+  // when we plumb a primordial token from the entry computation.
+  static std::unique_ptr<HloInstruction> CreateToken();
 
   // Returns the opcode for this instruction.
   HloOpcode opcode() const { return opcode_; }
@@ -1008,9 +1037,7 @@ class HloInstruction {
     if (sharding_ == nullptr) {
       return tensorflow::gtl::optional<int64>();
     }
-    auto device = sharding_->UniqueDevice();
-    return device.ok() ? device.ValueOrDie()
-                       : tensorflow::gtl::optional<int64>();
+    return sharding_->UniqueDevice();
   }
   // Sets the sharding of this operator. Should only be called by HloModule or
   // HloComputation methods.
@@ -1073,19 +1100,6 @@ class HloInstruction {
   // Returns the dump string of the dot dimension numbers.
   string DotDimensionNumbersToString() const;
 
-  const GatherDimensionNumbers& gather_dimension_numbers() const {
-    CHECK(gather_dimension_numbers_ != nullptr);
-    return *gather_dimension_numbers_;
-  }
-
-  tensorflow::gtl::ArraySlice<int64> gather_window_bounds() const {
-    CHECK_EQ(opcode(), HloOpcode::kGather);
-    return gather_window_bounds_;
-  }
-
-  // Returns the dump string of the gather dimension numbers.
-  string GatherDimensionNumbersToString() const;
-
   // Clones the HLO instruction. The clone will have the same opcode, shape, and
   // operands. After creation the clone has no uses. "this" (the instruction
   // cloned from) is not changed. Suffix is the string to append to the name of
@@ -1140,6 +1154,9 @@ class HloInstruction {
   // Returns true if this instruction is elementwise on all its operands.
   bool IsElementwise() const;
 
+  // Returns true if this is an cross module all-reduce instrucion.
+  bool IsCrossModuleAllReduce() const;
+
   // Returns true if this elementwise instruction implicitly broadcasts operand
   // `operand_idx`.
   //
@@ -1452,6 +1469,15 @@ class HloInstruction {
 
   // Delegates to HloDynamicSliceInstruction::dynamic_slice_sizes.
   const std::vector<int64>& dynamic_slice_sizes() const;
+
+  // Delegates to HloGatherInstruction::gather_dimension_numbers.
+  const GatherDimensionNumbers& gather_dimension_numbers() const;
+  // Delegates to HloGatherInstruction::gather_window_bounds.
+  tensorflow::gtl::ArraySlice<int64> gather_window_bounds() const;
+
+  // Delegates to HloScatterInstruction::scatter_dimension_numbers().
+  const ScatterDimensionNumbers& scatter_dimension_numbers() const;
+
   // Old methods kept for smooth subclassing transition END.
 
  protected:
@@ -1595,9 +1621,6 @@ class HloInstruction {
   // Describes the dimension numbers used for a dot.
   std::unique_ptr<DotDimensionNumbers> dot_dimension_numbers_;
 
-  std::unique_ptr<GatherDimensionNumbers> gather_dimension_numbers_;
-  std::vector<int64> gather_window_bounds_;
-
   // Used to tag kCopy instructions that are eligible for copy elision.
   bool copy_elision_allowed_ = true;
 
diff --git a/tensorflow/compiler/xla/service/hlo_instruction_test.cc b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
index d8ca99dfd1..8a694dde80 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
@@ -20,10 +20,11 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/protobuf_util.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -249,7 +250,7 @@ TEST_F(HloInstructionTest, MultipleUsersAndOperands) {
   auto param1 = builder.AddInstruction(
       HloInstruction::CreateParameter(1, r0f32_, "param1"));
   auto c0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto addleft = builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32_, HloOpcode::kAdd, param0, c0));
   auto addright = builder.AddInstruction(
@@ -294,7 +295,7 @@ TEST_F(HloInstructionTest, MultipleUsersAndOperandsWithUnaryOps) {
   auto param1 = builder.AddInstruction(
       HloInstruction::CreateParameter(1, r0f32_, "param1"));
   auto c0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto neg1 = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kNegate, c0));
   auto addleft = builder.AddInstruction(
@@ -334,7 +335,7 @@ TEST_F(HloInstructionTest, TrivialMap) {
   auto param = embedded_builder.AddInstruction(
       HloInstruction::CreateParameter(0, r0f32, "x"));
   auto value = embedded_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   embedded_builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32, HloOpcode::kAdd, param, value));
   auto add_f32 = module->AddEmbeddedComputation(embedded_builder.Build());
@@ -383,9 +384,9 @@ TEST_F(HloInstructionTest, TrivialReduce) {
   auto param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, f32a100x10, "p"));
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto reduce = builder.AddInstruction(
       HloInstruction::CreateReduce(f32v100, param0, const0,
                                    /*dimensions_to_reduce=*/{1}, add_f32));
@@ -626,7 +627,7 @@ TEST_F(HloInstructionTest, SingletonFusionOp) {
   HloComputation::Builder builder(TestName());
   // Create a fusion instruction containing a single unary operation.
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto exp = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kExp, constant));
   auto module = CreateNewModule();
@@ -642,9 +643,9 @@ TEST_F(HloInstructionTest, BinaryFusionOp) {
   HloComputation::Builder builder(TestName());
   // Create a fusion instruction containing a single binary operation.
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.1f)));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
       r0f32_, HloOpcode::kAdd, constant1, constant2));
   auto module = CreateNewModule();
@@ -661,7 +662,7 @@ TEST_F(HloInstructionTest, ChainFusionOp) {
   HloComputation::Builder builder(TestName());
   // Create a chain of fused unary ops.
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto exp1 = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kExp, constant));
   auto exp2 = builder.AddInstruction(
@@ -682,7 +683,7 @@ TEST_F(HloInstructionTest, PreserveMetadataInFusionAndClone) {
   HloComputation::Builder builder(TestName());
   // Create a chain of fused unary ops.
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto exp1 = builder.AddInstruction(
       HloInstruction::CreateUnary(r0f32_, HloOpcode::kExp, constant));
   auto exp2 = builder.AddInstruction(
@@ -710,13 +711,13 @@ TEST_F(HloInstructionTest, PreserveMetadataInFusionAndClone) {
 TEST_F(HloInstructionTest, PreserveOutfeedShapeThroughClone) {
   HloComputation::Builder builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2<float>({
           {1, 2},
           {3, 4},
       })));
   auto shape10 = ShapeUtil::MakeShapeWithLayout(F32, {2, 2}, {1, 0});
   auto shape01 = ShapeUtil::MakeShapeWithLayout(F32, {2, 2}, {0, 1});
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto outfeed10 = builder.AddInstruction(
       HloInstruction::CreateOutfeed(shape10, constant, token, ""));
   auto outfeed01 = builder.AddInstruction(
@@ -732,7 +733,7 @@ TEST_F(HloInstructionTest, PreserveOutfeedShapeThroughClone) {
 TEST_F(HloInstructionTest, PreserveTupleShapeThroughClone) {
   HloComputation::Builder builder(TestName());
   auto* constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2<float>({
           {1, 2},
           {3, 4},
       })));
@@ -763,7 +764,7 @@ TEST_F(HloInstructionTest, FusionOpWithCalledComputations) {
 
   HloComputation::Builder builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto map_1_x = builder.AddInstruction(
       HloInstruction::CreateMap(scalar_shape, {constant}, computation_x));
   auto map_2_x = builder.AddInstruction(
@@ -798,11 +799,11 @@ TEST_F(HloInstructionTest, ComplexFusionOp) {
   // Notable complexities are repeated operands in the same instruction,
   // different shapes, use of value in different expressions.
   auto c1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.1f)));
   auto c2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.1f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.1f)));
   auto c3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(9.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(9.0f)));
 
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(r0f32_, HloOpcode::kAdd, c1, c2));
@@ -873,11 +874,11 @@ TEST_F(HloInstructionTest, IdenticalInstructions) {
   // Create a set of random constant operands to use below. Make them matrices
   // so dimensions are interesting.
   auto operand1 = HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}));
   auto operand2 = HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{10.0, 20.0}, {30.0, 40.0}}));
-  auto vector_operand =
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({42.0, 123.0}));
+      LiteralUtil::CreateR2<float>({{10.0, 20.0}, {30.0, 40.0}}));
+  auto vector_operand = HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<float>({42.0, 123.0}));
   Shape shape = operand1->shape();
 
   // Convenient short names for the operands.
@@ -1234,9 +1235,9 @@ TEST_F(HloInstructionTest, NestedFusionEquality) {
   // Build a nested fusion computation.
   Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
   auto a = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 0.0}, {0.0, 1.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 0.0}, {0.0, 1.0}})));
   auto b = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
+      LiteralUtil::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
   auto b_t = builder.AddInstruction(
       HloInstruction::CreateTranspose(data_shape, b, {1, 0}));
   DotDimensionNumbers dot_dnums;
@@ -1245,7 +1246,7 @@ TEST_F(HloInstructionTest, NestedFusionEquality) {
   auto dot = builder.AddInstruction(
       HloInstruction::CreateDot(data_shape, a, b_t, dot_dnums));
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto add_operand = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape, one, {1}));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
@@ -1342,7 +1343,7 @@ TEST_F(HloInstructionTest, Stringification) {
             "condition=%TransposeDot, body=%TransposeDot");
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   HloInstruction* conditional =
       builder.AddInstruction(HloInstruction::CreateConditional(
           sout, pred, x, computation, x, computation));
@@ -1369,7 +1370,7 @@ TEST_F(HloInstructionTest, StringifyGather_0) {
   HloInstruction* gather_instruction =
       builder.AddInstruction(HloInstruction::CreateGather(
           gather_result_shape, input, gather_indices,
-          HloInstruction::MakeGatherDimNumbers(
+          HloGatherInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{4, 5, 6, 7, 8},
               /*elided_window_dims=*/{},
               /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1405,7 +1406,7 @@ TEST_F(HloInstructionTest, StringifyGather_1) {
   HloInstruction* gather_instruction =
       builder.AddInstruction(HloInstruction::CreateGather(
           gather_result_shape, input, gather_indices,
-          HloInstruction::MakeGatherDimNumbers(
+          HloGatherInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{4, 5, 6, 7, 8},
               /*elided_window_dims=*/{},
               /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1424,6 +1425,55 @@ TEST_F(HloInstructionTest, StringifyGather_1) {
             "index_vector_dim=2, window_bounds={30,29,28,27,26}");
 }
 
+TEST_F(HloInstructionTest, StringifyScatter) {
+  Shape input_tensor_shape = ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
+  Shape scatter_indices_tensor_shape =
+      ShapeUtil::MakeShape(S64, {10, 9, 5, 7, 6});
+  Shape scatter_updates_shape =
+      ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26});
+
+  HloComputation::Builder builder("Scatter");
+  HloInstruction* input = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_tensor_shape, "input_tensor"));
+  HloInstruction* scatter_indices =
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          1, scatter_indices_tensor_shape, "scatter_indices"));
+  HloInstruction* scatter_updates =
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          2, scatter_updates_shape, "scatter_updates"));
+
+  HloComputation::Builder update_builder("Scatter.update");
+  update_builder.AddInstruction(
+      HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(F32, {}), "p1"));
+  update_builder.AddInstruction(
+      HloInstruction::CreateParameter(1, ShapeUtil::MakeShape(F32, {}), "p2"));
+
+  auto module = CreateNewModule();
+  auto* update_computation =
+      module->AddEmbeddedComputation(update_builder.Build());
+
+  HloInstruction* scatter_instruction =
+      builder.AddInstruction(HloInstruction::CreateScatter(
+          input_tensor_shape, input, scatter_indices, scatter_updates,
+          update_computation,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4, 5, 6, 7, 8},
+              /*inserted_window_dims=*/{},
+              /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/2)));
+  module->AddEntryComputation(builder.Build());
+
+  EXPECT_EQ(
+      scatter_instruction->ToString(),
+      "%scatter = f32[50,49,48,47,46]{4,3,2,1,0} "
+      "scatter(f32[50,49,48,47,46]{4,3,2,1,0} %input_tensor, "
+      "s64[10,9,5,7,6]{4,3,2,1,0} %scatter_indices, "
+      "f32[10,9,7,6,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} %scatter_updates), "
+      "update_window_dims={4,5,6,7,8}, inserted_window_dims={}, "
+      "scatter_dims_to_operand_dims={0,1,2,3,4}, index_vector_dim=2, "
+      "to_apply=%Scatter.update");
+}
+
 TEST_F(HloInstructionTest, CanonnicalStringificationFusion) {
   // Tests stringification of a simple op, fusion, while, and conditional.
   const Shape s1 = ShapeUtil::MakeShape(F32, {5, 10});
@@ -1455,15 +1505,15 @@ TEST_F(HloInstructionTest, CanonnicalStringificationFusion) {
   HloInstruction* fusion = computation->CreateFusionInstruction(
       {dot, reshape}, HloInstruction::FusionKind::kLoop);
 
-  EXPECT_EQ(
-      fusion->ToString(options),
+  const string expected_fusion =
       R"(f32[5,20]{1,0} fusion(f32[5,10]{1,0}, f32[20,10]{1,0}), kind=kLoop, calls=
 {
   tmp_0 = f32[5,10]{1,0} parameter(0)
   tmp_1 = f32[20,10]{1,0} parameter(1)
   tmp_2 = f32[10,20]{1,0} transpose(f32[20,10]{1,0} tmp_1), dimensions={1,0}
   ROOT tmp_3 = f32[5,20]{1,0} dot(f32[5,10]{1,0} tmp_0, f32[10,20]{1,0} tmp_2), lhs_contracting_dims={1}, rhs_contracting_dims={0}
-})");
+})";
+  EXPECT_EQ(fusion->ToString(options), expected_fusion);
 }
 
 TEST_F(HloInstructionTest, CanonnicalStringificationWhile) {
@@ -1495,8 +1545,8 @@ TEST_F(HloInstructionTest, CanonnicalStringificationWhile) {
       HloInstruction::CreateWhile(sout, computation, computation, x));
 
   auto options = HloPrintOptions().Canonical();
-  EXPECT_EQ(loop->ToString(options),
-            R"(f32[5,20]{1,0} while(f32[5,10]{1,0}), condition=
+  const string expected_loop =
+      R"(f32[5,20]{1,0} while(f32[5,10]{1,0}), condition=
 {
   tmp_0 = f32[5,10]{1,0} parameter(0)
   tmp_1 = f32[20,10]{1,0} parameter(1)
@@ -1518,7 +1568,8 @@ TEST_F(HloInstructionTest, CanonnicalStringificationWhile) {
     tmp_2 = f32[10,20]{1,0} transpose(f32[20,10]{1,0} tmp_1), dimensions={1,0}
     ROOT tmp_3 = f32[5,20]{1,0} dot(f32[5,10]{1,0} tmp_0, f32[10,20]{1,0} tmp_2), lhs_contracting_dims={1}, rhs_contracting_dims={0}
   }
-})");
+})";
+  EXPECT_EQ(loop->ToString(options), expected_loop);
 }
 
 TEST_F(HloInstructionTest, CanonnicalStringificationConditional) {
@@ -1550,13 +1601,12 @@ TEST_F(HloInstructionTest, CanonnicalStringificationConditional) {
       HloInstruction::CreateWhile(sout, computation, computation, x));
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   HloInstruction* conditional =
       builder.AddInstruction(HloInstruction::CreateConditional(
           sout, pred, x, computation, x, computation));
   auto options = HloPrintOptions().Canonical();
-  EXPECT_EQ(
-      conditional->ToString(options),
+  const string expected_conditional =
       R"(f32[5,20]{1,0} conditional(pred[], f32[5,10]{1,0}, f32[5,10]{1,0}), true_computation=
 {
   tmp_0 = f32[5,10]{1,0} parameter(0)
@@ -1579,7 +1629,8 @@ TEST_F(HloInstructionTest, CanonnicalStringificationConditional) {
     tmp_2 = f32[10,20]{1,0} transpose(f32[20,10]{1,0} tmp_1), dimensions={1,0}
     ROOT tmp_3 = f32[5,20]{1,0} dot(f32[5,10]{1,0} tmp_0, f32[10,20]{1,0} tmp_2), lhs_contracting_dims={1}, rhs_contracting_dims={0}
   }
-})");
+})";
+  EXPECT_EQ(conditional->ToString(options), expected_conditional);
 }
 
 TEST_F(HloInstructionTest, CheckDeepClone) {
diff --git a/tensorflow/compiler/xla/service/hlo_instructions.cc b/tensorflow/compiler/xla/service/hlo_instructions.cc
index dcc1e3c8af..1d71a74c40 100644
--- a/tensorflow/compiler/xla/service/hlo_instructions.cc
+++ b/tensorflow/compiler/xla/service/hlo_instructions.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <deque>
 
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
@@ -180,8 +181,11 @@ std::unique_ptr<HloInstruction> HloFftInstruction::CloneWithNewOperandsImpl(
 
 HloSendRecvInstruction::HloSendRecvInstruction(HloOpcode opcode,
                                                const Shape& shape,
-                                               int64 channel_id)
-    : HloInstruction(opcode, shape), channel_id_(channel_id) {}
+                                               int64 channel_id,
+                                               bool is_host_transfer)
+    : HloInstruction(opcode, shape),
+      channel_id_(channel_id),
+      is_host_transfer_(is_host_transfer) {}
 
 HloInstructionProto HloSendRecvInstruction::ToProto() const {
   HloInstructionProto proto = HloInstruction::ToProto();
@@ -191,7 +195,12 @@ HloInstructionProto HloSendRecvInstruction::ToProto() const {
 
 std::vector<string> HloSendRecvInstruction::ExtraAttributesToStringImpl(
     const HloPrintOptions& options) const {
-  return {StrCat("channel_id=", channel_id_)};
+  std::vector<string> attrs;
+  attrs.push_back(StrCat("channel_id=", channel_id_));
+  if (is_host_transfer()) {
+    attrs.push_back("is_host_transfer=true");
+  }
+  return attrs;
 }
 
 bool HloSendRecvInstruction::IdenticalSlowPath(
@@ -204,12 +213,14 @@ bool HloSendRecvInstruction::IdenticalSlowPath(
 
 // Send instruction produces a tuple of {aliased operand, U32 context}.
 HloSendInstruction::HloSendInstruction(HloInstruction* operand,
-                                       HloInstruction* token, int64 channel_id)
+                                       HloInstruction* token, int64 channel_id,
+                                       bool is_host_transfer)
     : HloSendRecvInstruction(
           HloOpcode::kSend,
-          ShapeUtil::MakeTupleShape(
-              {CHECK_NOTNULL(operand)->shape(), ShapeUtil::MakeShape(U32, {})}),
-          channel_id) {
+          ShapeUtil::MakeTupleShape({CHECK_NOTNULL(operand)->shape(),
+                                     ShapeUtil::MakeShape(U32, {}),
+                                     ShapeUtil::MakeTokenShape()}),
+          channel_id, is_host_transfer) {
   AppendOperand(operand);
   AppendOperand(token);
 }
@@ -220,12 +231,14 @@ std::unique_ptr<HloInstruction> HloSendInstruction::CloneWithNewOperandsImpl(
     HloCloneContext* context) const {
   CHECK_EQ(new_operands.size(), 2);
   return MakeUnique<HloSendInstruction>(new_operands[0], new_operands[1],
-                                        channel_id());
+                                        channel_id(), is_host_transfer());
 }
 
-HloSendDoneInstruction::HloSendDoneInstruction(HloSendInstruction* operand)
-    : HloSendRecvInstruction(HloOpcode::kSendDone, ShapeUtil::MakeNil(),
-                             CHECK_NOTNULL(operand)->channel_id()) {
+HloSendDoneInstruction::HloSendDoneInstruction(HloSendInstruction* operand,
+                                               bool is_host_transfer)
+    : HloSendRecvInstruction(HloOpcode::kSendDone, ShapeUtil::MakeTokenShape(),
+                             CHECK_NOTNULL(operand)->channel_id(),
+                             is_host_transfer) {
   AppendOperand(operand);
 }
 
@@ -236,16 +249,18 @@ HloSendDoneInstruction::CloneWithNewOperandsImpl(
     HloCloneContext* context) const {
   CHECK_EQ(new_operands.size(), 1);
   return MakeUnique<HloSendDoneInstruction>(
-      Cast<HloSendInstruction>(new_operands[0]));
+      Cast<HloSendInstruction>(new_operands[0]), is_host_transfer());
 }
 
 // Recv instruction produces a tuple of {receive buffer, U32 context}.
 HloRecvInstruction::HloRecvInstruction(const Shape& shape,
-                                       HloInstruction* token, int64 channel_id)
+                                       HloInstruction* token, int64 channel_id,
+                                       bool is_host_transfer)
     : HloSendRecvInstruction(
           HloOpcode::kRecv,
-          ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeShape(U32, {})}),
-          channel_id) {
+          ShapeUtil::MakeTupleShape({shape, ShapeUtil::MakeShape(U32, {}),
+                                     ShapeUtil::MakeTokenShape()}),
+          channel_id, is_host_transfer) {
   AppendOperand(token);
 }
 
@@ -255,14 +270,18 @@ std::unique_ptr<HloInstruction> HloRecvInstruction::CloneWithNewOperandsImpl(
     HloCloneContext* context) const {
   CHECK_EQ(new_operands.size(), 1);
   return MakeUnique<HloRecvInstruction>(
-      ShapeUtil::GetTupleElementShape(shape, 0), new_operands[0], channel_id());
+      ShapeUtil::GetTupleElementShape(shape, 0), new_operands[0], channel_id(),
+      is_host_transfer());
 }
 
-HloRecvDoneInstruction::HloRecvDoneInstruction(HloRecvInstruction* operand)
+HloRecvDoneInstruction::HloRecvDoneInstruction(HloRecvInstruction* operand,
+                                               bool is_host_transfer)
     : HloSendRecvInstruction(
           HloOpcode::kRecvDone,
-          ShapeUtil::GetTupleElementShape(operand->shape(), 0),
-          CHECK_NOTNULL(operand)->channel_id()) {
+          ShapeUtil::MakeTupleShape(
+              {ShapeUtil::GetTupleElementShape(operand->shape(), 0),
+               ShapeUtil::MakeTokenShape()}),
+          CHECK_NOTNULL(operand)->channel_id(), is_host_transfer) {
   AppendOperand(operand);
 }
 
@@ -273,7 +292,7 @@ HloRecvDoneInstruction::CloneWithNewOperandsImpl(
     HloCloneContext* context) const {
   CHECK_EQ(new_operands.size(), 1);
   return MakeUnique<HloRecvDoneInstruction>(
-      Cast<HloRecvInstruction>(new_operands[0]));
+      Cast<HloRecvInstruction>(new_operands[0]), is_host_transfer());
 }
 
 HloAllReduceInstruction::HloAllReduceInstruction(
@@ -286,8 +305,6 @@ HloAllReduceInstruction::HloAllReduceInstruction(
       replica_group_ids_(replica_group_ids.begin(), replica_group_ids.end()),
       cross_replica_sum_barrier_(barrier.begin(), barrier.end()),
       all_reduce_id_(all_reduce_id) {
-  // TODO(b/79737069): Remove the CHECK when supported.
-  CHECK(!all_reduce_id_);
   for (auto operand : operands) {
     AppendOperand(operand);
   }
@@ -421,13 +438,14 @@ HloConcatenateInstruction::CloneWithNewOperandsImpl(
 }
 
 HloReduceInstruction::HloReduceInstruction(
-    const Shape& shape, HloInstruction* arg, HloInstruction* init_value,
+    const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> args,
     tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
     HloComputation* reduce_computation)
     : HloInstruction(HloOpcode::kReduce, shape),
       dimensions_(dimensions_to_reduce.begin(), dimensions_to_reduce.end()) {
-  AppendOperand(arg);
-  AppendOperand(init_value);
+  for (HloInstruction* arg : args) {
+    AppendOperand(arg);
+  }
   AppendComputation(reduce_computation);
 }
 
@@ -460,8 +478,48 @@ std::unique_ptr<HloInstruction> HloReduceInstruction::CloneWithNewOperandsImpl(
     tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
     HloCloneContext* context) const {
   CHECK_EQ(new_operands.size(), 2);
-  return MakeUnique<HloReduceInstruction>(
-      shape, new_operands[0], new_operands[1], dimensions(), to_apply());
+  return MakeUnique<HloReduceInstruction>(shape, new_operands, dimensions(),
+                                          to_apply());
+}
+
+HloSortInstruction::HloSortInstruction(const Shape& shape, int64 dimension,
+                                       HloInstruction* keys,
+                                       HloInstruction* values)
+    : HloInstruction(HloOpcode::kSort, shape), dimensions_({dimension}) {
+  AppendOperand(keys);
+  if (values) {
+    AppendOperand(values);
+  }
+}
+
+HloInstructionProto HloSortInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  for (int64 dimension : dimensions_) {
+    proto.add_dimensions(dimension);
+  }
+  return proto;
+}
+
+std::vector<string> HloSortInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {StrCat("dimensions={", Join(dimensions(), ","), "}")};
+}
+
+bool HloSortInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloSortInstruction&>(other);
+  return dimensions() == casted_other.dimensions();
+}
+
+std::unique_ptr<HloInstruction> HloSortInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  HloInstruction* keys = new_operands[0];
+  HloInstruction* values = new_operands.size() == 2 ? new_operands[1] : nullptr;
+  return MakeUnique<HloSortInstruction>(shape, dimensions(0), keys, values);
 }
 
 HloTransposeInstruction::HloTransposeInstruction(
@@ -762,7 +820,7 @@ string HloConstantInstruction::OperandsToStringWithCanonicalNameMap(
 HloTraceInstruction::HloTraceInstruction(const string& tag,
                                          HloInstruction* operand)
     : HloInstruction(HloOpcode::kTrace, ShapeUtil::MakeNil()),
-      literal_(Literal::CreateR1U8(tag)) {
+      literal_(LiteralUtil::CreateR1U8(tag)) {
   AppendOperand(operand);
   operand->set_tracing(this);
 }
@@ -1048,8 +1106,6 @@ HloInstruction* HloFusionInstruction::CloneAndFuseInternal(
         CHECK_NOTNULL(GetModule())->AddEmbeddedComputation(builder.Build()));
     clone = fused_expression_root();
   } else {
-    clone = fused_instructions_computation()->AddInstruction(
-        instruction_to_fuse->Clone(/*suffix=*/""));
     // When add_output is false, instruction_to_fuse is necessarily an operand
     // of the fusion instruction. After fusion this will no longer be the
     // case. Remove the operand from the operand list and remove its
@@ -1059,6 +1115,16 @@ HloInstruction* HloFusionInstruction::CloneAndFuseInternal(
     bool in_operand_list = std::find(operands().begin(), operands().end(),
                                      instruction_to_fuse) != operands().end();
     CHECK(add_output || in_operand_list);
+    if (instruction_to_fuse->opcode() == HloOpcode::kTuple) {
+      // We assume all uses of a kTuple operation are GTE ops, not another
+      // fusion node. In this case, we don't need to clone
+      // 'instruction_to_fuse'.
+      CHECK(!in_operand_list);
+      clone = instruction_to_fuse;
+    } else {
+      clone = fused_instructions_computation()->AddInstruction(
+          instruction_to_fuse->Clone(/*suffix=*/""));
+    }
     const std::vector<HloInstruction*>& fused_parameters =
         fused_instructions_computation()->parameter_instructions();
     for (int64 operand_num = 0; operand_num < operand_count(); ++operand_num) {
@@ -1155,9 +1221,10 @@ HloInstruction* HloFusionInstruction::CloneAndFuseInternal(
     }
     int64 index = tuple_elements.size();
     if (instruction_to_fuse->opcode() == HloOpcode::kTuple) {
-      index -= instruction_to_fuse->operand_count();
+      CHECK_EQ(clone, instruction_to_fuse);
+      index -= clone->operand_count();
       std::vector<HloInstruction*> to_be_removed;
-      for (auto old_gte : instruction_to_fuse->users()) {
+      for (auto old_gte : clone->users()) {
         CHECK_EQ(old_gte->opcode(), HloOpcode::kGetTupleElement);
         int64 old_tuple_index = old_gte->tuple_index();
         HloInstruction* new_gte =
@@ -1169,7 +1236,6 @@ HloInstruction* HloFusionInstruction::CloneAndFuseInternal(
       for (auto old_gte : to_be_removed) {
         TF_CHECK_OK(parent()->RemoveInstruction(old_gte));
       }
-      TF_CHECK_OK(fused_instructions_computation()->RemoveInstruction(clone));
     } else {
       HloInstruction* new_gte =
           parent()->AddInstruction(HloInstruction::CreateGetTupleElement(
@@ -1178,7 +1244,9 @@ HloInstruction* HloFusionInstruction::CloneAndFuseInternal(
     }
   }
 
-  VLOG(2) << "New clone:\n" << clone->ToString();
+  if (clone != instruction_to_fuse) {
+    VLOG(2) << "New clone:\n" << clone->ToString();
+  }
   return clone;
 }
 
@@ -1859,4 +1927,180 @@ HloDynamicSliceInstruction::CloneWithNewOperandsImpl(
   return MakeUnique<HloDynamicSliceInstruction>(
       shape, new_operands[0], new_operands[1], dynamic_slice_sizes_);
 }
+
+HloGatherInstruction::HloGatherInstruction(
+    const Shape& shape, HloInstruction* operand, HloInstruction* gather_indices,
+    const GatherDimensionNumbers& gather_dim_numbers,
+    tensorflow::gtl::ArraySlice<int64> window_bounds)
+    : HloInstruction(HloOpcode::kGather, shape) {
+  AppendOperand(operand);
+  AppendOperand(gather_indices);
+  gather_dimension_numbers_ =
+      MakeUnique<GatherDimensionNumbers>(gather_dim_numbers);
+  c_copy(window_bounds, std::back_inserter(gather_window_bounds_));
+}
+
+string HloGatherInstruction::GatherDimensionNumbersToString() const {
+  CHECK(gather_dimension_numbers_ != nullptr);
+  string output_window_dims =
+      StrCat("output_window_dims={",
+             Join(gather_dimension_numbers_->output_window_dims(), ","), "}");
+  string elided_window_dims =
+      StrCat("elided_window_dims={",
+             Join(gather_dimension_numbers_->elided_window_dims(), ","), "}");
+  string gather_dims_to_operand_dims = StrCat(
+      "gather_dims_to_operand_dims={",
+      Join(gather_dimension_numbers_->gather_dims_to_operand_dims(), ","), "}");
+  string index_vector_dim = StrCat(
+      "index_vector_dim=", gather_dimension_numbers_->index_vector_dim());
+
+  return Join<std::initializer_list<string>>(
+      {output_window_dims, elided_window_dims, gather_dims_to_operand_dims,
+       index_vector_dim},
+      ", ");
+}
+
+/* static */ GatherDimensionNumbers HloGatherInstruction::MakeGatherDimNumbers(
+    tensorflow::gtl::ArraySlice<int64> output_window_dims,
+    tensorflow::gtl::ArraySlice<int64> elided_window_dims,
+    tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
+    int64 index_vector_dim) {
+  GatherDimensionNumbers gather_dim_numbers;
+  for (int64 output_window_dim : output_window_dims) {
+    gather_dim_numbers.add_output_window_dims(output_window_dim);
+  }
+  for (int64 elided_window_dim : elided_window_dims) {
+    gather_dim_numbers.add_elided_window_dims(elided_window_dim);
+  }
+  for (int64 gather_dim_to_input_dim : gather_dims_to_operand_dims) {
+    gather_dim_numbers.add_gather_dims_to_operand_dims(gather_dim_to_input_dim);
+  }
+
+  gather_dim_numbers.set_index_vector_dim(index_vector_dim);
+  return gather_dim_numbers;
+}
+
+HloInstructionProto HloGatherInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  *proto.mutable_gather_dimension_numbers() = gather_dimension_numbers();
+  for (int64 bound : gather_window_bounds()) {
+    proto.add_gather_window_bounds(bound);
+  }
+  return proto;
+}
+
+std::vector<string> HloGatherInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {GatherDimensionNumbersToString(),
+          StrCat("window_bounds={", Join(gather_window_bounds(), ","), "}")};
+}
+
+bool HloGatherInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloGatherInstruction&>(other);
+  return protobuf_util::ProtobufEquals(
+             gather_dimension_numbers(),
+             casted_other.gather_dimension_numbers()) &&
+         gather_window_bounds() == casted_other.gather_window_bounds();
+}
+
+std::unique_ptr<HloInstruction> HloGatherInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 2);
+  return MakeUnique<HloGatherInstruction>(
+      shape, new_operands[0], new_operands[1], gather_dimension_numbers(),
+      gather_window_bounds());
+}
+
+HloScatterInstruction::HloScatterInstruction(
+    const Shape& shape, HloInstruction* operand,
+    HloInstruction* scatter_indices, HloInstruction* updates,
+    HloComputation* update_computation,
+    const ScatterDimensionNumbers& scatter_dim_numbers)
+    : HloInstruction(HloOpcode::kScatter, shape) {
+  AppendOperand(operand);
+  AppendOperand(scatter_indices);
+  AppendOperand(updates);
+  AppendComputation(update_computation);
+  scatter_dimension_numbers_ =
+      MakeUnique<ScatterDimensionNumbers>(scatter_dim_numbers);
+}
+
+string HloScatterInstruction::ScatterDimensionNumbersToString() const {
+  string update_window_dims =
+      StrCat("update_window_dims={",
+             Join(scatter_dimension_numbers().update_window_dims(), ","), "}");
+  string inserted_window_dims = StrCat(
+      "inserted_window_dims={",
+      Join(scatter_dimension_numbers().inserted_window_dims(), ","), "}");
+  string scatter_dims_to_operand_dims = StrCat(
+      "scatter_dims_to_operand_dims={",
+      Join(scatter_dimension_numbers().scatter_dims_to_operand_dims(), ","),
+      "}");
+  string index_vector_dim = StrCat(
+      "index_vector_dim=", scatter_dimension_numbers().index_vector_dim());
+
+  return Join<std::initializer_list<string>>(
+      {update_window_dims, inserted_window_dims, scatter_dims_to_operand_dims,
+       index_vector_dim},
+      ", ");
+}
+
+/* static */ ScatterDimensionNumbers
+HloScatterInstruction::MakeScatterDimNumbers(
+    tensorflow::gtl::ArraySlice<int64> update_window_dims,
+    tensorflow::gtl::ArraySlice<int64> inserted_window_dims,
+    tensorflow::gtl::ArraySlice<int64> scatter_dims_to_operand_dims,
+    int64 index_vector_dim) {
+  ScatterDimensionNumbers scatter_dim_numbers;
+  for (int64 update_window_dim : update_window_dims) {
+    scatter_dim_numbers.add_update_window_dims(update_window_dim);
+  }
+  for (int64 inserted_window_dim : inserted_window_dims) {
+    scatter_dim_numbers.add_inserted_window_dims(inserted_window_dim);
+  }
+  for (int64 scatter_dim_to_operand_dim : scatter_dims_to_operand_dims) {
+    scatter_dim_numbers.add_scatter_dims_to_operand_dims(
+        scatter_dim_to_operand_dim);
+  }
+  scatter_dim_numbers.set_index_vector_dim(index_vector_dim);
+  return scatter_dim_numbers;
+}
+
+HloInstructionProto HloScatterInstruction::ToProto() const {
+  HloInstructionProto proto = HloInstruction::ToProto();
+  *proto.mutable_scatter_dimension_numbers() = scatter_dimension_numbers();
+  return proto;
+}
+
+std::vector<string> HloScatterInstruction::ExtraAttributesToStringImpl(
+    const HloPrintOptions& options) const {
+  return {ScatterDimensionNumbersToString()};
+}
+
+bool HloScatterInstruction::IdenticalSlowPath(
+    const HloInstruction& other,
+    const std::function<bool(const HloComputation*, const HloComputation*)>&
+        eq_computations) const {
+  const auto& casted_other = static_cast<const HloScatterInstruction&>(other);
+  return protobuf_util::ProtobufEquals(
+             scatter_dimension_numbers(),
+             casted_other.scatter_dimension_numbers()) &&
+         eq_computations(to_apply(), casted_other.to_apply());
+}
+
+std::unique_ptr<HloInstruction> HloScatterInstruction::CloneWithNewOperandsImpl(
+    const Shape& shape,
+    tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+    HloCloneContext* context) const {
+  CHECK_EQ(new_operands.size(), 3);
+  return MakeUnique<HloScatterInstruction>(
+      shape, new_operands[0], new_operands[1], new_operands[2], to_apply(),
+      scatter_dimension_numbers());
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_instructions.h b/tensorflow/compiler/xla/service/hlo_instructions.h
index df6969c410..ac5a1ca080 100644
--- a/tensorflow/compiler/xla/service/hlo_instructions.h
+++ b/tensorflow/compiler/xla/service/hlo_instructions.h
@@ -141,12 +141,15 @@ class HloSendRecvInstruction : public HloInstruction {
   // channel.
   int64 channel_id() const { return channel_id_; }
 
+  // Returns whether this send/recv instruction sends data to/from the host.
+  bool is_host_transfer() const { return is_host_transfer_; }
+
   // Returns a serialized representation of this instruction.
   HloInstructionProto ToProto() const override;
 
  protected:
   explicit HloSendRecvInstruction(HloOpcode opcode, const Shape& shape,
-                                  int64 channel_id);
+                                  int64 channel_id, bool is_host_transfer);
 
  private:
   std::vector<string> ExtraAttributesToStringImpl(
@@ -157,12 +160,15 @@ class HloSendRecvInstruction : public HloInstruction {
           eq_computations) const override;
   // Represents a unique identifier for each Send/Recv instruction pair.
   int64 channel_id_;
+
+  // Whether this send/recv instruction sends data to/from the host.
+  bool is_host_transfer_;
 };
 
 class HloSendInstruction : public HloSendRecvInstruction {
  public:
   explicit HloSendInstruction(HloInstruction* operand, HloInstruction* token,
-                              int64 channel_id);
+                              int64 channel_id, bool is_host_transfer);
 
  private:
   // Implementation for non-common logic of CloneWithNewOperands.
@@ -174,7 +180,8 @@ class HloSendInstruction : public HloSendRecvInstruction {
 
 class HloSendDoneInstruction : public HloSendRecvInstruction {
  public:
-  explicit HloSendDoneInstruction(HloSendInstruction* operand);
+  explicit HloSendDoneInstruction(HloSendInstruction* operand,
+                                  bool is_host_transfer);
 
  private:
   // Implementation for non-common logic of CloneWithNewOperands.
@@ -187,7 +194,7 @@ class HloSendDoneInstruction : public HloSendRecvInstruction {
 class HloRecvInstruction : public HloSendRecvInstruction {
  public:
   explicit HloRecvInstruction(const Shape& shape, HloInstruction* token,
-                              int64 channel_id);
+                              int64 channel_id, bool is_host_transfer);
 
  private:
   // Implementation for non-common logic of CloneWithNewOperands.
@@ -199,7 +206,8 @@ class HloRecvInstruction : public HloSendRecvInstruction {
 
 class HloRecvDoneInstruction : public HloSendRecvInstruction {
  public:
-  explicit HloRecvDoneInstruction(HloRecvInstruction* operand);
+  explicit HloRecvDoneInstruction(HloRecvInstruction* operand,
+                                  bool is_host_transfer);
 
  private:
   // Implementation for non-common logic of CloneWithNewOperands.
@@ -216,8 +224,7 @@ class HloAllReduceInstruction : public HloInstruction {
       HloComputation* reduce_computation,
       tensorflow::gtl::ArraySlice<int64> replica_group_ids,
       tensorflow::StringPiece barrier,
-      const tensorflow::gtl::optional<int64>& all_reduce_id =
-          tensorflow::gtl::nullopt);
+      const tensorflow::gtl::optional<int64>& all_reduce_id);
 
   // Returns the group ids of each replica for CrossReplicaSum op.
   const std::vector<int64>& replica_group_ids() const {
@@ -324,7 +331,7 @@ class HloConcatenateInstruction : public HloInstruction {
 class HloReduceInstruction : public HloInstruction {
  public:
   explicit HloReduceInstruction(
-      const Shape& shape, HloInstruction* arg, HloInstruction* init_value,
+      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> args,
       tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
       HloComputation* reduce_computation);
   // Returns the dimension sizes or numbers associated with this instruction.
@@ -349,6 +356,35 @@ class HloReduceInstruction : public HloInstruction {
   std::vector<int64> dimensions_;
 };
 
+class HloSortInstruction : public HloInstruction {
+ public:
+  explicit HloSortInstruction(const Shape& shape, int64 dimension,
+                              HloInstruction* keys,
+                              HloInstruction* values = nullptr);
+  // Returns the dimension sizes or numbers associated with this instruction.
+  const std::vector<int64>& dimensions() const override { return dimensions_; }
+  int64 dimensions(int64 index) const override { return dimensions()[index]; }
+  // Returns the sort dimension for this instruction
+  int64 sort_dimension() { return dimensions(0); }
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::vector<int64> dimensions_;
+};
+
 class HloTransposeInstruction : public HloInstruction {
  public:
   explicit HloTransposeInstruction(
@@ -1119,6 +1155,88 @@ class HloDynamicSliceInstruction : public HloInstruction {
   // ('start' is specified dynamically in the second operand of the operation).
   std::vector<int64> dynamic_slice_sizes_;
 };
+
+class HloGatherInstruction : public HloInstruction {
+ public:
+  explicit HloGatherInstruction(
+      const Shape& shape, HloInstruction* operand,
+      HloInstruction* gather_indices,
+      const GatherDimensionNumbers& gather_dim_numbers,
+      tensorflow::gtl::ArraySlice<int64> window_bounds);
+  const GatherDimensionNumbers& gather_dimension_numbers() const {
+    CHECK(gather_dimension_numbers_ != nullptr);
+    return *gather_dimension_numbers_;
+  }
+  tensorflow::gtl::ArraySlice<int64> gather_window_bounds() const {
+    return gather_window_bounds_;
+  }
+  // Returns the dump string of the gather dimension numbers.
+  string GatherDimensionNumbersToString() const;
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+  // Creates an instance of GatherDimensionNumbers.
+  static GatherDimensionNumbers MakeGatherDimNumbers(
+      tensorflow::gtl::ArraySlice<int64> output_window_dims,
+      tensorflow::gtl::ArraySlice<int64> elided_window_dims,
+      tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
+      int64 index_vector_dim);
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::unique_ptr<GatherDimensionNumbers> gather_dimension_numbers_;
+  std::vector<int64> gather_window_bounds_;
+};
+
+class HloScatterInstruction : public HloInstruction {
+ public:
+  explicit HloScatterInstruction(
+      const Shape& shape, HloInstruction* operand,
+      HloInstruction* scatter_indices, HloInstruction* updates,
+      HloComputation* update_computation,
+      const ScatterDimensionNumbers& scatter_dim_numbers);
+  const ScatterDimensionNumbers& scatter_dimension_numbers() const {
+    CHECK(scatter_dimension_numbers_ != nullptr);
+    return *scatter_dimension_numbers_;
+  }
+  // Returns the dump string of the scatter dimension numbers.
+  string ScatterDimensionNumbersToString() const;
+  // Returns a serialized representation of this instruction.
+  HloInstructionProto ToProto() const override;
+
+  // Creates an instance of ScatterDimensionNumbers.
+  static ScatterDimensionNumbers MakeScatterDimNumbers(
+      tensorflow::gtl::ArraySlice<int64> update_window_dims,
+      tensorflow::gtl::ArraySlice<int64> inserted_window_dims,
+      tensorflow::gtl::ArraySlice<int64> scatter_dims_to_operand_dims,
+      int64 index_vector_dim);
+
+ private:
+  std::vector<string> ExtraAttributesToStringImpl(
+      const HloPrintOptions& options) const override;
+  bool IdenticalSlowPath(
+      const HloInstruction& other,
+      const std::function<bool(const HloComputation*, const HloComputation*)>&
+          eq_computations) const override;
+  // Implementation for non-common logic of CloneWithNewOperands.
+  std::unique_ptr<HloInstruction> CloneWithNewOperandsImpl(
+      const Shape& shape,
+      tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
+      HloCloneContext* context) const override;
+
+  std::unique_ptr<ScatterDimensionNumbers> scatter_dimension_numbers_;
+};
+
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_INSTRUCTIONS_H_
diff --git a/tensorflow/compiler/xla/service/hlo_lexer.cc b/tensorflow/compiler/xla/service/hlo_lexer.cc
index f0d9fdbc8f..71b44507cc 100644
--- a/tensorflow/compiler/xla/service/hlo_lexer.cc
+++ b/tensorflow/compiler/xla/service/hlo_lexer.cc
@@ -299,9 +299,12 @@ TokKind HloLexer::LexNumberOrPattern() {
   static LazyRE2 int_pattern = {R"([-]?\d+)"};
   if (RE2::Consume(&consumable, *int_pattern)) {
     current_ptr_ = consumable.begin();
-    tensorflow::strings::safe_strto64(
-        StringPieceFromPointers(token_start_, current_ptr_), &int64_val_);
-    return TokKind::kInt;
+    auto slice = StringPieceFromPointers(token_start_, current_ptr_);
+    if (tensorflow::strings::safe_strto64(slice, &int64_val_)) {
+      return TokKind::kInt;
+    }
+    LOG(ERROR) << "Failed to parse int literal: " << slice;
+    return TokKind::kError;
   }
 
   static LazyRE2 neg_inf = {"-inf"};
diff --git a/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc
index 0275294a1a..01b625c29c 100644
--- a/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_liveness_analysis_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_liveness_analysis.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
diff --git a/tensorflow/compiler/xla/service/hlo_matchers_test.cc b/tensorflow/compiler/xla/service/hlo_matchers_test.cc
index 9a3010cf1f..7de59acc1e 100644
--- a/tensorflow/compiler/xla/service/hlo_matchers_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_matchers_test.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 
@@ -75,8 +76,10 @@ TEST(HloMatchersTest, Test) {
 }
 
 TEST(HloMatchersTest, CustomCallMatcher) {
-  auto c1 = HloInstruction::CreateConstant(Literal::CreateR1<float>({1, 2, 3}));
-  auto c2 = HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 3}));
+  auto c1 =
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({1, 2, 3}));
+  auto c2 =
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({1, 2, 3}));
   auto call = HloInstruction::CreateCustomCall(
       ShapeUtil::MakeShape(F32, {1}), {c1.get(), c2.get()}, "foo_target");
 
diff --git a/tensorflow/compiler/xla/service/hlo_module.cc b/tensorflow/compiler/xla/service/hlo_module.cc
index 39bc25ba42..55ff073d3f 100644
--- a/tensorflow/compiler/xla/service/hlo_module.cc
+++ b/tensorflow/compiler/xla/service/hlo_module.cc
@@ -537,10 +537,11 @@ uint64 HloModule::RandomNew64() const {
 
 HloComputation* HloModule::GetComputationWithName(
     tensorflow::StringPiece name) {
-  auto it = c_find_if(computations(), [&](HloComputation* computation) {
+  auto computations_in_module = computations();
+  auto it = c_find_if(computations_in_module, [&](HloComputation* computation) {
     return computation->name() == name;
   });
-  return it == computations().end() ? nullptr : *it;
+  return it == computations_in_module.end() ? nullptr : *it;
 }
 
 /* static */ std::atomic<int> HloModule::next_unique_module_id_(0);
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc b/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
index bf33640db1..10bf9ffd6c 100644
--- a/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
+++ b/tensorflow/compiler/xla/service/hlo_module_group_metadata.cc
@@ -20,6 +20,8 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/util.h"
@@ -75,10 +77,23 @@ Status HloModuleGroupMetadata::Build() {
     if (tracked == nullptr) {
       return Status::OK();
     }
-    // Add the parent computation of this channel instruction and its peer
-    // computation (both must be while computations) as companions.
+
+    std::vector<HloComputation*> peers;
     if (IsChannelInstruction(hlo)) {
-      HloComputation* peer_computation = PeerComputation(hlo);
+      peers.push_back(PeerComputation(hlo));
+    } else if (hlo->IsCrossModuleAllReduce()) {
+      for (HloInstruction* instr : GetAllReduceGroup(*hlo->all_reduce_id())) {
+        if (instr == hlo) {
+          continue;
+        }
+        peers.push_back(instr->parent());
+      }
+    }
+
+    // Add the parent computation of this channel (or all-reduce) instruction
+    // and its peer computation(s) (both must be while computations) as
+    // companions.
+    for (HloComputation* peer_computation : peers) {
       const TrackedInstruction* peer_tracked =
           GetTrackedInstruction(peer_computation);
       TF_RET_CHECK(peer_tracked != nullptr)
@@ -162,8 +177,12 @@ bool HloModuleGroupMetadata::IsChannelInstruction(
     case HloOpcode::kSend:
     case HloOpcode::kRecv:
     case HloOpcode::kSendDone:
-    case HloOpcode::kRecvDone:
-      return true;
+    case HloOpcode::kRecvDone: {
+      const HloSendRecvInstruction* send_recv_instr =
+          DynCast<HloSendRecvInstruction>(instruction);
+      CHECK(send_recv_instr != nullptr);
+      return !send_recv_instr->is_host_transfer();
+    }
     default:
       return false;
   }
@@ -175,7 +194,8 @@ bool HloModuleGroupMetadata::IsCompanionInstruction(HloInstruction* hlo) const {
 
 bool HloModuleGroupMetadata::InstructionCommunicates(
     HloInstruction* hlo) const {
-  return IsChannelInstruction(hlo) || IsCompanionInstruction(hlo);
+  return IsChannelInstruction(hlo) || IsCompanionInstruction(hlo) ||
+         hlo->IsCrossModuleAllReduce();
 }
 
 const HloModuleGroupMetadata::Channel& HloModuleGroupMetadata::GetChannel(
@@ -200,6 +220,13 @@ HloComputation* HloModuleGroupMetadata::PeerComputation(
   }
 }
 
+const std::vector<HloInstruction*>& HloModuleGroupMetadata::GetAllReduceGroup(
+    int64 all_reduce_id) const {
+  auto it = all_reduce_map_.find(all_reduce_id);
+  CHECK(it != all_reduce_map_.end());
+  return it->second;
+}
+
 std::vector<HloModuleGroupMetadata::TrackedInstruction>
 HloModuleGroupMetadata::GetCompanionsPath(const HloInstruction* hlo) const {
   std::vector<TrackedInstruction> path;
@@ -278,10 +305,27 @@ Status HloModuleGroupMetadata::RecordInstructions() {
       tracked_instructions_[hlo->to_apply()] =
           TrackedInstruction(hlo, ComputationKind::kCallFunction);
     }
+
+    // Group cross module all-reduce instructions by the all_reduce id.
+    if (hlo->IsCrossModuleAllReduce()) {
+      TF_RET_CHECK(channel_id_map_.find(*hlo->all_reduce_id()) ==
+                   channel_id_map_.end())
+          << "all_reduce_id " << *hlo->all_reduce_id()
+          << " is already used by a send/recv instruction";
+      all_reduce_map_[*hlo->all_reduce_id()].push_back(hlo);
+      max_channel_id_ = std::max(max_channel_id_, *hlo->all_reduce_id());
+      return Status::OK();
+    }
+
     if (!IsChannelInstruction(hlo)) {
       return Status::OK();
     }
 
+    TF_RET_CHECK(all_reduce_map_.find(hlo->channel_id()) ==
+                 all_reduce_map_.end())
+        << "channel id " << hlo->channel_id()
+        << " is already used by an all-reduce instruction";
+
     // Add a new channel if needed.
     if (channel_id_map_.find(hlo->channel_id()) == channel_id_map_.end()) {
       channels_.emplace_back();
@@ -324,6 +368,7 @@ Status HloModuleGroupMetadata::RecordInstructions() {
     }
   }
   VLOG(2) << "Created " << channels_.size() << " channels";
+  VLOG(2) << "Created " << all_reduce_map_.size() << " all-reduce groups";
   return Status::OK();
 }
 
@@ -382,7 +427,8 @@ Status HloModuleGroupMetadata::VerifyChannelInstructions() {
   // Check if the shapes match for each channel.
   for (const Channel& channel : channels_) {
     const Shape& send_shape = channel.send->operand(0)->shape();
-    const Shape& recv_shape = channel.recv_done->shape();
+    const Shape& recv_shape =
+        ShapeUtil::GetTupleElementShape(channel.recv_done->shape(), 0);
     if (!ShapeUtil::Compatible(send_shape, recv_shape)) {
       return FailedPrecondition("send/recv shapes do not match");
     }
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_metadata.h b/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
index ffde3a332d..84f2d3f5fb 100644
--- a/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
+++ b/tensorflow/compiler/xla/service/hlo_module_group_metadata.h
@@ -92,7 +92,7 @@ class HloModuleGroupMetadata {
     ComputationKind kind_ = ComputationKind::kInvalid;
   };
 
-  // Represents a channel and the 4 instructions that form the channel.
+  // Represents a channel and the instructions that form the channel.
   struct Channel {
     int64 id = -1;
     HloInstruction* send = nullptr;
@@ -118,13 +118,17 @@ class HloModuleGroupMetadata {
   // comment above on companion instructions.
   bool IsCompanionInstruction(HloInstruction* hlo) const;
 
-  // Returns true if the instruction is either a channel instruction or a
-  // companion instruction.
+  // Returns true if the instruction is either a channel instruction, a
+  // cross-module all-reduce instruction, or a companion instruction.
   bool InstructionCommunicates(HloInstruction* hlo) const;
 
   // Returns the Channel instance for the given channel id.
   const Channel& GetChannel(int64 channel_id) const;
 
+  // Returns the all-reduce instructions with the same all_reduce_id.
+  const std::vector<HloInstruction*>& GetAllReduceGroup(
+      int64 all_reduce_id) const;
+
   // Returns the computation that contains the peer channel instructions for
   // the given instruction.
   //
@@ -187,13 +191,14 @@ class HloModuleGroupMetadata {
   // Returns all channels in the module group.
   const std::vector<Channel>& channels() const { return channels_; }
 
-  // Returns the maximum channel id used in the module group.
+  // Returns the maximum channel id or all_reduce_id used in the module group.
   int64 max_channel_id() const { return max_channel_id_; }
 
  private:
   Status Build();
 
-  // Record all channel instructions and While instructions.
+  // Record all channel instructions, cross-module AllReduce instructions, and
+  // While/Conditional/Call instructions.
   Status RecordInstructions();
 
   // Verifies the given HloModules are well-formed and follow the specification,
@@ -255,6 +260,9 @@ class HloModuleGroupMetadata {
   // Map from channel ids to the index in channels_.
   tensorflow::gtl::FlatMap<int64, int64> channel_id_map_;
 
+  // Map from all-reduce ids to the all reduce instructions.
+  tensorflow::gtl::FlatMap<int64, std::vector<HloInstruction*>> all_reduce_map_;
+
   // The maximum channel id used in the module group.
   int64 max_channel_id_ = -1;
 
diff --git a/tensorflow/compiler/xla/service/hlo_module_group_util.cc b/tensorflow/compiler/xla/service/hlo_module_group_util.cc
index 21a9b7291a..9fd0ade153 100644
--- a/tensorflow/compiler/xla/service/hlo_module_group_util.cc
+++ b/tensorflow/compiler/xla/service/hlo_module_group_util.cc
@@ -56,12 +56,17 @@ std::vector<HloInstruction*> HloModuleGroupUtil::GlobalPredecessors(
   };
 
   // If the given instruction is a companion instruction, we need to find the
-  // predecessors of all of its companion instructions.
+  // predecessors of all of its companion instructions. If the instruction is an
+  // all-reduce, we need to find the predecessors of all the peer all-reduce
+  // instructions.
   std::vector<HloInstruction*> instruction_group;
   if (metadata_.IsCompanionInstruction(instruction)) {
     for (HloInstruction* companion : metadata_.Companions(instruction)) {
       instruction_group.push_back(companion);
     }
+  } else if (instruction->IsCrossModuleAllReduce()) {
+    instruction_group =
+        metadata_.GetAllReduceGroup(*instruction->all_reduce_id());
   } else {
     instruction_group.push_back(instruction);
   }
@@ -112,12 +117,17 @@ std::vector<HloInstruction*> HloModuleGroupUtil::GlobalSuccessors(
   };
 
   // If the given instruction is a companion instruction, we need to find the
-  // successors of all of its companion instructions.
+  // successors of all of its companion instructions. If the instruction is an
+  // all-reduce, we need to find the successors of all its peer all-reduce
+  // instructions.
   std::vector<HloInstruction*> instruction_group;
   if (metadata_.IsCompanionInstruction(instruction)) {
     for (HloInstruction* companion : metadata_.Companions(instruction)) {
       instruction_group.push_back(companion);
     }
+  } else if (instruction->IsCrossModuleAllReduce()) {
+    instruction_group =
+        metadata_.GetAllReduceGroup(*instruction->all_reduce_id());
   } else {
     instruction_group.push_back(instruction);
   }
@@ -170,15 +180,17 @@ Status HloModuleGroupUtil::VisitTopologicalOrder(
     HloInstruction* hlo = stack.top();
 
     // Find the instruction group of the currently visited instruction. The
-    // instruction group represents all companion instructions of the
-    // current instruction, and are considered to be a single entity for the
-    // purpose of the traversal (i.e., they must always be in the same visit
-    // state).
+    // instruction group represents all companion instructions of the current
+    // instruction, or all the all-reduce instructions that belong to the same
+    // group, or are considered to be a single entity for the purpose of the
+    // traversal (i.e., they must always be in the same visit state).
     std::vector<HloInstruction*> instruction_group;
     if (metadata_.IsCompanionInstruction(hlo)) {
       for (HloInstruction* companion : metadata_.Companions(hlo)) {
         instruction_group.push_back(companion);
       }
+    } else if (hlo->IsCrossModuleAllReduce()) {
+      instruction_group = metadata_.GetAllReduceGroup(*hlo->all_reduce_id());
     } else {
       instruction_group.push_back(hlo);
     }
@@ -292,7 +304,7 @@ HloModuleGroupUtil::ComputeReachability(
   }
   auto reachability = MakeUnique<HloReachabilityMap>(post_order);
   for (HloInstruction* hlo : post_order) {
-    reachability->SetReachabilityToUnion(GlobalPredecessors(hlo), hlo);
+    reachability->FastSetReachabilityToUnion(GlobalPredecessors(hlo), hlo);
   }
   return std::move(reachability);
 }
diff --git a/tensorflow/compiler/xla/service/hlo_module_test.cc b/tensorflow/compiler/xla/service/hlo_module_test.cc
index 7f28a804bf..236f450086 100644
--- a/tensorflow/compiler/xla/service/hlo_module_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_module_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -38,7 +38,7 @@ class HloModuleTest : public HloTestBase {
   std::unique_ptr<HloComputation> CreateConstantComputation() {
     auto builder = HloComputation::Builder("Constant");
     builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
     return builder.Build();
   }
 
@@ -122,7 +122,7 @@ TEST_F(HloModuleTest, CloneHasFusion) {
   {
     auto b = HloComputation::Builder("Entry");
     auto input = b.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
     b.AddInstruction(
         HloInstruction::CreateFusion(r0f32_, HloInstruction::FusionKind::kInput,
                                      /*operands=*/{input}, fused_computation));
@@ -173,7 +173,7 @@ TEST_F(HloModuleTest, LargeConstantToString) {
   auto builder = HloComputation::Builder("Constant");
   std::vector<float> values(16, 42.0);
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>(values)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(values)));
   module->AddEntryComputation(builder.Build());
 
   EXPECT_EQ(
diff --git a/tensorflow/compiler/xla/service/hlo_opcode.h b/tensorflow/compiler/xla/service/hlo_opcode.h
index 05e47a698f..88531b6f20 100644
--- a/tensorflow/compiler/xla/service/hlo_opcode.h
+++ b/tensorflow/compiler/xla/service/hlo_opcode.h
@@ -87,6 +87,7 @@ namespace xla {
   V(kHostCompute, "host-compute")                            \
   V(kImag, "imag")                                           \
   V(kInfeed, "infeed")                                       \
+  V(kIota, "iota")                                           \
   V(kIsFinite, "is-finite")                                  \
   V(kLe, "less-than-or-equal-to", kHloOpcodeIsComparison)    \
   V(kLog, "log")                                             \
@@ -117,6 +118,7 @@ namespace xla {
   V(kReverse, "reverse")                                     \
   V(kRng, "rng")                                             \
   V(kRoundNearestAfz, "round-nearest-afz")                   \
+  V(kScatter, "scatter")                                     \
   V(kSelect, "select")                                       \
   V(kSelectAndScatter, "select-and-scatter")                 \
   V(kSend, "send")                                           \
@@ -133,6 +135,7 @@ namespace xla {
   V(kTrace, "trace")                                         \
   V(kTranspose, "transpose")                                 \
   V(kTuple, "tuple", kHloOpcodeIsVariadic)                   \
+  V(kTupleSelect, "tuple-select")                            \
   V(kWhile, "while")
 
 enum class HloOpcode {
diff --git a/tensorflow/compiler/xla/service/hlo_ordering_test.cc b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
index cfe5dace05..126d3a2d9c 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
@@ -57,7 +57,7 @@ TEST_F(HloOrderingTest, InstructionsInDifferentComputations) {
 
   auto builder_c = HloComputation::Builder("C");
   HloInstruction* c = builder_c.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   HloComputation* computation_c =
       module->AddEmbeddedComputation(builder_c.Build());
 
@@ -145,7 +145,7 @@ TEST_F(HloOrderingTest, InstructionsInWhileComputations) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto xla_while = builder.AddInstruction(
       HloInstruction::CreateWhile(scalar_shape, condition, body, constant));
   module->AddEntryComputation(builder.Build());
@@ -208,7 +208,7 @@ TEST_F(HloOrderingTest, ValuesInWhileComputations) {
 
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto xla_while = builder.AddInstruction(
       HloInstruction::CreateWhile(scalar_shape, condition, body, constant));
   auto add = builder.AddInstruction(HloInstruction::CreateBinary(
diff --git a/tensorflow/compiler/xla/service/hlo_parser.cc b/tensorflow/compiler/xla/service/hlo_parser.cc
index 5b0f09a498..2a4009604f 100644
--- a/tensorflow/compiler/xla/service/hlo_parser.cc
+++ b/tensorflow/compiler/xla/service/hlo_parser.cc
@@ -15,8 +15,10 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_domain_metadata.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/hlo_sharding_metadata.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -117,6 +119,7 @@ class HloParser {
 
   // Types of attributes.
   enum class AttrTy {
+    kBool,
     kInt64,
     kInt32,
     kFloat,
@@ -489,6 +492,14 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
           HloInstruction::CreateConstant(std::move(literal)));
       break;
     }
+    case HloOpcode::kIota: {
+      if (!ParseOperands(&operands, /*expected_size=*/0) ||
+          !ParseAttributes(attrs)) {
+        return false;
+      }
+      instruction = builder->AddInstruction(HloInstruction::CreateIota(shape));
+      break;
+    }
     // Unary ops.
     case HloOpcode::kAbs:
     case HloOpcode::kRoundNearestAfz:
@@ -551,7 +562,8 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     }
     // Ternary ops.
     case HloOpcode::kClamp:
-    case HloOpcode::kSelect: {
+    case HloOpcode::kSelect:
+    case HloOpcode::kTupleSelect: {
       if (!ParseOperands(&operands, /*expected_size=*/3) ||
           !ParseAttributes(attrs)) {
         return false;
@@ -620,23 +632,32 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
         return false;
       }
-      instruction =
-          builder->AddInstruction(HloInstruction::CreateAfterAll(operands));
+      if (operands.empty()) {
+        instruction = builder->AddInstruction(HloInstruction::CreateToken());
+      } else {
+        instruction =
+            builder->AddInstruction(HloInstruction::CreateAfterAll(operands));
+      }
       break;
     }
     case HloOpcode::kSort: {
       auto loc = lexer_.GetLoc();
-      if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
+
+      optional<std::vector<tensorflow::int64>> dimensions;
+      attrs["dimensions"] = {/*required=*/true, AttrTy::kBracedInt64List,
+                             &dimensions};
+      if (!ParseOperands(&operands) || !ParseAttributes(attrs) ||
+          dimensions->size() != 1) {
         return false;
       }
       switch (operands.size()) {
         case 1:
-          instruction = builder->AddInstruction(
-              HloInstruction::CreateSort(shape, /*keys=*/operands[0]));
+          instruction = builder->AddInstruction(HloInstruction::CreateSort(
+              shape, dimensions->at(0), /*keys=*/operands[0]));
           break;
         case 2:
           instruction = builder->AddInstruction(HloInstruction::CreateSort(
-              shape,
+              shape, dimensions->at(0),
               /*keys=*/operands[0], /*values=*/operands[1]));
           break;
         default:
@@ -669,18 +690,27 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     }
     case HloOpcode::kRecv: {
       optional<tensorflow::int64> channel_id;
+      // If the is_host_transfer attribute is not present then default to false.
+      optional<bool> is_host_transfer = false;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
+      attrs["is_host_transfer"] = {/*required=*/false, AttrTy::kBool,
+                                   &is_host_transfer};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
           !ParseAttributes(attrs)) {
         return false;
       }
+      // If the is_host_transfer attribute is not present then default to false.
       instruction = builder->AddInstruction(HloInstruction::CreateRecv(
-          shape.tuple_shapes(0), operands[0], *channel_id));
+          shape.tuple_shapes(0), operands[0], *channel_id, *is_host_transfer));
       break;
     }
     case HloOpcode::kRecvDone: {
       optional<tensorflow::int64> channel_id;
+      // If the is_host_transfer attribute is not present then default to false.
+      optional<bool> is_host_transfer = false;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
+      attrs["is_host_transfer"] = {/*required=*/false, AttrTy::kBool,
+                                   &is_host_transfer};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
           !ParseAttributes(attrs)) {
         return false;
@@ -688,24 +718,32 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       if (channel_id != operands[0]->channel_id()) {
         return false;
       }
-      instruction =
-          builder->AddInstruction(HloInstruction::CreateRecvDone(operands[0]));
+      instruction = builder->AddInstruction(
+          HloInstruction::CreateRecvDone(operands[0], *is_host_transfer));
       break;
     }
     case HloOpcode::kSend: {
       optional<tensorflow::int64> channel_id;
+      // If the is_host_transfer attribute is not present then default to false.
+      optional<bool> is_host_transfer = false;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
+      attrs["is_host_transfer"] = {/*required=*/false, AttrTy::kBool,
+                                   &is_host_transfer};
       if (!ParseOperands(&operands, /*expected_size=*/2) ||
           !ParseAttributes(attrs)) {
         return false;
       }
-      instruction = builder->AddInstruction(
-          HloInstruction::CreateSend(operands[0], operands[1], *channel_id));
+      instruction = builder->AddInstruction(HloInstruction::CreateSend(
+          operands[0], operands[1], *channel_id, *is_host_transfer));
       break;
     }
     case HloOpcode::kSendDone: {
       optional<tensorflow::int64> channel_id;
+      // If the is_host_transfer attribute is not present then default to false.
+      optional<bool> is_host_transfer = false;
       attrs["channel_id"] = {/*required=*/true, AttrTy::kInt64, &channel_id};
+      attrs["is_host_transfer"] = {/*required=*/false, AttrTy::kBool,
+                                   &is_host_transfer};
       if (!ParseOperands(&operands, /*expected_size=*/1) ||
           !ParseAttributes(attrs)) {
         return false;
@@ -713,8 +751,8 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       if (channel_id != operands[0]->channel_id()) {
         return false;
       }
-      instruction =
-          builder->AddInstruction(HloInstruction::CreateSendDone(operands[0]));
+      instruction = builder->AddInstruction(
+          HloInstruction::CreateSendDone(operands[0], *is_host_transfer));
       break;
     }
     case HloOpcode::kGetTupleElement: {
@@ -827,18 +865,28 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
       break;
     }
     case HloOpcode::kReduce: {
+      auto loc = lexer_.GetLoc();
+
       optional<HloComputation*> reduce_computation;
       attrs["to_apply"] = {/*required=*/true, AttrTy::kHloComputation,
                            &reduce_computation};
       optional<std::vector<tensorflow::int64>> dimensions_to_reduce;
       attrs["dimensions"] = {/*required=*/true, AttrTy::kBracedInt64List,
                              &dimensions_to_reduce};
-      if (!ParseOperands(&operands, /*expected_size=*/2) ||
-          !ParseAttributes(attrs)) {
+      if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
         return false;
       }
+      if (operands.size() % 2) {
+        return Error(loc, StrCat("expects an even number of operands, but has ",
+                                 operands.size(), " operands"));
+      }
       instruction = builder->AddInstruction(HloInstruction::CreateReduce(
-          shape, /*operand=*/operands[0], /*init_value=*/operands[1],
+          shape, /*operands=*/
+          tensorflow::gtl::ArraySlice<HloInstruction*>(operands, 0,
+                                                       operands.size() / 2),
+          /*init_values=*/
+          tensorflow::gtl::ArraySlice<HloInstruction*>(
+              operands, operands.size() / 2, operands.size()),
           *dimensions_to_reduce, *reduce_computation));
       break;
     }
@@ -1094,13 +1142,24 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
     }
     case HloOpcode::kCustomCall: {
       optional<string> custom_call_target;
+      optional<Window> window;
+      optional<ConvolutionDimensionNumbers> dnums;
       attrs["custom_call_target"] = {/*required=*/true, AttrTy::kString,
                                      &custom_call_target};
+      attrs["window"] = {/*required=*/false, AttrTy::kWindow, &window};
+      attrs["dim_labels"] = {/*required=*/false,
+                             AttrTy::kConvolutionDimensionNumbers, &dnums};
       if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
         return false;
       }
       instruction = builder->AddInstruction(HloInstruction::CreateCustomCall(
           shape, operands, *custom_call_target));
+      if (window.has_value()) {
+        instruction->set_window(*window);
+      }
+      if (dnums.has_value()) {
+        instruction->set_convolution_dimension_numbers(*dnums);
+      }
       break;
     }
     case HloOpcode::kHostCompute: {
@@ -1181,17 +1240,22 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
         return false;
       }
 
-      GatherDimensionNumbers dim_numbers = HloInstruction::MakeGatherDimNumbers(
-          /*output_window_dims=*/*output_window_dims,
-          /*elided_window_dims=*/*elided_window_dims,
-          /*gather_dims_to_operand_dims=*/*gather_dims_to_operand_dims,
-          /*index_vector_dim=*/*index_vector_dim);
+      GatherDimensionNumbers dim_numbers =
+          HloGatherInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/*output_window_dims,
+              /*elided_window_dims=*/*elided_window_dims,
+              /*gather_dims_to_operand_dims=*/*gather_dims_to_operand_dims,
+              /*index_vector_dim=*/*index_vector_dim);
 
       instruction = builder->AddInstruction(HloInstruction::CreateGather(
           shape, /*operand=*/operands[0], /*gather_indices=*/operands[1],
           dim_numbers, *window_bounds));
       break;
     }
+    case HloOpcode::kScatter: {
+      // TODO(b/32945756): Implement HLO parsing for Scatter.
+      return TokenError("HLO parsing is not implemented for Scatter.");
+    }
     case HloOpcode::kDomain: {
       DomainData domain;
       attrs["domain"] = {/*required=*/true, AttrTy::kDomain, &domain};
@@ -1200,8 +1264,8 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
         return false;
       }
       instruction = builder->AddInstruction(HloInstruction::CreateDomain(
-          shape, operands[0], std::move(domain.entry_metadata),
-          std::move(domain.exit_metadata)));
+          shape, operands[0], std::move(domain.exit_metadata),
+          std::move(domain.entry_metadata)));
       break;
     }
     case HloOpcode::kTrace:
@@ -1540,6 +1604,24 @@ bool HloParser::SetValueInLiteralHelper(ParsedElemT value,
           "value ", value, " is out of range for literal's primitive type ",
           PrimitiveType_Name(literal->shape().element_type())));
     }
+  } else if (std::is_unsigned<LiteralNativeT>::value) {
+    CHECK((std::is_same<ParsedElemT, tensorflow::int64>::value ||
+           std::is_same<ParsedElemT, bool>::value))
+        << "Unimplemented checking for ParsedElemT";
+
+    ParsedElemT upper_bound;
+    if (sizeof(LiteralNativeT) >= sizeof(ParsedElemT)) {
+      upper_bound = std::numeric_limits<ParsedElemT>::max();
+    } else {
+      upper_bound =
+          static_cast<ParsedElemT>(std::numeric_limits<LiteralNativeT>::max());
+    }
+    if (value > upper_bound || value < 0) {
+      // Value is out of range for LiteralNativeT.
+      return TokenError(StrCat(
+          "value ", value, " is out of range for literal's primitive type ",
+          PrimitiveType_Name(literal->shape().element_type())));
+    }
   } else if (value > static_cast<ParsedElemT>(
                          std::numeric_limits<LiteralNativeT>::max()) ||
              value < static_cast<ParsedElemT>(
@@ -1608,7 +1690,7 @@ bool HloParser::ParseTupleLiteral(std::unique_ptr<Literal>* literal,
       }
     }
   }
-  *literal = Literal::MakeTupleOwned(std::move(elements));
+  *literal = LiteralUtil::MakeTupleOwned(std::move(elements));
   return ParseToken(TokKind::kRparen,
                     StrCat("expects ')' at the end of the tuple with ",
                            ShapeUtil::TupleElementCount(shape), "elements"));
@@ -1636,8 +1718,8 @@ bool HloParser::ParseDenseLiteral(std::unique_ptr<Literal>* literal,
   }
 
   // Create a literal with the given shape in default layout.
-  *literal = Literal::CreateFromDimensions(shape.element_type(),
-                                           AsInt64Slice(shape.dimensions()));
+  *literal = LiteralUtil::CreateFromDimensions(
+      shape.element_type(), AsInt64Slice(shape.dimensions()));
   tensorflow::int64 nest_level = 0;
   tensorflow::int64 linear_index = 0;
   // elems_seen_per_dim[i] is how many elements or sub-arrays we have seen for
@@ -2030,6 +2112,14 @@ bool HloParser::ParseAttributeHelper(
   bool success = [&] {
     LocTy attr_loc = lexer_.GetLoc();
     switch (attr_type) {
+      case AttrTy::kBool: {
+        bool result;
+        if (!ParseBool(&result)) {
+          return false;
+        }
+        static_cast<optional<bool>*>(attr_out_ptr)->emplace(result);
+        return true;
+      }
       case AttrTy::kInt64: {
         tensorflow::int64 result;
         if (!ParseInt64(&result)) {
diff --git a/tensorflow/compiler/xla/service/hlo_parser_test.cc b/tensorflow/compiler/xla/service/hlo_parser_test.cc
index f40cd60907..16bd8fcea6 100644
--- a/tensorflow/compiler/xla/service/hlo_parser_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_parser_test.cc
@@ -277,13 +277,28 @@ ENTRY %WhileWithScalarS32Result.v2 () -> s32[] {
 "SendRecv",
 R"(HloModule TwoSendRecvBothWayRecvFist_module
 
-ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> f32[] {
+ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> (f32[], token[]) {
+  %token = token[] after-all()
+  %recv = (f32[], u32[], token[]) recv(token[] %token), channel_id=15, sharding={maximal device=1}
+  ROOT %recv-done = (f32[], token[]) recv-done((f32[], u32[], token[]) %recv), channel_id=15, sharding={maximal device=1}
+  %constant = f32[] constant(2.1), sharding={maximal device=0}
+  %send = (f32[], u32[], token[]) send(f32[] %constant, token[] %token), channel_id=16, sharding={maximal device=0}, control-predecessors={%recv}
+  %send-done = token[] send-done((f32[], u32[], token[]) %send), channel_id=16, sharding={maximal device=0}
+}
+
+)"
+},
+{
+"SendRecvWithHostTransfer",
+R"(HloModule HostTransferSendRecv_module
+
+ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> (f32[], token[]) {
   %token = token[] after-all()
-  %recv = (f32[], u32[]) recv(token[] %token), channel_id=15, sharding={maximal device=1}
-  ROOT %recv-done = f32[] recv-done((f32[], u32[]) %recv), channel_id=15, sharding={maximal device=1}
+  %recv = (f32[], u32[], token[]) recv(token[] %token), channel_id=15, is_host_transfer=true
+  ROOT %recv-done = (f32[], token[]) recv-done((f32[], u32[], token[]) %recv), channel_id=15, is_host_transfer=true
   %constant = f32[] constant(2.1), sharding={maximal device=0}
-  %send = (f32[], u32[]) send(f32[] %constant, token[] %token), channel_id=16, sharding={maximal device=0}, control-predecessors={%recv}
-  %send-done = () send-done((f32[], u32[]) %send), channel_id=16, sharding={maximal device=0}
+  %send = (f32[], u32[], token[]) send(f32[] %constant, token[] %token), channel_id=16, is_host_transfer=true
+  %send-done = token[] send-done((f32[], u32[], token[]) %send), channel_id=16, is_host_transfer=true
 }
 
 )"
@@ -745,6 +760,27 @@ ENTRY %Gather (input_tensor: f32[50,49,48,47,46], gather_indices: s64[10,9,8,7,5
 
 )"
 },
+{
+  "ConstantUnsignedNoUnderflow",
+  R"(HloModule ConstantUnsignedNoUnderflow_module
+
+ENTRY %ConstantUnsignedNoUnderflow () -> u64[] {
+  ROOT %constant = u64[] constant(1)
+}
+
+)"
+},
+
+{
+  "ConstantUnsignedNoOverflow",
+  R"(HloModule ConstantUnsignedNoOverflow_module
+
+ENTRY %ConstantUnsignedNoOverflow () -> u64[] {
+  ROOT %constant = u64[] constant(9223372036854775807)
+}
+
+)"
+},
   });
   // clang-format on
 }
@@ -790,6 +826,32 @@ ENTRY ReduceR3ToR2.v3 {
 
 )"
 },
+// tuple reduce
+{
+"TupleReduce",
+R"(HloModule TupleReduce
+
+max_argmax {
+  value = f32[] parameter(2)
+  prev_max = f32[] parameter(0)
+  is_next_larger = pred[] greater-than-or-equal-to(value, prev_max)
+  max = f32[] select(is_next_larger, value, prev_max)
+  index = s32[] parameter(3)
+  prev_argmax = s32[] parameter(1)
+  argmax = s32[] select(is_next_larger, index, prev_argmax)
+  ROOT pair = (f32[], s32[]) tuple(max, argmax)
+}
+
+ENTRY reduce_entry {
+  values = f32[1024]{0} parameter(0)
+  indices = f32[1024]{0} parameter(1)
+  init_value = f32[] constant(-inf)
+  init_index = s32[] constant(-1)
+  ROOT result = (f32[], s32[]) reduce(values, indices, init_value, init_index), dimensions={0}, to_apply=max_argmax
+}
+
+)"
+},
 // infeed/outfeed
 {
 "InfeedOutfeed",
@@ -840,7 +902,7 @@ R"(HloModule sort
 
 ENTRY Sort {
   x = f32[1024]{0} parameter(0)
-  ROOT sorted = f32[1024]{0} sort(x)
+  ROOT sorted = f32[1024]{0} sort(x), dimensions={0}
 }
 
 )"
@@ -853,7 +915,32 @@ R"(HloModule sort
 ENTRY Sort {
   keys = f32[1024]{0} parameter(0)
   values = s32[1024]{0} parameter(1)
-  ROOT sorted = (f32[1024]{0}, s32[1024]{0}) sort(keys, values)
+  ROOT sorted = (f32[1024]{0}, s32[1024]{0}) sort(keys, values), dimensions={0}
+}
+
+)"
+},
+// R2 Sort (Key)
+{
+"SortKeyR2",
+R"(HloModule sort
+
+ENTRY Sort {
+  x = f32[1024,16]{0,1} parameter(0)
+  ROOT sorted = f32[1024,16]{0,1} sort(x), dimensions={0}
+}
+
+)"
+},
+// R2 Sort (Key, Value)
+{
+"SortKeyValueR2",
+R"(HloModule sort
+
+ENTRY Sort {
+  keys = f32[1024,16]{0,1} parameter(0)
+  values = s32[1024,16]{0,1} parameter(1)
+  ROOT sorted = (f32[1024,16]{0,1}, s32[1024,16]{0,1}) sort(keys, values), dimensions={0}
 }
 
 )"
@@ -965,6 +1052,28 @@ ENTRY CrossReplicaSumWithSubgroups {
 }
 
 )"
+},
+// Iota
+{
+"Iota",
+R"(HloModule iota
+
+ENTRY Iota {
+  ROOT iota = f32[100]{0} iota()
+}
+
+)"
+},
+// custom-call with window and dim_labels
+{
+"CustomCallWithWindowAndDimLabels",
+R"(HloModule CustomCallWithWindowAndDimLabels
+
+ENTRY Computation {
+  ROOT r = f32[100]{0} custom-call(), window={size=2x2}, dim_labels=b01f_01io->b01f, custom_call_target="target"
+}
+
+)"
 }
   });
   // clang-format on
@@ -1162,6 +1271,40 @@ ENTRY %ConstantF16Overflow.v4 () -> f16[] {
                   "is out of range for literal's primitive type F16");
 }
 
+TEST_F(HloParserTest, ConstantUnsignedUnderflow) {
+  const string original = R"(
+      HloModule ConstantUnsignedUnderflow_module
+      ENTRY %ConstantUnsignedUnderflow () -> u64[] {
+        ROOT %constant = u64[] constant(-1)
+      })";
+  auto result = ParseHloString(original);
+  EXPECT_NE(Status::OK(), result.status());
+  ExpectHasSubstr(result.status().error_message(),
+                  "is out of range for literal's primitive type U64");
+}
+
+TEST_F(HloParserTest, ConstantUnsignedOverflow) {
+  const string original = R"(
+      HloModule ConstantUnsignedOverflow_module
+      ENTRY %ConstantUnsignedOverflow () -> u32[] {
+        ROOT %constant = u32[] constant(4294967296)
+      })";
+  auto result = ParseHloString(original);
+  EXPECT_NE(Status::OK(), result.status());
+  ExpectHasSubstr(result.status().error_message(),
+                  "is out of range for literal's primitive type U32");
+}
+
+TEST_F(HloParserTest, ConstantUnsignedInt64Overflow) {
+  const string original = R"(
+      HloModule ConstantUnsignedOverflow_module
+      ENTRY %ConstantUnsignedOverflow () -> u64[] {
+        ROOT %constant = u64[] constant(9223372036854775808)
+      })";
+  auto result = ParseHloString(original);
+  EXPECT_NE(Status::OK(), result.status());
+}
+
 TEST_F(HloParserTest, ConstantWithExp) {
   const string original = R"(HloModule ConstantWithExp_module
 
@@ -1223,11 +1366,11 @@ TEST_F(HloParserTest, UnexpectedAttribute) {
 
 ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> f32[] {
   %token = token[] after-all()
-  %recv = (f32[], u32[]) recv(token[] %token), channel_id=15
-  %recv-done = f32[] recv-done((f32[], u32[]) %recv), channel_id=15
+  %recv = (f32[], u32[], token[]) recv(token[] %token), channel_id=15
+  %recv-done = (f32[], token[]) recv-done((f32[], u32[], token[]) %recv), channel_id=15
   ROOT %constant = f32[] constant(2.1)
-  %send = (f32[], u32[]) send(f32[] %constant, token[] %token), channel_id=16, calls=%recv
-  %send-done = () send-done((f32[], u32[]) %send), channel_id=16
+  %send = (f32[], u32[], token[]) send(f32[] %constant, token[] %token), channel_id=16, calls=%recv
+  %send-done = token[] send-done((f32[], u32[], token[]) %send), channel_id=16
 }
 
 )";
@@ -1240,11 +1383,11 @@ TEST_F(HloParserTest, MissingAttribute) {
 
 ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> f32[] {
   %token = token[] after-all()
-  %recv = (f32[], u32[]) recv(token[] %token), channel_id=15
-  %recv-done = f32[] recv-done((f32[], u32[]) %recv), channel_id=15
+  %recv = (f32[], u32[], token[]) recv(token[] %token), channel_id=15
+  %recv-done = (f32[], token[]) recv-done((f32[], u32[], token[]) %recv), channel_id=15
   ROOT %constant = f32[] constant(-2.1)
-  %send = (f32[], u32[]) send(f32[] %constant, token[] %token)
-  %send-done = () send-done((f32[], u32[]) %send), channel_id=16
+  %send = (f32[], u32[], token[]) send(f32[] %constant, token[] %token)
+  %send-done = token[] send-done((f32[], u32[], token[]) %send), channel_id=16
 }
 
 )";
@@ -1257,11 +1400,11 @@ TEST_F(HloParserTest, PredecessorUndefined) {
 
 ENTRY %TwoSendRecvBothWayRecvFist.v3 () -> f32[] {
   %token = token[] after-all()
-  %recv = (f32[], u32[]) recv(token[] %token), channel_id=15
-  %recv-done = f32[] recv-done((f32[], u32[]) %recv), channel_id=15
+  %recv = (f32[], u32[], token[]) recv(token[] %token), channel_id=15
+  %recv-done = (f32[], token[]) recv-done((f32[], u32[], token[]) %recv), channel_id=15
   ROOT %constant = f32[] constant(2.1)
-  %send = (f32[], u32[]) send(f32[] %constant, token[] %token), channel_id=16, control-predecessors={%done}
-  %send-done = () send-done((f32[], u32[]) %send), channel_id=16
+  %send = (f32[], u32[], token[]) send(f32[] %constant, token[] %token), channel_id=16, control-predecessors={%done}
+  %send-done = token[] send-done((f32[], u32[], token[]) %send), channel_id=16
 }
 
 )";
diff --git a/tensorflow/compiler/xla/service/hlo_query.cc b/tensorflow/compiler/xla/service/hlo_query.cc
index 2418c19f3d..2a07b6fcbc 100644
--- a/tensorflow/compiler/xla/service/hlo_query.cc
+++ b/tensorflow/compiler/xla/service/hlo_query.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_query.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 
diff --git a/tensorflow/compiler/xla/service/hlo_reachability_test.cc b/tensorflow/compiler/xla/service/hlo_reachability_test.cc
index 657a9ee83d..585c95972b 100644
--- a/tensorflow/compiler/xla/service/hlo_reachability_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_reachability_test.cc
@@ -39,15 +39,15 @@ TEST_F(HloReachabilityTest, Reachability) {
   */
   auto builder = HloComputation::Builder(TestName());
   auto a = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   auto b = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   auto c = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   auto d = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   auto e = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(0.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(0.0f)));
   builder.Build();
 
   HloReachabilityMap reachability({a, b, c, d, e});
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization.cc b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
index 59a8800a7d..cf0be30c7a 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization.cc
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
@@ -1203,7 +1203,7 @@ StatusOr<bool> HloRematerialization::RematerializeComputation(
 StatusOr<bool> HloRematerialization::Run(
     HloModule* module, SequentialHloOrdering::HloModuleSequence* sequence,
     int64 memory_limit_bytes, RematerializationSizes* sizes,
-    bool run_copy_elision) {
+    CopyInsertion* copy_insertion) {
   // The sequence is constructed entirely by this method.
   TF_RET_CHECK(sequence->empty());
 
@@ -1238,13 +1238,14 @@ StatusOr<bool> HloRematerialization::Run(
                                        return size_function_(buffer.shape());
                                      },
                                      scheduler_algorithm_));
-  if (run_copy_elision) {
+  if (copy_insertion) {
     // We run a separate pass of copy elision here because the sequential
     // ordering from the HLO schedule allows for more copies to be eliminated.
     // TODO(b/80249101): Instead of a separate copy elision pass, use the
     // ordering from the HLO schedule directly for copy insertion.
     SequentialHloOrdering ordering(module, *sequence);
-    TF_RETURN_IF_ERROR(RemoveUnnecessaryCopies(ordering, module));
+    TF_RETURN_IF_ERROR(
+        copy_insertion->RemoveUnnecessaryCopies(ordering, module));
   }
 
   // Compute peak memory usage of all computations in the module called in a
@@ -1349,10 +1350,10 @@ StatusOr<bool> HloRematerialization::Run(
     int64 memory_limit_bytes, HloModule* hlo_module,
     MemorySchedulerAlgorithm scheduler_algorithm,
     SequentialHloOrdering::HloModuleSequence* sequence,
-    RematerializationSizes* sizes, bool run_copy_elision) {
+    RematerializationSizes* sizes, CopyInsertion* copy_insertion) {
   HloRematerialization remat(scheduler_algorithm, size_function);
   return remat.Run(hlo_module, sequence, memory_limit_bytes, sizes,
-                   run_copy_elision);
+                   copy_insertion);
 }
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization.h b/tensorflow/compiler/xla/service/hlo_rematerialization.h
index 59b4cf5dcc..2ec004350a 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization.h
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization.h
@@ -17,6 +17,7 @@
 
 #include "tensorflow/compiler/xla/service/buffer_liveness.h"
 #include "tensorflow/compiler/xla/service/call_graph.h"
+#include "tensorflow/compiler/xla/service/copy_insertion.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
@@ -57,8 +58,9 @@ class HloRematerialization {
   //   sizes: Optional outparam that indicates the peak memory usage of the HLO
   //     module before/after rematerialization.
   //
-  //   run_copy_elision: Enable copy elision. This pass is used to eliminate
-  //     copies that were inserted before HLO scheduling.
+  //   copy_insertion: If non-null, run copy elision after scheduling. This
+  //     pass is used to eliminate copies that were inserted by copy insertion
+  //     before HLO scheduling.
   //
   // TODO(b/80249101): Remove the 'run_copy_elision' parameter when copy
   // insertion is integrated with HLO scheduling.
@@ -74,7 +76,7 @@ class HloRematerialization {
       const ShapeSizeFunction& size_function, int64 memory_limit_bytes,
       HloModule* hlo_module, MemorySchedulerAlgorithm scheduler_algorithm,
       SequentialHloOrdering::HloModuleSequence* sequence,
-      RematerializationSizes* sizes, bool run_copy_elision = true);
+      RematerializationSizes* sizes, CopyInsertion* copy_insertion = nullptr);
 
  protected:
   HloRematerialization(MemorySchedulerAlgorithm scheduler_algorithm,
@@ -90,7 +92,7 @@ class HloRematerialization {
   StatusOr<bool> Run(HloModule* module,
                      SequentialHloOrdering::HloModuleSequence* sequence,
                      int64 memory_limit, RematerializationSizes* sizes,
-                     bool run_copy_elision);
+                     CopyInsertion* copy_insertion);
 
   // Rematerializes instructions within the given computation. 'order' is the
   // order in which the computation's instructions will be emitted in the
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc b/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
index 7a46da6efe..ac8c97d380 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization_test.cc
@@ -132,7 +132,7 @@ class HloRematerializationTest : public HloTestBase {
     builder.AddInstruction(
         HloInstruction::CreateParameter(0, vec1_shape_, "param"));
     builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
     return builder.Build();
   }
 
@@ -147,7 +147,7 @@ class HloRematerializationTest : public HloTestBase {
     TF_EXPECT_OK(verifier().Run(module).status());
     return HloRematerialization::RematerializeAndSchedule(
         ByteSizeOf, memory_limit_bytes, module, DefaultMemoryScheduler,
-        sequence, /*sizes=*/nullptr, /*run_copy_elision=*/false);
+        sequence, /*sizes=*/nullptr);
   }
 
   // Various shapes used in the canned computations.
@@ -226,7 +226,7 @@ TEST_F(HloRematerializationTest, RematerializeAroundWhile) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, vec1_shape_, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   HloComputation* while_cond =
       module->AddEmbeddedComputation(cond_builder.Build());
 
@@ -263,7 +263,7 @@ TEST_F(HloRematerializationTest, RematerializeEntryAndWhileBody) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, vec1_shape_, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   HloComputation* while_cond =
       module->AddEmbeddedComputation(cond_builder.Build());
 
@@ -296,7 +296,7 @@ TEST_F(HloRematerializationTest, RematerializeNestedComputations) {
   cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, vec1_shape_, "param"));
   cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   HloComputation* while_cond =
       module->AddEmbeddedComputation(cond_builder.Build());
 
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling.cc b/tensorflow/compiler/xla/service/hlo_scheduling.cc
index c6d3909af6..27cc5361cd 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling.cc
+++ b/tensorflow/compiler/xla/service/hlo_scheduling.cc
@@ -567,6 +567,7 @@ StatusOr<SequentialHloOrdering::HloModuleSequence> ScheduleComputationsInModule(
       sequence[computation] = std::move(one_computation_sequence);
     }
   }
+  VLOG(1) << "Module schedule:\n" << sequence;
   return sequence;
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling_test.cc b/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
index 73f22f81f4..cf9ceed5b2 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_scheduling_test.cc
@@ -168,8 +168,9 @@ TEST_F(HloSchedulingTest, ListAccountsForSubcomputations) {
   auto cond_builder = HloComputation::Builder("WhileCond");
   HloInstruction* cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, r1f32, "cond_param"));
-  HloInstruction* zero_vector = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({{0, 0, 0, 0}})));
+  HloInstruction* zero_vector =
+      cond_builder.AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR2<float>({{0, 0, 0, 0}})));
   cond_builder.AddInstruction(HloInstruction::CreateBinary(
       ShapeUtil::MakeShape(PRED, {}), HloOpcode::kNe, cond_param, zero_vector));
   auto cond_computation = module->AddEmbeddedComputation(cond_builder.Build());
@@ -179,16 +180,18 @@ TEST_F(HloSchedulingTest, ListAccountsForSubcomputations) {
   auto body_builder = HloComputation::Builder("WhileBody");
   HloInstruction* body_param = body_builder.AddInstruction(
       HloInstruction::CreateParameter(0, r1f32, "body_param"));
-  HloInstruction* one_vector = body_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({{1, 1, 1, 1}})));
+  HloInstruction* one_vector =
+      body_builder.AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR2<float>({{1, 1, 1, 1}})));
   body_builder.AddInstruction(HloInstruction::CreateBinary(
       r1f32, HloOpcode::kSubtract, body_param, one_vector));
   auto body_computation = module->AddEmbeddedComputation(body_builder.Build());
 
   // transpose(matrix) + bcast(while)
   auto builder = HloComputation::Builder(TestName());
-  HloInstruction* while_init = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({{1, 1, 1, 1}})));
+  HloInstruction* while_init =
+      builder.AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR2<float>({{1, 1, 1, 1}})));
   // Creates 16 bytes, ignoring subcomputations
   HloInstruction* while_loop =
       builder.AddInstruction(HloInstruction::CreateWhile(
@@ -199,7 +202,7 @@ TEST_F(HloSchedulingTest, ListAccountsForSubcomputations) {
       HloInstruction::CreateBroadcast(r2f32, while_loop, {0}));
 
   HloInstruction* matrix = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2<float>(
           {{1.0, 2.0, 3.0, 4.0}, {1.0, 2.0, 3.0, 4.0}})));
   // Creates 32 bytes
   HloInstruction* transpose = builder.AddInstruction(
@@ -257,7 +260,7 @@ TEST_F(HloSchedulingTest, TuplesAreAccountedCorrectly) {
   // Wrap lit in abs because constants are considered free by
   // IgnoreInstruction, and it skews the accounting.
   auto lit = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1, 1, 1, 1, 1, 1})));
+      LiteralUtil::CreateR1<float>({1, 1, 1, 1, 1, 1})));
   auto abs_const = builder.AddInstruction(
       HloInstruction::CreateUnary(r1f32, HloOpcode::kAbs, lit));
 
@@ -300,11 +303,11 @@ TEST_F(HloSchedulingTest, MultiOutputFusionAccountedCorrectly) {
   HloComputation::Builder builder(TestName());
 
   auto c1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1, 1, 1, 1, 1})));
+      LiteralUtil::CreateR1<float>({1, 1, 1, 1, 1})));
   auto c2 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1, 2, 3, 4, 5})));
+      LiteralUtil::CreateR1<float>({1, 2, 3, 4, 5})));
   auto c3 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({0, 2, 4, 6, 8})));
+      LiteralUtil::CreateR1<float>({0, 2, 4, 6, 8})));
 
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(r1f32, HloOpcode::kAdd, c1, c2));
@@ -354,8 +357,9 @@ TEST_F(HloSchedulingTest, HeapSimulatorAccountsForSubcomputations) {
   auto cond_builder = HloComputation::Builder("WhileCond");
   HloInstruction* cond_param = cond_builder.AddInstruction(
       HloInstruction::CreateParameter(0, r1f32, "cond_param"));
-  HloInstruction* zero_vector = cond_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({{0, 0, 0, 0}})));
+  HloInstruction* zero_vector =
+      cond_builder.AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR2<float>({{0, 0, 0, 0}})));
   cond_builder.AddInstruction(HloInstruction::CreateBinary(
       ShapeUtil::MakeShape(PRED, {}), HloOpcode::kNe, cond_param, zero_vector));
   auto cond_computation = module->AddEmbeddedComputation(cond_builder.Build());
@@ -365,15 +369,17 @@ TEST_F(HloSchedulingTest, HeapSimulatorAccountsForSubcomputations) {
   auto body_builder = HloComputation::Builder("WhileBody");
   HloInstruction* body_param = body_builder.AddInstruction(
       HloInstruction::CreateParameter(0, r1f32, "body_param"));
-  HloInstruction* one_vector = body_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({{1, 1, 1, 1}})));
+  HloInstruction* one_vector =
+      body_builder.AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR2<float>({{1, 1, 1, 1}})));
   body_builder.AddInstruction(HloInstruction::CreateBinary(
       r1f32, HloOpcode::kSubtract, body_param, one_vector));
   auto body_computation = module->AddEmbeddedComputation(body_builder.Build());
 
   auto builder = HloComputation::Builder(TestName());
-  HloInstruction* while_init = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<float>({{1, 1, 1, 1}})));
+  HloInstruction* while_init =
+      builder.AddInstruction(HloInstruction::CreateConstant(
+          LiteralUtil::CreateR2<float>({{1, 1, 1, 1}})));
   // Creates 16 bytes, ignoring subcomputations
   builder.AddInstruction(HloInstruction::CreateWhile(
       r1f32, cond_computation, body_computation, while_init));
diff --git a/tensorflow/compiler/xla/service/hlo_sharding.cc b/tensorflow/compiler/xla/service/hlo_sharding.cc
index 268b4727bc..6399f6ef3c 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding.cc
@@ -60,6 +60,9 @@ HloSharding HloSharding::Tuple(
     const Shape& tuple_shape,
     tensorflow::gtl::ArraySlice<HloSharding> shardings) {
   CHECK(ShapeUtil::IsTuple(tuple_shape)) << ShapeUtil::HumanString(tuple_shape);
+  for (auto& sharding : shardings) {
+    CHECK(!sharding.IsTuple()) << sharding.ToString();
+  }
   std::vector<HloSharding> flattened_list(shardings.begin(), shardings.end());
   CHECK_EQ(flattened_list.size(), RequiredLeaves(tuple_shape))
       << "Flat list has " << flattened_list.size() << ", required "
@@ -67,6 +70,24 @@ HloSharding HloSharding::Tuple(
   return HloSharding(flattened_list);
 }
 
+HloSharding HloSharding::SingleTuple(const Shape& tuple_shape,
+                                     const HloSharding& sharding) {
+  CHECK(ShapeUtil::IsTuple(tuple_shape)) << ShapeUtil::HumanString(tuple_shape);
+  CHECK(!sharding.IsTuple()) << sharding.ToString();
+  int64 leaf_count = ShapeUtil::GetLeafCount(tuple_shape);
+  std::vector<HloSharding> flattened_list;
+  flattened_list.reserve(leaf_count);
+  for (int64 i = 0; i < leaf_count; ++i) {
+    flattened_list.push_back(sharding);
+  }
+  return HloSharding(flattened_list);
+}
+
+HloSharding HloSharding::Single(const Shape& shape,
+                                const HloSharding& sharding) {
+  return ShapeUtil::IsTuple(shape) ? SingleTuple(shape, sharding) : sharding;
+}
+
 string HloSharding::ToString() const {
   if (IsTuple()) {
     std::vector<string> parts;
@@ -106,15 +127,15 @@ std::map<int64, int64> HloSharding::UsedDevices(int64* count) const {
   if (IsTuple()) {
     for (auto& tuple_element_sharding : tuple_elements()) {
       auto unique_device = tuple_element_sharding.UniqueDevice();
-      if (unique_device.ok()) {
-        device_map[unique_device.ValueOrDie()] += 1;
+      if (unique_device) {
+        device_map[*unique_device] += 1;
       }
     }
     element_count = tuple_elements().size();
   } else {
     auto unique_device = UniqueDevice();
-    if (unique_device.ok()) {
-      device_map[unique_device.ValueOrDie()] += 1;
+    if (unique_device) {
+      device_map[*unique_device] += 1;
     }
   }
   if (count != nullptr) {
@@ -217,40 +238,31 @@ StatusOr<HloSharding> HloSharding::GetTupleSharding(const Shape& shape) const {
   return Tuple(ShapeTree<HloSharding>(shape, *this));
 }
 
-StatusOr<int64> HloSharding::UniqueDevice() const {
+tensorflow::gtl::optional<int64> HloSharding::UniqueDevice() const {
   if (IsTuple()) {
     if (tuple_elements_.empty()) {
-      return tensorflow::errors::InvalidArgument(
-          "UniqueDevice() called on empty tuple");
+      return tensorflow::gtl::nullopt;
     }
-    std::vector<StatusOr<int64>> results;
-    std::transform(tuple_elements_.begin(), tuple_elements_.end(),
-                   std::back_inserter(results),
-                   [](const HloSharding& s) { return s.UniqueDevice(); });
-    if (std::all_of(results.begin(), results.end(),
-                    [&](const StatusOr<int64>& s) {
-                      return s.ok() && results[0].ok() &&
-                             s.ValueOrDie() == results[0].ValueOrDie();
-                    })) {
-      return results[0];
-    } else {
-      return tensorflow::errors::InvalidArgument(
-          "Tuple did not contain a unique device");
+    tensorflow::gtl::optional<int64> unique_device;
+    for (auto& tuple_sharding : tuple_elements_) {
+      auto device = tuple_sharding.UniqueDevice();
+      if (!device || (unique_device && *device != *unique_device)) {
+        return tensorflow::gtl::nullopt;
+      }
+      unique_device = device;
     }
+    return unique_device;
   }
-  if (!replicated_ && maximal_ && !IsTuple()) {
+  if (!replicated_ && maximal_) {
     return static_cast<int64>(*tile_assignment_.begin());
   }
-  return tensorflow::errors::InvalidArgument(
-      "UniqueDevice() called on sharding that executes on multiple devices");
+  return tensorflow::gtl::nullopt;
 }
 
-bool HloSharding::HasUniqueDevice() const {
-  if (IsTuple()) {
-    return UniqueDevice().status().ok();
-  } else {
-    return !IsReplicated() && IsTileMaximal();
-  }
+int64 HloSharding::GetUniqueDevice() const {
+  auto device = UniqueDevice();
+  CHECK(device) << "Sharding does not have a unique device: " << *this;
+  return *device;
 }
 
 Status HloSharding::ValidateTuple(const Shape& shape, int64 num_devices) const {
diff --git a/tensorflow/compiler/xla/service/hlo_sharding.h b/tensorflow/compiler/xla/service/hlo_sharding.h
index 34324d2058..28575c0e75 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding.h
+++ b/tensorflow/compiler/xla/service/hlo_sharding.h
@@ -24,7 +24,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/array.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/protobuf_util.h"
 #include "tensorflow/compiler/xla/shape_tree.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -80,6 +80,15 @@ class HloSharding {
   static HloSharding Tuple(const Shape& tuple_shape,
                            tensorflow::gtl::ArraySlice<HloSharding> shardings);
 
+  // Creates a new sharding for a tuple type, with a single input sharding
+  // repeated on each leaf.
+  static HloSharding SingleTuple(const Shape& tuple_shape,
+                                 const HloSharding& sharding);
+
+  // If shape is an array, returns sharding, otherwise returns the tuple shaped
+  // sharding with all the leaf nodes having the same input sharding.
+  static HloSharding Single(const Shape& shape, const HloSharding& sharding);
+
   // Create a new sharding from a protobuf OpSharding.
   static StatusOr<HloSharding> FromProto(const OpSharding& proto);
 
@@ -149,12 +158,17 @@ class HloSharding {
   // REQUIRES: !IsTuple()
   std::vector<int64> TileLimitForDevice(int64 device) const;
 
-  // Returns the single device this op operates on.
-  // REQUIRES: !IsTuple&& !Replicated() && IsTileMaximal()
-  StatusOr<int64> UniqueDevice() const;
+  // Returns the single device this op operates on. If the sharding does not
+  // span a single device, the return value will be empty.
+  // In order for a sharding to span a single device, every leaf sharding must
+  // be maximal and not replicated, and the used device must match.
+  tensorflow::gtl::optional<int64> UniqueDevice() const;
+
+  // Retrieves the unique device or fails with a CHECK.
+  int64 GetUniqueDevice() const;
 
   // Returns true if this op only uses a single device.
-  bool HasUniqueDevice() const;
+  bool HasUniqueDevice() const { return UniqueDevice().has_value(); }
 
   // Returns the ShapeTree containing the shardings for each element of this
   // tuple, if IsTuple, or a ShapeTree with a single element containing this
diff --git a/tensorflow/compiler/xla/service/hlo_sharding_metadata.cc b/tensorflow/compiler/xla/service/hlo_sharding_metadata.cc
index 39036e205e..94f5a3b273 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding_metadata.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding_metadata.cc
@@ -88,6 +88,12 @@ std::vector<PassThrough> LocatePassThroughDomainLinks(
         VLOG(2) << "  " << instruction->ToString();
       }
     }
+    if (instruction == instruction->parent()->root_instruction()) {
+      pass_through.emplace_back(nullptr, instruction);
+      VLOG(2) << "Found passthrough domain link:";
+      VLOG(2) << "  <root>";
+      VLOG(2) << "  " << instruction->ToString();
+    }
   }
   return pass_through;
 }
@@ -101,8 +107,12 @@ Status FixupPassThroughDomainLinks(const DomainMetadata::Domain& domain,
         HloInstruction::CreateGetTupleElement(pass_through.operand->shape(),
                                               tuple, 0));
     gte->set_sharding(sharding);
-    TF_RETURN_IF_ERROR(
-        pass_through.operand->ReplaceUseWith(pass_through.user, gte));
+    if (pass_through.user != nullptr) {
+      TF_RETURN_IF_ERROR(
+          pass_through.operand->ReplaceUseWith(pass_through.user, gte));
+    } else {
+      pass_through.operand->parent()->set_root_instruction(gte);
+    }
   }
   return Status::OK();
 }
@@ -235,21 +245,6 @@ StatusOr<int64> ApplyDomainShardingPass(const DomainMetadata::Domain& domain,
 
 Status ApplyDomainSharding(const DomainMetadata::Domain& domain,
                            const HloSharding& sharding) {
-  // Here is the place to call external sharding normalizers, which are
-  // implemented in other modules (ie, spatial partitioning).
-  // The signature of the external normalizer function should be something
-  // like:
-  //
-  //   StatusOr<bool> Normalizer(const DomainMetadata::Domain&,
-  //                             const HloSharding& sharding);
-  //
-  // The function should return true if it has processed the domain
-  // normalization, false if domain was not one recognized by it, or an error.
-  // We will call the functions in order below, and fall back to local code if
-  // none of the external normalizers acted on the domain.
-  // External normalizers should not handle the cases that are already handled
-  // locally.
-
   // None of the external normalizers handled the domain sharding, try to see
   // whether this is a single sharding first.
   auto single_sharding = sharding.ExtractSingleSharding();
@@ -380,25 +375,36 @@ string ShardingMetadata::ToString() const {
   return sharding_ != nullptr ? sharding_->ToString() : "{}";
 }
 
-Status ShardingMetadata::NormalizeInstructions(
-    const DomainMetadata::Domain& domain) const {
-  if (sharding_ != nullptr) {
-    VLOG(4) << "Normalizing sharding to " << sharding_->ToString() << ":";
-    TF_RETURN_IF_ERROR(ApplyDomainSharding(domain, *sharding_));
-    TF_RETURN_IF_ERROR(FixupPassThroughDomainLinks(domain, *sharding_));
+/*static*/ StatusOr<const ShardingMetadata*>
+ShardingMetadata::ToShardingMetadata(const DomainMetadata* metadata) {
+  if (metadata->Kind() != ShardingMetadata::KindName()) {
+    return Status(
+        tensorflow::error::INVALID_ARGUMENT,
+        "ShardingMetadata normalizer called with incorrect domain metadata");
   }
-  return Status::OK();
+  return static_cast<const ShardingMetadata*>(metadata);
 }
 
-Status NormalizeShardingDomain(const DomainMetadata::Domain& domain) {
-  TF_ASSIGN_OR_RETURN(std::unique_ptr<HloSharding> sharding,
-                      ExtractOriginalCommonSharding(domain.instructions));
-  if (sharding != nullptr) {
-    VLOG(4) << "Normalizing sharding-less domain to " << sharding->ToString()
-            << ":";
-    TF_RETURN_IF_ERROR(ApplyDomainSharding(domain, *sharding));
+Status ShardingMetadata::NormalizeShardingDomain(
+    const DomainMetadata::Domain& domain, const DomainMetadata* metadata) {
+  if (metadata != nullptr) {
+    TF_ASSIGN_OR_RETURN(const auto& sharding_metadata,
+                        ToShardingMetadata(metadata));
+    const HloSharding* sharding = sharding_metadata->sharding();
+    if (sharding != nullptr) {
+      VLOG(4) << "Normalizing sharding to " << sharding->ToString() << ":";
+      TF_RETURN_IF_ERROR(ApplyDomainSharding(domain, *sharding));
+      TF_RETURN_IF_ERROR(FixupPassThroughDomainLinks(domain, *sharding));
+    }
   } else {
-    VLOG(1) << "Unable to find common sharding";
+    TF_ASSIGN_OR_RETURN(std::unique_ptr<HloSharding> sharding,
+                        ExtractOriginalCommonSharding(domain.instructions));
+    if (sharding != nullptr) {
+      VLOG(4) << "Normalizing sharding-less domain to " << sharding->ToString();
+      TF_RETURN_IF_ERROR(ApplyDomainSharding(domain, *sharding));
+    } else {
+      VLOG(1) << "Unable to find common sharding";
+    }
   }
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/hlo_sharding_metadata.h b/tensorflow/compiler/xla/service/hlo_sharding_metadata.h
index ec162c3490..5e01fc0e22 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding_metadata.h
+++ b/tensorflow/compiler/xla/service/hlo_sharding_metadata.h
@@ -38,23 +38,26 @@ class ShardingMetadata : public DomainMetadata {
 
   string ToString() const override;
 
-  Status NormalizeInstructions(
-      const DomainMetadata::Domain& domain) const override;
+  const HloSharding* sharding() const { return sharding_.get(); }
 
   static tensorflow::StringPiece KindName() { return "sharding"; }
 
+  static StatusOr<const ShardingMetadata*> ToShardingMetadata(
+      const DomainMetadata* metadata);
+
+  // Apply the specified domain metadata onto the specified domain. If no
+  // metadata is specified then apply sharding heuristics and normalize the
+  // instructions whose sharding deviates from the one which is inferred as to
+  // be the original one. Policy wise, HLO passes are allowed to create new
+  // unassigned instructions, but if they do create assigned ones, they have to
+  // conform to the ones around.
+  static Status NormalizeShardingDomain(const DomainMetadata::Domain& domain,
+                                        const DomainMetadata* metadata);
+
  private:
   std::unique_ptr<HloSharding> sharding_;
 };
 
-// Within a set of instructions which had common sharding attributes before
-// entring the HLO passes pipeline, apply sharding heuristics and normalize the
-// instructions whose sharding deviates from the one which is inferred as to be
-// the original one.
-// Policy wise, HLO passes are allowed to create new unassigned instructions,
-// but if they do create assigned ones, they have to conform to the ones around.
-Status NormalizeShardingDomain(const DomainMetadata::Domain& domain);
-
 // Given an HLO graph edge between instruction and one of its operands, creates
 // a ShardingMetadata based kDomain instruction if the sharding between
 // instruction and operand changes. Returns nullptr if there is no need for a
diff --git a/tensorflow/compiler/xla/service/hlo_sharding_test.cc b/tensorflow/compiler/xla/service/hlo_sharding_test.cc
index 54b7402b86..aebda562d3 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -51,7 +51,7 @@ TEST_F(HloShardingTest, Replicate) {
 
   EXPECT_IS_OK(sharding.Validate(ShapeUtil::MakeShape(U32, {4}),
                                  /*num_devices=*/2));
-  EXPECT_IS_NOT_OK(sharding.UniqueDevice());
+  EXPECT_FALSE(sharding.HasUniqueDevice());
 }
 
 TEST_F(HloShardingTest, DevicePlacement) {
@@ -60,7 +60,7 @@ TEST_F(HloShardingTest, DevicePlacement) {
   EXPECT_TRUE(sharding.IsTileMaximal());
   EXPECT_FALSE(sharding.UsesDevice(0));
   EXPECT_TRUE(sharding.UsesDevice(5));
-  EXPECT_EQ(5, sharding.UniqueDevice().ValueOrDie());
+  EXPECT_EQ(5, sharding.GetUniqueDevice());
 
   HloSharding other = HloSharding::Replicate();
   EXPECT_NE(other, sharding);
@@ -123,7 +123,7 @@ TEST_F(HloShardingTest, Tile) {
     EXPECT_EQ(sharding.TileOffsetForDevice(2), (std::vector<int64>{2, 0}));
     EXPECT_EQ(sharding.TileOffsetForDevice(1), (std::vector<int64>{2, 3}));
 
-    EXPECT_IS_NOT_OK(sharding.UniqueDevice());
+    EXPECT_FALSE(sharding.HasUniqueDevice());
   }
 }
 
diff --git a/tensorflow/compiler/xla/service/hlo_subcomputation_unification_test.cc b/tensorflow/compiler/xla/service/hlo_subcomputation_unification_test.cc
index 7b601f9a95..45c684d667 100644
--- a/tensorflow/compiler/xla/service/hlo_subcomputation_unification_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_subcomputation_unification_test.cc
@@ -75,7 +75,7 @@ TEST_F(HloSubcomputationUnificationTest, UnifyIdentities) {
       module->AddEmbeddedComputation(CreateR0S32IdentityComputation());
 
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(5)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(5)));
   auto x = builder.AddInstruction(
       HloInstruction::CreateCall(r0s32_, {constant}, callee1));
   auto y = builder.AddInstruction(
@@ -112,9 +112,9 @@ TEST_F(HloSubcomputationUnificationTest, UnifyAdditions) {
       module->AddEmbeddedComputation(CreateR0S32AdditionComputation());
 
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(5)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(5)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(3)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(3)));
   auto x = builder.AddInstruction(
       HloInstruction::CreateCall(r0s32_, {constant1, constant2}, callee1));
   auto y = builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/hlo_tfgraph_builder.cc b/tensorflow/compiler/xla/service/hlo_tfgraph_builder.cc
index 3dc733940f..b78bfa0cdf 100644
--- a/tensorflow/compiler/xla/service/hlo_tfgraph_builder.cc
+++ b/tensorflow/compiler/xla/service/hlo_tfgraph_builder.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_tfgraph_builder.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
@@ -101,11 +101,11 @@ const string& HloTfGraphBuilder::GetNodeNameForInstruction(
     }
   };
   string node_name;
-  if (debug_options_.xla_hlo_tfgraph_device_scopes() &&
-      instruction->has_sharding() &&
-      instruction->sharding().HasUniqueDevice()) {
-    node_name = StrCat(
-        "dev", instruction->sharding().UniqueDevice().ConsumeValueOrDie());
+  if (debug_options_.xla_hlo_tfgraph_device_scopes()) {
+    auto device = instruction->sharding_unique_device();
+    if (device) {
+      node_name = StrCat("dev", *device);
+    }
   }
   // If an instruction is fused, put it in the subgraph of the fusion;
   // otherwise, put it in the computation subgraph.
@@ -215,10 +215,10 @@ Status HloTfGraphBuilder::AddInstruction(const HloInstruction* instruction) {
   NodeDef* node_def = graph_def_.add_node();
   node_def->set_name(GetNodeNameForInstruction(instruction));
   node_def->set_op(GetOpDefName(instruction));
-  if (instruction->has_sharding() &&
-      instruction->sharding().HasUniqueDevice()) {
-    TF_ASSIGN_OR_RETURN(int64 device, instruction->sharding().UniqueDevice());
-    node_def->set_device(GetDeviceName(device));
+
+  auto device = instruction->sharding_unique_device();
+  if (device) {
+    node_def->set_device(GetDeviceName(*device));
   }
   SetNodeAttrs(instruction, node_def);
   if (instruction->opcode() == HloOpcode::kFusion) {
diff --git a/tensorflow/compiler/xla/service/hlo_tfgraph_builder_test.cc b/tensorflow/compiler/xla/service/hlo_tfgraph_builder_test.cc
index be156d765d..1e2b31a1f2 100644
--- a/tensorflow/compiler/xla/service/hlo_tfgraph_builder_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_tfgraph_builder_test.cc
@@ -90,7 +90,7 @@ TEST_F(HloTfGraphBuilderTest, CheckConcatenateDimsAndShapes) {
 TEST_F(HloTfGraphBuilderTest, CheckScalarValue) {
   auto builder = HloComputation::Builder("Const");
   HloInstruction *instruction = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0(123)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0(123)));
   OpMetadata metadata;
   metadata.set_op_name("x");
   metadata.set_op_type("y");
diff --git a/tensorflow/compiler/xla/service/hlo_value.cc b/tensorflow/compiler/xla/service/hlo_value.cc
index 7b27dbfec3..4e3c9df3a0 100644
--- a/tensorflow/compiler/xla/service/hlo_value.cc
+++ b/tensorflow/compiler/xla/service/hlo_value.cc
@@ -125,7 +125,7 @@ bool MayUseOperandValue(int64 operand_number, const ShapeIndex& index,
       // transparently.
       CHECK_EQ(operand_number, 0);
       return index.empty();
-    case HloOpcode::kSelect:
+    case HloOpcode::kTupleSelect:
       // Select does not use any nested elements of its selected-from operands
       // (operand 1 and 2)
       CHECK_GE(operand_number, 0);
diff --git a/tensorflow/compiler/xla/service/hlo_verifier.cc b/tensorflow/compiler/xla/service/hlo_verifier.cc
index 765245096b..1a8c206aaf 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier.cc
+++ b/tensorflow/compiler/xla/service/hlo_verifier.cc
@@ -41,6 +41,10 @@ Status ShapeVerifier::HandleSelect(HloInstruction* select) {
   return CheckTernaryShape(select);
 }
 
+Status ShapeVerifier::HandleTupleSelect(HloInstruction* tuple_select) {
+  return CheckTernaryShape(tuple_select);
+}
+
 Status ShapeVerifier::HandleConcatenate(HloInstruction* concatenate) {
   std::vector<const Shape*> operand_shapes;
   for (const HloInstruction* operand : concatenate->operands()) {
@@ -115,7 +119,7 @@ Status CheckIsTokenOperand(const HloInstruction* instruction,
   const HloInstruction* token = instruction->operand(operand_no);
   if (!ShapeUtil::Equal(token->shape(), ShapeUtil::MakeTokenShape())) {
     return InternalError(
-        "Expected operand %lld to be token-shaped, actual shape is"
+        "Expected operand %lld to be token-shaped, actual shape is "
         "%s:\n%s",
         operand_no, ShapeUtil::HumanString(token->shape()).c_str(),
         instruction->ToString().c_str());
@@ -123,6 +127,22 @@ Status CheckIsTokenOperand(const HloInstruction* instruction,
   return Status::OK();
 }
 
+Status CheckOperandAndParameter(const HloInstruction* instruction,
+                                int64 operand_number,
+                                const HloComputation* computation,
+                                int64 parameter_number) {
+  const HloInstruction* operand = instruction->operand(operand_number);
+  const HloInstruction* parameter =
+      computation->parameter_instruction(parameter_number);
+  if (!ShapeUtil::Compatible(operand->shape(), parameter->shape())) {
+    return InternalError("Operand %s shape does not match parameter's %s in %s",
+                         operand->ToString().c_str(),
+                         parameter->ToString().c_str(),
+                         instruction->ToString().c_str());
+  }
+  return Status::OK();
+}
+
 }  // namespace
 
 Status ShapeVerifier::HandleInfeed(HloInstruction* instruction) {
@@ -190,6 +210,12 @@ Status ShapeVerifier::HandleConstant(HloInstruction* constant) {
   return CheckShape(constant, constant->literal().shape());
 }
 
+Status ShapeVerifier::HandleIota(HloInstruction* iota) {
+  return ShapeUtil::Rank(iota->shape()) == 1
+             ? Status::OK()
+             : InternalError("Iota only supports arrays of rank 1.");
+}
+
 Status ShapeVerifier::HandleGetTupleElement(HloInstruction* get_tuple_element) {
   return CheckShape(get_tuple_element,
                     ShapeInference::InferGetTupleElementShape(
@@ -198,10 +224,13 @@ Status ShapeVerifier::HandleGetTupleElement(HloInstruction* get_tuple_element) {
 }
 
 Status ShapeVerifier::HandleReduce(HloInstruction* reduce) {
+  if (!ShapeUtil::IsArray(reduce->shape())) {
+    return InvalidArgument("Variadic reduce is not supported.");
+  }
   return CheckShape(
       reduce,
       ShapeInference::InferReduceShape(
-          reduce->operand(0)->shape(), reduce->operand(1)->shape(),
+          {&reduce->operand(0)->shape(), &reduce->operand(1)->shape()},
           reduce->dimensions(), reduce->to_apply()->ComputeProgramShape()));
 }
 
@@ -249,8 +278,11 @@ Status ShapeVerifier::HandleParameter(HloInstruction* hlo) {
 Status ShapeVerifier::HandleFusion(HloInstruction*) { return Status::OK(); }
 
 Status ShapeVerifier::HandleCall(HloInstruction* call) {
+  for (int64 i = 0; i < call->to_apply()->num_parameters(); ++i) {
+    TF_RETURN_IF_ERROR(CheckOperandAndParameter(call, i, call->to_apply(), i));
+  }
   // The shape of kCall should match the shape of the computation it calls.
-  return CheckShape(call, call->to_apply()->ComputeProgramShape().result());
+  return CheckShape(call, call->to_apply()->root_instruction()->shape());
 }
 
 Status ShapeVerifier::HandleCustomCall(HloInstruction*) { return Status::OK(); }
@@ -319,19 +351,37 @@ Status ShapeVerifier::HandleSelectAndScatter(HloInstruction* instruction) {
 }
 
 Status ShapeVerifier::HandleWhile(HloInstruction* xla_while) {
+  TF_RETURN_IF_ERROR(
+      CheckOperandAndParameter(xla_while, 0, xla_while->while_body(), 0));
+  TF_RETURN_IF_ERROR(
+      CheckOperandAndParameter(xla_while, 0, xla_while->while_condition(), 0));
+  const Shape& conditional_shape =
+      xla_while->while_condition()->root_instruction()->shape();
+  if (!ShapeUtil::Compatible(conditional_shape,
+                             ShapeUtil::MakeShape(PRED, {}))) {
+    return InternalError(
+        "Conditional computation shape does not lead to a scalar predicate "
+        "shape: %s",
+        ShapeUtil::HumanString(conditional_shape).c_str());
+  }
   // The shape of kWhile should match the shape of the body computation it
   // calls.
   return CheckShape(xla_while,
-                    xla_while->while_body()->ComputeProgramShape().result());
+                    xla_while->while_body()->root_instruction()->shape());
 }
 
 Status ShapeVerifier::HandleConditional(HloInstruction* conditional) {
+  TF_RETURN_IF_ERROR(CheckOperandAndParameter(
+      conditional, 1, conditional->true_computation(), 0));
+  TF_RETURN_IF_ERROR(CheckOperandAndParameter(
+      conditional, 2, conditional->false_computation(), 0));
+  TF_RETURN_IF_ERROR(
+      CheckShape(conditional,
+                 conditional->true_computation()->root_instruction()->shape()));
   TF_RETURN_IF_ERROR(CheckShape(
       conditional,
-      conditional->true_computation()->ComputeProgramShape().result()));
-  return CheckShape(
-      conditional,
-      conditional->false_computation()->ComputeProgramShape().result());
+      conditional->false_computation()->root_instruction()->shape()));
+  return Status::OK();
 }
 
 Status ShapeVerifier::HandlePad(HloInstruction* pad) {
@@ -341,42 +391,29 @@ Status ShapeVerifier::HandlePad(HloInstruction* pad) {
 }
 
 Status ShapeVerifier::HandleSend(HloInstruction* send) {
-  TF_RET_CHECK(send->users().size() == 1);
-  const HloInstruction* send_done = send->users().front();
-  TF_RET_CHECK(send_done->opcode() == HloOpcode::kSendDone);
-  TF_RETURN_IF_ERROR(CheckSameChannel(send, send_done));
-  TF_RETURN_IF_ERROR(CheckIsTokenOperand(send, 1));
-  return CheckShape(
-      send, ShapeUtil::MakeTupleShape(
-                {send->operand(0)->shape(), ShapeUtil::MakeShape(U32, {})}));
+  return CheckShape(send,
+                    ShapeUtil::MakeTupleShape({send->operand(0)->shape(),
+                                               ShapeUtil::MakeShape(U32, {}),
+                                               ShapeUtil::MakeTokenShape()}));
 }
 
 Status ShapeVerifier::HandleSendDone(HloInstruction* send_done) {
-  TF_RET_CHECK(send_done->operands().size() == 1);
-  const HloInstruction* send = send_done->operand(0);
-  TF_RET_CHECK(send->opcode() == HloOpcode::kSend);
-  TF_RETURN_IF_ERROR(CheckSameChannel(send, send_done));
-
-  return CheckShape(send_done, ShapeUtil::MakeNil());
+  return CheckShape(send_done, ShapeUtil::MakeTokenShape());
 }
 
 Status ShapeVerifier::HandleRecv(HloInstruction* recv) {
-  TF_RET_CHECK(recv->users().size() == 1);
-  const HloInstruction* recv_done = recv->users().front();
-  TF_RET_CHECK(recv_done->opcode() == HloOpcode::kRecvDone);
-  TF_RETURN_IF_ERROR(CheckSameChannel(recv, recv_done));
-  TF_RETURN_IF_ERROR(CheckIsTokenOperand(recv, 0));
-  return CheckShape(recv,
-                    ShapeUtil::MakeTupleShape(
-                        {recv_done->shape(), ShapeUtil::MakeShape(U32, {})}));
+  return CheckShape(
+      recv, ShapeUtil::MakeTupleShape(
+                {ShapeUtil::GetTupleElementShape(recv->shape(), 0),
+                 ShapeUtil::MakeShape(U32, {}), ShapeUtil::MakeTokenShape()}));
 }
 
 Status ShapeVerifier::HandleRecvDone(HloInstruction* recv_done) {
-  TF_RET_CHECK(recv_done->operands().size() == 1);
-  const HloInstruction* recv = recv_done->operand(0);
-  TF_RET_CHECK(recv->opcode() == HloOpcode::kRecv);
-  TF_RETURN_IF_ERROR(CheckSameChannel(recv, recv_done));
-  return CheckShape(recv_done, recv->shape().tuple_shapes(0));
+  return CheckShape(
+      recv_done,
+      ShapeUtil::MakeTupleShape(
+          {ShapeUtil::GetTupleElementShape(recv_done->operand(0)->shape(), 0),
+           ShapeUtil::MakeTokenShape()}));
 }
 
 Status ShapeVerifier::HandleBatchNormTraining(
@@ -435,6 +472,7 @@ Status CheckMixedPrecisionOperands(const HloInstruction* instruction) {
     case HloOpcode::kRecvDone:
     case HloOpcode::kReducePrecision:
     case HloOpcode::kSelect:
+    case HloOpcode::kTupleSelect:
     case HloOpcode::kSend:
     case HloOpcode::kSendDone:
     case HloOpcode::kTuple:
@@ -475,6 +513,15 @@ Status ShapeVerifier::HandleGather(HloInstruction* gather) {
           gather->gather_dimension_numbers(), gather->gather_window_bounds()));
 }
 
+Status ShapeVerifier::HandleScatter(HloInstruction* scatter) {
+  return CheckShape(
+      scatter, ShapeInference::InferScatterShape(
+                   scatter->operand(0)->shape(), scatter->operand(1)->shape(),
+                   scatter->operand(2)->shape(),
+                   scatter->to_apply()->ComputeProgramShape(),
+                   scatter->scatter_dimension_numbers()));
+}
+
 Status ShapeVerifier::HandleAfterAll(HloInstruction* token) {
   std::vector<const Shape*> operand_shapes;
   for (const HloInstruction* operand : token->operands()) {
@@ -497,16 +544,10 @@ Status ShapeVerifier::CheckShape(const HloInstruction* instruction,
   // We treat BF16 and F32 as compatible types if mixed precision is allowed,
   // but only when the instruction defines the BF16/F32 buffer.
   switch (instruction->opcode()) {
-    case HloOpcode::kSelect:
-      if (ShapeUtil::IsTuple(inferred_shape) || !allow_mixed_precision_) {
-        // Select only defines the top-level buffer, which in this case is the
-        // tuple, so we cannot allow mixed precision.
-        compatible =
-            ShapeUtil::Compatible(instruction->shape(), inferred_shape);
-      } else {
-        compatible = ShapeUtil::CompatibleIgnoringFpPrecision(
-            instruction->shape(), inferred_shape);
-      }
+    case HloOpcode::kTupleSelect:
+      // TupleSelect only defines the top-level buffer, which in this case is
+      // the tuple, so we cannot allow mixed precision.
+      compatible = ShapeUtil::Compatible(instruction->shape(), inferred_shape);
       break;
     case HloOpcode::kGetTupleElement:
     case HloOpcode::kTuple:
@@ -587,19 +628,6 @@ Status ShapeVerifier::CheckVariadicShape(const HloInstruction* instruction) {
                         instruction->opcode(), instruction->operands()));
 }
 
-// Checks if the given two instructions shares the same channel id.
-Status ShapeVerifier::CheckSameChannel(const HloInstruction* instr1,
-                                       const HloInstruction* instr2) {
-  if (instr1->channel_id() != instr2->channel_id()) {
-    return InternalError(
-        "Expected to have the same channel id, actual channel ids are: %s "
-        "(%lld), %s (%lld)",
-        instr1->ToString().c_str(), instr1->channel_id(),
-        instr2->ToString().c_str(), instr2->channel_id());
-  }
-  return Status::OK();
-}
-
 string ComputationsToString(
     tensorflow::gtl::ArraySlice<HloComputation*> computations) {
   return tensorflow::str_util::Join(
@@ -799,33 +827,23 @@ Status HloVerifier::CheckWhileInstruction(HloInstruction* instruction) {
         "While loop must have exactly one operand; had %lld : %s",
         instruction->operand_count(), instruction->ToString().c_str());
   }
-  auto* init = instruction->operand(0);
-  auto* cond_param = while_cond->parameter_instruction(0);
-  if (!ShapeUtil::Compatible(init->shape(), cond_param->shape())) {
-    return FailedPrecondition(
-        "While condition's parameter must have the same shape as the "
-        "loop's 'init'. init: %s, param: %s",
-        init->ToString().c_str(), cond_param->ToString().c_str());
-  }
-  auto* cond_root = while_cond->root_instruction();
-  if (!ShapeUtil::Compatible(cond_root->shape(),
-                             ShapeUtil::MakeShape(PRED, {}))) {
-    return FailedPrecondition("While condition should have shape PRED: %s",
-                              cond_root->ToString().c_str());
-  }
-  auto* body_param = while_body->parameter_instruction(0);
-  if (!ShapeUtil::Compatible(init->shape(), body_param->shape())) {
+  return Status::OK();
+}
+
+Status HloVerifier::CheckConditionalInstruction(HloInstruction* instruction) {
+  if (instruction->true_computation()->num_parameters() != 1) {
     return FailedPrecondition(
-        "While body's parameter must have the same shape as the loop's"
-        " 'init'. init: %s, param: %s",
-        init->ToString().c_str(), body_param->ToString().c_str());
+        "True computation %s of %s must have 1 parameter insted of %lld",
+        instruction->true_computation()->name().c_str(),
+        instruction->ToString().c_str(),
+        instruction->true_computation()->num_parameters());
   }
-  auto* body_root = while_body->root_instruction();
-  if (!ShapeUtil::Compatible(init->shape(), body_root->shape())) {
+  if (instruction->false_computation()->num_parameters() != 1) {
     return FailedPrecondition(
-        "While body should have same shape as the loop's 'init'."
-        "init: %s, body: %s",
-        init->ToString().c_str(), body_root->ToString().c_str());
+        "False computation %s of %s must have 1 parameter insted of %lld",
+        instruction->false_computation()->name().c_str(),
+        instruction->ToString().c_str(),
+        instruction->false_computation()->num_parameters());
   }
   return Status::OK();
 }
@@ -878,10 +896,105 @@ Status VerifyEntryAndExitShapes(const HloModule& module) {
   return Status::OK();
 }
 
+// Checks if the given two instructions share the same channel id.
+Status CheckSameChannel(const HloInstruction* instr1,
+                        const HloInstruction* instr2) {
+  if (instr1->channel_id() != instr2->channel_id()) {
+    return InternalError(
+        "Expected to have the same channel id, actual channel ids are: %s "
+        "(%lld), %s (%lld)",
+        instr1->ToString().c_str(), instr1->channel_id(),
+        instr2->ToString().c_str(), instr2->channel_id());
+  }
+  return Status::OK();
+}
+
+// Checks if the given two instructions have the same is_host_transfer attribute
+// value. Intsructions must be send/recv instructions or their 'done' variant.
+Status CheckSameIsHostTransfer(const HloInstruction* instr1,
+                               const HloInstruction* instr2) {
+  const HloSendRecvInstruction* send_recv1 =
+      DynCast<const HloSendRecvInstruction>(instr1);
+  const HloSendRecvInstruction* send_recv2 =
+      DynCast<const HloSendRecvInstruction>(instr2);
+  TF_RET_CHECK(send_recv1 != nullptr);
+  TF_RET_CHECK(send_recv2 != nullptr);
+  if (send_recv1->is_host_transfer() != send_recv2->is_host_transfer()) {
+    return InternalError(
+        "Expected instructions to have the same is-host-transfer property: %s, "
+        "%s ",
+        instr1->ToString().c_str(), instr2->ToString().c_str());
+  }
+  return Status::OK();
+}
+
+// Checks various invariants of send and recv instructions.
+Status VerifySendsAndRecvs(const HloModule& module) {
+  tensorflow::gtl::FlatMap<int64, const HloInstruction*> host_channels;
+  // Host send/recv instructions must have their own unique channel.
+  auto check_unique_host_channel = [&](const HloInstruction* instruction) {
+    const HloSendRecvInstruction* sendrecv =
+        DynCast<const HloSendRecvInstruction>(instruction);
+    if (sendrecv->is_host_transfer()) {
+      auto it_inserted =
+          host_channels.insert({sendrecv->channel_id(), sendrecv});
+      if (!it_inserted.second) {
+        return FailedPrecondition(
+            "Channel %lld is used for multiple host send/recv instructions: %s "
+            "and "
+            "%s",
+            sendrecv->channel_id(), sendrecv->ToString().c_str(),
+            it_inserted.first->second->ToString().c_str());
+      }
+    }
+
+    return Status::OK();
+  };
+
+  // Send/Recv instruction must have a single user: the corresponding
+  // SendDone/RecvDone. with matching channel.
+  for (const HloComputation* computation : module.computations()) {
+    for (const HloInstruction* instruction : computation->instructions()) {
+      switch (instruction->opcode()) {
+        case HloOpcode::kSend: {
+          TF_RETURN_IF_ERROR(check_unique_host_channel(instruction));
+          TF_RET_CHECK(instruction->users().size() == 1);
+          const HloInstruction* send_done = instruction->users().front();
+          TF_RET_CHECK(send_done->opcode() == HloOpcode::kSendDone);
+          TF_RETURN_IF_ERROR(CheckSameChannel(instruction, send_done));
+          TF_RETURN_IF_ERROR(CheckSameIsHostTransfer(instruction, send_done));
+          break;
+        }
+        case HloOpcode::kRecv: {
+          TF_RETURN_IF_ERROR(check_unique_host_channel(instruction));
+          TF_RET_CHECK(instruction->users().size() == 1);
+          const HloInstruction* recv_done = instruction->users().front();
+          TF_RET_CHECK(recv_done->opcode() == HloOpcode::kRecvDone);
+          TF_RETURN_IF_ERROR(CheckSameChannel(instruction, recv_done));
+          TF_RETURN_IF_ERROR(CheckSameIsHostTransfer(instruction, recv_done));
+          break;
+        }
+        case HloOpcode::kSendDone:
+          TF_RET_CHECK(instruction->operands().size() == 1);
+          TF_RET_CHECK(instruction->operand(0)->opcode() == HloOpcode::kSend);
+          break;
+        case HloOpcode::kRecvDone:
+          TF_RET_CHECK(instruction->operands().size() == 1);
+          TF_RET_CHECK(instruction->operand(0)->opcode() == HloOpcode::kRecv);
+          break;
+        default:
+          break;
+      }
+    }
+  }
+  return Status::OK();
+}
+
 }  // namespace
 
 StatusOr<bool> HloVerifier::Run(HloModule* module) {
   TF_RETURN_IF_ERROR(VerifyHloStructure(module));
+  TF_RETURN_IF_ERROR(VerifySendsAndRecvs(*module));
 
   tensorflow::gtl::FlatMap<string, const HloInstruction*> instructions;
 
@@ -921,6 +1034,8 @@ StatusOr<bool> HloVerifier::Run(HloModule* module) {
             << " != " << ShapeUtil::Rank(instruction->operand(0)->shape());
       } else if (instruction->opcode() == HloOpcode::kWhile) {
         TF_RETURN_IF_ERROR(CheckWhileInstruction(instruction));
+      } else if (instruction->opcode() == HloOpcode::kConditional) {
+        TF_RETURN_IF_ERROR(CheckConditionalInstruction(instruction));
       } else if (instruction->opcode() !=
                      HloOpcode::kRng /* Rng operands are always scalar. */
                  && instruction->IsElementwise()) {
diff --git a/tensorflow/compiler/xla/service/hlo_verifier.h b/tensorflow/compiler/xla/service/hlo_verifier.h
index da6b5d2222..7feddaeabf 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier.h
+++ b/tensorflow/compiler/xla/service/hlo_verifier.h
@@ -35,7 +35,9 @@ class ShapeVerifier : public DfsHloVisitor {
   Status HandleElementwiseBinary(HloInstruction* hlo) override;
   Status HandleClamp(HloInstruction* clamp) override;
   Status HandleSelect(HloInstruction* select) override;
+  Status HandleTupleSelect(HloInstruction* tuple_select) override;
   Status HandleConcatenate(HloInstruction* concatenate) override;
+  Status HandleIota(HloInstruction* iota) override;
   Status HandleConvert(HloInstruction* convert) override;
   Status HandleBitcastConvert(HloInstruction* convert) override;
   Status HandleCopy(HloInstruction* copy) override;
@@ -81,6 +83,7 @@ class ShapeVerifier : public DfsHloVisitor {
       HloInstruction* batch_norm_inference) override;
   Status HandleBatchNormGrad(HloInstruction* batch_norm_grad) override;
   Status HandleGather(HloInstruction* gather) override;
+  Status HandleScatter(HloInstruction* scatter) override;
   Status HandleAfterAll(HloInstruction* token) override;
 
   Status FinishVisit(HloInstruction*) override { return Status::OK(); }
@@ -101,10 +104,6 @@ class ShapeVerifier : public DfsHloVisitor {
   Status CheckTernaryShape(const HloInstruction* instruction);
   Status CheckVariadicShape(const HloInstruction* instruction);
 
-  // Checks if the given two instructions share the same channel id.
-  Status CheckSameChannel(const HloInstruction* instr1,
-                          const HloInstruction* instr2);
-
  private:
   // Whether the inputs and output of an instruction can contain both F32s and
   // BF16s. Tuples that include both F32s and BF16s are allowed regardless of
@@ -145,6 +144,8 @@ class HloVerifier : public HloPassInterface {
 
   Status CheckWhileInstruction(HloInstruction* instruction);
 
+  Status CheckConditionalInstruction(HloInstruction* instruction);
+
   // Checks that the non-scalar operand shapes are compatible to the output
   // shape, i.e., that there are no implicit broadcasts of size-one dimensions.
   Status CheckElementwiseInstruction(HloInstruction* instruction);
diff --git a/tensorflow/compiler/xla/service/hlo_verifier_test.cc b/tensorflow/compiler/xla/service/hlo_verifier_test.cc
index c92db0be14..04c6ba3eeb 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_verifier_test.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
@@ -123,5 +124,55 @@ TEST_F(HloVerifierTest, ResetsShapeVerifierState) {
   EXPECT_FALSE(verifier().Run(module.get()).status().ok());
 }
 
+TEST_F(HloVerifierTest, CheckCallOperandParameterShapesMismatch) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+callme {
+  ROOT param = (s32[], f32[4]) parameter(0)
+}
+
+ENTRY entry {
+  p0 = (f32[4], s32[]) parameter(0)
+  ROOT mycall = (s32[], f32[4]) call(p0), to_apply=callme
+}
+)";
+  TF_ASSERT_OK_AND_ASSIGN(auto module, ParseHloString(hlo_string));
+
+  auto status = verifier().Run(module.get()).status();
+  ASSERT_FALSE(status.ok());
+  EXPECT_THAT(status.error_message(),
+              HasSubstr("shape does not match parameter"));
+}
+
+TEST_F(HloVerifierTest, CheckConditionalOperandParameterShapesMismatch) {
+  const char* const hlo_string = R"(
+HloModule Module
+
+true_branch {
+  tparam = (s32[], f32[4]) parameter(0)
+  ROOT tgte1 = f32[4] get-tuple-element(tparam), index=1
+}
+
+false_branch {
+  fparam = (s32[], f32[4]) parameter(0)
+  ROOT fgte1 = f32[4] get-tuple-element(fparam), index=1
+}
+
+ENTRY entry {
+  p0 = (f32[4], s32[]) parameter(0)
+  constant = pred[] constant(true)
+  ROOT conditional = f32[4] conditional(constant, p0, p0),
+    true_computation=true_branch, false_computation=false_branch
+}
+)";
+  TF_ASSERT_OK_AND_ASSIGN(auto module, ParseHloString(hlo_string));
+
+  auto status = verifier().Run(module.get()).status();
+  ASSERT_FALSE(status.ok());
+  EXPECT_THAT(status.error_message(),
+              HasSubstr("shape does not match parameter"));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/implicit_broadcast_remover_test.cc b/tensorflow/compiler/xla/service/implicit_broadcast_remover_test.cc
index 8c7b38dd1b..f85d31d522 100644
--- a/tensorflow/compiler/xla/service/implicit_broadcast_remover_test.cc
+++ b/tensorflow/compiler/xla/service/implicit_broadcast_remover_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/implicit_broadcast_remover.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
diff --git a/tensorflow/compiler/xla/service/indexed_array_analysis.cc b/tensorflow/compiler/xla/service/indexed_array_analysis.cc
index 1985d20578..8b2df32567 100644
--- a/tensorflow/compiler/xla/service/indexed_array_analysis.cc
+++ b/tensorflow/compiler/xla/service/indexed_array_analysis.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/core/lib/gtl/flatset.h"
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
+#include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 
 namespace xla {
@@ -160,6 +161,12 @@ StatusOr<Analysis::Array*> IndexedArrayAnalysis::ComputeArrayFor(
         computed_array,
         ComputeArrayForReshape(instr->shape(),
                                FindOrDie(cache_, instr->operand(0))));
+  } else if (instr->opcode() == HloOpcode::kDot) {
+    TF_ASSIGN_OR_RETURN(
+        computed_array,
+        ComputeArrayForDot(instr->shape(), instr->dot_dimension_numbers(),
+                           FindOrDie(cache_, instr->operand(0)),
+                           FindOrDie(cache_, instr->operand(1))));
   } else {
     computed_array = nullptr;
   }
@@ -290,8 +297,7 @@ StatusOr<Analysis::Array*> IndexedArrayAnalysis::ComputeArrayForGather(
   }
 
   if (auto* indexed = dynamic_cast<ScalarIndexedArray*>(source)) {
-    auto it = c_find(indexed->output_dims(), source_dim);
-    if (it != indexed->output_dims().end()) {
+    if (c_linear_search(indexed->output_dims(), source_dim)) {
       return FoldGatherOfGather(indexed, indices, source_dim, output_dims,
                                 shape);
     }
@@ -956,11 +962,177 @@ IndexedArrayAnalysis::ComputeArrayForElementwiseUnaryOp(HloOpcode opcode,
   return Construct<ScalarIndexedConstantArray>(
       new_source, scalar_indexed_const->indices(),
       scalar_indexed_const->source_dim(),
-      std::vector<int64>(scalar_indexed_const->output_dims().begin(),
-                         scalar_indexed_const->output_dims().end()),
+      ArraySliceToVector(scalar_indexed_const->output_dims()),
       scalar_indexed_const->shape());
 }
 
+namespace {
+
+// Returns the non-contracting non-batch dimension (as per `contracting_dims`
+// and `batch_dims`) if there is exactly one, otherwise returns nullopt.
+gtl::optional<int64> GetOnlyNonContractingNonBatchDim(
+    int64 rank, ArraySlice<int64> contracting_dims,
+    ArraySlice<int64> batch_dims) {
+  gtl::optional<int64> result;
+  for (int64 dim = 0; dim < rank; dim++) {
+    if (!ArrayContains(contracting_dims, dim) &&
+        !ArrayContains(batch_dims, dim)) {
+      if (result.has_value()) {
+        return gtl::nullopt;
+      }
+      result = dim;
+    }
+  }
+  return result;
+}
+
+// Returns true if `indexed_array`, which is either the LHS or the RHS of a Dot
+// HLO, can be folded into the dot operation.  For now these conditions are both
+// necessary and sufficient.
+//
+// `tag` describes the caller.  Used only for logging.
+//
+// `contracting_dims` and `batch_dims` are the contracting and batch dimensions
+// of whatever operand `indexed_array` is to the dot (LHS or RHS).
+bool CanFoldDotIntoIndexedArray(
+    tensorflow::StringPiece tag,
+    Analysis::ScalarIndexedConstantArray* indexed_array,
+    ArraySlice<int64> contracting_dims, ArraySlice<int64> batch_dims) {
+  gtl::optional<int64> non_contracting_non_batch_dim =
+      GetOnlyNonContractingNonBatchDim(ShapeUtil::Rank(indexed_array->shape()),
+                                       contracting_dims, batch_dims);
+  if (!non_contracting_non_batch_dim.has_value()) {
+    VLOG(3) << tag << ": multiple or no non-contracting non-batch dimensions";
+    return false;
+  }
+
+  if (indexed_array->output_dims().size() != 1 ||
+      indexed_array->output_dims()[0] != *non_contracting_non_batch_dim) {
+    VLOG(3) << tag << ": output dims != the lhs non-contracting non-batch dim";
+    return false;
+  }
+
+  int64 indexed_array_rank = ShapeUtil::Rank(indexed_array->shape());
+  if (indexed_array->source_dim() < (indexed_array_rank - 2)) {
+    // This restriction can be lifted by inserting reshape nodes.
+    VLOG(3) << tag
+            << ": source dim is not in the low two dims, won't be able to form "
+               "a matmul";
+    return false;
+  }
+
+  return true;
+}
+
+}  // namespace
+
+StatusOr<Analysis::Array*>
+IndexedArrayAnalysis::ComputeArrayForDotWithIndexedLhs(
+    const Shape& shape, const DotDimensionNumbers& dim_numbers,
+    ScalarIndexedConstantArray* lhs, ConstantArray* rhs) {
+  VLOG(3) << "ComputeArrayForDotWithIndexedLhs(" << ToString(lhs) << " "
+          << ToString(rhs);
+  if (!CanFoldDotIntoIndexedArray(
+          "ComputeArrayForDotWithIndexedLhs", lhs, /*contracting_dims=*/
+          AsInt64Slice(dim_numbers.lhs_contracting_dimensions()),
+          /*batch_dims=*/AsInt64Slice(dim_numbers.lhs_batch_dimensions()))) {
+    return nullptr;
+  }
+
+  int64 lhs_rank = ShapeUtil::Rank(lhs->shape());
+  DotDimensionNumbers new_dim_numbers = dim_numbers;
+  new_dim_numbers.set_lhs_contracting_dimensions(
+      0, lhs->source_dim() == (lhs_rank - 1) ? (lhs_rank - 2) : (lhs_rank - 1));
+
+  TF_ASSIGN_OR_RETURN(Literal * literal_for_new_source,
+                      TakeOwnership(HloEvaluator{}.EvaluateDotOp(
+                          new_dim_numbers, lhs->literal(), *rhs->literal())));
+
+  // The new source dimension is wherever the non-batch non-contracting LHS
+  // dimension "went".
+  int64 new_source_dim = dim_numbers.lhs_batch_dimensions_size() +
+                         dim_numbers.rhs_batch_dimensions_size();
+
+  ConstantArray* new_source = Construct<ConstantArray>(literal_for_new_source);
+  return Construct<ScalarIndexedConstantArray>(
+      new_source, lhs->indices(), new_source_dim,
+      ArraySliceToVector(lhs->output_dims()), shape);
+}
+
+StatusOr<Analysis::Array*>
+IndexedArrayAnalysis::ComputeArrayForDotWithIndexedRhs(
+    const Shape& shape, const DotDimensionNumbers& dim_numbers,
+    ConstantArray* lhs, ScalarIndexedConstantArray* rhs) {
+  VLOG(3) << "ComputeArrayForDotWithIndexedRhs(" << ToString(lhs) << " "
+          << ToString(rhs);
+  if (!CanFoldDotIntoIndexedArray(
+          "ComputeArrayForDotWithIndexedRhs", rhs, /*contracting_dims=*/
+          AsInt64Slice(dim_numbers.rhs_contracting_dimensions()),
+          /*batch_dims=*/AsInt64Slice(dim_numbers.rhs_batch_dimensions()))) {
+    return nullptr;
+  }
+
+  int64 rhs_rank = ShapeUtil::Rank(rhs->shape());
+
+  DotDimensionNumbers new_dim_numbers = dim_numbers;
+  new_dim_numbers.set_rhs_contracting_dimensions(
+      0, rhs->source_dim() == (rhs_rank - 1) ? (rhs_rank - 2) : (rhs_rank - 1));
+
+  TF_ASSIGN_OR_RETURN(Literal * literal_for_new_source,
+                      TakeOwnership(HloEvaluator{}.EvaluateDotOp(
+                          new_dim_numbers, *lhs->literal(), rhs->literal())));
+
+  // The new source dimension is wherever the non-batch non-contracting RHS
+  // dimension "went".
+  int64 new_source_dim = dim_numbers.lhs_batch_dimensions_size() +
+                         dim_numbers.rhs_batch_dimensions_size() + 1;
+
+  ConstantArray* new_source = Construct<ConstantArray>(literal_for_new_source);
+  return Construct<ScalarIndexedConstantArray>(
+      new_source, rhs->indices(), new_source_dim,
+      ArraySliceToVector(rhs->output_dims()), shape);
+}
+
+StatusOr<Analysis::Array*> IndexedArrayAnalysis::ComputeArrayForDot(
+    const Shape& shape, const DotDimensionNumbers& dim_numbers, Array* lhs,
+    Array* rhs) {
+  // Intuitively, if
+  //
+  //  - The LHS of a dot product is a gathered sequence of rows from a constant
+  //    array (i.e. LHS[I,J] = Const[Indices[I],J]) and the RHS is a constant
+  //
+  //  OR
+  //
+  //  - If the RHS of a dot product is a gathered sequence of columns from a
+  //    constant array (i.e. RHS[I,J] = Const[I, Indices[J]]) and the LHS is a
+  //    constant
+  //
+  // then the result of the dot product itself is a gather from a constant
+  // array.  E.g. Dot(LHS, ConstRhs) where LHS[I,J] = Const[Indices[I],J] can be
+  // rewritten as Result where Result[I,J] = Dot(Const, ConstRhs)[Indices[I],
+  // J].
+  //
+  // We do a general version of this rewrite here.
+  VLOG(3) << "ComputeArrayForDot(" << ToString(lhs) << " " << ToString(rhs);
+  if (auto* lhs_indexed_array =
+          dynamic_cast<ScalarIndexedConstantArray*>(lhs)) {
+    if (auto* rhs_constant = dynamic_cast<ConstantArray*>(rhs)) {
+      return ComputeArrayForDotWithIndexedLhs(shape, dim_numbers,
+                                              lhs_indexed_array, rhs_constant);
+    }
+  }
+
+  if (auto* rhs_indexed_array =
+          dynamic_cast<ScalarIndexedConstantArray*>(rhs)) {
+    if (auto* lhs_constant = dynamic_cast<ConstantArray*>(lhs)) {
+      return ComputeArrayForDotWithIndexedRhs(shape, dim_numbers, lhs_constant,
+                                              rhs_indexed_array);
+    }
+  }
+
+  return nullptr;
+}
+
 tensorflow::StringPiece IndexedArrayAnalysisPrinterPass::name() const {
   return "indexed-array-analysis-printer-pass";
 }
diff --git a/tensorflow/compiler/xla/service/indexed_array_analysis.h b/tensorflow/compiler/xla/service/indexed_array_analysis.h
index 8684430231..e923dc39f7 100644
--- a/tensorflow/compiler/xla/service/indexed_array_analysis.h
+++ b/tensorflow/compiler/xla/service/indexed_array_analysis.h
@@ -268,6 +268,18 @@ class IndexedArrayAnalysis {
       tensorflow::gtl::ArraySlice<int64> window_bounds, Array* source,
       Array* indices);
 
+  StatusOr<Array*> ComputeArrayForDotWithIndexedLhs(
+      const Shape& shape, const DotDimensionNumbers& dim_numbers,
+      ScalarIndexedConstantArray* lhs, ConstantArray* rhs);
+
+  StatusOr<Array*> ComputeArrayForDotWithIndexedRhs(
+      const Shape& shape, const DotDimensionNumbers& dim_numbers,
+      ConstantArray* lhs, ScalarIndexedConstantArray* rhs);
+
+  StatusOr<Array*> ComputeArrayForDot(const Shape& shape,
+                                      const DotDimensionNumbers& dim_numbers,
+                                      Array* lhs, Array* rhs);
+
   // This tries to fold a ScalarIndexedArray which has another
   // ScalarIndexedArray as a source into a ScalarIndexedArray that instead has a
   // ScalarIndexedArray as indices.  If `source` happened to be a
diff --git a/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc b/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc
index fc2befe05b..5f4b42799b 100644
--- a/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/indexed_array_analysis_test.cc
@@ -799,5 +799,170 @@ ENTRY main {
   AssertArrayForRootExpressionIs(hlo_text, "%add");
 }
 
+TEST_F(IndexedArrayAnalysisTest, DotOpBasic_0) {
+  string hlo_text = R"(
+HloModule DotOp
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{5,6,7,8},{9,10,11,12}})
+  dot_rhs_constant = s32[4,3] constant(s32[4,3]{{1,2,3},{4,5,6},{7,8,9},{10,11,12}})
+  indices = s32[5] parameter(0)
+  dot_lhs = s32[5,4] gather(gather_operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,4}
+  ROOT dot = s32[5,3] dot(dot_lhs, dot_rhs_constant), lhs_contracting_dims={1}, rhs_contracting_dims={0}
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, R"(
+(scalar-indexed-const
+  (constant s32[3,3] s32[3,3] {
+    { 70, 80, 90 },
+    { 158, 184, 210 },
+    { 246, 288, 330 } })
+  %indices 0->[0]))");
+}
+
+TEST_F(IndexedArrayAnalysisTest, DotOpBasic_1) {
+  string hlo_text = R"(
+HloModule DotOp
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{5,6,7,8},{9,10,11,12}})
+  dot_rhs_constant = s32[3,3] constant(s32[3,3]{{1,2,3},{4,5,6},{7,8,9}})
+  indices = s32[5] parameter(0)
+  dot_lhs = s32[3,5] gather(gather_operand, indices),
+      output_window_dims={0},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=1,
+      window_bounds={3,1}
+  ROOT dot = s32[5,3] dot(dot_lhs, dot_rhs_constant), lhs_contracting_dims={0}, rhs_contracting_dims={0}
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, R"(
+(scalar-indexed-const
+  (constant s32[4,3] s32[4,3] {
+    { 84, 99, 114 },
+    { 96, 114, 132 },
+    { 108, 129, 150 },
+    { 120, 144, 168 } })
+   %indices 0->[1]))");
+}
+
+TEST_F(IndexedArrayAnalysisTest, DotOpBasic_2) {
+  string hlo_text = R"(
+HloModule DotOp
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{5,6,7,8},{9,10,11,12}})
+  dot_lhs_constant = s32[4,3] constant(s32[4,3]{{1,2,3},{4,5,6},{7,8,9},{10,11,12}})
+  indices = s32[5] parameter(0)
+  dot_rhs = s32[3,5] gather(gather_operand, indices),
+      output_window_dims={0},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=1,
+      window_bounds={3,1}
+  ROOT dot = s32[4,5] dot(dot_lhs_constant, dot_rhs), lhs_contracting_dims={1}, rhs_contracting_dims={0}
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, R"(
+(scalar-indexed-const
+  (constant s32[4,4] s32[4,4] {
+    { 38, 44, 50, 56 },
+    { 83, 98, 113, 128 },
+    { 128, 152, 176, 200 },
+    { 173, 206, 239, 272 } })
+  %indices 1->[1])
+)");
+}
+
+TEST_F(IndexedArrayAnalysisTest, DotOpBasic_3) {
+  string hlo_text = R"(
+HloModule DotOp
+
+ENTRY main {
+  gather_operand = s32[4,3] constant(s32[4,3]{{1,2,3},{4,5,6},{7,8,9},{10,11,12}})
+  dot_lhs_constant = s32[4,3] constant(s32[4,3]{{1,2,3},{4,5,6},{7,8,9},{10,11,12}})
+  indices = s32[5] parameter(0)
+  dot_rhs = s32[5,3] gather(gather_operand, indices),
+      output_window_dims={1},
+      elided_window_dims={0},
+      gather_dims_to_operand_dims={0},
+      index_vector_dim=1,
+      window_bounds={1,3}
+  ROOT dot = s32[4,5] dot(dot_lhs_constant, dot_rhs), lhs_contracting_dims={1}, rhs_contracting_dims={1}
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, R"(
+(scalar-indexed-const
+  (constant s32[4,4] s32[4,4] {
+    { 14, 32, 50, 68 },
+    { 32, 77, 122, 167 },
+    { 50, 122, 194, 266 },
+    { 68, 167, 266, 365 } })
+  %indices 1->[0])
+)");
+}
+
+TEST_F(IndexedArrayAnalysisTest, DotOpWithBatch) {
+  string hlo_text = R"(
+HloModule DotOp
+
+ENTRY main {
+  gather_operand = s32[2,3,2] constant(s32[2,3,2]{{{1,2},{3,4},{5,6}},{{7,8},{9,10},{11,12}}})
+  dot_lhs_constant = s32[2,2,3] constant(s32[2,2,3]{{{1,2,3},{4,5,6}},{{7,8,9},{10,11,12}}})
+  indices = s32[4] parameter(0)
+  dot_rhs = s32[2,3,4] gather(gather_operand, indices),
+      output_window_dims={0,1},
+      elided_window_dims={2},
+      gather_dims_to_operand_dims={2},
+      index_vector_dim=1,
+      window_bounds={2,3,1}
+  ROOT dot = s32[2,2,4] dot(dot_lhs_constant, dot_rhs),
+      lhs_contracting_dims={2}, rhs_contracting_dims={1},
+      lhs_batch_dims={0}, rhs_batch_dims={0}
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, R"(
+(scalar-indexed-const
+  (constant s32[2,2,2] s32[2,2,2] {
+    { { 22, 28 },
+      { 49, 64 } },
+    { { 220, 244 },
+      { 301, 334 } } })
+  %indices 3->[2])
+)");
+}
+
+TEST_F(IndexedArrayAnalysisTest, DotOpNegative) {
+  string hlo_text = R"(
+HloModule DotOp
+
+ENTRY main {
+  gather_operand = s32[3,4] constant(s32[3,4]{{1,2,3,4},{5,6,7,8},{9,10,11,12}})
+  dot_rhs_constant = s32[2,3] constant(s32[2,3]{{1,2,3},{4,5,6}})
+  indices = s32[2] parameter(0)
+  dot_lhs = s32[3,2] gather(gather_operand, indices),
+      output_window_dims={0},
+      elided_window_dims={1},
+      gather_dims_to_operand_dims={1},
+      index_vector_dim=1,
+      window_bounds={3,1}
+  ROOT dot = s32[3,3] dot(dot_lhs, dot_rhs_constant), lhs_contracting_dims={1}, rhs_contracting_dims={0}
+}
+)";
+
+  AssertArrayWithConstantsForRootExpressionIs(hlo_text, "%dot");
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/inliner_test.cc b/tensorflow/compiler/xla/service/inliner_test.cc
index d2af261008..32937b33b3 100644
--- a/tensorflow/compiler/xla/service/inliner_test.cc
+++ b/tensorflow/compiler/xla/service/inliner_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <memory>
 #include <utility>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -51,10 +51,10 @@ TEST_F(InlinerTest, MapMax) {
   auto max_f32 = max_builder.Build();
 
   auto builder = HloComputation::Builder("MapMaxFunction");
-  auto lhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({1, 2, 3, 4})));
-  auto rhs = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({4, 3, 2, 1})));
+  auto lhs = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<float>({1, 2, 3, 4})));
+  auto rhs = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<float>({4, 3, 2, 1})));
   builder.AddInstruction(
       HloInstruction::CreateMap(lhs->shape(), {lhs, rhs}, max_f32.get()));
 
@@ -70,7 +70,7 @@ TEST_F(InlinerTest, MapMax) {
 
   // Verify execution on CPU.
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
-  auto expected = Literal::CreateR1<float>({4, 3, 3, 4});
+  auto expected = LiteralUtil::CreateR1<float>({4, 3, 3, 4});
   EXPECT_TRUE(LiteralTestUtil::Equal(*result, *expected));
 }
 
@@ -83,12 +83,12 @@ TEST_F(InlinerTest, MapConstant) {
       HloInstruction::CreateParameter(0, r0f32, "x"));
   (void)param1;
   const2_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0f)));
   auto const2_f32 = const2_builder.Build();
 
   auto builder = HloComputation::Builder("MapConstFunction");
   auto lhs = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1, 2, 3, 4}, {5, 6, 7, 8}})));
+      LiteralUtil::CreateR2<float>({{1, 2, 3, 4}, {5, 6, 7, 8}})));
   builder.AddInstruction(
       HloInstruction::CreateMap(lhs->shape(), {lhs}, const2_f32.get()));
 
@@ -104,7 +104,7 @@ TEST_F(InlinerTest, MapConstant) {
 
   // Verify execution on CPU.
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
-  auto expected = Literal::CreateR2<float>({{2, 2, 2, 2}, {2, 2, 2, 2}});
+  auto expected = LiteralUtil::CreateR2<float>({{2, 2, 2, 2}, {2, 2, 2, 2}});
   EXPECT_TRUE(LiteralTestUtil::Equal(*result, *expected));
 }
 
@@ -123,10 +123,10 @@ TEST_F(InlinerTest, MapSubtractOppositeOrder) {
   auto max_f32 = max_builder.Build();
 
   auto builder = HloComputation::Builder("MapSubFunction");
-  auto lhs = builder.AddInstruction(
-    HloInstruction::CreateConstant(Literal::CreateR1<float>({1, 2, 3, 4})));
-  auto rhs = builder.AddInstruction(
-    HloInstruction::CreateConstant(Literal::CreateR1<float>({4, 3, 2, 1})));
+  auto lhs = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<float>({1, 2, 3, 4})));
+  auto rhs = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<float>({4, 3, 2, 1})));
   builder.AddInstruction(
     HloInstruction::CreateMap(lhs->shape(), {lhs, rhs}, max_f32.get()));
 
@@ -142,7 +142,7 @@ TEST_F(InlinerTest, MapSubtractOppositeOrder) {
 
   // Verify execution on CPU.
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
-  auto expected = Literal::CreateR1<float>({3, 1, -1, -3});
+  auto expected = LiteralUtil::CreateR1<float>({3, 1, -1, -3});
   EXPECT_TRUE(LiteralTestUtil::Equal(*result, *expected));
 }
 
diff --git a/tensorflow/compiler/xla/service/instruction_fusion.cc b/tensorflow/compiler/xla/service/instruction_fusion.cc
index 088cc26226..e2191aedb7 100644
--- a/tensorflow/compiler/xla/service/instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/instruction_fusion.cc
@@ -73,6 +73,7 @@ bool IsAlwaysDuplicable(const HloInstruction& instruction) {
     case HloOpcode::kGt:
     case HloOpcode::kImag:
     case HloOpcode::kInfeed:
+    case HloOpcode::kIota:
     case HloOpcode::kIsFinite:
     case HloOpcode::kLe:
     case HloOpcode::kLt:
@@ -100,6 +101,7 @@ bool IsAlwaysDuplicable(const HloInstruction& instruction) {
     case HloOpcode::kAfterAll:
     case HloOpcode::kTranspose:
     case HloOpcode::kTuple:
+    case HloOpcode::kTupleSelect:
       return false;
 
     // Cheap instructions for reals, but expensive for complex.
@@ -139,6 +141,7 @@ bool IsAlwaysDuplicable(const HloInstruction& instruction) {
     case HloOpcode::kReduceWindow:
     case HloOpcode::kRemainder:
     case HloOpcode::kRng:
+    case HloOpcode::kScatter:
     case HloOpcode::kSelectAndScatter:
     case HloOpcode::kSend:
     case HloOpcode::kSendDone:
diff --git a/tensorflow/compiler/xla/service/instruction_fusion_test.cc b/tensorflow/compiler/xla/service/instruction_fusion_test.cc
index bb7231c8c8..9e7a15f033 100644
--- a/tensorflow/compiler/xla/service/instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/instruction_fusion_test.cc
@@ -167,7 +167,7 @@ TEST_F(InstructionFusionTest, AvoidDuplicationIfNotAllFusable) {
       builder.AddInstruction(HloInstruction::CreateParameter(1, shape, "1"));
   HloInstruction* binary1 = builder.AddInstruction(
       HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   builder.AddInstruction(HloInstruction::CreateSend(binary1, token, 0));
   HloInstruction* unary = builder.AddInstruction(
       HloInstruction::CreateUnary(shape, HloOpcode::kAbs, binary1));
@@ -356,7 +356,7 @@ TEST_F(InstructionFusionTest, AllowUnaryDuplication) {
       builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "0"));
   HloInstruction* unary1 = builder.AddInstruction(
       HloInstruction::CreateUnary(shape, HloOpcode::kFloor, param0));
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   builder.AddInstruction(HloInstruction::CreateSend(unary1, token, 0));
   HloInstruction* unary2 = builder.AddInstruction(
       HloInstruction::CreateUnary(shape, HloOpcode::kAbs, unary1));
@@ -380,7 +380,7 @@ TEST_F(InstructionFusionTest, AllowEffectiveUnaryDuplication) {
       builder.AddInstruction(HloInstruction::CreateParameter(1, shape, "1"));
   HloInstruction* binary1 = builder.AddInstruction(
       HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   builder.AddInstruction(HloInstruction::CreateSend(binary1, token, 0));
   HloInstruction* unary = builder.AddInstruction(
       HloInstruction::CreateUnary(shape, HloOpcode::kAbs, binary1));
diff --git a/tensorflow/compiler/xla/service/interpreter/BUILD b/tensorflow/compiler/xla/service/interpreter/BUILD
index 524d3234eb..8652599dc6 100644
--- a/tensorflow/compiler/xla/service/interpreter/BUILD
+++ b/tensorflow/compiler/xla/service/interpreter/BUILD
@@ -74,7 +74,7 @@ cc_library(
     hdrs = ["executable.h"],
     deps = [
         ":executor",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
diff --git a/tensorflow/compiler/xla/service/interpreter/executable.cc b/tensorflow/compiler/xla/service/interpreter/executable.cc
index 9816acf650..8d40c08d55 100644
--- a/tensorflow/compiler/xla/service/interpreter/executable.cc
+++ b/tensorflow/compiler/xla/service/interpreter/executable.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
diff --git a/tensorflow/compiler/xla/service/layout_assignment.cc b/tensorflow/compiler/xla/service/layout_assignment.cc
index 36fdfa868d..c5df6bd223 100644
--- a/tensorflow/compiler/xla/service/layout_assignment.cc
+++ b/tensorflow/compiler/xla/service/layout_assignment.cc
@@ -30,10 +30,12 @@ limitations under the License.
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/computation_layout.h"
+#include "tensorflow/compiler/xla/service/hlo_casting_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_dce.h"
 #include "tensorflow/compiler/xla/service/hlo_graph_dumper.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
 #include "tensorflow/compiler/xla/service/logical_buffer.h"
 #include "tensorflow/compiler/xla/service/tuple_simplifier.h"
@@ -59,7 +61,6 @@ namespace xla {
 // anonymous namespace, instead of three or four spread all over this file.
 namespace {
 
-
 }  // namespace
 
 std::ostream& operator<<(std::ostream& out,
@@ -113,14 +114,18 @@ LayoutConstraints::LayoutConstraints(
     HloComputation* computation)
     : points_to_analysis_(points_to_analysis), computation_(computation) {
   // Gather all array-shaped logical buffers into unconstrained_buffer_ids.
-  for (LogicalBuffer::Id id = 0; id < points_to_analysis_.num_logical_buffers();
-       id++) {
-    auto& buffer = points_to_analysis_.logical_buffer(id);
-    // The points to analysis is computed per module, restrict constraints to
-    // array buffers in this computation.
-    if (buffer.IsArray() && buffer.instruction()->parent() == computation) {
-      unconstrained_buffer_ids_.insert(buffer.id());
-    }
+  for (HloInstruction* inst : computation_->instructions()) {
+    points_to_analysis_.GetPointsToSet(inst).ForEachElement(
+        [&](const ShapeIndex&, const PointsToSet::BufferList& buffers) {
+          for (const LogicalBuffer* buffer : buffers) {
+            // The points to analysis is computed per module, restrict
+            // constraints to array buffers in this computation.
+            if (buffer->IsArray() &&
+                buffer->instruction()->parent() == computation) {
+              unconstrained_buffer_ids_.insert(buffer->id());
+            }
+          }
+        });
   }
 }
 
@@ -392,6 +397,43 @@ string LayoutConstraints::ToString() const {
   return output;
 }
 
+namespace {
+
+bool IsHostSendRecv(const HloInstruction* instruction) {
+  const HloSendRecvInstruction* send_recv_instr =
+      DynCast<HloSendRecvInstruction>(instruction);
+  return send_recv_instr != nullptr && send_recv_instr->is_host_transfer();
+}
+
+}  // namespace
+
+Status LayoutAssignment::BuildHostChannelConstraints(
+    HloComputation* computation) {
+  for (auto* instruction : computation->instructions()) {
+    const HloSendRecvInstruction* send_recv_instr =
+        DynCast<HloSendRecvInstruction>(instruction);
+    if (send_recv_instr == nullptr || !send_recv_instr->is_host_transfer()) {
+      continue;
+    }
+
+    // For host transfers the Send and Recv instruction carry the layout.
+    if (instruction->opcode() == HloOpcode::kSend ||
+        instruction->opcode() == HloOpcode::kRecv) {
+      const Shape& data_shape =
+          ShapeUtil::GetTupleElementShape(send_recv_instr->shape(), 0);
+      TF_RET_CHECK(ShapeUtil::IsArray(data_shape));
+      TF_RET_CHECK(LayoutUtil::HasLayout(data_shape));
+      const Layout* prev_layout = host_channel_constraints_.ConstrainChannel(
+          send_recv_instr->channel_id(), data_shape.layout());
+      TF_RET_CHECK(prev_layout == nullptr)
+          << "Cannot constrain host transfer layout as it was set to "
+          << LayoutUtil::HumanString(*prev_layout) << ": "
+          << send_recv_instr->ToString();
+    }
+  }
+  return Status::OK();
+}
+
 Status LayoutAssignment::AddMandatoryConstraints(
     const ComputationLayout* computation_layout,
     ChannelLayoutConstraints* channel_constraints, HloComputation* computation,
@@ -399,6 +441,11 @@ Status LayoutAssignment::AddMandatoryConstraints(
   VLOG(3) << "Adding mandatory layout constraints to computation "
           << computation->name();
 
+  auto get_channel_constraints = [&](const HloInstruction* instruction) {
+    return IsHostSendRecv(instruction) ? &host_channel_constraints_
+                                       : channel_constraints;
+  };
+
   // Constrain layouts of instructions which define values with pre-existing
   // layouts.
   for (auto* instruction : computation->instructions()) {
@@ -435,18 +482,21 @@ Status LayoutAssignment::AddMandatoryConstraints(
 
     if (instruction->opcode() == HloOpcode::kSend ||
         instruction->opcode() == HloOpcode::kRecv) {
-      CHECK(channel_constraints)
+      CHECK(get_channel_constraints(instruction))
           << "Multi-module layout assignment requires ChannelLayoutConstraints";
       int64 channel_id = instruction->channel_id();
-      if (!channel_constraints->IsChannelConstrained(channel_id)) {
+      if (!get_channel_constraints(instruction)
+               ->IsChannelConstrained(channel_id)) {
         continue;
       }
       if (instruction->opcode() == HloOpcode::kSend) {
         // TODO(b/68493863): Change to use SetOperandLayout().
         const Shape send_buffer_shape = instruction->operand(0)->shape();
         TF_RET_CHECK(ShapeUtil::IsArray(send_buffer_shape));
-        Shape new_buffer_shape = channel_constraints->LayoutShapeForChannel(
-            send_buffer_shape, instruction->channel_id());
+        Shape new_buffer_shape =
+            get_channel_constraints(instruction)
+                ->LayoutShapeForChannel(send_buffer_shape,
+                                        instruction->channel_id());
         TF_RETURN_IF_ERROR(constraints->SetInstructionLayout(
             new_buffer_shape, instruction->operand(0)));
       } else {
@@ -457,8 +507,9 @@ Status LayoutAssignment::AddMandatoryConstraints(
             const LogicalBuffer* buffer,
             constraints->points_to_analysis().GetBufferDefinedAt(instruction,
                                                                  {0}));
-        Shape new_shape = channel_constraints->LayoutShapeForChannel(
-            recv_buffer_shape, instruction->channel_id());
+        Shape new_shape = get_channel_constraints(instruction)
+                              ->LayoutShapeForChannel(
+                                  recv_buffer_shape, instruction->channel_id());
         TF_RETURN_IF_ERROR(
             constraints->SetBufferLayout(new_shape.layout(), *buffer));
       }
@@ -823,8 +874,8 @@ void LayoutAssignment::SetupCopiedInstruction(const HloInstruction& instruction,
     // HostCompute module.
     // Otherwise it is preferable to leave the new instruction without device,
     // and let the automatic device placer to choose the best location.
-    if (!sharding.HasUniqueDevice() ||
-        HloSharding::IsReservedDevice(sharding.UniqueDevice().ValueOrDie())) {
+    auto device = sharding.UniqueDevice();
+    if (!device || HloSharding::IsReservedDevice(*device)) {
       copy->set_sharding(sharding);
     }
   }
@@ -1535,6 +1586,10 @@ Status LayoutAssignment::RunOnComputation(
     ChannelLayoutConstraints* channel_constraints) {
   VLOG(2) << "LayoutAssignment::RunOnComputation(" << computation->name()
           << ")";
+
+  // Must be run before clearing layouts.
+  TF_RETURN_IF_ERROR(BuildHostChannelConstraints(computation));
+
   TF_RETURN_IF_ERROR(ClearComputationLayouts(computation));
   if (computation_layout != nullptr) {
     auto it = computation_layouts_.find(computation);
@@ -1624,13 +1679,20 @@ Status LayoutAssignment::RunOnComputation(
 Status LayoutAssignment::ConstrainChannelLayouts(
     HloComputation* computation,
     ChannelLayoutConstraints* channel_constraints) {
+  auto get_channel_constraints = [&](const HloInstruction* instruction) {
+    return IsHostSendRecv(instruction) ? &host_channel_constraints_
+                                       : channel_constraints;
+  };
   // We go through the kRecvDone before. These must either impose their layout,
-  // of find a matching one already existing (ConstrainChannel() returns
+  // or find a matching one already existing (ConstrainChannel() returns
   // nullptr).
   for (HloInstruction* instruction : computation->instructions()) {
     if (instruction->opcode() == HloOpcode::kRecvDone) {
-      const Layout* layout = channel_constraints->ConstrainChannel(
-          instruction->channel_id(), instruction->shape().layout());
+      const Layout* layout =
+          get_channel_constraints(instruction)
+              ->ConstrainChannel(
+                  instruction->channel_id(),
+                  ShapeUtil::GetSubshape(instruction->shape(), {0}).layout());
       TF_RET_CHECK(layout == nullptr)
           << instruction->ToString()
           << " cannot constrain layout as it was set to "
@@ -1643,11 +1705,12 @@ Status LayoutAssignment::ConstrainChannelLayouts(
   for (HloInstruction* instruction : computation->MakeInstructionPostOrder()) {
     if (instruction->opcode() == HloOpcode::kSend) {
       HloInstruction* operand = instruction->mutable_operand(0);
-      const Layout* layout = channel_constraints->ConstrainChannel(
-          instruction->channel_id(), operand->shape().layout());
+      const Layout* layout = get_channel_constraints(instruction)
+                                 ->ConstrainChannel(instruction->channel_id(),
+                                                    operand->shape().layout());
       if (layout != nullptr) {
         // We found an already constrained layout which does not match the one
-        // the kSend wants to impose. Eitehr add a new kCopy, or use the
+        // the kSend wants to impose. Either add a new kCopy, or use the
         // existing one to marshal the correct shape.
         Shape shape = operand->shape();
         *shape.mutable_layout() = *layout;
diff --git a/tensorflow/compiler/xla/service/layout_assignment.h b/tensorflow/compiler/xla/service/layout_assignment.h
index b75ecb311a..f9e8dbea2f 100644
--- a/tensorflow/compiler/xla/service/layout_assignment.h
+++ b/tensorflow/compiler/xla/service/layout_assignment.h
@@ -488,6 +488,9 @@ class LayoutAssignment : public HloPassInterface {
     }
   }
 
+  // Adds constraints related to host Send/Recv instructions.
+  Status BuildHostChannelConstraints(HloComputation* computation);
+
   // Map containing the layouts of all computations assigned so
   // far. Computations are handled in a topological sort where computations are
   // handled before their caller instructions so the layouts of caller
@@ -507,6 +510,10 @@ class LayoutAssignment : public HloPassInterface {
   // computations/instructions.
   ChannelLayoutConstraints channel_constraints_;
 
+  // Layout constraints for send/recv instructions which communicate with the
+  // host.
+  ChannelLayoutConstraints host_channel_constraints_;
+
   // The set of HLO instructions which lacked any layout constraint, thus
   // receiving propagated default layouts.
   tensorflow::gtl::FlatSet<const HloInstruction*>
diff --git a/tensorflow/compiler/xla/service/layout_assignment_test.cc b/tensorflow/compiler/xla/service/layout_assignment_test.cc
index 4cd584bf8b..a16fa75e30 100644
--- a/tensorflow/compiler/xla/service/layout_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/layout_assignment_test.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/algebraic_simplifier.h"
 #include "tensorflow/compiler/xla/service/computation_layout.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
@@ -141,9 +141,9 @@ TEST_F(LayoutAssignmentTest, FusionInstruction) {
   std::vector<std::initializer_list<int64>> minor_to_majors = {{0, 1}, {1, 0}};
   for (auto& minor_to_major : minor_to_majors) {
     auto builder = HloComputation::Builder(TestName());
-    auto constant_literal1 = Literal::CreateR2WithLayout<float>(
+    auto constant_literal1 = LiteralUtil::CreateR2WithLayout<float>(
         {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout(minor_to_major));
-    auto constant_literal2 = Literal::CreateR2WithLayout<float>(
+    auto constant_literal2 = LiteralUtil::CreateR2WithLayout<float>(
         {{5.0, 6.0}, {7.0, 8.0}}, LayoutUtil::MakeLayout(minor_to_major));
     Shape ashape = constant_literal1->shape();
 
@@ -192,10 +192,10 @@ TEST_F(LayoutAssignmentTest, TupleLayout) {
   // match their source).
   auto builder = HloComputation::Builder(TestName());
   auto constant0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({0, 1}))));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({1, 0}))));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant0, constant1}));
@@ -229,10 +229,10 @@ TEST_F(LayoutAssignmentTest, TupleSelect) {
   // Verify layouts of a select with tuple operands is assigned properly.
   auto builder = HloComputation::Builder(TestName());
   auto constant0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({0, 1}))));
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({1, 0}))));
   auto tuple0 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant0, constant1}));
@@ -240,7 +240,7 @@ TEST_F(LayoutAssignmentTest, TupleSelect) {
       HloInstruction::CreateTuple({constant0, constant1}));
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
 
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
       tuple0->shape(), HloOpcode::kSelect, pred, tuple0, tuple1));
@@ -274,7 +274,7 @@ TEST_F(LayoutAssignmentTest, ConflictingLayoutTuple) {
   // tuple and assigning the layouts of the copied arrays as needed.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   auto inner_tuple =
       builder.AddInstruction(HloInstruction::CreateTuple({constant}));
   auto nested_tuple = builder.AddInstruction(
@@ -584,7 +584,7 @@ TEST_F(LayoutAssignmentTest, TransposeToBitcastToUser) {
   auto builder = HloComputation::Builder(TestName());
   Shape input_shape = ShapeUtil::MakeShape(F32, {3, 5, 6, 7});
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0f)));
   auto broadcast = builder.AddInstruction(
       HloInstruction::CreateBroadcast(input_shape, constant, {}));
   auto transpose = builder.AddInstruction(HloInstruction::CreateTranspose(
@@ -770,8 +770,7 @@ TEST_F(LayoutAssignmentTest, ConditionalAsymmetricLayout) {
     false_builder.AddInstruction(
         HloInstruction::CreateParameter(0, tshape, "param"));
     // Using infeed as layout assignment does not mess up with it.
-    auto token =
-        false_builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+    auto token = false_builder.AddInstruction(HloInstruction::CreateToken());
     auto infeed = false_builder.AddInstruction(
         HloInstruction::CreateInfeed(xshape, token, ""));
     auto infeed_data = false_builder.AddInstruction(
@@ -803,7 +802,7 @@ TEST_F(LayoutAssignmentTest, ConditionalAsymmetricLayout) {
 TEST_F(LayoutAssignmentTest, InternalErrorOnBitcast) {
   auto builder = HloComputation::Builder(TestName());
   auto constant0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2WithLayout<float>(
           {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({0, 1}))));
   builder.AddInstruction(HloInstruction::CreateUnary(
       constant0->shape(), HloOpcode::kBitcast, constant0));
@@ -830,12 +829,13 @@ TEST_F(LayoutAssignmentTest, ChannelLayoutMismatch) {
       param = (f32[2,2]) parameter(0)
       gte = f32[2,2] get-tuple-element(param), index=0
       token = token[] after-all()
-      recv = (f32[2,2], u32[]) recv(token), channel_id=1, sharding={maximal device=1}
-      ROOT recv-done = f32[2,2] recv-done(recv), channel_id=1,
+      recv = (f32[2,2], u32[], token[]) recv(token), channel_id=1, sharding={maximal device=1}
+      recv-done = (f32[2,2], token[]) recv-done(recv), channel_id=1,
         sharding={maximal device=1}
-      send = (f32[2,2], u32[]) send(gte, token), channel_id=1,
+      ROOT root = f32[2,2] get-tuple-element(recv-done), index=0
+      send = (f32[2,2], u32[], token[]) send(gte, token), channel_id=1,
         sharding={maximal device=0}
-      send-done = () send-done(send), channel_id=1, sharding={maximal device=0}
+      send-done = token[] send-done(send), channel_id=1, sharding={maximal device=0}
     }
   )";
 
@@ -854,7 +854,7 @@ TEST_F(LayoutAssignmentTest, ChannelLayoutMismatch) {
   AssignLayouts(module.get(), &computation_layout, &channel_constraints);
 
   EXPECT_THAT(LayoutOf(module.get(), "gte"), ElementsAre(0, 1));
-  EXPECT_THAT(LayoutOf(module.get(), "recv-done"), ElementsAre(1, 0));
+  EXPECT_THAT(LayoutOf(module.get(), "root"), ElementsAre(1, 0));
   EXPECT_TRUE(
       ShapeUtil::Equal(ShapeUtil::GetSubshape(
                            FindInstruction(module.get(), "send")->shape(), {0}),
diff --git a/tensorflow/compiler/xla/service/llvm_ir/BUILD b/tensorflow/compiler/xla/service/llvm_ir/BUILD
index f1e7fc2953..cdd3daf73b 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/BUILD
+++ b/tensorflow/compiler/xla/service/llvm_ir/BUILD
@@ -21,6 +21,11 @@ filegroup(
     ]),
 )
 
+load(
+    "//tensorflow:tensorflow.bzl",
+    "tf_cc_test",
+)
+
 cc_library(
     name = "alias_analysis",
     srcs = ["alias_analysis.cc"],
@@ -37,12 +42,25 @@ cc_library(
     ],
 )
 
+tf_cc_test(
+    name = "alias_analysis_test",
+    srcs = ["alias_analysis_test.cc"],
+    deps = [
+        ":alias_analysis",
+        "//tensorflow/compiler/xla/service:hlo_parser",
+        "//tensorflow/compiler/xla/service/cpu:custom_call_target_registry",
+        "//tensorflow/compiler/xla/service/cpu/tests:cpu_codegen_test",
+        "//tensorflow/compiler/xla/tests:filecheck",
+        "//tensorflow/core:test",
+    ],
+)
+
 cc_library(
     name = "llvm_util",
     srcs = ["llvm_util.cc"],
     hdrs = ["llvm_util.h"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
@@ -107,11 +125,30 @@ cc_library(
 )
 
 cc_library(
+    name = "kernel_tiling",
+    srcs = ["kernel_tiling.cc"],
+    hdrs = ["kernel_tiling.h"],
+    deps = [
+        ":ir_array",
+        ":llvm_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/core:lib",
+        "@llvm//:core",
+    ],
+)
+
+cc_library(
     name = "fused_ir_emitter",
     srcs = ["fused_ir_emitter.cc"],
     hdrs = ["fused_ir_emitter.h"],
     deps = [
         ":ir_array",
+        ":kernel_tiling",
         ":llvm_util",
         ":loop_emitter",
         ":tuple_ops",
@@ -127,9 +164,9 @@ cc_library(
 )
 
 cc_library(
-    name = "ops",
-    srcs = ["ops.cc"],
-    hdrs = ["ops.h"],
+    name = "dynamic_update_slice_util",
+    srcs = ["dynamic_update_slice_util.cc"],
+    hdrs = ["dynamic_update_slice_util.h"],
     deps = [
         ":fused_ir_emitter",
         ":ir_array",
@@ -144,6 +181,23 @@ cc_library(
 )
 
 cc_library(
+    name = "sort_util",
+    srcs = ["sort_util.cc"],
+    hdrs = ["sort_util.h"],
+    deps = [
+        ":ir_array",
+        ":llvm_loop",
+        ":llvm_util",
+        ":loop_emitter",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla/service/gpu:parallel_loop_emitter",
+        "//tensorflow/compiler/xla/service/gpu:partition_assignment",
+        "//tensorflow/core:lib",
+        "@llvm//:core",
+    ],
+)
+
+cc_library(
     name = "tuple_ops",
     srcs = ["tuple_ops.cc"],
     hdrs = ["tuple_ops.h"],
@@ -169,3 +223,22 @@ cc_library(
         "@llvm//:core",
     ],
 )
+
+cc_library(
+    name = "buffer_assignment_util",
+    srcs = ["buffer_assignment_util.cc"],
+    hdrs = ["buffer_assignment_util.h"],
+    deps = [
+        "//tensorflow/compiler/xla/service:buffer_assignment",
+    ],
+)
+
+cc_library(
+    name = "math_ops",
+    srcs = ["math_ops.cc"],
+    hdrs = ["math_ops.h"],
+    deps = [
+        ":llvm_util",
+        "@llvm//:core",
+    ],
+)
diff --git a/tensorflow/compiler/xla/service/llvm_ir/alias_analysis.cc b/tensorflow/compiler/xla/service/llvm_ir/alias_analysis.cc
index f200a08a3c..e5370eca56 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/alias_analysis.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/alias_analysis.cc
@@ -28,16 +28,16 @@ namespace llvm_ir {
 // Sentry allocation used to represent parameters of the entry computation in
 // alias_scope_metadata_ and noalias_metadata_.
 static const BufferAllocation* kParameterAllocation = new BufferAllocation(
-    /*index=*/-1, /*size=*/0, /*is_thread_local=*/false, /*is_reusable=*/false,
-    LogicalBuffer::Color(0));
+    /*index=*/-1, /*size=*/0, LogicalBuffer::Color(0));
 
 void AliasAnalysis::AddAliasingInformationToIrArray(const HloInstruction& hlo,
                                                     llvm_ir::IrArray* array,
                                                     const ShapeIndex& index) {
   BufferAllocation::Slice buffer_slice;
-  if (hlo.opcode() == HloOpcode::kParameter) {
-    // Parameters may alias with each other but may not alias with our temporary
-    // buffers.
+  if (hlo.opcode() == HloOpcode::kParameter &&
+      hlo.parent() == hlo.parent()->parent()->entry_computation()) {
+    // Entry computation parameters may alias with each other but may not alias
+    // with our temporary buffers.
     buffer_slice = BufferAllocation::Slice(kParameterAllocation, 0, 0);
   } else {
     const std::set<BufferAllocation::Slice> slices =
diff --git a/tensorflow/compiler/xla/service/llvm_ir/alias_analysis_test.cc b/tensorflow/compiler/xla/service/llvm_ir/alias_analysis_test.cc
new file mode 100644
index 0000000000..fe5ec1cc66
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/alias_analysis_test.cc
@@ -0,0 +1,83 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <memory>
+#include <utility>
+
+#include "tensorflow/compiler/xla/service/cpu/custom_call_target_registry.h"
+#include "tensorflow/compiler/xla/service/cpu/tests/cpu_codegen_test.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/alias_analysis.h"
+#include "tensorflow/compiler/xla/tests/filecheck.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace xla {
+namespace cpu {
+namespace {
+class AliasAnalysisTest : public CpuCodegenTest {};
+
+void FakeCustomCallTarget(float* out, float** in) {}
+
+REGISTER_CUSTOM_CALL_TARGET(FakeCustomCallTarget);
+
+TEST_F(AliasAnalysisTest, EmbeddedComputationParamsMayAliasTemps) {
+  const char* hlo_string = R"(
+HloModule while
+
+body {
+  const.0.125 = f32[] constant(0.125)
+  body.state = f32[] parameter(0)
+  ROOT add.2.2 = f32[] add(const.0.125, body.state)
+}
+
+condition {
+  const.100 = f32[] constant(100)
+  condition.state = f32[] parameter(0)
+  addend = f32[] custom-call(condition.state), custom_call_target="FakeCustomCallTarget"
+  add = f32[] add(addend, condition.state)
+  ROOT greater-than = pred[] greater-than(const.100, add)
+}
+
+ENTRY while3 {
+  const.0 = f32[] constant(0)
+  ROOT while = f32[] while(const.0), condition=condition, body=body
+}
+)";
+
+  CompileAndVerifyIr(hlo_string, R"(
+; CHECK-LABEL: @body(i8* %retval
+; CHECK: %[[add_result:.*]] = fadd fast float %[[fadd_lhs:.*]], %[[fadd_rhs:.*]]
+; CHECK: store float %[[add_result]], float* %[[store_dest:.*]], !alias.scope ![[alias_scope_md_for_store:[0-9]+]]
+;
+; CHECK-LABEL: @condition(i8* %retval, i8* noalias %run_options, i8** noalias %params
+; CHECK: %[[cond_state_buf_ptr:.*]] = getelementptr inbounds i8*, i8** %temps, i64 0
+; CHECK: %[[cond_state_buf_untyped:.*]] = load i8*, i8** %[[cond_state_buf_ptr]]
+; CHECK: %[[cond_state_buf_typed:.*]] = bitcast i8* %[[cond_state_buf_untyped]] to float*
+; CHECK: load float, float* %[[cond_state_buf_typed]], !alias.scope ![[alias_scope_md_for_store]], !noalias ![[noalias_md_for_load:.*]]
+;
+; CHECK-LABEL: @while3(
+
+![[alias_scope_md_for_store]] = !{![[buffer_idx_0:.*]]}
+![[buffer_idx_0]] = !{!"buffer: {index:0, offset:0, size:4}", ![[aa_md_root:.*]]}
+![[aa_md_root]] = !{!"XLA global AA domain"}
+![[buffer_idx_1:.*]] = !{!"buffer: {index:1, offset:0, size:4}", !3}
+![[buffer_idx_1_offset_16:.*]] = !{!"buffer: {index:1, offset:16, size:1}", !3}
+![[noalias_md_for_load]] = !{![[buffer_idx_1_offset_16]], ![[buffer_idx_1]]}
+}
+)");
+}
+
+}  // namespace
+}  // namespace cpu
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.cc b/tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.cc
new file mode 100644
index 0000000000..4eb5d9fb47
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.cc
@@ -0,0 +1,59 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h"
+
+namespace xla {
+namespace llvm_ir {
+static const HloInstruction& InstrForConstantBufferAllocation(
+    const BufferAllocation& allocation) {
+  CHECK(allocation.is_constant());
+  HloInstruction* const_instr = nullptr;
+  for (const auto& buffer_offset_pair : allocation.assigned_buffers()) {
+    const LogicalBuffer* buffer = buffer_offset_pair.first;
+    // BufferAssignment may have assigned non-constant instructions to this
+    // allocation too so we can't CHECK this condition.  E.g. for
+    //
+    //   while(init = constant, body = identity, cond = ...)
+    //
+    // the LogicalBuffer for the kWhile instruction will have the same
+    // BufferAllocation as the LogicalBuffer for the (init) constant.
+    if (buffer->instruction()->opcode() == HloOpcode::kConstant) {
+      CHECK_EQ(const_instr, nullptr)
+          << const_instr->ToString() << " " << buffer->ToString();
+      const_instr = buffer->instruction();
+    }
+  }
+  CHECK_NE(const_instr, nullptr);
+  return *const_instr;
+}
+
+string ConstantBufferAllocationToGlobalName(
+    const BufferAllocation& allocation) {
+  string instr_name = InstrForConstantBufferAllocation(allocation).name();
+  for (char& c : instr_name) {
+    if (c == '.') {
+      c = '_';
+    }
+  }
+  return tensorflow::strings::StrCat("buffer_for_", instr_name);
+}
+
+const Literal& LiteralForConstantAllocation(
+    const BufferAllocation& allocation) {
+  return InstrForConstantBufferAllocation(allocation).literal();
+}
+}  // namespace llvm_ir
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h b/tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h
new file mode 100644
index 0000000000..bfb6eecb87
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/buffer_assignment_util.h
@@ -0,0 +1,34 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_BUFFER_ASSIGNMENT_UTIL_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_BUFFER_ASSIGNMENT_UTIL_H_
+
+#include "tensorflow/compiler/xla/service/buffer_assignment.h"
+
+namespace xla {
+namespace llvm_ir {
+// In XLA:GPU we map constant buffer allocations to globals in the generated
+// LLVM IR.  This function gives us the name of the global variable a constant
+// buffer is mapped to.  Not used on XLA:CPU.
+string ConstantBufferAllocationToGlobalName(const BufferAllocation& allocation);
+
+// Returns the Literal corresponding to `allocation`, which must be a constant
+// allocation.
+const Literal& LiteralForConstantAllocation(const BufferAllocation& allocation);
+}  // namespace llvm_ir
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_BUFFER_ASSIGNMENT_UTIL_H_
diff --git a/tensorflow/compiler/xla/service/llvm_ir/ops.cc b/tensorflow/compiler/xla/service/llvm_ir/dynamic_update_slice_util.cc
index 3b298f4746..27fbb11e2e 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/ops.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/dynamic_update_slice_util.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/service/llvm_ir/ops.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/dynamic_update_slice_util.h"
 #include "tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h"
 #include "tensorflow/compiler/xla/service/gpu/partition_assignment.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h"
@@ -38,16 +38,16 @@ bool CanUpdateDynamicSliceInPlace(HloInstruction* dynamic_update_slice,
 // Emits a sequential loop if launch_dimensions is null.
 static Status EmitDynamicUpdateSliceInPlaceImpl(
     const Shape& update_shape, const ElementGenerator& start_indices_generator,
-    ElementGenerator update_array_generator, const IrArray& output_array,
-    const gpu::LaunchDimensions* launch_dimensions,
-    tensorflow::StringPiece name, llvm::IRBuilder<>* ir_builder) {
+    bool is_signed, ElementGenerator update_array_generator,
+    const IrArray& output_array, const gpu::LaunchDimensions* launch_dimensions,
+    tensorflow::StringPiece name, llvm::IRBuilder<>* b) {
   const Shape& output_shape = output_array.GetShape();
 
   // Read start indices from start_indices_generator.
   const int64 rank = ShapeUtil::Rank(output_shape);
-  IrArray::Index start_index(ir_builder->getInt64Ty(), rank);
+  IrArray::Index start_index(b->getInt64Ty(), rank);
   for (int64 i = 0; i < rank; ++i) {
-    IrArray::Index dim_index({ir_builder->getInt64(i)});
+    IrArray::Index dim_index({b->getInt64(i)});
     TF_ASSIGN_OR_RETURN(start_index[i], start_indices_generator(dim_index));
     llvm::Value* output_dim_size = llvm::ConstantInt::get(
         start_index[i]->getType(), output_shape.dimensions(i));
@@ -56,21 +56,19 @@ static Status EmitDynamicUpdateSliceInPlaceImpl(
 
     // Clamp the start index so that the update region fits in the operand.
     // start_index = clamp(start_index, 0, output_dim_size - update_dim_size)
-
-    // TODO(b/74360564): This is implementation defined behavior, but is
-    // currently respected by all implementations. Change this if we ever decide
-    // to oficially document different behavior.
-    llvm::Value* max_bound =
-        ir_builder->CreateSub(output_dim_size, update_dim_size);
+    llvm::Value* max_bound = b->CreateSub(output_dim_size, update_dim_size);
     llvm::Value* zero = llvm::ConstantInt::get(start_index[i]->getType(), 0);
-    start_index[i] = ir_builder->CreateSelect(
-        ir_builder->CreateICmp(llvm::ICmpInst::ICMP_SGE, zero, start_index[i]),
-        zero, start_index[i]);
-
-    start_index[i] = ir_builder->CreateSelect(
-        ir_builder->CreateICmp(llvm::ICmpInst::ICMP_SLE, max_bound,
-                               start_index[i]),
-        max_bound, start_index[i]);
+    start_index[i] =
+        b->CreateSelect(b->CreateICmp(is_signed ? llvm::ICmpInst::ICMP_SGE
+                                                : llvm::ICmpInst::ICMP_UGE,
+                                      zero, start_index[i]),
+                        zero, start_index[i]);
+
+    start_index[i] =
+        b->CreateSelect(b->CreateICmp(is_signed ? llvm::ICmpInst::ICMP_SLE
+                                                : llvm::ICmpInst::ICMP_ULE,
+                                      max_bound, start_index[i]),
+                        max_bound, start_index[i]);
   }
 
   auto loop_body_emitter = [&](const IrArray::Index& update_index) -> Status {
@@ -81,31 +79,30 @@ static Status EmitDynamicUpdateSliceInPlaceImpl(
     //
     IrArray::Index output_index(start_index.GetType(), rank);
     for (int64 i = 0; i < rank; ++i) {
-      llvm::Value* start_index0 = ir_builder->CreateSExtOrBitCast(
-          start_index[i], update_index[i]->getType());
-      output_index[i] = ir_builder->CreateAdd(start_index0, update_index[i]);
+      llvm::Value* start_index0 =
+          b->CreateSExtOrBitCast(start_index[i], update_index[i]->getType());
+      output_index[i] = b->CreateAdd(start_index0, update_index[i]);
     }
 
     // Do output[output_index] = update[update_index].
     TF_ASSIGN_OR_RETURN(llvm::Value * update_data,
                         update_array_generator(update_index));
-    output_array.EmitWriteArrayElement(output_index, update_data, ir_builder);
+    output_array.EmitWriteArrayElement(output_index, update_data, b);
     return Status::OK();
   };
 
   if (launch_dimensions != nullptr) {
     return gpu::ParallelLoopEmitter(loop_body_emitter, update_shape,
-                                    *launch_dimensions, ir_builder)
+                                    *launch_dimensions, b)
         .EmitLoop(name);
   }
-  return LoopEmitter(loop_body_emitter, update_shape, ir_builder)
-      .EmitLoop(name);
+  return LoopEmitter(loop_body_emitter, update_shape, b).EmitLoop(name);
 }
 
 Status EmitDynamicUpdateSliceInPlace(
     tensorflow::gtl::ArraySlice<IrArray> operand_arrays,
     const IrArray& output_array, tensorflow::StringPiece name,
-    llvm::IRBuilder<>* ir_builder) {
+    llvm::IRBuilder<>* b) {
   VLOG(2) << "EmitDynamicUpdateSliceInPlace for " << name;
 
   // No need to use operand_arrays[0], the input array of the
@@ -116,15 +113,16 @@ Status EmitDynamicUpdateSliceInPlace(
   Shape update_shape = update_array.GetShape();
 
   ElementGenerator start_indices_generator = [&](const IrArray::Index& index) {
-    return start_indices_array.EmitReadArrayElement(index, ir_builder);
+    return start_indices_array.EmitReadArrayElement(index, b);
   };
   ElementGenerator update_array_generator = [&](const IrArray::Index& index) {
-    return update_array.EmitReadArrayElement(index, ir_builder);
+    return update_array.EmitReadArrayElement(index, b);
   };
 
+  bool is_signed = ShapeUtil::ElementIsSigned(start_indices_array.GetShape());
   return EmitDynamicUpdateSliceInPlaceImpl(
-      update_shape, start_indices_generator, update_array_generator,
-      output_array, /*launch_dimensions=*/nullptr, name, ir_builder);
+      update_shape, start_indices_generator, is_signed, update_array_generator,
+      output_array, /*launch_dimensions=*/nullptr, name, b);
 }
 
 // Shared implementation for EmitFusedDynamicUpdateSliceInPlace and
@@ -135,8 +133,7 @@ static Status EmitFusedDynamicUpdateSliceInPlaceImpl(
     HloInstruction* fusion,
     tensorflow::gtl::ArraySlice<IrArray> fusion_operand_arrays,
     const IrArray& fusion_output_array, ElementalIrEmitter* elemental_emitter,
-    const gpu::LaunchDimensions* launch_dimensions,
-    llvm::IRBuilder<>* ir_builder) {
+    const gpu::LaunchDimensions* launch_dimensions, llvm::IRBuilder<>* b) {
   CHECK_EQ(fusion->opcode(), HloOpcode::kFusion);
   VLOG(2) << "EmitFusedDynamicUpdateSliceInPlace for "
           << fusion->ToShortString();
@@ -170,30 +167,30 @@ static Status EmitFusedDynamicUpdateSliceInPlaceImpl(
   ElementGenerator start_indices_generator =
       fused_emitter.GetGenerator(start_indices);
 
+  bool is_signed = ShapeUtil::ElementIsSigned(start_indices->shape());
   return EmitDynamicUpdateSliceInPlaceImpl(
-      update_shape, start_indices_generator, update_array_generator,
-      fusion_output_array, launch_dimensions, IrName(fusion), ir_builder);
+      update_shape, start_indices_generator, is_signed, update_array_generator,
+      fusion_output_array, launch_dimensions, IrName(fusion), b);
 }
 
 Status EmitFusedDynamicUpdateSliceInPlace(
     HloInstruction* fusion,
     tensorflow::gtl::ArraySlice<IrArray> fusion_operand_arrays,
     const IrArray& fusion_output_array, ElementalIrEmitter* elemental_emitter,
-    llvm::IRBuilder<>* ir_builder) {
+    llvm::IRBuilder<>* b) {
   return EmitFusedDynamicUpdateSliceInPlaceImpl(
       fusion, fusion_operand_arrays, fusion_output_array, elemental_emitter,
-      /*launch_dimensions=*/nullptr, ir_builder);
+      /*launch_dimensions=*/nullptr, b);
 }
 
 Status EmitParallelFusedDynamicUpdateSliceInPlace(
     HloInstruction* fusion,
     tensorflow::gtl::ArraySlice<IrArray> fusion_operand_arrays,
     const IrArray& fusion_output_array, ElementalIrEmitter* elemental_emitter,
-    const gpu::LaunchDimensions& launch_dimensions,
-    llvm::IRBuilder<>* ir_builder) {
+    const gpu::LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* b) {
   return EmitFusedDynamicUpdateSliceInPlaceImpl(
       fusion, fusion_operand_arrays, fusion_output_array, elemental_emitter,
-      &launch_dimensions, ir_builder);
+      &launch_dimensions, b);
 }
 
 }  // namespace llvm_ir
diff --git a/tensorflow/compiler/xla/service/llvm_ir/ops.h b/tensorflow/compiler/xla/service/llvm_ir/dynamic_update_slice_util.h
index 175b081e84..3502577d23 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/ops.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/dynamic_update_slice_util.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_OPS_H_
-#define TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_OPS_H_
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_DYNAMIC_UPDATE_SLICE_UTIL_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_DYNAMIC_UPDATE_SLICE_UTIL_H_
 
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/elemental_ir_emitter.h"
@@ -66,7 +66,7 @@ inline bool CanEmitFusedDynamicUpdateSliceInPlace(
 Status EmitDynamicUpdateSliceInPlace(
     tensorflow::gtl::ArraySlice<IrArray> operand_arrays,
     const IrArray& output_array, tensorflow::StringPiece name,
-    llvm::IRBuilder<>* ir_builder);
+    llvm::IRBuilder<>* b);
 
 // Given a loop-fusion node whose root is a dynamic-update-slice op whose
 // array-to-be-updated and output share the same buffer slice, emits
@@ -76,7 +76,7 @@ Status EmitFusedDynamicUpdateSliceInPlace(
     HloInstruction* fusion,
     tensorflow::gtl::ArraySlice<IrArray> fusion_operand_arrays,
     const IrArray& fusion_output_array, ElementalIrEmitter* elemental_emitter,
-    llvm::IRBuilder<>* ir_builder);
+    llvm::IRBuilder<>* b);
 
 // Same as EmitFusedDynamicUpdateSliceInPlace, except emits a parallel loop with
 // the given launch dimensions.
@@ -84,10 +84,9 @@ Status EmitParallelFusedDynamicUpdateSliceInPlace(
     HloInstruction* fusion,
     tensorflow::gtl::ArraySlice<IrArray> fusion_operand_arrays,
     const IrArray& fusion_output_array, ElementalIrEmitter* elemental_emitter,
-    const gpu::LaunchDimensions& launch_dimensions,
-    llvm::IRBuilder<>* ir_builder);
+    const gpu::LaunchDimensions& launch_dimensions, llvm::IRBuilder<>* b);
 
 }  // namespace llvm_ir
 }  // namespace xla
 
-#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_OPS_H_
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_DYNAMIC_UPDATE_SLICE_UTIL_H_
diff --git a/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc b/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc
index d909845a3a..72ede377e1 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.cc
@@ -52,7 +52,7 @@ Status FusedIrEmitter::DefaultAction(HloInstruction* hlo) {
       // that would be regenerated without caching. But this might increase the
       // JIT compilation time.
       if (generated_value_bb == nullptr ||
-          generated_value_bb == ir_builder_->GetInsertBlock()) {
+          generated_value_bb == b_->GetInsertBlock()) {
         VLOG(3) << "The cached generated value is reused.";
         return generated_value;
       }
@@ -60,8 +60,7 @@ Status FusedIrEmitter::DefaultAction(HloInstruction* hlo) {
                  "a different BB ("
               << llvm_ir::AsString(generated_value_bb->getName())
               << ") from the current insertion block ("
-              << llvm_ir::AsString(ir_builder_->GetInsertBlock()->getName())
-              << ").";
+              << llvm_ir::AsString(b_->GetInsertBlock()->getName()) << ").";
     }
 
     TF_ASSIGN_OR_RETURN(
@@ -77,14 +76,14 @@ Status FusedIrEmitter::HandleConstant(HloInstruction* constant) {
   llvm::Constant* initializer =
       llvm_ir::ConvertLiteralToIrConstant(literal, module_);
   llvm::GlobalVariable* global = new llvm::GlobalVariable(
-      *ir_builder_->GetInsertBlock()->getModule(), initializer->getType(),
+      *b_->GetInsertBlock()->getModule(), initializer->getType(),
       /*isConstant=*/true, llvm::GlobalValue::ExternalLinkage, initializer,
       /*Name=*/"");
   llvm::Constant* shape_constant = llvm::ConstantExpr::getBitCast(
       global, llvm_ir::ShapeToIrType(literal.shape(), module_)->getPointerTo());
   generators_[constant] = [=](const IrArray::Index& index) {
     return IrArray(shape_constant, constant->shape())
-        .EmitReadArrayElement(index, ir_builder_);
+        .EmitReadArrayElement(index, b_);
   };
 
   return Status::OK();
@@ -104,7 +103,7 @@ Status FusedIrEmitter::HandleGetTupleElement(
   // Emit code to lookup tuple element pointer, and store it in 'gte_values_'.
   llvm::Value* tuple_element_ptr = llvm_ir::EmitGetTupleElement(
       get_tuple_element->shape(), get_tuple_element->tuple_index(),
-      /*alignment=*/1, it->second, ir_builder_, module_);
+      /*alignment=*/1, it->second, b_, module_);
   gte_values_.insert(std::make_pair(get_tuple_element, tuple_element_ptr));
   // Emit code to read base tuple element array (if non-tuple shaped).
   if (!ShapeUtil::IsTuple(get_tuple_element->shape())) {
@@ -112,16 +111,32 @@ Status FusedIrEmitter::HandleGetTupleElement(
         [=](const IrArray::Index& index) -> StatusOr<llvm::Value*> {
       // TODO(b/34080002) Add aliasing information to tuple element IrArray.
       return IrArray(tuple_element_ptr, get_tuple_element->shape())
-          .EmitReadArrayElement(index, ir_builder_);
+          .EmitReadArrayElement(index, b_);
     };
   }
   return Status::OK();
 }
 
 Status FusedIrEmitter::HandleParameter(HloInstruction* parameter) {
-  generators_[parameter] = [=](const IrArray::Index& index) {
+  generators_[parameter] = [=](const IrArray::Index& index) -> llvm::Value* {
+    if (tiled_parameter_info_) {
+      if (llvm::Value* param_tile_buffer =
+              tiled_parameter_info_->GetBufferForParameter(
+                  parameter->parameter_number())) {
+        // TODO(jlebar): Add AA metadata to this load.  Tile buffers are global
+        // variables, so LLVM's points-to analysis doesn't help us much.  And we
+        // want the AA info to be present before address spaces are inferred
+        // (which is pretty late in the pipeline), so even if we had
+        // address-space-based AA in LLVM, it wouldn't help us much here.
+        return b_->CreateLoad(
+            b_->CreateGEP(param_tile_buffer, {index.GetConstantWithIndexType(0),
+                                              tiled_parameter_info_->x(),
+                                              tiled_parameter_info_->y()}),
+            "tiled_buffer");
+      }
+    }
     return parameter_arrays_[parameter->parameter_number()]
-        .EmitReadArrayElement(index, ir_builder_);
+        .EmitReadArrayElement(index, b_);
   };
   // Store ir value for fusion operand associated with fusion parameter to be
   // accessed by subsequent fused GetTupleElement instructions.
@@ -140,11 +155,11 @@ Status FusedIrEmitter::HandleTuple(HloInstruction* tuple) {
   }
   generators_[tuple] =
       [=](const IrArray::Index& index) -> StatusOr<llvm::Value*> {
-    llvm::Value* ret = llvm::UndefValue::get(llvm::StructType::get(
-        ir_builder_->getContext(), operand_elemental_ir_types));
+    llvm::Value* ret = llvm::UndefValue::get(
+        llvm::StructType::get(b_->getContext(), operand_elemental_ir_types));
     for (size_t i = 0; i < ShapeUtil::TupleElementCount(tuple->shape()); ++i) {
       TF_ASSIGN_OR_RETURN(llvm::Value * val_i, generators_[operands[i]](index));
-      ret = ir_builder_->CreateInsertValue(ret, val_i, i);
+      ret = b_->CreateInsertValue(ret, val_i, i);
     }
     return ret;
   };
diff --git a/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h b/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h
index b3b6026ef1..30471480c4 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/elemental_ir_emitter.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/ir_array.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/loop_emitter.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -56,8 +57,9 @@ class FusedIrEmitter : public DfsHloVisitorWithDefault {
   FusedIrEmitter(tensorflow::gtl::ArraySlice<llvm_ir::IrArray> parameter_arrays,
                  ElementalIrEmitter* elemental_emitter)
       : parameter_arrays_(parameter_arrays),
+        tiled_parameter_info_(nullptr),
         elemental_emitter_(elemental_emitter),
-        ir_builder_(elemental_emitter->ir_builder()),
+        b_(elemental_emitter->b()),
         module_(elemental_emitter->module()) {}
 
   Status DefaultAction(HloInstruction* hlo) override;
@@ -86,9 +88,14 @@ class FusedIrEmitter : public DfsHloVisitorWithDefault {
     return it->second;
   }
 
+  void SetTiledParameterInfo(const llvm_ir::TiledParameterInfo* info) {
+    tiled_parameter_info_ = info;
+  }
+
  private:
   // Arrays of parameters of fusion instruction
   tensorflow::gtl::ArraySlice<llvm_ir::IrArray> parameter_arrays_;
+  const llvm_ir::TiledParameterInfo* tiled_parameter_info_;
 
   ElementalIrEmitter* elemental_emitter_;
 
@@ -96,7 +103,7 @@ class FusedIrEmitter : public DfsHloVisitorWithDefault {
   const HloInstruction* fused_root_ = nullptr;
 
   // Borrowed
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   llvm::Module* module_;
 
   // Map from instruction pointers to functions to generate elements of their
diff --git a/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc b/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
index ea10cef49a..2b6caee6aa 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
@@ -31,7 +31,7 @@ namespace llvm_ir {
 
 void IrArray::Index::Delinearize(std::vector<llvm::Value*>* multidim,
                                  llvm::Value* linear, const Shape& shape,
-                                 llvm::IRBuilder<>* ir_builder) const {
+                                 llvm::IRBuilder<>* b) const {
   int64 divisor = 1;
   const Layout& layout = shape.layout();
   for (int64 i = 0; i < layout.minor_to_major_size(); ++i) {
@@ -48,10 +48,9 @@ void IrArray::Index::Delinearize(std::vector<llvm::Value*>* multidim,
     // useful because cuda-memcheck can't help us much in XLA: Most of our
     // memory lives in one big allocation, so cuda-memcheck can't detect
     // out-of-bounds accesses.
-    auto* quot =
-        ir_builder->CreateUDiv(linear, GetConstantWithIndexType(divisor));
+    auto* quot = b->CreateUDiv(linear, GetConstantWithIndexType(divisor));
     if (i < layout.minor_to_major_size() - 1) {
-      (*multidim)[dimension] = ir_builder->CreateURem(
+      (*multidim)[dimension] = b->CreateURem(
           quot, GetConstantWithIndexType(size_of_current_dimension));
     } else {
       (*multidim)[dimension] = quot;
@@ -61,7 +60,7 @@ void IrArray::Index::Delinearize(std::vector<llvm::Value*>* multidim,
 }
 
 IrArray::Index::Index(llvm::Value* linear, const Shape& shape,
-                      llvm::IRBuilder<>* ir_builder)
+                      llvm::IRBuilder<>* b)
     : multidim_(ShapeUtil::Rank(shape)),
       linear_(linear),
       layout_(shape.layout()),
@@ -71,7 +70,7 @@ IrArray::Index::Index(llvm::Value* linear, const Shape& shape,
   CHECK(LayoutUtil::HasLayout(shape))
       << "Shape " << ShapeUtil::HumanStringWithLayout(shape)
       << " should have a layout.";
-  Delinearize(&multidim_, linear, shape, ir_builder);
+  Delinearize(&multidim_, linear, shape, b);
 }
 
 IrArray::Index::Index(tensorflow::gtl::ArraySlice<llvm::Value*> multidim,
@@ -94,7 +93,7 @@ IrArray::Index::Index(tensorflow::gtl::ArraySlice<llvm::Value*> multidim,
 }
 
 IrArray::Index::Index(tensorflow::gtl::ArraySlice<llvm::Value*> multidim,
-                      const Shape& shape, llvm::IRBuilder<>* ir_builder)
+                      const Shape& shape, llvm::IRBuilder<>* b)
     : multidim_(multidim.begin(), multidim.end()),
       layout_(shape.layout()),
       dims_(shape.dimensions().begin(), shape.dimensions().end()) {
@@ -328,6 +327,7 @@ llvm::Value* IrArray::Index::Linearize(
     llvm::IRBuilder<>* builder) const {
   // Each dimension is multiplied by the product of the sizes of all
   // earlier dimensions and added to the accumulator logical_linear_index.
+  CHECK_EQ(size(), dimensions.size());
   llvm::Value* logical_linear_index = GetConstantWithIndexType(0);
   int64 multiplier = 1;
   for (ssize_t i = size() - 1; i >= 0; --i) {
@@ -343,7 +343,7 @@ llvm::Value* IrArray::Index::Linearize(
 }
 
 llvm::Value* IrArray::EmitArrayElementAddress(
-    const IrArray::Index& index, llvm::IRBuilder<>* ir_builder,
+    const IrArray::Index& index, llvm::IRBuilder<>* b,
     tensorflow::StringPiece name) const {
   if (ShapeUtil::IsScalar(*shape_)) {
     // Special handling of scalars: a scalar pretends to have the same value for
@@ -354,12 +354,11 @@ llvm::Value* IrArray::EmitArrayElementAddress(
   CHECK_EQ(index.size(), ShapeUtil::Rank(*shape_));
 
   if (index.LinearValidOnShape(*shape_)) {
-    llvm::Module* module =
-        ir_builder->GetInsertBlock()->getParent()->getParent();
-    return ir_builder->CreateInBoundsGEP(
-        ir_builder->CreateBitCast(
-            base_ptr_, PrimitiveTypeToIrType(shape_->element_type(), module)
-                           ->getPointerTo()),
+    llvm::Module* module = b->GetInsertBlock()->getParent()->getParent();
+    return b->CreateInBoundsGEP(
+        b->CreateBitCast(base_ptr_,
+                         PrimitiveTypeToIrType(shape_->element_type(), module)
+                             ->getPointerTo()),
         {index.linear()}, llvm_ir::AsStringRef(name));
   }
 
@@ -385,8 +384,8 @@ llvm::Value* IrArray::EmitArrayElementAddress(
     int64 dimension = LayoutUtil::Major(shape_->layout(), i);
     gep_indices.push_back(actual_index[dimension]);
   }
-  return ir_builder->CreateInBoundsGEP(base_ptr_, gep_indices,
-                                       llvm_ir::AsStringRef(name));
+  return b->CreateInBoundsGEP(base_ptr_, gep_indices,
+                              llvm_ir::AsStringRef(name));
 }
 
 void IrArray::AnnotateLoadStoreInstructionWithMetadata(
@@ -402,37 +401,37 @@ void IrArray::AnnotateLoadStoreInstructionWithMetadata(
 }
 
 llvm::Value* IrArray::EmitReadArrayElement(const Index& index,
-                                           llvm::IRBuilder<>* ir_builder,
+                                           llvm::IRBuilder<>* b,
                                            tensorflow::StringPiece name) const {
-  llvm::Value* element_address =
-      EmitArrayElementAddress(index, ir_builder, name);
-  llvm::LoadInst* load = ir_builder->CreateLoad(element_address);
+  llvm::Value* element_address = EmitArrayElementAddress(index, b, name);
+  llvm::LoadInst* load = b->CreateLoad(element_address);
   AnnotateLoadStoreInstructionWithMetadata(load);
   return load;
 }
 
 void IrArray::EmitWriteArrayElement(const Index& index, llvm::Value* value,
-                                    llvm::IRBuilder<>* ir_builder) const {
-  llvm::Value* element_address = EmitArrayElementAddress(index, ir_builder);
-  llvm::StoreInst* store = ir_builder->CreateStore(value, element_address);
+                                    llvm::IRBuilder<>* b) const {
+  llvm::Value* element_address = EmitArrayElementAddress(index, b);
+  llvm::StoreInst* store = b->CreateStore(value, element_address);
   AnnotateLoadStoreInstructionWithMetadata(store);
 }
 
 IrArray IrArray::CastToShape(const Shape& new_shape,
-                             llvm::IRBuilder<>* ir_builder) const {
-  llvm::Module* module = ir_builder->GetInsertBlock()->getParent()->getParent();
+                             llvm::IRBuilder<>* b) const {
+  llvm::Module* module = b->GetInsertBlock()->getParent()->getParent();
   llvm::Type* new_ir_type = llvm_ir::ShapeToIrType(new_shape, module);
-  return IrArray(
-      ir_builder->CreatePointerCast(base_ptr_, new_ir_type->getPointerTo()),
-      new_shape);
+  IrArray new_irarray(
+      b->CreatePointerCast(base_ptr_, new_ir_type->getPointerTo()), new_shape);
+  new_irarray.metadata_ = metadata_;
+  return new_irarray;
 }
 
 /* static */ IrArray::Index IrArray::BumpIndex(const Index& index,
                                                int64 which_dimension,
                                                int64 addend,
-                                               llvm::IRBuilder<>* ir_builder) {
+                                               llvm::IRBuilder<>* b) {
   Index new_index = index;
-  new_index[which_dimension] = ir_builder->CreateAdd(
+  new_index[which_dimension] = b->CreateAdd(
       index[which_dimension],
       llvm::ConstantInt::get(index[which_dimension]->getType(), addend), "",
       /*HasNUW=*/true,
diff --git a/tensorflow/compiler/xla/service/llvm_ir/ir_array.h b/tensorflow/compiler/xla/service/llvm_ir/ir_array.h
index 4648c6d7ac..28ca793e3e 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/ir_array.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/ir_array.h
@@ -87,20 +87,19 @@ class IrArray {
     }
 
     // Constructs an index from linear index "linear" and computes the
-    // multi-dimensional index from "linear" and "shape". "ir_builder" is the IR
+    // multi-dimensional index from "linear" and "shape". "b" is the IR
     // builder to emit the index of each dimension in the multi-dimensional
     // index.
     //
     // Precondition: "shape" has a layout.
-    Index(llvm::Value* linear, const Shape& shape,
-          llvm::IRBuilder<>* ir_builder);
+    Index(llvm::Value* linear, const Shape& shape, llvm::IRBuilder<>* b);
 
     // Constructs an index from the given multi-dimensional index and the shape
     // that it indexes into.
     //
     // Precondition: "shape" has a layout.
     Index(tensorflow::gtl::ArraySlice<llvm::Value*> multidim,
-          const Shape& shape, llvm::IRBuilder<>* ir_builder);
+          const Shape& shape, llvm::IRBuilder<>* b);
 
     // Constructs an index from both a multi-dimensional index and a linear
     // index. "shape" has the same meaning as that in the constructor that takes
@@ -114,19 +113,19 @@ class IrArray {
     size_t size() const { return multidim().size(); }
 
     llvm::Value* operator[](size_t i) const { return multidim()[i]; }
-    llvm::Value*& operator[](size_t i) { return multidim()[i]; }
+    llvm::Value*& operator[](size_t i) { return mutable_multidim()[i]; }
 
-    void push_back(llvm::Value* value) { multidim().push_back(value); }
+    void push_back(llvm::Value* value) { mutable_multidim().push_back(value); }
     void InsertAt(int64 index, llvm::Value* value) {
       CHECK_LE(index, size());
-      multidim().insert(multidim().begin() + index, value);
+      mutable_multidim().insert(mutable_multidim().begin() + index, value);
     }
 
     using iterator = std::vector<llvm::Value*>::iterator;
     using const_iterator = std::vector<llvm::Value*>::const_iterator;
 
-    iterator begin() { return multidim().begin(); }
-    iterator end() { return multidim().end(); }
+    iterator begin() { return mutable_multidim().begin(); }
+    iterator end() { return mutable_multidim().end(); }
 
     const_iterator begin() const { return multidim().begin(); }
     const_iterator end() const { return multidim().end(); }
@@ -185,13 +184,13 @@ class IrArray {
 
    private:
     // Changing the multi-dimensional index invalidates the linear index.
-    std::vector<llvm::Value*>& multidim() {
+    std::vector<llvm::Value*>& mutable_multidim() {
       linear_ = nullptr;
       return multidim_;
     }
 
     void Delinearize(std::vector<llvm::Value*>* multidim, llvm::Value* linear,
-                     const Shape& shape, llvm::IRBuilder<>* ir_builder) const;
+                     const Shape& shape, llvm::IRBuilder<>* b) const;
 
     std::vector<llvm::Value*> multidim_;
 
@@ -240,8 +239,7 @@ class IrArray {
   //
   // The optional name is useful for debugging when looking at
   // the emitted LLVM IR.
-  llvm::Value* EmitArrayElementAddress(const Index& index,
-                                       llvm::IRBuilder<>* ir_builder,
+  llvm::Value* EmitArrayElementAddress(const Index& index, llvm::IRBuilder<>* b,
                                        tensorflow::StringPiece name = "") const;
 
   // Attach metadata this IrArray instance knows about to "instruction".
@@ -255,18 +253,16 @@ class IrArray {
   //
   // The optional name is useful for debugging when looking at
   // the emitted LLVM IR.
-  llvm::Value* EmitReadArrayElement(const Index& index,
-                                    llvm::IRBuilder<>* ir_builder,
+  llvm::Value* EmitReadArrayElement(const Index& index, llvm::IRBuilder<>* b,
                                     tensorflow::StringPiece name = "") const;
 
   // Emit IR to write the given value to the array element at the given index.
   void EmitWriteArrayElement(const Index& index, llvm::Value* value,
-                             llvm::IRBuilder<>* ir_builder) const;
+                             llvm::IRBuilder<>* b) const;
 
   // Returns a new IrArray whose shape is "new_shape" and base pointer is a
   // bitcast of the base pointer of "this" IrArray.
-  IrArray CastToShape(const Shape& new_shape,
-                      llvm::IRBuilder<>* ir_builder) const;
+  IrArray CastToShape(const Shape& new_shape, llvm::IRBuilder<>* b) const;
 
   void AddAliasScopeMetadata(llvm::MDNode* alias_scope) {
     CHECK_NE(alias_scope, nullptr);
@@ -312,7 +308,7 @@ class IrArray {
   // Bumps the "which_dimension" value within the provided index by the provided
   // addend.
   static Index BumpIndex(const Index& index, int64 which_dimension,
-                         int64 addend, llvm::IRBuilder<>* ir_builder);
+                         int64 addend, llvm::IRBuilder<>* b);
 
  private:
   // Add the specified LLVM IR metadata to loads/stores associated with this
diff --git a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc
index 1f6e3c829f..b79567369a 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.cc
@@ -22,9 +22,9 @@ Status KernelSupportLibrary::For(
     tensorflow::StringPiece name, llvm::Value* start, llvm::Value* end,
     llvm::Value* step,
     const std::function<Status(llvm::Value*, bool)>& for_body_generator) {
-  return If(ir_builder_->CreateICmpSLT(start, end), [&]() -> Status {
+  return If(b_->CreateICmpSLT(start, end), [&]() -> Status {
     TF_RETURN_IF_ERROR(for_body_generator(start, /*is_first_iteration=*/true));
-    return For(name, ir_builder_->CreateAdd(start, step), end, step,
+    return For(name, b_->CreateAdd(start, step), end, step,
                [&](llvm::Value* iv) { return for_body_generator(iv, false); });
   });
 }
@@ -37,44 +37,44 @@ Status KernelSupportLibrary::For(
   if (peel_first_iteration) {
     return For(name, start, end, step, true,
                [&](llvm::Value* indvar, bool is_first_iteration) -> Status {
-                 return for_body_generator(
-                     indvar, ir_builder_->getInt1(is_first_iteration));
+                 return for_body_generator(indvar,
+                                           b_->getInt1(is_first_iteration));
                });
   } else {
     std::unique_ptr<llvm_ir::ForLoop> loop = llvm_ir::ForLoop::EmitForLoop(
-        name, start, end, step, ir_builder_,
+        name, start, end, step, b_,
         /*unroll_mode=*/unroll_mode_,
         /*prevent_vectorization=*/prevent_vectorization_);
-    ir_builder_->SetInsertPoint(&loop->GetBodyBasicBlock()->back());
+    b_->SetInsertPoint(&loop->GetBodyBasicBlock()->back());
     TF_RETURN_IF_ERROR(
         for_body_generator(loop->GetIndVarValue(),
-                           /*is_first_iteration=*/ir_builder_->CreateICmpEQ(
+                           /*is_first_iteration=*/b_->CreateICmpEQ(
                                loop->GetIndVarValue(), start)));
-    llvm_ir::SetToLastInsertPoint(loop->GetExitBasicBlock(), ir_builder_);
+    llvm_ir::SetToLastInsertPoint(loop->GetExitBasicBlock(), b_);
     return Status::OK();
   }
 }
 
 Status KernelSupportLibrary::If(
-    llvm::Value* condition, const std::function<Status()>& true_block_generator,
+    tensorflow::StringPiece name, llvm::Value* condition,
+    const std::function<Status()>& true_block_generator,
     const std::function<Status()>& false_block_generator) {
-  llvm_ir::LlvmIfData if_data =
-      llvm_ir::EmitIfThenElse(condition, "", ir_builder_);
-  ir_builder_->SetInsertPoint(&if_data.true_block->back());
+  llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(condition, name, b_);
+  b_->SetInsertPoint(&if_data.true_block->back());
   TF_RETURN_IF_ERROR(true_block_generator());
-  ir_builder_->SetInsertPoint(&if_data.false_block->back());
+  b_->SetInsertPoint(&if_data.false_block->back());
   TF_RETURN_IF_ERROR(false_block_generator());
-  llvm_ir::SetToLastInsertPoint(if_data.after_block, ir_builder_);
+  llvm_ir::SetToLastInsertPoint(if_data.after_block, b_);
   return Status::OK();
 }
 
 void KernelSupportLibrary::EmitAndCallOutlinedKernel(
-    bool enable_fast_math, bool optimize_for_size,
-    llvm::IRBuilder<>* ir_builder, tensorflow::StringPiece kernel_name,
+    bool enable_fast_math, bool optimize_for_size, llvm::IRBuilder<>* b,
+    tensorflow::StringPiece kernel_name,
     KernelSupportLibrary::ArgumentVector arguments,
     const std::function<void(KernelSupportLibrary::ArgumentVector)>&
         kernel_body_generator) {
-  llvm::Module* module = ir_builder->GetInsertBlock()->getModule();
+  llvm::Module* module = b->GetInsertBlock()->getModule();
   llvm::Function* function =
       module->getFunction(llvm_ir::AsStringRef(kernel_name));
 
@@ -97,22 +97,22 @@ void KernelSupportLibrary::EmitAndCallOutlinedKernel(
                    std::back_inserter(arg_types),
                    [](llvm::Value* arg) { return arg->getType(); });
 
-    auto* function_type = llvm::FunctionType::get(
-        ir_builder->getVoidTy(), arg_types, /*isVarArg=*/false);
+    auto* function_type =
+        llvm::FunctionType::get(b->getVoidTy(), arg_types, /*isVarArg=*/false);
 
     function = llvm_ir::CreateFunction(
         function_type, llvm::GlobalValue::InternalLinkage,
         /*enable_fast_math=*/enable_fast_math,
         /*optimize_for_size=*/optimize_for_size, kernel_name, module);
 
-    llvm::IRBuilder<>::InsertPointGuard guard(*ir_builder);
+    llvm::IRBuilder<>::InsertPointGuard guard(*b);
 
     auto* entry_bb =
-        llvm::BasicBlock::Create(ir_builder->getContext(), "entry", function);
-    auto* return_inst = llvm::ReturnInst::Create(ir_builder->getContext(),
+        llvm::BasicBlock::Create(b->getContext(), "entry", function);
+    auto* return_inst = llvm::ReturnInst::Create(b->getContext(),
                                                  /*retVal=*/nullptr, entry_bb);
     // Set the insert point to before return_inst.
-    ir_builder->SetInsertPoint(return_inst);
+    b->SetInsertPoint(return_inst);
 
     std::vector<llvm::Value*> arg_values;
     /*
@@ -132,7 +132,7 @@ void KernelSupportLibrary::EmitAndCallOutlinedKernel(
     VLOG(3) << "Re-using kernel for " << kernel_name;
   }
 
-  ir_builder->CreateCall(function, llvm_ir::AsArrayRef(sanitized_args));
+  b->CreateCall(function, llvm_ir::AsArrayRef(sanitized_args));
 }
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h
index 6f7a9d94e3..b00f903d56 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/kernel_support_library.h
@@ -30,14 +30,14 @@ namespace xla {
 // flow more readable.
 class KernelSupportLibrary {
  public:
-  // `ir_builder` is the llvm::IRBuilder instance used to generate LLVM IR.
+  // `b` is the llvm::IRBuilder instance used to generate LLVM IR.
   // `unroll_mode` specifies the desired LLVM unrolling behavior for every loop
   // generated by this instance of KernelSupportLibrary.
   explicit KernelSupportLibrary(
-      llvm::IRBuilder<>* ir_builder,
+      llvm::IRBuilder<>* b,
       llvm_ir::UnrollMode unroll_mode = llvm_ir::UnrollMode::kNoUnroll,
       bool prevent_vectorization = true)
-      : ir_builder_(ir_builder),
+      : b_(b),
         unroll_mode_(unroll_mode),
         prevent_vectorization_(prevent_vectorization) {}
 
@@ -71,18 +71,18 @@ class KernelSupportLibrary {
              const std::function<Status(llvm::Value* ind_var,
                                         bool is_first_iteration)>&
                  for_body_generator) {
-    return For(name, /*start=*/ir_builder_->getInt64(start),
-               /*end=*/ir_builder_->getInt64(end),
-               /*step=*/ir_builder_->getInt64(step), for_body_generator);
+    return For(name, /*start=*/b_->getInt64(start),
+               /*end=*/b_->getInt64(end),
+               /*step=*/b_->getInt64(step), for_body_generator);
   }
 
   void ForReturnVoid(
       tensorflow::StringPiece name, int64 start, int64 end, int64 step,
       const std::function<void(llvm::Value* ind_var, bool is_first_iteration)>&
           for_body_generator) {
-    ForReturnVoid(name, /*start=*/ir_builder_->getInt64(start),
-                  /*end=*/ir_builder_->getInt64(end),
-                  /*step=*/ir_builder_->getInt64(step), for_body_generator);
+    ForReturnVoid(name, /*start=*/b_->getInt64(start),
+                  /*end=*/b_->getInt64(end),
+                  /*step=*/b_->getInt64(step), for_body_generator);
   }
 
   // Generates the following control flow structure if `peel_first_iteration` is
@@ -184,17 +184,17 @@ class KernelSupportLibrary {
   Status For(
       tensorflow::StringPiece name, int64 start, int64 end, int64 step,
       const std::function<Status(llvm::Value* ind_var)>& for_body_generator) {
-    return For(name, /*start=*/ir_builder_->getInt64(start),
-               /*end=*/ir_builder_->getInt64(end),
-               /*step=*/ir_builder_->getInt64(step), for_body_generator);
+    return For(name, /*start=*/b_->getInt64(start),
+               /*end=*/b_->getInt64(end),
+               /*step=*/b_->getInt64(step), for_body_generator);
   }
 
   void ForReturnVoid(
       tensorflow::StringPiece name, int64 start, int64 end, int64 step,
       const std::function<void(llvm::Value* ind_var)>& for_body_generator) {
-    ForReturnVoid(name, /*start=*/ir_builder_->getInt64(start),
-                  /*end=*/ir_builder_->getInt64(end),
-                  /*step=*/ir_builder_->getInt64(step), for_body_generator);
+    ForReturnVoid(name, /*start=*/b_->getInt64(start),
+                  /*end=*/b_->getInt64(end),
+                  /*step=*/b_->getInt64(step), for_body_generator);
   }
 
   // Generates the following control flow structure:
@@ -203,16 +203,30 @@ class KernelSupportLibrary {
   //     `true_block_generator()`;
   //   else
   //      `false_block_generator()`;
-  Status If(llvm::Value* condition,
+  Status If(tensorflow::StringPiece name, llvm::Value* condition,
             const std::function<Status()>& true_block_generator,
             const std::function<Status()>& false_block_generator =
                 []() -> Status { return Status::OK(); });
 
+  Status If(llvm::Value* condition,
+            const std::function<Status()>& true_block_generator,
+            const std::function<Status()>& false_block_generator =
+                []() -> Status { return Status::OK(); }) {
+    return If("", condition, true_block_generator, false_block_generator);
+  }
+
   void IfReturnVoid(llvm::Value* condition,
                     const std::function<void()>& true_block_generator,
                     const std::function<void()>& false_block_generator = []() {
                     }) {
-    TF_CHECK_OK(If(condition,
+    IfReturnVoid("", condition, true_block_generator, false_block_generator);
+  }
+
+  void IfReturnVoid(tensorflow::StringPiece name, llvm::Value* condition,
+                    const std::function<void()>& true_block_generator,
+                    const std::function<void()>& false_block_generator = []() {
+                    }) {
+    TF_CHECK_OK(If(name, condition,
                    [&]() {
                      true_block_generator();
                      return Status::OK();
@@ -244,41 +258,39 @@ class KernelSupportLibrary {
   // in a nullptr llvm::Value* in its position to `kernel_body_generator`.
   // Currently we only support at most one nullptr value in `arguments`.
   static void EmitAndCallOutlinedKernel(
-      bool enable_fast_math, bool optimize_for_size,
-      llvm::IRBuilder<>* ir_builder, tensorflow::StringPiece kernel_name,
-      ArgumentVector arguments,
+      bool enable_fast_math, bool optimize_for_size, llvm::IRBuilder<>* b,
+      tensorflow::StringPiece kernel_name, ArgumentVector arguments,
       const std::function<void(ArgumentVector)>& kernel_body_generator);
 
   // Thin wrappers around the more general EmitAndCallOutlinedKernel above.
   static void EmitAndCallOutlinedKernel(
-      bool enable_fast_math, bool optimize_for_size,
-      llvm::IRBuilder<>* ir_builder, tensorflow::StringPiece kernel_name,
-      llvm::Value* arg0, llvm::Value* arg1, llvm::Value* arg2,
+      bool enable_fast_math, bool optimize_for_size, llvm::IRBuilder<>* b,
+      tensorflow::StringPiece kernel_name, llvm::Value* arg0, llvm::Value* arg1,
+      llvm::Value* arg2,
       const std::function<void(llvm::Value*, llvm::Value*, llvm::Value*)>&
           kernel_body_generator) {
     EmitAndCallOutlinedKernel(
-        enable_fast_math, optimize_for_size, ir_builder, kernel_name,
-        {arg0, arg1, arg2}, [&](ArgumentVector args) {
+        enable_fast_math, optimize_for_size, b, kernel_name, {arg0, arg1, arg2},
+        [&](ArgumentVector args) {
           kernel_body_generator(args[0], args[1], args[2]);
         });
   }
 
   static void EmitAndCallOutlinedKernel(
-      bool enable_fast_math, bool optimize_for_size,
-      llvm::IRBuilder<>* ir_builder, tensorflow::StringPiece kernel_name,
-      llvm::Value* arg0, llvm::Value* arg1, llvm::Value* arg2,
-      llvm::Value* arg3,
+      bool enable_fast_math, bool optimize_for_size, llvm::IRBuilder<>* b,
+      tensorflow::StringPiece kernel_name, llvm::Value* arg0, llvm::Value* arg1,
+      llvm::Value* arg2, llvm::Value* arg3,
       const std::function<void(llvm::Value*, llvm::Value*, llvm::Value*,
                                llvm::Value*)>& kernel_body_generator) {
     EmitAndCallOutlinedKernel(
-        enable_fast_math, optimize_for_size, ir_builder, kernel_name,
+        enable_fast_math, optimize_for_size, b, kernel_name,
         {arg0, arg1, arg2, arg3}, [&](ArgumentVector args) {
           kernel_body_generator(args[0], args[1], args[2], args[3]);
         });
   }
 
  private:
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
   llvm_ir::UnrollMode unroll_mode_;
   bool prevent_vectorization_;
 };
diff --git a/tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.cc b/tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.cc
new file mode 100644
index 0000000000..35b3941272
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.cc
@@ -0,0 +1,118 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.h"
+#include "tensorflow/compiler/xla/layout_util.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace xla {
+namespace llvm_ir {
+
+namespace {
+// Returns the indices of the first elements of all consecutive subarrays of the
+// given array. For example:
+// ConsecutiveSegments({m, m+1, m+2, n, k, k+1}) = {0, 3, 4}
+std::vector<size_t> ConsecutiveSegments(tensorflow::gtl::ArraySlice<int64> xs) {
+  std::vector<size_t> is = {0};
+  for (size_t i = 1; i < xs.size(); ++i) {
+    if (1 != xs[i] - xs[i - 1]) {
+      is.push_back(i);
+    }
+  }
+  return is;
+}
+
+// Merges the sequences of dimensions of the given shape which start at the
+// given indices `segs`.
+Shape MergeDimensions(tensorflow::gtl::ArraySlice<size_t> segs,
+                      const Shape& shape) {
+  std::vector<int64> dimensions;
+  for (size_t i = 1; i <= segs.size(); ++i) {
+    dimensions.push_back(std::accumulate(
+        shape.dimensions().begin() + segs[i - 1],
+        shape.dimensions().begin() +
+            (segs.size() == i ? shape.dimensions().size() : segs[i]),
+        1, std::multiplies<int64>()));
+  }
+  return ShapeUtil::MakeShapeWithDescendingLayout(shape.element_type(),
+                                                  dimensions);
+}
+}  // namespace
+
+tensorflow::gtl::optional<std::vector<int64> > FindTranspose021(
+    const Shape& a, const Shape& b) {
+  if (!ShapeUtil::CompatibleIgnoringElementType(a, b)) {
+    return tensorflow::gtl::nullopt;
+  }
+
+  std::vector<int64> perm(a.dimensions().size());
+  {
+    auto layout_a_orig = LayoutUtil::MinorToMajor(a);
+    std::vector<int64> layout_a(layout_a_orig.rbegin(), layout_a_orig.rend());
+    auto layout_b_orig = LayoutUtil::MinorToMajor(b);
+    std::vector<int64> layout_b(layout_b_orig.rbegin(), layout_b_orig.rend());
+    for (size_t i = 0; i < perm.size(); ++i) {
+      perm[i] = PositionInContainer(layout_b, layout_a[i]);
+    }
+  }
+  auto segs = ConsecutiveSegments(perm);
+  if ((3 == segs.size() && 0 == perm[0]) || 2 == segs.size()) {
+    Shape norm_a =
+        ShapeUtil::MakeShapeWithDescendingLayoutAndSamePhysicalLayout(a);
+    Shape reduced_a = MergeDimensions(segs, norm_a);
+    auto reduced_a_dims = reduced_a.dimensions();
+    std::vector<int64> dims_021;
+    if (2 == segs.size()) {
+      // The logical component-0 is of size one.
+      dims_021 = {1, reduced_a_dims[1], reduced_a_dims[0]};
+    } else {
+      dims_021 = {reduced_a_dims[0], reduced_a_dims[2], reduced_a_dims[1]};
+    }
+
+    return dims_021;
+  }
+
+  return tensorflow::gtl::nullopt;
+}
+
+IrArray::Index GetUnreducedOutputIndex(
+    const IrArray::Index& reduced_output_index,
+    const Shape& reduced_output_shape, const Shape& unreduced_output_shape,
+    llvm::IRBuilder<>* b) {
+  auto bounds = reduced_output_shape.dimensions();
+  auto minor_to_major = reduced_output_shape.layout().minor_to_major();
+  llvm::Value* linear_index = reduced_output_index.GetConstantWithIndexType(0);
+  int64 multiplier = 1;
+  for (int i = 0; i < reduced_output_index.size(); ++i) {
+    int64 dim = minor_to_major[i];
+    llvm::Value* addend =
+        b->CreateMul(reduced_output_index[dim],
+                     reduced_output_index.GetConstantWithIndexType(multiplier),
+                     "linearizing",
+                     /*HasNUW=*/true, /*HasNSW=*/true);
+    linear_index = b->CreateAdd(linear_index, addend, "",
+                                /*HasNUW=*/true, /*HasNSW=*/true);
+    multiplier *= bounds[dim];
+  }
+
+  return IrArray::Index(linear_index, unreduced_output_shape, b);
+}
+
+}  // namespace llvm_ir
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.h b/tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.h
new file mode 100644
index 0000000000..ccb9b8ba3e
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/kernel_tiling.h
@@ -0,0 +1,80 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_KERNEL_TILING_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_KERNEL_TILING_H_
+
+#include "llvm/IR/Value.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/ir_array.h"
+
+namespace xla {
+namespace llvm_ir {
+
+// About 0-2-1 transpose:
+//
+// If a shape can be viewed as three logical components 0-1-2 in the order of
+// major to minor, a 0-2-1-transpose changes the order of such logical
+// components to 0-2-1. We call the shape being transposed the input shape and
+// the transposed shape the output shape. The logical view of the input and
+// output shapes for the transpose are called the 0-1-2 shape or reduced input
+// shape and the 0-2-1 shape or the reduced output shape respectively. The
+// original input and output shapes are called the unreduced input and output
+// shapes.
+
+// If `b` is a 0-2-1 transpose of `a` in 0-1-2, return the dimensions for the
+// reduced shape of `b` or the 0-2-1 shape.
+tensorflow::gtl::optional<std::vector<int64> > FindTranspose021(const Shape& a,
+                                                                const Shape& b);
+
+// Return the unreduced output index corresponding to the given reduced output
+// index.
+IrArray::Index GetUnreducedOutputIndex(
+    const IrArray::Index& reduced_output_index,
+    const Shape& reduced_output_shape, const Shape& unreduced_output_shape,
+    llvm::IRBuilder<>* b);
+
+// A class to represent information for tiled parameters to support IR emission
+// for 021 transpose.
+class TiledParameterInfo {
+ public:
+  TiledParameterInfo(tensorflow::gtl::ArraySlice<llvm::Value*> param_buffers,
+                     llvm::Value* y, llvm::Value* x)
+      : param_buffers_(param_buffers), y_(y), x_(x) {}
+
+  llvm::Value* x() const { return x_; }
+  llvm::Value* y() const { return y_; }
+
+  void set_x(llvm::Value* x) { x_ = x; }
+  void set_y(llvm::Value* y) { y_ = y; }
+
+  llvm::Value* GetBufferForParameter(int64 index) const {
+    return param_buffers_[index];
+  }
+
+ private:
+  // Param_buffers_[i] stores the tile buffer for the ith parameter or nullptr
+  // if the parameter is not tiled.
+  tensorflow::gtl::ArraySlice<llvm::Value*> param_buffers_;
+  // The y coordinate within a tile.
+  llvm::Value* y_;
+  // The x coordinate within a tile.
+  llvm::Value* x_;
+};
+
+}  // namespace llvm_ir
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_KERNEL_TILING_H_
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc
index c9ae7d3afd..ba7f94834c 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.cc
@@ -47,27 +47,27 @@ ForLoop::ForLoop(tensorflow::StringPiece prefix, tensorflow::StringPiece suffix,
 
 /* static */ std::unique_ptr<ForLoop> ForLoop::EmitForLoop(
     tensorflow::StringPiece prefix, llvm::Value* start_index,
-    llvm::Value* end_index, llvm::Value* step, llvm::IRBuilder<>* ir_builder,
+    llvm::Value* end_index, llvm::Value* step, llvm::IRBuilder<>* b,
     UnrollMode unroll_mode, bool prevent_vectorization) {
   std::unique_ptr<ForLoop> loop(new ForLoop(prefix, /*suffix=*/"", start_index,
                                             end_index, step, unroll_mode,
                                             prevent_vectorization));
-  loop->Emit(ir_builder);
+  loop->Emit(b);
   return loop;
 }
 
-void ForLoop::Emit(llvm::IRBuilder<>* ir_builder) {
+void ForLoop::Emit(llvm::IRBuilder<>* b) {
   // The preheader block is the block the builder is currently emitting
   // code into.
-  preheader_bb_ = ir_builder->GetInsertBlock();
+  preheader_bb_ = b->GetInsertBlock();
 
-  llvm::BasicBlock::iterator insert_point = ir_builder->GetInsertPoint();
+  llvm::BasicBlock::iterator insert_point = b->GetInsertPoint();
   if (insert_point == preheader_bb_->end()) {
     // We're emitting the loop at the end of a basic block. Verify there is no
     // terminator (eg, branch) in the basic block.
     CHECK_EQ(nullptr, preheader_bb_->getTerminator());
 
-    exit_bb_ = CreateLoopBB("loop_exit", ir_builder);
+    exit_bb_ = CreateLoopBB("loop_exit", b);
   } else {
     // We're emitting the loop into the middle of a basic block. splitBasicBlock
     // requires that this basic block be well-formed (have a terminator).
@@ -86,51 +86,50 @@ void ForLoop::Emit(llvm::IRBuilder<>* ir_builder) {
   insert_before_bb_ = exit_bb_;
 
   // Create remaining basic block which form the inside of the loop.
-  header_bb_ = CreateLoopBB("loop_header", ir_builder);
-  body_bb_ = CreateLoopBB("loop_body", ir_builder);
+  header_bb_ = CreateLoopBB("loop_header", b);
+  body_bb_ = CreateLoopBB("loop_body", b);
 
   // Function entry basic block.
   // Emit alloca for the induction variable. We do this at the entry to the
   // basic block to ensure the alloc only executes once per function (we could
   // be emitting a nested loop).
   llvm::Function* func = preheader_bb_->getParent();
-  ir_builder->SetInsertPoint(&func->getEntryBlock(),
-                             func->getEntryBlock().getFirstInsertionPt());
+  b->SetInsertPoint(&func->getEntryBlock(),
+                    func->getEntryBlock().getFirstInsertionPt());
   llvm::Value* indvar_address =
-      ir_builder->CreateAlloca(start_index_->getType(), nullptr,
-                               AsStringRef(GetQualifiedName("invar_address")));
+      b->CreateAlloca(start_index_->getType(), nullptr,
+                      AsStringRef(GetQualifiedName("invar_address")));
 
   // Preheader basic block.
   // Initialize induction variable starting index. Create branch to the header.
-  ir_builder->SetInsertPoint(preheader_bb_);
-  ir_builder->CreateStore(start_index_, indvar_address);
+  b->SetInsertPoint(preheader_bb_);
+  b->CreateStore(start_index_, indvar_address);
   // The preheader should not have a branch yet.
   CHECK_EQ(preheader_bb_->getTerminator(), nullptr);
-  ir_builder->CreateBr(header_bb_);
+  b->CreateBr(header_bb_);
 
   // Header basic block.
   // Emit the loop conditional branch. Load and compare indvar with ending
   // index and jump to loop exit if equal. Jump to body otherwise.
-  ir_builder->SetInsertPoint(header_bb_);
-  indvar_ = ir_builder->CreateLoad(indvar_address,
-                                   AsStringRef(GetQualifiedName("indvar")));
-  llvm::Value* exit_cond = ir_builder->CreateICmpUGE(indvar_, end_index_);
-  ir_builder->CreateCondBr(/*Cond=*/exit_cond,
-                           /*True=*/exit_bb_, /*False=*/body_bb_);
+  b->SetInsertPoint(header_bb_);
+  indvar_ =
+      b->CreateLoad(indvar_address, AsStringRef(GetQualifiedName("indvar")));
+  llvm::Value* exit_cond = b->CreateICmpUGE(indvar_, end_index_);
+  b->CreateCondBr(/*Cond=*/exit_cond,
+                  /*True=*/exit_bb_, /*False=*/body_bb_);
 
   // Body basic block.
   // Increment indvar, store indvar, and jump to header.
-  ir_builder->SetInsertPoint(body_bb_);
+  b->SetInsertPoint(body_bb_);
   llvm::Value* step = step_;
   llvm::Value* indvar = indvar_;
 
-  llvm::Value* indvar_inc =
-      ir_builder->CreateAdd(indvar, step, "invar.inc",
-                            /*HasNUW=*/true, /*HasNSW=*/true);
-  ir_builder->CreateStore(indvar_inc, indvar_address);
-  llvm::BranchInst* back_branch = ir_builder->CreateBr(header_bb_);
+  llvm::Value* indvar_inc = b->CreateAdd(indvar, step, "invar.inc",
+                                         /*HasNUW=*/true, /*HasNSW=*/true);
+  b->CreateStore(indvar_inc, indvar_address);
+  llvm::BranchInst* back_branch = b->CreateBr(header_bb_);
 
-  std::vector<llvm::Metadata*> loop_metadata = GetLoopMetadata(ir_builder);
+  std::vector<llvm::Metadata*> loop_metadata = GetLoopMetadata(b);
   if (!loop_metadata.empty()) {
     llvm::LLVMContext* ctx = &start_index_->getContext();
     auto temp_node = llvm::MDNode::getTemporary(*ctx, llvm::None);
@@ -141,11 +140,10 @@ void ForLoop::Emit(llvm::IRBuilder<>* ir_builder) {
   }
 
   // Re-point the IR builder to the loop exit block.
-  ir_builder->SetInsertPoint(exit_bb_);
+  b->SetInsertPoint(exit_bb_);
 }
 
-std::vector<llvm::Metadata*> ForLoop::GetLoopMetadata(
-    llvm::IRBuilder<>* ir_builder) {
+std::vector<llvm::Metadata*> ForLoop::GetLoopMetadata(llvm::IRBuilder<>* b) {
   const char* const kLlvmLoopUnrollDisableMDName = "llvm.loop.unroll.disable";
   const char* const kLlvmLoopUnrollFullMDName = "llvm.loop.unroll.full";
   const char* const kLlvmLoopVectorizeMDName = "llvm.loop.vectorize.enable";
@@ -160,7 +158,7 @@ std::vector<llvm::Metadata*> ForLoop::GetLoopMetadata(
   if (prevent_vectorization_) {
     result.push_back(llvm::MDNode::get(
         *ctx, {llvm::MDString::get(*ctx, kLlvmLoopVectorizeMDName),
-               llvm::ConstantAsMetadata::get(ir_builder->getFalse())}));
+               llvm::ConstantAsMetadata::get(b->getFalse())}));
   }
 
   if (unroll_mode_ == xla::llvm_ir::UnrollMode::kFullyUnroll) {
@@ -175,9 +173,8 @@ string ForLoop::GetQualifiedName(tensorflow::StringPiece name) {
 }
 
 llvm::BasicBlock* ForLoop::CreateLoopBB(tensorflow::StringPiece name,
-                                        llvm::IRBuilder<>* ir_builder) {
-  return CreateBasicBlock(insert_before_bb_, GetQualifiedName(name),
-                          ir_builder);
+                                        llvm::IRBuilder<>* b) {
+  return CreateBasicBlock(insert_before_bb_, GetQualifiedName(name), b);
 }
 
 std::unique_ptr<ForLoop> ForLoopNest::AddLoop(tensorflow::StringPiece suffix,
@@ -197,12 +194,12 @@ std::unique_ptr<ForLoop> ForLoopNest::AddLoop(tensorflow::StringPiece suffix,
                                               bool prevent_vectorization) {
   if (inner_loop_body_bb_ != nullptr) {
     // Create this loop inside the previous one.
-    ir_builder_->SetInsertPoint(&*inner_loop_body_bb_->getFirstInsertionPt());
+    b_->SetInsertPoint(&*inner_loop_body_bb_->getFirstInsertionPt());
   }
   std::unique_ptr<ForLoop> loop(new ForLoop(
       /*prefix=*/name_, suffix, start_index, end_index, stride, unroll_mode,
       prevent_vectorization));
-  loop->Emit(ir_builder_);
+  loop->Emit(b_);
 
   if (outer_loop_preheader_bb_ == nullptr) {
     outer_loop_preheader_bb_ = loop->GetPreheaderBasicBlock();
@@ -262,5 +259,35 @@ IrArray::Index ForLoopNest::AddLoopsForShapeOnDimensions(
   return index;
 }
 
+IrArray::Index ForLoopNest::EmitOperandArrayLoopNest(
+    const llvm_ir::IrArray& operand_array, int64 dimension_to_skip,
+    tensorflow::StringPiece name_suffix) {
+  // Prepares the dimension list we will use to emit the loop nest. Outermost
+  // loops are added first. Add loops in major-to-minor order, and skip the
+  // 'dimension_to_skip' dimension.
+  std::vector<int64> dimensions;
+  const Shape& shape = operand_array.GetShape();
+  for (int64 dimension : LayoutUtil::MinorToMajor(shape)) {
+    if (dimension != dimension_to_skip) {
+      dimensions.push_back(dimension);
+    }
+  }
+
+  // Create loop nest with one for-loop for each dimension of the
+  // output.
+  llvm_ir::IrArray::Index index =
+      AddLoopsForShapeOnDimensions(shape, dimensions, name_suffix);
+  // Verify every dimension except the 'dimension_to_skip' dimension was set in
+  // the index.
+  for (size_t dimension = 0; dimension < index.size(); ++dimension) {
+    if (dimension == dimension_to_skip) {
+      DCHECK_EQ(nullptr, index[dimension]);
+    } else {
+      DCHECK_NE(nullptr, index[dimension]);
+    }
+  }
+  return index;
+}
+
 }  // namespace llvm_ir
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h
index 0dd5b9d3b2..a4fed5c8dc 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h
@@ -79,7 +79,7 @@ class ForLoop {
   //  loop.
   static std::unique_ptr<ForLoop> EmitForLoop(
       tensorflow::StringPiece prefix, llvm::Value* start_index,
-      llvm::Value* end_index, llvm::Value* step, llvm::IRBuilder<>* ir_builder,
+      llvm::Value* end_index, llvm::Value* step, llvm::IRBuilder<>* b,
       UnrollMode unroll_mode = llvm_ir::UnrollMode::kDefaultUnroll,
       bool prevent_vectorization = false);
 
@@ -138,10 +138,10 @@ class ForLoop {
           UnrollMode unroll_mode, bool prevent_vectorization);
 
   // Emit the loop at the insert point of the builder.
-  void Emit(llvm::IRBuilder<>* ir_builder);
+  void Emit(llvm::IRBuilder<>* b);
 
   llvm::BasicBlock* CreateLoopBB(tensorflow::StringPiece name,
-                                 llvm::IRBuilder<>* ir_builder);
+                                 llvm::IRBuilder<>* b);
 
   // Creates a name for an LLVM construct, appending prefix_ and suffix_, if
   // they are set.
@@ -149,7 +149,7 @@ class ForLoop {
 
   // Return a list of metadata nodes that should be associated with the
   // llvm::Loop for this `ForLoop`.
-  std::vector<llvm::Metadata*> GetLoopMetadata(llvm::IRBuilder<>* ir_builder);
+  std::vector<llvm::Metadata*> GetLoopMetadata(llvm::IRBuilder<>* b);
 
   string prefix_;
   string suffix_;
@@ -177,19 +177,18 @@ class ForLoop {
 // A simple class for constructing nested for-loops.
 class ForLoopNest {
  public:
-  explicit ForLoopNest(llvm::IRBuilder<>* ir_builder,
-                       llvm::Type* index_ty = nullptr)
-      : ForLoopNest(/*name=*/"", ir_builder) {
+  explicit ForLoopNest(llvm::IRBuilder<>* b, llvm::Type* index_ty = nullptr)
+      : ForLoopNest(/*name=*/"", b) {
     SetIndexType(index_ty);
   }
 
-  ForLoopNest(tensorflow::StringPiece name, llvm::IRBuilder<>* ir_builder,
+  ForLoopNest(tensorflow::StringPiece name, llvm::IRBuilder<>* b,
               llvm::Type* index_ty = nullptr)
       : name_(std::string(name)),
         outer_loop_preheader_bb_(nullptr),
         outer_loop_exit_bb_(nullptr),
         inner_loop_body_bb_(nullptr),
-        ir_builder_(ir_builder) {
+        b_(b) {
     SetIndexType(index_ty);
   }
 
@@ -248,6 +247,17 @@ class ForLoopNest {
       const Shape& shape, tensorflow::gtl::ArraySlice<int64> dimensions,
       tensorflow::StringPiece suffix);
 
+  // Emits a series of nested loops for iterating over an operand array. Loops
+  // are constructed in major to minor dimension layout order. No loop is
+  // emitted for the given 'dimension_to_skip'. The function returns an IrArray
+  // index for the given operand_array containing the indvars of the loops. All
+  // dimensions of the index are filled except for 'dimension_to_skip'.
+  // name_suffix is the string to append to the names of LLVM constructs (eg,
+  // basic blocks) constructed by this method.
+  IrArray::Index EmitOperandArrayLoopNest(const llvm_ir::IrArray& operand_array,
+                                          int64 dimension_to_skip,
+                                          tensorflow::StringPiece name_suffix);
+
   // Convenience methods which return particular basic blocks of the outermost
   // or innermost loops. These methods return nullptr if no loops have been
   // added yet.
@@ -259,7 +269,7 @@ class ForLoopNest {
 
  private:
   void SetIndexType(llvm::Type* index_ty) {
-    index_type_ = index_ty == nullptr ? ir_builder_->getInt64Ty() : index_ty;
+    index_type_ = index_ty == nullptr ? b_->getInt64Ty() : index_ty;
   }
 
   llvm::Constant* GetConstantWithIndexType(int64 c) const {
@@ -278,7 +288,7 @@ class ForLoopNest {
   // has been added yet.
   llvm::BasicBlock* inner_loop_body_bb_;
 
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
 
   llvm::Type* index_type_;
 
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
index 97bacc34b5..e6126881af 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
@@ -26,7 +26,7 @@ limitations under the License.
 #include "llvm/Target/TargetOptions.h"
 #include "llvm/Transforms/Utils/Cloning.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/name_uniquer.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -48,8 +48,8 @@ namespace {
 
 // Note, this function is only useful in an insertion context; in a global
 // (e.g. constants) context it will CHECK fail.
-llvm::Module* ModuleFromIRBuilder(llvm::IRBuilder<>* ir_builder) {
-  auto block = CHECK_NOTNULL(ir_builder->GetInsertBlock());
+llvm::Module* ModuleFromIRBuilder(llvm::IRBuilder<>* b) {
+  auto block = CHECK_NOTNULL(b->GetInsertBlock());
   auto fn = CHECK_NOTNULL(block->getParent());
   auto module = CHECK_NOTNULL(fn->getParent());
   return module;
@@ -87,41 +87,41 @@ llvm::Value* EmitCallToIntrinsic(
     llvm::Intrinsic::ID intrinsic_id,
     tensorflow::gtl::ArraySlice<llvm::Value*> operands,
     tensorflow::gtl::ArraySlice<llvm::Type*> overloaded_types,
-    llvm::IRBuilder<>* ir_builder) {
-  llvm::Module* module = ModuleFromIRBuilder(ir_builder);
+    llvm::IRBuilder<>* b) {
+  llvm::Module* module = ModuleFromIRBuilder(b);
   llvm::Function* intrinsic = llvm::Intrinsic::getDeclaration(
       module, intrinsic_id, AsArrayRef(overloaded_types));
-  return ir_builder->CreateCall(intrinsic, AsArrayRef(operands));
+  return b->CreateCall(intrinsic, AsArrayRef(operands));
 }
 
 llvm::Value* EmitFloatMax(llvm::Value* lhs_value, llvm::Value* rhs_value,
-                          llvm::IRBuilder<>* ir_builder) {
-  if (ir_builder->getFastMathFlags().noNaNs()) {
-    auto cmp = ir_builder->CreateFCmpUGE(lhs_value, rhs_value);
-    return ir_builder->CreateSelect(cmp, lhs_value, rhs_value);
+                          llvm::IRBuilder<>* b) {
+  if (b->getFastMathFlags().noNaNs()) {
+    auto cmp = b->CreateFCmpUGE(lhs_value, rhs_value);
+    return b->CreateSelect(cmp, lhs_value, rhs_value);
   } else {
-    auto cmp_ge = ir_builder->CreateFCmpOGE(lhs_value, rhs_value);
-    auto lhs_is_nan = ir_builder->CreateFCmpUNE(lhs_value, lhs_value);
-    auto sel_lhs = ir_builder->CreateOr(cmp_ge, lhs_is_nan);
-    return ir_builder->CreateSelect(sel_lhs, lhs_value, rhs_value);
+    auto cmp_ge = b->CreateFCmpOGE(lhs_value, rhs_value);
+    auto lhs_is_nan = b->CreateFCmpUNE(lhs_value, lhs_value);
+    auto sel_lhs = b->CreateOr(cmp_ge, lhs_is_nan);
+    return b->CreateSelect(sel_lhs, lhs_value, rhs_value);
   }
 }
 
 llvm::Value* EmitFloatMin(llvm::Value* lhs_value, llvm::Value* rhs_value,
-                          llvm::IRBuilder<>* ir_builder) {
-  if (ir_builder->getFastMathFlags().noNaNs()) {
-    auto cmp = ir_builder->CreateFCmpULE(lhs_value, rhs_value);
-    return ir_builder->CreateSelect(cmp, lhs_value, rhs_value);
+                          llvm::IRBuilder<>* b) {
+  if (b->getFastMathFlags().noNaNs()) {
+    auto cmp = b->CreateFCmpULE(lhs_value, rhs_value);
+    return b->CreateSelect(cmp, lhs_value, rhs_value);
   } else {
-    auto cmp_le = ir_builder->CreateFCmpOLE(lhs_value, rhs_value);
-    auto lhs_is_nan = ir_builder->CreateFCmpUNE(lhs_value, lhs_value);
-    auto sel_lhs = ir_builder->CreateOr(cmp_le, lhs_is_nan);
-    return ir_builder->CreateSelect(sel_lhs, lhs_value, rhs_value);
+    auto cmp_le = b->CreateFCmpOLE(lhs_value, rhs_value);
+    auto lhs_is_nan = b->CreateFCmpUNE(lhs_value, lhs_value);
+    auto sel_lhs = b->CreateOr(cmp_le, lhs_is_nan);
+    return b->CreateSelect(sel_lhs, lhs_value, rhs_value);
   }
 }
 
 llvm::Value* EmitBufferIndexingGEP(llvm::Value* array, llvm::Value* index,
-                                   llvm::IRBuilder<>* ir_builder) {
+                                   llvm::IRBuilder<>* b) {
   llvm::Type* array_type = array->getType();
   CHECK(array_type->isPointerTy());
   llvm::PointerType* array_type_as_pointer =
@@ -131,16 +131,16 @@ llvm::Value* EmitBufferIndexingGEP(llvm::Value* array, llvm::Value* index,
           << " array=" << llvm_ir::DumpToString(*array)
           << " index=" << llvm_ir::DumpToString(*index);
 
-  return ir_builder->CreateInBoundsGEP(
+  return b->CreateInBoundsGEP(
       array_type_as_pointer->getElementType(), array,
       llvm::isa<llvm::GlobalVariable>(array)
-          ? llvm::ArrayRef<llvm::Value*>({ir_builder->getInt64(0), index})
+          ? llvm::ArrayRef<llvm::Value*>({b->getInt64(0), index})
           : index);
 }
 
 llvm::Value* EmitBufferIndexingGEP(llvm::Value* array, int64 index,
-                                   llvm::IRBuilder<>* ir_builder) {
-  return EmitBufferIndexingGEP(array, ir_builder->getInt64(index), ir_builder);
+                                   llvm::IRBuilder<>* b) {
+  return EmitBufferIndexingGEP(array, b->getInt64(index), b);
 }
 
 llvm::Type* PrimitiveTypeToIrType(PrimitiveType element_type,
@@ -232,14 +232,15 @@ llvm::Type* ShapeToIrType(const Shape& shape, llvm::Module* module) {
   return result_type;
 }
 
-StatusOr<llvm::Value*> EncodeSelfDescribingShapeConstant(
-    const Shape& shape, int32* shape_size, llvm::IRBuilder<>* ir_builder) {
+StatusOr<llvm::Value*> EncodeSelfDescribingShapeConstant(const Shape& shape,
+                                                         int32* shape_size,
+                                                         llvm::IRBuilder<>* b) {
   string encoded_shape = shape.SerializeAsString();
   if (encoded_shape.size() > std::numeric_limits<int32>::max()) {
     return InternalError("Encoded shape size exceeded int32 size limit.");
   }
   *shape_size = static_cast<int32>(encoded_shape.size());
-  return ir_builder->CreateGlobalStringPtr(llvm_ir::AsStringRef(encoded_shape));
+  return b->CreateGlobalStringPtr(llvm_ir::AsStringRef(encoded_shape));
 }
 
 StatusOr<Shape> DecodeSelfDescribingShapeConstant(const void* shape_ptr,
@@ -262,59 +263,57 @@ llvm::Constant* ConvertLiteralToIrConstant(const Literal& literal,
 
 llvm::AllocaInst* EmitAllocaAtFunctionEntry(llvm::Type* type,
                                             tensorflow::StringPiece name,
-                                            llvm::IRBuilder<>* ir_builder,
+                                            llvm::IRBuilder<>* b,
                                             int alignment) {
-  return EmitAllocaAtFunctionEntryWithCount(type, nullptr, name, ir_builder,
-                                            alignment);
+  return EmitAllocaAtFunctionEntryWithCount(type, nullptr, name, b, alignment);
 }
 
 llvm::AllocaInst* EmitAllocaAtFunctionEntryWithCount(
     llvm::Type* type, llvm::Value* element_count, tensorflow::StringPiece name,
-    llvm::IRBuilder<>* ir_builder, int alignment) {
-  llvm::IRBuilder<>::InsertPoint insert_point = ir_builder->saveIP();
-  llvm::Function* function = ir_builder->GetInsertBlock()->getParent();
-  ir_builder->SetInsertPoint(&function->getEntryBlock(),
-                             function->getEntryBlock().getFirstInsertionPt());
+    llvm::IRBuilder<>* b, int alignment) {
+  llvm::IRBuilder<>::InsertPoint insert_point = b->saveIP();
+  llvm::Function* function = b->GetInsertBlock()->getParent();
+  b->SetInsertPoint(&function->getEntryBlock(),
+                    function->getEntryBlock().getFirstInsertionPt());
   llvm::AllocaInst* alloca =
-      ir_builder->CreateAlloca(type, element_count, AsStringRef(name));
+      b->CreateAlloca(type, element_count, AsStringRef(name));
   if (alignment != 0) {
     alloca->setAlignment(alignment);
   }
-  ir_builder->restoreIP(insert_point);
+  b->restoreIP(insert_point);
   return alloca;
 }
 
 llvm::BasicBlock* CreateBasicBlock(llvm::BasicBlock* insert_before,
                                    tensorflow::StringPiece name,
-                                   llvm::IRBuilder<>* ir_builder) {
+                                   llvm::IRBuilder<>* b) {
   return llvm::BasicBlock::Create(
-      /*Context=*/ir_builder->getContext(),
+      /*Context=*/b->getContext(),
       /*Name=*/AsStringRef(name),
-      /*Parent=*/ir_builder->GetInsertBlock()->getParent(),
+      /*Parent=*/b->GetInsertBlock()->getParent(),
       /*InsertBefore*/ insert_before);
 }
 
 LlvmIfData EmitIfThenElse(llvm::Value* condition, tensorflow::StringPiece name,
-                          llvm::IRBuilder<>* ir_builder, bool emit_else) {
+                          llvm::IRBuilder<>* b, bool emit_else) {
   llvm_ir::LlvmIfData if_data;
-  if_data.if_block = ir_builder->GetInsertBlock();
-  if_data.true_block = CreateBasicBlock(
-      nullptr, tensorflow::strings::StrCat(name, "-true"), ir_builder);
+  if_data.if_block = b->GetInsertBlock();
+  if_data.true_block =
+      CreateBasicBlock(nullptr, tensorflow::strings::StrCat(name, "-true"), b);
   if_data.false_block =
-      emit_else ? CreateBasicBlock(nullptr,
-                                   tensorflow::strings::StrCat(name, "-false"),
-                                   ir_builder)
+      emit_else ? CreateBasicBlock(
+                      nullptr, tensorflow::strings::StrCat(name, "-false"), b)
                 : nullptr;
 
   // Add a terminator to the if block, if necessary.
   if (if_data.if_block->getTerminator() == nullptr) {
-    ir_builder->SetInsertPoint(if_data.if_block);
+    b->SetInsertPoint(if_data.if_block);
     if_data.after_block = CreateBasicBlock(
-        nullptr, tensorflow::strings::StrCat(name, "-after"), ir_builder);
-    ir_builder->CreateBr(if_data.after_block);
+        nullptr, tensorflow::strings::StrCat(name, "-after"), b);
+    b->CreateBr(if_data.after_block);
   } else {
     if_data.after_block = if_data.if_block->splitBasicBlock(
-        ir_builder->GetInsertPoint(),
+        b->GetInsertPoint(),
         AsStringRef(tensorflow::strings::StrCat(name, "-after")));
   }
 
@@ -322,39 +321,37 @@ LlvmIfData EmitIfThenElse(llvm::Value* condition, tensorflow::StringPiece name,
   // we're going to replace it with a conditional branch.
   if_data.if_block->getTerminator()->eraseFromParent();
 
-  ir_builder->SetInsertPoint(if_data.if_block);
-  ir_builder->CreateCondBr(
-      condition, if_data.true_block,
-      emit_else ? if_data.false_block : if_data.after_block);
+  b->SetInsertPoint(if_data.if_block);
+  b->CreateCondBr(condition, if_data.true_block,
+                  emit_else ? if_data.false_block : if_data.after_block);
 
-  ir_builder->SetInsertPoint(if_data.true_block);
-  ir_builder->CreateBr(if_data.after_block);
+  b->SetInsertPoint(if_data.true_block);
+  b->CreateBr(if_data.after_block);
 
   if (emit_else) {
-    ir_builder->SetInsertPoint(if_data.false_block);
-    ir_builder->CreateBr(if_data.after_block);
+    b->SetInsertPoint(if_data.false_block);
+    b->CreateBr(if_data.after_block);
   }
 
-  ir_builder->SetInsertPoint(if_data.after_block,
-                             if_data.after_block->getFirstInsertionPt());
+  b->SetInsertPoint(if_data.after_block,
+                    if_data.after_block->getFirstInsertionPt());
 
   return if_data;
 }
 
 llvm::Value* EmitComparison(llvm::CmpInst::Predicate predicate,
                             llvm::Value* lhs_value, llvm::Value* rhs_value,
-                            llvm::IRBuilder<>* ir_builder) {
+                            llvm::IRBuilder<>* b) {
   llvm::Value* comparison_result;
   if (lhs_value->getType()->isIntegerTy()) {
-    comparison_result = ir_builder->CreateICmp(predicate, lhs_value, rhs_value);
+    comparison_result = b->CreateICmp(predicate, lhs_value, rhs_value);
   } else {
-    comparison_result = ir_builder->CreateFCmp(predicate, lhs_value, rhs_value);
+    comparison_result = b->CreateFCmp(predicate, lhs_value, rhs_value);
   }
   // comparison_result is i1, but the NVPTX codegen incorrectly lowers i1
   // arrays. So we extend it to i8 so that it's addressable.
-  return ir_builder->CreateZExt(
-      comparison_result,
-      llvm_ir::PrimitiveTypeToIrType(PRED, ModuleFromIRBuilder(ir_builder)));
+  return b->CreateZExt(comparison_result, llvm_ir::PrimitiveTypeToIrType(
+                                              PRED, ModuleFromIRBuilder(b)));
 }
 
 // Internal helper that is called from emitted code to log an int64 value with a
@@ -363,17 +360,14 @@ static void LogS64(const char* tag, int64 value) {
   LOG(INFO) << tag << " (int64): " << value;
 }
 
-void EmitLogging(const char* tag, llvm::Value* value,
-                 llvm::IRBuilder<>* ir_builder) {
+void EmitLogging(const char* tag, llvm::Value* value, llvm::IRBuilder<>* b) {
   llvm::FunctionType* log_function_type = llvm::FunctionType::get(
-      ir_builder->getVoidTy(),
-      {ir_builder->getInt64Ty(), ir_builder->getInt64Ty()}, /*isVarArg=*/false);
-  ir_builder->CreateCall(
+      b->getVoidTy(), {b->getInt64Ty(), b->getInt64Ty()}, /*isVarArg=*/false);
+  b->CreateCall(
       log_function_type,
-      ir_builder->CreateIntToPtr(
-          ir_builder->getInt64(tensorflow::bit_cast<int64>(&LogS64)),
-          log_function_type->getPointerTo()),
-      {ir_builder->getInt64(tensorflow::bit_cast<int64>(tag)), value});
+      b->CreateIntToPtr(b->getInt64(tensorflow::bit_cast<int64>(&LogS64)),
+                        log_function_type->getPointerTo()),
+      {b->getInt64(tensorflow::bit_cast<int64>(tag)), value});
 }
 
 void SetAlignmentMetadataForLoad(llvm::LoadInst* load, uint64_t alignment) {
@@ -663,5 +657,56 @@ void InitializeLLVMCommandLineOptions(const HloModuleConfig& config) {
   }
 }
 
+std::pair<llvm::Value*, llvm::Value*> UMulLowHigh32(llvm::IRBuilder<>* b,
+                                                    llvm::Value* src0,
+                                                    llvm::Value* src1) {
+  CHECK_EQ(src0->getType()->getPrimitiveSizeInBits(), 32);
+  CHECK_EQ(src1->getType()->getPrimitiveSizeInBits(), 32);
+  llvm::Type* int64_ty = b->getInt64Ty();
+  src0 = b->CreateZExt(src0, int64_ty);
+  src1 = b->CreateZExt(src1, int64_ty);
+  return SplitInt64ToInt32s(b, b->CreateMul(src0, src1));
+}
+
+std::pair<llvm::Value*, llvm::Value*> SplitInt64ToInt32s(
+    llvm::IRBuilder<>* b, llvm::Value* value_64bits) {
+  CHECK_EQ(value_64bits->getType()->getPrimitiveSizeInBits(), 64);
+  llvm::Type* int32_ty = b->getInt32Ty();
+  llvm::Value* low_32bits = b->CreateTrunc(value_64bits, int32_ty);
+  llvm::Value* high_32bits =
+      b->CreateTrunc(b->CreateLShr(value_64bits, 32), int32_ty);
+  return std::make_pair(low_32bits, high_32bits);
+}
+
+llvm::GlobalVariable* GetOrCreateVariableForPhiloxRngState(
+    llvm::Module* module, llvm::IRBuilder<>* b) {
+  static const char* kPhiloxRngStateVariableName = "philox_rng_state";
+  llvm::GlobalVariable* state_ptr =
+      module->getNamedGlobal(kPhiloxRngStateVariableName);
+  if (!state_ptr) {
+    state_ptr = new llvm::GlobalVariable(
+        /*M=*/*module,
+        /*Ty=*/b->getInt64Ty(),
+        /*isConstant=*/false,
+        /*Linkage=*/llvm::GlobalValue::PrivateLinkage,
+        /*Initializer=*/b->getInt64(0),
+        /*Name=*/kPhiloxRngStateVariableName);
+  }
+  return state_ptr;
+}
+
+void IncrementVariableForPhiloxRngState(int64 value, llvm::Module* module,
+                                        llvm::IRBuilder<>* builder) {
+  llvm::GlobalVariable* state_ptr =
+      GetOrCreateVariableForPhiloxRngState(module, builder);
+  llvm::Value* state_value_old = builder->CreateLoad(state_ptr, "load_state");
+  // If the 64-bit value overflows, we use the wraparound value. This should
+  // be fine in practice as we only add one to the value each time when a RNG is
+  // executed.
+  llvm::Value* state_value_new = builder->CreateAdd(
+      state_value_old, builder->getInt64(value), "inc_state");
+  builder->CreateStore(state_value_new, state_ptr);
+}
+
 }  // namespace llvm_ir
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h
index 4a10ec466d..0958398534 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h
@@ -27,7 +27,7 @@ limitations under the License.
 #include "llvm/IR/Module.h"
 #include "llvm/IR/Value.h"
 #include "llvm/Support/raw_ostream.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module_config.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -105,26 +105,26 @@ llvm::Value* EmitCallToIntrinsic(
     llvm::Intrinsic::ID intrinsic_id,
     tensorflow::gtl::ArraySlice<llvm::Value*> operands,
     tensorflow::gtl::ArraySlice<llvm::Type*> overloaded_types,
-    llvm::IRBuilder<>* ir_builder);
+    llvm::IRBuilder<>* b);
 
 // Emit float max. Emit maxnum intrinsic is fast math is disabled, or
 // fcmp+select otherwise
 llvm::Value* EmitFloatMax(llvm::Value* lhs_value, llvm::Value* rhs_value,
-                          llvm::IRBuilder<>* ir_builder);
+                          llvm::IRBuilder<>* b);
 
 // Emit float min. Emit minnum intrinsic is fast math is disabled, or
 // fcmp+select otherwise
 llvm::Value* EmitFloatMin(llvm::Value* lhs_value, llvm::Value* rhs_value,
-                          llvm::IRBuilder<>* ir_builder);
+                          llvm::IRBuilder<>* b);
 
 // Convenience methods for emitting a GEP instruction that indexes into a buffer
 // (1-dimensional array), equivalent to array[index]. The type is automatically
 // determined from the element type of the array.  The int64 index overload
 // wraps the index in a i64 llvm::Value.
 llvm::Value* EmitBufferIndexingGEP(llvm::Value* array, llvm::Value* index,
-                                   llvm::IRBuilder<>* ir_builder);
+                                   llvm::IRBuilder<>* b);
 llvm::Value* EmitBufferIndexingGEP(llvm::Value* array, int64 index,
-                                   llvm::IRBuilder<>* ir_builder);
+                                   llvm::IRBuilder<>* b);
 
 // Returns the LLVM type which represents the given XLA primitive type.
 llvm::Type* PrimitiveTypeToIrType(PrimitiveType element_type,
@@ -139,8 +139,9 @@ llvm::Type* ShapeToIrType(const Shape& shape, llvm::Module* module);
 
 // Returns a value that represents a pointer to a global string constant that
 // encodes the shape as a serialized protobuf.
-StatusOr<llvm::Value*> EncodeSelfDescribingShapeConstant(
-    const Shape& shape, int32* shape_size, llvm::IRBuilder<>* ir_builder);
+StatusOr<llvm::Value*> EncodeSelfDescribingShapeConstant(const Shape& shape,
+                                                         int32* shape_size,
+                                                         llvm::IRBuilder<>* b);
 
 // Inverses the encoding of a Shape protobuf into an LLVM global variable.
 //
@@ -164,21 +165,21 @@ llvm::Constant* ConvertLiteralToIrConstant(const Literal& literal,
 // through a loop.
 llvm::AllocaInst* EmitAllocaAtFunctionEntry(llvm::Type* type,
                                             tensorflow::StringPiece name,
-                                            llvm::IRBuilder<>* ir_builder,
+                                            llvm::IRBuilder<>* b,
                                             int alignment = 0);
 
 // As EmitAllocaAtFunctionEntry, but allocates element_count entries
 // instead of a single element.
 llvm::AllocaInst* EmitAllocaAtFunctionEntryWithCount(
     llvm::Type* type, llvm::Value* element_count, tensorflow::StringPiece name,
-    llvm::IRBuilder<>* ir_builder, int alignment = 0);
+    llvm::IRBuilder<>* b, int alignment = 0);
 
 // Creates a basic block with the same context and function as for the
 // builder. Inserts at the end of the function if insert_before is
 // null.
 llvm::BasicBlock* CreateBasicBlock(llvm::BasicBlock* insert_before,
                                    tensorflow::StringPiece name,
-                                   llvm::IRBuilder<>* ir_builder);
+                                   llvm::IRBuilder<>* b);
 
 // Struct with data on a conditional branch in a diamond shape created
 // via EmitIfThenElse.
@@ -210,13 +211,13 @@ struct LlvmIfData {
 // block with a terminator. If you need to use this for a
 // non-terminated block, just make the function able to do that too.
 LlvmIfData EmitIfThenElse(llvm::Value* condition, tensorflow::StringPiece name,
-                          llvm::IRBuilder<>* ir_builder, bool emit_else = true);
+                          llvm::IRBuilder<>* b, bool emit_else = true);
 
 // Emits a compare operation between "lhs" and "rhs" with the given predicate,
 // and then converts the result to i8 so that it is addressable.
 llvm::Value* EmitComparison(llvm::CmpInst::Predicate predicate,
                             llvm::Value* lhs, llvm::Value* rhs,
-                            llvm::IRBuilder<>* ir_builder);
+                            llvm::IRBuilder<>* b);
 
 // Emits a call that logs the given value with the given tag as a prefix.
 // The provided tag and value are passed to a runtime logging call that is
@@ -228,8 +229,7 @@ llvm::Value* EmitComparison(llvm::CmpInst::Predicate predicate,
 // Precondition: value must be an int64.
 // Precondition: tag must be a stable pointer for the lifetime of the generated
 // program (the constant pointer is burned in to the program).
-void EmitLogging(const char* tag, llvm::Value* value,
-                 llvm::IRBuilder<>* ir_builder);
+void EmitLogging(const char* tag, llvm::Value* value, llvm::IRBuilder<>* b);
 
 // Adds alignment metadata to a load instruction using the given alignment.
 // The alignment refers to the result of the load, not the load itself.
@@ -292,6 +292,27 @@ llvm::Function* CreateFunction(llvm::FunctionType* function_type,
 // don't start with xla_ to LLVM.
 void InitializeLLVMCommandLineOptions(const HloModuleConfig& config);
 
+// Zero-extends two 32-bit values to 64 bits, multiplies them, and returns the
+// result as a pair of (low 32 bits, high 32 bits).
+std::pair<llvm::Value*, llvm::Value*> UMulLowHigh32(llvm::IRBuilder<>* b,
+                                                    llvm::Value* src0,
+                                                    llvm::Value* src1);
+// Splits the 64-bit integer value into its high and low 32 bits.
+std::pair<llvm::Value*, llvm::Value*> SplitInt64ToInt32s(
+    llvm::IRBuilder<>* b, llvm::Value* value_64bits);
+
+// Checks whether a global variable is already created to represent a
+// state passed between RNG calls implemented with Philox algorithm. If not,
+// creates such a variable. Returns the global variable.
+llvm::GlobalVariable* GetOrCreateVariableForPhiloxRngState(
+    llvm::Module* module, llvm::IRBuilder<>* b);
+
+// Adds a value to the global state variable each time when a RNG hlo is
+// executed. The value of this global state variable is added to the seed
+// of the Philox RNG algorithm so that calling the same RNG Hlo multiple times
+// should rarely produce the same result.
+void IncrementVariableForPhiloxRngState(int64 value, llvm::Module* module,
+                                        llvm::IRBuilder<>* b);
 }  // namespace llvm_ir
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc b/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc
index e8b0605b9d..36f5fa1952 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.cc
@@ -33,26 +33,24 @@ namespace xla {
 namespace llvm_ir {
 
 LoopEmitter::LoopEmitter(const BodyEmitter& body_emitter, const Shape& shape,
-                         llvm::IRBuilder<>* ir_builder)
-    : body_emitter_(body_emitter), shape_(shape), ir_builder_(ir_builder) {}
+                         llvm::IRBuilder<>* b)
+    : body_emitter_(body_emitter), shape_(shape), b_(b) {}
 
 LoopEmitter::LoopEmitter(const ElementGenerator& target_element_generator,
-                         const IrArray& target_array,
-                         llvm::IRBuilder<>* ir_builder)
+                         const IrArray& target_array, llvm::IRBuilder<>* b)
     : body_emitter_([=](const llvm_ir::IrArray::Index array_index) -> Status {
         // Convert target_element_generator to a BodyEmitter.
         TF_ASSIGN_OR_RETURN(llvm::Value * target_element,
                             target_element_generator(array_index));
-        target_array.EmitWriteArrayElement(array_index, target_element,
-                                           ir_builder);
+        target_array.EmitWriteArrayElement(array_index, target_element, b);
         return Status::OK();
       }),
       shape_(target_array.GetShape()),
-      ir_builder_(ir_builder) {}
+      b_(b) {}
 
 static LoopEmitter::BodyEmitter MakeBodyEmitterForMultiOutputFusion(
     const ElementGenerator& target_element_generator,
-    const std::vector<IrArray>& target_arrays, llvm::IRBuilder<>* ir_builder) {
+    const std::vector<IrArray>& target_arrays, llvm::IRBuilder<>* b) {
   return [=](const llvm_ir::IrArray::Index array_index) {
     TF_ASSIGN_OR_RETURN(llvm::Value * target_element,
                         target_element_generator(array_index));
@@ -64,8 +62,7 @@ static LoopEmitter::BodyEmitter MakeBodyEmitterForMultiOutputFusion(
 
     for (int64 i = 0; i < target_arrays.size(); ++i) {
       target_arrays[i].EmitWriteArrayElement(
-          array_index, ir_builder->CreateExtractValue(target_element, i),
-          ir_builder);
+          array_index, b->CreateExtractValue(target_element, i), b);
     }
     return Status::OK();
   };
@@ -73,13 +70,12 @@ static LoopEmitter::BodyEmitter MakeBodyEmitterForMultiOutputFusion(
 
 LoopEmitter::LoopEmitter(const ElementGenerator& target_element_generator,
                          tensorflow::gtl::ArraySlice<IrArray> target_arrays,
-                         llvm::IRBuilder<>* ir_builder)
+                         llvm::IRBuilder<>* b)
     : body_emitter_(MakeBodyEmitterForMultiOutputFusion(
           target_element_generator,
-          std::vector<IrArray>(target_arrays.begin(), target_arrays.end()),
-          ir_builder)),
+          std::vector<IrArray>(target_arrays.begin(), target_arrays.end()), b)),
       shape_(target_arrays[0].GetShape()),
-      ir_builder_(ir_builder) {
+      b_(b) {
   // Sanity check: In multi-output fusion, all shapes produced must have the
   // same dimensions.
   for (const IrArray& array : target_arrays) {
@@ -102,7 +98,7 @@ std::vector<IrArray::Index> LoopEmitter::EmitIndexAndSetExitBasicBlock(
   // Loops are added from outermost to innermost order with the ForLoopNest
   // class so emit loops in order from most-major dimension down to most-minor
   // dimension (of the target shape).
-  ForLoopNest loop_nest(loop_name, ir_builder_);
+  ForLoopNest loop_nest(loop_name, b_);
   IrArray::Index array_index(index_type, shape_.dimensions_size());
   for (int i = 0; i < LayoutUtil::MinorToMajor(shape_).size(); ++i) {
     int64 dimension = LayoutUtil::Major(shape_.layout(), i);
@@ -116,8 +112,8 @@ std::vector<IrArray::Index> LoopEmitter::EmitIndexAndSetExitBasicBlock(
   // Set IR builder insertion point to the loop body basic block of the
   // innermost loop.
   llvm::BasicBlock* innermost_body_bb = loop_nest.GetInnerLoopBodyBasicBlock();
-  ir_builder_->SetInsertPoint(innermost_body_bb,
-                              innermost_body_bb->getFirstInsertionPt());
+  b_->SetInsertPoint(innermost_body_bb,
+                     innermost_body_bb->getFirstInsertionPt());
 
   // Set exit_bb_ to the exit block of the loop nest.
   exit_bb_ = loop_nest.GetOuterLoopExitBasicBlock();
@@ -129,7 +125,7 @@ std::vector<IrArray::Index> LoopEmitter::EmitIndexAndSetExitBasicBlock(
 Status LoopEmitter::EmitLoop(tensorflow::StringPiece loop_name,
                              llvm::Type* index_type) {
   if (index_type == nullptr) {
-    index_type = ir_builder_->getInt64Ty();
+    index_type = b_->getInt64Ty();
   }
 
   for (const IrArray::Index& array_index :
@@ -137,10 +133,10 @@ Status LoopEmitter::EmitLoop(tensorflow::StringPiece loop_name,
     TF_RETURN_IF_ERROR(body_emitter_(array_index));
   }
 
-  // Set the insertion point of ir_builder_ to the loop exit, so that
+  // Set the insertion point of b_ to the loop exit, so that
   // code emitted for later instructions will be correctly placed.
   if (exit_bb_ != nullptr) {
-    ir_builder_->SetInsertPoint(exit_bb_);
+    b_->SetInsertPoint(exit_bb_);
   }
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.h b/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.h
index 6be1c2fba2..c4f5c82086 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/loop_emitter.h
@@ -41,11 +41,11 @@ class LoopEmitter {
   using BodyEmitter = std::function<Status(const IrArray::Index& index)>;
 
   LoopEmitter(const BodyEmitter& body_emitter, const Shape& shape,
-              llvm::IRBuilder<>* ir_builder);
+              llvm::IRBuilder<>* b);
   // Constructs a LoopEmitter from an element generator that generates each
   // element of the given target array.
   LoopEmitter(const ElementGenerator& target_element_generator,
-              const IrArray& target_array, llvm::IRBuilder<>* ir_builder);
+              const IrArray& target_array, llvm::IRBuilder<>* b);
 
   // Constructs a LoopEmitter that emits one element into each of N separate
   // arrays on each iteration of the loop.
@@ -54,7 +54,7 @@ class LoopEmitter {
   // produce an LLVM struct with N elements.
   LoopEmitter(const ElementGenerator& target_element_generator,
               tensorflow::gtl::ArraySlice<IrArray> target_arrays,
-              llvm::IRBuilder<>* ir_builder);
+              llvm::IRBuilder<>* b);
 
   LoopEmitter(const LoopEmitter&) = delete;
   LoopEmitter& operator=(const LoopEmitter&) = delete;
@@ -65,8 +65,7 @@ class LoopEmitter {
   // specifies the element, will return multiple indices if the loop is
   // unrolled.
   std::vector<IrArray::Index> EmitIndexAndSetExitBasicBlock() {
-    return EmitIndexAndSetExitBasicBlock(/*loop_name=*/"",
-                                         ir_builder_->getInt64Ty());
+    return EmitIndexAndSetExitBasicBlock(/*loop_name=*/"", b_->getInt64Ty());
   }
 
   virtual std::vector<IrArray::Index> EmitIndexAndSetExitBasicBlock(
@@ -87,7 +86,7 @@ class LoopEmitter {
   // scalar, no loops are emitted and exit_bb_ is nullptr in that case.
   llvm::BasicBlock* exit_bb_;
 
-  llvm::IRBuilder<>* ir_builder_;
+  llvm::IRBuilder<>* b_;
 };
 
 }  // namespace llvm_ir
diff --git a/tensorflow/compiler/xla/service/llvm_ir/math_ops.cc b/tensorflow/compiler/xla/service/llvm_ir/math_ops.cc
new file mode 100644
index 0000000000..0e115cdabf
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/math_ops.cc
@@ -0,0 +1,59 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/llvm_ir/math_ops.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
+
+namespace xla {
+namespace llvm_ir {
+
+llvm::Value* EmitFastTanh(llvm::IRBuilder<>* b, llvm::Value* input) {
+  llvm::Type* type = input->getType();
+
+  // Clamp the input to [-9, 9].
+  llvm::Value* input_clamped = llvm_ir::EmitFloatMin(
+      llvm_ir::EmitFloatMax(input, llvm::ConstantFP::get(type, -9.0), b),
+      llvm::ConstantFP::get(type, 9.0), b);
+
+  static constexpr std::array<float, 7> numerator_coeffs{
+      -2.76076847742355e-16f, 2.00018790482477e-13f, -8.60467152213735e-11f,
+      5.12229709037114e-08f,  1.48572235717979e-05f, 6.37261928875436e-04f,
+      4.89352455891786e-03f};
+
+  static constexpr std::array<float, 4> denominator_coeffs{
+      1.19825839466702e-06f, 1.18534705686654e-04f, 2.26843463243900e-03f,
+      4.89352518554385e-03f};
+
+  llvm::Value* input_squared = b->CreateFMul(input_clamped, input_clamped);
+  llvm::Value* numerator = llvm::ConstantFP::get(type, numerator_coeffs[0]);
+  for (int i = 1; i < numerator_coeffs.size(); i++) {
+    numerator = b->CreateFAdd(b->CreateFMul(input_squared, numerator),
+                              llvm::ConstantFP::get(type, numerator_coeffs[i]));
+  }
+
+  numerator = b->CreateFMul(input_clamped, numerator);
+
+  llvm::Value* denominator = llvm::ConstantFP::get(type, denominator_coeffs[0]);
+  for (int i = 1; i < denominator_coeffs.size(); i++) {
+    denominator =
+        b->CreateFAdd(b->CreateFMul(input_squared, denominator),
+                      llvm::ConstantFP::get(type, denominator_coeffs[i]));
+  }
+
+  return b->CreateFDiv(numerator, denominator);
+}
+
+}  // namespace llvm_ir
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/math_ops.h b/tensorflow/compiler/xla/service/llvm_ir/math_ops.h
new file mode 100644
index 0000000000..6c8bc3a076
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/math_ops.h
@@ -0,0 +1,32 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_MATH_OPS_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_MATH_OPS_H_
+
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Value.h"
+
+namespace xla {
+namespace llvm_ir {
+
+// Emits an approximation of tanh. The implementation uses the same rational
+// interpolant as implemented in Eigen3.
+llvm::Value* EmitFastTanh(llvm::IRBuilder<>* b, llvm::Value* input);
+
+}  // namespace llvm_ir
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_MATH_OPS_H_
diff --git a/tensorflow/compiler/xla/service/llvm_ir/sort_util.cc b/tensorflow/compiler/xla/service/llvm_ir/sort_util.cc
new file mode 100644
index 0000000000..e546f5cc4a
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/sort_util.cc
@@ -0,0 +1,161 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/llvm_ir/sort_util.h"
+
+// IWYU pragma: no_include "llvm/IR/Intrinsics.gen.inc"
+#include "llvm/IR/BasicBlock.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Value.h"
+#include "tensorflow/compiler/xla/primitive_util.h"
+#include "tensorflow/compiler/xla/service/gpu/parallel_loop_emitter.h"
+#include "tensorflow/compiler/xla/service/gpu/partition_assignment.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/ir_array.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/llvm_loop.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/loop_emitter.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/optional.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+namespace llvm_ir {
+
+namespace {
+// Adds the inner comparison loop where we compare elements pointed to by
+// 'keys_index' and 'compare_keys_index'.
+void EmitCompareLoop(int64 dimension_to_sort, const IrArray::Index& keys_index,
+                     const IrArray::Index& compare_keys_index,
+                     const IrArray& keys_array,
+                     const tensorflow::gtl::optional<IrArray>& values_array,
+                     llvm::IRBuilder<>* b) {
+  // if (is_smaller_index &&
+  //     compare_keys[dimension_to_sort] < dimension_to_sort_bound)
+  llvm::Value* is_smaller_index = b->CreateICmpSLT(
+      keys_index[dimension_to_sort], compare_keys_index[dimension_to_sort]);
+  int64 dimension_to_sort_bound =
+      keys_array.GetShape().dimensions(dimension_to_sort);
+  auto if_data = EmitIfThenElse(
+      b->CreateAnd(is_smaller_index,
+                   b->CreateICmpSLT(compare_keys_index[dimension_to_sort],
+                                    keys_index.GetConstantWithIndexType(
+                                        dimension_to_sort_bound))),
+      "smaller_comparison_index", b, /*emit_else=*/false);
+  SetToFirstInsertPoint(if_data.true_block, b);
+  auto key1 = keys_array.EmitReadArrayElement(keys_index, b);
+  auto key2 = keys_array.EmitReadArrayElement(compare_keys_index, b);
+  auto key_type = keys_array.GetShape().element_type();
+  auto comparison =
+      primitive_util::IsFloatingPointType(key_type)
+          // TODO(b/26783907): Figure out how to handle NaNs.
+          ? b->CreateFCmp(llvm::FCmpInst::FCMP_ULT, key2, key1)
+          : b->CreateICmp(primitive_util::IsSignedIntegralType(key_type)
+                              ? llvm::ICmpInst::ICMP_SLT
+                              : llvm::ICmpInst::ICMP_ULT,
+                          key2, key1);
+  // If key2 < key1
+  auto if_smaller_data =
+      EmitIfThenElse(comparison, "is_smaller_than", b, /*emit_else=*/false);
+  SetToFirstInsertPoint(if_smaller_data.true_block, b);
+  // Swap key1 with key2.
+  keys_array.EmitWriteArrayElement(keys_index, key2, b);
+  keys_array.EmitWriteArrayElement(compare_keys_index, key1, b);
+  if (values_array.has_value()) {
+    // Also swap the values.
+    auto value1 = values_array.value().EmitReadArrayElement(keys_index, b);
+    auto value2 =
+        values_array.value().EmitReadArrayElement(compare_keys_index, b);
+    values_array.value().EmitWriteArrayElement(keys_index, value2, b);
+    values_array.value().EmitWriteArrayElement(compare_keys_index, value1, b);
+  }
+}
+}  // namespace
+
+Status EmitSortInPlace(int64 dimension_to_sort, const IrArray& keys_array,
+                       const tensorflow::gtl::optional<IrArray>& values_array,
+                       tensorflow::StringPiece name, llvm::Value* xor_mask,
+                       llvm::IRBuilder<>* b,
+                       const gpu::LaunchDimensions* launch_dimensions) {
+  const Shape& keys_shape = keys_array.GetShape();
+
+  // Create loop nests which loop through the operand dimensions. The sort
+  // dimension is handled in the innermost loop which performs the sorting.
+  ForLoopNest loop_nest(name, b);
+  IrArray::Index keys_index =
+      loop_nest.EmitOperandArrayLoopNest(keys_array, dimension_to_sort, "keys");
+  if (loop_nest.GetInnerLoopBodyBasicBlock() != nullptr) {
+    SetToFirstInsertPoint(loop_nest.GetInnerLoopBodyBasicBlock(), b);
+  }
+
+  // 'compare_keys_index' is the index of the element that 'keys_index' should
+  // be compared to.
+  IrArray::Index compare_keys_index(keys_index.GetType());
+  for (size_t dimension = 0; dimension < keys_index.size(); ++dimension) {
+    if (dimension != dimension_to_sort) {
+      compare_keys_index.push_back(keys_index[dimension]);
+    } else {
+      compare_keys_index.push_back(nullptr);
+    }
+  }
+
+  // Naive C++ code for the inner compare loop:
+  //
+  // for (int64 i = 0; i < dimension_to_sort_bound; ++i) {
+  //   int64 j = i ^ xor_mask;
+  //   if (i < j && j < dimension_to_sort_bound) {
+  //     int64 min_key = std::min(keys[i], keys[j]);
+  //     keys[j] = std::max(keys[i], keys[j]);
+  //     keys[i] = min_key;
+  //   }
+  // }
+  //
+  // This follows the algorithm described on Wikipedia:
+  // https://en.wikipedia.org/wiki/Bitonic_sorter
+
+  int64 dimension_to_sort_bound =
+      keys_array.GetShape().dimensions(dimension_to_sort);
+  Shape compare_shape = ShapeUtil::MakeShape(keys_shape.element_type(),
+                                             {dimension_to_sort_bound});
+  auto compare_loop_body_emitter =
+      [&](const IrArray::Index& compare_index) -> Status {
+    keys_index[dimension_to_sort] = compare_index[0];
+    compare_keys_index[dimension_to_sort] =
+        b->CreateXor(compare_index[0], xor_mask);
+    EmitCompareLoop(dimension_to_sort, keys_index, compare_keys_index,
+                    keys_array, values_array, b);
+    return Status::OK();
+  };
+  if (launch_dimensions != nullptr) {
+    TF_RETURN_IF_ERROR(gpu::ParallelLoopEmitter(compare_loop_body_emitter,
+                                                compare_shape,
+                                                *launch_dimensions, b)
+                           .EmitLoop(name));
+  } else {
+    TF_RETURN_IF_ERROR(LoopEmitter(compare_loop_body_emitter, compare_shape, b)
+                           .EmitLoop(name));
+  }
+
+  // Set the IR builder insert point to the exit basic block of the outer most
+  // loop. This ensures later instructions are inserted after this loop nest.
+  b->SetInsertPoint(loop_nest.GetOuterLoopExitBasicBlock());
+
+  return Status::OK();
+}
+
+}  // namespace llvm_ir
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/llvm_ir/sort_util.h b/tensorflow/compiler/xla/service/llvm_ir/sort_util.h
new file mode 100644
index 0000000000..8458744c6b
--- /dev/null
+++ b/tensorflow/compiler/xla/service/llvm_ir/sort_util.h
@@ -0,0 +1,41 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_SORT_UTIL_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_SORT_UTIL_H_
+
+#include "llvm/IR/Value.h"
+#include "tensorflow/compiler/xla/service/gpu/partition_assignment.h"
+#include "tensorflow/compiler/xla/service/llvm_ir/ir_array.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/optional.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+namespace llvm_ir {
+// Emits llvm IR to do pairwise comparisons/swaps in the 'dimension_to_sort'
+// dimension of 'keys_array'. All other dimensions are kept as-is. This
+// implements the inner loop of BitonicSort. If 'launch_dimensions' is nullptr,
+// the inner compare loop will not be parallelized.
+Status EmitSortInPlace(int64 dimension_to_sort, const IrArray& keys_array,
+                       const tensorflow::gtl::optional<IrArray>& values_array,
+                       tensorflow::StringPiece name, llvm::Value* xor_mask,
+                       llvm::IRBuilder<>* b,
+                       const gpu::LaunchDimensions* launch_dimensions);
+}  // namespace llvm_ir
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_LLVM_IR_SORT_UTIL_H_
diff --git a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc
index 5fc08aab91..11ed6ee59f 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.cc
@@ -31,12 +31,12 @@ namespace llvm_ir {
 
 void EmitTupleSelect(const IrArray& select, const IrArray& pred,
                      llvm::Value* on_true, llvm::Value* on_false,
-                     llvm::IRBuilder<>* ir_builder, llvm::Module* module) {
+                     llvm::IRBuilder<>* b, llvm::Module* module) {
   CHECK(ShapeUtil::IsScalar(pred.GetShape()));
 
   llvm::LoadInst* pred_value =
-      ir_builder->CreateLoad(pred.GetBasePointer(), "load_predicate_value");
-  llvm::Value* pred_cond = ir_builder->CreateICmpNE(
+      b->CreateLoad(pred.GetBasePointer(), "load_predicate_value");
+  llvm::Value* pred_cond = b->CreateICmpNE(
       pred_value,
       llvm::ConstantInt::get(PrimitiveTypeToIrType(PRED, module), 0),
       "boolean_predicate");
@@ -46,47 +46,42 @@ void EmitTupleSelect(const IrArray& select, const IrArray& pred,
   VLOG(2) << "  pred_cond: " << DumpToString(*pred_cond);
 
   for (int i = 0; i < ShapeUtil::TupleElementCount(select.GetShape()); ++i) {
-    llvm::Value* const element_index[] = {ir_builder->getInt64(0),
-                                          ir_builder->getInt64(i)};
+    llvm::Value* const element_index[] = {b->getInt64(0), b->getInt64(i)};
     llvm::Value* on_true_element_address =
-        ir_builder->CreateInBoundsGEP(on_true, element_index);
-    llvm::Value* on_true_element = ir_builder->CreateLoad(
+        b->CreateInBoundsGEP(on_true, element_index);
+    llvm::Value* on_true_element = b->CreateLoad(
         on_true_element_address, "on_true_element_" + llvm::Twine(i));
     llvm::Value* on_false_element_address =
-        ir_builder->CreateInBoundsGEP(on_false, element_index);
-    llvm::Value* on_false_element = ir_builder->CreateLoad(
+        b->CreateInBoundsGEP(on_false, element_index);
+    llvm::Value* on_false_element = b->CreateLoad(
         on_false_element_address, "on_false_element_" + llvm::Twine(i));
 
     llvm::Value* output_element_address =
-        ir_builder->CreateInBoundsGEP(select.GetBasePointer(), element_index);
-    ir_builder->CreateStore(
-        ir_builder->CreateSelect(pred_cond, on_true_element, on_false_element,
-                                 "select_output_element_" + llvm::Twine(i)),
-        output_element_address);
+        b->CreateInBoundsGEP(select.GetBasePointer(), element_index);
+    b->CreateStore(b->CreateSelect(pred_cond, on_true_element, on_false_element,
+                                   "select_output_element_" + llvm::Twine(i)),
+                   output_element_address);
   }
 }
 
 void EmitTuple(const IrArray& tuple,
                tensorflow::gtl::ArraySlice<llvm::Value*> operands,
-               llvm::IRBuilder<>* ir_builder, llvm::Module* module) {
+               llvm::IRBuilder<>* b, llvm::Module* module) {
   for (size_t i = 0; i < operands.size(); ++i) {
-    auto* store = ir_builder->CreateStore(
-        ir_builder->CreatePointerCast(operands[i],
-                                      PrimitiveTypeToIrType(TUPLE, module)),
-        ir_builder->CreateInBoundsGEP(
-            tuple.GetBasePointer(),
-            {ir_builder->getInt64(0), ir_builder->getInt64(i)}));
+    auto* store = b->CreateStore(
+        b->CreatePointerCast(operands[i], PrimitiveTypeToIrType(TUPLE, module)),
+        b->CreateInBoundsGEP(tuple.GetBasePointer(),
+                             {b->getInt64(0), b->getInt64(i)}));
     tuple.AnnotateLoadStoreInstructionWithMetadata(store);
   }
 }
 
 llvm::Value* EmitGetTupleElement(const Shape& target_shape, int64 index,
                                  int alignment, llvm::Value* operand,
-                                 llvm::IRBuilder<>* ir_builder,
-                                 llvm::Module* module) {
-  llvm::Value* element_ptr = ir_builder->CreateInBoundsGEP(
-      operand, {ir_builder->getInt64(0), ir_builder->getInt64(index)});
-  llvm::LoadInst* src_buffer = ir_builder->CreateLoad(element_ptr);
+                                 llvm::IRBuilder<>* b, llvm::Module* module) {
+  llvm::Value* element_ptr =
+      b->CreateInBoundsGEP(operand, {b->getInt64(0), b->getInt64(index)});
+  llvm::LoadInst* src_buffer = b->CreateLoad(element_ptr);
 
   // Mark the loaded pointer as dereferenceable if we know its shape.
   if (!ShapeUtil::IsOpaque(target_shape)) {
@@ -98,7 +93,7 @@ llvm::Value* EmitGetTupleElement(const Shape& target_shape, int64 index,
 
   llvm::Type* element_type = ShapeToIrType(target_shape, module);
   llvm::Value* ret_val =
-      ir_builder->CreateBitCast(src_buffer, element_type->getPointerTo());
+      b->CreateBitCast(src_buffer, element_type->getPointerTo());
   return ret_val;
 }
 
diff --git a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h
index 352d34ebf8..cf6bf5d0b1 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/tuple_ops.h
@@ -61,13 +61,13 @@ namespace llvm_ir {
 //   output[i] = pred ? tuple_on_true[i] : tuple_on_false[i]
 void EmitTupleSelect(const IrArray& select, const IrArray& pred,
                      llvm::Value* on_true, llvm::Value* on_false,
-                     llvm::IRBuilder<>* ir_builder, llvm::Module* module);
+                     llvm::IRBuilder<>* b, llvm::Module* module);
 
 // A tuple is an array of pointers, one for each operand. Each pointer points to
 // the output buffer of its corresponding operand.
 void EmitTuple(const IrArray& tuple,
                tensorflow::gtl::ArraySlice<llvm::Value*> operands,
-               llvm::IRBuilder<>* ir_builder, llvm::Module* module);
+               llvm::IRBuilder<>* b, llvm::Module* module);
 
 // A tuple is an array of pointers, one for each operand. Each pointer points to
 // the output buffer of its corresponding operand. A GetTupleElement instruction
@@ -75,8 +75,7 @@ void EmitTuple(const IrArray& tuple,
 // Returns an llvm value representing a pointer to the tuple element buffer.
 llvm::Value* EmitGetTupleElement(const Shape& target_shape, int64 index,
                                  int alignment, llvm::Value* operand,
-                                 llvm::IRBuilder<>* ir_builder,
-                                 llvm::Module* module);
+                                 llvm::IRBuilder<>* b, llvm::Module* module);
 }  // namespace llvm_ir
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/local_service.cc b/tensorflow/compiler/xla/service/local_service.cc
index 53efc30c36..5e02096ee5 100644
--- a/tensorflow/compiler/xla/service/local_service.cc
+++ b/tensorflow/compiler/xla/service/local_service.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/client/executable_build_options.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/backend.h"
diff --git a/tensorflow/compiler/xla/service/local_service.h b/tensorflow/compiler/xla/service/local_service.h
index 39d6734c3f..8f707ea904 100644
--- a/tensorflow/compiler/xla/service/local_service.h
+++ b/tensorflow/compiler/xla/service/local_service.h
@@ -19,7 +19,7 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/compiler/xla/client/executable_build_options.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/backend.h"
 #include "tensorflow/compiler/xla/service/compiler.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
diff --git a/tensorflow/compiler/xla/service/logical_buffer_analysis.cc b/tensorflow/compiler/xla/service/logical_buffer_analysis.cc
index f410921b4b..d631fb5ee4 100644
--- a/tensorflow/compiler/xla/service/logical_buffer_analysis.cc
+++ b/tensorflow/compiler/xla/service/logical_buffer_analysis.cc
@@ -131,18 +131,23 @@ Status LogicalBufferAnalysis::HandleDomain(HloInstruction*) {
   return Status::OK();
 }
 
-Status LogicalBufferAnalysis::HandleRecvDone(HloInstruction*) {
-  // RecvDone doesn't create a new buffer but rather aliases its input (Recv)
-  // tuple element at {0} to its output.
+Status LogicalBufferAnalysis::HandleRecvDone(HloInstruction* recv_done) {
+  // RecvDone produces a two-element tuple containing the data value (which
+  // aliases part of its operand) and a token. Only the tuple index table and
+  // the token are defined by the RecvDone.
+  NewLogicalBuffer(recv_done, /*index=*/{});
+  NewLogicalBuffer(recv_done, /*index=*/{1});
   return Status::OK();
 }
 
 Status LogicalBufferAnalysis::HandleSend(HloInstruction* send) {
-  // Send creates new buffers for the top-level tuple and the context (tuple
-  // element at {1}). Tuple element at {0} is an alias of the Send operand, so
-  // we don't need to create a new Logical Buffer for that.
+  // Send creates new buffers for the top-level tuple, the context (tuple
+  // element at {1}), and the token (tuple element at {2}). Tuple element at {0}
+  // is an alias of the Send operand, so we don't need to create a new Logical
+  // Buffer for that.
   NewLogicalBuffer(send, /*index=*/{});
   NewLogicalBuffer(send, /*index=*/{1});
+  NewLogicalBuffer(send, /*index=*/{2});
   return Status::OK();
 }
 
@@ -152,10 +157,10 @@ Status LogicalBufferAnalysis::HandleTuple(HloInstruction* tuple) {
   return Status::OK();
 }
 
-Status LogicalBufferAnalysis::HandleSelect(HloInstruction* select) {
+Status LogicalBufferAnalysis::HandleTupleSelect(HloInstruction* tuple_select) {
   // Select allocates a new buffer and then shallow copies the on_true or
   // on_false buffer into this new buffer.
-  NewLogicalBuffer(select, /*index=*/{});
+  NewLogicalBuffer(tuple_select, /*index=*/{});
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/xla/service/logical_buffer_analysis.h b/tensorflow/compiler/xla/service/logical_buffer_analysis.h
index b5ef396787..81f524d84a 100644
--- a/tensorflow/compiler/xla/service/logical_buffer_analysis.h
+++ b/tensorflow/compiler/xla/service/logical_buffer_analysis.h
@@ -63,7 +63,7 @@ class LogicalBufferAnalysis : public DfsHloVisitorWithDefault {
   Status HandleCopy(HloInstruction* copy) override;
   Status HandleRecvDone(HloInstruction* recv_done) override;
   Status HandleSend(HloInstruction* send) override;
-  Status HandleSelect(HloInstruction* select) override;
+  Status HandleTupleSelect(HloInstruction* tuple_select) override;
 
   // A map from the buffer ID to the logical buffer
   std::vector<std::unique_ptr<LogicalBuffer>> logical_buffers_;
diff --git a/tensorflow/compiler/xla/service/pattern_matcher.h b/tensorflow/compiler/xla/service/pattern_matcher.h
index 2515222cf2..ac6ea4c72f 100644
--- a/tensorflow/compiler/xla/service/pattern_matcher.h
+++ b/tensorflow/compiler/xla/service/pattern_matcher.h
@@ -86,8 +86,8 @@ namespace xla {
 // are provided below.
 //
 // Example nullary instruction:
-//   Recv()                            == Op().WithOpcode(HloOpcode::kRecv)
-//   Recv(&a)                          == Op(&a).WithOpcode(HloOpcode::kRecv)
+//   Param()                        == Op().WithOpcode(HloOpcode::kParam)
+//   Param(&a)                      == Op(&a).WithOpcode(HloOpcode::kParam)
 //
 // Example unary instruction:
 //   Abs()                             == Op().WithOpcode(HloOpcode::kAbs)
@@ -726,6 +726,32 @@ class HloInstructionPatternFusionKindImpl {
   ::xla::HloInstruction::FusionKind kind_;
 };
 
+// An HloInstructionPattern implementation that matches only if the instruction
+// is a kGetTupleElement with a particular tuple index.
+template <typename Previous>
+class HloInstructionPatternTupleIndexImpl {
+ public:
+  explicit constexpr HloInstructionPatternTupleIndexImpl(
+      const Previous& previous, int64 tuple_index)
+      : previous_(previous), tuple_index_(tuple_index) {}
+
+  bool Match(const ::xla::HloInstruction* inst) const {
+    return previous_.Match(inst) &&
+           inst->opcode() == HloOpcode::kGetTupleElement &&
+           inst->tuple_index() == tuple_index_;
+  }
+
+  bool Match(::xla::HloInstruction* inst) const {
+    return previous_.Match(inst) &&
+           inst->opcode() == HloOpcode::kGetTupleElement &&
+           inst->tuple_index() == tuple_index_;
+  }
+
+ private:
+  Previous previous_;
+  int64 tuple_index_;
+};
+
 // A pattern that matches HloInstructions.
 template <typename HloInstructionType, typename Impl>
 class HloInstructionPattern {
@@ -841,6 +867,17 @@ class HloInstructionPattern {
         HloInstructionPatternFusionKindImpl<Impl>(impl_, kind), matched_inst_);
   }
 
+  // Modifies the pattern to match only if the instruction is a
+  // get-tuple-element with the given tuple index.
+  constexpr HloInstructionPattern<HloInstructionType,
+                                  HloInstructionPatternTupleIndexImpl<Impl>>
+  WithTupleIndex(int64 tuple_index) const {
+    return HloInstructionPattern<HloInstructionType,
+                                 HloInstructionPatternTupleIndexImpl<Impl>>(
+        HloInstructionPatternTupleIndexImpl<Impl>(impl_, tuple_index),
+        matched_inst_);
+  }
+
  private:
   Impl impl_;
   HloInstructionType** matched_inst_;
@@ -880,9 +917,7 @@ Op(::xla::HloInstruction** matched_inst) {
     return Op(matched_inst).WithOpcode(HloOpcode::k##NAME);           \
   }
 XLA_NULLOP_PATTERN(Constant)
-XLA_NULLOP_PATTERN(Infeed)
 XLA_NULLOP_PATTERN(Parameter)
-XLA_NULLOP_PATTERN(Recv)
 #undef XLA_NULLOP_PATTERN
 
 // Helpers for unary instructions.
@@ -919,18 +954,21 @@ XLA_UNOP_PATTERN(Cos)
 XLA_UNOP_PATTERN(Exp)
 XLA_UNOP_PATTERN(Fft)
 XLA_UNOP_PATTERN(Floor)
+XLA_UNOP_PATTERN(GetTupleElement)
 XLA_UNOP_PATTERN(Imag)
+XLA_UNOP_PATTERN(Infeed)
 XLA_UNOP_PATTERN(IsFinite)
 XLA_UNOP_PATTERN(Log)
 XLA_UNOP_PATTERN(Not)
 XLA_UNOP_PATTERN(Negate)
-XLA_UNOP_PATTERN(Outfeed)
 XLA_UNOP_PATTERN(Real)
+XLA_UNOP_PATTERN(Recv)
+XLA_UNOP_PATTERN(RecvDone)
 XLA_UNOP_PATTERN(Reduce)
 XLA_UNOP_PATTERN(ReducePrecision)
 XLA_UNOP_PATTERN(Reshape)
 XLA_UNOP_PATTERN(Reverse)
-XLA_UNOP_PATTERN(Send)
+XLA_UNOP_PATTERN(SendDone)
 XLA_UNOP_PATTERN(Sign)
 XLA_UNOP_PATTERN(Sin)
 XLA_UNOP_PATTERN(Sort)
@@ -981,8 +1019,10 @@ XLA_BINOP_PATTERN(Maximum)
 XLA_BINOP_PATTERN(Minimum)
 XLA_BINOP_PATTERN(Multiply)
 XLA_BINOP_PATTERN(Ne)
+XLA_BINOP_PATTERN(Outfeed)
 XLA_BINOP_PATTERN(Power)
 XLA_BINOP_PATTERN(Remainder)
+XLA_BINOP_PATTERN(Send)
 XLA_BINOP_PATTERN(Subtract)
 XLA_BINOP_PATTERN(And)
 XLA_BINOP_PATTERN(Or)
@@ -1040,6 +1080,32 @@ inline auto NonConstant(HloInstructionType** matched_inst)
   return Op(matched_inst).IsNonConstant();
 }
 
+// Add overloads for GetTupleElement which take a int64 specifying which tuple
+// element is selected.
+template <typename Arg>
+inline auto GetTupleElement(Arg&& arg, int64 tuple_index)
+    -> decltype(Op().WithOpcode(HloOpcode::kGetTupleElement)
+                    .WithOperand(0, std::forward<Arg>(arg))
+                    .WithTupleIndex(tuple_index)) {
+  return Op()
+      .WithOpcode(HloOpcode::kGetTupleElement)
+      .WithOperand(0, std::forward<Arg>(arg))
+      .WithTupleIndex(tuple_index);
+}
+
+template <typename HloInstructionType, typename Arg>
+inline auto GetTupleElement(HloInstructionType** matched_inst, Arg&& arg,
+                            int64 tuple_index)
+    -> decltype(Op(matched_inst)
+                    .WithOpcode(HloOpcode::kGetTupleElement)
+                    .WithOperand(0, std::forward<Arg>(arg))
+                    .WithTupleIndex(tuple_index)) {
+  return Op(matched_inst)
+      .WithOpcode(HloOpcode::kGetTupleElement)
+      .WithOperand(0, std::forward<Arg>(arg))
+      .WithTupleIndex(tuple_index);
+}
+
 }  // namespace match
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/pattern_matcher_test.cc b/tensorflow/compiler/xla/service/pattern_matcher_test.cc
index fef3c132b0..a530581c34 100644
--- a/tensorflow/compiler/xla/service/pattern_matcher_test.cc
+++ b/tensorflow/compiler/xla/service/pattern_matcher_test.cc
@@ -193,5 +193,23 @@ TEST(PatternMatcherTest, FusionKind) {
                                            HloInstruction::FusionKind::kLoop)));
 }
 
+TEST(PatternMatcherTest, GetTupleElement) {
+  constexpr char kModuleStr[] = R"(
+    HloModule test_module
+
+    ENTRY while.v11 {
+      p0 = (f32[], f32[], f32[]) parameter(0)
+      ROOT gte = f32[] get-tuple-element(p0), index=1
+    })";
+  TF_ASSERT_OK_AND_ASSIGN(auto hlo_module, ParseHloString(kModuleStr));
+
+  auto* root = hlo_module->entry_computation()->root_instruction();
+  EXPECT_FALSE(Match(root, match::Op().WithTupleIndex(0)));
+  EXPECT_TRUE(Match(root, match::Op().WithTupleIndex(1)));
+  EXPECT_FALSE(Match(root, match::Op().WithTupleIndex(2)));
+  EXPECT_FALSE(Match(root, match::GetTupleElement(match::Op(), 0)));
+  EXPECT_TRUE(Match(root, match::GetTupleElement(match::Op(), 1)));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/platform_util.cc b/tensorflow/compiler/xla/service/platform_util.cc
index 7c63c0acc7..39fe3c7835 100644
--- a/tensorflow/compiler/xla/service/platform_util.cc
+++ b/tensorflow/compiler/xla/service/platform_util.cc
@@ -75,19 +75,6 @@ PlatformUtil::GetSupportedPlatforms() {
     auto* platform = platform_pair.second;
     auto compiler_status = Compiler::GetForPlatform(platform);
     if (compiler_status.ok()) {
-      if (platform->VisibleDeviceCount() > 0) {
-        LOG(INFO) << "platform " << platform->Name() << " present with "
-                  << platform->VisibleDeviceCount() << " visible devices";
-      } else {
-        LOG(WARNING) << "platform " << platform->Name() << " present but no "
-                     << "visible devices found";
-      }
-      // Note: currently we call zero device platforms "supported" on the basis
-      // that, if the platform support was linked in, it was probably intended
-      // to be used for execution, and this way we can flag an error.
-      //
-      // TODO(b/33730287) If we want an alternative version of this behavior we
-      // could add an --xla_fallback_to_host flag.
       platforms.push_back(platform);
     } else {
       LOG(INFO) << "platform " << platform->Name() << " present but no "
diff --git a/tensorflow/compiler/xla/service/pool.h b/tensorflow/compiler/xla/service/pool.h
deleted file mode 100644
index 8e710ebb6d..0000000000
--- a/tensorflow/compiler/xla/service/pool.h
+++ /dev/null
@@ -1,84 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_COMPILER_XLA_POOL_H_
-#define TENSORFLOW_COMPILER_XLA_POOL_H_
-
-#include <functional>
-#include <vector>
-
-#include "tensorflow/compiler/xla/ptr_util.h"
-#include "tensorflow/core/platform/mutex.h"
-
-namespace xla {
-
-// Pool of values, which are created as needed and destroyed when the `Pool` is
-// destroyed
-template <typename T>
-class Pool {
- public:
-  struct Deleter {
-    void operator()(T* ptr) { pool->Deallocate(ptr); }
-
-    Pool<T>* pool;
-  };
-
-  // A pointer to a taken element of a `Pool` which returns it to the pool on
-  // destruction
-  using SmartPtr = std::unique_ptr<T, Deleter>;
-
-  // Constructs a `Pool` with given factory function, which need not be
-  // thread-safe.
-  explicit Pool(std::function<std::unique_ptr<T>()> factory)
-      : factory_(factory) {}
-
-  explicit Pool() : Pool([]() { return MakeUnique<T>(); }) {}
-
-  // Returns a pointer to a value in the pool, creating a new value if none is
-  // free. The returned smart pointer returns the element to the pool on
-  // destruction.
-  //
-  // This method is thread-safe.
-  SmartPtr Allocate() {
-    tensorflow::mutex_lock lock(mu_);
-    T* ptr;
-    if (!xs_.empty()) {
-      ptr = std::move(xs_.back()).release();
-      xs_.pop_back();
-    } else {
-      ptr = factory_().release();
-    }
-    Deleter del = {this};
-    return std::unique_ptr<T, Deleter>(ptr, del);
-  }
-
- private:
-  // Puts a pointer to a value back into the pool, leaving it free for future
-  // use.
-  //
-  // This method is thread-safe.
-  void Deallocate(T* ptr) {
-    tensorflow::mutex_lock lock(mu_);
-    xs_.push_back(std::unique_ptr<T>(ptr));
-  }
-
-  const std::function<std::unique_ptr<T>()> factory_ GUARDED_BY(mu_);
-  std::vector<std::unique_ptr<T>> xs_ GUARDED_BY(mu_);
-  tensorflow::mutex mu_;
-};
-
-}  // namespace xla
-
-#endif  // TENSORFLOW_COMPILER_XLA_POOL_H_
diff --git a/tensorflow/compiler/xla/service/reshape_mover.cc b/tensorflow/compiler/xla/service/reshape_mover.cc
index 49ec38eb62..ca86c5d13e 100644
--- a/tensorflow/compiler/xla/service/reshape_mover.cc
+++ b/tensorflow/compiler/xla/service/reshape_mover.cc
@@ -38,7 +38,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/reshape_mover.h"
 
 #include <algorithm>
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/util.h"
diff --git a/tensorflow/compiler/xla/service/reshape_mover_test.cc b/tensorflow/compiler/xla/service/reshape_mover_test.cc
index 13e2d3258e..ad3b662c20 100644
--- a/tensorflow/compiler/xla/service/reshape_mover_test.cc
+++ b/tensorflow/compiler/xla/service/reshape_mover_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/reshape_mover.h"
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -175,8 +175,9 @@ TEST_F(ReshapeMoverTest, EquivalentReshapesMoved) {
 TEST_F(ReshapeMoverTest, 1ConstantAnd2ReshapesMoved) {
   HloComputation::Builder builder(TestName());
   auto root_shape = ShapeUtil::MakeShape(F32, {2, 3});
-  auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<bool>({{true, true, false}, {false, false, true}})));
+  auto const0 = builder.AddInstruction(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2<bool>(
+          {{true, true, false}, {false, false, true}})));
 
   auto param1 = builder.AddInstruction(HloInstruction::CreateParameter(
       0, ShapeUtil::MakeShape(F32, {1, 3, 1, 2}), "param1"));
@@ -255,12 +256,12 @@ TEST_F(ReshapeMoverTest, 2TrivialConstantReshapeNotMoved) {
   HloComputation::Builder builder(TestName());
   auto root_shape = ShapeUtil::MakeShape(F32, {3, 2});
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1, 2, 3}, {4, 5, 6}})));
+      LiteralUtil::CreateR2<float>({{1, 2, 3}, {4, 5, 6}})));
   auto reshape0 =
       builder.AddInstruction(HloInstruction::CreateReshape(root_shape, const0));
 
   auto const1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1, 2, 3}, {4, 5, 6}})));
+      LiteralUtil::CreateR2<float>({{1, 2, 3}, {4, 5, 6}})));
   auto reshape1 =
       builder.AddInstruction(HloInstruction::CreateReshape(root_shape, const1));
 
@@ -309,7 +310,7 @@ TEST_F(ReshapeMoverTest, 1NonTrivialReshapeMoved) {
   auto param0 = builder.AddInstruction(HloInstruction::CreateParameter(
       0, ShapeUtil::MakeShape(F32, {1, 3, 1, 2}), "param0"));
   auto const1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1, 2, 3}, {4, 5, 6}})));
+      LiteralUtil::CreateR2<float>({{1, 2, 3}, {4, 5, 6}})));
   auto reshape0 =
       builder.AddInstruction(HloInstruction::CreateReshape(root_shape, param0));
   builder.AddInstruction(HloInstruction::CreateBinary(
@@ -348,7 +349,7 @@ TEST_F(ReshapeMoverTest, 1NonTrivialReshapeWith1ReshapedConstNotMoved) {
   auto param0 = builder.AddInstruction(HloInstruction::CreateParameter(
       0, ShapeUtil::MakeShape(F32, {1, 3}), "param0"));
   auto const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({9, 8, 7})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({9, 8, 7})));
   auto reshape0 =
       builder.AddInstruction(HloInstruction::CreateReshape(root_shape, param0));
   auto reshape1 =
diff --git a/tensorflow/compiler/xla/service/service.cc b/tensorflow/compiler/xla/service/service.cc
index da3b622bfa..ce070bc5b6 100644
--- a/tensorflow/compiler/xla/service/service.cc
+++ b/tensorflow/compiler/xla/service/service.cc
@@ -37,6 +37,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_proto_util.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
 #include "tensorflow/compiler/xla/service/source_map_util.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/compiler/xla/service/transfer_manager.h"
 #include "tensorflow/compiler/xla/shape_layout.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -169,7 +170,8 @@ Service::Service(const ServiceOptions& options,
 
 Status Service::CreateChannelHandle(const CreateChannelHandleRequest* arg,
                                     CreateChannelHandleResponse* result) {
-  *result->mutable_channel() = channel_tracker_.NewChannel();
+  TF_ASSIGN_OR_RETURN(*result->mutable_channel(),
+                      channel_tracker_.NewChannel(arg->channel_type()));
   return Status::OK();
 }
 
@@ -375,7 +377,7 @@ Service::ExecuteParallelAndRegisterResult(
     ExecutionProfile* profile) {
   // Streams where the computation are launched, so we can wait on the streams
   // to complete.
-  std::vector<Pool<se::Stream>::SmartPtr> streams;
+  std::vector<StreamPool::Ptr> streams;
   std::vector<std::unique_ptr<se::Timer>> timers;
 
   // Global data handles for the computation results, one for each computation.
@@ -402,7 +404,7 @@ Service::ExecuteParallelAndRegisterResult(
     CHECK_EQ(replicas.size(), arguments[i].size());
     std::vector<ScopedShapedBuffer> result_buffers;
     for (int64 replica = 0; replica < replicas.size(); ++replica) {
-      TF_ASSIGN_OR_RETURN(Pool<se::Stream>::SmartPtr stream,
+      TF_ASSIGN_OR_RETURN(StreamPool::Ptr stream,
                           backend->BorrowStream(replicas[replica]));
       streams.push_back(std::move(stream));
 
@@ -514,13 +516,13 @@ StatusOr<GlobalDataHandle> Service::ExecuteAndRegisterResult(
         arguments,
     Backend* backend, const string& result_tag, ExecutionProfile* profile) {
   // Set up streams.
-  std::vector<Pool<se::Stream>::SmartPtr> streams;
+  std::vector<StreamPool::Ptr> streams;
 
   TF_ASSIGN_OR_RETURN(auto replicas,
                       Replicas(*backend, SingleComputationDeviceHandle()));
   TF_RET_CHECK(!replicas.empty());
   for (se::StreamExecutor* executor : replicas) {
-    TF_ASSIGN_OR_RETURN(Pool<se::Stream>::SmartPtr stream,
+    TF_ASSIGN_OR_RETURN(StreamPool::Ptr stream,
                         backend->BorrowStream(executor));
     streams.push_back(std::move(stream));
   }
@@ -532,7 +534,7 @@ StatusOr<GlobalDataHandle> Service::ExecuteAndRegisterResult(
 
   // Set up run options.
   std::vector<ServiceExecutableRunOptions> run_options;
-  for (const Pool<se::Stream>::SmartPtr& stream : streams) {
+  for (const StreamPool::Ptr& stream : streams) {
     ExecutableRunOptions options;
     options.set_stream(stream.get());
     options.set_device_ordinal(stream->parent()->device_ordinal());
diff --git a/tensorflow/compiler/xla/service/service_executable_run_options.h b/tensorflow/compiler/xla/service/service_executable_run_options.h
index 7f3910cdb0..dbfed628bf 100644
--- a/tensorflow/compiler/xla/service/service_executable_run_options.h
+++ b/tensorflow/compiler/xla/service/service_executable_run_options.h
@@ -17,7 +17,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_SERVICE_SERVICE_EXECUTABLE_RUN_OPTIONS_H_
 
 #include "tensorflow/compiler/xla/executable_run_options.h"
-#include "tensorflow/compiler/xla/service/pool.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/stream_executor/stream_executor.h"
 
@@ -27,8 +27,7 @@ namespace xla {
 // data, now only a stream cache for GPU backend.
 class ServiceExecutableRunOptions {
  public:
-  using StreamBorrower =
-      std::function<StatusOr<Pool<se::Stream>::SmartPtr>(int)>;
+  using StreamBorrower = std::function<StatusOr<StreamPool::Ptr>(int)>;
 
   ServiceExecutableRunOptions()
       : ServiceExecutableRunOptions(ExecutableRunOptions()) {}
@@ -51,7 +50,7 @@ class ServiceExecutableRunOptions {
 
   // Borrows a stream and returns a smart pointer which returns the stream on
   // destruction.
-  StatusOr<Pool<se::Stream>::SmartPtr> BorrowStream(int device_ordinal) const {
+  StatusOr<StreamPool::Ptr> BorrowStream(int device_ordinal) const {
     return borrow_stream_
                ? borrow_stream_(device_ordinal)
                : Status(tensorflow::error::UNIMPLEMENTED, "No stream cache");
diff --git a/tensorflow/compiler/xla/service/shape_inference.cc b/tensorflow/compiler/xla/service/shape_inference.cc
index 81f071ecc5..c888bbf144 100644
--- a/tensorflow/compiler/xla/service/shape_inference.cc
+++ b/tensorflow/compiler/xla/service/shape_inference.cc
@@ -58,66 +58,101 @@ Status ExpectArray(const Shape& shape, tensorflow::StringPiece op_type) {
   return Status::OK();
 }
 
-Status VerifyReducerShape(const ProgramShape& reducer_shape,
-                          const Shape& init_value_shape,
-                          const PrimitiveType& input_element_type) {
-  if (reducer_shape.parameters_size() != 2) {
-    return InvalidArgument(
-        "Reduction function must take 2 parameters, but "
+Status VerifyReducerShape(
+    const ProgramShape& reducer_shape,
+    tensorflow::gtl::ArraySlice<const Shape*> init_value_shapes,
+    tensorflow::gtl::ArraySlice<PrimitiveType> input_element_types,
+    int64 inputs) {
+  if (reducer_shape.parameters_size() != inputs * 2) {
+    return InvalidArgument(
+        "Reduction function must take %lld parameters, but "
         "takes %d parameter(s).",
-        reducer_shape.parameters_size());
+        inputs * 2, reducer_shape.parameters_size());
   }
 
   const Shape& accumulator_shape = reducer_shape.result();
-  if (!ShapeUtil::IsArray(accumulator_shape) ||
-      ShapeUtil::Rank(accumulator_shape) != 0) {
-    return InvalidArgument(
-        "Reduction function must produce a scalar but has shape: %s",
-        ShapeUtil::HumanString(accumulator_shape).c_str());
-  }
-
-  // Check that the accumulator can be passed in as the first argument.
-  // Note: comparing here and below with Compatible since we don't care about
-  // layout in scalars - see b/26668201 for a longer-term vision.
-  if (!ShapeUtil::Compatible(accumulator_shape, reducer_shape.parameters(0))) {
+  std::vector<const Shape*> accumulator_subshapes;
+  if (ShapeUtil::IsArray(accumulator_shape)) {
+    if (inputs != 1) {
+      return InvalidArgument(
+          "Reduction function must produce a tuple with %lld elements, but "
+          "produces a scalar",
+          inputs);
+    }
+    accumulator_subshapes.push_back(&accumulator_shape);
+  } else if (ShapeUtil::IsTuple(accumulator_shape)) {
+    if (ShapeUtil::TupleElementCount(accumulator_shape) != inputs) {
+      return InvalidArgument(
+          "Reduction function must produce a tuple with %lld elements, but has "
+          "%lld elements",
+          inputs, ShapeUtil::TupleElementCount(accumulator_shape));
+    }
+    for (const Shape& element_shape : accumulator_shape.tuple_shapes()) {
+      accumulator_subshapes.push_back(&element_shape);
+    }
+  } else {
     return InvalidArgument(
-        "Reduction function's first parameter shape differs from the "
-        "result shape: %s vs %s",
-        ShapeUtil::HumanString(reducer_shape.parameters(0)).c_str(),
+        "Reduction function must produce a scalar or tuple of scalars, but has "
+        "shape: %s",
         ShapeUtil::HumanString(accumulator_shape).c_str());
   }
 
-  // Check that init_value's shape is suitable for reducer_shape.
-  if (!ShapeUtil::CompatibleIgnoringFpPrecision(accumulator_shape,
-                                                init_value_shape)) {
-    return InvalidArgument(
-        "Reduction function's accumulator shape differs from the "
-        "init_value shape: %s vs %s",
-        ShapeUtil::HumanString(accumulator_shape).c_str(),
-        ShapeUtil::HumanString(init_value_shape).c_str());
-  }
-
-  // Check that the inputs can be passed in as the second argument.
-  const Shape& input_element_shape =
-      ShapeUtil::MakeShape(input_element_type, {});
-  if (!ShapeUtil::CompatibleIgnoringFpPrecision(input_element_shape,
-                                                reducer_shape.parameters(1))) {
-    return InvalidArgument(
-        "Reduction function's second parameter shape differs from the "
-        "input type element type: %s vs %s",
-        ShapeUtil::HumanString(reducer_shape.parameters(1)).c_str(),
-        ShapeUtil::HumanString(input_element_shape).c_str());
+  for (const Shape* element_shape : accumulator_subshapes) {
+    if (ShapeUtil::Rank(*element_shape) != 0) {
+      return InvalidArgument(
+          "Reduction function must return a scalar or tuple of scalars but "
+          "returns shape: %s",
+          ShapeUtil::HumanString(accumulator_shape).c_str());
+    }
   }
 
-  // Currently the accumulator and inputs must be the same type,
-  // though that restriction could be relaxed.
-  if (!ShapeUtil::CompatibleIgnoringFpPrecision(accumulator_shape,
-                                                reducer_shape.parameters(1))) {
-    return InvalidArgument(
-        "Reduction function's second parameter shape must "
-        "match the result shape, but got %s vs %s.",
-        ShapeUtil::HumanString(reducer_shape.parameters(1)).c_str(),
-        ShapeUtil::HumanString(accumulator_shape).c_str());
+  for (int64 i = 0; i < inputs; ++i) {
+    // Check that the accumulator can be passed in as the first argument.
+    // Note: comparing here and below with Compatible since we don't care about
+    // layout in scalars - see b/26668201 for a longer-term vision.
+    if (!ShapeUtil::Compatible(*accumulator_subshapes[i],
+                               reducer_shape.parameters(i))) {
+      return InvalidArgument(
+          "Reduction function's %lld-th parameter shape differs from the "
+          "result shape: %s vs %s",
+          i, ShapeUtil::HumanString(reducer_shape.parameters(i)).c_str(),
+          ShapeUtil::HumanString(*accumulator_subshapes[i]).c_str());
+    }
+    // Check that init_value's shapes are suitable for reducer_shape.
+    if (!ShapeUtil::CompatibleIgnoringFpPrecision(*accumulator_subshapes[i],
+                                                  *init_value_shapes[i])) {
+      return InvalidArgument(
+          "Reduction function's accumulator shape at index %lld differs from "
+          "the init_value shape: %s vs %s",
+          i, ShapeUtil::HumanString(*accumulator_subshapes[i]).c_str(),
+          ShapeUtil::HumanString(*init_value_shapes[i]).c_str());
+    }
+    // Check that the inputs can be passed in as the non-accumulator arguments.
+    const Shape input_element_shape =
+        ShapeUtil::MakeShape(input_element_types[i], {});
+    if (!ShapeUtil::CompatibleIgnoringFpPrecision(
+            input_element_shape, reducer_shape.parameters(inputs + i))) {
+      return InvalidArgument(
+          "Reduction function's %lld-th parameter shape differs from the "
+          "input type element type: %s vs %s",
+          inputs + i,
+          ShapeUtil::HumanString(reducer_shape.parameters(inputs + i)).c_str(),
+          ShapeUtil::HumanString(input_element_shape).c_str());
+    }
+    // Check that the accumulator and inputs to the reducer function match.
+    // If the accumulator is scalar, it must have the same type as the inputs
+    // (up to fp precision). If it is a tuple, then the k-th element of the
+    // tuple must have the same type as the K-th input (again, up to fp
+    // precision.)
+    if (!ShapeUtil::CompatibleIgnoringFpPrecision(
+            *accumulator_subshapes[i], reducer_shape.parameters(inputs + i))) {
+      return InvalidArgument(
+          "Reduction function's %lld-th parameter shape must "
+          "match the result shape, but got %s vs %s.",
+          inputs + i,
+          ShapeUtil::HumanString(reducer_shape.parameters(inputs + i)).c_str(),
+          ShapeUtil::HumanString(*accumulator_subshapes[i]).c_str());
+    }
   }
 
   return Status::OK();
@@ -222,13 +257,16 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
       return shape;
     case HloOpcode::kReal:
     case HloOpcode::kImag:
-      if (!ShapeUtil::ElementIsComplex(shape)) {
+      if (ShapeUtil::ElementIsComplex(shape)) {
+        return ShapeUtil::ComplexComponentShape(shape);
+      } else if (ShapeUtil::ElementIsFloating(shape)) {
+        return shape;
+      } else {
         return InvalidArgument(
-            "Expected element type in shape to be complex for real/imag "
-            "operation; got %s.",
+            "Expected element type in shape to be floating or complex for "
+            "real/imag operation; got %s.",
             PrimitiveType_Name(shape.element_type()).c_str());
       }
-      return ShapeUtil::ChangeElementType(shape, F32);
     case HloOpcode::kAbs:
       if (ShapeUtil::ElementIsComplex(shape)) {
         return ShapeUtil::ChangeElementType(
@@ -929,6 +967,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
       return InferClampShape(lhs, rhs, ehs);
     case HloOpcode::kSelect:
       return InferSelectShape(lhs, rhs, ehs);
+    case HloOpcode::kTupleSelect:
+      return InferTupleSelectShape(lhs, rhs, ehs);
     default:
       return InvalidArgument("Unknown operation %s.",
                              HloOpcodeString(opcode).c_str());
@@ -965,6 +1005,14 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
       if (operand_shapes.size() == 1) {
         return *operand_shapes[0];
       } else if (operand_shapes.size() == 2) {
+        if (!ShapeUtil::SameDimensions(*operand_shapes[0],
+                                       *operand_shapes[1])) {
+          return InvalidArgument(
+              "Sort keys and values dimensions must match. "
+              "Keys shape is: %s\n, Values shape is: %s",
+              ShapeUtil::HumanString(*operand_shapes[0]).c_str(),
+              ShapeUtil::HumanString(*operand_shapes[1]).c_str());
+        }
         return ShapeUtil::MakeTupleShape(
             {*operand_shapes[0], *operand_shapes[1]});
       }
@@ -1732,10 +1780,37 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
 }
 
 /* static */ StatusOr<Shape> ShapeInference::InferReduceShape(
-    const Shape& arg, const Shape& init_value,
+    tensorflow::gtl::ArraySlice<const Shape*> arg_shapes,
     tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
     const ProgramShape& to_apply) {
-  // Check that the dimension to reduce are in-bounds for the given shape.
+  if (arg_shapes.empty()) {
+    return InvalidArgument("Reduce must have at least 2 arguments, has 0");
+  }
+  if (arg_shapes.size() % 2) {
+    return InvalidArgument(
+        "Reduce must have an even number of arguments, has %lu",
+        arg_shapes.size());
+  }
+  int64 num_reduced_args = arg_shapes.size() / 2;
+
+  tensorflow::gtl::ArraySlice<const Shape*> reduced_args(arg_shapes, 0,
+                                                         num_reduced_args);
+  // Check that all of the reduced tensors have the same dimensions. The element
+  // types may be different.
+  for (int64 i = 1; i < num_reduced_args; ++i) {
+    if (!ShapeUtil::SameDimensions(*reduced_args[0], *reduced_args[i])) {
+      return InvalidArgument(
+          "All reduced tensors must have the sime dimension. Tensor 0 has "
+          "shape %s, Tensor %lld has shape %s",
+          ShapeUtil::HumanString(*reduced_args[0]).c_str(), i,
+          ShapeUtil::HumanString(*reduced_args[i]).c_str());
+    }
+  }
+
+  // Check that the dimensions to reduce are in-bounds for the given shape.
+  // We've already verified all reduced tensors have the same dimensions, so it
+  // doesn't matter which one we choose.
+  const Shape& arg = *reduced_args[0];
   for (int64 dimension : dimensions_to_reduce) {
     if (dimension >= ShapeUtil::Rank(arg) || dimension < 0) {
       return InvalidArgument(
@@ -1743,8 +1818,15 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
           ShapeUtil::HumanString(arg).c_str());
     }
   }
-  TF_RETURN_IF_ERROR(
-      VerifyReducerShape(to_apply, init_value, arg.element_type()));
+
+  tensorflow::gtl::ArraySlice<const Shape*> init_values(
+      arg_shapes, num_reduced_args, arg_shapes.size());
+  std::vector<PrimitiveType> element_types;
+  for (const Shape* arg : reduced_args) {
+    element_types.push_back(arg->element_type());
+  }
+  TF_RETURN_IF_ERROR(VerifyReducerShape(to_apply, init_values, element_types,
+                                        num_reduced_args));
 
   std::set<int64> dimensions_to_reduce_set(dimensions_to_reduce.begin(),
                                            dimensions_to_reduce.end());
@@ -1755,15 +1837,26 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
     }
   }
 
-  return ShapeUtil::MakeShape(to_apply.result().element_type(), new_dimensions);
+  if (ShapeUtil::IsScalar(to_apply.result())) {
+    return ShapeUtil::MakeShape(to_apply.result().element_type(),
+                                new_dimensions);
+  } else {
+    std::vector<Shape> result_subshapes;
+    for (const Shape& subshape : to_apply.result().tuple_shapes()) {
+      result_subshapes.push_back(
+          ShapeUtil::MakeShape(subshape.element_type(), new_dimensions));
+    }
+    return ShapeUtil::MakeTupleShape(result_subshapes);
+  }
 }
 
 /* static */ StatusOr<Shape> ShapeInference::InferReduceWindowShape(
     const Shape& operand_shape, const Shape& init_value_shape,
     const Window& window, const ProgramShape& to_apply_shape) {
   TF_RETURN_IF_ERROR(ExpectArray(operand_shape, "operand of reduce-window"));
-  TF_RETURN_IF_ERROR(VerifyReducerShape(to_apply_shape, init_value_shape,
-                                        operand_shape.element_type()));
+  TF_RETURN_IF_ERROR(VerifyReducerShape(to_apply_shape, {&init_value_shape},
+                                        {operand_shape.element_type()},
+                                        /*inputs=*/1));
   return InferWindowOutputShape(operand_shape, window,
                                 init_value_shape.element_type(),
                                 /*allow_negative_padding=*/false);
@@ -1808,8 +1901,9 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
   }
 
   // Check if the scatter function has a proper shape as a reduction.
-  TF_RETURN_IF_ERROR(VerifyReducerShape(scatter_shape, init_value_shape,
-                                        source_shape.element_type()));
+  TF_RETURN_IF_ERROR(VerifyReducerShape(scatter_shape, {&init_value_shape},
+                                        {source_shape.element_type()},
+                                        /*inputs=*/1));
 
   // Check if the result shape of window operation matches the source shape.
   TF_ASSIGN_OR_RETURN(const Shape& window_result_shape,
@@ -2267,15 +2361,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
 // broadcast from all operands, not just the predicate.
 /* static */ StatusOr<Shape> ShapeInference::InferSelectShape(
     const Shape& pred, const Shape& on_true, const Shape& on_false) {
-  bool compatible;
-  if (ShapeUtil::IsTuple(on_true)) {
-    // Select only defines the top-level buffer, so if it's a tuple, the two
-    // input must match exactly.
-    compatible = ShapeUtil::Compatible(on_true, on_false);
-  } else {
-    compatible = ShapeUtil::CompatibleIgnoringFpPrecision(on_true, on_false);
-  }
-  if (!compatible) {
+  if (!ShapeUtil::CompatibleIgnoringFpPrecision(on_true, on_false)) {
     return InvalidArgument(
         "Operands to select must be the same shape; got %s and %s.",
         ShapeUtil::HumanString(on_true).c_str(),
@@ -2287,7 +2373,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
         ShapeUtil::HumanString(pred).c_str());
   }
   if (ShapeUtil::CompatibleIgnoringElementType(pred, on_true) ||
-      ShapeUtil::Rank(pred) == 0) {
+      ShapeUtil::IsScalar(pred)) {
     // By this stage we know that pred's element type is PRED. Therefore, this
     // check restricts pred to be a PRED scalar, or a PRED array with the same
     // dimensions as on_true and on_false.
@@ -2301,6 +2387,29 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(HloOpcode operation,
   }
 }
 
+/* static */ StatusOr<Shape> ShapeInference::InferTupleSelectShape(
+    const Shape& pred, const Shape& on_true, const Shape& on_false) {
+  // Select only defines the top-level buffer, so if it's a tuple, the two
+  // input must match exactly.
+  if (!ShapeUtil::Compatible(on_true, on_false)) {
+    return InvalidArgument(
+        "Operands to tuple-select must be the same shape; got %s and %s.",
+        ShapeUtil::HumanString(on_true).c_str(),
+        ShapeUtil::HumanString(on_false).c_str());
+  }
+  if (pred.element_type() != PRED) {
+    return InvalidArgument(
+        "TupleSelect's pred operand must have PRED element type; got %s.",
+        ShapeUtil::HumanString(pred).c_str());
+  }
+  if (!ShapeUtil::IsScalar(pred)) {
+    return InvalidArgument(
+        "TupleSelect operation with non-scalar predicate: %s.",
+        ShapeUtil::HumanString(pred).c_str());
+  }
+  return on_true;
+}
+
 /* static */ StatusOr<Shape> ShapeInference::InferCallShape(
     tensorflow::gtl::ArraySlice<const Shape*> arg_shapes,
     const ProgramShape& to_apply) {
@@ -2540,4 +2649,194 @@ static Status ValidateGatherDimensionNumbers(
   return ShapeUtil::MakeShape(input_shape.element_type(), output_dim_bounds);
 }
 
+namespace {
+
+Status ValidateScatterDimensionNumbers(
+    const Shape& operand_shape,
+    tensorflow::gtl::ArraySlice<int64> scatter_indices_shape,
+    const Shape& updates_shape, const ScatterDimensionNumbers& dim_numbers) {
+  // Validate update_window_dims in ScatterDimensionNumbers.
+  if (!c_is_sorted(dim_numbers.update_window_dims())) {
+    return InvalidArgument(
+        "update_window_dims in scatter op must be sorted; got: %s.",
+        Join(dim_numbers.update_window_dims(), ", ").c_str());
+  }
+  if (c_adjacent_find(dim_numbers.update_window_dims()) !=
+      dim_numbers.update_window_dims().end()) {
+    return InvalidArgument(
+        "update_window_dims in scatter op must not repeat; got: %s.",
+        Join(dim_numbers.update_window_dims(), ", ").c_str());
+  }
+  const int64 updates_rank = ShapeUtil::Rank(updates_shape);
+  for (int64 window_dim : dim_numbers.update_window_dims()) {
+    if (window_dim < 0 || window_dim >= updates_rank) {
+      return InvalidArgument(
+          "Invalid update_window_dims set in scatter op; valid range is [0, "
+          "%lld). got: %lld.",
+          updates_rank, window_dim);
+    }
+  }
+
+  // Validate inserted_window_dims in ScatterDimensionNumbers.
+  if (!c_is_sorted(dim_numbers.inserted_window_dims())) {
+    return InvalidArgument(
+        "inserted_window_dims in scatter op must be sorted; got: %s.",
+        Join(dim_numbers.inserted_window_dims(), ", ").c_str());
+  }
+  if (c_adjacent_find(dim_numbers.inserted_window_dims()) !=
+      dim_numbers.inserted_window_dims().end()) {
+    return InvalidArgument(
+        "inserted_window_dims in scatter op must not repeat; got: %s.",
+        Join(dim_numbers.inserted_window_dims(), ", ").c_str());
+  }
+  for (int64 inserted_dim : dim_numbers.inserted_window_dims()) {
+    if (inserted_dim < 0 || inserted_dim >= operand_shape.dimensions_size()) {
+      return InvalidArgument(
+          "Invalid inserted_window_dims set in scatter op; valid range is [0, "
+          "%d), got: %lld.",
+          operand_shape.dimensions_size(), inserted_dim);
+    }
+  }
+
+  // Validate scatter_dims_to_operand_dims in ScatterDimensionNumbers.
+  if (dim_numbers.scatter_dims_to_operand_dims_size() !=
+      scatter_indices_shape[dim_numbers.index_vector_dim()]) {
+    return InvalidArgument(
+        "Scatter op has %d elements in scatter_dims_to_operand_dims and the "
+        "bound of dimension index_vector_dim=%lld of scatter_indices is %lld. "
+        "These two numbers must be equal.",
+        dim_numbers.scatter_dims_to_operand_dims_size(),
+        dim_numbers.index_vector_dim(),
+        scatter_indices_shape[dim_numbers.index_vector_dim()]);
+  }
+  for (int i = 0; i < dim_numbers.scatter_dims_to_operand_dims_size(); ++i) {
+    int64 scatter_dim_to_operand_dim =
+        dim_numbers.scatter_dims_to_operand_dims(i);
+    if (scatter_dim_to_operand_dim < 0 ||
+        scatter_dim_to_operand_dim >= operand_shape.dimensions_size()) {
+      return InvalidArgument(
+          "Invalid scatter_dims_to_operand_dims mapping; domain is [0, %d), "
+          "got: %d->%lld.",
+          operand_shape.dimensions_size(), i, scatter_dim_to_operand_dim);
+    }
+  }
+  std::vector<int64> sorted_scatter_dims_to_operand_dims(
+      dim_numbers.scatter_dims_to_operand_dims().begin(),
+      dim_numbers.scatter_dims_to_operand_dims().end());
+  c_sort(sorted_scatter_dims_to_operand_dims);
+  if (c_adjacent_find(sorted_scatter_dims_to_operand_dims) !=
+      sorted_scatter_dims_to_operand_dims.end()) {
+    return InvalidArgument(
+        "Repeated dimensions not allowed in scatter_dims_to_operand_dims; "
+        "got: %s.",
+        Join(dim_numbers.scatter_dims_to_operand_dims(), ", ").c_str());
+  }
+
+  return Status::OK();
+}
+
+}  // namespace
+
+/*static*/ StatusOr<Shape> ShapeInference::InferScatterShape(
+    const Shape& operand_shape, const Shape& scatter_indices_shape,
+    const Shape& updates_shape, const ProgramShape& to_apply_shape,
+    const ScatterDimensionNumbers& scatter_dim_numbers) {
+  TF_RETURN_IF_ERROR(
+      ExpectArray(operand_shape, "operand tensor of scatter op"));
+  TF_RETURN_IF_ERROR(
+      ExpectArray(scatter_indices_shape, "scatter indices of scatter op"));
+  TF_RETURN_IF_ERROR(ExpectArray(updates_shape, "updates of scatter op"));
+
+  if (!ShapeUtil::ElementIsIntegral(scatter_indices_shape)) {
+    return InvalidArgument(
+        "Scatter indices parameter must be an integral tensor; got %s.",
+        ShapeUtil::HumanString(scatter_indices_shape).c_str());
+  }
+
+  if (scatter_indices_shape.dimensions_size() <
+          scatter_dim_numbers.index_vector_dim() ||
+      scatter_dim_numbers.index_vector_dim() < 0) {
+    return InvalidArgument(
+        "Scatter index leaf dimension must be within [0, rank(scatter_indices)"
+        " + 1). rank(scatter_indices) is %d and scatter index leaf dimension "
+        "is %lld.",
+        scatter_indices_shape.dimensions_size(),
+        scatter_dim_numbers.index_vector_dim());
+  }
+
+  // Check if the update computation has a proper shape as a reduction.
+  const Shape init_value_shape =
+      ShapeUtil::MakeShape(operand_shape.element_type(), {});
+  TF_RETURN_IF_ERROR(VerifyReducerShape(to_apply_shape, {&init_value_shape},
+                                        {updates_shape.element_type()},
+                                        /*inputs=*/1));
+
+  std::vector<int64> expanded_scatter_indices_shape =
+      ArraySliceToVector(AsInt64Slice(scatter_indices_shape.dimensions()));
+  if (expanded_scatter_indices_shape.size() ==
+      scatter_dim_numbers.index_vector_dim()) {
+    expanded_scatter_indices_shape.push_back(1);
+  }
+
+  int64 expected_updates_rank = expanded_scatter_indices_shape.size() - 1 +
+                                scatter_dim_numbers.update_window_dims_size();
+  if (ShapeUtil::Rank(updates_shape) != expected_updates_rank) {
+    return InvalidArgument("Updates tensor must be of rank %lld; got %lld.",
+                           expected_updates_rank,
+                           ShapeUtil::Rank(updates_shape));
+  }
+
+  TF_RETURN_IF_ERROR(ValidateScatterDimensionNumbers(
+      operand_shape, expanded_scatter_indices_shape, updates_shape,
+      scatter_dim_numbers));
+
+  int64 inserted_dims_seen = 0;
+  std::vector<int64> max_update_window_bounds;
+  for (int i = 0; i < operand_shape.dimensions_size(); ++i) {
+    if (inserted_dims_seen < scatter_dim_numbers.inserted_window_dims_size() &&
+        scatter_dim_numbers.inserted_window_dims(inserted_dims_seen) == i) {
+      ++inserted_dims_seen;
+    } else {
+      max_update_window_bounds.push_back(operand_shape.dimensions(i));
+    }
+  }
+  for (int i = 0; i < scatter_dim_numbers.update_window_dims_size(); ++i) {
+    auto update_window_dim = scatter_dim_numbers.update_window_dims(i);
+    if (updates_shape.dimensions(update_window_dim) >
+        max_update_window_bounds[i]) {
+      return InvalidArgument(
+          "Bounds of the window dimensions of updates must not exceed the "
+          "bounds of the corresponding dimensions of operand. For dimension "
+          "%lld, updates bound is %lld, operand bound is %lld.",
+          update_window_dim, updates_shape.dimensions(update_window_dim),
+          max_update_window_bounds[i]);
+    }
+  }
+
+  int64 scatter_dims_seen = 0;
+  for (int64 i = 0; i < ShapeUtil::Rank(updates_shape); ++i) {
+    bool is_update_window_dim =
+        c_binary_search(scatter_dim_numbers.update_window_dims(), i);
+    if (is_update_window_dim) {
+      continue;
+    }
+    if (scatter_dims_seen == scatter_dim_numbers.index_vector_dim()) {
+      ++scatter_dims_seen;
+    }
+    if (updates_shape.dimensions(i) !=
+        expanded_scatter_indices_shape[scatter_dims_seen]) {
+      return InvalidArgument(
+          "Bounds of the scatter dimensions of updates must be same as the "
+          "bounds of the corresponding dimensions of scatter indices. For "
+          "scatter dimension %lld, updates bound is %lld, scatter_indices "
+          "bound is %lld.",
+          i, updates_shape.dimensions(i),
+          expanded_scatter_indices_shape[scatter_dims_seen]);
+    }
+    ++scatter_dims_seen;
+  }
+
+  return operand_shape;
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/shape_inference.h b/tensorflow/compiler/xla/service/shape_inference.h
index ad34a2aa18..33da323b3d 100644
--- a/tensorflow/compiler/xla/service/shape_inference.h
+++ b/tensorflow/compiler/xla/service/shape_inference.h
@@ -131,7 +131,7 @@ class ShapeInference {
   // index as the leading parameter, and the program shape should match
   // accordingly (or an error will result).
   static StatusOr<Shape> InferReduceShape(
-      const Shape& arg, const Shape& init_value,
+      tensorflow::gtl::ArraySlice<const Shape*> arg_shapes,
       tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce,
       const ProgramShape& to_apply);
 
@@ -268,6 +268,14 @@ class ShapeInference {
       const GatherDimensionNumbers& gather_dim_numbers,
       tensorflow::gtl::ArraySlice<int64> window_bounds);
 
+  // Helper that validates the given input shape, scatter indices shape, updates
+  // shape, and scatter dimension numbers that constitute a scatter operation,
+  // and returns the result shape of the scatter operation.
+  static StatusOr<Shape> InferScatterShape(
+      const Shape& operand_shape, const Shape& scatter_indices_shape,
+      const Shape& updates_shape, const ProgramShape& to_apply_shape,
+      const ScatterDimensionNumbers& scatter_dim_numbers);
+
  private:
   // Helper that infers the shape produced by performing an element-wise binary
   // operation with the given LHS and RHS shapes.
@@ -286,6 +294,10 @@ class ShapeInference {
   static StatusOr<Shape> InferSelectShape(const Shape& pred,
                                           const Shape& on_true,
                                           const Shape& on_false);
+  // Helper for inferring the shape of TupleSelect ops.
+  static StatusOr<Shape> InferTupleSelectShape(const Shape& pred,
+                                               const Shape& on_true,
+                                               const Shape& on_false);
 
   // Helper for inferring shapes of binary operations which use degenerate
   // dimension broadcasting (a dimension of size 1 in one operand is broadcast
diff --git a/tensorflow/compiler/xla/service/shape_inference_test.cc b/tensorflow/compiler/xla/service/shape_inference_test.cc
index bafe14d6f4..a73fa181cd 100644
--- a/tensorflow/compiler/xla/service/shape_inference_test.cc
+++ b/tensorflow/compiler/xla/service/shape_inference_test.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <string>
 
+#include "tensorflow/compiler/xla/service/hlo_instructions.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -62,7 +63,7 @@ class ReduceShapeInferenceTest : public ShapeInferenceTest {
       tensorflow::gtl::ArraySlice<int64> dimensions_to_reduce) {
     ProgramShape to_apply = ShapeUtil::MakeProgramShape({f32_, f32_}, f32_);
     auto inferred_status = ShapeInference::InferReduceShape(
-        arg, f32_, dimensions_to_reduce, to_apply);
+        {&arg, &f32_}, dimensions_to_reduce, to_apply);
     EXPECT_IS_OK(inferred_status.status());
     EXPECT_TRUE(ShapeUtil::Equal(expected_inferred_shape,
                                  inferred_status.ValueOrDie()));
@@ -702,11 +703,99 @@ TEST_F(ReduceShapeInferenceTest, ReduceCubeAmongAllDimensions) {
                             /*dimensions_to_reduce=*/{0, 1, 2});
 }
 
+TEST_F(ReduceShapeInferenceTest, ReduceMultiOutput) {
+  Shape f32_arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
+  Shape s32_arg_shape = ShapeUtil::MakeShape(S32, {5, 3});
+  ProgramShape to_apply = ShapeUtil::MakeProgramShape(
+      {f32_, s32_, f32_, s32_}, ShapeUtil::MakeTupleShape({f32_, s32_}));
+  auto inferred_status = ShapeInference::InferReduceShape(
+      {&f32_arg_shape, &s32_arg_shape, &f32_, &s32_}, {0, 1}, to_apply);
+  EXPECT_IS_OK(inferred_status.status());
+  EXPECT_TRUE(ShapeUtil::Equal(ShapeUtil::MakeTupleShape({f32_, s32_}),
+                               inferred_status.ValueOrDie()));
+}
+
+TEST_F(ReduceShapeInferenceTest, ErrorMultiOutputBadReducerInput1) {
+  Shape f32_arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
+  Shape s32_arg_shape = ShapeUtil::MakeShape(S32, {5, 3});
+  ProgramShape to_apply =
+      ShapeUtil::MakeProgramShape({f32_, s32_, f32_, s32_, f32_, s32_},
+                                  ShapeUtil::MakeTupleShape({f32_, s32_}));
+  auto inferred_status = ShapeInference::InferReduceShape(
+      {&f32_arg_shape, &s32_arg_shape, &f32_, &s32_}, {0, 1}, to_apply);
+  EXPECT_FALSE(inferred_status.ok());
+  EXPECT_THAT(inferred_status.status().error_message(),
+              HasSubstr("must take 4 parameters, but takes 6 parameter(s)"));
+}
+
+TEST_F(ReduceShapeInferenceTest, ErrorMultiOutputBadReducerInput2) {
+  Shape f32_arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
+  Shape s32_arg_shape = ShapeUtil::MakeShape(S32, {5, 3});
+  ProgramShape to_apply = ShapeUtil::MakeProgramShape(
+      {s32_, s32_, f32_, s32_}, ShapeUtil::MakeTupleShape({f32_, s32_}));
+  auto inferred_status = ShapeInference::InferReduceShape(
+      {&f32_arg_shape, &s32_arg_shape, &f32_, &s32_}, {0, 1}, to_apply);
+  EXPECT_FALSE(inferred_status.ok());
+  EXPECT_THAT(
+      inferred_status.status().error_message(),
+      HasSubstr(
+          "parameter shape differs from the result shape: s32[] vs f32[]"));
+}
+
+TEST_F(ReduceShapeInferenceTest, ErrorMultiOutputBadReducerInput3) {
+  ProgramShape to_apply = ShapeUtil::MakeProgramShape(
+      {s32_, s32_, f32_, s32_}, ShapeUtil::MakeTupleShape({f32_, s32_}));
+  auto inferred_status = ShapeInference::InferReduceShape({}, {0, 1}, to_apply);
+  EXPECT_FALSE(inferred_status.ok());
+  EXPECT_THAT(inferred_status.status().error_message(),
+              HasSubstr("must have at least 2 arguments, has 0"));
+}
+
+TEST_F(ReduceShapeInferenceTest, ErrorMultiOutputBadReducerOutput1) {
+  Shape f32_arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
+  Shape s32_arg_shape = ShapeUtil::MakeShape(S32, {5, 3});
+  ProgramShape to_apply =
+      ShapeUtil::MakeProgramShape({f32_, s32_, f32_, s32_}, f32_);
+  auto inferred_status = ShapeInference::InferReduceShape(
+      {&f32_arg_shape, &s32_arg_shape, &f32_, &s32_}, {0, 1}, to_apply);
+  EXPECT_FALSE(inferred_status.ok());
+  EXPECT_THAT(
+      inferred_status.status().error_message(),
+      HasSubstr("must produce a tuple with 2 elements, but produces a scalar"));
+}
+
+TEST_F(ReduceShapeInferenceTest, ErrorMultiOutputBadReducerOutput2) {
+  Shape f32_arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
+  Shape s32_arg_shape = ShapeUtil::MakeShape(S32, {5, 3});
+  ProgramShape to_apply = ShapeUtil::MakeProgramShape(
+      {f32_, s32_, f32_, s32_}, ShapeUtil::MakeTupleShape({f32_, s32_, s32_}));
+  auto inferred_status = ShapeInference::InferReduceShape(
+      {&f32_arg_shape, &s32_arg_shape, &f32_, &s32_}, {0, 1}, to_apply);
+  EXPECT_FALSE(inferred_status.ok());
+  EXPECT_THAT(
+      inferred_status.status().error_message(),
+      HasSubstr("must produce a tuple with 2 elements, but has 3 elements"));
+}
+
+TEST_F(ReduceShapeInferenceTest, ErrorMultiOutputBadReducerBoth) {
+  Shape f32_arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
+  Shape s32_arg_shape = ShapeUtil::MakeShape(S32, {5, 3});
+  ProgramShape to_apply = ShapeUtil::MakeProgramShape(
+      {s32_, s32_, s32_, s32_}, ShapeUtil::MakeTupleShape({s32_, s32_}));
+  auto inferred_status = ShapeInference::InferReduceShape(
+      {&f32_arg_shape, &s32_arg_shape, &f32_, &s32_}, {0, 1}, to_apply);
+  EXPECT_FALSE(inferred_status.ok());
+  EXPECT_THAT(inferred_status.status().error_message(),
+              HasSubstr("accumulator shape at index 0 differs from the "
+                        "init_value shape: s32[] vs f32[]"));
+}
+
 TEST_F(ReduceShapeInferenceTest, ErrorOutOfBoundsDimension) {
   ProgramShape to_apply = ShapeUtil::MakeProgramShape({f32_, f32_}, f32_);
+  Shape arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
   auto inferred_status = ShapeInference::InferReduceShape(
-      ShapeUtil::MakeShape(F32, {5, 3}), f32_, /*dimensions_to_reduce=*/{3, 4},
-      to_apply);
+      {&arg_shape, &f32_},
+      /*dimensions_to_reduce=*/{3, 4}, to_apply);
   EXPECT_FALSE(inferred_status.ok());
   EXPECT_THAT(inferred_status.status().error_message(),
               HasSubstr("out-of-bounds dimension"));
@@ -714,8 +803,9 @@ TEST_F(ReduceShapeInferenceTest, ErrorOutOfBoundsDimension) {
 
 TEST_F(ReduceShapeInferenceTest, ErrorToApplyArity) {
   ProgramShape to_apply = ShapeUtil::MakeProgramShape({f32_, f32_, f32_}, f32_);
+  Shape arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
   auto inferred_status =
-      ShapeInference::InferReduceShape(ShapeUtil::MakeShape(F32, {5, 3}), f32_,
+      ShapeInference::InferReduceShape({&arg_shape, &f32_},
                                        /*dimensions_to_reduce=*/{0}, to_apply);
   EXPECT_FALSE(inferred_status.ok());
   EXPECT_THAT(inferred_status.status().error_message(),
@@ -724,12 +814,13 @@ TEST_F(ReduceShapeInferenceTest, ErrorToApplyArity) {
 
 TEST_F(ReduceShapeInferenceTest, ErrorElementTypeVsApplyType) {
   ProgramShape to_apply = ShapeUtil::MakeProgramShape({f32_, f32_}, s32_);
+  Shape arg_shape = ShapeUtil::MakeShape(F32, {5, 3});
   auto inferred_status =
-      ShapeInference::InferReduceShape(ShapeUtil::MakeShape(F32, {5, 3}), f32_,
+      ShapeInference::InferReduceShape({&arg_shape, &f32_},
                                        /*dimensions_to_reduce=*/{0}, to_apply);
   EXPECT_FALSE(inferred_status.ok());
   EXPECT_THAT(inferred_status.status().error_message(),
-              HasSubstr("first parameter shape differs"));
+              HasSubstr("0-th parameter shape differs"));
 }
 
 TEST_F(ShapeInferenceTest, InferSliceShapeRank2) {
@@ -1523,7 +1614,19 @@ TEST_F(ShapeInferenceTest, BadSlice) {
       << statusor.status();
 }
 
-class GatherShapeInferenceTest : public ShapeInferenceTest {
+TEST_F(ShapeInferenceTest, BadSort) {
+  auto keys = ShapeUtil::MakeShape(F32, {4});
+  auto values = ShapeUtil::MakeShape(F32, {5});
+  StatusOr<Shape> statusor =
+      ShapeInference::InferVariadicOpShape(HloOpcode::kSort, {&keys, &values});
+  ASSERT_FALSE(statusor.ok());
+
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("dimensions must match"))
+      << statusor.status();
+}
+
+class ScatterGatherShapeInferenceTest : public ShapeInferenceTest {
  protected:
   const Shape s64_scalar_ = ShapeUtil::MakeShape(S64, {});
   const Shape s64_vector_5_ = ShapeUtil::MakeShape(S64, {5});
@@ -1540,59 +1643,63 @@ class GatherShapeInferenceTest : public ShapeInferenceTest {
       ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
   const Shape tuple_shape_ = ShapeUtil::MakeTupleShape(
       {s64_4d_tensor_10_9_8_7_1_, s64_4d_tensor_10_9_8_7_1_});
+  const ProgramShape to_apply_ =
+      ShapeUtil::MakeProgramShape({f32_, f32_}, f32_);
 };
 
-TEST_F(GatherShapeInferenceTest, TensorFlowGather) {
-  TF_ASSERT_OK_AND_ASSIGN(
-      Shape gather_shape,
-      ShapeInference::InferGatherShape(matrix_64_48_, s64_vector_32_,
-                                       HloInstruction::MakeGatherDimNumbers(
-                                           /*output_window_dims=*/{0},
-                                           /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1},
-                                           /*index_vector_dim=*/1),
-                                       /*window_bounds=*/{64, 1}));
+// Shape inference tests for Gather.
+
+TEST_F(ScatterGatherShapeInferenceTest, TensorFlowGather) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape gather_shape,
+                          ShapeInference::InferGatherShape(
+                              matrix_64_48_, s64_vector_32_,
+                              HloGatherInstruction::MakeGatherDimNumbers(
+                                  /*output_window_dims=*/{0},
+                                  /*elided_window_dims=*/{1},
+                                  /*gather_dims_to_operand_dims=*/{1},
+                                  /*index_vector_dim=*/1),
+                              /*window_bounds=*/{64, 1}));
   EXPECT_TRUE(
       ShapeUtil::Equal(gather_shape, ShapeUtil::MakeShape(F32, {64, 32})))
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, TensorFlowGatherV2) {
-  TF_ASSERT_OK_AND_ASSIGN(
-      Shape gather_shape,
-      ShapeInference::InferGatherShape(matrix_64_48_, s64_vector_32_,
-                                       HloInstruction::MakeGatherDimNumbers(
-                                           /*output_window_dims=*/{1},
-                                           /*elided_window_dims=*/{0},
-                                           /*gather_dims_to_operand_dims=*/{0},
-                                           /*index_vector_dim=*/1),
-                                       /*window_bounds=*/{1, 48}));
+TEST_F(ScatterGatherShapeInferenceTest, TensorFlowGatherV2) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape gather_shape,
+                          ShapeInference::InferGatherShape(
+                              matrix_64_48_, s64_vector_32_,
+                              HloGatherInstruction::MakeGatherDimNumbers(
+                                  /*output_window_dims=*/{1},
+                                  /*elided_window_dims=*/{0},
+                                  /*gather_dims_to_operand_dims=*/{0},
+                                  /*index_vector_dim=*/1),
+                              /*window_bounds=*/{1, 48}));
   EXPECT_TRUE(
       ShapeUtil::Equal(gather_shape, ShapeUtil::MakeShape(F32, {32, 48})))
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, TensorFlowGatherNd) {
-  TF_ASSERT_OK_AND_ASSIGN(
-      Shape gather_shape,
-      ShapeInference::InferGatherShape(matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
-                                       HloInstruction::MakeGatherDimNumbers(
-                                           /*output_window_dims=*/{4},
-                                           /*elided_window_dims=*/{0},
-                                           /*gather_dims_to_operand_dims=*/{0},
-                                           /*index_vector_dim=*/4),
-                                       /*window_bounds=*/{1, 48}));
+TEST_F(ScatterGatherShapeInferenceTest, TensorFlowGatherNd) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape gather_shape,
+                          ShapeInference::InferGatherShape(
+                              matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+                              HloGatherInstruction::MakeGatherDimNumbers(
+                                  /*output_window_dims=*/{4},
+                                  /*elided_window_dims=*/{0},
+                                  /*gather_dims_to_operand_dims=*/{0},
+                                  /*index_vector_dim=*/4),
+                              /*window_bounds=*/{1, 48}));
   EXPECT_TRUE(ShapeUtil::Equal(gather_shape,
                                ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 48})))
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, TensorFlowBatchDynamicSlice) {
+TEST_F(ScatterGatherShapeInferenceTest, TensorFlowBatchDynamicSlice) {
   TF_ASSERT_OK_AND_ASSIGN(
       Shape gather_shape,
       ShapeInference::InferGatherShape(
           f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-          HloInstruction::MakeGatherDimNumbers(
+          HloGatherInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{4, 5, 6, 7, 8},
               /*elided_window_dims=*/{},
               /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1604,12 +1711,12 @@ TEST_F(GatherShapeInferenceTest, TensorFlowBatchDynamicSlice) {
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_A) {
+TEST_F(ScatterGatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_A) {
   TF_ASSERT_OK_AND_ASSIGN(
       Shape gather_shape,
       ShapeInference::InferGatherShape(
           f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_5_7_6_,
-          HloInstruction::MakeGatherDimNumbers(
+          HloGatherInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{4, 5, 6, 7, 8},
               /*elided_window_dims=*/{},
               /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1622,12 +1729,12 @@ TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_A) {
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_B) {
+TEST_F(ScatterGatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_B) {
   TF_ASSERT_OK_AND_ASSIGN(
       Shape gather_shape,
       ShapeInference::InferGatherShape(
           f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_5_10_9_7_6_,
-          HloInstruction::MakeGatherDimNumbers(
+          HloGatherInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{4, 5, 6, 7, 8},
               /*elided_window_dims=*/{},
               /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1640,13 +1747,13 @@ TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_B) {
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, NoOutputGatherDims) {
+TEST_F(ScatterGatherShapeInferenceTest, NoOutputGatherDims) {
   // This is equivalent to a dynamic slice.
   TF_ASSERT_OK_AND_ASSIGN(
       Shape gather_shape,
       ShapeInference::InferGatherShape(
           f32_5d_tensor_50_49_48_47_46_, s64_vector_5_,
-          HloInstruction::MakeGatherDimNumbers(
+          HloGatherInstruction::MakeGatherDimNumbers(
               /*output_window_dims=*/{0, 1, 2, 3, 4},
               /*elided_window_dims=*/{},
               /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1658,13 +1765,13 @@ TEST_F(GatherShapeInferenceTest, NoOutputGatherDims) {
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, ScalarGatherIndices) {
+TEST_F(ScatterGatherShapeInferenceTest, ScalarGatherIndices) {
   // The gather indices "tensor" is a scalar S here that's used to slice out
   // [S,0,0,0,0]..[S,30,29,28,27] into a [30,29,28,27] shaped result.
   TF_ASSERT_OK_AND_ASSIGN(Shape gather_shape,
                           ShapeInference::InferGatherShape(
                               f32_5d_tensor_50_49_48_47_46_, s64_scalar_,
-                              HloInstruction::MakeGatherDimNumbers(
+                              HloGatherInstruction::MakeGatherDimNumbers(
                                   /*output_window_dims=*/{0, 1, 2, 3},
                                   /*elided_window_dims=*/{0},
                                   /*gather_dims_to_operand_dims=*/{0},
@@ -1676,13 +1783,14 @@ TEST_F(GatherShapeInferenceTest, ScalarGatherIndices) {
       << ShapeUtil::HumanString(gather_shape);
 }
 
-TEST_F(GatherShapeInferenceTest, TupleShapedTensorInput) {
+TEST_F(ScatterGatherShapeInferenceTest, TupleShapedTensorInput) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       tuple_shape_, s64_vector_32_,
-      HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
-                                           /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1},
-                                           /*index_vector_dim=*/1),
+      HloGatherInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{0},
+          /*elided_window_dims=*/{1},
+          /*gather_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/1),
       /*window_bounds=*/{64, 1});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1690,13 +1798,14 @@ TEST_F(GatherShapeInferenceTest, TupleShapedTensorInput) {
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest, TupleShapedGatherIndicesInput) {
+TEST_F(ScatterGatherShapeInferenceTest, TupleShapedGatherIndicesInput) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       s64_vector_32_, tuple_shape_,
-      HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
-                                           /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1},
-                                           /*index_vector_dim=*/0),
+      HloGatherInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{0},
+          /*elided_window_dims=*/{1},
+          /*gather_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/0),
       /*window_bounds=*/{64, 1});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1704,13 +1813,14 @@ TEST_F(GatherShapeInferenceTest, TupleShapedGatherIndicesInput) {
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest, FloatingPointGatherIndicesInput) {
+TEST_F(ScatterGatherShapeInferenceTest, FloatingPointGatherIndicesInput) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       s64_vector_32_, vector_32_,
-      HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
-                                           /*elided_window_dims=*/{1},
-                                           /*gather_dims_to_operand_dims=*/{1},
-                                           /*index_vector_dim=*/0),
+      HloGatherInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{0},
+          /*elided_window_dims=*/{1},
+          /*gather_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/0),
       /*window_bounds=*/{64, 1});
   ASSERT_FALSE(statusor.ok());
   EXPECT_THAT(statusor.status().error_message(),
@@ -1718,11 +1828,11 @@ TEST_F(GatherShapeInferenceTest, FloatingPointGatherIndicesInput) {
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_NonAscendingWindowIndices) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 8, 7},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1735,11 +1845,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_RepeatedWindowIndices) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 7},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1752,11 +1862,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_WindowIndexOutOfBounds) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 99, 100, 101},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1768,11 +1878,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_WindowIndexBarelyOutOfBounds) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 9},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1784,11 +1894,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_MismatchingElidedWindowDims) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{4},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1802,11 +1912,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_OutOfBoundsWindowToInputMapping) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{0, 1, 2, 3, 19},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1819,11 +1929,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_RepeatedWindowToInputMapping) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{0, 1, 2, 3, 3},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1837,11 +1947,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_MismatchingGatherToInputMapping) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3},
@@ -1856,11 +1966,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_OutOfBoundsGatherToInputMapping) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 7},
@@ -1874,11 +1984,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_RepeatedGatherToInputMapping) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 3},
@@ -1892,11 +2002,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_NonAscendingElidedWindowDims) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{2, 1},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1908,10 +2018,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest, InvalidGatherDimNumbers_WindowBoundsTooLarge) {
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidGatherDimNumbers_WindowBoundsTooLarge) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7},
           /*elided_window_dims=*/{2},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1924,11 +2035,11 @@ TEST_F(GatherShapeInferenceTest, InvalidGatherDimNumbers_WindowBoundsTooLarge) {
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_MismatchingNumberOfWindowBounds) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1942,11 +2053,11 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest,
+TEST_F(ScatterGatherShapeInferenceTest,
        InvalidGatherDimNumbers_WindowBoundsNot1ForElidedDim) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7},
           /*elided_window_dims=*/{1},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1959,10 +2070,10 @@ TEST_F(GatherShapeInferenceTest,
       << statusor.status();
 }
 
-TEST_F(GatherShapeInferenceTest, OutOfBoundsGatherIndicesLeafDim) {
+TEST_F(ScatterGatherShapeInferenceTest, OutOfBoundsGatherIndicesLeafDim) {
   StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
       f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_5_7_6_,
-      HloInstruction::MakeGatherDimNumbers(
+      HloGatherInstruction::MakeGatherDimNumbers(
           /*output_window_dims=*/{4, 5, 6, 7, 8},
           /*elided_window_dims=*/{},
           /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
@@ -1976,5 +2087,575 @@ TEST_F(GatherShapeInferenceTest, OutOfBoundsGatherIndicesLeafDim) {
       << statusor.status();
 }
 
+// Shape inference tests for Scatter.
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterWithFullUpdates) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape scatter_shape,
+                          ShapeInference::InferScatterShape(
+                              matrix_64_48_, s64_vector_32_,
+                              ShapeUtil::MakeShape(F32, {64, 32}), to_apply_,
+                              HloScatterInstruction::MakeScatterDimNumbers(
+                                  /*update_window_dims=*/{0},
+                                  /*inserted_window_dims=*/{1},
+                                  /*scatter_dims_to_operand_dims=*/{1},
+                                  /*index_vector_dim=*/1)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterWithFullUpdatesV2) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape scatter_shape,
+                          ShapeInference::InferScatterShape(
+                              matrix_64_48_, s64_vector_32_,
+                              ShapeUtil::MakeShape(F32, {32, 48}), to_apply_,
+                              HloScatterInstruction::MakeScatterDimNumbers(
+                                  /*update_window_dims=*/{1},
+                                  /*inserted_window_dims=*/{0},
+                                  /*scatter_dims_to_operand_dims=*/{0},
+                                  /*index_vector_dim=*/1)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterWithPartialUpdates) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape scatter_shape,
+                          ShapeInference::InferScatterShape(
+                              matrix_64_48_, s64_vector_32_,
+                              ShapeUtil::MakeShape(F32, {10, 32}), to_apply_,
+                              HloScatterInstruction::MakeScatterDimNumbers(
+                                  /*update_window_dims=*/{0},
+                                  /*inserted_window_dims=*/{1},
+                                  /*scatter_dims_to_operand_dims=*/{1},
+                                  /*index_vector_dim=*/1)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterWithPartialUpdatesV2) {
+  TF_ASSERT_OK_AND_ASSIGN(Shape scatter_shape,
+                          ShapeInference::InferScatterShape(
+                              matrix_64_48_, s64_vector_32_,
+                              ShapeUtil::MakeShape(F32, {32, 8}), to_apply_,
+                              HloScatterInstruction::MakeScatterDimNumbers(
+                                  /*update_window_dims=*/{1},
+                                  /*inserted_window_dims=*/{0},
+                                  /*scatter_dims_to_operand_dims=*/{0},
+                                  /*index_vector_dim=*/1)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterWithUpdatesBiggerThanInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      matrix_64_48_, s64_vector_32_, ShapeUtil::MakeShape(F32, {65, 32}),
+      to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{0},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/1));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Bounds of the window dimensions of updates must not exceed "
+                "the bounds of the corresponding dimensions of operand."))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterWithUpdatesBiggerThanInputV2) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      matrix_64_48_, s64_vector_32_, ShapeUtil::MakeShape(F32, {32, 49}),
+      to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{1},
+          /*inserted_window_dims=*/{0},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/1));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Bounds of the window dimensions of updates must not exceed "
+                "the bounds of the corresponding dimensions of operand."))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       TfScatterWithUpdatesNotMatchingIndices) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      matrix_64_48_, s64_vector_32_, ShapeUtil::MakeShape(F32, {64, 31}),
+      to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{0},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/1));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr(
+          "Bounds of the scatter dimensions of updates must be same as the "
+          "bounds of the corresponding dimensions of scatter indices."))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       TfScatterWithUpdatesNotMatchingIndicesV2) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      matrix_64_48_, s64_vector_32_, ShapeUtil::MakeShape(F32, {31, 48}),
+      to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{1},
+          /*inserted_window_dims=*/{0},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/1));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr(
+          "Bounds of the scatter dimensions of updates must be same as the "
+          "bounds of the corresponding dimensions of scatter indices."))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterNdWithFullUpdates) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+          ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 48}), to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4},
+              /*inserted_window_dims=*/{0},
+              /*scatter_dims_to_operand_dims=*/{0},
+              /*index_vector_dim=*/4)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterNdWithFullUpdatesV2) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+          ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 64}), to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4},
+              /*inserted_window_dims=*/{1},
+              /*scatter_dims_to_operand_dims=*/{0},
+              /*index_vector_dim=*/4)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterNdWithPartialUpdates) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+          ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 10}), to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4},
+              /*inserted_window_dims=*/{0},
+              /*scatter_dims_to_operand_dims=*/{0},
+              /*index_vector_dim=*/4)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterNdWithPartialUpdatesV2) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+          ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 12}), to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4},
+              /*inserted_window_dims=*/{1},
+              /*scatter_dims_to_operand_dims=*/{0},
+              /*index_vector_dim=*/4)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, matrix_64_48_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfScatterNdWithUpdatesBiggerThanInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 65}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{0},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Bounds of the window dimensions of updates must not exceed "
+                "the bounds of the corresponding dimensions of operand."))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       TfScatterNdWithUpdatesNotMatchingIndices) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+      ShapeUtil::MakeShape(F32, {9, 9, 8, 7, 64}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{0},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr(
+          "Bounds of the scatter dimensions of updates must be same as the "
+          "bounds of the corresponding dimensions of scatter indices."))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, TfBatchDynamicUpdateSlice) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+          ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28, 27, 26}),
+          to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4, 5, 6, 7, 8},
+              /*inserted_window_dims=*/{},
+              /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/4)));
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, f32_5d_tensor_50_49_48_47_46_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, NonDefaultScatterIndicesLeafDim) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_5_7_6_,
+          ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26}),
+          to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4, 5, 6, 7, 8},
+              /*inserted_window_dims=*/{},
+              /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/2)));
+
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, f32_5d_tensor_50_49_48_47_46_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, NonDefaultScatterIndicesLeafDimV2) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_5_10_9_7_6_,
+          ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26}),
+          to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{4, 5, 6, 7, 8},
+              /*inserted_window_dims=*/{},
+              /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/0)));
+
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, f32_5d_tensor_50_49_48_47_46_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, NoUpdateScatterDims) {
+  // This is equivalent to a dynamic update slice.
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_vector_5_,
+          ShapeUtil::MakeShape(F32, {30, 29, 28, 27, 26}), to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{0, 1, 2, 3, 4},
+              /*inserted_window_dims=*/{},
+              /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/0)));
+
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, f32_5d_tensor_50_49_48_47_46_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, ScalarScatterIndices) {
+  // The scalar indices "tensor" is a scalar S here that's used to update a
+  // [30,29,28,27] shaped tensor within the operand at position S.
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape scatter_shape,
+      ShapeInference::InferScatterShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_scalar_,
+          ShapeUtil::MakeShape(F32, {30, 29, 28, 27}), to_apply_,
+          HloScatterInstruction::MakeScatterDimNumbers(
+              /*update_window_dims=*/{0, 1, 2, 3},
+              /*inserted_window_dims=*/{0},
+              /*scatter_dims_to_operand_dims=*/{0},
+              /*index_vector_dim=*/0)));
+
+  EXPECT_TRUE(ShapeUtil::Equal(scatter_shape, f32_5d_tensor_50_49_48_47_46_))
+      << ShapeUtil::HumanString(scatter_shape);
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, ScatterWithTupleShapedTensorInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      tuple_shape_, s64_vector_32_, s64_vector_32_, to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{0},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/1));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Expected array argument for operand"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       ScatterWithTupleShapedScatterIndicesInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      s64_vector_32_, tuple_shape_, s64_vector_32_, to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{0},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/0));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Expected array argument for scatter indices"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, ScatterWithTupleShapedUpdatesInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      s64_vector_32_, s64_vector_32_, tuple_shape_, to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{0},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/0));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Expected array argument for updates"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, FloatingPointScatterIndicesInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      s64_vector_32_, vector_32_, s64_vector_32_, to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{0},
+          /*inserted_window_dims=*/{1},
+          /*scatter_dims_to_operand_dims=*/{1},
+          /*index_vector_dim=*/0));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Scatter indices parameter must be an integral tensor"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, OutOfBoundsScatterIndicesLeafDim) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 2},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/10));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Scatter index leaf dimension must be within [0, "
+                        "rank(scatter_indices) + 1)"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, InvalidUpdates) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28, 50}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 2},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Updates tensor must be of rank 7; got 8."))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest, InvalidUpdateComputation) {
+  const ProgramShape invalid_update_computation =
+      ShapeUtil::MakeProgramShape({f32_}, f32_);
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}),
+      invalid_update_computation,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 2},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Reduction function must take 2 parameters, but takes 1"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_NonAscendingUpdateWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28, 27, 26}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6, 8, 7},
+          /*inserted_window_dims=*/{},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("update_window_dims in scatter op must be sorted"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_RepeatedUpdateWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28, 27, 26}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6, 7, 7},
+          /*inserted_window_dims=*/{},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("update_window_dims in scatter op must not repeat"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_OutOfBoundsUpdateWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28, 27, 26}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6, 7, 9},
+          /*inserted_window_dims=*/{},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Invalid update_window_dims set in scatter op; valid "
+                        "range is [0, 9)"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_NonAscendingInsertedWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{2, 1},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("inserted_window_dims in scatter op must be sorted"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_RepeatedInsertedWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 1},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("inserted_window_dims in scatter op must not repeat"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_OutOfBoundsInsertedWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 5},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Invalid inserted_window_dims set in scatter op; valid "
+                        "range is [0, 5)"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_MismatchingScatterDimsToOperandDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 2},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Scatter op has 4 elements in scatter_dims_to_operand_dims and "
+                "the bound of dimension index_vector_dim=4 of scatter_indices "
+                "is 5. These two numbers must be equal"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_OutOfBoundsScatterDimsToOperandDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 2},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 3, 10},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Invalid scatter_dims_to_operand_dims mapping; domain "
+                        "is [0, 5), got: 4->10"))
+      << statusor.status();
+}
+
+TEST_F(ScatterGatherShapeInferenceTest,
+       InvalidScatterDimNumbers_RepeatedValuesInScatterDimsToOperandDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferScatterShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28}), to_apply_,
+      HloScatterInstruction::MakeScatterDimNumbers(
+          /*update_window_dims=*/{4, 5, 6},
+          /*inserted_window_dims=*/{1, 2},
+          /*scatter_dims_to_operand_dims=*/{0, 1, 2, 2, 3},
+          /*index_vector_dim=*/4));
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr(
+          "Repeated dimensions not allowed in scatter_dims_to_operand_dims"))
+      << statusor.status();
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/stream_pool.cc b/tensorflow/compiler/xla/service/stream_pool.cc
new file mode 100644
index 0000000000..92bb21b816
--- /dev/null
+++ b/tensorflow/compiler/xla/service/stream_pool.cc
@@ -0,0 +1,56 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/stream_pool.h"
+
+#include "tensorflow/compiler/xla/ptr_util.h"
+
+namespace xla {
+
+StreamPool::Ptr StreamPool::BorrowStream(se::StreamExecutor* executor) {
+  std::unique_ptr<se::Stream> stream;
+  {
+    tensorflow::mutex_lock lock(mu_);
+    if (!streams_.empty()) {
+      // Re-use an existing stream from the pool.
+      stream = std::move(streams_.back());
+      streams_.pop_back();
+    }
+  }
+
+  if (!stream) {
+    // Create a new stream.
+    stream = MakeUnique<se::Stream>(executor);
+    stream->Init();
+  }
+
+  // Return the stream wrapped in Ptr, which has our special deleter semantics.
+  PtrDeleter deleter = {this};
+  return Ptr(stream.release(), deleter);
+}
+
+void StreamPool::ReturnStream(se::Stream* stream) {
+  if (stream->ok()) {
+    tensorflow::mutex_lock lock(mu_);
+    streams_.emplace_back(stream);
+  } else {
+    // If the stream has encountered any errors, all subsequent
+    // operations on it will fail. So just delete the stream, and rely
+    // on new streams to be created in the future.
+    delete stream;
+  }
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/stream_pool.h b/tensorflow/compiler/xla/service/stream_pool.h
new file mode 100644
index 0000000000..7221d323a6
--- /dev/null
+++ b/tensorflow/compiler/xla/service/stream_pool.h
@@ -0,0 +1,64 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_STREAM_POOL_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_STREAM_POOL_H_
+
+#include <memory>
+#include <vector>
+
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/stream_executor_no_cuda.h"
+
+namespace xla {
+
+// Pool of stream_executor::Streams, which are created as needed and
+// destroyed when the pool is destroyed.
+class StreamPool {
+ public:
+  struct PtrDeleter {
+    void operator()(se::Stream* stream) { pool->ReturnStream(stream); }
+    StreamPool* pool;
+  };
+
+  // Stream pointer type returned by BorrowStream, which returns the
+  // stream to the pool on destruction.
+  using Ptr = std::unique_ptr<se::Stream, PtrDeleter>;
+
+  StreamPool() {}
+
+  // Returns a pointer to a stream in the pool, creating a new stream
+  // if none are available in the pool. The returned smart pointer
+  // returns the stream to the pool on destruction.
+  //
+  // This method is thread-safe.
+  Ptr BorrowStream(se::StreamExecutor* executor);
+
+ private:
+  // Puts a pointer to a stream back into the pool, leaving it free
+  // for future use. Streams that have previously encountered errors
+  // are deleted, and not returned to the pool.
+  //
+  // This method is thread-safe.
+  void ReturnStream(se::Stream* stream);
+
+  tensorflow::mutex mu_;
+  std::vector<std::unique_ptr<se::Stream>> streams_ GUARDED_BY(mu_);
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_STREAM_POOL_H_
diff --git a/tensorflow/compiler/xla/service/stream_pool_test.cc b/tensorflow/compiler/xla/service/stream_pool_test.cc
new file mode 100644
index 0000000000..aaf5c37b0d
--- /dev/null
+++ b/tensorflow/compiler/xla/service/stream_pool_test.cc
@@ -0,0 +1,136 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/stream_pool.h"
+
+#include <memory>
+
+#include "tensorflow/compiler/xla/test_helpers.h"
+#include "tensorflow/core/platform/stream_executor_no_cuda.h"
+
+namespace xla {
+namespace {
+
+class StreamPoolTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<se::StreamExecutor> NewStreamExecutor() {
+    se::Platform* platform =
+        se::MultiPlatformManager::PlatformWithName("Host").ConsumeValueOrDie();
+    se::StreamExecutorConfig config(/*ordinal=*/0);
+    return platform->GetUncachedExecutor(config).ConsumeValueOrDie();
+  }
+};
+
+TEST_F(StreamPoolTest, EmptyPool) { StreamPool pool; }
+
+TEST_F(StreamPoolTest, OneStreamPool) {
+  std::unique_ptr<se::StreamExecutor> executor = NewStreamExecutor();
+  StreamPool pool;
+
+  // Borrow and return a stream.
+  StreamPool::Ptr stream1 = pool.BorrowStream(executor.get());
+  se::Stream* stream1_ptr = stream1.get();
+  EXPECT_TRUE(stream1->ok());
+  stream1 = nullptr;
+
+  // Borrow and return another stream.
+  StreamPool::Ptr stream2 = pool.BorrowStream(executor.get());
+  se::Stream* stream2_ptr = stream2.get();
+  EXPECT_TRUE(stream2->ok());
+  stream2 = nullptr;
+
+  // The underlying streams should be the same, since stream1 was the
+  // only stream available in the pool when stream2 was borrowed.
+  EXPECT_EQ(stream1_ptr, stream2_ptr);
+}
+
+TEST_F(StreamPoolTest, TwoStreamPool) {
+  std::unique_ptr<se::StreamExecutor> executor = NewStreamExecutor();
+  StreamPool pool;
+
+  // Borrow two streams.
+  StreamPool::Ptr stream1 = pool.BorrowStream(executor.get());
+  se::Stream* stream1_ptr = stream1.get();
+  EXPECT_TRUE(stream1->ok());
+  StreamPool::Ptr stream2 = pool.BorrowStream(executor.get());
+  se::Stream* stream2_ptr = stream2.get();
+  EXPECT_TRUE(stream2->ok());
+
+  // The underlying streams should be different, since we haven't
+  // returned either of them yet.
+  EXPECT_NE(stream1_ptr, stream2_ptr);
+
+  // Return stream1 and borrow stream3.
+  stream1 = nullptr;
+  StreamPool::Ptr stream3 = pool.BorrowStream(executor.get());
+  se::Stream* stream3_ptr = stream3.get();
+  EXPECT_TRUE(stream3->ok());
+
+  // stream1 and stream3 should be the same.
+  EXPECT_EQ(stream1_ptr, stream3_ptr);
+  EXPECT_NE(stream2_ptr, stream3_ptr);
+
+  // Return stream2, and borrow stream4.
+  stream2 = nullptr;
+  StreamPool::Ptr stream4 = pool.BorrowStream(executor.get());
+  se::Stream* stream4_ptr = stream4.get();
+  EXPECT_TRUE(stream4->ok());
+
+  // Stream2 and stream4 should be the same.
+  EXPECT_EQ(stream2_ptr, stream4_ptr);
+  EXPECT_NE(stream3_ptr, stream4_ptr);
+}
+
+TEST_F(StreamPoolTest, BadStreamDiscarded) {
+  std::unique_ptr<se::StreamExecutor> executor = NewStreamExecutor();
+  StreamPool pool;
+
+  // Borrow a stream.
+  StreamPool::Ptr stream1 = pool.BorrowStream(executor.get());
+  EXPECT_TRUE(stream1->ok());
+
+  // Force an error on the stream; here we call a method that requires
+  // DNN support, which we know the Host platform doesn't support.
+  stream1->ThenDepthConcatenate({}, {}, nullptr);
+  EXPECT_FALSE(stream1->ok());
+
+  // Return stream1 and borrow stream2.
+  stream1 = nullptr;
+  StreamPool::Ptr stream2 = pool.BorrowStream(executor.get());
+  se::Stream* stream2_ptr = stream2.get();
+  EXPECT_TRUE(stream2->ok());
+
+  // The underlying streams should be different. They would have been
+  // the same, but since we forced an error on stream1, it cannot be
+  // put back into the pool. Sadly we can't just check:
+  //    EXPECT_NE(stream1_ptr, stream2_ptr);
+  //
+  // The above should hold logically, but it may fail if the new
+  // stream instance allocated for stream2 happens to reside in the
+  // same memory address as stream1, which has been deleted.
+  //
+  // The check that stream2->ok() serves as a good-enough check.
+
+  // Return stream2 and borrow stream3. The previous error on stream1
+  // has no effect on these streams, and they are the same.
+  stream2 = nullptr;
+  StreamPool::Ptr stream3 = pool.BorrowStream(executor.get());
+  se::Stream* stream3_ptr = stream3.get();
+  EXPECT_TRUE(stream3->ok());
+  EXPECT_EQ(stream2_ptr, stream3_ptr);
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/transfer_manager.cc b/tensorflow/compiler/xla/service/transfer_manager.cc
index 4c5038a009..7232c658b3 100644
--- a/tensorflow/compiler/xla/service/transfer_manager.cc
+++ b/tensorflow/compiler/xla/service/transfer_manager.cc
@@ -44,6 +44,7 @@ StatusOr<std::unique_ptr<Literal>> TransferManager::TransferLiteralFromDevice(
     se::Stream* stream, const ShapedBuffer& device_buffer) {
   StatusOr<std::unique_ptr<Literal>> ret;
   se::Stream* substream = stream->GetOrCreateSubStream();
+  substream->ThenWaitFor(stream);
   auto cleanup = tensorflow::gtl::MakeCleanup(
       [&]() { stream->ReturnSubStream(substream); });
 
@@ -64,6 +65,7 @@ Status TransferManager::TransferLiteralToDevice(
   // Use a substream so that if we are called from a HostCallback we don't
   // deadlock.
   se::Stream* substream = stream->GetOrCreateSubStream();
+  substream->ThenWaitFor(stream);
   auto cleanup = tensorflow::gtl::MakeCleanup(
       [&]() { stream->ReturnSubStream(substream); });
   TF_RETURN_IF_ERROR(
diff --git a/tensorflow/compiler/xla/service/transfer_manager.h b/tensorflow/compiler/xla/service/transfer_manager.h
index e384359642..82c599e482 100644
--- a/tensorflow/compiler/xla/service/transfer_manager.h
+++ b/tensorflow/compiler/xla/service/transfer_manager.h
@@ -20,7 +20,7 @@ limitations under the License.
 #include <set>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/shaped_buffer.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -167,16 +167,6 @@ class TransferManager {
       const se::Platform* platform);
 
  protected:
-  // Transfer a memory block of the given size from 'source' buffer to the
-  // Infeed interface of the device using the given executor.
-  //
-  // size is the size to transfer from source in bytes.
-  //
-  // source is the source data that must be in the target-dependent layout that
-  // the Infeed HLO used in the computation expects.
-  virtual Status TransferBufferToInfeed(se::StreamExecutor* executor,
-                                        int64 size, const void* source) = 0;
-
   // Transfer a memory block of the given size from the device source into the
   // 'destination' buffer.
   //
diff --git a/tensorflow/compiler/xla/service/transpose_folding_test.cc b/tensorflow/compiler/xla/service/transpose_folding_test.cc
index cccb8f2fbb..58f767e913 100644
--- a/tensorflow/compiler/xla/service/transpose_folding_test.cc
+++ b/tensorflow/compiler/xla/service/transpose_folding_test.cc
@@ -19,8 +19,8 @@ limitations under the License.
 #include <unordered_set>
 #include <vector>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/gpu/ir_emission_utils.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -160,11 +160,11 @@ TEST_F(TransposeFoldingTest, FuseDotWithConstantOperands) {
   auto builder = HloComputation::Builder("entry");
   // (1.0 + 2.0) * (2.0 - 3.0)
   HloInstruction* const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   HloInstruction* const2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   HloInstruction* const3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   HloInstruction* add = builder.AddInstruction(HloInstruction::CreateBinary(
       const1->shape(), HloOpcode::kAdd, const1, const2));
   HloInstruction* sub = builder.AddInstruction(HloInstruction::CreateBinary(
diff --git a/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc b/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc
index d1e1744647..0effdc80a4 100644
--- a/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc
+++ b/tensorflow/compiler/xla/service/tuple_points_to_analysis.cc
@@ -292,22 +292,29 @@ Status TuplePointsToAnalysis::HandleSlice(HloInstruction* slice) {
 }
 
 Status TuplePointsToAnalysis::HandleRecvDone(HloInstruction* recv_done) {
-  // RecvDone aliases its input (Recv) tuple element {0} to its output.
+  // RecvDone aliases its input (Recv) tuple element {0} to element {0} of its
+  // output. The other indices ({} and {1}) define their own buffers.
   PointsToSet& points_to_set = CreateEmptyPointsToSet(recv_done);
+  points_to_set.AddPointedToBuffer(
+      logical_buffer_analysis_->GetBuffer(recv_done, /*index=*/{}),
+      /*index=*/{});
+  points_to_set.AddPointedToBuffer(
+      logical_buffer_analysis_->GetBuffer(recv_done, /*index=*/{1}),
+      /*index=*/{1});
+
   const PointsToSet& operand_points_to_set =
       GetPointsToSet(recv_done->operand(0));
 
-  // Recursively copy the points to set of the operand tuple {0}.
+  // Recursively copy the points to set of the operand tuple {0} to the output
+  // element {0}.
   points_to_set.ForEachMutableElement(
       [this, &points_to_set, &operand_points_to_set](
           const ShapeIndex& index, PointsToSet::BufferList* buffers) {
-        ShapeIndex src_index({0});
-        for (auto element : index) {
-          src_index.push_back(element);
+        if (index.empty() || index[0] != 0) {
+          return;
         }
-        *buffers = operand_points_to_set.element(src_index);
-        for (auto& tuple_source :
-             operand_points_to_set.tuple_sources(src_index)) {
+        *buffers = operand_points_to_set.element(index);
+        for (auto& tuple_source : operand_points_to_set.tuple_sources(index)) {
           points_to_set.add_tuple_source(index, tuple_source);
         }
       });
@@ -315,7 +322,7 @@ Status TuplePointsToAnalysis::HandleRecvDone(HloInstruction* recv_done) {
 }
 
 Status TuplePointsToAnalysis::HandleSend(HloInstruction* send) {
-  // Send creates a tuple of {aliased operand, U32 context}.
+  // Send creates a tuple of {aliased operand, U32 context, token}.
   PointsToSet& points_to_set = CreateEmptyPointsToSet(send);
 
   // Creates the points to set for the tuple and its element at {1}.
@@ -328,6 +335,10 @@ Status TuplePointsToAnalysis::HandleSend(HloInstruction* send) {
   context_buffer->push_back(
       &logical_buffer_analysis_->GetBuffer(send, ShapeIndex({1})));
 
+  auto token_buffer = points_to_set.mutable_element(ShapeIndex({2}));
+  token_buffer->push_back(
+      &logical_buffer_analysis_->GetBuffer(send, ShapeIndex({2})));
+
   // Recursively copy the points to set of the operand to output tuple {0}.
   const PointsToSet& operand_points_to_set = GetPointsToSet(send->operand(0));
   operand_points_to_set.ForEachElement(
@@ -388,7 +399,7 @@ Status TuplePointsToAnalysis::HandleTuple(HloInstruction* tuple) {
   return Status::OK();
 }
 
-Status TuplePointsToAnalysis::HandleSelect(HloInstruction* select) {
+Status TuplePointsToAnalysis::HandleTupleSelect(HloInstruction* tuple_select) {
   // Select allocates a new buffer and then shallow copies the on_true or
   // on_false buffer into this new buffer. Which side is chosen cannot be
   // determined statically so conservatively set the points-to set to the union
@@ -396,9 +407,9 @@ Status TuplePointsToAnalysis::HandleSelect(HloInstruction* select) {
   //
   // First create a copy of the on_true points-to set (and tuple sources), then
   // add in elements of the on_false points-to set (tuple sources).
-  auto on_true = select->operand(1);
-  auto on_false = select->operand(2);
-  PointsToSet& points_to_set = CreateCopiedPointsToSet(select, on_true);
+  auto on_true = tuple_select->operand(1);
+  auto on_false = tuple_select->operand(2);
+  PointsToSet& points_to_set = CreateCopiedPointsToSet(tuple_select, on_true);
   const PointsToSet& false_points_to_set = *PerInst(on_false)->points_to_set;
   points_to_set.ForEachMutableElement(
       [&](const ShapeIndex& index, PointsToSet::BufferList* buffers) {
@@ -416,7 +427,7 @@ Status TuplePointsToAnalysis::HandleSelect(HloInstruction* select) {
   // respective element in the points-to set should contain only itself.
   points_to_set.mutable_element({})->clear();
   points_to_set.AddPointedToBuffer(
-      logical_buffer_analysis_->GetBuffer(select, /*index=*/{}),
+      logical_buffer_analysis_->GetBuffer(tuple_select, /*index=*/{}),
       /*index=*/{});
   return Status::OK();
 }
@@ -707,6 +718,7 @@ bool TuplePointsToAnalysis::HasUniqueFusedUseOfOperandAt(
 //     root at operand 0 or 1. Or...
 // (4) The 'user' of 'operand' is DynamicUpdateSlice or While at operand index
 //     0.
+// (5) The 'user' of 'operand' is Sort, and it is the only user.
 //
 // (2) and (3) can only be determined if points-to analysis is available.
 bool TuplePointsToAnalysis::CanShareOperandBufferWithUser(
@@ -772,6 +784,21 @@ bool TuplePointsToAnalysis::CanShareOperandBufferWithUser(
     std::vector<int64> operand_indices = user->OperandIndices(operand);
     return operand_indices.size() == 1 && operand_indices[0] == 0;
   }
+  if (user->opcode() == HloOpcode::kSort) {
+    // Only valid if there are no other users.
+    if (operand->users().size() != 1) {
+      return false;
+    }
+    // If we only sort keys, the output of sort is not a tuple, so we can always
+    // share the buffer.
+    if (user->operand_count() == 1) {
+      return true;
+    }
+    CHECK(!user_index.empty());
+    // Only share with the right tuple element buffer.
+    std::vector<int64> operand_indices = user->OperandIndices(operand);
+    return operand_indices.size() == 1 && user_index[0] == operand_indices[0];
+  }
   if (user->opcode() == HloOpcode::kCall) {
     // TODO(b/62548313): Remove when buffer assignment is module scoped and
     // does not assign buffers to calls.
diff --git a/tensorflow/compiler/xla/service/tuple_points_to_analysis.h b/tensorflow/compiler/xla/service/tuple_points_to_analysis.h
index c0d8241480..686bb05328 100644
--- a/tensorflow/compiler/xla/service/tuple_points_to_analysis.h
+++ b/tensorflow/compiler/xla/service/tuple_points_to_analysis.h
@@ -253,7 +253,7 @@ class TuplePointsToAnalysis : public DfsHloVisitorWithDefault {
   Status HandleCopy(HloInstruction* copy) override;
   Status HandleRecvDone(HloInstruction* recv_done) override;
   Status HandleSend(HloInstruction* send) override;
-  Status HandleSelect(HloInstruction* select) override;
+  Status HandleTupleSelect(HloInstruction* tuple_select) override;
 
   string ToString() const;
 
diff --git a/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc b/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc
index a8f885fd86..2e5f646804 100644
--- a/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/tuple_points_to_analysis_test.cc
@@ -124,9 +124,9 @@ class TuplePointsToAnalysisTest : public HloTestBase {
 TEST_F(TuplePointsToAnalysisTest, SimpleTuple) {
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
 
@@ -177,14 +177,14 @@ TEST_F(TuplePointsToAnalysisTest, NestedTuple) {
   // tuple.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto inner_tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
 
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({inner_tuple, constant3}));
 
@@ -238,14 +238,14 @@ TEST_F(TuplePointsToAnalysisTest, GetTupleElement) {
   // tuple.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto inner_tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
 
   auto constant3 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(3.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(3.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({inner_tuple, constant3}));
 
@@ -270,7 +270,7 @@ TEST_F(TuplePointsToAnalysisTest, DuplicatedElement) {
   // Create a tuple which contains duplicate elements.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant, constant, constant}));
 
@@ -291,9 +291,9 @@ TEST_F(TuplePointsToAnalysisTest, TupleCopy) {
   // the same.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto copy = builder.AddInstruction(
@@ -317,8 +317,8 @@ TEST_F(TuplePointsToAnalysisTest, SendAndSendDone) {
   // Send forwards its operand to the output tuple at {0}.
   auto builder = HloComputation::Builder(TestName());
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto send = builder.AddInstruction(
       HloInstruction::CreateSend(constant, token, /*channel_id=*/0));
   auto send_done = builder.AddInstruction(HloInstruction::CreateSendDone(send));
@@ -343,7 +343,7 @@ TEST_F(TuplePointsToAnalysisTest, SendAndSendDone) {
 TEST_F(TuplePointsToAnalysisTest, RecvAndRecvDone) {
   // RecvDone forwards its operand tuple element at {0} to the output.
   auto builder = HloComputation::Builder(TestName());
-  auto token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder.AddInstruction(HloInstruction::CreateToken());
   auto recv = builder.AddInstruction(HloInstruction::CreateRecv(
       ShapeUtil::MakeShape(F32, {1, 2, 3}), token, /*channel_id=*/0));
   auto recv_done = builder.AddInstruction(HloInstruction::CreateRecvDone(recv));
@@ -357,7 +357,7 @@ TEST_F(TuplePointsToAnalysisTest, RecvAndRecvDone) {
 
   ExpectHasTopLevelBuffers(
       points_to_analysis_->GetPointsToSet(recv).element({}), {recv});
-  ExpectHasBufferAliases(recv, {0}, {{recv, {0}}, {recv_done, {}}});
+  ExpectHasBufferAliases(recv, {0}, {{recv, {0}}, {recv_done, {0}}});
 }
 
 TEST_F(TuplePointsToAnalysisTest, TupleSelect) {
@@ -365,18 +365,18 @@ TEST_F(TuplePointsToAnalysisTest, TupleSelect) {
   // set containing the union of both sides.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple1 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto tuple2 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant2, constant2}));
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple1->shape(), HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple1->shape(), HloOpcode::kTupleSelect, pred, tuple1, tuple2));
 
   BuildModuleAndRunAnalysis(builder.Build());
 
@@ -403,9 +403,9 @@ TEST_F(TuplePointsToAnalysisTest, SelectTupleParameters) {
   auto param1 = builder.AddInstruction(
       HloInstruction::CreateParameter(1, tuple_shape, "param1"));
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple_shape, HloOpcode::kSelect, pred, param0, param1));
+      tuple_shape, HloOpcode::kTupleSelect, pred, param0, param1));
   auto copy = builder.AddInstruction(
       HloInstruction::CreateUnary(tuple_shape, HloOpcode::kCopy, select));
 
@@ -443,18 +443,18 @@ TEST_F(TuplePointsToAnalysisTest, UnambiguousTupleSelect) {
   // Select from two identical tuples. The result should not be ambiguous.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto tuple1 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto tuple2 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple1->shape(), HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple1->shape(), HloOpcode::kTupleSelect, pred, tuple1, tuple2));
 
   BuildModuleAndRunAnalysis(builder.Build());
 
@@ -474,9 +474,9 @@ TEST_F(TuplePointsToAnalysisTest, NestedTupleSelect) {
   // the right values.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto inner_tuple1 = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto inner_tuple2 = builder.AddInstruction(
@@ -488,9 +488,9 @@ TEST_F(TuplePointsToAnalysisTest, NestedTupleSelect) {
       builder.AddInstruction(HloInstruction::CreateTuple({inner_tuple2}));
 
   auto pred = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
   auto select = builder.AddInstruction(HloInstruction::CreateTernary(
-      tuple1->shape(), HloOpcode::kSelect, pred, tuple1, tuple2));
+      tuple1->shape(), HloOpcode::kTupleSelect, pred, tuple1, tuple2));
 
   BuildModuleAndRunAnalysis(builder.Build());
 
@@ -521,9 +521,9 @@ TEST_F(TuplePointsToAnalysisTest, TupleWithBitcast) {
   // have the operand of the bitcast in its points-to set.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto bitcast = builder.AddInstruction(HloInstruction::CreateUnary(
       constant2->shape(), HloOpcode::kBitcast, constant2));
   auto tuple =
@@ -557,9 +557,10 @@ TEST_F(TuplePointsToAnalysisTest, PointsToTupleConstantElements) {
   // Construct a tuple constant and kCopy it. Verify the points-to set of the
   // copy correctly correctly points into the nested elements of the constant.
   auto builder = HloComputation::Builder(TestName());
-  auto tuple_constant = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::MakeTuple({Literal::CreateR2<float>({{1.0}, {2.0}}).get(),
-                          Literal::CreateR1<float>({2.0, 42}).get()})));
+  auto tuple_constant = builder.AddInstruction(
+      HloInstruction::CreateConstant(LiteralUtil::MakeTuple(
+          {LiteralUtil::CreateR2<float>({{1.0}, {2.0}}).get(),
+           LiteralUtil::CreateR1<float>({2.0, 42}).get()})));
   auto copy = builder.AddInstruction(HloInstruction::CreateUnary(
       tuple_constant->shape(), HloOpcode::kCopy, tuple_constant));
 
@@ -579,9 +580,9 @@ TEST_F(TuplePointsToAnalysisTest, BufferAliases) {
   // times. Verify buffer alias sets.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto constant2 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto inner_tuple = builder.AddInstruction(
       HloInstruction::CreateTuple({constant1, constant2}));
   auto tuple = builder.AddInstruction(
@@ -620,7 +621,7 @@ class FusionPointsToAnalysisTest : public TuplePointsToAnalysisTest {
     auto tuple_element1 = builder.AddInstruction(
         HloInstruction::CreateGetTupleElement(update_shape, tuple_param0, 1));
     auto ones = builder.AddInstruction(HloInstruction::CreateConstant(
-        Literal::CreateR1<float>({1.f, 1.f, 1.f, 1.f})));
+        LiteralUtil::CreateR1<float>({1.f, 1.f, 1.f, 1.f})));
     // Create 'update' = Add(GetTupleElement(tuple_param0, 1), ones)
     auto update = builder.AddInstruction(HloInstruction::CreateBinary(
         update_shape, HloOpcode::kAdd, tuple_element1, ones));
@@ -868,9 +869,9 @@ TEST_F(DoesNotUseOperandBufferTest, FusedDynamicUpdateSlice) {
 
   // Create a DynamicUpdateSlice instruction of tuple element 1.
   auto starts = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
   auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+      LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f})));
   auto dynamic_update_slice =
       builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
           data_shape, gte1, update, starts));
@@ -962,9 +963,9 @@ TEST_F(CanShareOperandBufferWithUserTest, FusedDynamicUpdateSlice) {
 
   // Create a DynamicUpdateSlice instruction of tuple element 1.
   auto starts = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
   auto update = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({2.f, 2.f, 2.f})));
+      LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f})));
   auto dynamic_update_slice =
       builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
           data_shape, gte1, update, starts));
@@ -1011,14 +1012,56 @@ TEST_F(CanShareOperandBufferWithUserTest, DynamicUpdateSliceCanShare) {
       points_to_analysis_->CanShareOperandBufferWithUser(starts, {}, dus, {}));
 }
 
+TEST_F(CanShareOperandBufferWithUserTest, SortCanShare) {
+  auto builder = HloComputation::Builder(TestName());
+
+  Shape keys_shape = ShapeUtil::MakeShape(F32, {8});
+  auto keys = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, keys_shape, "keys"));
+  auto sort =
+      builder.AddInstruction(HloInstruction::CreateSort(keys_shape, 0, keys));
+
+  BuildModuleAndRunAnalysis(builder.Build());
+
+  EXPECT_TRUE(
+      points_to_analysis_->CanShareOperandBufferWithUser(keys, {}, sort, {}));
+}
+
+TEST_F(CanShareOperandBufferWithUserTest, SortCanShareWithTupleUser) {
+  auto builder = HloComputation::Builder(TestName());
+
+  Shape keys_shape = ShapeUtil::MakeShape(F32, {8});
+  Shape values_shape = ShapeUtil::MakeShape(F32, {8});
+  auto keys = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, keys_shape, "keys"));
+  auto values = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, values_shape, "values"));
+  auto sort = builder.AddInstruction(HloInstruction::CreateSort(
+      ShapeUtil::MakeTupleShape({keys_shape, values_shape}), 0, keys, values));
+
+  BuildModuleAndRunAnalysis(builder.Build());
+
+  // The buffer for the keys can be shared with the first tuple entry.
+  EXPECT_TRUE(
+      points_to_analysis_->CanShareOperandBufferWithUser(keys, {}, sort, {0}));
+  // The buffer for the values can be shared with the second tuple entry.
+  EXPECT_TRUE(points_to_analysis_->CanShareOperandBufferWithUser(values, {},
+                                                                 sort, {1}));
+  // Verify that the buffers are not shared with the "wrong" tuple entry.
+  EXPECT_FALSE(
+      points_to_analysis_->CanShareOperandBufferWithUser(keys, {}, sort, {1}));
+  EXPECT_FALSE(points_to_analysis_->CanShareOperandBufferWithUser(values, {},
+                                                                  sort, {0}));
+}
+
 TEST_F(CanShareOperandBufferWithUserTest, FusedDotAdd) {
   auto builder = HloComputation::Builder(TestName());
   Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
 
   auto a = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 0.0}, {0.0, 1.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 0.0}, {0.0, 1.0}})));
   auto b = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
+      LiteralUtil::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
 
   DotDimensionNumbers dot_dnums;
   dot_dnums.add_lhs_contracting_dimensions(1);
@@ -1027,7 +1070,7 @@ TEST_F(CanShareOperandBufferWithUserTest, FusedDotAdd) {
       HloInstruction::CreateDot(data_shape, a, b, dot_dnums));
 
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto add_operand = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape, one, {1}));
 
@@ -1049,7 +1092,7 @@ TEST_F(CanShareOperandBufferWithUserTest, OutputFusionCantAliasOperandBuffer) {
   Shape data_shape = ShapeUtil::MakeShape(F32, {2, 2});
 
   auto one = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto operand = builder.AddInstruction(
       HloInstruction::CreateBroadcast(data_shape, one, {1}));
 
@@ -1057,7 +1100,7 @@ TEST_F(CanShareOperandBufferWithUserTest, OutputFusionCantAliasOperandBuffer) {
       HloInstruction::CreateReverse(data_shape, operand, {0, 1}));
 
   auto two = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
+      LiteralUtil::CreateR2<float>({{2.0, 2.0}, {2.0, 2.0}})));
 
   auto add = builder.AddInstruction(
       HloInstruction::CreateBinary(data_shape, HloOpcode::kAdd, reverse, two));
@@ -1122,7 +1165,7 @@ TEST_F(CanShareOperandBufferWithUserTest, CallToComputationWithFusionRoot) {
   auto sub_param = sub_builder.AddInstruction(
       HloInstruction::CreateParameter(0, shape, "sub_param"));
   auto one = sub_builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0)));
   auto ones = sub_builder.AddInstruction(
       HloInstruction::CreateBroadcast(shape, one, {1}));
   auto add = sub_builder.AddInstruction(
diff --git a/tensorflow/compiler/xla/service/tuple_simplifier_test.cc b/tensorflow/compiler/xla/service/tuple_simplifier_test.cc
index d3635eae81..39b693872d 100644
--- a/tensorflow/compiler/xla/service/tuple_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/tuple_simplifier_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include <memory>
 #include <utility>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_matchers.h"
diff --git a/tensorflow/compiler/xla/service/while_loop_constant_sinking.cc b/tensorflow/compiler/xla/service/while_loop_constant_sinking.cc
index 10fc4958fa..62af45128a 100644
--- a/tensorflow/compiler/xla/service/while_loop_constant_sinking.cc
+++ b/tensorflow/compiler/xla/service/while_loop_constant_sinking.cc
@@ -61,6 +61,12 @@ StatusOr<bool> WhileLoopConstantSinking::TrySinkingConstantsIntoWhileBody(
        WhileUtil::GetInvariantGTEsForWhileBody(*while_body)) {
     int64 index = invariant_gte->tuple_index();
     const HloInstruction& invariant_value = *init_value.operand(index);
+
+    // Should have at least one user that's not while_body_root.
+    if (invariant_gte->user_count() <= 1) {
+      continue;
+    }
+
     if (invariant_value.opcode() == HloOpcode::kConstant) {
       auto* constant_instr =
           while_body->AddInstruction(invariant_value.Clone(/*suffix=*/".sunk"));
diff --git a/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc b/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc
index 393e758038..266039d2ff 100644
--- a/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc
+++ b/tensorflow/compiler/xla/service/while_loop_constant_sinking_test.cc
@@ -196,5 +196,50 @@ ENTRY entry {
                         op::GetTupleElement(op::Parameter(0)),
                         op::GetTupleElement(op::Parameter(0))));
 }
+
+TEST_F(WhileLoopConstantSinkingTest, DontCreateDeadConstant) {
+  const char* const hlo_string = R"(
+HloModule ModuleWithWhile
+
+body {
+  p_body = (f32[2],f32[2]) parameter(0)
+  p_body.0 = f32[2] get-tuple-element((f32[2],f32[2]) p_body), index=0
+  p_body.1 = f32[2] get-tuple-element((f32[2],f32[2]) p_body), index=1
+
+  outfeed = token[] outfeed(p_body.0)
+  ROOT root = (f32[2],f32[2],f32[2]) tuple(p_body.0, p_body.1, p_body.1)
+}
+
+condition {
+  p_cond = (f32[2],f32[2]) parameter(0)
+  ROOT result = pred[] constant(true)
+}
+
+ENTRY entry {
+  const_0 = f32[2] constant({1, 2})
+  const_1 = f32[2] constant({2, 1})
+  while_init = (f32[2],f32[2]) tuple(const_0, const_1)
+  ROOT while = (f32[2],f32[2],f32[2]) while(while_init), condition=condition,
+                                      body=body
+}
+)";
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(hlo_string));
+
+  TF_ASSERT_OK_AND_ASSIGN(bool changed,
+                          WhileLoopConstantSinking{}.Run(module.get()));
+  ASSERT_TRUE(changed);
+
+  auto* while_body = module->GetComputationWithName("body");
+  EXPECT_THAT(while_body->root_instruction(),
+              op::Tuple(op::GetTupleElement(), op::GetTupleElement(),
+                        op::GetTupleElement()));
+  for (const HloInstruction* inst : while_body->instructions()) {
+    if (inst->opcode() == HloOpcode::kConstant) {
+      EXPECT_GT(inst->user_count(), 0);
+    }
+  }
+}
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc
index 23519e445e..32e69c335b 100644
--- a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc
+++ b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion_test.cc
@@ -53,7 +53,7 @@ HloComputation* WhileLoopInvariantCodeMotionTest::MakeAlwaysTrueComputation(
   builder.AddInstruction(
       HloInstruction::CreateParameter(0, param_shape, "param"));
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   return module->AddEmbeddedComputation(builder.Build());
 }
 
@@ -125,7 +125,7 @@ TEST_F(WhileLoopInvariantCodeMotionTest, HoistInvariantOperationTree) {
         builder.AddInstruction(HloInstruction::CreateUnary(
             scalar_s32, HloOpcode::kNegate, mul_result));
     HloInstruction* constant = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<int32>(4)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(4)));
     HloInstruction* sub_result =
         builder.AddInstruction(HloInstruction::CreateBinary(
             scalar_s32, HloOpcode::kSubtract, negate_result, constant));
@@ -273,7 +273,7 @@ TEST_F(WhileLoopInvariantCodeMotionTest, DontHoistInstructionWithSideEffects) {
   HloComputation::Builder builder(TestName());
   auto* scalar_param = builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_s32, "param"));
-  auto* token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto* token = builder.AddInstruction(HloInstruction::CreateToken());
   auto* init_value = builder.AddInstruction(
       HloInstruction::CreateTuple({scalar_param, scalar_param, token}));
   auto* while_inst = builder.AddInstruction(HloInstruction::CreateWhile(
@@ -323,7 +323,7 @@ TEST_F(WhileLoopInvariantCodeMotionTest, DontHoistBitcastAlone) {
   HloComputation::Builder builder(TestName());
   auto* scalar_param = builder.AddInstruction(
       HloInstruction::CreateParameter(0, scalar_s32, "param"));
-  auto* token = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto* token = builder.AddInstruction(HloInstruction::CreateToken());
   auto* init_value = builder.AddInstruction(
       HloInstruction::CreateTuple({scalar_param, scalar_param, token}));
   auto* while_inst = builder.AddInstruction(HloInstruction::CreateWhile(
diff --git a/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc b/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc
index 3c83049216..2e1571943e 100644
--- a/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/while_loop_simplifier_test.cc
@@ -157,7 +157,7 @@ TEST_F(WhileLoopSimplifierTest,
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* true_op = while_op->while_body()->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true)));
   TF_ASSERT_OK(true_op->AddControlDependencyTo(
       while_op->while_body()->root_instruction()));
   ASSERT_TRUE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
@@ -175,10 +175,10 @@ TEST_F(WhileLoopSimplifierTest, LoopWithSendNotSimplified) {
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* while_body = while_op->while_body();
-  auto* token = while_body->AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto* token = while_body->AddInstruction(HloInstruction::CreateToken());
   auto* send = while_body->AddInstruction(HloInstruction::CreateSend(
       while_body->AddInstruction(
-          HloInstruction::CreateConstant(Literal::CreateR0<bool>(true))),
+          HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(true))),
       token,
       /*channel_id=*/0));
   while_body->AddInstruction(HloInstruction::CreateSendDone(send));
@@ -192,7 +192,7 @@ TEST_F(WhileLoopSimplifierTest, LoopWithRecvNotSimplified) {
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* while_body = while_op->while_body();
-  auto* token = while_body->AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto* token = while_body->AddInstruction(HloInstruction::CreateToken());
   auto* recv = while_body->AddInstruction(
       HloInstruction::CreateRecv(ShapeUtil::MakeShape(F32, {1}), token,
                                  /*channel_id=*/0));
@@ -211,7 +211,7 @@ TEST_F(WhileLoopSimplifierTest, LoopWithInfeedNotSimplified) {
   auto* while_op = computation->root_instruction();
   ASSERT_EQ(while_op->opcode(), HloOpcode::kWhile);
   auto* while_body = while_op->while_body();
-  auto token = while_body->AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = while_body->AddInstruction(HloInstruction::CreateToken());
   while_body->AddInstruction(HloInstruction::CreateInfeed(
       ShapeUtil::MakeShape(F32, {1}), token, "config"));
   EXPECT_FALSE(WhileLoopSimplifier().Run(the_module).ValueOrDie());
diff --git a/tensorflow/compiler/xla/service/while_util.cc b/tensorflow/compiler/xla/service/while_util.cc
index 473eab2ea8..1ef17b9d7d 100644
--- a/tensorflow/compiler/xla/service/while_util.cc
+++ b/tensorflow/compiler/xla/service/while_util.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/while_util.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_creation_utils.h"
 #include "tensorflow/compiler/xla/service/tuple_util.h"
@@ -38,7 +39,7 @@ static StatusOr<HloComputation*> WidenWhileCondition(
     // the root instruction later.  We later change the root instruction to
     // something more appropriate.
     builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<bool>(false)));
     return narrow_condition->parent()->AddEmbeddedComputation(builder.Build());
   }();
 
@@ -154,7 +155,7 @@ MakeCountedLoopConditionComputation(const Shape& loop_state_shape,
                           {&loop_state_shape}, scalar_pred, "while_cond"));
 
   HloInstruction* trip_count_constant = cond_computation->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(trip_count)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(trip_count)));
 
   HloInstruction* param = cond_computation->parameter_instruction(0);
   TF_ASSIGN_OR_RETURN(HloInstruction * indvar,
@@ -175,7 +176,7 @@ static StatusOr<std::unique_ptr<HloComputation>> MakeCountedLoopBodyComputation(
                       CreateComputationWithSignature(
                           {&loop_state_shape}, loop_state_shape, "while_body"));
   HloInstruction* one = body_computation->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
   HloInstruction* param = body_computation->parameter_instruction(0);
   TF_ASSIGN_OR_RETURN(HloInstruction * indvar,
                       MakeGetTupleElementHlo(param, 0));
@@ -203,7 +204,7 @@ static StatusOr<HloInstruction*> MakeInitTupleFromInitValues(
   std::vector<HloInstruction*> init_values_with_indvar;
   init_values_with_indvar.reserve(init_values.size() + 1);
   HloInstruction* zero = computation->AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(0)));
   init_values_with_indvar.push_back(zero);
   c_copy(init_values, std::back_inserter(init_values_with_indvar));
   return computation->AddInstruction(
diff --git a/tensorflow/compiler/xla/service/zero_sized_hlo_elimination.cc b/tensorflow/compiler/xla/service/zero_sized_hlo_elimination.cc
index 44b0ec5cd4..83d696fe09 100644
--- a/tensorflow/compiler/xla/service/zero_sized_hlo_elimination.cc
+++ b/tensorflow/compiler/xla/service/zero_sized_hlo_elimination.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/zero_sized_hlo_elimination.h"
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -32,7 +32,8 @@ StatusOr<bool> ZeroSizedHloElimination::Run(HloModule* module) {
   for (HloComputation* comp : module->MakeNonfusionComputations()) {
     for (HloInstruction* instruction : comp->MakeInstructionPostOrder()) {
       if (instruction->HasSideEffect() ||
-          !ShapeUtil::IsArray(instruction->shape())) {
+          !ShapeUtil::IsArray(instruction->shape()) ||
+          instruction->opcode() == HloOpcode::kConstant) {
         continue;
       }
       if (comp->IsRemovable(instruction) &&
diff --git a/tensorflow/compiler/xla/service/zero_sized_hlo_elimination_test.cc b/tensorflow/compiler/xla/service/zero_sized_hlo_elimination_test.cc
index c6bd013a1a..b9ef18892d 100644
--- a/tensorflow/compiler/xla/service/zero_sized_hlo_elimination_test.cc
+++ b/tensorflow/compiler/xla/service/zero_sized_hlo_elimination_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <unordered_set>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
@@ -67,12 +67,19 @@ TEST_F(ZeroSizedHloEliminationTest, DoesNotEliminateParameter) {
 }
 
 TEST_F(ZeroSizedHloEliminationTest, DoesNotEliminateSideEffects) {
-  auto token = builder_.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token = builder_.AddInstruction(HloInstruction::CreateToken());
   builder_.AddInstruction(
       HloInstruction::CreateSend(zero_sized_param_, token, 0));
   TF_ASSERT_OK_AND_ASSIGN(bool changed, RunZeroSizedElimination());
   EXPECT_FALSE(changed);
 }
 
+TEST_F(ZeroSizedHloEliminationTest, DoesNotEliminateConstant) {
+  builder_.AddInstruction(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1({})));
+  TF_ASSERT_OK_AND_ASSIGN(bool changed, RunZeroSizedElimination());
+  EXPECT_FALSE(changed);
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/shape_layout.cc b/tensorflow/compiler/xla/shape_layout.cc
index 7ee366b27a..caad31d6ce 100644
--- a/tensorflow/compiler/xla/shape_layout.cc
+++ b/tensorflow/compiler/xla/shape_layout.cc
@@ -67,6 +67,14 @@ void ShapeLayout::ResetLayout(const Layout& layout) {
   TF_CHECK_OK(ShapeUtil::ValidateShape(shape_));
 }
 
+void ShapeLayout::ResetLayout(const Layout& layout,
+                              ShapeIndexView shape_index) {
+  CHECK(ShapeUtil::IsTuple(shape_));
+  *ShapeUtil::GetMutableSubshape(&shape_, shape_index)->mutable_layout() =
+      layout;
+  TF_CHECK_OK(ShapeUtil::ValidateShape(shape_));
+}
+
 bool ShapeLayout::operator==(const ShapeLayout& other) const {
   return ShapeUtil::Equal(shape_, other.shape_);
 }
diff --git a/tensorflow/compiler/xla/shape_layout.h b/tensorflow/compiler/xla/shape_layout.h
index 36806da599..214cf98854 100644
--- a/tensorflow/compiler/xla/shape_layout.h
+++ b/tensorflow/compiler/xla/shape_layout.h
@@ -72,6 +72,10 @@ class ShapeLayout {
   // tuple.
   void ResetLayout(const Layout& layout);
 
+  // Resets the layout on the shape at the provided ShapeIndex to the provided
+  // layout. Shape must be a tuple.
+  void ResetLayout(const Layout& layout, ShapeIndexView shape_index);
+
   // Returns a string representation of this object.
   string ToString() const { return ShapeUtil::HumanStringWithLayout(shape_); }
 
diff --git a/tensorflow/compiler/xla/shape_tree.h b/tensorflow/compiler/xla/shape_tree.h
index 4aacc87b78..c74dd648ad 100644
--- a/tensorflow/compiler/xla/shape_tree.h
+++ b/tensorflow/compiler/xla/shape_tree.h
@@ -44,10 +44,6 @@ struct ShapeTreeNode {
   // Data corresponding to this node.
   std::pair<ShapeIndex, T> data;
 
-  // Children of this node, as indices into the container's nodes_ array.
-  std::vector<size_t> children;
-
-  // Tells whether this is a leaf node.
   bool is_leaf = true;
 
   explicit ShapeTreeNode(ShapeIndex index)
@@ -56,6 +52,20 @@ struct ShapeTreeNode {
       : data(std::move(index), std::move(data)) {}
 };
 
+// Internal representation of an index table entry.
+struct IndexTableEntry {
+  // Index of the node in the ShapeTreeNode vector.
+  uint32 index;
+  // Index of the first child in a IndexTableEntry vector. In the index
+  // table all children entries for a given node will be placed next to each
+  // other. This allows us to use a single field to index them.
+  uint32 children_start;
+#ifndef NDEBUG
+  // Number of children, used for bounds checking.
+  uint32 children_count;
+#endif
+};
+
 }  // namespace internal
 
 template <typename ContainerType, typename IteratorType, typename ValueType>
@@ -84,6 +94,7 @@ template <typename T>
 class ShapeTree {
  public:
   using Node = internal::ShapeTreeNode<T>;
+  using Index = internal::IndexTableEntry;
 
   // Default constructor creates a tree with a nil shape (i.e. an empty tuple).
   ShapeTree() : ShapeTree(ShapeUtil::MakeNil()) {}
@@ -267,11 +278,12 @@ class ShapeTree {
  private:
   // Initialize node->children based on 'shape'. All children are assigned the
   // the given 'init_value'.
-  void InitChildren(const Shape& shape, const T& init_value, Node* node);
+  void InitChildren(const Shape& shape, const T& init_value, Node* node,
+                    Index* index);
 
   // Initialize node->children based on 'shape'. All children have
   // default-constructed data values.
-  void InitChildren(const Shape& shape, Node* node);
+  void InitChildren(const Shape& shape, Node* node, Index* index);
 
   // Returns the number of subshapes, including interior nodes, in shape.
   int64 CountSubshapes(const Shape& shape);
@@ -291,6 +303,9 @@ class ShapeTree {
   // The nodes in this shape tree.
   std::vector<Node> nodes_;
 
+  // Index table for node lookups.
+  std::vector<Index> index_table_;
+
   // If we own our Shape, this field contains it, and shape_ is a pointer into
   // here.  Otherwise if we don't own our shape, this is nullptr.
   std::shared_ptr<Shape> shape_storage_;
@@ -373,36 +388,74 @@ int64 ShapeTree<T>::CountSubshapes(const Shape& shape) {
 
 template <typename T>
 void ShapeTree<T>::InitChildren(const Shape& shape, const T& init_value,
-                                Node* node) {
+                                Node* node, Index* index) {
   if (ShapeUtil::IsTuple(shape)) {
     const int64 size = ShapeUtil::TupleElementCount(shape);
-    node->children.reserve(size);
+#ifndef NDEBUG
+    index->children_count = size;
+#endif
     node->is_leaf = false;
     ShapeIndex shape_index = node->data.first;
     shape_index.push_back(0);
+
+    // At the end of the index_table, reserve a continuous space to hold the
+    // children of current node. In order to enforce the invariant that all
+    // children of a given node are placed together, we need to do the
+    // reservation before we recurse into any of its children.
+    int64 children_start_position = index_table_.size();
+    index_table_.resize(index_table_.size() + size);
+
     for (int i = 0; i < size; ++i) {
       shape_index[shape_index.size() - 1] = i;
-      node->children.push_back(nodes_.size());
+      index_table_[children_start_position + i].index = nodes_.size();
+      // The first child of the node in the index table is placed at the end of
+      // the table.
+      index_table_[children_start_position + i].children_start =
+          index_table_.size();
       nodes_.emplace_back(shape_index, init_value);
-      InitChildren(shape.tuple_shapes(i), init_value, &nodes_.back());
+      InitChildren(shape.tuple_shapes(i), init_value, &nodes_.back(),
+                   &index_table_[children_start_position + i]);
     }
+  } else {
+#ifndef NDEBUG
+    index->children_count = 0;
+#endif
   }
 }
 
 template <typename T>
-void ShapeTree<T>::InitChildren(const Shape& shape, Node* node) {
+void ShapeTree<T>::InitChildren(const Shape& shape, Node* node, Index* index) {
   if (ShapeUtil::IsTuple(shape)) {
     const int64 size = ShapeUtil::TupleElementCount(shape);
-    node->children.reserve(size);
+#ifndef NDEBUG
+    index->children_count = size;
+#endif
     node->is_leaf = false;
     ShapeIndex shape_index = node->data.first;
     shape_index.push_back(0);
+
+    // At the end of the index_table, reserve a continuous space to hold the
+    // children of current node. In order to enforce the invariant that all
+    // children of a given node are placed together, we need to do the
+    // reservation before we recurse into any of its children.
+    int64 children_start_position = index_table_.size();
+    index_table_.resize(index_table_.size() + size);
+
     for (int i = 0; i < size; ++i) {
       shape_index[shape_index.size() - 1] = i;
-      node->children.push_back(nodes_.size());
+      index_table_[children_start_position + i].index = nodes_.size();
+      // The first child of the node in the index table is placed at the end of
+      // the table.
+      index_table_[children_start_position + i].children_start =
+          index_table_.size();
       nodes_.emplace_back(shape_index);
-      InitChildren(shape.tuple_shapes(i), &nodes_.back());
+      InitChildren(shape.tuple_shapes(i), &nodes_.back(),
+                   &index_table_[children_start_position + i]);
     }
+  } else {
+#ifndef NDEBUG
+    index->children_count = 0;
+#endif
   }
 }
 
@@ -413,24 +466,36 @@ ShapeTree<T>::ShapeTree(Shape shape)
   // The shape_ field is just used to hold the structure of the shape.
   // It should not be relied upon to store layout information.
   LayoutUtil::ClearLayout(shape_storage_.get());
-  nodes_.reserve(CountSubshapes(*shape_));
+  const int64 count = CountSubshapes(*shape_);
+  nodes_.reserve(count);
   nodes_.emplace_back(ShapeIndex{});
-  InitChildren(*shape_, &nodes_[0]);
+
+  index_table_.reserve(count);
+  index_table_.emplace_back(Index{0, 1});
+  InitChildren(*shape_, &nodes_[0], &index_table_[0]);
 }
 
 template <typename T>
 ShapeTree<T>::ShapeTree(const Shape* shape) : shape_(shape) {
-  nodes_.reserve(CountSubshapes(*shape_));
+  const int64 count = CountSubshapes(*shape_);
+  nodes_.reserve(count);
   nodes_.emplace_back(ShapeIndex{});
-  InitChildren(*shape_, &nodes_[0]);
+
+  index_table_.reserve(count);
+  index_table_.emplace_back(Index{0, 1});
+  InitChildren(*shape_, &nodes_[0], &index_table_[0]);
 }
 
 template <typename T>
 ShapeTree<T>::ShapeTree(const std::shared_ptr<Shape>& shape)
     : shape_storage_(shape), shape_(shape_storage_.get()) {
-  nodes_.reserve(CountSubshapes(*shape_));
+  const int64 count = CountSubshapes(*shape_);
+  nodes_.reserve(count);
   nodes_.emplace_back(ShapeIndex{});
-  InitChildren(*shape_, &nodes_[0]);
+
+  index_table_.reserve(count);
+  index_table_.emplace_back(Index{0, 1});
+  InitChildren(*shape_, &nodes_[0], &index_table_[0]);
 }
 
 template <typename T>
@@ -440,26 +505,38 @@ ShapeTree<T>::ShapeTree(Shape shape, const T& init_value)
   // The shape_ field is just used to hold the structure of the shape.
   // It should not be relied upon to store layout information.
   LayoutUtil::ClearLayout(shape_storage_.get());
-  nodes_.reserve(CountSubshapes(*shape_));
+  const int64 count = CountSubshapes(*shape_);
+  nodes_.reserve(count);
   nodes_.emplace_back(ShapeIndex{}, init_value);
-  InitChildren(*shape_, init_value, &nodes_[0]);
+
+  index_table_.reserve(count);
+  index_table_.emplace_back(Index{0, 1});
+  InitChildren(*shape_, init_value, &nodes_[0], &index_table_[0]);
 }
 
 template <typename T>
 ShapeTree<T>::ShapeTree(const Shape* shape, const T& init_value)
     : shape_(shape) {
-  nodes_.reserve(CountSubshapes(*shape_));
+  const int64 count = CountSubshapes(*shape_);
+  nodes_.reserve(count);
   nodes_.emplace_back(ShapeIndex{}, init_value);
-  InitChildren(*shape_, init_value, &nodes_[0]);
+
+  index_table_.reserve(count);
+  index_table_.emplace_back(Index{0, 1});
+  InitChildren(*shape_, init_value, &nodes_[0], &index_table_[0]);
 }
 
 template <typename T>
 ShapeTree<T>::ShapeTree(const std::shared_ptr<Shape>& shape,
                         const T& init_value)
     : shape_storage_(shape), shape_(shape_storage_.get()) {
-  nodes_.reserve(CountSubshapes(*shape_));
+  const int64 count = CountSubshapes(*shape_);
+  nodes_.reserve(count);
   nodes_.emplace_back(ShapeIndex{}, init_value);
-  InitChildren(*shape_, init_value, &nodes_[0]);
+
+  index_table_.reserve(count);
+  index_table_.emplace_back(Index{0, 1});
+  InitChildren(*shape_, init_value, &nodes_[0], &index_table_[0]);
 }
 
 template <typename T>
@@ -474,13 +551,16 @@ T* ShapeTree<T>::mutable_element(ShapeIndexView index) {
 
 template <typename T>
 internal::ShapeTreeNode<T>* ShapeTree<T>::Lookup(ShapeIndexView index) {
-  Node* node = &nodes_[0];
+  Index* iter = &index_table_[0];
   for (const int64 i : index) {
     CHECK_GE(i, 0);
-    CHECK_LT(i, node->children.size());
-    node = &nodes_[node->children[i]];
+#ifndef NDEBUG
+    CHECK_LT(i, iter->children_count);
+#endif
+    iter = &index_table_[iter->children_start + i];
   }
-  return node;
+
+  return &nodes_[iter->index];
 }
 
 template <typename T>
diff --git a/tensorflow/compiler/xla/shape_tree_test.cc b/tensorflow/compiler/xla/shape_tree_test.cc
index 51de82e957..4391078b64 100644
--- a/tensorflow/compiler/xla/shape_tree_test.cc
+++ b/tensorflow/compiler/xla/shape_tree_test.cc
@@ -227,14 +227,16 @@ TEST_F(ShapeTreeTest, NestedTupleShape) {
 
 TEST_F(ShapeTreeTest, InvalidIndexingTuple) {
   ShapeTree<int> shape_tree{tuple_shape_};
-
+#ifndef NDEBUG
   EXPECT_DEATH(shape_tree.element({4}), "");
+#endif
 }
 
 TEST_F(ShapeTreeTest, InvalidIndexingNestedTuple) {
   ShapeTree<int> shape_tree{nested_tuple_shape_};
-
+#ifndef NDEBUG
   EXPECT_DEATH(shape_tree.element({0, 0}), "");
+#endif
 }
 
 TEST_F(ShapeTreeTest, ShapeTreeOfNonCopyableType) {
@@ -602,12 +604,15 @@ void BM_Iterate(int iters, int depth, int fan_out) {
   }
 }
 
-BENCHMARK(BM_Construct)->ArgPair(2, 8);
-BENCHMARK(BM_ConstructUnowned)->ArgPair(2, 8);
-BENCHMARK(BM_Copy)->ArgPair(2, 8);
-BENCHMARK(BM_Move)->ArgPair(2, 8);
-BENCHMARK(BM_ForEach)->ArgPair(2, 8);
-BENCHMARK(BM_Iterate)->ArgPair(2, 8);
+#define BENCHMARK_WITH_ARGS(name) \
+  BENCHMARK(name)->ArgPair(2, 8)->ArgPair(1, 1000)
+
+BENCHMARK_WITH_ARGS(BM_Construct);
+BENCHMARK_WITH_ARGS(BM_ConstructUnowned);
+BENCHMARK_WITH_ARGS(BM_Copy);
+BENCHMARK_WITH_ARGS(BM_Move);
+BENCHMARK_WITH_ARGS(BM_ForEach);
+BENCHMARK_WITH_ARGS(BM_Iterate);
 
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/shape_util.cc b/tensorflow/compiler/xla/shape_util.cc
index 56d24423c4..9ea41b7c92 100644
--- a/tensorflow/compiler/xla/shape_util.cc
+++ b/tensorflow/compiler/xla/shape_util.cc
@@ -46,28 +46,14 @@ namespace xla {
 using ::tensorflow::strings::StrAppend;
 using ::tensorflow::strings::StrCat;
 
-string ShapeIndex::ToString() const {
-  return StrCat("{", tensorflow::str_util::Join(indices_, ","), "}");
-}
+string ShapeIndex::ToString() const { return ShapeIndexView(*this).ToString(); }
 
 string ShapeIndexView::ToString() const {
-  return StrCat("{",
-                tensorflow::str_util::Join(
-                    tensorflow::gtl::make_range(begin_, end_), ","),
-                "}");
+  return StrCat("{", tensorflow::str_util::Join(indices_, ","), "}");
 }
 
 bool ShapeIndexView::operator==(const ShapeIndexView& other) const {
-  if (size() != other.size()) {
-    return false;
-  }
-  for (auto it = begin(), other_it = other.begin(); it != end();
-       ++it, ++other_it) {
-    if (*it != *other_it) {
-      return false;
-    }
-  }
-  return true;
+  return indices_ == other.indices_;
 }
 
 bool ShapeIndexView::operator!=(const ShapeIndexView& other) const {
@@ -696,7 +682,7 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
                            CompatibleIgnoringElementType);
   } else {
     // Opaque, token, etc types are vacuously compatible.
-    return true;
+    return lhs.element_type() == rhs.element_type();
   }
 }
 
@@ -711,7 +697,7 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
                            CompatibleIgnoringFpPrecision);
   } else {
     // Opaque, token, etc types are vacuously compatible.
-    return true;
+    return lhs.element_type() == rhs.element_type();
   }
 }
 
@@ -806,7 +792,7 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
   if (LayoutUtil::IsSparseArray(shape)) {
     allocated_element_count = LayoutUtil::MaxSparseElements(shape.layout());
   } else {
-    CHECK(LayoutUtil::IsDenseArray(shape));
+    CHECK(LayoutUtil::IsDenseArray(shape)) << shape.ShortDebugString();
     tensorflow::gtl::ArraySlice<int64> padded_dimensions =
         LayoutUtil::PaddedDimensions(shape);
     if (!padded_dimensions.empty()) {
@@ -897,40 +883,51 @@ StatusOr<Shape> ParseShapeStringInternal(tensorflow::StringPiece* s) {
   }
 
   int64 shape_size = [&shape]() {
-    int64 shape_size;
     if (LayoutUtil::IsSparseArray(shape)) {
-      shape_size = LayoutUtil::MaxSparseElements(shape.layout());
-      if (shape_size < 0) {
-        return shape_size;
+      int64 max_sparse_elements = LayoutUtil::MaxSparseElements(shape.layout());
+      if (max_sparse_elements < 0) {
+        return max_sparse_elements;
       }
-      shape_size = MultiplyWithoutOverflow(shape_size, ShapeUtil::Rank(shape));
-      if (shape_size < 0) {
-        return shape_size;
+      int64 sparse_elements_size = MultiplyWithoutOverflow(
+          max_sparse_elements, ByteSizeOfPrimitiveType(shape.element_type()));
+      if (sparse_elements_size < 0) {
+        return sparse_elements_size;
       }
-      shape_size = MultiplyWithoutOverflow(shape_size, sizeof(int64));
-      if (shape_size < 0) {
-        return shape_size;
+      int64 sparse_indices_size =
+          MultiplyWithoutOverflow(max_sparse_elements, ShapeUtil::Rank(shape));
+      if (sparse_indices_size < 0) {
+        return sparse_indices_size;
+      }
+      sparse_indices_size =
+          MultiplyWithoutOverflow(sparse_indices_size, sizeof(int64));
+      if (sparse_indices_size < 0) {
+        return sparse_indices_size;
+      }
+      // At this point, both sparse_indices_size and sparse_elements_size are
+      // non-negative, so we can easily check if adding them wraps.
+      if (static_cast<uint64>(sparse_elements_size) +
+              static_cast<uint64>(sparse_indices_size) >
+          INT64_MAX) {
+        return static_cast<int64>(-1);
       }
     }
 
-    shape_size = 1;
-
     // This is intentionally unconditional: even if the shape is sparse, we want
     // to verify the densified version has a reasonable size.
+    int64 dense_shape_size = 1;
     if (shape.dimensions().empty()) {
-      return shape_size;
+      return dense_shape_size;
     }
 
     for (int64 dim : shape.dimensions()) {
-      shape_size = MultiplyWithoutOverflow(shape_size, dim);
-      if (shape_size < 0) {
-        return shape_size;
+      dense_shape_size = MultiplyWithoutOverflow(dense_shape_size, dim);
+      if (dense_shape_size < 0) {
+        return dense_shape_size;
       }
     }
-    shape_size = MultiplyWithoutOverflow(
-        shape_size, ByteSizeOfPrimitiveType(shape.element_type()));
-
-    return shape_size;
+    dense_shape_size = MultiplyWithoutOverflow(
+        dense_shape_size, ByteSizeOfPrimitiveType(shape.element_type()));
+    return dense_shape_size;
   }();
 
   if (shape_size < 0) {
@@ -1126,12 +1123,41 @@ Status ForEachMutableSubshapeHelper(
   for (auto dim : Permute(permutation, shape.dimensions())) {
     new_shape.add_dimensions(dim);
   }
+
+  // If `shape` has a layout, by contract we choose a new layout such that the
+  // transpose defined by this permutation is a bitcast.
+  //
+  // Some formalism helps to understand the correct way to do this.  We're going
+  // to do algebra in the group of permutations of the dimensions of `shape`.
+  //
+  // Since the order of `shape`'s dimensions is not permuted relative to itself,
+  // `shape`'s list of dimensions is isomorphic to the identity I.
+  //
+  // Let `shape`'s layout be L.  A layout is a permutation which maps a
+  // minor-to-major physical layout to the order of a shape's logical dims.
+  // Therefore inverse of a layout maps from logical to physical dims, and so
+  // the physical layout of I is simply L'.I = L', where L' is the inverse of L.
+  //
+  // Let the argument `permutation` be P.  This is a permutation over `shape`'s
+  // dimensions, so our return value will be a shape with dims P.I = P.  Our
+  // goal is to construct a layout permutation L* that we can apply to P such
+  // that that the physical dimension ordering of the returned shape is the same
+  // as that of the original shape, namely L'.
+  //
+  // Our returned shape has dims P and layout L*, so its in-memory layout is
+  // L*'.P.  Setting this equal to L' and solving for L*, we get:
+  //
+  //   L*'.P = L'    =>
+  //   L*'   = L'P'  =>
+  //   L*    = P.L
+  //
   if (shape.has_layout()) {
     CHECK(LayoutUtil::IsDenseArray(shape));
     Layout* new_layout = new_shape.mutable_layout();
     new_layout->set_format(DENSE);
     new_layout->clear_minor_to_major();
-    for (auto index : Permute(permutation, shape.layout().minor_to_major())) {
+    for (auto index : ComposePermutations(
+             permutation, AsInt64Slice(shape.layout().minor_to_major()))) {
       new_layout->add_minor_to_major(index);
     }
     if (shape.layout().padded_dimensions_size() > 0) {
@@ -1141,6 +1167,13 @@ Status ForEachMutableSubshapeHelper(
         new_layout->add_padded_dimensions(dim);
       }
     }
+    // The permutation accepted by TransposeIsBitcast is the inverse of the
+    // permutation here.
+    CHECK(TransposeIsBitcast(shape, new_shape, InversePermutation(permutation)))
+        << "shape=" << HumanStringWithLayout(shape)
+        << ", new_shape=" << HumanStringWithLayout(new_shape)
+        << ", permutation={" << tensorflow::str_util::Join(permutation, ",")
+        << "}";
   }
   return new_shape;
 }
diff --git a/tensorflow/compiler/xla/shape_util.h b/tensorflow/compiler/xla/shape_util.h
index 5ae04451d3..d6f17fc965 100644
--- a/tensorflow/compiler/xla/shape_util.h
+++ b/tensorflow/compiler/xla/shape_util.h
@@ -31,6 +31,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
 #include "tensorflow/core/lib/gtl/optional.h"
 #include "tensorflow/core/platform/cpu_info.h"
 #include "tensorflow/core/platform/env.h"
@@ -73,10 +74,12 @@ class ShapeIndex {
   // push_front is O(n^2), but shapes don't usually have a ton of dimensions.
   void push_front(int64 value) { indices_.insert(indices_.begin(), value); }
 
-  std::vector<int64>::const_iterator begin() const { return indices_.begin(); }
-  std::vector<int64>::const_iterator end() const { return indices_.end(); }
-  std::vector<int64>::iterator begin() { return indices_.begin(); }
-  std::vector<int64>::iterator end() { return indices_.end(); }
+  using container_type = tensorflow::gtl::InlinedVector<int64, 2>;
+
+  container_type::const_iterator begin() const { return indices_.begin(); }
+  container_type::const_iterator end() const { return indices_.end(); }
+  container_type::iterator begin() { return indices_.begin(); }
+  container_type::iterator end() { return indices_.end(); }
 
   const int64* data() const { return indices_.data(); }
 
@@ -97,7 +100,7 @@ class ShapeIndex {
   string ToString() const;
 
  private:
-  std::vector<int64> indices_;
+  container_type indices_;
 };
 
 // A view into a ShapeIndex as above, with the cheap/easy ability to consume the
@@ -110,31 +113,33 @@ class ShapeIndex {
 class ShapeIndexView {
  public:
   ShapeIndexView(const ShapeIndex& shape_index, int64 offset = 0)
-      : ShapeIndexView(shape_index.data() + offset,
-                       shape_index.data() + shape_index.size()) {
+      : indices_(shape_index.data() + offset, shape_index.size() - offset) {
     CHECK_LE(offset, shape_index.size());
   }
-  ShapeIndexView(std::initializer_list<int64> indices)
-      : ShapeIndexView(indices.begin(), indices.end()) {}
+  ShapeIndexView(std::initializer_list<int64> indices) : indices_(indices) {}
   ShapeIndexView(const ShapeIndexView& other) = default;
 
   using iterator = const int64*;
 
-  iterator begin() const { return begin_; }
-  iterator end() const { return end_; }
-  int64 size() const { return std::distance(begin_, end_); }
-  bool empty() const { return begin_ == end_; }
+  iterator begin() const { return indices_.begin(); }
+  iterator end() const { return indices_.end(); }
+  int64 size() const { return indices_.size(); }
+  bool empty() const { return indices_.empty(); }
   int64 front() const {
     CHECK(!empty());
-    return *begin_;
+    return indices_.front();
   }
   ShapeIndexView ConsumeFront() const {
-    CHECK(!empty());
-    auto new_begin = begin_;
-    ++new_begin;
-    return ShapeIndexView(new_begin, end_);
+    ShapeIndexView result = *this;
+    result.indices_.pop_front();
+    return result;
+  }
+  ShapeIndexView ConsumeBack() const {
+    ShapeIndexView result = *this;
+    result.indices_.pop_back();
+    return result;
   }
-  ShapeIndex ToShapeIndex() const { return ShapeIndex(begin_, end_); }
+  ShapeIndex ToShapeIndex() const { return ShapeIndex(begin(), end()); }
 
   bool operator==(const ShapeIndexView& other) const;
   bool operator!=(const ShapeIndexView& other) const;
@@ -142,10 +147,7 @@ class ShapeIndexView {
   string ToString() const;
 
  private:
-  ShapeIndexView(iterator begin, iterator end) : begin_(begin), end_(end) {}
-
-  iterator begin_;
-  iterator end_;
+  tensorflow::gtl::ArraySlice<int64> indices_;
 };
 
 std::ostream& operator<<(std::ostream& out, const ShapeIndex& shape_index);
@@ -530,7 +532,13 @@ class ShapeUtil {
   static bool HasDegenerateDimensions(const Shape& shape);
 
   // Permutes the dimensions by the given permutation, so
-  // return_value.dimensions[permutation[i]] = argument.dimensions[i]
+  // return_value.dimensions[permutation[i]] = argument.dimensions[i].
+  //
+  // Postcondition: For any valid permutation,
+  //
+  //   !HasLayout(shape) ||
+  //   TransposeIsBitcast(shape, PermuteDimensions(permutation, shape),
+  //                      InversePermutation(permutation)).
   static Shape PermuteDimensions(tensorflow::gtl::ArraySlice<int64> permutation,
                                  const Shape& shape);
 
diff --git a/tensorflow/compiler/xla/shape_util_test.cc b/tensorflow/compiler/xla/shape_util_test.cc
index b6f30af381..e5dd62ae9a 100644
--- a/tensorflow/compiler/xla/shape_util_test.cc
+++ b/tensorflow/compiler/xla/shape_util_test.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/shape_util.h"
 
+#include <numeric>
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -22,12 +23,23 @@ limitations under the License.
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
 
 namespace xla {
 namespace {
 
 using ::testing::ElementsAre;
 
+TEST(ShapeUtilTest, ShapeIndexViewTest) {
+  ShapeIndex index = {1, 2, 3, 4};
+  ShapeIndexView index_view(index, 1);
+  EXPECT_EQ(3, index_view.size());
+  EXPECT_EQ(ShapeIndexView({2, 3, 4}), index_view);
+  EXPECT_EQ(ShapeIndexView({3, 4}), index_view.ConsumeFront());
+  EXPECT_EQ(ShapeIndexView({2, 3}), index_view.ConsumeBack());
+}
+
 TEST(ShapeUtilTest, GetDimensionHelperCanNegativeIndex) {
   Shape matrix = ShapeUtil::MakeShape(F32, {2, 3});
   EXPECT_EQ(3, ShapeUtil::GetDimension(matrix, -1));
@@ -322,6 +334,17 @@ TEST(ShapeUtilTest, IncompatibleScalarVsTuple) {
   EXPECT_FALSE(ShapeUtil::CompatibleIgnoringFpPrecision(shape2, shape1));
 }
 
+TEST(ShapeUtilTest, OpaqueVsArray) {
+  Shape shape1 = ShapeUtil::MakeShape(F32, {5, 7});
+  Shape shape2 = ShapeUtil::MakeOpaqueShape();
+  EXPECT_FALSE(ShapeUtil::Compatible(shape1, shape2));
+  EXPECT_FALSE(ShapeUtil::Compatible(shape2, shape1));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringFpPrecision(shape1, shape2));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringFpPrecision(shape2, shape1));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringElementType(shape1, shape2));
+  EXPECT_FALSE(ShapeUtil::CompatibleIgnoringElementType(shape2, shape1));
+}
+
 TEST(ShapeUtilTest, CompareShapesWithPaddedDimensionsMismatch) {
   Shape shape1 = ShapeUtil::MakeShape(F32, {20, 30});
   shape1.mutable_layout()->add_padded_dimensions(10);
@@ -821,6 +844,28 @@ TEST(ShapeUtilTest, HasDegenerateDimensions) {
       ShapeUtil::HasDegenerateDimensions(ShapeUtil::MakeShape(F32, {3, 0, 5})));
 }
 
+TEST(ShapeUtilTest, PermuteDimensionsLayout) {
+  std::vector<int64> layout(3);
+  std::iota(layout.begin(), layout.end(), 0);
+  do {
+    Shape s = ShapeUtil::MakeShapeWithLayout(F32, {10, 100, 1000}, layout);
+    SCOPED_TRACE(tensorflow::strings::StrCat("s=", ShapeUtil::HumanString(s)));
+
+    std::vector<int64> permutation(3);
+    std::iota(permutation.begin(), permutation.end(), 0);
+    do {
+      SCOPED_TRACE(tensorflow::strings::StrCat(
+          "permutation=", tensorflow::str_util::Join(permutation, ",")));
+
+      // TransposeIsBitcast takes the inverse of the permutation that
+      // PermuteDimensions takes.
+      EXPECT_TRUE(ShapeUtil::TransposeIsBitcast(
+          s, ShapeUtil::PermuteDimensions(permutation, s),
+          InversePermutation(permutation)));
+    } while (std::next_permutation(permutation.begin(), permutation.end()));
+  } while (std::next_permutation(layout.begin(), layout.end()));
+}
+
 TEST(AlgebraicSimplifierTest, ReshapeIsBitcast_3x2x2_6x2_Dim0IsMostMinor) {
   EXPECT_FALSE(ShapeUtil::ReshapeIsBitcast(
       ShapeUtil::MakeShapeWithLayout(F32, {3, 2, 2}, {0, 1, 2}),
diff --git a/tensorflow/compiler/xla/tests/BUILD b/tensorflow/compiler/xla/tests/BUILD
index 77d398e5e2..42d52aee78 100644
--- a/tensorflow/compiler/xla/tests/BUILD
+++ b/tensorflow/compiler/xla/tests/BUILD
@@ -65,6 +65,7 @@ cc_library(
     srcs = ["test_utils.cc"],
     hdrs = ["test_utils.h"],
     deps = [
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
@@ -88,6 +89,7 @@ cc_library(
         "//tensorflow/compiler/xla:array3d",
         "//tensorflow/compiler/xla:array4d",
         "//tensorflow/compiler/xla:error_spec",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_comparison",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:test",
@@ -152,8 +154,8 @@ tf_cc_binary(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla/client:client_library",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service/cpu:cpu_compiler",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/core:lib",
@@ -179,6 +181,7 @@ cc_library(
         "//tensorflow/compiler/xla:array3d",
         "//tensorflow/compiler/xla:array4d",
         "//tensorflow/compiler/xla:execution_options_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -189,8 +192,8 @@ cc_library(
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:interpreter_plugin",  # reference backend
         "//tensorflow/compiler/xla/service:platform_util",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -209,6 +212,7 @@ cc_library(
     deps = [
         ":codegen_test_base",
         ":filecheck",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service:llvm_compiler",
         "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",
         "//tensorflow/core:test",
@@ -258,7 +262,7 @@ cc_library(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:computation_placer",
         "//tensorflow/compiler/xla/service:device_memory_allocator",
         "//tensorflow/compiler/xla/service:local_service",
@@ -286,8 +290,8 @@ xla_test(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
@@ -302,7 +306,7 @@ xla_test(
         "enable_for_xla_interpreter",
     ],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -310,8 +314,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:test",
@@ -330,8 +334,8 @@ xla_test(
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:test",
@@ -345,16 +349,16 @@ xla_test(
         "enable_for_xla_interpreter",
     ],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/service:platform_util",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -372,9 +376,10 @@ xla_test(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:platform_util",
+        "//tensorflow/compiler/xla/service:stream_pool",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:test_utils",
         "//tensorflow/core:lib",
@@ -391,8 +396,8 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -406,7 +411,7 @@ xla_test(
     tags = ["enable_for_xla_interpreter"],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -415,9 +420,9 @@ xla_test(
         "//tensorflow/compiler/xla:xla_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
@@ -435,14 +440,14 @@ xla_test(
     tags = ["optonly"],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -460,9 +465,9 @@ xla_test(
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:test",
@@ -479,8 +484,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -497,8 +502,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -515,9 +520,9 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -531,6 +536,7 @@ xla_test(
     srcs = ["scalar_computations_test.cc"],
     shard_count = 32,
     deps = [
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -538,8 +544,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -557,8 +563,8 @@ xla_test(
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
@@ -573,7 +579,7 @@ xla_test(
         "enable_for_xla_interpreter",
     ],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -581,8 +587,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
@@ -599,7 +605,7 @@ xla_test(
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
         "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -607,8 +613,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -633,7 +639,7 @@ xla_test(
     deps = [
         ":client_library_test_base",
         ":literal_test_util",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
     ],
@@ -645,7 +651,7 @@ xla_test(
     tags = ["enable_for_xla_interpreter"],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -653,7 +659,7 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:reduce_precision_insertion",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -676,8 +682,8 @@ xla_test(
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
@@ -697,7 +703,7 @@ xla_test(
         "//tensorflow/compiler/xla:execution_options_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:test",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
     ],
@@ -720,8 +726,8 @@ xla_test(
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
@@ -744,8 +750,8 @@ xla_test(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -764,11 +770,12 @@ xla_test(
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
         "//tensorflow/compiler/xla:array4d",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -780,7 +787,7 @@ xla_test(
 CONVOLUTION_TEST_DEPS = [
     "//tensorflow/compiler/xla:array2d",
     "//tensorflow/compiler/xla:array4d",
-    "//tensorflow/compiler/xla:literal_util",
+    "//tensorflow/compiler/xla:literal",
     "//tensorflow/compiler/xla:reference_util",
     "//tensorflow/compiler/xla:shape_util",
     "//tensorflow/compiler/xla:statusor",
@@ -789,7 +796,7 @@ CONVOLUTION_TEST_DEPS = [
     "//tensorflow/compiler/xla/client:global_data",
     "//tensorflow/compiler/xla/client:local_client",
     "//tensorflow/compiler/xla/client:padding",
-    "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+    "//tensorflow/compiler/xla/client:xla_builder",
     "//tensorflow/compiler/xla/tests:client_library_test_base",
     "//tensorflow/compiler/xla/tests:literal_test_util",
     "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -827,13 +834,13 @@ xla_test(
     deps = [
         "//tensorflow/compiler/xla:array3d",
         "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client:padding",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -856,8 +863,8 @@ xla_test(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client:padding",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -874,7 +881,7 @@ xla_test(
         ":test_utils",
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -885,10 +892,10 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
         "//tensorflow/compiler/xla/client/lib:math",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
@@ -907,7 +914,7 @@ xla_test(
         ":test_utils",
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -918,9 +925,9 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
@@ -940,12 +947,12 @@ xla_test(
     ],
     deps = [
         ":test_utils",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -966,7 +973,7 @@ xla_test(
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -985,8 +992,8 @@ xla_test(
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1007,7 +1014,7 @@ xla_test(
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:computation_placer",
         "//tensorflow/compiler/xla/service:device_memory_allocator",
         "//tensorflow/compiler/xla/service:local_service",
@@ -1031,14 +1038,15 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -1058,9 +1066,9 @@ xla_test(
         "//tensorflow/compiler/xla:array3d",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1079,6 +1087,7 @@ xla_test(
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -1088,9 +1097,9 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1115,9 +1124,9 @@ xla_test_library(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client:padding",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -1149,16 +1158,16 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/client:padding",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1176,10 +1185,10 @@ xla_test(
     deps = [
         ":client_library_test_base",
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -1228,12 +1237,13 @@ xla_test(
         "enable_for_xla_interpreter",
     ],
     deps = [
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1246,11 +1256,12 @@ xla_test(
     name = "custom_call_test",
     srcs = ["custom_call_test.cc"],
     deps = [
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service/cpu:custom_call_target_registry",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
@@ -1273,8 +1284,8 @@ xla_test(
         "//tensorflow/compiler/xla:array4d",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1291,12 +1302,13 @@ xla_test(
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1316,8 +1328,8 @@ xla_test(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1334,8 +1346,8 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1351,8 +1363,8 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1368,7 +1380,7 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
@@ -1376,8 +1388,8 @@ xla_test(
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
@@ -1391,14 +1403,14 @@ xla_test(
     name = "prng_test",
     srcs = ["prng_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
@@ -1416,6 +1428,7 @@ xla_test(
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -1426,8 +1439,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1446,7 +1459,7 @@ xla_test(
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1469,9 +1482,9 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:arithmetic",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1495,8 +1508,8 @@ xla_test(
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1515,8 +1528,8 @@ xla_test(
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1530,15 +1543,15 @@ xla_test(
     name = "cross_replica_sum_test",
     srcs = ["cross_replica_sum_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:test",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
@@ -1560,7 +1573,7 @@ xla_test(
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1574,15 +1587,15 @@ xla_test(
     name = "compilation_cache_test",
     srcs = ["compilation_cache_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla:xla_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
@@ -1600,8 +1613,8 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1614,7 +1627,7 @@ xla_test(
     name = "compute_constant_test",
     srcs = ["compute_constant_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -1623,8 +1636,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:global_data",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1644,8 +1657,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:test_utils",
@@ -1661,8 +1674,8 @@ xla_test(
     deps = [
         ":client_library_test_base",
         "//tensorflow/compiler/xla/client:global_data",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:test",
     ],
@@ -1675,8 +1688,8 @@ xla_test(
     deps = [
         ":client_library_test_base",
         "//tensorflow/compiler/xla/client:global_data",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:test",
     ],
@@ -1689,15 +1702,15 @@ xla_test(
         "enable_for_xla_interpreter",
     ],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:protobuf_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -1714,7 +1727,7 @@ xla_test(
         "enable_for_xla_interpreter",
     ],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -1731,6 +1744,7 @@ tf_cc_test(
     srcs = ["llvm_compiler_test.cc"],
     tags = ["requires-gpu-sm35"],
     deps = [
+        "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla/service:backend",
         "//tensorflow/compiler/xla/service:cpu_plugin",
@@ -1751,7 +1765,7 @@ xla_test(
     name = "round_trip_packed_literal_test",
     srcs = ["round_trip_packed_literal_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:packed_literal_reader",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
@@ -1774,15 +1788,16 @@ xla_test(
     ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:hlo",
+        "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service:hlo_runner",
         "//tensorflow/compiler/xla/service:platform_util",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
@@ -1801,14 +1816,14 @@ xla_test(
     srcs = ["multioutput_fusion_test.cc"],
     deps = [
         "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:util",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_runner",
         "//tensorflow/compiler/xla/service:platform_util",
@@ -1841,11 +1856,11 @@ xla_test(
     name = "local_client_allocation_test",
     srcs = ["local_client_allocation_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:local_service",
         "//tensorflow/compiler/xla/service:shaped_buffer",
         "//tensorflow/compiler/xla/tests:literal_test_util",
@@ -1864,7 +1879,7 @@ xla_test(
     shard_count = 30,
     tags = ["optonly"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -1872,8 +1887,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:device_memory_allocator",
         "//tensorflow/compiler/xla/service:local_service",
         "//tensorflow/compiler/xla/service:platform_util",
@@ -1889,6 +1904,16 @@ xla_test(
     ],
 )
 
+xla_test(
+    name = "outfeed_in_nested_computation_test",
+    srcs = ["outfeed_in_nested_computation_test.cc"],
+    deps = [
+        "//tensorflow/compiler/xla/tests:local_client_test_base",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/core:test",
+    ],
+)
+
 tf_cc_test(
     name = "hlo_metadata_test",
     srcs = [
@@ -1898,7 +1923,7 @@ tf_cc_test(
         ":local_client_test_base",
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/service:cpu_plugin",
         "//tensorflow/compiler/xla/service:local_service",
         "//tensorflow/core:test_main",
@@ -1910,7 +1935,7 @@ xla_test(
     srcs = ["round_trip_transfer_test.cc"],
     deps = [
         "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:xla_data_proto",
@@ -1931,7 +1956,7 @@ xla_test(
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:reference_util",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
@@ -1940,8 +1965,8 @@ xla_test(
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
@@ -1954,7 +1979,7 @@ xla_test(
     name = "deep_graph_test",
     srcs = ["deep_graph_test.cc"],
     deps = [
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_builder",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
     ],
@@ -1979,7 +2004,7 @@ xla_test(
         ":literal_test_util",
         ":local_client_test_base",
         ":xla_internal_test_main",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
@@ -1987,6 +2012,7 @@ xla_test(
         "//tensorflow/compiler/xla/service:device_memory_allocator",
         "//tensorflow/compiler/xla/service:generic_transfer_manager",
         "//tensorflow/compiler/xla/service:shaped_buffer",
+        "//tensorflow/compiler/xla/service:stream_pool",
         "//tensorflow/core:lib",
         "//tensorflow/core:stream_executor_no_cuda",
         "//tensorflow/core:test",
@@ -2039,10 +2065,30 @@ xla_test(
         ":local_client_test_base",
         ":test_utils",
         "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla/client/xla_client:xla_builder",
-        "//tensorflow/compiler/xla/client/xla_client:xla_computation",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:test",
     ],
 )
+
+xla_test(
+    name = "iota_test",
+    srcs = ["iota_test.cc"],
+    blacklisted_backends = [
+        "cpu",
+        "gpu",
+    ],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
+    deps = [
+        ":client_library_test_base",
+        ":literal_test_util",
+        ":xla_internal_test_main",
+        "//tensorflow/compiler/xla/client:xla_builder",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+    ],
+)
diff --git a/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc b/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
index 3bdf98544a..74f2e36f82 100644
--- a/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
+++ b/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
@@ -24,9 +24,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@@ -225,7 +225,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, AddTwoConstantU64s) {
                           0x8000000000000000LL,
                           0x8000000000000000LL,
                           1};
-  std::unique_ptr<Literal> lhs_literal = Literal::CreateR1<uint64>({lhs});
+  std::unique_ptr<Literal> lhs_literal = LiteralUtil::CreateR1<uint64>({lhs});
   auto lhs_param = Parameter(&b, 0, lhs_literal->shape(), "lhs_param");
   std::unique_ptr<GlobalData> lhs_data =
       client_->TransferToServer(*lhs_literal).ConsumeValueOrDie();
@@ -239,7 +239,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, AddTwoConstantU64s) {
                           0,
                           1,
                           0x8000000000000000LL};
-  std::unique_ptr<Literal> rhs_literal = Literal::CreateR1<uint64>({rhs});
+  std::unique_ptr<Literal> rhs_literal = LiteralUtil::CreateR1<uint64>({rhs});
   auto rhs_param = Parameter(&b, 1, rhs_literal->shape(), "rhs_param");
   std::unique_ptr<GlobalData> rhs_data =
       client_->TransferToServer(*rhs_literal).ConsumeValueOrDie();
@@ -265,7 +265,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, SubTwoConstantS64s) {
                          1,
                          0,
                          -1};
-  std::unique_ptr<Literal> lhs_literal = Literal::CreateR1<int64>({lhs});
+  std::unique_ptr<Literal> lhs_literal = LiteralUtil::CreateR1<int64>({lhs});
   auto lhs_param = Parameter(&b, 0, lhs_literal->shape(), "lhs_param");
   std::unique_ptr<GlobalData> lhs_data =
       client_->TransferToServer(*lhs_literal).ConsumeValueOrDie();
@@ -278,7 +278,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, SubTwoConstantS64s) {
                          0x7FFFFFFFFFFFFFFLL,
                          0x7FFFFFFFFFFFFFFFLL,
                          0x7FFFFFFFFFFFFFFFLL};
-  std::unique_ptr<Literal> rhs_literal = Literal::CreateR1<int64>({rhs});
+  std::unique_ptr<Literal> rhs_literal = LiteralUtil::CreateR1<int64>({rhs});
   auto rhs_param = Parameter(&b, 1, rhs_literal->shape(), "rhs_param");
   std::unique_ptr<GlobalData> rhs_data =
       client_->TransferToServer(*rhs_literal).ConsumeValueOrDie();
@@ -303,13 +303,13 @@ TEST_P(ArrayElementwiseOpTestParamCount, AddManyValues) {
     b_values.push_back(2 * i / static_cast<float>(count + 2));
   }
 
-  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({a_values});
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR1<float>({a_values});
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   auto a_constant = ConstantR1<float>(&builder, a_values);
   auto a_param = Parameter(&builder, 0, a_literal->shape(), "a_param");
 
-  std::unique_ptr<Literal> b_literal = Literal::CreateR1<float>({b_values});
+  std::unique_ptr<Literal> b_literal = LiteralUtil::CreateR1<float>({b_values});
   std::unique_ptr<GlobalData> b_data =
       client_->TransferToServer(*b_literal).ConsumeValueOrDie();
   auto b_constant = Parameter(&builder, 1, a_literal->shape(), "b_param");
@@ -1426,7 +1426,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, PowSpecialF32) {
   std::vector<float> values = {1.0f, 2.0f, 3.2f, -4.0f};
   std::vector<float> exponents = {0.0f, 1.0f, 2.0f, 0.5f, -1.0f, -0.5f};
 
-  std::unique_ptr<Literal> param_literal = Literal::CreateR1<float>(values);
+  std::unique_ptr<Literal> param_literal = LiteralUtil::CreateR1<float>(values);
   std::unique_ptr<GlobalData> param_data =
       client_->TransferToServer(*param_literal).ConsumeValueOrDie();
 
@@ -1454,10 +1454,10 @@ XLA_TEST_F(ArrayElementwiseOpTest, PowOfExpF32) {
   std::vector<float> values0 = {1.0f, 2.0f, 3.2f, -4.0f, 0.0f, 5.7f};
   std::vector<float> values1 = {0.0f, 1.0f, 2.0f, 0.5f, -1.0f, -0.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
   auto param0 = Parameter(&b, 0, literal0->shape(), "param0");
@@ -1479,10 +1479,10 @@ XLA_TEST_F(ArrayElementwiseOpTest, LogOfPowerF32) {
   std::vector<float> values0 = {1.0f, 2.0f, 3.2f, 4.0f, 0.5f, 5.7f};
   std::vector<float> values1 = {0.0f, 1.0f, 2.0f, 0.5f, -1.0f, -0.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
   auto param0 = Parameter(&b, 0, literal0->shape(), "param0");
@@ -1504,10 +1504,10 @@ XLA_TEST_F(ArrayElementwiseOpTest, MulOfExpF32) {
   std::vector<float> values0 = {1.0f, 2.0f, 3.2f, -4.0f, 0.0f, 5.7f};
   std::vector<float> values1 = {0.0f, 1.0f, 2.0f, 0.5f, -1.0f, -0.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
   auto param0 = Parameter(&b, 0, literal0->shape(), "param0");
@@ -1529,10 +1529,10 @@ XLA_TEST_F(ArrayElementwiseOpTest, DivOfExpF32) {
   std::vector<float> values0 = {1.0f, 2.0f, 3.2f, -4.0f, 0.0f, 5.7f};
   std::vector<float> values1 = {0.0f, 1.0f, 2.0f, 0.5f, -1.0f, -0.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
   auto param0 = Parameter(&b, 0, literal0->shape(), "param0");
@@ -1555,15 +1555,15 @@ XLA_TEST_F(ArrayElementwiseOpTest, Div3_lhs_F32) {
   std::vector<float> values1 = {0.1f, 1.0f, 2.0f, 0.5f, -1.0f, -0.5f};
   std::vector<float> values2 = {0.1f, 1.1f, 6.9f, 12.5f, -15.0f, -0.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal2 = Literal::CreateR1<float>(values2);
+  std::unique_ptr<Literal> literal2 = LiteralUtil::CreateR1<float>(values2);
   std::unique_ptr<GlobalData> data2 =
       client_->TransferToServer(*literal2).ConsumeValueOrDie();
   auto param0 = Parameter(&b, 0, literal0->shape(), "param0");
@@ -1587,15 +1587,15 @@ XLA_TEST_F(ArrayElementwiseOpTest, Div3_rhs_F32) {
   std::vector<float> values1 = {0.1f, 1.0f, 2.0f, 0.5f, -1.0f, -0.5f};
   std::vector<float> values2 = {0.1f, 1.1f, 6.9f, 12.5f, -15.0f, -0.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal2 = Literal::CreateR1<float>(values2);
+  std::unique_ptr<Literal> literal2 = LiteralUtil::CreateR1<float>(values2);
   std::unique_ptr<GlobalData> data2 =
       client_->TransferToServer(*literal2).ConsumeValueOrDie();
 
@@ -1620,15 +1620,15 @@ XLA_TEST_F(ArrayElementwiseOpTest, DivOfPowerF32) {
   std::vector<float> values1 = {0.1f, 1.0f, 2.0f, 0.5f, 1.0f, 0.5f};
   std::vector<float> values2 = {0.1f, 1.1f, 6.9f, 9.5f, -11.0f, -0.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal2 = Literal::CreateR1<float>(values2);
+  std::unique_ptr<Literal> literal2 = LiteralUtil::CreateR1<float>(values2);
   std::unique_ptr<GlobalData> data2 =
       client_->TransferToServer(*literal2).ConsumeValueOrDie();
 
@@ -1654,19 +1654,19 @@ XLA_TEST_F(ArrayElementwiseOpTest, Div4F32) {
   std::vector<float> values2 = {0.1f, 1.1f, 6.9f, 12.5f, -15.0f, -0.5f};
   std::vector<float> values3 = {2.1f, 3.1f, 9.9f, -4.5f, -11.0f, -21.5f};
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>(values0);
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>(values0);
   std::unique_ptr<GlobalData> data0 =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>(values1);
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>(values1);
   std::unique_ptr<GlobalData> data1 =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal2 = Literal::CreateR1<float>(values2);
+  std::unique_ptr<Literal> literal2 = LiteralUtil::CreateR1<float>(values2);
   std::unique_ptr<GlobalData> data2 =
       client_->TransferToServer(*literal2).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal3 = Literal::CreateR1<float>(values3);
+  std::unique_ptr<Literal> literal3 = LiteralUtil::CreateR1<float>(values3);
   std::unique_ptr<GlobalData> data3 =
       client_->TransferToServer(*literal3).ConsumeValueOrDie();
 
@@ -2101,12 +2101,12 @@ XLA_TEST_F(ArrayElementwiseOpTest, AddTwoParametersF32s) {
   XlaBuilder builder(TestName());
 
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({1.1f, 2.2f, 3.3f, 5.5f});
+      LiteralUtil::CreateR1<float>({1.1f, 2.2f, 3.3f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
   std::unique_ptr<Literal> param1_literal =
-      Literal::CreateR1<float>({7.2f, 2.3f, 3.4f, 5.6f});
+      LiteralUtil::CreateR1<float>({7.2f, 2.3f, 3.4f, 5.6f});
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
 
@@ -2123,12 +2123,12 @@ XLA_TEST_F(ArrayElementwiseOpTest, AddTwoParametersZeroElementF32s) {
   XlaBuilder builder(TestName());
 
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR3FromArray3D<float>(Array3D<float>(0, 7, 0));
+      LiteralUtil::CreateR3FromArray3D<float>(Array3D<float>(0, 7, 0));
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
   std::unique_ptr<Literal> param1_literal =
-      Literal::CreateR3FromArray3D<float>(Array3D<float>(0, 7, 0));
+      LiteralUtil::CreateR3FromArray3D<float>(Array3D<float>(0, 7, 0));
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
 
@@ -2145,7 +2145,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, AddParameterToConstantF32s) {
   XlaBuilder builder(TestName());
 
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({1.1f, 2.2f, 3.3f, 5.5f});
+      LiteralUtil::CreateR1<float>({1.1f, 2.2f, 3.3f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -2201,7 +2201,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, TanhF32sVector) {
   // the input tensor is large enough to exercise the vectorized tanh
   // implementation on XLA CPU.
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateR1<float>(
+  auto input_literal = LiteralUtil::CreateR1<float>(
       {1.02,  -0.32, 0.85,  0.90,  1.23,  -0.91, -0.49, 0.80,  -0.67, 0.16,
        -0.07, 0.39,  -0.41, 0.04,  1.36,  1.25,  0.41,  0.65,  -1.08, 0.32,
        -1.45, -0.77, -1.09, 0.91,  -1.03, -0.30, -1.11, -1.17, 1.50,  -0.85,
@@ -2243,7 +2243,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, ExpF32sVector) {
 
   // Just to help make sense of the scales here -- exp(89) saturates float32 and
   // exp(-10) is smaller than our error spec.
-  std::unique_ptr<Literal> input_literal = Literal::CreateR1<float>(
+  std::unique_ptr<Literal> input_literal = LiteralUtil::CreateR1<float>(
       {1.02,   -0.32,  0.85,   0.9,    1.23,   -0.91,  -0.49, 0.8,    -1.31,
        -1.44,  -0.13,  -1.31,  -0.79,  1.41,   1.21,   1.05,  -195.6, -194.5,
        -193.4, -192.3, -191.2, -190.1, -189.0, -187.9, -19.6, -18.5,  -17.4,
@@ -2277,7 +2277,7 @@ XLA_TEST_F(ArrayElementwiseOpTest, LogF32sVector) {
   // implementation on XLA CPU.
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> input_literal = Literal::CreateR1<float>(
+  std::unique_ptr<Literal> input_literal = LiteralUtil::CreateR1<float>(
       {-1.29,    -1.41,    -1.25,    -13.5,    -11.7,    -17.9,    -198,
        -167,     1.29,     1.41,     1.25,     13.5,     11.7,     17.9,
        198,      167,      1.27e+03, 1.33e+03, 1.74e+03, 1.6e+04,  1.84e+04,
@@ -2469,9 +2469,9 @@ XLA_TEST_F(ArrayElementwiseOpTest, Compare1DTo2DS32Eq) {
   auto cmp_dim_1 = Eq(v, m, /*broadcast_dimensions=*/{0});
   Tuple(&builder, {cmp_dim_0, cmp_dim_1});
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR2<bool>({{true, true}, {true, false}}).get(),
-       Literal::CreateR2<bool>({{true, false}, {false, false}}).get()});
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR2<bool>({{true, true}, {true, false}}).get(),
+       LiteralUtil::CreateR2<bool>({{true, false}, {false, false}}).get()});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -2825,8 +2825,9 @@ XLA_TEST_F(ArrayElementwiseOpTest, R4_16x16x2x2_Plus_R1_16) {
   std::iota(r1.begin(), r1.end(), 1.0);
 
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> a_literal = Literal::CreateR4FromArray4DWithLayout(
-      r4, LayoutUtil::MakeLayout({0, 1, 2, 3}));
+  std::unique_ptr<Literal> a_literal =
+      LiteralUtil::CreateR4FromArray4DWithLayout(
+          r4, LayoutUtil::MakeLayout({0, 1, 2, 3}));
   auto a = ConstantLiteral(&builder, *a_literal);
   auto b = ConstantR1<float>(&builder, r1);
   Add(a, b, {1});
@@ -2887,8 +2888,8 @@ XLA_TEST_F(ArrayElementwiseOpTest, NonIdentityBroadcastOfSameRankIsDisallowed) {
 // broadcast.
 XLA_TEST_F(ArrayElementwiseOpTest, ImplictBroadcastInFusedExpressions) {
   XlaBuilder builder(TestName());
-  auto x_literal = Literal::CreateR1<float>({1, 2, 3});
-  auto y_literal = Literal::CreateR1<float>({4, 5});
+  auto x_literal = LiteralUtil::CreateR1<float>({1, 2, 3});
+  auto y_literal = LiteralUtil::CreateR1<float>({4, 5});
   auto x_data = client_->TransferToServer(*x_literal).ConsumeValueOrDie();
   auto y_data = client_->TransferToServer(*y_literal).ConsumeValueOrDie();
 
diff --git a/tensorflow/compiler/xla/tests/axpy_simple_test.cc b/tensorflow/compiler/xla/tests/axpy_simple_test.cc
index 8d15b7841b..caeb0bf49a 100644
--- a/tensorflow/compiler/xla/tests/axpy_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/axpy_simple_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
diff --git a/tensorflow/compiler/xla/tests/bad_rng_shape_validation_test.cc b/tensorflow/compiler/xla/tests/bad_rng_shape_validation_test.cc
index 8c227df7f0..af0b852239 100644
--- a/tensorflow/compiler/xla/tests/bad_rng_shape_validation_test.cc
+++ b/tensorflow/compiler/xla/tests/bad_rng_shape_validation_test.cc
@@ -19,8 +19,8 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
diff --git a/tensorflow/compiler/xla/tests/batch_normalization_test.cc b/tensorflow/compiler/xla/tests/batch_normalization_test.cc
index 217673c8cb..d372d1ca43 100644
--- a/tensorflow/compiler/xla/tests/batch_normalization_test.cc
+++ b/tensorflow/compiler/xla/tests/batch_normalization_test.cc
@@ -22,9 +22,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/lib/math.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -63,7 +63,7 @@ class BatchNormalizationTest
         {5.0f, 4.4f},   // p2
     });
     input_array_.FillWithPZ(pz);
-    input_literal_ = std::move(*Literal::CreateR4FromArray4D(input_array_));
+    input_literal_ = std::move(*LiteralUtil::CreateR4FromArray4D(input_array_));
     CHECK_EQ(kSamples, input_array_.planes());
     CHECK_EQ(kZ, input_array_.depth());
     CHECK_EQ(kY, input_array_.height());
@@ -242,12 +242,12 @@ XLA_TEST_P(BatchNormalizationTest, BasicTraining) {
   BatchNormTraining(operand, scale, offset,
                     /*epsilon=*/0.001, kFeatureIndex);
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR4<float>({{{{-1.6f, -2.0f}}, {{0.1f, 0.6f}}},
-                                 {{{1.9f, 3.3f}}, {{3.7f, 6.0f}}}})
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR4<float>({{{{-1.6f, -2.0f}}, {{0.1f, 0.6f}}},
+                                     {{{1.9f, 3.3f}}, {{3.7f, 6.0f}}}})
            .get(),
-       Literal::CreateR1<float>({4, 5}).get(),
-       Literal::CreateR1<float>({5, 5}).get()});
+       LiteralUtil::CreateR1<float>({4, 5}).get(),
+       LiteralUtil::CreateR1<float>({5, 5}).get()});
 
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.1));
 }
@@ -267,12 +267,12 @@ XLA_TEST_P(BatchNormalizationTest, BasicTrainingOnDimension2) {
   BatchNormTraining(operand, scale, offset,
                     /*epsilon=*/0.001, kFeatureIndex);
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR4<float>({{{{-1.6f}, {-2.0f}}, {{0.1f}, {0.6f}}},
-                                 {{{1.9f}, {3.3f}}, {{3.7f}, {6.0f}}}})
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR4<float>({{{{-1.6f}, {-2.0f}}, {{0.1f}, {0.6f}}},
+                                     {{{1.9f}, {3.3f}}, {{3.7f}, {6.0f}}}})
            .get(),
-       Literal::CreateR1<float>({4, 5}).get(),
-       Literal::CreateR1<float>({5, 5}).get()});
+       LiteralUtil::CreateR1<float>({4, 5}).get(),
+       LiteralUtil::CreateR1<float>({5, 5}).get()});
 
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.1));
 }
@@ -298,11 +298,11 @@ XLA_TEST_P(BatchNormalizationTest, TrainingWithFeatureOnLowDimension) {
   BatchNormTraining(h0, h1, h2,
                     /*epsilon=*/1, kFeatureIndex);
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR3FromArray3D<float>(Array3D<float>(260, 2, 2, 1.0f))
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR3FromArray3D<float>(Array3D<float>(260, 2, 2, 1.0f))
            .get(),
-       Literal::CreateR1<float>(std::vector<float>(260, 1.0f)).get(),
-       Literal::CreateR1<float>(std::vector<float>(260, 0.0f)).get()});
+       LiteralUtil::CreateR1<float>(std::vector<float>(260, 1.0f)).get(),
+       LiteralUtil::CreateR1<float>(std::vector<float>(260, 0.0f)).get()});
 
   ComputeAndCompareTuple(&builder, *expected,
                          {operand.get(), scale.get(), offset.get()},
@@ -331,11 +331,12 @@ XLA_TEST_P(BatchNormalizationTest, LargeEpsilonTest) {
   BatchNormTraining(h0, h1, h2,
                     /*epsilon=*/-100, kFeatureIndex);
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR3FromArray3D<float>({{{-3.0f}, {-1.0f}, {1.0f}, {3.0f}}})
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR3FromArray3D<float>(
+           {{{-3.0f}, {-1.0f}, {1.0f}, {3.0f}}})
            .get(),
-       Literal::CreateR1<float>(std::vector<float>(1, 15.0f)).get(),
-       Literal::CreateR1<float>(std::vector<float>(1, 125.0f)).get()});
+       LiteralUtil::CreateR1<float>(std::vector<float>(1, 15.0f)).get(),
+       LiteralUtil::CreateR1<float>(std::vector<float>(1, 125.0f)).get()});
 
   ComputeAndCompareTuple(&builder, *expected,
                          {operand.get(), scale.get(), offset.get()},
@@ -362,12 +363,12 @@ XLA_TEST_P(BatchNormalizationTest, BatchNormGradBasic) {
   BatchNormGrad(operand, scale, mean, var, grad_output,
                 /*epsilon=*/0.0, kFeatureIndex);
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR4<float>({{{{-3.f}, {-3.f}}, {{-1.f}, {-1.f}}},
-                                 {{{1.f}, {1.f}}, {{3.f}, {3.f}}}})
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR4<float>({{{{-3.f}, {-3.f}}, {{-1.f}, {-1.f}}},
+                                     {{{1.f}, {1.f}}, {{3.f}, {3.f}}}})
            .get(),
-       Literal::CreateR1<float>({0, 0}).get(),
-       Literal::CreateR1<float>({16, 20}).get()});
+       LiteralUtil::CreateR1<float>({0, 0}).get(),
+       LiteralUtil::CreateR1<float>({16, 20}).get()});
 
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.1));
 }
@@ -513,11 +514,12 @@ XLA_TEST_P(BatchNormTestManySizes, RandomizedTrainingTests) {
   auto normalized = *ReferenceUtil::BatchNorm4D(input_array, mean4D, var4D,
                                                 scale4D, offset4D, epsilon);
 
-  auto expected_normalized = Literal::CreateR4FromArray4D<float>(normalized);
+  auto expected_normalized =
+      LiteralUtil::CreateR4FromArray4D<float>(normalized);
 
-  auto offset_literal = Literal::CreateR1<float>(offset);
-  auto scale_literal = Literal::CreateR1<float>(scale);
-  auto input_literal = Literal::CreateR4FromArray4D<float>(input_array);
+  auto offset_literal = LiteralUtil::CreateR1<float>(offset);
+  auto scale_literal = LiteralUtil::CreateR1<float>(scale);
+  auto input_literal = LiteralUtil::CreateR4FromArray4D<float>(input_array);
 
   auto input_activations =
       Parameter(&builder, 0, input_literal->shape(), "input");
@@ -526,9 +528,9 @@ XLA_TEST_P(BatchNormTestManySizes, RandomizedTrainingTests) {
   auto offset_activations =
       Parameter(&builder, 2, offset_literal->shape(), "scale");
 
-  auto expected = Literal::MakeTuple({expected_normalized.get(),
-                                      Literal::CreateR1<float>(mean).get(),
-                                      Literal::CreateR1<float>(var).get()});
+  auto expected = LiteralUtil::MakeTuple(
+      {expected_normalized.get(), LiteralUtil::CreateR1<float>(mean).get(),
+       LiteralUtil::CreateR1<float>(var).get()});
 
   std::unique_ptr<GlobalData> input_data =
       client_->TransferToServer(*input_literal).ConsumeValueOrDie();
@@ -613,11 +615,11 @@ XLA_TEST_P(BatchNormTestManySizes, RandomizedInferencingTests) {
   auto normalized = *ReferenceUtil::BatchNorm4D(input_array, mean4D, var4D,
                                                 scale4D, offset4D, epsilon);
 
-  auto offset_literal = Literal::CreateR1<float>(offset);
-  auto scale_literal = Literal::CreateR1<float>(scale);
-  auto mean_literal = Literal::CreateR1<float>(mean);
-  auto var_literal = Literal::CreateR1<float>(var);
-  auto input_literal = Literal::CreateR4FromArray4D<float>(input_array);
+  auto offset_literal = LiteralUtil::CreateR1<float>(offset);
+  auto scale_literal = LiteralUtil::CreateR1<float>(scale);
+  auto mean_literal = LiteralUtil::CreateR1<float>(mean);
+  auto var_literal = LiteralUtil::CreateR1<float>(var);
+  auto input_literal = LiteralUtil::CreateR4FromArray4D<float>(input_array);
 
   auto input_activations =
       Parameter(&builder, 0, input_literal->shape(), "input");
@@ -800,14 +802,14 @@ XLA_TEST_P(BatchNormTestManySizes, RandomizedGradTests) {
       });
 
   auto expected_grad_activation =
-      Literal::CreateR4FromArray4D<float>(grad_activation);
+      LiteralUtil::CreateR4FromArray4D<float>(grad_activation);
 
-  auto input_literal = Literal::CreateR4FromArray4D<float>(input_array);
-  auto scale_literal = Literal::CreateR1<float>(scale);
-  auto mean_literal = Literal::CreateR1<float>(mean);
-  auto var_literal = Literal::CreateR1<float>(var);
+  auto input_literal = LiteralUtil::CreateR4FromArray4D<float>(input_array);
+  auto scale_literal = LiteralUtil::CreateR1<float>(scale);
+  auto mean_literal = LiteralUtil::CreateR1<float>(mean);
+  auto var_literal = LiteralUtil::CreateR1<float>(var);
   auto grad_output_literal =
-      Literal::CreateR4FromArray4D<float>(grad_output_array);
+      LiteralUtil::CreateR4FromArray4D<float>(grad_output_array);
 
   auto input_parameter =
       Parameter(&builder, 0, input_literal->shape(), "input");
@@ -833,9 +835,9 @@ XLA_TEST_P(BatchNormTestManySizes, RandomizedGradTests) {
                 grad_output_parameter, epsilon, feature_index);
 
   auto expected =
-      Literal::MakeTuple({expected_grad_activation.get(),
-                          Literal::CreateR1<float>(grad_scale).get(),
-                          Literal::CreateR1<float>(grad_offset).get()});
+      LiteralUtil::MakeTuple({expected_grad_activation.get(),
+                              LiteralUtil::CreateR1<float>(grad_scale).get(),
+                              LiteralUtil::CreateR1<float>(grad_offset).get()});
 
   // Run all HLO passes during this test.  In particular, ClientLibraryTestBase
   // disables constant folding, but we want it enabled for our zero-sized tensor
diff --git a/tensorflow/compiler/xla/tests/bfloat16_test.cc b/tensorflow/compiler/xla/tests/bfloat16_test.cc
index f40d03bea7..6c20f654fe 100644
--- a/tensorflow/compiler/xla/tests/bfloat16_test.cc
+++ b/tensorflow/compiler/xla/tests/bfloat16_test.cc
@@ -21,8 +21,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -95,18 +95,18 @@ XLA_TEST_F(Bfloat16Test, BatchNormTraining) {
 
   BatchNormTraining(operand, scale, offset, /*epsilon=*/0.001, kFeatureIndex);
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR4<bfloat16>(
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR4<bfloat16>(
            {{{{static_cast<bfloat16>(-1.6875f)},
               {static_cast<bfloat16>(-2.04f)}},
              {{static_cast<bfloat16>(0.105f)}, {static_cast<bfloat16>(0.66f)}}},
             {{{static_cast<bfloat16>(1.89f)}, {static_cast<bfloat16>(3.35f)}},
              {{static_cast<bfloat16>(3.7f)}, {static_cast<bfloat16>(6.04f)}}}})
            .get(),
-       Literal::CreateR1<bfloat16>(
+       LiteralUtil::CreateR1<bfloat16>(
            {static_cast<bfloat16>(4), static_cast<bfloat16>(5)})
            .get(),
-       Literal::CreateR1<bfloat16>(
+       LiteralUtil::CreateR1<bfloat16>(
            {static_cast<bfloat16>(5), static_cast<bfloat16>(5)})
            .get()});
 
@@ -139,17 +139,17 @@ XLA_TEST_F(Bfloat16Test, BatchNormGrad) {
   BatchNormGrad(operand, scale, mean, var, grad_output,
                 /*epsilon=*/0.0, kFeatureIndex);
 
-  auto expected = Literal::MakeTuple(
-      {Literal::CreateR4<bfloat16>(
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR4<bfloat16>(
            {{{{static_cast<bfloat16>(-3.f)}, {static_cast<bfloat16>(-3.f)}},
              {{static_cast<bfloat16>(-1.f)}, {static_cast<bfloat16>(-1.f)}}},
             {{{static_cast<bfloat16>(1.f)}, {static_cast<bfloat16>(1.f)}},
              {{static_cast<bfloat16>(3.f)}, {static_cast<bfloat16>(3.f)}}}})
            .get(),
-       Literal::CreateR1<bfloat16>(
+       LiteralUtil::CreateR1<bfloat16>(
            {static_cast<bfloat16>(0), static_cast<bfloat16>(0)})
            .get(),
-       Literal::CreateR1<bfloat16>(
+       LiteralUtil::CreateR1<bfloat16>(
            {static_cast<bfloat16>(16), static_cast<bfloat16>(20)})
            .get()});
 
diff --git a/tensorflow/compiler/xla/tests/binop_scaling_test.cc b/tensorflow/compiler/xla/tests/binop_scaling_test.cc
index 20cb989751..0d7a3aa46a 100644
--- a/tensorflow/compiler/xla/tests/binop_scaling_test.cc
+++ b/tensorflow/compiler/xla/tests/binop_scaling_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
diff --git a/tensorflow/compiler/xla/tests/bitcast_convert_test.cc b/tensorflow/compiler/xla/tests/bitcast_convert_test.cc
index d531e8fa82..c6b5108fe9 100644
--- a/tensorflow/compiler/xla/tests/bitcast_convert_test.cc
+++ b/tensorflow/compiler/xla/tests/bitcast_convert_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
diff --git a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
index 91aba9a8de..1d28e85b16 100644
--- a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
@@ -20,7 +20,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -58,7 +59,7 @@ class BroadcastSimpleTest : public ClientLibraryTestBase {
       Array3D<float>* r3_array, float start, float end, int seed) {
     *r3_shape = ShapeUtil::MakeShapeWithLayout(F32, bounds, minor_to_major);
     r3_array->FillRandom(start, end, seed);
-    auto r3_data = Literal::CreateR3FromArray3D(*r3_array)->Relayout(
+    auto r3_data = LiteralUtil::CreateR3FromArray3D(*r3_array)->Relayout(
         LayoutUtil::MakeLayout(minor_to_major));
     std::unique_ptr<GlobalData> r3_global_data =
         client_->TransferToServer(*r3_data).ConsumeValueOrDie();
@@ -71,7 +72,7 @@ class BroadcastSimpleTest : public ClientLibraryTestBase {
       Array2D<float>* r2_array, float start, float end, int seed) {
     *r2_shape = ShapeUtil::MakeShapeWithLayout(F32, bounds, minor_to_major);
     r2_array->FillRandom(start, end, seed);
-    auto r2_data = Literal::CreateR2FromArray2D(*r2_array)->Relayout(
+    auto r2_data = LiteralUtil::CreateR2FromArray2D(*r2_array)->Relayout(
         LayoutUtil::MakeLayout(minor_to_major));
     std::unique_ptr<GlobalData> r2_global_data =
         client_->TransferToServer(*r2_data).ConsumeValueOrDie();
@@ -290,13 +291,13 @@ XLA_TEST_F(BroadcastSimpleTest, InDimensionAndDegenerateBroadcasting) {
   XlaBuilder b(TestName());
 
   Add(ConstantR2<float>(&b, {{1.0, 5.0}}),
-      ConstantLiteral(&b, *Literal::CreateR3<float>(
+      ConstantLiteral(&b, *LiteralUtil::CreateR3<float>(
                               {{{2.0}, {3.0}, {4.0}}, {{5.0}, {6.0}, {7.0}}})),
       /*broadcast_dimensions=*/{1, 2});
 
   auto expected =
-      Literal::CreateR3<float>({{{3.0, 7.0}, {4.0, 8.0}, {5.0, 9.0}},
-                                {{6.0, 10.0}, {7.0, 11.0}, {8.0, 12.0}}});
+      LiteralUtil::CreateR3<float>({{{3.0, 7.0}, {4.0, 8.0}, {5.0, 9.0}},
+                                    {{6.0, 10.0}, {7.0, 11.0}, {8.0, 12.0}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
@@ -365,7 +366,7 @@ XLA_TEST_P(BroadcastR3ImplicitTest, Doit) {
       }
     }
   }
-  auto expected = Literal::CreateR3FromArray3D(expected_array);
+  auto expected = LiteralUtil::CreateR3FromArray3D(expected_array);
   ComputeAndCompareLiteral(
       &builder, *expected,
       {r3_implicit_global_data.get(), r3_global_data.get()},
@@ -390,7 +391,7 @@ XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_1_2) {
   Add(r3h, r1h);
 
   auto expected =
-      Literal::CreateR3<float>({{{2, 3}, {4, 5}}, {{7, 8}, {9, 10}}});
+      LiteralUtil::CreateR3<float>({{{2, 3}, {4, 5}}, {{7, 8}, {9, 10}}});
 
   ComputeAndCompareLiteral(&b, *expected, {r3.get(), r1.get()},
                            ErrorSpec(0.0001));
@@ -398,39 +399,40 @@ XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_1_2) {
 
 XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_0_1) {
   XlaBuilder b(TestName());
-  auto r1 = ConstantLiteral(&b, *Literal::CreateR3<float>({{{1, 2}}}));
+  auto r1 = ConstantLiteral(&b, *LiteralUtil::CreateR3<float>({{{1, 2}}}));
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r3, r1);
 
   auto expected =
-      Literal::CreateR3<float>({{{2, 4}, {4, 6}}, {{6, 8}, {8, 10}}});
+      LiteralUtil::CreateR3<float>({{{2, 4}, {4, 6}}, {{6, 8}, {8, 10}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_0_2) {
   XlaBuilder b(TestName());
-  auto r1 = ConstantLiteral(&b, *Literal::CreateR3<float>({{{1}, {2}}}));
+  auto r1 = ConstantLiteral(&b, *LiteralUtil::CreateR3<float>({{{1}, {2}}}));
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r3, r1);
 
   auto expected =
-      Literal::CreateR3<float>({{{2, 3}, {5, 6}}, {{6, 7}, {9, 10}}});
+      LiteralUtil::CreateR3<float>({{{2, 3}, {5, 6}}, {{6, 7}, {9, 10}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_0) {
   XlaBuilder b(TestName());
-  auto r1 = ConstantLiteral(&b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}}));
+  auto r1 =
+      ConstantLiteral(&b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}}));
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r3, r1);
 
   auto expected =
-      Literal::CreateR3<float>({{{2, 4}, {6, 8}}, {{6, 8}, {10, 12}}});
+      LiteralUtil::CreateR3<float>({{{2, 4}, {6, 8}}, {{6, 8}, {10, 12}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
@@ -438,40 +440,40 @@ XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_0) {
 XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_1) {
   XlaBuilder b(TestName());
   auto r1 =
-      ConstantLiteral(&b, *Literal::CreateR3<float>({{{1, 2}}, {{3, 4}}}));
+      ConstantLiteral(&b, *LiteralUtil::CreateR3<float>({{{1, 2}}, {{3, 4}}}));
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r3, r1);
 
   auto expected =
-      Literal::CreateR3<float>({{{2, 4}, {4, 6}}, {{8, 10}, {10, 12}}});
+      LiteralUtil::CreateR3<float>({{{2, 4}, {4, 6}}, {{8, 10}, {10, 12}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_2) {
   XlaBuilder b(TestName());
-  auto r1 =
-      ConstantLiteral(&b, *Literal::CreateR3<float>({{{1}, {2}}, {{3}, {4}}}));
+  auto r1 = ConstantLiteral(
+      &b, *LiteralUtil::CreateR3<float>({{{1}, {2}}, {{3}, {4}}}));
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r3, r1);
 
   auto expected =
-      Literal::CreateR3<float>({{{2, 3}, {5, 6}}, {{8, 9}, {11, 12}}});
+      LiteralUtil::CreateR3<float>({{{2, 3}, {5, 6}}, {{8, 9}, {11, 12}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, Add3DTo3DDegenerate_0_1_2) {
   XlaBuilder b(TestName());
-  auto r1 = ConstantLiteral(&b, *Literal::CreateR3<float>({{{1}}}));
+  auto r1 = ConstantLiteral(&b, *LiteralUtil::CreateR3<float>({{{1}}}));
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r3, r1);
 
   auto expected =
-      Literal::CreateR3<float>({{{2, 3}, {4, 5}}, {{6, 7}, {8, 9}}});
+      LiteralUtil::CreateR3<float>({{{2, 3}, {4, 5}}, {{6, 7}, {8, 9}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
@@ -612,7 +614,7 @@ XLA_TEST_P(BroadcastR2ImplicitTest, Doit) {
     *v = ApplyOpToFloats(spec.op2, tmp, v3);
   });
 
-  auto expected = Literal::CreateR2FromArray2D(expected_array);
+  auto expected = LiteralUtil::CreateR2FromArray2D(expected_array);
   ComputeAndCompareLiteral(
       &builder, *expected,
       {r2_implicit_global_data1.get(), r2_global_data.get(),
@@ -626,22 +628,24 @@ INSTANTIATE_TEST_CASE_P(BroadcastR2ImplicitTestInstances,
 
 XLA_TEST_F(BroadcastSimpleTest, Add2DTo2DDegenerate_0) {
   XlaBuilder b(TestName());
-  auto r1 = ConstantLiteral(&b, *Literal::CreateR2<float>({{1, 2}}));
-  auto r2 = ConstantLiteral(&b, *Literal::CreateR2<float>({{1, 2}, {3, 4}}));
+  auto r1 = ConstantLiteral(&b, *LiteralUtil::CreateR2<float>({{1, 2}}));
+  auto r2 =
+      ConstantLiteral(&b, *LiteralUtil::CreateR2<float>({{1, 2}, {3, 4}}));
   Add(r2, r1);
 
-  auto expected = Literal::CreateR2<float>({{2, 4}, {4, 6}});
+  auto expected = LiteralUtil::CreateR2<float>({{2, 4}, {4, 6}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, Add2DTo2DDegenerate_1) {
   XlaBuilder b(TestName());
-  auto r1 = ConstantLiteral(&b, *Literal::CreateR2<float>({{1}, {2}}));
-  auto r2 = ConstantLiteral(&b, *Literal::CreateR2<float>({{1, 2}, {3, 4}}));
+  auto r1 = ConstantLiteral(&b, *LiteralUtil::CreateR2<float>({{1}, {2}}));
+  auto r2 =
+      ConstantLiteral(&b, *LiteralUtil::CreateR2<float>({{1, 2}, {3, 4}}));
   Add(r2, r1);
 
-  auto expected = Literal::CreateR2<float>({{2, 3}, {5, 6}});
+  auto expected = LiteralUtil::CreateR2<float>({{2, 3}, {5, 6}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
@@ -650,11 +654,11 @@ XLA_TEST_F(BroadcastSimpleTest, Add1DTo3DInDim0) {
   XlaBuilder b(TestName());
   auto r1 = ConstantR1<float>(&b, {10, 20});
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r3, r1, {0});
 
-  auto expected =
-      Literal::CreateR3<float>({{{11, 12}, {13, 14}}, {{25, 26}, {27, 28}}});
+  auto expected = LiteralUtil::CreateR3<float>(
+      {{{11, 12}, {13, 14}}, {{25, 26}, {27, 28}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
@@ -663,11 +667,11 @@ XLA_TEST_F(BroadcastSimpleTest, Add1DTo3DInDim1) {
   XlaBuilder b(TestName());
   auto r1 = ConstantR1<float>(&b, {10, 20});
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r1, r3, {1});
 
-  auto expected =
-      Literal::CreateR3<float>({{{11, 12}, {23, 24}}, {{15, 16}, {27, 28}}});
+  auto expected = LiteralUtil::CreateR3<float>(
+      {{{11, 12}, {23, 24}}, {{15, 16}, {27, 28}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
@@ -676,11 +680,11 @@ XLA_TEST_F(BroadcastSimpleTest, Add1DTo3DInDim2) {
   XlaBuilder b(TestName());
   auto r1 = ConstantR1<float>(&b, {10, 20});
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   Add(r1, r3, {2});
 
-  auto expected =
-      Literal::CreateR3<float>({{{11, 22}, {13, 24}}, {{15, 26}, {17, 28}}});
+  auto expected = LiteralUtil::CreateR3<float>(
+      {{{11, 22}, {13, 24}}, {{15, 26}, {17, 28}}});
 
   ComputeAndCompareLiteral(&b, *expected, {}, ErrorSpec(0.0001));
 }
@@ -691,7 +695,7 @@ XLA_TEST_F(BroadcastSimpleTest, Add1DTo3DInDimAll) {
   auto r1_1 = ConstantR1<float>(&b, {100, 200});
   auto r1_2 = ConstantR1<float>(&b, {10, 20});
   auto r3 = ConstantLiteral(
-      &b, *Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
+      &b, *LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}));
   for (int i = 0; i < 3; ++i) {
     r3 = Add(r1_0, r3, {0});
     r3 = Add(r3, r1_1, {1});
@@ -699,7 +703,7 @@ XLA_TEST_F(BroadcastSimpleTest, Add1DTo3DInDimAll) {
   }
   r3 = Mul(r3, ConstantR0<float>(&b, -2));
 
-  auto expected = Literal::CreateR3<float>(
+  auto expected = LiteralUtil::CreateR3<float>(
       {{{-6 * 1110 - 2, -6 * 1120 - 4}, {-6 * 1210 - 6, -6 * 1220 - 8}},
        {{-6 * 2110 - 10, -6 * 2120 - 12}, {-6 * 2210 - 14, -6 * 2220 - 16}}});
 
@@ -720,7 +724,7 @@ XLA_TEST_F(BroadcastSimpleTest, Add1DTo3DInDimAllWithScalarBroadcast) {
   }
   r3 = Mul(r3, ConstantR0<float>(&b, -1));
 
-  auto expected = Literal::CreateR3<float>(
+  auto expected = LiteralUtil::CreateR3<float>(
       {{{-3 * 1110 - 3, -3 * 1120 - 3}, {-3 * 1210 - 3, -3 * 1220 - 3}},
        {{-3 * 2110 - 3, -3 * 2120 - 3}, {-3 * 2210 - 3, -3 * 2220 - 3}}});
 
@@ -733,7 +737,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidBinaryAndDegenerateBroadcasting) {
   XlaBuilder b(TestName());
 
   Add(ConstantR2<float>(&b, {{1.0, 5.0}, {1.0, 5.0}}),
-      ConstantLiteral(&b, *Literal::CreateR3<float>(
+      ConstantLiteral(&b, *LiteralUtil::CreateR3<float>(
                               {{{2.0}, {3.0}, {4.0}}, {{5.0}, {6.0}, {7.0}}})),
       /*broadcast_dimensions=*/{1, 2});
 
diff --git a/tensorflow/compiler/xla/tests/broadcast_test.cc b/tensorflow/compiler/xla/tests/broadcast_test.cc
index 51b9f0d3e3..c7b94b5bba 100644
--- a/tensorflow/compiler/xla/tests/broadcast_test.cc
+++ b/tensorflow/compiler/xla/tests/broadcast_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include <memory>
 #include <utility>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -37,7 +37,7 @@ XLA_TEST_F(BroadcastTest, BroadcastScalarToScalar) {
   // Test degenerate case of broadcasting a scalar into a scalar.
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
   builder.AddInstruction(HloInstruction::CreateBroadcast(
       ShapeUtil::MakeShape(F32, {}), input, {}));
 
@@ -46,14 +46,14 @@ XLA_TEST_F(BroadcastTest, BroadcastScalarToScalar) {
   hlo_module->AddEntryComputation(builder.Build());
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
-  EXPECT_TRUE(LiteralTestUtil::Near(*Literal::CreateR0<float>(42.0), *result,
-                                    error_spec_));
+  EXPECT_TRUE(LiteralTestUtil::Near(*LiteralUtil::CreateR0<float>(42.0),
+                                    *result, error_spec_));
 }
 
 XLA_TEST_F(BroadcastTest, BroadcastScalarTo2D) {
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
   builder.AddInstruction(HloInstruction::CreateBroadcast(
       ShapeUtil::MakeShape(F32, {2, 2}), input, {}));
 
@@ -63,14 +63,14 @@ XLA_TEST_F(BroadcastTest, BroadcastScalarTo2D) {
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
   EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR2<float>({{42.0, 42.0}, {42.0, 42.0}}), *result,
+      *LiteralUtil::CreateR2<float>({{42.0, 42.0}, {42.0, 42.0}}), *result,
       error_spec_));
 }
 
 XLA_TEST_F(BroadcastTest, BroadcastVectorTo2D) {
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.0, 2.0, 3.0})));
+      LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0})));
 
   // Broadcast vector in both dimension 0 and dimension 1. Join them in a tuple
   // to enable testing of the results.
@@ -86,18 +86,18 @@ XLA_TEST_F(BroadcastTest, BroadcastVectorTo2D) {
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
   EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR2<float>({{1.0, 1.0}, {2.0, 2.0}, {3.0, 3.0}}),
+      *LiteralUtil::CreateR2<float>({{1.0, 1.0}, {2.0, 2.0}, {3.0, 3.0}}),
       LiteralSlice(*result, {0}), error_spec_));
 
   EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR2<float>({{1.0, 2.0, 3.0}, {1.0, 2.0, 3.0}}),
+      *LiteralUtil::CreateR2<float>({{1.0, 2.0, 3.0}, {1.0, 2.0, 3.0}}),
       LiteralSlice(*result, {1}), error_spec_));
 }
 
 XLA_TEST_F(BroadcastTest, Broadcast2DTo2D) {
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   builder.AddInstruction(HloInstruction::CreateBroadcast(
       ShapeUtil::MakeShape(F32, {2, 2}), input, {0, 1}));
 
@@ -106,9 +106,9 @@ XLA_TEST_F(BroadcastTest, Broadcast2DTo2D) {
   hlo_module->AddEntryComputation(builder.Build());
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
-  EXPECT_TRUE(
-      LiteralTestUtil::Near(*Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}),
-                            *result, error_spec_));
+  EXPECT_TRUE(LiteralTestUtil::Near(
+      *LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}), *result,
+      error_spec_));
 }
 
 XLA_TEST_F(BroadcastTest, Broadcast2DTo2DTranspose) {
@@ -116,7 +116,7 @@ XLA_TEST_F(BroadcastTest, Broadcast2DTo2DTranspose) {
   // the dimensions, ie transpose.
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   builder.AddInstruction(HloInstruction::CreateBroadcast(
       ShapeUtil::MakeShape(F32, {2, 2}), input, {1, 0}));
 
@@ -125,15 +125,15 @@ XLA_TEST_F(BroadcastTest, Broadcast2DTo2DTranspose) {
   hlo_module->AddEntryComputation(builder.Build());
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
-  EXPECT_TRUE(
-      LiteralTestUtil::Near(*Literal::CreateR2<float>({{1.0, 3.0}, {2.0, 4.0}}),
-                            *result, error_spec_));
+  EXPECT_TRUE(LiteralTestUtil::Near(
+      *LiteralUtil::CreateR2<float>({{1.0, 3.0}, {2.0, 4.0}}), *result,
+      error_spec_));
 }
 
 XLA_TEST_F(BroadcastTest, Broadcast2DTo3D) {
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}})));
   builder.AddInstruction(HloInstruction::CreateBroadcast(
       ShapeUtil::MakeShape(F32, {2, 3, 2}), input, {0, 2}));
 
@@ -143,15 +143,15 @@ XLA_TEST_F(BroadcastTest, Broadcast2DTo3D) {
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
   EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR3<float>({{{1.0, 2.0}, {1.0, 2.0}, {1.0, 2.0}},
-                                 {{3.0, 4.0}, {3.0, 4.0}, {3.0, 4.0}}}),
+      *LiteralUtil::CreateR3<float>({{{1.0, 2.0}, {1.0, 2.0}, {1.0, 2.0}},
+                                     {{3.0, 4.0}, {3.0, 4.0}, {3.0, 4.0}}}),
       *result, error_spec_));
 }
 
 TEST_F(BroadcastTest, Broadcast_R1_2_To_R4_2x2x3x3) {
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>({1.0, 2.0})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>({1.0, 2.0})));
 
   // Broadcast vector in dimension 1.
   builder.AddInstruction(HloInstruction::CreateBroadcast(
@@ -166,8 +166,9 @@ TEST_F(BroadcastTest, Broadcast_R1_2_To_R4_2x2x3x3) {
   Array2D<float> pz({{1, 2}, {1, 2}});
   expected.FillWithPZ(pz);
 
-  EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR4FromArray4D<float>(expected), *result, error_spec_));
+  EXPECT_TRUE(
+      LiteralTestUtil::Near(*LiteralUtil::CreateR4FromArray4D<float>(expected),
+                            *result, error_spec_));
 }
 
 TEST_F(BroadcastTest, Broadcast_R1_1025_To_R4_3x3x3x1025) {
@@ -176,7 +177,7 @@ TEST_F(BroadcastTest, Broadcast_R1_1025_To_R4_3x3x3x1025) {
   int64 r1_size = input_data.size();
   std::iota(input_data.begin(), input_data.end(), 0.0f);
   auto input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>(input_data)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(input_data)));
 
   // Broadcast vector in dimension 3.
   builder.AddInstruction(HloInstruction::CreateBroadcast(
@@ -196,8 +197,9 @@ TEST_F(BroadcastTest, Broadcast_R1_1025_To_R4_3x3x3x1025) {
   }
   expected.FillWithYX(yx);
 
-  EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR4FromArray4D<float>(expected), *result, error_spec_));
+  EXPECT_TRUE(
+      LiteralTestUtil::Near(*LiteralUtil::CreateR4FromArray4D<float>(expected),
+                            *result, error_spec_));
 }
 
 XLA_TEST_F(BroadcastTest, Broadcast_R1_64_To_R4_32x64x7x7) {
@@ -207,7 +209,7 @@ XLA_TEST_F(BroadcastTest, Broadcast_R1_64_To_R4_32x64x7x7) {
   std::vector<float> r1_array(64, 42.0);
 
   auto input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<float>(r1_array)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<float>(r1_array)));
 
   // Broadcast vector in dimension 1.
   builder.AddInstruction(HloInstruction::CreateBroadcast(
@@ -218,14 +220,14 @@ XLA_TEST_F(BroadcastTest, Broadcast_R1_64_To_R4_32x64x7x7) {
   hlo_module->AddEntryComputation(builder.Build());
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
-  EXPECT_TRUE(LiteralTestUtil::Near(*Literal::CreateR4FromArray4D(r4_array),
+  EXPECT_TRUE(LiteralTestUtil::Near(*LiteralUtil::CreateR4FromArray4D(r4_array),
                                     *result, error_spec_));
 }
 
 TEST_F(BroadcastTest, Broadcast_R0_to_R4_64x64x3x3) {
   auto builder = HloComputation::Builder(TestName());
   auto input = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(1.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(1.0f)));
   builder.AddInstruction(HloInstruction::CreateBroadcast(
       ShapeUtil::MakeShape(F32, {64, 64, 3, 3}), input, {}));
 
@@ -238,15 +240,16 @@ TEST_F(BroadcastTest, Broadcast_R0_to_R4_64x64x3x3) {
   Array4D<float> expected(64, 64, 3, 3);
   expected.Fill(1.0f);
 
-  EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR4FromArray4D<float>(expected), *result, error_spec_));
+  EXPECT_TRUE(
+      LiteralTestUtil::Near(*LiteralUtil::CreateR4FromArray4D<float>(expected),
+                            *result, error_spec_));
 }
 
 TEST_F(BroadcastTest, Broadcast_R2_2x2_To_R4_3x3x2x2) {
   auto builder = HloComputation::Builder(TestName());
   Array2D<float> to_broadcast({{1.0f, 2.0f}, {3.0f, 4.0f}});
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2FromArray2D<float>(to_broadcast)));
+      LiteralUtil::CreateR2FromArray2D<float>(to_broadcast)));
 
   // Broadcast vector in dimensions 2 and 3.
   builder.AddInstruction(HloInstruction::CreateBroadcast(
@@ -260,8 +263,9 @@ TEST_F(BroadcastTest, Broadcast_R2_2x2_To_R4_3x3x2x2) {
   Array4D<float> expected(3, 3, 2, 2);
   expected.FillWithYX(to_broadcast);
 
-  EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR4FromArray4D<float>(expected), *result, error_spec_));
+  EXPECT_TRUE(
+      LiteralTestUtil::Near(*LiteralUtil::CreateR4FromArray4D<float>(expected),
+                            *result, error_spec_));
 }
 
 TEST_F(BroadcastTest, Broadcast_R3_2x3x4_to_R4_2x3x4x5) {
@@ -280,7 +284,7 @@ TEST_F(BroadcastTest, Broadcast_R3_2x3x4_to_R4_2x3x4x5) {
     }
   }
   auto input = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR3FromArray3D<float>(input_vals)));
+      LiteralUtil::CreateR3FromArray3D<float>(input_vals)));
 
   // Broadcast vector in dimensions 2 and 3.
   builder.AddInstruction(HloInstruction::CreateBroadcast(
@@ -291,8 +295,9 @@ TEST_F(BroadcastTest, Broadcast_R3_2x3x4_to_R4_2x3x4x5) {
   hlo_module->AddEntryComputation(builder.Build());
   auto result = ExecuteAndTransfer(std::move(hlo_module), {});
 
-  EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR4FromArray4D<float>(expected), *result, error_spec_));
+  EXPECT_TRUE(
+      LiteralTestUtil::Near(*LiteralUtil::CreateR4FromArray4D<float>(expected),
+                            *result, error_spec_));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/call_test.cc b/tensorflow/compiler/xla/tests/call_test.cc
index bc64a19ce2..b1d18210ea 100644
--- a/tensorflow/compiler/xla/tests/call_test.cc
+++ b/tensorflow/compiler/xla/tests/call_test.cc
@@ -16,8 +16,9 @@ limitations under the License.
 #include <memory>
 #include <utility>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -76,7 +77,8 @@ class CallOpTest : public ClientLibraryTestBase {
 XLA_TEST_F(CallOpTest, CallR0F32IdentityScalar) {
   XlaBuilder builder(TestName());
   XlaComputation callee = CreateR0F32IdentityComputation();
-  auto constant = ConstantLiteral(&builder, *Literal::CreateR0<float>(42.0));
+  auto constant =
+      ConstantLiteral(&builder, *LiteralUtil::CreateR0<float>(42.0));
   Call(&builder, callee, {constant});
 
   ComputeAndCompareR0<float>(&builder, 42.0, {}, ErrorSpec(0.01f));
@@ -85,8 +87,8 @@ XLA_TEST_F(CallOpTest, CallR0F32IdentityScalar) {
 XLA_TEST_F(CallOpTest, CallR1S0F32AddArray) {
   XlaBuilder builder(TestName());
   XlaComputation callee = CreateR1S0F32AdditionComputation();
-  auto x = ConstantLiteral(&builder, *Literal::CreateR1<float>({}));
-  auto y = ConstantLiteral(&builder, *Literal::CreateR1<float>({}));
+  auto x = ConstantLiteral(&builder, *LiteralUtil::CreateR1<float>({}));
+  auto y = ConstantLiteral(&builder, *LiteralUtil::CreateR1<float>({}));
   Call(&builder, callee, {x, y});
 
   ComputeAndCompareR1<float>(&builder, {}, {}, ErrorSpec(0.01f));
@@ -95,8 +97,10 @@ XLA_TEST_F(CallOpTest, CallR1S0F32AddArray) {
 XLA_TEST_F(CallOpTest, CallR1S2F32AddArray) {
   XlaBuilder builder(TestName());
   XlaComputation callee = CreateR1S2F32AdditionComputation();
-  auto x = ConstantLiteral(&builder, *Literal::CreateR1<float>({1.0f, 2.0f}));
-  auto y = ConstantLiteral(&builder, *Literal::CreateR1<float>({2.0f, 3.0f}));
+  auto x =
+      ConstantLiteral(&builder, *LiteralUtil::CreateR1<float>({1.0f, 2.0f}));
+  auto y =
+      ConstantLiteral(&builder, *LiteralUtil::CreateR1<float>({2.0f, 3.0f}));
   Call(&builder, callee, {x, y});
 
   ComputeAndCompareR1<float>(&builder, {3.0f, 5.0f}, {}, ErrorSpec(0.01f));
@@ -129,15 +133,15 @@ XLA_TEST_F(CallOpTest, CallTreeTwoDeepBranchFactorThree) {
 
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> start,
-      client_->TransferToServer(*Literal::CreateR0<float>(1.0f)));
+      client_->TransferToServer(*LiteralUtil::CreateR0<float>(1.0f)));
   ComputeAndCompareR0<float>(&builder3, 10.0f, {start.get()}, ErrorSpec(0.0f));
 }
 
 XLA_TEST_F(CallOpTest, CallR0F32Tuple) {
   XlaBuilder builder(TestName());
   XlaComputation callee = CreateR0F32TupleComputation();
-  auto elem = Literal::CreateR0<float>(42.0);
-  auto tuple = Literal::MakeTuple({elem.get()});
+  auto elem = LiteralUtil::CreateR0<float>(42.0);
+  auto tuple = LiteralUtil::MakeTuple({elem.get()});
   Call(&builder, callee, {ConstantLiteral(&builder, *elem)});
 
   ComputeAndCompareTuple(&builder, *tuple, {}, ErrorSpec(0.01f));
diff --git a/tensorflow/compiler/xla/tests/check_execution_arity_test.cc b/tensorflow/compiler/xla/tests/check_execution_arity_test.cc
index 1ad57c075b..a4eb57fc7b 100644
--- a/tensorflow/compiler/xla/tests/check_execution_arity_test.cc
+++ b/tensorflow/compiler/xla/tests/check_execution_arity_test.cc
@@ -17,8 +17,8 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -36,7 +36,7 @@ class CheckExecutionArityTest : public ClientLibraryTestBase {};
 
 TEST_F(CheckExecutionArityTest, TwoParamComputationNumArguments) {
   XlaBuilder builder("add_two_params");
-  auto param_literal = Literal::CreateR1<float>({1.1f, 2.2f});
+  auto param_literal = LiteralUtil::CreateR1<float>({1.1f, 2.2f});
 
   auto p0 = Parameter(&builder, 0, param_literal->shape(), "param0");
   auto p1 = Parameter(&builder, 1, param_literal->shape(), "param1");
@@ -85,12 +85,12 @@ XLA_TEST_F(CheckExecutionArityTest, CheckArgumentShapes) {
   ASSERT_IS_OK(computation_status.status());
   auto computation = computation_status.ConsumeValueOrDie();
 
-  auto f32_literal = Literal::CreateR0<float>(1.1f);
+  auto f32_literal = LiteralUtil::CreateR0<float>(1.1f);
   auto f32_data = client_->TransferToServer(*f32_literal).ConsumeValueOrDie();
-  auto f32_4_literal = Literal::CreateR1<float>({1.0f, 2.0f, 3.0f, 4.0f});
+  auto f32_4_literal = LiteralUtil::CreateR1<float>({1.0f, 2.0f, 3.0f, 4.0f});
   auto f32_4_data =
       client_->TransferToServer(*f32_4_literal).ConsumeValueOrDie();
-  auto u8_4_literal = Literal::CreateR1U8("hola");
+  auto u8_4_literal = LiteralUtil::CreateR1U8("hola");
   auto u8_4_data = client_->TransferToServer(*u8_4_literal).ConsumeValueOrDie();
 
   // Match
diff --git a/tensorflow/compiler/xla/tests/client_library_test_base.cc b/tensorflow/compiler/xla/tests/client_library_test_base.cc
index dafd6ebabb..59d917054b 100644
--- a/tensorflow/compiler/xla/tests/client_library_test_base.cc
+++ b/tensorflow/compiler/xla/tests/client_library_test_base.cc
@@ -19,7 +19,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
@@ -157,7 +157,7 @@ string ClientLibraryTestBase::ExecuteToString(
 void ClientLibraryTestBase::ComputeAndCompareR1(
     XlaBuilder* builder, const tensorflow::core::Bitmap& expected,
     tensorflow::gtl::ArraySlice<GlobalData*> arguments) {
-  std::unique_ptr<Literal> expected_literal = Literal::CreateR1(expected);
+  std::unique_ptr<Literal> expected_literal = LiteralUtil::CreateR1(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments);
 }
@@ -273,10 +273,16 @@ Status ClientLibraryTestBase::ComputeAndCompareLiteralWithStatus(
     const Shape* shape_with_layout) {
   std::vector<GlobalData*> arguments(arguments_passed_in.begin(),
                                      arguments_passed_in.end());
+
+  // Transfer and use elements of arguments_, if the AddParam() API was used.
+  std::vector<std::unique_ptr<GlobalData>> owning_arguments;
   if (!arguments_.empty()) {
     CHECK(arguments.empty());
     for (const auto& argument : arguments_) {
-      arguments.push_back(argument.get());
+      owning_arguments.push_back(
+          client_->TransferToServer(MaybeConvertLiteralToBfloat16(argument))
+              .ValueOrDie());
+      arguments.push_back(owning_arguments.back().get());
     }
   }
 
@@ -295,7 +301,7 @@ Status ClientLibraryTestBase::ComputeAndCompareLiteralWithStatus(
   std::unique_ptr<Literal> converted_expected;
   Shape layout_shape;
   if (use_bfloat16_) {
-    converted_expected = Literal::ConvertF32ToBF16(expected);
+    converted_expected = LiteralUtil::ConvertF32ToBF16(expected);
     expected_ptr = converted_expected.get();
     if (shape_with_layout != nullptr) {
       layout_shape = *shape_with_layout;
@@ -331,10 +337,16 @@ Status ClientLibraryTestBase::ComputeAndCompareLiteralWithStatus(
     ErrorSpec error, const Shape* shape_with_layout) {
   std::vector<GlobalData*> arguments(arguments_passed_in.begin(),
                                      arguments_passed_in.end());
+
+  // Transfer and use elements of arguments_, if the AddParam() API was used.
+  std::vector<std::unique_ptr<GlobalData>> owning_arguments;
   if (!arguments_.empty()) {
     CHECK(arguments.empty());
     for (const auto& argument : arguments_) {
-      arguments.push_back(argument.get());
+      owning_arguments.push_back(
+          client_->TransferToServer(MaybeConvertLiteralToBfloat16(argument))
+              .ValueOrDie());
+      arguments.push_back(owning_arguments.back().get());
     }
   }
 
@@ -347,7 +359,7 @@ Status ClientLibraryTestBase::ComputeAndCompareLiteralWithStatus(
   std::unique_ptr<Literal> converted_expected;
   Shape layout_shape;
   if (use_bfloat16_) {
-    converted_expected = Literal::ConvertF32ToBF16(expected);
+    converted_expected = LiteralUtil::ConvertF32ToBF16(expected);
     expected_ptr = converted_expected.get();
     if (shape_with_layout != nullptr) {
       layout_shape = *shape_with_layout;
@@ -389,7 +401,7 @@ void ClientLibraryTestBase::ComputeAndCompareR1U8(
   auto actual = actual_status.ConsumeValueOrDie();
 
   // Turn the expected value into a literal.
-  std::unique_ptr<Literal> expected_literal = Literal::CreateR1U8(expected);
+  std::unique_ptr<Literal> expected_literal = LiteralUtil::CreateR1U8(expected);
 
   VLOG(1) << "expected: " << expected_literal->ToString();
   VLOG(1) << "actual:   " << actual->ToString();
@@ -454,6 +466,14 @@ ClientLibraryTestBase::ComputeValueAndReference(
   // function.
   std::vector<std::unique_ptr<GlobalData>> argument_data;
   std::vector<std::unique_ptr<GlobalData>> ref_argument_data;
+
+  // Use `arguments_` if the AddParam() API was used.  Otherwise, use
+  // plain `arguments`.
+  if (!arguments_.empty()) {
+    CHECK_EQ(arguments.size(), 0);
+    arguments = arguments_;
+  }
+
   for (const auto& arg : arguments) {
     TF_ASSIGN_OR_RETURN(auto data, client_->TransferToServer(arg.Clone()));
     TF_ASSIGN_OR_RETURN(auto ref_data, ref_client_->TransferToServer(arg));
@@ -552,16 +572,16 @@ ClientLibraryTestBase::CreatePatternedMatrixWithZeroPadding(int rows, int cols,
 
 XlaOp ClientLibraryTestBase::AddParam(const Literal& argument,
                                       XlaBuilder* builder) {
-  XlaOp data_handle;
-  arguments_.push_back(CreateParameterAndTransferLiteral(
-      arguments_.size(), argument, "", builder, &data_handle));
-  return data_handle;
+  arguments_.push_back(argument.Clone());
+  return Parameter(builder, /*parameter_number=*/arguments_.size() - 1,
+                   MaybeConvertShapeToBfloat16(argument.shape()), "");
 }
 
 XlaOp ClientLibraryTestBase::CreateConstantFromLiteral(const Literal& literal,
                                                        XlaBuilder* builder) {
-  return ConstantLiteral(
-      builder, use_bfloat16_ ? *Literal::ConvertF32ToBF16(literal) : literal);
+  return ConstantLiteral(builder, use_bfloat16_
+                                      ? *LiteralUtil::ConvertF32ToBF16(literal)
+                                      : literal);
 }
 
 std::unique_ptr<GlobalData>
@@ -574,22 +594,39 @@ ClientLibraryTestBase::CreateParameterAndTransferLiteral(int64 parameter_number,
                                            nullptr, builder, data_handle);
 }
 
+Shape ClientLibraryTestBase::MaybeConvertShapeToBfloat16(const Shape& shape) {
+  if (!use_bfloat16_) {
+    return shape;
+  }
+  Shape new_shape = shape;
+  ShapeUtil::ForEachMutableSubshape(&new_shape,
+                                    [](Shape* subshape, const ShapeIndex&) {
+                                      if (subshape->element_type() == F32) {
+                                        subshape->set_element_type(BF16);
+                                      }
+                                    });
+  return new_shape;
+}
+
+Literal ClientLibraryTestBase::MaybeConvertLiteralToBfloat16(
+    const Literal& literal) {
+  if (use_bfloat16_) {
+    return std::move(*LiteralUtil::ConvertF32ToBF16(literal));
+  }
+  return literal.Clone();
+}
+
 std::unique_ptr<GlobalData>
 ClientLibraryTestBase::CreateParameterAndTransferLiteral(
     int64 parameter_number, const Literal& literal, const string& name,
     const DeviceHandle* device_handle, XlaBuilder* builder,
     XlaOp* data_handle) {
-  const Literal* param_literal = &literal;
-  std::unique_ptr<Literal> converted_literal;
-  if (use_bfloat16_) {
-    converted_literal = Literal::ConvertF32ToBF16(literal);
-    param_literal = converted_literal.get();
-  }
+  Literal param_literal = MaybeConvertLiteralToBfloat16(literal);
   std::unique_ptr<GlobalData> data =
-      client_->TransferToServer(*param_literal, device_handle)
+      client_->TransferToServer(param_literal, device_handle)
           .ConsumeValueOrDie();
   *data_handle =
-      Parameter(builder, parameter_number, param_literal->shape(), name);
+      Parameter(builder, parameter_number, param_literal.shape(), name);
   return data;
 }
 
diff --git a/tensorflow/compiler/xla/tests/client_library_test_base.h b/tensorflow/compiler/xla/tests/client_library_test_base.h
index 5361ae6783..4a6e8a3124 100644
--- a/tensorflow/compiler/xla/tests/client_library_test_base.h
+++ b/tensorflow/compiler/xla/tests/client_library_test_base.h
@@ -26,8 +26,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -284,7 +285,7 @@ class ClientLibraryTestBase : public ::testing::Test {
 
   template <class T>
   XlaOp AddParam(const Array<T>& argument, XlaBuilder* builder) {
-    return AddParam(*Literal::CreateFromArray(argument), builder);
+    return AddParam(*LiteralUtil::CreateFromArray(argument), builder);
   }
 
   // Creates a constant instruction with the given literal. When the
@@ -299,13 +300,14 @@ class ClientLibraryTestBase : public ::testing::Test {
   template <typename NativeT>
   XlaOp CreateConstantFromArray(const Array<NativeT>& array,
                                 XlaBuilder* builder) {
-    return CreateConstantFromLiteral(*Literal::CreateFromArray(array), builder);
+    return CreateConstantFromLiteral(*LiteralUtil::CreateFromArray(array),
+                                     builder);
   }
 
   // Same as CreateConstantFromArray, but for scalars.
   template <typename NativeT>
   XlaOp CreateConstantFromScalar(NativeT value, XlaBuilder* builder) {
-    return CreateConstantFromLiteral(*Literal::CreateR0<NativeT>(value),
+    return CreateConstantFromLiteral(*LiteralUtil::CreateR0<NativeT>(value),
                                      builder);
   }
 
@@ -397,12 +399,16 @@ class ClientLibraryTestBase : public ::testing::Test {
                                const string& error_message)>& verify_output,
       const Shape* output_with_layout = nullptr);
 
+  // Converts an f32 shape/literal to bf16 if use_bfloat16_ is true.
+  Literal MaybeConvertLiteralToBfloat16(const Literal& literal);
+  Shape MaybeConvertShapeToBfloat16(const Shape& shape);
+
   // Whether to run tests with all float-type input/output converted to
   // bfloat16.
   bool use_bfloat16_ = false;
 
   // Arguments to be passed to the computation when it runs.
-  std::vector<std::unique_ptr<GlobalData>> arguments_;
+  std::vector<Literal> arguments_;
 };
 
 template <typename NativeT>
@@ -410,7 +416,7 @@ void ClientLibraryTestBase::ComputeAndCompareR0(
     XlaBuilder* builder, NativeT expected,
     tensorflow::gtl::ArraySlice<GlobalData*> arguments) {
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR0<NativeT>(expected);
+      LiteralUtil::CreateR0<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments);
 }
@@ -426,7 +432,7 @@ void ClientLibraryTestBase::ComputeAndCompareR0(
                     std::is_same<NativeT, complex64>::value,
                 "Float or complex type required when specifying an ErrorSpec");
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR0<NativeT>(expected);
+      LiteralUtil::CreateR0<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments, error);
 }
@@ -436,7 +442,7 @@ void ClientLibraryTestBase::ComputeAndCompareR1(
     XlaBuilder* builder, tensorflow::gtl::ArraySlice<NativeT> expected,
     tensorflow::gtl::ArraySlice<GlobalData*> arguments) {
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR1<NativeT>(expected);
+      LiteralUtil::CreateR1<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments);
 }
@@ -452,7 +458,7 @@ void ClientLibraryTestBase::ComputeAndCompareR1(
                     std::is_same<NativeT, complex64>::value,
                 "Float or complex type required when specifying an ErrorSpec");
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR1<NativeT>(expected);
+      LiteralUtil::CreateR1<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments, error);
 }
@@ -462,7 +468,7 @@ void ClientLibraryTestBase::ComputeAndCompareR2(
     XlaBuilder* builder, const Array2D<NativeT>& expected,
     tensorflow::gtl::ArraySlice<GlobalData*> arguments) {
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR2FromArray2D<NativeT>(expected);
+      LiteralUtil::CreateR2FromArray2D<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments);
 }
@@ -478,7 +484,7 @@ void ClientLibraryTestBase::ComputeAndCompareR2(
                     std::is_same<NativeT, complex64>::value,
                 "Float or complex type required when specifying an ErrorSpec");
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR2FromArray2D<NativeT>(expected);
+      LiteralUtil::CreateR2FromArray2D<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments, error);
 }
@@ -488,7 +494,7 @@ void ClientLibraryTestBase::ComputeAndCompareR3(
     XlaBuilder* builder, const Array3D<NativeT>& expected,
     tensorflow::gtl::ArraySlice<GlobalData*> arguments) {
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR3FromArray3D<NativeT>(expected);
+      LiteralUtil::CreateR3FromArray3D<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments);
 }
@@ -504,7 +510,7 @@ void ClientLibraryTestBase::ComputeAndCompareR3(
                     std::is_same<NativeT, complex64>::value,
                 "Float or complex type required when specifying an ErrorSpec");
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR3FromArray3D<NativeT>(expected);
+      LiteralUtil::CreateR3FromArray3D<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments, error);
 }
@@ -514,7 +520,7 @@ void ClientLibraryTestBase::ComputeAndCompareR4(
     XlaBuilder* builder, const Array4D<NativeT>& expected,
     tensorflow::gtl::ArraySlice<GlobalData*> arguments) {
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR4FromArray4D<NativeT>(expected);
+      LiteralUtil::CreateR4FromArray4D<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments);
 }
@@ -530,7 +536,7 @@ void ClientLibraryTestBase::ComputeAndCompareR4(
                     std::is_same<NativeT, complex64>::value,
                 "Float or complex type required when specifying an ErrorSpec");
   std::unique_ptr<Literal> expected_literal =
-      Literal::CreateR4FromArray4D<NativeT>(expected);
+      LiteralUtil::CreateR4FromArray4D<NativeT>(expected);
   ClientLibraryTestBase::ComputeAndCompareLiteral(builder, *expected_literal,
                                                   arguments, error);
 }
@@ -539,9 +545,9 @@ template <typename NativeT>
 std::unique_ptr<GlobalData> ClientLibraryTestBase::CreateR0Parameter(
     NativeT value, int64 parameter_number, const string& name,
     XlaBuilder* builder, XlaOp* data_handle) {
-  std::unique_ptr<Literal> literal = Literal::CreateR0(value);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR0(value);
   if (use_bfloat16_ && literal->shape().element_type() == F32) {
-    literal = Literal::ConvertF32ToBF16(*literal);
+    literal = LiteralUtil::ConvertF32ToBF16(*literal);
   }
   std::unique_ptr<GlobalData> data =
       client_->TransferToServer(*literal).ConsumeValueOrDie();
@@ -553,9 +559,9 @@ template <typename NativeT>
 std::unique_ptr<GlobalData> ClientLibraryTestBase::CreateR1Parameter(
     tensorflow::gtl::ArraySlice<NativeT> values, int64 parameter_number,
     const string& name, XlaBuilder* builder, XlaOp* data_handle) {
-  std::unique_ptr<Literal> literal = Literal::CreateR1(values);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR1(values);
   if (use_bfloat16_ && literal->shape().element_type() == F32) {
-    literal = Literal::ConvertF32ToBF16(*literal);
+    literal = LiteralUtil::ConvertF32ToBF16(*literal);
   }
   std::unique_ptr<GlobalData> data =
       client_->TransferToServer(*literal).ConsumeValueOrDie();
@@ -567,9 +573,9 @@ template <typename NativeT>
 std::unique_ptr<GlobalData> ClientLibraryTestBase::CreateR2Parameter(
     const Array2D<NativeT>& array_2d, int64 parameter_number,
     const string& name, XlaBuilder* builder, XlaOp* data_handle) {
-  std::unique_ptr<Literal> literal = Literal::CreateR2FromArray2D(array_2d);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR2FromArray2D(array_2d);
   if (use_bfloat16_ && literal->shape().element_type() == F32) {
-    literal = Literal::ConvertF32ToBF16(*literal);
+    literal = LiteralUtil::ConvertF32ToBF16(*literal);
   }
   std::unique_ptr<GlobalData> data =
       client_->TransferToServer(*literal).ConsumeValueOrDie();
@@ -581,9 +587,9 @@ template <typename NativeT>
 std::unique_ptr<GlobalData> ClientLibraryTestBase::CreateR3Parameter(
     const Array3D<NativeT>& array_3d, int64 parameter_number,
     const string& name, XlaBuilder* builder, XlaOp* data_handle) {
-  std::unique_ptr<Literal> literal = Literal::CreateR3FromArray3D(array_3d);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR3FromArray3D(array_3d);
   if (use_bfloat16_ && literal->shape().element_type() == F32) {
-    literal = Literal::ConvertF32ToBF16(*literal);
+    literal = LiteralUtil::ConvertF32ToBF16(*literal);
   }
   std::unique_ptr<GlobalData> data =
       client_->TransferToServer(*literal).ConsumeValueOrDie();
diff --git a/tensorflow/compiler/xla/tests/client_test.cc b/tensorflow/compiler/xla/tests/client_test.cc
index 831b863998..c898dacf48 100644
--- a/tensorflow/compiler/xla/tests/client_test.cc
+++ b/tensorflow/compiler/xla/tests/client_test.cc
@@ -18,8 +18,8 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -56,7 +56,7 @@ XLA_TEST_F(ClientTest, ExecuteWithLayout) {
           client_->Execute(computation, {}, &execution_options));
 
       std::unique_ptr<Literal> expected_literal =
-          Literal::CreateR2WithLayout<int32>(
+          LiteralUtil::CreateR2WithLayout<int32>(
               {{11, 22}, {33, 44}}, LayoutUtil::MakeLayout(transfer_layout));
 
       TF_ASSERT_OK_AND_ASSIGN(
@@ -112,9 +112,9 @@ XLA_TEST_F(ClientTest, DISABLED_ON_GPU(ExecuteParallel)) {
   XlaComputation add_with_one_arg, mul_with_two_args, dot_with_one_arg;
   Shape shape = ShapeUtil::MakeShape(S32, {2, 2});
 
-  TF_ASSERT_OK_AND_ASSIGN(
-      std::unique_ptr<GlobalData> const_arg,
-      client_->TransferToServer(*Literal::CreateR2<int32>({{5, 6}, {7, 8}})));
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<GlobalData> const_arg,
+                          client_->TransferToServer(
+                              *LiteralUtil::CreateR2<int32>({{5, 6}, {7, 8}})));
 
   XlaBuilder b(TestName() + ".add");
   Add(Parameter(&b, 0, shape, "param_0"),
@@ -136,7 +136,7 @@ XLA_TEST_F(ClientTest, DISABLED_ON_GPU(ExecuteParallel)) {
 
   TF_ASSERT_OK_AND_ASSIGN(auto results,
                           client_->ExecuteParallel(computation_instances));
-  auto expected_result = Literal::CreateR2<int32>({{6, 8}, {10, 12}});
+  auto expected_result = LiteralUtil::CreateR2<int32>({{6, 8}, {10, 12}});
 
   TF_ASSERT_OK_AND_ASSIGN(
       auto result_literal,
diff --git a/tensorflow/compiler/xla/tests/compilation_cache_test.cc b/tensorflow/compiler/xla/tests/compilation_cache_test.cc
index eb211dd8ff..7c52c9fbbb 100644
--- a/tensorflow/compiler/xla/tests/compilation_cache_test.cc
+++ b/tensorflow/compiler/xla/tests/compilation_cache_test.cc
@@ -19,9 +19,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@@ -50,7 +50,7 @@ class CompilationCacheTest : public ClientLibraryTestBase {
                                  &execution_profile)
             .ConsumeValueOrDie();
     EXPECT_TRUE(LiteralTestUtil::Near(
-        *Literal::CreateR0<float>(expected_result), *result, error_spec_));
+        *LiteralUtil::CreateR0<float>(expected_result), *result, error_spec_));
     EXPECT_EQ(expect_cache_hit, execution_profile.compilation_cache_hit());
   }
 
@@ -67,7 +67,7 @@ class CompilationCacheTest : public ClientLibraryTestBase {
     std::unique_ptr<Literal> result =
         client_->Transfer(*data_handle).ConsumeValueOrDie();
     EXPECT_TRUE(LiteralTestUtil::Near(
-        *Literal::CreateR2<float>(expected_result), *result, error_spec_));
+        *LiteralUtil::CreateR2<float>(expected_result), *result, error_spec_));
     EXPECT_EQ(expect_cache_hit, execution_profile.compilation_cache_hit());
   }
 
@@ -89,13 +89,13 @@ XLA_TEST_F(CompilationCacheTest, DISABLED_ComputationCalledMultipleTimes) {
 XLA_TEST_F(CompilationCacheTest,
            DISABLED_ComputationCalledWithDifferentParameters) {
   std::unique_ptr<GlobalData> data_42 =
-      client_->TransferToServer(*Literal::CreateR0<float>(42.0f))
+      client_->TransferToServer(*LiteralUtil::CreateR0<float>(42.0f))
           .ConsumeValueOrDie();
   std::unique_ptr<GlobalData> data_123 =
-      client_->TransferToServer(*Literal::CreateR0<float>(123.0f))
+      client_->TransferToServer(*LiteralUtil::CreateR0<float>(123.0f))
           .ConsumeValueOrDie();
   std::unique_ptr<GlobalData> data_456 =
-      client_->TransferToServer(*Literal::CreateR0<float>(456.0f))
+      client_->TransferToServer(*LiteralUtil::CreateR0<float>(456.0f))
           .ConsumeValueOrDie();
 
   XlaBuilder builder(TestName());
@@ -143,12 +143,12 @@ XLA_TEST_F(CompilationCacheTest, DISABLED_DifferentParameterLayouts) {
   // layouts. Use these arrays as parameters to a simple computation. If the
   // layout of the array changes then computation should be recompiled (cache
   // miss).
-  auto rowmaj_array = Literal::CreateR2WithLayout(
+  auto rowmaj_array = LiteralUtil::CreateR2WithLayout(
       {{1.0f, 2.0f}, {3.0f, 4.0f}}, LayoutUtil::MakeLayout({1, 0}));
   auto rowmaj_handle =
       client_->TransferToServer(*rowmaj_array).ConsumeValueOrDie();
 
-  auto colmaj_array = Literal::CreateR2WithLayout(
+  auto colmaj_array = LiteralUtil::CreateR2WithLayout(
       {{1.0f, 2.0f}, {3.0f, 4.0f}}, LayoutUtil::MakeLayout({0, 1}));
   auto colmaj_handle =
       client_->TransferToServer(*colmaj_array).ConsumeValueOrDie();
diff --git a/tensorflow/compiler/xla/tests/compute_constant_test.cc b/tensorflow/compiler/xla/tests/compute_constant_test.cc
index 1a396b090c..5a06d061f0 100644
--- a/tensorflow/compiler/xla/tests/compute_constant_test.cc
+++ b/tensorflow/compiler/xla/tests/compute_constant_test.cc
@@ -19,10 +19,10 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -207,7 +207,7 @@ TEST_F(ComputeConstantTest, NonScalarAdd) {
     TF_ASSERT_OK_AND_ASSIGN(auto computed,
                             ComputeConstantLiteral(client, computation, &b));
     std::unique_ptr<Literal> expected_literal =
-        Literal::CreateR1<int32>({4, 6});
+        LiteralUtil::CreateR1<int32>({4, 6});
     EXPECT_TRUE(LiteralTestUtil::Equal(*expected_literal, *computed));
   }
 }
@@ -221,7 +221,7 @@ TEST_F(ComputeConstantTest, IntegerDivide) {
 
     TF_ASSERT_OK_AND_ASSIGN(auto computed,
                             ComputeConstantLiteral(client, computation, &b));
-    std::unique_ptr<Literal> expected_literal = Literal::CreateR0<int32>(5);
+    std::unique_ptr<Literal> expected_literal = LiteralUtil::CreateR0<int32>(5);
     EXPECT_TRUE(LiteralTestUtil::Equal(*expected_literal, *computed));
   }
 }
@@ -242,8 +242,8 @@ XLA_TEST_F(ComputeConstantTest, Layout) {
                              &b, &layout_proto));
 
       std::unique_ptr<Literal> expected_literal =
-          Literal::CreateR2WithLayout<int32>({{11, 22}, {33, 44}},
-                                             LayoutUtil::MakeLayout(layout));
+          LiteralUtil::CreateR2WithLayout<int32>(
+              {{11, 22}, {33, 44}}, LayoutUtil::MakeLayout(layout));
       ASSERT_TRUE(LiteralTestUtil::EqualShapesAndLayouts(
           expected_literal->shape(), computed->shape()));
       EXPECT_TRUE(LiteralTestUtil::Equal(*expected_literal, *computed));
diff --git a/tensorflow/compiler/xla/tests/concat_test.cc b/tensorflow/compiler/xla/tests/concat_test.cc
index 1161b560b7..be017477d8 100644
--- a/tensorflow/compiler/xla/tests/concat_test.cc
+++ b/tensorflow/compiler/xla/tests/concat_test.cc
@@ -19,8 +19,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -534,8 +534,8 @@ TEST_P(ConcatR2BinaryTest, DoIt) {
 //     concat
 XLA_TEST_F(ConcatTest, ConcatOperandsOfSameOperand) {
   auto f32_scalar = ShapeUtil::MakeShape(xla::F32, {});
-  auto x_literal = Literal::CreateR0<float>(2.f);
-  auto y_literal = Literal::CreateR0<float>(3.f);
+  auto x_literal = LiteralUtil::CreateR0<float>(2.f);
+  auto y_literal = LiteralUtil::CreateR0<float>(3.f);
   auto x_data = client_->TransferToServer(*x_literal).ConsumeValueOrDie();
   auto y_data = client_->TransferToServer(*y_literal).ConsumeValueOrDie();
 
@@ -556,9 +556,9 @@ XLA_TEST_F(ConcatTest, ConcatOperandsOfSameOperand) {
 // produces the correct result in rank 1.
 XLA_TEST_F(ConcatTest, ConcatBroadcastArgument) {
   auto f32_scalar = ShapeUtil::MakeShape(xla::F32, {});
-  auto x_literal = Literal::CreateR1<float>({2.0f, 3.0f, 5.0f, 6.0f});
-  auto y_literal = Literal::CreateR0<float>(1.5f);
-  auto z_literal = Literal::CreateR0<float>(5.5f);
+  auto x_literal = LiteralUtil::CreateR1<float>({2.0f, 3.0f, 5.0f, 6.0f});
+  auto y_literal = LiteralUtil::CreateR0<float>(1.5f);
+  auto z_literal = LiteralUtil::CreateR0<float>(5.5f);
   auto x_data = client_->TransferToServer(*x_literal).ConsumeValueOrDie();
   auto y_data = client_->TransferToServer(*y_literal).ConsumeValueOrDie();
   auto z_data = client_->TransferToServer(*z_literal).ConsumeValueOrDie();
@@ -584,9 +584,9 @@ XLA_TEST_F(ConcatTest, ConcatBroadcastArgument) {
 XLA_TEST_F(ConcatTest, ConcatBroadcastArgumentR3) {
   auto f32_scalar = ShapeUtil::MakeShape(xla::F32, {});
   Array3D<float> x3d(3, 5, 7, 3.14f);
-  auto x_literal = Literal::CreateR3FromArray3D<float>(x3d);
-  auto y_literal = Literal::CreateR0<float>(1.5f);
-  auto z_literal = Literal::CreateR0<float>(5.5f);
+  auto x_literal = LiteralUtil::CreateR3FromArray3D<float>(x3d);
+  auto y_literal = LiteralUtil::CreateR0<float>(1.5f);
+  auto z_literal = LiteralUtil::CreateR0<float>(5.5f);
   auto x_data = client_->TransferToServer(*x_literal).ConsumeValueOrDie();
   auto y_data = client_->TransferToServer(*y_literal).ConsumeValueOrDie();
   auto z_data = client_->TransferToServer(*z_literal).ConsumeValueOrDie();
diff --git a/tensorflow/compiler/xla/tests/conditional_test.cc b/tensorflow/compiler/xla/tests/conditional_test.cc
index ee3c83039b..b27c1044ba 100644
--- a/tensorflow/compiler/xla/tests/conditional_test.cc
+++ b/tensorflow/compiler/xla/tests/conditional_test.cc
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
@@ -172,88 +172,95 @@ class ConditionalOpTest : public ClientLibraryTestBase {
 // Test true and false computations that do not take any parameters.
 XLA_TEST_F(ConditionalOpTest, Parameters0) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, true);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(true, 0, "pred", &builder, &pred);
   auto operands = Tuple(&builder, {});
   auto true_computation = CreateR0ConstantComputation(56.0f);
   auto false_computation = CreateR0ConstantComputation(12.0f);
   Conditional(pred, operands, true_computation, operands, false_computation);
 
-  ComputeAndCompareR0<float>(&builder, 56.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 56.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test true and false computations that take in 1 parameter.
 XLA_TEST_F(ConditionalOpTest, Parameters1) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 56.0f);
   auto operand2 = ConstantR0<float>(&builder, 12.0f);
   auto identity = CreateR0IdentityComputation();
   Conditional(pred, operand1, identity, operand2, identity);
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test conditional with two different computations in the true and false cases
 // that take in different arguments.
 XLA_TEST_F(ConditionalOpTest, DiffComputationsDiffArgs) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 56.4f);
   auto operand2 = ConstantR0<float>(&builder, 12.6f);
   Conditional(pred, operand1, CreateR0CeilComputation(), operand2,
               CreateR0FloorComputation());
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test conditional with two different computations in the true and false cases
 // that take in the same arguments.
 XLA_TEST_F(ConditionalOpTest, DiffComputationsSameArg) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand = ConstantR0<float>(&builder, 12.6f);
   Conditional(pred, operand, CreateR0CeilComputation(), operand,
               CreateR0FloorComputation());
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test conditional with the same computation in the true and false cases but
 // take in different arguments.
 XLA_TEST_F(ConditionalOpTest, SameComputationDiffArgs) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 56.4f);
   auto operand2 = ConstantR0<float>(&builder, 12.6f);
   auto floor = CreateR0FloorComputation();
   Conditional(pred, operand1, floor, operand2, floor);
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test conditional with the same computation in the true and false cases that
 // take in the same arguments.
 XLA_TEST_F(ConditionalOpTest, SameComputationSameArg) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand = ConstantR0<float>(&builder, 12.6f);
   auto floor = CreateR0FloorComputation();
   Conditional(pred, operand, floor, operand, floor);
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test conditional with different instances of the same computation in the true
 // and false cases.
 XLA_TEST_F(ConditionalOpTest, SameComputationDiffInstances) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 56.4f);
   auto operand2 = ConstantR0<float>(&builder, 12.6f);
   Conditional(pred, operand1, CreateR0FloorComputation(), operand2,
               CreateR0FloorComputation());
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test the case when a call invokes a computation that contains a conditional.
@@ -268,75 +275,83 @@ XLA_TEST_F(ConditionalOpTest, ConditionalWithCall) {
   auto inner_builder_result = inner_builder.Build();
 
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 56.4f);
   auto operand2 = ConstantR0<float>(&builder, 12.6f);
   Call(&builder, inner_builder_result.ConsumeValueOrDie(),
        {pred, operand1, operand2});
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test true and false computations that take in 2 parameters and predicate is
 // true.
 XLA_TEST_F(ConditionalOpTest, Parameters2TrueBranch) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, true);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(true, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 56.0f);
   auto operand2 = ConstantR0<float>(&builder, 12.0f);
   auto operands = Tuple(&builder, {operand1, operand2});
   Conditional(pred, operands, CreateR0TupleAddComputation(), operands,
               CreateR0TupleSubComputation());
 
-  ComputeAndCompareR0<float>(&builder, 68.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 68.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test true and false computations that take in 2 parameters and predicate is
 // false.
 XLA_TEST_F(ConditionalOpTest, Parameters2FalseBranch) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 56.0f);
   auto operand2 = ConstantR0<float>(&builder, 12.0f);
   auto operands = Tuple(&builder, {operand1, operand2});
   Conditional(pred, operands, CreateR0TupleAddComputation(), operands,
               CreateR0TupleSubComputation());
 
-  ComputeAndCompareR0<float>(&builder, 44.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 44.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test true and false computations that take in 2 array parameters and
 // predicate is true.
 XLA_TEST_F(ConditionalOpTest, Parameters2ArrayTrueBranch) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, true);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(true, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR1<float>(&builder, {24.0f, 56.0f});
   auto operand2 = ConstantR1<float>(&builder, {10.0f, 11.0f});
   auto operands = Tuple(&builder, {operand1, operand2});
   Conditional(pred, operands, CreateR1TupleAddComputation(), operands,
               CreateR1TupleSubComputation());
 
-  ComputeAndCompareR1<float>(&builder, {34.0f, 67.0f}, {}, error_spec_);
+  ComputeAndCompareR1<float>(&builder, {34.0f, 67.0f}, {pred_arg.get()},
+                             error_spec_);
 }
 
 // Test true and false computations that take in 2 array parameters and
 // predicate is false.
 XLA_TEST_F(ConditionalOpTest, Parameters2ArrayFalseBranch) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR1<float>(&builder, {24.0f, 56.0f});
   auto operand2 = ConstantR1<float>(&builder, {10.0f, 11.0f});
   auto operands = Tuple(&builder, {operand1, operand2});
   Conditional(pred, operands, CreateR1TupleAddComputation(), operands,
               CreateR1TupleSubComputation());
 
-  ComputeAndCompareR1<float>(&builder, {14.0f, 45.0f}, {}, error_spec_);
+  ComputeAndCompareR1<float>(&builder, {14.0f, 45.0f}, {pred_arg.get()},
+                             error_spec_);
 }
 
 // Test true and false computations that return a tuple of scalars.
 XLA_TEST_F(ConditionalOpTest, ReturnTupleOfScalars) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operands = Tuple(&builder, {ConstantR0<float>(&builder, 12.2f),
                                    ConstantR0<float>(&builder, 25.6f)});
   Conditional(pred, operands, CreateR0TupleCeilComputation(), operands,
@@ -344,15 +359,16 @@ XLA_TEST_F(ConditionalOpTest, ReturnTupleOfScalars) {
 
   ComputeAndCompareTuple(
       &builder,
-      *Literal::MakeTuple({Literal::CreateR0<float>(12.0f).get(),
-                           Literal::CreateR0<float>(25.0f).get()}),
-      {}, error_spec_);
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR0<float>(12.0f).get(),
+                               LiteralUtil::CreateR0<float>(25.0f).get()}),
+      {pred_arg.get()}, error_spec_);
 }
 
 // Test true and false computations that return a tuple of arrays.
 XLA_TEST_F(ConditionalOpTest, ReturnTupleOfArrays) {
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, true);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(true, 0, "pred", &builder, &pred);
   auto operands =
       Tuple(&builder, {ConstantR1<float>(&builder, {12.2f, 15.8f}),
                        ConstantR1<float>(&builder, {25.6f, 29.2f})});
@@ -361,9 +377,10 @@ XLA_TEST_F(ConditionalOpTest, ReturnTupleOfArrays) {
 
   ComputeAndCompareTuple(
       &builder,
-      *Literal::MakeTuple({Literal::CreateR1<float>({13.0f, 16.0f}).get(),
-                           Literal::CreateR1<float>({26.0f, 30.0f}).get()}),
-      {}, error_spec_);
+      *LiteralUtil::MakeTuple(
+          {LiteralUtil::CreateR1<float>({13.0f, 16.0f}).get(),
+           LiteralUtil::CreateR1<float>({26.0f, 30.0f}).get()}),
+      {pred_arg.get()}, error_spec_);
 }
 
 // Test true and false computations that return a tuple of a predicate, a
@@ -392,17 +409,19 @@ XLA_TEST_F(ConditionalOpTest, ReturnTupleofPredicateScalarArray) {
   EXPECT_IS_OK(false_builder_result.status());
 
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, true);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(true, 0, "pred", &builder, &pred);
   auto operands = Tuple(&builder, {});
   Conditional(pred, operands, true_builder_result.ConsumeValueOrDie(), operands,
               false_builder_result.ConsumeValueOrDie());
 
   ComputeAndCompareTuple(
       &builder,
-      *Literal::MakeTuple({Literal::CreateR0<bool>(true).get(),
-                           Literal::CreateR0<float>(12.2f).get(),
-                           Literal::CreateR1<float>({12.8f, 14.6f}).get()}),
-      {}, error_spec_);
+      *LiteralUtil::MakeTuple(
+          {LiteralUtil::CreateR0<bool>(true).get(),
+           LiteralUtil::CreateR0<float>(12.2f).get(),
+           LiteralUtil::CreateR1<float>({12.8f, 14.6f}).get()}),
+      {pred_arg.get()}, error_spec_);
 }
 
 // Test true and false computations that return a nested tuple.
@@ -436,21 +455,24 @@ XLA_TEST_F(ConditionalOpTest, ReturnNestedTuple) {
   EXPECT_IS_OK(false_builder_result.status());
 
   XlaBuilder builder(TestName());
-  auto pred = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operands = Tuple(&builder, {});
   Conditional(pred, operands, true_builder_result.ConsumeValueOrDie(), operands,
               false_builder_result.ConsumeValueOrDie());
 
   ComputeAndCompareTuple(
       &builder,
-      *Literal::MakeTuple(
-          {Literal::MakeTuple({Literal::CreateR0<float>(46.6f).get(),
-                               Literal::CreateR1<float>({54.4f, 58.4f}).get()})
+      *LiteralUtil::MakeTuple(
+          {LiteralUtil::MakeTuple(
+               {LiteralUtil::CreateR0<float>(46.6f).get(),
+                LiteralUtil::CreateR1<float>({54.4f, 58.4f}).get()})
                .get(),
-           Literal::MakeTuple({Literal::CreateR1<float>({62.1f, 67.4f}).get(),
-                               Literal::CreateR0<float>(9.3f).get()})
+           LiteralUtil::MakeTuple(
+               {LiteralUtil::CreateR1<float>({62.1f, 67.4f}).get(),
+                LiteralUtil::CreateR0<float>(9.3f).get()})
                .get()}),
-      {}, error_spec_);
+      {pred_arg.get()}, error_spec_);
 }
 
 // Test conditional that takes in scalar operands in the form of external
@@ -511,8 +533,9 @@ XLA_TEST_F(ConditionalOpTest, NestedConditionals) {
   EXPECT_IS_OK(inner_builder_result.status());
 
   XlaBuilder builder(TestName());
-  auto pred1 = ConstantR0<bool>(&builder, true);
-  auto pred2 = ConstantR0<bool>(&builder, false);
+  XlaOp pred1, pred2;
+  auto pred1_arg = CreateR0Parameter<bool>(true, 0, "pred1", &builder, &pred1);
+  auto pred2_arg = CreateR0Parameter<bool>(false, 1, "pred2", &builder, &pred2);
   auto operand1 = ConstantR0<float>(&builder, 1.1f);
   auto operand2 = ConstantR0<float>(&builder, 12.2f);
   auto operand3 = ConstantR0<float>(&builder, 43.3f);
@@ -520,7 +543,8 @@ XLA_TEST_F(ConditionalOpTest, NestedConditionals) {
   Conditional(pred1, tuple_operand, inner_builder_result.ConsumeValueOrDie(),
               operand3, CreateR0IdentityComputation());
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f,
+                             {pred1_arg.get(), pred2_arg.get()}, error_spec_);
 }
 
 XLA_TEST_F(ConditionalOpTest, ConditionalInNestedComputation) {
@@ -539,13 +563,14 @@ XLA_TEST_F(ConditionalOpTest, ConditionalInNestedComputation) {
   EXPECT_IS_OK(inner_builder_result.status());
 
   XlaBuilder builder(TestName());
-  auto pred2 = ConstantR0<bool>(&builder, false);
+  XlaOp pred;
+  auto pred_arg = CreateR0Parameter<bool>(false, 0, "pred", &builder, &pred);
   auto operand1 = ConstantR0<float>(&builder, 1.1f);
   auto operand2 = ConstantR0<float>(&builder, 12.2f);
-  auto tuple_operand = Tuple(&builder, {pred2, operand1, operand2});
+  auto tuple_operand = Tuple(&builder, {pred, operand1, operand2});
   Call(&builder, inner_builder_result.ConsumeValueOrDie(), {tuple_operand});
 
-  ComputeAndCompareR0<float>(&builder, 12.0f, {}, error_spec_);
+  ComputeAndCompareR0<float>(&builder, 12.0f, {pred_arg.get()}, error_spec_);
 }
 
 // Test a mismatch in the shape of the true operand and true computation.
@@ -600,16 +625,17 @@ XLA_TEST_F(ConditionalOpTest, SwappedInputsInSequentialConditionals) {
 
   auto test_swap = [&](float a, float b) {
     XlaBuilder builder(TestName());
-    auto x = ConstantR0<float>(&builder, a);
-    auto y = ConstantR0<float>(&builder, b);
+    XlaOp x, y;
+    auto x_arg = CreateR0Parameter<float>(a, 0, "x", &builder, &x);
+    auto y_arg = CreateR0Parameter<float>(b, 1, "y", &builder, &y);
     auto tuple_operand = Tuple(&builder, {x, y});
     Call(&builder, main, {tuple_operand});
 
     ComputeAndCompareTuple(
         &builder,
-        *Literal::MakeTuple({Literal::CreateR0<float>(a).get(),
-                             Literal::CreateR0<float>(b).get()}),
-        {}, error_spec_);
+        *LiteralUtil::MakeTuple({LiteralUtil::CreateR0<float>(a).get(),
+                                 LiteralUtil::CreateR0<float>(b).get()}),
+        {x_arg.get(), y_arg.get()}, error_spec_);
   };
 
   test_swap(3.11f, 9.4f);
diff --git a/tensorflow/compiler/xla/tests/constants_test.cc b/tensorflow/compiler/xla/tests/constants_test.cc
index cc5d3b1176..4937574831 100644
--- a/tensorflow/compiler/xla/tests/constants_test.cc
+++ b/tensorflow/compiler/xla/tests/constants_test.cc
@@ -22,7 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -110,8 +110,8 @@ TEST_F(ConstantsTest, Small_2x2) {
 
 TEST_F(ConstantsTest, Empty_3x0x2) {
   XlaBuilder builder(TestName());
-  ConstantLiteral(
-      &builder, *Literal::CreateR3FromArray3D<float>(Array3D<float>(3, 0, 2)));
+  ConstantLiteral(&builder, *LiteralUtil::CreateR3FromArray3D<float>(
+                                Array3D<float>(3, 0, 2)));
 
   ComputeAndCompareR3<float>(&builder, Array3D<float>(3, 0, 2), {});
 }
@@ -126,7 +126,7 @@ TEST_F(ConstantsTest, Small_2x2x2) {
       {{5.f, 6.f},   // y0
        {7.f, 8.f}},  // y1
   });
-  ConstantLiteral(&builder, *Literal::CreateR3FromArray3D<float>(array3d));
+  ConstantLiteral(&builder, *LiteralUtil::CreateR3FromArray3D<float>(array3d));
 
   ComputeAndCompareR3<float>(&builder, array3d, {});
 }
@@ -141,7 +141,7 @@ TEST_F(ConstantsTest, Small_3x2x1x1) {
   });
   input_array.FillWithPZ(pz);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4D(input_array);
+      LiteralUtil::CreateR4FromArray4D(input_array);
 
   {
     XlaBuilder builder(TestName());
@@ -159,22 +159,23 @@ TEST_F(ConstantsTest, Small_3x2x1x1) {
 // TODO(b/29263943): Support tuple constants.
 TEST_F(ConstantsTest, DISABLED_TupleConstant) {
   XlaBuilder builder(TestName());
-  ConstantLiteral(&builder, *Literal::MakeTuple(
-                                {Literal::CreateR2<float>({{1.0}, {2.0}}).get(),
-                                 Literal::CreateR1<float>({2.0, 42}).get()}));
+  ConstantLiteral(&builder,
+                  *LiteralUtil::MakeTuple(
+                      {LiteralUtil::CreateR2<float>({{1.0}, {2.0}}).get(),
+                       LiteralUtil::CreateR1<float>({2.0, 42}).get()}));
 
   std::unique_ptr<Literal> result =
       ExecuteAndTransfer(&builder, {}).ConsumeValueOrDie();
 
-  LiteralTestUtil::ExpectR2Near<float>(
-      {{1.0}, {2.0}}, LiteralSlice(*result, {0}), error_spec_);
-  LiteralTestUtil::ExpectR1Near<float>(
-      {2.0, 42.0}, LiteralSlice(*result, {1}), error_spec_);
+  LiteralTestUtil::ExpectR2Near<float>({{1.0}, {2.0}},
+                                       LiteralSlice(*result, {0}), error_spec_);
+  LiteralTestUtil::ExpectR1Near<float>({2.0, 42.0}, LiteralSlice(*result, {1}),
+                                       error_spec_);
 }
 
 TEST_F(ConstantsTest, Token) {
   XlaBuilder builder(TestName());
-  ConstantLiteral(&builder, *Literal::CreateToken());
+  ConstantLiteral(&builder, *LiteralUtil::CreateToken());
   // TODO(b/80000000): tokens cannot be returned from computations.
   Tuple(&builder, {});
   TF_ASSERT_OK(Execute(&builder, {}).status());
diff --git a/tensorflow/compiler/xla/tests/convert_test.cc b/tensorflow/compiler/xla/tests/convert_test.cc
index 292942a49e..1adc68cc48 100644
--- a/tensorflow/compiler/xla/tests/convert_test.cc
+++ b/tensorflow/compiler/xla/tests/convert_test.cc
@@ -13,13 +13,14 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include <array>
 #include <cstdint>
 #include <limits>
 #include <memory>
 #include <vector>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -52,13 +53,67 @@ TEST_F(ConvertTest, ConvertR1S32ToR1S32) {
   ComputeAndCompareR1<int32>(&builder, expected, {});
 }
 
+TEST_F(ConvertTest, ConvertR1S32ToR1U32) {
+  XlaBuilder builder(TestName());
+  auto a = ConstantR1<int32>(&builder, {42, 64});
+  ConvertElementType(a, U32);
+
+  std::vector<uint32> expected = {42, 64};
+  ComputeAndCompareR1<uint32>(&builder, expected, {});
+}
+
+TEST_F(ConvertTest, ConvertR1S32ToR1PRED) {
+  XlaBuilder builder(TestName());
+  auto a = ConstantR1<int32>(&builder, {42, 0, -64});
+  ConvertElementType(a, PRED);
+
+  std::array<bool, 3> expected = {true, false, true};
+  ComputeAndCompareR1<bool>(&builder, expected, {});
+}
+
+TEST_F(ConvertTest, ConvertR1U32ToR1U32) {
+  XlaBuilder builder(TestName());
+  auto a = ConstantR1<uint32>(&builder, {42, 64});
+  ConvertElementType(a, U32);
+
+  std::vector<uint32> expected = {42, 64};
+  ComputeAndCompareR1<uint32>(&builder, expected, {});
+}
+
+TEST_F(ConvertTest, ConvertR1U32ToR1S32) {
+  XlaBuilder builder(TestName());
+  auto a = ConstantR1<uint32>(&builder, {42, 64});
+  ConvertElementType(a, S32);
+
+  std::vector<int32> expected = {42, 64};
+  ComputeAndCompareR1<int32>(&builder, expected, {});
+}
+
+TEST_F(ConvertTest, ConvertR1U32ToR1PRED) {
+  XlaBuilder builder(TestName());
+  auto a = ConstantR1<uint32>(&builder, {42, 0, 64});
+  ConvertElementType(a, PRED);
+
+  std::array<bool, 3> expected = {true, false, true};
+  ComputeAndCompareR1<bool>(&builder, expected, {});
+}
+
 TEST_F(ConvertTest, ConvertR1F32ToR1F32) {
   XlaBuilder builder(TestName());
   auto a = ConstantR1<float>(&builder, {42.0f, 64.0f});
   ConvertElementType(a, F32);
 
   std::vector<float> expected = {42.0f, 64.0f};
-  ComputeAndCompareR1<float>(&builder, expected, {}, ErrorSpec(0.0001));
+  ComputeAndCompareR1<float>(&builder, expected, {});
+}
+
+TEST_F(ConvertTest, ConvertR1F32ToR1PRED) {
+  XlaBuilder builder(TestName());
+  auto a = ConstantR1<float>(&builder, {42.0f, 0.0f, 64.0f});
+  ConvertElementType(a, PRED);
+
+  std::array<bool, 3> expected = {true, false, true};
+  ComputeAndCompareR1<bool>(&builder, expected, {});
 }
 
 TEST_F(ConvertTest, ConvertR1S32ToR1F32) {
@@ -67,7 +122,7 @@ TEST_F(ConvertTest, ConvertR1S32ToR1F32) {
   ConvertElementType(a, F32);
 
   std::vector<float> expected = {42.0f, 64.0f};
-  ComputeAndCompareR1<float>(&builder, expected, {}, ErrorSpec(0.0001));
+  ComputeAndCompareR1<float>(&builder, expected, {});
 }
 
 TEST_F(ConvertTest, ConvertR1PREDToR1S32) {
@@ -79,6 +134,15 @@ TEST_F(ConvertTest, ConvertR1PREDToR1S32) {
   ComputeAndCompareR1<int32>(&builder, expected, {});
 }
 
+TEST_F(ConvertTest, ConvertR1PREDToR1U32) {
+  XlaBuilder builder(TestName());
+  auto a = ConstantR1<bool>(&builder, {true, false, true});
+  ConvertElementType(a, U32);
+
+  std::vector<uint32> expected = {1, 0, 1};
+  ComputeAndCompareR1<uint32>(&builder, expected, {});
+}
+
 TEST_F(ConvertTest, ConvertR1PREDToR1F32) {
   XlaBuilder builder(TestName());
   auto a = ConstantR1<bool>(&builder, {true, false, true});
@@ -94,7 +158,7 @@ XLA_TEST_F(ConvertTest, ConvertR1S0S32ToR1S0F32) {
   ConvertElementType(a, F32);
 
   std::vector<float> expected = {};
-  ComputeAndCompareR1<float>(&builder, expected, {}, ErrorSpec(0.0001));
+  ComputeAndCompareR1<float>(&builder, expected, {});
 }
 
 TEST_F(ConvertTest, ConvertR1F32ToR1S32) {
@@ -145,7 +209,7 @@ XLA_TEST_F(ConvertTest, ConvertR1S64ToR1F32) {
       static_cast<int64>(0x8000008000000000LL),
       static_cast<int64>(0x8000010000000000LL),
   };
-  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<int64>({arg});
+  std::unique_ptr<Literal> arg_literal = LiteralUtil::CreateR1<int64>({arg});
   auto arg_param = Parameter(&builder, 0, arg_literal->shape(), "arg_param");
   std::unique_ptr<GlobalData> arg_data =
       client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
@@ -164,7 +228,7 @@ XLA_TEST_F(ConvertTest, ConvertR1U32ToR1F32) {
   std::vector<uint32> arg{0,          1,          0x1000,     0x7fffffff,
                           0x80000000, 0x80000001, 0x80000002, 0x80000003,
                           0x80000080, 0x80000081, 0x80000082, 0xFFFFFFFF};
-  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<uint32>({arg});
+  std::unique_ptr<Literal> arg_literal = LiteralUtil::CreateR1<uint32>({arg});
   auto arg_param = Parameter(&builder, 0, arg_literal->shape(), "arg_param");
   std::unique_ptr<GlobalData> arg_data =
       client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
@@ -182,7 +246,7 @@ XLA_TEST_F(ConvertTest, ConvertR1F32ToR1U32) {
   XlaBuilder builder(TestName());
   std::vector<float> arg{0.0f,        1.0f,          16777216.0f,
                          16777218.0f, 2147483647.0f, 4294967040.0f};
-  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<float>({arg});
+  std::unique_ptr<Literal> arg_literal = LiteralUtil::CreateR1<float>({arg});
   auto arg_param = Parameter(&builder, 0, arg_literal->shape(), "arg_param");
   std::unique_ptr<GlobalData> arg_data =
       client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
@@ -199,7 +263,7 @@ XLA_TEST_F(ConvertTest, ConvertR1F32ToR1U32) {
 XLA_TEST_F(ConvertTest, ConvertR1U32ToR1S64) {
   XlaBuilder builder(TestName());
   std::vector<uint32> arg{0, 1, 0x1000, 0x7fffffff, 0x80000082, 0xFFFFFFFF};
-  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<uint32>({arg});
+  std::unique_ptr<Literal> arg_literal = LiteralUtil::CreateR1<uint32>({arg});
   auto arg_param = Parameter(&builder, 0, arg_literal->shape(), "arg_param");
   std::unique_ptr<GlobalData> arg_data =
       client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
@@ -216,7 +280,7 @@ XLA_TEST_F(ConvertTest, ConvertR1U32ToR1S64) {
 XLA_TEST_F(ConvertTest, ConvertR1S32ToR1S64) {
   XlaBuilder builder(TestName());
   std::vector<int32> arg{0, 1, 0x1000, -1, -0x1000};
-  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<int32>({arg});
+  std::unique_ptr<Literal> arg_literal = LiteralUtil::CreateR1<int32>({arg});
   auto arg_param = Parameter(&builder, 0, arg_literal->shape(), "arg_param");
   std::unique_ptr<GlobalData> arg_data =
       client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
@@ -253,7 +317,7 @@ XLA_TEST_F(ConvertTest, ConvertR1F32ToR1S64) {
                          9223370937343148032.f,
                          -9223371487098961920.f,
                          -9223370937343148032.f};
-  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<float>({arg});
+  std::unique_ptr<Literal> arg_literal = LiteralUtil::CreateR1<float>({arg});
   auto arg_param = Parameter(&builder, 0, arg_literal->shape(), "arg_param");
   std::unique_ptr<GlobalData> arg_data =
       client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
@@ -391,7 +455,7 @@ XLA_TEST_F(ConvertTest, ConvertR1F16ToR1F32) {
 
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> dot_lhs_handle,
-      client_->TransferToServer(*Literal::CreateR1<half>(input)));
+      client_->TransferToServer(*LiteralUtil::CreateR1<half>(input)));
 
   XlaBuilder builder(TestName());
   ConvertElementType(
@@ -411,7 +475,7 @@ XLA_TEST_F(ConvertTest, ConvertR1F32ToR1F16) {
 
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> dot_lhs_handle,
-      client_->TransferToServer(*Literal::CreateR1<float>(input)));
+      client_->TransferToServer(*LiteralUtil::CreateR1<float>(input)));
 
   XlaBuilder builder(TestName());
   ConvertElementType(
diff --git a/tensorflow/compiler/xla/tests/convolution_dimension_numbers_test.cc b/tensorflow/compiler/xla/tests/convolution_dimension_numbers_test.cc
index 7605ebf4c0..7b6bbc4f57 100644
--- a/tensorflow/compiler/xla/tests/convolution_dimension_numbers_test.cc
+++ b/tensorflow/compiler/xla/tests/convolution_dimension_numbers_test.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -93,7 +93,8 @@ XLA_TEST_F(ConvolutionDimensionNumbersTest,
   auto weight_array = MakeUnique<Array4D<float>>(4, 3, 1, 1);
   weight_array->FillWithMultiples(0.2);
   auto weight_data =
-      client_->TransferToServer(*Literal::CreateR4FromArray4D(*weight_array))
+      client_
+          ->TransferToServer(*LiteralUtil::CreateR4FromArray4D(*weight_array))
           .ConsumeValueOrDie();
 
   XlaBuilder builder(TestName());
diff --git a/tensorflow/compiler/xla/tests/convolution_test.cc b/tensorflow/compiler/xla/tests/convolution_test.cc
index 0f6d54d042..5ed8122e00 100644
--- a/tensorflow/compiler/xla/tests/convolution_test.cc
+++ b/tensorflow/compiler/xla/tests/convolution_test.cc
@@ -23,9 +23,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -123,8 +123,8 @@ class Convolve_1x1x1x2_1x1x1x2_Valid : public ConvolutionTest {
     }));
 
     ComputeAndCompare(&builder,
-                      {std::move(*Literal::CreateFromArray(input_data)),
-                       std::move(*Literal::CreateFromArray(filter_data))},
+                      {std::move(*LiteralUtil::CreateFromArray(input_data)),
+                       std::move(*LiteralUtil::CreateFromArray(filter_data))},
                       error_spec_);
   }
 };
@@ -157,8 +157,8 @@ class Convolve_1x1x4x4_1x1x2x2_Valid : public ConvolutionTest {
         {7.0f, 8.0f},
     }));
     ComputeAndCompare(&builder,
-                      {std::move(*Literal::CreateFromArray(input_data)),
-                       std::move(*Literal::CreateFromArray(filter_data))},
+                      {std::move(*LiteralUtil::CreateFromArray(input_data)),
+                       std::move(*LiteralUtil::CreateFromArray(filter_data))},
                       error_spec_);
   }
 };
@@ -192,8 +192,8 @@ class Convolve_1x1x4x4_1x1x2x2_Same : public ConvolutionTest {
     }));
 
     ComputeAndCompare(&builder,
-                      {std::move(*Literal::CreateFromArray(input_data)),
-                       std::move(*Literal::CreateFromArray(filter_data))},
+                      {std::move(*LiteralUtil::CreateFromArray(input_data)),
+                       std::move(*LiteralUtil::CreateFromArray(filter_data))},
                       error_spec_);
   }
 };
@@ -224,8 +224,8 @@ class Convolve_1x1x4x4_1x1x3x3_Same : public ConvolutionTest {
         {{5.0f, 6.0f, 7.0f}, {8.0f, 9.0f, 10.0f}, {11.0f, 12.0f, 13.0f}}));
     // clang-format on
     ComputeAndCompare(&builder,
-                      {std::move(*Literal::CreateFromArray(input_data)),
-                       std::move(*Literal::CreateFromArray(filter_data))},
+                      {std::move(*LiteralUtil::CreateFromArray(input_data)),
+                       std::move(*LiteralUtil::CreateFromArray(filter_data))},
                       error_spec_);
   }
 };
@@ -249,10 +249,10 @@ XLA_TEST_F(ConvolutionTest, Convolve1D_1x2x5_1x2x2_Valid) {
   Array3D<float> expected({{{510, 610, 710, 810}}});
 
   auto input_literal =
-      client_->TransferToServer(*Literal::CreateR3FromArray3D(input))
+      client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(input))
           .ConsumeValueOrDie();
   auto filter_literal =
-      client_->TransferToServer(*Literal::CreateR3FromArray3D(filter))
+      client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(filter))
           .ConsumeValueOrDie();
 
   ComputeAndCompareR3<float>(&builder, expected,
@@ -284,10 +284,10 @@ class Convolve1D_1x2x5_1x2x2_WithRHSDilation : public ConvolutionTest {
     Array3D<T> expected({{{570.0f, 670.0f, 770.0f}}});
 
     auto input_literal =
-        client_->TransferToServer(*Literal::CreateR3FromArray3D(input))
+        client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(input))
             .ConsumeValueOrDie();
     auto filter_literal =
-        client_->TransferToServer(*Literal::CreateR3FromArray3D(filter))
+        client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(filter))
             .ConsumeValueOrDie();
 
     ComputeAndCompareR3<T>(&builder, expected,
@@ -319,10 +319,10 @@ XLA_TEST_F(ConvolutionTest, Convolve1D_1x2x5_1x2x2_WithLHSDilation) {
   Array3D<float> expected({{{190, 320, 230, 380, 270, 440, 310, 500}}});
 
   auto input_literal =
-      client_->TransferToServer(*Literal::CreateR3FromArray3D(input))
+      client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(input))
           .ConsumeValueOrDie();
   auto filter_literal =
-      client_->TransferToServer(*Literal::CreateR3FromArray3D(filter))
+      client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(filter))
           .ConsumeValueOrDie();
 
   ComputeAndCompareR3<float>(&builder, expected,
@@ -350,10 +350,10 @@ XLA_TEST_F(ConvolutionTest, Convolve1D_1x2x5_1x2x2_WithLHSAndRHSDilation) {
   Array3D<float> expected({{{510, 0, 610, 0, 710, 0, 810}}});
 
   auto input_literal =
-      client_->TransferToServer(*Literal::CreateR3FromArray3D(input))
+      client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(input))
           .ConsumeValueOrDie();
   auto filter_literal =
-      client_->TransferToServer(*Literal::CreateR3FromArray3D(filter))
+      client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(filter))
           .ConsumeValueOrDie();
 
   ComputeAndCompareR3<float>(&builder, expected,
@@ -386,10 +386,10 @@ class Convolve1D_1x2x5_1x2x2_WithPadding : public ConvolutionTest {
         {{{0.0f, 260.0f, 510.0f, 610.0f, 710.0f, 810.0f, 350.0f, 0.0f}}});
 
     auto input_literal =
-        client_->TransferToServer(*Literal::CreateR3FromArray3D(input))
+        client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(input))
             .ConsumeValueOrDie();
     auto filter_literal =
-        client_->TransferToServer(*Literal::CreateR3FromArray3D(filter))
+        client_->TransferToServer(*LiteralUtil::CreateR3FromArray3D(filter))
             .ConsumeValueOrDie();
 
     ComputeAndCompareR3<T>(&builder, expected,
@@ -434,15 +434,15 @@ XLA_TEST_F(ConvolutionTest, Convolve3D_1x4x2x3x3_2x2x2x3x3_Valid) {
 
   std::vector<float> input_elems(ShapeUtil::ElementsIn(input_shape));
   iota(input_elems.begin(), input_elems.end(), 1.0f);
-  auto input_r1 = Literal::CreateR1<float>(input_elems);
+  auto input_r1 = LiteralUtil::CreateR1<float>(input_elems);
   auto input_r5 = input_r1->Reshape(input_dims).ConsumeValueOrDie();
 
   std::vector<float> filter_elems(ShapeUtil::ElementsIn(filter_shape));
   iota(filter_elems.begin(), filter_elems.end(), 1.0f);
-  auto filter_r1 = Literal::CreateR1<float>(filter_elems);
+  auto filter_r1 = LiteralUtil::CreateR1<float>(filter_elems);
   auto filter_r5 = filter_r1->Reshape(filter_dims).ConsumeValueOrDie();
 
-  auto expected_r1 = Literal::CreateR1<float>(
+  auto expected_r1 = LiteralUtil::CreateR1<float>(
       {19554, 19962, 20370, 22110, 22590, 23070, 34890, 35730, 36570, 37446,
        38358, 39270, 50226, 51498, 52770, 52782, 54126, 55470});
   auto expected_r5 = expected_r1->Reshape({1, 3, 1, 2, 3}).ConsumeValueOrDie();
@@ -497,15 +497,15 @@ class Convolve2D_1x3x3x5_3x3x5x5_Valid : public ConvolutionTest {
 
     std::vector<T> input_elems(ShapeUtil::ElementsIn(input_shape));
     iota_int_init_value(input_elems, 1);
-    auto input_r1 = Literal::CreateR1<T>(input_elems);
+    auto input_r1 = LiteralUtil::CreateR1<T>(input_elems);
     auto input_r4 = input_r1->Reshape(input_dims).ConsumeValueOrDie();
 
     std::vector<T> filter_elems(ShapeUtil::ElementsIn(filter_shape));
     iota_int_init_value(filter_elems, 1);
-    auto filter_r1 = Literal::CreateR1<T>(filter_elems);
+    auto filter_r1 = LiteralUtil::CreateR1<T>(filter_elems);
     auto filter_r4 = filter_r1->Reshape(filter_dims).ConsumeValueOrDie();
 
-    auto expected_r1 = Literal::CreateR1<T>(
+    auto expected_r1 = LiteralUtil::CreateR1<T>(
         {static_cast<T>(92115), static_cast<T>(93150), static_cast<T>(94185)});
     auto expected_r4 = expected_r1->Reshape({1, 1, 1, 3}).ConsumeValueOrDie();
 
@@ -561,8 +561,8 @@ XLA_TEST_P(ConvolveWithAndWithoutCanonicalization,
   expected_result.Fill(0);
 
   ComputeAndCompare(&builder,
-                    {std::move(*Literal::CreateFromArray(param0)),
-                     std::move(*Literal::CreateFromArray(param1))},
+                    {std::move(*LiteralUtil::CreateFromArray(param0)),
+                     std::move(*LiteralUtil::CreateFromArray(param1))},
                     error_spec_);
 }
 
@@ -617,18 +617,18 @@ class Convolve1D1WindowTestBase
 
     std::vector<T> input_elems(ShapeUtil::ElementsIn(input_shape),
                                static_cast<T>(1.0f));
-    auto input_r1 = Literal::CreateR1<T>(input_elems);
+    auto input_r1 = LiteralUtil::CreateR1<T>(input_elems);
     auto input_r3 = input_r1->Reshape(input_dims).ConsumeValueOrDie();
 
     std::vector<T> filter_elems(ShapeUtil::ElementsIn(filter_shape),
                                 static_cast<T>(1.0f));
 
-    auto filter_r1 = Literal::CreateR1<T>(filter_elems);
+    auto filter_r1 = LiteralUtil::CreateR1<T>(filter_elems);
     auto filter_r3 = filter_r1->Reshape(filter_dims).ConsumeValueOrDie();
 
     std::vector<T> expect_elems(batch * output_feature * num_windows,
                                 static_cast<T>(window_size * input_feature));
-    auto expected_r1 = Literal::CreateR1<T>(expect_elems);
+    auto expected_r1 = LiteralUtil::CreateR1<T>(expect_elems);
     auto expected_r3 =
         expected_r1->Reshape({batch, num_windows, output_feature})
             .ConsumeValueOrDie();
@@ -737,8 +737,8 @@ XLA_TEST_F(ConvolutionTest, Convolve_bf16_1x1x1x2_1x1x1x2_Valid) {
   }));
 
   ComputeAndCompare(&builder,
-                    {std::move(*Literal::CreateFromArray(input_data)),
-                     std::move(*Literal::CreateFromArray(filter_data))},
+                    {std::move(*LiteralUtil::CreateFromArray(input_data)),
+                     std::move(*LiteralUtil::CreateFromArray(filter_data))},
                     error_spec_);
 }
 
@@ -761,8 +761,8 @@ XLA_TEST_F(ConvolutionTest, NoCudnnAlgorithmPicker) {
   filter_data.FillIota(10);
 
   ComputeAndCompare(&builder,
-                    {std::move(*Literal::CreateFromArray(input_data)),
-                     std::move(*Literal::CreateFromArray(filter_data))});
+                    {std::move(*LiteralUtil::CreateFromArray(input_data)),
+                     std::move(*LiteralUtil::CreateFromArray(filter_data))});
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/convolution_variants_test.cc b/tensorflow/compiler/xla/tests/convolution_variants_test.cc
index c31d033bb0..6784c16715 100644
--- a/tensorflow/compiler/xla/tests/convolution_variants_test.cc
+++ b/tensorflow/compiler/xla/tests/convolution_variants_test.cc
@@ -27,8 +27,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -1333,17 +1333,17 @@ XLA_TEST_F(ConvolutionVariantsTest, BackwardFilterEvenPadding1D) {
 XLA_TEST_F(ConvolutionVariantsTest, BackwardInputEvenPadding3D) {
   XlaBuilder builder(TestName());
 
-  auto gradients_flat = Literal::CreateR1<float>({1});
+  auto gradients_flat = LiteralUtil::CreateR1<float>({1});
   auto gradients_literal =
       gradients_flat->Reshape({1, 1, 1, 1, 1}).ConsumeValueOrDie();
   auto gradients = ConstantLiteral(&builder, *gradients_literal);
 
-  auto weights_flat = Literal::CreateR1<float>({1, 10, 100});
+  auto weights_flat = LiteralUtil::CreateR1<float>({1, 10, 100});
   auto weights_literal =
       weights_flat->Reshape({1, 1, 1, 1, 3}).ConsumeValueOrDie();
   auto weights = ConstantLiteral(&builder, *weights_literal);
 
-  auto expected_flat = Literal::CreateR1<float>({10});
+  auto expected_flat = LiteralUtil::CreateR1<float>({10});
   auto expected_literal =
       expected_flat->Reshape({1, 1, 1, 1, 1}).ConsumeValueOrDie();
 
@@ -1357,17 +1357,17 @@ XLA_TEST_F(ConvolutionVariantsTest, BackwardInputEvenPadding3D) {
 XLA_TEST_F(ConvolutionVariantsTest, BackwardFilterEvenPadding3D) {
   XlaBuilder builder(TestName());
 
-  auto activations_flat = Literal::CreateR1<float>({1, 2, 3, 4});
+  auto activations_flat = LiteralUtil::CreateR1<float>({1, 2, 3, 4});
   auto activations_literal =
       activations_flat->Reshape({1, 1, 1, 1, 4}).ConsumeValueOrDie();
   auto activations = ConstantLiteral(&builder, *activations_literal);
 
-  auto gradients_flat = Literal::CreateR1<float>({100, 10, 1});
+  auto gradients_flat = LiteralUtil::CreateR1<float>({100, 10, 1});
   auto gradients_literal =
       gradients_flat->Reshape({1, 1, 1, 1, 3}).ConsumeValueOrDie();
   auto gradients = ConstantLiteral(&builder, *gradients_literal);
 
-  auto expected_flat = Literal::CreateR1<float>({13, 24, 130});
+  auto expected_flat = LiteralUtil::CreateR1<float>({13, 24, 130});
   auto expected_literal =
       expected_flat->Reshape({1, 1, 1, 1, 3}).ConsumeValueOrDie();
 
diff --git a/tensorflow/compiler/xla/tests/copy_test.cc b/tensorflow/compiler/xla/tests/copy_test.cc
index fef42885e5..5ef273e5a2 100644
--- a/tensorflow/compiler/xla/tests/copy_test.cc
+++ b/tensorflow/compiler/xla/tests/copy_test.cc
@@ -17,8 +17,8 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/array2d.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
@@ -58,37 +58,38 @@ class CopyOpTest : public HloTestBase {
 };
 
 XLA_TEST_F(CopyOpTest, CopyR0Bool) {
-  TestCopyOp(*Literal::CreateR0<bool>(true));
+  TestCopyOp(*LiteralUtil::CreateR0<bool>(true));
 }
 
 XLA_TEST_F(CopyOpTest, CopyR1S0U32) {
-  TestCopyOp(*Literal::CreateR1<uint32>({}));
+  TestCopyOp(*LiteralUtil::CreateR1<uint32>({}));
 }
 
 XLA_TEST_F(CopyOpTest, CopyR1S3U32) {
-  TestCopyOp(*Literal::CreateR1<uint32>({1, 2, 3}));
+  TestCopyOp(*LiteralUtil::CreateR1<uint32>({1, 2, 3}));
 }
 
 XLA_TEST_F(CopyOpTest, CopyR3F32_2x2x3) {
-  TestCopyOp(*Literal::CreateR3({{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
-                                 {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}}));
+  TestCopyOp(
+      *LiteralUtil::CreateR3({{{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}},
+                              {{1.1f, 2.1f, 3.1f}, {6.1f, 3.5f, 2.8f}}}));
 }
 
 XLA_TEST_F(CopyOpTest, CopyR4S32_2x2x3x2) {
-  TestCopyOp(*Literal::CreateR4(
+  TestCopyOp(*LiteralUtil::CreateR4(
       {{{{1, -2}, {-4, 5}, {6, 7}}, {{8, 9}, {10, 11}, {12, 13}}},
        {{{10, 3}, {7, -2}, {3, 6}}, {{2, 5}, {-11, 5}, {-2, -5}}}}));
 }
 
 XLA_TEST_F(CopyOpTest, CopyR4S32_0x2x3x2) {
-  TestCopyOp(*Literal::CreateR4FromArray4D(Array4D<int32>(0, 2, 3, 2)));
+  TestCopyOp(*LiteralUtil::CreateR4FromArray4D(Array4D<int32>(0, 2, 3, 2)));
 }
 
 XLA_TEST_F(CopyOpTest, CopyParameterScalar) {
   auto builder = HloComputation::Builder(TestName());
 
   // Copy literal to device to use as parameter.
-  auto literal = Literal::CreateR0<float>(42.0);
+  auto literal = LiteralUtil::CreateR0<float>(42.0);
   Shape shape = literal->shape();
 
   auto param0 = builder.AddInstruction(
@@ -109,7 +110,7 @@ XLA_TEST_F(CopyOpTest, CopyParameterScalar) {
 XLA_TEST_F(CopyOpTest, CopyConstantR2Twice) {
   auto builder = HloComputation::Builder(TestName());
 
-  auto literal = Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+  auto literal = LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
   auto constant = builder.AddInstruction(
       HloInstruction::CreateConstant(std::move(literal)));
 
@@ -131,7 +132,7 @@ XLA_TEST_F(CopyOpTest, CopyConstantR2DifferentLayouts) {
   HloComputation::Builder builder(TestName());
 
   std::unique_ptr<Literal> literal =
-      Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
+      LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}});
   // Reverse the minor-to-major order of the literal.
   Layout* literal_layout =
       literal->mutable_shape_do_not_use()->mutable_layout();
@@ -168,7 +169,7 @@ void CopyOpTest::TestCopyConstantLayout021(size_t n1, size_t n2, size_t n3) {
 
   HloComputation::Builder builder(TestName());
 
-  std::unique_ptr<Literal> literal = Literal::CreateR3FromArray3D(a);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR3FromArray3D(a);
 
   HloInstruction* constant = builder.AddInstruction(
       HloInstruction::CreateConstant(std::move(literal)));
@@ -202,7 +203,7 @@ void CopyOpTest::TestCopyConstantLayoutR4(
 
   HloComputation::Builder builder(TestName());
 
-  std::unique_ptr<Literal> literal = Literal::CreateR4FromArray4D(a);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR4FromArray4D(a);
 
   HloInstruction* constant = builder.AddInstruction(
       HloInstruction::CreateConstant(std::move(literal)));
diff --git a/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc b/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc
index b151187c4b..d12a4e7fcd 100644
--- a/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc
+++ b/tensorflow/compiler/xla/tests/cross_replica_sum_test.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -45,7 +45,7 @@ XLA_TEST_F(TrivialCrossReplicaSumTest, OneOperand) {
   })";
   auto module =
       ParseHloString(module_str, GetModuleConfigForTest()).ValueOrDie();
-  auto literal = Literal::CreateR1<float>({1, 2, 3});
+  auto literal = LiteralUtil::CreateR1<float>({1, 2, 3});
   EXPECT_EQ(*literal, *ExecuteAndTransfer(std::move(module), {literal.get()}));
 }
 
@@ -66,10 +66,10 @@ XLA_TEST_F(TrivialCrossReplicaSumTest, MultipleOperands) {
   })";
   auto module =
       ParseHloString(module_str, GetModuleConfigForTest()).ValueOrDie();
-  auto literal0 = Literal::CreateR1<float>({1, 2, 3});
-  auto literal1 = Literal::CreateR1<float>({10, 20});
+  auto literal0 = LiteralUtil::CreateR1<float>({1, 2, 3});
+  auto literal1 = LiteralUtil::CreateR1<float>({10, 20});
   EXPECT_EQ(
-      *Literal::MakeTuple({literal0.get(), literal1.get()}),
+      *LiteralUtil::MakeTuple({literal0.get(), literal1.get()}),
       *ExecuteAndTransfer(std::move(module), {literal0.get(), literal1.get()}));
 }
 
@@ -93,9 +93,9 @@ XLA_TEST_F(TrivialCrossReplicaSumTest, ConstantOperand) {
   })";
   auto module =
       ParseHloString(module_str, GetModuleConfigForTest()).ValueOrDie();
-  auto literal0 = Literal::CreateR1<float>({1, 2, 3});
-  auto literal1 = Literal::CreateR1<float>({10, 20});
-  EXPECT_EQ(*Literal::MakeTuple({literal0.get(), literal1.get()}),
+  auto literal0 = LiteralUtil::CreateR1<float>({1, 2, 3});
+  auto literal1 = LiteralUtil::CreateR1<float>({10, 20});
+  EXPECT_EQ(*LiteralUtil::MakeTuple({literal0.get(), literal1.get()}),
             *ExecuteAndTransfer(std::move(module), {literal0.get()}));
 }
 
diff --git a/tensorflow/compiler/xla/tests/custom_call_test.cc b/tensorflow/compiler/xla/tests/custom_call_test.cc
index d1516a28b0..13c777835e 100644
--- a/tensorflow/compiler/xla/tests/custom_call_test.cc
+++ b/tensorflow/compiler/xla/tests/custom_call_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include <memory>
 #include <utility>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/cpu/custom_call_target_registry.h"
@@ -74,7 +74,7 @@ XLA_TEST_F(CustomCallTest, DISABLED_ON_GPU(CustomCallR0F32Add2)) {
   auto builder = HloComputation::Builder(TestName());
 
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0f)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0f)));
   builder.AddInstruction(
       HloInstruction::CreateCustomCall(r0f32_, {constant}, "R0F32Add2"));
 
@@ -95,7 +95,7 @@ XLA_TEST_F(CustomCallTest, DISABLED_ON_GPU(CustomCallR2F32Reduce)) {
   array(1, 1) = 4.0f;
 
   auto constant = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2FromArray2D(array)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2FromArray2D(array)));
   builder.AddInstruction(
       HloInstruction::CreateCustomCall(r0f32_, {constant}, "R2F32ReduceSum"));
 
@@ -111,7 +111,7 @@ XLA_TEST_F(CustomCallTest,
   auto b = HloComputation::Builder(TestName());
 
   auto input = b.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2FromArray2D(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2FromArray2D(
           Array2D<float>{{1.0f, 2.0f}, {3.0f, 4.0f}})));
   auto incremented = b.AddInstruction(HloInstruction::CreateCustomCall(
       ShapeUtil::MakeShape(F32, {1, 2, 2}), {input}, "Add1ToValues"));
diff --git a/tensorflow/compiler/xla/tests/deallocation_test.cc b/tensorflow/compiler/xla/tests/deallocation_test.cc
index d4b3aac85b..5f234f36a8 100644
--- a/tensorflow/compiler/xla/tests/deallocation_test.cc
+++ b/tensorflow/compiler/xla/tests/deallocation_test.cc
@@ -17,8 +17,8 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
diff --git a/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc b/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
index acba67491d..2db6503afa 100644
--- a/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
+++ b/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
@@ -18,9 +18,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -171,7 +171,7 @@ TEST_F(DeconstructTupleTest, DeconstructNonTuple) {
 XLA_TEST_F(DeconstructTupleTest, DeconstructTupleFromParam) {
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({3.14f, -100.25f});
+      LiteralUtil::CreateR1<float>({3.14f, -100.25f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   auto p = Parameter(&builder, 0, ShapeUtil::MakeShape(F32, {2}), "param0");
diff --git a/tensorflow/compiler/xla/tests/deep_graph_test.cc b/tensorflow/compiler/xla/tests/deep_graph_test.cc
index 810947ab01..3f3e8ab712 100644
--- a/tensorflow/compiler/xla/tests/deep_graph_test.cc
+++ b/tensorflow/compiler/xla/tests/deep_graph_test.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 
 namespace xla {
diff --git a/tensorflow/compiler/xla/tests/dot_operation_test.cc b/tensorflow/compiler/xla/tests/dot_operation_test.cc
index cf2e645d47..0e9e92ed99 100644
--- a/tensorflow/compiler/xla/tests/dot_operation_test.cc
+++ b/tensorflow/compiler/xla/tests/dot_operation_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -67,15 +67,16 @@ XLA_TEST_F(DotOperationTest, DotOfInputTupleElem) {
   XlaOp param;
   auto param_data = CreateParameterAndTransferLiteral(
       0,
-      *Literal::MakeTuple({Literal::CreateR2<float>({{1, 2}, {3, 4}}).get(),
-                           Literal::CreateR2<float>({{5, 6}, {7, 8}}).get()}),
+      *LiteralUtil::MakeTuple(
+          {LiteralUtil::CreateR2<float>({{1, 2}, {3, 4}}).get(),
+           LiteralUtil::CreateR2<float>({{5, 6}, {7, 8}}).get()}),
       "arg0", &builder, &param);
   auto lhs = GetTupleElement(param, 0);
   auto rhs = GetTupleElement(param, 1);
   Dot(lhs, rhs);
 
   ComputeAndCompareLiteral(&builder,
-                           *Literal::CreateR2<float>({{19, 22}, {43, 50}}),
+                           *LiteralUtil::CreateR2<float>({{19, 22}, {43, 50}}),
                            {param_data.get()});
 }
 
@@ -110,7 +111,7 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, TrivialMatrixVectorDot) {
                                         this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, OneElementVectorDot) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, OneElementVectorDot) {
   using T = TypeParam;
   XlaBuilder builder(this->TestName());
   auto lhs = ConstantR1<T>(&builder, {static_cast<T>(2.0f)});
@@ -136,7 +137,7 @@ std::vector<int64> MinorToMajorForIsRowMajor(bool row_major) {
   return {row_major ? 1 : 0, row_major ? 0 : 1};
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x0) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, Dot_0x2_2x0) {
   using T = TypeParam;
   XlaBuilder builder(this->TestName());
   auto lhs = ConstantR2FromArray2D<T>(&builder, Array2D<T>(0, 2));
@@ -147,7 +148,7 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x0) {
                                         this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x3) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, Dot_0x2_2x3) {
   using T = TypeParam;
   XlaBuilder builder(this->TestName());
   auto lhs = ConstantR2FromArray2D<T>(&builder, Array2D<T>(0, 2));
@@ -159,7 +160,7 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x3) {
                                         this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_3x2_2x0) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, Dot_3x2_2x0) {
   using T = TypeParam;
   XlaBuilder builder(this->TestName());
   auto lhs = ConstantR2FromArray2D<T>(
@@ -171,7 +172,7 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_3x2_2x0) {
                                         this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_2x0_0x2) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, Dot_2x0_0x2) {
   using T = TypeParam;
   XlaBuilder builder(this->TestName());
   auto lhs = ConstantR2FromArray2D<T>(&builder, Array2D<T>(2, 0));
@@ -182,7 +183,7 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_2x0_0x2) {
       &builder, Array2D<T>(2, 2, static_cast<T>(0.0f)), {}, this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, FusedDot) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, FusedDot) {
   using T = TypeParam;
   XlaBuilder builder(this->TestName());
   auto param0 =
@@ -194,11 +195,11 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, FusedDot) {
 
   auto lhs_handle =
       this->client_
-          ->TransferToServer(*Literal::CreateR2FromArray2D<T>(
+          ->TransferToServer(*LiteralUtil::CreateR2FromArray2D<T>(
               {{1.0f, 2.0f, 3.0f, 4.0f}, {-1.0f, -2.0f, -3.0f, -4.0f}}))
           .ConsumeValueOrDie();
   auto rhs_handle = this->client_
-                        ->TransferToServer(*Literal::CreateR2FromArray2D<T>(
+                        ->TransferToServer(*LiteralUtil::CreateR2FromArray2D<T>(
                             {{1.0f}, {2.0f}, {3.0f}, {4.0f}}))
                         .ConsumeValueOrDie();
 
@@ -217,14 +218,14 @@ class SquareMatrixDot : public DotOperationTest {
   void TestImpl(bool lhs_row_major, bool rhs_row_major) {
     auto lhs_handle =
         client_
-            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+            ->TransferToServer(*LiteralUtil::CreateFromArrayWithLayout<T>(
                 {{1.0f, 2.0f}, {3.0f, -4.0f}},
                 LayoutUtil::MakeLayout(
                     MinorToMajorForIsRowMajor(lhs_row_major))))
             .ConsumeValueOrDie();
     auto rhs_handle =
         client_
-            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+            ->TransferToServer(*LiteralUtil::CreateFromArrayWithLayout<T>(
                 {{1.0f, 6.0f}, {7.0f, -4.0f}},
                 LayoutUtil::MakeLayout(
                     MinorToMajorForIsRowMajor(rhs_row_major))))
@@ -286,9 +287,10 @@ void ParametricDotTest::TestImpl() {
 
   std::unique_ptr<Array2D<NativeT>> dot_lhs_data =
       MakeLinspaceArray2D<NativeT>(0.0, 1.0, param.m, param.k);
-  std::unique_ptr<Literal> dot_lhs_lit = Literal::CreateR2FromArray2DWithLayout(
-      *dot_lhs_data, LayoutUtil::MakeLayout(
-                         MinorToMajorForIsRowMajor(param.dot_lhs_row_major)));
+  std::unique_ptr<Literal> dot_lhs_lit =
+      LiteralUtil::CreateR2FromArray2DWithLayout(
+          *dot_lhs_data, LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(
+                             param.dot_lhs_row_major)));
   std::unique_ptr<GlobalData> dot_lhs_handle =
       client_->TransferToServer(*dot_lhs_lit).ConsumeValueOrDie();
 
@@ -297,7 +299,7 @@ void ParametricDotTest::TestImpl() {
   Layout rhs_layout = LayoutUtil::MakeLayout(
       MinorToMajorForIsRowMajor(param.dot_rhs_row_major));
   std::unique_ptr<Literal> dot_rhs_lit =
-      Literal::CreateR2FromArray2DWithLayout(*dot_rhs_data, rhs_layout);
+      LiteralUtil::CreateR2FromArray2DWithLayout(*dot_rhs_data, rhs_layout);
   std::unique_ptr<GlobalData> dot_rhs_handle =
       client_->TransferToServer(*dot_rhs_lit).ConsumeValueOrDie();
 
@@ -307,7 +309,7 @@ void ParametricDotTest::TestImpl() {
 
   if (param.has_addend) {
     addend_data = MakeLinspaceArray2D<NativeT>(0.0, 1.0, param.m, param.n);
-    addend_lit = Literal::CreateR2FromArray2DWithLayout(
+    addend_lit = LiteralUtil::CreateR2FromArray2DWithLayout(
         *addend_data, LayoutUtil::MakeLayout(
                           MinorToMajorForIsRowMajor(param.addend_row_major)));
     addend_handle = client_->TransferToServer(*addend_lit).ConsumeValueOrDie();
@@ -476,14 +478,14 @@ class NonsquareMatrixDot : public DotOperationTest {
   void TestImpl(bool lhs_row_major, bool rhs_row_major) {
     auto lhs_handle =
         client_
-            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+            ->TransferToServer(*LiteralUtil::CreateFromArrayWithLayout<T>(
                 {{1.0f, 2.0f, 3.0f}, {3.0f, -4.0f, -1.0f}},
                 LayoutUtil::MakeLayout(
                     MinorToMajorForIsRowMajor(lhs_row_major))))
             .ConsumeValueOrDie();
     auto rhs_handle =
         client_
-            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+            ->TransferToServer(*LiteralUtil::CreateFromArrayWithLayout<T>(
                 {{1.0f, 6.0f}, {2.0f, 3.0f}, {7.0f, -4.0f}},
                 LayoutUtil::MakeLayout(
                     MinorToMajorForIsRowMajor(rhs_row_major))))
@@ -510,12 +512,12 @@ XLA_TYPED_TEST(NonsquareMatrixDot, TestTT) { this->TestImpl(true, true); }
 XLA_TEST_F(DotOperationTest, MatrixVectorC64) {
   auto lhs_handle =
       client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<complex64>(
+          ->TransferToServer(*LiteralUtil::CreateR2WithLayout<complex64>(
               {{1.0, 2.0, 3.0, -4.0}}, LayoutUtil::MakeLayout({1, 0})))
           .ConsumeValueOrDie();
   auto rhs_handle =
       client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<complex64>(
+          ->TransferToServer(*LiteralUtil::CreateR2WithLayout<complex64>(
               {{1.0, 1.0}, {2.0, 2.0}, {3.0, 3.0}, {-4.0, 4.0}},
               LayoutUtil::MakeLayout({1, 0})))
           .ConsumeValueOrDie();
@@ -531,7 +533,7 @@ XLA_TEST_F(DotOperationTest, MatrixVectorC64) {
       &builder, expected, {lhs_handle.get(), rhs_handle.get()}, error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, ConcurrentMatMult) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, ConcurrentMatMult) {
   using T = TypeParam;
 
   XlaBuilder builder(this->TestName());
@@ -583,7 +585,7 @@ XLA_TYPED_TEST(DotOperationTestForBatchMatMul, Types) {
   Reshape(out_flat, {0, 1, 2}, {2, 2, 2, 2});
 
   auto x_data = this->client_
-                    ->TransferToServer(*Literal::CreateR4FromArray4D<T>(
+                    ->TransferToServer(*LiteralUtil::CreateR4FromArray4D<T>(
                         {{{{1000.0f, 100.0f}, {10.0f, 1.0f}},
                           {{2000.0f, 200.0f}, {20.0f, 2.0f}}},
                          {{{3000.0f, 300.0f}, {30.0f, 3.0f}},
@@ -591,7 +593,7 @@ XLA_TYPED_TEST(DotOperationTestForBatchMatMul, Types) {
                     .ConsumeValueOrDie();
   auto y_data =
       this->client_
-          ->TransferToServer(*Literal::CreateR4FromArray4D<T>(
+          ->TransferToServer(*LiteralUtil::CreateR4FromArray4D<T>(
               {{{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}},
                {{{11.0f, 22.0f}, {33.0f, 44.0f}},
                 {{55.0f, 66.0f}, {77.0f, 88.0f}}}}))
@@ -610,7 +612,7 @@ XLA_TYPED_TEST(DotOperationTestForBatchMatMul, Types) {
       {x_data.get(), y_data.get()}, this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, GeneralMatMul) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, GeneralMatMul) {
   using T = TypeParam;
 
   XlaBuilder builder(this->TestName());
@@ -629,13 +631,13 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, GeneralMatMul) {
 
   auto x_data =
       this->client_
-          ->TransferToServer(*Literal::CreateR3FromArray3D<T>(
+          ->TransferToServer(*LiteralUtil::CreateR3FromArray3D<T>(
               {{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}}))
           .ConsumeValueOrDie();
 
   auto y_data =
       this->client_
-          ->TransferToServer(*Literal::CreateR3FromArray3D<T>(
+          ->TransferToServer(*LiteralUtil::CreateR3FromArray3D<T>(
               {{{1.0f, 0.0f}, {0.0f, 1.0f}}, {{1.0f, 0.0f}, {0.0f, 1.0f}}}))
           .ConsumeValueOrDie();
 
@@ -646,7 +648,49 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, GeneralMatMul) {
       {x_data.get(), y_data.get()}, this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64, TransposeFolding) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, GeneralMatMulMultipleBatch) {
+  using T = TypeParam;
+
+  XlaBuilder builder(this->TestName());
+  auto x = Parameter(&builder, 0, ShapeUtil::MakeShapeWithType<T>({2, 2, 2, 2}),
+                     "x");
+  auto y = Parameter(&builder, 1, ShapeUtil::MakeShapeWithType<T>({2, 2, 2, 2}),
+                     "y");
+
+  DotDimensionNumbers dnums;
+  dnums.add_lhs_contracting_dimensions(3);
+  dnums.add_rhs_contracting_dimensions(2);
+  dnums.add_lhs_batch_dimensions(0);
+  dnums.add_lhs_batch_dimensions(1);
+  dnums.add_rhs_batch_dimensions(0);
+  dnums.add_rhs_batch_dimensions(1);
+
+  DotGeneral(x, y, dnums);
+
+  auto x_data =
+      this->client_
+          ->TransferToServer(*LiteralUtil::CreateR4FromArray4D<T>(
+              {{{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}},
+               {{{9.0f, 10.0f}, {11.0f, 12.0f}},
+                {{13.0f, 14.0f}, {15.0f, 16.0f}}}}))
+          .ConsumeValueOrDie();
+
+  auto y_data =
+      this->client_
+          ->TransferToServer(*LiteralUtil::CreateR4FromArray4D<T>(
+              {{{{1.0f, 0.0f}, {0.0f, 1.0f}}, {{1.0f, 0.0f}, {0.0f, 1.0f}}},
+               {{{0.0f, 1.0f}, {1.0f, 0.0f}}, {{0.0f, 1.0f}, {1.0f, 0.0f}}}}))
+          .ConsumeValueOrDie();
+
+  this->template ComputeAndCompareR4<T>(
+      &builder,
+      /*expected=*/
+      {{{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}},
+       {{{10.0f, 9.0f}, {12.0f, 11.0f}}, {{14.0f, 13.0f}, {16.0f, 15.0f}}}},
+      {x_data.get(), y_data.get()}, this->error_spec_);
+}
+
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, TransposeFolding) {
   using T = TypeParam;
   for (bool transpose_lhs : {false, true}) {
     for (bool transpose_rhs : {false, true}) {
@@ -664,15 +708,17 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, TransposeFolding) {
         }
         auto lhs_handle =
             this->client_
-                ->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
-                    *lhs, LayoutUtil::MakeLayout(
-                              MinorToMajorForIsRowMajor(row_major))))
+                ->TransferToServer(
+                    *LiteralUtil::CreateR2FromArray2DWithLayout<T>(
+                        *lhs, LayoutUtil::MakeLayout(
+                                  MinorToMajorForIsRowMajor(row_major))))
                 .ConsumeValueOrDie();
         auto rhs_handle =
             this->client_
-                ->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
-                    *rhs, LayoutUtil::MakeLayout(
-                              MinorToMajorForIsRowMajor(row_major))))
+                ->TransferToServer(
+                    *LiteralUtil::CreateR2FromArray2DWithLayout<T>(
+                        *rhs, LayoutUtil::MakeLayout(
+                                  MinorToMajorForIsRowMajor(row_major))))
                 .ConsumeValueOrDie();
 
         XlaBuilder builder(this->TestName());
@@ -704,7 +750,7 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64, TransposeFolding) {
   }
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64,
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64,
                DotOfConcatOptimizationWithConstLHS) {
   using T = TypeParam;
   auto prim_type = primitive_util::NativeToPrimitiveType<T>();
@@ -733,15 +779,15 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64,
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_0_value,
       this->client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<T>(*arg_0_value_array)));
+          *LiteralUtil::CreateR2FromArray2D<T>(*arg_0_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_1_value,
       this->client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<T>(*arg_1_value_array)));
+          *LiteralUtil::CreateR2FromArray2D<T>(*arg_1_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_2_value,
       this->client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<T>(*arg_2_value_array)));
+          *LiteralUtil::CreateR2FromArray2D<T>(*arg_2_value_array)));
 
   Array2D<T> expected({{53.0f, 74.0f}, {45.0f, 66.0f}});
   this->template ComputeAndCompareR2<T>(
@@ -750,7 +796,7 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64,
       this->error_spec_);
 }
 
-XLA_TYPED_TEST(DotOperationTest_F16F32F64,
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64,
                DotOfConcatOptimizationWithConstRHS) {
   using T = TypeParam;
   std::unique_ptr<Array2D<T>> constant_rhs_array(
@@ -782,15 +828,15 @@ XLA_TYPED_TEST(DotOperationTest_F16F32F64,
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_0_value,
       this->client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<T>(*arg_0_value_array)));
+          *LiteralUtil::CreateR2FromArray2D<T>(*arg_0_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_1_value,
       this->client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<T>(*arg_1_value_array)));
+          *LiteralUtil::CreateR2FromArray2D<T>(*arg_1_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_2_value,
       this->client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<T>(*arg_2_value_array)));
+          *LiteralUtil::CreateR2FromArray2D<T>(*arg_2_value_array)));
 
   Array2D<T> expected({{38.0f, 36.0f}, {93.0f, 91.0f}});
   this->template ComputeAndCompareR2<T>(
diff --git a/tensorflow/compiler/xla/tests/dynamic_ops_test.cc b/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
index f3c258a4d4..7f6f203a1b 100644
--- a/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
+++ b/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
 #include "tensorflow/compiler/xla/service/local_service.h"
@@ -124,11 +124,11 @@ class DynamicSliceTest : public ClientLibraryTestBase {
     // vector<bool> is special so that it cannot be an ArraySlice<bool>, which
     // is what the code below wants. So instead we do this.
     Literal input_values =
-        std::move(*Literal::CreateR1(input_values_int)
+        std::move(*LiteralUtil::CreateR1(input_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal expected_values =
-        std::move(*Literal::CreateR1(expected_values_int)
+        std::move(*LiteralUtil::CreateR1(expected_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
 
@@ -150,11 +150,11 @@ class DynamicSliceTest : public ClientLibraryTestBase {
              const std::vector<int64>& slice_sizes,
              const Array2D<int>& expected_values_int) {
     Literal input_values =
-        std::move(*Literal::CreateR2FromArray2D(input_values_int)
+        std::move(*LiteralUtil::CreateR2FromArray2D(input_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal expected_values =
-        std::move(*Literal::CreateR2FromArray2D(expected_values_int)
+        std::move(*LiteralUtil::CreateR2FromArray2D(expected_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
 
@@ -176,11 +176,11 @@ class DynamicSliceTest : public ClientLibraryTestBase {
              const std::vector<int64>& slice_sizes,
              const Array3D<int>& expected_values_int) {
     Literal input_values =
-        std::move(*Literal::CreateR3FromArray3D(input_values_int)
+        std::move(*LiteralUtil::CreateR3FromArray3D(input_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal expected_values =
-        std::move(*Literal::CreateR3FromArray3D(expected_values_int)
+        std::move(*LiteralUtil::CreateR3FromArray3D(expected_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
 
@@ -202,18 +202,28 @@ XLA_TEST_F(DynamicSliceTest, Int32R1) { TestR1<int32, int32>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R1OOB) { TestR1OOB<int32, int32>(); }
 XLA_TEST_F(DynamicSliceTest, Int64R1) { TestR1<int64, float>(); }
 XLA_TEST_F(DynamicSliceTest, UInt64R1) { TestR1<uint64, float>(); }
+XLA_TEST_F(DynamicSliceTest, UInt32R1OOB) {
+  RunR1<uint32, int32>({0, 1, 2, 3, 4}, {2147483648u}, {2}, {3, 4});
+}
 
 XLA_TEST_F(DynamicSliceTest, Int32R2BF16) { TestR2<int32, bfloat16>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R2) { TestR2<int32, int32>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R2OOB) { TestR2OOB<int32, int32>(); }
 XLA_TEST_F(DynamicSliceTest, Int64R2) { TestR2<int64, float>(); }
 XLA_TEST_F(DynamicSliceTest, UInt64R2) { TestR2<uint64, int32>(); }
+XLA_TEST_F(DynamicSliceTest, UInt32R2OOB) {
+  RunR2<uint32, int32>({{0, 1}, {2, 3}}, {2147483648u, 0}, {1, 1}, {{2}});
+}
 
 XLA_TEST_F(DynamicSliceTest, Int32R3BF16) { TestR3<int32, bfloat16>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R3) { TestR3<int32, float>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R3OOB) { TestR3OOB<int32, float>(); }
 XLA_TEST_F(DynamicSliceTest, Int64R3) { TestR3<int64, float>(); }
 XLA_TEST_F(DynamicSliceTest, UInt64R3) { TestR3<uint64, float>(); }
+XLA_TEST_F(DynamicSliceTest, UInt32R3OOB) {
+  RunR3<uint32, int32>({{{0, 1}, {2, 3}}, {{4, 5}, {6, 7}}},
+                       {2147483648u, 0, 2147483648u}, {1, 1, 1}, {{{5}}});
+}
 
 XLA_TEST_F(DynamicSliceTest, Int32R1Pred) {
   // Slice at dimension start.
@@ -349,15 +359,15 @@ class DynamicUpdateSliceTest : public ClientLibraryTestBase {
   void RunR0(int input_value_int, int update_value_int,
              const std::vector<IndexT> slice_starts, int expected_value_int) {
     Literal input_value =
-        std::move(*Literal::CreateR0(input_value_int)
+        std::move(*LiteralUtil::CreateR0(input_value_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal update_value =
-        std::move(*Literal::CreateR0(update_value_int)
+        std::move(*LiteralUtil::CreateR0(update_value_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal expected_value =
-        std::move(*Literal::CreateR0(expected_value_int)
+        std::move(*LiteralUtil::CreateR0(expected_value_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
 
@@ -380,15 +390,15 @@ class DynamicUpdateSliceTest : public ClientLibraryTestBase {
              const std::vector<IndexT> slice_starts,
              tensorflow::gtl::ArraySlice<int> expected_values_int) {
     Literal input_values =
-        std::move(*Literal::CreateR1(input_values_int)
+        std::move(*LiteralUtil::CreateR1(input_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal update_values =
-        std::move(*Literal::CreateR1(update_values_int)
+        std::move(*LiteralUtil::CreateR1(update_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal expected_values =
-        std::move(*Literal::CreateR1(expected_values_int)
+        std::move(*LiteralUtil::CreateR1(expected_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
 
@@ -411,15 +421,15 @@ class DynamicUpdateSliceTest : public ClientLibraryTestBase {
              const std::vector<IndexT> slice_starts,
              const Array2D<int>& expected_values_int) {
     Literal input_values =
-        std::move(*Literal::CreateR2FromArray2D(input_values_int)
+        std::move(*LiteralUtil::CreateR2FromArray2D(input_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal update_values =
-        std::move(*Literal::CreateR2FromArray2D(update_values_int)
+        std::move(*LiteralUtil::CreateR2FromArray2D(update_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal expected_values =
-        std::move(*Literal::CreateR2FromArray2D(expected_values_int)
+        std::move(*LiteralUtil::CreateR2FromArray2D(expected_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
 
@@ -442,15 +452,15 @@ class DynamicUpdateSliceTest : public ClientLibraryTestBase {
              const std::vector<IndexT> slice_starts,
              const Array3D<int>& expected_values_int) {
     Literal input_values =
-        std::move(*Literal::CreateR3FromArray3D(input_values_int)
+        std::move(*LiteralUtil::CreateR3FromArray3D(input_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal update_values =
-        std::move(*Literal::CreateR3FromArray3D(update_values_int)
+        std::move(*LiteralUtil::CreateR3FromArray3D(update_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
     Literal expected_values =
-        std::move(*Literal::CreateR3FromArray3D(expected_values_int)
+        std::move(*LiteralUtil::CreateR3FromArray3D(expected_values_int)
                        ->Convert(primitive_util::NativeToPrimitiveType<DataT>())
                        .ValueOrDie());
 
@@ -520,7 +530,7 @@ class DynamicUpdateSliceTest : public ClientLibraryTestBase {
   template <typename NativeT>
   void DumpArray(const string& name, const Array3D<NativeT> values) {
     std::unique_ptr<Literal> literal =
-        Literal::CreateR3FromArray3D<NativeT>(values);
+        LiteralUtil::CreateR3FromArray3D<NativeT>(values);
     LOG(INFO) << name << ":" << literal->ToString();
   }
 };
@@ -530,21 +540,32 @@ XLA_TEST_F(DynamicUpdateSliceTest, Int32R0) { TestR0<int32, float>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int64R0) { TestR0<int64, float>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, UInt64R0) { TestR0<uint64, float>(); }
 
-// TODO(b/71820067): The CPU parallel backend failed for this on 2018-01-10.
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R1BF16) { TestR1<int32, bfloat16>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R1) { TestR1<int32, float>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int64R1) { TestR1<int64, float>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, UInt64R1) { TestR1<uint64, float>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, UInt32R1OOB) {
+  RunR1<uint32, int32>({0, 1, 2, 3, 4}, {5, 6}, {2147483648u}, {0, 1, 2, 5, 6});
+}
 
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R2BF16) { TestR2<int32, bfloat16>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R2) { TestR2<int32, float>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int64R2) { TestR2<int64, int64>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, UInt64R2) { TestR2<uint64, int32>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, UInt32R2OOB) {
+  RunR2<uint32, int32>({{0, 1}, {2, 3}}, {{4}}, {2147483648u, 0},
+                       {{0, 1}, {4, 3}});
+}
 
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R3BF16) { TestR3<int32, bfloat16>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R3) { TestR3<int32, float>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int64R3) { TestR3<int64, int64>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, UInt64R3) { TestR3<uint64, uint64>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, UInt32R3OOB) {
+  RunR3<uint32, int32>({{{0, 1}, {2, 3}}, {{4, 5}, {6, 7}}}, {{{8}}},
+                       {2147483648u, 0, 2147483648u},
+                       {{{0, 1}, {2, 3}}, {{4, 8}, {6, 7}}});
+}
 
 XLA_TEST_F(DynamicUpdateSliceTest, Int32OOBBF16) { TestOOB<int32, bfloat16>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32OOB) { TestOOB<int32, float>(); }
@@ -695,7 +716,7 @@ void BM_DynamicSlice(int num_iters) {
   XlaBuilder builder("DynamicSlice");
 
   // Create input as a constant: shape [1, 2, 3, 4]
-  auto input_literal = Literal::CreateR4(
+  auto input_literal = LiteralUtil::CreateR4(
       {{{{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}},
         {{13, 14, 15, 16}, {17, 18, 19, 20}, {21, 22, 23, 24}}}});
   auto input = ConstantLiteral(&builder, *input_literal);
@@ -715,7 +736,7 @@ void BM_DynamicSlice(int num_iters) {
                         start_indices_shape, &allocator, /*device_ordinal=*/0)
                     .ConsumeValueOrDie();
 
-  auto start_indices_literal = Literal::CreateR1<int32>({0, 1, 2, 3});
+  auto start_indices_literal = LiteralUtil::CreateR1<int32>({0, 1, 2, 3});
   auto stream =
       client->mutable_backend()->BorrowStream(device_ordinal).ValueOrDie();
   ASSERT_IS_OK(transfer_manager->TransferLiteralToDevice(
diff --git a/tensorflow/compiler/xla/tests/execution_profile_test.cc b/tensorflow/compiler/xla/tests/execution_profile_test.cc
index ddc6a7db18..5116e60ca6 100644
--- a/tensorflow/compiler/xla/tests/execution_profile_test.cc
+++ b/tensorflow/compiler/xla/tests/execution_profile_test.cc
@@ -14,8 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/client/global_data.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
 #include "tensorflow/core/platform/test.h"
@@ -31,7 +31,7 @@ XLA_TEST_F(ExecutionProfileTest, ExecuteWithExecutionProfile) {
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> input,
       client_->TransferToServer(
-          *Literal::CreateR2F32Linspace(1e0, 1e5, 256, 256)));
+          *LiteralUtil::CreateR2F32Linspace(1e0, 1e5, 256, 256)));
 
   XlaBuilder b(TestName() + ".add");
   Dot(Parameter(&b, 0, shape, "param_0"), Parameter(&b, 1, shape, "param_1"));
diff --git a/tensorflow/compiler/xla/tests/exhaustive_f32_elementwise_op_test.cc b/tensorflow/compiler/xla/tests/exhaustive_f32_elementwise_op_test.cc
index 74cf8b213e..bf1de02ba9 100644
--- a/tensorflow/compiler/xla/tests/exhaustive_f32_elementwise_op_test.cc
+++ b/tensorflow/compiler/xla/tests/exhaustive_f32_elementwise_op_test.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
@@ -39,7 +39,7 @@ class ExhaustiveF32ElementwiseOpTest
     XlaBuilder builder(TestName());
 
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateFromDimensions(F32, {input_size});
+        LiteralUtil::CreateFromDimensions(F32, {input_size});
     for (int64 i = begin; i < end; i++) {
       if (i >= known_incorrect_range.first &&
           i < known_incorrect_range.second) {
diff --git a/tensorflow/compiler/xla/tests/filecheck.cc b/tensorflow/compiler/xla/tests/filecheck.cc
index 93d1c921c4..dcb469087e 100644
--- a/tensorflow/compiler/xla/tests/filecheck.cc
+++ b/tensorflow/compiler/xla/tests/filecheck.cc
@@ -76,6 +76,11 @@ StatusOr<bool> RunFileCheck(const string& input, const string& pattern) {
     XLA_LOG_LINES(tensorflow::WARNING, input);
     LOG(WARNING) << "FileCheck pattern was:";
     XLA_LOG_LINES(tensorflow::WARNING, pattern);
+  } else if (!standard_error.empty()) {
+    LOG(INFO) << "FileCheck stderr:";
+    XLA_LOG_LINES(tensorflow::INFO, standard_error);
+    LOG(INFO) << "FileCheck input was:";
+    XLA_LOG_LINES(tensorflow::INFO, input);
   }
   return succeeded;
 }
diff --git a/tensorflow/compiler/xla/tests/floor_ceil_test.cc b/tensorflow/compiler/xla/tests/floor_ceil_test.cc
index 30dc639f11..39cc6c5927 100644
--- a/tensorflow/compiler/xla/tests/floor_ceil_test.cc
+++ b/tensorflow/compiler/xla/tests/floor_ceil_test.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include <string>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
diff --git a/tensorflow/compiler/xla/tests/fmax_test.cc b/tensorflow/compiler/xla/tests/fmax_test.cc
index 0254ae1baa..c5bbbe778d 100644
--- a/tensorflow/compiler/xla/tests/fmax_test.cc
+++ b/tensorflow/compiler/xla/tests/fmax_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/core/platform/test.h"
diff --git a/tensorflow/compiler/xla/tests/fusion_test.cc b/tensorflow/compiler/xla/tests/fusion_test.cc
index ab470f16a3..792be0d3fc 100644
--- a/tensorflow/compiler/xla/tests/fusion_test.cc
+++ b/tensorflow/compiler/xla/tests/fusion_test.cc
@@ -25,14 +25,15 @@ limitations under the License.
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@@ -89,7 +90,7 @@ class FusionTest : public HloTestBase {
     HloInstruction* hlos[4];
     for (int i = 0; i < Arity; ++i) {
       hlos[i + 1] = builder.AddInstruction(HloInstruction::CreateConstant(
-          Literal::CreateR2FromArray2D(operand_data[i])));
+          LiteralUtil::CreateR2FromArray2D(operand_data[i])));
     }
     auto answer_shape =
         ShapeUtil::MakeShape(prim_type, {test_width, test_height});
@@ -115,7 +116,7 @@ class FusionTest : public HloTestBase {
             ArraySlice<HloInstruction*>(hlos, 0, Arity + 1),
             HloInstruction::FusionKind::kLoop);
 
-    auto expected = Literal::CreateR2FromArray2D(answer_data);
+    auto expected = LiteralUtil::CreateR2FromArray2D(answer_data);
     auto actual = ExecuteAndTransfer(std::move(hlo_module), {});
     if (primitive_util::IsFloatingPointType(prim_type)) {
       EXPECT_TRUE(LiteralTestUtil::Near(*expected, *actual, ErrorSpec(1e-4)));
@@ -186,27 +187,28 @@ XLA_TEST_F(FusionTest, Test) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0}, {2.0}, {3.0}})));
+      LiteralUtil::CreateR2<float>({{1.0}, {2.0}, {3.0}})));
   auto const1 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{-1.0}, {-1.0}, {-1.0}})));
+      LiteralUtil::CreateR2<float>({{-1.0}, {-1.0}, {-1.0}})));
   auto add2 = builder.AddInstruction(HloInstruction::CreateBinary(
       ShapeUtil::MakeShape(F32, {3, 1}), HloOpcode::kAdd, const0, const1));
   auto reshape3 = builder.AddInstruction(HloInstruction::CreateTranspose(
       ShapeUtil::MakeShape(F32, {1, 3}), add2, {1, 0}));
   auto const4 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.62, 2.72, 3.14}})));
+      LiteralUtil::CreateR2<float>({{1.62, 2.72, 3.14}})));
   auto concat5 = builder.AddInstruction(HloInstruction::CreateConcatenate(
       ShapeUtil::MakeShape(F32, {2, 3}), {reshape3, const4}, 0));
   auto const6 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 1.0, 1.0}, {0.0, 0.0, 0.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 1.0, 1.0}, {0.0, 0.0, 0.0}})));
   auto negate7 = builder.AddInstruction(HloInstruction::CreateUnary(
       ShapeUtil::MakeShape(F32, {2, 3}), HloOpcode::kNegate, const6));
   auto add8 = builder.AddInstruction(HloInstruction::CreateBinary(
       ShapeUtil::MakeShape(F32, {2, 3}), HloOpcode::kAdd, concat5, negate7));
   auto const9 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{0.5, 0.5, 0.5}, {0.5, 0.5, 0.5}})));
-  auto const10 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<bool>({{true, false, true}, {false, true, false}})));
+      LiteralUtil::CreateR2<float>({{0.5, 0.5, 0.5}, {0.5, 0.5, 0.5}})));
+  auto const10 = builder.AddInstruction(
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2<bool>(
+          {{true, false, true}, {false, true, false}})));
   auto select11 = builder.AddInstruction(
       HloInstruction::CreateTernary(ShapeUtil::MakeShape(F32, {2, 3}),
                                     HloOpcode::kSelect, const10, add8, const9));
@@ -222,7 +224,7 @@ XLA_TEST_F(FusionTest, Test) {
           HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR2<float>({{0.5}, {2.72}}),
+      *LiteralUtil::CreateR2<float>({{0.5}, {2.72}}),
       *ExecuteAndTransfer(std::move(hlo_module), {}), ErrorSpec(1e-4)));
 }
 
@@ -233,11 +235,11 @@ XLA_TEST_F(FusionTest, Parameter) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{1.0, 2.0, 3.0}})));
+      LiteralUtil::CreateR2<float>({{1.0, 2.0, 3.0}})));
   auto copy1 = builder.AddInstruction(HloInstruction::CreateUnary(
       ShapeUtil::MakeShape(F32, {1, 3}), HloOpcode::kCopy, const0));
   auto const2 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{-2.0, -2.0, -2.0}})));
+      LiteralUtil::CreateR2<float>({{-2.0, -2.0, -2.0}})));
   // add3 = copy1 + const2 = const0 + const2 = {1,2,3} + {-2,-2,-2} = {-1,0,+1}
   auto add3 = builder.AddInstruction(HloInstruction::CreateBinary(
       ShapeUtil::MakeShape(F32, {1, 3}), HloOpcode::kAdd, copy1, const2));
@@ -248,7 +250,7 @@ XLA_TEST_F(FusionTest, Parameter) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR2<float>({{-1.0, 0.0, 1.0}}),
+      *LiteralUtil::CreateR2<float>({{-1.0, 0.0, 1.0}}),
       *ExecuteAndTransfer(std::move(hlo_module), {}), ErrorSpec(1e-4)));
 }
 
@@ -269,7 +271,7 @@ XLA_TEST_F(FusionTest, RandomizedParallelPartition) {
   auto hlo_module = CreateNewModule();
 
   auto two = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<float>(2.0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(2.0)));
   auto x =
       builder.AddInstruction(HloInstruction::CreateBroadcast(shape, two, {}));
   auto y = builder.AddInstruction(
@@ -292,9 +294,9 @@ XLA_TEST_F(FusionTest, BroadcastIntoBinaryOp) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const_vector = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR1<float>({1.0, 2.0, 3.0})));
+      LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0})));
   auto const_array = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<float>({{-1.0, -2.0, -4.0}, {10.0, 20.0, 30.0}})));
+      LiteralUtil::CreateR2<float>({{-1.0, -2.0, -4.0}, {10.0, 20.0, 30.0}})));
   auto broadcast = builder.AddInstruction(
       HloInstruction::CreateBroadcast(const_array->shape(), const_vector, {1}));
   // add2 = broadcast(const_vector) + const_array
@@ -308,7 +310,7 @@ XLA_TEST_F(FusionTest, BroadcastIntoBinaryOp) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(LiteralTestUtil::Near(
-      *Literal::CreateR2<float>({{0.0, 0.0, -1.0}, {11.0, 22.0, 33.0}}),
+      *LiteralUtil::CreateR2<float>({{0.0, 0.0, -1.0}, {11.0, 22.0, 33.0}}),
       *ExecuteAndTransfer(std::move(hlo_module), {}), ErrorSpec(1e-4)));
 }
 
@@ -316,14 +318,14 @@ XLA_TEST_F(FusionTest, ReshapeToScalar) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto single_element_array = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR2<int32>({{5}})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR2<int32>({{5}})));
   auto reshape = builder.AddInstruction(HloInstruction::CreateReshape(
       ShapeUtil::MakeShape(S32, {}), single_element_array));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR0<int32>(5),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR0<int32>(5),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -331,14 +333,14 @@ XLA_TEST_F(FusionTest, Reshape_3by2_1by2by3) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}})));
+      LiteralUtil::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}})));
   auto reshape1 = builder.AddInstruction(HloInstruction::CreateReshape(
       ShapeUtil::MakeShape(S32, {1, 2, 3}), const0));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR3<int32>({{{1, 2, 3}, {4, 5, 6}}}),
+      *LiteralUtil::CreateR3<int32>({{{1, 2, 3}, {4, 5, 6}}}),
       *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -346,14 +348,14 @@ XLA_TEST_F(FusionTest, Reshape_1by2by3_3by2) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR3<int32>({{{1, 2, 3}, {4, 5, 6}}})));
+      LiteralUtil::CreateR3<int32>({{{1, 2, 3}, {4, 5, 6}}})));
   auto reshape1 = builder.AddInstruction(
       HloInstruction::CreateReshape(ShapeUtil::MakeShape(S32, {3, 2}), const0));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}}),
+      *LiteralUtil::CreateR2<int32>({{1, 2}, {3, 4}, {5, 6}}),
       *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -361,14 +363,14 @@ XLA_TEST_F(FusionTest, Reshape_1by1by1_) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR3<int32>({{{7}}})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR3<int32>({{{7}}})));
   auto reshape1 = builder.AddInstruction(
       HloInstruction::CreateReshape(ShapeUtil::MakeShape(S32, {}), const0));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR0<int32>(7),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR0<int32>(7),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -376,14 +378,14 @@ XLA_TEST_F(FusionTest, Reshape__1by1by1) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(7)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(7)));
   auto reshape1 = builder.AddInstruction(HloInstruction::CreateReshape(
       ShapeUtil::MakeShape(S32, {1, 1, 1}), const0));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR3<int32>({{{7}}}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR3<int32>({{{7}}}),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -391,14 +393,14 @@ XLA_TEST_F(FusionTest, Reshape__) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(7)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(7)));
   auto reshape1 = builder.AddInstruction(
       HloInstruction::CreateReshape(ShapeUtil::MakeShape(S32, {}), const0));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR0<int32>(7),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR0<int32>(7),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -406,14 +408,14 @@ XLA_TEST_F(FusionTest, Reshape_3by3_3by3) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}})));
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}})));
   auto reshape1 = builder.AddInstruction(
       HloInstruction::CreateReshape(ShapeUtil::MakeShape(S32, {3, 3}), const0));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}),
+      *LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}),
       *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -421,14 +423,14 @@ XLA_TEST_F(FusionTest, Transpose_2by3) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}})));
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}})));
   auto reshape1 = builder.AddInstruction(HloInstruction::CreateTranspose(
       ShapeUtil::MakeShape(S32, {3, 2}), const0, {1, 0}));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR2<int32>({{1, 4}, {2, 5}, {3, 6}}),
+      *LiteralUtil::CreateR2<int32>({{1, 4}, {2, 5}, {3, 6}}),
       *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -436,14 +438,14 @@ XLA_TEST_F(FusionTest, Transpose_3by3) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}})));
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}})));
   auto reshape1 = builder.AddInstruction(HloInstruction::CreateTranspose(
       ShapeUtil::MakeShape(S32, {3, 3}), const0, {1, 0}));
   hlo_module->AddEntryComputation(builder.Build())
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{reshape1},
                                 HloInstruction::FusionKind::kLoop);
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR2<int32>({{1, 4, 7}, {2, 5, 8}, {3, 6, 9}}),
+      *LiteralUtil::CreateR2<int32>({{1, 4, 7}, {2, 5, 8}, {3, 6, 9}}),
       *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -451,7 +453,7 @@ XLA_TEST_F(FusionTest, Reverse) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 3})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({1, 2, 3})));
   auto reverse1 = builder.AddInstruction(HloInstruction::CreateReverse(
       ShapeUtil::MakeShape(S32, {3}), const0, {0}));
   hlo_module->AddEntryComputation(builder.Build())
@@ -459,7 +461,7 @@ XLA_TEST_F(FusionTest, Reverse) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR1<int32>({3, 2, 1}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR1<int32>({3, 2, 1}),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -467,7 +469,7 @@ XLA_TEST_F(FusionTest, ReverseNegate) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 3})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({1, 2, 3})));
   auto reverse1 = builder.AddInstruction(HloInstruction::CreateReverse(
       ShapeUtil::MakeShape(S32, {3}), const0, {0}));
   auto negate2 = builder.AddInstruction(HloInstruction::CreateUnary(
@@ -477,7 +479,7 @@ XLA_TEST_F(FusionTest, ReverseNegate) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR1<int32>({-3, -2, -1}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR1<int32>({-3, -2, -1}),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -485,7 +487,7 @@ XLA_TEST_F(FusionTest, BroadcastNegate) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
   auto broadcast1 = builder.AddInstruction(HloInstruction::CreateBroadcast(
       ShapeUtil::MakeShape(S32, {2}), const0, {}));
   auto negate2 = builder.AddInstruction(HloInstruction::CreateUnary(
@@ -495,15 +497,15 @@ XLA_TEST_F(FusionTest, BroadcastNegate) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR1<int32>({-1, -1}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR1<int32>({-1, -1}),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
 XLA_TEST_F(FusionTest, SliceNegate) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
-  auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 3, 4})));
+  auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<int32>({1, 2, 3, 4})));
   auto slice1 = builder.AddInstruction(HloInstruction::CreateSlice(
       ShapeUtil::MakeShape(S32, {2}), const0, {0}, {4}, {2}));
   auto negate2 = builder.AddInstruction(HloInstruction::CreateUnary(
@@ -513,17 +515,17 @@ XLA_TEST_F(FusionTest, SliceNegate) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR1<int32>({-1, -3}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR1<int32>({-1, -3}),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
 XLA_TEST_F(FusionTest, DynamicSliceNegate) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
-  auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 3, 4})));
+  auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<int32>({1, 2, 3, 4})));
   auto const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({1})));
   auto dynamic_slice2 =
       builder.AddInstruction(HloInstruction::CreateDynamicSlice(
           ShapeUtil::MakeShape(S32, {2}), const0, const1, {2}));
@@ -535,15 +537,15 @@ XLA_TEST_F(FusionTest, DynamicSliceNegate) {
           HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR1<int32>({-2, -3}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR1<int32>({-2, -3}),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
 XLA_TEST_F(FusionTest, ReshapeNegate) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
-  auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 3, 4})));
+  auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<int32>({1, 2, 3, 4})));
   auto reshape1 = builder.AddInstruction(
       HloInstruction::CreateReshape(ShapeUtil::MakeShape(S32, {2, 2}), const0));
   auto negate2 = builder.AddInstruction(HloInstruction::CreateUnary(
@@ -552,16 +554,16 @@ XLA_TEST_F(FusionTest, ReshapeNegate) {
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{negate2, reshape1},
                                 HloInstruction::FusionKind::kLoop);
 
-  EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{-1, -2}, {-3, -4}}),
-                             *ExecuteAndTransfer(std::move(hlo_module), {})));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *LiteralUtil::CreateR2<int32>({{-1, -2}, {-3, -4}}),
+      *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
 XLA_TEST_F(FusionTest, TransposeNegate) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int32>({{1, 2}, {3, 4}})));
+      LiteralUtil::CreateR2<int32>({{1, 2}, {3, 4}})));
   auto transpose1 = builder.AddInstruction(HloInstruction::CreateTranspose(
       ShapeUtil::MakeShape(S32, {2, 2}), const0, {1, 0}));
   auto negate2 = builder.AddInstruction(HloInstruction::CreateUnary(
@@ -570,9 +572,9 @@ XLA_TEST_F(FusionTest, TransposeNegate) {
       ->CreateFusionInstruction(/*instructions_to_fuse=*/{negate2, transpose1},
                                 HloInstruction::FusionKind::kLoop);
 
-  EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR2<int32>({{-1, -3}, {-2, -4}}),
-                             *ExecuteAndTransfer(std::move(hlo_module), {})));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *LiteralUtil::CreateR2<int32>({{-1, -3}, {-2, -4}}),
+      *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
 std::unique_ptr<HloComputation> MakeReduceTestComputation() {
@@ -590,10 +592,10 @@ XLA_TEST_F(FusionTest, DISABLED_ON_CPU(Reduce)) {
   auto hlo_module = CreateNewModule();
 
   auto builder = HloComputation::Builder(TestName());
-  auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 4, 8})));
+  auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<int32>({1, 2, 4, 8})));
   auto const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(0)));
   auto reduce2 = builder.AddInstruction(HloInstruction::CreateReduce(
       ShapeUtil::MakeShape(S32, {}), const0, const1, {0},
       hlo_module->AddEmbeddedComputation(MakeReduceTestComputation())));
@@ -602,7 +604,7 @@ XLA_TEST_F(FusionTest, DISABLED_ON_CPU(Reduce)) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR0<int32>(15),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR0<int32>(15),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -610,10 +612,10 @@ XLA_TEST_F(FusionTest, DISABLED_ON_CPU(ReduceImplicitBroadcast)) {
   auto hlo_module = CreateNewModule();
 
   auto builder = HloComputation::Builder(TestName());
-  auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({1, 2, 4, 8})));
+  auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
+      LiteralUtil::CreateR1<int32>({1, 2, 4, 8})));
   auto const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(0)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(0)));
   auto reduce2 = builder.AddInstruction(HloInstruction::CreateReduce(
       ShapeUtil::MakeShape(S32, {}), const0, const1, {0},
       hlo_module->AddEmbeddedComputation(MakeReduceTestComputation())));
@@ -624,7 +626,7 @@ XLA_TEST_F(FusionTest, DISABLED_ON_CPU(ReduceImplicitBroadcast)) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR0<int32>(-15),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR0<int32>(-15),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -632,9 +634,9 @@ XLA_TEST_F(FusionTest, DISABLED_ON_CPU(ReduceWindow)) {
   auto builder = HloComputation::Builder(TestName());
   auto hlo_module = CreateNewModule();
   auto const0 = builder.AddInstruction(HloInstruction::CreateConstant(
-      Literal::CreateR2<int32>({{2, 3, 5}, {7, 11, 13}, {17, 19, 23}})));
+      LiteralUtil::CreateR2<int32>({{2, 3, 5}, {7, 11, 13}, {17, 19, 23}})));
   auto const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(1)));
   Window window;
   ASSERT_TRUE(
       tensorflow::protobuf::TextFormat::ParseFromString("dimensions:{\n"
@@ -674,7 +676,7 @@ XLA_TEST_F(FusionTest, DISABLED_ON_CPU(ReduceWindow)) {
                                 HloInstruction::FusionKind::kLoop);
 
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::CreateR2<int32>({{462, 2145}, {24871, 62491}}),
+      *LiteralUtil::CreateR2<int32>({{462, 2145}, {24871, 62491}}),
       *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -686,9 +688,9 @@ XLA_TEST_F(FusionTest, SharedConstant) {
 
   auto builder = HloComputation::Builder(TestName());
   auto const0 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({0})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({0})));
   auto const1 = builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR1<int32>({2})));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR1<int32>({2})));
   auto add1 = builder.AddInstruction(HloInstruction::CreateBinary(
       ShapeUtil::MakeShape(S32, {1}), HloOpcode::kAdd, const1, const0));
   auto add2 = builder.AddInstruction(HloInstruction::CreateBinary(
@@ -710,7 +712,7 @@ XLA_TEST_F(FusionTest, SharedConstant) {
   EXPECT_EQ(entry_comp->root_instruction()->fused_instruction_count(), 6);
 
   EXPECT_TRUE(
-      LiteralTestUtil::Equal(*Literal::CreateR1<int32>({8}),
+      LiteralTestUtil::Equal(*LiteralUtil::CreateR1<int32>({8}),
                              *ExecuteAndTransfer(std::move(hlo_module), {})));
 }
 
@@ -764,6 +766,79 @@ XLA_TEST_F(FusionTest, Clamp2D) {
   TestElementwise2D<float, 3>(HloOpcode::kClamp);
 }
 
+// TODO(b/73903144): Enable on interpreter once interpreter supports bitcast.
+XLA_TEST_F(FusionTest, DISABLED_ON_INTERPRETER(FusionWithLayout)) {
+  const string hlo_text = R"(
+HloModule Cluster
+
+fusion_c {
+  fusion.arg = f32[2,2]{1,0} parameter(0)
+  bitcast.0 = f32[2,2,1]{2,1,0} bitcast(fusion.arg)
+  tanh.0 = f32[2,2,1]{0,2,1} tanh(bitcast.0)
+  ROOT bitcast.2 = f32[2,2,1]{1,2,0} bitcast(tanh.0)
+}
+
+ENTRY main {
+  arg = f32[2,2]{1,0} parameter(0)
+  ROOT fusion = f32[2,2,1]{1,2,0} fusion(arg), kind=kLoop, calls=fusion_c
+}
+)";
+
+  std::unique_ptr<Literal> operand =
+      LiteralUtil::CreateR2<float>({{0., 0.}, {1., 0.}});
+  HloModuleConfig config;
+  config.set_debug_options(GetDebugOptionsForTest());
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(hlo_text, config));
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<Literal> result,
+      test_runner_.Execute(std::move(module), {operand.get()},
+                           /*run_hlo_passes=*/false));
+  EXPECT_TRUE(LiteralTestUtil::Equal(
+      *LiteralUtil::CreateR3<float>({{{0.}, {0.76159415595}}, {{0.}, {0.}}}),
+      *result));
+}
+
+class FusionClientLibraryTest : public ClientLibraryTestBase {};
+
+XLA_TEST_F(FusionClientLibraryTest, ManyLayoutTransformations) {
+  // On the GPU backend, it's possible to have too many transposes within one
+  // fusion, causing the kernel to run out shared memory and thus not compile.
+  // We want to check that doesn't happen.
+  //
+  // To do this, we create a computation that computes
+  //
+  //   P0 + P0*P1*P1 + P0*P2*P2 ...
+  //
+  // where even parameters have layout 1 and odd parameters have layout 2.
+  //
+  // Our goal is to tempt the backend into creating one giant multi-output
+  // fusion for the whole computation, including the transposes.  Currently
+  // multi-output fusion only fuses fusions, so each of the terms in the sum
+  // needs to be a fusion itself, thus the contortions above.
+  constexpr int kNumParams = 25;
+  XlaBuilder b("ManyLayoutTransformations");
+
+  // This test produces values that overflow int32, which is UB, so use uint32,
+  // where overflow is OK.
+  Array2D<uint32> arr(32, 32);
+  arr.FillUnique();
+  std::unique_ptr<Literal> l1 = LiteralUtil::CreateR2FromArray2D(arr)->Relayout(
+      LayoutUtil::MakeLayout({0, 1}));
+
+  std::unique_ptr<Literal> l2 = LiteralUtil::CreateR2FromArray2D(arr)->Relayout(
+      LayoutUtil::MakeLayout({1, 0}));
+
+  XlaOp p0 = AddParam(*l1, &b);
+  XlaOp sum = p0;
+  for (int i = 1; i < kNumParams; ++i) {
+    auto pN = AddParam((i % 2 == 0 ? *l1 : *l2), &b);
+    sum = sum + p0 * pN * pN;
+  }
+
+  ComputeAndCompare(&b, {});
+}
+
 void BM_ParallelFusion(int num_iters) {
   // Simple element-wise computation to benchmark parallel task partitioning.
   tensorflow::testing::StopTiming();
@@ -804,19 +879,19 @@ void BM_ParallelFusion(int num_iters) {
 
   // Transfer literals to device.
   auto param0_literal =
-      Literal::CreateR2F32Linspace(1.0, 2.0, param0_dim0, param0_dim1);
+      LiteralUtil::CreateR2F32Linspace(1.0, 2.0, param0_dim0, param0_dim1);
   ScopedShapedBuffer buffer0 =
       client->LiteralToShapedBuffer(*param0_literal, device_ordinal)
           .ConsumeValueOrDie();
 
   auto param1_literal =
-      Literal::CreateR2F32Linspace(1.0, 2.0, param1_dim0, param1_dim1);
+      LiteralUtil::CreateR2F32Linspace(1.0, 2.0, param1_dim0, param1_dim1);
   ScopedShapedBuffer buffer1 =
       client->LiteralToShapedBuffer(*param1_literal, device_ordinal)
           .ConsumeValueOrDie();
 
   auto param2_literal =
-      Literal::CreateR2F32Linspace(1.0, 2.0, param2_dim0, param2_dim1);
+      LiteralUtil::CreateR2F32Linspace(1.0, 2.0, param2_dim0, param2_dim1);
   ScopedShapedBuffer buffer2 =
       client->LiteralToShapedBuffer(*param2_literal, device_ordinal)
           .ConsumeValueOrDie();
diff --git a/tensorflow/compiler/xla/tests/gather_operation_test.cc b/tensorflow/compiler/xla/tests/gather_operation_test.cc
index b8404826b1..b77bece85a 100644
--- a/tensorflow/compiler/xla/tests/gather_operation_test.cc
+++ b/tensorflow/compiler/xla/tests/gather_operation_test.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -22,9 +22,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
 
-// NB!  TODO(b/74360564): These tests do not test out of bounds behavior since
-// that hasn't been specced yet.
-
 namespace xla {
 namespace {
 
@@ -63,8 +60,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({0, 2});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -84,8 +82,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({0, 2});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -105,9 +104,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 2}, {2, 1}});
+      LiteralUtil::CreateR2<int32>({{0, 2}, {2, 1}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -127,9 +126,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
+      LiteralUtil::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -149,9 +148,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
+      LiteralUtil::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -171,11 +170,11 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
-                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
-                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+      LiteralUtil::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                    {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                    {{-7, 7}, {-8, 8}, {-9, 9}}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+      LiteralUtil::CreateR2<int32>({{0, 0}, {1, 0}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -195,11 +194,11 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
-                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
-                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+      LiteralUtil::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                    {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                    {{-7, 7}, {-8, 8}, {-9, 9}}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+      LiteralUtil::CreateR2<int32>({{0, 0}, {1, 0}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -219,8 +218,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({1, 1});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({1, 1});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -240,9 +240,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{2, 1}, {1, 1}});
+      LiteralUtil::CreateR2<int32>({{2, 1}, {1, 1}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -261,18 +261,15 @@ ENTRY main {
       window_bounds={1, 0}
 }
 )";
-  std::unique_ptr<Literal> operand = Literal::CreateR2<int32>({{}, {}, {}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+  std::unique_ptr<Literal> operand = LiteralUtil::CreateR2<int32>({{}, {}, {}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({0, 2});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
 XLA_TEST_F(GatherOperationTest, OutOfBoundsIndex) {
   // Out of bounds indices must not crash, and the indices in range should
   // produce the same values across all backends.
-  //
-  // TODO(b/74360564): Once we have a well defined semantics for OOB accesses,
-  // we should get rid of the mask and check that backends produce the same
-  // value for OOB indices too.
 
   const string hlo_text = R"(
 HloModule BatchDynamicSlice
@@ -286,29 +283,45 @@ ENTRY main {
       gather_dims_to_operand_dims={0,1},
       index_vector_dim=1,
       window_bounds={1,1}
-  gather_reshaped = s32[6]{0} reshape(gather)
-  in_bounds_mask = s32[6]{0} parameter(2)
-  ROOT result = s32[6]{0} multiply(gather_reshaped, in_bounds_mask)
+  ROOT result = s32[6]{0} reshape(gather)
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR2<int32>(
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = LiteralUtil::CreateR2<int32>(
       {{2, 7}, {2, 1}, {1, 1}, {5, 1}, {2147483647, 1}, {1, 2}});
-  std::unique_ptr<Literal> in_bounds_mask =
-      Literal::CreateR1<int32>({0, 1, 1, 0, 0, 1});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, OutOfBoundsUnsignedIndex) {
+  // Out of bounds indices must not crash, and the indices in range should
+  // produce the same values across all backends.
 
-  RunTest(hlo_text,
-          {operand.get(), gather_indices.get(), in_bounds_mask.get()});
+  const string hlo_text = R"(
+HloModule BatchDynamicSlice
+
+ENTRY main {
+  operand = s32[3,3]{1,0} parameter(0)
+  indices = u32[6,2]{1,0} parameter(1)
+  gather = s32[6,1,1]{2,1,0} gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=1,
+      window_bounds={1,1}
+  ROOT result = s32[6]{0} reshape(gather)
+}
+)";
+  std::unique_ptr<Literal> operand =
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = LiteralUtil::CreateR2<uint32>(
+      {{2, 7}, {2, 1}, {1, 1}, {5, 1}, {2147483648u, 1}, {1, 2}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
 XLA_TEST_F(GatherOperationTest, NegativeIndex) {
   // Negative indices must not crash, and the indices in range should produce
   // the same values across all backends.
-  //
-  // TODO(b/74360564): Once we have a well defined semantics for negative
-  // accesses, we should get rid of the mask and check that backends produce the
-  // same value for negative indices too.
 
   const string hlo_text = R"(
 HloModule BatchDynamicSlice
@@ -322,20 +335,40 @@ ENTRY main {
       gather_dims_to_operand_dims={0,1},
       index_vector_dim=1,
       window_bounds={1,1}
-  gather_reshaped = s32[6]{0} reshape(gather)
-  in_bounds_mask = s32[6]{0} parameter(2)
-  ROOT result = s32[6]{0} multiply(gather_reshaped, in_bounds_mask)
+  ROOT result = s32[6]{0} reshape(gather)
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR2<int32>(
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = LiteralUtil::CreateR2<int32>(
       {{2, -1}, {2, 1}, {1, 1}, {-500, 1}, {-2147483648, 1}, {1, 2}});
-  std::unique_ptr<Literal> in_bounds_mask =
-      Literal::CreateR1<int32>({0, 1, 1, 0, 0, 1});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
+}
+
+XLA_TEST_F(GatherOperationTest, NegativeIndexIntoUnsignedOperand) {
+  // Negative indices must not crash, and the indices in range should produce
+  // the same values across all backends.
 
-  RunTest(hlo_text,
-          {operand.get(), gather_indices.get(), in_bounds_mask.get()});
+  const string hlo_text = R"(
+HloModule BatchDynamicSlice
+
+ENTRY main {
+  operand = u32[3,3]{1,0} parameter(0)
+  indices = s32[6,2]{1,0} parameter(1)
+  gather = u32[6,1,1]{2,1,0} gather(operand, indices),
+      output_window_dims={1,2},
+      elided_window_dims={},
+      gather_dims_to_operand_dims={0,1},
+      index_vector_dim=1,
+      window_bounds={1,1}
+  ROOT result = u32[6]{0} reshape(gather)
+}
+)";
+  std::unique_ptr<Literal> operand =
+      LiteralUtil::CreateR2<uint32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = LiteralUtil::CreateR2<int32>(
+      {{2, -1}, {2, 1}, {1, 1}, {-500, 1}, {-2147483648, 1}, {1, 2}});
+  RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
 XLA_TEST_F(GatherOperationTest, OneScalarIndex) {
@@ -353,9 +386,9 @@ ENTRY main {
       window_bounds={1,3,2}
 }
 )";
-  std::unique_ptr<Literal> operand = Literal::CreateR3<int32>(
+  std::unique_ptr<Literal> operand = LiteralUtil::CreateR3<int32>(
       {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 10}, {11, 12}}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR0<int32>(1);
+  std::unique_ptr<Literal> gather_indices = LiteralUtil::CreateR0<int32>(1);
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -374,8 +407,8 @@ ENTRY main {
       window_bounds={1}
 }
 )";
-  std::unique_ptr<Literal> operand = Literal::CreateR1<int32>({1, 2, 3, 4});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR0<int32>(1);
+  std::unique_ptr<Literal> operand = LiteralUtil::CreateR1<int32>({1, 2, 3, 4});
+  std::unique_ptr<Literal> gather_indices = LiteralUtil::CreateR0<int32>(1);
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -395,8 +428,8 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices = LiteralUtil::CreateR1<int32>({});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -419,8 +452,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({0, 2});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({0, 2});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -443,9 +477,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 2}, {2, 1}});
+      LiteralUtil::CreateR2<int32>({{0, 2}, {2, 1}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -468,9 +502,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
+      LiteralUtil::CreateR3<int32>({{{0, 2}, {2, 1}}, {{1, 2}, {2, 0}}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -493,11 +527,11 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
-                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
-                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+      LiteralUtil::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                    {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                    {{-7, 7}, {-8, 8}, {-9, 9}}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+      LiteralUtil::CreateR2<int32>({{0, 0}, {1, 0}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -521,11 +555,11 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
-                                {{-4, 4}, {-5, 5}, {-6, 6}},  //
-                                {{-7, 7}, {-8, 8}, {-9, 9}}});
+      LiteralUtil::CreateR3<int32>({{{-1, 1}, {-2, 2}, {-3, 3}},  //
+                                    {{-4, 4}, {-5, 5}, {-6, 6}},  //
+                                    {{-7, 7}, {-8, 8}, {-9, 9}}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{0, 0}, {1, 0}});
+      LiteralUtil::CreateR2<int32>({{0, 0}, {1, 0}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -548,8 +582,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
-  std::unique_ptr<Literal> gather_indices = Literal::CreateR1<int32>({1, 1});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+  std::unique_ptr<Literal> gather_indices =
+      LiteralUtil::CreateR1<int32>({1, 1});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -572,9 +607,9 @@ ENTRY main {
 }
 )";
   std::unique_ptr<Literal> operand =
-      Literal::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
+      LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
   std::unique_ptr<Literal> gather_indices =
-      Literal::CreateR2<int32>({{2, 1}, {1, 1}});
+      LiteralUtil::CreateR2<int32>({{2, 1}, {1, 1}});
   RunTest(hlo_text, operand.get(), gather_indices.get());
 }
 
@@ -609,12 +644,13 @@ XLA_TEST_F(GatherClientLibraryTest, DISABLED_ON_GPU(Basic)) {
   Gather(operand, indices, dim_numbers, {1, 3});
 
   std::vector<int32> expected = {};
-  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<GlobalData> operand_arg,
-                          client_->TransferToServer(*Literal::CreateR2<int32>(
-                              {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}})));
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<GlobalData> operand_arg,
+      client_->TransferToServer(
+          *LiteralUtil::CreateR2<int32>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}})));
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> indices_arg,
-      client_->TransferToServer(*Literal::CreateR1<int32>({0, 2})));
+      client_->TransferToServer(*LiteralUtil::CreateR1<int32>({0, 2})));
   TF_ASSERT_OK_AND_ASSIGN(std::vector<xla::DeviceHandle> devices,
                           client_->GetDeviceHandles(1));
   xla::ExecutionOptions execution_options = CreateDefaultExecutionOptions();
diff --git a/tensorflow/compiler/xla/tests/half_test.cc b/tensorflow/compiler/xla/tests/half_test.cc
index fd85118849..51450314b6 100644
--- a/tensorflow/compiler/xla/tests/half_test.cc
+++ b/tensorflow/compiler/xla/tests/half_test.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include <cmath>
 #include <vector>
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -48,7 +48,8 @@ class UnaryOpTest : public HalfTestBase,
                     public ::testing::WithParamInterface<UnaryOpTestParam> {};
 
 XLA_TEST_P(UnaryOpTest, Ops) {
-  std::vector<half> x({half(1.4), half(-2.3), half(3.2), half(-4.1)});
+  std::vector<half> x({half(1.4), half(-2.3), half(3.2), half(-4.1), half(9.0),
+                       half(42.0), half(-9.0), half(-100.0)});
   XlaBuilder builder(TestName());
   XlaOp x_opnd;
   auto x_data = CreateR1Parameter<half>(x, /*parameter_number=*/0, "x",
diff --git a/tensorflow/compiler/xla/tests/hlo_metadata_test.cc b/tensorflow/compiler/xla/tests/hlo_metadata_test.cc
index 4d82442f7e..5511190caf 100644
--- a/tensorflow/compiler/xla/tests/hlo_metadata_test.cc
+++ b/tensorflow/compiler/xla/tests/hlo_metadata_test.cc
@@ -14,7 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/service/local_service.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/local_client_test_base.h"
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.cc b/tensorflow/compiler/xla/tests/hlo_test_base.cc
index 242cc5db11..b662e83716 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.cc
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.cc
@@ -276,9 +276,10 @@ StatusOr<::testing::AssertionResult> HloTestBase::RunAndCompareInternal(
 
 HloComputation* HloTestBase::FindComputation(HloModule* module,
                                              tensorflow::StringPiece name) {
-  auto it = c_find_if(module->computations(),
+  auto computations = module->computations();
+  auto it = c_find_if(computations,
                       [&](HloComputation* c) { return c->name() == name; });
-  if (it == module->computations().end()) {
+  if (it == computations.end()) {
     return nullptr;
   }
   return *it;
@@ -287,9 +288,10 @@ HloComputation* HloTestBase::FindComputation(HloModule* module,
 HloInstruction* HloTestBase::FindInstruction(HloModule* module,
                                              tensorflow::StringPiece name) {
   for (const HloComputation* c : module->computations()) {
-    auto it = c_find_if(c->instructions(),
+    auto instructions = c->instructions();
+    auto it = c_find_if(instructions,
                         [&](HloInstruction* i) { return i->name() == name; });
-    if (it != c->instructions().end()) {
+    if (it != instructions.end()) {
       return *it;
     }
   }
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.h b/tensorflow/compiler/xla/tests/hlo_test_base.h
index 9009d67cea..66719b1460 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.h
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.h
@@ -200,6 +200,13 @@ class HloTestBase : public ::testing::Test {
         ->ResetLayout(layout);
   }
 
+  void ForceResultLayout(HloModule* module, const Layout& layout,
+                         ShapeIndexView shape_index) {
+    module->mutable_entry_computation_layout()
+        ->mutable_result_layout()
+        ->ResetLayout(layout, shape_index);
+  }
+
   // Convenience method to clear the layout of the computation result in
   // 'module'.
   void ForceClearResultLayout(HloModule* module) {
diff --git a/tensorflow/compiler/xla/tests/iota_test.cc b/tensorflow/compiler/xla/tests/iota_test.cc
new file mode 100644
index 0000000000..f950aa1e8f
--- /dev/null
+++ b/tensorflow/compiler/xla/tests/iota_test.cc
@@ -0,0 +1,61 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <numeric>
+#include <vector>
+
+#include "tensorflow/compiler/xla/tests/client_library_test_base.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace xla {
+namespace {
+
+class IotaTest : public ClientLibraryTestBase {
+ public:
+  explicit IotaTest(se::Platform* platform = nullptr)
+      : ClientLibraryTestBase(platform) {}
+  template <typename T>
+  std::vector<T> GetExpected(const int64 num_elements) {
+    std::vector<T> result(num_elements);
+    std::iota(result.begin(), result.end(), 0);
+    return result;
+  }
+};
+
+TEST_F(IotaTest, SimpleR1) {
+  for (int num_elements = 1; num_elements < 10000001; num_elements *= 10) {
+    {
+      XlaBuilder builder(TestName() + "_f32");
+      IotaGen(&builder, F32, num_elements);
+      ComputeAndCompareR1<float>(&builder, GetExpected<float>(num_elements), {},
+                                 ErrorSpec{0.0001});
+    }
+    {
+      XlaBuilder builder(TestName() + "_u32");
+      IotaGen(&builder, U32, num_elements);
+      ComputeAndCompareR1<uint32>(&builder, GetExpected<uint32>(num_elements),
+                                  {});
+    }
+    {
+      XlaBuilder builder(TestName() + "_s32");
+      IotaGen(&builder, S32, num_elements);
+      ComputeAndCompareR1<int32>(&builder, GetExpected<int32>(num_elements),
+                                 {});
+    }
+  }
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/literal_test_util.h b/tensorflow/compiler/xla/tests/literal_test_util.h
index d1b8a6cf0b..31a099c15f 100644
--- a/tensorflow/compiler/xla/tests/literal_test_util.h
+++ b/tensorflow/compiler/xla/tests/literal_test_util.h
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/error_spec.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -154,20 +155,20 @@ class LiteralTestUtil {
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR0Equal(NativeT expected,
                                                  const LiteralSlice& actual) {
-  EXPECT_TRUE(Equal(*Literal::CreateR0<NativeT>(expected), actual));
+  EXPECT_TRUE(Equal(*LiteralUtil::CreateR0<NativeT>(expected), actual));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR1Equal(
     tensorflow::gtl::ArraySlice<NativeT> expected, const LiteralSlice& actual) {
-  EXPECT_TRUE(Equal(*Literal::CreateR1<NativeT>(expected), actual));
+  EXPECT_TRUE(Equal(*LiteralUtil::CreateR1<NativeT>(expected), actual));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR2Equal(
     std::initializer_list<std::initializer_list<NativeT>> expected,
     const LiteralSlice& actual) {
-  EXPECT_TRUE(Equal(*Literal::CreateR2<NativeT>(expected), actual));
+  EXPECT_TRUE(Equal(*LiteralUtil::CreateR2<NativeT>(expected), actual));
 }
 
 template <typename NativeT>
@@ -175,46 +176,46 @@ template <typename NativeT>
     std::initializer_list<std::initializer_list<std::initializer_list<NativeT>>>
         expected,
     const LiteralSlice& actual) {
-  EXPECT_TRUE(Equal(*Literal::CreateR3<NativeT>(expected), actual));
+  EXPECT_TRUE(Equal(*LiteralUtil::CreateR3<NativeT>(expected), actual));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR2EqualArray2D(
     const Array2D<NativeT>& expected, const LiteralSlice& actual) {
-  EXPECT_TRUE(Equal(*Literal::CreateR2FromArray2D(expected), actual));
+  EXPECT_TRUE(Equal(*LiteralUtil::CreateR2FromArray2D(expected), actual));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR3EqualArray3D(
     const Array3D<NativeT>& expected, const LiteralSlice& actual) {
-  EXPECT_TRUE(Equal(*Literal::CreateR3FromArray3D(expected), actual));
+  EXPECT_TRUE(Equal(*LiteralUtil::CreateR3FromArray3D(expected), actual));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR4EqualArray4D(
     const Array4D<NativeT>& expected, const LiteralSlice& actual) {
-  EXPECT_TRUE(Equal(*Literal::CreateR4FromArray4D(expected), actual));
+  EXPECT_TRUE(Equal(*LiteralUtil::CreateR4FromArray4D(expected), actual));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR0Near(NativeT expected,
                                                 const LiteralSlice& actual,
                                                 const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR0<NativeT>(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR0<NativeT>(expected), actual, error));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR1Near(
     tensorflow::gtl::ArraySlice<NativeT> expected, const LiteralSlice& actual,
     const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR1<NativeT>(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR1<NativeT>(expected), actual, error));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR2Near(
     std::initializer_list<std::initializer_list<NativeT>> expected,
     const LiteralSlice& actual, const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR2<NativeT>(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR2<NativeT>(expected), actual, error));
 }
 
 template <typename NativeT>
@@ -222,7 +223,7 @@ template <typename NativeT>
     std::initializer_list<std::initializer_list<std::initializer_list<NativeT>>>
         expected,
     const LiteralSlice& actual, const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR3<NativeT>(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR3<NativeT>(expected), actual, error));
 }
 
 template <typename NativeT>
@@ -231,28 +232,28 @@ template <typename NativeT>
         std::initializer_list<std::initializer_list<NativeT>>>>
         expected,
     const LiteralSlice& actual, const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR4<NativeT>(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR4<NativeT>(expected), actual, error));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR2NearArray2D(
     const Array2D<NativeT>& expected, const LiteralSlice& actual,
     const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR2FromArray2D(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR2FromArray2D(expected), actual, error));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR3NearArray3D(
     const Array3D<NativeT>& expected, const LiteralSlice& actual,
     const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR3FromArray3D(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR3FromArray3D(expected), actual, error));
 }
 
 template <typename NativeT>
 /* static */ void LiteralTestUtil::ExpectR4NearArray4D(
     const Array4D<NativeT>& expected, const LiteralSlice& actual,
     const ErrorSpec& error) {
-  EXPECT_TRUE(Near(*Literal::CreateR4FromArray4D(expected), actual, error));
+  EXPECT_TRUE(Near(*LiteralUtil::CreateR4FromArray4D(expected), actual, error));
 }
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/literal_test_util_test.cc b/tensorflow/compiler/xla/tests/literal_test_util_test.cc
index bbac7285ae..f297b2b847 100644
--- a/tensorflow/compiler/xla/tests/literal_test_util_test.cc
+++ b/tensorflow/compiler/xla/tests/literal_test_util_test.cc
@@ -31,8 +31,9 @@ namespace xla {
 namespace {
 
 TEST(LiteralTestUtilTest, ComparesEqualTuplesEqual) {
-  std::unique_ptr<Literal> literal = Literal::MakeTuple({
-      Literal::CreateR0<int32>(42).get(), Literal::CreateR0<int32>(64).get(),
+  std::unique_ptr<Literal> literal = LiteralUtil::MakeTuple({
+      LiteralUtil::CreateR0<int32>(42).get(),
+      LiteralUtil::CreateR0<int32>(64).get(),
   });
   EXPECT_TRUE(LiteralTestUtil::Equal(*literal, *literal));
 }
@@ -42,11 +43,13 @@ TEST(LiteralTestUtilTest, ComparesUnequalTuplesUnequal) {
   // un-fail an assertion failure. The CHECK-failure is death, so we can make a
   // death assertion.
   auto unequal_things_are_equal = [] {
-    std::unique_ptr<Literal> lhs = Literal::MakeTuple({
-        Literal::CreateR0<int32>(42).get(), Literal::CreateR0<int32>(64).get(),
+    std::unique_ptr<Literal> lhs = LiteralUtil::MakeTuple({
+        LiteralUtil::CreateR0<int32>(42).get(),
+        LiteralUtil::CreateR0<int32>(64).get(),
     });
-    std::unique_ptr<Literal> rhs = Literal::MakeTuple({
-        Literal::CreateR0<int32>(64).get(), Literal::CreateR0<int32>(42).get(),
+    std::unique_ptr<Literal> rhs = LiteralUtil::MakeTuple({
+        LiteralUtil::CreateR0<int32>(64).get(),
+        LiteralUtil::CreateR0<int32>(42).get(),
     });
     CHECK(LiteralTestUtil::Equal(*lhs, *rhs)) << "LHS and RHS are unequal";
   };
@@ -55,8 +58,8 @@ TEST(LiteralTestUtilTest, ComparesUnequalTuplesUnequal) {
 
 TEST(LiteralTestUtilTest, ExpectNearFailurePlacesResultsInTemporaryDirectory) {
   auto dummy_lambda = [] {
-    auto two = Literal::CreateR0<float>(2);
-    auto four = Literal::CreateR0<float>(4);
+    auto two = LiteralUtil::CreateR0<float>(2);
+    auto four = LiteralUtil::CreateR0<float>(4);
     ErrorSpec error(0.001);
     CHECK(LiteralTestUtil::Near(*two, *four, error)) << "two is not near four";
   };
@@ -98,8 +101,8 @@ TEST(LiteralTestUtilTest, ExpectNearFailurePlacesResultsInTemporaryDirectory) {
 }
 
 TEST(LiteralTestUtilTest, NotEqualHasValuesInMessage) {
-  auto expected = Literal::CreateR1<int32>({1, 2, 3});
-  auto actual = Literal::CreateR1<int32>({4, 5, 6});
+  auto expected = LiteralUtil::CreateR1<int32>({1, 2, 3});
+  auto actual = LiteralUtil::CreateR1<int32>({4, 5, 6});
   ::testing::AssertionResult result =
       LiteralTestUtil::Equal(*expected, *actual);
   EXPECT_THAT(result.message(), ::testing::HasSubstr("expected: {1, 2, 3}"));
@@ -107,25 +110,26 @@ TEST(LiteralTestUtilTest, NotEqualHasValuesInMessage) {
 }
 
 TEST(LiteralTestUtilTest, NearComparatorR1) {
-  auto a =
-      Literal::CreateR1<float>({0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8});
-  auto b =
-      Literal::CreateR1<float>({0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8});
+  auto a = LiteralUtil::CreateR1<float>(
+      {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8});
+  auto b = LiteralUtil::CreateR1<float>(
+      {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8});
   EXPECT_TRUE(LiteralTestUtil::Near(*a, *b, ErrorSpec{0.0001}));
 }
 
 TEST(LiteralTestUtilTest, NearComparatorR1Nan) {
-  auto a =
-      Literal::CreateR1<float>({0.0, 0.1, 0.2, 0.3, NAN, 0.5, 0.6, 0.7, 0.8});
-  auto b =
-      Literal::CreateR1<float>({0.0, 0.1, 0.2, 0.3, NAN, 0.5, 0.6, 0.7, 0.8});
+  auto a = LiteralUtil::CreateR1<float>(
+      {0.0, 0.1, 0.2, 0.3, NAN, 0.5, 0.6, 0.7, 0.8});
+  auto b = LiteralUtil::CreateR1<float>(
+      {0.0, 0.1, 0.2, 0.3, NAN, 0.5, 0.6, 0.7, 0.8});
   EXPECT_TRUE(LiteralTestUtil::Near(*a, *b, ErrorSpec{0.0001}));
 }
 
 TEST(LiteralTestUtil, NearComparatorDifferentLengths) {
-  auto a =
-      Literal::CreateR1<float>({0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8});
-  auto b = Literal::CreateR1<float>({0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7});
+  auto a = LiteralUtil::CreateR1<float>(
+      {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8});
+  auto b =
+      LiteralUtil::CreateR1<float>({0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7});
   EXPECT_FALSE(LiteralTestUtil::Near(*a, *b, ErrorSpec{0.0001}));
   EXPECT_FALSE(LiteralTestUtil::Near(*b, *a, ErrorSpec{0.0001}));
 }
diff --git a/tensorflow/compiler/xla/tests/llvm_compiler_test.cc b/tensorflow/compiler/xla/tests/llvm_compiler_test.cc
index 082bc34136..e719da54d4 100644
--- a/tensorflow/compiler/xla/tests/llvm_compiler_test.cc
+++ b/tensorflow/compiler/xla/tests/llvm_compiler_test.cc
@@ -14,9 +14,10 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/llvm_compiler.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/backend.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_compiler.h"
-#include "tensorflow/compiler/xla/service/gpu/gpu_compiler.h"
+#include "tensorflow/compiler/xla/service/gpu/nvptx_compiler.h"
 #include "tensorflow/compiler/xla/service/hlo_instruction.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -64,7 +65,7 @@ class LLVMCompilerTest : public ::testing::Test {
     // Create HLO module, and run the compiler.
     auto builder = HloComputation::Builder(TestName());
     builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
 
     auto hlo_module = CreateNewModule();
     hlo_module->AddEntryComputation(builder.Build());
@@ -86,7 +87,7 @@ class LLVMCompilerTest : public ::testing::Test {
   void TestMultiModuleCompilation(LLVMCompiler *compiler) {
     HloComputation::Builder builder(TestName());
     builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(42.0)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(42.0)));
 
     std::unique_ptr<HloModule> hlo_module = CreateNewModule();
     hlo_module->AddEntryComputation(builder.Build());
@@ -144,7 +145,7 @@ TEST_F(CpuCompilerTest, HooksTest) {
 }
 
 TEST_F(GpuCompilerTest, HooksTest) {
-  gpu::GpuCompiler compiler;
+  gpu::NVPTXCompiler compiler;
   TestCompilerHooks(&compiler);
 }
 
@@ -154,7 +155,7 @@ TEST_F(CpuCompilerTest, MultiModuleCompilation) {
 }
 
 TEST_F(GpuCompilerTest, MultModuleCompilation) {
-  gpu::GpuCompiler compiler;
+  gpu::NVPTXCompiler compiler;
   TestMultiModuleCompilation(&compiler);
 }
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc b/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc
index 2c45f19c09..6fc1115097 100644
--- a/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc
+++ b/tensorflow/compiler/xla/tests/llvm_irgen_test_base.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <functional>
 #include <utility>
 
+#include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/tests/filecheck.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
@@ -25,28 +26,28 @@ limitations under the License.
 
 namespace xla {
 
-void LLVMIRGenTestBase::SetIrHook(bool match_optimized_ir) {
+void LlvmIrGenTestBase::SetIrHook(bool match_optimized_ir) {
   auto llvm_compiler = GetLLVMCompiler();
   using std::placeholders::_1;
 
   // Add the IR inspection hook to the LLVM compiler.
   if (match_optimized_ir) {
     llvm_compiler->SetPostOptimizationHook(
-        std::bind(&LLVMIRGenTestBase::IrHook, this, _1));
+        std::bind(&LlvmIrGenTestBase::IrHook, this, _1));
   } else {
     llvm_compiler->SetPreOptimizationHook(
-        std::bind(&LLVMIRGenTestBase::IrHook, this, _1));
+        std::bind(&LlvmIrGenTestBase::IrHook, this, _1));
   }
 }
 
-void LLVMIRGenTestBase::ResetIrHook() {
+void LlvmIrGenTestBase::ResetIrHook() {
   auto llvm_compiler = GetLLVMCompiler();
 
   llvm_compiler->RemovePreOptimizationHook();
   llvm_compiler->RemovePostOptimizationHook();
 }
 
-void LLVMIRGenTestBase::CompileAndVerifyIr(
+void LlvmIrGenTestBase::CompileAndVerifyIr(
     std::unique_ptr<HloModule> hlo_module, const string& pattern,
     bool match_optimized_ir) {
   SetIrHook(match_optimized_ir);
@@ -58,7 +59,17 @@ void LLVMIRGenTestBase::CompileAndVerifyIr(
   EXPECT_TRUE(filecheck_result.ValueOrDie());
 }
 
-void LLVMIRGenTestBase::CompileAheadOfTimeAndVerifyIr(
+void LlvmIrGenTestBase::CompileAndVerifyIr(const string& hlo_text,
+                                           const string& expected_llvm_ir,
+                                           bool match_optimized_ir) {
+  HloModuleConfig config;
+  config.set_debug_options(GetDebugOptionsForTest());
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<HloModule> module,
+                          ParseHloString(hlo_text, config));
+  CompileAndVerifyIr(std::move(module), expected_llvm_ir, match_optimized_ir);
+}
+
+void LlvmIrGenTestBase::CompileAheadOfTimeAndVerifyIr(
     std::unique_ptr<HloModule> hlo_module, const AotCompilationOptions& options,
     const string& pattern, bool match_optimized_ir) {
   SetIrHook(match_optimized_ir);
@@ -71,11 +82,11 @@ void LLVMIRGenTestBase::CompileAheadOfTimeAndVerifyIr(
   EXPECT_TRUE(filecheck_result.ValueOrDie());
 }
 
-LLVMCompiler* LLVMIRGenTestBase::GetLLVMCompiler() {
+LLVMCompiler* LlvmIrGenTestBase::GetLLVMCompiler() {
   return static_cast<LLVMCompiler*>(backend().compiler());
 }
 
-Status LLVMIRGenTestBase::IrHook(const llvm::Module& module) {
+Status LlvmIrGenTestBase::IrHook(const llvm::Module& module) {
   ir_ = llvm_ir::DumpModuleToString(module);
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/tests/llvm_irgen_test_base.h b/tensorflow/compiler/xla/tests/llvm_irgen_test_base.h
index 74cbb5f5df..018f9546af 100644
--- a/tensorflow/compiler/xla/tests/llvm_irgen_test_base.h
+++ b/tensorflow/compiler/xla/tests/llvm_irgen_test_base.h
@@ -24,7 +24,7 @@ limitations under the License.
 namespace xla {
 
 // Tests that verify IR emitted by the CPU/GPU backend is as expected.
-class LLVMIRGenTestBase : public CodegenTestBase {
+class LlvmIrGenTestBase : public CodegenTestBase {
  protected:
   // Compiles the given HLO module to LLVM IR and verifies the IR matches the
   // given pattern. `pattern` is in the FileCheck pattern matching syntax
@@ -38,6 +38,12 @@ class LLVMIRGenTestBase : public CodegenTestBase {
   void CompileAndVerifyIr(std::unique_ptr<HloModule> hlo_module,
                           const string& pattern, bool match_optimized_ir);
 
+  // A thin wrapper around CompileAndVerifyIr that parses `hlo_text` to create
+  // an HLO module.
+  void CompileAndVerifyIr(const string& hlo_text,
+                          const string& expected_llvm_ir,
+                          bool match_optimized_ir = false);
+
   // Compiles the given HLO module to LLVM IR and verifies the IR matches the
   // given pattern. `pattern` is in the FileCheck pattern matching syntax
   // (http://llvm.org/docs/CommandGuide/FileCheck.html).
diff --git a/tensorflow/compiler/xla/tests/local_client_allocation_test.cc b/tensorflow/compiler/xla/tests/local_client_allocation_test.cc
index 9191be9fd9..e2cd5bcc5a 100644
--- a/tensorflow/compiler/xla/tests/local_client_allocation_test.cc
+++ b/tensorflow/compiler/xla/tests/local_client_allocation_test.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/local_service.h"
 #include "tensorflow/compiler/xla/service/shaped_buffer.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -45,7 +45,7 @@ XLA_TEST_F(LocalClientAllocationTest, AddVectors) {
   TestAllocator* allocator = GetOrCreateAllocator(local_client_->platform());
 
   auto x_array =
-      LiteralToShapedBuffer(*Literal::CreateR1<float>({0.0f, 1.0f, 2.0f}));
+      LiteralToShapedBuffer(*LiteralUtil::CreateR1<float>({0.0f, 1.0f, 2.0f}));
 
   int64 allocation_count_before = allocator_->allocation_count();
 
diff --git a/tensorflow/compiler/xla/tests/local_client_aot_test.cc b/tensorflow/compiler/xla/tests/local_client_aot_test.cc
index 47cab79604..115448c908 100644
--- a/tensorflow/compiler/xla/tests/local_client_aot_test.cc
+++ b/tensorflow/compiler/xla/tests/local_client_aot_test.cc
@@ -42,13 +42,12 @@ extern "C" void SumStructElements(float* out, void** parameters) {
 TEST_F(LocalClientAotTest, Constant) {
   xla::ExecutableRunOptions run_options;
   OpaqueData opaque_data{100, 20, 3};
-  void* parameters[] = {&opaque_data};
   float out = 0;
-  void* temporary_buffers[] = {nullptr, &out};
-  SumAndDouble(&out, &run_options, parameters, temporary_buffers);
+  void* temporary_buffers[] = {&opaque_data, &out};
+  SumAndDouble(&out, &run_options, nullptr, temporary_buffers);
   EXPECT_EQ(out, 246.0f);
 
   opaque_data = {1, 2, 3};
-  SumAndDouble(&out, &run_options, parameters, temporary_buffers);
+  SumAndDouble(&out, &run_options, nullptr, temporary_buffers);
   EXPECT_EQ(out, 12.0f);
 }
diff --git a/tensorflow/compiler/xla/tests/local_client_aot_test_helper.cc b/tensorflow/compiler/xla/tests/local_client_aot_test_helper.cc
index 70612e7c49..e310966d8b 100644
--- a/tensorflow/compiler/xla/tests/local_client_aot_test_helper.cc
+++ b/tensorflow/compiler/xla/tests/local_client_aot_test_helper.cc
@@ -21,8 +21,8 @@ limitations under the License.
 
 #include "llvm/ADT/Triple.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_compiler.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/types.h"
@@ -92,9 +92,10 @@ int main(int argc, char** argv) {
   // It's lame to hard-code the buffer assignments, but we need
   // local_client_aot_test.cc to be able to easily invoke the function.
   CHECK_EQ(result->result_buffer_index(), 1);
-  CHECK_EQ(result->buffer_sizes().size(), 2);
-  CHECK_EQ(result->buffer_sizes()[0], -1);             // param buffer
+  CHECK_EQ(result->buffer_sizes().size(), 3);
+  CHECK_EQ(result->buffer_sizes()[0], -2);             // param buffer
   CHECK_EQ(result->buffer_sizes()[1], sizeof(float));  // result buffer
+  CHECK_EQ(result->buffer_sizes()[2], -1);             // const buffer
   if (triple.isOSBinFormatELF()) {
     // Check the ELF magic.
     CHECK_EQ(result->object_file_data()[0], 0x7F);
diff --git a/tensorflow/compiler/xla/tests/local_client_execute_test.cc b/tensorflow/compiler/xla/tests/local_client_execute_test.cc
index 2c6393794e..1a823cf189 100644
--- a/tensorflow/compiler/xla/tests/local_client_execute_test.cc
+++ b/tensorflow/compiler/xla/tests/local_client_execute_test.cc
@@ -19,9 +19,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
 #include "tensorflow/compiler/xla/service/local_service.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
@@ -68,7 +68,7 @@ XLA_TEST_F(LocalClientExecuteTest, AddScalars) {
   auto y = ConstantR0<float>(&builder, 123.0f);
   Add(x, y);
 
-  auto x_value = LiteralToShapedBuffer(*Literal::CreateR0<float>(42.0f));
+  auto x_value = LiteralToShapedBuffer(*LiteralUtil::CreateR0<float>(42.0f));
   ScopedShapedBuffer result =
       ExecuteLocallyOrDie(builder.Build().ValueOrDie(), {&x_value});
   LiteralTestUtil::ExpectR0Near<float>(165.f, *ShapedBufferToLiteral(result),
@@ -81,7 +81,7 @@ XLA_TEST_F(LocalClientExecuteTest, AddZeroElementVectors) {
   auto y = ConstantR1<float>(&builder, {});
   Add(x, y);
 
-  auto x_array = LiteralToShapedBuffer(*Literal::CreateR1<float>({}));
+  auto x_array = LiteralToShapedBuffer(*LiteralUtil::CreateR1<float>({}));
   ScopedShapedBuffer result =
       ExecuteLocallyOrDie(builder.Build().ValueOrDie(), {&x_array});
   LiteralTestUtil::ExpectR1Near<float>({}, *ShapedBufferToLiteral(result),
@@ -95,7 +95,7 @@ XLA_TEST_F(LocalClientExecuteTest, AddVectors) {
   Add(x, y);
 
   auto x_array =
-      LiteralToShapedBuffer(*Literal::CreateR1<float>({0.0f, 1.0f, 2.0f}));
+      LiteralToShapedBuffer(*LiteralUtil::CreateR1<float>({0.0f, 1.0f, 2.0f}));
   ScopedShapedBuffer result =
       ExecuteLocallyOrDie(builder.Build().ValueOrDie(), {&x_array});
   LiteralTestUtil::ExpectR1Near<float>(
@@ -109,7 +109,7 @@ XLA_TEST_F(LocalClientExecuteTest, AddVectorsWithProfile) {
   Add(x, y);
 
   auto x_array =
-      LiteralToShapedBuffer(*Literal::CreateR1<float>({0.0f, 1.0f, 2.0f}));
+      LiteralToShapedBuffer(*LiteralUtil::CreateR1<float>({0.0f, 1.0f, 2.0f}));
   ExecutionProfile profile;
   ScopedShapedBuffer result = ExecuteLocallyOrDie(
       builder.Build().ValueOrDie(), {&x_array}, DefaultExecutableBuildOptions(),
@@ -128,13 +128,13 @@ XLA_TEST_F(LocalClientExecuteTest, AddArraysWithDifferentInputLayouts) {
   auto computation = builder.Build().ConsumeValueOrDie();
 
   // Create x as a col-major array.
-  auto x_array = LiteralToShapedBuffer(*Literal::CreateR2WithLayout(
+  auto x_array = LiteralToShapedBuffer(*LiteralUtil::CreateR2WithLayout(
       {{1.0f, 2.0f}, {3.0f, 4.0f}}, LayoutUtil::MakeLayout({0, 1})));
   EXPECT_TRUE(LayoutUtil::Equal(x_array.on_device_shape().layout(),
                                 LayoutUtil::MakeLayout({0, 1})));
 
   // Create y as a row-major array.
-  auto y_array = LiteralToShapedBuffer(*Literal::CreateR2WithLayout(
+  auto y_array = LiteralToShapedBuffer(*LiteralUtil::CreateR2WithLayout(
       {{10.0f, 20.0f}, {30.0f, 40.0f}}, LayoutUtil::MakeLayout({1, 0})));
   EXPECT_TRUE(LayoutUtil::Equal(y_array.on_device_shape().layout(),
                                 LayoutUtil::MakeLayout({1, 0})));
@@ -161,9 +161,9 @@ XLA_TEST_F(LocalClientExecuteTest, AddArraysWithDifferentOutputLayouts) {
   auto computation = builder.Build().ConsumeValueOrDie();
 
   auto x_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+      *LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
   auto y_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{10.0f, 20.0f}, {30.0f, 40.0f}}));
+      *LiteralUtil::CreateR2<float>({{10.0f, 20.0f}, {30.0f, 40.0f}}));
 
   // Run with col-major result layout.
   ScopedShapedBuffer result_colmaj = ExecuteLocallyOrDie(
@@ -198,9 +198,9 @@ XLA_TEST_F(LocalClientExecuteTest, TupleResult) {
   auto computation = builder.Build().ConsumeValueOrDie();
 
   auto x_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+      *LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
   auto y_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{10.0f, 20.0f}, {30.0f, 40.0f}}));
+      *LiteralUtil::CreateR2<float>({{10.0f, 20.0f}, {30.0f, 40.0f}}));
 
   ScopedShapedBuffer result =
       ExecuteLocallyOrDie(computation, {&x_array, &y_array});
@@ -226,9 +226,9 @@ XLA_TEST_F(LocalClientExecuteTest, NestedTupleResult) {
   auto computation = builder.Build().ConsumeValueOrDie();
 
   auto x_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+      *LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
   auto y_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{10.0f, 20.0f}, {30.0f, 40.0f}}));
+      *LiteralUtil::CreateR2<float>({{10.0f, 20.0f}, {30.0f, 40.0f}}));
 
   ScopedShapedBuffer result =
       ExecuteLocallyOrDie(computation, {&x_array, &y_array});
@@ -255,7 +255,7 @@ XLA_TEST_F(LocalClientExecuteTest, TupleResultWithLayout) {
   Tuple(&builder, {x, y});
 
   auto array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+      *LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
 
   ExecutableBuildOptions options = DefaultExecutableBuildOptions();
   Shape shape_with_layout = ShapeUtil::MakeTupleShape(
@@ -298,12 +298,12 @@ XLA_TEST_F(LocalClientExecuteTest, TupleArguments) {
   Tuple(&builder, {array_sum, vector_diff});
   auto computation = builder.Build().ConsumeValueOrDie();
 
-  auto x_literal = Literal::MakeTuple(
-      {Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get(),
-       Literal::CreateR1<float>({42.0, 75.0, 123.0}).get()});
-  auto y_literal = Literal::MakeTuple(
-      {Literal::CreateR1<float>({2.0, 4.0, 6.0}).get(),
-       Literal::CreateR2<float>({{55.0, 44.0}, {33.0, 22.0}}).get()});
+  auto x_literal = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get(),
+       LiteralUtil::CreateR1<float>({42.0, 75.0, 123.0}).get()});
+  auto y_literal = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR1<float>({2.0, 4.0, 6.0}).get(),
+       LiteralUtil::CreateR2<float>({{55.0, 44.0}, {33.0, 22.0}}).get()});
 
   auto x_buffer = LiteralToShapedBuffer(*x_literal);
   auto y_buffer = LiteralToShapedBuffer(*y_literal);
@@ -344,12 +344,12 @@ XLA_TEST_F(LocalClientExecuteTest, NestedTupleArgument) {
   Tuple(&builder, {negate_array, vector_sum});
   auto computation = builder.Build().ConsumeValueOrDie();
 
-  auto arg_literal = Literal::MakeTuple(
-      {Literal::MakeTuple(
-           {Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get(),
-            Literal::CreateR1<float>({42.0, 75.0, 123.0}).get()})
+  auto arg_literal = LiteralUtil::MakeTuple(
+      {LiteralUtil::MakeTuple(
+           {LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get(),
+            LiteralUtil::CreateR1<float>({42.0, 75.0, 123.0}).get()})
            .get(),
-       Literal::CreateR1<float>({222.0, -2.0, 10.0}).get()});
+       LiteralUtil::CreateR1<float>({222.0, -2.0, 10.0}).get()});
   auto arg_buffer = LiteralToShapedBuffer(*arg_literal);
 
   ScopedShapedBuffer result = ExecuteLocallyOrDie(computation, {&arg_buffer});
@@ -377,9 +377,9 @@ XLA_TEST_F(LocalClientExecuteTest, PassingTupleResultBackIntoComputation) {
   Tuple(&builder, {Neg(element_0), Add(element_1, element_1)});
   auto computation = builder.Build().ConsumeValueOrDie();
 
-  auto arg_literal = Literal::MakeTuple(
-      {Literal::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get(),
-       Literal::CreateR2<float>({{11.0, 3.0}, {4.0, 5.0}}).get()});
+  auto arg_literal = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR2<float>({{1.0, 2.0}, {3.0, 4.0}}).get(),
+       LiteralUtil::CreateR2<float>({{11.0, 3.0}, {4.0, 5.0}}).get()});
   auto arg_buffer = LiteralToShapedBuffer(*arg_literal);
 
   ScopedShapedBuffer result_0 = ExecuteLocallyOrDie(computation, {&arg_buffer});
@@ -429,10 +429,10 @@ XLA_TEST_F(LocalClientExecuteTest, LargeTuple) {
   // -tuple_index}.
   std::vector<std::unique_ptr<Literal>> arg_elements;
   for (int i = 0; i < kElementCount; ++i) {
-    arg_elements.push_back(Literal::CreateR1<float>({1.0f * i, -1.0f * i}));
+    arg_elements.push_back(LiteralUtil::CreateR1<float>({1.0f * i, -1.0f * i}));
   }
   std::unique_ptr<Literal> arg_literal =
-      Literal::MakeTupleOwned(std::move(arg_elements));
+      LiteralUtil::MakeTupleOwned(std::move(arg_elements));
   auto arg_buffer = LiteralToShapedBuffer(*arg_literal);
 
   ScopedShapedBuffer result = ExecuteLocallyOrDie(computation, {&arg_buffer});
@@ -480,12 +480,13 @@ XLA_TEST_F(LocalClientExecuteTest, LargeNestedTuple) {
   for (int i = 0; i < kFanout; ++i) {
     std::vector<std::unique_ptr<Literal>> inner_tuple_elements;
     for (int j = 0; j < kFanout; ++j) {
-      inner_tuple_elements.push_back(Literal::CreateR0<float>(i + j));
+      inner_tuple_elements.push_back(LiteralUtil::CreateR0<float>(i + j));
     }
     outer_tuple_elements.push_back(
-        Literal::MakeTupleOwned(std::move(inner_tuple_elements)));
+        LiteralUtil::MakeTupleOwned(std::move(inner_tuple_elements)));
   }
-  auto arg_literal = Literal::MakeTupleOwned(std::move(outer_tuple_elements));
+  auto arg_literal =
+      LiteralUtil::MakeTupleOwned(std::move(outer_tuple_elements));
   auto arg_buffer = LiteralToShapedBuffer(*arg_literal);
 
   ScopedShapedBuffer result = ExecuteLocallyOrDie(computation, {&arg_buffer});
@@ -524,11 +525,11 @@ XLA_TEST_F(LocalClientExecuteTest, DeepTuple) {
   auto computation = builder.Build().ConsumeValueOrDie();
 
   // Construct the argument to pass to the computation.
-  std::unique_ptr<Literal> arg_literal = Literal::CreateR0<float>(123.0);
+  std::unique_ptr<Literal> arg_literal = LiteralUtil::CreateR0<float>(123.0);
   for (int i = 0; i < kTupleDepth; ++i) {
     std::vector<std::unique_ptr<Literal>> arg_vector;
     arg_vector.push_back(std::move(arg_literal));
-    arg_literal = Literal::MakeTupleOwned(std::move(arg_vector));
+    arg_literal = LiteralUtil::MakeTupleOwned(std::move(arg_vector));
   }
   auto arg_buffer = LiteralToShapedBuffer(*arg_literal);
 
@@ -551,7 +552,7 @@ XLA_TEST_F(LocalClientExecuteTest, InvalidNumberOfArguments) {
   Add(x, y);
 
   auto x_array =
-      LiteralToShapedBuffer(*Literal::CreateR1<float>({1.0f, 2.0f, 3.0f}));
+      LiteralToShapedBuffer(*LiteralUtil::CreateR1<float>({1.0f, 2.0f, 3.0f}));
   auto execute_status =
       ExecuteLocally(builder.Build().ValueOrDie(), {&x_array});
 
@@ -567,7 +568,7 @@ XLA_TEST_F(LocalClientExecuteTest, IncorrectArgumentShape) {
   Neg(x);
 
   auto x_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{0.0f, 1.0f}, {2.0f, 3.0f}}));
+      *LiteralUtil::CreateR2<float>({{0.0f, 1.0f}, {2.0f, 3.0f}}));
   auto execute_status =
       ExecuteLocally(builder.Build().ValueOrDie(), {&x_array});
 
@@ -584,7 +585,7 @@ XLA_TEST_F(LocalClientExecuteTest, InvalidResultLayout) {
   Neg(x);
 
   auto x_array = LiteralToShapedBuffer(
-      *Literal::CreateR2<float>({{0.0f, 1.0f}, {2.0f, 3.0f}}));
+      *LiteralUtil::CreateR2<float>({{0.0f, 1.0f}, {2.0f, 3.0f}}));
   auto execute_status = ExecuteLocally(
       builder.Build().ValueOrDie(), {&x_array},
       DefaultExecutableBuildOptions().set_result_layout(
@@ -767,7 +768,7 @@ XLA_TEST_F(LocalClientExecuteTest, CompileExecutable) {
       executable_status.ConsumeValueOrDie();
 
   auto x_array =
-      LiteralToShapedBuffer(*Literal::CreateR1<float>({0.0f, 1.0f, 2.0f}));
+      LiteralToShapedBuffer(*LiteralUtil::CreateR1<float>({0.0f, 1.0f, 2.0f}));
   ScopedShapedBuffer result =
       executable->Run({&x_array}, DefaultExecutableRunOptions())
           .ConsumeValueOrDie();
@@ -795,29 +796,29 @@ XLA_TEST_F(LocalClientExecuteTest, ShapeBufferToLiteralConversion) {
   };
 
   // Array shapes.
-  test_to_device_and_back(*Literal::CreateR0<float>(42.0));
-  test_to_device_and_back(*Literal::CreateR0<bool>(true));
-  test_to_device_and_back(*Literal::CreateR1<float>({1.0, 42.0, 744.4}));
+  test_to_device_and_back(*LiteralUtil::CreateR0<float>(42.0));
+  test_to_device_and_back(*LiteralUtil::CreateR0<bool>(true));
+  test_to_device_and_back(*LiteralUtil::CreateR1<float>({1.0, 42.0, 744.4}));
   test_to_device_and_back(
-      *Literal::CreateR2<float>({{1.0, 2.0, 3.0}, {44.0, 0.1, -3}}));
-  test_to_device_and_back(*Literal::CreateR2<int32>({{2, 1}, {4444, 56}}));
+      *LiteralUtil::CreateR2<float>({{1.0, 2.0, 3.0}, {44.0, 0.1, -3}}));
+  test_to_device_and_back(*LiteralUtil::CreateR2<int32>({{2, 1}, {4444, 56}}));
 
   // Null shape (empty tuple).
-  test_to_device_and_back(*Literal::MakeTuple({}));
+  test_to_device_and_back(*LiteralUtil::MakeTuple({}));
 
   // Non-nested tuples.
   test_to_device_and_back(
-      *Literal::MakeTuple({Literal::CreateR0<float>(12223.0).get()}));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR0<float>(12223.0).get()}));
   test_to_device_and_back(
-      *Literal::MakeTuple({Literal::CreateR1<float>({1.0, -42.0}).get(),
-                           Literal::CreateR0<float>(123456.0).get()}));
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({1.0, -42.0}).get(),
+                               LiteralUtil::CreateR0<float>(123456.0).get()}));
 
   // Nested tuple.
-  test_to_device_and_back(*Literal::MakeTuple(
-      {Literal::MakeTuple({Literal::CreateR1<float>({1.0, -42.0}).get(),
-                           Literal::CreateR0<float>(123456.0).get()})
+  test_to_device_and_back(*LiteralUtil::MakeTuple(
+      {LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({1.0, -42.0}).get(),
+                               LiteralUtil::CreateR0<float>(123456.0).get()})
            .get(),
-       Literal::CreateR0<bool>(false).get()}));
+       LiteralUtil::CreateR0<bool>(false).get()}));
 }
 
 XLA_TEST_F(LocalClientExecuteTest, ShapeBufferToLiteralConversion64bit) {
@@ -835,13 +836,13 @@ XLA_TEST_F(LocalClientExecuteTest, ShapeBufferToLiteralConversion64bit) {
   };
 
   test_to_device_and_back(
-      *Literal::CreateR2<double>({{1.0, 2.0, 3.0}, {44.0, 0.1, -3}}));
-  test_to_device_and_back(*Literal::CreateR2<int64>({{2, 1}, {4444, 56}}));
+      *LiteralUtil::CreateR2<double>({{1.0, 2.0, 3.0}, {44.0, 0.1, -3}}));
+  test_to_device_and_back(*LiteralUtil::CreateR2<int64>({{2, 1}, {4444, 56}}));
   test_to_device_and_back(
-      *Literal::CreateR2<uint64>({{20000000000ULL, 1}, {4444, 56}}));
-  test_to_device_and_back(
-      *Literal::MakeTuple({Literal::CreateR1<double>({1.0, -42.0}).get(),
-                           Literal::CreateR0<int64>(123456789000LL).get()}));
+      *LiteralUtil::CreateR2<uint64>({{20000000000ULL, 1}, {4444, 56}}));
+  test_to_device_and_back(*LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR1<double>({1.0, -42.0}).get(),
+       LiteralUtil::CreateR0<int64>(123456789000LL).get()}));
 }
 
 XLA_TEST_F(LocalClientExecuteTest, InfeedTest) {
@@ -860,7 +861,7 @@ XLA_TEST_F(LocalClientExecuteTest, InfeedTest) {
           }));
 
   ASSERT_IS_OK(local_client_->TransferToInfeedLocal(
-      *Literal::CreateR1<float>({-5.0, 123.0, 42.0}),
+      *LiteralUtil::CreateR1<float>({-5.0, 123.0, 42.0}),
       local_client_->default_device_ordinal()));
 
   // Join the thread.
@@ -869,9 +870,7 @@ XLA_TEST_F(LocalClientExecuteTest, InfeedTest) {
   LiteralTestUtil::ExpectR1Equal<float>({-4.0, 125.0, 45.0}, *result);
 }
 
-// TODO(b/34359662): Support infeed/outfeed on GPU and CPU parallel.
-// 2017-10-18.
-XLA_TEST_F(LocalClientExecuteTest, DISABLED_ON_GPU(InfeedOutfeedTest)) {
+XLA_TEST_F(LocalClientExecuteTest, InfeedOutfeedTest) {
   XlaBuilder builder(TestName());
   const Shape shape = ShapeUtil::MakeShape(F32, {3});
   auto in = Infeed(&builder, shape);
@@ -885,7 +884,7 @@ XLA_TEST_F(LocalClientExecuteTest, DISABLED_ON_GPU(InfeedOutfeedTest)) {
           [&] { ExecuteLocallyOrDie(builder.Build().ValueOrDie(), {}); }));
 
   ASSERT_IS_OK(local_client_->TransferToInfeedLocal(
-      *Literal::CreateR1<float>({-5.0, 123.0, 42.0}),
+      *LiteralUtil::CreateR1<float>({-5.0, 123.0, 42.0}),
       local_client_->default_device_ordinal()));
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result,
@@ -920,7 +919,7 @@ void BM_LocalClientOverhead(int num_iters) {
       transfer_manager
           ->AllocateScopedShapedBuffer(shape, &allocator, /*device_ordinal=*/0)
           .ConsumeValueOrDie();
-  auto literal = Literal::CreateR2<float>({{0, 0, 0}, {0, 0, 0}});
+  auto literal = LiteralUtil::CreateR2<float>({{0, 0, 0}, {0, 0, 0}});
   auto stream =
       client->mutable_backend()->BorrowStream(device_ordinal).ValueOrDie();
   ASSERT_IS_OK(transfer_manager->TransferLiteralToDevice(stream.get(), *literal,
diff --git a/tensorflow/compiler/xla/tests/local_client_test_base.cc b/tensorflow/compiler/xla/tests/local_client_test_base.cc
index c31ba0e713..eaddf756db 100644
--- a/tensorflow/compiler/xla/tests/local_client_test_base.cc
+++ b/tensorflow/compiler/xla/tests/local_client_test_base.cc
@@ -20,6 +20,7 @@ limitations under the License.
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
diff --git a/tensorflow/compiler/xla/tests/local_client_test_base.h b/tensorflow/compiler/xla/tests/local_client_test_base.h
index 258226523d..b4477e9a6b 100644
--- a/tensorflow/compiler/xla/tests/local_client_test_base.h
+++ b/tensorflow/compiler/xla/tests/local_client_test_base.h
@@ -22,7 +22,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
 #include "tensorflow/compiler/xla/service/local_service.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
diff --git a/tensorflow/compiler/xla/tests/log_test.cc b/tensorflow/compiler/xla/tests/log_test.cc
index cdf70ee418..2d622242e6 100644
--- a/tensorflow/compiler/xla/tests/log_test.cc
+++ b/tensorflow/compiler/xla/tests/log_test.cc
@@ -17,7 +17,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
diff --git a/tensorflow/compiler/xla/tests/map_test.cc b/tensorflow/compiler/xla/tests/map_test.cc
index 1b3bc9d504..0732e195d4 100644
--- a/tensorflow/compiler/xla/tests/map_test.cc
+++ b/tensorflow/compiler/xla/tests/map_test.cc
@@ -19,9 +19,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -169,7 +169,7 @@ class MapTest : public ClientLibraryTestBase {
 TEST_F(MapTest, MapEachElemPlusOneR0) {
   // Applies lambda (x) (+ x 1)) to an input scalar.
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR0<float>(42.0);
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR0<float>(42.0);
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -183,7 +183,7 @@ TEST_F(MapTest, MapEachElemPlusOneR0) {
 XLA_TEST_F(MapTest, MapEachElemPlusOneR1S0) {
   // Maps (lambda (x) (+ x 1)) onto an input R1F32 vector of length 0.
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR1<float>({});
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR1<float>({});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -198,7 +198,7 @@ TEST_F(MapTest, MapEachElemPlusOneR1S4) {
   // Maps (lambda (x) (+ x 1)) onto an input R1F32 vector of length 4.
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -212,7 +212,7 @@ TEST_F(MapTest, MapEachElemPlusOneR1S4) {
 TEST_F(MapTest, MapEachF32ElementToS32Constant) {
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -225,7 +225,7 @@ TEST_F(MapTest, MapEachF32ElementToS32Constant) {
 TEST_F(MapTest, MapEachF32ElementToU32Constant) {
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -239,7 +239,7 @@ TEST_F(MapTest, MapEachElemLongerChainR1) {
   // Maps (lambda (x) (* (+ x 1) x)) onto an input R1F32 vector.
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.6f, -5.1f, 0.1f, 0.2f, 999.0f, 255.5f});
+      LiteralUtil::CreateR1<float>({2.6f, -5.1f, 0.1f, 0.2f, 999.0f, 255.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -255,7 +255,7 @@ XLA_TEST_F(MapTest, MapMultipleMapsR1S0) {
   // Maps (lambda (x) (+ x 1)) onto an input R1F32 vector of length 0, and then
   // maps (lambda (x) (* x 2)) on the result.
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR1<float>({});
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR1<float>({});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -272,7 +272,7 @@ TEST_F(MapTest, MapMultipleMapsR1S4) {
   // maps (lambda (x) (* x 2)) on the result.
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -287,7 +287,7 @@ TEST_F(MapTest, MapMultipleMapsR1S4) {
 TEST_F(MapTest, MapEachElemPlusOneR2) {
   // Maps (lambda (x) (+ x 1)) onto an input R2F32 vector.
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR2<float>(
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR2<float>(
       {{13.25f, 14.0f}, {-7.1f, -7.2f}, {-8.8f, 8.8f}});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
@@ -343,11 +343,11 @@ TEST_F(MapTest, MapBinaryAdder) {
   // Maps (lambda (x y) (+ x y)) onto two R1F32 vectors.
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<Literal> param1_literal =
-      Literal::CreateR1<float>({5.1f, 4.4f, -0.1f, -5.5f});
+      LiteralUtil::CreateR1<float>({5.1f, 4.4f, -0.1f, -5.5f});
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
 
@@ -365,12 +365,12 @@ TEST_F(MapTest, MapBinaryAdder) {
 // for Map that used to fail in shape inference (b/28989438).
 XLA_TEST_F(MapTest, AddWithMixedLayouts) {
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR2WithLayout(
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR2WithLayout(
       {{1, 2}, {3, 4}}, LayoutUtil::MakeLayout({1, 0}));
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> param1_literal = Literal::CreateR2WithLayout(
+  std::unique_ptr<Literal> param1_literal = LiteralUtil::CreateR2WithLayout(
       {{10, 20}, {30, 40}}, LayoutUtil::MakeLayout({0, 1}));
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
@@ -392,12 +392,12 @@ XLA_TEST_F(MapTest, AddWithMixedLayouts) {
 XLA_TEST_F(MapTest, AddR3_3x0x2) {
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR3FromArray3D<int32>(Array3D<int32>(3, 0, 2));
+      LiteralUtil::CreateR3FromArray3D<int32>(Array3D<int32>(3, 0, 2));
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
   std::unique_ptr<Literal> param1_literal =
-      Literal::CreateR3FromArray3D<int32>(Array3D<int32>(3, 0, 2));
+      LiteralUtil::CreateR3FromArray3D<int32>(Array3D<int32>(3, 0, 2));
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
 
@@ -414,15 +414,15 @@ TEST_F(MapTest, MapTernaryAdder) {
   // Maps (lambda (x y z) (+ x y z)) onto three R1F32 vectors.
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<Literal> param1_literal =
-      Literal::CreateR1<float>({5.1f, 4.4f, -0.1f, -5.5f});
+      LiteralUtil::CreateR1<float>({5.1f, 4.4f, -0.1f, -5.5f});
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
   std::unique_ptr<Literal> param2_literal =
-      Literal::CreateR1<float>({-10.0f, -100.0f, -900.0f, -400.0f});
+      LiteralUtil::CreateR1<float>({-10.0f, -100.0f, -900.0f, -400.0f});
   std::unique_ptr<GlobalData> param2_data =
       client_->TransferToServer(*param2_literal).ConsumeValueOrDie();
 
@@ -476,11 +476,11 @@ TEST_F(MapTest, MapOperantionWithBuildError) {
   auto error_add = sub_builder->BuildAndNoteError();
 
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 3.3f, 4.4f, 5.5f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<Literal> param1_literal =
-      Literal::CreateR1<float>({5.1f, 4.4f, -0.1f, -5.5f});
+      LiteralUtil::CreateR1<float>({5.1f, 4.4f, -0.1f, -5.5f});
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*param1_literal).ConsumeValueOrDie();
 
@@ -513,8 +513,8 @@ TEST_F(MapTestWithFullOpt, MapScalarPower) {
   Pow(x, y);
   auto power = sub_builder->BuildAndNoteError();
 
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR0<float>(2.0f);
-  std::unique_ptr<Literal> param1_literal = Literal::CreateR0<float>(5.0f);
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR0<float>(2.0f);
+  std::unique_ptr<Literal> param1_literal = LiteralUtil::CreateR0<float>(5.0f);
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<GlobalData> param1_data =
@@ -540,8 +540,8 @@ TEST_F(MapTestWithFullOpt, MapSubtractOppositeOrder) {
   Sub(y, x);  // note that this is y - x, not x - y
   auto sub_opposite = sub_builder->BuildAndNoteError();
 
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR0<float>(2.0f);
-  std::unique_ptr<Literal> param1_literal = Literal::CreateR0<float>(5.0f);
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR0<float>(2.0f);
+  std::unique_ptr<Literal> param1_literal = LiteralUtil::CreateR0<float>(5.0f);
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
   std::unique_ptr<GlobalData> param1_data =
@@ -565,7 +565,7 @@ TEST_F(MapTestWithFullOpt, MapSquare) {
   Mul(x, x);
   auto square = sub_builder->BuildAndNoteError();
 
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR0<float>(10.0f);
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR0<float>(10.0f);
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
diff --git a/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc b/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
index 17b1807f44..da8c42d465 100644
--- a/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
@@ -19,9 +19,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -63,8 +63,8 @@ XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, ExpTwoByTwoValues) {
   Exp(data);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2FromArray2D<T>({{2.71828f, 1.00000f},    // row 0
-                                       {0.36788f, 1.64872f}});  // row 1
+      LiteralUtil::CreateR2FromArray2D<T>({{2.71828f, 1.00000f},    // row 0
+                                           {0.36788f, 1.64872f}});  // row 1
 
   this->ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-5));
 }
@@ -92,8 +92,8 @@ XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, MapTwoByTwo) {
   Map(&builder, {data}, add_half, {0, 1});
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2FromArray2D<T>({{1.5f, 0.5f},     // row 0
-                                       {-0.5f, 1.0f}});  // row 1
+      LiteralUtil::CreateR2FromArray2D<T>({{1.5f, 0.5f},     // row 0
+                                           {-0.5f, 1.0f}});  // row 1
   this->ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-5));
 }
 
@@ -111,8 +111,8 @@ XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, MaxTwoByTwoValues) {
   Max(lhs, rhs);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2FromArray2D<T>({{7.0f, 6.0f},     // row 0
-                                       {3.0f, -4.0f}});  // row 1
+      LiteralUtil::CreateR2FromArray2D<T>({{7.0f, 6.0f},     // row 0
+                                           {3.0f, -4.0f}});  // row 1
   this->ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-6));
 }
 
@@ -200,12 +200,14 @@ class MatOpsDotAddTest
 
     TF_ASSERT_OK_AND_ASSIGN(
         auto lhs_handle,
-        client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
-            lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
+        client_->TransferToServer(
+            *LiteralUtil::CreateR2FromArray2DWithLayout<T>(
+                lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
     TF_ASSERT_OK_AND_ASSIGN(
         auto rhs_handle,
-        client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
-            rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
+        client_->TransferToServer(
+            *LiteralUtil::CreateR2FromArray2DWithLayout<T>(
+                rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
 
     XlaBuilder builder(TestName());
     auto lhs_arg = Parameter(&builder, 0, lhs_shape, "lhs");
diff --git a/tensorflow/compiler/xla/tests/multidimensional_slice_test.cc b/tensorflow/compiler/xla/tests/multidimensional_slice_test.cc
index e576f000ef..955dbef6dc 100644
--- a/tensorflow/compiler/xla/tests/multidimensional_slice_test.cc
+++ b/tensorflow/compiler/xla/tests/multidimensional_slice_test.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
diff --git a/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc b/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc
index 6597748c8d..eb06b115da 100644
--- a/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc
+++ b/tensorflow/compiler/xla/tests/multioutput_fusion_test.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include <utility>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_computation.h"
@@ -60,7 +60,7 @@ class MultiOutputFusionTest : public HloTestBase {
     const Shape elem_shape2 = ShapeUtil::MakeShape(F32, {size, size});
 
     auto const0 = builder.AddInstruction(
-        HloInstruction::CreateConstant(Literal::CreateR0<float>(8.0f)));
+        HloInstruction::CreateConstant(LiteralUtil::CreateR0<float>(8.0f)));
     auto param0 = builder.AddInstruction(
         HloInstruction::CreateParameter(0, elem_shape0, "0"));
 
@@ -105,8 +105,9 @@ class MultiOutputFusionTest : public HloTestBase {
 
     Literal expect(ShapeUtil::MakeShape(F32, {size, size}));
     expect.PopulateWithValue<float>(size * 1.5f * 3.5f);
-    auto actual = ExecuteAndTransfer(
-        std::move(hlo_module), {Literal::CreateR0<float>(-9.0f).get(), &arg1});
+    auto actual =
+        ExecuteAndTransfer(std::move(hlo_module),
+                           {LiteralUtil::CreateR0<float>(-9.0f).get(), &arg1});
     EXPECT_TRUE(LiteralTestUtil::Near(expect, *actual, error_spec_));
   }
 
@@ -165,7 +166,8 @@ class MultiOutputFusionTest : public HloTestBase {
     Literal input1(ShapeUtil::MakeShape(F64, {size}));
     input1.PopulateWithValue(1.);
 
-    Literal expect = std::move(*Literal::CreateR1<float>({size * 1.5f * 3.5f}));
+    Literal expect =
+        std::move(*LiteralUtil::CreateR1<float>({size * 1.5f * 3.5f}));
     auto actual = ExecuteAndTransfer(std::move(hlo_module), {&input0, &input1});
     EXPECT_TRUE(LiteralTestUtil::Near(expect, *actual, error_spec_));
   }
@@ -198,16 +200,16 @@ XLA_TEST_F(MultiOutputFusionTest, FusionNodeIsRoot) {
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::MakeTupleOwned(
-      Literal::MakeTupleOwned(
-          Literal::MakeTupleOwned(Literal::CreateR0<int32>(42)),
-          Literal::CreateR0<float>(1.0)),
-      Literal::MakeTupleOwned(Literal::CreateR0<float>(3.0),
-                              Literal::CreateR0<int32>(4)));
+  auto param = LiteralUtil::MakeTupleOwned(
+      LiteralUtil::MakeTupleOwned(
+          LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR0<int32>(42)),
+          LiteralUtil::CreateR0<float>(1.0)),
+      LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR0<float>(3.0),
+                                  LiteralUtil::CreateR0<int32>(4)));
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(Literal::CreateR0<int32>(42)), *result));
+      *LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR0<int32>(42)), *result));
 }
 
 XLA_TEST_F(MultiOutputFusionTest, MultiOutputLoopFusion) {
@@ -232,7 +234,7 @@ XLA_TEST_F(MultiOutputFusionTest, MultiOutputLoopFusion) {
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR1<float>({1.0, 2.0, 3.0, -1.0});
+  auto param = LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0, -1.0});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   LiteralTestUtil::ExpectR1Equal<float>({0.0, 4.0, 9.0, 1.0}, *result);
@@ -265,7 +267,7 @@ XLA_TEST_F(MultiOutputFusionTest, MultiOutputLoopFeedingMap) {
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR1<float>({1.0, 2.0, 3.0});
+  auto param = LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   LiteralTestUtil::ExpectR1Equal<float>({0.0, 4.0, 9.0}, *result);
@@ -308,12 +310,14 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto param =
+      LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(Literal::CreateR2<float>({{3, 7}, {11, 15}}),
-                               Literal::CreateR2<float>({{5, 16}, {36, 64}})),
+      *LiteralUtil::MakeTupleOwned(
+          LiteralUtil::CreateR2<float>({{3, 7}, {11, 15}}),
+          LiteralUtil::CreateR2<float>({{5, 16}, {36, 64}})),
       *result));
 }
 
@@ -338,12 +342,14 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto param =
+      LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(Literal::CreateR2<float>({{6, 8}, {10, 12}}),
-                               Literal::CreateR2<float>({{25, 36}, {49, 64}})),
+      *LiteralUtil::MakeTupleOwned(
+          LiteralUtil::CreateR2<float>({{6, 8}, {10, 12}}),
+          LiteralUtil::CreateR2<float>({{25, 36}, {49, 64}})),
       *result));
 }
 
@@ -369,13 +375,14 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto param =
+      LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(Literal::CreateR1<float>({14, 22}),
-                               Literal::CreateR1<float>({36, 64}),
-                               Literal::CreateR1<float>({66, 138})),
+      *LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR1<float>({14, 22}),
+                                   LiteralUtil::CreateR1<float>({36, 64}),
+                                   LiteralUtil::CreateR1<float>({66, 138})),
       *result));
 }
 
@@ -401,14 +408,15 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto param =
+      LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(
-          Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}),
-          Literal::CreateR2<float>({{3, 7}, {11, 15}}),
-          Literal::CreateR2<float>({{5, 16}, {36, 64}})),
+      *LiteralUtil::MakeTupleOwned(
+          LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}),
+          LiteralUtil::CreateR2<float>({{3, 7}, {11, 15}}),
+          LiteralUtil::CreateR2<float>({{5, 16}, {36, 64}})),
       *result));
 }
 
@@ -434,14 +442,16 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto param =
+      LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(
-          Literal::CreateR2<float>({{6, 8}, {10, 12}}),
-          Literal::CreateR3<float>({{{1, 4}, {9, 16}}, {{25, 36}, {49, 64}}}),
-          Literal::CreateR2<float>({{25, 36}, {49, 64}})),
+      *LiteralUtil::MakeTupleOwned(
+          LiteralUtil::CreateR2<float>({{6, 8}, {10, 12}}),
+          LiteralUtil::CreateR3<float>(
+              {{{1, 4}, {9, 16}}, {{25, 36}, {49, 64}}}),
+          LiteralUtil::CreateR2<float>({{25, 36}, {49, 64}})),
       *result));
 }
 
@@ -468,14 +478,16 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto param =
+      LiteralUtil::CreateR3<float>({{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(
-          Literal::CreateR1<float>({14, 22}),
-          Literal::CreateR3<float>({{{1, 4}, {9, 16}}, {{25, 36}, {49, 64}}}),
-          Literal::CreateR3<float>(
+      *LiteralUtil::MakeTupleOwned(
+          LiteralUtil::CreateR1<float>({14, 22}),
+          LiteralUtil::CreateR3<float>(
+              {{{1, 4}, {9, 16}}, {{25, 36}, {49, 64}}}),
+          LiteralUtil::CreateR3<float>(
               {{{5, 10}, {15, 20}}, {{25, 30}, {35, 40}}})),
       *result));
 }
@@ -502,15 +514,16 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<float>({{{0, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
-  auto init1 = Literal::CreateR0<float>(5);
-  auto init2 = Literal::CreateR0<float>(6);
+  auto param =
+      LiteralUtil::CreateR3<float>({{{0, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
+  auto init1 = LiteralUtil::CreateR0<float>(5);
+  auto init2 = LiteralUtil::CreateR0<float>(6);
   std::unique_ptr<Literal> result = ExecuteNoHloPasses(
       std::move(module), {param.get(), init1.get(), init2.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(
-          Literal::CreateR2<float>({{167, 172}, {176, 180}}),
-          Literal::CreateR2<float>({{6, 6}, {6, 8}})),
+      *LiteralUtil::MakeTupleOwned(
+          LiteralUtil::CreateR2<float>({{167, 172}, {176, 180}}),
+          LiteralUtil::CreateR2<float>({{6, 6}, {6, 8}})),
       *result));
 }
 
@@ -537,19 +550,20 @@ XLA_TEST_F(MultiOutputFusionTest,
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::CreateR3<Eigen::half>(
+  auto param = LiteralUtil::CreateR3<Eigen::half>(
       {{{Eigen::half(1), Eigen::half(2)}, {Eigen::half(3), Eigen::half(4)}},
        {{Eigen::half(5), Eigen::half(6)}, {Eigen::half(7), Eigen::half(8)}}});
   std::unique_ptr<Literal> result =
       ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(
-          Literal::CreateR2<float>({{3, 7}, {11, 15}}),
-          Literal::CreateR2<float>({{5, 16}, {36, 64}}),
-          Literal::CreateR3<Eigen::half>({{{Eigen::half(1), Eigen::half(2)},
-                                           {Eigen::half(3), Eigen::half(4)}},
-                                          {{Eigen::half(5), Eigen::half(6)},
-                                           {Eigen::half(7), Eigen::half(8)}}})),
+      *LiteralUtil::MakeTupleOwned(
+          LiteralUtil::CreateR2<float>({{3, 7}, {11, 15}}),
+          LiteralUtil::CreateR2<float>({{5, 16}, {36, 64}}),
+          LiteralUtil::CreateR3<Eigen::half>(
+              {{{Eigen::half(1), Eigen::half(2)},
+                {Eigen::half(3), Eigen::half(4)}},
+               {{Eigen::half(5), Eigen::half(6)},
+                {Eigen::half(7), Eigen::half(8)}}})),
       *result));
 }
 
diff --git a/tensorflow/compiler/xla/tests/outfeed_in_nested_computation_test.cc b/tensorflow/compiler/xla/tests/outfeed_in_nested_computation_test.cc
new file mode 100644
index 0000000000..0a0426adcb
--- /dev/null
+++ b/tensorflow/compiler/xla/tests/outfeed_in_nested_computation_test.cc
@@ -0,0 +1,169 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/tests/local_client_test_base.h"
+#include "tensorflow/compiler/xla/tests/test_macros.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+
+namespace xla {
+namespace {
+
+// Tests that ensure outfeed instructions that are contained in nested
+// computations in non-root positions are executed.
+
+class OutfeedInNestedComputationTest : public LocalClientTestBase {};
+
+XLA_TEST_F(OutfeedInNestedComputationTest, OutfeedInWhile) {
+  XlaBuilder b(TestName());
+
+  Shape state_tuple_array_shape = ShapeUtil::MakeShape(xla::S32, {10, 5});
+  Shape int_shape = ShapeUtil::MakeShape(xla::S32, {});
+  Shape state_tuple_shape =
+      ShapeUtil::MakeTupleShape({int_shape, state_tuple_array_shape});
+  Shape xfeed_shape = ShapeUtil::MakeShape(xla::S32, {2});
+
+  XlaOp some_buffer = Broadcast(ConstantR0<int32_t>(&b, 0), {10, 5});
+  XlaOp num_iter = Infeed(&b, int_shape);
+  XlaOp init_tuple = Tuple(&b, {num_iter, some_buffer});
+
+  TF_ASSERT_OK_AND_ASSIGN(XlaComputation loop_cond, [&] {
+    // Condition: iteration variable > 0
+    XlaBuilder cond_builder("loop_condition");
+    XlaOp state_tuple = Parameter(&cond_builder, 0, state_tuple_shape, "state");
+    XlaOp loop_counter = GetTupleElement(state_tuple, 0);
+    Outfeed(loop_counter, int_shape, "");
+    Gt(loop_counter, ConstantR0<int32_t>(&cond_builder, 0));
+    return cond_builder.Build();
+  }());
+
+  TF_ASSERT_OK_AND_ASSIGN(XlaComputation loop_body, [&] {
+    XlaBuilder body_builder("loop_body");
+    XlaOp state_tuple = Parameter(&body_builder, 0, state_tuple_shape, "state");
+    XlaOp loop_counter = GetTupleElement(state_tuple, 0);
+    XlaOp buffer_inside = GetTupleElement(state_tuple, 1);
+
+    // Read some stuff from Infeed.
+    XlaOp some_input = Infeed(&body_builder, xfeed_shape);
+    XlaOp sum = Add(some_input, Broadcast(loop_counter, {2}));
+    Outfeed(sum, xfeed_shape, "");
+
+    XlaOp iter_left = Sub(loop_counter, ConstantR0<int32_t>(&body_builder, 1));
+
+    Tuple(&body_builder, {iter_left, buffer_inside});
+    return body_builder.Build();
+  }());
+
+  // Build loop.
+  XlaOp result_tuple = While(loop_cond, loop_body, init_tuple);
+  GetTupleElement(result_tuple, 0);
+  TF_ASSERT_OK_AND_ASSIGN(XlaComputation computation, b.Build());
+
+  std::unique_ptr<xla::Literal> comp_result;
+  std::unique_ptr<tensorflow::Thread> thread(
+      tensorflow::Env::Default()->StartThread(
+          tensorflow::ThreadOptions(), "execute_thread", [&] {
+            comp_result = local_client_->ExecuteAndTransfer(computation, {})
+                              .ConsumeValueOrDie();
+          }));
+
+  VLOG(1) << "Transferring trip count to computation";
+  // Transfer number of iterations to Infeed.
+  TF_ASSERT_OK(
+      local_client_->TransferToInfeed(*LiteralUtil::CreateR0<int32_t>(1)));
+
+  // Pick up value from outfeed
+  {
+    VLOG(1) << "Reading from condition outfeed";
+    TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> r,
+                            local_client_->TransferFromOutfeed(&int_shape));
+    EXPECT_EQ(r->Get<int32>({}), 1);
+  }
+
+  VLOG(1) << "Writing data to infeed";
+  // Transfer some stuff to Infeed for use inside of loop.
+  TF_ASSERT_OK(local_client_->TransferToInfeed(
+      *LiteralUtil::CreateR1<int32_t>({10, 20})));
+
+  // Pick up value from outfeed
+  {
+    VLOG(1) << "Reading from body outfeed";
+    TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> r,
+                            local_client_->TransferFromOutfeed(&xfeed_shape));
+    EXPECT_EQ(r->Get<int32>({0}), 11);
+    EXPECT_EQ(r->Get<int32>({1}), 21);
+  }
+
+  {
+    VLOG(1) << "Reading from condition outfeed";
+    TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> r,
+                            local_client_->TransferFromOutfeed(&int_shape));
+    EXPECT_EQ(r->Get<int32>({}), 0);
+  }
+
+  // Joins the thread
+  thread.reset();
+
+  EXPECT_EQ(comp_result->Get<int32>({}), 0);
+}
+
+XLA_TEST_F(OutfeedInNestedComputationTest, OutfeedInConditional) {
+  XlaBuilder b(TestName());
+
+  Shape condition_shape = ShapeUtil::MakeShape(xla::PRED, {});
+  Shape result_shape = ShapeUtil::MakeShape(xla::PRED, {});
+
+  TF_ASSERT_OK_AND_ASSIGN(XlaComputation true_computation, [&] {
+    XlaBuilder inner_builder("true_computation");
+    XlaOp param = Parameter(&inner_builder, 0, result_shape, "param");
+    Outfeed(param, result_shape, "");
+    Or(param, param);
+    return inner_builder.Build();
+  }());
+
+  TF_ASSERT_OK_AND_ASSIGN(XlaComputation false_computation, [&] {
+    XlaBuilder inner_builder("false_computation");
+    Parameter(&inner_builder, 0, result_shape, "param");
+    return inner_builder.Build();
+  }());
+
+  XlaOp pred = Infeed(&b, condition_shape);
+  Conditional(/*predicate=*/pred, /*true_operand=*/pred,
+              /*true_computation=*/true_computation, /*false_operand=*/pred,
+              /*false_computation=*/false_computation);
+
+  TF_ASSERT_OK_AND_ASSIGN(XlaComputation computation, b.Build());
+
+  std::unique_ptr<xla::Literal> comp_result;
+  std::unique_ptr<tensorflow::Thread> thread(
+      tensorflow::Env::Default()->StartThread(
+          tensorflow::ThreadOptions(), "execute_thread", [&] {
+            comp_result = local_client_->ExecuteAndTransfer(computation, {})
+                              .ConsumeValueOrDie();
+          }));
+
+  TF_ASSERT_OK(
+      local_client_->TransferToInfeed(*LiteralUtil::CreateR0<bool>(true)));
+
+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> r,
+                          local_client_->TransferFromOutfeed(&result_shape));
+
+  EXPECT_EQ(r->Get<bool>({}), true);
+
+  // Join the thread
+  thread.reset();
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/pad_test.cc b/tensorflow/compiler/xla/tests/pad_test.cc
index 2e5081bbcb..ca21b0b2ba 100644
--- a/tensorflow/compiler/xla/tests/pad_test.cc
+++ b/tensorflow/compiler/xla/tests/pad_test.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@@ -93,8 +93,8 @@ XLA_TEST_P(PadTestFloat, Pad1DS0ToS0Array) {
   dimension->set_edge_padding_high(0);
   dimension->set_interior_padding(0);
 
-  Pad(AddParam(*Literal::CreateR1<float>({}), &b),
-      AddParam(*Literal::CreateR0<float>(0.1), &b), padding_config);
+  Pad(AddParam(*LiteralUtil::CreateR1<float>({}), &b),
+      AddParam(*LiteralUtil::CreateR0<float>(0.1), &b), padding_config);
   ComputeAndCompareR1<float>(&b, {}, {}, DefaultErrorSpec());
 }
 
@@ -108,8 +108,8 @@ XLA_TEST_P(PadTestFloat, Pad1DS0ToS5Array) {
   dimension->set_edge_padding_high(4);
   dimension->set_interior_padding(7);
 
-  Pad(AddParam(*Literal::CreateR1<float>({}), &b),
-      AddParam(*Literal::CreateR0<float>(0.1), &b), padding_config);
+  Pad(AddParam(*LiteralUtil::CreateR1<float>({}), &b),
+      AddParam(*LiteralUtil::CreateR0<float>(0.1), &b), padding_config);
   ComputeAndCompareR1<float>(&b, std::vector<float>(5, 0.1), {},
                              DefaultErrorSpec());
 }
@@ -123,8 +123,8 @@ XLA_TEST_P(PadTestFloat, Pad1DS3Array) {
   dimension->set_edge_padding_high(0);
   dimension->set_interior_padding(1);
 
-  Pad(AddParam(*Literal::CreateR1<float>({1, 2, 3}), &b),
-      AddParam(*Literal::CreateR0<float>(0.1), &b), padding_config);
+  Pad(AddParam(*LiteralUtil::CreateR1<float>({1, 2, 3}), &b),
+      AddParam(*LiteralUtil::CreateR0<float>(0.1), &b), padding_config);
   std::vector<float> expected({0.1, 0.1, 0.1, 1, 0.1, 2, 0.1, 3});
   ComputeAndCompareR1<float>(&b, expected, {}, DefaultErrorSpec());
 }
@@ -132,7 +132,8 @@ XLA_TEST_P(PadTestFloat, Pad1DS3Array) {
 XLA_TEST_P(PadTestFloat, Pad4D_2x0x3x2_FloatArray) {
   XlaBuilder b(TestName());
   Pad(AddParam(Array4D<float>(2, 0, 3, 2), &b),
-      AddParam(*Literal::CreateR0<float>(1.5), &b), r4_padding_on_dim0_dim1_);
+      AddParam(*LiteralUtil::CreateR0<float>(1.5), &b),
+      r4_padding_on_dim0_dim1_);
   ComputeAndCompareR4<float>(&b, Array4D<float>(5, 2, 3, 2, 1.5f), {},
                              DefaultErrorSpec());
 }
@@ -147,7 +148,7 @@ TEST_P(PadTestFloat, Pad4DFloat_1x1x3x2_Array) {
   });
   input->FillWithYX(input_xy);
 
-  Pad(AddParam(*input, &b), AddParam(*Literal::CreateR0<float>(1.5), &b),
+  Pad(AddParam(*input, &b), AddParam(*LiteralUtil::CreateR0<float>(1.5), &b),
       r4_padding_on_dim0_dim1_);
 
   auto expected = MakeUnique<Array4D<float>>(2, 3, 3, 2);
@@ -166,7 +167,8 @@ TEST_P(PadTestFloat, Pad4DFloatArrayWithInteriorPadding) {
 
   const float pad_value = 1.5f;
   Array4D<float> input(3, 2, 1, 1, {1, 2, 3, 4, 5, 6});
-  Pad(AddParam(input, &b), AddParam(*Literal::CreateR0<float>(pad_value), &b),
+  Pad(AddParam(input, &b),
+      AddParam(*LiteralUtil::CreateR0<float>(pad_value), &b),
       r4_padding_on_dim0_dim1_);
 
   auto expected = MakeUnique<Array4D<float>>(8, 5, 1, 1);
@@ -205,11 +207,11 @@ TEST_P(PadTestFloat, Pad4DFloatArrayMinorFirstSmall) {
 
   const float pad_value = -5.123f;
   Array4D<float> input_array(1, 1, 2, 3, {1, 2, 3, 4, 5, 6});
-  auto input = Literal::CreateR4FromArray4D<float>(input_array);
+  auto input = LiteralUtil::CreateR4FromArray4D<float>(input_array);
   input = input->Relayout(layout);
 
-  Pad(AddParam(*input, &b), AddParam(*Literal::CreateR0<float>(pad_value), &b),
-      padding_config);
+  Pad(AddParam(*input, &b),
+      AddParam(*LiteralUtil::CreateR0<float>(pad_value), &b), padding_config);
 
   Array4D<float> expected_array(1, 1, 5, 8);
   expected_array.Fill(pad_value);
@@ -251,11 +253,11 @@ XLA_TEST_P(PadTestFloat, Pad4DFloatArrayMinorFirstNonTrivialMinorDimensions) {
   input_array(0, 0, 0, 0) = 1.0f;
   input_array(0, 24, 6, 6) = 2.0f;
   input_array(0, 17, 2, 5) = 3.0f;
-  auto input = Literal::CreateR4FromArray4D<float>(input_array);
+  auto input = LiteralUtil::CreateR4FromArray4D<float>(input_array);
   input = input->Relayout(layout);
 
-  Pad(AddParam(*input, &b), AddParam(*Literal::CreateR0<float>(pad_value), &b),
-      padding_config);
+  Pad(AddParam(*input, &b),
+      AddParam(*LiteralUtil::CreateR0<float>(pad_value), &b), padding_config);
 
   Array4D<float> expected_array(1, 25, 17, 11);
   expected_array.Fill(pad_value);
@@ -329,7 +331,7 @@ XLA_TEST_P(PadTestFloat, Large2DPad) {
     padding_config.mutable_dimensions(dim)->set_edge_padding_high(58 +
                                                                   100 * dim);
   }
-  Pad(input, AddParam(*Literal::CreateR0<float>(0.0f), &b), padding_config);
+  Pad(input, AddParam(*LiteralUtil::CreateR0<float>(0.0f), &b), padding_config);
 
   auto expected = ReferenceUtil::PadArray2D(*ones, padding_config, 0.0f);
   ComputeAndCompareR2<float>(&b, *expected, {}, DefaultErrorSpec());
@@ -351,7 +353,8 @@ XLA_TEST_P(PadTestFloat, AllTypes2DPad) {
   padding_config.mutable_dimensions(1)->set_edge_padding_low(6);
   padding_config.mutable_dimensions(1)->set_edge_padding_high(4);
   padding_config.mutable_dimensions(1)->set_interior_padding(2);
-  Pad(input, AddParam(*Literal::CreateR0<float>(3.14f), &b), padding_config);
+  Pad(input, AddParam(*LiteralUtil::CreateR0<float>(3.14f), &b),
+      padding_config);
 
   auto expected = ReferenceUtil::PadArray2D(*operand, padding_config, 3.14f);
   ComputeAndCompareR2<float>(&b, *expected, {}, DefaultErrorSpec());
@@ -376,7 +379,8 @@ XLA_TEST_P(PadTestFloat, High2DPad) {
     padding_config.mutable_dimensions(dim)->set_interior_padding(
         interior_padding);
   }
-  Pad(input, AddParam(*Literal::CreateR0<float>(2.718f), &b), padding_config);
+  Pad(input, AddParam(*LiteralUtil::CreateR0<float>(2.718f), &b),
+      padding_config);
 
   auto expected = ReferenceUtil::PadArray2D(*operand, padding_config, 2.718f);
 
@@ -403,7 +407,8 @@ XLA_TEST_P(PadTestFloat, NegativePadding2D) {
     padding_config.mutable_dimensions(dim)->set_interior_padding(
         interior_padding);
   }
-  Pad(input, AddParam(*Literal::CreateR0<float>(2.718f), &b), padding_config);
+  Pad(input, AddParam(*LiteralUtil::CreateR0<float>(2.718f), &b),
+      padding_config);
 
   auto expected = ReferenceUtil::PadArray2D(*operand, padding_config, 2.718f);
 
@@ -430,7 +435,8 @@ XLA_TEST_P(PadTestFloat, NegativeAndInteriorPadding2D) {
     padding_config.mutable_dimensions(dim)->set_interior_padding(
         interior_padding[dim]);
   }
-  Pad(input, AddParam(*Literal::CreateR0<float>(2.718f), &b), padding_config);
+  Pad(input, AddParam(*LiteralUtil::CreateR0<float>(2.718f), &b),
+      padding_config);
 
   auto expected = ReferenceUtil::PadArray2D(*operand, padding_config, 2.718f);
 
@@ -446,12 +452,13 @@ XLA_TEST_P(PadTestFloat, ReducePad) {
 
   XlaComputation add = CreateScalarAddComputation(FloatType(), &b);
   auto reduce =
-      Reduce(input, AddParam(*Literal::CreateR0<float>(0.0), &b), add, {0});
+      Reduce(input, AddParam(*LiteralUtil::CreateR0<float>(0.0), &b), add, {0});
 
   PaddingConfig padding_config = MakeNoPaddingConfig(3);
   padding_config.mutable_dimensions(0)->set_edge_padding_low(1);
   padding_config.mutable_dimensions(0)->set_edge_padding_high(1);
-  Pad(reduce, AddParam(*Literal::CreateR0<float>(0.0f), &b), padding_config);
+  Pad(reduce, AddParam(*LiteralUtil::CreateR0<float>(0.0f), &b),
+      padding_config);
 
   Array3D<float> expected({{{0.0, 0.0}, {0.0, 0.0}},
                            {{2.0, 2.0}, {2.0, 2.0}},
diff --git a/tensorflow/compiler/xla/tests/params_test.cc b/tensorflow/compiler/xla/tests/params_test.cc
index 2620063aa4..f6c762e7a4 100644
--- a/tensorflow/compiler/xla/tests/params_test.cc
+++ b/tensorflow/compiler/xla/tests/params_test.cc
@@ -21,10 +21,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@@ -42,7 +42,8 @@ class ParamsTest : public ClientLibraryTestBase {};
 
 XLA_TEST_F(ParamsTest, ConstantR0F32Param) {
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR0<float>(3.14159f);
+  std::unique_ptr<Literal> param0_literal =
+      LiteralUtil::CreateR0<float>(3.14159f);
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -54,7 +55,7 @@ XLA_TEST_F(ParamsTest, ConstantR0F32Param) {
 
 XLA_TEST_F(ParamsTest, ConstantR1S0F32Param) {
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR1<float>({});
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR1<float>({});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -67,7 +68,7 @@ XLA_TEST_F(ParamsTest, ConstantR1S0F32Param) {
 XLA_TEST_F(ParamsTest, ConstantR1S2F32Param) {
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({3.14f, -100.25f});
+      LiteralUtil::CreateR1<float>({3.14f, -100.25f});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -80,7 +81,7 @@ XLA_TEST_F(ParamsTest, ConstantR1S2F32Param) {
 XLA_TEST_F(ParamsTest, ConstantR1U8Param) {
   XlaBuilder builder(TestName());
   string str("hello world");
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR1U8(str);
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR1U8(str);
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -94,7 +95,7 @@ XLA_TEST_F(ParamsTest, ConstantR1U8Param) {
 XLA_TEST_F(ParamsTest, ConstantR2_3x0_F32Param) {
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR2FromArray2D<float>(Array2D<float>(3, 0));
+      LiteralUtil::CreateR2FromArray2D<float>(Array2D<float>(3, 0));
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
 
@@ -106,7 +107,7 @@ XLA_TEST_F(ParamsTest, ConstantR2_3x0_F32Param) {
 
 XLA_TEST_F(ParamsTest, ConstantR2F32Param) {
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR2<float>(
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR2<float>(
       {{3.14f, -100.25f}, {7e8f, 7e-9f}, {30.3f, -100.0f}});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*param0_literal).ConsumeValueOrDie();
@@ -122,12 +123,12 @@ XLA_TEST_F(ParamsTest, ConstantR2F32Param) {
 XLA_TEST_F(ParamsTest, TwoParameters) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>({1, 2});
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>({1, 2});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
   auto param0 = Parameter(&builder, 0, literal0->shape(), "param0");
 
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>({10, 20});
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>({10, 20});
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
   auto param1 = Parameter(&builder, 1, literal1->shape(), "param1");
@@ -153,7 +154,7 @@ XLA_TEST_F(ParamsTest, TwoParameters) {
 XLA_TEST_F(ParamsTest, MissingParameter) {
   // Test that an error is returned when a computation with an incomplete set of
   // parameters (parameter numbers not contiguous from 0) is executed.
-  std::unique_ptr<Literal> literal = Literal::CreateR0<float>(3.14159f);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR0<float>(3.14159f);
   std::unique_ptr<GlobalData> data =
       client_->TransferToServer(*literal).ConsumeValueOrDie();
 
@@ -167,12 +168,12 @@ XLA_TEST_F(ParamsTest, MissingParameter) {
 XLA_TEST_F(ParamsTest, UnusedParameter) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>({1, 2});
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>({1, 2});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
   Parameter(&builder, 0, literal0->shape(), "param0");
 
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>({10, 20});
+  std::unique_ptr<Literal> literal1 = LiteralUtil::CreateR1<float>({10, 20});
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
   Parameter(&builder, 1, literal1->shape(), "param1");
@@ -187,11 +188,12 @@ XLA_TEST_F(ParamsTest, UnusedParametersInUnusedExpression) {
   // unused expression.
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> literal0 = Literal::CreateR1<float>({1, 2});
+  std::unique_ptr<Literal> literal0 = LiteralUtil::CreateR1<float>({1, 2});
   std::unique_ptr<GlobalData> param0_data =
       client_->TransferToServer(*literal0).ConsumeValueOrDie();
 
-  std::unique_ptr<Literal> literal1 = Literal::CreateR1<float>({10, 20, 30});
+  std::unique_ptr<Literal> literal1 =
+      LiteralUtil::CreateR1<float>({10, 20, 30});
   std::unique_ptr<GlobalData> param1_data =
       client_->TransferToServer(*literal1).ConsumeValueOrDie();
 
@@ -231,7 +233,7 @@ XLA_TEST_F(ParamsTest, HundredLargeR1Parameters) {
 
     std::vector<float> sum_value = {{entry0, entry1}};
     sum_value.resize(size);
-    std::unique_ptr<Literal> literal = Literal::CreateR1<float>(sum_value);
+    std::unique_ptr<Literal> literal = LiteralUtil::CreateR1<float>(sum_value);
     param_data_owner.push_back(
         client_->TransferToServer(*literal).ConsumeValueOrDie());
     XlaOp param = Parameter(&builder, i, literal->shape(), "param");
@@ -266,7 +268,7 @@ XLA_TEST_F(ParamsTest,
   constexpr int kParamCount = 3000;
   for (int i = 0; i < kParamCount; ++i) {
     target += i;
-    std::unique_ptr<Literal> literal = Literal::CreateR0<float>(i);
+    std::unique_ptr<Literal> literal = LiteralUtil::CreateR0<float>(i);
     param_data_owner.push_back(
         std::move(client_->TransferToServer(*literal)).ValueOrDie());
     XlaOp param = Parameter(&builder, i, literal->shape(), "param");
@@ -298,7 +300,7 @@ XLA_TEST_F(ParamsTest, DISABLED_ON_CPU(DISABLED_ON_GPU(
   std::vector<XlaOp> params;
   for (int i = 0; i < kParamCount; ++i) {
     target += i;
-    std::unique_ptr<Literal> literal = Literal::CreateR1<int32>({i, i});
+    std::unique_ptr<Literal> literal = LiteralUtil::CreateR1<int32>({i, i});
     param_data_owner.push_back(
         std::move(client_->TransferToServer(*literal)).ValueOrDie());
     XlaOp param = Parameter(&builder, i, literal->shape(), "param");
@@ -322,10 +324,10 @@ XLA_TEST_F(ParamsTest, DISABLED_ON_CPU(DISABLED_ON_GPU(
   std::vector<std::unique_ptr<Literal>> elements;
   std::vector<const Literal*> ptrs;
   for (int i = 0; i < kParamCount; ++i) {
-    elements.push_back(Literal::CreateR1<int32>({target + i, target + i}));
+    elements.push_back(LiteralUtil::CreateR1<int32>({target + i, target + i}));
     ptrs.push_back(elements.back().get());
   }
-  ComputeAndCompareTuple(&builder, *Literal::MakeTuple(ptrs), param_data);
+  ComputeAndCompareTuple(&builder, *LiteralUtil::MakeTuple(ptrs), param_data);
 }
 
 // Test large number of parameters flowing into a while-loop.
@@ -354,7 +356,7 @@ XLA_TEST_F(ParamsTest,
   std::vector<XlaOp> params;
   std::vector<Shape> parameter_shapes;
   for (int i = 0; i < kParamCount; ++i) {
-    std::unique_ptr<Literal> literal = Literal::CreateR1<int32>({i, i});
+    std::unique_ptr<Literal> literal = LiteralUtil::CreateR1<int32>({i, i});
     param_data_owner.push_back(
         std::move(client_->TransferToServer(*literal)).ValueOrDie());
     XlaOp param = Parameter(&builder, i, literal->shape(), "param");
@@ -364,7 +366,7 @@ XLA_TEST_F(ParamsTest,
 
   // Add bool parameter for the loop condition. Use a parameter HLO instead of a
   // constant because DCE may eliminate the while-body otherwise.
-  std::unique_ptr<Literal> bool_literal = Literal::CreateR0<bool>(false);
+  std::unique_ptr<Literal> bool_literal = LiteralUtil::CreateR0<bool>(false);
   param_data_owner.push_back(
       std::move(client_->TransferToServer(*bool_literal)).ValueOrDie());
   XlaOp bool_param =
@@ -421,10 +423,10 @@ XLA_TEST_F(ParamsTest,
   std::vector<std::unique_ptr<Literal>> elements;
   std::vector<const Literal*> ptrs;
   for (int i = 0; i < kParamCount; ++i) {
-    elements.push_back(Literal::CreateR1<int32>({i, i}));
+    elements.push_back(LiteralUtil::CreateR1<int32>({i, i}));
     ptrs.push_back(elements.back().get());
   }
-  ComputeAndCompareTuple(&builder, *Literal::MakeTuple(ptrs), param_data);
+  ComputeAndCompareTuple(&builder, *LiteralUtil::MakeTuple(ptrs), param_data);
 }
 
 #endif
@@ -441,9 +443,9 @@ XLA_TEST_F(ParamsTest, TupleOfR1ParametersAddedTogether) {
 
   std::unique_ptr<GlobalData> data =
       client_
-          ->TransferToServer(*Literal::MakeTuple({
-              Literal::CreateR1<float>({1, 2, 3}).get(),
-              Literal::CreateR1<float>({4, 5, 6}).get(),
+          ->TransferToServer(*LiteralUtil::MakeTuple({
+              LiteralUtil::CreateR1<float>({1, 2, 3}).get(),
+              LiteralUtil::CreateR1<float>({4, 5, 6}).get(),
           }))
           .ConsumeValueOrDie();
 
@@ -455,7 +457,7 @@ XLA_TEST_F(ParamsTest, TupleOfR1ParametersAddedTogether) {
 // Verifies that passing a 2x2 with {0, 1} layout returns the same value back
 // when (transferred to the server and) passed through a parameter.
 XLA_TEST_F(ParamsTest, R2_2x2_Layout_01) {
-  std::unique_ptr<Literal> literal = Literal::CreateR2WithLayout<float>(
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR2WithLayout<float>(
       {{1, 2}, {3, 4}}, LayoutUtil::MakeLayout({0, 1}));
   XlaBuilder builder(TestName());
   Parameter(&builder, 0, literal->shape(), "input");
@@ -467,7 +469,7 @@ XLA_TEST_F(ParamsTest, R2_2x2_Layout_01) {
 
 // As above, but for {1, 0} layout.
 XLA_TEST_F(ParamsTest, R2_2x2_Layout_10) {
-  std::unique_ptr<Literal> literal = Literal::CreateR2WithLayout<float>(
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR2WithLayout<float>(
       {{1, 3}, {2, 4}}, LayoutUtil::MakeLayout({1, 0}));
   XlaBuilder builder(TestName());
   Parameter(&builder, 0, literal->shape(), "input");
@@ -478,7 +480,7 @@ XLA_TEST_F(ParamsTest, R2_2x2_Layout_10) {
 }
 
 XLA_TEST_F(ParamsTest, R2_2x2_TryToPassReverseLayoutToParameter) {
-  std::unique_ptr<Literal> literal = Literal::CreateR2<float>({
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR2<float>({
       {1, 3},
       {2, 4},
   });
diff --git a/tensorflow/compiler/xla/tests/pred_test.cc b/tensorflow/compiler/xla/tests/pred_test.cc
index 5c351b2d11..2fc7f816b5 100644
--- a/tensorflow/compiler/xla/tests/pred_test.cc
+++ b/tensorflow/compiler/xla/tests/pred_test.cc
@@ -19,7 +19,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
diff --git a/tensorflow/compiler/xla/tests/prng_test.cc b/tensorflow/compiler/xla/tests/prng_test.cc
index 8e163e885d..029af69573 100644
--- a/tensorflow/compiler/xla/tests/prng_test.cc
+++ b/tensorflow/compiler/xla/tests/prng_test.cc
@@ -17,8 +17,8 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -177,7 +177,7 @@ XLA_TEST_F(PrngTest, Uniformity108) {
   EXPECT_LT(UniformChiSquared(108, 256), 132.144);
 }
 XLA_TEST_F(PrngTest, Uniformity256) {
-  EXPECT_LT(UniformChiSquared(256, 256), 293.248);
+  EXPECT_LT(UniformChiSquared(256, 512), 293.248);
 }
 
 XLA_TEST_F(PrngTest, MapUsingRng) {
@@ -193,7 +193,7 @@ XLA_TEST_F(PrngTest, MapUsingRng) {
 
   XlaBuilder builder(TestName());
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR1<float>({2.2f, 5.3f, 4.4f, 5.5f});
+      LiteralUtil::CreateR1<float>({2.2f, 5.3f, 4.4f, 5.5f});
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<GlobalData> param0_data,
                           client_->TransferToServer(*param0_literal));
 
diff --git a/tensorflow/compiler/xla/tests/query_inferred_shape_test.cc b/tensorflow/compiler/xla/tests/query_inferred_shape_test.cc
index 526a38e8d1..fab2a65de1 100644
--- a/tensorflow/compiler/xla/tests/query_inferred_shape_test.cc
+++ b/tensorflow/compiler/xla/tests/query_inferred_shape_test.cc
@@ -16,7 +16,7 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
diff --git a/tensorflow/compiler/xla/tests/reduce_hlo_test.cc b/tensorflow/compiler/xla/tests/reduce_hlo_test.cc
index 9052b188ed..a080dd1732 100644
--- a/tensorflow/compiler/xla/tests/reduce_hlo_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_hlo_test.cc
@@ -95,21 +95,21 @@ XLA_TEST_P(ReduceWithLayoutTest, DISABLED_ON_GPU(Reduce)) {
   *reduce_input_shape->mutable_layout() =
       LayoutUtil::MakeLayout(reduce_layout.input_minor_to_major);
 
-  std::unique_ptr<Literal> reduce_input =
-      Literal::CreateR4<float>({{ /*i0=0*/
-                                 {/*i1=0*/
-                                  {-0.246092796, -0.179497838, -0.161181688},
-                                  {-0.151643038, -0.240213156, -0.198156}},
-                                 {/*i1=1*/
-                                  {-0.14222312, -0.162200093, -0.193907976},
-                                  {-0.239411, -0.198166847, -0.172471642}}},
-                                { /*i0=1*/
-                                 {/*i1=0*/
-                                  {-0.22965157, -0.218723893, -0.129257083},
-                                  {-0.188762426, -0.16123569, -0.181166649}},
-                                 {/*i1=1*/
-                                  {-0.241772294, -0.245131493, -0.160247207},
-                                  {-0.179881215, -0.23383224, -0.121976733}}}});
+  std::unique_ptr<Literal> reduce_input = LiteralUtil::CreateR4<float>(
+      {{ /*i0=0*/
+        {/*i1=0*/
+         {-0.246092796, -0.179497838, -0.161181688},
+         {-0.151643038, -0.240213156, -0.198156}},
+        {/*i1=1*/
+         {-0.14222312, -0.162200093, -0.193907976},
+         {-0.239411, -0.198166847, -0.172471642}}},
+       { /*i0=1*/
+        {/*i1=0*/
+         {-0.22965157, -0.218723893, -0.129257083},
+         {-0.188762426, -0.16123569, -0.181166649}},
+        {/*i1=1*/
+         {-0.241772294, -0.245131493, -0.160247207},
+         {-0.179881215, -0.23383224, -0.121976733}}}});
 
   EXPECT_TRUE(RunAndCompareNoHloPasses(std::move(module), ErrorSpec(1e-5)));
 }
diff --git a/tensorflow/compiler/xla/tests/reduce_precision_test.cc b/tensorflow/compiler/xla/tests/reduce_precision_test.cc
index 4c1aa12106..531648fe3e 100644
--- a/tensorflow/compiler/xla/tests/reduce_precision_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_precision_test.cc
@@ -22,9 +22,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/reduce_precision_insertion.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
@@ -230,7 +230,8 @@ XLA_TEST_P(ReducePrecisionAccuracyTest, ReducePrecisionF32) {
 
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({input_values});
+  std::unique_ptr<Literal> a_literal =
+      LiteralUtil::CreateR1<float>({input_values});
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   auto a = Parameter(&builder, 0, a_literal->shape(), "a");
@@ -253,7 +254,7 @@ XLA_TEST_F(ReducePrecisionInsertionTest,
            DISABLED_ON_INTERPRETER(ReducePrecisionBeforeFusion)) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR1<float>({1.00001});
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   auto a = Parameter(&builder, 0, a_literal->shape(), "a");
@@ -282,7 +283,7 @@ XLA_TEST_F(ReducePrecisionInsertionTest,
            DISABLED_ON_INTERPRETER(ReducePrecisionSkippedAfterFusion)) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR1<float>({1.00001});
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   auto a = Parameter(&builder, 0, a_literal->shape(), "a");
@@ -308,7 +309,7 @@ XLA_TEST_F(ReducePrecisionInsertionTest,
            DISABLED_ON_INTERPRETER(ReducePrecisionAddedAfterFusion)) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR1<float>({1.00001});
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   auto a = Parameter(&builder, 0, a_literal->shape(), "a");
@@ -332,7 +333,7 @@ XLA_TEST_F(ReducePrecisionInsertionTest,
            DISABLED_ON_INTERPRETER(ReducePrecisionSkippedFusionContains)) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR1<float>({1.00001});
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   auto a = Parameter(&builder, 0, a_literal->shape(), "a");
@@ -357,7 +358,7 @@ XLA_TEST_F(ReducePrecisionInsertionTest,
            DISABLED_ON_INTERPRETER(ReducePrecisionAddedFusionContains)) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR1<float>({1.00001});
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   auto a = Parameter(&builder, 0, a_literal->shape(), "a");
diff --git a/tensorflow/compiler/xla/tests/reduce_test.cc b/tensorflow/compiler/xla/tests/reduce_test.cc
index c9f57cbb16..2065271a7f 100644
--- a/tensorflow/compiler/xla/tests/reduce_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_test.cc
@@ -37,8 +37,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/reference_util.h"
@@ -67,12 +67,12 @@ class ReduceTest : public ClientLibraryTestBase {
   ReduceTest() {
     // Implementation note: laid out z >> y >> x by default.
     // clang-format off
-    literal_2d_ = Literal::CreateR2<float>({
+    literal_2d_ = LiteralUtil::CreateR2<float>({
       // x0   x1   x2
       { 1.f, 2.f, 3.f},  // y0
       { 4.f, 5.f, 6.f},  // y1
     });
-    literal_3d_ = Literal::CreateR3Projected<float>({
+    literal_3d_ = LiteralUtil::CreateR3Projected<float>({
       // x0   x1   x2
       { 1.f, 2.f, 3.f},  // y0
       { 4.f, 5.f, 6.f},  // y1
@@ -101,7 +101,7 @@ class ReduceTest : public ClientLibraryTestBase {
       }
     }
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateR1(AsSlice(input_data));
+        LiteralUtil::CreateR1(AsSlice(input_data));
     std::unique_ptr<GlobalData> input_global_data =
         client_->TransferToServer(*input_literal).ConsumeValueOrDie();
 
@@ -125,15 +125,15 @@ class ReduceTest : public ClientLibraryTestBase {
     XlaComputation reduce;
     if (and_reduce) {
       init_value = ConstantR0<bool>(&builder, true);
-      reduce = CreateScalarAndComputation(&builder);
+      reduce = CreateScalarAndComputation(PRED, &builder);
     } else {
       init_value = ConstantR0<bool>(&builder, false);
-      reduce = CreateScalarOrComputation(&builder);
+      reduce = CreateScalarOrComputation(PRED, &builder);
     }
     Reduce(pred_values, init_value, reduce,
            /*dimensions_to_reduce=*/{0});
 
-    std::unique_ptr<Literal> input_literal = Literal::CreateR1(input_data);
+    std::unique_ptr<Literal> input_literal = LiteralUtil::CreateR1(input_data);
     std::unique_ptr<GlobalData> input_global_data =
         client_->TransferToServer(*input_literal).ConsumeValueOrDie();
 
@@ -163,10 +163,10 @@ class ReduceTest : public ClientLibraryTestBase {
     XlaComputation reduce_op;
     if (and_reduce) {
       init_value = ConstantR0<bool>(&builder, true);
-      reduce_op = CreateScalarAndComputation(&builder);
+      reduce_op = CreateScalarAndComputation(PRED, &builder);
     } else {
       init_value = ConstantR0<bool>(&builder, false);
-      reduce_op = CreateScalarOrComputation(&builder);
+      reduce_op = CreateScalarOrComputation(PRED, &builder);
     }
 
     Reduce(input_pred, init_value, reduce_op,
@@ -175,7 +175,7 @@ class ReduceTest : public ClientLibraryTestBase {
     Array2D<uint8> input_data(rows, cols);
     input_data.FillRandom(0, 1);
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateR2FromArray2D(input_data);
+        LiteralUtil::CreateR2FromArray2D(input_data);
     input_literal =
         input_literal->Relayout(LayoutUtil::MakeLayout({minor, major}));
     std::unique_ptr<GlobalData> input_global_data =
@@ -209,7 +209,7 @@ class ReduceTest : public ClientLibraryTestBase {
     Array2D<float> input_data(rows, cols);
     input_data.FillRandom(3.14f, 0.04);
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateR2FromArray2D(input_data);
+        LiteralUtil::CreateR2FromArray2D(input_data);
     input_literal =
         input_literal->Relayout(LayoutUtil::MakeLayout({minor, major}));
     std::unique_ptr<GlobalData> input_global_data =
@@ -237,7 +237,7 @@ class ReduceTest : public ClientLibraryTestBase {
     Array2D<float> input_data(rows, cols);
     input_data.FillRandom(3.14f, 0.04);
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateR2FromArray2D(input_data);
+        LiteralUtil::CreateR2FromArray2D(input_data);
     input_literal =
         input_literal->Relayout(LayoutUtil::MakeLayout({minor, major}));
     std::unique_ptr<GlobalData> input_global_data =
@@ -295,7 +295,7 @@ class ReduceTest : public ClientLibraryTestBase {
     Array2D<NativeT> input_data(rows, cols);
     input_data.FillUnique(initial_value);
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateR2FromArray2D(input_data);
+        LiteralUtil::CreateR2FromArray2D(input_data);
     input_literal =
         input_literal->Relayout(LayoutUtil::MakeLayout({minor, major}));
     std::unique_ptr<GlobalData> input_global_data =
@@ -450,7 +450,7 @@ XLA_TEST_F(ReduceTest, ReduceElementwiseR2_111x50_To_R1) {
   Array2D<float> input_data(rows, cols);
   input_data.FillRandom(3.14f, 0.04);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR2FromArray2D(input_data);
+      LiteralUtil::CreateR2FromArray2D(input_data);
   input_literal = input_literal->Relayout(LayoutUtil::MakeLayout({0, 1}));
   std::unique_ptr<GlobalData> input_global_data =
       client_->TransferToServer(*input_literal).ConsumeValueOrDie();
@@ -482,7 +482,7 @@ XLA_TEST_F(ReduceTest, TransposeAndReduceElementwiseR2_111x50_To_R1) {
   Array2D<float> input_data(rows, cols);
   input_data.FillRandom(3.14f, 0.04);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR2FromArray2D(input_data);
+      LiteralUtil::CreateR2FromArray2D(input_data);
   input_literal = input_literal->Relayout(LayoutUtil::MakeLayout({0, 1}));
   std::unique_ptr<GlobalData> input_global_data =
       client_->TransferToServer(*input_literal).ConsumeValueOrDie();
@@ -531,7 +531,7 @@ XLA_TEST_F(ReduceTest, Reshape_111x2x25Reduce_111x50_To_R1) {
   Array3D<float> input_data(rows, 2, cols / 2);
   input_data.FillRandom(3.14f, 0.04);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR3FromArray3D(input_data);
+      LiteralUtil::CreateR3FromArray3D(input_data);
   std::unique_ptr<GlobalData> input_global_data =
       client_->TransferToServer(*input_literal).ConsumeValueOrDie();
 
@@ -594,7 +594,7 @@ XLA_TEST_F(ReduceTest, MaxReduce2DToR0) {
   auto max = CreateScalarMaxComputation(F32, &builder);
   Array2D<float> input(300, 250);
   input.FillRandom(214.0f);
-  auto input_literal = Literal::CreateR2FromArray2D(input);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input);
   Reduce(ConstantLiteral(&builder, *input_literal),
          ConstantR0<float>(&builder, FLT_MIN), max, {0, 1});
   auto input_max = FLT_MIN;
@@ -609,7 +609,7 @@ XLA_TEST_F(ReduceTest, MinReduce2DToR0) {
   auto min = CreateScalarMinComputation(F32, &builder);
   Array2D<float> input(150, 130);
   input.FillRandom(214.0f);
-  auto input_literal = Literal::CreateR2FromArray2D(input);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input);
   Reduce(ConstantLiteral(&builder, *input_literal),
          ConstantR0<float>(&builder, FLT_MAX), min, {0, 1});
 
@@ -623,7 +623,7 @@ XLA_TEST_F(ReduceTest, UnsignedInt_MinReduce) {
   XlaBuilder builder(TestName());
   Array2D<uint32> input({{1}, {2}});
   auto min = CreateScalarMinComputation(U32, &builder);
-  auto input_literal = Literal::CreateR2FromArray2D(input);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input);
   auto initial_value =
       ConstantR0<uint32>(&builder, std::numeric_limits<uint32>::max());
 
@@ -635,7 +635,7 @@ XLA_TEST_F(ReduceTest, UnsignedInt_MaxReduce) {
   XlaBuilder builder(TestName());
   Array2D<uint32> input({{1}, {2}});
   auto max = CreateScalarMaxComputation(U32, &builder);
-  auto input_literal = Literal::CreateR2FromArray2D(input);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input);
   auto initial_value =
       ConstantR0<uint32>(&builder, std::numeric_limits<uint32>::min());
 
@@ -798,13 +798,17 @@ XLA_TEST_F(ReduceTest, VectorizedReduce_Min) {
 
 XLA_TEST_F(ReduceTest, VectorizedReduce_BooleanAnd) {
   RunVectorizedReduceTestForType<bool>(
-      static_cast<FuncGenerator>(CreateScalarAndComputation),
+      static_cast<FuncGenerator>([](XlaBuilder* builder) {
+        return CreateScalarAndComputation(PRED, builder);
+      }),
       [](bool a, bool b) { return a && b; }, true);
 }
 
 XLA_TEST_F(ReduceTest, VectorizedReduce_BooleanOr) {
   RunVectorizedReduceTestForType<bool>(
-      static_cast<FuncGenerator>(CreateScalarOrComputation),
+      static_cast<FuncGenerator>([](XlaBuilder* builder) {
+        return CreateScalarOrComputation(PRED, builder);
+      }),
       [](bool a, bool b) { return a || b; }, false);
 }
 
@@ -818,7 +822,7 @@ XLA_TEST_P(ReduceR3ToR2Test, ReduceR3ToR2) {
   //  input_array.FillRandom(3.14f, 0.05);
   input_array.Fill(1.0f);
 
-  auto input_literal = Literal::CreateR3FromArray3D(input_array);
+  auto input_literal = LiteralUtil::CreateR3FromArray3D(input_array);
   input_literal =
       input_literal->Relayout(LayoutUtil::MakeLayout(GetParam().layout));
   std::unique_ptr<GlobalData> input_data =
@@ -872,7 +876,8 @@ XLA_TEST_F(ReduceTest, DISABLED_ON_GPU(OperationOnConstantAsInitValue)) {
   auto a = ConstantR0<float>(&builder, 2.0f);
   auto a2 = Abs(a);
 
-  std::unique_ptr<Literal> b_literal = Literal::CreateR1<float>({1.0f, 4.0f});
+  std::unique_ptr<Literal> b_literal =
+      LiteralUtil::CreateR1<float>({1.0f, 4.0f});
   std::unique_ptr<GlobalData> b_data =
       client_->TransferToServer(*b_literal).ConsumeValueOrDie();
   auto b = Parameter(&builder, 0, b_literal->shape(), "b");
@@ -900,7 +905,7 @@ class ReduceInitializerTest : public ReduceTest {
 
     auto init = ConstantR0<T>(&builder, initializer);
     std::vector<T> input_arr(num_elems, std::numeric_limits<T>::lowest());
-    auto input_literal = Literal::CreateR1<T>(input_arr);
+    auto input_literal = LiteralUtil::CreateR1<T>(input_arr);
     auto input_data =
         client_->TransferToServer(*input_literal).ConsumeValueOrDie();
     Reduce(Parameter(&builder, 0, input_literal->shape(), "input"), init,
@@ -950,10 +955,11 @@ XLA_TEST_F(ReduceTest, ReduceIdentity) {
   float operand[] = {42.0f};
   float init = 58.5f;
   float expected = 42.0f;
-  std::unique_ptr<Literal> input_literal = Literal::CreateR1<float>(operand);
+  std::unique_ptr<Literal> input_literal =
+      LiteralUtil::CreateR1<float>(operand);
   std::unique_ptr<GlobalData> input_global_data =
       client_->TransferToServer(*input_literal).ConsumeValueOrDie();
-  std::unique_ptr<Literal> input_literal2 = Literal::CreateR0<float>(init);
+  std::unique_ptr<Literal> input_literal2 = LiteralUtil::CreateR0<float>(init);
   std::unique_ptr<GlobalData> input_global_data2 =
       client_->TransferToServer(*input_literal2).ConsumeValueOrDie();
   ComputeAndCompareR0<float>(
@@ -961,5 +967,32 @@ XLA_TEST_F(ReduceTest, ReduceIdentity) {
       ErrorSpec(0.0001));
 }
 
+XLA_TEST_F(ReduceTest, AndReduceU64) {
+  XlaBuilder builder(TestName());
+  Array2D<uint64> initializer = {{0x123456789ABCDEF0LL, 0x3BCDEF12A4567890LL},
+                                 {0XFFFFFFFFFFFFFFD6LL, 101},
+                                 {1, 0XFFFFFFFFFFFFFFFFLL}};
+  auto reducer = CreateScalarAndComputation(U64, &builder);
+  auto m = ConstantR2FromArray2D(&builder, initializer);
+  Reduce(m, ConstantR0<uint64>(&builder, 0xFFFFFFFFFFFFFFFFLL), reducer, {1});
+
+  std::vector<uint64> expected = {0x1204461080145890LL, 68, 1};
+  ComputeAndCompareR1<uint64>(&builder, expected, {});
+}
+
+XLA_TEST_F(ReduceTest, OrReduceU64) {
+  XlaBuilder builder(TestName());
+  Array2D<uint64> initializer = {{0x123456789ABCDEF0LL, 0x3BCDEF12A4567890LL},
+                                 {0xFFFFFFFFFFFFFFD6LL, 101},
+                                 {1, 0xCAFEBEEFABABABABLL}};
+  auto reducer = CreateScalarOrComputation(U64, &builder);
+  auto m = ConstantR2FromArray2D(&builder, initializer);
+  Reduce(m, ConstantR0<uint64>(&builder, 0), reducer, {1});
+
+  std::vector<uint64> expected = {0X3BFDFF7ABEFEFEF0LL, 0XFFFFFFFFFFFFFFF7LL,
+                                  0xCAFEBEEFABABABABLL};
+  ComputeAndCompareR1<uint64>(&builder, expected, {});
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/reduce_window_test.cc b/tensorflow/compiler/xla/tests/reduce_window_test.cc
index 741974480c..1bd6fdab31 100644
--- a/tensorflow/compiler/xla/tests/reduce_window_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_window_test.cc
@@ -24,8 +24,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@@ -70,8 +70,8 @@ class ReduceWindowTest : public ::testing::WithParamInterface<bool>,
                        tensorflow::gtl::ArraySlice<int64> window_dimensions,
                        tensorflow::gtl::ArraySlice<int64> window_strides,
                        Padding padding) {
-    auto init =
-        CreateConstantFromLiteral(*Literal::CreateR0<float>(0.0f), &builder_);
+    auto init = CreateConstantFromLiteral(*LiteralUtil::CreateR0<float>(0.0f),
+                                          &builder_);
     ReduceWindow(input, init,
                  CreateScalarAddComputation(FloatType(), &builder_),
                  window_dimensions, window_strides, padding);
@@ -81,7 +81,8 @@ class ReduceWindowTest : public ::testing::WithParamInterface<bool>,
                        tensorflow::gtl::ArraySlice<int64> window_dimensions,
                        tensorflow::gtl::ArraySlice<int64> window_strides,
                        Padding padding) {
-    auto init = CreateConstantFromLiteral(Literal::MinValue(F32), &builder_);
+    auto init =
+        CreateConstantFromLiteral(LiteralUtil::MinValue(F32), &builder_);
     ReduceWindow(input, init,
                  CreateScalarMaxComputation(FloatType(), &builder_),
                  window_dimensions, window_strides, padding);
@@ -91,7 +92,8 @@ class ReduceWindowTest : public ::testing::WithParamInterface<bool>,
                        tensorflow::gtl::ArraySlice<int64> window_dimensions,
                        tensorflow::gtl::ArraySlice<int64> window_strides,
                        Padding padding) {
-    auto init = CreateConstantFromLiteral(Literal::MaxValue(F32), &builder_);
+    auto init =
+        CreateConstantFromLiteral(LiteralUtil::MaxValue(F32), &builder_);
     ReduceWindow(input, init,
                  CreateScalarMinComputation(FloatType(), &builder_),
                  window_dimensions, window_strides, padding);
@@ -102,9 +104,9 @@ class ReduceWindowTest : public ::testing::WithParamInterface<bool>,
 
 TEST_P(ReduceWindowTest, MismatchedRanksGivesErrorStatus) {
   const auto input = CreateConstantFromLiteral(
-      *Literal::CreateR1<float>({1, 1, 1, 1}), &builder_);
+      *LiteralUtil::CreateR1<float>({1, 1, 1, 1}), &builder_);
   const auto init_value =
-      CreateConstantFromLiteral(*Literal::CreateR0<float>(0), &builder_);
+      CreateConstantFromLiteral(*LiteralUtil::CreateR0<float>(0), &builder_);
   TF_ASSERT_OK(builder_.first_error());
   ReduceWindow(input, init_value,
                CreateScalarAddComputation(FloatType(), &builder_),
@@ -119,32 +121,32 @@ TEST_P(ReduceWindowTest, MismatchedRanksGivesErrorStatus) {
 // Regression test for b/68964348.
 TEST_P(ReduceWindowTest, R0ReduceWindow) {
   const auto input =
-      CreateConstantFromLiteral(*Literal::CreateR0<float>(42.0), &builder_);
+      CreateConstantFromLiteral(*LiteralUtil::CreateR0<float>(42.0), &builder_);
   const auto init =
-      CreateConstantFromLiteral(*Literal::CreateR0<float>(1.0), &builder_);
+      CreateConstantFromLiteral(*LiteralUtil::CreateR0<float>(1.0), &builder_);
   ReduceWindow(input, init, CreateScalarAddComputation(FloatType(), &builder_),
                /*window_dimensions=*/{},
                /*window_strides=*/{}, Padding::kSame);
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateR0<float>(43.0), {},
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateR0<float>(43.0), {},
                            ErrorSpec(0.00001));
 }
 
 TEST_P(ReduceWindowTest, Min3In5Stride2) {
   const auto input = CreateConstantFromLiteral(
-      *Literal::CreateR1<float>({10000, 1000, 100, 10, 1}), &builder_);
+      *LiteralUtil::CreateR1<float>({10000, 1000, 100, 10, 1}), &builder_);
   ReduceWindowMin(input, {3}, {2}, Padding::kValid);
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateR1<float>({100, 1}), {},
-                           ErrorSpec(0.00001));
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateR1<float>({100, 1}),
+                           {}, ErrorSpec(0.00001));
 }
 
 TEST_P(ReduceWindowTest, Min3In5Stride1WithSamePadding) {
   const auto input = CreateConstantFromLiteral(
-      *Literal::CreateR1<float>({10000, 1000, 100, 10, 1}), &builder_);
+      *LiteralUtil::CreateR1<float>({10000, 1000, 100, 10, 1}), &builder_);
   ReduceWindowMin(input, /*window_dimensions=*/{3}, /*window_strides=*/{1},
                   Padding::kSame);
   ComputeAndCompareLiteral(&builder_,
-                           *Literal::CreateR1<float>({1000, 100, 10, 1, 1}), {},
-                           ErrorSpec(0.00001));
+                           *LiteralUtil::CreateR1<float>({1000, 100, 10, 1, 1}),
+                           {}, ErrorSpec(0.00001));
 }
 
 XLA_TEST_P(ReduceWindowTest, ZeroElementSmall) {
@@ -156,7 +158,7 @@ XLA_TEST_P(ReduceWindowTest, ZeroElementSmall) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(input_array, 0.0f, {1, 1, 2, 1},
                                               {1, 1, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res), {},
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res), {},
                            DefaultErrorSpec());
 }
 
@@ -171,7 +173,7 @@ TEST_P(ReduceWindowTest, NonSquareSmall) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(input_array, 0.0f, {1, 1, 2, 1},
                                               {1, 1, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res), {},
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res), {},
                            DefaultErrorSpec());
 }
 
@@ -185,7 +187,7 @@ TEST_P(ReduceWindowTest, MiddleDimsSmall) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(input_array, 0.0f, {1, 1, 1, 1},
                                               {1, 2, 2, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res), {},
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res), {},
                            DefaultErrorSpec());
 }
 
@@ -202,7 +204,7 @@ TEST_P(ReduceWindowTest, Along2ndMinorDim) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(
       input_array, 0.0f, {1, 1, lrn_diameter, 1}, {1, 1, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res), {},
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res), {},
                            DefaultErrorSpec());
 }
 
@@ -224,8 +226,8 @@ TEST_P(ReduceWindowTest, AmongMajor2Dims) {
       input_array, 0.0f, {win_len, win_len, 1, 1},
       {win_stride, win_stride, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*result), {},
-                           DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*result),
+                           {}, DefaultErrorSpec());
 }
 
 TEST_P(ReduceWindowTest, AmongMajor2DimsMediumSize) {
@@ -247,8 +249,8 @@ TEST_P(ReduceWindowTest, AmongMajor2DimsMediumSize) {
       input_array, 0.0f, {win_len, win_len, 1, 1},
       {win_stride, win_stride, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*result), {},
-                           DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*result),
+                           {}, DefaultErrorSpec());
 }
 
 // Tests the super windowing logic w.r.t handling prime number of windows in a
@@ -272,8 +274,8 @@ TEST_P(ReduceWindowTest, PrimeWindowsInReductionDimension) {
       input_array, 0.0f, {win_len, win_len, 1, 1},
       {win_stride, win_stride, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*result), {},
-                           DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*result),
+                           {}, DefaultErrorSpec());
 }
 
 TEST_P(ReduceWindowTest, ReduceAlongLaneDimension) {
@@ -289,8 +291,8 @@ TEST_P(ReduceWindowTest, ReduceAlongLaneDimension) {
   auto result = ReferenceUtil::ReduceWindow4DAdd(
       input_array, 0.0f, {1, 1, 1, 11}, {1, 1, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*result), {},
-                           DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*result),
+                           {}, DefaultErrorSpec());
 }
 
 // Tests a reduction function that is not a simple add/min/max/etc.
@@ -308,12 +310,12 @@ XLA_TEST_P(ReduceWindowTest, NonstandardReduceFunction) {
   auto lhs = Parameter(b.get(), 0, scalar, "lhs");
   auto rhs = Parameter(b.get(), 1, scalar, "rhs");
   Min(Add(lhs, rhs),
-      CreateConstantFromLiteral(*Literal::CreateR0<float>(8.0f), b.get()));
+      CreateConstantFromLiteral(*LiteralUtil::CreateR0<float>(8.0f), b.get()));
   XlaComputation reduce_fn = b->BuildAndNoteError();
 
   ReduceWindow(
       input,
-      CreateConstantFromLiteral(*Literal::CreateR0<float>(0.0f), &builder_),
+      CreateConstantFromLiteral(*LiteralUtil::CreateR0<float>(0.0f), &builder_),
       reduce_fn,
       /*window_dimensions=*/{1, 1, 2, 1},
       /*window_strides=*/{1, 1, 1, 1}, padding);
@@ -327,15 +329,15 @@ XLA_TEST_P(ReduceWindowTest, NonstandardReduceFunction) {
                                            /*window=*/{1, 1, 2, 1},
                                            /*stride=*/{1, 1, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*expected), {},
-                           DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*expected),
+                           {}, DefaultErrorSpec());
 }
 
 TEST_P(ReduceWindowTest, R4UnitWindow) {
   Array4D<float> input_array(13, 12, 8, 15);
   input_array.FillRandom(2.f, 2.f);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input_array, LayoutUtil::MakeLayout({0, 3, 2, 1}));
   XlaOp input;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -347,7 +349,7 @@ TEST_P(ReduceWindowTest, R4UnitWindow) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(input_array, 0.0f, {1, 1, 7, 1},
                                               {1, 4, 1, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res),
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res),
                            {input_data.get()}, DefaultErrorSpec());
 }
 
@@ -376,7 +378,7 @@ XLA_TEST_P(ReduceWindowTest, R6Add) {
   auto shape = ShapeUtil::MakeShape(F32, input_dims);
 
   std::unique_ptr<Literal> arg_literal =
-      Literal::CreateFullWithDescendingLayout<float>(input_dims, 1.0f);
+      LiteralUtil::CreateFullWithDescendingLayout<float>(input_dims, 1.0f);
 
   const auto input = CreateConstantFromLiteral(*arg_literal, &builder_);
 
@@ -385,7 +387,7 @@ XLA_TEST_P(ReduceWindowTest, R6Add) {
 
   std::vector<int64> output_dims = {8, 8, 6, 6, 8, 8};
   std::unique_ptr<Literal> expected =
-      Literal::CreateFullWithDescendingLayout<float>(output_dims, 9.0f);
+      LiteralUtil::CreateFullWithDescendingLayout<float>(output_dims, 9.0f);
 
   ComputeAndCompareLiteral(&builder_, *expected, {}, DefaultErrorSpec());
 }
@@ -394,7 +396,7 @@ XLA_TEST_P(ReduceWindowTest, R4SecondMinorStride) {
   Array4D<float> input_array(2, 1, 27, 119);
   input_array.FillRandom(2.0f);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input_array, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaOp input;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -408,7 +410,7 @@ XLA_TEST_P(ReduceWindowTest, R4SecondMinorStride) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(
       input_array, 0.0f, {1, 1, win_len, 1}, {1, 1, stride, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res),
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res),
                            {input_data.get()}, DefaultErrorSpec());
 }
 
@@ -416,7 +418,7 @@ XLA_TEST_P(ReduceWindowTest, R4SecondMinorUnitStride) {
   Array4D<float> input_array(3, 2, 4, 64);
   input_array.FillRandom(2.0f);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input_array, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaOp input;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -430,7 +432,7 @@ XLA_TEST_P(ReduceWindowTest, R4SecondMinorUnitStride) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(
       input_array, 0.0f, {1, 1, win_len, 1}, {1, 1, stride, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res),
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res),
                            {input_data.get()}, DefaultErrorSpec());
 }
 
@@ -438,7 +440,7 @@ XLA_TEST_P(ReduceWindowTest, R4SecondMinorWin) {
   Array4D<float> input_array(1, 3, 12, 200);
   input_array.FillRandom(2.0f);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input_array, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaOp input;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -452,7 +454,7 @@ XLA_TEST_P(ReduceWindowTest, R4SecondMinorWin) {
   auto res = ReferenceUtil::ReduceWindow4DAdd(
       input_array, 0.0f, {1, 1, win_len, 1}, {1, 1, stride, 1}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*res),
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*res),
                            {input_data.get()}, DefaultErrorSpec());
 }
 
@@ -473,18 +475,18 @@ TEST_P(ReduceWindowTest, AmongMajor2DimsMultipleMinor) {
   auto result = ReferenceUtil::ReduceWindow4DAdd(
       input_array, 0.0f, {win_len, win_len, 1, 1},
       {win_stride, win_stride, 1, 1}, padding);
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray(*result), {},
-                           DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateFromArray(*result),
+                           {}, DefaultErrorSpec());
 }
 
 XLA_TEST_P(ReduceWindowTest, Add24In1152_NoOverlap) {
   std::vector<float> input_vector(128 * 9, 1);
   const auto input = CreateConstantFromLiteral(
-      *Literal::CreateR1<float>(input_vector), &builder_);
+      *LiteralUtil::CreateR1<float>(input_vector), &builder_);
   ReduceWindowAdd(input, {32}, {128}, Padding::kValid);
   ComputeAndCompareLiteral(
       &builder_,
-      *Literal::CreateR1<float>({32, 32, 32, 32, 32, 32, 32, 32, 32}), {},
+      *LiteralUtil::CreateR1<float>({32, 32, 32, 32, 32, 32, 32, 32, 32}), {},
       DefaultErrorSpec());
 }
 
@@ -499,9 +501,9 @@ XLA_TEST_P(ReduceWindowTest, Add128In128Stride128) {
       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
   const auto input = CreateConstantFromLiteral(
-      *Literal::CreateR1<float>(input_vector), &builder_);
+      *LiteralUtil::CreateR1<float>(input_vector), &builder_);
   ReduceWindowAdd(input, {128}, {128}, Padding::kValid);
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateR1<float>({1088}), {},
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateR1<float>({1088}), {},
                            DefaultErrorSpec());
 }
 
@@ -516,9 +518,9 @@ XLA_TEST_P(ReduceWindowTest, Add128In128) {
       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
       1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
   const auto input = CreateConstantFromLiteral(
-      *Literal::CreateR1<float>(input_vector), &builder_);
+      *LiteralUtil::CreateR1<float>(input_vector), &builder_);
   ReduceWindowAdd(input, {128}, {1}, Padding::kValid);
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateR1<float>({1088}), {},
+  ComputeAndCompareLiteral(&builder_, *LiteralUtil::CreateR1<float>({1088}), {},
                            DefaultErrorSpec());
 }
 
@@ -535,14 +537,15 @@ TEST_P(ReduceWindowTest, R2ReduceWindowInceptionFromBroadcast) {
   auto res = ReferenceUtil::ReduceWindow2DAdd(
       input_array, 0.0f, {win_len, win_len}, {stride, stride}, padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray<float>(*res),
-                           {}, DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_,
+                           *LiteralUtil::CreateFromArray<float>(*res), {},
+                           DefaultErrorSpec());
 }
 
 TEST_P(ReduceWindowTest, R2ReduceWindowNonOverlappingFromBroadcast) {
   Array2D<float> input_array(6, 4, 1.0f);
   XlaOp input = Broadcast(
-      CreateConstantFromLiteral(Literal::One(F32), &builder_), {6, 4});
+      CreateConstantFromLiteral(LiteralUtil::One(F32), &builder_), {6, 4});
 
   Padding padding = Padding::kSame;
   ReduceWindowAdd(input, {4, 2}, {3, 3}, padding);
@@ -550,8 +553,9 @@ TEST_P(ReduceWindowTest, R2ReduceWindowNonOverlappingFromBroadcast) {
   auto res = ReferenceUtil::ReduceWindow2DAdd(input_array, 0.0f, {4, 2}, {3, 3},
                                               padding);
 
-  ComputeAndCompareLiteral(&builder_, *Literal::CreateFromArray<float>(*res),
-                           {}, DefaultErrorSpec());
+  ComputeAndCompareLiteral(&builder_,
+                           *LiteralUtil::CreateFromArray<float>(*res), {},
+                           DefaultErrorSpec());
 }
 
 INSTANTIATE_TEST_CASE_P(ReduceWindowTestInstance, ReduceWindowTest,
@@ -609,7 +613,7 @@ class R4ReduceWindowTest : public ReduceWindowTestBase,
                          param.base_bounds[2], param.base_bounds[3]);
     input.FillIota(1);
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateR4FromArray4DWithLayout(
+        LiteralUtil::CreateR4FromArray4DWithLayout(
             input, LayoutUtil::MakeLayout(param.layout));
     XlaOp parameter;
     auto input_arg = CreateParameterAndTransferLiteral(0, *input_literal, "p0",
@@ -621,7 +625,7 @@ class R4ReduceWindowTest : public ReduceWindowTestBase,
     }
 
     auto init_value =
-        CreateConstantFromLiteral(*Literal::CreateR0(kInitValue), &b);
+        CreateConstantFromLiteral(*LiteralUtil::CreateR0(kInitValue), &b);
     CHECK(param.reducer == kAdd || param.reducer == kMax);
     auto computation = param.reducer == kAdd
                            ? CreateScalarAddComputation(FloatType(), &b)
@@ -647,7 +651,7 @@ class R4ReduceWindowTest : public ReduceWindowTestBase,
             /*stride=*/param.strides,
             /*padding=*/padding);
     std::unique_ptr<Literal> expected_literal =
-        Literal::CreateFromArray(*expected);
+        LiteralUtil::CreateFromArray(*expected);
     const Shape& expected_shape_with_layout = ShapeUtil::MakeShapeWithLayout(
         input_literal->shape().element_type(),
         AsInt64Slice(expected_literal->shape().dimensions()), param.layout);
@@ -959,14 +963,14 @@ TEST_P(R3ReduceWindowTest, Add) {
   Array3D<float> input(param.base_bounds[0], param.base_bounds[1],
                        param.base_bounds[2], 1.0f);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR3FromArray3DWithLayout(
+      LiteralUtil::CreateR3FromArray3DWithLayout(
           input, LayoutUtil::MakeLayout(param.layout));
 
   XlaOp parameter;
   auto input_arg = CreateParameterAndTransferLiteral(0, *input_literal, "p0",
                                                      &b, &parameter);
   auto init_value =
-      CreateConstantFromLiteral(*Literal::CreateR0(kInitValue), &b);
+      CreateConstantFromLiteral(*LiteralUtil::CreateR0(kInitValue), &b);
   ReduceWindow(/*operand=*/parameter,
                /*init_value=*/init_value,
                /*computation=*/CreateScalarAddComputation(FloatType(), &b),
@@ -977,7 +981,7 @@ TEST_P(R3ReduceWindowTest, Add) {
       /*operand=*/input, /*init=*/kInitValue, /*window=*/param.window_bounds,
       /*stride=*/param.strides, /*padding=*/param.padding);
 
-  ComputeAndCompareLiteral(&b, *Literal::CreateFromArray(*expected),
+  ComputeAndCompareLiteral(&b, *LiteralUtil::CreateFromArray(*expected),
                            {input_arg.get()}, DefaultErrorSpec());
 }
 
@@ -1093,7 +1097,7 @@ class R2ReduceWindowTest : public ReduceWindowTestBase,
     const float kInitValue = 0.0f;
     Array2D<float> input(param.base_bounds[0], param.base_bounds[1], 1.0f);
     std::unique_ptr<Literal> input_literal =
-        Literal::CreateR2FromArray2DWithLayout(
+        LiteralUtil::CreateR2FromArray2DWithLayout(
             input, LayoutUtil::MakeLayout(param.layout));
 
     XlaOp parameter;
@@ -1107,7 +1111,7 @@ class R2ReduceWindowTest : public ReduceWindowTestBase,
                            ? CreateScalarAddComputation(FloatType(), &b)
                            : CreateScalarMaxComputation(FloatType(), &b);
     auto init_value =
-        CreateConstantFromLiteral(*Literal::CreateR0(kInitValue), &b);
+        CreateConstantFromLiteral(*LiteralUtil::CreateR0(kInitValue), &b);
     ReduceWindowWithGeneralPadding(
         /*operand=*/parameter,
         /*init_value=*/init_value,
@@ -1123,7 +1127,7 @@ class R2ReduceWindowTest : public ReduceWindowTestBase,
         /*window=*/param.window_bounds,
         /*stride=*/param.strides, /*padding=*/padding);
 
-    ComputeAndCompareLiteral(&b, *Literal::CreateFromArray(*expected),
+    ComputeAndCompareLiteral(&b, *LiteralUtil::CreateFromArray(*expected),
                              {input_arg.get()}, DefaultErrorSpec());
   }
 };
@@ -1292,7 +1296,7 @@ TEST_P(R1ReduceWindowTest, DoIt) {
   std::vector<float> input_vector(param.base_bounds[0]);
   std::iota(std::begin(input_vector), std::end(input_vector), 0);
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR1(tensorflow::gtl::ArraySlice<float>(input_vector));
+      LiteralUtil::CreateR1(tensorflow::gtl::ArraySlice<float>(input_vector));
   XlaOp parameter;
   auto input_arg = CreateParameterAndTransferLiteral(0, *input_literal, "p0",
                                                      &b, &parameter);
@@ -1304,7 +1308,7 @@ TEST_P(R1ReduceWindowTest, DoIt) {
                          ? CreateScalarAddComputation(FloatType(), &b)
                          : CreateScalarMaxComputation(FloatType(), &b);
   auto init_value =
-      CreateConstantFromLiteral(*Literal::CreateR0(kInitValue), &b);
+      CreateConstantFromLiteral(*LiteralUtil::CreateR0(kInitValue), &b);
   ReduceWindowWithGeneralPadding(
       /*operand=*/parameter,
       /*init_value=*/init_value,
@@ -1323,7 +1327,7 @@ TEST_P(R1ReduceWindowTest, DoIt) {
       /*stride=*/param.strides,
       /*padding=*/padding);
 
-  ComputeAndCompareLiteral(&b, *Literal::CreateR1<float>(*expected),
+  ComputeAndCompareLiteral(&b, *LiteralUtil::CreateR1<float>(*expected),
                            {input_arg.get()}, DefaultErrorSpec());
 }
 
diff --git a/tensorflow/compiler/xla/tests/replay_test.cc b/tensorflow/compiler/xla/tests/replay_test.cc
index bebd814fa8..d891451381 100644
--- a/tensorflow/compiler/xla/tests/replay_test.cc
+++ b/tensorflow/compiler/xla/tests/replay_test.cc
@@ -17,9 +17,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/protobuf_util.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/shape_util.h"
@@ -91,10 +91,10 @@ XLA_TEST_F(ReplayTest, XPlusYReplayWithParameters) {
 
   // Run it.
   std::unique_ptr<GlobalData> x_data =
-      client_->TransferToServer(*Literal::CreateR0<int32>(2))
+      client_->TransferToServer(*LiteralUtil::CreateR0<int32>(2))
           .ConsumeValueOrDie();
   std::unique_ptr<GlobalData> y_data =
-      client_->TransferToServer(*Literal::CreateR0<int32>(3))
+      client_->TransferToServer(*LiteralUtil::CreateR0<int32>(3))
           .ConsumeValueOrDie();
   std::unique_ptr<Literal> literal =
       client_
diff --git a/tensorflow/compiler/xla/tests/reshape_motion_test.cc b/tensorflow/compiler/xla/tests/reshape_motion_test.cc
index 5812fe442b..368f5583c9 100644
--- a/tensorflow/compiler/xla/tests/reshape_motion_test.cc
+++ b/tensorflow/compiler/xla/tests/reshape_motion_test.cc
@@ -22,9 +22,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
diff --git a/tensorflow/compiler/xla/tests/reshape_test.cc b/tensorflow/compiler/xla/tests/reshape_test.cc
index d3d6c3c7d7..382d1b1ae7 100644
--- a/tensorflow/compiler/xla/tests/reshape_test.cc
+++ b/tensorflow/compiler/xla/tests/reshape_test.cc
@@ -22,8 +22,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/reference_util.h"
@@ -55,39 +55,39 @@ XLA_TEST_P(ReshapeTest, CollapseTrivial1x1) {
   XlaBuilder builder(TestName());
   Array2D<float> input_array(1, 1);
   input_array.Fill(1.0f);
-  auto input_literal = Literal::CreateR2FromArray2D(input_array);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input_array);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "parameter",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{0, 1});
 
-  auto expected_literal = Literal::CreateR1<float>({1.0f});
+  auto expected_literal = LiteralUtil::CreateR1<float>({1.0f});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
 
 XLA_TEST_P(ReshapeTest, CollapseTrivialR1EmptyDims) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateR1<float>({1.0f});
+  auto input_literal = LiteralUtil::CreateR1<float>({1.0f});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "parameter",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{});
 
-  auto expected_literal = Literal::CreateR1<float>({1.0f});
+  auto expected_literal = LiteralUtil::CreateR1<float>({1.0f});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
 
 XLA_TEST_P(ReshapeTest, CollapseTrivialR1OnlyDim) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateR1<float>({1.0f});
+  auto input_literal = LiteralUtil::CreateR1<float>({1.0f});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "parameter",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{0});
 
-  auto expected_literal = Literal::CreateR1<float>({1.0f});
+  auto expected_literal = LiteralUtil::CreateR1<float>({1.0f});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -97,7 +97,7 @@ XLA_TEST_P(ReshapeTest, SingleElementArrayToScalar) {
   XlaBuilder builder(TestName());
   Array2D<float> input_array(1, 1);
   input_array.Fill(1.0f);
-  auto input_literal = Literal::CreateR2FromArray2D(input_array);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input_array);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "parameter",
                                                  &builder, &parameter);
@@ -105,7 +105,7 @@ XLA_TEST_P(ReshapeTest, SingleElementArrayToScalar) {
                          /*new_sizes=*/{});
   auto new_shape = builder.GetShape(reshape).ConsumeValueOrDie();
 
-  auto expected_literal = Literal::CreateR0<float>(1.0f);
+  auto expected_literal = LiteralUtil::CreateR0<float>(1.0f);
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -113,14 +113,14 @@ XLA_TEST_P(ReshapeTest, SingleElementArrayToScalar) {
 XLA_TEST_P(ReshapeTest, ScalarToSingleElementArray) {
   XlaBuilder builder(TestName());
 
-  std::unique_ptr<Literal> param0_literal = Literal::CreateR0<float>(1.0f);
+  std::unique_ptr<Literal> param0_literal = LiteralUtil::CreateR0<float>(1.0f);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *param0_literal, "param0",
                                                  &builder, &parameter);
   auto a = Neg(parameter);
   Reshape(/*operand=*/a, /*dimensions=*/{}, /*new_sizes=*/{1});
 
-  auto expected_literal = Literal::CreateR1<float>({-1.0f});
+  auto expected_literal = LiteralUtil::CreateR1<float>({-1.0f});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -128,12 +128,12 @@ XLA_TEST_P(ReshapeTest, ScalarToSingleElementArray) {
 XLA_TEST_P(ReshapeTest, Trivial0x3) {
   XlaBuilder builder(TestName());
   Array2D<float> input_array(0, 3);
-  auto input_literal = Literal::CreateR2FromArray2D(input_array);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input_array);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{0, 1});
-  auto expected_literal = Literal::CreateR1<float>({});
+  auto expected_literal = LiteralUtil::CreateR1<float>({});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -142,12 +142,12 @@ XLA_TEST_P(ReshapeTest, Trivial0x3WithParameter) {
   XlaBuilder builder(TestName());
 
   std::unique_ptr<Literal> param0_literal =
-      Literal::CreateR2FromArray2D<float>(Array2D<float>(0, 3));
+      LiteralUtil::CreateR2FromArray2D<float>(Array2D<float>(0, 3));
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *param0_literal, "param0",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{0, 1});
-  auto expected_literal = Literal::CreateR1<float>({});
+  auto expected_literal = LiteralUtil::CreateR1<float>({});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -155,12 +155,12 @@ XLA_TEST_P(ReshapeTest, Trivial0x3WithParameter) {
 XLA_TEST_P(ReshapeTest, Trivial3x0) {
   XlaBuilder builder(TestName());
   Array2D<float> input_array(3, 0);
-  auto input_literal = Literal::CreateR2FromArray2D(input_array);
+  auto input_literal = LiteralUtil::CreateR2FromArray2D(input_array);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{0, 1});
-  auto expected_literal = Literal::CreateR1<float>({});
+  auto expected_literal = LiteralUtil::CreateR1<float>({});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -168,12 +168,12 @@ XLA_TEST_P(ReshapeTest, Trivial3x0) {
 // Collapses a 2-dimensional row vector to 1 dimension.
 XLA_TEST_P(ReshapeTest, Trivial1x3) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateR2<float>({{1.0f, 2.0f, 3.0f}});
+  auto input_literal = LiteralUtil::CreateR2<float>({{1.0f, 2.0f, 3.0f}});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{0, 1});
-  auto expected_literal = Literal::CreateR1<float>({1.0f, 2.0f, 3.0f});
+  auto expected_literal = LiteralUtil::CreateR1<float>({1.0f, 2.0f, 3.0f});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -181,12 +181,12 @@ XLA_TEST_P(ReshapeTest, Trivial1x3) {
 // Collapses a 2-dimensional column vector to 1 dimension.
 XLA_TEST_P(ReshapeTest, Trivial3x1) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateR2<float>({{1.0f}, {2.0f}, {3.0f}});
+  auto input_literal = LiteralUtil::CreateR2<float>({{1.0f}, {2.0f}, {3.0f}});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{0, 1});
-  auto expected_literal = Literal::CreateR1<float>({1.0f, 2.0f, 3.0f});
+  auto expected_literal = LiteralUtil::CreateR1<float>({1.0f, 2.0f, 3.0f});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -194,13 +194,13 @@ XLA_TEST_P(ReshapeTest, Trivial3x1) {
 // Splits an empty vector into an empty matrix.
 XLA_TEST_P(ReshapeTest, R1ToR2_0_To_2x0) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateR1<float>({});
+  auto input_literal = LiteralUtil::CreateR1<float>({});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{0},
           /*new_sizes=*/{2, 0});
-  auto expected_literal = Literal::CreateR2<float>({{}, {}});
+  auto expected_literal = LiteralUtil::CreateR2<float>({{}, {}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -209,14 +209,14 @@ XLA_TEST_P(ReshapeTest, R1ToR2_0_To_2x0) {
 XLA_TEST_P(ReshapeTest, R1ToR2_6_To_2x3) {
   XlaBuilder builder(TestName());
   auto input_literal =
-      Literal::CreateR1<float>({1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f});
+      LiteralUtil::CreateR1<float>({1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{0},
           /*new_sizes=*/{2, 3});
   auto expected_literal =
-      Literal::CreateR2<float>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}});
+      LiteralUtil::CreateR2<float>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -224,13 +224,13 @@ XLA_TEST_P(ReshapeTest, R1ToR2_6_To_2x3) {
 // Transposes a 2x0 array to a 0x2 array.
 XLA_TEST_P(ReshapeTest, Reshape0x2To2x0) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(Array2D<float>(0, 2));
+  auto input_literal = LiteralUtil::CreateFromArray(Array2D<float>(0, 2));
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{0, 1},
           /*new_sizes=*/{2, 0});
-  auto expected_literal = Literal::CreateR2<float>({{}, {}});
+  auto expected_literal = LiteralUtil::CreateR2<float>({{}, {}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -239,7 +239,7 @@ XLA_TEST_P(ReshapeTest, Reshape0x2To2x0) {
 XLA_TEST_P(ReshapeTest, ReshapeRowToCol) {
   XlaBuilder builder(TestName());
   auto simple = MakeLinspaceArray2D(1.0f, 3.0f, 1, 3);
-  auto input_literal = Literal::CreateFromArray(*simple);
+  auto input_literal = LiteralUtil::CreateFromArray(*simple);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
@@ -247,7 +247,7 @@ XLA_TEST_P(ReshapeTest, ReshapeRowToCol) {
           /*new_sizes=*/{3, 1});
 
   auto expected = ReferenceUtil::TransposeArray2D(*simple);
-  auto expected_literal = Literal::CreateFromArray(*expected);
+  auto expected_literal = LiteralUtil::CreateFromArray(*expected);
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -256,7 +256,7 @@ XLA_TEST_P(ReshapeTest, ReshapeRowToCol) {
 XLA_TEST_P(ReshapeTest, TransposeAsReshape) {
   XlaBuilder builder(TestName());
   auto a4x3 = MakeLinspaceArray2D(1.0f, 12.0f, 4, 3);
-  auto input_literal = Literal::CreateFromArray(*a4x3);
+  auto input_literal = LiteralUtil::CreateFromArray(*a4x3);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
@@ -264,7 +264,7 @@ XLA_TEST_P(ReshapeTest, TransposeAsReshape) {
           /*new_sizes=*/{3, 4});
 
   auto expected = ReferenceUtil::TransposeArray2D(*a4x3);
-  auto expected_literal = Literal::CreateFromArray(*expected);
+  auto expected_literal = LiteralUtil::CreateFromArray(*expected);
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -272,12 +272,12 @@ XLA_TEST_P(ReshapeTest, TransposeAsReshape) {
 // Transposes a 0x4 array with XlaBuilder::Transpose.
 XLA_TEST_P(ReshapeTest, Transpose0x4) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(Array2D<float>(0, 4));
+  auto input_literal = LiteralUtil::CreateFromArray(Array2D<float>(0, 4));
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Transpose(parameter, {1, 0});
-  auto expected_literal = Literal::CreateR2<float>({{}, {}, {}, {}});
+  auto expected_literal = LiteralUtil::CreateR2<float>({{}, {}, {}, {}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -286,14 +286,14 @@ XLA_TEST_P(ReshapeTest, Transpose0x4) {
 XLA_TEST_P(ReshapeTest, Transpose4x3) {
   XlaBuilder builder(TestName());
   auto a4x3 = MakeLinspaceArray2D(1.0f, 12.0f, 4, 3);
-  auto input_literal = Literal::CreateFromArray(*a4x3);
+  auto input_literal = LiteralUtil::CreateFromArray(*a4x3);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Transpose(parameter, {1, 0});
 
   auto expected = ReferenceUtil::TransposeArray2D(*a4x3);
-  auto expected_literal = Literal::CreateFromArray(*expected);
+  auto expected_literal = LiteralUtil::CreateFromArray(*expected);
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -302,26 +302,27 @@ XLA_TEST_P(ReshapeTest, Transpose4x3) {
 // rearrangement of the originals (split), but no reordering (no shuffle).
 XLA_TEST_P(ReshapeTest, ReshapeSplitNoShuffleZeroElements) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(Array2D<float>(6, 0));
+  auto input_literal = LiteralUtil::CreateFromArray(Array2D<float>(6, 0));
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{0, 1},
           /*new_sizes=*/{2, 3, 0, 0});
-  auto expected_literal = Literal::CreateFromArray(Array4D<float>(2, 3, 0, 0));
+  auto expected_literal =
+      LiteralUtil::CreateFromArray(Array4D<float>(2, 3, 0, 0));
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
 
 XLA_TEST_P(ReshapeTest, ReshapeR4ToR2ZeroElements) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(Array4D<float>(2, 3, 4, 0));
+  auto input_literal = LiteralUtil::CreateFromArray(Array4D<float>(2, 3, 4, 0));
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{0, 1, 2, 3},
           /*new_sizes=*/{24, 0});
-  auto expected_literal = Literal::CreateFromArray(Array2D<float>(24, 0));
+  auto expected_literal = LiteralUtil::CreateFromArray(Array2D<float>(24, 0));
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -331,7 +332,7 @@ XLA_TEST_P(ReshapeTest, ReshapeR4ToR2ZeroElements) {
 XLA_TEST_P(ReshapeTest, ReshapeSplitNoShuffle) {
   XlaBuilder builder(TestName());
   auto a4x3 = MakeLinspaceArray2D(1.0f, 12.0f, 4, 3);
-  auto input_literal = Literal::CreateFromArray(*a4x3);
+  auto input_literal = LiteralUtil::CreateFromArray(*a4x3);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
@@ -339,20 +340,20 @@ XLA_TEST_P(ReshapeTest, ReshapeSplitNoShuffle) {
           /*new_sizes=*/{2, 6});
 
   auto expected = MakeLinspaceArray2D(1.0f, 12.0f, 2, 6);
-  auto expected_literal = Literal::CreateFromArray(*expected);
+  auto expected_literal = LiteralUtil::CreateFromArray(*expected);
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
 
 XLA_TEST_P(ReshapeTest, ReshapeSplitAndShuffleZeroElements) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(Array2D<float>(0, 6));
+  auto input_literal = LiteralUtil::CreateFromArray(Array2D<float>(0, 6));
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{1, 0},
           /*new_sizes=*/{3, 0});
-  auto expected_literal = Literal::CreateFromArray(Array2D<float>(3, 0));
+  auto expected_literal = LiteralUtil::CreateFromArray(Array2D<float>(3, 0));
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -362,7 +363,7 @@ XLA_TEST_P(ReshapeTest, ReshapeSplitAndShuffleZeroElements) {
 XLA_TEST_P(ReshapeTest, ReshapeSplitAndShuffle) {
   XlaBuilder builder(TestName());
   auto a4x3 = MakeLinspaceArray2D(1.0f, 12.0f, 4, 3);
-  auto input_literal = Literal::CreateFromArray(*a4x3);
+  auto input_literal = LiteralUtil::CreateFromArray(*a4x3);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
@@ -370,7 +371,7 @@ XLA_TEST_P(ReshapeTest, ReshapeSplitAndShuffle) {
           /*new_sizes=*/{2, 6});
   Array2D<float> expected({{1.0f, 4.0f, 7.0f, 10.0f, 2.0f, 5.0f},
                            {8.0f, 11.0f, 3.0f, 6.0f, 9.0f, 12.0f}});
-  auto expected_literal = Literal::CreateFromArray(expected);
+  auto expected_literal = LiteralUtil::CreateFromArray(expected);
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -388,13 +389,13 @@ static Array3D<float> ArrayForDocR3Tests() {
 
 XLA_TEST_P(ReshapeTest, DocR3_R1_Collapse_012) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(ArrayForDocR3Tests());
+  auto input_literal = LiteralUtil::CreateFromArray(ArrayForDocR3Tests());
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{0, 1, 2},
           /*new_sizes=*/{24});
-  auto expected_literal = Literal::CreateR1<float>(
+  auto expected_literal = LiteralUtil::CreateR1<float>(
       {10, 11, 12, 15, 16, 17, 20, 21, 22, 25, 26, 27,
        30, 31, 32, 35, 36, 37, 40, 41, 42, 45, 46, 47});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
@@ -403,33 +404,33 @@ XLA_TEST_P(ReshapeTest, DocR3_R1_Collapse_012) {
 
 XLA_TEST_P(ReshapeTest, DocR3_R2_Collapse_012_Refine_83) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(ArrayForDocR3Tests());
+  auto input_literal = LiteralUtil::CreateFromArray(ArrayForDocR3Tests());
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{0, 1, 2},
           /*new_sizes=*/{8, 3});
-  auto expected_literal = Literal::CreateR2<float>({{10, 11, 12},
-                                                    {15, 16, 17},
-                                                    {20, 21, 22},
-                                                    {25, 26, 27},
-                                                    {30, 31, 32},
-                                                    {35, 36, 37},
-                                                    {40, 41, 42},
-                                                    {45, 46, 47}});
+  auto expected_literal = LiteralUtil::CreateR2<float>({{10, 11, 12},
+                                                        {15, 16, 17},
+                                                        {20, 21, 22},
+                                                        {25, 26, 27},
+                                                        {30, 31, 32},
+                                                        {35, 36, 37},
+                                                        {40, 41, 42},
+                                                        {45, 46, 47}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
 
 XLA_TEST_P(ReshapeTest, DocR3_R1_Collapse_120) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(ArrayForDocR3Tests());
+  auto input_literal = LiteralUtil::CreateFromArray(ArrayForDocR3Tests());
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{1, 2, 0},
           /*new_sizes=*/{24});
-  auto expected_literal = Literal::CreateR1<float>(
+  auto expected_literal = LiteralUtil::CreateR1<float>(
       {10, 20, 30, 40, 11, 21, 31, 41, 12, 22, 32, 42,
        15, 25, 35, 45, 16, 26, 36, 46, 17, 27, 37, 47});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
@@ -438,33 +439,33 @@ XLA_TEST_P(ReshapeTest, DocR3_R1_Collapse_120) {
 
 XLA_TEST_P(ReshapeTest, DocR3_R2_Collapse_120_Refine_83) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(ArrayForDocR3Tests());
+  auto input_literal = LiteralUtil::CreateFromArray(ArrayForDocR3Tests());
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{1, 2, 0},
           /*new_sizes=*/{8, 3});
-  auto expected_literal = Literal::CreateR2<float>({{10, 20, 30},
-                                                    {40, 11, 21},
-                                                    {31, 41, 12},
-                                                    {22, 32, 42},
-                                                    {15, 25, 35},
-                                                    {45, 16, 26},
-                                                    {36, 46, 17},
-                                                    {27, 37, 47}});
+  auto expected_literal = LiteralUtil::CreateR2<float>({{10, 20, 30},
+                                                        {40, 11, 21},
+                                                        {31, 41, 12},
+                                                        {22, 32, 42},
+                                                        {15, 25, 35},
+                                                        {45, 16, 26},
+                                                        {36, 46, 17},
+                                                        {27, 37, 47}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
 
 XLA_TEST_P(ReshapeTest, DocR3_R3_Collapse_120_Refine_262) {
   XlaBuilder builder(TestName());
-  auto input_literal = Literal::CreateFromArray(ArrayForDocR3Tests());
+  auto input_literal = LiteralUtil::CreateFromArray(ArrayForDocR3Tests());
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Reshape(/*operand=*/parameter, /*dimensions=*/{1, 2, 0},
           /*new_sizes=*/{2, 6, 2});
-  auto expected_literal = Literal::CreateR3<float>(
+  auto expected_literal = LiteralUtil::CreateR3<float>(
       {{{10, 20}, {30, 40}, {11, 21}, {31, 41}, {12, 22}, {32, 42}},
        {{15, 25}, {35, 45}, {16, 26}, {36, 46}, {17, 27}, {37, 47}}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
@@ -491,12 +492,12 @@ XLA_TEST_P(ReshapeTest, FullyConnectedCollapse) {
   Array4D<float> t2x2x2x3(2, 2, 2, 3);
   auto filler2x3 = MakeLinspaceArray2D(1.0f, 6.0f, 2, 3);
   t2x2x2x3.FillWithYX(*filler2x3);
-  auto input_literal = Literal::CreateFromArray(t2x2x2x3);
+  auto input_literal = LiteralUtil::CreateFromArray(t2x2x2x3);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
   Collapse(/*operand=*/parameter, /*dimensions=*/{1, 2, 3});
-  auto expected_literal = Literal::CreateR2<float>(
+  auto expected_literal = LiteralUtil::CreateR2<float>(
       {{1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f},
        {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f,
         6.0f}});
@@ -516,7 +517,7 @@ XLA_TEST_P(ReshapeTest, FullyConnectedCollapseDesugared) {
   t(1, 0, 0, 1) = 5;
   t(1, 0, 1, 0) = 6;
   t(1, 0, 1, 1) = 7;
-  auto input_literal = Literal::CreateFromArray(t);
+  auto input_literal = LiteralUtil::CreateFromArray(t);
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input",
                                                  &builder, &parameter);
@@ -524,7 +525,7 @@ XLA_TEST_P(ReshapeTest, FullyConnectedCollapseDesugared) {
           /*new_sizes=*/{2, 4});
 
   auto expected_literal =
-      Literal::CreateR2<float>({{0, 1, 2, 3}, {4, 5, 6, 7}});
+      LiteralUtil::CreateR2<float>({{0, 1, 2, 3}, {4, 5, 6, 7}});
   ComputeAndCompareLiteral(&builder, *expected_literal, {input.get()},
                            zero_error_spec_);
 }
@@ -545,7 +546,7 @@ XLA_TEST_P(ReshapeTest, ToScalar) {
                                                    &b, &parameter);
     Reshape(parameter, dimensions, {});
 
-    auto expected_literal = Literal::CreateR0<float>(83.0f);
+    auto expected_literal = LiteralUtil::CreateR0<float>(83.0f);
     ComputeAndCompareLiteral(&b, *expected_literal, {input.get()},
                              zero_error_spec_);
   }
@@ -553,7 +554,7 @@ XLA_TEST_P(ReshapeTest, ToScalar) {
 
 XLA_TEST_P(ReshapeTest, BadDimensions) {
   XlaBuilder b(TestName());
-  auto input_literal = Literal::CreateR1<float>({1.0f});
+  auto input_literal = LiteralUtil::CreateR1<float>({1.0f});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input", &b,
                                                  &parameter);
@@ -565,7 +566,7 @@ XLA_TEST_P(ReshapeTest, BadDimensions) {
 
 XLA_TEST_P(ReshapeTest, BadNewSizes) {
   XlaBuilder b(TestName());
-  auto input_literal = Literal::CreateR1<float>({1.0f, 2.0f});
+  auto input_literal = LiteralUtil::CreateR1<float>({1.0f, 2.0f});
   XlaOp parameter;
   auto input = CreateParameterAndTransferLiteral(0, *input_literal, "input", &b,
                                                  &parameter);
@@ -577,7 +578,8 @@ XLA_TEST_P(ReshapeTest, BadNewSizes) {
 XLA_TEST_P(ReshapeTest, R4Dim0MinorLayoutToR2Dim0MajorLayout) {
   XlaBuilder builder(TestName());
   // clang-format off
-  auto input_literal = Literal::CreateR4FromArray4DWithLayout(Array4D<float>{
+  auto input_literal = LiteralUtil::CreateR4FromArray4DWithLayout(
+      Array4D<float>{
     {
       {
         {0, 1},
@@ -622,16 +624,16 @@ XLA_TEST_P(ReshapeTest, R4Dim0MinorLayoutToR2Dim0MajorLayout) {
           ->ExecuteAndTransfer(computation, {input.get()}, &execution_options)
           .ConsumeValueOrDie();
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2FromArray2D<float>(expected_array);
+      LiteralUtil::CreateR2FromArray2D<float>(expected_array);
   if (use_bfloat16()) {
-    expected = Literal::ConvertF32ToBF16(*expected);
+    expected = LiteralUtil::ConvertF32ToBF16(*expected);
   }
   EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *actual));
 }
 
 XLA_TEST_P(ReshapeTest, R2ToR4_3x8_To_3x2x1x4) {
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> input_literal = Literal::CreateR2<float>({
+  std::unique_ptr<Literal> input_literal = LiteralUtil::CreateR2<float>({
       {0, 1, 2, 3, 4, 5, 6, 7},
       {100, 101, 102, 103, 104, 105, 106, 107},
       {200, 201, 202, 203, 204, 205, 206, 207},
@@ -642,7 +644,7 @@ XLA_TEST_P(ReshapeTest, R2ToR4_3x8_To_3x2x1x4) {
   Reshape(parameter, /*dimensions=*/{0, 1}, /*new_sizes=*/{3, 2, 1, 4});
 
   // clang-format off
-  auto expected_literal = Literal::CreateR4<float>({
+  auto expected_literal = LiteralUtil::CreateR4<float>({
     {{{0, 1, 2, 3}},
      {{4, 5, 6, 7}}},
     {{{100, 101, 102, 103}},
@@ -658,7 +660,7 @@ XLA_TEST_P(ReshapeTest, R2ToR4_3x8_To_3x2x1x4) {
 // Tests R2->R4 reshape with the reshape dimensions {1, 0}.
 XLA_TEST_P(ReshapeTest, R2ToR4_3x8_To_3x2x1x4_Dimensions_10) {
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> input_literal = Literal::CreateR2<float>({
+  std::unique_ptr<Literal> input_literal = LiteralUtil::CreateR2<float>({
       {0, 1, 2, 3, 4, 5, 6, 7},
       {100, 101, 102, 103, 104, 105, 106, 107},
       {200, 201, 202, 203, 204, 205, 206, 207},
@@ -669,7 +671,7 @@ XLA_TEST_P(ReshapeTest, R2ToR4_3x8_To_3x2x1x4_Dimensions_10) {
   Reshape(parameter, /*dimensions=*/{1, 0}, /*new_sizes=*/{3, 2, 1, 4});
 
   // clang-format off
-  auto expected_literal = Literal::CreateR4<float>({
+  auto expected_literal = LiteralUtil::CreateR4<float>({
     {{{0, 100, 200, 1}},
      {{101, 201, 2, 102}}},
     {{{202, 3, 103, 203}},
@@ -691,7 +693,7 @@ XLA_TEST_P(ReshapeTest, R4ToR2_2x1x1x1_To_2x1) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaOp parameter;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -699,7 +701,7 @@ XLA_TEST_P(ReshapeTest, R4ToR2_2x1x1x1_To_2x1) {
   Reshape(parameter, /*dimensions=*/{0, 1, 2, 3}, /*new_sizes=*/{2, 1});
 
   std::unique_ptr<Literal> expected =
-      Literal::ReshapeSlice({2, 1}, {1, 0}, *input_literal);
+      LiteralUtil::ReshapeSlice({2, 1}, {1, 0}, *input_literal);
   ComputeAndCompareLiteral(&builder, *expected, {input_data.get()},
                            zero_error_spec_);
 }
@@ -713,7 +715,7 @@ XLA_TEST_P(ReshapeTest, R4ToR2_2x1x4x1_To_4x2) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaOp parameter;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -721,7 +723,7 @@ XLA_TEST_P(ReshapeTest, R4ToR2_2x1x4x1_To_4x2) {
   Reshape(parameter, /*dimensions=*/{0, 1, 2, 3}, /*new_sizes=*/{4, 2});
 
   std::unique_ptr<Literal> expected =
-      Literal::ReshapeSlice({4, 2}, {1, 0}, *input_literal);
+      LiteralUtil::ReshapeSlice({4, 2}, {1, 0}, *input_literal);
   ComputeAndCompareLiteral(&builder, *expected, {input_data.get()},
                            zero_error_spec_);
 }
@@ -736,7 +738,7 @@ XLA_TEST_P(ReshapeTest, R4ToR2_5x10x2x3_To_5x60_Dimensions_0213) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaOp parameter;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -749,7 +751,7 @@ XLA_TEST_P(ReshapeTest, R4ToR2_5x10x2x3_To_5x60_Dimensions_0213) {
     expected_array(indices[0], indices[2] * 30 + indices[1] * 3 + indices[3]) =
         *cell;
   });
-  auto expected = Literal::CreateR2FromArray2D(expected_array);
+  auto expected = LiteralUtil::CreateR2FromArray2D(expected_array);
   ComputeAndCompareLiteral(&builder, *expected, {input_data.get()},
                            zero_error_spec_);
 }
@@ -763,7 +765,7 @@ XLA_TEST_P(ReshapeTest, NoopReshape) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input_array, LayoutUtil::MakeLayout({1, 2, 3, 0}));
   XlaOp parameter;
   auto input_data = CreateParameterAndTransferLiteral(
@@ -785,7 +787,7 @@ XLA_TEST_P(ReshapeTest, NoopReshape) {
   // Since the reshape is a no-op, verify that it does not change the underlying
   // data.
   if (use_bfloat16()) {
-    auto expected = Literal::ConvertF32ToBF16(*input_literal);
+    auto expected = LiteralUtil::ConvertF32ToBF16(*input_literal);
     EXPECT_EQ(expected->data<bfloat16>(), output_literal->data<bfloat16>());
   } else {
     EXPECT_EQ(input_literal->data<float>(), output_literal->data<float>());
@@ -794,7 +796,7 @@ XLA_TEST_P(ReshapeTest, NoopReshape) {
 
 XLA_TEST_P(ReshapeTest, R4ToR4Reshape_Trivial) {
   XlaBuilder builder(TestName());
-  auto literal_1x2x3x4 = Literal::CreateR4<float>(
+  auto literal_1x2x3x4 = LiteralUtil::CreateR4<float>(
       {{{{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}},
         {{13, 14, 15, 16}, {17, 18, 19, 20}, {21, 22, 23, 24}}}});
 
@@ -808,7 +810,7 @@ XLA_TEST_P(ReshapeTest, R4ToR4Reshape_Trivial) {
 }
 
 XLA_TEST_P(ReshapeTest, R4ToR4Reshape) {
-  auto literal_1x2x3x4 = Literal::CreateR4<float>(
+  auto literal_1x2x3x4 = LiteralUtil::CreateR4<float>(
       {{{{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}},
         {{13, 14, 15, 16}, {17, 18, 19, 20}, {21, 22, 23, 24}}}});
 
@@ -820,7 +822,7 @@ XLA_TEST_P(ReshapeTest, R4ToR4Reshape) {
           /*new_sizes=*/{2, 4, 3, 1});
 
   // clang-format off
-  auto expected_2x4x3x1 = Literal::CreateR4<float>(
+  auto expected_2x4x3x1 = LiteralUtil::CreateR4<float>(
       {{{{1}, {5}, {9}},
         {{2}, {6}, {10}},
         {{3}, {7}, {11}},
@@ -844,7 +846,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeSimple) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaBuilder builder(TestName());
   XlaOp parameter;
@@ -854,7 +856,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeSimple) {
           /*new_sizes=*/new_bounds);
 
   std::unique_ptr<Literal> expected =
-      Literal::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
+      LiteralUtil::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
           ->Relayout(LayoutUtil::MakeLayout({3, 2, 1, 0}));
 
   // Specify the requested output shape explicitly to ensure that this reshape
@@ -873,7 +875,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeMajorFirstEffectiveR2) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaBuilder builder(TestName());
   XlaOp parameter;
@@ -883,7 +885,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeMajorFirstEffectiveR2) {
           /*new_sizes=*/new_bounds);
 
   std::unique_ptr<Literal> expected =
-      Literal::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
+      LiteralUtil::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
           ->Relayout(LayoutUtil::MakeLayout({3, 2, 1, 0}));
 
   // Specify the requested output shape explicitly to ensure that this reshape
@@ -902,7 +904,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeMajorFirstMinorEffectiveR1) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaBuilder builder(TestName());
   XlaOp parameter;
@@ -912,7 +914,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeMajorFirstMinorEffectiveR1) {
           /*new_sizes=*/new_bounds);
 
   std::unique_ptr<Literal> expected =
-      Literal::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
+      LiteralUtil::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
           ->Relayout(LayoutUtil::MakeLayout({3, 2, 1, 0}));
 
   // Specify the requested output shape explicitly to ensure that this reshape
@@ -932,7 +934,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeMajorFirstMinorEffectiveR1InR2) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({3, 2, 1, 0}));
   XlaBuilder builder(TestName());
   XlaOp parameter;
@@ -942,7 +944,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeMajorFirstMinorEffectiveR1InR2) {
           /*new_sizes=*/new_bounds);
 
   std::unique_ptr<Literal> expected =
-      Literal::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
+      LiteralUtil::ReshapeSlice(new_bounds, {2, 3, 1, 0}, *input_literal)
           ->Relayout(LayoutUtil::MakeLayout({3, 2, 1, 0}));
 
   // Specify the requested output shape explicitly to ensure that this reshape
@@ -961,7 +963,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeTrivialR2) {
       [&rng, &distribution](tensorflow::gtl::ArraySlice<int64> /* indices */,
                             float* cell) { *cell = distribution(rng); });
   std::unique_ptr<Literal> input_literal =
-      Literal::CreateR4FromArray4DWithLayout(
+      LiteralUtil::CreateR4FromArray4DWithLayout(
           input, LayoutUtil::MakeLayout({0, 1, 2, 3}));
   XlaBuilder builder(TestName());
   XlaOp parameter;
@@ -971,7 +973,7 @@ XLA_TEST_P(ReshapeTest, R4TwoMinorTransposeTrivialR2) {
           /*new_sizes=*/new_bounds);
 
   std::unique_ptr<Literal> expected =
-      Literal::ReshapeSlice(new_bounds, {1, 0, 2, 3}, *input_literal)
+      LiteralUtil::ReshapeSlice(new_bounds, {1, 0, 2, 3}, *input_literal)
           ->Relayout(input_literal->shape().layout());
 
   // Specify the requested output shape explicitly to ensure that this reshape
diff --git a/tensorflow/compiler/xla/tests/reverse_test.cc b/tensorflow/compiler/xla/tests/reverse_test.cc
index 662bc42224..41e49b4003 100644
--- a/tensorflow/compiler/xla/tests/reverse_test.cc
+++ b/tensorflow/compiler/xla/tests/reverse_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
@@ -82,7 +82,7 @@ TEST_P(FloatReverseTest, Reverses) {
   std::vector<float> input_vector(
       ShapeUtil::ElementsIn(ShapeUtil::MakeShape(F32, spec.input_dims)));
   std::iota(input_vector.begin(), input_vector.end(), 0.0);
-  auto r1_literal = Literal::CreateR1<float>(input_vector);
+  auto r1_literal = LiteralUtil::CreateR1<float>(input_vector);
   auto input_literal = r1_literal->Reshape(spec.input_dims).ConsumeValueOrDie();
 
   XlaBuilder builder(TestName());
diff --git a/tensorflow/compiler/xla/tests/round_trip_packed_literal_test.cc b/tensorflow/compiler/xla/tests/round_trip_packed_literal_test.cc
index 7cfca781ac..a620fe1908 100644
--- a/tensorflow/compiler/xla/tests/round_trip_packed_literal_test.cc
+++ b/tensorflow/compiler/xla/tests/round_trip_packed_literal_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/packed_literal_reader.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
diff --git a/tensorflow/compiler/xla/tests/round_trip_transfer_test.cc b/tensorflow/compiler/xla/tests/round_trip_transfer_test.cc
index f334a8c131..a8193c2eac 100644
--- a/tensorflow/compiler/xla/tests/round_trip_transfer_test.cc
+++ b/tensorflow/compiler/xla/tests/round_trip_transfer_test.cc
@@ -23,7 +23,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array4d.h"
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -46,61 +46,62 @@ class RoundTripTransferTest : public ClientLibraryTestBase {
 };
 
 TEST_F(RoundTripTransferTest, R0S32) {
-  RoundTripTest(*Literal::CreateR0<int32>(42));
+  RoundTripTest(*LiteralUtil::CreateR0<int32>(42));
 }
 
 TEST_F(RoundTripTransferTest, R0F32) {
-  RoundTripTest(*Literal::CreateR0<float>(42.0));
+  RoundTripTest(*LiteralUtil::CreateR0<float>(42.0));
 }
 
 TEST_F(RoundTripTransferTest, R1F32_Len0) {
-  RoundTripTest(*Literal::CreateR1<float>({}));
+  RoundTripTest(*LiteralUtil::CreateR1<float>({}));
 }
 
 TEST_F(RoundTripTransferTest, R1F32_Len2) {
-  RoundTripTest(*Literal::CreateR1<float>({42.0, 64.0}));
+  RoundTripTest(*LiteralUtil::CreateR1<float>({42.0, 64.0}));
 }
 
 TEST_F(RoundTripTransferTest, R1F32_Len256) {
   std::vector<float> values(256);
   std::iota(values.begin(), values.end(), 1.0);
-  RoundTripTest(*Literal::CreateR1<float>(values));
+  RoundTripTest(*LiteralUtil::CreateR1<float>(values));
 }
 
 TEST_F(RoundTripTransferTest, R1F32_Len1024) {
   std::vector<float> values(1024);
   std::iota(values.begin(), values.end(), 1.0);
-  RoundTripTest(*Literal::CreateR1<float>(values));
+  RoundTripTest(*LiteralUtil::CreateR1<float>(values));
 }
 
 TEST_F(RoundTripTransferTest, R1F32_Len1025) {
   std::vector<float> values(1025);
   std::iota(values.begin(), values.end(), 1.0);
-  RoundTripTest(*Literal::CreateR1<float>(values));
+  RoundTripTest(*LiteralUtil::CreateR1<float>(values));
 }
 
 TEST_F(RoundTripTransferTest, R1F32_Len4096) {
   std::vector<float> values(4096);
   std::iota(values.begin(), values.end(), 1.0);
-  RoundTripTest(*Literal::CreateR1<float>(values));
+  RoundTripTest(*LiteralUtil::CreateR1<float>(values));
 }
 
 TEST_F(RoundTripTransferTest, R2F32_Len10x0) {
-  RoundTripTest(*Literal::CreateR2FromArray2D<float>(Array2D<float>(10, 0)));
+  RoundTripTest(
+      *LiteralUtil::CreateR2FromArray2D<float>(Array2D<float>(10, 0)));
 }
 
 TEST_F(RoundTripTransferTest, R2F32_Len2x2) {
-  RoundTripTest(*Literal::CreateR2<float>({{42.0, 64.0}, {77.0, 88.0}}));
+  RoundTripTest(*LiteralUtil::CreateR2<float>({{42.0, 64.0}, {77.0, 88.0}}));
 }
 
 TEST_F(RoundTripTransferTest, R3F32) {
   RoundTripTest(
-      *Literal::CreateR3<float>({{{1.0, 2.0}, {1.0, 2.0}, {1.0, 2.0}},
-                                 {{3.0, 4.0}, {3.0, 4.0}, {3.0, 4.0}}}));
+      *LiteralUtil::CreateR3<float>({{{1.0, 2.0}, {1.0, 2.0}, {1.0, 2.0}},
+                                     {{3.0, 4.0}, {3.0, 4.0}, {3.0, 4.0}}}));
 }
 
 TEST_F(RoundTripTransferTest, R4F32) {
-  RoundTripTest(*Literal::CreateR4<float>({{
+  RoundTripTest(*LiteralUtil::CreateR4<float>({{
       {{10, 11, 12, 13}, {14, 15, 16, 17}},
       {{18, 19, 20, 21}, {22, 23, 24, 25}},
       {{26, 27, 28, 29}, {30, 31, 32, 33}},
@@ -108,33 +109,36 @@ TEST_F(RoundTripTransferTest, R4F32) {
 }
 
 TEST_F(RoundTripTransferTest, EmptyTuple) {
-  RoundTripTest(*Literal::MakeTuple({}));
+  RoundTripTest(*LiteralUtil::MakeTuple({}));
 }
 
 TEST_F(RoundTripTransferTest, TupleOfR1F32) {
-  RoundTripTest(*Literal::MakeTuple({Literal::CreateR1<float>({1, 2}).get(),
-                                     Literal::CreateR1<float>({3, 4}).get()}));
+  RoundTripTest(
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({1, 2}).get(),
+                               LiteralUtil::CreateR1<float>({3, 4}).get()}));
 }
 
 TEST_F(RoundTripTransferTest, TupleOfR1F32_Len0_Len2) {
-  RoundTripTest(*Literal::MakeTuple({Literal::CreateR1<float>({}).get(),
-                                     Literal::CreateR1<float>({3, 4}).get()}));
+  RoundTripTest(
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>({}).get(),
+                               LiteralUtil::CreateR1<float>({3, 4}).get()}));
 }
 
 TEST_F(RoundTripTransferTest, TupleOfR0F32AndR1S32) {
-  RoundTripTest(*Literal::MakeTuple({Literal::CreateR0<float>(1.0).get(),
-                                     Literal::CreateR1<int>({2, 3}).get()}));
+  RoundTripTest(
+      *LiteralUtil::MakeTuple({LiteralUtil::CreateR0<float>(1.0).get(),
+                               LiteralUtil::CreateR1<int>({2, 3}).get()}));
 }
 
 // Below two tests are added to identify the cost of large data transfers.
 TEST_F(RoundTripTransferTest, R2F32_Large) {
-  RoundTripTest(*Literal::CreateR2F32Linspace(-1.0f, 1.0f, 512, 512));
+  RoundTripTest(*LiteralUtil::CreateR2F32Linspace(-1.0f, 1.0f, 512, 512));
 }
 
 TEST_F(RoundTripTransferTest, R4F32_Large) {
   Array4D<float> array4d(2, 2, 256, 256);
   array4d.FillWithMultiples(1.0f);
-  RoundTripTest(*Literal::CreateR4FromArray4D<float>(array4d));
+  RoundTripTest(*LiteralUtil::CreateR4FromArray4D<float>(array4d));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/scalar_computations_test.cc b/tensorflow/compiler/xla/tests/scalar_computations_test.cc
index 3afd8c8fc8..e42c71eb28 100644
--- a/tensorflow/compiler/xla/tests/scalar_computations_test.cc
+++ b/tensorflow/compiler/xla/tests/scalar_computations_test.cc
@@ -19,8 +19,9 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -162,7 +163,7 @@ XLA_TEST_F(ScalarComputationsTest, CastS64ToF32) {
   ConvertElementType(a, F32);
 
   int64 value = 3LL << 35;
-  std::unique_ptr<Literal> a_literal = Literal::CreateR0<int64>(value);
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR0<int64>(value);
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
   ComputeAndCompareR0<float>(&builder, static_cast<float>(value),
@@ -226,9 +227,9 @@ XLA_TEST_F(ScalarComputationsTest, MulThreeScalarsS32) {
 
 XLA_TEST_F(ScalarComputationsTest, MulThreeScalarsF32Params) {
   XlaBuilder builder(TestName());
-  std::unique_ptr<Literal> a_literal = Literal::CreateR0<float>(2.1f);
-  std::unique_ptr<Literal> b_literal = Literal::CreateR0<float>(5.5f);
-  std::unique_ptr<Literal> c_literal = Literal::CreateR0<float>(0.5f);
+  std::unique_ptr<Literal> a_literal = LiteralUtil::CreateR0<float>(2.1f);
+  std::unique_ptr<Literal> b_literal = LiteralUtil::CreateR0<float>(5.5f);
+  std::unique_ptr<Literal> c_literal = LiteralUtil::CreateR0<float>(0.5f);
 
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
@@ -375,8 +376,8 @@ XLA_TEST_F(ScalarComputationsTest, DivU32s) {
   for (uint32 divisor : vals) {
     if (divisor != 0) {
       for (uint32 dividend : vals) {
-        auto dividend_literal = Literal::CreateR0<uint32>(dividend);
-        auto divisor_literal = Literal::CreateR0<uint32>(divisor);
+        auto dividend_literal = LiteralUtil::CreateR0<uint32>(dividend);
+        auto divisor_literal = LiteralUtil::CreateR0<uint32>(divisor);
         TF_ASSERT_OK_AND_ASSIGN(auto dividend_data,
                                 client_->TransferToServer(*dividend_literal));
         TF_ASSERT_OK_AND_ASSIGN(auto divisor_data,
@@ -387,7 +388,8 @@ XLA_TEST_F(ScalarComputationsTest, DivU32s) {
                                      {dividend_data.get(), divisor_data.get()},
                                      &execution_options_)
                 .ConsumeValueOrDie();
-        auto expected_literal = Literal::CreateR0<uint32>(dividend / divisor);
+        auto expected_literal =
+            LiteralUtil::CreateR0<uint32>(dividend / divisor);
         EXPECT_TRUE(LiteralTestUtil::Equal(*expected_literal, *actual_literal));
       }
     }
@@ -416,8 +418,8 @@ XLA_TEST_F(ScalarComputationsTest, RemU32s) {
   for (uint32 divisor : vals) {
     if (divisor != 0) {
       for (uint32 dividend : vals) {
-        auto dividend_literal = Literal::CreateR0<uint32>(dividend);
-        auto divisor_literal = Literal::CreateR0<uint32>(divisor);
+        auto dividend_literal = LiteralUtil::CreateR0<uint32>(dividend);
+        auto divisor_literal = LiteralUtil::CreateR0<uint32>(divisor);
         TF_ASSERT_OK_AND_ASSIGN(auto dividend_data,
                                 client_->TransferToServer(*dividend_literal));
         TF_ASSERT_OK_AND_ASSIGN(auto divisor_data,
@@ -428,7 +430,8 @@ XLA_TEST_F(ScalarComputationsTest, RemU32s) {
                                      {dividend_data.get(), divisor_data.get()},
                                      &execution_options_)
                 .ConsumeValueOrDie();
-        auto expected_literal = Literal::CreateR0<uint32>(dividend % divisor);
+        auto expected_literal =
+            LiteralUtil::CreateR0<uint32>(dividend % divisor);
         EXPECT_TRUE(LiteralTestUtil::Equal(*expected_literal, *actual_literal));
       }
     }
@@ -440,7 +443,7 @@ XLA_TEST_F(ScalarComputationsTest, RemainderTwoScalarsNonConstDividendS32) {
   auto x = Parameter(&builder, 0, ShapeUtil::MakeShape(S32, {}), "x");
   Rem(x, ConstantR0<int32>(&builder, 80000));
 
-  std::unique_ptr<Literal> literal = Literal::CreateR0<int32>(87919);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR0<int32>(87919);
   TF_ASSERT_OK_AND_ASSIGN(auto input_data, client_->TransferToServer(*literal));
   ComputeAndCompareR0<int32>(&builder, 7919, {input_data.get()});
 }
diff --git a/tensorflow/compiler/xla/tests/select_and_scatter_test.cc b/tensorflow/compiler/xla/tests/select_and_scatter_test.cc
index 0a173fbbbd..e3d4f98dd7 100644
--- a/tensorflow/compiler/xla/tests/select_and_scatter_test.cc
+++ b/tensorflow/compiler/xla/tests/select_and_scatter_test.cc
@@ -22,10 +22,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
 #include "tensorflow/compiler/xla/client/padding.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
diff --git a/tensorflow/compiler/xla/tests/select_test.cc b/tensorflow/compiler/xla/tests/select_test.cc
index 59409ab26e..1c01402798 100644
--- a/tensorflow/compiler/xla/tests/select_test.cc
+++ b/tensorflow/compiler/xla/tests/select_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
diff --git a/tensorflow/compiler/xla/tests/slice_test.cc b/tensorflow/compiler/xla/tests/slice_test.cc
index 3e5c01d6d4..b8ad6668f8 100644
--- a/tensorflow/compiler/xla/tests/slice_test.cc
+++ b/tensorflow/compiler/xla/tests/slice_test.cc
@@ -20,7 +20,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -170,7 +170,7 @@ XLA_TEST_F(SliceTest, StridedSliceR4WithOutputLayout) {
   values.FillRandom(3.14f);
   auto expected = ReferenceUtil::Slice4D(values, {{0, 0, 0, 0}}, {{2, 4, 6, 8}},
                                          /*strides=*/{{1, 1, 2, 1}});
-  auto expected_literal = Literal::CreateR4FromArray4DWithLayout(
+  auto expected_literal = LiteralUtil::CreateR4FromArray4DWithLayout(
       *expected, LayoutUtil::MakeLayout({0, 1, 2, 3}));
   XlaBuilder builder(TestName());
   auto original = ConstantR4FromArray4D(&builder, values);
@@ -197,7 +197,7 @@ class SliceR1Test : public ClientLibraryTestBase,
     // vector<bool>.
     tensorflow::gtl::InlinedVector<NativeT, 1> input(spec.input_dim0);
     std::iota(input.begin(), input.end(), NativeT());
-    auto literal = Literal::CreateR1<NativeT>(input);
+    auto literal = LiteralUtil::CreateR1<NativeT>(input);
 
     XlaBuilder builder(TestName());
     auto original = Parameter(&builder, 0, literal->shape(), "p0");
@@ -344,7 +344,11 @@ INSTANTIATE_TEST_CASE_P(
         R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 2},
         R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 8},
         R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 7},
-        R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 125}
+        R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 125},
+        R1Spec{16 * 1024 * 1024, 0, 16 * 1024 * 1024, 4097},
+        R1Spec{16 * 1024 * 1024, 0, 16 * 1024 * 1024, 4093},
+        R1Spec{16 * 1024 * 1024, 12 * 1024 + 17, 16 * 1024 * 1024 - 231, 4097},
+        R1Spec{16 * 1024 * 1024, 12 * 1024 + 17, 16 * 1024 * 1024 - 231, 4093}
     ),
     SliceR1TestDataToString
 );
@@ -368,7 +372,7 @@ XLA_TEST_P(SliceR2Test, DoIt) {
   const R2Spec& spec = GetParam();
   Array2D<int32> input(spec.input_dim0, spec.input_dim1);
   input.FillUnique();
-  auto literal = Literal::CreateR2FromArray2DWithLayout(
+  auto literal = LiteralUtil::CreateR2FromArray2DWithLayout(
       input, LayoutUtil::MakeLayout(spec.layout));
 
   XlaBuilder builder(TestName());
@@ -463,7 +467,7 @@ class SliceR4Test : public ClientLibraryTestBase,
     auto expected = ReferenceUtil::Slice4D(
         values, spec.slice_starts, spec.slice_limits, spec.slice_strides);
     XlaBuilder builder(TestName());
-    auto literal = Literal::CreateR4FromArray4DWithLayout(
+    auto literal = LiteralUtil::CreateR4FromArray4DWithLayout(
         values, LayoutUtil::MakeLayout(spec.input_layout));
     auto parameter = Parameter(&builder, 0, literal->shape(), "p0");
     TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<GlobalData> arg,
diff --git a/tensorflow/compiler/xla/tests/test_utils.cc b/tensorflow/compiler/xla/tests/test_utils.cc
index 20c7c30878..2647937013 100644
--- a/tensorflow/compiler/xla/tests/test_utils.cc
+++ b/tensorflow/compiler/xla/tests/test_utils.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/tests/test_utils.h"
+#include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
 #include "tensorflow/compiler/xla/service/hlo_dataflow_analysis.h"
 #include "tensorflow/compiler/xla/service/hlo_verifier.h"
@@ -110,7 +111,7 @@ StatusOr<std::unique_ptr<Literal>> MakeFakeLiteralInternal(
                           MakeFakeLiteralInternal(element_shape, engine));
       elements.push_back(std::move(element));
     }
-    return Literal::MakeTupleOwned(std::move(elements));
+    return LiteralUtil::MakeTupleOwned(std::move(elements));
   }
   if (engine == nullptr) {
     return Literal::CreateFromShape(shape);
@@ -220,7 +221,7 @@ std::unique_ptr<Literal> MakeRandomNonwrappingSliceIndex(
       start_indices[i] = generator(*engine);
     }
   }
-  return Literal::CreateR1<int32>(start_indices);
+  return LiteralUtil::CreateR1<int32>(start_indices);
 }
 
 // Use dataflow analysis on each parameter to see if there are uses that would
@@ -318,9 +319,9 @@ StatusOr<std::unique_ptr<Literal>> CreateLiteralForConstrainedUses(
   } else if (needs_constant != nullptr) {
     switch (constant_type) {
       case ConstantType::kZero:
-        return Literal::Zero(param.shape().element_type()).CloneToUnique();
+        return LiteralUtil::Zero(param.shape().element_type()).CloneToUnique();
       case ConstantType::kOne:
-        return Literal::One(param.shape().element_type()).CloneToUnique();
+        return LiteralUtil::One(param.shape().element_type()).CloneToUnique();
       case ConstantType::kUnknown:
         // We want the identity element for the computation, but we don't really
         // know what it is - so any value we generate will be just as wrong.
diff --git a/tensorflow/compiler/xla/tests/test_utils.h b/tensorflow/compiler/xla/tests/test_utils.h
index a8689f6498..e59f215a9a 100644
--- a/tensorflow/compiler/xla/tests/test_utils.h
+++ b/tensorflow/compiler/xla/tests/test_utils.h
@@ -21,7 +21,7 @@ limitations under the License.
 #include <random>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
diff --git a/tensorflow/compiler/xla/tests/test_utils_test.cc b/tensorflow/compiler/xla/tests/test_utils_test.cc
index 8f424ae81f..a2f0338e25 100644
--- a/tensorflow/compiler/xla/tests/test_utils_test.cc
+++ b/tensorflow/compiler/xla/tests/test_utils_test.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/tests/test_utils.h"
 
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/local_client_test_base.h"
diff --git a/tensorflow/compiler/xla/tests/token_hlo_test.cc b/tensorflow/compiler/xla/tests/token_hlo_test.cc
index e9008fa48a..2bdbd08309 100644
--- a/tensorflow/compiler/xla/tests/token_hlo_test.cc
+++ b/tensorflow/compiler/xla/tests/token_hlo_test.cc
@@ -31,21 +31,21 @@ class TokenHloTest : public HloTestBase {};
 XLA_TEST_F(TokenHloTest, SingleTokenInstruction) {
   std::unique_ptr<HloModule> module = CreateNewModule();
   auto builder = HloComputation::Builder(TestName());
-  builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  builder.AddInstruction(HloInstruction::CreateToken());
 
   module->AddEntryComputation(builder.Build());
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result,
                           Execute(std::move(module), {}));
-  EXPECT_TRUE(LiteralTestUtil::Equal(*result, *Literal::CreateToken()));
+  EXPECT_TRUE(LiteralTestUtil::Equal(*result, *LiteralUtil::CreateToken()));
 }
 
 XLA_TEST_F(TokenHloTest, TokenTree) {
   std::unique_ptr<HloModule> module = CreateNewModule();
   auto builder = HloComputation::Builder(TestName());
-  auto token0 = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
-  auto token1 = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
-  auto token2 = builder.AddInstruction(HloInstruction::CreateAfterAll({}));
+  auto token0 = builder.AddInstruction(HloInstruction::CreateToken());
+  auto token1 = builder.AddInstruction(HloInstruction::CreateToken());
+  auto token2 = builder.AddInstruction(HloInstruction::CreateToken());
   builder.AddInstruction(
       HloInstruction::CreateAfterAll({token0, token0, token1, token2}));
 
@@ -53,7 +53,7 @@ XLA_TEST_F(TokenHloTest, TokenTree) {
 
   TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result,
                           Execute(std::move(module), {}));
-  EXPECT_TRUE(LiteralTestUtil::Equal(*result, *Literal::CreateToken()));
+  EXPECT_TRUE(LiteralTestUtil::Equal(*result, *LiteralUtil::CreateToken()));
 }
 
 XLA_TEST_F(TokenHloTest, InvalidTokenShapedEntryParameter) {
@@ -64,7 +64,7 @@ XLA_TEST_F(TokenHloTest, InvalidTokenShapedEntryParameter) {
   builder.AddInstruction(
       HloInstruction::CreateParameter(1, ShapeUtil::MakeTokenShape(), "p1"));
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(42)));
   module->AddEntryComputation(builder.Build());
 
   Status status = HloVerifier().Run(module.get()).status();
@@ -98,7 +98,7 @@ XLA_TEST_F(TokenHloTest, InvalidOperandToTokenInstruction) {
       HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(F32, {}), "p0"));
   builder.AddInstruction(HloInstruction::CreateAfterAll({param}));
   builder.AddInstruction(
-      HloInstruction::CreateConstant(Literal::CreateR0<int32>(123)));
+      HloInstruction::CreateConstant(LiteralUtil::CreateR0<int32>(123)));
   module->AddEntryComputation(builder.Build());
 
   Status status = HloVerifier().Run(module.get()).status();
@@ -184,7 +184,7 @@ ENTRY %TokenInConditional (param.3: pred[]) -> s32[] {
     TF_ASSERT_OK_AND_ASSIGN(
         std::unique_ptr<HloModule> module,
         HloRunner::CreateModuleFromString(module_string, debug_options));
-    auto arg = Literal::CreateR0<bool>(true);
+    auto arg = LiteralUtil::CreateR0<bool>(true);
     TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result,
                             Execute(std::move(module), {arg.get()}));
     EXPECT_EQ(42, result->Get<int32>({}));
@@ -195,7 +195,7 @@ ENTRY %TokenInConditional (param.3: pred[]) -> s32[] {
     TF_ASSERT_OK_AND_ASSIGN(
         std::unique_ptr<HloModule> module,
         HloRunner::CreateModuleFromString(module_string, debug_options));
-    auto arg = Literal::CreateR0<bool>(false);
+    auto arg = LiteralUtil::CreateR0<bool>(false);
     TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<Literal> result,
                             Execute(std::move(module), {arg.get()}));
     EXPECT_EQ(7, result->Get<int32>({}));
diff --git a/tensorflow/compiler/xla/tests/transfer_manager_test.cc b/tensorflow/compiler/xla/tests/transfer_manager_test.cc
index 86babb58c9..125513ddfd 100644
--- a/tensorflow/compiler/xla/tests/transfer_manager_test.cc
+++ b/tensorflow/compiler/xla/tests/transfer_manager_test.cc
@@ -18,10 +18,11 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/layout_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/device_memory_allocator.h"
 #include "tensorflow/compiler/xla/service/generic_transfer_manager.h"
 #include "tensorflow/compiler/xla/service/shaped_buffer.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
@@ -60,7 +61,7 @@ class TransferManagerTest : public LocalClientTestBase {
   }
 
  protected:
-  Backend::StreamPtr stream_ptr_;
+  StreamPool::Ptr stream_ptr_;
   se::Stream* stream_;
 
  private:
@@ -68,7 +69,7 @@ class TransferManagerTest : public LocalClientTestBase {
 };
 
 XLA_TEST_F(TransferManagerTest, TransferR0U32) {
-  std::unique_ptr<Literal> literal = Literal::CreateR0<uint32>(42);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR0<uint32>(42);
   const Shape& shape = literal->shape();
   auto device_buffer = AllocateDeviceBuffer(shape);
 
@@ -84,7 +85,7 @@ XLA_TEST_F(TransferManagerTest, TransferR0U32) {
 
 XLA_TEST_F(TransferManagerTest, TransferR1F32) {
   std::unique_ptr<Literal> literal =
-      Literal::CreateR1<float>({1.25f, 2.5f, -17.0f, -20.125f});
+      LiteralUtil::CreateR1<float>({1.25f, 2.5f, -17.0f, -20.125f});
   const Shape& shape = literal->shape();
   auto device_buffer = AllocateDeviceBuffer(shape);
 
@@ -102,7 +103,7 @@ XLA_TEST_F(TransferManagerTest, TransferR1F32) {
 XLA_TEST_F(TransferManagerTest, TransferR1LargeF32) {
   std::vector<float> test_vector(1024 * 1024);
   std::iota(test_vector.begin(), test_vector.end(), 0);
-  std::unique_ptr<Literal> literal = Literal::CreateR1<float>(test_vector);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR1<float>(test_vector);
   const Shape& shape = literal->shape();
   auto device_buffer = AllocateDeviceBuffer(shape);
 
@@ -118,7 +119,7 @@ XLA_TEST_F(TransferManagerTest, TransferR1LargeF32) {
 
 XLA_TEST_F(TransferManagerTest, TransferR1U8) {
   const char* test_string = "0123456789abcdef";
-  std::unique_ptr<Literal> literal = Literal::CreateR1U8(test_string);
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR1U8(test_string);
   const Shape& shape = literal->shape();
   auto device_buffer = AllocateDeviceBuffer(shape);
 
@@ -134,7 +135,7 @@ XLA_TEST_F(TransferManagerTest, TransferR1U8) {
 
 XLA_TEST_F(TransferManagerTest, TransferR2F32) {
   std::unique_ptr<Literal> literal =
-      Literal::CreateR2<float>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}});
+      LiteralUtil::CreateR2<float>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}});
   const Shape& shape = literal->shape();
   auto device_buffer = AllocateDeviceBuffer(shape);
 
@@ -151,7 +152,7 @@ XLA_TEST_F(TransferManagerTest, TransferR2F32) {
 
 XLA_TEST_F(TransferManagerTest,
            TransferR2F32AndChangeLayoutTransferringToDevice) {
-  std::unique_ptr<Literal> literal = Literal::CreateR2WithLayout<float>(
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR2WithLayout<float>(
       {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}}, LayoutUtil::MakeLayout({0, 1}));
   const Shape ondevice_shape =
       ShapeUtil::MakeShapeWithLayout(F32, {2, 3}, {1, 0});
@@ -172,10 +173,10 @@ XLA_TEST_F(TransferManagerTest,
 }
 
 XLA_TEST_F(TransferManagerTest, TransferTuple) {
-  std::unique_ptr<Literal> literal = Literal::MakeTuple(
-      {Literal::CreateR0<float>(123.0f).get(),
-       Literal::CreateR2<float>({{1.0f, 2.0f}, {4.0f, 5.0f}}).get(),
-       Literal::CreateR1<float>({44.0f, -10.0f, 3333333.3f}).get()});
+  std::unique_ptr<Literal> literal = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(123.0f).get(),
+       LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {4.0f, 5.0f}}).get(),
+       LiteralUtil::CreateR1<float>({44.0f, -10.0f, 3333333.3f}).get()});
   auto device_buffer = AllocateDeviceBuffer(literal->shape());
 
   // Round trip literal through device.
@@ -189,7 +190,7 @@ XLA_TEST_F(TransferManagerTest, TransferTuple) {
 }
 
 XLA_TEST_F(TransferManagerTest, TransferEmptyTuple) {
-  std::unique_ptr<Literal> literal = Literal::MakeTuple({});
+  std::unique_ptr<Literal> literal = LiteralUtil::MakeTuple({});
   auto device_buffer = AllocateDeviceBuffer(literal->shape());
 
   // Round trip literal through device.
@@ -203,13 +204,13 @@ XLA_TEST_F(TransferManagerTest, TransferEmptyTuple) {
 }
 
 XLA_TEST_F(TransferManagerTest, TransferNestedTuple) {
-  std::unique_ptr<Literal> literal = Literal::MakeTuple(
-      {Literal::CreateR0<float>(123.0f).get(),
-       Literal::MakeTuple(
-           {Literal::CreateR2<float>({{1.0f, 2.0f}, {4.0f, 5.0f}}).get(),
-            Literal::CreateR1<float>({44.0f, -10.0f, 3333333.3f}).get()})
+  std::unique_ptr<Literal> literal = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(123.0f).get(),
+       LiteralUtil::MakeTuple(
+           {LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {4.0f, 5.0f}}).get(),
+            LiteralUtil::CreateR1<float>({44.0f, -10.0f, 3333333.3f}).get()})
            .get(),
-       Literal::CreateR1<float>({-10.0f, 123.0f}).get()});
+       LiteralUtil::CreateR1<float>({-10.0f, 123.0f}).get()});
   auto device_buffer = AllocateDeviceBuffer(literal->shape());
 
   // Round trip literal through device.
@@ -223,7 +224,7 @@ XLA_TEST_F(TransferManagerTest, TransferNestedTuple) {
 }
 
 XLA_TEST_F(TransferManagerTest, TransferComplexValue) {
-  std::unique_ptr<Literal> literal = Literal::CreateR1<complex64>(
+  std::unique_ptr<Literal> literal = LiteralUtil::CreateR1<complex64>(
       {complex64(1.0f, 2.0f), complex64(42.0f, -123.4f)});
   auto device_buffer = AllocateDeviceBuffer(literal->shape());
 
@@ -238,12 +239,12 @@ XLA_TEST_F(TransferManagerTest, TransferComplexValue) {
 }
 
 XLA_TEST_F(TransferManagerTest, TransferComplexValueInTuple) {
-  std::unique_ptr<Literal> literal = Literal::MakeTuple(
-      {Literal::CreateR1<complex64>(
+  std::unique_ptr<Literal> literal = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR1<complex64>(
            {complex64(1.0f, 2.0f), complex64(42.0f, -123.4f)})
            .get(),
-       Literal::CreateR1<int32>({1, 2, 3, 4, 5, 6}).get(),
-       Literal::CreateR0<complex64>(complex64(0.3f, -0.4f)).get()});
+       LiteralUtil::CreateR1<int32>({1, 2, 3, 4, 5, 6}).get(),
+       LiteralUtil::CreateR0<complex64>(complex64(0.3f, -0.4f)).get()});
   auto device_buffer = AllocateDeviceBuffer(literal->shape());
 
   // Round trip literal through device.
@@ -265,25 +266,25 @@ XLA_TEST_F(TransferManagerTest, TransferTokenFromDevice) {
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<Literal> result,
       transfer_manager_->TransferLiteralFromDevice(stream_, device_buffer));
-  EXPECT_TRUE(LiteralTestUtil::Equal(*Literal::CreateToken(), *result));
+  EXPECT_TRUE(LiteralTestUtil::Equal(*LiteralUtil::CreateToken(), *result));
 }
 
 XLA_TEST_F(TransferManagerTest, MultiStreamRoundTripSoak) {
   const int64 kIterationCount = 5000;
-  std::unique_ptr<Literal> literal1 = Literal::MakeTuple(
-      {Literal::CreateR0<float>(123.0f).get(),
-       Literal::MakeTuple(
-           {Literal::CreateR2<float>({{1.0f, 2.0f}, {4.0f, 5.0f}}).get(),
-            Literal::CreateR1<float>({44.0f, -10.0f, 3333333.3f}).get()})
+  std::unique_ptr<Literal> literal1 = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(123.0f).get(),
+       LiteralUtil::MakeTuple(
+           {LiteralUtil::CreateR2<float>({{1.0f, 2.0f}, {4.0f, 5.0f}}).get(),
+            LiteralUtil::CreateR1<float>({44.0f, -10.0f, 3333333.3f}).get()})
            .get(),
-       Literal::CreateR1<float>({-10.0f, 123.0f}).get()});
-  std::unique_ptr<Literal> literal2 = Literal::MakeTuple(
-      {Literal::CreateR0<float>(456.0f).get(),
-       Literal::MakeTuple(
-           {Literal::CreateR2<float>({{5.0f, 7.0f}, {9.0f, 4.0f}}).get(),
-            Literal::CreateR1<float>({44.0f, -11.0f, 3333333.3f}).get()})
+       LiteralUtil::CreateR1<float>({-10.0f, 123.0f}).get()});
+  std::unique_ptr<Literal> literal2 = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(456.0f).get(),
+       LiteralUtil::MakeTuple(
+           {LiteralUtil::CreateR2<float>({{5.0f, 7.0f}, {9.0f, 4.0f}}).get(),
+            LiteralUtil::CreateR1<float>({44.0f, -11.0f, 3333333.3f}).get()})
            .get(),
-       Literal::CreateR1<float>({-98.0f, 153.0f}).get()});
+       LiteralUtil::CreateR1<float>({-98.0f, 153.0f}).get()});
 
   auto device_buffer1 = AllocateDeviceBuffer(literal1->shape());
   auto device_buffer2 = AllocateDeviceBuffer(literal2->shape());
@@ -325,10 +326,10 @@ class TransferDeviceToHostBenchmark : public TransferManagerTest {
     std::vector<std::unique_ptr<Literal>> tuple_elements;
     for (int i = 0; i < num_tuple_elements; ++i) {
       tuple_elements.push_back(
-          Literal::CreateR2F32Linspace(0.0f, 1.0f, array_size, array_size));
+          LiteralUtil::CreateR2F32Linspace(0.0f, 1.0f, array_size, array_size));
     }
     std::unique_ptr<Literal> literal =
-        Literal::MakeTupleOwned(std::move(tuple_elements));
+        LiteralUtil::MakeTupleOwned(std::move(tuple_elements));
     auto device_buffer = AllocateDeviceBuffer(literal->shape());
     TF_CHECK_OK(transfer_manager_->TransferLiteralToDevice(stream_, *literal,
                                                            device_buffer));
@@ -357,10 +358,10 @@ class TransferHostToDeviceBenchmark : public TransferManagerTest {
     std::vector<std::unique_ptr<Literal>> tuple_elements;
     for (int i = 0; i < num_tuple_elements; ++i) {
       tuple_elements.push_back(
-          Literal::CreateR2F32Linspace(0.0f, 1.0f, array_size, array_size));
+          LiteralUtil::CreateR2F32Linspace(0.0f, 1.0f, array_size, array_size));
     }
     std::unique_ptr<Literal> literal =
-        Literal::MakeTupleOwned(std::move(tuple_elements));
+        LiteralUtil::MakeTupleOwned(std::move(tuple_elements));
     auto device_buffer = AllocateDeviceBuffer(literal->shape());
     tensorflow::testing::StartTiming();
     for (int i = 0; i < iters; ++i) {
diff --git a/tensorflow/compiler/xla/tests/transpose_test.cc b/tensorflow/compiler/xla/tests/transpose_test.cc
index 6ebb4324f8..fbe9d1b64a 100644
--- a/tensorflow/compiler/xla/tests/transpose_test.cc
+++ b/tensorflow/compiler/xla/tests/transpose_test.cc
@@ -17,7 +17,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/reference_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/hlo_test_base.h"
diff --git a/tensorflow/compiler/xla/tests/tuple_test.cc b/tensorflow/compiler/xla/tests/tuple_test.cc
index ec11508891..2fd70b72b5 100644
--- a/tensorflow/compiler/xla/tests/tuple_test.cc
+++ b/tensorflow/compiler/xla/tests/tuple_test.cc
@@ -18,8 +18,8 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
 
 namespace xla {
@@ -49,10 +50,10 @@ XLA_TEST_F(TupleTest, TupleConstant) {
       {1.1f, 2.2f, 3.5f},  // row 0
       {4.8f, 5.0f, 6.7f},  // row 1
   };
-  auto value =
-      Literal::MakeTuple({Literal::CreateR0<float>(constant_scalar).get(),
-                          Literal::CreateR1<float>(constant_vector).get(),
-                          Literal::CreateR2<float>(constant_matrix).get()});
+  auto value = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(constant_scalar).get(),
+       LiteralUtil::CreateR1<float>(constant_vector).get(),
+       LiteralUtil::CreateR2<float>(constant_matrix).get()});
 
   ConstantLiteral(&builder, *value);
   ComputeAndCompareTuple(&builder, *value, {}, error_spec_);
@@ -64,9 +65,9 @@ XLA_TEST_F(TupleTest, TupleScalarConstant) {
 
   const float constant_scalar1 = 7.3f;
   const float constant_scalar2 = 1.2f;
-  auto value =
-      Literal::MakeTuple({Literal::CreateR0<float>(constant_scalar1).get(),
-                          Literal::CreateR0<float>(constant_scalar2).get()});
+  auto value = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(constant_scalar1).get(),
+       LiteralUtil::CreateR0<float>(constant_scalar2).get()});
 
   ConstantLiteral(&builder, *value);
   ComputeAndCompareTuple(&builder, *value, {}, error_spec_);
@@ -86,10 +87,10 @@ XLA_TEST_F(TupleTest, TupleCreate) {
                    ConstantR1<float>(&builder, constant_vector),
                    ConstantR2<float>(&builder, constant_matrix)});
 
-  auto expected =
-      Literal::MakeTuple({Literal::CreateR0<float>(constant_scalar).get(),
-                          Literal::CreateR1<float>(constant_vector).get(),
-                          Literal::CreateR2<float>(constant_matrix).get()});
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR0<float>(constant_scalar).get(),
+       LiteralUtil::CreateR1<float>(constant_vector).get(),
+       LiteralUtil::CreateR2<float>(constant_matrix).get()});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -100,8 +101,9 @@ XLA_TEST_F(TupleTest, TupleCreateWithZeroElementEntry) {
   Tuple(&builder,
         {ConstantR0<float>(&builder, 7.0), ConstantR1<float>(&builder, {})});
 
-  auto expected = Literal::MakeTuple({Literal::CreateR0<float>(7.0).get(),
-                                      Literal::CreateR1<float>({}).get()});
+  auto expected =
+      LiteralUtil::MakeTuple({LiteralUtil::CreateR0<float>(7.0).get(),
+                              LiteralUtil::CreateR1<float>({}).get()});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -109,7 +111,7 @@ XLA_TEST_F(TupleTest, TupleCreateWithZeroElementEntry) {
 XLA_TEST_F(TupleTest, EmptyTupleCreate) {
   XlaBuilder builder(TestName());
   Tuple(&builder, {});
-  auto expected = Literal::MakeTuple({});
+  auto expected = LiteralUtil::MakeTuple({});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -193,9 +195,9 @@ XLA_TEST_F(TupleTest, TupleGTEToTuple) {
                        ConstantR2<float>(&builder, constant_matrix)});
   Tuple(&builder,
         {GetTupleElement(tuple_data, 1), GetTupleElement(tuple_data, 0)});
-  auto expected =
-      Literal::MakeTuple({Literal::CreateR2<float>(constant_matrix).get(),
-                          Literal::CreateR1<float>(constant_vector).get()});
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::CreateR2<float>(constant_matrix).get(),
+       LiteralUtil::CreateR1<float>(constant_vector).get()});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -216,8 +218,8 @@ XLA_TEST_F(TupleTest, SelectBetweenPredTuples) {
     auto v2_v1 = Tuple(&b, {v2_gt, v1_gt});  // {true, false}
     Select(direction ? v1_gt : v2_gt, v1_v2, v2_v1);
     auto expected =
-        Literal::MakeTuple({Literal::CreateR0<bool>(direction).get(),
-                            Literal::CreateR0<bool>(!direction).get()});
+        LiteralUtil::MakeTuple({LiteralUtil::CreateR0<bool>(direction).get(),
+                                LiteralUtil::CreateR0<bool>(!direction).get()});
 
     ComputeAndCompareTuple(&b, *expected, {v1_data.get(), v2_data.get()},
                            error_spec_);
@@ -284,8 +286,9 @@ XLA_TEST_F(TupleTest, SelectBetweenTuplesOnFalse) {
                                   ConstantR1<float>(&builder, vec1)});
 
   Select(ConstantR0<bool>(&builder, false), tuple12, tuple21);
-  auto expected = Literal::MakeTuple({Literal::CreateR1<float>(vec2).get(),
-                                      Literal::CreateR1<float>(vec1).get()});
+  auto expected =
+      LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>(vec2).get(),
+                              LiteralUtil::CreateR1<float>(vec1).get()});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -328,8 +331,9 @@ XLA_TEST_F(TupleTest, SelectBetweenTuplesOnTrue) {
                                   ConstantR1<float>(&builder, vec1)});
 
   Select(ConstantR0<bool>(&builder, true), tuple12, tuple21);
-  auto expected = Literal::MakeTuple({Literal::CreateR1<float>(vec1).get(),
-                                      Literal::CreateR1<float>(vec2).get()});
+  auto expected =
+      LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>(vec1).get(),
+                              LiteralUtil::CreateR1<float>(vec2).get()});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -403,8 +407,9 @@ XLA_TEST_F(TupleTest, SelectBetweenTuplesReuseConstants) {
 
   Select(ConstantR0<bool>(&builder, false), tuple12, tuple21);
 
-  auto expected = Literal::MakeTuple({Literal::CreateR1<float>(vec2).get(),
-                                      Literal::CreateR1<float>(vec1).get()});
+  auto expected =
+      LiteralUtil::MakeTuple({LiteralUtil::CreateR1<float>(vec2).get(),
+                              LiteralUtil::CreateR1<float>(vec1).get()});
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
 
@@ -414,13 +419,13 @@ XLA_TEST_F(TupleTest, NestedTuples) {
                                       ConstantR0<float>(&builder, 42.0)});
   Tuple(&builder, {inner_tuple, ConstantR1<float>(&builder, {22.0, 44.0})});
 
-  auto expected_v1 = Literal::CreateR1<float>({1.0, 2.0});
-  auto expected_s = Literal::CreateR0<float>(42.0);
+  auto expected_v1 = LiteralUtil::CreateR1<float>({1.0, 2.0});
+  auto expected_s = LiteralUtil::CreateR0<float>(42.0);
   auto expected_inner_tuple =
-      Literal::MakeTuple({expected_v1.get(), expected_s.get()});
-  auto expected_v2 = Literal::CreateR1<float>({22.0, 44.0});
+      LiteralUtil::MakeTuple({expected_v1.get(), expected_s.get()});
+  auto expected_v2 = LiteralUtil::CreateR1<float>({22.0, 44.0});
   auto expected =
-      Literal::MakeTuple({expected_inner_tuple.get(), expected_v2.get()});
+      LiteralUtil::MakeTuple({expected_inner_tuple.get(), expected_v2.get()});
 
   ComputeAndCompareTuple(&builder, *expected, {}, error_spec_);
 }
@@ -440,14 +445,14 @@ XLA_TEST_F(TupleTest, GetTupleElementOfNestedTuple) {
 
   std::unique_ptr<GlobalData> data =
       client_
-          ->TransferToServer(*Literal::MakeTuple({
-              Literal::MakeTuple(
+          ->TransferToServer(*LiteralUtil::MakeTuple({
+              LiteralUtil::MakeTuple(
                   {
-                      Literal::CreateR1<float>({1.0, 2.0, 3.0}).get(),
-                      Literal::CreateR1<float>({4.0, 5.0, 6.0}).get(),
+                      LiteralUtil::CreateR1<float>({1.0, 2.0, 3.0}).get(),
+                      LiteralUtil::CreateR1<float>({4.0, 5.0, 6.0}).get(),
                   })
                   .get(),
-              Literal::CreateR1<float>({7.0, 8.0, 9.0}).get(),
+              LiteralUtil::CreateR1<float>({7.0, 8.0, 9.0}).get(),
           }))
           .ConsumeValueOrDie();
 
@@ -478,11 +483,12 @@ XLA_TEST_F(TupleTest, ComplexTuples) {
 
   std::unique_ptr<GlobalData> arg0 =
       client_
-          ->TransferToServer(*Literal::MakeTuple(
-              {Literal::CreateR0<complex64>({1, 2}).get(),
-               Literal::MakeTuple(
-                   {Literal::CreateR1<complex64>({{10, 20}, {30, 40}}).get(),
-                    Literal::CreateR2<complex64>(
+          ->TransferToServer(*LiteralUtil::MakeTuple(
+              {LiteralUtil::CreateR0<complex64>({1, 2}).get(),
+               LiteralUtil::MakeTuple(
+                   {LiteralUtil::CreateR1<complex64>({{10, 20}, {30, 40}})
+                        .get(),
+                    LiteralUtil::CreateR2<complex64>(
                         {{{100, 200}, {300, 400}},
                          {{1000, 2000}, {3000, 4000}},
                          {{10000, 20000}, {30000, 40000}}})
@@ -491,11 +497,13 @@ XLA_TEST_F(TupleTest, ComplexTuples) {
           .ConsumeValueOrDie();
   std::unique_ptr<GlobalData> arg1 =
       client_
-          ->TransferToServer(*Literal::CreateR1<complex64>({{1, 2}, {1, -2}}))
+          ->TransferToServer(
+              *LiteralUtil::CreateR1<complex64>({{1, 2}, {1, -2}}))
           .ConsumeValueOrDie();
-  auto sum = Literal::CreateR2<complex64>({{{111, 222}, {331, 442}},
-                                           {{1011, 2022}, {3031, 4042}},
-                                           {{10011, 20022}, {30031, 40042}}});
+  auto sum =
+      LiteralUtil::CreateR2<complex64>({{{111, 222}, {331, 442}},
+                                        {{1011, 2022}, {3031, 4042}},
+                                        {{10011, 20022}, {30031, 40042}}});
   auto prod = MakeUnique<Literal>(sum->shape());
   ASSERT_TRUE(prod->Populate<complex64>(
                       [&sum](tensorflow::gtl::ArraySlice<int64> indexes) {
@@ -505,9 +513,9 @@ XLA_TEST_F(TupleTest, ComplexTuples) {
                                     : complex64(1, -2));
                       })
                   .ok());
-  auto expected =
-      Literal::MakeTuple({Literal::MakeTuple({prod.get(), sum.get()}).get(),
-                          Literal::CreateR0<complex64>({123, 456}).get()});
+  auto expected = LiteralUtil::MakeTuple(
+      {LiteralUtil::MakeTuple({prod.get(), sum.get()}).get(),
+       LiteralUtil::CreateR0<complex64>({123, 456}).get()});
   ComputeAndCompareTuple(&builder, *expected, {arg0.get(), arg1.get()},
                          error_spec_);
 }
@@ -530,12 +538,59 @@ XLA_TEST_F(TupleHloTest, DISABLED_ON_INTERPRETER(BitcastAfterGTE)) {
   auto module =
       HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
           .ValueOrDie();
-  auto param = Literal::MakeTupleOwned(Literal::CreateR1<float>({1, 2, 3}));
+  auto param =
+      LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR1<float>({1, 2, 3}));
   auto result = ExecuteNoHloPasses(std::move(module), {param.get()});
   EXPECT_TRUE(LiteralTestUtil::Equal(
-      *Literal::MakeTupleOwned(Literal::CreateR2<float>({{1, 2, 3}})),
+      *LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR2<float>({{1, 2, 3}})),
       *result));
 }
 
+// Disabled on interpreter due to lack of outfeed.
+XLA_TEST_F(TupleHloTest,
+           DISABLED_ON_INTERPRETER(NonAmbiguousTopLevelAllocation)) {
+  const char* testcase = R"(
+    HloModule tuple
+
+    ENTRY main {
+      a = f32[2] parameter(0)
+      b = f32[2] parameter(1)
+      c = f32[2] parameter(2)
+      d = f32[2] parameter(3)
+      cond = pred[] parameter(4)
+
+      tup0 = (f32[2],f32[2]) tuple(a, b)
+      tup1 = (f32[2],f32[2]) tuple(c, d)
+
+      s = (f32[2],f32[2]) tuple-select(cond, tup0, tup1)
+      gte = f32[2] get-tuple-element(s), index=0
+      tuple = (f32[2]) tuple(gte)
+      token = token[] after-all()
+      ROOT outfeed = token[] outfeed(tuple, token)
+    }
+  )";
+  auto module =
+      HloRunner::CreateModuleFromString(testcase, GetDebugOptionsForTest())
+          .ValueOrDie();
+  auto param0 = LiteralUtil::CreateR1<float>({1, 2});
+  auto param1 = LiteralUtil::CreateR1<float>({2, 3});
+  auto param4 = LiteralUtil::CreateR0<bool>(false);
+  // Put execution on a separate thread so we can block on outfeed.
+  std::unique_ptr<tensorflow::Thread> thread(
+      tensorflow::Env::Default()->StartThread(
+          tensorflow::ThreadOptions(), "execute_thread", [&] {
+            TF_EXPECT_OK(Execute(std::move(module),
+                                 {param0.get(), param1.get(), param1.get(),
+                                  param0.get(), param4.get()})
+                             .status());
+          }));
+  auto expected =
+      LiteralUtil::MakeTupleOwned(LiteralUtil::CreateR1<float>({2, 3}));
+  auto literal = MakeUnique<Literal>();
+  TF_EXPECT_OK(backend().transfer_manager()->TransferLiteralFromOutfeed(
+      backend().default_stream_executor(), expected->shape(), literal.get()));
+  EXPECT_TRUE(LiteralTestUtil::Equal(*expected, *literal));
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/unary_op_test.cc b/tensorflow/compiler/xla/tests/unary_op_test.cc
index 929b1ca7fb..20ae68ab74 100644
--- a/tensorflow/compiler/xla/tests/unary_op_test.cc
+++ b/tensorflow/compiler/xla/tests/unary_op_test.cc
@@ -18,7 +18,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
@@ -101,7 +101,7 @@ void UnaryOpTest::AbsTestHelper<complex64>() {
   Abs(arg);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR1<float>({2, 25, 0, 0.5, inf<float>(), inf<float>()});
+      LiteralUtil::CreateR1<float>({2, 25, 0, 0.5, inf<float>(), inf<float>()});
   ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-6f));
 }
 
@@ -113,7 +113,7 @@ void UnaryOpTest::SignTestHelper<complex64>() {
       {{-2, 0}, {0, 25}, {0, 0}, {static_cast<float>(-0.0), 0}, {-1, 1}});
   Sign(arg);
 
-  std::unique_ptr<Literal> expected = Literal::CreateR1<complex64>(
+  std::unique_ptr<Literal> expected = LiteralUtil::CreateR1<complex64>(
       {{-1, 0}, {0, 1}, {0, 0}, {0, 0}, {-std::sqrt(0.5f), std::sqrt(0.5f)}});
   ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-6f));
 }
@@ -128,7 +128,7 @@ void UnaryOpTest::SignAbsTestHelper<complex64>() {
   Sub(Mul(sign, ConvertElementType(abs, C64)), arg);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR1<complex64>({0, 0, 0, 0});
+      LiteralUtil::CreateR1<complex64>({0, 0, 0, 0});
   ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-6f));
 }
 
@@ -173,7 +173,7 @@ XLA_TEST_F(UnaryOpTest, SignTestR0) {
                 Add(Add(sgnf0, sgnf), ConvertElementType(sgni, F32)), C64));
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR0<complex64>({-2.6f, 0.8f});
+      LiteralUtil::CreateR0<complex64>({-2.6f, 0.8f});
   ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-6f));
 }
 
diff --git a/tensorflow/compiler/xla/tests/vector_ops_reduce_test.cc b/tensorflow/compiler/xla/tests/vector_ops_reduce_test.cc
index ea3aba6df1..ef1b1445bb 100644
--- a/tensorflow/compiler/xla/tests/vector_ops_reduce_test.cc
+++ b/tensorflow/compiler/xla/tests/vector_ops_reduce_test.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/array3d.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
diff --git a/tensorflow/compiler/xla/tests/vector_ops_simple_test.cc b/tensorflow/compiler/xla/tests/vector_ops_simple_test.cc
index 79bae22dac..3848ec1684 100644
--- a/tensorflow/compiler/xla/tests/vector_ops_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/vector_ops_simple_test.cc
@@ -21,8 +21,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
diff --git a/tensorflow/compiler/xla/tests/while_test.cc b/tensorflow/compiler/xla/tests/while_test.cc
index bbd67cd8d7..1bdf1867b9 100644
--- a/tensorflow/compiler/xla/tests/while_test.cc
+++ b/tensorflow/compiler/xla/tests/while_test.cc
@@ -20,9 +20,9 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/lib/arithmetic.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@@ -347,8 +347,8 @@ TEST_F(WhileTest, WhileWithVectorResultIntoTuple) {
   // the sum will increase by 1.0.  It will first be >15.5 when the elements
   // have all reached 2.0.
   auto expected_data =
-      Literal::CreateR1<float>({2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f});
-  auto expected = Literal::MakeTuple({expected_data.get()});
+      LiteralUtil::CreateR1<float>({2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f, 2.f});
+  auto expected = LiteralUtil::MakeTuple({expected_data.get()});
   VLOG(2) << "expected = " << ShapeUtil::HumanString(expected->shape());
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.0001));
 }
@@ -397,12 +397,13 @@ TEST_F(WhileTest, WhileWithPermutationAndTupleResult) {
           << ShapeUtil::HumanString(
                  builder.GetShape(result).ConsumeValueOrDie());
 
-  auto expected_counter = Literal::CreateR0<int32>(N);
-  auto expected_w1 = Literal::CreateR1<float>({1.0f, 1.0f, 1.0f});
-  auto expected_w2 = Literal::CreateR1<float>({2.0f, 2.0f, 2.0f});
-  auto expected_w3 = Literal::CreateR1<float>({3.0f, 3.0f, 3.0f});
-  auto expected = Literal::MakeTuple({expected_counter.get(), expected_w2.get(),
-                                      expected_w3.get(), expected_w1.get()});
+  auto expected_counter = LiteralUtil::CreateR0<int32>(N);
+  auto expected_w1 = LiteralUtil::CreateR1<float>({1.0f, 1.0f, 1.0f});
+  auto expected_w2 = LiteralUtil::CreateR1<float>({2.0f, 2.0f, 2.0f});
+  auto expected_w3 = LiteralUtil::CreateR1<float>({3.0f, 3.0f, 3.0f});
+  auto expected =
+      LiteralUtil::MakeTuple({expected_counter.get(), expected_w2.get(),
+                              expected_w3.get(), expected_w1.get()});
   VLOG(2) << "expected = " << ShapeUtil::HumanString(expected->shape());
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.0001));
 }
@@ -506,11 +507,11 @@ TEST_F(WhileTest, WhileWithTupleResult) {
           << ShapeUtil::HumanString(
                  builder.GetShape(result).ConsumeValueOrDie());
 
-  auto expected_counter = Literal::CreateR0<int32>(5);
-  auto expected_data = Literal::CreateR1<float>(
+  auto expected_counter = LiteralUtil::CreateR0<int32>(5);
+  auto expected_data = LiteralUtil::CreateR1<float>(
       {5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f});
   auto expected =
-      Literal::MakeTuple({expected_counter.get(), expected_data.get()});
+      LiteralUtil::MakeTuple({expected_counter.get(), expected_data.get()});
   VLOG(2) << "expected = " << ShapeUtil::HumanString(expected->shape());
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.0001));
 }
@@ -554,10 +555,10 @@ TEST_F(WhileTest, WhileWithPredicateTupleResult) {
           << ShapeUtil::HumanString(
                  builder.GetShape(result).ConsumeValueOrDie());
 
-  auto expected_counter = Literal::CreateR0<int32>(5);
-  auto expected_predicate = Literal::CreateR0<bool>(true);
-  auto expected =
-      Literal::MakeTuple({expected_counter.get(), expected_predicate.get()});
+  auto expected_counter = LiteralUtil::CreateR0<int32>(5);
+  auto expected_predicate = LiteralUtil::CreateR0<bool>(true);
+  auto expected = LiteralUtil::MakeTuple(
+      {expected_counter.get(), expected_predicate.get()});
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0));
 }
 
@@ -599,10 +600,10 @@ TEST_F(WhileTest, WhileWithTupleConstantScalarResult) {
           << ShapeUtil::HumanString(
                  builder.GetShape(result).ConsumeValueOrDie());
 
-  auto expected_counter = Literal::CreateR0<int32>(5);
-  auto expected_data = Literal::CreateR0<int32>(7);
+  auto expected_counter = LiteralUtil::CreateR0<int32>(5);
+  auto expected_data = LiteralUtil::CreateR0<int32>(7);
   auto expected =
-      Literal::MakeTuple({expected_counter.get(), expected_data.get()});
+      LiteralUtil::MakeTuple({expected_counter.get(), expected_data.get()});
   VLOG(2) << "expected = " << ShapeUtil::HumanString(expected->shape());
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.0001));
 }
@@ -882,11 +883,11 @@ XLA_TEST_F(WhileTest, WhileWithDynamicUpdateSlice) {
           << ShapeUtil::HumanString(
                  builder.GetShape(result).ConsumeValueOrDie());
 
-  auto expected_counter = Literal::CreateR0<int32>(5);
-  auto expected_data = Literal::CreateR1<float>(
+  auto expected_counter = LiteralUtil::CreateR0<int32>(5);
+  auto expected_data = LiteralUtil::CreateR1<float>(
       {1.0f, 1.0f, 2.0f, 2.0f, 3.0f, 3.0f, 4.0f, 4.0f, 5.0f, 5.0f});
   auto expected =
-      Literal::MakeTuple({expected_counter.get(), expected_data.get()});
+      LiteralUtil::MakeTuple({expected_counter.get(), expected_data.get()});
   VLOG(2) << "expected = " << ShapeUtil::HumanString(expected->shape());
   ComputeAndCompareTuple(&builder, *expected, {}, ErrorSpec(0.0001));
 }
@@ -974,12 +975,12 @@ TEST_F(WhileTest, WhileThatSwapsParameterWithTupleElement) {
   TF_ASSERT_OK_AND_ASSIGN(auto body_computation, body.Build());
   While(cond_computation, body_computation, t);
 
-  auto expected_element = Literal::CreateR1<float>({1, 1});
+  auto expected_element = LiteralUtil::CreateR1<float>({1, 1});
   auto expected =
-      Literal::MakeTuple({expected_element.get(), expected_element.get()});
+      LiteralUtil::MakeTuple({expected_element.get(), expected_element.get()});
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> parameter_data,
-      client_->TransferToServer(*Literal::CreateR1<float>({42, 42})));
+      client_->TransferToServer(*LiteralUtil::CreateR1<float>({42, 42})));
   ComputeAndCompareTuple(&outer, *expected, {parameter_data.get()},
                          ErrorSpec(1e-6));
 }
@@ -1004,7 +1005,7 @@ TEST_F(WhileTest, WhileThatSwapsParameterWithBroadcast) {
 
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> parameter_data,
-      client_->TransferToServer(*Literal::CreateR1<float>({42, 42})));
+      client_->TransferToServer(*LiteralUtil::CreateR1<float>({42, 42})));
   ComputeAndCompareR1<float>(&outer, {1.0f, 1.0f}, {parameter_data.get()},
                              ErrorSpec(1e-6));
 }
@@ -1030,7 +1031,7 @@ TEST_F(WhileTest, WhileThatTurnsScalarParameterToTupleElement) {
 
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> parameter_data,
-      client_->TransferToServer(*Literal::CreateR0<float>(42)));
+      client_->TransferToServer(*LiteralUtil::CreateR0<float>(42)));
   ComputeAndCompareR0<float>(&outer, 43.0f, {parameter_data.get()},
                              ErrorSpec(1e-6));
 }
@@ -1069,11 +1070,11 @@ TEST_F(WhileTest, WhileWithMixedTupleElements) {
 
   TF_ASSERT_OK_AND_ASSIGN(
       std::unique_ptr<GlobalData> parameter_data,
-      client_->TransferToServer(*Literal::CreateR0<int32>(1)));
+      client_->TransferToServer(*LiteralUtil::CreateR0<int32>(1)));
 
-  auto add1 = Literal::CreateR0<int32>(15);
-  auto add2 = Literal::CreateR0<int32>(16);
-  auto expected = Literal::MakeTuple({add1.get(), add2.get()});
+  auto add1 = LiteralUtil::CreateR0<int32>(15);
+  auto add2 = LiteralUtil::CreateR0<int32>(16);
+  auto expected = LiteralUtil::MakeTuple({add1.get(), add2.get()});
   ComputeAndCompareTuple(&outer, *expected, {parameter_data.get()},
                          ErrorSpec(1e-6));
 }
@@ -1226,15 +1227,44 @@ TEST_F(WhileTest, WhileWithLoopInvariantOperation) {
   auto while_instruction = While(condition, body, init);
   GetTupleElement(while_instruction, 3);
 
-  TF_ASSERT_OK_AND_ASSIGN(auto param_value,
-                          client_->TransferToServer(*Literal::CreateR2<float>(
-                              {{1.0, 2.0}, {-1.0, -2.0}})));
+  TF_ASSERT_OK_AND_ASSIGN(
+      auto param_value, client_->TransferToServer(*LiteralUtil::CreateR2<float>(
+                            {{1.0, 2.0}, {-1.0, -2.0}})));
 
   ComputeAndCompareR2<float>(
       &builder, {{-0.76159416, -0.96402758}, {0.76159416, 0.96402758}},
       {param_value.get()}, ErrorSpec(4e-5));
 }
 
+TEST_F(WhileTest, DISABLED_ON_INTERPRETER(WhileInfeedCondition)) {
+  auto while_shape = ShapeUtil::MakeShape(S32, {});
+
+  XlaComputation condition;
+  {
+    XlaBuilder builder("condition");
+    Parameter(&builder, 0, while_shape, "state");
+    Infeed(&builder, ShapeUtil::MakeShape(PRED, {}));
+    TF_ASSERT_OK_AND_ASSIGN(condition, builder.Build());
+  }
+
+  XlaComputation body;
+  {
+    XlaBuilder builder("body");
+    auto indvar = Parameter(&builder, 0, while_shape, "state");
+    Add(indvar, ConstantR0<int32>(&builder, 1));
+    TF_ASSERT_OK_AND_ASSIGN(body, builder.Build());
+  }
+
+  XlaBuilder builder(TestName());
+  While(condition, body, ConstantR0<int32>(&builder, 0));
+
+  TF_ASSERT_OK(client_->TransferToInfeed(*LiteralUtil::CreateR0<bool>(true)));
+  TF_ASSERT_OK(client_->TransferToInfeed(*LiteralUtil::CreateR0<bool>(true)));
+  TF_ASSERT_OK(client_->TransferToInfeed(*LiteralUtil::CreateR0<bool>(false)));
+
+  ComputeAndCompareR0<int32>(&builder, 2, {});
+}
+
 void BM_WhileLoop(int num_iters) {
   // Benchmark a simple kernel to measure while loop overheads.
   tensorflow::testing::StopTiming();
diff --git a/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc b/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc
index 7dba058d40..0ee8e68c88 100644
--- a/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc
+++ b/tensorflow/compiler/xla/tests/xla_hlo_profile_test.cc
@@ -18,10 +18,11 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/array2d.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_builder.h"
-#include "tensorflow/compiler/xla/client/xla_client/xla_computation.h"
+#include "tensorflow/compiler/xla/client/xla_builder.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/map_util.h"
 #include "tensorflow/compiler/xla/service/platform_util.h"
+#include "tensorflow/compiler/xla/service/stream_pool.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
@@ -79,7 +80,9 @@ struct ParsedProfileOutputLine {
 
 Status ParseOneProfileOutputLine(
     const string& line, bool expect_hlo,
-    gtl::FlatMap<string, ParsedProfileOutputLine>* parsed_results) {
+    gtl::FlatMap<string, ParsedProfileOutputLine>* parsed_results,
+    tensorflow::gtl::ArraySlice<tensorflow::StringPiece> opcodes_to_ignore =
+        {}) {
   string separator = "[^:]*:: +";
   string match_percentage = "\\d+\\.\\d\\d%";
   string match_cycles = "(\\d+) cycles +\\( *(" + match_percentage + ")\\)";
@@ -113,7 +116,9 @@ Status ParseOneProfileOutputLine(
         ", Regexp: ", regexp_pattern);
   }
 
-  InsertOrDie(parsed_results, parsed_line.opcode, parsed_line);
+  if (!c_linear_search(opcodes_to_ignore, parsed_line.opcode)) {
+    InsertOrDie(parsed_results, parsed_line.opcode, parsed_line);
+  }
 
   return Status::OK();
 }
@@ -129,7 +134,7 @@ void ExecuteAndFetchProfile(string* profile_output, LocalClient* client,
   DeviceMemoryAllocator* allocator = backend->memory_allocator();
   auto* transfer_manager = backend->transfer_manager();
   TF_ASSERT_OK_AND_ASSIGN(
-      Backend::StreamPtr stream_ptr,
+      StreamPool::Ptr stream_ptr,
       backend->BorrowStream(backend->default_device_ordinal()));
 
   TF_ASSERT_OK_AND_ASSIGN(
@@ -267,7 +272,7 @@ XLA_TEST_F(HloProfileTest, ProfileWhileComputation) {
     auto matrix = GetTupleElement(state, 1);
     auto next_iteration =
         Add(GetTupleElement(state, 0), ConstantR0<int32>(&builder, 1));
-    Tuple(&builder, {next_iteration, Add(matrix, matrix)});
+    Tuple(&builder, {next_iteration, Mul(matrix, matrix)});
     TF_ASSERT_OK_AND_ASSIGN(body, builder.Build());
   }
 
@@ -289,36 +294,50 @@ XLA_TEST_F(HloProfileTest, ProfileWhileComputation) {
       tensorflow::str_util::Split(profile_output, '\n');
 
   auto while_body_profile_start =
-      std::find_if(profile_output_lines.begin(), profile_output_lines.end(),
+      c_find_if(profile_output_lines, [](tensorflow::StringPiece s) {
+        return tensorflow::str_util::StartsWith(s,
+                                                "Execution profile for body");
+      });
+
+  ASSERT_NE(while_body_profile_start, profile_output_lines.cend());
+
+  auto while_body_profile_end =
+      std::find_if(while_body_profile_start, profile_output_lines.end(),
                    [](tensorflow::StringPiece s) {
                      return tensorflow::str_util::StartsWith(
-                         s, "Execution profile for body");
+                         s, "********** microseconds report **********");
                    });
 
-  ASSERT_NE(while_body_profile_start, profile_output_lines.end());
+  // We emit a blank line before the "********** microseconds report **********"
+  // line.
+  while_body_profile_end--;
 
-  gtl::FlatMap<string, ParsedProfileOutputLine> parsed_profile_lines;
+  ASSERT_NE(while_body_profile_end, profile_output_lines.end());
 
-  TF_ASSERT_OK(
-      ParseOneProfileOutputLine(*std::next(while_body_profile_start, 1),
-                                /*expect_hlo=*/false, &parsed_profile_lines));
+  gtl::FlatMap<string, ParsedProfileOutputLine> parsed_profile_lines;
 
-  TF_ASSERT_OK(
-      ParseOneProfileOutputLine(*std::next(while_body_profile_start, 2),
-                                /*expect_hlo=*/true, &parsed_profile_lines));
+  for (auto while_body_profile_i = while_body_profile_start + 1;
+       while_body_profile_i != while_body_profile_end; while_body_profile_i++) {
+    // There are multiple "get-tuple-element" instructions in the while body so
+    // we ignore them -- we don't want parsed_profile_lines to be a multi-map.
+    TF_ASSERT_OK(ParseOneProfileOutputLine(
+        *while_body_profile_i,
+        /*expect_hlo=*/while_body_profile_i != (while_body_profile_start + 1),
+        &parsed_profile_lines, {"get-tuple-element"}));
+  }
 
   TF_ASSERT_OK_AND_ASSIGN(ParsedProfileOutputLine total_while_body_profile,
                           MaybeFind(parsed_profile_lines, "[total]"));
-  TF_ASSERT_OK_AND_ASSIGN(ParsedProfileOutputLine dot_profile,
-                          MaybeFind(parsed_profile_lines, "add"));
+  TF_ASSERT_OK_AND_ASSIGN(ParsedProfileOutputLine multiply_profile,
+                          MaybeFind(parsed_profile_lines, "multiply"));
 
   EXPECT_GT(total_while_body_profile.cycles, 0);
   EXPECT_EQ(total_while_body_profile.opcode, "[total]");
   EXPECT_EQ(total_while_body_profile.cycles_percentage, "100.00%");
 
-  EXPECT_GT(total_while_body_profile.cycles, dot_profile.cycles);
-  EXPECT_NE(dot_profile.cycles_percentage, "0.00%");
-  EXPECT_NE(dot_profile.cycles_percentage, "100.00%");
+  EXPECT_GT(total_while_body_profile.cycles, multiply_profile.cycles);
+  EXPECT_NE(multiply_profile.cycles_percentage, "0.00%");
+  EXPECT_NE(multiply_profile.cycles_percentage, "100.00%");
 }
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/text_literal_reader.cc b/tensorflow/compiler/xla/text_literal_reader.cc
index 56702feab9..897123d760 100644
--- a/tensorflow/compiler/xla/text_literal_reader.cc
+++ b/tensorflow/compiler/xla/text_literal_reader.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
diff --git a/tensorflow/compiler/xla/text_literal_reader.h b/tensorflow/compiler/xla/text_literal_reader.h
index e45e5291c9..708e8c80d8 100644
--- a/tensorflow/compiler/xla/text_literal_reader.h
+++ b/tensorflow/compiler/xla/text_literal_reader.h
@@ -18,7 +18,7 @@ limitations under the License.
 
 #include <memory>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
diff --git a/tensorflow/compiler/xla/text_literal_reader_test.cc b/tensorflow/compiler/xla/text_literal_reader_test.cc
index 23070b6638..92f9b4f9f0 100644
--- a/tensorflow/compiler/xla/text_literal_reader_test.cc
+++ b/tensorflow/compiler/xla/text_literal_reader_test.cc
@@ -17,7 +17,7 @@ limitations under the License.
 
 #include <string>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/types.h"
diff --git a/tensorflow/compiler/xla/text_literal_writer.cc b/tensorflow/compiler/xla/text_literal_writer.cc
index 373c0d2d8d..24e0784741 100644
--- a/tensorflow/compiler/xla/text_literal_writer.cc
+++ b/tensorflow/compiler/xla/text_literal_writer.cc
@@ -17,7 +17,7 @@ limitations under the License.
 
 #include <string>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
diff --git a/tensorflow/compiler/xla/text_literal_writer.h b/tensorflow/compiler/xla/text_literal_writer.h
index 0a1235b5e0..159ac1b7e1 100644
--- a/tensorflow/compiler/xla/text_literal_writer.h
+++ b/tensorflow/compiler/xla/text_literal_writer.h
@@ -16,7 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_TEXT_LITERAL_WRITER_H_
 #define TENSORFLOW_COMPILER_XLA_TEXT_LITERAL_WRITER_H_
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/status.h"
diff --git a/tensorflow/compiler/xla/text_literal_writer_test.cc b/tensorflow/compiler/xla/text_literal_writer_test.cc
index 70cf2fb1b8..4ea02faffc 100644
--- a/tensorflow/compiler/xla/text_literal_writer_test.cc
+++ b/tensorflow/compiler/xla/text_literal_writer_test.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <memory>
 #include <string>
 
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
@@ -30,8 +31,9 @@ namespace xla {
 namespace {
 
 TEST(TextLiteralWriterTest, WritesFloatLiteral) {
-  auto literal = Literal::CreateR2<float>({
-      {3.14, 2.17}, {1.23, 4.56},
+  auto literal = LiteralUtil::CreateR2<float>({
+      {3.14, 2.17},
+      {1.23, 4.56},
   });
   string path =
       tensorflow::io::JoinPath(tensorflow::testing::TmpDir(), "/whatever");
diff --git a/tensorflow/compiler/xla/tools/BUILD b/tensorflow/compiler/xla/tools/BUILD
index e4a052c8f1..40d28a57bf 100644
--- a/tensorflow/compiler/xla/tools/BUILD
+++ b/tensorflow/compiler/xla/tools/BUILD
@@ -37,6 +37,7 @@ cc_library(
         "//tensorflow/compiler/xla/client",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
         "//tensorflow/compiler/xla/service",
         "//tensorflow/compiler/xla/service:hlo_proto",
@@ -74,7 +75,7 @@ cc_library(
     srcs = ["replay_computation.cc"],
     deps = [
         "//tensorflow/compiler/xla:execution_options_util",
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
@@ -84,7 +85,9 @@ cc_library(
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:global_data",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/client/lib:testing",
+        "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
         "//tensorflow/compiler/xla/service:hlo_parser",
         "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/compiler/xla/service/gpu:infeed_manager",
@@ -123,7 +126,7 @@ tf_cc_binary(
     name = "show_literal",
     srcs = ["show_literal.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/core:lib",
@@ -145,7 +148,7 @@ tf_cc_binary(
     name = "show_text_literal",
     srcs = ["show_text_literal.cc"],
     deps = [
-        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:literal",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:text_literal_reader",
         "//tensorflow/compiler/xla:types",
@@ -164,6 +167,7 @@ tf_cc_binary(
         "//tensorflow/compiler/xla/client",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service",
         "//tensorflow/compiler/xla/service:hlo_proto",
         "//tensorflow/compiler/xla/service:interpreter_plugin",
@@ -181,6 +185,7 @@ tf_cc_binary(
         "//tensorflow/compiler/xla/client",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/service",
         "//tensorflow/compiler/xla/service:hlo",
         "//tensorflow/compiler/xla/service:hlo_proto",
@@ -198,6 +203,7 @@ tf_cc_binary(
         "//tensorflow/compiler/xla/client",
         "//tensorflow/compiler/xla/client:client_library",
         "//tensorflow/compiler/xla/client:local_client",
+        "//tensorflow/compiler/xla/client:xla_computation",
         "//tensorflow/compiler/xla/legacy_flags:debug_options_flags",
         "//tensorflow/compiler/xla/service",
         "//tensorflow/compiler/xla/service:hlo_graph_dumper",
diff --git a/tensorflow/compiler/xla/tools/dumped_computation_to_graphviz.cc b/tensorflow/compiler/xla/tools/dumped_computation_to_graphviz.cc
index befb554537..f20dcef382 100644
--- a/tensorflow/compiler/xla/tools/dumped_computation_to_graphviz.cc
+++ b/tensorflow/compiler/xla/tools/dumped_computation_to_graphviz.cc
@@ -31,6 +31,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/service.h"
diff --git a/tensorflow/compiler/xla/tools/dumped_computation_to_operation_list.cc b/tensorflow/compiler/xla/tools/dumped_computation_to_operation_list.cc
index cfb8f37487..f0af0580c1 100644
--- a/tensorflow/compiler/xla/tools/dumped_computation_to_operation_list.cc
+++ b/tensorflow/compiler/xla/tools/dumped_computation_to_operation_list.cc
@@ -22,6 +22,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/service.h"
diff --git a/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc b/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc
index 5dd5150be3..f03e1b1f96 100644
--- a/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc
+++ b/tensorflow/compiler/xla/tools/dumped_computation_to_text.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/service.h"
 #include "tensorflow/compiler/xla/statusor.h"
diff --git a/tensorflow/compiler/xla/tools/dumped_computation_to_tf_graphdef.cc b/tensorflow/compiler/xla/tools/dumped_computation_to_tf_graphdef.cc
index a5dce20456..dc5c106d02 100644
--- a/tensorflow/compiler/xla/tools/dumped_computation_to_tf_graphdef.cc
+++ b/tensorflow/compiler/xla/tools/dumped_computation_to_tf_graphdef.cc
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/client.h"
 #include "tensorflow/compiler/xla/client/client_library.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/service.h"
diff --git a/tensorflow/compiler/xla/tools/replay_computation.cc b/tensorflow/compiler/xla/tools/replay_computation.cc
index 3a7917cf30..563e2d8fdb 100644
--- a/tensorflow/compiler/xla/tools/replay_computation.cc
+++ b/tensorflow/compiler/xla/tools/replay_computation.cc
@@ -42,8 +42,10 @@ limitations under the License.
 #include "tensorflow/compiler/xla/client/global_data.h"
 #include "tensorflow/compiler/xla/client/lib/testing.h"
 #include "tensorflow/compiler/xla/client/local_client.h"
+#include "tensorflow/compiler/xla/client/xla_computation.h"
 #include "tensorflow/compiler/xla/execution_options_util.h"
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/service/gpu/infeed_manager.h"
 #include "tensorflow/compiler/xla/service/hlo.pb.h"
 #include "tensorflow/compiler/xla/service/hlo_parser.h"
@@ -181,11 +183,18 @@ StatusOr<Literal> ReplayComputation(const HloSnapshot& module,
 
   // Run the computation num_runs times, and return the result from the last
   // execution.
+  const bool xla_hlo_profile =
+      legacy_flags::GetDebugOptionsFromFlags().xla_hlo_profile();
   StreamExecutorMemoryAllocator allocator(
       client->platform(),
       {client->platform()->ExecutorForDevice(0).ValueOrDie()});
   tensorflow::gtl::optional<ScopedShapedBuffer> result;
   for (int i = 0; i < opts.num_runs; ++i) {
+    // If xla_hlo_profile is enabled, print a noisy message before the last run,
+    // making it easier to separate this profile from the others in the logspam.
+    if (xla_hlo_profile && i == opts.num_runs - 1) {
+      LOG(INFO) << "\n\n***** Final run below ******";
+    }
     ExecutionProfile profile;
     ExecutableRunOptions run_options;
     run_options.set_execution_profile(&profile);
diff --git a/tensorflow/compiler/xla/tools/show_literal.cc b/tensorflow/compiler/xla/tools/show_literal.cc
index fe8e72ba32..51909190a3 100644
--- a/tensorflow/compiler/xla/tools/show_literal.cc
+++ b/tensorflow/compiler/xla/tools/show_literal.cc
@@ -21,7 +21,7 @@ limitations under the License.
 #include <stdio.h>
 #include <string>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/status.h"
diff --git a/tensorflow/compiler/xla/tools/show_text_literal.cc b/tensorflow/compiler/xla/tools/show_text_literal.cc
index 8525873e91..48c8374811 100644
--- a/tensorflow/compiler/xla/tools/show_text_literal.cc
+++ b/tensorflow/compiler/xla/tools/show_text_literal.cc
@@ -20,7 +20,7 @@ limitations under the License.
 #include <memory>
 #include <string>
 
-#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/literal.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/text_literal_reader.h"
 #include "tensorflow/compiler/xla/types.h"
diff --git a/tensorflow/compiler/xla/util.h b/tensorflow/compiler/xla/util.h
index b23b968aae..5ae099a462 100644
--- a/tensorflow/compiler/xla/util.h
+++ b/tensorflow/compiler/xla/util.h
@@ -500,17 +500,17 @@ bool c_is_sorted(const C& c, Compare&& comp) {
 }
 
 template <typename C>
-auto c_adjacent_find(const C& c) -> decltype(std::begin(c)) {
+auto c_adjacent_find(C& c) -> decltype(std::begin(c)) {
   return std::adjacent_find(std::begin(c), std::end(c));
 }
 
 template <typename C, typename Pred>
-auto c_find_if(const C& c, Pred&& pred) -> decltype(std::begin(c)) {
+auto c_find_if(C& c, Pred&& pred) -> decltype(std::begin(c)) {
   return std::find_if(std::begin(c), std::end(c), std::forward<Pred>(pred));
 }
 
 template <typename C, typename Value>
-auto c_find(const C& c, Value&& value) -> decltype(std::begin(c)) {
+auto c_find(C& c, Value&& value) -> decltype(std::begin(c)) {
   return std::find(std::begin(c), std::end(c), std::forward<Value>(value));
 }
 
@@ -562,6 +562,11 @@ void EraseAt(C* c, int64 index) {
   c->erase(c->begin() + index);
 }
 
+template <typename T>
+std::vector<T> ArraySliceToVector(tensorflow::gtl::ArraySlice<T> slice) {
+  return std::vector<T>(slice.begin(), slice.end());
+}
+
 template <typename T, int N>
 std::vector<T> InlinedVectorToVector(
     const tensorflow::gtl::InlinedVector<T, N>& inlined_vector) {
diff --git a/tensorflow/compiler/xla/xla.proto b/tensorflow/compiler/xla/xla.proto
index 6f07e4606b..10c0adc670 100644
--- a/tensorflow/compiler/xla/xla.proto
+++ b/tensorflow/compiler/xla/xla.proto
@@ -293,6 +293,7 @@ message ComputationStatsResponse {
 }
 
 message CreateChannelHandleRequest {
+  ChannelHandle.ChannelType channel_type = 1;
 }
 
 message CreateChannelHandleResponse {
diff --git a/tensorflow/compiler/xla/xla_data.proto b/tensorflow/compiler/xla/xla_data.proto
index c7472173a7..fd784e909c 100644
--- a/tensorflow/compiler/xla/xla_data.proto
+++ b/tensorflow/compiler/xla/xla_data.proto
@@ -308,6 +308,22 @@ message DeviceHandle {
 // Send instructions will be blocked until the data is transferred.
 message ChannelHandle {
   int64 handle = 1;
+  enum ChannelType {
+    // Invalid primitive type to serve as default.
+    CHANNEL_TYPE_INVALID = 0;
+
+    // A channel for sending data between devices.
+    DEVICE_TO_DEVICE = 1;
+
+    // A channel for sending data from the device to the host. Can only be used
+    // with a Send operation.
+    DEVICE_TO_HOST = 2;
+
+    // A channel for sending data from the host to the device. Can only be used
+    // with a Recv operation.
+    HOST_TO_DEVICE = 3;
+  }
+  ChannelType type = 2;
 }
 
 // DeviceAssignmentProto is a serialized form of DeviceAssignment class, which
@@ -431,6 +447,20 @@ message GatherDimensionNumbers {
   int64 index_vector_dim = 4;
 }
 
+// Describes the dimension numbers for a scatter operation.
+//
+// All the fields are similar to the corresponding fields in
+// GatherDimensionNumbers. Differences are noted below.
+message ScatterDimensionNumbers {
+  // The set of dimensions in the updates shape that are window dimensions.
+  repeated int64 update_window_dims = 1;
+  // The set of window dimensions that must be inserted into the updates shape.
+  repeated int64 inserted_window_dims = 2;
+
+  repeated int64 scatter_dims_to_operand_dims = 3;
+  int64 index_vector_dim = 4;
+}
+
 message ConvolutionDimensionNumbers {
   // The number of the dimension that represents batch in the input.
   int64 input_batch_dimension = 7;
diff --git a/tensorflow/contrib/BUILD b/tensorflow/contrib/BUILD
index c039624daa..6a4e252b44 100644
--- a/tensorflow/contrib/BUILD
+++ b/tensorflow/contrib/BUILD
@@ -7,7 +7,6 @@ package(default_visibility = ["//tensorflow:__subpackages__"])
 
 load("//third_party/mpi:mpi.bzl", "if_mpi")
 load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda")
-load("@local_config_tensorrt//:build_defs.bzl", "if_tensorrt")
 load("//tensorflow:tensorflow.bzl", "if_not_windows")
 load("//tensorflow:tensorflow.bzl", "if_not_windows_cuda")
 
@@ -27,7 +26,6 @@ py_library(
         "//tensorflow/contrib/bayesflow:bayesflow_py",
         "//tensorflow/contrib/boosted_trees:init_py",
         "//tensorflow/contrib/checkpoint/python:checkpoint",
-        "//tensorflow/contrib/cluster_resolver:cluster_resolver_pip",
         "//tensorflow/contrib/cluster_resolver:cluster_resolver_py",
         "//tensorflow/contrib/coder:coder_py",
         "//tensorflow/contrib/compiler:compiler_py",
@@ -114,9 +112,7 @@ py_library(
         "//tensorflow/contrib/util:util_py",
         "//tensorflow/python:util",
         "//tensorflow/python/estimator:estimator_py",
-    ] + if_mpi(["//tensorflow/contrib/mpi_collectives:mpi_collectives_py"]) + if_tensorrt([
-        "//tensorflow/contrib/tensorrt:init_py",
-    ]) + select({
+    ] + if_mpi(["//tensorflow/contrib/mpi_collectives:mpi_collectives_py"]) + select({
         "//tensorflow:with_kafka_support_windows_override": [],
         "//tensorflow:with_kafka_support": [
             "//tensorflow/contrib/kafka",
@@ -134,6 +130,11 @@ py_library(
         "//tensorflow/contrib/bigtable",  # depends on bigtable
         "//tensorflow/contrib/cloud:cloud_py",  # doesn't compile on Windows
         "//tensorflow/contrib/ffmpeg:ffmpeg_ops_py",
+        # TODO(aaroey): tensorrt dependency has to appear before tflite so the
+        # build can resolve its flatbuffers symbols within the tensorrt library.
+        # This is an issue with the tensorrt static library and will be fixed by
+        # the next tensorrt release, so fix the order here after that.
+        "//tensorflow/contrib/tensorrt:init_py",  # doesn't compile on windows
         "//tensorflow/contrib/lite/python:lite",  # unix dependency, need to fix code
     ]),
 )
diff --git a/tensorflow/contrib/android/cmake/src/main/AndroidManifest.xml b/tensorflow/contrib/android/cmake/src/main/AndroidManifest.xml
index bced47e046..c17110a78b 100644
--- a/tensorflow/contrib/android/cmake/src/main/AndroidManifest.xml
+++ b/tensorflow/contrib/android/cmake/src/main/AndroidManifest.xml
@@ -1,6 +1,10 @@
 <manifest xmlns:android="http://schemas.android.com/apk/res/android"
     package="org.tensorflow.contrib.android">
 
+    <uses-sdk
+        android:minSdkVersion="4"
+        android:targetSdkVersion="19" />
+
     <application android:allowBackup="true" android:label="@string/app_name"
         android:supportsRtl="true">
 
diff --git a/tensorflow/contrib/autograph/README.md b/tensorflow/contrib/autograph/README.md
index 7e26f47118..cc54da4daa 100644
--- a/tensorflow/contrib/autograph/README.md
+++ b/tensorflow/contrib/autograph/README.md
@@ -1,10 +1,10 @@
 # AutoGraph
 
-IMPORTANT: AutoGraph is alpha software, and under active development. Expect rough edges and bugs, but if you try it, we appreciate early feedback! We'd also love contributions ([please see our contributing guidelines](CONTRIBUTING.md) and our [style guide](STYLE_GUIDE.md)).
+IMPORTANT: AutoGraph is beta software, and under active development. Expect rough edges and bugs, but if you try it, we appreciate early feedback! We'd also love contributions ([please see our contributing guidelines](CONTRIBUTING.md) and our [style guide](STYLE_GUIDE.md)).
 
 AutoGraph is a Python to TensorFlow compiler.
 
-With AutoGraph, you can write [Eager style](https://www.tensorflow.org/guide/eager) code in a concise manner, and run it as a TensorFlow graph. AutoGraph uses source code transformation and partial evaluation to generate Python code that builds an equivalent TensorFlow subgraph. The result is code that behaves like ops and can be freely combined with other TensorFlow ops.
+With AutoGraph, you can write [Eager style](https://www.tensorflow.org/guide/eager) code in a concise manner, and run it as a TensorFlow graph. AutoGraph uses source code transformation and partial evaluation to generate Python code that builds an equivalent TensorFlow subgraph. The result is code that behaves like ops and can be freely combined with other TensorFlow ops.  [Please see this file for which parts of the Python language we currently support](LIMITATIONS.md).
 
 For example, this Python function:
 
@@ -68,12 +68,21 @@ Then import the `autograph` module from `tf.contrib`:
 from tensorflow.contrib import autograph as ag
 ```
 
-### Interactive demo notebooks
+### Related links
 
-For more extensive examples, check out these interactive notebooks:
+Articles:
 
- * [RNN trained using Keras and Estimators](https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/examples/notebooks/rnn_keras_estimator.ipynb)
+ * [TensorFlow blog post](https://medium.com/tensorflow/autograph-converts-python-into-tensorflow-graphs-b2a871f87ec7)
+
+Interactive notebooks:
+
+ * [Quick guide](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/guide/autograph.ipynb)
+ * [RNN trained using Keras and Estimators](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/examples/notebooks/rnn_keras_estimator.ipynb)
  * [Demo from the TF Dev Summit 2018](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/examples/notebooks/dev_summit_2018_demo.ipynb)
+ * [Basic control flow speed test](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_collatz_speed_test.ipynb)
+ * [MNIST training speed test](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_mnist_speed_test.ipynb)
+ * [Basic algorithm samples](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/examples/notebooks/algorithms.ipynb)
+ * [Introductory workshop support notebook](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/examples/notebooks/workshop.ipynb)
 
 ## Using with annotations
 
diff --git a/tensorflow/contrib/autograph/__init__.py b/tensorflow/contrib/autograph/__init__.py
index 361cf2d77c..26e7a4a4d3 100644
--- a/tensorflow/contrib/autograph/__init__.py
+++ b/tensorflow/contrib/autograph/__init__.py
@@ -22,17 +22,21 @@ from __future__ import division
 from __future__ import print_function
 
 # TODO(mdan): Bring only the relevant symbols to the top level.
-from tensorflow.contrib.autograph import utils
 from tensorflow.contrib.autograph import operators
+from tensorflow.contrib.autograph import utils
+from tensorflow.contrib.autograph.core.errors import GraphConstructionError
+from tensorflow.contrib.autograph.core.errors import TfRuntimeError
+from tensorflow.contrib.autograph.core.errors import improved_errors
+from tensorflow.contrib.autograph.impl.api import RunMode
 from tensorflow.contrib.autograph.impl.api import convert
 from tensorflow.contrib.autograph.impl.api import converted_call
 from tensorflow.contrib.autograph.impl.api import do_not_convert
-from tensorflow.contrib.autograph.impl.api import RunMode
 from tensorflow.contrib.autograph.impl.api import to_code
 from tensorflow.contrib.autograph.impl.api import to_graph
 from tensorflow.contrib.autograph.lang.directives import set_element_type
 from tensorflow.contrib.autograph.lang.directives import set_loop_options
 from tensorflow.contrib.autograph.lang.special_functions import stack
+from tensorflow.contrib.autograph.lang.special_functions import tensor_list
 from tensorflow.contrib.autograph.pyct.transformer import AutographParseError
 from tensorflow.python.util.all_util import remove_undocumented
 
@@ -46,10 +50,15 @@ _allowed_symbols = [
     'to_graph',
     # Overloaded operators
     'operators',
+    # Errors
+    'improved_errors',
+    'GraphConstructionError',
+    'TfRuntimeError',
     # Python language "extensions"
     'set_element_type',
     'set_loop_options',
     'stack',
+    'tensor_list',
     # Exceptions
     'AutographParseError',
     # Utilities: to be removed
diff --git a/tensorflow/contrib/autograph/converters/BUILD b/tensorflow/contrib/autograph/converters/BUILD
index b2e2e27673..7cbba71683 100644
--- a/tensorflow/contrib/autograph/converters/BUILD
+++ b/tensorflow/contrib/autograph/converters/BUILD
@@ -21,16 +21,18 @@ py_library(
         "break_statements.py",
         "builtin_functions.py",
         "call_trees.py",
+        "conditional_expressions.py",
         "continue_statements.py",
         "control_flow.py",
         "decorators.py",
-        "ifexp.py",
-        "list_comprehension.py",
+        "directives.py",
+        "error_handlers.py",
+        "list_comprehensions.py",
         "lists.py",
         "logical_expressions.py",
         "name_scopes.py",
+        "return_statements.py",
         "side_effect_guards.py",
-        "single_return.py",
         "slices.py",
     ],
     srcs_version = "PY2AND3",
@@ -95,6 +97,17 @@ py_test(
 )
 
 py_test(
+    name = "conditional_expressions_test",
+    srcs = ["conditional_expressions_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":converters",
+        "//tensorflow/contrib/autograph/core:test_lib",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
     name = "continue_statements_test",
     srcs = ["continue_statements_test.py"],
     srcs_version = "PY2AND3",
@@ -132,6 +145,18 @@ py_test(
 )
 
 py_test(
+    name = "directives_test",
+    srcs = ["directives_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":converters",
+        "//tensorflow/contrib/autograph/core:test_lib",
+        "//tensorflow/contrib/autograph/lang",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
     name = "name_scopes_test",
     srcs = ["name_scopes_test.py"],
     deps = [
@@ -143,8 +168,8 @@ py_test(
 )
 
 py_test(
-    name = "list_comprehension_test",
-    srcs = ["list_comprehension_test.py"],
+    name = "list_comprehensions_test",
+    srcs = ["list_comprehensions_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":converters",
@@ -179,11 +204,6 @@ py_test(
     name = "side_effect_guards_test",
     srcs = ["side_effect_guards_test.py"],
     srcs_version = "PY2AND3",
-    tags = [
-        # TODO(mdan): Fix.
-        "flaky",
-        "notap",
-    ],
     deps = [
         ":converters",
         "//tensorflow/contrib/autograph/core:test_lib",
@@ -192,8 +212,8 @@ py_test(
 )
 
 py_test(
-    name = "single_return_test",
-    srcs = ["single_return_test.py"],
+    name = "return_statements_test",
+    srcs = ["return_statements_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":converters",
@@ -204,8 +224,8 @@ py_test(
 )
 
 py_test(
-    name = "ifexp_test",
-    srcs = ["ifexp_test.py"],
+    name = "error_handlers_test",
+    srcs = ["error_handlers_test.py"],
     srcs_version = "PY2AND3",
     deps = [
         ":converters",
diff --git a/tensorflow/contrib/autograph/converters/__init__.py b/tensorflow/contrib/autograph/converters/__init__.py
index e4e8eda42f..6325ac78dc 100644
--- a/tensorflow/contrib/autograph/converters/__init__.py
+++ b/tensorflow/contrib/autograph/converters/__init__.py
@@ -18,5 +18,15 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# TODO(mdan): Define a base transformer class that can recognize skip_processing
-# TODO(mdan): All converters are incomplete, especially those that change blocks
+# Naming conventions:
+#  * each converter should specialize on a single idiom; be consistent with
+#    the Python reference for naming
+#  * all converters inherit core.converter.Base
+#  * module names describe the idiom that the converter covers, plural
+#  * the converter class is named consistent with the module, singular and
+#    includes the word Transformer
+#
+# Example:
+#
+#   lists.py
+#     class ListTransformer(converter.Base)
diff --git a/tensorflow/contrib/autograph/converters/asserts.py b/tensorflow/contrib/autograph/converters/asserts.py
index e664a403a5..af2f20f267 100644
--- a/tensorflow/contrib/autograph/converters/asserts.py
+++ b/tensorflow/contrib/autograph/converters/asserts.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Converts Assert statements to their corresponding TF calls."""
+"""Converts assert statements to their corresponding TF calls."""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -24,8 +24,8 @@ from tensorflow.contrib.autograph.core import converter
 from tensorflow.contrib.autograph.pyct import templates
 
 
-class AssertsTransformer(converter.Base):
-  """Transforms Print nodes to Call so they can be handled as functions."""
+class AssertTransformer(converter.Base):
+  """Transforms Assert nodes to Call so they can be handled as functions."""
 
   def visit_Assert(self, node):
     self.generic_visit(node)
@@ -46,4 +46,4 @@ class AssertsTransformer(converter.Base):
 
 
 def transform(node, ctx):
-  return AssertsTransformer(ctx).visit(node)
+  return AssertTransformer(ctx).visit(node)
diff --git a/tensorflow/contrib/autograph/converters/asserts_test.py b/tensorflow/contrib/autograph/converters/asserts_test.py
index 2cd0e626bc..38faba45df 100644
--- a/tensorflow/contrib/autograph/converters/asserts_test.py
+++ b/tensorflow/contrib/autograph/converters/asserts_test.py
@@ -32,10 +32,10 @@ class AssertsTest(converter_testing.TestCase):
     def test_fn(a):
       assert a > 0
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = asserts.transform(node, self.ctx)
+    node, ctx = self.prepare(test_fn, {})
+    node = asserts.transform(node, ctx)
 
-    self.assertTrue(isinstance(node.body[0].body[0].value, gast.Call))
+    self.assertTrue(isinstance(node.body[0].value, gast.Call))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/break_statements.py b/tensorflow/contrib/autograph/converters/break_statements.py
index a990e359a2..180779670d 100644
--- a/tensorflow/contrib/autograph/converters/break_statements.py
+++ b/tensorflow/contrib/autograph/converters/break_statements.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Canonicalizes break statements by de-sugaring into a control boolean."""
+"""Lowers break statements to conditionals."""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -24,20 +24,25 @@ from tensorflow.contrib.autograph.pyct import templates
 from tensorflow.contrib.autograph.pyct.static_analysis.annos import NodeAnno
 
 
-# Tags for local state.
-BREAK_USED = 'break_used'
-CONTROL_VAR_NAME = 'control_var_name'
+class _Break(object):
 
+  def __init__(self):
+    self.used = False
+    self.control_var_name = None
 
-class BreakStatementTransformer(converter.Base):
+  def __repr__(self):
+    return 'used: %s, var: %s' % (self.used, self.control_var_name)
+
+
+class BreakTransformer(converter.Base):
   """Canonicalizes break statements into additional conditionals."""
 
   def visit_Break(self, node):
-    self.set_local(BREAK_USED, True)
-    var_name = self.get_local(CONTROL_VAR_NAME)
+    self.state[_Break].used = True
+    var_name = self.state[_Break].control_var_name
     # TODO(mdan): This will fail when expanded inside a top-level else block.
     template = """
-      var_name = True
+      var_name = tf.constant(True)
       continue
     """
     return templates.replace(template, var_name=var_name)
@@ -57,12 +62,12 @@ class BreakStatementTransformer(converter.Base):
         block=block)
     return node
 
-  def _track_body(self, nodes, break_var):
-    self.enter_local_scope()
-    self.set_local(CONTROL_VAR_NAME, break_var)
+  def _process_body(self, nodes, break_var):
+    self.state[_Break].enter()
+    self.state[_Break].control_var_name = break_var
     nodes = self.visit_block(nodes)
-    break_used = self.get_local(BREAK_USED, False)
-    self.exit_local_scope()
+    break_used = self.state[_Break].used
+    self.state[_Break].exit()
     return nodes, break_used
 
   def visit_While(self, node):
@@ -70,7 +75,7 @@ class BreakStatementTransformer(converter.Base):
     break_var = self.ctx.namer.new_symbol('break_', scope.referenced)
 
     node.test = self.visit(node.test)
-    node.body, break_used = self._track_body(node.body, break_var)
+    node.body, break_used = self._process_body(node.body, break_var)
     # A break in the else clause applies to the containing scope.
     node.orelse = self.visit_block(node.orelse)
 
@@ -80,7 +85,7 @@ class BreakStatementTransformer(converter.Base):
       guarded_orelse = self._guard_if_present(node.orelse, break_var)
 
       template = """
-        var_name = False
+        var_name = tf.constant(False)
         while test and not var_name:
           body
         else:
@@ -101,7 +106,7 @@ class BreakStatementTransformer(converter.Base):
 
     node.target = self.visit(node.target)
     node.iter = self.visit(node.iter)
-    node.body, break_used = self._track_body(node.body, break_var)
+    node.body, break_used = self._process_body(node.body, break_var)
     # A break in the else clause applies to the containing scope.
     node.orelse = self.visit_block(node.orelse)
 
@@ -117,7 +122,7 @@ class BreakStatementTransformer(converter.Base):
       # the control variable is marked as used.
       # TODO(mdan): Use a marker instead, e.g. ag__.condition_loop_on(var_name)
       template = """
-        var_name = False
+        var_name = tf.constant(False)
         for target in iter_:
           (var_name,)
           body
@@ -138,4 +143,4 @@ class BreakStatementTransformer(converter.Base):
 
 
 def transform(node, ctx):
-  return BreakStatementTransformer(ctx).visit(node)
+  return BreakTransformer(ctx).visit(node)
diff --git a/tensorflow/contrib/autograph/converters/break_statements_test.py b/tensorflow/contrib/autograph/converters/break_statements_test.py
index dcff1c54c2..fcae7d68c0 100644
--- a/tensorflow/contrib/autograph/converters/break_statements_test.py
+++ b/tensorflow/contrib/autograph/converters/break_statements_test.py
@@ -20,12 +20,19 @@ from __future__ import print_function
 
 from tensorflow.contrib.autograph.converters import break_statements
 from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.python.eager import context as tfe_ctx
+from tensorflow.python.framework import constant_op
 from tensorflow.python.platform import test
 
 
 class BreakCanonicalizationTest(converter_testing.TestCase):
 
-  def test_basic_while(self):
+  def assertTransformedEquivalent(self, test_fn, *inputs):
+    with self.converted(test_fn, break_statements, {},
+                        constant_op.constant) as result:
+      self.assertEqual(test_fn(*inputs), result.test_fn(*inputs))
+
+  def test_while_loop(self):
 
     def test_fn(x):
       v = []
@@ -36,15 +43,12 @@ class BreakCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = break_statements.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      self.assertEqual([], result.test_fn(0))
-      self.assertEqual([], result.test_fn(1))
-      self.assertEqual([3], result.test_fn(4))
+    with tfe_ctx.eager_mode():
+      self.assertTransformedEquivalent(test_fn, 0)
+      self.assertTransformedEquivalent(test_fn, 1)
+      self.assertTransformedEquivalent(test_fn, 4)
 
-  def test_basic_for(self):
+  def test_for_loop(self):
 
     def test_fn(a):
       v = []
@@ -55,18 +59,13 @@ class BreakCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = break_statements.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
+    with self.converted(test_fn, break_statements, {},
+                        constant_op.constant) as result:
       # The break is incompletely canonicalized. The loop will not interrupt,
       # but the section following the break will be skipped.
-      self.assertEqual([], result.test_fn([]))
-      self.assertEqual([3, 3], result.test_fn([4, 4]))
-      self.assertEqual([3], result.test_fn([4, 5]))
       self.assertEqual([3], result.test_fn([5, 4]))
 
-  def test_deeply_nested(self):
+  def test_nested(self):
 
     def test_fn(x):
       v = []
@@ -83,13 +82,10 @@ class BreakCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v, u, w
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = break_statements.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      self.assertEqual(([], [], []), result.test_fn(0))
-      self.assertEqual(([2, 1], [2], [0]), result.test_fn(3))
-      self.assertEqual(([10, 9, 8, 7], [10, 8], [6]), result.test_fn(11))
+    with tfe_ctx.eager_mode():
+      self.assertTransformedEquivalent(test_fn, 0)
+      self.assertTransformedEquivalent(test_fn, 3)
+      self.assertTransformedEquivalent(test_fn, 11)
 
   def test_nested_loops(self):
 
@@ -109,16 +105,13 @@ class BreakCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v, u
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = break_statements.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      self.assertEqual(([], []), result.test_fn(0))
-      self.assertEqual(([1], []), result.test_fn(2))
-      self.assertEqual(([2, 1], [1]), result.test_fn(3))
-      self.assertEqual(([4, 3, 2, 1], [3, 1]), result.test_fn(5))
+    with tfe_ctx.eager_mode():
+      self.assertTransformedEquivalent(test_fn, 0)
+      self.assertTransformedEquivalent(test_fn, 2)
+      self.assertTransformedEquivalent(test_fn, 3)
+      self.assertTransformedEquivalent(test_fn, 5)
 
-  def test_loop_else(self):
+  def test_loop_orelse(self):
 
     def test_fn(x):
       v = []
@@ -134,13 +127,10 @@ class BreakCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v, u
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = break_statements.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      self.assertEqual(([], []), result.test_fn(0))
-      self.assertEqual(([], [1]), result.test_fn(2))
-      self.assertEqual(([2], [1]), result.test_fn(3))
+    with tfe_ctx.eager_mode():
+      self.assertTransformedEquivalent(test_fn, 0)
+      self.assertTransformedEquivalent(test_fn, 2)
+      self.assertTransformedEquivalent(test_fn, 3)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/builtin_functions_test.py b/tensorflow/contrib/autograph/converters/builtin_functions_test.py
index e9000e518c..d5c3e2c250 100644
--- a/tensorflow/contrib/autograph/converters/builtin_functions_test.py
+++ b/tensorflow/contrib/autograph/converters/builtin_functions_test.py
@@ -18,8 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import sys
-
 import six
 
 from tensorflow.contrib.autograph.converters import builtin_functions
@@ -36,55 +34,39 @@ class BuiltinFunctionsTest(converter_testing.TestCase):
     def test_fn(a):
       return len(a)
 
-    node = self.parse_and_analyze(test_fn, {'len': len})
-    node = builtin_functions.transform(node, self.ctx)
-
-    with self.compiled(node, array_ops.shape) as result:
+    with self.converted(test_fn, builtin_functions, {'len': len},
+                        array_ops.shape) as result:
       with self.test_session() as sess:
-        self.assertEqual(3,
-                         sess.run(
-                             result.test_fn(constant_op.constant([0, 0, 0]))))
-
-        self.assertEqual(3, result.test_fn([0, 0, 0]))
+        ops = result.test_fn(constant_op.constant([0, 0, 0]))
+        self.assertEqual(sess.run(ops), 3)
 
   def test_print(self):
 
-    def test_fn(a):
-      print(a)
+    if six.PY2:
+      return
 
-    node = self.parse_and_analyze(test_fn, {'print': print})
-    node = builtin_functions.transform(node, self.ctx)
+    def test_fn(a):
+      return print(a)
 
-    with self.compiled(node) as result:
+    with self.converted(test_fn, builtin_functions, {'print': print}) as result:
       with self.test_session() as sess:
-        try:
-          out_capturer = six.StringIO()
-          sys.stdout = out_capturer
-          result.test_fn(constant_op.constant('a'))
-          sess.run(sess.graph.get_operations())
-          self.assertEqual(out_capturer.getvalue(), 'a\n')
-        finally:
-          sys.stdout = sys.__stdout__
+        with self.assertPrints('a\n'):
+          sess.run(result.test_fn('a'))
 
-  def test_print_with_op_multiple_values(self):
+  def test_print_multiple_values(self):
 
-    def test_fn(a, b, c):
-      print(a, b, c)
+    if six.PY2:
+      return
 
-    node = self.parse_and_analyze(test_fn, {'print': print})
-    node = builtin_functions.transform(node, self.ctx)
+    def test_fn(a, b, c):
+      return print(a, b, c)
 
-    with self.compiled(node) as result:
+    with self.converted(test_fn, builtin_functions, {'print': print}) as result:
       with self.test_session() as sess:
-        try:
-          out_capturer = six.StringIO()
-          sys.stdout = out_capturer
-          result.test_fn(
-              constant_op.constant('a'), constant_op.constant(1), [2, 3])
-          sess.run(sess.graph.get_operations())
-          self.assertEqual(out_capturer.getvalue(), 'a 1 [2, 3]\n')
-        finally:
-          sys.stdout = sys.__stdout__
+        with self.assertPrints('a 1 [2, 3]\n'):
+          sess.run(
+              result.test_fn(
+                  constant_op.constant('a'), constant_op.constant(1), [2, 3]))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/call_trees.py b/tensorflow/contrib/autograph/converters/call_trees.py
index a36b3d77a9..2d1bed3367 100644
--- a/tensorflow/contrib/autograph/converters/call_trees.py
+++ b/tensorflow/contrib/autograph/converters/call_trees.py
@@ -238,7 +238,7 @@ class CallTreeTransformer(converter.Base):
     # Before we could convert all the time though, we'd need a reasonable
     # caching mechanism.
     template = """
-      ag__.converted_call(func, True, False, {}, args)
+      ag__.converted_call(func, True, False, False, {}, args)
     """
     call_expr = templates.replace(template, func=node.func, args=node.args)
     new_call = call_expr[0].value
diff --git a/tensorflow/contrib/autograph/converters/call_trees_test.py b/tensorflow/contrib/autograph/converters/call_trees_test.py
index 27d8281b85..8cdba659ee 100644
--- a/tensorflow/contrib/autograph/converters/call_trees_test.py
+++ b/tensorflow/contrib/autograph/converters/call_trees_test.py
@@ -36,37 +36,34 @@ class CallTreesTest(converter_testing.TestCase):
     def test_fn_1(_):
       raise ValueError('This should not be called in the compiled version.')
 
-    def renamed_test_fn_1(a):
+    def other_test_fn_1(a):
       return a + 1
 
     def test_fn_2(a):
       return test_fn_1(a) + 1
 
-    node = self.parse_and_analyze(test_fn_2, {'test_fn_1': test_fn_1})
-    node = call_trees.transform(node, self.ctx)
+    ns = {'test_fn_1': test_fn_1}
+    node, ctx = self.prepare(test_fn_2, ns)
+    node = call_trees.transform(node, ctx)
 
-    with self.compiled(node) as result:
-      # Only test_fn_2 is transformed, so we'll insert renamed_test_fn_1
-      # manually.
-      result.renamed_test_fn_1 = renamed_test_fn_1
-      self.assertEquals(3, result.test_fn_2(1))
+    with self.compiled(node, ns) as result:
+      new_name, _ = ctx.namer.compiled_function_name(('test_fn_1',))
+      setattr(result, new_name, other_test_fn_1)
+      self.assertEquals(result.test_fn_2(1), 3)
 
   def test_dynamic_function(self):
 
     def test_fn_1():
-      raise ValueError('This should be masked by the mock.')
+      raise ValueError('This should be masked by the mock in self.compiled.')
 
     def test_fn_2(f):
       return f() + 3
 
-    node = self.parse_and_analyze(test_fn_2, {})
-    node = call_trees.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
+    with self.converted(test_fn_2, call_trees, {}) as result:
       # 10 = 7 (from the mock) + 3 (from test_fn_2)
       self.assertEquals(10, result.test_fn_2(test_fn_1))
 
-  def test_simple_methods(self):
+  def test_basic_method(self):
 
     class TestClass(object):
 
@@ -76,49 +73,43 @@ class CallTreesTest(converter_testing.TestCase):
       def test_fn_2(self, a):
         return self.test_fn_1(a) + 1
 
-    node = self.parse_and_analyze(
-        TestClass.test_fn_2, {'TestClass': TestClass},
+    ns = {'TestClass': TestClass}
+    node, ctx = self.prepare(
+        TestClass.test_fn_2,
+        ns,
         namer=converter_testing.FakeNoRenameNamer(),
         arg_types={'self': (TestClass.__name__, TestClass)})
-    node = call_trees.transform(node, self.ctx)
+    node = call_trees.transform(node, ctx)
 
-    with self.compiled(node) as result:
+    with self.compiled(node, ns) as result:
       tc = TestClass()
       self.assertEquals(3, result.test_fn_2(tc, 1))
 
-  def test_py_func_wrap_no_retval(self):
+  def test_py_func_no_retval(self):
 
     def test_fn(a):
       setattr(a, 'foo', 'bar')
 
-    node = self.parse_and_analyze(test_fn, {'setattr': setattr})
-    node = call_trees.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
+    with self.converted(test_fn, call_trees, {'setattr': setattr}) as result:
       with self.test_session() as sess:
-        # The function has no return value, so we do some tricks to grab the
-        # generated py_func node and ensure its effect only happens at graph
-        # execution.
 
         class Dummy(object):
           pass
 
         a = Dummy()
         result.test_fn(a)
+        py_func_op, = sess.graph.get_operations()
         self.assertFalse(hasattr(a, 'foo'))
-        sess.run(sess.graph.get_operations()[0])
+        sess.run(py_func_op)
         self.assertEquals('bar', a.foo)
 
-  def test_py_func_wrap_known_function(self):
+  def test_py_func_known_function(self):
 
     def test_fn():
       return np.random.binomial(2, 0.5)
 
-    node = self.parse_and_analyze(test_fn, {'np': np})
-    node = call_trees.transform(node, self.ctx)
-
-    with self.compiled(node, dtypes.int64) as result:
-      result.np = np
+    with self.converted(test_fn, call_trees, {'np': np},
+                        dtypes.int64) as result:
       with self.test_session() as sess:
         self.assertTrue(isinstance(result.test_fn(), ops.Tensor))
         self.assertIn(sess.run(result.test_fn()), (0, 1, 2))
@@ -130,22 +121,17 @@ class CallTreesTest(converter_testing.TestCase):
       a = math_ops.add(a, constant_op.constant(1))
       return a
 
-    node = self.parse_and_analyze(
-        test_fn, {
-            'math_ops': math_ops,
-            'constant_op': constant_op
-        },
+    ns = {'math_ops': math_ops, 'constant_op': constant_op}
+    node, ctx = self.prepare(
+        test_fn,
+        ns,
         arg_types=set(((math_ops.__name__,), (constant_op.__name__,))))
-    node = call_trees.transform(node, self.ctx)
+    node = call_trees.transform(node, ctx)
 
-    with self.compiled(node) as result:
-      result.math_ops = math_ops
-      result.constant_op = constant_op
+    with self.compiled(node, ns) as result:
       with self.test_session() as sess:
-        # Not renamed, because the converter doesn't rename the definition
-        # itself (the caller is responsible for that).
         result_tensor = result.test_fn(constant_op.constant(1))
-        self.assertEquals(3, sess.run(result_tensor))
+        self.assertEquals(sess.run(result_tensor), 3)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/conditional_expressions.py b/tensorflow/contrib/autograph/converters/conditional_expressions.py
new file mode 100644
index 0000000000..63f649dfdf
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/conditional_expressions.py
@@ -0,0 +1,129 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converts the ternary conditional operator."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.core import converter
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import templates
+from tensorflow.contrib.autograph.pyct.static_analysis.annos import NodeAnno
+
+
+class _FunctionDefs(object):
+
+  def __init__(self):
+    self.nodes = []
+
+
+class _Statement(object):
+
+  def __init__(self):
+    self.scope = None
+
+
+class ConditionalExpressionTransformer(converter.Base):
+  """Converts contitional expressions to functional form."""
+
+  def _postprocess_statement(self, node):
+    """Inserts any separate functions that node may use."""
+    replacements = []
+    for def_node in self.state[_FunctionDefs].nodes:
+      replacements.extend(def_node)
+    replacements.append(node)
+    node = replacements
+    # The corresponding enter is called by self.visit_block (see _process_block)
+    self.state[_FunctionDefs].exit()
+    return node, None
+
+  def _create_branch(self, expr, name_stem):
+    scope = self.state[_Statement].scope
+    name = self.ctx.namer.new_symbol(name_stem, scope.referenced)
+    template = """
+      def name():
+        return expr,
+    """
+    node = templates.replace(template, name=name, expr=expr)
+    self.state[_FunctionDefs].nodes.append(node)
+    return name
+
+  def visit_IfExp(self, node):
+    if anno.hasanno(node.test, anno.Basic.QN):
+      name_root = anno.getanno(node.test, anno.Basic.QN).ssf()
+    else:
+      name_root = 'ifexp'
+
+    true_fn_name = self._create_branch(node.body, '%s_true' % name_root)
+    false_fn_name = self._create_branch(node.orelse, '%s_false' % name_root)
+
+    return templates.replace_as_expression(
+        'ag__.utils.run_cond(test, true_fn_name, false_fn_name)',
+        test=node.test,
+        true_fn_name=true_fn_name,
+        false_fn_name=false_fn_name)
+
+  def _process_block(self, scope, block):
+    self.state[_Statement].enter()
+    self.state[_Statement].scope = scope
+    block = self.visit_block(
+        block,
+        before_visit=self.state[_FunctionDefs].enter,
+        after_visit=self._postprocess_statement)
+    self.state[_Statement].exit()
+    return block
+
+  def visit_FunctionDef(self, node):
+    node.args = self.generic_visit(node.args)
+    node.decorator_list = self.visit_block(node.decorator_list)
+    node.body = self._process_block(
+        anno.getanno(node, anno.Static.SCOPE), node.body)
+    return node
+
+  def visit_For(self, node):
+    node.target = self.visit(node.target)
+    node.body = self._process_block(
+        anno.getanno(node, NodeAnno.BODY_SCOPE), node.body)
+    node.orelse = self._process_block(
+        anno.getanno(node, NodeAnno.ORELSE_SCOPE), node.orelse)
+    return node
+
+  def visit_While(self, node):
+    node.test = self.visit(node.test)
+    node.body = self._process_block(
+        anno.getanno(node, NodeAnno.BODY_SCOPE), node.body)
+    node.orelse = self._process_block(
+        anno.getanno(node, NodeAnno.ORELSE_SCOPE), node.orelse)
+    return node
+
+  def visit_If(self, node):
+    node.test = self.visit(node.test)
+    node.body = self._process_block(
+        anno.getanno(node, NodeAnno.BODY_SCOPE), node.body)
+    node.orelse = self._process_block(
+        anno.getanno(node, NodeAnno.ORELSE_SCOPE), node.orelse)
+    return node
+
+  def visit_With(self, node):
+    node.items = self.visit_block(node.items)
+    node.body = self._process_block(
+        anno.getanno(node, NodeAnno.BODY_SCOPE), node.body)
+    return node
+
+
+def transform(node, ctx):
+  node = ConditionalExpressionTransformer(ctx).visit(node)
+  return node
diff --git a/tensorflow/contrib/autograph/converters/conditional_expressions_test.py b/tensorflow/contrib/autograph/converters/conditional_expressions_test.py
new file mode 100644
index 0000000000..95a3108741
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/conditional_expressions_test.py
@@ -0,0 +1,53 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for conditional_expressions module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.converters import conditional_expressions
+from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.python.platform import test
+
+
+class ConditionalExpressionsTest(converter_testing.TestCase):
+
+  def assertTransformedEquivalent(self, test_fn, *inputs):
+    ns = {}
+    with self.converted(test_fn, conditional_expressions, ns) as result:
+      self.assertEqual(test_fn(*inputs), result.test_fn(*inputs))
+
+  def test_basic(self):
+
+    def test_fn(x):
+      return 1 if x else 0
+
+    self.assertTransformedEquivalent(test_fn, 0)
+    self.assertTransformedEquivalent(test_fn, 3)
+
+  def test_nested_orelse(self):
+
+    def test_fn(x):
+      y = x * x if x > 0 else x if x else 1
+      return y
+
+    self.assertTransformedEquivalent(test_fn, -2)
+    self.assertTransformedEquivalent(test_fn, 0)
+    self.assertTransformedEquivalent(test_fn, 2)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/converters/continue_statements.py b/tensorflow/contrib/autograph/converters/continue_statements.py
index 958bde0a58..0476e97c15 100644
--- a/tensorflow/contrib/autograph/converters/continue_statements.py
+++ b/tensorflow/contrib/autograph/converters/continue_statements.py
@@ -37,7 +37,7 @@ class ContinueCanonicalizationTransformer(converter.Base):
   def visit_Continue(self, node):
     self.set_local(CONTINUE_USED, True)
     template = """
-      var_name = True
+      var_name = tf.constant(True)
     """
     return templates.replace(
         template, var_name=self.get_local(CONTROL_VAR_NAME))
@@ -92,7 +92,7 @@ class ContinueCanonicalizationTransformer(converter.Base):
 
     if self.get_local(CONTINUE_USED, False):
       template = """
-        var_name = False
+        var_name = tf.constant(False)
       """
       control_var_init = templates.replace(template, var_name=continue_var)
       nodes = control_var_init + nodes
diff --git a/tensorflow/contrib/autograph/converters/continue_statements_test.py b/tensorflow/contrib/autograph/converters/continue_statements_test.py
index 2ce1837972..37c15211b4 100644
--- a/tensorflow/contrib/autograph/converters/continue_statements_test.py
+++ b/tensorflow/contrib/autograph/converters/continue_statements_test.py
@@ -20,12 +20,19 @@ from __future__ import print_function
 
 from tensorflow.contrib.autograph.converters import continue_statements
 from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.python.eager import context as tfe_ctx
+from tensorflow.python.framework import constant_op
 from tensorflow.python.platform import test
 
 
 class ContinueCanonicalizationTest(converter_testing.TestCase):
 
-  def test_basic_continue(self):
+  def assertTransformedEquivalent(self, test_fn, *inputs):
+    with self.converted(test_fn, continue_statements, {},
+                        constant_op.constant) as result:
+      self.assertEqual(test_fn(*inputs), result.test_fn(*inputs))
+
+  def test_basic(self):
 
     def test_fn(x):
       v = []
@@ -36,17 +43,13 @@ class ContinueCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = continue_statements.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      self.assertEqual(test_fn(0), result.test_fn(0))
-      self.assertEqual(test_fn(1), result.test_fn(1))
-      self.assertEqual(test_fn(2), result.test_fn(2))
-      self.assertEqual(test_fn(3), result.test_fn(3))
-      self.assertEqual(test_fn(4), result.test_fn(4))
+    with tfe_ctx.eager_mode():
+      self.assertTransformedEquivalent(test_fn, 0)
+      self.assertTransformedEquivalent(test_fn, 1)
+      self.assertTransformedEquivalent(test_fn, 3)
+      self.assertTransformedEquivalent(test_fn, 4)
 
-  def test_basic_continue_for_loop(self):
+  def test_for_loop(self):
 
     def test_fn(a):
       v = []
@@ -57,16 +60,13 @@ class ContinueCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = continue_statements.transform(node, self.ctx)
+    with tfe_ctx.eager_mode():
+      self.assertTransformedEquivalent(test_fn, [])
+      self.assertTransformedEquivalent(test_fn, [1])
+      self.assertTransformedEquivalent(test_fn, [2])
+      self.assertTransformedEquivalent(test_fn, [1, 2, 3])
 
-    with self.compiled(node) as result:
-      self.assertEqual(test_fn([]), result.test_fn([]))
-      self.assertEqual(test_fn([1]), result.test_fn([1]))
-      self.assertEqual(test_fn([2]), result.test_fn([2]))
-      self.assertEqual(test_fn([1, 2, 3]), result.test_fn([1, 2, 3]))
-
-  def test_continue_deeply_nested(self):
+  def test_nested(self):
 
     def test_fn(x):
       v = []
@@ -83,15 +83,11 @@ class ContinueCanonicalizationTest(converter_testing.TestCase):
         v.append(x)
       return v, u, w
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = continue_statements.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      self.assertEqual(test_fn(0), result.test_fn(0))
-      self.assertEqual(test_fn(1), result.test_fn(1))
-      self.assertEqual(test_fn(2), result.test_fn(2))
-      self.assertEqual(test_fn(3), result.test_fn(3))
-      self.assertEqual(test_fn(4), result.test_fn(4))
+    with tfe_ctx.eager_mode():
+      self.assertTransformedEquivalent(test_fn, 0)
+      self.assertTransformedEquivalent(test_fn, 1)
+      self.assertTransformedEquivalent(test_fn, 3)
+      self.assertTransformedEquivalent(test_fn, 4)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/control_flow.py b/tensorflow/contrib/autograph/converters/control_flow.py
index f4a8710627..5a5a6ad63a 100644
--- a/tensorflow/contrib/autograph/converters/control_flow.py
+++ b/tensorflow/contrib/autograph/converters/control_flow.py
@@ -25,8 +25,7 @@ from tensorflow.contrib.autograph.pyct import anno
 from tensorflow.contrib.autograph.pyct import ast_util
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import templates
-from tensorflow.contrib.autograph.pyct.static_analysis import cfg
-from tensorflow.contrib.autograph.pyct.static_analysis.annos import NodeAnno
+from tensorflow.contrib.autograph.pyct.static_analysis import annos
 
 
 class SymbolNamer(object):
@@ -47,6 +46,7 @@ class SymbolNamer(object):
 
 class ControlFlowTransformer(converter.Base):
   """Transforms control flow structures like loops an conditionals."""
+
   def _create_cond_branch(self, body_name, aliased_orig_names,
                           aliased_new_names, body, returns):
     if aliased_orig_names:
@@ -90,55 +90,51 @@ class ControlFlowTransformer(converter.Base):
       return templates.replace(
           template, test=test, body_name=body_name, orelse_name=orelse_name)
 
-  def visit_If(self, node):
-    self.generic_visit(node)
+  def _fmt_symbol_list(self, symbol_set):
+    if not symbol_set:
+      return 'no variables'
+    return ', '.join(map(str, symbol_set))
 
-    body_scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
-    orelse_scope = anno.getanno(node, NodeAnno.ORELSE_SCOPE)
-    body_defs = body_scope.created | body_scope.modified
-    orelse_defs = orelse_scope.created | orelse_scope.modified
-    live = anno.getanno(node, 'live_out')
-
-    # We'll need to check if we're closing over variables that are defined
-    # elsewhere in the function
-    # NOTE: we can only detect syntactic closure in the scope
-    # of the code passed in. If the AutoGraph'd function itself closes
-    # over other variables, this analysis won't take that into account.
-    defined = anno.getanno(node, 'defined_in')
-
-    # We only need to return variables that are
-    # - modified by one or both branches
-    # - live (or has a live parent) at the end of the conditional
-    modified = []
-    for def_ in body_defs | orelse_defs:
-      def_with_parents = set((def_,)) | def_.support_set
-      if live & def_with_parents:
-        modified.append(def_)
-
-    # We need to check if live created variables are balanced
-    # in both branches
-    created = live & (body_scope.created | orelse_scope.created)
-
-    # The if statement is illegal if there are variables that are created,
-    # that are also live, but both branches don't create them.
-    if created:
-      if created != (body_scope.created & live):
-        raise ValueError(
-            'The main branch does not create all live symbols that the else '
-            'branch does.')
-      if created != (orelse_scope.created & live):
-        raise ValueError(
-            'The else branch does not create all live symbols that the main '
-            'branch does.')
-
-    # Alias the closure variables inside the conditional functions
-    # to avoid errors caused by the local variables created in the branch
-    # functions.
+  def visit_If(self, node):
+    node = self.generic_visit(node)
+
+    body_scope = anno.getanno(node, annos.NodeAnno.BODY_SCOPE)
+    orelse_scope = anno.getanno(node, annos.NodeAnno.ORELSE_SCOPE)
+    defined_in = anno.getanno(node, anno.Static.DEFINED_VARS_IN)
+    live_out = anno.getanno(node, anno.Static.LIVE_VARS_OUT)
+
+    modified_in_cond = body_scope.modified | orelse_scope.modified
+    returned_from_cond = set()
+    for s in modified_in_cond:
+      if s in live_out:
+        returned_from_cond.add(s)
+      elif s.is_composite():
+        # Special treatment for compound objects: if any of their owner entities
+        # are live, then they are outputs as well.
+        if any(owner in live_out for owner in s.owner_set):
+          returned_from_cond.add(s)
+
+    need_alias_in_body = body_scope.modified & defined_in
+    need_alias_in_orelse = orelse_scope.modified & defined_in
+
+    created_in_body = body_scope.modified & returned_from_cond - defined_in
+    created_in_orelse = orelse_scope.modified & returned_from_cond - defined_in
+
+    if created_in_body != created_in_orelse:
+      raise ValueError(
+          'if statement may not initialize all variables: the true branch'
+          ' creates %s, while the false branch creates %s. Make sure all'
+          ' these variables are initialized either in both'
+          ' branches or before the if statement.' %
+          (self._fmt_symbol_list(created_in_body),
+           self._fmt_symbol_list(created_in_orelse)))
+
+    # Alias the closure variables inside the conditional functions, to allow
+    # the functions access to the respective variables.
     # We will alias variables independently for body and orelse scope,
     # because different branches might write different variables.
-    aliased_body_orig_names = tuple(body_scope.modified - body_scope.created)
-    aliased_orelse_orig_names = tuple(orelse_scope.modified -
-                                      orelse_scope.created)
+    aliased_body_orig_names = tuple(need_alias_in_body)
+    aliased_orelse_orig_names = tuple(need_alias_in_orelse)
     aliased_body_new_names = tuple(
         self.ctx.namer.new_symbol(s.ssf(), body_scope.referenced)
         for s in aliased_body_orig_names)
@@ -153,58 +149,47 @@ class ControlFlowTransformer(converter.Base):
     node_body = ast_util.rename_symbols(node.body, alias_body_map)
     node_orelse = ast_util.rename_symbols(node.orelse, alias_orelse_map)
 
-    if not modified:
+    returned_from_cond = tuple(returned_from_cond)
+    if returned_from_cond:
+      if len(returned_from_cond) == 1:
+        # TODO(mdan): Move this quirk into the operator implementation.
+        cond_results = returned_from_cond[0]
+      else:
+        cond_results = gast.Tuple([s.ast() for s in returned_from_cond], None)
+
+      returned_from_body = tuple(
+          alias_body_map[s] if s in need_alias_in_body else s
+          for s in returned_from_cond)
+      returned_from_orelse = tuple(
+          alias_orelse_map[s] if s in need_alias_in_orelse else s
+          for s in returned_from_cond)
+
+    else:
       # When the cond would return no value, we leave the cond called without
       # results. That in turn should trigger the side effect guards. The
       # branch functions will return a dummy value that ensures cond
       # actually has some return value as well.
-      results = None
-    elif len(modified) == 1:
-      results = modified[0]
-    else:
-      results = gast.Tuple([s.ast() for s in modified], None)
+      cond_results = None
+      # TODO(mdan): This doesn't belong here; it's specific to the operator.
+      returned_from_body = templates.replace_as_expression('tf.constant(1)')
+      returned_from_orelse = templates.replace_as_expression('tf.constant(1)')
 
     body_name = self.ctx.namer.new_symbol('if_true', body_scope.referenced)
     orelse_name = self.ctx.namer.new_symbol('if_false', orelse_scope.referenced)
-    if modified:
-
-      def build_returns(aliased_names, alias_map, scope):
-        """Builds list of return variables for a branch of a conditional."""
-        returns = []
-        for s in modified:
-          if s in aliased_names:
-            returns.append(alias_map[s])
-          else:
-            if s not in scope.created | defined:
-              raise ValueError(
-                  'Attempting to return variable "%s" from the true branch of '
-                  'a conditional, but it was not closed over, or created in '
-                  'this branch.' % str(s))
-            else:
-              returns.append(s)
-        return tuple(returns)
-
-      body_returns = build_returns(aliased_body_orig_names, alias_body_map,
-                                   body_scope)
-      orelse_returns = build_returns(aliased_orelse_orig_names,
-                                     alias_orelse_map, orelse_scope)
-
-    else:
-      body_returns = orelse_returns = templates.replace('tf.ones(())')[0].value
 
     body_def = self._create_cond_branch(
         body_name,
-        aliased_orig_names=tuple(aliased_body_orig_names),
-        aliased_new_names=tuple(aliased_body_new_names),
+        aliased_orig_names=aliased_body_orig_names,
+        aliased_new_names=aliased_body_new_names,
         body=node_body,
-        returns=body_returns)
+        returns=returned_from_body)
     orelse_def = self._create_cond_branch(
         orelse_name,
-        aliased_orig_names=tuple(aliased_orelse_orig_names),
-        aliased_new_names=tuple(aliased_orelse_new_names),
+        aliased_orig_names=aliased_orelse_orig_names,
+        aliased_new_names=aliased_orelse_new_names,
         body=node_orelse,
-        returns=orelse_returns)
-    cond_expr = self._create_cond_expr(results, node.test, body_name,
+        returns=returned_from_orelse)
+    cond_expr = self._create_cond_expr(cond_results, node.test, body_name,
                                        orelse_name)
 
     return body_def + orelse_def + cond_expr
@@ -212,11 +197,11 @@ class ControlFlowTransformer(converter.Base):
   def visit_While(self, node):
     self.generic_visit(node)
 
-    body_scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
+    body_scope = anno.getanno(node, annos.NodeAnno.BODY_SCOPE)
     body_closure = body_scope.modified - body_scope.created
     all_referenced = body_scope.referenced
 
-    cond_scope = anno.getanno(node, NodeAnno.COND_SCOPE)
+    cond_scope = anno.getanno(node, annos.NodeAnno.COND_SCOPE)
     cond_closure = set()
     for s in cond_scope.referenced:
       for root in s.support_set:
@@ -277,7 +262,7 @@ class ControlFlowTransformer(converter.Base):
   def visit_For(self, node):
     self.generic_visit(node)
 
-    body_scope = anno.getanno(node, NodeAnno.BODY_SCOPE)
+    body_scope = anno.getanno(node, annos.NodeAnno.BODY_SCOPE)
     body_closure = body_scope.modified - body_scope.created
     all_referenced = body_scope.referenced
 
@@ -331,7 +316,5 @@ class ControlFlowTransformer(converter.Base):
 
 
 def transform(node, ctx):
-  cfg.run_analyses(node, cfg.Liveness(ctx.info))
-  cfg.run_analyses(node, cfg.Defined(ctx.info))
   node = ControlFlowTransformer(ctx).visit(node)
   return node
diff --git a/tensorflow/contrib/autograph/converters/control_flow_test.py b/tensorflow/contrib/autograph/converters/control_flow_test.py
index 735eb92a0d..ade3501426 100644
--- a/tensorflow/contrib/autograph/converters/control_flow_test.py
+++ b/tensorflow/contrib/autograph/converters/control_flow_test.py
@@ -20,16 +20,23 @@ from __future__ import print_function
 
 from tensorflow.contrib.autograph.converters import control_flow
 from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.contrib.autograph.pyct import transformer
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.platform import test
 
 
 class ControlFlowTest(converter_testing.TestCase):
 
-  def test_simple_while(self):
+  def assertTransformedResult(self, test_fn, inputs, expected):
+    if not isinstance(inputs, tuple):
+      inputs = (inputs,)
+    with self.converted(test_fn, control_flow, {},
+                        constant_op.constant) as result:
+      with self.test_session() as sess:
+        self.assertEqual(sess.run(result.test_fn(*inputs)), expected)
+
+  def test_while_basic(self):
 
     def test_fn(n):
       i = 0
@@ -39,29 +46,18 @@ class ControlFlowTest(converter_testing.TestCase):
         i += 1
       return s, i, n
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      with self.test_session() as sess:
-        self.assertEqual((10, 5, 5),
-                         sess.run(result.test_fn(constant_op.constant(5))))
+    self.assertTransformedResult(test_fn, constant_op.constant(5), (10, 5, 5))
 
-  def test_while_single_var(self):
+  def test_while_single_output(self):
 
     def test_fn(n):
       while n > 0:
         n -= 1
       return n
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
+    self.assertTransformedResult(test_fn, constant_op.constant(5), 0)
 
-    with self.compiled(node) as result:
-      with self.test_session() as sess:
-        self.assertEqual(0, sess.run(result.test_fn(constant_op.constant(5))))
-
-  def test_simple_if(self):
+  def test_if_basic(self):
 
     def test_fn(n):
       a = 0
@@ -72,114 +68,85 @@ class ControlFlowTest(converter_testing.TestCase):
         b = 2 * n
       return a, b
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
+    self.assertTransformedResult(test_fn, constant_op.constant(1), (-1, 0))
+    self.assertTransformedResult(test_fn, constant_op.constant(-1), (0, -2))
+
+  def test_if_complex_outputs(self):
+
+    class TestClass(object):
 
-    with self.compiled(node) as result:
+      def __init__(self, a, b):
+        self.a = a
+        self.b = b
+
+    def test_fn(n, obj):
+      obj.a = 0
+      obj.b = 0
+      if n > 0:
+        obj.a = -n
+      else:
+        obj.b = 2 * n
+      return obj
+
+    with self.converted(test_fn, control_flow, {}) as result:
       with self.test_session() as sess:
-        self.assertEqual((-1, 0),
-                         sess.run(result.test_fn(constant_op.constant(1))))
-        self.assertEqual((0, -2),
-                         sess.run(result.test_fn(constant_op.constant(-1))))
+        res_obj = result.test_fn(constant_op.constant(1), TestClass(0, 0))
+        self.assertEqual(sess.run((res_obj.a, res_obj.b)), (-1, 0))
+        res_obj = result.test_fn(constant_op.constant(-1), TestClass(0, 0))
+        self.assertEqual(sess.run((res_obj.a, res_obj.b)), (0, -2))
 
-  def test_if_single_var(self):
+  def test_if_single_output(self):
 
     def test_fn(n):
       if n > 0:
         n = -n
       return n
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
+    self.assertTransformedResult(test_fn, constant_op.constant(1), -1)
 
-    with self.compiled(node) as result:
-      with self.test_session() as sess:
-        self.assertEqual(-1, sess.run(result.test_fn(constant_op.constant(1))))
-
-  def test_imbalanced_aliasing(self):
+  def test_if_semi(self):
 
     def test_fn(n):
       if n > 0:
         n = 3
       return n
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
-
-    with self.compiled(node, control_flow_ops.cond) as result:
-      with self.test_session() as sess:
-        self.assertEqual(3, sess.run(result.test_fn(constant_op.constant(2))))
-        self.assertEqual(-3, sess.run(result.test_fn(constant_op.constant(-3))))
+    self.assertTransformedResult(test_fn, constant_op.constant(2), 3)
+    self.assertTransformedResult(test_fn, constant_op.constant(-3), -3)
 
-  def test_ignore_unread_variable(self):
+  def test_if_local_var(self):
 
     def test_fn(n):
-      b = 3  # pylint: disable=unused-variable
       if n > 0:
         b = 4
+        n = b + 1
       return n
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
+    self.assertTransformedResult(test_fn, constant_op.constant(1), 5)
+    self.assertTransformedResult(test_fn, constant_op.constant(-1), -1)
 
-    with self.compiled(node, control_flow_ops.cond, array_ops.ones) as result:
-      with self.test_session() as sess:
-        self.assertEqual(3, sess.run(result.test_fn(constant_op.constant(3))))
-        self.assertEqual(-3, sess.run(result.test_fn(constant_op.constant(-3))))
+  def test_if_no_outputs(self):
 
-  def test_handle_temp_variable(self):
+    def test_fn(n):
+      if n > 0:
+        b = 4  # pylint:disable=unused-variable
+      return n
 
-    def test_fn_using_temp(x, y, w):
-      if x < y:
-        z = x + y
-      else:
-        w = 2
-        tmp = w
-        z = x - tmp
-      return z, w
+    # Without side effect guards, the if statement will stage a cond,
+    # but that will be pruned at execution.
+    self.assertTransformedResult(test_fn, constant_op.constant(1), 1)
+    self.assertTransformedResult(test_fn, constant_op.constant(-1), -1)
 
-    node = self.parse_and_analyze(test_fn_using_temp, {})
-    node = control_flow.transform(node, self.ctx)
+  def test_if_imbalanced_outputs(self):
 
-    with self.compiled(node, control_flow_ops.cond, array_ops.ones) as result:
-      with self.test_session() as sess:
-        z, w = sess.run(
-            result.test_fn_using_temp(
-                constant_op.constant(-3), constant_op.constant(3),
-                constant_op.constant(3)))
-        self.assertEqual(0, z)
-        self.assertEqual(3, w)
-        z, w = sess.run(
-            result.test_fn_using_temp(
-                constant_op.constant(3), constant_op.constant(-3),
-                constant_op.constant(3)))
-        self.assertEqual(1, z)
-        self.assertEqual(2, w)
-
-    def test_fn_ignoring_temp(x, y, w):
-      if x < y:
-        z = x + y
-      else:
-        w = 2
-        tmp = w
-        z = x - tmp
-      return z
+    def test_fn(n):
+      if n > 0:
+        b = 4
+      return b
 
-    node = self.parse_and_analyze(test_fn_ignoring_temp, {})
-    node = control_flow.transform(node, self.ctx)
-
-    with self.compiled(node, control_flow_ops.cond, array_ops.ones) as result:
-      with self.test_session() as sess:
-        z = sess.run(
-            result.test_fn_ignoring_temp(
-                constant_op.constant(-3), constant_op.constant(3),
-                constant_op.constant(3)))
-        self.assertEqual(0, z)
-        z = sess.run(
-            result.test_fn_ignoring_temp(
-                constant_op.constant(3), constant_op.constant(-3),
-                constant_op.constant(3)))
-        self.assertEqual(1, z)
+    node, ctx = self.prepare(test_fn, {})
+    with self.assertRaises(transformer.AutographParseError):
+      control_flow.transform(node, ctx)
 
   def test_simple_for(self):
 
@@ -191,22 +158,11 @@ class ControlFlowTest(converter_testing.TestCase):
         s2 += e * e
       return s1, s2
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
+    self.assertTransformedResult(test_fn, constant_op.constant([1, 3]), (4, 10))
+    empty_vector = constant_op.constant([], shape=(0,), dtype=dtypes.int32)
+    self.assertTransformedResult(test_fn, empty_vector, (0, 0))
 
-    with self.compiled(node) as result:
-      with self.test_session() as sess:
-        l = [1, 2, 3]
-        self.assertEqual(
-            test_fn(l), sess.run(result.test_fn(constant_op.constant(l))))
-        l = []
-        self.assertEqual(
-            test_fn(l),
-            sess.run(
-                result.test_fn(
-                    constant_op.constant(l, shape=(0,), dtype=dtypes.int32))))
-
-  def test_for_single_var(self):
+  def test_for_single_output(self):
 
     def test_fn(l):
       s = 0
@@ -214,22 +170,11 @@ class ControlFlowTest(converter_testing.TestCase):
         s += e
       return s
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = control_flow.transform(node, self.ctx)
+    self.assertTransformedResult(test_fn, constant_op.constant([1, 3]), 4)
+    empty_vector = constant_op.constant([], shape=(0,), dtype=dtypes.int32)
+    self.assertTransformedResult(test_fn, empty_vector, 0)
 
-    with self.compiled(node) as result:
-      with self.test_session() as sess:
-        l = [1, 2, 3]
-        self.assertEqual(
-            test_fn(l), sess.run(result.test_fn(constant_op.constant(l))))
-        l = []
-        self.assertEqual(
-            test_fn(l),
-            sess.run(
-                result.test_fn(
-                    constant_op.constant(l, shape=(0,), dtype=dtypes.int32))))
-
-  def test_for_with_iterated_expression(self):
+  def test_for_iterated_expression(self):
 
     eval_count = [0]
 
@@ -243,14 +188,13 @@ class ControlFlowTest(converter_testing.TestCase):
         s += e
       return s
 
-    node = self.parse_and_analyze(test_fn, {'count_evals': count_evals})
-    node = control_flow.transform(node, self.ctx)
+    ns = {'count_evals': count_evals}
+    node, ctx = self.prepare(test_fn, ns)
+    node = control_flow.transform(node, ctx)
 
-    with self.compiled(node) as result:
-      result.count_evals = count_evals
-      self.assertEqual(test_fn(5), result.test_fn(5))
-      # count_evals ran twice, once for test_fn and another for result.test_fn
-      self.assertEqual(eval_count[0], 2)
+    with self.compiled(node, ns) as result:
+      self.assertEqual(result.test_fn(5), 10)
+      self.assertEqual(eval_count[0], 1)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/decorators_test.py b/tensorflow/contrib/autograph/converters/decorators_test.py
index d41c7fde24..095abc5edc 100644
--- a/tensorflow/contrib/autograph/converters/decorators_test.py
+++ b/tensorflow/contrib/autograph/converters/decorators_test.py
@@ -61,13 +61,13 @@ class DecoratorsTest(converter_testing.TestCase):
         'simple_decorator': simple_decorator,
         'converter_testing': converter_testing,
     }
-    node = self.parse_and_analyze(
+    node, ctx = self.prepare(
         f,
         namespace,
         recursive=False,
         autograph_decorators=autograph_decorators)
-    node = decorators.transform(node, self.ctx)
-    import_line = '\n'.join(self.ctx.program.additional_imports)
+    node = decorators.transform(node, ctx)
+    import_line = '\n'.join(ctx.program.additional_imports)
     result, _ = compiler.ast_to_object(node, source_prefix=import_line)
     return getattr(result, f.__name__)
 
@@ -76,11 +76,8 @@ class DecoratorsTest(converter_testing.TestCase):
     def test_fn(a):
       return a
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = decorators.transform(node, self.ctx)
-    result, _ = compiler.ast_to_object(node)
-
-    self.assertEqual(1, result.test_fn(1))
+    with self.converted(test_fn, decorators, {}) as result:
+      self.assertEqual(1, result.test_fn(1))
 
   def test_function(self):
 
@@ -124,7 +121,7 @@ class DecoratorsTest(converter_testing.TestCase):
         return b + 11
       return inner_fn(a)
 
-    # Expected to fail because simple_decorator cannot be imported.
+    # Expected to fail because simple_decorator could not be imported.
     with self.assertRaises(transformer.AutographParseError):
       test_fn(1)
 
diff --git a/tensorflow/contrib/autograph/converters/directives.py b/tensorflow/contrib/autograph/converters/directives.py
new file mode 100644
index 0000000000..77f625bac7
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/directives.py
@@ -0,0 +1,128 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Handles directives.
+
+This converter removes the directive functions from the code and moves the
+information they specify into AST annotations. It is a specialized form of
+static analysis, one that is specific to AutoGraph.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.autograph.core import converter
+from tensorflow.contrib.autograph.lang import directives
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.python.util import tf_inspect
+
+ENCLOSING_LOOP = 'enclosing_loop'
+
+
+def _map_args(call_node, function):
+  """Maps AST call nodes to the actual function's arguments.
+
+  Args:
+    call_node: ast.Call
+    function: Callable[..., Any], the actual function matching call_node
+  Returns:
+    Dict[Text, ast.AST], mapping each of the function's argument names to
+    the respective AST node.
+  Raises:
+      ValueError: if the default arguments are not correctly set
+  """
+  args = call_node.args
+  kwds = {kwd.arg: kwd.value for kwd in call_node.keywords}
+  call_args = tf_inspect.getcallargs(function, *args, **kwds)
+
+  # Keyword arguments not specified in kwds will be mapped to their defaults,
+  # which are Python values. Since we don't currently have a way to transform
+  # those into AST references, we simply remove them. By convention, directives
+  # use UNSPECIFIED as default value for for optional arguments. No other
+  # defaults should be present.
+  unexpected_defaults = []
+  for k in call_args:
+    if (k not in kwds
+        and call_args[k] not in args
+        and call_args[k] is not directives.UNSPECIFIED):
+      unexpected_defaults.append(k)
+  if unexpected_defaults:
+    raise ValueError('Unexpected keyword argument values, %s, for function %s'
+                     % (zip(unexpected_defaults,
+                            [call_args[k] for k in unexpected_defaults]),
+                        function))
+  return {k: v for k, v in call_args.items() if v is not directives.UNSPECIFIED}
+
+
+class DirectivesTransformer(converter.Base):
+  """Parses compiler directives and converts them into AST annotations."""
+
+  def _process_symbol_directive(self, call_node, directive):
+    if len(call_node.args) < 1:
+      raise ValueError('"%s" requires a positional first argument'
+                       ' as the target' % directive.__name__)
+    target = call_node.args[0]
+    defs = anno.getanno(target, anno.Static.ORIG_DEFINITIONS)
+    for def_ in defs:
+      def_.directives[directive] = _map_args(call_node, directive)
+    return call_node
+
+  def _process_statement_directive(self, call_node, directive):
+    if self.local_scope_level < 1:
+      raise ValueError(
+          '"%s" must be used inside a statement' % directive.__name__)
+    target = self.get_local(ENCLOSING_LOOP)
+    node_anno = anno.getanno(target, converter.AgAnno.DIRECTIVES, {})
+    node_anno[directive] = _map_args(call_node, directive)
+    anno.setanno(target, converter.AgAnno.DIRECTIVES, node_anno)
+    return call_node
+
+  def visit_Expr(self, node):
+    if isinstance(node.value, gast.Call):
+      call_node = node.value
+      if anno.hasanno(call_node.func, 'live_val'):
+        live_val = anno.getanno(call_node.func, 'live_val')
+
+        if live_val is directives.set_element_type:
+          call_node = self._process_symbol_directive(call_node, live_val)
+        elif live_val is directives.set_loop_options:
+          call_node = self._process_statement_directive(call_node, live_val)
+        else:
+          return self.generic_visit(node)
+
+        return None  # Directive calls are not output in the generated code.
+    return self.generic_visit(node)
+
+  # TODO(mdan): This will be insufficient for other control flow.
+  # That means that if we ever have a directive that affects things other than
+  # loops, we'll need support for parallel scopes, or have multiple converters.
+  def _track_and_visit_loop(self, node):
+    self.enter_local_scope()
+    self.set_local(ENCLOSING_LOOP, node)
+    node = self.generic_visit(node)
+    self.exit_local_scope()
+    return node
+
+  def visit_While(self, node):
+    return self._track_and_visit_loop(node)
+
+  def visit_For(self, node):
+    return self._track_and_visit_loop(node)
+
+
+def transform(node, ctx):
+  return DirectivesTransformer(ctx).visit(node)
diff --git a/tensorflow/contrib/autograph/converters/directives_test.py b/tensorflow/contrib/autograph/converters/directives_test.py
new file mode 100644
index 0000000000..a2d083b891
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/directives_test.py
@@ -0,0 +1,95 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for directives module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.converters import directives as directives_converter
+from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.contrib.autograph.core.converter import AgAnno
+from tensorflow.contrib.autograph.lang import directives
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import parser
+from tensorflow.python.platform import test
+
+
+class DirectivesTest(converter_testing.TestCase):
+
+  def test_local_target(self):
+
+    def test_fn():
+      l = []
+      string_var = 0
+      directives.set_element_type(l, 'a', string_var)
+
+    node, ctx = self.prepare(test_fn, {'directives': directives})
+    node = directives_converter.transform(node, ctx)
+
+    def_, = anno.getanno(node.body[0].targets[0],
+                         anno.Static.DEFINITIONS)
+    d = def_.directives[directives.set_element_type]
+    self.assertEqual(d['dtype'].s, 'a')
+    self.assertEqual(d['shape'].id, 'string_var')
+
+  def test_argument_target(self):
+
+    def test_fn(a):
+      directives.set_element_type(a, 1, shape=2)
+
+    node, ctx = self.prepare(test_fn, {'directives': directives})
+    node = directives_converter.transform(node, ctx)
+
+    def_, = anno.getanno(node.args.args[0], anno.Static.DEFINITIONS)
+    d = def_.directives[directives.set_element_type]
+    self.assertEqual(d['dtype'].n, 1)
+    self.assertEqual(d['shape'].n, 2)
+
+  def test_loop_target(self):
+
+    def test_fn():
+      a = True
+      while True:
+        directives.set_loop_options(parallel_iterations=10, back_prop=a)
+
+    node, ctx = self.prepare(test_fn, {'directives': directives})
+    node = directives_converter.transform(node, ctx)
+
+    d = anno.getanno(node.body[1], AgAnno.DIRECTIVES)
+    d = d[directives.set_loop_options]
+    self.assertEqual(d['parallel_iterations'].n, 10)
+    self.assertEqual(d['back_prop'].id, 'a')
+    self.assertNotIn('swap_memory', d)
+
+  def test_invalid_default(self):
+
+    def invalid_directive(valid_arg, invalid_default=object()):
+      del valid_arg
+      del invalid_default
+      return
+
+    def call_invalid_directive():
+      invalid_directive(1)
+
+    node, _ = parser.parse_entity(call_invalid_directive)
+    # Find the call to the invalid directive
+    node = node.body[0].body[0].value
+    with self.assertRaisesRegexp(ValueError, 'Unexpected keyword.*'):
+      directives_converter._map_args(node, invalid_directive)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/converters/error_handlers.py b/tensorflow/contrib/autograph/converters/error_handlers.py
new file mode 100644
index 0000000000..1936821394
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/error_handlers.py
@@ -0,0 +1,53 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Wraps function bodies with a try/except to rewrite error tracebacks.
+
+Only adds try/except wrappers to functions that have the anno.Basic.ORIGIN
+annotation because these are the functions originally written by the user.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.core import converter
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import templates
+
+
+class ErrorRewritingTransformer(converter.Base):
+  """Possibly wraps the body of a function in a try/except.
+
+  Only wraps functions that were originally defined by the user, detected by
+  checking for the anno.Basic.ORIGIN annotation.
+  """
+
+  def visit_FunctionDef(self, node):
+    node = self.generic_visit(node)
+
+    if (anno.hasanno(node, anno.Basic.ORIGIN) and
+        len(self.enclosing_entities) <= 1):
+      template = """
+        try:
+          body
+        except:
+          ag__.rewrite_graph_construction_error(ag_source_map__)
+      """
+      node.body = templates.replace(template, body=node.body)
+    return node
+
+
+def transform(node, ctx):
+  return ErrorRewritingTransformer(ctx).visit(node)
diff --git a/tensorflow/contrib/autograph/converters/error_handlers_test.py b/tensorflow/contrib/autograph/converters/error_handlers_test.py
new file mode 100644
index 0000000000..cd74e5f18f
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/error_handlers_test.py
@@ -0,0 +1,57 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for error_handlers module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.converters import error_handlers
+from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.contrib.autograph.core import errors
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import origin_info
+from tensorflow.python.platform import test
+
+
+class ErrorHandlersTest(converter_testing.TestCase):
+
+  def test_basic(self):
+
+    def test_fn():
+      raise ValueError()
+
+    node, ctx = self.prepare(test_fn, {})
+    anno.setanno(node, anno.Basic.ORIGIN,
+                 origin_info.OriginInfo(None, None, None))
+    node = error_handlers.transform(node, ctx)
+    with self.compiled(node, {}) as result:
+      with self.assertRaises(errors.GraphConstructionError):
+        # Here we just assert that the handler works. Its correctness is
+        # verified by errors_test.py.
+        result.test_fn()
+
+  def test_no_origin_annotation(self):
+
+    def test_fn():
+      raise ValueError()
+
+    with self.converted(test_fn, error_handlers, {}) as result:
+      with self.assertRaises(ValueError):
+        result.test_fn()
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/converters/ifexp.py b/tensorflow/contrib/autograph/converters/ifexp.py
deleted file mode 100644
index e996138498..0000000000
--- a/tensorflow/contrib/autograph/converters/ifexp.py
+++ /dev/null
@@ -1,49 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Canonicalizes the ternary conditional operator."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.autograph.core import converter
-from tensorflow.contrib.autograph.pyct import templates
-
-
-class IfExp(converter.Base):
-  """Canonicalizes all IfExp nodes into plain conditionals."""
-
-  def visit_IfExp(self, node):
-    template = """
-        ag__.utils.run_cond(test, lambda: (body,), lambda: (orelse,))
-    """
-    desugared_ifexp = templates.replace_as_expression(
-        template, test=node.test, body=node.body, orelse=node.orelse)
-    return desugared_ifexp
-
-
-def transform(node, ctx):
-  """Desugar IfExp nodes into plain conditionals.
-
-  Args:
-     node: ast.AST, the node to transform
-     ctx: converter.EntityContext
-
-  Returns:
-     new_node: an AST with no IfExp nodes, only conditionals.
-  """
-
-  node = IfExp(ctx).visit(node)
-  return node
diff --git a/tensorflow/contrib/autograph/converters/ifexp_test.py b/tensorflow/contrib/autograph/converters/ifexp_test.py
deleted file mode 100644
index cdd5a2f591..0000000000
--- a/tensorflow/contrib/autograph/converters/ifexp_test.py
+++ /dev/null
@@ -1,106 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for ifexp module."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.autograph import utils
-from tensorflow.contrib.autograph.converters import ifexp
-from tensorflow.contrib.autograph.core import converter_testing
-from tensorflow.python.platform import test
-
-
-class IfExpTest(converter_testing.TestCase):
-
-  def compiled_fn(self, test_fn, *args):
-    node = self.parse_and_analyze(test_fn, {})
-    node = ifexp.transform(node, self.ctx)
-    module = self.compiled(node, *args)
-    return module
-
-  def test_simple(self):
-
-    def test_fn(x):
-      return 1 if x else 0
-
-    with self.compiled_fn(test_fn) as result:
-      result.autograph_util = utils
-      for x in [0, 1]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_fn(self):
-
-    def f(x):
-      return 3 * x
-
-    def test_fn(x):
-      y = f(x * x if x > 0 else x)
-      return y
-
-    with self.compiled_fn(test_fn) as result:
-      result.autograph_util = utils
-      result.f = f
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_exp(self):
-
-    def test_fn(x):
-      return x * x if x > 0 else x
-
-    with self.compiled_fn(test_fn) as result:
-      result.autograph_util = utils
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_nested(self):
-
-    def test_fn(x):
-      return x * x if x > 0 else x if x else 1
-
-    with self.compiled_fn(test_fn) as result:
-      result.autograph_util = utils
-      for x in [-2, 0, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_in_cond(self):
-
-    def test_fn(x):
-      if x > 0:
-        return x * x if x < 5 else x * x * x
-      return -x
-
-    with self.compiled_fn(test_fn) as result:
-      result.autograph_util = utils
-      for x in [-2, 2, 5]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_assign_in_cond(self):
-
-    def test_fn(x):
-      if x > 0:
-        x = -x if x < 5 else x
-      return x
-
-    with self.compiled_fn(test_fn) as result:
-      result.autograph_util = utils
-      for x in [-2, 2, 5]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-
-if __name__ == '__main__':
-  test.main()
diff --git a/tensorflow/contrib/autograph/converters/list_comprehension.py b/tensorflow/contrib/autograph/converters/list_comprehension.py
deleted file mode 100644
index c4a13ee822..0000000000
--- a/tensorflow/contrib/autograph/converters/list_comprehension.py
+++ /dev/null
@@ -1,77 +0,0 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Canonicalizing list comprehensions into for and if statements.
-
-e.g.
-result = [x * x for x in xs]
-
-becomes
-
-result = []
-for x in xs:
-  elt = x * x
-  result.append(elt)
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import gast
-
-from tensorflow.contrib.autograph.core import converter
-from tensorflow.contrib.autograph.pyct import parser
-from tensorflow.contrib.autograph.pyct import templates
-
-
-class ListCompCanonicalizationTransformer(converter.Base):
-  """NodeTransformer to canonicalize list comprehensions."""
-
-  def make_update_list_node(self, list_, elt):
-    return templates.replace('list_.append(elt)', list_=list_, elt=elt)[0]
-
-  def instantiate_list_node(self):
-    return parser.parse_str('[]').body[0].value
-
-  def visit_Assign(self, node):
-    if not isinstance(node.value, gast.ListComp):
-      return node
-    if len(node.targets) > 1:
-      raise ValueError('Only support single assignment.')
-    return self.canonicalize_listcomp(node.targets[0], node.value)
-
-  def canonicalize_listcomp(self, result_node, list_comp_node):
-
-    make_list = templates.replace(
-        'list_ = create_list',
-        list_=result_node,
-        create_list=self.instantiate_list_node())
-    loop_body = self.make_update_list_node(result_node, list_comp_node.elt)
-
-    for gen in reversed(list_comp_node.generators):
-      for gen_if in reversed(gen.ifs):
-        loop_body = templates.replace(
-            'if test: loop_body', test=gen_if, loop_body=loop_body)
-      loop_body = templates.replace(
-          'for target in iter_: loop_body',
-          iter_=gen.iter,
-          target=gen.target,
-          loop_body=loop_body)
-
-    return make_list + loop_body
-
-
-def transform(node, ctx):
-  return ListCompCanonicalizationTransformer(ctx).visit(node)
diff --git a/tensorflow/contrib/autograph/converters/list_comprehensions.py b/tensorflow/contrib/autograph/converters/list_comprehensions.py
new file mode 100644
index 0000000000..ecf4628816
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/list_comprehensions.py
@@ -0,0 +1,82 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Lowers list comprehensions into for and if statements.
+
+Example:
+
+  result = [x * x for x in xs]
+
+becomes
+
+  result = []
+  for x in xs:
+    elt = x * x
+    result.append(elt)
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.autograph.core import converter
+from tensorflow.contrib.autograph.pyct import templates
+
+
+# TODO(mdan): This should covert directly to operator calls.
+
+
+class ListCompTransformer(converter.Base):
+  """Lowers list comprehensions into standard control flow."""
+
+  def visit_Assign(self, node):
+    if not isinstance(node.value, gast.ListComp):
+      return self.generic_visit(node)
+    if len(node.targets) > 1:
+      raise NotImplementedError('multiple assignments')
+
+    target, = node.targets
+    list_comp_node = node.value
+
+    template = """
+      target = []
+    """
+    initialization = templates.replace(template, target=target)
+
+    template = """
+      target.append(elt)
+    """
+    body = templates.replace(template, target=target, elt=list_comp_node.elt)
+
+    for gen in reversed(list_comp_node.generators):
+      for gen_if in reversed(gen.ifs):
+        template = """
+          if test:
+            body
+        """
+        body = templates.replace(template, test=gen_if, body=body)
+      template = """
+        for target in iter_:
+          body
+      """
+      body = templates.replace(
+          template, iter_=gen.iter, target=gen.target, body=body)
+
+    return initialization + body
+
+
+def transform(node, ctx):
+  return ListCompTransformer(ctx).visit(node)
diff --git a/tensorflow/contrib/autograph/converters/list_comprehension_test.py b/tensorflow/contrib/autograph/converters/list_comprehensions_test.py
index 2bbee93412..59b5ce9ca0 100644
--- a/tensorflow/contrib/autograph/converters/list_comprehension_test.py
+++ b/tensorflow/contrib/autograph/converters/list_comprehensions_test.py
@@ -12,33 +12,31 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for list_comprehension module."""
+"""Tests for list_comprehensions module."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.autograph.converters import list_comprehension
+from tensorflow.contrib.autograph.converters import list_comprehensions
 from tensorflow.contrib.autograph.core import converter_testing
 from tensorflow.python.platform import test
 
 
 class ListCompTest(converter_testing.TestCase):
 
+  def assertTransformedEquivalent(self, test_fn, *inputs):
+    with self.converted(test_fn, list_comprehensions, {}) as result:
+      self.assertEqual(test_fn(*inputs), result.test_fn(*inputs))
+
   def test_basic(self):
 
     def test_fn(l):
       s = [e * e for e in l]
       return s
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = list_comprehension.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      l = [1, 2, 3]
-      self.assertEqual(test_fn(l), result.test_fn(l))
-      l = []
-      self.assertEqual(test_fn(l), result.test_fn(l))
+    self.assertTransformedEquivalent(test_fn, [])
+    self.assertTransformedEquivalent(test_fn, [1, 2, 3])
 
   def test_multiple_generators(self):
 
@@ -46,29 +44,17 @@ class ListCompTest(converter_testing.TestCase):
       s = [e * e for sublist in l for e in sublist]
       return s
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = list_comprehension.transform(node, self.ctx)
+    self.assertTransformedEquivalent(test_fn, [])
+    self.assertTransformedEquivalent(test_fn, [[1], [2], [3]])
 
-    with self.compiled(node) as result:
-      l = [[1], [2], [3]]
-      self.assertEqual(test_fn(l), result.test_fn(l))
-      l = []
-      self.assertEqual(test_fn(l), result.test_fn(l))
-
-  def test_conds(self):
+  def test_cond(self):
 
     def test_fn(l):
       s = [e * e for e in l if e > 1]
       return s
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = list_comprehension.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      l = [1, 2, 3]
-      self.assertEqual(test_fn(l), result.test_fn(l))
-      l = []
-      self.assertEqual(test_fn(l), result.test_fn(l))
+    self.assertTransformedEquivalent(test_fn, [])
+    self.assertTransformedEquivalent(test_fn, [1, 2, 3])
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/lists.py b/tensorflow/contrib/autograph/converters/lists.py
index d77a044798..a02fc827b8 100644
--- a/tensorflow/contrib/autograph/converters/lists.py
+++ b/tensorflow/contrib/autograph/converters/lists.py
@@ -33,6 +33,7 @@ from __future__ import print_function
 import gast
 
 from tensorflow.contrib.autograph.core import converter
+from tensorflow.contrib.autograph.lang import directives
 from tensorflow.contrib.autograph.pyct import anno
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import templates
@@ -88,12 +89,12 @@ class ListTransformer(converter.Base):
     scope = anno.getanno(node, NodeAnno.ARGS_SCOPE)
     target_node = node.func.value
 
-    # Attempt to use a related name if can get one. Otherwise use something
+    # Attempt to use a related name if one exists. Otherwise use something
     # generic.
     if anno.hasanno(target_node, anno.Basic.QN):
       target_name = anno.getanno(target_node, anno.Basic.QN).ssf()
     else:
-      target_name = 'list'
+      target_name = 'list_'
     pop_var_name = self.ctx.namer.new_symbol(target_name, scope.referenced)
 
     pop_uses = self.get_local(POP_USES, [])
@@ -104,9 +105,10 @@ class ListTransformer(converter.Base):
 
   def _replace_stack_call(self, node):
     assert len(node.args) == 1
-    dtype = anno.getanno(
+    dtype = self.get_definition_directive(
         node.args[0],
-        'element_type',
+        directives.set_element_type,
+        'dtype',
         default=templates.replace_as_expression('None'))
     template = """
       ag__.list_stack(
@@ -134,7 +136,10 @@ class ListTransformer(converter.Base):
         node = self._replace_append_call(node)
       elif func_name == 'pop' and (len(node.args) <= 1):
         node = self._replace_pop_call(node)
-      elif func_name == 'stack' and (len(node.args) == 1):
+      elif (func_name == 'stack' and (len(node.args) == 1) and
+            (not node.keywords or node.keywords[0].arg == 'strict')):
+        # This avoids false positives with keyword args.
+        # TODO(mdan): handle kwargs properly.
         node = self._replace_stack_call(node)
 
     return node
@@ -146,15 +151,22 @@ class ListTransformer(converter.Base):
       pop_element = original_call_node.args[0]
     else:
       pop_element = parser.parse_expression('None')
+
     # The call will be something like "target.pop()", and the dtype is hooked to
     # target, hence the func.value.
-    dtype = anno.getanno(
+    # TODO(mdan): For lists of lists, this won't work.
+    # The reason why it won't work is because it's unclear how to annotate
+    # the list as a "list of lists with a certain element type" when using
+    # operations like `l.pop().pop()`.
+    dtype = self.get_definition_directive(
         original_call_node.func.value,
-        'element_type',
+        directives.set_element_type,
+        'dtype',
         default=templates.replace_as_expression('None'))
-    shape = anno.getanno(
+    shape = self.get_definition_directive(
         original_call_node.func.value,
-        'element_shape',
+        directives.set_element_type,
+        'shape',
         default=templates.replace_as_expression('None'))
 
     template = """
diff --git a/tensorflow/contrib/autograph/converters/lists_test.py b/tensorflow/contrib/autograph/converters/lists_test.py
index ea04097b28..996e99ee61 100644
--- a/tensorflow/contrib/autograph/converters/lists_test.py
+++ b/tensorflow/contrib/autograph/converters/lists_test.py
@@ -18,9 +18,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.autograph import utils
 from tensorflow.contrib.autograph.converters import lists
 from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.contrib.autograph.lang import directives
+from tensorflow.contrib.autograph.lang import special_functions
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
@@ -28,6 +31,9 @@ from tensorflow.python.ops import list_ops
 from tensorflow.python.platform import test
 
 
+tf = None  # Will be replaced by a mock.
+
+
 class ListTest(converter_testing.TestCase):
 
   def test_empty_list(self):
@@ -35,10 +41,7 @@ class ListTest(converter_testing.TestCase):
     def test_fn():
       return []
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = lists.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
+    with self.converted(test_fn, lists, {}) as result:
       tl = result.test_fn()
       # Empty tensor lists cannot be evaluated or stacked.
       self.assertTrue(isinstance(tl, ops.Tensor))
@@ -49,27 +52,19 @@ class ListTest(converter_testing.TestCase):
     def test_fn():
       return [1, 2, 3]
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = lists.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      with self.test_session() as sess:
-        tl = result.test_fn()
-        r = list_ops.tensor_list_stack(tl, dtypes.int32)
-        self.assertAllEqual(sess.run(r), [1, 2, 3])
+    with self.converted(test_fn, lists, {}) as result:
+      self.assertAllEqual(result.test_fn(), [1, 2, 3])
 
   def test_list_append(self):
 
     def test_fn():
-      l = [1]
+      l = special_functions.tensor_list([1])
       l.append(2)
       l.append(3)
       return l
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = lists.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
+    ns = {'special_functions': special_functions}
+    with self.converted(test_fn, lists, ns) as result:
       with self.test_session() as sess:
         tl = result.test_fn()
         r = list_ops.tensor_list_stack(tl, dtypes.int32)
@@ -78,24 +73,21 @@ class ListTest(converter_testing.TestCase):
   def test_list_pop(self):
 
     def test_fn():
-      l = [1, 2, 3]
-      utils.set_element_type(l, dtypes.int32, ())
+      l = special_functions.tensor_list([1, 2, 3])
       s = l.pop()
       return s, l
 
-    node = self.parse_and_analyze(
-        test_fn,
-        {
-            'utils': utils,
-            'dtypes': dtypes
-        },
-        include_type_analysis=True,
-    )
-    node = lists.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
-      result.utils = utils
-      result.dtypes = dtypes
+    ns = {'special_functions': special_functions}
+    node, ctx = self.prepare(test_fn, ns)
+    def_, = anno.getanno(node.body[0].targets[0],
+                         anno.Static.ORIG_DEFINITIONS)
+    def_.directives[directives.set_element_type] = {
+        'dtype': parser.parse_expression('tf.int32'),
+        'shape': parser.parse_expression('()'),
+    }
+    node = lists.transform(node, ctx)
+
+    with self.compiled(node, ns, dtypes.int32) as result:
       with self.test_session() as sess:
         ts, tl = result.test_fn()
         r = list_ops.tensor_list_stack(tl, dtypes.int32)
@@ -108,10 +100,7 @@ class ListTest(converter_testing.TestCase):
       s = l.pop().pop()
       return s
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = lists.transform(node, self.ctx)
-
-    with self.compiled(node) as result:
+    with self.converted(test_fn, lists, {}) as result:
       test_input = [1, 2, [1, 2, 3]]
       # TODO(mdan): Pass a list of lists of tensor when we fully support that.
       # For now, we just pass a regular Python list of lists just to verify that
@@ -120,29 +109,24 @@ class ListTest(converter_testing.TestCase):
 
   def test_list_stack(self):
 
-    tf = None  # Will be replaced with a mock.
-
     def test_fn():
       l = [1, 2, 3]
-      utils.set_element_type(l, dtypes.int32)
       return tf.stack(l)
 
-    node = self.parse_and_analyze(
-        test_fn,
-        {
-            'utils': utils,
-            'dtypes': dtypes
-        },
-        include_type_analysis=True,
-    )
-    node = lists.transform(node, self.ctx)
-
-    with self.compiled(node, array_ops.stack, dtypes.int32) as result:
-      result.utils = utils
-      result.dtypes = dtypes
+    node, ctx = self.prepare(test_fn, {})
+    def_, = anno.getanno(node.body[0].targets[0],
+                         anno.Static.ORIG_DEFINITIONS)
+    def_.directives[directives.set_element_type] = {
+        'dtype': parser.parse_expression('tf.int32')
+    }
+    node = lists.transform(node, ctx)
+
+    with self.compiled(node, {}, array_ops.stack, dtypes.int32) as result:
       with self.test_session() as sess:
         self.assertAllEqual(sess.run(result.test_fn()), [1, 2, 3])
 
+  # TODO(mdan): Add a test with tf.stack with axis kwarg.
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/converters/logical_expressions_test.py b/tensorflow/contrib/autograph/converters/logical_expressions_test.py
index 48186024a9..ca07de5e8a 100644
--- a/tensorflow/contrib/autograph/converters/logical_expressions_test.py
+++ b/tensorflow/contrib/autograph/converters/logical_expressions_test.py
@@ -31,10 +31,8 @@ class GradientsFunctionTest(converter_testing.TestCase):
     def test_fn(a, b):
       return a == b
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = logical_expressions.transform(node, self.ctx)
-
-    with self.compiled(node, math_ops.equal) as result:
+    with self.converted(test_fn, logical_expressions, {},
+                        math_ops.equal) as result:
       with self.test_session() as sess:
         self.assertTrue(sess.run(result.test_fn(1, 1)))
         self.assertFalse(sess.run(result.test_fn(1, 2)))
@@ -44,11 +42,8 @@ class GradientsFunctionTest(converter_testing.TestCase):
     def test_fn(a, b, c):
       return (a or b) and (a or b or c)
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = logical_expressions.transform(node, self.ctx)
-
-    with self.compiled(node, math_ops.logical_or,
-                       math_ops.logical_and) as result:
+    with self.converted(test_fn, logical_expressions, {}, math_ops.logical_or,
+                        math_ops.logical_and) as result:
       with self.test_session() as sess:
         self.assertTrue(sess.run(result.test_fn(True, False, True)))
 
diff --git a/tensorflow/contrib/autograph/converters/name_scopes_test.py b/tensorflow/contrib/autograph/converters/name_scopes_test.py
index 444d0bcd46..a329b0db70 100644
--- a/tensorflow/contrib/autograph/converters/name_scopes_test.py
+++ b/tensorflow/contrib/autograph/converters/name_scopes_test.py
@@ -31,17 +31,13 @@ class FunctionNameScopeTransformer(converter_testing.TestCase):
 
     def test_fn(l):
       """This should stay here."""
-      a = 5
+      a = 1
       l += a
       return l
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = name_scopes.transform(node, self.ctx)
-
-    with self.compiled(node, ops.name_scope) as result:
+    with self.converted(test_fn, name_scopes, {}, ops.name_scope) as result:
       result_op = result.test_fn(constant_op.constant(1))
       self.assertIn('test_fn/', result_op.op.name)
-
       self.assertEqual('This should stay here.', result.test_fn.__doc__)
 
   def test_long_docstring(self):
@@ -54,13 +50,12 @@ class FunctionNameScopeTransformer(converter_testing.TestCase):
       Returns:
         l
       """
-      return l
-
-    node = self.parse_and_analyze(test_fn, {})
-    node = name_scopes.transform(node, self.ctx)
+      return l + 1
 
-    with self.compiled(node, ops.name_scope) as result:
-      self.assertIn('Multi-line', result.test_fn.__doc__)
+    with self.converted(test_fn, name_scopes, {}, ops.name_scope) as result:
+      result_op = result.test_fn(constant_op.constant(1))
+      self.assertIn('test_fn/', result_op.op.name)
+      self.assertIn('Multi-line docstring.', result.test_fn.__doc__)
       self.assertIn('Returns:', result.test_fn.__doc__)
 
   def test_nested_functions(self):
@@ -68,21 +63,16 @@ class FunctionNameScopeTransformer(converter_testing.TestCase):
     def test_fn(l):
 
       def inner_fn(i):
-        return i ** 2
-
-      l += 4
-      return inner_fn(l)
+        return i + 1
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = name_scopes.transform(node, self.ctx)
+      l += 1
+      return l, inner_fn(l)
 
-    with self.compiled(node, ops.name_scope) as result:
-      result_op = result.test_fn(constant_op.constant(1))
-      first_result_input_name = result_op.op.inputs[0].name
-      second_result_input_name = result_op.op.inputs[1].name
-      self.assertIn('test_fn/', first_result_input_name)
-      self.assertNotIn('inner_fn', first_result_input_name)
-      self.assertIn('test_fn/inner_fn/', second_result_input_name)
+    with self.converted(test_fn, name_scopes, {}, ops.name_scope) as result:
+      first, second = result.test_fn(constant_op.constant(1))
+      self.assertIn('test_fn/', first.op.name)
+      self.assertNotIn('inner_fn', first.op.name)
+      self.assertIn('test_fn/inner_fn/', second.op.name)
 
   def test_method(self):
 
@@ -91,48 +81,20 @@ class FunctionNameScopeTransformer(converter_testing.TestCase):
       def test_fn(self, l):
 
         def inner_fn(i):
-          return i ** 2
-
-        l += 4
-        return inner_fn(l)
+          return i + 1
 
-    # Note that 'TestClass' was needed in the namespace here.
-    node = self.parse_and_analyze(
-        TestClass, {'TestClass': TestClass}, owner_type=TestClass)
-    node = name_scopes.transform(node, self.ctx)
+        l += 1
+        return l, inner_fn(l)
 
-    with self.compiled(node, ops.name_scope) as result:
-      result_op = result.TestClass().test_fn(constant_op.constant(1))
-      first_result_input_name = result_op.op.inputs[0].name
-      second_result_input_name = result_op.op.inputs[1].name
-      self.assertIn('TestClass/test_fn/', first_result_input_name)
-      self.assertNotIn('inner_fn', first_result_input_name)
-      self.assertIn('TestClass/test_fn/inner_fn/', second_result_input_name)
+    ns = {'TestClass': TestClass}
+    node, ctx = self.prepare(TestClass, ns, owner_type=TestClass)
+    node = name_scopes.transform(node, ctx)
 
-  def test_operator(self):
-
-    class TestClass(object):
-
-      def __call__(self, l):
-
-        def inner_fn(i):
-          return i ** 2
-
-        l += 4
-        return inner_fn(l)
-
-    # Note that 'TestClass' was needed in the namespace here.
-    node = self.parse_and_analyze(
-        TestClass.__call__, {'TestClass': TestClass}, owner_type=TestClass)
-    node = name_scopes.transform(node, self.ctx)
-
-    with self.compiled(node, ops.name_scope) as result:
-      result_op = result.__call__(TestClass(), constant_op.constant(1))
-      first_result_input_name = result_op.op.inputs[0].name
-      second_result_input_name = result_op.op.inputs[1].name
-      self.assertIn('call__/', first_result_input_name)
-      self.assertNotIn('inner_fn', first_result_input_name)
-      self.assertIn('call__/inner_fn/', second_result_input_name)
+    with self.compiled(node, {}, ops.name_scope) as result:
+      first, second = result.TestClass().test_fn(constant_op.constant(1))
+      self.assertIn('TestClass/test_fn/', first.op.name)
+      self.assertNotIn('inner_fn', first.op.name)
+      self.assertIn('TestClass/test_fn/inner_fn/', second.op.name)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/single_return.py b/tensorflow/contrib/autograph/converters/return_statements.py
index a351cd81b8..a351cd81b8 100644
--- a/tensorflow/contrib/autograph/converters/single_return.py
+++ b/tensorflow/contrib/autograph/converters/return_statements.py
diff --git a/tensorflow/contrib/autograph/converters/return_statements_test.py b/tensorflow/contrib/autograph/converters/return_statements_test.py
new file mode 100644
index 0000000000..3c7c8c8a25
--- /dev/null
+++ b/tensorflow/contrib/autograph/converters/return_statements_test.py
@@ -0,0 +1,167 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for return_statements module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.converters import return_statements
+from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.python.framework import ops
+from tensorflow.python.platform import test
+
+
+class SingleReturnTest(converter_testing.TestCase):
+
+  def assertTransformedEquivalent(self, test_fn, *inputs):
+    ns = {'ops': ops}
+    with self.converted(test_fn, return_statements, ns) as result:
+      self.assertEqual(test_fn(*inputs), result.test_fn(*inputs))
+
+  def test_straightline(self):
+
+    def test_fn(x):
+      return x * x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+
+  def test_conditional(self):
+
+    def test_fn(x):
+      if x > 0:
+        return x
+      else:
+        return x * x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+
+  def test_missing_orelse(self):
+
+    def test_fn(x):
+      if x > 0:
+        return x
+
+    node, ctx = self.prepare(test_fn, {})
+    with self.assertRaises(ValueError):
+      return_statements.transform(node, ctx)
+
+  def test_missing_orelse_recovrable(self):
+
+    def test_fn(x):
+      if x > 0:
+        return x
+      return x * x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+
+  def test_missing_branch_return_recoverable(self):
+
+    def test_fn(x):
+      if x < 0:
+        x *= x
+      else:
+        return x
+      return x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+
+  def test_conditional_nested(self):
+
+    def test_fn(x):
+      if x > 0:
+        if x < 5:
+          return x
+        else:
+          return x * x
+      else:
+        return x * x * x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+    self.assertTransformedEquivalent(test_fn, 5)
+
+  def test_context_manager(self):
+
+    def test_fn(x):
+      with ops.name_scope(''):
+        return x * x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+
+  def test_context_manager_in_conditional(self):
+
+    def test_fn(x):
+      if x > 0:
+        with ops.name_scope(''):
+          return x * x
+      else:
+        return x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+
+  def text_conditional_in_context_manager(self):
+
+    def test_fn(x):
+      with ops.name_scope(''):
+        if x > 0:
+          return x * x
+        else:
+          return x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+
+  def test_no_return(self):
+
+    def test_fn(x):
+      x *= x
+
+    self.assertTransformedEquivalent(test_fn, 2)
+
+  def test_nested_functions(self):
+
+    def test_fn(x):
+
+      def inner_fn(y):
+        if y > 0:
+          return y * y
+        else:
+          return y
+
+      return inner_fn(x)
+
+    self.assertTransformedEquivalent(test_fn, 2)
+    self.assertTransformedEquivalent(test_fn, -2)
+
+  def test_loop(self):
+
+    def test_fn(x):
+      for _ in range(10):
+        return x
+      return x
+
+    node, ctx = self.prepare(test_fn, {})
+    with self.assertRaises(ValueError):
+      return_statements.transform(node, ctx)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/converters/side_effect_guards_test.py b/tensorflow/contrib/autograph/converters/side_effect_guards_test.py
index a7ad8efed4..bee512abbc 100644
--- a/tensorflow/contrib/autograph/converters/side_effect_guards_test.py
+++ b/tensorflow/contrib/autograph/converters/side_effect_guards_test.py
@@ -25,140 +25,138 @@ from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import state_ops
-from tensorflow.python.ops import variables
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import test
 
 
+tf = None  # Will be replaced by a mock.
+
+
 class SideEffectGuardsTest(converter_testing.TestCase):
 
   def test_side_effect_on_return_only_variable(self):
 
-    tf = None
-
     def test_fn(a):
       tf.assign(a, a + 1)
       return a
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = side_effect_guards.transform(node, self.ctx)
+    node, ctx = self.prepare(test_fn, {})
+    node = side_effect_guards.transform(node, ctx)
 
-    with self.compiled(node, state_ops.assign) as result:
-      self.assertEqual(len(node.body[0].body), 1)
+    self.assertEqual(len(node.body), 1)
+
+    with self.compiled(node, {}, state_ops.assign) as result:
       with self.test_session() as sess:
-        v = variables.Variable(2)
+        v = variable_scope.get_variable('test', initializer=2)
         sess.run(v.initializer)
-        # NOTE: We don't expect the assignment to execute in this case, because
-        # variables cannot be reliably guarded.
-        self.assertEqual(2, sess.run(result.test_fn(v)))
+        sess.run(result.test_fn(v))
+        # TODO(mdan): Add support for this use case.
+        # Right now the variable `a` is not conditioned on the `assign` because
+        # there's no way to add control dependencies to a variable object.
+        self.assertEqual(2, sess.run(v))
 
   def test_side_effect_on_used_variable(self):
 
-    tf = None
-
     def test_fn(a):
       tf.assign(a, a + 1)
       return a + 1
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = side_effect_guards.transform(node, self.ctx)
+    node, ctx = self.prepare(test_fn, {})
+    node = side_effect_guards.transform(node, ctx)
 
-    with self.compiled(node, state_ops.assign) as result:
-      self.assertEqual(len(node.body[0].body), 1)
+    self.assertEqual(len(node.body), 1)
+
+    with self.compiled(node, {}, state_ops.assign) as result:
       with self.test_session() as sess:
-        v = variables.Variable(2)
+        v = variable_scope.get_variable('test', initializer=2)
         sess.run(v.initializer)
-        # NOTE: Unlike test_side_effect_on_return_only_variable, the variable
-        # was used in the local scope and so we could catch the assign's side
-        # effect.
-        self.assertEqual(4, sess.run(result.test_fn(v)))
+        sess.run(result.test_fn(v))
+        # TODO(mdan): Ensure the result of test_fn(v) is also deterministic.
+        # Right now it's 3 or 4 based on whether the read is synchronized.
+        self.assertEqual(3, sess.run(v))
 
   def test_side_effect_on_tensor(self):
 
-    tf = None
-
     def test_fn(a):
       tf.Assert(a > 0, ['expected in throw'])
       return a
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = side_effect_guards.transform(node, self.ctx)
+    node, ctx = self.prepare(test_fn, {})
+    node = side_effect_guards.transform(node, ctx)
 
-    with self.compiled(node, control_flow_ops.Assert) as result:
-      self.assertEqual(len(node.body[0].body), 1)
+    self.assertEqual(len(node.body), 1)
+
+    with self.compiled(node, {}, control_flow_ops.Assert) as result:
       with self.test_session() as sess:
-        # NOTE: In this case we can also capture the side effect because the
-        # argument is a tensor ans we can wrap it inside an identity.
         with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
                                      'expected in throw'):
           sess.run(result.test_fn(constant_op.constant(-1)))
 
   def test_multiline_block(self):
 
-    tf = None
-
     def test_fn(a):
-      tf.assign(a, a + 1)
+      tf.assign_add(a, 1)
       b = a + 1
-      tf.assign(a, b + 1)
-      c = b + 1
-      d = c + 1
-      return d
+      tf.assign_add(a, 1)
+      b += 1
+      return b
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = side_effect_guards.transform(node, self.ctx)
+    node, ctx = self.prepare(test_fn, {})
+    node = side_effect_guards.transform(node, ctx)
 
-    with self.compiled(node, state_ops.assign) as result:
-      self.assertEqual(len(node.body[0].body), 1)
+    self.assertEqual(len(node.body), 1)
+
+    with self.compiled(node, {}, state_ops.assign_add) as result:
       with self.test_session() as sess:
-        v = variables.Variable(2)
+        v = variable_scope.get_variable('test', initializer=2)
         sess.run(v.initializer)
-        self.assertEqual(6, sess.run(result.test_fn(v)))
+        sess.run(result.test_fn(v))
+        # TODO(mdan): Ensure the result of test_fn(v) is also deterministic.
+        self.assertEqual(4, sess.run(v))
 
   def test_multiline_nested_block(self):
 
-    tf = None
-
     def test_fn(a):
       with tf.name_scope('foo'):
         tf.assign(a, a + 1)
         b = a + 1
-        c = b + 1
-        d = c + 1
-      return d
+      return b
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = side_effect_guards.transform(node, self.ctx)
+    node, ctx = self.prepare(test_fn, {})
+    node = side_effect_guards.transform(node, ctx)
 
-    with self.compiled(node, state_ops.assign, ops.name_scope) as result:
-      self.assertEqual(len(node.body[0].body[0].body), 1)
+    self.assertEqual(len(node.body[0].body), 1)
+
+    with self.compiled(node, {}, state_ops.assign, ops.name_scope) as result:
       with self.test_session() as sess:
-        v = variables.Variable(2)
+        v = variable_scope.get_variable('test', initializer=2)
         sess.run(v.initializer)
-        self.assertEqual(6, sess.run(result.test_fn(v)))
+        sess.run(result.test_fn(v))
+        # TODO(mdan): Ensure the result of test_fn(v) is also deterministic.
+        self.assertEqual(3, sess.run(v))
 
   def test_multiline_block_unsafe(self):
 
-    tf = None
-
     def test_fn(a):
       tf.assign(a, a + 1)
       b = a + 1
-      tf.assign(a, a + 1)
+      tf.assign_add(a, 1)
       c = b + 1
-      d = c + 1
-      return d
+      return c
+
+    node, ctx = self.prepare(test_fn, {})
+    node = side_effect_guards.transform(node, ctx)
 
-    node = self.parse_and_analyze(test_fn, {})
-    node = side_effect_guards.transform(node, self.ctx)
+    self.assertEqual(len(node.body), 1)
 
-    with self.compiled(node, state_ops.assign) as result:
-      self.assertEqual(len(node.body[0].body), 1)
+    with self.compiled(node, {}, state_ops.assign,
+                       state_ops.assign_add) as result:
       with self.test_session() as sess:
-        v = variables.Variable(2)
+        v = variable_scope.get_variable('test', initializer=2)
         sess.run(v.initializer)
-        # NOTE: This intentionally highlights the flakiness. The test should be
-        # tightened down once that is solved.
-        self.assertTrue(sess.run(result.test_fn(v)) in (6, 7))
+        sess.run(result.test_fn(v))
+        # TODO(mdan): Ensure the result of test_fn(v) is also deterministic.
+        self.assertEqual(4, sess.run(v))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/converters/single_return_test.py b/tensorflow/contrib/autograph/converters/single_return_test.py
deleted file mode 100644
index 1f0de4310e..0000000000
--- a/tensorflow/contrib/autograph/converters/single_return_test.py
+++ /dev/null
@@ -1,189 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for single_return module."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.autograph.converters import single_return
-from tensorflow.contrib.autograph.core import converter_testing
-from tensorflow.python.framework.ops import name_scope
-from tensorflow.python.platform import test
-
-
-class SingleReturnTest(converter_testing.TestCase):
-
-  def compiled_fn(self, test_fn, *args):
-    node = self.parse_and_analyze(test_fn, {})
-    node = single_return.transform(node, self.ctx)
-    module = self.compiled(node, *args)
-    return module
-
-  def test_noop(self):
-    # Noop
-    def test_fn(x):
-      return x
-
-    with self.compiled_fn(test_fn) as result:
-      self.assertEqual(test_fn(2.0), result.test_fn(2.0))
-
-  def test_return_expression(self):
-    # ANF
-    def test_fn(x):
-      return x * x
-
-    with self.compiled_fn(test_fn) as result:
-      x = 2
-      self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_merge(self):
-    # Simple merge
-    def test_fn(x):
-      if x > 0:
-        return x
-      else:
-        return x * x
-
-    with self.compiled_fn(test_fn) as result:
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_orphan_branch(self):
-
-    def test_fn(x):
-      if x > 0:
-        return x
-
-    with self.assertRaises(ValueError):
-      self.compiled_fn(test_fn)
-
-  def test_lift_body_into_false_branch(self):
-
-    def test_fn(x):
-      if x > 0:
-        return x
-      return x * x
-
-    with self.compiled_fn(test_fn) as result:
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_lift_body_into_true_branch(self):
-
-    def test_fn(x):
-      if x < 0:
-        x *= x
-      else:
-        # TODO(alexbw): linter bug here that requires us suppress this warning.
-        return x  # pylint: disable=undefined-loop-variable
-      return x
-
-    with self.compiled_fn(test_fn) as result:
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_nested_if(self):
-
-    def test_fn(x):
-      if x > 0:
-        if x < 5:
-          return x
-        else:
-          return x * x
-      else:
-        return x * x * x
-
-    with self.compiled_fn(test_fn) as result:
-      for x in [-2, 2, 5]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_context_manager(self):
-
-    def test_fn(x):
-
-      with name_scope(''):
-        return x * x
-
-    with self.compiled_fn(test_fn) as result:
-      result.name_scope = name_scope
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_context_manager_in_conditional(self):
-
-    def test_fn(x):
-      if x > 0:
-        with name_scope(''):
-          return x * x
-      else:
-        return x
-
-    with self.compiled_fn(test_fn, name_scope) as result:
-      result.name_scope = name_scope
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def text_conditional_in_context_manager(self):
-
-    def test_fn(x):
-      with name_scope(''):
-        if x > 0:
-          return x * x
-        else:
-          return x
-
-    with self.compiled_fn(test_fn) as result:
-      result.name_scope = name_scope
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_no_return(self):
-
-    def test_fn(x):
-      x *= x
-
-    with self.compiled_fn(test_fn) as result:
-      self.assertEqual(test_fn(2), result.test_fn(2))
-
-  def test_nested_functiondefs(self):
-
-    def test_fn(x):
-
-      def inner_fn(y):
-        if y > 0:
-          return y * y
-        else:
-          return y
-
-      return inner_fn(x)
-
-    with self.compiled_fn(test_fn) as result:
-      for x in [-2, 2]:
-        self.assertEqual(test_fn(x), result.test_fn(x))
-
-  def test_loop(self):
-
-    def test_fn(x):
-      for _ in range(10):
-        return x
-      return x
-
-    with self.assertRaises(ValueError):
-      self.compiled_fn(test_fn)
-
-
-if __name__ == '__main__':
-  test.main()
diff --git a/tensorflow/contrib/autograph/converters/slices.py b/tensorflow/contrib/autograph/converters/slices.py
index 3f5fc57125..c527f98613 100644
--- a/tensorflow/contrib/autograph/converters/slices.py
+++ b/tensorflow/contrib/autograph/converters/slices.py
@@ -21,7 +21,7 @@ from __future__ import print_function
 import gast
 
 from tensorflow.contrib.autograph.core import converter
-from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.lang import directives
 from tensorflow.contrib.autograph.pyct import templates
 
 
@@ -36,12 +36,14 @@ class SliceTransformer(converter.Base):
   def _process_single_assignment(self, target, value):
     if not isinstance(target, gast.Subscript):
       return None
+    if not isinstance(target.slice, gast.Index):
+      return None
 
     template = """
       target = ag__.set_item(target, key, item)
     """
     return templates.replace(
-        template, target=target.value, key=target.slice, item=value)
+        template, target=target.value, key=target.slice.value, item=value)
 
   def visit_Assign(self, node):
     node = self.generic_visit(node)
@@ -56,17 +58,17 @@ class SliceTransformer(converter.Base):
   def visit_Subscript(self, node):
     node = self.generic_visit(node)
     if not isinstance(node.slice, gast.Index):
-      # TODO(mdan): It might make more sense to wave them through.
-      raise NotImplementedError('non-index slice')
+      return node
 
     if not isinstance(node.ctx, gast.Load):
       # Index writes are handled at a higher level, one at which the rvalue is
       # also available.
       return node
 
-    dtype = anno.getanno(
+    dtype = self.get_definition_directive(
         node.value,
-        'element_type',
+        directives.set_element_type,
+        'dtype',
         default=templates.replace_as_expression('None'))
 
     template = """
@@ -76,7 +78,7 @@ class SliceTransformer(converter.Base):
           opts=ag__.GetItemOpts(element_dtype=dtype))
     """
     return templates.replace_as_expression(
-        template, target=node.value, key=node.slice, dtype=dtype)
+        template, target=node.value, key=node.slice.value, dtype=dtype)
 
 
 def transform(node, ctx):
diff --git a/tensorflow/contrib/autograph/converters/slices_test.py b/tensorflow/contrib/autograph/converters/slices_test.py
index df9a4c8bab..c822d53a4a 100644
--- a/tensorflow/contrib/autograph/converters/slices_test.py
+++ b/tensorflow/contrib/autograph/converters/slices_test.py
@@ -18,9 +18,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.autograph import utils
 from tensorflow.contrib.autograph.converters import slices
 from tensorflow.contrib.autograph.core import converter_testing
+from tensorflow.contrib.autograph.lang import directives
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import parser
+from tensorflow.contrib.autograph.pyct import transformer
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import list_ops
@@ -32,28 +35,42 @@ class SliceTest(converter_testing.TestCase):
   def test_index_access(self):
 
     def test_fn(l):
-      utils.set_element_type(l, dtypes.int32)
       return l[1]
 
-    node = self.parse_and_analyze(
-        test_fn,
-        {
-            'utils': utils,
-            'dtypes': dtypes
-        },
-        include_type_analysis=True,
-    )
-    node = slices.transform(node, self.ctx)
-
-    with self.compiled(node, dtypes.int32) as result:
-      result.utils = utils
-      result.dtypes = dtypes
+    node, ctx = self.prepare(test_fn, {})
+    def_, = anno.getanno(node.args.args[0], anno.Static.DEFINITIONS)
+    def_.directives[directives.set_element_type] = {
+        'dtype': parser.parse_expression('tf.int32')
+    }
+    node = slices.transform(node, ctx)
+
+    with self.compiled(node, {}, dtypes.int32) as result:
       with self.test_session() as sess:
         tl = list_ops.tensor_list_from_tensor(
             [1, 2], element_shape=constant_op.constant([], dtype=dtypes.int32))
         y = result.test_fn(tl)
         self.assertEqual(2, sess.run(y))
 
+  def test_index_access_multiple_definitions(self):
+
+    def test_fn(l):
+      if l:
+        l = []
+      return l[1]
+
+    node, ctx = self.prepare(test_fn, {})
+    def_, = anno.getanno(node.args.args[0], anno.Static.DEFINITIONS)
+    def_.directives[directives.set_element_type] = {
+        'dtype': parser.parse_expression('tf.int32')
+    }
+    def_, = anno.getanno(node.body[0].body[0].targets[0],
+                         anno.Static.DEFINITIONS)
+    def_.directives[directives.set_element_type] = {
+        'dtype': parser.parse_expression('tf.float32')
+    }
+    with self.assertRaises(transformer.AutographParseError):
+      slices.transform(node, ctx)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/core/BUILD b/tensorflow/contrib/autograph/core/BUILD
index 833f9dced8..1873045a92 100644
--- a/tensorflow/contrib/autograph/core/BUILD
+++ b/tensorflow/contrib/autograph/core/BUILD
@@ -19,6 +19,7 @@ py_library(
     srcs = [
         "config.py",
         "converter.py",
+        "errors.py",
         "naming.py",
     ],
     srcs_version = "PY2AND3",
@@ -30,6 +31,31 @@ py_library(
     ],
 )
 
+py_test(
+    name = "errors_test",
+    srcs = ["errors_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":core",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:random_ops",
+    ],
+)
+
+py_test(
+    name = "naming_test",
+    srcs = ["naming_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":core",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
 py_library(
     name = "test_lib",
     srcs = [
@@ -47,13 +73,3 @@ py_library(
         "@six_archive//:six",
     ],
 )
-
-py_test(
-    name = "naming_test",
-    srcs = ["naming_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":core",
-        "//tensorflow/python:client_testlib",
-    ],
-)
diff --git a/tensorflow/contrib/autograph/core/converter.py b/tensorflow/contrib/autograph/core/converter.py
index 54e6aa0f3b..83a80c1f52 100644
--- a/tensorflow/contrib/autograph/core/converter.py
+++ b/tensorflow/contrib/autograph/core/converter.py
@@ -64,15 +64,29 @@ from __future__ import division
 from __future__ import print_function
 
 import collections
+from enum import Enum
+
 
 from tensorflow.contrib.autograph.core import config
 from tensorflow.contrib.autograph.core import naming
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import ast_util
+from tensorflow.contrib.autograph.pyct import cfg
+from tensorflow.contrib.autograph.pyct import compiler
+from tensorflow.contrib.autograph.pyct import qual_names
 from tensorflow.contrib.autograph.pyct import transformer
+from tensorflow.contrib.autograph.pyct.static_analysis import activity
+from tensorflow.contrib.autograph.pyct.static_analysis import live_values
+from tensorflow.contrib.autograph.pyct.static_analysis import liveness
+from tensorflow.contrib.autograph.pyct.static_analysis import reaching_definitions
+from tensorflow.contrib.autograph.pyct.static_analysis import type_info
 
 # TODO(mdan): These contexts can be refactored into first class objects.
 # For example, we could define Program and Entity abstractions that hold on
 # to the actual entity and have conversion methods.
 
+# TODO(mdan): Add a test specific to this converter.
+
 
 class ProgramContext(object):
   """ProgramContext keeps track of converting function hierarchies.
@@ -197,6 +211,46 @@ class Base(transformer.Base):
     self._used = False
     self._ast_depth = 0
 
+  def get_definition_directive(self, node, directive, arg, default):
+    """Returns the unique directive for a symbol, or a default if none exist.
+
+    See lang/directives.py for details on directives.
+
+    Args:
+      node: ast.AST
+      directive: Callable[..., Any]
+      arg: str
+      default: Any
+
+    Raises:
+      ValueError: if conflicting annotations have been found
+    """
+    defs = anno.getanno(node, anno.Static.ORIG_DEFINITIONS, ())
+    if not defs:
+      return default
+
+    # TODO(mdan): Simplify this.
+    arg_values = []
+    for def_ in defs:
+      if (directive not in def_.directives or
+          arg not in def_.directives[directive]):
+        continue
+      arg_value = def_.directives[directive][arg]
+      for prev_value in arg_values:
+        if not ast_util.matches(arg_value, prev_value):
+          qn = anno.getanno(node, anno.Basic.QN)
+          raise ValueError('%s has ambiguous annotations for %s(%s): %s, %s' %
+                           (qn, directive.__name__, arg,
+                            compiler.ast_to_source(arg_value).strip(),
+                            compiler.ast_to_source(prev_value).strip()))
+      arg_values.append(arg_value)
+
+    if not arg_values:
+      return default
+
+    arg_value, = arg_values
+    return arg_value
+
   def visit(self, node):
     if not self._ast_depth:
       if self._used:
@@ -208,3 +262,69 @@ class Base(transformer.Base):
       return super(Base, self).visit(node)
     finally:
       self._ast_depth -= 1
+
+
+class AnnotatedDef(reaching_definitions.Definition):
+
+  def __init__(self):
+    super(AnnotatedDef, self).__init__()
+    self.directives = {}
+
+
+class AgAnno(Enum):
+  """Annotation labels specific to AutoGraph. See anno.py."""
+
+  DIRECTIVES = 'User directives associated with the annotated statement.'
+
+  def __repr__(self):
+    return self.name
+
+
+def standard_analysis(node, context, is_initial=False):
+  """Performs a complete static analysis of the given code.
+
+  Args:
+    node: ast.AST
+    context: converter.EntityContext
+    is_initial: bool, whether this is the initial analysis done on the input
+        source code
+
+  Returns:
+    ast.AST, same as node, with the static analysis annotations added
+  """
+  # TODO(mdan): Clear static analysis here.
+  # TODO(mdan): Consider not running all analyses every time.
+  # TODO(mdan): Don't return a node because it's modified by reference.
+  graphs = cfg.build(node)
+  node = qual_names.resolve(node)
+  node = activity.resolve(node, context.info, None)
+  node = reaching_definitions.resolve(node, context.info, graphs, AnnotatedDef)
+  node = liveness.resolve(node, context.info, graphs)
+  node = live_values.resolve(node, context.info, config.PYTHON_LITERALS)
+  node = type_info.resolve(node, context.info)
+  # This second call allows resolving first-order class attributes.
+  node = live_values.resolve(node, context.info, config.PYTHON_LITERALS)
+  if is_initial:
+    anno.dup(
+        node,
+        {
+            anno.Static.DEFINITIONS: anno.Static.ORIG_DEFINITIONS,
+        },
+    )
+  return node
+
+
+def apply_(node, context, converter_module):
+  """Applies a converter to an AST.
+
+  Args:
+    node: ast.AST
+    context: converter.EntityContext
+    converter_module: converter.Base
+
+  Returns:
+    ast.AST, the result of applying converter to node
+  """
+  node = standard_analysis(node, context)
+  node = converter_module.transform(node, context)
+  return node
diff --git a/tensorflow/contrib/autograph/core/converter_testing.py b/tensorflow/contrib/autograph/core/converter_testing.py
index 0e46aacc12..5ee2c3fffd 100644
--- a/tensorflow/contrib/autograph/core/converter_testing.py
+++ b/tensorflow/contrib/autograph/core/converter_testing.py
@@ -20,19 +20,19 @@ from __future__ import print_function
 
 import contextlib
 import imp
+import sys
+
+import six
 
 from tensorflow.contrib.autograph import operators
 from tensorflow.contrib.autograph import utils
 from tensorflow.contrib.autograph.core import config
 from tensorflow.contrib.autograph.core import converter
+from tensorflow.contrib.autograph.core import errors
 from tensorflow.contrib.autograph.pyct import compiler
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import pretty_printer
-from tensorflow.contrib.autograph.pyct import qual_names
 from tensorflow.contrib.autograph.pyct import transformer
-from tensorflow.contrib.autograph.pyct.static_analysis import activity
-from tensorflow.contrib.autograph.pyct.static_analysis import live_values
-from tensorflow.contrib.autograph.pyct.static_analysis import type_info
 from tensorflow.python.platform import test
 
 
@@ -74,7 +74,17 @@ class TestCase(test.TestCase):
   """Base class for unit tests in this module. Contains relevant utilities."""
 
   @contextlib.contextmanager
-  def compiled(self, node, *symbols):
+  def assertPrints(self, expected_result):
+    try:
+      out_capturer = six.StringIO()
+      sys.stdout = out_capturer
+      yield
+      self.assertEqual(out_capturer.getvalue(), expected_result)
+    finally:
+      sys.stdout = sys.__stdout__
+
+  @contextlib.contextmanager
+  def compiled(self, node, namespace, *symbols):
     source = None
 
     self.dynamic_calls = []
@@ -84,12 +94,17 @@ class TestCase(test.TestCase):
       return 7
 
     try:
-      result, source = compiler.ast_to_object(node)
+      result, source = compiler.ast_to_object(node, include_source_map=True)
+
       result.tf = self.make_fake_mod('fake_tf', *symbols)
       fake_ag = self.make_fake_mod('fake_ag', converted_call)
       fake_ag.__dict__.update(operators.__dict__)
       fake_ag.__dict__['utils'] = utils
+      fake_ag.__dict__['rewrite_graph_construction_error'] = (
+          errors.rewrite_graph_construction_error)
       result.__dict__['ag__'] = fake_ag
+      for k, v in namespace.items():
+        result.__dict__[k] = v
       yield result
     except Exception:  # pylint:disable=broad-except
       if source is None:
@@ -98,6 +113,13 @@ class TestCase(test.TestCase):
         print('Offending compiled code:\n%s' % source)
       raise
 
+  @contextlib.contextmanager
+  def converted(self, entity, converter_module, namespace, *tf_symbols):
+    node, ctx = self.prepare(entity, namespace)
+    node = converter_module.transform(node, ctx)
+    with self.compiled(node, namespace, *tf_symbols) as result:
+      yield result
+
   def make_fake_mod(self, name, *symbols):
     fake_mod = imp.new_module(name)
     for s in symbols:
@@ -114,17 +136,16 @@ class TestCase(test.TestCase):
     for k, v in ns.items():
       setattr(module, k, v)
 
-  def parse_and_analyze(self,
-                        test_fn,
-                        namespace,
-                        namer=None,
-                        arg_types=None,
-                        include_type_analysis=True,
-                        owner_type=None,
-                        recursive=True,
-                        autograph_decorators=()):
+  def prepare(self,
+              test_fn,
+              namespace,
+              namer=None,
+              arg_types=None,
+              owner_type=None,
+              recursive=True,
+              autograph_decorators=()):
     node, source = parser.parse_entity(test_fn)
-
+    node = node.body[0]
     if namer is None:
       namer = FakeNamer()
     program_ctx = converter.ProgramContext(
@@ -141,12 +162,5 @@ class TestCase(test.TestCase):
         arg_types=arg_types,
         owner_type=owner_type)
     ctx = converter.EntityContext(namer, entity_info, program_ctx)
-
-    node = qual_names.resolve(node)
-    node = activity.resolve(node, entity_info)
-    node = live_values.resolve(node, entity_info, {})
-    if include_type_analysis:
-      node = type_info.resolve(node, entity_info)
-      node = live_values.resolve(node, entity_info, {})
-    self.ctx = ctx
-    return node
+    node = converter.standard_analysis(node, ctx, is_initial=True)
+    return node, ctx
diff --git a/tensorflow/contrib/autograph/core/errors.py b/tensorflow/contrib/autograph/core/errors.py
new file mode 100644
index 0000000000..5a57d57e7d
--- /dev/null
+++ b/tensorflow/contrib/autograph/core/errors.py
@@ -0,0 +1,258 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Error rewriting logic.
+
+Contains the functions responsible for rewriting tracebacks of errors raised
+in AutoGraph (AG) code to refer to user written code, so that errors only refer
+to the original user code.
+
+When 'user code' is used in comments it refers to the original source code that
+the user wrote and is converting using AutoGraph.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import contextlib
+import logging
+import sys
+import traceback
+
+from tensorflow.contrib.autograph.pyct import origin_info
+from tensorflow.python.framework import errors_impl
+
+# TODO(mdan): Add a superclass common to all errors.
+
+
+class GraphConstructionError(Exception):
+  """Error for graph construction errors from AutoGraph generated code."""
+
+  def __init__(self, original_error, custom_traceback):
+    self.original_error = original_error
+    self.custom_traceback = custom_traceback
+    super(GraphConstructionError, self).__init__()
+
+  def __str__(self):
+    traceback_str = ''.join(traceback.format_list(self.custom_traceback))
+    return ('Traceback (most recent call last):\n' + traceback_str + '\n' + str(
+        self.original_error) + '\n')
+
+
+class TfRuntimeError(Exception):
+  """Error wrapper for runtime errors raised by AutoGraph generated code."""
+
+  def __init__(self, op_name, op_message, custom_traceback):
+    self.op_name = op_name
+    self.op_message = op_message
+    self.custom_traceback = custom_traceback
+    super(TfRuntimeError, self).__init__()
+
+  def __str__(self):
+    message = '%s\n\nCaused by op %r, defined at:\n' % (self.op_message,
+                                                        self.op_name)
+    return message + ''.join(traceback.format_list(self.custom_traceback))
+
+
+def _rewrite_tb(source_map, tb):
+  """Rewrites code references in a traceback.
+
+  Args:
+    source_map: Dict[origin_info.LineLocation, origin_info.OriginInfo], mapping
+        locations to their origin
+    tb: List[Tuple[Text, Text, Text, Text]], consistent with
+        traceback.extract_tb.
+  Returns:
+    List[Tuple[Text, Text, Text, Text]], the rewritten traceback
+  """
+  new_tb = []
+  for frame in tb:
+    filename, lineno, _, _ = frame
+    loc = origin_info.LineLocation(filename, lineno)
+    origin = source_map.get(loc)
+    if origin is not None:
+      new_tb.append(origin.as_frame())
+    else:
+      new_tb.append(frame)
+  return new_tb
+
+
+# TODO(mdan): rename to raise_*
+def rewrite_graph_construction_error(source_map):
+  """Rewrites errors raised by non-AG APIs inside AG generated code.
+
+  This is called from the except handler inside an AutoGraph generated function
+  (that is, during exception handling). Only rewrites the frames corresponding
+  to the function that this is called from, so each function is responsible
+  to call this to have its own frames rewritten.
+
+  This function always raises an error.
+
+  Args:
+    source_map: Dict[origin_info.Location, origin_info.OriginInfo], the source
+        map belonging to the calling function
+
+  Raises:
+    GraphConstructionError: The rewritten underlying error.
+    Exception: The underlying error, if it could not be rewritten.
+  """
+  error_info = sys.exc_info()
+  _, original_error, e_traceback = error_info
+  assert original_error is not None
+  try:
+    current_traceback = _cut_traceback_loops(source_map,
+                                             traceback.extract_tb(e_traceback))
+    if isinstance(original_error, GraphConstructionError):
+      # TODO(mdan): This is incomplete.
+      # The error might have bubbled through a non-converted function.
+      previous_traceback = original_error.custom_traceback
+      cleaned_traceback = [current_traceback[0]] + previous_traceback
+    else:
+      cleaned_traceback = current_traceback
+
+    cleaned_traceback = _rewrite_tb(source_map, cleaned_traceback)
+
+    if isinstance(original_error, GraphConstructionError):
+      original_error.custom_traceback = cleaned_traceback
+      new_error = original_error
+    else:
+      new_error = GraphConstructionError(original_error, cleaned_traceback)
+  except Exception:
+    logging.exception('Error while rewriting AutoGraph error:')
+    # TODO(mdan): Should reraise here, removing the top frame as well.
+    raise original_error
+  else:
+    raise new_error
+  finally:
+    # Addresses warning https://docs.python.org/2/library/sys.html#sys.exc_info.
+    del e_traceback
+
+
+def _cut_traceback_loops(source_map, original_traceback):
+  """Check for cases where we leave a user method and re-enter it.
+
+  This is done by looking at the function names when the filenames are from any
+  files the user code is in.  If we find a case where we return to a user method
+  after leaving it then we cut out the frames in between because we assume this
+  means these in between frames are from internal AutoGraph code that shouldn't
+  be included.
+
+  An example of this is:
+
+   File "file1.py", line 57, in my_func
+     ...
+   File "control_flow_ops.py", line 231, in cond
+     ...
+   File "control_flow_ops.py", line 1039, in inner_cond
+     ...
+   File "file1.py", line 68, in my_func
+     ...
+
+  Where we would remove the control_flow_ops.py frames because we re-enter
+  my_func in file1.py.
+
+  The source map keys are (file_path, line_number) so get the set of all user
+  file_paths.
+
+  Args:
+    source_map: Dict[origin_info.LineLocation, origin_info.OriginInfo], mapping
+      locations to their origin
+    original_traceback: List[Tuple[Text, Text, Text, Text]], consistent with
+      traceback.extract_tb.
+
+  Returns:
+    List[Tuple[Text, Text, Text, Text]], the traceback with any loops removed.
+  """
+  all_user_files = set(loc.filename for loc in source_map)
+  cleaned_traceback = []
+  last_user_frame_index = None
+  last_user_user_file_path = None
+  # TODO(mdan): Simplify this logic.
+  for fi, frame in enumerate(original_traceback):
+    frame_file_path, lineno, _, _ = frame
+    src_map_key = origin_info.LineLocation(frame_file_path, lineno)
+    if frame_file_path in all_user_files:
+      if src_map_key in source_map:
+        if (last_user_frame_index is not None and
+            last_user_user_file_path == frame_file_path):
+          cleaned_traceback = cleaned_traceback[:last_user_frame_index]
+      last_user_frame_index = fi
+      last_user_user_file_path = frame_file_path
+    cleaned_traceback.append(frame)
+  return cleaned_traceback
+
+
+# TODO(mdan): This should be consistent with rewrite_graph_construction_error
+# Both should either raise or return.
+def rewrite_tf_runtime_error(error, source_map):
+  """Rewrites TensorFlow runtime errors raised by ops created in AG code.
+
+  Args:
+    error: tf.OpError
+    source_map: Dict[origin_info.LineLocation, origin_info.OriginInfo]
+
+  Returns:
+    TfRuntimeError, the rewritten underlying error.
+  """
+  try:
+    cleaned_traceback = _cut_traceback_loops(source_map, error.op.traceback)
+    # cleaned_traceback = error.op.traceback
+    cleaned_traceback = _rewrite_tb(source_map, cleaned_traceback)
+
+    op_name = error.op.name
+    op_message = error.message
+    rewritten_error = TfRuntimeError(op_name, op_message, cleaned_traceback)
+    return rewritten_error
+  except Exception:  # pylint: disable=broad-except
+    logging.exception('Error while rewriting AutoGraph error:')
+    return error
+
+
+# TODO(znado): Add arg to enable different levels of error rewriting.
+@contextlib.contextmanager
+def improved_errors(converted_function):
+  """Context manager that rewrites runtime errors.
+
+  This context manager will rewrite runtime errors so that their traceback
+  is relative to the original code before conversion.
+
+  Use with the output of to_graph, and wrap the execution of respective ops.
+  Example:
+
+    converted_my_func = ag.to_graph(my_func)
+    ops = converted_my_func(...)
+
+    with ag.improved_errors(converted_my_func):
+      sess.run(ops)
+
+  Args:
+    converted_function: Callable[..., Any], the output of a to_graph call
+
+  Yields:
+    None
+
+  Raises:
+    TfRuntimeError: if any OpError originates in the converted code, it will
+        be wrapped into a TfRuntimeError
+    ValueError: If converted_function is not generated by AutoGraph
+  """
+  if (getattr(converted_function, 'ag_source_map', None) is None or
+      not isinstance(converted_function.ag_source_map, dict)):
+    raise ValueError(
+        'converted_function must be the result of an autograph.to_graph call')
+  try:
+    yield
+  except errors_impl.OpError as e:
+    raise rewrite_tf_runtime_error(e, converted_function.ag_source_map)
diff --git a/tensorflow/contrib/autograph/core/errors_test.py b/tensorflow/contrib/autograph/core/errors_test.py
new file mode 100644
index 0000000000..c0e2c74e47
--- /dev/null
+++ b/tensorflow/contrib/autograph/core/errors_test.py
@@ -0,0 +1,104 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for errors module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.core import errors
+from tensorflow.contrib.autograph.pyct import origin_info
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors as tf_errors
+from tensorflow.python.ops import array_ops
+from tensorflow.python.platform import test
+from tensorflow.python.util import tf_inspect
+
+
+def zero_div():
+  x = array_ops.constant(10, dtype=dtypes.int32)
+  return x // 0
+
+
+def zero_div_caller():
+  return zero_div()
+
+
+class RuntimeErrorsTest(test.TestCase):
+
+  def fake_origin(self, function, line_offset):
+    _, lineno = tf_inspect.getsourcelines(function)
+    filename = tf_inspect.getsourcefile(function)
+    lineno += line_offset
+    loc = origin_info.LineLocation(filename, lineno)
+    origin = origin_info.OriginInfo(loc, 'test_function_name', 'test_code')
+    return loc, origin
+
+  def test_improved_errors_basic(self):
+    loc, origin = self.fake_origin(zero_div, 2)
+    zero_div_caller.ag_source_map = {loc: origin}
+
+    ops = zero_div_caller()
+    with self.assertRaises(errors.TfRuntimeError) as cm:
+      with errors.improved_errors(zero_div_caller):
+        with self.test_session() as sess:
+          sess.run(ops)
+
+    for frame in cm.exception.custom_traceback:
+      _, _, function_name, _ = frame
+      self.assertNotEqual('zero_div', function_name)
+    self.assertIn(origin.as_frame(), set(cm.exception.custom_traceback))
+
+  def test_improved_errors_no_matching_lineno(self):
+    loc, origin = self.fake_origin(zero_div, -1)
+    zero_div_caller.ag_source_map = {loc: origin}
+
+    ops = zero_div_caller()
+    with self.assertRaises(errors.TfRuntimeError) as cm:
+      with errors.improved_errors(zero_div_caller):
+        with self.test_session() as sess:
+          sess.run(ops)
+
+    all_function_names = set()
+    for frame in cm.exception.custom_traceback:
+      _, _, function_name, _ = frame
+      all_function_names.add(function_name)
+      self.assertNotEqual('test_function_name', function_name)
+    self.assertIn('zero_div', all_function_names)
+
+  def test_improved_errors_failures(self):
+    loc, _ = self.fake_origin(zero_div, 2)
+    zero_div_caller.ag_source_map = {loc: 'bogus object'}
+
+    ops = zero_div_caller()
+    with self.assertRaises(tf_errors.InvalidArgumentError):
+      with errors.improved_errors(zero_div_caller):
+        with self.test_session() as sess:
+          sess.run(ops)
+
+  def test_improved_errors_validation(self):
+    with self.assertRaisesRegexp(
+        ValueError,
+        'converted_function must be the result of an autograph.to_graph call'):
+      errors.improved_errors(zero_div).__enter__()
+    with self.assertRaisesRegexp(
+        ValueError,
+        'converted_function must be the result of an autograph.to_graph call'):
+      zero_div_caller.ag_source_map = 'not a dict'
+      errors.improved_errors(zero_div_caller).__enter__()
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/examples/integration_tests/BUILD b/tensorflow/contrib/autograph/examples/integration_tests/BUILD
new file mode 100644
index 0000000000..6c281485b4
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/integration_tests/BUILD
@@ -0,0 +1,54 @@
+licenses(["notice"])  # Apache 2.0
+
+exports_files(["LICENSE"])
+
+load("//tensorflow:tensorflow.bzl", "py_test")
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
+
+py_test(
+    name = "errors_test",
+    srcs = [
+        "errors_test.py",
+    ],
+    srcs_version = "PY2AND3",
+    tags = ["no_windows"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
+py_test(
+    name = "keras_test",
+    srcs = [
+        "keras_test.py",
+    ],
+    srcs_version = "PY2AND3",
+    tags = ["no_windows"],
+    deps = [
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
+py_test(
+    name = "list_literals_test",
+    srcs = [
+        "list_literals_test.py",
+    ],
+    srcs_version = "PY2AND3",
+    tags = ["no_windows"],
+    deps = [
+        "//tensorflow:tensorflow_py",
+    ],
+)
diff --git a/tensorflow/contrib/autograph/examples/integration_tests/errors_test.py b/tensorflow/contrib/autograph/examples/integration_tests/errors_test.py
new file mode 100644
index 0000000000..f4b9159942
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/integration_tests/errors_test.py
@@ -0,0 +1,162 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Error traceback rewriting integration tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from tensorflow.contrib import autograph as ag
+from tensorflow.python.util import tf_inspect
+
+
+class ErrorsTest(tf.test.TestCase):
+
+  def test_graph_construction_error_rewriting_call_tree(self):
+
+    def innermost(x):
+      if x > 0:
+        return tf.random_normal((2, 3), mean=0.0, dtype=tf.int32)
+      return tf.zeros((2, 3))
+
+    def inner_caller():
+      return innermost(1.0)
+
+    def caller():
+      return inner_caller()
+
+    with self.assertRaises(ag.GraphConstructionError) as error:
+      graph = ag.to_graph(caller)
+      graph()
+    expected = error.exception
+    custom_traceback = expected.custom_traceback
+    found_correct_filename = False
+    num_innermost_names = 0
+    num_inner_caller_names = 0
+    num_caller_names = 0
+    ag_output_filename = tf_inspect.getsourcefile(graph)
+    for frame in custom_traceback:
+      filename, _, fn_name, _ = frame
+      self.assertFalse('control_flow_ops.py' in filename)
+      self.assertFalse(ag_output_filename in filename)
+      found_correct_filename |= __file__ in filename
+      self.assertNotEqual('tf__test_fn', fn_name)
+      num_innermost_names += int('innermost' == fn_name)
+      self.assertNotEqual('tf__inner_caller', fn_name)
+      num_inner_caller_names += int('inner_caller' == fn_name)
+      self.assertNotEqual('tf__caller', fn_name)
+      num_caller_names += int('caller' == fn_name)
+    self.assertTrue(found_correct_filename)
+    self.assertEqual(num_innermost_names, 1)
+    self.assertEqual(num_inner_caller_names, 1)
+    self.assertEqual(num_caller_names, 1)
+
+  def test_graph_construction_error_rewriting_class(self):
+
+    class TestClass(object):
+
+      def test_fn(self):
+        return tf.random_normal((2, 3), mean=0.0, dtype=tf.int32)
+
+      def inner_caller(self):
+        return self.test_fn()
+
+      def caller(self):
+        return self.inner_caller()
+
+    # Note we expect a TypeError here because the traceback will not be
+    # rewritten for classes.
+    with self.assertRaises(TypeError):
+      graph = ag.to_graph(TestClass)
+      graph().caller()
+
+  def test_runtime_error_rewriting(self):
+
+    def g(x, s):
+      while tf.reduce_sum(x) > s:
+        x //= 0
+      return x
+
+    def test_fn(x):
+      return g(x, 10)
+
+    compiled_fn = ag.to_graph(test_fn)
+
+    with self.assertRaises(ag.TfRuntimeError) as error:
+      with self.test_session() as sess:
+        x = compiled_fn(tf.constant([4, 8]))
+        with ag.improved_errors(compiled_fn):
+          sess.run(x)
+    expected = error.exception
+    custom_traceback = expected.custom_traceback
+    found_correct_filename = False
+    num_test_fn_frames = 0
+    num_g_frames = 0
+    ag_output_filename = tf_inspect.getsourcefile(compiled_fn)
+    for frame in custom_traceback:
+      filename, _, fn_name, source_code = frame
+      self.assertFalse(ag_output_filename in filename)
+      self.assertFalse('control_flow_ops.py' in filename)
+      self.assertFalse('ag__.' in fn_name)
+      self.assertFalse('tf__g' in fn_name)
+      self.assertFalse('tf__test_fn' in fn_name)
+      found_correct_filename |= __file__ in filename
+      num_test_fn_frames += int('test_fn' == fn_name and
+                                'return g(x, 10)' in source_code)
+      # This makes sure that the code is correctly rewritten from "x_1 //= 0" to
+      # "x //= 0".
+      num_g_frames += int('g' == fn_name and 'x //= 0' in source_code)
+    self.assertTrue(found_correct_filename)
+    self.assertEqual(num_test_fn_frames, 1)
+    self.assertEqual(num_g_frames, 1)
+
+  def test_runtime_error_rewriting_nested(self):
+
+    def test_fn(x):
+
+      def g(y):
+        return y**2 // 0
+
+      s = 0
+      for xi in x:
+        s += g(xi)
+      return s
+
+    compiled_fn = ag.to_graph(test_fn)
+
+    # TODO(b/111408261): Nested functions currently do not rewrite correctly,
+    # when they do we should change this test to check for the same traceback
+    # properties as the other tests.  This should throw a runtime error with a
+    # frame with "g" as the function name but because we don't yet add
+    # try/except blocks to inner functions the name is "tf__g".
+    with self.assertRaises(ag.TfRuntimeError) as error:
+      with self.test_session() as sess:
+        x = compiled_fn(tf.constant([4, 8]))
+        with ag.improved_errors(compiled_fn):
+          sess.run(x)
+    expected = error.exception
+    custom_traceback = expected.custom_traceback
+    num_tf_g_frames = 0
+    for frame in custom_traceback:
+      _, _, fn_name, _ = frame
+      self.assertNotEqual('g', fn_name)
+      num_tf_g_frames += int('tf__g' == fn_name)
+    self.assertEqual(num_tf_g_frames, 1)
+
+
+if __name__ == '__main__':
+  tf.test.main()
diff --git a/tensorflow/contrib/autograph/examples/integration_tests/keras_test.py b/tensorflow/contrib/autograph/examples/integration_tests/keras_test.py
new file mode 100644
index 0000000000..7e7ef5a3e2
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/integration_tests/keras_test.py
@@ -0,0 +1,103 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Keras integration tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from tensorflow.contrib import autograph
+
+
+class MinimalKeras(tf.keras.Model):
+
+  def call(self, x):
+    return x * 3
+
+
+class ModelWithStaticConditional(object):
+
+  def __init__(self, initial):
+    self.initial = initial
+    if self.initial:
+      self.h = 15
+
+  @autograph.convert()
+  def call(self):
+    x = 10
+    if self.initial:
+      x += self.h
+    return x
+
+
+class BasicBlock(tf.keras.Model):
+
+  def __init__(self):
+    super(BasicBlock, self).__init__()
+    self.conv1 = tf.keras.layers.Conv2D(8, 3)
+    self.pool = tf.keras.layers.GlobalAveragePooling2D()
+    self.dense = tf.keras.layers.Dense(3)
+
+  def call(self, x):
+    x = self.conv1(x)
+    x = self.pool(x)
+    x = self.dense(x)
+    return x
+
+
+class CompoundModel(tf.keras.Model):
+
+  def __init__(self):
+    super(CompoundModel, self).__init__()
+    self.block = BasicBlock()
+
+  @autograph.convert(recursive=True)
+  def call(self, x):
+    x = self.block(x)  # pylint: disable=not-callable
+    return x
+
+
+class KerasTest(tf.test.TestCase):
+
+  def test_basic(self):
+    MinimalKeras()
+
+  def test_conditional_attributes_False(self):
+    model = ModelWithStaticConditional(False)
+    self.assertEqual(model.call(), 10)
+
+  def test_conditional_attributes_True(self):
+    model = ModelWithStaticConditional(True)
+    self.assertEqual(model.call(), 25)
+
+  def test_recursive_true(self):
+    with self.assertRaisesRegexp(NotImplementedError,
+                                 'Object conversion is not yet supported.'):
+      with tf.Graph().as_default():
+        model = CompoundModel()
+        model.build(tf.TensorShape((None, 10, 10, 1)))
+        init = tf.global_variables_initializer()
+
+        with tf.Session() as sess:
+          sess.run(init)
+          sample_input = tf.random_uniform((1, 10, 10, 1))
+          output = model(sample_input)  # pylint: disable=not-callable
+          self.assertEqual(sess.run(output).shape, (1, 3))
+
+
+if __name__ == '__main__':
+  tf.test.main()
diff --git a/tensorflow/contrib/autograph/core/annos.py b/tensorflow/contrib/autograph/examples/integration_tests/list_literals_test.py
index b8937ce36a..680b6dbaf0 100644
--- a/tensorflow/contrib/autograph/core/annos.py
+++ b/tensorflow/contrib/autograph/examples/integration_tests/list_literals_test.py
@@ -12,28 +12,30 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Annotations specific to AutoGraph."""
+"""Tests of functions that use list literals."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from enum import Enum
+import tensorflow as tf
 
+from tensorflow.contrib import autograph as ag
 
-class NoValue(Enum):
 
-  def __repr__(self):
-    return self.name
+def list_used_as_tuple():
+  return tf.constant([1, 2, 3])
 
 
-class NodeAnno(NoValue):
-  """Additional annotations used by AutoGraph converters.
+class ListLiteralsTest(tf.test.TestCase):
 
-  These are in addition to the basic annotations declared in pyct/anno.py and
-  pyct/static_analysis/annos.py.
-  """
+  def test_basic(self):
+    converted = ag.to_graph(list_used_as_tuple)
+    result = converted()
 
-  # The directives collection - see directives.py
-  DIRECTIVES = (
-      'Dict depicting static directive calls. See the directives converter.')
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(result), [1, 2, 3])
+
+
+if __name__ == '__main__':
+  tf.test.main()
diff --git a/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_collatz_speed_test.ipynb b/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_collatz_speed_test.ipynb
new file mode 100644
index 0000000000..c10a5741f6
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_collatz_speed_test.ipynb
@@ -0,0 +1,299 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "aQkTGc-d8I1k"
+      },
+      "source": [
+        "This notebook runs a basic speed test for a simple algorithm that implements the process described in Collatz Conjecture.\n",
+        "\n",
+        "https://en.wikipedia.org/wiki/Collatz_conjecture"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "x5ChBlH09jk_"
+      },
+      "source": [
+        "### Imports"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "X-QAUpWdPxUh"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U -q tf-nightly"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "wiKQu3w05eCa"
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "from matplotlib import pyplot as plt\n",
+        "import tensorflow as tf\n",
+        "from tensorflow.contrib import autograph as ag\n",
+        "from tensorflow.python.eager import context"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "_cRFTcwT9mnn"
+      },
+      "source": [
+        "### Plotting helpers"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "ww7rc0GQ9pMu"
+      },
+      "outputs": [],
+      "source": [
+        "def plot_results(counts, times, title):\n",
+        "  plt.plot(counts, np.array(times) * 1000., 'o')\n",
+        "  plt.ylabel('Time (milliseconds)')\n",
+        "  plt.xlabel('Collatz counter')\n",
+        "  plt.title(title)\n",
+        "  plt.ylim(0, 30)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "ESZGw9s9-Y5_"
+      },
+      "source": [
+        "### Collatz function definition"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "qeunWm9m-dT7"
+      },
+      "outputs": [],
+      "source": [
+        "def collatz(a):\n",
+        "  count = 0\n",
+        "  while a \u003e 1.1:\n",
+        "    if a % 2 \u003c 0.1:\n",
+        "      a //= 2\n",
+        "    else:\n",
+        "      a = 3 * a + 1\n",
+        "    count += 1\n",
+        "  return count\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "nnFmPDvScsDo"
+      },
+      "source": [
+        "# AutoGraph"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 301
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 9153,
+          "status": "ok",
+          "timestamp": 1531757473651,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "6fU4vlxYcsDe",
+        "outputId": "11b50f28-aced-4506-a743-4b749e9645c3"
+      },
+      "outputs": [
+        {
+          "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEcCAYAAAAydkhNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtcVGXCB/DfGRBUQA0ZURQvyIspm1reQkxNSPICgoqW\npWZu1vbmjZJV3Jc+axappVLu7guV25rU5g3wlq3iBd1wXHSN3hXy9ZaCgoOIIKAzMOf9g5dZkTkz\nB5i7v+9fzJlzzjzPHD2/Oc/znOcIoiiKICIiMkBh6wIQEZH9YkgQEZEkhgQREUliSBARkSSGBBER\nSWJIEBGRJIYEkQNYsWIFkpOTbV0MegQxJMipzJ49G8OHD4dWq5W9zeOPP45r164163O2bt2KqKgo\nDB48GKNGjcKcOXOwf//+5haXyO4xJMhpFBUV4fTp0xAEAYcPH5a9nSAIzfqc9957D1999RVWrFiB\nU6dO4fjx41iyZAmOHz8uuQ3vWSVHxZAgp5GRkYHBgwdj6tSpSE9P1y+fPXs2duzYoX+dnp6OWbNm\nAQBefvlliKKIqKgoPPXUU/juu+8AANu2bcP48eMxYsQIvPnmm7h58yYA4PLly/jmm2+wYcMGhISE\nwM3NDYIg4KmnnkJSUlKjz9ywYQNefPFFDB48GIWFhdi1axcmTpyIp556Cs899xy+/fZb/fqnTp3C\nmDFjkJKSgqeffhphYWHYs2dPo/rduXMHr7/+Op566inMnDmz2Vc/RC3BkCCnkZmZiaioKEyePBkn\nTpxAWVmZ5LoNVw9bt24FAOzevRtnzpzBhAkTkJOTg/Xr1+OTTz7BiRMn4Ofnh7i4OACASqVCt27d\nMGDAAJPl2bNnD1avXo0zZ86gW7du6Ny5M1JTU3HmzBkkJSUhKSkJ+fn5+vVLS0tRXl6O48eP48MP\nP0RiYiKuXLmif3/fvn1YuHAhcnNz4e/vj40bN7bkayJqFoYEOYXc3Fxcv34dEyZMQHBwMHr27Nnk\nl7hce/fuxfTp0/H444+jTZs2iIuLw9mzZ3H9+nXcvn0bSqWy0fpjxozBsGHDMHDgQNy4cUO/PCYm\nBn379oVCoYCrqyvGjBmDHj16AACGDh2K0NBQ5Obm6tcXBAFLlixBmzZtMGzYMIwZM0Z/ZQMA48eP\nx69+9SsoFApERkY2ChgiS2FIkFPIzMzEqFGj0LFjRwDApEmTkJGR0aJ93bx5E35+fvrX7du3R6dO\nnVBSUoJOnTrpm54aHDt2DCdPnoRWq23U99C1a9cm682cORMjRozAsGHDkJ2djdu3b+vf79ChA9zd\n3fWv/fz8Gn2Wj4+P/u927dqhurq6RfUjag5XWxeAqLXu37+P7777DjqdDqNGjQIAaDQaVFZWoqCg\nAO3bt8e9e/f066vVaqP769KlC65fv65/XV1djfLycvj6+qJTp05YvXo1/vWvfyE4OLjRdg93Tj/Y\nIa7RaLB48WKsW7cOYWFhUCgU+M///M9G21RUVODevXto27YtAODGjRsICgpq5rdBZF68kiCHd/Dg\nQbi4uOC7775DZmYmMjMz8d1332Ho0KHIzMxE//798be//Q337t3DL7/8gp07dzba3sfHp1En8OTJ\nk7Fr1y4UFBRAo9Fg/fr1GDRoEPz8/NCnTx/MnDkTcXFx+OGHH3D//n3odDqcOXPG6CgprVYLrVaL\nxx57DAqFAseOHcPf//73RuuIoohPPvkEWq0Wubm5OHr0KCZMmGDeL4uomXglQQ4vIyMD06ZNg6+v\nb6Pls2bNwvvvv4+9e/fip59+QmhoKPr164fIyEjk5OTo11u4cCHi4+Oh0WiwatUqPP/881i8eDEW\nLlyIiooKPPnkk1i/fr1+/cTERGzduhVJSUm4du0avLy80Lt3b2zcuFHfTPVwYHh4eGDlypVYvHgx\ntFotnn32WYSFhTVaR6lUomPHjnjmmWfQvn17rFq1Cr179zbzt0XUPIIlHzqk0Wjw0ksvQavVoq6u\nDhEREXjrrbdQWFiIuLg43LlzB8HBwVi7di1cXZlX9Og6deoU4uPjcfToUVsXhagRizY3ubm5YcuW\nLcjIyEBGRgays7Px448/4qOPPsK8efPw/fffw8vLq9EYdiIish8W75No164dgPqritraWgiCAJVK\nhYiICAD1wwQPHjxo6WIQEVELWDwkdDodoqOjERoaitDQUPj7+6NDhw5QKOo/umvXrk2GFBI9aoYP\nH86mJrJLFg8JhUKhb2rKy8vDxYsXm6zT3LlziIjIOqw2BNbT0xPDhg3Djz/+iIqKCuh0OgBAcXEx\nunTpYnJ7TpBGRGR9Fh1SVFZWhjZt2sDLywv37t1DTk4OFixYgBEjRuDAgQOYOHEi0tPTmwwFNEQQ\nBKjVlZYsrk0plV6sn4Ny5roBrJ+jUyq9WrW9RUNCrVZj+fLl0Ol00Ol0mDhxIsaMGYOAgADExcUh\nOTkZ/fv3x/Tp0y1ZDCIiaiGL3idhbs6e9qyfY3LmugGsn6Nr7ZUEp+UgIiJJDAkiIpLEkCAiIkkM\nCSIiksSQICIiSQwJIiKSxJAgIiJJDAkiIpLEkCAiIkkMCSIiksSQICIiSQwJIiKSxJAgIiJJDAki\nIpLEkCAiIkkMCSIiksSQICIiSQwJIiKSxJAgIiJJDAkiIpLEkCAiIkkMCSIiksSQICIiSQwJIiKS\nxJAgIiJJDAkiIpLEkCAiIkkMCSIikuRqyZ0XFxcjPj4epaWlcHFxwYwZMzB79mxs2rQJ27ZtQ+fO\nnQEAS5cuxejRoy1ZFCIiagGLhoSLiwtWrFiB/v37o6qqClOnTsXIkSMBAPPmzcO8efMs+fFERNRK\nFg0JpVIJpVIJAPDw8EDfvn1x8+ZNAIAoipb8aCIiMgOr9UkUFhaioKAAAwcOBACkpaVhypQpWLly\nJSorK61VDCIiagarhERVVRUWLVqEhIQEeHh4YNasWTh06BAyMzPh4+ODpKQkaxSDiIiaSRAt3O5T\nW1uL119/HaNHj8bcuXObvF9UVIQ33ngDe/bssWQxiIioBSzaJwEACQkJCAwMbBQQarVa31dx8OBB\nBAUFydqXWu28zVJKpRfr56CcuW4A6+folEqvVm1v0ZA4ffo09uzZg6CgIERHR0MQBCxduhR79+5F\nfn4+FAoFunfvjlWrVlmyGERE1EIWDYkhQ4YgPz+/yXLeE0FE5Bh4xzUREUliSBARkSSGBBERSWJI\nEBGRJIYEERFJYkgQEZEkhgQREUliSBARkSSGBBERSWJIEBGRJIYEERFJYkgQEZEkhgQREUliSBAR\nkSSGBBERSWJIEBGRJIYEERFJYkgQEZEk2Y8vvXfvHtRqNdzd3dGlSxdLlomIiOyE0ZDQ6XTIyMjA\n9u3bUVBQAE9PT2g0Gri6uiI8PByvvPIK+vTpY62yEhGRlRkNiRdffBGDBw/GihUrEBwcDBcXFwDA\nrVu3cPz4cSQmJuKFF17ApEmTrFJYIiKyLkEURVHqzbKyMnh7exvdgZx1zEWtrrTK59iCUunF+jko\nZ64bwPo5OqXSq1XbG+24NnTyv3XrFs6ePWt0HSIicg6yRjfNmjULlZWVqKioQHR0NFauXIk1a9ZY\numxERGRjskKiuroaXl5eOHLkCCIjI7Fnzx6cOHHC0mUjIiIbkxUSGo0GAKBSqTBy5EgoFAp9JzYR\nETkvWSExfPhwREREIDc3F8OHD0dFRQUUCt6HR0Tk7GTdTPfuu++ioKAA/v7+cHNzw927d7F69WpL\nl42IiGzMaEhcuHBB/3ebNm1QXFysf+3m5ma5UhERkV0wGhILFiyAIAgQRRE3btyAp6cnAODu3bvo\n1q0bDh8+bHTnxcXFiI+PR2lpKVxcXBAbG4s5c+bgzp07WLp0KYqKitCjRw9s3LgRXl6tG8tLRETm\nZzQkGkJg9erVGDJkCCZMmAAAOHDgAM6dO2dy5y4uLlixYgX69++PqqoqTJ06FaGhodi1axdCQkLw\n2muvITU1FSkpKXjnnXfMUB0iIjInWb3PeXl5+oAAgOeffx4nT540uZ1SqUT//v0BAB4eHujbty9K\nSkqQlZWFmJgYAEBMTAwOHTrUkrITEZGFyQqJmpoa5Obm6l/n5uaipqamWR9UWFiIgoICDBo0CLdu\n3YKPjw+A+iC5fft2s/ZFRETWIXt0U1xcHNq1awcAuH//Pj7++GPZH1JVVYVFixYhISEBHh4eEASh\nRYVt7Rwk9o71c1zOXDeA9XuUyQqJoUOH4tChQ7h8+TJEUURAQIDs0U21tbVYtGgRpkyZgvDwcABA\n586dUVpaCh8fH6jVatnzPzn7JFysn2Ny5roBrJ+js+gEfw+qq6uDm5sbXF1dcfXq1UbDY41JSEhA\nYGAg5s6dq182btw47Nq1CwCQnp6OsLCwZhabiIisQdaVRFpaGj766CN06tRJ31QkCAKysrKMbnf6\n9Gns2bMHQUFBiI6OhiAIWLp0KV577TUsWbIEO3fuhJ+fH5KTk1tfEyIiMjtZIbF582bs3bsX3bt3\nb9bOhwwZgvz8fIPvffnll83aFxERWZ+s5ialUtnsgCAiIscn60pi5MiRWLt2LSZNmgR3d3f98sDA\nQIsVjIiIbE9WSGRkZACov9O6gZw+CSIicmyyQsLUHE1EROScZIUEUD8jrEqlAgA8/fTT6Nu3r8UK\nRURE9kFWx3VGRgZeeeUV5OfnIz8/H/PmzcPu3bstXTYiIrIx2UNg09PToVQqAQBqtRrz589HVFSU\nRQtHRES2JfuO64aAePhvIiJyXrJComfPnvjkk09QUlKCmzdvYtOmTfD397d02YiIyMZkhcTvf/97\nXL58GVFRUYiKisKlS5ewatUqS5eNiIhsTFafROfOnbFhwwZLl4WIiOyMrCuJ1NRUlJeX61/fvn0b\nn3/+ucUKRURE9kFWSOzbtw+dOnXSv37sscewd+9eixWKiIjsg6yQEEWxybK6ujqzF4aIiOyLrJDo\n3bs3/vznP0MUReh0OmzevBk9e/a0dNmIiMjGZIXEypUrceTIEQwcOBCDBw/GsWPHkJiYaOmyERGR\njcka3eTr64stW7aguroaANC+fXuLFoqIiOyD7D6J7du3449//CPat2+PwsJCnDlzxtJlIyIiG5MV\nEklJSTh58iQOHToEAPDw8MAHH3xg0YIREZHtyQoJlUqFjz76CG3btgVQPwT2/v37Fi0YERHZnqyQ\ncHd3hyAI+tc6nc5iBSIiIvshq+M6KCgIu3fvhiiKKCwsRGpqKoYMGWLpshERkY3JupJYvnw5Tp06\nBbVajdjYWNTV1WHZsmWWLhsREdmYrCsJT09PrF692tJlISIiOyPrSmL//v24e/cuACA5ORnz58/H\n//zP/1i0YEREZHuyQuJPf/oTPD09kZeXhxMnTiA6OppXFkREjwBZIeHqWt8q9fe//x2xsbGIjIzk\nEFgiokeArJAQBAG7d+/Gvn37EBISAgDQarUWLRgREdmerJD43e9+hwMHDiA2Nhb+/v64cuUKRowY\nYXK7hIQEjBw5EpGRkfplmzZtwujRoxETE4OYmBhkZ2e3vPRERGRRgmjoYRFmkpubCw8PD8THx2PP\nnj0A6kPCw8MD8+bNa/b+1OpKcxfRbiiVXqyfg3LmugGsn6NTKr1atb3RIbB/+ctfMHfuXKxdu9bg\n+/Hx8UZ3PnToUBQVFTVZbsFcIiIiMzIaEu7u7gDMPzV4WloaMjMz8atf/QrLly+Hl1frko6IiCzD\nos1NAFBUVIQ33nhD39xUVlaGxx57DIIgYMOGDVCr1ZxRlojIThm9kkhLSzO68UsvvdTsD/T29tb/\nPWPGDLzxxhuyt3X2dkPWzzE5c90A1s/RWbRPwhx3VT98oaJWq6FUKgEABw8eRFBQUKs/g4iILMNo\nSCQlJbVq52+//TZUKhXKy8sxduxYLFy4ECqVCvn5+VAoFOjevTtWrVrVqs8gIiLLMRoSx44dM7rx\nmDFjjL7/8ccfN1k2bdo0GcUiIiJ7YDQkPv/8c8n3BEEwGRJEROTYjIbEV199Za1yEBGRHTIaEteu\nXYO/vz8uXLhg8P3AwECLFIqIiOyD0ZBYvXo1UlJSsGDBgibvCYKArKwsixWMiIhsz2hIpKSkAAAO\nHz5slcIQEZF9kfX4UgCoqalBcXEx6urq9MvY3ERE5NxkhcSWLVuwYcMGdOzYEQpF/ezibG4iInJ+\nskLiL3/5Cw4cOABfX19Ll4eIiOyIrIcOde3alQFBRPQIknUlsXDhQqxcuRJjxozRTx8OmL7jmoiI\nHJuskDhy5AiOHDmCK1euNOqTYEgQETk3WSFx8OBBHD58GG3btrV0eYiIyI7I6pPw9/eHq6vs0bJE\nROQkZJ35e/Xqhblz5yI8PBxubm765S156BARETkOWSGh1WrRs2dPnD9/3tLlISIiOyIrJFr78CEi\nInJMRvskTD2+VKPR4OLFi2YtEBER2Q+TE/zV1NRg8uTJGDRoEHx8fHD//n1cvnwZx48fx7Fjx7B8\n+XL07dvXWuUlIiIrMhoSn376KfLy8vDtt9/iD3/4A4qLi9GuXTsEBQUhPDwcaWlp8PT0tFZZiYjI\nykz2SQwcOBADBw60RlmIiMjOyLpPgoiIHk0MCSIiksSQICIiSQwJIiKSJCskbt26hXfeeUc/DUdB\nQQG++eYbixaMiIhsT1ZI/O53v8OQIUNQUVEBAAgICMDXX39t0YIREZHtyQqJkpISvPjii3BxcQEA\nuLm56Z8rQUREzkvWmf7hacIrKiogiqJFCkRERPZDVkiMHz8eiYmJqKqqwq5du/Dqq69i2rRpJrdL\nSEjAyJEjERkZqV92584dvPrqq4iIiMD8+fNRWVnZ8tITEZFFyQqJX//61xg6dCiCg4Nx7NgxzJ49\nG3PnzjW53dSpU/HFF180WpaamoqQkBB8//33GDFiBFJSUlpWciIisjjZj5uLiopCVFRUs3Y+dOhQ\nFBUVNVqWlZWFrVu3AgBiYmIwe/ZsvPPOO83aLxERWYeskLh16xa2bt2Kq1evora2Vr88OTm52R9Y\nVlYGHx8fAIBSqcTt27ebvQ8iIrIOWSHx5ptvYsCAAQgJCdGPcLIFpdLLZp9tDayf43LmugGs36NM\nVkjU1NTg3XffNcsHdu7cGaWlpfDx8YFarYa3t7fsbdVq5+3kViq9WD8H5cx1A1g/R9faAJTVcT1o\n0CD8/PPPLfqAh4fKjhs3Drt27QIApKenIywsrEX7JSIiy5N1JfHCCy/g5ZdfRteuXeHu7q5fvmPH\nDqPbvf3221CpVCgvL8fYsWOxcOFCLFiwAIsXL8bOnTvh5+fXon4NIiKyDlkhsWzZMrzxxhsYMGBA\ns/okPv74Y4PLv/zyS9n7ICIi25EVEu7u7pg/f76ly0JERHZGVp/EM888g+zsbEuXhYiI7IysK4lt\n27YhNTUVHh4ecHNzgyiKEAQBOTk5li4fERHZkKyQ2Llzp6XLQUREdkhWSHTv3t3S5SAiIjtkNCSW\nLVuGdevWYdq0aRAEocn7pobAEhGRYzMaEg0zvf72t7+1SmGIiMi+GA2Jr7/+Gh988AGGDx9urfIQ\nEZEdMToENj8/31rlICIiO8QHVRMRkSSjzU3nz59HSEhIk+W8T4KI6NFgNCR69+6N1NRUa5WFiIjs\njNGQcHNz4z0SRESPMKN9Em3atLFWOYiIyA4ZDYlt27ZZqxxERGSHOLqJiIgkMSSIiEgSQ4KIiCQx\nJIiISBJDgoiIJDEkiIhIEkOCiIgkMSSIiEgSQ4KIiCTJesY1EdGjTnWuBPtyruB6aTX8fNpjUkhv\njBjga+tiWRxDgojIBNW5EqTs/pf+daG6Sv/a2YOCIUFEVuHIv8T35VyRWP6Lw9ShpRgSRGRxjv5L\n/HpptcHlN25VWbkk1mezkBg3bhw8PT2hUCjg6uqKHTt22KooRGRhjv5L3M+nPQrVTQOhW2cPG5TG\numwWEoIg4KuvvkLHjh1tVQQishJH/yU+KaR3oyuhfy/v1ei1IzepSbFZSIiiCJ1OZ6uPJ3I69nyC\nsuYvcUt8Dw3b78v5BTduVaFbZw9MCunVaL+O3qQmxaZXEvPnz4cgCJg5cyZmzJhhq6IQOTx7P0HJ\n/SXeWpb8HkYM8DW6D0dvUpNis5D461//CqVSibKyMsybNw8BAQEYOnSorYpDZHXm/MVr7ycoOb/E\nzcGW34OjN6lJsVlIKJVKAIC3tzeee+45/PTTTyZDQqn0skbRbIb1c1zNrVv2PwsN/uLt0KEtRj/Z\no9F627P+F1dLKtHT1wuxYf/R6P0G129Jn6DM8b2bYx+Tx3hh8pjAZm0jt/4NWvo9mKN+Pbt64cqN\niibL/X29HPrfvk1CoqamBjqdDh4eHqiursaJEyfw1ltvmdxOra60QulsQ6n0Yv0cVEvq9s33BRLL\nf0b/HvWDOR5uOrlyowLrtp5GRcW9Jr+K/TpLt/m39nu31bFrTv0btOR7MFf9Iob5G2xSixjmb9N/\n+60NKJuERGlpKd566y0IgoC6ujpERkZi1KhRtigKkU3IaZpoTtOJsTZ/e+7QNqYlTUfW6vswxFpN\natZmk5Dw9/dHZmamLT6ayCKaeyI2NdpHda7E4PuA4TZuqRMUALvu0DbGVJAa+85tdaI21bntiHjH\nNVErSfUvANInYlO//A2916Cjp5vB5YZOUIlfqAyuay8d2sYYC1JTo5jsvW6OhFOF0yNLda4EiV+o\n8Os1R5D4hQqqcyUt2s/2rP81uHxfzi+S24wY4IvXo4LRQ+kJF4WAHkpPvB4VjBEDfCWbWRqUVdyX\nXVZHHnEzKaS3xPJeRpuiyLx4JUGPJHOOp79aYrhT0tSJWOoXr9SJ/UFyrwQceToJY01Hn+05Z3Ab\nRwg/R8OQoEeSOcfT9/Q1PPSxpSdiqRP7gwrVd/HrNUdM9n/YsiPXHKSC1JHDz9GwuYkeSVK/1otK\n7za7CSo27D8MLm/piViqmeVhOlHUXwFJldNYs5YjM9YURebFKwl6JEn9EhVF6JfLbYIa/WQPVFTc\nM9uImobtth+5gLLK+7K2MXYF5IwdubYexfQoYUhQqznSOPyGshaVym+7bugMNVZHS5yI5QYE0LQt\n3pGOSUs5Y/jZI4YEtZjqXEmTX7v2NA7/4RNlv56PIet0YZP1BAF4zNNd8qRcVHrX6vcaSPWZtHFR\nQFvXdPbkB9vi7X2yP3IsDAlqEVNj+c05Dr8lv4oNnSilOoMf83JHWYX0r3ZXheETsyXvNZDqM6mV\nmF7/wbZ4e5/sjxwLQ4JaxNRYfnMNRZTzq9hQiJgq34OMBQQgfWK25HBLqT6T7j6e/3+fgHRbvCPf\nG0H2hyHhpCzdJm1qLL+5hiJuP3LB8PKjFzBigK9kiAiCWT4er0cFY1/OFbMPtzR1fIwNXTXVFs/h\noWRODAknZI02aVNj+Zs7FFHqpCnVT9Dw61/qikGqicgQby/D/RE9lJ7678uc9xrIOT6tGb3j6PdG\nkH1hSDgha7RJS52IvDu4I3ZsYLM+x9hJ09R2zW27DxvSAz9fLTc6CV6DhvfMPdxS7vFp6egdDg8l\nc2JIOCFrtEmb80Rk7KQp9SsfqD+xu0g0K8lpuzf0eVLrmnO4pbWOD0OBzIEh4YRMzZ7Zmr4KS/R1\nSJ00C9V3YaproU40vFxO2/2DrHlSZZ8BORKGhBOSagrq17NTq/oqpJqFUnf/C92VHvrASDt4Htln\ni6CtE+EiAO3atkH1/Vr4dTYcKsb6NyQyoIk2LgroRNEhmlbYZ0COhCHhhKSaglrTV6E6V4LN+wzP\nvCni34Fx4qcb+NflMv17dSJwt0YLQDqUpE6azaETRXwW/2yr9mEt7DMgR8KQcFKGmk9aOr2yqRvn\nHvRgQEgx1EELGO48lsvRmmrYZ0COgiHhhKT6DVraFt6cG9PkkHr8ptT9CA28vdwBwfDNb2yqIbIM\nThXuZBp+9Reqq5pMJd3S6ZXlPASnOaRCydQU2bHPBuKjN0OdcuprInvFKwknY6zfYdX84fq/m9MW\nLuchOA2C+3ibbHKSCqUH2+qLSu/CVaFAnU4Hv/8fzvrgjWYMBSLrYEg4GVNj8FtygpXqWDZ0Y9q/\nRzddh7ZOBxdBQLu2rqi5XysrlBgARPaFIWFhDz6/QCEIqNPVD+r09nJH7LPNuzNZDkuMwW/uaJyX\nngvCS88FNVqmVHpBrTb8LGgisl8MiYcYegbBz1dvm7x5zFBnMdB4xE6d+O9R/2WV9y0yx7+lxuDz\nFz7Ro4kh8QBTzyCQGucvdZOZt5e7yc809xz/HINPRObEkHiA3KGeD5/YpbaT8/hJS8zxz1/9RGQu\nHAL7ALlDPR8+sbdmiKij3QRGRI8WhsQD/Hzay1rv4RO71HZympt4ExgR2TObhUR2djaef/55RERE\nIDU1tUX7UJ0rQeIXKvx6zREkfqGC6lxJq8pk6mauf6/X66HXhreLfTZQf+OXQgBcFP+e09Tby503\ngRGR3bNJn4ROp8N7772HL7/8El26dMH06dMRFhaGvn37Sm4zZdnuRrOIWuLpa4Y6ffv17GTwXgBT\n2z188xcRkSOySUjk5eWhV69e6N69OwBg0qRJyMrKMhoSOp3YKAgs9fS11jwNjGFARM7GJs1NJSUl\n6Natm/61r68vbt68KXv7fTm/WOXpXkREjzqbhIQoyn2UjGE3blVJdhZztBARkfnYpLmpa9euuH79\nuv51SUkJunTpInt7f18vxIb9B9ZtPd3kvRcj+kGp9DJLOa3NUcstlzPXz5nrBrB+jzJBbO3P+hao\nq6vD888/jy+//BJKpRKxsbFYv3690T4JIiKyPptcSbi4uOC//uu/8Oqrr0IURUyfPp0BQURkh2xy\nJUFERI6Bd1wTEZEkhgQREUliSBARkSS7DwlzzPFkb8aNG4eoqChER0dj+vTpAIA7d+7g1VdfRURE\nBObPn4/KSsd5iltCQgJGjhyJyMhI/TJj9Vm9ejXGjx+PKVOmID8/3xZFbhZD9du0aRNGjx6NmJgY\nxMTEIDs7W/9eSkoKxo8fjwkTJuDEiRO2KLJsxcXFmDNnDiZOnIjIyEhs2bIFgPMcv4fr99VXXwFw\nnuOn0WiMkXxQAAAKiUlEQVQQGxuL6OhoREZGYtOmTQCAwsJCzJgxAxEREYiLi0Ntba1+/aVLl2L8\n+PGYOXNmo1sRJIl2rK6uTgwPDxcLCwtFjUYjRkVFiRcuXLB1sVpt3LhxYnl5eaNla9euFVNTU0VR\nFMWUlBRx3bp1tihai/zjH/8Qz507J06ePFm/TKo+R48eFV977TVRFEXx7NmzYmxsrPUL3EyG6vfp\np5+KmzdvbrLuhQsXxClTpoharVa8du2aGB4eLup0OmsWt1lu3rwpnjt3ThRFUbx79644fvx48cKF\nC05z/KTq5yzHTxRFsbq6WhRFUaytrRVjY2PFs2fPiosXLxb3798viqIoJiYmit98840oiqKYlpYm\nvvvuu6IoiuK+ffvEJUuWmNy/XV9JPDjHU5s2bfRzPDk6URSh0+kaLcvKykJMTAwAICYmBocOHbJF\n0Vpk6NCh6NChQ6NlD9en4bhlZWUhOjoaADBo0CBUVlaitLTUugVuJkP1AwzPHJCVlYWJEyfC1dUV\nPXr0QK9evZCXl2eNYraIUqlE//79AQAeHh7o27cvSkpKnOb4GapfwxRAznD8AKBdu3YA6q8Samtr\nIQgCVCoVIiIiADQ+nzx4XCMiIpCTk2Ny/3YdEq2d48leCYKA+fPnY9q0adi+fTsA4NatW/Dx8QFQ\n/w/79u3btixiq5WVlTWqT1lZGQDg5s2b6Nq1q349X19flJS0bop3W0lLS8OUKVOwcuVKfXOMoX+z\njlK/wsJCFBQUYNCgQU3+PTrD8Wuo38CBAwE4z/HT6XSIjo5GaGgoQkND4e/vjw4dOkChqD+9d+3a\nVV+HB4+fi4sLOnTogPLycqP7t+uQMJT0zuCvf/0rdu3ahc8++wxpaWnIzc2FIAimN3QCho6pI9Z9\n1qxZOHToEDIzM+Hj44MPP/wQgOPWr6qqCosWLUJCQgI8PDwky+ws9XOm46dQKJCRkYHs7Gzk5eXh\n4sWLTdZpqMPD9RNF0WT97DokWjvHk71SKpUAAG9vb4SHhyMvLw+dO3fWX7ar1Wp4e3vbsoitJlUf\nX19fFBcX69crLi52yGPq7e2t/881Y8YMfZNE165dcePGDf16jlC/2tpaLFq0CFOmTEF4eDgA5zp+\nhurnTMevgaenJ4YNG4Yff/wRFRUV+ibtB+vw4PGrq6vD3bt30bFjR6P7teuQeOKJJ3D16lUUFRVB\no9Fg3759CAsLs3WxWqWmpgZVVfXTmVdXV+PEiRMICgrCuHHjsGvXLgBAenq6w9Xz4V8oUvUJCwtD\nRkYGAODs2bPo0KGDvlnDnj1cP7Varf/74MGDCAoKAlBf7/3790Oj0eDatWu4evWqvnnDXiUkJCAw\nMBBz587VL3Om42eofs5y/MrKyvRNZffu3UNOTg4CAwMxYsQIHDhwAEDj4zdu3Dikp6cDAA4cOICn\nn37a5GfY/bQc2dnZeP/99/VzPC1YsMDWRWqVa9eu4a233oIgCKirq0NkZCQWLFiA8vJyLFmyBDdu\n3ICfnx+Sk5MNdpbao7fffhsqlQrl5eXw8fHBwoULER4ejsWLFxusz6pVq3D8+HG0a9cOSUlJCA4O\ntnENjDNUP5VKhfz8fCgUCnTv3h2rVq3SnyxTUlKwY8cOuLq6YuXKlRg1apSNayDt9OnTePnllxEU\nFARBECAIApYuXYqBAwdK/nt0pOMnVb+9e/c6xfH7+eefsXz5cuh0Ouh0OkycOBG/+c1vcO3aNcTF\nxaGiogL9+/fHunXr0KZNG2g0Gixbtgz5+fno1KkT1q9fjx49ehj9DLsPCSIish27bm4iIiLbYkgQ\nEZEkhgQREUliSBARkSSGBBERSWJIEBGRJIYE2b3a2lokJycjIiICkZGRmDRpEtasWYO6ujqj261Y\nsQJpaWkA6qeGXrt2rcnPOnToEH766SezlNsSioqKsG3bNlsXgx4hDAmye8uXL8fFixeRkZGBPXv2\nYPfu3QgICIBGozH7Z2VlZdn1rJ+FhYX49ttvW7StqVAlMsTV1gUgMuaXX35BVlaW/g5foH72ytjY\nWAD1M2CuW7dO/3CYUaNGIT4+3uikZefPn8fvf/971NTUQKPRYMaMGZgzZw5OnDiBw4cPIycnBzt2\n7MArr7yCwsJCHDx4EIIgQKPR4NKlS/jHP/4BT0/PRvv85z//iXXr1qGqqgqCICA+Ph4jR45EXl4e\nPvjgA9TU1KBdu3ZYuXIlnnjiCZw6dQpr1qzBzp07AaDR61OnTuGDDz7AwIEDcfbsWSgUCqxfvx4B\nAQF47733UFRUhJiYGPTs2RPJycm4dOkSkpKSUF5eDq1Wizlz5mDq1KkAgMcffxzLli3D0aNHMWzY\nMCxatMjsx4icnFmeekFkIfv37xejo6Ml3//666/FefPmibW1taJWqxXnzp2rf8DK8uXLxa1bt4qi\nWP+QoDVr1oiiKIpVVVWiRqPR/z1x4kTx4sWLTbZ52LJly8QPP/ywyfLy8nIxNDRUPHv2rCiKoqjT\n6cSKigpRo9GIY8eOFXNyckRRFMUffvhBHDt2rKjVakWVSiVOmzZNv48HX6tUKjE4OFjMz88XRVEU\n//SnP4nvvPNOk/VEsf5BMzExMeKlS5dEUax/sE5ERIT+db9+/cTPP/9c8vsjMoVXEmTXRBOzxuTk\n5CAmJgYuLi4AgKlTp+LQoUN44YUXJLepqanBu+++i4KCAigUCqjVahQUFCAgIEBym40bN6Kmpga/\n/e1vm7x39uxZBAYGYtCgQQDqp2X28vLC+fPn4ebmpp9ELSQkBG5ubrh8+bLJevfp0wePP/44gPqH\n+xw9etTgeleuXMGlS5cQFxen/660Wi0uXryIPn36AID+IUFELcGQILsWHByMK1euoLKyEl5eXk3e\nFw3Mh29qfvz169dDqVRi7dq1+gdAGevf2LlzJ06ePKl//rOhMshd3lBeFxeXRk8nvH//fqP13N3d\n9X+7uLjon1FsaH/e3t76mT0fJggC2rdvb/A9IjnYcU12rVevXhg3bhwSExP1U6zX1dVhy5YtqKmp\nwciRI5Geno7a2lpotVpkZGQgNDTU6D4rKyvRrVs3CIKA8+fPIzc3V/+eh4cH7t69q3/9ww8/4LPP\nPsMf//hHuLm5Gdzfk08+iQsXLuDHH38EUN9PUlFRgYCAAGi1Wpw6dQoAcPLkSdTW1qJ3797o0aMH\nCgsLUVlZCVEUsW/fPlnfh6enp35qaKD+iqNt27bIzMzUL7t06ZL+uzJ1JUZkCq8kyO6tWbMGn376\nKaZOnQo3NzeIoojRo0fDzc0NM2fOxNWrV/XP7X3mmWf0ndpSfvOb3yA+Ph67d+9Gz549MWzYMP17\nU6ZMwYoVK3DgwAG88sor2LlzJ2pqajB//nz9VUBaWlqjX+cdO3bEpk2bkJSUhOrqari4uCA+Ph4h\nISH45JNPsHr1an3H9aeffgpXV1f4+vpi3rx5iImJgb+/P5544glcuHDB5HfRr18/9OnTB5GRkQgI\nCEBycjL++7//G++//z42b96Muro6+Pj4YOPGjQDs/6lqZP84VTgREUlicxMREUliSBARkSSGBBER\nSWJIEBGRJIYEERFJYkgQEZEkhgQREUliSBARkaT/AzLfG+oMx+5pAAAAAElFTkSuQmCC\n",
+            "text/plain": [
+              "\u003cmatplotlib.figure.Figure at 0x7fc3b259add0\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "counts = []\n",
+        "times = []\n",
+        "for n in np.logspace(0, 7, 50):\n",
+        "\n",
+        "  with tf.Graph().as_default():\n",
+        "    tf_collatz = ag.to_graph(collatz)\n",
+        "    count = tf_collatz(tf.constant(n, dtype=tf.float32))\n",
+        "    with tf.Session() as sess:\n",
+        "      count_value = sess.run(count)\n",
+        "\n",
+        "      res = %timeit -n10 -r1 -o -q sess.run(count)\n",
+        "      counts.append(count_value)\n",
+        "      times.append(res.best)\n",
+        "      \n",
+        "plot_results(counts, times, 'AutoGraph')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "RRENYzLRF_f3"
+      },
+      "source": [
+        "# Eager"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 301
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 5003,
+          "status": "ok",
+          "timestamp": 1531757478713,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "dhDf8LLdF_f-",
+        "outputId": "3de0a5a5-7a11-4b41-8ab0-e4e21ce8d59b"
+      },
+      "outputs": [
+        {
+          "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEcCAYAAAAydkhNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XtYVWW+B/Dv2hshBdSQHaighhwas7Qeb6GFDjIyI3LZ\nGphdJLLMzqSlKaPQsTPm5KhZkZ7moKOnManGK17wsUfIS87QNj2jnEnIg5cQEtyAyDWBvdf5g4d9\nBPbaLGCvfeP7+QvW2mvt38tGvq71vut9BVEURRAREZmhsncBRETkuBgSREQkiSFBRESSGBJERCSJ\nIUFERJIYEkREJIkhQUREktzsXQCRvYWHh6OiogJqtRqiKEIQBMyePRtvv/22vUsjsjuGBBGA9PR0\nPPHEE3Z5b4PBALVabZf3JuoMbzcRATA38cCNGzeQmJiISZMmITQ0FMuXL0dtba1p//fffw+tVotx\n48bhjTfewNKlS5GWlmbaf+LECcTFxWHChAmYN28efvjhB9O+8PBwbNu2DTExMXj88cdhNBqVbSBR\nNzEkiCSIoohFixbhb3/7G44ePYqysjJs3rwZANDU1ITFixdjzpw5OHv2LGbNmoXjx4+bjv3++++R\nmpqKd999F2fPnsXcuXPx2muvoampyfSao0ePYtu2bTh37hxUKv5TJMfE30wiAL/97W8xceJETJgw\nARMnTsSePXswbNgwhIaGws3NDffffz8SExPx3XffAQAuXLgAg8GA559/Hmq1Gr/61a8wZswY0/n2\n7NmDZ555Bo8++igEQUBcXBzc3d1x8eJF02vmz58PPz8/uLu727y9RHKxT4IIwCeffNKhT6KyshJr\n167FuXPnUF9fD4PBgIEDBwIA9Ho9/Pz82rx+8ODBpq9/+uknHDx4ELt27QLQclXS3NyMW7dumV7j\n7++vVHOIrIYhQQTzfRKbNm2CIAg4cuQI+vfvj+zsbKxduxYAoNFoUFZW1ub1N2/exLBhwwC0BMCi\nRYvw6quvKl88kYJ4u4lIQl1dHTw9PeHl5YWysjJs377dtO+xxx6DWq1GRkYGDAYDsrOzkZeXZ9qf\nkJCAL7/80rStvr4ep06dQn19vc3bQdQTvJIgAvDaa69BpVKZnpOYMmUK3nzzTaxYsQLjx4/H8OHD\nERsbi08//RQA0KdPH2zevBmpqanYtGkTwsLCEB4ebupfeOSRR/Duu+9izZo1KCoqgoeHB8aNG4cJ\nEyYAAARBsFdTibpEUHLRocbGRjz33HNoamqCwWBAZGQkXn/9dRQXF2PZsmW4c+cORo8ejQ0bNsDN\njXlFzi0hIQHz5s2DVqu1dylEVqPo7SZ3d3fs3LkTmZmZyMzMxOnTp3Hx4kW8//77SEpKwldffQVv\nb2/s3btXyTKIFPHdd9+hvLwcBoMBBw4cwOXLl/HUU0/Zuywiq1K8T6Jv374AWq4qmpubIQgCdDod\nIiMjAQBarbbN+HIiZ3Ht2jXExsZi/Pjx+PTTT/Hxxx/D19fX3mURWZXi93iMRiNmz56NoqIiPPfc\ncwgMDET//v1NDw/5+/u3GRZI5CwSEhKQkJBg7zKIFKX4lYRKpTLdasrLy8OVK1c6vIadeEREjslm\nQ2C9vLwwYcIEXLx4EdXV1aa5akpLS/HAAw90eryC/etERCRB0dtNlZWV6NOnD7y9vfHzzz8jNzcX\nCxcuxKRJk3Ds2DHMnDkTBw4cwPTp0zs9lyAI0OtrlCzXrjQab7bPSbly2wC2z9lpNN49Ol7RkNDr\n9Vi5ciWMRiOMRiNmzpyJqVOnIigoCMuWLUNaWhpGjRqFp59+WskyiIiomxR9TsLaXD3t2T7n5Mpt\nA9g+Z9fTKwlOy0FERJIYEkREJIkhQUREkhgSREQkiSFBRESSGBJERCSJIUFERJIYEkREJIkhQURE\nkhgSREQkiSFBRESSGBJERCSJIUFERJIYEkREJIkhQUREkhgSREQkiSFBRESSGBJERCSJIUFERJIY\nEkREJIkhQUREkhgSREQkiSFBRESSGBJERCSJIUFERJIYEkREJIkhQUREkhgSREQkyU3Jk5eWliI5\nORnl5eVQq9VISEjACy+8gC1btmD37t0YNGgQAGDp0qUICwtTshQiIuoGRUNCrVZj1apVGDVqFOrq\n6jB79mxMnjwZAJCUlISkpCQl356IiHpI0ZDQaDTQaDQAAE9PT4wcORK3bt0CAIiiqORbExGRFdis\nT6K4uBgFBQUYM2YMACAjIwOxsbFITU1FTU2NrcogIqIusElI1NXVYcmSJUhJSYGnpyeeffZZZGdn\n4+DBg/D19cW6detsUQYREXWRICp836e5uRmvvvoqwsLCkJiY2GF/SUkJFi1ahMOHDytZBhERdYOi\nfRIAkJKSguDg4DYBodfrTX0Vx48fR0hIiKxz6fWue1tKo/Fm+5yUK7cNYPucnUbj3aPjFQ2J8+fP\n4/DhwwgJCUFcXBwEQcDSpUtx5MgR5OfnQ6VSYejQoVizZo2SZRARUTcpGhLjxo1Dfn5+h+18JoKI\nyDnwiWsiIpLEkCAiIkkMCSIiksSQICIiSQwJIiKSxJAgIiJJDAkiIpLEkCAiIkkMCSIiksSQICIi\nSQwJIiKSxJAgIiJJDAkiIpLEkCAiIkkMCSIiksSQICIiSQwJIiKSxJAgIiJJspcv/fnnn6HX6+Hh\n4YEHHnhAyZqIiMhBWAwJo9GIzMxM7NmzBwUFBfDy8kJjYyPc3NwQERGBF198EQ8++KCtaiUiIhuz\nGBLz5s3DY489hlWrVmH06NFQq9UAgIqKCnzzzTdYvXo1nnnmGURFRdmkWCIisi1BFEVRamdlZSV8\nfHwsnkDOa6xFr6+xyfvYg0bjzfY5KVduG8D2OTuNxrtHx1vsuDb3x7+iogIXLlyw+BoiInINskY3\nPfvss6ipqUF1dTXi4uKQmpqK9evXK10bERHZmayQqK+vh7e3N06cOIHo6GgcPnwYZ86cUbo2IiKy\nM1kh0djYCADQ6XSYPHkyVCqVqRObiIhcl6yQmDhxIiIjI3Hu3DlMnDgR1dXVUKn4HB4RkauT9TDd\nO++8g4KCAgQGBsLd3R21tbVYu3at0rUREZGdWQyJwsJC09d9+vRBaWmp6Xt3d3flqiIiIodgMSQW\nLlwIQRAgiiJu3rwJLy8vAEBtbS0GDx6Mr7/+2uLJS0tLkZycjPLycqjVasTHx2P+/Pm4c+cOli5d\nipKSEgQEBOCjjz6Ct3fPxvISEZH1WQyJ1hBYu3Ytxo0bh9/85jcAgGPHjuHSpUudnlytVmPVqlUY\nNWoU6urqMHv2bEyZMgX79+9HaGgoXnnlFWzduhXp6elYvny5FZpDRETWJKv3OS8vzxQQAPDrX/8a\n3377bafHaTQajBo1CgDg6emJkSNHoqysDDk5OdBqtQAArVaL7Ozs7tROREQKkxUSDQ0NOHfunOn7\nc+fOoaGhoUtvVFxcjIKCAowdOxYVFRXw9fUF0BIkt2/f7tK5iIjINmSPblq2bBn69u0LALh79y42\nbdok+03q6uqwZMkSpKSkwNPTE4IgdKvYns5B4ujYPuflym0D2D65Tv+jGHty/hdFZTUY5ueN+On/\ngrDHA6xybnuRFRLjx49HdnY2rl27BlEUERQUJHt0U3NzM5YsWYLY2FhEREQAAAYNGoTy8nL4+vpC\nr9fLnv/J1SfhYvuckyu3DWD75NJdKkP6oe9N31+/WY2Nu86juvpnTHrYr8fn7y5FJ/i7l8FggLu7\nO9zc3FBUVNRmeKwlKSkpCA4ORmJiomlbeHg49u/fDwA4cOAApk+f3sWyiYgcS1budYntP9q0DmuT\ndSWRkZGB999/HwMHDjTdKhIEATk5ORaPO3/+PA4fPoyQkBDExcVBEAQsXboUr7zyCt58803s27cP\nQ4YMQVpaWs9bQkRkRz+V15vdfrOizsaVWJeskNixYweOHDmCoUOHdunk48aNQ35+vtl9n376aZfO\nRUTkyIb49kOxvmMgDB7kaYdqrEfW7SaNRtPlgCAi6k2iQkdIbB9u20KsTNaVxOTJk7FhwwZERUXB\nw8PDtD04OFixwoiIHInuUhmycq/jp/J6DPHth6jQEW06pFu/zsr9ETcr6jB4kCeiQofbtdPaGmSF\nRGZmJoCWJ61byemTICJyBe1HLhXr60zftw8KZw+F9mSFRGdzNBERuTJLI5dcLRTakxUSQMuMsDqd\nDgDwxBNPYOTIkYoVRUTUXZ3dFuoOVx25JIesjuvMzEy8+OKLyM/PR35+PpKSknDo0CGlayMi6pLW\n20LF+joYRdF0W0h3qaxH5x3i28/sdmcfuSSH7CGwBw4cgEajAQDo9XosWLAAMTExihZHRNQVSt0W\nigod0aZP4v+3O/fIJTlk325qDYj2XxMROQqlbgu56sglOWSFxLBhw/Dxxx9j7ty5EAQBu3fvRmBg\noNK1ERF1iZIPtLniyCU5ZPVJ/P73v8e1a9cQExODmJgYXL16FWvWrFG6NiKiLnHVB9rsSdaVxKBB\ng/Dhhx8qXQsRUY/05ttCSpEVElu3bkVCQgIGDhwIALh9+zb27duHl19+WdHiiIi6qrfeFlKKrNtN\nWVlZpoAAgPvvvx9HjhxRrCgiInIMskJCFMUO2wwGg9WLISIixyIrJEaMGIH/+q//giiKMBqN2LFj\nB4YNG6Z0bUREZGeyQiI1NRUnTpzAmDFj8Nhjj+HUqVNYvXq10rUREZGdyeq49vPzw86dO1Ff3/Kg\nSr9+5h9RJyIi1yK7T2LPnj345JNP0K9fPxQXF+O///u/la6NiIjsTFZIrFu3Dt9++y2ys7MBAJ6e\nnnjvvfcULYyIiOxP1u0mnU6HzMxMaLVaAC1DYO/evatoYUREligxJTh1JCskPDw8IAiC6Xuj0ahY\nQUREnZG7Uhz1nKzbTSEhITh06BBEUURxcTH+/d//HePGjVO6NiIisyxNCU7WJSskVq5cibNnz0Kv\n1yM+Ph4GgwErVqxQujYiIrN680pxtibrdpOXlxfWrl2rdC1ERLIoOSU4tSXrSuLo0aOora0FAKSl\npWHBggX45z//qWhhRES6S2VYvV2Hl9efwOrtOtMypJwS3HZkhcSf/vQneHl5IS8vD2fOnEFcXByv\nLIhIUZbWq570sB9ejRmNAI0X1CoBARovvBozmp3WCpB1u8nNreVlf/vb3xAfH4/o6Gjs2LFD0cKI\nqHfrbL1qTgluG7KuJARBwKFDh5CVlYXQ0FAAQFNTk6KFEVHvxs5pxyArJN5++20cO3YM8fHxCAwM\nxPXr1zFp0qROj0tJScHkyZMRHR1t2rZlyxaEhYVBq9VCq9Xi9OnT3a+eiFzWEF/zc8Sxc9q2BNHc\nYhFWcu7cOXh6eiI5ORmHDx8G0BISnp6eSEpK6vL59Poaa5foMDQab7bPSbly2wD7ta/9A3OtrN33\n0Bs+v56w2Cfxl7/8BYmJidiwYYPZ/cnJyRZPPn78eJSUlHTYrmAuEZGD6uo0Glyv2jFYDAkPDw8A\n1p8aPCMjAwcPHsQjjzyClStXwtu7Z0lHRI6tu9NosHPa/hS93QQAJSUlWLRokel2U2VlJe6//34I\ngoAPP/wQer2eM8oSubjF75/A9ZvVHbaPGNwfm5f/0g4VkVwWryQyMjIsHvzcc891+Q19fHxMXyck\nJGDRokWyj3X1+4Zsn3Ny5bYB1mlfUan542+U1dj9Z9cbPr+esBgS1niquv2Fil6vh0ajAQAcP34c\nISEhPX4PInJsnEbDeVkMiXXr1vXo5G+99RZ0Oh2qqqowbdo0LF68GDqdDvn5+VCpVBg6dCjWrFnT\no/cgIscXFTrC7EglTqPh+CyGxKlTpywePHXqVIv7N23a1GHbnDlzZJRFRK6EI5Wcl8WQ+POf/yy5\nTxCETkOCiKgVRyo5J4sh8dlnn9mqDiIickAWQ+LGjRsIDAxEYWGh2f3BwcGKFEVERI7BYkisXbsW\n6enpWLhwYYd9giAgJydHscKIiMj+LIZEeno6AODrr7+2STFERORYZK0nAQANDQ0oLS2FwWAwbePt\nJiLH1NV5koikyAqJnTt34sMPP8SAAQOgUrXMLs7bTUSOqbvzJBGZIysk/vKXv+DYsWPw8+MvGJGj\n62xFN6KukLXokL+/PwOCyElIrehWrK/F6u066C6V2bgicmayriQWL16M1NRUTJ061TR9OND5E9dE\nJM3a/Qat5zNamNiZt56oq2SFxIkTJ3DixAlcv369TZ8EQ4Koe6zdbyC1ipsU3noiuWSFxPHjx/H1\n11/jvvvuU7oeol6hJ/0G5q5ApM4n5WZFxxlZicyRFRKBgYFwc5M9WpaIOiHVb9DZH2+pKxBB6Nr7\nc4pukkvWX/7hw4cjMTERERERcHd3N23vzqJDRNT5+gpS/RVSVwxuKhWaDMYO2328PVBZc7fDdk7R\nTXLJCommpiYMGzYMly9fVroeol7B0voKlvorpK5Amo0dAwIA4n/Z8sArp+im7pIVEj1dfIiI2rK0\nvsLq7Tqzx6Qf+h591AKMho77hvp6ISp0uGQYMBSouzpdvvSRRx6R3N/Y2IgbN25g5MiRVi+MyNVJ\nra8gdbUAAE0G88NbWwOBYUDW1ukEfw0NDZg1axbGjh0LX19f3L17F9euXcM333yDU6dOYeXKlQwJ\nIiuS6q+4Vx+1CkZR5O0jUpzFkNi8eTPy8vLw17/+Ff/xH/+B0tJS9O3bFyEhIYiIiEBGRga8vLxs\nVStRryDVX3EvoyhiW/IvbVQR9Wad9kmMGTMGY8aMsUUtRISW21B7ThSaHZXUikNYyVZkzd1ERLbV\nOipJCoewkq3wCTkiO+hs3qZJD/uhsOQOcs4Xdzh2+rgA9kGQzTAkiGxM7rxNz/0qBMFDB/AZB7Ir\nhgSRjXVl3iYOayV7k9UnUVFRgeXLl5um4SgoKMAXX3yhaGFErqq78zYR2YOskHj77bcxbtw4VFdX\nAwCCgoLw+eefK1oYkasa4tvP7HaOWCJHJCskysrKMG/ePKjVagCAu7u7aV0JIuqaqNAREts5Yokc\nj6w+ifbThFdXV0O0sPoVUW/TlVXmLM3bRORoZIXEjBkzsHr1atTV1WH//v34/PPPMWfOnE6PS0lJ\nwcmTJzFo0CAcPnwYAHDnzh0sXboUJSUlCAgIwEcffQRvb++etYLIjk7/o7jLq8yxQ5qchax7Ri+/\n/DLGjx+P0aNH49SpU3jhhReQmJjY6XGzZ8/G9u3b22zbunUrQkND8dVXX2HSpElIT0/vXuVEDmJP\nzv+a3Z6V+6ONKyGyPtlDYGNiYhATE9Olk48fPx4lJSVttuXk5GDXrl0AAK1WixdeeAHLly/v0nmJ\nHElRWY3Z7RytRK5AVkhUVFRg165dKCoqQnNzs2l7Wlpal9+wsrISvr6+AACNRoPbt293+RxE1tCV\nfgRLrx3m543rN6s7HMPRSuQKZIXEv/7rv+Lhhx9GaGioaYSTPWg0rt13wfbZjlQ/wtbD32O4f3/E\nT/8XhD0eYPG1/fvfh7DHAxA//V+wcdf5Du8xL/Ihh2pzT7hKO6S4evt6QlZINDQ04J133rHKGw4a\nNAjl5eXw9fWFXq+Hj4+P7GP1evOX9a5Ao/Fm+2zoi68KzG4XReD6zWps3HUe2w/+E/G/DJZ8QvqL\nr37AqIABCHs8ANXVP3cYrTQqYIBDtbm7HO2zs7be0L6ekBUSY8eOxQ8//ICHHnqoy2/QfqhseHg4\n9u/fj4ULF+LAgQOYPn16l89J1FOWVn9rVVlzF+mHvocgmN9/b58DRyuRq5IVEs888wyef/55+Pv7\nw8PDw7R97969Fo976623oNPpUFVVhWnTpmHx4sVYuHAh3njjDezbtw9DhgzpVr8GUU/JWf2tlZtK\nhSaDscN29jlQbyArJFasWIFFixbh4Ycf7lKfxKZNm8xu//TTT2Wfg8iaWjugS8rljzxqNnYMCIBP\nSFPvICskPDw8sGDBAqVrIeoWuaOU2k/RLddQXy9EhQ7nE9LUK8kKiaeeegqnT59GWFiY0vUQdYnc\ntRkA6Sm6AzQtIbDnZCEqqzsuGdoaCAwF6o1khcTu3buxdetWeHp6wt3dHaIoQhAE5ObmKl0fkUWW\n1mZo3d96hSF1i+lmRZ0pBFquSnjFQNRKVkjs27dP6TqIukVqlFJJeW2HKwwp93ZA84qBqC1ZITF0\n6FCl6yDqFqlRSlIjksxhBzSRNIshsWLFCmzcuBFz5syBYGaweGdDYImUcG9H9UAvd7OvkRqRJAgt\nHdG8nUQkj8WQaJ3p9Xe/+51NiiEyp30oVNb8f+dy69c+3h64U9do+sOflXvd7BXGUF8vrFkw0UaV\nEzk/iyHx+eef47333sPEifxHRfbRfvTSvQFxr3739cH7v53SZpu54a68tUTUNRZDIj8/31Z1EJm1\n50ShrNe1n5abq78RWYfs9SSIlNT+gbiHht2PH4puS145tGduigyOVCLqOYshcfnyZYSGhnbYzuck\nyJrMPRAnd16lVryNRKQMiyExYsQIbN261Va1UC8l9UCcHH3UKrwUNYpXDEQKsRgS7u7ufEaCFCdn\n2m4pDAgiZaks7ezTp4+t6qBebIhvP9mv7aNWQSW0zLf0asxoBgSRwixeSezevdtWdVAvFhU6Qtbs\nrAwFItvj6Cayu9Y//FKzsPr090D8tGAGBJEdMCTI6syt7zBrquV1djkLK5FjEsT2i1A7MFdfrNzZ\n2mcuDADzTzq3zJnkKbkgkDNzxs+uK9g+56bRWP4PWmd4JUHdIrXYj4+3h9nXi6LlBYGIyDFZHN1E\nJEXq2QY5T0i3LghERI6PIUHd0pNnG9rPs0REjou3m0iW9v0P/e5zQ21DU7fOZW6eJSJyTAwJ6pS5\n/oee4DxLRM6DIUGd6sncSq1UAjDE14tDWomcDEOCOtWd/of2K8XNmhrs0sMMiVwVQ4I6NcS3n+xb\nTHw6msi1MCSoU3LmVgrQ8FYSkStiSFCnTHMrnSg0+xwEJ94jcl12C4nw8HB4eXlBpVLBzc0Ne/fu\ntVcpJAPnViLqnewWEoIg4LPPPsOAAQPsVUKvZG6+pa78kee60US9i91CQhRFGI1Ge719ryQ13xLA\nuZSIyDy7XkksWLAAgiBg7ty5SEhIsFcpLkfqakHqeYes3B8ZEkRklt1C4ssvv4RGo0FlZSWSkpIQ\nFBSE8ePH26sclyF1tVBYckfyeQfOpUREUhxiPYktW7bA09MTSUlJ9i7F6S1+/wSu36w2u893YF+U\nVzV02D5icH9sXv5LpUsjIidklyuJhoYGGI1GeHp6or6+HmfOnMHrr7/e6XGu/MSutRY+KSqVPodU\nH1DkhEDFf7auvLCLK7cNYPucnVMuOlReXo7XX38dgiDAYDAgOjoaTz75pD1KcTmWno6+U9uIV2NG\ncwgrEclml5AIDAzEwYMH7fHWLs/S09GDB3lyCCsRdQkXHXIxkx72w/RxAWb3cYpuIuoqTsvhgp77\nVQiChw7gbSUi6jGGhIvibSUisgbebiIiIkkMCSIiksSQICIiSeyTsKGezsBKRGRrDAkbyTh+GTnn\ni03fcwZWInIGDAmF6S6VSa7oBnAGViJybAwJBbWfkdUczsBKRI6MIWEl5vobpNZvuNfgQZ5Kl0ZE\n1G0MCSuQWsNBEDo/llNlEJEj4xBYK5C6YnBTWf7xTh8XwP4IInJovJKwAqkV35ol1m/w8fZA/C+D\nGRBE5PAYElYgtYbDUF8vRIUO50R7ROS0GBKdkPMAnNQaDq2BwFAgImfFkLBAqkMaaPsAXOvXvGIg\nIlfDkGjn3isHtUS/s7kH4HjFQESuiCFxj/ZXDkaD+dfxATgi6i04BPYech5+A/gAHBH1HgyJe0gN\nZW2PD8ARUW/B2033kBrK2ketglEU2SFNRL0OQ+IeUkNZX4oaxWAgol6JIXEPDmUlImqLIdEOh7IS\nEf0/dlwTEZEkp76S4JrRRETKctqQkDtlBhERdZ/dbjedPn0av/71rxEZGYmtW7d2+XipB9+ycn/s\nWWFERGRil5AwGo149913sX37dhw5cgRZWVm4cuVKl84h9eAbp8wgIrIeu4REXl4ehg8fjqFDh6JP\nnz6IiopCTk6OxWNiVxzC6u066C6VAWh58M0cTplBRGQ9dgmJsrIyDB482PS9n58fbt26ZfEYo1E0\n9TvoLpUhKnSE2ddxygwiIuuxS0iIotij41un6n41ZjQCNF5QqwQEaLzwasxodloTEVmRXUY3+fv7\n46effjJ9X1ZWhgceeED28Tcr6qDReGPWVG/MmhqsRIl2odF427sERbly+1y5bQDb15vZJSQeffRR\nFBUVoaSkBBqNBllZWfjggw8sHnN4U6yNqiMiolZ2CQm1Wo1/+7d/w0svvQRRFPH0009j5MiR9iiF\niIgsEMSedhAQEZHL4txNREQkiSFBRESSGBJERCTJ4UOip3M8OaLw8HDExMQgLi4OTz/9NADgzp07\neOmllxAZGYkFCxagpqbGzlXKl5KSgsmTJyM6Otq0zVJ71q5dixkzZiA2Nhb5+fn2KLlLzLVvy5Yt\nCAsLg1arhVarxenTp0370tPTMWPGDPzmN7/BmTNn7FGybKWlpZg/fz5mzpyJ6Oho7Ny5E4DrfH7t\n2/fZZ58BcJ3Pr7GxEfHx8YiLi0N0dDS2bNkCACguLkZCQgIiIyOxbNkyNDc3m16/dOlSzJgxA3Pn\nzm3zKIIk0YEZDAYxIiJCLC4uFhsbG8WYmBixsLDQ3mX1WHh4uFhVVdVm24YNG8StW7eKoiiK6enp\n4saNG+1RWrd899134qVLl8RZs2aZtkm15+TJk+Irr7wiiqIoXrhwQYyPj7d9wV1krn2bN28Wd+zY\n0eG1hYWFYmxsrNjU1CTeuHFDjIiIEI1Goy3L7ZJbt26Jly5dEkVRFGtra8UZM2aIhYWFLvP5SbXP\nVT4/URTF+vp6URRFsbm5WYyPjxcvXLggvvHGG+LRo0dFURTF1atXi1988YUoiqKYkZEhvvPOO6Io\nimJWVpb45ptvdnp+h76S6M4cT85AFEUYjcY223JycqDVagEAWq0W2dnZ9iitW8aPH4/+/fu32da+\nPa2fW04mDIcbAAAJLUlEQVRODuLi4gAAY8eORU1NDcrLy21bcBeZax9gfuaAnJwczJw5E25ubggI\nCMDw4cORl5dnizK7RaPRYNSoUQAAT09PjBw5EmVlZS7z+ZlrX+sUQK7w+QFA3759AbRcJTQ3N0MQ\nBOh0OkRGRgJo+/fk3s81MjISubm5nZ7foUOiO3M8OQNBELBgwQLMmTMHe/bsAQBUVFTA19cXQMsv\n9u3bt+1ZYo9VVla2aU9lZSUA4NatW/D39ze9zs/PD2VlZXapsacyMjIQGxuL1NRU0+0Yc7+zztK+\n4uJiFBQUYOzYsR1+H13h82tt35gxYwC4zudnNBoRFxeHKVOmYMqUKQgMDET//v2hUrX8eff39ze1\n4d7PT61Wo3///qiqqrJ4focOCXNJ7wq+/PJL7N+/H9u2bUNGRgbOnTsHQRDsXZZNmPtMnbHtzz77\nLLKzs3Hw4EH4+vrij3/8IwDnbV9dXR2WLFmClJQUeHp6StbsKu1zpc9PpVIhMzMTp0+fRl5entll\nF1rb0L59oih22j6HDomezvHkqDQaDQDAx8cHERERyMvLw6BBg0yX7Xq9Hj4+PvYsscek2uPn54fS\n0lLT60pLS53yM/Xx8TH940pISDDdkvD398fNmzdNr3OG9jU3N2PJkiWIjY1FREQEANf6/My1z5U+\nv1ZeXl6YMGECLl68iOrqatMt7XvbcO/nZzAYUFtbiwEDBlg8r0OHxL1zPDU2NiIrKwvTp0+3d1k9\n0tDQgLq6loWR6uvrcebMGYSEhCA8PBz79+8HABw4cMDp2tn+fyhS7Zk+fToyMzMBABcuXED//v1N\ntzUcWfv26fV609fHjx9HSEgIgJZ2Hz16FI2Njbhx4waKiopMtzccVUpKCoKDg5GYmGja5kqfn7n2\nucrnV1lZabpV9vPPPyM3NxfBwcGYNGkSjh07BqDt5xceHo4DBw4AAI4dO4Ynnnii0/dw+Gk5Tp8+\njT/84Q+mOZ4WLlxo75J65MaNG3j99dchCAIMBgOio6OxcOFCVFVV4c0338TNmzcxZMgQpKWlme0s\ndURvvfUWdDodqqqq4Ovri8WLFyMiIgJvvPGG2fasWbMG33zzDfr27Yt169Zh9OjRdm6BZebap9Pp\nkJ+fD5VKhaFDh2LNmjWmP5bp6enYu3cv3NzckJqaiieffNLOLZB2/vx5PP/88wgJCYEgCBAEAUuX\nLsWYMWMkfx+d6fOTat+RI0dc4vP74YcfsHLlShiNRhiNRsycOROvvfYabty4gWXLlqG6uhqjRo3C\nxo0b0adPHzQ2NmLFihXIz8/HwIED8cEHHyAgIMDiezh8SBARkf049O0mIiKyL4YEERFJYkgQEZEk\nhgQREUliSBARkSSGBBERSWJIkMNrbm5GWloaIiMjER0djaioKKxfvx4Gg8HicatWrUJGRgaAlqmh\nN2zY0Ol7ZWdn43/+53+sUrcSSkpKsHv3bnuXQb0IQ4Ic3sqVK3HlyhVkZmbi8OHDOHToEIKCgtDY\n2Gj198rJyXHoWT+Li4vx17/+tVvHdhaqROa42bsAIkt+/PFH5OTkmJ7wBVpmr4yPjwfQMgPmxo0b\nTYvDPPnkk0hOTrY4adnly5fx+9//Hg0NDWhsbERCQgLmz5+PM2fO4Ouvv0Zubi727t2LF198EcXF\nxTh+/DgEQUBjYyOuXr2K7777Dl5eXm3O+Y9//AMbN25EXV0dBEFAcnIyJk+ejLy8PLz33ntoaGhA\n3759kZqaikcffRRnz57F+vXrsW/fPgBo8/3Zs2fx3nvvYcyYMbhw4QJUKhU++OADBAUF4d1330VJ\nSQm0Wi2GDRuGtLQ0XL16FevWrUNVVRWampowf/58zJ49GwDwi1/8AitWrMDJkycxYcIELFmyxOqf\nEbk4q6x6QaSQo0ePinFxcZL7P//8czEpKUlsbm4Wm5qaxMTERNMCKytXrhR37dolimLLIkHr168X\nRVEU6+rqxMbGRtPXM2fOFK9cudLhmPZWrFgh/vGPf+ywvaqqSpwyZYp44cIFURRF0Wg0itXV1WJj\nY6M4bdo0MTc3VxRFUfz73/8uTps2TWxqahJ1Op04Z84c0znu/V6n04mjR48W8/PzRVEUxT/96U/i\n8uXLO7xOFFsWmtFqteLVq1dFUWxZWCcyMtL0/UMPPST++c9/lvz5EXWGVxLk0MROZo3Jzc2FVquF\nWq0GAMyePRvZ2dl45plnJI9paGjAO++8g4KCAqhUKuj1ehQUFCAoKEjymI8++ggNDQ343e9+12Hf\nhQsXEBwcjLFjxwJomZbZ29sbly9fhru7u2kStdDQULi7u+PatWudtvvBBx/EL37xCwAti/ucPHnS\n7OuuX7+Oq1evYtmyZaafVVNTE65cuYIHH3wQAEyLBBF1B0OCHNro0aNx/fp11NTUwNvbu8N+0cx8\n+J3Nj//BBx9Ao9Fgw4YNpgWgLPVv7Nu3D99++61p/WdzNcjd3lqvWq1uszrh3bt327zOw8PD9LVa\nrTatUWzufD4+PqaZPdsTBAH9+vUzu49IDnZck0MbPnw4wsPDsXr1atMU6waDATt37kRDQwMmT56M\nAwcOoLm5GU1NTcjMzMSUKVMsnrOmpgaDBw+GIAi4fPkyzp07Z9rn6emJ2tpa0/d///vfsW3bNnzy\nySdwd3c3e77HH38chYWFuHjxIoCWfpLq6moEBQWhqakJZ8+eBQB8++23aG5uxogRIxAQEIDi4mLU\n1NRAFEVkZWXJ+nl4eXmZpoYGWq447rvvPhw8eNC07erVq6afVWdXYkSd4ZUEObz169dj8+bNmD17\nNtzd3SGKIsLCwuDu7o65c+eiqKjItG7vU089ZerUlvLaa68hOTkZhw4dwrBhwzBhwgTTvtjYWKxa\ntQrHjh3Diy++iH379qGhoQELFiwwXQVkZGS0+d/5gAEDsGXLFqxbtw719fVQq9VITk5GaGgoPv74\nY6xdu9bUcb1582a4ubnBz88PSUlJ0Gq1CAwMxKOPPorCwsJOfxYPPfQQHnzwQURHRyMoKAhpaWn4\nz//8T/zhD3/Ajh07YDAY4Ovri48++giA46+qRo6PU4UTEZEk3m4iIiJJDAkiIpLEkCAiIkkMCSIi\nksSQICIiSQwJIiKSxJAgIiJJDAkiIpL0f3zF2/hGE4QYAAAAAElFTkSuQmCC\n",
+            "text/plain": [
+              "\u003cmatplotlib.figure.Figure at 0x7fc3af690a50\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "with context.eager_mode():\n",
+        "\n",
+        "  counts = []\n",
+        "  times = []  \n",
+        "  for n in np.logspace(0, 7, 50):\n",
+        "\n",
+        "    n_tensor = tf.constant(n, dtype=tf.float32)\n",
+        "    count = collatz(n_tensor)\n",
+        "\n",
+        "    res = %timeit -n10 -r1 -o -q collatz(n_tensor)\n",
+        "    times.append(res.best)\n",
+        "    counts.append(count)\n",
+        "      \n",
+        "plot_results(counts, times, 'Eager')\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [
+        "x5ChBlH09jk_",
+        "_cRFTcwT9mnn"
+      ],
+      "default_view": {},
+      "last_runtime": {
+        "build_target": "",
+        "kind": "local"
+      },
+      "name": "Autograph vs. Eager Collatz speed test",
+      "provenance": [
+        {
+          "file_id": "0B8bm7KvwJklpMUQtbnVpYkdJUjRtOTRyWVVfSEhpRl9HYm5n",
+          "timestamp": 1531512047714
+        }
+      ],
+      "version": "0.3.2",
+      "views": {}
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_mnist_speed_test.ipynb b/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_mnist_speed_test.ipynb
new file mode 100644
index 0000000000..952ec091fb
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/notebooks/ag_vs_eager_mnist_speed_test.ipynb
@@ -0,0 +1,652 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "etTmZVFN8fYO"
+      },
+      "source": [
+        "This notebook runs a basic speed test for a short training loop of a neural network training on the MNIST dataset."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "eqOvRhOz8SWs"
+      },
+      "source": [
+        "### Imports"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "nHY0tntRizGb"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U -q tf-nightly"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "Pa2qpEmoVOGe"
+      },
+      "outputs": [],
+      "source": [
+        "import gzip\n",
+        "import os\n",
+        "import shutil\n",
+        "import time\n",
+        "\n",
+        "import numpy as np\n",
+        "import six\n",
+        "from six.moves import urllib\n",
+        "import tensorflow as tf\n",
+        "\n",
+        "from tensorflow.contrib import autograph as ag\n",
+        "from tensorflow.contrib.eager.python import tfe\n",
+        "from tensorflow.python.eager import context\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "PZWxEJFM9A7b"
+      },
+      "source": [
+        "### Testing boilerplate"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "kfZk9EFZ5TeQ"
+      },
+      "outputs": [],
+      "source": [
+        "# Test-only parameters. Test checks successful completion not correctness. \n",
+        "burn_ins = 1\n",
+        "trials = 1\n",
+        "max_steps = 2\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "k0GKbZBJ9Gt9"
+      },
+      "source": [
+        "### Speed test configuration"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "gWXV8WHn43iZ"
+      },
+      "outputs": [],
+      "source": [
+        "#@test {\"skip\": true} \n",
+        "burn_ins = 3\n",
+        "trials = 10\n",
+        "max_steps = 500\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "kZV_3pGy8033"
+      },
+      "source": [
+        "### Data source setup"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "YfnHJbBOBKae"
+      },
+      "outputs": [],
+      "source": [
+        "def download(directory, filename):\n",
+        "  filepath = os.path.join(directory, filename)\n",
+        "  if tf.gfile.Exists(filepath):\n",
+        "    return filepath\n",
+        "  if not tf.gfile.Exists(directory):\n",
+        "    tf.gfile.MakeDirs(directory)\n",
+        "  url = 'https://storage.googleapis.com/cvdf-datasets/mnist/' + filename + '.gz'\n",
+        "  zipped_filepath = filepath + '.gz'\n",
+        "  print('Downloading %s to %s' % (url, zipped_filepath))\n",
+        "  urllib.request.urlretrieve(url, zipped_filepath)\n",
+        "  with gzip.open(zipped_filepath, 'rb') as f_in, open(filepath, 'wb') as f_out:\n",
+        "    shutil.copyfileobj(f_in, f_out)\n",
+        "  os.remove(zipped_filepath)\n",
+        "  return filepath\n",
+        "\n",
+        "\n",
+        "def dataset(directory, images_file, labels_file):\n",
+        "  images_file = download(directory, images_file)\n",
+        "  labels_file = download(directory, labels_file)\n",
+        "\n",
+        "  def decode_image(image):\n",
+        "    # Normalize from [0, 255] to [0.0, 1.0]\n",
+        "    image = tf.decode_raw(image, tf.uint8)\n",
+        "    image = tf.cast(image, tf.float32)\n",
+        "    image = tf.reshape(image, [784])\n",
+        "    return image / 255.0\n",
+        "\n",
+        "  def decode_label(label):\n",
+        "    label = tf.decode_raw(label, tf.uint8)\n",
+        "    label = tf.reshape(label, [])\n",
+        "    return tf.to_int32(label)\n",
+        "\n",
+        "  images = tf.data.FixedLengthRecordDataset(\n",
+        "      images_file, 28 * 28, header_bytes=16).map(decode_image)\n",
+        "  labels = tf.data.FixedLengthRecordDataset(\n",
+        "      labels_file, 1, header_bytes=8).map(decode_label)\n",
+        "  return tf.data.Dataset.zip((images, labels))\n",
+        "\n",
+        "\n",
+        "def mnist_train(directory):\n",
+        "  return dataset(directory, 'train-images-idx3-ubyte',\n",
+        "                 'train-labels-idx1-ubyte')\n",
+        "\n",
+        "def mnist_test(directory):\n",
+        "  return dataset(directory, 't10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte')\n",
+        "\n",
+        "def setup_mnist_data(is_training, hp, batch_size):\n",
+        "  if is_training:\n",
+        "    ds = mnist_train('/tmp/autograph_mnist_data')\n",
+        "    ds = ds.cache()\n",
+        "    ds = ds.shuffle(batch_size * 10)\n",
+        "  else:\n",
+        "    ds = mnist_test('/tmp/autograph_mnist_data')\n",
+        "    ds = ds.cache()\n",
+        "  ds = ds.repeat()\n",
+        "  ds = ds.batch(batch_size)\n",
+        "  return ds\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "qzkZyZcS9THu"
+      },
+      "source": [
+        "### Keras model definition"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "x_MU13boiok2"
+      },
+      "outputs": [],
+      "source": [
+        "def mlp_model(input_shape):\n",
+        "  model = tf.keras.Sequential((\n",
+        "      tf.keras.layers.Dense(100, activation='relu', input_shape=input_shape),\n",
+        "      tf.keras.layers.Dense(100, activation='relu'),\n",
+        "      tf.keras.layers.Dense(10, activation='softmax')))\n",
+        "  model.build()\n",
+        "  return model\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "DXt4GoTxtvn2"
+      },
+      "source": [
+        "# AutoGraph"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "W51sfbONiz_5"
+      },
+      "outputs": [],
+      "source": [
+        "def predict(m, x, y):\n",
+        "  y_p = m(x)\n",
+        "  losses = tf.keras.losses.categorical_crossentropy(y, y_p)\n",
+        "  l = tf.reduce_mean(losses)\n",
+        "  accuracies = tf.keras.metrics.categorical_accuracy(y, y_p)\n",
+        "  accuracy = tf.reduce_mean(accuracies)\n",
+        "  return l, accuracy\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "CsAD0ajbi9iZ"
+      },
+      "outputs": [],
+      "source": [
+        "def fit(m, x, y, opt):\n",
+        "  l, accuracy = predict(m, x, y)\n",
+        "  opt.minimize(l)\n",
+        "  return l, accuracy\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "RVw57HdTjPzi"
+      },
+      "outputs": [],
+      "source": [
+        "def get_next_batch(ds):\n",
+        "  itr = ds.make_one_shot_iterator()\n",
+        "  image, label = itr.get_next()\n",
+        "  x = tf.to_float(tf.reshape(image, (-1, 28 * 28)))\n",
+        "  y = tf.one_hot(tf.squeeze(label), 10)\n",
+        "  return x, y\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "UUI0566FjZPx"
+      },
+      "outputs": [],
+      "source": [
+        "def train(train_ds, test_ds, hp):\n",
+        "  m = mlp_model((28 * 28,))\n",
+        "  opt = tf.train.MomentumOptimizer(hp.learning_rate, 0.9)\n",
+        "\n",
+        "  train_losses = []\n",
+        "  test_losses = []\n",
+        "  train_accuracies = []\n",
+        "  test_accuracies = []\n",
+        "  ag.set_element_type(train_losses, tf.float32)\n",
+        "  ag.set_element_type(test_losses, tf.float32)\n",
+        "  ag.set_element_type(train_accuracies, tf.float32)\n",
+        "  ag.set_element_type(test_accuracies, tf.float32)\n",
+        "\n",
+        "  i = tf.constant(0)\n",
+        "  while i \u003c hp.max_steps:\n",
+        "    train_x, train_y = get_next_batch(train_ds)\n",
+        "    test_x, test_y = get_next_batch(test_ds)\n",
+        "    step_train_loss, step_train_accuracy = fit(m, train_x, train_y, opt)\n",
+        "    step_test_loss, step_test_accuracy = predict(m, test_x, test_y)\n",
+        "\n",
+        "    train_losses.append(step_train_loss)\n",
+        "    test_losses.append(step_test_loss)\n",
+        "    train_accuracies.append(step_train_accuracy)\n",
+        "    test_accuracies.append(step_test_accuracy)\n",
+        "\n",
+        "    i += 1\n",
+        "  return (ag.stack(train_losses), ag.stack(test_losses),\n",
+        "          ag.stack(train_accuracies), ag.stack(test_accuracies))\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 215
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 12156,
+          "status": "ok",
+          "timestamp": 1531752050611,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "K1m8TwOKjdNd",
+        "outputId": "bd5746f2-bf91-44aa-9eff-38eb11ced33f"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "('Duration:', 0.6226680278778076)\n",
+            "('Duration:', 0.6082069873809814)\n",
+            "('Duration:', 0.6223258972167969)\n",
+            "('Duration:', 0.6176440715789795)\n",
+            "('Duration:', 0.6309840679168701)\n",
+            "('Duration:', 0.6180410385131836)\n",
+            "('Duration:', 0.6219630241394043)\n",
+            "('Duration:', 0.6183009147644043)\n",
+            "('Duration:', 0.6176400184631348)\n",
+            "('Duration:', 0.6476900577545166)\n",
+            "('Mean duration:', 0.62254641056060789, '+/-', 0.0099792188690656976)\n"
+          ]
+        }
+      ],
+      "source": [
+        "#@test {\"timeout\": 90}\n",
+        "with tf.Graph().as_default():\n",
+        "  hp = tf.contrib.training.HParams(\n",
+        "      learning_rate=0.05,\n",
+        "      max_steps=max_steps,\n",
+        "  )\n",
+        "  train_ds = setup_mnist_data(True, hp, 500)\n",
+        "  test_ds = setup_mnist_data(False, hp, 100)\n",
+        "  tf_train = ag.to_graph(train)\n",
+        "  losses = tf_train(train_ds, test_ds, hp)\n",
+        "\n",
+        "  with tf.Session() as sess:\n",
+        "    durations = []\n",
+        "    for t in range(burn_ins + trials):\n",
+        "      sess.run(tf.global_variables_initializer())\n",
+        "\n",
+        "      start = time.time()\n",
+        "      (train_losses, test_losses, train_accuracies,\n",
+        "       test_accuracies) = sess.run(losses)\n",
+        "\n",
+        "      if t \u003c burn_ins:\n",
+        "        continue\n",
+        "\n",
+        "      duration = time.time() - start\n",
+        "      durations.append(duration)\n",
+        "      print('Duration:', duration)\n",
+        "\n",
+        "    print('Mean duration:', np.mean(durations), '+/-', np.std(durations))\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "A06kdgtZtlce"
+      },
+      "source": [
+        "# Eager"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "hBKOKGrWty4e"
+      },
+      "outputs": [],
+      "source": [
+        "def predict(m, x, y):\n",
+        "  y_p = m(x)\n",
+        "  losses = tf.keras.losses.categorical_crossentropy(tf.cast(y, tf.float32), y_p)\n",
+        "  l = tf.reduce_mean(losses)\n",
+        "  accuracies = tf.keras.metrics.categorical_accuracy(y, y_p)\n",
+        "  accuracy = tf.reduce_mean(accuracies)\n",
+        "  return l, accuracy\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "HCgTZ0MTt6vt"
+      },
+      "outputs": [],
+      "source": [
+        "def train(ds, hp):\n",
+        "  m = mlp_model((28 * 28,))\n",
+        "  opt = tf.train.MomentumOptimizer(hp.learning_rate, 0.9)\n",
+        "\n",
+        "  train_losses = []\n",
+        "  test_losses = []\n",
+        "  train_accuracies = []\n",
+        "  test_accuracies = []\n",
+        "\n",
+        "  i = 0\n",
+        "  train_test_itr = tfe.Iterator(ds)\n",
+        "  for (train_x, train_y), (test_x, test_y) in train_test_itr:\n",
+        "    train_x = tf.to_float(tf.reshape(train_x, (-1, 28 * 28)))\n",
+        "    train_y = tf.one_hot(tf.squeeze(train_y), 10)\n",
+        "    test_x = tf.to_float(tf.reshape(test_x, (-1, 28 * 28)))\n",
+        "    test_y = tf.one_hot(tf.squeeze(test_y), 10)\n",
+        "\n",
+        "    if i \u003e hp.max_steps:\n",
+        "      break\n",
+        "\n",
+        "    with tf.GradientTape() as tape:\n",
+        "      step_train_loss, step_train_accuracy = predict(m, train_x, train_y)\n",
+        "    grad = tape.gradient(step_train_loss, m.variables)\n",
+        "    opt.apply_gradients(zip(grad, m.variables))\n",
+        "    step_test_loss, step_test_accuracy = predict(m, test_x, test_y)\n",
+        "\n",
+        "    train_losses.append(step_train_loss)\n",
+        "    test_losses.append(step_test_loss)\n",
+        "    train_accuracies.append(step_train_accuracy)\n",
+        "    test_accuracies.append(step_test_accuracy)\n",
+        "\n",
+        "    i += 1\n",
+        "  return train_losses, test_losses, train_accuracies, test_accuracies\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 215
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 52499,
+          "status": "ok",
+          "timestamp": 1531752103279,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "plv_yrn_t8Dy",
+        "outputId": "55d5ab3d-252d-48ba-8fb4-20ec3c3e6d00"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "('Duration:', 3.9973549842834473)\n",
+            "('Duration:', 4.018772125244141)\n",
+            "('Duration:', 3.9740989208221436)\n",
+            "('Duration:', 3.9922947883605957)\n",
+            "('Duration:', 3.9795801639556885)\n",
+            "('Duration:', 3.966722011566162)\n",
+            "('Duration:', 3.986541986465454)\n",
+            "('Duration:', 3.992305040359497)\n",
+            "('Duration:', 4.012261867523193)\n",
+            "('Duration:', 4.004716157913208)\n",
+            "('Mean duration:', 3.9924648046493529, '+/-', 0.015681688635624851)\n"
+          ]
+        }
+      ],
+      "source": [
+        "#@test {\"timeout\": 90}\n",
+        "with context.eager_mode():\n",
+        "  durations = []\n",
+        "  for t in range(burn_ins + trials):\n",
+        "    hp = tf.contrib.training.HParams(\n",
+        "        learning_rate=0.05,\n",
+        "        max_steps=max_steps,\n",
+        "    )\n",
+        "    train_ds = setup_mnist_data(True, hp, 500)\n",
+        "    test_ds = setup_mnist_data(False, hp, 100)\n",
+        "    ds = tf.data.Dataset.zip((train_ds, test_ds))\n",
+        "    start = time.time()\n",
+        "    (train_losses, test_losses, train_accuracies,\n",
+        "     test_accuracies) = train(ds, hp)\n",
+        "    \n",
+        "    train_losses[-1].numpy()\n",
+        "    test_losses[-1].numpy()\n",
+        "    train_accuracies[-1].numpy()\n",
+        "    test_accuracies[-1].numpy()\n",
+        "\n",
+        "    if t \u003c burn_ins:\n",
+        "      continue\n",
+        "\n",
+        "    duration = time.time() - start\n",
+        "    durations.append(duration)\n",
+        "    print('Duration:', duration)\n",
+        "\n",
+        "  print('Mean duration:', np.mean(durations), '+/-', np.std(durations))\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [
+        "eqOvRhOz8SWs",
+        "PZWxEJFM9A7b",
+        "kZV_3pGy8033"
+      ],
+      "default_view": {},
+      "name": "Autograph vs. Eager MNIST speed test",
+      "provenance": [
+        {
+          "file_id": "1tAQW5tHUgAc8M4-iwwJm6Xs6dV9nEqtD",
+          "timestamp": 1530297010607
+        },
+        {
+          "file_id": "18dCjshrmHiPTIe1CNsL8tnpdGkuXgpM9",
+          "timestamp": 1530289467317
+        },
+        {
+          "file_id": "1DcfimonWU11tmyivKBGVrbpAl3BIOaRG",
+          "timestamp": 1522272821237
+        },
+        {
+          "file_id": "1wCZUh73zTNs1jzzYjqoxMIdaBWCdKJ2K",
+          "timestamp": 1522238054357
+        },
+        {
+          "file_id": "1_HpC-RrmIv4lNaqeoslUeWaX8zH5IXaJ",
+          "timestamp": 1521743157199
+        },
+        {
+          "file_id": "1mjO2fQ2F9hxpAzw2mnrrUkcgfb7xSGW-",
+          "timestamp": 1520522344607
+        }
+      ],
+      "version": "0.3.2",
+      "views": {}
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/autograph/examples/notebooks/algorithms.ipynb b/tensorflow/contrib/autograph/examples/notebooks/algorithms.ipynb
new file mode 100644
index 0000000000..bf824e2760
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/notebooks/algorithms.ipynb
@@ -0,0 +1,1512 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "b9R-4ezU3NH0"
+      },
+      "source": [
+        "## AutoGraph: examples of simple algorithms\n",
+        "\n",
+        "This notebook shows how you can use AutoGraph to compile simple algorithms and run them in TensorFlow.\n",
+        "\n",
+        "It requires the nightly build of TensorFlow, which is installed below."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "TuWj26KWz1fZ"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U -q tf-nightly"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "3kudk1elq0Gh"
+      },
+      "source": [
+        "### Fibonacci numbers\n",
+        "\n",
+        "https://en.wikipedia.org/wiki/Fibonacci_number"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 197
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 7512,
+          "status": "ok",
+          "timestamp": 1532101577266,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "H7olFlMXqrHe",
+        "outputId": "472dbfe0-9449-4f93-e908-1a0785188a92"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "0 :  1\n",
+            "1 :  2\n",
+            "2 :  3\n",
+            "3 :  5\n",
+            "4 :  8\n",
+            "5 :  13\n",
+            "6 :  21\n",
+            "7 :  34\n",
+            "8 :  55\n",
+            "9 :  89\n"
+          ]
+        }
+      ],
+      "source": [
+        "import tensorflow as tf\n",
+        "from tensorflow.contrib import autograph as ag\n",
+        "\n",
+        "\n",
+        "def fib(n):\n",
+        "  f1 = 0\n",
+        "  f2 = 1\n",
+        "  for i in range(n):\n",
+        "    tmp = f2\n",
+        "    f2 = f2 + f1\n",
+        "    f1 = tmp\n",
+        "    print(i, ': ', f2)\n",
+        "  return f2\n",
+        "\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  final_fib = ag.to_graph(fib)(tf.constant(10))\n",
+        "  with tf.Session() as sess:\n",
+        "    sess.run(final_fib)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "p8zZyj-tq4K3"
+      },
+      "source": [
+        "#### Generated code"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 541
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 103,
+          "status": "ok",
+          "timestamp": 1532101577412,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "UeWjK8rHq6Cj",
+        "outputId": "73ece895-12fb-489a-e52c-032945d7ed7a"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "from __future__ import print_function\n",
+            "import tensorflow as tf\n",
+            "\n",
+            "def tf__fib(n):\n",
+            "  try:\n",
+            "    with tf.name_scope('fib'):\n",
+            "      f1 = 0\n",
+            "      f2 = 1\n",
+            "\n",
+            "      def extra_test(f1_1, f2_1):\n",
+            "        with tf.name_scope('extra_test'):\n",
+            "          return True\n",
+            "\n",
+            "      def loop_body(i, f1_1, f2_1):\n",
+            "        with tf.name_scope('loop_body'):\n",
+            "          tmp = f2_1\n",
+            "          f2_1 = f2_1 + f1_1\n",
+            "          f1_1 = tmp\n",
+            "          with ag__.utils.control_dependency_on_returns(ag__.utils.\n",
+            "              dynamic_print(i, ': ', f2_1)):\n",
+            "            f2, i_1 = ag__.utils.alias_tensors(f2_1, i)\n",
+            "            return f1_1, f2\n",
+            "      f1, f2 = ag__.for_stmt(ag__.utils.dynamic_builtin(range, n),\n",
+            "          extra_test, loop_body, (f1, f2))\n",
+            "      return f2\n",
+            "  except:\n",
+            "    ag__.rewrite_graph_construction_error(ag_source_map__)\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(ag.to_code(fib))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "eIfVy6ZTrFEH"
+      },
+      "source": [
+        "### Fizz Buzz\n",
+        "\n",
+        "https://en.wikipedia.org/wiki/Fizz_buzz"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 125
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 233,
+          "status": "ok",
+          "timestamp": 1532101577681,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "33CAheYsrEQ7",
+        "outputId": "82a493ee-15b5-419d-8c9c-5f4159090a05"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Buzz\n",
+            "11\n",
+            "Fizz\n",
+            "13\n",
+            "14\n",
+            "FizzBuzz\n"
+          ]
+        }
+      ],
+      "source": [
+        "import tensorflow as tf\n",
+        "from tensorflow.contrib import autograph as ag\n",
+        "\n",
+        "def fizzbuzz(i, n):\n",
+        "  while i \u003c n:\n",
+        "    msg = ''\n",
+        "    if i % 3 == 0:\n",
+        "      msg += 'Fizz'\n",
+        "    if i % 5 == 0:\n",
+        "      msg += 'Buzz'\n",
+        "    if msg == '':\n",
+        "      msg = tf.as_string(i)\n",
+        "    print(msg)\n",
+        "    i += 1\n",
+        "  return i\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  final_i = ag.to_graph(fizzbuzz)(tf.constant(10), tf.constant(16))\n",
+        "  with tf.Session() as sess:\n",
+        "    sess.run(final_i)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Lkq3DBGOv3fA"
+      },
+      "source": [
+        "#### Generated code"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 1081
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 289,
+          "status": "ok",
+          "timestamp": 1532101578003,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "bBhFIIaZrxvx",
+        "outputId": "d076a7ea-e643-4689-f90a-57f5d086dedc"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "from __future__ import print_function\n",
+            "import tensorflow as tf\n",
+            "\n",
+            "def tf__fizzbuzz(i, n):\n",
+            "  try:\n",
+            "    with tf.name_scope('fizzbuzz'):\n",
+            "\n",
+            "      def loop_test(i_1):\n",
+            "        with tf.name_scope('loop_test'):\n",
+            "          return tf.less(i_1, n)\n",
+            "\n",
+            "      def loop_body(i_1):\n",
+            "        with tf.name_scope('loop_body'):\n",
+            "          msg = ''\n",
+            "\n",
+            "          def if_true():\n",
+            "            with tf.name_scope('if_true'):\n",
+            "              msg_1, = msg,\n",
+            "              msg_1 += 'Fizz'\n",
+            "              return msg_1,\n",
+            "\n",
+            "          def if_false():\n",
+            "            with tf.name_scope('if_false'):\n",
+            "              return msg,\n",
+            "          msg = ag__.utils.run_cond(tf.equal(i_1 % 3, 0), if_true, if_false)\n",
+            "\n",
+            "          def if_true_1():\n",
+            "            with tf.name_scope('if_true_1'):\n",
+            "              msg_2, = msg,\n",
+            "              msg_2 += 'Buzz'\n",
+            "              return msg_2,\n",
+            "\n",
+            "          def if_false_1():\n",
+            "            with tf.name_scope('if_false_1'):\n",
+            "              return msg,\n",
+            "          msg = ag__.utils.run_cond(tf.equal(i_1 % 5, 0), if_true_1, if_false_1\n",
+            "              )\n",
+            "\n",
+            "          def if_true_2():\n",
+            "            with tf.name_scope('if_true_2'):\n",
+            "              msg_3, = msg,\n",
+            "              msg_3 = tf.as_string(i_1)\n",
+            "              return msg_3,\n",
+            "\n",
+            "          def if_false_2():\n",
+            "            with tf.name_scope('if_false_2'):\n",
+            "              return msg,\n",
+            "          msg = ag__.utils.run_cond(tf.equal(msg, ''), if_true_2, if_false_2)\n",
+            "          with ag__.utils.control_dependency_on_returns(ag__.utils.\n",
+            "              dynamic_print(msg)):\n",
+            "            msg_4 = ag__.utils.alias_tensors(msg)\n",
+            "            i_1 += 1\n",
+            "            return i_1,\n",
+            "      i = ag__.while_stmt(loop_test, loop_body, (i,), (tf, n, ag__, i))\n",
+            "      return i\n",
+            "  except:\n",
+            "    ag__.rewrite_graph_construction_error(ag_source_map__)\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(ag.to_code(fizzbuzz))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "BNRtprSvwJgk"
+      },
+      "source": [
+        "### Conway's Game of Life\n",
+        "\n",
+        "https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "r8_0ioEuAI-a"
+      },
+      "source": [
+        "#### Testing boilerplate"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "7moIlf8VABkl"
+      },
+      "outputs": [],
+      "source": [
+        "NUM_STEPS = 1"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "QlEvfIQPAYF5"
+      },
+      "source": [
+        "#### Game of Life for AutoGraph"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "5pCK2qQSAAK4"
+      },
+      "outputs": [],
+      "source": [
+        "#@test {\"skip\": true} \n",
+        "NUM_STEPS = 100"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 308
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 14892,
+          "status": "ok",
+          "timestamp": 1532101593030,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "hC3qMqryPDHS",
+        "outputId": "8405c0e9-e518-41d6-f5bc-e78df6474169"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/html": [
+              "\u003cvideo width=\"432.0\" height=\"288.0\" controls autoplay loop\u003e\n",
+              "  \u003csource type=\"video/mp4\" src=\"data:video/mp4;base64,AAAAHGZ0eXBNNFYgAAACAGlzb21pc28yYXZjMQAAAAhmcmVlAACZUm1kYXQAAAKuBgX//6rcRem9\n",
+              "5tlIt5Ys2CDZI+7veDI2NCAtIGNvcmUgMTQ4IHIyNzk1IGFhYTlhYTggLSBILjI2NC9NUEVHLTQg\n",
+              "QVZDIGNvZGVjIC0gQ29weWxlZnQgMjAwMy0yMDE3IC0gaHR0cDovL3d3dy52aWRlb2xhbi5vcmcv\n",
+              "eDI2NC5odG1sIC0gb3B0aW9uczogY2FiYWM9MSByZWY9MyBkZWJsb2NrPTE6MDowIGFuYWx5c2U9\n",
+              "MHgzOjB4MTEzIG1lPWhleCBzdWJtZT03IHBzeT0xIHBzeV9yZD0xLjAwOjAuMDAgbWl4ZWRfcmVm\n",
+              "PTEgbWVfcmFuZ2U9MTYgY2hyb21hX21lPTEgdHJlbGxpcz0xIDh4OGRjdD0xIGNxbT0wIGRlYWR6\n",
+              "b25lPTIxLDExIGZhc3RfcHNraXA9MSBjaHJvbWFfcXBfb2Zmc2V0PS0yIHRocmVhZHM9OSBsb29r\n",
+              "YWhlYWRfdGhyZWFkcz0xIHNsaWNlZF90aHJlYWRzPTAgbnI9MCBkZWNpbWF0ZT0xIGludGVybGFj\n",
+              "ZWQ9MCBibHVyYXlfY29tcGF0PTAgY29uc3RyYWluZWRfaW50cmE9MCBiZnJhbWVzPTMgYl9weXJh\n",
+              "bWlkPTIgYl9hZGFwdD0xIGJfYmlhcz0wIGRpcmVjdD0xIHdlaWdodGI9MSBvcGVuX2dvcD0wIHdl\n",
+              "aWdodHA9MiBrZXlpbnQ9MjUwIGtleWludF9taW49MTAgc2NlbmVjdXQ9NDAgaW50cmFfcmVmcmVz\n",
+              "aD0wIHJjX2xvb2thaGVhZD00MCByYz1jcmYgbWJ0cmVlPTEgY3JmPTIzLjAgcWNvbXA9MC42MCBx\n",
+              "cG1pbj0wIHFwbWF4PTY5IHFwc3RlcD00IGlwX3JhdGlvPTEuNDAgYXE9MToxLjAwAIAAAAPQZYiE\n",
+              "ABH//veIHzLLafk613IR560urR9Q7kZxXqS9/iAAAAMAFpyZZ6/h5MpYA5/oqv4s2qPbYpW3jfK6\n",
+              "zQ6q7WMrNj7Hy8jZzmBpfHCwAAO1W4riBNsrapcCk+5V1W0XkkFULR4Qe+H3uGA2HgNW0zFAAUgt\n",
+              "W4tdpXv2OEg0Vuy5W5l/xGRmEGKDyeXyrM0S6q/1EKbad0x2mcHseUqNmeOGLy1N3b376XZKZcPY\n",
+              "IXC5F2332tNMj8CwOQiXM9PiCLyCVfZ3rQSkKBTZErkpS5kXUyoJG3FdIqLjRFKEapbUjcW64HIo\n",
+              "BeIbtRyWV9FyZfcTakx2KW3eB4ZI//MDykSe8CRgN76uBEqZFXwO63wmUREhHOb5AdaLV3xyGl/I\n",
+              "RV70rU/3t9t1aq5mFD3hy1aLTAV2U7nG072dyX87F7NgCxZHT2kFxu44fxf6gqVzE3PEbGr5fx9x\n",
+              "7TKXtmY53VP8UaeCd2HJiZ/sd165SutTnfiWvaLuCnmmXGF0AGqbj9S19kgOhTubZIJBydTTqQOV\n",
+              "YRlxbgKn2nzvunv9+NDG0/2ikyyp73W15QClmjyt8dUeynoN8CwtEQ59DdrAPZe4ARZTwWAfsRXw\n",
+              "1vcZ6Gr1nCNWllQw5IyZyxQtXrfc5p4wjPvGaltciG7d3FG1SGk6HDsZy5i/PsnkjRXLUvGbzYp2\n",
+              "2gs7ZSGfSJbEifctcMGeSqhOOYORKy6f/9omoieCVEEkniBXwWZ/eImb3nxF7SFIaBjgG2j9w5ut\n",
+              "BY6zSuQ5zRCdajzJ1loNO0havI8mp5yViAeAlLKYCxeK0Lha1FskL67W1YsARZVZ5EkhqAYEeTNI\n",
+              "M38Og48OXmj6QBN7c1b9uDUTacYEXO88ZQ1gCIREIMnm2Fgkir8pN4gtSeQ12sfOVz5x5KX7sa95\n",
+              "L4LyFQPDrFZcDBr4PWLeEEv8yzk0cYHE97GmAlA6WQ0HlWsS42cnXefvTPXnx4vcq8pbEo/slAuH\n",
+              "IBsrJEN1+aMCc9FNxwUPVbZVaWVjwLY0qh+mNWEaiNGRmacDXrYWw0NjqMPiLiFHacY5oGELRgym\n",
+              "S2mSo6zhsD1wKQ3EUQtwrjKPiDYc/HCqhkVwoWKUdI8xTS60kn4f5UqB0L77Yevh/wt7AnvQKQAq\n",
+              "QAEEevggRl1uigbOBTtscnYRnAj0edW4QExAzdo+RwLWXTzW/l3cBWTrh3ORzZQlxJ8jQTvPLB+f\n",
+              "bLazJZWFQQDcWhuhQ3gYcP1ruNwIroINRIr8px0UOgAhnk6CllxMN6gA5S0YPhFVFKd3n0AAAC9f\n",
+              "vYgISQAAAltBmiRsQR/+tSqC8p1IAOZemTPutEfx0mzK8zG8tdIxonBsDpoLZ+NnIOp4qK6idP1s\n",
+              "vbGvZz/zHM86Bg3q0yx2atmtgoo/Trt3YRy3se4HTjou+tCi7oJt2d7A8vEhVDu33JNJx+WCOgP0\n",
+              "03nVdg9lBs15v/0w7qMc3zqqJXCOy/Whl9aRhcaeOEWcD7uK6mCV8a6MpDJ959xBRfv2i/qFOFbL\n",
+              "Grs58WiGJcq4MQJI+rVWuFN50oiqBgiunfUrRmdviPYpNN11V9pwcOJwssWfIE3agnor/RC7vfLY\n",
+              "YoXzaJjtWLEL92OOaHLZT0j555xfb4FZcoJee+RXovB9IaoDdYRusngtBXPMUvnO+g2Z5Qdo9P8q\n",
+              "Zb8ItBAeHT8IBZAD/Z2nEA6qbxqOBSBtQNW6ZFYLtCTIoP/bLjCDHgtZk3cf+N1CpXs15pUIYWDW\n",
+              "elZtlTkM4w4EJlLdjLZyQPAeaBx/qoLmKyTKAEhm0hU8EcTq00f6fwkWgz2J6GTGtL/vJXgC8u4o\n",
+              "nTnf+Ou7sVJGVaouXxrzx+yGVHEcp/eV4gaFA95rInngQAOZWbA3558nK61JBPZl3NjEv5B9r9pg\n",
+              "2+SYY3wBAUeu2fgAB2+yYGw82pkoJJKpzYWORs6i1vn3GEgUTcwlYsdJcraYC5SnGvqSZhX7KM72\n",
+              "uE1e9bkpvpVyG/mkACn5R4jwX3xc2utCjjZgM101rirIF/7VfDtmJsSTDes+UVhbSr3SeMSI9ixJ\n",
+              "+fVuFZ5bnQPoRIfPc+Erw+K99JiGN+HE98/eq4pPlMY9oCfVPSdNyOAAAAFfQZ5CeId/AUuqOi5D\n",
+              "jlKfxuJGZZ1+rVyomjOIykvxtsjsuCiGtElbraCSFWcn3aIYWLrF3fPovVLcOnroBkiRMsdf5yJA\n",
+              "F87MQuoKeTaGOrxojCCCS64RiHrqNsE+7mfRRUDuB4sAEHFQHxBorgTukPSvrdFr5QDq+BhZj/6H\n",
+              "KN+IutwFWKX3ZX9pO3sI8My78TgRY5AA6FEcT91WcvnMypB/OWXzK6M8fYuhVVWipAZigjVOYhcF\n",
+              "9i6GweQFX9AV9EUQOp2qFbkrT5jceBRFLX6j4JUQ781/UGTekv1fcpCmzlpNpp8GdSeWxRL4gasp\n",
+              "F5uO5KW63rlhYccBo1cFwIN8txHNnwyQNiP00XC0PWDRZfaWSxsACRWrISow71IyUfcL7JNhjTII\n",
+              "rwDYATS0xZ9ep8siFC3JTxg1eNaroYfeI4tbkRHok47Vk+CUOQPuagVBtFMOOcy2OUbw8AWlAAAA\n",
+              "ugGeYXRDfwHM79ghzBo9nMnzfQPPIuvorxBb6AC8F4fYGD/t93kNSKNSEuhUXq9FKGtxnCkxN880\n",
+              "BPb/uTbjLTQVyPNuYlGl/gTlyLcVA/cDoLrl5TvaR/AcSLFE7C/t3kLx0STNibmdAf4TsHWKSblH\n",
+              "VWB4X7oQHrrDdhwIivRgUZf7f63j2XaGB+cbp5aHCCwJoovY51YTqsZZTz70FlSnypPHQBNzif7h\n",
+              "uvZkXhtEzpu9rYMo3YECkgAAAXIBnmNqQ38BDchAitLfY16mYQAQlVmv7062W8KLpIS1/zhS50Ib\n",
+              "b3ERigmkZKZMPaCsAi+zsLcku/gHGHnVZpuCZMFs72gmyuL4JFo6VjWcr5FtBvzIgD26rBNvP73P\n",
+              "nJjl3JImmFHiKjNez/gG3zTuYyCACuJCEYXyuEmzCM13hdCPHKg5GZtso0Z1qk6T1k2oiqF/3RIn\n",
+              "kyjRWuxBlHHmJ46TXULiUY14G+RAGoXI+u/G6muNclld2bq+6Zztuy+5ynaDWNNjuN1Ag9KUIx2F\n",
+              "XwNdepmp52/rOvISNPbMJ0U26OvqplXi+qHTbg8MLpUSIGCY8w9FZ5woLAENgvgu9M79yGlL20e7\n",
+              "ypJ4RMBqHYDpEz6Z+SSjXD8LsJ7VKlwo22A5Yukp1vTp6HHA35nV+PXK09DuRWKKdQUzmXVihF51\n",
+              "/+bB0PEFdoNxGdbbM7WveaCJN8XI7JgQWvw2nPlHX8M5QyPGSJ2HEexumoFrABvRAAAB70GaaEmo\n",
+              "QWiZTAgj//61KoCPNGHq/MxnjqmxxQAEHvTwibmyMZGX3ES9Abh1tMR+/DjR+6dnqRr/VxCl6gEP\n",
+              "wJ/5EYCYfGaGmQYsLOeM3v2SZjdvqQBwrwKk5A/63kFm8fc3QCLe93Mldv3KWXHdFT7/mudSntDc\n",
+              "vJwStG4jgi5LKlWdSrVaAxOmElsF+zWNzaCIQ1dOiZqi3JKj64hOeq1XIWyGvRvh6OLKBpB4rL6W\n",
+              "ugf7H/IPbSQuF5jWV7zL5LhxWiTiI+kAZTUMfO2YOLzmhCUSN9GAmNzgY4D2awYB4V4QTDjI7kdQ\n",
+              "tL+3Pmfl1HVilu7nC9CzQSvWIosiwv4btyHTL7IPT2gusybyNfW8QO133L6KbDhhXSDWUtcIFCgn\n",
+              "QUm36C9hvgGjorpKYr5VnErpJX6fRJm76fFYs8/nt763alyqdcSrqaTOLaf/72Wkkmlwbq3nLOIw\n",
+              "ADFDkkAPwzaM811K11iK/3HaYRT3nEhjJQFk5v4WBXwIVLAZeKdtC8YoGN9K6isN142fOG3s6fm4\n",
+              "J1nMtOEZHIwep8In4slLmHh39qBzhGZO3igiVpgz7u+JMBeFkVHe72vduBjIy+1dqvxL/TPics3s\n",
+              "+alwfTMNQKave1qW+5Uj8jZQTjcLAtKvzoako9VMIOfQUQAAAQpBnoZFESw7/wC9ZU4P+UeGsidW\n",
+              "4n5tFkXmtxppYvKQ+WGj/x3AAdl6+9c9x7N2b/yJykTvVggfpMnFUWtxla4sr1ouwANom+Uf4IBJ\n",
+              "/zXPovndpGdy98nJbZxFU4rrWpr8aI4YmRX65+IGTn756CZWwXKY5DyMgKnDcCtk0HEuoHgdGhh7\n",
+              "1PG8+nue+pE9pBHqiBNWAjPd90qfMtABmMShLoXtUObqYbqXhJvVjjFhKdPS03IF24fu9Z0ax15V\n",
+              "DnkiLmgyOCvJmcdIX70L2ZEECd/hxrSq9JUVjC41OX0F/ayI6GtkPMUuZ2xWkMFo5rqOAo7v0Zlk\n",
+              "ke/79TjeY13FNiowqcbhMwfDuwAAATIBnqV0Q38BDXNpg2t4nJdhAA5ru/5Co2KbB/AnQt7fa959\n",
+              "0crOQgtTxL36jtVyKPmfuQMYuWbJ/7bYTEV8sEjceHvN6B0CSEZzVCjaPLzOQJZMQpQ4K4WKPlGc\n",
+              "lnEwYAC9Dsejj7Fbk2RyCFiJinyU2HOscjUR6fW2jRsAFpVq/PtZDVPvesPG3AqooVaKHp9Ex+Da\n",
+              "AH0OvccSugyDKsRBAEiYR8645aXxbFSzraQsELDsIIr6HRN8F3lUNVBvzNO3mxBhq4th/kgZSjjJ\n",
+              "JZrYmg3UfIUO/jn4xs2XQ9Pa7Uy5K3JhuIQwAOUKDmAMC0p6fgz2on4ceyEcfiCGDPZpPyL3391F\n",
+              "dXID0ctPQ1a+Hk7UcAc9gSDL8CZKz59YyO0ACPjfAKV3Y2dbTAKdWBsUU0EAAAFEAZ6nakN/AItk\n",
+              "aaqbMCcBE0iEIDnEBfRZN0neHQxaz5DPSzK0ZSL640q0AA5jkP0YAYAumNCN0MxJYpWFoQ9r43H0\n",
+              "i9SZLdv1UbgpG3aX6KESZW7AgdlevaBngH/w8xYsqWx5t90zzi7x9VyRYpIAD+XTrxvgBoFILNCs\n",
+              "gd+zDA9uvbAPlLMwG/qFltlwvLokMt344erv3a/C/ySOwZHFzpakInpJ7MQHkmKi1KHZB5KrfqwF\n",
+              "FnglZJwWbe7LtVojTdwQnAksziDNlEWCkMQQJwziY1KYtlXMNX8mZ3MtYR1KNf/CNin7/ys9ZQyx\n",
+              "4Zlk//H5KDc/8O2+JaxH20CAaAABxgSxo+yJal1LnRHYfOQ1TygNueW/rPAA37g/6fLS7mbYKz7k\n",
+              "dsiSiy1mAV7n/qq81UHJPShQSXK+E4Y5XKuXEWG4AAAB8UGarEmoQWyZTAgj//61KoAW7kO9JCjl\n",
+              "XSE6nAngAJVxWWFl/YDS0gZ32xjwUFed4hmI6rj18z16nS3Mz1iMmFblrtaE4zGXS046COODiIwH\n",
+              "QG5lRmcBExMKlnynQruQtA8n/NitzdP/ysLrucGyp5nKV+XyJURULfxk4kwNp0a5TFlJ1fusOOJm\n",
+              "y0hvsvEg+d4Jz3anvWT6M9n5A84CGucNifV+WlN9gI9gs3qSoCZdU/gglcFYM5u8YchzhQFyMKxn\n",
+              "kpfWK2LU7aaZHt6xLbqjuv74523K9/dtrrsFq/LySiv1P9Wk6/6d5RC72z4cyaUq6hMMn4IWWRo0\n",
+              "zJIM1/lSYsWxt5/M1Mkv00Rt8OZvmLxuFfd1BIVlANlpgZ39RYhqqzU6v1HwaW0EudelFBGhr5mf\n",
+              "GaDE05Z8ywp5rN4Qq4D4GNAGD/qgEjtaDDf4ZBAD/TAHBwxfNjm2nPAdbbbIuWSkkv8NK6EMlKqH\n",
+              "mOktd+CB3P6Szd1+HPnUsyQ3659r3XLnoi0cvM4usfW+BgxqT0mgHSgn/F6ajdTNM+a8xJQnT036\n",
+              "7195r0uF5vwi7PIviCQ2E4Vs4Wx80/8tBDEJS4qOY1YJ5aNV1OV82fB3HOimLHd2vU/d4Cv7OBh8\n",
+              "k3gNFcjeBGh+3lQcDCLZrG1mAAAA3kGeykUVLDv/AGVBMHxAlJYGEpFnv2bb0ADrwvVKxe7+SIJI\n",
+              "g0dPJdL0s9Hd2mGX7rpdIiUH9ZgtnBO+m3uPNae/YtN3u2p0kkCez2KiPNqgSoEcHM+ePgq7afkq\n",
+              "0HHTSZl/+QbjsyfbI/0lv1mLAJUd3u7VZPPHSdXK3vwLfAwOe3Nid72slU892DijWVvanzM1IzDQ\n",
+              "XfN6x6GH2qfaLrHePrJTJxXC/RSxcAol7x2JJ5OA8VjN8jXu0yKirBiYqgcdFf9odG8j4bRmE2wD\n",
+              "MG0SKuGrJfd91b6B7hbRUwAAAPYBnul0Q38Ahz7YAbwPIqnkAA5sEIcKo2/sVUP0LEeFOLjKjaet\n",
+              "5YFAjDbL5BIdGqWouG/H8ozoec2ZpUbIZu0ELtG5yXc/5opSZlnqbOpqdTQkLs6gr9dv5GbFvVjS\n",
+              "Os1j9FIMQsdc8pttosNtygWB8gLxr65El6umAZE5CVU9Mc8Xxg/tenmTduGK9Cd7qRDiu1sLYR2f\n",
+              "or3KBMo8ebz5q5EmWucvREbYSziQIIycIwJg9OG+aH+ZUEQbjbfHfaiX7yoxGJGP78aNOHP7GvC+\n",
+              "JwM6DxnSyowUBAqkW8ckgrhet8gYYrt8MIe1MPJQB6sv8hHuAXkAAAFWAZ7rakN/AI9XvmYGr0rf\n",
+              "QEvrPPTQWEAA5ru3wBCXPJiC8OaE25OBvVl2wRXqp61wQU4HxGJCAxkSOz+G3Yzvg36uCK8bPZTq\n",
+              "avaOG/H9WxjsuwAl/bIYJdnyD151CiUZ34aErVIixKJ53oKrLeHr3xLgxuH+y3w5uH5lQRsL0Pmp\n",
+              "0jQItTBkKwlPywxFk55pROuYZWi/h/N19QaFlF7WPobUElLlr+nCH+pVt1nW9/YwVGz/cO8zwmWe\n",
+              "Fb0OnFji7CYSsi9ScC3a50GjUP7IpaY5NAHv33V57bkO/BD6dnreymTbSmQdcj7PAJkvz610fMqn\n",
+              "mDGTMB31oxAIE5eWeH7mBZouSgmtxEamul7sYaTPe7mP6FqNCz0h6wLot/zAFwx9/D2+XB0x8mmS\n",
+              "b086o+gqkoYoHQeQm2Sb3MU1Bz0KHDGo9jCmsBmecxs3oNHV4KaIoLKAAAABrEGa8EmoQWyZTAgj\n",
+              "//61KoAcdmk2P6doyaR4wEHxsIcmssCD5f+3/v8PGtlbWZ+A0oGGFPTAdgmU2TFbrRxlmwUCouNe\n",
+              "8freV7blHDodFImzwP3saA3AZT6NUl7vDGH/tw5n9y8rP4XGnhEXBHK+6jIhoAYc6G1CDX0mqczJ\n",
+              "7tbei5I0YSkDjza4rJSbAF6cRoJQH3s2Q+ggBQR0BfH6N3QlPVwd9YFvP6++J+XrbNU56Pxu6Wey\n",
+              "51asar4AaARXHregTXL4xn/VNt8Ppk2xD3/1jXAVXdqMlS0tYGM/TtrcuTC63Lx21RQtklG6k0xA\n",
+              "eWm6W0oL0KTvxuyegpC2ySp5v6zpSEYvzWR4IYirfT0RYU+jLtX0t4M/L/0k8xOLTHbouoUPD6DN\n",
+              "dYYLYlVX5noJzjCAVCiS21OCcIKqWD/YiU/+dTZpdFFNdHEa/MPvUEq7cJD7ANJ0YUweepq2Eqdh\n",
+              "57SC4Tpg6jyEnFgMaHQLSz1nJNh4lxM1TPouGZ9bmQdDr9WY+nwzRBa+ZLnaqBSYKWSKEs/TNtNZ\n",
+              "ev7d+EnJUf9G9CAmmiSDlRAvAAAAz0GfDkUVLDv/AGU2nAwHHyQlvUxuENDSO8vXFIAPilnMlQWb\n",
+              "nTHwb8wkIo6JKOaIP9blrrNXcWeeQDVprB1Bn//+nbSDHls1apJcUyMHUmojA58P91gutTiF40zp\n",
+              "fDaF096G01gcvpH5Za4+DfUvxQpt/wH5PntJzggww1tLhP1NyH5U2TTgrnA/BevK2aCa9xCuCVgA\n",
+              "JJZF4uqHE//COeWbJ6LIFJPoadxAxbrAcxPQQHMzEG5G5S3Yfd+YJBLrdO35JvVrsUTYO4AfvJeC\n",
+              "zwAAAe8Bny10Q38Aj03WPPyvISnWAC7KM5WfLH925SBeAKcvJaYOa5WZCzX9H5nU/7qAFTCgAnl3\n",
+              "rAoSnKk1337XDAnLfPYAAOSIcqQwF++e4HouwNVAWCEsVyl7Y6DnBaBT2mD1H8560KoMvm3kKNNC\n",
+              "oxFCc4BdAIXk45JUbGFNGYAjCbBbJInMjwa41HA404yKnJG7rNXdBctnsSL/36UoXvVx3J2tGX84\n",
+              "+FHk7e72CsAyB49ajd62idmFQji9Jj1GaiqtCIjWs5o6Mz8s5QfrvipNYYD0YZ7gBBGm4AEz17d8\n",
+              "isscgsp4QI2odbuEJDq1nfJbW6+1HGcN1XfDC1Xfa5IptM5UYHm5zIT4rSPBIDE6l8/NhVxlFP21\n",
+              "JPQ0DZxnZFvxIBznQbqkhaGZjMafgFoRzC9Nl17x+K6e75RlplRZtXaUIbjAUFBJIQPkoIrT6/O9\n",
+              "NtkAmnl8qqUC1RktW/RjiJqOyRTTITHqNKvKy/0gb88xEvvGPgzcSs2KpkbHJWmCGIlSWEkuqcCE\n",
+              "jBn3Y8XOQxMUxEYeLPJ/9s/F2fT5NAnko+RFlv75fWLekZZP2s17yJ5ccFGhZyrkGX6u7xXK7N8G\n",
+              "Qlz8qfOHvgMQrlB8p4j7qtnPgBPf8mcsM295CuAZxkK+sut074W+0hM24VMAAADaAZ8vakN/AI9G\n",
+              "UrhSy/Rrhc/LGXguupji5cAHC2DVoxU1gWUkKeMT366GcmuxH5O8lBZJeHl8r2KNT0EaVARyW7pN\n",
+              "L4uNsKKl/WAzLJ1OZWTQf4NaAfodQGO9KzZS0j6oGvr/urKiQwbP44Tv//glYQyyCFeq+8nnrHBj\n",
+              "aACu2w1otySh0DYMX412uY6EYcx3GtQaRpNPiKQniWdVV2KH48fVxDy0uLS0SmCZEAWLVNvtWqO+\n",
+              "q2OwCBr1m50s0i8eRTlSP9xoKtxWC4ZqL77eAW3kYEBJOAywYUAAAAH6QZs0SahBbJlMCCP//rUq\n",
+              "gBY3NzYDjVIwwAKbp/vtZn3NtK6t0V/4sA0MV4ijJVoTZ+e36T0E9eQ0LOyzsqR0ULZJUDRy41oM\n",
+              "RdsBwM4wyEJC67daWmuDEXKhZo862uqAH8A0QJ5u5RKBPFpngChYYJdWzP3onEWImG8Yryy/SXt0\n",
+              "jQ5te76AagLius72bzwZ4AZfLm/04ID6oXhPwqkf1cNsu4/kIt7oCOETiL+lzwHLEnEsdPSz3DxD\n",
+              "uLGkH8o6jHofDxEXcB6cOS43aUxGKPYPtHCj2gw6RzcRoX5lD5mwqtoCTxk6N8TxyipSUyNnbA2b\n",
+              "G5NuBUVLHTce3QKY3SdkbyH/wzdOpT3YHUE+FYQwMKCF6SMyMBxp2gI9k4yUZYljUiekF2XIFkfv\n",
+              "TFy1RUmikOycLKkTYTreTarsMD5JfjZ2FJWrroj/YX+uNeGtKNZl9Zyt+k8u4Htq1bPYEjCrLHds\n",
+              "qeIuFWmvxTYEQblStjDXmWfITtxy8KvOgn9iV+KlidrnVhlE7Dz30fuHXxxFZvIzhgU9uv6sSC7T\n",
+              "vZuGMsKGBGTYmSe0P9hLI2VyM/8GUWwG/AITiU4a7OVDjUNRPaiIEt8jt2oImPIY8qcrJ82CVd+P\n",
+              "mSjoppoeHUTHmeo+koGqjhwT7ueVHNT5VZ4yuGKEDdFfEIkAAAEMQZ9SRRUsO/8AYrbCELHs5dcg\n",
+              "AyOPuRHZUWtdXLx9XaNQixO/8Cc4Q2MgEa/wKETsHiR8C1XOv7rI3JB0rg46JfjEArbHaTHmANKo\n",
+              "+czcI/sIduYNFOE3TvObMh/KtGpZSdF+qnDDtY8zD+7RQUdzmkG5zeDj3u4Vq+f3qnKCwgbU+U0R\n",
+              "dQR9Q60wXqL03p/iYVxkI8jJqvkECuxT7efJI+5rmzyP1yn+WKY2EsjjB7bwwVfe6RxBmzR9Ed/9\n",
+              "CA95ILUJxNg4HsmCO2Ko+MqZAH3wMlG18kUm2ogL3cKIkVXogjofyKhbsSpKLpFFk71DzB6NrY/3\n",
+              "HfknWM2yn9yeQB/joufGEf/bvMAS8QAAAN4Bn3F0Q38Ado97WJWiqN4XS53kTA5YWsnJBdebpf+9\n",
+              "lcN5zPySAC6fH/XzBsBKbxdm4pTiPFVrmGXyhaRiB6dxtlwj8MyI40Do8AXHq41BAunk4K4PTgzR\n",
+              "rFycWqaL549wB2C5jNCLXlq6Tuytik3ijlMSkx9noeIG2Lc83eWkRkQieksQSO4xI1tzzkdqaNhG\n",
+              "ExZARu3MauZwrBopslb/ZLdR5ZS0G6p8o9DD5cphJjxJoSV/70/0Gr+woS8Zj0JpVvvpygE5bXQp\n",
+              "/YBCqjmq4uOCyt9SvCzPelUEwXEAAAGyAZ9zakN/AHZ6+HiwE6fxvgA5rqP9zmI+FShvhJS43N4N\n",
+              "sc5a7qq0DK7DHadXkQxf+APmeqLrIGM9X5aCQgeyxdoAlcQoyNsm6ol85w5z6JV8A3YntmCae+s8\n",
+              "+8/Yheg1ctJWrSharoeypUyemQeq9Rm5cIkSOS9Ej0hbIHyFhPQW6K3SawgMNVKQ0s1BpJvXDQSY\n",
+              "x3jIEdIgEtwe7zce/DjcO3RNN3g+SlPoM7cl0qJbM44NIDG9JGXcwVrY/YKNrpChX0yegP2ZHDI1\n",
+              "MzOs5eWP/2l5loJrLid2mK4Qhw6EGFrIadsV8rSjzgHRNuzJ4U3JdubidEobU0ehkU0P6MYRK/XM\n",
+              "58mVywGbsw6LPu56h1S4w3zHGYMd1zPKOsnCUhaRfrSZTxvjerNQ22prVPqBstk4JgHdnSScrwGw\n",
+              "eQcqvIw7gKhonPDKM4fJtO4n2EsI5Cd0iGMjmgPw/PU3FL8ZP3QbYLMwZ81Wd7BLLBDf+ngKiFIe\n",
+              "it4neyhhaE/a71b8TxeM/ZrgH9+D76dlgPI1ZJW6CCVyIs6Y5gK2plkcgRYa0MwWF+1A6zPtBEgA\n",
+              "LOAAAAIIQZt4SahBbJlMCCP//rUqgBY9we30eRuAA2kMf/9/gX2SHKs8Uq31+W7Vx4LugxILnhMT\n",
+              "6icG5WQzdpL8yjIXjBq99nVaYweUdJE3LrdOpsVxNJ3kODVBkposYOoRuOMi/SNhcjrJwShp6ljG\n",
+              "Qs7tSeRJSYDkvm+SI2ckjbManbEesw6wo2ZffuryaLuWkU9SNALC+2QbPJD4bFy7sTmB9+6VOdMm\n",
+              "rnLvYN4ZyAJz7OhQG85P+JnxdgXgvSv66sWBs05p3vOE+53H+HQCMTLVgvoYmHNTIYtZ5CIln4hA\n",
+              "GrjLg53unVVQTiYlSzZrRE2vmtsqac+v6CrcbtgC4HktflvPTsvgqWNHri9NWa+EuXgx/AgGkZVJ\n",
+              "r1n6gAd3jtjLtv6YvbPiBBo2AhBUxCbYyroAjcvjwUBtRjXTdDEvdYfItmTKA7W3+KvVi/PCtod6\n",
+              "/3gOoaA7zRdO+8+MHlGl/c2xzQhj2O1n8eJkOu+NcsBkpmxyosDi11EOEaiQ6vfnOvH9MSM+7D/v\n",
+              "k91SLlwv/nF+5eDPHSLZQIoFUjHjwVoSGCdOLqmIe6tsfTERCeAhC+1bhRhe0612KIL6izjolsR2\n",
+              "nUgrl1o39HqnKAVqQ/HguEezLTgmGW27Df2kp4E1wRl/EQgEcsMfBPga1ndY4uHPYq84ArNCWk+c\n",
+              "YwxlHAPVC3PK3Zp2kQAAAWFBn5ZFFSw7/wBXFVHDEfqz5TAg6AmqzzGCl9B1ICKhB+tKz4Y9Km1L\n",
+              "/vZyZ1OR5rO815FlrTgGoncUDKVNjpKrVerCm+HleHb1b4FhYQG8B61zGq10uLuoQHIyL4Cv2/mm\n",
+              "s5Mi7ZftErBt64oWYphUyh0Hmn9dYYheGFzLdE9gvqcAEGJDyLZq+nfiK0Px8pHIgaIfsEdSUYcC\n",
+              "8Otyxta0EKY+Dm2m8AtQ8jjuDmkSHm/uLhgf1uCnztOKFhkR+ydRCeR9tnIlTfiv3gJbsPT8swjP\n",
+              "0OUm6yT8LhwwCJU0AGI9hN0/kTkz+NeSHjSPaBx26MAfS2Y5NEtva844h4B/RttjqxMsNDiDrfB4\n",
+              "5xn/Cl/3XrcF40eivyUSC+FHzx3M4BoLQLOKf7iz8hKiUrqRGVkGToUMxkr5192x9xCjbuvLRMd8\n",
+              "9Pel4WIOhSi52xuSf1eEhC5VVAp4lHpZmHCbgAAAAaABn7V0Q38AdnTaV3jxqK844c19uepGJJSA\n",
+              "C7DQuTz6pWfCzxcMbX5JwHItpyM9y3YT46z61a7h5Lyukp+nSKoO0zQhT0EB/u6ILUCNvVbb/89X\n",
+              "7TVI5UN6EFwYYfi4uoFmqb+5Cd0J/+d2405yTsK/f6WH/T+vNB1DYWrW67ctgHOgMHAWDLG9mitl\n",
+              "16bXmPVSi2sWzpWYg3147nlnaD00aZHqQlrMPzYTLLFwWHOLNqCoWpNLMMEevc8AnQWeykk9VNTU\n",
+              "NXzAXhrKDXl1tLQTxZG7GX3K9cQyeUnjfH3rMBGDD2zCLGXrMfPVl9EJ/F5M49Rjn38sXUf2JvF8\n",
+              "D9r9tV1APCHN27+egfFIMDg9OhrQMtjAe3WEfpYS7pl5yHh7ZZ2CedEo/Wf/ygYTAQFI72AaUTrV\n",
+              "n47d9OSqAdYs7lkgV0864auRyPQeTKK1Sp3ADeIFS134VGBNG1VnrfyZuznYkI2r0FVkGFrAXpUu\n",
+              "ZJmyKqqILhJ1OTBM8C0VBV2QXBYa2aSn2jj9t40/wJJWc9IGAVR0vj/u+wFocjwf4QAAAZYBn7dq\n",
+              "Q38AeUc/pR5QUuADgu7/kKjYlIf8yn+MfKKvFMJ4eRJz/DRqteBIBJsZW3T3phi3NzuSw0zOvEhr\n",
+              "CHz7xEUteyaR+fa6YCBeiCtangbUerW/UGoCobzV/74XB/lXH53NcEw+6x9o3/ZgwG/7l4psK3P0\n",
+              "EqSwtCrcKAAv8Wi0Z88mFp3Sp19shMF41mqYa8pNsyefrruQONS60LHg/1GySbrTeTWW74lCDwnt\n",
+              "BGXpwghp/QF087PP7hxkE8lvu8APh5F1FTiOCBSvJFm6yFC/tz24gmveLoV4Rq/qtYWRE09VDCDH\n",
+              "yjftToPMsyi4DoCtXsPRk5Jxr9Mn6xDxGjfz8uMmOKJ15ejPi/Sx9cR1QrBsU9dhcYifdB+c0AMF\n",
+              "PolB3N4pBZAASP6m7EzaTer6yZ2sIKcQdlGt9xsZ0SHtS2313gpdJkLEVrHpO5/BTcfUTTcK1+bC\n",
+              "PwRYX+iIyInP1m6htprdy84ySZ5IaGCpRKFxMCf5w22wXyyon+dlMPKACguyEPTCCZQ2MqEuC+sa\n",
+              "uB/hAAABxUGbvEmoQWyZTAgj//61KoAXgR9s4tVmwJ9HTza3s57iAAoQf/wjqzjlXnP+29f12EfR\n",
+              "S7B+4I2epG2qM/uoQ7VlrfXFlhjyX/aTq0n55QXAKa2xUKolKsuMfmZFFc6+GP96b13JiSidvPgt\n",
+              "2SSGnq9Yw4MfceFmgOaZRcwoMnpdb0UpI73YdP+DfypKyrkDqKWcBc/BGhrH8+XdnpCNDXfg5rMl\n",
+              "b0uFlQ11yUxnDYOfRwLbdjJA6FYddawSEVorFtY7jkSQx+OUBUgWkKC9rhKB+uV/yqQsvbuFiyYV\n",
+              "MviBpsZgSSN0TOC5JedQ5H38ENVBLjXnWZD9PQyueLoT4qwtI+7lodFSnBG3zboWdj6P7XDbgKT/\n",
+              "zKkFObUjwhstiQtohzxd5AXhBH3DQqNv6mRzuMxFDcTEo5ut/0/1HrPGOF4R3sJ/eQT+YnYseqvc\n",
+              "0m5njpgI3qkLmn8efBB4q3zWGpHCxBwC84HKjuugMICuXfcJHKn0aWkn65aEjT8AdxDWE09InGyo\n",
+              "EM1wsU0JgJ/qq/6MdHWfQW6+bt5xWlpYJ4axi9wZc3Aoz+Rixn8UVM2e/bd31+W37ucz9udquxnL\n",
+              "2JdNUAAAARlBn9pFFSw7/wBZVXkLa/7xg9HEtDOpc+GkSv0gCD3x6eQNkROUaCyL6QH8m/0USPLW\n",
+              "nllgC+uXg2X8kUpaUiErsLvwKd9y+trtKwV7xlvkAn0JqEnToCvptE1Sb8eF86DTi2ywy7WE/imn\n",
+              "jNBYQny1cV38ScnZp/V3phWQAYBG3kUdNNuj/FyVB7DgbQbTLK48AO5nLYv8B3LvBNBfBJ+ym1yg\n",
+              "YJXKwjm8kt8xUjO2UGKeggZOs7YHWr5Fj8OX4jV/B3/cMzP+f6YyrayA/80F6f9vgrbTlhWdlFQ8\n",
+              "QtrHKjmrl874OSSPJYH5wfQfF/1NrQd6soxjmSWYI9/FqOPoy6ujUPxQvg1fUda+wK31Cv8gD96H\n",
+              "LPqpgQAAAXkBn/l0Q38AeBaU9hYCjxV6lA176iBcJKIHTfhwkqkAB+a0LmdvcgdK3vyEsSkCI+8U\n",
+              "up3OQ4OQId/B45+Mf5P4Fc2VsfnQAACxyzNkvgEEYwZk+TyOR6/VZmeFNYMrBdqc2NNBlh56ISK/\n",
+              "h5V9lagvsX7yv0p9Hk6RXo3uoMgKhKOv/QgBAqhUvAKDw4DS7G31tehd/myRMmCPxIJ79bZsQe2/\n",
+              "iq7Nquzc/VDpPXFZHPvOmiyfyrt6Fxc2jLHZJGpvacPTIeLJiSaBxgRTEKBr/xXaKQjc5nLhlwgc\n",
+              "HSz1WRlyOsXOkob3rY8KoGVETaaIvHEl7sVHsV3QN7iR2rIGzf6YHv+c3l8OW1b7tAMShtcCLifl\n",
+              "8k1OtS8Z5o7MNTObuLXIONSPGo1fC97qRzqHFEfMZntEMqsFjjWPM6JduvRiAv8p/h0kRdcTeRox\n",
+              "t4PEdFJikYgCJgtFa00LDpNvd6Vv6MImiivCAgL9L7zEaNCr8p/p5ZiDugAAAO8Bn/tqQ38AfAnX\n",
+              "r+Rl0wYAC9kEZglKr0YEZPxbFiynbDVLyUoB5/4mwbggJCKqWcWLXkOc702XkfuMANGy7OD7QUCV\n",
+              "nopFHkp77AuzGvvM2JQndhYVkdbX30/kmHQDID1DcpthKQBbzUjm7wgAOqbulxKDc1OUw1plN1OA\n",
+              "iXs8Ju+zQDtZelKPfekDEF5iPA8IQMn3LLocZ168PVHW73hdmgfMFTsqduJxZ1oiezDuUBPUKdNQ\n",
+              "1lGg5KUsS5A9iNuo+n1shJKCmk20FfXGeNEywAjYeaq4bao/dd8nZn//htlIayY083IymAgdHbKW\n",
+              "UQAAAW1Bm/5JqEFsmUwUTBH//rUqgBbB5O6qXkABRezeefAxp9PjwxeDBuTTFSUNk2voPSz0T3Lj\n",
+              "1K/LmQtEI6YkskJKgxvIXHGf8LHTV/h2Mg/qV3IQ4zvBygOQs98iZyR5jgV+hQ58R6xIcus/6y5a\n",
+              "HrkViRrv8Sk7So3LYWmfkLzyR6vcCKhF/sCJsY8RS8BK5OOGU2Ll4Qs1n4jPQwTLDELf8SF2+07z\n",
+              "zB5hexERnOHmWZ9THKXS8j6NXPrj2p32k0gvmlI4b/Of9evEX9mDBp5GtQHOvTswQ/VYUajAUXz4\n",
+              "5w6EHuB/k+FBz9pe+B69syJ2X5MYn7Qi9rKpCl2kZv4uAWXuNo7oIaU7hr6elcFz53tdL9AEjCAb\n",
+              "BlT3p448134hjvo9lj95CHF5teK1w+R310Gc3NQ0eeJcsiYD2EoVrHHjVDF/m8I8JtTUFdJ3xm+G\n",
+              "muADOcIpcqYbeqyKWwHmgvRze+DMQbkLo4AlgQAAAR4Bnh1qQ38AfBSmnoPKZzTuFWeZOcrkeWeU\n",
+              "yVIALsozlefbqRZf6f7w7fkPoFSkdlxkJJsnO6qzfbc/Kotbm2yeFrIQw5yspszQL8gAAvMHKSnw\n",
+              "f4CTQ2vfLY55MADj1baDD7LZtn0UK1Eh1HnwXobc+mdHd/JEl/a2Tszf/EZ9+J7oMl+BYsjWKwNY\n",
+              "vOv5flnnPLcex/hWFIF4n+hpBybvasl5hI9mV0CeAAyAclftj8N9n7hadcpM/TOVmHbSkJ3cr/k+\n",
+              "StSwI8gY9k3tmbMSZc42caMpFr6YdNCCIj52zmNBccPNFxW+UT/4qCqtX1gc2j7obKDaWzC1yj1A\n",
+              "td8/VAjqVn+FzuuEokhhvubRT3RCdxeWnBTCG0CxwC7gAAACMkGaAknhClJlMCCP//61KoAXgkIw\n",
+              "VJpvAgAqN7f+5rJJcY8tkjj7p4LozjswOy2dTydK33mOBGS+NojRzBOlwt3ro+/vdQIUTIVrXKwh\n",
+              "2SrHPCPJXQoCjJUPkRODCmqbZeBHsv1r7iIOZPpX66HYYhWgPLvPzAb/Nqu9nQqKoyphhNy32+S5\n",
+              "qAFvjRKLSjPAx7GoKGUNMbYduhsBsrvVTwhrV8uWAls2mxYggJzVuRUZSL9cSt+tjl44BXjlbo1a\n",
+              "I7ybNHG97GCzcbSNcg0RA+iqwDsdnrZCO0zsNdWK1qVmER0PsSf0dicSrZwIcxZWy6JbkwQn5TnO\n",
+              "kAah3wAs6pJvW+a5ZiJHl6sVlU3yCOlrECAESqWu0YR75WfiMXgesBOuXGGNsC3icmPYNzM93us1\n",
+              "7GQTI6RmmFHGo+B2yAB2YJiK1YN/T0ltUuXfFAvL4UdHgEXOVIqVj+S+YpITMKy740IvYQ5zuZPD\n",
+              "ahdXF7HIU7xE0W12w+6qkuyZwxUMXLXdgx6svudMor1GNfDCdymcKIidhuuXh7vdQrgbivH7usVC\n",
+              "zjMqgjGahkW1YlmytCooEIoULx5ux9DK360iAi4u/nAomESdiosanRfQ9jQdJSpo4rurLfeCLF1Z\n",
+              "XsQAQRTcezHlxp1tz3A3WsYMA9urPBB8pUlDdB63MfZDCBphVx/Ddv1AMvPXFEPu18oREsV3BdKx\n",
+              "e3lxLWWpytzF3zXttYGgBb90j9DgRGE1uaAWyEAAAAEiQZ4gRTRMO/8AWVV6uU/hFqUNYqrP23yu\n",
+              "FpB+ECoAQNVnJ92i7ZF1i7u1D6K4L4gxm2RaiGsRDmf2iYWEjO8yGHAqwpcDep1/+H221WMh98AE\n",
+              "VV9Ferf+hy0D7Zu5rX4Hp3s1TpcNcEBIKPHVSHIzaZKKfPXkqE/ga/eepp8Bzdc39OW6g91hVVvf\n",
+              "WJxrnf77rapWbmivuJFfeO9u+RRykk/agdEi5E/5a475KGQprA2yl390PNrCvoamPyXbETwtbYAQ\n",
+              "pF9uDZkHdN/NQ1P4rz+zQLJx21eQsP9WBLswpDFYg9BjPw+3VrVEzeid2j5wJBlq+56Hw+Ex6fI6\n",
+              "1O0GbWSAC5/5Zg+kGX0Yx7/We9PseMWGwXWIVwqI7oHPEnK6wUkAAADgAZ5fdEN/AHk02mburIzA\n",
+              "1V5U+8CauxZABexQ9zxvy3GIkNn2+19EyZqnRm0DMMsXP4ZwiY8vW/qdBTlATfbmIFDxCTzt76+L\n",
+              "X3WaNfG+rqTfzj6gLFFHl5IJDtQmIC9KAmTgQM0Lp8TEDdYJnPYGFybq0Xdyl74+130DteV0SYTD\n",
+              "hgB6230zJvCx8ZW04pZHmYvtJ1LZAxF3BAWKPXcstkh7/Er8zYdPblR7K6t0r3b/sIHpME53VRBk\n",
+              "ggj1uN/p+iN4KwToxjP8kZ1opB7xpkyOQpicygiGnwjU7EpZpywAAAF2AZ5BakN/AIdka2Wer/IA\n",
+              "EJVZr+9KNmiS7zXHA/5uJU6D0CbJOrsLPWcfwAUCZZjhlCsnAlgzrrGOONmuxU3En1TfTKb/7Pu5\n",
+              "1R8PfIYkV/dZFitvMyRPMvzwXX1OcxtjbhM+M0LCh6zNEWJFi2Pi95t8cspIknD4iXNUblA3oEFp\n",
+              "VGuXt+8S3Upf64YqAxWADhb5zxXL+O/gnWiyawM9fyRrYcExecMkEiv5MHRsJs8Euzdps1vwxzNA\n",
+              "Zu4bu6ic2K2ueNja78qXGaHz7xLoPIVJv/T4KAuseyOhznfFtKf0Ey0eSBVK9qutGGF83lfe5Wtv\n",
+              "xb73lHTKLAyiyJassoDHBSQLAcUPb4nB6xWNr9G9gWtqEIp4Or9tKJzZIZ1tnIKZFZGb0ELAlV2+\n",
+              "pKKDz5nW+syHi871Soc3HtgomT3Y1cp83yQG1GdKkcJPkU1uJVzsVPzbXbSU7/z2Q7cikc4seN2D\n",
+              "ryQ1l58HjUs0ikCXV/V/CDkAAAH6QZpGSahBaJlMCCP//rUqgBbmS0XBN5gNQAaCJTjyhVwVkMwl\n",
+              "GF6KXnd0XUyzqjFCJEv0D2xQiJu8if6sKo6qHl+BP/MZw8ss5OKq407INzCjWOsjf2HTKyC5fNLK\n",
+              "wiJv+PzieOozn64ZK7RRud2QUaDe0kuhk4uCClSYQBImrxmWeEf/X9zH3+ilYhfoZigVm0IoMiuu\n",
+              "YX1ERVdg0Ld9E6wxbYMiQAGJU1qeeTwc8vb3w3kiJheTA2PNXtrJ98RwtpnhN6QxMe1dw+aQWI7S\n",
+              "j0oQ9iNx73N93RuNVRxXj/57S9VltjA0RTZBjLvYS81QDA3fBgaNHNzOBZ7dztz/rTxxOpumjTTw\n",
+              "x9FgnvlMsjx7FYPKUcXD5quVKd8lwTlOiGVI7X1HEv3Hh4EvpYVt6azhUBI1qGunVb3X1lyMhWJ9\n",
+              "p3muqcicwInEt+BuHY92HoNXaaJJbbQmNX5s3QJbI28Pg4gc2gaUF4SQRcBgM8uwcYUzxEkBS06L\n",
+              "0moZm8bwMsLYCLj3fgXOyFudpfg6jkYPDeVK811WbzEz8Hcd42XVL0EwE3bwDc+i2I4+NERo6J6l\n",
+              "d4d7nOIvqUuorZnDPtlYcfSWgBqdP0tQHvFb4Sv9QUCBvXlH2IEiNzo/daaHVtbFRNZ3cag2HOiP\n",
+              "lMxyt8xYJMnG7di2JiwAAAD7QZ5kRREsO/8AVwwP3fRRACC0tQoY45xe6yfL8KMHlR1wbd4HcPUC\n",
+              "+4PcnqOzdoNv80ufRyOopFYryJahX+qWFUVKK+nDtdvegTv/PqvENcT8ykEwwQ7z2oNUdaMITYi5\n",
+              "4tC5YA9FaLSBorMGx3aocAbiF8065MBqyaTkiW7FtGRHVSPubGixAl7hiQRoBoEipfCxkE/EBoII\n",
+              "omSCNrFRyjd8oY66cDfZt+iBI44uLDeP6eHMEpBALsV0FY7iWjBLaYO1t2PsklOb93SAExoyIX1I\n",
+              "TiPXiUgrCYe7dgepAF31BCnOuxiIAPWKLDHZLhGOJBLqdemk1EZoKCEAAAE5AZ6DdEN/AIteG4cJ\n",
+              "hGXgWAAHNd3/IaNiUh/zKhTXYgf+UKkbUvWJoLo7whMXByWkvy3MotNcPaSHeaKS5vKy/hBJIgk5\n",
+              "CWcdsbd5QzFHyjOIZiaEAA1AziqRPTDRRVYKhcrm181rAlAdaYmvKZAOu92pmI39/PSQjhiMouSe\n",
+              "XVT3pg0s+/zN7WMQCHqTmey2TTctwD0YnAH9CK4EMAw1jPCCTXgop9epuL/iXjup2S+LS3pGE3iO\n",
+              "oIHon+1ERGRC2Vp3b2QAstSXzK/2zI+bVnxf0PhgKqa/NeuEaF2SBGZ/TyqGPDnQfJRorCp1s+mw\n",
+              "tm/3aVbjKRTXeSwl+OCfF6rMqjf/Zw8/4yrjLNmiyOgD8OWqATkM50NFqOShrrTCaHdcxgVW70ss\n",
+              "cCXKxvzAUCe+4nK4C3zP8QAAAWMBnoVqQ38Ai2Rc7ISR6q0L0pberS7nbElvP1eAuajd6ehFPCEk\n",
+              "va4007gA4DkP0YAYAumNCN0kma3A2DvFPa+NTDmrilkXNhiNVTFRLzynsy8rdgQPBH6k5DFr/4eZ\n",
+              "jmJjfYPWB5+2eEYYc9uJ5Ni70hsVFfV+T8zp+ZkLZnd2wv7AZ7A8baF9R5O9oQlCkoVPxkDHTrmt\n",
+              "rElQhX8Fi0yj2+BVP5O9UNPGQU0+M3KYUTg9yTBG2cCw6Drt49/5M/86NN03F5R9JS9KGOfJjIlA\n",
+              "koCavGpTFqq7OYU0RM3ilfXBmxvL5QoIK28Uvs71J3h/IvKmg4v/14n3/eoSpqNUCC77ty2SgAAi\n",
+              "rxQNIHz2GF/lpTynlwsORrYNT1lJMVud8AAQb+/SaHWQXmhJ+8cZTt8XuMgG/t/hdF6GqyG0A/Pn\n",
+              "hWRq+asN+zBaeyQUWZrjl8ry0h3WPkAZksFb/gV7ABWxAAAB/0GaikmoQWyZTAgj//61KoAWw9mB\n",
+              "34Nmlq4DQoTYIkneVdOFHxDDrFwsv7yxZXXwNkGuLMduj7QGT/7lr2bNfzApMJfo9/ffM5g789Cz\n",
+              "1Mn0zxePHMHBL6IHHRVXWyqDMhVLYnQ9xFtc1jml18If/8STBCOf+AZjMnARcFmX1IwLt/ziVSoN\n",
+              "e4GPKKZqfZWytoW7461OuaeZ9dvtxrCL+W45zobgR5vOrVM+Opl+w/eFlupHlgpQBWgJcPy8sZC4\n",
+              "/O9laiYA63xx6M701UUvGFsRI+RM6anXyjKc7TVrmZ/YQKRjqB6Mejs2G1mTDkBn7T2ZURI2vZ3u\n",
+              "VXRNsQnGYDxRUokS3YRHs9LEF/gxKSdLEEiHDqcoIHyS2FPM+cIJRSvB7sxIA3hgfN/O4qDK6VO+\n",
+              "t71oi1H0Bkz1ugONnVTpQr+WeMS5AtXXNBMXU+ycO0+R9eRe9BwSk0V6tHm/HJ45oIYvyWTj3yZa\n",
+              "JQ6q+o4isbf26PsTbuSAcvQoMnzEXJkqElGJ8Z3rZtdkIzQW0DDnXeNRbj2wQmuUNBknMsWOw2/t\n",
+              "fD8BErzYLXI65PwTY+6R5c6RWYzF9HNMLBaO1c6cI4yEu1DMKtZW5FrmVuc6hg7VnWxgAgOdFKFA\n",
+              "QvmmcrbHsqCH4rkez1y5GoMlxeOuW5WKa/JdcefAflYgakEAAAEQQZ6oRRUsO/8AZUEtmg0dqwLy\n",
+              "ubLYtABfXw0ri+bvSnwBqWW9hB3/jYP94x5LyZNY560IvuBe5T4EX3/71Gbqj7BS5SJLQ7X1JK0z\n",
+              "I9iR6McwRU2BDEhu+2JQm1RA2fBVxnzCyNr1JVnfyyuumlkNzE8n1UgnkIbS/FMxc8DghB7zqZzK\n",
+              "rkagW0hHwSjNf+LJf3DnbXyvnzmB1lcv8Z9QlsnPKDef2giSgbZeTNWRMfeu91kckRy0SSKkaYVK\n",
+              "KUUpf450Vl2TzPLRaNhk7Du1IJzIJRf9supxssXD9v31LAVibgyznyLU/cS57Vr8KEXG+WpKysV+\n",
+              "6iQmQ/hCoRg82drzuniAPltxm8MMUZwVMGAAAAEzAZ7HdEN/AHUKF3WsfCAA7NAZyuGlRySXJzA8\n",
+              "WtPYIqCp+udF6BaVoG3w794kSqeP3syNbVlr+uFhruNMOOzTsNGrbATFZMl9DU6mhIXZ1HEAskmI\n",
+              "VVSgXlz4sVX35JqYrDPP8r9Bsg/O9tAp7LnTMjWlqOdgOPhHpyqf/hmokPsCwqtKfsDhxP/tmX60\n",
+              "fhM4KsfvpygzK8jmUmY/GDBCISRQeW6U8uaq8guf+cvy+sP09JLJ4HsULhIsm6kyYO04HBdOFUDr\n",
+              "/8IzlOKX3w/FCxhimlJIduY8iySAFQmALOuag1Ry1Z3p7NpGIGhZp/q5hzsMAsH2jpHXQPdtFNFH\n",
+              "4VkqDlRDeGqieCr6gwu3hPQQfF9yauq4qf5R+bfPha9tZ3XjpRO4eqNaj2xEQrcb5cIJOAAAAUsB\n",
+              "nslqQ38Aj1e+ZhXsJE07lvgA5ryx/X3Tt1hQ2T/wP93u+Km2fQtCsS47kHT/v+BMMbdxEWzwYvcd\n",
+              "d3NYalS7o/aUthPBRfYGmx2hUIQijLOXN4leC3SONeoCputIRor3Lgsy985K8UL4nvf1+pFmRQg0\n",
+              "eJgJ9ubt7jVqU4S6enDDZ82+hYwxDWOROomkxsOv8nlizRgAHHE1n42Dq5sLIu8oVYp/4M1h4rCy\n",
+              "m7AmDrR9dbHlpV6pqPLshIJSKr7R6XCF5H/mgt+78ttEoS2XxbrmVQj6DQtTzcYF1gqzE9DaiXTc\n",
+              "rKcf1aBAFclenBiNHhbAMEE20Br4FIkr51a0ynzJocMgaUhstOH+7gKJGCsTPkykOiVzQeIGOfi6\n",
+              "AmLkbzIds0NOnV21ExFbxIFAMu1BymG8Kjwvo1cLb7372R2f+Qt5Z8LjmGrBAAABxUGazkmoQWyZ\n",
+              "TAgj//61KoAWP/AeMmkxh4qDG8hcZFMZjYIY//v8PGtlbWZ+A0oGGFPTAdgmU2TFbrR0QmwUCouN\n",
+              "e8fq+V7LhZ4IhSGjAEZXRALCc6lvXQaVk4Hy29vGup69bTfpCSIWWGXFW7WfQjL50GRbZZRZHQ2m\n",
+              "pjAJ2N9/bloCCNQEfrVxCeDkKfJqKlRpIdnOUaiQpsnEysqkLqMfxaCLAtiv1vFXcLPLizzlMPs7\n",
+              "NIiiAuhD4+CMokPsODEut5yq6fM1zRym2P9iids6rfyvN0EtWlvUXkAIdmS8HfE5DlX5rtipWZ2i\n",
+              "d9rb+tQcwCfWN6erokI6tARQJu2c+ZSF/sI7qofDkfNVCHii2Msza0cnJEbLkEfdF+gBET2KrdRv\n",
+              "E5mgO+6ICEAI6O/h7r7DxvTQ9Wxzo3mHNo6898yojVZYUAEyiEUBn5+alz6XfA0d5GcOXFRjv906\n",
+              "SVSt5h/ZyjXd+HmcrubYPlDuxhjCrkqyrKcbhfJHp/Mq+DI065H9OXdNO/+uDSHvPcKkibqiAVhI\n",
+              "DqTA+NZM5+PbtXMsqU6iKpSzqr3AN5mBITP84n9JoTkmCR2U/+5h8eajZc3UcAAAAOdBnuxFFSw7\n",
+              "/wBlSP3uCsGGoV8bqfG+TF6JTvUuRSAD4pZzJUFnxrFOJYnshFJtjPOw7rAcguf7FPJIlPqbN5qs\n",
+              "fqCPl7TU74m2w4/OJHMnDpS1+crxo620hZORUqqaN/UeMSuSm/KKx2/MSsIgkvOy0fYS1MAD67Fk\n",
+              "Z5FUhBYQOPZatG+Xc3Icj+kvLjp5v9fX+nJsaNN4CCl0quEK1R//8eZO87p6DKKxlnRfV62uCNE9\n",
+              "o2MWYwf9qwHYbtyqG6I4xWPTngQnrsOmiw1Sy0bIvHiKKw6nsCsKdLVPqCFU/q5rppy8Ah4AAAIT\n",
+              "AZ8LdEN/AI9CIO0JMMhrV/0AB0HLuqwUdobO4BdVbPV1Ioua5WZC0IWTaPE/7qAFTCgAnl3rAoSn\n",
+              "Kk1336t4zGyyPYAAOSIcqQwF8zee7dn7XFk1tvgy6W/qOMTmkEiEdwceoRsnhNmrNp/TK9OoMIUg\n",
+              "ShyIuwXG8nP6tDCpAEYSuvpzo5kchXf9jICMUEGqQZjLulIdzbNUEecLTDRk1r3gpdToPPcXdXTM\n",
+              "AElxf3acmkXSo1kx4tBmKJrXm4kNQ2oDIaqLOc1dGZ+ccoProxsI+jQiCldj17rGF1/E4alcIa3L\n",
+              "dIofRLGOPkev2msNj9eN+tELiQktxoUq9fKnDsRx9Nbc5IkysRYA/KsIu02gpfPyisLPQwjLSjpr\n",
+              "jTxnZViCfPC6UCMSLVKUvso8AB0eV8Q+lldoHmqd+EeBeeJOkPU3vuU/GQacMWsLnKmVt/65Nw0r\n",
+              "y1AnL9+YKkDmvNgpqgQANfZvj5NhddHche/p4la1cXWhY3W/jmtWxMTkOC4tX16bao5sNwcVWRvt\n",
+              "UHjkDIOIXB+3akBV5Lzaef6YjjT1MeUeFh/FB0tOMV3Bhvdw35krP/ItZ1RF5hRCk1oYqz0ykGZW\n",
+              "YkciBlvCsweWM2wXwX55h7SZHtxiKM3rO4Aff+TOWGbe8hXaapPE+4wKof+j5KoQ530gP62KsQIG\n",
+              "BV49pf0LYkAEd7yVzO9dhYYFAAAA+QGfDWpDfwCPWoxxjdaiaFtca/OwfG9dSAC6jYuqYuZmzKSC\n",
+              "kzbTtnf9idy9v7frgKuFjQymibohZCHRXBQdujo9Laqcw233I4Za+//Mdf06kxHe/IBTsCsxcSfV\n",
+              "ksVUEdqCe9dEwWwg//4Ee8Le2gLXqz21e4jiFyBOjP5GsM1hpupcfwZtr5Mo/ou28BY4QZExXJ0H\n",
+              "FzCqK0jKq6c//ut1tsd+kiOyZUVGRAFVkS8bi0vvjrj3zga9Zaa6Mt7yQii43DdcrobbVIWdc0QI\n",
+              "3+rsc8fgmOnJ+GJGdWYzpFLd5zMjS5ofw5IMBt0GmHVcG82Z6YQkqKJHzQAAAe9BmxJJqEFsmUwI\n",
+              "I//+tSqAFjc3NgONUfiwAKbp/vtZn3NtK6t0V/4sA0MV4unWIJlE1N72EjQeUPmvxOpceaVXIrAK\n",
+              "21oMRdsBwM4wyEJDPiji6fXmMlmmsCvOtr78Aj8gA+xKnVDFjoVlH7PPNvnMo0iZJruZeFy1B4T9\n",
+              "/2iVnlLy1r3LZhoykeyNXqaKEANWeqYl2HjpH92g+fHSONko5D2m4SRKJwFWFllUBg2RTQ3etVYS\n",
+              "PdQGNCLeaZwhH8zjnIe5Vuu46VBC79Le/PF0x5A18FileZQS8Adcvcamp8leUQ9dML537b7ARaSt\n",
+              "9Lyu3Sdke9BouNe3+hTyxzxAi1Setn//aNMjVtdKZIT0wLvPIMCsfe3gvhpNMtez9cWJYRUO4qU0\n",
+              "Dlg6h/pUIog+BzidDDvn6SZ9WUgEXhGZOFeOBYowQfwTGI3ac1V8O93aTpJwa/om7scQbOrwAjjK\n",
+              "gaYt9yqViBt3FWYRIoJJGYqmGJkf0tLvcymA+Hyayho8kg3J33tLzi7Gkd8xVzsn0AbjvoJ9u5le\n",
+              "OKsB4L1kcStddnytXouu9GStBCQSRLPeb+iGeZTwQ5uYY8D5fTAcb3C6Ob+B7IWRbbytzq93Kz0y\n",
+              "yYvbeUq1qJCNW3/zJeXeH+8yV69x5FRyM+55j6UAAAEdQZ8wRRUsO/8AYsUcQvOGOSSADI46r94B\n",
+              "/W+PEO3biH5wUahFid/4E5wZcJb1S+5KPsyD0qQEL2HibG5BPsDLysut2eDJfU6ijjP6zrYmNEWR\n",
+              "huQfgh9NsMVuoggiphkYt9ccXxVhYHn++9K8YAnkm28Kzp0jUWHgD2VeIoDjCfJPNnBqH+CERm3s\n",
+              "nubUQ9LmttVf/+MNJAJgtOFW5A6IBAcBpJtd5kPS+zJ8VxzguhOiD6Pf/zfgjMDUsehmT57QUanw\n",
+              "gbdNgBf1mSXZw3Czfs4swXmaj+42V39PQblTRJ5hVxxBfyBMHdtD+eP+pUlQP8pBAAnf3v75+Q0T\n",
+              "L19oeS5dx79IIwiodA3vtFf2KOiU2gODZqY3kJGizWNAAAAA3AGfT3RDfwB2j3tYlaKo3hdLneRM\n",
+              "Dlhayh8NourV4B4kYRi+kgAOdUf8hAGAI5XCPTeroAwXn8G2yGEphnv3FPeZqmLNmvgLgUkPciaQ\n",
+              "A3x0WVLvMk+lZn6cJdklOXHEnjNKsClw6wU0RbMDBk1zQUzYb/75rZ2h0N0KqL096XGATDutyhUZ\n",
+              "RVkyTgfbEgHdPAmzdroStgpcOUEN4xVVZX2E+XrryGs2/tIi+iUaglsBszkGSHUeEuoEpHc8PRHH\n",
+              "tDc+6s5rO2oABm+Gux/PUd+4yoXEBbF4DtdMIooAAAHGAZ9RakN/AHaNgkMVTymoPnXABzXUf7nM\n",
+              "R8KlDfCSlxubwbY5y13VVoGV2GO0t+vExf+APmeqLrIGM9X5aCQgGSaQJX4OQoECqyNRzFZQDLhW\n",
+              "KA4dfYJp7oYRPF8AMOzGYqm7AO7w7FtM2J0yD1XqM3LrKYS1dGZTAzMM0YXyhFuS7+8HWwRTCnl1\n",
+              "B1MtLMYaA8qvJY/AATH13D2takXBcx78I1sCsI+P57X6Q2Nh62/bggQuV3uhAAN0tyrIgbNQYVBH\n",
+              "gFwoUmXrxaEApAv0P2E40tM9SJDDcZe8DyE7ljCyxGjQA+gKJHzTkZCCQsmlxDg5It6wsdQ6cusN\n",
+              "DyWnlyoq3MMo7ugMYcm1YMEY73l36Y/R5wo4wUzuNvV2tJ3rSYBCfXsVjc5o1oA8OllKUpgpBG5u\n",
+              "9AavXOqCqjA07sUF9WlQ9JPrhiXa9bThYRp0lNBazKKlKwsBPK9zJ1/OayuptCCUOtFLyDYWpp2k\n",
+              "qNXWH8r0IpnJjxnQFcNmI3LKk+rH0vqX+48vd2BUqTcJ4rwX4e+V6oU1+lJyU8fmS4Kj/iQFUx5A\n",
+              "ntiGKLVWwqfkoYN2YexrEPVBTpKi81wf61aU8NAxYQAAAjdBm1ZJqEFsmUwII//+tSqAFj3B7fR5\n",
+              "G4ADaQx//3+BfZIcqzxSrotcVc8CLm7cBBc8JifUTg3KyGbsl0UtvUGR3t77PRffuzjjVfcKeiAp\n",
+              "EmDpLoqmMXTQU5wmHksjapt36fasfEiGyN1dOKyOI9nT0TFFL0pzQSss7Ux5GajOaQUF29zSIoeo\n",
+              "7hOusjWiFyZylISVuEBU8nCgDYn9P601XpFko2u3FAuYp/svCLJOzc9W7b14FY05eVZdhfmiv0Wm\n",
+              "d+i5ZPIv9mhB+8Cb50V0LQeFfsyfPeAABtfp/HIPaN+amWONE9vQ2YbC1JsqKljPbi6Vrd258gHB\n",
+              "PNyXvESqATfkK1Gnk0AWxo7XFr5y0Ce95pJr1n6gAd91M5RV5lL/XAgE7sYG4524aA+cXAa2XPdd\n",
+              "1BugfbN6YGWbktwAoVIXoUq7TnrmhBrw2FHa1aE9uMJerl9x/Rs847iKP+iuBUD2VIUOVa/G9Po0\n",
+              "ksPo1bHVIsITIKnrhXV1NabDgHAc5kIv+PJk6IroGA19oMw2I1d4rGiaYQZE9dmK1VRARJ9VXDBJ\n",
+              "Vlz3aoQhCyQZvwzvxWhVA1iU1RO1TWnJsppajNeO4Vg4/b+BSviIvrSwwqmjaRr8iuCpVTgz+ZJ6\n",
+              "95zLiSdnoIFqQJA1Hz4YR/KIOmAfhTTnHcdDelso1m8Bx2oHlzAOiYwR4NhSSRD6EhhCU2kXf5vn\n",
+              "vYdShk1Y3/pp+Wd9yZwIwTneJB0AoI0bbmfrtbbWj1oAAAFQQZ90RRUsO/8AVxVRwqizyog1fzvw\n",
+              "w3oFk0s5kH60rPhj0qbUv+9nJnU5H1hbksC+yivmpdt3FAylOp/Re8NoooEKQr4q7MX/kjNCB5zj\n",
+              "aCmG5E3TxVGWGCYMCsdEF1I+HuXX2a3wLCwf1iqCfznNMRG46GE6nIgxc91oY/zfMduLLCzyb8AQ\n",
+              "b20W2eRODsXd4+7XC1RndLreJ7Km543AdL1iUo99hYdoASXjyWRNv6wvJrmyFngIDlQOrLluZf/9\n",
+              "T8Y21pcggXpfTtvdj+B+3lZv29AFHkL2xGPZvyL4UyVUgb3U1DWd/iySeGzlK1IbRNu7obP1czi4\n",
+              "Rchm1nI/pS+cSuamJbhlQHIreF0u2/zcrSGkuOpbObSfAY//5j6RVfcQovw5wL1RQN0tcA1GtFxu\n",
+              "ZpovaLthGUkeOPh8iV5bEpupJR1R79Ew1sEkTDugAAABwQGfk3RDfwB2dNpntdq7wHtHkfExb8Mi\n",
+              "4AOIW+6weDVD4WeLhja/JOA5FtORnuW7CfHWfWrXcPJWyNJJfpx2maEKeggtR3RVEAdA1a1truYO\n",
+              "N3PBvt2C5hri51AyWveiUQtRNh8OhcT8b+NVPo5dLHlfN2wr8ZipKDuUP3k1md+EiPqVCrK5TuMQ\n",
+              "knvfHHEV8fXqrrFiHhWYrAGbSJdOrXgrQTN4JDv0LMwXs1Nl1nmEdfSgT5BF3DohYi4r2xGfiJcJ\n",
+              "KMZ1oPHaRBjgxhu40ZP5HqUG5rQWHD92UCH/Terh0cf4e0554mxHgDF9CBXD2Ey6LaV8LB9Jb9nA\n",
+              "f7tFFMQRIVaLiP+uig+B5OoeaCY5+GdEeHuY+ZE9jNToZ4yOUwNfysZaXJBrtfqEkQosI3EYRZQA\n",
+              "COu9BHjZjXsKjEmWe9Jj9yWusbXq4WMANyEJEPNSeDcqy2nLsc2OqSE4CgyCqy8blbRZqycUiZt/\n",
+              "3NpFflI5dk/7eeQ8Uo727U5FhceNm/3Tv/0N3CZNlPGV4f+3/HHJknpIjibzMw4AkTq3Lkxy1XZ+\n",
+              "FA9yAR3cZ0/eN1EscyudULe5dTvs1EvlYMWBAAABtgGflWpDfwB5Rz+lHWcxYALocP/IVGxKQ/5l\n",
+              "P8Y+UVeKYTw8iTn+GjVV8vbhgCZ5cI/70wvHdrfJYaZZyRIawh8+61+/vwo8HAkEyAQL0QVrU8Db\n",
+              "Z7+ORIRATWUQyS/LIyP8q4/O5rf7OuybqgrrJ5JQm3dvb5EYgnYLHCULt4xtpfvTsT5gEynxu9HL\n",
+              "Km20sO4q1oqcF4MPx2dj7xETa3veUfVJqfvwop/9NWsmPrdhY/wz7rinYt2HcWm7+ulSBZtWIRv3\n",
+              "yMRoNM+lyCvZDr0PaN2HfwYWOYr/NgyLM3qvI6TujkJkGWBIPuiFK/SHsSPx7iAMcrZ3CQvQC1rq\n",
+              "psLEx1Lx0vtWsdQAcjEYe6l7VHqUFbgcjcHAYPQIIgi8NauIxLhxUOQnkJo1mXO/e5w2N9AAHA22\n",
+              "RlXXsFU92TGe3GmYdLlI4OC3IklyabPhxs95veQzY6n0a2BnyANXxWrQG1vVVVAYgtb88NEdo6By\n",
+              "gCh1aEE1VpUTP0of4shaZpNk/2gd6T34r4uIClLqdADAAdaA4/epPc357p2Ro8OkrT9okATGaQDM\n",
+              "AYBiPC2kAQBkyn5ImAAAAdBBm5pJqEFsmUwII//+tSqAF4In0o7iUdIU6DQAMu59v/f4eNbK2my3\n",
+              "LFfU4bVvmOXvurgANJp+yhdNshfKZWyf1yiq02eNo25TtXkBg+c9UZquU5KtxkSr2wTyRJb5fWbg\n",
+              "+NL8Fosje7XYkSxYEiB3sVwPhHSvNWh2d4v6fN1lP9qvuUnfb1Bn+TdruqmJdM2vx9efbO5Th2CP\n",
+              "KiH3jeuRzoCzSIUG7cY38FVzT4nUIJdz+2KjjjJ0E7ZNKQ6lROaPqjFN4utrXaZfqGFX2nWmlL+h\n",
+              "PxS7plcEcSC1oWpbRWphWgodqD5c2VmFV0yO9NkxWYeDoEeaPVORAB/gqWAbIHdoZVHMBBV6fLyv\n",
+              "D3u5FppjGB4tzB+WC5jnXJKg0Sk3SkInESay6cwWUVJt/G4Tfg6wbMdEkCvCKlRosg/RTpp5P6wR\n",
+              "Z2iZfctuN2EQi36vtriULh4PVI/bw9ZXWlyhMpAYPlW3C1NvZrlJMNaSqGSSnh5cJMfrxHquXcAN\n",
+              "CTgojRhZ3tMe14Ny/HV3UfnpEJgrqxN8KZxlRpYS28Q96uqEu6NBBsBIIz0ei/Mg1x57c0aguL4j\n",
+              "dVBDXATm12Zi0uXfiRBRiIror0O2CDrlUQAAAPNBn7hFFSw7/wBgSQL3wIE2Tv5B6OJXPcoXMcSb\n",
+              "cE8qv/1v/uy5HaAJNUQCTSWlcVovOwe/GLZOdN2BNEgb1OlzNEinzyASzg3GuZ9zFeyJHe/zvxXW\n",
+              "qHgQlhmuH8QdE1M1s5tXy5mwAyoAiCrzupaN60ez6jWL/yRvGdGiPt3qJJLeMG60zAMKa7QhUJFJ\n",
+              "FMWUFrcLW6iQXx7VTZR7Qo0gz/aCe+BxT2h34J4bdpQTH59SHjOd2X4DMr2kpW5buE3EQBEKSUD8\n",
+              "yEiNy7MVRtsZHXt1V4Pb6TljTGXtC9pzGwEXtgadiRP8dhtDjxgpVN3IyoEAAAFOAZ/XdEN/AHkx\n",
+              "u7J3fsEfo6cXtbkNOd4swcOB3voAJyKHu0c0/MGiiYXv+2wca3XUwSOEG+s8df2rHPxj/J/Armyt\n",
+              "j86AAAWOWZsl8AgjGF9fWv1mQf9jrWNuA4APvfeLBFbZJZm7otp6Fc0DFqB0XCbEvLTkRU5ySc7e\n",
+              "Y4CD3ziWyxgWkLgxNxAV0V3rzOqUGhFxcTbBCJI75knYyulzgB9+SazwgLVSR2N8nND844Y7GLCN\n",
+              "0aeRWZgNIAWJkPPhP1VnSRo1jOpV+axgAXL8ExpNwIvLk+O8lekZ0/1o7sI+uJ46XyI2SuA6uJHd\n",
+              "bwUKNMI2qDKAM6f4kKlJLSQWqzXAi8hAQzI017i25Vpi5npQJ4TsJeyOHRvmO1wY5ZnIEZHyhgB4\n",
+              "IoLWrdA5opbAou9XxH6m1F6osqepeJLd97Dr7+5BqWzoHoOLhOxNwAAAAQ4Bn9lqQ38Ah1fDGltb\n",
+              "SoFNBABy4LNe514R+dnaDTYn5E46OmsRrJgYyAm1lSXdflAXI1+CFQXE0A4eKb0poyZSLaaXfRBJ\n",
+              "r/tA3jW8xYt/UxFDszVrqnPHP/Ny6pw3mJ+pwWr+YYAHxNaLyZj85nxRNPFMUkOr96iCB+MslYrg\n",
+              "cr/vUoZCrrFka9nw08yFJlyN4Ky9KHUYJOXDrBIiz8KQQaHFalCe3rENKk9raHLB9E2PdI37xydW\n",
+              "9R3Ktqa3KW5rMJCOoArO2/3trkkCh+/FDlbsei4VdbDQ32DjCaAkDFjCyuqOJNsi8nSI2KDSRFCB\n",
+              "83l81kCObhPemVMTlMBQzSDvOtDFUtuVwHtirD8AAAFqQZvcSahBbJlMFEwR//61KoAWweTusUEY\n",
+              "AFR7WLigAceU/KgvW9LBBRTRioW652v1Xpv5tYMFhkRmmlUca4/8lM9NJwOZFgbdLq3dhRjr1SQ+\n",
+              "iitgTnIKVe77qt/yWy3INzcVxffYfGucVy2ypyvLSUZVvVzu37Ufe4d1uKQAC1EE3Wwzkx7sEK4N\n",
+              "QwJyCdTZZnLiyrlEXcLAMbB36CvMtmCiaP8XPpa1U2RaJxnBB9qYeP0+JCORflaC8m/hyWfMppd0\n",
+              "XeCFuAYTEakC9vO4HVF02QH4GZZigg7j7bXnvstEtP5QgYZViZcOoAaQGKtWm3PCHoS8mKWfCUk8\n",
+              "ZLC6z2a10V0U2DavVH2m02W1Lc4/2WzrwUTHr66DOaP+urnPdabeHdXruv1HJ087InGSipJtxGko\n",
+              "4rppNbdlP4z6g2o/ksCKcSZ76uS1diKM/39wzVYDu1tkCD1lomve9NoQwUToKqCn30PDqMAAAAEr\n",
+              "AZ/7akN/AIdka2XuDkeawxOj/BZhZtP+kNbRABb4RmWT8vSOMSH2HVKuz5/n3pn38gQM6YQqY5bV\n",
+              "v8KsLMWKt//3BpX7BUiSjA/GsXEpiGachc2o+KqjjRfujy3SLc+TvzNfgePwT9w0Jj9Y8j6ORxA7\n",
+              "13x9/iM5Lx1s2OQQyRluiOYKxXDE9QjNulPCcMLJFKpvAfnZmzl0pzzHw/ANcBEDhABHQ9ftCkUs\n",
+              "Q4pQOQF20mJ1++bXoRcUz/lR79ACwohpzpGuaQCknCVhUL3lnnyQzloB0PAIRq1VnOd+y8D18t8/\n",
+              "IEva3L9FTrRi90eT/2pNxjMaqrOmFzrhjd2kmSd3YBlll+A3KrjDn/HtXx8SDjztM7Km7BEd2LVO\n",
+              "U1pVGn0+C8gCov9gxoEAAAIMQZvgSeEKUmUwII///rUqgBet471BV4xl2QAFRvb+6Uilj9hVaCt9\n",
+              "oXOXB19FM5G4bNDJAOl9w7HrxMOF2dPOUf977Rp9NoBObCR9cN42Ht77Y+l36qfp5SrWPFz3DG9k\n",
+              "Uks1s5yfRvMME5RxPYk9+qohbe5TR7z2WNWBJjaTvhnu4485WU3BaTyIbA4BRRdj0/JwsbCXRVZy\n",
+              "OMmFdXnFdxhNGZ5JMCQy+ip435WTv8KevLzG3OUTxX5d8x0gaiQZdaPwNC9GVrgmtqTc0z7He5Hx\n",
+              "p/UnXiE+WgHU095CwXga4AbeOtQbj0tjxKUoS9sAoJ5fyTlHv9FnU0ujgUuoA3Kj0ma5qF69zgnv\n",
+              "MTXEIqf8zuYuInk435YB6s5Aa1W77q49/ZLR70JdKU9F42nWnuaGIFvaX8JNp0NTGvA0s1VSOWIl\n",
+              "YVdpY6hSPbDqLYXO/LE7X1D3sWpexh+/kcA2B6pYDzx14bD7OD1f9pMDWxIrW6BpNH75M54gOMY1\n",
+              "SxoTsfh6KVoyFK4Yqd6lPKCLY4O17tm0vzqLEva8zNeuM7b2yHKwMHpqK8FV5yaEer9Zd+uSgIqd\n",
+              "eftECExc0GDPrda1mDLPyRR8iDjZRvRS/EElnceTaWiUEonB934ThxItQqnJINdKSyNdNwx44Jgq\n",
+              "H9/Zh55FLA3sdVDr+1aesKMfNmYnbwaje7GN0y0AAAENQZ4eRTRMO/8AYEUc98FD5/CYkGD6VZTK\n",
+              "7qaMD8JeD5Yvz1s+LaCSFWcn3aLtkXWLu76WBTjEp2boTz2lISGgYIiIhTqGBdSAvn4GaApcqQ2+\n",
+              "sy0LjwIg9aZXDdjP9AWFTV1H8wY3dWCf+Rn8X8p7dsAFRxXZ4015PG0t6STtIq5DOqARSPJ32oCq\n",
+              "OenP2L2rQhT0bU7kBXZqDOvuedMFko4K8dbR3EOKtstAjt1gHGNubjQIVeNhJsdrdMtXEY7juX3P\n",
+              "NuPteAILXrR8S3R5mIOtuZ+vWEUdS+Inr7FnZsbQiIv9i7KDzU2m3LJLNdjmArFBBLgFXYHDvQmL\n",
+              "9VT51Mb8gx1TyNar/CPWDggAAADyAZ49dEN/AInJdfYNr4ilmYSAMFB4GADpypoeWWXE3q20mGL8\n",
+              "wfGmH6ZgcbtTXJWZn5/uB2IPeQFG/rqNYZ/bmIUcKhccFRuPa9wOgu4Qnm9oi81y+ChWQK1KoKDK\n",
+              "TWWDeg/SDhV8w/q9dFY0rcekgnjPKbKFgzK+IO7hoMF7vhpMoVCqvwMtBaesBfF4bzxIufyftMba\n",
+              "VRaJWuZpM22/FtH8FxujQ6EjGNr9PHZg3rsxXbkYHRqZvH6RGypNdfKRL4serPMKtCeuCWEKaj1Z\n",
+              "h+pr+ULdNvwpLLHfA3OCu3Ql8v/sLDD/O1LVB9ug+l/wHpAAAAGVAZ4/akN/AInJdjcgUcZACEqh\n",
+              "GvWiTtr19IbQdv8WE1dBOa+lNipi00vM+C9W8F7IDH0aaS+KKFaekfOwUNG520lVemVKNYbjnPl7\n",
+              "LimE+s4N2NJ5SYT5+XRMb+vTvKCkG/By5wQO/WbZo9HorEm10+Tu4CVIj+2Ky5hDZl+kA6mkBK7E\n",
+              "3LwAW+4rGYiO9JH1BLFQj0ZOJq0ybrdVynOYOw8TudsCI+I3fiT5nmYCkIO1N7h++s67fASBLfgP\n",
+              "CYo7yLNwfifRM3ay+JhoRmwX5tGJ8l9w676Zo1wDaqZ0Q5guAYSxSJk2jHShR6LxlZmIVJnq7S00\n",
+              "iBOM0mxomzMhjpxeX6zqy/aA2SEREi4ulxZsEvlIWhLQ5YFv6LMkVEh9RITRQOsKGEls7Y4eSRWc\n",
+              "f23FGWOVxL2MZUmPGVh++Xygx19XCiXwoatt/s2T7zGfLkQ2IBiMKXoeDb7yiR4q+0v6UjACWT2H\n",
+              "kOIRMpG/B4KQPsfMRT0Rk3cAwV9dNnKm4XTlo9P9TmyT71B/Greq+KvhEBDxAAACJkGaJEmoQWiZ\n",
+              "TAgj//61KoAW5ktFwTkgtAAhBassVgP2a7WSOTniW7GlpUC5YARIimzpboyDKn/53KIxVBS+A0NS\n",
+              "3NuuWMzq53zfHvhoSdYO4dYooBUDN2VkLpVK3v3kQo1FoE02X3cyV2j6ziOTJORgWGzqU5k0XKJO\n",
+              "1VCPDS1gJclQYem5NlGAENmSiR9I8XvNQLGvpLGF/2+aU31xCZzIPp4tUxyLu/gVqq+6L5DezfDz\n",
+              "gPP3+vv4JFttE5Nyc7LysmCaQfUhi6zPymHmdLjs3bZdma4hV61UMMsGBNZfYf2GUkV1dVZ9kkfz\n",
+              "RyUYJPFdwjA5S++T8sc03o81MYXnXYkO9hGiG6RRLRRV2fPSgGhghnaqxRhYVQiuVS0ENIpjxqqc\n",
+              "KBEaAMs1VoaLKEOrNhZ8yB1VLLV9KSiM7/prkkNKRuNLp0WeTv2eHtXhIdAfhKb+ic7Pb48CqpOl\n",
+              "FnnbgphlxDaS1dplrA4VxMNzEL/27xNMQzhuRvnSDNb60j/kSJHw5x2JG6G/VwCoVAfFrZll45AB\n",
+              "Puajv4y9+7flMd/pR8Rg9UAn+cey+vNCcCbbn7FNSWq2hl9cymk4fwW6iqBgiFEQ7YZtyDoNCyYz\n",
+              "KAnW0gvHCg+5n6+qxC+xDS291Y4JfSW927ZZudU0tXxvupwcKf6fDXxz/bqsOMvxj6Y81+e6Dezh\n",
+              "B2/8nCpk1Qc7N5s0JoStEQ8+K2ir0vIXayhFQIgAAAEeQZ5CRREsO/8AZTZTJbuKD3PiQhYpzA/Q\n",
+              "3Iqsld8XUz3sHppFsAHZevvXPBLN2cIUd+YCbEEH6MplVFEcbuDDV0dnlBcrCNrbp3+CAOdBsr6h\n",
+              "0YfLGDPxHlFlUCi4qTS1o0TT2Jzkq8/O+TU7SSImG1EjEmOGpKvxjn7KxERq2Pbd/0y1sNHk5hiQ\n",
+              "eJwHwc7Z19aIrWes4h3UYQqHeU6kfCpUHVgnGubU2A0Xjg0UrouNSumFogz0StLk4fuhL5slF3Bb\n",
+              "3NpP7YhgiVLV0FNM21/pfbXvRQFzmliOaZuScgePqa02nvOdEHEpGVRPLCGL/tvzSkZqhXResmQg\n",
+              "1qZ/TxlvqjWYqPRThBIk2nP66jbd6NLagdWz1BtbrwB3TQAAAVkBnmF0Q38Ajz7dDL7wKLyRAA5r\n",
+              "u/5Co2KbB/AnQg3XvWeaImUuto8KuobiZ5Rpi0jf/+r5lFprj/mYxpQ5OwqjQqFG0eXwqi1D6M23\n",
+              "HLH/3LvgYXkbAAGr9uWkQaEU+TeJ38WNXodDC29t8Y0uYEpwNzyC6FqtgkCyDYDpd/nESpdVRRJh\n",
+              "15SV0TP88AKwZsT7yWH2r5gpJv8AhXnnWmKJ/WMwiS/2+Kf3ikj614P+BDohXhMYGO4GSZ19EkRI\n",
+              "RjwO1zoy3Umd4iOMuBBPzevAs74sU7IUdkUF24rNAstoyqnAUgY510L3SgPXbZmJYMv+tRpT7ZuM\n",
+              "oLxE5ACIQ+eHStmGZgh2P1nvrIaZRiBxoWZ1B+DDOtu5OZpc7LbajGP/oy8HbEFyJIcGXHGB5VXY\n",
+              "HnskMmabuu5xyFIJcVaqbGg3TlqrbBE29OX6xO7K38oavU/okVlIM+AAAAGEAZ5jakN/AIdXv9ZL\n",
+              "/wCpeCQF0zyG8897iu+TVNq8xXl3pE8eXm424VBKoADmOQ/RgBgC6Y0IzpqUKPVKwCZafdEIuhUv\n",
+              "zhgtxewRpr3F4VdMy9NUqqvPfGroLPxDW64Af18RtCEv8t7amX9ezvEWK8AgZjHjHXeVi2k8dp4r\n",
+              "TuMjdngEOGe6y0V0qXE0vJudyGSblaiStnW6rV0e34JxbdN3Qbajy6ozlLfOkq7Wqx1iLXxa4foY\n",
+              "IPBIjzxdye8gOjZW7bP0axd+wppVHkXrrvuxUf9dp18AanJIIFv6MCm6ujRO2wyu4ZfSbZp/KVFm\n",
+              "xvxpBAJyjKSdCoPxWylEDyms9NAmwAADmUiy6WUOIsiAC130X9MRKfeLHi3miJh/YDGeINuX+P+e\n",
+              "NWBXxp3RqAzo1eISPcPztmgXUHCSN2VRpnCOFQoF4yyryK4v7s2U4a7V5e2sVJBhb7kguiVFACK3\n",
+              "rbLSCnWI4OCs6u017nghnGW3Juq0rF80iqmo5QCt19S62wAAAkZBmmhJqEFsmUwII//+tSqAFu/w\n",
+              "HjJpMYeKfGxaFh4NwH9VzFzipiNnWLhZf3lim8qQP0NcWviT9hCfSjxxrnYEE59yPQn7u6+tCr/u\n",
+              "vn8/iyWB73TxWIDTyqwOWzo0R8Wj7McP4QWP8yE0svd//Wkug5+3cHmcpP/ONbeBn+TAQ0VzErlc\n",
+              "2hXFLnmGW7EB004qvGi/S7JfG21T+V5Sx9Nre0PuomioWltV0uJSYiMg18UwZktQhoyeO+qpPgky\n",
+              "U9/xX6NUrUyAfCz03v4wSV58lpzV7BxftApX8ZGWBx2zWQV/YeOCEWbmbHqvN18Jd5FxK1iHRqe+\n",
+              "nBGg6SyBQEQQfCMxCo37AXM212ulRN9X2fE3P9HkhvkaOxQZ5AElyFJ4BlaM9J8bcUgOX6NS6Cqb\n",
+              "n7IHMcCIPjAIJ36atWVr0EheDYyrwatT/sRxqfSoF0RgoVqtGqstMXZF7XACu2N9LDV5Ss0B+mSl\n",
+              "kJJqGxc50wazbtpofP341QOLrRCoQigLO2IFkJyqTpln4FgoWIMbx8x6cKkFmIESXv7mZEx6LOrL\n",
+              "ggZa/EdzllkBPCO/+zBjmey1Y55MrbMpoidNDpdQ6yZ4UDU0ai3HtghNjtrUaVDC+dCrSCASLB02\n",
+              "bO819PX27qwUTWW1MCrVhUzQkUkht4Xa4bdnUW7zTudPa++EPxUMVY36vPDJoCGilCgIXzTOV6S9\n",
+              "OVTh4+OA6S/XkcoA6ZjbQLERX5kZSQMoFJs4bPot93titzpDSKAhc1QMx6eKK6Ol2IEAAAEkQZ6G\n",
+              "RRUsO/8AZUEFdKFRxHYcrgnLV1IJewAc5dAL6/Pr5YWcZb4ejev9b/lpY1ea5Xk1AlTe44c3rPkF\n",
+              "DXI6yAdEC7kxPh5StAse03AARSF2nro+Dr5bfPJyYF/ERJ9NScPmUIVihvTCsyh5qmuoAH9P7eCu\n",
+              "Y8rdH1hF/pTSa+Z1tzZc8gwGtgV/YsMtlWLs3VbLWxt2KTDW5Y2b0HA6zgNn25rXu72r6iiN5aw7\n",
+              "sjFipq/8rjgHE9K0EK2Opn+0SPK2Rbo28aoNdC9V8VxW1CpMNxKjFOs8YmQmJE6Qtkw+Uo5mh3ic\n",
+              "7Ng6Xje5wAF7a8Iyr8DMIwvMZnnVp6ilQ1B/LSGEPncviRIHH8w83Grtt0CsL1L2isuyMboY11N9\n",
+              "lxQPpwAAAUABnqV0Q38Aiz6zZgMl5b2XXQAXQ9yHCqNv7FVD9CxHdTnw5pqRTLAoFiba5ss3lqXG\n",
+              "QCf4/o32jzmzNKjZDN2ghdo3OS7n/NFKTMs4yX0NTqaEhdnVRvrbcGvcKo0NYMgzE8UNwneueU22\n",
+              "1vpuKbOkae4P82iS9XSi8TlOPcF8mmD+n9qfVTXzL4r0M/s5xxZempvnxqhz38EgmSM/Zw7kEyiv\n",
+              "giyuP/YjNhFl3FVcOSLiQTCj+F0nLUE7lia+UkuO/YNBXwUKZKD8Add8BG6ZTC4bD/RSktc7uv8w\n",
+              "NB82AXgnpuELTB2xZFOLAYJncjo03/3uAK678Cl8cw8fzlbnSpp5eUkHacCUtAY9LPrz/OMf2bA9\n",
+              "vBE2eUwrxz/W0Sg0tjzkUrpnJSF+xYsA2fgRolT6A0NA++mVN8PJVhaGzQAAAX4BnqdqQ38Aj1eg\n",
+              "HO2BrhbSJp3bjAA7Lyx/X3Tt1hQ2T/wP93u+Km2fQtCsS47kHT/v6cxSu0EEWzwOVr17m7uMIt8s\n",
+              "rOS2NL0s+wNbNsQiUhFGWcubxLdtukca9QFTdaQjRXuW15l7gz2QnuVPe/r9SLMinrQ8TAT7c4JB\n",
+              "GrUpwbYY2wvPKUw4NOIKdjGz2TGxM02Yhqm+YQD7nu+MPeXg/5dBf+XeKfPK+RchTbfnRfx28pUm\n",
+              "+MUq+ynmpWVmmfO3TbD8gZCbZRUeK4LOH5lP3nvVvkbZlQVhN5vPlxxNouZsDfsmprxmWrHzH3vb\n",
+              "E+c7VsDA88L9wCH+ZmQGzxFjyOQ8cz4P9rsZSuU8vQS1h6fmk4XXUosrmweEGKJT/Sv5qb0OG8e9\n",
+              "voRxFaPrroiqkALWSnA5n4zcQMwfY/xXX1aR5rslt9ItB406qJIsbsrkl8pXUe2CwOVm9B72bhd1\n",
+              "lqsCRNktqyPMF/Ek4JsxscPvDjbSqbQZL+uT8zjgAAAB5EGarEmoQWyZTAgj//61KoAZQB+OVG5p\n",
+              "SZHABUb2//v8PGtlbWZ+A0oGGFPTAdgmU2TFbsuJ6mwUCouNe8f1I2ythN04JSJ5lx+ik6KpnC91\n",
+              "1FD3eD5Jit+kJIg5holbnldcijL50GRMV+Tt0L65TPBxqSAUdrQu+eLUTHPpJCL4CV5RJau8pEIv\n",
+              "uK3a7QA/UMQ/nrDjeZ6jqf1BF3JjbyaeIc5drvnYbR6lQ0gBIzp/QRU9xrHm8FESnIe42aooWDJ9\n",
+              "bVMccs59QBQd45WisW0MXV7NFtyepgfK7biPJN57MDsWL2A4LYHAXH6f6In3GVsSrYQ2HUKGlxpv\n",
+              "Yf/Xvk0pBnHsuIEsslXTjxwTTzuRb2YT7QCJp6yHiUVL67n8RfvHMNoHfUzP4rVgPSXcPL8FOP2d\n",
+              "F8GxovHNOmsOSUyc+t9OZXQFF+4FJNSN23FsgARohBEJ3c1u0ax3ACLYlwfCd3/U1mT29ftZkWMR\n",
+              "uj01t9v2AGHvgKM29X2Vs/ALzLNDd2OM9z+AC4TlcpgcRujIhnjHf17Je/8RMBqJCZtdfrFmz6AW\n",
+              "Z/aNIv/p/WX6adpvStFWxoDAnf+Tai9COS20TO4GHDviQkpMo6tbNTk4tiYWsmvBNq5u/aO08r2y\n",
+              "Bs1eH2kAAAD6QZ7KRRUsO/8AZUj9pUTz7rNMoHjJ4gSsLw2wABNFEVCVBZ8at73oa3C8UmeDMVba\n",
+              "M3uHP8p2EFDXTkl9EiChbxZZgpuvefKfc50lYhoTJ/7H62X0Z9NX2I7S32WT1XJeJtD32zfVBu3K\n",
+              "VmE+30x6+W2pKnyMM0ZejDKLq8WyIyi+9rC0QVVyU0N739nDCyt6aqRfMfSdljqTnwOmgDB5pHyK\n",
+              "U8Nf/BZxnIET5uBVX/VcS4bjmT9sCYYwmAz5vBy8cv5J53FYPh0/wF7kP2myhm8SfTnmNtpTej0y\n",
+              "JjLbrdGSBUAu+lwbCsr/YdOCYrxvvrklZP4j4s5VlQAAAgYBnul0Q38Aiz6zZf6skuDOogA4jl3V\n",
+              "YKO0NncAuqtob34dJ/eVmQtCFk2jxP+6gBUwoAJ5d6wKEpypNd+AlIf83kNIAAC8trXyGAv3zzzV\n",
+              "tAa7kzCHOXS39Rxic+qZEHcHH0Hx0iIZnH1UNeoS6dQYQqolDkQpOXG8nP6tDCpAEYSQsJzo5kch\n",
+              "Xf9jICMUCBjMQXeVS1i3FdA07mrKCBowVzEdee9WvqvXV7KuMTufiL0hA8BHvtD6VFvEZ6eiqgvN\n",
+              "8RNM5cYXQ2i+4Lx4R2QlAIN1NNxqM8GvSjSh/rgipqY8DwHJh8p9Jbu0Zs+w86pgxJN8m/cvWxRZ\n",
+              "yFAtI7sBhDbJnNXx83ll0o93YVJhxi0TxWXPf6PlHZeEyvr6QOF2VVafQjsZUg34P/p6tj3lkAer\n",
+              "aZouLIrbfbTrpoGdtXuXR2qC418s780GZsUBVTlvppC7dgGYqQzB5daoV61BoiIg6tQyG20Yk/Ib\n",
+              "TtwSJmeU5Eiu/zRo0bpbU2jgV79WVCB/SVzxsmoD1jJEhzN1FHxsbajOijl9Vp76GofsezNr+37n\n",
+              "UWWhPPzCk1rCLQgaI34ekcMUWq/vBK2WDe7wKACe/5M5UglN5Ct9Orsd3SfYPc0336usW56marFA\n",
+              "xW2XgVLc1GludnoFyQrT+oASHSl68jJc1j3I4WTIeU/p+eW8RtUF4AAAAR4BnutqQ38Ai1egJmdK\n",
+              "YqnGBlYUAF9obzNVJ+s4Wyt0Rq0YuZmzKSClvCu/741bUzMW9+2RqBxHf8xROd9WCD2DFO6m3iiG\n",
+              "ZOgLMC6WQsGlrWDKBATBQkW8M70y/ztO1ZzNQj1ow5FREW75+T8qWeYnaEkP0sDPfhS/8A++EHpT\n",
+              "ONUZpoNHugOpCj8EFvE/MnQhkWbqDB+V4zYJeD+V1h9PGTTPeM5Ykyq4ZMi+8E5Gka9dd2CFXMaQ\n",
+              "M99mRo+FOH0+y87A4U4JusoMgrnGwBHn7tNdR1Jgk+wKYqmIwBj2jGPnQFJXhHhE3ZkpIjaeakM2\n",
+              "8MH5c8xC359KRjK1nfiZHGSkxS98YPps7lGGiAJ2WdM/l0XaVpItX1VPHy/wAAACGUGa8EmoQWyZ\n",
+              "TAgj//61KoAWNzc2A41R+LAApun++OIZUz7EikV/szjfxvYPLx+f9K2/F/he8DHawkBMdV2wRLxA\n",
+              "t50GIuRUSWE/39Xo4nAQqkjDTJdufKMgNIx0erMAcY2QA5ejjVo1tlzncJOxCqGpuGwA+5/4IKyu\n",
+              "bmTzdPecTw0ZdpVPq5j/sb/uUTmyS5oriK2QJUn4uMhurpWU0pM90BFHxmx/55iJQnC/E4AiRjGv\n",
+              "TSfvy9eol7L6q3/AmWDGKQmta5h6TQecJSS7keMMTmFMkcgh+dQEUTFbphGIZpTz6vxfkWPPyqpQ\n",
+              "VmS0gectGBeLssajkGiu1ivhXeMUvGnpqjpc6XSD8FJ8sVdfwdsse9JozsVq/t5YFq5+AnEYcopl\n",
+              "mlIiLVwif6/glDa/FvPVZyUrYuYY9L3TA7eEHe1IcHWSOPxpnafEFBrVGoeZPrbfymiVcHOQ/3CX\n",
+              "aGrpVwdWrmOHr8jLuajUxWOW37ajHobcyT1hYWMxRTx80fZmsfvsrNw/Nztdx7LidHGE8jPZ4gQZ\n",
+              "DABlByR/bof6mTmjqkfbsR1PCXy4RDNnn9nCnaSnb8pCApsF6YsDTv0+UmVzx2ZPSdm2LhZIqOim\n",
+              "mhiXHWt+ZE1dnYkLwTdsgNYEeAUTjY5XG25CAykSMfKGwGWeeOwqKmLAqTmb7mCXXxxpy4+bbELo\n",
+              "RAxOLFOR7z+Rlt4VIVMH4QAAASRBnw5FFSw7/wBiyP2mEJvZyVx6ACpM7CM8ZBKHKR5j7ndOem+L\n",
+              "X5lQTliSlHrc19blDxI+BarmPxVVRFr/CorqLGvI+vHNUfF9L5rOth1seL+LchCRD6bYXJMlctoQ\n",
+              "KBnrSfN8OsFA3rCX0rxhgXIKgdEDuCNRYd4XCiw0AyO8VPwgQ3UKQOwN4T9AdwOVZht3xWSjlGSY\n",
+              "LTfR+DOcni9vpFUI/V99yTFNeriW/Ezi0Mmb4Xp+UrrTAn+/oqePQryHATZ97i1I4TzdZJ6ol421\n",
+              "ZZiGDIa6I2z+mz36WJISXYfn5PcaqZon5evy7wkHdXdLSXQuyy6RoW3UMK1kv4eYGMx6MEUBV881\n",
+              "1DxJ4Az2tfQhJ60iq3lK6xGARpoGTWiGA3pBAAABAwGfLXRDfwCHPtdry+v+2nyY2Sk+gF5YW5HN\n",
+              "XoAL6QRR4alJgXnPRJGLu1H/XzBsCOVwj2OHZ7/Befz18ioG7PdTUWTo/DFmzXwFwKSHq5MESJ/K\n",
+              "+czoaBaMU0SilMUvvgF9NaNkzEcYOJjCpUUkl+lvc9iWY7aNcNT0YkO2YuPLl1ZJa6XpXyzgvJfC\n",
+              "YABMMMlHP4hWdgac8C4JyYJle4OEiXwhanMhhDIkpZpmZqqPP6iXGzuSTb+0ZDMJHqoDGqJmkb8S\n",
+              "IJuvyZGNE4panvJTPVd9f7g4/aXxMPm3Cn3wfT3mTthI056NzanOEWKjM1qGy4olpTOi0cV3zUKu\n",
+              "VGl1k7sAAAHXAZ8vakN/AInJcXImIY9AsY+/nZAB2XUf7nMR8KlDfCSlxubwbY5yyAvaK6FdhjtI\n",
+              "iTEMX/gD5nqi6yBjPV+WgerMVdQiwmsTWCh4ZDRMTEvRNiTK06p6H4BM93iWfwAaKh8Gz9Gaukwy\n",
+              "InHLEZ0yD1XqM2twrrM9K/zMIWUOeN0Z6Qpdges4mCaPjYBUMA0KTxEuHmES85gUYlt0s0Ks9Nu+\n",
+              "2hfyb2t0rmyvRs70WgBBgYrdeTZMCwmoCbRHPK4oxsSlCang/p1gu/DmbjnwYRln/v7ufz7R3gdP\n",
+              "Fr7XrHKEZc+f98DBxQMF82PBbmDGtLAQXHwptz6g5mqHfaJhvvgj78jkqTGrQ4WXMBaKzHGNvGYe\n",
+              "XIR0bHtcMMQd0uz0UHs+NS8bhlZ93PGBn0DI4S7X4qFOiND2PCIg5ogjbfFqU4Kuh5oLH4L3vi2E\n",
+              "bzWP7DaofhwjMqjCqAvZAgznNJDsvnJzQxJ6Pqjj2ny04t1drdQRUisSLN+PcLenLQZbe401Xg2H\n",
+              "yhW845ouHrITGSqb9EOEeoN97gj42PjsdYRMVLRDVvCV2BOAqdLbEmICPHZnyy75qPsejK7duPuc\n",
+              "fJ9rEnjynB/HxYz7zf/RM6xyYbzIoc3AAAACEkGbNEmoQWyZTAgj//61KoAbj1lLPyvb6PAZgAh9\n",
+              "7f/9/gX2SHKs8Uq31kdycpXc3bf6XPCYn1E4Nyshm7SbxYTXwR3t77AgzFtBuE6fBgZeY48yXmAW\n",
+              "rqOr3iMlgArjVOjemrjz47grY/T9rKmhvhaqPi8pvZTzkzZCl+tV6nzXVbBFw15yZW9xk2z611V7\n",
+              "GITjv5GH4Oi/06B5IbjEMVKEcRpvt893HwIyUBXniM9I90uh0TBxOedvsxxE2iLZsr/m/GNXryb+\n",
+              "9as6btju6GU5FfXHAHKy97PxI2Rac5Rx/FoPiuKEecRx7EQrDfRmlggPPP63oMY4jkBeTzC7Drwp\n",
+              "8ik2Z4rhoAMWlcRPfXCI56oe4Jt09oRInuaD3ww9/jGDjhHIXGbNYM/s5UG1XuYLCqaLxESIyPG/\n",
+              "eNnETthXX/QZDvDCFX3YINANkqDvHlUQ+vcUvksaWF/g1aVcMu45c8BoP1coWBAVWVE6iyDMwfYl\n",
+              "RYTcnNfp26mpOfqiSJnYH+AFj0qGJttgeZBuJCzdV4F5EDreo0WWAiq/0jdXljJ+ZxDij/UazQOM\n",
+              "0ct15Q7rTOqLKy+lpOVa/koSWj06e8eyy0wY1FBSVaROGYbDgXze1QzYiVyP6+WTk1fjz+Do+J+/\n",
+              "TxVlHJsfUOz0tbPJ3R4cSjRVigTxPg9VAYynpzzMlIr0/pCOGd4XYyl3SGTwAAABOUGfUkUVLDv/\n",
+              "AGU2ltMhgssRVFnYDYHdfwUIOpARUIP1pWfDHpU2pf97OTOpyP7SrW+j72yMHgCy10/KQJvVenOE\n",
+              "eMrSHUfyq6lVIsdEDgl0M+/NXx5VMpg+IZB+I7xozsY2f0ARjiAjA8ZSqG32YEqaGwpGp+vfKL3P\n",
+              "hav1CfnyaUmopPCa0Y5ww/PZN4YINPOwE+Gg36kaKP/ME/B0d8v00CzvLXmI8pIa3TqrGIa7PF4X\n",
+              "8miGO6oXkRH45ag0gFdgkGj+BD1PvtIptIkuqTa5jzG/NewDN9cCfws/hjc474K6NoCTyr++7Tth\n",
+              "LSIM60DcVje0csuhEMwOmCNob99l/AJp/9hMVsVsEaxUNsWBZFMKnZoLJU/ljkNlTtF1zcUwJoZD\n",
+              "oLTT6FmWVzlFnyfjiJdVIqMAAYsAAAIPAZ9xdEN/AI8+s1VkrBucudR5tN1L4cUDsugAOgW+6weD\n",
+              "VD4WeLhja/JOA5FtORnuW7CfHWfWrXcPJlwit0rQdaNL8wYmpMOBxVMKErdopYTnWfb0EZST9ZFP\n",
+              "kGeAI5wBNyE7pmk7U/hz6/Uncd5yONsvInzdtLdlFGIUuwPsZsiC4nxcPKJ4ER73zqMcPC62dMwB\n",
+              "YeP2JTSzcWxmsY8AuUeSUMff3wugzCWo2dZWIqj8MEevc9dnI6e4RX4rfqOmeKfJ7QFxuPllAOzz\n",
+              "FkyERujhdmr2mdRExctZgI01tg+iF/NwBCqP+hQ0BZaq12BgDPwBcWyuj8PXGo/75aroqbic3atK\n",
+              "78lcQoP6TccBH3q4TpJbdFKZCXZFrS7Hh71ZQxzuADlZ8DDRzGHyvFJs8+7LX0Z3SVEeli/7hzNR\n",
+              "3en2BovQV52x/rwTox00ojUHS89/I6QK5rr9xZ5z1Evdog7ewBETCofR8FQPxE+2X576ofb9SYpa\n",
+              "RU+FFWJ4WPQBj/u1ljXdmoINHOgs90YcpGG37DHSgRaxKh3h9samVWdsr/7ZPH7Krx9nfE8zJoXc\n",
+              "5Frf0sUOO22BhUTf6MatKarbA54SuNAmIi3ejRZKQJ4XCjhpsLBrmw33yy9Nk6OT0LCi0ELysL29\n",
+              "OvbOK/J+/iRz4bP6v+/3ppYXG9MzSEeggmS96wm6yOsevJy9wrAAAAHWAZ9zakN/AIdXwVSZADwX\n",
+              "ZeAC6HD/yFRsSkP+ZT/GPlFXimE8PIk5/ho1VfL2NNL2pqViOd6YYnwc7ksNMs5IkNYQ+fdC2XMm\n",
+              "GpZcBQdS+anJcAkZpOHFxqdIo1pLhI3h3bcsWXXBd+BTXZhbA2JSmhm8EWBGqSBNaO0U3Qcdcea5\n",
+              "428f3xthr08dSK0oFN+HNErgBuKfL3JZNShDHaW66u0MaG1B/cF2Go8z1F6LGKUAmsy0D/C2CM25\n",
+              "q38c827dgYTnZjZnTFxlPuxm+JuWvYpOeWyy3J/wjV/USVL+4BKz61/Ccy+EH/JkQUqRmUOtvYei\n",
+              "XxTdexyug9nI6kyTGc2H3hy0C3uFxKKFKo9PfiwDCQWhQ1+vZIsII4FYexn+pQbkz5kmdlWKB5Lx\n",
+              "ONpNVggWvIuTYEFI34NTLTOf285YYkebB68ywIJ5f1uX/OXMZ5RxH3gjNZ8mKLNX9suvs06qOt/Q\n",
+              "e2ZfZ7Orgt/l3O7GLxwWvzugIsO88I1KhpZhgYDdYZ//1lVBcwG/tKVYjF1obqjtyFctY9LPGIag\n",
+              "318ehZmIvkhW9djj90e+pnWknudbQDv3Os17s3l7qFADdqSGqYyGaSU47a6O12HCRSwmepV1bewA\n",
+              "AAIrQZt4SahBbJlMCCH//qpVAC8LE+AX+ndLRI9AAL65x3/f4eNbK2tvWi3seP5qm31GHdf4edmk\n",
+              "0/ZKv9BuxjUGH/qoYxXDUlaWZFHb65x0lomfbckqRBtklU+1LGTmYtvnPAbKnUSAh/jTBATZpFND\n",
+              "l6V6ofQ5PTBcFjOWwgI6YqalXUkmqnN6g77O4xvodhM7XQWhsA44ADmvatn61wvReF9d9MqoCN9N\n",
+              "Twpkx2kbbrSoHJrSyqidCsv+e2gnLoWDEdLGn/42++dseweQBj40iKRQ7paDrpDRwTZVjGQJ+52c\n",
+              "gaUSUp5A/cAn4FgESmp/sZ0NpfD9/7ZAmCbSUfPUar6ndxZ3XG2DXWcNFu473rzFQZNpJnXg/Pfh\n",
+              "QCQDuu/iX2Vi2NjGs1QVI3BReUxvD8Z/YeLy6w0jDh9dcJGJdKoNjb9Epdy5r0lFeFb9L8AWhdEd\n",
+              "sGreMPdTiMRlq+JOqjdogseyQTcuDo5iesxIsb0dhY+P9VqSJtTxyPO42dn6TXPZDgt1vROlp+Ic\n",
+              "VTutbib7FY5U+jSckVQsLzLRwDuIoa+HpEcHjzuwHMaHrKVljgiPeRI3Afdpqx3nHgy0MFCOhGEr\n",
+              "Jkw+Dadh5qrWjCGOX2K5HPLV0E5qw7krTDhpWX8sTsYsIqvxr/V2EjIFiKwnheBvunmhlbHNUKTl\n",
+              "ykWRC9Afa8QE+vO8sLJHYNqVh5kOrsn0+NP1Mm4JPbYiahSDJa4o8TJzkXFBAAABAkGflkUVLDv/\n",
+              "AGBJAvfAgTZO/kHo4lc9yaSVZkgaxkXEQAgySaAqoJy8U1XmJXFaLzsHv4KqZnckX0gP1AYFUr5X\n",
+              "3Zof5zltHp7OQG87KhkyMuJLOz4diYjf3ctsH2KA3/S29L1hP4qjZ9kfgNEsjrH/nSlX3ikiiFcQ\n",
+              "/2mu5vwlzQMTIUj5/0pAslvbULpI2rwxcgfjtpeW3qe/Q0sCZXyJ3L7VhEaeyKZo/ALUAi114xdn\n",
+              "Gao6fyKpZhWohGCsI53i8XO3Y7Dq+aD4ONx4A265BL770fTZiNNw+oM7dwTK1vcPMdOTVjz4fi6j\n",
+              "bCMBPzMCGM7CsAz7OQTIKiUTlOi8YAAAAakBn7V0Q38AeTG7snd+wR+ioRwfka+slSBm7w4HiigA\n",
+              "mYoe7RzT8waKJhe/5/xyHdk2lI4Qb6yur2vWdYx/k/gVzZWx+dAAALHLM2W5kE06MD+/WY8W9vMg\n",
+              "jgsWx+NCob+sUo3r0m3kC7Z6vE5pa/kp8NVK1XizBU/gSaY6/S/NP+nzZeAUHhvnb6LPnQnTmhI7\n",
+              "+CLAa1UiK6P+lwPbKP0S0Q5RWiopmhls/AKTmwxXB+WRWyrrFglLMCCi/H7yBlZCPn3f1nUi1WXW\n",
+              "txmtCNftDVTPLfu3fbw+YSszpG0LQoe/d+Hn14JtNEXcVveVKgdRtrJ2SZSzkDZoD5uTokEopKbG\n",
+              "geSmsxJSe6mDenK/tstnSjFiozTKWgyJb1mTK9iBWStV+uPeceDypkgatRgkwgz17Zgn457UL8xo\n",
+              "RIb3Rzvhn1PaM6KKHv4wQMqvpqRXKRm+SScKgBhgUzc706tHx+sk3QXrFbfmTj3VwEqpASdMV8SQ\n",
+              "Rc7Pl7VdiwexHM38nPcgZguGyvH4NF1CZay1mT9d+wee9MfU3VHZJgMp057sUGFJIJZNmQAAASYB\n",
+              "n7dqQ38Ah1fDGltbSoFNBABy4LNfpqaOuQiA03rsvInHR01iNZMDGQE2sq9jRvjWYcCsjv8TgHDx\n",
+              "TelM9UgK8aIkbW5xZBO7YH31DMzHB/HcoCKmBUni45/7i/CIo8gF1pGPr0DAA7wV6D09MIgWLTIz\n",
+              "u2RlgzWHXLOhQSqpesq6gEgghz4eO+szzJWiaji2cgnbFYV7gS1iXMpBIisJc8i3U9gywhFgtGxt\n",
+              "IPW/7TiYEwGOLwxyjZX1HkROuSI8lAAdZBpungwbYVpPKSngzu3PnOIcBqes7c29MHD8jRPn7Zrt\n",
+              "720E/jZ4jB2yT62h5AEs+TCYeJmiY6lwGwXm58hIVqeMFafCwAYhd3vDCtfE6mymrvYwtLYQ0YeE\n",
+              "Ebj2MbA5+zEAAAFwQZu6SahBbJlMFEwR//61KoAWx89GABUe1i4OfaowcQHQyqHCv9PnwkHOB5jh\n",
+              "ZaY1nqaJvfgMHLxnx0HRU319XsFiIgZ3fycxZ7MoTbod+V6rFy2y2Qtld8RvCt0Ug4PVQuLFLU9x\n",
+              "N6gbeWntqj92UVkXYHO8rtnoyHbc5vkyDRwK85+1rEknOmV2fCPAJQWJQHZKzqn/akJ6R91HlWya\n",
+              "u/8GgP8q7KTtX0XyZMALsB3jT/UhmW5AlGIwNHeW1rtDiMG/Xy+69i+m2kTOjww4y5o0/8WfwLLR\n",
+              "RKlhEE1LYjJQjoy3+hNy7YguxzdtR0GOg0UsPQLFZIBnnCwGmFharg9MSkzKoZck80tBnNzVcu5F\n",
+              "Ot8W+bdDLv2E/9UTXci1RXlM26z5jearPa/9d/CciU6kElsImbzJ5J2YpzVs+pvW89XbvAJMExZq\n",
+              "wXD26iUkefzti1p2cc2CbM5qN5CGCTCmR13du1Y9J/JQwXkxhEAAAAFiAZ/ZakN/AHwUpp6Dymc0\n",
+              "2L536BR5shJlFypABdlGcrzfdaw/6f5GB/atQKmEnLjISTsAvG6zfbdBMs7bm2yeFrIQxXuK81kC\n",
+              "9pAAAXcBlvswH72knWeKBsU0Ht1g5h3YcKtQv4e82ah693wXobc+mdHgPA3TBKIFWUv/iM+/E90G\n",
+              "S/NmTeZC+lgt/zT/+HMt/QSFK9C1+AMdH9l6Wmy5eJzA8pumBNuqAArwclv8LW1AC9Ryj7J7dIqZ\n",
+              "2nhKIYQ08cavMFAGExrDHt7RiTs4Auer+jpijDT1MWhCFcQjNZn9nbOp1MdYUZ3batlHR94YKH39\n",
+              "SB9iaEe1H+vDrSDRsP3b0PfVLevCUtQQ7tTMju5YxLigI0SkXHby6oMGwH35DOmYdZ/QEHihEbbH\n",
+              "ljlaWypqm6TR7b/zNBCPoaZiHS0IlbTr/gzMbXxGasP7GssB89XtUV2jZihKJYcij8456L2VAAAC\n",
+              "WkGb3knhClJlMCCH//6qVQAvW48vGhnpxPcAFRvWsRQfCH0ZQNKlkI/Fmy/VFBZqjdqwlFWyRDRU\n",
+              "ATa/x8nSCThm/LYIboN0iejGj3Uchm8nyLv3P3+HOOnCw7+XGsyycSpaT/SKI8hu4RwjrdDxqaYn\n",
+              "k6pZ6qjZtX+IZ04XS8X44piBkZKHHklQnddyez3eJG0JjT0fN5b/c72jAD+sOeXlR6iPKkSUzu0o\n",
+              "3ha2oHN6UEDmISbP1cbB3piI/SHrisHlFNjIuHiEdkqSzG95tlcEE5RmJMFHyIZtmV+VUnHUg//H\n",
+              "WOVjyT0+oFlaS4c8th8dtoQJgchjo9u+OPpSDxEJgWI6zeeh28ogNTGzlwRqjfRSsrTItvjA1MD/\n",
+              "oBFhKLk5Gm5LLSkMpDHu9T5I2IaoH3PKDFRJp5FswrHAqK+C6EMiKJRw3UfQ++e71IzTL0xpDNJL\n",
+              "z6AeitOHT7WHH1q0lcaxtRKIXyzlri2FOeAU+zEh7DbcM3wvbzCPYrbD4ePmP1flYALif0DM+F20\n",
+              "woqO1ciEp6KvfcdLwkVhOi6HukmunTXGsruYaqjkaLT2QlUIMJVPTAaXGvEAsJSG/0vfsDXKkk6Z\n",
+              "sB3ElNrSO3yHej1aIEgW5xnCNisEQsWn6TKnOYGilPN4ZN8EB64V0F8PWNB9Aq0baX+T8kKesmFw\n",
+              "2y/668NRP8ypn4s+0TEew3V5nLH+An+XxWolypflMoVnWhEhG2W+IIgxfWfPuSgDmqBKtSemnfnO\n",
+              "mj2z1HJ4yEmqNoBjJwYnWfK8e0PHHb381Mk1zGGJOgWAAAABUEGf/EU0TDv/AFlVerlP4Rak+BQA\n",
+              "rfH1MAekqKZtO9rI3YpPu0XbIusXd4D2mikBBjNWCs5ZCx1/nIkAW78LpHSyCScRX686DgqeELvg\n",
+              "+6gjEvz9oPv/Q5SyPMBeMNrb/QJ3ato+Qw19nLJWjl0bduh+HilMsrklIYKHCWBaC/dNC4s7Xl/r\n",
+              "RCzM7ZJuRKmUY/D5sEAdr/H6TIVmiD0u2jiehC8y8Gw6flB5fdlWyz5ArpMes88RS9cHH1n4Dp5A\n",
+              "9YiKoxa6XsjMVtwy/Q1CE1CcjEE8nX1x2wi3FF+AiuFwqQsSRlHtfUsVksDBdXLvE8zjbyOIuIMV\n",
+              "pnJU22cEHHqRAVAAAQz/a8I3JUwtCYefKDlHQuITIdlhxtkj1S9/MOKY0At1R1tnioLMWN7HUVCo\n",
+              "b6XS9uoGwS6oOJgKcTFbR1vNa4wchWq0XCPds0DBwQAAAPYBnht0Q38AeTSjvudgsbkOLNHOwJSE\n",
+              "7MIAOT4Tae/DlzyAOhFcKHSt+XmND2K3krM1WAe1ksxoXOx8R5ib25iI4yoXHAvjcPvcDoLvQIYy\n",
+              "rfzkEj8FCsgVqTty2M7mcrrsvBMmGI/tSEAq1Wpq/wSUg2I4oZj0GjiChzewD+uw3YnWAi/Ntf5Y\n",
+              "Cv2dU9qEo9e3jPCavhxnj6HVQyqcvxekJ6cEcAGQvRh8PwiQyys4LYMz+Th6jmnZO6zDQlY1h459\n",
+              "aXiX/1NPDVjhvbOibPxdXy1nW8ZFN/ZpmMtUtTAz4mvuGfLCJYTZv8r0n1cztBPRieehovEAAAGy\n",
+              "AZ4dakN/AHwTrqiSAEDVZr7cfUIfCi6SEtf6z4BBmn/qEvCbGFYoG0hJzipIIEfgPxGLOPb5hgYo\n",
+              "3EqlxYfhyi3ADlPB0rSvUe/2K1c1bOHHkBdbN7v2fRCe6cTgBUViIyBzKbW8+YVzs1NjLsftvDLF\n",
+              "Jws+AVbFUOsz2XZO6+tJqS4okplORVfI8Zh8pjE7ly6+HI7Omo301kEp6VZks8VHiVKJOuTRsuFe\n",
+              "1lak9cDIgZS7IV3MkEjdmu8V6wPVTOui5KhgRegdKpe7dvKwiZROacSHUyEpgoiQ49NAkgd9ICSC\n",
+              "nOG96XtcVUK5qLGXI1ECEXtJcuaFVMtCmmOBBiFL8jC1MpHbxQ+4k2qRSUjP3JvFi0NfrsxeXbrH\n",
+              "Ebg5vBmNpJE6T+wdC73c70xC+Mtp+wYFzu5kfTKcL8d+Nzu4GlIr338e6SWwNSpXRGjfdLp9o3Ic\n",
+              "2PzMtQmrlpbEeUDp1vnkaZoqSF5M9xanIk/zohgoPX5++NN/ebYvr56WROjUeIUdsOf6nrJlmboT\n",
+              "DZEat6r4aY15lVCgiz4Mpb/mqSazxzrszmdRYRxGsW8DnzAAAAHfQZoCSahBaJlMCHf//qmWALFy\n",
+              "5oM61QiAB+cxK4+jNCOHXw6RALujtnWF0llKsvjvaSIz+44BdTBn8Dqmduydu0Ab2yYLL8rBa9BR\n",
+              "bM/WBrO6FCt4pfpaT57HiAbORTevnWHgnUCdwsiqbddvhjkiuJYbgCMD0kEP1SURu/b2Z5hWsq5s\n",
+              "eIdJwlVUmffx/GFsHH2OVg2kldaudIzyWEsMXsnZccvZ4+1TTMECSDKdUtlhUW9AAgPUraaePKP1\n",
+              "hatMAsKbsEP5g1nzjTlmyHjs7FjRbwjKng4/qsqVQ+s9Z8Le9mq44VPerxrlkKxdRgf8PQXTEpxP\n",
+              "gMR8UP9I/vRSJBbzTafYsMhPytfC8ESUe9ySga0pNZKSvC+bN1h7zO9OEjqF3rsnXJU2SZN7NAbS\n",
+              "01WCPkWQIdWN39TZ8BwhuM2E1/XfXA9OxCI/7PAG40Z8M1rKVJPTY+iwZnIQA6cEF3rnJVasn/JZ\n",
+              "rircnzzi1JQr5NiwthCEkD02k7GAoyHtF8lIKArvw+GqH7Ox1Tpd6DhPPJm2hmyijeFH6E+9UCJk\n",
+              "Iiolc9K3UW1rmUlHlF/p9jHAvsiiJUpuG/KCfna2LEYj9yn6P2oNlWfqq5P2HNtctaJeVRZv9Qb/\n",
+              "mNVjyjAAAAErQZ4gRREsO/8AZUEtk8LzOoS4AAhIFC88oI10PfUAs3UxxCOOtSzHREgn4/jgVfHt\n",
+              "0r483Tf2Y8D+zGlycQw2lUV6Nidlo0k0sASUCm4dEwF8Hb0+IzseFE0dYexJdLqvhcI7IIUIH6RG\n",
+              "uv8cjTXFD8CTksvYGpGc+uBYXhlwc3/jHhNGtm8G24uHniey+Zy/NtEpSl5dub3bE324kx+/N1gF\n",
+              "sU/CxkQF6UQWvd6Br4nL+i2L6udCLqM/JAVJhScc01UR/bE+NX2i3upx0qofgxfWL8unNZ/BP9Vc\n",
+              "CvVXAtxPw+0JopAnWMlwtBFG9wd+oP4zOIJ88u/VEvyZQd0JJP1Y3qhYk13Deyiv0C1r6ci1z7CQ\n",
+              "UwYqgUT64pT/hlIvHeCzEZxqH+WbUbEAAAGYAZ5fdEN/AIteE+hbrZmAAHNd3/IVGxTYP4E6C+Wr\n",
+              "63le3xAHjzqOqEil1tIAAUY3LvF62/277H30QskV8sEjceHvPe7bE0mfZ44avBY2gS0AAAMByRDk\n",
+              "EKOyh31Y2H0mdsy+zcGsPrGm3pHtO2riBcgILxHO0F5398HG90hK8UgtDUfp9CQyPOvDSyEU4WTb\n",
+              "6/WT9Z3aca6tb4C53W6p8Geyjq/mwbvNpnCVbbqIcx1ZT2+dencovmeYmPlI7jrhk6KwLYEd+5gO\n",
+              "J2YeKk4iWai6BsaO9+Tb5P52jBVHcSZ+Vws5QhTxkBSpdHlWJRcbh50V4ViVltwUN//XNx+jx2bk\n",
+              "KsfglI41FGmS2xAJtr8ZhKDk1VRRL2tGsNB5nztuRXCFd8q4MIuVVWGjim0ntcxZ/R18mzJZN+sI\n",
+              "qKUvfsxoaeZp+oIaU1hLeXzgcHEe+3/6emdZeJWoDNhUqhkfWzWzVZbEzUKpDBS9AbVIA5KR27LD\n",
+              "3HEfRMw9yt8eYILg7m/Rm2ubtU8u6V2QuxVXq1OHry5oY2TAAAABvQGeQWpDfwCPV5unds/RGF4o\n",
+              "aWlq+XwTSVpG+igacFOApaqyNJIXSXT4q7gA4DkP0YAYAumNCN0MwD7HSEeIsv3Q3L9kZ2RagxvU\n",
+              "jle4yQq6Zl5W7AgdlZnaBngH/w8xYsqWx5t90zzi7s9VyRY9jaNshfxuJAZcRgFILNTmQNCPoCtl\n",
+              "wyo5Ht91VCy2qSby6JDLeTD096PzM4KOK7/I+amuefuT0S/QnDNs952oi11JV2mbadqtKDqJE9x4\n",
+              "nX/OjU9PBP1uhsFLNkjsz6ZHlTOcsZvWUxabbw0HBNFuLXWIYqtAYdWN7c/QUoqY2IlVBR//v+NN\n",
+              "Bxf/rxPv+9QlTTeUOAVhzyU/kQACorW+VEL2KFNUPF85LUxlbSGEYQv/98/fAQAu6hKRw3yoJoPy\n",
+              "tyr7S7Za9gGurMYseuvuasNoB+fPCmp37VWgm4yNZQ0LM+8CPtaQgShVMs2/RIG2cXksHuYVqEB7\n",
+              "PJtzP2tl8EYDen8RohIb2UO5d/Xdc8aoi/Nu4IzGq8ApuZIxjC5J9bUYtMDEDA6eChGKPjb20vqg\n",
+              "2PRBI2fSXJrcSROGTC4m+VsF+VagO1LnjrakndEAAAHtQZpDSahBbJlMCG///qeEAVH55ayIAL6z\n",
+              "9D9Go2JR/VsPgULYIy+HM1JNQWUio64eqKV59gHDbxQ77xKGvVi/RlMeepNHF+Cplpp4rKqgivaK\n",
+              "14o0jVVjKwdzXmYfm8QJck76NrSj9rXzMi3Th9DbQ5HQHvlFr1+Ft6fGVXaubVoF+Bx3J4nvsWO+\n",
+              "FhXDphKaWh9geM/3PqX1TK4zqhRL2wKgDCWdLvIi2s2e48RSWR1zksj0SjkMINJfgjA7wVj0dW8Z\n",
+              "NZGlcRPjgkoSgpomI+x9/l7dJ5fHEj4WOkMQMTJnj+KOqaXfgtXbhBachZ0Av1Z6rh+qw/iObJOy\n",
+              "7q2gUdlftEWI7In7KZjqqg18Bg+z35wI2FmknOyXdEiDAPaFiRrhqkKOLfgLssw1BdohiuTGWlKn\n",
+              "NvPL4EzIbAUeS+0qv5cFdXvRjnn1zOMYTMpyN1CZYg4pqjj8mGtGdm1F7w0Xo4Mnm3hRmvZyyOaW\n",
+              "yf38s1SCwyOkhQcwJhrAAebvkxMWrAUWrTq9K9PdCUqFbMVB9+93aovoux8zBfM/WLangtLLXd/D\n",
+              "T9TcgY0eosWGZeAhQk2sxNC3bgvMT328AT2T2XCg2nG4jsOakPWfscwbc0zKfItj/1eXvyR2tk+K\n",
+              "fpgdg9dJ/OdcXINTUAAAB95tb292AAAAbG12aGQAAAAAAAAAAAAAAAAAAAPoAAAnEAABAAABAAAA\n",
+              "AAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAA\n",
+              "AAAAAAAAAAAAAAAAAAACAAAHCHRyYWsAAABcdGtoZAAAAAMAAAAAAAAAAAAAAAEAAAAAAAAnEAAA\n",
+              "AAAAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAEAAAAABsAAAASAA\n",
+              "AAAAACRlZHRzAAAAHGVsc3QAAAAAAAAAAQAAJxAAAAgAAAEAAAAABoBtZGlhAAAAIG1kaGQAAAAA\n",
+              "AAAAAAAAAAAAACgAAAGQAFXEAAAAAAAtaGRscgAAAAAAAAAAdmlkZQAAAAAAAAAAAAAAAFZpZGVv\n",
+              "SGFuZGxlcgAAAAYrbWluZgAAABR2bWhkAAAAAQAAAAAAAAAAAAAAJGRpbmYAAAAcZHJlZgAAAAAA\n",
+              "AAABAAAADHVybCAAAAABAAAF63N0YmwAAACzc3RzZAAAAAAAAAABAAAAo2F2YzEAAAAAAAAAAQAA\n",
+              "AAAAAAAAAAAAAAAAAAABsAEgAEgAAABIAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n",
+              "AAAAAAAAAAAY//8AAAAxYXZjQwFkABX/4QAYZ2QAFazZQbCWhAAAAwAEAAADAFA8WLZYAQAGaOvj\n",
+              "yyLAAAAAHHV1aWRraEDyXyRPxbo5pRvPAyPzAAAAAAAAABhzdHRzAAAAAAAAAAEAAABkAAAEAAAA\n",
+              "ABRzdHNzAAAAAAAAAAEAAAABAAADMGN0dHMAAAAAAAAAZAAAAAEAAAgAAAAAAQAAFAAAAAABAAAI\n",
+              "AAAAAAEAAAAAAAAAAQAABAAAAAABAAAUAAAAAAEAAAgAAAAAAQAAAAAAAAABAAAEAAAAAAEAABQA\n",
+              "AAAAAQAACAAAAAABAAAAAAAAAAEAAAQAAAAAAQAAFAAAAAABAAAIAAAAAAEAAAAAAAAAAQAABAAA\n",
+              "AAABAAAUAAAAAAEAAAgAAAAAAQAAAAAAAAABAAAEAAAAAAEAABQAAAAAAQAACAAAAAABAAAAAAAA\n",
+              "AAEAAAQAAAAAAQAAFAAAAAABAAAIAAAAAAEAAAAAAAAAAQAABAAAAAABAAAMAAAAAAEAAAQAAAAA\n",
+              "AQAAFAAAAAABAAAIAAAAAAEAAAAAAAAAAQAABAAAAAABAAAUAAAAAAEAAAgAAAAAAQAAAAAAAAAB\n",
+              "AAAEAAAAAAEAABQAAAAAAQAACAAAAAABAAAAAAAAAAEAAAQAAAAAAQAAFAAAAAABAAAIAAAAAAEA\n",
+              "AAAAAAAAAQAABAAAAAABAAAUAAAAAAEAAAgAAAAAAQAAAAAAAAABAAAEAAAAAAEAABQAAAAAAQAA\n",
+              "CAAAAAABAAAAAAAAAAEAAAQAAAAAAQAAFAAAAAABAAAIAAAAAAEAAAAAAAAAAQAABAAAAAABAAAM\n",
+              "AAAAAAEAAAQAAAAAAQAAFAAAAAABAAAIAAAAAAEAAAAAAAAAAQAABAAAAAABAAAUAAAAAAEAAAgA\n",
+              "AAAAAQAAAAAAAAABAAAEAAAAAAEAABQAAAAAAQAACAAAAAABAAAAAAAAAAEAAAQAAAAAAQAAFAAA\n",
+              "AAABAAAIAAAAAAEAAAAAAAAAAQAABAAAAAABAAAUAAAAAAEAAAgAAAAAAQAAAAAAAAABAAAEAAAA\n",
+              "AAEAABQAAAAAAQAACAAAAAABAAAAAAAAAAEAAAQAAAAAAQAAFAAAAAABAAAIAAAAAAEAAAAAAAAA\n",
+              "AQAABAAAAAABAAAMAAAAAAEAAAQAAAAAAQAAFAAAAAABAAAIAAAAAAEAAAAAAAAAAQAABAAAAAAB\n",
+              "AAAUAAAAAAEAAAgAAAAAAQAAAAAAAAABAAAEAAAAAAEAAAgAAAAAHHN0c2MAAAAAAAAAAQAAAAEA\n",
+              "AABkAAAAAQAAAaRzdHN6AAAAAAAAAAAAAABkAAAGhgAAAl8AAAFjAAAAvgAAAXYAAAHzAAABDgAA\n",
+              "ATYAAAFIAAAB9QAAAOIAAAD6AAABWgAAAbAAAADTAAAB8wAAAN4AAAH+AAABEAAAAOIAAAG2AAAC\n",
+              "DAAAAWUAAAGkAAABmgAAAckAAAEdAAABfQAAAPMAAAFxAAABIgAAAjYAAAEmAAAA5AAAAXoAAAH+\n",
+              "AAAA/wAAAT0AAAFnAAACAwAAARQAAAE3AAABTwAAAckAAADrAAACFwAAAP0AAAHzAAABIQAAAOAA\n",
+              "AAHKAAACOwAAAVQAAAHFAAABugAAAdQAAAD3AAABUgAAARIAAAFuAAABLwAAAhAAAAERAAAA9gAA\n",
+              "AZkAAAIqAAABIgAAAV0AAAGIAAACSgAAASgAAAFEAAABggAAAegAAAD+AAACCgAAASIAAAIdAAAB\n",
+              "KAAAAQcAAAHbAAACFgAAAT0AAAITAAAB2gAAAi8AAAEGAAABrQAAASoAAAF0AAABZgAAAl4AAAFU\n",
+              "AAAA+gAAAbYAAAHjAAABLwAAAZwAAAHBAAAB8QAAABRzdGNvAAAAAAAAAAEAAAAsAAAAYnVkdGEA\n",
+              "AABabWV0YQAAAAAAAAAhaGRscgAAAAAAAAAAbWRpcmFwcGwAAAAAAAAAAAAAAAAtaWxzdAAAACWp\n",
+              "dG9vAAAAHWRhdGEAAAABAAAAAExhdmY1Ny44My4xMDA=\n",
+              "\"\u003e\n",
+              "  Your browser does not support the video tag.\n",
+              "\u003c/video\u003e"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.HTML at 0x7f84b2253b50\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "import time\n",
+        "import traceback\n",
+        "\n",
+        "from matplotlib import pyplot as plt\n",
+        "from matplotlib import animation as anim\n",
+        "import tensorflow as tf\n",
+        "from tensorflow.contrib import autograph as ag\n",
+        "from IPython import display\n",
+        "\n",
+        "\n",
+        "@ag.do_not_convert(ag.RunMode.PY_FUNC)\n",
+        "def render(boards):\n",
+        "  fig = plt.figure()\n",
+        "\n",
+        "  ims = []\n",
+        "  for b in boards:\n",
+        "    im = plt.imshow(b, interpolation='none')\n",
+        "    im.axes.get_xaxis().set_visible(False)\n",
+        "    im.axes.get_yaxis().set_visible(False)\n",
+        "    ims.append([im])\n",
+        "\n",
+        "  try:\n",
+        "    ani = anim.ArtistAnimation(\n",
+        "        fig, ims, interval=100, blit=True, repeat_delay=5000)\n",
+        "    plt.close()\n",
+        "\n",
+        "    display.display(display.HTML(ani.to_html5_video()))\n",
+        "  except RuntimeError:\n",
+        "    print('Coult not render animation:')\n",
+        "    traceback.print_exc()\n",
+        "\n",
+        "\n",
+        "def gol_episode(board):\n",
+        "  directions = tf.constant(\n",
+        "      ((-1, -1), (-1, 0), (-1, 1), (0, -1), (0, 1), (1, -1), (1, 0), (1, 1)))\n",
+        "\n",
+        "  new_board = []\n",
+        "  ag.set_element_type(new_board, tf.int32)\n",
+        "\n",
+        "  for i in range(len(board)):\n",
+        "    for j in range(len(board[i])):\n",
+        "      num_neighbors = 0\n",
+        "      for d in directions:\n",
+        "        ni = i + d[0]\n",
+        "        nj = j + d[1]\n",
+        "        if ni \u003e= 0 and nj \u003e= 0 and ni \u003c len(board) and nj \u003c len(board[i]):\n",
+        "          num_neighbors += board[ni][nj]\n",
+        "      \n",
+        "      new_cell = 0\n",
+        "      if num_neighbors == 2:\n",
+        "        new_cell = board[i][j]\n",
+        "      elif num_neighbors == 3:\n",
+        "        new_cell = 1\n",
+        "      \n",
+        "      new_board.append(new_cell)\n",
+        "  final_board = ag.stack(new_board)\n",
+        "  final_board = tf.reshape(final_board, board.shape)\n",
+        "  return final_board\n",
+        "  \n",
+        "\n",
+        "def gol(initial_board):\n",
+        "  board = initial_board\n",
+        "  boards = []\n",
+        "  ag.set_element_type(boards, tf.int32)\n",
+        "  # We are being explicit about tensor constants to ensure the loop\n",
+        "  # is not unrolled in the graph. This may change in the future.\n",
+        "  for i in range(tf.constant(NUM_STEPS)):\n",
+        "    board = gol_episode(board)\n",
+        "    boards.append(board)\n",
+        "  boards = ag.stack(boards)\n",
+        "  render(boards)\n",
+        "  return tf.no_op()\n",
+        " \n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  # Gosper glider gun\n",
+        "  # Adapted from http://www.cplusplus.com/forum/lounge/75168/\n",
+        "  _ = 0\n",
+        "  initial_board = tf.constant((\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,1,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,1,_,1,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,_,1,1,_,_,_,_,_,_,1,1,_,_,_,_,_,_,_,_,_,_,_,_,1,1,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,1,_,_,_,1,_,_,_,_,1,1,_,_,_,_,_,_,_,_,_,_,_,_,1,1,_ ),\n",
+        "      ( _,1,1,_,_,_,_,_,_,_,_,1,_,_,_,_,_,1,_,_,_,1,1,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,1,1,_,_,_,_,_,_,_,_,1,_,_,_,1,_,1,1,_,_,_,_,1,_,1,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,1,_,_,_,_,_,1,_,_,_,_,_,_,_,1,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,1,_,_,_,1,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,_,1,1,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "      ( _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ ),\n",
+        "  ))\n",
+        "  initial_board = tf.pad(initial_board, ((0, 20), (0, 10)))\n",
+        "  \n",
+        "  tf_gol = ag.to_graph(gol)\n",
+        "  game_ops = tf_gol(initial_board)\n",
+        "  with tf.Session() as sess:\n",
+        "    sess.run(game_ops)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "7NgrSPCZxs3h"
+      },
+      "source": [
+        "#### Generated code"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 2323
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 753,
+          "status": "ok",
+          "timestamp": 1532101593840,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "hIGYeX0Cxs3i",
+        "outputId": "e0b62eb1-3e12-4e53-dc54-8a3fa56d823d"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "from __future__ import print_function\n",
+            "import tensorflow as tf\n",
+            "\n",
+            "def tf__gol_episode(board):\n",
+            "  try:\n",
+            "    with tf.name_scope('gol_episode'):\n",
+            "      directions = tf.constant(((-1, -1), (-1, 0), (-1, 1), (0, -1), (0, 1),\n",
+            "          (1, -1), (1, 0), (1, 1)))\n",
+            "      new_board = ag__.new_list([])\n",
+            "\n",
+            "      def extra_test_2(new_board_2):\n",
+            "        with tf.name_scope('extra_test_2'):\n",
+            "          return True\n",
+            "\n",
+            "      def loop_body_2(i, new_board_2):\n",
+            "        with tf.name_scope('loop_body_2'):\n",
+            "\n",
+            "          def extra_test_1(new_board_1):\n",
+            "            with tf.name_scope('extra_test_1'):\n",
+            "              return True\n",
+            "\n",
+            "          def loop_body_1(j, new_board_1):\n",
+            "            with tf.name_scope('loop_body_1'):\n",
+            "              num_neighbors = 0\n",
+            "\n",
+            "              def extra_test(num_neighbors_2):\n",
+            "                with tf.name_scope('extra_test'):\n",
+            "                  return True\n",
+            "\n",
+            "              def loop_body(d, num_neighbors_2):\n",
+            "                with tf.name_scope('loop_body'):\n",
+            "                  ni = i + ag__.get_item(d, (0), opts=ag__.GetItemOpts(\n",
+            "                      element_dtype=None))\n",
+            "                  nj = j + ag__.get_item(d, (1), opts=ag__.GetItemOpts(\n",
+            "                      element_dtype=None))\n",
+            "\n",
+            "                  def if_true():\n",
+            "                    with tf.name_scope('if_true'):\n",
+            "                      num_neighbors_1, = num_neighbors_2,\n",
+            "                      num_neighbors_1 += ag__.get_item(ag__.get_item(board,\n",
+            "                          (ni), opts=ag__.GetItemOpts(element_dtype=None)),\n",
+            "                          (nj), opts=ag__.GetItemOpts(element_dtype=None))\n",
+            "                      return num_neighbors_1,\n",
+            "\n",
+            "                  def if_false():\n",
+            "                    with tf.name_scope('if_false'):\n",
+            "                      return num_neighbors_2,\n",
+            "                  num_neighbors_2 = ag__.utils.run_cond(tf.logical_and(tf.\n",
+            "                      greater_equal(ni, 0), tf.logical_and(tf.greater_equal\n",
+            "                      (nj, 0), tf.logical_and(tf.less(ni, ag__.utils.\n",
+            "                      dynamic_builtin(len, board)), tf.less(nj, ag__.utils.\n",
+            "                      dynamic_builtin(len, ag__.get_item(board, (i), opts=\n",
+            "                      ag__.GetItemOpts(element_dtype=None))))))), if_true,\n",
+            "                      if_false)\n",
+            "                  return num_neighbors_2,\n",
+            "              num_neighbors = ag__.for_stmt(directions, extra_test,\n",
+            "                  loop_body, (num_neighbors,))\n",
+            "              new_cell = 0\n",
+            "\n",
+            "              def if_true_2():\n",
+            "                with tf.name_scope('if_true_2'):\n",
+            "                  new_cell_2, = new_cell,\n",
+            "                  new_cell_2 = ag__.get_item(ag__.get_item(board, (i), opts\n",
+            "                      =ag__.GetItemOpts(element_dtype=None)), (j), opts=\n",
+            "                      ag__.GetItemOpts(element_dtype=None))\n",
+            "                  return new_cell_2,\n",
+            "\n",
+            "              def if_false_2():\n",
+            "                with tf.name_scope('if_false_2'):\n",
+            "                  new_cell_3, = new_cell,\n",
+            "\n",
+            "                  def if_true_1():\n",
+            "                    with tf.name_scope('if_true_1'):\n",
+            "                      new_cell_1, = new_cell_3,\n",
+            "                      new_cell_1 = 1\n",
+            "                      return new_cell_1,\n",
+            "\n",
+            "                  def if_false_1():\n",
+            "                    with tf.name_scope('if_false_1'):\n",
+            "                      return new_cell_3,\n",
+            "                  new_cell_3 = ag__.utils.run_cond(tf.equal(num_neighbors, \n",
+            "                      3), if_true_1, if_false_1)\n",
+            "                  return new_cell_3,\n",
+            "              new_cell = ag__.utils.run_cond(tf.equal(num_neighbors, 2),\n",
+            "                  if_true_2, if_false_2)\n",
+            "              new_board_1 = ag__.list_append(new_board_1, new_cell)\n",
+            "              return new_board_1,\n",
+            "          new_board_2 = ag__.for_stmt(ag__.utils.dynamic_builtin(range,\n",
+            "              ag__.utils.dynamic_builtin(len, ag__.get_item(board, (i),\n",
+            "              opts=ag__.GetItemOpts(element_dtype=None)))), extra_test_1,\n",
+            "              loop_body_1, (new_board_2,))\n",
+            "          return new_board_2,\n",
+            "      new_board = ag__.for_stmt(ag__.utils.dynamic_builtin(range, ag__.\n",
+            "          utils.dynamic_builtin(len, board)), extra_test_2, loop_body_2, (\n",
+            "          new_board,))\n",
+            "      final_board = ag__.list_stack(new_board, opts=ag__.ListStackOpts(\n",
+            "          element_dtype=tf.int32, original_call=ag.stack))\n",
+            "      final_board = tf.reshape(final_board, board.shape)\n",
+            "      return final_board\n",
+            "  except:\n",
+            "    ag__.rewrite_graph_construction_error(ag_source_map__)\n",
+            "\n",
+            "def tf__gol(initial_board):\n",
+            "  try:\n",
+            "    with tf.name_scope('gol'):\n",
+            "      board = initial_board\n",
+            "      boards = ag__.new_list([])\n",
+            "\n",
+            "      def extra_test(board_1, boards_1):\n",
+            "        with tf.name_scope('extra_test'):\n",
+            "          return True\n",
+            "\n",
+            "      def loop_body(i, board_1, boards_1):\n",
+            "        with tf.name_scope('loop_body'):\n",
+            "          board_1 = tf__gol_episode(board_1)\n",
+            "          boards_1 = ag__.list_append(boards_1, board_1)\n",
+            "          return board_1, boards_1\n",
+            "      board, boards = ag__.for_stmt(ag__.utils.dynamic_builtin(range, tf.\n",
+            "          constant(NUM_STEPS)), extra_test, loop_body, (board, boards))\n",
+            "      boards = ag__.list_stack(boards, opts=ag__.ListStackOpts(\n",
+            "          element_dtype=tf.int32, original_call=ag.stack))\n",
+            "      with ag__.utils.control_dependency_on_returns(render(boards)):\n",
+            "        boards_2 = ag__.utils.alias_tensors(boards)\n",
+            "        return tf.no_op()\n",
+            "  except:\n",
+            "    ag__.rewrite_graph_construction_error(ag_source_map__)\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(ag.to_code(gol))"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [
+        "p8zZyj-tq4K3",
+        "Lkq3DBGOv3fA",
+        "r8_0ioEuAI-a",
+        "7NgrSPCZxs3h"
+      ],
+      "default_view": {},
+      "last_runtime": {
+        "build_target": "",
+        "kind": "local"
+      },
+      "name": "Simple algorithms using AutoGraph",
+      "provenance": [
+        {
+          "file_id": "19q8KdVF8Cb_fDd13i-WDOG_6n_QGNW5-",
+          "timestamp": 1528465909719
+        }
+      ],
+      "version": "0.3.2",
+      "views": {}
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/autograph/examples/notebooks/dev_summit_2018_demo.ipynb b/tensorflow/contrib/autograph/examples/notebooks/dev_summit_2018_demo.ipynb
index 0702273fac..7e9cc54d4c 100644
--- a/tensorflow/contrib/autograph/examples/notebooks/dev_summit_2018_demo.ipynb
+++ b/tensorflow/contrib/autograph/examples/notebooks/dev_summit_2018_demo.ipynb
@@ -1,49 +1,20 @@
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "name": "Dev Summit 2018 - Autograph",
-      "version": "0.3.2",
-      "views": {},
-      "default_view": {},
-      "provenance": [
-        {
-          "file_id": "1wCZUh73zTNs1jzzYjqoxMIdaBWCdKJ2K",
-          "timestamp": 1522238054357
-        },
-        {
-          "file_id": "1_HpC-RrmIv4lNaqeoslUeWaX8zH5IXaJ",
-          "timestamp": 1521743157199
-        },
-        {
-          "file_id": "1mjO2fQ2F9hxpAzw2mnrrUkcgfb7xSGW-",
-          "timestamp": 1520522344607
-        }
-      ],
-      "collapsed_sections": []
-    },
-    "kernelspec": {
-      "name": "python2",
-      "display_name": "Python 2"
-    }
-  },
   "cells": [
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "g7nGs4mzVUHP",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "g7nGs4mzVUHP"
       },
-      "cell_type": "markdown",
       "source": [
-        "# Experimental: TF Autograph\n",
+        "# Experimental: TF AutoGraph\n",
         "**TensorFlow Dev Summit, 2018.**\n",
         "\n",
-        "This interactive notebook demonstrates **autograph**, an experimental source-code transformation library to automatically convert TF.Eager and Python code to TensorFlow graphs.\n",
+        "This interactive notebook demonstrates **AutoGraph**, an experimental source-code transformation library to automatically convert Python, TensorFlow and NumPy code to TensorFlow graphs.\n",
         "\n",
         "**Note: this is pre-alpha software!** The notebook works best with Python 2, for now.\n",
         "\n",
-        "> ![alt text](https://lh3.googleusercontent.com/QOvy0clmg7siaVKzwmSPAjicWWNQ0OeyaB16plDjSJMf35WD3vLjF6mz4CGrhSHw60HnlZPJjkyDCBzw5XOI0oBGSewyYw=s688)\n",
+        "\u003e ![alt text](https://lh3.googleusercontent.com/QOvy0clmg7siaVKzwmSPAjicWWNQ0OeyaB16plDjSJMf35WD3vLjF6mz4CGrhSHw60HnlZPJjkyDCBzw5XOI0oBGSewyYw=s688)\n",
         "\n",
         "### Table of Contents\n",
         "1. _Write Eager code that is fast and scalable._\n",
@@ -53,37 +24,39 @@
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "uFcgBENZqkB2",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "uFcgBENZqkB2"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "# Install TensorFlow; note that Colab notebooks run remotely, on virtual\n",
         "# instances provided by Google.\n",
         "!pip install -U -q tf-nightly"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "Pa2qpEmoVOGe",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "Pa2qpEmoVOGe"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "import os\n",
         "import time\n",
@@ -96,170 +69,172 @@
         "import six\n",
         "\n",
         "from google.colab import widgets"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "ZVKfj5ttVkqz",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "ZVKfj5ttVkqz"
       },
-      "cell_type": "markdown",
       "source": [
         "# 1. Write Eager code that is fast and scalable\n",
         "\n",
         "TF.Eager gives you more flexibility while coding, but at the cost of losing the benefits of TensorFlow graphs. For example, Eager does not currently support distributed training, exporting models, and a variety of memory and computation optimizations.\n",
         "\n",
-        "Autograph gives you the best of both worlds: write your code in an Eager style, and we will automatically transform it into the equivalent TF graph code. The graph code can be executed eagerly (as a single op), included as part of a larger graph, or exported."
+        "AutoGraph gives you the best of both worlds: you can write your code in an Eager style, and we will automatically transform it into the equivalent TF graph code. The graph code can be executed eagerly (as a single op), included as part of a larger graph, or exported."
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "snaZRFdWd9ym",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "snaZRFdWd9ym"
       },
-      "cell_type": "markdown",
       "source": [
-        "For example, autograph can convert a function like this:"
+        "For example, AutoGraph can convert a function like this:"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "9__n8cSIeDnD",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "9__n8cSIeDnD"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def g(x):\n",
-        "  if x > 0:\n",
+        "  if x \u003e 0:\n",
         "    x = x * x\n",
         "  else:\n",
         "    x = 0\n",
         "  return x"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "gq0eQcuReHET",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "gq0eQcuReHET"
       },
-      "cell_type": "markdown",
       "source": [
         "... into a TF graph-building function:"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "sELSn599ePUF",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
-          "height": 413
+          "height": 431
         },
-        "outputId": "bb0c7216-1ca3-4da1-d1fb-589902cdcd1a",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 69,
           "status": "ok",
-          "timestamp": 1522345737505,
-          "user_tz": 240,
-          "elapsed": 243,
+          "timestamp": 1531750911837,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "sELSn599ePUF",
+        "outputId": "2858bde5-ae05-4c32-be01-7770ac914f02"
       },
-      "cell_type": "code",
-      "source": [
-        "print(autograph.to_code(g))"
-      ],
-      "execution_count": 0,
       "outputs": [
         {
+          "name": "stdout",
           "output_type": "stream",
           "text": [
             "from __future__ import print_function\n",
             "import tensorflow as tf\n",
-            "from tensorflow.contrib.autograph.impl import api as autograph_api\n",
-            "from tensorflow.contrib.autograph import utils as autograph_utils\n",
             "\n",
             "def tf__g(x):\n",
-            "  with tf.name_scope('g'):\n",
+            "  try:\n",
+            "    with tf.name_scope('g'):\n",
             "\n",
-            "    def if_true():\n",
-            "      with tf.name_scope('if_true'):\n",
-            "        x_1, = x,\n",
-            "        x_1 = x_1 * x_1\n",
-            "        return x_1,\n",
+            "      def if_true():\n",
+            "        with tf.name_scope('if_true'):\n",
+            "          x_1, = x,\n",
+            "          x_1 = x_1 * x_1\n",
+            "          return x_1,\n",
             "\n",
-            "    def if_false():\n",
-            "      with tf.name_scope('if_false'):\n",
-            "        x_1, = x,\n",
-            "        x_1 = 0\n",
-            "        return x_1,\n",
-            "    x = autograph_utils.run_cond(tf.greater(x, 0), if_true, if_false)\n",
-            "    return x\n",
+            "      def if_false():\n",
+            "        with tf.name_scope('if_false'):\n",
+            "          x_2, = x,\n",
+            "          x_2 = 0\n",
+            "          return x_2,\n",
+            "      x = ag__.utils.run_cond(tf.greater(x, 0), if_true, if_false)\n",
+            "      return x\n",
+            "  except:\n",
+            "    ag__.rewrite_graph_construction_error(ag_source_map__)\n",
             "\n"
-          ],
-          "name": "stdout"
+          ]
         }
+      ],
+      "source": [
+        "print(autograph.to_code(g))"
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "j74n-8hEe6dk",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "j74n-8hEe6dk"
       },
-      "cell_type": "markdown",
       "source": [
         "You can then use the converted function as you would any regular TF op -- you can pass `Tensor` arguments and it will return `Tensor`s:"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "AkVaY0-dfEbH",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
           "height": 53
         },
-        "outputId": "4ffe3757-c44d-424c-c2a8-7ddc973bfcce",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 83,
           "status": "ok",
-          "timestamp": 1522345737841,
-          "user_tz": 240,
-          "elapsed": 257,
+          "timestamp": 1531750911965,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "AkVaY0-dfEbH",
+        "outputId": "f04541ad-b1d3-4663-bf27-4d902648283d"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "g(9) = 81\n",
+            "tf_g(9) = 81\n"
+          ]
+        }
+      ],
       "source": [
         "tf_g = autograph.to_graph(g)\n",
         "\n",
@@ -272,77 +247,72 @@
         "\n",
         "  print('g(9) = %s' % g(9))\n",
         "  print('tf_g(9) = %s' % tf_g_result)"
-      ],
-      "execution_count": 0,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "text": [
-            "g(9) = 81\n",
-            "tf_g(9) = 81\n"
-          ],
-          "name": "stdout"
-        }
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "trrHQBM1VnD0",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "trrHQBM1VnD0"
       },
-      "cell_type": "markdown",
       "source": [
         "# 2. Case study: complex control flow\n",
         "\n",
-        "Autograph can convert a large chunk of the Python language into graph-equivalent code, and we're adding new supported language features all the time. In this section, we'll give you a taste of some of the functionality in autograph.\n",
-        "Autograph will automatically convert most Python control flow statements into their correct graph equivalent.\n",
+        "Autograph can convert a large subset of the Python language into graph-equivalent code, and we're adding new supported language features all the time. In this section, we'll give you a taste of some of the functionality in AutoGraph.\n",
+        "AutoGraph will automatically convert most Python control flow statements into their graph equivalent.\n",
         "  "
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "u0YG3DPgZxoW",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "u0YG3DPgZxoW"
       },
-      "cell_type": "markdown",
       "source": [
         "We support common statements like `while`, `for`, `if`, `break`, `return` and more. You can even nest them as much as you like. Imagine trying to write the graph version of this code by hand:"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "xJYDzOcrZ8pI",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
           "height": 35
         },
-        "outputId": "6c244ee4-b141-4ad6-eefa-cfffa71f33c6",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 169,
           "status": "ok",
-          "timestamp": 1522345738402,
-          "user_tz": 240,
-          "elapsed": 483,
+          "timestamp": 1531750912183,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "xJYDzOcrZ8pI",
+        "outputId": "f392b475-bf87-4d90-919d-44f895ee9fc7"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Sum of even numbers: 42\n"
+          ]
+        }
+      ],
       "source": [
         "def sum_even(numbers):\n",
         "  s = 0\n",
         "  for n in numbers:\n",
-        "    if n % 2 > 0:\n",
+        "    if n % 2 \u003e 0:\n",
         "      continue\n",
         "    s += n\n",
         "  return s\n",
@@ -358,77 +328,74 @@
         "  \n",
         "# Uncomment the line below to print the generated graph code\n",
         "# print(autograph.to_code(sum_even))"
-      ],
-      "execution_count": 0,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "text": [
-            "Sum of even numbers: 42\n"
-          ],
-          "name": "stdout"
-        }
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "_YXo4KOcbKrn",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "_YXo4KOcbKrn"
       },
-      "cell_type": "markdown",
       "source": [
         "Try replacing the `continue` in the above code with `break` -- Autograph supports that as well!"
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "xHmC0rBIavW_",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "xHmC0rBIavW_"
       },
-      "cell_type": "markdown",
       "source": [
         "The Python code above is much more readable than the matching graph code. Autograph takes care of tediously converting every piece of Python code into the matching TensorFlow graph version for you, so that you can quickly write maintainable code, but still benefit from the optimizations and deployment benefits of graphs."
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "UEHWGpBXbS7g",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "UEHWGpBXbS7g"
       },
-      "cell_type": "markdown",
       "source": [
         "Let's try some other useful Python constructs, like `print` and `assert`. We automatically convert Python `assert` statements into the equivalent `tf.Assert` code.  "
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "qUU57xlEbauI",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
           "height": 53
         },
-        "outputId": "add3db4a-2077-4dd5-f7a7-a5b5a4529c26",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 56,
           "status": "ok",
-          "timestamp": 1522345738697,
-          "user_tz": 240,
-          "elapsed": 253,
+          "timestamp": 1531750912292,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "qUU57xlEbauI",
+        "outputId": "c9cd536a-4a95-4eb0-98c0-aafce5d79580"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Got error message: assertion failed: [Do not pass zero!]\n",
+            "\t [[{{node f/Assert/Assert}} = Assert[T=[DT_STRING], summarize=3, _device=\"/job:localhost/replica:0/task:0/device:CPU:0\"](f/NotEqual, f/Assert/Assert/data_0)]]\n"
+          ]
+        }
+      ],
       "source": [
         "def f(x):\n",
         "  assert x != 0, 'Do not pass zero!'\n",
@@ -444,61 +411,35 @@
         "      \n",
         "# Uncomment the line below to print the generated graph code\n",
         "# print(autograph.to_code(f))"
-      ],
-      "execution_count": 0,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "text": [
-            "Got error message: assertion failed: [Do not pass zero!]\n",
-            "\t [[Node: f/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device=\"/job:localhost/replica:0/task:0/device:CPU:0\"](f/NotEqual, f/Assert/Assert/data_0)]]\n"
-          ],
-          "name": "stdout"
-        }
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "w5hBZaVJbck4",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "w5hBZaVJbck4"
       },
-      "cell_type": "markdown",
       "source": [
         "You can also use `print` functions in-graph:"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "6NdzRKLEboRv",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          },
-          "output_extras": [
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
-          "height": 35
-        },
-        "outputId": "fb82dfc3-790f-4127-87f6-361805be9e9b",
-        "executionInfo": {
-          "status": "ok",
-          "timestamp": 1522345739013,
-          "user_tz": 240,
-          "elapsed": 247,
-          "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "6NdzRKLEboRv"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def print_sign(n):\n",
-        "  if n >= 0:\n",
+        "  if n \u003e= 0:\n",
         "    print(n, 'is positive!')\n",
         "  else:\n",
         "    print(n, 'is negative!')\n",
@@ -512,62 +453,58 @@
         "    \n",
         "# Uncomment the line below to print the generated graph code\n",
         "# print(autograph.to_code(print_sign))"
-      ],
-      "execution_count": 0,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "text": [
-            "1 is positive!\n"
-          ],
-          "name": "stdout"
-        }
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "9u_Z3i3AivLA",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "9u_Z3i3AivLA"
       },
-      "cell_type": "markdown",
       "source": [
-        "We can convert lists to TensorArray, so appending to lists also works, with a few modifications:"
+        "Appending to lists also works, with a few modifications:"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "MjhCQJVuiTNR",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
           "height": 35
         },
-        "outputId": "dc320b87-595b-4392-d29c-994486fd8a0a",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 148,
           "status": "ok",
-          "timestamp": 1522345744470,
-          "user_tz": 240,
-          "elapsed": 5391,
+          "timestamp": 1531750912595,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "MjhCQJVuiTNR",
+        "outputId": "96bf9131-c7c1-4359-ee82-9c38575e7ab4"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[0 1 2 3 4]\n"
+          ]
+        }
+      ],
       "source": [
         "def f(n):\n",
         "  numbers = []\n",
         "  # We ask you to tell us about the element dtype.\n",
-        "  autograph.utils.set_element_type(numbers, tf.int32)\n",
+        "  autograph.set_element_type(numbers, tf.int32)\n",
         "  for i in range(n):\n",
         "    numbers.append(i)\n",
         "  return autograph.stack(numbers) # Stack the list so that it can be used as a Tensor\n",
@@ -580,65 +517,62 @@
         "    \n",
         "# Uncomment the line below to print the generated graph code\n",
         "# print(autograph.to_code(f))"
-      ],
-      "execution_count": 0,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "text": [
-            "[0 1 2 3 4]\n"
-          ],
-          "name": "stdout"
-        }
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "UdG8ZFrkTAF2",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "UdG8ZFrkTAF2"
       },
-      "cell_type": "markdown",
       "source": [
         "And all of these functionalities, and more, can be composed into more complicated code:\n"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "DVs6wt8NKaGQ",
-        "colab_type": "code",
+        "cellView": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
           "height": 53
         },
-        "cellView": "code",
-        "outputId": "0a4b8d08-8f65-4bbc-85ba-dc4c60563519",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 555,
           "status": "ok",
-          "timestamp": 1522345745186,
-          "user_tz": 240,
-          "elapsed": 658,
+          "timestamp": 1531750913176,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "DVs6wt8NKaGQ",
+        "outputId": "8729229c-4f08-4640-d3a1-0d3f9c697a87"
       },
-      "cell_type": "code",
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "The prime numbers less than 50 are:\n",
+            "[ 2  3  5  7 11 13 17 19 23 29 31 37 41 43 47]\n"
+          ]
+        }
+      ],
       "source": [
         "def print_primes(n):\n",
         "  \"\"\"Returns all the prime numbers less than n.\"\"\"\n",
-        "  assert n > 0\n",
+        "  assert n \u003e 0\n",
         "  \n",
         "  primes = []\n",
-        "  autograph.utils.set_element_type(primes, tf.int32)\n",
+        "  autograph.set_element_type(primes, tf.int32)\n",
         "  for i in range(2, n):\n",
         "    is_prime = True\n",
         "    for k in range(2, i):\n",
@@ -663,45 +597,36 @@
         "    \n",
         "# Uncomment the line below to print the generated graph code\n",
         "# print(autograph.to_code(print_primes))"
-      ],
-      "execution_count": 0,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "text": [
-            "The prime numbers less than 50 are:\n",
-            "[ 2  3  5  7 11 13 17 19 23 29 31 37 41 43 47]\n"
-          ],
-          "name": "stdout"
-        }
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "JQ8kQT99VqDk",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "JQ8kQT99VqDk"
       },
-      "cell_type": "markdown",
       "source": [
         "# 3. Case study: training MNIST with Keras\n",
         "\n",
-        "As we've seen, writing control flow in Autograph is easy. So running a training loop in graph should be easy as well!\n",
+        "As we've seen, writing control flow in AutoGraph is easy. So running a training loop in graph should be easy as well!\n",
         "\n",
         "Here, we show an example of such a training loop for a simple Keras model that trains on MNIST."
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "0CrtGWgwuLJr",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "0CrtGWgwuLJr"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "import gzip\n",
         "import shutil\n",
@@ -754,66 +679,67 @@
         "\n",
         "def mnist_test(directory):\n",
         "  return dataset(directory, 't10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte')"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "2zu1U9Nqir6L",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "2zu1U9Nqir6L"
       },
-      "cell_type": "markdown",
       "source": [
         "First, we'll define a small three-layer neural network using the Keras API"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "x_MU13boiok2",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "x_MU13boiok2"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def mlp_model(input_shape):\n",
-        "  model = tf.keras.Sequential([\n",
+        "  model = tf.keras.Sequential((\n",
         "      tf.keras.layers.Dense(100, activation='relu', input_shape=input_shape),\n",
         "      tf.keras.layers.Dense(100, activation='relu'),\n",
-        "      tf.keras.layers.Dense(10, activation='softmax')])\n",
+        "      tf.keras.layers.Dense(10, activation='softmax'),\n",
+        "  ))\n",
         "  model.build()\n",
         "  return model"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "Wuqg3H8mi0Xj",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "Wuqg3H8mi0Xj"
       },
-      "cell_type": "markdown",
       "source": [
         "Let's connect the model definition (here abbreviated as `m`) to a loss function, so that we can train our model."
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "W51sfbONiz_5",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "W51sfbONiz_5"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def predict(m, x, y):\n",
         "  y_p = m(x)\n",
@@ -822,63 +748,63 @@
         "  accuracies = tf.keras.metrics.categorical_accuracy(y, y_p)\n",
         "  accuracy = tf.reduce_mean(accuracies)\n",
         "  return l, accuracy"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "035tNWQki9tr",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "035tNWQki9tr"
       },
-      "cell_type": "markdown",
       "source": [
         "Now the final piece of the problem specification (before loading data, and clicking everything together) is backpropagating the loss through the model, and optimizing the weights using the gradient."
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "CsAD0ajbi9iZ",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "CsAD0ajbi9iZ"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def fit(m, x, y, opt):\n",
         "  l, accuracy = predict(m, x, y)\n",
         "  opt.minimize(l)\n",
         "  return l, accuracy"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "PcVRIacKjSwb",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "PcVRIacKjSwb"
       },
-      "cell_type": "markdown",
       "source": [
         "These are some utility functions to download data and generate batches for training"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "RVw57HdTjPzi",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "RVw57HdTjPzi"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def setup_mnist_data(is_training, hp, batch_size):\n",
         "  if is_training:\n",
@@ -896,16 +822,14 @@
         "  x = tf.to_float(tf.reshape(image, (-1, 28 * 28)))\n",
         "  y = tf.one_hot(tf.squeeze(label), 10)\n",
         "  return x, y"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "2zEJH5XNjgFz",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "2zEJH5XNjgFz"
       },
-      "cell_type": "markdown",
       "source": [
         "This function specifies the main training loop. We instantiate the model (using the code above), instantiate an optimizer (here we'll use SGD with momentum, nothing too fancy), and we'll instantiate some lists to keep track of training and test loss and accuracy over time.\n",
         "\n",
@@ -913,33 +837,35 @@
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "UUI0566FjZPx",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "UUI0566FjZPx"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def train(train_ds, test_ds, hp):\n",
         "  m = mlp_model((28 * 28,))\n",
         "  opt = tf.train.MomentumOptimizer(hp.learning_rate, 0.9)\n",
+        "\n",
         "  train_losses = []\n",
-        "  train_losses = autograph.utils.set_element_type(train_losses, tf.float32)\n",
+        "  autograph.set_element_type(train_losses, tf.float32)\n",
         "  test_losses = []\n",
-        "  test_losses = autograph.utils.set_element_type(test_losses, tf.float32)\n",
+        "  autograph.set_element_type(test_losses, tf.float32)\n",
         "  train_accuracies = []\n",
-        "  train_accuracies = autograph.utils.set_element_type(train_accuracies,\n",
-        "                                                      tf.float32)\n",
+        "  autograph.set_element_type(train_accuracies, tf.float32)\n",
         "  test_accuracies = []\n",
-        "  test_accuracies = autograph.utils.set_element_type(test_accuracies,\n",
-        "                                                     tf.float32)\n",
-        "  i = tf.constant(0)\n",
-        "  while i < hp.max_steps:\n",
+        "  autograph.set_element_type(test_accuracies, tf.float32)\n",
+        "\n",
+        "  i = 0\n",
+        "  while i \u003c hp.max_steps:\n",
         "    train_x, train_y = get_next_batch(train_ds)\n",
         "    test_x, test_y = get_next_batch(test_ds)\n",
         "    step_train_loss, step_train_accuracy = fit(m, train_x, train_y, opt)\n",
@@ -956,173 +882,144 @@
         "  return (autograph.stack(train_losses), autograph.stack(test_losses),\n",
         "          autograph.stack(train_accuracies),\n",
         "          autograph.stack(test_accuracies))"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "cYiUQ1ppkHzk",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "cYiUQ1ppkHzk"
       },
-      "cell_type": "markdown",
       "source": [
         "Everything is ready to go, let's train the model and plot its performance!"
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "K1m8TwOKjdNd",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {},
-            {},
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
-          "height": 988
+          "height": 585
         },
-        "outputId": "f9d3eef3-5bea-45c1-ddf9-4edee73e4436",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 17094,
           "status": "ok",
-          "timestamp": 1522345800262,
-          "user_tz": 240,
-          "elapsed": 52391,
+          "timestamp": 1531750930585,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "K1m8TwOKjdNd",
+        "outputId": "9f63da19-c3bf-498b-cf00-29090bf3b4f0"
       },
-      "cell_type": "code",
-      "source": [
-        "with tf.Graph().as_default():\n",
-        "  hp = tf.contrib.training.HParams(\n",
-        "      learning_rate=0.05,\n",
-        "      max_steps=500,\n",
-        "  )\n",
-        "  train_ds = setup_mnist_data(True, hp, 50)\n",
-        "  test_ds = setup_mnist_data(False, hp, 1000)\n",
-        "  tf_train = autograph.to_graph(train)\n",
-        "  (train_losses, test_losses, train_accuracies,\n",
-        "   test_accuracies) = tf_train(train_ds, test_ds, hp)\n",
-        "\n",
-        "  with tf.Session() as sess:\n",
-        "    sess.run(tf.global_variables_initializer())\n",
-        "    (train_losses, test_losses, train_accuracies,\n",
-        "     test_accuracies) = sess.run([train_losses, test_losses, train_accuracies,\n",
-        "                                  test_accuracies])\n",
-        "    plt.title('MNIST train/test losses')\n",
-        "    plt.plot(train_losses, label='train loss')\n",
-        "    plt.plot(test_losses, label='test loss')\n",
-        "    plt.legend()\n",
-        "    plt.xlabel('Training step')\n",
-        "    plt.ylabel('Loss')\n",
-        "    plt.show()\n",
-        "    plt.title('MNIST train/test accuracies')\n",
-        "    plt.plot(train_accuracies, label='train accuracy')\n",
-        "    plt.plot(test_accuracies, label='test accuracy')\n",
-        "    plt.legend(loc='lower right')\n",
-        "    plt.xlabel('Training step')\n",
-        "    plt.ylabel('Accuracy')\n",
-        "    plt.show()"
-      ],
-      "execution_count": 0,
       "outputs": [
         {
-          "output_type": "stream",
-          "text": [
-            "Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz to /tmp/autograph_mnist_data/train-images-idx3-ubyte.gz\n",
-            "Downloading https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz to /tmp/autograph_mnist_data/train-labels-idx1-ubyte.gz\n",
-            "Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz to /tmp/autograph_mnist_data/t10k-images-idx3-ubyte.gz\n",
-            "Downloading https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz to /tmp/autograph_mnist_data/t10k-labels-idx1-ubyte.gz\n",
-            "Step 0 train loss: 2.244329 test loss: 2.2499208 train accuracy: 0.12 test accuracy: 0.161\n",
-            "Step 50 train loss: 0.64771986 test loss: 0.56013924 train accuracy: 0.82 test accuracy: 0.836\n",
-            "Step 100 train loss: 0.49011207 test loss: 0.42143965 train accuracy: 0.84 test accuracy: 0.879\n",
-            "Step 150 train loss: 0.3768609 test loss: 0.39319593 train accuracy: 0.88 test accuracy: 0.883\n",
-            "Step 200 train loss: 0.36007702 test loss: 0.37089333 train accuracy: 0.9 test accuracy: 0.881\n",
-            "Step 250 train loss: 0.182115 test loss: 0.28543878 train accuracy: 0.94 test accuracy: 0.915\n",
-            "Step 300 train loss: 0.2119576 test loss: 0.22305593 train accuracy: 0.92 test accuracy: 0.93\n",
-            "Step 350 train loss: 0.12932214 test loss: 0.29057172 train accuracy: 0.96 test accuracy: 0.906\n",
-            "Step 400 train loss: 0.22937602 test loss: 0.2200287 train accuracy: 0.92 test accuracy: 0.925\n",
-            "Step 450 train loss: 0.23444137 test loss: 0.19857481 train accuracy: 0.94 test accuracy: 0.94\n"
-          ],
-          "name": "stdout"
-        },
-        {
-          "output_type": "display_data",
           "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAe8AAAFnCAYAAACPasF4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzs3XmAFNW9Pvynlt5mYdhmQMHggnGN\nS9zCD0ElKug1edUY9ZoQTYze3GuiRk1uYjRqRHNj4n5NrhKjiUYlbihGQFRUFDSoKIvgICAO6+xL\n711V5/2jlq7qZaZnpnumZ3g+/zjTXV1dXSP91PecU+dIQggBIiIiGjLkwT4AIiIi6h2GNxER0RDD\n8CYiIhpiGN5ERERDDMObiIhoiGF4ExERDTEMb6JeOOigg3DllVdmPf6rX/0KBx10kGe766+/3rPN\ne++9h9mzZwMAtm3bhkMPPdR57osvvsCPfvQjzJw5EzNnzsTZZ5+NV199FQBw0003YdasWZg1axYO\nO+wwnHLKKc7v4XDY8x7JZBLz58/v9edavXo1Lr300oK2XbBgAebMmdPn97J19/rZs2fjhRde6PO+\niYY7hjdRL3366aee0Ewmk1izZk3WditXrsQnn3xS0D6vu+46TJs2DYsXL8bixYtxyy234LrrrsPO\nnTtxyy23YNGiRVi0aBHGjRuH3//+987vVVVVnv188sknfQrUI444Ag8//HBB2y5fvhxTpkzp83vZ\n+vt6oj0Zw5uol0444QQsWbLE+f3tt9/GV77ylaztrrnmGtx+++0F7bO+vh5HHnmk8/uRRx6JxYsX\nY/z48QUfV3NzM3784x/jo48+wkUXXQTAbAF48MEHMXPmTOi6jlWrVuHcc8/FrFmzcOaZZ2L58uUA\nzFaB0047DQBw//334ze/+Q2uuOIKfP3rX8d5552HxsZG533ee+89HHzwwVnv9cEHH+Bb3/oWTjvt\nNJx//vloaGgAAOzevRsXX3wxzjzzTJx66qm4++67cx5rPu+99x7OOecczJo1C9/+9redC6Vc++3u\ncSEE/vd//xczZ87EKaecgjlz5kDXdQDAwoULcdZZZ+GMM87AN77xDbz33nsFn3eiwcDwJuqlM844\nAy+99JLz+z//+U/MmjUr53ZCCCxatKjHfU6fPh1XXnkl/va3v2HTpk0AgHHjxkGSpIKPa+zYsbjm\nmmtw1FFH4YknnnAeF0Jg8eLFUBQFv/71r3HppZdi0aJFuPzyy3HTTTfl3NeiRYtw/fXX49VXX8WY\nMWPw7LPPAgA2bdqE2tpaTJgwwfNe4XAY//mf/4lrrrkGS5Yswfe+9z1cddVVAIBHH30Uxx13HF5+\n+WUsWLAADQ0NMAwj57FmikQiuOqqq3DDDTdg0aJF+OEPf4jrrrsOhmHk3G9jY2Pex1944QUsWrQI\nzzzzDJYsWYKGhgY8+eSTAIBbbrkFDz74IBYuXIibbroJr7/+esHnnWgwMLyJeun444/Hxo0b0dLS\nglgshlWrVmHKlCk5t73++uvxhz/8AYlEott9/v73v8d3vvMdLFiwAGeddRZmzJjhBEt/nXzyyc7P\n8+fPxxlnnAEAOOaYY5zqONOxxx6LCRMmQJIkHHLIIdi5cycAYMWKFTk/6wcffIBx48Zh6tSpAICz\nzjoLX3zxBXbs2IExY8bg7bffxvvvvw+/34+77roLdXV1BR376tWrMX78eBxzzDEAgJkzZ6KtrQ3b\nt2/Pu998jy9duhTf+ta3UF1dDVVV8e1vfxuvvPIKAGDMmDF46qmnsH37dhx77LH45S9/WdjJJRok\n6mAfANFQoygKTj/9dCxcuBCjR4/GiSeeCFXN/U/psMMOw3HHHYdHHnkERx99dN59BgIBXHrppbj0\n0kvR2dmJRYsW4fbbb8fEiRMxbdq0fh3vyJEjnZ8XLFiAv/3tb4hEIjAMA/mWNqiurnZ+VhTFaV5+\n5513cMkll2Rt39nZiYaGBk8LhN/vR2trKy655BIYhoFbbrkFjY2N+M53voOf/OQnBR17a2srRowY\nkXVsLS0tefeb7/Guri48/PDDmDdvHgBA13WMHj0aAPCnP/0Jf/rTn3Duuedir732wvXXX4/jjz++\noGMkGgwMb6I+OPPMM3H33Xdj1KhRPfbZ/vSnP8W5556LiRMn5ny+tbUV69evd6rWESNG4Pzzz8ey\nZctQX1/f7/C27d69GzfccAOefvppHHLIIfj8888xc+bMgl+vaRrWrFmT8yKkrq4O+++/P5577rmc\nr7388stx+eWXY8uWLbjsssucSronY8aMQXt7u/O7EAIdHR0YM2YMVFXNud+pU6fmfLyurg4zZszA\nd7/73az3+dKXvoTf/va3MAwD8+fPx7XXXotly5YVeGaIBh6bzYn64Oijj0ZjYyM2btzYY4VWV1eH\n73znO7j//vtzPh+Px3HllVd6wmLr1q34+OOPceyxx/bquFRVRTgczllRt7a2oqKiAvvvvz80TXMq\n0EgkUtC+V69ejYMOOgh+vz/rvY488kg0NTXh448/BgA0NDTgZz/7GYQQ+PWvf4133nkHgBmSY8eO\nhSRJ3R6r7YgjjkBzczNWrVoFwBxfMH78eEycODHvfvM9/vWvfx0vvPACYrEYAOCpp57C888/j9bW\nVnz/+99HOByGLMs48sgjezXWgGgwsPIm6gNJknDaaachFotBlnu+Bv7BD36Ap59+Oudze++9N/70\npz/hvvvuw5w5cyCEQFVVFX75y196RqAX4phjjsEf/vAHTJs2DW+++abnuYMPPhjTp0/HzJkzMWbM\nGPziF7/Ahx9+iNmzZ+O///u/e9y3fYtYvve67777cOuttyISicDn8+Gqq66CJEm48MIL8etf/xq3\n3norhBCYMWMGpkyZgh07dnheryhK1ntWVFTgnnvuwa233opoNIrRo0fjrrvu6na/I0eOzPk4AGzc\nuBHnnHMOADPYb7vtNowePRrTpk3Dt771LSiKAp/Ph9tuu61X551ooElcz5uIiGhoYbM5ERHREMPw\nJiIiGmIY3kREREMMw5uIiGiIYXgTERENMUPmVrGmpq6i7m/UqAq0tUWLus89Ec9j//Ec9h/PYXHw\nPPZfsc9hbW11zsf32MpbVbPvKaXe43nsP57D/uM5LA6ex/4bqHO4x4Y3ERHRUMXwJiIiGmIY3kRE\nREMMw5uIiGiIYXgTERENMQxvIiKiIYbhTURENMQwvImIaNh6443XCt723nvvxI4d23vc7sMP38cN\nN/y8P4fVbwxvIiIalnbu3IFXX11c8PZXXXUt9t57QgmPqHiGzPSoREREvXHXXb/D+vXr8Mgjc2EY\nBnbs2I6dO3fgnnv+iN/+9jdoampELBbDD35wOaZOnYYf//hyXHPNz7F06WuIRML44out2L59G668\n8lpMmTI153u89toSzJv3dyiKgoMOOgS33XYL6us34M47fwefzwe/349bbvktdu7cnvVYdXXuqU8L\nsceGd0c4gfc3NOLYg+sG+1CIiIa9f7z+GVZuaCzqPo87uA7nz5ic9/l///fZeO65f+D7378MDz/8\nIDQthT/+8c9oa2vF8cd/DWeccRa2b9+GG2/8BaZOneZ5bWPjbvzhD/fh3XeX44UXns0Z3tFoFA89\n9AAeeeQJVFRU4Oc//yneffddvPzyyzjnnPMwa9a/4YMPVqK1tQUvv7wg6zGGdx9ceecbaO2M46ZL\njsOk8X0/gURENDQccshhAIDq6hFYv34dXnzxOUiSjM7OjqxtjzjiKABAXV0dwuFwzv01NHyBiRO/\nhIqKCgDA0Ucfg/Xr1+PEE0/CH/7wP2ho+AJf//ppmDRp35yP9cceGd5b23YiPOFNSMnD0dwRZ3gT\nEZXY+TMmd1slDwSfzwcAWLJkETo7O/HAA39GZ2cnfvjD2VnbKkp6gREhRM79SZL3OU1LQZJCOPbY\n4/HnP/8Ny5cvw5w5N+PHP74652Nf/eqxff4se2R4f7ztCyjVbTBG70RLZ3ywD4eIiEpAlmXoup71\neHt7O/baa2/Isow333wdqVSqT/vfZ59J2LbtC0SjEVRUVGLVqg9x1VU/xrPPzsOUKSfi9NPPgBAC\n9fUbsGXLpqzHGN69dPykA7G4CZArO9DSwfAmIhqOJk3aD59+ugH33XcnKiurnMdPPnkGfvGLa/DJ\nJ2vxb//2TdTV1eGRR+b2ev+hUAhXXHEVrr32J5AkGUcccRSOPfZY7NzZghtv/AWqqqrg8/lw/fU3\nob7+06zH+kMS+doDykxTU1dR93fjit+ipTOCQyLn4yfnHlHUfe9Jamuri/632dPwHPYfz2Fx8Dz2\nX7HPYW1t7m7dPfY+7y+P2Q+SL4mmcOtgHwoREVGv7LHhPbFmPACgLdk2yEdCRETUO3tseI8JjQIA\nxBFGPKkN8tEQEREVbs8N74rRAADJH+egNSIiGlL22PAeW2FW3pI/xtvFiIhoSNljw3uME96svImI\naGjZY8M75AvCLwcg+eNoZuVNRDQs9WZJUNtHH32ItjbvnUjlsAyo2x4b3gAwMlDDypuIaJjq7ZKg\ntn/+88Ws8C43e+QMa7a6ijFojDWiqSt7UnoiIhra3EuCXnDBRbj99lvQ1dUFXddx9dU/w+TJB+Lx\nxx/Fm28uhSzLmDp1Gg455FAsW/YGtmzZjDlz7sD48eOz9pu5DOjVV1/nLANaWRkCIJdkGVC3PTy8\nxwItQKfWPtiHQkQ0rD332UtY1bimqPs8uu4rOHfyWXmfdy8J+uijf8YJJ/w/fOMbZ2PLls24994/\n4J57/oinnnoc8+cvgqIomD//WRx33NcwefKXcc01P88Z3LmWAf3ww/fx1ltLcc4552H27AuxaNHr\nJVkG1G2PDu/a0FgAQAysvImIhrM1a1ajvb0Nixe/DABIJMzu0pNP/jquvvq/cNpps3D66bN63E+u\nZUDr6zc4S362tOzClCknlWQZULc9OrzrKszwTildMISALEmDfERERMPTuZPP6rZKLjWfT8VPf/oz\nHH64dy2L6677JbZu/Ryvv74EP/nJf+Chh/7a7X5yLQMaCAScJT/XrFlZsmVA3fboAWt25Y1gFNE4\nZ1kjIhpO3EuCHnro4XjrrTcAAFu2bMZTTz2OcDiMRx6Zi0mT9sX3v38ZqqtrEI1G8i4lCniXAQWA\nVas+xEEHHYpnn52Hzs4OfPOb38QFF1yE+voNzmOnn36G81ix7NGV96hgDSQhQw5EEYmnUBXyDfYh\nERFRkbiXBP3hD3+E2267Gf/1Xz+EYRi4+urrUFVVhfb2Nlx22fcQClXg8MOPwIgRNTjqqK/ihhv+\nG7/97Z3Yf/8DPPvMtQzokUcehVgsihtv/AVGjaoBIJdkGVC3PXZJUHvZtp++fgvicQM/O+pa7L/3\niKK+x56ASwj2H89h//EcFgfPY/9xSdABEpCCkNQUIvHUYB8KERFRQfb48A4qIUiqhs4oJ2ohIqKh\nYY8P7wo1BABoj4YH+UiIiIgKs8eHd6XPvFevIxEZ5CMhIiIqzB4f3iMClQCAjjjDm4iIhoY9PrxH\nVZgj+Xa1tw3ykRARERVmjw/v0RXm7WE7OjrQHk4M8tEQERH1bI8P70qfOWBNUpNYvallkI+GiIio\nZwxvn9nnDSWFpvbY4B4MERFRAUo6Peodd9yBDz74AJqm4T/+4z9w+umnO88tX74cd911FxRFwfTp\n03HFFVeU8lDysm8Vk9QUWjvZbE5EROWvZOH97rvvYuPGjZg3bx7a2tpwzjnneMJ7zpw5ePjhhzFu\n3Dh897vfxcyZMzF58uRSHU5eITVo/qBoaOviRC1ERFT+Shbexx13HI44wlx6bcSIEYjFYtB1HYqi\noKGhATU1Ndhrr70AACeddBJWrFgxKOHtV/wAAJ9foK2NlTcREZW/koW3oijOYuXPPPMMpk+fDkVR\nAABNTU0YPXq0s+3o0aPR0NDQ7f5GjaqAqipFPcba2mqM1M3K2+8XaI8kMXZsFSSu690r+SbOp8Lx\nHPYfz2Fx8Dz230Ccw5IvCfrqq6/imWeewV/+8pd+7aetLVqkIzLZK78IISBLMiTFQCKpY+u2NlQG\nuTRoobgKUf/xHPYfz2Fx8Dz237BYVWzZsmX4v//7P8ydOxfV1ekDqKurQ3Nzs/P77t27UVdXV8pD\nyUuSJPhlP2TVXHi9jYPWiIiozJUsvLu6unDHHXfgwQcfxMiRIz3PTZw4EeFwGNu2bYOmaVi6dCmm\nTp1aqkPpkV/xAbIZ3h2R5KAdBxERUSFK1mz+8ssvo62tDVdffbXz2AknnICDDjoIp512Gm6++WZc\ne+21AIAzzzwT++23X6kOpUd+xY9kyhxpHo5xXW8iIipvJQvvCy64ABdccEHe54877jjMmzevVG/f\nKwHFjw6YS4IyvImIqNzt8TOsAYBf9kMXZmhHGN5ERFTmGN4w+7wNGIBksPImIqKyx/BGeqIWyDrC\ncYY3ERGVN4Y3zD5vAGZ4s/ImIqIyx/CG2ecNAKrPYJ83ERGVPYY3rPu8AYRCEitvIiIqewxvpPu8\nQyEgHNMG+WiIiIi6x/BGus87GABiCQ26YQzyEREREeXH8Ea68g5YS3tH46y+iYiofDG8Afhls89b\nVc2KO57UB/NwiIiIusXwRrryllUBwGw6JyIiKlcMbwABJQAAzrKgrLyJiKicMbwBhFQzvCXVrLjj\nSVbeRERUvhjeAIKKNVJNNu/xjiVYeRMRUflieAMIWpW3IZkVd4yVNxERlTGGN4CQGgIAGJJZecdZ\neRMRURljeAMIWgPWdCQBsM+biIjKG8MbgCqrUCQFmhXe7PMmIqJyxvAGIEkSgmoAKWGFNytvIiIq\nYwxvS1AJImkkAABxTtJCRERljOFtCaoBJHQrvDlJCxERlTGGtyWkBpHQk1BkNpsTEVF5Y3hbgkoQ\nAgKBoOCtYkREVNYY3hZ7opZgCIiyz5uIiMoYw9sSVM0pUitCQCSWGuSjISIiyo/hbQlZ85sHQwJJ\nzUAixaZzIiIqTwxvi115B4IGAFbfRERUvhjeFrvP2+c3wzvM8CYiojLF8LbYzeYqw5uIiMocw9ti\nV96yz+zrZngTEVG5YnhbglblLavmbWIMbyIiKlcMb4tdeUNheBMRUXljeFtC1mhzQzJDm+FNRETl\niuFtCTK8iYhoiGB4W+w+b3tNb85vTkRE5YrhbfHJKmRJdtb0TunGIB8RERFRbgxviyRJCClBZ01v\nneFNRERliuHtElQDiGlxKLLEypuIiMoWw9slqAYR1xJQFRmaJgb7cIiIiHJieLsErWZzRQE0Vt5E\nRFSmGN4uITUAAQHVLxjeRERUthjeLj7Fb/5XNRjeRERUthjeLn7ZBwCQVYGUzj5vIiIqTwxvF5+s\nAgAU1YCm9b/ybutK4MEX16G5I9bvfREREdkY3i4+xay8FaU4fd5PvFqP9z7Zjb8u3NDvfREREdkY\n3i4+u9ncZ0ArQrN5PKl7/ktERFQMDG8Xu89bUQwYQsAw2O9NRETlh+HtYjebS4rZZM5Z1oiIqBwx\nvF2c0eayGdq8XYyIiMoRw9vF7vO2K+9i9HsTEREVG8PbxWk2tyvvItwuRkREVGwlDe/6+nqceuqp\nePzxx7OemzFjBi666CLMnj0bs2fPxu7du0t5KAWxK2/I5ujwfjebC1buRERUfGqpdhyNRnHrrbdi\nypQpebeZO3cuKisrS3UIvebPCG8OWCMionJUssrb7/dj7ty5qKurK9VbFF1Ws3mxwlsqzm6IiIiA\nElbeqqpCVbvf/U033YTt27fjmGOOwbXXXgtJGtyUs6dHFZLdbM5mbyIiKj8lC++eXHnllZg2bRpq\nampwxRVXYPHixZg1a1be7UeNqoCqKkU9htraas/vcf9IAIBqLi6Gqqpg1ja94fObp9enKv3aT7kb\nzp9toPAc9h/PYXHwPPbfQJzDQQvvs88+2/l5+vTpqK+v7za829qiRX3/2tpqNDV1eR4Lx1IAgJSW\nBAA0t4TRVBPo83ukkpq1Pz3rvYaLXOeReofnsP94DouD57H/in0O810IDMqtYl1dXbj00kuRTJoh\nuXLlShx44IGDcSge9mhzQ+KANSIiKl8lq7zXrl2L3/3ud9i+fTtUVcXixYsxY8YMTJw4Eaeddhqm\nT5+OCy64AIFAAIceemi3VfdA8St2n7dZMevs8yYiojJUsvA+/PDD8dhjj+V9/uKLL8bFF19cqrfv\nE6fyBitvIiIqX5xhzUWRFEiQYMCsvDnDGhERlSOGt4skSfApPqfy7uk+7x3hXXjsk38grsUH4vCI\niIgADOJo83Lll33QhTVKvIc+7/s+eghdyTDGVdTi9H1PGYjDIyIiYuWdKagEkDQSAAC9m8p7W2MY\nXckwACBpJAfk2IiIiACGd5bairGIGREEv/oqtic3593ulfcbnJ8lzn9KREQDiOGdYXyFORe7pGpY\nrb2af0N3i/ogT+tKRER7FoZ3hnGV6YVUVPjzbifAe8CJiGhwMLwzjK+oTf8iCquoZTabExHRAGJ4\nZxhfOc75OYEINEPLvaGn8GZ4ExHRwGF4Z6j2V+EHX/4h9I4xgCTQGm/r8TXs8iYiooHE8M5h/5pJ\nMLpGAQCaYq05t/GMV8tTebNXnIiISoHhnYOqSBApc7BaLJV7KVLhSmbeKkZERAOJ4Z2DqsiAYU4+\nl8g7AYsnvYmIiAYMwzsHVZEhDAUAkNBzh3chzeZERESlwPDOQVUkQDfDO5knvHuD4U5ERMXE8M5B\nkiQo1pot21o6cm8kvNsTERENFIZ3Hgp8AICV9TuwsyWS9TxHkhMR0WBheOdhhzdkHZ2R7pvO2SxO\nREQDieGdhyqlwzsX4bpXzBD5lw4lIiIqNoZ3HnZ4S0qe6VFd3EFORERUagzvPHyKz5yIRdaR1Lqv\nrA2w8iYiooHD8M5DlRXAUCApOpKp7KZzd7HNZnMiIhpIDO88fKp1r7esI5nqofJmszkREQ0ghnce\n5ixrKiRFQ0LLUXm7f2blTUREA4jhnYeqyAVX3nqe8GZBTkREpcDwzkOWJXN+c1lHIpljxLn7VjEO\nWCMiogHE8M7DMIQ5YE0WSGiprOe9zeY9lNicw4WIiIqI4Z2HYQhAN+c3j2mJ7rdlnzcREQ0ghnce\nuiGcZUFjqXj2Bp5bxdi5TUREA6eg8F67di2WLl0KALj77rtx8cUX4/333y/pgQ023RAQyQAAIKKH\nu92Wfd5ERDSQCgrvOXPmYL/99sP777+PNWvW4MYbb8R9991X6mMbVIYhIBIhAEBMdGU9z1vFiIho\nsBQU3oFAAPvuuy9ee+01nH/++Zg8eTJkeXi3uJuVdzfh7VmYhM3mREQ0cApK4FgshoULF+LVV1/F\niSeeiPb2dnR2dpb62AaVIQREMggASCLXet7pwK7f1pZzxDkXLCEiolIoKLyvueYaLFiwAD/96U9R\nVVWFxx57DJdcckmJD21w6a5m86Scq8873VS+uy2CpvZY9hZ2djPDiYioiNRCNvra176Gww8/HFVV\nVWhubsaUKVPw1a9+tdTHNqgMwwAMFUJToSvRrOfdlTckkQ5q9zZW5c0KnIiIiqmgyvvWW2/FwoUL\n0d7ejgsvvBCPP/44br755hIf2uD60rhqAIBIhKCrkawAzry3O3ezub1taY6RiIj2TAWF9yeffIJv\nf/vbWLhwIc455xzcc8892Lp1a6mPbVBdcsbB+N7Mg+DTqwFZR0cyo49fSieyxMqbiIgGUEHhbYfP\nG2+8gRkzZgAAkslk6Y6qDFQGfTj56AkIiBEAgMZok+d54b63WxI5VyGxA53ZTURExVRQeO+33344\n88wzEYlEcMghh2D+/Pmoqakp9bGVhZAwP+fOsDe8vROziJwBzcqbiIhKoaABa3PmzEF9fT0OOOAA\nAMDkyZNxxx13lPTAykW1MgotALZ37fY8nll557rXO+lrARSJfd5ERFRUBYV3PB7H66+/jnvvvReS\nJOGoo47C5MmTS31sZWGkbzSAXM3mmaPNvQm9qf1ztO31OvwVYyBaTy71YRIR0R6koGbzG2+8EeFw\nGBdeeCHOP/98NDc344Ybbij1sZWFmmAVhACi1uIkH3zaiBfe3gJkNJvruje869s2AQCUmhb2eRMR\nUVEVVHk3Nzfjrrvucn4/5ZRTMHv27JIdVDmpCKpAVIJm6ACAB55fCwA4cLLrukcS6EqGcdt7D+Oc\nyf+GQ8cchNZ4KwBApHzs8yYioqIqeHrUWCw9g1g0GkUi0f0a18NFZVAFhAxN1z2Pp9y/S8DqjlXY\nEdmFBz5+GADQEm8DAIhkiH3eRERUVAVV3hdccAHOOOMMHH744QCAdevW4aqrrirpgZWLiqAPEBI0\n4Q3vpK65fhMQGQndaoe3prLyJiKioioovM877zxMnToV69atgyRJuPHGG/HYY4+V+tjKgll5S9AN\n74xqKS0d3lLGgDUhhFN5QzYY3kREVFQFhTcA7LXXXthrr72c31evXl2SAyo3duWti+6azYUnoBN6\n0pk+VZJ1DlgjIqKi6vOi3HtKNVkZVCGEbC5U4pIy3GEunAFtABDX4+mnFH2POVdERDQw+hzekiQV\n8zjKVkVQBSBBhze8tYzKO2GkB/DFtHR4S7LOAWtERFRU3Tabn3TSSTlDWgiBtra2kh1UOamw+ryF\n8PZdp3QNAdd2yTzhzT5vIiIqtm7D+4knnhio4yhbiixDEjIMpKC5J2KR3TOsGXkrb7DPm4iIiqzb\n8J4wYcJAHUdZkyUJAgZSWrqpXJJdt4pJQMod3qmoazsDBpjeRERUPH3u8y5EfX09Tj31VDz++ONZ\nzy1fvhznnXceLrjgAjzwwAOlPIx+kyADEEhprn5v1R3eAkmRDu+2RIfn9ULSQEREVCwlC+9oNIpb\nb70VU6ZMyfn8nDlzcP/99+PJJ5/EO++8g88++6xUh9JviiRDSAaSrvD2VN4QSBnp9c2d8DbM0ysk\n721mRERE/VGy8Pb7/Zg7dy7q6uqynmtoaEBNTQ322msvyLKMk046CStWrCjVofSbLCkABOJJVwgr\n3klaUsIV3vF28wctCAAQYOVNRETFU7LwVlUVwWAw53NNTU0YPXq08/vo0aPR1NSUc9tyoMgyJFmg\nI5JuGpdUb+Wtwd1sboV3yhwf+2w3AAAgAElEQVSPzsqbiIiKqeAZ1gbbqFEVUFWlqPusra0uaDtV\nMU+TUFzXOlblLTQVkj/puQu8PWk1m1vhDVkv+L2GouH82QYKz2H/8RwWB89j/w3EORyU8K6rq0Nz\nc7Pz++7du3M2r7u1tUW7fb63amur0dTUVdC2spABCdi2M31vu2SHt+5zqvB9qvZGQ3gHuhJh87lU\nABIAQ9IKfq+hpjfnkXLjOew/nsPi4Hnsv2Kfw3wXAiUdbZ7PxIkTEQ6HsW3bNmiahqVLl2Lq1KmD\ncSgFUWTzNHVE0/3akDXz/m3NvP6pwlhMm+AdnCecZnP2eRMRUfGUrPJeu3Ytfve732H79u1QVRWL\nFy/GjBkzMHHiRJx22mm4+eabce211wIAzjzzTOy3336lOpR+UxUF0IHOqGvaU1UDdBWQzHu4fQhA\nldOn0y/7ENet5nb2eRMRURGVLLwPP/zwbpcNPe644zBv3rxSvX1RqbIV3rH0oDQoGoSuOn3fighA\nkdN98j7Fh6iumE0bDG8iIiqiQWk2H2pUK5S7oq7R5opZeUtOePuhSunwViU1fZ+3zGZzIiIqHoZ3\nAXyKGcrhuN3nLZzK2x6spghvs7kqqxCaFeYyK28iIioehncB/NatYl0xK7xlA5IkzD5v2A/5PM3m\nqqxA6NbvisaVxYiIqGgY3gXwWfeX68KqoJ3bxNzh7Tebyi2qpMIwrN9lnUuTEBFR0TC8C1Dh91k/\nmRHszGvuCm9J+Jy+cQBmFW5V3pJVeXdGk7j/2dXY1hgekOMmIqLhieFdAL/PCm/JmkdNsSpwwzXj\nm65mNJur6cpcMdf0/ufyrVi1sRn3Pbt6AI6aiIiGK4Z3ARTJOk2SgCSlK293szkMNavZ3A53STYr\nb3s98GSKA9iIiKjvGN4FUOxbwCSByqAPvoDVg+2qvCXd22yuuprNoegw2OlNRERFMmQWJhlMslV5\nS5JAZciHsGrAACB0BYmNR0EZ2QRZqvbcKqZIKgAZQpedypuIiKgYWHkXIN1sbuDEr4yHL2D1fRsq\njLbxSG35CoRhB7b9GsXZxu7zdkjSwBw4ERENSwzvAshWEP/7qZNx5tcmweczk9i5jxuArouMZnPV\n2UbKvM+bVTgREfUDw7sAduU9fmwIkiRB8dmVtyu8hchoNndV3jL7vImIqHgY3gWQrSVBDWGGtqxa\no8Vdo811XXjmNreb0IWuAIoGwzDSO2SzORER9QPDuwB2Fa1b4S1Z93m7m80NQzgD2wCkg1xXIUlA\nyuDiJEREVBwcbV4AO5S/6NyGz9o3A0rKfMJwVd6GAclVUctOs7n537iWXguceuetj3dgQm0lDti7\nZrAPhYioLDC8C2D3eS/e+joAQLZWGfMMWDPSk7CYr/Fuc+fqe3AELhqQ4x1OYgkNjy7cAAD4yy9m\nDPLREBGVBzabF0B29WUDgJCyJ2nRDYHHXql3frfDW1LNKj2hJyBghntnJIkHnlsDg6POe6TpRs8b\nERHtYRjeBVAk72kSMMy7vQxvn/fGhnbXa8zntN2T0tsgXZl/UN+Enc2REh0xERENZwzvAmSGNwAr\nuNN93PGk7unztkebG51jobWMN3+Gd05znfeP9YhniIgoG8O7ALKsZD+oe4cLRGIpz+8KXK+xKnQD\n3hHnDO+esWeBiCgbw7sAco7KWxgZ/eAwB1c5r5HdK45Z94lL3srbYHj3iOdoePvX+t247I6l2N0a\nHexDIRpSGN4FyN9s7hV2Vd+K69TaQS/YbN5rDO/h7c8vfQLdEFi2eudgHwrRkMLwLkDmaHMATjXt\n5g7j5vZk+glhbtssbYLkT1cYKY6k7hFH5A9v/PMS9Q3DuwC5Ku9RVaFuX7P4vW3pX6yg362uQ/Co\nt5yHUymGd08Y3kRE2RjeBcjV5z1+VBWqQj4AQCjQw1w3OZrYAVbehWCzORFRNoZ3AZQczeaqrDrL\nfFZX+Lp9vcjRxA4AyZSe83FKY3jvGbhWD1HvMLwLIOf4ZnEv/1kdyg7vQyeNxrUXHoWvHTouu/KW\nNQACb3e8jHe2vwcAWPDOFsxd8ElRj3s4YHYTEWVjeBcgksq+jcW9/Gd1hT/r+ZHVfhy272jzOeE9\nzZI/DskfxxfJT/HEp88CAJ5ftgUr1u0q8pEPnLWbW7BibfGPn5U3EVE2hncBJo3YBwDw5VGTncfM\nZnPz5xGV6fAWmlmFj6kYCQCQ5exmcykQg+RPrzJmrxMOwGmKz2f+ss34+LPmPnyK0rrrHx9j7kvF\nbznggDUiomwM7wJU+6vwwIw7cOa+pzqPqa5Z1/yq7Axei6+ZisTGo3DUhP0BABUBNavZXArEIAVi\nzu+NkXQYdxdWndEkXnznc9z7zOr+faAS6unio7fKObxffGeLs+IZEdFAYnj3guIKbFVWPfNu71NX\nBQCoCYzAdbPOwKTx1QCAiqAv655wSUlB8qfD+4GP/+KsEa7p3YR3JJn3uXKR1Io7gr6cm83nL9uC\ntz7eMdiHMaSV8bUZUVljePeCu59blVQ4y2ZIwMGTRgEAamtCzs8AUBlSs/q8IQlP5d2aaIW692YA\ngN7N7WPhaCrvc+Wi2CPoyzm8iYgGC8O7F7Iqbye7JZxxwpfwzan74rJvHOp5TWXQlzUPOmTDCe+v\n7zPd3F9tAyBrWZV3Uk9i4ZZX0Z7oQGe0/CvvRLHD23U6/ufvH6KxPZZ/40HCC4y+4y1iRH3D8O4F\n9/3e7j5vSQJURcbZ0/ZH7UjvzGuVOZrNIRmQ/HEoIoBzDzwLB4a+AknVIPnj0DIq73/Uv4CXtryC\nFzctQke4/MM7WeRZ49x93vUN7Zj32sai7r8YONlO37HZnKhvGN69oHbT551PZUjN7vOWDEi+JFQj\nCAAwdOt5SUBzVXHtiQ6s2LnS+b0jo897xY6VWLBpUS8/RWkVu/IWGVVtOS7mknnBRURUagzvXvBU\n3pKCQtK7MuiDENnN5lBSkI0AAEDT0o+7+7zvWzU3/RJJdgashQLm/h7f8DQWbX0dulE+M7UVu887\nM6zLsYk6VeRBekREPWF490L2aHMzSLrrtzNvFcucpCUBSQIk3QzvlDUOTZIM6Fafd2ckieZYi/Oa\nqBZzKu+qjBndolr59AMnSthsnuv3cqAxvPuNfd9EvcPw7gXPaHO5h8VILLIsZd/n7TMnaJE0c3IX\n3S5WJQOaYQbBtX98C7rQceDIAwAAsVQMHZEEAMCvKp77qbuS4d5/mCJyH0vxR5tn/l4e4a27Dox9\n3n0nCup8IqJMDO9eUFyBrcqq606xHsoG4X1e8pshbM/GJgzreUkgkdTxxqrt0GWzyq5UK+BX/Ihq\nMcQTZjDqhkBcT8/QFklF+vyZisHdtF30Pu/MyrtMwlvT0sfR3b35ROXglZUNWL+1bbAPg4qI4d0L\n7nW9PQPWemzyywhvnxnMwqq8Dd16Xjbw7Fub8bfFn0JSzI5wvxxEhRpCTIs5FZ4hBLqS6cAO55h7\nfSC5w9tdeacMDXd/+Ces2LEy18sKkt1s3uddFZW72uaANSpniZSOp17biN8/uWqwD4WKiOHdC1kD\n1iw9ZXfdqFDOx42kWXk74S0Z2N5kNoFLqtkRHrDCO6rFnYFRhiEQTrnDe5Arb1d4ufu8t3Y24LP2\nLXh8w9N933eZjjZ3BzYHrFE5K5fWKiouhncvSK5RNYprkpae3Pz94zC+60QkPj3G83gqYTbD233e\nkiTgXApY06X65QBCaghxLY6UZm5oCOFpKh/sZnMtT+WtGVquzXsl84unXAasuQepsc+7H6w/Z+bY\nBiLqHsO7j3yyAvf0qN0J+lVMCnwZRkcthKv/OxaVYRgCuqvytkmqGXw+KYAKXxACAklh9pUbRmaz\n+WBX3q4+by0d3rmWUu2tzLDOvO97sLgvWDjavP/K5aJsOOK5HZ4Y3n2UOT1qTxTFOtVGeluR8iMS\nT0HX0n3e6RdYlbcUQIVaYT1vPmYIIJxKjzAPJ7NDMvMfbCKl4911u5zqvZjczebJZPrnLtcxLnrv\niz7tO/N7Rx/gLyJNN7BkZQOice+88u7AZp93/7Fpt3TKpauJiovh3UfmwiSmQu5R9dnh7VqkROgq\nWjsTcPLUGm0OpPu8VRFASA1ab2pW45l93pnN5l/s7sIPf7cUb3603XnsuTc346EFn2D+si0FfT63\ndVta8doH2/I+7xlt7ro4CLtuYVv4Xu/fF8jRbD7AX0TPv7UZT762EX9fUu953N1Uzmbz/mN4l065\ntFZRcTG8+6jQ+7xtimIlvHuFMV3BLY+uRDhid3obTsVsh7cCPypUc8CbZFXjhiE8TdI7Irs8s6wt\nX7sLAPDU6585jzU0dgEANu3o7NVxA8Cd8z7C35fU521+y9fn7b7/3JD7tiJa1gxrA/w9tHFbBwCg\ntTPheZwD1orD/nOyabd0mN3DE8O7j1RZ6dWiCnblLQz7vxKc028FuiS5m83NKlsRfgTUgPVYesBa\nXDPv8z669itoT3Rgbct656V2FSP7EqhvMwPcp5qj4/vTbJ6vOvKMNk+6wtvVIiAUb/gV/J5Z93kP\nbFC2dZnHPbI64Hnc22wuEI2nsO7z1gE9tuGEAVM6bNUYnhjefaRIhU2PalNVO6itjQ1X5S7Sk7TY\n7CpbNgLpJnopfatYzArv0yedAgB4a9sK57VOv/A+a3DvqofwcdNa+K33T/ajSszXt5tvkhZ35d3X\n8M5s8hvoUcntYfO4R1T4PY+nXIP0NM3AnfM+xp1PfYT6hvYBPb6hzv6nM9AXZW5CCCz9cBt2tQ7u\nfAmlwlaN4Ynh3UeqrOLkoycAAA760qgCtvc2m8vCHd7Wn8E1YE3yJyAMCbLwwWc10UtyepKWmBaD\nLBQ89sIuHDhyf2xo24hdkd3m7uzAC5lBMn/Ty/D5zPdI9WPu8XwDX9yjzd39v+5BdYbct+VMM9/S\nEAKabmTNvFYq9mfOfD8tY5KWLTvN7oimMlxvvJw5zeaD2POweWcnHnulHr+a++7gHUQJsfIenhje\nfeSTFXzntC/jjh9NwWH7ju5xe6fytprNZeSqvN3hHYNIhqDrrv512Wo2N4CYFofQfdi8vRMnjDfv\nH/+0bZP5vN1vnjJHqTdGmyGpZgWZ7EezuZ5jGlAhhGeeb/e0oRHXKHhD6Wt4Zw9Yu/z3b2DO3z7o\n0/56w90FkDkoTcszYI0LbPRNb6vD9zc04sEX1xWlqozEzC6q4VqgsvIenhjefaTKKmRJwtiRuWdP\ny9o+Y7S5e7S63Q9uN5srqg7Jn4RIhKDpBnyKtYqY5K6844BuTtEaUioBAM+/vRHN7bF0FaOkB4nF\nVXOFMvfgqriW7hMvRGbl3RZvx/ee+ylWtb0PqGY4u0MtYbgCW+1bs3n2gDXzd7vSLaXWrvT88cmM\nFgv3eXT/LGekdyyhIZbo/2Q1A03TDWzd1TVg79fb6vCP89fivU92Y3cBTd2vf7itV5/FMATunPeR\n526NoYyV9/DE8O4j91SphVCt0eb2JC2q5FrW0x6wZjWLT5xo7lskQkhpRlazOWCGt6GZj8swt4+m\n4nhx+efpK20lHRqblXcANenp835iwzO4d9VD+KhpbUGfQc+oPjd1fI6ElsCyliUIHvkG4Is74W0I\nA5qhQYHVV9zHyjuzz3sgFwGx108Huq+8NU/l7Q3vK+5+C1fc/VaJjrB0HlrwCW55dOWA9eH3tTp0\n5k/Io7E9hsdfqcctj+afXz+ztaSxPYZ1W1rx10Wf9umYyg2ze3gqaXjffvvtuOCCC3DhhRdi9erV\nnudmzJiBiy66CLNnz8bs2bOxe/fuUh5K0Vz+le/hrP1metb2LoRdeUtWda1KKn71PWu61IxmcyVo\n9puKRAU03Ug3m9vN6rIBXegwUnZ4p5/fvKPTudIWcgp1FWMBAElEoY773FMl2iPUN1rN7T3pbrIH\nSTGgjGx0giypm8EXRJV1Avp2q1jSSHpaEIq95Gh3uqLp982cRU3zDFhzDTTsY7N5S6wVN7xzOza0\nbuzbDors/Q2NAFBQZVsMfa0OMy8oMxXy/0vmn2y4dX2w8h6eShbe//rXv7B161bMmzcPt912G267\n7basbebOnYvHHnsMjz32GMaNG1eqQymqI2sPxxn7fb3Xr3Oaza3wliDjgL1roMiS0w/ujDb3m1+Y\nduWdsjPErrytMBO6VZFb64VLio4dzRGzipEMQNYxOjAKFx30LfP5jACt9JnN7YVOr6plfAnYI95t\nysgmZxR2QrdmiDMqrffuW+W9UnseoWNegz20qbezRdW3fYZH1j2BVB/mWe+KuirvjLECqTxzm/e1\ngnz1izfRlmjH3DWPFbS9EAJPLKnHui2lvT2tKuTreaMi6Gu+9HSPfUE5LHX765DHPu/hqWThvWLF\nCpx66qkAgAMOOAAdHR0Ih8M9vGr4UuzR5s7tZVbftyJD2KPN7T5t1aq8k0GkdAML3ramFrUGrNnL\nhcIKbwjF83wkrjkBH1KDOGj0gZ7nbVU+c0BbOJk7vNvi7Xh03ZOQrIuJzCrHvtf8lJFnw4hXQK5u\ncyrUlNXfrYgghC47y6D2VhhmOMk1zX16/b2rHsL7uz/CxwV2Dbh1uirvzJDwDNJzN6FrfWz+tVpy\nNFHYRcb2pghe/WAb7pz3UZ/erzvukfWZF2yl4q4OOyNJrN3ckndbz/H11I3ShzJ6uGXdnlh5G0Lg\nd3//EP9c8flgH0rJ9G6asF5obm7GYYcd5vw+evRoNDU1oaqqynnspptuwvbt23HMMcfg2muvzeov\ndBs1qgKq2rum6p7U1lYXdX/dGdlsNT/azeaKitraavhUGYmUt887FJKBlFlZ+/wqWtpTwCjXJC5W\neAvdrIpGjrA+hxXOmiGchU1GVY/AXnXWrWzW/u3PXREIAl1AzIjmPBfLPnkbK3evQuAIGfH3T0f1\niJBnu+QXZrhVh6ogkgHIwSg0w0BtbTVi7eaAMkX2QST9gJrM+R7vbVuFUcEafHns/p7Ho/EUKoLp\nqk8Zux1GR61nm978/Xyh3v+9U64vPUOSPK/3B9LHJrv6XYMVfmc7dytBT+8dCJj/FDVDK+g4O+Lp\nC7Fi/3/c1pluUQm5Pk8pqT7FeZ9fPLQEja1R3Hftydhv75qsbSOx9EVVVXXQeV2u44y7rrnyfY6a\n1phnm5he+N9tKGgKpy+cC/k8w+Ezh6NJfNrQjk8b2nHJN78y4O8/IP9mSv4Olsz7ZK+88kpMmzYN\nNTU1uOKKK7B48WLMmjUr7+vb2orb91ZbW42mpoEbTdvVZX1BWAEqdKCpqQuyLOWYpMX6YhYyOrvi\nUCUFCddrneZva8BaW6s5ktsO/65Iup9Y0hR0tVnPWzO0NTZ2QpIkdMbMintXuAlf7GyCX/Z5+vI7\nwzFnv1IwjJaWCJpC5nsmkjpefOdTqOOARFQ4rQApI4mmpi7s6jAHOukpCdD8kIKRrPNtCAN3vvMQ\nAOCBGXc4j2/Y2oY7nlyF807e32yokAClphkpyfBML9ubv19LR5dn+22NYUgSMKG2yrPdZ9s7cO/T\nH+Pq849EY0u6RSIWT3le3+EKuIireb2tPepsF0+mq+jujvWtxmVY9Nkbzu/23ydTMqXj3U9247iD\n69DWnv73UKz/j1es24VdLVEctl/61sfWtuiA/DuJu85vo9XPvnFLC6p82Y2Dja576Ztawmiq9uf9\n99zSkm7ty/c5OjLOZVNzz68ZSlpb0/8f9/R5Bvp7sVQiroWEBvrzFPsc5rsQKFmzeV1dHZqb002d\njY2NqK1NV05nn302xowZA1VVMX36dNTX1+fazbDh3ELk6vMGkNHnbQ1YU+1FjmVougG/Yo3Ylg0E\nfIrTbG73ecPwNpvHk5qzTYUagk/2eZ5/7q3NeGfNTqfPOqEncd1bv8b1Cx/CZ9s7nGN2z58uBSOe\npuKuWNJpAZAMn3MsQtagG4bTbA5dgdB8kBQdCc3bdJ7Qc98+ttIaLLVw5RanA1JSNchVfR/5HE15\nJ0/59V/+hRsf/lfWds8s/QyRuIZnlm5yms1HVPg8zea6YeCL3el/nO4+b/e98O6R/d01Xc5bu8Dz\ne0TLfaG6YPnneHThBjz56saSNO3OXfAJFiz/HM0d6XM1UPO25+qXzddk7668e1qOtZAxEpnvM9xW\n4ervx+mIJPHBp03FOZgBMtz+hrmULLynTp2KxYsXAwDWrVuHuro6p8m8q6sLl156KZJJ88t85cqV\nOPDAA0t1KGUhff+vNe847D5vKV1NWuEtK1Z1bshIaQYCavo+74BPTt8CZjWb6xrM6t0K51hCd6rz\nkBqCIiuQhAzJev6fK7bi4X+uzxpwFg5twd3/SPehulcFU0Y1oiWRHhxlGMK5QJCFz6m8JUWDpgkk\nrNHmwlAgUubFR1vcezUaTaXf372wiv1FLqvmY0I3L07kEd5+UPMiQcsK5lw6k7mvhA1hYHc0/cVk\nT6aj6Qa6oklUhXwI+BVPiK1YuxtrXQPFtDyD19yz2el5phAzRPbjLbHcg9B2tZih/vmu0t7j3tKR\n/ruUMrzdrXG5Lm7yjST3hHcPo80LGayV+d7D7YvffQ76MjPh//z9Qzzw/JohNfVvrgmlhpuShfdX\nv/pVHHbYYbjwwgsxZ84c3HTTTXjuueewZMkSVFdXY/r06c5tZKNHj+62yXw4kK0Ba3a/tV15T6yt\nAmA1nctWVW6FNwwFKV0gqKbv8w74FSek7cldNN2AJBSn2Tye0JyAt5cTlaB41wuHQEJPoNJeKxyA\n0FTPVbq78lZrt+PvDQ8imdJR39BuTlpih7er8oaiIaUbSOr2iHgZ0Mzwrm/x3pIW1dKh61772/4y\ntS8OjC6zGVcOeQc8aprAw2sfw8+W3YRIKuoJccMQePHtLc5kOJ3J3IG3cMur+M27v3fudbfvCtB0\nga5oCtUVPvhUb3hv3tHh2UfKM2DNtba5PUJd0pHQcg9Ei2vZrQ/5Rv/b/w/phijpl1OLq0uglMud\nunMkZ3jnCdGwK7x7Or5CgjgrvHvY519eXo///r/lPe63XLg/X19Gntu3Cw6lqX/zXSwPJyXt877u\nuus8vx988MHOzxdffDEuvvjiUr59WXEqbzugrdHml5xxMPbfewdeiSsw7AFp9n+FDE0zYOj23Oe6\n2Wwup8MdsKoj4Qp1pG/NspcTlYXqHW2uaBAQ2K9mknO/t4hXpudgR+4Q+cvL6/Gv9Y046/9NgqRo\nELoC3YCn8tZ1A0nDWr5UV2AkzGOYt+kZHD/hSAStVdJirvDuSHRiZMAcnGR/v8jWoDsRr4DQFUhB\n7/GkdANrms1j//mymwEAPzjsIhwz7ij8a/1uzH97C0LHCEABOhPp4HdXa+/sMCfvWLV7DY6qPdwJ\n70RKRziWwoSxlYgndU94B/2utdxlAy0j/gUlPgJSMIIW3Q/AHHyXTBmAZCB45FuYv6kT3z3sW1nn\nM7P1AzAHreVi37FgGKLHirM/8lXeiZSOrmgSY2sKm1WwJ+4gyZWx+YI3Ek+fn54uYgoZaZ35Pj0F\n/turd1rbGVDk8p/nyhPehkAP89rkNZRaJIbSsfZV+f+fN0yMqQlaP3mbzasr/Pi3KftClc1wkkc2\nIpwKQ7Kq8ZRuIJGy/keUzD5vJ9ytyjulGea93q5wloLm1fLY0Bjzd6E4zeZAetWyoBLElV/5sfmg\nrHtmrAqnIhgd9C668q/1Zn/0xoYOc1CcrkI3BISRWXlbzea6DL3xSxBJM7Cjrv5cd3gv3rrUqdad\nudntixFdhYhXWp/JfZtQdoDtipjHZ37BC2cZ1Q5X5e2e6tReS317s9msbs+E12atJmZW3rInxGLW\nQLTbLjsBoTFtSFR/Dv8Bq+GbsAmrxItOU3hS0yH545D8CWzsyD0NbVw3gzKoBHDKPicCQM570qOp\nmPP31Q0BrciVhbs5tdm1drm7JeGOJ1bh539a4Zl5rj/0HirCfBcoUdd0sz1V3oWFd+ZtgIV98Wd2\nKQgh8PFnzWU3Ha773PYn1IbSLWdsNqeiGVUdwG2XneBUywq8k18okgJJ1RD48ofYEdllzaomIaUZ\nSKYEhCFDkg34fa6QFq5lPo2McA5EASFhTMhscpZyVN4AsGZjJ3738GdQjRAgG051J4RAOBVBlTWR\nS6akpluVt2p++en2RDEaNF044W1oMiBk6O3mYEU7oAHvILKPm9bipS3mGIms6V11FUas0hz17k+/\nRtMMyJL3f2FP8Lmmh+1IdDrv51621J4AJ2m9zl533V6UpLrSD58qw3AtwBK3ngv61Zwz7dmfMakZ\nkHxmOLfEWz2f3WZX3idNnIqJVXtnfwaYs9XNee9ObAq+CkBYK6sV98vJHUSeytsVjvZ88u5m6/5w\nh0GuUMn3GfU83RQ5t+1L5a27jyv//jOPb/naXbj3mdX466INPb7nQHJ/hP5c8w2lanYoXWj0FcN7\nAO01phIHpk6D3joOU+r+n+e5zBHGftkHVTFHmyeSulllywZURXaazYWr2VwYCiRfylkARA5GIeuh\n9LzoIqMyt5qk7XlzzIsD3ak8E3oSmqGhUq1AcrN5n6Q9hzoAJDTdDEddhaYJT5/3X15ej664GZS6\nZq+mZr7WDnXAW3kDwEeNa8xNnT5vb+UNAFIo3XSe0DSnYjyq1jzGlGEHp56ezAaAgMD7uz92nks/\nYU9ba430z2hTHFFhhjeQDji7sgr6FShq9rdh0khCNwxsbwxD8iec998VzZ4C2J7oJqQGnb9VKiPk\nP2hcjY5kJ8LyLsjVbVafd3Er77hrBTXPimnWZ3avsFasL3F3tZ85h735Prk/o2dq2j40my98dytW\nb2rJu02+VfIyZVbe9iDGTdtLv2hOb3gGBvbjNoVi/z9XSkPpQqOvGN4D7IpZU3H1cT/AtMO+1O12\nqqzCp5qVdyJlhndNtYq6kaH0wDO72Vw3zIFhAEJfXQqoCUj+BJRUFYQQ+OPzaxCNCUiygNPsbM8X\n7p5iVU734dn93RVqJTsCVIYAACAASURBVPTmCdA7R8OADsBuEk5Bks3Qjic1T5/3Z9s6sGqTGVS6\nZlXyVmVu94UDQDSjv7cl3obGaHO6/9OpvBWIlNns7p5mNZyMQkDgyNrD8c39Z5rnwqpaw7GU83q9\nrQ5CAG82LEdjW8QTRPY99kKyBt9l3F49wmo2B9Jf1vGEBglAwK84I+LdknoKTy/dhKde/wzwpZug\nd4R3ZW1rV95BNQjVuqVPM7zhvXLXh87PytjtVp934V9OWzq+wJ/XPo5wMpJ3tHE8zxzg9mduaDSv\n8tQJG/HytpcKfu/u9NRsnm+kuztce2w2z9hvNK7h6Tc24Z6nP05v002fd+b+3ecvc8rcZmtA11in\ni6w86D20cBS8nyE09Vyxu5XKEcN7gPl9Cg6eNKrb2eQAwCer8Cmy1WyuQ5FVCDkJQ04BkvWl4fR5\n604VDgDKSPP+eilZiVhCw/ufNmXdCy75zdCwQ1FYfeZ25b0zYgbNCL81QYDzeiu8EXFeH4lrnsob\nSFfYuqZ4Xp9wVd6fN5mVynXH/Bhn7GtOpdueaE9/Qcqu+9l17/EDQJc1rWuVr8IJPrvyjsRS6dHq\nsSroLXthV2wXfvXss3jh7S3pE23tLwnzizfzy7raVXl/vH0zVu/YjFhSRzCgQJYkyEqu8E5imTWo\nya68ge7DO6QE0pV3RrN5S7wN1b4qyEKFXNmRNWBtV2Q33tuZf33zf9TPx6rG1bjx5b/i9sdzbxfP\n009rn4+GRnNMgG/CJqxu/xDhWApPLKnvVxO6O0dyZUrmMqw276IwvWs2D8ey++u7azbPvIBwH1Pm\nc01Wd8OoEYFuj2mgGT3cklfwfoZQNbsn9HkP2Axr1DuqYt5fHEtqSGoGAlAQTnXhXflvgGwu4iKE\nq9lcl5wFFeRqs0lQSoUQtkfmWkHv+9IGpD4/DFLAXrnMHDls6GZzvH070spdq8ztIxMAtDmVM2Qd\nMFTEpU4oMEeoR42Uq/K2mrqtCwwtZVW2OZrN12zdBXUsMMJf5dzSFtPi6S8J2T52NT0JnSss7TnZ\nK32V8Cne4AvHNE+fubbjAKhjd0Ie0YpVG9OTB9n3w8eMMO54/37ElNEA9nKe19ROrK94BsrY/fBk\nwyLz/RJnOyPOJVflbUSrIFeEkdCTqAgoiCU0p88bAHZEssM77qq884V3OBnBmNAoSMlKdIR2Q0fK\nEzi3vncnAKAmMAL71UxCwJ7Uxz4uawBdomIbNm34ctYxAN5mczc7HKMJLf33ADD3pbVYs6kNmiHw\nvZkH5XxtT/L1ecuSBEOIvCuCefq8ezlgzT1ffa73BrxVW+b+3bPmuS/0hBDOQL5yC7nM0eZ9NZQC\nkc3mNGh8soqAT0VXxPyySfc3Cydw7C/7ZMqAcC2bKVeYVZIwZIStLys7PNW6bZBrmiFb4W3Ezfu8\nDatvWlENGMLA6uZPEMIIvPCKNWGIYc+/bo149pnNqCJe4am81boGyKN2Oc3Qeko2mxFzhLd7xLs7\nvJ2Kxj2TnD0Lnavytu9Dr/RVOLPI2f3F4VjK2b/QVAgtPdGNh7WNhhS2djag0f+x5+k2YxeSUgT+\n/dMLm8STOkKBdDcBACTWHwe9dbzzGYP281blXaFU5q68dbvPO2Teiw/vrWIpQ0Ncj6PaV4VqqRaS\nBBiBzpxNyvd/NBd3rLwPQgi8/uE2ayY2gdZ4m3ksqubchZApka/Z3AooTRee127aaf5/YfSjedId\nJO4+b7v1J6nluaDoplk7U+aXeFeOkfLdNptrmeGt53zOfftavhYDIQSefXOTZxbDgeAZbd6Ppu+B\nWqSmGPaE+7wZ3mXi9Emn4OBRBzr3OvtkFUG/4vzDE3L6S8eerGREyAy8aELzNM/KlWZ4G7qcbtbU\nXaOiJQEpEDOraWsCFfteclk2oBk6UkYKWjQEZ35Su9ncqnxl64vciFeiPZxIr3AGwDfhMwgrZFMp\nCaOqA1AlMzyT1rSpumFO8iKEOUNb0BXedsUl5HSfd2azPwBENKvyViucCxk7+CKxFKCmK3e4lk1N\nnwcjPSFOPkp2c3IslUDQnx5dDwBC8zvvsaO1AxV2ePviEJoPtYFx6Eh2eia+AbwD1pZ+YIb79tb0\ngCd7lrsqfyWqYN72JwKdeQcP7Yo2YmP7Jjz+Sj2WvN+AjmSXZzIcuTJ7MFVcS2Bly/KsVeeAdEBp\nugHZdZ99NGVdlAT7vmSonmcglWKHd54Q9FbehQ9YE0KgM9q7ZvPsyts1sM89IY/r4iffRcfnu7rw\nzxVbcftj+bs4SsGd17kGBvbEPb/AUFGs1oZyxvAuE//fAWfgJ0dfZt0iBqiy2WxuS0npL2A7qGsq\nrfCOa5B82TN1CUN2+vjcfeKSrEMKRK0mc8nZFgBk1XACUNcl1768fd72hCkiXoH2cDIdrjCb0u3K\nW+gKVEXGiKDZPG/fLhWOpszPofmh6cKpvONaHAnNACCQ8rUDwgxGu9nefTucHUqVvgrzVjtIzoC4\ncCzlDG4TKX+Oyl3Af9BKz2fPJQVr/vdPj4HeYYYnanYh4LcvqlyD6qxz8MTrG8zKXElBCkZhRKsw\n2mfeKpdZfcdc4b16o1khN7anJ5SxZ56r9lVBMsygFJLebRW0rvlT5+ftHebAwUlV5gBJKZQ9Teyz\nG1/EB13L4Nvn06zn7PBKaYZ5+6HFvmjpzz3NIk+zuT1oMpmnP7uvfd66Yc6alykz4LuvvHM3m7tb\nLqJGFz5sXJ31Pok8XROl1t8Ba3Z4r9ncUvKpeYuluwuw7vx10QZnEp5yx/AuM4p137JPVhH0eatl\nN2HIGGmFdyyhQW8dl7Uvs/K2vmxEOoilQBSSqkEkXTNluSpbu8/VHinuft4OTykQMydesSZnCfpV\nnDzCmkFMNiAk3ZqaVIJPlVFTYb5XJGmGVWs4BikQgxGvQDJleJrNkykdck0z9EA7KhP7mHO4Z1T+\nABC1Ku8qfyUkSYJPVhFJxLHgnS1WeNvN5n4AMoQhpcPfl4AywgxLvWW8s09ZT48U3m+vaucWPpEM\nQsTMufn9B6xG20irerIH1bmqe8jmjGxyVbvZzN01GmN85t9nc8fnnr+RHd6bGiLOHPbuPm+7X7/K\nX+lZtz39hZT9ZRxOpPvZd0fN/v0vjzBnN5QrO7NGnO+KmhPb5Ap2O7xSuuFtclfSa8c/99YmvPbB\ntqzX9iTfaPN05e0Nu2ff3ISXln+OsNTsdH/0ps9b13NX3lpGuOVbqx0AYnmazd2tBLvHvYSH1z6O\n19au97z2b1/8H/wHZy+GUyhDCM/FQ8Gv62cVav89GhrD+M2j72PTjg7c98xqRON9v3ArplxjI/py\nwZJM6Xjzox34y8vre964DDC8y4w9baoqKZ7KO4sho6bKbPKOJjSkPj8MX459A0YsPamK0CWn8naW\nEQUgWc3u9qxn9v4As9nZvlXJXXlnjVZXXCPMAVQEVewd2MfaRoMOzemHVhUZFX4zFCNWsOzobIYk\nCYh4BZKajpCSEd6VZr9gZXQ/81hzNJt3psyFEsYEzYrYp/iwszWM55dtMf/B+lyVN2BeaNjH75rn\nXa4I43jpAnNbpB8/cOJIp5lbaH7rIsDU5WswH5fTg+Ls1gF17834rKUBcrV5cWCER2Kczzw3G9q8\nM63t6mqF0GX88dkNCPnM/bv7vO1b9qp9Va7WAyNdfarZYbR5d5vzc7u1GEyNOhpGIgg51JX1ZaZI\n9oWZAXWfDVDqvjB/l1zN5prhad2RrM8djafw0vKt+PuSeuyM7Ma8T5/POV97Lkae+7ztqYTXbmnF\nR5+lBxf+c8VWvLBuGT6vfhnq3uY8+T32eXtmFzN6rLx1XXQ72txTeWu5K2+7p+mJ1z51LpQMYaAj\n1QZlRO5FZwpx3zOr8V93vdVta4emG57lMM337t993plTwP7+yVX46LNmvPnR9l7vq9iefXMTfnTn\nm9jZ4p06ubtBh24bt7Wjrcv8/zVfyLd0xEs6HXFfMbzLjF15GxCe8E6PJbcYCmoqrfCOpwChoEau\n9TRf64bkDFjzfPFat4l5mrqtn9uqV6fvv3Y1J2eFp6x7Xl8RUOFTFXMOckWH4QlvCRX/P3vfGW9H\nVa/9TN/19H5OzknvIR0SEjpEulIFiShYLyI2BEQR9PpD5aJX5d5XQbHAtYAIypULWABpIXRIg5De\nc0pO3XXKej+sMmv2npOQkJAE5vlAOHvKXrNm9jzr356/wfTMWax0xyDt5EUKSdiOhxjTYM+5eRSY\nJjgdgxEYq+w2H3D7YGkmKkxqERuqESB33eR9z/k51OD4GZztI2F5FehIjxDu/mRMxylzRvgxascA\nsf34rgGLndIG8VhnODZGNZaFPvlpkZvgZSphKnG0pVqwrm+DiGl7xENPoQskT5vT8LwDuXXqoBTz\nlhdQPO6rsAXKmMqRmBc/EwCwTYqZ9xeY7CuJg2QroJhF9GT7AxKnQo42MQijeQN0Rt5xU4fteMg5\nebylPwa10idSYXnnfCL58Su348mtS/D0tucwHAghIg8j2DDD30f+/Cf3Bd3PWgO18DU2Fsfx8Pra\nnmFL1uRzOR4JlXYtzXrfXZ33cAlroXFuhYhEtqCG/b4RAReWkRvHlOJbv34Bn//RUwGyGS488Xah\nlogfcC/Du9Uudnd4aMlGAMCK9cFFUVAlL/yaO3uz+O7/vIyb734RQDjJ9/Tn8dWfPov/vPe1sm0H\nGxF5H2Lgcp8e8QJu85OSl+CyKR8RWeeEqEjFDWiqItxXhq4G4rfE8RPWvLxvkYsabznWy+uwY9vx\n5zUPBT4D4Mufqi4AAqjB2vKERevS4eo0EU3xyds0NCTMIHl35ujLl1reXiDmXXRcEVsnpCRhTvXd\nxUNuH+pitejpz+P7v30Zjh20qDXTptYwczcTT2rqwv51drbD3dXC+qYbgOohldBx2xePRW1lDBk7\nA0MxAaIGLG9DYeSt+AI1gfkC/GQ3x4DjemhPt8EhDr551xMAgK5cD4jiwcumEDM1DGR4jbwjXLfC\n8jZTUtzft7x5XH989RjUaC1iO8cga4X6qwfXw8vSmv2/L1+BL972NJ5bSePv/YX+wHGinaylwXE9\nPLVlCfr1jZClCbjl3ZcpAIoL64h/iYXGa10rUIqubA8Gi0P4zSNv4KofP4Wt3ZlhNbdLX7YD2SJW\n71oLrWETVOba91gI47W1PfjRH1/Dd38d7o4W51IdvLTzFQxk/UUst4qD3+1hlf009DYa/y+zvAsS\nebthbnP/M0X1hFUnt9flHqF9xe6M561d9HmRPQHDLYzeLrRS5aJDEKW6GYFF2zBW86ad9J70MC3/\nsORH3tt+1cbesm0HGxF5H2KQyduSyLs53YA5jTNgKiwm66mIWzoqkqZI7DE0FfUVPkm7riLI29ky\nDsUNkwFIwiEy2Uj//0bvWwDoAsHfTv/fHLNMxHJlyzwRM6DrKrW8VQdEcaEpPB6uIcXc5isHX8fD\n6/+B3iJzKRcSKNgutnXmoCoqNnX3omh7Qq5UfAdhMWtOiEYBLhxk+k1c87MleHNzHwaGXMhtTxW9\n6LvMAboA4W5zVodOHJal7hIYGvMU6P6POGNnYaksN0Amb7Yw8VAU4QNSQt6KXqTKdVDgekR0U4Pq\nghCC7Sx5jeTSKBRdZLJ+jTx/6QvL20j6fd+lmDe3vFNGCiopDy2IVquOKch7xY4NAIA7HlwJQgj6\nCiWlSxqXf6WWt0tCrErNRUXSxMBQEUosCzVGX3KqomJd/4aApekRD//x4m345Yrf4cnXaDLQ+m0D\nw8a8S8l73dYB/PjV22GOXClkfVHiiXpdcq/L4C9xo/0N3Lf+TxhM+fFM/rIujXFvJstgtKwXf3O8\n2rUcAwU/L6DUba5YGcSP/Jv/5ap/H4ekKgO5MmRf8HZ6cpcuSDj25DbfuGMA/3X/soDrfTjyPpRy\nuEuH+HZi3p0lLU7DKjhMQyv77FBBRN6HGHi3MZd4Abd5OkGJQybvmKkFpBh1XUVrbYX4m3gaeofY\nKp9ocDtHiAYn/Bz+viGPQpjbHIA1eSn9n5KYNz1GBzQXRHVE85WYqSNp+eP86/q/+brmtonnV+7E\nt3/zIlxbw2CBveTYGF23xDvAPuelal2d0o+LaDSBTKUdxVy1ECBcriInn58vWlzPg6XSfXVDJu8M\nYiodu6gVB+ApLC9AsUXSXqAcDzSWbjHCdlxPiKcomgvXI9gyRInMy6XYi1AR96efuXeHbE7eKSGB\nC9XzrT5meadNSu6EoKScboiGDYgKkqXPhpyYtq2/F45EzqpnsnI6D6ahwnY9IfIiQzc8NFbHA+1n\nTxt5MmY3TAfgl8ABtAFNxsliTd86keCn6wrk05JhyAYA1m4rr4tWNV8NcHcQOvkshGEnt4ltnHxl\nK01uHSvvs7p3LX6+7C68YD8ItXon9Kb1cFwPHgsDFG0XetPGkkF6tIwS/n0EEBDuKRsv8fZIzm8n\nbC1n4e+N5X3lfzyOl1d34bkV5Tr85eMoP9ef1/zfbtX+9hV7Gnep5e0GLO/wY3mf8mSMl5mW73co\nl5lF5H2IgVvepJS848wFzcmbqIgZQfI2NEVYhAAATxUPKIXix39RalmHrDBD3OoyFATd5lTpTYei\nuVAU1gwFQNzUUBEP9oC2XZ6lreI1Fssjju5b1oxcbVt+80iWM3P9J9W0fz2uCkUliM/5BxQzD6J4\nAVc37bxGaDy9pDOb6xKYjFw1g373uv4NsD0HMS3BxufPnY0C8k4BLmxhvYfNkaHQc7oegcoFDVUX\nRdtFX44nDkpa2GyBMsAWXZt6ekA8BU7RD4koiivkTLmLO2kkqTEqhwYA5NwMVI/OPfdCFIlv+e0c\npLHCZHYU8svno4KwzHvdEfK8A5LL18vTcyUTCpK8xpuPQU+IOZRlcLn17xFPJPFpqor/emCZf97d\nSHgOZotlOR+q7ore67sDf4nzygoS8/MBuFXtegSKlYU17Um83hOMsfMXf3+BHpdVemGNewVG+5so\n2g4efHo9rvrxU1i5cRfUip7AsUrA8vYTqoazvPNOAf/+3K349crf7/aa9qTbrVZ24U3mPQPefsxb\nnvdk3F+YD7eYKP246Nr4+6YncNeqe3Y7vr2F63n42h1LcO9j4W11AZQ6YgJW9HCaCNt76LuxtoL+\n/sLc6283Uc0jBKs29r6jxi97i4i8DzHwhDW3JObNLW9D5a5XD5apo7bSJ0VdV8vIuxQyAQXIhoQ8\nCnsgd/m7EjFK3rL1yWO0MVNDMhaU7LSJLb6DJxEpngHD8jCiIeWXAknhQSK3PWX/5vNAR2MaJ8xs\nDYxXYdnqPMnMMjV/PjSnrDOb43rCbc47hf34lTsAUPUzOugY7E1UCtT2CljeQ12w3mBN4FwyTGbN\nP7NsOx54YpMYe9HxROa9fJ9UaIDioj9TxKadg+jNDQKOibVbB0RCG1RPlCxx8t64Nc+6z0neBcVD\nkRRgEDZ+fq3En1SejY5CEqZdA5PF8hXNgc403Tlx0fmk280YEd4WbnnHtLjwLshKev1539LnBNfd\nnwsmj9VtwD82/QuATzAXnzyOnstxxaLW3jyOnch5W+QtkvGYkp6iEtF5T7a89Za1UONZPLL1kcDx\nolQupMd63inikefpPX3hra1Q48GMZyhyzFsi72Es78c2P4nOXDde3PkqVvS8OSxp7l6m1IM14SX8\nz9q7sXGAVkS83WzzjTv9+/R2Er5K8wGyUmfEMG/NviKTc9DVl8fGnYPIOTlsGiwvS1R3Y3kPN/4u\nFs/mW12XAEYe8SMfwWObnsTTr2/H7//xVuixpXjxjU6ahf9WePjmQCAi70MMgZh3wG1OicVQfOvN\nKnWba6qwfADfsm6o9gl+WPIOURIjw7jNOUbUVfqHqwomtlcFysc42cRMXciJcnDxEz6GuKVjYmsD\nHGLjcxeNRnUFHWdBTiJ2Zbc3V3BTkU4YaKyOB65HJOUxy7syYYpriM96TKjQnbuQkoHjEphsMdLX\n9Dh6cr2iZGt2zZHivM6OUXD7a2ATGy/upPrvXBa11G0OACZbbK3fPuhnzGsOirYrVMrkudVVHVA9\nPL9qJ2761QtQDBq394h0P1TXLxdi5L1lexH5okv34d4JVmGQGeT3UQFxNbjwJ7UvT4nZKRiIW5oY\nLzQbpk7H1S/FefkCSTdcn7wNej5L8cm74PrWZU/Od3trjLxL1dOMjlV4YM1DcDwXhAAT26tw1GRa\nG19winCJC7evDs72MaKigTeMCUPfUAGPv7JVkJDcjc4cuRJqRXfA8pZDQNLFis5hgYQzBqphH1zA\naH3t0gLDldzmEnlb4eQtJ/r9v9fuxLLulaH7DVceV3RtaHV+WODxzc8AKPdqEELws9d/jf9dG1yo\n9PTnWRc8EiDm4cgvb7t4+LmNWL2ZlmxmbT+GvCvfF3rMvoCX5xUdF//96p34/gs/KRM7KvXWBGr3\nh1ns8NACfw4c1xNVDH9a81f86onnsKnTv++7s8K3sERBfr/fDUTkfYhBVcOzzbmVoTGZUUV1ETM0\n1ErkPba1UsiE0pPQ40c2paXPJHeYbJmXan5Lx7O9yzaPb6kVC4yBrI3KlIXjjmgX23lTkpipiZcc\nh6uxlxnLJm+pTeDoFkqSd6+6V4xHdpvLMWthgbsa0gkDDdWJwPWUlsNVJM3A9ei1NN5cnaIuccfz\nhCeBqDbuXf0AvcaqMWhPjQheuEv3W9e3EaYSA8nR+f33y+eXzRG3vAH4Cxtmeedt+sKvTib8/TUD\niurhjU19gOJC0VwQx4TrefA8iJg4J29uUaowaRmT7DYXjVmkBZur0wx5Bp4QZ+cNxC1dkLeiOeLe\n9hcly7tAnzdV84TbnBOXqcZD3ea9WXo8IUy6Vy+KEkYK/9nryXK3ugKTkXPe4wsxQ1wDNCcQ81ZT\nvbjnzT8LBb8f3PMq7n70TSxdxWK3hpSAVbMT1sQXBQm6HvGrGmRIuQWDdjl5F11byMNyD4ilxv3K\nDtUT3gWZvLmGQSl6MoMwSAIN8ToAwxPgcKpyj21+MqDBz/NKZPL9xV9XYe2urVjWvRKPbHxMfH7/\nk2tx9zPPIT7zceitawLqdjIxquke4bnY3p3BH59Yi+/9lraslWV4d2a7Qse4L8ixDP9C0cP6Aerp\nKG3y44sJ2VjbtwHLi0/AGE1DII7r4anXt+FP/1obOIaHRGQJYPkdEZu6BErCf/YzuxGl6WFW/HCS\nvgcCEXkfYjihbSEA4JSO42GZ5daAcFUzy7u1jr4oJnVUY1RzhbAeAQh3LHe5A74rm26XasI7R8Dp\nbIPT2eZvl9zQJJeCN1SJRNHvuGVqJs4/kVoZU0ZS17HIqAZQLPjkHbf0gIAMjAIjW7pPU20Ccxpn\nYFrdJKzr34AhbTs7h3TxcsyaK615GtIJk1qBsopcSUb9rPH1qEun/HMxxboYUzVzXSL01wFfxtXQ\nDJhG8GfCSSTjZGEp/uKptT6FUpiaLITj66sXbQ95FhNorvGTDGO6IVnOvshMNu/QlzBbwMiWNyGA\n6rG+6l65d0K27ImniVp2wE+kKuboPbKYWA50h1U7EAzZQ1DsONA5GvYW1pVMc0SiD0+aMxCDxa5X\nJm/umvcGqJiOVrsNg3JrTql0atsQJVtVVYVlXfAYKTAvCl3EObAMSU9/1HI8ufVZPLLhn3A9V5RM\n9bA2nUqImI3sNlfCyrcUV7yMB/dgeXMPSEz1PUCK6olqD368O1gFNZ4RLm0ZWTuHQk7DB0efAcDv\nA5ArOPjt31eL/Rw3PKntjV1BFy8PXcge7KGcjd89/1TZsX99diNyBiVEo3VtoFc5J38lNgRr0guw\nJtM6/lK1uqyUUf+/ax8WvyEZL+54Bc9L/elLUXSL5fr/kuXN8asVv8OPX75dhMe4Vfzwhn/ihy//\nP2zxVkKv2waoDhzPw6/+7w08tGRjwAshpH+55e2RMg+kKpP3btrfdrPnbLgGPwcCEXkfYphcOwE/\nOf67mNVwBCyj/PYYnGBUDzFTQ1XKwq1XHI0vf5hm+fK4LQBBvgH3ouwelC1vosHeMBVeVs5WD24v\nrJyPKUnfhWxqBj588nh89zPzMGMctRZiElnlmfEbs6jbvLBsIYobJ4rtinR+njRycvvx9OsUjyXE\nlMfdrSnPCsubeNTytgzNT3aDH1fk1xC3NMwd7y88eDcxUzOggCa1aFKHXMI8Dbyvugw5NGCqscC2\ns0afiqq+2eLvQHvOgHyqi6JXBCEKWup8z0jcNH0vCCccx0Qmb9MsbE+lMe+Cr3QH1wAhCLjNdU3y\nTngl91+aJy7/6hQMxE0NMS3OzmvT5iu6DZe4UPKViO+a5hOo6kiWN53LJ1/y+8bLMW+e8ObsGAli\nGzDa30Rfwbcqq6sly3Dlb2BNfRquPgRNVaEqCoqk3PImqitCSfLcPrrxMVz1xNegN9FSL9cjwoPh\nZYOLq0DCWgi5Q/WEBSqTN/GYfKtr+6ED9jwljARkHf3BnA3Xc7FhYBNMLyUWMLe8eBs2D/ou7oJt\nU8liR0c2xyxCRn6PLN0kyc8SbM9tx5cfvwlPrH8hMNwEy81wulqhQRM6/4E4t+KhR6WJX2rp619a\n5PXYdBHVP1TwNQXYgpiXBZaSmWx5bx7ahqU7yrPOf7Xy9/jNyj+Iv4u2G9B8v/Wl/8Y1T92EXfle\n/HrFH7Az0yme9VIZ1NV9a6HX00UQJ+A1fesD+6iJwYDbnH+XJ2nYc0+G63riPnLIev6lynUyIvKO\nAADQVPYjUspdedzyVlRPuNJrKmJCwjBgeTOrNpDMEaKqxlGZNANxW0teCDA0VfrkbmoGFEVBY7Xv\n9pXJm7+EYqbGXKBKILNazlavTtPjWpK+znhpkhx3hauJId8t7lLL2zTUACnpsWLgHKauBRc2jNhM\nzYSmqXC8oOXNyVtX9fLYqpQ3YKlWYNOpI09ERW6cv13zr5d7PbTa7XijawOKjg14KtobZcvbpN4F\nkDLL2/OISNrjpMFnvQAAIABJREFULwlFt0EcAx4jb3gaFIVlC3P3eWkSIRfaAc1GB2huQNzS/fun\nOYiZmp87UIhTS5yoILaBIjJSzLsI4mpYsqwbdz9MXZM524/r8nixl6mE09kORSEYcCh5X/Ghqaiu\nCU6vmhhCwaRuV8NQy8ibXoODdFJ+PoOWqNEuNVlhiwsu7AJQFz63vF3J8k5qKcQd2kRGUT3YbJ4H\n7SHEtBgaN58HZ/toAIBDbL8Gmn1Hykj4vyvFQ6Ho4q3e9cg5eaSdVnj9dWIMXUO+KtiWHuZKdw0M\nZei4RJMdRhp6y1rE5vwdD+/6LYrI4c8rngxc85Cdpde1fip01RALKNntrbe+BcegnhAufyyseOn3\ns9T5E17dvAFf+q9nxCLHigWJqbQ3Oq9l93rpb3ht34bAdtlb4HgObNfG13/+HP7th/8Sn29l5ZM3\nPPtdvLDzZfxr6xLkmOVdCHNJMw8cH2NNrDqwWUkMBkrF1vZuxuceuwbLu94Qn8ltb6EHr4mrJAI0\ncY4QgnvefACvdvnhCdvx0McSE4frQX8gEJH3IYzm2gROmNWKL15whPgskE0eAqOEcDVVCcgbkuEs\nbwC1lbEAoafiQWICgNYaP0lNjudyWJLb3M821/06TEk0hYu4AEBVih4XsFRLM+Cl97OaYGVWsuWt\ny+QddJsbuio0vGWYqgFNU6h2t7SY4Mk3pmqUkbec9CeTsxibNN+xEMtbjWXxt/7f04Q1TxPudtNQ\nYeq+Z0V0RXNMZPIOtSCY5e1fqA04BmzH893mAGJxlLnNP3bqBEHufFvOy9JnytMRs3RhvampPjqn\njLy9giU8EKQYR8YblFzGRX9O2Hf15XwrLONkaEzZMcR+3EqLWRo8zd83zix/T6X3z9BUOGD3UnwH\n/d5kXAqTlHTVcwerpG1sIWeb8DIVYpz8he95HqAX4RXi+FjHlTAddqzqsg531PJOm0nYDoSHpugV\nxQKAex8qrKT4XXHPx7Iu6vKOF5vhDVXD3joGAPDcm742+NZeupghjo6BQWZpMsubX6Wa3hUoAyxk\ng7+/jJ1hc6RAgyFCF778bT/05vVAMQEvkxZiQaVqfRxLNvtlc+NHVOH8U/zcDzW1C/IP8ub/eQld\ng9TFbO8YgYQex/qSJjy25xPj7ct+gy8/eQN2ubQpztbuTKjVammmkKQNI0au9Oc4dCxyxjtA3d6y\nbsCDq/8BAPjjW38Wn/FjHdcLvEMAQElI5J230Vvow5Nbl+Dny+4Sn+8azIuZiCzvCACoxfzRRRNw\nxBh/tT4hdQTcwSoU3pwdekwpucdMLaiQJJM3CZKZkDdlkBOpONpqq6T9yxcSlaZvRfKXWMyULT//\n/LpE3tzy1lTNJ9mSxUVx3RHwWMIUb0nKY96moQlXOOCX9nDi0jU1IBwiX4OuKtjUOYTfPbpOfM7L\no/RQ8vYXKLy0SoYWIG95MVOSw6C6UIkuLNiKhCnlNDjCCiCOQd3mhJM3LwVzoai0tj5XcFDgbnMA\nlim7zTUoACxDCyTNAUCB5JDQaC5CwtIRN+j86rU70KmsgWKypKd8TMwDKcThEgdEp5nJil70xXDY\ngi3LuscRQtDv7PLbz7Lvz7t0e6ezCTuTVNr0kxMvw1ktF9Ahs/71pqHCUYKVA2JRKB5PAhhFjEx3\n4OjmuXRqpC588iKosHIezGKNyDsAaLKiYhQB2/QXSACgUMvbIx6G7AzSZgq27Sc2Op4jkTf9jsp4\nWlLCo9v6cixbv0jnmeTpwAekBc6OPhZGcA08vIS6yGWyo99B5X7zrx1L50hx8M+XttA+5ZkiuocG\nxBzpii5i5tzw1iq7oShAYeN4EMeEogDZYkHEkkvj/tttP8FLU5WA0Iw1+XmYE18AJ/A1W/rx6rrt\nbJ4NjKrsQHd+FwaKfqWCHMte2fMmrftnv+MbfrEUP/uLb81yFN2i0DRwPSJCFv7AWNWJ68sJK0RF\n7oVF9JqsbMBtvnEbnfNdhV6YE16A3voWc6F71I3O3iFpvQLENpnbnC0M8g5eWeu3C/WIhy2dQ3h+\nVac/3ihhLcJwiGtxFFfNg9dfH7q9lLwtUyuxvOWENXr7501uxPc/O5+2/pMs7/GtJf5MMMuCn1sr\nt7wbEv5Cg1tIPGv5e5+Zh7PmjQ0dK7e8AYiM5dIMYJJP4SOTzg1+oWR5u92tZePxX8R+9q0MQzWk\nemH/+3gs2ND0snri0fX+NRoh5C3PtxJI6C8JA6gudMUQZXQVSRNtKRqX16q6AuSbZZa3r89OAuSe\nLTjCbQ4Ahkl8kvc0aBpLAJOS5gACm+TAeSNmakjr/uIrhz5R1uQWLDEPXoH1ZlcGac9yzRPhEL5Y\n4q1fu3O7YKMgLF5ueXsKJYo3B/3yqE1bbdz10Dq2ncdXM7Br3gKgCNLjC4BYjL2U9SIUBUjoSVw0\n4TwoTgyKbqOphu4vXP/FGEBUmAodP7f+O6uepIsgT4Preb4YDot5d2W74REP9fE62K4nFp02sSWl\nO2Z5mwkpt4Fu4yEEl1Vf8DnK2QUUbRe/fGgVVm1hjXpcXWyX8wb4dRLHFGI7ikoT2VZv7sMdf10B\nWymI+VWhS25ztsBgdegkmxZz2DkwhKLtQW9eC60mqKrW7/o1y6qqBKRhAdAOaZK1nnPZ78s10Jyg\nZX49OT80kA35/cmu+tfWlau6FdyicJtD8VhIyQfPc+GLqEwxA89mioKuCkVzgqV10m9Qq+wRTXgc\nhzDLm97Hj7Z/Bl6mIuClsl0Pv3/CL9+7c/n/4DtLbsMDT/niMZHlHWFYqHu4Y2aJNWwZWlD3N1Aq\nRh/kptoE6qvi0FQ1QO6TO2rF/08fU4svXzjdj8ejNL5OURvzCZ94Kl08MJd5Q3UCR030CVa4iAGk\nErIrmrfwLL/YMXV+TJx386pImNA1Bc7WsdQqIeUxfuIRVFoVpaeDqRmi5EgJqXU3VCMgvfjZD07B\n5R+YIf7Ww8hb2t+TpEcntNUFd9RtqIqOuGR5z2+eCxAFesNmP6Pe1ZHJ2zR2yaw6a+ozfptXx0Au\nHyTvbGIjI2h6nzVNoeTLXtqJlEvdpooHz6afxS0dST2Jwlv0+lzFFpa3V4gL0RauVpbxBkUSk8hl\nYN+/dscu/Owvy/HEm5ScSYaFW3jZGrdwTN+789yrg4J8XOY2R7IXUF2MVuaCFBOB73hsgJYUcosx\nriawoycLt6hDt1xRiREgb0BkxOftIjziIR9jFmM+IfIKAFoOt2pjL/64lNbzt6aaqQyqwlu32uVu\n81hSkD/XyOdVBbativtJPy/gsZe34ull27F5Fy2Rq02m/aQ/j7vNFYgcCFsqeWT3dzBrY8122mKX\ne0Dyeep2J4SAe43VWAbEU0AKcXGN37l7KdZvH4AxgvUzcHTkXliEpNMIBzZ4GZ+mKgErmkP+zdhM\nuY84hig5zEreroxdImID3+1N57A8abDgFkTCGkqSyVJGUhCrIyzvrK+q6OqA7gT7juvhSWe268F1\nCfNuqMgXJE8Zu8ai7QbG8GrXcmjpXvEbEfu8S4jI+z2GcLe5VPIVEvMOWJYSudek/Bfr/KlNmDra\nJ3MAAUEYDpnc4WkBlzk9p2+5VyV88RiZ8MQChBGVSGarS6IuLnkDPA26ptDEKkUBoIAUEtAhuarZ\nS8ojBIs6TkBV/wwa72OQyZk37pARqJsHVZKrkhYBCb085i27zT2pfj5VojKnKABcFQ3VCUwbXYu5\nExtQHauC5VZBiQ8FMuppqZhfh6omhkQmLHENDOVtFGzfbd6XXI7p09l99VToqsK6zrH5GPMMVNZb\nmj8TBduFoijwhmhoxEYOipmHQhTAlmLezPLut/tgxIPkzc/Vn83h+VWd+PsKSt6lljePLTpMMCb/\n+kLs7CmIuHavthHfff5HUHU6B5br51qIlynJQzFzIt5tKQlk8g6Ia8BVCjBYtUYpefNcjbybD1iD\nzrYxrByPC9FQ8n19K81gbk01U8ubu80JJe+EpQuXdtKK+d4PI0jeXPdAdPDzCsiKen3672lzx4rf\nnS2XWqksROKYoGI7qng+sgUnQJwAkMl6ICBwPMePeceyIIUEAFXyDri49wnfclR0ByAqFM8MzLWm\nKqHlcmD3UYkPQq3sogtqT4NC6Dhkb1fGLre8lVjWJz+jnFjzbkGUivE5cnc14gOpy2CqlvjMcT04\nnoO8mxeqisTVoaiOmGN6fSXfwd3ujkcXAJoNuAZts+zySgJ/n7LjpTkChkmqO0CIyPsww5566IZa\n3oGYd3mdNycbz/MCCWvyQqBUfrB0eygUglhJrbpsrVt6+PG8QQh/iR49tQkfP20irr5oBkzNpCtu\nNv50wixrSiCTN38RVqUs6KqO6sKkQHtUUzVEOdCYmjZ888jrMK5q9LDXaGhqoJZ9XHMdPnDkCNx0\n2Vz/slUFhVVzoearsKDZF27hgh4yKhMJ6JqKL104HfOnUq+CiQR9YRh+0h0tFSOB8h7enAWOgf6h\nYHY9APSxzm10kaMGLG8A0OtYwhT7bOG0ZurZYQRAyTEPnSQAKNA1BQumNWFEFQ3Z7Mr3wUqweHKJ\n5a0YBVimImWrJ6Brip/YptlQFF+qlQghGVVoxW8Z2ibmwHN8qV23zw8ZKUZRWN4WErSch32HbrJY\nrsVkMJnHIGXRf9/Y0oVfPkL7NDudbSDFOAtNcPJmI0pQi7Ml2URj3sxtTsnbRdzSka70UBmjrV1L\nyZ+7r4u8RxD7DRb1XRhwugHFhcZEgyqsBGrScRAiuc0VOW4vJe0x0ti4Y1C4r0lJ7kHRsyl560W6\nwGChB+Fh01ykkiE04AYXWSqzvHUvjuKGSbC3jaLbmSWqN9MFDsnR3vSKS8chk3e2pH4bAPSGLYjN\noNnmhuWTYFKpggIFBafot2FlY/EKcShOnC7CJLc5j6kHLG/NoUTMUGrdK6oHKJS4HZewcj0ahvIX\nOPQ7MnlbkLfT3QJnRzubA7o9bmmR5R1heBT3QN56iaVoGcGENeLJ2xXpv1wmskSqkyGsLWCY5Q0A\nY6voD5vYFsa0BF3VMtEamoFvf+JI3HrF0SXnpeTI5V1jpoZjp7eIuHiVVcmuRRMNW2QYEnlfcfZ0\nfP7caRjTSo/RNZW9YPh1aSJO1VSTQGOqBhWmbJkH57M0/p00E/jwiePQ3ugfo6kKvMFaJDedgOqY\n/7nc7IGjpabc2rcU+oJVOem4GnIFF7miCy3mZ1VziyWQxyBZ+l051vCFuc0NPRgWEYlVnobPnD0F\nNRUxen+IBuJqKHg5KLotXsS6ruITZ0zGNefT+9WT3wU9zsnbEucCAK1yF5qmv+W77l0dLbVJsVDQ\n67Yj1rLJF3MJkZYFAM9gCnBF+twkYjq83iZUDbIKDL0o9NKTajWyeceP+zJLTjGZNeZpuOXf5iPN\nyHv11h68vtFPsgJo8ppX4vZWrBwsNYZ7/rYJBP4C1CU0YU2zCsg4Q+iobGGeJhWEKFA1Rt6eDVM1\nYLOsZrHAqejBC7gPWsNmaJXsGow4aiuo0EvOLhey4fFuIkkFr9vZi9gUKpzCFy5y3NzziEgMEwtX\nISTjoChJ2Y6qYhnlJeENVSXoLw4grqThdnb4izUtaBUX3jgyMA5ZMjUjZYIHcmMAQPEEeRc3TMIM\n71xYmomiWxAxb1GD7eooFF0YqinKHh2X+Cp2IrFRh6J5yBSkeWTkrQ42wB1gZWWqC9vxqKdDs0Fc\ng3lwuOVNv38wa4v5cLvaUB1jybvsGY9behTzjjA89mR5lwovWKaGie30ITthVmvoS5KngHgEQQlR\nibiUEPIu7fTEccX0T6Bi2wkgmUqcefTIYcdqqDra6lOoqYiVfS6j1Hr33dYkKNTBj5dUz2pSScwc\n71tqhqYG6n0BX7iBZ30nDD9cUGZ5l2Sex8JKxbgbnhDELR1nzO/AvMmNWDituWxfSyuPmXPyVrhl\nzd2sRRfElPpCM3KXQx2Vdf6LWGQrexp0VaVubzlhx2KuVtfXntfZfSaOgSFniMqzshc5d5vH9TgS\nehy7cr2iJI+/zD+4YIw4f6eyRri947qJz35oqug0BwBoXYmCW4ACJZDMJ5frOBq1evM5Rt5snKpD\nCXj2tCTMup3wCnGkvUYqYcleuq/iL4DqUPJm46urjPueE83P6OcvfM8jovWqxvu6aw5yWQVLWJtM\ngy1aHTh0MZ2gNdod6XamSgfAU6FoLJud2DA1U2RNBxfQ/iKNz21N2gI8Ddtz23HHsrsgMvqlccLT\nxMKoM+tnO8ulcIBP3rwlqli4Sm5z3oa3Wm3CN46/KuCh4ffC1bLwiIdxDS04YVYrxrXUse22P5eA\nOG57Fx1vgLyZZXzVjE/jqhmfDswBvRd8gRLDUMaFpZk0YU3EvNn8cfJWTJFQ5rgeduWpp8kX86H/\nDhX8MSi6DS+TRmbVLH8Bwo7f5qyHogDeUGXAbS5yC3K2mA/iGJg7voWek2kimHpkeUfYDfbU67fU\nhZyMGWitT+G2Lx6DxaeMLy9XAkRrPyrmIFnGEonK33v6yJNRH68NxH5lWJqJq886ATd8bI7I+JXB\nm6+UeglKt/NyH8sILjgqmeWt6DbSyZBac0XOXA9+h6YpActbBpf7TOp+LH5PlndY9yQeo+eqcecd\nNwafPnsKmmuTqDQqA/uGldvFWekWfzEeN82vr1WkF7/O483SgqxdnVZ2PuL6lrdcIy7I3/W158e3\nV2H+lEY0pqtEaRBPaNOlhUtNrBo9+V7U1zOyZy/CU+YEdeCJ4oB4KmaOa0RTTQLfuuyowPa8W4Cl\nmaiuKF8EAUBRoyV7OUbeosENK9dzk53wFAfurkbs7M0hm7eF29RGHlrtdroAkcSBYixPQdHcMnf0\n8nW70DfAXMUa70jmBBZIwuOkuFTVLk5Jo6OizV9oen5M2iU2DCk8M5yXAQDqE3WoqYgJ1/1rXcuR\nJf2wJlBJUd/y1oU7l3sv7O0j4Q0wi5aR8zPbn4dLiBAb4QtX/syYY19FhtDx1+ktSFspGLoKj7e5\nZZamrdHjm5J1+OiiCWirpgaBYmVhjnsZWsUudk56n555lWaqb+rpFdfG3ea5IR1JI/heUHQbnsEs\nZ9tAf6YIS7NQCIl5wzGQLzq+qJLqoug6uH/NX+k1MhU7fo0i1q7QOm6xuJcaBdmOh60OFW5xu1tD\nLe+hrB2o8EjH6Hvig8eOwHc/PQ+WoUUx7wjDY97kJswaX4+vLZ4Vup1nVNdZ9Zg2uhZnL6Qu7GSM\nJWaFkTezvUvbBcqiJvKmM0Yvwk3zrw0mp5WgtjKGUc3h5M7JebiYOd8u9MdLkt7SJn0BKbqDdLyc\nvGXLu1RIxtBUv+SoBJwYZMtbLyFXTmAfnXQhxlaNwsiKkqYlAM6Y34HT5rXjU2dNLtv2hWlXBWr0\nwzL2E5rk1lc0jGzyCb+m+xhfI54n+kj3tEFvxw+O/ffgCT1VinlL99RgbnfPz3jXVBWfOmsK6lL+\nvXOKLAFLWrjUxqphezZ67R7qvuS92y3//GkjhYq0CsXTce6xNI9A04KLy135PliaiY+dOhFnzO8o\nmwvCwgCdPVRHnN8jTmI7MszqtC1s685Qy1tWA+WhBdvCSbOobn/CYG1Nx7wu+otzwn9pdZcIJ2ga\nK8nTnMACSSgPcnI26QJjRLoVlsmS+jwNHmhfe9uz0dMnJToNoxxYvXURLM2kSoeyVCk2+vMhW95C\n598RcyD2Y8f/c9OT6NPXQIlTD0ap5a1oHvQxNJuee5Fk8halWCo9vi5OibE6Sc9jtKyHVs3ugfQc\ncm8N11bo7suhM0sJ/Sf3vFn221fTPSC1G0CKFrxsBdZvH0BPn4OcUxAiLVzbgdgx5G0XGgwxxgIZ\nQme2G9PrpooWvdzyzhTZ74Qt1OIaj/v7mgeO6yGDXpCiBZJL0wx1fs/ZImkoZ0uuewOVTGggFgcq\nUxZMQ0XRdvdoYO0vhJs+EQ5ZWKaGK88tt644UkYS35p/HSrMVHhMOqS1J3/Zlbb+k6340pZ77wS+\n5R1O/pogb/riLiXvuJThHeY214gpvqd0gUHJV8XE+Cx0NFQFtnELf3duc+5Wntc8B/Oa54SOP27p\nuOD4saHbUrFYwPIPu0dJPQk4/vbqtH+9llcJe/0UWJOfB1GZFSBZhZahIaZbSOoJP8bo6dBUBTFL\ng9vTgqJuw+zw5SHh6mVd32TLyCkyYpeItyZO44V9hX5U6JXg7RsURUFh9UxY41+B7dlIWAZqrKQI\njWglnouck0M6UYdpo2sxbXQtHlqyEYWVR6FuyhoMEhazJ4Bjq+hoTQjPByfvXqaRbqoxbO/JImbq\nouc44MtbLpzSjounUNnadExanNWzpL1Aq1z6Ha45CKj1rCpAmmON11mzeD57oSeNhL/wJCo8uEhY\nOlzVLbG2gwsYTt5xRp7phBH4neaJn+XtDXBi8suYeLlVIJ9B+v+COgjVGoKXjwsPguyB4z9z/rsy\nNBW9/QRWo+/9Kap0DPUsVl2TKM/VCHw/I/8iyeP1tT348f8+g9gR6+EO1ACuIQiZQ6vugqIAxS3j\nAU9HvujCLChQzaLoC6ym+kEIdWsXii7i4HF5FzZT4RvoKxeEGsxnAZjQ0rS6okKvQhcQ0DywHQ8O\n8ZUCs3lb/K54eCJrboNVxWrfPRWVcbqIzjssVBUbgDbiDeSKp5XNzYFAZHm/B1EXrxk2mexblx9V\n9hmnZbIbgi61yt8JRFx+mFOqSnncXkZCcmvL7U55rJlnm4dZtfzlf3TNSTh7zKmh35PYjdv8ncLU\ntYALN8z7kNb9a7JUU7jhAZoMN29C0NqX3eomW4BUxZi1ThSAKNBUBcmYgfqqONydI6ES2UrSAhYz\nAD+jHxAvYpng5Xr+ilgSs8bXi0XloglzUUmakXcLyDm5gJiPripCHpQjVhL394aqMdM8xf/A1QEo\naK5Jipp8txict7pUBTp7cxjMFuF2t+KktuMB+PKWDekKUXWRNMt/G7LkbYJtH4qvg9FOFzky2QkJ\nYJ6Mp9iIaVbwuXV05L0cYpVDVMSmpLJDBifvGHvuUnED8o8jD2r1FlbPFORbEaf7ajXbpQ575RoO\nAPWsKUYRhCWr3fjxuThnQfnikqvrmYYq7jl3mxdUujyrZ5Z3Q0U5eZeqNxJPRc7JY9naHiEA43bS\nZ3XZup7gHHDPgOwV83QoCi+1I1CT/SC5FFRCyZ9b3takpbAVapWv3iDVkrt8AcF6rTdSQZZRFvOI\nSZZ10XHhoCg8BnLuBPds8GeBjRhxg3fQo+SdSa6F0bwBm3qD7UoPFCLyfp9hREMKN867BjcvuEF8\nxt08Lvt3XP+5uOGoqwPH7U/y5pa1S8KTO9QSy7s05t2UpOpN9bHaQO05fzlzyzuMGPnLP6xjG/+e\nZIjlffHJ4zB+RFVACW5fQL9fETHN/kJ5b2ceFgAASw+St6oquPC4oDu+QrIkeQ9s/pKloQefMK5f\nPBvzpjQibUrk7Oplcyxn3E8f2YxPnDEJx83wBXZ4xj9A5+vKc6dhFksMvPCEsRjZQL8/5+QDSXma\npsLZOg7FdVP9awxJ2otp/vg4cTbX+Za36ypI6v51N1VVwiME67cPQFVUnNi+AIBfTiff0zE1I/zs\neAauugbQOefQG1g3L4mYDFWjiyJmeXuqLeLoHPa2MSDwUGxaRj/gmvNmubdJMQsgRKEd5cDIW/N/\nG0VlqGwMHfXU82GOXAU11ReYJ/n7AKDIiJfY1LXb0ZTG+NagZgPgaxYQ4ru9eYJWERkYqiEWdfXp\n8pBYaSIeHAMFL49ETIfCeoDzkM+ytUHyVpmSn+w14Za8Yuap7oHmwstUwjI1arm7vuVcTG1mx0jN\nhQIxawIt3Y8RqTbUWLWB8Sqai6ydp78VtmjpzxQDx1tSCaCzg4Z3+D1/bPNTeHzz037mPQn3KO5v\nROT9PkRDog6VVvnKmbvGLSWFpmRDYFtIXtY+QxXkHX5SlcfaWcy71FoZXdmBzx7xcXxp9hXB47ik\nNCPv0pp3gFs1VIq0FMJtHmJ5nzJnBK67ZFawZn4foCgKrr5oBmbWURWz0sQdAEiYMVHrbGkmkjFd\nkLKmBUkLAJKmFONn19DMFjgcfOlVmbLw6bOmoDImddjytLJEx3qplOeoiW1YMK1ZzB2AwPOTCLsG\naQ5ly5vfS/klHUbeimv6nhNO3jVJcbzreUhLY2itoZ6ATN5BIqajwkoHqiHkMVZYaehvLkL+9YXi\nM/k+xIzdh5scj8BQLKiJQSjxAXhKOXl7fY1I6Wm4Zl/g+NLKCQFXF2JKybgRVC5TWaxXImdu9QF+\nXH+4RLiiTscwsq4ON1xKQz1hiZIJk96zwWxRsjqZ2xw5VJgp8ZzIz5x8DYCfsEk8DXllANvct0SN\nNo9Db9hRrtYG0JJD/qyLOTviaRH+IPkkYqaGfNHBUM6fI0/3NQ9Kx2N0rGJzRFATr0KMe5mE5e0K\nsR5O/rmCFPPWHDom1QOxTdibJtFxSc/tfW89KMJYils+twcCEXm/j/GlC6ejrT6JY46gJQ+cvGWC\nGsvqoxtq4uUn2EfwOHRYpjbgW+Ya++2kQmq5p9VNLluAcLe5RuiPKszyPml2G7568UyMaPDJ6+On\nTUR7QwpjWqk1EbS8939ayOSRNbjsiPNxycTzsajjxLLtluRaNzUqQiMatygKNFULvDiC5E3nroy8\nSxwnSSNoeZeiPu6Tt0zEHBVSA5rQBUiAvP2xcs+HHDoI08h3XCI8LPwl3taQFDK6iZiBasn676j3\n3fiJmA5VUZHQ/WssXfAYugqST6Hw5mwU101FXaU/3ngIecvEWSi6mKgfDUVzoTdtgIci4iELkLRR\nCaIEO7uVhif880slmpoaIG9P9ZOkOJRA1QDvXS/FsaUua45OiW/e+A7RwS6szDNlMNlbqVaeyt8S\n2MghLXljShd79BqY5rzJa8jpta8k//TVANl5t3aXS6USxwCIhoZqdi8kmeOWDno9Zx45HjGTajP0\nZf2ySRKKQnq/AAAbb0lEQVQbCJxfHo+i29CbN7BrTIjx8Xtijl6OdblV9CBXFzkAckKbodN7Egif\nlNxzy6I/suaaYEXJgUJE3u9jTBtdi29/4ihhhXLXuKym9qULp+Nri2dhTMv+eyC55T0cefPt6YSO\nb19+JCrfpqv6iDHUHdZWzVyKIdZFzNQxqaM68PI5dnoLbrr8SBh6iOUdco79AV3VcXTLkaFjNAzV\n713NSJeTN19Y8aoCUzUQtyTVOmF5S33R6ZkCf6UMyfIOJW/frZowysm7UnqRlxJj6THyNXLrkkus\nAggo1vE6esfxfCEPRkS1FTGcMa8DC6Y14XPnTEW15Sccjm32a/m5cE9ausbSaxC1+P31cLvbxLMD\nhJO3vMAp2C7aY+MB0HI7opAyyxsAKqQmL9yKa2sIL1NEqbWmlv825PvUW/Sbhvglf/52N6RxkRyO\naU+34YNjTgsQZMqS480avEwaWsUuaPVbQBQvcDwAjKroQEqpFp2+xjbV4rR57SK0YW8eL/bV0n1C\nOnU48FBGQzUTKUr6IaW8QRu3jKqvQ8LSkc07yG0ZAQyy6xS96/05mDLC9x7yTPWkkfS9H9K+y7LP\n0jE4BkawBQ4kt7mha9QL4egY1ZzGly+cXrbodPUcTM1EOvHOQmtvFxF5RxDglrfspo5bOsa1VQ13\nyD5B3UPMW1jm8IZ/2YXg46dNxFXnHYGFU6hs4R7lW4eBoRnCZbuv53gnMHVNvMy5vCTPOOfZ2jUx\npg6lKIhJ8WruNi9VsCq1vNvSkmBMyAtVJtQwy1te1ISRe3wYy1tkrEtWolySyMvRbMcT18jjoYqi\nIBEz8IkzJqO5NonqmL+gbK3zn9EjxtJrr5LIPVGywCgNxUwf689XmEuYuJqwVfNFF0nTAnE1EVMP\nI++URN7cyjv32NE4//gxZfsSV99ziZFENos6ThL/z/UQZOudZCqRe/HkQGy/QiJfRVGwqOMEJGzf\nQ5OWyRsKiutoAqLesLnseAC4es7nME+7UCwARjdX4YLjx4pyPrenFflXj5PGT3uNDwfujeGWt73N\nn6ch0Bh52kxhVEsFXI+gp9dD88CxwXOwRe8lp4zHF844DpW5CQAANUkt85SRRJznHYQtJFzd98rx\nksGqbrg1a2jioWvgklMmYOro2jLvQ09uV6gH5kAhIu8IAsfPpAlJs8aHtxvdXxhTORJAmHVIMb1u\nCh1P24K9Om/M1DFjXB3SZhKWZgaSrvYWPEZ6INzme4IpWd5claqmgvc7py8MTmxFtwhLiqNazHug\nqzqumvFptPfRspVSWhhd6ddUf2NxeQWCjHgIecsItbyHiXnLXp20Qq1dXv8L+Ja37bpoSNDnkBBg\nzsRgDgYAVPMFDILiOVzJri3V4o+x1PKWyPu4GS1oqfWvIWUl8K3510IfkhY4ro6PnEItyUVzR9De\n6LYpuqrFJaW9tnrqrm9IBpvoANSDcvq8Dnz9yC/jxBHHBM4vo3YoqONAXE2QCQBMrh+Lr5TkfJSF\nPzxdlNQBCP09aNLzLSc+AsBXPngs4GqC+NIhxxMoYlw8h+Wy0yYKWWRSjAnLXHZpA0DlllNwztgz\n/HOxPIiKBPME9jYF8hIA6k2Z2O7f95baCmiuf2/5d+SLDlRFRYfH+ruzkreUZHmHClY5BkYIqWMF\nbj99RovJbfQjVy+rfvHnItwDc6AQ1XlHEDh9XgcWTmt+227qfcWF4z+ICdVjMatxeuj2SbXj8b2F\n3wyWK+0FNFXDNXM+H3AN7y0Sehx9hf6DYnkbuubXm7JabRHzZqRTKxGX/DIxpSz6CTVjYbkZAD1l\n7D0i5WeOj24O96x8ceZnsHFwS5m7tBRhZYmyNR6WkAYA7eZErCg8g+aUb/0J8nY8zG2cgXV9GxDL\njMTZx00qO77GCo77psvmYiBbFHM1qmoEsCV8DHweZ42vx8dOnRgcu6WjLl6LOSNH47lupn3u6Zg/\npQknzaZCL6+s7mJSpdQzInsqvrZ4Nrr6cuhVN4nvT5oWvvrJo0TYoCXVhIUtR+GxzU/R87s6IHHC\nCGUatrxYg/icf9APSohGVYKJi6RE2lh87vj3Jox8PdZ61elpQoKFX266bC5Wb+7D5JE1UFdUwovv\nYseXPwfHTm/GP5dyOWBK3o01CXz90jlYuWEX/vPe10DsGHXtl1xDtVGHk9tnYkzlKNz94t+xsZMu\ndqaMqsH0TbUY0ZjCX5esByGK8C6kzBTGj/AXXifPbsPmt6rQyYVYuDgMlzw2LZCiJRZZKTOJmDK8\n5U1cHUdPbcIf/klbpBbfnIPYnL/BNQbE9pgxvOt/uGf9QCCyvCMIKIpywIkboC7Go5pn79aqTUuZ\nrfuCpmQjUua+kT9A3c5pI1VWc/5uwNJV1roRqGSJYaUxb9EUAcGyt9KSLz6DpIS9Dc3AhOqxGF9V\n7sLlGFc9Bie3Hzfsdo6w8IdcBx6WkAYAU5Nz8NkjPo6zR/v19kdNpkQ+sb0auqrjkknn47w5c0Q+\nggzZbQ4A7Y1pTB3lx65HVUqysiXPkio66ZW7qrkVP76+zf/Q1f0sZQCmqQUy5mXhoLilo70xHcgb\nOG56G1rqgs9jXPYGuDpOnOV/XzpBVdZ4YlmYlRhIOvR8adLW+iTOOYY1B5LqpsPCHzWDM+H2NqCt\nOE/McXtjGiczmVut6Lv+wyz3uso4KhL02r2S52DyyBpceMJYv1lKyTV091PCHVXZjqnG8aLne1NN\nAl+4YDqV2iWqyI8wVB2WZiIVN/CJMybhyxdOR3tjGpMa/C6AHz91Mlrrk2KR1TdUCCRHUstbCx0P\nAMAxkIobuP6js3Hy7DYACohtwVN5A52g5X3JxPOFpxAID58cKESWd4QIIfjY5ItQcIvvaAGxr9B1\nFW5nG2yjgCvPOx8AUJP21a+AYDKWXH5klpK3SJ0t/56rZn66/MO9wHnjzsIDax7CxJpxZdsaE37o\nZTjytkwd0+qCNevnHDMacyc2BKoBhgOPaZdm1nPwpD7e31lGW30SmzuHUFc1/MtWvi7iagGXv6Vr\nAZd02Eu7OdmI5mQjtmd2oqmy3LshW84nzhyB8SP8fToa0wAUmEocBZIVVutpR7ULyeOEEYcChS7M\nJCI6YWYrTpzVhhNnt+Hz/1WEmuqDSozQZ/mCo+bi0aWNWPyhCaFzoBerwIVd08N4ssZVjcZLna/5\n1QESqtIWyFZekkUt81HNVP50guT+lhdn3PuSZImH3kAN1FgWtudn4C+QmvzMbZyJf215BgBNPj12\nuh8uaapJYHl3HGqKJr8ljSRivN+BY+L85o/h3rcegJryLWuAVtmMba3EkZMace+W5diapS4U4hgB\nsaKjW47ElNqJeK2b9q1/N2PeEXlHiBACUzOHVak70DCYhKuzdRzqE9R6a29M4ZxjRokOaTweXB+v\nDVjbZoj4DDCsmN07wokjjgnGbSXIRCFaNZYgbKyqqgTaq+4Opmbg5gXfGHZxAAC5l06iL9sPBD//\n6AcmoK0hxayr0rHTfyulxjtfuXB2YB/TUAPkHQ/pLqcoCq6Z83m81rUC0+unlm3XVA03zbsWf1n3\nMOY1B88/aSQlNiPXgEJsA5QEJRdNU8X9VhUVcT2GrJMrkTv1O7DpJIbCiqNRlQ4nlTEtlbjinOHl\nluOZDgwZW2FU9pZpP3AsnnQBptVNxuyQMFgypkvtR6llfvyMFpxz7KhABYvrlmfX88WS21/ni+WE\nYGTFCDQmGtCSKs+h+dAxo5B5eRJeGKKqZykjAVPKjxhd3Q4vlxbkXZr1P7atEvW91YK8DcUs03pI\nmynoqg7HcyLLO0KE9zOSMQOXfmBCwPpUFAVnLRgl/q6NV+PauVehxqoWtdMAy1QPwbvUKyGARR0n\n4G8bH0d7upwggXIvwb6gcpjOdhy3f+kUhDlPYqaO044qb4RSio9Nvggv7HwFExpaA5/HTC0QTx7u\npW1qJuY2zRz2/PWJWnxy6uKyz6tSFlrrkujcUgN97Aa/R3XJjeTiIh111Th/8Ww8+vwmHD2Fkpii\nKHBcD4CCUU27n6fhoCkGiqtnY+KY6mFzH3Z3jRPaq1D7Vgp96BYKZJapBcIbANA7SGPSFSG9Cnij\nkXFVo8u2AfQ6bzjqK6GeBUPXcMmcU/DCE4/T79aswH6WqQXaKB83pTyMJHdPTJrhXRJrrCp05roD\nuQ8HGhF5R4hwCIJn/u8OYaSol3Tt8t9T7z57nz36VBzVNCvUnQr4mfEHEqX913eHay6eiQeeWheY\n+yObZuHIpvIOfg3VCZw4bQyeHqB61/EDYHGdd/wY/PTPWRQ3TII3SC3x0kXYpJrxWLVrNU4ffRLG\n1ldibFu4FT1hxL6Ve/L5U/YxPUpTVZw//Rj8YvkGuF10XsMWmJUpujiZPKqmbBscE1+c/CWMqKsu\n38awu/CWoer4ztHXI+fky/bTVQWktxVesg+jnWNw6YfKEyPlBWKlVU7eAK3+6Mx1v6sJaxF5R4jw\nHkLpy4n/fRAMbyiKMixxA8O7+A8WJnZU42sds/e8I8MHpx8JrO2GSzyMZuWP+xMzxtZhbGsVVm30\nPQSliYeXT/kICm4xkMAYhpHDtOfdEy47bRLufvRNXHRSeV7D28XMhmn45lFfxdeefx1AeF+Bs44e\nicqkhWOOaA58fvnpk/Dy6i6MaWh6R9LE1bEqhFG/rqtQMrUoLF+A6qnhYYFKSU2wKhHufeClm2Hh\nkwOFiLwjRHg/4GCw9x5Qmhl/uCFhxHHxxPMO6Hc01iSwamOv+LvU8k4YiVBteY5vXDoHb27uxbi2\nfVNIbKlL4tpLyj0Pe4vGZD14NnxYuMTQNZEhLmPhEc1YWELo+xO6poqSjPgwuvNT6yahVZuADVuK\n+MAJ4eWtnLwjt3mECBH2Cj/43AK4XnnSz26SzQ869kfM+72OxupgedecCeHW4XAY3VKB0S37ZnUf\nKBxKizZFAbhBP5znPWkkcN2xl9Me4lY4ZY6pot6RltSBW2iUIiLvCBHeA6geJpt4yqgavPRmF2aN\nrwvdfjBxqLnND0U0VvtW9e1XH79XMfxDFeYwCmXvJj555iS8sbGPlX3tObSkKsqwxA0A46vH4gfH\n/ntkeUeIEGH/4NjpLRjVVPG26qbfbZjvASI60JgyqhqTOqpx9NSm9wRxA76lezBx9NRmHD2VWsn7\nK6fz3SRuICLvCBHe01AVBR1N+67xfiDwqbMmY/32gVDVtAhBGLqGr148fKnZ4YTT53Xg2eXbUfUu\nqDjuDfzQ0qEYXBoeB3Qpd/PNN+PDH/4wLrroIrz++uuBbc8++yzOP/98fPjDH8Z///d/H8hhRIgQ\n4RDC/ClN+MjJ4/e8Y4T3FM4/fgx+eOXCQBOZQwEXn0wz6WVltsMBB8zyfv7557Fx40bcc889WLt2\nLa6//nrcc889Yvt3vvMd3HnnnWhsbMTixYvxgQ98AGPHjj1Qw4kQIUKECBHKILvQDyccsCXQkiVL\ncPLJJwMAxowZg/7+fgwNDQEANm/ejMrKSjQ3N0NVVRx33HFYsmTJgRpKhAgRIkSI8J7CAbO8u7u7\nMWWK322lpqYGXV1dSKVS6OrqQk1NTWDb5s2bd3u+6uoE9P0cI6uvP7RigYcronl854jm8J0jmsP9\ng2ge3znejTl81xLWSjV59xa9vdn9NBKK+vo0uroG9+s534+I5vGdI5rDd45oDvcPonl859jfczjc\nQuCAuc0bGhrQ3d0t/u7s7ER9fX3otp07d6KhYe/EByJEiBAhQoT3Kw4YeS9YsACPPvooAGDFihVo\naGhAKkVrTdva2jA0NIQtW7bAcRw8/vjjWLBgwYEaSoQIESJEiPCewgFzm8+aNQtTpkzBRRddBEVR\ncOONN+L+++9HOp3GKaecgptuuglf+cpXAACnn346Ro0atYczRogQIUKECBEAQCHvNBj9LmF/x2Gi\n2M7+QTSP7xzRHL5zRHO4fxDN4zvHYR/zjhAhQoQIESIcGETkHSFChAgRIhxmiMg7QoQIESJEOMwQ\nkXeECBEiRIhwmCEi7wgRIkSIEOEww2GTbR4hQoQIESJEoIgs7wgRIkSIEOEwQ0TeESJEiBAhwmGG\niLwjRIgQIUKEwwwReUeIECFChAiHGSLyjhAhQoQIEQ4zROQdIUKECBEiHGY4YF3FDmXcfPPNeO21\n16AoCq6//nocccQRB3tIhzRWr16NK664Ah//+MexePFibN++Hddccw1c10V9fT3+4z/+A6Zp4sEH\nH8RvfvMbqKqKCy+8EBdccMHBHvohg1tuuQUvvfQSHMfBZz7zGUybNi2aw71ALpfDddddh56eHhQK\nBVxxxRWYOHFiNIf7iHw+jzPPPBNXXHEF5s+fH83jXmDp0qX4whe+gHHjxgEAxo8fj09+8pPv/hyS\n9xmWLl1KPv3pTxNCCFmzZg258MILD/KIDm1kMhmyePFi8o1vfIPcfffdhBBCrrvuOvJ///d/hBBC\nfvCDH5Df/va3JJPJkEWLFpGBgQGSy+XIGWecQXp7ew/m0A8ZLFmyhHzyk58khBCya9cuctxxx0Vz\nuJd46KGHyB133EEIIWTLli1k0aJF0Ry+A/zwhz8k5557LvnTn/4UzeNe4rnnniOf//znA58djDl8\n37nNlyxZgpNPPhkAMGbMGPT392NoaOggj+rQhWma+PnPf46Ghgbx2dKlS3HSSScBAE444QQsWbIE\nr732GqZNm4Z0Oo1YLIZZs2bh5ZdfPljDPqQwd+5c/PjHPwYAVFRUIJfLRXO4lzj99NPxqU99CgCw\nfft2NDY2RnO4j1i7di3WrFmD448/HkD0e94fOBhz+L4j7+7ublRXV4u/a2pq0NXVdRBHdGhD13XE\nYrHAZ7lcDqZpAgBqa2vR1dWF7u5u1NTUiH2iefWhaRoSiQQA4L777sOxxx4bzeE+4qKLLsLVV1+N\n66+/PprDfcT3v/99XHfddeLvaB73HmvWrMFnP/tZXHzxxXjmmWcOyhy+L2PeMkikDvuOMNz8RfNa\njn/84x+477778Mtf/hKLFi0Sn0dz+Pbxhz/8AatWrcJXv/rVwPxEc/j28Oc//xkzZszAiBEjQrdH\n87hnjBw5EldeeSVOO+00bN68GZdeeilc1xXb3605fN+Rd0NDA7q7u8XfnZ2dqK+vP4gjOvyQSCSQ\nz+cRi8Wwc+dONDQ0hM7rjBkzDuIoDy089dRT+NnPfoZf/OIXSKfT0RzuJZYvX47a2lo0Nzdj0qRJ\ncF0XyWQymsO9xBNPPIHNmzfjiSeewI4dO2CaZvQs7iUaGxtx+umnAwDa29tRV1eHZcuWvetz+L5z\nmy9YsACPPvooAGDFihVoaGhAKpU6yKM6vHD00UeLOfzb3/6GY445BtOnT8eyZcswMDCATCaDl19+\nGXPmzDnIIz00MDg4iFtuuQW33347qqqqAERzuLd48cUX8ctf/hIADX1ls9loDvcBP/rRj/CnP/0J\n9957Ly644AJcccUV0TzuJR588EHceeedAICuri709PTg3HPPfdfn8H3ZVezWW2/Fiy++CEVRcOON\nN2LixIkHe0iHLJYvX47vf//72Lp1K3RdR2NjI2699VZcd911KBQKaGlpwXe/+10YhoFHHnkEd955\nJxRFweLFi3H22Wcf7OEfErjnnntw2223YdSoUeKz733ve/jGN74RzeHbRD6fx9e//nVs374d+Xwe\nV155JaZOnYprr702msN9xG233YbW1lYsXLgwmse9wNDQEK6++moMDAzAtm1ceeWVmDRp0rs+h+9L\n8o4QIUKECBEOZ7zv3OYRIkSIECHC4Y6IvCNEiBAhQoTDDBF5R4gQIUKECIcZIvKOECFChAgRDjNE\n5B0hQoQIESIcZnjfibREiHC44ZZbbsGyZctQKBSwcuVKzJw5EwBw3nnn4UMf+tDbOscdd9yB8ePH\nCz3rMHz0ox/Fr3/9a2iatj+GHcDOnTuxbt06zJ8/f7+fO0KE9yOiUrEIEQ4TbNmyBR/5yEfw5JNP\nHuyh7DUefPBBrF27Fl/60pcO9lAiRHhPILK8I0Q4jHHbbbdhy5Yt2LZtG6699lrk83nceuutME0T\n+XweN954I6ZMmYLrrrsOs2fPxvz58/Fv//ZvWLhwIV5//XVkMhncfvvtaGxsxIQJE7BixQr89Kc/\nRV9fH3bs2IGNGzfiqKOOwg033IBCoYBrr70WW7duRVNTEzRNw4IFCwI9ijOZDL7yla9gYGAAjuPg\nhBNOwJlnnokf/ehHIISgqqoKl1xyCb797W9j48aNyGQyOPPMM3H55Zfj/vvvx9///ncoioKdO3di\n9OjRuPnmm2EYxkGc4QgRDk1EMe8IEQ5zbNmyBXfddRemTp2Kvr4+3HTTTbjrrrtw6aWX4vbbby/b\nf+3atTj33HPx29/+FpMmTcLDDz9cts/KlSvxk5/8BPfddx/uv/9+9Pf348EHH4TjOPjjH/+Ib37z\nm3jmmWfKjnv22WfhOA5+97vf4Q9/+AMSiQRaW1txzjnn4Oyzz8Zll12Gu+66Cw0NDbj77rvxxz/+\nEQ899BDeeOMNAMCyZctw66234r777sO2bdsOSy9DhAjvBiLLO0KEwxzTp0+HoigAgLq6Otxyyy0o\nFAoYHBxEZWVl2f7V1dUYN24cAKClpQV9fX1l+8yePRuapkHTNFRXV6O/vx+rVq3CkUceCQCor6/H\n7Nmzy46bNWsWfvKTn+ALX/gCjjvuOFxwwQVQ1aCNsHTpUuzYsQMvvPACAKBYLGLTpk3ieN4+debM\nmVi7dq3okxwhQgQfEXlHiHCYQ3YrX3PNNfjWt76F+fPn4/HHHxfNPGSUJqSFpb2E7eN5XoCIS0kZ\noL2M//KXv+CVV17BP//5T5x33nl44IEHAvuYponPfe5zOPXUUwOf33///fA8b7fjihAhAkXkNo8Q\n4T2E7u5ujBs3Dq7r4pFHHkGxWNxv5x49ejReeeUVAEBPTw9eeun/t3eHOAoDYRTHHyGYJlwAMAjg\nAFROSC0STCWCIJCYBhwOwxEqegIkuqLBbRN0LQaBxkBZsdkaDJutmeb/05PJ517eZCbz9bYmSRLF\ncazhcKggCOQ4jm63m2q1mh6Ph6SfVv97VJ/nuXa7XdH+z+ez7ve7Xq+X0jTVYDAobX6gSmjeQIUs\nFgvNZjO1Wi3N53MFQaAoikrZezqdKo5j+b6vTqcj13XfGnq329V6vVYYhqrX6zLGqN1uy3VdrVYr\nNRoNLZdLZVkm3/f1fD7leV7xVWq/39dms9HlclGv15MxppTZgarhqRiAj1yvV6VpqvF4rDzPNZlM\ntN1ui3fn/3U4HHQ6nbTf70vZD6gymjeAjzSbTR2Px+J/4tFoVFpwA/gbmjcAAJbhwhoAAJYhvAEA\nsAzhDQCAZQhvAAAsQ3gDAGAZwhsAAMt8AxJ5C+54P8QOAAAAAElFTkSuQmCC\n",
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEcCAYAAADUX4MJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsvXeAVNXd//++ZdrONsqyNBUECxZQRBHUoKLoE+lP0F+i\nxMT4tRDFWBKVxG7UJPaK8mBBE40lQBAVFAQE6bAU6WWBZXvf6bec3x+3zu7M7iw7w+4Onxd/MDO3\nnXtn9rzPp5zP4RhjDARBEATRAnx7N4AgCILoHJBgEARBEAlBgkEQBEEkBAkGQRAEkRAkGARBEERC\nkGAQBEEQCUGCQRDtzLp16zBq1KiE9n399dfxxz/+sc3nIYhjgQSDaHeuvPJKnHvuuaitrY36fMKE\nCTjzzDNRXFwMAHjooYdw5plnYtu2beY+hw8fxplnnmm+nzp1Kj7//HPz/cyZMzF69GgMHToUl19+\nOe677z4AwNixYzF06FAMHToUZ511FgYPHozzzz8fQ4cOxTvvvJPK240Jx3FJ2bc15yGI1iK2dwMI\nAgD69u2LhQsX4sYbbwQA7NmzB+FwOKoD5DgOubm5ePnllzF79uyoz2Mxd+5cLFiwAB988AH69u2L\nqqoqLF26FADw5ZdfmvtNnToVEydOxP/+7/+m4tYIIm0gC4PoEEyYMAFz584138+dOxeTJk1qst+k\nSZOwe/dubNiwocVzbt++HZdeein69u0LAOjWrRumTJkSc9+WCh68/vrruOeee/DHP/4RQ4cOxfjx\n41FYWIh33nkHI0eOxBVXXIEff/zR3L+8vBx33nknhg8fjmuuuQafffaZuS0cDuOhhx7CRRddhLFj\nx0ZZTMax06dPx4gRI3DVVVfhww8/bPFeY7F//35MnToVF154IcaNG2eKJQAsX74c1113HYYOHYpR\no0bhvffeAwDU1NTgjjvuwIUXXojhw4fjpptuOqZrE+kJCQbRIRgyZAj8fj8OHDgAVVXxzTffYPz4\n8U06crfbjTvuuAMvvvhiQuecN28eZs+eje3bt0NV1Ta1cdmyZZg0aRI2bNiAQYMG4Xe/+x0YY/jh\nhx8wbdo0PPLII+a+9913H3r37o2VK1filVdewYsvvog1a9YAAF577TUUFRVhyZIlmD17NubNm2ce\nxxjDHXfcgUGDBmHlypV4//33MWfOHKxatapVbZVlGXfeeScuu+wyrF69Gn/+85/xwAMPoLCwEADw\n5z//GU899RQ2bdqEL7/8EhdffDEA4L333kPPnj2xdu1a/Pjjj7j33nvb9MyI9IIEg+gwTJgwAfPm\nzcOqVatw6qmnokePHjH3u/7661FSUoIffvih2fONHz8ejzzyCFatWoWpU6di5MiRbYpPDBs2DCNH\njgTP87j22mtRU1OD2267DYIg4Oc//zmKi4vh8/lQUlKCzZs344EHHoDD4cCZZ56JKVOmYP78+QCA\nb775BnfeeSeysrKQn5+PqVOnmtfYunUramtrceedd0IQBPTt2xdTpkzBwoULW9XWgoICBAIB3Hbb\nbRBFERdffDGuuOIK0xXndDqxb98++Hw+ZGVlYdCgQQAAURRRUVGBoqIiCIKACy644JifF5F+kGAQ\nHYbx48fjyy+/xNy5czFhwoS4+zmdTkybNg2vvPJKi66ksWPH4t1338WGDRvwxBNP4NVXX231aN2g\nW7du5mu3240uXbqY8RO32w3GGPx+PyoqKpCTkwOPx2Pu37t3b5SXlwPQXE49e/aM2mZQXFyMsrIy\nXHTRRbjoootw4YUX4u2330Z1dXWr2lpeXo5evXpFfWZvw6uvvoply5bhyiuvxNSpU1FQUAAAuPXW\nW3HyySfjlltuwdVXX90uCQBEx4UEg+gw9O7dG3369MGKFSswZsyYZvedPHkyGhoa8O233yZ0bkEQ\ncM011+CMM87A3r17k9HcuPTo0QN1dXUIBALmZyUlJabFlJeXh5KSEnObkQUGAL169ULfvn2xbt06\nrFu3DuvXr8fGjRsxc+bMVrfBfg3jOkYbzjnnHLz55ptYvXo1Ro8ejT/84Q8AgIyMDDz44IP47rvv\nMHPmTLz//vumK40gSDCIDsUzzzyDDz74AG63u9n9BEHAXXfdhVmzZsXdZ+7cuVi+fDn8fj8YY1i+\nfDn279+PwYMHJ7vZUfTs2RPnn38+XnzxRUQiEezatQuff/45xo8fDwD4n//5H7z99tuor69HaWkp\nPvroI/PYwYMHIzMzE7NmzUI4HIaiKNi7d2+TwHhLDBkyBBkZGZg1axZkWcbatWuxbNkyjB07FpIk\nYcGCBfD5fBAEAV6vF4IgANDiNIcPHwagiYcgCOY2gqC0WqLdsafFnnTSSXG3NWbs2LF455130NDQ\nEHP/zMxMzJw5EwcOHICiKOjduzcef/xxDB06NOFrtAb7eV544QU89thjuOyyy5CTk4N77rkHI0aM\nAADcddddeOyxxzB69Gjk5+dj8uTJmDNnDgCA53nMnDkTzz33HEaPHg1JktC/f3/cc889rWqLw+HA\nW2+9hccffxxvv/02evbsib///e/o168fJEnC/Pnz8fTTT0NRFPTv3x/PP/88AKCwsBBPPvkkampq\nkJOTgxtvvBEXXnhhUp4P0fnhUrmAUmlpKf70pz+hsrISgiBgypQp+PWvfx21z7p16zBt2jSzo7j6\n6qsxbdq0VDWJIAiCOEZSamEIgoCHH34YgwYNgt/vx+TJk3HJJZdgwIABUfsNGzas1T5agiAI4viS\n0hhGXl6ema7n9XoxYMAAM0uDIAiC6Fwct6B3UVERdu3aFTPgWFBQgIkTJ+K2227Dvn37jleTCIIg\niFaQ0hiGgd/vx9SpUzFt2jRcddVVTbbxPA+Px4Ply5fjmWeewaJFi1LdJIIgCKKVpNzCkGUZ06dP\nx4QJE5qIBaC5qowJTqNGjYIkSU2qljbmOGgcQRAE0YiUp9XOmDEDAwcOxM033xxze2VlJbp37w5A\nK4sAALm5uc2ek+M4VFQ0NLvPiUJeXhY9Cx16Fhb0LCzoWVjk5WW16fiUCsbGjRuxYMECnH766Zg4\ncSI4jsO9996L4uJicByHG264AYsWLcLHH38MURThdrvx0ksvpbJJBEEQxDFyXGIYqYBGDBo0erKg\nZ2FBz8KCnoVFWy0MKg1CEARBJAQJBkEQBJEQJBgEQRBEQpBgEARBEAlBgkEQBEEkBAkGQRCEDZ/P\nh7lzPz+mY//0pz/A7/clvP+7776DTz75qOUdOwgkGARBEDYaGuoxd+5nMbepqtrssX//+8vwejNT\n0awOQadcQOmR715AqDIHd1w8CdleZ3s3hyCINGLmzNdRXHwUt9xyI4YNG44RIy7Be+/NQrdu3bFv\n3x58+OGnePjhB1BRUY5IJIwpU36JceMmAgCmTBmP2bM/RCAQwAMPTMe5556H7du3IC8vH8899wKc\nzvj91d69u/H8888hHA6jT58+ePjhx5CZmYnPPvsE8+f/B6Iool+//nj88b9i8+aNePXVF/RFuzi8\n8casqDXkU0WnFIzdVfsADvhg0SDcPTm1y20SBNF+fLp0H9bvatuSCILAQVGs+ckXntkD1185MO7+\nd955NwoLD+Ddd/8JANi8eSN27tyBDz/8FD179gQAzJjxGLKyshAOh/H//t+vMWrUlcjOzgZgrbpY\nVHQETzzxLB588M949NGHsWzZUowZc23c6z799OO4774HMWTIeZg9+2289947uPvu+/DPf36Azz9f\nAFEUTXfXJ598hPvvfwjnnDMYoVCoWSFKJp3aJVUfab5IIUEQRDI466yzTbEAgE8//Rd+85tf4fbb\nf4vy8nIUFR3Wt1jC1KtXbwwYoAnTGWecidLS4rjn9/t98Pt9GDLkPADAtddeh4KCzQCAgQNPw+OP\n/xmLF38NntfWVz/33CF49dUX8fnnn6ChoR48f3y68k5pYUQOnQnnKbvAvDXt3RSCIFLI9VcObNYa\nSIRklAZxu93m682bN2LTpg1455334XQ6cffdtyMSiTQ5xj7q53kh5j524lVp+sc/XkFBwSasXLkc\n77//f/joo89w002/wciRl2H16pW4/fbf4uWX38TJJ59yjHeXOJ3SwmD+HACA5CDBIAgiuWRkZCAQ\nCMTd7vf7kJWVBafTiUOHCvHTT9tj7teaMn1ebyays7OxdWsBAGDRoq9w3nlDAQBlZaU4//wLcOed\n0+H3+xAMBnD0aBFOPXUAbrzxZpxxxiAcPlyY+A22gU5pYagBrYCWH1Xt3BKCINKN7OwcnHvuENx8\n8/+H4cNHYsSIS6K2Dx8+EvPmfYHf/OZXOPnkU3DOOefatloxDC0gnTgzZjyO559/FuFwGL1798GM\nGY9BlmU8+eQj8Pv9ABhuuOFGeL2ZmDXrLWzatAGCIKBfv1Nx8cWXtHj+ZNApq9Xe/MQi+PsthuCQ\n8dpVT7X6i0knqBKnBT0LC3oWFvQsLE7IarXvPzoGQiQHTIigLlLf3s0hCII4IeiUgsFxHFyqppSV\nwep2bg1BEMSJQacUDADwIBsAUBGgOAZBEMTxoNMKRqagZUrtrYif20wQBEEkj04rGJeeoeVmbz96\npJ1bQhAEcWLQaQVjxOn9AAaEWOKVIQmCIIhjp9MKhsAL4FQnFK752ZMEQRCtoS3lzQHg008/Rjgc\njrnt7rtvx+7du4753O1NpxUMABCYE4yPQFaaLzlMEASRKM2VN0+Ezz77GOFwKIkt6jh0ypneBg7O\nBUkMoCEgoUuWq72bQxBEGtC4vPm0adPxr399iO+//xaSJONnP7sct9xyG0KhEB599CFUVJRDVVXc\nfPOtqK6uRGVlBe6++w7k5ubilVfeinudb7/9Bh999D4A4OKLL8Gdd94NVVXx3HNPYffunQA4XHfd\neFx//S9jljhvDzq1YLh4D4JcFap9fhIMgkhD/rPvS2wu39amcwg8B0W1Clqc3+NcTB44Nu7+jcub\nr1+/BkVFhzFr1hwwxvDgg/dhy5YC1NZWo3v3PPz97y8DAAIBPzIyvPj3vz/Ga6+9rZc7j01lZSVm\nznwd7733T2RmZuHee3+PlSuXIy8vHxUV5fjgg08AwCxnHqvEeXvQqV1SHkFbMKTCR7O9CYJIDevW\nrcX69etwyy034pZbbsThw4dQVHQYp546EBs2rMPMma9jy5YCZGR49SMY7GXOY7Fr108YOnQYsrNz\nwPM8rr76WhQUbEbv3n1QUlKMl19+HmvXrjbPGavEeXvQqS2MDIcHCAM17ai4BEGkjskDxzZrDSRC\nW2tJMcYwdepvMH78pCbbZs/+CKtXr8Lbb7+Oiy66GL/5za0JnzNWGb+srCy8//7HWLt2Nf7zn0+x\ndOm3ePjhR2OWOD9ea2DY6dQWRpZTU9/qABUWIwgiOTQubz58+MVYuPC/CAaDAIDKygrU1NSgsrIS\nLpcLY8Zci1/+8ibs2bNbP96rV5eNz1lnnYMtWzajvr4OiqLgu+8W4bzzhqKurhaqqmDUqCtw6613\nYu9e7ZyxSpy3B53awsh2e4EGoC7U/JdDEASRKI3Lm0+bNh2FhYW4447fAtAE5ZFHnkJR0RG88cYr\n4HkOoujAAw88DAAYP34iHnhgOrp3z2sS9DYqa3fr1h233/573H337QCAESMuxaWX/gz79u3FM888\nAcZUcByHO+64O26J8/agU5Y3B4CKigYs3rcK8w/PxymRS/Cnaye0d5PaBSrdbEHPwoKehQU9C4sT\nsry5QY+sLgCABpl+DARBEKmmUwtG36x8AECA0VKtBEEQqaZTC0ZXTxdAFRAWKK2WIAgi1XRqweA5\nHqKcBdXZAEVV2rs5BEEQaU2nFgwAcLMccLyKcl9tezeFIAgiren0gpHt1KL+e0vL2rklBEEQ6U2n\nF4zeOV0BAHtKy9u5JQRBEOlNSgWjtLQUv/71r/Hzn/8c48aNw5w5c2Lu9/TTT2PMmDGYMGECdu7c\n2apr9M/LAwAcqa5sc3sJgiCI+KR0prcgCHj44YcxaNAg+P1+TJ48GZdccgkGDBhg7rN8+XIcPnwY\nixcvxpYtW/DYY4/h008/TfgaeZm5AID6CNWTIgiCSCUptTDy8vIwaNAgAIDX68WAAQNQXh7tOlqy\nZAkmTpwIABgyZAgaGhpQWZm4tZDt1KbIB1U/1M45aZ0gCKJTcNxiGEVFRdi1axcGDx4c9Xl5eTl6\n9uxpvs/Pz0dZWeIBbCPozYQwGgJSchpLEARBNOG4FB/0+/2YPn06ZsyYAa/XG7UtVikro0BXcxg1\nUbqqGdoxjjCYwLe5Vkpn5ES853jQs7CgZ2FBzyI5pFwwZFnG9OnTMWHCBFx11VVNtufn56O0tNR8\nX1paih49erR4XnsxMSc8CDnC2H+oBrnuTl2At9VQYTULehYW9Cws6FlYdPjigzNmzMDAgQNx8803\nx9w+evRozJs3DwBQUFCA7OxsdO/evVXX8IpecI4I9hyhyXsEQRCpIqXD8Y0bN2LBggU4/fTTMXHi\nRHAch3vvvRfFxcXgOA433HADRo0aheXLl+Pqq6+Gx+PBs88+2+rr5GXmokauxNKNhzDh0n7IcDtS\ncDcEQRAnNikVjAsuuCCheRWPPvpom66T49LMLFUIo6o+TIJBEASRAjr9TG/AypSCI4x6f6R9G0MQ\nBJGmpIVgZOlzMThHBHX+cDu3hiAIIj1JC8EwLAzOEUIdWRgEQRApIS0EIy+jGwCA9/hR5yPBIAiC\nSAVpIRi9vb3AgQOfW47qAOVbEwRBpIK0EAy36AIDA+8O4oC4or2bQxAEkZakhWAAwPCeFwAAgq7i\ndm4JQRBEepI2gvGL08YBAHjJ28KeBEEQxLGQNoKR4cgAwplgnNzeTSEIgkhL0kYwAIBXBTCeBIMg\nCCIVpJVgcMwB8ApUprZ3UwiCINKOtBIMXi+NFVFoISWCIIhkk16CwbSig2GFJu8RBEEkm7QSDEG3\nMMIK1ZMiCIJINmkmGJqFEZRC7dwSgiCI9COtBEPkNMHwR0gwCIIgkk2aCYYTABCUyCVFEASRbNJM\nMDQLI0AuKYIgiKSTVoLh5HXBkMnCIAiCSDZpJRgO3SUVIguDIAgi6aSVYLh4FwAgKJNgEARBJJv0\nEgzBAwAISMF2bglBEET6kVaC4RbcAICgTIJBEASRbNJLMETNwggq5JIiCIJINmklGB7dwggpZGEQ\nBEEkm7QSDJdDBFMEqiVFEASRAtJKMESBA5MdCKvkkiIIgkg2aSUYDpEHFAcijASDIAgi2aSVYIgC\nDyaLkFiEVt0jCIJIMmklGE6HACh6iXOavEcQBJFU0kowsjwOMFkvcS4F2rk1BEEQ6UV6CUaGA0zS\n6kn5JH87t4YgCCK9SDPBcAKyLhgRXzu3hiAIIr1IK8FwOwVwClkYBEEQqSCtBIPjOHjEDACAL0KC\nQRAEkUxSKhgzZszAyJEjMW7cuJjb161bh2HDhmHSpEmYNGkS3nzzzTZf0yt4AQANErmkCIIgkomY\nypNPnjwZU6dOxZ/+9Ke4+wwbNgwzZ85M2jWznF7UAqgPk2AQBEEkk5RaGMOGDUN2dnYqL9GEHHcW\nAKAu1HBcr0sQBJHutHsMo6CgABMnTsRtt92Gffv2tfl8mW43mCKgLkKCQRAEkUxS6pJqibPPPhvf\nf/89PB4Pli9fjt///vdYtGhRm87pdTvA6jNQI1aDMQaO45LUWoIgiBObdhUMr9drvh41ahSeeOIJ\n1NbWIjc3t8Vj8/KyYn/eNROsPANSRgMcWQxdPMfXJdYexHsWJyL0LCzoWVjQs0gOKRcMxljcbZWV\nlejevTsAYOvWrQCQkFgAQEVFbJcTUxSooQwIAHYeKcRpXU5tXYM7GXl5WXGfxYkGPQsLehYW9Cws\n2iqcKRWM+++/H2vXrkVtbS0uv/xy3H333ZAkCRzH4YYbbsCiRYvw8ccfQxRFuN1uvPTSS22+ZoZL\nBAtplktFsDLtBYMgCOJ4kVLBeOGFF5rdfuONN+LGG29M6jW9btGsJ0UFCAmCIJJHu2dJJZsMtwNg\n2m3JqtLOrSEIgkgf0k4wPG7RFAyFyQkdwxhrNtZCEARBpKFgeN0imNo6C+Optc/jlc1vp7JZBEEQ\nnZ52TatNBR6nCDBt7oWcoIVRFqhAWaAilc0iCILo9KSdhcHzHDwOLeitUAyDIAgiaaSdYABAhtsF\nAJDVxCwMgiAIomXSUjCy3JqFISVgYahMTXVzCIIg0oI0FQw3ACAsSy3uS24rgiCIxEhLwcj2aC6p\nkNSyYMiMBIMgCCIR0lIwMj2ahREhC4MgCCJpJCQYX331FXw+bQW7V155Bb/73e+wffv2lDasLeRk\naBZGRG456J1o6i1BEMSJTkKC8dZbbyEzMxNbt27FypUrMXHiRDz99NOpbtsxk5OhWxhK6ywMCoAT\nBEHEJyHBEEVtft+qVaswZcoUjBs3DuFwOKUNawtZXieYykFKIK3WHsMg9xRBEER8EhIMjuPw3//+\nFwsXLsSIESMAAFICAeX2IlMvQJhIaRC7SCgUACcIgohLQoLxl7/8Bd988w2mTJmCk046CYWFhRg+\nfHiq23bMeN0ioPIJCYA9hqGQS4ogCCIuCdWSGjp0KN58803zfb9+/fDII4+krFFtxevRLIxEBIMs\nDIIgiMRIyMJ47rnn0NDQAFmW8atf/QrnnXce5s+fn+q2HTNupwAwHmoiFoZKMQyCIIhESEgwfvzx\nR2RlZWHlypXIz8/HokWL8O6776a6bccMx3HgwENFAhYGIwuDIAgiEVo1cW/9+vW4+uqrkZ+fD47j\nUtWmpMBDAEPLMQl7gUKyMAiCIOKTkGB069YNf/nLX/DVV1/hkksugSzLUJSO3bkKnADGqVBbWEkv\n2sKgoDdBEEQ8EhKMF154AQMHDsRLL72EnJwclJaW4re//W2q29YmBE4AOBWhcPPCJlPQmyAIIiES\nEoyuXbvipptugtfrxb59+9CzZ09Mnjw51W1rEyIvgOMZGgLNTzC0i0SiS7oSBEGciCSUVrtt2zZM\nnz4dTqcTjDHIsozXXnsNZ599dqrbd8w4BAcAoMYfQn5Xb9z9yMIgCIJIjIQE469//SueeeYZc5b3\nmjVr8NRTT+GTTz5JaePaglMQAQWo9Yea3U+xTdxLJA2XIAjiRCUhl1QwGDTFAgAuvvhiBIPBlDUq\nGTj1+ld1gebbGT0Pg4LeBEEQ8UhIMDweD9asWWO+X7duHTweT8oalQzcorZMa12gJQvDFsOgUucE\nQRBxScglNWPGDNxzzz1wOvW1siUJr776akob1lY8DkMwAs3uFx3DIAuDIAgiHgkJxuDBg7F48WIc\nPHgQjDH0798fY8aMwbJly1LcvGOniycbqAXqwr5m91No4h5BEERCJCQYAOBwOHD66aeb71kLE+La\nm64ZWQCA+hYEQ6bSIARBEAlxzGt6d/TSINkuTTBCavMuKYWKDxIEQSREsxbGvn374m6TE1gvuz3J\ncmYCACJoIYZBFgZBEERCNCsYt912W9xtLpcr6Y1JJlkOTTBkNJ8lJavWyoEU9CYIgohPs4KxdOnS\n49WOpJPp1GZ3K0LzpUHCSsR8TRYGQRBEfI45htHRyXRogsGEMBRVhSSrqGloKh4h2fqMYhgEQRDx\nSVvBEHkRHBPBCRJCEQUvf7YF97+xCrW+aNE4UFpjviYLgyAIIj5pKxgAIDAR4BWEIwp2HtKEoSEg\nRe3jC1ulQ+yLKREEQRDRpFQwZsyYgZEjR2LcuHFx93n66acxZswYTJgwATt37kzq9QXOAfAqghHL\nchD4RunAvLUtKDcfICcIgjiRSalgTJ48GbNnz467ffny5Th8+DAWL16MJ598Eo899lhSry9wIjhB\nwaqtJRC6lsB5xjqE5UjUPpxoWRUkGARBEPFJqWAMGzYM2dnZcbcvWbIEEydOBAAMGTIEDQ0NqKys\nTNr1Rc4B8Aq+WXcYzoFbIORUY1/DXgDAtxuO4I7nlwG8AjXsBgAE5I5dgZcgCKI9adcYRnl5OXr2\n7Gm+z8/PR1lZWdLO7+Qd4HgVgFXGpCpUDQD4+Lu9iMgqwMuA5AIYcNRXgtpwXdKuTxAEkU60q2DE\nqkeVzJIjDl5bdc8epygLldgupoLjGZgigmMOVIdq8OdVf03a9QmCINKJhIsPpoL8/HyUlpaa70tL\nS9GjR4+Ejs3Ly2pxnwyXG5AAzmW5mioj5fDk8HCduR5SST/tQ0UAVN6Uz0TO3ZHobO1NJfQsLOhZ\nWNCzSA4pF4zmqtqOHj0a//znP/Hzn/8cBQUFyM7ORvfu3RM6b0VFQ4v78EzQ/s+w9q0JV2PJzjXg\ns6vgyq7S2qiKYLyVbpvIuTsKeXlZnaq9qYSehQU9Cwt6FhZtFc6UCsb999+PtWvXora2Fpdffjnu\nvvtuSJIEjuNwww03YNSoUVi+fDmuvvpqeDwePPvss0m9vkvQFlHiXFYBQgaGf+76PHpHRQD42HWk\n/FIAHtENnkvrKSsEQRAtklLBeOGFF1rc59FHH03Z9Z2GYDi1dFk1kBVlbRgwVYh5fGWwGk+u+Qem\nnD4el/UZEXMfgiCIE4W0Hja7BC3obcQwlOr82DvGEYxSfxkUpqDEX56S9hEEQXQm0lswRK0EO+fU\nBaO2B/p5Tmu6oyqALxze5OMGyQ8ACMvNV7wlCII4EUhrwXCLRgxDn8EtO9DV0TQLi6k8uIYe6J99\nSlSswhfRlncNKyQYbUVVGRSV1hshiM5MmguGbmHoAW0mO83MqShUHqrKIPA8VKaamV0+3cIIkWC0\nmfvfWIXpr6xs72YQBNEG2nUeRqrxOJzWG1XQXE9cDMFgAhTGIOjbVKZC4ARTMMjCaDt1/kjLOxEE\n0aFJawvD47CWkeUUTTyatTB0wTDWxfBFdAsjRgxj/9E6vDF3G8IRWkODIIgTg7QWjAyn23zNq5p4\n8FxTo4rZXFKATTBMC6Pp6PivH27Ext0VWLmtpMk2giCIdCStBcMreszXAtMD4LEsDMZDsVkYsmpY\nGC0HvRWFArkEQZwYpLVguEXLwjAEg48VtlG1x8A3dklJ2gzx5oLe8QufEARBpBdpLRgem2A4OD2e\nwZresjHDQ/F7AAAgAElEQVTTW9HDEXuO1EBSZYQULR1XVmUoKsUqCII4sTlhBEPUBSNm0FsXkcMl\nWsziixV74dfjFwaUKUUQxIlOWguGyFvuJ5cuGDFjGLpLqs6nVazNyXKgIRItGDQXgyCIE520Fgw7\nTkGzNjjEF4yIpEUkvB6hiYURK7WWIAjiROKEEQw3r1sYMQoNMt3qYLpwhCXZzJDyihkArBTbpscm\nvakEQRAdkhNGMFy6haEqMZaA1YUCTNsWliWz8ODJ2X0BANWhmqhDhG7FcA9bBL+anmuAl9UEUOtL\nvlXV3IJaBEF0bE4YwRD1dNqv1xxtutEUDO3/kCybFsXJWbEFw3HqVnA8Q6G0LUUtbl8efnsN7nt9\nVdLPS3pBEJ2XE0YwOF63LNQYt8yiLYyILCMoaym1fTJ7AQCqQ7XRh4S1SYFBRks/tgaVFIMgOi1p\nLxhy2ckAgGxeXys8xjwMQBcTm2AoqgwA6JGhHdfYwjAEI8Dqk9zi9IZcUgTReUl7wegdvgjBddeg\nb9cu+icxYhgGuphEFBmyPtvbI7qR5chsIhjGefxqHVRG5UGawy4SKukFQXRa0rq8OQA89KsLUFUf\nQqhRVVnGAK6RdjDdwpAUGbKqvRZ5EV3dXXDUVwyVqeYCS5ygWSAyIjjqK8VJWb1TfCfHj2RbAXY3\nlEqKQRCdlrS3MFxOAb27e8Hb7jS48UqENl7VdGfdwuA4hoisTeITOAFO5oXMFDToqbbaBtl8uat6\nT0ra3l4kO85gPx25pAii85L2gmHA280JxQmoMYwr3cIApyKiaIIg8gJ27NUC4Ha3FCfIYLIIDjzW\nlW5Kq44w2Sup2q2KjmxgbNxdgcNllMRAEPE4cQSDbyZ2YWAExDkGSRcMgRPBIlqA2xAMSVYAQQYL\nZyCf749ifymK/aUpaXd7kGy3kV1LO2qWlKyoeGPuNjz+3vr2bgpBdFjSPoZhIMQQjOCmK6PTbG0W\nRlh3SfEcb2ZEVYdqMe+HA/jvqoPwXKRAVQR4oQXTG5cS6cwku1O3n491UBND6aDtIoiOxIljYTSO\ncAOAHO2aMkqDgGOIyDI4cJAkBiZpa2k0RHz476pCK36hiDAeoWwrf76zeg8KKran4jaOC8mPYXR8\nl1Q6uRQJIlWcOIIRw8JwORvVlTItDIaIIkHkRS27Sq8/JTNNKIwMKaaI4Fj0sq4A8HrB/2HWtjnJ\nvoXjht0llYyOVO0EQe9kx20IIh05cQQjhoXhaSIYepYUr0JSZIi8gFBENoVE1ifzRVkYrKmFYWDM\nFu9s2AUjGa6aqLTajioYHbRdBNGROHEEI4aF4XE1CuHYLAxZVSBwAgJh2YxzGIFwu4VxqFSLXZhi\nYqM23DkLE9o1QlHa3pGyTpAlRfNDCKJlTmjBcDtjC4bgrYfMFPgDCv46ZyMYixYMiFpAHLIT/oBm\nWRgzw+2zvjutYERZGG331US5pFrRMW/YVY7f/W0pSqpSn1BAFgZBtMyJIxgxYt4ZrmiXlOrPQaYj\nE3yXMiiiD4oSXbAwohoWhiYYTHaYLimj9lRYiZjn++7QcqhMBWMMSzcVobQ6kNR7ShX2zlNOhoVx\njC6pd7/aCcaAZZuL29yGlkimhaGoKpZtPor6QKTlnQmiE3HCCEastFp3Y5eUKmJkrwut92YV29gW\nBlMcZmaVYWGEbHGLXTV7sbN6Dw6U1OOjxXvwl1lrk3ErKSdeDONQaQO+WL6/1aNxtY1ZUrES3Foi\nLDWNKTVHMgXjhy0lmLNoN96c23kz5QgiFieMYHAxg95Np6FkOb3WG0MwDFFQNaHgTJeUPehtWBjR\niw4V+0oRCGnbVCh4o2A2NpVvPfYbOQ7YO3hFsVxSywuOYuHqQyipap2ldKwuqWPtwn/YWow7X1iO\nzXsrEj4mmS6pitogAOBAMVUyJtKLE0YwYloYjbOkAHgcGdYb1TiGB2OApGdCGYLBZKdtlT5NFEK6\nYAicdu4Sf5lZH5f31mFH9W7M3v5RW28npcSzMCKyJh6hSNMAf3Mcq0vqWBXj2/VFAICVW0sSPiap\nMW/zp0ZxESK9OGEEI1bQWxCafpYheszXzL52hsrb0mptMQzd+vAFNVdUSNYEY8wpV0DkRZT4y6zl\nNljT69U0hBFppfsk1dhFQra/1q2NcOTY3T0dNbaczJnenP6F2+91b1EtZi/cYT7D9mDVthJ8vmx/\nu12f6Pyc0ILBxVgbwy4YUYstMR6SEfQ2XFK2eRiBiPaZYWFkODzIEbvgaH05Xvz3Fu043upoGWOo\nrg/h/jdW4c15HcvXbe/o7C4pST42wYhXSyoQkrBxd0XcyXxMH6G3NobBmQLdijYmUzBiXP/ZjzZh\n1bZSbNlXlbTrtJbZC3fiqzWH2u36RGIEQhLmfLML5bprsyNx4ghGgr1OhsMuGBxGnN0TYy48CVB5\nax6GKIGpHKAKtpRbXTD0oLdbcKG8SoLMJJiuCcHqaINyEAdLtMqoW/e3XycSi6gYRpSFob1ubUA5\nnkvqrXnb8cbcbVi/qxwAsPtwDZ58fz3qfNFxoFjC3hzHECNPWgxj/9E6LFwdv1OWlPa3JimFuGPz\n5epDWFZQjDf/s629m9KElAvGihUrcO211+Kaa67BO++802T73LlzMWLECEyaNAmTJk3C559/nuom\nIcMl4vSTcmP2LN6oGAaPrAwHHCIPxmwuKVECZAcAzoxzGGJiWBhu0Q2oAjieAZw+UrZZGIX1R9Bg\nS7sMyiGsOroWSowZ45v2lmHBmn1tueVWERXDUJq6pEKtzUCKCnpbr38q1Kr/Fldq8yx2FNagsLTB\nFNLGIYBV20pQVpOa1ORkdaJ//XCj+Zp10BgGTVLs2BhJMnX+jpeWndJqtaqq4qmnnsL777+PHj16\n4Be/+AVGjx6NAQMGRO133XXX4S9/+UsqmxLFq3+4DDzH4bPvm3bCjV1STgevWSeqzSXFK2BG0ULT\nwtA60bAew3AJLr04IbRSIrIzatGlN7bMxmnCRQC6AgDm7luIVcVrURWqwfgB10a16Z0d70LIqsGo\nwFPIznA1afPhsgYcrfBjxDk9W/8wYqDEmbgnHWMMI56FwXGa28b4LCJr5/WHpOgTcEBRhQ+zF+4E\nALz70JXNX9B0CSXeMaakWm2sU3aAvlpRGMSm+R5EB8HwnneAn0oTUmphbN26Faeccgr69OkDh8OB\n6667DkuWLGmy3/EuSGe6p2JYGA7eYb5mjIfLIUAUeIDxCMu64vNWQUJTMHQx8ellzr0OD5i+j2lZ\n6P93dWsl0asjWtqnxyWgMqi5pbZX7WzSJiFLG4nvKyuLeT+Pv7ces77c0bSjPUZYnIl78jHGMKLK\nm9teG9+DoUkRSXvhD+pJBbZzGKOu+NdQ8fyqt7GudJMVdNa3lQcq8fKmmagKNl6X3SIVy7J3UL1I\nyuz9YyEQknCwhFKNW4QzkiY6wq8lmpQKRllZGXr16mW+z8/PR3l5eZP9Fi9ejAkTJuCee+5BaWnq\nFiK6dvjJ+OVVp8XdfuGZPaLna6gcnKIuGCoPcCq6ZLnA8aqZHWVO3NNdSUY5kFxXDqDooqILhSEc\nkwZeBwAI6gKU5XGaIlIeqIzbvgOVzT8bo8NtK/FKg5hZUq2OYcQ+t5GIYFoY+nl9IQm+oCV+HGIn\nLdgpD1RgXVEBPtjxSZNtH+38DHtrD+CLvf+Ne/xx8+u34jLBsIxguPUpzC11NO219seTH2zAUx9s\nMOepELExfukdUC9S65JKRCGvvPJKjB07Fg6HA5988gkefPBBfPDBBy0el5eX1er2/P7686Pee3X3\njihw+Ntdl6Ffr2w4HQLcQgZCiuYr79Y1AxFJBSvlwfEq8rt5cIhXwQwxMFJleRV5eVnwq35wHIeT\n8/NNC8N0RelB7755eQAAf1hzXzkcAjiH9qwkVcL+0F5c1Oc88Hy0nleEapq97+wcD/K6eWNuKyyp\nxyNv/4iHfn0hzj61W7PPqaja+oP2ZrrNaxrfJifwLT7/3YeqwXEcTj+5C6oCVuefneMxjxV4DhIA\nt9uhfSZo9/vlj4fw5Y9W4Dgjw4luXa37inXtBsGyHhwO7TxOp4i8vCwonL4YliP+76akzpqh35rf\nViAkweMSY04MjXWuzCztee46VI1Fqw9h2i+GwCHGHreNu38+AGDBCxMSaosvEMEvH/kaYy/tj9sn\nDY66tj2dN7eLF12z3QmdM5mU12i/KyYIx/T32xaO9/XagsejeTk4jutw7U6pYPTs2RPFxVYdoLKy\nMvTo0SNqn5ycHPP19ddfj+effz6hc1dUtH3t5WBQG+EzBnTxiKir1UQii89BSAmAc4YQDkmaC0bl\nwXGAIOqdfyOXVEiSUFHRgEpfNbIdWSgr85kxDI5XwABwvHZsxMfAgYfC6YHysIQan2Wqv/jjLNx1\n3q0Y1PV0RGy1qUrrK5rct/0HVVJWDyGOu+H9/25HbUMYr/17M566dXizz6XaFliuqQmY1wyFdSuq\nPhTVjj1HarFs81H89ueDzM7vgVd/AKDFG2qqY5/P6GN9/jAqKhrQ0Cg7yiAQjKC21jrHrP9swbkD\numFAb+u3c7TassxM11lYRkVFAyKS9pyliBL3d2O/55Z+W2t+KkVOpgsOkcczH27EL0efhqsvPCnm\nvo3PtftgFQ4eqcG8lQcBAKf3ycawM7W/CZWxmNl8if7W9xdr1u2XKw/i9kmDo46zWyrl5Q1Qwslx\nXx4LtbUBVFQ4j9v18vKyktJfHC9CumtZVdWkt7utApRSl9S5556Lw4cP4+jRo4hEIli4cCFGjx4d\ntU9FhVW+YcmSJRg4cGAqmxTFWf20gPMVQ/tEfZ7j1DoizhWEUxS0CX66MDhdWmdkWg+qVXyQMYa6\ncD1yXTnarGjTwlCi/j941A+m8KaLKhRRcKAsOrW2Pqz9UHy2pV+DsERlR2E1bnluKXYVVpuf2V1S\nQTmIz/bMR0PEpzWTGXMaWk46jSo+aA96c34ArIlL6rl/bsKaHWUo2NfUnSYratxaUlYMI9ol1aQ9\nKovK1vrvqkL8dc7GqH0CsmUVMS76PEYFYaWZQEVr5mG8s2AH/vHxZmzao/12P16yF8s2H03o2K/W\nHDLFArC+j6/XHsKtf/selbq7Jl7cpzliVTMwsD/b9ophdAa2H6jCii2pL3bZHJwZw2jXZsQkpRaG\nIAh45JFHcMstt4Axhl/84hcYMGAAXn31VZx77rm44oor8OGHH2Lp0qUQRRE5OTl49tlnU9mkKAad\n0gUv/P4S5GZGj3Z6ZuZhn38XOF6By8EjIvOmMBRlaCPnxhaGwhT4JD9kpsDDe3GkrMF0WxmWhRHL\neG/hPrjPFgBBQddsF6rrw3ApQXDMiQGu83CArTNX9/NL1sjX7zqChogPWc5MfLxkLwDgw+WrAUcI\nkNxRncJXB7/DsqJVOOorwR+G3mF2ynwCQwSmMkCMACpvdtQH6gohnfEtxOL+CEeiXVqc2w+hSxlU\n9awm56ppCMfNkrJiGNp7o/RIYxSFtdjJ2Z+TKuiuD92JVhfSRLesPn7QO9EYhr0dWR4rQWLOot24\n/Pw+sQ5pFpdT+0I++16bgb1lfxVGX9A3KsgvySqcjpbTmuwDhsaps2Hbs23v9cuPpZjk8eLFT7VJ\ntj8b0rvd2mDGMIz/GcPhMh/65Hm1eGo7klLBAICf/exn+NnPfhb12fTp083X9913H+67775UNyMu\nXbKapqlOPP1qLN+5B1JJfzjO4iEKvDlBz8fpFpEhGODAmCYYxmh++x4/Nh/aAqGbkVarB70NS0MR\nwFTNwuia5UZ1fRicIIHJTuzcE4HrNGteh3FOJjnAOSSsKdmAq0+5XAsK8zL2ur+B53wguO7aqJG/\nsWTsoYYi/b0uGLYMjHjWhqKq8AxdCjXshqKeAwDYXa11aI7eBxEqvihqf9fZP4ITFBRHDgKITu39\nfN9cfW6J1pnaR/JxLQxBgpBbAaWqFwAOispa7OQCNsFo6FIAFA0BGKCoCoKKH+CA2kj8DB1FjR7R\nx3s2stx0XkpbaDwp0RBXvy3oH4ooCQmGZBOFmobo1R6jLYz2FYyOOHJujKqyFhMtUkajWmTrd5Vj\n5vyfcOXQPrhpzBnt0yadE2amd2vwONyI7DsfzJ8LABB5DpwjehKNkR2lveGhMsWsVMv02AVrlCUF\n3hb/UAWAV3WfPwNEOao2VUTVrlcf0gRDLusHMA4bywoAaB1K4zbZR+heUZuAaMRAmGlhcFh86HvM\nWPV0lLvLjk/Wrsm7QmZpEJG3OqzGabWGEMpoOtFob8Mu7Pb9BHDaeaJcUk2ypLR9nP23wTlgK4S8\nIwA0AWtRMGwuqbCnBJxX8+f7pID5B6jy4ZgrIwLRa3o3Z21INpEoScL6Jo1Fx7i0L2QXjMQypYx5\nLABQ2qiisN36SMYqiolSWh1oMgGtvQUrEaQ41u7xoHEtst1HagEA63Y2zTA93pBgxOGZ2y7GpMv6\n47STciEIPDhno1RA1TbiU3koULG7qCp6m/4/7w4AnArOFdRFhNMFQ4HLof3PcQxQHOYxRqmRyqDW\n8alBL5i/C474iiGpsjY/QrT9IYqRqFFkWLW2qUw1O2qe4zB//9eojzRgdfH6mPdeI1nxFKP4oGHp\nAEBIit2BSUwTTCvXniHMQlChgvPolpIhDkoEkZNWg8+pMMXM6PB4fd4J79XOoygsZicXDMtm525Y\nGMPyz9MeR14RGLRYjp36SHQQsSZUi/21hdFus2Y6NPszLqlsWTBueW4pNu6O/4feeIEqNY6FkQj2\nTs5+vLbNOsfxSiFmjGHGO2tw72sroz5XErTMUj0PISwp2FFYbT4P+/Xs4nu8MWuR6e+NZnUEVx4J\nRhx6ds3AuEv6g+c4OAQOnKsZwWA8GBSs31MStc2wNMT8w3CctFsbsVf1wvCztJRbjlchijAtBfsK\nfuV1flTUBrFgk7Z2BgtmQY1o5zOsBrMIIgBnv59QHCwy39sXclpyeAVUPSbCc0A3txbs31a5I+a9\n19kEw+ioa8OWOyckWokKhrABgE9uwM5DNXjqgw3aB7b28Rna8cYf5+6afVAzy+E6Y6M54jRGwcw+\nQx7QXVJNO5nfv7QCr3+h1dvx68Iw/lRtljznCAGMNRGMunC0W+qx1X/Di5vehF+2xUD0Symq0qRU\ni6SoAKdC7HkQxbVWTCTbGz/rZ86i3XG3Nb4v08I4BsGwWxGSrKLWlnUWbgcLI968oHirODYEIman\nXV4bxO/+9j1WbYsuUb+vqM7MBmsrc77Zjec/KcCan0qbtPd4WBjltUE88OYq7C2qjfrcKl4ZLWSJ\nJKykGhKMBBAEHkp1tG+eRQkGh7As4XBFbfQ22z5Cdy2LRqnqhT7dvRh0UncAgCgycE6tc2cRt+nG\nWrPzKB6cuRp8Rj2YIoCFMsxyJIYY2F1SQtcyfF8zD/uOan9MQZtgzNv/FeoytE6L5zm4BK1zO1h/\nOGo/QAtu7w/9ZL43Rlp1YWtkLuVr1XUDIQlPrPmH+Xm1VInVRwrM95zNAuIztOMNwbB3xEfFjTjS\nUGyN6owZ8oarq5kYhpGZZVgYOa5s/YLasQH9/pik3XNdIwvDiPUEFMuCUvXJbw+seBSvFcyK2l+S\nVYh99sJx8m6gj/accjKdZipvLLo1M+ehqUvKsDAsKy5Rl5TdXbZ43SHc9/oqFJZqAtkeWVLxKg/E\niv3sOVKLe15dif+sOAAAWLNd68SNcjAGz3y0sUmGXCzqfGHUNMRO0zbYul/77Rws1n4TgXB0okGq\nWbDqIKrrw5g5/6eoz824lv6TtwSj9dc4WuHD58v2J+07J8FIgByvE9LBcxAptGUBKZYYaAFs1YpV\nGBZGyAsW0YLqnD5/g6kC3E4BXpfWiQgOm2CE3QAz4h4qwCngPH6ogSyYbiwADSFdMMRGMQNexfwf\ntD84Y10OA5nXOlSO40x/v8pU7KiyRr+MMby48S1Uy5YLxReQsLNqD/bVaUFvpa4buIw6bDlUhLte\n/gE1YWt0dCS0HxsjX4PTrQm7BcRn1gK8bJbgsLuGKt3b8fGuL6yRp25lGfenKGqLo+KAHESGwwOR\n10rOc4IClVkuKTWoTfybtW0OygNNV+Lzyz4Yf6EqYwjIQURUCXtrD5hpuYDWkQi52vGcU3vGWR5H\nVPpxY5oXjNguqZYsjFjuGskmCkfKtOdbU6+10e5ikY9TDMEXjC0YscR/2wHNqv16zeGkXPve11fh\n/jdWNbuPEUMzEkKOt2Bk6ll2TZ6TEXMzBUP7P9GK23aemrMBX605hI27E199sjlIMBIgv2sGHvv1\nCPx53HXWh3pw+qW7LwWLeMA5w+Yo+srzTtb2YTxCWy+LPpkqwO0U4eK1Ea8gKmZ8hEU85nnBK4Ao\ngeMYWETrcAzrwxcOAWBaJwwgvGeotl1yIcOt/QgbWw6GEPE8h5AcAqf/+2zPfEiKBEVVsXD9XjMV\nlelVeOsCIawqXqt/xkOp1WapL9tfgHhwDv3adgsjsw6us9aYHWLjWIKTt7l09OM4dwDgFCgqMztk\nvksphPzCJtf0SwFkOrVAP8e0uJCiqKZgsGCmue+yoqYdyfd1/4F76BKAU6GqDA229hX7rJIskqxq\n7QIA2QEOgNspQpaZfm/MDPAbZGVo34nj1K1wnBztBmw82pbMVQ2tDn7TngoU7LXmuDw6ex2mv/JD\nk3uwJz3U+bRnaIjDdxssd+Xxqlbrj1P/K5aFITRKgEgWzcVBGmfpBW3ttT/LzXsrzPkxycSr/602\nFifr+7EGMEDiFgZjDDsLqxGKyKabraVabIlCgpEgp/TMQj+9pAegFSn87f+ciRyvE0q5NstX6KGN\njjJdthGlKkIN299rFoZT0Kf/Cyo4l80lZRYsVK2ihYZPX9+29WAZ+NwKCF21YoQs5IUayAQnSvC6\ntX2DSgheMQMPXHCXdgpO74Q5hpAShlzfBVJlTzRIPtSG6/HV6kOYu0brzOSyk6HWafdaFwiA5/TM\nrf2DwfzapMafiovMjlGp64bIviHmLRqussZZXHyGDypjqAnV4utCrQilGtKqAwucNafBsEw4UYLn\nwm8R5GrNUanrtAI4T9llrnpoEJACyNTXY+f0DDRZZQhKIf06lmB0cWnZb/b4i3Y9GRAjUBlQbwvy\n76zeY7tOSLMmAUCMgEErLaMyBklS4Th5FzwXLgbfpRTiSbugTXRUAV6G2L0YYs/oEbSsMESUCBwD\nCsB5a82OKmxLLFi3sxyvfmGtA19U4YM/pE0U3by3An94bSWq6kJRnZzhglIUFTUNYRSWWgJ4vGIY\ngTguqVgWRpMU1iS565uL/5hZemosC0M7rrIuiNe+2IYH3159zG2org/h9f9sQ3mj0vweV+xZDcbz\nYY0sjETXhdm6vwr/+KQgytWVLCEmwThGpk0cjMv0yT0PTbgaALS1LwBkuT3RO0vWXA+m8nA7BTh0\nwdiozrO5pBpZGKaLS/9MtzCWFhwC77UCf0xyaIFiQTLrKIXkENyiG909WoBb5YxKu/ofsSzqa3oA\nYSWMo5V+0zLQhEs7T00giCMVesC6vqs1GVGQo1xwSnUvXOwZq20zBEO3FOQqK/7DGPCZrQhgeMcI\nvb36WuiCas1X0QnxNXonZ/3oxV4HIHTT4kKSIiGiSqZggIngeAWKwkz3GwtYgmEUioyVVszxShML\nY8XR1WY6bnXIcsFxDq3NxmSqyrogxJ5aDSzXaQVw9CoE5/Fh9U+lZsaXdpNWR6ooKjaUbYHYrRTu\ns9eYHX04gUKSEUnFa19sQ70/ghVbiqMyoczzq6xJxlQy/NmL1x/B9gPNL/zVGgujsbulpa4x0Tkw\nDXHcYgDA8wycK2B20AFbuRQjHmTEkuz9bWFpfdRaNi3xr+/2YtOeCnzwTXTyQ7y4XGMLsLUxjCPl\n2mDHvjBbsgw3EoxjxO7yObVHHpyC5VLJz8mE02E9WjPrBzBdUkbgWoEEzhnSOmhVNK0IIbcSYm+t\nhARTBW3tCHswWD+nQ83U0nEVBzgOCEohHK30IygH4RHd5voeKq9nYhnrkSsOK4iuhBGKKFHBd6Md\nlfV+lNQYCxoJWhsBbU6JLhiGiDiYdi2tI2Xgs7UfrFw8AEqdJlyyKkfFBKDPPTHmsLgztG1yVU8z\n/lNcXY/NeysAm8Xi6H0QzgHbAFiikOnMwIotxZAlDuAVyKrNJSW5ENl9AQCY14o5D0VQwBhDQ0Tb\nluPMRnWoBosOfQ8AqI1YmVG8JwDnoDWQnFp5lkdmr2tyOk7P9LILPOeyRpolVQFsO2C5mzbzX6Cg\nfJs114VTwGXoqdVqdCVae1B5wY+F2HbAKBOjwtF/G/icCpTXBPHOAm2kaYxo2zoPIiwp+GTJXrz4\n6ZZmR66+QMQaVNiIlSUVq6wJn10F8DIW/FioXVeWwLm076W5GIO9w20sllHt6LYP7iErUMNrIh8M\nW22V4gh2MCzjr3M24tMYa+nEw6jj1bjN9vRi+/ca18JIUDBifSfJckOSYLSS+4ZOw+DuZ2NI3tnm\nZxzHobueqgoAORkZeHX6ZfjddYO0D6LmbAhwOQUInPUZ5wrYgujWr0LsZqXpCjxvncdmfXRvGA6A\n01JyAazacQRPfPkJwkoEHtENgRfgFJxgumWh8LovVhHNa+4trsK2I0Vw9tdcUizianQt/Yet8qb4\ncYJiWQL6viLTXW9iBJzHByGnGkpdNy1+oAuNpESQ49QzmcKZ2v2qIsJ6qrA3Wxe0iBuRQ/rzE2Ts\nKKwB74ox74FXTMEoLAri/a93aRYZr0BWmCnsTHFA1WNBxrViWxgyFJuF8asz/xcCJ2BbxU6tVlgk\nuryIkFWLqowt+sEx3B+6YHA2weAzGsx9V24rwcb9tnRovhabyreas/Yd/XbAfY42XyUsKVEj68bB\nUmPlQj6rBmLeUbjO2IgFPxaiqEJfo8V9bILhC0pRKx3a/eH7iuKnuG4J/gDPsG8BRzjKqoll4TRO\nGdUBXG0AACAASURBVC1R9sN15no4T92GuXrm1OJD38M95AfwXUpRVF/S5BwG9ooHDXql5IgiYfb2\nj7CzynIvylnac68QdkNVWVSBRimOBdMQiEB116ImmHhRQEMMGmui/bu0W2OqKRjG/CRtv+bSarcf\nrDLvO5Y4JGtOCwlGKxmQ2w+3D745yqIAgG4eSzCcghNOh2CO6KIsDKa5pK7rP8b8iBOU6DTdxqgC\nnCJvm59gCUbEGHTLxjbJHNn3d2mi5hE8UFy14Lx1OJS1WGuGKpjn++KHPVEdGot4rJnsvKKvMMgB\n4CAw2xyJRllhX63UgsOcI2LGIVRfLgDOtEIiTMKuI1r7QruG6seL5qi/78n6XIxAllXt1yitos+F\nUX1WlVoICvy6NXDoqCEO2tK4ZTU+lNbVWc9Hb0NIDmPzngrUhzXTfdzJ43CaMNw8X3lNED/sKAQA\ndPd0g4Nz4VBFDZYXFKNe1s7X3WG52VzQ3F1mMNyGZWFYLinnqdvhOtcKvBvZVgYO3mH63oXuWiE8\nMf8wnl/xEd7+cru5X60vjltEjD2qNrJyYsUwGGPYWbUHSw+vaLLtyffX4+G315hCYff1Hy5rwJqd\nxZAVFXW+6LphRdDmyAi5ZVGxhFgWRuNRcbWi/Zb4HMv6MuYNuU4rwMvbXjMXHWuMPYXYsDCe+Wou\nNpVvxZtb3zW3CbL2vQWcxXhj0cqoCgZGsLix66vSXwf3OatxOGdhzGvHwhDoxnEa+3Owx3tMC6PR\n/cQTjK37K/Hiv7fgrXnabyOWNiQrz4EEI0nYLQxXIzGxp+ACgNspINPpxbhTr7HtE7+sF1MFnHda\nd2t+Aq+YgdeI3tf07qIFcjlRMjvy+Qsi+GFrMer8WkfrPtsWuJOtWeWcIJsBdqUhFyycYbuWNlHN\neJ/tdWvioQuJ0T7thQAmi5pLqpGYGBZGfTCA0toG/Tjt56fKvFkKJcRrI3g1mNVkAp8xW1wqOg1y\neV+z7XVhv3VP9mtyKg5X1gJMqzZsuOB2HKnAa//ZhrV7tNIjny85gp8O6B06r+CNudvMLK5sZxYU\nSQAnyPh+81EE9LIpl+Zei1OyTjKvA6Dp5E6j7YIE3h2AGrTWi+dt4mK4Ag3CShgRSdFcEJJ2T0Ju\nBUqFHdhSaQW/lx9dYQbW7fEd3tYOzuU3S9JogsFQK0V3tC99ugW/+9v3eH3L/+GLfV/CF4m2vCrr\nAnCduwKf7JoHILpU+vyCDfiw5GX88V+f497XV2LpJqtqL6enR/NZtVHHxJrpbc1jYfj20DL4mR4r\nsv1d5Dpzo46xx5PsRFkYumCUBLV2uQUrAYVj1t/lT8HVUccZFkYTwQhoAwZFaJSFqBOQAiitDuDl\nz7Y0qTzcuMO3p2LbRbixBRavirNBlb6WixGzUE2LxroeBb07GD0yojOoAFvVyUZi4HZqP9QshxWI\nRTMWxjmn5GHcyH4xXVL6GkxwGX8IeufPFB4Ah/e+2hXl+zeQy0+K7pANwSg7Obo9hktK/+PPyXAB\nqhgd9LYJourPBp/hM+cqGGJiWBhLtxy2soxsM+IlFobY8wB8rBo8eM2NZVoY+ig9s0ZbA9yfY2uf\njJqAXnZEFwxr4SpN1DgmAuDQv4fW4dQFtM66rF7vcGSneQ+cIOvxnDB4JsIjujWxE2R4XCIkpnVA\nua4c3HXerdrt81qasxEEz/QPtGacCzJ4XeiU2vwm3wOXUQehizbvxUiPNmJKOV5nk98ObDXMdsmr\n4ehVCPf5S+GyDQbswuUe8gOcp23W2uVxQMg/hK/r5mBd6Sb4ghIqa4PmHAiDssbzVBwR8J4ANlZr\nMZoo102u5i6K9NkA99Al+HanlW4tqppA8hn15loqQPTI+l+7Psc3hUvMDlroWoJ5+79CNbRsMuO7\nNObG2JHi1AWzz9iuD4Qwc/1HELtrLqwcwVZpmbfNE8poiBKMQCQIX8TfpIJybSh+ActN5Vvxxx8e\nxxNz/4ut+6uwYqt2TaP/53kOBeXb8O/d88BYdLmbYEiGrKhY81NplHvqQHE99uhuP9mW1LC35gD+\ntesLKKrSRIhCakBzC9s+TkahTOA4VKs9UejltToDI2X27P5dMbBvDnqc0RubGw6a242smkynJRjN\nuaT69chFZobDCi732W8GkcMh7Ufn5LSAM++t1zp42/mYypkZXAAwKOdsbLIF2DleAThj/oXuRtM7\nJtcZG8Fk69ouhwCmCOAzfHD029Gk7XLJAAg51RC66nMXDDExr2V3ZfFR2xwn70GY5SLT6YWf8dGC\nxqngM+t0V5XDFkuR8e/lO+E8xSbM9vsSFPD6SHLMsH74oASmxcIEW0kW0Qjm6+a/IwQnMsBxnGY1\neRR4XDwa5AjAAR6HC27RBQ4c6vijcJ3lg1KjLw5Wn49Tc06x2m6Ufom4oFTnm+nQAIOYd9Rsg1qb\nBzAtxhKWFGRlOBBq7F4y53hYHQDnkMA5bEvaNrJ0hBxNELxuh/m9fLjzU8gNXyOy7zwA0Vl9ZYEK\nDMjthyUbi7CjsNoUQkCb7GkXDHsRTk6UEcksQllNAHm5HiicdhznDkRlIBmdl0/yY1WxJkKXsf+n\nP/9GI2H9O919uBZ1IT+YIkAqOg3OU3aZKdONsXf8FeEybD9guf/s7WC8BMY4qHXdIORWwu+zLKvl\ngU/x1cpaBNddA3tc0V5aRlEVCLainKuOavOVkLcfqOhhDgzNET8PzNr+IQDg2n5XRnXitb4IPl6y\nF9/bLDQAeHrOBvO1PWj+8uaZAICzup0Bvy1lPBiWsZZ9As9QCcr+oWAKoNb2SNpERLIwkkSvTEsw\njHkLToeAGTddgFN7Wu6qHrkec0SQZROMU/O7xD232+HS4iH2UiM5WkZMSP9bzuP6AbILYs9CcKIU\n1YkbqasGP+3X/fpGp+uIxO3EAd3Npb93OQXzOMOtkuG0pQ0bM9sNq8ZMCbbiL11zdNcR49Et22Va\nLwAQVANwCa6oY3i3D5zbD45X9Vnv0efjmri/LOuI4xWAieA5Dh6XA1BEy5VmCIbkjC5HwqmAIwIH\n00bITI8PFVXVotrvB1M5uByi+T0D2sREo2MNBUS49ew0TpDBGWm0iojIgcHWvBxeMcUrvPsCaAkA\nDoSVMMKSApcTUUJg3C/QNO6hfWi4xmIXRfR6RG1yKLSOn8+sg5CnB9xtAfsVR3/E90dW4p/f7sHm\nvZVR82nW7i1sNAksenTr8zE8/PYarN5eAlUfwXO8ioKqzeY+isowf//X+GjnZ+ZnZufJR1sNxiDg\nHx9vRqWvQRN3Q0SKLMuoMliF8oAW77ALRkiJFpWGUNDswFVeAhQRql9LwqhnVrzEcInxuRXgnNbz\ntE84/XL38qhzu0Xtd8tn1kHIPwSfXpGBqQyctxZlmZZwheSQaWnxORX4v6VrmoiF7SmAzy2HxJp6\nCurD9VHJD5V1ISjQ3gsDNsF1+iYA8et6tRayMJJEpiP2WtoA4BasDvW5O6zO236M8WMDADWUEeXj\ndotO8BzX1D0BmCN4t+CCw5cPKfuwlhoatM7NAtlQ6rtAyNbjA4rW0ZnFEXseMjv6Jp2u2SjtmHNP\n7YbdpdGjwAyHC8afUeM2mi4po3JvVg04XgWTNZdZbpYLDbZ5CRKLWM/COMbbAOdA3dWhOKKuY3eN\n9evRBQfqbBaP7rrjVDdEgYPLoWea6Z1uhOkjcVs8B7wCzhEGxwG8rHWuiqRtq/b74OQVPQlBvy9b\n7MCwJP535CB4jHsQ5P+/vTMPr6LK8/631rvl3pt9D1khJEAgAcIWdmQTJGkWhRe1WxRFWxRwQXrU\nntHWmcYHp/vpx8exfbrtxWec0R573ufFcXoGX0VfEW1axBZwWFQSIAkhZM9dquq8f5yqU1X3XiAo\niCT1+QdS66lTt36/81vO77AEAKLfR+tJBu9qpsrOsHZCPvbM/dEQrVnl1kujdKRDSG4znxdm3EPr\nCYJP6mT3giLFVVY2FF6SR2Lns316jS2rUmjsPonG7pMAP4+6Hy0Wxnst/w/DBLNEjnWftX0ff3EK\nsHhc32l/E5BmAVE3FFXBn/Q0ZQNjBMydJ2BPrx21pXwrxDz28T3/AABQT1SiZkQGDKupX6XfEYnI\nAE+gIIrX/u8x1FXlQOOiIKoIEnXrx/YDoHEeA9eIv4BEJYQ+ngPALP0PAH86/SamFVYj2RXEZ1+0\nQ+bNb1guPISmaAaAcmiEwD3qA1jzqvpVY+kADa5yWhur/8OF9gcWopCKPgNUEWJmE5TmQrx3oBJ1\nVTnskLZQO3r7zcFmbyix1ZVojs7XwbEwLiO3jVqDtRWr4rbHBcF1rBaGWzSPCf91KrIj49jfXsk+\n4rahj84FgYdILItBxQp8y7lMoFpiD8aIlQniGEFg3Ke2IhMen/3H55XNQGJlfoZtH2LuJeV8gQ6t\nlSkgr0sCHxNjMRWsOXrlPb0xbTfdVUYW1aIpZbhr2Sh2bcMlBU2EKPCQJUGvEkyPjyIEmXMB4G0T\nEg1hzOmCRInw7F6Gu08U4z8do8+mVRSy4CqnB71pmyVb2zneuk93BSoCzhquEd1S0HqDCB80srgM\nhUGVgtKWixJ3JbuX26uAEzSo58z3YMR2fB7pvAI+UZzLmHhpVSZfRj/F7tC/xD0zQxf41gWtzOvR\nfbGjfoAuGMbJ5sRHdg6bx0EAga4Zw+IaMCbVmQJeGHYQn4TMkf8XZ6gVEvmqki5Cxit488MT+JsX\n99K5SYrI+j6il+ePrSLASVHW7/2qPSHgz6cP4L/+3IifvXYAB0/YYz/tfZ147vVPcbYr3hrsV0K0\nbIulb901u2wWlpjRCDGtGWImtQI5bxd+9QYtxmik5Z/oakJbxKz91hEyFZoJGdBE0IHgKIzLyPis\ncZiSMyFuO88JCY6mglHk6Y/VI1mFvYhUyfzgPYbLRxMR+nQa1G5rtggVqqLAsTgGEB8TIYawAphA\nNeIVNoyPsTdo20w0ARyoHzys2T94aymUZdOG29P6mIURG7wVWLtjS4hYra04Yhan4t294EX6MYwp\nzqJ+Y0MR6rW4NJWn/SPy9HyjbDofhgS97WxCogroQlDTrS6j76Sig3pCgZBwohnn6odP8kLgBfMZ\nBMUsPBmbxaUrO7ZGivFcutDo1Vd31HoDccv9GnEKEvGYmT+CAneSns4c8qH/41n6OVTo+twSIEbg\nhh/jhRvYOQDAiQlcXEaJllilYH1mfaARbRyuX4ee09qjB2rbzNGwoZxCoEJ3Zv5UjEmnc23CSgTy\nyA/Z+ZEv9Tk4hsIQFBrEVU3XrFEdIOEETMM9Z9QzUyQ9ecFUQLTvRWaBRbWw7ZlslxMj0DQN3Rq1\n9JRmGqP6n45jaGmn76IzRli3dnfhz5+fscV8DPqVEBRVs/32OTFqm+BpxBUNSJi6SOn6NvT5jnQc\nx1fBnWwA0R1O3BeOhXENcb4qkxzHsUwpm8IAkOo2zUyfvu+Bm8ZhTd14WyE9A1HgIXNm2mZsKq/N\nOlHjLQwDluranYolefXmDo2H1y0mXLbS6zKVkUcWY6wZe1mT2O0CzyGpc7Rtn0s4v8Jgqxhqpjst\nJZd+ZC5RhiQKTFm6hlM3lhrlIQg8C9hTIURAhAibbMgEsqufCS0lIlF/uP48gr+DChNNQDDB+he8\nKwS/TGMsIi9C4kXwvk5zBr0S605T6WjW1l8iOEEDQNBJ6MhR6w3aLCoA4APt+r4AzeQCVSYun24p\nhj1A1A2tJ8DOccs8OCkMGR5oUaOWma4wdMGlfFUJF9HnlRgKw514ZUYIUXBiFEJvJpTTpbRqs369\n3sAh1nZDmRjtCGnU+gjIfrgFOsiJkAh4t+lKU9vyqJvUSEKwVihg82ki6O6L4Me/j587wgpYWtyB\nRBV0a5SwthhVEgAgrIXAwVSQysnhiJ4u0q8XxeftxxHiu6C05SB6YiQIAXojvTQOZ7lXibfc9nei\n2e4hJQRVJXHK2FYRQE5sDfZG+0BAQIjlW9Sv0xOhfcsSMAA6V0nrwtuNF67eOxAchfEtYA2OxmLU\nP/LGKIwMrxko9+gun8qiVMwdn4/a4QVx1xF5Dh7eku0SY2EE3JaYBps3kaBdFuFV4LPcR+Ph0yd+\nzR1mX6PdWgbFJQuILYUCIP6jMUqgCDwy1QpEjlWxXdaYT13WjITnWe/Rqa/V7RJkyBIPrSPTdoqq\n6BaGJACKTOMT/nZwHEE0bM+sEoJn2WS5aJhHR3fY/mECyEsNskmZPxi1xrYvYEmVLg0WgxMVljbL\n+tbmTlPs/WVZ1pfz9FDfe9TNLLTcLBe23TIWQuAcDdZG3eZgQ1DBu43yLh7WT5yg4YHVVYiQMDie\nQIIHasS0cgBTKah9SVCbS+g2MQLe3w4huQ2C6mGVio1Kxka1ZD90a1iRqFXCaeCDbbS6cVuePd4E\noF9feyQgm8ouooXNwYD+PoxFxgBAyGw076G/q7AaxpHTbQgVvQ0AiDaOgHKWWjRs5C4amXAyJE42\n+1ZXQD7Zg1HDMlkbkv0uJqi1sAuwxHi+7KJtUM9lAeAAVaJl8I0MJDEKLezBkgK9phqz0CxlhCTq\nau6N9OsWhm6hnSzV+9SqMOh5N5Wstr0rowqB2poPtbnYdq8e3RWodadAOZPH+r1d/hyvHvl3fFMc\nhfEtkJdEf8RjM0bH7TMsDLfooi4TnaJMM188yWVXJiWZMXECUMHrEUylIPEinrl7KsaU0OvwmtUl\nZZYhCR+eyASAfR/g4k2LhRAeyUm0HfWli/HszCcROlCHaONw5LoKzXMkIa4UCkAtFhvEtDACXpmu\nBWJcw+KSWla6ALnuYeZpTOjGW0cuQYZLFEAiHuYyMNpAYxg8lDN0wp9r2BEAQJeRJWkR2oK+RGxP\nN/DBwRZz3ghrn2ldTMgahxJ+PPvbGpdaVW5aaETjLMrO4l4SFHbv5CSZKYZ5tdmQ3SqIItvbxysQ\nfL0Ap0HrplaoVzLjJWYBSSPuRd/7sFwP2iI0pdZDUqBEeXYOADZXROtPQl+v/nsQo+D0kv1ZkXGI\nHKmG1pcEcATJfpkpjGx3nv5cIjghimBArwLbka6P4O3W0alOah0F5CTmumtKfSMmRsfRd6wPNIz0\nY63Pb1EYUXzRaVb/JRE3KzLJSWHw/rMQ0/T0bkWCZJTQFxQ2CVRWk1i9NQgKAl6ZlaAhETfrf06M\nsNG7UUyUKCJCagjhqAo+0AbeFQJUAZnBII2b68rKmOXv66zAoQP0fby57zhaO/qZwtB6kvX0bdou\nztcBIaUVhAATc8aA0wSmFA6dpXWsSNQFrd/IxqP7enWFYc0mg6BAwYUXkxoojsL4Fkh1p+Dv6x7D\n7aPXxu0zBIwsyCxv+4l1tSjLM2MIse4qq1Ay6Asr8ImmgJ9YnoPUgNucvGONIVhy57WuNFbKnGIq\nD06zWwrpQSqYeI6n9alCSVBOl7J2A/a0W8AapJYQ+mSGZTttgyhwtFS3Ygphq4XhcYnwy5YMNEPo\nRjzQQubzcuAgCRJy032oHp6OCaWmdURUEQLPQxYFaJ3pVGD79NURozK7QviwPf5EVBH//t4XUFqG\n2awMq8IAAAHm3wHdJQUAGZ40UxmrElyyiCdvn2S+CykCjifI8Adw45wyOjlTVyZzJmYjrIbN2JOx\ndK/SiLcaqQtGC1NhkSSbAk8TdJeUIdQs5VBO9dMRcpKWgUiEp7Emyyx6EpUBRbYIySgTRElCgM6W\nD3vAccC9KyvY+i9FQdrXdD6LgtxMe+wn1sIw4i9BVwAe68zrGBcM0QRbDEMLeaC25bPf1LmeXvzH\nXjNIThTJJuDFLEspeSLArSefcLzKhLhXS4NsJCiICqKqBjFJXx2yz2/ri96IRRgDgCIhrPUjElUh\nDTvM3lPQ6wKnSbqA1yCX6bXGoh7WFz3hPpwJtUIqOKK/Lxkk5NNTogmkAlrZluMAt0tEwO2DKNP+\n+/cj/wWoEtS2XNbHctkBCKmn0aN0mX1h6XcFA6+ueyEchfEt4ZeTErqmDJcUVRj0BWemUEG4bvRa\nLCmeb5scBMTMENcJeCX4JGvWlV3J8Kop1OIC4uGYcuw6GiFmmwmf0G8PmAvBALDXvALsEwiVeCtH\n4HlMr8qlAslou+DC/SvH4o4lleA4DgGX5XmNaxMe4QPTWbaIJEjgOA48z+He5VUYV2h1pwkQBU6P\nv3BmCitAZ3kbh/WZwt7WXk1E1OYys/cDT8y/rcqc53hz8StFgqpqyE33MSEuF1IhU5CejAW1w+Bx\ni6zv2kPnQECQ4be3CaAzigGwa/tc9P2JGY1QjOKSMZbJrz57Gaf08hj97X5EIhqdk6KnJfPufmj9\nSRg5LBmV+XROESdGzPiHYMR6aJ8E/BxdD4RwyPYnY9a4XL1iMkEwzZ75Zfw7rzYbnLsHQkYTBM2N\nTG9GwgQHI1gPjdYE41x0Do4R9F08kbphQmrETEuOuOhgQFeUvByBqCtytZNa2Sw2JqjwpOjVAfqC\nOHUmRAcEQhSn2nrpipFRN6C4zHIzYoSt+24qQgkqVOw+0MQC1NHGEeB5DjxxUYUhmgFvviPPzJQT\nFPsSBRE3tLCXPq8cYuO26CnqHvRJXmhSH4TMr6DyIag9AZCI1/atyWWf4AuiL1+rSKYVLijQuPOn\nK18KjsK4yhiL+filJDZSFwX6a6nJrMKi4nlx5/gkc2RdNzYX96+swrQxObYJdMYo2Ai4c+r54xux\nghIApo3JRllekNXI4sRInMJ49NYJmFOTh+oR6Wwbx3F214LFmkm03e+VML48A//4w5mWtrtQVZqG\nKaNpgb9kt2VGfIzbIsVFLTGZtygjAMkuUwEZLin2p8UymVJRgBEFetaZItvjFVZLyaLs3DEWn6BZ\nFYa9L41ArtaTjPJhKXHXBQCvaFhunF6sEfiwmU64Ks1Kwy/un2GLE7E26QojoK+/wid1oVdoocrC\niE/pM6e/6mpEe+QsSFTG4S/68HljB+1L0eLGCnuQnuzBD+brylGMMjcIc9vo/dCn9FOFokjweiTc\nsnAkND1773/wLr0ey2jTkwYkDby3GxwH5ChVcAkyunrtgkwLeYGoG5nJHowtobEF99h3bf2W7NXf\nn2XiY7SxHADHLEZXcjdLj40co4t7ifpvhE86B9HTD6JIaDsLLJtabCpPMQIihaD20PdoXE/K/QLH\nO7+09QH7Tej127SwG1oX/RYkuMHJYXNNmDN5UBXebm3pbZd68gDFBaL/LqXSA+AFDUTlkROh5WKM\n9VjkokO2e9sGYRZIxLRmeF8HVC5qq5D9dXEUxlVmWm4tvl+5GuWpZSjM9iMvw3fBMsaAXWE8fMtE\nVJWmQxR4uCXLaD6m2JgxpwBAvMJIkHW17vpK8DzHsrU4Vz8CMQqjOCeAtfPLael1C7bgpW0mMGdZ\n4J6ek5mi+2CtzxxTHSLda5kFH9P2FDcVUrFzXZJdZuoxUQWmhAEzPREAJpTms7LfAMeCnDzHo7LQ\ndNVZLTSPZL9Xrs+sXBuIcRdGvhoJEpUw0lVL54ggPmXZrSuMmhEZmFkyDi7ehY9a6Mxoj+iB1y3i\nb2+rxTBSYzvPUBheyW4hcqpl5r3lXXRGOi0uOIDjCHhXP3PdkKgMTSPwy0m0/Iv/HBN4xj0MIdQX\npdlkRJFoZhyA22uXQiAy+jU9vVQVMboklQn6kBKCKOuz7BUZR5o68L8/OGJruzEq9rpFdITtpdON\nZ/G4JJqRxZsTH3m9ijLpC0DrS4LqPwWS1MbaAQBpoHEtKfc4iEDb3tMfxeiSNKT5/OBFBYvqqJIy\nrG4jeQAAQoSWJmHKWBfW1HWnmNYDAJ8eT5RLP2FtiCrUqiME4LzdZu2xvjL9nvR3KfjPgfN2IuD2\n4cHV1QCA1n5zFjq9d0xsy0L0VIm+3DPdJ+UdB/GcYwkG3wRHYVxlZEHGxOxq8ByP7y8aiR//YOJF\nz0lxJ2Pl8GXYMv5u+7VkgZWdOBemPvrKIipsq4vy2XFG/MDjElGSG7C7aGIo8NOApkeSUD0iPtie\nkPNU3r11YTmCxshfz5PPTDaFtzECCql2X3a2pU6XITSeuXsq/v7OyUyhSTEKI9Vtr2wqWCwM6yz4\nJMlnsz4Ml4ZHcCPoNa+Z7DUtB3dM2u+SCebaKF6LMgcAtaUIoY/nYFp5iem600SED9WyY9J0K04U\neNx8XSUKg6Y7zbA+slK8eHjuTfh+5Wrz4npbjbXMDXjVFAzKqVJInEWBxKz+CIBNliNRF8JRFSIv\nguvIB+8KQUg5oz+XBwGfzEa0Lx9+FZwUBYnKdAY9gNqKbGQlmckNRJFQlO3HjrtnAQDeP/0R5k2l\n7qHjjf14+vd/gdKab6tHZQT9PS4RTT2nbM/F0psFDhyvlzZJp242v5EFSMzEBgiKTcCnCFkYnlwG\nTg4jTPoARURuuk/vQw8kt4KRZW57PxEeoU+m256J/V+1KAxRQW5yEE+so++1NjgbRBHB+2g8ROZd\ntMAi4aGeyQfv7oOoZ34ZvycjC83AL/tYSfqCpFzbPkPx2+ZX6Rjl/+OKnl4gXX2gOArjOwTHcXGj\n9fMxq2AaSoJFtm1uSWCZM8aodU5NPratHY/66aXmgfoo/b4VVchJ8wKqBKGtDBVCTAorgMXF12F2\nQR22zbyDZkCdh4fXVONHN9NsIbUzsWLRNIKyZOp/NjJsMlLMEdyDE36IMemVmJpbazsv22dJk9U/\nghS/C5kpXqYwYl1SAi+gIf8maCEPtO5UiJb5I2qXmYHmO4/CUImKJI+pMFycKYRjlZMk8phTMB0i\nLyLHZ0/ppXA2C+p7M0qw1KJkioL2NGlrIUt3zKhwQtY4TM6egNrsGty3YizuvGEU0j1p4L+YxALs\ngmY5R5Uw1jeV/TkqP4cJoegXMVl7UReum0DbInXnsc1EFeCWJTyxrhar6uis8rMhfSEpRbYp46DF\nFUhUEa3n+pnSA4C3mvRZ2Cy+ISP8V7N9rNSNLGBa7iRb8wxlIvI8KyjJe6k147aU5bAtiWwRxpkc\nvgAAFtxJREFU8KLAI91jWqsF6SnYciNNcy0JFiGqKdjXoseHrNcIe01LzSqg9Wu7KmgBxaxAEHkZ\n1MIszciB0mJm6vkkD1vRT23T0131+AYb+SsyIkfNWJnVcrxr7A+QKZjXY0kiSoLBGYvL2U11NdGx\nl4ijMAYRLklA9MtRiDaVob50EQBaUrksP2gTikb8QBJ55iKSzoxCTdr4uGvKgoQVw29ga4Ofj/Jh\nKSjVM7u0zgxET5aiyluHm+ePYMeoGsGkbF2p6EI74DU/wAJ/Hu6q+n6c6Wyr06WPFg0BnHoelxQA\njEorR/jATJB+P7xuy8dicc/5ZR8k0bJuQBd9zpAaZusVGGuwGyNhI25i5XtlS/DszCfhERMnEFgd\nc0umFmFpbTn7Oy9m9GhVOt6Y63Ech5srV+HWypswtiwdkyqpciFdGQgfnIwcbTRSQxVI8ZsCLyCb\nQjw3OYWlWmvdabbFqB5YMYnFc1xRM8OLKBJ8bhF+r4yZJTU2F5zEuVj2HGBXGOlJSbhuYgGrZmDF\nNlK3JB4Ygxm3LGLliGUYlzHG3GfUPhN4/GjyJtv1rKnYVrebVcBLAm+LbWX6A6yfjBU0P2r5S/w1\nwLE4H5s5DqA4RtFb331mssfWt92WQlKGK9Fg0cQy9n/N4i61Zj0mu4IYhrHmSUb/kfhBnNG3WmeG\nrSpEW/s3D3w7CmMQ4ZIFQBOhnCpLKLhMU1VflIfnMFUPLC+ZWmRzD31TlJPDUe4ej9k1+Vh/QyUC\nPhkTK7JQmVaOB8bfg++PXoX1SysvGq8xmJozEZWp5Vh3fSVumFbEtpsuqXjT3Mg6AwC/165QNo3d\niDtG3wyP6LEpU8OdUZlWztYrCPpkyIKA0P6ZSGqahYnZ1XH34jguYRbcTXOHQxJ5jCy0VyPmOR4z\n86diTsF0SDECNccikD1SYgUUiyjyIH1BjOCn4oeLpuMnG6axfUGX6U4LyH4MyzLjLJolQ84q7GVB\nBukz/67Q2y8LMlaOWMa2zxtXauu/ZIsy3bC0GqW5QXAch3UxKeW2YK3NzWO4pARIvIiiQEHcPlHg\nkO3LtGULWgcZomYpkWMZVQsCh6ClfdZvpDhQaMbXABDF7r5RWwtsbQAAn5qNGcEl7G+rRZAacGFq\neTH7Oxw2r11XUQTrEGJ4bjqeXj+ZXt+iMGLffapo/i58kg8TR9KBhW0FSgAjcvQkFMIjcthirV9g\nkbaB4lSrHURcyGUEAOHDtXopCl1hCFSQPb9lJmRJQGfv5cnVZugW8eTKbEyuNH/sxcFCFMcP0i/I\n/6pYmXA7C3rz8RaG12X+vA03zMYVVTh2shNlafkAqHKwWV+qBOHQfKy7czr6iglOn+3FzQvK8S+7\njgCKC66o74Iz92OZP7EA8yfGz8wHgFUj6hNutwpJcYCZLRuXV+EP7xzDwknDEPDJyMgwlYTVOgzI\nfvgyLLP+LQLKOodEFHmo57LAJ3XSkicWhZtvsYh8MTGboMWasQrxmswq/IvkM+s+2Xzv1oQH+n9D\n2VvbZMYwaP+nulPQHaUuKTp5sRcuWcDT62bhRx/sjrsPz3M2C8Mq4CVBQoY3jZVIJxEXinMCKM0N\n4PjpLvzo5tlY/+vfQOtJxuiSVBz+6hyWTivCud4e7NZj81ZrkOM4rJo+Gvve/QN9HsENoxwjITy8\nfBL6tG4QjYfEi5CM9FuL8pRjftMLa4vxJz1hbNPy8Th4UMNHh1sRPjgFj24owzP7fgEAaD5jqV1F\neD1BQLvwMtADxLEwBhEu+SI/CE2wuWOIvtCvrCsawz1kdWd8HQwBWT4s+SJHfnMyPelYWDgHM/On\nxu2zpqL69WcbV5aO5TNLbcfZFAYAF+eFW3QhNeDGI2vHIz8jCafO0s89mPTNA4cXQ+RFXDdsFoB4\nd9X5KMkN4MHV1XGZbACQ6gli1Yh6ZHrSURwchorCFEwbnY07llSymFehv8Am4JfPLIHSRu9dmVJp\nu55X8uDGEfVIknwoCgyz7Svw03MkXrQLe1jWvdd4WMvSrJk3nC13y3vpxDPjt2i9htXCAOyJDR7Z\nFLRBj0UhWq0XYs+ei7XCc7xm7GjNrFF4aHU11lw3An9zywRwHAe1dRhIXwCFWX688OBsFOcEkOIz\nrZzYZA2rS2ndItO11tsfZen0RsxGZoM9jrmR+hR7xV+XJGBTzQbMyp+GAn8eJlZkQuA53LGkEmmW\n2MzyGcNt5xluViNu+E1wLIxBxMUsjFnjcvH2fjPzJLZEN8dx+NnGujgBeqncNHc4GqaXXFyBXQY4\njsPS0oXn3WcQ65KyYk25BZBwzkN3H7W+aisSBbUvP8tKF2FR8bzzlsa/FFySgJlZU21Kdd0SqgQq\ni+ohSEvhkz22/hpdnIZfbVmMzvB0c20PCzPyp2J63pQ4l2JxsBBPTfsbcBwXF7BPd6fiq65GWiLe\nQpJHwuphy7Fj33PoPkWVeWYqFbb5fovCVI15SnpKttda0ZkqBlUltjZpFrcaIQRZlnO8Me0zrpfm\nTsG88YWIZerobLz/12akBszz3LKA6MlSSHnH4jKZrO3IS0nB8PwuHGnqRG9/FHmBTJzsb2QLZFnL\nAkWOjkNwxP9gYdHcuDaUJRezxJH0oAe/fGg2ezaDuqocVgZ9XFk6XMHJOID/w9xq3wRHYQwiLqYw\nblk4Emvnl6OxtQefN3YgOzU+ZnEhwXpJbfkWlMWlYLikEiHFKM7Z1Xlxx2xcUYX9R9owZVR23L4r\nAcdxl0VZANbRazwXs5is8Y9Yzhd/ssZCrFRnVmFf6ydx271uEQX+dPy07m9x5wc0iypLz57z25Yx\npuLKr7/L2QV12NW4GwVJuZB66TtU9USFm8ob8Lu3Dphr1INaGLIl1hUbjJ+ZPxUCL2B2QV3C9n9/\n0UhUD09H9XBT6QR8MpSTZfBHCzB+9riE5wF0lvniyYX42WsHMG9CATwZSfiwZR/bb/0NPnvnXEji\nfHuixkXgOA4NZdfHpc5WFKVAVZOx973ZQPT838BAcRTGIOJiCgOgftzCbD8Ks88vCAYjfu/5Pxbr\n8pX/eG9dQrfO6OI0jC5Oi9t+LTCQ38W3wdiMUZiQNQ4dZ2R8atlupJJLotnOjGTTXfS3Ux7Gm4c/\nxFu99L2kJ5vK5KfTfwwOHP5jD11kyBhoT8+bgvTp5XjtnWP44nS3vo/uTPekoa3/LOy5azQetrRk\nwXnbLwo8xpfbLUyfW8LT66cgySslVKC3j74ZRzqOISAnYWyZn8ULVS0NxYFClKdQi4rjOFw/pRA5\nad6v7facN2xm3Dae4yDKvC3V+JvgKIxBhEsWsKF+NBudOZhcyMIw1qj2uMSEyuJaxyV/N0KVPMez\ncvDaTILb/4Eu0+pOYI1a3aLpnjRMzpiKt7APxTl268WwwhItaFVRlIpHi1Jx29+/BcBUJptrNuC9\nU3sxKbsm7pyvQ1YCS92gOnMMqjPN+IVh7Qm8gAcm3GM7Nja2djkQeA6CMLBMxIFwxRXG7t278dRT\nT4EQguXLl2P9+vW2/ZFIBA8//DA+++wzpKSk4Nlnn0Vu7sACfQ7xGKl2DpQlU4vw2RftbC2PRPSF\n9bURLsEFcC0x0Mmg3yY8x+HJ2yfh4JftKMk1lcC9y8fYStwYlOUHsfnGsRienziRYiBC0fDyB10B\nXF983ddq97UGz3PQNHLxAwfIFf1CNE3DE088gZdeegmZmZlYsWIF5s6di9JSU5O+9tprCAaD+NOf\n/oQ33ngD27dvx7PPPnslm+UwhPjejBJ8b0bJBY8xXDaJYjrXMo/eOoEF67+L5Kb7WGkOA2t8IJYL\nuQQvpBR9Hgm9/dG45IahgFsW0NN/eSrVAlc4rfbAgQMoLCxEXl4eJEnC9ddfj127dtmO2bVrFxoa\nGgAACxYswJ49e65kkxwc4lg5uwzXTSjA+htGXfzga4jinACqStMvfuAgIJFLyuAnd03FxJGZmFOT\nf95jBhuP3joBs6vzUDMigxX4HJ5/iZOfEnBFLYyWlhbk5JiLwGdlZeHTTz+1HdPa2orsbJp5IggC\nAoEAOjo6kJx85XP4HRwAGrtYPW/4xQ90+M7i8+iT+hIojtL8ZGyoj1/tcjBTnBNg8Z7RxWnYfONY\nlOR8xxVGbIntgRxDCBlwuQgHBwcHAJhQnonjE7pQNybn4gcPQS5Xht8VVRjZ2dk4dcqcKNbS0oLM\nzMy4Y5qbm5GVlQVVVdHT04Ng8OKa0Fr6YKjj9IWJ0xcmQ60v7lsdXzzTYKj1xZXiisYwxowZgxMn\nTuDkyZOIRCLYuXMn5s61z16cPXs2Xn/9dQDAm2++icmTJ1/JJjk4ODg4fE04MhC/0Tdg9+7d+MlP\nfgJCCFasWIH169fj5z//OcaMGYPZs2cjEongwQcfxKFDh5CcnIwdO3YgP3/oBKccHBwcrhWuuMJw\ncHBwcBgcfPdm9Dg4ODg4fCdxFIaDg4ODw4BwFIaDg4ODw4C45hTG7t27sXDhQixYsAAvvPDC1W7O\nFWfbtm2YOnUqli5dyrZ1dnbitttuw4IFC7Bu3Tp0WxYMfvLJJzF//nwsW7YMhw4duhpNviI0Nzfj\nlltuweLFi7F06VL89re/BTA0+yISiWDlypWor6/H0qVL8Ytf0JXWmpqasGrVKixYsACbN2+Goijs\n+E2bNmH+/Pm48cYbbanugwVN09DQ0IC77roLwNDtizlz5uCGG25AfX09VqxYAeAyfyPkGkJVVTJv\n3jzS1NREIpEIueGGG8jRo0evdrOuKB999BE5ePAgWbJkCdv205/+lLzwwguEEEL+6Z/+iWzfvp0Q\nQsjbb79N7rjjDkIIIfv37ycrV6789ht8hWhtbSUHDx4khBDS09ND5s+fT44ePTok+4IQQvr6+ggh\nhCiKQlauXEn2799P7rvvPvLGG28QQgh57LHHyD//8z8TQgh5+eWXyeOPP04IIWTnzp3k/vvvvypt\nvpL8+te/Jlu2bCF33nknIYQM2b6YM2cO6ejosG27nN/INWVhDKQ21WBjwoQJCATsJZ2t9bcaGhpY\nH+zatQv19XSd6LFjx6K7uxttbW3fboOvEBkZGaioqAAA+Hw+lJaWoqWlZUj2BQB4PLQ+UCQSgaIo\n4DgOe/fuxYIFdD2HhoYG/Pd//zeAwV+vrbm5Ge+88w5WrjTXff/ggw+GZF8QQqBp9hUNL+c3ck0p\njES1qVpbW69ii64O7e3tSE+nReUyMjLQ3t4OwF6XC6D909LSclXaeCVpamrC4cOHMXbsWJw9e3ZI\n9oWmaaivr8e0adMwbdo0FBQUIBAIgNertmZnZ7PnPV+9tsHCU089hYceeoiVFDp37hyCweCQ7AuO\n47Bu3TosX74cr776KgBc1m/kmloAgDhTRi5Iov4ZbHW5ent7sXHjRmzbtg0+n++8zzfY+4Lnefzx\nj39ET08P7rnnHhw7dizuGON5Y/uCDKJ6bW+//TbS09NRUVGBvXv3AqDPF/vMQ6EvAOCVV15hSuG2\n225DcXHxZf1GrimFMZDaVEOBtLQ0tLW1IT09HWfOnEFqaioAOkJobm5mxzU3Nw+q/lEUBRs3bsSy\nZcswb948AEO3LwySkpIwceJEfPLJJ+jq6oKmaeB53va8Rl9car22a4G//OUveOutt/DOO+8gHA6j\nt7cXTz31FLq7u4dcXwDUggCA1NRUzJs3DwcOHLis38g15ZIaSG2qwUjsSGDOnDn4t3/7NwDA66+/\nzvpg7ty5+OMf/wgA2L9/PwKBADNFBwPbtm1DWVkZbr31VrZtKPZFe3s7y3QJhULYs2cPysrKMGnS\nJLz55psA7H0xZ86cQVuvbfPmzXj77bexa9cu7NixA5MmTcIzzzwzJPuiv78fvb29AIC+vj689957\nGDFixGX9Rq650iCJalMNZrZs2YK9e/eio6MD6enpuPfeezFv3jzcd999OH36NHJzc/Gzn/2MBcb/\n7u/+Du+++y48Hg+efvppjBo1OBYF2rdvH9auXYsRI0aA4zhwHIdNmzahqqoK999//5Dqi88//xxb\nt26FpmnQNA2LFy/Ghg0b0NjYiM2bN6OrqwsVFRXYvn07JEkaMvXaPvzwQ/zqV7/C888/PyT7orGx\nET/84Q/BcRxUVcXSpUuxfv16dHR0XLZv5JpTGA4ODg4OV4dryiXl4ODg4HD1cBSGg4ODg8OAcBSG\ng4ODg8OAcBSGg4ODg8OAcBSGg4ODg8OAcBSGg4ODg8OAcBSGwzXNqlWr0NDQgOuvvx6jRo1CQ0MD\nGhoasG3btku+1u233z6gctePPPII9u/f/3Wae0kcPHgQ//mf/3nF7+PgMFCceRgOg4KTJ09ixYoV\nF6w+apSKuFZ49dVXsWfPHuzYseNqN8XBAcA1VkvKweFS2LNnD7Zv345x48bh4MGDuOeee9De3o6X\nX36ZLaizdetW1NbWAgBmzpyJl156CcXFxVizZg2qq6vx8ccfo7W1FUuWLMH9998PAFizZg3uvvtu\n1NXV4cEHH0RSUhKOHTuGlpYW1NTU4OmnnwZAa/M89NBDOHfuHAoKCqCqKubMmYMbb7zR1s62tjZs\n2bIF586dAwDU1dXh9ttvx3PPPYe+vj40NDRg0qRJ2Lp1Kz7++GPs2LED/f39AICNGzdixowZOHHi\nBNasWYMlS5Zg3759iEQiePzxx1FTU/Ot9LXDEOGbLNbh4PBdoampiUyePNm27f333yeVlZXk008/\nZdusi8scPXqUzJo1i/09Y8YMcvz4cUIIIatXryZbtmwhhBDS1dVFamtrSVNTE9v37rvvEkIIeeCB\nB8jatWtJNBol4XCYLFy4kOzdu5cQQsiGDRvIL3/5S0IIIY2NjaS6upq88sorcW1/8cUXyWOPPcb+\n7urqIoQQ8q//+q9k8+bNtrbX19eTs2fPEkIIaW5uJjNmzCA9PT3kq6++IuXl5WTnzp3s2WfNmkUU\nRRl4Jzo4XATHwnAY1JSUlGD06NHs7y+//BI///nP0draCkEQ0Nraio6ODiQnJ8edu2jRIgCA3+9H\ncXExTpw4gby8vLjjrrvuOogi/ZQqKytx4sQJ1NbWYu/evXjyyScBAPn5+cySiWXcuHH4/e9/j2ee\neQYTJ05EXV1dwuP27duHpqYmrFu3jhWkFAQBjY2N8Hq98Hg8WLx4MQBgypQpEAQBX375JUpLSwfa\nXQ4OF8RRGA6DGp/PZ/t706ZNePzxxzFz5kxomoaqqiqEw+GE57pcLvZ/nuehquolHTfQdRbGjx+P\n119/He+//z7+8Ic/4MUXX8Tvfve7uOMIIRg1ahReeumluH0nTpyI26Zp2qBa68Hh6nPtRAAdHC4C\nGUD+Rk9PD6tO+sorr5xXCVwOamtrWVnpkydP4sMPP0x4XFNTE5KSkrB48WJs3boVf/3rXwHQtS6M\nMuYAUFNTg6NHj+LPf/4z23bgwAH2//7+frzxxhsA6BKlAFBYWHh5H8phSONYGA6DhoGMprdt24b1\n69cjJycHkyZNgt/vT3h+7LXOt+9Cxz366KN4+OGHsXPnTpSUlKCmpsZ2P4M9e/bgt7/9LQRBACEE\nTzzxBABg2rRp+M1vfoP6+npMnjwZW7duxXPPPYft27eju7sb0WgUBQUFeP755wEA6enpOHLkCFau\nXIlIJIIdO3ZAEISL9omDw0Bx0modHK4Q4XAYkiSB53m0tLRg5cqVePnll1FQUHDZ72VkSb333nuX\n/doODgaOheHgcIU4fvw4HnnkERBCoGkaNm3adEWUhYPDt4VjYTg4ODg4DAgn6O3g4ODgMCAcheHg\n4ODgMCAcheHg4ODgMCAcheHg4ODgMCAcheHg4ODgMCAcheHg4ODgMCD+P4xSKOOE0RxSAAAAAElF\nTkSuQmCC\n",
             "text/plain": [
-              "<matplotlib.figure.Figure at 0x7f72fab5e290>"
+              "\u003cmatplotlib.figure.Figure at 0x7f97f1e98d90\u003e"
             ]
           },
           "metadata": {
             "tags": []
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAe8AAAFnCAYAAACPasF4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzsvXe8XVWZ///e5dTba3pCQiAJCSWE\nIJGmoSSgjsg4gmCb4Tf+dCwURUdEQXGs41gYFQvDiIyIiKIIJIAgEBJCgJBKertpt59z76m7fv9Y\nu55zboiQBCL783rllXt2WXvttfden6et55Fs27aJECFChAgRIhw1kF/vDkSIECFChAgR/jZE5B0h\nQoQIESIcZYjIO0KECBEiRDjKEJF3hAgRIkSIcJQhIu8IESJEiBDhKENE3hEiRIgQIcJRhoi8I7yp\nMW3aND796U9Xbf/iF7/ItGnTQsfdcMMNoWOWL1/OBz/4QQB2797NCSec4O3btWsXH/vYx1iwYAEL\nFizgkksu4bHHHgPgpptuYuHChSxcuJCZM2fy9re/3fudy+VC19A0jfvvv/9vvq/Vq1dz1VVXHdSx\nDzzwAF/72tde9bVcvNbz3wi46667+P73v/96dyNChFeE+np3IEKE1xsbN24kl8tRX18PCBJas2ZN\n1XErVqxg/fr1IZIeCZ/97Gd597vfzW233QbAqlWr+PCHP8zDDz/MV77yFe+4+fPn8+1vf5vTTjut\nZjvr16/n/vvv55JLLvmb7umkk07i9ttvP6hjly5dyvnnn/+qr+XitZ7/RsAHPvCB17sLESIcFCLN\nO8KbHm95y1t49NFHvd9LlizhxBNPrDruuuuu4+tf//pBtblp0yZOPvlk7/fJJ5/M4sWLGT169EH3\nq6+vj09+8pO89NJLXHHFFYCwAPz0pz9lwYIFmKbJypUrufTSS1m4cCEXX3wxS5cuBYRV4IILLgDg\n1ltv5atf/Sqf+MQnOO+883jve99LT0+Pd53ly5czffr0qmu98MIL/OM//iMXXHAB73vf++jq6gKg\nu7ubD3/4w1x88cWcf/75fO9736vZ18p7ueqqq1i4cCHz58/njjvu8PatXbuWSy+9lAULFvCBD3zA\nu85I26dNm8b+/fu9893fy5cv5/LLL+fqq6/mM5/5DAD33nsvF110ERdeeCFXXnkle/bsAcC2bb7x\njW8wf/58FixYwC9+8QtvrL74xS8CsH///pD15MknnwTAMAy++MUvsmDBAi644AI++clPVllMIkQ4\n3IjIO8KbHhdddBF//vOfvd8PPvggCxcurHmcbdssWrToFds855xz+PSnP82dd97J1q1bARg1ahSS\nJB10v9rb27nuuus45ZRT+PWvf+1tt22bxYsXoygKX/7yl7nqqqtYtGgRH/3oR7nppptqtrVo0SJu\nuOEGHnvsMdra2rjvvvsA2Lp1Kx0dHYwbNy50rVwux8c//nGuu+46Hn30UT70oQ9x9dVXA/C///u/\nzJ07l4ceeogHHniArq4uLMuq2VcXP/nJTxg/fjyLFi3il7/8Jd/97nfZt28fIISiq6++msWLF3P+\n+edzyy23HHD7gbB+/Xouv/xyvvvd79Lf389Xv/pV7rjjDh555BEmTpzIj3/8YwD+9Kc/sXr1ahYv\nXsx9993HXXfdxerVq0Ntff7zn2f69OksXryYn/3sZ3zuc59jcHCQJUuWsHv3bhYtWsQjjzzC1KlT\nWbly5Sv2LUKEQ4mIvCO86XH66aezefNm+vv7KRaLrFy5knnz5tU89oYbbuA///M/KZfLB2zzO9/5\nDldeeSUPPPAA73znO5k/fz533333Ienv2972Nu/v+++/n4suugiAOXPmeNppJU477TTGjRuHJEnM\nmDHDI85ly5bVvNcXXniBUaNGceaZZwLwzne+k127drF3717a2tpYsmQJzz//PPF4nP/6r/+is7Pz\ngH2+8cYb+dKXvgTAhAkT6OjoYPfu3Wzfvp3BwUHOPfdcQJitb7311hG3vxKSyaR3P21tbbzwwgue\nteO0007zxuepp55iwYIFxGIx6uvreeihh0LWlkKhwPLly/nIRz4CwKRJk5gzZw5PPvkkra2tbN26\nlUcffZRiscg111zD2Wef/Yp9ixDhUCLyeUd400NRFC688EIefvhhWltbOeuss1DV2p/GzJkzmTt3\nLnfccQezZ88esc1EIsFVV13FVVddxdDQEIsWLeLrX/8648ePf80TfXNzs/f3Aw88wJ133kk+n8ey\nLEYqVdDQ0OD9rSgKpmkC8Mwzz3gEFcTQ0BBdXV0hC0Q8HmdgYICPfOQjWJbFV77yFXp6erjyyiv5\n1Kc+dcA+r1mzxtO2ZVmmt7cXy7IYHBwM9U1VVVRVHXH7K6Gpqcn72zRNfvjDH/L4449jmib5fJ7J\nkycDMDg4SGNjo3dsOp0OtTM8PIxt21x++eXetkKhwBlnnMFJJ53EjTfeyK9+9Ss+//nPM3/+fG66\n6aZQexEiHG5E5B0hAnDxxRfzve99j5aWlpo+2yCuvfZaLr30UsaPH19z/8DAAC+//LKntTY2NvK+\n972Pp59+mk2bNh0yLa27u5sbb7yRe++9lxkzZrBjxw4WLFhw0OcbhsGaNWtqCiGdnZ1MmTKF3//+\n9zXP/ehHP8pHP/pRtm/fzr/+678yZ86cA17r+uuv58Mf/jDvf//7kSTJG4OWlhYymQyWZSHLMrqu\n093dPeL28ePHI8uyJ3xks9kRr/nQQw/x+OOPc9ddd9Ha2spvf/tbHnjgAe+6g4OD3rF9fX0kk0nv\nd1tbG4qicN9991FXV1fVtrs6IJPJcMMNN3D77bdz7bXXHnAMIkQ4lIjM5hEiALNnz6anp4fNmzdz\n+umnH/DYzs5OrrzyyhHNuKVSiU9/+tM8/fTT3radO3eyatWqEaPKR4KqquRyuZoa9cDAAOl0milT\npmAYBvfccw8A+Xz+oNpevXo106ZNIx6PV13r5JNPpre3l1WrVgHQ1dXF9ddfj23bfPnLX+aZZ54B\nYOLEibS3tyNJ0gH72t/fz6xZs5AkiT/84Q8Ui0UKhQLHHHMMo0eP5pFHHgHgd7/7HV/+8pdH3A7Q\n0dHBhg0bALjvvvuQ5drTWH9/P+PGjaO1tZXBwUEefvhhb2zmz5/Pgw8+iKZpFAoFrrjiCjZt2hQa\n93PPPZff/OY3ABSLRb7whS+wb98+7rvvPn70ox8BwgoyZcqUgxrvCBEOJSLyjhABkCSJCy64gLe+\n9a0jkkEQ//Iv/4Ku6zX3jR07lp/85CdeVPiFF17Itddeyxe+8IVQBPrBYM6cOfT09HD22Wd72qaL\n6dOnc84557BgwQIuu+wy5s+fzymnnOKtPX8lLF26NOTvDl4rFovxwx/+kFtuuYWLLrqIT3ziEyxc\nuBBJkrj88sv53ve+50W4z549m3nz5h2wr1dffTWf+MQneNe73kWhUOCyyy7jS1/6El1dXfzgBz/g\ntttu48ILL+TPf/4zN998M5Ik1dwOwvJx88038+53v5tUKuUt8avEO9/5TjKZDBdccAGf+cxnuOaa\na9i/fz/f/OY3ufjiiznrrLO48MILec973sN73/teTj311ND5N998MytWrGDhwoW85z3vYcKECYwZ\nM4bzzjuPdevWceGFF3LRRRexZcsW/vmf//mgxjxChEMFKarnHSFChAgRIhxdiDTvCBEiRIgQ4ShD\nRN4RIkSIECHCUYaIvCNEiBAhQoSjDBF5R4gQIUKECEcZIvKOECFChAgRjjIcNUlaenuHD2l7LS1p\nBgcLh7TNNyOicXztiMbwtSMaw0ODaBxfOw71GHZ0NNTc/qbVvFVVeb278HeBaBxfO6IxfO2IxvDQ\nIBrH144jNYZvWvKOECFChAgRjlZE5B0hQoQIESIcZYjIO0KECBEiRDjKEJF3hAgRIkSIcJQhIu8I\nESJEiBDhKENE3hEiRIgQIcJRhoi8I0SIECFChKMMEXlHiBAhQoQIRxkOK3lv2rSJ888/n7vuuqtq\n39KlS3nve9/LZZddxo9+9KPD2Y0IESJEiBDh7wqHjbwLhQK33HIL8+bNq7n/a1/7Grfeeit33303\nzzzzDFu2bDlcXYkQIUKECBH+rnDYyDsej/Pzn/+czs7Oqn1dXV00NTUxZswYZFnm3HPPZdmyZYer\nKxEivGmhGxZL1+6jWDZe76542NuXZ822/te7G0cNXtjYy879wyxduw/Lsl/v7rxq9GWKrN8x8Hp3\nA4D9AwVWbekDoKyZPPdyN7Y98tjmSzovbOw54DFHGoetMImqqqhq7eZ7e3tpbW31fre2ttLV1XXA\n9lpa0oc8Z+xICd8j/G2IxvG143CN4d2PbOTXizdw3twc11x+6mG5xt+Kf/nm4wDc/+13oSiHTn/4\ne3wP9/Tm+NEf1ni/48k4F8075rBe83CNo/vcf3XzQpobEoflGn9rX+79+jv4+d0vsmzNPmRV4aK3\nTq55/I9/8SzPv9zNdVecytvnTHjF9o/Eu3jUVBU71JVuOjoaDnmlsjcjonF87TicY7hhu9BwN+wY\neMM9p737syTjh2YK+nt9D7dWaKobt/dz2tS2w3a9IzGOXXsz6K3pw3qNg0V3zzArN/YAsGnnAKcd\n117zuA3Oc3hh/X5mTWw+YJuHegzfUFXFOjs76evr8353d3fXNK9HiBDhtcE180lIr3NPqqEZ1uvd\nhTc8SroZ+m2aR/+YvZFcOJZtY5jiG1EPYAVqrheWgsHh8hHp18HgdSHv8ePHk8vl2L17N4Zh8MQT\nT3DmmWe+Hl2JEOHvGq6LTnrjcTdGRN6viHIFeRtHsc/bRb6kv95d8GBaticQqcrIH0mLY+bP5N44\n5H3YzOZr167lW9/6Fnv27EFVVRYvXsz8+fMZP348F1xwATfffDOf+cxnALj44ouZPLm2ryFChAiv\nHW9E8tYj8n5FaHp4jEzz6CfvQun11byDQWeWZeP+UuSRddn6VAyAzAE072x5iKZE4yHp48HgsJH3\nrFmz+NWvfjXi/rlz53LPPfccrstHiPCGwf6BAo3pOOmk+Nx6MkXSCdWbEGqhe6BAQzpGOukf0z1Y\noLk+QSJWHbiZzZUxLZvWxmRou+Wazd+A7H0kzOYDQyUUWaKp/rUHSFm2TVd3jgmj6pEliZ7BAk11\nCRLx8PMoayZ9QyXGtde9pusVSjq7e3OhbYPDJbJ5jaa6uLetN1MkGVdoSMcrm6BYNtiyJ8u49rqq\ndwOEANWXLTKmrbqvA0Ml4jGF/mzJu+dK2LZNV0+Ose11ntnZtm329OUZ116H5IxTXeBdz5cM9vbl\n6WxJeedYts2W3VniMZljRjfSkynSmI6FYiJ2dQ8zpq2OmFqbZGudUwslzbdmmJZV9bdl2WzenSGV\nUEknVOJxxfuOhgo6fZkidakYqYR/naV7V/B/G+7lIye8n4s7zjng9Q8VjpqAtQgRjkaUNZMbfvYs\njXVxvv+pswD499uWIQG3//v8mufohsXNd6xg9nHtfPQfZgLQny1x48+X8455k7jk7ClV51z7388A\n8D8VbbpKhvw6cLdpmWzL7mBq85SawsOR0Lw/++OlQPW4vBosfm4X9z6xlcvPO45Tj2vn33/6LBOm\nZ7A7N3LdnH+jOdEEwDf/70V2dg/z7Y/No7059aqvd/MdK+jLlkLbNuzKcO2tS0L3c8Mf7sYup/nF\nxy6vauOexzezZNdKmuoVvnvlZVX7n1i5h3v+spmvXHU64zvqve2mZXljB3DlBcdz3pzxVeev2TbA\n9+9dxZknjuaqd5wAwOMv7uH/Ht3E+88/jtOmdfLvP32W9iZfcFi5uZdfLd7I208dxwcvnAbAqi19\n3HqfiKq//v2z+c7dK5k+sZnPXSFWSGzcNci3fr2SudM7+fgls6r6kc2V+ffbljF1XBM3fHBOjdH0\nEdT8g0vvXFJ/YVMvP7l/rbddSuZomLUSuXkaVqaTz922jHHtddzy/73FO+avu5cAsKJ7JRefeGTI\nO0qPGiHCYYRmiAlhKK8BYDj+tQMZP4tlg7Juhvxre/pymJb9N/vcfBPhkWfvezf/ie+v/CkrulfW\n3K8bZs3tbyT8cevDfGHJLWimxsrNIsh21ZY+9vYXIFair/FZ+kuDdA3v8c7Z2S0ijQcOMrhJt2qb\nkSuJuxZ2De0hPmkDieNfrLm/qzdH4riXKI15ofY1MkVsoKsnrOFXmrbXjrAuf9veLADPrNnvbXtx\nUy8AK17uYSivIaWHyE+7H7mpx2lLRG4/8aI/Zv2Be3VzAGzYlfHb3LsRKV5gxYaemv3oHiwCsGVP\ntuZ+0zIxLfG+BX3uZoC8NSe+oC9bDJ0rN/Wjy3kxxoo4d09fPnRM2RDPOqkcuSVwEXlHiHAYURlf\ndDDapghSssnGtzGsiUm1NyMmt6DPc1XvWnoLB0524pL366F5P71HJF7al+/2tlkBf+PR4PN+ZOcT\nDGnD9BbD45zJlVEa/WVcRaOaaJUDBEC52DW8m2v+egNL9jxb+wDJIjZlNXJzd2iz+1yf6FpywPZ7\nC/6qHsuuHm83IK43EyasSvIeyVRdy6Li3rdhWpiWTWzsVtHGpA0j9jN4vf4KoaW/OMCSwu+Jz3hu\nxPMrz6nEf734E7723HeBcLR7Tisi1WWIH/8Cw8ZwVV8AJFXz/pbragsHZVMck1CqXReHCxF5R4hw\nGFG5tEc/iKU+Zd1Ead9LpvU5/mfdrwF/cnU1hf35Hn625k7+47n/8jQGoCoDl6d315hkNw5sYU3f\n+oO+l1eLxri/TjVI2G/0pWLuhAygW+EI6d5MESnuE0ZBD5MfgHIQEtOjO/8KwIPbH625X2ndj9q+\nl8TxYeuFu7xpW3YnALZR7QEtaQZFxSfvWn0cibzzAQKT6rIMpNbXJH+5xj2qTuCXYdqifcVpq0Yf\na12vp6Ivq3qFCVtOjEzQwf6XtDD5dg3vYcfQLnoKfWim7l9L0fn+y98mOfNZlOZetiUfreoLgBQr\nB/7WqIWyeeSj0CPyjvC64I2UZvBwwqwgU10/OPKWG4RWtze3D/AnJ3epUG9RTMq6pYcmm0rh4EBW\n8x++9DNuW/2/r9ifkfBiz2q+8dz3KRrVpBCc6IPEFyTv16p5W7bFrSt/7hFg9X4bpW0vcsv+mvtf\nCbuGdnt/l4zw5NybKSLFAuRtVCeRMi2bIW2Y32/+MzktX7U/eI2JDeOq+g4gN4nnbNvhB1jWTWzb\nJlN2TMuKUUWufZkScr2vKeZr9NGNZu/LhImx4JiWpbosyZnL2Bd/gd3De6vOryWfuEuuTMuirJtI\nqmjLNqsDNF3BsxAwZe9zTNKphAgEfMkhb9saWRjakFvtPefKe1m+z3cZ5PScZzavJGJNyZLT86G+\nACEhDTV8zu83/5lfrL0LzXnHa1lgDhci8n6TwPUvWrZdMYGaNY97pW2vBat61/LJJz7PtuyOmvst\n2+Kl3rUU9dIBr23VEAAOVV9/uf433LT0m6/6fLcfQfI2TCtEriMJMJpmIsUFIbYmW4Cg2VycP1jy\n/YHByaaSED2zeeC3bdvkdJ9MKv3ouiGIwbKtmtqWi9vX3sXu3F5W9673zgHxXILm/JAGG+hfppzh\nzvX30F3oxbQsBofL2LZd9QxFX6rHqrfYz4bBzdy/9aGq4yzbxjBMYsesIzaxtrm2ss18SQ+ZTHcN\n++RdOSn3ZkpIcX/cCjUEGNO0+c7z/81fup7imb3Lvevphskftz7MZ578Mn0lIaTJkh+xXiwb7O4f\nRG7sQ2l0yLscDnzTdJMhLYdhi/5KEhR1v49lzWT7/iGkhE/Y+RoCxLDdR+yYtfQMD4W254o6iZlL\nSc70a04MlAZDxzyzdzm7zZeRm3tInb6IPbl92LbtRZAXk7tZMfCMr3lXQB2zleuXfImCXgwJoK5F\nJp2IYds2e/Ou8CXh2pJ0w2JgqESuqDNYyrIz/gyJ414C/JgDwzK4e+PveaFntX9fWt5/xnJ1v7Kl\noWqzeby25m1ZNn/peoqVgfaPJHlH0eZvAnQPFvjCT5/l4jMm8fLOQbbvG+J//n0+Dy7bwX1PbuPm\nf57LxFEN/PWlPdy5aCPXv382MyYJ0vjNXzbzyIouvvWxeXS8hsjZIO7fIibbv3Y9w5SmY6r2L937\nHHdv/D2N+jF0r5zOj649J7QsA2BTV4Zv/t+LfPySWcydLrLzPfp8F3c/tpkbPjgHPdVNS6KZ0XWv\nLnPfc/tFAJBhGajy3/aZvLxzkO/cvZIPLZzGceP9VIr5khEycRumTUyt1iZKuomUEGSwfVeZNW39\nXhCNaxbvCfgyQ5p3gBwf2LaY4eQQ0IYkSZR1k49/90nOOGEU557lB9Zc96On+MAFM5h/6niG8hrX\n3LqEc04eiz3xRV7u38R/nHUjsQOMgaZb/P//+SRnnTiGf3nHDD73k6VkpN0kRCBxyKToE7PFnwZv\nB6A50cSG5Z1s2JUhEVcoayafuewUZk5u5cWe1WzdrPDw091V0dvd+XDw0n/d8xLb9g3xb+85ke/+\n5iWuuGgikmKCbGFaJorsE+SqLX384Herue59JzNrShvb9w1xyy+fB+CbH5tHZ3OKVV27vON/8dBq\nxqnT/WsnXkJp9f3QtUzSmfKgR3j3P72de34Dn79iNt/69UpSpz8ROvalbfvZ2TlMc32cz922DOnY\n5SSm+wKQVKHxLVu3n98/v5LkTH9btpynLp6mpBlc/+Ol5EsGiVk+mQxr1Zp3X/ol1NR+8oqBbvhR\n0kPlAnJdmNAHy2F/76833AdAfLLQqH+y5EFSPafQ0SKeUXncc7yUAzlZ+x5iEzZj2LBpcAu5cnXf\nDNPi9kWrKDrmckm2QBZC4pduX06PE6TWcEwXeJ+5ze0PvsykUQ3stzZXxRIM6znyJdFfqYZQ8Z3f\nPUeb7FtB5MZ+5PQw2BJIdugeilr1+bWEuMOFSPN+E2CjE7X50LM72b5PfJCWbXPfk9sAPzr0waXC\nf7Z07T7v3EdWiIIxm7p8Te+1wtXmRlp77EbuZhFmuv6hamn2ryvFMfc+4ZeSve+vIjBm+cbd/PdL\nv+CW5f/5mvsa1BoPFktWi34/tGxnyOddKOkhzbsye5aLkmYguf492eLhZ3d6y1hcTb7HMZvHlXhI\nU3DJ0bAMFu34CwPNKwDxvIediPdn13fTlfMjfVEML3p2l6O1PLVqN893v0TeKJAp1Q7ScbEvI/Yv\nWbMPy7YZGCqHTI1lI0jezrNP+pO1bulelHDZuc+la/ezYWAzt6+9i8cHBUm8vCus+e2vIO91OwYp\nlk1+/dgG1NHbWbRGmFslSZivg1i0XBDzn5buAMS6eq/d/gLL973Attxmb5tml0NLBOzOTeJ/UwgE\ntczmPaVe729TEtaRxc+NUIBJ0dm0O0NXTw7dsFCawgFykmqA5L87f1ixKqQVA2RLeedehCY7arRN\nLOW/vzm9uo+u8UFp2093xo84HypWa+lBzTtoNZJi4t4yWYOd3TnvGVYimbb454unM2/mKN4+2yfI\nn6/9FXvG/L7KvVHWTZZt3hHaJsU0hgs6PYNFL9+BlvYtJA2NYoy27xtClqvzIeS0PANDzvtYg7yF\nZu5bshLTxfcjmXHn+v67nCtWZ4orRWbzCIcSlZGiUrzAQCHrmbfcCdUNsKn000LtwJRXCzenkSzV\nfv08U63j56t15ZST8CS47MM1t9nqa5N+g6biVxOIogdyJQfHslAyQj5vbQTyHtZySJK7QNsMCS8e\neRd8Yqg1BtlymKzcyF8Xe3P+RCkpJprmulWcyzb4wlqmXE3ewTEKEpebgSpE3gEBSDMsUDXkQKT2\nYDHnBWC5SMQVT4iT04JUKpPT7A1EsQ+X/D70JlcTm7iR4lh/nfJAKSx8uglz+lIr+dnqX4aEqoHC\nEHe+fA92zH+PJMW3mtiYge2Oz7aW5q0NBI4TRDE40lI/xaA3U/RiG1yhIISA1hcb5QsBVkEEBA47\nhNubKSI39DM0cTGmFCDvcjUhm5Lfn64B35ozXK6+n6DmXWt5m6aJ96w0AnnbisbZJ43lX981k7NO\nGlO1X2kJC2NlKUd8imOSdueCWJk9TuKaGcc0ITf1hPz6l54nhILebBHd9L8LN2ZgWM95Yyyp1fcg\nqToZR8gNCktSsQXbkpDrhjyXVqYQHs/GeEOkeUc4tIhVJNxPnvIUNy3/ukfq7oTvEnStmsG1siu9\nWnjBOCO8fqZDDF6QTsW1C3qRVervUNp3UyxXTxSGUjs46GAR/AC1V6F5uzm7K8k7XzIOSvMeChCv\npBj0Z/0J1jQtbNv2JlLN1MgXq33Kg+UwWelGONZhMKhNK4bXF89H3uATT7YGebtL2AAKZoA43Ykx\noKGEzeYWyROXED/Gj3LPFKsrMCVisqfpuYFKlQUt9hd88t7W7U/87uQaxEDRHw/TMulveB65foBy\n0xZW9a1DM/yJfqhUYwJWDIYdTasQ8C3bloRqJ8g774zhPV+bQS2gPTvrg0dapy8pOn2ZkhfbYBuB\n4C6XuFS/j5Ll77cKIrmKaxbvzRRDhKZaooJXvkLz3jS4FSvhH7c34z/zXMDEbmbbkGyZTDDOooal\nwRVkhgvV38yY5Hh0S/e+p2DSFu/8eFhrjU3Y4AluFB33k6qxs1tsax6TJTFNuLdkW4yHkhTj25sp\nUTQDz0lLOPeV9yPTlWrNWVI133IQ0MylvTPBjCHFyyROehqAgYL/DZzVeiGtyRaKRumIBeNG5P1m\nQ0CajKsVmrdyZDXvkczmXiCRM2lVLrfaNbybItmQ9haEJudqbg9ix9CuUDRxELkAMb0as7k7gcdU\nudpsHiDQkTSUIT3Qf8UMBVaZlk1eL2AENJ+hgJbktl+pLZtWOFguUwoICLJBWQ8njwlOpBkt7PuE\nsHBQNH1hyU0sEgzyqQxYq4zyHa6hESZiCrudSHsMYbKsDCQKmnF39PmWCKxqrTWoea8f2EivuoHE\nCc+BLO47GImdr0HekmJ4VoVcYLy1DXNRiFN0iNHVzmPHrmJDYZV/vqPlDeVqvE+2BIpBT7ZAr5sg\nJBCZHdNF/Elw3EzE3+ZQC9aQKBE6rPmad5D8k5YgviB59xT6+MHKn3r3D9CdC0SmOwKKOdSKtuUU\nVCsVGsOagVmOcDGU10LzjLZsY2voAAAgAElEQVRtFi0J0Qc3ULJWamC5gryDEfbGUKM3Bjv2i/dR\nSfnHj5VEPIIuFVBkib5MMWzCdsZzqJwj4zyDWj5vggKSs9/oHYdeSHrjLzlj1p0V42V0T6TdmEZK\nTWLaZkjjP5yIyPvvEAOlQTQzaEoNkETg5Yx55C32u2ZzzSpXSemHMsmHa3IdSZu3bLe/Yn/l8ifX\nZOx+XJU+tqLtE9NI0dI/X/Mrfvly7dz6w4Go3FdjNvfIW5FCVox8yXCehY0yagd7ctVLbwAKhk/e\nUkVErGnZVcQ8FCC/2uQttO6gmX446ANWzCrNO6g51zKbBzX3vOFf39e8S9imgm0qlMxqn3cQtZaa\nxWOyPz6KDtgVS+L00Du6dzAgyFnV01qwv7XKoxYD1oOc5vfX1WpRDE+wcTXvuvxUrFwrshX3rDWu\nEKS2+W4J28bT8mwIERtA0m5Ckm36snl6B4uOUO2/N0nTIe9gwJfzHWtbT8Z2hJu8JvrQmy2FiClh\nC7N60KKU06sF3IGCP0Yll7wHRoMZQzHTDGnDXpayWm4C1zIwVNA9rdUcGIXZN576mMidviWzXRxb\n49sXAl8wsMA/xsoJ8pcSBU/zjiXEOOp7jmVS8ngAslqWtqYkvZli2IK2XUT2DRQDgmhgjLRts5x7\nEGOcSqj+flMV30dIKLTodSL0bSNGb6ZIWhWBevkaY3M4EEWbH4XY1T3ML/68nk9eeiKdLeGi9nm9\nwJeWfoPx9WP5wunX8H+PbOIvL/oaZnACiFVq3g5Db2m+l889bdO89VLv2B/9YS3nzxnPFRccf1B9\nfOCZ7by8c5Dr3z/b+1DveXwz2ZxGLqWBAiDxg3tXsdkpSPCpfzyJyWMasQhr3pWlI7tdf6/zcXUP\nFkKBa3nTn4SeXtPFI8v3ceOHTvMi1otGkUw5S4MzET350h4ef3EPLQ0Jpk9soXNyteb98PKdrHi5\nhxs/dNorWiFcYUOpMJs//uJu9vUXkFI54pM2cHfXBo4ZfQ0dHdNC5xfMvC9WKz7hqopMMbGXb6y4\nN3R8vkLzfu7lbh7ZvAncVNWySV+2xH/+5iVQyySmr6Bo+WSlxk1PAHIzuAXJ+4k1W2nMdHHh3An8\n/qmtDA6VGTfL13SDxPenZ3aI8+MlbC2JpOrsGxxC003iMSUsSOL4CZ3+j++oZ3dvDqV1L08Ul1G2\nXQ1JRBkPFzRu+eXzDAyVuPyiseLWJBnLtugeGgScSHTJH3NbjyPFNJ7esI1ZiX5mTWmrGVQUFCCW\nb9hLYgZItoK2+VSSJz8VIkPN0kgACScVpmzF0S2DsqlVuUJsSwJLDWt5FRpfPpNAaRVBcXv6ZEa1\npukPHJM0WxlmK/Gpqyi91Iytpfz2jJj4B+zoHeCWX66gN1MiMcrC/WpiOLWoy4M8u24/v35sM23j\nstDqdlJEUvfkMlzzvb/S2ZRkz2CWeDNgim9GNtPY2GTKQ7SlWnjmZT8S37+voNbqru0W53em2wG4\nc/09nNwxq3a8i2yKNtzgMMdaUVpzJraewDZU1NE76Fk3FmjwrmFmOmlKNEFJBLt2NI9l3fYBVm7d\nBzEorT4Lu1SHjEJ/UVhrmuvj5J0xHNf/DrYMlGHKWi/4rqkuTlmvuIf++fR0OMl0VIP+vAZpMUZ9\nmRId44UrIK8XSODniT9ciDTvoxD//fs17O7Nc/+S7VX7smUhDe52tJYgcYP/UUE1eXuk5Ex++/rD\n2vdjL4i2Hti6iJuWfjNkuq3EH57ezoZdmdBktvi5Lp5d3+1V7zEsg1Vb+ymUDTI5jXXbhfZkOaTq\nkXeFGd9dJuVOYNv2DbFuh29CzZm+VnnnY2vZ11/w2ga8NchlS5DDLxdtpKsnx+qt/fz2iS0hs7mr\nzdz7xFZ27B9mYFhM/I/tepIvLf1GTbO6YYj+xlQ51Hd3PIPEuLXGWveiJTRZ24gJE51kkUooJOMK\nQx3LveMkXZBVIUBGmmFy2x/XMRQ0dct+pLrascf3IzqIxy3vOXmacUwTpk5bwlSKvLBR+JT/vHQn\nz6zd7yWPUWWVklXh/5QNpJiOrSWxTRXd0ti0W5hcS3p4vNpTbRhogM2oVnE/8amrKdhhbV+KldnX\nX2D7viGyeY2N+4VmO8FJbtJfEM9/zrQOkulAycfhZmxLwlIK/OB3Ivgp6Au1ikIjdCO1g+M1Sj8Z\nu5wS5tsa5OvmsVY1oRWu69/gCUFuxrPyunnYhhr2V1eQt2vilhQD07LpaEqScuSQs8fOo8mY6B3b\n0C76Kak6tiWDrXiad1e+i+37hsgVdU8rbUk0M8Y6EXO4ha58F89u3k6uqLN70DeB1xtCECrbBbbu\nzrJsXbfXx6ljBOm675rrLtm63/fne/0P3CMO8br7zh53BhMbxmNjM1QerhKgZEMoIak5j3uBeZKq\nYVsSdrEejDj67uORZBu5LksqoVK2xHc0a2In582ayrFNk9kwuJlp08XzH8yL91wkh5EYk5jAgN6L\nlMgzujXtPceGRJpPvWc2tiV5yk1bU9IXnB3yTlvtTFBO8PqWLfrfaaGkM7vzJE5sn0FnXTtHAhF5\nH4UYcgJC6pPVfiP7gCUvCJnN3UxIru9VqTRlSbVNzot2Pk5faaAqorkWatbudYSDsiHuo9NZu+ul\nAK2INh9Z8xb34i6BclG2AxOxc7/BW3Ozk2mmVm1Wl02yAZPycCk8ybhc/IctDzJQGqyZdcowLZAN\nCsmuqr7Hj3+exPTnvd+1AuLKtiBDu5T2+pSIKSiyhGwEAn1KQrovB8hINyykZA65MbBGOKC9h9Jo\n6qItJRYgb9MCbKRYmeZEI7aeQIqXqtJn7sntJ6HEmdQwAc0uhd6Vjsni2tZwC5gKyCb5ongPKrN8\n1cXS4n1QjFCZSxcJSbwbUkwjmw8kRTHFxDyrbTqqpNDPDuIxiX+7ZBbHT/K1HqtUJywA8ZJnBXH9\ntbGhiRyfOA2A/ny1sCNZCiCBWaE5O/uTagJFlogPC3J9dt/zvrAqmyT0Nuxio/C31iB/M9sqtErX\nv+1sb29OolllpjQdw+XT30NCrqO8+RQAzjhFVC5D1T2N2y6lMTPtKI0DyM3i25Ad8rzm1I+RUJKY\nvULI2Ws6Firn2zH6RzPJnOe0GXgXnb5ceubxpBMqaOJdcZMDlZx3Ttt6EqUX345VrAuR97hRTh4B\nh/iS8RjHNYtqeHkjXxWVPaV9lD+8DQNCaFUMMGOeddF2+iCpGsm44rXxrxefQjKhcsGkc0UDDb1M\n7KzHkp3+OO/8xLiwcClt+xjVmvb6m5QTzD6uQ5i9nW1j2tJV1gNVkZkxfpTTB92L1bDNGLppM731\nOD520j8TV0Yu9XsoEZH3UQg3pWFDuka6wVcIsAp+YO5yD9eXqCgyECAb+cDZygyrOjBDMzVRLEEO\ntx3KmuWQd8kh77FO3WOXICqXihkBn7duGV6gkrf8JlS9yaZsB5f4iD4G/es9gexfwSUvUqJA6rRH\nWbTjL962XLmCvKtyh9fI8mZaxI9byZ66p9haeDm0T2nuC/2u5VPXKGDbYDlZtSTFEOStSEiaX3fZ\ndLRGzfKfeX+5j8TMZUiq4S83CvrNAwFKdllMikrM94drugmqjiTbNMQbsEsppHiJTD6Q7U6y6Cn2\nMrZuNC1JQSZBa4LWuB3bkjB6JmBbigjGGhSknXeimG1TYczwOYK8Ee9lMmVWvXMtsrOkKFYmGxDS\nipYg77H1Yzix/QSM2BAtHWUkSfKIxb1HW0tCrOwJGG4msobisXTUif5nQwF8jvbs+DhtUw2Rr/ve\nJeQE8ZiCVaynM93OjuwuQd6ShSTbWIZ/vhCgrND5Vq4Fu9jgkYvSIN7rlqYYNjZJ1dHsFckjLtcq\nIyl6IChNwth/DBAonOFcI6UmUWQJa0jYyPP0e+MNYPaOp0FtAFsKPUM1bnrnx2Iylkve5QyWbVGS\nRTu2HgdkQdJObAKAGjP8sUMQn/usl+xZ7uVkd3FSu59tJnHcS6RmPYuk6khmjNaGROBaQEwnHlMo\nODEPKUX0rTMlNN5sOYuqytiyLtwWtqC5MbFjkWwFdew28vWbkRQD25KIqWIcG2PNSIkCclOvqG8e\n8Hm7z8H1a8tNfdhjnRUThloVVHskEJH3UYxaUeFBM26tJQth8hbHFsoOwcmSZ+6CEaIxAyjVIJ4l\ne5fzu81/Iu6UKHQTHlQm+we8oLr6VIzm+rgXqeznwq4OWOst9PmEqRiA7ZVelOoyJE56GjsogDj3\nIAX81K7mDWHylOurE9G4ZnMXlZp0Lf+pYfpJNoaMAye3qWV216WiiLB2J+eA5m1LgQxteYe8bf8e\n9pZ3ICkmetfxGN2TgLDmHXz+linGRFZNz/pSMjSSJz0FQL1aj1VOIUli+ZUXSZ7MY9kWY+pGe+lb\ng9HphprHLtWBkRBaqwQ9Q4Jsi6azpGr/MSQK47wJXU6UeEL/XxLTnwtZB1plx7edyntCK0DRsa40\nJxqZ2ijiMFIteeceAuRddDRvyRcw3ICidCzFqAZB3iUr8Jwd8rZ0550xYk6kcfC9EwlyknGFsm7S\nnmwjbxTIlYre+YYhuwMi/ne/rQpSsDVBCLGJGyFWoqlRXNclJUWWQBcEVrByoh+q7hEj+OlTvWVy\nAdO+KsvYWgoZGSvmm91BmHx1A5JyGjk9jNK2l8TMZ5AdQSKlJokpMmbJ1byz/HHrwxjNwuftWg2E\ni8dGaXfW5scCPnkH7rNetm8Fd738W4KYN2Yu7516iX8/ySFQdFRJCEiiLdcXrpGIyRSMIkkl4WXO\na3LqqWfKQyKHhaO5e5kiTJVYYTSSbLFOe1ospzNVYoo4/8KxFyFJoI7ewdi2tDf/ueMcU2XqnMC7\n2Litfl/N2EEVHDrUiMj7KEN/IUPylCeQW/bXXCccJCOj1gtVg7xdYrVtO+QTr6V5u/5qqC7WAHjL\nJJTGAZANr22fvG1PA3LN5om4Qkdziv6hEoZp+ZHyjoYeLIPZEyBeSRZtuZp3/NhVyE7mLjcgxtMw\nAm0EyTtoqQhOhi4KevgeNcMK+fprJWUIErxsB9s8sLAFIg7AUHJYpTS25ZyrGCTiCrIsYztadHnT\nqZ42plt+H12t08o3CpO1c76LEHk7yT0kxcS0bAzTIqsNeoFCti152rmUKLK/wmffnGyixZkw3XSu\nSBaWpHt+WDdCtzebc8bL0byNGLph0eFoS7HxIpuZXJ8N9bfDnoptg9wUWAoGDNvid0uiBVsT10qk\nxHlFowSWjLZtltBuXXOrI2C4qTjr4knGNAvhQx29E6VjlzceAGXNyXtQSiPJlne+uz+pCGIp6xYt\nSeH3HigP+uTtnO8SnEsGleZYs38Mck6YY+Vkgfp6cZ6vecvYegJsWJ9ZizJqJ5IEiuW7GWwthW0H\nnoOsE1fiKLLiLAGVSMuNSIkCx09o8oPLzBjDeY3ZTWcgqQbxY1cj1w1jJ7NOH5IidqMk+rJjaCeP\n7XrSfxCGK4CIMY5PWRsao+A35Uac10JKTXJK58zQNkm2ScpJLzmPa2lQO/YwNGoJBb1ISvXT5SbV\nBEklSaacJaZIQrM2VRJxcb5hWpT2jQ9dwzZjnvvw2JaJIj4hVqalIeG9h+51Fdm3HoTa0OO159rD\njIi8jzI8vnOZSBRw3Es10xAGyeBHf1hbtT84eWu2+LusmTy8fKfIchUMOlGq28+X/PaD5kkXwQpS\nUsqv4OOlHJRsQbrgVeJJxAR52zbc9sd1dGcdE6ZD8oWSwc/+tI7t+4ZYvlVIvHaAmAbcDGSBicI1\nobkfYFk3uOuRDdz57OPszPjpX3/1WKAkZg0ff7DYA8AdD7/MMxt2+PsDWt7zG3q4/+ltXoY1gBUv\nB7JG1bBkDBULWJbN/zz0Mt/9zUrW7u0CyRZBOs49SorQvFVZwpZ1UnIdVqbTm1SCfmQ3iMc2Yz75\nB4Uwl0D2noDpaeZim6abFAMa6JT6qb5Glyiw28ls5b5DxbzM48sdM6yreTt+U9fE6fZxy0AXS9fu\n8zVcI0ZXT47lS2JYxXqkej/gUJIgZTfz2TmfADOOlWsRVhGnbbmhnyFpPzNaj6cp0YBWEtey4jl+\nt/lPDJYzKHoDZt94QAqR992PbWbzfiG8NSTqGN/a4l03Ptl5FxzXgusxsYvChy4lnVgKR4BKqkkS\nMZmybnoWiKyWDWhsYdLxfMoBUhjVmgYkFIe8pXiJtMNHSVX0W5Yc06/zPONOXeykEqg1YMvCwuCQ\ntyXrYc0dyA6oSDGd3WN+i5zyg62GCjpnj52HbVbTQVJJEFcV8jlQibNreE9ovyuY6Hum+hslv5JY\ncL16LeKb0Xo8lxx7MZIk0ZRoqNqfjqX9zHqm6iXsKSf3M1jOkI6F6y00JxrZm9/PQMOLYi4zVS+W\n4p7Ht1AeaGFy9p3e8dZwi5dpsi6pYhtxJFVHVWTf8uhp3lLoHqxcE+WNp4KexDBtNnVl+PkD6w95\nIaeREJH3UQYvwtGSR9C8fXJdvbWvan9wOYcZINpHV3RhmFY4KrZm1R0/GKyW5h3URCXVr9Lk19AN\nrNX1yFtm1mThk3txU6+fwcohnadW7eXZ9d3c8svneXHnDsDRLBHLSTzLgeFrIp3pDq8PAC9s7OXJ\nrmdZXlgUShm5bmcgM1egb66/2fXLu9jTm+dXj/vJN4JLjH58/1r+9MyOUCYwSxJ/nzdnfDga10Hf\ncJ6+TJElq/exbscgK7YJ4cQq1vtai2z4ZnPZICY596m7+Zb9PnqWF1MNkH+15l3YO9Zr31svr1ue\nyVnfNY2J6SkhzXt3r/PsHRLq2qvRtdsJ7nK1UvcenWdh9ApNRx23lb+8sMeL5G9K1lPWTdZvz2L2\nj64al3pzFJObJqEbFtZwC5IEckoIdW5ynrdPOFvcS05MY7uNjSLeApDxScMn7yKPPt/FcKmIbcP0\n8e00pfzJ2DYVLjhtgvfeubGK9YrQqpV0nqb6uDdeKVVohZpmehaIgVLGF5Zcn7kTeCincsSOXYXa\nIVZttNbVMXl0g9NH8b6NHS15edBd8vVQIfx1NjaGfttaCjlRIjZlFSYaKZf8HfK2HdO7jS0sHDbU\nxVNccf5xjG2vJ2ZWL29SZMVZlSJR6vOjqMubT0HvOs57zu3pZibEnWWkqk5RFXPP5I4OLxVqXQ3N\n+x+mLOSCSW8T/ayxfGx0U6OnOYPvv3aRVivJ2zGdpzYiyRa2odJQkRBmzqQp3t9mpsPLQJlMqCTk\nJHJcFwKPa4Gq4bcHMHomYGU7aUzHMEyLp1ftZdm6/fRnj0x+84i8jzJ4ZGHEvIQQQYQCoCo0ye99\n8kzGjvJfZMM2mDymkUmjGiiUDQzDqliPWi0cZEv+MqNaPu9g6khJMTzy9uoDy7XIW+GMmaOZM80h\nXOe6biajYPUeOT0szLmOyTc4oU0d3eH9PaqCvLsHi6GsX36DZu2/HXPgLv1lnt+/MnxOILCndi7j\ngHnc6d+0Cc186rJpVUdqZjkkhA3oTgnIYr0n8UuqCNBRFAkUHQXXzxj0AYoJzrUE2IbqTToN9YHP\nXNHF0idLEe3bEpYsyLikGb7mbsQo66YnxMjJgle8xBUWTC0WIka3L+75AP/90X9gUuMElPpBhu0e\n9sbEWH78nbO5/v2zAbDyTVXjIluOS8AwPdLxVg441291TNXZrF1V79pSfWuEbz1wBQwDhRjzZo5B\nkiTU3aeKZUKKiT12LRPGiOuVyqKm9LX/cBYA8+c1M3/2uEAwWIJETMEGGmMueQ/6AW+u5u1o7kpr\nN2rbPuQ6IYTc8E9v83IPWI5PefrxKe+7cjVvL6eMHo7GnzVugvf3tz8+jwmdghzV9n0YUtkjb1fz\ndp+Vi3Qsxa1Xn8Ox45qIqTIzx4r2bEOl05jBW8ecDvhLSs0+EX9QrzZgDY5G6fdzPnz742+lNS2+\nydiY7QzYuzm+ZSpffN+5/MvFM4BqzbshVs/ExrAZ28qFBZJxLS2hnPZSxZyUrqHNB2FrqVBFwpnH\ntPD2U8czPjlZXG+oDdW5P1mSOG5MBzYW31/zQz/4L0DeDTFfwDH7x5KMKzTVJzBMS9R4l6Ct6dBU\nX3wlROR9lMElC3dyrUTIh1rxosdUORSJbdg6MVUmnVTRdKeggHJgzXsoQN7lGpp3KFtWgLx9zdvv\nk0fejmTtfaTudR3ydgPzpGQOuW4IK9vmE1egv2ogAdKYOmfpiUMm/dlSyKzuJVEIEHZQcLED2ZTu\nWH936B6DwVmVZnWlfTdKu798zG1TUYTJOwjbktCscMrUYVMEuNmlOo+0pFiZZFxBloVA4+ZxxlKw\nLRkppnnpJstWwIXg3INhB56pE8ErGEEiIdWhSeKZarqFZrmFMWLifdATyCjIyYDP2yFRvaSCGcM2\nFc9c6xKrazaPqTJtyRaQID9KVGhS+o9lcvNEjxRcK0pojB0TsW5YHmkpDYNI8aInILgTaX8mXPEL\nwFQC5K255F1EqsuKwCzbJ8J0cRJG9zEAPLN/GT3SJm98VUVmVLodWZJZ2buGPnmrFxOQiiW9dzet\nOMVB9Kz/jjvjbzlL+jwyQBBZc6LJS0lslcWzzpQy3mqKSh+xvP2tnNJxovd7Zsdx3t8xVWFWy6zQ\n8UmPvMU4G/smc0LsLG99eqXw3ZRwnoNkc4w1jytnvFcc5xatGWrj7JaFvHvM+wHoqCCphrgjPIze\nSVxOcOnUd4b2B8n7golv4wunX0MlyhtOp7R2nve7M91BIjYyTbkCigvTDs95drE+RN5u8NtFoy6l\n+OLbwYyhBoJZ61RxDz3FXuRkwUmyI85RFZn6eB0fPuFyJmbeAbZMQzqGqsjohk1vtkRrQ7KqENTh\nQkTebzDolsEL3atGXPJVMv3JtbbPO1A4voJ8Y6rirSEGQLZEBKVTYSlb0MKm3Rqad1+gwENNzTto\nNlcM8k4ke9Eh76CJ17CdJTfOB5WIK0h1Wc8n7loO3OVZSpvwVZt943yTslJtogY/o5N7P2U9LJgk\nqHP6GPQHB9ZDl/2JqXISDd5DZWrP+JS1xKesCbTpkLcsoVNhTrNUdFsLZR3TLFdzjnvFFKR4mURM\n8QOeLH+JkK3HQfXJW7c1Z3mM4pnNTcIJQkLEJTWiSQWQRIpUL3LdUB3BSyItNSKlh4jPWYSUzHkC\nUbkozKlWoQEplUNp24OccM+Pe/fdFHdIIZHHzHQwpnwasiT7BXMMv7b4jBZhnVBNJ5LesDxBQB29\nk+QpTwrLhy15/s7eTMl7Z+aOOhWAMaXT/HE2Y9iGitLc65fRDFilEjEFK+dr/xa+5qwqMnElzsJj\nzmNYy/F88REUZy11OuYHU6WoR5UUitKQJxDalkIqoYAR9zK9uVAlX5sDMHSFhBJnsJxlq5NCdErT\nJIKQyg28a8qF3u+JjX5lrrgqc+boed56cPDJ0hUQsFSmpWbzDqeNyY1+8hcARXIEViksCfnr6yVa\n9anETfE8O5rDxOmSN8C5o89hQsPY0H41UBP+H45d6AsLQVgqdqGJy6dcwTWzP8bcUbOrqskFMXfU\n7NDv9x1/ifftg1jn71aQA0g6wlZSjXvvnRog2/p4WJMXFj4xfm5g2+mjT0XVxfuSTsaIKRKGKQJn\nK8fkcCIi7zcYFu94nP9Z93/cv/Xhmvu9IDFbqql5B3OaV5KvLNuUrTC5xlWZdDIG2BhNO/xoVaiK\nNpcb+1jc/cfqvgQQIrMKn7c6ertXHxfAQiz18j5OtRSuUSyHydsNGDKHW3yTskNoqiJj4pN3a7KV\nhJIIpYN1NSZzqIVOyzFhBwQc19xp5ZrQd87wtlea+4Jthgs01Fiap7h542VPsDGHm/nQ8R/ANhVM\nWw/lHNesssiFbckhzTsekz1BwF0/DIARR1I16p01/yaaFyTkBqwFyRtVJyb5ZFkvi0lIbhhk7eAa\nj7xtM0beqaLVoDoR5bKN2tnlkVCx4JiF841IEsSPXYM6QQRTeZYRSQpN0ma2jQ4nKU88oFG17lnI\nl97yWT4y/QOUN84hVRJJRXTDCsUyiPHQUOwEsiRjWlaoZOqxzZP477d/i3GcGDonKIyBsxzPQTyu\nYA2OQt80h45Um3+Q5QsYFx9zPh+Y8T5/V66JZEz1a0obNu2pdqz4MI0Nzn2ZKumEeBZuJjcXHzxB\ntOUSgmHatCRb6C8OsGlwK82JJi8ILvhadaTamdQwgYXHnIcs++MXU2WSCVUkxnFwaufJ4hoBzTKm\nysyfcDafnfMJPjDjn0J9mtYqgs7M3rApOzPsv++9mZIXhNpWURmsMemblMc3d3IgjFQO2MVJ7TM4\nrmUKkiQRj/vvu7Z9JlY5yWzlHXz5LZ9leutxofPG1o/m+jmf8n7bxbqQ5u0+r2CKYzVQddHVvF2Y\nw63e30GN2nUDphNqiPzbm4+MyRwi8j6ieOz5Lrp6wqkphwoaDyz1g5y2O8kLdgyJZSuPrOjinsc3\ne8uhvCQNstCUlq3bz12PbOS3T2xhqKCFfd4V5Js3CuGkIgHNW27qJT55HWpnIA96KEDGJhYo4wjV\nAWuWbYfK5EmKEYo2V8eJ7E5mthUz60ySkuV9nHvl1aH2RE1rvw61p7kYcZ/YHD92XVL1lr5JG9/G\njq4yacXPmAR45KdvO9ExHVdq3k7U9daTwIxTeukcZDPBkBZ+ZkENasPgZm596o/c+9ctxBM1stsF\nNG+3kIax5zjmjj0RLAXN0vnLCr82s47uCCaSuE9bglhZWCVc4UMPrO/V40iK5UUoIxu+VcJdR+ya\n6yUTSbZIyP6k26AKv3HsmHX8pe8BhmNOX4yY9+wanWPASTiiashWjGLJoqUhUdNnbet+bEWQvO1S\n2iPvYKnaJI2MruskEVOxsh1YTlSxHtC8XcjJAoolnv/9T2/HtGxithCwOlLtSJJUpa25goxVrMPM\ntnGccoa3TxwrYWY7mBYkA0vxtFZJkpg35jRa4+K9NbonElMV7zovbupF1uqRFJN0Y9k737VqeX57\nQNtyMjNahb/YNWnbNqT0jnwAACAASURBVExrOZaSWaZgFJnaPLlm8Q5FVvjc3E/xrikLKrZLwrxs\nJLC1BAoqJ7YLAVRRwiQPMLlpkhfU6eLE9hOIbTsXfdf00PZgVbvebJGCM1e11CdCxzUHyLsj3Uot\n/MeZX+Rrb72h5r4g4oHnlwz8bfZOoLzqbYxPTmZUXW0BIRiBbpdTNc3mSlCgCZJ3haBuF/0I+CDJ\nu0pJXSoW2t4RkfffH/b05vj1Y5u56X+eC23/5cMb+MNT2/ijk6fc9dmokkJfpshv/rKZxc+JZTa6\nqfumV4e871y0kcdf3MOi5bt4fkNPyOddGdzh1om2yk6QkWx6Pu9ahelD/uBE0VtD7aLSbL5++wAl\no+RPtorOcN5J0lLWQRITsbb5VH8Nsmx6H+cwvdiWRGnVOZhZ5+OXA+StaiKBhy37EbxOn+rTMXRL\nx9bjFLJJfvC71aTUdCi5hr+EJ4ahy1X36Js7nWIMRh2y1iisCZIFWMSnP4fS0uMtWQHYYDzDw8/u\n8pa+ubBtKeTzdjNC2UYMWZbEGnDZ4PHnffK2EMk3Jo6qByTQ48LnHTCbG5r/2bpaaSJtihSwiu6T\ntvMcOjqcNe+uoBPQLprjzc44OlYBxe+jm9K0PuaTr6Tqjt88TqFk0NaYrE3eAW3ZM5sjfPluLedY\nYFJWHRLz/LPOulndsJDNMEkAyGaSkmbw4DIh7F7UcQVXTv8nprUI7TFoKhX37rgjSmm0jXOZnvTN\n6u77Z9swJu2n6cSWQxM7wPunXIm2YwZm/1hiqkzKuc79T29n5y4np3Z6nTjdUkgnVc49ZayXZAXC\na59PmSpMvP9w5jGcMdrv07njzwx0vur2PbQ42cckSfJIpLT2TK4c93FPuw1mF4yrI5ugAS47ay7Y\nMmec4I/DhXP9wLjeTNEjrvGdgqynTxTvUFOAvNuStcm7OdHkrYmvhfEd4t0MCl+1zOYH8oMDTJFP\nQ983GZBFeteKtoKat+dWIOxDNzPtmAP+OAQ17wWnC5fD204ZGyJvNxvckUBUVewIoViuvfZv/4CY\nLF3Tn0veiqzSE8gnPVzQ6Q/UL0a2KGtmyHReLBuUpaDmHSbkYUeDtMtpSJRANompCnXJcO7lifHj\n2aVtCpO/G6S07xhGG7Pon/DnqoC1fFlDUkyxbjemISkG/UMlLMumZJSRZJvpbZP5+GfO56tPbKef\nHpAt74PSEZnF7HLaXx8qW77ZPKb564fLKSRkSAhLREdTim5LCwWaqXZSRKzLplgj6yWmUCmXbVER\nyCHsKWMbSU2qZ1sOT7CoS6pCY0oBqoYkWSL5DAifbkX0uhtjEJfjDL94BvETlgc0b9kb/1s+IqKX\nZTuGpYhc4u4MLSkGthHnuHHNfPby2Vz/2FKkZE5MHE77Wjkwm7sJJGI6LQ0xioqF5Wb0slQScoJU\nyuCmj8zllj8+AAjTopsfqjXRChUp6t1o9JyjeadUn4ileAlUDb2QwrJt0kmVn3ziHfz48TgbSi+i\nNGRoUBspBsgqpHlrqZqatxfxK0tIkh+kqBsWsRqEI5kJT7iYPKaBK+efSl+fbyFprwim0ndNI3Hc\nS+h7BbknAqbYoJY3OqTNSSGTKMC4pk7MHuGLjqkyHQHTsV1hGscU39aHFkyjdetuFu0Sgkaw1vak\n0Q386NpzhGUFeNv4M2lPtYX93QcoV/Ctj83zAh49Td2Ih4g0SE6V91OJd501hVMmt4a01ffNn8q7\nz5rMt3+9kr39ec+d0tqQ4NZrziYVF8cGY0Nqrek+GHz5I3PRdCtErkGzubftAH5wgOPUuazrEgpR\nkLxdn/dImncwT4W2KRA3QVggPPeUsZw+YxTppMpTq/wA1WT8yFFqpHkfIVg1UpUCVaYxw3KLhMih\nYhCFkkFf0c/JLSt+SktX8ivr1oE1b6fghuf/U0zH5+1XParPT+WczvnO/mCqVD/pQn3MCc6p8Hkv\nyQo/va0lhd9WFVWSBoZLXtnIpkQ9qiLTXu8kvohp3sdZtovexGa7NZklV/O2QdWwveAmmTq5Ednx\ng7c3J0WQn+l/1JYernYkMi4JE2nZ4V3h47dJxhUM1zfsCACphIrlraUuh0zwthFnsiYKIXjJLZzx\nPnPs6SiWWKftEroiS2TKQ0hIjKpvce6gVhIVE0yVeFymPhUThUEUC1s2sJVgoJjTDyeoTZcLtLW4\n5nKfHBpiDWS1Idqakshp8fyntvqaVGstDckQZnt3kp7deiqzW+eIcUgNI8m2l9WsLqkSjym0SZPR\nd8yEgQl8aMpVBNXFUGCSLdf0eYeIXJFFwiBElbRa0buSmfDM+lPGNFV9R5WBQ9bgaN7ffg22YyUI\nam5BIvdWKQT6EkRdYAKPq3LIxxn0N4MTsJZUkSSJ9nTAOmGE1x2nEiqyJCFJEv90/Lt5+4Szqu53\nJKiKHCLaWvcUJKr4K5C3JElV7cnOto7mJLphsddZMphOxqhLxjyiDa7jrmXyPxioilxlNamteR+Y\nvINCyiuZzYPHnth+AnVqmium/2NVm3WBQlCSJHn9DL67ifiRo9SIvI8QauUZr7Xd07wlhd6MT475\nUrXm7aIuJV4iTTdDRSrCmrfN5sw28Zdjcg6bzZ3lN6UpXuIKKWg293Ihq6QSKkk1GdK8dctgW2GD\nc7AFZswj/L5Myat85Urnk5pEQJJclyURU9AtQ0RKuyZ3JxmDu9YbVRdm4YD/s0FpEfV3FZ2OppQg\n74DmrZXcQLhAZivHZFly5CK1bT9Kx26x3MPSPHJXFYlEXMEoOwJATAv5usHGGhjDhORkp9604Res\nUBNiQgsUtFBkiaw2RH28zsvFbLqme1dIkiyRWMJUfFOuQ85lCuiKIF+9GCAMxyeXNftobnL8pwGz\nbGO8kbxeYG+xy8vHPGOUr9U1JeqIyeIeZdstpOFkbnPMo+lEgg/NfC9WOemZ122nIlk66QpbNnax\nAXXvKbSkwlHESSXBuPix6LunoioyTfV+JLoLNaAdKrLkFXoQmncN8jYSnvm2crKH2r7HukCyjrBZ\n1m+/MR7O8hXsl/gd9h8Hr2NraYrPBXzRAZ93Q9zXhG0zTN6viFfBg0HNVKkIbHu1cO91pxO3U1dJ\nskqck9pnVvnjXytqEXWlUHWg/alEtQl+pIC1hng93z7nZs4c+5aqNmu9ZxAm/1cSKg4lIvI+QhiB\nu6vglsNUZYW+rK9590gb+O2m+/0DA8TqlgYtaWaIUIOat9zcy7J9ItLbKjnLpGJlYorsmM2dNddy\niua0Y/JSqoO9MEVO6sZ4AwOlQTQnA1le9zOvGb0TsE0VJSau35sp8v/Yu/P4qMqzf/yfs81MJpls\nkAAJ+yabICgo4i5Qt69WWxUXcKlaRVu1daFUpbUPuFT9Wbva1trqQ12hllddeLpp1YLWlcUVtAjI\nkkD2zHaW3x9nmXMmM5mQZCYZ5vP+h8xkZnLmJMx1rvu+7uuOWevL7eA9uXqMeVyl+/CrDx7Gqzv+\nbZ6npJ7Y/knrzKBmNUZxFy/ZhVRCoB0DyvxQDc0zbG533rKDrjkkbZ6rcLsrWJTvRWNwM3a173Iy\nd0kU4VckaBFX8PZUrsdR3xSBz96yUo45vxO/5IMkCOb6Z8mcKxdFc7cjuwMUAGiqfYFiPq+4OLGk\nx+nnbL3fiNaGmGiNnESCieYeVrOaFr0egZB1fK7gXR4wA+nKj58xHx8NYGBxYs4x4JfNoXMAJdoQ\nCLGg01TEzmz9PslsEqO5A5V5UWF/gNt75IiC0GGeWBAEnFJ9DtQvx6KqPODMwbqzMzkp8/YOm4uY\nV3YhYlumIbLpaMS3j4PSNNK5uEgOIgBQXtJx7tGdOaWbUxUEAQtGXYzohzM7HFcyRRZR2mFnP8HZ\nIcuI+Z2LG89FQYoe+r3NfUHiHjbvjeAdjWnmErqkQCUIAr459RKcMvLkbv+MVFLOb2e4oHFfdLmH\nsv0phs2TL9DSCaYY4QAS9RoAg/dByT1s/t6n9dANsxfuftd2llu/bEJMtTJcw0BdY9jJAPeXJ/aA\n1ttLrAIq8zXNDy8De/WtqI/sT/xQV/C1h5dlQYbeWG0uMSpuhqJ4h819QgClwSLo4SDEUKPzGu7M\ne2d9G0q1oYjpcTyx/jWs/2C3s7etumeY9fqJIri6pjBiVqFdsbWOcnRlrbmOdsBufNGyHau2/MU8\nUCt428PmghL3NOZwF0KVyFaLVCWC0lJ7eU7iP09Ts13oZm1VKCUqsdvaBGCH+SErVdShrug980nW\n/2NZMiuW7QsdqWq7N/OW4tjXHIEMa3jWmuMHzEzTzLytD3YljrgRQVxXUe4aQraXfCnDzKYgpSF7\nIwvZmUqwq5TbjVZExCYYutnD2mn5GPfDiPuwH19ik/oP8xQ0JqqI7S077SmX2JbEOmDAXPs/sMgM\n3pE2H9o3HO08xh42d9bhC4lhUfu47OBk/32bc9YdPwztAJuuGtcdJCVR8BSs+WQRgwODoe0fAqO9\nFOquMdBVn7MbXjDFvvbuzMo5Blfm7Q48/qQ51YmV46C3mFXlyRciboospXyvV0y5GPH35gGaL2Xm\nndziMxv8nmJAd/DufnAZ6JqKSHXBlC2ZsuxMz3FPz9gXAuky784Up/g7AwBZTrwWg/dByJ15P7Rq\nA155dyfuXvmOp9HK8sfeRn2zOTcc0+PY1xxFZWkAJQEFQsRd9GP9J7IztiIFUtUObCsyd/s5acDp\nAOBtB2oF4UvGLwIMEXpbGUR/GIZoNfiQzbaZPsmHoF+GVl8LQdQhVe72PB+agoaWKN57y/xD/vN7\nr+PXaz7Axm3m4+zgamgydMEMmvubo4gLVvC2Mm9REJ0Mz3OeUvTrdg9ZuzPvErnEeZ8lQSvwuTJv\nLe7q2Caaeyy7s57wl0M9VePun2suvZGgt1ZgQukkSKFGSFWJZXSxz6bCMIC4nZl7Mm+/uYey9f7E\nYDPaNHsLy0TWO7bYWspTuQcQVZSUJC5A7A+BgUHz8a/sfx5hcb815SGgusIOgmaTFA0xRIw2xHeM\nxfTBiTXqdvAGzPXtRpv5evaccNAvozpoBqq2Fsks7LOCS0u7+Tu3i3wG+BJroO3gbVfXjq01f85h\nYwc6owLuoFhZGoAAYGhVx9854B16lCUhkXlrZuadnDFqmp5YrpNuODPpQzlV4RLQ8QPXPUfaWYGX\nnbFVliay/AGl5haVMsy/02CK4J343XXN6CHm//3Dxg3M8MhERul+T6LYu5k3kH4IORvsn+WTRQyz\nKtyryjpvhuK+6PLMSSuJkbVU3+/KcSSTPXPeuQverDbPESczKWmAPPQTfLjTu7ymTdgLsbzOmeON\najFEYqq5jlY30KaJEACE6meiWbaWFllV1MGA7GzacP74ryKyx9zooXaIjMtPnAXdMPBKXQPW7/3M\nyXy11lKIZXUIi/sQ9I81s1PV3NtWFAV8e958/PKjTzF1qoh3/55ocFLiC6IZifWPdr/o3U1WW0+7\nGMfOOiUVMVWHjihEeCtSJ9YOxsdNiZaR5vPNDz17pACAOd/tt+daXQ1GrOB95IwifN76mfVzXWug\nnUYuGgTr/BiajOKAbA25CmZBmL9jsxnJNSw4o2w2Pmr+wNmJKbLhWHO/agDtLQJQYgV915y3JEWc\nJVRicTNaVfPnuzPvb596PJa/vBX75E8hKDFUlgWxwzpGe8574UmH4uebXI1rrO5XQwYU47yTxiIU\n9OGdLwdgXcM/UOoL4ZRDvo5h1SFcEDYb5OwT/us89fCRI3HWCWbryTsunYn9zebWh9VNZlCwh8JP\nnjEUf39nBzTdQHFAdoYd506agj98bG7KcsnJhyEkDMCU0WbWfszUIRhUUYTRNWaf7GWXzkSFK6hV\nlRfh9kuPwODK1FXInjlvSUQsrsEwDKfaPDljVDU9MSef5kP1vmuPxusbduGZl825fk+Rmiu4JQc0\nd5CXU2Tw9187B63huJN1L7t0JprazIs+ewcrO4ja2Zq7u9hti7xVzJnMnjIYA8sCGF2ToiNZkvsW\nH43G1ljSnHfXC9Y6M6DU3BfdMHIbvAM+GcsunYnykB+KJODLfe2oTXMRaHNfdCkpRlnSFay53X/t\nHLz18V488Tdzu9p0GXqqi4NcYPDOEbswTSzdB6m0Ac2tewAkrh63la6Fuyg3psUQi5vLqEQB2Cuo\nKJaLIDQMhVi1y1xcJOowAAQUGYJsZn2TB0zA3z7ZZ3bv8oedtZhavVWQ5jOvnu250jZhPwRBgCDH\nYaiJhgMTBtdC/FhE3JpntTPvimAJmvfHzbXWmuQUpe1vbzH/muxqcSdwxs0GNIpVze5aQlJRFAK8\nsdvJrGPbJiIweb35GnIMorVlpN6ayFxDivke3t3/Nt7d/7Z5pyvzdobQ5Rj8483vG9EihII+TB0z\nEOs273aCfak+BOMGV2P9f8zzJEuCk50FjUrobaUQi5s9xwgATc0wg7ccd4oI/ZIPoiA4PbvF4ia0\nqOZzy1xz3n6fhOpQGfaFzfqD8jLRXLalJ4bNq0PeavD4DrOJSHFAdrLYE8dNw4mY5nmcX5FQWQpU\nxkc79w0vH4RqK3sqtiqFAWDW4MPx7KsfQ9s/BJNHVngyQ3e2NW5AotBt+qihngsxURBwyPBEtfWI\nwR23dxw5OH3w6ThsbjhD58mZt98nQdONRJerNMOZpUGf50PeHdB8nmHljnP0smQeQ6oP9oqQ31lf\nDQChoA+hoLeRjP1+3BcCK+bcDkkUUaIcWMFa8rntTFmJH2VJ8/2pmrR0hyyJqAz5sa85mnYIOVvc\nf0/2KE9n5DRLwVIWrKW4QAPM3/PIFH/Hydw1BRw2Pwg5w+ZWT+WInmKHKxd7yZdfkcwPJ1GDIvoQ\njWuQkpYYybIASUlsU1jXGIERC6BdS6x7tVtzhvzmB669WUMM7TAMwwrePkSsHbxkUUaFvxx14X0Q\ny/dCHmAPi7u2WlQVZ/mUvduYMyft7GGsojUchVhsZuYhV+FOyrWg9rB7WzmiH1vLk5QoxFAj9EgR\nEHd1B/OluPp27fhlX0BIFXsgKHFIrYOh7jSDn7OUyPp9+EQ/Lp9yEcT95m5DdsEaYFZdq9aOSoYu\neLL7BmsBgFi6z5mXtzd+QDwAI+aHEGzGjlZzyL0maSlSib1LkRxDaYld7Z0YNncXOk0UToTeYI6q\ndDXzce+6lG7tbUD2I/blKECXMLC8KG27R3exXbHcvXW86aQqWItZ65cVyRu8Az4JqmZkHDYHktY4\np8mQUs2P2wVv7u1dD4Q9kuD+PZX5Qx365OdCb2XeQOJiLpeZd3d4Mu8U1eBdybzNx2U+X+6Lg1R/\nS9nC4J0jTsGatYGCs/uTLWlLQ7tHud9ndmkSJA2KYG4Dam9q4ARvSXSGtQNyAPWNYQhqAO1qO1Td\nvD+shiEKIoKKGbTsIdKI0WbuYiQYgKZ4CuiqigagOdYC//h3nPvawq41yZriFGm1WNXmRofMW0Wj\n8hnE4hZUxMd4AkiqYOLeYcoZQg81QJDjHdbRFrn28lWsYUnPPLp1DGKRtca8bZIzn+tklFa2LMIq\nHrP+I8qS4BS6tEfi0PbVmIFb9cFd6mq0l0JvLYNUXg+p2mxp65f8zsWa3h6C6I9g475NKJKLMCxU\n63kPIcVV+e/XneO2P2R8kmvNtpgYdTiQzOeiCeeiSC7C5AET0j4mZm0vW1bs82Qi7vXSgiDg4gnn\n4uyxp3d7HW86SoqlYnbzEZ8ieoJOQJGg6ZmHzYH0WZV7Pa6U4jF2Zt3Y0vlFdjr2h36uM9RUpG4U\nZ6VjX8wV+/v+fXXGezHoyox9nS8VS9aF2J2x8U22MHhnQTiqoqXduyuY0yXMyvTiRlJ3LtVbgOEE\nb8Xa9UtUAV1GLK5Bttbl2vPjih28NRkCBNQ1heEXzMBoN2Zpj4dRJAcgSaK5VCfuh2EIaFWb8ceP\nVgEAtP2DPB9WA4OuTRosEVenOMOqKPcrIiL2Bh3OnLcVOBUVMdnMumvh3bIwOXifNOxYs2DKZjVZ\nEcvMPa6T23C650EXTVqAG2dcA3XXqMTx6e75bxHlYiLrtT+c7WAfMMzXtocYJUl0/qO3RVRA9SH+\n3ymIb0/sya3IImCIiG01h6ztna38kh/2SgB7HXZYi2B8+egOGzKU+q3aASXmbCBiaHLK5TElUuLi\npegAMp+ja2bivuN+2GlbSltxkeL5MEquDp9dMxNzhx/f5Z/dZUnLxlQtfebt90nQNHPY3C4sTCdd\n5uS+v7Pg3dDazeCdIvPuK+5h855edOVL5q2kec+pMu/OCtbc2/Wm09MLou5i8M6C6x96Ddc/9Jrn\nPrt6Fk7w9gZ3Q9CgR4LmGlZRcTbZ8CsSSopkCJKOPfti5iYMgt061GroIgnOhhRtERXhqIZia3/h\npqg51xpWwwhamar5QWj2zd7eth0fNXyKQfIIaPW1GDwgEVA9OyxZhg8yX3dAqd8pShs62Oc0QnEy\nb6dtpwpVtIbsFe/8kXsI8VuHXdlh/99xQ8xWlfb/PfcmAYD3P+DQkhqMLR8FGCnmvGF1RLOqdodV\nlzgZROyzqYhvH4ehhrk8ys4AZVFwisbs4VmtvhbaPnP4fNSQUqdHtxEtcvrFA4Bf9jkdLY32xEhA\nqsy3LJAI3s7fhC55AtL/G30Kjhx8OAJS9pbq2PPcQyqDnkrsQTnaaMHdrEiWBOiG4azEUBTJO2yu\nmHPebRHVHJXqJCBJXVjDW2o1jXFn92NqzIu5IQO6N8ztVyQU+eU++2B3S3Vx0l2DrL+T0mJfhkf2\nrXS/d/v3ka63eTK7F7zYyd9YV9eJ97b+ffmUp+xCG8MwnA+WRPA2/9XQMXhDC0DdNQbV46LYGfkC\nsLbLnDV5IF54C04wcipXreCtSCIgxWFEfE5L1XJ/KRoANESbMArmnLddLFUe8mPP/nZz/tgXRZEc\nwHdnX463yxow3bUcZXz5mMTxqTK+MuJknHTUZLz7aT3iqo6nt5hrzysrJWyPxc1kU1NwzNQhEMsF\n/CeyCZKswfBFYBhCh0zbfXt8xRgIgoAbzp0Gnyxi9/52zJo4CDe/9ifnfert3jluWRKwaOL52NL4\neYcLjbISH5paDRiGGfyrS8px+lEjURr0YfaUwSgOKLj27EPx8z9thLprDIQae7g8kXnbRU32nuTj\nhpbhzGNGYV9TBDPGV+FnqzbA3GFcgN40EKK1I5tn2Nx1wTFj0FQkq7CaqMiDvsCGfebPOe+YyZ6i\no1NGmu1qX3rjC+e+AaW9u2/wLRdMx0dfNGDK6AGIxTV8/YQxB1Qo1VPupZR2sLH/litCfs8oix3I\nW9oT+5ink/yhe+uF0xGNe7OpMTVluPTUCc4GGwBw8hFD4fdJmDHeu/NWV10wd5wzrN/XejN4H35I\nFRbOH4+jJg/utdfMhuRs+vuLDkdTa+Iz1/130dn5GTE4hEtPnYBDhqcfteqrCzQG7yxStcSmCppm\nz3mbHxya4A3eEPREYxIkdtzy+yRnq0l7DbOdeTt7RUsCdCEOQwtiZ521UUdxJT4PA/sjDeZuZLrq\nZN5V5QEzeFsXEjXFQ1CsFOG4ad4sa3jpUAwOVmN3+15EP5yFY4+ag1DQh+Om1WD9B4lK7WDQgICw\ntbm9gLOPHY09cRn/ec8cNheUKBD3IVDi/aAt9lQrm+996hgzCE8YYfX/1v3QxXarUKxjRe+RVYfj\nyCGHdzj35SV+8z+rIQCCgepQGfyKhLlHJPp6H35IFYJ+Ge1R1cmU7Yst93CsnXkfOnoAJo9MVH+7\nq5zj28cDhoihlRVQRNmpcTDCJdAjRThhzHTPHL1znEWJC5KdrealwNETRnV4HODNEFJ1EOuJytIA\njp4yBIBZiX3aUSMyPKN3uTNve5jX3rSnqjzgyYrt77dFVFRXdF44l5xVpbsYOW5ajee2KAgd7jsQ\n44ZmnqLIld4M3pIo4sQZQzM/sI8lz0PbIympZJpKyPR30JWitmzo+zGdg5j7Cl/Tra+tYXNnj2Xz\nljlfavfztpc7WTtuOXt0W8HSZ+/HbDdOkTSn4GxHnVn1XVtqZtANkUa0W/PRRdY+t1X2jkuy+bo1\nJemvom8+4jpEP5wJI1zq3bQ+oCSGyJUwRH8EmpUZK7LobK0nyCoEXxRGzO/Zlxfo2s5DJa3mHLPe\n2DED6uxDKRS06wLMoFDiSz386QzJG4bntuyZ844797l55v00H+LbJmGYPsN6Qet+Q0R0w3E4b/xZ\nKX9+kc/nFA8CZuFdukpud/FVLqtac8Gdedvnede+xI5x7mFz9+890/RBbwaufCX1g6H7XMvlUHa6\nTaeyrfB+qzlkL7sCEsPmguBu2WmxN9+wMm87wxZE1QreiblQAAiIAc9rGFYWb2gydlrBe0SlOV+8\nP9Jo7kcNIGgFVLvNYeyzQzEiNAynjpyb9j0E5IDTKtL9HyIYkJ3g/Z+ItZtYuGPwhq/dXI8eD3To\nhdyV4F0ePgSRjXMQ+++UDt/r7EMped/idEt07Kvu5P9/dntUIJF5JweCVEU79lW49+VStwwFzGVP\n0Q+OcgJ4mb/jDlm2rhTP5Bv7nRquM2af50TmXeQ59+7fe6bCqT76XO1XCvECJpdD2U5ilmMcNu9F\numF45lI8mbfmLVhzb7cJwargtoKz0SHztrqLWXPefsneDMMM3ppgXQioMnZY2/UNG1AJn6hgQ/1m\nJxjYa4ZDRebws948ELfMPK/L7y8580bS7kh24xdFFlFkWM1gfFZns5i/Q1WwLMo4YegcDAqmn1eU\nZbFDoZqts0KT5Cvv9MHb/Dd52FwUEsHbnitLfs1Uy4DsD8p0u8h1PE4RRqzIXG5WuRdiJzsu2Mv4\netJoo78RBAGGYSRl3lbw3tcOvyIhFFSSNjFxZ96dz3l39fdwMGPwzi7nsz3HDp5PgT62e387rrjn\nn/j7267+1/HEsqrkgjVB6ph5G9awuW7vNuWLwO9zZ97mtZZTdWy9hu7KvJvaYigt9iHgk+GTzCD9\nft0mAMCospEA9OeBHQAAIABJREFUgPJQ9ypFk5tcOHtuW/RwCLIkmPv/Wpm37rOat8T9Kfv+njv+\nLBw39Oi0P7OzZRzJnbHcyoq9c8IBOXWBlz13XGQdWyITTPS/brcadSRn+ikzbyl1Jp+JfcEW19MX\nOdmvOXxQ560h84m9JMtd4S675rQHlgc6jES4f++ZMu+DbXqhOwrxHOTygqWvLqaZefeStz7aCwBY\n+ddPnPvcm444QyuiO/M2AAiJPautYFgSrwX8m6AM+wQ++WRnztvOvINyEIh3zLxrKspR7h/gVMi2\nurbpBIDRZWYR0qSRlTh99ghMH9e1Stprzz4UX9a3ej4EKkJ+nHroNLypbcWYkrF4+8P9MNpKofgT\nFfHmkjd7NzJft1oHdjZ3lep7t1wwHe9vrcfXjh8DUQT+ZT9WTP2zrz17Cl5Y/wVOn20VaLlesqqi\nCKOGhPD5LnP0IPkDIdV8qzNsbkVaWRJxwdxxad8DAFx51hSsa9iBrZFdnuHjZGcdMwqqpuOsY1IX\ntOWj75w/Df/3n+04+fBEEdSx02rQ0h6Hbhg4ekqiHuPCueMQ8MnY+mWip26m4D24MohTjhyOKaMq\nO33cwcyvSDhzzshO29MebIr8svmeh6R/zxfPH9/lTUk6M2N8FU6cXotjpw3p8WsdCAbvA7S18b+o\nC9fjqCHezQXswCYEWiGW1UPbMwLRlJm3GagF0XA2FnGG0q3g7YsOQoV/KBqKdwBSvMOcd5EcgBAX\nnNakLaq5DehXpo/F7JpEj+sLJ3wNL3z+NzRGm5znAeaQ8NeOTywDy+TwQ6pw+CEdA/3Xjp6Mq6uO\nwsdb67B+rbkft/sqNCgH0BSzh/SVbgZvMem22XMaSD1sPmFEhVOpfv5J4/Avc5dMKGLq4dXqiiAu\nPTWx/lpAYthbFARcd85UfPfnr6c8lmCKLlPJAf74aTU4cXpth8e5nXncGEzYfiZ+uWEfFhxydtrH\nBQMyFn7lkLTfz0dDBhTjklO869/H1pbh21/vuKzOXimwbXeLc1+mYXNBEHDeiWN74Ujz21ePHZ35\nQQeZTO/5pF6qmpclsU/+XzJ4H6AH3vkFAGDW4Bmebln2XHdgqtmcJdJa4Q3emrdgDYCZfetyIhu3\nhs1jcQ2KHgREQBOirszbqjZXJBTFi9BqBe9P2z+CKIiYPND7ITin5kjMqTkSb+95DwNTNFzpLe4P\nUPeSnqASRFMssZtXd7bLSw6YPlmCqplDy501TrAdWzsbr+5ch9HWlEEmiepz89+yksQUQ3Kmn7pg\nzTts3tWGVhWBciyddWPXHlzg3Bdt7o0/iAoJ//K7SdVVZ04ZAJKnWARRSxo2TypYgznsbcQDEKwm\nJPYcciSmQdB9ZvAWI4iq1lIxe523LCIoF6FNaoXgb8OeyJeYWDnes4mF2+GDDuvRe83Ep4hmP2rd\n8GTe7mpyo7uZd9J8kt8nOXPQXWn1eN74s/DVMachIHdvXbS3mYP3WFIOm9tz3vbwd+FNN2ad5ClY\n40cYFSYWrHWTmlRY1CGQGELSsLm9zjuRedubeiSGzc3gFo1rEKyGJHEjirC1TttemuWTRQSVIkCO\nQRpgNvaYOWh6z99UNwmC4HyIuueQPOuVXZttHIjkbPdAd0USBfGAAnfyum+3mKp5bqfaijIx523/\nfEbv3uYtWOvfG2QQZQuDdzfFda3zBwg6/vi3T7Hps30AXJm36FoTaFecJw2bb9vdgi92mvPcUSOM\ntri53tVu0qLIEkqUIATRgFS1A7IgY2rV5J6/qR6wP0QVxTtsbjNUxbOTU1clD5tne79cZ847xfda\n2uOe26kL1rpXbU5dx8ybiMG725Izb7ufucMKyA88/T6AdMPmKgZVFGFghbWHtWvplb0dZtQIO3tx\n25m3IotOYBT9EQwtHppoitJHjpo0CANKAzh8fLVzX1BJtAOVoXSrjaA7eI8cHMKF88b37EAzSZEo\nf+/iGZgwvByzJ3v34lZkEbMmVnsKopKHzZl4976JIyowqDKISSMrUFHau21iifIFL1u7SdW9WVgs\nufuVNY9tf3inK1i7Y9FMPP3uK3izHYAuosgvIRzVnK0129VE8LaboiiyiGI90XQk5Ov7db9nHjMK\nZyYtYXIPm/vl7q0td+/zfPslR2R9swdnnbfr1zRuaDluuXBGx8cKAq4+y+z89vQ/twBwVZsbicdQ\n7xo3tBx3XXVUXx8GUZ9i5t1NquEdNjdbV7rms63MuzJkZsTJvc0Bs1GLTxFRVGT9GgzRqdy2s+zW\nWBva4+3wiT5nWN0niyj3JdYvhtIUqvU1d8FadyrNgUTBmiSaLUaz3Xwh0XGte+Pe9pJBnfVqRJRF\nWc28V6xYgffffx+CIGDp0qWYOjWxdnPlypVYs2YNRFHElClT8P3vfz+bh9LrkofN46ruZNsAnK8H\nWMN6qea8BVmFJIoIBqzgrZutIOubIs6weVu8De1qGAEpALs1hSKLKJMSwbs0zaYbfc09593duWq7\nOMkO2tnecEBI7pd6gOSkJi3MvIkoG7KWeb/55pvYtm0bnnrqKSxfvhzLly93vtfa2opHHnkEK1eu\nxBNPPIGtW7fivffey9ahZEVyG8u4qnn7lVvBu9Rqv6m72qMaqnnNJPniePHzv6MdjQDMOe9ia39i\nSfdBgIDWeBva42FnO0/ALFgr8yeCd1mgf3ZOch9z8qYkXWVn3nZGm+3t9xLD5t2M3slLBhm7iSgL\nsvZJuG7dOsyda+5WNWbMGDQ1NaG11exzrSgKFEVBe3s7VFVFOBxGWVn6/Vb7Ql1jGI+t/djZDjJZ\nqszbvVOY0/LUCgLujUnsrFoYsAN/+XwtXtn5uvVY0Wk6IUkiipUgGqNNiGgRTxaryKI3ePeDOe9U\nZDExsNPtzFtK7K8N5K5Pc7eLxa0n9tU2gURUGLI2bF5fX4/JkxPLlyorK1FXV4eSkhL4/X5ce+21\nmDt3Lvx+P04//XSMGtV5v+aKiiBkuXeXCVVVpZ8rXrHyHWzZ3oiyUABXnNVxO8rikOJ5viCJiXXb\ngJN5y4qEqqoQJFkCYEAQAD3uAwLtHY+nrBhl1hy5IokYXl6DD+o+BQBUliSC9eDqEAJFiYA9fNAg\nVA3su3nvdOfRCNYC7wB6OIjSEn+n5zudygpzSkCWRef5AZ+E8cMruvV6mVx46kT86JE3cN68Q7r1\n+iWhAKqqQrj6nKn45aoNOHXO6C69TjbeS6HhOewdPI89l4tzmLNqc/cwZGtrKx5++GG89NJLKCkp\nwSWXXIKPPvoIEyZMSPv8hoaOwa4nqqpCqKtrSfv9fY1h69/2lI/b19CCOiVxf2tbzDNsPn54CB/s\nBMLhOOrqWtAeiSWK1TQZhi4msnPLVacfin+tM3+uKAoY5B+ED2AGb9lINKNoaQ5DiyZ+dVq72Ol7\nyabOzqMAH04oPh8vvl0HjDO6dYzhtqj1WnCe/7MbjoMgICvveVRVMX57y4kQRaFbr9/cHEZdXQtm\njhuIw7v4Opn+FikznsPewfPYc719DtNdCGRt2Ly6uhr19fXO7b1796KqytzcYuvWrRg2bBgqKyvh\n8/lwxBFHYNOmTdk6lG6xLzbSjdJ2GDbX9KRtPs3MW7NeJ2q0QvBHrBcXALXjdZMsys7wuiyJqA3V\nON9zL7tSJNFTCFWi9M9hcwCo8g0GNB/8Svf+1Ox13u4qc9GqPM+W3hqaL8StGIkoN7IWvOfMmYO1\na9cCADZv3ozq6mqUlJhBpra2Flu3bkUkYgazTZs2YeTIkdk6lG4xUqzTdY8edChYi2uA7B42N7Nq\nOxjvqFqDwNRXrRcSYcQ7NpdQRBmqtaRMEgUMK0kEb/dabrt/+IiQuctSd/t254IdfANK9wZ5ZDm3\nc91ERPkga8PmM2bMwOTJk7FgwQIIgoBly5Zh9erVCIVCmDdvHr7xjW9g0aJFkCQJ06dPxxFHHJH5\nRXPIvdRH1VX88aNVmO3aBlQ1OmbeYlFiqMQQzO87Veae1xZgtJVBLDYff+GEr+HThs9RVTQQmlYH\nwCxYqykZjONqZ0MWZcypmYUnsN45JgD47uGLu70eOVfsgjNfN1qjAole6WKWq8x7C+vUiCgXsjrn\nfdNNN3luu+e0FyxYgAULFmTzx/eIu8nGxvoP8cbut/HG7red76tJvc1jqg6xsinxfGgQBQGaYeCL\nvc3eFzdE6K3lQPUOAImtO4FEm1VZFCAKIs7vZH9nScxun+/eYGfe3a02l6zny3mSeff3iykiOjjk\nRzrTBxKZd+pGG8lz3jEtBiHYAr3NrArXoEIUBei6gR/8YZ33yboIvS310rhZE83+2ccfVpPy+/mm\nImQO6Q8o7V7v9UTm3b+D9xGHmPUcIwf3zzX3RHRwYW/zDARBgF/s2Jc7ntzbXG6EIBjQWiogBJuh\nG6qzx7Wn8xoAASLu/8ZXsOaLCMaVj/Z878hJgzC2tgyVKTZc+NkNxyY6teWJMbVluPvq2RhY1r3g\nbQ+79/fg/c2zJuO8ligGlhVlfjARUQ8xeKfhDJsLqbt6JQ+bq4K5lE2PBiHpkpN5R2Kad/03zD2m\ny0sCWDTp/JQ/e0CaQJevexdXl3c/oLl7m/dnkigycBNRznDYPI3EUjEBmqF3+H6HLUFFa9vOmB/Q\nRWiGBkkUsLehvUPmDZ2nvavsXuH9PfMmIsolRpE03FXDWlKWDXirzQ3DgC5Za7jjPhi6BNWIQxIF\nGAY6ZN727mCUmZ1550vBGhFRLjCKpJEp845riYC8e387oJidwIy4HzBEqIaayBalpODP4N1lSp7M\neRMR5RKjSBruOW/N6Dzz/vmfNkFQYgCAUn8I0BKZNwAIHQrWGIi6yqdIkCUBRT6WZxAR2Ri803A3\nadFTDZu75rwjMRWCEoUiKvjhJbNRO6AUcT0OwT67ycPmev9fn91fyJKI755/GM4/aWxfHwoRUb/B\ndCYDM/NOVbCWCOiabkDyx1DmC6G02I/yYDF2RXRIkpW+JxesaflZNd5XDhle0deHQETUrzDzTkN3\nNWlJOeftWuet6ToMKWoOmQPwS+YabdGa6xaS57w1XjMREVH3MXin47RHFVLPebuGzXUhBggGQtbu\nXgEreAv2RiVi0rC51rHpCxERUVcxeKdhrxRLV7AW01QnOzdEs9K8WDG37fRbu3wJaTNvDpsTEVH3\nMXhnIKYpWPt8dyN+9Zy5B7kmmpXmQTt4S1ZmbQftDpk3h82JiKj7GLy7INWcN0QNb31sbt9pWMG7\nWDaDtzNszsybiIiygME7A90wUg6bu7umGZJZvNZh2FxUARgQ/OGkF2XmTURE3cfgnYFupMm8reBt\nGAYMKXnY3NoRTFQhDdoGsbjZ05iFTVqIiKgnGLwzMAwj5Zy3ORRuQDcMCLKdeZu7StnD5pBUSKX7\nAQCXTlqQk+MlIqKDH4N3BuaweYrMGwAkFf/e+SaU2q0AgGDSnLchqBCKWmDEFVQFB+bkeImI6ODH\n4J2BYaRYKmavAZdUPPnpaufu5DlvTYpADIRhREKQRc5zExFR72DwzkDXOxasybDntL33FyctFYvI\n9eY3wqEO+38TERF1F4N3BobRcT9vUU/MaeuRIud+RTSXgNnD5mFxHwBAiIYwpHgQJMOH+M4xEFiv\nRkREPcDgnYFhGNCT5rwF3cysBUmFEU0Eb8GKyvawuV1ULmh++CQfZukLoe4cl/2DJiKigxqDdwYp\nC9bsJiuSCgjmBPjo8Hzn285SMYtgPV4wmHITEVHPMXin8NH+TyH4zMYqqQrWjLgVjK3gbRgCSvUa\n5/uKKENxFagJujeYExER9QSDd5KWWCt++t5v4J/2CgAz8/7vnibPY1S79kxSIQgGYAiQRG9WHfKF\nnK9FnbuIERFR72HwTtIWbwcAp6hM1XTsaWjzPMauX7MzbxgCxKQzGfKVOF+LOnuZExFR72HwThLT\nY57bcVV35rVtumadNsnsXW4Gb++pLHUFbwHmELoB7+sQERF1B4N3koga8dyOxXVAMAvWoh8dgaJw\nLbR6c37bnXlLSeu/Qkpi2Dz5e0RERD3B4J2kPSl4x1XNybz15gEI7T0aajQAABCUqBO8haQzWepP\nBG8hKXgn3yYiIjoQ7NmZJBz3bt8Zs4bNDQMABLRH44Dqg6EqEAJt1lruFAVrimvOW2SwJiKi3sPM\nO0lY6zhsLgg6YJinqj1ilprr4WIIgTAEUYNhCB0CtGepGDNtIiLqRQzeSbyZt4G4pjtD4wAQjpql\n5kakGIJgQPBFAUPskHlLouR8bX+L5WpERNQbGLyTeDJvwUAsrgGCDlmUMLqmFLo5fg4jXJx4nCFA\nTMquJw+YABgCYtsmdPgeERFRTzB4JwnHXcFbVJ05b1EQ4VcS2bQeDSYel2LYPOQrwYTGi6DtGclh\ncyIi6lUM3knCqmvYXNSdanMR3uAN3fV1ig5rAKwit8SwORERUW9g8E4Sdi0VEyTVWectQoLf5w7Y\n7lPXMfMG4AyxC4zeRETUixi8k3gzbw2abkBwhs1dp8u9Q1iKOW8gEbyd2M2KNSIi6gUM3kncTVoE\n0W5ibgZvn2vY3NATpy7VUjEAmDKyEgBw2NiBnvs5BU5ERD3BJi1JYpqrt7lkB28doiBBkdJn3qnm\nvOfOHIZDhldgWHVJh+8RERF1F4N3kqh7Y5KkzFv2BG9vIE+VeYuCgBGDQx3uJyIi6gkGbxfDMBDX\n4s7txLC5DkmQkoK3O1innvMmIiLKBs55u8R11bttp5TIvCVBhCy5h8q9WXiqYfNkrFcjIqLewODt\nYs93S4JZmCaIGiCqEATAJ/o9mbe7YC3dsHk6zNGJiKgnGLxdolbwDohW9zRRg+Azq89LlNABF6wl\nG1xpvu7omrLeOWAiIipInPN2iVvFakVSEG1aCyCp5sYjAEqVEGQxdcGakWadd7KTZtSiOCBj+riB\nGR9LRESUDoO3i5N5C2aGLEgaBMXMvEt9pZCTsm33110ZNpclEXMOHdJ7B0xERAWJw+YuMavS3O8M\nm6vOsHmZrzT9UrE07VGJiIiyIWPw3rp1ay6Oo1+IWcPmPqMIgJV5W8Pm5f4yyHLP5ryJiIh6Q8bg\n/e1vfxsXXHABVq1ahXA4nOnhec0eNldgBm9zztvMvCuKyrwFa8jc25yIiCgbMs55P//88/jkk0/w\n4osvYuHChZg4cSLOPfdcTJ06NRfHl1N2gxZBV2BoIgRJBXwRGLqIkFKMqBRJ/URD5LA5ERHlTJfm\nvMePH4/rr78eS5YswdatW7F48WJcdNFF+O9//5vlw8stO/OGIQG6DLG4GWKgHdq+wVCUpA5rbhw2\nJyKiHMqYee/cuRN/+tOf8Je//AVjx47F1VdfjWOPPRYbN27EzTffjGeeeSYXx5kT9py3oUkwNAmC\nYt6v7hwHRfL2Nvd0WwOYeRMRUc5kDN4LFy7E17/+dfzhD3/AoEGDnPunTp2aceh8xYoVeP/99yEI\nApYuXep5/K5du/Cd73wH8XgckyZNwp133tmDt9E77A5rhi4CmnlqDEOAEQtAlrztURXZvc5b5Jw3\nERHlTMZh8zVr1mDkyJFO4H7iiSfQ1tYGALj99tvTPu/NN9/Etm3b8NRTT2H58uVYvny55/t33303\nLr/8cjz77LOQJAlffvllT95Hr7CXihmqBEO39u6OKwAESJLgqTZP7rbGYXMiIsqVjMH7e9/7Hurr\n653bkUgEt9xyS8YXXrduHebOnQsAGDNmDJqamtDa2goA0HUdb7/9Nk466SQAwLJly1BTU9OtN9Cb\n7DlvXXNl3roMSTSryd0BO3nZGIfNiYgoVzIG78bGRixatMi5fdlll6G5uTnjC9fX16OiosK5XVlZ\nibq6OgDA/v37UVxcjLvuugsXXHAB7r///u4ce6+z57x1VYSzFEyTnEDtnvNOzrwZvImIKFcyznnH\n43Fs3boVY8aMAQBs2rQJ8Xg8w7M6MgzD8/WePXuwaNEi1NbW4qqrrsLLL7+ME044Ie3zKyqCkGXp\ngH9uZ6qqQp7bwqfmMUqiD7D28jZ0CQFFQlVVCEUlifddFFDgXMIYAgYOKO7weoWiUN93b+I57Dme\nw97B89hzuTiHGYP39773PSxevBgtLS3QNA2VlZW49957M75wdXW1Z7h97969qKqqAgBUVFSgpqYG\nw4cPBwDMnj0bn376aafBu6GhPePPPBBVVSHU1bV47mtpN39GuN2A4Lf28tYlSKKAuroWxOJa4sGu\nixEYApoa2+EvwOQ71XmkA8Nz2HM8h72D57HnevscprsQyDhsPm3aNKxduxbPP/881q5dixdffLFL\nmfecOXOwdu1aAMDmzZtRXV2NkpISAIAsyxg2bJizTnzz5s0YNWpUV99L1tjV5mocTuYNXXIqy93z\n3JJnqRg7rBERUe5kzLxbW1vx5z//GQ0NDQDMYfRVq1bhtdde6/R5M2bMwOTJk7FgwQIIgoBly5Zh\n9erVCIVCmDdvHpYuXYolS5bAMAyMHz/eKV7rS1E9BlmUoaqAPedtqDJ8VtB2B2jJ9bVhCDBARESU\nGxmD9w033ICamhq89tpr+MpXvoLXX38dP/jBD7r04jfddJPn9oQJE5yvR4wYgSeeeOLAjjbL4loc\nPlFBXNUhfjEdyvCPEd5+CJSqjgMU7gI1QTAYvImIKGcyDptHo1HceeedqK2txa233orHHnsML774\nYi6OLeeiWgw+yYeYqkNRy1C291hA9UNJUSgnJbdKNRi+iYgoNzIG73g8jvb2dui6joaGBpSXl2P7\n9u25OLaci2kx+CQz81ZkCbpuBmR3NzWbuymLJAuoCAVydpxERFTYMg6bn3XWWXj66adx7rnn4rTT\nTkNlZSVGjBiRi2PLuZgeQ7lYigZVQzCgQNV0AHDmvN3cwfvsY0alDPBERETZkDF42wVngLmka9++\nfZg4cWLWDyzXDMNATIvDJ/kQ13Qosoj2iAogc+bNGW8iIsqljOmiu7vaoEGDMGnSJCeYH0xUXYUB\nwwzeqg6fLELVzcxbSbEVqOgJ3kRERLmTMfOeOHEifvKTn2D69OlQFMW5f/bs2Vk9sFyLWq1RFVGB\nqhlQZBGaZs15KykK1kT3rmIM30RElDsZg/eHH34IAHjrrbec+wRBOOiCt92gRbY28VZkyZnzTpV5\ne4fN9RwcIRERkSlj8H788cdzcRx9zt4ONBG8RSd4y3IiUFeVB1DXGPEOmzPzJiKiHMoYvC+88MKU\nc9wrV67MygH1leTM2yeLUK1hc9k1RL78yqMQi2t4+p9bnftYsEZERLnUpQ5rtng8jvXr1yMYDGb1\noPqCvZe3CHN+293HXHb1MZcl0dkaVI8UQQyEUSQX5fBIiYio0GUM3rNmzfLcnjNnDq688sqsHVBf\nienmsLmExLC5rUM3Nfs5Hx+B4NCdOPb4g2v+n4iI+reMwTu5m9quXbvw+eefZ+2A+krMybzNU+Ju\nzCKLqZbGGTCixZD3TIFPUlJ8n4iIKDsyBu9LLrnE+VoQBJSUlOC6667L6kH1BSd4GzIA3ZN5y510\nTzv4VrwTEVF/lzF4/+Mf/4Cu6xCtoq14PO5Z732wiOnu4B3zbEYipxk2JyIi6gsZo9LatWuxePFi\n5/ZFF12El156KasH1RfsgjUY5ilxr+2WUgybc3UYERH1lYzB+9FHH8WPf/xj5/bvfvc7PProo1k9\nqL4Qt9Z5Q7fmvBV3wVr6wfGDsVUsERH1bxmDt2EYCIVCzu2SkpKDMmBFtCgAQLCCtzvzdq/ztjHx\nJiKivpJxznvKlCm44YYbMGvWLBiGgVdffRVTpkzJxbHllB287cxbUdzrvDnnTURE/UfG4H3bbbdh\nzZo12LBhAwRBwJlnnolTTjklF8eWU1HVCt6anXm7C9YOvpEGIiLKXxmDdzgchqIouP322wEATzzx\nBMLhMIqLi7N+cLlkZ96GZgZt91KxUNDX4fFVZQEAQG3VwXUeiIio/8s4Hnzrrbeivr7euR2JRHDL\nLbdk9aD6gp1566oZvH2yiOVXHolLTjkEIwaHOjz+lCOH44KTx+HKMybl9DiJiIgyBu/GxkYsWrTI\nuX3ZZZehubk5qwfVFyJaBIqoQNPM24osYsiAYhx/WG3KxyuyhHkzh6XMyomIiLIpY/COx+PYujWx\ng9bGjRsRj8ezelB9IaJFEZD8iKvWHt6ddFUjIiLqSxnnvL/3ve9h8eLFaGlpga7rqKiowL333puL\nY8upqBpFQPYjxuBNRET9XMYINW3aNKxduxarVq3CkiVLUF1djWuuuSYXx5ZTyZm3z9UelYiIqD/J\nmHm/9957WL16NV544QXouo4f/ehHmD9/fi6OLWd0Q0dUi8Ev+xHXmHkTEVH/ljZC/eY3v8Fpp52G\nG2+8EZWVlVi1ahWGDx+O008//aDbmMTuax6Q/IjHzYo1Bm8iIuqv0mbeDz74IMaOHYs77rgDRx11\nFICDt4931FrjHZADaGPmTURE/Vza4P3yyy/jT3/6E5YtWwZd13H22WcflFXmABCx1nj7JbNgTRBS\n7yRGRETUH6RNL6uqqnDVVVdh7dq1WLFiBb744gvs3LkTV199NV555ZVcHmPWOZm3VbDmk6WDdpSB\niIjyX5fGhmfOnIm7774br776Kk444QT8/Oc/z/Zx5VRYjQCAWbCm6hwyJyKifu2AolRJSQkWLFiA\np59+OlvH0ye8mbfG4E1ERP0aoxSA9ngYAFAkB9DcHkdx4OCqpiciooMLgzeAdtUM3qLuRzSmoao8\n0MdHRERElB6DN4D2eDsAIBoxT0dVeVFfHg4REVGnGLwBtFmZd6SNwZuIiPo/Bm8kMu+WVvM2gzcR\nEfVnDN5IzHk3NZnd1TjnTURE/RmDN4C2eDsUUUFLmxm8K0L+Pj4iIiKi9Bi8YQ6bFytBRGPmpiQ+\nhduBEhFR/8XgDXPYPCgXIRLX4FNEiGyNSkRE/VjBB2/d0BFWIwgqRYjFNfiZdRMRUT9X8ME7rEZg\nwECxHEQ6NoGoAAAYmElEQVSUwZuIiPIAg7ddad6so6E5Cr+PwZuIiPq3gg/eMc3co3zL9jYYADNv\nIiLq9wo+eMd1M3gbunkqGLyJiKi/Y/DWVfMLBm8iIsoTDN5W5g3dDNo+peBPCRER9XMFH6ni1pw3\nDPNUBFiwRkRE/RyDtzPnbWfeDN5ERNS/MXhzzpuIiPIMg7cz583gTURE+YHBW/MOmzN4ExFRf5fV\n4L1ixQqcf/75WLBgATZs2JDyMffffz8WLlyYzcPolDNsbhWsiSI3JSEiov4ta8H7zTffxLZt2/DU\nU09h+fLlWL58eYfHbNmyBf/5z3+ydQhdkrxUTNP0PjwaIiKizLIWvNetW4e5c+cCAMaMGYOmpia0\ntrZ6HnP33XfjxhtvzNYhdEksqcOapht9eThEREQZZS1419fXo6KiwrldWVmJuro65/bq1asxa9Ys\n1NbWZusQukR1qs3NzLu4SOnDoyEiIspMztUPMoxERtvY2IjVq1fj0UcfxZ49e7r0/IqKIGS5d4vJ\nqqpCkD63bugi5h85Al89aTwkznsfkKqqUF8fQt7jOew5nsPewfPYc7k4h1kL3tXV1aivr3du7927\nF1VVVQCA9evXY//+/bjooosQi8XwxRdfYMWKFVi6dGna12toaO/V46uqCqGurgXN7ebrGrqEuTNq\nsH9fa4Znkpt9Hqn7eA57juewd/A89lxvn8N0FwJZGzafM2cO1q5dCwDYvHkzqqurUVJSAgA45ZRT\n8MILL+Dpp5/Gz372M0yePLnTwJ1NqqvaXBILfuUcERHlgaxl3jNmzMDkyZOxYMECCIKAZcuWYfXq\n1QiFQpg3b162fuwBi7matMgSh8uJiKj/y+qc90033eS5PWHChA6PGTp0KB5//PFsHkannI1JdAmy\nxMybiIj6v4KPVqquWg1aBBaqERFRXij44B3T4xAMs4qdmTcREeWDgo9WcSt4CwJboxIRUX5g8NZU\nCKw0JyKiPFLwESuuxwFDYqU5ERHlDQZvPW4tEyv4U0FERHmioCOWYRiIaXFAl1lpTkREeaOgg7dq\naDBgwGCDFiIiyiMFHbxjWsz8QpcgcdiciIjyREFHLDt4G5rEYXMiIsobhR28rb7mhsaCNSIiyh8F\nHbHszFtn5k1ERHmkwIM3M28iIso/BR2xYnoi82a1ORER5YvCDt4sWCMiojxU4MHb3stb5FIxIiLK\nGwUdsRLrvGXOeRMRUd4o6IjlLBXTRQ6bExFR3ijs4O3qsMaCNSIiyhcM3gCgSdzPm4iI8kZBR6zE\nsLkERSnoU0FERHmkoCOWe9hcYcEaERHliYKOWFFnqZgEHzNvIiLKEwUdseJWhzWDmTcREeWRgo5Y\nMVfmrchS3x4MERFRFxV08I46c94iFLmgTwUREeWRgo5Yqq5CggRAgI/Bm4iI8kRBRyzVUCEKMgAw\n8yYiorxR0BErrschwpzrZvAmIqJ8UdARK66pruDNgjUiIsoPBR28VUOFYJingJk3ERHli4KOWKqu\nQrAybxasERFRvijoiBXXVQgG57yJiCi/FGzEMgzDzLw5bE5ERHmmYCOWqqvmF8y8iYgozxRsxIpr\nVvDW7cyb1eZERJQfCjd423t5W8PmLFgjIqJ8UbARy868DZ1z3kRElF8KNmLFrMwbmghBACRR6NsD\nIiIi6qKCDd6qlXnrugBFFiEIDN5ERJQfCjZ423t5G5oIRSrY00BERHmoYKOWXbCmaQJ8CivNiYgo\nfxRu8LaHzTWBmTcREeWVgo1acd0O3iIUpWBPAxER5aGCjVpxa85bUwXIYsGeBiIiykMFG7Xcw+ay\nzEpzIiLKH4UbvK2CNV0TITHzJiKiPFKwUcvpbW6IkCVm3kRElD8KN3jbvc11ETKrzYmIKI8UbNSy\nm7TAENkalYiI8krBBm9nP29dhMTMm4iI8kjBRq2Ys6uYBJmZNxER5ZGCDd5x97A5C9aIiCiPyNl8\n8RUrVuD999+HIAhYunQppk6d6nxv/fr1eOCBByCKIkaNGoXly5dDzOGSrYgaNb/QJBasERFRXsla\n1HrzzTexbds2PPXUU1i+fDmWL1/u+f4dd9yBhx56CE8++STa2trw6quvZutQUgqrEQCAocssWCMi\norySteC9bt06zJ07FwAwZswYNDU1obW11fn+6tWrMXjwYABAZWUlGhoasnUoKUXiZvCGJjPzJiKi\nvJK1qFVfX4+KigrndmVlJerq6pzbJSUlAIC9e/fi9ddfx/HHH5+tQ0kprEYhQLCqzZl5ExFR/sjq\nnLebYRgd7tu3bx+uvvpqLFu2zBPoU6moCEKWe2/f7XA8Ap/kRzsElJYEUFUV6rXXLjQ8dz3Hc9hz\nPIe9g+ex53JxDrMWvKurq1FfX+/c3rt3L6qqqpzbra2tuPLKK3HDDTfgmGOOyfh6DQ3tvXp8YTUC\nBQoAIBqNo66upVdfv1BUVYV47nqI57DneA57B89jz/X2OUx3IZC1YfM5c+Zg7dq1AIDNmzejurra\nGSoHgLvvvhuXXHIJjjvuuGwdQqci8QgU0QcALFgjIqK8krXMe8aMGZg8eTIWLFgAQRCwbNkyrF69\nGqFQCMcccwyee+45bNu2Dc8++ywA4IwzzsD555+frcPpIKxGUSGXAgAL1oiIKK9kdc77pptu8tye\nMGGC8/WmTZuy+aM7FddVqLoKQTffPoM3EVHfevnlv+OEE07u0mN/8pP7ce65C1BTU5vlo+q/CjJq\nRa0GLbv2xgBw2JyIqC/t2vUl/va3tV1+/PXXf7egAzeQw2rz/iSimcHb0MzqdS4VIyLqOw88cA8+\n/HAzHn30N9B1HV9+uRO7dn2JBx/8Be66607U1e1FOBzG5ZdfhTlzjsV1112F73znFvzzn39HW1sr\nvvhiG3bu3IFvf/u7mD17jvO6qqpi+fIfdHj+J598hPvvvweiKGDKlGm49trrU95n/5zRo8di1aqn\n0NjYiOnTD8eTT/4v2tvbcd11N+Ldd9/Gyy//HbquY/bsObj11u+ipaUFd955G9ra2lBSUoI77vgf\nXH75Rfj9759AMBjEhg3v4cknV2LFih93+5wVZPCOWsEb9rB5DtuyEhH1Z0//Ywv+89HeXn3NmROq\ncd5JY9N+/4ILFmL16qdx2WVX4pFHHoaqxvGLX/wWDQ37MWvWUTj11DOwc+cO3H77EsyZc6znuXv3\n7sF99z2E9ev/jT//eZUneLe0NKd8/oMP3oebb16KsWPH4Uc/ugO7d+9KeV86W7duwRNPrIbP58O7\n776NX/zitxBFEeeddxauvfabeOKJxzFr1myce+4CPPXUSrzzzls47rgT8dpr/8L8+afgtddewbx5\nX+nROS3I4G33NTc08+0z8yYi6j8mTpwMAAiFSvHhh5uxZs1qCIKI5uamDo+dOvUwAObyZHcXz86e\n/8UX2zB27DgAwO2335n2vnTGjh0Hn89crRQIBHDddVdBkiQ0NjaisbERn3zyEa644hoAwPnnXwQA\nqKmpxW9/+0vMn38K3n33bXzjG1cf+IlxKczgrSU2JQFYsEZEZDvvpLGdZsm5oChmD46//vUlNDc3\n4+c//y2am5txxRULOzxWkhLNu5KbgaV7fqpNsFLdJwiJxE5V1Q7Ht3v3Ljz11Er87ncrEQwGsXDh\nedZrSTAM3fNaY8eOw759+/Dhh5sxatQY+P3+zk9CBgUZtSL2piR25s2CNSKiPiOKIjRN63B/Y2Mj\nhgypgSiKeOWVfyAejx/Q66Z7/siRo7B5s7ni6a677sR///t5yvuKi4uxb5/ZbGzjxvdTvn5FRQWC\nwSA+/vgj7N69G/F4HBMnTsLbb/8HAPDcc6vw4ot/AQCcdNI8PPDAPZg375QDeh+pFGTwthlx88qH\nmTcRUd8ZMWIUPv74Izz00P2e+0844ST8+9+v4vrrr0FRURGqq6vx6KO/6fLrpnv+9dffhJ/97P/D\nNdd8A6FQKUaOHJXyvjPPPAf3338vbr75egwcWNXh9ceNG4+ioiCuueZy/P3v/4ezzjoHP/zhD3Hu\nuRdg06YNuO66q/Dvf7+G448/EQBw8snzsHfvXhx++MyenTAAgpGq6Xg/1Jvt5uJaHNf89hnojdWA\nIeKWC6ZjwojOe6tTamyn2HM8hz3Hc9g7eB57rrNz+Pzza7B79y584xvfPKDXS6Ug57wVSYHeMNi5\nzcybiIiy6Z57/gdffrkTd911X6+8XkEG72SsNiciomy69dbbevX1CjLl1HXvTAEL1oiIKJ8UZPCO\nxr1VjRw2JyKifFKQUSvWIXgz8yYiovxRkME7OfOW2B6ViIjySEFGrWjc2/mGmTcRUd96+eW/H/Bz\n3nvvHTQ07M/C0fR/hRm8Y0mZN+e8iYj6zIFuCWp7/vk1BRu8C3KpWMdhc2beRER9xb0l6PnnX4gV\nK36IlpYWaJqGG264GWPHjsP//u/v8cor/4Qoipgz51hMnDgJr776Mj7//DP8z//ci8GDzd4dfbEN\n6OWXX+VsAxqLReD3F2VlG1A3Bm+w2pyIyLZ6y1/w7t6Nvfqa06sPxTljz0j7ffeWoL///W9x5JFH\n4//9v6/i888/w09+ch8efPAXePLJ/8Vzz70ESZLw3HOrMHPmURg7djy+851bnMAN9M02oOeff6Gz\nDejixVfiZz/7VVa2AXVj8AabtBAR9RcbN25AY2MD1q59AQAQjZobSZ1wwsm44YbFmDfvFMyfn35j\nj77YBrS5uTkn24C6FWTwrgz54ZNF6IYBVTMgCgzeREQAcM7YMzrNkrNNUWTceOPNmDJlquf+m276\nHrZt+y/+8Y+/4lvf+iZ+/es/pHz+wbwNqOfYe+2V8sghwyvw1IrT8fBNJ+DXN5/Q14dDRFTQ3FuC\nTpo0Bf/618sAgM8//wxPPvm/aG1txaOP/gYjRozEZZddiVCoDO3tbSm3Ej2YtwH1nLNefbU8Iksi\nBEHgfDcRUR9zbwn69a+fj507t2Px4itwzz3/g8MOm4GSkhI0NjbgyisX4dvfvhqTJ09BaWkZDjts\nBm677VZ89tlW57X6YhvQ+++/x9kGdOHChVnbBtStILcEBbj1XW/heew5nsOe4znsHTyPPZd8Druz\nDWjy66VSkHPeRERE2dbb24C6MXgTERFlQW9vA+rGCV8iIqI8w+BNRESUZxi8iYiI8gyDNxERUZ5h\n8CYiIsozDN5ERER5hsGbiIgozzB4ExER5Zm8aY9KREREJmbeREREeYbBm4iIKM8weBMREeUZBm8i\nIqI8w+BNRESUZxi8iYiI8kxB7ue9YsUKvP/++xAEAUuXLsXUqVP7+pD6tU8++QSLFy/GpZdeiosv\nvhi7du3CLbfcAk3TUFVVhR//+Mfw+XxYs2YN/vCHP0AURZx33nk499xz+/rQ+417770Xb7/9NlRV\nxTe/+U0ceuihPIcHIBwOY8mSJdi3bx+i0SgWL16MCRMm8Bx2UyQSwRlnnIHFixdj9uzZPI8H4I03\n3sD111+PcePGAQDGjx+PK664Ivfn0Cgwb7zxhnHVVVcZhmEYW7ZsMc4777w+PqL+ra2tzbj44ouN\n2267zXj88ccNwzCMJUuWGC+88IJhGIZx//33GytXrjTa2tqM+fPnG83NzUY4HDZOP/10o6GhoS8P\nvd9Yt26dccUVVxiGYRj79+83jj/+eJ7DA/T8888bv/71rw3DMIwdO3YY8+fP5znsgQceeMA455xz\njFWrVvE8HqD169cb3/rWtzz39cU5LLhh83Xr1mHu3LkAgDFjxqCpqQmtra19fFT9l8/nw29+8xtU\nV1c7973xxhs4+eSTAQAnnngi1q1bh/fffx+HHnooQqEQAoEAZsyYgXfeeaevDrtfmTlzJn7yk58A\nAEpLSxEOh3kOD9Bpp52GK6+8EgCwa9cuDBo0iOewm7Zu3YotW7bghBNOAMD/z72hL85hwQXv+vp6\nVFRUOLcrKytRV1fXh0fUv8myjEAg4LkvHA7D5/MBAAYMGIC6ujrU19ejsrLSeQzPa4IkSQgGgwCA\nZ599FscddxzPYTctWLAAN910E5YuXcpz2E333HMPlixZ4tzmeTxwW7ZswdVXX40LLrgAr7/+ep+c\nw4Kc83Yz2B22R9KdP57Xjv72t7/h2Wefxe9+9zvMnz/fuZ/nsOuefPJJfPjhh7j55ps954fnsGue\ne+45HHbYYRg2bFjK7/M8ZjZy5Ehcd911OPXUU7F9+3YsWrQImqY538/VOSy44F1dXY36+nrn9t69\ne1FVVdWHR5R/gsEgIpEIAoEA9uzZg+rq6pTn9bDDDuvDo+xfXn31VfzqV7/Cb3/7W4RCIZ7DA7Rp\n0yYMGDAAQ4YMwcSJE6FpGoqLi3kOD9DLL7+M7du34+WXX8bu3bvh8/n4t3iABg0ahNNOOw0AMHz4\ncAwcOBAbN27M+TksuGHzOXPmYO3atQCAzZs3o7q6GiUlJX18VPnl6KOPds7h//3f/+HYY4/FtGnT\nsHHjRjQ3N6OtrQ3vvPMOjjjiiD4+0v6hpaUF9957Lx5++GGUl5cD4Dk8UG+99RZ+97vfATCnvtrb\n23kOu+HBBx/EqlWr8PTTT+Pcc8/F4sWLeR4P0Jo1a/DII48AAOrq6rBv3z6cc845OT+HBbmr2H33\n3Ye33noLgiBg2bJlmDBhQl8fUr+1adMm3HPPPdi5cydkWcagQYNw3333YcmSJYhGo6ipqcFdd90F\nRVHw0ksv4ZFHHoEgCLj44otx5pln9vXh9wtPPfUUfvrTn2LUqFHOfXfffTduu+02nsMuikQi+P73\nv49du3YhEonguuuuw5QpU3DrrbfyHHbTT3/6U9TW1uKYY47heTwAra2tuOmmm9Dc3Ix4PI7rrrsO\nEydOzPk5LMjgTURElM8KbticiIgo3zF4ExER5RkGbyIiojzD4E1ERJRnGLyJiIjyTME1aSHKN/fe\ney82btyIaDSKDz74ANOnTwcAfO1rX8NXv/rVLr3Gr3/9a4wfP97pZ53KwoUL8fvf/x6SJPXGYXvs\n2bMHn332GWbPnt3rr01UiLhUjChP7NixAxdeeCH+9a9/9fWhHLA1a9Zg69atuPHGG/v6UIgOCsy8\nifLYT3/6U+zYsQNffvklbr31VkQiEdx3333w+XyIRCJYtmwZJk+ejCVLluDwww/H7Nmzcc011+CY\nY47Bhg0b0NbWhocffhiDBg3CIYccgs2bN+OXv/wlGhsbsXv3bmzbtg1HHnkkbr/9dkSjUdx6663Y\nuXMnBg8eDEmSMGfOHM8exW1tbfjud7+L5uZmqKqKE088EWeccQYefPBBGIaB8vJyXHTRRbjzzjux\nbds2tLW14YwzzsDll1+O1atX469//SsEQcCePXswevRorFixAoqi9OEZJuqfOOdNlOd27NiBxx57\nDFOmTEFjYyN+8IMf4LHHHsOiRYvw8MMPd3j81q1bcc4552DlypWYOHEiXnzxxQ6P+eCDD/DQQw/h\n2WefxerVq9HU1IQ1a9ZAVVU888wzuOOOO/D66693eN6///1vqKqKP/7xj3jyyScRDAZRW1uLs88+\nG2eeeSYuu+wyPPbYY6iursbjjz+OZ555Bs8//zw++ugjAMDGjRv///bu2CW1MIzj+NcONQQRQi3W\nYnBsjDoSBFKNOVaEo0M4REO4HGyrKQin5ob+gDBaoiVyECEipakhWkKkQKFoiERPd5DOzYxLlysX\njvw+4+F5X97tx/PyHh7S6TSHh4eUy2VP3jKI/A/qvEU8bmJiAp/PB8DQ0BC7u7u8vb3x8vLC4OBg\nW73f78c0TQACgQBPT09tNZZlYRgGhmHg9/t5fn7m5uaG6elpAIaHh7Esq23d1NQUe3t7bGxsMDc3\nx8rKCj09rT3CxcUFDw8PXF5eAlCr1bi/v3fXf4xPnZyc5O7uzp2TLCK/KbxFPO7ztbJt22xvbzMz\nM8P5+bk7zOOzrw/Svnv28l2N4zgtQfw1lKE5y/j4+JhiscjZ2RnLy8scHR211PT19bG+vs7CwkLL\n90wmg+M4fzyXiDTp2lyki1QqFUzTpNFocHp6Sq1W69jeY2NjFItFAKrVKldXV201uVyObDaLZVnY\ntk1/fz/VahWfz0e9XgeaXf3HVb3jOOzs7Ljd//X1Na+vr7y/v1MoFBgfH+/Y+UW6iTpvkS6SSCSI\nx+MEAgFWV1exbZuDg4OO7L20tEQ2myUWizE6Oko4HG7r0IPBIKlUiv39fQzDIBKJMDIyQjgcJplM\n0tvby9raGre3t8RiMRqNBvPz8+6o1FAoxObmJqVSCdM0iUQiHTm7SLfRr2Ii8iOPj48UCgWi0SiO\n47C4uMjW1pb73/m/ymQy5PN50ul0R/YT6WbqvEXkRwYGBjg5OXHnE8/OznYsuEXk76jzFhER8Rg9\nWBMREfEYhbeIiIjHKLxFREQ8RuEtIiLiMQpvERERj1F4i4iIeMwvRph4T/csGFUAAAAASUVORK5C\nYII=\n",
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEcCAYAAADUX4MJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsvXeAHMWZ/v/pNGlnc5S0ymmFUE6WEAgQ2UJkGxtjsMEG\nbDD+YnNwZ3PnH+fD2GcwnDFHMBmcwETLIiMJ5YByzqvd1eY0eTr9/uie7p7dlRACHQ7z/LM73dVV\n1dXd71NvqLcE0zRNcsghhxxyyOFjIH7eHcghhxxyyOHvAznCyCGHHHLI4ZiQI4wccsghhxyOCTnC\nyCGHHHLI4ZiQI4wccsghhxyOCTnCyCGHHHLI4ZiQI4wccviUWL16NXPmzDmmsg899BC33377Ce5R\nDjmcGOQII4dPjTPPPJNx48bR2dmZdfyiiy6ipqaGhoYGAO68805qamrYvHmzU6a2tpaamhrn99VX\nX81LL73k/H7kkUeYO3cukydP5vTTT+e2224DYN68eUyePJnJkydz0kknMX78eCZNmsTkyZN57LHH\nTuTt9glBEE5I2Rxy+FuC/Hl3IId/DFRXV7NgwQKuuuoqAHbt2kUqlcoSjoIgUFRUxAMPPMATTzyR\ndbwvvPLKK7zxxhs888wzVFdX09bWxvvvvw/AX/7yF6fc1VdfzcUXX8xll112Im7tnwa6riNJ0ufd\njRz+hpHTMHL4THDRRRfxyiuvOL9feeUVLrnkkl7lLrnkEnbu3MnatWs/ts4tW7Ywe/ZsqqurASgt\nLeWKK67os+zHJSx46KGHuPXWW7n99tuZPHky8+fP58CBAzz22GPMmjWLM844g+XLlzvlm5ubuemm\nm5gxYwbnnnsuL774onMulUpx5513Mn36dObNm5elMWWu/d73vsfMmTM566yzeO655z72XgG6u7u5\n8cYbmTlzJjNmzODGG2+kqanJOd/V1cW//uu/cuqppzJjxgxuvvlm59y7777LxRdfzJQpUzjnnHNY\nunQpYGl/K1asyBqHjEmsvr6empoaXnrpJc444wyuvfZaAG699VZmz57NtGnTuPrqq9mzZ0/Wvd97\n772ceeaZTJ06lauuuopUKsUNN9zACy+8kHU/8+fP57333jume8/h7wM5wsjhM8GECROIxWLs27cP\nwzB48803mT9/fi9BHggEuPHGG7n//vuPqc5XX32VJ554gi1btmAYxqfq46JFi7jkkktYu3YtY8aM\n4brrrsM0TT788EO+853vcNdddzllb7vtNvr378/SpUt58MEHuf/++1m5ciUAv/71r6mrq+O9997j\niSee4NVXX3WuM02TG2+8kTFjxrB06VKefvppnn32WZYtW/ax/TMMg8suu4zFixfzwQcfEAgEuPvu\nu53zt99+O6lUioULF7J8+XJHwG/atIk777yTO+64g3Xr1vH8888zYMCAI7bTU6Nbu3YtCxcudLS+\nOXPm8M4777B8+XJOOukkfvjDHzpl7733XrZt28Yf//hH1qxZw+23344oilx88cW89tprTrkdO3bQ\n3Nx8zL6dHP4+kCOMHD4zXHTRRbz66qssW7aMYcOGUVFR0We5L33pSxw+fJgPP/zwqPXNnz+fu+66\ni2XLlnH11Vcza9asT+WfmDp1KrNmzUIURc477zw6Ojr49re/jSRJXHDBBTQ0NBCNRjl8+DDr16/n\nhz/8IYqiUFNTwxVXXOEIxDfffJObbrqJ/Px8Kisrufrqq502Nm3aRGdnJzfddBOSJFFdXc0VV1zB\nggULPrZ/RUVFnH322fh8PkKhEDfccIOjiTU3N7N06VLuvvtuwuEwkiQxdepUAF566SUuv/xyZs6c\nCUBFRQVDhw49pjERBIFbbrmFQCCAz+cD4NJLLyUYDKIoCt/97nfZsWMH0WgU0zR5+eWX+fGPf0x5\neTmCIDBx4kQURWHu3LkcPHiQ2tpaAF577TUuuOACZDln9f5HQu5p5vCZYf78+Xzta1+jrq6Oiy66\n6IjlfD4f3/nOd3jwwQe57777jlrnvHnzmDdvHrqu8+677/KDH/yAsWPHcsopp3zi/pWWljr/BwIB\niouLndl2IBDANE1isRgtLS0UFhYSDAad8v3792fr1q2AJbyrqqqyzmXQ0NBAU1MT06dPByyNwzAM\npk2b9rH9SyaT3HPPPSxdupTu7m5M0yQej2OaJo2NjRQWFhIOh3td19jY+Klm8t57MQyD+++/n7fe\neouOjg4EQUAQBDo6Okin06TTaQYOHNirDp/Px/nnn8/rr7/Od7/7XRYsWMCvf/3r4+5TDn+byGkY\nOXxm6N+/PwMGDGDJkiWcc845Ry176aWXEolEeOedd46pbkmSOPfccxk9ejS7d+/+LLp7RFRUVNDV\n1UU8HneOHT582NGYysvLOXz4sHMuEwUG0K9fP6qrq1m9ejWrV69mzZo1rFu3jkceeeRj233yySc5\ncOAAL730EmvXrnV8AqZp0q9fP7q6uohGo72uq6qq4tChQ33WGQqFSCaTzu+WlpZeZbwmqjfeeIMP\nPviAZ555hrVr1/L+++87ZsXi4mL8fr+jRfTExRdfzOuvv86KFSsIBoNMmDDhY+85h78v5Agjh88U\n99xzD8888wyBQOCo5SRJ4uabb+bxxx8/YplXXnmFxYsXE4vFME2TxYsXs3fvXsaPH/9ZdzsLVVVV\nTJo0ifvvv590Os2OHTt46aWXmD9/PgDnn38+jz76KN3d3TQ2NvL88887144fP55wOMzjjz9OKpVC\n13V2797dyzHeF2KxGIFAgHA4TGdnZ9YMvby8nNNOO42f/OQndHd3o2maY666/PLLefnll1m5ciWm\nadLU1MS+ffsAqKmpYcGCBWiaxubNm3nrrbey2uzpY4rFYvh8PgoKCojH49x3330OoQiCwKWXXsq9\n995Lc3MzhmGwYcMGVFUFYOLEiQiCwL333ntUDTOHv1/kCCOHTw3vDHXgwIGMHTu2z3M9MW/ePCoq\nKnqF3mYQDod55JFHOPPMM5k2bRr33XcfP/nJT5g8efIR2/808NZz3333UVdXx6mnnsr3vvc9br31\nVsdHcPPNN9O/f3/mzp3L9ddfz8UXX+xcJ4oijzzyCDt27GDu3LnMmjWLu+66q0/NoCeuueYaEokE\nM2bM4Morr+xlZvrFL36BLMucf/75nHLKKTz77LOARVL33HMP99xzD1OmTOHrX/+6owHdeuut1NbW\nMn36dH7zm99w4YUXHvGewdIS+vXrx2mnnca8efOYNGlS1vk77riDUaNGcfnllzNjxgzuu+++LNK5\n+OKL2b17t0OuOfxjQTiRGyj927/9G4sWLaK0tJQ33nijzzI//elPWbJkCcFgkHvvvZcxY8acqO7k\nkEMOJxivvvoqL774Yq8Q2xz+MXBCNYxLL700a4FWTyxevJja2lrefvtt7r77bv7jP/7jRHYnhxxy\nOIFIJBL8/ve/58tf/vLn3ZUcThBOKGFMnTqVgoKCI55/7733HHV+woQJRCIRWltbT2SXcsghhxOA\npUuXMmvWLMrLy5k3b97n3Z0cThA+17DanuGJlZWVNDU1UVZW9jn2KocccvikmD17NuvXr/+8u5HD\nCcbn6vTuy32SS8yWQw455PC3ic9Vw6isrKSxsdH53djYeMTVwV6Yppkjlj6w+1AHtz2whLOmDeLW\nKycdsdyFP7BWLL/+y/m5cQSu+veFdMfSnDdzCN+9vO+1A5kxe+JHZ1NREvq/7N5x48ofLSAYUHjq\nrqOvifm/RmYsM3juJ+dRlO//VHXe98I6Fn1UR2VJiN/+6Ow+2/vpjbOYMLL8U7Xz94LMPT/6r3N5\n8vWtrNrayOCqfB66/cxPVe8JJ4yjBWHNnTuXF154gQsuuIANGzZQUFBwTOYoQRBoaYl8lt38u0V5\neb4zFoebrL+xeOqYxudwYzeK/I8TWe0di08Cv2KNQWdX4mOv37anBWFoyXH17/8S5eX5pFQD01T/\n5r+Vg3UdqGV5n6qOZMpaC2IYxhHvt+sYnu8/Glpbo6TTGgC6/ulyscEJJowf/OAHrFq1is7OTk4/\n/XRuueUWVFVFEAS+/OUvM2fOHBYvXszZZ59NMBjkZz/72Ynszj88kpkXwzi2SGlVM/6hCON4EfBZ\nn0EipX1s2cb2OGP/DgjDNE003cAwzL95jTyaUP9vGjpxKwj+ZqHrn+09n1DC+Lg8QQD//u//fiK7\n8HeFV5bsQ5FF5s0aclzXJ9M6AFqPl0TTDR59fSunnNyPiSNdDU7tMeMwTJPH39jG5FHlTKv5eNPg\nkbB6exMbdrdy/YUnIX4Ggmr7wQ4+3NjANy4Y84kJ7u3Vtai6wRdnDgFg4cqDxFMal80Z7pQJ+qw9\nIHYe6uRXf9rI9fPGkB/y9VlfY1u8z+PHCtM0eXrhDkZWFzF7fL9jvm7jnlbW7mzmG+ePQRSPPqYv\nLdpLSZGVB8swTVTNwKdIJFIaD7+yme64ypTR5Zw7bRCPvLaFc6cPomZwcVYde+q7+MvyA5w7fRDv\nravj+nljHGI9XpimydKGlQihbsy4Gz0ZswnDNE1eeGcXowYWMX1MZda1ndEUzyzcwVfPHkV5UZBn\n39zBkH4FnDahP0fC+l0trNja2Ov42wc/YEndCn404zaCspWRIJnWeGLBdubNHMLgqnyn7CtL9iHL\nIudOG8hv/7KNc6YPQhDg7dWH+OYXxyAKAk/+dTtnTBrAqIFFR+zLB+vr2X6gHd0wue6LJxHwSTy5\ncAszavpRVhTkpUV7+cYFYwgHlaOOoWGaPPfWTsYMLu41Ru3dSZ57aydfO2e0c+wXv19Pfsiq87Og\njlzywb8hvLH8AMBxE0bKJoyequfO2k7W7Wxh3c4WnrzTtWFqWna5uuYoq7Y1sWpbE9PuPH5b5yOv\nWUn65s8eStVnYO//799b0TcnDyth1snHLmQB/vC+tZdDhjBeXLQXIIswAn7rM0imdTbva2PBioNc\nOXdkVj2CYE1Qu2Kp47qHDJJpnQ83HebDTYc/EWE8+NImAM6ZNoiBFb0TEHrx15UHs9tUdXyKxKHm\nKFsPdADQHUtTkh9g4942Nu5ty3ovAH7+wkfohsmmvW0ArNhSzBmTq4+5v31hS9t2/rDzFQKjAyTW\nn+4cz2gY0YTK+x/V8/5H9VnCcEf7bl5eu5Y9eytpj2zmrmumsmhDA2xoOCJh7O7Yy0MLt2AmXOGv\n2xrGa3sXAlAfPcyIIiur73vr6li3s4Wt+9t5+DZ3hX3mmxxVXcjanS0E/DLLNh3GBEYMKCQUkFm9\nby8fRRfxmytvQBazReob+94iko7y7lsuIb+5+iCD+iusDz7HmlWDKY9PpaE1RuHivXz9vBqOhuaO\nBIs3NLB4Q0MvwvjtX7axo7YT5QN3/5JoQiWaUBH8cUw+/beYs0f8A8HRMHqYpOJHMLX01DAy139W\n+KyTCLR19y2sNUPrs62jte89F/Rnf+R9jYMsiUc890lgfMoxMY7R3OhFMq2zs30Pf2n4MwhW/6MJ\nFekomkrGrCnkdYKkEgpkz3y70xHePvgBunHs47Hy8DrrHymddTyaUDFMgx1te+hrHvzrDY9TL68H\nOUVje5y0mmnT4L9W3c/Le/6SVd4wDR5Y/yiBccuy6lM1g7Tumr/SejrrnFjYgjFkFRtatvTqQyRu\nXdfUHndqTKQ1UqqO/+RlyFUH2dDcO1/YmwfeY1nDqqx+mCbsi1i5vuSqg6iabtenO/3f3bG3z/e3\nuaNvDVc3dGqL30Cu3oUiZYt1sbCFwIQlJPx1fV77SZAjjL8RaJ+BQyp5BOdW0kMY3pdQ7aFhpNTP\nmjA+m3qCfstk1BntTRgN0UZuXfRvvFu7uNe5tOreX8/xTXvuvafZrOe4eMt8WsI4VptyR7KTrW07\nevUn84w/CZIpjf/Z8Bh7YzsR861913XDxMRECHXjG7GepfUre10nFjcSGLsSZeBOeloWf7n2N7y2\ndyEfNW/qdd2KhjV9Po+D3XZG3VS2hhRNqLy463We3v00UqmV+TcSt4R5d9p1Uguyagl9ezzEgg4a\nYo28V7skS1uOqwn3HvLbnf9VzeBQpN75HVOzha/cfy9SUSsv7sqO4gKIJlUEX5zD7VGHaFXNQNMM\nBNF+pj0GyUtOSO5zkyWRjnRH1m9w39E/7XqNB9Y/yrrmjb364TWJxpJu/S2JNkx/DKX/Pgrzss2p\nUqG1GDoldvWq75MiRxh/IzheQaRqOnsbupw6hGCEpsKlJDUrpfXuuk7W7Gy2CitJNh5yzRVb9reh\n6Qa76zqtvReSfQuj7liaw22xY+pPbVME5DRCsBvdMDnUHKUj0lvQ17VE+WhXizO7OhqKwn7kfnvZ\nl9rE4bYYLZ2uQFh52MrY+vq+N3td53Wm9iTDlGe89R47+Xn71BlN0dQedz7mhJo9O/44dMfTNLS6\nY9fXxKC2KdKLDH+8/B4e3vgknamurPtNqTrN8RY6U9kf//aWvexv7OxTgzkcdVOaC7Lb/2hCQ648\niFTSxO93vuwcb+uy3h2l2kojLxa0O+S7pXU7d3z4/9GWtASxZljvzKHmKA2tMZYcWM/zO17klT0L\n2LinlbSqs35XC3/e9i4dKYusDDF7DOtaoyypt7bHFQKWQFy5rQnTNNnZ7ppXBMW6LqNhSCWufyLz\n/bR0Jlm5y33HMwQJ1thn+g0QVd3nEjHakeyykXSUVdsbWbuj2Tm/uHYVgYlLSBbudrSvnYc62RV3\ntZGE5qaRB9hS56ac9457Q2uMpqSVHNLURYcwDhyO0NAa48N6a0vdPa3ZGoGmG/z10EL8J60ATJra\nrfdibeP6LHI50Nht/aOk8I34CKnEakslu3/Hg5wP428ExzNzBPjv59exYvNhbr9yIsm0hlxeRzJY\nx57O/QwMDONnz3/klA1OWsTjexYB5wHw4gd7WbDccgJ/4/yarFl3Bg3RRn76wROk9kzgkVvOR5Gl\nI/ZFNwx+8tQafKM2IxW1cChaw2//ZM0Ye9rI//2J1QB8/bzRnD7xyNuJAsgSyAP20KqH+NHjpVn1\ntdtCqMTf2+HoJYxkSifPY1ZJpjUK7JmYN6pMHrCLneHFrG5UmV41mcde30pDWxzdMJEqD9A+eAeH\nIoMZmJ9tO09oSVJ6iiJ/YdbxH/5mGZpu8ugPT0eRxSzCeLd2MZXBCn711OGse1IN913oSnXT2elu\n5BRPqfxy3aP0z6vi+5NvBGB72y4e2vxbzHSA6092d//LoDHu7gsu+JKOP+ZQpA653J1xx9UEISXI\n7f+7HDAQApZANVNB0jaJvrxnQZagTWgJuqIp7n56Dbphogzajmwnb3jwlbV8oWYgK3cdJDhpkdsH\nOTsqavPBRoKZva1MS3j+/t3dHPQvYX2LZ5ZtC92uWBowkIotwhBMKWvC9YclW/DXZNpyiVjVDKIe\noo157mOT9h7YCoJu6jz6149AzaToN2kOWxqYWNgKh4cBsC+6C7/hrm6PezSWw20xHv7ravwZ/7Os\ngt2VNTua8U9oQrSXnkiS1XBbd5IfP7GcoL3X1vvrGri8xnAIZdX2w6jFe61ZvpKmuSNOVbnCU9t+\nnzWeO2o7AAGl/16kEpf0eo778SCnYXzO2Nd1gOe2/4l46thnrpqh8dy2P7Gv6wArNlvCpqUraWkY\n9kee0JJEssIVvTNPj/qe0kDUeLnlCbZE1/Zq65FNTyHkdSH320c0cXRSy3y0UpE1o13ZvKJXmc5U\nF49vfg4hYKX7jnn6uKN9Ny9sf7GXXTxhRhBEE1Po3X5H0iKM4kBvwuiMuTPzbEI2Wd+yyTGneM1E\nctVBNCnBxpatmKbJwaYI3bE0CAa+wZaJ6EB3tlO5NdHGv3z4E/5r1f29+q77u5BKGhytRbXbEnwJ\nXtmzgEc2P9Wr33s79zv/d6W6s3xQjYkmYmqcxpgrCFoSrXadSZ7Y9Thi2DV3AFnaiOBLUhS2JNVH\nvJJVzjv7FgJxx8IiyKqjYfQPV2VdE1XjtEdSrs/D7wpNQVGpa445moFzXNJBcN9B7+wbyX0fssgC\nMIwonQdWEImriMXNCIodXSXoJNLudVnt2f/Xr36SSDRKV6o7q+8ZJOjEiBWgHh5i1eFzZ+OZb6on\npEIrIEA9NArINnF1x9IIfvf9y+qToCP4rHOCZCDJ7jsj5nVnXdNqa3sHuw+xtMPdbEzwWd93T7Oa\n1THrfTH17MndsEFH36PmWJAjjM8Z9617mJWH17KtfYdz7OOcxR81b2Jl41ruW/ewcywvoPQgjESW\nDd/7ASBlCzUx3ElajLBLX96rrbakJXxM1Z8l3PtCMmU77aJWyGRnuqNXmUc2PsWGls3I/Syh6J3d\n/3rD4yw/vIZaj50ZIClYH5Ep9m6/w+6fX8peKdyZ6uKJgw+gDLYitrwzUKmsntfrXubJLVYKbs0x\nSZkgWuW60910x9Ik7HvyCkJnKmpjY8tWDNMgriWIqO6+F2lVJ3DycnwjNtEcs4SLrhuI4Q4CE10b\nf0Z4uPW5Zo7udMTtu6RSn7DGLaJGHRu5LGY7pMViV6MQi5pZ3f2+p62ko1n1RGui3Xaqm/hGeyYP\nctrRMCQhW2TE1FiWJpcxKYFFBLGkmkUIercdLeSx6XuFqSBpDKrsOwrM0KN0HlxBJJ7GN3gbpiFg\nJKwFfwnNHXdBSXn+t+oeMP2byEowizAyGoZmaJiCjqkpmGlLqGa+F8GXIDB+qVufl0jsdvRua11O\nTHPvfWf3dnxDtmeNhXeMvO4OFY/PJeya0JDTNLZbdf732oeoVbdl9SMaV4lrvQnDuX+jhzVA+vQa\nRs4k9TeCtMdubpgm0lHWL/S0lYJlDoqnUwglCadMWnDr9M6SBEnD1F0hk0UmR4IufewCq2TGTyBb\nwiClZ9vldUPnUNQyUZmqJeD7cgKrRnY7qhSxZjaiYc1MbbOFZmh02U5Rb8QLQF2kAR0NufIQWtNg\nkmndIWKxwBLeJiYJLen2QUk5H3JXqtv5WCF7jBJq9nh5TRGdqS7HLNXs8T3s7NjDkNIqVN1ALGrO\nul4saEdvtcxyHzVvYkm9q5l1pSOItnYUnPIe2zy32ZnqpCJUTlLLJhzRQ27+UR9lnUNJUZTn46BH\ny0zvOxnfsC00x1qJhVWEvG5Ev0cwejSMuN3WndO+z71rHiCmxok6zl0zezYuq0SjKkKBdX6wOpO9\nyUNQ0IEgq5iaDyGvC/+Y1VltDanKp7ap94ZTzWtXoMbbePBn38McahKuGknzijUopSbJg1sYdOoP\nqV/zDIZQj2molM0cSPFJFqHse+9nbJ/4bbbVr2bfcxvJG1TE/ob1bB24ih/ffbfVgC5jpizzX6xp\nOy2rX8IUUygb0wy+fCxSyI9pxmnc+CeSnXWIqxNUnjEIfypE9+42/vT4U7znf5WioiLU+WEaP9iP\n5JMoP2UQgpzmwOL7GTD9GyhVm9jx65WEhxQTr+vGd9ZYmjavJdl5COQIReNLqDpjKIKSYv2GjTx6\n3wvsbtmHKIsMu2Yi+5/fSMXkcqLJycRUld2/XUf1haMJ2kQrKGnMZLYJSjDFLBPc8SJHGJ8Cj72+\nleqKMBd8YfDHll27o5n31tXx/740AZ/S2w+wfHMTYAmaX/xuPd+/YgKvr9iNIMCX5libSv158V40\n3aB0RG/BrWoGcbPLEXhr9zTQb6ibF0kIel4WMVvD8M4KAXx2qoy31rqmEUSjF2F8uLGBrQfa+fb8\nsTzxl+2I9uRTsGePqulKt39/YhVCqAsyoeO2SSKjYUTSroCIqjF+984u8kMKF8wcjOGLuqqwpIHm\n44/v70aVXVNLysgmjPakO1MT/AmSac1pSwy5kTc/enoRna3WjNtLCm2JLnYeyq7D6V86gW4YPPDi\nJmaMqaRRd/vRmewCe01aoydQYE/3Ps5ltkVOerZGkJkRNnXEeXbVe5AHp/afyYcNK+hOdRNIZZtw\nMuhIdlERKidhE7PW2h9faTOGp6+mJiPYBB4S84kF4hT4fM4MX2+vwIhZ791bm7YxavoUZzast1eC\npCEVtpHUUqRUnX2NbQiKyLLVUZIb57DCFFlt2DNpSSO1cY6nbQXDMKCznOSGOew2/OjmaLSG4Zj2\n7NdMB0husK6RShoRC1sZXFUAG9090zOonD2BZFOCC274Aet5lc51PhLNbQz68jTMhjkYUeh/6lxC\n47dgqDq7H11HwUjL1yQIAjvTFjGl2xMMu2ISgeIyzPc6efbFV2EwmLqMmbSEbv6IPIK+m5FKGoh0\nvUnj2+30O30cTWtWIPoGMmTObfgnLEZPpkltTlP3wQ5GXDeFSmEW3z3lbH665efW/dn3KfhS9t8E\nUkE7qbYEA848jeoL06jpA5TVnIevsgNl2Eb2PrmZxKgkgao2nv3lvdzz01/wh8hLpFIJ9JbhlExp\noWvPFnYWnsaqvRswdYNgZRhTlxAkHd+odSTXneU8Y/XQSIoGtmWZ4I4XOcI4TpimycptTbCt6ZgI\n4+FXLTPDtgMdWautM2iLxsgQxu66LhZtqGeJ8RSmIfIl7sU0TRassGznlw/pHVmk6gYJwRVcta0d\nxPpZAr60wE+XR033mgPAQLRnhULm5bZZ509LNxPIbJ8t6r0I46mFlhlt3qwh2StqbdVX8xBGXUsM\nqbKOjDFEsEkrE6FUH3UFRDQd4911ll3+zMnV2dqRqGHi463Vh5BKDuMbYR1P6T0JwzWHCZJGMq1b\nhCEYCEGXnLrT3UAZYKIM3OVWIBi8unwXYAn3LA0jneJwa5yt+9vZur+dorFNYKdC6vD4Cxq6XZ9A\nl+2c13TDGZ9Qqpq4v85x5m7Z105STyIBs8pO48OGFXSlIwiq3uOZ2fdo15mJiNOaBpNXHEf3R7F8\nVgJmOoAgW/dbKJYTV/bhl9IOqZu6jJkIY2oKCV8zb66qdWamelcpYoF1D3E1wZZ97cTVJAISopKZ\nHRhoug6IzkTE1BS7jky4qfXXMHCseYKoW85eUUQ3PSHO4U5KigVkScA0rUhVpWkcauVmRJsIu9Jd\n4ANT9REor8BXFCDVbOeSalzHzqUbQVNIR1Ko3d3IFbVWqpRoAUIgSaikADlcjq5ECBcNYMee/UiD\nsTSMZAgVe2coAAAgAElEQVQjGUIz66lb9Ti61g5SAtlfweDScvbu66BqwjzARPSlkM0COjtqCQ8p\nwlcUoN3YwMrdE62bMUFrGoSpg1RxyHr37O/QVxBEjp2BEVuBmNdN5PA6utd/CIKK2mmS6ugCOQ+l\nSCQiFqGhQrICvWUoRSfvo+n9j6gfGKH90DpKJvWjguHUbqsiMG4ZgqRbwQ32M9ZaqvEPitOuu+/j\n8SLnwzhO9Ey/cawwj7RAX8wWCM2dthAXrY/Jigyx0BrrnUBN1QxU0bXPIqnUtViC4sJThhIMecxT\nGR+GqBOY/D5SiW3zNi3C0DQDwzSz7bWSSxh7Ovez6NAypLI6lOEbaenwmEQE3YlL1/EIDUCyTUHW\nj8yqdOt8c9wN/Yx6/AD3b3jIcS5a11njJPgS+Ea4TtGeJikvYZAhDN1AGbTD0sJMwa7Hnvn540i2\ncDRVW+OwP24hrwtlwF6nuriazHqKpmchmtdG3p5w+xDXrdmdphvOhyx1DrTqtwV0bVPENhdKiIYP\nRZTpSnWRTGsIHvtzRdCacOzptBZ/ZQgDTYZ0yNIoMuYIu63/N/kmCgQr3Uta6XDMhugKgyrzMbqL\nEf0JyyeUaUtXQLPGIqHHiSbSlilJV/jSmSMITFxMYOJifKOsSKHAycvwjV6LGIoQmLgYuf9+EHQC\n45ZaPht/FLGgzbnu9usHMe2sJue3MmgngqTzSsPvqShREATQO8vpPjgAvbMcMa8bQdSJaDYpaz5E\nyeqff9RHjCnvJNVcx8hvT2XIWTcQrCjC0HR8Q2zbv01oAbEAIxFGEE1SZoqUaj3nYZWlgIDRVUb9\nwu0U10xi+BWXUX3haEzN4LSRYyw3V7iLqgofCAZjq/sxwJM4URAN3k09ab0XyUICYhCtfgSCaODz\nPBdBCFptRYtIdyTorHuHEdePY+T1c8grr0HrtFaoC5LGa13/a9Wn+UD1kxcIkz+8hGjzRiJ12yke\nV4nZWo2ZyEdtsFauC/6EO8nQFQrkEnxS376rT4IcYRwn+lrcdSS0JToITPwAsbAl67jXuS30cEQf\naM+2cze1xwETqayO2m7LKWwa7uPTdANN8URYSBqHbDuwTxadWSzgkJPgSzjmCsg4lU10w6S1K5nt\n2/BoGL/66H95cfdr+IZtQS49TG17b4FuNYBr/pJTiIWtGCnbqZjRMOwxaEm4dURSrkbRmMg2TWSE\nrVdLKPIX9iKMTMy/dY1KMq2hGabjP0jvH2t1K2Bf57H3aq22GcMmE9leTJZBQku6z05Oo/pcYsi0\na5om27QPneMZ56Smm05biYjlx8kQxsHGiDV+ukxKMygJFNOe7LSc3vZzUvQwd07/PmXBUtY1bSCu\nJhyflqnL6HEr/YMYtLQMQUkR0ssYUTSUAsEimqjQ7BCQqcvkh3yOWapN2YNUbI2RqfkcX1Pc7Ka2\nvc0aE122khp2WvZFQUkC1n2Zmg9Ts7UyOY3cf58zjqbmw+ioRK21Yk3rY41Zi+wyaEk24Q+a9jWW\nEURrGYDokzH0JHHDevam6sfUPUYSqRMpKCPKIsm2CPGG7KALQdIsTSqlOSlDkkYcVbfGNuy3828l\n8zBSOr5CGUFWad/QCKbA2LIa8keU0LlvFSm/pVH7NR9SuJrI7iRp22el2d+JEixGSDRhJMLEGyJE\nOzsQ82yys18fMxVET+mIPgnRL5FuDBJr2YnePJBgXn/UaJp4g/Vd63ER0zQZVjCUkmkVtGx/hdCA\nAqSgQn2T6tQHOBqGaYhgikzNP527Z97Za6w/KXImqeNEz7QaALXddeyvTzBr5HAUWcQwTdbvamWX\nsRTBl8I3YgO6foZTPsuM4o0aCUY4rLaR8XRs3NtMR0RFzG/HN2wLhzPWJU8UxKY9bZgFUTAES1BL\nGrV1liaiyFLWLNgRut64bDUAStIS8IZMXXO0F2HEEmqfC+0aOj3CWe5hOpE0MGTE/A4E0URrHYA4\nYK9DJJpmsGxLA+vbXHNQdyoKFPRpt3c0DJsAvzr6Mt7Zv4yo1sq2/W3UHe6iJXWY/V3uoilkjb31\n3fQvy0OQVYxYAUbMcjTkF+qk6tyxUA+NtGZyeKJNMuTWVerY8zNrVvwnrcAUVcsMI2mOZlPb1kZM\nsEjQSAVI+ZNEEylWbWtyxj/SKRMwcQiktjlKoFrDTPtZta2J4tJimuIttGh1jqCRuwdxoD7G2KKx\nLD68hHV1u0lmggt0mURHPkopSOWHEIuaEUSTZERm8YZ65GQZpilRl96NMtg1H+UXKZjddhK+4u3O\ne2dqCoYdAXQgtpe60IcgWOGaq3c0kd49Cd+kd0FWEfwJS3NTfZjpgKUlFbZhJj35izQFENC7SlGA\nDc2b2W1rST1hZhb3ZQhB9SGHFEJVFax78vcU1BTiZwiioZA+WINv8A4aBx4kujRG08OrkcVWAkWD\n7f6bCIL1jI2khKabTnRVVypCLKVRSMhJRGimglSeMYT6hS+jbA4QGlhAyhQp8hcy8NRx7PvrWjb+\n+SEQBUZdWU1Mq6by5MvZ/7s/ASZyno/h10wkXF1Dqm45+xc8R94IEX9pyF33YgduGKkgwUFhgv3C\n7HxoNUqgjGDxUECkIDKRwZcdpv4v2zE0A8HMY8DkGkYVj2Br/21IAZGSSVY+MkO1EwymLcKQ++2z\nfJP2+IV8PvJ9R89BdizIEcZxomfiPt3Q+fna/wFg+66vc+NFJ7N002GeXriD/hNbwQcYEgnPegBv\nDLXgIQwrB46LB1/5iEElJb3CYREM8kM+IvE0O+vbCFRGMOMFVtiepNFt57/xKyKm5PVhZMI0M06x\nUYihbqTSRkfAb2/Zh1LtWWVraxjrdmVrSQDN3RHIiJketnZB0jBVV8CbSVt9t4Xwxr1trGhfhNLf\nDQWNZKI5PGYYPVKElN/pjpMtZNvadRpbU0gFGnc89CEgIA/cgdLPQK0bgVK9B0HS2LCnlQ17mwlO\n0zE0BdNelCVlTFKOTV9xZtWCkrQCADIhlgemwIS3SeopVDsiTAzY8fSyiqiFaEtYZq3/+v1ylLGg\nNVdbZORP8r2H3wfVj2+MhmkIFuHrisfcZIKkYep5vLu2Dt+QBFIFNBa/j8+ORu3qMvn579ZTPcKA\nEvjj7j9jyvYs3RTRuotQALnM1czSCZln3twJgDK8nEhpI2JGjusy4aAbTpoFTcFI52OqCnKFu+pY\nDEb57RvbMUxLQImhCIEJS6wuaAoYMnpbP+sa2+8gdw90TJ5mMoxihhyyEBD4cuVNPL/5dacdof9u\n0HA0iAyJV593CmJBO4KskvxoNGd8fQ5rDlkmp6SQoPqrJxEQQ3SsPM1qt3wHcIARX/4acngHkhxm\nyJzbMFWLgMMnjySUCgHbCSlBQMNMhigcX05hjbvZ0ujpVwKQ5ytm0KUnOcdPH3cm+Wsklqo6pefN\n6hGZlMcPf/Rz/ufljwhOfTdraIecfROyJqLb79qgS6w6UzunYHRZ7QbS+YxWvkno238EQG0Yilbn\n46Sykfxpg/Xe5o8occcd8JsWKYghWwuzAz0CviMvuP0kyJmkjhM9NYxme/EUwOrtlkq/z07Z0RW3\nPhrTELPWA8Q0T+RSH07NDARRo7Y548j0HJd0CsLWiyLmdSGIJka0yI6McV9cRRYxPITRU8MwNcX5\nMDPn6hPZi9MyUVKxPhbvtXt8KkLPWO8+2hJN2THBdcfSjo9CShTjl3xE7Ygp12nXH1oH91lfdwQw\n7HmPTUKZ64yovZgv0yfJ7QOaYglYKZFdRpO5cJplMhF8KQZV5CMoKUxdol9xIaYukVCTpDSjlwYk\nqnl0pSOkdRVDsAlS87k+EZs0BVm1Z36C7SBOO/0XBNOZFRqpIL1gR1jV19rCV/aadATQ/Jjp7DUp\nGQIEMDqzd5zLmKRMtTdhmLZGYESzU5+bhuSkIDHV7Igv2bTqccYea11Ov9gpngpERhinuj8xmTVm\nEN+ddhWTS6xlzo3avqz7dbQ+XxLBH8eI53PV2aO5ft4Yrj9nalYf8hU3Q22GCH1DtyJIumvCsutD\ndgMA8ny2hpHuPe43zLOiP4JiftbxylAFXzt7FP9x7TR8PVwERrSIQRVh9/304AeXTWf2uH4Y8QKM\nuFun939ZEqgqDZHaPg0z7XdCrzd+uJbdj6+l6qxhDM0fglQ3BQyZorCP//z66VntZL6THGF8zujp\nw2iI9g4DzJCDKGUMltmE0Z3yOK+dUNc+nOJStiD0oiDfeoQZk4UeKbY+fiXtCEFJBiTVTQbo+BWs\n8yWhsPsh2W2kNDvssqvUmg3bGkYqbc+OPfDaoTMLpRw7dg+NwNR8Vqih2Nu0JR6YSZG/iC579ud1\nzJaGM05AeywywlfzuStae2hOVeHSrD44JKpbgtBn5qEKcfucq2GcNtZyHKKkGFQVRvAnMdMBqsvD\nln9BT6NqRtaCO61xMELamrZ/cOhDN4eRprjCTk4jD9yBGIw54ZboEoIvbZmQQvYCxcysug/CyNj0\nzXQAI9l3umovQQCOoAHQu0t7FBYI+qS+NQy7jxnzHcDs/jMYFHfNqpnV1hmU5RXafXClp5EMUxTO\nlqaFRjUPnH4PY0tr+GrNZUiiyPjhpVQXZffPebaagmlaa1YEAcx4mJrBxUiiyPQRgykLlCCLMpMq\nxnN6v9Pd6+PZAj4gWfdZErLfJ1l1vpOwL5R131747b1A8noQRnmwFJ8iMbgqH9MT7ZXeOx40n5OO\nRu8qRcKtt6a6wtJeDYnUllMY4B9ivZeeZyeJApUlIYxIKckNZzghv+ef/0V+/L8/Y9bsU7l50nVU\nYKXqLykIUBLOo+DwaaS2zbBuJWb199PuZZJBziR1nPDmBDJNk4Zo741aMoudRMkua4hZSe/aeoR+\nWoV7C1JB0qwgyZ4mKSAUykT72I7PVBCjuwQpvwMxvx2js5LX619yzgmBBMqAvWjNAx1tYES/MtbU\ntme1ldRS4AO1biT+0WudKKmkpiKIJnpnGXpbf3zDN3kcxiZyv/2YJugd5cjlDS7ZZcpoCromZt2n\noKQQ1TxSSYHSQDFN8WYQNQTbOW9qClXFhXR67tNLQJkP3Aq59Ttj6RdDCAh9ajkAQTGPLpoAN9QV\nXabAlw+m1a+64GKEtIqp+qgsCUKHhGqmSau6syq3Mj2eA7X9CA6tBaxEiMpAPP2zSF0qbUSusHwr\n7sI46/n5hm51H6pNGHpXH9sVO2s4BFJbZiGGO/HXrHVs8pBNGKmdUzBTHmJRA4wqqGFXtxUSbSbz\nrDDqPoRkpm+GZ0+Jr9RcxsPbNwMtWWOZQUAMuved6U86QFFpNokZpokiynxnwjezjs+pnoUoiAzM\nH8Db+5ew0dGIBNAUh6CMeIEVzAGIgsh/zPwXTNNEEiUi8TTPYq3OHhQeTO0u3VnAWJIXphsYUFrI\nLl1CUNKIpjVpKA4UAJZ14Cv9b+Cpv+5AGbKNEZXucwhIQSezzpSKCUiiO26Z8GC1YRh6mxU4kVnT\nlN45lSvPG41QVks0HUUUxKy8bFcN/xq/fnkzCVxLgCSJR9xPZu6g05g76DT7/oWsv/5UBUY0yuDO\neezYZ1kxPqudNXMaxnHCq2E89PJmlh305sI3eWt1LRv2WGaqhO7amJdsbOAHv1nGDx9exs5GTwqM\noxBG5pwv4LaZSbFQG7TTPkiuQDZsQZMJE21OWmTmnWlKth0YYOzAKlcjsDWE1qgdhaTLljARdeJJ\njZitTZi67NEi3BBOMRTB6C7FiJTY5/oQ1rpkCUw5TSaSRzaDpFSdPMme+fkTjoYRkP2UBSqsuPxQ\nt3WNrWHEYgJkZqGiq2GYhoBf8hOQ/S5ZeUgLICSFARPfiI1O/itTU5BECUH3I/iTHEpbfhwjUoxf\nkZBQMFDpSHXgG24984BZCAjEunrvlmZqiiOwM2QBONpeJmIo65oMKegKqe3TUBuGec65c7zivDyM\n7jLmyNeS2jrTLWObpExDdOzhXnxp6JepqL2MxPrTMZNhR/CqDUPRGgdz57Tvc8fU7znlja5S9O5i\nrj3pq9b9emar6r5xaI2DLU3UhALJ3r5W9RKGv1fK7SPt6xGQA5w9+HRqSkZy3divY6ZDVBRnk5Bp\nWuG2smffB1EQHeHtTTJZXZ6P0VnhaGahgFWmqiRkmQP9CcTCVoRUPpVhV7vpX1COmQ6S3jWF2cXn\nO8eDinsf3zz5qqy+OyHzpquBC4Jgj69AMmUwp3oWXxx2DuDuJW/1J8+a3HggScIxbUCW2aUv8zez\njsqnF4Lmd+r6LJDTMI4TXsLYcOgAgWKPM1jS+OP7vdMyI2pZi98+OlBL5vtyBGsfZqfMMcVnkAZS\n26chV1o+hiitWAI0EyapYCZsf4Q/AZjEtThiogitYTi+/vsxRZ3+VQqqItEFDC4rwUzYaQWCEaev\nAKP6l9AgKhiiRgroiEUhz1LfC0tLaAIKC0UKggaGqNEMmKmQa4bx234a2TaJ6bITIRKc/L6V7E0A\nn2l9GPFuxel7hoiCcoCg7MfszkMq6CAw5V3XBxLVPFFNacyEPV6GjF+WCEgBYpLttLZJZkBxMeWF\n5ZRVRjncsMddh4IrkAv9BXTq1jM1Yvmoh0YhjxbxC3kkpC7WR62QWb2rlCADgE6MVO+Pe9zgKjZu\nVjENwVmfotYPR++01kQU0p90ohgj6AkB9chSI1JKuTIQv5yiRavP0h5GDypi5dYmGpu1LDt55h6O\ntD+SLIkEfYqTjfULY6vYUduBblQwqDLfycT7pTMMJ6tvLDmKaVVDgGx7uJkOotZamQgQNQomlAP1\nWRqGqAcJBrJFzbFsJBXyy5wzbSBD+xVwsDHCYqx3yegu4dSThvQyczntiQLzZg2hrDDA5FHlmILA\nTsqI0UhC7GDK6GmcMq4fi1f5EPyWGfCU6slZRBgOevrr6eqXJp3G3sV7OXf4ab3a/ebYq3h++4uc\nVDwJIRxkUKU1+fnxNVN5fel+5kzMznA8fngZm/a2MbgqH38fPgZZFKksDvGFkyqtRcJHwNfOGYUg\nwFVnj7Lv3zpumCY3XXwym/a0UlHUhz/sOJAjjOOE1+mdlTAMPLmaTKSyekQ79YaXDARfAqmkCcGU\n0ZN+O1bd7B0JBc7MOUMY86aP5C9rTDd1sZy2BbLghiEaohUWKqvopo6oBQCBkeo57PIvJL9fC/u7\nLRNKcSjsONsy2TIzff32vAk8vHk1TVFLW+mIRyAPJg/vx1nDJnL3ync4eVQeqxvfdrprpv0Y3SWY\nhkigooVo/UhL+NtOVG+4rtLvAAB+wRK2sYgCPpswbNJSBL/luI8VIgZjDlmYhkB7JIUpZaKabHVe\n0jB1GZ8iEZQDCJLtRLcjdvoVFfLtOeN4v7Z3csSM9jG8tD/rmi3C0BqHgO5DlkTyKSNBA43GXkxV\nIb1zCsnBlmTuy+cwdXg1GzfWYkRKHOe+1jDcIc1pNRXUh8vYF3P74kSSYZkZ7vnWDDRjCk+/s5kV\nmvuuDanMZ+XWJg4191jImVmUeAT7gSQKWcJRkUW+deHYXuXOmzGoz+sD/iM4UL3OXc//kh7o5XQ1\njmEZkyAIzla5M06qpHHdaHZ27USrH8k3bh5z1GsvPc3Vyu78+jSW7/LxwPpHOGfw6cz4wjgg25x2\n/qjZ+D0+hnDQoyF5CCPPH+Cn59zUZ5tTKicwpXJCr+PV5WG+c8m4XscHVoT5169NOeI9SJKAKAp8\ne/5Y5k6p5r+eW9dnuZKCALdcNt75ndEwTNN6v6bVVByxjU+KnEnqOOENq80IIjFtxzlnVvKWHsY3\nzLPdo3fXrSorT5NsBDET1voAb8RGFjL1KVabhUHLFKE1WpFD/lEfWZu/2AIZBGQziOCP4x+zyuqj\nZs0mFXsmnyELsGbw6ApGKmjnWXKJK98fxC/50AXLfNSVtMgvpAQJyZaAbI67EWJg29AN2TJlKBGU\nYZusML+MIOuR7hogKFpj195qvZK+wdtRBlob+CiCD0US0Vtck1p6/1jSuyfRGU31zjBqL37zySJB\nOWiNn5J0MuTmS5Y5r6iPlOgZQd4vz90vORMlZJgmJZnNHsBe7CaSyKQf13rPeDOOVL3Ds/+y6X52\niizik+2oqGSIacp8x/4N1kxXEAQUScHXY0/monzL1JPZutbnmDhs34PZt1SWJfFTRc18UgeqpId6\nXXM8W9VeUHUuh1/194raOhaMLB7G7JbxTCw52TmWeV/KxGqK/IX4PTnegh5S/LTb6h4vvCY3fx/5\n546ETD6549nO9+PwT0UYa3c0H/POcQBbD7Q75ZfVr+LB9Y+x6NAyHlz/GCnVE29tE0Y6Ypt1MkK/\n15oETyimLTTDTV/AsGeUYjCWlSJEtGc8gqSBnKbbb4UaFoes8hlBKYatqCKvfdtPHoKiWnWCE32h\nkD0LvnnC9c6MxIgUIShppPI6pIJ2ezcw2UkpoAzZSjRtEUbYZxGGX/JxwEM+Vr9sG3pmEZG9JsBU\nA4693IuS5BiqJWvG2NSH5q0IPkvDiJSSPlhDet/J6C0DMboqSKV1d92Eo6VpmJpHwxCsPToE0UBr\nGkSRYtmqi/wFWe3oHRVkhG2VhzAymkNbV5Iqv0tahm3Gc/ercG1AWtMgpGgV+b48u+5K+xpXewCL\nMBz7sikwrGBYFqH0lagyg4BPptJj4y7Jt94Hvd0itUtGfLHP6yRJ+FSE8UmEF1ihtgGlp4bxyYWZ\npBl07l7/8QWPgJdf+hOplKvdavUj0CNFTAudC1imrAwET7boz5IvdP3Yd9b07rkuHmX/9Z5wNYwc\nYRw3uqIpHn51Cz96fNUxlVc1nfv+sMEp/7udf2ZXxx5e3P0auzr20Kl6NqXxxzFNHD+AQxR9PS8x\n21fR1OheJwSijrlletVkTi+4xDouq/g9+xP0K7EEXc9QSO8aiJ6z0YyJQBEUfPb+CacNmMWY0lFO\nkYxQcyN2rBfvi0MtJ51cUYdUbS0AG5jfD0mUmFHVW6XOxPR7yUnvLkatHU1Bno/UzsluWU1mgDrN\nEaz0CAkFCAhhZJto9KYh6K3V2e1lNAwlZa9lsO7Xp9gaBm5Kc729Etk28hb63N3xUtunkd7t9qsq\nz1Lj8+QQX5w5BIAR1YUUhcKOZpcJLvBubas1DsJI5KEeHENB8yn4MpEwqp/k5lmkdkzP6rsiiRT5\n8537KAhlayleQpgwIjtqKuCTGFLlRjBl9pEw4wXcPv7HzB14GtPHWPcxdog7K5dFkZKCQK/6jxVe\nshk+wHoX+9v5lE4aUuwIuglcSGr3RBRJ7mXGOnlYySdu9+knH0GNt1G37H94+GFrkezvfvcc3/rW\n17n22q/y5JOPAZBMJvmXf/k+3/jGV7nmmitZuHAhL730B1pbW7jllhu59VbLpKS39yO9/QsUBQp4\n+unf8q1vXcOBxffTtOnPAEwaWUY61sbzj/yEa6/9KtdddzUNDVagygsvPMM111zJN77xVR599DcA\n3HLLDezcaUWfdXV1csUV8wFYuPAv3HXXndxxx//jtttuIZFIcOut3+G6667mmmu+wtKl1t4oM0+u\npLtuHQcW/4qDSx5g6YLHiMfjXHHFRYSD1viVF4hcccX8oxKP6DFJfdb4p/FhJNVjZ3bg43eX09zw\nN9GfsNIhZGyits9h6thiNtmLuX1qCWmlHUHWkIUApcUSHYaAronki4VowMk1QbbssGZANcUj6U5b\nAk0IRrN24qosCvOf189gb9ce/njQTcAnKCrhoEI0oeLvQRhCOjOzFcj35dOWbGdcmWsHlkQBvbOc\nfMqJ2CGTGfIaXjSEfKmQiN6F6E+iNQxl9BlW7Pe4spOy9m8A+NcvzaIsXMi6Vh8v77MIRj00GjNW\nREF/H60NFWiNg5GrDmJqlvbgmizcmZR6aBR6exWhYWEUj3r+X9+awdqdLbyyxNK4fvSVmfxqxyLL\nxODJwqrIEtjpHjIRY0a8wJnRF/o9C6aS2TP/fnmVfHvcNQzKH0CRv5DZ4/tRWRwimlBRa2vQmgc6\ncfEZk9SVZ47gD+5eRUii4BAdgJnI1mgAZFnk4hEXsHhjA2r9CAKTJX51y2zASo8+sMJN5zBxRBnX\nnl/D03aW4IBP4tLThjFpZBmSKFKc73cWjQZ9fgRB4LovnsQVp49g+ZbDbD1g+UkkSeC8GYMYNbDo\nmKJweiLgk5EH7kAqaSSWH6BqqDWrrTIMXm1bReUpJqYJO9M6yiCVuLiLx3Z/gH+CvTo56OPt6Cre\n7rFf16SKcVw6Yt4R273pplvYu28vj//2BYJ+mTVrVlJXV8vjjz+LaZrcccdtbNy4gc7OdsrKyvnF\nLx6wxiIoMHWqyR//+Ht+/etHKSjIfg6yJHDZZV/m2muvJxJP88tf3M3y5Uu56eJZfLTgl3ztm9cx\ne/YcVFXFMAxWrlzO0qVLePzxZ/H5fEQivZOBWnDf5a1bN/Pss38kHA5jGAY/+9kvCYVCdHV1csMN\n32D27DlMGGDyx9YVjD3rFmKqzIzRhYRCISZPnsKm9av4+Y0z+fD9vzLo9LlI0pG1vIwyciI0jH8a\nwhCOsiFRX/i4zYISzqY1JigppEQxWo/V0uEwELdCJwdXF3KQ1Yj57ZSZZdZCsbi12tcvBtEAjZSb\nUVP2kySIqUvZ2VptDCjLo6L4JA7qk1m2qd6J9Mns+RASLPu8IsrcOP4b/O/WRkBFECxhmNbTjCwe\n7tTnUyQSKZPx+kWYFftY3vZ+VnsFvgIiiS7MtB+haYwznhUhd9Z77uAzKQ4UMaLKmtUWxz0fph1m\nmZlBqw3DEQJx1EOjUEb3bVM3U0HMVAhZErPiyPuV5hFQ3FTNAyvCBHblEVNSrjlQl/ErIrKc0T7S\nVuJDXXFsw7Loef378D9MKHcdwZXFlmDND1p+ogxZgDXmfkVi3PBS/uCJjpMlIYvo+oIii4SUIOoB\nq62AX3JCUHuGogIMKHeJLeCT8CkSowdZ2oNXQGSISpFFSgsDTsglWEQmCAKjBvbhwzkGeLUFr9lE\nsl7OvssAACAASURBVO9VEAQrd1MmlTnZ35/8KUI8JVEg6Lee2+rVq1izZjXf/OZVmKZJIpGkrq6W\n8eMn8pvfPMgjjzzEzJmzOeusU0kkbN9cH2q/KAqsW7ea3/3uOVKpJJFIhFEjRzJx4mS6OtuYPdva\nr0NRrDFcu3Y1X/zihfjspd35+fm96uyJadNmEA5b74xhGDz66ENs2LAeURRobW2ho6OdDRvWcdbc\ns9mWCoGaxh+w3rl58y7id797jtmz5/D22wu4444fH7Utx8T88cP5ifFPQxifFD0JwycqpD07waV0\n2xYqWInNArKPZIYwZBVREBDkTNK6MvoNHsRBczVSSSO0jUJHdXwOQTFIDEgZSQT7W/RLfuv1TuYh\n2NpFdbg/pw1w4+0VSeG2U77FB6/+GZ+S4ltf+CK/XW9F0fQXxnDR5AmUBIooCRQjmNZaDEEQuOak\nK1ENFcUjMH2ySCJl7fw3INhjNTBWfDxYkSU+z4KjYr8rdMaW1jC8aIjzO19xhWrGz+Bsh6r5SO+y\nzFmyLGaFFRrRAsRwt+MjUGSxVyoW78zdp0j4hSBxOe5ESpmqgiKJ+GSP2c7ug1fI5UtFdGtdWX6D\noyEv2HutBVhC0DsuVjtiVj/7Qk9C+Tj/gN/TRk9HcpZQFrPr9fb7k06eevVBkdAO1aAdquE/7zzz\niOUWbajn2Td30q8qn3/56iS+c7+Vb+rWG77gEPCngWmaXH31tcyff0mvc0888TwrVizj0UcfYteu\nzVxxxdVHrEfXNO6//xc8+eTzlJWV8+STj5FOW0EeR2oXeo+hJEnOam/rehfBoGuefeedN+ns7OSp\np15AFC0TUyqVdgg/U3OG/8eNm0Bj48/ZsOEjDMNg6NBhHA0nUsP4p/Fh6H1klz0avPtX17VEScaz\nBUU0kw7DdmIHFZ8zSxWVNGWFAYdUTE2hUClCNoIIgTiabpA0kgiGveLYFmppM+msp/BLfnyymJX+\n4bwhczllwIzenTVk0tu/wJTKCc5MMi/gY0TRUEoC1uwzbM/sg36JkBKksIfDt7TQ9jvIEiXB3qYT\nE3vmbkieaByyVro6foi+ftuJ5zIC0jtTVmSRoEf4pXZOJbVthpOCWpFEZ9V8xmneU9AGpCCC5K6+\nNhNhEARnbMFNV+FdxPSV6utJrj271/0eCflHIAxJElGU7D7JkpBFTn2h5wrcjyvvHfujOa5TPUyw\n4SP0+0Qi0z9ZErOIsCeZHStCoRDxuJuwc8aML7BgweskEta3aM3UO2htbcXv93POOefxla98jW3b\nttnX5xGL9Q56EdEQBCgoKCQej7No0XtO+YqKSj78cBH8/+3deXxU1f038M+9d2Yy2ReyEjBCEAWM\nAsomNMgiQcKSFKIsVm1Q3BGiCNIifUqr/YHlKTwqlmKlVV7Sal36M6htQUULYl0ALaKCYkggC4Ts\nyyz3PH/cmTuZbDMJmSQz+bxfr76aO3MzOXNk7ne+59zzPQCsVisaGxswdqz2d50T6FVV2he6pKRk\nHD+u/a133/1Xi7/jVFNTg+joGMiyjM8++wTFxdpNIddcMxbvvvsv2B03ljTUu9qakTELv/jFz5CZ\nOddjPzm/HIQEdf1/8z6TYXT0rozqJgHjnY8L9LBvLUqFMfkkyqqrAARr+0xD+7Y/69qh2Ft3CCkD\nTZgWNxifWBxbVzrWBBgagmBVamG122GxW2CUItEAINhkhFkxo1GthxxdqxW6C43HgCtM2HcmAWeh\nZQfNL/JOt994hT7xuiLnKrzz8WnccO1At3Puy74Sez76AZlt7A54X3Ya3vjwe/w4fTAMBoGYE8lI\n6+cakokLjcF31d9DrQ9zK2kAuLKv5uWTw4zux4C2O194iBGzr7sUj2w76Og7GUMHRmHssHiYTQom\npfXHZ9+U4e2PtbuvDAYZE0Yk4lRxNW64Vpvwbn6hHRAdjeLSH/RbcZ3lLNwDhpZhNL1gpQ2Kw/Uj\nB8JkkPGP/zQpid6GfpFmTB2djJIL9YgMNeHAl45V9HbV7ds/oF38w4KNuHHcJahtsKG8ugH9Isw4\nXlDh2N/EFfhWLRyJ4wUViPOwwKpp37dW7mHdbdfi0LESDLs0BufPu/YM6cqAMbh/BDLGDsS1l7d/\nf78zAzIoknv208kyFRERkUhLuxq33bYQ48Zdh3vvXY5Tp07h7rt/CkALKOvWbUBh4Wk8/fQWyLIE\ng8GIX/96AwBg7twsPPzwcsTGxmHLlm149JbROPhlMa4dMRBz5mTj1ltvRlJSfwwb5vp3//Of/x9s\n2vQ4duz4PYxGIzZs+A3GjZuAEye+wdKlt8JkMmL8+IlYtuxeLFq0BOvWPYp33nkL11wzps33MWPG\nTKxenYc777wVQ4ZcjpQUrXbZoEGDceutufjt//t/UIUEUTAE98y/1vE7N2LHjmcxffoMj/20cNpl\nMBpkZE1qPxPpjD4TMOwdDBhNh6TCgo1AowrVUacJySfRqDrKYjvmHAySAbPHXIG97wNh4SrGDkvA\nB582aIvpVAVGgwwDTIChEjZo6WqwYkY1AHOQAaHGYJxvOA85SCuJ7RwCmjUyDc99qU1sR5paHytN\nv9p1335yXBhyM1suakrqF4qlmcNbPO4UHR6E22+8Qj/eMOVBt+dzhs7FocNVsBamICjW/QP/iwmr\nUdFYqd+R5KSViwaCJDOcMz4hZgNuv9G9fUaDjBCzAXfPc90jP2RApB4wjI45jFszXGU0mo+qhAe5\nByfn4rembWotw5BlCbdmXI7PvynzKmBIkoRbZriX8zjwZTEaLPYWGYZzTD9nyhC3x785XYHf7NJq\nGzkvnsMujcGwSz3fOdS0nERrQ0uDkiIwKCmixW2YXRkwZEnCzVMv83hecJMMo6mLmcN47LENbsc5\nOQuRk7PQ7bH+/ZMxdux4/TguLhxlZdWYP/9mzJ9/s/74ZQOicNkAbUj1jjvuxh133N3i7w0YMBBb\ntmxr8fiSJbdhyZLb3B675JJL8ac/vaQfO1/vxhtn48YbXZP5kZFRePbZP7b6/mbOzMS/votCeVUj\nJqa51vwcOfI5rr9+GkJDPe9pERUW1O5n/WIwYLShpt41BilLEiRZdZQB1z54+hyGI8MwyAaYFCOC\nDWZUWbS7Jupt9XoZa5NRhkEKgiQJ2GXt22WoY4evIKOCEGOIXozQVurKDgaEuYJBRBsZRncINgQj\nvOpK1Kv1LdYGRAZFtJr9yJKMX123Fke/rcCfoN3RpLRykfNmYri5pkUcASDM6Brisl+Id5UfaSXD\naHUMv5PXMOeF2K4K/XZGp7aGl5rOJ3h67821ty6jPT0xJBXUZsDoMyPhF8dxyfrd7zbho48O4skn\nt/Rse9CXAkaTPbj3HzmDlIRwpCS2/MYuhMAbJ97BB7YPIIeNhloTjeo6K2CyA6pZ2zcZTSe9HUNS\njrUNEaYIVFq0Mc16W4Ne9MxkUGCStAuW3VAHBUBEkHaRM5sUfW2EWh8CUedaHxAb7PrW2XSSuic4\n747pSOXLaHMUgg1NbkFu5SLqaYiitQtM8zH60CYBo+l6ioSQJsX3HHNMVbUtV5o3L/zmrbYmwYG2\nA0bTi3dHq4h6muNoS2cDzcVoOiTV1MVkGH2B89+i84q1YsWqnmtMM30m1NubFK/Z+dZxvPHh962e\n91Hxp/jn6X2QjFZHZVSgqs4CSCqEKusZhr4VqbPOk+NiHmkKR621DjbVhjpbvV6O2qBISIrSAkGQ\no8rpgOh+UGQJ/WND9Q2Y1NpITBntWk0sSzIWXp6Nm4e2vBOkuzk3u4+NbGXvhHa43XrZygWvMxmG\ncyhh2mhtTsOstFzwd/nAKIQYQ/QsbdwQbf5mUFIrmVonr2FtTYK3J7RJIb6Oftt2ZkfhIZ3LGDob\ncDojMtQEk9G1SND576Z5JkbuenP39J0Mo9mQVFVdy2+ZAHDgzMf6zwYDYAdQWdsAqZ+AJBRXcT/F\nimnXDMC732jrAYyKI8NwLAYrrTsHi90CYXUsvpMkJEVG4kgFYIMFI+PSMOuydMxcriA4yIC3/q39\nK8m6ZjRmDnatvgaAHzW5lbYn/XTWMMwcl4Kkfh27JdKt5EJrAcPTraetPJ+SGI7fPTCp1QvnM3np\nsFhVRDjWMeRdcy++vXASI/pdgYUTbK0Oz3T2M9pehtHWIGjTINGZfQqeWpEOo6HjLX4mL73TmVRn\nBAcZsOme6/Ry448vG49Gq/2ib+vtK3qohFW7+mzAaG1hnipUFNWc0Y8jwhSUAaiu14afTIoBjfq2\nmlaMuiwW755wZhiOW9kM2sX0bK1294yzrpIkAcFG1wRs7ojF2i2pjv8Cd111Gw6e+Q+mX3pdr/1A\nGRTZbeWxtzxlGJ6+ZbeVgUQ0Wdg2Kj4NpxtO49p+Wplqc5M1b0GKCVc6VrV39Vh+cHs1mbz4wHcm\nYISYO/ex7apd1zoivEmpE4Mic/7CC66Pf++LGH0nYNjdO7+2WcAQQuBc3Xk02i0wS2FoEDVw7pVS\n3dAIAwBFcmyPaTVBMjZiQHyYPodhUrTnzAYtQJyt1Uo06HWOmq0JaLp+AQAujbgEl0a0Xk7a3zXN\nMFobjvC0uYs3m78YZAPuGrMEZWVtlWloX2djdHsXYW8+7ryAUlt6X7joSwGjWQH+2gYb7KoKxXFP\n/rbXv8TnZV/ANASIkhJQLGrgqAQAq2prGTDMtQgJUiDJzoDhvgjvrVPawh1nwDAbFZQ3qT/VlzS9\nM6q1DKM3jGl3tAKrU2hwOwHDizGFjlQhpb6hX4QZZRUNCA9ufYOonuTzgLF//348/vjjEEJg/vz5\nWLZsmdvzZ8+exerVq1FdXQ1VVZGXl4fJkyd3eTtau622tsGm1zb65OsyKLGOrVCFNrlrNDqW6jdZ\nawEAsJkgSY6tVx0BI8jgWEltcJ8QXnx9Gi6cicDll0QBF5IAoNUKr4HMLcNoLWB4uGh2xxDd0IFR\nmH1dCkYPbbmlaXuS+oUi+0eD9HpOD908Er/9y2GPv7d68Sh8W1jZar0o6tvumD0c//jPacydeGlP\nN6UFnwYMVVWxYcMG7Ny5E/Hx8ViwYAGmTZuG1FRX0btt27Zh1qxZWLhwIU6ePIk777wT+/bta+dV\nO6fVgFFvRUSIybUK3BEYZKF9iA3O4W5HUHBmGMFKCCwAqi01gKT9jp5hKO4BY2hiIpKHaIHiipjL\nsHrMciSHJnXZ+/IHngKGp3jQHd/BJUnCj9NTPZ/YijkTB+k/jxgUg4HxYThdWtPupOXll0TrQYao\nqZgIs77TYG/j0wHUo0ePIiUlBcnJyTAajcjMzMTevXvdzpEkCTU1WgmDqqoqJCQktPZSF635HAbg\nmvgur3Ls1OYIDJLzVliDM5Boj+sbGtm1eYoaa22TDEP7HXOTDCM18lL0D3Wt1gSAS8IHtJi/CHSe\n5jA8DUn1ghGrDvGz5hJ5zacZRklJCZKSXN+mExIS8MUXX7idc//99yM3NxcvvPACGhoa8Pzzz/uk\nLc3nMABXwCi+4Cho5pjA/vpULUypgKw4Aogji5BaCRh6kHEU12saMG4amtVr73jqTp7u/fc8h+Fn\nfehnzSXylk8DhjeTfvn5+Zg/fz5uv/12HD58GKtWrUJ+fr7H34uL81yDvqngkJYLuyRFQVxcOGzf\nOfZWcFz8nWXHTWatrr/z8eEpcTh9BLhueAr2lR2BMFmR0j8UZwAMHhCLuLhw1BtdpcFTkhIQE9yx\ndnZGR/uiu9VYXcG6aVunXjsQ+z45jauuSEBkWMv/PjdOuBRvHTyFMWlJ6BfZflG+1l6/pxgdE+hG\nk9Kj7ekNfdFbsC+6hk8DRmJiIs6cca1rKCkpQXy8e4XLV155Bc899xwAYOTIkWhsbER5eTliYtov\nxNbR2ycrK+tbPHa2rBplZdU478wwZOdeDY7yHxYLQs1G1Dke7xcWjGcfmoTvq7/HvjKguPw8RgyK\nxJkCQG1UUVZWjfoGV8mKxiqBsprO3ebpLWdhtd6sssJVkrppW5dMG4KbJg+Gpd6CsvqWCylzJg/G\nvOtSoFpsXr3H3tIXNpv276Wx0bt2+0Jv6YvegH3hcrGB06dzGGlpaSgoKEBRUREsFgvy8/Mxbdo0\nt3P69++PAwe0vRpPnjwJi8XiMVh0RmuT3s4hKYujLpHkGJISqgFCAHZhQ2iwUb9LyigbYTIqCHPs\n81BtrYXVsamSwbFwz9xk0tvQw7Wfeou27oKSJMljjaOeqIFERK3z6RVNURSsW7cOubm5EEJgwYIF\nSE1NxdatW5GWloYpU6Zg9erV+PnPf46dO3dClmX8z//8T5e2obSiHmFmY6tzGM7FexbHN0LnXVJQ\nZUBVYBM2hAUbcM6qfft1VkR17vNQY6lBiFFb2e2sJeVcuEcufW2tQfMd04gChc+/AqenpyM9Pd3t\nseXLl+s/p6am4qWXXmr+a13is2/K8NSrXyA4yICMMQNbPF9Tr627cGYYzklvqDIgZNiEHZEhJkj1\n2oK7CMd+FGGOIFFjrdVrSDkDhizJkCAhOaxv3TrbntZKmgeyhJgQnCqu7nCRRqLeLqDHTM5VarfL\n1jfaWpTDBoAaRwFC5/af+qS3UABVhk21Ys7ES1H+5ccoBRDpKCyoyApCDSGottSgzlYPo2xw26ti\n65QnfPiu/E9fyzCW3DAUybGhmOqopEsUKAI6YDTdx7uhlYBRGvI5vjwXAoujUrnzFlmoMoQqw6ra\ncGliBGKLJZSWa3tdOIWZQlFcp9WLuixqsNteFbLE+kBNdWdJ7d4gLFjbgpYo0AT0lc3WZKK7+Q5t\nMDagMeobbDv6vGsOo+mQlKrApmqRpNJShSDF5DY/Ue7YHQ8ALo/unasye4u+lmEQBaqADhhNM4wD\nXxa7PRfUpISPPofhzDCENofhvAOqylKtz184OSe+r44dgWmXuM/RkLu+lmEQBarAHpJqZx/vsHAJ\nztUBFpsKJeYslIhyCFUCIOlDUnbVjhpLLeIj3YvS3X3V7ThR+T0mJ/fe/St6C/YPUWAI6IBhs7e8\nldYpOFi4AobVDtOQIwAASXYEGVWGKlRcaKyAgEC0OdLt9weE98eA8P6+aHbAYYZBFBgCfEiq7QzD\nFOya09DnMJpy1IYqrdP22o4OiuraxvUhnMMgCgyBnWG0MSQlRxfjbJhrzwJLK3dQCUd5kMJqrbRJ\njJkBo7MYMIgCQ4BnGK0PSRmTvnc7Pnu+rsU5wqIVuztZqZ0bzYDRab1hRz0iuniBHTBayTDWLBmN\n/v3Cmj3qfp7JKGP2tcMAACcqTgHgkNTFemB+GtbfPqanm0FEFyGwh6RayTAGxIUhvCwIxU2315bd\nh6TSBvVDapwROAM02LXV4lHNJr2pY0Zd1rGtT4mo9wn4DMOQ9B2MqYfhzCIUWYLavCpcs4DRaLMj\nxuzaPlOChBCDd/sxEBEFqoAOGFa7DcaB38DQrxgwaHWjFEVCjbXG7TypWcCwWlW3gBFiCGa5DyLq\n8wL6KlituDZvks11SEkIhyJLqLI020xFsQNCm5iV6qKxaPplWikQx94WIUZmF0REAT2H0ShX6T9L\nQXVY/9MxsNitqLc1wCAZYLUCksGmD0n1D03E6uuX6xsfRZjC0FDfoO95QUTUlwV0hmEXrm0/JbN2\n62xFYyUAYHjUlbCVpGjPKTZAEgg3hbntkucsNhiscF8DIqLADhiSVf9ZCnIGjAoAQIw5ErBrq7kl\ng3ObVfeES5G051Vw6zQiosAOGHAFDNlchxMV3+O7ygIAQGxIDITq2C/aETCMzQOGrD1vd5Q5JyLq\nywJ6DkN1ZBjCrkAOq8T//Wyb/lxCWAx+lDYQh6q/ajPDMEjasV20XcSQiKivCOgMwxkw1NqIFs9F\nm6NwSZxjMZ7SesBw1o+KbLYXBhFRXxTgGYYNQgCiLhyIuOD2XHRQJIIUbRclSR+SMrqdM/+yOTAb\nzMhImdo9DSYi6sUCNmD885PTsAkLJFWB2hCqPy5Bwk1D58FsMMNs1AKEpGhzFAbHnIWT2WDG/Mvm\ndF+jiYh6sYAdknrpX98Cih2SasSg8MH64z8fl4f0AdcBAMwGxz6tbWQYRETkErABA3BkDnYFa3Mm\n64/FmGP0n4ONQY7zHHMYknuGQURELgE7JAUAUGxQHftaPDZ+FSobK2FSXFlEkJ5haENSzDCIiNoW\nwAFDhSSrUG1a1pAQEoeEEPcS284AoWcYSgB3BxHRRQrcISnHRDbUtoOAHjAcGYZz3QUREbUUsAFD\nMtcDAISl7TpQpmYZRfOV3kRE5BKwAUMO1kqYq3VtL7prPmfRfOEeERG5BG7ACNEChqhvvn+3iyIp\naLr5HjMMIqK2BdwV8u8ffo+TZ6ogBWu76rWXYUiSBKiKtoESmGEQEbUn4K6Qr3/4PQAgaLgNQpVw\n95yr2/8FVWbAICLyQsAOSUFSIUPB2GEJ7Z6mlzgH12EQEbUncAOGrEISXqzcVl1d0LyWFBERuQRu\nwJBUSN68PWYYRERe8XnA2L9/P2bOnImMjAxs37691XP27NmDzMxMzJkzBw8//HCX/F3JywxDbrJY\nj3MYRERt8+kVUlVVbNiwATt37kR8fDwWLFiAadOmITU1VT/nhx9+wI4dO/CXv/wFYWFhKC8v75o/\n7mWGkZoYjZNV2l4ZvK2WiKhtPs0wjh49ipSUFCQnJ8NoNCIzMxN79+51O+evf/0rFi9ejLAwbb1E\nTExMay/VcbI26e2Js2ItwAyDiKg9Pg0YJSUlSEpK0o8TEhJQWlrqds6pU6fw/fffY9GiRVi4cCE+\n+OCDrvnjkgrJi4BhNrgCBjMMIqK2ebxClpSUICGh/VtT2yKaLqNug91uR0FBAXbt2oUzZ85gyZIl\nyM/P1zOOzhGQZAFZ9RwwghRmGERE3vB4hZw/fz5GjRqFxYsXY8KECR168cTERJw5c0Y/LikpQXx8\nvNs5CQkJGDVqFGRZxoABAzBo0CCcOnUKV155ZbuvHRfX9gpuSFqgMshK++cBiC5yPZ8YH6Wt/vYz\nnt5jX8K+cGFfuLAvuobHgLFv3z7s2bMHv/vd77BhwwYsWbIE8+bN8yoDSEtLQ0FBAYqKihAXF4f8\n/Hxs3rzZ7Zzp06cjPz8fWVlZKC8vxw8//ICBAwd6fO2ysuq2n5RU7f+F3P55AFSLK0CcO1fj8e/2\nNnFx4R7fY1/BvnBhX7iwL1wuNnB6DBgmkwlZWVnIysrCZ599hry8PPz2t79FdnY27r33XvTr16/N\n31UUBevWrUNubi6EEFiwYAFSU1OxdetWpKWlYcqUKfjRj36Ef//738jMzISiKHjkkUcQGRnZ6Tck\nSYCQtYCheDPp3WRIioiI2ubVoH1RURF2796NN998ExMmTEBOTg4++ugjLF26FK+//nq7v5ueno70\n9HS3x5YvX+52vGbNGqxZs6aDTW9JCKFVn3VkGLIXGyIFGRgwiIi84fGKevfdd+Obb77BwoUL8eqr\nryI6OhoAMHr0aOzZs8fnDewIu6rNXUgdyDDMzDCIiLziMWDMmzcPM2bMgKK0vPi++eabPmlUZ9nt\njruy9AyjY3dJERFR2zyuw4iMjERdXZ1+XFVVhYMHD/q0UZ3lzDDgyDAMXgQMs6HtLVyJiMjFY8DY\nuHGj2x1RYWFh2Lhxo08b1Vl21XF3lCPDULyYw+CQFBGRdzwGDCGE29oEWZZht9t92qjOcs1haO1T\nOCRFRNRlPAaM0NBQHDlyRD8+cuQIQkJCfNqoztLnMJyT3l4EDO6BQUTkHY9jNqtWrcJ9992HIUOG\nAABOnDiBp556yucN6wzXkJQWOIQX5c0jTOEwK0EYkzjal00jIvJ7HgPGqFGjkJ+fj8OHD0MIgVGj\nRl3Uwjpf0ie99ZXenst8KLKCJ9N/6ZclQYiIupNXC/ciIyMxefJkX7flojVfh+HVFq0AgwURkRc8\nBozjx49j/fr1OH78OCwWi/74V1995dOGdYbdLgDFCslUrz2gBu4OtERE3c1jwPjFL36BFStW4Ikn\nnsCOHTuwa9cuhIaGdkfbOsyuCgSNOAjZrK0bEQwYRERdxuMV1WKxYMKECRBCID4+HitXruy6TY66\nmF1V9WABAAaVi/KIiLqKx4Ahy9opkZGROH78OC5cuICioiKfN6wzVNV9wyazLbaHWkJEFHg8Dkll\nZmbiwoULWLZsGRYtWgRVVVtUm+0tbM0CRlRocA+1hIgo8LQbMFRVxYQJExAdHY309HR8/PHHaGxs\nvMjtU33HbhcQqgxJVjFWmY/Z4y/t6SYREQWMdoekZFnGz372M/3YaDT22mABAFa7DZKsop+cjNsm\nj0NwEPfoJiLqKh7nMFJTU1FYWNgdbbloFrt2269RMvVwS4iIAo/Hr+Dl5eWYO3currnmGrcaUlu2\nbPFpwzqjwd4IADDKDBhERF3Nq0nvzMzM7mjLRWu0OTIM2djDLSEiCjweA0Z2dnZ3tKNLNNgbAAAm\nmSXLiYi6mseAsXz58lZrLfXGISmLygyDiMhXPAaMKVOm6D83NjbinXfeQWpqqk8b1VnOgBHEDIOI\nqMt1eEjqxz/+Me655x6fNehiOO+SMimc9CYi6modrs4nSVKvvc3Womp3SXGfbiKirtehOQwhBL7+\n+mtMmDDB5w3rjEbHbbVmA4sOEhF1tQ7NYSiKgtzcXIwcOdKnjeqsRlXbByPM2Dv3HCci8mcBdVtt\no3AGjN5bvoSIyF95nMNYtGgRKisr9eOKigosWbLEp43qrEahrcMIN/XODZ6IiPyZx4BRV1eHyMhI\n/TgqKgo1NTU+bVRnWVEPoUoIMXIOg4ioq3kMGKqqoq7OtYtdbW0t7Ha7TxvVWVbRANhMMBqUnm4K\nEVHA8TiHMXv2bOTm5mLRokUAgJdeeglz5871ecM6wyo1QNiCoCgtV6YTEdHF8Rgw7rrrLsTHx2Pf\nvn0QQmDhwoXIysrqjrZ1iF21Q5WsELZwGJUOLy8hIiIPvNphKDs7u9ffLVVcVwoAEFYTFAYMF7dU\nkAAAFDhJREFUIqIu5/HK+sADD6CiokI/vnDhAh588EGfNqoz3jq1FwBgP5/EDIOIyAc8XllPnz6N\nqKgo/Tg6OhoFBQU+bVRnlNaVQVKNUCviOYdBROQDHgOG3W53uyvKarXCYrH4tFGdUW9rgKwaIUGC\nIjNgEBF1NY8BY9KkSVi5ciU++eQTfPLJJ8jLy0N6errXf2D//v2YOXMmMjIysH379jbPe/vtt3HF\nFVfgv//9r9ev3VS9rQGSaoSiyK3u30FERBfH46R3Xl4efv/73+M3v/kNAK221Lhx47x6cVVVsWHD\nBuzcuRPx8fFYsGABpk2b1mI/jdraWrz44oudrlElhECDrQGKGgIDh6OIiHzCY4ZhNBpx//334+mn\nn8YNN9yAv//971i7dq1XL3706FGkpKQgOTkZRqMRmZmZ2Lt3b4vztmzZgjvvvBNGY+d2ymu0WyAg\nALsBBk54ExH5RLsZhs1mw759+/C3v/0Nhw8fhs1mw3PPPed1JlBSUoKkpCT9OCEhAV988YXbOV99\n9RWKi4sxefJk7NixoxNvwbWXN+xGZhhERD7S5tfxJ554Atdffz12796N2bNn4/3330dkZGSHho2E\nEB6ff/zxx7FmzRqvf6c1DTYtYAhmGEREPtNmhvHSSy9h1KhRWLZsGcaPHw8AHZ5MTkxMxJkzZ/Tj\nkpISxMfH68e1tbU4ceIEfvKTn0AIgXPnzuHee+/Ftm3bMGLEiHZfOy4uXP/5glSm/WA3IMhkcHuu\nL+hr77c97AsX9oUL+6JrtBkwPvzwQ/zv//4vNm7ciMrKSmRlZXW46GBaWhoKCgpQVFSEuLg45Ofn\nY/PmzfrzYWFhOHjwoH78k5/8BI8++iiGDx/u8bXLyqr1n8+eLwcA2K0KpGbPBbq4uPA+9X7bw75w\nYV+4sC9cLjZwtjl+ExERgSVLluDVV1/F008/jcrKSjQ0NGDJkiXYvXu3Vy+uKArWrVuH3NxczJ49\nG5mZmUhNTcXWrVvx7rvvtjhfkqTODUk5tmZVbQrnMIiIfEQSHbhCW61W/POf/8Rrr72GP/zhD75s\nl0dlZdUoqCrEy9/+HcNiLkP+9/+E5WQaBpmHY+1PrunRtnUnfntyYV+4sC9c2BcuF5theFV80Mlo\nNGLWrFmYNWvWRf3RrrL7m9fwQ9VpfFd5CoA26W2x9c69OoiI/J1f31IUFRTp/oDdgPOVDT3TGCKi\nAOfXASM2OMbtWNgNqG2w9VBriIgCm18HjBbsnVspTkREnvl1wLCr7vMVwm5AanJED7WGiCiwdWjS\nu7exqe7DT4umXIGJIwb0UGuIiAKbX2cYNuGeYQzpH4MQs1/HQCKiXsuvA0bzISmzSemhlhARBT6/\nDhjlNfVux2YTswsiIl/x64BRXe++5oIZBhGR7/h1wLA3m8MIMjJgEBH5SkAFDFlm4UEiIl/x64Ch\nwhUwbGXJPdgSIqLA59ezxKpQIVQZDZ9NBwSzCyIiX/LrgGEXNkDI2v+IiMin/PpKq0JlsCAi6iZ+\nfbUVsEOofv0WiIj8hl9fbbUMg3MXRETdwa/nMATsgDBg2ugBGDs8vqebQ0QU0Pw8YKiAKuOmqakw\nGrhoj4jIl/x6SEpI2qS3ovj12yAi8gt+faUVjrukZInzGEREvua3AUMVKiAJSP77FoiI/IrfXm2d\nu+1JXIdBRNQt/PZqa3NsniSBk91ERN3BbwOGs1Kt7L9vgYjIr/jt1VYfkmKGQUTULfw4YDgzDAYM\nIqLu4LcBwy60DINDUkRE3cNvr7ZWZ4YhMcMgIuoOfhswnHMYCgMGEVG38NuAYbE7h6QYMIiIuoP/\nBgybFQCgyAwYRETdwW8DRqMzYDDDICLqFn4ZMKpqLbDYtCEpAzMMIqJu4Zf7YSx57C0kDq4AYjnp\nTUTUXXyeYezfvx8zZ85ERkYGtm/f3uL5nTt3IjMzE/PmzcNPf/pTnD171qvXLausAwAYZL+MeURE\nfsenAUNVVWzYsAHPPfcc3nzzTeTn5+PkyZNu5wwfPhyvvvoq3njjDcyYMQMbN2707sUlFQADBhFR\nd/FpwDh69ChSUlKQnJwMo9GIzMxM7N271+2csWPHIigoCAAwcuRIlJSUePfishYwjAoDBhFRd/Bp\nwCgpKUFSUpJ+nJCQgNLS0jbPf+WVV5Cenu7di+sZBucwiIi6g0+/ngshvD73jTfewH//+1+88MIL\nXp0vSdprh4cEIy4uvFPtCxR9/f03xb5wYV+4sC+6hk8DRmJiIs6cOaMfl5SUID4+vsV5Bw4cwPbt\n2/Hiiy/CaDR69+KOISnVBpSVVXdJe/1RXFx4n37/TbEvXNgXLuwLl4sNnD4dkkpLS0NBQQGKiopg\nsViQn5+PadOmuZ1z7NgxrF+/Htu2bUN0dLT3L+4YkjJxDoOIqFv49GqrKArWrVuH3NxcCCGwYMEC\npKamYuvWrUhLS8OUKVOwadMm1NfX48EHH4QQAv3798czzzzj+cUl56Q35zCIiLqDz7+ep6ent5jI\nXr58uf7z888/36nXlWRnhuHlEBYREV0UvywNAgBwTHqbDAwYRETdwS8DRnR4EOcwiIi6mV8GDINB\n1u+SCvL2rioiIroo/hkwZFnPMIKYYRARdQv/DBgGSV+4Z2KGQUTULfwyYCiya0jKzElvIqJu4ZcB\nw6BI+pCUmRkGEVG38NOA0WQOw8A5DCKi7uCXAUNRZEiyCqHKMBm40puIqDv4ZcAwKjKg2AC7AqPB\nL98CEZHf8curraJIkIyNENYgBgwiom7il1dbWRGQDDYIaxAUxS/fAhGR3/HPq63SCAAQVhNkSerh\nxhAR9Q1+GTBUgzNgBPVwS4iI+g6/DBh2uR4AINkZMIiIuotfBgxVbgAAGEVwD7eEiKjv8MuAcVoc\nAQAEqRE93BIior7DLwNGg1QJ27n+CFHjeropRER9hl8GDABQq2IQEsSyIERE3cV/A0ZdOIIZMIiI\nuo1/BgwBiPowmE2sI0VE1F38MmAEl6cBQkFYMEubExF1F78MGOvmLsaVg2Iwd+Kgnm4KEVGf4ZeT\nAEMGRCHv5pE93Qwioj7FLzMMIiLqfgwYRETkFQYMIiLyCgMGERF5hQGDiIi8woBBREReYcAgIiKv\nMGAQEZFXGDCIiMgrDBhEROQVBgwiIvKKzwPG/v37MXPmTGRkZGD79u0tnrdYLFi5ciVmzJiBm2++\nGWfOnPF1k4iIqBN8GjBUVcWGDRvw3HPP4c0330R+fj5Onjzpds4rr7yCyMhI/OMf/8Btt92GTZs2\n+bJJRETUST4NGEePHkVKSgqSk5NhNBqRmZmJvXv3up2zd+9eZGdnAwAyMjJw8OBBXzaJiIg6yacB\no6SkBElJSfpxQkICSktL3c4pLS1FYmIiAEBRFERERKCiosKXzSIiok7wacAQQnT4HCEEJEnyVZOI\niKiTfLqBUmJiotskdklJCeLj41ucU1xcjISEBNjtdtTU1CAyMtLja8fFhXd5e/0V+8KFfeHCvnBh\nX3QNn2YYaWlpKCgoQFFRESwWC/Lz8zFt2jS3c6ZMmYLXXnsNAPD2229j/PjxvmwSERF1kiS8GTe6\nCPv378evf/1rCCGwYMECLFu2DFu3bkVaWhqmTJkCi8WCVatW4auvvkJUVBQ2b96MAQMG+LJJRETU\nCT4PGEREFBi40puIiLzCgEFERF5hwCAiIq/4XcDwVJsq0KxduxbXXXcd5syZoz9WWVmJ3NxcZGRk\nYOnSpaiurtaf+9WvfoUZM2Zg3rx5+Oqrr3qiyT5RXFyMW2+9FbNmzcKcOXPw5z//GUDf7AuLxYKc\nnBxkZWVhzpw5eOqppwAAhYWFuOmmm5CRkYG8vDzYbDb9/ECv16aqKrKzs3H33XcD6Lt9MXXqVMyd\nOxdZWVlYsGABgC7+jAg/YrfbxfTp00VhYaGwWCxi7ty54sSJEz3dLJ/6z3/+I44dOyZmz56tP7Zx\n40axfft2IYQQv//978WmTZuEEEK899574s477xRCCHH48GGRk5PT/Q32kdLSUnHs2DEhhBA1NTVi\nxowZ4sSJE32yL4QQoq6uTgghhM1mEzk5OeLw4cPiwQcfFHv27BFCCPHYY4+Jl156SQghxK5du8T6\n9euFEELk5+eLFStW9Eibfen5558XDz30kLjrrruEEKLP9sXUqVNFRUWF22Nd+RnxqwzDm9pUgeba\na69FRESE22NN629lZ2frfbB3715kZWUBAK6++mpUV1fj3Llz3dtgH4mLi8OwYcMAAKGhoUhNTUVJ\nSUmf7AsACA4OBqB9Y7bZbJAkCYcOHUJGRgYArS/+9a9/AQj8em3FxcV4//33kZOToz/20Ucf9cm+\nEEJAVVW3x7ryM+JXAcOb2lR9QXl5OWJjYwFoF9Ly8nIA7nW5AK1/SkpKeqSNvlRYWIjjx4/j6quv\nxvnz5/tkX6iqiqysLEycOBETJ07EwIEDERERAVnWPtKJiYn6+w30em2PP/44HnnkEb2k0IULFxAZ\nGdkn+0KSJCxduhTz58/Hyy+/DABd+hnxaWmQria4ZKRdrfVPoNXlqq2txfLly7F27VqEhoa2+f4C\nvS9kWcbrr7+Ompoa3HfffS22DQBc77d5X4gAqtf23nvvITY2FsOGDcOhQ4cAaO+v+XvuC30BALt3\n79aDQm5uLgYNGtSlnxG/Chje1KbqC/r164dz584hNjYWZWVliImJAaB9QyguLtbPKy4uDqj+sdls\nWL58OebNm4fp06cD6Lt94RQWFoYxY8bgyJEjqKqqgqqqkGXZ7f06+6Kj9dr8wWeffYZ9+/bh/fff\nR2NjI2pra/H444+jurq6z/UFoGUQABATE4Pp06fj6NGjXfoZ8ashKW9qUwWi5t8Epk6dildffRUA\n8Nprr+l9MG3aNLz++usAgMOHDyMiIkJPRQPB2rVrMWTIENx22236Y32xL8rLy/U7XRoaGnDw4EEM\nGTIE48aNw9tvvw3AvS+mTp0asPXa8vLy8N5772Hv3r3YvHkzxo0bhyeffLJP9kV9fT1qa2sBAHV1\ndfjwww8xdOjQLv2M+F1pkNZqUwWyhx56CIcOHUJFRQViY2PxwAMPYPr06XjwwQdx9uxZ9O/fH1u2\nbNEnxn/5y1/igw8+QHBwMJ544gmMGDGih99B1/j0009xyy23YOjQoZAkCZIkYeXKlbjqqquwYsWK\nPtUXX3/9NdasWQNVVaGqKmbNmoV77rkHp0+fRl5eHqqqqjBs2DBs2rQJRqOxz9Rr+/jjj/HHP/4R\nzz77bJ/si9OnT+P++++HJEmw2+2YM2cOli1bhoqKii77jPhdwCAiop7hV0NSRETUcxgwiIjIKwwY\nRETkFQYMIiLyCgMGERF5hQGDiIi8woBBfu2mm25CdnY2MjMzMWLECGRnZyM7Oxtr167t8Gvdcccd\nXpW7fvTRR3H48OHONLdDjh07hnfeecfnf4fIW1yHQQGhqKgICxYsaLf6qLNUhL94+eWXcfDgQWze\nvLmnm0IEwM9qSRF1xMGDB7Fp0yaMHDkSx44dw3333Yfy8nLs2rVL31BnzZo1GDt2LABg8uTJ2Llz\nJwYNGoTFixdj1KhR+Pzzz1FaWorZs2djxYoVAIDFixfj3nvvxaRJk7Bq1SqEhYXh5MmTKCkpwejR\no/HEE08A0GrzPPLII7hw4QIGDhwIu92OqVOn4uabb3Zr57lz5/DQQw/hwoULAIBJkybhjjvuwDPP\nPIO6ujpkZ2dj3LhxWLNmDT7//HNs3rwZ9fX1AIDly5cjPT0dBQUFWLx4MWbPno1PP/0UFosF69ev\nx+jRo7ulr6mPuJjNOoh6i8LCQjF+/Hi3xw4cOCCGDx8uvvjiC/2xppvLnDhxQlx//fX6cXp6uvju\nu++EEEIsWrRIPPTQQ0IIIaqqqsTYsWNFYWGh/twHH3wghBDi4YcfFrfccouwWq2isbFRzJw5Uxw6\ndEgIIcQ999wj/vCHPwghhDh9+rQYNWqU2L17d4u279ixQzz22GP6cVVVlRBCiL/+9a8iLy/Pre1Z\nWVni/PnzQgghiouLRXp6uqipqRE//PCDuPzyy0V+fr7+3q+//nphs9m870QiD5hhUEAbPHgwrrzy\nSv341KlT2Lp1K0pLS6EoCkpLS1FRUYGoqKgWv3vjjTcCAMLDwzFo0CAUFBQgOTm5xXk33HADDAbt\nozR8+HAUFBRg7NixOHToEH71q18BAAYMGKBnMs2NHDkSL774Ip588kmMGTMGkyZNavW8Tz/9FIWF\nhVi6dKlekFJRFJw+fRohISEIDg7GrFmzAAATJkyAoig4deoUUlNTve0uonYxYFBACw0NdTteuXIl\n1q9fj8mTJ0NVVVx11VVobGxs9XeDgoL0n2VZht1u79B53u6zcM011+C1117DgQMH8Le//Q07duzA\nCy+80OI8IQRGjBiBnTt3tniuoKCgxWOqqgbUXg/U8/xnBpDIA+HF/Rs1NTV6ddLdu3e3GQS6wtix\nY/Wy0kVFRfj4449bPa+wsBBhYWGYNWsW1qxZgy+//BKAtteFs4w5AIwePRonTpzAJ598oj929OhR\n/ef6+nrs2bMHgLZFKQCkpKR07ZuiPo0ZBgUMb75Nr127FsuWLUNSUhLGjRuH8PDwVn+/+Wu19Vx7\n561btw6rV69Gfn4+Bg8ejNGjR7v9PaeDBw/iz3/+MxRFgRACGzZsAABMnDgRf/rTn5CVlYXx48dj\nzZo1eOaZZ7Bp0yZUV1fDarVi4MCBePbZZwEAsbGx+Pbbb5GTkwOLxYLNmzdDURSPfULkLd5WS+Qj\njY2NMBqNkGUZJSUlyMnJwa5duzBw4MAu/1vOu6Q+/PDDLn9tIidmGEQ+8t133+HRRx+FEAKqqmLl\nypU+CRZE3YUZBhEReYWT3kRE5BUGDCIi8goDBhEReYUBg4iIvMKAQUREXmHAICIir/x/apbYj523\no60AAAAASUVORK5CYII=\n",
             "text/plain": [
-              "<matplotlib.figure.Figure at 0x7f72f867ef90>"
+              "\u003cmatplotlib.figure.Figure at 0x7f97f1330850\u003e"
             ]
           },
           "metadata": {
             "tags": []
-          }
+          },
+          "output_type": "display_data"
         }
+      ],
+      "source": [
+        "def plot(train, test, label):\n",
+        "    plt.title('MNIST model %s' % label)\n",
+        "    plt.plot(train, label='train %s' % label)\n",
+        "    plt.plot(test, label='test %s' % label)\n",
+        "    plt.legend()\n",
+        "    plt.xlabel('Training step')\n",
+        "    plt.ylabel(label.capitalize())\n",
+        "    plt.show()\n",
+        "  \n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  hp = tf.contrib.training.HParams(\n",
+        "      learning_rate=0.05,\n",
+        "      max_steps=tf.constant(500),\n",
+        "  )\n",
+        "  train_ds = setup_mnist_data(True, hp, 50)\n",
+        "  test_ds = setup_mnist_data(False, hp, 1000)\n",
+        "  tf_train = autograph.to_graph(train)\n",
+        "  all_losses = tf_train(train_ds, test_ds, hp)\n",
+        "\n",
+        "  with tf.Session() as sess:\n",
+        "    sess.run(tf.global_variables_initializer())\n",
+        "    (train_losses, test_losses, train_accuracies,\n",
+        "     test_accuracies) = sess.run(all_losses)\n",
+        "    \n",
+        "    plot(train_losses, test_losses, 'loss')\n",
+        "    plot(train_accuracies, test_accuracies, 'accuracy')"
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "HNqUFL4deCsL",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "HNqUFL4deCsL"
       },
-      "cell_type": "markdown",
       "source": [
         "# 4. Case study: building an RNN\n"
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "YkC1k4HEQ7rw",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "YkC1k4HEQ7rw"
       },
-      "cell_type": "markdown",
       "source": [
         "In this exercise we build and train a model similar to the RNNColorbot model that was used in the main Eager notebook. The model is adapted for converting and training in graph mode."
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "7nkPDl5CTCNb",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "7nkPDl5CTCNb"
       },
-      "cell_type": "markdown",
       "source": [
         "To get started, we load the colorbot dataset. The code is identical to that used in the other exercise and its details are unimportant."
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "A0uREmVXCQEw",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "A0uREmVXCQEw"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def parse(line):\n",
         "  \"\"\"Parses a line from the colors dataset.\n",
@@ -1137,7 +1034,7 @@
         "    A tuple of three tensors (rgb, chars, length), of shapes: (batch_size, 3),\n",
         "    (batch_size, max_sequence_length, 256) and respectively (batch_size).\n",
         "  \"\"\"\n",
-        "  items = tf.string_split([line], \",\").values\n",
+        "  items = tf.string_split(tf.expand_dims(line, 0), \",\").values\n",
         "  rgb = tf.string_to_number(items[1:], out_type=tf.float32) / 255.0\n",
         "  color_name = items[0]\n",
         "  chars = tf.one_hot(tf.decode_raw(color_name, tf.uint8), depth=256)\n",
@@ -1169,23 +1066,21 @@
         "  dataset = dataset.repeat()\n",
         "  if training:\n",
         "    dataset = dataset.shuffle(buffer_size=3000)\n",
-        "  dataset = dataset.padded_batch(batch_size, padded_shapes=([None], [None, None], []))\n",
+        "  dataset = dataset.padded_batch(batch_size, padded_shapes=((None,), (None, None), ()))\n",
         "  return dataset\n",
         "\n",
         "\n",
-        "train_url = \"https://raw.githubusercontent.com/random-forests/tensorflow-workshop/master/extras/colorbot/data/train.csv\"\n",
-        "test_url = \"https://raw.githubusercontent.com/random-forests/tensorflow-workshop/master/extras/colorbot/data/test.csv\"\n",
+        "train_url = \"https://raw.githubusercontent.com/random-forests/tensorflow-workshop/master/archive/extras/colorbot/data/train.csv\"\n",
+        "test_url = \"https://raw.githubusercontent.com/random-forests/tensorflow-workshop/master/archive/extras/colorbot/data/test.csv\"\n",
         "data_dir = \"tmp/rnn/data\""
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "waZ89t3DTUla",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "waZ89t3DTUla"
       },
-      "cell_type": "markdown",
       "source": [
         "Next, we set up the RNNColobot model, which is very similar to the one we used in the main exercise.\n",
         "\n",
@@ -1193,17 +1088,19 @@
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "9v8AJouiC44V",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "9v8AJouiC44V"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def model_components():\n",
         "  lower_cell = tf.contrib.rnn.LSTMBlockCell(256)\n",
@@ -1227,12 +1124,13 @@
         "  Returns:\n",
         "    A Tensor of shape (max_sequence_length, batch_size, output_size).\n",
         "  \"\"\"\n",
-        "  hidden_outputs = []\n",
-        "  autograph.utils.set_element_type(hidden_outputs, tf.float32)\n",
+        "  hidden_outputs = tf.TensorArray(tf.float32, size=0, dynamic_size=True)\n",
         "  state, output = cell.zero_state(batch_size, tf.float32)\n",
+        "  initial_state_shape = state.shape\n",
+        "  initial_output_shape = output.shape\n",
         "  n = tf.shape(chars)[0]\n",
         "  i = 0\n",
-        "  while i < n:\n",
+        "  while i \u003c n:\n",
         "    ch = chars[i]\n",
         "    cell_output, (state, output) = cell.call(ch, (state, output))\n",
         "    hidden_outputs.append(cell_output)\n",
@@ -1261,50 +1159,51 @@
         "    A Tensor of shape (batch_size, 3) - the model predictions.\n",
         "  \"\"\"\n",
         "  (chars, length) = inputs\n",
-        "  chars_time_major = tf.transpose(chars, [1, 0, 2])\n",
+        "  chars_time_major = tf.transpose(chars, (1, 0, 2))\n",
         "  chars_time_major.set_shape((None, batch_size, 256))\n",
         "\n",
         "  hidden_outputs = rnn_layer(chars_time_major, lower_cell, batch_size, training)\n",
         "  final_outputs = rnn_layer(hidden_outputs, upper_cell, batch_size, training)\n",
         "\n",
         "  # Grab just the end-of-sequence from each output.\n",
-        "  indices = tf.stack([length - 1, range(batch_size)], axis=1)\n",
+        "  indices = tf.stack((length - 1, range(batch_size)), axis=1)\n",
         "  sequence_ends = tf.gather_nd(final_outputs, indices)\n",
+        "  sequence_ends.set_shape((batch_size, 128))\n",
         "  return relu_layer(sequence_ends)\n",
         "\n",
         "def loss_fn(labels, predictions):\n",
         "  return tf.reduce_mean((predictions - labels) ** 2)"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "JjK4gXFvFsf4",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "JjK4gXFvFsf4"
       },
-      "cell_type": "markdown",
       "source": [
         "The train and test functions are also similar to the ones used in the Eager notebook. Since the network requires a fixed batch size, we'll train in a single shot, rather than by epoch."
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "ZWQMExk0S6X6",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "ZWQMExk0S6X6"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "def train(optimizer, train_data, lower_cell, upper_cell, relu_layer, batch_size, num_steps):\n",
         "  iterator = train_data.make_one_shot_iterator()\n",
         "  step = 0\n",
-        "  while step < num_steps:\n",
+        "  while step \u003c num_steps:\n",
         "    labels, chars, sequence_length = iterator.get_next()\n",
         "    predictions = model((chars, sequence_length), lower_cell, upper_cell, relu_layer, batch_size, training=True)\n",
         "    loss = loss_fn(labels, predictions)\n",
@@ -1319,7 +1218,7 @@
         "  total_loss = 0.0\n",
         "  iterator = eval_data.make_one_shot_iterator()\n",
         "  step = 0\n",
-        "  while step < num_steps:\n",
+        "  while step \u003c num_steps:\n",
         "    labels, chars, sequence_length = iterator.get_next()\n",
         "    predictions = model((chars, sequence_length), lower_cell, upper_cell, relu_layer, batch_size, training=False)\n",
         "    total_loss += loss_fn(labels, predictions)\n",
@@ -1340,16 +1239,14 @@
         "  # Here, we create a no_op that will drive the execution of all other code in\n",
         "  # this function. Autograph will add the necessary control dependencies.\n",
         "  return tf.no_op()"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "iopcs5hXG2od",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "iopcs5hXG2od"
       },
-      "cell_type": "markdown",
       "source": [
         "Finally, we add code to run inference on a single input, which we'll read from the input.\n",
         "\n",
@@ -1357,17 +1254,19 @@
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "DyU0wnnAFEYj",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
-        }
+        },
+        "colab_type": "code",
+        "id": "DyU0wnnAFEYj"
       },
-      "cell_type": "code",
+      "outputs": [],
       "source": [
         "@autograph.do_not_convert(run_as=autograph.RunMode.PY_FUNC)\n",
         "def draw_prediction(color_name, pred):\n",
@@ -1389,16 +1288,14 @@
         "  draw_prediction(color_name, pred)\n",
         "  # Create an op that will drive the entire function.\n",
         "  return tf.no_op()"
-      ],
-      "execution_count": 0,
-      "outputs": []
+      ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "Nt0Kv5OCHip0",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "Nt0Kv5OCHip0"
       },
-      "cell_type": "markdown",
       "source": [
         "Finally, we put everything together.\n",
         "\n",
@@ -1406,218 +1303,132 @@
       ]
     },
     {
+      "cell_type": "code",
+      "execution_count": 0,
       "metadata": {
-        "id": "-GmWa0GtYWdh",
-        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           },
-          "output_extras": [
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {},
-            {}
-          ],
-          "base_uri": "https://localhost:8080/",
-          "height": 668
+          "height": 415
         },
-        "outputId": "61f4af1d-c81e-44db-9079-1a7b8ed8ce58",
+        "colab_type": "code",
         "executionInfo": {
+          "elapsed": 15536,
           "status": "ok",
-          "timestamp": 1522345877153,
-          "user_tz": 240,
-          "elapsed": 75500,
+          "timestamp": 1531750946373,
           "user": {
-            "displayName": "Dan Moldovan",
-            "photoUrl": "//lh5.googleusercontent.com/-Rneh8xjecyk/AAAAAAAAAAI/AAAAAAAACB4/c5vwsJpbktY/s50-c-k-no/photo.jpg",
-            "userId": "112023154726779574577"
-          }
-        }
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "-GmWa0GtYWdh",
+        "outputId": "2e7a9856-9809-43a3-8b43-3c8514ea43e9"
       },
-      "cell_type": "code",
-      "source": [
-        "def run_input_loop(sess, inference_ops, color_name_placeholder):\n",
-        "  \"\"\"Helper function that reads from input and calls the inference ops in a loop.\"\"\"\n",
-        "\n",
-        "  tb = widgets.TabBar([\"RNN Colorbot\"])\n",
-        "  while True:\n",
-        "    with tb.output_to(0):\n",
-        "      try:\n",
-        "        color_name = six.moves.input(\"Give me a color name (or press 'enter' to exit): \")\n",
-        "      except (EOFError, KeyboardInterrupt):\n",
-        "        break\n",
-        "    if not color_name:\n",
-        "      break\n",
-        "    with tb.output_to(0):\n",
-        "      tb.clear_tab()\n",
-        "      sess.run(inference_ops, {color_name_placeholder: color_name})\n",
-        "      plt.show()\n",
-        "\n",
-        "with tf.Graph().as_default():\n",
-        "  # Read the data.\n",
-        "  batch_size = 64\n",
-        "  train_data = load_dataset(data_dir, train_url, batch_size)\n",
-        "  eval_data = load_dataset(data_dir, test_url, 50, training=False)\n",
-        "  \n",
-        "  # Create the model components.\n",
-        "  lower_cell, upper_cell, relu_layer = model_components()\n",
-        "  # Create the helper placeholder for inference.\n",
-        "  color_name_placeholder = tf.placeholder(tf.string, shape=())\n",
-        "  \n",
-        "  # Compile the train / test code.\n",
-        "  tf_train_model = autograph.to_graph(train_model)\n",
-        "  train_model_ops = tf_train_model(\n",
-        "      train_data, eval_data, batch_size, lower_cell, upper_cell, relu_layer, train_steps=100)\n",
-        "  \n",
-        "  # Compile the inference code.\n",
-        "  tf_inference = autograph.to_graph(inference)\n",
-        "  inference_ops = tf_inference(color_name_placeholder, lower_cell, upper_cell, relu_layer)\n",
-        "  \n",
-        "  with tf.Session() as sess:\n",
-        "    sess.run(tf.global_variables_initializer())\n",
-        "    \n",
-        "    # Run training and testing.\n",
-        "    sess.run(train_model_ops)\n",
-        "     \n",
-        "    # Run the inference loop.\n",
-        "    run_input_loop(sess, inference_ops, color_name_placeholder)"
-      ],
-      "execution_count": 0,
       "outputs": [
         {
+          "name": "stdout",
           "output_type": "stream",
           "text": [
-            "('Successfully downloaded', 'train.csv', 28010L, 'bytes.')\n",
-            "('Successfully downloaded', 'test.csv', 2414L, 'bytes.')\n",
-            "Step 0 train loss 0.37890616\n",
-            "Step 10 train loss 0.18515904\n",
-            "Step 20 train loss 0.0892782\n",
-            "Step 30 train loss 0.07883155\n",
-            "Step 40 train loss 0.08585831\n",
-            "Step 50 train loss 0.09302989\n",
-            "Step 60 train loss 0.089012615\n",
-            "Step 70 train loss 0.07275697\n",
-            "Step 80 train loss 0.06644974\n",
-            "Step 90 train loss 0.0854013\n",
-            "Test loss 0.13216865Colorbot is ready to generate colors!\n",
-            "\n",
+            "Test loss 0.138294\n",
+            "Colorbot is ready to generate colors!\n",
             "\n",
             "\n"
-          ],
-          "name": "stdout"
+          ]
         },
         {
-          "output_type": "display_data",
           "data": {
-            "text/plain": [
-              "<IPython.core.display.HTML object>"
-            ],
             "text/html": [
-              "<link rel=stylesheet type=text/css href='/nbextensions/google.colab/tabbar.css'></link>"
+              "\u003clink rel=stylesheet type=text/css href='/nbextensions/google.colab/tabbar.css'\u003e\u003c/link\u003e"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.HTML at 0x7f97ee42bb90\u003e"
             ]
           },
           "metadata": {
             "tags": [
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
-            "text/plain": [
-              "<IPython.core.display.HTML object>"
-            ],
             "text/html": [
-              "<script src='/nbextensions/google.colab/tabbar_main.min.js'></script>"
+              "\u003cscript src='/nbextensions/google.colab/tabbar_main.min.js'\u003e\u003c/script\u003e"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.HTML at 0x7f97ee42be10\u003e"
             ]
           },
           "metadata": {
             "tags": [
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
-            "text/plain": [
-              "<IPython.core.display.HTML object>"
-            ],
             "text/html": [
-              "<div id=\"id1\"></div>"
+              "\u003cdiv id=\"id1\"\u003e\u003c/div\u003e"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.HTML at 0x7f97ee42bd90\u003e"
             ]
           },
           "metadata": {
             "tags": [
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"b102d936-3379-11e8-ac70-0242ac110002\"] = colab_lib.createTabBar({\"contentBorder\": [\"0px\"], \"borderColor\": [\"#a7a7a7\"], \"tabNames\": [\"RNN Colorbot\"], \"initialSelection\": 0, \"location\": \"top\", \"contentHeight\": [\"initial\"], \"elementId\": \"id1\"});\n",
-              "//# sourceURL=js_e223a56194"
+              "window[\"a6045494-8903-11e8-99f9-c8d3ffb5fbe0\"] = colab_lib.createTabBar({\"location\": \"top\", \"borderColor\": [\"#a7a7a7\"], \"initialSelection\": 0, \"elementId\": \"id1\", \"contentHeight\": [\"initial\"], \"contentBorder\": [\"0px\"], \"tabNames\": [\"RNN Colorbot\"]});\n",
+              "//# sourceURL=js_02f896cbda"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2ab810\u003e"
             ]
           },
           "metadata": {
             "tags": [
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"b103532a-3379-11e8-ac70-0242ac110002\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
-              "//# sourceURL=js_b8c6a821fb"
+              "window[\"a6045495-8903-11e8-99f9-c8d3ffb5fbe0\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
+              "//# sourceURL=js_7e8f9f77a0"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2ab710\u003e"
             ]
           },
           "metadata": {
             "tags": [
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"b105b28c-3379-11e8-ac70-0242ac110002\"] = google.colab.output.getActiveOutputArea();\n",
-              "//# sourceURL=js_44805e254b"
+              "window[\"a6045496-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.getActiveOutputArea();\n",
+              "//# sourceURL=js_5531553c2f"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2ab6d0\u003e"
             ]
           },
           "metadata": {
@@ -1625,17 +1436,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"b106197a-3379-11e8-ac70-0242ac110002\"] = document.querySelector(\"#id1_content_0\");\n",
-              "//# sourceURL=js_a63d3c6c47"
+              "window[\"a6045497-8903-11e8-99f9-c8d3ffb5fbe0\"] = document.querySelector(\"#id1_content_0\");\n",
+              "//# sourceURL=js_d1f809ec17"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2ab990\u003e"
             ]
           },
           "metadata": {
@@ -1643,17 +1454,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"b1069f44-3379-11e8-ac70-0242ac110002\"] = google.colab.output.setActiveOutputArea(window[\"b106197a-3379-11e8-ac70-0242ac110002\"]);\n",
-              "//# sourceURL=js_7e203b8bce"
+              "window[\"a6045498-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.setActiveOutputArea(window[\"a6045497-8903-11e8-99f9-c8d3ffb5fbe0\"]);\n",
+              "//# sourceURL=js_3a3123cadb"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2aba50\u003e"
             ]
           },
           "metadata": {
@@ -1661,17 +1472,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"b1070f38-3379-11e8-ac70-0242ac110002\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
-              "//# sourceURL=js_d53293d4a7"
+              "window[\"a6045499-8903-11e8-99f9-c8d3ffb5fbe0\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
+              "//# sourceURL=js_1a0e1f7d6f"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2ab890\u003e"
             ]
           },
           "metadata": {
@@ -1679,17 +1490,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c6d90d5c-3379-11e8-ac70-0242ac110002\"] = google.colab.output.setActiveOutputArea(window[\"b105b28c-3379-11e8-ac70-0242ac110002\"]);\n",
-              "//# sourceURL=js_3000dc2c05"
+              "window[\"a8e54762-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.setActiveOutputArea(window[\"a6045496-8903-11e8-99f9-c8d3ffb5fbe0\"]);\n",
+              "//# sourceURL=js_6213539615"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2abad0\u003e"
             ]
           },
           "metadata": {
@@ -1697,17 +1508,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c6da872c-3379-11e8-ac70-0242ac110002\"] = google.colab.output.getActiveOutputArea();\n",
-              "//# sourceURL=js_4136f669a3"
+              "window[\"a8e54763-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.getActiveOutputArea();\n",
+              "//# sourceURL=js_0bd7f95c6e"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2ab950\u003e"
             ]
           },
           "metadata": {
@@ -1715,17 +1526,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c6dac868-3379-11e8-ac70-0242ac110002\"] = document.querySelector(\"#id1_content_0\");\n",
-              "//# sourceURL=js_2f70dd9aee"
+              "window[\"a8e54764-8903-11e8-99f9-c8d3ffb5fbe0\"] = document.querySelector(\"#id1_content_0\");\n",
+              "//# sourceURL=js_215f004f6b"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2abb10\u003e"
             ]
           },
           "metadata": {
@@ -1733,17 +1544,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c6db07d8-3379-11e8-ac70-0242ac110002\"] = google.colab.output.setActiveOutputArea(window[\"c6dac868-3379-11e8-ac70-0242ac110002\"]);\n",
-              "//# sourceURL=js_7226726048"
+              "window[\"a8e54765-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.setActiveOutputArea(window[\"a8e54764-8903-11e8-99f9-c8d3ffb5fbe0\"]);\n",
+              "//# sourceURL=js_a06186c8ad"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2aba90\u003e"
             ]
           },
           "metadata": {
@@ -1751,17 +1562,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c6dcc6fe-3379-11e8-ac70-0242ac110002\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
-              "//# sourceURL=js_72e7709865"
+              "window[\"a8e54766-8903-11e8-99f9-c8d3ffb5fbe0\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
+              "//# sourceURL=js_383fbaae67"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ee2abc50\u003e"
             ]
           },
           "metadata": {
@@ -1769,14 +1580,14 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVQAAAFZCAYAAADHDNdrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAB9JJREFUeJzt3E1Lle0ax+HTF4jeEAyMBhE0DawI\nwsCH0AIlaGBWNJBo0CDoA0TQhmDXuKAGDioiCA2KlEAlnl05FD9Co8BeaGCQoBDa2jPZsXt4Bvu/\n0+o4Rmvd1zW4rsmP84bFamo0Go0C4H/WvNYHAPhVCCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKDy\nUxgeHq5Dhw7V4OBgPXz4sHp7e+vWrVt15cqVOnnyZN2/f78ajUbdvn27+vr6qqenp65du1YrKytV\nVfXhw4e6cOFC9fX1VV9fX01PT1dV1dzcXHV3d9eDBw/q+PHj9ccff9TExMRaXpWfWOtaHwD+zuvX\nr+vOnTs1MTFRbW1tdf78+dW16enpGh8fr/b29hobG6upqal6/Phxbdy4sS5evFgjIyM1NDRUly5d\nqv3799fw8HC9efOmTp8+XVNTU1VV9enTp2pubq5nz57V5ORk3bhxo44dO7ZW1+UnZkJl3Zudna2D\nBw9WR0dHbdiwoQYHB1fX9u7dW+3t7VVV9fLlyxocHKytW7dWa2trnTp1qp4/f16Li4s1MzNT586d\nq6qqXbt21YEDB1an1OXl5Tpx4kRVVe3Zs6fevXv3Yy/IL8OEyrr3+fPnamtrW/2+ffv21c//+Xxh\nYaHu3r1bjx49qqqqlZWVam9vr4WFhWo0GnXmzJnVvYuLi9XV1VVVVS0tLbVp06aqqmpubq6vX7/+\nX+/Dr0tQWfe2bNlSi4uLq98/fvz43X0dHR3V29tbQ0ND3zxfXl6ulpaWevLkSW3evPmbtbm5ufyB\n+W155Wfd6+zsrJmZmZqfn68vX77U2NjYd/cdOXKkxsfHa2lpqaqqRkdH6+nTp9Xa2lqHDx+u0dHR\nqqpaWlqqy5cv1/v373/YHfg9CCrrXmdnZw0MDNTAwECdPXu2enp6vrvv6NGj1dPTUwMDA9Xf318v\nXryo7u7uqqq6evVqzc7OVn9/fw0MDNTOnTtrx44dP/Ia/Aaa/B8qP4NGo1FNTU1VVfXq1au6efPm\nX06qsFZMqKx78/Pz1dXVVW/fvq1Go1GTk5O1b9++tT4W/BcTKj+FkZGRunfvXjU1NdXu3bvr+vXr\ntW3btrU+FnxDUAFCvPIDhAgqQMi6+WH/kX8eXesjAPytf/3jz79cM6EChAgqQIigAoQIKkCIoAKE\nCCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQI\nKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgq\nQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpA\niKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCI\noAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIig\nAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAC\nhAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKE\nCCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQI\nKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgq\nQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpA\niKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCI\noAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIig\nAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAC\nhAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKE\nCCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQI\nKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgq\nQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpA\niKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkCIoAKECCpAiKAChAgqQIigAoQIKkBI\nU6PRaKz1IQB+BSZUgBBBBQgRVIAQQQUIEVSAEEEFCBFUgBBBBQgRVIAQQQUIEVSAEEEFCBFUgBBB\nBQgRVIAQQQUIEVSAEEEFCBFUgBBBBQgRVIAQQQUIEVSAkH8D1Aj8lNhhe7QAAAAASUVORK5CYII=\n",
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQwAAAENCAYAAAD60Fs2AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAACL9JREFUeJzt3F+IlXUex/Gv2ziiBRGVOQaFd2JBzOg5aiH+IZGoJgmM\n/uhVGIlgFE0QEYHdFQaRGBJ10VX0D5TAi8jKomGmILsYjEAkmBwbRIxKGDV/e7G7w8ouux9jd911\nX6+rcx6e85zveS7e/J7zb0ZrrRVA4A8XewDgf4dgADHBAGKCAcQEA4gJBhATDC6Kp59+urrdbt13\n3301OjpaK1euvNgjERCMS9yaNWtqeHj4Yo9xnq+++qqGh4frs88+q7fffruqqmbMmHGRpyIhGPxH\n/fbbb/XDDz/U9ddfX7NmzbrY43CBBOMS9tRTT9XExERt2bKlBgYG6vXXX69vvvmm7r///up0OrV+\n/foaHR2d3n/Tpk318ssv1wMPPFADAwP18MMP18mTJ6uq6vTp0zU0NFRLly6tTqdTGzZsqBMnTlRV\n1eTkZG3ZsqWWLl1a69atq3feeWf6mDt37qxt27bV0NBQLVmypN5777169tln6+DBgzUwMFA7d+78\nm7kPHz5cmzZtqk6nU3fffXft37+/qqrGx8er0+lM7/fMM8/UrbfeOn1/aGio3nzzzX/tSeR8jUva\n6tWr2/DwcGuttWPHjrVut9sOHDjQWmvtiy++aN1ut504caK11trGjRvb2rVr2/fff9+mpqbaxo0b\n244dO1prrb311lvt0UcfbVNTU+3cuXNtbGys/fLLL6211h566KG2ffv2dvr06Xbo0KG2bNmy6ed8\n5ZVX2k033dQ++uij1lprU1NT7f33328PPvjg9IwjIyNt5cqVrbXWzpw509auXdt2797dzpw504aH\nh1t/f387cuTI9OsZGxtrrbW2bt26dvvtt7fDhw+31lpbtWpVO3To0L/rVNJas8L4P9D+/HOhvXv3\n1qpVq2rFihVVVbV8+fK6+eab69NPP53e9957760bbrihent764477qhDhw5VVVVPT0+dPHmyjhw5\nUjNmzKhFixbV5ZdfXseOHauvv/66nnzyyZo5c2YtXLiwNmzYUHv27Jk+Zn9/f61Zs6aqqnp7e//h\nrAcPHqxTp07VI488Uj09PbVs2bJavXp1ffDBB1VVtWTJkhodHa3jx49XVdW6devqyy+/rPHx8fr1\n119r4cKF/6Kzxt/Tc7EH4D/n6NGjtW/fvvr444+r6k8hOXv2bC1fvnx6n2uuuWb69uzZs+vUqVNV\nVXXPPffUsWPH6oknnqiff/65BgcH6/HHH6/Jycm68sora/bs2dOPmz9/fo2NjU3fnzdvXjzj5ORk\n9fX1nbdt/vz5NTk5WVVVnU6n9u/fX9ddd111u93qdru1Z8+e6u3trcWLF1/A2eD3EIxL3F9/+tDX\n11fr16+v7du3X/Bxenp6auvWrbV169Y6evRobd68uRYsWFC33XZb/fTTT3Xq1KmaM2dOVVVNTEzU\n3Llz/+4M/8zcuXNrYmLivG1Hjx6tBQsWVFVVt9utF198sfr6+qrT6dTAwEA999xz1dvbW91u94Jf\nFxfGJckl7tprr63x8fGqqhocHKz9+/fX559/XufOnaupqakaHR2tH3/88Z8eZ2RkpL777rs6d+5c\nzZkzp3p6euqyyy6refPmVX9/f7300kt1+vTp+vbbb+vdd9+twcHB3zXvLbfcUnPmzKnXXnutzp49\nWyMjI/XJJ5/UnXfeWVVVN954Y82aNav27t1bnU6nrrjiirr66qvrww8/PO8NUf49BOMSt3nz5tq1\na1d1u93at29f7dq1q3bv3l3Lly+v1atX1xtvvDH9Hsc/WgkcP368tm3bVosXL6677rqrli5dOh2F\nHTt21Pj4eK1YsaK2bdtWjz322HmXORdi5syZ9eqrr9aBAwdq2bJl9fzzz9cLL7wwvcKo+tMq46qr\nrpq+1PlLKBYtWvS7npPcjNb8gQ6QscIAYoIBxAQDiAkGEPuv/R7GxN7+iz0C/F/rG/z6b7ZZYQAx\nwQBiggHEBAOICQYQEwwgJhhATDCAmGAAMcEAYoIBxAQDiAkGEBMMICYYQEwwgJhgADHBAGKCAcQE\nA4gJBhATDCAmGEBMMICYYAAxwQBiggHEBAOICQYQEwwgJhhATDCAmGAAMcEAYoIBxAQDiAkGEBMM\nICYYQEwwgJhgADHBAGKCAcQEA4gJBhATDCAmGEBMMICYYAAxwQBiggHEBAOICQYQEwwgJhhATDCA\nmGAAMcEAYoIBxAQDiAkGEBMMICYYQEwwgJhgADHBAGKCAcQEA4gJBhATDCAmGEBMMICYYAAxwQBi\nggHEBAOICQYQEwwgJhhATDCAmGAAMcEAYoIBxAQDiAkGEBMMICYYQEwwgJhgADHBAGKCAcQEA4gJ\nBhATDCAmGEBMMICYYAAxwQBiggHEBAOICQYQEwwgJhhATDCAmGAAMcEAYoIBxAQDiAkGEBMMICYY\nQEwwgJhgADHBAGKCAcQEA4gJBhATDCAmGEBMMICYYAAxwQBiggHEBAOICQYQEwwgJhhATDCAmGAA\nMcEAYoIBxAQDiAkGEBMMICYYQEwwgJhgADHBAGKCAcQEA4gJBhATDCAmGEBMMICYYAAxwQBiggHE\nBAOICQYQEwwgJhhATDCAmGAAMcEAYoIBxAQDiAkGEBMMICYYQEwwgJhgADHBAGKCAcQEA4gJBhAT\nDCAmGEBMMICYYAAxwQBiggHEBAOICQYQEwwgJhhATDCAmGAAMcEAYoIBxAQDiAkGEBMMICYYQEww\ngJhgADHBAGKCAcQEA4gJBhATDCAmGEBMMICYYAAxwQBiggHEBAOICQYQEwwgJhhATDCAmGAAMcEA\nYoIBxAQDiAkGEBMMICYYQEwwgJhgADHBAGKCAcQEA4gJBhATDCAmGEBMMICYYAAxwQBiggHEBAOI\nCQYQEwwgNqO11i72EMD/BisMICYYQEwwgJhgADHBAGKCAcQEA4gJBhATDCAmGEBMMICYYAAxwQBi\nggHEBAOICQYQEwwgJhhATDCAmGAAMcEAYoIBxAQDiP0RoqNMBlokHDIAAAAASUVORK5CYII=\n",
             "text/plain": [
-              "<matplotlib.figure.Figure at 0x7f72f402e850>"
+              "\u003cmatplotlib.figure.Figure at 0x7f97ee42bb90\u003e"
             ]
           },
           "metadata": {
@@ -1785,17 +1596,17 @@
               "outputarea_id1",
               "user_output"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c70592aa-3379-11e8-ac70-0242ac110002\"] = google.colab.output.setActiveOutputArea(window[\"c6da872c-3379-11e8-ac70-0242ac110002\"]);\n",
-              "//# sourceURL=js_25c3aaf79a"
+              "window[\"a8e54767-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.setActiveOutputArea(window[\"a8e54763-8903-11e8-99f9-c8d3ffb5fbe0\"]);\n",
+              "//# sourceURL=js_28bd08ac10"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9efc10\u003e"
             ]
           },
           "metadata": {
@@ -1803,17 +1614,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c70842c0-3379-11e8-ac70-0242ac110002\"] = google.colab.output.getActiveOutputArea();\n",
-              "//# sourceURL=js_984c56b816"
+              "window[\"a8e54768-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.getActiveOutputArea();\n",
+              "//# sourceURL=js_ae2887f57d"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9efb50\u003e"
             ]
           },
           "metadata": {
@@ -1821,17 +1632,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c708dec4-3379-11e8-ac70-0242ac110002\"] = document.querySelector(\"#id1_content_0\");\n",
-              "//# sourceURL=js_e0451a1217"
+              "window[\"a8e54769-8903-11e8-99f9-c8d3ffb5fbe0\"] = document.querySelector(\"#id1_content_0\");\n",
+              "//# sourceURL=js_608805a786"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9ef710\u003e"
             ]
           },
           "metadata": {
@@ -1839,17 +1650,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c7092726-3379-11e8-ac70-0242ac110002\"] = google.colab.output.setActiveOutputArea(window[\"c708dec4-3379-11e8-ac70-0242ac110002\"]);\n",
-              "//# sourceURL=js_7aa23d7385"
+              "window[\"a8e5476a-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.setActiveOutputArea(window[\"a8e54769-8903-11e8-99f9-c8d3ffb5fbe0\"]);\n",
+              "//# sourceURL=js_3d87cf7d0f"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9efa90\u003e"
             ]
           },
           "metadata": {
@@ -1857,17 +1668,17 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c7099044-3379-11e8-ac70-0242ac110002\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
-              "//# sourceURL=js_5722756ddb"
+              "window[\"a8e5476b-8903-11e8-99f9-c8d3ffb5fbe0\"] = window[\"id1\"].setSelectedTabIndex(0);\n",
+              "//# sourceURL=js_5e91101199"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9efa50\u003e"
             ]
           },
           "metadata": {
@@ -1875,24 +1686,149 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "stream",
-          "text": [
-            "Give me a color name (or press 'enter' to exit): \n"
-          ],
-          "name": "stdout"
+          "data": {
+            "text/html": [
+              "\u003cdiv class=id_45185901 style=\"margin-right:10px; display:flex;align-items:center;\"\u003e\u003cspan style=\"margin-right: 3px;\"\u003e\u003c/span\u003e\u003c/div\u003e"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.HTML at 0x7f97ee42bd90\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": [
+              "id1_content_0",
+              "outputarea_id1",
+              "user_output"
+            ]
+          },
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/javascript": [
+              "window[\"a8e5476c-8903-11e8-99f9-c8d3ffb5fbe0\"] = jQuery(\".id_45185901 span\");\n",
+              "//# sourceURL=js_f43052a94e"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9ef750\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": [
+              "id1_content_0",
+              "outputarea_id1",
+              "user_output"
+            ]
+          },
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/javascript": [
+              "window[\"a8e5476d-8903-11e8-99f9-c8d3ffb5fbe0\"] = window[\"a8e5476c-8903-11e8-99f9-c8d3ffb5fbe0\"].text(\"Give me a color name (or press 'enter' to exit): \");\n",
+              "//# sourceURL=js_bfc0fb76ce"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9efb10\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": [
+              "id1_content_0",
+              "outputarea_id1",
+              "user_output"
+            ]
+          },
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/javascript": [
+              "window[\"a9e9b8b0-8903-11e8-99f9-c8d3ffb5fbe0\"] = jQuery(\".id_45185901 input\");\n",
+              "//# sourceURL=js_7f167283fa"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9ef610\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": [
+              "id1_content_0",
+              "outputarea_id1",
+              "user_output"
+            ]
+          },
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/javascript": [
+              "window[\"a9e9b8b1-8903-11e8-99f9-c8d3ffb5fbe0\"] = window[\"a9e9b8b0-8903-11e8-99f9-c8d3ffb5fbe0\"].remove();\n",
+              "//# sourceURL=js_016ae4bf21"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9ef250\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": [
+              "id1_content_0",
+              "outputarea_id1",
+              "user_output"
+            ]
+          },
+          "output_type": "display_data"
         },
         {
-          "output_type": "display_data",
           "data": {
             "application/javascript": [
-              "window[\"c7baac12-3379-11e8-ac70-0242ac110002\"] = google.colab.output.setActiveOutputArea(window[\"c70842c0-3379-11e8-ac70-0242ac110002\"]);\n",
-              "//# sourceURL=js_cdd622e58f"
+              "window[\"a9e9b8b2-8903-11e8-99f9-c8d3ffb5fbe0\"] = jQuery(\".id_45185901 span\");\n",
+              "//# sourceURL=js_e666f179bc"
             ],
             "text/plain": [
-              "<IPython.core.display.Javascript object>"
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9ef550\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": [
+              "id1_content_0",
+              "outputarea_id1",
+              "user_output"
+            ]
+          },
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/javascript": [
+              "window[\"a9e9b8b3-8903-11e8-99f9-c8d3ffb5fbe0\"] = window[\"a9e9b8b2-8903-11e8-99f9-c8d3ffb5fbe0\"].text(\"Give me a color name (or press 'enter' to exit): \");\n",
+              "//# sourceURL=js_cbb9d14aec"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9ef1d0\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": [
+              "id1_content_0",
+              "outputarea_id1",
+              "user_output"
+            ]
+          },
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/javascript": [
+              "window[\"a9e9b8b4-8903-11e8-99f9-c8d3ffb5fbe0\"] = google.colab.output.setActiveOutputArea(window[\"a8e54768-8903-11e8-99f9-c8d3ffb5fbe0\"]);\n",
+              "//# sourceURL=js_2967a79665"
+            ],
+            "text/plain": [
+              "\u003cIPython.core.display.Javascript at 0x7f97ea9ef1d0\u003e"
             ]
           },
           "metadata": {
@@ -1900,21 +1836,98 @@
               "id1_content_0",
               "outputarea_id1"
             ]
-          }
+          },
+          "output_type": "display_data"
         }
+      ],
+      "source": [
+        "def run_input_loop(sess, inference_ops, color_name_placeholder):\n",
+        "  \"\"\"Helper function that reads from input and calls the inference ops in a loop.\"\"\"\n",
+        "\n",
+        "  tb = widgets.TabBar([\"RNN Colorbot\"])\n",
+        "  while True:\n",
+        "    with tb.output_to(0):\n",
+        "      try:\n",
+        "        color_name = six.moves.input(\"Give me a color name (or press 'enter' to exit): \")\n",
+        "      except (EOFError, KeyboardInterrupt):\n",
+        "        break\n",
+        "    if not color_name:\n",
+        "      break\n",
+        "    with tb.output_to(0):\n",
+        "      tb.clear_tab()\n",
+        "      sess.run(inference_ops, {color_name_placeholder: color_name})\n",
+        "      plt.show()\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  # Read the data.\n",
+        "  batch_size = 64\n",
+        "  train_data = load_dataset(data_dir, train_url, batch_size)\n",
+        "  eval_data = load_dataset(data_dir, test_url, 50, training=False)\n",
+        "  \n",
+        "  # Create the model components.\n",
+        "  lower_cell, upper_cell, relu_layer = model_components()\n",
+        "  # Create the helper placeholder for inference.\n",
+        "  color_name_placeholder = tf.placeholder(tf.string, shape=())\n",
+        "  \n",
+        "  # Compile the train / test code.\n",
+        "  tf_train_model = autograph.to_graph(train_model)\n",
+        "  train_model_ops = tf_train_model(\n",
+        "      train_data, eval_data, batch_size, lower_cell, upper_cell, relu_layer, train_steps=100)\n",
+        "  \n",
+        "  # Compile the inference code.\n",
+        "  tf_inference = autograph.to_graph(inference)\n",
+        "  inference_ops = tf_inference(color_name_placeholder, lower_cell, upper_cell, relu_layer)\n",
+        "  \n",
+        "  with tf.Session() as sess:\n",
+        "    sess.run(tf.global_variables_initializer())\n",
+        "    \n",
+        "    # Run training and testing.\n",
+        "    sess.run(train_model_ops)\n",
+        "     \n",
+        "    # Run the inference loop.\n",
+        "    run_input_loop(sess, inference_ops, color_name_placeholder)"
       ]
     },
     {
+      "cell_type": "markdown",
       "metadata": {
-        "id": "AHJ2c47U-A5W",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "AHJ2c47U-A5W"
       },
-      "cell_type": "markdown",
       "source": [
         "# Where do we go next?\n",
         "\n",
-        "Autograph is available in tensorflow.contrib, but it's still in its early stages. We're excited about the possibilities it brings — write your machine learning code in the flexible Eager style, but still enjoy all the benefits that come with running in graph mode. A beta version will be available soon -- stay tuned!"
+        "AutoGraph is still in its early stages, but is available in [tensorflow.contrib](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/autograph). We're excited about the possibilities it brings. New versions will be available soon — stay tuned!"
       ]
     }
-  ]
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "Dev Summit 2018 - Autograph",
+      "provenance": [
+        {
+          "file_id": "1wCZUh73zTNs1jzzYjqoxMIdaBWCdKJ2K",
+          "timestamp": 1522238054357
+        },
+        {
+          "file_id": "1_HpC-RrmIv4lNaqeoslUeWaX8zH5IXaJ",
+          "timestamp": 1521743157199
+        },
+        {
+          "file_id": "1mjO2fQ2F9hxpAzw2mnrrUkcgfb7xSGW-",
+          "timestamp": 1520522344607
+        }
+      ],
+      "version": "0.3.2",
+      "views": {}
+    },
+    "kernelspec": {
+      "display_name": "Python 2",
+      "name": "python2"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
 }
diff --git a/tensorflow/contrib/autograph/examples/notebooks/graph_vs_ag_vs_eager_sum_speed_test.ipynb b/tensorflow/contrib/autograph/examples/notebooks/graph_vs_ag_vs_eager_sum_speed_test.ipynb
new file mode 100644
index 0000000000..32742bec7e
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/notebooks/graph_vs_ag_vs_eager_sum_speed_test.ipynb
@@ -0,0 +1,519 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "moMkWaT_TTHi"
+      },
+      "source": [
+        "This Colab illustrates the differing overhead* between a custom, vectorized graph operation and a loop over a tensor\n",
+        "that computes the same function. The loop is implemented in TensorFlow Eager mode using Python syntax and control-flow, and using AutoGraph which takes a python function and converts it into graph mode. In AutoGraph the Python loop is converted into a tf.while_loop.\n",
+        "\n",
+        "The actual computation, summing a small number of scalar values, takes very little time to compute, so the graphs below are showing the overhead of the differing approaches. As such, this is more of a \"micro-benchmark\" than a representation of real-world performance of the three approaches.\n",
+        "\n",
+        "*Note the differing scales of the included plots"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "a0X_rfvuav98"
+      },
+      "source": [
+        "### Imports"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "EdxWv4Vn0ync"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U -q tf-nightly"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "erq3_S7QsjkU"
+      },
+      "outputs": [],
+      "source": [
+        "from __future__ import absolute_import\n",
+        "from __future__ import division\n",
+        "from __future__ import print_function\n",
+        "\n",
+        "import numpy as np\n",
+        "import tensorflow as tf\n",
+        "import matplotlib.pyplot as plt\n",
+        "import math\n",
+        "import time\n",
+        "import random\n",
+        "from colabtools import adhoc_import\n",
+        "from tensorflow.contrib import autograph as ag\n",
+        "from tensorflow.python.framework import function"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "1JgnsXooa2RP"
+      },
+      "source": [
+        "### Testing boilerplate"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "UyD5LLjVZzny"
+      },
+      "outputs": [],
+      "source": [
+        "# Test-only parameters. Test checks successful completion not correctness. \n",
+        "burn_ins = 1\n",
+        "trials = 1\n",
+        "batches = 2\n",
+        "max_elements = 2"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "4_NBL0RQa8gY"
+      },
+      "source": [
+        "### Speed comparison parameters"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "Yq6daecyiJV5"
+      },
+      "outputs": [],
+      "source": [
+        "#@test {\"skip\": true} \n",
+        "burn_ins = 3 # Batches not counted in the average\n",
+        "trials = 10 # Batches run per vector-size (and averaged)\n",
+        "batches = 1000 # Number of random vectors summed over per trial\n",
+        "max_elements = 100 # Vectors of size 0 to this-1 will be executed and plotted"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "fiR8m13CbKH2"
+      },
+      "source": [
+        "### Random input"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "d8vrTlyNXuxc"
+      },
+      "outputs": [],
+      "source": [
+        "# Construct a random num x 1 tensor\n",
+        "def get_elements(num):\n",
+        "  return tf.random_uniform(shape=(num, 1), maxval=1)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "ILJ6SbF3bXFQ"
+      },
+      "source": [
+        "## Graph mode"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "vovRf597X55n"
+      },
+      "outputs": [],
+      "source": [
+        "def tf_sum(elements):\n",
+        "  # Using custom vectorized op\n",
+        "  return tf.reduce_sum(elements)\n",
+        "\n",
+        "def run_trial(num):\n",
+        "  elements = get_elements(num)\n",
+        "  return tf_sum(elements)\n",
+        "\n",
+        "\n",
+        "\n",
+        "graph_means = []\n",
+        "for num in range(max_elements):\n",
+        "  with tf.Graph().as_default():\n",
+        "    durations = []\n",
+        "    foo = run_trial(num)\n",
+        "  \n",
+        "    with tf.Session() as sess:\n",
+        "      \n",
+        "      for _ in range(burn_ins):\n",
+        "        for _ in range(batches):\n",
+        "          sess.run(foo)\n",
+        "      \n",
+        "      for _ in range(trials):\n",
+        "      \n",
+        "        start = time.time()\n",
+        "        for _ in range(batches):\n",
+        "          sess.run(foo)\n",
+        "      \n",
+        "        duration = time.time() - start\n",
+        "        durations.append(duration)    \n",
+        "      \n",
+        "    graph_means.append(np.mean(durations))  "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 301
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 278,
+          "status": "ok",
+          "timestamp": 1532447361278,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "Jm9Blkyx90Eq",
+        "outputId": "d83cd51f-7e56-4d73-f7df-bb157dee46df"
+      },
+      "outputs": [
+        {
+          "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAa8AAAEcCAYAAABwNTvaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3WdgVFXegPFnZtI7kJ5QQwlCKIGERGroEoSAYZFVEFAR\ngV1XXHvbFWEtK6xlUVgRXMuLiqBSZQUh9E5ChxRIn/ReJjNz3g8hNxkzCUMJkHh+X2DmtnPPnXv+\n95R7ohJCCCRJkiSpGVHf6QRIkiRJ0vWSwUuSJElqdmTwkiRJkpodGbwkSZKkZkcGL0mSJKnZkcFL\nkiRJanaaJHitWLGCV199tSl23WINHz6cAwcONPlxXnzxRd5///0mP87dIikpiUmTJtGvXz++/PLL\nO52cFi0+Pp4HHnjgTifjmsaPH8+RI0du6T5/b/eVObcqX9966y3Wrl17zfWsbmTnffv2RaVSAVBe\nXo6NjQ1qtRqVSsUbb7zBE088cSO7vW5paWmMGDGCs2fPolY3n0rkiy++iLe3N0899dSdTspdoSmv\n46effsqAAQPYsGGD2eVbt27l888/5/z58/Tq1Yv//ve/JsvPnTvHyy+/TGJiIgEBASxevJjAwEBl\n+bvvvsu6detQqVQ88MADPPvssxZv29SmT5/OxIkTiY6Ovi3H++CDD3jsscea9BjDhw9n8eLFhIeH\n3/A+Nm3adAtT1LgVK1bwySefoFKp0Ov16PV67OzsEELg7+/Pxo0bCQwMxN7eHpVKhRACa2trDh8+\nfNvSeCPMlWG3Kl8fffRRpkyZQnR0NFZWDYeoGyopTpw4wfHjxzl+/Di+vr6sWLFC+W78+PE3nOjr\nJYRQLrjUfDXldUxPT6dz584NLndzc2PmzJnMmTOn3rKqqirmz59PVFQUR44cISoqinnz5qHX6wFY\nu3YtO3fuZOPGjfz000/s2rWLb775xqJtm4PruR7Z2dkcOnSIESNGNGGKbo7BYLjtx3ziiSeUsvHv\nf/87ffv25fjx45w4cYKNGzcCoFKp+Omnn5Tv73TguhP5VJeHhwcBAQHs3Lmz0fVu+jFXCFHvR/7R\nRx8pT6BpaWkEBgayfv16hg0bxoABA1i7di2nTp1iwoQJhIaGsmjRIpPt161bx7hx4xgwYACPPfYY\n6enpZo89ffp0APr3709wcDCxsbEIIVi+fDnDhw9n4MCBvPDCC5SUlJjdPj8/n7lz5xISEsKAAQN4\n+OGHlWWBgYGkpKQon+s2Cxw+fJihQ4fy6aefcu+99zJ48GB++eUXdu/ezZgxYxgwYAArVqwwe8xv\nv/2WjRs38umnnxIcHMyTTz6pLDt37hwTJkwgJCSEhQsXotPplGW//vorUVFRhISEMG3aNC5cuGB2\n/wAJCQnMnj2bAQMGcN9997F169YG121sv8OHD2fVqlVMmDCBvn378sorr5Cbm8vjjz9OcHAws2fP\npri4WFn/5MmTPPjgg4SEhBAVFWVyE06fPp3333+fadOmERwczKOPPkpBQYGyDEyvY3JyMtOnT6d/\n//6Eh4ezcOHCBs9hx44djB8/ntDQUGbMmEFiYiIAjzzyCIcOHeKNN94gODiYK1eu1Ns2PDycsWPH\n4uHhUW/Z4cOHMRgMzJgxA2tra6ZPn44QgoMHDwLwww8/MHv2bDw9PfH09GTWrFlKDe/QoUONblvX\nli1b6jW3rVmzhnnz5gGg0+l4++23iYiIYNCgQfztb38z+W388ssvREVF0a9fP0aPHs3evXtZtmwZ\nx44dY9GiRQQHB/Pmm28CcPz4caKjowkJCWHKlCmcOHHC5BotW7aMadOm0adPH1JTU1m/fj0jR44k\nODiYkSNHNvh0vW/fPnr06IGNjQ0AK1eu5M9//rPJOm+++SaLFy8GoKSkhJdffplBgwYxdOhQ/vWv\nf5mUI99++y3jxo0jODiY8ePHc+7cOZ577jkyMjJ48sknCQ4OZtWqVUD965+QkKDsZ/jw4fznP/9R\nfsMGg8GkiT4kJITg4GCCg4Pp27cvgYGBSnnT2L1x9uxZJk+eTL9+/Xj66aeprKw0my+WsPQhobGy\nraac/fbbbxk8eDCDBw9m9erVJtuuXLmSUaNGERYWxtNPP01RUZHJtuvWrSMiIoKZM2cC8NRTTzFo\n0CBCQkKYPn26kq8NlWF181Wn07F48WIGDx7MkCFDWLJkCVVVVUBt+bl69Wql/Fy/fr3JuYaEhLBr\n165rZshNiYiIEPv37zf57sMPPxTPPvusEEKI1NRU0a1bN/H666+LyspKsW/fPhEUFCTmz58v8vLy\nRGZmpggPDxdHjhwRQgjxv//9T4wePVokJiYKg8EgPv74YzF16lSzx05NTRWBgYHCaDQq33333Xdi\n9OjRIjU1VZSVlYkFCxYoafmt9957T7z++uvCYDAIvV4vjh49qiwLDAwUycnJyucXXnhB/Otf/xJC\nCHHo0CFxzz33iOXLlwu9Xi++/fZbERYWJp555hlRVlYmLl26JIKCgkRKSorZ49bdV918nDJlisjO\nzhaFhYXivvvuE2vXrhVCCHH69GkRHh4u4uLihNFoFBs2bBARERFCp9PV23dZWZkYOnSo2LBhgzAa\njeLs2bNiwIABIj4+vt6xr7XfiIgIMXXqVJGbmyu0Wq0IDw8XkyZNEufOnRM6nU7MmDFDfPTRR0II\nITIzM0VoaKiIiYkRQgixf/9+ERoaKvLy8oQQQjz88MNi1KhR4sqVK6KyslI8/PDD4r333mvwOi5c\nuFB88sknQgghKisrxbFjx8zmZWJioujTp4/Yv3+/0Ov14j//+Y8YNWqUqKqqUo773Xffmd22rm+/\n/VZMnz7d5LvVq1eLxx9/3OS7J554QqxevVoIIUS/fv1EbGyssuzUqVMiODjYom3rKi8vF8HBweLK\nlSvKdw888IDYsmWLEEKIN998Uzz55JOiqKhIlJaWirlz54qlS5cKIYSIjY0V/fr1U+5BrVYrEhMT\nzZ57QUGBCAkJET/99JMwGAxi06ZNIiQkRBQUFCjrR0REiPj4eGEwGERxcbEIDg4Wly9fFkIIkZ2d\nrfyOfuvtt98Wb7zxhvI5LS1N9OnTR5SUlAghhDAYDGLgwIFKfj355JPi9ddfFxUVFSI3N1dMmTJF\nfPPNN0IIIbZs2SKGDBkiTp8+LYQQIjk5WaSnpwshqn+TBw4cUI5zresfEREhoqKiRGZmpqisrFS+\n+22ZJYQQS5cuFQ8//LDQ6/WN3hs6nU5ERESIzz//XOj1erFt2zbRo0ePevf0b61fv1788Y9/rPd9\nt27dTMqahjRWttWUswsXLhQVFRXiwoULIiwsTDnP1atXi6lTpwqtVit0Op147bXXxMKFC022ff75\n50V5ebmST99//70oKysTOp1OLFmyREycOFFJS0NlWM3x/vWvf4mpU6eKvLw8kZeXJ6ZOnSref/99\nIURt+fnhhx8KvV4vdu3aJXr37i2KioqUfW3fvl1MmjSp0fy4LR1FKpWK+fPnY2Njw7333ou9vT2R\nkZG0atUKLy8v+vfvz9mzZwH45ptvmDNnDh07dkStVjNnzhzOnz9PRkZGYwFY+f+mTZuYOXMmfn5+\n2Nvbs3DhQrZs2YLRaKy3nZWVFdnZ2aSmpqLRaOjXr5/ZfZpjbW3N3Llz0Wg0jBs3jvz8fB555BHs\n7e3p3LkznTt3brR2ZM6MGTNwd3fHxcWFiIgIzp07B8B3333Hgw8+SFBQECqViqioKGxsbIiNja23\nj19//RV/f3+ioqJQqVR0796d0aNHs23btnrrWrLfhx9+mNatW+Pp6Un//v3p3bs3gYGBWFtbM2rU\nKCWNP/30E8OGDWPw4MFAdY2mZ8+e7N69W9nX5MmTadeuHTY2Ntx3333KtjXq5rmVlRVpaWlotVps\nbGwIDg42m2dbt25l2LBhhIeHo9FoePTRR6moqDCpUdyosrIynJ2dTb5zcnJSnnZ/u9zZ2ZmysjKL\ntq3Lzs6OESNGKLWay5cvk5SUpDTBrVu3jhdffBFnZ2ccHByYM2eOsu66deuIjo5W+oA8PT3p2LGj\n2fPZtWsXHTp04P7770etVhMZGUmnTp349ddflXUmTZpEQEAAarUajUaDRqPh4sWLVFZW4u7uTkBA\ngNl9FxcX4+joqHz29fXlnnvu4ZdffgHgwIEDODg40KtXL3JyctizZw8vvfQStra2tG7dmkceeYTN\nmzcr5/TYY4/Ro0cPANq2bYuPj4+y77q/E0uu/4wZM/Dy8lJqheZs2bKFTZs28eGHH6LRaBq9N2Jj\nY9Hr9cyYMQONRsOYMWPo2bNng/u2xKRJkwgJCSE0NFSpnf6WJWXbn/70J2xtbenatSuTJ09W8vTb\nb7/lL3/5C56enlhbWzN//nx+/vlnZVuVSsWf/vQn7OzslHyaPHky9vb2yvrnz59vsBXLXFrnz59P\nq1ataNWqFQsWLODHH39UlltbWzNv3jw0Gg1Dhw7FwcGBpKQkZbmjo6NJq445NzRg40a0adNG+b+d\nnR3u7u7KZ1tbW+WmT09PZ/Hixbz99ttAbX+IVqs1+QE3JCsrC19fX+Wzn58fer2enJwcPD09TdZ9\n7LHH+PDDD5k9ezYqlYopU6aY7fswx83NTRm0YmdnZ/Yca87JUnW3t7e3Jzs7G6jOkx9//FEZLSeE\nQK/Xk5WVVW8f6enpnDx5ktDQUGVdg8FAVFSU2XWvtd+6abK1ta33ue5127p1q1IQ1uyrbsd63Wtu\nb2/faP4899xz/Otf/yI6OlrplzI3ku2311ulUuHj44NWq21w35ZycHCod7OWlJTg5ORkdnlJSQkO\nDg4WbftbkZGRvPPOO8ybN49NmzYxcuRIbGxsyMvLo7y83OTcjUajUoBnZmYydOhQi87nt3kF1UGm\nbl55e3sr/7e3t2fZsmWsWrWKl156iX79+vHcc8/RqVOnevt2cXGhtLS03jlt3ryZiRMnsmnTJqU/\nPD09Hb1ez6BBg4Daroea+zszM5N27drd0DmZu/51z8mcs2fPsmjRIlavXo2bm5uSxsbuDS8vL5N9\n+Pn5WZTehmzYsIG2bds2uk5jZRtUn3vdc/X19eXSpUvK+SxYsEAZECWEwMrKStkWTPPJaDSydOlS\nfv75Z/Lz81GpVKhUKvLz8xv8DTeWVl9fX5Nyxc3NzWRwlp2dncnvp7S0tN7D32/dtuBlKW9vb558\n8kmLBn7UBI+6PD09TfrI0tLSsLKyMik4azg4OPD888/z/PPPk5CQwPTp0+nVqxdhYWHY29tTXl6u\nrJudnX3Nm6CpeHt7M3fuXItGcfr4+DBgwAClP+BW7deS40ZFRfHGG29c97bmrmObNm2UvtBjx44x\na9YsQkND693gnp6eyg1aIyMj45Zcqy5durBmzRqT7y5evKj00XXu3Jnz588TFBQEVPdZdunSpdFt\n6/ar1jVo0CBefPFFzp8/z+bNm3nppZcAaNWqFfb29mzatKnewxdUX8O6fbN1/TZfPT092b59u8l3\n6enpDBkypMFtBg4cyMCBA9HpdCxbtoxXX32Vr776qt6xunXrZvJkDTB27FjeeecdtFotv/zyizKY\nxcfHB1tbWw4dOmT22nt7e5OcnGzxOd3M9c/Ly2PBggW8/vrrJiNBG7s3jhw5Uu/hKD093eKAe6Ma\nK9syMjIQQpCRkaHUvDMyMpTfjI+PD0uWLKFv37719puWlgaY5u3GjRv59ddf+fzzz/H19aW4uJiQ\nkJDrSmtaWppSU09PTzf7+21IQkLCNUfm3pZmw2s1wdU1bdo0VqxYQXx8PFDdHGGuyQugdevWqNVq\nkx96ZGQka9asITU1ldLSUpYtW0ZkZKTZIdi7du1StnVwcFCaSaB6wMamTZswGo3ExMTc0vdC3N3d\nGyxwzPnDH/7A2rVriYuLA6qbpHbv3m225jJs2DCSkpL48ccf0ev1VFVVcerUKWUQw43u91omTJjA\nzp072bt3L0ajkcrKSg4fPmxRDcjcddy2bZuyrYuLC2q12uw1vO+++9i1axcHDx5Er9ezatUqbG1t\n6dOnj0XpNhqN6HQ69Hq9yf8BQkNDUavVfPHFF+h0OuUpfMCAAQBERUWxZs0atFotWq2WNWvWMHny\n5Ea3DQsLM5uOmuand955h6KiIgYOHAigtAgsWbKEvLw8ALRaLXv37gUgOjqa9evXc/DgQYQQaLVa\n5Vr/9nc2dOhQrly5wubNmzEYDGzZsoXExEQiIiLMpik3N5edO3dSXl6OlZWVco+YM3DgQM6cOWMy\nkKR169aEhITw4osv0rZtW6XG5uHhwcCBA1myZAklJSUIIUhJSVHusSlTpvDZZ59x5swZAJKTk5Vu\nA3d3d1JTU5Vj3Mz1NxgM/OlPf2LChAmMHTvWZFlj90afPn2wsrLiiy++wGAwsH37dk6dOnXN490s\nS8q25cuXU1FRwaVLl1i/fj2RkZEATJ06laVLlyrBLy8vjx07dijb/baMLi0txcbGBhcXF8rKynjv\nvfdMgtu1yrDIyEg+/vhj8vLyyMvLY/ny5UycONHicz1y5IjJQ5U5Nx28zD05XWudxj6PHDmSxx9/\nnKeffpr+/fszYcIE9uzZY3a/dnZ2zJ07l2nTphEaGkpcXBzR0dFMnDiRhx9+mFGjRmFvb88rr7xi\ndvvLly8zc+ZM+vbty7Rp03jooYeUp4uXX36ZnTt3EhISwubNmxk5cuRNnWNd0dHRxMfHExoayoIF\nC665fs+ePVm0aBFvvPEGoaGhjBkzpsH3lhwdHfnss8/YsmWLMurovffeMylULN3v9ZyTt7c3y5cv\nZ8WKFYSHhxMREcFnn32m3BSNbWvuOp46dYopU6YQHBzM/Pnzefnll802zXTs2JF3332XRYsWER4e\nzq5du/jkk0+U90Ou9fv88ccf6dWrF2+88QbHjh2jd+/eygv21tbWLF++nA0bNhAaGsr69etZvny5\nsu8HH3yQiIgIJkyYwIQJE4iIiOAPf/iDRduaExkZyYEDB7jvvvtMCqS//vWvtG/fnj/84Q/079+f\n2bNnc/nyZQB69erFkiVLWLJkCf369WPGjBlKQT9jxgy2bdvGgAEDWLx4MW5ubnzyySesWrWKsLAw\nVq1axYoVK3B1dTWbV0ajkdWrVzNkyBDCwsI4cuQIr7/+utm0t2nThrCwMKWPq8b48eM5cOAA999/\nv8n3b7/9NlVVVURGRhIaGspTTz2lNJOPHTuWuXPn8swzzyjXv7CwEIA5c+awfPlyQkNDWb169Q1d\n/5rvMjMzOX78OJ9//rky2jA4OJjMzMxG7w1ra2s+/PBD1q9fT2hoKNu2bWP06NENXtdrsaQMBSwq\n20JDQxk1ahSzZs3iscceU5rtH3nkEUaMGMHs2bPp168fDz74oBKYzaUhKioKHx8fhgwZwvjx4+vV\n2K5Vhs2bN4+ePXsyYcIEJk6cSM+ePZk7d65FeZCVlUVCQsK1y1xxPdUiSZKkBiQkJPDCCy/w3Xff\n3emk/O6kpaUxcuRIzpw506wmbDDn7bffpl27dkybNq3R9WTwkiRJauaa62xDN+P3cZaSJEktnKXN\njy2FrHlJkiRJzY6seUmSJEnNzl33ntfN0OsN5Odf/zDvlqhVKweZF1fJvKgl86KWzItaHh6NvxB8\nN2pRNS8rK/PvoPweybyoJfOilsyLWjIvmrcWFbwkSZKk3wcZvCSpBSsq1VFcVv8FdUlq7mTwkqQW\n7J9rT/LBurhrryhJzUyLGrAhSVIto1GQnlOKrY3s25FaHlnzkqQWqri8CqMQlFfq0Rvq/z07SWrO\nZPCSpBaqsKT2T9MXl1XdwZRI0q0ng5cktVAFJbUDNeSgDamlkcFLklqoujWvIhm8pBamyYNXTEwM\nY8eOZcyYMaxcubLe8qNHjzJ58mR69OhR76+8QvWfTh8yZAhvvvlmUydVklqUgtI6Na9S2WwotSxN\nGryMRiOLFi1i1apVbNq0ic2bN5OQkGCyjq+vL2+99Va9P1ZX4/333yc0NLQpkylJLZJpn5eseUkt\nS5MGr7i4ONq3b4+fnx/W1tZERkaa/OlpqA5eXbt2NTud/+nTp8nLy2PQoEFNmUxJapEK6/R5FckB\nG1IL06TBS6vV4uPjo3z28vIiKyvLom2FELz99ts899xzyL/aIknXr6BU1ryklqtJX1K+maDz9ddf\nM2zYMLy8vK5rX81xduSmIvOi1u8xL4rL9djbaiivNFCpF0oe/B7zoiEyL5qvJg1e3t7epKenK5+1\nWi2enp4WbXvixAmOHz/O119/TWlpKXq9HkdHRxYuXNjodtnZxTeV5pbCw8NZ5sVVv8e8EEKQV1iB\nn4cjqVkl5BaUkZ1d/LvMi4bIvKjVHIN4kwavoKAgkpOTSUtLw8PDg82bN7N06dIG169bu/rnP/+p\n/H/Dhg2cOXPmmoFLkqRqZVdn1WjlZEthSaUcKi+1OE3a56XRaHj11VeZPXs248ePJzIykoCAAD74\n4AN+/fVXAE6dOsXQoUPZtm0br7/+eoOjDiVJslxBcXV/l6uTDS4ONnLAhtTiqEQLGw0hmwGqySaR\nWr/HvDhzOY/31p5kwsAOJKQVcuZyPp88MxQ/X7ffXV405Pf4u2hIc2w2lDNsSFILVPOOl5uTLc6O\nNoCc31BqWWTwkqQWqOYdL1cnG5ztrwavctnvJbUcMnjdJpU6A2u2nufoecvec5Okm1EzKa+bky0u\njtYAFMkpoqQWRAav2+RkfA4xseks/+E0KzeeobRCFiRS0ym8+oKyq6MNzg41zYay5iW1HPIvKd8m\nlzOLAGjjYsfBM1ouJBfwxIQedG3rdodTdnuUlFdx8lIO+cUV5JfoqKjUM3FQR7xaO9zppLVIBSU6\nVICLow3ODtU1L9nnJbUkMnjdJpczilEBf58dwi9HU/lp32VWbz3PP+aE3emk3RZfbr/A4XOmTaat\nXGyZMqzzHUpRy1ZYUomzgzVWGjUuV2te8l0vqSWRzYa3gVEILmuL8XF3xMHOmgmDOtLZz4WsvDKq\n9IY7nbwmpzcYiUvIpY2LLU//oTcvPhwMQFp26R1OWctVUKrD1ckWoE7Nq+UEr5LyKjnn6e+cDF63\ngTavjEqdgQ7ete9SeLdxQADa/PI7l7Db5GJKARU6A327eBDUqQ1d/N1wdbIhNbvkTietRarQ6anU\nGXB1qq5x1fZ5tYxmw0upBfz5/T38c+1Jsgta/v0jmSeD121wObP6RUiT4NXaEYDM3LI7kqbbKTY+\nF4Dend2V7/w9nMgrqqRMDly55WqGybs5Vte87Gw0WGnULabmde5KvvLvq6sO8cvRFIyyFva7I4PX\nbXA5oyZ4uSjfeV8dqJCZ1/KDV1xCDrY2GpPBKf4e1cE7VTYd3nIFJbVTQwGoVCpcHK1bzFD5mt/M\nlIgArDVqvv7lEv/dduEOp0q63eSAjdvgcmYRKhW09XJSvvNu8/sIXpl5ZWjzy+nX1QNrq9pnJX+P\n6rxIyy753Yy4vF56g5HSCj2uV2fIsFRhae07XjWc7W3IyL27HxTKKqrYeiiZvKJKist1lJRVMaS3\nL8P6+pmsl5Zdgr2tFWND23FvTx+WfHGUg2cyeWhUF6ytNHco9dLtJmteTcxoFFzRFuPn7oitde2N\n5e5qh0atQtvCg1dsfA4AvTq3Mfm+JnjJmlfDvvrfRZ7/ZD9FpdfX3KdMylsn6Dk7WqPTG6mo1N/S\nNN5K2w4ns/nAFQ6cyeR0Yh6XM4v5+UiKyTpVegPavHL8PBxRqVS4OtrQr6snOr2R88kFdyjl0p0g\ng1cTy8gtRVdlNGkyBLDSqPFwsyczr6xFj5pSgleAu8n3Pm0cUKmQgzYaUFpRxf7TmeiqjCSmF13X\ntgVmal41w+VrmhTvNlV6I7tPpuNoZ8U/ngjj44VD6d6+Fdq8MpMX+tNzyjAKQVuP2laMoIDqB6O4\nq32rUjVtfhknLmbf6WQ0GRm8mpgyWMOn/qzN3q0dKK3QU1x+Y30Rd3vQK6uo4lJqIR19XOo1fdlY\na/Bq5UBqduldfx51ZeSWsnzDKfKLmzYIHDidSZXeCNS+4G6pwt/0eUHtcPnrrcXdLkfPZ1FcVsXg\n3r54tXLA1kZDJ9/qB76kOsG75mGnps8UoIu/K/a2GmITcprVb6mprd5yng/XnyIhvfBOJ6VJyODV\nxMwN1qihDNpoZMThtzvj+b9fLtX7Pq+ogqc+2Mv2w8m3KKW33umkPAxGQe+ANmaX+3s4Ul6pNwkE\nCWmFrN1xCYPReLuSeV02xCRy9EI2/zuaUm9ZUZmO9JybbwYVQhATm45GrQJqH4AsVTuvYW3wuttr\nXjuOp6ICIur0b9UEr8SM+sHLr07Ny0qjpkeH1uQUVpDxOxi9a4n84koupVQ3o/58uP5vtSWQwauJ\nXc4sQqNW0dbTsd6yaw3aiE8tZNvhZP53NKVeoRgTm05JeRUHz2pvfaJvQn5xJTmF5RSWVHL8apNF\n3SHyddX2e9U2Ha7deYntR1K4eBf2X+QUlnPs6jntP52J3lAbYIUQfPh9HH9bfZicm3z3KDG9iNTs\nUvp29aCNiy2XM4quq0ZRWKrD0c7KZPCCU03N6y4MXkkZRSSmF9G7szsebvbK9518XQFMmk1r+kjr\n1rygtlk6LkE2HQIcOZ+FADRqFccuZLXI9+Fk8GpCeoOR5KwS/DwczY6Camy4vBCC73cnKJ93nUxT\n/m8wGtkTlwHAFW3xLZvk1yjETTW7nLyUwzP/3sdzHx/g6Y/2cfhcFm5ONrSrM8qyLj9lxGF1gZSZ\nV0ZCWnVBdSn17mvq2Hk8DSGqB9sUleo4lVhbUF5MKSAhrQi9QbD54JWbOs7u2HQAhvb2pYO3C0Vl\nVdfVTFlYUqnMrlGjtuZ19zUb7jyWCsDwfqajCl0dbWjjYkdiem3wTs0uoY2LLQ521ibrKv1eCTm3\nIcV3vyPntKhUED0sACEw21LQ3Mng1YTSc0qp0tcfrFGjsWbDM5fzuJBSQI+OrXF1tGH/qUwqq6qn\nkjqVmEd+cSU21mqE4JbUUsor9Ty7fD///fnG3pcxCsH6mERUKgjr4UVIoCd9u7jz4IguqFQqs9v4\ne9a861WD0HifAAAgAElEQVRd89p/OlNZdjH17qp5VeoMxJxMx8XBmicm9gBg79UHCIBth6qbb53s\nrdkbl0FuYcUNHae8Us/hc1rcXe3o3qGV0ldqadNhld5gdnh9zSwbhXdZzauoTMehc1l4tXbgng6t\n6y3v5OtCSXkV2YUVFJfpKCzRmTQZ1nB1tKGDtzOXUgspq7gzIypv9kXp3zah36icgnIS0osIbNeK\nEf38aeVsy57YjBb3lyxk8GpCjQ3WgOpOdAdbq3o1r+paVyIAU4YFMLi3L2VXCzWAmJPVT+YPDAkA\n4Fxy/k2n9XRSdUDcfTKds5fzrnv7ExdzSM0uYcA9Xsy5vwdPRvXkTw/0IrS7V4PbeLjZY2OtJjW7\nFKMQHDidgZ2NBk83exLSiu6qfq/9pzMoq9QzrK8fAb6utPNyIjY+l8KSStJySolNyKWznysPjuiM\nwXjjta9DZ7XoqowM7u2LWqVSHnwsHbRRaKa/C8DlarPh3RS8Siuq+GFPEnqDkeHBfqjNPOQo/V7p\nhXWaDM3X5HsFtMFgFDf0+71Ze2LTmbd0NxdTbuyhq7xSz5v/Pcqrnx666Vlnjlz9m4ED7vHCSqNm\nZH9/KqsM7L5abrQUMnjdpLKKKtbuuGS2Tfn81WlsOjZQ81KpVHi3cSC7oNykoD52IZsrmcWEdvek\nnZczQ3v7olLBrhPp5BVVEJuQQwdvZ4b19cPaSq0c52bUHVL7xc8XrmvCYCEEG/cloQLGh3eweDu1\nSoWfuyMZuaWcu5xPblEl/QM96d6hFZVVBpK1t3YYfUZuKa+tOsSWRgJLld5ITGw6b311nP/+fIH8\n4kqMQvC/o6lo1CplQMHgXr7VAfeMlp+v1rrGDmjHgHu88Gxlz57Y6mt1PfQGI7+eSEOtUjEoyAeA\n9lenFKsZ+NMQo1EQl5DD59vOA9RrNqyted1cs2FZhf6mA2BiehErN55h4Uf72HUiDRdHGwb29DG7\nbm3wKjI70rCumr7V2NvcdGgUgi0Hr6CrMvLpprOUX+e7dEIIVm0+R0ZuGWWVemJiM669USMOn8tC\no1YR3NUDgKG9/bCz0fDL0RQycku5kJzP0fNZN9w6cLeQM2zcpB/2JvHL0VS0eWU8NaW38n1uYQVH\nzmfh08bBZGaN3/Ju7UBiehE5BRV4tXbAaBRs2JOIWqUianAnANq42tGrUxtiE3L5+pdLCAFD+/hi\nbaWmi78rZy/nU1Sqw+U6Z2KooTcYiU3IpY2LHX27uvPL0VS2HExm4qCOFm1/Mj6H5KwSQrt74utu\nvmBpiJ+HE0kZxayPqa5pDuzpTV5RdQ3wUkoBHX3MB/7rVVSqY9m3seQUVrBuVwKujjYMDKotMCt0\nenYcS+WXo6nKDBUXUwrYfyqD3p3dycwr496e3kpQGHCPF9/svMTO46nkF1fi1dqBPl3cUatUjA/v\nwGdbzrH54BWmj+5mUfqMQvDZ5nOkZFXXXls5Vx/Hyd4ad1c7LmcWI4Qw2wR75nIen289T87VwqiT\nr4vJuQHY2miwsVYrf6SyMWcu53HyYg4TB3fEyb62b6msooq/rT5CaYWev88Kwb3O4ApLZeWX8Y8v\nj2EwCrxa2TOkty/3BvngYGe+KGrv5YxGrSIpvQjd1Wbzhmpe7b2dcXG0ITY+l437krC11mBtrUGv\nN1JRZUBXZSCwfSt6mGmevBnnLuejzS/H0c6KnMIKvv01nkfGBlq8/ZaDVzh+MZvO/q6kaEv45VgK\no0L80ajN1y1yCsrJK640OzONNq+MK9pigjq1Ua6dg50VQ3r7sv1ICi//55Cybr+uHsyfHHSdZ3v3\naPKaV0xMDGPHjmXMmDGsXLmy3vKjR48yefJkevTowfbt25Xvz58/z4MPPsj999/PxIkT2bJlS1Mn\n9bplFZTz6/HqgRSxCbkmo6K2H0nBYBTcN6C92eaQGjX9XhlXmw73nsogI7eMgUHeyjJAmSLn+MVs\nbG00SnNc9/atADh/E02H55PzKa/U07erO5MGd6KVsy2bD1y2aOoqIQQ/7b2MCrh/oGXBri7/q8Eu\nKaMId1c7urR1o0vb6lFmF2/RoI0KnZ7318WRU1jB0D6+ONhasWbreaWJJz6tkL99doTvdydSWWVg\nbGg73pkbzsz7AnG0t1aaYUb1b6vs08nemr5dPMgprMBgFIwJbatc5/CeXni42bEnNp2kjPrNffFp\nhRw4k6k0Dwkh+HL7RQ6e1RLg58LM3xR8HXyq+33MPSnnFJTz8YbTFJRUMqS3D6/PDOGVGf3xM/MQ\n4eJgQ2EjfSpVeiPf7LzEe2tPsuN4Kh//cFoZUSmE4L8/XyCnsILySj2fbjqL0Xj9fTw198VDo7qy\nZE4Y94W1b3T6KxtrDf4eTlzRlnA5sxiNWqWM0v0ttUpFv64elJRXsWFPEmt3xvPFzxf4vx2X2BCT\nyOYDV1i+4dRN9f2cuZxXr1lv5/HqASd/eqAX/h5O7D6ZbvGoxzNJeayPSaSVsy0LJgUxMKj64e3Y\nBfMvF5+8lMNrnx3mra+Om32lpKZrIbS7p8n348LaE9bDi0G9fIgMb88fR3Zh6vDm/bf0mrTmZTQa\nWbRoEWvWrMHT05Po6GhGjBhBQECAso6vry9vvfUWn332mcm29vb2vPPOO7Rr146srCwmT57MkCFD\ncHJquBZzu22IScRgFAzr68euE2n8uDeJp//Qm5LyKmJi02nlbEtYj4b7fMB00EZlOwM/7EnExkqt\n1LpqBHVqQxsXO3KLKhjQ3Qt72+pLF6gEr4JG+5cac+JidTNLcBcP7G2tmDaiC8t/OM0XP1/gmQf7\nNBp8YxNyuaItJiTQ02yBeS1+nrXX896e3qhVKtxd7WntYsul1AKT2sbxi9moVSr6dDE/9N4cg9HI\ne18dIymjiHt7ejNjTDdCAz1Z+m0sH60/xb09vatHYgkYG9qO8fd2UGoBQ9zsCbvHi10n0hDUNuHV\nGNzLhyPns3BxsGZgT2/le41aTdSgTvxn01kWfX6UPp3dGX9vB/KLK9l2+IoyotLaSk2/bh7YWGmI\niU2nracTT0/pja2N6cjUjt7OHD2fxeXMYpPajt5g5OMfz1BWqWfWuEAG9/JtNC+cHayVl8Lr1uCM\nRkFSRhFf/HyB5KwSvFrZ08bVjrOX8/lmZzwPjerKgTOZHD6XRYCfC26Othy7mM22w8mMC2tv8bUo\nLtOxNy6DNi62DO3j2+BAnt/q5OvCFW0xydoS/D0csdI0/Mz94IguhPf0plJnoPJqbcvaSo2tjYaz\nl/PZdiiZnw+nMHlI7f1VXqlnQ0wirVxs6dGhNf6eTmZ/86cTc1n6bSyd/Vx57o99sdKoyS2s4GR8\nDu29neni78rj99/DG2uOsHrrOSYO6kh8aiHxqYU4OVgzc2wg/nV+76cTc1nx0xk0ahXzJvXExdGG\nUf3b8uvxNLYfSTG5n4UQbD5whQ0xiVhZqfFws2P7kRTSckqZO7EH1ho1567ks/dUBlYaFX27eJik\n3cXRhjn397Aov5uLJg1ecXFxtG/fHj+/6lpDZGQkO3bsqBe8gHo/5Pbta28KT09P2rRpQ15eXqPB\nq6lmD9AbjPxn41kc7ayYEtEZe1srkjKKOHRWS3tvZx4e3ZXM3FJOJeYSn1bI2aQ8KqsMRA3u2OiN\nBqbD5bcfTaGgRMf4e9srzUY11GoVYwe045ud8QwPrh1S3MHbGTsbjfJnIqC6U/7AuSw6eTnh1dr8\nU2oNoxCcuJSNk721UuPp182D3gHVzZQ/H07mvgHmC6hKnYFvdsYDcP/ADo0epyF1m4DC6wSALv5u\nHDqrJTOvDJ82jqTnlLJ8w2mMQjDzvkCG9DZfUFfqDOyJS+dCcgEZeWVk5ZehNwgC27kx875AVCoV\n3Tu0ZvqYbqzZep7tR1Jo42LHY+O7061dq3r7s7HWMDq0ndlj3dOhNSOC/Qls71bvVYjwnt44O1jz\n077LnIzP4WR8bT9Mn87udPB25sCZTA6eqX5S9mrtwDNT+9QbAg61f0onKbOI/oG1T9TrdiWQlFFE\neA8vpY+sMc4ONlTpiymr1KPNK+dMUi4XUwtJSCukQlfdJDe4lw/TRnZBCFjyxTF2HEvF0c6K7UdS\nsLPR8Pj9PXCwtSI+vZANMYn06NC6XlBvyK8n0tDpjYwKaXfN+6KuTr4u/HqiuoWjoSbDGtZWajr7\nuZpd1sXfjf2nM/nlaAqjQ9pSU7x/uf0iB85Uj3T9jgScHawZHdKWyN/032692rcZn1bI97sTmDq8\nC7tjq1+fGN7XD5VKRVtPJ6IGd+T73YnKTPf2thqyCspZ9N+jTBvRhUG9fPhhTxJbDl7BSqNi1n3d\nCbj6TptXawd6d3bnZHwO8WmFdPZzpbBUx5fbL3DsQjatXWz50+ReeLjZs3LjGeIScnn5P4eoqNSj\nuzojy5Devg02w7YkTXqGWq0WH5/am8rLy4tTp05d937i4uLQ6/W0a2e+EKnx0GtbmTaiC6NC2ja6\nXg0hBMnaElq52CrvwZizcd9lpenodFIecyb0YMPVPpo/DAtQ+qfe+uo463YlkJ5TiuPVduZr8Wxl\njwpISC/k8Dktzg7WDQaL4cF+DO7lg02dCX41ajVd27oRl5BLXlEFGo2at78+oTT5dW3rxuBePrT1\ndEKjVqFWq2jtbKc83SdlFFFQomNgkLfSxq5SqZg1rjuvrz7M97sS6eLvZrZA+L8dl9DmlTE6pO01\nC5WGuDra0M7TiVbOtni1qg20Xf1dOXRWy6XUQnzaOPLdr/EYhcDGWs3nW89jY6UmrEdtsCutqFL6\nrEquTrdlb6uhracTgR3bEBna1qTAHNLbl0qdgdyiCiYO6qjUZK+HWq3iodFdG1zes1MbenRszfkr\n+ew4noaTfXWhWNMveP/ADlxKLeR0Uh7D+vg22GdpbtDGiUvZbD+SgndrB6aP6WZRLaZmiqjnPt5P\neWXtgBzv1g508XelXzdPetWZDeVP0b1YtOYIP+27DMBj47vjebXm9+i47iz9NpaVG8/w2swQk0mn\nzdFVGdhxLBV7WysG97p2oK2rZtAGgF8DgzUsYWutYVxYe9buuMTPh5OZ2641h85qOXAmk44+zozs\n35azSXnEJuTy/e5EurVtRWf/6t/9lcxizl3Jp4u/K8VlVfx8OIWOPi7EXJ2PMfSe2lrSfQPao1ar\nsLPW0KWtG77ujsTG5/DZ5nP89+cL/Lg3icJSHZ5u9syN6lHvVZpRIW05GZ/DtkPJdGvnxg97Eimv\nNNC1rRvzonoqv5M/P9CLDXsS2XYouTroBbShd2f3BoN3S9OkwetWzDOWlZXFc889xzvvvHPNdVs5\n27J25yU6+LtxbyNNKAaj4NDpDNbviufClXzcnG15ZVYo3drX78i9mJzP5oNX8Gxlz+A+fqzfFc8/\nvjyGEBAc6MmQkOpA4+HhTJ/DKZy8VN1WPXVkV9r513+SN8ejtYPyou7MyHss3q5GSA9v4hJyOZ9W\nxC+Hk8nMK2NYP3/yCiuIi8+pN3zXyd6a52f0p09XT7ZcnTpmWP92eHjUPkF7eMDz00N45ZN9rNx4\nlg+eGaaMWAM4cCqdmNh0Ovq6MDe69039KYoPnx2uzAZQY0AvP77YfpHk7FLSCyqITcilZ0AbHpvQ\nk5c/3senm89hY2dNhc7A8QtZnE7IRVdlwMnemgdHdWP0gPa4u9k1Wqj/cdw9N5zm6+Hp6aL8Tswt\nGxh87YctH3dHkrNKcHd3YvfxVFZuPIuNlZqXZoXS1teywiqwozv7TmXiaGfN4D7+BHfzpGdAm3oj\nE2t4eDjz4sxQ/v7pQQb38WPCsNp39iI8nLmYVsSmfUms3nqBF2eG1KtNGY0C9dVruu3AZYrLqoge\n3uW6f99t2jjhaGdFaYWeHp09TH6n1yt6VDe2H0lmx7FURoZ14MvtF7Cz0fDCI6H4ejgxYRicTcrl\n+Y/28u2ueN57aihqtYrPf74IwEP3dcfd1Z6F78ew8qczGAVEDQ3A39d08MSM8T1NPo/2dKFvdx/+\n+dVRziblMaSPH/On9DZb03Z3d2Ld7gSOX8zm+MVsHO2tmTu5B2PDO5jcIwBzo/vw+OTe9b7/PWjS\n4OXt7U16eu27BVqtFk9Pz0a2MFVSUsLcuXNZuHAhvXr1uub6rz0Wxgsf7eWfXx3jWaPR7BNIfGoh\nqzafRZtfPbS9W1s3LqYW8MK/9zF7XKDJ07yuysA/vzyK0Sh4ZGwg3du3orOPM//ZdJbCEh0TwtuT\nnV37NDxuQDtOXsrG2kpNeHdPk2WN8XS1IyuvDK9W9gR3bmPxdjXaXu3A/vTH0wCMCPbnqWnB5OSU\nkFVQzqGzWopLdRiMAl2VgUPntLy+8iAPjujMvtg0bKzU+Le2r3dcb1dbJgzsyA97k3jn8yPMmXAP\ndjZW5BdX8v7aE1hbqZk9rjsF+bd+Pjk7DTjaWRF3KZv4q4NRJg/uiIuthqem9Oa9tSf56LtYZX1f\nd0cGBnkzrI9fdS1Krycnp3potYeH83Xn6d2mrYcjh3NK+ft/DnDsQjZ2NhrmTOyBk7Xa4nML7+7B\nkL5j0FfolCCkK9eRXd5wc7uvmx3LFgzC3laj5GeN+8Pbk5RWwOGzmbzz+REeHd8dtUpFUkYRq7ec\nI6ugnF4B7oQGevJ9TCIateq67ou6Ovq4cDopDxdbzU1fy7Gh7fj6l0s8/9FedFUGZt4XiDVC2a+H\nkw1h93hx8KyWDTsvck+HVuw5mYa/hyNtW9ujUqmYProrqzafA2BANw+L07RwSm+0+WXVk3IXV1Ba\nbH64emRYez7+4TThPb2JHhaAi4MNeblN9xcYbuaB4E5p0uAVFBREcnIyaWlpeHh4sHnzZpYuXdrg\n+nVralVVVcyfP5+oqChGjx5t0fE6+7vxZFRPPlgXxwfr4njuj31NmrMOn9Py6aZzGI2Cwb18GDug\nHT5tHDmVmMsnP55m5cazxKcV0qNja/zcHdlxLI2M3DJG9vNXRvV1a9eKxY+FUVymqzdUuLO/K9HD\nAnB1tLmuYettPZ04nZRH9LCA6+oLqOHvWftkOqS3D9NG1T4he7rZc/+9HUzWH9rHj482nOLrqxP+\nBnf1aLDZZ/y9HbiQUsDJ+BzmLY3BzckGlUpFaYWe6aO73tAgDUuoVSo6+7kSe3XUVngPb6V5pbOf\nKwun9mZPXAZd/F3p0aE1rV3smiQdd4sO3i4cPpfFsQvZtPV0Yt6knibNrJZQq1S0drEju/L6Rts1\n1H9ibaVm/uQg3lt7kgNnMnGws8LWWsPWQ1cQAtq42HH0fBZHrza5DwryqdeXa6lpI7uQll16S67z\n0D6+bD2UTH5xJcFdPcw2Y06J6MyJSzl8vzuB+FR3jEIwJrSdcl8NDPKpfgfQKK7Zr1yXWq3Cp821\n75ngrh6s+OswpeYq1acSTfw3BGJiYli8eDFCCKKjo5kzZw4ffPABQUFBREREcOrUKRYsWEBRURG2\ntrZ4eHiwceNGfvrpJ1566SW6dOmijI76xz/+QWBg4+9PZGcXExObzpqt51EBfbq4Mzqk7dVO1kTs\nbDTMi+pJz06mM52n55Tywbo4sn7zsrFXawf+Nuvabfo3o6xCT0pWsdkBA5baE5tOdmEFUYM6olar\nrlnbyCuq4IPv40jWljBnwj2E3ePd4LrFZTq2HLxCanYp2rwycgsrCO7qwbxJPS0eMXYjth68wne7\nErC2UvOPOWE3XHC1hJpXSlYJiz4/SngPLx4a1dWk3/N6NEVelJRX8fZXx0m7Onm0u6sds8Z1J7Cd\nGylZJRw+l8UVbTEzxnQzmXj3Tjp+MZtD57N4eGQXk+bwujbtv6y8f9jK2Za354bf0MNlc9Aca15N\nHrxut5ob89iFLDYfuGIyJ1wrZ1v+MqU3bT3NDy6o0Ok5dyWf9JxS0nJKyS+qZOqIzg3OTXg3s6SQ\nqqwycCWzmC7+rtcVhPQGIxq1qkkDF0BqVgmvrz7MxEEdmXAD75DVaAnBC6rz/WYLz6bKi/ziSlb+\ndIa2Xk5MHtIJO5u7f7TbtfKiSm/glU8PkV1QwZSIgAYHUrUEMnjdBer+GIUQJKQVsf1oCuWVemaP\n637DzRbNTUspsEvKq3C0s7qpQNlS8uJWkHlRy5K8iE8tZPfJNP44qusNjUhtLppj8Gq5V4PqId+d\n/V2V4a5S81N3eiJJut1k+XH3apkNuJIkSVKLJoOXJEmS1OzI4CVJkiQ1OzJ4SZIkSc2ODF6SJElS\nsyODlyRJktTsyOAlSZIkNTsyeEmSJEnNjgxekiRJUrMjg5ckSZLU7MjgJUmSJDU7MnhJkiRJzY4M\nXpIkSVKzI4OXJEmS1OzI4CVJkiQ1OzJ4SZIkSc2ODF6SJElSsyODlyRJktTsyOAlSZIkNTtNHrxi\nYmIYO3YsY8aMYeXKlfWWHz16lMmTJ9OjRw+2b99usmzDhg2MGTOGMWPG8MMPPzR1UiVJkqRmwupa\nK6SkpLBu3ToOHTpEZmYmtra2BAYGMmbMGEaPHo2VVcO7MBqNLFq0iDVr1uDp6Ul0dDQjRowgICBA\nWcfX15e33nqLzz77zGTbwsJC/v3vf7NhwwaEEEyePJkRI0bg7Ox8E6crSZIktQSNBq/XXnuNM2fO\nMHbsWP7617/i7u5OZWUlCQkJ7N27l5UrV/K3v/2NPn36mN0+Li6O9u3b4+fnB0BkZCQ7duyoF7wA\nVCqVybZ79+5l4MCBSrAaOHAge/bsYdy4cTd+tpIkSVKL0GjwGjFiBG+88Ua977t168a4ceMoKCgg\nJSWlwe21Wi0+Pj7KZy8vL06dOmVRwsxtq9VqLdpWkiRJatkaDV5Dhw5tdGM3Nzfc3NwaXC6EuLFU\nNbDtb2tn5nh4yGbFGjIvasm8qCXzopbMi+brmn1eAG+99Rbz58/H3t6eGTNmcPbsWf7+978zceLE\nRrfz9vYmPT1d+azVavH09LQoYd7e3hw6dEj5nJmZSVhY2DW3y84utmj/LZ2Hh7PMi6tkXtSSeVFL\n5kWt5hjELRptuH//fpydndm7dy9eXl78/PPP9QZYmBMUFERycjJpaWnodDo2b97MiBEjGly/bm1r\n0KBB7N+/n+LiYgoLC9m/fz+DBg2yJLmSJElSC2dRzavGkSNHGDVqFF5eXhY14Wk0Gl599VVmz56N\nEILo6GgCAgL44IMPCAoKIiIiglOnTrFgwQKKior49ddf+eijj9i4cSOurq7MmzePBx54AJVKxYIF\nC3BxcbnhE5UkSZJaDpWwoGNq1qxZ+Pn5sW/fPn744QccHR2ZNGkSGzduvB1pvC6yGaCabBKpJfOi\nlsyLWjIvarXYZsP33nuPzp07s2zZMlxdXcnMzGTWrFlNnTZJkiRJMsuiZsPWrVszc+ZM5bO/vz/+\n/v5NlSZJkiRJalSjwSssLKzRvq0DBw7c8gRJkiRJ0rU0Gry+//57ANatW0dBQQFTp05FCMH333+P\nl5fXbUmgJEmSJP1Wo8GrZlqnI0eO8OWXXyrfv/LKKzz88MM8/vjjTZs6SZIkSTLDogEbWVlZ5OXl\nKZ/z8vLIzs5uskRJkiRJUmMsGrDxyCOPEBUVxbBhwwDYvXs3TzzxRFOmS5IkSZIaZFHweuihh+jX\nrx9HjhxBCMFDDz1Et27dmjptkiRJkmSWxTNsBAYGEhgY2JRpkSRJkiSLWBS8jh8/zrvvvktKSgoG\ngwEhBCqVSg6VlyRJku4Ii4LXyy+/zLx58+jTpw9qtUVjPCRJkiSpyVgUvOzs7Lj//vubOi2SJEmS\nZBGLqlFDhgxh9+7dTZ0WSZIkSbKIRTWvb775hhUrVuDo6IiNjY3s85IkSZLuKIuCV800UZIkSZJ0\nN7AoePn5+aHX60lKSkKlUtGhQwesrK7r71hKkiRJ0i1jUQQ6deoUf/7zn5UmQ71ez4cffkiPHj2a\nOn2SJEmSVI9FwWvx4sUsWbKE8PBwAA4ePMiiRYtYu3ZtkyZOkiRJksyxaLRheXm5Erig+u98lZeX\nN1miJEmSJKkxFgUve3t7Dh48qHw+fPgw9vb2TZYoSZIkSWqMRc2GL730Ek899RQ2NjYAVFVV8cEH\nH1h0gJiYGJYsWYIQggceeIA5c+aYLNfpdDz//POcOXOGVq1asWzZMnx9fdHr9bzyyiucOXMGo9HI\nxIkT620rSZIk/T5ZFLx69erF9u3bSUpKQghBp06dsLa2vuZ2RqORRYsWsWbNGjw9PYmOjmbEiBEE\nBAQo66xbtw5XV1e2b9/Oli1bePfdd1m2bBnbtm2jqqqKjRs3UlFRwbhx4xg/fjy+vr43fraSJElS\ni2BRs+H+/fupqKiga9eudOvWjfLycoteUI6Li6N9+/b4+flhbW1NZGQkO3bsMFlnx44dTJo0CYAx\nY8YozZMqlYqysjIMBgPl5eXY2Njg5OR0vecnSZIktUAWBa933nnHJHA4OTnxzjvvXHM7rVaLj4+P\n8tnLy4usrCyTdbKysvD29gZAo9Hg7OxMQUEBY8aMwd7enkGDBjF8+HAeffRRXFxcLDopSZIkqWWz\nqNmwZjqoGmq1GoPBYNF217tOzbHi4uLQaDTs27ePgoIC/vjHPxIeHo6/v78lSZYkSZJaMIuCl6Oj\nI7GxsfTu3RuA2NhYHBwcrrmdt7c36enpymetVounp2e9dTIzM/Hy8sJgMFBSUoKrqyubNm1i8ODB\nqNVqWrduTXBwMKdPn75m8PLwcLbklH4XZF7UknlRS+ZFLZkXzZdFwevZZ59l/vz5dO7cGYD4+Hg+\n+uija24XFBREcnIyaWlpeHh4sHnzZpYuXWqyTkREBBs2bKB3795s27aNsLAwAHx8fDh48CATJkyg\nrKyM2NhYZs6cec1jZmcXW3JKLZ6Hh7PMi6tkXtSSeVFL5kWt5hjEVcKStj2gsLCQkydPIoSgb9++\nuLq6WnSAmJgYFi9ejBCC6Oho5syZwwcffEBQUBARERHodDqeffZZzp07h5ubG0uXLsXf35+ysjJe\nfNEYr50AABg9SURBVPFFEhISAHjggQeYNWvWNY8nf4zV5I1ZS+ZFLZkXtWRe1GrRwSspKYmEhARG\njhxJaWkpVVVVuLm5NXX6rpv8MVaTN2YtmRe1ZF7UknlRqzkGL4tGG27YsIEnn3ySf/zjH0B139Vf\n/vKXJk2YJEmSJDXEouD1+eef8/333+PsXB2dO3XqRE5OTpMmTJIkSZIaYlHwsra2xtHR0eQ7jUbT\nJAmSJEmSpGuxKHi5ubkpf4gS4Mcff1ReLJYkSZKk283iiXmfeeYZkpKSGD58OHZ2dnzyySdNnTZJ\nkiRJMsui4NWxY0e+++47Ll++jBCCjh07ymZDSZIk6Y6xqNkwKSkJvV5PQEAAGRkZrFq1isLCwqZO\nmyRJkiSZZVHw+stf/oJarSYlJYXXX3+dlJQUnn/++aZOmyRJkiSZZVHwUqvVWFtbs3v3bqZNm8ai\nRYvIyMho6rRJkiRJklkWBa/Kykq0Wi07d+5U5h60cGIOSZIkSbrlLApejzzyCJGRkTg6OhIUFERK\nSorywrIkSZIk3W4Wz21Yl8FgwGAwYGNj0xRpuilyrrJqct62WjIvasm8qCXzolaLm9vw9OnTZr/X\naDTY2Nig0+mUWd8lSZIk6XZp9D2vFStWUF5ezvjx4+nduzfu7u5UVlaSlJTEnj172L17Ny+88AIB\nAQG3K72SJEmS1Hjw+vDDD4mLi+Obb77h3//+N5mZmdjb29O1a1dGjhzJV199hZOT0+1KqyRJkiQB\nFsyw0atXL3r16nU70iJJkiRJFrFotKEkSZIk3U1k8JIkSZKaHRm8JEmSpGZHBi9JkiSp2bEoeOXm\n5vLXv/6Vhx56CIDz58/zf//3f02aMEmSJElqiEXB65VXXqFfv34UFRUB0KlTJ77++muLDhATE8PY\nsWMZM2YMK1eurLdcp9Px9NNPM3r0aKZOnUp6erqy7Pz58zz44IOMHz+eCRMmoNPpLDqmJEmS1LJZ\nFLy0Wi3Tpk1T/gCljY0NavW1NzUajSxatIhVq1axadMmNm/eXG9GjnXr1uHq6sr27dt55JFHePfd\nd4HqKaiee+453njjDTZt2sQXX3yBtbX19Z6fJEmS1AJZFLysrExfBysqKrJoVvm4uDjat2+Pn58f\n1tbWREZGsmPHDpN1duzYwaRJkwAYM2YMBw8eBGDv3r0EBgbStWtXAFxdXVGpVJYkV5IkSWrhLApe\no0eP5rXXXqO0tJT169cze/ZsHnjggWtup9Vq8fHxUT57eXmRlZVlsk5WVhbe3t5A9ZyJzs7OFBT8\nf3v3HhxVef9x/L1sAlJMgpiQRaS0JraQGqAzKsERIYBZIITsBiIMUsKlpdoBKqFYwck4crXGyUhk\nOhIBKzRMa4HIJRBSgxI6XGy1hZkCRUEn3JJwS5NgypLN8/sjP3YbgrBWNvEkn9df7Nlnz373yzN8\nOGfPPqeKL774AoAZM2aQlpbG6tWrA/1MIiLSxt12hQ2An/70p2zdupXq6mr27NnDT37yE1JTU2/7\nukCOzm4cY4zBZrPh9Xr55JNP2LRpE506dWLq1Kk89NBDvvuJiYhI+xVQeAGMHTuWsWPHfq2dOxyO\nJhdgVFRU0L1792ZjysvLiY6Oxuv1UltbS0REBA6Hg0ceeYSIiAgAnnjiCY4cOXLb8LLi0v7Bol74\nqRd+6oWfemFdAYXXxYsX+f3vf09ZWRn19fW+7StWrLjl6+Lj4ykrK+PMmTNERUVRWFhITk5OkzGJ\niYkUFBTQv39/ioqKfOH0+OOPs3r1aq5evYrdbuevf/0rU6dOvW2tuj9PI92ryE+98FMv/NQLPyuG\neEDh9Ytf/IK4uDgGDRrku+IwEHa7naysLKZPn44xhvHjxxMTE0Nubi7x8fEkJiaSnp7O/PnzSUpK\nomvXrr5wCw8PZ9q0aYwbNw6bzcbQoUMZMmTI//YpRUSkTQnoTspjx45l69atLVHPN6b/STXS/yr9\n1As/9cJPvfCz4pFXQFcb9u/fn3/961/BrkVERCQgAZ02nDhxIpMnT8bhcNCpUyff9o0bNwatMBER\nka8SUHjNnz+fZ555hri4uK/1nZeIiEgwBBRenTp1YsaMGcGuRUREJCABfec1ePBgSktLg12LiIhI\nQAI68nr33XfJy8ujS5cudOzY0bcKxv79+4Ndn4iISDMBhdemTZuCXYeIiEjAAgqvnj17BrsOERGR\ngN0yvObPn092drZvlYsb6VJ5ERFpDbcMr4yMDAB+/etft0gxIiIigbhleG3YsIFly5bx6KOPtlQ9\nIiIit3XLS+WPHj3aUnWIiIgELKDfeYmIiHyb3PK04fHjxxk0aFCz7fqdl4iItKZbhtf3vvc98vLy\nWqoWERGRgNwyvDp27KjfeImIyLfOLb/zCg0Nbak6REREAnbL8Hr33Xdbqg4REZGA6WpDERGxHIWX\niIhYjsJLREQsJ+jhVVpaysiRI3E6nTe97N7j8TB37lySkpKYMGECZ8+ebfL82bNn+fGPf8zbb78d\n7FJFRMQighpeDQ0NLF68mDVr1rB9+3YKCws5ceJEkzEbN24kIiKC4uJiMjIyyM7ObvL8K6+8wpAh\nQ4JZpoiIWExQw+vw4cP07t2bnj17EhoaSnJyMiUlJU3GlJSU4Ha7AXA6nU1W7Xj//ffp1asXsbGx\nwSxTREQsJqjhVVFRQY8ePXyPo6OjqaysbDKmsrISh8MBgN1uJzw8nKqqKurq6li9ejWzZs0KZoki\nImJBAd1J+X9ljPnaY66vm5ibm8vUqVPp3LlzwPsCiIoK+/qFtlHqhZ964ade+KkX1hXU8HI4HE0u\nwKioqKB79+7NxpSXlxMdHY3X66W2tpaIiAgOHz5McXEx2dnZVFdX06FDBzp16sTTTz99y/c8f74m\nKJ/FaqKiwtSL/6de+KkXfuqFnxVDPKjhFR8fT1lZGWfOnCEqKorCwkJycnKajElMTKSgoID+/ftT\nVFREQkICAPn5+b4xK1eupEuXLrcNLhERaR+CGl52u52srCymT5+OMYbx48cTExNDbm4u8fHxJCYm\nkp6ezvz580lKSqJr167Nwk1ERORGNhPol0kWodMAjXRKxE+98FMv/NQLPyueNtQKGyIiYjkKLxER\nsRyFl4iIWI7CS0RELEfhJSIilqPwEhERy1F4iYiI5Si8RETEchReIiJiOQovERGxHIWXiIhYjsJL\nREQsR+ElIiKWo/ASERHLUXiJiIjlKLxERMRyFF4iImI5Ci8REbEchZeIiFiOwktERCxH4SUiIpYT\n9PAqLS1l5MiROJ1O8vLymj3v8XiYO3cuSUlJTJgwgbNnzwKwb98+0tLSGDt2LOPGjePAgQPBLlVE\nRCwiqOHV0NDA4sWLWbNmDdu3b6ewsJATJ040GbNx40YiIiIoLi4mIyOD7OxsALp168aqVavYunUr\nr7zyCs8//3wwSxUREQsJangdPnyY3r1707NnT0JDQ0lOTqakpKTJmJKSEtxuNwBOp5P9+/cD0KdP\nH6KiogB48MEH8Xg8XLt2LZjlioiIRQQ1vCoqKujRo4fvcXR0NJWVlU3GVFZW4nA4ALDb7YSHh1NV\nVdVkTFFREXFxcYSGhgazXBERsYiQYO7cGPO1xxhjsNlsvseffvopOTk5rF27NqD3jIoK+3pFtmHq\nhZ964ade+KkX1hXU8HI4HL4LMKDxSKx79+7NxpSXlxMdHY3X66W2tpaIiAgAysvLmTVrFq+++ir3\n339/QO95/nzNnfsAFhYVFaZe/D/1wk+98FMv/KwY4kE9bRgfH09ZWRlnzpzB4/FQWFjI8OHDm4xJ\nTEykoKAAaDw9mJCQAEB1dTU///nP+dWvfsWAAQOCWaaIiFhMUMPLbreTlZXF9OnTGTNmDMnJycTE\nxJCbm8sHH3wAQHp6OpcvXyYpKYl33nmHefPmAZCfn09ZWRm//e1vcblcuN1uLl26FMxyRUTEImwm\nkC+mLESnARrplIifeuGnXvipF346bSgiItICFF4iImI5Ci8REbEchZeIiFiOwktERCxH4SUiIpaj\n8BIREctReImIiOUovERExHIUXiIiYjkKLxERsRyFl4iIWI7CS0RELEfhJSIilqPwEhERy1F4iYiI\n5Si8RETEchReIiJiOQovERGxHIWXiIhYTtDDq7S0lJEjR+J0OsnLy2v2vMfjYe7cuSQlJTFhwgTO\nnj3re27VqlUkJSUxatQo/vKXvwS7VBERsYighldDQwOLFy9mzZo1bN++ncLCQk6cONFkzMaNG4mI\niKC4uJiMjAyys7MB+Oyzz9i5cyc7duzgrbfe4uWXX8YYE8xyRUTEIoIaXocPH6Z379707NmT0NBQ\nkpOTKSkpaTKmpKQEt9sNgNPp5MCBAwDs3r2b0aNHExISwv3330/v3r05fPhwMMsVERGLCGp4VVRU\n0KNHD9/j6OhoKisrm4yprKzE4XAAYLfbCQsLo6qq6qavraioCGa5IiJiEUENr0BO891sjM1m+8rt\nIiIiIcHcucPhaHIBRkVFBd27d282pry8nOjoaLxeLzU1NUREROBwODh37pxvXHl5ebPX3kxUVNid\n+wAWp174qRd+6oWfemFdQT3yio+Pp6ysjDNnzuDxeCgsLGT48OFNxiQmJlJQUABAUVERCQkJAAwb\nNowdO3bg8Xg4deoUZWVl9OvXL5jlioiIRQT1yMtut5OVlcX06dMxxjB+/HhiYmLIzc0lPj6exMRE\n0tPTmT9/PklJSXTt2pWcnBwAYmNjGTVqFMnJyYSEhPDSSy/ptKGIiABgM7r+XERELEYrbIiIiOUo\nvERExHIUXiIiYjltJrxut4ZiW1ZeXs6UKVMYPXo0KSkprFu3DoB///vfTJ8+HafTyYwZM6ipqWnl\nSltOQ0MDbrebZ555BoDTp0/z1FNP4XQ6yczMpL6+vpUrbBk1NTXMmTPHd/HToUOH2u28+N3vfseY\nMWNISUlh3rx5eDyedjMvFi5cyGOPPUZKSopv263mwZIlS0hKSiI1NZWjR4+2Rsm31SbCK5A1FNsy\nu93OggUL2LFjB3/4wx/Iz8/nxIkT5OXlMWjQIHbt2sXAgQNZtWpVa5faYtatW0dMTIzv8Wuvvca0\nadPYtWsXYWFhbNy4sRWrazlLly5lyJAh7Ny5ky1btvDAAw+0y3lRUVHB+vXr2bx5M9u2bcPr9VJY\nWNhu5kVaWhpr1qxpsu2r5sGePXsoKyujuLiYRYsW8dJLL7VGybfVJsIrkDUU27KoqCj69u0LQJcu\nXYiJiaGioqLJupFut5v333+/NctsMeXl5ezZs4f09HTftgMHDuB0OoHGXvz5z39urfJaTG1tLX/7\n298YN24cACEhIYSFhbXbedHQ0EBdXR319fX85z//oXv37hw8eLBdzIuHH36Y8PDwJttunAfX/80s\nKSnB5XIB0L9/f2pqarhw4ULLFhyANhFegayh2F6cPn2aY8eO0b9/fy5evEhkZCTQGHCXL19u5epa\nxrJly3j++ed9vwu8fPkyERERdOjQON0dDke7mB+nT5/mnnvuYcGCBbjdbrKysqirq2uX8yI6Oppp\n06YxdOhQnnjiCcLCwoiLiyM8PLzdzYvrLl261GQeXLp0CWi63ix8e9eVbRPhpZ+qNbpy5Qpz5sxh\n4cKFdOnSpV3+qPvDDz8kMjKSvn37+uaFMabZHGkPvamvr+fIkSNMmjSJgoICOnfuTF5eXrv47Deq\nrq6mpKSEDz74gL1791JXV0dpaWmzce2xNzeyyrqyQV1ho6UEsoZiW1dfX8+cOXNITU1lxIgRANx7\n771cuHCByMhIzp8/T7du3Vq5yuD75JNP2L17N3v27OHq1atcuXKFZcuWUVNTQ0NDAx06dAh4nUyr\nczgcOBwO4uPjAUhKSuKtt95ql/Ni37599OrVi65duwIwYsQI/v73v1NdXd3u5sV1XzUPoqOjKS8v\n9437tvalTRx5BbKGYlu3cOFCYmNjycjI8G0bNmwYmzdvBqCgoKBd9CQzM5MPP/yQkpIScnJyGDhw\nIK+99hoDBw6kqKgIaD+9iIyMpEePHnz++edA4/d+sbGx7XJe3HfffRw6dIirV69ijOHAgQM8+OCD\n7Wpe3HhE9VXzYPjw4bz33nsA/OMf/yA8PNx3evHbpM0sD1VaWsrSpUt9ayjOnDmztUtqMR9//DGT\nJ0/mBz/4ATabDZvNxty5c+nXrx/PPfcc586d47777mPFihXNvrRtyz766CPWrl3Lm2++yalTp8jM\nzKS6upq+ffuSnZ1NaGhoa5cYdMeOHePFF1+kvr6eXr16sXz5crxeb7ucFytXrqSwsJCQkBDi4uJY\nsmQJ5eXl7WJezJs3j4MHD1JVVUVkZCSzZ89mxIgR/PKXv7zpPFi0aBF79+6lc+fOLF++nB/96Eet\n/AmaazPhJSIi7UebOG0oIiLti8JLREQsR+ElIiKWo/ASERHLUXiJiIjlKLxERMRyFF5iScOGDeOz\nzz5rkfdauXJlk1tlLFiwgPz8/G+83wULFpCSkkJmZuY33tetHDt2jJ07dwb1PURamsJL5DZWrlzJ\ntWvX7ug+L1y4QHFxMdu2bSMnJ+eO7vtGR44c+Z/Dq6Gh4Q5XI3JnKLykTfn888/52c9+Rnp6Oi6X\ny7f8DUCfPn1YtWoV48eP58knn6S4uNj33K5duxg1ahRpaWmsWrWKPn36UFdXx6JFi7DZbEycOBG3\n201tbS0Ax48fJyMjA6fTyQsvvPCV9bz33nukpKSQmprK7NmzuXTpEleuXCEjI4OrV6/idrt55513\nmrxmy5YtzJo1y/fY6/UyePBg3/qdq1ev5qmnniItLY1nn32WixcvAnDt2jV+85vfkJKSgsvlYvbs\n2VRVVfHGG29w4MAB3G43S5cuBRpXpHG73aSmpjJt2jROnToFNK5K4nK5WLJkCRMnTmTv3r3f5K9D\nJHiMiAUlJiaaTz/9tMm2+vp643a7zcmTJ40xxtTW1hqn0+l7/MMf/tDk5+cbY4z5+OOPzeDBg40x\nxly4cME8+uijpqyszBhjzNtvv2369OljvvzyS9/r6urqfO/zwgsvmEmTJhmPx2M8Ho9JTk42+/bt\na1bj8ePHzeOPP24uXLhgjDHm9ddfN88995wxxpjTp0+bhISEm362uro6k5CQYC5fvmyMMWb37t0m\nIyPDGGPMli1bTFZWlm/shg0bzLx584wxxrzxxhtm9uzZpr6+3hhjfK/fvHmzmTNnju81Fy9eNAkJ\nCebEiRPGGGP+9Kc/mfT0dGOMMQcPHjRxcXHm0KFDN61N5NtCR17SZnzxxRecPHmSzMxMXC4XTz/9\nNNeuXWtyV+3Ro0cDMGDAAM6fP4/H4+HQoUM89NBD9OrVC4Dx48c327e5YRW1ESNGEBoaSmhoKHFx\ncZSVlTV7zcGDBxk6dCj33nsvABMnTmTfvn23/Rx33XUXw4cPZ/v27UDjoqnXbyi5e/du9u/fj8vl\nwuVysWHDBs6dOwc03g5mypQp2O12AN8K6jc6dOgQffv25YEHHgBg3LhxHD16lC+//BKA3r17069f\nv9vWKdKa2sQtUUSgMWC6detGQUHBTZ+32Wx06tQJwHcDQq/X2yyYbnx8Mx07dvT92W63N7mg47/3\nc+N9kK6/7+24XC6WL1/OmDFj+Oijj8jOzvbt89lnnyUtLe2m7xeIm9X134+/853vBLQfkdakIy9p\nM77//e9z1113sWXLFt+2kydPcuXKFaD5P+7XHw8YMIB//vOfvu99/vt7MoC7776bmpqar13PoEGD\n2LNnj+87qT/+8Y889thjzd7/Zh5++GFqa2vJycnhySef9IXusGHD2LBhA9XV1QB4PB6OHTsGQGJi\nIuvWrfNdXHL9Dsl3332377u665/36NGjvlulbN68mbi4OIWWWIqOvMSSbDYbU6dOJSQkxHcksW3b\nNt58802WLl3K2rVr8Xq9REZG8vrrr/tec+M+oPGmfC+//DIzZ87knnvuYejQoYSEhNC5c2cApk2b\nxpQpU+jcuTPr168PuMbY2FgyMzOZOnUqHTp0oFevXixatKjZ+38Vl8tFbm4uGzZs8G1LTU2lqqqK\nyZMnY7PZaGhoYNKkSfTp04eZM2eSk5ODy+WiY8eOfPe732XFihUMGjSINWvW4HK5eOSRR3jxxRd5\n9dVXmTdvHl6vl27duvmO7ESsQrdEEQGuXLlCly5dgMYjkU2bNt2R33KJSHDoyEsEWL9+PUVFRXi9\nXrp27crixYtbuyQRuQUdeYmIiOXogg0REbEchZeIiFiOwktERCxH4SUiIpaj8BIREctReImIiOX8\nH4gzFtcS9o9MAAAAAElFTkSuQmCC\n",
+            "text/plain": [
+              "\u003cmatplotlib.figure.Figure at 0x7f47b20dd690\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "plt.plot(graph_means)\n",
+        "plt.ylabel('Time (seconds)')\n",
+        "plt.xlabel('Length of vector')\n",
+        "_ = plt.title('Time to sum the elements of 1000 vectors (vectorized TF operation)')\n",
+        "_ = plt.ylim(ymin=0)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "4KZg2WXjbhg5"
+      },
+      "source": [
+        "## AutoGraph"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "UQJBQWbCbinm"
+      },
+      "outputs": [],
+      "source": [
+        "# Sum written using for loop and converted with AutoGraph\n",
+        "def sum_all(elements):\n",
+        "  sum_ = 0.0\n",
+        "  length = len(elements)\n",
+        "  for i in tf.range(length): \n",
+        "    sum_ += elements[i][0]\n",
+        "  return sum_\n",
+        "\n",
+        "def run_trial(num):\n",
+        "  elements = get_elements(num)\n",
+        "  return sum_all(elements)\n",
+        "    \n",
+        "ag_means = []\n",
+        "ag_run_trial = ag.to_graph(run_trial)\n",
+        "\n",
+        "for num in range(max_elements):\n",
+        "  with tf.Graph().as_default():\n",
+        "    durations = []\n",
+        "    foo = ag_run_trial(num)\n",
+        "    with tf.Session() as sess:\n",
+        "      for _ in range(burn_ins):\n",
+        "        for _ in range(batches):\n",
+        "          sess.run(foo)\n",
+        "        \n",
+        "      for _ in range(trials):\n",
+        "        start = time.time()\n",
+        "        for _ in range(batches):\n",
+        "          sess.run(foo)\n",
+        "      \n",
+        "        duration = time.time() - start\n",
+        "        durations.append(duration)\n",
+        "    ag_means.append(np.mean(durations))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 301
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 310,
+          "status": "ok",
+          "timestamp": 1532448438694,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "DLDOmrRW99v5",
+        "outputId": "ae0e0573-39db-4004-a064-efc618dbf867"
+      },
+      "outputs": [
+        {
+          "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEcCAYAAADUX4MJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3XdYVFf++PH3DE1AinTEjgULioggNiTYjcZCoiZqjEmM\n6RuzcVc32exPE7O72dTNmmhi1u+abLJqLFFsib1E7GLDjkgbQUSKyDAz5/eHySARcYwMQ/m8nsfn\nce4999zPPXOHzz23nKtRSimEEEKIu9DaOgAhhBC1gyQMIYQQFpGEIYQQwiKSMIQQQlhEEoYQQgiL\nSMIQQghhkTqZMObPn88bb7xh6zBqlQceeICffvrJ6uuZOXMmH330kdXXU1NcuHCBUaNG0a1bN776\n6itbh1OnnT17ljFjxtg6jGpV2e9Jr9czZMgQcnNzq2x9tTJhdO3alfDwcMLDw2nfvj1dunQxT1uz\nZg3PPPMMc+bMsXoc6enphISEYDKZrL6uqlTf/mjfjTW/xy+++IKoqCgOHDjAhAkTbpu/bt06xo0b\nR1hYGJMmTbpt/smTJxk9ejRhYWGMGTOG5OTkcvPfffddoqKi6NGjB+++++49LWttEydOZNmyZdW2\nvo8//pinnnqqwjgiIyMpLS29p/pCQkK4dOnSPS3z1VdfMWLECMLCwujduzeTJk1i7dq191RHVXF0\ndCQ+Pp7PP/+8yuqslQnj0KFDHDx4kIMHD9K4cWPmz59vnvbggw9WWxxKKTQaDfLsY+1mze8xIyOD\n1q1b33G+p6cnkydPZurUqbfNKy0t5fnnn2fkyJHs27ePkSNH8txzz2EwGAD49ttv2bx5M6tXr+b7\n779n69at/O9//7No2drgXr6P7OxsEhMTiYuLKzc9PT2dAwcOoNFo2Lx58z2tX6PR3FP5OXPmsHjx\nYmbOnMnevXvZsWMHv/vd79ixY8cdl7H2344HH3yQFStW3HOyvJNamTBupZS6rdE/+eQTXnvtNaDs\n6HH58uX069ePqKgovv32W44ePcqIESOIjIy8rTeybNkyhg4dSlRUFE899RQZGRkVrnvixIkARERE\nEB4ezpEjR1BKMW/ePB544AF69erFH//4RwoLCytc/urVq0ybNo3u3bsTFRVV7gj010c3t/YK9u7d\nS0xMDF988QU9e/akT58+/Pjjj2zbto1BgwYRFRXF/PnzK1znkiVLWL16NV988QXh4eE8++yz5nkn\nT55kxIgRdO/enenTp6PX683ztmzZwsiRI+nevTvjx4/n1KlTFdYPcO7cOaZMmUJUVBRDhgxh3bp1\ndyxbWb0PPPAACxcuZMSIEXTt2pXXX3+dK1eu8PTTTxMeHs6UKVMoKCgwlz98+DDjxo2je/fujBw5\nkr1795rnTZw4kY8++ojx48cTHh7Ok08+SV5ennkelP8eU1NTmThxIhEREURHRzN9+vQ7bsOmTZt4\n8MEHiYyMZNKkSZw/fx6Axx9/nMTERGbPnk14eDgXL168bdno6GgGDx6Mr6/vbfP27t2L0Whk0qRJ\nODg4MHHiRJRS7NmzB4CVK1cyZcoU/Pz88PPz44knnmDFihUAJCYmVrrsrdauXXvbqZxFixbx3HPP\nATdPbfztb38jNjaW3r1785e//KXcvvHjjz8ycuRIunXrxsCBA9m5cycffPABBw4cYM6cOYSHh/PW\nW28BcPDgQeLj4+nevTsPP/wwhw4dKvcdffDBB4wfP56wsDDS0tJYvnw5/fv3Jzw8nP79+7NmzZoK\nv4Ndu3bRsWNHHB0dy01fuXIlYWFhjB492tw2t67v1h7QihUrePTRRwGYMGECSilGjBhBeHi4eR9e\nsmQJAwcOJCoqiueee47Lly8DN089fvPNN3zwwQdER0fj6OiIRqMhPDycd955567bOHToUMLDwxkw\nYIA56f+yD8TExDB//nx69OhBXFwcq1evLrcd165d45lnniE8PJyxY8eW+7vh7++Ph4cHR44cqbDd\n7pmq5WJjY9Xu3bvLTfvnP/+pXnvtNaWUUmlpaapdu3bqzTffVCUlJWrXrl0qNDRUPf/88yo3N1dl\nZWWp6OhotW/fPqWUUj/88IMaOHCgOn/+vDIajerTTz9VY8eOrXDdaWlpKiQkRJlMJvO0pUuXqoED\nB6q0tDR1/fp19cILL5hj+bX33ntPvfnmm8poNCqDwaD2799vnhcSEqJSU1PNn//4xz+qDz/8UCml\nVGJiourQoYOaN2+eMhgMasmSJapHjx7q1VdfVdevX1dnzpxRoaGh6tKlSxWu99a6bm3Hhx9+WGVn\nZ6tr166pIUOGqG+//VYppdSxY8dUdHS0SkpKUiaTSa1YsULFxsYqvV5/W93Xr19XMTExasWKFcpk\nMqkTJ06oqKgodfbs2dvWfbd6Y2Nj1dixY9WVK1eUTqdT0dHRatSoUerkyZNKr9erSZMmqU8++UQp\npVRWVpaKjIxU27dvV0optXv3bhUZGalyc3OVUkpNmDBBDRgwQF28eFGVlJSoCRMmqPfee++O3+P0\n6dPVZ599ppRSqqSkRB04cKDCtjx//rwKCwtTu3fvVgaDQX3++edqwIABqrS01LzepUuXVrjsrZYs\nWaImTpxYbtq///1v9fTTT5eb9swzz6h///vfSimlunXrpo4cOWKed/ToURUeHm7RsrcqLi5W4eHh\n6uLFi+ZpY8aMUWvXrlVKKfXWW2+pZ599VuXn56uioiI1bdo09f777yullDpy5Ijq1q2b+Teo0+nU\n+fPnK9z2vLw81b17d/X9998ro9Go1qxZo7p3767y8vLM5WNjY9XZs2eV0WhUBQUFKjw8XKWkpCil\nlMrOzjbvR7/2t7/9Tc2ePfu26QMGDFDffPONOnbsmOrYsaO6cuWKed6v41u+fLl69NFHzZ/btWtX\n7je4e/duFRUVZd7/5syZox577DGllFLffPONeuCBByqM7Va/3sbS0lK1detW82913759qkuXLurE\niRNKqbLf+l//+lel1+vV3r17VVhYmLpw4YJS6ubvKTIyUh09elQZjUb16quvqunTp5db57Rp09Ti\nxYvvGpslan0PwxIajYbnn38eR0dHevbsibOzM8OGDaNRo0b4+/sTERHBiRMnAPjf//7H1KlTadmy\nJVqtlqlTp5KcnExmZuYd61e39HDWrFnD5MmTCQoKwtnZmenTp7N27doKz4/b29uTnZ1NWloadnZ2\ndOvWrcI6K+Lg4MC0adOws7Nj6NChXL16lccffxxnZ2dat25N69atK+0FVGTSpEn4+Pjg7u5ObGws\nJ0+eBGDp0qWMGzeO0NBQNBoNI0eOxNHRscKjli1bttCkSRNGjhyJRqOhffv2DBw4kPXr199W1pJ6\nJ0yYgJeXF35+fkRERNClSxdCQkJwcHBgwIAB5hi///57+vXrR58+fYCbR+6dOnVi27Zt5rpGjx5N\ns2bNcHR0ZMiQIeZlf3Frm9vb25Oeno5Op8PR0ZHw8PAK22zdunX069eP6Oho7OzsePLJJ7lx40a5\nI+ff6vr167i5uZWb1rBhQ3OP9dfz3dzcuH79ukXL3qpBgwbExcWZj95TUlK4cOGC+fTOsmXLmDlz\nJm5ubri4uDB16lRz2WXLlhEfH090dDQAfn5+tGzZssLt2bp1Ky1atGD48OFotVqGDRtGq1at2LJl\ni7nMqFGjCA4ORqvVYmdnh52dHadPn6akpAQfHx+Cg4MrrLugoABXV9dy0/bv309GRgZDhgyhY8eO\nNGvW7Laj83uxZs0a4uPjzfvf9OnTOXz4MBkZGVy9evW2XmJMTAzdu3enc+fO5f5+3LqN9vb2xMTE\n0KRJE+BmL7dXr17s37/fXF6j0fC73/0OBwcHunfvTkxMTLle+8CBA+nUqRNarZbhw4fftl+7urqS\nn5//m7f7VvZVUkst4O3tbf5/gwYN8PHxMX92cnIy/9AyMjJ4++23+dvf/gaUnd/W6XQEBgbedT2X\nL1+mcePG5s9BQUEYDAZycnLw8/MrV/app57in//8J1OmTEGj0fDwww9XeC67Ip6enuZzrA0aNKhw\nG3/ZJkvduryzszPZ2dnAzTZZtWqV+S4fpRQGg8HcHb9VRkYGhw8fJjIy0lzWaDQycuTICsverd5b\nY3Jycrrt863f27p168x/fH6p65c/ZEC579zZ2bnS9pkxYwYffvgh8fHx5usMFd2B8+vvW6PREBgY\niE6nu2PdlnJxcbntD3xhYSENGzascH5hYSEuLi4WLftrw4YN4+9//zvPPfcca9asoX///jg6OpKb\nm0txcXG5bTeZTObkmpWVRUxMjEXb8+u2AmjcuHG5tgoICDD/39nZmQ8++ICFCxcya9YsunXrxowZ\nM2jVqtVtdbu7u1NUVFRu2qpVq+jduzceHh7mbVy5ciWPP/64RfFWFH/Hjh3Nn11cXPD09ESn0+Hp\n6Xnb72Hbtm0YjUY6depU7mDk1m38pdy8efNISUnBZDJx48YN2rVrV27bnJyczJ8bN25cbl1326+L\niopwd3f/Tdv8a/UmYVgqICCAZ5991qKL5xVdFPPz8yt3zSM9PR17e/tyX+ovXFxc+MMf/sAf/vAH\nzp07x8SJE+ncuTM9evTA2dmZ4uJic9ns7OzbdrTqEhAQwLRp03jmmWfuWjYwMJCoqCgWLlxYpfVa\nst6RI0cye/bse162ou/R29vbfG3rwIEDPPHEE0RGRtK0adNy5fz8/Dhz5ky5aZmZmVXyXbVp04ZF\nixaVm3b69GnzNZfWrVuTnJxMaGgocPMaVJs2bSpdtqI7tQB69+7NzJkzSU5OJiEhgVmzZgHQqFEj\nnJ2dWbNmzW0HPHDzO7zTnUS/blc/Pz82btxYblpGRgZ9+/a94zK9evWiV69e6PV6PvjgA9544w2+\n/vrr29bVrl07Vq1aZf5cUlLCunXrMJlM9O7dG7h5I0B+fj6nTp2iXbt2uLi4cOPGDfMyvxwg3cmv\nf9vXr18nLy8Pf39/PD09eeuttzh+/Hi5pAK3ny24dRv1ej0vv/wy7777LnFxcWi1Wp5//vlyy+Tn\n53Pjxg3zgWFmZiZt27atNNZbnT9/nieffNLi8pWpF6ek7nZ651bjx49n/vz5nD17FrjZ1a3odAqA\nl5cXWq2W1NRU87Rhw4axaNEi0tLSKCoq4oMPPmDYsGFotbc39datW83Luri4mLvgcPOi95o1azCZ\nTGzfvp19+/ZZvA134+Pjc0+3Cz7yyCN8++23JCUlATd/KNu2bavwCL1fv35cuHCBVatWYTAYKC0t\n5ejRo+YLwb+13rsZMWIEmzdvZufOnZhMJkpKSti7d69FR/oVfY/r1683L+vu7o5Wq63wOxwyZAhb\nt25lz549GAwGFi5ciJOTE2FhYRbFbTKZ0Ov1GAyGcv8HiIyMRKvVsnjxYvR6vbknFhUVBcDIkSNZ\ntGgROp0OnU7HokWLGD16dKXL9ujRo8I47OzsGDRoEH//+9/Jz8+nV69eAOae79y5c8338+t0Onbu\n3AlAfHw8y5cvZ8+ePSil0Ol05u/61/tZTEwMFy9eJCEhAaPRyNq1azl//jyxsbEVxnTlyhU2b95M\ncXEx9vb25t9IRXr16sXx48fNF+N/+OEH7OzsWLduHatWrWLVqlWsXbuWbt26sXLlSuDmb2zjxo3c\nuHGDixcv8t1335Wr89fxP/jggyxfvpzk5GT0ej3vv/8+Xbp0oXHjxrRs2ZKxY8cyffp0du/eTUlJ\nCSaTiYMHD1Z6t1VpaSmlpaU0atQIrVbLtm3b2LVrV7kySik+/vhjSktL2b9/P1u3bmXIkCF3rPNW\nOp2Oa9eu0aVLF4vK302tTxiW3Pr26zKVfe7fvz9PP/00r7zyChEREYwYMeKOt8U1aNCAadOmMX78\neCIjI0lKSiI+Pp6HHnqICRMmMGDAAJydnXn99dcrXD4lJYXJkyfTtWtXxo8fz2OPPUb37t0B+NOf\n/sTmzZvp3r07CQkJ9O/f/7628Vbx8fGcPXuWyMhIXnjhhbuW79SpE3PmzGH27NlERkYyaNCg2+44\n+YWrqytffvkla9eupU+fPvTp04f33nuv3F01ltZ7L9sUEBDAvHnzmD9/PtHR0cTGxvLll1+aDxYq\nW7ai7/Ho0aM8/PDDhIeH8/zzz/OnP/2JoKCg25Zt2bIl7777LnPmzCE6OpqtW7fy2WefYW9vf9f1\nws3TJp07d2b27NkcOHCALl26mB86dXBwYN68eaxYsYLIyEiWL1/OvHnzzHWPGzeO2NhYRowYwYgR\nI4iNjeWRRx6xaNmKDBs2jJ9++okhQ4aUS46///3vad68OY888ggRERFMmTKFlJQUADp37szcuXOZ\nO3cu3bp1Y9KkSebz9ZMmTWL9+vVERUXx9ttv4+npyWeffcbChQvp0aMHCxcuZP78+eZTRr9uK5PJ\nxL///W/69u1Ljx492LdvH2+++WaFsXt7e9OjRw82bdoE3Lw7asyYMfj7++Pt7W3+99hjj7F69WpM\nJhOTJ0/GwcGBXr16MXPmTIYPH16uzhdffJEZM2YQGRnJ+vXriY6O5uWXX+bFF1+kT58+pKWl8f77\n75vL//nPf2bixIm88847REVFERMTw8cff8yHH35oPhX36210dXXlT3/6Ey+//DKRkZGsXbv2tluD\nfX198fDwoE+fPsyYMYPZs2fTokWLO36Pt1q9ejWjRo3CwcHBovJ3o1H3cvh9j2bNmsXWrVvx9va+\n48WmxMRE3nnnHQwGA40aNWLx4sXWCkcIUYedO3eOP/7xjyxdutTWoVSZvXv3MmPGDLZu3XrPy+r1\nekaOHMlXX32Fl5dXlcRj1YSxf/9+XF1dmTFjRoUJo6CggHHjxvHll1/i7+9Pbm5ulW2YEELUdveT\nMKzBqqekIiIiKr06v3r1agYOHIi/vz+AJAshhKjBbHoNIyUlhWvXrjFx4kTGjBljvhglhBDi5s0L\nNaV3ATa+rdZoNHLixAn+7//+j+vXrzNu3Di6du1K8+bNbRmWEEKICtg0Yfj7+9OoUSOcnJxwcnIi\nIiKC5OTkuyaMXx6mE0IIUX2snjAqu6YeFxfHW2+9hdFoRK/Xk5SUxBNPPHHXOjUaDdnZBXctVx/4\n+rpJW/xM2qKMtEUZaYsyvr5udy9UCasmjFdffZXExETy8vLo168fL774IqWlpWg0GsaOHUtwcDC9\ne/dmxIgRaLVaHnnkkUqHghZCCGE7Vr2t1prkiOEmOXoqI21RRtqijLRFmfvtYdT6J72FEEJUD0kY\nQgghLCIJQwghhEUkYQghhLCIJAwhhBAWkYQhhBB1VFXfBCsJQwgh6qD07EJe+HAHe05kVVmdkjCE\nEKIOWrnjAsUlBlwbVM3Lk0AShhBC1GqLN5zi201nyp1+StUVcOB0Nq0au9OpZdW9NsKmgw8KIYT4\n7c6mX2PLoXQA/Bs5ExveBIBVOy8AMLJ3yyodqFUShhBC1FIb9qYC4Oig5ZtNZ2gR6I5Wo+HQmRyC\ng9zpWIW9C5BTUkIIUStdvnqdg6eyaR7gxgujQjEaFZ+uPMaSLWcBGNmnVZW/BkJ6GEIIUcMopZi3\n8hhXC0ro0MKLTi29aNXYHXu7smP8jfsuoYDBkc3o1MqbB3u2YPXuFHKu3aBNEw86NG9U5XFJwhBC\niBrmXEY+B05lA3A+I581u1Nwc3HgyWHt6RzsQ2FxKTuTMvF2dyIixBeAh3q35Gz6NU5evGqV3gVI\nwhBCiBpn19FMAJ4b2Qk7Ow3HLuSy40gmHy5NYlh0c+y0GvQGEwO6N8NOe7PXodVqeGlMZzKuFNEy\n0N0qcUnCEEIIG7mYVcD5jGv06xpk7hHoS43sPamjkZsT4W190Wo1dG3jS9/Ojfl05TESfroIgLOT\nPX06B5arz8nRzmrJAuSitxBC2IS+1Mi/Vhxl8cbTJJ7UmacfOpNDcYmRnp0C0GrLTis1D3Djz5O7\nE9725imouG5BODtV7zG/9DCEEMIGNuxNJefaDQCWbD5Ll2AfnJ3szaejenYKuG0Zlwb2PD+qE5cu\nF9LEt2G1xgvSwxBCiGqXm3+DhD0XcXdxYGD3puQV6lm9K4WrBSUcT8klOMidQG/XCpfVaDQ083cr\n1/uoLlZNGLNmzaJnz54MHz680nJJSUl06NCBjRs3WjMcIYSoEZZtO4e+1MSYmGBG922Fj0cDfth/\nieXbzqEU9AoNvHslNmDVhDF69GgWLlxYaRmTycR7771Hnz59rBmKEELUCGfTrrHnuI7mAW706hyI\no4Md4/u3wWhS7DqWhb2dlsgQP1uHWSGrJoyIiAjc3Su/Yr948WIGDRqEl1fVPsIuhBA1jUkpvtl0\nGoBH+7dB+/OdUWGtfegc7A1AeFsfXKpwhNmqZNNrGDqdjh9//JHx48fbMgwhhKgWe45ncSGzgMj2\nfrRp4mmertFomDCgLV2CvRkW3cJ2Ad6FTe+Smjt3Lq+99pr5/uOqfjuUEELUFCWlRr7bdh4Hey0P\n92t923wfT2defriLDSKznE0TxrFjx3jllVdQSnH16lW2b9+Ovb09cXFxd13W19etGiKsHaQtykhb\nlJG2KFMT2uLbH05xtaCEh+PaENLa19bh/CZWTxiV9Ro2bdpk/v/MmTOJjY21KFkAZGcX3HdsdYGv\nr5u0xc+kLcpIW5SpCW1xtaCEpZtO4+7iQL/OgTaL534Tp1UTxquvvkpiYiJ5eXn069ePF198kdLS\nUjQaDWPHjrXmqoUQosZYseM8+lIT4+PaVPvT2VXJqpG/9957Fpd95513rBiJEEJUv5JSI9sOpbMr\nKZMmvq706dzY1iHdl9qb6oQQwoYMRhMnUq4S0swTRwe7cvOKSwxsOZTOhr2pFFwvxdFBy4SB7Wzy\ndHZVkoQhhBC/wfJt51m/N5UALxeeHNae4CAPlFLsP5XNf388zbVCPc5OdjzYszkDIpri5uJo65Dv\nmyQMIYS4R5lXivhh/yWcnezR5V5n7lcH6N+tKZm5RRw7n4u9nZbhPVswKLJpjX0I77eQhCGEEPfo\n201nMZoUU4aG4ObiyJcJJ/lh/yUAOrb0YsLAtvg3crFxlFVPEoYQQtyDI2dzOHr+Cu2bNyK8rS8a\njYb/NyWSH/ZfIsDLhW7tfK3yetSaQBKGEEJYyGA08e2mM2g1Gsb3b2NODE6OdjzYs4Vtg6sG8j4M\nIYSwgMFoYvm28+iuFhPbNcgmLzCyNelhCCFEJZRSHDqTw9Kt59DlXsfD1ZGH+rS0dVg2IQlDCCF+\n5WpBCeczrnE+I58TF69yMasArUZDv65BPNS7JQ2d686dT/dCEoYQQvys1GDi281n2HIw3TxNA3Rt\n40N8v+A7vja1vpCEIYSol86k5eHsaE9jX1e0Gg05ecXMW3mMlKwCGvu4Et3Rn1aB7rQIdK/V4z9V\nJWkFIUS9s+NIBv9elwxAQ2cH2jTx4PSlPIpuGOjVKYAJg9rh9KvhPoQkDCFEPXMq9Sr/2XAK1wb2\ndGntw6nUqxw6k4O9nZbJQ0Lo0zmwzj5Hcb8kYQgh6g3d1et8svwoAM+PCiWkeSOUUuRcu4GTgx3u\nrrV/vCdrkoQhhKgXim6U8tHSJIpuGJg8JISQ5o2Am+/T9vV0tnF0tYM8uCeEqPOMJhOfrTxGVu51\nBkc2o2+X2v1eCluRhCGEqPOWbjnH8ZSrdA72Jr5fsK3DqbUkYQgh6rQf915k475LBHq78MyIjrX+\nJUa2JAlDCFGrXS0oYfPBNIpulN4270xaHv9aloRrA3teiu8sz1PcJ6u23qxZs9i6dSve3t6sXr36\ntvmrV6/m888/R6PR4OLiwl/+8hfatWtnzZCEEHVIid7I+0sOk55dxIrt5xkW3YK4bkHorhazds9F\nEk/o0Gg0TBvZqU6+n6K6aZRSylqV79+/H1dXV2bMmFFhwjh8+DDBwcG4ubmxfft2PvnkE5YsWWJR\n3dnZBVUdbq3k6+smbfEzaYsy9aEtlFIsWH2CxBM62jdvxMWsAq6XGGjo7EBh8c3eRhNfV6aM6EQL\n3/o9pMcvfH3d7mt5q/YwIiIiSE9Pv+P8sLCwcv/X6XTWDEcIUYf8eCCNxBM6goPceeWRLpSUGkn4\n6SJbDqUTHOTOsOgWdAn2xs/Pvc4nz+pSY07oLV26lL59+9o6DCFELXD6Uh5LNp/F3cWB50aGYm+n\nxd5OyyOxrXkktrWtw6uzakTC2LNnD8uXL+e///2vxcvcb9eqLpG2KCNtUaautsW5tDzmrTyGAv44\nOZK2rXzuukxdbYvqZvOEkZyczJ///Ge++OILPDw8LF5Oupg31Ydz1ZaStihTV9vibPo1PlhyhBsl\nBh4fEkKAu9Ndt7OutsVvUaOvYcDNC1N3kpGRwUsvvcTf//53mjVrZu1QhBC1WPLFq3y0LIlSg4mn\nhncgumOArUOqd6yaMF599VUSExPJy8ujX79+vPjii5SWlqLRaBg7dizz5s3j2rVr/L//9/9QSmFv\nb8+yZcusGZIQopZJyy7kx/1p7D6WiVLw7MhOdGvna+uw6iWr3lZrTdLFvEm622WkLcrU9rYwmkwk\nnb3CjwfSOHnxKgA+Hg2YNLgdnVp631Ndtb0tqlKNPyUlhBCWulpQwq6jmWw9nE5ufgkA7Zs3on+3\nJnRp7SPDetiYJAwhhM2YTIp9yZc5kZLL6Ut56K4WA+DkYEds1yBiuwbRxK+hjaMUv5CEIYSwCX2p\nkc9Xn+DA6WwAGjjaEdrKm87B3vTsFCDjPtVA8o0IIapdwXU9//zuKGfTrxHSzJNHHmhNMz83OeVU\nw0nCEEJUq8t5xXyw5Ai63Ov06ODPE0Pb42AvA2fXBpIwhBDVJj2niH98e4hrhXqG9mjO6JhWaDXS\nq6gtJGEIIapFqq6Af3x7mMLiUsY90JqBkfKwbm0jCUMIYTUmk6KguJSLWQUs+P44xSUGHh/cjpiw\nIFuHJn4DSRhCiCp3IiWX/2w4RU7eDUw/Pxus1WhkSI9aThKGEKJKHTiVzfzvjwHQqrE7Hg0d8XR1\nIrydL+2bN7JxdOJ+SMIQQlRqx5EMDp3JoW9YYzoHe1d6kXrHkQwWrU/G0d6OF8eE0qGFVzVGKqxN\nEoYQ4o5+EQoNAAAgAElEQVRKDSaWbj1HYXEph8/mEOjtQv9uTWjo4oi+1Ii+1Ej+9VKuFtzgyrUb\nHE+5imsDe155JIxWjd1tHb6oYpIwhBB3dOhMNoXFpUS298PeTkviCR2LN56+Y3n/Rs68MDqUIF8Z\nzqMukoQhhLijHUmZAIzo1ZLGPq6M7tuKg6ez0Wg0ONprcXDQ4ubiiJebE54NnWQ4jzpOvl0hRIVy\n8oo5cSGX1kEeNPZxBcDLvQH9I5raODJhK/I8vhCiQjuPZqKAPl0CbR2KqCEkYQghbmMyKXYezaSB\nox3dQ/xsHY6oISRhCCFuczwll9z8EiLb+9PAUc5ci5skYQghbrP9SAYAfbs0tnEkoiaxasKYNWsW\nPXv2ZPjw4Xcs89ZbbzFw4EAeeughTp48ac1whBB3kXmliM9WHePgqWyCfF1pGXh/74AWdYtV+5qj\nR49m4sSJzJgxo8L527ZtIzU1lY0bN3LkyBHefPNNlixZYs2QhKi3SvRGPlt1DEcHO8Ja+xAa7I1r\nA3uuXLtBSlYBh87ksOdEFkpBc383Jg8JQSNDj4tbWDVhREREkJ6efsf5mzZtYuTIkQB06dKFgoIC\ncnJy8PHxsWZYQtRLq3encOTcFQD2JV9Gq9Hg7GRH0Q2DuUwTX1ce6t2K8LY+kizEbWx6Nevy5csE\nBJSNXOnv749Op5OEIUQVy7xSxIa9qXi7N+D50Z04fiGXw2dzKLheSocWXrQIcKNloDttm3nKC43E\nHdk0Yaifhz2+laVHNb6+cm71F9IWZaQtyvzSFkopPvouCaNJMW1MZ7p3CqR7aP16H4XsF1XDpgnD\n39+frKws8+esrCz8/Cy75zs7u8BaYdUqvr5u0hY/k7Yoc2tb7D2p48iZHDoHe9PKz7XetZHsF2Xu\nN3Fa/bbainoRv4iLi2PlypUAHD58GHd3dzkdJUQVKiwu5dtNZ7C30/Jo/zZyXULcF6v2MF599VUS\nExPJy8ujX79+vPjii5SWlqLRaBg7diwxMTFs27aNAQMG4OzszDvvvGPNcISos0r0Rq6XGCg1mjAY\nTBy9mMeWfakcu3AFg1ExolcL/Bq52DpMUctpVGVdgBpMupg3SXe7TH1rC6UUpy/lseVQOgdOZWM0\n3f5TbuLrSlQHfwZFNsPern4+p1vf9ovK3O8pKXnmX4ha6PSlPBZvOEV6ThEAgd4uNPN3w95Og4Od\nlqaBHoQ0cSfQ29XGkYq6RBKGELXMoTPZfLryOCaTIrK9H7Fdg2jb1LPc9Qk5qhbWIAlDiFpk19FM\n/r02GXt7DS+N6UynVt62DknUI5IwhKihDEYTmw+kkX3tBiaT4nqJgcQTOlwb2PO7h7sQHORh6xBF\nPSMJQ4ga6rtt59iw91K5aY3cnHjlkS40kXdmCxuQhCFEDXToTDYb9l7C38uFaSM6Ym+vxU6rwdvd\nCQd7O1uHJ+qpuyaMS5cusWzZMhITE8nKysLJyYmQkBAGDRrEwIEDsbeXnCNEVcrJK2bhmpM42Gt5\nbmQnmvpJb0LUDJX+tf/zn//M8ePHGTx4ML///e/x8fGhpKSEc+fOsXPnThYsWMBf/vIXwsLCqite\nIeo0g9HEp6uOc73EwOQhIZIsRI1SacKIi4tj9uzZt01v164dQ4cOJS8vj0uXLlWwpBDiXhmMJhYm\nnORCZj7RHQPo0znQ1iEJUU6lCSMmJqbShT09PfH09KzSgISoj0oNRj5deZzDZ3MIDnJn4qC2Mu6T\nqHEsGivgr3/9KwUFBRgMBh599FHCwsJYtWqVtWMTol64oTfw4dIkDp/NoUOLRvx+bFcaOMq1QVHz\nWJQwdu/ejZubGzt37sTf358NGzbw5ZdfWjs2Ieq8VF0Bf/3qICcvXqVrGx9eju+Mk6PcBSVqpns6\njNm3bx8DBgzA399fustC3IdSg5Hvd6Wwbk8qJqWICWvMhIFtsdPWzwECRe1gUcLw9vbm9ddfZ9eu\nXUydOhWDwYDRaLR2bELUOfpSI3tO6Fi35yK6q8V4uzfg8cHtZIgPUStYlDDee+89vv/+e+Lj4/Hw\n8CAtLY0nnnjC2rEJUWcUFpeycV8qWw9lUFhcilajoX9EE0b3bSXXK0StYdGe6uXlxeTJk82fmzRp\nQpMmTawVkxA12rXCErRaDW4ujhaV1129zgdLjnD5ajGuDewZFt2c2K5BeLk3sHKkQlStShPGc889\nx7Rp0+jcufNt8woLC/nuu+9o0KABY8eOtVqAQtQkJXojbyzcyw29gZ6dAhkU2bTSd06cS7/GR8uS\nKCwuZUiPZozo1RInB7moLWqnShPGSy+9xHvvvUdKSgqdO3fG29ubkpISzp8/T3p6OuPGjWP8+PHV\nFasQNrc3WUdhcSmODlq2H8lgx5EMQoO96djSi3ZNPWni2xC9wYgut5jzGdf43+azlBpNTBrcjn5h\nQbYOX4j7UmnCCAkJ4fPPPyczM5O9e/ei0+lwcnJi8ODBdOvWDUdHy7rkQtQVO45kogFmPxlFalYB\n6xIvknTuCknnrgDgaK9FbzCZyzs6aHlpTGe6tPaxUcRCVB2LrmEEBgby0EMP/aYVbN++nblz56KU\nYsyYMUydOrXc/MzMTP7whz9QUFCAyWRi+vTpd33CXAhbSM8p4mz6NTq29MLP0xk/T2e6tfMl+9oN\nzlzK49SlPFIyC3B3dcDfy4WARi6EBnsT4OVi69CFqBIWJYwrV67wzjvvkJmZyddff01ycjKHDh26\n6+kok8nEnDlzWLRoEX5+fsTHxxMXF0dwcLC5zKeffsrQoUMZN24c586d4+mnn2bz5s33t1VCWMGO\nIxkA9O3S2DxNo9GYk0evUBn7SdRtFj0l9Prrr9OtWzfy8/MBaNWqFf/973/vulxSUhLNmzcnKCgI\nBwcHhg0bxqZNm8qV0Wg0FBYWApCfn4+/v/+9boMQVldqMLH7WBYNnR3o2kZOL4n6yaKEodPpGD9+\nPHZ2N+/ucHR0RGvBE6k6nY7AwLKjLn9/fy5fvlyuzAsvvMCqVauIiYlh2rRpvPHGG/cSvxDV4vDZ\nHAqLS+kVGoC9nTyNLeoni05J/folSfn5+Sil7rqcJWUSEhIYM2YMkydP5vDhw7z22mskJCTcdTlf\nX7e7lqkvpC3KWKst9iw/CsBD/drUmvauLXFWB2mLqmFRwhg4cCB//vOfKSoqYvny5fz3v/9lzJgx\nd10uICCAjIwM82edToefn1+5MsuWLWPhwoUAhIWFUVJSQm5uLl5eXpXWnZ1dYEnodZ6vr5u0xc+s\n1RapugIOn86mdRMPGmhrx74n+0UZaYsy95s4LepbP/XUU0RERNCxY0e2bdvGxIkTefzxx++6XGho\nKKmpqaSnp6PX60lISCAuLq5cmcaNG7N7924Azp07h16vv2uyEKI6KKXYcjCNtxcfQAEDIpraOiQh\nbEqjLDlvdB+2b9/O22+/jVKK+Ph4pk6dyscff0xoaCixsbGcO3eO119/nevXr6PVapkxYwbR0dF3\nrVeOGG6So6cyVdkW+UV6Fq1L5vDZHFwb2DN5SHu6tfOtkrqrg+wXZaQtytxvD8OihHHlyhW++uor\nUlNTMRgM5ukfffTRfa38fsgOcJP8GMpURVvkX9ezYW8qmw+kU1JqJKSZJ08P70gjN6cqirJ6yH5R\nRtqizP0mDIuuYTz33HN06NCB6Oho851SQtQl6dmF7EjKZNvhDEpKjXg0dCS+XzCxXYPQauXdL0KA\nhQmjuLiYN99809qxCFGtSkqNbD+cwe5jWVzU3TwC9WjoyJiYVsSENcbBXg6OhLiVRQmjS5cunDp1\ninbt2lk7HiGqRfLFqyxal8zlvGLstBq6BHvTMzSQsNbekiiEuAOLEsa4ceOYMGECAQEBODmVnctd\ntmyZ1QIToqqUGkwUXNejN5go0RvZejidbYcz0GhgUGRThkQ1x91VBtIU4m4sShivvfYa06ZNo0OH\nDnINQ9Qql/OK+etXB8gr1Jeb3sTXlSeGtqdloLuNIhOi9rEoYTg5OfHkk09aOxYhqpS+1Mi85UfJ\nK9TTtY0Pbi4OONrb4e/lQkxYYxniQ4h7ZFHC6NOnD9u3b6dv377WjkeIKqGUYvHGU6ReLqRvl8ZM\nHhJi65CEqPUsShhLlixhwYIFuLq64ujoiFIKjUbDTz/9ZO34hPhNth/JYNfRLJoHuPHYgDa2DkeI\nOsGihPHdd99ZOw4hqkROXjE/ndCxetcFXBvY8/yoTnLXkxBVxKKEERQk7yIWNVvSuRx+WHKE4+d/\nflWqg5ZnHuqIj4ezjSMTou6oNGG89tprvPvuu4wZMwaN5vanXeW2WmFrJqX4fucFvt+VAkBIM0+i\nOwbQrZ0fLg0sOh4SQlio0l/ULy87+sMf/lAtwQhxL4pLDCxMOMnB09n4eDTgjSd74OYodz4JYS2V\nJoxfXskaGRlZLcEIYYkSvZHEkzrWJ6aSlXudkGaePDuyE62CPGSQOSGsSPrsotbIKywh4aeL7D6W\nSXGJEY0G+ndrwiMPtJZnKoSoBpUmjNOnT1f4bgq5rVZUtwOnsvm/9ckUFpfi2dCRARFN6dulMV7u\nDWwdmhD1RqUJo0WLFixYsKC6YhHiNsUlBr758Qw7j2biYK/lsQFt5SltIWyk0oTh6Ogot9QKmzAY\nTWw7nMGa3SlcK9LT3N+NqSM6EOjtauvQhKi3Kk0YDg4O1RWHEMDN22R/OpbFqp0XyLl2AycHOx7q\n3ZJh0c2lVyGEjVWaMJYsWVJdcQhBenYh/7fhFGfTrmFvp2Vg96YM7SFDjwtRU1j9Lqnt27czd+5c\nlFKMGTOGqVOn3lZm7dq1/Otf/0Kr1dKuXTv+8Y9/WDssUYPkF+n5Yf8l1iemYjQpurXzZdwDbfD2\nkAvaQtQkVk0YJpOJOXPmsGjRIvz8/IiPjycuLo7g4GBzmYsXL/LFF1/wv//9j4YNG5Kbm2vNkISN\nGIwmVu9KQW8w4uxkj4uTPVfyb3Ai5SqXLhcC4O3uxGMD2hHWxsfG0QohKmLVhJGUlETz5s3NF86H\nDRvGpk2byiWMJUuW8Oijj9KwYUMAvLy8rBmSsJENe1NZvTvltun2dlo6tGhEp5be9OvamAaO8miQ\nEDWVVX+dOp2OwMBA82d/f3+OHj1arkxKSgoA48ePRynF888/T58+fawZlqhmV67dYPWuFNxdHHhh\ndGdKDEau3zDg2sCe1kEeODrIaLJC1AZWTRhKqbuWMRqNpKam8vXXX5ORkcFjjz1GQkKCucdxJ76+\nblUVZq1X09vi84ST6A0mnn+4C9Fdm1h1XTW9LaqTtEUZaYuqYdWEERAQQEZGhvmzTqfDz8+vXBl/\nf3+6du2KVqulSZMmtGzZkpSUFDp16lRp3TJm0E2+vm41ui2Onr/CT0czadPEg07NPK0aa01vi+ok\nbVFG2qLM/SZOq97YHhoaSmpqKunp6ej1ehISEoiLiytXpn///uzZsweA3NxcLl68SNOmTa0Zlqgm\npQYTX/9wGq1Gw4SB7SocIl8IUXtYtYdhZ2fHG2+8wZQpU1BKER8fT3BwMB9//DGhoaHExsbSp08f\ndu3axbBhw7Czs2PGjBl4eHhYMyxhRUopUnWFHDydzcHT2Vy+Wkz/iCY09av8FKMQoubTKEsuNNRA\n0sW8qSZ1t3OuFfOvFce4mHUzHns7LWGtvXliaHucnax/91NNagtbk7YoI21R5n5PSck9jKJKnEnL\n41/Lj5J/vZSubXyI7hhAp1ZecpusEHWI/JrFfTEpxc6kTBZvOIVS8NiAtjwQHiTXK4SogyRhiHtm\nMilOpl7l4Kmb1ymuFelxcbLn2VGd6NhCHrwUoq6ShCHuSeaVIr5Yc5ILmTdf39vQ2YHenQMZFt0c\n/0YuNo5OCGFNkjCERUxKselAGsu2nqPUYKJ7iB+xXYNo09QDO60MOy5EfSAJQ9xVzrVivkw4SXJq\nHg2dHXj6wQ5EhPjdfUEhRJ0iCUPckVKKHUmZfLvpDDf0RsJa+/D44HZ4NHSydWhCCBuQhFHPGYwm\ntBoNWm35u5oycopYsuUsSeeu4Oxkx5PD2tOzU4Dc/SREPSYJox7LL9Lz1n/2U1JqJCLEj6j2/rg2\nsGf17hT2nbyMAjq2aMQTQ9vj5S4vMxKivpOEUU+ZlOLzNSduvjfb0Y4tB9PZcjDdPL+ZX0OG92pJ\neFsf6VUIIQBJGPXW2p8ucvxCLqGtvHlxTCinUvNIPKEjr6iE2K5BhLWWRCGEKE8SRj10+lIeK3ac\np5GbE0892B57Oy0dW3rRsaU8dCeEuDNJGPWIUoqz6df4bNUxNGh4ZkRH3FwcbR2WEKKWkIRRD5Qa\njOw/lc0P+y6R8vNIsg/3C6ZtU08bRyaEqE0kYdRRpy/lcfB0NmfTr3ExqwCjSaEBurbxYWD3prRr\n1sjWIQohahlJGHXQ0fNX+HDJERRgp9XQzL8hIc0aEdM1CD9PZ1uHJ4SopSRh1DGX84pZ8P1x7Oy0\nPPtQRzq09MLJwc7WYQkh6gBJGHVIid7IJ98dpeiGgSeGhNC1ra+tQxJC1CEyzGgdoZRi0fpk0rIL\n6dc1iD5dGts6JCFEHWP1hLF9+3YGDx7MoEGDWLBgwR3LrV+/npCQEI4fP27tkOqU4hIDWw+n8+aX\n+0g8oSO4sTuP9m9j67CEEHWQVU9JmUwm5syZw6JFi/Dz8yM+Pp64uDiCg4PLlSsqKuKrr74iLCzM\nmuHUKSaTImHPRTbsTeX6DQN2Wg0RIX481r8N9nbScRRCVD2rJoykpCSaN29OUFAQAMOGDWPTpk23\nJYyPPvqIp59+mi+++MKa4dQZeYUlLPj+OMmpeXi6OdG/WxNiwoJo5CbDjgshrMeqh6I6nY7AwEDz\nZ39/fy5fvlyuzMmTJ8nKyiImJsaaodQZxy5c4S9f7iU5NY+w1j7Mm/EAI/u0kmQhhLA6q/YwlFJ3\nnT937lz+9re/WbzML3x93e4rttomJTOfxWtPsvdEFvZ2Gp5+qBPD+7RCo9HI8B63qG/7RWWkLcpI\nW1QNqyaMgIAAMjIyzJ91Oh1+fmWv9iwqKuLs2bNMnDgRpRQ5OTk899xzfPrpp3Ts2LHSurOzC6wW\nd02Sqitgw95U9hzXoYC2TTwY178NLQLcyckpxNfXrd60xd1IW5SRtigjbVHmfhOnVRNGaGgoqamp\npKen4+vrS0JCAu+//755fsOGDfnpp5/MnydOnMjMmTPp0KGDNcOq8fSlRvaevMzWw+mcz8gHbr6f\nYnRMMKGtvGTYcSGETVg1YdjZ2fHGG28wZcoUlFLEx8cTHBzMxx9/TGhoKLGxseXKazQai09J1UUm\nk2LXsUxWbD9PXqEeDdA52JuYsMZ0ae2DVhKFEMKGNKqW/oWua13Mkym5fLv5LJcuF+JoryUuogmx\nXYPw8ah87CfpbpeRtigjbVFG2qJMjT4lJe5OX2pkyZazbD6Yjgbo1SmAUX1byTu0hRA1jiQMG0q7\nXMj874+TnlNEkI8rTz7YnhYB7rYOSwghKiQJwwZMSrHpQBpLt5zDYDTxQHgQj8S2xlFGlRVC1GCS\nMKpZbv4Nvlx7khMpV2no7MATQzvStY2MKiuEqPkkYVQTfamR3ceyWLb1HNdLDHQO9uaJISF4NJQn\ntIUQtYMkDCvLyStm86F0dhzJoOiGAUcHLZMGtyOmS2N5nkIIUatIwrASk0mxLvEiK3dcwGhSuLk4\n8GDP5vQLC5I7oIQQtZIkDCu4cu0GX6w5walLeXg0dCQ+JpjI9n442MtFbSFE7SUJowoppdhzQsdX\nG09TXGIgvK0vjw9uJ4MDCiHqBEkYVSS/SM9/Npzi4OlsnBzsmDwkhD6dA+U6hRCizpCEcZ+UUuxL\nvsxXG09TWFxK26aeTBnWHj/Pyof0EEKI2kYSxn1Iu1zIf388TXJqHg72WsbFtaF/RBMZJFAIUSdJ\nwvgNCotLWbnjPFsOpaMUdAn2ZlxcG/y9XGwdmhBCWI0kjHtgMim2HU5n+fbzFN0wEODlwvj+bQht\n5W3r0IQQwuokYVjoVOpV/vvjGS5dLqSBox2PxLamf0QT7O2s+lp0IYSoMSRh3EVu/g2WbDnL3pOX\nAegdGsiYmFYypIcQot6RhHEHJXojG/elkrDnIvpSEy0D3Xh0QFuCG3vYOjQhhLAJSRi/YjCa2JmU\nyaqdF7hWpMfNxYHH+relV+dAuftJCFGvWT1hbN++nblz56KUYsyYMUydOrXc/EWLFrF06VLs7e3x\n8vJi7ty5BAYGWjus2xiMJhJP6Fjz00V0uddxdNDyYM8WDI5shksDyatCCGHVv4Qmk4k5c+awaNEi\n/Pz8iI+PJy4ujuDgYHOZDh06sHz5cpycnPjmm2/4+9//zgcffGDNsMopNZjYmZTBusRUcq7dwE6r\noV/XIEb0aoGnXKcQQggzqyaMpKQkmjdvTlBQEADDhg1j06ZN5RJGZGSk+f9hYWGsXr3amiGVk3ml\niM9WHefS5UIc7LXEhTdhcFQzvD1kNFkhhPg1qyYMnU5X7vSSv78/R48evWP5ZcuW0bdvX2uGZLbr\naCZfbTxNSamRPp0DGd1X7nwSQojKWDVhKKUsLrtq1SqOHz/O4sWLqzyOkxev8p/1yZQaTTg72qPR\naEjLLsTZyY5pD3Uksr1/la9TCCHqGqsmjICAADIyMsyfdTodfn5+t5XbvXs3CxYs4KuvvsLBwcGi\nun193Swqd+R0Nh8tS8JkMuHl4Uz+dT3Xbxjo2Mqb343rSoC3q2UbU4NZ2hb1gbRFGWmLMtIWVcOq\nCSM0NJTU1FTS09Px9fUlISGB999/v1yZEydO8Oabb7Jw4UIaNWpkcd3Z2QV3LXPswhX++d1RlFK8\nMDqUzsE+wM2ej0ajAZPJonpqMl9ft1q/DVVF2qKMtEUZaYsy95s4rZow7OzseOONN5gyZQpKKeLj\n4wkODubjjz8mNDSU2NhY3n33XYqLi3n55ZdRStG4cWPmzZt33+s+fCaHeSuPAfDimM7lxnuSd1QI\nIcS906h7udBQg2RnF1BSamRnUiYdWjQi8OdTS0op1iemsmzrOezttbw4JpROLevu4IBy9FRG2qKM\ntEUZaYsyNbqHYW3fbTvHj/vT0AARIX4MimzGjwcusee4jkZuTrwwOpSWge62DlMIIeqEWpswUnUF\nbDqQho9HA1ydHdiXfJl9yTcHCAxu7M7zo0PlwTshhKhCtTJhmEyKxRtPoRRMGtyOji28OHYhlw17\nU/Fv5MK4uDY42Muw40IIUZVqZcL4cV8q59LziQjxM1+fCG3lLS8yEkIIK6qVh+GL1pzAydGO8XFt\nbB2KEELUG7UyYRRc1zOyd0sauck1CiGEqC61MmE82Lslcd2a2DoMIYSoV2plwnhmVGd5l7YQQlQz\n+asrhBDCIpIwhBBCWEQShhBCCItIwhBCCGERSRhCCCEsIglDCCGERSRhCCGEsIgkDCGEEBaRhCGE\nEMIikjCEEEJYRBKGEEIIi1g9YWzfvp3BgwczaNAgFixYcNt8vV7PK6+8wsCBAxk7diwZGRnWDkkI\nIcRvYNWEYTKZmDNnDgsXLmTNmjUkJCRw7ty5cmWWLVuGh4cHGzdu5PHHH+fdd9+1ZkhCCCF+I6sm\njKSkJJo3b05QUBAODg4MGzaMTZs2lSuzadMmRo0aBcCgQYP46aefrBmSEEKI38iqCUOn0xEYGGj+\n7O/vz+XLl8uVuXz5MgEBAQDY2dnh7u5OXl6eNcMSQgjxG1g1YSil7rmMUgqNRmOtkIQQQvxG9tas\nPCAgoNxFbJ1Oh5+f321lsrKy8Pf3x2g0UlhYiIeHx13r9vV1q/J4aytpizLSFmWkLcpIW1QNq/Yw\nQkNDSU1NJT09Hb1eT0JCAnFxceXKxMbGsmLFCgDWr19Pjx49rBmSEEKI30ijLDlvdB+2b9/O22+/\njVKK+Ph4pk6dyscff0xoaCixsbHo9Xpee+01Tp48iaenJ++//z5Nmsj7uoUQoqaxesIQQghRN8iT\n3kIIISwiCUMIIYRFJGEIIYSwSK1LGHcbm6ouy8rKYtKkSQwdOpThw4fzn//8B4Br164xZcoUBg0a\nxJNPPklBQYGNI60eJpOJUaNGMW3aNADS0tJ45JFHGDRoENOnT8dgMNg4wupTUFDASy+9xJAhQxg2\nbBhHjhypl/vFokWLePDBBxk+fDivvvoqer2+Xu0Xs2bNomfPngwfPtw8rbL94K233mLgwIE89NBD\nnDx58q7116qEYcnYVHWZnZ0dM2fOZO3atXz77bd8/fXXnDt3jgULFhAdHc2GDRuIiopi/vz5tg61\nWvznP/8hODjY/Pkf//gHTzzxBBs2bMDNzY1ly5bZMLrq9fbbbxMTE8O6detYtWoVrVq1qnf7hU6n\nY/HixSxfvpzVq1djNBpJSEioV/vF6NGjWbhwYblpd9oPtm3bRmpqKhs3bmT27Nm8+eabd62/ViUM\nS8amqst8fX1p3749AK6urgQHB6PT6cqNxzVq1Ch+/PFHW4ZZLbKysti2bRsPP/ywedqePXsYNGgQ\ncLMdfvjhB1uFV60KCwvZv38/Y8aMAcDe3h43N7d6uV+YTCaKi4sxGAzcuHEDPz8/EhMT681+ERER\ngbu7e7lpv94PfvmbuWnTJkaOHAlAly5dKCgoICcnp9L6a1XCsGRsqvoiLS2N5ORkunTpwpUrV/Dx\n8QFuJpWrV6/aODrrmzt3LjNmzDAPI3P16lU8PDzQam/u0gEBAfVm30hLS6NRo0bMnDmTUaNG8cYb\nb1BcXFzv9gt/f3+eeOIJ+vXrR9++fXFzc6NDhw64u7vXy/3iF7m5ueX2g9zcXKD8OH5ws/10Ol2l\nddWqhCGPjNxUVFTESy+9xKxZs3B1da13Y29t3boVHx8f2rdvb94nlFK37R/1pV0MBgMnTpzg0Ucf\nZcWKFTg7O7NgwYJ6s/2/yM/PZ9OmTWzZsoUdO3ZQXFzM9u3bbytX39rlTir6e3q3trHqWFJVzZKx\nqcpAbb4AAAdwSURBVOo6g8HASy+9xEMPPUT//v0B8Pb2JicnBx8fH7Kzs/Hy8rJxlNZ18OBBNm/e\nzLZt2ygpKaGoqIi5c+dSUFCAyWRCq9WSlZVVb/aNgIAAAgICCA0NBWDgwIF8/vnn9W6/2L17N02b\nNsXT0xOA/v37c+jQIfLz8+vlfvGLO+0H/v7+ZGVlmctZ0ja1qodhydhUdd2sWbNo3bo1jz/+uHna\nAw88wPLlywFYsWJFnW+T6dOns3XrVjZt2sT7779PVFQU//jHP4iKimL9+vVA/WiHX/j4+BAYGMiF\nCxeAm9dyWrduXe/2i8aNG3PkyBFKSkpQSrFnzx7atGlT7/aLX/cc7rQfxMXFsXLlSgAOHz6Mu7u7\n+dTVndS6oUEqGpuqvjhw4AATJkygbdu2aDQaNBoNr7zyCp07d+Z3v/sdmZmZNG7cmI8++ui2C191\n1d69e/nyyy/57LPPuHTpEtOnTyc/P5/27dvz7rvv4uDgYOsQq0VycjJ/+tOfMBgMNG3alHfeeQej\n0Vjv9otPPvmEhIQE7O3t6dChA2+99RZZWVn1Zr949dVXSUxMJC8vDx8fH1588UX69+/Pyy+/XOF+\nMHv2bHbs2IGzszPvvPMOHTt2rLT+WpcwhBBC2EatOiUlhBDCdiRhCCGEsIgkDCGEEBaRhCGEEMIi\nkjCEEEJYRBKGEEIIi0jCEDXaAw88wNmzZ6tlXZ988km5oa9nzpzJ119/fd/1zpw5k+HDhzN9+vT7\nrqsyycnJrFu3zqrrEPWbJAwhfvbJJ59QWlpapXXm5OTw/9u7v5AmuziA49/ln7S8KOvWoghaI8KL\nihkJWon0R/Y8S2NYOL1IEFqE3gjRRZZEBcPyJqE/lDSIyBp2UV4IEVgGXeyiDKMVFnSRltTmaPr4\ney/Eh3KL9vYG7+vb73O182znnN/DYL+dHfY7fX199Pb2EgwGf+vYcz1//vyXE8b09PRvjkb9H2nC\nUPPS69evOXjwIDU1NRiGYZc+AHA6nXR1dVFdXU1FRQV9fX32c/fv32fnzp14vV66urpwOp0kEgna\n2tpwOBz4fD5M0yQWiwEwPDyM3++nsrKS1tbWH8Zz584dqqqq8Hg8BAIBPn78SDwex+/38/XrV0zT\n5OrVq9/1CYfDHDp0yG5blkVpaaldL+3ixYvs27cPr9dLU1MTY2NjAExOTnL69GmqqqowDINAIMD4\n+DidnZ08fvwY0zRpb28HZiojmKaJx+OhoaGBt2/fAjP/kDcMg5MnT+Lz+Xj48OE/eTvUn0KU+g8r\nLy+Xly9ffndtampKTNOUaDQqIiKxWEwqKyvt9tq1a+X69esiIvL06VMpLS0VEZHR0VHZvHmzjIyM\niIjIlStXxOl0ysTEhN0vkUjY87S2tkptba0kk0lJJpOye/duGRgYSIlxeHhYtm7dKqOjoyIi0tHR\nIUeOHBERkXfv3onb7U57b4lEQtxut3z69ElERPr7+8Xv94uISDgclmPHjtmvDYVC0tLSIiIinZ2d\nEggEZGpqSkTE7t/T0yOHDx+2+4yNjYnb7ZZXr16JiMjNmzelpqZGREQGBwfF5XJJJBJJG5tS6egK\nQ807b968IRqN0tzcjGEY7N+/n8nJye9OX9y1axcAxcXFfPjwgWQySSQSYf369RQVFQFQXV2dMrbM\nqZSzY8cOcnJyyMnJweVyMTIyktJncHCQsrIyli1bBoDP52NgYOCn95GXl8f27du5e/cuMFMYbvYQ\npP7+fh49eoRhGBiGQSgU4v3798BMefe6ujqysrIA7Oqsc0UiEdatW8fq1asB2Lt3L0NDQ0xMTACw\ncuVKNmzY8NM4lZo1r8qbKwUzH+qFhYXcvn077fMOh4OFCxcC2AfnWJaVkgzmttPJzc21H2dlZaU9\nD1pEUs4RmJ33ZwzD4NSpU+zZs4cnT55w9uxZe8ympia8Xm/a+TKRLq5v24sWLcpoHKVm6QpDzTur\nVq0iLy+PcDhsX4tGo8TjcSD1A3W2XVxczLNnz+zf8b/d9wAoKCjgy5cvfzuekpISHjx4YO8x3Lhx\ngy1btqTMn87GjRuJxWIEg0EqKirsRLdt2zZCoRCfP38GIJlM8uLFCwDKy8u5du2avUE/e5JeQUGB\nvfcye79DQ0N22fOenh5cLpcmCvXLdIWh/tMcDgf19fVkZ2fb35h7e3u5cOEC7e3tXL58GcuyWL58\nOR0dHXafuWPAzEEyx48fp7GxkaVLl1JWVkZ2djb5+fkANDQ0UFdXR35+Pt3d3RnHuGbNGpqbm6mv\nr2fBggUUFRXR1taWMv+PGIbB+fPnCYVC9jWPx8P4+DgHDhzA4XAwPT1NbW0tTqeTxsZGgsEghmGQ\nm5vLihUrOHfuHCUlJVy6dAnDMNi0aRNHjx7lzJkztLS0YFkWhYWF9gpGqV+h5c3VHyUej7N48WJg\n5hv3rVu3fst/LZT6E+gKQ/1Ruru7uXfvHpZlsWTJEk6cOPFvh6TUvKErDKWUUhnRTW+llFIZ0YSh\nlFIqI5owlFJKZUQThlJKqYxowlBKKZURTRhKKaUy8hf8CwfjbzhfpQAAAABJRU5ErkJggg==\n",
+            "text/plain": [
+              "\u003cmatplotlib.figure.Figure at 0x7f47b218dbd0\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "plt.plot(ag_means)\n",
+        "plt.ylabel('Time(s)')\n",
+        "plt.xlabel('Length of vector')\n",
+        "_ = plt.title('Time to sum the elements of 1000 vectors (AutoGraph)')\n",
+        "_ = plt.ylim(ymin=0)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "d7IAJ6Bwbk9t"
+      },
+      "source": [
+        "## Eager"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "XMu5-12yoOzY"
+      },
+      "outputs": [],
+      "source": [
+        "from tensorflow.python.eager import context"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "_vt9MzpyjQ4T"
+      },
+      "outputs": [],
+      "source": [
+        "# Sum written using for loop and run with tf.eager\n",
+        "def sum_all(elements):\n",
+        "  sum_ = 0.0\n",
+        "  length = elements.shape[0]\n",
+        "  for i in tf.range(length): \n",
+        "    sum_ += elements[i][0]\n",
+        "  return sum_\n",
+        "\n",
+        "eager_means = []\n",
+        "for num in range(max_elements):\n",
+        "  with context.eager_mode():\n",
+        "    durations = []\n",
+        "    for i in range(trials + burn_ins):\n",
+        "      \n",
+        "      start = time.time()\n",
+        "      for _ in range(batches):\n",
+        "        run_trial(num)\n",
+        "      \n",
+        "      if i \u003c burn_ins:\n",
+        "        continue\n",
+        "      \n",
+        "      duration = time.time() - start\n",
+        "      durations.append(duration)\n",
+        "    eager_means.append(np.mean(durations))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          },
+          "height": 301
+        },
+        "colab_type": "code",
+        "executionInfo": {
+          "elapsed": 422,
+          "status": "ok",
+          "timestamp": 1532460024499,
+          "user": {
+            "displayName": "",
+            "photoUrl": "",
+            "userId": ""
+          },
+          "user_tz": 240
+        },
+        "id": "5gHVdMlD-A-T",
+        "outputId": "3b581cb7-7ef9-489c-92f1-3e52c0c2dc8a"
+      },
+      "outputs": [
+        {
+          "data": {
+            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEcCAYAAAAydkhNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3XlclNX+wPHPsC+CILviCiruhAiiqaAimqnglpZLWpkt\ntmh50+rXvZZ2q9veq5ulZYtlaZp7mQuaorjklkoqiwgKsojsy8yc3x9cBxFUVGBYvu+/mGfmPM93\nzjzMd85zznOORimlEEIIISphYuwAhBBC1F2SJIQQQtyQJAkhhBA3JElCCCHEDUmSEEIIcUOSJIQQ\nQtyQJAlg8eLFvPrqq8YOo14ZOHAge/furfHjzJs3jw8//LDGj1NXxMfHExERQc+ePfnuu++MHU6D\ndvbsWcaMGWPsMCo1btw4YmNjjR0G0EiSxD333IOfnx9+fn506tSJHj16GLZt2LCBxx9/nNdff73G\n40hOTsbHxwe9Xl/jx6pOje2L+lZq8nNcsmQJgYGBHDp0iEmTJlV4fvPmzUyYMAFfX1+mTJlS4flT\np04xevRofH19GTNmDDExMeWef+eddwgMDKR379688847t1W2pk2ePJlVq1bV2vE++ugjHn30UcPj\ngQMH0qNHD/z8/AzfD2+88UatxXOtRx55pM78zzWKJHH48GH+/PNP/vzzT5o3b87ixYsN2+6///5a\ni0MphUajQe5frN9q8nO8cOEC3t7eN3zewcGBhx9+mBkzZlR4rqSkhKeeeorw8HAOHDhAeHg4Tz75\nJFqtFoAVK1awfft21q9fz7p164iMjOTHH3+sUtn64HY+j7S0NKKjoxk0aFC57YsXL+bPP/80fD+8\n8sor1R3mTel0OqA0YUVHR5Oenl6rx69Mo0gS11JKVTiZPvnkE1588UWg7Ffi6tWrCQ4OJjAwkBUr\nVnD8+HFGjhxJQEBAhVbHqlWruO+++wgMDOTRRx/lwoULlR578uTJAPj7++Pn58fRo0dRSvHpp58y\ncOBA+vbty0svvURubm6l5S9fvszMmTPp1asXgYGB5X5p+vj4cP78ecPja3/979+/nwEDBrBkyRL6\n9OlDv3792Lp1Kzt37iQsLIzAwEAWL15c6TF/+ukn1q9fz5IlS/Dz8+OJJ54wPHfq1ClGjhxJr169\nmD17NsXFxYbnduzYQXh4OL169WLixIn8/fffle4fIDY2lunTpxMYGMiwYcPYvHnzDV97s/0OHDiQ\npUuXMnLkSO655x5eeeUVMjIyeOyxx/Dz82P69Onk5OQYXn/kyBEmTJhAr169CA8PZ//+/YbnJk+e\nzIcffsjEiRPx8/PjkUceISsry/AclP8cExMTmTx5Mv7+/gQFBTF79uwbvodt27Zx//33ExAQwJQp\nU4iLiwNg6tSpREdHs2DBAvz8/Dh37lyFskFBQQwdOhQXF5cKz+3fvx+dTseUKVMwNzdn8uTJKKXY\nt28fAL/88gvTp0/H1dUVV1dXpk2bxpo1awCIjo6+adlrbdq0qcJlmmXLlvHkk08CUFxczFtvvUVI\nSAj33nsv//znP8udG1u3biU8PJyePXsyZMgQdu/ezfvvv8+hQ4d4/fXXy/2C//PPPxk7diy9evVi\n3LhxHD58uNxn9P777zNx4kR8fX1JSkpi9erVDB48GD8/PwYPHsyGDRsq/Qz27NlDly5dsLCwKLf9\nRonm/PnzTJ06lcDAQIKCgnjhhRfK/Z+eOHHCcJnw2Wef5fnnny/XErjVefvFF18Yzlu9Xo+FhQVd\nunRh9+7dlcZTq1QjExISoqKiospt+/jjj9WLL76olFIqKSlJdezYUb322muqqKhI7dmzR3Xr1k09\n9dRTKjMzU6WkpKigoCB14MABpZRSv//+uxoyZIiKi4tTOp1O/fe//1UPPPBApcdOSkpSPj4+Sq/X\nG7atXLlSDRkyRCUlJan8/Hz19NNPG2K53rvvvqtee+01pdPplFarVQcPHjQ85+PjoxITEw2PX3rp\nJfXBBx8opZSKjo5WnTt3Vp9++qnSarXqp59+Ur1791Zz5sxR+fn56syZM6pbt27q/PnzlR732n1d\nW4/jxo1TaWlp6sqVK2rYsGFqxYoVSiml/vrrLxUUFKSOHTum9Hq9WrNmjQoJCVHFxcUV9p2fn68G\nDBig1qxZo/R6vTp58qQKDAxUZ8+erXDsW+03JCREPfDAAyojI0OlpqaqoKAgFRERoU6dOqWKi4vV\nlClT1CeffKKUUiolJUUFBASoXbt2KaWUioqKUgEBASozM1MppdSkSZNUaGioOnfunCoqKlKTJk1S\n77777g0/x9mzZ6vPPvtMKaVUUVGROnToUKV1GRcXp3x9fVVUVJTSarXqiy++UKGhoaqkpMRw3JUr\nV1Za9lo//fSTmjx5crltX331lXrsscfKbXv88cfVV199pZRSqmfPnuro0aOG544fP678/PyqVPZa\nBQUFys/PT507d86wbcyYMWrTpk1KKaXeeOMN9cQTT6js7GyVl5enZs6cqd577z2llFJHjx5VPXv2\nNPwPpqamqri4uErfe1ZWlurVq5dat26d0ul0asOGDapXr14qKyvL8PqQkBB19uxZpdPpVE5OjvLz\n81MJCQlKKaXS0tIM59H13nrrLbVgwYJy2yr7brjq3LlzKioqSpWUlKjMzEw1adIktWjRIqWUUsXF\nxSokJER9++23SqvVqi1btqguXbrc1nkbHh6uUlJSVFFRkeGYr7/+uvr3v/9daTy1qdG1JKpCo9Hw\n1FNPYWFhQZ8+fbC2tmb48OE4Ojri5uaGv78/J0+eBODHH39kxowZtG3bFhMTE2bMmEFMTAwXL168\n4f7VNb9WNmzYwMMPP0yLFi2wtrZm9uzZbNq0qdLr3WZmZqSlpZGUlISpqSk9e/asdJ+VMTc3Z+bM\nmZiamnLfffdx+fJlpk6dirW1Nd7e3nh7e9/0135lpkyZgrOzM/b29oSEhHDq1CkAVq5cyYQJE+jW\nrRsajYbw8HAsLCw4evRohX3s2LEDT09PwsPD0Wg0dOrUiSFDhvDrr79WeG1V9jtp0iSaNWuGq6sr\n/v7+9OjRAx8fH8zNzQkNDTXEuG7dOoKDg+nXrx9Q+gu9a9eu7Ny507Cv0aNH06pVKywsLBg2bJih\n7FXX1rmZmRnJycmkpqZiYWGBn59fpXW2efNmgoODCQoKwtTUlEceeYTCwsJyv5DvVH5+PnZ2duW2\nNWnSxPCL9/rn7ezsyM/Pr1LZa1lZWTFo0CDDr/SEhATi4+MNl25WrVrFvHnzsLOzw8bGhhkzZhhe\nu2rVKsaOHUtQUBAArq6utG3bttL3ExkZSZs2bRgxYgQmJiYMHz6cdu3asWPHDsNrIiIi8PLywsTE\nBFNTU0xNTTl9+jRFRUU4Ozvj5eVV6b5zcnKwtbWtsP2pp54iICCAXr16ERAQwMqVKwFo1aoVQUFB\nmJmZ4ejoyNSpUzlw4ABQ2iLV6XRMmjQJU1NTQkND6d69u2GfVTlvp0yZgpubW7mWja2tLdnZ2ZXG\nX5vMjB1AXeXk5GT428rKCmdnZ8NjS0tLwz/XhQsXWLhwIW+99RZQdr06NTUVDw+PWx7n0qVLNG/e\n3PC4RYsWaLVa0tPTcXV1LffaRx99lI8//pjp06ej0WgYN25cpdemK+Pg4IBGozG8n8re49X3VFXX\nlre2tiYtLQ0orZO1a9caRucopdBqtVy6dKnCPi5cuMCRI0cICAgwvFan0xEeHl7pa2+132tjsrS0\nrPD42s9t8+bNhi+cq/u6+uUFlPvMra2tb1o/c+fO5YMPPmDs2LGGfoPKRs5c/3lrNBo8PDxITU29\n4b6rysbGpsKXem5uLk2aNKn0+dzcXGxsbKpU9nrDhw/n7bff5sknn2TDhg0MHjwYCwsLMjMzKSgo\nKPfe9Xq9IaGmpKQwYMCAKr2f6+sKoHnz5uXqyt3d3fC3tbU177//PkuXLmX+/Pn07NmTuXPn0q5d\nuwr7tre3Jy8vr8L2Tz/9lN69e1fYnpmZyRtvvMHBgwfJz89Hp9Ph4OAAlPZvuLm5lXv9tf/7VTlv\nr30fV+Xl5WFvb19he22TJHGX3N3deeKJJ6rUAX71S/parq6u5fowkpOTMTMzK/cFdZWNjQ3/+Mc/\n+Mc//kFsbCyTJ0+me/fu9O7dG2trawoKCgyvTUtLq/TEqw3u7u7MnDmTxx9//Jav9fDwIDAwkKVL\nl1brfqty3PDwcBYsWHDbZSv7HJ2cnAx9VYcOHWLatGkEBATQsmXLcq9zdXXlzJkz5bZdvHixWj6r\n9u3bs2zZsnLbTp8+behD8fb2JiYmhm7dugGlfUrt27e/adnKRlgB3HvvvcybN4+YmBg2btzI/Pnz\nAXB0dMTa2poNGzZU+JEDpZ/htX1n17q+Xl1dXdmyZUu5bRcuXKB///43LNO3b1/69u1LcXEx77//\nPq+++irLly+vcKyOHTuydu3aCttv1CJ/99130Wg0bNiwAXt7e7Zu3WroN3FxcamQ5C9evEirVq0M\n7/lOztu4uDhGjhx5W2VqglxuqsStLt1ca+LEiSxevJizZ88Cpc3Yyi6VADRr1gwTExMSExMN24YP\nH86yZctISkoiLy+P999/n+HDh2NiUvGjiYyMNJS1sbExNK+htON6w4YN6PV6du3aZWgKVwdnZ+cb\n/mNXZvz48axYsYJjx44BpZcydu7cWekv8eDgYOLj41m7di1arZaSkhKOHz9u6My90/3eysiRI9m+\nfTu7d+9Gr9dTVFTE/v37q/SLvrLP8ddffzWUtbe3x8TEpNLPcNiwYURGRrJv3z60Wi1Lly7F0tIS\nX1/fKsWt1+spLi5Gq9WW+xsgICAAExMTvv32W4qLiw2/XAMDAwEIDw9n2bJlpKamkpqayrJlyxg9\nevRNy1b2qxrA1NSUsLAw3n77bbKzs+nbty+AoYW7aNEiMjMzAUhNTTV0wI4dO5bVq1ezb98+lFKk\npqYaPuvrz7MBAwZw7tw5Nm7ciE6nY9OmTcTFxRESElJpTBkZGWzfvp2CggLMzMwM/yOV6du3LydO\nnCjXoX4zeXl52Nra0qRJE1JTU8v9qPH19cXU1JTly5ej0+nYunWr4RyFOztvi4uLOXHihKFejanR\nJYnKfgXe6jU3ezx48GAee+wxnn/+efz9/Rk5ciR//PFHpfu1srJi5syZTJw4kYCAAI4dO8bYsWMZ\nNWoUkyZNIjQ0FGtr6xsOu0tISODhhx/mnnvuYeLEiTz00EP06tULgJdffpnt27fTq1cvNm7cyODB\ng+/qPV5r7NixnD17loCAAJ5++ulbvr5r1668/vrrLFiwgICAAMLCwgyjaK5na2vLl19+yaZNm+jX\nrx/9+vXj3XffrfSf91b7vZ335O7uzqeffsrixYsJCgoiJCSEL7/80vAD4WZlK/scjx8/zrhx4/Dz\n8+Opp57i5ZdfpkWLFhXKtm3blnfeeYfXX3+doKAgIiMj+eyzzzAzM7vlcQHWrl1L9+7dWbBgAYcO\nHaJHjx6GG0HNzc359NNPWbNmDQEBAaxevZpPP/3UsO8JEyYQEhLCyJEjGTlyJCEhIYwfP75KZSsz\nfPhw9u7dy7Bhw8olxBdeeIHWrVszfvx4/P39mT59OgkJCQB0796dRYsWsWjRInr27MmUKVMM/XdT\npkzh119/JTAwkIULF+Lg4MBnn33G0qVL6d27N0uXLmXx4sU0bdq00rrS6/V89dVX9O/fn969e3Pg\nwAFee+21SmN3cnKid+/ebN26tdz2J554wnBPlZ+fH7NmzQLg6aef5q+//sLf35+ZM2cSFhZmKGNu\nbs7HH3/MypUr6dWrFxs2bGDgwIGG/oXbPW+hdARcYGBgpaPYaptG3c7P5ts0f/58IiMjcXJyYv36\n9QDExMTw2muvUVRUhJmZGa+99pqh+SuEELUlNjaWl156ydA5XZ3Gjx/PxIkTiYiIuKPyDzzwAAsX\nLrzpPTO1pUaTxMGDB7G1tWXu3LmGJPHII48wbdo07r33Xnbu3MmSJUv49ttvayoEIYSocQcOHKBt\n27Y4Ojqybt06/vWvf7F169ZK+xbrmxrtuPb39yc5ObncNo1GY7ihKScnp8KoACGEqG/i4+N57rnn\nyM/Pp1WrVnz00UcNIkFADbckoHS0zsyZMw0tidjYWB599FHDnc8rVqyo0lBRIYQQta/WO65/+OEH\nXn75ZSIjI5k3b55h6JwQQoi6p9aTxC+//GIYeTN06NByQ8VupoYbPEIIUe9l5xVz/Gx6tX5f1vjN\ndNcH6+bmxv79+wkICGDv3r20adOmSvvRaDSkpeXc+oWNgIuLndTF/0hdlJG6KNPY6iL9SgFb9p9n\n17ELFJfo+ee0XrRyK51mxcXF7halb65Gk8ScOXOIjo4mKyuL4OBgZs2axeuvv84bb7yBXq/H0tKy\nVtZxEEKIhijjSiG/7I5j71+p6JWimb0lw4Jb09K18ulU7kSNd1xXp8b0y+BmGtuvpJuRuigjdVGm\noddFXmEJG/eeY+vBJLQ6Pc2dbRkW2IrAzm6YmZbvRajTLQkhhBDV61RCJv9de4LcghKa2VsS0a8d\nQV3cMTG59WwSd0KShBBC1BORR5JZvuU0AGODvRjc0xML88rnp6oukiSEEKKO0+n1/LQ9lt8PnqeJ\ntTlPj+5Gh5YOtXJsSRJCCFFHXczIY/fxi+z9K4Ws3GI8nGx4dlwPXB2say0GSRJCCFHHFBRp+WL9\nSY6cTQfA2tKMgX4tGN3fCxur2v3aliQhhBB1SGZ2IR+sPEZSWi7enk0Z5OfJPe2da7zv4UYkSQgh\nRB1x/lIuH6w8yuWcIkL8WvDQ4A41NmqpqiRJCCGEESmlOJt8hT3HU4g+mUpRiY7xId6EBbSs0iJp\nNU2ShBBCGMmhvy+xMjKWS5dL16d3tLNk+vBO9PKpuD64sUiSEEKIWqbXK9b8EcfGvecwNzMhqIsb\nfbp50KmVo9EvL11PkoQQQtSi/MISFq87yfG4DFwdrZk1uhstXKpvrqXqJklCCCFqScaVQt798Qgp\nmfl0bdeMx0d2wdbK3Nhh3ZQkCSGEqAUX0vN498cjXM4pYkivlowP8a5zl5YqI0lCCCFqWNyFbD5Y\neZTcghLGhXgxLLC1sUOqMkkSQghRA5RSJKTksPvYRfb8dZESrZ5pw3zo16O5sUO7LZIkhBCimh2L\nzWBV5FmS0vIAcGhiweQRHbmng4uRI7t9NZok5s+fT2RkJE5OTqxfv96w/dtvv2X58uWYm5szYMAA\nXnjhhZoMQwghas3Wg+f5YdsZTDQaenZ0oV93D7q0bYapicmtC9dBNZokRo8ezeTJk5k7d65hW3R0\nNDt27GDDhg2YmZmRmZlZkyEIIUSt0OsVK7afYevBJOxtLXh2bHfaetgbO6y7VqNJwt/fn+Tk5HLb\nfvjhBx577DHMzEoP3axZs5oMQQghaoRerzgWm0Fyei5pWYWcS83hXEoOzZ1teW5sd5xrcTrvmlTr\nfRIJCQkcPHiQ999/H0tLS+bOnUu3bt1qOwwhhLhjWp2epRtPEX0ytdz27l5OzBjRGZs6fu/D7aj1\nJKHT6cjOzuann37i2LFjPPfcc2zbtq1KZe92Qe+GROqijNRFGamLMjVVF0UlOv799QEOnkqlU5tm\njB3UHrdmNrg52mBl2fDGAtX6O3J3d2fIkCEAdO/eHRMTEy5fvoyjo+Mty6al5dR0ePWCi4ud1MX/\nSF2UkbooU1N1UVCk5eOfjxGTmEWXts14OqIblhal6zzkZBdQF2v/bpNljXe3K6XKPR48eDB79+4F\nID4+Hq1WW6UEIYQQxhR/MZvXvz5ITGIWPTu68MyY7oYE0ZDVaEtizpw5REdHk5WVRXBwMLNmzWLM\nmDHMmzePESNGYG5uzltvvVWTIQghxF3R6fVs3HuO9XsS0OkVQ3q1ZFyIV70d0nq7NOr6n/p1mDSl\nS8llhTJSF2WkLspUV12kXylg8boTxCZn42hnySPDO9G5Tf0akXm3l5saXi+LEEJUgyNn01m64SR5\nhVoCOrkyOaxjnZ+xtSZIkhBCiGsUlehY+0c8v+5PxMzUhKlDO9K/R/M6sZSoMUiSEEIIILeghO2H\nkth6KIncghLcmtnwxKgutHJr3MOKJUkIIRo1vV6xYW8Cm/ado7hEj62VGff3acOwwFZYN8D7Hm6X\n1IAQotG6klfM5+tOcOrcZRyaWDCmf2v69fDAykK+Gq+SmhBCNEp/J17ms7UnuJJXjK+3M4/c36lR\ndkzfiiQJIUSjUqLV88vuOH6NTkSDhnEhXgwNaNVoO6ZvRZKEEKLRSEzNYcmGkySl5eHc1IrHRnSm\nvaeDscOq0yRJCCEavBKtno17E9i49xw6vWKAb3PGh3hLx3QVSA0JIRq00+ez+PrXGC5m5ONoZ8nU\noT5093Iydlj1hiQJIUSDlH6lgLV/xLPnrxQ0wCA/T0YPaCeth9sktSWEaFCu5Baxenc8m6NKJ+Tz\ndLFlylAfvFs0NXZo9ZIkCSFEvafT6zkRn8nu4ykcOZOGVqdwcbAi/N52BHZ2w8RERi7dKUkSQoh6\nq6hYx/Y/k9hy8DxXcosBaO5sS/gAL3zbNcPMtHFM512TJEkIIeqdohIdO/5MZnP0OXLyS7C2NCPE\nrwX3dvOgjbsdrq72Mm16NZEkIYSoV1Iz8/lw1TFSMvOxtjRlZN82DOnVEhu5W7pG1GhbbP78+fTp\n04cRI0ZUeG7p0qX4+PiQlZVVkyEIIRqQUwmZvPHNQVIy8xnU05O3ZvYhvF87SRA1qEaTxOjRo1m6\ndGmF7SkpKURFRdG8efOaPLwQooEo0erYevA87/10lMJiHdPu8+Gh0A40sZbkUNNq9HKTv78/ycnJ\nFbYvWrSIuXPn8sQTT9Tk4YUQ9ZhSiuNxGUSfvMSRs2kUFOloYm3O06O70aGlTKVRW2q9T2L79u14\neHjQsWPH2j60EKIeWRUZy+boRACc7K0Y4NuCQX6eODW1MnJkjUutJonCwkI+++wzvvzyS8M2pVSV\ny9/tgt4NidRFGamLMg2lLk7EZfDr/kQ8nG2Z/aAfHVs53vYsrQ2lLoytVpNEYmIiycnJjBo1CqUU\nqampjBkzhpUrV+LkdOu5VGRIWykXFzupi/+RuijTUOqiqFjHe98fAgXThvrgZGNOenrube2jodRF\ndbjbZFnjSeLalkKHDh3Ys2eP4fHAgQNZs2YNTZvK7fJCiFKrdsZy6XIBQwNa4e0p3w3GVqOjm+bM\nmcOECROIj48nODiYn3/+udzzGo3mti43CSEatphzl9l2KAkPJxvC+7U1djiCGm5JvPvuuzd9ftu2\nbTV5eCFEPXL4TBpfbjyFRgOPDO+MhbmpsUMSyB3XQggj0+r0rIqMZcuB85ibmTD9vk60a25v7LDE\n/0iSEEIYTWJqDl//+jfxF7Nxb2bDE+FdaenaxNhhiWtIkhBC1LrcghLW7Ioj8kgySkFQFzcmh3XE\nykK+kuoa+USEELXm6uytG/cmkFeoxcPJhomD29O1rSwnWldJkhBC1LgSrY7IIxfYtPccV/KKsbY0\nY8JAbwb29JQ1H+o4SRJCiBqhV4qzSVfYdzKVA6dSySvUYmlhyv19WhMW0Apbmbm1XpAkIYSodhcz\n8vho1TFSLxcA0NTWgvt6tyYsoCV2NhZGjk7cDkkSQohqdTmniPd+PEJGdhG9u7jRt6sHnVo7yjrT\n9ZQkCSFEtckvLOG9n0oTRET/dozo08bYIYm7JD1GQohqUVSi46NVx0hOy2OQnyf3B7U2dkiiGkhL\nQghxx/ILtRw9m87hs+n8FZdBYbGOXj6uTBzc/ran9hZ1kyQJIcQdOZt8hU9+PkZ2fgkALg5WDOrp\nyci+baX/oQGRJCGEuG37TqTw5aYY9HrF/X1aE9jJjebOttJ6aIAkSQghqkwpxdrd8azbk4C1pSlP\nhHeTu6UbOEkSQogqUUqx/PfTbP8zGeemVjw7rgctnG2NHZaoYZIkhBC3pFeK5VtOs+NwMp4uTXhh\ngi/2tnJTXGNQo0li/vz5REZG4uTkxPr16wF4++232bFjBxYWFrRq1Yo333yTJk1kamAh6iq9Uny3\n5TSRh5Np6VqaIOSu6cZDo2pw/dCDBw9ia2vL3LlzDUkiKiqK3r17Y2Jiwn/+8x80Gg1z5syp0v5k\nYfNSssh7GamLMtVVF4XFWtbtTiD2whWy84rJzi+moEhHK9cmvDDxHppY1/05l+S8KOPiYndX5Wu0\nJeHv709ycnK5bX369DH87evry2+//VaTIQghbkNs8hW+WH+SS1kFaDRgZ22Ok70VLVya8FBoh3qR\nIET1MmqfxKpVqxg+fLgxQxBCAHq9Yn1UAuv3JKCUYlhgK8L7tcPcTCZlaOyMliT++9//Ym5uzogR\nI6pc5m6bTQ2J1EUZqYsyd1IX+YUlvPPdIQ6eSsXZwZrZE/3o5u1cA9HVLjkvqodRksSaNWvYuXMn\n33zzzW2Vk2uMpeR6axmpizJ3UheZ2YV8uOoY5y/l0qVtM2aO6oKtlXm9r1M5L8rU6T4JKB1bfa1d\nu3axZMkSvvvuOywsZISEEMZyLiWHD1cdJSu3mGDf5jwY2kFWiRMV1GiSmDNnDtHR0WRlZREcHMys\nWbNYvHgxJSUlTJ8+HYAePXrwz3/+sybDEEJc5+/Ey3y46hhFxTrGh3gTFtBSptQQlarRIbDVTZqP\npaQpXUbqokxV6+LI2XT++8tf6PWKx0Z0JqCTWy1EV7vkvChT5y83CSGMK7eghKycIrLzi0lMzWVV\nZCxmphqeGdudbu1k3iVxc5IkhGjA/jh6gW9++xudvuyCgY2lGc+O6057TwcjRibqC0kSQjRQx2Iz\n+PrXv7GxMqNXJ1fsbSywszGnWzsnXBysjR2eqCckSQjRACWkZPPfX/7C1FTDs2O749WiqbFDEvWU\njHcTooFJSsvlg5XHKC7RMWNEF0kQ4q5IS0KIBuDvxMus3h3PgRMppF4uAOCh0A707Ohi5MhEfSdJ\nQoh6rKhEx4/bzxJ5uHQiTUsLU+5p70xAJzcCOze8oa2i9kmSEKKeSrqUy2frTnAhPQ9PF1seH90d\nN3tLuWtXb7olAAAgAElEQVRaVCtJEkLUMyVaPb/tT2TdngS0Oj2D/DwZF+JFi+YOcgOZqHaSJISo\nR04mZPLdltOkZOZjb2PO1GFduKe99DuImiNJQoh6oLBYy/LfT7PneAoaYKBfC0b3b4eNlSwCJGqW\nJAkh6rjE1Bz++8tfpF4uoLWbHVOHdaSNu72xwxKNhCQJIeoopRTb/0zmx+1n0OoUQwNaMXpAO+mY\nFrVKkoQQdZBSip93xrFp3zmaWJvz6P2d6O5V/1eLE/WPJAkh6hi9UqzYeoath5Jwc7TmhQn34NTU\nythhiUbqlkni/PnzrFq1iujoaFJSUrC0tMTHx4ewsDCGDBmCmdmNdzF//nwiIyNxcnJi/fr1AFy5\ncoXnn3+e5ORkPD09+eCDD7Czk7VoReNUotVzObeIgkItCoVSsPNIMruOXqSFiy0vPOBL0yaWxg5T\nNGI3XXTo//7v/zhx4gRDhw7lnnvuwdnZmaKiImJjY9m9ezcnT57kn//8J76+vpWWP3jwILa2tsyd\nO9eQJN555x0cHBx47LHH+Pzzz8nOzuaFF16oUrAyBryULKhSpj7WRVpWAd/8GsP5S7lk55dU+prW\nbnbMmeBLE+uqj16qj3VRU6QuytTookODBg1iwYIFFbZ37NiR++67j6ysLM6fP3/D8v7+/iQnJ5fb\ntm3bNr777jsAIiIimDx5cpWThBD1XWzyFT76+Rg5+SW4OlrTwqUJjnaW2FiZYaLRoNGAjZU5g/xa\nyPBWUSfcNEkMGDDgpoUdHBxwcLi9hUsyMzNxdi7tgHNxceHy5cu3VV6I+upAzCWWbDiJTqeYPKQD\nIX6exg5JiFuq0li6f//73+Tk5KDVannwwQfx9fVl7dq1NR2bEPVadl4xO48k882vMfxr2YHS9R1M\nNDw7rrskCFFvVGl0U1RUFC+99BKRkZG4ubnx/vvvM2PGDEaNGnXbB3RyciI9PR1nZ2fS0tJo1qxZ\nlcve7bW1hkTqokxdrIviEh3zv9hHSkY+AGamJnRq04wnx/agjUfN3QhXF+vCWKQuqsdtDYE9cOAA\noaGhuLm5odFoqlTm+n7xgQMHsnr1ambMmMGaNWsYNGhQlY8vHVGlpFOuTF2ti1+jE0nJyCeoiztD\nerWkhYut4Sa4moq3rtaFMUhdlLnbZFmly01OTk688sorbNq0ib59+6LVatHpdLcsN2fOHCZMmEB8\nfDzBwcH8/PPPzJgxg6ioKMLCwti7dy8zZsy4qzcgRF2TW1DChqgEbK3MeDC0Pa3d7eQuaVFvVakl\n8e6777Ju3TrGjh1L06ZNSUpKYtq0aVUqV5lly5bdVpBC1CcbohLIL9LywEBvbGWEkqjnqpQkmjVr\nxsMPP2x47OnpiaendLwJcb20rAK2/5mEc1MrBkrntGgAbtoGfvLJJzl27Filz+Xm5vL111/z448/\n1khgQtRHa3bFodUpRvdvh7mZXGIS9d9NWxLPPPMM7777LgkJCXTv3h0nJyeKioqIi4sjOTmZCRMm\nMHHixNqKVYg6Sa9XHI1NZ+vBJE6du0xrdzsCZH1p0UDcNEn4+PjwxRdfcPHiRfbv309qaiqWlpYM\nHTqUnj17YmFhUVtxClEnnU26wpINJ7mUVQBAp9aOPBTaAZMqjv4Toq6rUp+Eh4fHHd0TIURDlltQ\nwqe/HCc7r4T+PTwY3LMlnq5NjB2WENWqShdNMzIyeOGFF3jooYcAiImJ4YcffqjRwISo677b8jdZ\nucWE92vLw8M6SYIQDVKVksQrr7xCz549yc7OBqBdu3Z8//33NRqYEHXZvpMp7D91Ce8WTRnWu5Wx\nwxGixlQpSaSmpjJx4kRMTU0BsLCwwMRERm6Ixikzu5DvfjuNpbkpj97fCVP5XxANWJXO7usXFsrO\nzq4w3YYQjUFqZj6frD5OfpGWCYO8cXW0MXZIQtSoKnVcDxkyhP/7v/8jLy+P1atX8/333zNmzJia\njk2IOkOr0/Pb/kTW7k5Aq9PTp6s7/Xs0N3ZYQtS4KiWJRx99lHXr1pGdnc3OnTuZPHmyjHYSjYJS\nimOxGfy8M46ktFzsbS14KLQD/h1dqjzJpRD1WZVngR05ciQjR46syViEqDZanf6uJtXT6vT8FZfJ\nuj3xJKSUziZ6b3cPmY9JNDpVShIZGRl89913JCYmotVqDds//PDDGgtMiDuVmJrDv5Yd4KmIbvh1\ncKlSGaUUv/wRz7HYDC7nFBrWntYAvXxcGdG3DZ4uMsRVND5VShJPPvkknTt3JigoyDDCSYi66kzS\nFZSCY7EZVU4S6/YksD4qAXMzE5rZWdLc2Ra3ZjYM6ukpyUE0alVKEgUFBbz22ms1HYsQ1eLS5dIp\nMs6lVm3RmX0nU1i7Ox7npla8MsUfe1uZbkaIq6qUJHr06MHff/9Nx44dazoeIe5a2v/mUUpOy63Q\nN/H7gfOsj0qgh7cTfbt6YGZmwpcbY7C2NOXZsd0lQQhxnSoliQkTJjBp0iTc3d2xtLQ0bF+1atUd\nH3jZsmWsWrUKjUZDhw4dePPNN2XCQFEtrk62p9UpLqTn0cqtbPnGvSdSyC0oYc/xFPYcTwFAo4Fn\nRvWghVxWEqKCKiWJF198kZkzZ9K5c+dq6ZNITU3l22+/ZfPmzVhYWPDcc8+xadMmwsPD73rfonHT\nK2VoSUDpJaerSaKwWEtiai5eze0ZG+zF7uMXOR6bQXj/dnRt52SskIWo06qUJCwtLXnkkUeq9cB6\nvZ6CggJMTEwoLCzE1dW1WvcvGqcrucWUaPU42VuRkV1IYmqu4bnYC9nolaJDSwc6tnKkYytHI0Yq\nRP1QpYHk/fr1Y9euXdV2UDc3N6ZNm0ZwcDD9+/fHzs6OPn36VNv+ReN16XI+APd0cMZEoynXeX3m\nfBYA7T0djBKbEPVRlVoSP/30E59//jm2trZYWFiglEKj0bB37947Omh2djbbtm1jx44d2NnZ8cwz\nz7B+/XpGjBhxR/sT4qqr/RGeLk3wcLLhfGoueqUw0Wg4k3QFAG/PpsYMUYh6pUpJ4ueff67Wg0ZF\nRdGyZUscHEp/0YWGhnL48OFbJgkXF7ubPt+YSF2UubYu8or1ALRv04xzl3JJPpRECRrcm9kSdzGb\nVu52tG3VzFih1jg5L8pIXVSPKiWJFi1aVOtBmzdvztGjRykqKsLCwoJ9+/bRrVu3W5ZLS6vauPeG\nzsXFTurif66vi4Tk0ktKFoCbgzUAh0+l4OZoQ1GxjnYe9g227uS8KCN1UeZuk+VNk8SLL77IO++8\nw5gxYyqdzOxOh8B2796dsLAwwsPDMTMzo3PnzowfP/6O9iXEtdKyCjAz1eBoZ0lrt9IhrYmpuVzJ\nLQagvVxqEuK23DRJXLp0CYB//OMf1X7gp59+mqeffrra9ysat7SsQpyaWmNioqGla+kvqHMpOVhb\nlp7qHaTTWojbctMkcXW50oCAgFoJRoi7kV+oJbeghLYe9gDYWJnh6mBNYmoOJiYamtlb4tTUyshR\nClG/yLqLosG4ehOd6//6IgBauduRV6glJ79EWhFC3IGbtiROnz5NUFBQhe13OwRWiJpwdfiri2NZ\nkmjt1oSDMaWXTaU/Qojbd9Mk0aZNGz7//PPaikWI25KTX4xdcdn6JldvpLu2JdH6mnmb2reUloQQ\nt+umScLCwqLah78KUR2KS3S8siQab08HZo0uHT6dVklL4uq8TbZWZjR3tq39QIWo526aJMzNZZlG\nUTfFJGaRk1/C4dNpnEvJobW7nWEdCZdrOqftbS3o29UdFwdrTGRNaiFu2007rn/66afaikOI23Is\nNt3w9+8HzwOlLQlHO0sszMvPVPzI/Z0ZeW/bWo1PiIZCRjeJekcpxbHYDKwtzWjh0oTok6mkXykg\nM7sIl2v6I4QQd0+ShKh3Lmbkk36lkC5tmzGqfzt0esWqyFgU4OIg90EIUZ0kSYh651hsBgA9vJwI\n6dkSWysz9p8qHebqKi0JIaqVJAlR71ztj+jazgkrSzMG+JaNwLt2ZJMQ4u5JkhD1Sn6hljNJV2jr\nYUdT29I10Qf6tcDUpHTkkquDjTHDE6LBkSQh6pWTCZno9Ipu16xJ3czeij5d3bG2NMPDSZKEENWp\nSutJCFFXGPojvJ3LbZ8ytCMPDGxvmO1VCFE95D9K1Bt6pTgel4G9jTmt3csvpGJqYoKNlTSMhahu\n8l8l6o0T8ZlcySumWzsnuXtaiFpitJZETk4OL7/8MmfOnMHExIRFixbRo0cPY4Uj6jCtTs+GqAQ2\nRJ1DAwR1dTd2SEI0GkZLEgsXLmTAgAF89NFHaLVaCgsLjRWKqMMupOexdONJ4i/m4GRvySPDO+PT\n2tHYYQnRaBglSeTm5nLw4EH+/e9/lwZhZkaTJk2MEYqoo/ILtazbE8+2Q0no9IqgLu48FNoBGyvp\nRhOiNhnlPy4pKQlHR0fmzZtHTEwMXbt25eWXX8bKSqZUaOwKirTsO5nK2j/iyM4vwbmpFRMHteee\nDi7GDk2IRkmjlFK1fdC//vqLBx54gBUrVtCtWzcWLlyInZ0dzzzzTG2HIuoApRSH/05j+8Hz7P3r\nIsUlOqwsTBk3qAPhA7wqzOoqhKg9RmlJuLu74+7uTrdupYvFhIWFsWTJkluWS0vLqenQ6gUXF7sG\nUxdZuUV8ufEUf8VnAuDmaE1QV3f6dW+Oo50lV7Lyb1q+IdXF3ZK6KCN1UcbFxe7WL7oJoyQJZ2dn\nPDw8iI+Pp23btuzbtw8vLy9jhCKM6PDpNL7aHENuQQld2zVjVN+2tGtuj0aGtwpRZxitF/CVV17h\nhRdeQKvV0rJlS958801jhSJqWVGxjhXbz7DzyAXMzUx4KLQDA/1aSHIQog4yWpLw8fHh559/Ntbh\nhZEkpGSzeN1JUjPz8XRpwuMjO9PCRUa2CVFXyXhCUSu0Oj2/7U/klz/i0ekVYQEtGd3fC3Mzuelf\niLpMkoSoUVeXGv1x+1lSMvNp2sSCR4d3pkvbZsYOTQhRBZIkRI1JTM1hVWQsf8VnotFAiF8LIvq1\no4m1ubFDE0JUkSQJUe0SU3NYtyeBP0+nAdCljSMPDGqPp/Q9CFHvSJIQ1SY9q4CVkbEciCldb9qr\nuT2j7m1Ll7bNZOSSEPWUJAlx1wqKtGzad47f9p9Hq9PT1sOO8H7t6CrJQYh6T5KEuGNKKfaeSGFl\nZCxXcotxtLNkbLAXgZ3dZL0HIRoISRLijpxLyWH576c5m3wFczMTRvZtw7DA1lhayDxLQjQkkiTE\nbSko0rJmVxzbDiWhgJ4dXHhgoDfODtbGDk0IUQMkSYgqOxabwbe/xZCRXYRbMxsmDelAlzZyv4MQ\nDZkkCXFLl3OK+GnHWaJPpmJqouH+Pm0Y0ac15mZyaUmIhk6ShLghrU7P1oNJrN0TT1GxjrYe9kwb\n5oOnq9zvIERjIUlCVKCU4sjZdH7eGceF9DyaWJszYag3/Xo0l1FLQjQykiREOafOXWb1zlhiL2Sj\n0cAA3+aMGeAlU2kI0UhJkhBA6Qpxy7ec5tD/ptLo2cGF8P7taOFsa+TIhBDGJEmikVNK8cexi/y4\n/SwFRVraezZlwqD2tPWwN3ZoQog6wKhJQq/XM2bMGNzc3Pjss8+MGUqjdC4lhxXbzvD3+SysLEyZ\nHNaRAb7S7yCEKGPUJPHNN9/g5eVFbm6uMcNodDKzC1mzK46ov1JQgK+3M5OGdKCZvZWxQxNC1DFG\nSxIpKSns3LmTmTNn8tVXXxkrjEalqETHb9GJbNp3jmKtHk+XJjwwyFtuiBNC3JDRksSiRYuYO3cu\nOTk5xgqh0VBKcSDmEj/tOEtmdhFNbS14KLQdfbt5YGIil5aEEDdmlCQRGRmJs7MznTp1Ijo6usrl\nXFzsajCq+qUqdaGUYv+JFFZsPc3Z81mYmZowdmB7xg1qj41VwxnSKudFGamLMlIX1UOjlFK1fdD3\n3nuPdevWYWpqSlFREXl5eYSGhvL222/ftFxamrQ6oPTkv1VdHDmTzpo/4jh/KRcN4O/jypgB7XB1\ntKmdIGtJVeqisZC6KCN1UeZuk6VRksS19u/fz5dfflml0U3yoZe62T/A5Zwilv9+mj9Pp6HRQGAn\nN4b3adNg73eQL4MyUhdlpC7K3G2SkPskGgi9Uuw8coFVkWcpKNLRoaUDU8I60ryBJgchRO0wepII\nCAggICDA2GHUa38nXuaHbWdITM3F2tKMqUM7yjxLQohqYfQkIe7cpawCVu44y6G/S6fSCOrixthg\nbxztLI0cmRCioZAkUQ/lF5awKjKWLQcS0eoUXi3smTioA+2ay1QaQojqJUmiHlFKEfVXCqt3xXE5\npwhHO0vGhXgR2MkNjVxaEkLUAEkS9URyeh7f/vY3p89nYWFuyqh72zI0sBWW5rI6nBCi5kiSqOOK\ninWsj0rgt/2J6PSKe9o78/QD96DR6owdmhCiEZAkUUcppYg+mcrKyFgu5xThZG/FQ6Ed8G3vjIuj\njYwBF0LUCkkSddC5lByW/36as8lXMDM14f4+bRjeuzWWFnJpSQhRuyRJ1CG5BSWs3hXHzsPJKKBn\nRxceCPHG2cHa2KEJIRopSRJ1gFanZ/exi6zeFUduQQkeTjY8FNqBzjKFtxDCyCRJGJFWpyfqrxQ2\nRCWQfqUQSwtTxod4M9jfEzNTE2OHJ4QQkiSM5fCZNH7Yeob0K4WYmZowuKcn9wW1xqGJ3C0thKg7\nJEnUsit5xXz/+2kOxFzC1ETDoJ6e3Ne7tUylIYSokyRJ1BK9XvHHsQusiowlr1CLVwt7Hh7WqcFO\n4S2EaBgkSdSC0+ez+H7raRJTc7G0MOWh0A6E+LWQWVqFEHWeJIkalJZVwM87Y9l/6hIAQV3cGRvs\nJZeWhBD1hiSJGpBXWMLGqHNsPXQerU7R1sOeBwe3x6tFU2OHJoQQt8UoSSIlJYW5c+eSnp6Oqakp\n48aNY8qUKcYIpVoppfjj2EVW7jhLXqEWJ3tLxgzwIqCzm1xaEkLUS0ZJEqampsybN49OnTqRl5fH\n6NGj6du3L15eXsYIp1qkXs7n680xxCRmYWVhyrhgLwb7e2JuJlNpCCHqL6MkCRcXF1xcXACwtbXF\ny8uLS5cu1cskUVCk5fcD59m47xwlWj2+3s5MGtKBZvZWxg5NCCHumtH7JJKSkoiJiaF79+7GDuW2\nlGh17Dh8gY17E8jJL8He1oJH7++Af0cXWQBICNFgaJRSylgHz8vLY/LkyTz55JMMHjzYWGHclqIS\nHb9Hn+PnHWdJzyrA2tKMiGBvRvVvh42VubHDE0KIamW0JKHVann88cfp378/U6dOrVIZY66hUFyi\nY/ufyfy6P5HsvGIszEwI8WvBfb1bY2djUauxuLjYyXoS/yN1UUbqoozURRkXF7u7Km+0y03z58/H\n29u7ygnCmOIvZrNkw0kuZuRjZWHK8KDWhPq3xN62dpODEELUNqMkiUOHDrF+/Xo6dOhAeHg4Go2G\n559/nv79+xsjnBvS6vSs25PApr3n0CvF4J6ejOrXFlu5rCSEaCSMkiR69uzJqVOnjHHoKjubdIVv\nfoshKS0PJ3srpg/vRKfWjsYOSwghapXRRzfVNbkFJayKPMuuoxcB6N+jOQ8M9MbaUqpKCNH4yDff\nNf48ncayzTHkFpTg6WLL5LCOtPd0MHZYQghhNJIkKL0h7oetZ9h9/CLmZiaMC/Ei1L+lrA4nhGj0\nGn2SOJmQybLNMaRfKaSVWxMeG9FF1ngQQoj/abRJIuNKIT9uP8PBv9PQAMODWjPq3rbSehBCiGs0\nuiRRotXz6/5ENkYlUKzV49XCnkmhHWntfnc3nAghREPUqJLEsdh0vt96hkuXC7C3tWBymBdBXd1l\nGm8hhLiBRpEk0rIKWLHtDIfPpGOi0RDq35JR97bFxqpRvH0hhLhjDfpbsqhEx+Z959i0LxGtTk+H\nlg5MCu2Ap2sTY4cmhBD1QoNMEkopDv2dxo/bz5KRXYhDEwvGD/QmsJObTOMthBC3ocEliYSUbFZs\nPcPppCuYmmgYFtiK+/u0kTumhRDiDjSYb84recX8HBnLnuMXUcA97Z0ZH+KNWzMbY4cmhBD1Vr1P\nElqdnu2Hkli7J56CIh2eLrZMGNSezm2aGTs0IYSo9+p1kvg78TLfbjnNhfQ8bK3MmDSkAwN8m2Nq\nIjfECSFEdaiXSSK3oISfdpxl97GLaIBg3+ZE9G9X6yvECSFEQ2e0JLFr1y4WLVqEUooxY8YwY8aM\nW5ZRShF9MpXvt54ht6CElq5NmDrUh3bN7WshYiGEaHyMkiT0ej2vv/46y5Ytw9XVlbFjxzJo0CC8\nvLxuWCbjSgEfrTrG0dgMLMxNGB/iTWgvT7m0JIQQNcgoSeLYsWO0bt2aFi1aADB8+HC2bdt20yTx\n1NvbySvU0qm1I1OH+eDqYF1b4QohRKNllCSRmpqKh4eH4bGbmxvHjx+/aRm9gqlDO9K/R3O5IU4I\nIWqJUZKEUuq2yyx9JZTCvKIaiEYIIcSNGCVJuLu7c+HCBcPj1NRUXF1db1rGzsZCRi9dw8VFpja/\nSuqijNRFGamL6mGUXt9u3bqRmJhIcnIyxcXFbNy4kUGDBhkjFCGEEDdhlJaEqakpr776KtOnT0cp\nxdixY2/aaS2EEMI4NOpOOgiEEEI0CnKTgRBCiBuSJCGEEOKGJEkIIYS4oTqfJHbt2sXQoUMJCwvj\n888/N3Y4tSolJYUpU6Zw3333MWLECL755hsArly5wvTp0wkLC+ORRx4hJyfHyJHWHr1eT0REBDNn\nzgQgKSmJ8ePHExYWxuzZs9FqtUaOsHbk5OTwzDPPMGzYMIYPH87Ro0cb7XmxbNky7r//fkaMGMGc\nOXMoLi5uNOfF/Pnz6dOnDyNGjDBsu9l58MYbbzBkyBBGjRrFqVOnqnSMOp0krs7xtHTpUjZs2MDG\njRuJjY01dli1xtTUlHnz5rFp0yZWrFjB8uXLiY2N5fPPPycoKIjffvuNwMBAFi9ebOxQa80333xT\nbiTcf/7zH6ZNm8Zvv/2GnZ0dq1atMmJ0tWfhwoUMGDCAzZs3s3btWtq1a9coz4vU1FS+/fZbVq9e\nzfr169HpdGzcuLHRnBejR49m6dKl5bbd6DzYuXMniYmJbNmyhQULFvDaa69V6Rh1OklcO8eTubm5\nYY6nxsLFxYVOnToBYGtri5eXF6mpqWzbto2IiAgAIiIi2Lp1qzHDrDUpKSns3LmTcePGGbbt27eP\nsLAwoLQufv/9d2OFV2tyc3M5ePAgY8aMAcDMzAw7O7tGe17o9XoKCgrQarUUFhbi6upKdHR0ozgv\n/P39sbcvPwv29efB1e/Mbdu2ER4eDkCPHj3IyckhPT39lseo00misjmeLl26ZMSIjCcpKYmYmBh6\n9OhBRkYGzs7OQGkiuXz5spGjqx2LFi1i7ty5hrm7Ll++TNOmTTH530zA7u7ujeL8SEpKwtHRkXnz\n5hEREcGrr75KQUFBozwv3NzcmDZtGsHBwfTv3x87Ozs6d+6Mvb19ozsvrsrMzCx3HmRmZgJw6dIl\n3N3dDa9zc3MjNTX1lvur00lCbuEolZeXxzPPPMP8+fOxtbVtlBMcRkZG4uzsTKdOnQznhVKqwjnS\nGOpGq9Vy8uRJHnzwQdasWYO1tTWff/55o3jv18vOzmbbtm3s2LGDP/74g4KCAnbt2lXhdY2xbq5X\n2fdpVeqlTq9MdydzPDU0Wq2WZ555hlGjRjF48GAAnJycSE9Px9nZmbS0NJo1a/jref/5559s376d\nnTt3UlRURF5eHosWLSInJwe9Xo+JiQkpKSmN4vxwd3fH3d2dbt26ATBkyBC++OKLRnleREVF0bJl\nSxwcHAAYPHgwhw8fJjs7u9GdF1fd6Dxwc3MjJSXF8Lqq1kudbknIHE+loxe8vb2ZOnWqYdvAgQNZ\nvXo1AGvWrGkUdTJ79mwiIyPZtm0b7733HoGBgfznP/8hMDCQX3/9FWg8deHs7IyHhwfx8fFAab+M\nt7d3ozwvmjdvztGjRykqKkIpxb59+2jfvn2jOi+ubyHc6DwYNGgQv/zyCwBHjhzB3t7ecFnqZur8\ntBy7du1i4cKFhjmeqrLMaUNx6NAhJk2aRIcOHdBoNGg0Gp5//nm6d+/Oc889x8WLF2nevDkffvhh\nhc6rhmz//v18+eWXfPbZZ5w/f57Zs2eTnZ1Np06deOeddzA3Nzd2iDUuJiaGl19+Ga1WS8uWLXnz\nzTfR6XSN8rz45JNP2LhxI2ZmZnTu3Jk33niDlJSURnFezJkzh+joaLKysnB2dmbWrFkMHjyYZ599\nttLzYMGCBfzxxx9YW1vz5ptv0qVLl1seo84nCSGEEMZTpy83CSGEMC5JEkIIIW5IkoQQQogbkiQh\nhBDihiRJCCGEuCFJEkIIIW5IkoSocwYOHMjZs2dr5ViffPJJuWmk582bx/Lly+96v/PmzWPEiBHM\nnj37rvd1MzExMWzevLlGjyEaN0kSolH75JNPKCkpqdZ9pqens2XLFtavX897771Xrfu+3smTJ+84\nSej1+mqORjREkiREvREfH89jjz3GuHHjCA8PN0w9AODj48PixYsZO3YsoaGhbNmyxfDcb7/9xrBh\nwxg9ejSLFy/Gx8eHgoICFixYgEajYcKECURERJCbmwvA6dOnmTp1KmFhYbz00ks3jOeXX35hxIgR\njBo1ilmzZpGZmUleXh5Tp06lqKiIiIgIvv7663Jl1q5dy9NPP214rNPp6Nevn2GOsiVLljB+/HhG\njx7NE088QUZGBgAlJSW89dZbjBgxgvDwcGbNmkVWVhYff/wx+/btIyIigoULFwKlsxREREQwatQo\npk2bxvnz54HSO9XDw8N54403mDBhAn/88cfdfByisVBC1DEhISHqzJkz5bZptVoVERGh4uLilFJK\n5ebmqrCwMMPjjh07quXLlyullDp06JDq16+fUkqp9PR0FRAQoBITE5VSSn311VfKx8dH5efnG8oV\nFGMkQHAAAAOxSURBVBQYjvPSSy+pBx98UBUXF6vi4mI1fPhwFRUVVSHG06dPq3vvvVelp6crpZT6\n4IMP1HPPPaeUUiopKUn17t270vdWUFCgevfurS5fvqyUUmr79u1q6tSpSiml1q5dq1599VXDa7//\n/ns1Z84cpZRSH3/8sZo1a5bSarVKKWUov3r1avXMM88YymRkZKjevXur2NhYpZRSK1euVOPGjVNK\nKRUdHa06d+6sjh49WmlsQlRGWhKiXkhISCAuLo7Zs2cTHh7OQw89RElJSbmVCu+77z4AfH19SUtL\no7i4mKNHj9K1a1datmwJwNixYyvsW103M83gwYMxNzfH3Nyczp07k5iYWKFMdHQ0wcHBODk5ATBh\nwgSioqJu+T6srKwYNGgQGzZsAEonYLu6eND27dvZu3cv4eHhhIeH8/3333Px4kWgdKr0KVOmYGpq\nCmCY9fR6R48epVOnTrRr1w6AMWPGcOrUKfLz8wFo3bo13bt3v2WcQlxVp6cKF+IqpRTNmjVjzZo1\nlT6v0WiwtLQEMCw2o9PpKiSA6x9XxsLCwvC3qalppesjK6UqzMV/9bi3Eh4ezptvvsn999/P/v37\neeeddwz7fOKJJxg9enSlx6uKyuK69rGNjU2V9iPEVdKSEPVC27ZtsbKyYu3atYZtcXFx5OXlARW/\nRK8+9vX15cSJE4br8tf2YwA0adKk3ELxVRUUFMTOnTsNfQY//vgjffr0qXD8yvj7+5Obm8t7771H\naGioIbkNHDiQ77//nuzsbACKi4uJiYkBICQkhG+++cbQyX511bkmTZoY+lKuvt9Tp04ZphFfvXo1\nnTt3luQg7pi0JESdo9FoePjhhzEzMzP8Ml6/fj2fffYZCxcu5Msvv0Sn0+Hs7MwHH3xgKHP9PqB0\nAZZ//etfzJgxA0dHR4KDgzEzM8Pa2hqAadOmMWXKFKytrfn222+rHKO3tzezZ8/m4YcfxsTEhJYt\nW7JgwYIKx7+R8PBwPvroo/9v5w5xGASiIAwPBoMhHADNBRCcgtUEzQWQSByChAOgSHB4joVBLqlo\ngnumadK0/T/51LrZyeat1nW9Z2VZ6jgO1XWtIAh0XZeqqlKWZWqaRuM4yjmnMAyVpqmmaVJRFJrn\nWc455Xmurus0DIPatpX3XkmS3E0FeAVfhePnneepKIokPW/W27a9ZRcC+Ac0Cfy8ZVm077u894rj\nWH3ff/pIwNegSQAATDxcAwBMhAQAwERIAABMhAQAwERIAABMhAQAwPQAVSnSA55bZkwAAAAASUVO\nRK5CYII=\n",
+            "text/plain": [
+              "\u003cmatplotlib.figure.Figure at 0x7f47b8e3bd90\u003e"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "plt.plot(eager_means)\n",
+        "plt.ylabel('Time(s)')\n",
+        "plt.xlabel('Length of vector')\n",
+        "_ = plt.title('Time to sum the elements of 1000 vectors (Eager)')\n",
+        "_ = plt.ylim(ymin=0)"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "Autograph vs. Eager vs Graph sum",
+      "provenance": [
+        {
+          "file_id": "1olZkm32B7n7pQwlIAXR0_w8fZhRHCtkX",
+          "timestamp": 1531755808890
+        }
+      ],
+      "version": "0.3.2",
+      "views": {}
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/autograph/examples/notebooks/workshop.ipynb b/tensorflow/contrib/autograph/examples/notebooks/workshop.ipynb
new file mode 100644
index 0000000000..e7dfb13e15
--- /dev/null
+++ b/tensorflow/contrib/autograph/examples/notebooks/workshop.ipynb
@@ -0,0 +1,1129 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "u3B7Uh50lozN"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U -q tf-nightly"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "qWUV0FYjDSKj"
+      },
+      "outputs": [],
+      "source": [
+        "import tensorflow as tf\n",
+        "from tensorflow.contrib import autograph\n",
+        "\n",
+        "import matplotlib.pyplot as plt"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "kGXS3UWBBNoc"
+      },
+      "source": [
+        "# 1. AutoGraph writes graph code for you\n",
+        "\n",
+        "[AutoGraph](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/README.md) helps you write complicated graph code using just plain Python -- behind the scenes, AutoGraph automatically transforms your code into the equivalent TF graph code. We support a large chunk of the Python language, which is growing. [Please see this document for what we currently support, and what we're working on](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/autograph/LIMITATIONS.md).\n",
+        "\n",
+        "Here's a quick example of how it works:\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "aA3gOodCBkOw"
+      },
+      "outputs": [],
+      "source": [
+        "# Autograph can convert functions like this...\n",
+        "def g(x):\n",
+        "  if x \u003e 0:\n",
+        "    x = x * x\n",
+        "  else:\n",
+        "    x = 0.0\n",
+        "  return x\n",
+        "\n",
+        "# ...into graph-building functions like this:\n",
+        "def tf_g(x):\n",
+        "  with tf.name_scope('g'):\n",
+        "\n",
+        "    def if_true():\n",
+        "      with tf.name_scope('if_true'):\n",
+        "        x_1, = x,\n",
+        "        x_1 = x_1 * x_1\n",
+        "        return x_1,\n",
+        "\n",
+        "    def if_false():\n",
+        "      with tf.name_scope('if_false'):\n",
+        "        x_1, = x,\n",
+        "        x_1 = 0.0\n",
+        "        return x_1,\n",
+        "\n",
+        "    x = autograph_utils.run_cond(tf.greater(x, 0), if_true, if_false)\n",
+        "    return x"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "I1RtBvoKBxq5"
+      },
+      "outputs": [],
+      "source": [
+        "# You can run your plain-Python code in graph mode,\n",
+        "# and get the same results out, but with all the benfits of graphs:\n",
+        "print('Original value: %2.2f' % g(9.0))\n",
+        "\n",
+        "# Generate a graph-version of g and call it:\n",
+        "tf_g = autograph.to_graph(g)\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  # The result works like a regular op: takes tensors in, returns tensors.\n",
+        "  # You can inspect the graph using tf.get_default_graph().as_graph_def()\n",
+        "  g_ops = tf_g(tf.constant(9.0))\n",
+        "  with tf.Session() as sess:\n",
+        "    print('Autograph value: %2.2f\\n' % sess.run(g_ops))\n",
+        "\n",
+        "\n",
+        "# You can view, debug and tweak the generated code:\n",
+        "print(autograph.to_code(g))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "m-jWmsCmByyw"
+      },
+      "source": [
+        "#### Automatically converting complex control flow\n",
+        "\n",
+        "AutoGraph can convert a large chunk of the Python language into equivalent graph-construction code, and we're adding new supported language features all the time. In this section, we'll give you a taste of some of the functionality in AutoGraph.\n",
+        "AutoGraph will automatically convert most Python control flow statements into their correct graph equivalent.  \n",
+        "  \n",
+        "We support common statements like `while`, `for`, `if`, `break`, `return` and more. You can even nest them as much as you like. Imagine trying to write the graph version of this code by hand:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "toxKBOXbB1ro"
+      },
+      "outputs": [],
+      "source": [
+        "# Continue in a loop\n",
+        "def f(l):\n",
+        "  s = 0\n",
+        "  for c in l:\n",
+        "    if c % 2 \u003e 0:\n",
+        "      continue\n",
+        "    s += c\n",
+        "  return s\n",
+        "\n",
+        "print('Original value: %d' % f([10,12,15,20]))\n",
+        "\n",
+        "tf_f = autograph.to_graph(f)\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session():\n",
+        "    print('Graph value: %d\\n\\n' % tf_f(tf.constant([10,12,15,20])).eval())\n",
+        "\n",
+        "print(autograph.to_code(f))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "FUJJ-WTdCGeq"
+      },
+      "source": [
+        "Try replacing the `continue` in the above code with `break` -- AutoGraph supports that as well!  \n",
+        "  \n",
+        "Let's try some other useful Python constructs, like `print` and `assert`. We automatically convert Python `assert` statements into the equivalent `tf.Assert` code.  "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "IAOgh62zCPZ4"
+      },
+      "outputs": [],
+      "source": [
+        "def f(x):\n",
+        "  assert x != 0, 'Do not pass zero!'\n",
+        "  return x * x\n",
+        "\n",
+        "tf_f = autograph.to_graph(f)\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session():\n",
+        "    try:\n",
+        "      print(tf_f(tf.constant(0)).eval())\n",
+        "    except tf.errors.InvalidArgumentError as e:\n",
+        "      print('Got error message:\\n%s' % e.message)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "KRu8iIPBCQr5"
+      },
+      "source": [
+        "You can also use plain Python `print` functions in in-graph"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "ySTsuxnqCTQi"
+      },
+      "outputs": [],
+      "source": [
+        "def f(n):\n",
+        "  if n \u003e= 0:\n",
+        "    while n \u003c 5:\n",
+        "      n += 1\n",
+        "      print(n)\n",
+        "  return n\n",
+        "\n",
+        "tf_f = autograph.to_graph(f)\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session():\n",
+        "    tf_f(tf.constant(0)).eval()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "NqF0GT-VCVFh"
+      },
+      "source": [
+        "Appending to lists in loops also works (we create a tensor list ops behind the scenes)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "ABX070KwCczR"
+      },
+      "outputs": [],
+      "source": [
+        "def f(n):\n",
+        "  z = []\n",
+        "  # We ask you to tell us the element dtype of the list\n",
+        "  autograph.set_element_type(z, tf.int32)\n",
+        "  for i in range(n):\n",
+        "    z.append(i)\n",
+        "  # when you're done with the list, stack it\n",
+        "  # (this is just like np.stack)\n",
+        "  return autograph.stack(z)\n",
+        "\n",
+        "tf_f = autograph.to_graph(f)\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session():\n",
+        "    print(tf_f(tf.constant(3)).eval())\n",
+        "\n",
+        "print('\\n\\n'+autograph.to_code(f))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "iu5IF7n2Df7C"
+      },
+      "outputs": [],
+      "source": [
+        "def fizzbuzz(num):\n",
+        "  if num % 3 == 0 and num % 5 == 0:\n",
+        "      print('FizzBuzz')\n",
+        "  elif num % 3 == 0:\n",
+        "      print('Fizz')\n",
+        "  elif num % 5 == 0:\n",
+        "      print('Buzz')\n",
+        "  else:\n",
+        "      print(num)\n",
+        "  return num"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "EExAjWuwDPpR"
+      },
+      "outputs": [],
+      "source": [
+        "tf_g = autograph.to_graph(fizzbuzz)\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  # The result works like a regular op: takes tensors in, returns tensors.\n",
+        "  # You can inspect the graph using tf.get_default_graph().as_graph_def()\n",
+        "  g_ops = tf_g(tf.constant(15))\n",
+        "  with tf.Session() as sess:\n",
+        "    sess.run(g_ops)    \n",
+        "  \n",
+        "# You can view, debug and tweak the generated code:\n",
+        "print('\\n')\n",
+        "print(autograph.to_code(fizzbuzz))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "SzpKGzVpBkph"
+      },
+      "source": [
+        "# De-graphify Exercises\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "8k23dxcSmmXq"
+      },
+      "source": [
+        "#### Easy print statements"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "dE1Vsmp-mlpK"
+      },
+      "outputs": [],
+      "source": [
+        "# See what happens when you turn AutoGraph off.\n",
+        "# Do you see the type or the value of x when you print it?\n",
+        "\n",
+        "# @autograph.convert()\n",
+        "def square_log(x):\n",
+        "  x = x * x\n",
+        "  print('Squared value of x =', x)\n",
+        "  return x\n",
+        "\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    print(sess.run(square_log(tf.constant(4))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "_R-Q7BbxmkBF"
+      },
+      "source": [
+        "#### Convert the TensorFlow code into Python code for AutoGraph"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "SwA11tO-yCvg"
+      },
+      "outputs": [],
+      "source": [
+        "def square_if_positive(x):\n",
+        "  x = tf.cond(tf.greater(x, 0), lambda: x * x, lambda: x)\n",
+        "  return x\n",
+        "\n",
+        "with tf.Session() as sess:\n",
+        "  print(sess.run(square_if_positive(tf.constant(4))))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "GPmx4CNhyPI_"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def square_if_positive(x):\n",
+        "\n",
+        "  pass # TODO: fill it in!\n",
+        "\n",
+        "\n",
+        "with tf.Session() as sess:\n",
+        "  print(sess.run(square_if_positive(tf.constant(4))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "qqsjik-QyA9R"
+      },
+      "source": [
+        "#### Uncollapse to see answer"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "DaSmaWUEvMRv"
+      },
+      "outputs": [],
+      "source": [
+        "# Simple cond\n",
+        "@autograph.convert()\n",
+        "def square_if_positive(x):\n",
+        "  if x \u003e 0:\n",
+        "    x = x * x\n",
+        "  return x\n",
+        "\n",
+        "with tf.Graph().as_default():  \n",
+        "  with tf.Session() as sess:\n",
+        "    print(sess.run(square_if_positive(tf.constant(4))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "qj7am2I_xvTJ"
+      },
+      "source": [
+        "#### Nested If statement"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "4yyNOf-Twr6s"
+      },
+      "outputs": [],
+      "source": [
+        "def nearest_odd_square(x):\n",
+        "\n",
+        "    def if_positive():\n",
+        "      x1 = x * x\n",
+        "      x1 = tf.cond(tf.equal(x1 % 2, 0), lambda: x1 + 1, lambda: x1)\n",
+        "      return x1,\n",
+        "\n",
+        "    x = tf.cond(tf.greater(x, 0), if_positive, lambda: x)\n",
+        "    return x\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    print(sess.run(nearest_odd_square(tf.constant(4))))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "hqmh5b2VyU9w"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def nearest_odd_square(x):\n",
+        "\n",
+        "  pass # TODO: fill it in!\n",
+        "\n",
+        "\n",
+        "with tf.Session() as sess:\n",
+        "  print(sess.run(nearest_odd_square(tf.constant(4))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "b9AXIkNLxp6J"
+      },
+      "source": [
+        "#### Uncollapse to reveal answer"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "8RlCVEpNxD91"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def nearest_odd_square(x):\n",
+        "  if x \u003e 0:\n",
+        "    x = x * x\n",
+        "    if x % 2 == 0:\n",
+        "      x = x + 1\n",
+        "  return x\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    print(sess.run(nearest_odd_square(tf.constant(4))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "jXAxjeBr1qWK"
+      },
+      "source": [
+        "#### Convert a while loop"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "kWkv7anlxoee"
+      },
+      "outputs": [],
+      "source": [
+        "# Convert a while loop\n",
+        "def square_until_stop(x, y):\n",
+        "  x = tf.while_loop(lambda x: tf.less(x, y), lambda x: x * x, [x])\n",
+        "  return x\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    print(sess.run(square_until_stop(tf.constant(4), tf.constant(100))))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "zVUsc1eA1u2K"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def square_until_stop(x, y):\n",
+        "\n",
+        "  pass # TODO: fill it in!\n",
+        "\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    print(sess.run(square_until_stop(tf.constant(4), tf.constant(100))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "L2psuzPI02S9"
+      },
+      "source": [
+        "#### Uncollapse for the answer\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "ucmZyQVL03bF"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def square_until_stop(x, y):\n",
+        "  while x \u003c y:\n",
+        "    x = x * x\n",
+        "  return x\n",
+        "\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    print(sess.run(square_until_stop(tf.constant(4), tf.constant(100))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "FXB0Zbwl13PY"
+      },
+      "source": [
+        "#### Nested loop and conditional"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "clGymxdf15Ig"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def argwhere_cumsum(x, threshold):\n",
+        "  current_sum = 0.0\n",
+        "  idx = 0\n",
+        "\n",
+        "  for i in range(len(x)):\n",
+        "    idx = i\n",
+        "    if current_sum \u003e= threshold:\n",
+        "      break\n",
+        "    current_sum += x[i]\n",
+        "  return idx\n",
+        "\n",
+        "n = 10\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    idx = argwhere_cumsum(tf.ones(n), tf.constant(float(n / 2)))\n",
+        "    print(sess.run(idx))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "i7PF-uId9lp5"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def argwhere_cumsum(x, threshold):\n",
+        "\n",
+        "  pass # TODO: fill it in!\n",
+        "\n",
+        "\n",
+        "n = 10\n",
+        "with tf.Graph().as_default():\n",
+        "  with tf.Session() as sess:\n",
+        "    idx = argwhere_cumsum(tf.ones(n), tf.constant(float(n / 2)))\n",
+        "    print(sess.run(idx))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "weKFXAb615Vp"
+      },
+      "source": [
+        "#### Uncollapse to see answer"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "1sjaFcL717Ig"
+      },
+      "outputs": [],
+      "source": [
+        "@autograph.convert()\n",
+        "def argwhere_cumsum(x, threshold):\n",
+        "  current_sum = 0.0\n",
+        "  idx = 0\n",
+        "  for i in range(len(x)):\n",
+        "    idx = i\n",
+        "    if current_sum \u003e= threshold:\n",
+        "      break\n",
+        "    current_sum += x[i]\n",
+        "  return idx\n",
+        "\n",
+        "n = 10\n",
+        "with tf.Graph().as_default():  \n",
+        "  with tf.Session() as sess:\n",
+        "    idx = argwhere_cumsum(tf.ones(n), tf.constant(float(n / 2)))\n",
+        "    print(sess.run(idx))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "4LfnJjm0Bm0B"
+      },
+      "source": [
+        "# 3. Training MNIST in-graph\n",
+        "\n",
+        "Writing control flow in AutoGraph is easy, so running a training loop in a TensorFlow graph should be easy as well!  \n",
+        "\n",
+        "Here, we show an example of training a simple Keras model on MNIST, where the entire training process -- loading batches, calculating gradients, updating parameters, calculating validation accuracy, and repeating until convergence -- is done in-graph."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Em5dzSUOtLRP"
+      },
+      "source": [
+        "#### Download data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "xqoxumv0ssQW"
+      },
+      "outputs": [],
+      "source": [
+        "import gzip\n",
+        "import os\n",
+        "import shutil\n",
+        "\n",
+        "from six.moves import urllib\n",
+        "\n",
+        "\n",
+        "def download(directory, filename):\n",
+        "  filepath = os.path.join(directory, filename)\n",
+        "  if tf.gfile.Exists(filepath):\n",
+        "    return filepath\n",
+        "  if not tf.gfile.Exists(directory):\n",
+        "    tf.gfile.MakeDirs(directory)\n",
+        "  url = 'https://storage.googleapis.com/cvdf-datasets/mnist/' + filename + '.gz'\n",
+        "  zipped_filepath = filepath + '.gz'\n",
+        "  print('Downloading %s to %s' % (url, zipped_filepath))\n",
+        "  urllib.request.urlretrieve(url, zipped_filepath)\n",
+        "  with gzip.open(zipped_filepath, 'rb') as f_in, open(filepath, 'wb') as f_out:\n",
+        "    shutil.copyfileobj(f_in, f_out)\n",
+        "  os.remove(zipped_filepath)\n",
+        "  return filepath\n",
+        "\n",
+        "\n",
+        "def dataset(directory, images_file, labels_file):\n",
+        "  images_file = download(directory, images_file)\n",
+        "  labels_file = download(directory, labels_file)\n",
+        "\n",
+        "  def decode_image(image):\n",
+        "    # Normalize from [0, 255] to [0.0, 1.0]\n",
+        "    image = tf.decode_raw(image, tf.uint8)\n",
+        "    image = tf.cast(image, tf.float32)\n",
+        "    image = tf.reshape(image, [784])\n",
+        "    return image / 255.0\n",
+        "\n",
+        "  def decode_label(label):\n",
+        "    label = tf.decode_raw(label, tf.uint8)\n",
+        "    label = tf.reshape(label, [])\n",
+        "    return tf.to_int32(label)\n",
+        "\n",
+        "  images = tf.data.FixedLengthRecordDataset(\n",
+        "      images_file, 28 * 28, header_bytes=16).map(decode_image)\n",
+        "  labels = tf.data.FixedLengthRecordDataset(\n",
+        "      labels_file, 1, header_bytes=8).map(decode_label)\n",
+        "  return tf.data.Dataset.zip((images, labels))\n",
+        "\n",
+        "\n",
+        "def mnist_train(directory):\n",
+        "  return dataset(directory, 'train-images-idx3-ubyte',\n",
+        "                 'train-labels-idx1-ubyte')\n",
+        "\n",
+        "def mnist_test(directory):\n",
+        "  return dataset(directory, 't10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "znmy4l8ntMvW"
+      },
+      "source": [
+        "#### Define the model"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "Pe-erWQdBoC5"
+      },
+      "outputs": [],
+      "source": [
+        "def mlp_model(input_shape):\n",
+        "  model = tf.keras.Sequential((\n",
+        "      tf.keras.layers.Dense(100, activation='relu', input_shape=input_shape),\n",
+        "      tf.keras.layers.Dense(100, activation='relu'),\n",
+        "      tf.keras.layers.Dense(10, activation='softmax')))\n",
+        "  model.build()\n",
+        "  return model\n",
+        "\n",
+        "\n",
+        "def predict(m, x, y):\n",
+        "  y_p = m(x)\n",
+        "  losses = tf.keras.losses.categorical_crossentropy(y, y_p)\n",
+        "  l = tf.reduce_mean(losses)\n",
+        "  accuracies = tf.keras.metrics.categorical_accuracy(y, y_p)\n",
+        "  accuracy = tf.reduce_mean(accuracies)\n",
+        "  return l, accuracy\n",
+        "\n",
+        "\n",
+        "def fit(m, x, y, opt):\n",
+        "  l, accuracy = predict(m, x, y)\n",
+        "  opt.minimize(l)\n",
+        "  return l, accuracy\n",
+        "\n",
+        "\n",
+        "def setup_mnist_data(is_training, hp, batch_size):\n",
+        "  if is_training:\n",
+        "    ds = mnist_train('/tmp/autograph_mnist_data')\n",
+        "    ds = ds.shuffle(batch_size * 10)\n",
+        "  else:\n",
+        "    ds = mnist_test('/tmp/autograph_mnist_data')\n",
+        "  ds = ds.repeat()\n",
+        "  ds = ds.batch(batch_size)\n",
+        "  return ds\n",
+        "\n",
+        "\n",
+        "def get_next_batch(ds):\n",
+        "  itr = ds.make_one_shot_iterator()\n",
+        "  image, label = itr.get_next()\n",
+        "  x = tf.to_float(tf.reshape(image, (-1, 28 * 28)))\n",
+        "  y = tf.one_hot(tf.squeeze(label), 10)\n",
+        "  return x, y"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "oeYV6mKnJGMr"
+      },
+      "source": [
+        "#### Define the training loop"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "3xtg_MMhJETd"
+      },
+      "outputs": [],
+      "source": [
+        "def train(train_ds, test_ds, hp):\n",
+        "  m = mlp_model((28 * 28,))\n",
+        "  opt = tf.train.MomentumOptimizer(hp.learning_rate, 0.9)\n",
+        "\n",
+        "  # We'd like to save our losses to a list. In order for AutoGraph\n",
+        "  # to convert these lists into their graph equivalent,\n",
+        "  # we need to specify the element type of the lists.\n",
+        "  train_losses = []\n",
+        "  test_losses = []\n",
+        "  train_accuracies = []\n",
+        "  test_accuracies = []\n",
+        "  autograph.set_element_type(train_losses, tf.float32)\n",
+        "  autograph.set_element_type(test_losses, tf.float32)\n",
+        "  autograph.set_element_type(train_accuracies, tf.float32)\n",
+        "  autograph.set_element_type(test_accuracies, tf.float32)\n",
+        "\n",
+        "  # This entire training loop will be run in-graph.\n",
+        "  i = tf.constant(0)\n",
+        "  while i \u003c hp.max_steps:\n",
+        "    train_x, train_y = get_next_batch(train_ds)\n",
+        "    test_x, test_y = get_next_batch(test_ds)\n",
+        "\n",
+        "    step_train_loss, step_train_accuracy = fit(m, train_x, train_y, opt)\n",
+        "    step_test_loss, step_test_accuracy = predict(m, test_x, test_y)\n",
+        "\n",
+        "    if i % (hp.max_steps // 10) == 0:\n",
+        "      print('Step', i, 'train loss:', step_train_loss, 'test loss:',\n",
+        "            step_test_loss, 'train accuracy:', step_train_accuracy,\n",
+        "            'test accuracy:', step_test_accuracy)\n",
+        "\n",
+        "    train_losses.append(step_train_loss)\n",
+        "    test_losses.append(step_test_loss)\n",
+        "    train_accuracies.append(step_train_accuracy)\n",
+        "    test_accuracies.append(step_test_accuracy)\n",
+        "\n",
+        "    i += 1\n",
+        "\n",
+        "  # We've recorded our loss values and accuracies\n",
+        "  # to a list in a graph with AutoGraph's help.\n",
+        "  # In order to return the values as a Tensor,\n",
+        "  # we need to stack them before returning them.\n",
+        "  return (\n",
+        "      autograph.stack(train_losses),\n",
+        "      autograph.stack(test_losses),\n",
+        "      autograph.stack(train_accuracies),\n",
+        "      autograph.stack(test_accuracies),\n",
+        "  )"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "HYh6MSZyJOag"
+      },
+      "outputs": [],
+      "source": [
+        "with tf.Graph().as_default():\n",
+        "  hp = tf.contrib.training.HParams(\n",
+        "      learning_rate=0.05,\n",
+        "      max_steps=500,\n",
+        "  )\n",
+        "  train_ds = setup_mnist_data(True, hp, 50)\n",
+        "  test_ds = setup_mnist_data(False, hp, 1000)\n",
+        "  tf_train = autograph.to_graph(train)\n",
+        "  loss_tensors = tf_train(train_ds, test_ds, hp)\n",
+        "\n",
+        "  with tf.Session() as sess:\n",
+        "    sess.run(tf.global_variables_initializer())\n",
+        "    (\n",
+        "        train_losses,\n",
+        "        test_losses,\n",
+        "        train_accuracies,\n",
+        "        test_accuracies\n",
+        "    ) = sess.run(loss_tensors)\n",
+        "\n",
+        "    plt.title('MNIST train/test losses')\n",
+        "    plt.plot(train_losses, label='train loss')\n",
+        "    plt.plot(test_losses, label='test loss')\n",
+        "    plt.legend()\n",
+        "    plt.xlabel('Training step')\n",
+        "    plt.ylabel('Loss')\n",
+        "    plt.show()\n",
+        "    plt.title('MNIST train/test accuracies')\n",
+        "    plt.plot(train_accuracies, label='train accuracy')\n",
+        "    plt.plot(test_accuracies, label='test accuracy')\n",
+        "    plt.legend(loc='lower right')\n",
+        "    plt.xlabel('Training step')\n",
+        "    plt.ylabel('Accuracy')\n",
+        "    plt.show()"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [
+        "qqsjik-QyA9R",
+        "b9AXIkNLxp6J",
+        "L2psuzPI02S9",
+        "weKFXAb615Vp",
+        "Em5dzSUOtLRP"
+      ],
+      "default_view": {},
+      "name": "AutoGraph Workshop.ipynb",
+      "provenance": [
+        {
+          "file_id": "1kE2gz_zuwdYySL4K2HQSz13uLCYi-fYP",
+          "timestamp": 1530563781803
+        }
+      ],
+      "version": "0.3.2",
+      "views": {}
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/autograph/impl/api.py b/tensorflow/contrib/autograph/impl/api.py
index c7401c7df1..4729c735c6 100644
--- a/tensorflow/contrib/autograph/impl/api.py
+++ b/tensorflow/contrib/autograph/impl/api.py
@@ -23,7 +23,6 @@ from functools import wraps
 from enum import Enum
 
 # pylint:disable=g-bad-import-order
-import gast
 import six
 # pylint:enable=g-bad-import-order
 
@@ -69,7 +68,8 @@ def convert(recursive=False, verbose=False, arg_types=None):
 
     @wraps(f)
     def wrapper(*args, **kwargs):
-      return converted_call(f, recursive, verbose, arg_types, *args, **kwargs)
+      return converted_call(f, recursive, verbose, True, arg_types, *args,
+                            **kwargs)
 
     wrapper = tf_decorator.make_decorator(f, wrapper)
 
@@ -99,6 +99,7 @@ def do_not_convert(run_as=RunMode.GRAPH, return_dtypes=None):
   Returns:
     A decorator that wraps the original function.
   """
+
   def decorator(f):
     """Decorator implementation."""
 
@@ -109,8 +110,7 @@ def do_not_convert(run_as=RunMode.GRAPH, return_dtypes=None):
     @wraps(f)
     def py_func_wrapper(*args, **kwargs):
       if kwargs:
-        raise NotImplementedError(
-            'RunMode.PY_FUNC does not yet support kwargs')
+        raise NotImplementedError('RunMode.PY_FUNC does not yet support kwargs')
       # TODO(mdan): Add support for kwargs.
       return py_func.wrap_py_func(
           f, return_dtypes, args, kwargs, use_dummy_return=not return_dtypes)
@@ -130,12 +130,12 @@ def do_not_convert(run_as=RunMode.GRAPH, return_dtypes=None):
   return decorator
 
 
-def converted_call(f, recursive, verbose, arg_types, *args, **kwargs):
+def converted_call(f, recursive, verbose, force_conversion, arg_types, *args,
+                   **kwargs):
   """Compiles a function call inline."""
   # TODO(mdan): This needs cleanup.
   # In particular, we may want to avoid renaming functions altogether.
-
-  if conversion.is_whitelisted_for_graph(f):
+  if not force_conversion and conversion.is_whitelisted_for_graph(f):
     return f(*args, **kwargs)
 
   unknown_arg_value = object()  # Sentinel for arguments of unknown value
@@ -231,7 +231,10 @@ def to_graph(e,
 
   Returns:
     A function with a signature identical to `o`, but which when executed it
-  creates TF a graph that has the same functionality as the original entity.
+    creates TF a graph that has the same functionality as the original entity.
+  Raises:
+    ValueError: If the converted function defines or refers to symbol names that
+    are reserved for AutoGraph.
   """
   program_ctx = converter.ProgramContext(
       recursive=recursive,
@@ -242,24 +245,41 @@ def to_graph(e,
   _, name, namespace = conversion.entity_to_graph(e, program_ctx, arg_values,
                                                   arg_types)
 
-  module = gast.Module([])
+  nodes = []
   for dep in reversed(program_ctx.dependency_cache.values()):
-    module.body.append(dep)
-  compiled_node, compiled_src = compiler.ast_to_object(
-      module, source_prefix=program_ctx.required_imports)
+    nodes.extend(dep)
+  compiled_module, compiled_src = compiler.ast_to_object(
+      nodes,
+      source_prefix=program_ctx.required_imports,
+      include_source_map=True)
 
   # The compiled code should see everything the entry entity saw.
   # TODO(mdan): This might not work well if the call tree spans modules?
   for key, val in namespace.items():
     # Avoid overwriting entities that have been transformed.
-    if key not in compiled_node.__dict__:
-      compiled_node.__dict__[key] = val
-  compiled_fn = getattr(compiled_node, name)
+    if key not in compiled_module.__dict__:
+      compiled_module.__dict__[key] = val
+  compiled = getattr(compiled_module, name)
+
+  # Need this so the source_mapping attribute is available for the context
+  # manager to access for runtime errors.
+  #
+  # Note that compiler.ast_to_object attaches the source map 'ag_source_map__'
+  # symbol to the compiled module.
+  # TODO(mdan): Record this statically in the generated code.
+  # TODO(mdan): Rename this attribute to 'autograph_info__'
+  source_map_attribute_name = 'ag_source_map'
+  if getattr(compiled, source_map_attribute_name, None) is not None:
+    raise ValueError('cannot convert %s because is has an attribute '
+                     '"%s", which is reserved for AutoGraph.' %
+                     (compiled, source_map_attribute_name))
+  setattr(compiled, source_map_attribute_name,
+          compiled_module.__dict__['ag_source_map__'])
 
   if verbose:
     logging.info('Compiled output of %s:\n\n%s\n', e, compiled_src)
 
-  return compiled_fn
+  return compiled
 
 
 def to_code(e,
diff --git a/tensorflow/contrib/autograph/impl/api_test.py b/tensorflow/contrib/autograph/impl/api_test.py
index 9943093332..803fde9089 100644
--- a/tensorflow/contrib/autograph/impl/api_test.py
+++ b/tensorflow/contrib/autograph/impl/api_test.py
@@ -183,8 +183,8 @@ class ApiTest(test.TestCase):
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
         while tf.reduce_sum(x) > s:
-          x //= api.converted_call(self.called_member, False, False, {}, self,
-                                   a)
+          x //= api.converted_call(self.called_member, False, False, False, {},
+                                   self, a)
         return x
 
     tc = TestClass()
@@ -195,7 +195,7 @@ class ApiTest(test.TestCase):
       self.assertListEqual([0, 1], sess.run(x).tolist())
 
   def test_converted_call_builtin(self):
-    x = api.converted_call(range, False, False, {}, 3)
+    x = api.converted_call(range, False, False, False, {}, 3)
     self.assertEqual((0, 1, 2), tuple(x))
 
   def test_converted_call_function(self):
@@ -206,8 +206,8 @@ class ApiTest(test.TestCase):
       return x
 
     with self.test_session() as sess:
-      x = api.converted_call(
-          test_fn, False, False, {}, constant_op.constant(-1))
+      x = api.converted_call(test_fn, False, False, False, {},
+                             constant_op.constant(-1))
       self.assertEqual(1, sess.run(x))
 
   def test_converted_call_method(self):
@@ -224,7 +224,7 @@ class ApiTest(test.TestCase):
 
     with self.test_session() as sess:
       tc = TestClass(constant_op.constant(-1))
-      x = api.converted_call(tc.test_method, False, False, {}, tc)
+      x = api.converted_call(tc.test_method, False, False, False, {}, tc)
       self.assertEqual(1, sess.run(x))
 
   def test_converted_call_method_by_class(self):
@@ -241,7 +241,7 @@ class ApiTest(test.TestCase):
 
     with self.test_session() as sess:
       tc = TestClass(constant_op.constant(-1))
-      x = api.converted_call(TestClass.test_method, False, False, {}, tc)
+      x = api.converted_call(TestClass.test_method, False, False, False, {}, tc)
       self.assertEqual(1, sess.run(x))
 
   def test_converted_call_callable_object(self):
@@ -258,7 +258,7 @@ class ApiTest(test.TestCase):
 
     with self.test_session() as sess:
       tc = TestClass(constant_op.constant(-1))
-      x = api.converted_call(tc, False, False, {})
+      x = api.converted_call(tc, False, False, False, {})
       self.assertEqual(1, sess.run(x))
 
   def test_converted_call_constructor(self):
@@ -274,12 +274,27 @@ class ApiTest(test.TestCase):
         return self.x
 
     with self.test_session() as sess:
-      tc = api.converted_call(
-          TestClass, False, False, {}, constant_op.constant(-1))
+      tc = api.converted_call(TestClass, False, False, False, {},
+                              constant_op.constant(-1))
       # tc is now a converted object.
       x = tc.test_method()
       self.assertEqual(1, sess.run(x))
 
+  def test_converted_call_already_converted(self):
+
+    def f(x):
+      return x == 0
+
+    with self.test_session() as sess:
+      x = api.converted_call(f, False, False, False, {},
+                             constant_op.constant(0))
+      self.assertTrue(sess.run(x))
+
+      converted_f = api.to_graph(f)
+      x = api.converted_call(converted_f, False, False, False, {},
+                             constant_op.constant(0))
+      self.assertTrue(sess.run(x))
+
   def test_to_graph_basic(self):
 
     def test_fn(x, s):
@@ -305,6 +320,13 @@ class ApiTest(test.TestCase):
     # Just check that it is parseable Python code.
     self.assertIsNotNone(parser.parse_str(compiled_code))
 
+  def test_source_map_attribute_present(self):
+
+    def test_fn(y):
+      return y**2
+
+    self.assertTrue(hasattr(api.to_graph(test_fn), 'ag_source_map'))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/impl/conversion.py b/tensorflow/contrib/autograph/impl/conversion.py
index 776d19f672..fc8a976d3f 100644
--- a/tensorflow/contrib/autograph/impl/conversion.py
+++ b/tensorflow/contrib/autograph/impl/conversion.py
@@ -28,26 +28,28 @@ from tensorflow.contrib.autograph.converters import asserts
 from tensorflow.contrib.autograph.converters import break_statements
 from tensorflow.contrib.autograph.converters import builtin_functions
 from tensorflow.contrib.autograph.converters import call_trees
+from tensorflow.contrib.autograph.converters import conditional_expressions
 from tensorflow.contrib.autograph.converters import continue_statements
 from tensorflow.contrib.autograph.converters import control_flow
 from tensorflow.contrib.autograph.converters import decorators
-from tensorflow.contrib.autograph.converters import ifexp
+from tensorflow.contrib.autograph.converters import directives
+from tensorflow.contrib.autograph.converters import error_handlers
 from tensorflow.contrib.autograph.converters import lists
 from tensorflow.contrib.autograph.converters import logical_expressions
 from tensorflow.contrib.autograph.converters import name_scopes
+from tensorflow.contrib.autograph.converters import return_statements
 from tensorflow.contrib.autograph.converters import side_effect_guards
-from tensorflow.contrib.autograph.converters import single_return
 from tensorflow.contrib.autograph.converters import slices
 from tensorflow.contrib.autograph.core import config
 from tensorflow.contrib.autograph.core import converter
+from tensorflow.contrib.autograph.core import errors
 from tensorflow.contrib.autograph.pyct import ast_util
 from tensorflow.contrib.autograph.pyct import inspect_utils
+from tensorflow.contrib.autograph.pyct import origin_info
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import qual_names
+from tensorflow.contrib.autograph.pyct import templates
 from tensorflow.contrib.autograph.pyct import transformer
-from tensorflow.contrib.autograph.pyct.static_analysis import activity
-from tensorflow.contrib.autograph.pyct.static_analysis import live_values
-from tensorflow.contrib.autograph.pyct.static_analysis import type_info
 from tensorflow.python.util import tf_inspect
 
 
@@ -69,6 +71,8 @@ def is_whitelisted_for_graph(o):
   for prefix, in config.DEFAULT_UNCOMPILED_MODULES:
     if m.__name__.startswith(prefix):
       return True
+  if hasattr(o, 'autograph_info__'):
+    return True
   return False
 
 
@@ -114,12 +118,32 @@ def entity_to_graph(o, program_ctx, arg_values, arg_types):
       node, name, ns = function_to_graph(o, program_ctx, arg_values, arg_types)
   elif tf_inspect.ismethod(o):
     node, name, ns = function_to_graph(o, program_ctx, arg_values, arg_types)
+  # TODO(mdan,yashkatariya): Remove when object conversion is implemented.
+  elif hasattr(o, '__class__'):
+    raise NotImplementedError(
+        'Object conversion is not yet supported. If you are '
+        'trying to convert code that uses an existing object, '
+        'try including the creation of that object in the '
+        'conversion. For example, instead of converting the method '
+        'of a class, try converting the entire class instead. '
+        'See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/'
+        'contrib/autograph/README.md#using-the-functional-api '
+        'for more information.')
   else:
     raise ValueError(
         'Entity "%s" has unsupported type "%s". Only functions and classes are '
         'supported for now.' % (o, type(o)))
 
+  # TODO(mdan): This is temporary. it should be created using a converter.
+  # TODO(mdan): The attribute should be added with a helper, not directly.
+  # The helper can ensure there are no collisions.
+  template = '''
+      entity.autograph_info__ = {}
+  '''
+  node.extend(templates.replace(template, entity=name))
+
   program_ctx.add_to_cache(o, node)
+
   if program_ctx.recursive:
     while True:
       candidate = None
@@ -157,26 +181,27 @@ def class_to_graph(c, program_ctx):
         program_ctx=program_ctx,
         arg_values={},
         arg_types={'self': (c.__name__, c)},
-        owner_type=c)
+        owner_type=c,
+        rewrite_errors=False)
     if class_namespace is None:
       class_namespace = namespace
     else:
       class_namespace.update(namespace)
-    converted_members[m] = node
+    converted_members[m] = node[0]
   namer = program_ctx.new_namer(class_namespace)
   class_name = namer.compiled_class_name(c.__name__, c)
 
   # TODO(mdan): This needs to be explained more thoroughly.
-  # Process any base classes: if the sueprclass if of a whitelisted type, an
+  # Process any base classes: if the superclass if of a whitelisted type, an
   # absolute import line is generated. Otherwise, it is marked for conversion
   # (as a side effect of the call to namer.compiled_class_name() followed by
   # program_ctx.update_name_map(namer)).
   output_nodes = []
   renames = {}
-  bases = []
+  base_names = []
   for base in c.__bases__:
     if isinstance(object, base):
-      bases.append('object')
+      base_names.append('object')
       continue
     if is_whitelisted_for_graph(base):
       alias = namer.new_symbol(base.__name__, ())
@@ -188,28 +213,28 @@ def class_to_graph(c, program_ctx):
     else:
       # This will trigger a conversion into a class with this name.
       alias = namer.compiled_class_name(base.__name__, base)
-    bases.append(alias)
+    base_names.append(alias)
     renames[qual_names.QN(base.__name__)] = qual_names.QN(alias)
   program_ctx.update_name_map(namer)
 
   # Generate the definition of the converted class.
-  output_nodes.append(
-      gast.ClassDef(
-          class_name,
-          bases=bases,
-          keywords=[],
-          body=list(converted_members.values()),
-          decorator_list=[]))
-  node = gast.Module(output_nodes)
-
+  bases = [gast.Name(n, gast.Load(), None) for n in base_names]
+  class_def = gast.ClassDef(
+      class_name,
+      bases=bases,
+      keywords=[],
+      body=list(converted_members.values()),
+      decorator_list=[])
   # Make a final pass to replace references to the class or its base classes.
   # Most commonly, this occurs when making super().__init__() calls.
   # TODO(mdan): Making direct references to superclass' superclass will fail.
-  node = qual_names.resolve(node)
+  class_def = qual_names.resolve(class_def)
   renames[qual_names.QN(c.__name__)] = qual_names.QN(class_name)
-  node = ast_util.rename_symbols(node, renames)
+  class_def = ast_util.rename_symbols(class_def, renames)
 
-  return node, class_name, class_namespace
+  output_nodes.append(class_def)
+
+  return output_nodes, class_name, class_namespace
 
 
 def _add_reserved_symbol(namespace, name, entity):
@@ -231,6 +256,8 @@ def _add_self_references(namespace, autograph_module):
     ag_internal = imp.new_module('autograph')
     ag_internal.converted_call = autograph_module.converted_call
     ag_internal.utils = utils
+    ag_internal.rewrite_graph_construction_error = (
+        errors.rewrite_graph_construction_error)
     # TODO(mdan): Add safeguards against name clashes.
     # We don't want to create a submodule because we want the operators to be
     # accessible as ag__.<operator>
@@ -239,11 +266,17 @@ def _add_self_references(namespace, autograph_module):
   _add_reserved_symbol(namespace, 'ag__', ag_internal)
 
 
-def function_to_graph(f, program_ctx, arg_values, arg_types, owner_type=None):
+def function_to_graph(f,
+                      program_ctx,
+                      arg_values,
+                      arg_types,
+                      owner_type=None,
+                      rewrite_errors=True):
   """Specialization of `entity_to_graph` for callable functions."""
+
   node, source = parser.parse_entity(f)
   node = node.body[0]
-
+  origin_info.resolve(node, source, f)
   namespace = inspect_utils.getnamespace(f)
   _add_self_references(namespace, program_ctx.autograph_module)
   namer = program_ctx.new_namer(namespace)
@@ -256,38 +289,29 @@ def function_to_graph(f, program_ctx, arg_values, arg_types, owner_type=None):
       arg_types=arg_types,
       owner_type=owner_type)
   context = converter.EntityContext(namer, entity_info, program_ctx)
-  node = node_to_graph(node, context)
+  node = node_to_graph(node, context, rewrite_errors=rewrite_errors)
 
-  # TODO(mdan): This somewhat duplicates the call rename logic in call_treest.py
+  # TODO(mdan): This somewhat duplicates the call rename logic in call_trees.py
   new_name, did_rename = namer.compiled_function_name(f.__name__, f, owner_type)
   if not did_rename:
     new_name = f.__name__
     if node.name != f.__name__:
       raise NotImplementedError('Strange corner case. Send us offending code!')
-
   node.name = new_name
+
   program_ctx.update_name_map(namer)
   # TODO(mdan): Use this at compilation.
 
-  return node, new_name, namespace
-
-
-def _apply_transformer(node, context, converter_module):
-  # TODO(mdan): Clear static analysis here.
-  node = qual_names.resolve(node)
-  node = activity.resolve(node, context.info, None)
-  node = live_values.resolve(node, context.info, config.PYTHON_LITERALS)
-  node = type_info.resolve(node, context.info)
-  node = converter_module.transform(node, context)
-  return node
+  return [node], new_name, namespace
 
 
-def node_to_graph(node, context):
+def node_to_graph(node, context, rewrite_errors=True):
   """Convert Python code to equivalent TF graph mode code.
 
   Args:
     node: AST, the code to convert.
     context: converter.EntityContext
+    rewrite_errors: Boolean, whether or not to rewrite the error traceback.
 
   Returns:
     A tuple (node, deps):
@@ -295,28 +319,33 @@ def node_to_graph(node, context):
         * deps: A set of strings, the fully qualified names of entity
             dependencies that this node has.
   """
-  # TODO(mdan): Verify arguments for correctness.
+  # TODO(mdan): Insert list_comprehensions somewhere.
 
-  node = _apply_transformer(node, context, ifexp)
+  node = converter.standard_analysis(node, context, is_initial=True)
   # Past this point, line numbers are no longer accurate so we ignore the
   # source.
   # TODO(mdan): Is it feasible to reconstruct intermediate source code?
   context.info.source_code = None
-  node = _apply_transformer(node, context, decorators)
-  node = _apply_transformer(node, context, break_statements)
-  node = _apply_transformer(node, context, asserts)
+
+  node = converter.apply_(node, context, decorators)
+  node = converter.apply_(node, context, directives)
+  node = converter.apply_(node, context, break_statements)
+  node = converter.apply_(node, context, asserts)
   # Note: sequencing continue canonicalization before for loop one avoids
   # dealing with the extra loop increment operation that the for
   # canonicalization creates.
-  node = _apply_transformer(node, context, continue_statements)
+  node = converter.apply_(node, context, continue_statements)
   context.info.namespace['len'] = len
-  node = _apply_transformer(node, context, single_return)
-  node = _apply_transformer(node, context, lists)
-  node = _apply_transformer(node, context, slices)
-  node = _apply_transformer(node, context, builtin_functions)
-  node = _apply_transformer(node, context, call_trees)
-  node = _apply_transformer(node, context, control_flow)
-  node = _apply_transformer(node, context, logical_expressions)
-  node = _apply_transformer(node, context, side_effect_guards)
-  node = _apply_transformer(node, context, name_scopes)
+  node = converter.apply_(node, context, return_statements)
+  node = converter.apply_(node, context, lists)
+  node = converter.apply_(node, context, slices)
+  node = converter.apply_(node, context, builtin_functions)
+  node = converter.apply_(node, context, call_trees)
+  node = converter.apply_(node, context, control_flow)
+  node = converter.apply_(node, context, conditional_expressions)
+  node = converter.apply_(node, context, logical_expressions)
+  node = converter.apply_(node, context, side_effect_guards)
+  node = converter.apply_(node, context, name_scopes)
+  if rewrite_errors:
+    node = converter.apply_(node, context, error_handlers)
   return node
diff --git a/tensorflow/contrib/autograph/impl/conversion_test.py b/tensorflow/contrib/autograph/impl/conversion_test.py
index f5279298af..86432573a7 100644
--- a/tensorflow/contrib/autograph/impl/conversion_test.py
+++ b/tensorflow/contrib/autograph/impl/conversion_test.py
@@ -50,7 +50,7 @@ class ConversionTest(test.TestCase):
     self.assertTrue(conversion.is_whitelisted_for_graph(constant_op.constant))
 
   def test_entity_to_graph_unsupported_types(self):
-    with self.assertRaises(ValueError):
+    with self.assertRaises(NotImplementedError):
       program_ctx = self._simple_program_ctx()
       conversion.entity_to_graph('dummy', program_ctx, None, None)
 
@@ -60,10 +60,11 @@ class ConversionTest(test.TestCase):
       return a + b
 
     program_ctx = self._simple_program_ctx()
-    ast, name, ns = conversion.entity_to_graph(f, program_ctx, None, None)
-    self.assertTrue(isinstance(ast, gast.FunctionDef), ast)
+    nodes, name, ns = conversion.entity_to_graph(f, program_ctx, None, None)
+    fn_node, _ = nodes
+    self.assertIsInstance(fn_node, gast.FunctionDef)
     self.assertEqual('tf__f', name)
-    self.assertTrue(ns['b'] is b)
+    self.assertIs(ns['b'], b)
 
   def test_entity_to_graph_call_tree(self):
 
@@ -78,12 +79,11 @@ class ConversionTest(test.TestCase):
 
     self.assertTrue(f in program_ctx.dependency_cache)
     self.assertTrue(g in program_ctx.dependency_cache)
-    self.assertEqual('tf__f', program_ctx.dependency_cache[f].name)
-    # need the extra .body[0] in order to step past the with tf.name_scope('f')
-    # that is added automatically
-    self.assertEqual(
-        'tf__g', program_ctx.dependency_cache[f].body[0].body[0].value.func.id)
-    self.assertEqual('tf__g', program_ctx.dependency_cache[g].name)
+    f_node = program_ctx.dependency_cache[f][0]
+    g_node = program_ctx.dependency_cache[g][0]
+    self.assertEqual('tf__f', f_node.name)
+    self.assertEqual('tf__g', f_node.body[0].body[0].body[0].value.func.id)
+    self.assertEqual('tf__g', g_node.name)
 
   def test_entity_to_graph_class_hierarchy(self):
 
@@ -115,10 +115,12 @@ class ConversionTest(test.TestCase):
 
     self.assertTrue(TestBase in program_ctx.dependency_cache)
     self.assertTrue(TestSubclass in program_ctx.dependency_cache)
+    # The returned nodes will include:
+    # <import nodes>, <class node>, <assignment node>
     self.assertEqual('TfTestBase',
-                     program_ctx.dependency_cache[TestBase].body[-1].name)
+                     program_ctx.dependency_cache[TestBase][-2].name)
     self.assertEqual('TfTestSubclass',
-                     program_ctx.dependency_cache[TestSubclass].body[-1].name)
+                     program_ctx.dependency_cache[TestSubclass][-2].name)
 
   def test_entity_to_graph_class_hierarchy_whitelisted(self):
 
@@ -137,10 +139,11 @@ class ConversionTest(test.TestCase):
     self.assertTrue(TestSubclass in program_ctx.dependency_cache)
     self.assertFalse(training.Model in program_ctx.dependency_cache)
     self.assertEqual(
-        'Model',
-        program_ctx.dependency_cache[TestSubclass].body[0].names[0].name)
+        'Model', program_ctx.dependency_cache[TestSubclass][0].names[0].name)
+    # The returned nodes will include:
+    # <import nodes>, <class node>, <assignment node>
     self.assertEqual('TfTestSubclass',
-                     program_ctx.dependency_cache[TestSubclass].body[-1].name)
+                     program_ctx.dependency_cache[TestSubclass][-2].name)
 
   def test_entity_to_graph_lambda(self):
     f = lambda a: a
diff --git a/tensorflow/contrib/autograph/lang/special_functions.py b/tensorflow/contrib/autograph/lang/special_functions.py
index 11135295a7..6149cbbd6c 100644
--- a/tensorflow/contrib/autograph/lang/special_functions.py
+++ b/tensorflow/contrib/autograph/lang/special_functions.py
@@ -26,6 +26,43 @@ from __future__ import print_function
 from tensorflow.contrib.autograph.operators import data_structures
 
 
+def tensor_list(elements,
+                element_dtype=None,
+                element_shape=None,
+                use_tensor_array=False):
+  """Creates an tensor list and populates it with the given elements.
+
+  This function provides a more uniform access to tensor lists and tensor
+  arrays, and allows optional initialization.
+
+  Note: this function is a simplified wrapper. If you need greater control,
+  it is recommended to use the underlying implementation directly.
+
+  Args:
+    elements: Iterable[tf.Tensor, ...], the elements to initially fill the list
+        with
+    element_dtype: Optional[tf.DType], data type for the elements in the list;
+        required if the list is empty
+    element_shape: Optional[tf.TensorShape], shape for the elements in the list;
+        required if the list is empty
+    use_tensor_array: bool, whether to use the more compatible but restrictive
+        tf.TensorArray implementation
+  Returns:
+    Union[tf.Tensor, tf.TensorArray], the new list.
+  Raises:
+    ValueError: for invalid arguments
+  """
+  if not (elements or (element_dtype and element_shape)):
+    raise ValueError(
+        'element_dtype and element_shape are required for empty lists')
+  if use_tensor_array:
+    return data_structures.tf_tensor_array_new(elements, element_dtype,
+                                               element_shape)
+  else:
+    return data_structures.tf_tensor_list_new(elements, element_dtype,
+                                              element_shape)
+
+
 def stack(list_or_tensor, element_dtype=None, strict=True):
   """Stacks the input, if it admits the notion of stacking.
 
diff --git a/tensorflow/contrib/autograph/lang/special_functions_test.py b/tensorflow/contrib/autograph/lang/special_functions_test.py
index a49cb64075..db492cc5c6 100644
--- a/tensorflow/contrib/autograph/lang/special_functions_test.py
+++ b/tensorflow/contrib/autograph/lang/special_functions_test.py
@@ -28,7 +28,23 @@ from tensorflow.python.platform import test
 
 class SpecialFunctionsTest(test.TestCase):
 
-  def test_basic(self):
+  def test_tensor_list_from_elements(self):
+    elements = [constant_op.constant([1, 2]), constant_op.constant([3, 4])]
+
+    l = special_functions.tensor_list(elements)
+    sl = list_ops.tensor_list_stack(l, element_dtype=dtypes.int32)
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(sl), [[1, 2], [3, 4]])
+
+  def test_tensor_list_array_from_elements(self):
+    elements = [constant_op.constant([1, 2]), constant_op.constant([3, 4])]
+
+    l = special_functions.tensor_list(elements, use_tensor_array=True)
+    sl = l.stack()
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(sl), [[1, 2], [3, 4]])
+
+  def test_stack(self):
     self.assertEqual(special_functions.stack(1, strict=False), 1)
     self.assertListEqual(
         special_functions.stack([1, 2, 3], strict=False), [1, 2, 3])
diff --git a/tensorflow/contrib/autograph/operators/__init__.py b/tensorflow/contrib/autograph/operators/__init__.py
index c900fd6af2..392cb60bcc 100644
--- a/tensorflow/contrib/autograph/operators/__init__.py
+++ b/tensorflow/contrib/autograph/operators/__init__.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""This module implements operators that we overload.
+"""This module implements operators that AutoGraph overloads.
 
 Note that "operator" is used loosely here, and includes control structures like
 conditionals and loops, implemented in functional form, using for example
diff --git a/tensorflow/contrib/autograph/operators/data_structures.py b/tensorflow/contrib/autograph/operators/data_structures.py
index 06d8727b0f..cc0a3c3544 100644
--- a/tensorflow/contrib/autograph/operators/data_structures.py
+++ b/tensorflow/contrib/autograph/operators/data_structures.py
@@ -28,7 +28,6 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import list_ops
 from tensorflow.python.ops import tensor_array_ops
-from tensorflow.python.ops import variables
 
 
 # TODO(mdan): Once control flow supports objects, repackage as a class.
@@ -48,29 +47,101 @@ def new_list(iterable=None):
   else:
     elements = ()
 
-  # TODO(mdan): Extend these criteria.
-  if any(isinstance(el, variables.Variable) for el in elements):
+  if elements:
+    # When the list contains elements, it is assumed to be a "Python" lvalue
+    # list.
     return _py_list_new(elements)
-  return _tf_tensor_list_new(elements)
+  return tf_tensor_list_new(elements)
 
 
-def _tf_tensor_list_new(elements):
+def tf_tensor_array_new(elements, element_dtype=None, element_shape=None):
   """Overload of new_list that stages a Tensor list creation."""
   elements = tuple(ops.convert_to_tensor(el) for el in elements)
+
+  all_dtypes = set(el.dtype for el in elements)
+  if len(all_dtypes) == 1:
+    inferred_dtype, = tuple(all_dtypes)
+    if element_dtype is not None and element_dtype != inferred_dtype:
+      raise ValueError(
+          'incompatible dtype; specified: {}, inferred from {}: {}'.format(
+              element_dtype, elements, inferred_dtype))
+  elif len(all_dtypes) > 1:
+    raise ValueError(
+        'TensorArray requires all elements to have the same dtype:'
+        ' {}'.format(elements))
+  else:
+    if element_dtype is None:
+      raise ValueError('dtype is required to create an empty TensorArray')
+
+  all_shapes = set(tuple(el.shape.as_list()) for el in elements)
+  if len(all_shapes) == 1:
+    inferred_shape, = tuple(all_shapes)
+    if element_shape is not None and element_shape != inferred_shape:
+      raise ValueError(
+          'incompatible shape; specified: {}, inferred from {}: {}'.format(
+              element_shape, elements, inferred_shape))
+  elif len(all_shapes) > 1:
+    raise ValueError(
+        'TensorArray requires all elements to have the same shape:'
+        ' {}'.format(elements))
+    # TODO(mdan): We may want to allow different shapes with infer_shape=False.
+  else:
+    inferred_shape = None
+
+  if element_dtype is None:
+    element_dtype = inferred_dtype
+  if element_shape is None:
+    element_shape = inferred_shape
+
+  l = tensor_array_ops.TensorArray(
+      dtype=element_dtype,
+      size=len(elements),
+      dynamic_size=True,
+      infer_shape=(element_shape is None),
+      element_shape=element_shape)
+  for i, el in enumerate(elements):
+    l = l.write(i, el)
+  return l
+
+
+def tf_tensor_list_new(elements, element_dtype=None, element_shape=None):
+  """Overload of new_list that stages a Tensor list creation."""
+  elements = tuple(ops.convert_to_tensor(el) for el in elements)
+
   all_dtypes = set(el.dtype for el in elements)
   if len(all_dtypes) == 1:
-    element_dtype = tuple(all_dtypes)[0]
+    inferred_dtype = tuple(all_dtypes)[0]
+    if element_dtype is not None and element_dtype != inferred_dtype:
+      raise ValueError(
+          'incompatible dtype; specified: {}, inferred from {}: {}'.format(
+              element_dtype, elements, inferred_dtype))
   else:
     # Heterogeneous lists are ok.
-    element_dtype = dtypes.variant
+    if element_dtype is not None:
+      raise ValueError(
+          'specified dtype {} is inconsistent with that of elements {}'.format(
+              element_dtype, elements))
+    inferred_dtype = dtypes.variant
 
-  # TODO(mdan): This may fail for elements of variable shapes.
   all_shapes = set(tuple(el.shape.as_list()) for el in elements)
   if len(all_shapes) == 1:
-    element_shape = array_ops.shape(elements[0])
+    inferred_shape = array_ops.shape(elements[0])
+    if element_shape is not None and element_shape != inferred_shape:
+      raise ValueError(
+          'incompatible shape; specified: {}, inferred from {}: {}'.format(
+              element_shape, elements, inferred_shape))
   else:
     # Heterogeneous lists are ok.
-    element_shape = constant_op.constant(-1)  # unknown shape, by convention
+    if element_shape is not None:
+      raise ValueError(
+          'specified shape {} is inconsistent with that of elements {}'.format(
+              element_shape, elements))
+    inferred_shape = constant_op.constant(-1)  # unknown shape, by convention
+
+  if element_dtype is None:
+    element_dtype = inferred_dtype
+  if element_shape is None:
+    element_shape = inferred_shape
 
   l = list_ops.empty_tensor_list(
       element_shape=element_shape, element_dtype=element_dtype)
diff --git a/tensorflow/contrib/autograph/operators/data_structures_test.py b/tensorflow/contrib/autograph/operators/data_structures_test.py
index 8bbb52d6c1..7ea11a839b 100644
--- a/tensorflow/contrib/autograph/operators/data_structures_test.py
+++ b/tensorflow/contrib/autograph/operators/data_structures_test.py
@@ -37,10 +37,51 @@ class ListTest(test.TestCase):
 
   def test_new_list_tensor(self):
     l = data_structures.new_list([3, 4, 5])
+    self.assertAllEqual(l, [3, 4, 5])
+
+  def test_tf_tensor_list_new(self):
+    l = data_structures.tf_tensor_list_new([3, 4, 5])
     t = list_ops.tensor_list_stack(l, element_dtype=dtypes.int32)
     with self.test_session() as sess:
       self.assertAllEqual(sess.run(t), [3, 4, 5])
 
+  def test_tf_tensor_list_new_illegal_input(self):
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_list_new([3, 4.0])
+    # TODO(mdan): It might make more sense to type cast in this case.
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_list_new([3, 4], element_dtype=dtypes.float32)
+    # Tensor lists do support heterogeneous lists.
+    self.assertIsNot(data_structures.tf_tensor_list_new([3, [4, 5]]), None)
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_list_new([3, 4], element_shape=(2,))
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_list_new([], element_shape=(2,))
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_list_new([], element_dtype=dtypes.float32)
+
+  def test_tf_tensor_array_new(self):
+    l = data_structures.tf_tensor_array_new([3, 4, 5])
+    t = l.stack()
+    with self.test_session() as sess:
+      self.assertAllEqual(sess.run(t), [3, 4, 5])
+
+  def test_tf_tensor_array_new_illegal_input(self):
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_array_new([3, 4.0])
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_array_new([3, 4], element_dtype=dtypes.float32)
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_array_new([3, [4, 5]])
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_array_new([3, 4], element_shape=(2,))
+    with self.assertRaises(ValueError):
+      data_structures.tf_tensor_array_new([], element_shape=(2,))
+    # TAs can infer the shape.
+    self.assertIsNot(
+        data_structures.tf_tensor_array_new([], element_dtype=dtypes.float32),
+        None)
+
   def test_append_tensor_list(self):
     l = data_structures.new_list()
     x = constant_op.constant([1, 2, 3])
diff --git a/tensorflow/contrib/autograph/pyct/BUILD b/tensorflow/contrib/autograph/pyct/BUILD
index a49a4ed05c..ddadc6b96e 100644
--- a/tensorflow/contrib/autograph/pyct/BUILD
+++ b/tensorflow/contrib/autograph/pyct/BUILD
@@ -25,6 +25,7 @@ py_library(
         "cfg.py",
         "compiler.py",
         "inspect_utils.py",
+        "origin_info.py",
         "parser.py",
         "pretty_printer.py",
         "qual_names.py",
@@ -99,6 +100,16 @@ py_test(
 )
 
 py_test(
+    name = "origin_info_test",
+    srcs = ["origin_info_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":pyct",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
     name = "parser_test",
     srcs = ["parser_test.py"],
     srcs_version = "PY2AND3",
diff --git a/tensorflow/contrib/autograph/pyct/anno.py b/tensorflow/contrib/autograph/pyct/anno.py
index ae861627fd..1a52110ef3 100644
--- a/tensorflow/contrib/autograph/pyct/anno.py
+++ b/tensorflow/contrib/autograph/pyct/anno.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Handling annotations on AST nodes.
+"""AST node annotation support.
 
 Adapted from Tangent.
 """
@@ -21,37 +21,90 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from enum import Enum
+import enum
 
+# pylint:disable=g-bad-import-order
+import gast
+# pylint:enable=g-bad-import-order
 
-class NoValue(Enum):
+
+# TODO(mdan): Shorten the names.
+# These names are heavily used, and anno.blaa
+# TODO(mdan): Replace the attr-dict mechanism with a more typed solution.
+
+
+class NoValue(enum.Enum):
 
   def __repr__(self):
     return self.name
 
 
 class Basic(NoValue):
-  """Container for annotation keys.
+  """Container for basic annotation keys.
 
   The enum values are used strictly for documentation purposes.
   """
 
-  QN = 'Qualified name, as it appeared in the code.'
+  QN = 'Qualified name, as it appeared in the code. See qual_names.py.'
   SKIP_PROCESSING = (
       'This node should be preserved as is and not processed any further.')
   INDENT_BLOCK_REMAINDER = (
-      'When a node is annotated with this, the remainder of the block should '
-      'be indented below it. The annotation contains a tuple '
-      '(new_body, name_map), where `new_body` is the new indented block and '
-      '`name_map` allows renaming symbols.')
+      'When a node is annotated with this, the remainder of the block should'
+      ' be indented below it. The annotation contains a tuple'
+      ' (new_body, name_map), where `new_body` is the new indented block and'
+      ' `name_map` allows renaming symbols.')
+  ORIGIN = ('Information about the source code that converted code originated'
+            ' from. See origin_information.py.')
+
+
+class Static(NoValue):
+  """Container for static analysis annotation keys.
+
+  The enum values are used strictly for documentation purposes.
+  """
+
+  # Deprecated - use reaching definitions instead.
+  # Symbols
+  # These flags are boolean.
+  IS_LOCAL = 'Symbol is local to the function scope being analyzed.'
+  IS_PARAM = 'Symbol is a parameter to the function being analyzed.'
+
+  # Scopes
+  # Scopes are represented by objects of type activity.Scope.
+  SCOPE = 'The scope for the annotated node. See activity.py.'
+  # TODO(mdan): Drop these in favor of accessing the child's SCOPE.
+  ARGS_SCOPE = 'The scope for the argument list of a function call.'
+  COND_SCOPE = 'The scope for the test node of a conditional statement.'
+  BODY_SCOPE = (
+      'The scope for the main body of a statement (True branch for if '
+      'statements, main body for loops).')
+  ORELSE_SCOPE = (
+      'The scope for the orelse body of a statement (False branch for if '
+      'statements, orelse body for loops).')
+
+  # Static analysis annotations.
+  DEFINITIONS = (
+      'Reaching definition information. See reaching_definitions.py.')
+  ORIG_DEFINITIONS = (
+      'The value of DEFINITIONS that applied to the original code before any'
+      ' conversion.')
+  DEFINED_VARS_IN = (
+      'Symbols defined when entering the node. See reaching_definitions.py.')
+  LIVE_VARS_OUT = ('Symbols live when exiting the node. See liveness.py.')
 
 
 FAIL = object()
 
 
+def keys(node, field_name='___pyct_anno'):
+  if not hasattr(node, field_name):
+    return frozenset()
+  return frozenset(getattr(node, field_name).keys())
+
+
 def getanno(node, key, default=FAIL, field_name='___pyct_anno'):
-  if (default is FAIL or
-      (hasattr(node, field_name) and (key in getattr(node, field_name)))):
+  if (default is FAIL or (hasattr(node, field_name) and
+                          (key in getattr(node, field_name)))):
     return getattr(node, field_name)[key]
   else:
     return default
@@ -86,3 +139,19 @@ def copyanno(from_node, to_node, key, field_name='___pyct_anno'):
         key,
         getanno(from_node, key, field_name=field_name),
         field_name=field_name)
+
+
+def dup(node, copy_map, field_name='___pyct_anno'):
+  """Recursively copies annotations in an AST tree.
+
+  Args:
+    node: ast.AST
+    copy_map: Dict[Hashable, Hashable], maps a source anno key to a destination
+        key. All annotations with the source key will be copied to identical
+        annotations with the destination key.
+    field_name: str
+  """
+  for n in gast.walk(node):
+    for k in copy_map:
+      if hasanno(n, k, field_name):
+        setanno(n, copy_map[k], getanno(n, k, field_name), field_name)
diff --git a/tensorflow/contrib/autograph/pyct/anno_test.py b/tensorflow/contrib/autograph/pyct/anno_test.py
index f2c0c8cf05..5ef4da61a3 100644
--- a/tensorflow/contrib/autograph/pyct/anno_test.py
+++ b/tensorflow/contrib/autograph/pyct/anno_test.py
@@ -32,22 +32,27 @@ class AnnoTest(test.TestCase):
   def test_basic(self):
     node = ast.Name()
 
+    self.assertEqual(anno.keys(node), set())
     self.assertFalse(anno.hasanno(node, 'foo'))
     with self.assertRaises(AttributeError):
       anno.getanno(node, 'foo')
 
     anno.setanno(node, 'foo', 3)
+
+    self.assertEqual(anno.keys(node), {'foo'})
     self.assertTrue(anno.hasanno(node, 'foo'))
     self.assertEqual(anno.getanno(node, 'foo'), 3)
     self.assertEqual(anno.getanno(node, 'bar', default=7), 7)
 
     anno.delanno(node, 'foo')
+
+    self.assertEqual(anno.keys(node), set())
     self.assertFalse(anno.hasanno(node, 'foo'))
     with self.assertRaises(AttributeError):
       anno.getanno(node, 'foo')
     self.assertIsNone(anno.getanno(node, 'foo', default=None))
 
-  def test_copyanno(self):
+  def test_copy(self):
     node_1 = ast.Name()
     anno.setanno(node_1, 'foo', 3)
 
@@ -58,6 +63,22 @@ class AnnoTest(test.TestCase):
     self.assertTrue(anno.hasanno(node_2, 'foo'))
     self.assertFalse(anno.hasanno(node_2, 'bar'))
 
+  def test_duplicate(self):
+    node = ast.If(
+        test=ast.Num(1),
+        body=[ast.Expr(ast.Name('bar', ast.Load()))],
+        orelse=[])
+    anno.setanno(node, 'spam', 1)
+    anno.setanno(node, 'ham', 1)
+    anno.setanno(node.body[0], 'ham', 1)
+
+    anno.dup(node, {'spam': 'eggs'})
+
+    self.assertTrue(anno.hasanno(node, 'spam'))
+    self.assertTrue(anno.hasanno(node, 'ham'))
+    self.assertTrue(anno.hasanno(node, 'eggs'))
+    self.assertFalse(anno.hasanno(node.body[0], 'eggs'))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/pyct/ast_util.py b/tensorflow/contrib/autograph/pyct/ast_util.py
index c4f82d1170..d7453b0781 100644
--- a/tensorflow/contrib/autograph/pyct/ast_util.py
+++ b/tensorflow/contrib/autograph/pyct/ast_util.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Copy an AST tree, discarding annotations."""
+"""AST manipulation utilities."""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -26,47 +26,53 @@ from tensorflow.contrib.autograph.pyct import anno
 from tensorflow.contrib.autograph.pyct import parser
 
 
-class CleanCopier(gast.NodeVisitor):
-  """Copies AST nodes.
+class CleanCopier(object):
+  """NodeTransformer-like visitor that copies an AST."""
 
-  The copied nodes will ignore almost all fields that are prefixed by '__'.
-  Exceptions make some annotations.
-  """
+  def __init__(self, preserve_annos):
+    super(CleanCopier, self).__init__()
+    self.preserve_annos = preserve_annos
 
-  # TODO(mdan): Parametrize which annotations get carried over.
+  def copy(self, node):
+    """Returns a deep copy of node (excluding some fields, see copy_clean)."""
+
+    if isinstance(node, list):
+      return [self.copy(n) for n in node]
+    elif isinstance(node, tuple):
+      return tuple(self.copy(n) for n in node)
+    elif not isinstance(node, (gast.AST, ast.AST)):
+      # Assuming everything that's not an AST, list or tuple is a value type
+      # and may simply be assigned.
+      return node
+
+    assert isinstance(node, (gast.AST, ast.AST))
 
-  def generic_visit(self, node):
     new_fields = {}
     for f in node._fields:
-      if f.startswith('__'):
-        continue
-      if not hasattr(node, f):
-        continue
-      v = getattr(node, f)
-      if isinstance(v, list):
-        v = [self.generic_visit(n) for n in v]
-      elif isinstance(v, tuple):
-        v = tuple(self.generic_visit(n) for n in v)
-      elif isinstance(v, (gast.AST, ast.AST)):
-        v = self.generic_visit(v)
-      else:
-        # Assume everything else is a value type.
-        pass
-      new_fields[f] = v
+      if not f.startswith('__') and hasattr(node, f):
+        new_fields[f] = self.copy(getattr(node, f))
     new_node = type(node)(**new_fields)
-    if anno.hasanno(node, anno.Basic.SKIP_PROCESSING):
-      anno.setanno(new_node, anno.Basic.SKIP_PROCESSING, True)
+
+    if self.preserve_annos:
+      for k in self.preserve_annos:
+        anno.copyanno(node, new_node, k)
     return new_node
 
 
-def copy_clean(node):
-  copier = CleanCopier()
-  if isinstance(node, list):
-    return [copier.visit(n) for n in node]
-  elif isinstance(node, tuple):
-    return tuple(copier.visit(n) for n in node)
-  else:
-    return copier.visit(node)
+def copy_clean(node, preserve_annos=None):
+  """Creates a deep copy of an AST.
+
+  The copy will not include fields that are prefixed by '__', with the
+  exception of user-specified annotations.
+
+  Args:
+    node: ast.AST
+    preserve_annos: Optional[Set[Hashable]], annotation keys to include in the
+        copy
+  Returns:
+    ast.AST
+  """
+  return CleanCopier(preserve_annos).copy(node)
 
 
 class SymbolRenamer(gast.NodeTransformer):
@@ -78,7 +84,11 @@ class SymbolRenamer(gast.NodeTransformer):
   def _process(self, node):
     qn = anno.getanno(node, anno.Basic.QN)
     if qn in self.name_map:
-      return gast.Name(str(self.name_map[qn]), node.ctx, None)
+      new_node = gast.Name(str(self.name_map[qn]), node.ctx, None)
+      # All annotations get carried over.
+      for k in anno.keys(node):
+        anno.copyanno(node, new_node, k)
+      return new_node
     return self.generic_visit(node)
 
   def visit_Name(self, node):
@@ -92,6 +102,7 @@ class SymbolRenamer(gast.NodeTransformer):
 
 
 def rename_symbols(node, name_map):
+  """Renames symbols in an AST. Requires qual_names annotations."""
   renamer = SymbolRenamer(name_map)
   if isinstance(node, list):
     return [renamer.visit(n) for n in node]
@@ -101,6 +112,7 @@ def rename_symbols(node, name_map):
 
 
 def keywords_to_dict(keywords):
+  """Converts a list of ast.keyword objects to a dict."""
   keys = []
   values = []
   for kw in keywords:
@@ -110,10 +122,7 @@ def keywords_to_dict(keywords):
 
 
 class PatternMatcher(gast.NodeVisitor):
-  """Matches a node against a pattern represented by a node.
-
-  The pattern may contain wildcards represented by the symbol '_'.
-  """
+  """Matches a node against a pattern represented by a node."""
 
   def __init__(self, pattern):
     self.pattern = pattern
@@ -177,9 +186,128 @@ class PatternMatcher(gast.NodeVisitor):
 
 
 def matches(node, pattern):
+  """Basic pattern matcher for AST.
+
+  The pattern may contain wildcards represented by the symbol '_'. A node
+  matches a pattern if for every node in the tree, either there is a node of
+  the same type in pattern, or a Name node with id='_'.
+
+  Args:
+    node: ast.AST
+    pattern: ast.AST
+  Returns:
+    bool
+  """
   if isinstance(pattern, str):
     pattern = parser.parse_expression(pattern)
   matcher = PatternMatcher(pattern)
   matcher.visit(node)
   return matcher.matches
 
+
+# TODO(mdan): Once we have error tracing, we may be able to just go to SSA.
+def apply_to_single_assignments(targets, values, apply_fn):
+  """Applies a function to each individual assignment.
+
+  This function can process a possibly-unpacked (e.g. a, b = c, d) assignment.
+  It tries to break down the unpacking if possible. In effect, it has the same
+  effect as passing the assigned values in SSA form to apply_fn.
+
+  Examples:
+
+  The following will result in apply_fn(a, c), apply_fn(b, d):
+
+      a, b = c, d
+
+  The following will result in apply_fn(a, c[0]), apply_fn(b, c[1]):
+
+      a, b = c
+
+  The following will result in apply_fn(a, (b, c)):
+
+      a = b, c
+
+  It uses the visitor pattern to allow subclasses to process single
+  assignments individually.
+
+  Args:
+    targets: Union[List[ast.AST, ...], Tuple[ast.AST, ...], ast.AST, should be
+        used with the targets field of an ast.Assign node
+    values: ast.AST
+    apply_fn: Callable[[ast.AST, ast.AST], None], called with the
+        respective nodes of each single assignment
+  """
+  if not isinstance(targets, (list, tuple)):
+    targets = (targets,)
+  for target in targets:
+    if isinstance(target, (gast.Tuple, gast.List)):
+      for i in range(len(target.elts)):
+        target_el = target.elts[i]
+        if isinstance(values, (gast.Tuple, gast.List)):
+          value_el = values.elts[i]
+        else:
+          idx = parser.parse_expression(str(i))
+          value_el = gast.Subscript(values, gast.Index(idx), ctx=gast.Load())
+        apply_to_single_assignments(target_el, value_el, apply_fn)
+    else:
+      apply_fn(target, values)
+
+
+def parallel_walk(node, other):
+  """Walks two ASTs in parallel.
+
+  The two trees must have identical structure.
+
+  Args:
+    node: Union[ast.AST, Iterable[ast.AST]]
+    other: Union[ast.AST, Iterable[ast.AST]]
+  Yields:
+    Tuple[ast.AST, ast.AST]
+  Raises:
+    ValueError: if the two trees don't have identical structure.
+  """
+  if isinstance(node, (list, tuple)):
+    node_stack = list(node)
+  else:
+    node_stack = [node]
+
+  if isinstance(other, (list, tuple)):
+    other_stack = list(other)
+  else:
+    other_stack = [other]
+
+  while node_stack and other_stack:
+    assert len(node_stack) == len(other_stack)
+    n = node_stack.pop()
+    o = other_stack.pop()
+
+    if (not isinstance(n, (ast.AST, gast.AST)) or
+        not isinstance(o, (ast.AST, gast.AST)) or
+        n.__class__.__name__ != o.__class__.__name__):
+      raise ValueError('inconsistent nodes: {} and {}'.format(n, o))
+
+    yield n, o
+
+    for f in n._fields:
+      n_child = getattr(n, f, None)
+      o_child = getattr(o, f, None)
+      if f.startswith('__') or n_child is None or o_child is None:
+        continue
+
+      if isinstance(n_child, (list, tuple)):
+        if (not isinstance(o_child, (list, tuple)) or
+            len(n_child) != len(o_child)):
+          raise ValueError(
+              'inconsistent values for field {}: {} and {}'.format(
+                  f, n_child, o_child))
+        node_stack.extend(n_child)
+        other_stack.extend(o_child)
+
+      elif isinstance(n_child, (gast.AST, ast.AST)):
+        node_stack.append(n_child)
+        other_stack.append(o_child)
+
+      elif n_child != o_child:
+        raise ValueError(
+            'inconsistent values for field {}: {} and {}'.format(
+                f, n_child, o_child))
diff --git a/tensorflow/contrib/autograph/pyct/ast_util_test.py b/tensorflow/contrib/autograph/pyct/ast_util_test.py
index 3afa04a506..2293c89720 100644
--- a/tensorflow/contrib/autograph/pyct/ast_util_test.py
+++ b/tensorflow/contrib/autograph/pyct/ast_util_test.py
@@ -19,7 +19,10 @@ from __future__ import division
 from __future__ import print_function
 
 import ast
+import collections
+import textwrap
 
+from tensorflow.contrib.autograph.pyct import anno
 from tensorflow.contrib.autograph.pyct import ast_util
 from tensorflow.contrib.autograph.pyct import compiler
 from tensorflow.contrib.autograph.pyct import parser
@@ -29,62 +32,75 @@ from tensorflow.python.platform import test
 
 class AstUtilTest(test.TestCase):
 
-  def test_rename_symbols(self):
-    node = ast.Tuple([
-        ast.Name('a', ast.Load()),
-        ast.Name('b', ast.Load()),
-        ast.Attribute(ast.Name('b', None), 'c', ast.Store()),
-        ast.Attribute(
-            ast.Attribute(ast.Name('b', None), 'c', ast.Load()), 'd', None)
-    ], None)
+  def setUp(self):
+    super(AstUtilTest, self).setUp()
+    self._invocation_counts = collections.defaultdict(lambda: 0)
+
+  def test_rename_symbols_basic(self):
+    node = parser.parse_str('a + b')
     node = qual_names.resolve(node)
+
     node = ast_util.rename_symbols(
-        node, {
-            qual_names.QN('a'):
-                qual_names.QN('renamed_a'),
-            qual_names.QN(qual_names.QN('b'), attr='c'):
-                qual_names.QN('renamed_b_c'),
-        })
-
-    self.assertEqual(node.elts[0].id, 'renamed_a')
-    self.assertTrue(isinstance(node.elts[0].ctx, ast.Load))
-    self.assertEqual(node.elts[1].id, 'b')
-    self.assertEqual(node.elts[2].id, 'renamed_b_c')
-    self.assertTrue(isinstance(node.elts[2].ctx, ast.Store))
-    self.assertEqual(node.elts[3].value.id, 'renamed_b_c')
-    self.assertTrue(isinstance(node.elts[3].value.ctx, ast.Load))
+        node, {qual_names.QN('a'): qual_names.QN('renamed_a')})
+
+    self.assertIsInstance(node.body[0].value.left.id, str)
+    source = compiler.ast_to_source(node)
+    self.assertEqual(source.strip(), 'renamed_a + b')
+
+  def test_rename_symbols_attributes(self):
+    node = parser.parse_str('b.c = b.c.d')
+    node = qual_names.resolve(node)
+
+    node = ast_util.rename_symbols(
+        node, {qual_names.from_str('b.c'): qual_names.QN('renamed_b_c')})
+
+    source = compiler.ast_to_source(node)
+    self.assertEqual(source.strip(), 'renamed_b_c = renamed_b_c.d')
+
+  def test_rename_symbols_annotations(self):
+    node = parser.parse_str('a[i]')
+    node = qual_names.resolve(node)
+    anno.setanno(node, 'foo', 'bar')
+    orig_anno = anno.getanno(node, 'foo')
+
+    node = ast_util.rename_symbols(node,
+                                   {qual_names.QN('a'): qual_names.QN('b')})
+
+    self.assertIs(anno.getanno(node, 'foo'), orig_anno)
 
   def test_copy_clean(self):
-    ret = ast.Return(
-        ast.BinOp(
-            op=ast.Add(),
-            left=ast.Name(id='a', ctx=ast.Load()),
-            right=ast.Num(1)))
-    setattr(ret, '__foo', 'bar')
-    node = ast.FunctionDef(
-        name='f',
-        args=ast.arguments(
-            args=[ast.Name(id='a', ctx=ast.Param())],
-            vararg=None,
-            kwarg=None,
-            defaults=[]),
-        body=[ret],
-        decorator_list=[],
-        returns=None)
+    node = parser.parse_str(
+        textwrap.dedent("""
+      def f(a):
+        return a + 1
+    """))
+    setattr(node.body[0], '__foo', 'bar')
     new_node = ast_util.copy_clean(node)
-    self.assertFalse(node is new_node)
-    self.assertFalse(ret is new_node.body[0])
+    self.assertIsNot(new_node, node)
+    self.assertIsNot(new_node.body[0], node.body[0])
     self.assertFalse(hasattr(new_node.body[0], '__foo'))
 
+  def test_copy_clean_preserves_annotations(self):
+    node = parser.parse_str(
+        textwrap.dedent("""
+      def f(a):
+        return a + 1
+    """))
+    anno.setanno(node.body[0], 'foo', 'bar')
+    anno.setanno(node.body[0], 'baz', 1)
+    new_node = ast_util.copy_clean(node, preserve_annos={'foo'})
+    self.assertEqual(anno.getanno(new_node.body[0], 'foo'), 'bar')
+    self.assertFalse(anno.hasanno(new_node.body[0], 'baz'))
+
   def test_keywords_to_dict(self):
     keywords = parser.parse_expression('f(a=b, c=1, d=\'e\')').keywords
     d = ast_util.keywords_to_dict(keywords)
     # Make sure we generate a usable dict node by attaching it to a variable and
     # compiling everything.
-    output = parser.parse_str('b = 3')
-    output.body += (ast.Assign([ast.Name(id='d', ctx=ast.Store())], d),)
-    result, _ = compiler.ast_to_object(output)
-    self.assertDictEqual(result.d, {'a': 3, 'c': 1, 'd': 'e'})
+    node = parser.parse_str('def f(b): pass').body[0]
+    node.body.append(ast.Return(d))
+    result, _ = compiler.ast_to_object(node)
+    self.assertDictEqual(result.f(3), {'a': 3, 'c': 1, 'd': 'e'})
 
   def assertMatch(self, target_str, pattern_str):
     node = parser.parse_expression(target_str)
@@ -113,6 +129,68 @@ class AstUtilTest(test.TestCase):
     self.assertNoMatch('super(Foo, self).__init__()',
                        'super(Bar, _).__init__(_)')
 
+  def _mock_apply_fn(self, target, source):
+    target = compiler.ast_to_source(target)
+    source = compiler.ast_to_source(source)
+    self._invocation_counts[(target.strip(), source.strip())] += 1
+
+  def test_apply_to_single_assignments_dynamic_unpack(self):
+    node = parser.parse_str('a, b, c = d')
+    node = node.body[0]
+    ast_util.apply_to_single_assignments(node.targets, node.value,
+                                         self._mock_apply_fn)
+    self.assertDictEqual(self._invocation_counts, {
+        ('a', 'd[0]'): 1,
+        ('b', 'd[1]'): 1,
+        ('c', 'd[2]'): 1,
+    })
+
+  def test_apply_to_single_assignments_static_unpack(self):
+    node = parser.parse_str('a, b, c = d, e, f')
+    node = node.body[0]
+    ast_util.apply_to_single_assignments(node.targets, node.value,
+                                         self._mock_apply_fn)
+    self.assertDictEqual(self._invocation_counts, {
+        ('a', 'd'): 1,
+        ('b', 'e'): 1,
+        ('c', 'f'): 1,
+    })
+
+  def test_parallel_walk(self):
+    node = parser.parse_str(
+        textwrap.dedent("""
+      def f(a):
+        return a + 1
+    """))
+    for child_a, child_b in ast_util.parallel_walk(node, node):
+      self.assertEqual(child_a, child_b)
+
+  def test_parallel_walk_inconsistent_trees(self):
+    node_1 = parser.parse_str(
+        textwrap.dedent("""
+      def f(a):
+        return a + 1
+    """))
+    node_2 = parser.parse_str(
+        textwrap.dedent("""
+      def f(a):
+        return a + (a * 2)
+    """))
+    node_3 = parser.parse_str(
+        textwrap.dedent("""
+      def f(a):
+        return a + 2
+    """))
+    with self.assertRaises(ValueError):
+      for _ in ast_util.parallel_walk(node_1, node_2):
+        pass
+    # There is not particular reason to reject trees that differ only in the
+    # value of a constant.
+    # TODO(mdan): This should probably be allowed.
+    with self.assertRaises(ValueError):
+      for _ in ast_util.parallel_walk(node_1, node_3):
+        pass
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/pyct/cfg.py b/tensorflow/contrib/autograph/pyct/cfg.py
index 666328781f..ba51dcf285 100644
--- a/tensorflow/contrib/autograph/pyct/cfg.py
+++ b/tensorflow/contrib/autograph/pyct/cfg.py
@@ -64,11 +64,17 @@ class Node(object):
     self.prev = frozenset(self.prev)
 
   def __repr__(self):
+    if isinstance(self.ast_node, gast.FunctionDef):
+      return 'def %s' % self.ast_node.name
+    elif isinstance(self.ast_node, gast.withitem):
+      return compiler.ast_to_source(self.ast_node.context_expr).strip()
     return compiler.ast_to_source(self.ast_node).strip()
 
 
 class Graph(
-    collections.namedtuple('Graph', ['entry', 'exit', 'error', 'index'])):
+    collections.namedtuple(
+        'Graph',
+        ['entry', 'exit', 'error', 'index', 'stmt_prev', 'stmt_next'])):
   """A Control Flow Graph.
 
   The CFG maintains an index to allow looking up a CFG node by the AST node to
@@ -82,6 +88,11 @@ class Graph(
   because these are shared, and wiring them would create a reverse path from
   normal control flow into the error nodes, which we want to avoid.
 
+  The graph also maintains edges corresponding to higher level statements
+  like for-else loops. A node is considered successor of a statement if there
+  is an edge from a node that is lexically a child of that statement to a node
+  that is not. Statement predecessors are analogously defined.
+
   Attributes:
     entry: Node, the entry node
     exit: FrozenSet[Node, ...], the exit nodes
@@ -89,6 +100,10 @@ class Graph(
         error (errors propagated from function calls are not accounted)
     index: Dict[ast.Node, Node], mapping AST nodes to the respective CFG
         node
+    stmt_prev: Dict[ast.Node, FrozenSet[Node, ...]], mapping statement AST
+        nodes to their predecessor CFG nodes
+    stmt_next: Dict[ast.Node, FrozenSet[Node, ...]], mapping statement AST
+        nodes to their successor CFG nodes
   """
 
   def __repr__(self):
@@ -96,9 +111,8 @@ class Graph(
     for node in self.index.values():
       result += '  %s [label="%s"];\n' % (id(node), node)
     for node in self.index.values():
-      if node.next:
-        result += '  %s -> {%s};\n' % (id(node), ', '.join(
-            repr(id(n)) for n in node.next))
+      for next_ in node.next:
+        result += '  %s -> %s;\n' % (id(node), id(next_))
     result += '}'
     return result
 
@@ -108,6 +122,8 @@ class _WalkMode(Enum):
   REVERSE = 2
 
 
+# TODO(mdan): Rename to DataFlowAnalyzer.
+# TODO(mdan): Consider specializations that use gen/kill/transfer abstractions.
 class GraphVisitor(object):
   """Base class for a CFG visitors.
 
@@ -130,26 +146,22 @@ class GraphVisitor(object):
     out: Dict[Node, Any], stores node-keyed state during a visit
   """
 
-  def reset(self):
-    self.in_ = {
-        node: self.init_state(node) for node in self.graph.index.values()
-    }
-    self.out = {
-        node: self.init_state(node) for node in self.graph.index.values()
-    }
+  def __init__(self, graph):
+    self.graph = graph
+    self.reset()
 
   def init_state(self, node):
     """State initialization function. Optional to overload.
 
     An in/out state slot will be created for each node in the graph. Subclasses
-    may overload this to control what that is initialized to.
+    must overload this to control what that is initialized to.
 
     Args:
       node: Node
     """
-    del node
-    return None
+    raise NotImplementedError('Subclasses must implement this.')
 
+  # TODO(mdan): Rename to flow?
   def visit_node(self, node):
     """Visitor function.
 
@@ -161,6 +173,14 @@ class GraphVisitor(object):
     """
     raise NotImplementedError('Subclasses must implement this.')
 
+  def reset(self):
+    self.in_ = {
+        node: self.init_state(node) for node in self.graph.index.values()
+    }
+    self.out = {
+        node: self.init_state(node) for node in self.graph.index.values()
+    }
+
   def _visit_internal(self, mode):
     """Visits the CFG, depth-first."""
     assert mode in (_WalkMode.FORWARD, _WalkMode.REVERSE)
@@ -169,7 +189,6 @@ class GraphVisitor(object):
     elif mode == _WalkMode.REVERSE:
       open_ = list(self.graph.exit)
     closed = set()
-    self.reset()
 
     while open_:
       node = open_.pop(0)
@@ -186,12 +205,10 @@ class GraphVisitor(object):
         if should_revisit or next_ not in closed:
           open_.append(next_)
 
-  def visit_forward(self, graph):
-    self.graph = graph
+  def visit_forward(self):
     self._visit_internal(_WalkMode.FORWARD)
 
-  def visit_reverse(self, graph):
-    self.graph = graph
+  def visit_reverse(self):
     self._visit_internal(_WalkMode.REVERSE)
 
 
@@ -244,8 +261,16 @@ class GraphBuilder(object):
     # TODO(mdan): Too many primitives. Use classes.
     self.leaves = set()
 
+    # Note: This mechanism requires that nodes are added in lexical order (top
+    # to bottom, depth first).
+    self.active_stmts = set()
+    self.owners = {}  # type: Set[any]
+    self.forward_edges = set()  # type: Tuple[Node, Node] # (from, to)
+
     self.finally_sections = {}
-    self.finally_section_subgraphs = {}  # Values are [begin_node, exit_nodes]
+    # Dict values represent (entry, exits)
+    self.finally_section_subgraphs = {
+    }  # type: Dict[ast.AST, Tuple[Node, Set[Node]]]
     # Whether the guard section can be reached from the statement that precedes
     # it.
     self.finally_section_has_direct_flow = {}
@@ -275,6 +300,7 @@ class GraphBuilder(object):
     if isinstance(first, Node):
       first.next.add(second)
       second.prev.add(first)
+      self.forward_edges.add((first, second))
     else:
       for node in first:
         self._connect_nodes(node, second)
@@ -285,6 +311,7 @@ class GraphBuilder(object):
       raise ValueError('%s added twice' % ast_node)
     node = Node(next_=set(), prev=set(), ast_node=ast_node)
     self.node_index[ast_node] = node
+    self.owners[node] = frozenset(self.active_stmts)
 
     if self.head is None:
       self.head = node
@@ -299,6 +326,25 @@ class GraphBuilder(object):
 
     return node
 
+  def begin_statement(self, stmt):
+    """Marks the beginning of a statement.
+
+    Args:
+      stmt: Hashable, a key by which the statement can be identified in
+          the CFG's stmt_prev and stmt_next attributes
+    """
+    self.active_stmts.add(stmt)
+
+  def end_statement(self, stmt):
+    """Marks the end of a statement.
+
+    Args:
+      stmt: Hashable, a key by which the statement can be identified in
+          the CFG's stmt_prev and stmt_next attributes; must match a key
+          previously passed to begin_statement.
+    """
+    self.active_stmts.remove(stmt)
+
   def add_ordinary_node(self, ast_node):
     """Grows the graph by adding an ordinary CFG node.
 
@@ -505,11 +551,35 @@ class GraphBuilder(object):
     for node in self.node_index.values():
       node.freeze()
 
+    # Build the statement edges.
+    stmt_next = {}
+    stmt_prev = {}
+    for node, _ in self.forward_edges:
+      for stmt in self.owners[node]:
+        if stmt not in stmt_next:
+          stmt_next[stmt] = set()
+        if stmt not in stmt_prev:
+          stmt_prev[stmt] = set()
+    for first, second in self.forward_edges:
+      stmts_exited = self.owners[first] - self.owners[second]
+      for stmt in stmts_exited:
+        stmt_next[stmt].add(second)
+      stmts_entered = self.owners[second] - self.owners[first]
+      for stmt in stmts_entered:
+        stmt_prev[stmt].add(first)
+    for stmt in stmt_next:
+      stmt_next[stmt] = frozenset(stmt_next[stmt])
+    for stmt in stmt_prev:
+      stmt_prev[stmt] = frozenset(stmt_prev[stmt])
+
+    # Construct the final graph object.
     result = Graph(
         entry=self.head,
         exit=self.leaves,
         error=self.errors,
-        index=self.node_index)
+        index=self.node_index,
+        stmt_prev=stmt_prev,
+        stmt_next=stmt_next)
 
     # Reset the state.
     self.reset()
@@ -523,8 +593,6 @@ class AstToCfg(gast.NodeVisitor):
   A separate CFG will be constructed for each function.
   """
 
-  # TODO(mdan): Figure out how to deal with closures.
-
   def __init__(self):
     super(AstToCfg, self).__init__()
 
@@ -577,6 +645,13 @@ class AstToCfg(gast.NodeVisitor):
     self.builder.add_continue_node(node, try_node, guards)
 
   def visit_FunctionDef(self, node):
+    # We also keep the FunctionDef node in the CFG. This allows us to determine
+    # things like reaching definitions via closure. Note that the function body
+    # will be stored in a separate graph, because function definitions are not
+    # the same as function calls.
+    if self.builder is not None:
+      self.builder.add_ordinary_node(node)
+
     self.builder_stack.append(self.builder)
     self.builder = GraphBuilder(node)
 
@@ -622,7 +697,7 @@ class AstToCfg(gast.NodeVisitor):
     )
     if try_node is None:
       raise ValueError('%s that is not enclosed by any FunctionDef' % node)
-    self.builder.add_error_node(node, try_node, guards)
+    self.builder.add_error_node(node, guards)
 
   def visit_Assert(self, node):
     # Ignoring the effect of exceptions.
@@ -637,6 +712,7 @@ class AstToCfg(gast.NodeVisitor):
     # targets of jump statements like break/continue/etc. Since there is no
     # statement that can interrupt a conditional, we don't need to track their
     # lexical scope. That may change in the future.
+    self.builder.begin_statement(node)
 
     self.builder.enter_cond_section(node)
     self._process_basic_statement(node.test)
@@ -650,8 +726,10 @@ class AstToCfg(gast.NodeVisitor):
       self.visit(stmt)
 
     self.builder.exit_cond_section(node)
+    self.builder.end_statement(node)
 
   def visit_While(self, node):
+    self.builder.begin_statement(node)
     self._enter_lexical_scope(node)
 
     self.builder.enter_section(node)
@@ -670,8 +748,10 @@ class AstToCfg(gast.NodeVisitor):
       self.visit(stmt)
 
     self.builder.exit_section(node)
+    self.builder.end_statement(node)
 
   def visit_For(self, node):
+    self.builder.begin_statement(node)
     self._enter_lexical_scope(node)
 
     self.builder.enter_section(node)
@@ -693,6 +773,7 @@ class AstToCfg(gast.NodeVisitor):
       self.visit(stmt)
 
     self.builder.exit_section(node)
+    self.builder.end_statement(node)
 
   def visit_Break(self, node):
     self._process_exit_statement(node, gast.While, gast.For)
@@ -722,12 +803,13 @@ class AstToCfg(gast.NodeVisitor):
 
   def visit_With(self, node):
     # TODO(mdan): Mark the context manager's exit call as exit guard.
-    self._process_basic_statement(node.items)
+    for item in node.items:
+      self._process_basic_statement(item)
     for stmt in node.body:
       self.visit(stmt)
 
 
 def build(node):
-  builder = AstToCfg()
-  builder.visit(node)
-  return builder.cfgs
+  visitor = AstToCfg()
+  visitor.visit(node)
+  return visitor.cfgs
diff --git a/tensorflow/contrib/autograph/pyct/cfg_test.py b/tensorflow/contrib/autograph/pyct/cfg_test.py
index 00afadd521..9d0a85d615 100644
--- a/tensorflow/contrib/autograph/pyct/cfg_test.py
+++ b/tensorflow/contrib/autograph/pyct/cfg_test.py
@@ -25,9 +25,13 @@ from tensorflow.python.platform import test
 
 class CountingVisitor(cfg.GraphVisitor):
 
-  def __init__(self):
+  def __init__(self, graph):
+    super(CountingVisitor, self).__init__(graph)
     self.counts = {}
 
+  def init_state(self, _):
+    return None
+
   def visit_node(self, node):
     self.counts[node.ast_node] = self.counts.get(node.ast_node, 0) + 1
     return False  # visit only once
@@ -51,8 +55,8 @@ class GraphVisitorTest(test.TestCase):
 
     graphs, node = self._build_cfg(test_fn)
     graph, = graphs.values()
-    visitor = CountingVisitor()
-    visitor.visit_forward(graph)
+    visitor = CountingVisitor(graph)
+    visitor.visit_forward()
     fn_node = node.body[0]
 
     self.assertEqual(visitor.counts[fn_node.args], 1)
@@ -74,8 +78,8 @@ class GraphVisitorTest(test.TestCase):
 
     graphs, node = self._build_cfg(test_fn)
     graph, = graphs.values()
-    visitor = CountingVisitor()
-    visitor.visit_reverse(graph)
+    visitor = CountingVisitor(graph)
+    visitor.visit_reverse()
     fn_node = node.body[0]
 
     self.assertEqual(visitor.counts[fn_node.args], 1)
@@ -94,7 +98,7 @@ class AstToCfgTest(test.TestCase):
     return cfgs
 
   def _repr_set(self, node_set):
-    return set(repr(n) for n in node_set)
+    return frozenset(repr(n) for n in node_set)
 
   def _as_set(self, elements):
     if elements is None:
@@ -110,14 +114,35 @@ class AstToCfgTest(test.TestCase):
       matched = False
       for cfg_node in graph.index.values():
         if repr(cfg_node) == node_repr:
-          if (self._as_set(prev) == set(map(repr, cfg_node.prev)) and
-              self._as_set(next_) == set(map(repr, cfg_node.next))):
+          if (self._as_set(prev) == frozenset(map(repr, cfg_node.prev)) and
+              self._as_set(next_) == frozenset(map(repr, cfg_node.next))):
             matched = True
             break
       if not matched:
         self.fail(
             'match failed for node "%s" in graph:\n%s' % (node_repr, graph))
 
+  def assertStatementEdges(self, graph, edges):
+    """Tests whether the CFG contains the specified statement edges."""
+    for prev_node_reprs, node_repr, next_node_reprs in edges:
+      matched = False
+      partial_matches = []
+      self.assertSetEqual(
+          frozenset(graph.stmt_next.keys()), frozenset(graph.stmt_prev.keys()))
+      for stmt_ast_node in graph.stmt_next:
+        ast_repr = '%s:%s' % (stmt_ast_node.__class__.__name__,
+                              stmt_ast_node.lineno)
+        if ast_repr == node_repr:
+          actual_next = frozenset(map(repr, graph.stmt_next[stmt_ast_node]))
+          actual_prev = frozenset(map(repr, graph.stmt_prev[stmt_ast_node]))
+          partial_matches.append((actual_prev, node_repr, actual_next))
+          if (self._as_set(prev_node_reprs) == actual_prev and
+              self._as_set(next_node_reprs) == actual_next):
+            matched = True
+            break
+      if not matched:
+        self.fail('edges mismatch for %s: %s' % (node_repr, partial_matches))
+
   def test_straightline(self):
 
     def test_fn(a):
@@ -171,7 +196,7 @@ class AstToCfgTest(test.TestCase):
         ),
     )
 
-  def test_branch_straightline(self):
+  def test_if_straightline(self):
 
     def test_fn(a):
       if a > 0:
@@ -189,6 +214,10 @@ class AstToCfgTest(test.TestCase):
             ('(a > 0)', 'a += -1', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'If:2', None),),
+    )
 
   def test_branch_nested(self):
 
@@ -219,6 +248,14 @@ class AstToCfgTest(test.TestCase):
             ('(a > 2)', 'a = 4', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'If:2', None),
+            ('(a > 0)', 'If:3', None),
+            ('(a > 0)', 'If:8', None),
+        ),
+    )
 
   def test_branch_straightline_semi(self):
 
@@ -236,6 +273,10 @@ class AstToCfgTest(test.TestCase):
             ('(a > 0)', 'a = 1', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'If:2', None),),
+    )
 
   def test_branch_return(self):
 
@@ -257,6 +298,10 @@ class AstToCfgTest(test.TestCase):
             ('a = 1', 'a = 2', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'If:2', 'a = 2'),),
+    )
 
   def test_branch_return_minimal(self):
 
@@ -273,6 +318,10 @@ class AstToCfgTest(test.TestCase):
             ('(a > 0)', 'return', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'If:2', None),),
+    )
 
   def test_while_straightline(self):
 
@@ -291,6 +340,10 @@ class AstToCfgTest(test.TestCase):
             ('(a > 0)', 'a = 2', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'While:2', 'a = 2'),),
+    )
 
   def test_while_else_straightline(self):
 
@@ -312,6 +365,10 @@ class AstToCfgTest(test.TestCase):
             ('a = 2', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'While:2', 'a = 3'),),
+    )
 
   def test_while_else_continue(self):
 
@@ -339,6 +396,13 @@ class AstToCfgTest(test.TestCase):
             ('a = 2', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'While:2', 'a = 3'),
+            ('(a > 0)', 'If:3', ('a = 1', '(a > 0)')),
+        ),
+    )
 
   def test_while_else_break(self):
 
@@ -364,6 +428,13 @@ class AstToCfgTest(test.TestCase):
             (('break', 'a = 2'), 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'While:2', 'a = 3'),
+            ('(a > 0)', 'If:3', ('a = 1', 'a = 3')),
+        ),
+    )
 
   def test_while_else_return(self):
 
@@ -389,6 +460,13 @@ class AstToCfgTest(test.TestCase):
             ('a = 2', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'While:2', 'a = 3'),
+            ('(a > 0)', 'If:3', 'a = 1'),
+        ),
+    )
 
   def test_while_nested_straightline(self):
 
@@ -411,6 +489,13 @@ class AstToCfgTest(test.TestCase):
             ('(a > 0)', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'While:2', 'a = 3'),
+            ('(a > 0)', 'While:3', 'a = 2'),
+        ),
+    )
 
   def test_while_nested_continue(self):
 
@@ -437,6 +522,14 @@ class AstToCfgTest(test.TestCase):
             ('(a > 0)', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'While:2', 'a = 3'),
+            ('(a > 0)', 'While:3', 'a = 2'),
+            ('(a > 1)', 'If:4', ('a = 1', '(a > 1)')),
+        ),
+    )
 
   def test_while_nested_break(self):
 
@@ -451,16 +544,21 @@ class AstToCfgTest(test.TestCase):
 
     graph, = self._build_cfg(test_fn).values()
 
-    self.assertGraphMatches(
+    self.assertGraphMatches(graph, (
+        (('a', 'a = 2'), '(a > 0)', ('(a > 1)', 'a = 3')),
+        (('(a > 0)', 'a = 1'), '(a > 1)', ('(a > 2)', 'a = 2')),
+        ('(a > 1)', '(a > 2)', ('break', 'a = 1')),
+        ('(a > 2)', 'break', 'a = 2'),
+        ('(a > 2)', 'a = 1', '(a > 1)'),
+        (('(a > 1)', 'break'), 'a = 2', '(a > 0)'),
+        ('(a > 0)', 'a = 3', None),
+    ))
+    self.assertStatementEdges(
         graph,
         (
-            (('a', 'a = 2'), '(a > 0)', ('(a > 1)', 'a = 3')),
-            (('(a > 0)', 'a = 1'), '(a > 1)', ('(a > 2)', 'a = 2')),
-            ('(a > 1)', '(a > 2)', ('break', 'a = 1')),
-            ('(a > 2)', 'break', 'a = 2'),
-            ('(a > 2)', 'a = 1', '(a > 1)'),
-            (('(a > 1)', 'break'), 'a = 2', '(a > 0)'),
-            ('(a > 0)', 'a = 3', None),
+            ('a', 'While:2', 'a = 3'),
+            ('(a > 0)', 'While:3', 'a = 2'),
+            ('(a > 1)', 'If:4', ('a = 1', 'a = 2')),
         ),
     )
 
@@ -481,6 +579,10 @@ class AstToCfgTest(test.TestCase):
             ('range(0, a)', 'a = 2', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'For:2', 'a = 2'),),
+    )
 
   def test_for_else_straightline(self):
 
@@ -502,6 +604,10 @@ class AstToCfgTest(test.TestCase):
             ('a = 2', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (('a', 'For:2', 'a = 3'),),
+    )
 
   def test_for_else_continue(self):
 
@@ -530,6 +636,13 @@ class AstToCfgTest(test.TestCase):
             ('a = 2', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'For:2', 'a = 3'),
+            ('range(0, a)', 'If:3', ('a = 1', 'range(0, a)')),
+        ),
+    )
 
   def test_for_else_break(self):
 
@@ -555,6 +668,13 @@ class AstToCfgTest(test.TestCase):
             (('break', 'a = 2'), 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'For:2', 'a = 3'),
+            ('range(0, a)', 'If:3', ('a = 1', 'a = 3')),
+        ),
+    )
 
   def test_for_else_return(self):
 
@@ -580,6 +700,13 @@ class AstToCfgTest(test.TestCase):
             ('a = 2', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'For:2', 'a = 3'),
+            ('range(0, a)', 'If:3', 'a = 1'),
+        ),
+    )
 
   def test_for_nested_straightline(self):
 
@@ -602,6 +729,13 @@ class AstToCfgTest(test.TestCase):
             ('range(0, a)', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'For:2', 'a = 3'),
+            ('range(0, a)', 'For:3', 'a = 2'),
+        ),
+    )
 
   def test_for_nested_continue(self):
 
@@ -629,6 +763,14 @@ class AstToCfgTest(test.TestCase):
             ('range(0, a)', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'For:2', 'a = 3'),
+            ('range(0, a)', 'For:3', 'a = 2'),
+            ('range(1, a)', 'If:4', ('b += 1', 'range(1, a)')),
+        ),
+    )
 
   def test_for_nested_break(self):
 
@@ -655,6 +797,14 @@ class AstToCfgTest(test.TestCase):
             ('range(0, a)', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('a', 'For:2', 'a = 3'),
+            ('range(0, a)', 'For:3', 'a = 2'),
+            ('range(1, a)', 'If:4', ('b += 1', 'a = 2')),
+        ),
+    )
 
   def test_complex(self):
 
@@ -704,6 +854,17 @@ class AstToCfgTest(test.TestCase):
             ('range(1, a)', 'a = 3', None),
         ),
     )
+    self.assertStatementEdges(
+        graph,
+        (
+            ('b = 0', 'While:3', 'range(1, a)'),
+            ('(a > 0)', 'For:4', 'a = 2'),
+            ('range(0, a)', 'If:5', ('(a > 3)', 'a = 2')),
+            ('(a > 2)', 'If:7', ('b += 1', 'a = 2', 'range(0, a)')),
+            ('(a > 3)', 'If:8', ('a = 2', 'range(0, a)')),
+            ('(a > 0)', 'For:17', 'a = 3'),
+        ),
+    )
 
   def test_finally_straightline(self):
 
@@ -785,6 +946,24 @@ class AstToCfgTest(test.TestCase):
         ),
     )
 
+  def test_with_straightline(self):
+
+    def test_fn(a):
+      with max(a) as b:
+        a = 0
+        return b
+
+    graph, = self._build_cfg(test_fn).values()
+
+    self.assertGraphMatches(
+        graph,
+        (
+            ('a', 'max(a)', 'a = 0'),
+            ('max(a)', 'a = 0', 'return b'),
+            ('a = 0', 'return b', None),
+        ),
+    )
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/pyct/common_transformers/BUILD b/tensorflow/contrib/autograph/pyct/common_transformers/BUILD
index ca1441cf6f..a0938b3e5f 100644
--- a/tensorflow/contrib/autograph/pyct/common_transformers/BUILD
+++ b/tensorflow/contrib/autograph/pyct/common_transformers/BUILD
@@ -24,6 +24,7 @@ py_library(
     deps = [
         "//tensorflow/contrib/autograph/pyct",
         "@gast_archive//:gast",
+        "@six_archive//:six",
     ],
 )
 
diff --git a/tensorflow/contrib/autograph/pyct/common_transformers/anf.py b/tensorflow/contrib/autograph/pyct/common_transformers/anf.py
index cc039986c2..e42f679cfe 100644
--- a/tensorflow/contrib/autograph/pyct/common_transformers/anf.py
+++ b/tensorflow/contrib/autograph/pyct/common_transformers/anf.py
@@ -12,12 +12,24 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Conversion to A-normal form."""
+"""Conversion to A-normal form.
+
+The general idea of A-normal form is that every intermediate value is
+explicitly named with a variable.  For more, see
+https://en.wikipedia.org/wiki/A-normal_form.
+
+The specific converters used here are based on Python AST semantics as
+documented at https://greentreesnakes.readthedocs.io/en/latest/.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import gast
+import six
+
+from tensorflow.contrib.autograph.pyct import templates
 from tensorflow.contrib.autograph.pyct import transformer
 
 
@@ -32,26 +44,375 @@ class DummyGensym(object):
     #   * the symbols generated so far
     self._idx = 0
 
-  def new_name(self, stem):
+  def new_name(self, stem='tmp'):
     self._idx += 1
     return stem + '_' + str(1000 + self._idx)
 
 
 class AnfTransformer(transformer.Base):
-  """Performs the actual conversion."""
+  """Performs the conversion to A-normal form (ANF)."""
 
-  # TODO(mdan): Link to a reference.
-  # TODO(mdan): Implement.
+  # The algorithm is a postorder recursive tree walk.  Any given node A may, in
+  # general, require creation of a series B of Assign statements, which compute
+  # and explicitly name the intermediate values needed to compute the value of
+  # A.  If A was already a statement, it can be replaced with the sequence B +
+  # [A].  If A was an expression, B needs to be propagated up the tree until a
+  # statement is encountered.  Since the `ast.NodeTransformer` framework makes
+  # no provision for subtraversals returning side information, this class
+  # accumulates the sequence B in an instance variable.
 
-  def __init__(self, entity_info):
-    """Creates a transformer.
+  # The only other subtlety is that some Python statements (like `if`) have both
+  # expression fields (`test`) and statement list fields (`body` and `orelse`).
+  # Any additional assignments needed to name all the intermediate values in the
+  # `test` can be prepended to the `if` node, but assignments produced by
+  # processing the `body` and the `orelse` need to be kept together with them,
+  # and not accidentally lifted out of the `if`.
+
+  def __init__(self, entity_info, gensym_source=None):
+    """Creates an ANF transformer.
 
     Args:
       entity_info: transformer.EntityInfo
+      gensym_source: An optional object with the same interface as `DummyGensym`
+        for generating unique names
     """
     super(AnfTransformer, self).__init__(entity_info)
-    self._gensym = DummyGensym(entity_info)
+    if gensym_source is None:
+      self._gensym = DummyGensym(entity_info)
+    else:
+      self._gensym = gensym_source(entity_info)
+    self._pending_statements = []
+
+  def _consume_pending_statements(self):
+    ans = self._pending_statements
+    self._pending_statements = []
+    return ans
+
+  def _add_pending_statement(self, stmt):
+    self._pending_statements.append(stmt)
+
+  _trivial_nodes = (
+      # Non-nodes that show up as AST fields
+      bool, six.string_types,
+      # Leaf nodes that are already in A-normal form
+      gast.expr_context, gast.Name, gast.Num, gast.Str, gast.Bytes,
+      gast.NameConstant, gast.Ellipsis,
+      # Binary operators
+      gast.Add, gast.Sub, gast.Mult, gast.Div, gast.Mod, gast.Pow, gast.LShift,
+      gast.RShift, gast.BitOr, gast.BitXor, gast.BitAnd, gast.FloorDiv,
+      # Unary operators
+      gast.Invert, gast.Not, gast.UAdd, gast.USub,
+      # Comparison operators
+      gast.Eq, gast.NotEq, gast.Lt, gast.LtE, gast.Gt, gast.GtE,
+      gast.Is, gast.IsNot, gast.In, gast.NotIn,
+  )
+
+  def _is_node_trivial(self, node):
+    if node is None:
+      return True
+    elif isinstance(node, self._trivial_nodes):
+      return True
+    elif isinstance(node, gast.keyword):
+      return self._is_node_trivial(node.value)
+    elif isinstance(node, (gast.Starred, gast.withitem, gast.slice)):
+      return self._are_children_trivial(node)
+    return False
+
+  def _are_children_trivial(self, node):
+    for field in node._fields:
+      if not field.startswith('__'):
+        if not self._is_node_trivial(getattr(node, field)):
+          return False
+    return True
+
+  def _ensure_node_is_trivial(self, node):
+    if node is None:
+      return node
+    elif isinstance(node, self._trivial_nodes):
+      return node
+    elif isinstance(node, list):
+      # If something's field was actually a list, e.g., variadic arguments.
+      return [self._ensure_node_is_trivial(n) for n in node]
+    elif isinstance(node, gast.keyword):
+      node.value = self._ensure_node_is_trivial(node.value)
+      return node
+    elif isinstance(node, (gast.Starred, gast.withitem, gast.slice)):
+      return self._ensure_fields_trivial(node)
+    elif isinstance(node, gast.expr):
+      temp_name = self._gensym.new_name()
+      temp_assign = templates.replace(
+          'temp_name = expr', temp_name=temp_name, expr=node)[0]
+      self._add_pending_statement(temp_assign)
+      answer = templates.replace('temp_name', temp_name=temp_name)[0]
+      return answer
+    else:
+      raise ValueError('Do not know how to treat {}'.format(node))
+
+  def _ensure_fields_trivial(self, node):
+    for field in node._fields:
+      if field.startswith('__'):
+        continue
+      setattr(node, field, self._ensure_node_is_trivial(getattr(node, field)))
+    return node
+
+  def _visit_strict_statement(self, node, trivialize_children=True):
+    assert not self._pending_statements
+    node = self.generic_visit(node)
+    if trivialize_children:
+      self._ensure_fields_trivial(node)
+    results = self._consume_pending_statements()
+    results.append(node)
+    return results
+
+  def _visit_strict_expression(self, node):
+    node = self.generic_visit(node)
+    self._ensure_fields_trivial(node)
+    return node
+
+  # Note on code order: These are listed in the same order as the grammar
+  # elements on https://github.com/serge-sans-paille/gast
+
+  # FunctionDef, AsyncFunctionDef, and ClassDef should be correct by default.
+
+  def visit_Return(self, node):
+    return self._visit_strict_statement(node)
+
+  def visit_Delete(self, node):
+    return self._visit_strict_statement(node, trivialize_children=False)
+
+  def visit_Assign(self, node):
+    return self._visit_strict_statement(node, trivialize_children=False)
+
+  def visit_AugAssign(self, node):
+    return self._visit_strict_statement(node, trivialize_children=False)
+
+  def visit_Print(self, node):
+    return self._visit_strict_statement(node)
+
+  def visit_For(self, node):
+    assert not self._pending_statements
+    # It's important to visit node.iter first, because any statements created
+    # thereby need to live outside the body.
+    self.visit(node.iter)
+    node.iter = self._ensure_node_is_trivial(node.iter)
+    iter_stmts = self._consume_pending_statements()
+    # This generic_visit will revisit node.iter, but that is both correct and
+    # cheap because by this point node.iter is trivial.
+    node = self.generic_visit(node)
+    assert not self._pending_statements
+    iter_stmts.append(node)
+    return iter_stmts
+
+  def visit_AsyncFor(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial AsyncFor nodes not supported yet '
+             '(need to think through the semantics).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_While(self, node):
+    if not self._is_node_trivial(node.test):
+      msg = ('While with nontrivial test not supported yet '
+             '(need to avoid precomputing the test).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_If(self, node):
+    assert not self._pending_statements
+    # It's important to visit node.test first, because any statements created
+    # thereby need to live outside the body.
+    self.visit(node.test)
+    node.test = self._ensure_node_is_trivial(node.test)
+    condition_stmts = self._consume_pending_statements()
+    # This generic_visit will revisit node.test, but that is both correct and
+    # cheap because by this point node.test is trivial.
+    node = self.generic_visit(node)
+    assert not self._pending_statements
+    condition_stmts.append(node)
+    return condition_stmts
+
+  def visit_With(self, node):
+    assert not self._pending_statements
+    # It's important to visit node.items first, because any statements created
+    # thereby need to live outside the body.
+    for item in node.items:
+      self.visit(item)
+    node.items = [self._ensure_node_is_trivial(n) for n in node.items]
+    contexts_stmts = self._consume_pending_statements()
+    # This generic_visit will revisit node.items, but that is both correct and
+    # cheap because by this point node.items is trivial.
+    node = self.generic_visit(node)
+    assert not self._pending_statements
+    contexts_stmts.append(node)
+    return contexts_stmts
+
+  def visit_AsyncWith(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial AsyncWith nodes not supported yet '
+             '(need to think through the semantics).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_Raise(self, node):
+    return self._visit_strict_statement(node)
+
+  # Try should be correct by default.
+
+  def visit_Assert(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial Assert nodes not supported yet '
+             '(need to avoid computing the test when assertions are off, and '
+             'avoid computing the irritant when the assertion does not fire).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  # Import and ImportFrom should be correct by default.
+
+  def visit_Exec(self, node):
+    return self._visit_strict_statement(node)
+
+  # Global and Nonlocal should be correct by default.
+
+  def visit_Expr(self, node):
+    return self._visit_strict_statement(node, trivialize_children=False)
+
+  # Pass, Break, and Continue should be correct by default.
+
+  def visit_BoolOp(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial BoolOp nodes not supported yet '
+             '(need to preserve short-circuiting semantics).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_BinOp(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_UnaryOp(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_Lambda(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial Lambda nodes not supported '
+             '(cannot insert statements into lambda bodies).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_IfExp(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial IfExp nodes not supported yet '
+             '(need to convert to If statement, to evaluate branches lazily '
+             'and insert statements into them).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_Dict(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_Set(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_ListComp(self, node):
+    msg = ('ListComp nodes not supported '
+           '(need to convert to a form that tolerates '
+           'assignment statements in clause bodies).')
+    raise ValueError(msg)
+
+  def visit_SetComp(self, node):
+    msg = ('SetComp nodes not supported '
+           '(need to convert to a form that tolerates '
+           'assignment statements in clause bodies).')
+    raise ValueError(msg)
+
+  def visit_DictComp(self, node):
+    msg = ('DictComp nodes not supported '
+           '(need to convert to a form that tolerates '
+           'assignment statements in clause bodies).')
+    raise ValueError(msg)
+
+  def visit_GeneratorExp(self, node):
+    msg = ('GeneratorExp nodes not supported '
+           '(need to convert to a form that tolerates '
+           'assignment statements in clause bodies).')
+    raise ValueError(msg)
+
+  def visit_Await(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial Await nodes not supported yet '
+             '(need to think through the semantics).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_Yield(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_YieldFrom(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial YieldFrom nodes not supported yet '
+             '(need to unit-test them in Python 2).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_Compare(self, node):
+    if len(node.ops) > 1:
+      msg = ('Multi-ary compare nodes not supported yet '
+             '(need to preserve short-circuiting semantics).')
+      raise ValueError(msg)
+    return self._visit_strict_expression(node)
+
+  def visit_Call(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_Repr(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial Repr nodes not supported yet '
+             '(need to research their syntax and semantics).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_FormattedValue(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial FormattedValue nodes not supported yet '
+             '(need to unit-test them in Python 2).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_JoinedStr(self, node):
+    if not self._are_children_trivial(node):
+      msg = ('Nontrivial JoinedStr nodes not supported yet '
+             '(need to unit-test them in Python 2).')
+      raise ValueError(msg)
+    return self.generic_visit(node)
+
+  def visit_Attribute(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_Subscript(self, node):
+    return self._visit_strict_expression(node)
+
+  # Starred and Name are correct by default, because the right thing to do is to
+  # just recur.
+
+  def visit_List(self, node):
+    return self._visit_strict_expression(node)
+
+  def visit_Tuple(self, node):
+    return self._visit_strict_expression(node)
+
+
+def transform(node, entity_info, gensym_source=None):
+  """Converts the given node to A-normal form (ANF).
+
+  The general idea of A-normal form: https://en.wikipedia.org/wiki/A-normal_form
 
+  The specific converters used here are based on Python AST semantics as
+  documented at https://greentreesnakes.readthedocs.io/en/latest/.
 
-def transform(node, entity_info):
-  return AnfTransformer(entity_info).visit(node)
+  Args:
+    node: The node to transform.
+    entity_info: transformer.EntityInfo.  TODO(mdan): What information does this
+      argument provide?
+    gensym_source: An optional object with the same interface as `DummyGensym`
+      for generating unique names.
+  """
+  return AnfTransformer(entity_info, gensym_source=gensym_source).visit(node)
diff --git a/tensorflow/contrib/autograph/pyct/common_transformers/anf_test.py b/tensorflow/contrib/autograph/pyct/common_transformers/anf_test.py
index 81983a5ecb..951974820c 100644
--- a/tensorflow/contrib/autograph/pyct/common_transformers/anf_test.py
+++ b/tensorflow/contrib/autograph/pyct/common_transformers/anf_test.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import textwrap
+
 from tensorflow.contrib.autograph.pyct import compiler
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import transformer
@@ -25,6 +27,22 @@ from tensorflow.contrib.autograph.pyct.common_transformers import anf
 from tensorflow.python.platform import test
 
 
+class DummyGensym(object):
+  """A dumb gensym that suffixes a stem by sequential numbers from 1000."""
+
+  def __init__(self, entity_info):
+    del entity_info
+    # A proper implementation needs to account for:
+    #   * entity_info.namespace
+    #   * all the symbols defined in the AST
+    #   * the symbols generated so far
+    self._idx = 0
+
+  def new_name(self, stem='tmp'):
+    self._idx += 1
+    return stem + '_' + str(1000 + self._idx)
+
+
 class AnfTransformerTest(test.TestCase):
 
   def _simple_source_info(self):
@@ -37,17 +55,349 @@ class AnfTransformerTest(test.TestCase):
         owner_type=None)
 
   def test_basic(self):
-
     def test_function():
       a = 0
       return a
-
     node, _ = parser.parse_entity(test_function)
-    node = anf.transform(node, self._simple_source_info())
+    node = anf.transform(node.body[0], self._simple_source_info())
     result, _ = compiler.ast_to_object(node)
-
     self.assertEqual(test_function(), result.test_function())
 
+  def assert_same_ast(self, expected_node, node, msg=None):
+    expected_source = compiler.ast_to_source(expected_node, indentation='  ')
+    expected_str = textwrap.dedent(expected_source).strip()
+    got_source = compiler.ast_to_source(node, indentation='  ')
+    got_str = textwrap.dedent(got_source).strip()
+    self.assertEqual(expected_str, got_str, msg=msg)
+
+  def assert_body_anfs_as_expected(self, expected_fn, test_fn):
+    # Testing the code bodies only.  Wrapping them in functions so the
+    # syntax highlights nicely, but Python doesn't try to execute the
+    # statements.
+    exp_node, _ = parser.parse_entity(expected_fn)
+    node, _ = parser.parse_entity(test_fn)
+    node = anf.transform(
+        node, self._simple_source_info(), gensym_source=DummyGensym)
+    exp_name = exp_node.body[0].name
+    # Ignoring the function names in the result because they can't be
+    # the same (because both functions have to exist in the same scope
+    # at the same time).
+    node.body[0].name = exp_name
+    self.assert_same_ast(exp_node, node)
+    # Check that ANF is idempotent
+    node_repeated = anf.transform(
+        node, self._simple_source_info(), gensym_source=DummyGensym)
+    self.assert_same_ast(node_repeated, node)
+
+  def test_binop_basic(self):
+
+    def test_function(x, y, z):
+      a = x + y + z
+      return a
+
+    def expected_result(x, y, z):
+      tmp_1001 = x + y
+      a = tmp_1001 + z
+      return a
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_if_basic(self):
+
+    def test_function(a, b, c, e, f, g):
+      if a + b + c:
+        d = e + f + g
+        return d
+
+    def expected_result(a, b, c, e, f, g):
+      tmp_1001 = a + b
+      tmp_1002 = tmp_1001 + c
+      if tmp_1002:
+        tmp_1003 = e + f
+        d = tmp_1003 + g
+        return d
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_nested_binop_and_return(self):
+
+    def test_function(b, c, d, e):
+      return (2 * b + c) + (d + e)
+
+    def expected_result(b, c, d, e):
+      tmp_1001 = 2 * b
+      tmp_1002 = tmp_1001 + c
+      tmp_1003 = d + e
+      tmp_1004 = tmp_1002 + tmp_1003
+      return tmp_1004
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_function_call_and_expr(self):
+
+    def test_function(call_something, a, b, y, z, c, d, e, f, g, h, i):
+      call_something(a + b, y * z, kwarg=c + d, *(e + f), **(g + h + i))
+
+    def expected_result(call_something, a, b, y, z, c, d, e, f, g, h, i):
+      tmp_1001 = g + h
+      tmp_1002 = a + b
+      tmp_1003 = y * z
+      tmp_1004 = e + f
+      tmp_1005 = c + d
+      tmp_1006 = tmp_1001 + i
+      call_something(tmp_1002, tmp_1003, kwarg=tmp_1005, *tmp_1004, **tmp_1006)
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_with_and_print(self):
+
+    def test_function(a, b, c):
+      with a + b + c as d:
+        print(2 * d + 1)
+
+    def expected_result(a, b, c):
+      tmp_1001 = a + b
+      tmp_1002 = tmp_1001 + c
+      with tmp_1002 as d:
+        tmp_1003 = 2 * d
+        tmp_1004 = tmp_1003 + 1
+        print(tmp_1004)
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_local_definition_and_binary_compare(self):
+
+    def test_function():
+      def foo(a, b):
+        return 2 * a < b
+      return foo
+
+    def expected_result():
+      def foo(a, b):
+        tmp_1001 = 2 * a
+        tmp_1002 = tmp_1001 < b
+        return tmp_1002
+      return foo
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_list_literal(self):
+
+    def test_function(a, b, c, d, e, f):
+      return [a + b, c + d, e + f]
+
+    def expected_result(a, b, c, d, e, f):
+      tmp_1001 = a + b
+      tmp_1002 = c + d
+      tmp_1003 = e + f
+      tmp_1004 = [tmp_1001, tmp_1002, tmp_1003]
+      return tmp_1004
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_tuple_literal_and_unary(self):
+
+    def test_function(a, b, c, d, e, f):
+      return (a + b, -(c + d), e + f)
+
+    def expected_result(a, b, c, d, e, f):
+      tmp_1001 = c + d
+      tmp_1002 = a + b
+      tmp_1003 = -tmp_1001
+      tmp_1004 = e + f
+      tmp_1005 = (tmp_1002, tmp_1003, tmp_1004)
+      return tmp_1005
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_set_literal(self):
+
+    def test_function(a, b, c, d, e, f):
+      return set(a + b, c + d, e + f)
+
+    def expected_result(a, b, c, d, e, f):
+      tmp_1001 = a + b
+      tmp_1002 = c + d
+      tmp_1003 = e + f
+      tmp_1004 = set(tmp_1001, tmp_1002, tmp_1003)
+      return tmp_1004
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_dict_literal_and_repr(self):
+
+    def test_function(foo, bar, baz):
+      return repr({foo + bar + baz: 7 | 8})
+
+    def expected_result(foo, bar, baz):
+      tmp_1001 = foo + bar
+      tmp_1002 = tmp_1001 + baz
+      tmp_1003 = 7 | 8
+      tmp_1004 = {tmp_1002: tmp_1003}
+      tmp_1005 = repr(tmp_1004)
+      return tmp_1005
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_field_read_and_write(self):
+
+    def test_function(a, d):
+      a.b.c = d.e.f + 3
+
+    def expected_result(a, d):
+      tmp_1001 = a.b
+      tmp_1002 = d.e
+      tmp_1003 = tmp_1002.f
+      tmp_1001.c = tmp_1003 + 3
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_subscript_read_and_write(self):
+
+    def test_function(a, b, c, d, e, f):
+      a[b][c] = d[e][f] + 3
+
+    def expected_result(a, b, c, d, e, f):
+      tmp_1001 = a[b]
+      tmp_1002 = d[e]
+      tmp_1003 = tmp_1002[f]
+      tmp_1001[c] = tmp_1003 + 3
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_augassign_and_delete(self):
+
+    def test_function(a, x, y, z):
+      a += x + y + z
+      del a
+      del z[y][x]
+
+    def expected_result(a, x, y, z):
+      tmp_1001 = x + y
+      a += tmp_1001 + z
+      del a
+      tmp_1002 = z[y]
+      del tmp_1002[x]
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_raise_yield_and_raise(self):
+
+    def test_function(a, c, some_computed, exception):
+      yield a ** c
+      raise some_computed('complicated' + exception)
+
+    def expected_result(a, c, some_computed, exception):
+      tmp_1001 = a ** c
+      yield tmp_1001
+      tmp_1002 = 'complicated' + exception
+      tmp_1003 = some_computed(tmp_1002)
+      raise tmp_1003
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_with_and_if_with_expressions(self):
+
+    def test_function(foo, bar, function, quux, quozzle, w, x, y, z):
+      with foo + bar:
+        function(x + y)
+      if quux + quozzle:
+        function(z / w)
+
+    def expected_result(foo, bar, function, quux, quozzle, w, x, y, z):
+      tmp_1001 = foo + bar
+      with tmp_1001:
+        tmp_1002 = x + y
+        function(tmp_1002)
+      tmp_1003 = quux + quozzle
+      if tmp_1003:
+        tmp_1004 = z / w
+        function(tmp_1004)
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_exec(self):
+
+    def test_function():
+      # The point is to test A-normal form conversion of exec
+      # pylint: disable=exec-used
+      exec('computed' + 5 + 'stuff', globals(), locals())
+
+    def expected_result():
+      # pylint: disable=exec-used
+      tmp_1001 = 'computed' + 5
+      tmp_1002 = tmp_1001 + 'stuff'
+      tmp_1003 = globals()
+      tmp_1004 = locals()
+      exec(tmp_1002, tmp_1003, tmp_1004)
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_simple_while_and_assert(self):
+
+    def test_function(foo, quux):
+      while foo:
+        assert quux
+        foo = foo + 1 * 3
+
+    def expected_result(foo, quux):
+      while foo:
+        assert quux
+        tmp_1001 = 1 * 3
+        foo = foo + tmp_1001
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  def test_for(self):
+
+    def test_function(compute, something, complicated, foo):
+      for foo in compute(something + complicated):
+        bar = foo + 1 * 3
+      return bar
+
+    def expected_result(compute, something, complicated, foo):
+      tmp_1001 = something + complicated
+      tmp_1002 = compute(tmp_1001)
+      for foo in tmp_1002:
+        tmp_1003 = 1 * 3
+        bar = foo + tmp_1003
+      return bar
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
+  # This test collects several examples where the definition of A-normal form
+  # implemented by this transformer is questionable.  Mostly it's here to spell
+  # out what the definition is in these cases.
+  def test_controversial(self):
+
+    def test_function(b, c, d, f):
+      a = c + d
+      a.b = c + d
+      a[b] = c + d
+      a += c + d
+      a, b = c
+      a, b = c, d
+      a = f(c)
+      a = f(c + d)
+      a[b + d] = f.e(c + d)
+
+    def expected_result(b, c, d, f):
+      a = c + d
+      a.b = c + d  # Should be a.b = tmp?  (Definitely not tmp = c + d)
+      a[b] = c + d  # Should be a[b] = tmp?  (Definitely not tmp = c + d)
+      a += c + d  # Should be a += tmp?  (Definitely not tmp = c + d)
+      a, b = c  # Should be a = c[0], b = c[1]?  Or not?
+      a, b = c, d  # Should be a = c, b = d?  Or not?
+      a = f(c)
+      tmp_1001 = c + d
+      a = f(tmp_1001)
+      tmp_1002 = b + d
+      tmp_1003 = f.e
+      tmp_1004 = c + d
+      a[tmp_1002] = tmp_1003(tmp_1004)  # Or should be a[tmp1] = tmp2?
+
+    self.assert_body_anfs_as_expected(expected_result, test_function)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/pyct/compiler.py b/tensorflow/contrib/autograph/pyct/compiler.py
index 24c4517afa..f9cee10962 100644
--- a/tensorflow/contrib/autograph/pyct/compiler.py
+++ b/tensorflow/contrib/autograph/pyct/compiler.py
@@ -30,46 +30,112 @@ import tempfile
 import astor
 import gast
 
+from tensorflow.contrib.autograph.pyct import origin_info
+
 
 def ast_to_source(node, indentation='  '):
-  """Return the source code of given AST."""
-  if isinstance(node, gast.AST):
-    node = gast.gast_to_ast(node)
+  """Return the source code of given AST.
+
+  Args:
+    node: The code to compile, as an AST object.
+    indentation: The string to use for indentation.
+
+  Returns:
+    code: The source code generated from the AST object
+    source_mapping: A mapping between the user and AutoGraph generated code.
+  """
+  if not isinstance(node, (list, tuple)):
+    node = (node,)
   generator = astor.codegen.SourceGenerator(indentation, False,
                                             astor.string_repr.pretty_string)
-  generator.visit(node)
-  generator.result.append('\n')
+
+  for n in node:
+    if isinstance(n, gast.AST):
+      n = gast.gast_to_ast(n)
+    generator.visit(n)
+    generator.result.append('\n')
+
   # In some versions of Python, literals may appear as actual values. This
   # ensures everything is string.
   code = map(str, generator.result)
-  return astor.source_repr.pretty_source(code).lstrip()
+  code = astor.source_repr.pretty_source(code).lstrip()
 
+  return code
 
-def ast_to_object(
-    node, indentation='  ', source_prefix=None, delete_on_exit=True):
+
+def ast_to_object(nodes,
+                  indentation='  ',
+                  include_source_map=False,
+                  source_prefix=None,
+                  delete_on_exit=True):
   """Return the Python objects represented by given AST.
 
   Compiling the AST code this way ensures that the source code is readable by
   e.g. `pdb` or `inspect`.
 
   Args:
-    node: The code to compile, as an AST object.
-    indentation: The string to use for indentation.
-    source_prefix: Optional string to print as-is into the source file.
-    delete_on_exit: Whether to delete the temporary file used for compilation
-        on exit.
+    nodes: Union[ast.AST, Iterable[ast.AST]], the code to compile, as an AST
+        object.
+    indentation: Text, the string to use for indentation.
+    include_source_map: bool, whether to attach a source map to the compiled
+        object. Also see origin_info.py.
+    source_prefix: Optional[Text], string to print as-is into the source file.
+    delete_on_exit: bool, whether to delete the temporary file used for
+        compilation on exit.
 
   Returns:
-    A module object containing the compiled source code.
+    compiled_nodes: A module object containing the compiled source code.
+    source: The source code of the compiled object
+  Raises:
+    ValueError: If ag_source_map__ is already in the namespace of the compiled
+    nodes.
   """
-  source = ast_to_source(node, indentation)
+  if not isinstance(nodes, (list, tuple)):
+    nodes = (nodes,)
+
+  source = ast_to_source(nodes, indentation=indentation)
+
+  if source_prefix:
+    source = source_prefix + '\n' + source
 
   with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
     module_name = os.path.basename(f.name[:-3])
-    if source_prefix:
-      f.write(source_prefix)
-      f.write('\n')
     f.write(source)
+
+    if isinstance(nodes, (list, tuple)):
+      indices = range(-len(nodes), 0)
+    else:
+      indices = (-1,)
+
+    if include_source_map:
+      source_map = origin_info.source_map(nodes, source, f.name, indices)
+
+  # TODO(mdan): Try flush() and delete=False instead.
   if delete_on_exit:
     atexit.register(lambda: os.remove(f.name))
-  return imp.load_source(module_name, f.name), source
+  compiled_nodes = imp.load_source(module_name, f.name)
+
+  # TODO(znado): Clean this up so we don't need to attach it to the namespace.
+  # TODO(znado): This does not work for classes because their methods share a
+  # namespace.
+  # This attaches the source map which is needed for error handling.  Note that
+  # api.to_graph copies this source map into an attribute of the function.
+  #
+  # We need this so the ag_source_map__ variable is available to the call to
+  # rewrite_graph_construction_error in the except block inside each function
+  # that handles graph construction errors.
+  #
+  # We cannot get the rewritten function name until it is too late so templating
+  # is hard, and this cleanly fixes the
+  # issues encountered with nested functions because this is attached to the
+  # outermost one.
+  if include_source_map:
+    # TODO(mdan): This name should be decided by the caller.
+    source_map_name = 'ag_source_map__'
+    if source_map_name in compiled_nodes.__dict__:
+      raise ValueError('cannot convert %s because is has namespace attribute '
+                       '"%s", which is reserved for AutoGraph.' %
+                       (compiled_nodes, source_map_name))
+    compiled_nodes.__dict__[source_map_name] = source_map
+
+  return compiled_nodes, source
diff --git a/tensorflow/contrib/autograph/pyct/compiler_test.py b/tensorflow/contrib/autograph/pyct/compiler_test.py
index 98cdc1506b..cf783da6a3 100644
--- a/tensorflow/contrib/autograph/pyct/compiler_test.py
+++ b/tensorflow/contrib/autograph/pyct/compiler_test.py
@@ -59,14 +59,14 @@ class CompilerTest(test.TestCase):
                 value=gast.Str('c'))
         ])
 
+    source = compiler.ast_to_source(node, indentation='  ')
     self.assertEqual(
         textwrap.dedent("""
             if 1:
               a = b
             else:
               a = 'c'
-        """).strip(),
-        compiler.ast_to_source(node, indentation='  ').strip())
+        """).strip(), source.strip())
 
   def test_ast_to_object(self):
     node = gast.FunctionDef(
diff --git a/tensorflow/contrib/autograph/pyct/origin_info.py b/tensorflow/contrib/autograph/pyct/origin_info.py
new file mode 100644
index 0000000000..9f98e48a6a
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/origin_info.py
@@ -0,0 +1,173 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Container for origin source code information before AutoGraph compilation."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+
+import gast
+
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import ast_util
+from tensorflow.contrib.autograph.pyct import parser
+from tensorflow.python.util import tf_inspect
+
+
+class LineLocation(
+    collections.namedtuple('LineLocation', ('filename', 'lineno'))):
+  """Similar to Location, but without column information.
+
+  Attributes:
+    filename: Text
+    lineno: int, 1-based
+  """
+  pass
+
+
+class Location(
+    collections.namedtuple('Location', ('filename', 'lineno', 'col_offset'))):
+  """Encodes code location information.
+
+  Attributes:
+    filename: Text
+    lineno: int, 1-based
+    col_offset: int
+  """
+
+  @property
+  def line_loc(self):
+    return LineLocation(self.filename, self.lineno)
+
+
+class OriginInfo(
+    collections.namedtuple(
+        'OriginInfo',
+        ('loc', 'function_name', 'source_code_line'))):
+  """Container for information about the source code before conversion.
+
+  Attributes:
+    loc: Location
+    function_name: Optional[Text]
+    source_code_line: Text
+  """
+
+  def as_frame(self):
+    """Returns a 4-tuple consistent with the return of traceback.extract_tb."""
+    return (self.loc.filename, self.loc.lineno, self.function_name,
+            self.source_code_line)
+
+
+# TODO(mdan): This source map should be a class - easier to refer to.
+def source_map(nodes, code, filename, indices_in_code):
+  """Creates a source map between an annotated AST and the code it compiles to.
+
+  Args:
+    nodes: Iterable[ast.AST, ...]
+    code: Text
+    filename: Optional[Text]
+    indices_in_code: Union[int, Iterable[int, ...]], the positions at which
+        nodes appear in code. The parser always returns a module when parsing
+        code. This argument indicates the position in that module's body at
+        which the corresponding of node should appear.
+
+  Returns:
+    Dict[CodeLocation, OriginInfo], mapping locations in code to locations
+    indicated by origin annotations in node.
+  """
+  reparsed_nodes = parser.parse_str(code)
+  reparsed_nodes = [reparsed_nodes.body[i] for i in indices_in_code]
+
+  resolve(reparsed_nodes, code)
+  result = {}
+
+  for before, after in ast_util.parallel_walk(nodes, reparsed_nodes):
+    # Note: generated code might not be mapped back to its origin.
+    # TODO(mdan): Generated code should always be mapped to something.
+    origin_info = anno.getanno(before, anno.Basic.ORIGIN, default=None)
+    final_info = anno.getanno(after, anno.Basic.ORIGIN, default=None)
+    if origin_info is None or final_info is None:
+      continue
+
+    line_loc = LineLocation(filename, final_info.loc.lineno)
+
+    existing_origin = result.get(line_loc)
+    if existing_origin is not None:
+      # Overlaps may exist because of child nodes, but almost never to
+      # different line locations. Exception make decorated functions, where
+      # both lines are mapped to the same line in the AST.
+
+      # Line overlaps: keep bottom node.
+      if existing_origin.loc.line_loc == origin_info.loc.line_loc:
+        if existing_origin.loc.lineno >= origin_info.loc.lineno:
+          continue
+
+      # In case of overlaps, keep the leftmost node.
+      if existing_origin.loc.col_offset <= origin_info.loc.col_offset:
+        continue
+
+    result[line_loc] = origin_info
+
+  return result
+
+
+# TODO(znado): Consider refactoring this into a Visitor.
+# TODO(mdan): Does this work correctly with inner functions?
+def resolve(nodes, source, function=None):
+  """Adds an origin information to all nodes inside the body of function.
+
+  Args:
+    nodes: Union[ast.AST, Iterable[ast.AST, ...]]
+    source: Text, the source code string for the function whose body nodes will
+      be annotated.
+    function: Callable, the function that will have all nodes inside of it
+      annotation with an OriginInfo annotation with key anno.Basic.ORIGIN.  If
+      it is None then only the line numbers and column offset will be set in the
+      annotation, with the rest of the information being None.
+
+  Returns:
+    A tuple of the AST node for function and a String containing its source
+    code.
+  """
+  if not isinstance(nodes, (list, tuple)):
+    nodes = (nodes,)
+
+  if function:
+    _, function_lineno = tf_inspect.getsourcelines(function)
+    function_filepath = tf_inspect.getsourcefile(function)
+  else:
+    function_lineno = None
+    function_filepath = None
+
+  source_lines = source.split('\n')
+  for node in nodes:
+    for n in gast.walk(node):
+      if not hasattr(n, 'lineno'):
+        continue
+
+      lineno_in_body = n.lineno
+
+      source_code_line = source_lines[lineno_in_body - 1]
+      if function:
+        source_lineno = function_lineno + lineno_in_body - 1
+        function_name = function.__name__
+      else:
+        source_lineno = lineno_in_body
+        function_name = None
+
+      location = Location(function_filepath, source_lineno, n.col_offset)
+      origin = OriginInfo(location, function_name, source_code_line)
+      anno.setanno(n, anno.Basic.ORIGIN, origin)
diff --git a/tensorflow/contrib/autograph/pyct/origin_info_test.py b/tensorflow/contrib/autograph/pyct/origin_info_test.py
new file mode 100644
index 0000000000..6d7d8b1622
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/origin_info_test.py
@@ -0,0 +1,101 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for origin_info module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import compiler
+from tensorflow.contrib.autograph.pyct import origin_info
+from tensorflow.contrib.autograph.pyct import parser
+from tensorflow.python.platform import test
+
+
+class OriginInfoTest(test.TestCase):
+
+  def test_source_map(self):
+
+    def test_fn(x):
+      if x > 0:
+        x += 1
+      return x
+
+    node, source = parser.parse_entity(test_fn)
+    fn_node = node.body[0]
+    origin_info.resolve(fn_node, source)
+
+    # Insert a traced line.
+    new_node = parser.parse_str('x = abs(x)').body[0]
+    anno.copyanno(fn_node.body[0], new_node, anno.Basic.ORIGIN)
+    fn_node.body.insert(0, new_node)
+
+    # Insert an untraced line.
+    fn_node.body.insert(0, parser.parse_str('x = 0').body[0])
+
+    modified_source = compiler.ast_to_source(fn_node)
+
+    source_map = origin_info.source_map(fn_node, modified_source,
+                                        'test_filename', [0])
+
+    loc = origin_info.LineLocation('test_filename', 1)
+    origin = source_map[loc]
+    self.assertEqual(origin.source_code_line, 'def test_fn(x):')
+    self.assertEqual(origin.loc.lineno, 1)
+
+    # The untraced line, inserted second.
+    loc = origin_info.LineLocation('test_filename', 2)
+    self.assertFalse(loc in source_map)
+
+    # The traced line, inserted first.
+    loc = origin_info.LineLocation('test_filename', 3)
+    origin = source_map[loc]
+    self.assertEqual(origin.source_code_line, '  if x > 0:')
+    self.assertEqual(origin.loc.lineno, 2)
+
+    loc = origin_info.LineLocation('test_filename', 4)
+    origin = source_map[loc]
+    self.assertEqual(origin.source_code_line, '  if x > 0:')
+    self.assertEqual(origin.loc.lineno, 2)
+
+  def test_resolve(self):
+
+    def test_fn(x):
+      """Docstring."""
+      return x  # comment
+
+    node, source = parser.parse_entity(test_fn)
+    fn_node = node.body[0]
+    origin_info.resolve(fn_node, source)
+
+    origin = anno.getanno(fn_node, anno.Basic.ORIGIN)
+    self.assertEqual(origin.loc.lineno, 1)
+    self.assertEqual(origin.loc.col_offset, 0)
+    self.assertEqual(origin.source_code_line, 'def test_fn(x):')
+
+    origin = anno.getanno(fn_node.body[0], anno.Basic.ORIGIN)
+    self.assertEqual(origin.loc.lineno, 2)
+    self.assertEqual(origin.loc.col_offset, 2)
+    self.assertEqual(origin.source_code_line, '  """Docstring."""')
+
+    origin = anno.getanno(fn_node.body[1], anno.Basic.ORIGIN)
+    self.assertEqual(origin.loc.lineno, 3)
+    self.assertEqual(origin.loc.col_offset, 2)
+    self.assertEqual(origin.source_code_line, '  return x  # comment')
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/pyct/parser.py b/tensorflow/contrib/autograph/pyct/parser.py
index c961efa892..112ed46a1e 100644
--- a/tensorflow/contrib/autograph/pyct/parser.py
+++ b/tensorflow/contrib/autograph/pyct/parser.py
@@ -37,6 +37,7 @@ def parse_entity(entity):
 
 def parse_str(src):
   """Returns the AST of given piece of code."""
+  # TODO(mdan): This should exclude the module things are autowrapped in.
   return gast.parse(src)
 
 
diff --git a/tensorflow/contrib/autograph/pyct/qual_names.py b/tensorflow/contrib/autograph/pyct/qual_names.py
index da07013cf4..fb81404edc 100644
--- a/tensorflow/contrib/autograph/pyct/qual_names.py
+++ b/tensorflow/contrib/autograph/pyct/qual_names.py
@@ -30,6 +30,7 @@ import collections
 import gast
 
 from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import parser
 
 
 class Symbol(collections.namedtuple('Symbol', ['name'])):
@@ -89,7 +90,8 @@ class QN(object):
       if not isinstance(base, (str, StringLiteral, NumberLiteral)):
         # TODO(mdan): Require Symbol instead of string.
         raise ValueError(
-            'For simple QNs, base must be a string or a Literal object.')
+            'for simple QNs, base must be a string or a Literal object;'
+            ' got instead "%s"' % type(base))
       assert '.' not in base and '[' not in base and ']' not in base
       self._parent = None
       self.qn = (base,)
@@ -113,6 +115,22 @@ class QN(object):
     return self._parent
 
   @property
+  def owner_set(self):
+    """Returns all the symbols (simple or composite) that own this QN.
+
+    In other words, if this symbol was modified, the symbols in the owner set
+    may also be affected.
+
+    Examples:
+      'a.b[c.d]' has two owners, 'a' and 'a.b'
+    """
+    owners = set()
+    if self.has_attr() or self.has_subscript():
+      owners.add(self.parent)
+      owners.update(self.parent.owner_set)
+    return owners
+
+  @property
   def support_set(self):
     """Returns the set of simple symbols that this QN relies on.
 
@@ -122,7 +140,7 @@ class QN(object):
 
     Examples:
       'a.b' has only one support symbol, 'a'
-      'a[i]' has two roots, 'a' and 'i'
+      'a[i]' has two support symbols, 'a' and 'i'
     """
     # TODO(mdan): This might be the set of Name nodes in the AST. Track those?
     roots = set()
@@ -231,3 +249,9 @@ class QnResolver(gast.NodeTransformer):
 
 def resolve(node):
   return QnResolver().visit(node)
+
+
+def from_str(qn_str):
+  node = parser.parse_expression(qn_str)
+  node = resolve(node)
+  return anno.getanno(node, anno.Basic.QN)
diff --git a/tensorflow/contrib/autograph/pyct/qual_names_test.py b/tensorflow/contrib/autograph/pyct/qual_names_test.py
index 264afd508c..c793c2bb39 100644
--- a/tensorflow/contrib/autograph/pyct/qual_names_test.py
+++ b/tensorflow/contrib/autograph/pyct/qual_names_test.py
@@ -30,6 +30,15 @@ from tensorflow.python.platform import test
 
 class QNTest(test.TestCase):
 
+  def test_from_str(self):
+    a = QN('a')
+    b = QN('b')
+    a_dot_b = QN(a, attr='b')
+    a_sub_b = QN(a, subscript=b)
+    self.assertEqual(qual_names.from_str('a.b'), a_dot_b)
+    self.assertEqual(qual_names.from_str('a'), a)
+    self.assertEqual(qual_names.from_str('a[b]'), a_sub_b)
+
   def test_basic(self):
     a = QN('a')
     self.assertEqual(a.qn, ('a',))
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/BUILD b/tensorflow/contrib/autograph/pyct/static_analysis/BUILD
index bcf2dacec2..92eacba3fd 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/BUILD
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/BUILD
@@ -19,8 +19,9 @@ py_library(
     srcs = [
         "activity.py",
         "annos.py",
-        "cfg.py",
         "live_values.py",
+        "liveness.py",
+        "reaching_definitions.py",
         "type_info.py",
     ],
     srcs_version = "PY2AND3",
@@ -28,6 +29,7 @@ py_library(
     deps = [
         "//tensorflow/contrib/autograph/pyct",
         "//tensorflow/contrib/autograph/utils",
+        "//tensorflow/python:util",
         "@gast_archive//:gast",
     ],
 )
@@ -46,23 +48,32 @@ py_test(
 )
 
 py_test(
-    name = "cfg_test",
-    srcs = ["cfg_test.py"],
+    name = "live_values_test",
+    srcs = ["live_values_test.py"],
     srcs_version = "PY2AND3",
     tags = ["no_windows"],
     deps = [
         ":static_analysis",
         "//tensorflow/contrib/autograph/pyct",
         "//tensorflow/python:client_testlib",
-        "@gast_archive//:gast",
     ],
 )
 
 py_test(
-    name = "live_values_test",
-    srcs = ["live_values_test.py"],
+    name = "liveness_test",
+    srcs = ["liveness_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":static_analysis",
+        "//tensorflow/contrib/autograph/pyct",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
+    name = "reaching_definitions_test",
+    srcs = ["reaching_definitions_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_windows"],
     deps = [
         ":static_analysis",
         "//tensorflow/contrib/autograph/pyct",
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/__init__.py b/tensorflow/contrib/autograph/pyct/static_analysis/__init__.py
index c325e19f28..9a82de735d 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/__init__.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/__init__.py
@@ -18,10 +18,14 @@ This module contains utilities to help annotate AST nodes with as much runtime
 information as can be possibly extracted without actually executing the code,
 under that assumption that the context in which the code will run is known.
 
-Note: It's a fair bet that this analysis cannot be reused across contexts
-without re-running it. In most cases, the context usually means referenced
-modules, which should be static enough to allow reuse, but that is not being
-reliably verified.
+Overall, the different analyses have the functions listed below:
+
+ * activity: inventories symbols read, written to, params, etc. at different
+     levels
+ * liveness, reaching_definitions: dataflow analyses based on the program's CFG
+     and using the symbol information gathered by activity analysis
+ * live_values, type_info: type and value inference based on dataflow
+     analysis and context information
 """
 
 from __future__ import absolute_import
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/activity.py b/tensorflow/contrib/autograph/pyct/static_analysis/activity.py
index 4d7b0cbb7b..a0182da9d1 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/activity.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/activity.py
@@ -12,7 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Activity analysis."""
+"""Activity analysis.
+
+Requires qualified name annotations (see qual_names.py).
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -59,9 +62,10 @@ class Scope(object):
     self.parent = parent
     self.add_unknown_symbols = add_unknown_symbols
     self.modified = set()
+    # TODO(mdan): Completely remove this.
     self.created = set()
     self.used = set()
-    self.params = set()
+    self.params = {}
     self.returned = set()
 
   # TODO(mdan): Rename to `locals`
@@ -106,37 +110,23 @@ class Scope(object):
     self.modified |= other.modified
     self.created |= other.created
     self.used |= other.used
-    self.params |= other.params
+    self.params.update(other.params)
     self.returned |= other.returned
 
   def has(self, name):
-    if name in self.modified or name in self.params:
+    if name in self.modified:
       return True
     elif self.parent is not None:
       return self.parent.has(name)
     return False
 
-  def is_modified_since_entry(self, name):
-    if name in self.modified:
-      return True
-    elif self.parent is not None and not self.isolated:
-      return self.parent.is_modified_since_entry(name)
-    return False
-
-  def is_param(self, name):
-    if name in self.params:
-      return True
-    elif self.parent is not None and not self.isolated:
-      return self.parent.is_param(name)
-    return False
-
   def mark_read(self, name):
     self.used.add(name)
     if self.parent is not None and name not in self.created:
       self.parent.mark_read(name)
 
-  def mark_param(self, name):
-    self.params.add(name)
+  def mark_param(self, name, owner):
+    self.params[name] = owner
 
   def mark_creation(self, name, writes_create_symbol=False):
     """Mark a qualified name as created."""
@@ -226,37 +216,56 @@ class ActivityAnalyzer(transformer.Base):
     elif isinstance(node.ctx, gast.Param):
       # Param contexts appear in function defs, so they have the meaning of
       # defining a variable.
-      # TODO(mdan): This may be incorrect with nested functions.
-      # For nested functions, we'll have to add the notion of hiding args from
-      # the parent scope, not writing to them.
-      self.scope.mark_creation(qn)
-      self.scope.mark_param(qn)
+      self.scope.mark_write(qn)
+      self.scope.mark_param(qn, self.enclosing_entities[-1])
     else:
       raise ValueError('Unknown context %s for node %s.' % (type(node.ctx), qn))
 
     anno.setanno(node, NodeAnno.IS_LOCAL, self.scope.has(qn))
-    anno.setanno(node, NodeAnno.IS_MODIFIED_SINCE_ENTRY,
-                 self.scope.is_modified_since_entry(qn))
-    anno.setanno(node, NodeAnno.IS_PARAM, self.scope.is_param(qn))
 
     if self._in_return_statement:
       self.scope.mark_returned(qn)
 
+  def _enter_scope(self, isolated):
+    self.scope = Scope(self.scope, isolated=isolated)
+
+  def _exit_scope(self):
+    self.scope = self.scope.parent
+
+  def _process_statement(self, node):
+    self._enter_scope(False)
+    node = self.generic_visit(node)
+    anno.setanno(node, anno.Static.SCOPE, self.scope)
+    self._exit_scope()
+    return node
+
+  def visit_Expr(self, node):
+    return self._process_statement(node)
+
+  def visit_Return(self, node):
+    self._in_return_statement = True
+    node = self._process_statement(node)
+    self._in_return_statement = False
+    return node
+
+  def visit_Assign(self, node):
+    return self._process_statement(node)
+
   def visit_AugAssign(self, node):
     # Special rules for AugAssign. In Assign, the target is only written,
     # but in AugAssig (e.g. a += b), the target is both read and written.
     self._in_aug_assign = True
-    self.generic_visit(node)
+    node = self._process_statement(node)
     self._in_aug_assign = False
     return node
 
   def visit_Name(self, node):
-    self.generic_visit(node)
+    node = self.generic_visit(node)
     self._track_symbol(node)
     return node
 
   def visit_Attribute(self, node):
-    self.generic_visit(node)
+    node = self.generic_visit(node)
     if self._in_constructor and self._node_sets_self_attribute(node):
       self._track_symbol(
           node, composite_writes_alter_parent=True, writes_create_symbol=True)
@@ -265,44 +274,38 @@ class ActivityAnalyzer(transformer.Base):
     return node
 
   def visit_Subscript(self, node):
-    self.generic_visit(node)
+    node = self.generic_visit(node)
     # Subscript writes (e.g. a[b] = "value") are considered to modify
     # both the element itself (a[b]) and its parent (a).
-    self._track_symbol(node, composite_writes_alter_parent=True)
+    self._track_symbol(node)
     return node
 
   def visit_Print(self, node):
-    current_scope = self.scope
-    args_scope = Scope(current_scope)
-    self.scope = args_scope
-    for n in node.values:
-      self.visit(n)
-    anno.setanno(node, NodeAnno.ARGS_SCOPE, args_scope)
-    self.scope = current_scope
+    self._enter_scope(False)
+    node.values = self.visit_block(node.values)
+    anno.setanno(node, anno.Static.SCOPE, self.scope)
+    anno.setanno(node, NodeAnno.ARGS_SCOPE, self.scope)
+    self._exit_scope()
     return node
 
+  def visit_Assert(self, node):
+    return self._process_statement(node)
+
   def visit_Call(self, node):
-    current_scope = self.scope
-    args_scope = Scope(current_scope, isolated=False)
-    self.scope = args_scope
-    for n in node.args:
-      self.visit(n)
+    self._enter_scope(False)
+    node.args = self.visit_block(node.args)
+    node.keywords = self.visit_block(node.keywords)
     # TODO(mdan): Account starargs, kwargs
-    for n in node.keywords:
-      self.visit(n)
-    anno.setanno(node, NodeAnno.ARGS_SCOPE, args_scope)
-    self.scope = current_scope
-    self.visit(node.func)
+    anno.setanno(node, NodeAnno.ARGS_SCOPE, self.scope)
+    self._exit_scope()
+    node.func = self.visit(node.func)
     return node
 
   def _process_block_node(self, node, block, scope_name):
-    current_scope = self.scope
-    block_scope = Scope(current_scope, isolated=False)
-    self.scope = block_scope
-    for n in block:
-      self.visit(n)
-    anno.setanno(node, scope_name, block_scope)
-    self.scope = current_scope
+    self._enter_scope(False)
+    block = self.visit_block(block)
+    anno.setanno(node, scope_name, self.scope)
+    self._exit_scope()
     return node
 
   def _process_parallel_blocks(self, parent, children):
@@ -321,94 +324,75 @@ class ActivityAnalyzer(transformer.Base):
       self.scope.merge_from(after_child)
     return parent
 
+  def visit_arguments(self, node):
+    return self._process_statement(node)
+
   def visit_FunctionDef(self, node):
-    if self.scope:
-      qn = qual_names.QN(node.name)
-      self.scope.mark_write(qn)
-    current_scope = self.scope
-    body_scope = Scope(current_scope, isolated=True)
-    self.scope = body_scope
-    self.generic_visit(node)
-    anno.setanno(node, NodeAnno.BODY_SCOPE, body_scope)
-    self.scope = current_scope
+    # The FunctionDef node itself has a Scope object that tracks the creation
+    # of its name, along with the usage of any decorator accompany it.
+    self._enter_scope(False)
+    node.decorator_list = self.visit_block(node.decorator_list)
+    self.scope.mark_write(qual_names.QN(node.name))
+    anno.setanno(node, anno.Static.SCOPE, self.scope)
+    self._exit_scope()
+
+    # A separate Scope tracks the actual function definition.
+    self._enter_scope(True)
+    node.args = self.visit(node.args)
+
+    # Track the body separately. This is for compatibility reasons, it may not
+    # be strictly needed.
+    self._enter_scope(False)
+    node.body = self.visit_block(node.body)
+    anno.setanno(node, NodeAnno.BODY_SCOPE, self.scope)
+    self._exit_scope()
+
+    self._exit_scope()
     return node
 
   def visit_With(self, node):
-    current_scope = self.scope
-    with_scope = Scope(current_scope, isolated=False)
-    self.scope = with_scope
-    self.generic_visit(node)
-    anno.setanno(node, NodeAnno.BODY_SCOPE, with_scope)
-    self.scope = current_scope
+    self._enter_scope(False)
+    node = self.generic_visit(node)
+    anno.setanno(node, NodeAnno.BODY_SCOPE, self.scope)
+    self._exit_scope()
     return node
 
-  def visit_If(self, node):
-    current_scope = self.scope
-    cond_scope = Scope(current_scope, isolated=False)
-    self.scope = cond_scope
-    self.visit(node.test)
-    anno.setanno(node, NodeAnno.COND_SCOPE, cond_scope)
-    self.scope = current_scope
+  def visit_withitem(self, node):
+    return self._process_statement(node)
 
+  def visit_If(self, node):
+    self._enter_scope(False)
+    node.test = self.visit(node.test)
+    anno.setanno(node, NodeAnno.COND_SCOPE, self.scope)
+    anno.setanno(node.test, anno.Static.SCOPE, self.scope)
+    self._exit_scope()
     node = self._process_parallel_blocks(node,
                                          ((node.body, NodeAnno.BODY_SCOPE),
                                           (node.orelse, NodeAnno.ORELSE_SCOPE)))
     return node
 
   def visit_For(self, node):
-    self.visit(node.target)
-    self.visit(node.iter)
+    self._enter_scope(False)
+    node.target = self.visit(node.target)
+    node.iter = self.visit(node.iter)
+    anno.setanno(node.iter, anno.Static.SCOPE, self.scope)
+    self._exit_scope()
     node = self._process_parallel_blocks(node,
                                          ((node.body, NodeAnno.BODY_SCOPE),
                                           (node.orelse, NodeAnno.ORELSE_SCOPE)))
     return node
 
   def visit_While(self, node):
-    current_scope = self.scope
-    cond_scope = Scope(current_scope, isolated=False)
-    self.scope = cond_scope
-    self.visit(node.test)
-    anno.setanno(node, NodeAnno.COND_SCOPE, cond_scope)
-    self.scope = current_scope
-
+    self._enter_scope(False)
+    node.test = self.visit(node.test)
+    anno.setanno(node, NodeAnno.COND_SCOPE, self.scope)
+    anno.setanno(node.test, anno.Static.SCOPE, self.scope)
+    self._exit_scope()
     node = self._process_parallel_blocks(node,
                                          ((node.body, NodeAnno.BODY_SCOPE),
                                           (node.orelse, NodeAnno.ORELSE_SCOPE)))
     return node
 
-  def visit_Return(self, node):
-    self._in_return_statement = True
-    node = self.generic_visit(node)
-    self._in_return_statement = False
-    return node
-
-
-def get_read(node, context):
-  """Return the variable names as QNs (qual_names.py) read by this statement."""
-  analyzer = ActivityAnalyzer(context, None, True)
-  analyzer.visit(node)
-  return analyzer.scope.used
-
-
-def get_updated(node, context):
-  """Return the variable names created or mutated by this statement.
-
-  This function considers assign statements, augmented assign statements, and
-  the targets of for loops, as well as function arguments.
-  For example, `x[0] = 2` will return `x`, `x, y = 3, 4` will return `x` and
-  `y`, `for i in range(x)` will return `i`, etc.
-  Args:
-    node: An AST node
-    context: An EntityContext instance
-
-  Returns:
-    A set of variable names (QNs, see qual_names.py) of all the variables
-    created or mutated.
-  """
-  analyzer = ActivityAnalyzer(context, None, True)
-  analyzer.visit(node)
-  return analyzer.scope.created | analyzer.scope.modified
-
 
 def resolve(node, context, parent_scope=None):
   return ActivityAnalyzer(context, parent_scope).visit(node)
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/activity_test.py b/tensorflow/contrib/autograph/pyct/static_analysis/activity_test.py
index bc22be0a27..e940516190 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/activity_test.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/activity_test.py
@@ -52,18 +52,18 @@ class ScopeTest(test.TestCase):
     other = activity.Scope(None)
     other.copy_from(scope)
 
-    self.assertTrue(QN('foo') in other.created)
+    self.assertTrue(QN('foo') in other.modified)
 
     scope.mark_write(QN('bar'))
     scope.copy_from(other)
 
-    self.assertFalse(QN('bar') in scope.created)
+    self.assertFalse(QN('bar') in scope.modified)
 
     scope.mark_write(QN('bar'))
     scope.merge_from(other)
 
-    self.assertTrue(QN('bar') in scope.created)
-    self.assertFalse(QN('bar') in other.created)
+    self.assertTrue(QN('bar') in scope.modified)
+    self.assertFalse(QN('bar') in other.modified)
 
   def test_copy_of(self):
     scope = activity.Scope(None)
@@ -157,7 +157,8 @@ class ActivityAnalyzerTest(test.TestCase):
     """Assert the scope contains specific used, modified & created variables."""
     self.assertSymbolSetsAre(used, scope.used, 'read')
     self.assertSymbolSetsAre(modified, scope.modified, 'modified')
-    self.assertSymbolSetsAre(created, scope.created, 'created')
+    # Created is deprecated, we're no longer verifying it.
+    # self.assertSymbolSetsAre(created, scope.created, 'created')
 
   def test_print_statement(self):
 
@@ -215,12 +216,6 @@ class ActivityAnalyzerTest(test.TestCase):
         (),
         (),
     )
-    self.assertScopeIsRmc(
-        anno.getanno(call_node, NodeAnno.ARGS_SCOPE).parent,
-        ('a', 'a.b', 'a.c', 'a.d', 'foo'),
-        ('a.c',),
-        ('a',),
-    )
 
   def test_call_args_subscripts(self):
 
@@ -241,12 +236,6 @@ class ActivityAnalyzerTest(test.TestCase):
         (),
         (),
     )
-    self.assertScopeIsRmc(
-        anno.getanno(call_node, NodeAnno.ARGS_SCOPE).parent,
-        ('a', 'a[0]', 'a[b]', 'a[c]', 'b', 'c', 'foo'),
-        ('b', 'c'),
-        ('a', 'b', 'c'),
-    )
 
   def test_while(self):
 
@@ -362,20 +351,20 @@ class ActivityAnalyzerTest(test.TestCase):
     self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.BODY_SCOPE),
         ('a', 'b', 'c', 'a[c]'),
-        ('a', 'a[b]', 'd'),
+        ('a[b]', 'd'),
         ('d',),
     )
     # TODO(mdan): Should subscript writes (a[0] = 1) be considered to read "a"?
     self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.ORELSE_SCOPE),
         ('a', 'e'),
-        ('a', 'a[0]', 'd'),
+        ('a[0]', 'd'),
         ('d',),
     )
     self.assertScopeIsRmc(
         anno.getanno(if_node, NodeAnno.ORELSE_SCOPE).parent,
         ('a', 'b', 'c', 'd', 'e', 'a[c]'),
-        ('a', 'd', 'a[b]', 'a[0]'),
+        ('d', 'a[b]', 'a[0]'),
         ('a', 'b', 'c', 'd', 'e'),
     )
 
@@ -416,10 +405,6 @@ class ActivityAnalyzerTest(test.TestCase):
     fn_def_node = node.body[0].body[0]
 
     self.assertScopeIsRmc(
-        anno.getanno(fn_def_node,
-                     NodeAnno.BODY_SCOPE).parent, ('b', 'i', 'f', 'c', 'a'),
-        ('f', 'b', 'c', 'i'), ('f', 'a', 'b', 'c', 'i'))
-    self.assertScopeIsRmc(
         anno.getanno(fn_def_node, NodeAnno.BODY_SCOPE), ('x', 'y'), ('y',), (
             'x',
             'y',
@@ -452,7 +437,7 @@ class ActivityAnalyzerTest(test.TestCase):
     self.assertScopeIsRmc(
         anno.getanno(fn_node, NodeAnno.BODY_SCOPE),
         ('a', 'a[0]'),
-        ('a', 'a[0]'),
+        ('a[0]',),
         ('a',),
     )
 
@@ -518,47 +503,6 @@ class ActivityAnalyzerTest(test.TestCase):
         anno.getanno(fn_node, NodeAnno.BODY_SCOPE), ('b',), (('')),
         (('a', 'b')))
 
-  def test_get_read(self):
-
-    def test_fn(x, y):
-      z = test_fn(x, y)
-      return z
-
-    node, ctx = self._parse_and_analyze(test_fn)
-    node = node.body[0].body[0]
-    read_vars = activity.get_read(node, ctx)
-    self.assertEqual(read_vars, set(map(qual_names.QN, ('test_fn', 'x', 'y'))))
-
-    def test_fn2(x, y, z):
-      z += test_fn2(x, y, z)
-      return z
-
-    node, ctx = self._parse_and_analyze(test_fn2)
-    node = node.body[0].body[0]
-    read_vars = activity.get_read(node, ctx)
-    self.assertEqual(read_vars,
-                     set(map(qual_names.QN, ('test_fn2', 'x', 'y', 'z'))))
-
-  def test_get_updated(self):
-
-    def test_fn(x, y):
-      z = test_fn(x, y)
-      return z
-
-    node, ctx = self._parse_and_analyze(test_fn)
-    node = node.body[0].body[0]
-    updated_vars = activity.get_updated(node, ctx)
-    self.assertEqual(updated_vars, set(map(qual_names.QN, ('z'))))
-
-    def test_fn2(x, y, z):
-      z += test_fn2(x, y, z)
-      return z
-
-    node, ctx = self._parse_and_analyze(test_fn2)
-    node = node.body[0].body[0]
-    updated_vars = activity.get_updated(node, ctx)
-    self.assertEqual(updated_vars, set(map(qual_names.QN, ('z'))))
-
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/annos.py b/tensorflow/contrib/autograph/pyct/static_analysis/annos.py
index b929b35b79..5eefecf278 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/annos.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/annos.py
@@ -21,6 +21,9 @@ from __future__ import print_function
 from enum import Enum
 
 
+# TODO(mdan): Remove.
+
+
 class NoValue(Enum):
 
   def __repr__(self):
@@ -50,10 +53,3 @@ class NodeAnno(NoValue):
   ORELSE_SCOPE = (
       'The scope for the orelse body of a statement (False branch for if '
       'statements, orelse body for loops).')
-
-  # Type and Value annotations
-  # Type annotations are represented by objects of type type_info.Type.
-  STATIC_INFO = (
-      'The type or value information that should be asserted about the entity '
-      'referenced by the symbol holding this annotation, irrespective of the '
-      'execution context.')
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/cfg.py b/tensorflow/contrib/autograph/pyct/static_analysis/cfg.py
deleted file mode 100644
index 4acc4ed66a..0000000000
--- a/tensorflow/contrib/autograph/pyct/static_analysis/cfg.py
+++ /dev/null
@@ -1,446 +0,0 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Control flow graph analysis.
-
-Given a Python AST we construct a control flow graph, with edges both to the
-next and previous statements (so it can easily walk the graph both ways). Its
-nodes contain the AST of the statements. It can then perform forward or backward
-analysis on this CFG.
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from collections import namedtuple
-import functools
-import operator
-
-import gast
-
-from tensorflow.contrib.autograph.pyct import anno
-from tensorflow.contrib.autograph.pyct.static_analysis import activity
-
-
-class CfgNode(object):
-  """A node in the CFG."""
-  __slots__ = ['next', 'value', 'prev']
-
-  def __init__(self, value):
-    self.next = set()
-    self.prev = set()
-    self.value = value
-
-
-class Cfg(namedtuple('Cfg', ['entry', 'exit'])):
-  """A Control Flow Graph.
-
-  Each statement is represented as a node. For control flow statements such
-  as conditionals and loops the conditional itself is a node which either
-  branches or cycles, respectively.
-  Attributes:
-    entry: The entry node, which contains the `gast.arguments` node of the
-        function definition.
-    exit: The exit node. This node is special because it has no value (i.e. no
-        corresponding AST node). This is because Python functions can have
-        multiple return statements.
-  """
-  pass
-
-
-class CfgBuilder(gast.NodeVisitor):
-  """Construct a control flow graph.
-
-  Construct a CFG starting from a FunctionDef node.
-  Usage:
-    cfg_obj = CfgBuilder().build_cfg(fndef_node)
-  """
-
-  def __init__(self):
-    # The current leaves of the CFG
-    self.current_leaves = []
-    # TODO(alexbw): generalize to break, return, continue, yield, etc.
-    # A stack of lists, tracking continue statements
-    self.continue_ = []
-    # A stack of lists tracking break nodes
-    self.break_ = []
-
-  def set_current_leaves(self, cfg_node):
-    """Link this cfg_node to the current leaves.
-
-    This is the central function for building the CFG. It links the current
-    head cfg_nodes to the passed cfg_node. It then resets the head to the
-    passed cfg_node.
-
-    Args:
-      cfg_node: A CfgNode instance.
-    """
-    for head in self.current_leaves:
-      head.next.add(cfg_node)
-      # While we're linking the CFG forward, add backlinks
-      cfg_node.prev.add(head)
-    self.current_leaves = [cfg_node]
-
-  def build_cfg(self, node):
-    """Build a CFG for a function.
-
-    Implementation of building a CFG for dataflow analysis. See, e.g.:
-    https://www.seas.harvard.edu/courses/cs252/2011sp/slides/Lec02-Dataflow.pdf
-
-    Args:
-      node: A function definition the body of which to analyze.
-    Returns:
-      A CFG object.
-    Raises:
-      TypeError: If the input is not a function definition.
-    """
-    if not isinstance(node, gast.FunctionDef):
-      raise TypeError('input must be a function definition')
-    entry_cfg_node = CfgNode(node.args)
-    self.current_leaves = [entry_cfg_node]
-    self.visit_statements(node.body)
-    exit_cfg_node = CfgNode(None)
-    self.set_current_leaves(exit_cfg_node)
-    return Cfg(entry_cfg_node, exit_cfg_node)
-
-  def visit_statements(self, nodes):
-    for node in nodes:
-      # Check for control flow
-      if isinstance(node, (gast.For, gast.While, gast.If, gast.Try, gast.Break,
-                           gast.Continue, gast.With)):
-        self.visit(node)
-      else:
-        expr = CfgNode(node)
-        self.set_current_leaves(expr)
-
-  def generic_visit(self, node):
-    raise ValueError('unknown control flow')
-
-  def visit_If(self, node):
-    # TODO(alexbw): change this to use immutable tuples instead of lists
-    # The current head will hold the conditional
-    test = CfgNode(node.test)
-    self.set_current_leaves(test)
-    # Handle the body
-    self.visit_statements(node.body)
-    body_exit = self.current_leaves
-    self.current_leaves = [test]
-    # Handle the orelse
-    self.visit_statements(node.orelse)
-    self.current_leaves.extend(body_exit)
-
-  def visit_While(self, node):
-    test = CfgNode(node.test)
-    self.set_current_leaves(test)
-    # Start a new level of nesting
-    self.break_.append([])
-    self.continue_.append([])
-    # Handle the body
-    self.visit_statements(node.body)
-    body_exit = self.current_leaves
-    self.current_leaves.extend(self.continue_.pop())
-    self.set_current_leaves(test)
-    # Handle the orelse
-    self.visit_statements(node.orelse)
-    # The break statements and the test go to the next node
-    self.current_leaves.extend(self.break_.pop())
-    # Body and orelse statements can reach out of the loop
-    self.current_leaves.extend(body_exit)
-
-  def visit_For(self, node):
-    iter_ = CfgNode(node.iter)
-    self.set_current_leaves(iter_)
-    self.break_.append([])
-    self.continue_.append([])
-    self.visit_statements(node.body)
-    body_exit = self.current_leaves
-    self.current_leaves.extend(self.continue_.pop())
-    self.set_current_leaves(iter_)
-    # Handle the orelse
-    self.visit_statements(node.orelse)
-    # The break statements and the test go to the next node
-    self.current_leaves.extend(self.break_.pop())
-    # Body and orelse statements can reach out of the loop
-    self.current_leaves.extend(body_exit)
-
-  def visit_Break(self, node):
-    self.break_[-1].extend(self.current_leaves)
-    self.current_leaves[:] = []
-
-  def visit_Continue(self, node):
-    self.continue_[-1].extend(self.current_leaves)
-    self.current_leaves[:] = []
-
-  def visit_Try(self, node):
-    self.visit_statements(node.body)
-    body = self.current_leaves
-    handlers = []
-    for handler in node.handlers:
-      self.current_leaves = body[:]
-      self.visit_statements(handler.body)
-      handlers.extend(self.current_leaves)
-    self.current_leaves = body
-    self.visit_statements(node.orelse)
-    self.current_leaves = handlers + self.current_leaves
-    self.visit_statements(node.finalbody)
-
-  def visit_With(self, node):
-    for item in node.items:
-      self.set_current_leaves(CfgNode(item))
-    self.visit_statements(node.body)
-
-
-# TODO(alexbw): once CFG analysis occurs at a block level,
-# this extra class will not be necessary
-class PropagateAnalysis(gast.NodeVisitor):
-  """Port analysis annotations from statements to their enclosing blocks."""
-
-  def __init__(self, analysis):
-    self.transfer_fn = analysis.transfer_fn
-    self.in_label = analysis.in_label
-    self.out_label = analysis.out_label
-    super(PropagateAnalysis, self).__init__()
-
-  def visit_If(self, node):
-    # Depth-first.
-    self.generic_visit(node)
-    incoming = anno.getanno(node.body[0], self.in_label)
-    incoming |= anno.getanno(node.test, self.in_label)
-    outgoing = anno.getanno(node.body[-1], self.out_label)
-    outgoing |= anno.getanno(node.test, self.out_label)
-    if node.orelse:
-      orelse_outgoing = anno.getanno(node.orelse[-1], self.out_label)
-      outgoing = self.transfer_fn(outgoing, orelse_outgoing)
-    anno.setanno(node, self.in_label, incoming)
-    anno.setanno(node, self.out_label, outgoing)
-
-  def visit_For(self, node):
-    self.generic_visit(node)
-    incoming = set(anno.getanno(node.body[0], self.in_label))
-    incoming -= set((anno.getanno(node.target, anno.Basic.QN),))
-    outgoing = anno.getanno(node.body[-1], self.out_label)
-    if node.orelse:
-      orelse_outgoing = anno.getanno(node.orelse[-1], self.out_label)
-      outgoing = self.transfer_fn(outgoing, orelse_outgoing)
-    anno.setanno(node, self.in_label, frozenset(incoming))
-    anno.setanno(node, self.out_label, outgoing)
-
-  def visit_While(self, node):
-    self.generic_visit(node)
-    incoming = anno.getanno(node.body[0], self.in_label)
-    incoming |= anno.getanno(node.test, self.in_label)
-    outgoing = anno.getanno(node.body[-1], self.out_label)
-    if node.orelse:
-      orelse_outgoing = anno.getanno(node.orelse[-1], self.out_label)
-      outgoing = self.transfer_fn(outgoing, orelse_outgoing)
-    anno.setanno(node, self.in_label, incoming)
-    anno.setanno(node, self.out_label, outgoing)
-
-  def visit_With(self, node):
-    self.generic_visit(node)
-    incoming = anno.getanno(node.body[0], self.in_label)
-    for item in node.items:
-      incoming |= anno.getanno(item, self.in_label)
-    outgoing = anno.getanno(node.body[-1], self.out_label)
-    anno.setanno(node, self.in_label, incoming)
-    anno.setanno(node, self.out_label, outgoing)
-
-
-# TODO(alexbw): Abstract the CFG walking machinery into a superclass
-# which is parameterized on which fields it selects when walking.
-# TODO(alexbw): Abstract the application of dataflow analysis
-class Forward(object):
-  """Forward analysis on CFG.
-
-  Args:
-    label: A name for this analysis e.g. 'active' for activity analysis. The AST
-      nodes in the CFG will be given annotations 'name_in', 'name_out',
-      'name_gen' and 'name_kill' which contain the incoming values, outgoing
-      values, values generated by the statement, and values deleted by the
-      statement respectively.
-    transfer_fn: Either the AND or OR operator. If the AND operator is used it
-      turns into forward must analysis (i.e. a value will only be carried
-      forward if it appears on all incoming paths). The OR operator means that
-      forward may analysis is done (i.e. the union of incoming values will be
-      taken).
-  """
-
-  def __init__(self, label, source_info, transfer_fn=operator.or_):
-    self.transfer_fn = transfer_fn
-    self.source_info = source_info
-    self.out_label = label + '_out'
-    self.in_label = label + '_in'
-    self.gen_label = label + '_gen'
-    self.kill_label = label + '_kill'
-
-  # TODO(alexbw): see if we can simplify by visiting breadth-first
-  def visit(self, node):
-    """Depth-first walking the CFG, applying dataflow info propagation."""
-    # node.value is None only for the exit CfgNode.
-    if not node.value:
-      return
-
-    if anno.hasanno(node.value, self.out_label):
-      before = hash(anno.getanno(node.value, self.out_label))
-    else:
-      before = None
-    preds = [
-        anno.getanno(pred.value, self.out_label)
-        for pred in node.prev
-        if anno.hasanno(pred.value, self.out_label)
-    ]
-    if preds:
-      incoming = functools.reduce(self.transfer_fn, preds[1:], preds[0])
-    else:
-      incoming = frozenset()
-    anno.setanno(node.value, self.in_label, incoming)
-    gen, kill = self.get_gen_kill(node, incoming)
-    anno.setanno(node.value, self.gen_label, gen)
-    anno.setanno(node.value, self.kill_label, kill)
-    anno.setanno(node.value, self.out_label, (incoming - kill) | gen)
-
-    if hash(anno.getanno(node.value, self.out_label)) != before:
-      for succ in node.next:
-        self.visit(succ)
-
-  def get_gen_kill(self, cfg_node, incoming):
-    """Calculate Gen and Kill properties of a CFG node in dataflow analysis.
-
-    A function which takes the CFG node as well as a set of incoming
-    values. It must return a set of newly generated values by the statement as
-    well as a set of deleted (killed) values.
-
-    Args:
-      cfg_node: A CfgNode instance.
-      incoming:
-    """
-    raise NotImplementedError()
-
-
-class Backward(Forward):
-  """Backward analysis on CFG."""
-
-  def visit(self, cfg_node):
-    # cfg_node.value is None for the exit node, which will be visited only once
-    if not cfg_node.value:
-      for pred in cfg_node.prev:
-        self.visit(pred)
-      return
-
-    if anno.hasanno(cfg_node.value, self.in_label):
-      before = hash(anno.getanno(cfg_node.value, self.in_label))
-    else:
-      before = None
-    succs = [
-        anno.getanno(succ.value, self.in_label)
-        for succ in cfg_node.next
-        if anno.hasanno(succ.value, self.in_label)
-    ]
-    if succs:
-      incoming = functools.reduce(self.transfer_fn, succs[1:], succs[0])
-    else:
-      incoming = frozenset()
-    anno.setanno(cfg_node.value, self.out_label, incoming)
-    gen, kill = self.get_gen_kill(cfg_node, incoming)
-    anno.setanno(cfg_node.value, self.gen_label, gen)
-    anno.setanno(cfg_node.value, self.kill_label, kill)
-    anno.setanno(cfg_node.value, self.in_label, (incoming - kill) | gen)
-    if hash(anno.getanno(cfg_node.value, self.in_label)) != before:
-      for pred in cfg_node.prev:
-        self.visit(pred)
-
-
-def run_analyses(node, analyses):
-  """Perform dataflow analysis on all functions within an AST.
-
-  Args:
-    node: An AST node on which to run dataflow analysis.
-    analyses: Either an instance of the Forward or Backward dataflow analysis
-      class, or a list or tuple of them.
-
-  Returns:
-    node: The node, but now with annotations on the AST nodes containing the
-    results of the dataflow analyses.
-  """
-  if not isinstance(analyses, (tuple, list)):
-    analyses = (analyses,)
-  for analysis in analyses:
-    if not isinstance(analysis, (Forward, Backward)):
-      raise TypeError('not a valid forward analysis object')
-
-  for child_node in gast.walk(node):
-    if isinstance(child_node, gast.FunctionDef):
-      cfg_obj = CfgBuilder().build_cfg(child_node)
-      for analysis in analyses:
-        if isinstance(analysis, Backward):
-          analysis.visit(cfg_obj.exit)
-        elif isinstance(analysis, Forward):
-          analysis.visit(cfg_obj.entry)
-  for analysis in analyses:
-    PropagateAnalysis(analysis).visit(node)
-  return node
-
-
-class Liveness(Backward):
-  """Perform a liveness analysis.
-
-  Each statement is annotated with a set of variables that may be used
-  later in the program.
-  """
-
-  def __init__(self, source_info):
-    super(Liveness, self).__init__('live', source_info)
-
-  def get_gen_kill(self, node, _):
-    # A variable's parents are live if it is live
-    # e.g. x is live if x.y is live. This means gen needs to return
-    # all parents of a variable (if it's an Attribute or Subscript).
-    # This doesn't apply to kill (e.g. del x.y doesn't affect liveness of x)
-    gen = activity.get_read(node.value, self.source_info)
-    gen = functools.reduce(lambda left, right: left | right.support_set, gen,
-                           gen)
-    kill = activity.get_updated(node.value, self.source_info)
-    return gen, kill
-
-
-class ReachingDefinitions(Forward):
-  """Perform reaching definition analysis.
-
-  Each statement is annotated with a set of (variable, definition) pairs.
-  """
-
-  def __init__(self, source_info):
-    super(ReachingDefinitions, self).__init__('definitions', source_info)
-
-  def get_gen_kill(self, node, incoming):
-    definitions = activity.get_updated(node.value, self.source_info)
-    gen = frozenset((id_, node.value) for id_ in definitions)
-    kill = frozenset(def_ for def_ in incoming if def_[0] in definitions)
-    return gen, kill
-
-
-class Defined(Forward):
-  """Perform defined variable analysis.
-
-  Each statement is annotated with a set of variables which are guaranteed to
-  be defined at that point.
-  """
-
-  def __init__(self, source_info):
-    super(Defined, self).__init__(
-        'defined', source_info, transfer_fn=operator.and_)
-
-  def get_gen_kill(self, node, _):
-    gen = activity.get_updated(node.value, self.source_info)
-    return gen, frozenset()
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/cfg_test.py b/tensorflow/contrib/autograph/pyct/static_analysis/cfg_test.py
deleted file mode 100644
index 428ebbedca..0000000000
--- a/tensorflow/contrib/autograph/pyct/static_analysis/cfg_test.py
+++ /dev/null
@@ -1,303 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for cfg module."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import functools
-
-import gast
-
-from tensorflow.contrib.autograph.pyct import anno
-from tensorflow.contrib.autograph.pyct import parser
-from tensorflow.contrib.autograph.pyct import qual_names
-from tensorflow.contrib.autograph.pyct import transformer
-from tensorflow.contrib.autograph.pyct.static_analysis import cfg
-from tensorflow.python.platform import test
-
-
-class CFGTest(test.TestCase):
-
-  def _parse_and_analyze(self, test_fn):
-    node, source = parser.parse_entity(test_fn)
-    entity_info = transformer.EntityInfo(
-        source_code=source,
-        source_file=None,
-        namespace={},
-        arg_values=None,
-        arg_types=None,
-        owner_type=None)
-    node = qual_names.resolve(node)
-    return node, entity_info
-
-  def _check_anno_matches(self, node, anno_name, var_names):
-    if isinstance(var_names, str):
-      var_names = (var_names,)
-    qual_vars = set()
-    for var_name in var_names:
-      if isinstance(var_name, str):
-        if '[' in var_name or ']' in var_name:
-          raise ValueError('Annotation matching not supported with subscript.')
-        if '.' not in var_name:
-          qual_vars.add(qual_names.QN(var_name))
-        else:
-          attrs = var_name.split('.')
-          this_qn = functools.reduce(qual_names.QN, attrs[1:],
-                                     qual_names.QN(attrs[0]))
-          qual_vars.add(this_qn)
-    self.assertEqual(anno.getanno(node, anno_name), qual_vars)
-
-  def test_reaching(self):
-
-    def f(x):
-      print(x)
-      while True:
-        x = x
-        x = x
-      return x
-
-    node, ctx = self._parse_and_analyze(f)
-    cfg.run_analyses(node, cfg.ReachingDefinitions(ctx))
-    body = node.body[0].body
-    # Only the argument reaches the expression
-    def_in = anno.getanno(body[0], 'definitions_in')
-    # One element, x, from arguments
-    self.assertEqual(set(type(d[1]) for d in def_in), set((gast.arguments,)))
-
-    while_body = body[1].body
-    def_in = anno.getanno(while_body[0], 'definitions_in')
-    # One definition, two possible sources.
-    # - One from an assignment (if the loop is entered)
-    # - The other from the arguments (if loop is not entered)
-    self.assertEqual(
-        set(type(d[1]) for d in def_in), set((gast.arguments, gast.Assign)))
-
-    def_in = anno.getanno(while_body[1], 'definitions_in')
-    # If we've reached this line, the only reaching definition of x is the
-    # Assign node in previous line
-    self.assertEqual(set(type(d[1]) for d in def_in), set((gast.Assign,)))
-
-    def_in = anno.getanno(body[2], 'definitions_in')
-    # Same situation as while_body[0]
-    self.assertEqual(
-        set(type(d[1]) for d in def_in), set((gast.arguments, gast.Assign)))
-
-  def test_defined(self):
-
-    def f(x):
-      if x:
-        y = 2  # pylint: disable=unused-variable
-      return x
-
-    node, ctx = self._parse_and_analyze(f)
-    cfg.run_analyses(node, cfg.Defined(ctx))
-    body = node.body[0].body
-    # only x is for sure defined at the end
-    self._check_anno_matches(body[1], 'defined_in', 'x')
-    # at the end of the if body both x and y are defined
-    if_body = body[0].body
-    self._check_anno_matches(if_body[0], 'defined_out', ('x', 'y'))
-
-  def _get_live_annotated_fnbody(self, f):
-    node, ctx = self._parse_and_analyze(f)
-    cfg.run_analyses(node, cfg.Liveness(ctx))
-    body = node.body[0].body
-    return body
-
-  def test_live_straightline(self):
-
-    def f1(x):
-      a = g(x)  # pylint: disable=undefined-variable
-      b = h(a)  # pylint: disable=undefined-variable, unused-variable
-      return x
-
-    body = self._get_live_annotated_fnbody(f1)
-    self._check_anno_matches(body[1], 'live_in', ('a', 'h', 'x'))
-    self._check_anno_matches(body[2], 'live_in', ('x'))
-    self._check_anno_matches(body[0], 'live_in', ('g', 'h', 'x'))
-    self._check_anno_matches(body[2], 'live_out', ())
-
-  def test_live_stacked_conds_with_else(self):
-
-    def f2(x, a):  # pylint: disable=unused-argument
-      if a > 0:  # x should not be live
-        x = 0
-      if a > 1:
-        x = 1
-      else:
-        x = 2
-
-    body = self._get_live_annotated_fnbody(f2)
-    self._check_anno_matches(body[0], 'live_in', ('a'))
-    self._check_anno_matches(body[1], 'live_in', ('a'))
-
-  def test_live_stacked_conds(self):
-
-    def f3(x, a):
-      if a > 0:  # x and a should be live
-        x = 0
-      if a > 1:  # x and a should be live_in
-        x = 1
-      return x  # x should be live
-
-    body = self._get_live_annotated_fnbody(f3)
-    self._check_anno_matches(body[0], 'live_in', ('a', 'x'))
-    self._check_anno_matches(body[1], 'live_in', ('a', 'x'))
-    self._check_anno_matches(body[2], 'live_in', ('x'))
-
-  def test_live_possibly_unused_cond(self):
-
-    def f4(x, a):
-      if a > 0:  # x should be live
-        x = 0
-      x += 1
-
-    body = self._get_live_annotated_fnbody(f4)
-    self._check_anno_matches(body[0], 'live_in', ('x', 'a'))
-    self._check_anno_matches(body[1], 'live_in', ('x'))
-
-  def test_live_attribute_in_cond(self):
-
-    def f5(x, a):
-      if a > 0:  # x.y should be live
-        x.y = 0
-      return x.y
-
-    body = self._get_live_annotated_fnbody(f5)
-    self._check_anno_matches(body[0], 'live_in', ('x', 'x.y', 'a'))
-
-  def test_live_noop(self):
-
-    def f6(x):
-      return x  # should this cause x.* to be live?
-
-    body = self._get_live_annotated_fnbody(f6)
-    self._check_anno_matches(body[0], 'live_in', ('x'))
-
-  def test_live_loop(self):
-
-    def f7(x, n):
-      for i in range(n):
-        x += i
-      return x
-
-    body = self._get_live_annotated_fnbody(f7)
-    self._check_anno_matches(body[0], 'live_in', ('x', 'n', 'range'))
-    self._check_anno_matches(body[1], 'live_in', ('x'))
-
-  def test_live_context_manager(self):
-
-    def f8(x, f):
-      with f:
-        x += 1
-
-    body = self._get_live_annotated_fnbody(f8)
-    self._check_anno_matches(body[0], 'live_in', ('f', 'x'))
-
-  def test_node_equality(self):
-    node_a = gast.parse('y = x').body[0]
-    node_b = gast.parse('y = x').body[0]
-    self.assertNotEqual(node_a, node_b)
-
-  def test_nested_functions_defined(self):
-
-    def f(x):
-      y = x * 2
-
-      def g(z):
-        return z + y
-
-      return g(x)
-
-    node, ctx = self._parse_and_analyze(f)
-    cfg.run_analyses(node, cfg.Defined(ctx))
-
-    body = node.body[0].body
-    self.assertEqual(
-        anno.getanno(body[2], 'defined_in'),
-        frozenset(map(qual_names.QN, ('g', 'x', 'y'))))
-
-    # TODO(alexbw): CFG analysis doesn't currently cross FunctionDef boundaries.
-    # NOTE: 'z' is easy to find, but 'y' is  not identified as
-    # defined, because CFG analysis is applied with each function separately.
-    # fndef_body = body[1].body
-    # self.assertEqual(
-    #     anno.getanno(fndef_body[0], 'defined_in'),
-    #     frozenset(map(qual_names.QN, ('z', 'y'))))
-
-  def test_nested_functions_dont_leak_definitions(self):
-
-    def f(x):
-      print(x)
-
-      def g():
-        y = 2
-        return y
-
-      return g()  # y is not defined here
-
-    node, ctx = self._parse_and_analyze(f)
-    cfg.run_analyses(node, cfg.Defined(ctx))
-    body = node.body[0].body
-    self.assertEqual(
-        anno.getanno(body[2], 'defined_in'),
-        frozenset(map(qual_names.QN, ('x', 'g'))))
-
-  def test_loop_else(self):
-
-    # Disabling useless-else-on-loop error, because 'break' and 'continue'
-    # canonicalization are a separate analysis pass, and here we test
-    # the CFG analysis in isolation.
-    def for_orelse(x):
-      y = 0
-      for i in range(len(x)):
-        x += i
-      else:  # pylint: disable=useless-else-on-loop
-        y = 1
-      return x, y
-
-    def while_orelse(x, i):
-      y = 0
-      while x < 10:
-        x += i
-      else:  # pylint: disable=useless-else-on-loop
-        y = 1
-      return x, y
-
-    for f in (for_orelse, while_orelse):
-      node, ctx = self._parse_and_analyze(f)
-      cfg.run_analyses(node, cfg.ReachingDefinitions(ctx))
-      body = node.body[0].body
-      return_node = body[-1]
-      reaching_defs = anno.getanno(return_node, 'definitions_in')
-
-      # Y could be defined by Assign(Num(0)) or Assign(Num(1))
-      # X could be defined as an argument or an AugAssign.
-      y_defs = [node for var, node in reaching_defs if str(var) == 'y']
-      x_defs = [node for var, node in reaching_defs if str(var) == 'x']
-
-      self.assertEqual(set((gast.Assign,)), set(type(def_) for def_ in y_defs))
-      self.assertEqual(set((0, 1)), set(def_.value.n for def_ in y_defs))
-      self.assertEqual(len(y_defs), 2)
-      self.assertEqual(
-          set((gast.arguments, gast.AugAssign)),
-          set(type(def_) for def_ in x_defs))
-      self.assertEqual(len(x_defs), 2)
-
-
-if __name__ == '__main__':
-  test.main()
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/live_values.py b/tensorflow/contrib/autograph/pyct/static_analysis/live_values.py
index 9ccb98f79a..2d8f922a45 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/live_values.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/live_values.py
@@ -16,7 +16,7 @@
 
 Live values are extracted from the known execution context.
 
-Requires activity analysis annotations.
+Requires activity and reaching definitions analyses.
 """
 
 from __future__ import absolute_import
@@ -45,14 +45,12 @@ class LiveValueResolver(transformer.Base):
   def visit_Name(self, node):
     self.generic_visit(node)
     if isinstance(node.ctx, gast.Load):
-      assert anno.hasanno(node, NodeAnno.IS_LOCAL), node
-      symbol_is_local = anno.getanno(node, NodeAnno.IS_LOCAL)
-      assert anno.hasanno(node, NodeAnno.IS_MODIFIED_SINCE_ENTRY), node
-      symbol_is_modified = anno.getanno(node, NodeAnno.IS_MODIFIED_SINCE_ENTRY)
-      assert anno.hasanno(node, NodeAnno.IS_PARAM), node
-      symbol_is_param = anno.getanno(node, NodeAnno.IS_PARAM)
-
-      if not symbol_is_local and not symbol_is_param:
+      defs = anno.getanno(node, anno.Static.DEFINITIONS, ())
+
+      is_defined = bool(defs)
+      has_single_def = len(defs) == 1
+
+      if not is_defined:
         if node.id in self.literals:
           anno.setanno(node, 'live_val', self.literals[node.id])
         elif node.id in self.entity_info.namespace:
@@ -79,11 +77,13 @@ class LiveValueResolver(transformer.Base):
         # TODO(mdan): Attempt to trace its value through the local chain.
         # TODO(mdan): Use type annotations as fallback.
 
-      if not symbol_is_modified:
-        if node.id in self.entity_info.arg_values:
-          obj = self.entity_info.arg_values[node.id]
-          anno.setanno(node, 'live_val', obj)
-          anno.setanno(node, 'fqn', (obj.__class__.__name__,))
+      if has_single_def:
+        def_, = defs
+        if def_.param_of is self.enclosing_entities[0]:
+          if node.id in self.entity_info.arg_values:
+            obj = self.entity_info.arg_values[node.id]
+            anno.setanno(node, 'live_val', obj)
+            anno.setanno(node, 'fqn', (obj.__class__.__name__,))
     return node
 
   def visit_Attribute(self, node):
@@ -91,12 +91,20 @@ class LiveValueResolver(transformer.Base):
     if anno.hasanno(node.value, 'live_val'):
       assert anno.hasanno(node.value, 'fqn')
       parent_object = anno.getanno(node.value, 'live_val')
-      if not hasattr(parent_object, node.attr):
-        raise AttributeError('%s has no attribute %s' % (parent_object,
-                                                         node.attr))
+
       anno.setanno(node, 'parent_type', type(parent_object))
-      anno.setanno(node, 'live_val', getattr(parent_object, node.attr))
       anno.setanno(node, 'fqn', anno.getanno(node.value, 'fqn') + (node.attr,))
+      if hasattr(parent_object, node.attr):
+        # This can happen when the attribute's creation and use depend on the
+        # same static condition, for example:
+        #
+        #  if cond:
+        #    foo.bar = baz
+        #  if cond:
+        #    x = foo.bar
+        #
+        anno.setanno(node, 'live_val', getattr(parent_object, node.attr))
+
     # TODO(mdan): Investigate the role built-in annotations can play here.
     elif anno.hasanno(node.value, 'type'):
       parent_type = anno.getanno(node.value, 'type')
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/live_values_test.py b/tensorflow/contrib/autograph/pyct/static_analysis/live_values_test.py
index 38af792777..fe3051179c 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/live_values_test.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/live_values_test.py
@@ -21,11 +21,13 @@ from __future__ import print_function
 import six
 
 from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import cfg
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import qual_names
 from tensorflow.contrib.autograph.pyct import transformer
 from tensorflow.contrib.autograph.pyct.static_analysis import activity
 from tensorflow.contrib.autograph.pyct.static_analysis import live_values
+from tensorflow.contrib.autograph.pyct.static_analysis import reaching_definitions
 from tensorflow.contrib.autograph.pyct.static_analysis import type_info
 from tensorflow.python.framework import constant_op
 from tensorflow.python.platform import test
@@ -48,7 +50,10 @@ class LiveValuesResolverTest(test.TestCase):
         arg_types=arg_types,
         owner_type=None)
     node = qual_names.resolve(node)
+    graphs = cfg.build(node)
     node = activity.resolve(node, entity_info)
+    node = reaching_definitions.resolve(node, entity_info, graphs,
+                                        reaching_definitions.Definition)
     node = live_values.resolve(node, entity_info, literals)
     node = type_info.resolve(node, entity_info)
     node = live_values.resolve(node, entity_info, literals)
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/liveness.py b/tensorflow/contrib/autograph/pyct/static_analysis/liveness.py
new file mode 100644
index 0000000000..bf29d868a2
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/liveness.py
@@ -0,0 +1,200 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Live variable analysis.
+
+This analysis attaches a set containing the live symbols that are live at the
+exit of control flow statements.
+
+Requires activity analysis.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import cfg
+from tensorflow.contrib.autograph.pyct import transformer
+from tensorflow.contrib.autograph.pyct.static_analysis import annos
+
+
+class Analyzer(cfg.GraphVisitor):
+  """CFG visitor that performs liveness analysis at statement level."""
+
+  def __init__(self, graph):
+    super(Analyzer, self).__init__(graph)
+    # This allows communicating that nodes generate extra symbols,
+    # e.g. those that a function definition closes over.
+    self.extra_gen = {}
+
+  def init_state(self, _):
+    return set()
+
+  def visit_node(self, node):
+    prev_live_in = self.in_[node]
+
+    if anno.hasanno(node.ast_node, anno.Static.SCOPE):
+      node_scope = anno.getanno(node.ast_node, anno.Static.SCOPE)
+
+      gen = node_scope.used | self.extra_gen.get(node.ast_node, frozenset())
+      # TODO(mdan): verify whether composites' parents need to be added.
+      # E.g. if x.y is live whether x needs to be added. Theoretically the
+      # activity analysis should have both so that wouldn't be needed.
+      kill = node_scope.modified
+
+      live_out = set()
+      for n in node.next:
+        live_out |= self.in_[n]
+      live_in = gen | (live_out - kill)
+
+    else:
+      # Nodes that don't have a scope annotation are assumed not to touch any
+      # symbols.
+      # This Name node below is a literal name, e.g. False
+      assert isinstance(node.ast_node,
+                        (gast.Name, gast.Continue, gast.Break)), type(
+                            node.ast_node)
+      live_in = prev_live_in
+      live_out = live_in
+
+    self.in_[node] = live_in
+    self.out[node] = live_out
+
+    # TODO(mdan): Move this to the superclass?
+    return prev_live_in != live_in
+
+
+class WholeTreeAnalyzer(transformer.Base):
+  """Runs liveness analysis on each of the functions defined in the AST.
+
+  If a function defined other local functions, those will have separate CFGs.
+  However, dataflow analysis needs to tie up these CFGs to properly emulate the
+  effect of closures. In the case of liveness, the parent function's live
+  variables must account for the variables that are live at the entry of each
+  subfunction. For example:
+
+    def foo():
+      # baz is live here
+      def bar():
+        print(baz)
+
+  This analyzer runs liveness analysis on each individual function, accounting
+  for the effect above.
+  """
+
+  def __init__(self, source_info, graphs):
+    super(WholeTreeAnalyzer, self).__init__(source_info)
+    self.graphs = graphs
+    self.current_analyzer = None
+    self.analyzers = {}
+
+  def visit_FunctionDef(self, node):
+    parent_analyzer = self.current_analyzer
+    subgraph = self.graphs[node]
+
+    # Postorder tree processing makes this a bit complicated:
+    #  1. construct an analyzer object and put it on stack
+    #  2. recursively walk the subtree; this will initialize the analyzer's
+    #     in_ state properly (done in a block below)
+    #  3. run the final analysis
+    analyzer = Analyzer(subgraph)
+    self.current_analyzer = analyzer
+    node = self.generic_visit(node)
+    analyzer.visit_reverse()
+
+    if parent_analyzer is not None:
+      # Wire the state between the two subgraphs' analyzers.
+      child_in_state = analyzer.in_[subgraph.entry]
+      # Exception: symbols modified in the child function are local to it
+      body_scope = anno.getanno(node, annos.NodeAnno.BODY_SCOPE)
+      for qn in body_scope.modified:
+        # Note: a function modifying the symbol doesn't make that symbol
+        # live at the function's entry. In fact when that happens it is
+        # probably a case of undefined assignment, like this:
+        #
+        #   bar = 0
+        #   def foo():
+        #     print(bar)  # bar is undefined here!
+        #     bar = 1
+        #
+        # Hence we use discard and not remove below.
+        child_in_state.discard(qn)
+      parent_analyzer.extra_gen[node] = frozenset(child_in_state,)
+
+    self.analyzers[node] = analyzer
+    self.current_analyzer = parent_analyzer
+    return node
+
+  def visit_nonlocal(self, node):
+    raise NotImplementedError()
+
+  def visit_global(self, node):
+    raise NotImplementedError()
+
+
+class Annotator(transformer.Base):
+  """AST visitor that annotates each control flow block with live symbols."""
+
+  # Note: additional nodes may be added as needed.
+
+  def __init__(self, source_info, cross_function_analyzer):
+    super(Annotator, self).__init__(source_info)
+    self.cross_function_analyzer = cross_function_analyzer
+    self.current_analyzer = None
+
+  def visit_FunctionDef(self, node):
+    parent_analyzer = self.current_analyzer
+    self.current_analyzer = self.cross_function_analyzer.analyzers[node]
+
+    node = self.generic_visit(node)
+    self.current_analyzer = parent_analyzer
+    return node
+
+  def _aggregate_successors_live_in(self, node):
+    successors = self.current_analyzer.graph.stmt_next[node]
+    node_live_out = set()
+    for s in successors:
+      node_live_out.update(self.current_analyzer.in_[s])
+    anno.setanno(node, anno.Static.LIVE_VARS_OUT, frozenset(node_live_out))
+    node = self.generic_visit(node)
+    return node
+
+  def visit_If(self, node):
+    return self._aggregate_successors_live_in(node)
+
+  def visit_For(self, node):
+    return self._aggregate_successors_live_in(node)
+
+  def visit_While(self, node):
+    return self._aggregate_successors_live_in(node)
+
+
+def resolve(node, source_info, graphs):
+  """Resolves the live symbols at the exit of control flow statements.
+
+  Args:
+    node: ast.AST
+    source_info: transformer.SourceInfo
+    graphs: Dict[ast.FunctionDef, cfg.Graph]
+  Returns:
+    ast.AST
+  """
+  cross_function_analyzer = WholeTreeAnalyzer(source_info, graphs)
+  node = cross_function_analyzer.visit(node)
+  visitor = Annotator(source_info, cross_function_analyzer)
+  node = visitor.visit(node)
+  return node
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/liveness_test.py b/tensorflow/contrib/autograph/pyct/static_analysis/liveness_test.py
new file mode 100644
index 0000000000..d53adb28af
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/liveness_test.py
@@ -0,0 +1,149 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for liveness module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import cfg
+from tensorflow.contrib.autograph.pyct import parser
+from tensorflow.contrib.autograph.pyct import qual_names
+from tensorflow.contrib.autograph.pyct import transformer
+from tensorflow.contrib.autograph.pyct.static_analysis import activity
+from tensorflow.contrib.autograph.pyct.static_analysis import liveness
+from tensorflow.python.platform import test
+
+
+class LivenessTest(test.TestCase):
+
+  def _parse_and_analyze(self, test_fn):
+    node, source = parser.parse_entity(test_fn)
+    entity_info = transformer.EntityInfo(
+        source_code=source,
+        source_file=None,
+        namespace={},
+        arg_values=None,
+        arg_types=None,
+        owner_type=None)
+    node = qual_names.resolve(node)
+    node = activity.resolve(node, entity_info)
+    graphs = cfg.build(node)
+    liveness.resolve(node, entity_info, graphs)
+    return node
+
+  def assertHasLiveOut(self, node, expected):
+    live_out = anno.getanno(node, anno.Static.LIVE_VARS_OUT)
+    live_out_str = set(str(v) for v in live_out)
+    if not expected:
+      expected = ()
+    if not isinstance(expected, tuple):
+      expected = (expected,)
+    self.assertSetEqual(live_out_str, set(expected))
+
+  def test_stacked_if(self):
+
+    def test_fn(x, a):
+      if a > 0:
+        x = 0
+      if a > 1:
+        x = 1
+      return x
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasLiveOut(fn_body[0], ('a', 'x'))
+    self.assertHasLiveOut(fn_body[1], 'x')
+
+  def test_stacked_if_else(self):
+
+    def test_fn(x, a):
+      if a > 0:
+        x = 0
+      if a > 1:
+        x = 1
+      else:
+        x = 2
+      return x
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasLiveOut(fn_body[0], 'a')
+    self.assertHasLiveOut(fn_body[1], 'x')
+
+  def test_for_basic(self):
+
+    def test_fn(x, a):
+      for i in range(a):
+        x += i
+      return x
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasLiveOut(fn_body[0], 'x')
+
+  def test_attributes(self):
+
+    def test_fn(x, a):
+      if a > 0:
+        x.y = 0
+      return x.y
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasLiveOut(fn_body[0], ('x.y', 'x'))
+
+  def test_nested_functions(self):
+
+    def test_fn(a, b):
+      if b:
+        a = []
+
+      def foo():
+        return a
+
+      foo()
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasLiveOut(fn_body[0], 'a')
+
+  def test_nested_functions_isolation(self):
+
+    def test_fn(b):
+      if b:
+        a = 0  # pylint:disable=unused-variable
+
+      def child():
+        max(a)  # pylint:disable=used-before-assignment
+        a = 1
+        return a
+
+      child()
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasLiveOut(fn_body[0], 'max')
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/reaching_definitions.py b/tensorflow/contrib/autograph/pyct/static_analysis/reaching_definitions.py
new file mode 100644
index 0000000000..9a84f1231c
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/reaching_definitions.py
@@ -0,0 +1,301 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Reaching definition analysis.
+
+This analysis attaches a set of a Definition objects to each symbol, one
+for each distinct definition that may reach it. The Definition objects are
+mutable and may be used by subsequent analyses to further annotate data like
+static type and value information.
+The analysis also attaches the set of the symbols defined at the entry of
+control flow statements.
+
+Requires activity analysis.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import cfg
+from tensorflow.contrib.autograph.pyct import transformer
+from tensorflow.contrib.autograph.pyct.static_analysis import annos
+
+
+class Definition(object):
+  """Definition objects describe a unique definition of a variable.
+
+  Subclasses of this may be used by passing an appropriate factory fuction to
+  resolve.
+
+  Attributes:
+    param_of: Optional[ast.AST]
+  """
+
+  def __init__(self):
+    self.param_of = None
+
+  def __repr__(self):
+    return '%s[%d]' % (self.__class__.__name__, id(self))
+
+
+class _NodeState(object):
+  """Abstraction for the state of the CFG walk for reaching definition analysis.
+
+  This is a value type. Only implements the strictly necessary operators.
+
+  Attributes:
+    value: Dict[qual_names.QN, Set[Definition, ...]], the defined symbols and
+        their possible definitions
+  """
+
+  def __init__(self, init_from=None):
+    if init_from:
+      if isinstance(init_from, _NodeState):
+        self.value = {
+            s: set(other_infos) for s, other_infos in init_from.value.items()
+        }
+      elif isinstance(init_from, dict):
+        self.value = {s: set((init_from[s],)) for s in init_from}
+      else:
+        assert False, init_from
+    else:
+      self.value = {}
+
+  def __eq__(self, other):
+    if frozenset(self.value.keys()) != frozenset(other.value.keys()):
+      return False
+    ret = all(self.value[s] == other.value[s] for s in self.value)
+    return ret
+
+  def __ne__(self, other):
+    return not self.__eq__(other)
+
+  def __or__(self, other):
+    assert isinstance(other, _NodeState)
+    result = _NodeState(self)
+    for s, other_infos in other.value.items():
+      if s in result.value:
+        result.value[s].update(other_infos)
+      else:
+        result.value[s] = set(other_infos)
+    return result
+
+  def __sub__(self, other):
+    assert isinstance(other, set)
+    result = _NodeState(self)
+    for s in other:
+      result.value.pop(s, None)
+    return result
+
+  def __repr__(self):
+    return 'NodeState[%s]=%s' % (id(self), repr(self.value))
+
+
+class Analyzer(cfg.GraphVisitor):
+  """CFG visitor that determines reaching definitions at statement level."""
+
+  def __init__(self, graph, definition_factory):
+    self._definition_factory = definition_factory
+    super(Analyzer, self).__init__(graph)
+    # This allows communicating that nodes have extra reaching definitions,
+    # e.g. those that a function closes over.
+    self.extra_in = {}
+
+    self.gen_map = {}
+
+  def init_state(self, _):
+    return _NodeState()
+
+  def visit_node(self, node):
+    prev_defs_out = self.out[node]
+
+    defs_in = _NodeState(self.extra_in.get(node.ast_node, None))
+    for n in node.prev:
+      defs_in |= self.out[n]
+
+    if anno.hasanno(node.ast_node, anno.Static.SCOPE):
+      node_scope = anno.getanno(node.ast_node, anno.Static.SCOPE)
+      # The definition objects created by each node must be singletons because
+      # their ids are used in equality checks.
+      if node not in self.gen_map:
+        node_symbols = {}
+        for s in node_scope.modified:
+          def_ = self._definition_factory()
+          if s in node_scope.params:
+            def_.param_of = node_scope.params[s]
+          node_symbols[s] = def_
+        self.gen_map[node] = _NodeState(node_symbols)
+
+      gen = self.gen_map[node]
+      kill = node_scope.modified
+      defs_out = gen | (defs_in - kill)
+
+    else:
+      # Nodes that don't have a scope annotation are assumed not to touch any
+      # symbols.
+      # This Name node below is a literal name, e.g. False
+      # This can also happen if activity.py forgot to annotate the node with a
+      # scope object.
+      assert isinstance(
+          node.ast_node,
+          (gast.Name, gast.Break, gast.Continue, gast.Raise)), (node.ast_node,
+                                                                node)
+      defs_out = defs_in
+
+    self.in_[node] = defs_in
+    self.out[node] = defs_out
+
+    # TODO(mdan): Move this to the superclass?
+    return prev_defs_out != defs_out
+
+
+class TreeAnnotator(transformer.Base):
+  """AST visitor that annotates each symbol name with its reaching definitions.
+
+  Simultaneously, the visitor runs the dataflow analysis on each function node,
+  accounting for the effect of closures. For example:
+
+    def foo():
+      bar = 1
+      def baz():
+        # bar = 1 reaches here
+  """
+
+  def __init__(self, source_info, graphs, definition_factory):
+    super(TreeAnnotator, self).__init__(source_info)
+    self.definition_factory = definition_factory
+    self.graphs = graphs
+    self.current_analyzer = None
+    self.current_cfg_node = None
+
+  def visit_FunctionDef(self, node):
+    parent_analyzer = self.current_analyzer
+    subgraph = self.graphs[node]
+
+    # Preorder tree processing:
+    #  1. if this is a child function, the parent was already analyzed and it
+    #     has the proper state value for the subgraph's entry
+    #  2. analyze the current function body
+    #  2. recursively walk the subtree; child functions will be processed
+    analyzer = Analyzer(subgraph, self.definition_factory)
+    if parent_analyzer is not None:
+      # Wire the state between the two subgraphs' analyzers.
+      parent_out_state = parent_analyzer.out[parent_analyzer.graph.index[node]]
+      # Exception: symbols modified in the child function are local to it
+      body_scope = anno.getanno(node, annos.NodeAnno.BODY_SCOPE)
+      parent_out_state -= body_scope.modified
+      analyzer.extra_in[node.args] = parent_out_state
+
+    # Complete the analysis for the local function and annotate its body.
+    analyzer.visit_forward()
+
+    # Recursively process any remaining subfunctions.
+    self.current_analyzer = analyzer
+    # Note: not visiting name, decorator_list and returns because they don't
+    # apply to this anlysis.
+    # TODO(mdan): Should we still process the function name?
+    node.args = self.visit(node.args)
+    node.body = self.visit_block(node.body)
+    self.current_analyzer = parent_analyzer
+
+    return node
+
+  def visit_nonlocal(self, node):
+    raise NotImplementedError()
+
+  def visit_global(self, node):
+    raise NotImplementedError()
+
+  def visit_Name(self, node):
+    if self.current_analyzer is None:
+      # Names may appear outside function defs - for example in class
+      # definitions.
+      return node
+
+    analyzer = self.current_analyzer
+    cfg_node = self.current_cfg_node
+
+    assert cfg_node is not None, 'name node outside of any statement?'
+
+    qn = anno.getanno(node, anno.Basic.QN)
+    if isinstance(node.ctx, gast.Load):
+      anno.setanno(node, anno.Static.DEFINITIONS,
+                   tuple(analyzer.in_[cfg_node].value.get(qn, ())))
+    else:
+      anno.setanno(node, anno.Static.DEFINITIONS,
+                   tuple(analyzer.out[cfg_node].value.get(qn, ())))
+
+    return node
+
+  def _aggregate_predecessors_defined_in(self, node):
+    preds = self.current_analyzer.graph.stmt_prev[node]
+    node_defined_in = set()
+    for p in preds:
+      node_defined_in |= set(self.current_analyzer.out[p].value.keys())
+    anno.setanno(node, anno.Static.DEFINED_VARS_IN, frozenset(node_defined_in))
+
+  def visit_If(self, node):
+    self._aggregate_predecessors_defined_in(node)
+    return self.generic_visit(node)
+
+  def visit_For(self, node):
+    self._aggregate_predecessors_defined_in(node)
+
+    # Manually accounting for the shortcoming described in
+    # cfg.AstToCfg.visit_For.
+    parent = self.current_cfg_node
+    self.current_cfg_node = self.current_analyzer.graph.index[node.iter]
+    node.target = self.visit(node.target)
+    self.current_cfg_node = parent
+
+    node.iter = self.visit(node.iter)
+    node.body = self.visit_block(node.body)
+    node.orelse = self.visit_block(node.orelse)
+
+    return node
+
+  def visit_While(self, node):
+    self._aggregate_predecessors_defined_in(node)
+    return self.generic_visit(node)
+
+  def visit(self, node):
+    parent = self.current_cfg_node
+
+    if (self.current_analyzer is not None and
+        node in self.current_analyzer.graph.index):
+      self.current_cfg_node = self.current_analyzer.graph.index[node]
+    node = super(TreeAnnotator, self).visit(node)
+
+    self.current_cfg_node = parent
+    return node
+
+
+def resolve(node, source_info, graphs, definition_factory):
+  """Resolves reaching definitions for each symbol.
+
+  Args:
+    node: ast.AST
+    source_info: transformer.SourceInfo
+    graphs: Dict[ast.FunctionDef, cfg.Graph]
+    definition_factory: Callable[[], Definition]
+  Returns:
+    ast.AST
+  """
+  visitor = TreeAnnotator(source_info, graphs, definition_factory)
+  node = visitor.visit(node)
+  return node
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/reaching_definitions_test.py b/tensorflow/contrib/autograph/pyct/static_analysis/reaching_definitions_test.py
new file mode 100644
index 0000000000..243fe804b2
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/reaching_definitions_test.py
@@ -0,0 +1,263 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for reaching_definitions module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import cfg
+from tensorflow.contrib.autograph.pyct import parser
+from tensorflow.contrib.autograph.pyct import qual_names
+from tensorflow.contrib.autograph.pyct import transformer
+from tensorflow.contrib.autograph.pyct.static_analysis import activity
+from tensorflow.contrib.autograph.pyct.static_analysis import reaching_definitions
+from tensorflow.python.platform import test
+
+
+class DefinitionInfoTest(test.TestCase):
+
+  def _parse_and_analyze(self, test_fn):
+    node, source = parser.parse_entity(test_fn)
+    entity_info = transformer.EntityInfo(
+        source_code=source,
+        source_file=None,
+        namespace={},
+        arg_values=None,
+        arg_types=None,
+        owner_type=None)
+    node = qual_names.resolve(node)
+    node = activity.resolve(node, entity_info)
+    graphs = cfg.build(node)
+    node = reaching_definitions.resolve(node, entity_info, graphs,
+                                        reaching_definitions.Definition)
+    return node
+
+  def assertHasDefs(self, node, num):
+    defs = anno.getanno(node, anno.Static.DEFINITIONS)
+    self.assertEqual(len(defs), num)
+    for r in defs:
+      self.assertIsInstance(r, reaching_definitions.Definition)
+
+  def assertHasDefinedIn(self, node, expected):
+    defined_in = anno.getanno(node, anno.Static.DEFINED_VARS_IN)
+    defined_in_str = set(str(v) for v in defined_in)
+    if not expected:
+      expected = ()
+    if not isinstance(expected, tuple):
+      expected = (expected,)
+    self.assertSetEqual(defined_in_str, set(expected))
+
+  def assertSameDef(self, first, second):
+    self.assertHasDefs(first, 1)
+    self.assertHasDefs(second, 1)
+    self.assertIs(
+        anno.getanno(first, anno.Static.DEFINITIONS)[0],
+        anno.getanno(second, anno.Static.DEFINITIONS)[0])
+
+  def assertNotSameDef(self, first, second):
+    self.assertHasDefs(first, 1)
+    self.assertHasDefs(second, 1)
+    self.assertIsNot(
+        anno.getanno(first, anno.Static.DEFINITIONS)[0],
+        anno.getanno(second, anno.Static.DEFINITIONS)[0])
+
+  def test_conditional(self):
+
+    def test_fn(a, b):
+      a = []
+      if b:
+        a = []
+      return a
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasDefs(fn_body[0].targets[0], 1)
+    self.assertHasDefs(fn_body[1].test, 1)
+    self.assertHasDefs(fn_body[1].body[0].targets[0], 1)
+    self.assertHasDefs(fn_body[2].value, 2)
+
+    self.assertHasDefinedIn(fn_body[1], ('a', 'b'))
+
+  def test_while(self):
+
+    def test_fn(a):
+      max(a)
+      while True:
+        a = a
+        a = a
+      return a
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasDefs(fn_body[0].value.args[0], 1)
+    self.assertHasDefs(fn_body[1].body[0].targets[0], 1)
+    self.assertHasDefs(fn_body[1].body[1].targets[0], 1)
+    self.assertHasDefs(fn_body[1].body[1].value, 1)
+    # The loop does have an invariant test, but the CFG doesn't know that.
+    self.assertHasDefs(fn_body[1].body[0].value, 2)
+    self.assertHasDefs(fn_body[2].value, 2)
+
+  def test_while_else(self):
+
+    def test_fn(x, i):
+      y = 0
+      while x:
+        x += i
+        if i:
+          break
+      else:
+        y = 1
+      return x, y
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasDefs(fn_body[0].targets[0], 1)
+    self.assertHasDefs(fn_body[1].test, 2)
+    self.assertHasDefs(fn_body[1].body[0].target, 1)
+    self.assertHasDefs(fn_body[1].body[1].test, 1)
+    self.assertHasDefs(fn_body[1].orelse[0].targets[0], 1)
+    self.assertHasDefs(fn_body[2].value.elts[0], 2)
+    self.assertHasDefs(fn_body[2].value.elts[1], 2)
+
+  def test_for_else(self):
+
+    def test_fn(x, i):
+      y = 0
+      for i in x:
+        x += i
+        if i:
+          break
+        else:
+          continue
+      else:
+        y = 1
+      return x, y
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasDefs(fn_body[0].targets[0], 1)
+    self.assertHasDefs(fn_body[1].target, 1)
+    self.assertHasDefs(fn_body[1].body[0].target, 1)
+    self.assertHasDefs(fn_body[1].body[1].test, 1)
+    self.assertHasDefs(fn_body[1].orelse[0].targets[0], 1)
+    self.assertHasDefs(fn_body[2].value.elts[0], 2)
+    self.assertHasDefs(fn_body[2].value.elts[1], 2)
+
+  def test_nested_functions(self):
+
+    def test_fn(a, b):
+      a = []
+      if b:
+        a = []
+
+        def foo():
+          return a
+
+        foo()
+
+      return a
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+    def_of_a_in_if = fn_body[1].body[0].targets[0]
+
+    self.assertHasDefs(fn_body[0].targets[0], 1)
+    self.assertHasDefs(fn_body[1].test, 1)
+    self.assertHasDefs(def_of_a_in_if, 1)
+    self.assertHasDefs(fn_body[2].value, 2)
+
+    inner_fn_body = fn_body[1].body[1].body
+    self.assertSameDef(inner_fn_body[0].value, def_of_a_in_if)
+
+  def test_nested_functions_isolation(self):
+
+    def test_fn(a):
+      a = 0
+
+      def child():
+        a = 1
+        return a
+
+      child()
+      return a
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    parent_return = fn_body[3]
+    child_return = fn_body[1].body[1]
+    # The assignment `a = 1` makes `a` local to `child`.
+    self.assertNotSameDef(parent_return.value, child_return.value)
+
+  def test_function_call_in_with(self):
+
+    def foo(_):
+      pass
+
+    def test_fn(a):
+      with foo(a):
+        return a
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    self.assertHasDefs(fn_body[0].items[0].context_expr.func, 0)
+    self.assertHasDefs(fn_body[0].items[0].context_expr.args[0], 1)
+
+  def test_mutation_subscript(self):
+
+    def test_fn(a):
+      l = []
+      l[0] = a
+      return l
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    creation = fn_body[0].targets[0]
+    mutation = fn_body[1].targets[0].value
+    use = fn_body[2].value
+    self.assertSameDef(creation, mutation)
+    self.assertSameDef(creation, use)
+
+  def test_replacement(self):
+
+    def foo(a):
+      return a
+
+    def test_fn(a):
+      a = foo(a)
+      return a
+
+    node = self._parse_and_analyze(test_fn)
+    fn_body = node.body[0].body
+
+    param = node.body[0].args.args[0]
+    source = fn_body[0].value.args[0]
+    target = fn_body[0].targets[0]
+    retval = fn_body[1].value
+    self.assertSameDef(param, source)
+    self.assertNotSameDef(source, target)
+    self.assertSameDef(target, retval)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py b/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py
index a229c288a8..835d5199fa 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/type_info.py
@@ -43,9 +43,8 @@ from __future__ import print_function
 
 import gast
 
-from tensorflow.contrib.autograph import utils
 from tensorflow.contrib.autograph.pyct import anno
-from tensorflow.contrib.autograph.pyct import parser
+from tensorflow.contrib.autograph.pyct import ast_util
 from tensorflow.contrib.autograph.pyct import transformer
 from tensorflow.python.util import tf_inspect
 
@@ -166,7 +165,6 @@ class TypeInfoResolver(transformer.Base):
         definition = self.scope.getval(qn)
         anno.copyanno(definition, node, 'type')
         anno.copyanno(definition, node, 'type_fqn')
-        anno.setanno(node, 'definition', definition)
 
         # TODO(mdan): Remove this when the directives module is in.
         anno.copyanno(definition, node, 'element_type')
@@ -198,52 +196,18 @@ class TypeInfoResolver(transformer.Base):
   def visit_With(self, node):
     for item in node.items:
       if item.optional_vars is not None:
-        self.apply_to_single_assignments((item.optional_vars,),
-                                         item.context_expr,
-                                         self._process_variable_assignment)
+        ast_util.apply_to_single_assignments((item.optional_vars,),
+                                             item.context_expr,
+                                             self._process_variable_assignment)
     self.generic_visit(node)
     return node
 
   def visit_Assign(self, node):
     self.generic_visit(node)
-    self.apply_to_single_assignments(
-        node.targets, node.value, self._process_variable_assignment)
+    ast_util.apply_to_single_assignments(node.targets, node.value,
+                                         self._process_variable_assignment)
     return node
 
-  # TODO(mdan): Remove as soon as the new directives module is ready.
-  def visit_Call(self, node):
-    if anno.hasanno(node.func, 'live_val'):
-      # Symbols targeted by the "set_type" marker function are assigned the data
-      # type that it specified.
-      if anno.getanno(node.func, 'live_val') is utils.set_element_type:
-
-        if len(node.args) < 2 or len(node.args) > 3:
-          raise ValueError('"%s" must have either two or three parameters'
-                           % self.context.type_annotation_func)
-        if len(node.args) == 2:
-          target_arg, type_arg = node.args
-          shape_arg = parser.parse_expression('None')
-        else:
-          target_arg, type_arg, shape_arg = node.args
-        if not anno.hasanno(target_arg, anno.Basic.QN):
-          raise ValueError('the first argument of "%s" must by a symbol' %
-                           utils.set_element_type)
-        # TODO(mdan): This is vulnerable to symbol renaming.
-        element_type = type_arg
-        element_shape = shape_arg
-
-        target_symbol = anno.getanno(target_arg, anno.Basic.QN)
-        # Find the definition of this symbol and annotate it with the given
-        # data type. That in turn will cause future uses of the symbol
-        # to receive the same type annotation.
-        definition = self.scope.getval(target_symbol)
-        anno.setanno(node, 'element_type', element_type)
-        anno.setanno(node, 'element_shape', element_shape)
-        anno.setanno(definition, 'element_type', element_type)
-        anno.setanno(definition, 'element_shape', element_shape)
-        # TODO(mdan): Should we update references between definition and here?
-    return self.generic_visit(node)
-
 
 def resolve(node, context):
   return TypeInfoResolver(context).visit(node)
diff --git a/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py b/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py
index 32b1148ab2..404311ba24 100644
--- a/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py
+++ b/tensorflow/contrib/autograph/pyct/static_analysis/type_info_test.py
@@ -19,11 +19,13 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.autograph.pyct import anno
+from tensorflow.contrib.autograph.pyct import cfg
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import qual_names
 from tensorflow.contrib.autograph.pyct import transformer
 from tensorflow.contrib.autograph.pyct.static_analysis import activity
 from tensorflow.contrib.autograph.pyct.static_analysis import live_values
+from tensorflow.contrib.autograph.pyct.static_analysis import reaching_definitions
 from tensorflow.contrib.autograph.pyct.static_analysis import type_info
 from tensorflow.python.client import session
 from tensorflow.python.platform import test
@@ -69,7 +71,10 @@ class TypeInfoResolverTest(test.TestCase):
         arg_types=arg_types,
         owner_type=None)
     node = qual_names.resolve(node)
+    graphs = cfg.build(node)
     node = activity.resolve(node, entity_info)
+    node = reaching_definitions.resolve(node, entity_info, graphs,
+                                        reaching_definitions.Definition)
     node = live_values.resolve(node, entity_info, {})
     node = type_info.resolve(node, entity_info)
     node = live_values.resolve(node, entity_info, {})
diff --git a/tensorflow/contrib/autograph/pyct/templates.py b/tensorflow/contrib/autograph/pyct/templates.py
index 9c479ebc2f..5831d57ceb 100644
--- a/tensorflow/contrib/autograph/pyct/templates.py
+++ b/tensorflow/contrib/autograph/pyct/templates.py
@@ -26,6 +26,7 @@ import textwrap
 
 import gast
 
+from tensorflow.contrib.autograph.pyct import anno
 from tensorflow.contrib.autograph.pyct import ast_util
 from tensorflow.contrib.autograph.pyct import parser
 from tensorflow.contrib.autograph.pyct import qual_names
@@ -43,39 +44,65 @@ class ReplaceTransformer(gast.NodeTransformer):
     """
     self.replacements = replacements
     self.in_replacements = False
+    self.preserved_annos = {
+        anno.Basic.ORIGIN,
+        anno.Basic.SKIP_PROCESSING,
+        anno.Static.ORIG_DEFINITIONS,
+    }
+
+  def _prepare_replacement(self, replaced, key):
+    """Prepares a replacement AST that's safe to swap in for a node.
+
+    Args:
+      replaced: ast.AST, the node being replaced
+      key: Hashable, the key of the replacement AST
+    Returns:
+      ast.AST, the replacement AST
+    """
+    repl = self.replacements[key]
+
+    new_nodes = ast_util.copy_clean(repl, preserve_annos=self.preserved_annos)
+    if isinstance(new_nodes, gast.AST):
+      new_nodes = [new_nodes]
+
+    return new_nodes
 
   def visit_Expr(self, node):
-    if (isinstance(node.value, gast.Name) and
-        node.value.id in self.replacements):
-      return self.visit(node.value)
-    self.generic_visit(node)
-    return node
+    # When replacing a placeholder with an entire statement, the replacement
+    # must stand on its own and not be wrapped in an Expr.
+    new_value = self.visit(node.value)
+    if new_value is node.value:
+      return node
+    return new_value
 
   def visit_keyword(self, node):
-    if node.arg in self.replacements:
-      repl = self.replacements[node.arg]
-      if isinstance(repl, gast.keyword):
-        return repl
-      elif (isinstance(repl, (list, tuple)) and repl and
-            all(isinstance(r, gast.keyword) for r in repl)):
-        return repl
-      # TODO(mdan): We may allow replacing with a string as well.
-      # For example, if one wanted to replace foo with bar in foo=baz, then
-      # we could allow changing just node arg, so that we end up with bar=baz.
-      raise ValueError(
-          'a keyword argument may only be replaced by another keyword or a '
-          'non-empty list of keywords. Found: %s' % repl)
-    return self.generic_visit(node)
+    if node.arg not in self.replacements:
+      return self.generic_visit(node)
+
+    repl = self._prepare_replacement(node, node.arg)
+    if isinstance(repl, gast.keyword):
+      return repl
+    elif (repl and isinstance(repl, (list, tuple)) and
+          all(isinstance(r, gast.keyword) for r in repl)):
+      return repl
+    # TODO(mdan): We may allow replacing with a string as well.
+    # For example, if one wanted to replace foo with bar in foo=baz, then
+    # we could allow changing just node arg, so that we end up with bar=baz.
+    raise ValueError(
+        'a keyword argument may only be replaced by another keyword or a '
+        'non-empty list of keywords. Found: %s' % repl)
 
   def visit_FunctionDef(self, node):
     node = self.generic_visit(node)
-    if node.name in self.replacements:
-      repl = self.replacements[node.name]
-      if not isinstance(repl, (gast.Name, ast.Name)):
-        raise ValueError(
-            'a function name can only be replaced by a Name node. Found: %s' %
-            repl)
-      node.name = repl.id
+    if node.name not in self.replacements:
+      return node
+
+    repl = self.replacements[node.name]
+    if not isinstance(repl, (gast.Name, ast.Name)):
+      raise ValueError(
+          'a function name can only be replaced by a Name node. Found: %s' %
+          repl)
+    node.name = repl.id
     return node
 
   def _check_has_context(self, node):
@@ -113,8 +140,8 @@ class ReplaceTransformer(gast.NodeTransformer):
 
   def _set_inner_child_context(self, node, ctx):
     if isinstance(node, gast.Attribute):
-      self._set_inner_child_context(node.value, ctx)
-      node.ctx = gast.Load()
+      self._set_inner_child_context(node.value, gast.Load())
+      node.ctx = ctx
     elif isinstance(node, gast.Tuple):
       for e in node.elts:
         self._set_inner_child_context(e, ctx)
@@ -148,6 +175,7 @@ class ReplaceTransformer(gast.NodeTransformer):
     node = self.generic_visit(node)
     if node.attr not in self.replacements:
       return node
+
     repl = self.replacements[node.attr]
     if not isinstance(repl, gast.Name):
       raise ValueError(
@@ -159,9 +187,7 @@ class ReplaceTransformer(gast.NodeTransformer):
     if node.id not in self.replacements:
       return node
 
-    new_nodes = ast_util.copy_clean(self.replacements[node.id])
-    if isinstance(new_nodes, gast.AST):
-      new_nodes = [new_nodes]
+    new_nodes = self._prepare_replacement(node, node.id)
 
     # Preserve the target context.
     for n in new_nodes:
@@ -182,7 +208,7 @@ class ReplaceTransformer(gast.NodeTransformer):
 
 
 def _convert_to_ast(n):
-  """Convert from a known data type to AST."""
+  """Converts from a known data type to AST."""
   if isinstance(n, str):
     # Note: the node will receive the ctx value from the template, see
     # ReplaceTransformer.visit_Name.
@@ -197,7 +223,7 @@ def _convert_to_ast(n):
 
 
 def replace(template, **replacements):
-  """Replace placeholders in a Python template.
+  """Replaces placeholders in a Python template.
 
   AST Name and Tuple nodes always receive the context that inferred from
   the template. However, when replacing more complex nodes (that can potentially
diff --git a/tensorflow/contrib/autograph/pyct/templates_test.py b/tensorflow/contrib/autograph/pyct/templates_test.py
index a01f8bf04c..77e8ff62fd 100644
--- a/tensorflow/contrib/autograph/pyct/templates_test.py
+++ b/tensorflow/contrib/autograph/pyct/templates_test.py
@@ -97,6 +97,19 @@ class TemplatesTest(test.TestCase):
     with self.assertRaises(ValueError):
       templates.replace(template, foo=1)
 
+  def test_replace_attribute_context(self):
+    template = """
+      def test_fn(foo):
+        foo = 0
+    """
+
+    node = templates.replace(
+        template,
+        foo=parser.parse_expression('a.b.c'))[0]
+    self.assertIsInstance(node.body[0].targets[0].ctx, gast.Store)
+    self.assertIsInstance(node.body[0].targets[0].value.ctx, gast.Load)
+    self.assertIsInstance(node.body[0].targets[0].value.value.ctx, gast.Load)
+
   def test_replace_call_keyword(self):
     template = """
       def test_fn():
@@ -151,17 +164,13 @@ class TemplatesTest(test.TestCase):
     self.assertEqual(node.func.id, 'bar')
     self.assertEqual(node.func.args[0].id, 'baz')
 
-  def replace_as_expression_restrictions(self):
+  def test_replace_as_expression_restrictions(self):
     template = """
       foo(a)
       bar(b)
     """
     with self.assertRaises(ValueError):
       templates.replace_as_expression(template)
-    with self.assertRaises(ValueError):
-      templates.replace('')
-    with self.assertRaises(ValueError):
-      templates.replace('a = b')
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/autograph/pyct/testing/BUILD b/tensorflow/contrib/autograph/pyct/testing/BUILD
new file mode 100644
index 0000000000..957db356f7
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/testing/BUILD
@@ -0,0 +1,43 @@
+licenses(["notice"])  # Apache 2.0
+
+load("//tensorflow:tensorflow.bzl", "py_test")
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
+
+py_library(
+    name = "testing",
+    srcs = [
+        "codegen.py",
+    ],
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/contrib/autograph/pyct",
+        "//tensorflow/contrib/autograph/utils",
+        "@gast_archive//:gast",
+    ],
+)
+
+py_test(
+    name = "codegen_test",
+    size = "large",
+    srcs = ["codegen_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_windows"],
+    deps = [
+        ":testing",
+        "//tensorflow/contrib/autograph/pyct",
+        "//tensorflow/python:client_testlib",
+        "@gast_archive//:gast",
+    ],
+)
diff --git a/tensorflow/contrib/autograph/pyct/testing/codegen.py b/tensorflow/contrib/autograph/pyct/testing/codegen.py
new file mode 100644
index 0000000000..279e7c09dc
--- /dev/null
+++ b/tensorflow/contrib/autograph/pyct/testing/codegen.py
@@ -0,0 +1,234 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Random code generation for testing/fuzzing."""
+# pylint: disable=invalid-name
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import random
+import string
+
+import gast
+import numpy as np
+
+from tensorflow.contrib.autograph.pyct import templates
+
+
+class NodeSampler(object):
+  sample_map = None
+
+  def sample(self):
+    nodes, magnitudes = zip(*self.sample_map.items())
+    return np.random.choice(
+        nodes, p=np.array(magnitudes, dtype='float32') / np.sum(magnitudes))
+
+
+class StatementSampler(NodeSampler):
+  sample_map = dict((
+      (gast.Assign, 10),
+      (gast.Print, 1),
+      (gast.If, 2),
+      (gast.While, 2),
+      (gast.For, 0),
+  ))
+
+
+class ExpressionSampler(NodeSampler):
+  sample_map = dict((
+      (gast.UnaryOp, 1),
+      (gast.BinOp, 8),
+      (gast.Name, 1),
+      (gast.Call, 0),
+  ))
+
+
+class CompareSampler(NodeSampler):
+  sample_map = dict((
+      (gast.Eq, 1),
+      (gast.NotEq, 1),
+      (gast.Lt, 1),
+      (gast.LtE, 1),
+      (gast.Gt, 1),
+      (gast.GtE, 1),
+      (gast.Is, 1),
+      (gast.IsNot, 1),
+  ))
+
+
+class BinaryOpSampler(NodeSampler):
+  sample_map = dict((
+      (gast.Add, 1),
+      (gast.Sub, 1),
+      (gast.Mult, 1),
+      (gast.Div, 1),
+      (gast.FloorDiv, 1),
+      (gast.Mod, 1),
+      (gast.Pow, 1),
+  ))
+
+
+class UnaryOpSampler(NodeSampler):
+  sample_map = dict(((gast.USub, 1), (gast.UAdd, 0)))
+
+
+class NameSampler(NodeSampler):
+  sample_map = dict((
+      ('new', 1),
+      ('existing', 1),
+  ))
+
+
+N_CONTROLFLOW_STATEMENTS = 10
+N_FUNCTIONDEF_STATEMENTS = 10
+
+
+class CodeGenerator(object):
+  """Generate random syntactically-valid Python ASTs."""
+
+  def __init__(self, max_depth=3, depth=0):
+    self.max_depth = max_depth
+    self.depth = depth
+
+  def generate_statement(self):
+    """Generate a statement node, dispatching to the correct class method."""
+    desired_node = StatementSampler().sample()
+    self.depth += 1
+
+    # Enforce some constraints on generating statements.
+    # E.g., if statements need at least 3 readable variables.
+    # If we fail to satisfy our constraints, draw another sample.
+    if desired_node in (gast.While, gast.For, gast.If):
+      if self.depth > self.max_depth:
+        return self.generate_statement()
+
+    # Go get the generator method and run it
+    method = 'generate_' + desired_node.__name__
+    visitor = getattr(self, method)
+    node = visitor()
+    self.depth -= 1
+    return node
+
+  def sample_node_list(self, low, high, generator):
+    """Generate a list of statements of random length.
+
+    Args:
+      low: Fewest number of statements to generate.
+      high: Highest number of statements to generate.
+      generator: Function to call to generate nodes.
+
+    Returns:
+      A list of statements.
+    """
+    statements = []
+    for _ in range(np.random.randint(low, high)):
+      statements.append(generator())
+    return statements
+
+  def generate_Name(self, ctx=gast.Load()):
+    variable_name = '_' + ''.join(
+        random.choice(string.ascii_lowercase) for _ in range(4))
+    return gast.Name(variable_name, ctx=ctx, annotation=None)
+
+  def generate_BinOp(self):
+    # TODO(alexbw): convert to generate_expression when we get to limit
+    # expression depth.
+    op = BinaryOpSampler().sample()()
+    return gast.BinOp(self.generate_Name(), op, self.generate_Name())
+
+  def generate_Compare(self):
+    op = CompareSampler().sample()()
+    return gast.Compare(self.generate_Name(), [op], [self.generate_Name()])
+
+  def generate_UnaryOp(self):
+    operand = self.generate_Name()
+    op = UnaryOpSampler().sample()()
+    return gast.UnaryOp(op, operand)
+
+  def generate_expression(self):
+    desired_node = ExpressionSampler().sample()
+    # Go get the generator method and run it
+    method = 'generate_' + desired_node.__name__
+    generator = getattr(self, method)
+    return generator()
+
+  def generate_Assign(self):
+    """Generate an Assign node."""
+    # Generate left-hand side
+    target_node = self.generate_Name(gast.Store())
+    # Generate right-hand side
+    value_node = self.generate_expression()
+    # Put it all together
+    node = gast.Assign(targets=[target_node], value=value_node)
+    return node
+
+  def generate_If(self):
+    """Generate an If node."""
+    test = self.generate_Compare()
+
+    # Generate true branch statements
+    body = self.sample_node_list(
+        low=1,
+        high=N_CONTROLFLOW_STATEMENTS // 2,
+        generator=self.generate_statement)
+
+    # Generate false branch statements
+    orelse = self.sample_node_list(
+        low=1,
+        high=N_CONTROLFLOW_STATEMENTS // 2,
+        generator=self.generate_statement)
+
+    node = gast.If(test, body, orelse)
+    return node
+
+  def generate_While(self):
+    """Generate a While node."""
+
+    test = self.generate_Compare()
+    body = self.sample_node_list(
+        low=1, high=N_CONTROLFLOW_STATEMENTS, generator=self.generate_statement)
+    orelse = []  # not generating else statements
+
+    node = gast.While(test, body, orelse)
+    return node
+
+  def generate_Call(self):
+    raise NotImplementedError
+
+  def generate_Return(self):
+    return gast.Return(self.generate_expression())
+
+  def generate_Print(self):
+    return templates.replace('print(x)', x=self.generate_expression())[0]
+
+  def generate_FunctionDef(self):
+    """Generate a FunctionDef node."""
+
+    # Generate the arguments, register them as available
+    arg_vars = self.sample_node_list(
+        low=2, high=10, generator=lambda: self.generate_Name(gast.Param()))
+    args = gast.arguments(arg_vars, None, [], [], None, [])
+
+    # Generate the function body
+    body = self.sample_node_list(
+        low=1, high=N_FUNCTIONDEF_STATEMENTS, generator=self.generate_statement)
+    body.append(self.generate_Return())
+    fn_name = self.generate_Name().id
+    node = gast.FunctionDef(fn_name, args, body, (), None)
+    return node
+
+
+def generate_random_functiondef():
+  return CodeGenerator().generate_FunctionDef()
diff --git a/tensorflow/contrib/autograph/utils/type_hints.py b/tensorflow/contrib/autograph/pyct/testing/codegen_test.py
index aeb9e54561..255c3b2a2e 100644
--- a/tensorflow/contrib/autograph/utils/type_hints.py
+++ b/tensorflow/contrib/autograph/pyct/testing/codegen_test.py
@@ -12,30 +12,29 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""No-op utilities that provide static type hints.
-
-These are used when the data type is not known at creation, for instance in the
-case of empty lists.
-"""
+"""Tests for type_info module."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
+from tensorflow.contrib.autograph.pyct import compiler
+from tensorflow.contrib.autograph.pyct.testing import codegen
+from tensorflow.python.platform import test
+
+
+class CodeGenTest(test.TestCase):
 
-def set_element_type(entity, dtype, shape=None):
-  """Indicates that the entity is expected hold items of specified type.
+  def test_codegen_gens(self):
+    np.random.seed(0)
+    for _ in range(1000):
+      node = codegen.generate_random_functiondef()
+      fn = compiler.ast_to_object(node)
+      self.assertIsNotNone(
+          fn, 'Generated invalid AST that could not convert to source.')
 
-  This function is a no-op. Its presence merely marks the data type of its
-  argument. The staged TensorFlow ops will reflect and assert this data type.
 
-  Args:
-    entity: A Tensor or TensorArray.
-    dtype: TensorFlow dtype value to assert for entity.
-    shape: Optional shape to assert for entity.
-  Returns:
-    The value of entity, unchanged.
-  """
-  del dtype
-  del shape
-  return entity
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/autograph/pyct/transformer.py b/tensorflow/contrib/autograph/pyct/transformer.py
index 7655811830..969ca12244 100644
--- a/tensorflow/contrib/autograph/pyct/transformer.py
+++ b/tensorflow/contrib/autograph/pyct/transformer.py
@@ -59,6 +59,103 @@ class EntityInfo(object):
     self.owner_type = owner_type
 
 
+class _StateStack(object):
+  """Typed stack abstraction.
+
+  This class provides syntactic sugar for a stack of objects of known
+  type. It allows accessing attributes of the object at the top of the stack
+  directly against this object, which allows for very terse syntax.
+
+  For example, this code:
+
+    stack = _StateStack(Foo)
+    stack.enter()
+    stack.bar
+
+  Is equivalent to:
+
+    stack = []
+    stack.append(Foo())
+    foo = stack[-1]
+    foo.bar
+
+  See _State for more on how this is used.
+
+  Attributes:
+    type: Any, the type of objects that this stack holds
+    level: int, the current stack depth
+    value: Any, the instance of the object at the top of the stack
+  """
+
+  def __init__(self, type_):
+    # Because we override __setattr__, we need to attach these attributes using
+    # the superclass' setattr.
+    object.__setattr__(self, 'type', type_)
+    object.__setattr__(self, '_stack', [])
+    self.enter()
+
+  def enter(self):
+    self._stack.append(self.type())
+
+  def exit(self):
+    return self._stack.pop()
+
+  @property
+  def level(self):
+    return len(self._stack)
+
+  @property
+  def value(self):
+    return self._stack[-1]
+
+  def __getattr__(self, key):
+    return getattr(self._stack[-1], key)
+
+  def __setattr__(self, key, value):
+    setattr(self._stack[-1], key, value)
+
+
+class _State(object):
+  """Supporting class for nested scope variable space for converter.Base.
+
+  This structure offers syntactic sugar over a dict of stacks of objects
+  of known type. These structures are useful to keep state during AST walks.
+  Multiple different scopes can be tracked in parallel. For example:
+
+    s = _State()
+
+    s[foo].enter()
+    s[bar].enter()  # this will not affect s[foo]
+
+  Element access has special semantics:
+    * keys are a data type
+    * element values are _StateStack(type=key) objects
+    * missing elements are automatically added, similarly to defaultdict
+
+  For example, the following block :
+
+    _State s
+    s[Foo]
+
+  Is equivalent to:
+
+    s = {}
+    if Foo not in s:
+      s[Foo] = Foo()
+    s[Foo]
+
+  See Base for how it's used.
+  """
+
+  def __init__(self):
+    self._value = {}
+
+  def __getitem__(self, key):
+    if key not in self._value:
+      self._value[key] = _StateStack(key)
+    return self._value[key]
+
+
 class Base(gast.NodeTransformer):
   """Base class for general-purpose code transformers transformers.
 
@@ -71,6 +168,27 @@ class Base(gast.NodeTransformer):
   (possibly nested) scopes, use enter/exit_local_scope and set/get_local.
   You must call enter/exit_local_scope manually, but the transformer detects
   when they are not properly paired.
+
+  The transformer allows keeping state across calls to visit_* that is local to
+  arbitrary nodes and their descendants, using the self.state attribute.
+  Multiple independent scopes are allowed and automatically constructed.
+
+  For example, to keep track of the If node that encloses any Name node, one can
+  write:
+
+    class FooType(object):
+
+      def __init__(self):
+        self.foo_property = None
+
+    class DummyTransformer(Base):
+
+      def visit_If(self, node):
+        self.state[FooType].enter()
+        self.state[FooType].foo_property = node
+
+      def visit_Name(self, node):
+        self.state[FooType].foo_property  # will hold the innermost enclosing if
   """
 
   # TODO(mdan): Document all extra features.
@@ -92,6 +210,12 @@ class Base(gast.NodeTransformer):
     self._local_scope_state = []
     self.enter_local_scope()
 
+    # Allows scoping of local variables to keep state across calls to visit_*
+    # methods. Multiple scope hierchies may exist and are keyed by tag. A scope
+    # is valid at one or more nodes and all its children. Scopes created in
+    # child nodes supersede their parent. Scopes are isolated from one another.
+    self.state = _State()
+
   @property
   def enclosing_entities(self):
     return tuple(self._enclosing_entities)
@@ -101,7 +225,9 @@ class Base(gast.NodeTransformer):
     return len(self._local_scope_state)
 
   def enter_local_scope(self, inherit=None):
-    """Marks entry into a new local scope.
+    """Deprecated. Use self.state instead.
+
+    Marks entry into a new local scope.
 
     Args:
       inherit: Optional enumerable of variable names to copy from the
@@ -116,7 +242,9 @@ class Base(gast.NodeTransformer):
     self._local_scope_state.append(scope_entered)
 
   def exit_local_scope(self, keep=None):
-    """Marks exit from the current local scope.
+    """Deprecated. Use self.state instead.
+
+    Marks exit from the current local scope.
 
     Args:
       keep: Optional enumerable of variable names to copy into the
@@ -133,9 +261,11 @@ class Base(gast.NodeTransformer):
     return scope_left
 
   def set_local(self, name, value):
+    """Deprecated. Use self.state instead."""
     self._local_scope_state[-1][name] = value
 
   def get_local(self, name, default=None):
+    """Deprecated. Use self.state instead."""
     return self._local_scope_state[-1].get(name, default)
 
   def debug_print(self, node):
@@ -216,7 +346,7 @@ class Base(gast.NodeTransformer):
         node_destination = new_destination
     return results
 
-  # TODO(mdan): Once we have error tracing, we may be able to just go to SSA.
+  # TODO(mdan): Remove.
   def apply_to_single_assignments(self, targets, values, apply_fn):
     """Applies a function to each individual assignment.
 
@@ -266,19 +396,38 @@ class Base(gast.NodeTransformer):
 
   def _get_source(self, node):
     try:
-      return compiler.ast_to_source(node)
-    except AssertionError:
+      source, _ = compiler.ast_to_source(node)
+      return source
+    # pylint: disable=broad-except
+    # This function is used for error reporting.  If an exception occurs here,
+    # it should be suppressed, in favor of emitting as informative a message
+    # about the original error as possible.
+    except Exception:
       return '<could not convert AST to source>'
 
   def visit(self, node):
+    if not isinstance(node, gast.AST):
+      # This is not that uncommon a mistake: various node bodies are lists, for
+      # example, posing a land mine for transformers that need to recursively
+      # call `visit`.  The error needs to be raised before the exception handler
+      # below is installed, because said handler will mess up if `node` is not,
+      # in fact, a node.
+      msg = (
+          'invalid value for "node": expected "ast.AST", got "{}"; to'
+          ' visit lists of nodes, use "visit_block" instead').format(type(node))
+      raise ValueError(msg)
+
     source_code = self.entity_info.source_code
     source_file = self.entity_info.source_file
     did_enter_function = False
     local_scope_size_at_entry = len(self._local_scope_state)
+    processing_expr_node = False
 
     try:
       if isinstance(node, (gast.FunctionDef, gast.ClassDef, gast.Lambda)):
         did_enter_function = True
+      elif isinstance(node, gast.Expr):
+        processing_expr_node = True
 
       if did_enter_function:
         self._enclosing_entities.append(node)
@@ -287,9 +436,23 @@ class Base(gast.NodeTransformer):
         self._lineno = node.lineno
         self._col_offset = node.col_offset
 
+      if processing_expr_node:
+        entry_expr_value = node.value
+
       if not anno.hasanno(node, anno.Basic.SKIP_PROCESSING):
         result = super(Base, self).visit(node)
 
+      # Adjust for consistency: replacing the value of an Expr with
+      # an Assign node removes the need for the Expr node.
+      if processing_expr_node:
+        if isinstance(result, gast.Expr) and result.value != entry_expr_value:
+          # When the replacement is a list, it is assumed that the list came
+          # from a template that contained a number of statements, which
+          # themselves are standalone and don't require an enclosing Expr.
+          if isinstance(result.value,
+                        (list, tuple, gast.Assign, gast.AugAssign)):
+            result = result.value
+
       # On exception, the local scope integrity is not guaranteed.
       if did_enter_function:
         self._enclosing_entities.pop()
diff --git a/tensorflow/contrib/autograph/pyct/transformer_test.py b/tensorflow/contrib/autograph/pyct/transformer_test.py
index baf04653ae..a37e922a1d 100644
--- a/tensorflow/contrib/autograph/pyct/transformer_test.py
+++ b/tensorflow/contrib/autograph/pyct/transformer_test.py
@@ -93,6 +93,83 @@ class TransformerTest(test.TestCase):
                       inner_function, lambda_node),
                      anno.getanno(lambda_expr, 'enclosing_entities'))
 
+  def assertSameAnno(self, first, second, key):
+    self.assertIs(anno.getanno(first, key), anno.getanno(second, key))
+
+  def assertDifferentAnno(self, first, second, key):
+    self.assertIsNot(anno.getanno(first, key), anno.getanno(second, key))
+
+  def test_state_tracking(self):
+
+    class LoopState(object):
+      pass
+
+    class CondState(object):
+      pass
+
+    class TestTransformer(transformer.Base):
+
+      def visit(self, node):
+        anno.setanno(node, 'loop_state', self.state[LoopState].value)
+        anno.setanno(node, 'cond_state', self.state[CondState].value)
+        return super(TestTransformer, self).visit(node)
+
+      def visit_While(self, node):
+        self.state[LoopState].enter()
+        node = self.generic_visit(node)
+        self.state[LoopState].exit()
+        return node
+
+      def visit_If(self, node):
+        self.state[CondState].enter()
+        node = self.generic_visit(node)
+        self.state[CondState].exit()
+        return node
+
+    tr = TestTransformer(self._simple_source_info())
+
+    def test_function(a):
+      a = 1
+      while a:
+        _ = 'a'
+        if a > 2:
+          _ = 'b'
+          while True:
+            raise '1'
+        if a > 3:
+          _ = 'c'
+          while True:
+            raise '1'
+
+    node, _ = parser.parse_entity(test_function)
+    node = tr.visit(node)
+
+    fn_body = node.body[0].body
+    outer_while_body = fn_body[1].body
+    self.assertSameAnno(fn_body[0], outer_while_body[0], 'cond_state')
+    self.assertDifferentAnno(fn_body[0], outer_while_body[0], 'loop_state')
+
+    first_if_body = outer_while_body[1].body
+    self.assertDifferentAnno(outer_while_body[0], first_if_body[0],
+                             'cond_state')
+    self.assertSameAnno(outer_while_body[0], first_if_body[0], 'loop_state')
+
+    first_inner_while_body = first_if_body[1].body
+    self.assertSameAnno(first_if_body[0], first_inner_while_body[0],
+                        'cond_state')
+    self.assertDifferentAnno(first_if_body[0], first_inner_while_body[0],
+                             'loop_state')
+
+    second_if_body = outer_while_body[2].body
+    self.assertDifferentAnno(first_if_body[0], second_if_body[0], 'cond_state')
+    self.assertSameAnno(first_if_body[0], second_if_body[0], 'loop_state')
+
+    second_inner_while_body = second_if_body[1].body
+    self.assertDifferentAnno(first_inner_while_body[0],
+                             second_inner_while_body[0], 'cond_state')
+    self.assertDifferentAnno(first_inner_while_body[0],
+                             second_inner_while_body[0], 'loop_state')
+
   def test_local_scope_info_stack(self):
 
     class TestTransformer(transformer.Base):
@@ -205,6 +282,88 @@ class TransformerTest(test.TestCase):
     self.assertTrue(isinstance(node.body[1].body[0], gast.Assign))
     self.assertTrue(isinstance(node.body[1].body[1], gast.Return))
 
+  def test_robust_error_on_list_visit(self):
+
+    class BrokenTransformer(transformer.Base):
+
+      def visit_If(self, node):
+        # This is broken because visit expects a single node, not a list, and
+        # the body of an if is a list.
+        # Importantly, the default error handling in visit also expects a single
+        # node.  Therefore, mistakes like this need to trigger a type error
+        # before the visit called here installs its error handler.
+        # That type error can then be caught by the enclosing call to visit,
+        # and correctly blame the If node.
+        self.visit(node.body)
+        return node
+
+    def test_function(x):
+      if x > 0:
+        return x
+
+    tr = BrokenTransformer(self._simple_source_info())
+
+    node, _ = parser.parse_entity(test_function)
+    with self.assertRaises(transformer.AutographParseError) as cm:
+      node = tr.visit(node)
+    obtained_message = str(cm.exception)
+    expected_message = r'expected "ast.AST", got "\<(type|class) \'list\'\>"'
+    self.assertRegexpMatches(obtained_message, expected_message)
+    # The exception should point at the if statement, not any place else.  Could
+    # also check the stack trace.
+    self.assertTrue(
+        'Occurred at node:\nIf' in obtained_message, obtained_message)
+    self.assertTrue(
+        'Occurred at node:\nFunctionDef' not in obtained_message,
+        obtained_message)
+    self.assertTrue(
+        'Occurred at node:\nReturn' not in obtained_message, obtained_message)
+
+  def test_robust_error_on_ast_corruption(self):
+    # A child class should not be able to be so broken that it causes the error
+    # handling in `transformer.Base` to raise an exception.  Why not?  Because
+    # then the original error location is dropped, and an error handler higher
+    # up in the call stack gives misleading information.
+
+    # Here we test that the error handling in `visit` completes, and blames the
+    # correct original exception, even if the AST gets corrupted.
+
+    class NotANode(object):
+      pass
+
+    class BrokenTransformer(transformer.Base):
+
+      def visit_If(self, node):
+        node.body = NotANode()
+        raise ValueError('I blew up')
+
+    def test_function(x):
+      if x > 0:
+        return x
+
+    tr = BrokenTransformer(self._simple_source_info())
+
+    node, _ = parser.parse_entity(test_function)
+    with self.assertRaises(transformer.AutographParseError) as cm:
+      node = tr.visit(node)
+    obtained_message = str(cm.exception)
+    # The message should reference the exception actually raised, not anything
+    # from the exception handler.
+    expected_substring = 'I blew up'
+    self.assertTrue(expected_substring in obtained_message, obtained_message)
+    # Expect the exception to have failed to parse the corrupted AST
+    self.assertTrue(
+        '<could not convert AST to source>' in obtained_message,
+        obtained_message)
+    # The exception should point at the if statement, not any place else.  Could
+    # also check the stack trace.
+    self.assertTrue(
+        'Occurred at node:\nIf' in obtained_message, obtained_message)
+    self.assertTrue(
+        'Occurred at node:\nFunctionDef' not in obtained_message,
+        obtained_message)
+    self.assertTrue(
+        'Occurred at node:\nReturn' not in obtained_message, obtained_message)
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/autograph/utils/BUILD b/tensorflow/contrib/autograph/utils/BUILD
index d82c17bf2a..d2b399f19b 100644
--- a/tensorflow/contrib/autograph/utils/BUILD
+++ b/tensorflow/contrib/autograph/utils/BUILD
@@ -28,7 +28,6 @@ py_library(
         "tensor_list.py",
         "testing.py",
         "type_check.py",
-        "type_hints.py",
     ],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:__subpackages__"],
diff --git a/tensorflow/contrib/autograph/utils/__init__.py b/tensorflow/contrib/autograph/utils/__init__.py
index 817d4126d1..57b5f74741 100644
--- a/tensorflow/contrib/autograph/utils/__init__.py
+++ b/tensorflow/contrib/autograph/utils/__init__.py
@@ -30,4 +30,3 @@ from tensorflow.contrib.autograph.utils.py_func import wrap_py_func
 from tensorflow.contrib.autograph.utils.tensor_list import dynamic_list_append
 from tensorflow.contrib.autograph.utils.testing import fake_tf
 from tensorflow.contrib.autograph.utils.type_check import is_tensor
-from tensorflow.contrib.autograph.utils.type_hints import set_element_type
diff --git a/tensorflow/contrib/autograph/utils/builtins.py b/tensorflow/contrib/autograph/utils/builtins.py
index 998087e056..ccbe5fc954 100644
--- a/tensorflow/contrib/autograph/utils/builtins.py
+++ b/tensorflow/contrib/autograph/utils/builtins.py
@@ -27,6 +27,7 @@ from tensorflow.contrib.autograph.utils import type_check
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import list_ops
 from tensorflow.python.ops import logging_ops
 from tensorflow.python.ops import math_ops
 
@@ -50,15 +51,22 @@ def dynamic_builtin(f, *args, **kwargs):
 
 def dynamic_len(list_or_tensor):
   """Implementation of len using dynamic dispatch."""
-  if tensor_util.is_tensor(list_or_tensor):
+  if _is_tensor_list(list_or_tensor):
+    return list_ops.tensor_list_length(list_or_tensor)
+  elif tensor_util.is_tensor(list_or_tensor):
     shape = list_or_tensor.shape
-    if not shape:
+    if not shape.ndims:
       raise ValueError(
           'len requires non-zero rank for tensor "%s"' % list_or_tensor)
     return array_ops.shape(list_or_tensor)[0]
   return len(list_or_tensor)
 
 
+def _is_tensor_list(list_or_tensor):
+  return (tensor_util.is_tensor(list_or_tensor)
+          and list_or_tensor.dtype == dtypes.variant)
+
+
 def dynamic_int(num_or_tensor, **kwargs):
   """Implementation of int() using dynamic dispatch."""
   if tensor_util.is_tensor(num_or_tensor):
diff --git a/tensorflow/contrib/autograph/utils/builtins_test.py b/tensorflow/contrib/autograph/utils/builtins_test.py
index 0c2312178a..b4821f36fc 100644
--- a/tensorflow/contrib/autograph/utils/builtins_test.py
+++ b/tensorflow/contrib/autograph/utils/builtins_test.py
@@ -33,7 +33,8 @@ class BuiltinsTest(test.TestCase):
   def test_dynamic_len_tf_scalar(self):
     a = constant_op.constant(1)
 
-    with self.assertRaises(ValueError):
+    with self.assertRaisesRegexp(ValueError,
+                                 'len requires non-zero rank for tensor.*'):
       with self.test_session() as sess:
         sess.run(builtins.dynamic_builtin(len, a))
 
diff --git a/tensorflow/contrib/batching/python/ops/batch_ops.py b/tensorflow/contrib/batching/python/ops/batch_ops.py
index 47b80bdf4a..55faad983f 100644
--- a/tensorflow/contrib/batching/python/ops/batch_ops.py
+++ b/tensorflow/contrib/batching/python/ops/batch_ops.py
@@ -58,8 +58,6 @@ def batch_function(num_batch_threads,
                    max_batch_size,
                    batch_timeout_micros,
                    allowed_batch_sizes=None,
-                   grad_timeout_micros=60 * 1000 * 1000,
-                   unbatch_timeout_micros=60 * 1000 * 1000,
                    max_enqueued_batches=10):
   """Batches the computation done by the decorated function.
 
@@ -94,10 +92,6 @@ def batch_function(num_batch_threads,
      does nothing. Otherwise, supplies a list of batch sizes, causing the op
      to pad batches up to one of those sizes. The entries must increase
      monotonically, and the final entry must equal max_batch_size.
-    grad_timeout_micros: The timeout to use for the gradient. See the
-     documentation of the unbatch op for more details. Defaults to 60s.
-    unbatch_timeout_micros: The timeout to use for unbatching. See the
-     documentation of the unbatch op for more details. Defaults to 60s.
     max_enqueued_batches: The maximum depth of the batch queue. Defaults to 10.
 
   Returns:
diff --git a/tensorflow/contrib/bigtable/BUILD b/tensorflow/contrib/bigtable/BUILD
index 5c15d21e35..71538e0770 100644
--- a/tensorflow/contrib/bigtable/BUILD
+++ b/tensorflow/contrib/bigtable/BUILD
@@ -31,6 +31,7 @@ tf_custom_op_py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":bigtable_ops",
+        "//tensorflow/contrib/data/python/ops:interleave_ops",
         "//tensorflow/contrib/util:util_py",
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:platform",
@@ -39,18 +40,24 @@ tf_custom_op_py_library(
     ],
 )
 
+KERNEL_FILES = [
+    "kernels/bigtable_kernels.cc",
+    "kernels/bigtable_lookup_dataset_op.cc",
+    "kernels/bigtable_prefix_key_dataset_op.cc",
+    "kernels/bigtable_range_key_dataset_op.cc",
+    "kernels/bigtable_sample_keys_dataset_op.cc",
+    "kernels/bigtable_sample_key_pairs_dataset_op.cc",
+    "kernels/bigtable_scan_dataset_op.cc",
+]
+
 tf_custom_op_library(
     name = "python/ops/_bigtable.so",
-    srcs = [
-        "kernels/bigtable_kernels.cc",
-        "kernels/bigtable_lookup_dataset_op.cc",
-        "kernels/bigtable_prefix_key_dataset_op.cc",
-        "kernels/bigtable_range_key_dataset_op.cc",
-        "kernels/bigtable_scan_dataset_op.cc",
+    srcs = KERNEL_FILES + [
         "ops/bigtable_ops.cc",
     ],
     deps = [
         ":bigtable_lib_cc",
+        ":bigtable_range_helpers",
         "@com_github_googlecloudplatform_google_cloud_cpp//google/cloud/bigtable:bigtable_client",
     ],
 )
@@ -69,15 +76,10 @@ tf_gen_op_libs(
 
 tf_kernel_library(
     name = "bigtable_kernels",
-    srcs = [
-        "kernels/bigtable_kernels.cc",
-        "kernels/bigtable_lookup_dataset_op.cc",
-        "kernels/bigtable_prefix_key_dataset_op.cc",
-        "kernels/bigtable_range_key_dataset_op.cc",
-        "kernels/bigtable_scan_dataset_op.cc",
-    ],
+    srcs = KERNEL_FILES,
     deps = [
         ":bigtable_lib_cc",
+        ":bigtable_range_helpers",
         "//tensorflow/core:framework_headers_lib",
         "//third_party/eigen3",
         "@com_github_googlecloudplatform_google_cloud_cpp//google/cloud/bigtable:bigtable_client",
@@ -97,6 +99,15 @@ cc_library(
 )
 
 cc_library(
+    name = "bigtable_range_helpers",
+    srcs = ["kernels/bigtable_range_helpers.cc"],
+    hdrs = ["kernels/bigtable_range_helpers.h"],
+    deps = [
+        "//tensorflow/core:framework_headers_lib",
+    ],
+)
+
+cc_library(
     name = "bigtable_test_client",
     srcs = ["kernels/test_kernels/bigtable_test_client.cc"],
     hdrs = ["kernels/test_kernels/bigtable_test_client.h"],
@@ -120,6 +131,17 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "bigtable_range_helpers_test",
+    size = "small",
+    srcs = ["kernels/bigtable_range_helpers_test.cc"],
+    deps = [
+        ":bigtable_range_helpers",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
 tf_gen_op_wrapper_py(
     name = "bigtable_test_ops",
     deps = [":bigtable_test_ops_op_lib"],
@@ -168,11 +190,6 @@ tf_custom_op_py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":bigtable_test_ops",
-        # "//tensorflow/contrib/util:util_py",
-        # "//tensorflow/python:framework_for_generated_wrappers",
-        # "//tensorflow/python:platform",
-        # "//tensorflow/python:util",
-        # "//tensorflow/python/data",
     ],
 )
 
diff --git a/tensorflow/contrib/bigtable/README.md b/tensorflow/contrib/bigtable/README.md
index ef3c60069e..88a3909de4 100644
--- a/tensorflow/contrib/bigtable/README.md
+++ b/tensorflow/contrib/bigtable/README.md
@@ -1,10 +1,347 @@
 # Bigtable #
 
-[Google Cloud Bigtable](https://cloud.google.com/bigtable/) is a high
+[Cloud Bigtable](https://cloud.google.com/bigtable/) is a high
 performance storage system that can store and serve training data. This contrib
 package contains an experimental integration with TensorFlow.
 
 > **Status: Highly experimental.** The current implementation is very much in
 > flux. Please use at your own risk! :-)
 
-<!-- TODO(saeta): Document usage / methods / etc. -->
+The TensorFlow integration with Cloud Bigtable is optimized for common
+TensorFlow usage and workloads. It is currently optimized for reading from Cloud
+Bigtable at high speed, in particular to feed modern accelerators. For
+general-purpose Cloud Bigtable
+APIs, see the [official Cloud Bigtable client library documentation][clientdoc].
+
+[clientdoc]:  https://cloud.google.com/bigtable/docs/reference/libraries
+
+## Sample Use
+
+There are three main reading styles supported by the `BigtableTable` class:
+
+ 1. **Reading keys**: Read only the row keys in a table. Keys are returned in
+    sorted order from the table. Most key reading operations retrieve all keys
+    in a contiguous range, however the `sample_keys` operation skips keys, and
+    operates on the whole table (and not a contiguous subset).
+ 2. **Retrieving a row's values**: Given a row key, look up the data associated
+    with a defined set of columns. This operation takes advantage of Cloud
+    Bigtable's low-latency and excellent support for random access.
+ 3. **Scanning ranges**: Given a contiguous range of rows retrieve both the row
+    key and the data associated with a fixed set of columns. This operation
+    takes advantage of Cloud Bigtable's high throughput scans, and is the most
+    efficient way to read data.
+
+When using the Cloud Bigtable API, the workflow is:
+
+ 1. Create a `BigtableClient` object.
+ 2. Use the `BigtableClient` to create `BigtableTable` objects corresponding to
+    each table in the Cloud Bigtable instance you would like to access.
+ 3. Call methods on the `BigtableTable` object to create `tf.data.Dataset`s to
+    retrieve data.
+
+The following is an example for how to read all row keys with the prefix
+`train-`.
+
+```python
+import tensorflow as tf
+
+GCP_PROJECT_ID = '<FILL_ME_IN>'
+BIGTABLE_INSTANCE_ID = '<FILL_ME_IN>'
+BIGTABLE_TABLE_NAME = '<FILL_ME_IN>'
+PREFIX = 'train-'
+
+def main():
+  client = tf.contrib.cloud.BigtableClient(GCP_PROJECT_ID, BIGTABLE_INSTANCE_ID)
+  table = client.table(BIGTABLE_TABLE_NAME)
+  dataset = table.keys_by_prefix_dataset(PREFIX)
+  iterator = dataset.make_initializable_iterator()
+  get_next_op = iterator.get_next()
+
+  with tf.Session() as sess:
+    print('Initializing the iterator.')
+    sess.run(iterator.initializer)
+    print('Retrieving rows:')
+    row_index = 0
+    while True:
+      try:
+        row_key = sess.run(get_next_op)
+        print('Row key %d: %s' % (row_index, row_key))
+        row_index += 1
+      except tf.errors.OutOfRangeError:
+        print('Finished reading data!')
+        break
+
+if __name__ == '__main__':
+  main()
+
+```
+
+### Reading row keys
+
+Read only the row keys in a table. Keys are returned in sorted order from the
+table. Most key reading operations retrieve all keys in a contiguous range,
+however the `sample_keys` operation skips keys, and operates on the whole table
+(and not a contiguous subset).
+
+There are 3 methods to retrieve row keys:
+
+ - `table.keys_by_range_dataset(start, end)`: Retrieve row keys starting with
+   `start`, and ending with `end`. The range is "half-open", and thus it
+   includes `start` if `start` is present in the table. It does not include
+   `end`.
+ - `table.keys_by_prefix_dataset(prefix)`: Retrieves all row keys that start
+   with `prefix`. It includes the row key `prefix` if present in the table.
+ - `table.sample_keys()`: Retrieves a sampling of keys from the underlying
+   table. This is often useful in conjunction with parallel scans.
+
+### Reading cell values given a row key
+
+Given a dataset producing row keys, you can use the `table.lookup_columns`
+transformation to retrieve values. Example:
+
+```python
+key_dataset = tf.data.Dataset.from_tensor_slices([
+    'row_key_1',
+    'other_row_key',
+    'final_row_key',
+])
+values_dataset = key_dataset.apply(
+  table.lookup_columns(('my_column_family', 'column_name'),
+                       ('other_cf', 'col')))
+training_data = values_dataset.map(my_parsing_function)  # ...
+```
+
+### Scanning ranges
+Given a contiguous range of rows retrieve both the row key and the data
+associated with a fixed set of columns. Scanning is the most efficient way to
+retrieve data from Cloud Bigtable and is thus a very common API for high
+performance data pipelines. To construct a scanning `tf.data.Dataset` from a
+`BigtableTable` object, call one of the following methods:
+
+ - `table.scan_prefix(prefix, ...)`
+ - `table.scan_range(start, end, ...)`
+ - `table.parallel_scan_prefix(prefix, ...)`
+ - `table.parallel_scan_range(start, end, ...)`
+
+Aside from the specification of the contiguous range of rows, they all take the
+following arguments:
+
+ - `probability`: (Optional.) A float between 0 (exclusive) and 1 (inclusive).
+      A non-1 value indicates to probabilistically sample rows with the
+      provided probability.
+ - `columns`: The columns to read. (See below.)
+ - `**kwargs`: The columns to read. (See below.)
+
+In addition the two parallel operations accept the following optional argument:
+`num_parallel_scans` which configures the number of parallel Cloud Bigtable scan
+operations to run. A reasonable default is automatically chosen for small
+Cloud Bigtable clusters. If you have a large cluster, or an extremely demanding
+workload, you can tune this value to optimize performance.
+
+#### Specifying columns to read when scanning
+
+All of the scan operations allow you to specify the column family and columns
+in the same ways.
+
+##### Using `columns`
+
+The first way to specify the data to read is via the `columns` parameter. The
+value should be a tuple (or list of tuples) of strings. The first string in the
+tuple is the column family, and the second string in the tuple is the column
+qualifier.
+
+##### Using `**kwargs`
+
+The second way to specify the data to read is via the `**kwargs` parameter,
+which you can use to specify keyword arguments corresponding to the columns that
+you want to read. The keyword to use is the column family name, and the argument
+value should be either a string, or a tuple of strings, specifying the column
+qualifiers (column names).
+
+Although using `**kwargs` has the advantage of requiring less typing, it is not
+future-proof in all cases. (If we add a new parameter to the scan functions that
+has the same name as your column family, your code will break.)
+
+##### Examples
+
+Below are two equivalent snippets for how to specify which columns to read:
+
+```python
+ds1 = table.scan_range("row_start", "row_end", columns=[("cfa", "c1"),
+                                                        ("cfa", "c2"),
+                                                        ("cfb", "c3")])
+ds2 = table.scan_range("row_start", "row_end", cfa=["c1", "c2"], cfb="c3")
+```
+
+In this example, we are reading 3 columns from a total of 2 column families.
+From the `cfa` column family, we are reading columns `c1`, and `c2`. From the
+second column family (`cfb`), we are reading `c3`. Both `ds1` and `ds2` will
+output elements of the following types (`tf.string`, `tf.string`, `tf.string`,
+`tf.string`). The first `tf.string` is the row key, the second `tf.string` is
+the latest data in cell `cfa:c1`, the third corresponds to `cfa:c2`, and the
+final one is `cfb:c3`.
+
+#### Determinism when scanning
+
+While the non-parallel scan operations are fully deterministic, the parallel
+scan operations are not. If you would like to scan in parallel without losing
+determinism, you can build up the `parallel_interleave` yourself. As an example,
+say we wanted to scan all rows between `training_data_00000`, and
+`training_data_90000`, we can use the following code snippet:
+
+```python
+table = # ...
+columns = [('cf1', 'col1'), ('cf1', 'col2')]
+NUM_PARALLEL_READS = # ...
+ds = tf.data.Dataset.range(9).shuffle(10)
+def interleave_fn(index):
+  # Given a starting index, create 2 strings to be the start and end
+  start_idx = index
+  end_idx = index + 1
+  start_idx_str = tf.as_string(start_idx * 10000, width=5, fill='0')
+  end_idx_str = tf.as_string(end_idx * 10000, width=5, fill='0')
+  start = tf.string_join(['training_data_', start_idx_str])
+  end = tf.string_join(['training_data_', end_idx_str])
+  return table.scan_range(start_idx, end_idx, columns=columns)
+ds = ds.apply(tf.contrib.data.parallel_interleave(
+    interleave_fn, cycle_length=NUM_PARALLEL_READS, prefetch_input_elements=1))
+```
+
+> Note: you should divide up the key range into more sub-ranges for increased
+> parallelism.
+
+## Writing to Cloud Bigtable
+
+In order to simplify getting started, this package provides basic support for
+writing data into Cloud Bigtable.
+
+> Note: The implementation is not optimized for performance! Please consider
+> using alternative frameworks such as Apache Beam / Cloud Dataflow for
+> production workloads.
+
+Below is an example for how to write a trivial dataset into Cloud Bigtable.
+
+```python
+import tensorflow as tf
+
+GCP_PROJECT_ID = '<FILL_ME_IN>'
+BIGTABLE_INSTANCE_ID = '<FILL_ME_IN>'
+BIGTABLE_TABLE_NAME = '<FILL_ME_IN>'
+COLUMN_FAMILY = '<FILL_ME_IN>'
+COLUMN_QUALIFIER = '<FILL_ME_IN>'
+
+def make_dataset():
+  """Makes a dataset to write to Cloud Bigtable."""
+  return tf.data.Dataset.from_tensor_slices([
+      'training_data_1',
+      'training_data_2',
+      'training_data_3',
+  ])
+
+def make_row_key_dataset():
+  """Makes a dataset of strings used for row keys.
+
+  The strings are of the form: `fake-data-` followed by a sequential counter.
+  For example, this dataset would contain the following elements:
+
+   - fake-data-00000001
+   - fake-data-00000002
+   - ...
+   - fake-data-23498103
+  """
+  counter_dataset = tf.contrib.data.Counter()
+  width = 8
+  row_key_prefix = 'fake-data-'
+  ds = counter_dataset.map(lambda index: tf.as_string(index,
+                                                      width=width,
+                                                      fill='0'))
+  ds = ds.map(lambda idx_str: tf.string_join([row_key_prefix, idx_str]))
+  return ds
+
+
+def main():
+  client = tf.contrib.cloud.BigtableClient(GCP_PROJECT_ID, BIGTABLE_INSTANCE_ID)
+  table = client.table(BIGTABLE_TABLE_NAME)
+  dataset = make_dataset()
+  index_dataset = make_row_key_dataset()
+  aggregate_dataset = tf.data.Dataset.zip((index_dataset, dataset))
+  write_op = table.write(aggregate_dataset, column_families=[COLUMN_FAMILY],
+                         columns=[COLUMN_QUALIFIER])
+
+  with tf.Session() as sess:
+    print('Starting transfer.')
+    sess.run(write_op)
+    print('Transfer complete.')
+
+if __name__ == '__main__':
+  main()
+```
+
+## Sample applications and architectures
+
+While most machine learning applications are well suited by a high performance
+distributed file system, there are certain applications where using Cloud
+Bigtable works extremely well.
+
+### Perfect Shuffling
+
+Normally, training data is stored in flat files, and a combination of
+(1) `tf.data.Dataset.interleave` (or `parallel_interleave`), (2)
+`tf.data.Dataset.shuffle`, and (3) writing the data in an unsorted order in the
+data files in the first place, provides enough randomization to ensure models
+train efficiently. However, if you would like perfect shuffling, you can use
+Cloud Bigtable's low-latency random access capabilities. Create a
+`tf.data.Dataset` that generates the keys in a perfectly random order (or read
+all the keys into memory and use a shuffle buffer sized to fit all of them for a
+perfect random shuffle using `tf.data.Dataset.shuffle`), and then use
+`lookup_columns` to retrieve the training data.
+
+### Distributed Reinforcement Learning
+
+Sophisticated reinforcement learning algorithms are commonly trained across a
+distributed cluster. (See [IMPALA by DeepMind][impala].) One part of the cluster
+runs self-play, while the other part of the cluster learns a new version of the
+model based on the training data generated by self-play. The new model version
+is then distributed to the self-play half of the cluster, and new training data
+is generated to continue the cycle.
+
+In such a configuration, because there is value in training on the freshest
+examples, a storage service like Cloud Bigtable can be used to store and
+serve the generated training data. When using Cloud Bigtable, there is no need
+to aggregate the examples into large batch files, but the examples can instead
+be written as soon as they are generated, and then retrieved at high speed.
+
+[impala]: https://arxiv.org/abs/1802.01561
+
+## Common Gotchas!
+
+### gRPC Certificates
+
+If you encounter a log line that includes the following:
+
+```
+"description":"Failed to load file", [...],
+"filename":"/usr/share/grpc/roots.pem"
+```
+
+you likely need to copy the [gRPC `roots.pem` file][grpcPem] to
+`/usr/share/grpc/roots.pem` on your local machine.
+
+[grpcPem]: https://github.com/grpc/grpc/blob/master/etc/roots.pem
+
+### Permission denied errors
+
+The TensorFlow Cloud Bigtable client will search for credentials to use in the
+process's environment. It will use the first credentials it finds if multiple
+are available.
+
+ - **Compute Engine**: When running on Compute Engine, the client will often use
+   the service account from the virtual machine's metadata service. Be sure to
+   authorize your Compute Engine VM to have access to the Cloud Bigtable service
+   when creating your VM, or [update the VM's scopes][update-vm-scopes] on a
+   running VM if you run into this issue.
+ - **Cloud TPU**: Your Cloud TPUs run with the designated Cloud TPU service
+   account dedicated to your GCP project. Ensure the service account has been
+   authorized via the Cloud Console to access your Cloud Bigtable instances.
+
+[update-vm-scopes]: https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#changeserviceaccountandscopes
diff --git a/tensorflow/contrib/bigtable/__init__.py b/tensorflow/contrib/bigtable/__init__.py
index 7df054637c..b7d89c9842 100644
--- a/tensorflow/contrib/bigtable/__init__.py
+++ b/tensorflow/contrib/bigtable/__init__.py
@@ -18,7 +18,7 @@ This contrib package allows TensorFlow to interface directly with Cloud Bigtable
 for high-speed data loading.
 
 @@BigtableClient
-@@BigTable
+@@BigtableTable
 
 """
 
@@ -26,14 +26,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.bigtable.python.ops.bigtable_api import BigTable
 from tensorflow.contrib.bigtable.python.ops.bigtable_api import BigtableClient
+from tensorflow.contrib.bigtable.python.ops.bigtable_api import BigtableTable
 
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
-    'BigTable',
     'BigtableClient',
+    'BigtableTable',
 ]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_kernels.cc b/tensorflow/contrib/bigtable/kernels/bigtable_kernels.cc
index 8a7309e870..a6755a3496 100644
--- a/tensorflow/contrib/bigtable/kernels/bigtable_kernels.cc
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_kernels.cc
@@ -31,6 +31,26 @@ class BigtableClientOp : public OpKernel {
                 errors::InvalidArgument("project_id must be non-empty"));
     OP_REQUIRES(ctx, !instance_id_.empty(),
                 errors::InvalidArgument("instance_id must be non-empty"));
+
+    OP_REQUIRES_OK(
+        ctx, ctx->GetAttr("connection_pool_size", &connection_pool_size_));
+    // If left unset by the client code, set it to a default of 100. Note: the
+    // cloud-cpp default of 4 concurrent connections is far too low for high
+    // performance streaming.
+    if (connection_pool_size_ == -1) {
+      connection_pool_size_ = 100;
+    }
+
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("max_receive_message_size",
+                                     &max_receive_message_size_));
+    // If left unset by the client code, set it to a default of 100. Note: the
+    // cloud-cpp default of 4 concurrent connections is far too low for high
+    // performance streaming.
+    if (max_receive_message_size_ == -1) {
+      max_receive_message_size_ = 1 << 24;  // 16 MBytes
+    }
+    OP_REQUIRES(ctx, max_receive_message_size_ > 0,
+                errors::InvalidArgument("connection_pool_size must be > 0"));
   }
 
   ~BigtableClientOp() override {
@@ -56,10 +76,18 @@ class BigtableClientOp : public OpKernel {
               cinfo_.container(), cinfo_.name(), &resource,
               [this, ctx](
                   BigtableClientResource** ret) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+                auto client_options =
+                    google::cloud::bigtable::ClientOptions()
+                        .set_connection_pool_size(connection_pool_size_)
+                        .set_data_endpoint("batch-bigtable.googleapis.com");
+                auto channel_args = client_options.channel_arguments();
+                channel_args.SetMaxReceiveMessageSize(
+                    max_receive_message_size_);
+                channel_args.SetUserAgentPrefix("tensorflow");
+                client_options.set_channel_arguments(channel_args);
                 std::shared_ptr<google::cloud::bigtable::DataClient> client =
                     google::cloud::bigtable::CreateDefaultDataClient(
-                        project_id_, instance_id_,
-                        google::cloud::bigtable::ClientOptions());
+                        project_id_, instance_id_, std::move(client_options));
                 *ret = new BigtableClientResource(project_id_, instance_id_,
                                                   std::move(client));
                 return Status::OK();
@@ -75,6 +103,8 @@ class BigtableClientOp : public OpKernel {
  private:
   string project_id_;
   string instance_id_;
+  int64 connection_pool_size_;
+  int32 max_receive_message_size_;
 
   mutex mu_;
   ContainerInfo cinfo_ GUARDED_BY(mu_);
@@ -200,7 +230,6 @@ class ToBigtableOp : public AsyncOpKernel {
       OP_REQUIRES_ASYNC(ctx, timestamp_int >= -1,
                         errors::InvalidArgument("timestamp must be >= -1"),
                         done);
-      std::chrono::milliseconds timestamp(timestamp_int);
 
       BigtableTableResource* resource;
       OP_REQUIRES_OK_ASYNC(
@@ -221,7 +250,7 @@ class ToBigtableOp : public AsyncOpKernel {
             OP_REQUIRES_OK_ASYNC(
                 ctx,
                 CreateMutation(std::move(components), column_families, columns,
-                               timestamp, &mutation),
+                               timestamp_int, &mutation),
                 done);
           }
           components.clear();
@@ -229,6 +258,12 @@ class ToBigtableOp : public AsyncOpKernel {
         grpc::Status mutation_status;
         std::vector<::google::cloud::bigtable::FailedMutation> failures =
             resource->table().BulkApply(std::move(mutation), mutation_status);
+        if (!mutation_status.ok()) {
+          LOG(ERROR) << "Failure applying mutation: "
+                     << mutation_status.error_code() << " - "
+                     << mutation_status.error_message() << " ("
+                     << mutation_status.error_details() << ").";
+        }
         if (!failures.empty()) {
           for (const auto& failure : failures) {
             LOG(ERROR) << "Failure applying mutation on row ("
@@ -241,7 +276,7 @@ class ToBigtableOp : public AsyncOpKernel {
         }
         OP_REQUIRES_ASYNC(
             ctx, failures.empty() && mutation_status.ok(),
-            errors::Unknown("Failure while writing to BigTable: ",
+            errors::Unknown("Failure while writing to Cloud Bigtable: ",
                             mutation_status.error_code(), " - ",
                             mutation_status.error_message(), " (",
                             mutation_status.error_details(),
@@ -270,7 +305,7 @@ class ToBigtableOp : public AsyncOpKernel {
 
   Status CreateMutation(
       std::vector<Tensor> tensors, const std::vector<string>& column_families,
-      const std::vector<string>& columns, std::chrono::milliseconds timestamp,
+      const std::vector<string>& columns, int64 timestamp_int,
       ::google::cloud::bigtable::BulkMutation* bulk_mutation) {
     if (tensors.size() != column_families.size() + 1) {
       return errors::InvalidArgument(
@@ -278,13 +313,20 @@ class ToBigtableOp : public AsyncOpKernel {
     }
     ::google::cloud::bigtable::SingleRowMutation mutation(
         std::move(tensors[0].scalar<string>()()));
+    std::chrono::milliseconds timestamp(timestamp_int);
     for (size_t i = 1; i < tensors.size(); ++i) {
       if (!TensorShapeUtils::IsScalar(tensors[i].shape())) {
         return errors::Internal("Output tensor ", i, " was not a scalar");
       }
-      mutation.emplace_back(::google::cloud::bigtable::SetCell(
-          column_families[i - 1], columns[i - 1], timestamp,
-          std::move(tensors[i].scalar<string>()())));
+      if (timestamp_int == -1) {
+        mutation.emplace_back(::google::cloud::bigtable::SetCell(
+            column_families[i - 1], columns[i - 1],
+            std::move(tensors[i].scalar<string>()())));
+      } else {
+        mutation.emplace_back(::google::cloud::bigtable::SetCell(
+            column_families[i - 1], columns[i - 1], timestamp,
+            std::move(tensors[i].scalar<string>()())));
+      }
     }
     bulk_mutation->emplace_back(std::move(mutation));
     return Status::OK();
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_lib.cc b/tensorflow/contrib/bigtable/kernels/bigtable_lib.cc
index 2514575f30..67bf14c176 100644
--- a/tensorflow/contrib/bigtable/kernels/bigtable_lib.cc
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_lib.cc
@@ -27,10 +27,10 @@ Status GrpcStatusToTfStatus(const ::grpc::Status& status) {
       status.error_code() == ::grpc::StatusCode::OUT_OF_RANGE) {
     grpc_code = ::grpc::StatusCode::INTERNAL;
   }
-  return Status(
-      static_cast<::tensorflow::error::Code>(status.error_code()),
-      strings::StrCat("Error reading from BigTable: ", status.error_message(),
-                      " (Details: ", status.error_details(), ")"));
+  return Status(static_cast<::tensorflow::error::Code>(status.error_code()),
+                strings::StrCat("Error reading from Cloud Bigtable: ",
+                                status.error_message(),
+                                " (Details: ", status.error_details(), ")"));
 }
 
 string RegexFromStringSet(const std::vector<string>& strs) {
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_lib.h b/tensorflow/contrib/bigtable/kernels/bigtable_lib.h
index 12d8256dea..a2a5df1037 100644
--- a/tensorflow/contrib/bigtable/kernels/bigtable_lib.h
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_lib.h
@@ -58,7 +58,8 @@ class BigtableTableResource : public ResourceBase {
   BigtableTableResource(BigtableClientResource* client, string table_name)
       : client_(client),
         table_name_(std::move(table_name)),
-        table_(client->get_client(), table_name_) {
+        table_(client->get_client(), table_name_,
+               google::cloud::bigtable::AlwaysRetryMutationPolicy()) {
     client_->Ref();
   }
 
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.cc b/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.cc
new file mode 100644
index 0000000000..51965f6214
--- /dev/null
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.cc
@@ -0,0 +1,68 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.h"
+
+#include "tensorflow/core/platform/logging.h"
+
+namespace tensorflow {
+
+namespace {
+
+string MakePrefixEndKey(const string& prefix) {
+  string end = prefix;
+  while (true) {
+    if (end.empty()) {
+      return end;
+    }
+    ++end[end.size() - 1];
+    if (end[end.size() - 1] == 0) {
+      // Handle wraparound case.
+      end = end.substr(0, end.size() - 1);
+    } else {
+      return end;
+    }
+  }
+}
+
+}  // namespace
+
+/* static */ MultiModeKeyRange MultiModeKeyRange::FromPrefix(string prefix) {
+  string end = MakePrefixEndKey(prefix);
+  VLOG(1) << "Creating MultiModeKeyRange from Prefix: " << prefix
+          << ", with end key: " << end;
+  return MultiModeKeyRange(std::move(prefix), std::move(end));
+}
+
+/* static */ MultiModeKeyRange MultiModeKeyRange::FromRange(string begin,
+                                                            string end) {
+  return MultiModeKeyRange(std::move(begin), std::move(end));
+}
+
+const string& MultiModeKeyRange::begin_key() const { return begin_; }
+
+const string& MultiModeKeyRange::end_key() const { return end_; }
+
+bool MultiModeKeyRange::contains_key(StringPiece key) const {
+  if (StringPiece(begin_) > key) {
+    return false;
+  }
+  if (StringPiece(end_) <= key && !end_.empty()) {
+    return false;
+  }
+  return true;
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.h b/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.h
new file mode 100644
index 0000000000..44c628e366
--- /dev/null
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.h
@@ -0,0 +1,67 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_BIGTABLE_KERNELS_BIGTABLE_RANGE_HELPERS_H_
+#define TENSORFLOW_CONTRIB_BIGTABLE_KERNELS_BIGTABLE_RANGE_HELPERS_H_
+
+#include <string>
+
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+// Represents a continuous range of keys defined by either a prefix or a range.
+//
+// Ranges are represented as "half-open", where the beginning key is included
+// in the range, and the end_key is the first excluded key after the range.
+//
+// The range of keys can be specified either by a key prefix, or by an explicit
+// begin key and end key. All methods on this class are valid no matter which
+// way the range was specified.
+//
+// Example:
+//   MultiModeKeyRange range = MultiModeKeyRange::FromPrefix("myPrefix");
+//   if (range.contains_key("myPrefixedKey")) {
+//     LOG(INFO) << "range from " << range.begin_key() << " to "
+//               << range.end_key() << "contains \"myPrefixedKey\"";
+//   }
+//   if (!range.contains_key("randomKey")) {
+//     LOG(INFO) << "range does not contain \"randomKey\"";
+//   }
+//   range = MultiModeKeyRange::FromRange("a_start_key", "z_end_key");
+class MultiModeKeyRange {
+ public:
+  static MultiModeKeyRange FromPrefix(string prefix);
+  static MultiModeKeyRange FromRange(string begin, string end);
+
+  // The first valid key in the range.
+  const string& begin_key() const;
+  // The first invalid key after the valid range.
+  const string& end_key() const;
+  // Returns true if the provided key is a part of the range, false otherwise.
+  bool contains_key(StringPiece key) const;
+
+ private:
+  MultiModeKeyRange(string begin, string end)
+      : begin_(std::move(begin)), end_(std::move(end)) {}
+
+  const string begin_;
+  const string end_;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CONTRIB_BIGTABLE_KERNELS_BIGTABLE_RANGE_HELPERS_H_
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers_test.cc b/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers_test.cc
new file mode 100644
index 0000000000..1bfc547271
--- /dev/null
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_range_helpers_test.cc
@@ -0,0 +1,107 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace {
+
+TEST(MultiModeKeyRangeTest, SimplePrefix) {
+  MultiModeKeyRange r = MultiModeKeyRange::FromPrefix("prefix");
+  EXPECT_EQ("prefix", r.begin_key());
+  EXPECT_EQ("prefiy", r.end_key());
+  EXPECT_TRUE(r.contains_key("prefixed_key"));
+  EXPECT_FALSE(r.contains_key("not-prefixed-key"));
+  EXPECT_FALSE(r.contains_key("prefi"));
+  EXPECT_FALSE(r.contains_key("prefiy"));
+  EXPECT_FALSE(r.contains_key("early"));
+  EXPECT_FALSE(r.contains_key(""));
+}
+
+TEST(MultiModeKeyRangeTest, Range) {
+  MultiModeKeyRange r = MultiModeKeyRange::FromRange("a", "b");
+  EXPECT_EQ("a", r.begin_key());
+  EXPECT_EQ("b", r.end_key());
+  EXPECT_TRUE(r.contains_key("a"));
+  EXPECT_TRUE(r.contains_key("ab"));
+  EXPECT_FALSE(r.contains_key("b"));
+  EXPECT_FALSE(r.contains_key("bc"));
+  EXPECT_FALSE(r.contains_key("A"));
+  EXPECT_FALSE(r.contains_key("B"));
+  EXPECT_FALSE(r.contains_key(""));
+}
+
+TEST(MultiModeKeyRangeTest, InvertedRange) {
+  MultiModeKeyRange r = MultiModeKeyRange::FromRange("b", "a");
+  EXPECT_FALSE(r.contains_key("a"));
+  EXPECT_FALSE(r.contains_key("b"));
+  EXPECT_FALSE(r.contains_key(""));
+}
+
+TEST(MultiModeKeyRangeTest, EmptyPrefix) {
+  MultiModeKeyRange r = MultiModeKeyRange::FromPrefix("");
+  EXPECT_EQ("", r.begin_key());
+  EXPECT_EQ("", r.end_key());
+  EXPECT_TRUE(r.contains_key(""));
+  EXPECT_TRUE(r.contains_key("a"));
+  EXPECT_TRUE(r.contains_key("z"));
+  EXPECT_TRUE(r.contains_key("A"));
+  EXPECT_TRUE(r.contains_key("ZZZZZZ"));
+}
+
+TEST(MultiModeKeyRangeTest, HalfRange) {
+  MultiModeKeyRange r = MultiModeKeyRange::FromRange("start", "");
+  EXPECT_EQ("start", r.begin_key());
+  EXPECT_EQ("", r.end_key());
+  EXPECT_TRUE(r.contains_key("start"));
+  EXPECT_TRUE(r.contains_key("starting"));
+  EXPECT_TRUE(r.contains_key("z-end"));
+  EXPECT_FALSE(r.contains_key(""));
+  EXPECT_FALSE(r.contains_key("early"));
+}
+
+TEST(MultiModeKeyRangeTest, PrefixWrapAround) {
+  string prefix = "abc\xff";
+  MultiModeKeyRange r = MultiModeKeyRange::FromPrefix(prefix);
+  EXPECT_EQ(prefix, r.begin_key());
+  EXPECT_EQ("abd", r.end_key());
+
+  EXPECT_TRUE(r.contains_key("abc\xff\x07"));
+  EXPECT_TRUE(r.contains_key("abc\xff\x15"));
+  EXPECT_TRUE(r.contains_key("abc\xff\x61"));
+  EXPECT_TRUE(r.contains_key("abc\xff\xff"));
+  EXPECT_FALSE(r.contains_key("abc\0"));
+  EXPECT_FALSE(r.contains_key("abd"));
+}
+
+TEST(MultiModeKeyRangeTest, PrefixSignedWrapAround) {
+  string prefix = "abc\x7f";
+  MultiModeKeyRange r = MultiModeKeyRange::FromPrefix(prefix);
+  EXPECT_EQ(prefix, r.begin_key());
+  EXPECT_EQ("abc\x80", r.end_key());
+
+  EXPECT_TRUE(r.contains_key("abc\x7f\x07"));
+  EXPECT_TRUE(r.contains_key("abc\x7f\x15"));
+  EXPECT_TRUE(r.contains_key("abc\x7f\x61"));
+  EXPECT_TRUE(r.contains_key("abc\x7f\xff"));
+  EXPECT_FALSE(r.contains_key("abc\0"));
+  EXPECT_FALSE(r.contains_key("abc\x01"));
+  EXPECT_FALSE(r.contains_key("abd"));
+  EXPECT_FALSE(r.contains_key("ab\x80"));
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_sample_key_pairs_dataset_op.cc b/tensorflow/contrib/bigtable/kernels/bigtable_sample_key_pairs_dataset_op.cc
new file mode 100644
index 0000000000..a1a63a975a
--- /dev/null
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_sample_key_pairs_dataset_op.cc
@@ -0,0 +1,200 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/bigtable/kernels/bigtable_lib.h"
+#include "tensorflow/contrib/bigtable/kernels/bigtable_range_helpers.h"
+#include "tensorflow/core/framework/op_kernel.h"
+
+namespace tensorflow {
+namespace {
+
+class BigtableSampleKeyPairsDatasetOp : public DatasetOpKernel {
+ public:
+  using DatasetOpKernel::DatasetOpKernel;
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase** output) override {
+    string prefix;
+    OP_REQUIRES_OK(ctx, ParseScalarArgument<string>(ctx, "prefix", &prefix));
+
+    string start_key;
+    OP_REQUIRES_OK(ctx,
+                   ParseScalarArgument<string>(ctx, "start_key", &start_key));
+    string end_key;
+    OP_REQUIRES_OK(ctx, ParseScalarArgument<string>(ctx, "end_key", &end_key));
+
+    BigtableTableResource* resource;
+    OP_REQUIRES_OK(ctx,
+                   LookupResource(ctx, HandleFromInput(ctx, 0), &resource));
+
+    OP_REQUIRES(ctx, prefix.empty() || start_key.empty(),
+                errors::InvalidArgument(
+                    "Only one of prefix and start_key can be provided"));
+    if (!prefix.empty()) {
+      OP_REQUIRES(ctx, end_key.empty(),
+                  errors::InvalidArgument(
+                      "If prefix is specified, end_key must be empty."));
+    }
+
+    *output = new Dataset(ctx, resource, std::move(prefix),
+                          std::move(start_key), std::move(end_key));
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    explicit Dataset(OpKernelContext* ctx, BigtableTableResource* table,
+                     string prefix, string start_key, string end_key)
+        : GraphDatasetBase(ctx),
+          table_(table),
+          key_range_(MakeMultiModeKeyRange(
+              std::move(prefix), std::move(start_key), std::move(end_key))) {
+      table_->Ref();
+    }
+
+    ~Dataset() override { table_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(new Iterator(
+          {this, strings::StrCat(prefix, "::BigtableSampleKeyPairsDataset")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      static DataTypeVector* dtypes =
+          new DataTypeVector({DT_STRING, DT_STRING});
+      return *dtypes;
+    }
+
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      static std::vector<PartialTensorShape>* shapes =
+          new std::vector<PartialTensorShape>({{}, {}});
+      return *shapes;
+    }
+
+    string DebugString() const override {
+      return "BigtableSampleKeyPairsDatasetOp::Dataset";
+    }
+
+   private:
+    static MultiModeKeyRange MakeMultiModeKeyRange(string prefix,
+                                                   string start_key,
+                                                   string end_key) {
+      if (!start_key.empty()) {
+        return MultiModeKeyRange::FromRange(std::move(start_key),
+                                            std::move(end_key));
+      }
+      return MultiModeKeyRange::FromPrefix(std::move(prefix));
+    }
+
+    BigtableTableResource& table() const { return *table_; }
+
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      // Computes split points (`keys_`) to use when scanning the table.
+      //
+      // Initialize first retrieves the sample keys from the table (`row_keys`),
+      // as these often form good split points within the table. We then iterate
+      // over them, and copy them to `keys_` if they fall within the requested
+      // range to scan (`dataset()->key_range_`). Because the requested range
+      // might start between elements of the sampled keys list, care is taken to
+      // ensure we don't accidentally miss any subsets of the requested range by
+      // including `begin_key()` and `end_key()` as appropriate.
+      Status Initialize(IteratorContext* ctx) override {
+        grpc::Status status;
+        std::vector<google::cloud::bigtable::RowKeySample> row_keys =
+            dataset()->table().table().SampleRows(status);
+        if (!status.ok()) {
+          return GrpcStatusToTfStatus(status);
+        }
+
+        for (size_t i = 0; i < row_keys.size(); ++i) {
+          string row_key(row_keys[i].row_key);
+          if (dataset()->key_range_.contains_key(row_key)) {
+            // First key: check to see if we need to add the begin_key.
+            if (keys_.empty() && dataset()->key_range_.begin_key() != row_key) {
+              keys_.push_back(dataset()->key_range_.begin_key());
+            }
+            keys_.push_back(std::move(row_key));
+          } else if (!keys_.empty()) {
+            // If !keys_.empty(), then we have found at least one element of
+            // `row_keys` that is within our requested range
+            // (`dataset()->key_range_`). Because `row_keys` is sorted, if we
+            // have found an element that's not within our key range, then we
+            // are after our requested range (ranges are contiguous) and can end
+            // iteration early.
+            break;
+          }
+        }
+
+        // Handle the case where we skip over the selected range entirely.
+        if (keys_.empty()) {
+          keys_.push_back(dataset()->key_range_.begin_key());
+        }
+
+        // Last key: check to see if we need to add the end_key.
+        if (keys_.back() != dataset()->key_range_.end_key()) {
+          keys_.push_back(dataset()->key_range_.end_key());
+        }
+        return Status::OK();
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        mutex_lock l(mu_);
+        if (index_ > keys_.size() - 2) {
+          *end_of_sequence = true;
+          return Status::OK();
+        }
+
+        *end_of_sequence = false;
+        out_tensors->emplace_back(ctx->allocator({}), DT_STRING,
+                                  TensorShape({}));
+        out_tensors->back().scalar<string>()() = keys_[index_];
+
+        out_tensors->emplace_back(ctx->allocator({}), DT_STRING,
+                                  TensorShape({}));
+        out_tensors->back().scalar<string>()() = keys_[index_ + 1];
+        ++index_;
+
+        return Status::OK();
+      }
+
+     private:
+      mutex mu_;
+      size_t index_ GUARDED_BY(mu_) = 0;
+      // Note: we store the keys_ on the iterator instead of the dataset
+      // because we want to re-sample the row keys in case there have been
+      // tablet rebalancing operations since the dataset was created.
+      //
+      // Note: keys_ is readonly after Initialize, and thus does not need a
+      // guarding lock.
+      std::vector<string> keys_;
+    };
+
+    BigtableTableResource* const table_;
+    const MultiModeKeyRange key_range_;
+  };
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("BigtableSampleKeyPairsDataset").Device(DEVICE_CPU),
+    BigtableSampleKeyPairsDatasetOp);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/bigtable/kernels/bigtable_sample_keys_dataset_op.cc b/tensorflow/contrib/bigtable/kernels/bigtable_sample_keys_dataset_op.cc
new file mode 100644
index 0000000000..a5a47cfe2d
--- /dev/null
+++ b/tensorflow/contrib/bigtable/kernels/bigtable_sample_keys_dataset_op.cc
@@ -0,0 +1,113 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/bigtable/kernels/bigtable_lib.h"
+#include "tensorflow/core/framework/op_kernel.h"
+
+namespace tensorflow {
+namespace {
+
+class BigtableSampleKeysDatasetOp : public DatasetOpKernel {
+ public:
+  using DatasetOpKernel::DatasetOpKernel;
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase** output) override {
+    BigtableTableResource* resource;
+    OP_REQUIRES_OK(ctx,
+                   LookupResource(ctx, HandleFromInput(ctx, 0), &resource));
+    *output = new Dataset(ctx, resource);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    explicit Dataset(OpKernelContext* ctx, BigtableTableResource* table)
+        : GraphDatasetBase(ctx), table_(table) {
+      table_->Ref();
+    }
+
+    ~Dataset() override { table_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(new Iterator(
+          {this, strings::StrCat(prefix, "::BigtableSampleKeysDataset")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      static DataTypeVector* dtypes = new DataTypeVector({DT_STRING});
+      return *dtypes;
+    }
+
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      static std::vector<PartialTensorShape>* shapes =
+          new std::vector<PartialTensorShape>({{}});
+      return *shapes;
+    }
+
+    string DebugString() const override {
+      return "BigtableRangeKeyDatasetOp::Dataset";
+    }
+
+    BigtableTableResource* table() const { return table_; }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        ::grpc::Status status;
+        row_keys_ = dataset()->table()->table().SampleRows(status);
+        if (!status.ok()) {
+          row_keys_.clear();
+          return GrpcStatusToTfStatus(status);
+        }
+        return Status::OK();
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        mutex_lock l(mu_);
+        if (index_ < row_keys_.size()) {
+          out_tensors->emplace_back(ctx->allocator({}), DT_STRING,
+                                    TensorShape({}));
+          out_tensors->back().scalar<string>()() =
+              string(row_keys_[index_].row_key);
+          *end_of_sequence = false;
+          index_++;
+        } else {
+          *end_of_sequence = true;
+        }
+        return Status::OK();
+      }
+
+     private:
+      mutex mu_;
+      size_t index_ = 0;
+      std::vector<::google::cloud::bigtable::RowKeySample> row_keys_;
+    };
+
+    BigtableTableResource* const table_;
+  };
+};
+
+REGISTER_KERNEL_BUILDER(Name("BigtableSampleKeysDataset").Device(DEVICE_CPU),
+                        BigtableSampleKeysDatasetOp);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client.cc b/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client.cc
index 0f107f169c..f083ce6f44 100644
--- a/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client.cc
+++ b/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client.cc
@@ -28,6 +28,8 @@ namespace {
 void UpdateRow(const ::google::bigtable::v2::Mutation& mut,
                std::map<string, string>* row) {
   if (mut.has_set_cell()) {
+    CHECK(mut.set_cell().timestamp_micros() >= -1)
+        << "Timestamp_micros: " << mut.set_cell().timestamp_micros();
     auto col =
         strings::Printf("%s:%s", mut.set_cell().family_name().c_str(),
                         string(mut.set_cell().column_qualifier()).c_str());
@@ -61,24 +63,29 @@ class SampleRowKeysResponse : public grpc::ClientReaderInterface<
 
   bool NextMessageSize(uint32_t* sz) override {
     mutex_lock l(mu_);
-    if (sent_first_message_) {
-      return false;
+    mutex_lock l2(client_->mu_);
+    if (num_messages_sent_ * 2 < client_->table_.rows.size()) {
+      *sz = 10000;  // A sufficiently high enough value to not worry about.
+      return true;
     }
-    *sz = 10000;  // A sufficiently high enough value to not worry about.
-    return true;
+    return false;
   }
 
   bool Read(google::bigtable::v2::SampleRowKeysResponse* resp) override {
+    // Send every other key from the table.
     mutex_lock l(mu_);
-    if (sent_first_message_) {
-      return false;
-    }
-    sent_first_message_ = true;
-
     mutex_lock l2(client_->mu_);
     *resp = google::bigtable::v2::SampleRowKeysResponse();
-    resp->set_row_key(client_->table_.rows.begin()->first);
-    resp->set_offset_bytes(0);
+    auto itr = client_->table_.rows.begin();
+    for (uint64 i = 0; i < 2 * num_messages_sent_; ++i) {
+      ++itr;
+      if (itr == client_->table_.rows.end()) {
+        return false;
+      }
+    }
+    resp->set_row_key(itr->first);
+    resp->set_offset_bytes(100 * num_messages_sent_);
+    num_messages_sent_++;
     return true;
   }
 
@@ -88,7 +95,7 @@ class SampleRowKeysResponse : public grpc::ClientReaderInterface<
 
  private:
   mutex mu_;
-  bool sent_first_message_ GUARDED_BY(mu_) = false;
+  int64 num_messages_sent_ GUARDED_BY(mu_) = 0;
   BigtableTestClient* client_;  // Not owned.
 };
 
diff --git a/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client_test.cc b/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client_test.cc
index d6b3964719..32611e2590 100644
--- a/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client_test.cc
+++ b/tensorflow/contrib/bigtable/kernels/test_kernels/bigtable_test_client_test.cc
@@ -286,5 +286,60 @@ TEST(BigtableTestClientTest, RowKeys) {
   EXPECT_TRUE(rows.Finish().ok());
 }
 
+TEST(BigtableTestClientTest, SampleKeys) {
+  std::shared_ptr<::google::cloud::bigtable::DataClient> client_ptr =
+      std::make_shared<BigtableTestClient>();
+  ::google::cloud::bigtable::noex::Table table(client_ptr, "test_table");
+
+  WriteCell("r1", "f1", "c1", "v1", &table);
+  WriteCell("r2", "f1", "c1", "v2", &table);
+  WriteCell("r3", "f1", "c1", "v3", &table);
+  WriteCell("r4", "f1", "c1", "v4", &table);
+  WriteCell("r5", "f1", "c1", "v5", &table);
+
+  grpc::Status status;
+  auto resp = table.SampleRows(status);
+  EXPECT_TRUE(status.ok());
+  EXPECT_EQ(3, resp.size());
+  EXPECT_EQ("r1", string(resp[0].row_key));
+  EXPECT_EQ(0, resp[0].offset_bytes);
+  EXPECT_EQ("r3", string(resp[1].row_key));
+  EXPECT_EQ(100, resp[1].offset_bytes);
+  EXPECT_EQ("r5", string(resp[2].row_key));
+  EXPECT_EQ(200, resp[2].offset_bytes);
+}
+
+TEST(BigtableTestClientTest, SampleKeysShort) {
+  std::shared_ptr<::google::cloud::bigtable::DataClient> client_ptr =
+      std::make_shared<BigtableTestClient>();
+  ::google::cloud::bigtable::noex::Table table(client_ptr, "test_table");
+
+  WriteCell("r1", "f1", "c1", "v1", &table);
+
+  grpc::Status status;
+  auto resp = table.SampleRows(status);
+  EXPECT_TRUE(status.ok());
+  EXPECT_EQ(1, resp.size());
+  EXPECT_EQ("r1", string(resp[0].row_key));
+}
+
+TEST(BigtableTestClientTest, SampleKeysEvenNumber) {
+  std::shared_ptr<::google::cloud::bigtable::DataClient> client_ptr =
+      std::make_shared<BigtableTestClient>();
+  ::google::cloud::bigtable::noex::Table table(client_ptr, "test_table");
+
+  WriteCell("r1", "f1", "c1", "v1", &table);
+  WriteCell("r2", "f1", "c1", "v2", &table);
+  WriteCell("r3", "f1", "c1", "v3", &table);
+  WriteCell("r4", "f1", "c1", "v4", &table);
+
+  grpc::Status status;
+  auto resp = table.SampleRows(status);
+  EXPECT_TRUE(status.ok());
+  EXPECT_EQ(2, resp.size());
+  EXPECT_EQ("r1", string(resp[0].row_key));
+  EXPECT_EQ("r3", string(resp[1].row_key));
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/bigtable/ops/bigtable_ops.cc b/tensorflow/contrib/bigtable/ops/bigtable_ops.cc
index 17ecc3dcb2..416b719e30 100644
--- a/tensorflow/contrib/bigtable/ops/bigtable_ops.cc
+++ b/tensorflow/contrib/bigtable/ops/bigtable_ops.cc
@@ -22,6 +22,8 @@ namespace tensorflow {
 REGISTER_OP("BigtableClient")
     .Attr("project_id: string")
     .Attr("instance_id: string")
+    .Attr("connection_pool_size: int")
+    .Attr("max_receive_message_size: int = -1")
     .Attr("container: string = ''")
     .Attr("shared_name: string = ''")
     .Output("client: resource")
@@ -70,6 +72,23 @@ REGISTER_OP("BigtableRangeKeyDataset")
                       // stateful to inhibit constant folding.
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("BigtableSampleKeysDataset")
+    .Input("table: resource")
+    .Output("handle: variant")
+    .SetIsStateful()  // TODO(b/65524810): Source dataset ops must be marked
+                      // stateful to inhibit constant folding.
+    .SetShapeFn(shape_inference::ScalarShape);
+
+REGISTER_OP("BigtableSampleKeyPairsDataset")
+    .Input("table: resource")
+    .Input("prefix: string")
+    .Input("start_key: string")
+    .Input("end_key: string")
+    .Output("handle: variant")
+    .SetIsStateful()  // TODO(b/65524810): Source dataset ops must be marked
+                      // stateful to inhibit constant folding.
+    .SetShapeFn(shape_inference::ScalarShape);
+
 // TODO(saeta): Support continuing despite bad data (e.g. empty string, or
 // skip incomplete row.)
 REGISTER_OP("BigtableScanDataset")
diff --git a/tensorflow/contrib/bigtable/python/kernel_tests/bigtable_ops_test.py b/tensorflow/contrib/bigtable/python/kernel_tests/bigtable_ops_test.py
index d33a66f2df..e36f7f32c6 100644
--- a/tensorflow/contrib/bigtable/python/kernel_tests/bigtable_ops_test.py
+++ b/tensorflow/contrib/bigtable/python/kernel_tests/bigtable_ops_test.py
@@ -21,8 +21,10 @@ from __future__ import print_function
 from tensorflow.contrib import bigtable
 from tensorflow.contrib.bigtable.ops import gen_bigtable_ops
 from tensorflow.contrib.bigtable.ops import gen_bigtable_test_ops
+from tensorflow.contrib.bigtable.python.ops import bigtable_api
 from tensorflow.contrib.util import loader
 from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import errors
 from tensorflow.python.platform import resource_loader
 from tensorflow.python.platform import test
 from tensorflow.python.util import compat
@@ -31,6 +33,10 @@ _bigtable_so = loader.load_op_library(
     resource_loader.get_path_to_datafile("_bigtable_test.so"))
 
 
+def _ListOfTuplesOfStringsToBytes(values):
+  return [(compat.as_bytes(i[0]), compat.as_bytes(i[1])) for i in values]
+
+
 class BigtableOpsTest(test.TestCase):
   COMMON_ROW_KEYS = ["r1", "r2", "r3"]
   COMMON_VALUES = ["v1", "v2", "v3"]
@@ -38,7 +44,7 @@ class BigtableOpsTest(test.TestCase):
   def setUp(self):
     self._client = gen_bigtable_test_ops.bigtable_test_client()
     table = gen_bigtable_ops.bigtable_table(self._client, "testtable")
-    self._table = bigtable.BigTable("testtable", None, table)
+    self._table = bigtable.BigtableTable("testtable", None, table)
 
   def _makeSimpleDataset(self):
     output_rows = dataset_ops.Dataset.from_tensor_slices(self.COMMON_ROW_KEYS)
@@ -99,12 +105,18 @@ class BigtableOpsTest(test.TestCase):
   def testScanPrefixListCol(self):
     self.runScanTest(self._table.scan_prefix("r", cf1=["c1"]))
 
+  def testScanPrefixTupleCol(self):
+    self.runScanTest(self._table.scan_prefix("r", columns=("cf1", "c1")))
+
   def testScanRangeStringCol(self):
     self.runScanTest(self._table.scan_range("r1", "r4", cf1="c1"))
 
   def testScanRangeListCol(self):
     self.runScanTest(self._table.scan_range("r1", "r4", cf1=["c1"]))
 
+  def testScanRangeTupleCol(self):
+    self.runScanTest(self._table.scan_range("r1", "r4", columns=("cf1", "c1")))
+
   def testLookup(self):
     ds = self._table.keys_by_prefix_dataset("r")
     ds = ds.apply(self._table.lookup_columns(cf1="c1"))
@@ -127,6 +139,134 @@ class BigtableOpsTest(test.TestCase):
             "Unequal values at step %d: want: %s, got: %s" %
             (i, compat.as_bytes(elem[1]), compat.as_bytes(output[1])))
 
+  def testSampleKeys(self):
+    ds = self._table.sample_keys()
+    itr = ds.make_initializable_iterator()
+    n = itr.get_next()
+    expected_key = self.COMMON_ROW_KEYS[0]
+    with self.test_session() as sess:
+      self._writeCommonValues(sess)
+      sess.run(itr.initializer)
+      output = sess.run(n)
+      self.assertEqual(
+          compat.as_bytes(self.COMMON_ROW_KEYS[0]), compat.as_bytes(output),
+          "Unequal keys: want: %s, got: %s" % (compat.as_bytes(
+              self.COMMON_ROW_KEYS[0]), compat.as_bytes(output)))
+      output = sess.run(n)
+      self.assertEqual(
+          compat.as_bytes(self.COMMON_ROW_KEYS[2]), compat.as_bytes(output),
+          "Unequal keys: want: %s, got: %s" % (compat.as_bytes(
+              self.COMMON_ROW_KEYS[2]), compat.as_bytes(output)))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(n)
+
+  def runSampleKeyPairsTest(self, ds, expected_key_pairs):
+    itr = ds.make_initializable_iterator()
+    n = itr.get_next()
+    with self.test_session() as sess:
+      self._writeCommonValues(sess)
+      sess.run(itr.initializer)
+      for i, elems in enumerate(expected_key_pairs):
+        output = sess.run(n)
+        self.assertEqual(
+            compat.as_bytes(elems[0]), compat.as_bytes(output[0]),
+            "Unequal key pair (first element) at step %d; want: %s, got %s" %
+            (i, compat.as_bytes(elems[0]), compat.as_bytes(output[0])))
+        self.assertEqual(
+            compat.as_bytes(elems[1]), compat.as_bytes(output[1]),
+            "Unequal key pair (second element) at step %d; want: %s, got %s" %
+            (i, compat.as_bytes(elems[1]), compat.as_bytes(output[1])))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(n)
+
+  def testSampleKeyPairsSimplePrefix(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="r", start="", end="")
+    expected_key_pairs = [("r", "r1"), ("r1", "r3"), ("r3", "s")]
+    self.runSampleKeyPairsTest(ds, expected_key_pairs)
+
+  def testSampleKeyPairsSimpleRange(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="", start="r1", end="r3")
+    expected_key_pairs = [("r1", "r3")]
+    self.runSampleKeyPairsTest(ds, expected_key_pairs)
+
+  def testSampleKeyPairsSkipRangePrefix(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="r2", start="", end="")
+    expected_key_pairs = [("r2", "r3")]
+    self.runSampleKeyPairsTest(ds, expected_key_pairs)
+
+  def testSampleKeyPairsSkipRangeRange(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="", start="r2", end="r3")
+    expected_key_pairs = [("r2", "r3")]
+    self.runSampleKeyPairsTest(ds, expected_key_pairs)
+
+  def testSampleKeyPairsOffsetRanges(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="", start="r2", end="r4")
+    expected_key_pairs = [("r2", "r3"), ("r3", "r4")]
+    self.runSampleKeyPairsTest(ds, expected_key_pairs)
+
+  def testSampleKeyPairEverything(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="", start="", end="")
+    expected_key_pairs = [("", "r1"), ("r1", "r3"), ("r3", "")]
+    self.runSampleKeyPairsTest(ds, expected_key_pairs)
+
+  def testSampleKeyPairsPrefixAndStartKey(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="r", start="r1", end="")
+    itr = ds.make_initializable_iterator()
+    with self.test_session() as sess:
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(itr.initializer)
+
+  def testSampleKeyPairsPrefixAndEndKey(self):
+    ds = bigtable_api._BigtableSampleKeyPairsDataset(
+        self._table, prefix="r", start="", end="r3")
+    itr = ds.make_initializable_iterator()
+    with self.test_session() as sess:
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(itr.initializer)
+
+  def testParallelScanPrefix(self):
+    ds = self._table.parallel_scan_prefix(prefix="r", cf1="c1")
+    itr = ds.make_initializable_iterator()
+    n = itr.get_next()
+    with self.test_session() as sess:
+      self._writeCommonValues(sess)
+      sess.run(itr.initializer)
+      expected_values = list(zip(self.COMMON_ROW_KEYS, self.COMMON_VALUES))
+      actual_values = []
+      for _ in range(len(expected_values)):
+        output = sess.run(n)
+        actual_values.append(output)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(n)
+      self.assertItemsEqual(
+          _ListOfTuplesOfStringsToBytes(expected_values),
+          _ListOfTuplesOfStringsToBytes(actual_values))
+
+  def testParallelScanRange(self):
+    ds = self._table.parallel_scan_range(start="r1", end="r4", cf1="c1")
+    itr = ds.make_initializable_iterator()
+    n = itr.get_next()
+    with self.test_session() as sess:
+      self._writeCommonValues(sess)
+      sess.run(itr.initializer)
+      expected_values = list(zip(self.COMMON_ROW_KEYS, self.COMMON_VALUES))
+      actual_values = []
+      for _ in range(len(expected_values)):
+        output = sess.run(n)
+        actual_values.append(output)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(n)
+      self.assertItemsEqual(
+          _ListOfTuplesOfStringsToBytes(expected_values),
+          _ListOfTuplesOfStringsToBytes(actual_values))
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/bigtable/python/ops/bigtable_api.py b/tensorflow/contrib/bigtable/python/ops/bigtable_api.py
index a54e020ed7..e6ef513c40 100644
--- a/tensorflow/contrib/bigtable/python/ops/bigtable_api.py
+++ b/tensorflow/contrib/bigtable/python/ops/bigtable_api.py
@@ -12,15 +12,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""The Python API for TensorFlow's Bigtable integration.
+"""The Python API for TensorFlow's Cloud Bigtable integration.
 
 TensorFlow has support for reading from and writing to Cloud Bigtable. To use
-the Bigtable TensorFlow integration, first create a BigtableClient (which
-configures your connection to Cloud Bigtable), and then open a Table. The Table
-object then allows you to create numerous @{tf.data.Dataset}s to read data, or
-write a @{tf.data.Dataset} object to the underlying Bigtable Table.
+TensorFlow + Cloud Bigtable integration, first create a BigtableClient to
+configure your connection to Cloud Bigtable, and then create a BigtableTable
+object to allow you to create numerous @{tf.data.Dataset}s to read data, or
+write a @{tf.data.Dataset} object to the underlying Cloud Bigtable table.
 
-For background on Google Cloud Bigtable, see: https://cloud.google.com/bigtable.
+For background on Cloud Bigtable, see: https://cloud.google.com/bigtable .
 """
 
 from __future__ import absolute_import
@@ -28,8 +28,10 @@ from __future__ import division
 from __future__ import print_function
 
 from six import iteritems
+from six import string_types
 
 from tensorflow.contrib.bigtable.ops import gen_bigtable_ops
+from tensorflow.contrib.data.python.ops import interleave_ops
 from tensorflow.contrib.util import loader
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
@@ -46,22 +48,53 @@ class BigtableClient(object):
   """BigtableClient is the entrypoint for interacting with Cloud Bigtable in TF.
 
   BigtableClient encapsulates a connection to Cloud Bigtable, and exposes the
-  `table` method to open a Bigtable Table.
+  `table` method to open a Bigtable table.
   """
 
-  def __init__(self, project_id, instance_id):
+  def __init__(self,
+               project_id,
+               instance_id,
+               connection_pool_size=None,
+               max_receive_message_size=None):
     """Creates a BigtableClient that can be used to open connections to tables.
 
     Args:
       project_id: A string representing the GCP project id to connect to.
       instance_id: A string representing the Bigtable instance to connect to.
+      connection_pool_size: (Optional.) A number representing the number of
+        concurrent connections to the Cloud Bigtable service to make.
+      max_receive_message_size: (Optional.) The maximum bytes received in a
+        single gRPC response.
+
+    Raises:
+      ValueError: if the arguments are invalid (e.g. wrong type, or out of
+        expected ranges (e.g. negative).)
     """
+    if not isinstance(project_id, str):
+      raise ValueError("`project_id` must be a string")
     self._project_id = project_id
+
+    if not isinstance(instance_id, str):
+      raise ValueError("`instance_id` must be a string")
     self._instance_id = instance_id
-    self._resource = gen_bigtable_ops.bigtable_client(project_id, instance_id)
+
+    if connection_pool_size is None:
+      connection_pool_size = -1
+    elif connection_pool_size < 1:
+      raise ValueError("`connection_pool_size` must be positive")
+
+    if max_receive_message_size is None:
+      max_receive_message_size = -1
+    elif max_receive_message_size < 1:
+      raise ValueError("`max_receive_message_size` must be positive")
+
+    self._connection_pool_size = connection_pool_size
+
+    self._resource = gen_bigtable_ops.bigtable_client(
+        project_id, instance_id, connection_pool_size, max_receive_message_size)
 
   def table(self, name, snapshot=None):
-    """Opens a table and returns a `BigTable` object.
+    """Opens a table and returns a `tf.contrib.bigtable.BigtableTable` object.
 
     Args:
       name: A `tf.string` `tf.Tensor` name of the table to open.
@@ -69,19 +102,20 @@ class BigtableClient(object):
         request the creation of a snapshot. (Note: currently unimplemented.)
 
     Returns:
-      A `BigTable` python object representing the operations available on the
-      table.
+      A `tf.contrib.bigtable.BigtableTable` Python object representing the
+      operations available on the table.
     """
     # TODO(saeta): Implement snapshot functionality.
     table = gen_bigtable_ops.bigtable_table(self._resource, name)
-    return BigTable(name, snapshot, table)
+    return BigtableTable(name, snapshot, table)
 
 
-class BigTable(object):
-  """BigTable is the entrypoint for reading and writing data in Cloud Bigtable.
+class BigtableTable(object):
+  """BigtableTable is the entrypoint for reading and writing data in Cloud
+  Bigtable.
 
-  This BigTable class is the python representation of the Cloud Bigtable table
-  within TensorFlow. Methods on this class allow data to be read from and
+  This BigtableTable class is the Python representation of the Cloud Bigtable
+  table within TensorFlow. Methods on this class allow data to be read from and
   written to the Cloud Bigtable service in flexible and high performance
   manners.
   """
@@ -99,7 +133,8 @@ class BigTable(object):
     """Retrieves the values of columns for a dataset of keys.
 
     Example usage:
-    ```
+
+    ```python
     table = bigtable_client.table("my_table")
     key_dataset = table.get_keys_prefix("imagenet")
     images = key_dataset.apply(table.lookup_columns(("cf1", "image"),
@@ -110,7 +145,8 @@ class BigTable(object):
 
     Alternatively, you can use keyword arguments to specify the columns to
     capture. Example (same as above, rewritten):
-    ```
+
+    ```python
     table = bigtable_client.table("my_table")
     key_dataset = table.get_keys_prefix("imagenet")
     images = key_dataset.apply(table.lookup_columns(
@@ -118,15 +154,17 @@ class BigTable(object):
     training_data = images.map(parse_and_crop, num_parallel_calls=64).batch(128)
     ```
 
-    Note: certain kwargs keys are reserved, and thus some column families cannot
-    be identified using the kwargs syntax. Instead, please use the args syntax.
-    This list includes:
+    Note: certain `kwargs` keys are reserved, and thus, some column families
+    cannot be identified using the `kwargs` syntax. Instead, please use the
+    `args` syntax. This list includes:
+
       - 'name'
-    This list can change at any time.
+
+    Note: this list can change at any time.
 
     Args:
       *args: A list of tuples containing (column family, column name) pairs.
-      **kwargs: Column families and
+      **kwargs: Column families (keys) and column qualifiers (values).
 
     Returns:
       A function that can be passed to `tf.data.Dataset.apply` to retrieve the
@@ -186,6 +224,18 @@ class BigTable(object):
     """
     return _BigtablePrefixKeyDataset(self, prefix)
 
+  def sample_keys(self):
+    """Retrieves a sampling of row keys from the Bigtable table.
+
+    This dataset is most often used in conjunction with
+    @{tf.contrib.data.parallel_interleave} to construct a set of ranges for
+    scanning in parallel.
+
+    Returns:
+      A @{tf.data.Dataset} returning string row keys.
+    """
+    return _BigtableSampleKeysDataset(self)
+
   def scan_prefix(self, prefix, probability=None, columns=None, **kwargs):
     """Retrieves row (including values) from the Bigtable service.
 
@@ -208,9 +258,11 @@ class BigTable(object):
     Note: only the latest value of a cell will be retrieved.
 
     Args:
-      prefix: The prefix all row keys muat match to be retrieved for prefix-
+      prefix: The prefix all row keys must match to be retrieved for prefix-
         based scans.
-      probability: Probabilistically sample rows.
+      probability: (Optional.) A float between 0 (exclusive) and 1 (inclusive).
+        A non-1 value indicates to probabilistically sample rows with the
+        provided probability.
       columns: The columns to read. Note: most commonly, they are expressed as
         kwargs. Use the columns value if you are using column families that are
         reserved. The value of columns and kwargs are merged. Columns is a list
@@ -225,26 +277,8 @@ class BigTable(object):
     Raises:
       ValueError: If the configured probability is unexpected.
     """
-    if probability is None:
-      probability = 1.0
-    if isinstance(probability, float) and (probability <= 0.0 or
-                                           probability > 1.0):
-      raise ValueError("probability must be in the range (0, 1].")
-
-    normalized = columns
-    if normalized is None:
-      normalized = []
-    if isinstance(normalized, tuple):
-      normalized = list(normalized)
-    for key, value in iteritems(kwargs):
-      if key == "name":
-        continue
-      if isinstance(value, str):
-        normalized.append((key, value))
-        continue
-      for col in value:
-        normalized.append((key, col))
-
+    probability = _normalize_probability(probability)
+    normalized = _normalize_columns(columns, kwargs)
     return _BigtableScanDataset(self, prefix, "", "", normalized, probability)
 
   def scan_range(self, start, end, probability=None, columns=None, **kwargs):
@@ -271,7 +305,9 @@ class BigTable(object):
     Args:
       start: The start of the range when scanning by range.
       end: (Optional.) The end of the range when scanning by range.
-      probability: Probabilistically sample rows.
+      probability: (Optional.) A float between 0 (exclusive) and 1 (inclusive).
+        A non-1 value indicates to probabilistically sample rows with the
+        provided probability.
       columns: The columns to read. Note: most commonly, they are expressed as
         kwargs. Use the columns value if you are using column families that are
         reserved. The value of columns and kwargs are merged. Columns is a list
@@ -286,27 +322,129 @@ class BigTable(object):
     Raises:
       ValueError: If the configured probability is unexpected.
     """
-    if probability is None:
-      probability = 1.0
-    if isinstance(probability, float) and (probability <= 0.0 or
-                                           probability > 1.0):
-      raise ValueError("probability must be in the range (0, 1].")
+    probability = _normalize_probability(probability)
+    normalized = _normalize_columns(columns, kwargs)
+    return _BigtableScanDataset(self, "", start, end, normalized, probability)
 
-    normalized = columns
-    if normalized is None:
-      normalized = []
-    if isinstance(normalized, tuple):
-      normalized = list(normalized)
-    for key, value in iteritems(kwargs):
-      if key == "name":
-        continue
-      if isinstance(value, str):
-        normalized.append((key, value))
-        continue
-      for col in value:
-        normalized.append((key, col))
+  def parallel_scan_prefix(self,
+                           prefix,
+                           num_parallel_scans=None,
+                           probability=None,
+                           columns=None,
+                           **kwargs):
+    """Retrieves row (including values) from the Bigtable service at high speed.
 
-    return _BigtableScanDataset(self, "", start, end, normalized, probability)
+    Rows with row-key prefixed by `prefix` will be retrieved. This method is
+    similar to `scan_prefix`, but by constrast performs multiple sub-scans in
+    parallel in order to achieve higher performance.
+
+    Note: The dataset produced by this method is not deterministic!
+
+    Specifying the columns to retrieve for each row is done by either using
+    kwargs or in the columns parameter. To retrieve values of the columns "c1",
+    and "c2" from the column family "cfa", and the value of the column "c3"
+    from column family "cfb", the following datasets (`ds1`, and `ds2`) are
+    equivalent:
+
+    ```
+    table = # ...
+    ds1 = table.parallel_scan_prefix("row_prefix", columns=[("cfa", "c1"),
+                                                            ("cfa", "c2"),
+                                                            ("cfb", "c3")])
+    ds2 = table.parallel_scan_prefix("row_prefix", cfa=["c1", "c2"], cfb="c3")
+    ```
+
+    Note: only the latest value of a cell will be retrieved.
+
+    Args:
+      prefix: The prefix all row keys must match to be retrieved for prefix-
+        based scans.
+      num_parallel_scans: (Optional.) The number of concurrent scans against the
+        Cloud Bigtable instance.
+      probability: (Optional.) A float between 0 (exclusive) and 1 (inclusive).
+        A non-1 value indicates to probabilistically sample rows with the
+        provided probability.
+      columns: The columns to read. Note: most commonly, they are expressed as
+        kwargs. Use the columns value if you are using column families that are
+        reserved. The value of columns and kwargs are merged. Columns is a list
+        of tuples of strings ("column_family", "column_qualifier").
+      **kwargs: The column families and columns to read. Keys are treated as
+        column_families, and values can be either lists of strings, or strings
+        that are treated as the column qualifier (column name).
+
+    Returns:
+      A @{tf.data.Dataset} returning the row keys and the cell contents.
+
+    Raises:
+      ValueError: If the configured probability is unexpected.
+    """
+    probability = _normalize_probability(probability)
+    normalized = _normalize_columns(columns, kwargs)
+    ds = _BigtableSampleKeyPairsDataset(self, prefix, "", "")
+    return self._make_parallel_scan_dataset(ds, num_parallel_scans, probability,
+                                            normalized)
+
+  def parallel_scan_range(self,
+                          start,
+                          end,
+                          num_parallel_scans=None,
+                          probability=None,
+                          columns=None,
+                          **kwargs):
+    """Retrieves rows (including values) from the Bigtable service.
+
+    Rows with row-keys between `start` and `end` will be retrieved. This method
+    is similar to `scan_range`, but by constrast performs multiple sub-scans in
+    parallel in order to achieve higher performance.
+
+    Note: The dataset produced by this method is not deterministic!
+
+    Specifying the columns to retrieve for each row is done by either using
+    kwargs or in the columns parameter. To retrieve values of the columns "c1",
+    and "c2" from the column family "cfa", and the value of the column "c3"
+    from column family "cfb", the following datasets (`ds1`, and `ds2`) are
+    equivalent:
+
+    ```
+    table = # ...
+    ds1 = table.parallel_scan_range("row_start",
+                                    "row_end",
+                                    columns=[("cfa", "c1"),
+                                             ("cfa", "c2"),
+                                             ("cfb", "c3")])
+    ds2 = table.parallel_scan_range("row_start", "row_end",
+                                    cfa=["c1", "c2"], cfb="c3")
+    ```
+
+    Note: only the latest value of a cell will be retrieved.
+
+    Args:
+      start: The start of the range when scanning by range.
+      end: (Optional.) The end of the range when scanning by range.
+      num_parallel_scans: (Optional.) The number of concurrent scans against the
+        Cloud Bigtable instance.
+      probability: (Optional.) A float between 0 (exclusive) and 1 (inclusive).
+        A non-1 value indicates to probabilistically sample rows with the
+        provided probability.
+      columns: The columns to read. Note: most commonly, they are expressed as
+        kwargs. Use the columns value if you are using column families that are
+        reserved. The value of columns and kwargs are merged. Columns is a list
+        of tuples of strings ("column_family", "column_qualifier").
+      **kwargs: The column families and columns to read. Keys are treated as
+        column_families, and values can be either lists of strings, or strings
+        that are treated as the column qualifier (column name).
+
+    Returns:
+      A @{tf.data.Dataset} returning the row keys and the cell contents.
+
+    Raises:
+      ValueError: If the configured probability is unexpected.
+    """
+    probability = _normalize_probability(probability)
+    normalized = _normalize_columns(columns, kwargs)
+    ds = _BigtableSampleKeyPairsDataset(self, "", start, end)
+    return self._make_parallel_scan_dataset(ds, num_parallel_scans, probability,
+                                            normalized)
 
   def write(self, dataset, column_families, columns, timestamp=None):
     """Writes a dataset to the table.
@@ -353,6 +491,89 @@ class BigTable(object):
         columns,
         timestamp)
 
+  def _make_parallel_scan_dataset(self, ds, num_parallel_scans,
+                                  normalized_probability, normalized_columns):
+    """Builds a parallel dataset from a given range.
+
+    Args:
+      ds: A `_BigtableSampleKeyPairsDataset` returning ranges of keys to use.
+      num_parallel_scans: The number of concurrent parallel scans to use.
+      normalized_probability: A number between 0 and 1 for the keep probability.
+      normalized_columns: The column families and column qualifiers to retrieve.
+
+    Returns:
+      A @{tf.data.Dataset} representing the result of the parallel scan.
+    """
+    if num_parallel_scans is None:
+      num_parallel_scans = 50
+
+    ds = ds.shuffle(buffer_size=10000)  # TODO(saeta): Make configurable.
+
+    def _interleave_fn(start, end):
+      return _BigtableScanDataset(
+          self,
+          prefix="",
+          start=start,
+          end=end,
+          normalized=normalized_columns,
+          probability=normalized_probability)
+
+    # Note prefetch_input_elements must be set in order to avoid rpc timeouts.
+    ds = ds.apply(
+        interleave_ops.parallel_interleave(
+            _interleave_fn,
+            cycle_length=num_parallel_scans,
+            sloppy=True,
+            prefetch_input_elements=1))
+    return ds
+
+
+def _normalize_probability(probability):
+  if probability is None:
+    probability = 1.0
+  if isinstance(probability, float) and (probability <= 0.0 or
+                                         probability > 1.0):
+    raise ValueError("probability must be in the range (0, 1].")
+  return probability
+
+
+def _normalize_columns(columns, provided_kwargs):
+  """Converts arguments (columns, and kwargs dict) to C++ representation.
+
+  Args:
+    columns: a datastructure containing the column families and qualifier to
+      retrieve. Valid types include (1) None, (2) list of tuples, (3) a tuple of
+      strings.
+    provided_kwargs: a dictionary containing the column families and qualifiers
+      to retrieve
+
+  Returns:
+    A list of pairs of column family+qualifier to retrieve.
+
+  Raises:
+    ValueError: If there are no cells to retrieve or the columns are in an
+      incorrect format.
+  """
+  normalized = columns
+  if normalized is None:
+    normalized = []
+  if isinstance(normalized, tuple):
+    if len(normalized) == 2:
+      normalized = [normalized]
+    else:
+      raise ValueError("columns was a tuple of inappropriate length")
+  for key, value in iteritems(provided_kwargs):
+    if key == "name":
+      continue
+    if isinstance(value, string_types):
+      normalized.append((key, value))
+      continue
+    for col in value:
+      normalized.append((key, col))
+  if not normalized:
+    raise ValueError("At least one column + column family must be specified.")
+  return normalized
+
 
 class _BigtableKeyDataset(dataset_ops.Dataset):
   """_BigtableKeyDataset is an abstract class representing the keys of a table.
@@ -410,6 +631,20 @@ class _BigtableRangeKeyDataset(_BigtableKeyDataset):
         end_key=self._end)
 
 
+class _BigtableSampleKeysDataset(_BigtableKeyDataset):
+  """_BigtableSampleKeysDataset represents a sampling of row keys.
+  """
+
+  # TODO(saeta): Expose the data size offsets into the keys.
+
+  def __init__(self, table):
+    super(_BigtableSampleKeysDataset, self).__init__(table)
+
+  def _as_variant_tensor(self):
+    return gen_bigtable_ops.bigtable_sample_keys_dataset(
+        table=self._table._resource)  # pylint: disable=protected-access
+
+
 class _BigtableLookupDataset(dataset_ops.Dataset):
   """_BigtableLookupDataset represents a dataset that retrieves values for keys.
   """
@@ -478,3 +713,34 @@ class _BigtableScanDataset(dataset_ops.Dataset):
         column_families=self._column_families,
         columns=self._columns,
         probability=self._probability)
+
+
+class _BigtableSampleKeyPairsDataset(dataset_ops.Dataset):
+  """_BigtableSampleKeyPairsDataset returns key pairs from a Bigtable table.
+  """
+
+  def __init__(self, table, prefix, start, end):
+    self._table = table
+    self._prefix = prefix
+    self._start = start
+    self._end = end
+
+  @property
+  def output_classes(self):
+    return (ops.Tensor, ops.Tensor)
+
+  @property
+  def output_shapes(self):
+    return (tensor_shape.TensorShape([]), tensor_shape.TensorShape([]))
+
+  @property
+  def output_types(self):
+    return (dtypes.string, dtypes.string)
+
+  def _as_variant_tensor(self):
+    # pylint: disable=protected-access
+    return gen_bigtable_ops.bigtable_sample_key_pairs_dataset(
+        table=self._table._resource,
+        prefix=self._prefix,
+        start_key=self._start,
+        end_key=self._end)
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/BUILD b/tensorflow/contrib/boosted_trees/estimator_batch/BUILD
index ef0e80cd09..5fcb19a47a 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/BUILD
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/BUILD
@@ -147,6 +147,7 @@ py_library(
     deps = [
         ":distillation_loss",
         ":estimator_utils",
+        ":model",
         ":trainer_hooks",
         "//tensorflow/contrib/boosted_trees:gbdt_batch",
         "//tensorflow/contrib/boosted_trees:model_ops_py",
@@ -190,7 +191,7 @@ py_test(
 
 py_test(
     name = "estimator_test",
-    size = "medium",
+    size = "large",
     srcs = ["estimator_test.py"],
     srcs_version = "PY2AND3",
     tags = [
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py b/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
index 62f1f4122b..78232fa0a6 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
@@ -32,6 +32,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.platform import gfile
 from tensorflow.python.saved_model import loader as saved_model_loader
 from tensorflow.python.saved_model import tag_constants
+from tensorflow.python.util import compat
 
 _SPARSE_FLOAT_FEATURE_NAME_TEMPLATE = "%s_%d"
 
@@ -88,10 +89,12 @@ def make_custom_export_strategy(name,
             len(sparse_float_indices), len(sparse_int_indices))
         sorted_by_importance = sorted(
             feature_importances.items(), key=lambda x: -x[1])
-        assets_dir = os.path.join(result_dir, "assets.extra")
+        assets_dir = os.path.join(
+            compat.as_bytes(result_dir), compat.as_bytes("assets.extra"))
         gfile.MakeDirs(assets_dir)
-        with gfile.GFile(os.path.join(assets_dir, "feature_importances"),
-                         "w") as f:
+        with gfile.GFile(os.path.join(
+            compat.as_bytes(assets_dir),
+            compat.as_bytes("feature_importances")), "w") as f:
           f.write("\n".join("%s, %f" % (k, v) for k, v in sorted_by_importance))
     return result_dir
 
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
index 7eb429b636..194a5c8754 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator.py
@@ -26,6 +26,7 @@ from __future__ import print_function
 import six
 
 from tensorflow.contrib import layers
+from tensorflow.contrib.boosted_trees.estimator_batch import model
 from tensorflow.contrib.boosted_trees.estimator_batch import distillation_loss
 from tensorflow.contrib.boosted_trees.estimator_batch import estimator_utils
 from tensorflow.contrib.boosted_trees.estimator_batch import trainer_hooks
@@ -34,6 +35,7 @@ from tensorflow.contrib.boosted_trees.python.training.functions import gbdt_batc
 from tensorflow.contrib.layers.python.layers import optimizers
 from tensorflow.contrib.learn.python.learn.estimators import estimator
 from tensorflow.contrib.learn.python.learn.estimators import head as head_lib
+from tensorflow.python.estimator import estimator as core_estimator
 from tensorflow.contrib.learn.python.learn.estimators import model_fn
 from tensorflow.python.feature_column import feature_column as feature_column_lib
 from tensorflow.python.framework import ops
@@ -62,27 +64,30 @@ def _add_hidden_layer_summary(value, tag):
   summary.histogram("%s_activation" % tag, value)
 
 
-def _dnn_tree_combined_model_fn(features,
-                                labels,
-                                mode,
-                                head,
-                                dnn_hidden_units,
-                                dnn_feature_columns,
-                                tree_learner_config,
-                                num_trees,
-                                tree_examples_per_layer,
-                                config=None,
-                                dnn_optimizer="Adagrad",
-                                dnn_activation_fn=nn.relu,
-                                dnn_dropout=None,
-                                dnn_input_layer_partitioner=None,
-                                dnn_input_layer_to_tree=True,
-                                dnn_steps_to_train=10000,
-                                predict_with_tree_only=False,
-                                tree_feature_columns=None,
-                                tree_center_bias=False,
-                                dnn_to_tree_distillation_param=None,
-                                use_core_versions=False):
+def _dnn_tree_combined_model_fn(
+    features,
+    labels,
+    mode,
+    head,
+    dnn_hidden_units,
+    dnn_feature_columns,
+    tree_learner_config,
+    num_trees,
+    tree_examples_per_layer,
+    config=None,
+    dnn_optimizer="Adagrad",
+    dnn_activation_fn=nn.relu,
+    dnn_dropout=None,
+    dnn_input_layer_partitioner=None,
+    dnn_input_layer_to_tree=True,
+    dnn_steps_to_train=10000,
+    predict_with_tree_only=False,
+    tree_feature_columns=None,
+    tree_center_bias=False,
+    dnn_to_tree_distillation_param=None,
+    use_core_versions=False,
+    output_type=model.ModelBuilderOutputType.MODEL_FN_OPS,
+    override_global_step_value=None):
   """DNN and GBDT combined model_fn.
 
   Args:
@@ -131,6 +136,12 @@ def _dnn_tree_combined_model_fn(features,
       will be set to True.
     use_core_versions: Whether feature columns and loss are from the core (as
       opposed to contrib) version of tensorflow.
+    output_type: Whether to return ModelFnOps (old interface) or EstimatorSpec
+      (new interface).
+    override_global_step_value: If after the training is done, global step
+      value must be reset to this value. This is particularly useful for hyper
+      parameter tuning, which can't recognize early stopping due to the number
+      of trees. If None, no override of global step will happen.
 
   Returns:
     A `ModelFnOps` object.
@@ -156,6 +167,10 @@ def _dnn_tree_combined_model_fn(features,
       partitioned_variables.min_max_variable_partitioner(
           max_partitions=config.num_ps_replicas, min_slice_size=64 << 20))
 
+  if (output_type == model.ModelBuilderOutputType.ESTIMATOR_SPEC and
+      not use_core_versions):
+    raise ValueError("You must use core versions with Estimator Spec")
+
   with variable_scope.variable_scope(
       dnn_parent_scope,
       values=tuple(six.itervalues(features)),
@@ -235,7 +250,8 @@ def _dnn_tree_combined_model_fn(features,
       learner_config=tree_learner_config,
       feature_columns=tree_feature_columns,
       logits_dimension=head.logits_dimension,
-      features=tree_features)
+      features=tree_features,
+      use_core_columns=use_core_versions)
 
   with ops.name_scope("gbdt"):
     predictions_dict = gbdt_model.predict(mode)
@@ -284,63 +300,98 @@ def _dnn_tree_combined_model_fn(features,
     del loss
     return control_flow_ops.no_op()
 
-  if use_core_versions:
-    model_fn_ops = head.create_estimator_spec(
-        features=features,
-        mode=mode,
-        labels=labels,
-        train_op_fn=_no_train_op_fn,
-        logits=tree_train_logits)
-    dnn_train_op = head.create_estimator_spec(
-        features=features,
-        mode=mode,
-        labels=labels,
-        train_op_fn=_dnn_train_op_fn,
-        logits=dnn_logits)
-    dnn_train_op = estimator_utils.estimator_spec_to_model_fn_ops(
-        dnn_train_op).train_op
+  if tree_center_bias:
+    num_trees += 1
+  finalized_trees, attempted_trees = gbdt_model.get_number_of_trees_tensor()
 
-    tree_train_op = head.create_estimator_spec(
-        features=tree_features,
-        mode=mode,
-        labels=labels,
-        train_op_fn=_tree_train_op_fn,
-        logits=tree_train_logits)
-    tree_train_op = estimator_utils.estimator_spec_to_model_fn_ops(
-        tree_train_op).train_op
+  if output_type == model.ModelBuilderOutputType.MODEL_FN_OPS:
+    if use_core_versions:
+      model_fn_ops = head.create_estimator_spec(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_no_train_op_fn,
+          logits=tree_train_logits)
+      dnn_train_op = head.create_estimator_spec(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_dnn_train_op_fn,
+          logits=dnn_logits)
+      dnn_train_op = estimator_utils.estimator_spec_to_model_fn_ops(
+          dnn_train_op).train_op
 
-    model_fn_ops = estimator_utils.estimator_spec_to_model_fn_ops(model_fn_ops)
-  else:
-    model_fn_ops = head.create_model_fn_ops(
+      tree_train_op = head.create_estimator_spec(
+          features=tree_features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_tree_train_op_fn,
+          logits=tree_train_logits)
+      tree_train_op = estimator_utils.estimator_spec_to_model_fn_ops(
+          tree_train_op).train_op
+
+      model_fn_ops = estimator_utils.estimator_spec_to_model_fn_ops(
+          model_fn_ops)
+    else:
+      model_fn_ops = head.create_model_fn_ops(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_no_train_op_fn,
+          logits=tree_train_logits)
+      dnn_train_op = head.create_model_fn_ops(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_dnn_train_op_fn,
+          logits=dnn_logits).train_op
+      tree_train_op = head.create_model_fn_ops(
+          features=tree_features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_tree_train_op_fn,
+          logits=tree_train_logits).train_op
+
+    # Add the hooks
+    model_fn_ops.training_hooks.extend([
+        trainer_hooks.SwitchTrainOp(dnn_train_op, dnn_steps_to_train,
+                                    tree_train_op),
+        trainer_hooks.StopAfterNTrees(num_trees, attempted_trees,
+                                      finalized_trees,
+                                      override_global_step_value)
+    ])
+    return model_fn_ops
+
+  elif output_type == model.ModelBuilderOutputType.ESTIMATOR_SPEC:
+    fusion_spec = head.create_estimator_spec(
         features=features,
         mode=mode,
         labels=labels,
         train_op_fn=_no_train_op_fn,
         logits=tree_train_logits)
-    dnn_train_op = head.create_model_fn_ops(
+    dnn_spec = head.create_estimator_spec(
         features=features,
         mode=mode,
         labels=labels,
         train_op_fn=_dnn_train_op_fn,
-        logits=dnn_logits).train_op
-    tree_train_op = head.create_model_fn_ops(
+        logits=dnn_logits)
+    tree_spec = head.create_estimator_spec(
         features=tree_features,
         mode=mode,
         labels=labels,
         train_op_fn=_tree_train_op_fn,
-        logits=tree_train_logits).train_op
-
-  if tree_center_bias:
-    num_trees += 1
-  finalized_trees, attempted_trees = gbdt_model.get_number_of_trees_tensor()
-
-  model_fn_ops.training_hooks.extend([
-      trainer_hooks.SwitchTrainOp(dnn_train_op, dnn_steps_to_train,
-                                  tree_train_op),
-      trainer_hooks.StopAfterNTrees(num_trees, attempted_trees, finalized_trees)
-  ])
+        logits=tree_train_logits)
 
-  return model_fn_ops
+    training_hooks = [
+        trainer_hooks.SwitchTrainOp(dnn_spec.train_op, dnn_steps_to_train,
+                                    tree_spec.train_op),
+        trainer_hooks.StopAfterNTrees(num_trees, attempted_trees,
+                                      finalized_trees,
+                                      override_global_step_value)
+    ]
+    fusion_spec = fusion_spec._replace(training_hooks=training_hooks +
+                                       list(fusion_spec.training_hooks))
+    return fusion_spec
 
 
 class DNNBoostedTreeCombinedClassifier(estimator.Estimator):
@@ -369,7 +420,8 @@ class DNNBoostedTreeCombinedClassifier(estimator.Estimator):
                tree_feature_columns=None,
                tree_center_bias=False,
                dnn_to_tree_distillation_param=None,
-               use_core_versions=False):
+               use_core_versions=False,
+               override_global_step_value=None):
     """Initializes a DNNBoostedTreeCombinedClassifier instance.
 
     Args:
@@ -425,6 +477,10 @@ class DNNBoostedTreeCombinedClassifier(estimator.Estimator):
         will be set to True.
       use_core_versions: Whether feature columns and loss are from the core (as
         opposed to contrib) version of tensorflow.
+      override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This is particularly useful for hyper
+        parameter tuning, which can't recognize early stopping due to the number
+        of trees. If None, no override of global step will happen.
     """
     head = head_lib.multi_class_head(
         n_classes=n_classes,
@@ -455,7 +511,8 @@ class DNNBoostedTreeCombinedClassifier(estimator.Estimator):
           tree_feature_columns=tree_feature_columns,
           tree_center_bias=tree_center_bias,
           dnn_to_tree_distillation_param=dnn_to_tree_distillation_param,
-          use_core_versions=use_core_versions)
+          use_core_versions=use_core_versions,
+          override_global_step_value=override_global_step_value)
 
     super(DNNBoostedTreeCombinedClassifier, self).__init__(
         model_fn=_model_fn,
@@ -489,7 +546,8 @@ class DNNBoostedTreeCombinedRegressor(estimator.Estimator):
                tree_feature_columns=None,
                tree_center_bias=False,
                dnn_to_tree_distillation_param=None,
-               use_core_versions=False):
+               use_core_versions=False,
+               override_global_step_value=None):
     """Initializes a DNNBoostedTreeCombinedRegressor instance.
 
     Args:
@@ -545,6 +603,10 @@ class DNNBoostedTreeCombinedRegressor(estimator.Estimator):
         will be set to True.
       use_core_versions: Whether feature columns and loss are from the core (as
         opposed to contrib) version of tensorflow.
+      override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This is particularly useful for hyper
+        parameter tuning, which can't recognize early stopping due to the number
+        of trees. If None, no override of global step will happen.
     """
     head = head_lib.regression_head(
         label_name=label_name,
@@ -580,7 +642,8 @@ class DNNBoostedTreeCombinedRegressor(estimator.Estimator):
           tree_feature_columns=tree_feature_columns,
           tree_center_bias=tree_center_bias,
           dnn_to_tree_distillation_param=dnn_to_tree_distillation_param,
-          use_core_versions=use_core_versions)
+          use_core_versions=use_core_versions,
+          override_global_step_value=override_global_step_value)
 
     super(DNNBoostedTreeCombinedRegressor, self).__init__(
         model_fn=_model_fn,
@@ -615,7 +678,8 @@ class DNNBoostedTreeCombinedEstimator(estimator.Estimator):
                tree_feature_columns=None,
                tree_center_bias=False,
                dnn_to_tree_distillation_param=None,
-               use_core_versions=False):
+               use_core_versions=False,
+               override_global_step_value=None):
     """Initializes a DNNBoostedTreeCombinedEstimator instance.
 
     Args:
@@ -666,6 +730,10 @@ class DNNBoostedTreeCombinedEstimator(estimator.Estimator):
         will be set to True.
       use_core_versions: Whether feature columns and loss are from the core (as
         opposed to contrib) version of tensorflow.
+      override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This is particularly useful for hyper
+        parameter tuning, which can't recognize early stopping due to the number
+        of trees. If None, no override of global step will happen.
     """
 
     def _model_fn(features, labels, mode, config):
@@ -690,10 +758,109 @@ class DNNBoostedTreeCombinedEstimator(estimator.Estimator):
           tree_feature_columns=tree_feature_columns,
           tree_center_bias=tree_center_bias,
           dnn_to_tree_distillation_param=dnn_to_tree_distillation_param,
-          use_core_versions=use_core_versions)
+          use_core_versions=use_core_versions,
+          override_global_step_value=override_global_step_value)
 
     super(DNNBoostedTreeCombinedEstimator, self).__init__(
         model_fn=_model_fn,
         model_dir=model_dir,
         config=config,
         feature_engineering_fn=feature_engineering_fn)
+
+
+class CoreDNNBoostedTreeCombinedEstimator(core_estimator.Estimator):
+  """Initializes a core version of DNNBoostedTreeCombinedEstimator.
+
+    Args:
+      dnn_hidden_units: List of hidden units per layer for DNN.
+      dnn_feature_columns: An iterable containing all the feature columns
+        used by the model's DNN.
+      tree_learner_config: A config for the tree learner.
+      num_trees: Number of trees to grow model to after training DNN.
+      tree_examples_per_layer: Number of examples to accumulate before
+        growing the tree a layer. This value has a big impact on model
+        quality and should be set equal to the number of examples in
+        training dataset if possible. It can also be a function that computes
+        the number of examples based on the depth of the layer that's
+        being built.
+      head: `Head` instance.
+      model_dir: Directory for model exports.
+      config: `RunConfig` of the estimator.
+      dnn_optimizer: string, `Optimizer` object, or callable that defines the
+        optimizer to use for training the DNN. If `None`, will use the Adagrad
+        optimizer with default learning rate.
+      dnn_activation_fn: Activation function applied to each layer of the DNN.
+        If `None`, will use `tf.nn.relu`.
+      dnn_dropout: When not `None`, the probability to drop out a given
+        unit in the DNN.
+      dnn_input_layer_partitioner: Partitioner for input layer of the DNN.
+        Defaults to `min_max_variable_partitioner` with `min_slice_size`
+        64 << 20.
+      dnn_input_layer_to_tree: Whether to provide the DNN's input layer
+      as a feature to the tree.
+      dnn_steps_to_train: Number of steps to train dnn for before switching
+        to gbdt.
+      predict_with_tree_only: Whether to use only the tree model output as the
+        final prediction.
+      tree_feature_columns: An iterable containing all the feature columns
+        used by the model's boosted trees. If dnn_input_layer_to_tree is
+        set to True, these features are in addition to dnn_feature_columns.
+      tree_center_bias: Whether a separate tree should be created for
+        first fitting the bias.
+      dnn_to_tree_distillation_param: A Tuple of (float, loss_fn), where the
+        float defines the weight of the distillation loss, and the loss_fn, for
+        computing distillation loss, takes dnn_logits, tree_logits and weight
+        tensor. If the entire tuple is None, no distillation will be applied. If
+        only the loss_fn is None, we will take the sigmoid/softmax cross entropy
+        loss be default. When distillation is applied, `predict_with_tree_only`
+        will be set to True.
+    """
+
+  def __init__(self,
+               dnn_hidden_units,
+               dnn_feature_columns,
+               tree_learner_config,
+               num_trees,
+               tree_examples_per_layer,
+               head,
+               model_dir=None,
+               config=None,
+               dnn_optimizer="Adagrad",
+               dnn_activation_fn=nn.relu,
+               dnn_dropout=None,
+               dnn_input_layer_partitioner=None,
+               dnn_input_layer_to_tree=True,
+               dnn_steps_to_train=10000,
+               predict_with_tree_only=False,
+               tree_feature_columns=None,
+               tree_center_bias=False,
+               dnn_to_tree_distillation_param=None):
+
+    def _model_fn(features, labels, mode, config):
+      return _dnn_tree_combined_model_fn(
+          features=features,
+          labels=labels,
+          mode=mode,
+          head=head,
+          dnn_hidden_units=dnn_hidden_units,
+          dnn_feature_columns=dnn_feature_columns,
+          tree_learner_config=tree_learner_config,
+          num_trees=num_trees,
+          tree_examples_per_layer=tree_examples_per_layer,
+          config=config,
+          dnn_optimizer=dnn_optimizer,
+          dnn_activation_fn=dnn_activation_fn,
+          dnn_dropout=dnn_dropout,
+          dnn_input_layer_partitioner=dnn_input_layer_partitioner,
+          dnn_input_layer_to_tree=dnn_input_layer_to_tree,
+          dnn_steps_to_train=dnn_steps_to_train,
+          predict_with_tree_only=predict_with_tree_only,
+          tree_feature_columns=tree_feature_columns,
+          tree_center_bias=tree_center_bias,
+          dnn_to_tree_distillation_param=dnn_to_tree_distillation_param,
+          output_type=model.ModelBuilderOutputType.ESTIMATOR_SPEC,
+          use_core_versions=True,
+          override_global_step_value=None)
+
+    super(CoreDNNBoostedTreeCombinedEstimator, self).__init__(
+        model_fn=_model_fn, model_dir=model_dir, config=config)
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py
index 9b7acfa664..839eedd3a8 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/dnn_tree_combined_estimator_test.py
@@ -28,10 +28,11 @@ from tensorflow.python.estimator.canned import head as head_lib
 from tensorflow.python.feature_column import feature_column_lib as core_feature_column
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.platform import googletest
-
+from tensorflow.python.training import checkpoint_utils
 
 def _train_input_fn():
   features = {
@@ -156,5 +157,72 @@ class DNNBoostedTreeCombinedTest(test_util.TensorFlowTestCase):
     classifier.evaluate(input_fn=_eval_input_fn, steps=1)
 
 
+class CoreDNNBoostedTreeCombinedTest(test_util.TensorFlowTestCase):
+
+  def _assert_checkpoint(self, model_dir, global_step):
+    reader = checkpoint_utils.load_checkpoint(model_dir)
+    self.assertEqual(global_step, reader.get_tensor(ops.GraphKeys.GLOBAL_STEP))
+
+  def testTrainEvaluateInferDoesNotThrowErrorWithNoDnnInput(self):
+    head_fn = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
+        loss_reduction=losses.Reduction.SUM_OVER_NONZERO_WEIGHTS)
+
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 3
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    est = estimator.CoreDNNBoostedTreeCombinedEstimator(
+        head=head_fn,
+        dnn_hidden_units=[1],
+        dnn_feature_columns=[core_feature_column.numeric_column("x")],
+        tree_learner_config=learner_config,
+        num_trees=1,
+        tree_examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        dnn_steps_to_train=10,
+        dnn_input_layer_to_tree=False,
+        tree_feature_columns=[core_feature_column.numeric_column("x")])
+
+    # Train for a few steps.
+    est.train(input_fn=_train_input_fn, steps=1000)
+    # 10 steps for dnn, 3  for 1 tree of depth 3 + 1 after the tree finished
+    self._assert_checkpoint(est.model_dir, global_step=14)
+    res = est.evaluate(input_fn=_eval_input_fn, steps=1)
+    self.assertLess(0.5, res["auc"])
+    est.predict(input_fn=_eval_input_fn)
+
+  def testTrainEvaluateInferDoesNotThrowErrorWithDnnInput(self):
+    head_fn = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
+        loss_reduction=losses.Reduction.SUM_OVER_NONZERO_WEIGHTS)
+
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 3
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    est = estimator.CoreDNNBoostedTreeCombinedEstimator(
+        head=head_fn,
+        dnn_hidden_units=[1],
+        dnn_feature_columns=[core_feature_column.numeric_column("x")],
+        tree_learner_config=learner_config,
+        num_trees=1,
+        tree_examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        dnn_steps_to_train=10,
+        dnn_input_layer_to_tree=True,
+        tree_feature_columns=[])
+
+    # Train for a few steps.
+    est.train(input_fn=_train_input_fn, steps=1000)
+    res = est.evaluate(input_fn=_eval_input_fn, steps=1)
+    self.assertLess(0.5, res["auc"])
+    est.predict(input_fn=_eval_input_fn)
+
+
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py b/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
index 9c36c30221..870ce2442b 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/estimator.py
@@ -22,7 +22,16 @@ from tensorflow.contrib.boosted_trees.estimator_batch import model
 from tensorflow.contrib.boosted_trees.python.utils import losses
 from tensorflow.contrib.learn.python.learn.estimators import estimator
 from tensorflow.contrib.learn.python.learn.estimators import head as head_lib
+from tensorflow.python.estimator.canned import head as core_head_lib
+from tensorflow.python.estimator import estimator as core_estimator
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.losses import losses as core_losses
+
+
+# ================== Old estimator interface===================================
+# The estimators below were designed for old feature columns and old estimator
+# interface. They can be used with new feature columns and losses by setting
+# use_core_libs = True.
 
 
 class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
@@ -42,7 +51,8 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
                logits_modifier_function=None,
                center_bias=True,
                use_core_libs=False,
-               output_leaf_index=False):
+               output_leaf_index=False,
+               override_global_step_value=None):
     """Initializes a GradientBoostedDecisionTreeClassifier estimator instance.
 
     Args:
@@ -76,6 +86,14 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
         for result_dict in result_iter:
           # access leaf index list by result_dict["leaf_index"]
           # which contains one leaf index per tree
+      override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This should be used to reset global
+        step to a number > number of steps used to train the current ensemble.
+        For example, the usual way is to train a number of trees and set a very
+        large number of training steps. When the training is done (number of
+        trees were trained), this parameter can be used to set the global step
+        to a large value, making it look like that number of training steps ran.
+        If None, no override of global step will happen.
 
     Raises:
       ValueError: If learner_config is not valid.
@@ -116,6 +134,7 @@ class GradientBoostedDecisionTreeClassifier(estimator.Estimator):
             'logits_modifier_function': logits_modifier_function,
             'use_core_libs': use_core_libs,
             'output_leaf_index': output_leaf_index,
+            'override_global_step_value': override_global_step_value
         },
         model_dir=model_dir,
         config=config,
@@ -139,7 +158,8 @@ class GradientBoostedDecisionTreeRegressor(estimator.Estimator):
                logits_modifier_function=None,
                center_bias=True,
                use_core_libs=False,
-               output_leaf_index=False):
+               output_leaf_index=False,
+               override_global_step_value=None):
     """Initializes a GradientBoostedDecisionTreeRegressor estimator instance.
 
     Args:
@@ -173,6 +193,14 @@ class GradientBoostedDecisionTreeRegressor(estimator.Estimator):
         for example_prediction_result in result_dict:
           # access leaf index list by example_prediction_result["leaf_index"]
           # which contains one leaf index per tree
+      override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This should be used to reset global
+        step to a number > number of steps used to train the current ensemble.
+        For example, the usual way is to train a number of trees and set a very
+        large number of training steps. When the training is done (number of
+        trees were trained), this parameter can be used to set the global step
+        to a large value, making it look like that number of training steps ran.
+        If None, no override of global step will happen.
     """
     head = head_lib.regression_head(
         label_name=label_name,
@@ -196,6 +224,7 @@ class GradientBoostedDecisionTreeRegressor(estimator.Estimator):
             'center_bias': center_bias,
             'use_core_libs': use_core_libs,
             'output_leaf_index': False,
+            'override_global_step_value': override_global_step_value
         },
         model_dir=model_dir,
         config=config,
@@ -221,7 +250,8 @@ class GradientBoostedDecisionTreeEstimator(estimator.Estimator):
                logits_modifier_function=None,
                center_bias=True,
                use_core_libs=False,
-               output_leaf_index=False):
+               output_leaf_index=False,
+               override_global_step_value=None):
     """Initializes a GradientBoostedDecisionTreeEstimator estimator instance.
 
     Args:
@@ -251,6 +281,14 @@ class GradientBoostedDecisionTreeEstimator(estimator.Estimator):
         for example_prediction_result in result_dict:
           # access leaf index list by example_prediction_result["leaf_index"]
           # which contains one leaf index per tree
+      override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This should be used to reset global
+        step to a number > number of steps used to train the current ensemble.
+        For example, the usual way is to train a number of trees and set a very
+        large number of training steps. When the training is done (number of
+        trees were trained), this parameter can be used to set the global step
+        to a large value, making it look like that number of training steps ran.
+        If None, no override of global step will happen.
     """
     super(GradientBoostedDecisionTreeEstimator, self).__init__(
         model_fn=model.model_builder,
@@ -265,7 +303,282 @@ class GradientBoostedDecisionTreeEstimator(estimator.Estimator):
             'center_bias': center_bias,
             'use_core_libs': use_core_libs,
             'output_leaf_index': False,
+            'override_global_step_value': override_global_step_value
+        },
+        model_dir=model_dir,
+        config=config,
+        feature_engineering_fn=feature_engineering_fn)
+
+
+class GradientBoostedDecisionTreeRanker(estimator.Estimator):
+  """A ranking estimator using gradient boosted decision trees."""
+
+  def __init__(self,
+               learner_config,
+               examples_per_layer,
+               head,
+               ranking_model_pair_keys,
+               num_trees=None,
+               feature_columns=None,
+               weight_column_name=None,
+               model_dir=None,
+               config=None,
+               label_keys=None,
+               feature_engineering_fn=None,
+               logits_modifier_function=None,
+               center_bias=False,
+               use_core_libs=False,
+               output_leaf_index=False,
+               override_global_step_value=None):
+    """Initializes a GradientBoostedDecisionTreeRanker instance.
+
+    This is an estimator that can be trained off the pairwise data and can be
+    used for inference on non-paired data. This is essentially LambdaMart.
+    Args:
+      learner_config: A config for the learner.
+      examples_per_layer: Number of examples to accumulate before growing a
+        layer. It can also be a function that computes the number of examples
+        based on the depth of the layer that's being built.
+      head: `Head` instance.
+      ranking_model_pair_keys: Keys to distinguish between features
+        for left and right part of the training pairs for ranking. For example,
+        for an Example with features "a.f1" and "b.f1", the keys would be
+        ("a", "b").
+      num_trees: An int, number of trees to build.
+      feature_columns: A list of feature columns.
+      weight_column_name: Name of the column for weights, or None if not
+        weighted.
+      model_dir: Directory for model exports, etc.
+      config: `RunConfig` object to configure the runtime settings.
+      label_keys: Optional list of strings with size `[n_classes]` defining the
+        label vocabulary. Only supported for `n_classes` > 2.
+      feature_engineering_fn: Feature engineering function. Takes features and
+        labels which are the output of `input_fn` and returns features and
+        labels which will be fed into the model.
+      logits_modifier_function: A modifier function for the logits.
+      center_bias: Whether a separate tree should be created for first fitting
+        the bias.
+      use_core_libs: Whether feature columns and loss are from the core (as
+        opposed to contrib) version of tensorflow.
+      output_leaf_index: whether to output leaf indices along with predictions
+        during inference. The leaf node indexes are available in predictions
+        dict by the key 'leaf_index'. It is a Tensor of rank 2 and its shape is
+        [batch_size, num_trees].
+        For example,
+        result_iter = classifier.predict(...)
+        for result_dict in result_iter:
+          # access leaf index list by result_dict["leaf_index"]
+          # which contains one leaf index per tree
+      override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This should be used to reset global
+        step to a number > number of steps used to train the current ensemble.
+        For example, the usual way is to train a number of trees and set a very
+        large number of training steps. When the training is done (number of
+        trees were trained), this parameter can be used to set the global step
+        to a large value, making it look like that number of training steps ran.
+        If None, no override of global step will happen.
+    Raises:
+      ValueError: If learner_config is not valid.
+    """
+    super(GradientBoostedDecisionTreeRanker, self).__init__(
+        model_fn=model.ranking_model_builder,
+        params={
+            'head': head,
+            'n_classes': 2,
+            'feature_columns': feature_columns,
+            'learner_config': learner_config,
+            'num_trees': num_trees,
+            'weight_column_name': weight_column_name,
+            'examples_per_layer': examples_per_layer,
+            'center_bias': center_bias,
+            'logits_modifier_function': logits_modifier_function,
+            'use_core_libs': use_core_libs,
+            'output_leaf_index': output_leaf_index,
+            'ranking_model_pair_keys': ranking_model_pair_keys,
+            'override_global_step_value': override_global_step_value
         },
         model_dir=model_dir,
         config=config,
         feature_engineering_fn=feature_engineering_fn)
+
+# ================== New Estimator interface===================================
+# The estimators below use new core Estimator interface and must be used with
+# new feature columns and heads.
+
+# For multiclass classification, use the following head since it uses loss
+# that is twice differentiable.
+def core_multiclass_head(n_classes):
+  """Core head for multiclass problems."""
+
+  def loss_fn(labels, logits):
+    result = losses.per_example_maxent_loss(
+        labels=labels, logits=logits, weights=None, num_classes=n_classes)
+    return result[0]
+
+  # pylint:disable=protected-access
+  head_fn = core_head_lib._multi_class_head_with_softmax_cross_entropy_loss(
+      n_classes=n_classes,
+      loss_fn=loss_fn,
+      loss_reduction=core_losses.Reduction.SUM_OVER_NONZERO_WEIGHTS)
+  # pylint:enable=protected-access
+
+  return head_fn
+
+
+class CoreGradientBoostedDecisionTreeEstimator(core_estimator.Estimator):
+  """An estimator using gradient boosted decision trees.
+
+  Useful for training with user specified `Head`.
+  """
+
+  def __init__(self,
+               learner_config,
+               examples_per_layer,
+               head,
+               num_trees=None,
+               feature_columns=None,
+               weight_column_name=None,
+               model_dir=None,
+               config=None,
+               label_keys=None,
+               feature_engineering_fn=None,
+               logits_modifier_function=None,
+               center_bias=True,
+               output_leaf_index=False):
+    """Initializes a core version of GradientBoostedDecisionTreeEstimator.
+
+    Args:
+      learner_config: A config for the learner.
+      examples_per_layer: Number of examples to accumulate before growing a
+        layer. It can also be a function that computes the number of examples
+        based on the depth of the layer that's being built.
+      head: `Head` instance.
+      num_trees: An int, number of trees to build.
+      feature_columns: A list of feature columns.
+      weight_column_name: Name of the column for weights, or None if not
+        weighted.
+      model_dir: Directory for model exports, etc.
+      config: `RunConfig` object to configure the runtime settings.
+      label_keys: Optional list of strings with size `[n_classes]` defining the
+        label vocabulary. Only supported for `n_classes` > 2.
+      feature_engineering_fn: Feature engineering function. Takes features and
+        labels which are the output of `input_fn` and returns features and
+        labels which will be fed into the model.
+      logits_modifier_function: A modifier function for the logits.
+      center_bias: Whether a separate tree should be created for first fitting
+        the bias.
+      output_leaf_index: whether to output leaf indices along with predictions
+        during inference. The leaf node indexes are available in predictions
+        dict by the key 'leaf_index'. For example,
+        result_dict = classifier.predict(...)
+        for example_prediction_result in result_dict:
+          # access leaf index list by example_prediction_result["leaf_index"]
+          # which contains one leaf index per tree
+    """
+
+    def _model_fn(features, labels, mode, config):
+      return model.model_builder(
+          features=features,
+          labels=labels,
+          mode=mode,
+          config=config,
+          params={
+              'head': head,
+              'feature_columns': feature_columns,
+              'learner_config': learner_config,
+              'num_trees': num_trees,
+              'weight_column_name': weight_column_name,
+              'examples_per_layer': examples_per_layer,
+              'center_bias': center_bias,
+              'logits_modifier_function': logits_modifier_function,
+              'use_core_libs': True,
+              'output_leaf_index': output_leaf_index,
+              'override_global_step_value': None
+          },
+          output_type=model.ModelBuilderOutputType.ESTIMATOR_SPEC)
+
+    super(CoreGradientBoostedDecisionTreeEstimator, self).__init__(
+        model_fn=_model_fn, model_dir=model_dir, config=config)
+
+
+class CoreGradientBoostedDecisionTreeRanker(core_estimator.Estimator):
+  """A ranking estimator using gradient boosted decision trees."""
+
+  def __init__(self,
+               learner_config,
+               examples_per_layer,
+               head,
+               ranking_model_pair_keys,
+               num_trees=None,
+               feature_columns=None,
+               weight_column_name=None,
+               model_dir=None,
+               config=None,
+               label_keys=None,
+               logits_modifier_function=None,
+               center_bias=False,
+               output_leaf_index=False):
+    """Initializes a GradientBoostedDecisionTreeRanker instance.
+
+    This is an estimator that can be trained off the pairwise data and can be
+    used for inference on non-paired data. This is essentially LambdaMart.
+    Args:
+      learner_config: A config for the learner.
+      examples_per_layer: Number of examples to accumulate before growing a
+        layer. It can also be a function that computes the number of examples
+        based on the depth of the layer that's being built.
+      head: `Head` instance.
+      ranking_model_pair_keys: Keys to distinguish between features
+        for left and right part of the training pairs for ranking. For example,
+        for an Example with features "a.f1" and "b.f1", the keys would be
+        ("a", "b").
+      num_trees: An int, number of trees to build.
+      feature_columns: A list of feature columns.
+      weight_column_name: Name of the column for weights, or None if not
+        weighted.
+      model_dir: Directory for model exports, etc.
+      config: `RunConfig` object to configure the runtime settings.
+      label_keys: Optional list of strings with size `[n_classes]` defining the
+        label vocabulary. Only supported for `n_classes` > 2.
+      logits_modifier_function: A modifier function for the logits.
+      center_bias: Whether a separate tree should be created for first fitting
+        the bias.
+      output_leaf_index: whether to output leaf indices along with predictions
+        during inference. The leaf node indexes are available in predictions
+        dict by the key 'leaf_index'. It is a Tensor of rank 2 and its shape is
+        [batch_size, num_trees].
+        For example,
+        result_iter = classifier.predict(...)
+        for result_dict in result_iter:
+          # access leaf index list by result_dict["leaf_index"]
+          # which contains one leaf index per tree
+
+    Raises:
+      ValueError: If learner_config is not valid.
+    """
+
+    def _model_fn(features, labels, mode, config):
+      return model.ranking_model_builder(
+          features=features,
+          labels=labels,
+          mode=mode,
+          config=config,
+          params={
+              'head': head,
+              'n_classes': 2,
+              'feature_columns': feature_columns,
+              'learner_config': learner_config,
+              'num_trees': num_trees,
+              'weight_column_name': weight_column_name,
+              'examples_per_layer': examples_per_layer,
+              'center_bias': center_bias,
+              'logits_modifier_function': logits_modifier_function,
+              'use_core_libs': True,
+              'output_leaf_index': output_leaf_index,
+              'ranking_model_pair_keys': ranking_model_pair_keys,
+              'override_global_step_value': None
+          },
+          output_type=model.ModelBuilderOutputType.ESTIMATOR_SPEC)
+
+    super(CoreGradientBoostedDecisionTreeRanker, self).__init__(
+        model_fn=_model_fn, model_dir=model_dir, config=config)
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py b/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py
index 75ef1b0500..68d710d713 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/estimator_test.py
@@ -25,10 +25,12 @@ from tensorflow.python.estimator.canned import head as head_lib
 from tensorflow.python.feature_column import feature_column_lib as core_feature_column
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import googletest
+from tensorflow.python.training import checkpoint_utils
 
 
 def _train_input_fn():
@@ -37,18 +39,50 @@ def _train_input_fn():
   return features, label
 
 
+def _multiclass_train_input_fn():
+  features = {
+      "x": constant_op.constant([[2.], [1.], [1.], [5.], [3.5], [4.6], [3.5]])
+  }
+  label = constant_op.constant(
+      [[1], [0], [0], [2], [2], [0], [1]], dtype=dtypes.int32)
+  return features, label
+
+
+def _ranking_train_input_fn():
+  features = {
+      "a.f1": constant_op.constant([[3.], [0.3], [1.]]),
+      "a.f2": constant_op.constant([[0.1], [3.], [1.]]),
+      "b.f1": constant_op.constant([[13.], [0.4], [5.]]),
+      "b.f2": constant_op.constant([[1.], [3.], [0.01]]),
+  }
+  label = constant_op.constant([[0], [0], [1]], dtype=dtypes.int32)
+  return features, label
+
+
 def _eval_input_fn():
   features = {"x": constant_op.constant([[1.], [2.], [2.]])}
   label = constant_op.constant([[0], [1], [1]], dtype=dtypes.int32)
   return features, label
 
 
+def _infer_ranking_train_input_fn():
+  features = {
+      "f1": constant_op.constant([[3.], [2], [1.]]),
+      "f2": constant_op.constant([[0.1], [3.], [1.]])
+  }
+  return features, None
+
+
 class BoostedTreeEstimatorTest(test_util.TensorFlowTestCase):
 
   def setUp(self):
     self._export_dir_base = tempfile.mkdtemp() + "export/"
     gfile.MkDir(self._export_dir_base)
 
+  def _assert_checkpoint(self, model_dir, global_step):
+    reader = checkpoint_utils.load_checkpoint(model_dir)
+    self.assertEqual(global_step, reader.get_tensor(ops.GraphKeys.GLOBAL_STEP))
+
   def testFitAndEvaluateDontThrowException(self):
     learner_config = learner_pb2.LearnerConfig()
     learner_config.num_classes = 2
@@ -155,6 +189,290 @@ class BoostedTreeEstimatorTest(test_util.TensorFlowTestCase):
     regressor.evaluate(input_fn=_eval_input_fn, steps=1)
     regressor.export(self._export_dir_base)
 
+  def testRankingDontThrowExceptionForForEstimator(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 1
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    head_fn = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
+        loss_reduction=losses.Reduction.SUM_OVER_NONZERO_WEIGHTS)
+
+    model = estimator.GradientBoostedDecisionTreeRanker(
+        head=head_fn,
+        learner_config=learner_config,
+        num_trees=1,
+        examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        use_core_libs=True,
+        feature_columns=[
+            core_feature_column.numeric_column("f1"),
+            core_feature_column.numeric_column("f2")
+        ],
+        ranking_model_pair_keys=("a", "b"))
+
+    model.fit(input_fn=_ranking_train_input_fn, steps=1000)
+    model.evaluate(input_fn=_ranking_train_input_fn, steps=1)
+    model.predict(input_fn=_infer_ranking_train_input_fn)
+
+  def testDoesNotOverrideGlobalSteps(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 2
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.GradientBoostedDecisionTreeClassifier(
+        learner_config=learner_config,
+        num_trees=1,
+        examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[contrib_feature_column.real_valued_column("x")],
+        output_leaf_index=False)
+
+    classifier.fit(input_fn=_train_input_fn, steps=15)
+    # When no override of global steps, 5 steps were used.
+    self._assert_checkpoint(classifier.model_dir, global_step=5)
+
+  def testOverridesGlobalSteps(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 2
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.GradientBoostedDecisionTreeClassifier(
+        learner_config=learner_config,
+        num_trees=1,
+        examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[contrib_feature_column.real_valued_column("x")],
+        output_leaf_index=False,
+        override_global_step_value=10000000)
+
+    classifier.fit(input_fn=_train_input_fn, steps=15)
+    self._assert_checkpoint(classifier.model_dir, global_step=10000000)
+
+  def testFitAndEvaluateMultiClassTreePerClassDontThrowException(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 3
+    learner_config.constraints.max_tree_depth = 1
+    learner_config.multi_class_strategy = (
+        learner_pb2.LearnerConfig.TREE_PER_CLASS)
+
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.GradientBoostedDecisionTreeClassifier(
+        learner_config=learner_config,
+        n_classes=learner_config.num_classes,
+        num_trees=1,
+        examples_per_layer=7,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[contrib_feature_column.real_valued_column("x")])
+
+    classifier.fit(input_fn=_multiclass_train_input_fn, steps=100)
+    classifier.evaluate(input_fn=_eval_input_fn, steps=1)
+    classifier.export(self._export_dir_base)
+    result_iter = classifier.predict(input_fn=_eval_input_fn)
+    for prediction_dict in result_iter:
+      self.assertTrue("classes" in prediction_dict)
+
+  def testFitAndEvaluateMultiClassDiagonalDontThrowException(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 3
+    learner_config.constraints.max_tree_depth = 1
+    learner_config.multi_class_strategy = (
+        learner_pb2.LearnerConfig.DIAGONAL_HESSIAN)
+
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.GradientBoostedDecisionTreeClassifier(
+        learner_config=learner_config,
+        n_classes=learner_config.num_classes,
+        num_trees=1,
+        examples_per_layer=7,
+        model_dir=model_dir,
+        config=config,
+        center_bias=False,
+        feature_columns=[contrib_feature_column.real_valued_column("x")])
+
+    classifier.fit(input_fn=_multiclass_train_input_fn, steps=100)
+    classifier.evaluate(input_fn=_eval_input_fn, steps=1)
+    classifier.export(self._export_dir_base)
+    result_iter = classifier.predict(input_fn=_eval_input_fn)
+    for prediction_dict in result_iter:
+      self.assertTrue("classes" in prediction_dict)
+
+  def testFitAndEvaluateMultiClassFullDontThrowException(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 3
+    learner_config.constraints.max_tree_depth = 1
+    learner_config.multi_class_strategy = (
+        learner_pb2.LearnerConfig.FULL_HESSIAN)
+
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.GradientBoostedDecisionTreeClassifier(
+        learner_config=learner_config,
+        n_classes=learner_config.num_classes,
+        num_trees=1,
+        examples_per_layer=7,
+        model_dir=model_dir,
+        config=config,
+        center_bias=False,
+        feature_columns=[contrib_feature_column.real_valued_column("x")])
+
+    classifier.fit(input_fn=_multiclass_train_input_fn, steps=100)
+    classifier.evaluate(input_fn=_eval_input_fn, steps=1)
+    classifier.export(self._export_dir_base)
+    result_iter = classifier.predict(input_fn=_eval_input_fn)
+    for prediction_dict in result_iter:
+      self.assertTrue("classes" in prediction_dict)
+
+
+class CoreGradientBoostedDecisionTreeEstimators(test_util.TensorFlowTestCase):
+
+  def testTrainEvaluateInferDoesNotThrowError(self):
+    head_fn = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
+        loss_reduction=losses.Reduction.SUM_OVER_NONZERO_WEIGHTS)
+
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 1
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    est = estimator.CoreGradientBoostedDecisionTreeEstimator(
+        head=head_fn,
+        learner_config=learner_config,
+        num_trees=1,
+        examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[core_feature_column.numeric_column("x")])
+
+    # Train for a few steps.
+    est.train(input_fn=_train_input_fn, steps=1000)
+    est.evaluate(input_fn=_eval_input_fn, steps=1)
+    est.predict(input_fn=_eval_input_fn)
+
+  def testRankingDontThrowExceptionForForEstimator(self):
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = 2
+    learner_config.constraints.max_tree_depth = 1
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    head_fn = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss(
+        loss_reduction=losses.Reduction.SUM_OVER_NONZERO_WEIGHTS)
+
+    est = estimator.CoreGradientBoostedDecisionTreeRanker(
+        head=head_fn,
+        learner_config=learner_config,
+        num_trees=1,
+        examples_per_layer=3,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[
+            core_feature_column.numeric_column("f1"),
+            core_feature_column.numeric_column("f2")
+        ],
+        ranking_model_pair_keys=("a", "b"))
+
+    # Train for a few steps.
+    est.train(input_fn=_ranking_train_input_fn, steps=1000)
+    est.evaluate(input_fn=_ranking_train_input_fn, steps=1)
+    est.predict(input_fn=_infer_ranking_train_input_fn)
+
+  def testFitAndEvaluateMultiClassTreePerClasssDontThrowException(self):
+    n_classes = 3
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = n_classes
+    learner_config.constraints.max_tree_depth = 1
+    learner_config.multi_class_strategy = (
+        learner_pb2.LearnerConfig.TREE_PER_CLASS)
+
+    head_fn = estimator.core_multiclass_head(n_classes=n_classes)
+
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.CoreGradientBoostedDecisionTreeEstimator(
+        learner_config=learner_config,
+        head=head_fn,
+        num_trees=1,
+        center_bias=False,
+        examples_per_layer=7,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[core_feature_column.numeric_column("x")])
+
+    classifier.train(input_fn=_multiclass_train_input_fn, steps=100)
+    classifier.evaluate(input_fn=_multiclass_train_input_fn, steps=1)
+    classifier.predict(input_fn=_eval_input_fn)
+
+  def testFitAndEvaluateMultiClassDiagonalDontThrowException(self):
+    n_classes = 3
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = n_classes
+    learner_config.constraints.max_tree_depth = 1
+    learner_config.multi_class_strategy = (
+        learner_pb2.LearnerConfig.DIAGONAL_HESSIAN)
+
+    head_fn = estimator.core_multiclass_head(n_classes=n_classes)
+
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.CoreGradientBoostedDecisionTreeEstimator(
+        learner_config=learner_config,
+        head=head_fn,
+        num_trees=1,
+        center_bias=False,
+        examples_per_layer=7,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[core_feature_column.numeric_column("x")])
+
+    classifier.train(input_fn=_multiclass_train_input_fn, steps=100)
+    classifier.evaluate(input_fn=_multiclass_train_input_fn, steps=1)
+    classifier.predict(input_fn=_eval_input_fn)
+
+  def testFitAndEvaluateMultiClassFullDontThrowException(self):
+    n_classes = 3
+    learner_config = learner_pb2.LearnerConfig()
+    learner_config.num_classes = n_classes
+    learner_config.constraints.max_tree_depth = 1
+    learner_config.multi_class_strategy = (
+        learner_pb2.LearnerConfig.FULL_HESSIAN)
+
+    head_fn = estimator.core_multiclass_head(n_classes=n_classes)
+
+    model_dir = tempfile.mkdtemp()
+    config = run_config.RunConfig()
+
+    classifier = estimator.CoreGradientBoostedDecisionTreeEstimator(
+        learner_config=learner_config,
+        head=head_fn,
+        num_trees=1,
+        center_bias=False,
+        examples_per_layer=7,
+        model_dir=model_dir,
+        config=config,
+        feature_columns=[core_feature_column.numeric_column("x")])
+
+    classifier.train(input_fn=_multiclass_train_input_fn, steps=100)
+    classifier.evaluate(input_fn=_multiclass_train_input_fn, steps=1)
+    classifier.predict(input_fn=_eval_input_fn)
+
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/model.py b/tensorflow/contrib/boosted_trees/estimator_batch/model.py
index 1ee8911989..04b46c3483 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/model.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/model.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import copy
 
+from tensorflow.contrib import learn
 from tensorflow.contrib.boosted_trees.estimator_batch import estimator_utils
 from tensorflow.contrib.boosted_trees.estimator_batch import trainer_hooks
 from tensorflow.contrib.boosted_trees.python.ops import model_ops
@@ -28,8 +29,17 @@ from tensorflow.python.framework import ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.training import training_util
 
+class ModelBuilderOutputType(object):
+  MODEL_FN_OPS = 0
+  ESTIMATOR_SPEC = 1
 
-def model_builder(features, labels, mode, params, config):
+
+def model_builder(features,
+                  labels,
+                  mode,
+                  params,
+                  config,
+                  output_type=ModelBuilderOutputType.MODEL_FN_OPS):
   """Multi-machine batch gradient descent tree model.
 
   Args:
@@ -48,7 +58,13 @@ def model_builder(features, labels, mode, params, config):
       * weight_column_name: The name of weight column.
       * center_bias: Whether a separate tree should be created for first fitting
           the bias.
+      * override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This is particularly useful for hyper
+        parameter tuning, which can't recognize early stopping due to the number
+        of trees. If None, no override of global step will happen.
     config: `RunConfig` of the estimator.
+    output_type: Whether to return ModelFnOps (old interface) or EstimatorSpec
+      (new interface).
 
   Returns:
     A `ModelFnOps` object.
@@ -64,6 +80,7 @@ def model_builder(features, labels, mode, params, config):
   use_core_libs = params["use_core_libs"]
   logits_modifier_function = params["logits_modifier_function"]
   output_leaf_index = params["output_leaf_index"]
+  override_global_step_value = params.get("override_global_step_value", None)
 
   if features is None:
     raise ValueError("At least one feature must be specified.")
@@ -115,29 +132,271 @@ def model_builder(features, labels, mode, params, config):
         return update_op
 
   create_estimator_spec_op = getattr(head, "create_estimator_spec", None)
-  if use_core_libs and callable(create_estimator_spec_op):
-    model_fn_ops = head.create_estimator_spec(
+
+  training_hooks = []
+  if num_trees:
+    if center_bias:
+      num_trees += 1
+
+    finalized_trees, attempted_trees = gbdt_model.get_number_of_trees_tensor()
+    training_hooks.append(
+        trainer_hooks.StopAfterNTrees(num_trees, attempted_trees,
+                                      finalized_trees,
+                                      override_global_step_value))
+
+  if output_type == ModelBuilderOutputType.MODEL_FN_OPS:
+    if use_core_libs and callable(create_estimator_spec_op):
+      model_fn_ops = head.create_estimator_spec(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_train_op_fn,
+          logits=logits)
+      model_fn_ops = estimator_utils.estimator_spec_to_model_fn_ops(
+          model_fn_ops)
+    else:
+      model_fn_ops = head.create_model_fn_ops(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_train_op_fn,
+          logits=logits)
+
+    if output_leaf_index and gbdt_batch.LEAF_INDEX in predictions_dict:
+      model_fn_ops.predictions[gbdt_batch.LEAF_INDEX] = predictions_dict[
+          gbdt_batch.LEAF_INDEX]
+
+    model_fn_ops.training_hooks.extend(training_hooks)
+    return model_fn_ops
+  elif output_type == ModelBuilderOutputType.ESTIMATOR_SPEC:
+    assert callable(create_estimator_spec_op)
+    estimator_spec = head.create_estimator_spec(
         features=features,
         mode=mode,
         labels=labels,
         train_op_fn=_train_op_fn,
         logits=logits)
-    model_fn_ops = estimator_utils.estimator_spec_to_model_fn_ops(model_fn_ops)
+
+    estimator_spec = estimator_spec._replace(
+        training_hooks=training_hooks + list(estimator_spec.training_hooks))
+    return estimator_spec
+
+  return model_fn_ops
+
+
+def ranking_model_builder(features,
+                          labels,
+                          mode,
+                          params,
+                          config,
+                          output_type=ModelBuilderOutputType.MODEL_FN_OPS):
+  """Multi-machine batch gradient descent tree model for ranking.
+
+  Args:
+    features: `Tensor` or `dict` of `Tensor` objects.
+    labels: Labels used to train on.
+    mode: Mode we are in. (TRAIN/EVAL/INFER)
+    params: A dict of hyperparameters.
+      The following hyperparameters are expected:
+      * head: A `Head` instance.
+      * learner_config: A config for the learner.
+      * feature_columns: An iterable containing all the feature columns used by
+          the model.
+      * examples_per_layer: Number of examples to accumulate before growing a
+          layer. It can also be a function that computes the number of examples
+          based on the depth of the layer that's being built.
+      * weight_column_name: The name of weight column.
+      * center_bias: Whether a separate tree should be created for first fitting
+          the bias.
+      * ranking_model_pair_keys (Optional): Keys to distinguish between features
+        for left and right part of the training pairs for ranking. For example,
+        for an Example with features "a.f1" and "b.f1", the keys would be
+        ("a", "b").
+      * override_global_step_value: If after the training is done, global step
+        value must be reset to this value. This is particularly useful for hyper
+        parameter tuning, which can't recognize early stopping due to the number
+        of trees. If None, no override of global step will happen.
+    config: `RunConfig` of the estimator.
+    output_type: Whether to return ModelFnOps (old interface) or EstimatorSpec
+      (new interface).
+
+
+  Returns:
+    A `ModelFnOps` object.
+  Raises:
+    ValueError: if inputs are not valid.
+  """
+  head = params["head"]
+  learner_config = params["learner_config"]
+  examples_per_layer = params["examples_per_layer"]
+  feature_columns = params["feature_columns"]
+  weight_column_name = params["weight_column_name"]
+  num_trees = params["num_trees"]
+  use_core_libs = params["use_core_libs"]
+  logits_modifier_function = params["logits_modifier_function"]
+  output_leaf_index = params["output_leaf_index"]
+  ranking_model_pair_keys = params["ranking_model_pair_keys"]
+  override_global_step_value = params.get("override_global_step_value", None)
+
+  if features is None:
+    raise ValueError("At least one feature must be specified.")
+
+  if config is None:
+    raise ValueError("Missing estimator RunConfig.")
+
+  center_bias = params["center_bias"]
+
+  if isinstance(features, ops.Tensor):
+    features = {features.name: features}
+
+  # Make a shallow copy of features to ensure downstream usage
+  # is unaffected by modifications in the model function.
+  training_features = copy.copy(features)
+  training_features.pop(weight_column_name, None)
+  global_step = training_util.get_global_step()
+  with ops.device(global_step.device):
+    ensemble_handle = model_ops.tree_ensemble_variable(
+        stamp_token=0,
+        tree_ensemble_config="",  # Initialize an empty ensemble.
+        name="ensemble_model")
+
+  # Extract the features.
+  if mode == learn.ModeKeys.TRAIN or mode == learn.ModeKeys.EVAL:
+    # For ranking pairwise training, we extract two sets of features.
+    if len(ranking_model_pair_keys) != 2:
+      raise ValueError("You must provide keys for ranking.")
+    left_pair_key = ranking_model_pair_keys[0]
+    right_pair_key = ranking_model_pair_keys[1]
+    if left_pair_key is None or right_pair_key is None:
+      raise ValueError("Both pair keys should be provided for ranking.")
+
+    features_1 = {}
+    features_2 = {}
+    for name in training_features:
+      feature = training_features[name]
+      new_name = name[2:]
+      if name.startswith(left_pair_key + "."):
+        features_1[new_name] = feature
+      else:
+        assert name.startswith(right_pair_key + ".")
+        features_2[new_name] = feature
+
+    main_features = features_1
+    supplementary_features = features_2
   else:
-    model_fn_ops = head.create_model_fn_ops(
+    # For non-ranking or inference ranking, we have only 1 set of features.
+    main_features = training_features
+
+  # Create GBDT model.
+  gbdt_model_main = gbdt_batch.GradientBoostedDecisionTreeModel(
+      is_chief=config.is_chief,
+      num_ps_replicas=config.num_ps_replicas,
+      ensemble_handle=ensemble_handle,
+      center_bias=center_bias,
+      examples_per_layer=examples_per_layer,
+      learner_config=learner_config,
+      feature_columns=feature_columns,
+      logits_dimension=head.logits_dimension,
+      features=main_features,
+      use_core_columns=use_core_libs,
+      output_leaf_index=output_leaf_index)
+
+  with ops.name_scope("gbdt", "gbdt_optimizer"):
+    # Logits for inference.
+    if mode == learn.ModeKeys.INFER:
+      predictions_dict = gbdt_model_main.predict(mode)
+      logits = predictions_dict[gbdt_batch.PREDICTIONS]
+      if logits_modifier_function:
+        logits = logits_modifier_function(logits, features, mode)
+    else:
+      gbdt_model_supplementary = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=config.is_chief,
+          num_ps_replicas=config.num_ps_replicas,
+          ensemble_handle=ensemble_handle,
+          center_bias=center_bias,
+          examples_per_layer=examples_per_layer,
+          learner_config=learner_config,
+          feature_columns=feature_columns,
+          logits_dimension=head.logits_dimension,
+          features=supplementary_features,
+          use_core_columns=use_core_libs,
+          output_leaf_index=output_leaf_index)
+
+      # Logits for train and eval.
+      if not supplementary_features:
+        raise ValueError("Features for ranking must be specified.")
+
+      predictions_dict_1 = gbdt_model_main.predict(mode)
+      predictions_1 = predictions_dict_1[gbdt_batch.PREDICTIONS]
+
+      predictions_dict_2 = gbdt_model_supplementary.predict(mode)
+      predictions_2 = predictions_dict_2[gbdt_batch.PREDICTIONS]
+
+      logits = predictions_1 - predictions_2
+      if logits_modifier_function:
+        logits = logits_modifier_function(logits, features, mode)
+
+      predictions_dict = predictions_dict_1
+      predictions_dict[gbdt_batch.PREDICTIONS] = logits
+
+    def _train_op_fn(loss):
+      """Returns the op to optimize the loss."""
+      update_op = gbdt_model_main.train(loss, predictions_dict, labels)
+      with ops.control_dependencies(
+          [update_op]), (ops.colocate_with(global_step)):
+        update_op = state_ops.assign_add(global_step, 1).op
+        return update_op
+
+  create_estimator_spec_op = getattr(head, "create_estimator_spec", None)
+
+  training_hooks = []
+  if num_trees:
+    if center_bias:
+      num_trees += 1
+
+    finalized_trees, attempted_trees = (
+        gbdt_model_main.get_number_of_trees_tensor())
+    training_hooks.append(
+        trainer_hooks.StopAfterNTrees(num_trees, attempted_trees,
+                                      finalized_trees,
+                                      override_global_step_value))
+
+  if output_type == ModelBuilderOutputType.MODEL_FN_OPS:
+    if use_core_libs and callable(create_estimator_spec_op):
+      model_fn_ops = head.create_estimator_spec(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_train_op_fn,
+          logits=logits)
+      model_fn_ops = estimator_utils.estimator_spec_to_model_fn_ops(
+          model_fn_ops)
+    else:
+      model_fn_ops = head.create_model_fn_ops(
+          features=features,
+          mode=mode,
+          labels=labels,
+          train_op_fn=_train_op_fn,
+          logits=logits)
+
+    if output_leaf_index and gbdt_batch.LEAF_INDEX in predictions_dict:
+      model_fn_ops.predictions[gbdt_batch.LEAF_INDEX] = predictions_dict[
+          gbdt_batch.LEAF_INDEX]
+
+    model_fn_ops.training_hooks.extend(training_hooks)
+    return model_fn_ops
+
+  elif output_type == ModelBuilderOutputType.ESTIMATOR_SPEC:
+    assert callable(create_estimator_spec_op)
+    estimator_spec = head.create_estimator_spec(
         features=features,
         mode=mode,
         labels=labels,
         train_op_fn=_train_op_fn,
         logits=logits)
-  if output_leaf_index and gbdt_batch.LEAF_INDEX in predictions_dict:
-    model_fn_ops.predictions[gbdt_batch.LEAF_INDEX] = predictions_dict[
-        gbdt_batch.LEAF_INDEX]
-  if num_trees:
-    if center_bias:
-      num_trees += 1
-    finalized_trees, attempted_trees = gbdt_model.get_number_of_trees_tensor()
-    model_fn_ops.training_hooks.append(
-        trainer_hooks.StopAfterNTrees(num_trees, attempted_trees,
-                                      finalized_trees))
+
+    estimator_spec = estimator_spec._replace(
+        training_hooks=training_hooks + list(estimator_spec.training_hooks))
+    return estimator_spec
+
   return model_fn_ops
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/trainer_hooks.py b/tensorflow/contrib/boosted_trees/estimator_batch/trainer_hooks.py
index 2e4151cac4..f137ada355 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/trainer_hooks.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/trainer_hooks.py
@@ -25,6 +25,7 @@ from tensorflow.contrib.learn.python.learn.session_run_hook import SessionRunArg
 from tensorflow.core.framework.summary_pb2 import Summary
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import state_ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import training_util
 from tensorflow.python.training.summary_io import SummaryWriterCache
@@ -150,12 +151,23 @@ class FeedFnHook(session_run_hook.SessionRunHook):
 class StopAfterNTrees(session_run_hook.SessionRunHook):
   """Stop training after building N full trees."""
 
-  def __init__(self, n, num_attempted_trees_tensor, num_finalized_trees_tensor):
+  def __init__(self, n, num_attempted_trees_tensor, num_finalized_trees_tensor,
+               override_global_step_value=None):
     self._num_trees = n
     # num_attempted_trees_tensor and num_finalized_trees_tensor are both
     # tensors.
     self._num_attempted_trees_tensor = num_attempted_trees_tensor
     self._num_finalized_trees_tensor = num_finalized_trees_tensor
+    self._override_global_step_value = override_global_step_value
+
+  def begin(self):
+    self._global_step_tensor = training_util.get_global_step()
+    if self._global_step_tensor is None:
+      raise RuntimeError("Global step should be created.")
+
+    if self._override_global_step_value is not None:
+      self._override_global_step_op = state_ops.assign(
+          self._global_step_tensor, self._override_global_step_value)
 
   def before_run(self, run_context):
     del run_context  # unused by StopTrainingAfterNTrees.
@@ -175,6 +187,9 @@ class StopAfterNTrees(session_run_hook.SessionRunHook):
         num_attempted_trees > 2 * self._num_trees):
       logging.info("Requesting stop since we have reached %d trees.",
                    num_finalized_trees)
+      if self._override_global_step_value is not None:
+        logging.info("Overriding global steps value.")
+        run_context.session.run(self._override_global_step_op)
       run_context.request_stop()
 
 
diff --git a/tensorflow/contrib/boosted_trees/examples/boston.py b/tensorflow/contrib/boosted_trees/examples/boston.py
index e9dbdb0fd7..54c4ff059e 100644
--- a/tensorflow/contrib/boosted_trees/examples/boston.py
+++ b/tensorflow/contrib/boosted_trees/examples/boston.py
@@ -45,6 +45,7 @@ from tensorflow.contrib.boosted_trees.estimator_batch.estimator import GradientB
 from tensorflow.contrib.boosted_trees.proto import learner_pb2
 from tensorflow.contrib.layers.python.layers import feature_column
 from tensorflow.contrib.learn import learn_runner
+from tensorflow.python.util import compat
 
 _BOSTON_NUM_FEATURES = 13
 
@@ -79,7 +80,8 @@ def _convert_fn(dtec, sorted_feature_names, num_dense, num_sparse_float,
                 num_sparse_int, export_dir, unused_eval_result):
   universal_format = custom_export_strategy.convert_to_universal_format(
       dtec, sorted_feature_names, num_dense, num_sparse_float, num_sparse_int)
-  with tf.gfile.GFile(os.path.join(export_dir, "tree_proto"), "w") as f:
+  with tf.gfile.GFile(os.path.join(
+      compat.as_bytes(export_dir), compat.as_bytes("tree_proto")), "w") as f:
     f.write(str(universal_format))
 
 
diff --git a/tensorflow/contrib/boosted_trees/kernels/quantile_ops.cc b/tensorflow/contrib/boosted_trees/kernels/quantile_ops.cc
index 0b28f81e7c..5b4be2f258 100644
--- a/tensorflow/contrib/boosted_trees/kernels/quantile_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/quantile_ops.cc
@@ -241,6 +241,11 @@ class CreateQuantileAccumulatorOp : public OpKernel {
     // other exceptions. If one already exists, it unrefs the new one.
     const Tensor* stamp_token_t;
     OP_REQUIRES_OK(context, context->input(kStampTokenName, &stamp_token_t));
+    // An epsilon value of zero could cause perfoamance issues and is therefore,
+    // disallowed.
+    OP_REQUIRES(
+        context, epsilon_ > 0,
+        errors::InvalidArgument("An epsilon value of zero is not allowed."));
     auto result = new QuantileStreamResource(epsilon_, num_quantiles_,
                                              max_elements_, generate_quantiles_,
                                              stamp_token_t->scalar<int64>()());
diff --git a/tensorflow/contrib/boosted_trees/kernels/training_ops.cc b/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
index 1bfeed3066..6d9a6ee5a0 100644
--- a/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
@@ -372,12 +372,18 @@ class GrowTreeEnsembleOp : public OpKernel {
       return;
     }
 
+    // Get the max tree depth.
+    const Tensor* max_tree_depth_t;
+    OP_REQUIRES_OK(context,
+                   context->input("max_tree_depth", &max_tree_depth_t));
+    const int32 max_tree_depth = max_tree_depth_t->scalar<int32>()();
+
     // Update and retrieve the growable tree.
     // If the tree is fully built and dropout was applied, it also adjusts the
     // weights of dropped and the last tree.
     boosted_trees::trees::DecisionTreeConfig* const tree_config =
         UpdateAndRetrieveGrowableTree(ensemble_resource, learning_rate,
-                                      dropout_seed);
+                                      dropout_seed, max_tree_depth);
 
     // Split tree nodes.
     for (auto& split_entry : best_splits) {
@@ -494,7 +500,8 @@ class GrowTreeEnsembleOp : public OpKernel {
   boosted_trees::trees::DecisionTreeConfig* UpdateAndRetrieveGrowableTree(
       boosted_trees::models::DecisionTreeEnsembleResource* const
           ensemble_resource,
-      const float learning_rate, const uint64 dropout_seed) {
+      const float learning_rate, const uint64 dropout_seed,
+      const int32 max_tree_depth) {
     const auto num_trees = ensemble_resource->num_trees();
     if (num_trees <= 0 ||
         ensemble_resource->LastTreeMetadata()->is_finalized()) {
@@ -506,8 +513,7 @@ class GrowTreeEnsembleOp : public OpKernel {
       tree_config->add_nodes()->mutable_leaf();
       boosted_trees::trees::DecisionTreeMetadata* const tree_metadata =
           ensemble_resource->LastTreeMetadata();
-      tree_metadata->set_is_finalized(
-          learner_config_.constraints().max_tree_depth() <= 1);
+      tree_metadata->set_is_finalized(max_tree_depth <= 1);
       tree_metadata->set_num_tree_weight_updates(1);
     } else {
       // The growable tree is by definition the last tree in the ensemble.
@@ -518,8 +524,7 @@ class GrowTreeEnsembleOp : public OpKernel {
               << num_trees - 1 << " of ensemble of " << num_trees << " trees.";
       // Update growable tree metadata.
       tree_metadata->set_num_layers_grown(new_num_layers);
-      tree_metadata->set_is_finalized(
-          new_num_layers >= learner_config_.constraints().max_tree_depth());
+      tree_metadata->set_is_finalized(new_num_layers >= max_tree_depth);
     }
     UpdateTreeWeightsIfDropout(ensemble_resource, dropout_seed);
     return ensemble_resource->LastTree();
diff --git a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h
index c120dd8a6c..f19e5116f5 100644
--- a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h
+++ b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_stream.h
@@ -58,6 +58,8 @@ namespace quantiles {
 // Compute: O(n * log(1/eps * log(eps * n))).
 // Memory: O(1/eps * log^2(eps * n)) <- for one worker streaming through the
 //                                      entire dataset.
+// An epsilon value of zero would make the algorithm extremely inefficent and
+// therefore, is disallowed.
 template <typename ValueType, typename WeightType,
           typename CompareFn = std::less<ValueType>>
 class WeightedQuantilesStream {
@@ -69,6 +71,9 @@ class WeightedQuantilesStream {
 
   explicit WeightedQuantilesStream(double eps, int64 max_elements)
       : eps_(eps), buffer_(1LL, 2LL), finalized_(false) {
+    // See the class documentation. An epsilon value of zero could cause
+    // perfoamance issues.
+    QCHECK(eps > 0) << "An epsilon value of zero is not allowed.";
     std::tie(max_levels_, block_size_) = GetQuantileSpecs(eps, max_elements);
     buffer_ = Buffer(block_size_, max_elements);
     summary_levels_.reserve(max_levels_);
diff --git a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h
index a7e7bfc13c..69bb8fd4ad 100644
--- a/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h
+++ b/tensorflow/contrib/boosted_trees/lib/quantiles/weighted_quantiles_summary.h
@@ -51,7 +51,7 @@ class WeightedQuantilesSummary {
     SummaryEntry() {
       memset(this, 0, sizeof(*this));
 
-      value = 0;
+      value = ValueType();
       weight = 0;
       min_rank = 0;
       max_rank = 0;
diff --git a/tensorflow/contrib/boosted_trees/lib/utils/batch_features.cc b/tensorflow/contrib/boosted_trees/lib/utils/batch_features.cc
index 35b059f349..4fab2b0b7d 100644
--- a/tensorflow/contrib/boosted_trees/lib/utils/batch_features.cc
+++ b/tensorflow/contrib/boosted_trees/lib/utils/batch_features.cc
@@ -16,6 +16,7 @@
 #include "tensorflow/contrib/boosted_trees/lib/utils/batch_features.h"
 #include "tensorflow/contrib/boosted_trees/lib/utils/macros.h"
 #include "tensorflow/contrib/boosted_trees/lib/utils/tensor_utils.h"
+#include "tensorflow/core/lib/core/errors.h"
 
 namespace tensorflow {
 namespace boosted_trees {
@@ -96,9 +97,11 @@ Status BatchFeatures::Initialize(
             "Sparse float feature shape incompatible with batch size."));
     auto tensor_shape = TensorShape({shape_flat(0), shape_flat(1)});
     auto order_dims = sparse::SparseTensor::VarDimArray({0, 1});
-    sparse_float_feature_columns_.emplace_back(sparse_float_feature_indices,
-                                               sparse_float_feature_values,
-                                               tensor_shape, order_dims);
+    sparse::SparseTensor sparse_tensor;
+    TF_RETURN_IF_ERROR(sparse::SparseTensor::Create(
+        sparse_float_feature_indices, sparse_float_feature_values, tensor_shape,
+        order_dims, &sparse_tensor));
+    sparse_float_feature_columns_.push_back(std::move(sparse_tensor));
   }
 
   // Read sparse int features.
@@ -136,9 +139,11 @@ Status BatchFeatures::Initialize(
             "Sparse int feature shape incompatible with batch size."));
     auto tensor_shape = TensorShape({shape_flat(0), shape_flat(1)});
     auto order_dims = sparse::SparseTensor::VarDimArray({0, 1});
-    sparse_int_feature_columns_.emplace_back(sparse_int_feature_indices,
-                                             sparse_int_feature_values,
-                                             tensor_shape, order_dims);
+    sparse::SparseTensor sparse_tensor;
+    TF_RETURN_IF_ERROR(sparse::SparseTensor::Create(
+        sparse_int_feature_indices, sparse_int_feature_values, tensor_shape,
+        order_dims, &sparse_tensor));
+    sparse_int_feature_columns_.push_back(std::move(sparse_tensor));
   }
   return Status::OK();
 }
diff --git a/tensorflow/contrib/boosted_trees/lib/utils/examples_iterable_test.cc b/tensorflow/contrib/boosted_trees/lib/utils/examples_iterable_test.cc
index d8a6088648..30c37435fe 100644
--- a/tensorflow/contrib/boosted_trees/lib/utils/examples_iterable_test.cc
+++ b/tensorflow/contrib/boosted_trees/lib/utils/examples_iterable_test.cc
@@ -43,27 +43,35 @@ TEST_F(ExamplesIterableTest, Iterate) {
       test::AsTensor<int64>({0, 0, 2, 0, 3, 0, 4, 0}, {4, 2});
   auto sparse_float_values1 = test::AsTensor<float>({-3.0f, 0.0f, 5.0f, 0.0f});
   auto sparse_float_shape1 = TensorShape({8, 1});
-  sparse::SparseTensor sparse_float_tensor1(
-      sparse_float_indices1, sparse_float_values1, sparse_float_shape1);
+  sparse::SparseTensor sparse_float_tensor1;
+  TF_ASSERT_OK(
+      sparse::SparseTensor::Create(sparse_float_indices1, sparse_float_values1,
+                                   sparse_float_shape1, &sparse_float_tensor1));
   auto sparse_float_indices2 = test::AsTensor<int64>(
       {0, 1, 1, 0, 2, 1, 3, 0, 4, 1, 5, 0, 5, 1, 7, 0}, {8, 2});
   auto sparse_float_values2 =
       test::AsTensor<float>({1.f, 4.0f, 3.f, 7.0f, 4.3f, 9.0f, 0.8f, -4.0f});
   auto sparse_float_shape2 = TensorShape({8, 2});
-  sparse::SparseTensor sparse_float_tensor2(
-      sparse_float_indices2, sparse_float_values2, sparse_float_shape2);
+  sparse::SparseTensor sparse_float_tensor2;
+  TF_ASSERT_OK(
+      sparse::SparseTensor::Create(sparse_float_indices2, sparse_float_values2,
+                                   sparse_float_shape2, &sparse_float_tensor2));
   auto sparse_int_indices1 =
       test::AsTensor<int64>({0, 0, 0, 1, 1, 0, 3, 0, 3, 1, 7, 0}, {6, 2});
   auto sparse_int_values1 = test::AsTensor<int64>({1, 8, 0, 2, 0, 5});
   auto sparse_int_shape1 = TensorShape({8, 2});
-  sparse::SparseTensor sparse_int_tensor1(
-      sparse_int_indices1, sparse_int_values1, sparse_int_shape1);
+  sparse::SparseTensor sparse_int_tensor1;
+  TF_ASSERT_OK(
+      sparse::SparseTensor::Create(sparse_int_indices1, sparse_int_values1,
+                                   sparse_int_shape1, &sparse_int_tensor1));
   auto sparse_int_indices2 =
       test::AsTensor<int64>({1, 0, 2, 0, 3, 0, 4, 0}, {4, 2});
   auto sparse_int_values2 = test::AsTensor<int64>({7, 13, 4, 0});
   auto sparse_int_shape2 = TensorShape({8, 1});
-  sparse::SparseTensor sparse_int_tensor2(
-      sparse_int_indices2, sparse_int_values2, sparse_int_shape2);
+  sparse::SparseTensor sparse_int_tensor2;
+  TF_ASSERT_OK(
+      sparse::SparseTensor::Create(sparse_int_indices2, sparse_int_values2,
+                                   sparse_int_shape2, &sparse_int_tensor2));
 
   auto validate_example_features = [](int64 example_idx,
                                       const Example& example) {
diff --git a/tensorflow/contrib/boosted_trees/ops/training_ops.cc b/tensorflow/contrib/boosted_trees/ops/training_ops.cc
index f63c199ad6..22ac9edb72 100644
--- a/tensorflow/contrib/boosted_trees/ops/training_ops.cc
+++ b/tensorflow/contrib/boosted_trees/ops/training_ops.cc
@@ -56,6 +56,7 @@ REGISTER_OP("GrowTreeEnsemble")
     .Input("next_stamp_token: int64")
     .Input("learning_rate: float")
     .Input("dropout_seed: int64")
+    .Input("max_tree_depth: int32")
     .Input("partition_ids: num_handlers * int32")
     .Input("gains: num_handlers * float")
     .Input("splits: num_handlers * string")
@@ -67,6 +68,8 @@ REGISTER_OP("GrowTreeEnsemble")
       TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &unused_input));
       // Dropout seed.
       TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 0, &unused_input));
+      // Maximum tree depth.
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 0, &unused_input));
       return Status::OK();
     })
     .Doc(R"doc(
diff --git a/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py b/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
index 3e524efbea..e39e1de8d1 100644
--- a/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
+++ b/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
@@ -296,7 +296,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
           growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE,
           # Dropout does not change anything here, tree is not finalized.
-          dropout_probability=0.5).SerializeToString()
+          dropout_probability=0.5)
 
       # Prepare handler inputs.
       # Note that handlers 1 & 3 have the same gain but different splits.
@@ -321,9 +321,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           ],
           gains=[handler1_gains, handler2_gains, handler3_gains],
           splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the simpler split from handler 1 to be chosen.
@@ -443,7 +444,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
           growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE,
           # Dropout does not change anything here - tree is not finalized.
-          dropout_probability=0.5).SerializeToString()
+          dropout_probability=0.5)
 
       # Prepare handler inputs.
       # Handler 1 only has a candidate for partition 1, handler 2 has candidates
@@ -472,9 +473,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           ],
           gains=[handler1_gains, handler2_gains, handler3_gains],
           splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the split for partition 1 to be chosen from handler 1 and
@@ -632,8 +634,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           max_depth=1,
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE).SerializeToString(
-          )
+          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
 
       # Prepare handler inputs.
       handler1_partitions = np.array([0], dtype=np.int32)
@@ -657,9 +658,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           ],
           gains=[handler1_gains, handler2_gains, handler3_gains],
           splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect a new tree to be added with the split from handler 1.
@@ -773,8 +775,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           max_depth=1,
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE).SerializeToString(
-          )
+          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
 
       # Prepare handler inputs.
       # All handlers have negative gain.
@@ -794,9 +795,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           partition_ids=[handler1_partitions, handler2_partitions],
           gains=[handler1_gains, handler2_gains],
           splits=[handler1_split, handler2_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the ensemble to be empty.
@@ -839,8 +841,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           max_depth=1,
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.POST_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE).SerializeToString(
-          )
+          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
 
       # Prepare handler inputs.
       # Note that handlers 1 & 3 have the same gain but different splits.
@@ -865,9 +866,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           ],
           gains=[handler1_gains, handler2_gains, handler3_gains],
           splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the simpler split from handler 1 to be chosen.
@@ -946,8 +948,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           max_depth=2,
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.POST_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE).SerializeToString(
-          )
+          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
 
       # Prepare handler inputs.
       # All handlers have negative gain.
@@ -967,9 +968,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           partition_ids=[handler1_partitions, handler2_partitions],
           gains=[handler1_gains, handler2_gains],
           splits=[handler1_split, handler2_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the split from handler 2 to be chosen despite the negative gain.
@@ -1048,9 +1050,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           partition_ids=[handler1_partitions],
           gains=[handler1_gains],
           splits=[handler1_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the ensemble to be empty as post-pruning will prune
@@ -1094,8 +1097,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           max_depth=2,
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.POST_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE).SerializeToString(
-          )
+          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
 
       # Prepare handler inputs.
       # Second handler has positive gain.
@@ -1115,9 +1117,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           partition_ids=[handler1_partitions, handler2_partitions],
           gains=[handler1_gains, handler2_gains],
           splits=[handler1_split, handler2_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the split from handler 2 to be chosen despite the negative gain.
@@ -1194,9 +1197,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           partition_ids=[handler1_partitions],
           gains=[handler1_gains],
           splits=[handler1_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the negative gain split of partition 1 to be pruned and the
@@ -1335,7 +1339,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
           growing_mode=learner_pb2.LearnerConfig.LAYER_BY_LAYER,
           # Dropout will have no effect, since the tree will not be fully grown.
-          dropout_probability=1.0).SerializeToString()
+          dropout_probability=1.0)
 
       # Prepare handler inputs.
       # Handler 1 only has a candidate for partition 1, handler 2 has candidates
@@ -1364,9 +1368,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           ],
           gains=[handler1_gains, handler2_gains, handler3_gains],
           splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect the split for partition 1 to be chosen from handler 1 and
@@ -1543,7 +1548,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
           growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE,
-          dropout_probability=1.0).SerializeToString()
+          dropout_probability=1.0)
 
       # Prepare handler inputs.
       handler1_partitions = np.array([0], dtype=np.int32)
@@ -1567,9 +1572,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           ],
           gains=[handler1_gains, handler2_gains, handler3_gains],
           splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       # Expect a new tree to be added with the split from handler 1.
@@ -1669,7 +1675,6 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
 
       learner_config.constraints.max_number_of_unique_feature_columns = 3
-      learner_config = learner_config.SerializeToString()
       # Prepare handler inputs.
       handler1_partitions = np.array([0], dtype=np.int32)
       handler1_gains = np.array([7.62], dtype=np.float32)
@@ -1692,9 +1697,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           ],
           gains=[handler1_gains, handler2_gains, handler3_gains],
           splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
+          learner_config=learner_config.SerializeToString(),
           dropout_seed=123,
-          center_bias=True)
+          center_bias=True,
+          max_tree_depth=learner_config.constraints.max_tree_depth)
       session.run(grow_op)
 
       _, serialized = session.run(
diff --git a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
index 1ee7f2395e..ba5ef700c5 100644
--- a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
+++ b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
@@ -287,7 +287,8 @@ class GradientBoostedDecisionTreeModel(object):
                loss_reduction=losses.Reduction.SUM_OVER_NONZERO_WEIGHTS,
                feature_columns=None,
                use_core_columns=False,
-               output_leaf_index=False):
+               output_leaf_index=False,
+               output_leaf_index_modes=None):
     """Construct a new GradientBoostedDecisionTreeModel function.
 
     Args:
@@ -307,6 +308,9 @@ class GradientBoostedDecisionTreeModel(object):
         used.
       output_leaf_index: A boolean variable indicating whether to output leaf
         index into predictions dictionary.
+      output_leaf_index_modes: A list of modes from (TRAIN, EVAL, INFER) which
+        dictates when leaf indices will be outputted. By default, leaf indices
+        are only outputted in INFER mode.
 
     Raises:
       ValueError: if inputs are not valid.
@@ -349,6 +353,9 @@ class GradientBoostedDecisionTreeModel(object):
       self._gradient_shape = tensor_shape.scalar()
       self._hessian_shape = tensor_shape.scalar()
     else:
+      if center_bias:
+        raise ValueError("Center bias should be False for multiclass.")
+
       self._gradient_shape = tensor_shape.TensorShape([logits_dimension])
       if (learner_config.multi_class_strategy ==
           learner_pb2.LearnerConfig.FULL_HESSIAN):
@@ -376,6 +383,8 @@ class GradientBoostedDecisionTreeModel(object):
     self._learner_config = learner_config
     self._feature_columns = feature_columns
     self._learner_config_serialized = learner_config.SerializeToString()
+    self._max_tree_depth = variables.Variable(
+        initial_value=self._learner_config.constraints.max_tree_depth)
     self._attempted_trees = variables.Variable(
         initial_value=array_ops.zeros([], dtypes.int64),
         trainable=False,
@@ -404,7 +413,16 @@ class GradientBoostedDecisionTreeModel(object):
         self._learner_config.multi_class_strategy ==
         learner_pb2.LearnerConfig.TREE_PER_CLASS and
         learner_config.num_classes == 2)
+
+    if output_leaf_index_modes is None:
+      output_leaf_index_modes = [learn.ModeKeys.INFER]
+    elif not all(
+        mode in (learn.ModeKeys.TRAIN, learn.ModeKeys.EVAL,
+                 learn.ModeKeys.INFER) for mode in output_leaf_index_modes):
+      raise ValueError("output_leaf_index_modes should only contain ModeKeys.")
+
     self._output_leaf_index = output_leaf_index
+    self._output_leaf_index_modes = output_leaf_index_modes
 
   def _predict_and_return_dict(self, ensemble_handle, ensemble_stamp, mode):
     """Runs prediction and returns a dictionary of the prediction results.
@@ -435,8 +453,7 @@ class GradientBoostedDecisionTreeModel(object):
     # the right stamp.
     with ops.control_dependencies(ensemble_stats):
       leaf_index = None
-      # Only used in infer (predict), not used in train and eval.
-      if self._output_leaf_index and mode == learn.ModeKeys.INFER:
+      if self._output_leaf_index and mode in self._output_leaf_index_modes:
         predictions, _, leaf_index = (
             prediction_ops).gradient_trees_prediction_verbose(
                 ensemble_handle,
@@ -508,9 +525,6 @@ class GradientBoostedDecisionTreeModel(object):
     if not input_deps:
       raise ValueError("No input tensors for prediction.")
 
-    if any(i.device != input_deps[0].device for i in input_deps):
-      raise ValueError("All input tensors should be on the same device.")
-
     # Get most current model stamp.
     ensemble_stamp = model_ops.tree_ensemble_stamp_token(self._ensemble_handle)
 
@@ -1042,7 +1056,8 @@ class GradientBoostedDecisionTreeModel(object):
             splits=split_info_list,
             learner_config=self._learner_config_serialized,
             dropout_seed=dropout_seed,
-            center_bias=self._center_bias)
+            center_bias=self._center_bias,
+            max_tree_depth=self._max_tree_depth)
 
       def _grow_ensemble_not_ready_fn():
         # Don't grow the ensemble, just update the stamp.
@@ -1056,7 +1071,8 @@ class GradientBoostedDecisionTreeModel(object):
             splits=[],
             learner_config=self._learner_config_serialized,
             dropout_seed=dropout_seed,
-            center_bias=self._center_bias)
+            center_bias=self._center_bias,
+            max_tree_depth=self._max_tree_depth)
 
       def _grow_ensemble_fn():
         # Conditionally grow an ensemble depending on whether the splits
@@ -1096,6 +1112,9 @@ class GradientBoostedDecisionTreeModel(object):
   def get_number_of_trees_tensor(self):
     return self._finalized_trees, self._attempted_trees
 
+  def get_max_tree_depth(self):
+    return self._max_tree_depth
+
   def train(self, loss, predictions_dict, labels):
     """Updates the accumalator stats and grows the ensemble.
 
diff --git a/tensorflow/contrib/boosted_trees/python/utils/losses.py b/tensorflow/contrib/boosted_trees/python/utils/losses.py
index ab7ac2aba6..b5ebaf1999 100644
--- a/tensorflow/contrib/boosted_trees/python/utils/losses.py
+++ b/tensorflow/contrib/boosted_trees/python/utils/losses.py
@@ -23,6 +23,12 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
+from tensorflow.python.ops.losses import losses
+
+
+def per_example_squared_hinge_loss(labels, weights, predictions):
+  loss = losses.hinge_loss(labels=labels, logits=predictions, weights=weights)
+  return math_ops.square(loss), control_flow_ops.no_op()
 
 
 def per_example_logistic_loss(labels, weights, predictions):
@@ -126,7 +132,7 @@ def per_example_squared_loss(labels, weights, predictions):
 
 
 def per_example_exp_loss(labels, weights, predictions, name=None, eps=0.1):
-  """Exponential loss given labels, example weights and predictions.
+  """Trimmed exponential loss given labels, example weights and predictions.
 
   Note that this is only for binary classification.
   If logistic loss tries to make sure that the classifier is certain of its
@@ -211,3 +217,62 @@ def per_example_exp_loss(labels, weights, predictions, name=None, eps=0.1):
   unweighted_loss = exp_with_logits(
       name=name, eps=eps, labels=labels, logits=predictions)
   return unweighted_loss * weights, control_flow_ops.no_op()
+
+
+def per_example_full_exp_loss(labels, weights, predictions, name=None):
+  """Full exponential loss given labels, example weights and predictions.
+
+  Note that this is only for binary classification.
+  The loss returns is exp(-targets*logits), where targets are converted to -1
+  and 1.
+
+  Args:
+    labels: Rank 2 (N, D) tensor of per-example labels.
+    weights: Rank 2 (N, 1) tensor of per-example weights.
+    predictions: Rank 2 (N, D) tensor of per-example predictions.
+    name: A name for the operation (optional).
+
+  Returns:
+    loss: A Rank 2 (N, 1) tensor of per-example exp loss
+    update_op: An update operation to update the loss's internal state.
+  """
+
+  def full_exp_with_logits(name, labels=None, logits=None):
+    """Computes exponential loss given `logits`.
+
+    Args:
+      name: A name for the operation (optional).
+      labels: A `Tensor` of the same type and shape as `logits`.
+      logits: A `Tensor` of type `float32` or `float64`.
+
+    Returns:
+      A `Tensor` of the same shape as `logits` with the componentwise
+      exponential losses.
+
+    Raises:
+      ValueError: If `logits` and `labels` do not have the same shape.
+    """
+    with ops.name_scope(name, "exp_loss", [logits, labels]) as name:
+      logits = ops.convert_to_tensor(logits, name="logits")
+      labels = ops.convert_to_tensor(labels, name="labels")
+      try:
+        labels.get_shape().merge_with(logits.get_shape())
+      except ValueError:
+        raise ValueError("logits and labels must have the same shape (%s vs %s)"
+                         % (logits.get_shape(), labels.get_shape()))
+
+    # Default threshold of 0 to switch between classes
+    zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
+    ones = array_ops.ones_like(logits, dtype=logits.dtype)
+    neg_ones = -array_ops.ones_like(logits, dtype=logits.dtype)
+
+    # Convert labels to 1 and -1
+    cond_labels = (labels > zeros)
+    labels_converted = array_ops.where(cond_labels, ones, neg_ones)
+
+    return math_ops.exp(-1.0 * logits * labels_converted)
+
+  labels = math_ops.to_float(labels)
+  unweighted_loss = full_exp_with_logits(
+      name=name, labels=labels, logits=predictions)
+  return unweighted_loss * weights, control_flow_ops.no_op()
diff --git a/tensorflow/contrib/checkpoint/python/containers.py b/tensorflow/contrib/checkpoint/python/containers.py
index 4d3d531299..242c1e8ba4 100644
--- a/tensorflow/contrib/checkpoint/python/containers.py
+++ b/tensorflow/contrib/checkpoint/python/containers.py
@@ -35,9 +35,9 @@ class UniqueNameTracker(data_structures.CheckpointableDataStructure):
       self.slotdeps = tf.contrib.checkpoint.UniqueNameTracker()
       slotdeps = self.slotdeps
       slots = []
-      slots.append(slotdeps.track(tfe.Variable(3.), "x"))  # Named "x"
-      slots.append(slotdeps.track(tfe.Variable(4.), "y"))
-      slots.append(slotdeps.track(tfe.Variable(5.), "x"))  # Named "x_1"
+      slots.append(slotdeps.track(tf.Variable(3.), "x"))  # Named "x"
+      slots.append(slotdeps.track(tf.Variable(4.), "y"))
+      slots.append(slotdeps.track(tf.Variable(5.), "x"))  # Named "x_1"
   ```
   """
 
diff --git a/tensorflow/contrib/cloud/README.md b/tensorflow/contrib/cloud/README.md
index 134ce057f4..a80d8965f3 100644
--- a/tensorflow/contrib/cloud/README.md
+++ b/tensorflow/contrib/cloud/README.md
@@ -1,8 +1,8 @@
 # Cloud #
 
-## BigTable ##
+## Cloud Bigtable ##
 
-[Google Cloud BigTable](https://cloud.google.com/bigtable/) is a high
+[Google Cloud Bigtable](https://cloud.google.com/bigtable/) is a high
 performance storage system that can store and serve training data. This contrib
 package contains an experimental integration with TensorFlow.
 
diff --git a/tensorflow/contrib/cloud/__init__.py b/tensorflow/contrib/cloud/__init__.py
index af81106a68..8efd259946 100644
--- a/tensorflow/contrib/cloud/__init__.py
+++ b/tensorflow/contrib/cloud/__init__.py
@@ -25,8 +25,8 @@ from tensorflow.contrib.cloud.python.ops.bigquery_reader_ops import *
 from tensorflow.contrib.cloud.python.ops.gcs_config_ops import *
 
 if os.name != 'nt':
-  from tensorflow.contrib.bigtable.python.ops.bigtable_api import BigTable
   from tensorflow.contrib.bigtable.python.ops.bigtable_api import BigtableClient
+  from tensorflow.contrib.bigtable.python.ops.bigtable_api import BigtableTable
 
 del os
 
@@ -34,8 +34,8 @@ from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
     'BigQueryReader',
-    'BigTable',
     'BigtableClient',
+    'BigtableTable',
     'BlockCacheParams',
     'configure_colab_session',
     'configure_gcs',
diff --git a/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.cc b/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.cc
index 1bfd27305d..58fadffce3 100644
--- a/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.cc
+++ b/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.cc
@@ -85,7 +85,7 @@ Status BigQueryTableAccessor::New(
     int64 timestamp_millis, int64 row_buffer_size, const string& end_point,
     const std::vector<string>& columns, const BigQueryTablePartition& partition,
     std::unique_ptr<AuthProvider> auth_provider,
-    std::unique_ptr<HttpRequest::Factory> http_request_factory,
+    std::shared_ptr<HttpRequest::Factory> http_request_factory,
     std::unique_ptr<BigQueryTableAccessor>* accessor) {
   if (timestamp_millis <= 0) {
     return errors::InvalidArgument(
@@ -94,29 +94,19 @@ Status BigQueryTableAccessor::New(
   const string& big_query_end_point =
       end_point.empty() ? kBigQueryEndPoint : end_point;
   if (auth_provider == nullptr && http_request_factory == nullptr) {
-    accessor->reset(new BigQueryTableAccessor(
-        project_id, dataset_id, table_id, timestamp_millis, row_buffer_size,
-        big_query_end_point, columns, partition));
-  } else {
-    accessor->reset(new BigQueryTableAccessor(
-        project_id, dataset_id, table_id, timestamp_millis, row_buffer_size,
-        big_query_end_point, columns, partition, std::move(auth_provider),
-        std::move(http_request_factory)));
+    http_request_factory = std::make_shared<CurlHttpRequest::Factory>();
+    auto compute_engine_metadata_client =
+        std::make_shared<ComputeEngineMetadataClient>(http_request_factory);
+    auth_provider = std::unique_ptr<AuthProvider>(
+        new GoogleAuthProvider(compute_engine_metadata_client));
   }
-  return (*accessor)->ReadSchema();
-}
 
-BigQueryTableAccessor::BigQueryTableAccessor(
-    const string& project_id, const string& dataset_id, const string& table_id,
-    int64 timestamp_millis, int64 row_buffer_size, const string& end_point,
-    const std::vector<string>& columns, const BigQueryTablePartition& partition)
-    : BigQueryTableAccessor(
-          project_id, dataset_id, table_id, timestamp_millis, row_buffer_size,
-          end_point, columns, partition,
-          std::unique_ptr<AuthProvider>(new GoogleAuthProvider()),
-          std::unique_ptr<HttpRequest::Factory>(
-              new CurlHttpRequest::Factory())) {
-  row_buffer_.resize(row_buffer_size);
+  accessor->reset(new BigQueryTableAccessor(
+      project_id, dataset_id, table_id, timestamp_millis, row_buffer_size,
+      big_query_end_point, columns, partition, std::move(auth_provider),
+      std::move(http_request_factory)));
+
+  return (*accessor)->ReadSchema();
 }
 
 BigQueryTableAccessor::BigQueryTableAccessor(
@@ -124,7 +114,7 @@ BigQueryTableAccessor::BigQueryTableAccessor(
     int64 timestamp_millis, int64 row_buffer_size, const string& end_point,
     const std::vector<string>& columns, const BigQueryTablePartition& partition,
     std::unique_ptr<AuthProvider> auth_provider,
-    std::unique_ptr<HttpRequest::Factory> http_request_factory)
+    std::shared_ptr<HttpRequest::Factory> http_request_factory)
     : project_id_(project_id),
       dataset_id_(dataset_id),
       table_id_(table_id),
diff --git a/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.h b/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.h
index b349063715..1af43a3e10 100644
--- a/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.h
+++ b/tensorflow/contrib/cloud/kernels/bigquery_table_accessor.h
@@ -109,24 +109,17 @@ class BigQueryTableAccessor {
                     const std::vector<string>& columns,
                     const BigQueryTablePartition& partition,
                     std::unique_ptr<AuthProvider> auth_provider,
-                    std::unique_ptr<HttpRequest::Factory> http_request_factory,
+                    std::shared_ptr<HttpRequest::Factory> http_request_factory,
                     std::unique_ptr<BigQueryTableAccessor>* accessor);
 
   /// \brief Constructs an object for a given table and partition.
-  BigQueryTableAccessor(const string& project_id, const string& dataset_id,
-                        const string& table_id, int64 timestamp_millis,
-                        int64 row_buffer_size, const string& end_point,
-                        const std::vector<string>& columns,
-                        const BigQueryTablePartition& partition);
-
-  /// Used for unit testing.
   BigQueryTableAccessor(
       const string& project_id, const string& dataset_id,
       const string& table_id, int64 timestamp_millis, int64 row_buffer_size,
       const string& end_point, const std::vector<string>& columns,
       const BigQueryTablePartition& partition,
       std::unique_ptr<AuthProvider> auth_provider,
-      std::unique_ptr<HttpRequest::Factory> http_request_factory);
+      std::shared_ptr<HttpRequest::Factory> http_request_factory);
 
   /// \brief Parses column values for a given row.
   Status ParseColumnValues(const Json::Value& value,
@@ -199,7 +192,7 @@ class BigQueryTableAccessor {
   SchemaNode schema_root_;
 
   std::unique_ptr<AuthProvider> auth_provider_;
-  std::unique_ptr<HttpRequest::Factory> http_request_factory_;
+  std::shared_ptr<HttpRequest::Factory> http_request_factory_;
 
   TF_DISALLOW_COPY_AND_ASSIGN(BigQueryTableAccessor);
 };
diff --git a/tensorflow/contrib/cluster_resolver/BUILD b/tensorflow/contrib/cluster_resolver/BUILD
index c239e6f8f9..707f621184 100644
--- a/tensorflow/contrib/cluster_resolver/BUILD
+++ b/tensorflow/contrib/cluster_resolver/BUILD
@@ -12,6 +12,15 @@ licenses(["notice"])  # Apache 2.0
 
 py_library(
     name = "cluster_resolver_pip",
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [
+        ":cluster_resolver_py",
+    ],
+)
+
+py_library(
+    name = "cluster_resolver_py",
     srcs = [
         "__init__.py",
         "python/training/__init__.py",
@@ -19,7 +28,7 @@ py_library(
     srcs_version = "PY2AND3",
     visibility = ["//visibility:public"],
     deps = [
-        ":cluster_resolver_py",
+        ":base_cluster_resolver_py",
         ":gce_cluster_resolver_py",
         ":tpu_cluster_resolver_py",
         "//tensorflow/python:util",
@@ -27,7 +36,7 @@ py_library(
 )
 
 py_library(
-    name = "cluster_resolver_py",
+    name = "base_cluster_resolver_py",
     srcs = ["python/training/cluster_resolver.py"],
     srcs_version = "PY2AND3",
     deps = [
@@ -40,7 +49,7 @@ py_library(
     srcs = ["python/training/gce_cluster_resolver.py"],
     srcs_version = "PY2AND3",
     deps = [
-        ":cluster_resolver_py",
+        ":base_cluster_resolver_py",
         "//tensorflow/python:training",
     ],
 )
@@ -50,13 +59,13 @@ py_library(
     srcs = ["python/training/tpu_cluster_resolver.py"],
     srcs_version = "PY2AND3",
     deps = [
-        ":cluster_resolver_py",
+        ":base_cluster_resolver_py",
         "//tensorflow/python:training",
     ],
 )
 
 tf_py_test(
-    name = "cluster_resolver_py_test",
+    name = "base_cluster_resolver_py_test",
     srcs = ["python/training/cluster_resolver_test.py"],
     additional_deps = [
         ":cluster_resolver_py",
diff --git a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
index 8f521ffee4..1ab150d74a 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
@@ -148,6 +148,9 @@ class TPUClusterResolver(ClusterResolver):
       else:
         tpu = self._envVarFallback()
 
+    if tpu is None:
+      raise ValueError('Please provide a TPU Name to connect to.')
+
     self._tpu = compat.as_bytes(tpu)  # self._tpu is always bytes
     self._job_name = job_name
     self._credentials = credentials
@@ -259,11 +262,11 @@ class TPUClusterResolver(ClusterResolver):
 
       if 'state' in response and response['state'] != 'READY':
         raise RuntimeError('TPU "%s" is not yet ready; state: "%s"' %
-                           (self._tpu, response['state']))
+                           (compat.as_text(self._tpu), response['state']))
 
       if 'health' in response and response['health'] != 'HEALTHY':
-        raise RuntimeError('TPU "%s" is unhealthy: "%s"' % (self._tpu,
-                                                            response['health']))
+        raise RuntimeError('TPU "%s" is unhealthy: "%s"' %
+                           (compat.as_text(self._tpu), response['health']))
 
       if 'networkEndpoints' in response:
         worker_list = [
diff --git a/tensorflow/contrib/cmake/CMakeLists.txt b/tensorflow/contrib/cmake/CMakeLists.txt
index a0a5b0e00c..f6c928e2be 100644
--- a/tensorflow/contrib/cmake/CMakeLists.txt
+++ b/tensorflow/contrib/cmake/CMakeLists.txt
@@ -145,26 +145,41 @@ if(WIN32)
       # temporary fix for #18241
       add_definitions(-DEIGEN_DEFAULT_DENSE_INDEX_TYPE=std::int64_t)
   endif()
-  add_definitions(-DNOMINMAX -D_WIN32_WINNT=0x0A00 -DLANG_CXX11)
-  add_definitions(-DWIN32 -DOS_WIN -D_MBCS -DWIN32_LEAN_AND_MEAN -DNOGDI -DPLATFORM_WINDOWS)
+  add_definitions(-DNOMINMAX -D_WIN32_WINNT=0x0A00)
+  add_definitions(-DWIN32_LEAN_AND_MEAN -DNOGDI -DPLATFORM_WINDOWS)
   add_definitions(-DTENSORFLOW_USE_EIGEN_THREADPOOL -DEIGEN_HAS_C99_MATH)
   add_definitions(-DTF_COMPILE_LIBRARY)
-  add_definitions(/bigobj /nologo /EHsc /GF /MP /Gm-)
+  add_compile_options(/bigobj /GF /MP /Gm-)
   # Suppress warnings to reduce build log size.
-  add_definitions(/wd4267 /wd4244 /wd4800 /wd4503 /wd4554 /wd4996 /wd4348 /wd4018)
-  add_definitions(/wd4099 /wd4146 /wd4267 /wd4305 /wd4307)
-  add_definitions(/wd4715 /wd4722 /wd4723 /wd4838 /wd4309 /wd4334)
-  add_definitions(/wd4003 /wd4244 /wd4267 /wd4503 /wd4506 /wd4800 /wd4996)
+  add_compile_options(/wd4267 /wd4244 /wd4800 /wd4503 /wd4554 /wd4996 /wd4348 /wd4018)
+  add_compile_options(/wd4099 /wd4146 /wd4267 /wd4305 /wd4307)
+  add_compile_options(/wd4715 /wd4722 /wd4723 /wd4838 /wd4309 /wd4334)
+  add_compile_options(/wd4003 /wd4244 /wd4267 /wd4503 /wd4506 /wd4800 /wd4996)
   # Suppress linker warnings.
   set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} /ignore:4049 /ignore:4197 /ignore:4217 /ignore:4221")
   set(CMAKE_MODULE_LINKER_FLAGS "${CMAKE_MODULE_LINKER_FLAGS} /ignore:4049 /ignore:4197 /ignore:4217 /ignore:4221")
   set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /ignore:4049 /ignore:4197 /ignore:4217 /ignore:4221")
-  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /MP")
   set(CMAKE_CXX_FLAGS_DEBUG "/D_DEBUG /MDd /Ob2")
   set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /D_ITERATOR_DEBUG_LEVEL=0")
   set(CMAKE_CXX_FLAGS_MINSIZEREL "${CMAKE_CXX_FLAGS_MINSIZEREL} /D_ITERATOR_DEBUG_LEVEL=0")
   set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} /D_ITERATOR_DEBUG_LEVEL=0")
 
+  set(compiler_flags
+    CMAKE_CXX_FLAGS
+    CMAKE_CXX_FLAGS_DEBUG
+    CMAKE_CXX_FLAGS_RELEASE
+    CMAKE_C_FLAGS
+    CMAKE_C_FLAGS_DEBUG
+    CMAKE_C_FLAGS_RELEASE
+  )
+  # No exception
+  foreach(flag ${compiler_flags})
+    string(REPLACE "/EHsc" "/EHs-c-" ${flag} "${${flag}}")
+  endforeach()
+  add_definitions(/D_HAS_EXCEPTIONS=0)
+  # Suppress 'noexcept used with no exception handling mode specified' warning
+  add_compile_options(/wd4577)
+
   # Try to avoid flaky failures due to failed generation of generate.stamp files.
   set(CMAKE_SUPPRESS_REGENERATION ON)
 endif()
@@ -379,16 +394,20 @@ if (tensorflow_ENABLE_GPU)
 
   # by default we assume compute cabability 3.5 and 5.2. If you change this change it in
   # CUDA_NVCC_FLAGS and cuda_config.h below
-  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_30,code=\"sm_30,compute_30\";-gencode arch=compute_35,code=\"sm_35,compute_35\";-gencode arch=compute_52,code=\"sm_52,compute_52\")
+  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_37,code=\"sm_37,compute_37\")
+  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_52,code=\"sm_52,compute_52\")
+  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_60,code=\"sm_60,compute_60\")
+  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_61,code=\"sm_61,compute_61\")
+  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_70,code=\"sm_70,compute_70\")
   set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};--include-path ${PROJECT_BINARY_DIR}/$\{build_configuration\};--expt-relaxed-constexpr)
   set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-ftz=true)  # Flush denormals to zero
   set(CUDA_INCLUDE ${CUDA_TOOLKIT_TARGET_DIR} ${CUDA_TOOLKIT_TARGET_DIR}/extras/CUPTI/include)
   include_directories(${CUDA_INCLUDE})
   if (WIN32)
-    add_definitions(-DGOOGLE_CUDA=1 -DTF_EXTRA_CUDA_CAPABILITIES=3.0,3.5,5.2)
+    add_definitions(-DGOOGLE_CUDA=1 -DTF_EXTRA_CUDA_CAPABILITIES=3.7,5.2,6.0,6.1,7.0)
   else (WIN32)
-    # Without these double quotes, cmake in Linux makes it "-DTF_EXTRA_CUDA_CAPABILITIES=3.0, -D3.5, -D5.2" for cc, which incurs build breaks
-    add_definitions(-DGOOGLE_CUDA=1 -D"TF_EXTRA_CUDA_CAPABILITIES=3.0,3.5,5.2")
+    # Without these double quotes, cmake in Linux makes it "-DTF_EXTRA_CUDA_CAPABILITIES=3.7, -D5.2, ..." for cc, which incurs build breaks
+    add_definitions(-DGOOGLE_CUDA=1 -D"TF_EXTRA_CUDA_CAPABILITIES=3.7,5.2,6.0,6.1,7.0")
   endif (WIN32)
 
   if (WIN32)
@@ -437,7 +456,7 @@ if (tensorflow_ENABLE_GPU)
   FILE(WRITE ${tensorflow_source_dir}/third_party/gpus/cuda/cuda_config.h
     "#ifndef CUDA_CUDA_CONFIG_H_\n"
     "#define CUDA_CUDA_CONFIG_H_\n"
-    "#define TF_CUDA_CAPABILITIES CudaVersion(\"3.0\"),CudaVersion(\"3.5\"),CudaVersion(\"5.2\")\n"
+    "#define TF_CUDA_CAPABILITIES CudaVersion(\"3.7\"),CudaVersion(\"5.2\"),CudaVersion(\"6.0\"),CudaVersion(\"6.1\"),CudaVersion(\"7.0\")\n"
     "#define TF_CUDA_VERSION \"64_${short_CUDA_VER}\"\n"
     "#define TF_CUDNN_VERSION \"64_${tensorflow_CUDNN_VERSION}\"\n"
     "#define TF_CUDA_TOOLKIT_PATH \"${CUDA_TOOLKIT_ROOT_DIR}\"\n"
@@ -452,7 +471,6 @@ if (tensorflow_ENABLE_GPU)
     ${CUDA_TOOLKIT_TARGET_DIR}/include/cuComplex.h
     ${CUDA_TOOLKIT_TARGET_DIR}/include/cublas_v2.h
     ${CUDA_TOOLKIT_TARGET_DIR}/include/cusolverDn.h
-    ${CUDA_TOOLKIT_TARGET_DIR}/include/cuda_fp16.h
     ${CUDA_TOOLKIT_TARGET_DIR}/include/device_functions.h
     ${CUDA_TOOLKIT_TARGET_DIR}/include/cufft.h
     ${CUDA_TOOLKIT_TARGET_DIR}/include/curand.h
diff --git a/tensorflow/contrib/cmake/external/highwayhash.cmake b/tensorflow/contrib/cmake/external/highwayhash.cmake
index a6e8a38d8c..7d260b85f2 100644
--- a/tensorflow/contrib/cmake/external/highwayhash.cmake
+++ b/tensorflow/contrib/cmake/external/highwayhash.cmake
@@ -20,14 +20,6 @@ set(highwayhash_TAG be5edafc2e1a455768e260ccd68ae7317b6690ee)
 set(highwayhash_BUILD ${CMAKE_CURRENT_BINARY_DIR}/highwayhash/src/highwayhash)
 set(highwayhash_INSTALL ${CMAKE_CURRENT_BINARY_DIR}/highwayhash/install)
 
-# put highwayhash includes in the directory where they are expected
-add_custom_target(highwayhash_create_destination_dir
-    COMMAND ${CMAKE_COMMAND} -E make_directory ${highwayhash_INCLUDE_DIR}/highwayhash
-    DEPENDS highwayhash)
-
-add_custom_target(highwayhash_copy_headers_to_destination
-    DEPENDS highwayhash_create_destination_dir)
-
 if(WIN32)
   set(highwayhash_HEADERS "${highwayhash_BUILD}/highwayhash/*.h")
   set(highwayhash_STATIC_LIBRARIES ${highwayhash_INSTALL}/lib/highwayhash.lib)
@@ -36,6 +28,20 @@ else()
   set(highwayhash_STATIC_LIBRARIES ${highwayhash_INSTALL}/lib/libhighwayhash.a)
 endif()
 
+set(highwayhash_HEADERS
+    "${highwayhash_INSTALL}/include/code_annotation.h"
+    "${highwayhash_INSTALL}/include/highway_tree_hash.h"
+    "${highwayhash_INSTALL}/include/scalar_highway_tree_hash.h"
+    "${highwayhash_INSTALL}/include/scalar_sip_tree_hash.h"
+    "${highwayhash_INSTALL}/include/sip_hash.h"
+    "${highwayhash_INSTALL}/include/sip_tree_hash.h"
+    "${highwayhash_INSTALL}/include/sse41_highway_tree_hash.h"
+    "${highwayhash_INSTALL}/include/state_helpers.h"
+    "${highwayhash_INSTALL}/include/types.h"
+    "${highwayhash_INSTALL}/include/vec.h"
+    "${highwayhash_INSTALL}/include/vec2.h"
+)
+
 ExternalProject_Add(highwayhash
     PREFIX highwayhash
     GIT_REPOSITORY ${highwayhash_URL}
@@ -50,5 +56,15 @@ ExternalProject_Add(highwayhash
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
         -DCMAKE_INSTALL_PREFIX:STRING=${highwayhash_INSTALL})
 
-add_custom_command(TARGET highwayhash_copy_headers_to_destination PRE_BUILD
-    COMMAND ${CMAKE_COMMAND} -E copy_directory ${highwayhash_INSTALL}/include/ ${highwayhash_INCLUDE_DIR}/highwayhash)
+# put highwayhash includes in the directory where they are expected
+add_custom_target(highwayhash_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${highwayhash_INCLUDE_DIR}/highwayhash
+    DEPENDS highwayhash)
+
+add_custom_target(highwayhash_copy_headers_to_destination
+    DEPENDS highwayhash_create_destination_dir)
+
+foreach(header_file ${highwayhash_HEADERS})
+  add_custom_command(TARGET highwayhash_copy_headers_to_destination PRE_BUILD
+      COMMAND ${CMAKE_COMMAND} -E copy_if_different ${header_file} ${highwayhash_INCLUDE_DIR}/highwayhash/)
+endforeach()
diff --git a/tensorflow/contrib/cmake/external/nsync.cmake b/tensorflow/contrib/cmake/external/nsync.cmake
index eba3bcfc79..1d638e6402 100644
--- a/tensorflow/contrib/cmake/external/nsync.cmake
+++ b/tensorflow/contrib/cmake/external/nsync.cmake
@@ -20,14 +20,6 @@ set(nsync_TAG 1.20.0)
 set(nsync_BUILD ${CMAKE_CURRENT_BINARY_DIR}/nsync/src/nsync)
 set(nsync_INSTALL ${CMAKE_CURRENT_BINARY_DIR}/nsync/install)
 
-# put nsync includes in the directory where they are expected
-add_custom_target(nsync_create_destination_dir
-    COMMAND ${CMAKE_COMMAND} -E make_directory ${nsync_INCLUDE_DIR}
-    DEPENDS nsync)
-
-add_custom_target(nsync_copy_headers_to_destination
-    DEPENDS nsync_create_destination_dir)
-
 if(WIN32)
   set(nsync_HEADERS "${nsync_BUILD}/public/*.h")
   set(nsync_STATIC_LIBRARIES ${nsync_INSTALL}/lib/nsync.lib)
@@ -49,7 +41,35 @@ ExternalProject_Add(nsync
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
         -DCMAKE_INSTALL_PREFIX:STRING=${nsync_INSTALL}
-	-DNSYNC_LANGUAGE:STRING=c++11)
+    -DNSYNC_LANGUAGE:STRING=c++11)
+
+set(nsync_HEADERS
+    "${nsync_INSTALL}/include/nsync.h"
+    "${nsync_INSTALL}/include/nsync_atomic.h"
+    "${nsync_INSTALL}/include/nsync_counter.h"
+    "${nsync_INSTALL}/include/nsync_cpp.h"
+    "${nsync_INSTALL}/include/nsync_cv.h"
+    "${nsync_INSTALL}/include/nsync_debug.h"
+    "${nsync_INSTALL}/include/nsync_mu.h"
+    "${nsync_INSTALL}/include/nsync_mu_wait.h"
+    "${nsync_INSTALL}/include/nsync_note.h"
+    "${nsync_INSTALL}/include/nsync_once.h"
+    "${nsync_INSTALL}/include/nsync_time.h"
+    "${nsync_INSTALL}/include/nsync_time_internal.h"
+    "${nsync_INSTALL}/include/nsync_waiter.h"
+)
+                                                            
+# put nsync includes in the directory where they are expected
+add_custom_target(nsync_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${nsync_INCLUDE_DIR}
+    DEPENDS nsync)
+
+add_custom_target(nsync_copy_headers_to_destination
+    DEPENDS nsync_create_destination_dir)
+
+foreach(header_file ${nsync_HEADERS})
+  add_custom_command(TARGET nsync_copy_headers_to_destination PRE_BUILD
+      COMMAND ${CMAKE_COMMAND} -E copy_if_different ${header_file} ${nsync_INCLUDE_DIR}/)
+endforeach()
+
 
-add_custom_command(TARGET nsync_copy_headers_to_destination PRE_BUILD
-    COMMAND ${CMAKE_COMMAND} -E copy_directory ${nsync_INSTALL}/include/ ${nsync_INCLUDE_DIR}/)
diff --git a/tensorflow/contrib/cmake/python_modules.txt b/tensorflow/contrib/cmake/python_modules.txt
index a5eba5a8c9..75e00f3267 100644
--- a/tensorflow/contrib/cmake/python_modules.txt
+++ b/tensorflow/contrib/cmake/python_modules.txt
@@ -14,6 +14,7 @@ tensorflow/examples/tutorials
 tensorflow/examples/tutorials/mnist
 tensorflow/python
 tensorflow/python/client
+tensorflow/python/compat
 tensorflow/python/data
 tensorflow/python/data/ops
 tensorflow/python/data/util
@@ -61,6 +62,8 @@ tensorflow/python/saved_model
 tensorflow/python/summary
 tensorflow/python/summary/writer
 tensorflow/python/tools
+tensorflow/python/tools/api
+tensorflow/python/tools/api/generator
 tensorflow/python/training
 tensorflow/python/training/checkpointable
 tensorflow/python/user_ops
@@ -68,7 +71,6 @@ tensorflow/python/util
 tensorflow/python/util/protobuf
 tensorflow/tools
 tensorflow/tools/api
-tensorflow/tools/api/generator
 tensorflow/tools/graph_transforms
 tensorflow/contrib
 tensorflow/contrib/all_reduce
diff --git a/tensorflow/contrib/cmake/tf_core_framework.cmake b/tensorflow/contrib/cmake/tf_core_framework.cmake
index 872b016d2b..067c299a71 100644
--- a/tensorflow/contrib/cmake/tf_core_framework.cmake
+++ b/tensorflow/contrib/cmake/tf_core_framework.cmake
@@ -49,48 +49,43 @@ function(RELATIVE_PROTOBUF_GENERATE_CPP SRCS HDRS ROOT_DIR)
   set(${HDRS} ${${HDRS}} PARENT_SCOPE)
 endfunction()
 
-function(RELATIVE_PROTOBUF_GENERATE_GRPC_CPP SRCS HDRS ROOT_DIR)
-  if(NOT ARGN)
-    message(SEND_ERROR "Error: RELATIVE_PROTOBUF_GENERATE_GRPC_CPP() called without any proto files")
-    return()
-  endif()
-
-  set(${SRCS})
-  set(${HDRS})
-  foreach(FIL ${ARGN})
-    set(ABS_FIL ${ROOT_DIR}/${FIL})
-    get_filename_component(FIL_WE ${FIL} NAME_WE)
-    get_filename_component(FIL_DIR ${ABS_FIL} PATH)
-    file(RELATIVE_PATH REL_DIR ${ROOT_DIR} ${FIL_DIR})
-
-    list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.cc")
-    list(APPEND ${HDRS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.h")
-    list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc")
-    list(APPEND ${HDRS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h")
-
-    # We adust the path of the gRPC code generation accordingly.
-    if(WIN32)
-      set(GRPC_PROTOC_PLUGIN_PATH ${GRPC_BUILD}/Release/grpc_cpp_plugin.exe)
-    else()
-      set(GRPC_PROTOC_PLUGIN_PATH ${GRPC_BUILD}/grpc_cpp_plugin)
+if(NOT WIN32)
+  function(RELATIVE_PROTOBUF_GENERATE_GRPC_CPP SRCS HDRS ROOT_DIR)
+    if(NOT ARGN)
+      message(SEND_ERROR "Error: RELATIVE_PROTOBUF_GENERATE_GRPC_CPP() called without any proto files")
+      return()
     endif()
 
-    add_custom_command(
-      OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.cc"
-             "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.h"
-             "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc"
-             "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h"
-      COMMAND ${PROTOBUF_PROTOC_EXECUTABLE}
-      ARGS --grpc_out ${CMAKE_CURRENT_BINARY_DIR} --cpp_out ${CMAKE_CURRENT_BINARY_DIR} --plugin=protoc-gen-grpc=${GRPC_PROTOC_PLUGIN_PATH} -I ${ROOT_DIR} ${ABS_FIL} -I ${PROTOBUF_INCLUDE_DIRS}
-      DEPENDS ${ABS_FIL} protobuf grpc
-      COMMENT "Running C++ protocol buffer grpc compiler on ${FIL}"
-      VERBATIM )
-  endforeach()
-
-  set_source_files_properties(${${SRCS}} ${${HDRS}} PROPERTIES GENERATED TRUE)
-  set(${SRCS} ${${SRCS}} PARENT_SCOPE)
-  set(${HDRS} ${${HDRS}} PARENT_SCOPE)
-endfunction()
+    set(${SRCS})
+    set(${HDRS})
+    foreach(FIL ${ARGN})
+      set(ABS_FIL ${ROOT_DIR}/${FIL})
+      get_filename_component(FIL_WE ${FIL} NAME_WE)
+      get_filename_component(FIL_DIR ${ABS_FIL} PATH)
+      file(RELATIVE_PATH REL_DIR ${ROOT_DIR} ${FIL_DIR})
+
+      list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.cc")
+      list(APPEND ${HDRS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.h")
+      list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc")
+      list(APPEND ${HDRS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h")
+
+      add_custom_command(
+        OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.cc"
+               "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.grpc.pb.h"
+               "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc"
+               "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h"
+        COMMAND ${PROTOBUF_PROTOC_EXECUTABLE}
+        ARGS --grpc_out ${CMAKE_CURRENT_BINARY_DIR} --cpp_out ${CMAKE_CURRENT_BINARY_DIR} --plugin protoc-gen-grpc=${GRPC_BUILD}/grpc_cpp_plugin -I ${ROOT_DIR} ${ABS_FIL} -I ${PROTOBUF_INCLUDE_DIRS}
+        DEPENDS ${ABS_FIL} protobuf grpc
+        COMMENT "Running C++ protocol buffer grpc compiler on ${FIL}"
+        VERBATIM )
+    endforeach()
+
+    set_source_files_properties(${${SRCS}} ${${HDRS}} PROPERTIES GENERATED TRUE)
+    set(${SRCS} ${${SRCS}} PARENT_SCOPE)
+    set(${HDRS} ${${HDRS}} PARENT_SCOPE)
+  endfunction()
+endif()
 
 function(RELATIVE_PROTOBUF_TEXT_GENERATE_CPP SRCS HDRS ROOT_DIR)
   if(NOT ARGN)
@@ -180,14 +175,17 @@ RELATIVE_PROTOBUF_TEXT_GENERATE_CPP(PROTO_TEXT_SRCS PROTO_TEXT_HDRS
     ${tensorflow_source_dir} ${tf_proto_text_srcs}
 )
 
-file(GLOB_RECURSE tf_protos_grpc_cc_srcs RELATIVE ${tensorflow_source_dir}
-    "${tensorflow_source_dir}/tensorflow/core/debug/*.proto"
-    "${tensorflow_source_dir}/tensorflow/core/protobuf/master_service.proto"
-)
-RELATIVE_PROTOBUF_GENERATE_GRPC_CPP(PROTO_GRPC_SRCS PROTO_GRPC_HDRS
-    ${tensorflow_source_dir} ${tf_protos_grpc_cc_srcs}
-)
-add_library(tf_protos_cc ${PROTO_GRPC_SRCS} ${PROTO_GRPC_HDRS} ${PROTO_SRCS} ${PROTO_HDRS})
+if(WIN32)
+  add_library(tf_protos_cc ${PROTO_SRCS} ${PROTO_HDRS})
+else()
+  file(GLOB_RECURSE tf_protos_grpc_cc_srcs RELATIVE ${tensorflow_source_dir}
+      "${tensorflow_source_dir}/tensorflow/core/debug/*.proto"
+  )
+  RELATIVE_PROTOBUF_GENERATE_GRPC_CPP(PROTO_GRPC_SRCS PROTO_GRPC_HDRS
+      ${tensorflow_source_dir} ${tf_protos_grpc_cc_srcs}
+  )
+  add_library(tf_protos_cc ${PROTO_GRPC_SRCS} ${PROTO_GRPC_HDRS} ${PROTO_SRCS} ${PROTO_HDRS})
+endif()
 
 ########################################################
 # tf_core_lib library
diff --git a/tensorflow/contrib/cmake/tf_core_kernels.cmake b/tensorflow/contrib/cmake/tf_core_kernels.cmake
index 844f62649d..7b892ba248 100644
--- a/tensorflow/contrib/cmake/tf_core_kernels.cmake
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@@ -68,6 +68,7 @@ if(tensorflow_BUILD_CONTRIB_KERNELS)
       "${tensorflow_source_dir}/tensorflow/contrib/coder/kernels/range_coder_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/coder/kernels/range_coder_ops_util.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/coder/ops/coder_ops.cc"
+      "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/assert_next_dataset_op.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/csv_dataset_op.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/directed_interleave_dataset_op.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/ignore_errors_dataset_op.cc"
diff --git a/tensorflow/contrib/cmake/tf_python.cmake b/tensorflow/contrib/cmake/tf_python.cmake
index e3b59001bc..32b185f07b 100755
--- a/tensorflow/contrib/cmake/tf_python.cmake
+++ b/tensorflow/contrib/cmake/tf_python.cmake
@@ -736,8 +736,8 @@ endif()
 # Generate API __init__.py files.
 ########################################################
 
-# Parse tensorflow/tools/api/generator/BUILD to get list of generated files.
-FILE(READ ${tensorflow_source_dir}/tensorflow/tools/api/generator/api_gen.bzl api_generator_BUILD_text)
+# Parse tensorflow/python/tools/api/generator/BUILD to get list of generated files.
+FILE(READ ${tensorflow_source_dir}/tensorflow/python/tools/api/generator/api_gen.bzl api_generator_BUILD_text)
 STRING(REGEX MATCH "# BEGIN GENERATED FILES.*# END GENERATED FILES" api_init_files_text ${api_generator_BUILD_text})
 string(REPLACE "# BEGIN GENERATED FILES" "" api_init_files_text ${api_init_files_text})
 string(REPLACE "# END GENERATED FILES" "" api_init_files_text ${api_init_files_text})
@@ -781,7 +781,7 @@ if (tensorflow_ENABLE_MKL_SUPPORT)
 
           # Run create_python_api.py to generate API init files.
           COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CMAKE_CURRENT_BINARY_DIR}/tf_python PATH=${PY_RUNTIME_ENV} ${PYTHON_EXECUTABLE}
-                  "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/tools/api/generator/create_python_api.py"
+                  "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/tools/api/generator/create_python_api.py"
                   "--root_init_template=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/api_template.__init__.py"
                   "--apidir=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow"
                   "--package=tensorflow.python"
@@ -803,7 +803,7 @@ else (tensorflow_ENABLE_MKL_SUPPORT)
 
           # Run create_python_api.py to generate API init files.
           COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CMAKE_CURRENT_BINARY_DIR}/tf_python ${PYTHON_EXECUTABLE}
-                  "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/tools/api/generator/create_python_api.py"
+                  "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/tools/api/generator/create_python_api.py"
                   "--root_init_template=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/api_template.__init__.py"
                   "--apidir=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow"
                   "--package=tensorflow.python"
@@ -824,8 +824,8 @@ add_dependencies(tf_python_api tf_python_ops)
 # Generate API __init__.py files for tf.estimator.
 ########################################################
 
-# Parse tensorflow/tools/api/generator/BUILD to get list of generated files.
-FILE(READ ${tensorflow_source_dir}/tensorflow/tools/api/generator/api_gen.bzl api_generator_BUILD_text)
+# Parse tensorflow/python/tools/api/generator/BUILD to get list of generated files.
+FILE(READ ${tensorflow_source_dir}/tensorflow/python/tools/api/generator/api_gen.bzl api_generator_BUILD_text)
 STRING(REGEX MATCH "# BEGIN GENERATED ESTIMATOR FILES.*# END GENERATED ESTIMATOR FILES" api_init_files_text ${api_generator_BUILD_text})
 string(REPLACE "# BEGIN GENERATED ESTIMATOR FILES" "" api_init_files_text ${api_init_files_text})
 string(REPLACE "# END GENERATED ESTIMATOR FILES" "" api_init_files_text ${api_init_files_text})
@@ -849,10 +849,11 @@ add_custom_command(
 
       # Run create_python_api.py to generate API init files.
       COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CMAKE_CURRENT_BINARY_DIR}/tf_python ${PYTHON_EXECUTABLE}
-              "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/tools/api/generator/create_python_api.py"
+              "${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/tools/api/generator/create_python_api.py"
               "--apidir=${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/python/estimator/api"
               "--package=tensorflow.python.estimator"
               "--apiname=estimator"
+	      "--output_package=tensorflow.python.estimator.api"
               "${estimator_api_init_list_file}"
 
       COMMENT "Generating __init__.py files for Python API."
diff --git a/tensorflow/contrib/cmake/tf_tests.cmake b/tensorflow/contrib/cmake/tf_tests.cmake
index eb9482dc25..b2330c4e34 100644
--- a/tensorflow/contrib/cmake/tf_tests.cmake
+++ b/tensorflow/contrib/cmake/tf_tests.cmake
@@ -193,6 +193,7 @@ if (tensorflow_BUILD_PYTHON_TESTS)
     # flaky test
     "${tensorflow_source_dir}/tensorflow/python/profiler/internal/run_metadata_test.py"
     "${tensorflow_source_dir}/tensorflow/python/profiler/model_analyzer_test.py"
+    "${tensorflow_source_dir}/tensorflow/python/data/kernel_tests/map_dataset_op_test.py"
     # Fails because uses data dependencies with bazel
     "${tensorflow_source_dir}/tensorflow/python/saved_model/saved_model_test.py"
     "${tensorflow_source_dir}/tensorflow/contrib/image/python/kernel_tests/sparse_image_warp_test.py"
@@ -216,7 +217,8 @@ if (tensorflow_BUILD_PYTHON_TESTS)
     ${tensorflow_source_dir}/tensorflow/python/kernel_tests/duplicate_op_test.py
     ${tensorflow_source_dir}/tensorflow/python/kernel_tests/invalid_op_test.py
     ${tensorflow_source_dir}/tensorflow/python/kernel_tests/ackermann_test.py
-
+    # Tests too large to run.
+    ${tensorflow_source_dir}/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py
   )
   if (WIN32)
     set(tf_test_src_py_exclude
diff --git a/tensorflow/contrib/copy_graph/python/util/copy_elements.py b/tensorflow/contrib/copy_graph/python/util/copy_elements.py
index a0dd3881a8..6c9ab6aeb8 100644
--- a/tensorflow/contrib/copy_graph/python/util/copy_elements.py
+++ b/tensorflow/contrib/copy_graph/python/util/copy_elements.py
@@ -18,7 +18,7 @@ These functions allow for recursive copying of elements (ops and variables)
 from one graph to another. The copied elements are initialized inside a
 user-specified scope in the other graph. There are separate functions to
 copy ops and variables.
-There is also a function to retrive the copied version of an op from the
+There is also a function to retrieve the copied version of an op from the
 first graph inside a scope in the second graph.
 
 @@copy_op_to_graph
@@ -77,7 +77,7 @@ def copy_variable_to_graph(org_instance, to_graph, scope=''):
       else:
         collections.append(scope + '/' + name)
 
-  #See if its trainable.
+  #See if it's trainable.
   trainable = (
       org_instance in org_instance.graph.get_collection(
           ops.GraphKeys.TRAINABLE_VARIABLES))
@@ -162,7 +162,7 @@ def copy_op_to_graph(org_instance, to_graph, variables, scope=''):
 
   if isinstance(org_instance, ops.Tensor):
 
-    #If its a Tensor, it is one of the outputs of the underlying
+    #If it's a Tensor, it is one of the outputs of the underlying
     #op. Therefore, copy the op itself and return the appropriate
     #output.
     op = org_instance.op
@@ -219,8 +219,10 @@ def copy_op_to_graph(org_instance, to_graph, variables, scope=''):
                            op_def)
     #Use Graph's hidden methods to add the op
     to_graph._record_op_seen_by_control_dependencies(new_op)
-    for device_function in reversed(to_graph._device_function_stack):
+    # pylint: disable=protected-access
+    for device_function in to_graph._device_functions_outer_to_inner:
       new_op._set_device(device_function(new_op))
+    # pylint: enable=protected-access
 
     return new_op
 
diff --git a/tensorflow/contrib/crf/__init__.py b/tensorflow/contrib/crf/__init__.py
index 046c509626..615e62b16f 100644
--- a/tensorflow/contrib/crf/__init__.py
+++ b/tensorflow/contrib/crf/__init__.py
@@ -20,6 +20,7 @@ See the @{$python/contrib.crf} guide.
 @@crf_decode
 @@crf_log_likelihood
 @@crf_log_norm
+@@crf_multitag_sequence_score
 @@crf_sequence_score
 @@crf_unary_score
 @@CrfDecodeBackwardRnnCell
@@ -36,6 +37,7 @@ from tensorflow.contrib.crf.python.ops.crf import crf_binary_score
 from tensorflow.contrib.crf.python.ops.crf import crf_decode
 from tensorflow.contrib.crf.python.ops.crf import crf_log_likelihood
 from tensorflow.contrib.crf.python.ops.crf import crf_log_norm
+from tensorflow.contrib.crf.python.ops.crf import crf_multitag_sequence_score
 from tensorflow.contrib.crf.python.ops.crf import crf_sequence_score
 from tensorflow.contrib.crf.python.ops.crf import crf_unary_score
 from tensorflow.contrib.crf.python.ops.crf import CrfDecodeBackwardRnnCell
diff --git a/tensorflow/contrib/crf/python/kernel_tests/crf_test.py b/tensorflow/contrib/crf/python/kernel_tests/crf_test.py
index 74f2ec22ff..f56a973f6f 100644
--- a/tensorflow/contrib/crf/python/kernel_tests/crf_test.py
+++ b/tensorflow/contrib/crf/python/kernel_tests/crf_test.py
@@ -31,6 +31,15 @@ from tensorflow.python.platform import test
 
 class CrfTest(test.TestCase):
 
+  def calculateSequenceScore(self, inputs, transition_params, tag_indices,
+                             sequence_lengths):
+    expected_unary_score = sum(
+        inputs[i][tag_indices[i]] for i in range(sequence_lengths))
+    expected_binary_score = sum(
+        transition_params[tag_indices[i], tag_indices[i + 1]]
+        for i in range(sequence_lengths - 1))
+    return expected_unary_score + expected_binary_score
+
   def testCrfSequenceScore(self):
     transition_params = np.array(
         [[-3, 5, -2], [3, 4, 1], [1, 2, 1]], dtype=np.float32)
@@ -60,14 +69,55 @@ class CrfTest(test.TestCase):
             transition_params=constant_op.constant(transition_params))
         sequence_score = array_ops.squeeze(sequence_score, [0])
         tf_sequence_score = sess.run(sequence_score)
-        expected_unary_score = sum(inputs[i][tag_indices[i]]
-                                   for i in range(sequence_lengths))
-        expected_binary_score = sum(
-            transition_params[tag_indices[i], tag_indices[i + 1]]
-            for i in range(sequence_lengths - 1))
-        expected_sequence_score = expected_unary_score + expected_binary_score
+        expected_sequence_score = self.calculateSequenceScore(
+            inputs, transition_params, tag_indices, sequence_lengths)
         self.assertAllClose(tf_sequence_score, expected_sequence_score)
 
+  def testCrfMultiTagSequenceScore(self):
+    transition_params = np.array(
+        [[-3, 5, -2], [3, 4, 1], [1, 2, 1]], dtype=np.float32)
+    # Test both the length-1 and regular cases.
+    sequence_lengths_list = [
+        np.array(3, dtype=np.int32),
+        np.array(1, dtype=np.int32)
+    ]
+    inputs_list = [
+        np.array([[4, 5, -3], [3, -1, 3], [-1, 2, 1], [0, 0, 0]],
+                 dtype=np.float32),
+        np.array([[4, 5, -3]],
+                 dtype=np.float32),
+    ]
+    tag_bitmap_list = [
+        np.array(
+            [[True, True, False], [True, False, True], [False, True, True],
+             [True, False, True]],
+            dtype=np.bool),
+        np.array([[True, True, False]], dtype=np.bool)
+    ]
+    for sequence_lengths, inputs, tag_bitmap in zip(
+        sequence_lengths_list, inputs_list, tag_bitmap_list):
+      with self.test_session() as sess:
+        sequence_score = crf.crf_multitag_sequence_score(
+            inputs=array_ops.expand_dims(inputs, 0),
+            tag_bitmap=array_ops.expand_dims(tag_bitmap, 0),
+            sequence_lengths=array_ops.expand_dims(sequence_lengths, 0),
+            transition_params=constant_op.constant(transition_params))
+        sequence_score = array_ops.squeeze(sequence_score, [0])
+        tf_sum_sequence_score = sess.run(sequence_score)
+        all_indices_list = [
+            single_index_bitmap.nonzero()[0]
+            for single_index_bitmap in tag_bitmap[:sequence_lengths]
+        ]
+        expected_sequence_scores = [
+            self.calculateSequenceScore(inputs, transition_params, indices,
+                                        sequence_lengths)
+            for indices in itertools.product(*all_indices_list)
+        ]
+        expected_log_sum_exp_sequence_scores = np.logaddexp.reduce(
+            expected_sequence_scores)
+        self.assertAllClose(tf_sum_sequence_score,
+                            expected_log_sum_exp_sequence_scores)
+
   def testCrfUnaryScore(self):
     inputs = np.array(
         [[4, 5, -3], [3, -1, 3], [-1, 2, 1], [0, 0, 0]], dtype=np.float32)
diff --git a/tensorflow/contrib/crf/python/ops/crf.py b/tensorflow/contrib/crf/python/ops/crf.py
index 2d2cbdc199..8a7ff61bc8 100644
--- a/tensorflow/contrib/crf/python/ops/crf.py
+++ b/tensorflow/contrib/crf/python/ops/crf.py
@@ -67,7 +67,7 @@ __all__ = [
     "crf_sequence_score", "crf_log_norm", "crf_log_likelihood",
     "crf_unary_score", "crf_binary_score", "CrfForwardRnnCell",
     "viterbi_decode", "crf_decode", "CrfDecodeForwardRnnCell",
-    "CrfDecodeBackwardRnnCell"
+    "CrfDecodeBackwardRnnCell", "crf_multitag_sequence_score"
 ]
 
 
@@ -114,6 +114,56 @@ def crf_sequence_score(inputs, tag_indices, sequence_lengths,
       false_fn=_multi_seq_fn)
 
 
+def crf_multitag_sequence_score(inputs, tag_bitmap, sequence_lengths,
+                                transition_params):
+  """Computes the unnormalized score of all tag sequences matching tag_bitmap.
+
+  tag_bitmap enables more than one tag to be considered correct at each time
+  step. This is useful when an observed output at a given time step is
+  consistent with more than one tag, and thus the log likelihood of that
+  observation must take into account all possible consistent tags.
+
+  Using one-hot vectors in tag_bitmap gives results identical to
+  crf_sequence_score.
+
+  Args:
+    inputs: A [batch_size, max_seq_len, num_tags] tensor of unary potentials
+        to use as input to the CRF layer.
+    tag_bitmap: A [batch_size, max_seq_len, num_tags] boolean tensor
+        representing all active tags at each index for which to calculate the
+        unnormalized score.
+    sequence_lengths: A [batch_size] vector of true sequence lengths.
+    transition_params: A [num_tags, num_tags] transition matrix.
+  Returns:
+    sequence_scores: A [batch_size] vector of unnormalized sequence scores.
+  """
+
+  # If max_seq_len is 1, we skip the score calculation and simply gather the
+  # unary potentials of all active tags.
+  def _single_seq_fn():
+    filtered_inputs = array_ops.where(
+        tag_bitmap, inputs,
+        array_ops.fill(array_ops.shape(inputs), float("-inf")))
+    return math_ops.reduce_logsumexp(
+        filtered_inputs, axis=[1, 2], keepdims=False)
+
+  def _multi_seq_fn():
+    # Compute the logsumexp of all scores of sequences matching the given tags.
+    filtered_inputs = array_ops.where(
+        tag_bitmap, inputs,
+        array_ops.fill(array_ops.shape(inputs), float("-inf")))
+    return crf_log_norm(
+        inputs=filtered_inputs,
+        sequence_lengths=sequence_lengths,
+        transition_params=transition_params)
+
+  return utils.smart_cond(
+      pred=math_ops.equal(inputs.shape[1].value or array_ops.shape(inputs)[1],
+                          1),
+      true_fn=_single_seq_fn,
+      false_fn=_multi_seq_fn)
+
+
 def crf_log_norm(inputs, sequence_lengths, transition_params):
   """Computes the normalization for a CRF.
 
diff --git a/tensorflow/contrib/data/__init__.py b/tensorflow/contrib/data/__init__.py
index 156538b4e0..7878e46e88 100644
--- a/tensorflow/contrib/data/__init__.py
+++ b/tensorflow/contrib/data/__init__.py
@@ -34,6 +34,7 @@ See @{$guide/datasets$Importing Data} for an overview.
 @@batch_and_drop_remainder
 @@bucket_by_sequence_length
 @@choose_from_datasets
+@@copy_to_device
 @@dense_to_sparse_batch
 @@enumerate_dataset
 
@@ -51,6 +52,7 @@ See @{$guide/datasets$Importing Data} for an overview.
 @@prefetch_to_device
 @@read_batch_features
 @@rejection_resample
+@@reduce_dataset
 @@sample_from_datasets
 @@scan
 @@shuffle_and_repeat
@@ -76,6 +78,7 @@ from tensorflow.contrib.data.python.ops.counter import Counter
 from tensorflow.contrib.data.python.ops.enumerate_ops import enumerate_dataset
 from tensorflow.contrib.data.python.ops.error_ops import ignore_errors
 from tensorflow.contrib.data.python.ops.get_single_element import get_single_element
+from tensorflow.contrib.data.python.ops.get_single_element import reduce_dataset
 from tensorflow.contrib.data.python.ops.grouping import bucket_by_sequence_length
 from tensorflow.contrib.data.python.ops.grouping import group_by_reducer
 from tensorflow.contrib.data.python.ops.grouping import group_by_window
@@ -86,6 +89,7 @@ from tensorflow.contrib.data.python.ops.interleave_ops import sample_from_datase
 from tensorflow.contrib.data.python.ops.interleave_ops import sloppy_interleave
 from tensorflow.contrib.data.python.ops.iterator_ops import CheckpointInputPipelineHook
 from tensorflow.contrib.data.python.ops.iterator_ops import make_saveable_from_iterator
+from tensorflow.contrib.data.python.ops.prefetching_ops import copy_to_device
 from tensorflow.contrib.data.python.ops.prefetching_ops import prefetch_to_device
 from tensorflow.contrib.data.python.ops.random_ops import RandomDataset
 from tensorflow.contrib.data.python.ops.readers import CsvDataset
diff --git a/tensorflow/contrib/data/kernels/BUILD b/tensorflow/contrib/data/kernels/BUILD
index 7b69e10441..566cbb246a 100644
--- a/tensorflow/contrib/data/kernels/BUILD
+++ b/tensorflow/contrib/data/kernels/BUILD
@@ -71,8 +71,19 @@ cc_library(
 )
 
 cc_library(
+    name = "assert_next_dataset_op",
+    srcs = ["assert_next_dataset_op.cc"],
+    deps = [
+        "//tensorflow/core:framework_headers_lib",
+        "//third_party/eigen3",
+        "@protobuf_archive//:protobuf_headers",
+    ],
+)
+
+cc_library(
     name = "dataset_kernels",
     deps = [
+        ":assert_next_dataset_op",
         ":csv_dataset_op",
         ":directed_interleave_dataset_op",
         ":ignore_errors_dataset_op",
diff --git a/tensorflow/contrib/data/kernels/assert_next_dataset_op.cc b/tensorflow/contrib/data/kernels/assert_next_dataset_op.cc
new file mode 100644
index 0000000000..95b8e1f7fd
--- /dev/null
+++ b/tensorflow/contrib/data/kernels/assert_next_dataset_op.cc
@@ -0,0 +1,152 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <map>
+
+#include "tensorflow/core/framework/dataset.h"
+#include "tensorflow/core/framework/partial_tensor_shape.h"
+#include "tensorflow/core/framework/tensor.h"
+
+namespace tensorflow {
+namespace {
+
+// See documentation in ../ops/dataset_ops.cc for a high-level
+// description of the following op.
+class AssertNextDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit AssertNextDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+  }
+
+ protected:
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    std::vector<string> transformations;
+    OP_REQUIRES_OK(ctx, ParseVectorArgument<string>(ctx, "transformations",
+                                                    &transformations));
+    *output =
+        new Dataset(ctx, input, transformations, output_types_, output_shapes_);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, const DatasetBase* input,
+            const std::vector<string>& transformations,
+            const DataTypeVector& output_types,
+            const std::vector<PartialTensorShape>& output_shapes)
+        : GraphDatasetBase(ctx),
+          input_(input),
+          transformations_(transformations),
+          output_types_(output_types),
+          output_shapes_(output_shapes) {
+      input_->Ref();
+    }
+
+    ~Dataset() override { input_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(
+          new Iterator({this, strings::StrCat(prefix, "::Assert")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return output_types_;
+    }
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return output_shapes_;
+    }
+
+    string DebugString() const override {
+      return "AssertNextDatasetOp::Dataset";
+    }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_graph_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
+      Node* transformations_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddVector(transformations_, &transformations_node));
+      TF_RETURN_IF_ERROR(b->AddDataset(
+          this, {input_graph_node, transformations_node}, output));
+      return Status::OK();
+    }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        std::vector<string> tokens =
+            str_util::Split(prefix(), ':', str_util::SkipEmpty());
+        if (dataset()->transformations_.size() > tokens.size() - 2) {
+          return errors::InvalidArgument(
+              "Asserted next ", dataset()->transformations_.size(),
+              " transformations but encountered only ", tokens.size() - 2, ".");
+        }
+        int n = tokens.size();
+        for (size_t i = 0; i < dataset()->transformations_.size(); ++i) {
+          if (dataset()->transformations_[i] != tokens[n - 2 - i]) {
+            return errors::InvalidArgument(
+                "Asserted ", dataset()->transformations_[i],
+                " transformation at offset ", i, " but encountered ",
+                tokens[n - 2 - i], " transformation instead.");
+          }
+        }
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        return input_impl_->GetNext(ctx, out_tensors, end_of_sequence);
+      }
+
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+        return Status::OK();
+      }
+
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+        return Status::OK();
+      }
+
+     private:
+      std::unique_ptr<IteratorBase> input_impl_;
+    };
+
+    const DatasetBase* input_;
+    const std::vector<string> transformations_;
+    const DataTypeVector output_types_;
+    const std::vector<PartialTensorShape> output_shapes_;
+  };
+
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("AssertNextDataset").Device(DEVICE_CPU),
+                        AssertNextDatasetOp);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/data/kernels/csv_dataset_op.cc b/tensorflow/contrib/data/kernels/csv_dataset_op.cc
index 4657807785..f7e3ed886c 100644
--- a/tensorflow/contrib/data/kernels/csv_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/csv_dataset_op.cc
@@ -18,7 +18,10 @@ limitations under the License.
 #include "tensorflow/core/framework/dataset.h"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/shape_inference.h"
+#include "tensorflow/core/lib/io/inputstream_interface.h"
 #include "tensorflow/core/lib/io/random_inputstream.h"
+#include "tensorflow/core/lib/io/zlib_compression_options.h"
+#include "tensorflow/core/lib/io/zlib_inputstream.h"
 
 namespace tensorflow {
 namespace {
@@ -37,6 +40,10 @@ class CSVDatasetOp : public DatasetOpKernel {
         ctx, filenames_tensor->dims() <= 1,
         errors::InvalidArgument("`filenames` must be a scalar or a vector."));
 
+    string compression_type;
+    OP_REQUIRES_OK(ctx, ParseScalarArgument<string>(ctx, "compression_type",
+                                                    &compression_type));
+
     OpInputList record_defaults_list;
     OP_REQUIRES_OK(ctx,
                    ctx->input_list("record_defaults", &record_defaults_list));
@@ -86,6 +93,19 @@ class CSVDatasetOp : public DatasetOpKernel {
       filenames.push_back(filenames_tensor->flat<string>()(i));
     }
 
+    io::ZlibCompressionOptions zlib_compression_options =
+        io::ZlibCompressionOptions::DEFAULT();
+    if (compression_type == "ZLIB") {
+      zlib_compression_options = io::ZlibCompressionOptions::DEFAULT();
+    } else if (compression_type == "GZIP") {
+      zlib_compression_options = io::ZlibCompressionOptions::GZIP();
+    } else {
+      OP_REQUIRES(ctx, compression_type.empty(),
+                  errors::InvalidArgument(
+                      "Unsupported compression_type: ", compression_type, "."));
+    }
+    zlib_compression_options.input_buffer_size = buffer_size;
+
     std::vector<int64> select_cols;
     select_cols.reserve(select_cols_tensor->NumElements());
     for (int i = 0; i < select_cols_tensor->NumElements(); ++i) {
@@ -103,7 +123,8 @@ class CSVDatasetOp : public DatasetOpKernel {
         ctx, select_cols.empty() || select_cols.front() >= 0,
         errors::InvalidArgument("select_cols should be non-negative indices"));
 
-    *output = new Dataset(ctx, std::move(filenames), header, buffer_size,
+    *output = new Dataset(ctx, std::move(filenames), header,
+                          std::move(compression_type), zlib_compression_options,
                           output_types_, output_shapes_,
                           std::move(record_defaults), std::move(select_cols),
                           use_quote_delim, delim[0], std::move(na_value));
@@ -113,21 +134,24 @@ class CSVDatasetOp : public DatasetOpKernel {
   class Dataset : public GraphDatasetBase {
    public:
     Dataset(OpKernelContext* ctx, std::vector<string> filenames, bool header,
-            int64 buffer_size, const DataTypeVector& output_types,
+            string compression_type, io::ZlibCompressionOptions options,
+            const DataTypeVector& output_types,
             const std::vector<PartialTensorShape>& output_shapes,
             std::vector<Tensor> record_defaults, std::vector<int64> select_cols,
             bool use_quote_delim, char delim, string na_value)
         : GraphDatasetBase(ctx),
           filenames_(std::move(filenames)),
           header_(header),
-          buffer_size_(buffer_size),
           out_type_(output_types),
           output_shapes_(output_shapes),
           record_defaults_(std::move(record_defaults)),
           select_cols_(std::move(select_cols)),
           use_quote_delim_(use_quote_delim),
           delim_(delim),
-          na_value_(std::move(na_value)) {}
+          na_value_(std::move(na_value)),
+          use_compression_(!compression_type.empty()),
+          compression_type_(std::move(compression_type)),
+          options_(options) {}
 
     std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
@@ -146,10 +170,45 @@ class CSVDatasetOp : public DatasetOpKernel {
    protected:
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
                               Node** output) const override {
-      // TODO(rachelim): Implement this
-      std::vector<Node*> input_tensors;
-      TF_RETURN_IF_ERROR(b->AddDataset(this, input_tensors, output));
-      return errors::Unimplemented("CSVDataset: AsGraphDefInternal");
+      Node* filenames = nullptr;
+      Node* compression_type = nullptr;
+      Node* buffer_size = nullptr;
+      Node* header = nullptr;
+      Node* delim = nullptr;
+      Node* use_quote_delim = nullptr;
+      Node* na_value = nullptr;
+      Node* select_cols = nullptr;
+
+      std::vector<Node*> record_defaults;
+      record_defaults.reserve(record_defaults_.size());
+      for (const Tensor& t : record_defaults_) {
+        Node* node;
+        TF_RETURN_IF_ERROR(b->AddTensor(t, &node));
+        record_defaults.emplace_back(node);
+      }
+
+      TF_RETURN_IF_ERROR(b->AddVector(filenames_, &filenames));
+      TF_RETURN_IF_ERROR(b->AddScalar(compression_type_, &compression_type));
+      TF_RETURN_IF_ERROR(
+          b->AddScalar(options_.input_buffer_size, &buffer_size));
+      TF_RETURN_IF_ERROR(b->AddScalar(header_, &header));
+
+      string delim_string(1, delim_);
+      TF_RETURN_IF_ERROR(b->AddScalar(delim_string, &delim));
+      TF_RETURN_IF_ERROR(b->AddScalar(use_quote_delim_, &use_quote_delim));
+      TF_RETURN_IF_ERROR(b->AddScalar(na_value_, &na_value));
+      TF_RETURN_IF_ERROR(b->AddVector(select_cols_, &select_cols));
+
+      TF_RETURN_IF_ERROR(b->AddDataset(
+          this,
+          {std::make_pair(0, filenames), std::make_pair(1, compression_type),
+           std::make_pair(2, buffer_size), std::make_pair(3, header),
+           std::make_pair(4, delim), std::make_pair(5, use_quote_delim),
+           std::make_pair(6, na_value),
+           std::make_pair(7, select_cols)},      // Single tensor inputs
+          {std::make_pair(8, record_defaults)},  // Tensor list inputs
+          {}, output));
+      return Status::OK();
     }
 
    private:
@@ -201,14 +260,58 @@ class CSVDatasetOp : public DatasetOpKernel {
      protected:
       Status SaveInternal(IteratorStateWriter* writer) override {
         mutex_lock l(mu_);
-        // TODO(rachelim): Implement save
-        return errors::Unimplemented("CSVDataset: SaveInternal");
+        TF_RETURN_IF_ERROR(writer->WriteScalar(full_name("current_file_index"),
+                                               current_file_index_));
+        // `input_stream_` is empty if
+        // 1. GetNext has not been called even once.
+        // 2. All files have been read and the iterator has been exhausted.
+        if (input_stream_ && num_buffer_reads_ > 0) {
+          TF_RETURN_IF_ERROR(writer->WriteScalar(full_name("pos"), pos_));
+          // If num_buffer_reads_ == 0, the buffer hasn't been filled even once.
+          TF_RETURN_IF_ERROR(writer->WriteScalar(full_name("num_buffer_reads"),
+                                                 num_buffer_reads_));
+        }
+        return Status::OK();
       }
+
       Status RestoreInternal(IteratorContext* ctx,
                              IteratorStateReader* reader) override {
         mutex_lock l(mu_);
-        // TODO(rachelim): Implement restore
-        return errors::Unimplemented("CSVDataset: RestoreInternal");
+        ResetStreamsLocked();
+        int64 current_file_index;
+        TF_RETURN_IF_ERROR(reader->ReadScalar(full_name("current_file_index"),
+                                              &current_file_index));
+        current_file_index_ = size_t(current_file_index);
+        // The keys "pos" and "num_buffer_reads" are written only if
+        // the iterator was saved with an open, partially read file.
+        if (reader->Contains(full_name("pos"))) {
+          int64 pos, num_buffer_reads;
+          TF_RETURN_IF_ERROR(reader->ReadScalar(full_name("pos"), &pos));
+          TF_RETURN_IF_ERROR(reader->ReadScalar(full_name("num_buffer_reads"),
+                                                &num_buffer_reads));
+
+          TF_RETURN_IF_ERROR(SetupStreamsLocked(ctx->env()));
+
+          num_buffer_reads_ = size_t(num_buffer_reads - 1);
+
+          // Restores the most recently held buffer
+          Status s = input_stream_->SkipNBytes(
+              num_buffer_reads_ * dataset()->options_.input_buffer_size);
+          if (!s.ok() && !errors::IsOutOfRange(s)) {
+            // We might get out of range error here if the size of the file
+            // is not an exact multiple of the buffer size, and the last buffer
+            // read is < buffer_size. This is valid and we do not surface the
+            // error.
+            return s;
+          }
+
+          Status s2 = FillBuffer(&buffer_);
+          if (!s2.ok() && !errors::IsOutOfRange(s2)) {
+            return s2;
+          }
+          pos_ = size_t(pos);
+        }
+        return Status::OK();
       }
 
      private:
@@ -510,7 +613,9 @@ class CSVDatasetOp : public DatasetOpKernel {
 
       Status FillBuffer(string* result) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
         result->clear();
-        Status s = input_stream_->ReadNBytes(dataset()->buffer_size_, result);
+        ++num_buffer_reads_;
+        Status s = input_stream_->ReadNBytes(
+            dataset()->options_.input_buffer_size, result);
 
         if (errors::IsOutOfRange(s) && !result->empty()) {
           // Ignore OutOfRange error when ReadNBytes read < N bytes.
@@ -675,10 +780,20 @@ class CSVDatasetOp : public DatasetOpKernel {
         // Actually move on to next file.
         TF_RETURN_IF_ERROR(env->NewRandomAccessFile(
             dataset()->filenames_[current_file_index_], &file_));
-        input_stream_.reset(
-            new io::RandomAccessInputStream(file_.get(), false));
+        random_access_input_stream_ =
+            std::make_shared<io::RandomAccessInputStream>(file_.get(), false);
+
+        if (dataset()->use_compression_) {
+          input_stream_ = std::make_shared<io::ZlibInputStream>(
+              random_access_input_stream_.get(),
+              dataset()->options_.input_buffer_size,
+              dataset()->options_.input_buffer_size, dataset()->options_);
+        } else {
+          input_stream_ = random_access_input_stream_;
+        }
         buffer_.clear();
         pos_ = 0;
+        num_buffer_reads_ = 0;
         if (dataset()->header_) {
           // Read one line, but don't include it. Pass nullptrs as dummy
           // pointers to objects that shouldn't be invoked anyway
@@ -704,8 +819,10 @@ class CSVDatasetOp : public DatasetOpKernel {
       string buffer_ GUARDED_BY(mu_);  // Maintain our own buffer
       size_t pos_ GUARDED_BY(
           mu_);  // Index into the buffer must be maintained between iters
-      std::unique_ptr<io::RandomAccessInputStream> input_stream_
+      size_t num_buffer_reads_ GUARDED_BY(mu_);
+      std::shared_ptr<io::RandomAccessInputStream> random_access_input_stream_
           GUARDED_BY(mu_);
+      std::shared_ptr<io::InputStreamInterface> input_stream_ GUARDED_BY(mu_);
       size_t current_file_index_ GUARDED_BY(mu_) = 0;
       std::unique_ptr<RandomAccessFile> file_
           GUARDED_BY(mu_);  // must outlive input_stream_
@@ -713,7 +830,6 @@ class CSVDatasetOp : public DatasetOpKernel {
 
     const std::vector<string> filenames_;
     const bool header_;
-    const int64 buffer_size_;
     const DataTypeVector out_type_;
     const std::vector<PartialTensorShape> output_shapes_;
     const std::vector<Tensor> record_defaults_;
@@ -721,6 +837,9 @@ class CSVDatasetOp : public DatasetOpKernel {
     const bool use_quote_delim_;
     const char delim_;
     const string na_value_;
+    const bool use_compression_;
+    const string compression_type_;
+    const io::ZlibCompressionOptions options_;
   };  // class Dataset
 
   DataTypeVector output_types_;
diff --git a/tensorflow/contrib/data/kernels/prefetching_kernels.cc b/tensorflow/contrib/data/kernels/prefetching_kernels.cc
index b3d464d716..32f03ca683 100644
--- a/tensorflow/contrib/data/kernels/prefetching_kernels.cc
+++ b/tensorflow/contrib/data/kernels/prefetching_kernels.cc
@@ -15,6 +15,7 @@ limitations under the License.
 #include <deque>
 
 #include "tensorflow/core/common_runtime/process_function_library_runtime.h"
+#include "tensorflow/core/framework/dataset.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/resource_op_kernel.h"
@@ -23,6 +24,7 @@ limitations under the License.
 #include "tensorflow/core/util/device_name_utils.h"
 
 namespace tensorflow {
+namespace {
 
 struct BufferElement {
   // The producer sets `status` if getting the input element fails.
@@ -473,4 +475,466 @@ class IteratorGetDeviceOp : public OpKernel {
 REGISTER_KERNEL_BUILDER(Name("IteratorGetDevice").Device(DEVICE_CPU),
                         IteratorGetDeviceOp);
 
+Status VerifyTypesMatch(const DataTypeVector& expected,
+                        const DataTypeVector& received) {
+  if (expected.size() != received.size()) {
+    return errors::InvalidArgument(
+        "Number of components does not match: expected ", expected.size(),
+        " types but got ", received.size(), ".");
+  }
+  for (size_t i = 0; i < expected.size(); ++i) {
+    if (expected[i] != received[i]) {
+      return errors::InvalidArgument("Data type mismatch at component ", i,
+                                     ": expected ", DataTypeString(expected[i]),
+                                     " but got ", DataTypeString(received[i]),
+                                     ".");
+    }
+  }
+  return Status::OK();
+}
+
+Status VerifyShapesCompatible(const std::vector<PartialTensorShape>& expected,
+                              const std::vector<PartialTensorShape>& received) {
+  if (expected.size() != received.size()) {
+    return errors::InvalidArgument(
+        "Number of components does not match: expected ", expected.size(),
+        " shapes but got ", received.size(), ".");
+  }
+  for (size_t i = 0; i < expected.size(); ++i) {
+    if (!expected[i].IsCompatibleWith(received[i])) {
+      return errors::InvalidArgument("Incompatible shapes at component ", i,
+                                     ": expected ", expected[i].DebugString(),
+                                     " but got ", received[i].DebugString(),
+                                     ".");
+    }
+  }
+
+  return Status::OK();
+}
+
+string SanitizeThreadSuffix(string suffix) {
+  string clean;
+  for (int i = 0; i < suffix.size(); ++i) {
+    const char ch = suffix[i];
+    if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
+        (ch >= '0' && ch <= '9') || ch == '_' || ch == '-') {
+      clean += ch;
+    } else {
+      clean += '_';
+    }
+  }
+  return clean;
+}
+
+class MultiDeviceIterator : public ResourceBase {
+ public:
+  MultiDeviceIterator(const DataTypeVector& output_types,
+                      const std::vector<PartialTensorShape>& output_shapes,
+                      const std::vector<string>& devices,
+                      std::unique_ptr<FunctionLibraryDefinition> flib_def,
+                      std::unique_ptr<ProcessFunctionLibraryRuntime> pflr,
+                      FunctionLibraryRuntime* lib)
+      : output_types_(output_types),
+        output_shapes_(output_shapes),
+        devices_(devices),
+        flib_def_(std::move(flib_def)),
+        pflr_(std::move(pflr)),
+        lib_(lib) {
+    buffer_.resize(devices_.size());
+  }
+
+  string DebugString() override {
+    return strings::StrCat("MultiDeviceIterator");
+  }
+
+  Status Init(std::unique_ptr<IteratorBase> iterator, int64* incarnation_id) {
+    mutex_lock l(mu_);
+    if (iterator) {
+      TF_RETURN_IF_ERROR(
+          VerifyTypesMatch(output_types_, iterator->output_dtypes()));
+      TF_RETURN_IF_ERROR(
+          VerifyShapesCompatible(output_shapes_, iterator->output_shapes()));
+    }
+    host_iterator_.reset(iterator.release());
+    incarnation_id_++;
+    *incarnation_id = incarnation_id_;
+    max_buffer_size_ = 0;
+    num_elements_ = 0;
+    buffer_.clear();
+    buffer_.resize(devices_.size());
+    return Status::OK();
+  }
+
+  Status GetNextFromShard(IteratorContext* ctx, int shard_num,
+                          int64 incarnation_id,
+                          std::vector<Tensor>* out_tensors,
+                          bool* end_of_sequence) {
+    // TODO(rohanj): This might potentially strand elements in other shards.
+    // Opportunity to do smarter locking semantics.
+    mutex_lock l(mu_);
+    // Make sure we're in the right incarnation.
+    if (incarnation_id != incarnation_id_) {
+      return errors::InvalidArgument(
+          "Current incarnation: ", incarnation_id_,
+          "; Supplied incarnation: ", incarnation_id);
+    }
+    // Then look it up in the buffer.
+    if (!buffer_[shard_num].empty()) {
+      const HostBufferElement& elem = buffer_[shard_num].front();
+      *out_tensors = elem.value;
+      *end_of_sequence = elem.end_of_sequence;
+      Status s = elem.status;
+      buffer_[shard_num].pop_front();
+      return s;
+    }
+    std::shared_ptr<IteratorBase> captured_iterator(host_iterator_);
+    if (captured_iterator) {
+      if (lib_ != nullptr) {
+        ctx->set_lib(lib_);
+      }
+      while (true) {
+        HostBufferElement elem;
+        elem.status =
+            captured_iterator->GetNext(ctx, &elem.value, &elem.end_of_sequence);
+        int buffer_index = num_elements_ % devices_.size();
+        num_elements_++;
+        if (buffer_index == shard_num) {
+          out_tensors->swap(elem.value);
+          *end_of_sequence = elem.end_of_sequence;
+          return elem.status;
+        } else {
+          buffer_[buffer_index].push_back(std::move(elem));
+          // TODO(rohanj): Put an upper bound to buffer size.
+          if (buffer_[buffer_index].size() > max_buffer_size_) {
+            max_buffer_size_ = buffer_[buffer_index].size();
+            VLOG(1) << "MultiDeviceIterator: Max buffer size increased to: "
+                    << max_buffer_size_;
+          }
+        }
+      }
+    } else {
+      return errors::FailedPrecondition("Iterator not initialized");
+    }
+    return Status::OK();
+  }
+
+  const DataTypeVector& output_types() const { return output_types_; }
+
+  const std::vector<PartialTensorShape>& output_shapes() const {
+    return output_shapes_;
+  }
+
+  std::shared_ptr<const FunctionLibraryDefinition> function_library() {
+    tf_shared_lock l(mu_);
+    return lib_def_;
+  }
+
+ private:
+  struct HostBufferElement {
+    Status status;
+    bool end_of_sequence;
+    std::vector<Tensor> value;
+  };
+
+  mutex mu_;
+  const DataTypeVector output_types_;
+  const std::vector<PartialTensorShape> output_shapes_;
+  const std::vector<string> devices_;
+  int64 num_elements_ GUARDED_BY(mu_) = 0;
+  int64 max_buffer_size_ GUARDED_BY(mu_) = 0;
+  int64 incarnation_id_ GUARDED_BY(mu_) = 0;
+  std::vector<std::deque<HostBufferElement>> buffer_ GUARDED_BY(mu_);
+  std::unique_ptr<FunctionLibraryDefinition> flib_def_;
+  std::unique_ptr<ProcessFunctionLibraryRuntime> pflr_;
+  FunctionLibraryRuntime* lib_ = nullptr;  // not owned.
+  std::shared_ptr<IteratorBase> host_iterator_;
+  std::shared_ptr<const FunctionLibraryDefinition> lib_def_ GUARDED_BY(mu_);
+};
+
+// Just creates a MultiDeviceIterator and returns it.
+class MultiDeviceIteratorHandleOp : public OpKernel {
+ public:
+  explicit MultiDeviceIteratorHandleOp(OpKernelConstruction* ctx)
+      : OpKernel(ctx), graph_def_version_(ctx->graph_def_version()) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("shared_name", &name_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("container", &container_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("devices", &devices_));
+  }
+
+  // The resource is deleted from the resource manager only when it is private
+  // to kernel.
+  ~MultiDeviceIteratorHandleOp() override {
+    if (resource_ != nullptr) {
+      resource_->Unref();
+      if (cinfo_.resource_is_private_to_kernel()) {
+        if (!cinfo_.resource_manager()
+                 ->template Delete<MultiDeviceIterator>(cinfo_.container(),
+                                                        cinfo_.name())
+                 .ok()) {
+          // Do nothing; the resource can have been deleted by session resets.
+        }
+      }
+    }
+  }
+
+  void Compute(OpKernelContext* context) override LOCKS_EXCLUDED(mu_) {
+    {
+      mutex_lock l(mu_);
+      if (resource_ == nullptr) {
+        FunctionLibraryRuntime* lib;
+        std::unique_ptr<FunctionLibraryDefinition> flib_def(nullptr);
+        std::unique_ptr<ProcessFunctionLibraryRuntime> pflr(nullptr);
+        OP_REQUIRES_OK(context, context->function_library()->Clone(
+                                    &flib_def, &pflr, &lib));
+        ResourceMgr* mgr = context->resource_manager();
+        OP_REQUIRES_OK(context, cinfo_.Init(mgr, def()));
+
+        MultiDeviceIterator* resource;
+        OP_REQUIRES_OK(
+            context,
+            mgr->LookupOrCreate<MultiDeviceIterator>(
+                cinfo_.container(), cinfo_.name(), &resource,
+                [this, lib, &flib_def, &pflr](MultiDeviceIterator** ret)
+                    EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+                      *ret = new MultiDeviceIterator(
+                          output_types_, output_shapes_, devices_,
+                          std::move(flib_def), std::move(pflr), lib);
+                      return Status::OK();
+                    }));
+
+        Status s = VerifyResource(resource);
+        if (TF_PREDICT_FALSE(!s.ok())) {
+          resource->Unref();
+          context->SetStatus(s);
+          return;
+        }
+
+        resource_ = resource;
+      }
+    }
+    OP_REQUIRES_OK(context, MakeResourceHandleToOutput(
+                                context, 0, cinfo_.container(), cinfo_.name(),
+                                MakeTypeIndex<MultiDeviceIterator>()));
+  }
+
+ private:
+  // During the first Compute(), resource is either created or looked up using
+  // shared_name. In the latter case, the resource found should be verified if
+  // it is compatible with this op's configuration. The verification may fail in
+  // cases such as two graphs asking queues of the same shared name to have
+  // inconsistent capacities.
+  Status VerifyResource(MultiDeviceIterator* resource) {
+    TF_RETURN_IF_ERROR(
+        VerifyTypesMatch(output_types_, resource->output_types()));
+    TF_RETURN_IF_ERROR(
+        VerifyShapesCompatible(output_shapes_, resource->output_shapes()));
+    return Status::OK();
+  }
+
+  mutex mu_;
+  ContainerInfo cinfo_;  // Written once under mu_ then constant afterwards.
+  MultiDeviceIterator* resource_ GUARDED_BY(mu_) = nullptr;
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+  const int graph_def_version_;
+  string name_;
+  string container_;
+  std::vector<string> devices_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("MultiDeviceIterator").Device(DEVICE_CPU),
+                        MultiDeviceIteratorHandleOp);
+
+// Calls init on the MultiDeviceIterator.
+class MultiDeviceIteratorInitOp : public OpKernel {
+ public:
+  explicit MultiDeviceIteratorInitOp(OpKernelConstruction* ctx)
+      : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    DatasetBase* dataset;
+    OP_REQUIRES_OK(ctx, GetDatasetFromVariantTensor(ctx->input(0), &dataset));
+    MultiDeviceIterator* resource;
+    OP_REQUIRES_OK(ctx,
+                   LookupResource(ctx, HandleFromInput(ctx, 1), &resource));
+    core::ScopedUnref unref(resource);
+
+    IteratorContext iter_ctx = dataset::MakeIteratorContext(ctx);
+    std::unique_ptr<IteratorBase> iterator;
+    OP_REQUIRES_OK(ctx,
+                   dataset->MakeIterator(&iter_ctx, "Iterator", &iterator));
+    int64 incarnation_id;
+    OP_REQUIRES_OK(ctx, resource->Init(std::move(iterator), &incarnation_id));
+    Tensor tensor_incarnation_id(DT_INT64, TensorShape({}));
+    tensor_incarnation_id.scalar<int64>()() = incarnation_id;
+    OP_REQUIRES_OK(ctx,
+                   ctx->set_output("incarnation_id", tensor_incarnation_id));
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MultiDeviceIteratorInit").Device(DEVICE_CPU),
+                        MultiDeviceIteratorInitOp);
+
+// Calls GetNextFromShard(shard) and returns a vector of Tensors as output.
+// TODO(rohanj): Implement using BackgroundWorker that Derek built?
+class MultiDeviceIteratorGetNextFromShardOp : public AsyncOpKernel {
+ public:
+  explicit MultiDeviceIteratorGetNextFromShardOp(OpKernelConstruction* ctx)
+      : AsyncOpKernel(ctx),
+        thread_pool_(new thread::ThreadPool(
+            ctx->env(), ThreadOptions(),
+            strings::StrCat("multi_device_iterator_get_next_thread_",
+                            SanitizeThreadSuffix(name())),
+            1 /* num_threads */, false /* low_latency_hint */)) {}
+
+  void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
+    const Tensor* tensor_shard_num;
+    OP_REQUIRES_OK_ASYNC(ctx, ctx->input("shard_num", &tensor_shard_num), done);
+    int32 shard_num = tensor_shard_num->scalar<int32>()();
+
+    const Tensor* tensor_incarnation_id;
+    OP_REQUIRES_OK_ASYNC(
+        ctx, ctx->input("incarnation_id", &tensor_incarnation_id), done);
+    int64 incarnation_id = tensor_incarnation_id->scalar<int64>()();
+
+    MultiDeviceIterator* iterator;
+    OP_REQUIRES_OK_ASYNC(
+        ctx, LookupResource(ctx, HandleFromInput(ctx, 0), &iterator), done);
+    thread_pool_->Schedule(std::bind(
+        [ctx, iterator, shard_num, incarnation_id](DoneCallback done) {
+          std::vector<Tensor> components;
+          bool end_of_sequence = false;
+
+          IteratorContext::Params params;
+          params.env = ctx->env();
+          params.runner = *(ctx->runner());
+          params.function_library = iterator->function_library();
+          DeviceBase* device = ctx->function_library()->device();
+          params.allocator_getter = [device](AllocatorAttributes attrs) {
+            return device->GetAllocator(attrs);
+          };
+          IteratorContext iter_ctx(std::move(params));
+
+          Status s =
+              iterator->GetNextFromShard(&iter_ctx, shard_num, incarnation_id,
+                                         &components, &end_of_sequence);
+          iterator->Unref();
+
+          if (!s.ok()) {
+            ctx->SetStatus(s);
+          } else if (end_of_sequence) {
+            ctx->SetStatus(errors::OutOfRange("End of sequence"));
+          } else {
+            for (int i = 0; i < components.size(); ++i) {
+              // TODO(mrry): Check that the shapes match the shape attrs.
+              ctx->set_output(i, components[i]);
+            }
+          }
+          done();
+        },
+        std::move(done)));
+  }
+
+ private:
+  std::unique_ptr<thread::ThreadPool> thread_pool_;
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("MultiDeviceIteratorGetNextFromShard").Device(DEVICE_CPU),
+    MultiDeviceIteratorGetNextFromShardOp);
+
+class MultiDeviceIteratorToStringHandleOp : public OpKernel {
+ public:
+  explicit MultiDeviceIteratorToStringHandleOp(OpKernelConstruction* ctx)
+      : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor& resource_handle_t = ctx->input(0);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(resource_handle_t.shape()),
+                errors::InvalidArgument("resource_handle must be a scalar"));
+
+    // Validate that the handle corresponds to a real resource, and
+    // that it is an MultiDeviceIterator.
+    MultiDeviceIterator* resource;
+    OP_REQUIRES_OK(ctx,
+                   LookupResource(ctx, HandleFromInput(ctx, 0), &resource));
+    resource->Unref();
+
+    Tensor* string_handle_t;
+    OP_REQUIRES_OK(ctx,
+                   ctx->allocate_output(0, TensorShape({}), &string_handle_t));
+    string_handle_t->scalar<string>()() =
+        resource_handle_t.scalar<ResourceHandle>()().SerializeAsString();
+  }
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("MultiDeviceIteratorToStringHandle").Device(DEVICE_CPU),
+    MultiDeviceIteratorToStringHandleOp);
+
+class MultiDeviceIteratorFromStringHandleOp : public OpKernel {
+ public:
+  explicit MultiDeviceIteratorFromStringHandleOp(OpKernelConstruction* ctx)
+      : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+    OP_REQUIRES(
+        ctx,
+        output_types_.empty() || output_shapes_.empty() ||
+            output_types_.size() == output_shapes_.size(),
+        errors::InvalidArgument("If both 'output_types' and 'output_shapes' "
+                                "are set, they must have the same length."));
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor& string_handle_t = ctx->input(0);
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(string_handle_t.shape()),
+                errors::InvalidArgument("string_handle must be a scalar"));
+
+    ResourceHandle resource_handle;
+    OP_REQUIRES(
+        ctx,
+        resource_handle.ParseFromString(string_handle_t.scalar<string>()()),
+        errors::InvalidArgument(
+            "Could not parse string_handle as a valid ResourceHandle"));
+
+    OP_REQUIRES(
+        ctx, resource_handle.device() == ctx->device()->attributes().name(),
+        errors::InvalidArgument("Attempted create an iterator on device \"",
+                                ctx->device()->attributes().name(),
+                                "\" from handle defined on device \"",
+                                resource_handle.device(), "\""));
+
+    // Validate that the handle corresponds to a real resource, and
+    // that it is an MultiDeviceIterator.
+    MultiDeviceIterator* resource;
+    OP_REQUIRES_OK(ctx, LookupResource(ctx, resource_handle, &resource));
+    core::ScopedUnref unref_iterator(resource);
+    if (!output_types_.empty()) {
+      OP_REQUIRES_OK(ctx,
+                     VerifyTypesMatch(output_types_, resource->output_types()));
+    }
+    if (!output_shapes_.empty()) {
+      OP_REQUIRES_OK(ctx, VerifyShapesCompatible(output_shapes_,
+                                                 resource->output_shapes()));
+    }
+
+    Tensor* resource_handle_t;
+    OP_REQUIRES_OK(
+        ctx, ctx->allocate_output(0, TensorShape({}), &resource_handle_t));
+    resource_handle_t->scalar<ResourceHandle>()() = resource_handle;
+  }
+
+ private:
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("MultiDeviceIteratorFromStringHandle").Device(DEVICE_CPU),
+    MultiDeviceIteratorFromStringHandleOp);
+
+}  // anonymous namespace
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/data/ops/dataset_ops.cc b/tensorflow/contrib/data/ops/dataset_ops.cc
index 8413fcaf87..66a7c7fdcd 100644
--- a/tensorflow/contrib/data/ops/dataset_ops.cc
+++ b/tensorflow/contrib/data/ops/dataset_ops.cc
@@ -36,6 +36,7 @@ data_input_datasets: `N` datasets with the same type that will be interleaved
 
 REGISTER_OP("CSVDataset")
     .Input("filenames: string")
+    .Input("compression_type: string")
     .Input("buffer_size: int64")
     .Input("header: bool")
     .Input("field_delim: string")
@@ -52,17 +53,18 @@ REGISTER_OP("CSVDataset")
       shape_inference::ShapeHandle unused;
       // `filenames` must be a scalar or a vector.
       TF_RETURN_IF_ERROR(c->WithRankAtMost(c->input(0), 1, &unused));
-      // `buffer_size`, `header`, `field_delim`, `use_quote_delim`,
-      // `na_value` must be scalars
+      // `compression_type`, `buffer_size`, `header`, `field_delim`,
+      // `use_quote_delim`, `na_value` must be scalars
       TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
       TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));
       TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &unused));
       TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 0, &unused));
       TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 0, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 0, &unused));
       // `select_cols` must be a vector
-      TF_RETURN_IF_ERROR(c->WithRank(c->input(6), 1, &unused));
-      // `record_defaults` must be a list of scalars...?
-      for (size_t i = 7; i < c->num_inputs(); ++i) {
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(7), 1, &unused));
+      // `record_defaults` must be lists of scalars
+      for (size_t i = 8; i < c->num_inputs(); ++i) {
         TF_RETURN_IF_ERROR(c->WithRank(c->input(i), 1, &unused));
       }
       return shape_inference::ScalarShape(c);
@@ -143,6 +145,80 @@ Resets the FunctionBufferingResource.
 function_buffer_resource: The FunctionBufferingResource handle.
 )doc");
 
+REGISTER_OP("MultiDeviceIterator")
+    .Output("handle: resource")
+    .Attr("devices: list(string) >= 1")
+    .Attr("shared_name: string")
+    .Attr("container: string")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .Doc(R"doc(
+Creates a MultiDeviceIterator resource.
+
+handle: Handle to the resource created.
+devices: A list of devices the iterator works across.
+shared_name: If non-empty, this resource will be shared under the given name
+  across multiple sessions.
+container: If non-empty, this resource is placed in the given container.
+  Otherwise, a default container is used.
+output_types: The type list for the return values.
+output_shapes: The list of shapes being produced.
+)doc");
+
+REGISTER_OP("MultiDeviceIteratorInit")
+    .Input("dataset: variant")
+    .Input("multi_device_iterator: resource")
+    .Output("incarnation_id: int64")
+    .Doc(R"doc(
+Initializes the multi device iterator with the given dataset.
+incarnation_id: An int64 indicating which incarnation of the MultiDeviceIterator
+  is running.
+dataset: Dataset to be iterated upon.
+multi_device_iterator: A MultiDeviceIteratorResource.
+)doc");
+
+REGISTER_OP("MultiDeviceIteratorGetNextFromShard")
+    .Input("multi_device_iterator: resource")
+    .Input("shard_num: int32")
+    .Input("incarnation_id: int64")
+    .Output("components: output_types")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .Doc(R"doc(
+Gets next element for the provided shard number.
+
+multi_device_iterator: A MultiDeviceIterator resource.
+shard_num: Integer representing which shard to fetch data for.
+incarnation_id: Which incarnation of the MultiDeviceIterator is running.
+components: Result of the get_next on the dataset.
+output_types: The type list for the return values.
+output_shapes: The list of shapes being produced.
+)doc");
+
+REGISTER_OP("MultiDeviceIteratorToStringHandle")
+    .Input("multi_device_iterator: resource")
+    .Output("string_handle: string")
+    .Doc(R"doc(
+Produces a string handle for the given MultiDeviceIterator.
+
+multi_device_iterator: A MultiDeviceIterator resource.
+string_handle: A string representing the resource.
+)doc");
+
+REGISTER_OP("MultiDeviceIteratorFromStringHandle")
+    .Input("string_handle: string")
+    .Output("multi_device_iterator: resource")
+    .Attr("output_types: list(type) >= 0 = []")
+    .Attr("output_shapes: list(shape) >= 0 = []")
+    .Doc(R"doc(
+Generates a MultiDeviceIterator resource from its provided string handle.
+
+string_handle: String representing the resource.
+multi_device_iterator: A MultiDeviceIterator resource.
+output_types: The type list for the return values.
+output_shapes: The list of shapes being produced.
+)doc");
+
 REGISTER_OP("ThreadPoolDataset")
     .Input("input_dataset: variant")
     .Input("thread_pool: resource")
@@ -175,4 +251,17 @@ display_name: A human-readable name for the threads that may be visible in
   some visualizations.
 )doc");
 
+REGISTER_OP("AssertNextDataset")
+    .Input("input_dataset: variant")
+    .Input("transformations: string")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle unused;
+      // transformations should be a vector.
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &unused));
+      return shape_inference::ScalarShape(c);
+    });
+
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index c9435eadcd..00f5b74c33 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -60,7 +60,7 @@ py_test(
 
 py_test(
     name = "csv_dataset_op_test",
-    size = "small",
+    size = "medium",
     srcs = ["csv_dataset_op_test.py"],
     srcs_version = "PY2AND3",
     tags = ["no_pip"],
@@ -121,6 +121,7 @@ py_test(
     srcs = ["get_single_element_test.py"],
     deps = [
         "//tensorflow/contrib/data/python/ops:get_single_element",
+        "//tensorflow/contrib/data/python/ops:grouping",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
@@ -128,6 +129,7 @@ py_test(
         "//tensorflow/python:errors",
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python/data/ops:dataset_ops",
+        "@absl_py//absl/testing:parameterized",
     ],
 )
 
@@ -190,6 +192,7 @@ py_test(
     deps = [
         "//tensorflow/contrib/data/python/ops:batching",
         "//tensorflow/contrib/data/python/ops:error_ops",
+        "//tensorflow/contrib/data/python/ops:optimization",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:errors",
@@ -207,11 +210,13 @@ py_test(
     srcs = ["optimize_dataset_op_test.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":stats_dataset_test_base",
         "//tensorflow/contrib/data/python/ops:optimization",
-        "//tensorflow/core:protos_all_py",
+        "//tensorflow/contrib/data/python/ops:stats_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:errors",
         "//tensorflow/python/data/ops:dataset_ops",
+        "@absl_py//absl/testing:parameterized",
     ],
 )
 
@@ -229,9 +234,16 @@ cuda_py_test(
         "//tensorflow/python:framework_test_lib",
         "//tensorflow/python:function",
         "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python/compat:compat",
         "//tensorflow/python/data/ops:dataset_ops",
         "//tensorflow/python/data/ops:iterator_ops",
     ],
+    tags = [
+        "manual",
+        "no_oss",
+        "no_windows_gpu",
+        "notap",
+    ],
 )
 
 py_test(
@@ -378,6 +390,7 @@ py_test(
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python/data/ops:dataset_ops",
         "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
     ],
 )
 
@@ -420,8 +433,8 @@ py_test(
     tags = ["no_pip"],
     deps = [
         ":reader_dataset_ops_test_base",
+        ":stats_dataset_test_base",
         "//tensorflow/contrib/data/python/ops:stats_ops",
-        "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:errors",
@@ -431,6 +444,16 @@ py_test(
     ],
 )
 
+py_library(
+    name = "stats_dataset_test_base",
+    srcs = ["stats_dataset_test_base.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
 py_test(
     name = "threadpool_dataset_ops_test",
     size = "small",
@@ -467,6 +490,28 @@ py_test(
 )
 
 py_test(
+    name = "window_dataset_op_test",
+    size = "medium",
+    srcs = ["window_dataset_op_test.py"],
+    srcs_version = "PY2AND3",
+    tags = [
+        "no_pip",
+    ],
+    deps = [
+        "//tensorflow/contrib/data/python/ops:batching",
+        "//tensorflow/contrib/data/python/ops:grouping",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:sparse_tensor",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
+    ],
+)
+
+py_test(
     name = "writer_ops_test",
     size = "small",
     srcs = ["writer_ops_test.py"],
diff --git a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
index af97fbf87a..42adfd17f0 100644
--- a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
@@ -293,7 +293,7 @@ class BatchDatasetTest(test.TestCase, parameterized.TestCase):
               ph2: np.arange(8).astype(np.int32)
           })
       with self.assertRaises(errors.InvalidArgumentError):
-        print(sess.run(next_element))
+        sess.run(next_element)
 
       # No 0th dimension (i.e. scalar value) for one component.
       sess.run(
@@ -303,7 +303,7 @@ class BatchDatasetTest(test.TestCase, parameterized.TestCase):
               ph2: 7
           })
       with self.assertRaises(errors.InvalidArgumentError):
-        print(sess.run(next_element))
+        sess.run(next_element)
 
   def testBatchAndDropRemainder(self):
     components = (np.arange(7),
diff --git a/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py
index df115175f5..2a0e64caeb 100644
--- a/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/csv_dataset_op_test.py
@@ -18,10 +18,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import gzip
 import os
 import string
 import tempfile
 import time
+import zlib
 
 import numpy as np
 
@@ -62,18 +64,29 @@ class CsvDatasetOpTest(test.TestCase):
         op2 = sess.run(next2)
         self.assertAllEqual(op1, op2)
 
-  def setup_files(self, inputs, linebreak='\n'):
+  def _setup_files(self, inputs, linebreak='\n', compression_type=None):
     filenames = []
     for i, ip in enumerate(inputs):
       fn = os.path.join(self.get_temp_dir(), 'temp_%d.csv' % i)
-      with open(fn, 'wb') as f:
-        f.write(linebreak.join(ip).encode('utf-8'))
+      contents = linebreak.join(ip).encode('utf-8')
+      if compression_type is None:
+        with open(fn, 'wb') as f:
+          f.write(contents)
+      elif compression_type == 'GZIP':
+        with gzip.GzipFile(fn, 'wb') as f:
+          f.write(contents)
+      elif compression_type == 'ZLIB':
+        contents = zlib.compress(contents)
+        with open(fn, 'wb') as f:
+          f.write(contents)
+      else:
+        raise ValueError('Unsupported compression_type', compression_type)
       filenames.append(fn)
     return filenames
 
   def _make_test_datasets(self, inputs, **kwargs):
     # Test by comparing its output to what we could get with map->decode_csv
-    filenames = self.setup_files(inputs)
+    filenames = self._setup_files(inputs)
     dataset_expected = core_readers.TextLineDataset(filenames)
     dataset_expected = dataset_expected.map(
         lambda l: parsing_ops.decode_csv(l, **kwargs))
@@ -112,15 +125,18 @@ class CsvDatasetOpTest(test.TestCase):
           except errors.OutOfRangeError:
             break
 
-  def _test_dataset(self,
-                    inputs,
-                    expected_output=None,
-                    expected_err_re=None,
-                    linebreak='\n',
-                    **kwargs):
+  def _test_dataset(
+      self,
+      inputs,
+      expected_output=None,
+      expected_err_re=None,
+      linebreak='\n',
+      compression_type=None,  # Used for both setup and parsing
+      **kwargs):
     """Checks that elements produced by CsvDataset match expected output."""
     # Convert str type because py3 tf strings are bytestrings
-    filenames = self.setup_files(inputs, linebreak)
+    filenames = self._setup_files(inputs, linebreak, compression_type)
+    kwargs['compression_type'] = compression_type
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
         dataset = readers.CsvDataset(filenames, **kwargs)
@@ -174,7 +190,7 @@ class CsvDatasetOpTest(test.TestCase):
   def testCsvDataset_ignoreErrWithUnescapedQuotes(self):
     record_defaults = [['']] * 3
     inputs = [['1,"2"3",4', '1,"2"3",4",5,5', 'a,b,"c"d"', 'e,f,g']]
-    filenames = self.setup_files(inputs)
+    filenames = self._setup_files(inputs)
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
         dataset = readers.CsvDataset(filenames, record_defaults=record_defaults)
@@ -184,7 +200,7 @@ class CsvDatasetOpTest(test.TestCase):
   def testCsvDataset_ignoreErrWithUnquotedQuotes(self):
     record_defaults = [['']] * 3
     inputs = [['1,2"3,4', 'a,b,c"d', '9,8"7,6,5', 'e,f,g']]
-    filenames = self.setup_files(inputs)
+    filenames = self._setup_files(inputs)
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
         dataset = readers.CsvDataset(filenames, record_defaults=record_defaults)
@@ -355,7 +371,7 @@ class CsvDatasetOpTest(test.TestCase):
         '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19',
         '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19'
     ]]
-    file_path = self.setup_files(data)
+    file_path = self._setup_files(data)
 
     with ops.Graph().as_default() as g:
       ds = readers.make_csv_dataset(
@@ -432,14 +448,29 @@ class CsvDatasetOpTest(test.TestCase):
         record_defaults=record_defaults,
         buffer_size=0)
 
-  def testCsvDataset_withBufferSize(self):
+  def _test_dataset_on_buffer_sizes(self,
+                                    inputs,
+                                    expected,
+                                    linebreak,
+                                    record_defaults,
+                                    compression_type=None,
+                                    num_sizes_to_test=20):
+    # Testing reading with a range of buffer sizes that should all work.
+    for i in list(range(1, 1 + num_sizes_to_test)) + [None]:
+      self._test_dataset(
+          inputs,
+          expected,
+          linebreak=linebreak,
+          compression_type=compression_type,
+          record_defaults=record_defaults,
+          buffer_size=i)
+
+  def testCsvDataset_withLF(self):
     record_defaults = [['NA']] * 3
     inputs = [['abc,def,ghi', '0,1,2', ',,']]
     expected = [['abc', 'def', 'ghi'], ['0', '1', '2'], ['NA', 'NA', 'NA']]
-    for i in range(20):
-      # Test a range of buffer sizes that should all work
-      self._test_dataset(
-          inputs, expected, record_defaults=record_defaults, buffer_size=i + 1)
+    self._test_dataset_on_buffer_sizes(
+        inputs, expected, linebreak='\n', record_defaults=record_defaults)
 
   def testCsvDataset_withCR(self):
     # Test that when the line separator is '\r', parsing works with all buffer
@@ -447,14 +478,8 @@ class CsvDatasetOpTest(test.TestCase):
     record_defaults = [['NA']] * 3
     inputs = [['abc,def,ghi', '0,1,2', ',,']]
     expected = [['abc', 'def', 'ghi'], ['0', '1', '2'], ['NA', 'NA', 'NA']]
-    for i in range(20):
-      # Test a range of buffer sizes that should all work
-      self._test_dataset(
-          inputs,
-          expected,
-          linebreak='\r',
-          record_defaults=record_defaults,
-          buffer_size=i + 1)
+    self._test_dataset_on_buffer_sizes(
+        inputs, expected, linebreak='\r', record_defaults=record_defaults)
 
   def testCsvDataset_withCRLF(self):
     # Test that when the line separator is '\r\n', parsing works with all buffer
@@ -462,29 +487,15 @@ class CsvDatasetOpTest(test.TestCase):
     record_defaults = [['NA']] * 3
     inputs = [['abc,def,ghi', '0,1,2', ',,']]
     expected = [['abc', 'def', 'ghi'], ['0', '1', '2'], ['NA', 'NA', 'NA']]
-    for i in range(20):
-      # Test a range of buffer sizes that should all work
-      self._test_dataset(
-          inputs,
-          expected,
-          linebreak='\r\n',
-          record_defaults=record_defaults,
-          buffer_size=i + 1)
+    self._test_dataset_on_buffer_sizes(
+        inputs, expected, linebreak='\r\n', record_defaults=record_defaults)
 
   def testCsvDataset_withBufferSizeAndQuoted(self):
     record_defaults = [['NA']] * 3
     inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
     expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
                 ['NA', 'NA', 'NA']]
-    for i in range(20):
-      # Test a range of buffer sizes that should all work
-      self._test_dataset(
-          inputs,
-          expected,
-          linebreak='\n',
-          record_defaults=record_defaults,
-          buffer_size=i + 1)
-    self._test_dataset(
+    self._test_dataset_on_buffer_sizes(
         inputs, expected, linebreak='\n', record_defaults=record_defaults)
 
   def testCsvDataset_withCRAndQuoted(self):
@@ -494,15 +505,7 @@ class CsvDatasetOpTest(test.TestCase):
     inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
     expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
                 ['NA', 'NA', 'NA']]
-    for i in range(20):
-      # Test a range of buffer sizes that should all work
-      self._test_dataset(
-          inputs,
-          expected,
-          linebreak='\r',
-          record_defaults=record_defaults,
-          buffer_size=i + 1)
-    self._test_dataset(
+    self._test_dataset_on_buffer_sizes(
         inputs, expected, linebreak='\r', record_defaults=record_defaults)
 
   def testCsvDataset_withCRLFAndQuoted(self):
@@ -512,17 +515,33 @@ class CsvDatasetOpTest(test.TestCase):
     inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
     expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
                 ['NA', 'NA', 'NA']]
-    for i in range(20):
-      # Test a range of buffer sizes that should all work
-      self._test_dataset(
-          inputs,
-          expected,
-          linebreak='\r\n',
-          record_defaults=record_defaults,
-          buffer_size=i + 1)
-    self._test_dataset(
+    self._test_dataset_on_buffer_sizes(
         inputs, expected, linebreak='\r\n', record_defaults=record_defaults)
 
+  def testCsvDataset_withGzipCompressionType(self):
+    record_defaults = [['NA']] * 3
+    inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
+    expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
+                ['NA', 'NA', 'NA']]
+    self._test_dataset_on_buffer_sizes(
+        inputs,
+        expected,
+        linebreak='\r\n',
+        compression_type='GZIP',
+        record_defaults=record_defaults)
+
+  def testCsvDataset_withZlibCompressionType(self):
+    record_defaults = [['NA']] * 3
+    inputs = [['"\n\n\n","\r\r\r","abc"', '"0","1","2"', '"","",""']]
+    expected = [['\n\n\n', '\r\r\r', 'abc'], ['0', '1', '2'],
+                ['NA', 'NA', 'NA']]
+    self._test_dataset_on_buffer_sizes(
+        inputs,
+        expected,
+        linebreak='\r\n',
+        compression_type='ZLIB',
+        record_defaults=record_defaults)
+
 
 class CsvDatasetBenchmark(test.Benchmark):
   """Benchmarks for the various ways of creating a dataset from CSV files.
diff --git a/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py b/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py
index 87b7c6ddb7..e6883d53e0 100644
--- a/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/get_single_element_test.py
@@ -17,9 +17,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from absl.testing import parameterized
+import numpy as np
+
 from tensorflow.contrib.data.python.ops import get_single_element
+from tensorflow.contrib.data.python.ops import grouping
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import sparse_tensor
@@ -27,40 +30,69 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import test
 
 
-class GetSingleElementTest(test.TestCase):
+class GetSingleElementTest(test.TestCase, parameterized.TestCase):
 
-  def testGetSingleElement(self):
-    skip_value = array_ops.placeholder(dtypes.int64, shape=[])
-    take_value = array_ops.placeholder_with_default(
-        constant_op.constant(1, dtype=dtypes.int64), shape=[])
+  @parameterized.named_parameters(
+      ("Zero", 0, 1),
+      ("Five", 5, 1),
+      ("Ten", 10, 1),
+      ("Empty", 100, 1, errors.InvalidArgumentError, "Dataset was empty."),
+      ("MoreThanOne", 0, 2, errors.InvalidArgumentError,
+       "Dataset had more than one element."),
+  )
+  def testGetSingleElement(self, skip, take, error=None, error_msg=None):
+    skip_t = array_ops.placeholder(dtypes.int64, shape=[])
+    take_t = array_ops.placeholder(dtypes.int64, shape=[])
 
     def make_sparse(x):
       x_1d = array_ops.reshape(x, [1])
       x_2d = array_ops.reshape(x, [1, 1])
       return sparse_tensor.SparseTensor(x_2d, x_1d, x_1d)
 
-    dataset = (dataset_ops.Dataset.range(100)
-               .skip(skip_value)
-               .map(lambda x: (x * x, make_sparse(x)))
-               .take(take_value))
-
+    dataset = dataset_ops.Dataset.range(100).skip(skip_t).map(
+        lambda x: (x * x, make_sparse(x))).take(take_t)
     element = get_single_element.get_single_element(dataset)
 
     with self.test_session() as sess:
-      for x in [0, 5, 10]:
-        dense_val, sparse_val = sess.run(element, feed_dict={skip_value: x})
-        self.assertEqual(x * x, dense_val)
-        self.assertAllEqual([[x]], sparse_val.indices)
-        self.assertAllEqual([x], sparse_val.values)
-        self.assertAllEqual([x], sparse_val.dense_shape)
-
-      with self.assertRaisesRegexp(errors.InvalidArgumentError,
-                                   "Dataset was empty."):
-        sess.run(element, feed_dict={skip_value: 100})
-
-      with self.assertRaisesRegexp(errors.InvalidArgumentError,
-                                   "Dataset had more than one element."):
-        sess.run(element, feed_dict={skip_value: 0, take_value: 2})
+      if error is None:
+        dense_val, sparse_val = sess.run(
+            element, feed_dict={
+                skip_t: skip,
+                take_t: take
+            })
+        self.assertEqual(skip * skip, dense_val)
+        self.assertAllEqual([[skip]], sparse_val.indices)
+        self.assertAllEqual([skip], sparse_val.values)
+        self.assertAllEqual([skip], sparse_val.dense_shape)
+      else:
+        with self.assertRaisesRegexp(error, error_msg):
+          sess.run(element, feed_dict={skip_t: skip, take_t: take})
+
+  @parameterized.named_parameters(
+      ("SumZero", 0),
+      ("SumOne", 1),
+      ("SumFive", 5),
+      ("SumTen", 10),
+  )
+  def testReduceDataset(self, stop):
+    def init_fn(_):
+      return np.int64(0)
+
+    def reduce_fn(state, value):
+      return state + value
+
+    def finalize_fn(state):
+      return state
+
+    sum_reducer = grouping.Reducer(init_fn, reduce_fn, finalize_fn)
+
+    stop_t = array_ops.placeholder(dtypes.int64, shape=[])
+    dataset = dataset_ops.Dataset.range(stop_t)
+    element = get_single_element.reduce_dataset(dataset, sum_reducer)
+
+    with self.test_session() as sess:
+      value = sess.run(element, feed_dict={stop_t: stop})
+      self.assertEqual(stop * (stop - 1) / 2, value)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/data/python/kernel_tests/iterator_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/iterator_ops_test.py
index 30a993b1f7..77148aceec 100644
--- a/tensorflow/contrib/data/python/kernel_tests/iterator_ops_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/iterator_ops_test.py
@@ -28,6 +28,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.training import training_util
 
@@ -55,7 +56,7 @@ class CheckpointInputPipelineHookTest(test.TestCase):
   def _read_vars(self, model_dir):
     """Returns (global_step, latest_feature)."""
     with ops.Graph().as_default() as g:
-      ckpt_path = saver_lib.latest_checkpoint(model_dir)
+      ckpt_path = checkpoint_management.latest_checkpoint(model_dir)
       meta_filename = ckpt_path + '.meta'
       saver_lib.import_meta_graph(meta_filename)
       saver = saver_lib.Saver()
diff --git a/tensorflow/contrib/data/python/kernel_tests/map_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/map_dataset_op_test.py
index a075dfd8b5..48adc98e9a 100644
--- a/tensorflow/contrib/data/python/kernel_tests/map_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/map_dataset_op_test.py
@@ -17,6 +17,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import hashlib
 import itertools
 import os
 import time
@@ -25,6 +26,7 @@ import numpy as np
 
 from tensorflow.contrib.data.python.ops import batching
 from tensorflow.contrib.data.python.ops import error_ops
+from tensorflow.contrib.data.python.ops import optimization
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.client import session
 from tensorflow.python.data.ops import dataset_ops
@@ -32,9 +34,12 @@ from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import io_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 from tensorflow.python.util import compat
 
+_NUMPY_RANDOM_SEED = 42
+
 
 class MapDatasetTest(test.TestCase):
 
@@ -78,15 +83,17 @@ class MapDatasetTest(test.TestCase):
     def write_string_to_file(value, filename):
       with open(filename, "w") as f:
         f.write(value)
-    filenames = [os.path.join(self.get_temp_dir(), "file_%d.txt" % i)
-                 for i in range(5)]
+
+    filenames = [
+        os.path.join(self.get_temp_dir(), "file_%d.txt" % i) for i in range(5)
+    ]
     for filename in filenames:
       write_string_to_file(filename, filename)
 
     dataset = (
         dataset_ops.Dataset.from_tensor_slices(filenames).map(
-            io_ops.read_file, num_parallel_calls=2).prefetch(2).apply(
-                error_ops.ignore_errors()))
+            io_ops.read_file,
+            num_parallel_calls=2).prefetch(2).apply(error_ops.ignore_errors()))
     iterator = dataset.make_initializable_iterator()
     init_op = iterator.initializer
     get_next = iterator.get_next()
@@ -142,80 +149,164 @@ class MapDatasetTest(test.TestCase):
 
 class MapDatasetBenchmark(test.Benchmark):
 
+  # The purpose of this benchmark is to compare the performance of chaining vs
+  # fusing of the map and batch transformations across various configurations.
+  #
+  # NOTE: It is recommended to build the benchmark with
+  # `-c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-gmlt`
+  # and execute it on a machine with at least 32 CPU cores.
   def benchmarkMapAndBatch(self):
-    small = itertools.product([1, 4], [1, 4], [1, 4], [16, 64], [100])
-    large = itertools.product([16, 64], [16, 64], [16, 64], [256, 1024], [10])
-
-    num_iters = 100
-
-    def benchmark(series):
-
-      for num_calls, inter_op, element_size, batch_size, num_steps in series:
-        dataset = dataset_ops.Dataset.from_tensors(
-            np.random.randint(100, size=element_size)).repeat().map(
-                lambda x: x,
-                num_parallel_calls=num_calls).batch(batch_size=batch_size)
-        iterator = dataset.make_one_shot_iterator()
-        get_next = iterator.get_next()
 
-        fused_dataset = dataset_ops.Dataset.from_tensors(
-            np.random.randint(100, size=element_size)).repeat(None).apply(
-                batching.map_and_batch(
-                    lambda x: x,
-                    num_parallel_calls=num_calls,
-                    batch_size=batch_size))
-        fused_iterator = fused_dataset.make_one_shot_iterator()
-        fused_get_next = fused_iterator.get_next()
+    # Sequential pipeline configurations.
+    seq_elem_size_series = itertools.product([1], [1], [1, 2, 4, 8], [16])
+    seq_batch_size_series = itertools.product([1], [1], [1], [8, 16, 32, 64])
+
+    # Parallel pipeline configuration.
+    par_elem_size_series = itertools.product([32], [32], [1, 2, 4, 8], [256])
+    par_batch_size_series = itertools.product([32], [32], [1],
+                                              [128, 256, 512, 1024])
+    par_num_calls_series = itertools.product([8, 16, 32, 64], [32], [1], [512])
+    par_inter_op_series = itertools.product([32], [8, 16, 32, 64], [1], [512])
+
+    def name(method, label, num_calls, inter_op, element_size, batch_size):
+      return ("%s_id_%s_num_calls_%d_inter_op_%d_elem_size_%d_batch_size_%d" % (
+          method,
+          hashlib.sha1(label).hexdigest(),
+          num_calls,
+          inter_op,
+          element_size,
+          batch_size,
+      ))
+
+    def benchmark(label, series):
+
+      print("%s:" % label)
+      for num_calls, inter_op, element_size, batch_size in series:
+
+        num_iters = 1024 // (
+            (element_size * batch_size) // min(num_calls, inter_op))
+        k = 1024 * 1024
+        dataset = dataset_ops.Dataset.from_tensors((np.random.rand(
+            element_size, 4 * k), np.random.rand(4 * k, 1))).repeat()
+
+        chained_dataset = dataset.map(
+            math_ops.matmul,
+            num_parallel_calls=num_calls).batch(batch_size=batch_size)
+        chained_iterator = chained_dataset.make_one_shot_iterator()
+        chained_get_next = chained_iterator.get_next()
 
-        fused_deltas = []
+        chained_deltas = []
         with session.Session(
             config=config_pb2.ConfigProto(
-                inter_op_parallelism_threads=inter_op)) as sess:
-
+                inter_op_parallelism_threads=inter_op,
+                use_per_session_threads=True)) as sess:
           for _ in range(5):
-            sess.run(fused_get_next)
+            sess.run(chained_get_next.op)
           for _ in range(num_iters):
             start = time.time()
-            for _ in range(num_steps):
-              sess.run(fused_get_next)
+            sess.run(chained_get_next.op)
             end = time.time()
-            fused_deltas.append(end - start)
+            chained_deltas.append(end - start)
 
-        chained_deltas = []
+        fused_dataset = dataset = dataset.apply(
+            batching.map_and_batch(
+                math_ops.matmul,
+                num_parallel_calls=num_calls,
+                batch_size=batch_size))
+        fused_iterator = fused_dataset.make_one_shot_iterator()
+        fused_get_next = fused_iterator.get_next()
+
+        fused_deltas = []
         with session.Session(
             config=config_pb2.ConfigProto(
-                inter_op_parallelism_threads=inter_op)) as sess:
+                inter_op_parallelism_threads=inter_op,
+                use_per_session_threads=True)) as sess:
+
           for _ in range(5):
-            sess.run(get_next)
+            sess.run(fused_get_next.op)
           for _ in range(num_iters):
             start = time.time()
-            for _ in range(num_steps):
-              sess.run(get_next)
+            sess.run(fused_get_next.op)
             end = time.time()
-            chained_deltas.append(end - start)
+            fused_deltas.append(end - start)
 
-        chained_wall_time = np.median(chained_deltas) / num_iters
-        fused_wall_time = np.median(fused_deltas) / num_iters
         print(
             "batch size: %d, num parallel calls: %d, inter-op parallelism: %d, "
-            "element size: %d, chained wall time: %f, fused wall time: %f" %
-            (batch_size, num_calls, inter_op, element_size, chained_wall_time,
-             fused_wall_time))
+            "element size: %d, num iters: %d\nchained wall time: %f (median), "
+            "%f (mean), %f (stddev), %f (min), %f (max)\n  fused wall time: "
+            "%f (median), %f (mean), %f (stddev), %f (min), %f (max)\n    "
+            "chained/fused:    %.2fx (median),    %.2fx (mean)" %
+            (batch_size, num_calls, inter_op, element_size, num_iters,
+             np.median(chained_deltas), np.mean(chained_deltas),
+             np.std(chained_deltas), np.min(chained_deltas),
+             np.max(chained_deltas), np.median(fused_deltas),
+             np.mean(fused_deltas), np.std(fused_deltas), np.min(fused_deltas),
+             np.max(fused_deltas),
+             np.median(chained_deltas) / np.median(fused_deltas),
+             np.mean(chained_deltas) / np.mean(fused_deltas)))
 
         self.report_benchmark(
             iters=num_iters,
-            wall_time=chained_wall_time,
-            name="chained_batch_size_%d_num_calls_%d_inter_op_%d_elem_size_%d"
-            % (batch_size, num_calls, inter_op, element_size))
+            wall_time=np.median(chained_deltas),
+            name=name("chained", label, num_calls, inter_op, element_size,
+                      batch_size))
 
         self.report_benchmark(
             iters=num_iters,
-            wall_time=fused_wall_time,
-            name="fused_batch_size_%d_num_calls_%d_inter_op_%d_elem_size_%d"
-            % (batch_size, num_calls, inter_op, element_size))
+            wall_time=np.median(fused_deltas),
+            name=name("fused", label, num_calls, inter_op, element_size,
+                      batch_size))
+
+      print("")
+
+    np.random.seed(_NUMPY_RANDOM_SEED)
+    benchmark("Sequential element size evaluation", seq_elem_size_series)
+    benchmark("Sequential batch size evaluation", seq_batch_size_series)
+    benchmark("Parallel element size evaluation", par_elem_size_series)
+    benchmark("Parallel batch size evaluation", par_batch_size_series)
+    benchmark("Transformation parallelism evaluation", par_num_calls_series)
+    benchmark("Threadpool size evaluation", par_inter_op_series)
+
+  # This benchmark compares the performance of pipeline with multiple chained
+  # maps with and without map fusion.
+  def benchmarkChainOfMaps(self):
+    chain_lengths = [0, 1, 2, 5, 10, 20, 50]
+    for chain_length in chain_lengths:
+      self._benchmarkChainOfMaps(chain_length, False)
+      self._benchmarkChainOfMaps(chain_length, True)
+
+  def _benchmarkChainOfMaps(self, chain_length, optimize_dataset):
+    with ops.Graph().as_default():
+      dataset = dataset_ops.Dataset.from_tensors(0).repeat(None)
+      for _ in range(chain_length):
+        dataset = dataset.map(lambda x: x)
+      if optimize_dataset:
+        dataset = dataset.apply(optimization.optimize(["map_fusion"]))
+
+      iterator = dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+      with session.Session() as sess:
+        for _ in range(5):
+          sess.run(next_element.op)
+        deltas = []
+        for _ in range(100):
+          start = time.time()
+          for _ in range(100):
+            sess.run(next_element.op)
+          end = time.time()
+          deltas.append(end - start)
+
+        median_wall_time = np.median(deltas) / 100
+        opt_mark = "opt" if optimize_dataset else "no-opt"
+        print("Map dataset {} chain length: {} Median wall time: {}".format(
+            opt_mark, chain_length, median_wall_time))
+        self.report_benchmark(
+            iters=1000,
+            wall_time=median_wall_time,
+            name="benchmark_map_dataset_chain_latency_{}_{}".format(
+                opt_mark, chain_length))
 
-    benchmark(small)
-    benchmark(large)
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/optimize_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/optimize_dataset_op_test.py
index e35be8a23f..2427935c73 100644
--- a/tensorflow/contrib/data/python/kernel_tests/optimize_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/optimize_dataset_op_test.py
@@ -17,60 +17,179 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from absl.testing import parameterized
+
+from tensorflow.contrib.data.python.kernel_tests import stats_dataset_test_base
 from tensorflow.contrib.data.python.ops import optimization
-from tensorflow.core.framework import graph_pb2
+from tensorflow.contrib.data.python.ops import stats_ops
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import errors
 from tensorflow.python.platform import test
 
 
-class OptimizeDatasetTest(test.TestCase):
+class OptimizeDatasetTest(test.TestCase, parameterized.TestCase):
+
+  def testAssertSuffix(self):
+    dataset = dataset_ops.Dataset.from_tensors(0).apply(
+        optimization.assert_next(["Map"])).map(lambda x: x)
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      self.assertEqual(0, sess.run(get_next))
+
+  def testAssertSuffixInvalid(self):
+    dataset = dataset_ops.Dataset.from_tensors(0).apply(
+        optimization.assert_next(["Whoops"])).map(lambda x: x)
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      with self.assertRaisesRegexp(
+          errors.InvalidArgumentError,
+          "Asserted Whoops transformation at offset 0 but encountered "
+          "Map transformation instead."
+      ):
+        sess.run(get_next)
+
+  def testAssertSuffixShort(self):
+    dataset = dataset_ops.Dataset.from_tensors(0).apply(
+        optimization.assert_next(["Map", "Whoops"])).map(lambda x: x)
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      with self.assertRaisesRegexp(
+          errors.InvalidArgumentError,
+          "Asserted next 2 transformations but encountered only 1."):
+        sess.run(get_next)
 
   def testDefaultOptimizations(self):
-    dataset = dataset_ops.Dataset.range(10).map(lambda x: x * x).batch(
-        10).apply(optimization.optimize())
+    dataset = dataset_ops.Dataset.range(10).apply(
+        optimization.assert_next(
+            ["Map", "Batch"])).map(lambda x: x * x).batch(10).apply(
+                optimization.optimize())
     iterator = dataset.make_one_shot_iterator()
     get_next = iterator.get_next()
 
     with self.test_session() as sess:
-      graph = graph_pb2.GraphDef().FromString(
-          sess.run(dataset._as_serialized_graph()))
-      self.assertTrue(
-          all([node.op != "MapAndBatchDatasetV2" for node in graph.node]))
       self.assertAllEqual([x * x for x in range(10)], sess.run(get_next))
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
   def testEmptyOptimizations(self):
-    dataset = dataset_ops.Dataset.range(10).map(lambda x: x * x).batch(
-        10).apply(optimization.optimize([]))
+    dataset = dataset_ops.Dataset.range(10).apply(
+        optimization.assert_next(
+            ["Map", "Batch"])).map(lambda x: x * x).batch(10).apply(
+                optimization.optimize([]))
     iterator = dataset.make_one_shot_iterator()
     get_next = iterator.get_next()
 
     with self.test_session() as sess:
-      graph = graph_pb2.GraphDef().FromString(
-          sess.run(dataset._as_serialized_graph()))
-      self.assertTrue(
-          all([node.op != "MapAndBatchDatasetV2" for node in graph.node]))
       self.assertAllEqual([x * x for x in range(10)], sess.run(get_next))
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
   def testOptimization(self):
-    dataset = dataset_ops.Dataset.range(10).map(lambda x: x * x).batch(
-        10).apply(optimization.optimize(["map_and_batch_fusion"]))
+    dataset = dataset_ops.Dataset.range(10).apply(
+        optimization.assert_next(
+            ["MapAndBatch"])).map(lambda x: x * x).batch(10).apply(
+                optimization.optimize(["map_and_batch_fusion"]))
     iterator = dataset.make_one_shot_iterator()
     get_next = iterator.get_next()
 
     with self.test_session() as sess:
-      graph = graph_pb2.GraphDef().FromString(
-          sess.run(dataset._as_serialized_graph()))
-      self.assertTrue(
-          any([node.op == "MapAndBatchDatasetV2" for node in graph.node]))
       self.assertAllEqual([x * x for x in range(10)], sess.run(get_next))
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testFunctionLibraryDefinitionModification(self):
+    dataset = dataset_ops.Dataset.from_tensors(0).map(lambda x: x).apply(
+        optimization.optimize(["_test_only_function_rename"]))
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      with self.assertRaisesRegexp(errors.NotFoundError,
+                                   "Function .* is not defined."):
+        sess.run(get_next)
+
+  @staticmethod
+  def map_functions():
+    identity = lambda x: x
+    increment = lambda x: x + 1
+
+    def increment_and_square(x):
+      y = x + 1
+      return y * y
+
+    functions = [identity, increment, increment_and_square]
+    tests = []
+
+    for fun1 in functions:
+      for fun2 in functions:
+        tests.append(([fun1, fun2],))
+        for fun3 in functions:
+          tests.append(([fun1, fun2, fun3],))
+
+    swap = lambda x, n: (n, x)
+    tests.append(([lambda x: (x, 42), swap],))
+    tests.append(([lambda x: (x, 42), swap, swap],))
+    return tuple(tests)
+
+  @parameterized.parameters(*map_functions.__func__())
+  def testMapFusion(self, functions):
+    dataset = dataset_ops.Dataset.range(5).apply(
+        optimization.assert_next(["Map", "Prefetch"]))
+    for function in functions:
+      dataset = dataset.map(function)
+
+    dataset = dataset.prefetch(0).apply(optimization.optimize(["map_fusion"]))
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next()
+    with self.test_session() as sess:
+      for x in range(5):
+        result = sess.run(get_next)
+        r = x
+        for function in functions:
+          if isinstance(r, tuple):
+            r = function(*r)  # Pass tuple as multiple arguments.
+          else:
+            r = function(r)
+        self.assertAllEqual(r, result)
+
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+
+class OptimizeStatsDatasetTest(stats_dataset_test_base.StatsDatasetTestBase):
+
+  def testLatencyStatsOptimization(self):
+
+    stats_aggregator = stats_ops.StatsAggregator()
+    dataset = dataset_ops.Dataset.from_tensors(1).apply(
+        optimization.assert_next(
+            ["LatencyStats", "Map", "LatencyStats", "Prefetch",
+             "LatencyStats"])).map(lambda x: x * x).prefetch(1).apply(
+                 optimization.optimize(["latency_all_edges"])).apply(
+                     stats_ops.set_stats_aggregator(stats_aggregator))
+    iterator = dataset.make_initializable_iterator()
+    get_next = iterator.get_next()
+    summary_t = stats_aggregator.get_summary()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      self.assertEqual(1 * 1, sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+      summary_str = sess.run(summary_t)
+      self._assertSummaryHasCount(summary_str,
+                                  "record_latency_TensorDataset/_1", 1)
+      self._assertSummaryHasCount(summary_str, "record_latency_MapDataset/_4",
+                                  1)
+      self._assertSummaryHasCount(summary_str,
+                                  "record_latency_PrefetchDataset/_6", 1)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/prefetching_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/prefetching_ops_test.py
index 40a8e46676..2da6131e8e 100644
--- a/tensorflow/contrib/data/python/kernel_tests/prefetching_ops_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/prefetching_ops_test.py
@@ -21,6 +21,7 @@ import threading
 
 from tensorflow.contrib.data.python.ops import prefetching_ops
 from tensorflow.core.protobuf import config_pb2
+from tensorflow.python.compat import compat
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.framework import constant_op
@@ -30,6 +31,7 @@ from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import test
 
@@ -86,8 +88,7 @@ class PrefetchingKernelsOpsTest(test.TestCase):
     return (prefetch_op, reset_op, destroy_op)
 
   def _prefetch_fn_helper_one_shot(self, buffer_name, device0, device1):
-    worker_config = config_pb2.ConfigProto()
-    worker_config.device_count["CPU"] = 2
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
 
     ds, ds_iterator = self._create_ds_and_iterator(device0, initializable=False)
     prefetch_op, _, destroy_op = self._create_ops(ds, ds_iterator, buffer_name,
@@ -126,8 +127,7 @@ class PrefetchingKernelsOpsTest(test.TestCase):
                                       "/job:localhost/replica:0/task:0/gpu:0")
 
   def testReinitialization(self):
-    worker_config = config_pb2.ConfigProto()
-    worker_config.device_count["CPU"] = 2
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
 
     device0 = "/job:localhost/replica:0/task:0/cpu:0"
     device1 = "/job:localhost/replica:0/task:0/cpu:1"
@@ -167,8 +167,7 @@ class PrefetchingKernelsOpsTest(test.TestCase):
       sess.run(destroy_op)
 
   def testReinitializationOutOfRange(self):
-    worker_config = config_pb2.ConfigProto()
-    worker_config.device_count["CPU"] = 2
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
 
     device0 = "/job:localhost/replica:0/task:0/cpu:0"
     device1 = "/job:localhost/replica:0/task:0/cpu:1"
@@ -271,8 +270,7 @@ class PrefetchToDeviceTest(test.TestCase):
     self.assertEqual(dtypes.int64, next_element.dtype)
     self.assertEqual([], next_element.shape)
 
-    worker_config = config_pb2.ConfigProto()
-    worker_config.device_count["CPU"] = 2
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
     with self.test_session(config=worker_config) as sess:
       for i in range(10):
         self.assertEqual(i, sess.run(next_element))
@@ -332,8 +330,7 @@ class PrefetchToDeviceTest(test.TestCase):
     self.assertEqual(dtypes.int64, next_element["a"].dtype)
     self.assertEqual([], next_element["a"].shape)
 
-    worker_config = config_pb2.ConfigProto()
-    worker_config.device_count["CPU"] = 2
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
     with self.test_session(config=worker_config) as sess:
       for i in range(10):
         self.assertEqual({"a": i}, sess.run(next_element))
@@ -366,8 +363,7 @@ class PrefetchToDeviceTest(test.TestCase):
     next_element = iterator.get_next()
     self.assertEqual(dtypes.int64, next_element.dtype)
 
-    worker_config = config_pb2.ConfigProto()
-    worker_config.device_count["CPU"] = 2
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
     with self.test_session(config=worker_config) as sess:
       for i in range(10):
         actual = sess.run(next_element)
@@ -417,8 +413,7 @@ class PrefetchToDeviceTest(test.TestCase):
     self.assertEqual(dtypes.int64, next_element.dtype)
     self.assertEqual([], next_element.shape)
 
-    worker_config = config_pb2.ConfigProto()
-    worker_config.device_count["CPU"] = 2
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
     with self.test_session(config=worker_config) as sess:
       sess.run(iterator.initializer)
       for i in range(5):
@@ -451,5 +446,617 @@ class PrefetchToDeviceTest(test.TestCase):
         sess.run(next_element)
 
 
+class CopyToDeviceTest(test.TestCase):
+
+  def testCopyToDevice(self):
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1"))
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element.dtype)
+    self.assertEqual([], next_element.shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceInt32(self):
+    host_dataset = dataset_ops.Dataset.from_tensors([0, 1, 2, 3])
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1"))
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int32, next_element.dtype)
+    self.assertEqual((4,), next_element.shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      self.assertAllEqual([0, 1, 2, 3], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToSameDevice(self):
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:0"))
+
+    with ops.device("/cpu:0"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element.dtype)
+    self.assertEqual([], next_element.shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceWithPrefetch(self):
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1")).prefetch(1)
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element.dtype)
+    self.assertEqual([], next_element.shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyDictToDevice(self):
+    host_dataset = dataset_ops.Dataset.range(10).map(lambda x: {"a": x})
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1"))
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element["a"].dtype)
+    self.assertEqual([], next_element["a"].shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      for i in range(10):
+        self.assertEqual({"a": i}, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyDictToDeviceWithPrefetch(self):
+    host_dataset = dataset_ops.Dataset.range(10).map(lambda x: {"a": x})
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1")).prefetch(1)
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element["a"].dtype)
+    self.assertEqual([], next_element["a"].shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      for i in range(10):
+        self.assertEqual({"a": i}, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopySparseTensorsToDevice(self):
+
+    def make_tensor(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=[[0, 0]], values=(i * [1]), dense_shape=[2, 2])
+
+    host_dataset = dataset_ops.Dataset.range(10).map(make_tensor)
+
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1"))
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element.dtype)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      for i in range(10):
+        actual = sess.run(next_element)
+        self.assertAllEqual([i], actual.values)
+        self.assertAllEqual([[0, 0]], actual.indices)
+        self.assertAllEqual([2, 2], actual.dense_shape)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopySparseTensorsToDeviceWithPrefetch(self):
+
+    def make_tensor(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=[[0, 0]], values=(i * [1]), dense_shape=[2, 2])
+
+    host_dataset = dataset_ops.Dataset.range(10).map(make_tensor)
+
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1")).prefetch(1)
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_one_shot_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element.dtype)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      for i in range(10):
+        actual = sess.run(next_element)
+        self.assertAllEqual([i], actual.values)
+        self.assertAllEqual([[0, 0]], actual.indices)
+        self.assertAllEqual([2, 2], actual.dense_shape)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpu(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0"))
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpuWithPrefetch(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0")).prefetch(1)
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpuInt32(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.from_tensors([0, 1, 2, 3])
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0"))
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      self.assertAllEqual([0, 1, 2, 3], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpuInt32AndPrefetch(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.from_tensors([0, 1, 2, 3])
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0")).prefetch(1)
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      self.assertAllEqual([0, 1, 2, 3], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpuStrings(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.from_tensors(["a", "b", "c"])
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0"))
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      self.assertAllEqual([b"a", b"b", b"c"], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpuStringsAndPrefetch(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.from_tensors(["a", "b", "c"])
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0"))
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      self.assertAllEqual([b"a", b"b", b"c"], sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDevicePingPongCPUGPU(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    with compat.forward_compatibility_horizon(2018, 8, 4):
+      host_dataset = dataset_ops.Dataset.range(10)
+      device_dataset = host_dataset.apply(
+          prefetching_ops.copy_to_device("/gpu:0", source_device="/cpu:0"))
+      back_to_cpu_dataset = device_dataset.apply(
+          prefetching_ops.copy_to_device("/cpu:0", source_device="/gpu:0"))
+
+      with ops.device("/cpu:0"):
+        iterator = back_to_cpu_dataset.make_initializable_iterator()
+        next_element = iterator.get_next()
+
+      with self.test_session() as sess:
+        sess.run(iterator.initializer)
+        for i in range(10):
+          self.assertEqual(i, sess.run(next_element))
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(next_element)
+
+  def testCopyToDeviceWithReInit(self):
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1"))
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element.dtype)
+    self.assertEqual([], next_element.shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      sess.run(iterator.initializer)
+      for i in range(5):
+        self.assertEqual(i, sess.run(next_element))
+      sess.run(iterator.initializer)
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceWithReInitAndPrefetch(self):
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/cpu:1")).prefetch(1)
+
+    with ops.device("/cpu:1"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    self.assertEqual(host_dataset.output_types, device_dataset.output_types)
+    self.assertEqual(host_dataset.output_types, iterator.output_types)
+    self.assertEqual(host_dataset.output_shapes, device_dataset.output_shapes)
+    self.assertEqual(host_dataset.output_shapes, iterator.output_shapes)
+    self.assertEqual(host_dataset.output_classes, device_dataset.output_classes)
+    self.assertEqual(host_dataset.output_classes, iterator.output_classes)
+
+    self.assertEqual(dtypes.int64, next_element.dtype)
+    self.assertEqual([], next_element.shape)
+
+    worker_config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=worker_config) as sess:
+      sess.run(iterator.initializer)
+      for i in range(5):
+        self.assertEqual(i, sess.run(next_element))
+      sess.run(iterator.initializer)
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpuWithReInit(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0"))
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      for i in range(5):
+        self.assertEqual(i, sess.run(next_element))
+      sess.run(iterator.initializer)
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testCopyToDeviceGpuWithReInitAndPrefetch(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    host_dataset = dataset_ops.Dataset.range(10)
+    device_dataset = host_dataset.apply(
+        prefetching_ops.copy_to_device("/gpu:0")).prefetch(1)
+
+    with ops.device("/gpu:0"):
+      iterator = device_dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      for i in range(5):
+        self.assertEqual(i, sess.run(next_element))
+      sess.run(iterator.initializer)
+      for i in range(10):
+        self.assertEqual(i, sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+
+class MultiDeviceIteratorTest(test.TestCase):
+
+  def testBasic(self):
+    dataset = dataset_ops.Dataset.range(10)
+    multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+        dataset, ["/cpu:1", "/cpu:2"])
+    elem_on_1, elem_on_2 = multi_device_iterator.get_next()
+
+    config = config_pb2.ConfigProto(device_count={"CPU": 3})
+    with self.test_session(config=config) as sess:
+      sess.run(multi_device_iterator.initializer)
+      for i in range(0, 10, 2):
+        self.assertEqual(i, sess.run(elem_on_1))
+        self.assertEqual(i + 1, sess.run(elem_on_2))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(elem_on_1)
+        sess.run(elem_on_2)
+
+  def testOneOnSameDevice(self):
+    with ops.device("/cpu:0"):
+      dataset = dataset_ops.Dataset.range(10)
+    multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+        dataset, ["/cpu:0", "/cpu:1"])
+    elem_on_1, elem_on_2 = multi_device_iterator.get_next()
+
+    config = config_pb2.ConfigProto(device_count={"CPU": 2})
+    with self.test_session(config=config) as sess:
+      sess.run(multi_device_iterator.initializer)
+      for i in range(0, 10, 2):
+        self.assertEqual(i, sess.run(elem_on_1))
+        self.assertEqual(i + 1, sess.run(elem_on_2))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(elem_on_1)
+        sess.run(elem_on_2)
+
+  def testRepeatDevices(self):
+    with ops.device("/cpu:0"):
+      dataset = dataset_ops.Dataset.range(20)
+    multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+        dataset, ["/cpu:1", "/cpu:2", "/cpu:1", "/cpu:2"])
+    elements = multi_device_iterator.get_next()
+    elem_on_1, elem_on_2, elem_on_3, elem_on_4 = elements
+
+    config = config_pb2.ConfigProto(device_count={"CPU": 3})
+    with self.test_session(config=config) as sess:
+      sess.run(multi_device_iterator.initializer)
+      for i in range(0, 20, 4):
+        self.assertEqual(i, sess.run(elem_on_1))
+        self.assertEqual(i + 1, sess.run(elem_on_2))
+        self.assertEqual(i + 2, sess.run(elem_on_3))
+        self.assertEqual(i + 3, sess.run(elem_on_4))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(elem_on_1)
+        sess.run(elem_on_2)
+        sess.run(elem_on_3)
+        sess.run(elem_on_4)
+
+  def testNotFullyDivisible(self):
+    dataset = dataset_ops.Dataset.range(9)
+    multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+        dataset, ["/cpu:1", "/cpu:2"])
+    elem_on_1, elem_on_2 = multi_device_iterator.get_next()
+
+    config = config_pb2.ConfigProto(device_count={"CPU": 3})
+    with self.test_session(config=config) as sess:
+      sess.run(multi_device_iterator.initializer)
+      for i in range(0, 8, 2):
+        self.assertEqual(i, sess.run(elem_on_1))
+        self.assertEqual(i + 1, sess.run(elem_on_2))
+      self.assertEqual(8, sess.run(elem_on_1))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(elem_on_1)
+        sess.run(elem_on_2)
+
+  def testUneven(self):
+    dataset = dataset_ops.Dataset.range(10)
+    multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+        dataset, ["/cpu:1", "/cpu:2"])
+    elem_on_1, elem_on_2 = multi_device_iterator.get_next()
+
+    config = config_pb2.ConfigProto(device_count={"CPU": 3})
+    with self.test_session(config=config) as sess:
+      sess.run(multi_device_iterator.initializer)
+      for i in range(0, 10, 2):
+        self.assertEqual(i, sess.run(elem_on_1))
+      for i in range(0, 10, 2):
+        self.assertEqual(i + 1, sess.run(elem_on_2))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(elem_on_1)
+        sess.run(elem_on_2)
+
+  def testMultipleInitializations(self):
+    with ops.device("/cpu:0"):
+      epoch = array_ops.placeholder(dtypes.int64, shape=[])
+      dataset1 = dataset_ops.Dataset.from_tensors(epoch).repeat(1000)
+      dataset2 = dataset_ops.Dataset.range(1000)
+      dataset = dataset_ops.Dataset.zip((dataset1, dataset2))
+    multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+        dataset, ["/cpu:1", "/cpu:2"], prefetch_buffer_size=4)
+    elem_on_1, elem_on_2 = multi_device_iterator.get_next()
+    init_op = multi_device_iterator.initializer
+
+    config = config_pb2.ConfigProto(device_count={"CPU": 3})
+    with self.test_session(config=config) as sess:
+      for i in range(1000):
+        sess.run(init_op, feed_dict={epoch: i})
+        self.assertEqual([(i, 0), (i, 1)], sess.run([elem_on_1, elem_on_2]))
+
+  def testBasicGpu(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    with compat.forward_compatibility_horizon(2018, 8, 4):
+      dataset = dataset_ops.Dataset.range(10)
+      multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+          dataset, ["/cpu:1", "/gpu:0"])
+      elem_on_1, elem_on_2 = multi_device_iterator.get_next()
+
+      config = config_pb2.ConfigProto(device_count={"CPU": 2, "GPU": 1})
+      with self.test_session(config=config) as sess:
+        sess.run(multi_device_iterator.initializer)
+        for i in range(0, 10, 2):
+          self.assertEqual(i, sess.run(elem_on_1))
+          self.assertEqual(i + 1, sess.run(elem_on_2))
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(elem_on_1)
+          sess.run(elem_on_2)
+
+  def testUnevenGpu(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+
+    with compat.forward_compatibility_horizon(2018, 8, 4):
+      dataset = dataset_ops.Dataset.range(10)
+      multi_device_iterator = prefetching_ops.MultiDeviceIterator(
+          dataset, ["/cpu:1", "/gpu:0"])
+      elem_on_1, elem_on_2 = multi_device_iterator.get_next()
+
+      config = config_pb2.ConfigProto(device_count={"CPU": 2, "GPU": 1})
+      with self.test_session(config=config) as sess:
+        sess.run(multi_device_iterator.initializer)
+        for i in range(0, 10, 2):
+          self.assertEqual(i, sess.run(elem_on_1))
+        for i in range(0, 10, 2):
+          self.assertEqual(i + 1, sess.run(elem_on_2))
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(elem_on_1)
+          sess.run(elem_on_2)
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
index 9df403ef50..15b342d30f 100644
--- a/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/reader_dataset_ops_test.py
@@ -17,13 +17,16 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import gzip
 import os
+import zlib
 
 import numpy as np
 
 from tensorflow.contrib.data.python.kernel_tests import reader_dataset_ops_test_base
 from tensorflow.contrib.data.python.ops import readers
 from tensorflow.python.data.ops import readers as core_readers
+from tensorflow.python.data.util import nest
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
@@ -170,276 +173,383 @@ class ReadBatchFeaturesTest(
       for num_epochs in [1, 10]:
         with ops.Graph().as_default():
           # Basic test: read from file 0.
-          self.outputs = self.make_batch_feature(
+          outputs = self.make_batch_feature(
               filenames=self.test_filenames[0],
               num_epochs=num_epochs,
               batch_size=batch_size,
               drop_final_batch=True).make_one_shot_iterator().get_next()
-          for _, tensor in self.outputs.items():
+          for _, tensor in outputs.items():
             if isinstance(tensor, ops.Tensor):  # Guard against SparseTensor.
               self.assertEqual(tensor.shape[0], batch_size)
 
+  def testIndefiniteRepeatShapeInference(self):
+    dataset = self.make_batch_feature(
+        filenames=self.test_filenames[0], num_epochs=None, batch_size=32)
+    for shape, clazz in zip(nest.flatten(dataset.output_shapes),
+                            nest.flatten(dataset.output_classes)):
+      if issubclass(clazz, ops.Tensor):
+        self.assertEqual(32, shape[0])
+
 
 class MakeCsvDatasetTest(test.TestCase):
 
-  COLUMN_TYPES = [
-      dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64, dtypes.string
-  ]
-  COLUMNS = ["col%d" % i for i in range(len(COLUMN_TYPES))]
-  DEFAULT_VALS = [[], [], [], [], ["NULL"]]
-  DEFAULTS = [
-      constant_op.constant([], dtype=dtypes.int32),
-      constant_op.constant([], dtype=dtypes.int64),
-      constant_op.constant([], dtype=dtypes.float32),
-      constant_op.constant([], dtype=dtypes.float64),
-      constant_op.constant(["NULL"], dtype=dtypes.string)
-  ]
-  LABEL = COLUMNS[0]
-
-  def setUp(self):
-    super(MakeCsvDatasetTest, self).setUp()
-    self._num_files = 2
-    self._num_records = 11
-    self._test_filenames = self._create_files()
-
-  def _csv_values(self, fileno, recordno):
-    return [
-        fileno,
-        recordno,
-        fileno * recordno * 0.5,
-        fileno * recordno + 0.5,
-        "record %d" % recordno if recordno % 2 == 1 else "",
-    ]
+  def _make_csv_dataset(self, filenames, batch_size, num_epochs=1, **kwargs):
+    return readers.make_csv_dataset(
+        filenames, batch_size=batch_size, num_epochs=num_epochs, **kwargs)
 
-  def _write_file(self, filename, rows):
-    for i in range(len(rows)):
-      if isinstance(rows[i], list):
-        rows[i] = ",".join(str(v) if v is not None else "" for v in rows[i])
-    fn = os.path.join(self.get_temp_dir(), filename)
-    f = open(fn, "w")
-    f.write("\n".join(rows))
-    f.close()
-    return fn
-
-  def _create_file(self, fileno, header=True):
-    rows = []
-    if header:
-      rows.append(self.COLUMNS)
-    for recno in range(self._num_records):
-      rows.append(self._csv_values(fileno, recno))
-    return self._write_file("csv_file%d.csv" % fileno, rows)
-
-  def _create_files(self):
+  def _setup_files(self, inputs, linebreak="\n", compression_type=None):
     filenames = []
-    for i in range(self._num_files):
-      filenames.append(self._create_file(i))
+    for i, ip in enumerate(inputs):
+      fn = os.path.join(self.get_temp_dir(), "temp_%d.csv" % i)
+      contents = linebreak.join(ip).encode("utf-8")
+      if compression_type is None:
+        with open(fn, "wb") as f:
+          f.write(contents)
+      elif compression_type == "GZIP":
+        with gzip.GzipFile(fn, "wb") as f:
+          f.write(contents)
+      elif compression_type == "ZLIB":
+        contents = zlib.compress(contents)
+        with open(fn, "wb") as f:
+          f.write(contents)
+      else:
+        raise ValueError("Unsupported compression_type", compression_type)
+      filenames.append(fn)
     return filenames
 
-  def _make_csv_dataset(
-      self,
-      filenames,
-      defaults,
-      column_names=COLUMNS,
-      label_name=LABEL,
-      select_cols=None,
-      batch_size=1,
-      num_epochs=1,
-      shuffle=False,
-      shuffle_seed=None,
-      header=True,
-      na_value="",
-  ):
-    return readers.make_csv_dataset(
-        filenames,
-        batch_size=batch_size,
-        column_names=column_names,
-        column_defaults=defaults,
-        label_name=label_name,
-        num_epochs=num_epochs,
-        shuffle=shuffle,
-        shuffle_seed=shuffle_seed,
-        header=header,
-        na_value=na_value,
-        select_columns=select_cols,
-    )
-
-  def _next_actual_batch(self, file_indices, batch_size, num_epochs, defaults):
-    features = {col: list() for col in self.COLUMNS}
+  def _next_expected_batch(self, expected_output, expected_keys, batch_size,
+                           num_epochs):
+    features = {k: [] for k in expected_keys}
     for _ in range(num_epochs):
-      for i in file_indices:
-        for j in range(self._num_records):
-          values = self._csv_values(i, j)
-          for n, v in enumerate(values):
-            if v == "":  # pylint: disable=g-explicit-bool-comparison
-              values[n] = defaults[n][0]
-          values[-1] = values[-1].encode("utf-8")
-
-          # Regroup lists by column instead of row
-          for n, col in enumerate(self.COLUMNS):
-            features[col].append(values[n])
-          if len(list(features.values())[0]) == batch_size:
-            yield features
-            features = {col: list() for col in self.COLUMNS}
-
-  def _run_actual_batch(self, outputs, sess):
-    features, labels = sess.run(outputs)
-    batch = [features[k] for k in self.COLUMNS if k != self.LABEL]
-    batch.append(labels)
-    return batch
-
-  def _verify_records(
+      for values in expected_output:
+        for n, key in enumerate(expected_keys):
+          features[key].append(values[n])
+        if len(features[expected_keys[0]]) == batch_size:
+          yield features
+          features = {k: [] for k in expected_keys}
+    if features[expected_keys[0]]:  # Leftover from the last batch
+      yield features
+
+  def _verify_output(
       self,
       sess,
       dataset,
-      file_indices,
-      defaults=tuple(DEFAULT_VALS),
-      label_name=LABEL,
-      batch_size=1,
-      num_epochs=1,
+      batch_size,
+      num_epochs,
+      label_name,
+      expected_output,
+      expected_keys,
   ):
-    iterator = dataset.make_one_shot_iterator()
-    get_next = iterator.get_next()
+    nxt = dataset.make_one_shot_iterator().get_next()
 
-    for expected_features in self._next_actual_batch(file_indices, batch_size,
-                                                     num_epochs, defaults):
-      actual_features = sess.run(get_next)
+    for expected_features in self._next_expected_batch(
+        expected_output,
+        expected_keys,
+        batch_size,
+        num_epochs,
+    ):
+      actual_features = sess.run(nxt)
 
       if label_name is not None:
         expected_labels = expected_features.pop(label_name)
-        # Compare labels
         self.assertAllEqual(expected_labels, actual_features[1])
-        actual_features = actual_features[0]  # Extract features dict from tuple
+        actual_features = actual_features[0]
 
       for k in expected_features.keys():
         # Compare features
         self.assertAllEqual(expected_features[k], actual_features[k])
 
     with self.assertRaises(errors.OutOfRangeError):
-      sess.run(get_next)
-
-  def testMakeCSVDataset(self):
-    defaults = self.DEFAULTS
-
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        # Basic test: read from file 0.
-        dataset = self._make_csv_dataset(self._test_filenames[0], defaults)
-        self._verify_records(sess, dataset, [0])
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        # Basic test: read from file 1.
-        dataset = self._make_csv_dataset(self._test_filenames[1], defaults)
-        self._verify_records(sess, dataset, [1])
+      sess.run(nxt)
+
+  def _test_dataset(self,
+                    inputs,
+                    expected_output,
+                    expected_keys,
+                    batch_size=1,
+                    num_epochs=1,
+                    label_name=None,
+                    **kwargs):
+    """Checks that elements produced by CsvDataset match expected output."""
+    # Convert str type because py3 tf strings are bytestrings
+    filenames = self._setup_files(
+        inputs, compression_type=kwargs.get("compression_type", None))
     with ops.Graph().as_default() as g:
       with self.test_session(graph=g) as sess:
-        # Read from both files.
-        dataset = self._make_csv_dataset(self._test_filenames, defaults)
-        self._verify_records(sess, dataset, range(self._num_files))
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        # Read from both files. Exercise the `batch` and `num_epochs` parameters
-        # of make_csv_dataset and make sure they work.
         dataset = self._make_csv_dataset(
-            self._test_filenames, defaults, batch_size=2, num_epochs=10)
-        self._verify_records(
-            sess, dataset, range(self._num_files), batch_size=2, num_epochs=10)
+            filenames,
+            batch_size=batch_size,
+            num_epochs=num_epochs,
+            label_name=label_name,
+            **kwargs)
+        self._verify_output(sess, dataset, batch_size, num_epochs, label_name,
+                            expected_output, expected_keys)
+
+  def testMakeCSVDataset(self):
+    """Tests making a CSV dataset with keys and defaults provided."""
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    expected_output = [[0, 1, 2, 3, b"4"], [5, 6, 7, 8, b"9"],
+                       [10, 11, 12, 13, b"14"], [15, 16, 17, 18, b"19"]]
+    label = "col0"
+
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        column_names=column_names,
+        label_name=label,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        column_defaults=record_defaults,
+    )
+
+  def testMakeCSVDataset_withBatchSizeAndEpochs(self):
+    """Tests making a CSV dataset with keys and defaults provided."""
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    expected_output = [[0, 1, 2, 3, b"4"], [5, 6, 7, 8, b"9"],
+                       [10, 11, 12, 13, b"14"], [15, 16, 17, 18, b"19"]]
+    label = "col0"
+
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        column_names=column_names,
+        label_name=label,
+        batch_size=3,
+        num_epochs=10,
+        shuffle=False,
+        header=True,
+        column_defaults=record_defaults,
+    )
 
-  def testMakeCSVDataset_withBadColumns(self):
+  def testMakeCSVDataset_withCompressionType(self):
+    """Tests `compression_type` argument."""
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    expected_output = [[0, 1, 2, 3, b"4"], [5, 6, 7, 8, b"9"],
+                       [10, 11, 12, 13, b"14"], [15, 16, 17, 18, b"19"]]
+    label = "col0"
+
+    for compression_type in ("GZIP", "ZLIB"):
+      self._test_dataset(
+          inputs,
+          expected_output=expected_output,
+          expected_keys=column_names,
+          column_names=column_names,
+          label_name=label,
+          batch_size=1,
+          num_epochs=1,
+          shuffle=False,
+          header=True,
+          column_defaults=record_defaults,
+          compression_type=compression_type,
+      )
+
+  def testMakeCSVDataset_withBadInputs(self):
     """Tests that exception is raised when input is malformed.
     """
-    dupe_columns = self.COLUMNS[:-1] + self.COLUMNS[:1]
-    defaults = self.DEFAULTS
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    filenames = self._setup_files(inputs)
 
     # Duplicate column names
     with self.assertRaises(ValueError):
       self._make_csv_dataset(
-          self._test_filenames, defaults, column_names=dupe_columns)
+          filenames,
+          batch_size=1,
+          column_defaults=record_defaults,
+          label_name="col0",
+          column_names=column_names * 2)
 
     # Label key not one of column names
     with self.assertRaises(ValueError):
       self._make_csv_dataset(
-          self._test_filenames, defaults, label_name="not_a_real_label")
+          filenames,
+          batch_size=1,
+          column_defaults=record_defaults,
+          label_name="not_a_real_label",
+          column_names=column_names)
 
   def testMakeCSVDataset_withNoLabel(self):
-    """Tests that CSV datasets can be created when no label is specified.
-    """
-    defaults = self.DEFAULTS
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        # Read from both files. Make sure this works with no label key supplied.
-        dataset = self._make_csv_dataset(
-            self._test_filenames,
-            defaults,
-            batch_size=2,
-            num_epochs=10,
-            label_name=None)
-        self._verify_records(
-            sess,
-            dataset,
-            range(self._num_files),
-            batch_size=2,
-            num_epochs=10,
-            label_name=None)
+    """Tests making a CSV dataset with no label provided."""
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    expected_output = [[0, 1, 2, 3, b"4"], [5, 6, 7, 8, b"9"],
+                       [10, 11, 12, 13, b"14"], [15, 16, 17, 18, b"19"]]
+
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        column_names=column_names,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        column_defaults=record_defaults,
+    )
 
   def testMakeCSVDataset_withNoHeader(self):
     """Tests that datasets can be created from CSV files with no header line.
     """
-    defaults = self.DEFAULTS
-    file_without_header = self._create_file(
-        len(self._test_filenames), header=False)
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            file_without_header,
-            defaults,
-            batch_size=2,
-            num_epochs=10,
-            header=False,
-        )
-        self._verify_records(
-            sess,
-            dataset,
-            [len(self._test_filenames)],
-            batch_size=2,
-            num_epochs=10,
-        )
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [["0,1,2,3,4", "5,6,7,8,9"], ["10,11,12,13,14", "15,16,17,18,19"]]
+    expected_output = [[0, 1, 2, 3, b"4"], [5, 6, 7, 8, b"9"],
+                       [10, 11, 12, 13, b"14"], [15, 16, 17, 18, b"19"]]
+    label = "col0"
+
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        column_names=column_names,
+        label_name=label,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=False,
+        column_defaults=record_defaults,
+    )
 
   def testMakeCSVDataset_withTypes(self):
     """Tests that defaults can be a dtype instead of a Tensor for required vals.
     """
-    defaults = [d for d in self.COLUMN_TYPES[:-1]]
-    defaults.append(constant_op.constant(["NULL"], dtype=dtypes.string))
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(self._test_filenames, defaults)
-        self._verify_records(sess, dataset, range(self._num_files))
+    record_defaults = [
+        dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64,
+        dtypes.string
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x[0] for x in column_names), "0,1,2,3,4", "5,6,7,8,9"],
+              [
+                  ",".join(x[0] for x in column_names), "10,11,12,13,14",
+                  "15,16,17,18,19"
+              ]]
+    expected_output = [[0, 1, 2, 3, b"4"], [5, 6, 7, 8, b"9"],
+                       [10, 11, 12, 13, b"14"], [15, 16, 17, 18, b"19"]]
+    label = "col0"
+
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        column_names=column_names,
+        label_name=label,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        column_defaults=record_defaults,
+    )
 
   def testMakeCSVDataset_withNoColNames(self):
     """Tests that datasets can be created when column names are not specified.
 
     In that case, we should infer the column names from the header lines.
     """
-    defaults = self.DEFAULTS
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        # Read from both files. Exercise the `batch` and `num_epochs` parameters
-        # of make_csv_dataset and make sure they work.
-        dataset = self._make_csv_dataset(
-            self._test_filenames,
-            defaults,
-            column_names=None,
-            batch_size=2,
-            num_epochs=10)
-        self._verify_records(
-            sess, dataset, range(self._num_files), batch_size=2, num_epochs=10)
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    expected_output = [[0, 1, 2, 3, b"4"], [5, 6, 7, 8, b"9"],
+                       [10, 11, 12, 13, b"14"], [15, 16, 17, 18, b"19"]]
+    label = "col0"
+
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        label_name=label,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        column_defaults=record_defaults,
+    )
 
   def testMakeCSVDataset_withTypeInferenceMismatch(self):
     # Test that error is thrown when num fields doesn't match columns
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    filenames = self._setup_files(inputs)
     with self.assertRaises(ValueError):
       self._make_csv_dataset(
-          self._test_filenames,
-          column_names=self.COLUMNS + ["extra_name"],
-          defaults=None,
+          filenames,
+          column_names=column_names + ["extra_name"],
+          column_defaults=None,
           batch_size=2,
           num_epochs=10)
 
@@ -448,197 +558,215 @@ class MakeCsvDatasetTest(test.TestCase):
 
     In that case, we should infer the types from the first N records.
     """
-    # Test that it works with standard test files (with header, etc)
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            self._test_filenames, defaults=None, batch_size=2, num_epochs=10)
-        self._verify_records(
-            sess,
-            dataset,
-            range(self._num_files),
-            batch_size=2,
-            num_epochs=10,
-            defaults=[[], [], [], [], [""]])
-
-  def testMakeCSVDataset_withTypeInferenceTricky(self):
-    # Test on a deliberately tricky file (type changes as we read more rows, and
-    # there are null values)
-    fn = os.path.join(self.get_temp_dir(), "file.csv")
-    expected_dtypes = [
-        dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float32,
-        dtypes.string, dtypes.string
-    ]
-    col_names = ["col%d" % i for i in range(len(expected_dtypes))]
-    rows = [[None, None, None, "NAN", "",
-             "a"], [1, 2**31 + 1, 2**64, 123, "NAN", ""],
-            ['"123"', 2, 2**64, 123.4, "NAN", '"cd,efg"']]
-    expected = [[0, 0, 0, 0, "", "a"], [1, 2**31 + 1, 2**64, 123, "", ""],
-                [123, 2, 2**64, 123.4, "", "cd,efg"]]
-    for row in expected:
-      row[-1] = row[-1].encode("utf-8")  # py3 expects byte strings
-      row[-2] = row[-2].encode("utf-8")  # py3 expects byte strings
-    self._write_file("file.csv", [col_names] + rows)
+    column_names = ["col%d" % i for i in range(5)]
+    str_int32_max = str(2**33)
+    inputs = [[
+        ",".join(x for x in column_names),
+        "0,%s,2.0,3e50,rabbit" % str_int32_max
+    ]]
+    expected_output = [[0, 2**33, 2.0, 3e50, b"rabbit"]]
+    label = "col0"
 
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            fn,
-            defaults=None,
-            column_names=None,
-            label_name=None,
-            na_value="NAN",
-        )
-        features = dataset.make_one_shot_iterator().get_next()
-        # Check that types match
-        for i in range(len(expected_dtypes)):
-          print(features["col%d" % i].dtype, expected_dtypes[i])
-          assert features["col%d" % i].dtype == expected_dtypes[i]
-        for i in range(len(rows)):
-          assert sess.run(features) == dict(zip(col_names, expected[i]))
-
-  def testMakeCSVDataset_withTypeInferenceAllTypes(self):
-    # Test that we make the correct inference for all types with fallthrough
-    fn = os.path.join(self.get_temp_dir(), "file.csv")
-    expected_dtypes = [
-        dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64,
-        dtypes.string, dtypes.string
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        column_names=column_names,
+        label_name=label,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+    )
+
+  def testMakeCSVDataset_withTypeInferenceFallthrough(self):
+    """Tests that datasets can be created when no defaults are specified.
+
+    Tests on a deliberately tricky file.
+    """
+    column_names = ["col%d" % i for i in range(5)]
+    str_int32_max = str(2**33)
+    inputs = [[
+        ",".join(x for x in column_names),
+        ",,,,",
+        "0,0,0.0,0.0,0.0",
+        "0,%s,2.0,3e50,rabbit" % str_int32_max,
+        ",,,,",
+    ]]
+    expected_output = [[0, 0, 0, 0, b""], [0, 0, 0, 0, b"0.0"],
+                       [0, 2**33, 2.0, 3e50, b"rabbit"], [0, 0, 0, 0, b""]]
+    label = "col0"
+
+    self._test_dataset(
+        inputs,
+        expected_output=expected_output,
+        expected_keys=column_names,
+        column_names=column_names,
+        label_name=label,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+    )
+
+  def testMakeCSVDataset_withSelectCols(self):
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
     ]
-    col_names = ["col%d" % i for i in range(len(expected_dtypes))]
-    rows = [[1, 2**31 + 1, 1.0, 4e40, "abc", ""]]
-    expected = [[
-        1, 2**31 + 1, 1.0, 4e40, "abc".encode("utf-8"), "".encode("utf-8")
+    column_names = ["col%d" % i for i in range(5)]
+    str_int32_max = str(2**33)
+    inputs = [[
+        ",".join(x for x in column_names),
+        "0,%s,2.0,3e50,rabbit" % str_int32_max
     ]]
-    self._write_file("file.csv", [col_names] + rows)
+    expected_output = [[0, 2**33, 2.0, 3e50, b"rabbit"]]
 
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            fn,
-            defaults=None,
-            column_names=None,
-            label_name=None,
-            na_value="NAN",
-        )
-        features = dataset.make_one_shot_iterator().get_next()
-        # Check that types match
-        for i in range(len(expected_dtypes)):
-          self.assertAllEqual(features["col%d" % i].dtype, expected_dtypes[i])
-        for i in range(len(rows)):
-          self.assertAllEqual(
-              sess.run(features), dict(zip(col_names, expected[i])))
+    select_cols = [1, 3, 4]
+    self._test_dataset(
+        inputs,
+        expected_output=[[x[i] for i in select_cols] for x in expected_output],
+        expected_keys=[column_names[i] for i in select_cols],
+        column_names=column_names,
+        column_defaults=[record_defaults[i] for i in select_cols],
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        select_columns=select_cols,
+    )
+
+    # Can still do inference without provided defaults
+    self._test_dataset(
+        inputs,
+        expected_output=[[x[i] for i in select_cols] for x in expected_output],
+        expected_keys=[column_names[i] for i in select_cols],
+        column_names=column_names,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        select_columns=select_cols,
+    )
+
+    # Can still do column name inference
+    self._test_dataset(
+        inputs,
+        expected_output=[[x[i] for i in select_cols] for x in expected_output],
+        expected_keys=[column_names[i] for i in select_cols],
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        select_columns=select_cols,
+    )
+
+    # Can specify column names instead of indices
+    self._test_dataset(
+        inputs,
+        expected_output=[[x[i] for i in select_cols] for x in expected_output],
+        expected_keys=[column_names[i] for i in select_cols],
+        column_names=column_names,
+        batch_size=1,
+        num_epochs=1,
+        shuffle=False,
+        header=True,
+        select_columns=[column_names[i] for i in select_cols],
+    )
 
   def testMakeCSVDataset_withSelectColsError(self):
-    data = [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
-    col_names = ["col%d" % i for i in range(5)]
-    fn = self._write_file("file.csv", [col_names] + data)
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+    column_names = ["col%d" % i for i in range(5)]
+    str_int32_max = str(2**33)
+    inputs = [[
+        ",".join(x for x in column_names),
+        "0,%s,2.0,3e50,rabbit" % str_int32_max
+    ]]
+
+    select_cols = [1, 3, 4]
+    filenames = self._setup_files(inputs)
+
     with self.assertRaises(ValueError):
       # Mismatch in number of defaults and number of columns selected,
       # should raise an error
       self._make_csv_dataset(
-          fn,
-          defaults=[[0]] * 5,
-          column_names=col_names,
-          label_name=None,
-          select_cols=[1, 3])
+          filenames,
+          batch_size=1,
+          column_defaults=record_defaults,
+          column_names=column_names,
+          select_columns=select_cols)
+
     with self.assertRaises(ValueError):
       # Invalid column name should raise an error
       self._make_csv_dataset(
-          fn,
-          defaults=[[0]],
-          column_names=col_names,
+          filenames,
+          batch_size=1,
+          column_defaults=[[0]],
+          column_names=column_names,
           label_name=None,
-          select_cols=["invalid_col_name"])
-
-  def testMakeCSVDataset_withSelectCols(self):
-    data = [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
-    col_names = ["col%d" % i for i in range(5)]
-    fn = self._write_file("file.csv", [col_names] + data)
-    # If select_cols is specified, should only yield a subset of columns
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            fn,
-            defaults=[[0], [0]],
-            column_names=col_names,
-            label_name=None,
-            select_cols=[1, 3])
-        expected = [[1, 3], [6, 8]]
-        features = dataset.make_one_shot_iterator().get_next()
-        for i in range(len(data)):
-          self.assertAllEqual(
-              sess.run(features),
-              dict(zip([col_names[1], col_names[3]], expected[i])))
-    # Can still do default inference with select_cols
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            fn,
-            defaults=None,
-            column_names=col_names,
-            label_name=None,
-            select_cols=[1, 3])
-        expected = [[1, 3], [6, 8]]
-        features = dataset.make_one_shot_iterator().get_next()
-        for i in range(len(data)):
-          self.assertAllEqual(
-              sess.run(features),
-              dict(zip([col_names[1], col_names[3]], expected[i])))
-    # Can still do column name inference
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            fn,
-            defaults=None,
-            column_names=None,
-            label_name=None,
-            select_cols=[1, 3])
-        expected = [[1, 3], [6, 8]]
-        features = dataset.make_one_shot_iterator().get_next()
-        for i in range(len(data)):
-          self.assertAllEqual(
-              sess.run(features),
-              dict(zip([col_names[1], col_names[3]], expected[i])))
-    # Can specify column names instead of indices
-    with ops.Graph().as_default() as g:
-      with self.test_session(graph=g) as sess:
-        dataset = self._make_csv_dataset(
-            fn,
-            defaults=None,
-            column_names=None,
-            label_name=None,
-            select_cols=[col_names[1], col_names[3]])
-        expected = [[1, 3], [6, 8]]
-        features = dataset.make_one_shot_iterator().get_next()
-        for i in range(len(data)):
-          self.assertAllEqual(
-              sess.run(features),
-              dict(zip([col_names[1], col_names[3]], expected[i])))
+          select_columns=["invalid_col_name"])
 
   def testMakeCSVDataset_withShuffle(self):
-    total_records = self._num_files * self._num_records
-    defaults = self.DEFAULTS
+    record_defaults = [
+        constant_op.constant([], dtypes.int32),
+        constant_op.constant([], dtypes.int64),
+        constant_op.constant([], dtypes.float32),
+        constant_op.constant([], dtypes.float64),
+        constant_op.constant([], dtypes.string)
+    ]
+
+    def str_series(st):
+      return ",".join(str(i) for i in range(st, st + 5))
+
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [
+        [",".join(x for x in column_names)
+        ] + [str_series(5 * i) for i in range(15)],
+        [",".join(x for x in column_names)] +
+        [str_series(5 * i) for i in range(15, 20)],
+    ]
+
+    filenames = self._setup_files(inputs)
+
+    total_records = 20
     for batch_size in [1, 2]:
       with ops.Graph().as_default() as g:
         with self.test_session(graph=g) as sess:
           # Test that shuffling with the same seed produces the same result
           dataset1 = self._make_csv_dataset(
-              self._test_filenames,
-              defaults,
+              filenames,
+              column_defaults=record_defaults,
+              column_names=column_names,
               batch_size=batch_size,
+              header=True,
               shuffle=True,
-              shuffle_seed=5)
+              shuffle_seed=5,
+              num_epochs=2,
+          )
           dataset2 = self._make_csv_dataset(
-              self._test_filenames,
-              defaults,
+              filenames,
+              column_defaults=record_defaults,
+              column_names=column_names,
               batch_size=batch_size,
+              header=True,
               shuffle=True,
-              shuffle_seed=5)
+              shuffle_seed=5,
+              num_epochs=2,
+          )
           outputs1 = dataset1.make_one_shot_iterator().get_next()
           outputs2 = dataset2.make_one_shot_iterator().get_next()
           for _ in range(total_records // batch_size):
-            batch1 = self._run_actual_batch(outputs1, sess)
-            batch2 = self._run_actual_batch(outputs2, sess)
+            batch1 = nest.flatten(sess.run(outputs1))
+            batch2 = nest.flatten(sess.run(outputs2))
             for i in range(len(batch1)):
               self.assertAllEqual(batch1[i], batch2[i])
 
@@ -646,27 +774,45 @@ class MakeCsvDatasetTest(test.TestCase):
         with self.test_session(graph=g) as sess:
           # Test that shuffling with a different seed produces different results
           dataset1 = self._make_csv_dataset(
-              self._test_filenames,
-              defaults,
+              filenames,
+              column_defaults=record_defaults,
+              column_names=column_names,
               batch_size=batch_size,
+              header=True,
               shuffle=True,
-              shuffle_seed=5)
+              shuffle_seed=5,
+              num_epochs=2,
+          )
           dataset2 = self._make_csv_dataset(
-              self._test_filenames,
-              defaults,
+              filenames,
+              column_defaults=record_defaults,
+              column_names=column_names,
               batch_size=batch_size,
+              header=True,
               shuffle=True,
-              shuffle_seed=6)
+              shuffle_seed=6,
+              num_epochs=2,
+          )
           outputs1 = dataset1.make_one_shot_iterator().get_next()
           outputs2 = dataset2.make_one_shot_iterator().get_next()
           all_equal = False
           for _ in range(total_records // batch_size):
-            batch1 = self._run_actual_batch(outputs1, sess)
-            batch2 = self._run_actual_batch(outputs2, sess)
+            batch1 = nest.flatten(sess.run(outputs1))
+            batch2 = nest.flatten(sess.run(outputs2))
             for i in range(len(batch1)):
               all_equal = all_equal and np.array_equal(batch1[i], batch2[i])
           self.assertFalse(all_equal)
 
+  def testIndefiniteRepeatShapeInference(self):
+    column_names = ["col%d" % i for i in range(5)]
+    inputs = [[",".join(x for x in column_names), "0,1,2,3,4", "5,6,7,8,9"], [
+        ",".join(x for x in column_names), "10,11,12,13,14", "15,16,17,18,19"
+    ]]
+    filenames = self._setup_files(inputs)
+    dataset = self._make_csv_dataset(filenames, batch_size=32, num_epochs=None)
+    for shape in nest.flatten(dataset.output_shapes):
+      self.assertEqual(32, shape[0])
+
 
 class MakeTFRecordDatasetTest(
     reader_dataset_ops_test_base.TFRecordDatasetTestBase):
@@ -874,6 +1020,12 @@ class MakeTFRecordDatasetTest(
           self._shuffle_test(batch_size, num_epochs, num_parallel_reads,
                              seed=21345)
 
+  def testIndefiniteRepeatShapeInference(self):
+    dataset = readers.make_tf_record_dataset(
+        file_pattern=self.test_filenames, num_epochs=None, batch_size=32)
+    for shape in nest.flatten(dataset.output_shapes):
+      self.assertEqual(32, shape[0])
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/serialization/BUILD b/tensorflow/contrib/data/python/kernel_tests/serialization/BUILD
index 686788522a..3c3f23f9a9 100644
--- a/tensorflow/contrib/data/python/kernel_tests/serialization/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/serialization/BUILD
@@ -73,6 +73,20 @@ py_test(
 )
 
 py_test(
+    name = "csv_dataset_serialization_test",
+    size = "small",
+    srcs = ["csv_dataset_serialization_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_pip"],
+    deps = [
+        ":dataset_serialization_test_base",
+        "//tensorflow/contrib/data/python/ops:readers",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_ops",
+    ],
+)
+
+py_test(
     name = "dataset_constructor_serialization_test",
     size = "medium",
     srcs = ["dataset_constructor_serialization_test.py"],
diff --git a/tensorflow/contrib/data/python/kernel_tests/serialization/csv_dataset_serialization_test.py b/tensorflow/contrib/data/python/kernel_tests/serialization/csv_dataset_serialization_test.py
new file mode 100644
index 0000000000..247f2046ea
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/serialization/csv_dataset_serialization_test.py
@@ -0,0 +1,73 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the CsvDataset serialization."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gzip
+import os
+
+from tensorflow.contrib.data.python.kernel_tests.serialization import dataset_serialization_test_base
+from tensorflow.contrib.data.python.ops import readers
+from tensorflow.python.platform import test
+
+
+class CsvDatasetSerializationTest(
+    dataset_serialization_test_base.DatasetSerializationTestBase):
+
+  def setUp(self):
+    self._num_cols = 7
+    self._num_rows = 10
+    self._num_epochs = 14
+    self._num_outputs = self._num_rows * self._num_epochs
+
+    inputs = [
+        ",".join(str(self._num_cols * j + i)
+                 for i in range(self._num_cols))
+        for j in range(self._num_rows)
+    ]
+    contents = "\n".join(inputs).encode("utf-8")
+
+    self._filename = os.path.join(self.get_temp_dir(), "file.csv")
+    self._compressed = os.path.join(self.get_temp_dir(),
+                                    "comp.csv")  # GZip compressed
+
+    with open(self._filename, "wb") as f:
+      f.write(contents)
+    with gzip.GzipFile(self._compressed, "wb") as f:
+      f.write(contents)
+
+  def ds_func(self, **kwargs):
+    compression_type = kwargs.get("compression_type", None)
+    if compression_type == "GZIP":
+      filename = self._compressed
+    elif compression_type is None:
+      filename = self._filename
+    else:
+      raise ValueError("Invalid compression type:", compression_type)
+
+    return readers.CsvDataset(filename, **kwargs).repeat(self._num_epochs)
+
+  def testSerializationCore(self):
+    defs = [[0]] * self._num_cols
+    self.run_core_tests(
+        lambda: self.ds_func(record_defaults=defs, buffer_size=2),
+        lambda: self.ds_func(record_defaults=defs, buffer_size=12),
+        self._num_outputs)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/serialization/dataset_serialization_test_base.py b/tensorflow/contrib/data/python/kernel_tests/serialization/dataset_serialization_test_base.py
index 393f08850b..3ed4dfb729 100644
--- a/tensorflow/contrib/data/python/kernel_tests/serialization/dataset_serialization_test_base.py
+++ b/tensorflow/contrib/data/python/kernel_tests/serialization/dataset_serialization_test_base.py
@@ -32,6 +32,7 @@ from tensorflow.python.ops import lookup_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.util import nest
 
@@ -655,7 +656,7 @@ class DatasetSerializationTestBase(test.TestCase):
     return os.path.join(self.get_temp_dir(), "iterator")
 
   def _latest_ckpt(self):
-    return saver_lib.latest_checkpoint(self.get_temp_dir())
+    return checkpoint_management.latest_checkpoint(self.get_temp_dir())
 
   def _save(self, sess, saver):
     saver.save(sess, self._ckpt_path())
diff --git a/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py
index 5590a4bf78..8b2f846494 100644
--- a/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py
@@ -17,6 +17,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from absl.testing import parameterized
 import numpy as np
 
 from tensorflow.contrib.data.python.ops import sliding
@@ -29,28 +30,45 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 
 
-class SlideDatasetTest(test.TestCase):
-
-  def testSlideDataset(self):
-    """Test an dataset that maps a TF function across its input elements."""
+class SlideDatasetTest(test.TestCase, parameterized.TestCase):
+
+  @parameterized.parameters(
+      (20, 14, 7, 1),
+      (20, 17, 9, 1),
+      (20, 14, 14, 1),
+      (20, 10, 14, 1),
+      (20, 14, 19, 1),
+      (20, 4, 1, 2),
+      (20, 2, 1, 6),
+      (20, 4, 7, 2),
+      (20, 2, 7, 6),
+      (1, 10, 4, 1),
+      (0, 10, 4, 1),
+  )
+  def testSlideDataset(self, count, window_size, window_shift, window_stride):
+    """Tests a dataset that slides a window its input elements."""
     components = (np.arange(7),
                   np.array([[1, 2, 3]]) * np.arange(7)[:, np.newaxis],
                   np.array(37.0) * np.arange(7))
 
-    count = array_ops.placeholder(dtypes.int64, shape=[])
-    window_size = array_ops.placeholder(dtypes.int64, shape=[])
-    stride = array_ops.placeholder(dtypes.int64, shape=[])
+    count_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_size_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_shift_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_stride_t = array_ops.placeholder(dtypes.int64, shape=[])
 
     def _map_fn(x, y, z):
       return math_ops.square(x), math_ops.square(y), math_ops.square(z)
 
     # The pipeline is TensorSliceDataset -> MapDataset(square_3) ->
-    # RepeatDataset(count) -> _SlideDataset(window_size, stride).
-    iterator = (dataset_ops.Dataset.from_tensor_slices(components)
-                .map(_map_fn)
-                .repeat(count)
-                .apply(sliding.sliding_window_batch(window_size, stride))
-                .make_initializable_iterator())
+    # RepeatDataset(count) ->
+    # _SlideDataset(window_size, window_shift, window_stride).
+    iterator = (
+        dataset_ops.Dataset.from_tensor_slices(components).map(_map_fn)
+        .repeat(count).apply(
+            sliding.sliding_window_batch(
+                window_size=window_size_t,
+                window_shift=window_shift_t,
+                window_stride=window_stride_t)).make_initializable_iterator())
     init_op = iterator.initializer
     get_next = iterator.get_next()
 
@@ -58,90 +76,126 @@ class SlideDatasetTest(test.TestCase):
                      [t.shape.as_list() for t in get_next])
 
     with self.test_session() as sess:
-      # stride < window_size.
-      # Slide over a finite input, where the window_size divides the
-      # total number of elements.
-      sess.run(init_op, feed_dict={count: 20, window_size: 14, stride: 7})
-      # Same formula with convolution layer.
-      num_batches = (20 * 7 - 14) // 7 + 1
-      for i in range(num_batches):
-        result = sess.run(get_next)
-        for component, result_component in zip(components, result):
-          for j in range(14):
-            self.assertAllEqual(component[(i*7 + j) % 7]**2,
-                                result_component[j])
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
-      # Slide over a finite input, where the window_size does not
-      # divide the total number of elements.
-      sess.run(init_op, feed_dict={count: 20, window_size: 17, stride: 9})
-      num_batches = (20 * 7 - 17) // 9 + 1
+      sess.run(
+          init_op,
+          feed_dict={
+              count_t: count,
+              window_size_t: window_size,
+              window_shift_t: window_shift,
+              window_stride_t: window_stride
+          })
+      num_batches = (count * 7 - (
+          (window_size - 1) * window_stride + 1)) // window_shift + 1
       for i in range(num_batches):
         result = sess.run(get_next)
         for component, result_component in zip(components, result):
-          for j in range(17):
-            self.assertAllEqual(component[(i*9 + j) % 7]**2,
-                                result_component[j])
+          for j in range(window_size):
+            self.assertAllEqual(
+                component[(i * window_shift + j * window_stride) % 7]**2,
+                result_component[j])
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
-      # stride == window_size.
-      sess.run(init_op, feed_dict={count: 20, window_size: 14, stride: 14})
-      num_batches = 20 * 7 // 14
-      for i in range(num_batches):
-        result = sess.run(get_next)
-        for component, result_component in zip(components, result):
-          for j in range(14):
-            self.assertAllEqual(component[(i*14 + j) % 7]**2,
-                                result_component[j])
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
+  @parameterized.parameters(
+      (20, 14, 7, 1),
+      (20, 17, 9, 1),
+      (20, 14, 14, 1),
+      (20, 10, 14, 1),
+      (20, 14, 19, 1),
+      (20, 4, 1, 2),
+      (20, 2, 1, 6),
+      (20, 4, 7, 2),
+      (20, 2, 7, 6),
+      (1, 10, 4, 1),
+      (0, 10, 4, 1),
+  )
+  def testSlideDatasetDeprecated(self, count, window_size, stride,
+                                 window_stride):
+    """Tests a dataset that slides a window its input elements."""
+    components = (np.arange(7),
+                  np.array([[1, 2, 3]]) * np.arange(7)[:, np.newaxis],
+                  np.array(37.0) * np.arange(7))
 
-      # stride > window_size.
-      sess.run(init_op, feed_dict={count: 20, window_size: 10, stride: 14})
-      num_batches = 20 * 7 // 14
-      for i in range(num_batches):
-        result = sess.run(get_next)
-        for component, result_component in zip(components, result):
-          for j in range(10):
-            self.assertAllEqual(component[(i*14 + j) % 7]**2,
-                                result_component[j])
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
-      # Drop the last batch which is smaller than window_size.
-      sess.run(init_op, feed_dict={count: 20, window_size: 14, stride: 19})
-      num_batches = (20 * 7 - 7) // 19  # = 19 * 7 // 19
-      for i in range(num_batches):
-        result = sess.run(get_next)
-        for component, result_component in zip(components, result):
-          for j in range(14):
-            self.assertAllEqual(component[(i*19 + j) % 7]**2,
-                                result_component[j])
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
+    count_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_size_t = array_ops.placeholder(dtypes.int64, shape=[])
+    stride_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_stride_t = array_ops.placeholder(dtypes.int64, shape=[])
 
-      # Slide over a finite input, which is less than window_size,
-      # should fail straight away.
-      sess.run(init_op, feed_dict={count: 1, window_size: 10, stride: 4})
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
+    def _map_fn(x, y, z):
+      return math_ops.square(x), math_ops.square(y), math_ops.square(z)
 
-      sess.run(init_op, feed_dict={count: 1, window_size: 10, stride: 8})
-      with self.assertRaises(errors.OutOfRangeError):
-        sess.run(get_next)
+    # The pipeline is TensorSliceDataset -> MapDataset(square_3) ->
+    # RepeatDataset(count) -> _SlideDataset(window_size, stride, window_stride).
+    iterator = (
+        dataset_ops.Dataset.from_tensor_slices(components).map(_map_fn)
+        .repeat(count).apply(
+            sliding.sliding_window_batch(
+                window_size=window_size_t,
+                stride=stride_t,
+                window_stride=window_stride_t)).make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
 
-      # Slide over an empty input should fail straight away.
-      sess.run(init_op, feed_dict={count: 0, window_size: 8, stride: 4})
+    self.assertEqual([[None] + list(c.shape[1:]) for c in components],
+                     [t.shape.as_list() for t in get_next])
+
+    with self.test_session() as sess:
+      sess.run(
+          init_op,
+          feed_dict={
+              count_t: count,
+              window_size_t: window_size,
+              stride_t: stride,
+              window_stride_t: window_stride
+          })
+      num_batches = (count * 7 - (
+          (window_size - 1) * window_stride + 1)) // stride + 1
+      for i in range(num_batches):
+        result = sess.run(get_next)
+        for component, result_component in zip(components, result):
+          for j in range(window_size):
+            self.assertAllEqual(
+                component[(i * stride + j * window_stride) % 7]**2,
+                result_component[j])
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
-      # Empty window_size should be an initialization time error.
-      with self.assertRaises(errors.InvalidArgumentError):
-        sess.run(init_op, feed_dict={count: 14, window_size: 0, stride: 0})
+  @parameterized.parameters(
+      (14, 0, 3, 1),
+      (14, 3, 0, 1),
+      (14, 3, 3, 0),
+  )
+  def testSlideDatasetInvalid(self, count, window_size, window_shift,
+                              window_stride):
+    count_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_size_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_shift_t = array_ops.placeholder(dtypes.int64, shape=[])
+    window_stride_t = array_ops.placeholder(dtypes.int64, shape=[])
+
+    iterator = (
+        dataset_ops.Dataset.range(10).map(lambda x: x).repeat(count_t).apply(
+            sliding.sliding_window_batch(
+                window_size=window_size_t,
+                window_shift=window_shift_t,
+                window_stride=window_stride_t)).make_initializable_iterator())
+    init_op = iterator.initializer
 
-      # Invalid stride should be an initialization time error.
+    with self.test_session() as sess:
       with self.assertRaises(errors.InvalidArgumentError):
-        sess.run(init_op, feed_dict={count: 14, window_size: 3, stride: 0})
+        sess.run(
+            init_op,
+            feed_dict={
+                count_t: count,
+                window_size_t: window_size,
+                window_shift_t: window_shift,
+                window_stride_t: window_stride
+            })
+
+  def testSlideDatasetValueError(self):
+    with self.assertRaises(ValueError):
+      dataset_ops.Dataset.range(10).map(lambda x: x).apply(
+          sliding.sliding_window_batch(
+              window_size=1, stride=1, window_shift=1, window_stride=1))
 
   def assertSparseValuesEqual(self, a, b):
     self.assertAllEqual(a.indices, b.indices)
@@ -155,7 +209,8 @@ class SlideDatasetTest(test.TestCase):
           indices=[[0]], values=(i * [1]), dense_shape=[1])
 
     iterator = dataset_ops.Dataset.range(10).map(_sparse).apply(
-        sliding.sliding_window_batch(5, 3)).make_initializable_iterator()
+        sliding.sliding_window_batch(
+            window_size=5, window_shift=3)).make_initializable_iterator()
     init_op = iterator.initializer
     get_next = iterator.get_next()
 
@@ -183,7 +238,8 @@ class SlideDatasetTest(test.TestCase):
           dense_shape=[i])
 
     iterator = dataset_ops.Dataset.range(10).map(_sparse).apply(
-        sliding.sliding_window_batch(5, 3)).make_initializable_iterator()
+        sliding.sliding_window_batch(
+            window_size=5, window_shift=3)).make_initializable_iterator()
     init_op = iterator.initializer
     get_next = iterator.get_next()
 
@@ -213,11 +269,11 @@ class SlideDatasetTest(test.TestCase):
       return sparse_tensor.SparseTensorValue(
           indices=[[0]], values=(i * [1]), dense_shape=[1])
 
-    iterator = (dataset_ops.Dataset.range(10)
-                .map(_sparse)
-                .apply(sliding.sliding_window_batch(4, 2))
-                .apply(sliding.sliding_window_batch(3, 1))
-                .make_initializable_iterator())
+    iterator = (
+        dataset_ops.Dataset.range(10).map(_sparse).apply(
+            sliding.sliding_window_batch(window_size=4, window_shift=2)).apply(
+                sliding.sliding_window_batch(window_size=3, window_shift=1))
+        .make_initializable_iterator())
     init_op = iterator.initializer
     get_next = iterator.get_next()
 
@@ -226,9 +282,9 @@ class SlideDatasetTest(test.TestCase):
       # Slide: 1st batch.
       actual = sess.run(get_next)
       expected = sparse_tensor.SparseTensorValue(
-          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0],
-                   [1, 0, 0], [1, 1, 0], [1, 2, 0], [1, 3, 0],
-                   [2, 0, 0], [2, 1, 0], [2, 2, 0], [2, 3, 0]],
+          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0], [1, 0, 0],
+                   [1, 1, 0], [1, 2, 0], [1, 3, 0], [2, 0, 0], [2, 1, 0],
+                   [2, 2, 0], [2, 3, 0]],
           values=[0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7],
           dense_shape=[3, 4, 1])
       self.assertTrue(sparse_tensor.is_sparse(actual))
@@ -236,9 +292,9 @@ class SlideDatasetTest(test.TestCase):
       # Slide: 2nd batch.
       actual = sess.run(get_next)
       expected = sparse_tensor.SparseTensorValue(
-          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0],
-                   [1, 0, 0], [1, 1, 0], [1, 2, 0], [1, 3, 0],
-                   [2, 0, 0], [2, 1, 0], [2, 2, 0], [2, 3, 0]],
+          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0], [1, 0, 0],
+                   [1, 1, 0], [1, 2, 0], [1, 3, 0], [2, 0, 0], [2, 1, 0],
+                   [2, 2, 0], [2, 3, 0]],
           values=[2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9],
           dense_shape=[3, 4, 1])
       self.assertTrue(sparse_tensor.is_sparse(actual))
@@ -253,10 +309,11 @@ class SlideDatasetTest(test.TestCase):
       yield [4.0, 5.0, 6.0]
       yield [7.0, 8.0, 9.0, 10.0]
 
-    iterator = (dataset_ops.Dataset.from_generator(generator, dtypes.float32,
-                                                   output_shapes=[None])
-                .apply(sliding.sliding_window_batch(3, 1))
-                .make_initializable_iterator())
+    iterator = (
+        dataset_ops.Dataset.from_generator(
+            generator, dtypes.float32, output_shapes=[None]).apply(
+                sliding.sliding_window_batch(window_size=3, window_shift=1))
+        .make_initializable_iterator())
     next_element = iterator.get_next()
 
     with self.test_session() as sess:
diff --git a/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py
index b4945685c1..a41d21f8c1 100644
--- a/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/stats_dataset_ops_test.py
@@ -20,8 +20,8 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.data.python.kernel_tests import reader_dataset_ops_test_base
+from tensorflow.contrib.data.python.kernel_tests import stats_dataset_test_base
 from tensorflow.contrib.data.python.ops import stats_ops
-from tensorflow.core.framework import summary_pb2
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
@@ -29,28 +29,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import test
 
 
-class StatsDatasetTestBase(test.TestCase):
-
-  def _assertSummaryHasCount(self, summary_str, tag, expected_value):
-    summary_proto = summary_pb2.Summary()
-    summary_proto.ParseFromString(summary_str)
-    for value in summary_proto.value:
-      if tag == value.tag:
-        self.assertEqual(expected_value, value.histo.num)
-        return
-    self.fail("Expected tag %r not found in summary %r" % (tag, summary_proto))
-
-  def _assertSummaryHasSum(self, summary_str, tag, expected_value):
-    summary_proto = summary_pb2.Summary()
-    summary_proto.ParseFromString(summary_str)
-    for value in summary_proto.value:
-      if tag == value.tag:
-        self.assertEqual(expected_value, value.histo.sum)
-        return
-    self.fail("Expected tag %r not found in summary %r" % (tag, summary_proto))
-
-
-class StatsDatasetTest(StatsDatasetTestBase):
+class StatsDatasetTest(stats_dataset_test_base.StatsDatasetTestBase):
 
   def testBytesProduced(self):
     stats_aggregator = stats_ops.StatsAggregator()
@@ -197,7 +176,7 @@ class StatsDatasetTest(StatsDatasetTestBase):
 
 
 class FeatureStatsDatasetTest(
-    StatsDatasetTestBase,
+    stats_dataset_test_base.StatsDatasetTestBase,
     reader_dataset_ops_test_base.ReadBatchFeaturesTestBase):
 
   def testFeaturesStats(self):
diff --git a/tensorflow/contrib/data/python/kernel_tests/stats_dataset_test_base.py b/tensorflow/contrib/data/python/kernel_tests/stats_dataset_test_base.py
new file mode 100644
index 0000000000..9a13acf8f0
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/stats_dataset_test_base.py
@@ -0,0 +1,44 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Base class for testing the input pipeline statistics gathering ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+from tensorflow.core.framework import summary_pb2
+from tensorflow.python.platform import test
+
+
+class StatsDatasetTestBase(test.TestCase):
+  """Base class for testing statistics gathered in `StatsAggregator`."""
+
+  def _assertSummaryHasCount(self, summary_str, tag, expected_value):
+    summary_proto = summary_pb2.Summary()
+    summary_proto.ParseFromString(summary_str)
+    for value in summary_proto.value:
+      if tag == value.tag:
+        self.assertEqual(expected_value, value.histo.num)
+        return
+    self.fail("Expected tag %r not found in summary %r" % (tag, summary_proto))
+
+  def _assertSummaryHasSum(self, summary_str, tag, expected_value):
+    summary_proto = summary_pb2.Summary()
+    summary_proto.ParseFromString(summary_str)
+    for value in summary_proto.value:
+      if tag == value.tag:
+        self.assertEqual(expected_value, value.histo.sum)
+        return
+    self.fail("Expected tag %r not found in summary %r" % (tag, summary_proto))
diff --git a/tensorflow/contrib/data/python/kernel_tests/window_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/window_dataset_op_test.py
new file mode 100644
index 0000000000..33d95d6754
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/window_dataset_op_test.py
@@ -0,0 +1,523 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental input pipeline ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl.testing import parameterized
+import numpy as np
+
+from tensorflow.contrib.data.python.ops import batching
+from tensorflow.contrib.data.python.ops import grouping
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import sparse_ops
+from tensorflow.python.platform import test
+
+
+class WindowDatasetTest(test.TestCase, parameterized.TestCase):
+
+  def _structuredDataset(self, structure, shape, dtype):
+    if structure is None:
+      return dataset_ops.Dataset.from_tensors(
+          array_ops.zeros(shape, dtype=dtype))
+    else:
+      return dataset_ops.Dataset.zip(
+          tuple([
+              self._structuredDataset(substructure, shape, dtype)
+              for substructure in structure
+          ]))
+
+  def _structuredElement(self, structure, shape, dtype):
+    if structure is None:
+      return array_ops.zeros(shape, dtype=dtype)
+    else:
+      return tuple([
+          self._structuredElement(substructure, shape, dtype)
+          for substructure in structure
+      ])
+
+  def _assertEqual(self, xs, ys):
+    self.assertEqual(type(xs), type(ys))
+    if isinstance(xs, tuple) and isinstance(ys, tuple):
+      self.assertEqual(len(xs), len(ys))
+      for x, y in zip(xs, ys):
+        self._assertEqual(x, y)
+    elif isinstance(xs, np.ndarray) and isinstance(ys, np.ndarray):
+      self.assertAllEqual(xs, ys)
+    else:
+      self.assertEqual(xs, ys)
+
+  @parameterized.parameters(
+      (None, np.int32([]), dtypes.bool),
+      (None, np.int32([]), dtypes.int32),
+      (None, np.int32([]), dtypes.float32),
+      (None, np.int32([]), dtypes.string),
+      (None, np.int32([2]), dtypes.int32),
+      (None, np.int32([2, 2]), dtypes.int32),
+      ((None, None, None), np.int32([]), dtypes.int32),
+      ((None, (None, None)), np.int32([]), dtypes.int32),
+  )
+  def testWindowDatasetFlatMap(self, structure, shape, dtype):
+    """Tests windowing by chaining it with flat map.
+
+    Args:
+      structure: the input structure
+      shape: the input shape
+      dtype: the input data type
+    """
+
+    def fn(*args):
+      if len(args) == 1 and not isinstance(args[0], tuple):
+        return args[0]
+      return dataset_ops.Dataset.zip(
+          tuple([fn(*arg) if isinstance(arg, tuple) else arg for arg in args]))
+
+    dataset = self._structuredDataset(structure, shape, dtype).apply(
+        grouping.window_dataset(5)).flat_map(fn)
+    get_next = dataset.make_one_shot_iterator().get_next()
+    with self.test_session() as sess:
+      expected = sess.run(self._structuredElement(structure, shape, dtype))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  @parameterized.parameters(
+      (None, np.int32([]), dtypes.bool),
+      (None, np.int32([]), dtypes.int32),
+      (None, np.int32([]), dtypes.float32),
+      (None, np.int32([]), dtypes.string),
+      (None, np.int32([2]), dtypes.int32),
+      (None, np.int32([2, 2]), dtypes.int32),
+      ((None, None, None), np.int32([]), dtypes.int32),
+      ((None, (None, None)), np.int32([]), dtypes.int32),
+  )
+  def testWindowDatasetBatchDense(self, structure, shape, dtype):
+    """Tests batching of dense tensor windows.
+
+    Args:
+      structure: the input structure
+      shape: the input shape
+      dtype: the input data type
+    """
+
+    def fn(*args):
+      if len(args) == 1 and not isinstance(args[0], tuple):
+        return batching.batch_window(args[0])
+
+      return tuple([
+          fn(*arg) if isinstance(arg, tuple) else batching.batch_window(arg)
+          for arg in args
+      ])
+
+    dataset = self._structuredDataset(structure, shape, dtype).repeat(5).apply(
+        grouping.window_dataset(5)).apply(grouping._map_x_dataset(fn))
+    get_next = dataset.make_one_shot_iterator().get_next()
+    with self.test_session() as sess:
+      expected = sess.run(
+          self._structuredElement(structure, np.concatenate(
+              ([5], shape), axis=0), dtype))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  @parameterized.parameters(
+      (np.int32([]),),
+      (np.int32([1]),),
+      (np.int32([1, 2, 3]),),
+  )
+  def testWindowDatasetBatchDenseDynamicShape(self, shape):
+    """Tests batching of dynamically shaped dense tensor windows.
+
+    Args:
+      shape: the input shape
+    """
+
+    shape_t = array_ops.placeholder(dtypes.int32)
+    dataset = dataset_ops.Dataset.from_tensors(
+        array_ops.zeros(shape_t)).repeat(5).apply(
+            grouping.window_dataset(5)).apply(
+                grouping._map_x_dataset(batching.batch_window))
+    iterator = dataset.make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+    with self.test_session() as sess:
+      sess.run(init_op, {shape_t: shape})
+      expected = sess.run(
+          self._structuredElement(None, np.concatenate(([5], shape), axis=0),
+                                  dtypes.int32))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  def _make_dense_to_sparse_fn(self, is_scalar):
+
+    def dense_to_sparse_scalar(tensor):
+      indices = [[]]
+      values = array_ops.expand_dims(tensor, 0)
+      shape = []
+      return sparse_tensor.SparseTensorValue(indices, values, shape)
+
+    def dense_to_sparse_non_scalar(tensor):
+      indices = array_ops.where(array_ops.ones_like(tensor, dtype=dtypes.bool))
+      values = array_ops.gather_nd(tensor, indices)
+      shape = array_ops.shape(tensor, out_type=dtypes.int64)
+      return sparse_tensor.SparseTensorValue(indices, values, shape)
+
+    if is_scalar:
+      return dense_to_sparse_scalar
+    return dense_to_sparse_non_scalar
+
+  def _structuredSparseDataset(self, structure, shape, dtype):
+    dense_to_sparse = self._make_dense_to_sparse_fn(len(shape) == 0)  # pylint: disable=g-explicit-length-test
+    if structure is None:
+      return dataset_ops.Dataset.from_tensors(
+          dense_to_sparse(array_ops.zeros(shape, dtype=dtype)))
+    else:
+      return dataset_ops.Dataset.zip(
+          tuple([
+              self._structuredSparseDataset(substructure, shape, dtype)
+              for substructure in structure
+          ]))
+
+  def _structuredSparseElement(self, structure, shape, dtype):
+    dense_to_sparse = self._make_dense_to_sparse_fn(len(shape) == 0)  # pylint: disable=g-explicit-length-test
+    if structure is None:
+      return dense_to_sparse(array_ops.zeros(shape, dtype=dtype))
+    else:
+      return tuple([
+          self._structuredSparseElement(substructure, shape, dtype)
+          for substructure in structure
+      ])
+
+  @parameterized.parameters(
+      (None, np.int32([]), dtypes.bool),
+      (None, np.int32([]), dtypes.int32),
+      (None, np.int32([]), dtypes.float32),
+      (None, np.int32([]), dtypes.string),
+      (None, np.int32([2]), dtypes.int32),
+      (None, np.int32([2, 2]), dtypes.int32),
+      ((None, None, None), np.int32([]), dtypes.int32),
+      ((None, (None, None)), np.int32([]), dtypes.int32),
+  )
+  def testWindowDatasetBatchSparse(self, structure, shape, dtype):
+    """Tests batching of sparse tensor windows.
+
+    Args:
+      structure: the input structure
+      shape: the input shape
+      dtype: the input data type
+    """
+
+    def fn(*args):
+      if len(args) == 1 and not isinstance(args[0], tuple):
+        return batching.batch_window(args[0])
+
+      return tuple([
+          fn(*arg) if isinstance(arg, tuple) else batching.batch_window(arg)
+          for arg in args
+      ])
+
+    dataset = self._structuredSparseDataset(
+        structure, shape, dtype).repeat(5).apply(
+            grouping.window_dataset(5)).apply(grouping._map_x_dataset(fn))
+    get_next = dataset.make_one_shot_iterator().get_next()
+    with self.test_session() as sess:
+      expected = sess.run(
+          self._structuredSparseElement(structure,
+                                        np.concatenate(([5], shape), axis=0),
+                                        dtype))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  @parameterized.parameters(
+      (np.int32([]),),
+      (np.int32([1]),),
+      (np.int32([1, 2, 3]),),
+  )
+  def testWindowDatasetBatchSparseDynamicShape(self, shape):
+    """Tests batching of dynamically shaped sparse tensor windows.
+
+    Args:
+      shape: the input shape
+    """
+
+    shape_t = array_ops.placeholder(dtypes.int32)
+    dataset = dataset_ops.Dataset.from_tensors(array_ops.zeros(shape_t)).map(
+        self._make_dense_to_sparse_fn(len(shape) == 0)).repeat(5).apply(  # pylint: disable=g-explicit-length-test
+            grouping.window_dataset(5)).apply(
+                grouping._map_x_dataset(batching.batch_window))
+    iterator = dataset.make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+    with self.test_session() as sess:
+      sess.run(init_op, {shape_t: shape})
+      expected = sess.run(
+          self._structuredSparseElement(None,
+                                        np.concatenate(([5], shape), axis=0),
+                                        dtypes.int32))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  def _structuredRaggedDataset(self, structure, shapes, dtype):
+
+    if structure is None:
+      return dataset_ops.Dataset.from_tensor_slices(shapes).map(
+          lambda shape: array_ops.zeros(shape, dtype=dtype))
+    else:
+      return dataset_ops.Dataset.zip(
+          tuple([
+              self._structuredRaggedDataset(substructure, shapes, dtype)
+              for substructure in structure
+          ]))
+
+  @parameterized.parameters(
+      (None, np.int32([[1], [2], [3]]), dtypes.bool, [-1]),
+      (None, np.int32([[1], [2], [3]]), dtypes.int32, [-1]),
+      (None, np.int32([[1], [2], [3]]), dtypes.float32, [-1]),
+      (None, np.int32([[1], [2], [3]]), dtypes.string, [-1]),
+      (None, np.int32([[1, 3], [2, 2], [3, 1]]), dtypes.int32, [-1, -1]),
+      (None, np.int32([[3, 1, 3], [1, 3, 1]]), dtypes.int32, [-1, -1, -1]),
+      ((None, None, None), np.int32([[1], [2], [3]]), dtypes.int32, [-1]),
+      ((None, (None, None)), np.int32([[1], [2], [3]]), dtypes.int32, [-1]),
+      (None, np.int32([[1], [2], [3]]), dtypes.int32, [-1]),
+      (None, np.int32([[1], [2], [3]]), dtypes.int32, np.int32([10])),
+  )
+  def testWindowDatasetPaddedBatchDense(self, structure, shapes, dtype,
+                                        padded_shape):
+    """Tests padded batching of dense tensor windows.
+
+    Args:
+      structure: the input structure
+      shapes: the input shapes
+      dtype: the input data type
+      padded_shape: the shape to pad the output to
+    """
+
+    def fn(*args):
+      if len(args) == 1 and not isinstance(args[0], tuple):
+        return batching.padded_batch_window(args[0], padded_shape)
+
+      return tuple([
+          fn(*arg) if isinstance(arg, tuple) else batching.padded_batch_window(
+              arg, padded_shape) for arg in args
+      ])
+
+    dataset = self._structuredRaggedDataset(structure, shapes, dtype).apply(
+        grouping.window_dataset(len(shapes))).apply(
+            grouping._map_x_dataset(fn))
+    get_next = dataset.make_one_shot_iterator().get_next()
+    with self.test_session() as sess:
+      expected_shape = np.maximum(np.amax(shapes, axis=0), padded_shape)
+      expected = sess.run(
+          self._structuredElement(
+              structure,
+              np.concatenate((np.int32([len(shapes)]), expected_shape)), dtype))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  @parameterized.parameters(
+      (np.int32([[1], [2], [3]]), [-1]),
+      (np.int32([[1, 3], [2, 2], [3, 1]]), [-1, -1]),
+      (np.int32([[3, 1, 3], [1, 3, 1]]), [-1, -1, -1]),
+  )
+  def testWindowDatasetPaddedBatchDenseDynamicShape(self, shapes, padded_shape):
+    """Tests padded batching of dynamically shaped dense tensor windows.
+
+    Args:
+      shapes: the input shapes
+      padded_shape: the shape to pad the output to
+    """
+
+    shapes_t = array_ops.placeholder(dtypes.int32)
+    dataset = dataset_ops.Dataset.from_tensor_slices(shapes_t).map(
+        lambda shape: array_ops.zeros(shape, dtype=dtypes.int32)).apply(
+            grouping.window_dataset(len(shapes))).apply(
+                grouping._map_x_dataset(
+                    lambda x: batching.padded_batch_window(x, padded_shape)))
+    iterator = dataset.make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+    with self.test_session() as sess:
+      sess.run(init_op, {shapes_t: shapes})
+      expected_shape = np.maximum(np.amax(shapes, axis=0), padded_shape)
+      expected = sess.run(
+          self._structuredElement(
+              None, np.concatenate((np.int32([len(shapes)]), expected_shape)),
+              dtypes.int32))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  @parameterized.parameters(
+      (np.int32([[1]]), np.int32([0])),
+      (np.int32([[10], [20]]), np.int32([15])),
+  )
+  def testWindowDatasetPaddedBatchDenseInvalid(self, shapes, padded_shape):
+    """Tests invalid padded batching of dense tensor windows.
+
+    Args:
+      shapes: the input shapes
+      padded_shape: the shape to pad the output to
+    """
+
+    dataset = dataset_ops.Dataset.from_tensor_slices(shapes).map(
+        lambda shape: array_ops.zeros(shape, dtype=dtypes.int32)).apply(
+            grouping.window_dataset(len(shapes))).apply(
+                grouping._map_x_dataset(
+                    lambda x: batching.padded_batch_window(x, padded_shape)))
+    get_next = dataset.make_one_shot_iterator().get_next()
+    with self.test_session() as sess:
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(get_next)
+
+  def _structuredRaggedSparseDataset(self, structure, shapes, dtype):
+
+    def map_fn(shape):
+      dense_to_sparse = self._make_dense_to_sparse_fn(False)
+      return dense_to_sparse(array_ops.zeros(shape, dtype=dtype))
+
+    if structure is None:
+      return dataset_ops.Dataset.from_tensor_slices(shapes).map(map_fn)
+    else:
+      return dataset_ops.Dataset.zip(
+          tuple([
+              self._structuredRaggedSparseDataset(substructure, shapes, dtype)
+              for substructure in structure
+          ]))
+
+  def _structuredRaggedSparseElement(self, structure, shapes, dtype,
+                                     padded_shape):
+    if structure is None:
+      dense_shape = np.maximum(np.amax(shapes, axis=0), padded_shape)
+      values = []
+      for shape in shapes:
+        dense_to_sparse = self._make_dense_to_sparse_fn(len(shape) == 0)  # pylint: disable=g-explicit-length-test
+        sparse = dense_to_sparse(array_ops.zeros(shape, dtype=dtype))
+        padded_sparse = sparse_tensor.SparseTensor(sparse.indices,
+                                                   sparse.values, dense_shape)
+        reshaped_sparse = sparse_ops.sparse_reshape(
+            padded_sparse,
+            array_ops.concat([np.array([1], dtype=np.int64), dense_shape], 0))
+        values.append(reshaped_sparse)
+      return sparse_ops.sparse_concat(0, values)
+    else:
+      return tuple([
+          self._structuredRaggedSparseElement(substructure, shapes, dtype,
+                                              padded_shape)
+          for substructure in structure
+      ])
+
+  @parameterized.parameters(
+      (None, np.int64([[1], [2], [3]]), dtypes.bool, [-1]),
+      (None, np.int64([[1], [2], [3]]), dtypes.int32, [-1]),
+      (None, np.int64([[1], [2], [3]]), dtypes.float32, [-1]),
+      (None, np.int64([[1], [2], [3]]), dtypes.string, [-1]),
+      (None, np.int64([[1, 3], [2, 2], [3, 1]]), dtypes.int32, [-1, -1]),
+      (None, np.int64([[1, 3, 1], [3, 1, 3]]), dtypes.int32, [-1, -1, -1]),
+      ((None, None, None), np.int64([[1], [2], [3]]), dtypes.int32, [-1]),
+      ((None, (None, None)), np.int64([[1], [2], [3]]), dtypes.int32, [-1]),
+      (None, np.int64([[1], [2], [3]]), dtypes.int32, [-1]),
+      (None, np.int64([[1], [2], [3]]), dtypes.int32, np.int64([10])),
+  )
+  def testWindowDatasetPaddedBatchSparse(self, structure, shapes, dtype,
+                                         padded_shape):
+    """Tests padded batching of sparse tensor windows.
+
+    Args:
+      structure: the input structure
+      shapes: the input shapes
+      dtype: the input data type
+      padded_shape: the shape to pad the output to
+    """
+
+    def fn(*args):
+      if len(args) == 1 and not isinstance(args[0], tuple):
+        return batching.padded_batch_window(args[0], padded_shape)
+
+      return tuple([
+          fn(*arg) if isinstance(arg, tuple) else batching.padded_batch_window(
+              arg, padded_shape) for arg in args
+      ])
+
+    dataset = self._structuredRaggedSparseDataset(
+        structure, shapes, dtype).apply(grouping.window_dataset(
+            len(shapes))).apply(grouping._map_x_dataset(fn))
+    get_next = dataset.make_one_shot_iterator().get_next()
+    with self.test_session() as sess:
+      expected = sess.run(
+          self._structuredRaggedSparseElement(structure, shapes, dtype,
+                                              padded_shape))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  @parameterized.parameters(
+      (np.int64([[1], [2], [3]]), [-1]),
+      (np.int64([[1, 3], [2, 2], [3, 1]]), [-1, -1]),
+      (np.int64([[3, 1, 3], [1, 3, 1]]), [-1, -1, -1]),
+  )
+  def testWindowDatasetPaddedBatchSparseDynamicShape(self, shapes,
+                                                     padded_shape):
+    """Tests padded batching of dynamically shaped sparse tensor windows.
+
+    Args:
+      shapes: the input shapes
+      padded_shape: the shape to pad the output to
+    """
+
+    shapes_t = array_ops.placeholder(dtypes.int32)
+    dataset = dataset_ops.Dataset.from_tensor_slices(shapes_t).map(
+        lambda shape: array_ops.zeros(shape, dtype=dtypes.int32)).map(
+            self._make_dense_to_sparse_fn(False)
+        ).apply(grouping.window_dataset(len(shapes))).apply(
+            grouping._map_x_dataset(
+                lambda x: batching.padded_batch_window(x, padded_shape)))
+    iterator = dataset.make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+    with self.test_session() as sess:
+      sess.run(init_op, {shapes_t: shapes})
+      expected = sess.run(
+          self._structuredRaggedSparseElement(None, shapes, dtypes.int32,
+                                              padded_shape))
+      actual = sess.run(get_next)
+      self._assertEqual(expected, actual)
+
+  @parameterized.parameters(
+      (np.int64([[1]]), [0]),
+      (np.int64([[10], [20]]), [15]),
+  )
+  def testWindowDatasetPaddedBatchSparseInvalid(self, shapes, padded_shape):
+    """Tests invalid padded batching of sparse tensor windows.
+
+    Args:
+      shapes: the input shapes
+      padded_shape: the shape to pad the output to
+    """
+
+    dataset = dataset_ops.Dataset.from_tensor_slices(shapes).map(
+        lambda shape: array_ops.zeros(shape, dtype=dtypes.int32)).map(
+            self._make_dense_to_sparse_fn(False)
+        ).apply(grouping.window_dataset(len(shapes))).apply(
+            grouping._map_x_dataset(
+                lambda x: batching.padded_batch_window(x, padded_shape)))
+    get_next = dataset.make_one_shot_iterator().get_next()
+    with self.test_session() as sess:
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(get_next)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/data/python/ops/BUILD b/tensorflow/contrib/data/python/ops/BUILD
index 0240814562..1ad021ea03 100644
--- a/tensorflow/contrib/data/python/ops/BUILD
+++ b/tensorflow/contrib/data/python/ops/BUILD
@@ -28,10 +28,12 @@ py_library(
     srcs = ["get_single_element.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":grouping",
         "//tensorflow/python:dataset_ops_gen",
         "//tensorflow/python/data/ops:dataset_ops",
         "//tensorflow/python/data/util:nest",
         "//tensorflow/python/data/util:sparse",
+        "//third_party/py/numpy",
     ],
 )
 
@@ -115,6 +117,8 @@ py_library(
     srcs = ["batching.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":get_single_element",
+        ":grouping",
         "//tensorflow/contrib/framework:framework_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:dataset_ops_gen",
@@ -127,6 +131,7 @@ py_library(
         "//tensorflow/python/data/util:convert",
         "//tensorflow/python/data/util:nest",
         "//tensorflow/python/data/util:sparse",
+        "//third_party/py/numpy",
     ],
 )
 
diff --git a/tensorflow/contrib/data/python/ops/batching.py b/tensorflow/contrib/data/python/ops/batching.py
index 7350d595f5..42fc20ec01 100644
--- a/tensorflow/contrib/data/python/ops/batching.py
+++ b/tensorflow/contrib/data/python/ops/batching.py
@@ -17,22 +17,135 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
+from tensorflow.contrib.data.python.ops import get_single_element
+from tensorflow.contrib.data.python.ops import grouping
 from tensorflow.contrib.framework import with_shape
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import convert
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import gen_array_ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import sparse_ops
 from tensorflow.python.util import deprecation
 
 
+def batch_window(dataset):
+  """Batches a window of tensors.
+
+  Args:
+    dataset: the input dataset.
+
+  Returns:
+    A `Tensor` representing the batch of the entire input dataset.
+  """
+  if isinstance(dataset.output_classes, tuple):
+    raise TypeError("Input dataset expected to have a single component")
+  if dataset.output_classes is ops.Tensor:
+    return _batch_dense_window(dataset)
+  elif dataset.output_classes is sparse_tensor.SparseTensor:
+    return _batch_sparse_window(dataset)
+  else:
+    raise TypeError("Unsupported dataset type: %s" % dataset.output_classes)
+
+
+def _batch_dense_window(dataset):
+  """Batches a window of dense tensors."""
+
+  def key_fn(_):
+    return np.int64(0)
+
+  def shape_init_fn(_):
+    return array_ops.shape(first_element)
+
+  def shape_reduce_fn(state, value):
+    check_ops.assert_equal(state, array_ops.shape(value))
+    return state
+
+  def finalize_fn(state):
+    return state
+
+  if dataset.output_shapes.is_fully_defined():
+    shape = dataset.output_shapes
+  else:
+    first_element = get_single_element.get_single_element(dataset.take(1))
+    shape_reducer = grouping.Reducer(shape_init_fn, shape_reduce_fn,
+                                     finalize_fn)
+    shape = get_single_element.get_single_element(
+        dataset.apply(grouping.group_by_reducer(key_fn, shape_reducer)))
+
+  def batch_init_fn(_):
+    batch_shape = array_ops.concat([[0], shape], 0)
+    return gen_array_ops.empty(batch_shape, dtype=dataset.output_types)
+
+  def batch_reduce_fn(state, value):
+    return array_ops.concat([state, [value]], 0)
+
+  batch_reducer = grouping.Reducer(batch_init_fn, batch_reduce_fn, finalize_fn)
+  return get_single_element.get_single_element(
+      dataset.apply(grouping.group_by_reducer(key_fn, batch_reducer)))
+
+
+def _batch_sparse_window(dataset):
+  """Batches a window of sparse tensors."""
+
+  def key_fn(_):
+    return np.int64(0)
+
+  def shape_init_fn(_):
+    return first_element.dense_shape
+
+  def shape_reduce_fn(state, value):
+    check_ops.assert_equal(state, value.dense_shape)
+    return state
+
+  def finalize_fn(state):
+    return state
+
+  if dataset.output_shapes.is_fully_defined():
+    shape = dataset.output_shapes
+  else:
+    first_element = get_single_element.get_single_element(dataset.take(1))
+    shape_reducer = grouping.Reducer(shape_init_fn, shape_reduce_fn,
+                                     finalize_fn)
+    shape = get_single_element.get_single_element(
+        dataset.apply(grouping.group_by_reducer(key_fn, shape_reducer)))
+
+  def batch_init_fn(_):
+    indices_shape = array_ops.concat([[0], [array_ops.size(shape) + 1]], 0)
+    return sparse_tensor.SparseTensor(
+        indices=gen_array_ops.empty(indices_shape, dtype=dtypes.int64),
+        values=constant_op.constant([], shape=[0], dtype=dataset.output_types),
+        dense_shape=array_ops.concat(
+            [np.array([0], dtype=np.int64),
+             math_ops.cast(shape, dtypes.int64)], 0))
+
+  def batch_reduce_fn(state, value):
+    return sparse_ops.sparse_concat(0, [state, value])
+
+  def reshape_fn(value):
+    return sparse_ops.sparse_reshape(
+        value,
+        array_ops.concat([np.array([1], dtype=np.int64), value.dense_shape], 0))
+
+  batch_reducer = grouping.Reducer(batch_init_fn, batch_reduce_fn, finalize_fn)
+  return get_single_element.get_single_element(
+      dataset.map(reshape_fn).apply(
+          grouping.group_by_reducer(key_fn, batch_reducer)))
+
+
 def dense_to_sparse_batch(batch_size, row_shape):
   """A transformation that batches ragged elements into `tf.SparseTensor`s.
 
@@ -82,6 +195,157 @@ def dense_to_sparse_batch(batch_size, row_shape):
   return _apply_fn
 
 
+def padded_batch_window(dataset, padded_shape, padding_value=None):
+  """Batches a window of tensors with padding.
+
+  Args:
+    dataset: the input dataset.
+    padded_shape: (Optional.) `tf.TensorShape` or `tf.int64` vector tensor-like
+      object representing the shape to which the input elements should be padded
+      prior to batching. Any unknown dimensions (e.g. `tf.Dimension(None)` in a
+      `tf.TensorShape` or `-1` in a tensor-like object) will be padded to the
+      maximum size of that dimension in each batch.
+    padding_value: (Optional.) A scalar-shaped `tf.Tensor`, representing the
+      padding value to use. Defaults are `0` for numeric types and the empty
+      string for string types. If `dataset` contains `tf.SparseTensor`, this
+      value is ignored.
+
+  Returns:
+    A `Tensor` representing the batch of the entire input dataset.
+
+  Raises:
+    ValueError: if invalid arguments are provided.
+  """
+  if not issubclass(dataset.output_classes,
+                    (ops.Tensor, sparse_tensor.SparseTensor)):
+    raise TypeError("Input dataset expected to have a single tensor component")
+  if issubclass(dataset.output_classes, (ops.Tensor)):
+    return _padded_batch_dense_window(dataset, padded_shape, padding_value)
+  elif issubclass(dataset.output_classes, (sparse_tensor.SparseTensor)):
+    if padding_value is not None:
+      raise ValueError("Padding value not allowed for sparse tensors")
+    return _padded_batch_sparse_window(dataset, padded_shape)
+  else:
+    raise TypeError("Unsupported dataset type: %s" % dataset.output_classes)
+
+
+def _padded_batch_dense_window(dataset, padded_shape, padding_value=None):
+  """Batches a window of dense tensors with padding."""
+
+  padded_shape = math_ops.cast(
+      convert.partial_shape_to_tensor(padded_shape), dtypes.int32)
+
+  def key_fn(_):
+    return np.int64(0)
+
+  def max_init_fn(_):
+    return padded_shape
+
+  def max_reduce_fn(state, value):
+    """Computes the maximum shape to pad to."""
+    condition = math_ops.reduce_all(
+        math_ops.logical_or(
+            math_ops.less_equal(array_ops.shape(value), padded_shape),
+            math_ops.equal(padded_shape, -1)))
+    assert_op = control_flow_ops.Assert(condition, [
+        "Actual shape greater than padded shape: ",
+        array_ops.shape(value), padded_shape
+    ])
+    with ops.control_dependencies([assert_op]):
+      return math_ops.maximum(state, array_ops.shape(value))
+
+  def finalize_fn(state):
+    return state
+
+  # Compute the padded shape.
+  max_reducer = grouping.Reducer(max_init_fn, max_reduce_fn, finalize_fn)
+  padded_shape = get_single_element.get_single_element(
+      dataset.apply(grouping.group_by_reducer(key_fn, max_reducer)))
+
+  if padding_value is None:
+    if dataset.output_types == dtypes.string:
+      padding_value = ""
+    elif dataset.output_types == dtypes.bool:
+      padding_value = False
+    elif dataset.output_types == dtypes.variant:
+      raise TypeError("Unable to create padding for field of type 'variant'")
+    else:
+      padding_value = 0
+
+  def batch_init_fn(_):
+    return array_ops.fill(
+        array_ops.concat([np.array([0], dtype=np.int32), padded_shape], 0),
+        constant_op.constant(padding_value, dtype=dataset.output_types))
+
+  def batch_reduce_fn(state, value):
+    return array_ops.concat([state, [value]], 0)
+
+  def pad_fn(value):
+    shape = array_ops.shape(value)
+    left = array_ops.zeros_like(shape)
+    right = padded_shape - shape
+    return array_ops.pad(
+        value, array_ops.stack([left, right], 1), constant_values=padding_value)
+
+  batch_reducer = grouping.Reducer(batch_init_fn, batch_reduce_fn, finalize_fn)
+  return get_single_element.get_single_element(
+      dataset.map(pad_fn).apply(
+          grouping.group_by_reducer(key_fn, batch_reducer)))
+
+
+def _padded_batch_sparse_window(dataset, padded_shape):
+  """Batches a window of sparse tensors with padding."""
+
+  def key_fn(_):
+    return np.int64(0)
+
+  def max_init_fn(_):
+    return convert.partial_shape_to_tensor(padded_shape)
+
+  def max_reduce_fn(state, value):
+    """Computes the maximum shape to pad to."""
+    condition = math_ops.reduce_all(
+        math_ops.logical_or(
+            math_ops.less_equal(value.dense_shape, padded_shape),
+            math_ops.equal(padded_shape, -1)))
+    assert_op = control_flow_ops.Assert(condition, [
+        "Actual shape greater than padded shape: ", value.dense_shape,
+        padded_shape
+    ])
+    with ops.control_dependencies([assert_op]):
+      return math_ops.maximum(state, value.dense_shape)
+
+  def finalize_fn(state):
+    return state
+
+  # Compute the padded shape.
+  max_reducer = grouping.Reducer(max_init_fn, max_reduce_fn, finalize_fn)
+  padded_shape = get_single_element.get_single_element(
+      dataset.apply(grouping.group_by_reducer(key_fn, max_reducer)))
+
+  def batch_init_fn(_):
+    indices_shape = array_ops.concat([[0], [array_ops.size(padded_shape) + 1]],
+                                     0)
+    return sparse_tensor.SparseTensor(
+        indices=gen_array_ops.empty(indices_shape, dtype=dtypes.int64),
+        values=constant_op.constant([], shape=[0], dtype=dataset.output_types),
+        dense_shape=array_ops.concat(
+            [np.array([0], dtype=np.int64), padded_shape], 0))
+
+  def batch_reduce_fn(state, value):
+    padded_value = sparse_tensor.SparseTensor(
+        indices=value.indices, values=value.values, dense_shape=padded_shape)
+    reshaped_value = sparse_ops.sparse_reshape(
+        padded_value,
+        array_ops.concat(
+            [np.array([1], dtype=np.int64), padded_value.dense_shape], 0))
+    return sparse_ops.sparse_concat(0, [state, reshaped_value])
+
+  reducer = grouping.Reducer(batch_init_fn, batch_reduce_fn, finalize_fn)
+  return get_single_element.get_single_element(
+      dataset.apply(grouping.group_by_reducer(key_fn, reducer)))
+
+
 class _UnbatchDataset(dataset_ops.Dataset):
   """A dataset that splits the elements of its input into multiple elements."""
 
@@ -251,10 +515,7 @@ def batch_and_drop_remainder(batch_size):
 
   def _apply_fn(dataset):
     """Function from `Dataset` to `Dataset` that applies the transformation."""
-    # TODO(jsimsa): Switch to using `batch(..., drop_remainder=True)` any time
-    # after 6/30/2018.
-    batched = dataset.batch(batch_size)
-    return _filter_irregular_batches(batch_size)(batched)
+    return dataset.batch(batch_size, drop_remainder=True)
 
   return _apply_fn
 
@@ -289,11 +550,9 @@ def padded_batch_and_drop_remainder(batch_size,
 
   def _apply_fn(dataset):
     """Function from `Dataset` to `Dataset` that applies the transformation."""
-    # TODO(jsimsa): Switch to using `padded_batch(..., drop_remainder=True)`
-    # any time after 6/30/2018.
-    batched = dataset.padded_batch(
-        batch_size, padded_shapes=padded_shapes, padding_values=padding_values)
-    return _filter_irregular_batches(batch_size)(batched)
+    return dataset.padded_batch(
+        batch_size, padded_shapes=padded_shapes, padding_values=padding_values,
+        drop_remainder=True)
 
   return _apply_fn
 
diff --git a/tensorflow/contrib/data/python/ops/get_single_element.py b/tensorflow/contrib/data/python/ops/get_single_element.py
index 0f4cd8e20c..ef9284456e 100644
--- a/tensorflow/contrib/data/python/ops/get_single_element.py
+++ b/tensorflow/contrib/data/python/ops/get_single_element.py
@@ -17,6 +17,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
+from tensorflow.contrib.data.python.ops import grouping
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
@@ -68,3 +71,30 @@ def get_single_element(dataset):
   return sparse.deserialize_sparse_tensors(
       nested_ret, dataset.output_types, dataset.output_shapes,
       dataset.output_classes)
+
+
+def reduce_dataset(dataset, reducer):
+  """Returns the result of reducing the `dataset` using `reducer`.
+
+  Args:
+    dataset: A @{tf.data.Dataset} object.
+    reducer: A @{tf.contrib.data.Reducer} object representing the reduce logic.
+
+  Returns:
+    A nested structure of @{tf.Tensor} objects, corresponding to the result
+    of reducing `dataset` using `reducer`.
+
+  Raises:
+    TypeError: if `dataset` is not a `tf.data.Dataset` object.
+  """
+  if not isinstance(dataset, dataset_ops.Dataset):
+    raise TypeError("`dataset` must be a `tf.data.Dataset` object.")
+
+  # The sentinel dataset is used in case the reduced dataset is empty.
+  sentinel_dataset = dataset_ops.Dataset.from_tensors(
+      reducer.finalize_func(reducer.init_func(np.int64(0))))
+  reduced_dataset = dataset.apply(
+      grouping.group_by_reducer(lambda x: np.int64(0), reducer))
+
+  return get_single_element(
+      reduced_dataset.concatenate(sentinel_dataset).take(1))
diff --git a/tensorflow/contrib/data/python/ops/grouping.py b/tensorflow/contrib/data/python/ops/grouping.py
index 5d9640a768..bd8d398c58 100644
--- a/tensorflow/contrib/data/python/ops/grouping.py
+++ b/tensorflow/contrib/data/python/ops/grouping.py
@@ -227,6 +227,50 @@ def bucket_by_sequence_length(element_length_func,
     return _apply_fn
 
 
+def _map_x_dataset(map_func):
+  """A transformation that maps `map_func` across its input.
+
+  This transformation is similar to `tf.data.Dataset.map`, but in addition to
+  supporting dense and sparse tensor inputs, it also supports dataset inputs.
+
+  Args:
+    map_func: A function mapping a nested structure of tensors and/or datasets
+      (having shapes and types defined by `self.output_shapes` and
+     `self.output_types`) to another nested structure of tensors and/or
+     datasets.
+
+  Returns:
+    Dataset: A `Dataset`.
+  """
+
+  def _apply_fn(dataset):
+    """Function from `Dataset` to `Dataset` that applies the transformation."""
+    return _MapXDataset(dataset, map_func)
+
+  return _apply_fn
+
+
+def window_dataset(window_size):
+  """A transformation that creates window datasets from the input dataset.
+
+  The resulting datasets will contain `window_size` elements (or
+  `N % window_size` for the last dataset if `window_size` does not divide the
+  number of input elements `N` evenly).
+
+  Args:
+    window_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
+      consecutive elements of the input dataset to combine into a window.
+
+  Returns:
+    Dataset: A `Dataset`.
+  """
+
+  def _apply_fn(dataset):
+    return _WindowDataset(dataset, window_size)
+
+  return _apply_fn
+
+
 class _GroupByReducerDataset(dataset_ops.Dataset):
   """A `Dataset` that groups its input and performs a reduction."""
 
@@ -468,3 +512,85 @@ class Reducer(object):
   @property
   def finalize_func(self):
     return self._finalize_func
+
+
+class _MapXDataset(dataset_ops.Dataset):
+  """A `Dataset` that maps a function over elements in its input."""
+
+  def __init__(self, input_dataset, map_func):
+    """See `map_x_dataset()` for details."""
+    super(_MapXDataset, self).__init__()
+    self._input_dataset = input_dataset
+
+    wrapped_func = dataset_ops.StructuredFunctionWrapper(
+        map_func,
+        "tf.contrib.data.map_x_dataset()",
+        input_dataset,
+        experimental_nested_dataset_support=True)
+    self._output_classes = wrapped_func.output_classes
+    self._output_shapes = wrapped_func.output_shapes
+    self._output_types = wrapped_func.output_types
+    self._map_func = wrapped_func.function
+
+  def _as_variant_tensor(self):
+    input_t = self._input_dataset._as_variant_tensor()  # pylint: disable=protected-access
+    return gen_dataset_ops.map_dataset(
+        input_t,
+        self._map_func.captured_inputs,
+        f=self._map_func,
+        **dataset_ops.flat_structure(self))
+
+  @property
+  def output_classes(self):
+    return self._output_classes
+
+  @property
+  def output_shapes(self):
+    return self._output_shapes
+
+  @property
+  def output_types(self):
+    return self._output_types
+
+
+class _WindowDataset(dataset_ops.Dataset):
+  """A dataset that creates window datasets from the input elements."""
+
+  def __init__(self, input_dataset, window_size):
+    """See `window_dataset()` for more details."""
+    super(_WindowDataset, self).__init__()
+    self._input_dataset = input_dataset
+    self._window_size = ops.convert_to_tensor(
+        window_size, dtype=dtypes.int64, name="window_size")
+    self._output_classes = nest.pack_sequence_as(
+        input_dataset.output_classes,
+        [
+            dataset_ops._NestedDatasetComponent(  # pylint: disable=protected-access
+                output_classes=output_class,
+                output_shapes=output_shape,
+                output_types=output_type)
+            for output_class, output_shape, output_type in zip(
+                nest.flatten(input_dataset.output_classes),
+                nest.flatten(input_dataset.output_shapes),
+                nest.flatten(input_dataset.output_types))
+        ])
+    self._output_shapes = self._output_classes
+    self._output_types = self._output_classes
+
+  def _as_variant_tensor(self):
+    return gen_dataset_ops.window_dataset(
+        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+        self._window_size,
+        **dataset_ops.flat_structure(self))
+
+  @property
+  def output_classes(self):
+    return self._output_classes
+
+  @property
+  def output_shapes(self):
+    return self._output_shapes
+
+  @property
+  def output_types(self):
+    return self._output_types
diff --git a/tensorflow/contrib/data/python/ops/iterator_ops.py b/tensorflow/contrib/data/python/ops/iterator_ops.py
index 0d71be6601..d2c1d0d362 100644
--- a/tensorflow/contrib/data/python/ops/iterator_ops.py
+++ b/tensorflow/contrib/data/python/ops/iterator_ops.py
@@ -20,6 +20,7 @@ from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.training import session_run_hook
 
@@ -206,7 +207,7 @@ class CheckpointInputPipelineHook(session_run_hook.SessionRunHook):
 
     # Check if there is an existing checkpoint. If so, restore from it.
     # pylint: disable=protected-access
-    latest_checkpoint_path = saver_lib.latest_checkpoint(
+    latest_checkpoint_path = checkpoint_management.latest_checkpoint(
         self._checkpoint_saver_hook._checkpoint_dir,
         latest_filename=self._latest_filename)
     if latest_checkpoint_path:
diff --git a/tensorflow/contrib/data/python/ops/optimization.py b/tensorflow/contrib/data/python/ops/optimization.py
index cf89657226..018c5115e1 100644
--- a/tensorflow/contrib/data/python/ops/optimization.py
+++ b/tensorflow/contrib/data/python/ops/optimization.py
@@ -18,12 +18,34 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.data.python.ops import contrib_op_loader  # pylint: disable=unused-import
+from tensorflow.contrib.data.python.ops import gen_dataset_ops as contrib_gen_dataset_ops
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_dataset_ops
 
 
+# TODO(jsimsa): Support RE matching for both individual transformation (e.g. to
+# account for indexing) and transformation sequence.
+def assert_next(transformations):
+  """A transformation that asserts which transformations happen next.
+
+  Args:
+    transformations: A `tf.string` vector `tf.Tensor` identifying the
+      transformations that are expected to happen next.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+  """
+
+  def _apply_fn(dataset):
+    """Function from `Dataset` to `Dataset` that applies the transformation."""
+    return _AssertNextDataset(dataset, transformations)
+
+  return _apply_fn
+
+
 def optimize(optimizations=None):
   """A transformation that applies optimizations.
 
@@ -44,6 +66,37 @@ def optimize(optimizations=None):
   return _apply_fn
 
 
+class _AssertNextDataset(dataset_ops.Dataset):
+  """A `Dataset` that asserts which transformations happen next."""
+
+  def __init__(self, input_dataset, transformations):
+    """See `assert_next()` for details."""
+    super(_AssertNextDataset, self).__init__()
+    self._input_dataset = input_dataset
+    if transformations is None:
+      raise ValueError("At least one transformation should be specified")
+    self._transformations = ops.convert_to_tensor(
+        transformations, dtype=dtypes.string, name="transformations")
+
+  def _as_variant_tensor(self):
+    return contrib_gen_dataset_ops.assert_next_dataset(
+        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+        self._transformations,
+        **dataset_ops.flat_structure(self))
+
+  @property
+  def output_classes(self):
+    return self._input_dataset.output_classes
+
+  @property
+  def output_shapes(self):
+    return self._input_dataset.output_shapes
+
+  @property
+  def output_types(self):
+    return self._input_dataset.output_types
+
+
 class _OptimizeDataset(dataset_ops.Dataset):
   """A `Dataset` that acts as an identity, and applies optimizations."""
 
diff --git a/tensorflow/contrib/data/python/ops/prefetching_ops.py b/tensorflow/contrib/data/python/ops/prefetching_ops.py
index 21fc17102e..0edd7c9fe9 100644
--- a/tensorflow/contrib/data/python/ops/prefetching_ops.py
+++ b/tensorflow/contrib/data/python/ops/prefetching_ops.py
@@ -26,10 +26,15 @@ from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
 from tensorflow.python.eager import context
+from tensorflow.python.framework import device as framework_device
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import functional_ops
 from tensorflow.python.ops import gen_dataset_ops as core_gen_dataset_ops
+from tensorflow.python.ops import resource_variable_ops
 
 
 def function_buffering_resource(string_arg,
@@ -345,3 +350,348 @@ def prefetch_to_device(device, buffer_size=None):
     return _PrefetchToDeviceDataset(dataset, device, buffer_size)
 
   return _apply_fn
+
+
+def copy_to_device(target_device, source_device="/cpu:0"):
+  """A transformation that copies dataset elements to the given `target_device`.
+
+  Args:
+    target_device: The name of a device to which elements will be copied.
+    source_device: The original device on which `input_dataset` will be placed.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+  """
+
+  def _apply_fn(dataset):
+    return _CopyToDeviceDataset(
+        dataset, target_device=target_device, source_device=source_device)
+
+  return _apply_fn
+
+
+# TODO(rohanj): Use the _input_hostmem attr on the RemoteCall ops to indicate
+# all inputs to the Op are in host memory, thereby avoiding some unnecessary
+# Sends and Recvs.
+class _CopyToDeviceDataset(dataset_ops.Dataset):
+  """A `Dataset` that copies elements to another device."""
+
+  def __init__(self, input_dataset, target_device, source_device="/cpu:0"):
+    """Constructs a _CopyToDeviceDataset.
+
+    Args:
+      input_dataset: `Dataset` to be copied
+      target_device: The name of the device to which elements would be copied.
+      source_device: Device where input_dataset would be placed.
+    """
+    self._input_dataset = input_dataset
+    self._target_device = target_device
+    spec = framework_device.DeviceSpec().from_string(self._target_device)
+    self._is_gpu_target = (spec.device_type == "GPU")
+    self._source_device_string = source_device
+    self._source_device = ops.convert_to_tensor(source_device)
+
+    self._flat_output_shapes = nest.flatten(
+        sparse.as_dense_shapes(self._input_dataset.output_shapes,
+                               self._input_dataset.output_classes))
+    self._flat_output_types = nest.flatten(
+        sparse.as_dense_types(self._input_dataset.output_types,
+                              self._input_dataset.output_classes))
+
+    @function.Defun()
+    def _init_func():
+      """Creates an iterator for the input dataset.
+
+      Returns:
+        A `string` tensor that encapsulates the iterator created.
+      """
+      # pylint: disable=protected-access
+      ds_variant = self._input_dataset._as_variant_tensor()
+      resource = core_gen_dataset_ops.anonymous_iterator(
+          output_types=self._flat_output_types,
+          output_shapes=self._flat_output_shapes)
+      with ops.control_dependencies(
+          [core_gen_dataset_ops.make_iterator(ds_variant, resource)]):
+        return core_gen_dataset_ops.iterator_to_string_handle(resource)
+
+    @function.Defun()
+    def _remote_init_func():
+      return functional_ops.remote_call(
+          target=self._source_device,
+          args=_init_func.captured_inputs,
+          Tout=[dtypes.string],
+          f=_init_func)
+
+    self._init_func = _remote_init_func
+    self._init_captured_args = _remote_init_func.captured_inputs
+
+    @function.Defun(dtypes.string)
+    def _next_func(string_handle):
+      """Calls get_next for created iterator.
+
+      Args:
+        string_handle: An iterator string handle created by _init_func
+      Returns:
+        The elements generated from `input_dataset`
+      """
+      with ops.device(self._source_device_string):
+        iterator = iterator_ops.Iterator.from_string_handle(
+            string_handle, self.output_types, self.output_shapes,
+            self.output_classes)
+      ret = iterator.get_next()
+      return nest.flatten(sparse.serialize_sparse_tensors(ret))
+
+    @function.Defun(dtypes.string)
+    def _remote_next_func(string_handle):
+      return functional_ops.remote_call(
+          target=self._source_device,
+          args=[string_handle] + _next_func.captured_inputs,
+          Tout=self._flat_output_types,
+          f=_next_func)
+
+    self._next_func = _remote_next_func
+    self._next_captured_args = _remote_next_func.captured_inputs
+
+    @function.Defun(dtypes.string)
+    def _finalize_func(string_handle):
+      """Destroys the iterator resource created.
+
+      Args:
+        string_handle: An iterator string handle created by _init_func
+      Returns:
+        Tensor constant 0
+      """
+      iterator_resource = core_gen_dataset_ops.iterator_from_string_handle_v2(
+          string_handle,
+          output_types=self._flat_output_types,
+          output_shapes=self._flat_output_shapes)
+      with ops.control_dependencies([
+          resource_variable_ops.destroy_resource_op(
+              iterator_resource, ignore_lookup_error=True)]):
+        return array_ops.constant(0, dtypes.int64)
+
+    @function.Defun(dtypes.string)
+    def _remote_finalize_func(string_handle):
+      return functional_ops.remote_call(
+          target=self._source_device,
+          args=[string_handle] + _finalize_func.captured_inputs,
+          Tout=[dtypes.int64],
+          f=_finalize_func)
+
+    self._finalize_func = _remote_finalize_func
+    self._finalize_captured_args = _remote_finalize_func.captured_inputs
+
+    g = ops.get_default_graph()
+    _remote_init_func.add_to_graph(g)
+    _remote_next_func.add_to_graph(g)
+    _remote_finalize_func.add_to_graph(g)
+    # pylint: enable=protected-scope
+
+  # The one_shot_iterator implementation needs a 0 arg _make_dataset function
+  # that thereby captures all the inputs required to create the dataset. Since
+  # there are strings that are inputs to the GeneratorDataset which can't be
+  # placed on a GPU, this fails for the GPU case. Therefore, disabling it for
+  # GPU
+  def make_one_shot_iterator(self):
+    if self._is_gpu_target:
+      raise ValueError("Cannot create a one shot iterator when using "
+                       "`tf.contrib.data.copy_to_device()` on GPU. Please use "
+                       "`Dataset.make_initializable_iterator()` instead.")
+    else:
+      return super(_CopyToDeviceDataset, self).make_one_shot_iterator()
+
+  def _as_variant_tensor(self):
+    with ops.device(self._target_device):
+      return core_gen_dataset_ops.generator_dataset(
+          self._init_captured_args,
+          self._next_captured_args,
+          self._finalize_captured_args,
+          init_func=self._init_func,
+          next_func=self._next_func,
+          finalize_func=self._finalize_func,
+          output_types=self._flat_output_types,
+          output_shapes=self._flat_output_shapes)
+
+  @property
+  def output_types(self):
+    return self._input_dataset.output_types
+
+  @property
+  def output_shapes(self):
+    return self._input_dataset.output_shapes
+
+  @property
+  def output_classes(self):
+    return self._input_dataset.output_classes
+
+
+class _PerDeviceGenerator(dataset_ops.Dataset):
+  """A `dummy` generator dataset."""
+
+  def __init__(self, shard_num, multi_device_iterator_resource, incarnation_id,
+               source_device, target_device, output_shapes, output_types,
+               output_classes):
+    self._target_device = target_device
+    self._output_types = output_types
+    self._output_shapes = output_shapes
+    self._output_classes = output_classes
+    self._flat_output_shapes = nest.flatten(
+        sparse.as_dense_shapes(self._output_shapes, self._output_classes))
+    self._flat_output_types = nest.flatten(
+        sparse.as_dense_types(self._output_types, self._output_classes))
+
+    multi_device_iterator_string_handle = (
+        gen_dataset_ops.multi_device_iterator_to_string_handle(
+            multi_device_iterator_resource))
+
+    @function.Defun()
+    def _init_func():
+      return multi_device_iterator_string_handle
+
+    @function.Defun()
+    def _remote_init_func():
+      return functional_ops.remote_call(
+          target=source_device,
+          args=_init_func.captured_inputs,
+          Tout=[dtypes.string],
+          f=_init_func)
+
+    self._init_func = _remote_init_func
+    self._init_captured_args = _remote_init_func.captured_inputs
+
+    @function.Defun(dtypes.string)
+    def _next_func(string_handle):
+      multi_device_iterator = (
+          gen_dataset_ops.multi_device_iterator_from_string_handle(
+              string_handle=string_handle,
+              output_types=self._flat_output_types,
+              output_shapes=self._flat_output_shapes))
+      return gen_dataset_ops.multi_device_iterator_get_next_from_shard(
+          multi_device_iterator=multi_device_iterator,
+          shard_num=shard_num,
+          incarnation_id=incarnation_id,
+          output_types=self._flat_output_types,
+          output_shapes=self._flat_output_shapes)
+
+    @function.Defun(dtypes.string)
+    def _remote_next_func(string_handle):
+      return functional_ops.remote_call(
+          target=source_device,
+          args=[string_handle] + _next_func.captured_inputs,
+          Tout=self._flat_output_types,
+          f=_next_func)
+
+    self._next_func = _remote_next_func
+    self._next_captured_args = _remote_next_func.captured_inputs
+
+    @function.Defun(dtypes.string)
+    def _finalize_func(unused_string_handle):
+      return array_ops.constant(0, dtypes.int64)
+
+    @function.Defun(dtypes.string)
+    def _remote_finalize_func(string_handle):
+      return functional_ops.remote_call(
+          target=source_device,
+          args=[string_handle] + _finalize_func.captured_inputs,
+          Tout=[dtypes.int64],
+          f=_finalize_func)
+
+    self._finalize_func = _remote_finalize_func
+    self._finalize_captured_args = _remote_finalize_func.captured_inputs
+
+  def _as_variant_tensor(self):
+    with ops.device(self._target_device):
+      return core_gen_dataset_ops.generator_dataset(
+          self._init_captured_args,
+          self._next_captured_args,
+          self._finalize_captured_args,
+          init_func=self._init_func,
+          next_func=self._next_func,
+          finalize_func=self._finalize_func,
+          output_types=self._flat_output_types,
+          output_shapes=self._flat_output_shapes)
+
+  @property
+  def output_types(self):
+    return self._output_types
+
+  @property
+  def output_shapes(self):
+    return self._output_shapes
+
+  @property
+  def output_classes(self):
+    return self._output_classes
+
+
+class MultiDeviceIterator(object):
+  """An iterator over multiple devices."""
+
+  def __init__(self,
+               dataset,
+               devices,
+               prefetch_buffer_size=1,
+               source_device="/cpu:0"):
+    self._dataset = dataset
+    self._devices = devices
+    self._source_device = source_device
+    self._source_device_tensor = ops.convert_to_tensor(source_device)
+
+    self._flat_output_shapes = nest.flatten(
+        sparse.as_dense_shapes(self._dataset.output_shapes,
+                               self._dataset.output_classes))
+    self._flat_output_types = nest.flatten(
+        sparse.as_dense_types(self._dataset.output_types,
+                              self._dataset.output_classes))
+
+    # Create the MultiDeviceIterator.
+    with ops.device(self._source_device):
+      self._multi_device_iterator_resource = (
+          gen_dataset_ops.multi_device_iterator(
+              devices=self._devices,
+              shared_name="",
+              container="",
+              output_types=self._flat_output_types,
+              output_shapes=self._flat_output_shapes))
+
+      # The incarnation ID is used to ensure consistency between the per-device
+      # iterators and the multi-device iterator.
+      self._incarnation_id = gen_dataset_ops.multi_device_iterator_init(
+          self._dataset._as_variant_tensor(),  # pylint: disable=protected-access
+          self._multi_device_iterator_resource)
+
+    # TODO(rohanj): Explore the possibility of the MultiDeviceIterator to
+    # initialize the device side of the pipeline. This would allow the
+    # MultiDeviceIterator to choose, for example, to move some transformations
+    # into the device side from its input. It might be useful in rewriting.
+    # Create the per device iterators.
+    self._device_iterators = []
+    i = 0
+    for device in self._devices:
+      ds = _PerDeviceGenerator(
+          i, self._multi_device_iterator_resource, self._incarnation_id,
+          self._source_device_tensor, device, self._dataset.output_shapes,
+          self._dataset.output_types, self._dataset.output_classes)
+      ds = ds.prefetch(prefetch_buffer_size)
+      with ops.device(device):
+        self._device_iterators.append(ds.make_initializable_iterator())
+      i += 1
+
+    device_iterator_initializers = [
+        iterator.initializer for iterator in self._device_iterators
+    ]
+    self._initializer = control_flow_ops.group(*device_iterator_initializers)
+
+  def get_next(self):
+    result = []
+    i = 0
+    for device in self._devices:
+      with ops.device(device):
+        result.append(self._device_iterators[i].get_next())
+      i += 1
+    return result
+
+  @property
+  def initializer(self):
+    return self._initializer
diff --git a/tensorflow/contrib/data/python/ops/readers.py b/tensorflow/contrib/data/python/ops/readers.py
index 83095c7ba1..14d69f8d5b 100644
--- a/tensorflow/contrib/data/python/ops/readers.py
+++ b/tensorflow/contrib/data/python/ops/readers.py
@@ -286,11 +286,14 @@ def make_tf_record_dataset(
   dataset = _maybe_shuffle_and_repeat(
       dataset, num_epochs, shuffle, shuffle_buffer_size, shuffle_seed)
 
+  # NOTE(mrry): We set `drop_final_batch=True` when `num_epochs is None` to
+  # improve the shape inference, because it makes the batch dimension static.
+  # It is safe to do this because in that case we are repeating the input
+  # indefinitely, and all batches will be full-sized.
+  drop_final_batch = drop_final_batch or num_epochs is None
+
   if parser_fn is None:
-    if drop_final_batch:
-      dataset = dataset.apply(batching.batch_and_drop_remainder(batch_size))
-    else:
-      dataset = dataset.batch(batch_size)
+    dataset = dataset.batch(batch_size, drop_remainder=drop_final_batch)
   else:
     # TODO(josh11b): if num_parallel_parser_calls is None, use some function
     # of num cores instead of map_and_batch's default behavior of one batch.
@@ -326,6 +329,7 @@ def make_csv_dataset(
     num_parallel_parser_calls=2,
     sloppy=False,
     num_rows_for_inference=100,
+    compression_type=None,
 ):
   """Reads CSV files into a dataset.
 
@@ -399,6 +403,8 @@ def make_csv_dataset(
     num_rows_for_inference: Number of rows of a file to use for type inference
       if record_defaults is not provided. If None, reads all the rows of all
       the files. Defaults to 100.
+    compression_type: (Optional.) A `tf.string` scalar evaluating to one of
+      `""` (no compression), `"ZLIB"`, or `"GZIP"`. Defaults to no compression.
 
   Returns:
     A dataset, where each element is a (features, labels) tuple that corresponds
@@ -461,7 +467,9 @@ def make_csv_dataset(
         use_quote_delim=use_quote_delim,
         na_value=na_value,
         select_cols=select_columns,
-        header=header)
+        header=header,
+        compression_type=compression_type,
+    )
 
   def map_fn(*columns):
     """Organizes columns into a features dictionary.
@@ -488,8 +496,13 @@ def make_csv_dataset(
       dataset, num_epochs, shuffle, shuffle_buffer_size, shuffle_seed)
 
   # Apply batch before map for perf, because map has high overhead relative
-  # to the size of the computation in each map
-  dataset = dataset.batch(batch_size=batch_size)
+  # to the size of the computation in each map.
+  # NOTE(mrry): We set `drop_remainder=True` when `num_epochs is None` to
+  # improve the shape inference, because it makes the batch dimension static.
+  # It is safe to do this because in that case we are repeating the input
+  # indefinitely, and all batches will be full-sized.
+  dataset = dataset.batch(batch_size=batch_size,
+                          drop_remainder=num_epochs is None)
   dataset = dataset.map(map_fn, num_parallel_calls=num_parallel_parser_calls)
   dataset = dataset.prefetch(prefetch_buffer_size)
 
@@ -505,6 +518,7 @@ class CsvDataset(dataset_ops.Dataset):
   def __init__(self,
                filenames,
                record_defaults,
+               compression_type=None,
                buffer_size=None,
                header=False,
                field_delim=",",
@@ -540,11 +554,11 @@ class CsvDataset(dataset_ops.Dataset):
 
     The expected output of its iterations is:
     ```python
-    next = dataset.make_one_shot_iterator().get_next()
+    next_element = dataset.make_one_shot_iterator().get_next()
     with tf.Session() as sess:
       while True:
         try:
-          print(sess.run(nxt))
+          print(sess.run(next_element))
         except tf.errors.OutOfRangeError:
           break
 
@@ -562,6 +576,9 @@ class CsvDataset(dataset_ops.Dataset):
         both this and `select_columns` are specified, these must have the same
         lengths, and `column_defaults` is assumed to be sorted in order of
         increasing column index.
+      compression_type: (Optional.) A `tf.string` scalar evaluating to one of
+        `""` (no compression), `"ZLIB"`, or `"GZIP"`. Defaults to no
+        compression.
       buffer_size: (Optional.) A `tf.int64` scalar denoting the number of bytes
         to buffer while reading files. Defaults to 4MB.
       header: (Optional.) A `tf.bool` scalar indicating whether the CSV file(s)
@@ -581,6 +598,11 @@ class CsvDataset(dataset_ops.Dataset):
     super(CsvDataset, self).__init__()
     self._filenames = ops.convert_to_tensor(
         filenames, dtype=dtypes.string, name="filenames")
+    self._compression_type = convert.optional_param_to_tensor(
+        "compression_type",
+        compression_type,
+        argument_default="",
+        argument_dtype=dtypes.string)
     record_defaults = [
         constant_op.constant([], dtype=x) if x in _ACCEPTABLE_CSV_TYPES else x
         for x in record_defaults
@@ -621,6 +643,7 @@ class CsvDataset(dataset_ops.Dataset):
         use_quote_delim=self._use_quote_delim,
         na_value=self._na_value,
         select_cols=self._select_cols,
+        compression_type=self._compression_type,
     )
 
   @property
@@ -757,10 +780,12 @@ def make_batched_features_dataset(file_pattern,
 
   dataset = dataset.apply(stats_ops.feature_stats("record_stats"))
 
-  if drop_final_batch:
-    dataset = dataset.apply(batching.batch_and_drop_remainder(batch_size))
-  else:
-    dataset = dataset.batch(batch_size)
+  # NOTE(mrry): We set `drop_remainder=True` when `num_epochs is None` to
+  # improve the shape inference, because it makes the batch dimension static.
+  # It is safe to do this because in that case we are repeating the input
+  # indefinitely, and all batches will be full-sized.
+  dataset = dataset.batch(
+      batch_size, drop_remainder=drop_final_batch or num_epochs is None)
 
   # Parse `Example` tensors to a dictionary of `Feature` tensors.
   dataset = dataset.map(
diff --git a/tensorflow/contrib/data/python/ops/sliding.py b/tensorflow/contrib/data/python/ops/sliding.py
index 3f3c5ca17c..e9dd74530a 100644
--- a/tensorflow/contrib/data/python/ops/sliding.py
+++ b/tensorflow/contrib/data/python/ops/sliding.py
@@ -23,25 +23,29 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import gen_dataset_ops
+from tensorflow.python.util import deprecation
 
 
 class _SlideDataset(dataset_ops.Dataset):
   """A `Dataset` that passes a sliding window over its input."""
 
-  def __init__(self, input_dataset, window_size, stride=1):
+  def __init__(self, input_dataset, window_size, window_shift, window_stride):
     """See `sliding_window_batch` for details."""
     super(_SlideDataset, self).__init__()
     self._input_dataset = input_dataset
     self._window_size = ops.convert_to_tensor(
-        window_size, dtype=dtypes.int64, name="window_size")
-    self._stride = ops.convert_to_tensor(
-        stride, dtype=dtypes.int64, name="stride")
+        window_size, dtype=dtypes.int64, name="window_stride")
+    self._window_stride = ops.convert_to_tensor(
+        window_stride, dtype=dtypes.int64, name="window_stride")
+    self._window_shift = ops.convert_to_tensor(
+        window_shift, dtype=dtypes.int64, name="window_shift")
 
   def _as_variant_tensor(self):
     return gen_dataset_ops.slide_dataset(
         self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
         window_size=self._window_size,
-        stride=self._stride,
+        window_shift=self._window_shift,
+        window_stride=self._window_stride,
         **dataset_ops.flat_structure(self))
 
   @property
@@ -61,38 +65,63 @@ class _SlideDataset(dataset_ops.Dataset):
     return self._input_dataset.output_types
 
 
-def sliding_window_batch(window_size, stride=1):
-  """A sliding window with size of `window_size` and step of `stride`.
+@deprecation.deprecated_args(
+    None, "stride is deprecated, use window_shift instead", "stride")
+def sliding_window_batch(window_size,
+                         stride=None,
+                         window_shift=None,
+                         window_stride=1):
+  """A sliding window over a dataset.
 
-  This transformation passes a sliding window over this dataset. The
-  window size is `window_size` and step size is `stride`. If the left
-  elements cannot fill up the sliding window, this transformation will
-  drop the final smaller element. For example:
+  This transformation passes a sliding window over this dataset. The window size
+  is `window_size`, the stride of the input elements is `window_stride`, and the
+  shift between consecutive windows is `window_shift`. If the remaining elements
+  cannot fill up the sliding window, this transformation will drop the final
+  smaller element. For example:
 
   ```python
   # NOTE: The following examples use `{ ... }` to represent the
   # contents of a dataset.
   a = { [1], [2], [3], [4], [5], [6] }
 
-  a.apply(tf.contrib.data.sliding_window_batch(window_size=3, stride=2)) ==
-  {
-      [[1], [2], [3]],
-      [[3], [4], [5]],
-  }
+  a.apply(sliding_window_batch(window_size=3)) ==
+  { [[1], [2], [3]], [[2], [3], [4]], [[3], [4], [5]], [[4], [5], [6]] }
+
+  a.apply(sliding_window_batch(window_size=3, window_shift=2)) ==
+  { [[1], [2], [3]], [[3], [4], [5]] }
+
+  a.apply(sliding_window_batch(window_size=3, window_stride=2)) ==
+  { [[1], [3], [5]], [[2], [4], [6]] }
   ```
 
   Args:
     window_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
-      elements in the sliding window.
+      elements in the sliding window. It must be positive.
     stride: (Optional.) A `tf.int64` scalar `tf.Tensor`, representing the
-      steps moving the sliding window forward for one iteration. The default
-      is `1`. It must be positive.
+      forward shift of the sliding window in each iteration. The default is `1`.
+      It must be positive. Deprecated alias for `window_shift`.
+    window_shift: (Optional.) A `tf.int64` scalar `tf.Tensor`, representing the
+      forward shift of the sliding window in each iteration. The default is `1`.
+      It must be positive.
+    window_stride: (Optional.) A `tf.int64` scalar `tf.Tensor`, representing the
+      stride of the input elements in the sliding window. The default is `1`.
+      It must be positive.
 
   Returns:
     A `Dataset` transformation function, which can be passed to
     @{tf.data.Dataset.apply}.
+
+  Raises:
+    ValueError: if invalid arguments are provided.
   """
+  if stride is None and window_shift is None:
+    window_shift = 1
+  elif stride is not None and window_shift is None:
+    window_shift = stride
+  elif stride is not None and window_shift is not None:
+    raise ValueError("Cannot specify both `stride` and `window_shift`")
+
   def _apply_fn(dataset):
-    return _SlideDataset(dataset, window_size, stride)
+    return _SlideDataset(dataset, window_size, window_shift, window_stride)
 
   return _apply_fn
diff --git a/tensorflow/contrib/distribute/BUILD b/tensorflow/contrib/distribute/BUILD
index 74b2cd90a1..1126f76f58 100644
--- a/tensorflow/contrib/distribute/BUILD
+++ b/tensorflow/contrib/distribute/BUILD
@@ -30,6 +30,7 @@ py_library(
         "//tensorflow/contrib/distribute/python:monitor",
         "//tensorflow/contrib/distribute/python:one_device_strategy",
         "//tensorflow/contrib/distribute/python:step_fn",
+        "//tensorflow/contrib/distribute/python:tpu_strategy",
         "//tensorflow/python:training",
         "//tensorflow/python:util",
     ],
diff --git a/tensorflow/contrib/distribute/README.md b/tensorflow/contrib/distribute/README.md
index 44a4481021..2f5dd10550 100644
--- a/tensorflow/contrib/distribute/README.md
+++ b/tensorflow/contrib/distribute/README.md
@@ -116,8 +116,6 @@ in the input function gives a solid boost in performance. When using
 ## Caveats
 This feature is in early stages and there are a lot of improvements forthcoming:
 
-* Metrics are not yet supported during distributed training. They are still
-supported during the evaluation.
 * Summaries are only computed in the first tower in `MirroredStrategy`.
 * Evaluation is not yet distributed.
 * Eager support is in the works; performance can be more challenging with eager
diff --git a/tensorflow/contrib/distribute/__init__.py b/tensorflow/contrib/distribute/__init__.py
index 76711baf3a..2e2c3be853 100644
--- a/tensorflow/contrib/distribute/__init__.py
+++ b/tensorflow/contrib/distribute/__init__.py
@@ -24,6 +24,7 @@ from tensorflow.contrib.distribute.python.mirrored_strategy import MirroredStrat
 from tensorflow.contrib.distribute.python.monitor import Monitor
 from tensorflow.contrib.distribute.python.one_device_strategy import OneDeviceStrategy
 from tensorflow.contrib.distribute.python.step_fn import *
+from tensorflow.contrib.distribute.python.tpu_strategy import TPUStrategy
 from tensorflow.python.training.distribute import *
 
 from tensorflow.python.util.all_util import remove_undocumented
@@ -41,6 +42,7 @@ _allowed_symbols = [
     'StandardInputStep',
     'StandardSingleLossStep',
     'TowerContext',
+    'TPUStrategy',
     'get_cross_tower_context',
     'get_distribution_strategy',
     'get_loss_reduction',
diff --git a/tensorflow/contrib/distribute/python/BUILD b/tensorflow/contrib/distribute/python/BUILD
index eba0dd0ea3..f6cc1dcc02 100644
--- a/tensorflow/contrib/distribute/python/BUILD
+++ b/tensorflow/contrib/distribute/python/BUILD
@@ -101,6 +101,23 @@ py_library(
 )
 
 py_library(
+    name = "parameter_server_strategy",
+    srcs = ["parameter_server_strategy.py"],
+    visibility = ["//tensorflow:internal"],
+    deps = [
+        ":cross_tower_ops",
+        ":mirrored_strategy",
+        ":values",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:training",
+        "//tensorflow/python:util",
+    ],
+)
+
+py_library(
     name = "one_device_strategy",
     srcs = ["one_device_strategy.py"],
     visibility = ["//tensorflow:internal"],
@@ -207,6 +224,35 @@ py_test(
     ],
 )
 
+py_test(
+    name = "parameter_server_strategy_test",
+    srcs = ["parameter_server_strategy_test.py"],
+    srcs_version = "PY2AND3",
+    tags = [
+        "no_pip",
+    ],
+    deps = [
+        ":combinations",
+        ":multi_worker_test_base",
+        ":parameter_server_strategy",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:layers",
+        "//tensorflow/python:session",
+        "//tensorflow/python:training",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python:variables",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/estimator:run_config",
+        "@absl_py//absl/testing:parameterized",
+    ],
+)
+
 cuda_py_test(
     name = "mirrored_strategy_multigpu_test",
     srcs = ["mirrored_strategy_multigpu_test.py"],
@@ -247,11 +293,11 @@ py_library(
     ],
     deps = [
         "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:client_testlib",
         "//tensorflow/python:distributed_framework_test_lib",
-        "//tensorflow/python:platform",
         "//tensorflow/python:session",
-        "//tensorflow/python:training",
-        "//tensorflow/python/eager:test",
+        "//tensorflow/python/estimator:run_config",
+        "//third_party/py/numpy",
     ],
 )
 
@@ -587,6 +633,7 @@ cuda_py_test(
     ],
     tags = [
         "multi_and_single_gpu",
+        "no_windows_gpu",
         "notsan",
     ],
 )
@@ -609,3 +656,40 @@ cuda_py_test(
         "no_pip",
     ],
 )
+
+cuda_py_test(
+    name = "warm_starting_util_test",
+    size = "medium",
+    srcs = ["warm_starting_util_test.py"],
+    additional_deps = [
+        ":combinations",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:training",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python:variables",
+    ],
+    tags = [
+        "multi_and_single_gpu",
+        "no_pip",
+    ],
+)
+
+cuda_py_test(
+    name = "checkpoint_utils_test",
+    size = "medium",
+    srcs = ["checkpoint_utils_test.py"],
+    additional_deps = [
+        ":combinations",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:checkpoint_utils_test",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:training",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python:variables",
+    ],
+    tags = [
+        "multi_and_single_gpu",
+        "no_pip",
+    ],
+)
diff --git a/tensorflow/contrib/distribute/python/checkpoint_utils_test.py b/tensorflow/contrib/distribute/python/checkpoint_utils_test.py
new file mode 100644
index 0000000000..bcb977f640
--- /dev/null
+++ b/tensorflow/contrib/distribute/python/checkpoint_utils_test.py
@@ -0,0 +1,78 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for checkpoint_utils.init_from_checkpoint with Distribution Strategy.
+
+These tests are located here instead of as part of
+`python.training.CheckpointsTest` because they need access to distribution
+strategies which are only present in contrib right now.
+TODO(priyag): Move the tests to core `python.training.CheckpointsTest` when
+distribution strategy moves out of contrib.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl.testing import parameterized
+
+from tensorflow.contrib.distribute.python import combinations
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import checkpoint_utils
+from tensorflow.python.training import checkpoint_utils_test
+
+
+class CheckpointUtilsWithDistributionStrategyTest(
+    test.TestCase, parameterized.TestCase):
+
+  @combinations.generate(combinations.combine(
+      distribution=[combinations.default_strategy,
+                    combinations.one_device_strategy,
+                    combinations.mirrored_strategy_with_gpu_and_cpu,
+                    combinations.mirrored_strategy_with_two_gpus],
+      in_tower_mode=[True, False],
+      mode=["graph"]))
+  def testInitFromCheckpoint(self, distribution, in_tower_mode):
+    checkpoint_dir = self.get_temp_dir()
+    with self.test_session() as session:
+      v1_value, v2_value, _, _ = checkpoint_utils_test._create_checkpoints(
+          session, checkpoint_dir)
+
+    def init_and_verify(g):
+      v1 = variable_scope.get_variable("new_var1", [1, 10])
+      v2 = variable_scope.get_variable(
+          "new_var2", [10, 10],
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          aggregation=variable_scope.VariableAggregation.MEAN)
+      checkpoint_utils.init_from_checkpoint(checkpoint_dir, {
+          "var1": "new_var1",
+          "var2": "new_var2"
+      })
+      with self.test_session(graph=g) as session:
+        session.run(variables.global_variables_initializer())
+        self.assertAllEqual(v1_value, self.evaluate(v1))
+        self.assertAllEqual(v2_value, self.evaluate(v2))
+
+    with ops.Graph().as_default() as g, distribution.scope():
+      if in_tower_mode:
+        distribution.call_for_each_tower(init_and_verify, g)
+      else:
+        init_and_verify(g)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distribute/python/combinations.py b/tensorflow/contrib/distribute/python/combinations.py
index 9a8ea4aa48..52f73ddb03 100644
--- a/tensorflow/contrib/distribute/python/combinations.py
+++ b/tensorflow/contrib/distribute/python/combinations.py
@@ -144,7 +144,7 @@ def _augment_with_special_arguments(test_method):
     """A wrapped test method that treats some arguments in a special way."""
     mode = kwargs.pop("mode", "graph")
 
-    distribution = kwargs.pop("distribution", None)
+    distribution = kwargs.get("distribution", None)
     required_tpu = kwargs.pop("required_tpu", False)
     required_gpus = kwargs.pop("required_gpus", None)
 
@@ -153,7 +153,6 @@ def _augment_with_special_arguments(test_method):
           "Do not use `required_gpus` and `distribution` together.")
       assert required_tpu is False, (
           "Do not use `required_tpu` and `distribution` together.")
-      kwargs["distribution"] = distribution.strategy
       required_gpus = distribution.required_gpus
       required_tpu = distribution.required_tpu
 
@@ -189,9 +188,13 @@ def _augment_with_special_arguments(test_method):
 
     if mode == "eager":
       with ops.Graph().as_default(), context.eager_mode():
+        if distribution:
+          kwargs_to_pass["distribution"] = distribution.strategy
         test_method(**kwargs_to_pass)
     elif mode == "graph":
       with ops.Graph().as_default(), context.graph_mode():
+        if distribution:
+          kwargs_to_pass["distribution"] = distribution.strategy
         test_method(**kwargs_to_pass)
     else:
       raise ValueError(
diff --git a/tensorflow/contrib/distribute/python/cross_tower_ops.py b/tensorflow/contrib/distribute/python/cross_tower_ops.py
index b0baf0dad1..b6037d2133 100644
--- a/tensorflow/contrib/distribute/python/cross_tower_ops.py
+++ b/tensorflow/contrib/distribute/python/cross_tower_ops.py
@@ -28,18 +28,37 @@ from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import device_util
 
 
+def check_destinations(destinations):
+  """Checks whether `destinations` is not None and not empty.
+
+  Args:
+    destinations: a DistributedValues, Variable, string or a list of strings.
+
+  Returns:
+    Boolean indicating whether `destinations` is not None and not empty.
+  """
+  # Calling bool() on a ResourceVariable is not allowed.
+  if isinstance(destinations, resource_variable_ops.ResourceVariable):
+    return bool(destinations.device)
+  return bool(destinations)
+
+
 def validate_destinations(destinations):
-  if not isinstance(destinations,
-                    (value_lib.DistributedValues, six.string_types, list)):
+  if not isinstance(
+      destinations,
+      (value_lib.DistributedValues, resource_variable_ops.ResourceVariable,
+       six.string_types, list)):
     raise ValueError("destinations must be one of a `DistributedValues` object,"
-                     " a device string, a list of device strings or None")
+                     " a tf.Variable object, a device string, a list of device "
+                     "strings or None")
 
-  if not destinations:
+  if not check_destinations(destinations):
     raise ValueError("destinations can not be empty")
 
 
@@ -59,6 +78,8 @@ def _validate_value_destination_pairs(value_destination_pairs):
 def get_devices_from(destinations):
   if isinstance(destinations, value_lib.DistributedValues):
     return list(destinations.devices)
+  elif isinstance(destinations, resource_variable_ops.ResourceVariable):
+    return [destinations.device]
   elif isinstance(destinations, six.string_types):
     return [device_util.resolve(destinations)]
   else:
@@ -225,7 +246,10 @@ class ReductionToOneDeviceCrossTowerOps(CrossTowerOps):
     super(ReductionToOneDeviceCrossTowerOps, self).__init__()
 
   def _reduce(self, aggregation, per_device_value, destinations):
-    devices = get_devices_from(destinations or per_device_value)
+    if check_destinations(destinations):
+      devices = get_devices_from(destinations)
+    else:
+      devices = get_devices_from(per_device_value)
     reduce_to_device = self.reduce_to_device or devices[0]
     reduced = _simple_reduce(per_device_value, reduce_to_device,
                              self.accumulation_fn, aggregation)
@@ -508,7 +532,10 @@ class AllReduceCrossTowerOps(CrossTowerOps):
             logging.WARN,
             "Efficient allreduce is not supported for IndexedSlices.", 10)
 
-      devices = get_devices_from(destinations or per_device_value)
+      if check_destinations(destinations):
+        devices = get_devices_from(destinations)
+      else:
+        devices = get_devices_from(per_device_value)
       reduce_to_device = devices[0]
       reduced = _simple_reduce(per_device_value, reduce_to_device,
                                math_ops.add_n, aggregation)
diff --git a/tensorflow/contrib/distribute/python/estimator_integration_test.py b/tensorflow/contrib/distribute/python/estimator_integration_test.py
index 34410a6470..a0bb144b7c 100644
--- a/tensorflow/contrib/distribute/python/estimator_integration_test.py
+++ b/tensorflow/contrib/distribute/python/estimator_integration_test.py
@@ -96,7 +96,8 @@ class DNNLinearCombinedClassifierIntegrationTest(test.TestCase,
         # TODO(isaprykin): Work around the colocate_with error.
         dnn_optimizer=adagrad.AdagradOptimizer(0.001),
         linear_optimizer=adagrad.AdagradOptimizer(0.001),
-        config=run_config.RunConfig(train_distribute=distribution))
+        config=run_config.RunConfig(
+            train_distribute=distribution, eval_distribute=distribution))
 
     num_steps = 10
     estimator.train(train_input_fn, steps=num_steps)
diff --git a/tensorflow/contrib/distribute/python/examples/simple_estimator_example.py b/tensorflow/contrib/distribute/python/examples/simple_estimator_example.py
index 00c25c7a24..44a69ed23a 100644
--- a/tensorflow/contrib/distribute/python/examples/simple_estimator_example.py
+++ b/tensorflow/contrib/distribute/python/examples/simple_estimator_example.py
@@ -59,7 +59,8 @@ def build_model_fn_optimizer():
 def main(_):
   distribution = tf.contrib.distribute.MirroredStrategy(
       ["/device:GPU:0", "/device:GPU:1"])
-  config = tf.estimator.RunConfig(train_distribute=distribution)
+  config = tf.estimator.RunConfig(train_distribute=distribution,
+                                  eval_distribute=distribution)
 
   def input_fn():
     features = tf.data.Dataset.from_tensors([[1.]]).repeat(10)
@@ -70,7 +71,7 @@ def main(_):
       model_fn=build_model_fn_optimizer(), config=config)
   estimator.train(input_fn=input_fn, steps=10)
 
-  eval_result = estimator.evaluate(input_fn=input_fn)
+  eval_result = estimator.evaluate(input_fn=input_fn, steps=10)
   print("Eval result: {}".format(eval_result))
 
   def predict_input_fn():
diff --git a/tensorflow/contrib/distribute/python/examples/simple_tfkeras_example.py b/tensorflow/contrib/distribute/python/examples/simple_tfkeras_example.py
index 2b05884b9b..518ec9c423 100644
--- a/tensorflow/contrib/distribute/python/examples/simple_tfkeras_example.py
+++ b/tensorflow/contrib/distribute/python/examples/simple_tfkeras_example.py
@@ -57,7 +57,8 @@ def main(args):
   # tf.Estimator that utilizes the DistributionStrategy.
   strategy = tf.contrib.distribute.MirroredStrategy(
       ['/device:GPU:0', '/device:GPU:1'])
-  config = tf.estimator.RunConfig(train_distribute=strategy)
+  config = tf.estimator.RunConfig(
+      train_distribute=strategy, eval_distribute=strategy)
   keras_estimator = tf.keras.estimator.model_to_estimator(
       keras_model=model, config=config, model_dir=model_dir)
 
diff --git a/tensorflow/contrib/distribute/python/keras_test.py b/tensorflow/contrib/distribute/python/keras_test.py
index 75ecd90dcf..fbdb376fcc 100644
--- a/tensorflow/contrib/distribute/python/keras_test.py
+++ b/tensorflow/contrib/distribute/python/keras_test.py
@@ -12,33 +12,40 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for Keras Sequential and Functional models."""
+"""Tests for tf.keras models using DistributionStrategy."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 import os
-
 import numpy as np
 
 from tensorflow.contrib.distribute.python import mirrored_strategy
+from tensorflow.contrib.distribute.python import values
 from tensorflow.python import keras
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.estimator import keras as keras_lib
 from tensorflow.python.estimator import run_config as run_config_lib
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import test_util
 from tensorflow.python.keras import testing_utils
+from tensorflow.python.keras.engine import distributed_training_utils
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import gradient_descent
 from tensorflow.python.training import rmsprop
 
+
 _RANDOM_SEED = 1337
 _TRAIN_SIZE = 200
 _INPUT_SIZE = (10,)
 _NUM_CLASS = 2
 
 
+# TODO(anjalisridhar): Add a decorator that will allow us to run these tests as
+# part of the tf.keras unit tests suite.
 def simple_sequential_model():
   model = keras.models.Sequential()
   model.add(keras.layers.Dense(16, activation='relu', input_shape=_INPUT_SIZE))
@@ -84,7 +91,7 @@ def get_ds_test_input_fn():
   return dataset
 
 
-class TestKerasDistributionStrategy(test_util.TensorFlowTestCase):
+class TestEstimatorDistributionStrategy(test_util.TensorFlowTestCase):
 
   def setUp(self):
     self._base_dir = os.path.join(self.get_temp_dir(),
@@ -107,7 +114,8 @@ class TestKerasDistributionStrategy(test_util.TensorFlowTestCase):
         optimizer=rmsprop.RMSPropOptimizer(learning_rate=0.01))
     config = run_config_lib.RunConfig(tf_random_seed=_RANDOM_SEED,
                                       model_dir=self._base_dir,
-                                      train_distribute=dist)
+                                      train_distribute=dist,
+                                      eval_distribute=dist)
     with self.test_session():
       est_keras = keras_lib.model_to_estimator(
           keras_model=keras_model, config=config)
@@ -144,5 +152,367 @@ class TestKerasDistributionStrategy(test_util.TensorFlowTestCase):
     writer_cache.FileWriterCache.clear()
     gfile.DeleteRecursively(self._config.model_dir)
 
+
+class TestWithDistributionStrategy(test.TestCase):
+
+  def test_validating_dataset_input_tensors_with_shape_mismatch(self):
+    with self.test_session():
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:0',
+                                                     '/device:CPU:0'])
+      a = constant_op.constant([1, 2], shape=(1, 2))
+      b = constant_op.constant([[1, 2], [1, 2]], shape=(2, 2))
+      x = values.DistributedValues({'/device:CPU:0': a, '/device:GPU:0': b})
+      y = values.DistributedValues({'/device:CPU:0': a, '/device:GPU:0': a})
+      with strategy.scope():
+        # Removed device and input tensor shape details from the error message
+        # since the order of the device and the corresponding input tensor shape
+        # is not deterministic over different runs.
+        with self.assertRaisesRegexp(ValueError,
+                                     'Input tensor shapes do not match for '
+                                     'distributed tensor inputs '
+                                     'DistributedValues:.+'):
+          distributed_training_utils.validate_distributed_dataset_inputs(
+              strategy, x, y)
+
+  def test_validating_dataset_input_tensors_with_dtype_mismatch(self):
+    with self.test_session():
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:0',
+                                                     '/device:CPU:0'])
+      a = constant_op.constant([1, 2], shape=(1, 2), dtype=dtypes.int32)
+      b = constant_op.constant([1, 2], shape=(1, 2), dtype=dtypes.float64)
+      x = values.DistributedValues({'/device:CPU:0': a, '/device:GPU:0': b})
+      y = values.DistributedValues({'/device:CPU:0': a, '/device:GPU:0': a})
+      with strategy.scope():
+        # Removed device and input tensor dtype details from the error message
+        # since the order of the device and the corresponding input tensor dtype
+        # is not deterministic over different runs.
+        with self.assertRaisesRegexp(ValueError,
+                                     'Input tensor dtypes do not match for '
+                                     'distributed tensor inputs '
+                                     'DistributedValues:.+'):
+          distributed_training_utils.validate_distributed_dataset_inputs(
+              strategy, x, y)
+
+  def test_calling_model_on_same_dataset(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = gradient_descent.GradientDescentOptimizer(0.001)
+      loss = 'mse'
+      metrics = ['mae']
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:1',
+                                                     '/device:GPU:0'])
+      model.compile(optimizer, loss, metrics=metrics, distribute=strategy)
+
+      inputs = np.zeros((10, 3), dtype=np.float32)
+      targets = np.zeros((10, 4), dtype=np.float32)
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      # Call fit with validation data
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                validation_data=dataset, validation_steps=2)
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                validation_data=dataset, validation_steps=2)
+      model.predict(dataset, steps=2)
+
+  def test_fit_eval_and_predict_methods_on_dataset(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = gradient_descent.GradientDescentOptimizer(0.001)
+      loss = 'mse'
+      metrics = ['mae']
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:0',
+                                                     '/device:CPU:0'])
+
+      model.compile(optimizer, loss, metrics=metrics, distribute=strategy)
+
+      inputs = np.zeros((10, 3), dtype=np.float32)
+      targets = np.zeros((10, 4), dtype=np.float32)
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=1)
+      model.evaluate(dataset, steps=2, verbose=1)
+      model.predict(dataset, steps=2)
+      # Test with validation data
+      model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                validation_data=dataset, validation_steps=2)
+
+  def test_unsupported_features(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = gradient_descent.GradientDescentOptimizer(0.001)
+      loss = 'mse'
+      metrics = ['mae']
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:1',
+                                                     '/device:GPU:0'])
+
+      model.compile(optimizer, loss, metrics=metrics, distribute=strategy)
+
+      inputs = np.zeros((10, 3), dtype=np.float32)
+      targets = np.zeros((10, 4), dtype=np.float32)
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      # Test with validation split
+      with self.assertRaisesRegexp(
+          ValueError, '`validation_split` argument is not supported '
+                      'when input `x` is a dataset or a dataset iterator'):
+        model.fit(dataset,
+                  epochs=1, steps_per_epoch=2, verbose=0,
+                  validation_split=0.5, validation_steps=2)
+
+      # Test with sample weight.
+      sample_weight = np.random.random((10,))
+      with self.assertRaisesRegexp(
+          ValueError, 'sample_weight is currently not supported when using '
+                      'DistributionStrategy.'):
+        model.fit(
+            dataset,
+            epochs=1,
+            steps_per_epoch=2,
+            verbose=0,
+            sample_weight=sample_weight)
+
+      # Test with not specifying the `steps` argument.
+      with self.assertRaisesRegexp(
+          ValueError, 'you should specify the `steps_per_epoch` argument'):
+        model.fit(dataset, epochs=1, verbose=0)
+      with self.assertRaisesRegexp(ValueError,
+                                   'you should specify the `steps` argument'):
+        model.evaluate(dataset, verbose=0)
+
+      with self.assertRaisesRegexp(ValueError,
+                                   'you should specify the `steps` argument'):
+        model.predict(dataset, verbose=0)
+
+  def test_calling_with_unsupported_predefined_callbacks(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = gradient_descent.GradientDescentOptimizer(0.001)
+      loss = 'mse'
+      metrics = ['mae']
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:1',
+                                                     '/device:GPU:0'])
+      model.compile(optimizer, loss, metrics=metrics, distribute=strategy)
+
+      inputs = np.zeros((10, 3), dtype=np.float32)
+      targets = np.zeros((10, 4), dtype=np.float32)
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      def schedule(_):
+        return 0.001
+      with self.assertRaisesRegexp(ValueError,
+                                   'LearningRateScheduler callback is not '
+                                   'supported with DistributionStrategy.'):
+        model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                  callbacks=[keras.callbacks.LearningRateScheduler(schedule)])
+
+      with self.assertRaisesRegexp(ValueError,
+                                   'ReduceLROnPlateau callback is not '
+                                   'supported with DistributionStrategy.'):
+        model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                  callbacks=[keras.callbacks.ReduceLROnPlateau()])
+      with self.assertRaisesRegexp(ValueError,
+                                   'histogram_freq in the TensorBoard callback '
+                                   'is not supported when using '
+                                   'DistributionStrategy.'):
+        model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0,
+                  callbacks=[keras.callbacks.TensorBoard(histogram_freq=10)])
+
+  def test_dataset_input_shape_validation(self):
+    with self.test_session():
+      x = keras.layers.Input(shape=(3,), name='input')
+      y = keras.layers.Dense(4, name='dense')(x)
+      model = keras.Model(x, y)
+
+      optimizer = rmsprop.RMSPropOptimizer(learning_rate=0.001)
+      loss = 'mse'
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:1',
+                                                     '/device:GPU:0'])
+
+      model.compile(optimizer, loss, distribute=strategy)
+
+      # User forgets to batch the dataset
+      inputs = np.zeros((10, 3), dtype=np.float32)
+      targets = np.zeros((10, 4), dtype=np.float32)
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+
+      with self.assertRaisesRegexp(ValueError,
+                                   'expected input to have 2 dimensions'):
+        model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0)
+
+      # Wrong input shape
+      inputs = np.zeros((10, 5), dtype=np.float32)
+      targets = np.zeros((10, 4), dtype=np.float32)
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+
+      with self.assertRaisesRegexp(ValueError,
+                                   'expected input to have shape'):
+        model.fit(dataset, epochs=1, steps_per_epoch=2, verbose=0)
+
+  def test_learning_phase_value(self):
+    # TODO(anjalisridhar): Modify this test to use Lambdas since we can compare
+    # meaningful values. Currently we don't pass the learning phase if the
+    # Lambda layer uses the learning phase.
+    with self.test_session():
+      x = keras.layers.Input(shape=(16,), name='input')
+      y = keras.layers.Dense(16)(x)
+      z = keras.layers.Dropout(0.9999)(y)
+      model = keras.Model(x, z)
+
+      optimizer = gradient_descent.GradientDescentOptimizer(0.005)
+      loss = 'mse'
+      metrics = ['acc']
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:0',
+                                                     '/device:CPU:0'])
+
+      model.compile(optimizer, loss, metrics=metrics, distribute=strategy)
+
+      inputs = np.random.rand(10, 16)
+      targets = np.ones((10, 16), dtype=np.float32)
+      dataset = dataset_ops.Dataset.from_tensor_slices((inputs, targets))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(8)
+
+      hist = model.fit(dataset, epochs=5, steps_per_epoch=20, verbose=1)
+      self.assertEqual(hist.history['acc'][0], 1)
+
+      evaluate_output = model.evaluate(dataset, steps=20)
+      self.assertEqual(evaluate_output[1], 0)
+
+      predict_output = model.predict(dataset, steps=1)
+      self.assertNotEqual(np.mean(predict_output), 0)
+
+
+class LossMaskingWithDistributionStrategyTest(test.TestCase):
+
+  def test_masking(self):
+    with self.test_session():
+      np.random.seed(1337)
+      x = np.array([[[1], [1]], [[0], [0]]])
+      model = keras.models.Sequential()
+      model.add(keras.layers.Masking(mask_value=0, input_shape=(2, 1)))
+      model.add(
+          keras.layers.TimeDistributed(
+              keras.layers.Dense(1, kernel_initializer='one')))
+      strategy = mirrored_strategy.MirroredStrategy(['/device:GPU:1',
+                                                     '/device:GPU:0'])
+
+      model.compile(loss='mse',
+                    optimizer=gradient_descent.GradientDescentOptimizer(0.01),
+                    distribute=strategy)
+      y = np.array([[[1], [1]], [[1], [1]]])
+      dataset = dataset_ops.Dataset.from_tensor_slices((x, y))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(10)
+      hist = model.fit(x=dataset, epochs=1, steps_per_epoch=2)
+      self.assertEqual(hist.history['loss'][0], 0)
+
+
+class NormalizationLayerWithDistributionStrategyTest(test.TestCase):
+
+  def test_batchnorm_correctness(self):
+    with self.test_session():
+      model = keras.models.Sequential()
+      norm = keras.layers.BatchNormalization(input_shape=(10,), momentum=0.8)
+      model.add(norm)
+      strategy = mirrored_strategy.MirroredStrategy(['/device:CPU:0',
+                                                     '/device:GPU:0'])
+      model.compile(loss='mse',
+                    optimizer=gradient_descent.GradientDescentOptimizer(0.01),
+                    distribute=strategy)
+
+      # centered on 5.0, variance 10.0
+      x = np.random.normal(loc=5.0, scale=10.0, size=(1000, 10))
+      dataset = dataset_ops.Dataset.from_tensor_slices((x, x))
+      dataset = dataset.repeat(100)
+      dataset = dataset.batch(32)
+
+      model.fit(dataset, epochs=4, verbose=0, steps_per_epoch=10)
+      out = model.predict(dataset, steps=2)
+      out -= keras.backend.eval(norm.beta)
+      out /= keras.backend.eval(norm.gamma)
+      np.testing.assert_allclose(out.mean(), 0.0, atol=1e-1)
+      np.testing.assert_allclose(out.std(), 1.0, atol=1e-1)
+
+
+class CorrectnessWithDistributionStrategyTest(test.TestCase):
+
+  def test_correctness(self):
+    with self.test_session():
+      keras.backend.set_image_data_format('channels_last')
+      num_samples = 10000
+      x_train = np.random.rand(num_samples, 1)
+      y_train = 3 * x_train
+      x_train = x_train.astype('float32')
+      y_train = y_train.astype('float32')
+
+      model = keras.Sequential()
+      model.add(keras.layers.Dense(1, input_shape=(1,)))
+
+      # With DistributionStrategy
+      dataset_with = dataset_ops.Dataset.from_tensor_slices((x_train, y_train))
+      dataset_with = dataset_with.batch(32)
+      strategy = mirrored_strategy.MirroredStrategy(devices=['/device:CPU:0',
+                                                             '/device:GPU:0'],
+                                                    prefetch_on_device=False)
+
+      model.compile(loss=keras.losses.mean_squared_error,
+                    optimizer=gradient_descent.GradientDescentOptimizer(0.5),
+                    distribute=strategy)
+      model.fit(x=dataset_with, epochs=1, steps_per_epoch=310)
+      wts_with_ds = model.get_weights()
+
+      x_predict = [[1], [2], [3], [4]]
+      predict_dataset_with = dataset_ops.Dataset.from_tensor_slices((x_predict,
+                                                                     x_predict))
+      predict_dataset_with = predict_dataset_with.batch(2)
+      predict_with_ds = model.predict(predict_dataset_with, steps=1)
+      predict_with_ds = np.reshape(predict_with_ds, (4, 1))
+
+      # Without DistributionStrategy
+      dataset_without = dataset_ops.Dataset.from_tensor_slices((x_train,
+                                                                y_train))
+      dataset_without = dataset_without.batch(64)
+
+      model.compile(loss=keras.losses.mean_squared_error,
+                    optimizer=gradient_descent.GradientDescentOptimizer(0.5))
+      model.fit(x=dataset_without, epochs=1, steps_per_epoch=310)
+      wts_without_ds = model.get_weights()
+
+      x_predict = [[1], [2], [3], [4]]
+      predict_dataset_without = dataset_ops.Dataset.from_tensor_slices((
+          x_predict, x_predict))
+      predict_dataset_without = predict_dataset_without.batch(4)
+      predict_without_ds = model.predict(predict_dataset_without, steps=1)
+
+      # Verify that the weights are the same within some limits of tolerance.
+      np.testing.assert_allclose(wts_with_ds[0], wts_without_ds[0], rtol=1e-3)
+      # Verify that the predicted outputs are the same within some limits of
+      # tolerance.
+      np.testing.assert_allclose(predict_with_ds, predict_without_ds, rtol=1e-3)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/distribute/python/metrics_v1_test.py b/tensorflow/contrib/distribute/python/metrics_v1_test.py
index 6c6bf14309..2f3d6bdd3f 100644
--- a/tensorflow/contrib/distribute/python/metrics_v1_test.py
+++ b/tensorflow/contrib/distribute/python/metrics_v1_test.py
@@ -19,7 +19,6 @@ from __future__ import print_function
 
 from absl.testing import parameterized
 
-from tensorflow.contrib.data.python.ops import batching
 from tensorflow.contrib.distribute.python import combinations
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.eager import test
@@ -183,7 +182,7 @@ class MetricsV1Test(test.TestCase, parameterized.TestCase):
     def _dataset_fn():
       dataset = dataset_ops.Dataset.range(1000).map(math_ops.to_float)
       # Want to produce a fixed, known shape, so drop remainder when batching.
-      dataset = dataset.apply(batching.batch_and_drop_remainder(4))
+      dataset = dataset.batch(4, drop_remainder=True)
       return dataset
 
     def _expected_fn(num_batches):
diff --git a/tensorflow/contrib/distribute/python/mirrored_strategy.py b/tensorflow/contrib/distribute/python/mirrored_strategy.py
index dcbc6b0878..eb2d102012 100644
--- a/tensorflow/contrib/distribute/python/mirrored_strategy.py
+++ b/tensorflow/contrib/distribute/python/mirrored_strategy.py
@@ -20,7 +20,6 @@ from __future__ import print_function
 
 import contextlib
 import threading
-import six
 
 from tensorflow.contrib.distribute.python import cross_tower_ops as cross_tower_ops_lib
 from tensorflow.contrib.distribute.python import shared_variable_creator
@@ -60,6 +59,156 @@ class _RequestedStop(Exception):
   pass
 
 
+# Make _call_for_each_tower and _reduce_non_distributed_value not members of
+# MirroredStrategy so that they are generally not allowed to use anything
+# specific to MirroredStrategy and thus can be shared with other distribution
+# strategies.
+
+
+# TODO(yuefengz): maybe create a common class for those who need to call this
+# _call_for_each_tower.
+def _call_for_each_tower(distribution, fn, *args, **kwargs):
+  """Run `fn` in separate threads, once per tower/worker device.
+
+  Args:
+    distribution: the DistributionStrategy object.
+    fn: function to run (will be run once per device, each in its own thread).
+    *args: positional arguments for `fn`
+    **kwargs: keyword arguments for `fn`.
+        `"run_concurrently"`: Boolean indicating whether executions of `fn`
+           can be run concurrently (under eager execution only), defaults to
+           `True`.
+
+  Returns:
+    Merged return value of `fn` across all towers.
+
+  Raises:
+    RuntimeError: If fn() calls get_tower_context().merge_call() a different
+        number of times from the available devices.
+  """
+  run_concurrently = kwargs.pop("run_concurrently", True)
+  if not context.executing_eagerly():
+    # Lots of TF library code isn't thread-safe in graph mode, and
+    # there is little to be gained by turning on multithreading when
+    # constructing a graph.
+    run_concurrently = False
+    # Needed for per-thread device, etc. contexts in graph mode.
+    ops.get_default_graph().switch_to_thread_local()
+  elif run_concurrently is None:
+    run_concurrently = True
+
+  coord = coordinator.Coordinator(clean_stop_exception_types=(_RequestedStop,))
+
+  shared_variable_store = {}
+
+  # TODO(isaprykin): Create these threads once instead of during every run()
+  # call.
+  threads = []
+  for index, d in enumerate(distribution.worker_devices):
+    variable_creator_fn = shared_variable_creator.make_fn(
+        shared_variable_store, index)
+    t = MirroredStrategy._MirroredTowerThread(  # pylint: disable=protected-access
+        distribution, coord, d, variable_creator_fn, fn,
+        *values.select_device(d, args), **values.select_device(d, kwargs))
+    threads.append(t)
+
+  for t in threads:
+    t.start()
+
+  # When `fn` starts `should_run` event is set on _MirroredTowerThread
+  # (`MTT`) threads. The execution waits until
+  # `MTT.has_paused` is set, which indicates that either `fn` is
+  # complete or a `get_tower_context().merge_call()` is called.  If `fn` is
+  # complete, then `MTT.done` is set to True.  Otherwise, arguments
+  # of `get_tower_context().merge_call` from all paused threads are grouped
+  # and the `merge_fn` is performed.  Results of the
+  # `get_tower_context().merge_call` are then set to `MTT.merge_result`.
+  # Each such `get_tower_context().merge_call` call returns the
+  # `MTT.merge_result` for that thread when `MTT.should_run` event
+  # is reset again. Execution of `fn` resumes.
+
+  try:
+    with coord.stop_on_exception():
+      all_done = False
+      while not all_done and not coord.should_stop():
+        done = []
+        if run_concurrently:
+          for t in threads:
+            t.should_run.set()
+          for t in threads:
+            t.has_paused.wait()
+            t.has_paused.clear()
+            if coord.should_stop():
+              return None
+            done.append(t.done)
+        else:
+          for t in threads:
+            t.should_run.set()
+            t.has_paused.wait()
+            t.has_paused.clear()
+            if coord.should_stop():
+              return None
+            done.append(t.done)
+        if coord.should_stop():
+          return None
+        all_done = all(done)
+        if not all_done:
+          if any(done):
+            raise RuntimeError("Some towers made a different number of "
+                               "tower_context().merge_call() calls.")
+          # get_tower_context().merge_call() case
+          merge_args = values.regroup({t.device: t.merge_args for t in threads})
+          merge_kwargs = values.regroup(
+              {t.device: t.merge_kwargs for t in threads})
+          # We capture the name_scope of the MTT when we call merge_fn
+          # to ensure that if we have opened a name scope in the MTT,
+          # it will be respected when executing the merge function. We only
+          # capture the name_scope from the first MTT and assume it is
+          # the same for all other MTTs.
+          mtt_captured_name_scope = threads[0].captured_name_scope
+          with ops.name_scope(mtt_captured_name_scope):
+            merge_result = threads[0].merge_fn(distribution, *merge_args,
+                                               **merge_kwargs)
+          for t in threads:
+            t.merge_result = values.select_device(t.device, merge_result)
+  finally:
+    for t in threads:
+      t.should_run.set()
+    coord.join(threads)
+
+  return values.regroup({t.device: t.main_result for t in threads})
+
+
+def _reduce_non_distributed_value(distribution, aggregation, value,
+                                  destinations):
+  """Reduce a non-DistributedValue `value` to `destinations`."""
+  if isinstance(value, values.DistributedValues):
+    raise ValueError("You are passing a `DistributedValue` to "
+                     "`_reduce_non_distributed_value`, which is not allowed.")
+
+  if value == 0:
+    return 0
+  if aggregation == variable_scope.VariableAggregation.MEAN:
+    return distribution.broadcast(value, destinations)
+
+  cross_tower_ops_lib.validate_destinations(destinations)
+  if (len(distribution.worker_devices) != 1 or
+      not cross_tower_ops_lib.check_destinations(destinations)):
+    raise ValueError("A non-DistributedValues value cannot be reduced with the "
+                     "given aggregation.")
+  # TODO(anjalisridhar): Moves these methods to a device utility file?
+  devices = cross_tower_ops_lib.get_devices_from(destinations)
+  if len(devices) == 1:
+    with ops.device(devices[0]):
+      return array_ops.identity(value)
+  else:
+    value_updates = {}
+    for d in devices:
+      with ops.device(d):
+        value_updates[d] = array_ops.identity(value)
+    return values.Mirrored(value_updates)
+
+
 class MirroredStrategy(distribute_lib.DistributionStrategy):
   """Mirrors vars to distribute across multiple devices on a single machine.
 
@@ -198,116 +347,7 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
                                                  self._devices)
 
   def _call_for_each_tower(self, fn, *args, **kwargs):
-    """Run `fn` in separate threads, once per tower/worker device.
-
-    Args:
-      fn: function to run (will be run once per device, each in its own thread).
-      *args: positional arguments for `fn`
-      **kwargs: keyword arguments for `fn`.
-          `"run_concurrently"`: Boolean indicating whether executions of `fn`
-             can be run concurrently (under eager execution only), defaults to
-             `True`.
-
-    Returns:
-      Merged return value of `fn` across all towers.
-
-    Raises:
-      RuntimeError: If fn() calls get_tower_context().merge_call() a different
-          number of times for when called for different devices.
-    """
-    run_concurrently = kwargs.pop("run_concurrently", True)
-    if not context.executing_eagerly():
-      # Lots of TF library code isn't thread-safe in graph mode, and
-      # there is little to be gained by turning on multithreading when
-      # constructing a graph.
-      run_concurrently = False
-      # Needed for per-thread device, etc. contexts in graph mode.
-      ops.get_default_graph().switch_to_thread_local()
-    elif run_concurrently is None:
-      run_concurrently = True
-
-    coord = coordinator.Coordinator(
-        clean_stop_exception_types=(_RequestedStop,))
-
-    shared_variable_store = {}
-
-    # TODO(isaprykin): Create these threads once instead of during every run()
-    # call.
-    threads = []
-    for index, d in enumerate(self._devices):
-      variable_creator_fn = shared_variable_creator.make_fn(
-          shared_variable_store, index)
-      t = MirroredStrategy._MirroredTowerThread(
-          self, coord, d, variable_creator_fn, fn,
-          *values.select_device(d, args), **values.select_device(d, kwargs))
-      threads.append(t)
-
-    for t in threads:
-      t.start()
-
-    # When `fn` starts `should_run` event is set on _MirroredTowerThread
-    # (`MTT`) threads. The execution waits until
-    # `MTT.has_paused` is set, which indicates that either `fn` is
-    # complete or a `get_tower_context().merge_call()` is called.  If `fn` is
-    # complete, then `MTT.done` is set to True.  Otherwise, arguments
-    # of `get_tower_context().merge_call` from all paused threads are grouped
-    # and the `merge_fn` is performed.  Results of the
-    # `get_tower_context().merge_call` are then set to `MTT.merge_result`.
-    # Each such `get_tower_context().merge_call` call returns the
-    # `MTT.merge_result` for that thread when `MTT.should_run` event
-    # is reset again. Execution of `fn` resumes.
-
-    try:
-      with coord.stop_on_exception():
-        all_done = False
-        while not all_done and not coord.should_stop():
-          done = []
-          if run_concurrently:
-            for t in threads:
-              t.should_run.set()
-            for t in threads:
-              t.has_paused.wait()
-              t.has_paused.clear()
-              if coord.should_stop():
-                return None
-              done.append(t.done)
-          else:
-            for t in threads:
-              t.should_run.set()
-              t.has_paused.wait()
-              t.has_paused.clear()
-              if coord.should_stop():
-                return None
-              done.append(t.done)
-          if coord.should_stop():
-            return None
-          all_done = all(done)
-          if not all_done:
-            if any(done):
-              raise RuntimeError("Some towers made a different number of "
-                                 "tower_context().merge_call() calls.")
-            # get_tower_context().merge_call() case
-            merge_args = values.regroup(
-                {t.device: t.merge_args for t in threads})
-            merge_kwargs = values.regroup(
-                {t.device: t.merge_kwargs for t in threads})
-            # We capture the name_scope of the MTT when we call merge_fn
-            # to ensure that if we have opened a name scope in the MTT,
-            # it will be respected when executing the merge function. We only
-            # capture the name_scope from the first MTT and assume it is
-            # the same for all other MTTs.
-            mtt_captured_name_scope = threads[0].captured_name_scope
-            with ops.name_scope(mtt_captured_name_scope):
-              merge_result = threads[0].merge_fn(
-                  self, *merge_args, **merge_kwargs)
-            for t in threads:
-              t.merge_result = values.select_device(t.device, merge_result)
-    finally:
-      for t in threads:
-        t.should_run.set()
-      coord.join(threads)
-
-    return values.regroup({t.device: t.main_result for t in threads})
+    return _call_for_each_tower(self, fn, *args, **kwargs)
 
   def map(self, map_over, fn, *args, **kwargs):
     # TODO(josh11b): In eager mode, use one thread per device.
@@ -337,29 +377,9 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
 
   def _reduce(self, aggregation, value, destinations):
     assert not isinstance(value, values.Mirrored)
-    if not isinstance(value, values.PerDevice):
-      if value == 0:
-        return 0
-      if aggregation == variable_scope.VariableAggregation.MEAN:
-        return self._broadcast(value, destinations)
-
-      cross_tower_ops_lib.validate_destinations(destinations)
-      if len(self._devices) == 1:
-        if destinations:
-          # TODO(anjalisridhar): Moves these methods to a device utility file?
-          devices = cross_tower_ops_lib.get_devices_from(destinations)
-          if len(devices) == 1:
-            with ops.device(devices[0]):
-              return array_ops.identity(value)
-          else:
-            value_updates = {}
-            for d in devices:
-              with ops.device(d):
-                value_updates[d] = array_ops.identity(value)
-            return values.Mirrored(value_updates)
-      raise ValueError("A non PerDevice value cannot be reduced with the given "
-                       "aggregation.")
-
+    if not isinstance(value, values.DistributedValues):
+      return _reduce_non_distributed_value(self, aggregation, value,
+                                           destinations)
     return self._get_cross_tower_ops().reduce(
         aggregation, value, destinations=destinations)
 
@@ -433,15 +453,8 @@ class MirroredStrategy(distribute_lib.DistributionStrategy):
   def _get_devices_from(self, colocate_with=None):
     if colocate_with is None:
       return self._devices
-    elif isinstance(colocate_with, values.DistributedValues):
-      # pylint: disable=protected-access
-      return list(colocate_with._index.keys())
-    elif isinstance(colocate_with, six.string_types):
-      return [device_util.resolve(colocate_with)]
-    elif isinstance(colocate_with, list):
-      return [device_util.resolve(d) for d in colocate_with]
     else:
-      return colocate_with
+      return cross_tower_ops_lib.get_devices_from(colocate_with)
 
   class _MirroredTowerThread(threading.Thread):
     """A thread that runs() a function on a device."""
diff --git a/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py b/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py
index b597bce035..aab7119901 100644
--- a/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py
+++ b/tensorflow/contrib/distribute/python/mirrored_strategy_multigpu_test.py
@@ -491,13 +491,14 @@ class MirroredStrategyVariableCreationTest(test.TestCase):
     components_mean = {}
 
     def model_fn(device_id):
-      tower_context = distribute_lib.get_tower_context()
-      with tower_context.tower_local_var_scope(
-          variable_scope.VariableAggregation.SUM):
-        v_sum = variable_scope.variable(1.0)
-      with tower_context.tower_local_var_scope(
-          variable_scope.VariableAggregation.MEAN):
-        v_mean = variable_scope.variable(4.0)
+      v_sum = variable_scope.variable(
+          1.0,
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          aggregation=variable_scope.VariableAggregation.SUM)
+      v_mean = variable_scope.variable(
+          4.0,
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          aggregation=variable_scope.VariableAggregation.MEAN)
       self.assertTrue(isinstance(v_sum, values.TowerLocalVariable))
       self.assertTrue(isinstance(v_mean, values.TowerLocalVariable))
       updates = [v_sum.assign_add(2.0 + device_id),
@@ -700,10 +701,10 @@ class MirroredStrategyVariableCreationTest(test.TestCase):
     with context.graph_mode():
 
       def model_fn():
-        tower_context = distribute_lib.get_tower_context()
-        with tower_context.tower_local_var_scope(
-            variable_scope.VariableAggregation.SUM):
-          v_sum = variable_scope.variable(1.0)
+        v_sum = variable_scope.variable(
+            1.0,
+            synchronization=variable_scope.VariableSynchronization.ON_READ,
+            aggregation=variable_scope.VariableAggregation.SUM)
         self.assertTrue(isinstance(v_sum, values.TowerLocalVariable))
         return v_sum
 
@@ -791,8 +792,8 @@ class MirroredVariableUpdateTest(test.TestCase):
         return mirrored_var.assign(5.0)
 
       with self.assertRaisesRegexp(
-          ValueError, "A non PerDevice value cannot be reduced with the given "
-          "aggregation."):
+          ValueError, "A non-DistributedValues value cannot be reduced with "
+          "the given aggregation."):
         self.evaluate(dist.unwrap(dist.call_for_each_tower(model_fn)))
 
   @test_util.run_in_graph_and_eager_modes(config=config)
@@ -922,5 +923,118 @@ class MirroredVariableUpdateTest(test.TestCase):
       self.assertEquals(4.5, self.evaluate(mirrored_var))
 
 
+class MirroredAndTowerLocalVariableInitializerTest(test.TestCase):
+  config = config_pb2.ConfigProto()
+  config.allow_soft_placement = True
+
+  def testAssignMirroredVarInitializer(self):
+    # This test is not eager compatible since in eager variables are initialized
+    # upon construction instead of once the initialization op is run.
+    with context.graph_mode():
+      def var_fn():
+        v = variable_scope.variable(1.0, name="foo")
+        return v
+
+      dist = mirrored_strategy.MirroredStrategy(
+          ["/device:GPU:0", "/device:CPU:0"])
+
+      with dist.scope():
+        mirrored_var = dist.call_for_each_tower(var_fn)
+        self.assertIsInstance(mirrored_var, values.MirroredVariable)
+        self.assertFalse(self.evaluate(mirrored_var.is_initialized()))
+        self.evaluate(mirrored_var.initializer)
+        self.assertTrue(self.evaluate(mirrored_var.is_initialized()))
+
+  def testAssignTowerLocalVarInitializer(self):
+    # This test is not eager compatible since in eager variables are initialized
+    # upon construction instead of once the initialization op is run.
+    with context.graph_mode():
+      def model_fn():
+        v_sum = variable_scope.variable(
+            1.0,
+            synchronization=variable_scope.VariableSynchronization.ON_READ,
+            aggregation=variable_scope.VariableAggregation.SUM)
+        self.assertTrue(isinstance(v_sum, values.TowerLocalVariable))
+        return v_sum
+
+      dist = mirrored_strategy.MirroredStrategy(
+          ["/device:GPU:0", "/device:CPU:0"])
+
+      with dist.scope():
+        tower_local_var = dist.call_for_each_tower(model_fn)
+        self.assertTrue(isinstance(tower_local_var, values.TowerLocalVariable))
+        self.assertFalse(self.evaluate(tower_local_var.is_initialized()))
+        self.evaluate(tower_local_var.initializer)
+        self.assertTrue(self.evaluate(tower_local_var.is_initialized()))
+
+
+class TowerLocalVariableAssignTest(test.TestCase):
+  config = config_pb2.ConfigProto()
+  config.allow_soft_placement = True
+
+  def _skip_eager_if_gpus_less_than(self, num_gpus):
+    if context.num_gpus() < num_gpus and context.executing_eagerly():
+      self.skipTest("Enough GPUs not available for this test in eager mode.")
+
+  @test_util.run_in_graph_and_eager_modes(config=config)
+  def testAssignTowerLocalVarSumAggregation(self):
+    self._skip_eager_if_gpus_less_than(1)
+    def model_fn():
+      v_sum = variable_scope.variable(
+          1.0,
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          aggregation=variable_scope.VariableAggregation.SUM)
+      return v_sum
+
+    dist = mirrored_strategy.MirroredStrategy(
+        ["/device:GPU:0", "/device:CPU:0"])
+
+    with dist.scope():
+      tower_local_var = dist.call_for_each_tower(model_fn,
+                                                 run_concurrently=False)
+      self.assertTrue(isinstance(tower_local_var, values.TowerLocalVariable))
+      self.evaluate(variables.global_variables_initializer())
+      # Each tower has a value of 1.0 assigned to it in tower context.
+      # When we read the value using `read_var` we should see the SUM of each of
+      # values on each of the towers.
+      self.assertEqual(2.0, self.evaluate(dist.read_var(tower_local_var)))
+      # Assigning 6.0 in cross tower context will assign a value of
+      # 6.0/num_towers to each tower.
+      tlv_ops = tower_local_var.assign(6.0)
+      self.evaluate(tlv_ops)
+      # On reading the tower local var we should get the assigned value back.
+      # The value on all the towers are added before being returned by
+      # `read_var`.
+      self.assertEqual(6.0, self.evaluate(dist.read_var(tower_local_var)))
+
+  @test_util.run_in_graph_and_eager_modes(config=config)
+  def testAssignTowerLocalVarMeanAggregation(self):
+    self._skip_eager_if_gpus_less_than(1)
+    def model_fn():
+      v_sum = variable_scope.variable(
+          1.0,
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          aggregation=variable_scope.VariableAggregation.MEAN)
+      return v_sum
+
+    dist = mirrored_strategy.MirroredStrategy(
+        ["/device:GPU:0", "/device:CPU:0"])
+
+    with dist.scope():
+      tower_local_var = dist.call_for_each_tower(model_fn,
+                                                 run_concurrently=False)
+      self.assertTrue(isinstance(tower_local_var, values.TowerLocalVariable))
+      self.evaluate(variables.global_variables_initializer())
+      # Each tower has a value of 1.0 assigned to it in tower context.
+      # When we read the value using `read_var` we should see the MEAN of values
+      # on all towers which is the value assigned in tower context.
+      self.assertEqual(1.0, self.evaluate(dist.read_var(tower_local_var)))
+      tlv_ops = tower_local_var.assign(6.0)
+      self.evaluate(tlv_ops)
+      # On reading the tower local var we should get the MEAN of all values
+      # which is equal to the value assigned.
+      self.assertEqual(6.0, self.evaluate(dist.read_var(tower_local_var)))
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/distribute/python/multi_worker_strategy.py b/tensorflow/contrib/distribute/python/multi_worker_strategy.py
index 0f21a42732..cbfe5df61d 100644
--- a/tensorflow/contrib/distribute/python/multi_worker_strategy.py
+++ b/tensorflow/contrib/distribute/python/multi_worker_strategy.py
@@ -46,7 +46,7 @@ class MultiWorkerMirroredStrategy(MirroredStrategy):
     * **In-graph replication**: the `client` creates a single `tf.Graph` that
     specifies tasks for devices on all workers. The `client` then creates a
     client session which will talk to the `master` service of a `worker`. Then
-    the `master` will parition the graph and distribute the work to all
+    the `master` will partition the graph and distribute the work to all
     participating workers.
     * **Worker**: A `worker` is a TensorFlow `task` that usually maps to one
     physical machine. We will have multiple `worker`s with different `task`
diff --git a/tensorflow/contrib/distribute/python/multi_worker_test_base.py b/tensorflow/contrib/distribute/python/multi_worker_test_base.py
index f659be5f42..2063e57178 100644
--- a/tensorflow/contrib/distribute/python/multi_worker_test_base.py
+++ b/tensorflow/contrib/distribute/python/multi_worker_test_base.py
@@ -20,35 +20,62 @@ from __future__ import print_function
 
 import contextlib
 import copy
+import threading
+import numpy as np
 
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.client import session
-from tensorflow.python.eager import test
+from tensorflow.python.estimator import run_config
+from tensorflow.python.platform import test
 from tensorflow.python.framework import test_util
 
 
+def create_in_process_cluster(num_workers, num_ps):
+  """Create an in-process cluster that consists of only standard server."""
+  # Leave some memory for cuda runtime.
+  gpu_mem_frac = 0.7 / num_workers
+  worker_config = config_pb2.ConfigProto()
+  worker_config.gpu_options.per_process_gpu_memory_fraction = gpu_mem_frac
+
+  ps_config = config_pb2.ConfigProto()
+  ps_config.device_count['GPU'] = 0
+
+  # Create in-process servers. Once an in-process tensorflow server is created,
+  # there is no way to terminate it. So we create one cluster per test process.
+  # We could've started the server in another process, we could then kill that
+  # process to terminate the server. The reasons why we don't want multiple
+  # processes are
+  # 1) it is more difficult to manage these processes;
+  # 2) there is something global in CUDA such that if we initialize CUDA in the
+  # parent process, the child process cannot initialize it again and thus cannot
+  # use GPUs (https://stackoverflow.com/questions/22950047).
+  return test_util.create_local_cluster(
+      num_workers,
+      num_ps=num_ps,
+      worker_config=worker_config,
+      ps_config=ps_config,
+      protocol='grpc')
+
+
 class MultiWorkerTestBase(test.TestCase):
   """Base class for testing multi node strategy and dataset."""
 
   @classmethod
   def setUpClass(cls):
     """Create a local cluster with 2 workers."""
-    num_workers = 2
-    # Leave some memory for cuda runtime.
-    gpu_mem_frac = 0.7 / num_workers
-    default_config = config_pb2.ConfigProto()
-    default_config.gpu_options.per_process_gpu_memory_fraction = gpu_mem_frac
-
-    # The local cluster takes some portion of the local GPUs and there is no way
-    # for the cluster to terminate unless using multiple processes. Therefore,
-    # we have to only create only one cluster throughout a test process.
-    workers, _ = test_util.create_local_cluster(
-        num_workers, num_ps=0, worker_config=default_config)
-    cls._master_target = workers[0].target
+    cls._workers, cls._ps = create_in_process_cluster(num_workers=2, num_ps=0)
+
+  def setUp(self):
+    # We only cache the session in one test because another test may have a
+    # different session config or master target.
+    self._thread_local = threading.local()
+    self._thread_local.cached_session = None
+    self._result = 0
+    self._lock = threading.Lock()
 
   @contextlib.contextmanager
-  def test_session(self, graph=None, config=None):
+  def test_session(self, graph=None, config=None, target=None):
     """Create a test session with master target set to the testing cluster.
 
     This overrides the base class' method, removes arguments that are not needed
@@ -78,13 +105,46 @@ class MultiWorkerTestBase(test.TestCase):
         rewriter_config_pb2.RewriterConfig.OFF)
 
     if graph is None:
-      if self._cached_session is None:  # pylint: disable=access-member-before-definition
-        self._cached_session = session.Session(
-            graph=None, config=config, target=self._master_target)
-      sess = self._cached_session
+      if getattr(self._thread_local, 'cached_session', None) is None:
+        self._thread_local.cached_session = session.Session(
+            graph=None, config=config, target=target or self._workers[0].target)
+      sess = self._thread_local.cached_session
       with sess.graph.as_default(), sess.as_default():
         yield sess
     else:
       with session.Session(
-          graph=graph, config=config, target=self._master_target) as sess:
+          graph=graph, config=config, target=target or
+          self._workers[0].target) as sess:
         yield sess
+
+  def _run_client(self, client_fn, task_type, task_id, num_gpus, *args,
+                  **kwargs):
+    result = client_fn(task_type, task_id, num_gpus, *args, **kwargs)
+    if np.all(result):
+      with self._lock:
+        self._result += 1
+
+  def _run_between_graph_clients(self, client_fn, cluster_spec, num_gpus, *args,
+                                 **kwargs):
+    """Runs several clients for between-graph replication.
+
+    Args:
+      client_fn: a function that needs to accept `task_type`, `task_id`,
+        `num_gpus` and returns True if it succeeds.
+      cluster_spec: a dict specifying jobs in a cluster.
+      num_gpus: number of GPUs per worker.
+      *args: will be passed to `client_fn`.
+      **kwargs: will be passed to `client_fn`.
+    """
+    threads = []
+    for task_type in [run_config.TaskType.CHIEF, run_config.TaskType.WORKER]:
+      for task_id in range(len(cluster_spec.get(task_type, []))):
+        t = threading.Thread(
+            target=self._run_client,
+            args=(client_fn, task_type, task_id, num_gpus) + args,
+            kwargs=kwargs)
+        t.start()
+        threads.append(t)
+    for t in threads:
+      t.join()
+    self.assertEqual(self._result, len(threads))
diff --git a/tensorflow/contrib/distribute/python/parameter_server_strategy.py b/tensorflow/contrib/distribute/python/parameter_server_strategy.py
new file mode 100644
index 0000000000..9bcf6f8bac
--- /dev/null
+++ b/tensorflow/contrib/distribute/python/parameter_server_strategy.py
@@ -0,0 +1,355 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Classes implementing a multi-worker ps DistributionStrategy."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import json
+import os
+
+from tensorflow.contrib.distribute.python import cross_tower_ops as cross_tower_ops_lib
+from tensorflow.contrib.distribute.python import mirrored_strategy
+from tensorflow.contrib.distribute.python import values
+from tensorflow.core.protobuf import cluster_pb2
+from tensorflow.python.framework import device as tf_device
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.training import device_setter
+from tensorflow.python.training import device_util
+from tensorflow.python.training import distribute as distribute_lib
+from tensorflow.python.training import server_lib
+from tensorflow.python.util import nest
+
+_LOCAL_CPU = "/device:CPU:0"
+_LOCAL_GPU_0 = "/device:GPU:0"
+
+
+def _normalize_cluster_spec(cluster_spec):
+  """Makes `cluster_spec` into a `ClusterSpec` object."""
+  if isinstance(cluster_spec, (dict, cluster_pb2.ClusterDef)):
+    return server_lib.ClusterSpec(cluster_spec)
+  elif not isinstance(cluster_spec, server_lib.ClusterSpec):
+    raise ValueError(
+        "`cluster_spec' should be dict or a `tf.train.ClusterSpec` or a "
+        "`tf.train.ClusterDef` object")
+  return cluster_spec
+
+
+# TODO(yuefengz): maybe cache variables on local CPU.
+# TODO(yuefengz): we may want to set session options to disallow communication
+# between workers.
+class ParameterServerStrategy(distribute_lib.DistributionStrategy):
+  """A parameter server DistributionStrategy.
+
+  This strategy class works for both local training and between-graph replicated
+  training for multiple workers. If `cluster_spec` is specified, either passed
+  in to __init__() method or parsed from the
+  ["TF_CONFIG" environment
+  variable](https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig),
+  variables and updates to those variables are assigned to parameter servers and
+  other operations are assigned to workers. If `cluster_spec` is not set, it
+  becomes local training where variables are assigned to local CPU or the only
+  GPU. When each worker has more than one GPU, operations will be replicated on
+  these GPUs. In both cases, operations are replicated but variables are not and
+  these workers share a common view for which paramater server a variable is
+  assigned to.
+
+  This class assumes between-graph replication will be used and works on a graph
+  for a particular worker.
+
+  It is expected to call `call_for_each_tower(fn, *args, **kwargs)` for any
+  operations which potentially can be replicated across towers (i.e. multiple
+  GPUs) even if there is only CPU or one GPU. When defining the `fn`, extra
+  caution needs to be taken:
+
+  1) Always use @{tf.get_variable} instead of @{tf.Variable} which is not able
+  to refer to the same variable on different towers.
+
+  2) It is generally not recommended to open a device scope under the strategy's
+  scope. A device scope (i.e. calling @{tf.device}) will be merged with or
+  override the device for operations but will not change the device for
+  variables.
+
+  3) It is also not recommended to open a colocation scope (i.e. calling
+  @{tf.colocate_with}) under the strategy's scope. For colocating variables,
+  use `distribution.colocate_vars_with` instead. Colocation of ops will possibly
+  create conflicts of device assignement.
+  """
+
+  def __init__(self,
+               num_gpus_per_worker=0,
+               cluster_spec=None,
+               task_type=None,
+               task_id=None):
+    """Initiailizes this strategy.
+
+    Args:
+      num_gpus_per_worker: number of local GPUs or GPUs per worker.
+      cluster_spec: a dict, ClusterDef or ClusterSpec object specifying the
+        cluster configurations.
+      task_type: the current task type.
+      task_id: the current task id.
+    """
+    super(ParameterServerStrategy, self).__init__()
+    self._num_gpus_per_worker = num_gpus_per_worker
+    if cluster_spec:
+      cluster_spec = _normalize_cluster_spec(cluster_spec)
+    self._cluster_spec = cluster_spec
+
+    # We typically don't need to do all-reduce in this strategy.
+    self._cross_tower_ops = (
+        cross_tower_ops_lib.ReductionToOneDeviceCrossTowerOps(
+            reduce_to_device=_LOCAL_CPU))
+
+    self._initialize_devices(num_gpus_per_worker, cluster_spec, task_type,
+                             task_id)
+
+  def _initialize_devices(self, num_gpus_per_worker, cluster_spec, task_type,
+                          task_id):
+    """Initialize internal devices.
+
+    It creates variable devices and compute devices. Variables and operations
+    will be assigned to them respectively. We have one compute device per tower.
+    The variable device is a device function or device string. The default
+    variable device assigns variables to parameter servers in a round-robin
+    fashion.
+
+    Args:
+      num_gpus_per_worker: number of local GPUs or GPUs per worker.
+      cluster_spec: a dict, ClusterDef or ClusterSpec object specifying the
+        cluster configurations.
+      task_type: the current task type.
+      task_id: the current task id.
+
+    Raises:
+      ValueError: if the cluster_spec doesn't have ps jobs.
+    """
+    self._task_type = task_type or "worker"
+    self._task_id = task_id or 0
+    self._worker_device = "/job:%s/task:%d" % (self._task_type, self._task_id)
+
+    # TODO(yuefengz): maybe clearer to split it into two classes, one for
+    # the distribuetd case and one for the local case, once we have the factory
+    # class/method.
+
+    # Define compute devices which is a list of device strings and one for each
+    # tower. When there are GPUs, replicate operations on these GPUs. Otherwise,
+    # place operations on CPU.
+    if cluster_spec is None:
+      # Local mode.
+      if num_gpus_per_worker > 0:
+        self._compute_devices = list(
+            map("/device:GPU:{}".format, range(num_gpus_per_worker)))
+      else:
+        self._compute_devices = [_LOCAL_CPU]
+    else:
+      # Distributed mode.
+      if num_gpus_per_worker > 0:
+        self._compute_devices = [
+            "%s/device:GPU:%d" % (self._worker_device, i)
+            for i in range(num_gpus_per_worker)
+        ]
+      else:
+        self._compute_devices = [self._worker_device]
+
+    self._compute_devices = list(
+        map(device_util.resolve, self._compute_devices))
+    self._canonical_compute_device_set = set(self._compute_devices)
+
+    # Define variable device which is a device string in the local case and a
+    # device function in the distributed case. It is used to open a device scope
+    # where varibles are defined.
+    # The `_parameter_devices` is needed for the `parameter_devices` property
+    # and is a list of all variable devices.
+    if cluster_spec is None:
+      # Local mode. If there is only one GPU, put everything on that GPU.
+      # Otherwise, place variables on CPU.
+      if num_gpus_per_worker == 1:
+        assert len(list(self._compute_devices)) == 1
+        self._variable_device = _LOCAL_GPU_0
+        self._parameter_devices = [_LOCAL_GPU_0]
+      else:
+        self._variable_device = _LOCAL_CPU
+        self._parameter_devices = [_LOCAL_CPU]
+    else:
+      # Distributed mode. Place variables on ps jobs in a round-robin fashion.
+      # Note that devices returned from `replica_device_setter` are not
+      # canonical and therefore we don't canonicalize all variable devices to
+      # make them consistent.
+      # TODO(yuefengz): support passing a strategy object to control variable
+      # assignment.
+      # TODO(yuefengz): merge the logic of replica_device_setter into this
+      # class.
+      num_ps_replicas = len(cluster_spec.as_dict().get("ps", []))
+      if num_ps_replicas == 0:
+        raise ValueError("The cluster spec needs to have `ps` jobs.")
+      self._variable_device = device_setter.replica_device_setter(
+          ps_tasks=num_ps_replicas,
+          worker_device=self._worker_device,
+          merge_devices=True,
+          cluster=cluster_spec)
+
+      # Parameter devices are all tasks of the "ps" job.
+      self._parameter_devices = map("/job:ps/task:{}".format,
+                                    range(num_ps_replicas))
+
+    # Define the default device in cross-tower mode. In the distributed case, we
+    # set the default device to the corresponding worker to prevent these ops
+    # from being placed on other workers.
+    if cluster_spec is None:
+      self._default_device = None
+    else:
+      self._default_device = self._worker_device
+
+  def distribute_dataset(self, dataset_fn):
+    """Distributes the dataset to each local GPU."""
+    return values.PerDeviceDataset(
+        self._call_dataset_fn(dataset_fn), self._compute_devices, True)
+
+  def _broadcast(self, tensor, destinations):
+    if not cross_tower_ops_lib.check_destinations(destinations):
+      destinations = self._compute_devices
+    return self._cross_tower_ops.broadcast(tensor, destinations)
+
+  # TODO(yuefengz): not all ops in device_setter.STANDARD_PS_OPS will go through
+  # this creator, such as "MutableHashTable".
+  def _create_variable(self, next_creator, *args, **kwargs):
+    if "colocate_with" in kwargs:
+      with ops.device(None):
+        with ops.colocate_with(kwargs["colocate_with"]):
+          return next_creator(*args, **kwargs)
+
+    with ops.colocate_with(None, ignore_existing=True):
+      with ops.device(self._variable_device):
+        return next_creator(*args, **kwargs)
+
+  def _call_for_each_tower(self, fn, *args, **kwargs):
+    # pylint: disable=protected-access
+    return mirrored_strategy._call_for_each_tower(self, fn, *args, **kwargs)
+
+  def _verify_destinations_not_different_worker(self, destinations):
+    if destinations is None:
+      return
+    for d in cross_tower_ops_lib.get_devices_from(destinations):
+      d_spec = tf_device.DeviceSpec.from_string(d)
+      if d_spec.job == self._task_type and d_spec.task != self._task_id:
+        raise ValueError(
+            "Cannot reduce to another worker: %r, current worker is %r" %
+            (d, self._worker_device))
+
+  def _reduce(self, aggregation, value, destinations):
+    self._verify_destinations_not_different_worker(destinations)
+    if not isinstance(value, values.DistributedValues):
+      # pylint: disable=protected-access
+      return mirrored_strategy._reduce_non_distributed_value(
+          self, aggregation, value, destinations)
+
+    return self._cross_tower_ops.reduce(
+        aggregation, value, destinations=destinations)
+
+  def _batch_reduce(self, aggregation, value_destination_pairs):
+    for _, destinations in value_destination_pairs:
+      self._verify_destinations_not_different_worker(destinations)
+    return self._cross_tower_ops.batch_reduce(aggregation,
+                                              value_destination_pairs)
+
+  def _select_single_value(self, structured):
+    """Select any single values in `structured`."""
+
+    def _select_fn(x):  # pylint: disable=g-missing-docstring
+      if isinstance(x, values.Mirrored):
+        if len(x.devices) == 1:
+          return list(x._index.values())[0]  # pylint: disable=protected-access
+        else:
+          raise ValueError(
+              "You cannot update variable with a Mirrored object with multiple "
+              "components %r when using ParameterServerStrategy. You must "
+              "specify a single value or a Mirrored with a single value." % x)
+      elif isinstance(x, values.PerDevice):
+        raise ValueError(
+            "You cannot update variable with a PerDevice object %r when using "
+            "ParameterServerStrategy. You must specify a single value or a "
+            "Mirrored with a single value" % x)
+      else:
+        return x
+
+    return nest.map_structure(_select_fn, structured)
+
+  def _update(self, var, fn, *args, **kwargs):
+    if not isinstance(var, resource_variable_ops.ResourceVariable):
+      raise ValueError(
+          "You can not update `var` %r. It must be a Variable." % var)
+    with ops.colocate_with(var), distribute_lib.UpdateContext(var.device):
+      return fn(var, *self._select_single_value(args),
+                **self._select_single_value(kwargs))
+
+  # TODO(yuefengz): does it need to call _select_single_value?
+  def _update_non_slot(self, colocate_with, fn, *args, **kwargs):
+    with ops.device(
+        colocate_with.device), distribute_lib.UpdateContext(colocate_with):
+      return fn(*args, **kwargs)
+
+  def _unwrap(self, val):
+    if isinstance(val, values.DistributedValues):
+      # Return in a deterministic order.
+      if set(val.devices) == self._canonical_compute_device_set:
+        return [val.get(device=d) for d in self._compute_devices]
+      return [val.get(device=d) for d in sorted(val.devices)]
+    return [val]
+
+  def read_var(self, var):
+    # No need to distinguish between normal variables and tower-local variables.
+    return array_ops.identity(var)
+
+  def configure(self, session_config=None):
+    del session_config
+
+    # Use TF_CONFIG to get the cluster spec and the current job.
+    tf_config = json.loads(os.environ.get("TF_CONFIG", "{}"))
+    cluster_spec = _normalize_cluster_spec(tf_config.get("cluster", {}))
+
+    task_env = tf_config.get("task", {})
+    if task_env:
+      task_type = task_env.get("type", "worker")
+      task_id = int(task_env.get("index", "0"))
+    else:
+      task_type = "worker"
+      task_id = None
+
+    # Set the devices if cluster_spec is defined in TF_CONFIG but not passed in
+    # the constructor.
+    if not self._cluster_spec and cluster_spec:
+      self._cluster_spec = cluster_spec
+      self._initialize_devices(self._num_gpus_per_worker, cluster_spec,
+                               task_type, task_id)
+
+  @property
+  def num_towers(self):
+    return len(self._compute_devices)
+
+  @property
+  def worker_devices(self):
+    # Make a copy to prevent users from accidentally mutating our copy.
+    return list(self._compute_devices)
+
+  @property
+  def parameter_devices(self):
+    return list(self._parameter_devices)
+
+  def non_slot_devices(self, var_list):
+    return min(var_list, key=lambda x: x.name)
diff --git a/tensorflow/contrib/distribute/python/parameter_server_strategy_test.py b/tensorflow/contrib/distribute/python/parameter_server_strategy_test.py
new file mode 100644
index 0000000000..cf29c0ed91
--- /dev/null
+++ b/tensorflow/contrib/distribute/python/parameter_server_strategy_test.py
@@ -0,0 +1,430 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for ParameterServerStrategy."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import json
+import threading
+from absl.testing import parameterized
+
+from tensorflow.contrib.distribute.python import combinations
+from tensorflow.contrib.distribute.python import multi_worker_test_base
+from tensorflow.contrib.distribute.python import parameter_server_strategy
+from tensorflow.python.eager import context
+from tensorflow.python.estimator import run_config
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.layers import core
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import gradients
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import device_util
+from tensorflow.python.training import distribute as distribute_lib
+
+
+class ParameterServerStrategyTest(multi_worker_test_base.MultiWorkerTestBase,
+                                  parameterized.TestCase):
+
+  @classmethod
+  def setUpClass(cls):
+    cls._workers, cls._ps = multi_worker_test_base.create_in_process_cluster(
+        num_workers=3, num_ps=2)
+    cls._cluster_spec = {
+        run_config.TaskType.WORKER: [
+            'fake_worker_0', 'fake_worker_1', 'fake_worker_2'
+        ],
+        run_config.TaskType.PS: ['fake_ps_0', 'fake_ps_1']
+    }
+
+  def setUp(self):
+    self._result = 0
+    self._lock = threading.Lock()
+    self._init_condition = threading.Condition()
+    self._init_reached = 0
+    self._finish_condition = threading.Condition()
+    self._finish_reached = 0
+    super(ParameterServerStrategyTest, self).setUp()
+
+  def _get_test_objects(self, task_type, task_id, num_gpus):
+    distribution = parameter_server_strategy.ParameterServerStrategy(
+        num_gpus_per_worker=num_gpus)
+    if not task_type:
+      return distribution, ''
+
+    tf_config = {
+        'cluster': self._cluster_spec,
+        'task': {
+            'type': task_type,
+            'index': task_id
+        }
+    }
+    with self._lock:
+      # Accessing environment variables should be protected by locks because
+      # environment variables are shared by all threads.
+      with test.mock.patch.dict('os.environ',
+                                {'TF_CONFIG': json.dumps(tf_config)}):
+        distribution.configure()
+    return distribution, self._workers[task_id].target
+
+  def _test_device_assignment_distributed(self, task_type, task_id, num_gpus):
+    worker_device = '/job:%s/replica:0/task:%d' % (task_type, task_id)
+    d, _ = self._get_test_objects(task_type, task_id, num_gpus)
+    with ops.Graph().as_default(), \
+         self.test_session(target=self._workers[0].target) as sess, \
+         d.scope():
+
+      # Define a variable outside the call_for_each_tower scope. This is not
+      # recommended.
+      n = variable_scope.get_variable('n', initializer=10.0)
+      self.assertEqual(n.device, '/job:ps/task:0')
+
+      def model_fn():
+        if num_gpus == 0:
+          last_part_device = 'device:CPU:0'
+        else:
+          last_part_device = (
+              'device:GPU:%d' % distribute_lib.get_tower_context().tower_id)
+
+        a = constant_op.constant(1.0)
+        b = constant_op.constant(2.0)
+        c = a + b
+        self.assertEqual(a.device, worker_device + '/' + last_part_device)
+        self.assertEqual(b.device, worker_device + '/' + last_part_device)
+        self.assertEqual(c.device, worker_device + '/' + last_part_device)
+
+        # The device scope is ignored for variables but not for normal ops.
+        with ops.device('/job:worker/task:0'):
+          x = variable_scope.get_variable('x', initializer=10.0)
+          x_add = x.assign_add(c)
+          e = a + c
+        # The variable x is on the task 1 since the device_function has been
+        # called once before the model_fn.
+        self.assertEqual(x.device, '/job:ps/task:1')
+        self.assertEqual(x_add.device, x.device)
+        self.assertEqual(e.device,
+                         '/job:worker/replica:0/task:0/%s' % last_part_device)
+
+        # The colocate_vars_with can override the distribution's device.
+        with d.colocate_vars_with(x):
+          y = variable_scope.get_variable('y', initializer=20.0)
+        y_add = y.assign_add(x_add)
+        self.assertEqual(y.device, '/job:ps/task:1')
+        self.assertEqual(y_add.device, y.device)
+        self.assertEqual(y.device, x.device)
+
+        z = variable_scope.get_variable('z', initializer=10.0)
+        self.assertEqual(z.device, '/job:ps/task:0')
+        self.assertNotEqual(z.device, x.device)
+
+        with ops.control_dependencies([y_add]):
+          z_add = z.assign_add(y)
+        with ops.control_dependencies([z_add]):
+          f = z + c
+        self.assertEqual(f.device, worker_device + '/' + last_part_device)
+
+        # The device scope would merge with the default worker device.
+        with ops.device('/CPU:1'):
+          g = e + 1.0
+        self.assertEqual(g.device, worker_device + '/device:CPU:1')
+
+        # Ths ops.colocate_with will be ignored when defining a variale but not
+        # for a normal tensor.
+        with ops.colocate_with(x):
+          u = variable_scope.get_variable('u', initializer=30.0)
+          v = variable_scope.get_variable('v', initializer=30.0)
+          h = f + 1.0
+        self.assertIn('/job:ps/', u.device)
+        self.assertIn('/job:ps/', v.device)
+        # u and v are on different parameter servers.
+        self.assertTrue(u.device != x.device or v.device != x.device)
+        self.assertTrue(u.device == x.device or v.device == x.device)
+        # Here h is not on one worker. Note h.device is canonical while x.device
+        # is not but.
+        self.assertIn('/job:ps/', h.device)
+        return y_add, z_add, f
+
+      y, z, f = d.call_for_each_tower(model_fn)
+      self.assertNotEqual(y, None)
+      self.assertNotEqual(z, None)
+      self.assertNotEqual(f, None)
+
+      if context.num_gpus() >= 1 and num_gpus <= 1:
+        variables.global_variables_initializer().run()
+        y_val, z_val, f_val = sess.run([y, z, f])
+        self.assertEqual(y_val, 33.0)
+        self.assertEqual(z_val, 43.0)
+        self.assertEqual(f_val, 46.0)
+
+  @combinations.generate(
+      combinations.combine(mode=['graph'], num_gpus=[0, 1, 2]))
+  def testDeviceAssignmentDistributed(self, num_gpus):
+    self._test_device_assignment_distributed('worker', 1, num_gpus)
+
+  def _test_device_assignment_local(self,
+                                    d,
+                                    compute_device='CPU',
+                                    variable_device='CPU',
+                                    num_gpus=0):
+    with ops.Graph().as_default(), \
+         self.test_session(target=self._workers[0].target) as sess, \
+         d.scope():
+
+      def model_fn():
+        if 'CPU' in compute_device:
+          tower_compute_device = '/device:CPU:0'
+        else:
+          tower_compute_device = (
+              '/device:GPU:%d' % distribute_lib.get_tower_context().tower_id)
+        tower_compute_device = device_util.canonicalize(tower_compute_device)
+
+        if 'CPU' in variable_device:
+          tower_variable_device = '/device:CPU:0'
+        else:
+          tower_variable_device = (
+              '/device:GPU:%d' % distribute_lib.get_tower_context().tower_id)
+        tower_variable_device = device_util.canonicalize(tower_variable_device)
+
+        a = constant_op.constant(1.0)
+        b = constant_op.constant(2.0)
+        c = a + b
+        self.assertEqual(a.device, tower_compute_device)
+        self.assertEqual(b.device, tower_compute_device)
+        self.assertEqual(c.device, tower_compute_device)
+
+        # The device scope is ignored for variables but not for normal ops.
+        with ops.device('/device:GPU:2'):
+          x = variable_scope.get_variable('x', initializer=10.0)
+          x_add = x.assign_add(c)
+          e = a + c
+        self.assertEqual(
+            device_util.canonicalize(x.device), tower_variable_device)
+        self.assertEqual(x_add.device, x.device)
+        self.assertEqual(e.device, device_util.canonicalize('/device:GPU:2'))
+
+        # The colocate_vars_with can override the distribution's device.
+        with d.colocate_vars_with(x):
+          y = variable_scope.get_variable('y', initializer=20.0)
+        y_add = y.assign_add(x_add)
+        self.assertEqual(
+            device_util.canonicalize(y.device), tower_variable_device)
+        self.assertEqual(y_add.device, y.device)
+        self.assertEqual(y.device, x.device)
+
+        z = variable_scope.get_variable('z', initializer=10.0)
+        self.assertEqual(
+            device_util.canonicalize(z.device), tower_variable_device)
+
+        with ops.control_dependencies([y_add]):
+          z_add = z.assign_add(y)
+        with ops.control_dependencies([z_add]):
+          f = z + c
+        self.assertEqual(f.device, tower_compute_device)
+
+        # The device scope would merge with the default worker device.
+        with ops.device('/CPU:1'):
+          g = e + 1.0
+        self.assertEqual(g.device, device_util.canonicalize('/device:CPU:1'))
+
+        # Ths ops.colocate_with will be ignored when defining a variale but not
+        # for a normal tensor.
+        with ops.colocate_with(x):
+          u = variable_scope.get_variable('u', initializer=30.0)
+          h = f + 1.0
+        self.assertEqual(
+            device_util.canonicalize(u.device), tower_variable_device)
+        self.assertEqual(device_util.canonicalize(x.device), h.device)
+        return y_add, z_add, f
+
+      y, z, f = d.call_for_each_tower(model_fn)
+      self.assertNotEqual(y, None)
+      self.assertNotEqual(z, None)
+      self.assertNotEqual(f, None)
+
+      if context.num_gpus() >= 1 and num_gpus <= 1:
+        variables.global_variables_initializer().run()
+        y_val, z_val, f_val = sess.run([y, z, f])
+        self.assertEqual(y_val, 33.0)
+        self.assertEqual(z_val, 43.0)
+        self.assertEqual(f_val, 46.0)
+
+  def testDeviceAssignmentLocalCPU(self):
+    distribution = parameter_server_strategy.ParameterServerStrategy(
+        num_gpus_per_worker=0)
+    self._test_device_assignment_local(
+        distribution, compute_device='CPU', variable_device='CPU', num_gpus=0)
+
+  def testDeviceAssignmentLocalOneGPU(self):
+    distribution = parameter_server_strategy.ParameterServerStrategy(
+        num_gpus_per_worker=1)
+    self._test_device_assignment_local(
+        distribution, compute_device='GPU', variable_device='GPU', num_gpus=1)
+
+  def testDeviceAssignmentLocalTwoGPUs(self):
+    distribution = parameter_server_strategy.ParameterServerStrategy(
+        num_gpus_per_worker=2)
+    self._test_device_assignment_local(
+        distribution, compute_device='GPU', variable_device='CPU', num_gpus=2)
+
+  def _test_simple_increment(self, task_type, task_id, num_gpus):
+    d, master_target = self._get_test_objects(task_type, task_id, num_gpus)
+    if hasattr(d, '_cluster_spec') and d._cluster_spec:
+      num_workers = len(d._cluster_spec.as_dict().get('worker',
+                                                      ['dummy_worker']))
+    else:
+      num_workers = 1
+    with ops.Graph().as_default(), \
+         self.test_session(target=master_target) as sess, \
+         d.scope():
+
+      def model_fn():
+        x = variable_scope.get_variable('x', initializer=10.0)
+        y = variable_scope.get_variable('y', initializer=20.0)
+
+        x_add = x.assign_add(1.0, use_locking=True)
+        y_add = y.assign_add(1.0, use_locking=True)
+
+        train_op = control_flow_ops.group([x_add, y_add])
+        return x, y, train_op
+
+      x, y, train_op = d.call_for_each_tower(model_fn)
+      train_op = d.group(d.unwrap(train_op))
+
+      if context.num_gpus() < d._num_gpus_per_worker:
+        return True
+
+      if task_id == 0:
+        variables.global_variables_initializer().run()
+
+      # Workers waiting for chief worker's initializing variables.
+      self._init_condition.acquire()
+      self._init_reached += 1
+      while self._init_reached != num_workers:
+        self._init_condition.wait()
+      self._init_condition.notify_all()
+      self._init_condition.release()
+
+      sess.run(train_op)
+
+      # Wait for other workers to finish training.
+      self._finish_condition.acquire()
+      self._finish_reached += 1
+      while self._finish_reached != num_workers:
+        self._finish_condition.wait()
+      self._finish_condition.notify_all()
+      self._finish_condition.release()
+
+      x_val, y_val = sess.run([x, y])
+      self.assertEqual(x_val, 10.0 + 1.0 * num_workers * d.num_towers)
+      self.assertEqual(y_val, 20.0 + 1.0 * num_workers * d.num_towers)
+      return (x_val == 10.0 + 1.0 * num_workers * d.num_towers and
+              y_val == 20.0 + 1.0 * num_workers * d.num_towers)
+
+  def _test_minimize_loss_graph(self, task_type, task_id, num_gpus):
+    d, master_target = self._get_test_objects(task_type, task_id, num_gpus)
+    with ops.Graph().as_default(), \
+         self.test_session(target=master_target) as sess, \
+         d.scope():
+      l = core.Dense(1, use_bias=False)
+
+      def loss_fn(x):
+        y = array_ops.reshape(l(x), []) - constant_op.constant(1.)
+        return y * y
+
+      # TODO(yuefengz, apassos): eager.backprop.implicit_grad is not safe for
+      # multiple graphs (b/111216820).
+      def grad_fn(x):
+        loss = loss_fn(x)
+        var_list = (
+            variables.trainable_variables() + ops.get_collection(
+                ops.GraphKeys.TRAINABLE_RESOURCE_VARIABLES))
+        grads = gradients.gradients(loss, var_list)
+        ret = list(zip(grads, var_list))
+        return ret
+
+      def update(v, g):
+        return v.assign_sub(0.05 * g, use_locking=True)
+
+      one = d.broadcast(constant_op.constant([[1.]]))
+
+      def step():
+        """Perform one optimization step."""
+        # Run forward & backward to get gradients, variables list.
+        g_v = d.call_for_each_tower(grad_fn, one)
+        # Update the variables using the gradients and the update() function.
+        before_list = []
+        after_list = []
+        for g, v in g_v:
+          fetched = d.read_var(v)
+          before_list.append(fetched)
+          with ops.control_dependencies([fetched]):
+            # TODO(yuefengz): support non-Mirrored variable as destinations.
+            g = d.reduce(
+                variable_scope.VariableAggregation.SUM, g, destinations=v)
+            with ops.control_dependencies(d.unwrap(d.update(v, update, g))):
+              after_list.append(d.read_var(v))
+        return before_list, after_list
+
+      before_out, after_out = step()
+
+      if context.num_gpus() < d._num_gpus_per_worker:
+        return True
+
+      if task_id == 0:
+        variables.global_variables_initializer().run()
+
+      # Workers waiting for chief worker's initializing variables.
+      self._init_condition.acquire()
+      self._init_reached += 1
+      while self._init_reached != 3:
+        self._init_condition.wait()
+      self._init_condition.notify_all()
+      self._init_condition.release()
+
+      for i in range(10):
+        b, a = sess.run((before_out, after_out))
+        if i == 0:
+          before, = b
+        after, = a
+
+      error_before = abs(before - 1)
+      error_after = abs(after - 1)
+      # Error should go down
+      self.assertLess(error_after, error_before)
+      return error_after < error_before
+
+  def testSimpleBetweenGraph(self):
+    self._run_between_graph_clients(self._test_simple_increment,
+                                    self._cluster_spec, 0)
+
+  @combinations.generate(
+      combinations.combine(mode=['graph'], num_gpus=[0, 1, 2]))
+  def testLocalSimpleIncrement(self, num_gpus):
+    self._test_simple_increment(None, 0, num_gpus)
+
+  @combinations.generate(
+      combinations.combine(mode=['graph'], num_gpus=[0, 1, 2]))
+  def testMinimizeLossGraph(self, num_gpus):
+    self._run_between_graph_clients(self._test_minimize_loss_graph,
+                                    self._cluster_spec, num_gpus)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/distribute/python/tpu_strategy.py b/tensorflow/contrib/distribute/python/tpu_strategy.py
index bc53898539..f5497e0b21 100644
--- a/tensorflow/contrib/distribute/python/tpu_strategy.py
+++ b/tensorflow/contrib/distribute/python/tpu_strategy.py
@@ -21,15 +21,19 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib import tpu
+from tensorflow.contrib.distribute.python import cross_tower_ops as cross_tower_ops_lib
 from tensorflow.contrib.distribute.python import one_device_strategy
 from tensorflow.contrib.distribute.python import values
 from tensorflow.contrib.tpu.python.ops import tpu_ops
+from tensorflow.contrib.tpu.python.tpu import tpu
+from tensorflow.contrib.tpu.python.tpu import training_loop
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variable_scope as vs
+from tensorflow.python.training import device_util
 from tensorflow.python.util import nest
 
 
@@ -39,11 +43,11 @@ class TPUStrategy(one_device_strategy.OneDeviceStrategy):
   def __init__(self, num_cores_per_host=2):
     # TODO(isaprykin): Generalize the defaults.  They are currently tailored for
     # the unit test.
-    super(TPUStrategy, self).__init__('/cpu:0')
+    super(TPUStrategy, self).__init__('/device:CPU:0')
     # TODO(isaprykin): Auto-detect number of cores and hosts.
     self._num_cores_per_host = num_cores_per_host
     # TODO(priyag): This should not be hardcoded here.
-    self._host = '/task:0/device:CPU:0'
+    self._host = '/device:CPU:0'
 
   def distribute_dataset(self, dataset_fn):
     # TODO(priyag): Perhaps distribute across cores here.
@@ -54,7 +58,7 @@ class TPUStrategy(one_device_strategy.OneDeviceStrategy):
   # a mechanism to infer the outputs of `fn`. Pending b/110550782.
   def _run_steps_on_dataset(self, fn, iterator, iterations,
                             initial_loop_values=None):
-    # Enqueue ops
+
     shapes = nest.flatten(iterator.output_shapes)
     if any([not s.is_fully_defined() for s in shapes]):
       raise ValueError(
@@ -93,9 +97,8 @@ class TPUStrategy(one_device_strategy.OneDeviceStrategy):
           [constant_op.constant(0)],
           parallel_iterations=1)
 
-    # Dequeue ops
     def dequeue_fn():
-      dequeued = tpu.infeed_dequeue_tuple(dtypes=types, shapes=shapes)
+      dequeued = tpu_ops.infeed_dequeue_tuple(dtypes=types, shapes=shapes)
       return nest.pack_sequence_as(iterator.output_shapes, dequeued)
 
     # Wrap `fn` for repeat.
@@ -110,17 +113,14 @@ class TPUStrategy(one_device_strategy.OneDeviceStrategy):
       with ops.control_dependencies([fn_result]):
         return array_ops.identity(ctx.last_step_outputs)
 
-    # Repeat
     # TODO(sourabhbajaj): The input to while loop should be based on the output
     # type of the step_fn
     def iterate_on_tpu():
-      return tpu.repeat(iterations, run_fn, [initial_loop_values])
+      return training_loop.repeat(iterations, run_fn, [initial_loop_values])
 
-    # Re-write and distribute computation.
-    # TODO(sourabhbajaj): Convert the output to PerDevice variable and
-    # implement support for that in reduce.
-    last_step_tensor_outputs = tpu.batch_parallel(
-        iterate_on_tpu, [], num_shards=self._num_cores_per_host)
+    replicate_inputs = [[]] * self._num_cores_per_host
+    outputs = tpu.replicate(iterate_on_tpu, replicate_inputs)
+    last_step_tensor_outputs = [list(x) for x in zip(*outputs)]
 
     # Take index [0] of last_step_tensor_outputs as we wrapped
     # initial_loop_values in a list in the `repeat` call.
@@ -139,11 +139,32 @@ class TPUStrategy(one_device_strategy.OneDeviceStrategy):
     return [tpu.shutdown_system()]
 
   def _reduce(self, aggregation, value, destinations):
-    del destinations  # TPU is graph mode only.  Rely on implicit Send/Recv.
+    graph = ops.get_default_graph()
+    context = graph._get_control_flow_context()  # pylint: disable=protected-access
+    # If we're inside the ReplicateContext, reduction should be done using
+    # CrossReplicaSum while outside we can directly use an add_n op.
+    while context:
+      if isinstance(context, tpu.TPUReplicateContext):
+        if aggregation == vs.VariableAggregation.MEAN:
+          # TODO(jhseu):  Revisit once we support model-parallelism.
+          value *= (1. / self._num_cores_per_host)
+        return tpu_ops.cross_replica_sum(value)
+      context = context.outer_context
+
+    # Validate that the destination is same as the host device
+    # Note we don't do this when in replicate context as the reduction is
+    # performed on the TPU device itself.
+    devices = cross_tower_ops_lib.get_devices_from(destinations)
+    if len(devices) == 1:
+      assert device_util.canonicalize(devices[0]) == device_util.canonicalize(
+          self._host)
+    else:
+      raise ValueError('Multiple devices are not supported for TPUStrategy')
+
+    output = math_ops.add_n(value)
     if aggregation == vs.VariableAggregation.MEAN:
-      # TODO(jhseu):  Revisit once we support model-parallelism.
-      value *= (1. / self._num_cores_per_host)
-    return tpu_ops.cross_replica_sum(value)
+      return output * (1. / len(value))
+    return output
 
   @property
   def num_towers(self):
diff --git a/tensorflow/contrib/distribute/python/values.py b/tensorflow/contrib/distribute/python/values.py
index b36ac563d2..f4da91a8ac 100644
--- a/tensorflow/contrib/distribute/python/values.py
+++ b/tensorflow/contrib/distribute/python/values.py
@@ -30,10 +30,10 @@ from tensorflow.contrib.distribute.python import prefetching_ops_v2
 from tensorflow.python.eager import context
 from tensorflow.python.framework import device as tf_device
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.training import device_util
 from tensorflow.python.training import distribute as distribute_lib
@@ -78,6 +78,13 @@ class DistributedValues(object):
   def devices(self):
     return list(self._index.keys())
 
+  @property
+  def is_tensor_like(self):
+    for v in self._index.values():
+      if not tensor_util.is_tensor(v):
+        return False
+    return True
+
   def __str__(self):
     return "%s:%s" % (self.__class__.__name__, self._index)
 
@@ -197,11 +204,54 @@ class DistributedVariable(DistributedDelegate):
     # to the container without introducing a reference cycle.
     for v in six.itervalues(index):
       v._distributed_container = weakref.ref(self)  # pylint: disable=protected-access
+    # tf.keras keeps track of variables initialized using this attribute. When
+    # tf.keras gets the default session, it initializes all uninitialized vars.
+    # We need to make _keras_initialized a member of DistributedVariable because
+    # without this it will use `__getattr__` which will delegate to a component
+    # variable.
+    self._keras_initialized = False
+    # Typically, a `DistributedVariable`'s initializer is composed of the
+    # initializers of the components variables. However, in some cases, such as
+    # when restoring from a checkpoint, we may set the _initializer_op
+    # property on the entire `DistributedVariable`.
+    self._initializer_op = None
     super(DistributedVariable, self).__init__(index)
 
+  def is_initialized(self, name=None):
+    """Identifies if all the component variables are initialized.
+
+    Args:
+      name: Name of the final `logical_and` op.
+
+    Returns:
+      The op that evaluates to True or False depending on if all the
+      component variables are initialized.
+    """
+    # We have to cast the self._index.values() to a `list` because when we
+    # use `model_to_estimator` to run tf.keras models, self._index.values() is
+    # of type `dict_values` and not `list`.
+    values_list = list(self._index.values())
+    result = values_list[0].is_initialized()
+    # We iterate through the list of values except the last one to allow us to
+    # name the final `logical_and` op the same name that is passed by the user
+    # to the `is_initialized` op. For distributed variables, the
+    # `is_initialized` op is a `logical_and` op.
+    for v in values_list[1:-1]:
+      result = math_ops.logical_and(result, v.is_initialized())
+    result = math_ops.logical_and(result, values_list[-1].is_initialized(),
+                                  name=name)
+    return result
+
   @property
   def initializer(self):
-    return control_flow_ops.group([v.initializer for v in self._index.values()])
+    if self._initializer_op:
+      init_op = self._initializer_op
+    else:
+      # return grouped ops of all the var initializations of component values of
+      # the mirrored variable
+      init_op = control_flow_ops.group(
+          [v.initializer for v in self._index.values()])
+    return init_op
 
   @property
   def graph(self):
@@ -244,6 +294,9 @@ class DistributedVariable(DistributedDelegate):
                               self._primary_var.op.type)
     return self.get().op
 
+  def read_value(self):
+    return distribute_lib.get_distribution_strategy().read_var(self)
+
   def _should_act_as_resource_variable(self):
     """Pass resource_variable_ops.is_resource_variable check."""
     pass
@@ -320,6 +373,7 @@ class MirroredVariable(DistributedVariable, Mirrored,
       return distribute_lib.get_distribution_strategy().update(
           self, f, *args, **kwargs)
     else:
+      _assert_tower_context()
       # We are calling an assign function on the mirrored variable in tower
       # context.
       # We reduce the value we want to assign/add/sub. More details about how we
@@ -330,23 +384,27 @@ class MirroredVariable(DistributedVariable, Mirrored,
         raise ValueError("You must specify an aggregation method to update a "
                          "MirroredVariable in Tower Context.")
 
-      def merge_fn(strategy, value):
+      def merge_fn(strategy, value, *other_args, **other_kwargs):
         return strategy.update(
             self, f,
             strategy.reduce(
-                aggregation=self._aggregation, value=value, destinations=self))
+                aggregation=self._aggregation, value=value, destinations=self),
+            *other_args, **other_kwargs)
 
       return distribute_lib.get_tower_context().merge_call(merge_fn, *args,
                                                            **kwargs)
 
   def assign_sub(self, *args, **kwargs):
-    return self._assign_func(f=state_ops.assign_sub, *args, **kwargs)
+    assign_sub_fn = lambda var, *a, **kw: var.assign_sub(*a, **kw)
+    return self._assign_func(f=assign_sub_fn, *args, **kwargs)
 
   def assign_add(self, *args, **kwargs):
-    return self._assign_func(f=state_ops.assign_add, *args, **kwargs)
+    assign_add_fn = lambda var, *a, **kw: var.assign_add(*a, **kw)
+    return self._assign_func(f=assign_add_fn, *args, **kwargs)
 
   def assign(self, *args, **kwargs):
-    return self._assign_func(f=state_ops.assign, *args, **kwargs)
+    assign_fn = lambda var, *a, **kw: var.assign(*a, **kw)
+    return self._assign_func(f=assign_fn, *args, **kwargs)
 
   @property
   def aggregation(self):
@@ -412,14 +470,7 @@ class _TowerLocalSaveable(saver.BaseSaverBuilder.SaveableObject):
   def restore(self, restored_tensors, restored_shapes):
     """Restore the same value into all variables."""
     tensor, = restored_tensors
-    # To preserve the sum across save and restore, we have to divide the
-    # total across all devices when restoring a variable that was summed
-    # when saving.
-    if self._tower_local_variable.aggregation == vs.VariableAggregation.SUM:
-      tensor *= 1. / len(self._tower_local_variable.devices)
-    return control_flow_ops.group([
-        _assign_on_device(d, v, tensor)
-        for d, v in six.iteritems(self._tower_local_variable._index)])  # pylint: disable=protected-access
+    return self._tower_local_variable.assign(tensor)
 
 
 def _assert_tower_context():
@@ -446,8 +497,19 @@ class TowerLocalVariable(DistributedVariable, PerDevice,
     return self.get().assign_add(*args, **kwargs)
 
   def assign(self, *args, **kwargs):
-    _assert_tower_context()
-    return self.get().assign(*args, **kwargs)
+    if distribute_lib.get_cross_tower_context():
+      # To preserve the sum across save and restore, we have to divide the
+      # total across all devices when restoring a variable that was summed
+      # when saving.
+      tensor = args[0]
+      if self._aggregation == vs.VariableAggregation.SUM:
+        tensor *= 1. / len(self.devices)
+      return control_flow_ops.group(
+          [_assign_on_device(d, v, tensor)
+           for d, v in six.iteritems(self._index)])
+    else:
+      _assert_tower_context()
+      return self.get().assign(*args, **kwargs)
 
   @property
   def aggregation(self):
diff --git a/tensorflow/contrib/distribute/python/values_test.py b/tensorflow/contrib/distribute/python/values_test.py
index 8e44f2fea1..91a43d4999 100644
--- a/tensorflow/contrib/distribute/python/values_test.py
+++ b/tensorflow/contrib/distribute/python/values_test.py
@@ -32,6 +32,7 @@ from tensorflow.python.estimator import model_fn as model_fn_lib
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import random_ops
@@ -79,6 +80,30 @@ class DistributedValuesTest(test.TestCase):
     with self.assertRaises(AssertionError):
       v = values.DistributedValues({"/device:cpu:0": 42})
 
+  def testIsTensorLike(self):
+    with context.graph_mode(), \
+         ops.Graph().as_default(), \
+         ops.device("/device:CPU:0"):
+      one = constant_op.constant(1)
+      two = constant_op.constant(2)
+      v = values.DistributedValues({"/device:CPU:0": one, "/device:GPU:0": two})
+      self.assertEqual(two, v.get("/device:GPU:0"))
+      self.assertEqual(one, v.get())
+      self.assertTrue(v.is_tensor_like)
+      self.assertTrue(tensor_util.is_tensor(v))
+
+  def testIsTensorLikeWithAConstant(self):
+    with context.graph_mode(), \
+         ops.Graph().as_default(), \
+         ops.device("/device:CPU:0"):
+      one = constant_op.constant(1)
+      two = 2.0
+      v = values.DistributedValues({"/device:CPU:0": one, "/device:GPU:0": two})
+      self.assertEqual(two, v.get("/device:GPU:0"))
+      self.assertEqual(one, v.get())
+      self.assertFalse(v.is_tensor_like)
+      self.assertFalse(tensor_util.is_tensor(v))
+
 
 class DistributedDelegateTest(test.TestCase):
 
diff --git a/tensorflow/contrib/distribute/python/warm_starting_util_test.py b/tensorflow/contrib/distribute/python/warm_starting_util_test.py
new file mode 100644
index 0000000000..d8bacdb338
--- /dev/null
+++ b/tensorflow/contrib/distribute/python/warm_starting_util_test.py
@@ -0,0 +1,97 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for warm_starting_util with Distribution Strategy.
+
+These tests are located here instead of as part of `WarmStartingUtilTest`
+because they need access to distribution strategies which are only present in
+contrib right now.
+TODO(priyag): Move the tests to core `WarmStartingUtilTest` when distribution
+strategy moves out of contrib.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+from absl.testing import parameterized
+
+from tensorflow.contrib.distribute.python import combinations
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import saver as saver_lib
+from tensorflow.python.training import warm_starting_util as ws_util
+
+
+class WarmStartingUtilWithDistributionStrategyTest(
+    test.TestCase, parameterized.TestCase):
+
+  @combinations.generate(combinations.combine(
+      distribution=[combinations.default_strategy,
+                    combinations.one_device_strategy,
+                    combinations.mirrored_strategy_with_gpu_and_cpu,
+                    combinations.mirrored_strategy_with_two_gpus],
+      save_with_distribution=[True, False],
+      restore_with_distribution=[True, False],
+      mode=["graph"]))
+  def testWarmStart(self, distribution, save_with_distribution,
+                    restore_with_distribution):
+
+    var_name = "v"
+    original_value = [[1., 2.], [3., 4.]]
+
+    # Create variable and save checkpoint from which to warm-start.
+    def create_var(g):
+      with self.test_session(graph=g) as sess:
+        var = variable_scope.get_variable(var_name, initializer=original_value)
+        sess.run(variables.global_variables_initializer())
+        saver = saver_lib.Saver()
+        ckpt_prefix = os.path.join(self.get_temp_dir(), "model")
+        saver.save(sess, ckpt_prefix, global_step=0)
+        return var, sess.run(var)
+
+    if save_with_distribution:
+      with ops.Graph().as_default() as g, distribution.scope():
+        _, prev_init_val = create_var(g)
+    else:
+      with ops.Graph().as_default() as g:
+        _, prev_init_val = create_var(g)
+
+    # Verify we initialized the values correctly.
+    self.assertAllEqual(original_value, prev_init_val)
+
+    def warm_start(g):
+      with self.test_session(graph=g) as sess:
+        # Initialize with zeros.
+        var = variable_scope.get_variable(
+            var_name, initializer=[[0., 0.], [0., 0.]])
+        ws_util.warm_start(self.get_temp_dir())
+        sess.run(variables.global_variables_initializer())
+        # Verify weights were correctly warm-started to previous values.
+        self.assertAllEqual(original_value, self.evaluate(var))
+
+    # Warm start in a new graph.
+    if restore_with_distribution:
+      with ops.Graph().as_default() as g, distribution.scope():
+        warm_start(g)
+    else:
+      with ops.Graph().as_default() as g:
+        warm_start(g)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
index b8f2a4b2c7..296e66f2b2 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/masked_autoregressive.py
@@ -514,9 +514,8 @@ def masked_autoregressive_default_template(
        Masked Autoencoder for Distribution Estimation. In _International
        Conference on Machine Learning_, 2015. https://arxiv.org/abs/1502.03509
   """
-
-  with ops.name_scope(name, "masked_autoregressive_default_template",
-                      values=[log_scale_min_clip, log_scale_max_clip]):
+  name = name or "masked_autoregressive_default_template"
+  with ops.name_scope(name, values=[log_scale_min_clip, log_scale_max_clip]):
     def _fn(x):
       """MADE parameterized via `masked_autoregressive_default_template`."""
       # TODO(b/67594795): Better support of dynamic shape.
@@ -552,8 +551,7 @@ def masked_autoregressive_default_template(
                     else _clip_by_value_preserve_grad)
       log_scale = which_clip(log_scale, log_scale_min_clip, log_scale_max_clip)
       return shift, log_scale
-    return template_ops.make_template(
-        "masked_autoregressive_default_template", _fn)
+    return template_ops.make_template(name, _fn)
 
 
 @deprecation.deprecated(
diff --git a/tensorflow/contrib/distributions/python/ops/quantized_distribution.py b/tensorflow/contrib/distributions/python/ops/quantized_distribution.py
index ef3bdfa75f..18a0f754e6 100644
--- a/tensorflow/contrib/distributions/python/ops/quantized_distribution.py
+++ b/tensorflow/contrib/distributions/python/ops/quantized_distribution.py
@@ -326,6 +326,21 @@ class QuantizedDistribution(distributions.Distribution):
         graph_parents=graph_parents,
         name=name)
 
+  @property
+  def distribution(self):
+    """Base distribution, p(x)."""
+    return self._dist
+
+  @property
+  def low(self):
+    """Lowest value that quantization returns."""
+    return self._low
+
+  @property
+  def high(self):
+    """Highest value that quantization returns."""
+    return self._high
+
   def _batch_shape_tensor(self):
     return self.distribution.batch_shape_tensor()
 
@@ -569,8 +584,3 @@ class QuantizedDistribution(distributions.Distribution):
       dependencies = [distribution_util.assert_integer_form(
           value, message="value has non-integer components.")]
       return control_flow_ops.with_dependencies(dependencies, value)
-
-  @property
-  def distribution(self):
-    """Base distribution, p(x)."""
-    return self._dist
diff --git a/tensorflow/contrib/eager/python/datasets.py b/tensorflow/contrib/eager/python/datasets.py
index 58c548d798..16844e0d68 100644
--- a/tensorflow/contrib/eager/python/datasets.py
+++ b/tensorflow/contrib/eager/python/datasets.py
@@ -18,35 +18,13 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import threading
-
 from tensorflow.contrib.data.python.ops import prefetching_ops
 from tensorflow.python.data.ops import iterator_ops
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
 from tensorflow.python.eager import context
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
-from tensorflow.python.ops import gen_dataset_ops
-from tensorflow.python.ops import resource_variable_ops
-from tensorflow.python.training.checkpointable import base as checkpointable
-from tensorflow.python.training.saver import BaseSaverBuilder
-
-_uid_counter = 0
-_uid_lock = threading.Lock()
-
-
-def _generate_shared_name(prefix):
-  with _uid_lock:
-    global _uid_counter
-    uid = _uid_counter
-    _uid_counter += 1
-  return "{}{}".format(prefix, uid)
 
 
-class Iterator(iterator_ops.EagerIterator, checkpointable.CheckpointableBase):
+class Iterator(iterator_ops.EagerIterator):
   """An iterator producing tf.Tensor objects from a tf.data.Dataset.
 
   NOTE: Unlike the iterator created by the
@@ -80,38 +58,18 @@ class Iterator(iterator_ops.EagerIterator, checkpointable.CheckpointableBase):
           "`tf.contrib.eager.Iterator`. Use `for ... in dataset:` to iterate "
           "over the dataset instead.")
 
-    super(Iterator, self).__init__(dataset)
     if not context.context().device_spec.device_type:
       is_remote_device = False
     else:
       is_remote_device = context.context().device_spec.device_type != "CPU"
-    self._buffer_resource_handle = None
     if is_remote_device:
-      with ops.device("/device:CPU:0"):
-        iter_string_handle = gen_dataset_ops.iterator_to_string_handle(
-            self._resource)
-
-        @function.Defun(dtypes.string)
-        def remote_fn(h):
-          remote_iterator = iterator_ops.Iterator.from_string_handle(
-              h, self.output_types, self.output_shapes, self.output_classes)
-          return remote_iterator.get_next()
-
-        remote_fn.add_to_graph(None)
-        target = constant_op.constant("/device:CPU:0")
-      with ops.device(self._device):
-        self._buffer_resource_handle = prefetching_ops.function_buffering_resource(  # pylint: disable=line-too-long
-            string_arg=iter_string_handle,
-            output_types=self._flat_output_types,
-            f=remote_fn,
-            target_device=target,
-            buffer_size=10,
-            container="",
-            shared_name=_generate_shared_name(
-                "contrib_eager_iterator_function_buffer_resource"))
-        self._buffer_resource_deleter = resource_variable_ops.EagerResourceDeleter(  # pylint: disable=line-too-long
-            handle=self._buffer_resource_handle,
-            handle_device=self._device)
+      with ops.device(None):
+        # Let the placer figure out where to place the various functions etc.
+        # created by the CopyToDeviceDataset.
+        dataset = dataset.apply(prefetching_ops.copy_to_device(
+            context.context().device_name))
+        dataset = dataset.prefetch(1)
+    super(Iterator, self).__init__(dataset)
 
   def _next_internal(self):
     """Returns a nested structure of `tf.Tensor`s containing the next element.
@@ -120,40 +78,4 @@ class Iterator(iterator_ops.EagerIterator, checkpointable.CheckpointableBase):
     # that there is no more data to iterate over.
     # TODO(b/77291417): Fix
     with context.execution_mode(context.SYNC):
-      if self._buffer_resource_handle is not None:
-        with ops.device(self._device):
-          ret = prefetching_ops.function_buffering_resource_get_next(
-              function_buffer_resource=self._buffer_resource_handle,
-              output_types=self._flat_output_types)
-        return sparse.deserialize_sparse_tensors(
-            nest.pack_sequence_as(self._output_types, ret), self._output_types,
-            self._output_shapes, self._output_classes)
-      else:
-        return super(Iterator, self)._next_internal()
-
-  # TODO(shivaniagrawal): Expose checkpointable stateful objects from dataset
-  # attributes(potential).
-
-  class _Saveable(BaseSaverBuilder.SaveableObject):
-    """SaveableObject for saving/restoring iterator state."""
-
-    def __init__(self, iterator_resource, name):
-      serialized_iterator = gen_dataset_ops.serialize_iterator(
-          iterator_resource)
-      specs = [
-          BaseSaverBuilder.SaveSpec(serialized_iterator, "", name + "_STATE")
-      ]
-      # pylint: disable=protected-access
-      super(Iterator._Saveable, self).__init__(iterator_resource, specs, name)
-
-    def restore(self, restored_tensors, restored_shapes):
-      with ops.colocate_with(self.op):
-        return gen_dataset_ops.deserialize_iterator(self.op,
-                                                    restored_tensors[0])
-
-  def _gather_saveables_for_checkpoint(self):
-
-    def _saveable_factory(name):
-      return self._Saveable(self._resource, name)
-
-    return {"ITERATOR": _saveable_factory}
+      return super(Iterator, self)._next_internal()
diff --git a/tensorflow/contrib/eager/python/datasets_test.py b/tensorflow/contrib/eager/python/datasets_test.py
index 68bec9aee8..a753d77580 100644
--- a/tensorflow/contrib/eager/python/datasets_test.py
+++ b/tensorflow/contrib/eager/python/datasets_test.py
@@ -37,6 +37,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import script_ops
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training.checkpointable import util as checkpointable_utils
 
 
@@ -193,6 +194,20 @@ class IteratorTest(test.TestCase):
       x = math_ops.add(x, x)
     self.assertAllEqual([0., 2.], x.numpy())
 
+  def testGpuTensor(self):
+    ds = Dataset.from_tensors([0., 1.])
+    with ops.device(test.gpu_device_name()):
+      for x in ds:
+        y = math_ops.add(x, x)
+    self.assertAllEqual([0., 2.], y.numpy())
+
+  def testGpuDefinedDataset(self):
+    with ops.device(test.gpu_device_name()):
+      ds = Dataset.from_tensors([0., 1.])
+      for x in ds:
+        y = math_ops.add(x, x)
+    self.assertAllEqual([0., 2.], y.numpy())
+
   def testTensorsExplicitPrefetchToDevice(self):
     ds = Dataset.from_tensor_slices([0., 1.])
     ds = ds.apply(prefetching_ops.prefetch_to_device(test.gpu_device_name()))
@@ -292,6 +307,19 @@ class IteratorTest(test.TestCase):
     checkpoint.restore(save_path)
     self.assertEqual(2, iterator.get_next().numpy())
 
+  def testRestoreInReconstructedIterator(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, 'ckpt')
+    dataset = Dataset.range(10)
+    for i in range(5):
+      iterator = datasets.Iterator(dataset)
+      checkpoint = checkpointable_utils.Checkpoint(iterator=iterator)
+      checkpoint.restore(checkpoint_management.latest_checkpoint(
+          checkpoint_directory))
+      for j in range(2):
+        self.assertEqual(i * 2 + j, iterator.get_next().numpy())
+      checkpoint.save(file_prefix=checkpoint_prefix)
+
 
 class DatasetConstructorBenchmark(test.Benchmark):
 
diff --git a/tensorflow/contrib/eager/python/examples/BUILD b/tensorflow/contrib/eager/python/examples/BUILD
index 12155a459c..6f02c90368 100644
--- a/tensorflow/contrib/eager/python/examples/BUILD
+++ b/tensorflow/contrib/eager/python/examples/BUILD
@@ -15,8 +15,6 @@ py_library(
         "//tensorflow/contrib/eager/python/examples/revnet:config",
         "//tensorflow/contrib/eager/python/examples/rnn_colorbot",
         "//tensorflow/contrib/eager/python/examples/rnn_ptb",
-        "//tensorflow/contrib/eager/python/examples/sagan",
-        "//tensorflow/contrib/eager/python/examples/sagan:config",
         "//tensorflow/contrib/eager/python/examples/spinn:data",
     ],
 )
diff --git a/tensorflow/contrib/eager/python/examples/densenet/BUILD b/tensorflow/contrib/eager/python/examples/densenet/BUILD
new file mode 100644
index 0000000000..2dc196f550
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/densenet/BUILD
@@ -0,0 +1,48 @@
+licenses(["notice"])  # Apache 2.0
+
+package(default_visibility = ["//tensorflow:internal"])
+
+load("//tensorflow:tensorflow.bzl", "cuda_py_test")
+
+py_binary(
+    name = "densenet",
+    srcs = ["densenet.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow:tensorflow_py",
+        "//tensorflow/contrib/eager/python:tfe",
+    ],
+)
+
+cuda_py_test(
+    name = "densenet_test",
+    size = "large",
+    srcs = ["densenet_test.py"],
+    additional_deps = [
+        ":densenet",
+        "//tensorflow/contrib/eager/python:tfe",
+        "//tensorflow:tensorflow_py",
+    ],
+    tags = [
+        "no_pip",
+        "optonly",
+    ],
+)
+
+cuda_py_test(
+    name = "densenet_graph_test",
+    size = "large",
+    srcs = ["densenet_graph_test.py"],
+    additional_deps = [
+        ":densenet",
+        "//third_party/py/numpy",
+        "//tensorflow:tensorflow_py",
+    ],
+    tags = [
+        "no_pip",
+        "noasan",
+        "nomsan",
+        "notsan",
+        "optonly",
+    ],
+)
diff --git a/tensorflow/contrib/eager/python/examples/densenet/densenet.py b/tensorflow/contrib/eager/python/examples/densenet/densenet.py
new file mode 100644
index 0000000000..6de4e69400
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/densenet/densenet.py
@@ -0,0 +1,296 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Densely Connected Convolutional Networks.
+
+Reference [
+Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)
+
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+l2 = tf.keras.regularizers.l2
+
+
+class ConvBlock(tf.keras.Model):
+  """Convolutional Block consisting of (batchnorm->relu->conv).
+
+  Arguments:
+    num_filters: number of filters passed to a convolutional layer.
+    data_format: "channels_first" or "channels_last"
+    bottleneck: if True, then a 1x1 Conv is performed followed by 3x3 Conv.
+    weight_decay: weight decay
+    dropout_rate: dropout rate.
+  """
+
+  def __init__(self, num_filters, data_format, bottleneck, weight_decay=1e-4,
+               dropout_rate=0):
+    super(ConvBlock, self).__init__()
+    self.bottleneck = bottleneck
+
+    axis = -1 if data_format == "channels_last" else 1
+    inter_filter = num_filters * 4
+    # don't forget to set use_bias=False when using batchnorm
+    self.conv2 = tf.keras.layers.Conv2D(num_filters,
+                                        (3, 3),
+                                        padding="same",
+                                        use_bias=False,
+                                        data_format=data_format,
+                                        kernel_initializer="he_normal",
+                                        kernel_regularizer=l2(weight_decay))
+    self.batchnorm1 = tf.keras.layers.BatchNormalization(axis=axis)
+    self.dropout = tf.keras.layers.Dropout(dropout_rate)
+
+    if self.bottleneck:
+      self.conv1 = tf.keras.layers.Conv2D(inter_filter,
+                                          (1, 1),
+                                          padding="same",
+                                          use_bias=False,
+                                          data_format=data_format,
+                                          kernel_initializer="he_normal",
+                                          kernel_regularizer=l2(weight_decay))
+      self.batchnorm2 = tf.keras.layers.BatchNormalization(axis=axis)
+
+  def call(self, x, training=True):
+    output = self.batchnorm1(x, training=training)
+
+    if self.bottleneck:
+      output = self.conv1(tf.nn.relu(output))
+      output = self.batchnorm2(output, training=training)
+
+    output = self.conv2(tf.nn.relu(output))
+    output = self.dropout(output, training=training)
+
+    return output
+
+
+class TransitionBlock(tf.keras.Model):
+  """Transition Block to reduce the number of features.
+
+  Arguments:
+    num_filters: number of filters passed to a convolutional layer.
+    data_format: "channels_first" or "channels_last"
+    weight_decay: weight decay
+    dropout_rate: dropout rate.
+  """
+
+  def __init__(self, num_filters, data_format,
+               weight_decay=1e-4, dropout_rate=0):
+    super(TransitionBlock, self).__init__()
+    axis = -1 if data_format == "channels_last" else 1
+
+    self.batchnorm = tf.keras.layers.BatchNormalization(axis=axis)
+    self.conv = tf.keras.layers.Conv2D(num_filters,
+                                       (1, 1),
+                                       padding="same",
+                                       use_bias=False,
+                                       data_format=data_format,
+                                       kernel_initializer="he_normal",
+                                       kernel_regularizer=l2(weight_decay))
+    self.avg_pool = tf.keras.layers.AveragePooling2D(data_format=data_format)
+
+  def call(self, x, training=True):
+    output = self.batchnorm(x, training=training)
+    output = self.conv(tf.nn.relu(output))
+    output = self.avg_pool(output)
+    return output
+
+
+class DenseBlock(tf.keras.Model):
+  """Dense Block consisting of ConvBlocks where each block's
+  output is concatenated with its input.
+
+  Arguments:
+    num_layers: Number of layers in each block.
+    growth_rate: number of filters to add per conv block.
+    data_format: "channels_first" or "channels_last"
+    bottleneck: boolean, that decides which part of ConvBlock to call.
+    weight_decay: weight decay
+    dropout_rate: dropout rate.
+  """
+
+  def __init__(self, num_layers, growth_rate, data_format, bottleneck,
+               weight_decay=1e-4, dropout_rate=0):
+    super(DenseBlock, self).__init__()
+    self.num_layers = num_layers
+    self.axis = -1 if data_format == "channels_last" else 1
+
+    self.blocks = []
+    for _ in range(int(self.num_layers)):
+      self.blocks.append(ConvBlock(growth_rate,
+                                   data_format,
+                                   bottleneck,
+                                   weight_decay,
+                                   dropout_rate))
+
+  def call(self, x, training=True):
+    for i in range(int(self.num_layers)):
+      output = self.blocks[i](x, training=training)
+      x = tf.concat([x, output], axis=self.axis)
+
+    return x
+
+
+class DenseNet(tf.keras.Model):
+  """Creating the Densenet Architecture.
+
+  Arguments:
+    depth_of_model: number of layers in the model.
+    growth_rate: number of filters to add per conv block.
+    num_of_blocks: number of dense blocks.
+    output_classes: number of output classes.
+    num_layers_in_each_block: number of layers in each block.
+                              If -1, then we calculate this by (depth-3)/4.
+                              If positive integer, then the it is used as the
+                                number of layers per block.
+                              If list or tuple, then this list is used directly.
+    data_format: "channels_first" or "channels_last"
+    bottleneck: boolean, to decide which part of conv block to call.
+    compression: reducing the number of inputs(filters) to the transition block.
+    weight_decay: weight decay
+    rate: dropout rate.
+    pool_initial: If True add a 7x7 conv with stride 2 followed by 3x3 maxpool
+                  else, do a 3x3 conv with stride 1.
+    include_top: If true, GlobalAveragePooling Layer and Dense layer are
+                 included.
+  """
+
+  def __init__(self, depth_of_model, growth_rate, num_of_blocks,
+               output_classes, num_layers_in_each_block, data_format,
+               bottleneck=True, compression=0.5, weight_decay=1e-4,
+               dropout_rate=0, pool_initial=False, include_top=True):
+    super(DenseNet, self).__init__()
+    self.depth_of_model = depth_of_model
+    self.growth_rate = growth_rate
+    self.num_of_blocks = num_of_blocks
+    self.output_classes = output_classes
+    self.num_layers_in_each_block = num_layers_in_each_block
+    self.data_format = data_format
+    self.bottleneck = bottleneck
+    self.compression = compression
+    self.weight_decay = weight_decay
+    self.dropout_rate = dropout_rate
+    self.pool_initial = pool_initial
+    self.include_top = include_top
+
+    # deciding on number of layers in each block
+    if isinstance(self.num_layers_in_each_block, list) or isinstance(
+        self.num_layers_in_each_block, tuple):
+      self.num_layers_in_each_block = list(self.num_layers_in_each_block)
+    else:
+      if self.num_layers_in_each_block == -1:
+        if self.num_of_blocks != 3:
+          raise ValueError(
+              "Number of blocks must be 3 if num_layers_in_each_block is -1")
+        if (self.depth_of_model - 4) % 3 == 0:
+          num_layers = (self.depth_of_model - 4) / 3
+          if self.bottleneck:
+            num_layers //= 2
+          self.num_layers_in_each_block = [num_layers] * self.num_of_blocks
+        else:
+          raise ValueError("Depth must be 3N+4 if num_layer_in_each_block=-1")
+      else:
+        self.num_layers_in_each_block = [
+            self.num_layers_in_each_block] * self.num_of_blocks
+
+    axis = -1 if self.data_format == "channels_last" else 1
+
+    # setting the filters and stride of the initial covn layer.
+    if self.pool_initial:
+      init_filters = (7, 7)
+      stride = (2, 2)
+    else:
+      init_filters = (3, 3)
+      stride = (1, 1)
+
+    self.num_filters = 2 * self.growth_rate
+
+    # first conv and pool layer
+    self.conv1 = tf.keras.layers.Conv2D(self.num_filters,
+                                        init_filters,
+                                        strides=stride,
+                                        padding="same",
+                                        use_bias=False,
+                                        data_format=self.data_format,
+                                        kernel_initializer="he_normal",
+                                        kernel_regularizer=l2(
+                                            self.weight_decay))
+    if self.pool_initial:
+      self.pool1 = tf.keras.layers.MaxPooling2D(pool_size=(3, 3),
+                                                strides=(2, 2),
+                                                padding="same",
+                                                data_format=self.data_format)
+      self.batchnorm1 = tf.keras.layers.BatchNormalization(axis=axis)
+
+    self.batchnorm2 = tf.keras.layers.BatchNormalization(axis=axis)
+
+    # last pooling and fc layer
+    if self.include_top:
+      self.last_pool = tf.keras.layers.GlobalAveragePooling2D(
+          data_format=self.data_format)
+      self.classifier = tf.keras.layers.Dense(self.output_classes)
+
+    # calculating the number of filters after each block
+    num_filters_after_each_block = [self.num_filters]
+    for i in range(1, self.num_of_blocks):
+      temp_num_filters = num_filters_after_each_block[i-1] + (
+          self.growth_rate * self.num_layers_in_each_block[i-1])
+      # using compression to reduce the number of inputs to the
+      # transition block
+      temp_num_filters = int(temp_num_filters * compression)
+      num_filters_after_each_block.append(temp_num_filters)
+
+    # dense block initialization
+    self.dense_blocks = []
+    self.transition_blocks = []
+    for i in range(self.num_of_blocks):
+      self.dense_blocks.append(DenseBlock(self.num_layers_in_each_block[i],
+                                          self.growth_rate,
+                                          self.data_format,
+                                          self.bottleneck,
+                                          self.weight_decay,
+                                          self.dropout_rate))
+      if i+1 < self.num_of_blocks:
+        self.transition_blocks.append(
+            TransitionBlock(num_filters_after_each_block[i+1],
+                            self.data_format,
+                            self.weight_decay,
+                            self.dropout_rate))
+
+  def call(self, x, training=True):
+    output = self.conv1(x)
+
+    if self.pool_initial:
+      output = self.batchnorm1(output, training=training)
+      output = tf.nn.relu(output)
+      output = self.pool1(output)
+
+    for i in range(self.num_of_blocks - 1):
+      output = self.dense_blocks[i](output, training=training)
+      output = self.transition_blocks[i](output, training=training)
+
+    output = self.dense_blocks[
+        self.num_of_blocks - 1](output, training=training)
+    output = self.batchnorm2(output, training=training)
+    output = tf.nn.relu(output)
+
+    if self.include_top:
+      output = self.last_pool(output)
+      output = self.classifier(output)
+
+    return output
diff --git a/tensorflow/contrib/eager/python/examples/densenet/densenet_graph_test.py b/tensorflow/contrib/eager/python/examples/densenet/densenet_graph_test.py
new file mode 100644
index 0000000000..4b3cb624bc
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/densenet/densenet_graph_test.py
@@ -0,0 +1,151 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests and Benchmarks for Densenet model under graph execution."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import time
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.contrib.eager.python.examples.densenet import densenet
+
+
+def data_format():
+  return 'channels_first' if tf.test.is_gpu_available() else 'channels_last'
+
+
+def image_shape(batch_size):
+  if data_format() == 'channels_first':
+    return [batch_size, 3, 224, 224]
+  return [batch_size, 224, 224, 3]
+
+
+def random_batch(batch_size):
+  images = np.random.rand(*image_shape(batch_size)).astype(np.float32)
+  num_classes = 1000
+  labels = np.random.randint(
+      low=0, high=num_classes, size=[batch_size]).astype(np.int32)
+  one_hot = np.zeros((batch_size, num_classes)).astype(np.float32)
+  one_hot[np.arange(batch_size), labels] = 1.
+  return images, one_hot
+
+
+class DensenetGraphTest(tf.test.TestCase):
+
+  def testApply(self):
+    depth = 7
+    growth_rate = 2
+    num_blocks = 3
+    output_classes = 10
+    num_layers_in_each_block = -1
+    batch_size = 1
+    with tf.Graph().as_default():
+      images = tf.placeholder(tf.float32, image_shape(None))
+      model = densenet.DenseNet(depth, growth_rate, num_blocks,
+                                output_classes, num_layers_in_each_block,
+                                data_format(), bottleneck=True, compression=0.5,
+                                weight_decay=1e-4, dropout_rate=0,
+                                pool_initial=False, include_top=True)
+      predictions = model(images, training=False)
+
+      init = tf.global_variables_initializer()
+
+      with tf.Session() as sess:
+        sess.run(init)
+        np_images, _ = random_batch(batch_size)
+        out = sess.run(predictions, feed_dict={images: np_images})
+        self.assertAllEqual([batch_size, output_classes], out.shape)
+
+
+class DensenetBenchmark(tf.test.Benchmark):
+
+  def __init__(self):
+    self.depth = 121
+    self.growth_rate = 32
+    self.num_blocks = 4
+    self.output_classes = 1000
+    self.num_layers_in_each_block = [6, 12, 24, 16]
+
+  def _report(self, label, start, num_iters, batch_size):
+    avg_time = (time.time() - start) / num_iters
+    dev = 'gpu' if tf.test.is_gpu_available() else 'cpu'
+    name = 'graph_%s_%s_batch_%d_%s' % (label, dev, batch_size, data_format())
+    extras = {'examples_per_sec': batch_size / avg_time}
+    self.report_benchmark(
+        iters=num_iters, wall_time=avg_time, name=name, extras=extras)
+
+  def benchmark_graph_apply(self):
+    with tf.Graph().as_default():
+      images = tf.placeholder(tf.float32, image_shape(None))
+      model = densenet.DenseNet(self.depth, self.growth_rate, self.num_blocks,
+                                self.output_classes,
+                                self.num_layers_in_each_block, data_format(),
+                                bottleneck=True, compression=0.5,
+                                weight_decay=1e-4, dropout_rate=0,
+                                pool_initial=True, include_top=True)
+      predictions = model(images, training=False)
+
+      init = tf.global_variables_initializer()
+
+      batch_size = 64
+      with tf.Session() as sess:
+        sess.run(init)
+        np_images, _ = random_batch(batch_size)
+        num_burn, num_iters = (3, 30)
+        for _ in range(num_burn):
+          sess.run(predictions, feed_dict={images: np_images})
+        start = time.time()
+        for _ in range(num_iters):
+          sess.run(predictions, feed_dict={images: np_images})
+        self._report('apply', start, num_iters, batch_size)
+
+  def benchmark_graph_train(self):
+    for batch_size in [16, 32, 64]:
+      with tf.Graph().as_default():
+        np_images, np_labels = random_batch(batch_size)
+        dataset = tf.data.Dataset.from_tensors((np_images, np_labels)).repeat()
+        (images, labels) = dataset.make_one_shot_iterator().get_next()
+
+        model = densenet.DenseNet(self.depth, self.growth_rate, self.num_blocks,
+                                  self.output_classes,
+                                  self.num_layers_in_each_block, data_format(),
+                                  bottleneck=True, compression=0.5,
+                                  weight_decay=1e-4, dropout_rate=0,
+                                  pool_initial=True, include_top=True)
+        logits = model(images, training=True)
+        cross_ent = tf.losses.softmax_cross_entropy(
+            logits=logits, onehot_labels=labels)
+        regularization = tf.add_n(model.losses)
+        loss = cross_ent + regularization
+        optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
+        train_op = optimizer.minimize(loss)
+
+        init = tf.global_variables_initializer()
+        with tf.Session() as sess:
+          sess.run(init)
+          (num_burn, num_iters) = (5, 10)
+          for _ in range(num_burn):
+            sess.run(train_op)
+          start = time.time()
+          for _ in range(num_iters):
+            sess.run(train_op)
+          self._report('train', start, num_iters, batch_size)
+
+
+if __name__ == '__main__':
+  tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/densenet/densenet_test.py b/tensorflow/contrib/eager/python/examples/densenet/densenet_test.py
new file mode 100644
index 0000000000..0736ed02b7
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/densenet/densenet_test.py
@@ -0,0 +1,350 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests and Benchmarks for Densenet model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gc
+import time
+import tensorflow as tf
+import tensorflow.contrib.eager as tfe
+
+from tensorflow.contrib.eager.python.examples.densenet import densenet
+from tensorflow.python.client import device_lib
+
+
+class DensenetTest(tf.test.TestCase):
+
+  def test_bottleneck_true(self):
+    depth = 7
+    growth_rate = 2
+    num_blocks = 3
+    output_classes = 10
+    num_layers_in_each_block = -1
+    batch_size = 1
+    data_format = ('channels_first') if tf.test.is_gpu_available() else (
+        'channels_last')
+
+    model = densenet.DenseNet(depth, growth_rate, num_blocks,
+                              output_classes, num_layers_in_each_block,
+                              data_format, bottleneck=True, compression=0.5,
+                              weight_decay=1e-4, dropout_rate=0,
+                              pool_initial=False, include_top=True)
+
+    if data_format == 'channels_last':
+      rand_input = tf.random_uniform((batch_size, 32, 32, 3))
+    else:
+      rand_input = tf.random_uniform((batch_size, 3, 32, 32))
+    output_shape = model(rand_input).shape
+    self.assertEqual(output_shape, (batch_size, output_classes))
+
+  def test_bottleneck_false(self):
+    depth = 7
+    growth_rate = 2
+    num_blocks = 3
+    output_classes = 10
+    num_layers_in_each_block = -1
+    batch_size = 1
+    data_format = ('channels_first') if tf.test.is_gpu_available() else (
+        'channels_last')
+
+    model = densenet.DenseNet(depth, growth_rate, num_blocks,
+                              output_classes, num_layers_in_each_block,
+                              data_format, bottleneck=False, compression=0.5,
+                              weight_decay=1e-4, dropout_rate=0,
+                              pool_initial=False, include_top=True)
+
+    if data_format == 'channels_last':
+      rand_input = tf.random_uniform((batch_size, 32, 32, 3))
+    else:
+      rand_input = tf.random_uniform((batch_size, 3, 32, 32))
+    output_shape = model(rand_input).shape
+    self.assertEqual(output_shape, (batch_size, output_classes))
+
+  def test_pool_initial_true(self):
+    depth = 7
+    growth_rate = 2
+    num_blocks = 4
+    output_classes = 10
+    num_layers_in_each_block = [1, 2, 2, 1]
+    batch_size = 1
+    data_format = ('channels_first') if tf.test.is_gpu_available() else (
+        'channels_last')
+
+    model = densenet.DenseNet(depth, growth_rate, num_blocks,
+                              output_classes, num_layers_in_each_block,
+                              data_format, bottleneck=True, compression=0.5,
+                              weight_decay=1e-4, dropout_rate=0,
+                              pool_initial=True, include_top=True)
+
+    if data_format == 'channels_last':
+      rand_input = tf.random_uniform((batch_size, 32, 32, 3))
+    else:
+      rand_input = tf.random_uniform((batch_size, 3, 32, 32))
+    output_shape = model(rand_input).shape
+    self.assertEqual(output_shape, (batch_size, output_classes))
+
+  def test_regularization(self):
+    if tf.test.is_gpu_available():
+      rand_input = tf.random_uniform((10, 3, 32, 32))
+      data_format = 'channels_first'
+    else:
+      rand_input = tf.random_uniform((10, 32, 32, 3))
+      data_format = 'channels_last'
+    weight_decay = 1e-4
+
+    conv = tf.keras.layers.Conv2D(
+        3, (3, 3),
+        padding='same',
+        use_bias=False,
+        data_format=data_format,
+        kernel_regularizer=tf.keras.regularizers.l2(weight_decay))
+    optimizer = tf.train.GradientDescentOptimizer(0.1)
+    conv(rand_input)  # Initialize the variables in the layer
+
+    def compute_true_l2(vs, wd):
+      return tf.reduce_sum(tf.square(vs)) * wd
+
+    true_l2 = compute_true_l2(conv.variables, weight_decay)
+    keras_l2 = tf.add_n(conv.losses)
+    self.assertAllClose(true_l2, keras_l2)
+
+    with tf.GradientTape() as tape_true, tf.GradientTape() as tape_keras:
+      loss = tf.reduce_sum(conv(rand_input))
+      loss_with_true_l2 = loss + compute_true_l2(conv.variables, weight_decay)
+      loss_with_keras_l2 = loss + tf.add_n(conv.losses)
+
+    true_grads = tape_true.gradient(loss_with_true_l2, conv.variables)
+    keras_grads = tape_keras.gradient(loss_with_keras_l2, conv.variables)
+    self.assertAllClose(true_grads, keras_grads)
+
+    optimizer.apply_gradients(zip(keras_grads, conv.variables))
+    keras_l2_after_update = tf.add_n(conv.losses)
+    self.assertNotAllClose(keras_l2, keras_l2_after_update)
+
+
+def compute_gradients(model, images, labels):
+  with tf.GradientTape() as tape:
+    logits = model(images, training=True)
+    cross_ent = tf.losses.softmax_cross_entropy(
+        logits=logits, onehot_labels=labels)
+    regularization = tf.add_n(model.losses)
+    loss = cross_ent + regularization
+    tf.contrib.summary.scalar(name='loss', tensor=loss)
+  return tape.gradient(loss, model.variables)
+
+
+def apply_gradients(model, optimizer, gradients):
+  optimizer.apply_gradients(zip(gradients, model.variables))
+
+
+def device_and_data_format():
+  return ('/gpu:0',
+          'channels_first') if tf.test.is_gpu_available() else ('/cpu:0',
+                                                                'channels_last')
+
+
+def random_batch(batch_size, data_format):
+  shape = (3, 224, 224) if data_format == 'channels_first' else (224, 224, 3)
+  shape = (batch_size,) + shape
+
+  num_classes = 1000
+  images = tf.random_uniform(shape)
+  labels = tf.random_uniform(
+      [batch_size], minval=0, maxval=num_classes, dtype=tf.int32)
+  one_hot = tf.one_hot(labels, num_classes)
+
+  return images, one_hot
+
+
+class MockIterator(object):
+
+  def __init__(self, tensors):
+    self._tensors = [tf.identity(x) for x in tensors]
+
+  def next(self):
+    return self._tensors
+
+
+class DensenetBenchmark(tf.test.Benchmark):
+
+  def __init__(self):
+    self.depth = 121
+    self.growth_rate = 32
+    self.num_blocks = 4
+    self.output_classes = 1000
+    self.num_layers_in_each_block = [6, 12, 24, 16]
+
+  def _train_batch_sizes(self):
+    """Choose batch sizes based on GPU capability."""
+    for device in device_lib.list_local_devices():
+      if tf.DeviceSpec.from_string(device.name).device_type == 'GPU':
+        if 'K20' in device.physical_device_desc:
+          return (16,)
+        if 'P100' in device.physical_device_desc:
+          return (16, 32, 64)
+
+      if tf.DeviceSpec.from_string(device.name).device_type == 'TPU':
+        return (32,)
+    return (16, 32)
+
+  def _report(self, label, start, num_iters, device, batch_size, data_format):
+    avg_time = (time.time() - start) / num_iters
+    dev = tf.DeviceSpec.from_string(device).device_type.lower()
+    name = '%s_%s_batch_%d_%s' % (label, dev, batch_size, data_format)
+    extras = {'examples_per_sec': batch_size / avg_time}
+    self.report_benchmark(
+        iters=num_iters, wall_time=avg_time, name=name, extras=extras)
+
+  def _force_device_sync(self):
+    # If this function is called in the context of a non-CPU device
+    # (e.g., inside a 'with tf.device("/gpu:0")' block)
+    # then this will force a copy from CPU->NON_CPU_DEVICE->CPU,
+    # which forces a sync. This is a roundabout way, yes.
+    tf.constant(1.).cpu()
+
+  def _benchmark_eager_apply(self, label, device_and_format, defun=False,
+                             execution_mode=None, compiled=False):
+    with tfe.execution_mode(execution_mode):
+      device, data_format = device_and_format
+      model = densenet.DenseNet(self.depth, self.growth_rate, self.num_blocks,
+                                self.output_classes,
+                                self.num_layers_in_each_block, data_format,
+                                bottleneck=True, compression=0.5,
+                                weight_decay=1e-4, dropout_rate=0,
+                                pool_initial=True, include_top=True)
+      if defun:
+        model.call = tfe.defun(model.call, compiled=compiled)
+      batch_size = 64
+      num_burn = 5
+      num_iters = 30
+      with tf.device(device):
+        images, _ = random_batch(batch_size, data_format)
+        for _ in xrange(num_burn):
+          model(images, training=False).cpu()
+        if execution_mode:
+          tfe.async_wait()
+        gc.collect()
+        start = time.time()
+        for _ in xrange(num_iters):
+          model(images, training=False).cpu()
+        if execution_mode:
+          tfe.async_wait()
+        self._report(label, start, num_iters, device, batch_size, data_format)
+
+  def benchmark_eager_apply_sync(self):
+    self._benchmark_eager_apply('eager_apply', device_and_data_format(),
+                                defun=False)
+
+  def benchmark_eager_apply_async(self):
+    self._benchmark_eager_apply(
+        'eager_apply_async', device_and_data_format(), defun=False,
+        execution_mode=tfe.ASYNC)
+
+  def benchmark_eager_apply_with_defun(self):
+    self._benchmark_eager_apply('eager_apply_with_defun',
+                                device_and_data_format(), defun=True)
+
+  def _benchmark_eager_train(self,
+                             label,
+                             make_iterator,
+                             device_and_format,
+                             defun=False,
+                             execution_mode=None,
+                             compiled=False):
+    with tfe.execution_mode(execution_mode):
+      device, data_format = device_and_format
+      for batch_size in self._train_batch_sizes():
+        (images, labels) = random_batch(batch_size, data_format)
+        model = densenet.DenseNet(self.depth, self.growth_rate, self.num_blocks,
+                                  self.output_classes,
+                                  self.num_layers_in_each_block, data_format,
+                                  bottleneck=True, compression=0.5,
+                                  weight_decay=1e-4, dropout_rate=0,
+                                  pool_initial=True, include_top=True)
+        optimizer = tf.train.GradientDescentOptimizer(0.1)
+        apply_grads = apply_gradients
+        if defun:
+          model.call = tfe.defun(model.call, compiled=compiled)
+          apply_grads = tfe.defun(apply_gradients, compiled=compiled)
+
+        num_burn = 3
+        num_iters = 10
+        with tf.device(device):
+          iterator = make_iterator((images, labels))
+          for _ in xrange(num_burn):
+            (images, labels) = iterator.next()
+            apply_grads(model, optimizer,
+                        compute_gradients(model, images, labels))
+          if execution_mode:
+            tfe.async_wait()
+          self._force_device_sync()
+          gc.collect()
+
+          start = time.time()
+          for _ in xrange(num_iters):
+            (images, labels) = iterator.next()
+            apply_grads(model, optimizer,
+                        compute_gradients(model, images, labels))
+          if execution_mode:
+            tfe.async_wait()
+          self._force_device_sync()
+          self._report(label, start, num_iters, device, batch_size, data_format)
+
+  def benchmark_eager_train_sync(self):
+    self._benchmark_eager_train('eager_train', MockIterator,
+                                device_and_data_format(), defun=False)
+
+  def benchmark_eager_train_async(self):
+    self._benchmark_eager_train(
+        'eager_train_async',
+        MockIterator,
+        device_and_data_format(),
+        defun=False,
+        execution_mode=tfe.ASYNC)
+
+  def benchmark_eager_train_with_defun(self):
+    self._benchmark_eager_train(
+        'eager_train_with_defun', MockIterator,
+        device_and_data_format(), defun=True)
+
+  def benchmark_eager_train_datasets(self):
+
+    def make_iterator(tensors):
+      with tf.device('/device:CPU:0'):
+        ds = tf.data.Dataset.from_tensors(tensors).repeat()
+      return tfe.Iterator(ds)
+
+    self._benchmark_eager_train(
+        'eager_train_dataset', make_iterator,
+        device_and_data_format(), defun=False)
+
+  def benchmark_eager_train_datasets_with_defun(self):
+
+    def make_iterator(tensors):
+      with tf.device('/device:CPU:0'):
+        ds = tf.data.Dataset.from_tensors(tensors).repeat()
+      return tfe.Iterator(ds)
+
+    self._benchmark_eager_train(
+        'eager_train_dataset_with_defun', make_iterator,
+        device_and_data_format(), defun=True)
+
+
+if __name__ == '__main__':
+  tf.enable_eager_execution()
+  tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/gan/mnist.py b/tensorflow/contrib/eager/python/examples/gan/mnist.py
index cc9cf53410..9a42179299 100644
--- a/tensorflow/contrib/eager/python/examples/gan/mnist.py
+++ b/tensorflow/contrib/eager/python/examples/gan/mnist.py
@@ -29,7 +29,6 @@ import time
 
 import tensorflow as tf
 
-import tensorflow.contrib.eager as tfe
 from tensorflow.examples.tutorials.mnist import input_data
 
 layers = tf.keras.layers
@@ -214,7 +213,7 @@ def train_one_epoch(generator, discriminator, generator_optimizer,
 
   total_generator_loss = 0.0
   total_discriminator_loss = 0.0
-  for (batch_index, images) in enumerate(tfe.Iterator(dataset)):
+  for (batch_index, images) in enumerate(dataset):
     with tf.device('/cpu:0'):
       tf.assign_add(step_counter, 1)
 
@@ -227,7 +226,10 @@ def train_one_epoch(generator, discriminator, generator_optimizer,
           maxval=1.,
           seed=batch_index)
 
-      with tf.GradientTape(persistent=True) as g:
+      # we can use 2 tapes or a single persistent tape.
+      # Using two tapes is memory efficient since intermediate tensors can be
+      # released between the two .gradient() calls below
+      with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
         generated_images = generator(noise)
         tf.contrib.summary.image(
             'generated_images',
@@ -243,9 +245,10 @@ def train_one_epoch(generator, discriminator, generator_optimizer,
         generator_loss_val = generator_loss(discriminator_gen_outputs)
         total_generator_loss += generator_loss_val
 
-      generator_grad = g.gradient(generator_loss_val, generator.variables)
-      discriminator_grad = g.gradient(discriminator_loss_val,
-                                      discriminator.variables)
+      generator_grad = gen_tape.gradient(generator_loss_val,
+                                         generator.variables)
+      discriminator_grad = disc_tape.gradient(discriminator_loss_val,
+                                              discriminator.variables)
 
       generator_optimizer.apply_gradients(
           zip(generator_grad, generator.variables))
@@ -261,7 +264,7 @@ def train_one_epoch(generator, discriminator, generator_optimizer,
 
 def main(_):
   (device, data_format) = ('/gpu:0', 'channels_first')
-  if FLAGS.no_gpu or tfe.num_gpus() <= 0:
+  if FLAGS.no_gpu or tf.contrib.eager.num_gpus() <= 0:
     (device, data_format) = ('/cpu:0', 'channels_last')
   print('Using device %s, and data format %s.' % (device, data_format))
 
@@ -287,7 +290,7 @@ def main(_):
   latest_cpkt = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
   if latest_cpkt:
     print('Using latest checkpoint at ' + latest_cpkt)
-  checkpoint = tfe.Checkpoint(**model_objects)
+  checkpoint = tf.train.Checkpoint(**model_objects)
   # Restore variables on creation if a checkpoint exists.
   checkpoint.restore(latest_cpkt)
 
diff --git a/tensorflow/contrib/eager/python/examples/generative_examples/dcgan.ipynb b/tensorflow/contrib/eager/python/examples/generative_examples/dcgan.ipynb
new file mode 100644
index 0000000000..44ff43a111
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/generative_examples/dcgan.ipynb
@@ -0,0 +1,733 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "0TD5ZrvEMbhZ"
+      },
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors.\n",
+        "\n",
+        "Licensed under the Apache License, Version 2.0 (the \"License\").\n",
+        "\n",
+        "# DCGAN: An example with tf.keras and eager\n",
+        "\n",
+        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\u003ctd\u003e\n",
+        "\u003ca target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/generative_examples/dcgan.ipynb\"\u003e\n",
+        "    \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e  \n",
+        "\u003c/td\u003e\u003ctd\u003e\n",
+        "\u003ca target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/generative_examples/dcgan.ipynb\"\u003e\u003cimg width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\u003c/td\u003e\u003c/table\u003e"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "ITZuApL56Mny"
+      },
+      "source": [
+        "This notebook demonstrates how to generate images of handwritten digits using [tf.keras](https://www.tensorflow.org/programmers_guide/keras) and [eager execution](https://www.tensorflow.org/programmers_guide/eager). To do so, we use Deep Convolutional Generative Adverserial Networks ([DCGAN](https://arxiv.org/pdf/1511.06434.pdf)).\n",
+        "\n",
+        "This model takes about ~30 seconds per epoch (using tf.contrib.eager.defun to create graph functions) to train on a single Tesla K80 on Colab, as of July 2018.\n",
+        "\n",
+        "Below is the output generated after training the generator and discriminator models for 150 epochs.\n",
+        "\n",
+        "![sample output](https://tensorflow.org/images/gan/dcgan.gif)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "u_2z-B3piVsw"
+      },
+      "outputs": [],
+      "source": [
+        "# to generate gifs\n",
+        "!pip install imageio"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "e1_Y75QXJS6h"
+      },
+      "source": [
+        "## Import TensorFlow and enable eager execution"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "YfIk2es3hJEd"
+      },
+      "outputs": [],
+      "source": [
+        "from __future__ import absolute_import, division, print_function\n",
+        "\n",
+        "# Import TensorFlow \u003e= 1.9 and enable eager execution\n",
+        "import tensorflow as tf\n",
+        "tf.enable_eager_execution()\n",
+        "\n",
+        "import os\n",
+        "import time\n",
+        "import numpy as np\n",
+        "import glob\n",
+        "import matplotlib.pyplot as plt\n",
+        "import PIL\n",
+        "import imageio\n",
+        "from IPython import display"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "iYn4MdZnKCey"
+      },
+      "source": [
+        "## Load the dataset\n",
+        "\n",
+        "We are going to use the MNIST dataset to train the generator and the discriminator. The generator will then generate handwritten digits."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "a4fYMGxGhrna"
+      },
+      "outputs": [],
+      "source": [
+        "(train_images, train_labels), (_, _) = tf.keras.datasets.mnist.load_data()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "NFC2ghIdiZYE"
+      },
+      "outputs": [],
+      "source": [
+        "train_images = train_images.reshape(train_images.shape[0], 28, 28, 1).astype('float32')\n",
+        "# We are normalizing the images to the range of [-1, 1]\n",
+        "train_images = (train_images - 127.5) / 127.5"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "S4PIDhoDLbsZ"
+      },
+      "outputs": [],
+      "source": [
+        "BUFFER_SIZE = 60000\n",
+        "BATCH_SIZE = 256"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "PIGN6ouoQxt3"
+      },
+      "source": [
+        "## Use tf.data to create batches and shuffle the dataset"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "-yKCCQOoJ7cn"
+      },
+      "outputs": [],
+      "source": [
+        "train_dataset = tf.data.Dataset.from_tensor_slices(train_images).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "THY-sZMiQ4UV"
+      },
+      "source": [
+        "## Write the generator and discriminator models\n",
+        "\n",
+        "* **Generator** \n",
+        "  * It is responsible for **creating convincing images that are good enough to fool the discriminator**.\n",
+        "  * It consists of Conv2DTranspose (Upsampling) layers. We start with a fully connected layer and upsample the image 2 times so as to reach the desired image size (mnist image size) which is (28, 28, 1). \n",
+        "  * We use **leaky relu** activation except for the **last layer** which uses **tanh** activation.\n",
+        "  \n",
+        "* **Discriminator**\n",
+        "  * **The discriminator is responsible for classifying the fake images from the real images.**\n",
+        "  * In other words, the discriminator is given generated images (from the generator) and the real MNIST images. The job of the discriminator is to classify these images into fake (generated) and real (MNIST images).\n",
+        "  * **Basically the generator should be good enough to fool the discriminator that the generated images are real**."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "VGLbvBEmjK0a"
+      },
+      "outputs": [],
+      "source": [
+        "class Generator(tf.keras.Model):\n",
+        "  def __init__(self):\n",
+        "    super(Generator, self).__init__()\n",
+        "    self.fc1 = tf.keras.layers.Dense(7*7*64, use_bias=False)\n",
+        "    self.batchnorm1 = tf.keras.layers.BatchNormalization()\n",
+        "    \n",
+        "    self.conv1 = tf.keras.layers.Conv2DTranspose(64, (5, 5), strides=(1, 1), padding='same', use_bias=False)\n",
+        "    self.batchnorm2 = tf.keras.layers.BatchNormalization()\n",
+        "    \n",
+        "    self.conv2 = tf.keras.layers.Conv2DTranspose(32, (5, 5), strides=(2, 2), padding='same', use_bias=False)\n",
+        "    self.batchnorm3 = tf.keras.layers.BatchNormalization()\n",
+        "    \n",
+        "    self.conv3 = tf.keras.layers.Conv2DTranspose(1, (5, 5), strides=(2, 2), padding='same', use_bias=False)\n",
+        "\n",
+        "  def call(self, x, training=True):\n",
+        "    x = self.fc1(x)\n",
+        "    x = self.batchnorm1(x, training=training)\n",
+        "    x = tf.nn.relu(x)\n",
+        "\n",
+        "    x = tf.reshape(x, shape=(-1, 7, 7, 64))\n",
+        "\n",
+        "    x = self.conv1(x)\n",
+        "    x = self.batchnorm2(x, training=training)\n",
+        "    x = tf.nn.relu(x)\n",
+        "\n",
+        "    x = self.conv2(x)\n",
+        "    x = self.batchnorm3(x, training=training)\n",
+        "    x = tf.nn.relu(x)\n",
+        "\n",
+        "    x = tf.nn.tanh(self.conv3(x))  \n",
+        "    return x"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "bkOfJxk5j5Hi"
+      },
+      "outputs": [],
+      "source": [
+        "class Discriminator(tf.keras.Model):\n",
+        "  def __init__(self):\n",
+        "    super(Discriminator, self).__init__()\n",
+        "    self.conv1 = tf.keras.layers.Conv2D(64, (5, 5), strides=(2, 2), padding='same')\n",
+        "    self.conv2 = tf.keras.layers.Conv2D(128, (5, 5), strides=(2, 2), padding='same')\n",
+        "    self.dropout = tf.keras.layers.Dropout(0.3)\n",
+        "    self.flatten = tf.keras.layers.Flatten()\n",
+        "    self.fc1 = tf.keras.layers.Dense(1)\n",
+        "\n",
+        "  def call(self, x, training=True):\n",
+        "    x = tf.nn.leaky_relu(self.conv1(x))\n",
+        "    x = self.dropout(x, training=training)\n",
+        "    x = tf.nn.leaky_relu(self.conv2(x))\n",
+        "    x = self.dropout(x, training=training)\n",
+        "    x = self.flatten(x)\n",
+        "    x = self.fc1(x)\n",
+        "    return x"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "gDkA05NE6QMs"
+      },
+      "outputs": [],
+      "source": [
+        "generator = Generator()\n",
+        "discriminator = Discriminator()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "k1HpMSLImuRi"
+      },
+      "outputs": [],
+      "source": [
+        "# Defun gives 10 secs/epoch performance boost\n",
+        "generator.call = tf.contrib.eager.defun(generator.call)\n",
+        "discriminator.call = tf.contrib.eager.defun(discriminator.call)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "0FMYgY_mPfTi"
+      },
+      "source": [
+        "## Define the loss functions and the optimizer\n",
+        "\n",
+        "* **Discriminator loss**\n",
+        "  * The discriminator loss function takes 2 inputs; **real images, generated images**\n",
+        "  * real_loss is a sigmoid cross entropy loss of the **real images** and an **array of ones (since these are the real images)**\n",
+        "  * generated_loss is a sigmoid cross entropy loss of the **generated images** and an **array of zeros (since these are the fake images)**\n",
+        "  * Then the total_loss is the sum of real_loss and the generated_loss\n",
+        "  \n",
+        "* **Generator loss**\n",
+        "  * It is a sigmoid cross entropy loss of the generated images and an **array of ones**\n",
+        "  \n",
+        "\n",
+        "* The discriminator and the generator optimizers are different since we will train them separately."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "wkMNfBWlT-PV"
+      },
+      "outputs": [],
+      "source": [
+        "def discriminator_loss(real_output, generated_output):\n",
+        "    # [1,1,...,1] with real output since it is true and we want\n",
+        "    # our generated examples to look like it\n",
+        "    real_loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=tf.ones_like(real_output), logits=real_output)\n",
+        "\n",
+        "    # [0,0,...,0] with generated images since they are fake\n",
+        "    generated_loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=tf.zeros_like(generated_output), logits=generated_output)\n",
+        "\n",
+        "    total_loss = real_loss + generated_loss\n",
+        "\n",
+        "    return total_loss"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "90BIcCKcDMxz"
+      },
+      "outputs": [],
+      "source": [
+        "def generator_loss(generated_output):\n",
+        "    return tf.losses.sigmoid_cross_entropy(tf.ones_like(generated_output), generated_output)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "iWCn_PVdEJZ7"
+      },
+      "outputs": [],
+      "source": [
+        "discriminator_optimizer = tf.train.AdamOptimizer(1e-4)\n",
+        "generator_optimizer = tf.train.AdamOptimizer(1e-4)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Rw1fkAczTQYh"
+      },
+      "source": [
+        "## Training\n",
+        "\n",
+        "* We start by iterating over the dataset\n",
+        "* The generator is given **noise as an input** which when passed through the generator model will output a image looking like a handwritten digit\n",
+        "* The discriminator is given the **real MNIST images as well as the generated images (from the generator)**.\n",
+        "* Next, we calculate the generator and the discriminator loss.\n",
+        "* Then, we calculate the gradients of loss with respect to both the generator and the discriminator variables (inputs) and apply those to the optimizer.\n",
+        "\n",
+        "## Generate Images\n",
+        "\n",
+        "* After training, its time to generate some images!\n",
+        "* We start by creating noise array as an input to the generator\n",
+        "* The generator will then convert the noise into handwritten images.\n",
+        "* Last step is to plot the predictions and **voila!**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "NS2GWywBbAWo"
+      },
+      "outputs": [],
+      "source": [
+        "EPOCHS = 150\n",
+        "noise_dim = 100\n",
+        "num_examples_to_generate = 100\n",
+        "\n",
+        "# keeping the random vector constant for generation (prediction) so\n",
+        "# it will be easier to see the improvement of the gan.\n",
+        "random_vector_for_generation = tf.random_normal([num_examples_to_generate,\n",
+        "                                                 noise_dim])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "RmdVsmvhPxyy"
+      },
+      "outputs": [],
+      "source": [
+        "def generate_and_save_images(model, epoch, test_input):\n",
+        "  # make sure the training parameter is set to False because we\n",
+        "  # don't want to train the batchnorm layer when doing inference.\n",
+        "  predictions = model(test_input, training=False)\n",
+        "\n",
+        "  fig = plt.figure(figsize=(10,10))\n",
+        "  \n",
+        "  for i in range(predictions.shape[0]):\n",
+        "      plt.subplot(10, 10, i+1)\n",
+        "      plt.imshow(predictions[i, :, :, 0] * 127.5 + 127.5, cmap='gray')\n",
+        "      plt.axis('off')\n",
+        "        \n",
+        "  # tight_layout minimizes the overlap between 2 sub-plots\n",
+        "  plt.tight_layout()\n",
+        "  plt.savefig('image_at_epoch_{:04d}.png'.format(epoch))\n",
+        "  plt.show()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "2M7LmLtGEMQJ"
+      },
+      "outputs": [],
+      "source": [
+        "def train(dataset, epochs, noise_dim):  \n",
+        "  for epoch in range(epochs):\n",
+        "    start = time.time()\n",
+        "    \n",
+        "    for images in dataset:\n",
+        "      # generating noise from a uniform distribution\n",
+        "      noise = tf.random_normal([BATCH_SIZE, noise_dim])\n",
+        "      \n",
+        "      with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:\n",
+        "        generated_images = generator(noise, training=True)\n",
+        "      \n",
+        "        real_output = discriminator(images, training=True)\n",
+        "        generated_output = discriminator(generated_images, training=True)\n",
+        "        \n",
+        "        gen_loss = generator_loss(generated_output)\n",
+        "        disc_loss = discriminator_loss(real_output, generated_output)\n",
+        "        \n",
+        "      gradients_of_generator = gen_tape.gradient(gen_loss, generator.variables)\n",
+        "      gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.variables)\n",
+        "      \n",
+        "      generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.variables))\n",
+        "      discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.variables))\n",
+        "\n",
+        "      \n",
+        "    if epoch % 10 == 0:\n",
+        "      display.clear_output(wait=True)\n",
+        "      generate_and_save_images(generator,\n",
+        "                               epoch + 1,\n",
+        "                               random_vector_for_generation)\n",
+        "\n",
+        "    print ('Time taken for epoch {} is {} sec'.format(epoch + 1,\n",
+        "                                                      time.time()-start))\n",
+        "  # generating after the final epoch\n",
+        "  generate_and_save_images(generator,\n",
+        "                           epochs,\n",
+        "                           random_vector_for_generation)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "Ly3UN0SLLY2l"
+      },
+      "outputs": [],
+      "source": [
+        "train(train_dataset, EPOCHS, noise_dim)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "P4M_vIbUi7c0"
+      },
+      "source": [
+        "# Display an image using the epoch number"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "WfO5wCdclHGL"
+      },
+      "outputs": [],
+      "source": [
+        "def display_image(epoch_no):\n",
+        "  plt.figure(figsize=(15,15))\n",
+        "  plt.imshow(np.array(PIL.Image.open('image_at_epoch_{:04d}.png'.format(epoch_no))))\n",
+        "  plt.axis('off')"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "5x3q9_Oe5q0A"
+      },
+      "outputs": [],
+      "source": [
+        "display_image(EPOCHS)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "NywiH3nL8guF"
+      },
+      "source": [
+        "## Generate a GIF of all the saved images."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "xmO0Dmu2WICn"
+      },
+      "source": [
+        "\u003c!-- TODO(markdaoust): Remove the hack when Ipython version is updated --\u003e\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "IGKQgENQ8lEI"
+      },
+      "outputs": [],
+      "source": [
+        "with imageio.get_writer('dcgan.gif', mode='I') as writer:\n",
+        "  filenames = glob.glob('image*.png')\n",
+        "  filenames = sorted(filenames)\n",
+        "  for filename in filenames:\n",
+        "    image = imageio.imread(filename)\n",
+        "    writer.append_data(image)\n",
+        "  # this is a hack to display the gif inside the notebook\n",
+        "  os.system('mv dcgan.gif dcgan.gif.png')"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "uV0yiKpzNP1b"
+      },
+      "outputs": [],
+      "source": [
+        "display.Image(filename=\"dcgan.gif.png\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "4UJjSnIMOzOJ"
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "dcgan.ipynb",
+      "private_outputs": true,
+      "provenance": [
+        {
+          "file_id": "1eb0NOTQapkYs3X0v-zL1x5_LFKgDISnp",
+          "timestamp": 1527173385672
+        }
+      ],
+      "toc_visible": true,
+      "version": "0.3.2",
+      "views": {}
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/eager/python/examples/generative_examples/image_captioning_with_attention.ipynb b/tensorflow/contrib/eager/python/examples/generative_examples/image_captioning_with_attention.ipynb
new file mode 100644
index 0000000000..1a5a186e7a
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/generative_examples/image_captioning_with_attention.ipynb
@@ -0,0 +1,1184 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "image_captioning_with_attention.ipynb",
+      "version": "0.3.2",
+      "views": {},
+      "default_view": {},
+      "provenance": [
+        {
+          "file_id": "1HI8OK2sMjcx9CTWVn0122QAHOuXaOaMg",
+          "timestamp": 1530222436922
+        }
+      ],
+      "private_outputs": true,
+      "collapsed_sections": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "metadata": {
+        "id": "K2s1A9eLRPEj",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors.\n",
+        "\n",
+        "Licensed under the Apache License, Version 2.0 (the \"License\").\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Cffg2i257iMS",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Image Captioning with Attention\n",
+        "\n",
+        "<table class=\"tfo-notebook-buttons\" align=\"left\"><td>\n",
+        "<a target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/generative_examples/image_captioning_with_attention.ipynb\">\n",
+        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>  \n",
+        "</td><td>\n",
+        "<a target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/generative_examples/image_captioning_with_attention.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a></td></table>"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "QASbY_HGo4Lq",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Image captioning is the task of generating a caption for an image. Given an image like this:\n",
+        "\n",
+        "![Man Surfing](https://tensorflow.org/images/surf.jpg) \n",
+        "\n",
+        "[Image Source](https://commons.wikimedia.org/wiki/Surfing#/media/File:Surfing_in_Hawaii.jpg), License: Public Domain\n",
+        "\n",
+        "Our goal is generate a caption, such as \"a surfer riding on a wave\". Here, we'll use an attention based model. This enables us to see which parts of the image the model focuses on as it generates a caption.\n",
+        "\n",
+        "![Prediction](https://tensorflow.org/images/imcap_prediction.png)\n",
+        "\n",
+        "This model architecture below is similar to [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). \n",
+        "\n",
+        "The code uses [tf.keras](https://www.tensorflow.org/programmers_guide/keras) and [eager execution](https://www.tensorflow.org/programmers_guide/eager), which you can learn more about in the linked guides.\n",
+        "\n",
+        "This notebook is an end-to-end example. If you run it, it will download the  [MS-COCO](http://cocodataset.org/#home) dataset, preprocess and cache a subset of the images using Inception V3, train an encoder-decoder model, and use it to generate captions on new images.\n",
+        "\n",
+        "The code requires TensorFlow version >=1.9. If you're running this in [Colab]()\n",
+        "\n",
+        "In this example, we're training on a relatively small amount of data as an example. On a single P100 GPU, this example will take about ~2 hours to train. We train on the first 30,000 captions (corresponding to about ~20,000 images depending on shuffling, as there are multiple captions per image in the dataset)\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "U8l4RJ0XRPEm",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# Import TensorFlow and enable eager execution\n",
+        "# This code requires TensorFlow version >=1.9\n",
+        "import tensorflow as tf\n",
+        "tf.enable_eager_execution()\n",
+        "\n",
+        "# We'll generate plots of attention in order to see which parts of an image\n",
+        "# our model focuses on during captioning\n",
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "# Scikit-learn includes many helpful utilities\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "from sklearn.utils import shuffle\n",
+        "\n",
+        "import re\n",
+        "import numpy as np\n",
+        "import os\n",
+        "import time\n",
+        "import json\n",
+        "from glob import glob\n",
+        "from PIL import Image\n",
+        "import pickle"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "b6qbGw8MRPE5",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Download and prepare the MS-COCO dataset\n",
+        "\n",
+        "We will use the [MS-COCO dataset](http://cocodataset.org/#home) to train our model. This dataset contains >82,000 images, each of which has been annotated with at least 5 different captions. The code code below will download and extract the dataset automatically.  \n",
+        "\n",
+        "**Caution: large download ahead**. We'll use the training set, it's a 13GB file."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "krQuPYTtRPE7",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "annotation_zip = tf.keras.utils.get_file('captions.zip', \n",
+        "                                          cache_subdir=os.path.abspath('.'),\n",
+        "                                          origin = 'http://images.cocodataset.org/annotations/annotations_trainval2014.zip',\n",
+        "                                          extract = True)\n",
+        "annotation_file = os.path.dirname(annotation_zip)+'/annotations/captions_train2014.json'\n",
+        "\n",
+        "name_of_zip = 'train2014.zip'\n",
+        "if not os.path.exists(os.path.abspath('.') + '/' + name_of_zip):\n",
+        "  image_zip = tf.keras.utils.get_file(name_of_zip, \n",
+        "                                      cache_subdir=os.path.abspath('.'),\n",
+        "                                      origin = 'http://images.cocodataset.org/zips/train2014.zip',\n",
+        "                                      extract = True)\n",
+        "  PATH = os.path.dirname(image_zip)+'/train2014/'\n",
+        "else:\n",
+        "  PATH = os.path.abspath('.')+'/train2014/'"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "aANEzb5WwSzg",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Optionally, limit the size of the training set for faster training\n",
+        "For this example, we'll select a subset of 30,000 captions and use these and the corresponding images to train our model. As always, captioning quality will improve if you choose to use more data."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "4G3b8x8_RPFD",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# read the json file\n",
+        "with open(annotation_file, 'r') as f:\n",
+        "    annotations = json.load(f)\n",
+        "\n",
+        "# storing the captions and the image name in vectors\n",
+        "all_captions = []\n",
+        "all_img_name_vector = []\n",
+        "\n",
+        "for annot in annotations['annotations']:\n",
+        "    caption = '<start> ' + annot['caption'] + ' <end>'\n",
+        "    image_id = annot['image_id']\n",
+        "    full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)\n",
+        "    \n",
+        "    all_img_name_vector.append(full_coco_image_path)\n",
+        "    all_captions.append(caption)\n",
+        "\n",
+        "# shuffling the captions and image_names together\n",
+        "# setting a random state\n",
+        "train_captions, img_name_vector = shuffle(all_captions,\n",
+        "                                          all_img_name_vector,\n",
+        "                                          random_state=1)\n",
+        "\n",
+        "# selecting the first 30000 captions from the shuffled set\n",
+        "num_examples = 30000\n",
+        "train_captions = train_captions[:num_examples]\n",
+        "img_name_vector = img_name_vector[:num_examples]"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "mPBMgK34RPFL",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "len(train_captions), len(all_captions)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "8cSW4u-ORPFQ",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Preprocess the images using InceptionV3\n",
+        "Next, we will use InceptionV3 (pretrained on Imagenet) to classify each image. We will extract features from the last convolutional layer. \n",
+        "\n",
+        "First, we will need to convert the images into the format inceptionV3 expects by:\n",
+        "* Resizing the image to (299, 299)\n",
+        "* Using the [preprocess_input](https://www.tensorflow.org/api_docs/python/tf/keras/applications/inception_v3/preprocess_input) method to place the pixels in the range of -1 to 1 (to match the format of the images used to train InceptionV3)."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "zXR0217aRPFR",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "def load_image(image_path):\n",
+        "    img = tf.read_file(image_path)\n",
+        "    img = tf.image.decode_jpeg(img, channels=3)\n",
+        "    img = tf.image.resize_images(img, (299, 299))\n",
+        "    img = tf.keras.applications.inception_v3.preprocess_input(img)\n",
+        "    return img, image_path"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "MDvIu4sXRPFV",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Initialize InceptionV3 and load the pretrained Imagenet weights\n",
+        "\n",
+        "To do so, we'll create a tf.keras model where the output layer is the last convolutional layer in the InceptionV3 architecture. \n",
+        "* Each image is forwarded through the network and the vector that we get at the end is stored in a dictionary (image_name --> feature_vector). \n",
+        "* We use the last convolutional layer because we are using attention in this example. The shape of the output of this layer is ```8x8x2048```. \n",
+        "* We avoid doing this during training so it does not become a bottleneck. \n",
+        "* After all the images are passed through the network, we pickle the dictionary and save it to disk."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "RD3vW4SsRPFW",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "image_model = tf.keras.applications.InceptionV3(include_top=False, \n",
+        "                                                weights='imagenet')\n",
+        "new_input = image_model.input\n",
+        "hidden_layer = image_model.layers[-1].output\n",
+        "\n",
+        "image_features_extract_model = tf.keras.Model(new_input, hidden_layer)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "rERqlR3WRPGO",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Caching the features extracted from InceptionV3\n",
+        "\n",
+        "We will pre-process each image with InceptionV3 and cache the output to disk. Caching the output in RAM would be faster but memory intensive, requiring 8 \\* 8 \\* 2048 floats per image. At the time of writing, this would exceed the memory limitations of Colab (although these may change, an instance appears to have about 12GB of memory currently). \n",
+        "\n",
+        "Performance could be improved with a more sophisticated caching strategy (e.g., by sharding the images to reduce random access disk I/O) at the cost of more code.\n",
+        "\n",
+        "This will take about 10 minutes to run in Colab with a GPU. If you'd like to see a progress bar, you could: install [tqdm](https://github.com/tqdm/tqdm) (```!pip install tqdm```), then change this line: \n",
+        "\n",
+        "```for img, path in image_dataset:``` \n",
+        "\n",
+        "to:\n",
+        "\n",
+        "```for img, path in tqdm(image_dataset):```."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Dx_fvbVgRPGQ",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# getting the unique images\n",
+        "encode_train = sorted(set(img_name_vector))\n",
+        "\n",
+        "# feel free to change the batch_size according to your system configuration\n",
+        "image_dataset = tf.data.Dataset.from_tensor_slices(\n",
+        "                                encode_train).map(load_image).batch(16)\n",
+        "\n",
+        "for img, path in image_dataset:\n",
+        "  batch_features = image_features_extract_model(img)\n",
+        "  batch_features = tf.reshape(batch_features, \n",
+        "                              (batch_features.shape[0], -1, batch_features.shape[3]))\n",
+        "\n",
+        "  for bf, p in zip(batch_features, path):\n",
+        "    path_of_feature = p.numpy().decode(\"utf-8\")\n",
+        "    np.save(path_of_feature, bf.numpy())"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "nyqH3zFwRPFi",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Preprocess and tokenize the captions\n",
+        "\n",
+        "* First, we'll tokenize the captions (e.g., by splitting on spaces). This will give us a  vocabulary of all the unique words in the data (e.g., \"surfing\", \"football\", etc).\n",
+        "* Next, we'll limit the vocabulary size to the top 5,000 words to save memory. We'll replace all other words with the token \"UNK\" (for unknown).\n",
+        "* Finally, we create a word --> index mapping and vice-versa.\n",
+        "* We will then pad all sequences to the be same length as the longest one. "
+      ]
+    },
+    {
+      "metadata": {
+        "id": "HZfK8RhQRPFj",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# This will find the maximum length of any caption in our dataset\n",
+        "def calc_max_length(tensor):\n",
+        "    return max(len(t) for t in tensor)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "oJGE34aiRPFo",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# The steps above is a general process of dealing with text processing\n",
+        "\n",
+        "# choosing the top 5000 words from the vocabulary\n",
+        "top_k = 5000\n",
+        "tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k, \n",
+        "                                                  oov_token=\"<unk>\", \n",
+        "                                                  filters='!\"#$%&()*+.,-/:;=?@[\\]^_`{|}~ ')\n",
+        "tokenizer.fit_on_texts(train_captions)\n",
+        "train_seqs = tokenizer.texts_to_sequences(train_captions)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "8Q44tNQVRPFt",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "tokenizer.word_index = {key:value for key, value in tokenizer.word_index.items() if value <= top_k}\n",
+        "# putting <unk> token in the word2idx dictionary\n",
+        "tokenizer.word_index[tokenizer.oov_token] = top_k + 1\n",
+        "tokenizer.word_index['<pad>'] = 0"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "0fpJb5ojRPFv",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# creating the tokenized vectors\n",
+        "train_seqs = tokenizer.texts_to_sequences(train_captions)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "olQArbgbRPF1",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# creating a reverse mapping (index -> word)\n",
+        "index_word = {value:key for key, value in tokenizer.word_index.items()}"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "AidglIZVRPF4",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# padding each vector to the max_length of the captions\n",
+        "# if the max_length parameter is not provided, pad_sequences calculates that automatically\n",
+        "cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "gL0wkttkRPGA",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# calculating the max_length \n",
+        "# used to store the attention weights\n",
+        "max_length = calc_max_length(train_seqs)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "M3CD75nDpvTI",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Split the data into training and testing"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "iS7DDMszRPGF",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# Create training and validation sets using 80-20 split\n",
+        "img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector, \n",
+        "                                                                    cap_vector, \n",
+        "                                                                    test_size=0.2, \n",
+        "                                                                    random_state=0)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "XmViPkRFRPGH",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "len(img_name_train), len(cap_train), len(img_name_val), len(cap_val)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "uEWM9xrYcg45",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Our images and captions are ready! Next, let's create a tf.data dataset to use for training our model.\n",
+        "\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Q3TnZ1ToRPGV",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# feel free to change these parameters according to your system's configuration\n",
+        "\n",
+        "BATCH_SIZE = 64\n",
+        "BUFFER_SIZE = 1000\n",
+        "embedding_dim = 256\n",
+        "units = 512\n",
+        "vocab_size = len(tokenizer.word_index)\n",
+        "# shape of the vector extracted from InceptionV3 is (64, 2048)\n",
+        "# these two variables represent that\n",
+        "features_shape = 2048\n",
+        "attention_features_shape = 64"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "SmZS2N0bXG3T",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# loading the numpy files \n",
+        "def map_func(img_name, cap):\n",
+        "    img_tensor = np.load(img_name.decode('utf-8')+'.npy')\n",
+        "    return img_tensor, cap"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "FDF_Nm3tRPGZ",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))\n",
+        "\n",
+        "# using map to load the numpy files in parallel\n",
+        "# NOTE: Be sure to set num_parallel_calls to the number of CPU cores you have\n",
+        "# https://www.tensorflow.org/api_docs/python/tf/py_func\n",
+        "dataset = dataset.map(lambda item1, item2: tf.py_func(\n",
+        "          map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=8)\n",
+        "\n",
+        "# shuffling and batching\n",
+        "dataset = dataset.shuffle(BUFFER_SIZE)\n",
+        "# https://www.tensorflow.org/api_docs/python/tf/contrib/data/batch_and_drop_remainder\n",
+        "dataset = dataset.batch(BATCH_SIZE)\n",
+        "dataset = dataset.prefetch(1)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "nrvoDphgRPGd",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Model\n",
+        "\n",
+        "Fun fact, the decoder below is identical to the one in the example for [Neural Machine Translation with Attention]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb).\n",
+        "\n",
+        "The model architecture is inspired by the [Show, Attend and Tell](https://arxiv.org/pdf/1502.03044.pdf) paper.\n",
+        "\n",
+        "* In this example, we extract the features from the lower convolutional layer of InceptionV3 giving us a vector of shape (8, 8, 2048). \n",
+        "* We squash that to a shape of (64, 2048).\n",
+        "* This vector is then passed through the CNN Encoder(which consists of a single Fully connected layer).\n",
+        "* The RNN(here GRU) attends over the image to predict the next word."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "AAppCGLKRPGd",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "def gru(units):\n",
+        "  # If you have a GPU, we recommend using the CuDNNGRU layer (it provides a \n",
+        "  # significant speedup).\n",
+        "  if tf.test.is_gpu_available():\n",
+        "    return tf.keras.layers.CuDNNGRU(units, \n",
+        "                                    return_sequences=True, \n",
+        "                                    return_state=True, \n",
+        "                                    recurrent_initializer='glorot_uniform')\n",
+        "  else:\n",
+        "    return tf.keras.layers.GRU(units, \n",
+        "                               return_sequences=True, \n",
+        "                               return_state=True, \n",
+        "                               recurrent_activation='sigmoid', \n",
+        "                               recurrent_initializer='glorot_uniform')"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "ja2LFTMSdeV3",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "class BahdanauAttention(tf.keras.Model):\n",
+        "  def __init__(self, units):\n",
+        "    super(BahdanauAttention, self).__init__()\n",
+        "    self.W1 = tf.keras.layers.Dense(units)\n",
+        "    self.W2 = tf.keras.layers.Dense(units)\n",
+        "    self.V = tf.keras.layers.Dense(1)\n",
+        "  \n",
+        "  def call(self, features, hidden):\n",
+        "    # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)\n",
+        "    \n",
+        "    # hidden shape == (batch_size, hidden_size)\n",
+        "    # hidden_with_time_axis shape == (batch_size, 1, hidden_size)\n",
+        "    hidden_with_time_axis = tf.expand_dims(hidden, 1)\n",
+        "    \n",
+        "    # score shape == (batch_size, 64, hidden_size)\n",
+        "    score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))\n",
+        "    \n",
+        "    # attention_weights shape == (batch_size, 64, 1)\n",
+        "    # we get 1 at the last axis because we are applying score to self.V\n",
+        "    attention_weights = tf.nn.softmax(self.V(score), axis=1)\n",
+        "    \n",
+        "    # context_vector shape after sum == (batch_size, hidden_size)\n",
+        "    context_vector = attention_weights * features\n",
+        "    context_vector = tf.reduce_sum(context_vector, axis=1)\n",
+        "    \n",
+        "    return context_vector, attention_weights"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "AZ7R1RxHRPGf",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "class CNN_Encoder(tf.keras.Model):\n",
+        "    # Since we have already extracted the features and dumped it using pickle\n",
+        "    # This encoder passes those features through a Fully connected layer\n",
+        "    def __init__(self, embedding_dim):\n",
+        "        super(CNN_Encoder, self).__init__()\n",
+        "        # shape after fc == (batch_size, 64, embedding_dim)\n",
+        "        self.fc = tf.keras.layers.Dense(embedding_dim)\n",
+        "        \n",
+        "    def call(self, x):\n",
+        "        x = self.fc(x)\n",
+        "        x = tf.nn.relu(x)\n",
+        "        return x"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "V9UbGQmERPGi",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "class RNN_Decoder(tf.keras.Model):\n",
+        "  def __init__(self, embedding_dim, units, vocab_size):\n",
+        "    super(RNN_Decoder, self).__init__()\n",
+        "    self.units = units\n",
+        "\n",
+        "    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)\n",
+        "    self.gru = gru(self.units)\n",
+        "    self.fc1 = tf.keras.layers.Dense(self.units)\n",
+        "    self.fc2 = tf.keras.layers.Dense(vocab_size)\n",
+        "    \n",
+        "    self.attention = BahdanauAttention(self.units)\n",
+        "        \n",
+        "  def call(self, x, features, hidden):\n",
+        "    # defining attention as a separate model\n",
+        "    context_vector, attention_weights = self.attention(features, hidden)\n",
+        "    \n",
+        "    # x shape after passing through embedding == (batch_size, 1, embedding_dim)\n",
+        "    x = self.embedding(x)\n",
+        "    \n",
+        "    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)\n",
+        "    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)\n",
+        "    \n",
+        "    # passing the concatenated vector to the GRU\n",
+        "    output, state = self.gru(x)\n",
+        "    \n",
+        "    # shape == (batch_size, max_length, hidden_size)\n",
+        "    x = self.fc1(output)\n",
+        "    \n",
+        "    # x shape == (batch_size * max_length, hidden_size)\n",
+        "    x = tf.reshape(x, (-1, x.shape[2]))\n",
+        "    \n",
+        "    # output shape == (batch_size * max_length, vocab)\n",
+        "    x = self.fc2(x)\n",
+        "\n",
+        "    return x, state, attention_weights\n",
+        "\n",
+        "  def reset_state(self, batch_size):\n",
+        "    return tf.zeros((batch_size, self.units))"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "Qs_Sr03wRPGk",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "encoder = CNN_Encoder(embedding_dim)\n",
+        "decoder = RNN_Decoder(embedding_dim, units, vocab_size)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "-bYN7xA0RPGl",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "optimizer = tf.train.AdamOptimizer()\n",
+        "\n",
+        "# We are masking the loss calculated for padding\n",
+        "def loss_function(real, pred):\n",
+        "    mask = 1 - np.equal(real, 0)\n",
+        "    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask\n",
+        "    return tf.reduce_mean(loss_)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "PHod7t72RPGn",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Training\n",
+        "\n",
+        "* We extract the features stored in the respective `.npy` files and then pass those features through the encoder.\n",
+        "* The encoder output, hidden state(initialized to 0) and the decoder input (which is the start token) is passed to the decoder.\n",
+        "* The decoder returns the predictions and the decoder hidden state.\n",
+        "* The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.\n",
+        "* Use teacher forcing to decide the next input to the decoder.\n",
+        "* Teacher forcing is the technique where the target word is passed as the next input to the decoder.\n",
+        "* The final step is to calculate the gradients and apply it to the optimizer and backpropagate.\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Vt4WZ5mhJE-E",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# adding this in a separate cell because if you run the training cell \n",
+        "# many times, the loss_plot array will be reset\n",
+        "loss_plot = []"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "UlA4VIQpRPGo",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "EPOCHS = 20\n",
+        "\n",
+        "for epoch in range(EPOCHS):\n",
+        "    start = time.time()\n",
+        "    total_loss = 0\n",
+        "    \n",
+        "    for (batch, (img_tensor, target)) in enumerate(dataset):\n",
+        "        loss = 0\n",
+        "        \n",
+        "        # initializing the hidden state for each batch\n",
+        "        # because the captions are not related from image to image\n",
+        "        hidden = decoder.reset_state(batch_size=target.shape[0])\n",
+        "\n",
+        "        dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)\n",
+        "        \n",
+        "        with tf.GradientTape() as tape:\n",
+        "            features = encoder(img_tensor)\n",
+        "            \n",
+        "            for i in range(1, target.shape[1]):\n",
+        "                # passing the features through the decoder\n",
+        "                predictions, hidden, _ = decoder(dec_input, features, hidden)\n",
+        "\n",
+        "                loss += loss_function(target[:, i], predictions)\n",
+        "                \n",
+        "                # using teacher forcing\n",
+        "                dec_input = tf.expand_dims(target[:, i], 1)\n",
+        "        \n",
+        "        total_loss += (loss / int(target.shape[1]))\n",
+        "        \n",
+        "        variables = encoder.variables + decoder.variables\n",
+        "        \n",
+        "        gradients = tape.gradient(loss, variables) \n",
+        "        \n",
+        "        optimizer.apply_gradients(zip(gradients, variables), tf.train.get_or_create_global_step())\n",
+        "        \n",
+        "        if batch % 100 == 0:\n",
+        "            print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, \n",
+        "                                                          batch, \n",
+        "                                                          loss.numpy() / int(target.shape[1])))\n",
+        "    # storing the epoch end loss value to plot later\n",
+        "    loss_plot.append(total_loss / len(cap_vector))\n",
+        "    \n",
+        "    print ('Epoch {} Loss {:.6f}'.format(epoch + 1, \n",
+        "                                         total_loss/len(cap_vector)))\n",
+        "    print ('Time taken for 1 epoch {} sec\\n'.format(time.time() - start))"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "1Wm83G-ZBPcC",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "plt.plot(loss_plot)\n",
+        "plt.xlabel('Epochs')\n",
+        "plt.ylabel('Loss')\n",
+        "plt.title('Loss Plot')\n",
+        "plt.show()"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "xGvOcLQKghXN",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Caption!\n",
+        "\n",
+        "* The evaluate function is similar to the training loop, except we don't use teacher forcing here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.\n",
+        "* Stop predicting when the model predicts the end token.\n",
+        "* And store the attention weights for every time step."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "RCWpDtyNRPGs",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "def evaluate(image):\n",
+        "    attention_plot = np.zeros((max_length, attention_features_shape))\n",
+        "\n",
+        "    hidden = decoder.reset_state(batch_size=1)\n",
+        "\n",
+        "    temp_input = tf.expand_dims(load_image(image)[0], 0)\n",
+        "    img_tensor_val = image_features_extract_model(temp_input)\n",
+        "    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))\n",
+        "\n",
+        "    features = encoder(img_tensor_val)\n",
+        "\n",
+        "    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)\n",
+        "    result = []\n",
+        "\n",
+        "    for i in range(max_length):\n",
+        "        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)\n",
+        "\n",
+        "        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()\n",
+        "\n",
+        "        predicted_id = tf.multinomial(tf.exp(predictions), num_samples=1)[0][0].numpy()\n",
+        "        result.append(index_word[predicted_id])\n",
+        "\n",
+        "        if index_word[predicted_id] == '<end>':\n",
+        "            return result, attention_plot\n",
+        "\n",
+        "        dec_input = tf.expand_dims([predicted_id], 0)\n",
+        "\n",
+        "    attention_plot = attention_plot[:len(result), :]\n",
+        "    return result, attention_plot"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "fD_y7PD6RPGt",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "def plot_attention(image, result, attention_plot):\n",
+        "    temp_image = np.array(Image.open(image))\n",
+        "\n",
+        "    fig = plt.figure(figsize=(10, 10))\n",
+        "    \n",
+        "    len_result = len(result)\n",
+        "    for l in range(len_result):\n",
+        "        temp_att = np.resize(attention_plot[l], (8, 8))\n",
+        "        ax = fig.add_subplot(len_result//2, len_result//2, l+1)\n",
+        "        ax.set_title(result[l])\n",
+        "        img = ax.imshow(temp_image)\n",
+        "        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())\n",
+        "\n",
+        "    plt.tight_layout()\n",
+        "    plt.show()"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "io7ws3ReRPGv",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# captions on the validation set\n",
+        "rid = np.random.randint(0, len(img_name_val))\n",
+        "image = img_name_val[rid]\n",
+        "real_caption = ' '.join([index_word[i] for i in cap_val[rid] if i not in [0]])\n",
+        "result, attention_plot = evaluate(image)\n",
+        "\n",
+        "print ('Real Caption:', real_caption)\n",
+        "print ('Prediction Caption:', ' '.join(result))\n",
+        "plot_attention(image, result, attention_plot)\n",
+        "# opening the image\n",
+        "Image.open(img_name_val[rid])"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "Rprk3HEvZuxb",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Try it on your own images\n",
+        "For fun, below we've provided a method you can use to caption your own images with the model we've just trained. Keep in mind, it was trained on a relatively small amount of data, and your images may be different from the training data (so be prepared for weird results!)\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "9Psd1quzaAWg",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "image_url = 'https://tensorflow.org/images/surf.jpg'\n",
+        "image_extension = image_url[-4:]\n",
+        "image_path = tf.keras.utils.get_file('image'+image_extension, \n",
+        "                                     origin=image_url)\n",
+        "\n",
+        "result, attention_plot = evaluate(image_path)\n",
+        "print ('Prediction Caption:', ' '.join(result))\n",
+        "plot_attention(image_path, result, attention_plot)\n",
+        "# opening the image\n",
+        "Image.open(image_path)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "VJZXyJco6uLO",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Next steps\n",
+        "\n",
+        "Congrats! You've just trained an image captioning model with attention. Next, we recommend taking a look at this example [Neural Machine Translation with Attention]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb). It uses a similar architecture to translate between Spanish and English sentences. You can also experiment with training the code in this notebook on a different dataset."
+      ]
+    }
+  ]
+}
diff --git a/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb b/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb
new file mode 100644
index 0000000000..b173f856c6
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb
@@ -0,0 +1,689 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "hcD2nPQvPOFM"
+      },
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors.\n",
+        "\n",
+        "Licensed under the Apache License, Version 2.0 (the \"License\").\n",
+        "\n",
+        "# Text Generation using a RNN\n",
+        "\n",
+        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\u003ctd\u003e\n",
+        "\u003ca target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb\"\u003e\n",
+        "    \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e  \n",
+        "\u003c/td\u003e\u003ctd\u003e\n",
+        "\u003ca target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb\"\u003e\u003cimg width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on Github\u003c/a\u003e\u003c/td\u003e\u003c/table\u003e"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "BwpJ5IffzRG6"
+      },
+      "source": [
+        "This notebook demonstrates how to generate text using an RNN using [tf.keras](https://www.tensorflow.org/programmers_guide/keras) and [eager execution](https://www.tensorflow.org/programmers_guide/eager). If you like, you can write a similar [model](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb) using less code. Here, we show a lower-level impementation that's useful to understand as prework before diving in to deeper examples in a similar, like [Neural Machine Translation with Attention](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb).\n",
+        "\n",
+        "This notebook is an end-to-end example. When you run it, it will download a dataset of Shakespeare's writing. We'll use a collection of plays, borrowed from Andrej Karpathy's excellent [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).  The notebook will train a model, and use it to generate sample output.\n",
+        "  \n",
+        "Here is the output(with start string='w') after training a single layer GRU for 30 epochs with the default settings below:\n",
+        "\n",
+        "```\n",
+        "were to the death of him\n",
+        "And nothing of the field in the view of hell,\n",
+        "When I said, banish him, I will not burn thee that would live.\n",
+        "\n",
+        "HENRY BOLINGBROKE:\n",
+        "My gracious uncle--\n",
+        "\n",
+        "DUKE OF YORK:\n",
+        "As much disgraced to the court, the gods them speak,\n",
+        "And now in peace himself excuse thee in the world.\n",
+        "\n",
+        "HORTENSIO:\n",
+        "Madam, 'tis not the cause of the counterfeit of the earth,\n",
+        "And leave me to the sun that set them on the earth\n",
+        "And leave the world and are revenged for thee.\n",
+        "\n",
+        "GLOUCESTER:\n",
+        "I would they were talking with the very name of means\n",
+        "To make a puppet of a guest, and therefore, good Grumio,\n",
+        "Nor arm'd to prison, o' the clouds, of the whole field,\n",
+        "With the admire\n",
+        "With the feeding of thy chair, and we have heard it so,\n",
+        "I thank you, sir, he is a visor friendship with your silly your bed.\n",
+        "\n",
+        "SAMPSON:\n",
+        "I do desire to live, I pray: some stand of the minds, make thee remedies\n",
+        "With the enemies of my soul.\n",
+        "\n",
+        "MENENIUS:\n",
+        "I'll keep the cause of my mistress.\n",
+        "\n",
+        "POLIXENES:\n",
+        "My brother Marcius!\n",
+        "\n",
+        "Second Servant:\n",
+        "Will't ple\n",
+        "```\n",
+        "\n",
+        "Of course, while some of the sentences are grammatical, most do not make sense. But, consider:\n",
+        "\n",
+        "* Our model is character based (when we began training, it did not yet know how to spell a valid English word, or that words were even a unit of text).\n",
+        "\n",
+        "* The structure of the output resembles a play (blocks begin with a speaker name, in all caps similar to the original text). Sentences generally end with a period. If you look at the text from a distance (or don't read the invididual words too closely, it appears as if it's an excerpt from a play).\n",
+        "\n",
+        "As a next step, you can experiment training the model on a different dataset - any large text file(ASCII) will do, and you can modify a single line of code below to make that change. Have fun!\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "R3p22DBDsaCA"
+      },
+      "source": [
+        "## Install unidecode library\n",
+        "A helpful library to convert unicode to ASCII."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "wZ6LOM12wKGH"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install unidecode"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "WGyKZj3bzf9p"
+      },
+      "source": [
+        "## Import tensorflow and enable eager execution."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "yG_n40gFzf9s"
+      },
+      "outputs": [],
+      "source": [
+        "# Import TensorFlow \u003e= 1.9 and enable eager execution\n",
+        "import tensorflow as tf\n",
+        "\n",
+        "# Note: Once you enable eager execution, it cannot be disabled. \n",
+        "tf.enable_eager_execution()\n",
+        "\n",
+        "import numpy as np\n",
+        "import re\n",
+        "import random\n",
+        "import unidecode\n",
+        "import time"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "EHDoRoc5PKWz"
+      },
+      "source": [
+        "## Download the dataset\n",
+        "\n",
+        "In this example, we will use the [shakespeare dataset](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt). You can use any other dataset that you like.\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "pD_55cOxLkAb"
+      },
+      "outputs": [],
+      "source": [
+        "path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "UHjdCjDuSvX_"
+      },
+      "source": [
+        "## Read the dataset\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "-E5JvY3wzf94"
+      },
+      "outputs": [],
+      "source": [
+        "text = unidecode.unidecode(open(path_to_file).read())\n",
+        "# length of text is the number of characters in it\n",
+        "print (len(text))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Il9ww98izf-D"
+      },
+      "source": [
+        "Creating dictionaries to map from characters to their indices and vice-versa, which will be used to vectorize the inputs"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "IalZLbvOzf-F"
+      },
+      "outputs": [],
+      "source": [
+        "# unique contains all the unique characters in the file\n",
+        "unique = sorted(set(text))\n",
+        "\n",
+        "# creating a mapping from unique characters to indices\n",
+        "char2idx = {u:i for i, u in enumerate(unique)}\n",
+        "idx2char = {i:u for i, u in enumerate(unique)}"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "1v_qUYfAzf-I"
+      },
+      "outputs": [],
+      "source": [
+        "# setting the maximum length sentence we want for a single input in characters\n",
+        "max_length = 100\n",
+        "\n",
+        "# length of the vocabulary in chars\n",
+        "vocab_size = len(unique)\n",
+        "\n",
+        "# the embedding dimension \n",
+        "embedding_dim = 256\n",
+        "\n",
+        "# number of RNN (here GRU) units\n",
+        "units = 1024\n",
+        "\n",
+        "# batch size \n",
+        "BATCH_SIZE = 64\n",
+        "\n",
+        "# buffer size to shuffle our dataset\n",
+        "BUFFER_SIZE = 10000"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "LFjSVAlWzf-N"
+      },
+      "source": [
+        "## Creating the input and output tensors\n",
+        "\n",
+        "Vectorizing the input and the target text because our model cannot understand strings only numbers.\n",
+        "\n",
+        "But first, we need to create the input and output vectors.\n",
+        "Remember the max_length we set above, we will use it here. We are creating **max_length** chunks of input, where each input vector is all the characters in that chunk except the last and the target vector is all the characters in that chunk except the first.\n",
+        "\n",
+        "For example, consider that the string = 'tensorflow' and the max_length is 9\n",
+        "\n",
+        "So, the `input = 'tensorflo'` and `output = 'ensorflow'`\n",
+        "\n",
+        "After creating the vectors, we convert each character into numbers using the **char2idx** dictionary we created above."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "0UHJDA39zf-O"
+      },
+      "outputs": [],
+      "source": [
+        "input_text = []\n",
+        "target_text = []\n",
+        "\n",
+        "for f in range(0, len(text)-max_length, max_length):\n",
+        "    inps = text[f:f+max_length]\n",
+        "    targ = text[f+1:f+1+max_length]\n",
+        "\n",
+        "    input_text.append([char2idx[i] for i in inps])\n",
+        "    target_text.append([char2idx[t] for t in targ])\n",
+        "    \n",
+        "print (np.array(input_text).shape)\n",
+        "print (np.array(target_text).shape)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "MJdfPmdqzf-R"
+      },
+      "source": [
+        "## Creating batches and shuffling them using tf.data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "p2pGotuNzf-S"
+      },
+      "outputs": [],
+      "source": [
+        "dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)\n",
+        "dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(BATCH_SIZE))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "m8gPwEjRzf-Z"
+      },
+      "source": [
+        "## Creating the model\n",
+        "\n",
+        "We use the Model Subclassing API which gives us full flexibility to create the model and change it however we like. We use 3 layers to define our model.\n",
+        "\n",
+        "* Embedding layer\n",
+        "* GRU layer (you can use an LSTM layer here)\n",
+        "* Fully connected layer"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "P3KTiiInzf-a"
+      },
+      "outputs": [],
+      "source": [
+        "class Model(tf.keras.Model):\n",
+        "  def __init__(self, vocab_size, embedding_dim, units, batch_size):\n",
+        "    super(Model, self).__init__()\n",
+        "    self.units = units\n",
+        "    self.batch_sz = batch_size\n",
+        "\n",
+        "    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)\n",
+        "\n",
+        "    if tf.test.is_gpu_available():\n",
+        "      self.gru = tf.keras.layers.CuDNNGRU(self.units, \n",
+        "                                          return_sequences=True, \n",
+        "                                          return_state=True, \n",
+        "                                          recurrent_initializer='glorot_uniform')\n",
+        "    else:\n",
+        "      self.gru = tf.keras.layers.GRU(self.units, \n",
+        "                                     return_sequences=True, \n",
+        "                                     return_state=True, \n",
+        "                                     recurrent_activation='sigmoid', \n",
+        "                                     recurrent_initializer='glorot_uniform')\n",
+        "\n",
+        "    self.fc = tf.keras.layers.Dense(vocab_size)\n",
+        "        \n",
+        "  def call(self, x, hidden):\n",
+        "    x = self.embedding(x)\n",
+        "\n",
+        "    # output shape == (batch_size, max_length, hidden_size) \n",
+        "    # states shape == (batch_size, hidden_size)\n",
+        "\n",
+        "    # states variable to preserve the state of the model\n",
+        "    # this will be used to pass at every step to the model while training\n",
+        "    output, states = self.gru(x, initial_state=hidden)\n",
+        "\n",
+        "\n",
+        "    # reshaping the output so that we can pass it to the Dense layer\n",
+        "    # after reshaping the shape is (batch_size * max_length, hidden_size)\n",
+        "    output = tf.reshape(output, (-1, output.shape[2]))\n",
+        "\n",
+        "    # The dense layer will output predictions for every time_steps(max_length)\n",
+        "    # output shape after the dense layer == (max_length * batch_size, vocab_size)\n",
+        "    x = self.fc(output)\n",
+        "\n",
+        "    return x, states"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "trpqTWyvk0nr"
+      },
+      "source": [
+        "## Call the model and set the optimizer and the loss function"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "7t2XrzEOzf-e"
+      },
+      "outputs": [],
+      "source": [
+        "model = Model(vocab_size, embedding_dim, units, BATCH_SIZE)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "dkjWIATszf-h"
+      },
+      "outputs": [],
+      "source": [
+        "optimizer = tf.train.AdamOptimizer()\n",
+        "\n",
+        "# using sparse_softmax_cross_entropy so that we don't have to create one-hot vectors\n",
+        "def loss_function(real, preds):\n",
+        "    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "lPrP0XMUzf-p"
+      },
+      "source": [
+        "## Train the model\n",
+        "\n",
+        "Here we will use a custom training loop with the help of GradientTape()\n",
+        "\n",
+        "* We initialize the hidden state of the model with zeros and shape == (batch_size, number of rnn units). We do this by calling the function defined while creating the model.\n",
+        "\n",
+        "* Next, we iterate over the dataset(batch by batch) and calculate the **predictions and the hidden states** associated with that input.\n",
+        "\n",
+        "* There are a lot of interesting things happening here.\n",
+        "  * The model gets hidden state(initialized with 0), lets call that **H0** and the first batch of input, lets call that **I0**.\n",
+        "  * The model then returns the predictions **P1** and **H1**.\n",
+        "  * For the next batch of input, the model receives **I1** and **H1**.\n",
+        "  * The interesting thing here is that we pass **H1** to the model with **I1** which is how the model learns. The context learned from batch to batch is contained in the **hidden state**.\n",
+        "  * We continue doing this until the dataset is exhausted and then we start a new epoch and repeat this.\n",
+        "\n",
+        "* After calculating the predictions, we calculate the **loss** using the loss function defined above. Then we calculate the gradients of the loss with respect to the model variables(input)\n",
+        "\n",
+        "* Finally, we take a step in that direction with the help of the optimizer using the apply_gradients function.\n",
+        "\n",
+        "Note:- If you are running this notebook in Colab which has a **Tesla K80 GPU** it takes about 23 seconds per epoch.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "d4tSNwymzf-q"
+      },
+      "outputs": [],
+      "source": [
+        "# Training step\n",
+        "\n",
+        "EPOCHS = 30\n",
+        "\n",
+        "for epoch in range(EPOCHS):\n",
+        "    start = time.time()\n",
+        "    \n",
+        "    # initializing the hidden state at the start of every epoch\n",
+        "    hidden = model.reset_states()\n",
+        "    \n",
+        "    for (batch, (inp, target)) in enumerate(dataset):\n",
+        "          with tf.GradientTape() as tape:\n",
+        "              # feeding the hidden state back into the model\n",
+        "              # This is the interesting step\n",
+        "              predictions, hidden = model(inp, hidden)\n",
+        "              \n",
+        "              # reshaping the target because that's how the \n",
+        "              # loss function expects it\n",
+        "              target = tf.reshape(target, (-1,))\n",
+        "              loss = loss_function(target, predictions)\n",
+        "              \n",
+        "          grads = tape.gradient(loss, model.variables)\n",
+        "          optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())\n",
+        "\n",
+        "          if batch % 100 == 0:\n",
+        "              print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch+1,\n",
+        "                                                            batch,\n",
+        "                                                            loss))\n",
+        "    \n",
+        "    print ('Epoch {} Loss {:.4f}'.format(epoch+1, loss))\n",
+        "    print('Time taken for 1 epoch {} sec\\n'.format(time.time() - start))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "DjGz1tDkzf-u"
+      },
+      "source": [
+        "## Predicting using our trained model\n",
+        "\n",
+        "The below code block is used to generated the text\n",
+        "\n",
+        "* We start by choosing a start string and initializing the hidden state and setting the number of characters we want to generate.\n",
+        "\n",
+        "* We get predictions using the start_string and the hidden state\n",
+        "\n",
+        "* Then we use a multinomial distribution to calculate the index of the predicted word. **We use this predicted word as our next input to the model**\n",
+        "\n",
+        "* **The hidden state returned by the model is fed back into the model so that it now has more context rather than just one word.** After we predict the next word, the modified hidden states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.\n",
+        "\n",
+        "* If you see the predictions, the model knows when to capitalize, make paragraphs and the text follows a shakespeare style of writing which is pretty awesome!"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "WvuwZBX5Ogfd"
+      },
+      "outputs": [],
+      "source": [
+        "# Evaluation step(generating text using the model learned)\n",
+        "\n",
+        "# number of characters to generate\n",
+        "num_generate = 1000\n",
+        "\n",
+        "# You can change the start string to experiment\n",
+        "start_string = 'Q'\n",
+        "# converting our start string to numbers(vectorizing!) \n",
+        "input_eval = [char2idx[s] for s in start_string]\n",
+        "input_eval = tf.expand_dims(input_eval, 0)\n",
+        "\n",
+        "# empty string to store our results\n",
+        "text_generated = ''\n",
+        "\n",
+        "# low temperatures results in more predictable text.\n",
+        "# higher temperatures results in more surprising text\n",
+        "# experiment to find the best setting\n",
+        "temperature = 1.0\n",
+        "\n",
+        "# hidden state shape == (batch_size, number of rnn units); here batch size == 1\n",
+        "hidden = [tf.zeros((1, units))]\n",
+        "for i in range(num_generate):\n",
+        "    predictions, hidden = model(input_eval, hidden)\n",
+        "\n",
+        "    # using a multinomial distribution to predict the word returned by the model\n",
+        "    predictions = predictions / temperature\n",
+        "    predicted_id = tf.multinomial(tf.exp(predictions), num_samples=1)[0][0].numpy()\n",
+        "    \n",
+        "    # We pass the predicted word as the next input to the model\n",
+        "    # along with the previous hidden state\n",
+        "    input_eval = tf.expand_dims([predicted_id], 0)\n",
+        "    \n",
+        "    text_generated += idx2char[predicted_id]\n",
+        "\n",
+        "print (start_string + text_generated)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "AM2Uma_-yVIq"
+      },
+      "source": [
+        "## Next steps\n",
+        "\n",
+        "* Change the start string to a different character, or the start of a sentence.\n",
+        "* Experiment with training on a different, or with different parameters. [Project  Gutenberg](http://www.gutenberg.org/ebooks/100), for example, contains a large collection of books.\n",
+        "* Experiment with the temperature parameter.\n",
+        "* Add another RNN layer.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "gtEd86sX5cB2"
+      },
+      "outputs": [],
+      "source": [
+        ""
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "text_generation.ipynb",
+      "private_outputs": true,
+      "provenance": [],
+      "toc_visible": true,
+      "version": "0.3.2",
+      "views": {}
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/README.md b/tensorflow/contrib/eager/python/examples/l2hmc/README.md
new file mode 100644
index 0000000000..f171806e37
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/README.md
@@ -0,0 +1,53 @@
+# L2HMC with TensorFlow eager execution
+
+This folder contains an implementation of [L2HMC](https://arxiv.org/pdf/1711.09268.pdf)  adapted from the released implementation by the authors. The presented implementation runs in both eager and graph mode.
+With eager execution enabled, longer sample chains can be handled compared to graph mode, since no graph is explicitly stored. Moreover, with eager execution enabled, there is no need to use a `tf.while_loop`.
+
+## What is L2HMC?
+L2HMC is an adaptive Markov Chain Monte Carlo (MCMC) algorithm that learns a non-volume preserving transformation
+for a Hamiltonian Monte Carlo (HMC) sampling algorithm. More specifically, the non-volume preserving
+transformation is learned with neural nets instantiated within Normalizing Flows
+(real-NVPs).
+
+##  Content
+
+- `l2hmc.py`: Dynamics definitions and example energy functions,
+including the 2D strongly correlated Gaussian and the rough well energy function,
+- `l2hmc_test.py`: Unit tests and benchmarks for training a sampler on the energy functions in both eager and graph mode.
+- `neural_nets.py`: The neural net for learning the kernel on the 2D strongly correlated example.
+- `main.py`: Run to train a samplers on 2D energy landscapes.
+
+## To run
+- Make sure you have installed TensorFlow 1.9+ or the latest `tf-nightly` or `tf-nightly-gpu` pip package.
+- Execute the command
+
+```bash
+python main.py --train_dir ${PWD}/dump --use_defun
+```
+
+Specifying the optional argument `train_dir` will store event files for
+tensorboard and a plot of sampled chain from the trained sampler.
+
+Specifying the optional argument `use_defun` will let the program use compiled
+graphs when running specific sections and improve the overall speed.
+
+## Boosting Performance with `tfe.defun`
+Currently, some models may experience increased overhead with eager execution enabled.
+To improve performance, we could wrap certain functions with the decorator `@tfe.defun`.
+For example, we could wrap the function that does the sampling step:
+
+```python
+@tfe.defun
+def apply_transition(old_sample):
+  new_sample = ...
+  return new_sample
+```
+
+We could also explicitly wrap the desired function with `tfe.defun`:
+
+```python
+apply_transition = tfe.defun(apply_transition)
+```
+
+## Reference
+Generalizing Hamiltonian Monte Carlo with Neural Networks. Levy, Daniel, Hoffman, Matthew D, and Sohl-Dickstein, Jascha. International Conference on Learning Representations (ICLR), 2018.
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc.py b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc.py
index 729d8525fa..14b8324e48 100644
--- a/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc.py
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc.py
@@ -32,20 +32,28 @@ from tensorflow.contrib.eager.python.examples.l2hmc import neural_nets
 
 
 class Dynamics(tf.keras.Model):
-  """Dynamics engine of naive L2HMC sampler.
-
-  Args:
-    x_dim: dimensionality of observed data
-    loglikelihood_fn: log-likelihood function of conditional probability
-    n_steps: number of leapfrog steps within each transition
-    eps: initial value learnable scale of step size
-  """
-
-  def __init__(self, x_dim, loglikelihood_fn, n_steps=25, eps=.1):
+  """Dynamics engine of naive L2HMC sampler."""
+
+  def __init__(self,
+               x_dim,
+               minus_loglikelihood_fn,
+               n_steps=25,
+               eps=.1,
+               np_seed=1):
+    """Initialization.
+
+    Args:
+      x_dim: dimensionality of observed data
+      minus_loglikelihood_fn: log-likelihood function of conditional probability
+      n_steps: number of leapfrog steps within each transition
+      eps: initial value learnable scale of step size
+      np_seed: Random seed for numpy; used to control sampled masks.
+    """
     super(Dynamics, self).__init__()
 
+    npr.seed(np_seed)
     self.x_dim = x_dim
-    self.potential = loglikelihood_fn
+    self.potential = minus_loglikelihood_fn
     self.n_steps = n_steps
 
     self._construct_time()
@@ -54,7 +62,7 @@ class Dynamics(tf.keras.Model):
     self.position_fn = neural_nets.GenericNet(x_dim, factor=2.)
     self.momentum_fn = neural_nets.GenericNet(x_dim, factor=1.)
 
-    self.eps = tfe.Variable(
+    self.eps = tf.Variable(
         initial_value=eps, name="eps", dtype=tf.float32, trainable=True)
 
   def apply_transition(self, position):
@@ -68,8 +76,8 @@ class Dynamics(tf.keras.Model):
         position, forward=False)
 
     # Decide direction uniformly
-    forward_mask = tf.cast(
-        tf.random_uniform(shape=[tf.shape(position)[0]]) > .5, tf.float32)
+    batch_size = tf.shape(position)[0]
+    forward_mask = tf.cast(tf.random_uniform((batch_size,)) > .5, tf.float32)
     backward_mask = 1. - forward_mask
 
     # Obtain proposed states
@@ -108,7 +116,6 @@ class Dynamics(tf.keras.Model):
       position_post, momentum_post, logdet = lf_fn(position_post, momentum_post,
                                                    i)
       sumlogdet += logdet
-
     accept_prob = self._compute_accept_prob(position, momentum, position_post,
                                             momentum_post, sumlogdet)
 
@@ -125,17 +132,17 @@ class Dynamics(tf.keras.Model):
     sumlogdet += logdet
 
     position, logdet = self._update_position_forward(position, momentum, t,
-                                                     mask)
+                                                     mask, mask_inv)
     sumlogdet += logdet
 
     position, logdet = self._update_position_forward(position, momentum, t,
-                                                     mask_inv)
+                                                     mask_inv, mask)
     sumlogdet += logdet
 
     momentum, logdet = self._update_momentum_forward(position, momentum, t)
     sumlogdet += logdet
 
-    return position, momentum, tf.reduce_sum(sumlogdet, axis=1)
+    return position, momentum, sumlogdet
 
   def _backward_lf(self, position, momentum, i):
     """One backward augmented leapfrog step. See Appendix A in paper."""
@@ -149,17 +156,17 @@ class Dynamics(tf.keras.Model):
     sumlogdet += logdet
 
     position, logdet = self._update_position_backward(position, momentum, t,
-                                                      mask)
+                                                      mask_inv, mask)
     sumlogdet += logdet
 
     position, logdet = self._update_position_backward(position, momentum, t,
-                                                      mask_inv)
+                                                      mask, mask_inv)
     sumlogdet += logdet
 
     momentum, logdet = self._update_momentum_backward(position, momentum, t)
     sumlogdet += logdet
 
-    return position, momentum, tf.reduce_sum(sumlogdet, axis=1)
+    return position, momentum, sumlogdet
 
   def _update_momentum_forward(self, position, momentum, t):
     """Update v in the forward leapfrog step."""
@@ -172,12 +179,11 @@ class Dynamics(tf.keras.Model):
         momentum * tf.exp(scale) -
         .5 * self.eps * (tf.exp(transformed) * grad - translation))
 
-    return momentum, scale
+    return momentum, tf.reduce_sum(scale, axis=1)
 
-  def _update_position_forward(self, position, momentum, t, mask):
+  def _update_position_forward(self, position, momentum, t, mask, mask_inv):
     """Update x in the forward leapfrog step."""
 
-    mask_inv = 1. - mask
     scale, translation, transformed = self.position_fn(
         [momentum, mask * position, t])
     scale *= self.eps
@@ -186,8 +192,7 @@ class Dynamics(tf.keras.Model):
         mask * position +
         mask_inv * (position * tf.exp(scale) + self.eps *
                     (tf.exp(transformed) * momentum + translation)))
-
-    return position, mask_inv * scale
+    return position, tf.reduce_sum(mask_inv * scale, axis=1)
 
   def _update_momentum_backward(self, position, momentum, t):
     """Update v in the backward leapfrog step. Inverting the forward update."""
@@ -200,21 +205,20 @@ class Dynamics(tf.keras.Model):
         tf.exp(scale) * (momentum + .5 * self.eps *
                          (tf.exp(transformed) * grad - translation)))
 
-    return momentum, scale
+    return momentum, tf.reduce_sum(scale, axis=1)
 
-  def _update_position_backward(self, position, momentum, t, mask):
+  def _update_position_backward(self, position, momentum, t, mask, mask_inv):
     """Update x in the backward leapfrog step. Inverting the forward update."""
 
-    mask_inv = 1. - mask
     scale, translation, transformed = self.position_fn(
-        [momentum, mask_inv * position, t])
+        [momentum, mask * position, t])
     scale *= -self.eps
     transformed *= self.eps
     position = (
-        mask_inv * position + mask * tf.exp(scale) *
-        (position - self.eps * tf.exp(transformed) * momentum + translation))
+        mask * position + mask_inv * tf.exp(scale) *
+        (position - self.eps * (tf.exp(transformed) * momentum + translation)))
 
-    return position, mask * scale
+    return position, tf.reduce_sum(mask_inv * scale, axis=1)
 
   def _compute_accept_prob(self, position, momentum, position_post,
                            momentum_post, sumlogdet):
@@ -222,8 +226,10 @@ class Dynamics(tf.keras.Model):
 
     old_hamil = self.hamiltonian(position, momentum)
     new_hamil = self.hamiltonian(position_post, momentum_post)
+    prob = tf.exp(tf.minimum(old_hamil - new_hamil + sumlogdet, 0.))
 
-    return tf.exp(tf.minimum(old_hamil - new_hamil + sumlogdet, 0.))
+    # Ensure numerical stability as well as correct gradients
+    return tf.where(tf.is_finite(prob), prob, tf.zeros_like(prob))
 
   def _construct_time(self):
     """Convert leapfrog step index into sinusoidal time."""
@@ -248,6 +254,8 @@ class Dynamics(tf.keras.Model):
 
     self.masks = []
     for _ in range(self.n_steps):
+      # Need to use npr here because tf would generated different random
+      # values across different `sess.run`
       idx = npr.permutation(np.arange(self.x_dim))[:self.x_dim // 2]
       mask = np.zeros((self.x_dim,))
       mask[idx] = 1.
@@ -273,19 +281,15 @@ class Dynamics(tf.keras.Model):
   def grad_potential(self, position, check_numerics=True):
     """Get gradient of potential function at current location."""
 
-    if not tf.executing_eagerly():
-      # TODO(lxuechen): Change this to tfe.gradients_function when it works
-      grad = tf.gradients(self.potential(position), position)[0]
-    else:
+    if tf.executing_eagerly():
       grad = tfe.gradients_function(self.potential)(position)[0]
-
-    if check_numerics:
-      return tf.check_numerics(grad, message="gradient of potential")
+    else:
+      grad = tf.gradients(self.potential(position), position)[0]
 
     return grad
 
 
-# Examples of unnormalized log density/probabilities
+# Examples of unnormalized log densities
 def get_scg_energy_fn():
   """Get energy function for 2d strongly correlated Gaussian."""
 
@@ -295,32 +299,53 @@ def get_scg_energy_fn():
   sigma_inv = tf.matrix_inverse(sigma)
 
   def energy(x):
-    """Unnormalized log density/energy of 2d strongly correlated Gaussian."""
+    """Unnormalized minus log density of 2d strongly correlated Gaussian."""
 
     xmmu = x - mu
     return .5 * tf.diag_part(
         tf.matmul(tf.matmul(xmmu, sigma_inv), tf.transpose(xmmu)))
 
-  return energy
+  return energy, mu, sigma
 
 
-def get_multivariate_gaussian_energy_fn(x_dim=2):
-  """Get energy function for 2d strongly correlated Gaussian."""
-
-  mu = tf.random_normal(shape=[x_dim])
-  # Lower triangularize and positive diagonal
-  l = tf.sigmoid(
-      tf.matrix_band_part(tf.random_normal(shape=[x_dim, x_dim]), -1, 0))
-  # Exploit Cholesky decomposition
-  sigma = tf.matmul(l, tf.transpose(l))
-  sigma *= 100.  # Small covariance causes extreme numerical instability
-  sigma_inv = tf.matrix_inverse(sigma)
+def get_rw_energy_fn():
+  """Get energy function for rough well distribution."""
+  # For small eta, the density underlying the rough-well energy is very close to
+  # a unit Gaussian; however, the gradient is greatly affected by the small
+  # cosine perturbations
+  eta = 1e-2
+  mu = tf.constant([0., 0.])
+  sigma = tf.constant([[1., 0.], [0., 1.]])
 
   def energy(x):
-    """Unnormalized log density/energy of 2d strongly correlated Gaussian."""
+    ip = tf.reduce_sum(x**2., axis=1)
+    return .5 * ip + eta * tf.reduce_sum(tf.cos(x / eta), axis=1)
 
-    xmmu = x - mu
-    return .5 * tf.diag_part(
-        tf.matmul(tf.matmul(xmmu, sigma_inv), tf.transpose(xmmu)))
+  return energy, mu, sigma
+
+
+# Loss function
+def compute_loss(dynamics, x, scale=.1, eps=1e-4):
+  """Compute loss defined in equation (8)."""
+
+  z = tf.random_normal(tf.shape(x))  # Auxiliary variable
+  x_, _, x_accept_prob, x_out = dynamics.apply_transition(x)
+  z_, _, z_accept_prob, _ = dynamics.apply_transition(z)
+
+  # Add eps for numerical stability; following released impl
+  x_loss = tf.reduce_sum((x - x_)**2, axis=1) * x_accept_prob + eps
+  z_loss = tf.reduce_sum((z - z_)**2, axis=1) * z_accept_prob + eps
+
+  loss = tf.reduce_mean(
+      (1. / x_loss + 1. / z_loss) * scale - (x_loss + z_loss) / scale, axis=0)
+
+  return loss, x_out, x_accept_prob
+
+
+def loss_and_grads(dynamics, x, loss_fn=compute_loss):
+  """Obtain loss value and gradients."""
+  with tf.GradientTape() as tape:
+    loss_val, out, accept_prob = loss_fn(dynamics, x)
+  grads = tape.gradient(loss_val, dynamics.trainable_variables)
 
-  return energy
+  return loss_val, grads, out, accept_prob
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc_test.py b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc_test.py
index e33b4cae4c..9557479885 100644
--- a/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc_test.py
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/l2hmc_test.py
@@ -37,63 +37,37 @@ def get_default_hparams():
       n_warmup_iters=3)
 
 
-# Relevant functions for benchmarking
-def compute_loss(dynamics, x, scale=.1, eps=1e-4):
-  """Compute loss defined in equation (8)."""
-
-  z = tf.random_normal(tf.shape(x))
-  x_, _, x_accept_prob, x_out = dynamics.apply_transition(x)
-  z_, _, z_accept_prob, _ = dynamics.apply_transition(z)
-
-  # Add eps for numerical stability; following released impl
-  x_loss = tf.reduce_sum((x - x_)**2, axis=1) * x_accept_prob + eps
-  z_loss = tf.reduce_sum((z - z_)**2, axis=1) * z_accept_prob + eps
-
-  loss = tf.reduce_mean(
-      (1. / x_loss + 1. / z_loss) * scale - (x_loss + z_loss) / scale, axis=0)
-
-  return loss, x_out
-
-
-def loss_and_grads(dynamics, x, loss_fn=compute_loss):
-  """Obtain loss value and gradients."""
-
-  with tf.GradientTape() as tape:
-    loss_val, x_out = loss_fn(dynamics, x)
-  grads = tape.gradient(loss_val, dynamics.variables)
-
-  return loss_val, grads, x_out
-
-
-def warmup(dynamics, optimizer, n_iters=1, n_samples=200, loss_fn=compute_loss):
+def warmup(dynamics,
+           optimizer,
+           n_iters=1,
+           n_samples=200,
+           loss_fn=l2hmc.compute_loss):
   """Warmup optimization to reduce overhead."""
 
   samples = tf.random_normal(
       shape=[n_samples, dynamics.x_dim], dtype=tf.float32)
 
   for _ in range(n_iters):
-    _, grads, samples = loss_and_grads(dynamics, samples, loss_fn=loss_fn)
+    _, grads, samples, _ = l2hmc.loss_and_grads(
+        dynamics, samples, loss_fn=loss_fn)
     optimizer.apply_gradients(zip(grads, dynamics.variables))
 
 
 def fit(dynamics,
         samples,
         optimizer,
-        loss_fn=compute_loss,
+        loss_fn=l2hmc.compute_loss,
         n_iters=5000,
         verbose=True,
-        logdir=None,
-        decay_lr=True):
+        logdir=None):
   """Fit L2HMC sampler with given log-likelihood function."""
 
   if logdir:
     summary_writer = tf.contrib.summary.create_file_writer(logdir)
 
   for i in range(n_iters):
-    loss, grads, samples = loss_and_grads(dynamics, samples, loss_fn=loss_fn)
-    # TODO(lxuechen): Proper learning rate decay
-    if decay_lr:
-      grads = [grad * .96**(i // 1000) for grad in grads]
+    loss, grads, samples, _ = l2hmc.loss_and_grads(
+        dynamics, samples, loss_fn=loss_fn)
     optimizer.apply_gradients(zip(grads, dynamics.variables))
     if verbose:
       print("Iteration %d: loss %.4f" % (i, loss))
@@ -112,9 +86,10 @@ class L2hmcTest(tf.test.TestCase):
 
     # Eager mode testing
     hparams = get_default_hparams()
+    energy_fn, _, _ = l2hmc.get_scg_energy_fn()
     dynamics = l2hmc.Dynamics(
         x_dim=hparams.x_dim,
-        loglikelihood_fn=l2hmc.get_scg_energy_fn(),
+        minus_loglikelihood_fn=energy_fn,
         n_steps=hparams.n_steps,
         eps=hparams.eps)
     samples = tf.random_normal(shape=[hparams.n_samples, hparams.x_dim])
@@ -127,9 +102,10 @@ class L2hmcTest(tf.test.TestCase):
 
     # Graph mode testing
     with tf.Graph().as_default():
+      energy_fn, _, _ = l2hmc.get_scg_energy_fn()
       dynamics = l2hmc.Dynamics(
           x_dim=hparams.x_dim,
-          loglikelihood_fn=l2hmc.get_scg_energy_fn(),
+          minus_loglikelihood_fn=energy_fn,
           n_steps=hparams.n_steps,
           eps=hparams.eps)
       x = tf.placeholder(tf.float32, shape=[None, hparams.x_dim])
@@ -150,32 +126,20 @@ class L2hmcTest(tf.test.TestCase):
 class L2hmcBenchmark(tf.test.Benchmark):
   """Eager and graph benchmarks for l2hmc."""
 
-  def _get_energy_fn(self):
-    """Get specific energy function according to FLAGS."""
-
-    if FLAGS.energy_fn == "scg":
-      energy_fn = l2hmc.get_scg_energy_fn()
-    elif FLAGS.energy_fn == "multivariate_gaussian":
-      energy_fn = l2hmc.get_multivariate_gaussian_energy_fn(x_dim=FLAGS.x_dim)
-    else:
-      raise ValueError("No such energy function %s" % FLAGS.energy_fn)
-
-    return energy_fn
-
   def benchmark_graph(self):
     """Benchmark Graph performance."""
 
     hparams = get_default_hparams()
     tf.reset_default_graph()
     with tf.Graph().as_default():
-      energy_fn = self._get_energy_fn()
+      energy_fn, _, _ = l2hmc.get_scg_energy_fn()
       dynamics = l2hmc.Dynamics(
           x_dim=hparams.x_dim,
-          loglikelihood_fn=energy_fn,
+          minus_loglikelihood_fn=energy_fn,
           n_steps=hparams.n_steps,
           eps=hparams.eps)
       x = tf.placeholder(tf.float32, shape=[None, hparams.x_dim])
-      loss, x_out = compute_loss(dynamics, x)
+      loss, x_out, _ = l2hmc.compute_loss(dynamics, x)
 
       global_step = tf.Variable(0., name="global_step", trainable=False)
       learning_rate = tf.train.exponential_decay(
@@ -183,7 +147,11 @@ class L2hmcBenchmark(tf.test.Benchmark):
       optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
       train_op = optimizer.minimize(loss, global_step=global_step)
 
-      with tf.Session() as sess:
+      # Single thread; fairer comparison against eager
+      session_conf = tf.ConfigProto(
+          intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
+
+      with tf.Session(config=session_conf) as sess:
         sess.run(tf.global_variables_initializer())
 
         # Warmup to reduce initialization effect when timing
@@ -218,14 +186,14 @@ class L2hmcBenchmark(tf.test.Benchmark):
     """Benchmark Eager performance."""
 
     hparams = get_default_hparams()
-    energy_fn = self._get_energy_fn()
+    energy_fn, _, _ = l2hmc.get_scg_energy_fn()
     dynamics = l2hmc.Dynamics(
         x_dim=hparams.x_dim,
-        loglikelihood_fn=energy_fn,
+        minus_loglikelihood_fn=energy_fn,
         n_steps=hparams.n_steps,
         eps=hparams.eps)
     optimizer = tf.train.AdamOptimizer(learning_rate=hparams.learning_rate)
-    loss_fn = tfe.defun(compute_loss) if defun else compute_loss
+    loss_fn = tfe.defun(l2hmc.compute_loss) if defun else l2hmc.compute_loss
 
     # Warmup to reduce initialization effect when timing
     warmup(dynamics, optimizer, n_iters=hparams.n_warmup_iters, loss_fn=loss_fn)
@@ -234,12 +202,7 @@ class L2hmcBenchmark(tf.test.Benchmark):
     samples = tf.random_normal(
         shape=[hparams.n_samples, hparams.x_dim], dtype=tf.float32)
     start_time = time.time()
-    fit(dynamics,
-        samples,
-        optimizer,
-        loss_fn=loss_fn,
-        n_iters=hparams.n_iters,
-        decay_lr=True)
+    fit(dynamics, samples, optimizer, loss_fn=loss_fn, n_iters=hparams.n_iters)
     wall_time = time.time() - start_time
     examples_per_sec = hparams.n_samples / wall_time
 
@@ -251,14 +214,8 @@ class L2hmcBenchmark(tf.test.Benchmark):
         wall_time=wall_time)
 
     del dynamics
-    del loss_fn
 
 
 if __name__ == "__main__":
-  tf.flags.DEFINE_string("energy_fn", "scg",
-                         ("The energy function/unnormalized log-probability. "
-                          "Either be `scg` or `multivariate_gaussian`"))
-  tf.flags.DEFINE_integer("x_dim", 2, "Dimensionality of observation space.")
-  FLAGS = tf.flags.FLAGS
   tf.enable_eager_execution()
   tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/main.py b/tensorflow/contrib/eager/python/examples/l2hmc/main.py
new file mode 100644
index 0000000000..45e1f98429
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/main.py
@@ -0,0 +1,235 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""L2HMC on simple Gaussian mixture model with TensorFlow eager."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+
+from absl import flags
+import numpy as np
+import tensorflow as tf
+from tensorflow.contrib.eager.python.examples.l2hmc import l2hmc
+try:
+  import matplotlib.pyplot as plt  # pylint: disable=g-import-not-at-top
+  HAS_MATPLOTLIB = True
+except ImportError:
+  HAS_MATPLOTLIB = False
+tfe = tf.contrib.eager
+
+
+def main(_):
+  tf.enable_eager_execution()
+  global_step = tf.train.get_or_create_global_step()
+  global_step.assign(1)
+
+  energy_fn, mean, covar = {
+      "scg": l2hmc.get_scg_energy_fn(),
+      "rw": l2hmc.get_rw_energy_fn()
+  }[FLAGS.energy_fn]
+
+  x_dim = 2
+  train_iters = 5000
+  eval_iters = 2000
+  eps = 0.1
+  n_steps = 10  # Chain length
+  n_samples = 200
+  record_loss_every = 100
+
+  dynamics = l2hmc.Dynamics(
+      x_dim=x_dim, minus_loglikelihood_fn=energy_fn, n_steps=n_steps, eps=eps)
+  learning_rate = tf.train.exponential_decay(
+      1e-3, global_step, 1000, 0.96, staircase=True)
+  optimizer = tf.train.AdamOptimizer(learning_rate)
+  checkpointer = tf.train.Checkpoint(
+      optimizer=optimizer, dynamics=dynamics, global_step=global_step)
+
+  if FLAGS.train_dir:
+    summary_writer = tf.contrib.summary.create_file_writer(FLAGS.train_dir)
+    if FLAGS.restore:
+      latest_path = tf.train.latest_checkpoint(FLAGS.train_dir)
+      checkpointer.restore(latest_path)
+      print("Restored latest checkpoint at path:\"{}\" ".format(latest_path))
+      sys.stdout.flush()
+
+  if not FLAGS.restore:
+    # Training
+    if FLAGS.use_defun:
+      # Use `tfe.deun` to boost performance when there are lots of small ops
+      loss_fn = tfe.defun(l2hmc.compute_loss)
+    else:
+      loss_fn = l2hmc.compute_loss
+
+    samples = tf.random_normal(shape=[n_samples, x_dim])
+    for i in range(1, train_iters + 1):
+      loss, samples, accept_prob = train_one_iter(
+          dynamics,
+          samples,
+          optimizer,
+          loss_fn=loss_fn,
+          global_step=global_step)
+
+      if i % record_loss_every == 0:
+        print("Iteration {}, loss {:.4f}, x_accept_prob {:.4f}".format(
+            i, loss.numpy(),
+            accept_prob.numpy().mean()))
+        if FLAGS.train_dir:
+          with summary_writer.as_default():
+            with tf.contrib.summary.always_record_summaries():
+              tf.contrib.summary.scalar("Training loss", loss, step=global_step)
+    print("Training complete.")
+    sys.stdout.flush()
+
+    if FLAGS.train_dir:
+      saved_path = checkpointer.save(
+          file_prefix=os.path.join(FLAGS.train_dir, "ckpt"))
+      print("Saved checkpoint at path: \"{}\" ".format(saved_path))
+      sys.stdout.flush()
+
+  # Evaluation
+  if FLAGS.use_defun:
+    # Use tfe.deun to boost performance when there are lots of small ops
+    apply_transition = tfe.defun(dynamics.apply_transition)
+  else:
+    apply_transition = dynamics.apply_transition
+
+  samples = tf.random_normal(shape=[n_samples, x_dim])
+  samples_history = []
+  for i in range(eval_iters):
+    samples_history.append(samples.numpy())
+    _, _, _, samples = apply_transition(samples)
+  samples_history = np.array(samples_history)
+  print("Sampling complete.")
+  sys.stdout.flush()
+
+  # Mean and covariance of target distribution
+  mean = mean.numpy()
+  covar = covar.numpy()
+  ac_spectrum = compute_ac_spectrum(samples_history, mean, covar)
+  print("First 25 entries of the auto-correlation spectrum: {}".format(
+      ac_spectrum[:25]))
+  ess = compute_ess(ac_spectrum)
+  print("Effective sample size per Metropolis-Hastings step: {}".format(ess))
+  sys.stdout.flush()
+
+  if FLAGS.train_dir:
+    # Plot autocorrelation spectrum in tensorboard
+    plot_step = tfe.Variable(1, trainable=False, dtype=tf.int64)
+
+    for ac in ac_spectrum:
+      with summary_writer.as_default():
+        with tf.contrib.summary.always_record_summaries():
+          tf.contrib.summary.scalar("Autocorrelation", ac, step=plot_step)
+      plot_step.assign(plot_step + n_steps)
+
+    if HAS_MATPLOTLIB:
+      # Choose a single chain and plot the trajectory
+      single_chain = samples_history[:, 0, :]
+      xs = single_chain[:100, 0]
+      ys = single_chain[:100, 1]
+      plt.figure()
+      plt.plot(xs, ys, color="orange", marker="o", alpha=0.6)  # Trained chain
+      plt.savefig(os.path.join(FLAGS.train_dir, "single_chain.png"))
+
+
+def train_one_iter(dynamics,
+                   x,
+                   optimizer,
+                   loss_fn=l2hmc.compute_loss,
+                   global_step=None):
+  """Train the sampler for one iteration."""
+  loss, grads, out, accept_prob = l2hmc.loss_and_grads(
+      dynamics, x, loss_fn=loss_fn)
+  optimizer.apply_gradients(
+      zip(grads, dynamics.trainable_variables), global_step=global_step)
+
+  return loss, out, accept_prob
+
+
+def compute_ac_spectrum(samples_history, target_mean, target_covar):
+  """Compute autocorrelation spectrum.
+
+  Follows equation 15 from the L2HMC paper.
+
+  Args:
+    samples_history: Numpy array of shape [T, B, D], where T is the total
+        number of time steps, B is the batch size, and D is the dimensionality
+        of sample space.
+    target_mean: 1D Numpy array of the mean of target(true) distribution.
+    target_covar: 2D Numpy array representing a symmetric matrix for variance.
+  Returns:
+    Autocorrelation spectrum, Numpy array of shape [T-1].
+  """
+
+  # Using numpy here since eager is a bit slow due to the loop
+  time_steps = samples_history.shape[0]
+  trace = np.trace(target_covar)
+
+  rhos = []
+  for t in range(time_steps - 1):
+    rho_t = 0.
+    for tau in range(time_steps - t):
+      v_tau = samples_history[tau, :, :] - target_mean
+      v_tau_plus_t = samples_history[tau + t, :, :] - target_mean
+      # Take dot product over observation dims and take mean over batch dims
+      rho_t += np.mean(np.sum(v_tau * v_tau_plus_t, axis=1))
+
+    rho_t /= trace * (time_steps - t)
+    rhos.append(rho_t)
+
+  return np.array(rhos)
+
+
+def compute_ess(ac_spectrum):
+  """Compute the effective sample size based on autocorrelation spectrum.
+
+  This follows equation 16 from the L2HMC paper.
+
+  Args:
+    ac_spectrum: Autocorrelation spectrum
+  Returns:
+    The effective sample size
+  """
+  # Cutoff from the first value less than 0.05
+  cutoff = np.argmax(ac_spectrum[1:] < .05)
+  if cutoff == 0:
+    cutoff = len(ac_spectrum)
+  ess = 1. / (1. + 2. * np.sum(ac_spectrum[1:cutoff]))
+  return ess
+
+
+if __name__ == "__main__":
+  flags.DEFINE_string(
+      "train_dir",
+      default=None,
+      help="[Optional] Directory to store the training information")
+  flags.DEFINE_boolean(
+      "restore",
+      default=False,
+      help="[Optional] Restore the latest checkpoint from `train_dir` if True")
+  flags.DEFINE_boolean(
+      "use_defun",
+      default=False,
+      help="[Optional] Use `tfe.defun` to boost performance")
+  flags.DEFINE_string(
+      "energy_fn",
+      default="scg",
+      help="[Optional] The energy function used for experimentation"
+      "Other options include `rw`")
+  FLAGS = flags.FLAGS
+  tf.app.run(main)
diff --git a/tensorflow/contrib/eager/python/examples/l2hmc/neural_nets.py b/tensorflow/contrib/eager/python/examples/l2hmc/neural_nets.py
index e230ad5e25..68e0bc3123 100644
--- a/tensorflow/contrib/eager/python/examples/l2hmc/neural_nets.py
+++ b/tensorflow/contrib/eager/python/examples/l2hmc/neural_nets.py
@@ -25,7 +25,6 @@ from __future__ import division
 from __future__ import print_function
 
 import tensorflow as tf
-import tensorflow.contrib.eager as tfe
 
 
 class GenericNet(tf.keras.Model):
@@ -47,13 +46,13 @@ class GenericNet(tf.keras.Model):
 
     # Scale
     self.scale_layer = _custom_dense(x_dim, .001)
-    self.coeff_scale = tfe.Variable(
+    self.coeff_scale = tf.Variable(
         initial_value=tf.zeros([1, x_dim]), name='coeff_scale', trainable=True)
     # Translation
     self.translation_layer = _custom_dense(x_dim, factor=.001)
     # Transformation
     self.transformation_layer = _custom_dense(x_dim, .001)
-    self.coeff_transformation = tfe.Variable(
+    self.coeff_transformation = tf.Variable(
         initial_value=tf.zeros([1, x_dim]),
         name='coeff_transformation',
         trainable=True)
diff --git a/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb b/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb
index 54ebcad8e9..1ab1b71bd0 100644
--- a/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb
+++ b/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb
@@ -41,11 +41,11 @@
         "\n",
         "# Neural Machine Translation with Attention\n",
         "\n",
-        "<table align=\"left\"><td>\n",
-        "<a target=\"_blank\"  href=\"https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb\">\n",
+        "<table class=\"tfo-notebook-buttons\" align=\"left\"><td>\n",
+        "<a target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb\">\n",
         "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>  \n",
         "</td><td>\n",
-        "<a target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on Github</a></td></table>"
+        "<a target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a></td></table>"
       ]
     },
     {
@@ -383,6 +383,7 @@
       "source": [
         "BUFFER_SIZE = len(input_tensor_train)\n",
         "BATCH_SIZE = 64\n",
+        "N_BATCH = BUFFER_SIZE//BATCH_SIZE\n",
         "embedding_dim = 256\n",
         "units = 1024\n",
         "vocab_inp_size = len(inp_lang.word2idx)\n",
@@ -677,21 +678,23 @@
         "                # using teacher forcing\n",
         "                dec_input = tf.expand_dims(targ[:, t], 1)\n",
         "        \n",
-        "        total_loss += (loss / int(targ.shape[1]))\n",
+        "        batch_loss = (loss / int(targ.shape[1]))\n",
+        "        \n",
+        "        total_loss += batch_loss\n",
         "        \n",
         "        variables = encoder.variables + decoder.variables\n",
         "        \n",
         "        gradients = tape.gradient(loss, variables)\n",
-        "      \n",
+        "        \n",
         "        optimizer.apply_gradients(zip(gradients, variables), tf.train.get_or_create_global_step())\n",
-        "\n",
+        "        \n",
         "        if batch % 100 == 0:\n",
         "            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,\n",
         "                                                         batch,\n",
-        "                                                         loss.numpy() / int(targ.shape[1])))\n",
+        "                                                         batch_loss.numpy()))\n",
         "    \n",
         "    print('Epoch {} Loss {:.4f}'.format(epoch + 1,\n",
-        "                                        total_loss/len(input_tensor)))\n",
+        "                                        total_loss / N_BATCH))\n",
         "    print('Time taken for 1 epoch {} sec\\n'.format(time.time() - start))"
       ],
       "execution_count": 0,
@@ -906,4 +909,4 @@
       ]
     }
   ]
-}
-\ No newline at end of file
+}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
deleted file mode 100644
index 9c1af9c208..0000000000
--- a/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
+++ /dev/null
@@ -1,323 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "vDJ4XzMqodTy"
-      },
-      "source": [
-        "# Automatic Differentiation\n",
-        "\n",
-        "In the previous tutorial we introduced `Tensor`s and operations on them. In this tutorial we will cover [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), a key technique for optimizing machine learning models."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "GQJysDM__Qb0"
-      },
-      "source": [
-        "## Setup\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "OiMPZStlibBv"
-      },
-      "outputs": [],
-      "source": [
-        "import tensorflow as tf\n",
-        "tf.enable_eager_execution()\n",
-        "\n",
-        "tfe = tf.contrib.eager # Shorthand for some symbols"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "1CLWJl0QliB0"
-      },
-      "source": [
-        "## Derivatives of a function\n",
-        "\n",
-        "TensorFlow provides APIs for automatic differentiation - computing the derivative of a function. The way that more closely mimics the math is to encapsulate the computation in a Python function, say `f`, and use `tfe.gradients_function` to create a function that computes the derivatives of `f` with respect to its arguments. If you're familiar with [autograd](https://github.com/HIPS/autograd) for differentiating numpy functions, this will be familiar. For example: "
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "9FViq92UX7P8"
-      },
-      "outputs": [],
-      "source": [
-        "from math import pi\n",
-        "\n",
-        "def f(x):\n",
-        "  return tf.square(tf.sin(x))\n",
-        "\n",
-        "assert f(pi/2).numpy() == 1.0\n",
-        "\n",
-        "\n",
-        "# grad_f will return a list of derivatives of f\n",
-        "# with respect to its arguments. Since f() has a single argument,\n",
-        "# grad_f will return a list with a single element.\n",
-        "grad_f = tfe.gradients_function(f)\n",
-        "assert tf.abs(grad_f(pi/2)[0]).numpy() \u003c 1e-7"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "v9fPs8RyopCf"
-      },
-      "source": [
-        "### Higher-order gradients\n",
-        "\n",
-        "The same API can be used to differentiate as many times as you like:\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 276
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 730,
-          "status": "ok",
-          "timestamp": 1527005655565,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "3D0ZvnGYo0rW",
-        "outputId": "e23f8cc6-6813-4944-f20f-825b8a03c2ff"
-      },
-      "outputs": [
-        {
-          "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAEDCAYAAAAhsS8XAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsnXd0HNX5sJ/ZXrTq3ZLV3IvcDdgGGwOm2WCbHhJa6C2B\nUBISQioBfoQPkjhACA4QCIQSDITQbGMbsHHvVbZ6s7q0vc18f4xmJVltJa0q+5zDOXhn9s7dqzvv\nfe/briBJkkSYMGHChBkxqAa7A2HChAkTJrSEBXuYMGHCjDDCgj1MmDBhRhhhwR4mTJgwI4ywYA8T\nJkyYEUZYsIcJEybMCCNkgl0URVasWMHtt98eqibDhAkTJkwvCJlgf+2118jJyQlVc2HChAkTppeE\nRLBXVlayceNGrrjiilA0FyZMmDBh+kBIBPvjjz/OQw89hCAIoWguTJgwYcL0gT4L9g0bNhAfH8/E\niRMJVycIEyZMmMFH6GutmGeeeYYPP/wQtVqN2+3Gbrdz3nnn8dRTT3X6HUmSwtp9CKittvH8UxsQ\nxZY/4aXXTGfa7PRB7NXAU1dj5y9PrIfmYUgeFcnya2aQmBI5uB0bYE5WNPHS/9uE6JcHYukVucw8\nPWOQezXw7NhcyCfvH0Bqfi+uumkO4ycnD3KvBpY+C/bWbNu2jdWrV/PCCy90e291tTVUj+03EhIs\nQ7qfWzfls2tzMTNPH01UrJEv/3eU5LRIVnx/5mB3rUP6azw3fnaMQ7vLOX1RNrVVNvIOVZGeFcPS\nq6YNmT6GmlP7KYoi/3ltF9WVNhacO4btXxfi9fi5+Mpc0jJjhkw/+5t9O0r5Zu1xDEYtpy/KZuOn\nR4mOM3HlTbNRqTo3UAynv3swhOPYhymSJJF3sAqtTs35l05mQm4K6VkxVJY2UVdtH+zuDRgOu4ej\n+yqIjDYwbW4a514yiYTkCMqKGnC7vIPdvQFjz9YSqittjJuSxNTZaVywcgoAX3xwCL9PHOTeDRyH\ndpej0ai47PqZTJyWwoTcFOprHBzdf3KwuzaghFSwz507NyhtPUzfOVnehLXRRdbYeLQ6DQATp6UC\ncGhv+WB2bUA5sLMMv19i2pz0gEaWNS4BUZQoOlE3yL0bGDxuHzu+LsRk1jH/nDEApI6OZtL0VFxO\nLyfLmwa5hwNDU4OT+loHozJiiIw2AjB7QSYajYrtXxXg9foHuYcDR1hjH6bkHawCYOzkxMBnmWPj\nMJq1HDtwEt93YBJ7PT4O7CrDYNQwPrfFhpo1Lh6AgmPVg9W1AaWyrBG/X2JCbjIGozbweXqWbIIp\nLawfrK4NKMX58kI+Oic28FmERc/UOWnYbR7yDn53tPawYB+GiKLI8SNVGEzaNvZTtVrFhKkpuF0+\n8o+OfKGWd7gKt8vHlJmj0GrVgc9j4kxExRopzq/7Tixw5cWNAKSkR7f5PHV0NIIApUXfEcHevEMb\nnR3b5vPxU5IAqChpHPA+DRZhwT4MKS2sx+XwMmZCYjuH0MRpKQAc3lsxGF0bUJQXNWdiYpvPBUEg\ne1w8Pq9IyXdAW60oaUAQ5Gig1uj0GhJTIqkqb8Lj9g1S7wYGn89PWXE90XGmgBlGITrWhN6gobIs\nLNjDDGE6MsMoRMUYSUiOoLKsacQ7zaoqrGh1amLiTO2uZY1LAKDgWM1Ad2tA8Xr9VFVYSUi2oNNr\n2l1Py4xBkqC8pGEQejdwVJQ04vOKZJyirYO80CeNiqSpwYXD5h6E3g08YcE+zJAkiZKCOswWHUmp\nHcdpJyRbEEWJupqRGx3jdvloqHWQmGLpMCciMcWCOUJH0fEaRHHkLnBV5U2IokRKelSH10dlyOaZ\nkW5nD5hhcuI6vJ48Sh6fyrLvhiM5LNiHGQ67B6fDS2JyZKdJXgnJcqxrdeXQj8vtLcpv6ywJSRAE\nMsfF43L6RvTLXF4sa+Kn2tcVkkdFodGoKCvqu8b+zjtv8f3vX8Fvf/ton9sKNUX5tWi0KlLSOl7g\nFDPVSJ4LrWm/dwszpKk5aQMgLimi03u+C4JdCeFLTOk8YSMlLYqDu8qpOWkjtRPBN9wpb/YzpHai\nsas1KlLSoygpqMdhc2OK0Pf6WWvWvMsf//hnkpNTet1Gf9BY76Sxzknm2DjUmo51VXlnBye/I3b2\nsMY+zKitkgV7fGLngj02wYxKLVBdaRuobg04VRWyYO/MHAUQlyCPkTJmIw2/X+RkeRNxCWb0Bm2n\n943KaA577IPW/vTTf6C8vIyHH76ft99+s9ft9AeKUzQto/MMW61OQ1xiBFWV1hHve4Kwxj7sUDT2\n+C40drVaRVyCmdpqG36/iFo9stZvSZKoKrditugwWzrXQKNijahUwojLxH17/XF25VXj8fhx+Hzo\nGh1s/+vmTu8X/SJ2RA59egTDxhMd3jNnQiJXLh7TaRsPPPAztm79lj//+UUiI4dWDZ76Zl9SXBfK\nDshmqZqTNqpPWgM295HKyHrjvwPUVNnQGzRERHa9pU5ItiD6pREn1ADsVjcOu6fbIl9qtYqYeBN1\nNfYRWXlU0Ty7W7hVzddFf181VYlApbUhhDLHYxPMXd6XnCbPl5PfATt7WGMfRng9PhrrnM2JJ11X\nx5Tt7BVUn7QGbO4jhZPliuO0+98VlxBBbZWdpgYnUTHtwyKHI1cuHsNdV83g1b9upuhELdf/cG63\ntvN/v7ydpgYnN99xxoirrFpX48Bk1rXJuu2IlsiYRqYxsiughjX2YURts2bSlRlGocWBOvLsy4p9\nPZiyvLGJshZXWzXydi71tXYMJm1QDtHYeBM+r4itaWTFcXs9PqyNLmLiu1+0IyL1mCN0VJY2jcgd\nXGvCgn0YEbCvd2NLBIiNN6NSCdSMwMiYqoqeaeww8hyoPq9fFmixwe1CouPkBa6+ti8L3NDT9Otq\nHED3ZhiQQ2ATUyNx2D3YbZ7+7tqgEhbsw4hgHKcKao2K2AQztVWyA3WkIIoS1ZVWYuJNHWZankpc\n8wtfO8J8DbLfAKI7yLrtiNhmjba+WRD2hnfe+YDIyKHldAzY1+O7F+xAIEu5sa734zAcCAv2YURt\nlQ2VWgj6ZU5ItuD3S4GogZFAU4MTr8dPQlJwfgNThA6DUTPinMi11fIi31E5hY5Q5kx97cgSaMrc\nDkZjB7luDEBDnbPf+jQUCAv2YYIoitRW24mNNwcdvjgS7eyN9fILGR1r7OZOGUEQiE2IoLFeXhBG\nCjXNpqXoYE0xMSYEoa+mmKGHUjYjJi44wR7VPG/CGnuYIUFDnRO/TwzKDKOg3DuS7MuNzZpWVJAC\nDVrMMSOpdk5AsAepsas1KiJjjNTXOEaU47Cuxo7ZokdvCC7AL6yxhxlS9MRxqqBotU0NI2cSN9bL\nmlZUTHAaO7Qkrijmi5GAYpazRBmC/k5snBm3y4fTMTKODHS7vNitnqDNMAAGoxaDUUNDWGMPMxRQ\nbMTdZde1Rm/QojdoaGxw9Ve3BhzFFNMTwa68+HUjJORRkiRqquxExciZtcESHXCgjoxxCETEBBHq\n2JqoWBNNDc4RFVRwKn0W7B6PhyuuuILly5ezbNky/vKXv4SiX2FOQdG6eyLQlPubGpyI4sjYfjfW\nOzGatEFFxCgoERMjJTLGYfPgcfuCdpwqxI4wB2rAcRpkRIxCdIwRSQJr48hReE6lz4Jdp9Px2muv\nsWbNGtasWcOmTZvYt29fKPoWphVNDU7UGhWmCF2PvhcZbUT0S9itwz8xxe8XsTa6Ag6wYNHq1ETF\nGKkbIaYYRTAHa19XiGkWgL0NeWxdtvebb77ijTdeDfq7lZUVfPHFp0Hd+/jjv2bjxvXd3te6lMCa\nNe/x2Wf/C6r9qICdXR6HTz75L9XVLUdJPvnk7ykqKgyqraFKSEoKGI3yi+bxePD5RvYRXINFU4OL\nyChDj9PBFQ2/sd7ZI3vsUMTa6EKS6FVpgMgYIyX5TjxuX4+0/aGIIpCCTU5SUByHvY2MObVs7/z5\nZ7a7x+/3o1ar231eXl7GF198xnnnXdCrZ3eE4gyPjDawfPllQX9PGQfFEf+//33EzJlTSUrKAODh\nh38esj4OFiGZ4aIosnLlSoqLi7n22mvJzc0NRbNhmnG7vLhdvnZnWgZDZLQszGVTTudlTYcDgYiY\nHpqjACKjlHFw9SiyaCjS0EuNXatTY4nU98oU07ps78UXX4LFYuHIkUPcd99DPP74r7FYIsnLO8r4\n8ROZP/9MnnvuaQRBQKvV8OyzL/Dii6soKirkppuu5YILlnLllde0af+ZZ55k9+6dpKSktonaOXr0\nCH/+8zO4XC6ioqL5+c8fIzY2jnvuuQ3RFUt1XSGxH5Zht9sxmUycccYCfve7x3jpJXk3UVlZwcMP\n38+rr77JK6/8nW+++QqHw4lWSmTS9HvYsGEdR44c5sEHH0Sj0fL886t54IF7ufvu+zh8+ADl5eXc\neee9gKzZHz16hB//+AE+//wT3nnnLfx+H5MmTeEnP/npkKrBExLBrlKpWLNmDTabjTvvvJPjx48z\nZkznJUDD9IymZufnqYf0BoMiBEdCZExDLyJiFJQFztroHPaC/Vv315ROK+RP+ZsRCnomTJxjPfh8\nInnfbGgjiGYkTmXlmKWdfu/Usr2ffPLfNt8vLS3mT396AYCHH76Pn/zkp0yZkktEhIamJg+33343\nb731Ok8++f/atb1x45eUlpbwz3++TU1NDd///hUsXXopPp+PZ599iieeeIaoqGjWrfuCF19cxc9+\n9kskScLpsHPdlT9l6VXTWL36bwBkZGTi9/uoqCgnJSWVdes+55xzzgPgssuu4oYbbsbn9XPj9+9k\n566t3P/Idbz33tv88pe/ICGhbWGwRYvO5fbbbwwI9nXrPuf6639IUVEh69Z9zgsvrEatVvPHPz7J\n559/wvnnX9Sjv0V/EtI9aUREBHPnzuWrr77qVrAnJAyPioNDoZ/VzdUMU9OiO+1PZ58b9HLFO5fD\nNyR+S1/64HHKCUaZ2fE9bidttLxb8fukbr87FMapK9xuH6oIAU0v6uyrNSp8PhEkUKtbBLPJqOv2\nd6tUEBdnJjragsViwNj8HYNBy8KFSwPfP/30uTz//HMsW7aMJUuWkJSURHS0CZ1O0+Ezjh07wIoV\nl5KQYCEhwcK8eWcQGWnEZquhoCCfBx+8F0mSEEWRxMREEhIsCAhkpE4nMTmShAQLZrMes9lAQoKF\npUsvZuvWTdxyyy1s2rSeZ599loQEC7t2bebll1/G6XRSVV9FWXkaCQkWtFo1ktQyL7RaNTExJsaO\nTSczM4OKigJGjx5NeXkpixcv4I033uD48WPccceNSJKE2+0mLS15SM2bPgv2uro6tFotFosFl8vF\nli1buPXWW7v9XnX10C9OlZBgGRL9LC2WDyJWaYQO+9NVPyVJQqNVUV1pHfTf0tfxPFkhn5QjIva4\nHalZhlWUNnb53aHyN+8Mr9dPbN5YZo45gwvPn9Lj7x/aU87GT49x9sUTmDA1uc217n63KErU1trw\netVYrS6cTg/V1VZcLi8+X8vcXLHiGqZNm8uWLV9z5ZVX8swzq2hocODx+Dp8htPpwWZzB6653V6a\nmpzU1dnIysrm+edXt+uny+VFY9Gh0amorrZit7uRJDXV1VZOO+0sHn30p8yaNQ+/X8JojKGsrJZf\n/erXrF79OvHxCTx8/2+wNjgpL6vH6/W3+f1er5/6egfV1VYWLDibd99dQ0ZGJvPnL6S62orV6mTJ\nkou47ba7ejR+oSDYxaPPUTHV1dVcd911XHrppVxxxRUsWLCAhQsX9rXZMK1QzCi9McUIgkBktJHG\nBuewzzhsqHNiMut65fxsbYoZziip8PGJPQvxU1Ac6LZ+DPUrKyslOzuHa6+9nilTplBcXIjJZMZu\n79hpO23aTNau/RxRFKmpqWHXrp0AjB6dSX19AwcO7AfA5/NRUJAPtBwy0tE7MWpUGmq1ilde+TuL\nF8tmGI/HgyBAZGQUDoeD44W7geY5ZTJhs3UcMbVw4WK++mpDG5POrFlz2bBhHfX1ssLV1NREZWVl\nr8aqv+izxj5+/Hjef//9UPQlTCcoSTmW6N5FtcihfnacDi8mc8/CJYcKfr+IrcnV6yPN9AY59r1p\nmMcuK6nwSjninqIIdmtTb8YhOHv+O++8ya5dO1Cr1YwfP47TT58PgFqt4cYbv8eFFy5r4zxduPBs\ndu3azvXXX016egYzZswCQKPR8LvfPcmzz/4fNpsNUfRz5ZXXkJWVjd8vtfk9p7J48RKef/5P3HLL\nnYBsJl62bAXXXXcVKSmpZGeNw14vv1sXXbSMxx57DK1Wx/PPr27jO7BYLGRmZlNcXMiECZMAyMzM\n4pZb7uT+++9CFCW0Wi333/8QycnJHfZlMBCkQVLjhvJ2V2GobMtff/5b/H6R6++e1+H17vq5ef0J\n9m4rYcX3Z5CcNnhlV/synvW1dt56aTsTpiZz9sUTetXGO//YQUOtg5t/cmanEQxD5W/eGbu/Lebb\nDflcddOcwCEiPcHn8/PS018xKiOaS66Z3g89bEt/jeen/zlAwbEarr9nXq+UleL8Wj5+ez9zFmQy\ne0HmkP+7KwyYKSZM/6JoqpG91NahVSz7MI6MCZQS6GFyUmssUQZ8PhGnffgesqBkS0b38pg/jUaN\nyawb9lmX1kYXGo0Ko6nr4/A6I1AMrH5kZOGeSliwD3FsTW4kCSKjei/QomLkRUERjsORvsSwKyj2\n2OFsjrE1m1D6Mg4RUfrmeTV8fS7WRheWXiTsKUREGlCpBJrqh+9c6IqwYB/iBBynoRBoI0Fj78OB\n1C3JWsP3ZbY2udHp1d0e3NwVkVEGRFEatsfDuV0+3C5fnzKpVSoBc4QOm3X4zoWuCAv2IU5LclLv\nJ7GinQxrjT0g2Hs/DgHH4TBd4CRJwtroIiKyb6UhlO/3Z2RMf6LsWvpaIiMi0oDd6hmRVR7Dgn2I\n05dQRwWVSsASbRjW205rkwuDSYtW1/tAroDGPkwFmsftw+vxY4nU96kdRSAO13FQ+t3bKDGFiCh5\nHEdCgbxTCQv2IU6LYO/bJI6KNuJyyjVnhhuSJGFvchNhCZFAG6amGGujLIAi+qipBmLZexXyOPhY\nlV1sCDR2kP1YI42wYB/iNDXI3v++xp8PZzu72+XD5xOJ6KOmqtGoMUcM34gQJfbc0kdTjPL94TYO\nu3fv5KGH7gv0uzNTzD333MbRo0e6bU/Z+diaXPzpT39i587tverX22+/idvdsjg89NCPsdsHt0R0\nWLAPYSRJoqnBiSW6995/BWXbaRuG207lRY6w9L3ssCXagK3JNSztqoqG3dcFztI8F4abYAcQBLoV\n7MGiaOyNDU7uvfdeZs2a06t23nnnTdzulrF86qlnMZsHt9Dc8C5MPcJxu3x43H5S0ntvX1dQzBjD\n0Z6oLEZ9FWggh41WljZht7r75LcYDBRTTF8FmlanwWDU9Eiwu1wufvnLn1JdXYUoilx//c0sXnxu\np2V1y8pK+b//exybrQlJEvjtb58gNXUUq1Y9x9atmxEEFddddxPnnHMeu3fvZPXqvxEVFU1BwQkm\nTJjIo4/+FoBvv93Mn//8DNHRMYwdOx6ApkYnGq0qEBnkdrt5/PFfU1RUSEZGBh5PS7TP9u3f8vLL\nf8Pr9TJqVBqPPPIYBoOBK664hLMXXcAXm7/Er1vGV9veZNas09HrDfzvfx/xm9/8AZB3Cf/+9xs8\n8cQzPP30Exw9egi3282iRedw00238u67b1FTU80999xOdHQ0zz33PFdccQkvv/xP3njjNZKTU1ix\n4nIAVq/+G2azmauuupZ//euffPnlF3i9Ps46axE33dR9fa2eEBbsQxjlxeurLRFaBPtwtCfam0In\n2C2tQh6Hm2BXNHb/hv+yY9WePu065tg8iH6J/IffBsAyew4JV1zd6f1bt24mPj6Bp556FgCHw95l\nWd1f//oXXHfdjaxYsZTy8jpEUWTjxvWcOJHHa6/9m/r6Om6++TpmzJgJQF7eMV5//R3i4uK4444f\nsn//XsaPn8hTT/2eP//5RUaNSuOXv/wZ0D6Gfc2adzEajbzyyr84ceI4N910LQCNjQ28+upqnnvu\nr+j1Bt5441Xeeut1brjhZvk3R5pZMu8uRqfHcrSkVB6XOafx9NN/wO12odcbWLfuCxYvXgLAbbfd\nhcViQRRFfvSjO8jPP87ll1/Nv//9ZqCcsYzcr3PPXcJzz/0xINjXr1/LM8/8me3bv6W0tJiXXnoN\nSZJ4+OH72bt3D9OmhS4TOCzYhzC2EAo087DW2BUTRN8XuMCBG43D7+ARa5MLlUpAq1XTVxe4oBKQ\n/CKSJJs3uiM7ewyrVj3HCy/8hTPOWMC0adPJzz9Bfv4J7rvvruayuhLx8Qk4HA5qaqpZsEAuBqjV\nypr1vn17OPfc8wGIiYllxoxZHD58CJPJxKRJk4mPjwdgzJhxVFRUYDAYSU0dxahRaQAsWXIhH3zw\nH3kXm9ayKO/Zs5srmhelnJwxjBkzDoCDBw9QWJjPHXf8EEmS8Pl8TJkyLfC9JUvO57//ymuj7KjV\nak477Qy+/vorFi1azJYtX3PXXT8CYN26z/jwwzX4/X7q6mopKCggO3sMIDX/pyD//9ix42loaKC2\ntob6+noiIyNJTEzinXfeYvv2bdx007VyXXmni9LS4rBg/66gCGFzH6NBWrcxHCMhrMoCF4JxaHEi\nD79xsDW6MVv0JF55NQl33dKn2ibfrDvOvu2lrLxuJkmp3Z/MlZ4+mpdffp0tW77hxRf/wty5p3PW\nWYvIzs5pV1bX4ei4iuOpma6t/60IfwC1WoXf3/HS5fPKu5RTzVGtfVBKu5IkMWfO6Tz22O86bMto\nNBIRaWj3TixefB7/+c/bREZamDhxMkajkYqKct566w1efvmfmM0RPP74r/F4uleSzj77HL78ci21\ntbWcc86SQL9+8IMbuOSSFd1+v7eEnadDmIBtOQQCTa2WD8Iejs5TW5MbQQCzpe+VKZXdj32YmaT8\nPhGH3ROyc2t7GvJYU1ODXq9nyZILuOaa73Ps2NFOy+qaTGYSE5P46qsNAHi9XtxuF9OmzWTdui8Q\nRZH6+nr27dvDpEmTO31mRkYmlZUVlJeXAbB27Wf4mmuntx6H6dNn8PnnnwCQn3+cEyfyAJg8eSr7\n9++lrEw2s7jdLkpKits8IyJSj8ftlw8faWbGjFkcO3aUDz9cEyjVa7fbMRqNmExm6upq+fbbzYH7\nuypJvHjxeaxb9zkbN67n7LPPAeC0007n448/xOl0No9tdaAEcKgIa+xDmFBq7CAvEDVVNiRJGlLn\nM3aHvcmFKUKHStV3PSSwcxlmC5xijuprcpKCEvIYbJJSfv5xVq16DpVKQKPR8sADP+uyrO4vfvFr\n/u//HueVV15CENT89rdPsHDh2Rw8uI8bbrgGQVBx5533EhMTS2FhQZtnKXNTp9Px4IOP8OCDPyI6\nOobc3OmcrKiT+99KsC9ffjmPP/5rbrjhe4wdO45Jk+QDSKKjo3nkkcf41a8ewePxIggCt9xyB+np\no1Hs4Ip5T1kwQD7qc968BXzyycf84he/BmDMmLGMHTueH/zgKlJTR5Gb22LSueSS5TzwwL3Exyfw\n3HPP07q8cVZWNg6Hg4SEJGJj4wCYM+d0iooKuf32GwEwmUw8+uhviYkJnWkwXLa3Cwa7lOcH/9pD\neXEDtz54FuoujkELtp99LXXaV3oznqIo8dLTm0hIsbDyBzND0o9X/vQNOr2G7912Wkj6OBCUFtbz\n0Vt7mTUvg7lnZfW5nzUnrbzzj51MmZnKmUvGhbCnbQn1eH79RR77d5Zx+Q2zSEju+1F0u7YUsXVj\nAVf/cC4xCb2vQzRQhMv2jgDsVjdGs7ZLod4ThmPIo8PuQRSlkJijFMwWPXbr8KpuGKr6KAqBujnD\nLJY9lKGvcjtKlNTwS9zrirBgH6JIkoTN2vc0+tZERA6/kMdQJeW0JsKix+cTh1V5hUCSVojGQT5R\nSh1wTA8X7FY3KrXQp+qWrVHGczifVdARYcE+RHG7fPh9Ysjs69CqNsYwKlVqDziQQ6OpwvAM/VQW\n41Bp7CDb2a2NrmG1c7Fb3Zgj9CHzEQV8DcO48mlH9FmwV1ZWct1113HRRRexbNkyXnvttVD06zuP\nLYQhfgrDWaCFUmMfjg5UpU5MSOdDpB6vx4/X4+/+5iGAKMqRQaGIjlIwRegQhJGnsfc5KkatVvOz\nn/2MiRMnYrfbWblyJfPnzycnJycU/fvOEuqIGBie2afWfjDFBBY42zAah0YXRpMWjVYdsjbNES0L\nvU4/9APkHHYvktTS71AghwHrh/VZBR3RZ409ISGBiRMnAmA2m8nJyaGqqqrPHfuuE8oYdgVThKzp\nDCfB3qKxh84EEXAiD5NxCPhbQjgGMPwWuP5QdkAOIW1qdCGKw8ck1R0htbGXlpZy5MgRcnNzQ9ls\nv2I/sA9nfv5gd6Md/TGJ1WpV83Fg7V9kT2UFjsOHQvasUKE4y3p7aHFHKFv51uMgSRKi14vo9SL5\nhpZT1enwIvqlkO5aAMzNC73d2lI0S3S7se3Zjd/WtuyszWbj/fffDfxbKaHbEU8++XuKigq7fX5X\nbbRGKcMbeCeC0NhffvnFoMvwRkQakEQJR/MC9/bbb+JyOrHu2IansmJIlOHtKSHbf9ntdu69914e\neeQRzGZzt/cHG4/Z3xT+8xU8dfWkLL2IjB9ci1rfdtIMVj+V1OnRmXHExoduPKNiTVSWNRIfFwGS\nSPlHH1P15QYchUUApF9zFaOvvrL3HQ9RPxUcNg9R0UYSE7tPew+WSItcVsDr8ZOQYEH0ejn8uz/Q\nsGcvxwFUKjK+/z3SLuu/lO+eUOlpBCA+IaLN+PV1bqamRcv/I8lt+Z1ODj3zJE2HDiOo1URNyyV1\n6UXEzJqJ293IRx/9h1tvlZNqoqNN6PWaDvvw9NNPtPm3co8oim2SzLpqozVarZqYGBP2etlhmjIq\nqsvviKLIT3/6QPcD0ExisoXjh6vQqNUkJFh49+03mJF/HOn4CZIvWMI//vFy0G0NFUIi2H0+H/fe\ney+XXnoGQVm+AAAgAElEQVQp5557blDfGSpJIEm33U3l6r9R8dHH1Gzbwagf/wRdQiIwuMkqNVXy\nc90eb7d96Ek/DUYNol+iuKgW19frqHn3bVCrMU+bjqesjJI3/43D4SFu2aV9/g196SfIafQ2q5vU\n9KiQ/x10ejX1tQ6qq61UvfUGDXv2ohuVhikhDmt+AUX/fANfXDLmyVNC+tzeUFoip5urNEJgHEIx\nN33N1SGrKps4WVpD2XPP4Dx2FOOEiYgOBw27dtOwZy8Zj/2Gx//2V4qLi1m27BJmzz6NM86YT0ND\nE7fddme7Urv33HMbd999H+PHT2DJkrO46qpr2bbtW+6++8fY7fY2ZXg9Hl+733FqGV673Ul9vYP6\nCh8V1cf42aOrEVRSuzK8F198Cdu3b2XlyivZunUz8+efGVQZ3sYGG3GWCZQUTeTdF/8fVSdP8ugX\nnxEdHcOq85exaNHZg16GVyHYxTwkgv2RRx5hzJgxXH/99aFobkAxZmeT8cvfUPOfd2hY+wU1775N\n6h13D3a3sFvdGIyhdZZBS9hgQ1k19o8+QB1hIePXv0MTFYW3tpbS/3uC2g/eR9DpiD3/wpA+u6co\ntt9Q25ahJUnJumsnDWu/QJeSyuhHHiUpLZ6SbXspfuL3VP79RTIe+y2a6OiQP78nKONgajZBbF5/\ngsK8GsQ+HhaimJSP7KvEsXsHWXlHiZg9h5RbbkdQq7Ht3kX5qj9R9cY/uf32uykszGf16jcAWUB2\nVGp36tRpbZ7hdDrJyRnDD394Gx6Ph6uvXtGuDO+pdFaGt7qqhgN5a3nxpb+RkBTdrgyvTqdn1aqX\nALnMMARXhvfEkZM89PC9HDt8mNOLi3lPq+WPj/6G1IVnN4dVDn4Z3p7SZxv7zp07+eijj/j2229Z\nvnw5K1asYNOmTaHo24Ch0ulIuOp76DOzsO3cgfuUQkEDTX8kJykoNvvyz9Yjud3EX34lmqgoALRx\ncaQ9+DDqqGhqP3i/nZ11oOmPUEeFCIset8tH+auvIOh0pNx+F6pmM5whK5uEK67Cb7VS8dILSOLg\nnrak2MAVm3ioUELBRb+Ir64Oc+40Um6+DUEtKxMRM2ZinjET57Gj2Hbvavd9pdSuIAiBUrunotFo\nWLhwMQBFRYXtyvB2xJ49uwPXWpfhPX7iCI22kzz007u48cbv8emnH3Py5MnA95SCXa1pXYbX7/ez\nZcvXnHmmXE543brPuOmm7/PL395Do/Ukx3ftQLTZEIwmLDNntYqVb1+G9/jxvEAZ3m3btgbK8N50\n07UUFxdRWjq4MqTPGvusWbM4fPhwKPoyqAiCQPzyFZQ9+wy1H35A6l33DFpfPG4fPm9ok5MUApl2\nReUk5owhct78Nte1cfHELDmfmnf+TeNXm4i98KKQ9yFY+iPrVEFxwDk9kPW9a9GPGtXmevQ55+E4\nchj7nt3Y9+4hYkZo6tT0BsWpp8yHeYtzuPSq6SExT73+/Ld4GxsZW7uDhB8/jqBpKxISr7qGwoMH\nqPv4w3YLXDCldnU6Xa+SiToqw+tyekhLnsA//vFih98xGjs+OKW7MrySX8Odt/4Ya1UtKrMZVSft\nwOCV4e0p4czTVpgmT8WQnYNt905cxUWD1g8lWsPcLwJN1vpcmggSr/0BQgcVE6POPAtBr6dh/dpB\njRCx2xRNNfTjYDLJWqkvbhSR889sd10QBOIvXQlA49eDuwPtL40dwKgRcUtajJNz0aWktruujU8g\n9qKlaB0ObLW1PW6/dVZrR2V4O6KzMrwW4yiq6gq6LMPbEd2V4XW6rZRXH8EraIg9/0LM5oghV4a3\np4QFeysEQSDuUnnVrf1wzaD1w94PMewK2qZqAPyJ6RhGZ3R4j9pkJmr+Anz1dR1uwQeK/opbBlDX\nyMJFGJ/b4eIGoE9PR5+ZhX3/PnwNDSHvQ7DYbW40GlW/JBFprDVIggrDWZ0HPcScfwGRlkhydDqu\nu+5q/vrXP7W7p7WG3dn/63Q6Hnro5zz44I+4665bSOlgIQG5DK/D4eCGG77Hm2++zqRJU/B6fKgF\nI8vOv5lf/eoRrr/+Gm677SaKAwpY57sCpQzv1q1bmDdPXsRbl+F96onfkBSdgU+tJ3rxOYEyvD/6\n0R3t2u6sDO95553P7bffyPXXX82jjz6M0+notD8DQbhs7ylIkkTJE7/HdeI4s/72PFbVwJ+LeWhv\nORs/OcbZF09gwtTkbu/vSYTEybfe5D8FSSTGaLns9vaaqoLnZCWFP/8phpwxjP7ZL4Lue6j6CfDZ\n+wfJP1rNdXefEXKtffsfVrFDmMycuUnMXjyx0z42bFhP1euvEb/ycmIvWhrSPgTLK3/+Bp2ubZnh\nkETFNNTz6ZNvURI1kcuun0liSuchpZWvrKbp602kPfAwpgkTO73vVEIVWVZfY+etv29n4rQUFl04\nvs/ttWl7/Vo+3tSI0xTLzQ8uGtJnFYTL9vYSQRCInLcAgLqt2walD/Z+qBMDIIki9p3b0IsunGLX\n2p8uKRlz7jRcJ47jzD8R0n4Ei8Mun5xkNIXWBOEuL0dVIv8mp6/rqCPL3NMRtFoav/lqUIpl+f0i\nTrs3kDUcShq+XI/eKzvIFbNXZ0SedjoA1m1bQ96PYLDb+m/3Ztu1E53fgU8Uhk3dnO4IC/YOiJg+\nAwSB2i3fDsrzbf1kgnDmHcNXX49Rr8Jh93QrqKIXy9tza6tjwAYSu9WDyaxDpQqtBtX09Sb0Pnvg\nGV2hNpmImDUb78mTOPOOhbQfweC094+fQZIkmrZuwaCSfSjdFYYzjp+AOioK687tg+J3USKkQlkA\nDMBvteI8dpSICNkRPFzKK3RHWLB3gCYqCuOYsTQdPoKvqWnAn99iYw/tJLZukxeqiPhI/H4Jj7vr\nF9Q0YSIqgwH7/n0Drq1KkoTD7gm5pir5fDRt+Qa9SYtaLQRV4TFqwVkANH61MaR9CYYWB3Jox8FT\nUY6vpoao0SmAnOHbFYJKhWX2XES7HfuhgyHtSzD0lyPdtncPiCIxaXJSYncL/XAhLNg7IWLGTJAk\n7Ht2D/izbVY3Or0arS50zjLJ58O6cwfqqCgik2SnT3eTWNBoME2egre6Gm9l+xjl/sTjluvRm0L8\nIjvzjuG3Womae7qcpBSEhmYcPwFNfDz23bsGXFvtLweyfd9eAGInjmnznK6wzJVt/IqCMJD0V0CB\nbfdOABInZAItoaXDnbBg74SIGbOAlj/8QOKweUKumdgPHUS02bDMnoup+eVw2LufxObmTEJbsyAY\nKPorxM9+8IDcbm4uZoseh82Dv5sMTkEQME/JRXS5cBUMbME4RZMO9c7Fvm8vCALxM+SSCcEscIbs\nHLTxCdh270Z0D6wA7I8FTnS5cBw8gG5UGjHpSfJzutm5DBfCgr0TtAkJmLMycRw+hN85cLWa/c1H\ntoX6Rbbtkhcoy9zTWqr6BTGJzVOnyvfu3xfS/nSHsuiEWmN3HDyAoNFgHDs+ICQUO3ZXmCdPBhhw\nM0TAaRjCcfA77DiP52HIysIQG41OrwlqLgiCgGXuaUhu14BXArXb3KjVAnpD6Hax9gP7kXw+ImbM\nDJykFLaxfweIPf00JJ8P+/6B01Zb6oKEVrA7jxxGZTJhyMoOtN2dXRVAExWNPjNLNmEM4ALXHxq7\nr6kJd0kxxrHjUOn1LQePBGGGMI6fCCoVjmaNf6DoD03VcfAgiGJgN2a26II+VcvUXBTNcWSABbvV\ng9kSuiPxoGU3HjFzVuDIwWDeieFAWLB3QdzpcwGwD2CSjqNZezSZQ/cie2uq8dZUYxw3HkGlajk5\nJ0jtxDw1F/x+HIcGTqgFxiGEgt1xWNa2TZNk4WQyB7/AqU0mDNk5uAry8XeSldgf2PvBFKPY1825\nzYI9Qq6b4/N2H+pnyM5B0GpxHDkSsv50h9/ffCReCHctkt+Pfd9eNHFx6NNHN5+jGjbFfCcwZWSg\njo7GcfTIgEWFOPohCkJ5CZXEkp4INICIZgFg3zdw5pieHKoQLIq2bWo2q/Rk5wLIJXwlaUC1VbtN\nPrZOG6Iqn5IoYj+wD3VUNPrmzOOWk5S6HweVVotxzFg8pSX4rAMTMRYI+QzhrsVdXITodGKeMhVB\nEFCpBExmXdh5+l1AEARM48bjb2rC26qKXH/SH84yx1G5SJsi2I1mbY+0E31GJmpLJPb9ewes0mGo\nw/wkScJ+8CBqiwV9Wnpz280CLQgnMoBpkrwgOAbQzu6whfbwZldhAX6rFfPU3IBZo+UkpeDGwdg8\nj5xHj4asX13RktcRwnfimNx347gJgc9MEXrstu7zO4YDYcHeDcaxcvqy89jATGJFyChadV+RJAnn\nkcOoLRZ0qXIFQ5VKhdEUvHYiqFSYp0zF39SEp6wsJP3qDiXr1BCirFNPeRn+xgZMkyYHasP0VGM3\nZGahMhqxHzwwIC+/z+vH7fKFdtfSvCgpTnHo+dmnioLgODIwVV0d/RDDrrzPxrHjAp+ZI3T4fWK3\n+R3DgbBg7wbjOFmwO/IGRrAHJnGItp3eqpNytun4CW2KXZkidEFlnyooL4DzeF5I+tUdoc46DZhh\nJrWciNRTk5SgVmOaOAlfTQ3eATiwvT+Sk5S/n6KwyO03C/Ygk3MMGZkIegPOgRLsIfa3SKKIM+8Y\n2oQEtLGxgc+VMOCRkKQUFuzdoEtJQRURMWAae8AUEyKNXdGqTi3cZI7Q4fOKeNzB1cYwjh0LDIxg\n74+sUyVMUQlbBNDpNWi0qh5FQgSiQgbAkRyIkArRIi+JIq4Tx9EmJaGJbCn4pZg4gtXY5XDRcXgq\nK/A19H952lC/E56yMkSHo83iBq1MUiPAzh4W7N0gqFQYx47DV1uLt7am35/nsHnQaENXotXZiWBX\n4sODSVIC0CY3L3An+l+whzrrVBJFXMfz0CYno4mOaXPNZNYFbWMHMI1vti/n9f84hNqR7ikvQ3Q6\nMeaMbfO5orH3xHFomiDbph1H+z865tSjAfuKsvtWduMKph7kdwx1woI9CEwBO3v/F4Gy290hsyVK\nkoTjyBHU0dFok9qW/+2xGUIQMOaMwVdT0+9aWqhj2D3lZYguF8bsMe2umSL0uBxeRDE4k5Q2KUle\n4PKPh6RvXRHqyKCAGWZMW8FuNMsFsHq0c5kwSf7OAJye5rSHVmMP2NfHnaqx93yBG6qEBXsQKBPA\n2c92dlFsLtEaqi1nRTl+axOm8RPbJXa0bL+Df5kVgdDf5phQZ50qZYcNOe0FuzlChySB09GDBS47\nR17gGvv38I1Ql6pV/m6GUwS77EzXYg8iA1dBP3o0KpMJ59H+F+x2m6f5oJG+h3xKkoTz2FFZ2UlI\naHOtJToorLED8MgjjzBv3jyWLVsWiuaGHPr0dFQGQyBEqr9w2r1A6JxErhOyVqnYx1ujJED1RDsZ\nKMEeao3ddUIW7MacnHbXerpzATlJB8DVz3XqQ+08dR0/jspsRpfc/vAWU4QuqNIKCoJKhTFnDN7q\n6n6vgKr4W0KRdeo9eRJ/UxOmcePbtWfqYeLeUCYkgn3lypW8/PLLoWhqSCKo1RjGjMVbWYmvsbHf\nnhNq779SsEoRRK3paagfgD4zE0GjwXm8f80QIR+H/BOoDIZAuGdrejMOxmbN33mifwW70idjCHZw\nvoYGOfs4Z0yHRwGazDo8bj/eILJPFQxZ2QD9WhhNFCWcdk/ozTCnOE4BjCYtKpUwIsoKhESwz549\nm8jIzo/VGgmYAuaY/rOzh7rgkzM/H0GnQz8qrd21nhQCU1BpdegzMuWsvX6s7hdK27LfbsdTUY4h\nK7tjgdbDJCUAQ1YWCEJgR9RfOOweDEYtanXfX9PO7OsKyjj0RGs3ZCuCvf8WOJfTiySFbpFX3l/j\nuHHtrgmCIIcBjwCNPfSn445QAtvvgnwss+cAYPPa2V65G5WgIjMyndSIFLSq4IZUkiRKqmwcKqzH\n6/Oj16pxVcs1SEKhnYhuN56yUoxjxiKo29smjQETRM8msXHMGFwnjuMqyA9E2tS7GjhQewS7186M\nxFySTAndtNKCy+OjqNJKQYUVm9NLXJSB6pPyGZmhMEEoQqejXUvrZ/RES1MZjOhSR+EqKkTy+RA0\nGlw+F2W2SmpddVg9NqbGTySxB+NQ3eCk+KSVmkYXjXYPybEmbFY3lsj+ta8rKHPObvMQGR3cOb+G\nTEWwFwQ+a/JYOVhzBLVKjU6tY5pxLALB/4ayGjvHShpweXx4vCKG5jyLUNVOchacQGU0ouvkIG1T\nhI6aShuSJA3ps0+7Y9AEe7CHsg42Sj995qmUCgL+smI0ESJrDn/G+vxvcPtbBIJereOHs65mUdYZ\nnbbncvt4d30ea7cXU9voanMtFRiFig0HK7GkRzNtbPCC4dTxbDxYDJJEzKTxnY61KUKH2+Xr0d9C\nNWsa9Z99iqqiGPuUFJ7f/k8K6ksC1z/K/4zxcdlcPuVipiVP6rSftY1O3vz8KGu3FeM/JSJlAgIR\nCHyxr4JLzxpDQkzvDxR3VpYCkDRzKrEd/E7RKz9b8kuBvgUzHo1TJnLys1JM9joKLV6e3foyTW5b\n4PoHJ/7HOTkLuHzyxUQbOt7NSpLEoYI63t9wnG2HKmmdKyYAs1FRWu9k8+EqLpqXiVbTdoHuyd+t\nvCgfQaMhbfZU1Pr2QjIxSW5Lq1YF326ChbKUZNyFBcTGGvmycAtv7H0fu7elCqjmoIZrc5dz4biz\nUQkd7zx8fpEvthbxxbZi8kraOqQjgfGo2Ha8moRJSSyYntprgeuz2zlWWUlU7lQSk6La/5wECzGx\nZqrKrUSY9CEvGT2QDJpgD8XJ5f3NqSes65JTsB4/zs8+e4I6dwMx+miWZi3BrDVT2FTCjpO7+eu2\n1yisKueirPPaTEBJktiTV8O/1h6jtsmN2aDh9MlJ5GbHYTHpcHv9HPy2GFu5lX2FdWx9YTNn5qZw\n9TljMXYT097RSfB1u+WEHCk5vdOxNpq0NDW4evS38CXIduqSndt5Ub0Zl8/FxNhxTImbiElrZGvF\nTo7WHucPm1Zx69TrmBrfItwTEixUVDby4TcFfLatBK9PJCnWxPQxcWSlRBJl1lHX5GbfF3l4PH4+\n2JTPf78uYMVZ2Vxw2mhUvXiha/fLBbs8cakd/k63V3ZY19bYqa62djiWHZI6GoAvv3iP12MLUQkq\nFqbNJ9mUiFqlYm3RRj4/vomvCrfx4xm3k2ZpqyHaXV5Wf3yY3XlybkRWSiRzJiQSH2Ug0qyjsKSB\nE5sK8UgSf//gAO9/mcfV54xj1viEwFgG+3cT3W5s+QUYMjKpa/IA7XcnIvKqUlHeSHxK8AuGdnQW\nrq1b+MM7v2efUIlBrWdZ9gVEaE04vE6+LPuKV/e8y9aivVw/+WoidW3bLqux8/f/HqKo0oogQG5O\nHLPGJWAx6dBpVRzeW0HV4WqqrW6een0H//06hmvPG0dKnDnoPiooNeRVqe3fCWU8NVp58SkuriMu\nIaLHz+hvgl10QybYR0LhnO7QjE7HU1GOVFXDBdPO56LMc1GrZC3qtJRZLEybx1/3ruZ/hWupdzdy\n7YTLEQQBUZR4Y+0xvtxVhlolcPEZGSw9IxO9rq0GdnJfJTbgnqum8eaXJ/hqXwWHCuu5fflkclLb\naxhdoURsKHbQjjBF6KmtsuP1+II+hk9jiUSKjcZZkI97ZgLXTbqKuckzA9fnJs8krz6fv+59mb/v\n/ye35t7A5DjZP9Fk9/D/3t7L4aJ6Yix6li/IYt7UZNStbN+SJLH/02MkJ0bww9mjeG/jCd7dcIJj\nJQ3cvHQSEUZt0GMgiSKu/BNok5JQR3T8khqMvXOYGZtNO1VH9hC5KI2bp/6A7KjMwPXTk2ezsWwz\n7+V9xPP7/sFDs+8hSi9r7gUVTTy/5gA1jS7GpUez8qxsxqZFtVEEIlUCJyhk/oxR5Khh3c4yVr2/\nn0vmZ3LJgqwe9dVdUgx+f9dzQTHN9cDGDqDPysK6dQuewgJy58zmqvHLida3zNWLpy7iua//wcHa\nI/xt32vcN/P2wDuzbmcp/15/HJ9fZP6UZFYuzCHmlNBOZ7mVqsPVfO+C8aw7Ws3+/FoeW72dW5dN\nYvaExB711VUom4wMWZ2PnzIOTrsHgt8wDzlC4jz9yU9+wtVXX01BQQGLFi3ivffeC0WzQwqv38s2\nXSUAC8VMlmYtCUxQhWRzIg/MvovRllFsqdjOloodeLx+Vr2/ny93lZGWEMGvb5rLZQtz2gl1kF8q\ntVpgfGYsj14/m6XzMqizunj6zT0cKqzrUX9dBfmoLZFoYuM6vcds7rkDtdZZzwmLG6Nb5Oa0S9oI\ndYWxMdncnnsjgiDw0v5XKWgspqzGzk+e28jhonpmjI3ndzefxpnTUtsIdWjJOjVb9MyfmsKvbpzL\n5KxY9p2o5TevbKemIfjDPjyVFXKmZQeJSQqCIGDsRbnWfK0Nl05gVK3Iw3N+3EaoA6hVahann8ml\n2RfS4G7kxX2v4vF72Hm0isf/uZPaRheXzM/koWtmMC49up15QVlooqMMXLV4LI/dMJuEaAMfflPI\nX98/gKsHhapcRYUAGDK6EGi98DUA7NLLO46JNjO3TP1BG6EOEG2I5I7cG5mdNJ2CpiI+yP8ESZJ4\nb+MJ3vjiGEa9mrtXTuWHSye1E+rQ4sxNSbLw4ytyuXP5FNRqgefXHGDtjpJ293dFQLBndqXs9G4c\nhhohEex//OMf+frrrzlw4AAbNmzgsssuC0WzQ4qPC77ggFGO1811xXRq54vUWbhl6nUY1HrezfuQ\nJ975ht15NUzMiOGn184kNb7zLaTdJod1CYKARq1i5Vk53L1iKn5R5Nl39rEnL7iSBr6GBnx1dRiy\ns7u0R5osPZvEkiTx5tH3qIyRp022tXPn5vjYMdw85Qd4RR+vHnybJ/+1g8paB8vmZXLXyqmdmpdO\nTaOPNOu478ppLJ2XSU2ji6fe3E1NY3DCPbBr6SB+vTXmCB32HhREs3nsvHbk31TGa7FYvZjdnX/v\nvIxFnJ48myJrCX/Z9i9e+OAgGo2K+66axvIzszstcnZqyOeohAgevX4OE0ZHs+tYNb//xza8vuBC\nE92FhYBcfrkzeqOx76s+yIeu3fhVkNOk79SGLggC14xfSaIpnnXFm1i1di0fbykiMcbIo9fPZua4\nzlXj1rH8giAwe0IiP/3eTCLNOv61No/3NgYfkeMqKGhWdmI7vUcJKuhJstZQJJx5GgQn7VWsL/kK\nX1IcqFS4iwq6vD/WEMOKMctw+92UGzczd1Ii9105DVMX5zVKUnO87ikOmxnjEvjRFdNQqWDV+/vZ\nd6K22/4G4tezOtdMAMzmniVkbKvcxeG6YwGNx11U1OX9U+InMit+FtWuKlxRedxxWS4rzsru0lau\nCJbWsdsqQWDlWdmsODNLFu7/Ck64K9Ea3Y2DyaxD9Eu4Xd1rwZIk8caRd2n0WIkeI0cFKZpgRwiC\nwDUTVhKvTeaE8xCaqHruv3IaU7I630lBx4WvIoxa7r9qOtPHxLMnr5oXPjiIr5uDuEHW2AW9ocPE\nJIWeFkRz+py8ceRdVFodmrQ0vKWliN7Ov2vQGPjh5O+jktQckjaQkqziZ9fOJD6qa8d4R+WbM5It\n/PwHs0iKMfLxliI+3VrcbX99TU346moxZGV1qewoCoUzrLGPbCRJ4p28D/FLflZOvBR9Wjru4mIk\nX+dCQJQkDuw04a9PQB1Vx4QZTWi6iUV2OeV6JR3F607OjOX+K6ejUgk8/8EBik927TQLVrD3ZNvZ\n5LHybt6H6NU6zjvjGvk5XQg0gEa7h6Nbk5G8OvTp+czO7d7x4+iiLsiy+Vksbxbu/+/tvTi6EcTu\n4iJQqzuM429NT8Zhb81B9tUcZGx0NhNzF7Y8pwsKym1U7pPNIElTCsgZ1X3OR2dJWhq1ijuWT2ba\n2Hh259Xwj/8d7nKnIbrdchz/6NEdxvG3xmTWBa2xf160AZvXzgWZ5xA1Zjz4/biLuxawhw77cBWN\nQ9B4mTCnhqggok4cNg9GU/vyzfHRRh64egbRETre/vI4Ww5UdtmOMle72rVA730NQ42wYO+GvTUH\nOVx3jImx45iWMAVDZhaSz4e7vPMDJ9798gTbDlUxyn0GBrWeTwq/aBMW2RHdnZw0Lj2aW5ZOwuPx\n8+w7e6lrcnV4H7QW7F072QICLYhJvOb4/3D4nFyacxEJcaloE5PkOO5OhIrXJ7Lq/f1U1/qZrJ+P\niI+Xd/272+d0V6L1kvlZLJmTTkWtg+c/OIC/kxOdJJ8Pd0kx+lFpCJquHcPBVroUJZGP8j9DQDYt\nKDZrxYbdETUNTv7yn/34bdGMi5hMtfsk31bs6PI50PU4aDVqfn7jaeSkRrLl4En+u7nz57uL5bBX\nfWb3DldThB6n3dNtQbQ6Vz1flnxFtD6KxekLMGQpOR6dL/Q7j1bx7/XHMTtyiNPHsa1qBycd1V0+\np7vyzXFRBu6/ajomvYbV/zvMwS78UO4gHKcARlNYsI94vH4v7+V9hFpQc8XYSxAEAUPzC9LZJN5y\nsJJPtxWTEmfivhWncXb6mdi8djaVbu7yWQFbYhfJSbMnJHLl4jE02Dw89+4+3B2kf0uShKuwAG1S\nMmpT1yFhwdZJqXLUsK1yF6nmZM4cdToAhsxMRLsdX017u78kSbzxxVGOlzYyd2Iid5x1PuNixrC7\n4gDHG7rW8oMpJ3Dl2WOYlhPHwYI63lrbcfanp6ICyefDkJnZ5fOgbXJOV+w4uYdK+0lOS5lFkjkR\nTXQ06sjITk1STreP597bh9Xh5drzxnL9tOXoVFo+PPEpTl/nCzO0ONI7K99s1Gu457Jc4iL1vP9V\nAbuPdSwkXUWKwzCzy+eBPA6SJO8eu+Kj/M/wij4uyb4AnVrXUlqgsOPSAkWVVv720SF0WjX3XT6D\n5WMvDCySXeH1+PF5xS7nQlpCBPdenosgwAtrDlDdiXM9GMcpgFqjQm/QhJ2nI5mvirZT56rnrLQz\nSN51YmQAACAASURBVDLLoVXKit/RJC6qtPLqJ0cw6tXcc1kuEUYti9PPxKgxsLZ4Iy5f5xqhI8ia\n00vmpLNoeiolVTZe+7T9IdvemmpEpxNDN1tOCH7b+VnReiQkLsg8J+AgU7a0rg78Det3lbFpbwUZ\nSRZuvGgiKpWKZdnny20Vru/yWcEcqqBSCdx6yWRGJZhZt6uUTXvL292jaNHKgc1dEczOxS/6+Tj/\nc9SCmosyzwVk+7l+dCa+ulr81rbmMUmS+Pt/D1FWbeecWWmcPTONaH0USzIWY/XaWF+8qcs+OZr9\nLV3ZgyPNOu65LBedVsXf/nuI0mpbu3sCAi2I+dCShdv5PC2xlrG9cjdpEanMSZ4BgDYxEUFv6NAU\nY3V4WPX+frw+kdsumUxGsoUZCVPJsKSzu2ofRU2dR7Z0ZZZrzbj0aK49bxx2l49V/9nfTuGRJAlX\nQQGa2Lg2B4x0hnK62HAmLNg7QZREPjwiv8jnjl4Y+FyXOgpBpwts7RRsTi+r3t+Pxydy89JJJMea\nADBpjUFp7cEWvhIEgWvOHUd28zb8y91tTUKK9qgfPbrb36jRqtHp1V1O4hpnHdsqd5FkSmRGYss5\nmYqgcDVHXCjklTbw5to8Ik1a7rlsKnqtHNaZHZXB5MRxHKo7SnFTaafPC/ZlNuo1/OiyXMwGDa9/\nfoyiyraC1V0s90s/OrPLdiC4sgJbKrZT46pjfuppxBlboioMGfLC4TrFzv7p1uJANNTV57SEWy4e\nfSZmjYmNZZvxdGKeCzjSgygtMTrJwg8vnoTb42fVf/bjPCUM0l1UhMpgQJuY1G1bxiAW+k8L5UV+\n+ZiLAou8oFJhGD0aT0V5mxpCoiTx9Bs7qWkO7Zw+Nl6+XxC4NOdCAD488Wmnz+rJwe4Lp49i4fRU\niqtsvHqKwuOrq8NvberWDKNgMssZ2X7fwBzc3h+EBXsn7Ks5RLn1JHOTZ7aJzRXUavTpo3GXlSF6\n5IknShIvfXQoMIFnnFIKYHH6AowaY7PW3vEWvCfHf2k1Ku5cPoUIo5Y31+ZxpJVtUXHkBaOhKc/r\n6kX+vOhLREnkgszFbcLZFE3Y3cq+3OTw8MIHB5GQuGP5FGIjDW3aWjHxAgA+K/qy0+c57B50ejUa\nbfe1t+Ojjdy8dBI+v8hf1+zH4WoxIbiKikClQp/eteMUujdJ+UU/nxauR6vSckHm4jbXlJ1L63E4\nWlzPuxtPEB2h47ZLJreJ1derdZyZdgZ2r6NTW3tXjvSOmDMhkQtPG83Jeif/+KRFqIkuJ57KCvSj\nM7p1nEL3C1yNs5a91QcYbRnFhJi2NWf0ozNAknCXtSzaH35dwK4jVUzJjm2XVDU+dgzjonM4Up9H\nqbX9jgtaFcULsk7M984dR05qJN8ePMmGVgpPixkmSMHeA9/TUCUs2DtAkiQ+L/oSAaGNtq5gyMgA\nUcRdKk/iT74tYn9+LZOz2k9gAKPGyDnpZ2L3Odhcsb3DZ/a0VG1spIHbL52MKEk8+c8d2Jrtoorm\nqE/vXmMHWai5HF78HYTN1bsa+LZiB4nGeGYlTmtzTW0yoU1KDjhQlcWt3upm5VnZjB8d0669qUkT\nyLCks7f6AJX2kx32x9HDEq3TxsSzdF4G1Q0u/v5fOUJEEkXcJcXoUkeh0nbfVncF0fbWHKTe3cAZ\nKbMD2aMKp2rsjTY3z39wEAGB2y+dQmQHv2Vh2jw0Kg3rSr5ClNqPe7C7ltasaM5e3XGkivW7ypr7\nJDtOgxVoxm58DV+WfI2ExOL0s9qZiJQdorJjPFhQx0ffFJIYY+TWZZM7DHFdPPpMud3Srzt8Xkeh\nr12h1ai4Q1F41uUFdnGKshOMWQ5GRmRMWLB3QF7DCYqaSpgzahrJ5vZpywFttaSIYyUNvL+pgBiL\nnluWTeo0RvvMUWegUWnYVLq5y5fZaAo+ZX5SZizLF2RR0+Dk7/89hF8UcRcVoYmL6zSF/lSUhcTl\naO8w21S2Bb/k57yMRe2ybEHeFYgOB97qaj7eXMjBgjpyc+K48PSOXyBBEDg/82wkJD4v2tDuut8v\n4nL0/ASp5QuymZgRw57jNXy2rQRPZQWSxxP0rkWtVmHo4gShDSXfALAwbX67a5rYOFQREbiLChFF\niRc/PEiT3cPli3IYlx7dYXuROgunJc9s1oAPtrvem8ObNWoVt186BYtJy1vr8sgvbwoqMak1gRju\nDsbB4ZWVkmh9FDMTc9tdNzSbvNwlRdRb3fzto4OoVAIPXzen0zIQk+MmkGiMZ0flbqye9v6B3pz5\nGhtpaN7FSc27OF/LLjZowa4cQhMW7COKdc2OrUsnLunwuiLYrScKeOED+bT62y6ZTKSp8wkYoTMz\nO3E61c5aDte1P4HIYfc0F/rv2Z/k4jMymT4ugX0nalm34SB+a1PQmgl0blf1ij42l2/DrDExO2lG\nh99VIi3yt+9nzdcFxEbquXlp54sbwNT4SSQa49lZtReb197mmtOhnCDVs6p6ijM1yqzjvY0nKN4j\nH9emzwh+HMzmjk8QKrGWcaKxgImx4zpc5AVBwDA6A+//Z++9oyS560PfT3WOk3ty3JyjNiqsJAQS\nCiRjHgbDRRhjHDg8Xb/jc1+wr6/TxX6PCxiuMRgso4vBZIQQKGu1knalzTnvTs6xezqHqvdHdfX0\nzHRPV3XXzG6P+nMO54jpqq7f/vpX39/3942jo/zqlYtc7pli++oaHtzdsuDz3tVyDwICL/W8Ns8B\nnm+jkUq3lc8+thFRlPjGL87jV8JeVUTEwMLRQW8OHCWaiHJv850ZN3lLQ4Ncvri7i2/98gLTwRgf\nuX8VazKc3BQMgoF7W+4iLiV4vf/IvM+12NjT2bKymkf2yae4J399iXBvD6bKKoxudQW0SqaYZch4\naIIL41foKGtldXXmI6y1sQmMRgbOX2HKH+VDB1Zk1c7SOdCyH4BDfW/O+yzfLjEGg8Cffmwn5S4L\nxw+eBtRrJpDdvnxq5Cz+WIC9jXdgMWbWuJQN5PQbZzAIAn/4/k05i3QZBAN3Ne0lLsbn2ZgLaVpc\n7rTw2ffJpqmzb+QxD65kB6HobOfjweRvdW8GbV1BmYdTh85QU27j04/M7zE7lzpnLZtq1tPl66Fr\nTmRIPqYYhY0dVTx2ZzvjvjCjF6/JjlOPumJZNocFQZgv0BJigoN9b2I1WrizcU/GewWTCUtzC6He\nPq71TLBzjYcHdub2b+yp34ndZONQ/xFi4uy5L2QePnB3B2tbKrh0sYfE1JSqYAIFRw7TXDFQEuxz\nODxwFAmJu5Lx2pkQTCZC5R5c02NsX1HFQ3vULZpWdzMdZW1cGL/CaHCmNEAsliAaSeTdJabCbeVz\n79tIXVj+zkRt5iYCmZhJzpn9Mh/qO4KAwN2N2WvLm5plrbQiMMZv37eKlU3qKlDubbgDs8HE6/1v\nzTJL5auhKaxvq+T9d3VQ4RtBQsDctLDWnI5ycvGnNTKejvo5PnyaWnsNG6rnt1JTiNfKpYwbouP8\n4Qc24bSpM6cdaJI3+jcH3p7190Ln4X13drCp2YUjMEmgok6V4xRkJcHumF8Q7ezYRaYiXvY27MJh\nzl4CIFBei0FMsMYa5vGH16mqm24zWdnfuJvpqJ+Tw2dmfabFkT4Xo8HAH7x/Ix2CbGcPVOSOClIo\n2diXGQkxwZuDR7Gb7OyY4yxM5/zNca7HnJilBJ/YVampTviB5v1ISBzqnwl9DGl0EmVibWslO8rk\nF/IHF0JZMzLnkmkR90730+nrZn31GjyO7DVNfnFsiCmTi6b4FA/snN9PNBtOs4OdtdsYC41zZWIm\nwUirsywTj+xppSE2yZiljGeOZY62yIQjJdhnopYODxwlLsY50Hxn1gJXsbjIDy7K9+yqiNHRoL5F\n5NqqVVTbqjgxfJpQfCaxphBNFWQB/cntZRiQuBiyc6l7UvW9mWK4lY3nrizaOsDIZJBDI7IA/vB6\nKw6VmxvIG5yAwBsDb836e9BfWK/TCpeVRzrkMT3fk8AXVCeoS6aYZcaZsQtMR/3srd+Z1fwwPBHk\nn5++wIhdFniG4eylBTKxvXYzZRa3XNI3IduU83GWZaJ8eoSIxcGZ4Rg/Oaiu6l0mU8yhPtneqWiU\nmXjr4hDPvd2D112DNRpE1Nip/u5m+USUblstVKABJMZGMSViTLk8PHO4S3VFzJR9OdlvVZREDg8c\nxWIws6dhfmlihe+/dJXzkxA3WamYHtE0VoNg4M7G3UTFGMeGTqX+rkcTa9PYIAAjtiq59rvKcscO\np4V4TCSajIcfD01weeIaK8rbaHRlLiIWiSX4p5+fp9con9hck5kjnrJRba9iXdVqbnq7GUxGSyUS\nIuFQrOAuRmU+OSP3pljGN35+XlXRNKvNJNfoLwn25cGb/UnNpCmzZhIMx/jqT84SjMTZcfc2gJyF\nj+ZiMpjY23AHoXiI06Pn5O/N01mWTsLvJz4+TvmqFdRVO3n+aC+vn82tsc7VTkLxMMeHT1Ftq8xq\nfugemubffn0Zm8XI6js2AvMTdHLR5m6hxd3E2bGLTIbldmip7NsCBFqkV/491u/ZjNlk4F9+dYHB\n8UCOu2bmwZ8U7NcmbzIWnmB77Rbspszmh4On+3nt9ACtdW6cHe3ERoY1N/ne27ALg2DgjYG3U05U\nPZpYK+ty54Ht+EMxvp4hIzMTc9fD4cFjSEhZbeuiJPHtZy7SM+Jn3a4NcvXTXm3vBMD+xt3y8waO\nAoX5W9KJ9HZjcDhZtbGdK71TfO+FqznLM880tS4J9qJnJDjG5clrrKrooN453x4nihL//MsLDE0E\neXB3C3fcK0eKaBVoAPsa5GbYR5LOQz00VeVlcrS384UPyxmZTz13Jecx3GY3z3KYnRw5Q1SMsa9h\nd0bzw4QvzNd/dpZoXOSzj22kZu2qWc9XiyAI3N20FwmJI8nYfj02OGUc9RtW86n3riMUSfA/fniG\nqRyOsJQpxidfd3hQFjCKwJnL2RtjfO/5q7jsZrm+fFurnKDTp635Q7nVzZaajfT7B1NO1KA/e+Er\ntUR65cqW++/blsrI/PavLuYs8JV+gkuICY4MHMNusmUMcQT46cEbnLg6yrrWCn7noY1Y6hsI9/Qg\nqTQFKmyp2YDL7OTtoRPExLgu74QYDhEbGcHa2spnHt1Ia62LQ2cGePlE9sxnBSVxr1g7w5UEexJF\nuNzVON9pKkoS//aby5y/OcHmFdX89r2rMNrtmGvr5BK+Gn/8WkcNqyo6uDp5nbHQuC6mmHBaEkZ9\nlYM/+ZCc/v8/f3aOgbHsGqviMFM0pCMDxxEQ2Nuwc961/lCM//GjM4z7IvzWgRVsW12TSoTKVbo2\nEztrt2IxmHlr8ASiJBIMROXa2xra380lPUFr38Z6Pnh3B+O+MF/58Zl56fbppNvYg7Egp0fPU+fw\nsHJOZySQW9v90y/OYzQKfOHDW/BU2LG2JHMbNJ7gYMZ2/cbAW8RjCaKReEFrQUomz1kbGzGYzXz8\n3WtY21LBiSujPPX8lQXXa7rGfmH8Mt6oj11127EY54/ntdP9/ObtHuqqHPzRBzdjMhqwtrUhRcLE\nRrSZY0wGE3sadhKIBTk7ekGnTb5PTtBqacVqkes3lTkt/ODlaxy9tPD4lBr9UQ2dqm4nSoId2Z76\n9uAJ7CYbWz2bZn0mSRLfe+Eqb5wbpL3ezR+8b2OqNrS1pQUxGCA+kbv5xVz2N8ia4JHB4/os4jkZ\np2tbK/nUe9cRjMT5hx+con8B4a5oJ0OBYTp93ayrWk2lbXb4Zjga58s/OsPAWID37Grh4WQSkqmq\nCoPTSaRXm6YKcvOFHbVbGQ9PcH3qplx72zm/9rYWIr09mKpmErQe3d/OPVsb6Rn28z9/fo5INLM5\nIt0Uc3T4FHExzr6GXfMiO/pH/Xz1x2eIxUU+976NqUggm5J5mYcZQnaiVnJy5CyTPjlRpxDBHh0a\nQopGU2vBZDTw+d/aQmudrLH+7FDmKozpzw0GoryZNItkMsO8fnaAp567gtNm4n//7S2pMFdbARuc\n8k4cHjiqi8Ye7p1dN6m63MYXPrwFq9nIvzxzMWtFTCj+FnklwQ5cmriKN+pjZ922WU5TUZT4wUvX\nOHiqn5Zal1z7Oa0LUioDNQ9tdXvtZmxGK28NHk/VAS/UFCPHLM/UqblzcwMff/cafIEo//D9k/SN\nzM/uA7C7LMSiCd7slU1DiqlIwReM8qUfnqZz0Medm+r5yP2rUgJPEASsLa3ERoZJhNT3I1XY23AH\nMLPBFTIHce8UCa93VsyyIAh84sE1bFtVw8WuSf6/H55KlV9Ix2I1YTAK+H0RDg8cxSAY2DPn1HKj\n38sX//0kvmCM333PWrantXSzNDSC0ZiXYDcIBvbU7ySaiHKm/zKgjzkqPVHNYTPxnz+yLdV16Iev\nXEPMoLkr8z/pnebixBVa3U00u2eHzx483c+Tv76Mw2bi//joduoqHanPlLkP5zEP9c5aVpZ3cHny\nGmNTXnk8eig7afPQ0VDGEx/Zislo4J9+cT6rcz1XeYXbHV0E+6FDh3jooYd48MEH+da3vqXHVy4p\niq17X1LIAATCMf76X9/mpRN9NNY4+dOPbpuXfKMkwITz0E4sRgs767YxFfEy4Z1esPZ2LhKRCNHB\nQawt87vkvGtnM598cC3TwRh///2TnL0xfyErNeBP9VzAaXKwxbMx9dnAWIC/+e5xbvT72Luxjk89\nvG5eeKcyD1GN9mWAVRUdeOzVnB68KNfeLmhzk58/t06O0WDgjz64ib0b67jR7+Pv//0k497ZxdgE\nQcDhtOD1Buj3D7K5ej1llplMxbM3xvh//+MUoUiC33tkPfdtnx3eKZhMWBubiPT1IiXU9SJNZ09y\n7V3ol7OSC5qHLPWCypwW/vSj22iodvD80V65xO2cE4wiSPvGhxElkb1pm3xCFPnF6zd56rkruB1m\n/uxjO2irn53NaU3mNuSzwQHsa5Sf1z0qO/4Lm4ceBLMZS33DrL+vbq7gCx/egtEg8LWfnuVXh7vm\n+R6cRR7yWLBgF0WRv/7rv+Y73/kOv/rVr3j22We5cUN9g9lbTSAW5NzoBeqddbS55UV5Y8DLX/3b\nMY5fGmZjRxX/5eM7MpYLSBU+ykNjB9ifXMTT06FUE+t8CPb0yl1yWjIn5Ny7vYnfe2Q9kViCr/z4\nLN9/8eqsRsj25CKOhBLsqt+O2WAiIYocPNXP3/6vmbKrv//ohlmVChUUAZKPI1kQBFlrj8jfq4um\nmqEAmslo4DOPbuCBO5rpHwvwF//6NgdP9c/SWh1Oi6yhSTMCJhiO893nLvOVH59FkuBPPrSZOzc3\nzPt+5blSLEZ0WJt9GaDGXsWaipWMTckRQos1DzXldv6vT+xkXWsFp66N8bf/6wRXe6dSnyuCdGzK\ni0kwckfdtuT/D/H3/36KX77ZRXWZjT/72A5aaufXIzK6XJiqqvMW7Ns9m7EYLYwm5yHfkE8pHic6\n0I+lqRnBOD/BaV1bJX/2sR1UuK387NBN/vt3j+JNc7Ar85CpzEQxkJ+KmMbZs2dpa2ujqUnWYB55\n5BFefvllVuboDH+7cGz4FHEpwd76ndwY8PGrw12phtEfeWAN79nRlNXmayqvwFhenpd9GeSQv3pH\nHVLEgLUy/58ikOzmtFBFxzs3N9BS6+Kbv7zASyf6OHVtjPt2NHHXlobUIjbFrGyr3s6JKyM8/UYn\nfaMBrBYjv//oBvZtyt4I2VqAfRnktPJXzsjOaz00VVuWeTAIAr/zrtU0e1z88JXrPPX8FY5cGOL+\nHc1sXVWN3WkGUaDcUEGdqY3nj/bw/NEepvxRmj1OHn94/YIJSNbWVjgsR6RYG9Vn/yrsbbiD586f\nBPKfB0mSiPT2YK7xYHQ4Ml7jtMlNsb//4lUOnh7gi/9+kl3rannPrhbaG9wYTQKJMGz2bGRiUuSn\np65w5PwQkViC3etr+eSDaxdMQLK2thI4fYq4dwo86uqzKNhMVnZ4tjB8TijIkR4dHJA7aC1QVmJF\nYxn/9VO7+Oenz/PW+SFOXh7hwLYmHtzdoqo2/e1MwYJ9eHiYhoYZDaauro5z584V+rVLxuHXbuK2\n1fLTX0SIhk4AsKa5nA/cvYK772hldHThxtHWllaC58+R8PtVV1RUEASBXVU76ZQgYgzm/W8I3OyS\nx5KjNkprnZu/+NQufn7oJgdP9/OTgzf4+aGbNDsEagFLqIIvfvs6kgQCcNeWBj50zwoqciSJWOrl\nAlD5OMwAKm0VtFrlscfN+dfnCPf2YLDbMdXUZL1GEATu2drI5hXVfO+FK5y6Nsa1Pi9mk4GV9ghu\nrIhDTfyXf5ZzGkxGgQ/e3cF797blbEg+43PpgT3ZSzFkY1vtZl6NyzZ2m4Yqn+nEp6ZITE9jX71m\nwetMRgOffGgd+zc38IOXrnHs8gjHLo9gtRhZb4hgilk5fzzB4SHZgVpdZuV337OG/Zvqc54srS2y\nYI/09sIq9WUdFPY27OTZ2GWwJPJ2pCvm0VzlqxXz1MkbE/zwxSu8eLyXF4/3Umkzsgq41jfIPopD\nSU2nYMGeb5ynR+NOvliU9zVisVThrK6mbUMZ79ndxuZVM4Ih1zgDa1cRPH8O2/QYFR2Zj+gLsT+4\nk05OMMl43nMyeLMTwWikactaDJbcmt7nP7qDx9+/mVeO9/DayT68kWvgr0OYqmJ9exWbV9Zw59ZG\nOhrV1X4BGGxvI9DVTXWFDYM5u1DK9m9cX76Wq/gYZgCPJ3Ps+EIkwmGuDg9TtnEDtbW50/o9Hjd/\n9bkauod8vHlmgDfODBCMD+CmkcRoLVtX13Dn1ib2b26gXGX2Y9yxnj5AGh7I+7f0mGqJAn7HOOs8\nC6+nTM+Y6L4KQNW61arG4PG42bOliWMXhzhxeYSzN4eJ+idwBMqxhl3csb6Sh/a2cceGeowqhaxh\n01omngHT+FDWcS5Edc0WXox3EbIFcFeYsZltuW+aw/SY/Oy6LesoU/H8h+vKeffuVl4+1suxi8Pc\nnOwkOi4QIXbbyCotFCzY6+vrGRiYyXAcHh6mtjZ3NblcmvBS4XY5KBMcfOKTM45TZWwejzvnOMUa\n+eUbOXeZWEO75uf7RuQ42QlxnDOd17KmbWdDEkUC3d2Y6xsY90YA9RrvvnW17F3r4YsHj8BwHfva\nV/LgYzPt77T8RoaGJqTrNxg4dzWrlrTQfNojZYCPs5PnGRrOXP99IUI3roMkYahv1DRuh1Hg3Tua\n2L3RzZd+fhTGG/n9d+1gzUY5SS0aijIaUn8cN9d4mL5xk5ERX14+E1vcSVgI8XLnm7Q5s5/Ass3l\n+DlZ449X1WmahxV1Lvl/66Z58RdhhEAl//UTu1MmoYnxzBFVmYiVy9FCE5ev0Yz2dz0WjSMkjMRM\nYV64eDjl79DC1JVrIAiEnFVEVDzf43EzNRlk56pqdq6q5t8vXeTwwDH+eNunbxtZBeo3yYKdp5s3\nb6anp4f+/n6i0SjPPvss73rXuwr92iXD6ZKTc/I9eaQch3nalxUbXswS4a2hzK3SFiI2MoIYDmsq\nS5pOr7+fgbhc7yYRzj/LrpAIIYBIQN7gvMIklyauar9/AYehGo4NnyJqliNlCnGYWVtaSUxPk/BO\n5b44A2JIQLLEOTt2nmBMe/io1m5BczkyeCxlDss3httUXYPBbi/4nYibI6mINS2k/Ax1dRhs2rX9\nSCLKyZEzVNrKWVe1OvcNtyEFC3aj0cif//mf8+lPf5pHH32URx55pGgcpyB73RMFZJjJHdqteduX\nlZfHZIWjQydJiNpC5RSBls1hmIu3Bk8gGuIYjIU5itK7SuVD+sv81tAJzfcXItglSeLI4HEkc2zW\nWPIhFcedx3qQJEmO5XdZiIlxToyc1vwdkd4ejC43psrsDS6yMRme4vLENdxuuTZOvvOQym0YHiYR\nztzjdyGUd6LM7eCGt5ORYPZEokzEx8cQQ6G834nTI+cIJyLsadiZtarn7Y4uo77nnnt4/vnneeGF\nF/jsZz+rx1cuGWo61C+EYDBgbW6RO7THtH+H8vKsbmhjOurn4sQVTfcXItBiYpzjQ6dwW1w4XbaC\nsuysTc0gCPlvcIEoJrOB2rIazo1eIBDT5kyO9PSA0Sg3QdFIl6+HocAwq+rlzamgeSigxEIkHEcU\nJWoqKhAQNGuriWSbQmtLa15moLcGTyAhsaJWbpBR8AYnSQS7ta8H5bntHvm31DoPhZ7elAYwe+vv\nyHHl7Utxbkc6okdYk7W1FUSRaL/6+t8KyrH/jhbZtq2kcatFrfc/E+fGLhKIB9lVvx2nq7CiRwab\nDXNdHZFe7bVzYKb29r6GO4hLCY4Pq9dWpUSCSF8v1sYmBJN2t5FSUXBfm1yeVw+NPZ/QT+W55W4H\nG6rX0u3rZcA/pPr+mYxT7WtBlETeGjyGxWBmbf0KoHCTFID/Zqfme5WNdVVdG3aTnbcHj2s6yabe\niTzMUWOhCa5O3ZAT5xboRXC7844X7Hp0S0lpaXmYIZTnrqxrodXdxIXxy0xFvKrvj/T2YPXUaA61\nhJkyxXc27sbutCBJEM6Qbq8WW2sbYihEbEzb0VkUJULBKA6XlV11OzAIBt5MK2Obi+jQEFIslteL\nHI6HOT5yhipbJRtqV2O1mQpaC6bKKowud14ae3oxOKWsw9z2gQtRiGC/PtWZKlNcWe6aNZ58UHwu\ngc4uzfcq819WZmdX3Ta80WlNJ9lCNPa3FW29QbvD9naiJNiz9PzUQiGOQ7n9lwmTycj+xt2pgmRq\niHu9JLxTODsy92ZdiLHQOJcnr7GyvJ16Z50uRY9mzBDa5iEciiFJ8m9RbnWzuWYD/f5BeqZzl1eV\nn9clP19D82qFkyNniSai7G24A4NgwOW2FiTYBUHA2tpKbHSURDB3Hfh00ovBba5Zj9Ps4O2hE6q1\n1ZlSAtrnQaluuq9hly6nWKV2TiAfjT2tAJgSEXNk4Jjq+yM9PRjLyzGVqw/XBbmD2uHBY1iNw/az\nyAAAIABJREFUFrZ7Nue+4TamJNh1qAlhaWzKu8FAeu3tO+q2YTaYOTxwdFYv0GwoJwRnR7vm5x5O\nvihK5T5dTi55OlDnNthQxvRG/1tZ70lH2VBteQi0wwPHEBBSdYJcZTbCwRgJFZ12sjErUUkDMxq7\nFZPBxO76HfhjAc6MXVB1f7inB8FiwVKvLWQ2FA9xauQcHns1qyo6sCeTowra4JK1c4Ld3Zpr56QL\n9hZXE02uBs6NX8IXzR12mPD7iU+M56WtX5y4wlTEy676HdhMhXVuutWUBLsOGrvBYsFS30Ckt1dT\ng4FU+6/kGOReq1sYC09wbTJ7aVWFcHdSsK9coWm8CTHBW4PHsJvsbE82ULiVGvvcssXrq1ZTZavk\n+PBpQvHcURWRnm4QhKy1crIxmFamuMomR5G43PILHQ7mb5KaqSFU2DwovQFe7zuS9R4FMRYjOjiA\ntblZdfNqhbcGTxATY+xv2I0gCBiNBmx287ym1lqxtrUhRqNEhwY13Rf0y450s8WIIAjsb1B/kk1F\nieVhlns9qUjcnaEnQ7FREuw6VXGztrbKDQZG1duXQ0nh4XDOZGoq2qrSwWchlKO3a4U2wX5+/BLe\n6DS767enyhTrobGbysowVlRoPrnMbTSS3gv0+PCphW6diVmu1R6zrPgY0rskKYK9kHlImeY0nlzm\ntoOrd9aypmIlV6duMBRYuLBYdKAfEgnNZhhJkni9/wgmwTgrEShTU2utKPMQ6dY+D+lF8XbXb8ds\nMPN6/5GcJ9l87eujgXEujl+ho6x1XpniYuQdL9hNJiMWa2EOM8jPgTpjgpg59q0ob6PeUcupkXN4\nIwsfPSM93Rhdbiw12rz3mRooOJNp84U2FrC1thGfnCQ+rb65daZGI/uUXqD9CztR42NjiMFgqtGF\nWsLxMEcGj1NucbOlZkPq704dBLu5tg7BastbY7enbfR3N8s1Z17PYZbK13F6ZfI6w8FRttduxW2Z\nccA7nBaikQRxFX1Ss2FtawcgnPSBqCEVy59WBM1hdrC7fjvj4UkujF9e8P5wlpLFuXj55htyb9em\n4tfWoSTYAXRpXGtTFnFXl+p7UppqmkATBIEDzXeSkBK80Z/9CJ4IBOSY5bY2TTHLI8HRlGbS5Jqp\nRTLjMCvw+J2HGSJTa8ByaxmbazbQ5x9I9QLNhCI0tEbEvDV0gnAizN1N+zAZZkIkXW7brDHlg2Aw\nYG1J5jZE1X9PwB9JOdIVttZspMzi5u2hE0QS2b8rX8fpoeQaO9A8u2iZLj6X5hbZ96RBY1cc6XPL\n9R5ovhOAg71vLnh/pLtbDr1VUdZEISEmePnmYewmOzuz9HYtNkqCHXkRh0M6Ocw0LOJsLfH2NOzE\nbrJzqP8IsURmW2+mLjlqeLVX1kzua7l71t9TDrMCN7h8en9mm4d7mmRh82rv61nvjeQRsyxKIq/1\nvYlJMHLXHA3NVVa4xg7JVnnJ3qNqCQWiqYQ5BaNBjpYKxcOcWCC2P9zTI/sZmptVP28yPMXZ0Qu0\nuBppL5ut4ephojRYrdibGjU1t86k7AA0uRpYVSF3VxoKjGS8VwyHiQ4NYm1t0+RnOD16Dm/Yx976\nnRl7uxYjJcGOPkX1jQ4H5to6wt1dquOvszWxthot3NW4B38skDVRJ9zdBYBNQ4ifPxbgyOBxqmyV\nbJvT29VoNGBzmAno4GsArSappAliTqnatZWraHY1cnLkLGOhiYz3ztRGUX/0vjRxjZHgGDvrts0y\nP0CaYC/UcagxQkh2pMczNpa4q3EPAgIH+97MuLYkUSTS24uloUFVdU+FN/rfQkLinub98059ejWa\ncK1ckWxunVkYz2WhXqeK1n4oy0k20tsjN5xJnp7VIEkSL/a8hoDAPc3aSy3frpQEO/o5UG1tbXJz\n67HMfRTnEsiiqQIcaN6PQTDwat8bGV/mGYHWrnp8b/S/TUyMcV/znRmrJzqdloJfZHONB4PDkYrY\nUUMwEMXuNGOYo2UJgsC7Wu9BQuKVLFp7uKcHU2UVJnfuUr0KB/veAODepKBIJ2WKKXiD09YPN7TA\nWqi0VbCzbiv9/kHOj1+a93lsZBgpEtZkhgnHw7ze/xYOkz3VJSkdvRpNOJOOfbV29oUE+9aajZRb\nynh78HjGaKl8lJ0rk9fpne5nT/N2ah2e3DcUCSXBjj72REhzFiUXWC5CSU3VmaHed6Wtgu2ezfT7\nB7k2Nb/VYKS7G4PdPqt59ULExDiv9b2JzWhjX2PmeucOl+wwixXgMBMEAVtbO7HhIRJBdfVeFmpi\nvbN2K5XWCo4MHMUfm53wIzevntKkrQ/4h7g4foUV5e20ls03WzidFgRBB5NUY5Pc3FqlSWohgQbw\nnrb7AHi+65V5G324S04CsmlIVDvUf4RAPMj9LXdnND/oEQYMssYO6k2UC82D0WDkQPN+wolIRlv7\njGBvVz2+F7pfBeD969+j+p5ioCTY0U+w2zQK9kAggsEgYLVlrm9yX8tdAPxmzssshsNEh4dkW6JK\nx+nx4dP4otNy+QBT5rBAvY7fyganRluNRePEogkcWZpZGA1G7mu5i6gY4/W+2ZEh+djXn+18AYD3\ntN2b8XPBIMz0Pi0AwWTC2tSsurl1LsHe5Gpgc80GOn098zZ6xWFva1Mn2COJKC/3HMJmtKXMG3PR\n6xSrJM+p3uCy2NgVDjTvx2ly8HLvoXlljSPd3QhWG+Y6dQla3b5erkxeZ23lKlZW5Vfm+HalJNjR\nJzkH0h2oXaquV7JOswnnjvI2NlSt5erkdS5PXEv9PdIrN69Wm4QRS8T4deeLmAQj97ZkfpFhZh4K\nFWqK5hjuzJ1OHgwosfzZbcPKZnSw7w3CaUfwlIamUmPv8fVxevQ87WWtbKpen/U6R4EF0RSsrW1y\nc+vB3MXhsvlb0nmw7X4Anu96ddbfw12dsuNU5Ty80f8W/liA+1ruxGG2Z7xGL43d5HTKvqcedb6n\nXPNgM9l4oPUAoXiIV5MmNQAxEiE6OICttVW14/TF7oPAzGloOVES7OinsRudTswejyoHaqZ43Uy8\nb+V7AXj6xq9TyRlhjbVRXus/zER4kgPNd6YyLDOhxNMXHPrZnhTs3SoE+5xyAhm/z2Tj/pa78ccC\nPN89I9RmTBDqErSeufk8AI+teHDBk47DaSURF4lG8jdJyePqmDXOhcgWGZROR3kraytXcXnyGlfH\n5MxkKZEg0tONpbEJgzV3Gnw0EePFnoNYjZZ5kVHpWG0mDEZBl2bO1tY2xECA+MR4zmuV9ZDJiaxw\nT/N+XGYnr/a+ntLatTpOu329nB49T6u7ibWVq1TdU0yUBDv6aewgmyHEQID4+MIO1Eg4jpiQcgr2\nFncjd9Rto9c/wMmRs/K93eodp/5YgOe6XsZhsvNQ+/0LXjtz/C4sIsRUVS1XOFQR069GoAE80HqA\nSmsFr/S+zlhoAkmSCHfexFhRgakid1OJ61OdXJy4wpqKlTm74ug1D6kNrjN3eYhcphiFhzveDcC/\nnvwhoiQSHRpEikZTz8rFq72vMx31c6D5TpxmR9brBEE2Sekh2BVnphqHeiAQxeYwY1ygcbjNZE1q\n7WFe6T2U/O6u5LPacz5DlER+dPVpJCQ+uOqRvGrX3+6UBDtgs5sRhMJty6Dezp7LlpjOYysexCgY\neebm88TFOOGebtXFnp7rfJlQPMx729+FY4EXGfQ7uQiCgLW9ndjYKAn/wr0y1ZggACxGCx9Y9TBx\nMc7Prz9LfHKShNerSlsXJZFfXP81AI+tfDDn9XqZIaxNzQgmkzqTlMr1sKqig931O7g52cOhviMz\np5b29pzPGA6O8uuul3CbXTzQeiDn9Ypg18MkBRBRc3LxR3HmWAsga+1ui4sXe15jKDCcMn+q0djf\nGjxOl6+HnbVbWbMMtXUoCXZgRjsp1LYMaY7DHNrJjKaa+/hcY6/m7qa9jIXGefbys0T7+7C1tee0\nJfb7BznUf4QaWxV3N+/P+Rw9Ty6KoMm5wanUVEGOkFlR3s7p0XN0npdjme0qBPvzXa/S6etmR+0W\nVpS357xeL1+DYDJhbWsn0tebMwM1FIgiCLKSkYsPrXoUp8XBMzefw3dD7g9rzeE4FSWRf7/0E+Ji\nnI+s/cCC2rqCw2VBTEhEwvm1jVRIKTs5NrhYNJF0pOdeC1ajhY+u+SBxMc5TF39EuKsLwWrNqewE\nY0GevvEbLEYLH1z1iOp/Q7FREuxJHAU2tVZIFYDKqbHnti2n89iKB6m113DhzKuy4zRH4a9ALMi3\nzn6XhJTgw2veh9mQu7OQXho7gK09Gb+cQ0tTa4oBeQP+8OrHEBA4d+ol+Tk5BHunt5tfd71IhbWc\nj679kJqhF9wuMR1be4ecgZqjMJrib1FjFnBbXHx8ywcJJyIMXz0jtwTMUdnyzYG3ueHtZKtnk+pa\n44rSESgwWcvocmGuqyfcdXPBDFTF9KVG2QHYVruZXXU76J/sITI4ILcEXEDZkSSJH1/7Jf5YgIfb\nH6DSVqHtH1JEFCTYn3vuOR599FHWr1/PhQvqakbfrjicFuJxkVi0MIeZ0eXCVFOT04G6UHJSJmwm\nG7+36XdpnJDHF2/OrpmIksi/XfgBY+EJHmq7n81pRa4WwmwxYjIb9NXYcwl2laYYhbayFt634iEq\nRmQTj9CcvRJfOB7m3y78AEmS+E8bPqpKS4UZwVKojR3SI4Sy29klSSLojy7oMJzL/Sv2s8rVimPE\nR6imbMGWgDemuvj59Wexm2z8b2s+oNqmrOcGZ1+xEjEUWrCEb0CDeVLhI2veR7vfgiBJRBqqFrz2\nmZvPc3ToJK3uplQo8XKlIMG+Zs0avv71r7NrV3G3kQL9Mu1A1lZFv3/BEr4zyUnqF3Gzu5GdIbmS\n43+E3s7YvT0hJvjZtV9xceIKG6rX8sgK9YkXejrMTBWVGMsrcjpQlSbWFqv6XqUPtNxDw6TERJmR\n73U9k7HD0Hhokq+c+iZj4Qne3XYvaypXqv5+XU8uyRPFQoI9GkkQj4ua1oJBMPCJqvswiXDdHeLZ\nzhczXnd54hpfP/0vxMQ4v7vutym3qs/Q1cskBaROmOGb2edB2UDU2NgVHGYHDxnWAfBi4irnxi5m\nvO5g75s83/0KHns1f7T192YVfluOFCTYV6xYQXt7e8Hmi9sBPe3L9lWyQyZ841rWawIabMsKkiTh\nGJwk6rJxnXH++7Gv8ubA28QSMSRJotPbzd8f/0de7XsDj72axzf8DgZB20+smKREsfDf1NbeTnxy\ngrh3Kus1akI+5xIfGcYUjROsr+T06Dn+7uiXOTt6QY6UiYc5N3aRvz/+VXqn+9nXsItHO7RlFerl\nPAW5hK/B4Vjw5JIyy6k0QSiYBuSNPVBXwW+6XuKpiz+k0ys3E58IT/JKzyG+ceZfEZH47OZPsq1W\nW7s3p1OfujkAthXyxhq+OT+LWkFLQEE65UNyiejBWgvfPPtdXuh+lfHQJJIk0Tc9wJMXvs9Prv0S\nt8XFn2z7zLz6QMuR5b1taSC1iHXQ0uwrZcEeun6dsn2ZE4JSha80CLX4xAQJr5eqHTt5fOPd/MeV\nn/H9yz/l+5d/ikEwpOLc72zczftXPpwzCiYTDqc11dRaq8Cdi629g8CZ04S7unBtnV+PRBQlQoEo\ndU3qtUiYMe9s2v4uhhoDHB44xjfPfReb0Uo4IQsho2Dkd9Z+iDsb92gOZzOaDHJTax0EuyAI2No7\nCF68QMLvz9h0PJDH6Q1InYYevPPjdE78hreHTvD20AncZhfTMdlUZTGY+YMtn8oZ4pkJXcOAm5oR\nzGbCndkFeyCPDU6SJELXr2EsL+f37v5j/vncv/H0jd/w9I3f4DQ5CMTlshaNznr+04aPUmPX1rug\nWMkp2B9//HHGMhS1euKJJ7j//oXjohfC43Hnfe9iUN8oCxdBmj22fMYpVmygz2Ih1n0z6/3RcByH\n00J9vfqGu2NXzwFQvXkDWzfdza6Ojfzowq+YCE4RSUSxGM18eOPDrPdof4kVqmuc3LwyitVsKvg3\nMm3byPjTP8cw1IvnATkZJv07/dMRJAkqq5yanjU9JJfCbb1jO19Ys5rf8j3ED889Q59vkFpnNR5H\nNfd27GNVdXte4/Z43JRV2Jn2hnVZp8GN6whevIB1apjKjoZ5nw/2eAGoayjT9LxY900MFgsb9uzm\nHw17OTt8iYOdR7gwcpXtDRvZ0bCZXU1bqXLk5yS0W+UInXhMLGgelHuHV6/Cd/kKVS4TRvv8jFcx\nLp8SW1orqax2qvruyOgoiakpqvbuYf2qjaxs/L95vfsoNya6uTnZTXtVM4+tfTfbGzbm3OBvN5lU\nCDkF+5NPPrkoDx4dzd2YdimJJ731I8PTqbF5PO68x2ltayd4/RpDPSMZF7HPG8JVZtP0/aOnzgOQ\nqGtO3mfmtzs+OG+chcytYJQXf3/fJEZLYUFTiepGEATGz5zH8eD0vHGODcv/bTQZNI158uIVMBoJ\nuqoJj05jxcUn1/zO7IvE/OZBGaPVZmJ0KMbgwBQm8/xKmFoQa5sAGD59gXjzfFv/0IAs2EVJUj3m\nSrtAsKcX+9p1jE/K2ZdNplY+vroV0vb1RABGA/mtB1GUEASYnAjkvabSf3NjcxtcvETf8XM41s0v\n6TAxLhd5C0diqp83ffQMAIaW9uQ9RvbX7GN/zewSvGNjC+dTFPKuLyVqNx/dwh2L3c7u1Cm0S8G2\nchUksyPnEo8liEYSmk0doZs3wGDQVL1OK3ral40OB9bmFsKdNxFj8xuGaAl1VJDicSK9PVhbWjGY\nc8d858tSOlDzsS37Ll8BScK+Kv/TWS4MSkG06cLnANLs7FnmIdVBSsNGGrpxHZgxf5aQKUiwv/TS\nSxw4cIAzZ87wuc99js985jN6jWvJ0VOgAakXLpxceOnkLdB6urE2NauqCZIvelX1U7CvXo0Ui2Us\njKY11BFk+7oUj2PX2MBbK3rOg6miAlNVNaEb1zPGcSvKRKbyzdnwXZCjP+yr1xQ8voXQqyAazETG\nhLI4UIP++R2kchG6cT2ZCLa8qjMWSkHO0wceeIAHHnhAr7HcUowmAza7WZfQLgDbSlk7CV2fHxmT\nj0CL9PUixWIprWex0H2DW72WqVdeJnTtKuzbMeuzlNPQrX4eglfkZsb2Net0GV829HQcAtjXrmX6\nyGGiA/1yL9A0gn456zS9iXUufJcugyBgX7nI68FlZXTITzQSx2or7IRkqqzCWFFB+OYNJEmaZfNO\nxEUi4Tg1deojVsRIhEhPN7aOFRjMy6OlnV6UMk/TcLosuoR2AZjcZZjr6uRFPEdLyycRQ9FycmWc\nFopTx9hlkDV2QBbsc8hHUw1dvSJ/75q1OowuO8qY9BLsjrXyRhRMjj+dgD+C3WGZ10EqG2Isiv/a\ndaytbRhsmcvu6oWe60EQBOwdK0l4vfMqPeZzig13dYIolswwGSgJ9jQcbqvcQShaWG0MBfuKVXK2\n3eDsbDslo1GTQFM01UW0qQLYHMkOQjpkXYKcqGT2eAhdn2+GCE5re5mleJzQtatYGpswlWkLkdSK\ncnIJ6DQP9qRgV35HBSXrVJNA60yao1Yv7lqAxTjBJTf6K7M3uFSoo1P9O6GYOW2LfGopRkqCPQ29\ntVVbMlEpNCdRSUvhK5CbFQcvX8JUVY25tk6XsWXDYBCwOy26vcgA9lVrEIMBgr19s/4e8EcwGAVV\nha9Arr8jRaPY1y6utg76m2LMNR5MlVWErlyZZa/OJ+s0nDTvLbZ9HcDp1i9JCcCxXi5vEbw0O0M0\nmEcsf8lxmp2SYE9D7+O3suDC1+YIdo2mmEhPD2IggGPDhiWpHe10yZUu9Yp0UgSQ7+Lslzngj+J0\nWVX/mxRtVzFrLCZ6a6qCIGBfu5aEf5rowExHpXyyToNXZbOWfdXiC/bUyUWnebA0NWN0uwlcujBr\nfWk1xUiiSOjGdUzV1arq8b/TKAn2NGZqY+ijnVgam+RFfPH87EWs0XmqaDeKtrPYOF3WlDNLD5Tj\nt+/ijBlCFCWC/ogmDW2pHKdAMuzOoFt0EMxsSKErl1J/05p1Koki4RvXsDU2YCpXn9yWL3qfXASD\nAce69SSmpoilFQTT+k5EeroR/f4leyeKjZJgTyNlitEpblcwGHBs3ETC6yXa15v6e9AvF74yW9TF\n6wYvyZUzHeuWSLAnj9+BaX02OHN9A0aXG9/FGYEWDkaRJPWaqhSPE7p+DUtD46Lb1xWcLqu+Jqk1\n8x2oWjX2aH8fYihE2frsPVv1xKljpUsFx/qNAATSzDFaywkEzstZ2M5N2urfvFMoCfY0UuVaddLY\nYWbhKQsRwO+PqDZBiLGoLNCampdEQwP9fQ2CIGBfs4bo2BjRoaFZ36021DHc3YUUiaSckEuBw2kh\nFNSnIBqAubYWU2XlLDu7Vo1dOb2VbVwawa6EYOql7EBmO7tWG3vg/DkQhNQmUWI2JcGeht4CDcCx\ncRMIQkqwJ+Ii4WAspRXnInzjBlI0uqRHTr01dgDnlq0A+M+ckr9bY6ijEua4FPZ1BYfLgiRBKKjn\nBreOxLQvFSml1d/iP30KBIHKnTtyX6wDBoMBu9Osq0nK7PHIkVKXLyEl5JLLWk6xiUCA8I3r2Fas\nxOhUV1PmnUZJsKeRqsmuo8ZucpdhbWsndP0aYjg0I9BUaqpLbV8H/TrnpOPcvFXe4M6clr97WqOm\nelk24yx2/Ho6etuXIS2e/bL8u2rZ4BJ+P6Hr17CtWImlYum6/zhdVgL+iK5lQxzrNyCGQqkG14FA\nRHUHqeClCyBJODdv0W08y42SYE/DaJS1Ez01dkiaYxIJgpcupR291WmqwUsXwGDAsQQhfgrKpqPn\nPJjKy3GvWU3o+jUSfr8mm2oiECB4+RLW1rYlM0dBWv0gHU8ujqRpzn/yhPzdGrJOA+fPgihmLIG8\nmDhcFuKxwruLzfrOpAkleOkCoigSCsRK9nUdKQn2OSyGdpJuZ1eEhBpTTCIYINzZKadML3KGYTqu\nRTDFAFTt3gWiSODc2Rmbqop58J8+CYkE7juWtlNXyiSl48nFXFWFbeUqQlcuE/f5CGrIOvWflk87\nzq3bdRuPGmYK5OnoSF6XPLlcukgoEEs+J/fpTZIkAufPYXS5sbaW6sNkoyTY5+BcBO3E1rECg8NB\n4MI5/Elh6VIj0E6evCVHTovVhNFk0NUkBVC1+w5AtrPPmCByv8z+48cAcO1cWsGu/EZ+nTc4985d\nIElMnzyhOutUiscJnj+L2ePB0pi9z+ti4FgkE6WtYwWhq1fwDcvlBdTMQ7S/j8TUFI6NmxZsXP1O\npzQzc1gM+7JgNOLYsJH42BjTQxOAOk3Vd+RNAMr27Mtxpb4IgiAnKekYCQFgb2nB7PEQPH+OgC+C\n2WLM2es0EQwQuHgBa0srlrrFzbqdy4wTWd95cN0hb3BTx0+qzjoNXrmMGA7j3Lp9SZLU0tGz92k6\n7n37QRQZPymH86oxTwbOlcwwaigJ9jnoHcuu4Eoen729Q7Oek43Y+BihK5exr1mL2ePRdSxqcLqt\nBANREon5ZWbzRRAEnFu3I4bDBLxBVRqa/9QpSCRwLbEZBtJ8DTpr7OaqamwrVjJ1U85tUGNbDiSj\niVzbltYMA/pnZCuU7doDRiMTV+X67K6yhedBkiR8bx0GoxHHpk26jmW5URLsc9C7NoaCa+cdGBxO\npsenEYTcx07fW0cAKNu7X9dxqEV5mUM6hrmBLJhEDISjkioNzX9CNsMstX0dwGQyYrObdDfFgLwe\nIkbZb5Jrk5dEEf/p0xgcjkUvApeJmdr0+s6D0e3GuXkLfp86v1P4+jWi/X24tu/E5F6aJLVipSTY\n57BYx06DxULZnXcRESzYzCzoLJMkCd+RNxFMpluiqcLiRMaAXJ0yXlkLgMO+cMxyIhggcOE81pYW\nLHX1uo5DLU63VXeNHeSNShHsuTT2wNkzxCfGcW3fiWBa+v7zi3WKBSjbt5+ISW66nsvvNHXwFQAq\n7r1P93EsN0qCfQ56t8hLp/yee4kYnViiC/dfjHR1EhsawrV9B0aHQ/dxqGExQv0ABJMJy54DAJgm\nBhe8dvrYMdkMs8RO03ScbiuxaIJoRJ+6OQrm6hrE2mYArGSfY0mSmPj1rwCofM9Duo5BLQ6XFUEA\n/3RY9+92btlGxCr38XQ4sod8xqd9+E8cx9LQuKTZx8VKSbDPYTGSUhSkihpEgxGzf4LIQH/W6xSn\nqXvfrTHDwOKE+ikIa2THl3TzEmI08zyLkQgTv3oawWymbP9duo9BLYsV+gkgJRtbx46+kfWa0LWr\nhG/ewLltO9amJt3HoAaDQcDhshLw6T8HBrOZmKMKczxE5NrlrNf53ngdKR6n/MB9S+48LkZKgn0O\n9mSjCb1NEEDKlmiNB/AefDXjNdGhQbyvH8JYXoFzw61zEC3m8Tuc/EqzfwLf4cxCbfKlF4hPTlL5\n7gcxV1XpPga1KCeXxbCzx9zVAMRPHSHS25PxmolfPwtA1Xsf0f35WnCVWQn49auboyBJEiEs2OIB\nxn72E6T4/JORJIp4XzuIYLFQtv/WKTvFREGC/R/+4R9473vfy/vf/34+//nP4/cvbGIoBpTO7Ho7\nT2FG+7WbJbxvHCLS2zvrc0kUGXryO0ixGLUf+/gtsacqLKbGrmyaNqJMPv+bVL0QhbjPx+RvnsXo\nclP50MO6P18Li1E3R8E/HUEQwBIPMfqTH837PNLbQ/D8Wexr1t7yZhIutxVRlHR3pkfCcRIJCWeZ\njUh3FxPP/XreNVMvv0hsbBT37r0YHaXaMGooSLDfddddPPvsszz99NO0tbXxzW9+U69x3VIcLquu\njSYUFOHg2b0NKRql/2tfJj41lfp88sXnCd+4jnvXbjmJ5RaSciIvgkBTNovqbZuIjY4y+sMfzGqb\nN/7M04jhMFXve/8t8zEoLKZgD/giuMpsODdsIHjh/KwKoHHvFEPffRK49do6LF6yljJpmH0fAAAa\nGklEQVSvVWtXYKyoYPyZp4mklbj2nznN6I/+A2N5OdXve7+uz17OFCTY9+/fn4ru2LZtG0PJkqzF\njtNlIREXCQVjun6vsohrNq+j5kMfJj4xQf/XvkLg/Dkmnv8N47/4GUa3m9qPfULX5+aDEuq3GCYp\nZR4aH3svloZGpl55iYGvfYXg1Sv0f+0reF99GXNdHRX33Kv7s7WSEmg6z0MiIRLwR3G5rdR8+CMg\nCAz809cY/fF/ELx0kZ6//SsiXZ2U7b8zVV/mVuJMxpj7dbazKxuFu8pJ3Sc/BYkEg//yTbyHXmP6\nxDEGv/UNBLOZpj/5Auaqal2fvZzR7az/k5/8hEceufWahR64ymwA+KZCGC36uSHSC4BVvPcRosPD\n+N58nf6vfEm+QBCo/cSnMLrduj2zEBwuK36f/pEQQX8Um92EraaKlv/z/2Hwm/9E4NxZAufOAnIr\nvdqPfeKWmqIUUmGfOgs0xTnvKrNia22j/vd+n7Gf/pjJ559j8vnnAKj+4G9R9fCjt4WzcLGcyOm1\nk1ybtlF+zwG8h15j+KknU9c0/OGfYOtYoetzlzs535zHH3+csbGxeX9/4oknuP/++wH4xje+gdls\n5rHHHlP9YI/n9hBemahvLOP8yX68UyHWbtQvfjoWkW3JbR3VWG1map74Y3qb5DR5Z1srrlUrsdXn\n97zFmM/KagcTowHKy+w5U//V4vG4CQailFfak2N2U/fXf0H3975P4GYnTR98P+Vbt9xSYZY+l5Ik\nYbYYiYRius5xKOmU9tSV4fG48Tz2IB0P3sfwiy8x+tobNH3wfVTv26t6nItNJCg7NRNxUfNzF7pe\nTMjmzqaWSjweNzX/+fP4H32IYE8Pwd4+XCtX4jlwd/4D12mcxUbOt/XJJ59c8POf//znvPbaazz1\n1FOaHjw6Oq3p+qVEMMpCxTcZ0nWck+MBzBYjvukwJGOCHe95FAAJmAam83iex+NelPlUmh50dY5T\nWV24rdvjcTPQP0UkHMdqM80as/PhD+AEYsDY2K1zwmeaS4fLwtSUvmuht2cSAKNZmPW9pt1307D7\nbkQWfkcW6zfPRizp4B4dntb03FzjHB2SP4snEjPXVTVgqGrAtW2PfM0S/DuXej7zRe3mU5Cd4dCh\nQ3z729/mG9/4BhaL+qbEtzvKsdM7FdL1ewMamzffapyL0CpwOmnaUcxdxYDLbSUcjJGI61c3J6Ch\nyuftgMNpwWAQdDfF+DWUsS6hnoLO13/zN39DLBbj05/+NABbt27lL//yL/UY1y1FKUbkndRPsMfj\nCcKhONW1Lt2+c7FZjIgQxWbvLi8ewZ6ejVxWoU9dfH9qgysOgSYnKVkWJSrGZjdhNqtr7F5CHQUJ\n9hdeeEGvcdxWKCnUemrsWhpL3C4ojkM9X+Zpb1JTLRKBBmkRIdN6CnZlHopng3O5rQwP+BBFCYNB\nHx+IPKfFMwfFQinzNAMGg4DLbcWno8ZejEdOd1Lo6Bnipphi3MUk0Bahbo5/OoLJbMBqu/WRP2px\nlVmRJHRrbB2NxIlFE0VjjiomSoI9C64yG9O+sG71yFM2VZV9HW8HFG1y2qtfyGNRmmIWySTlcltv\ni1BGtSjzoFcIbDEqO8VCSbBnwVWe1E50SkyZidctHuep1WbCYjWltGw9mPZGVNWjv53Q2yQVi8n+\nlmIywwC43PJ49drgis2BXEyUBHsWUtqqTkJN0XqLSVMFKCu3Me0N61Zewe8L43RbMRqLZ+nNJOfo\nu8kXm0Bz6Zx9qrbBRgntFM/btcS4dV7ExSrYXeVW4jGRcKjw8gpiQiQwHSk6TdWuc6hfSqAVkQMZ\n0kwxemvsRTYPxUBJsGfBlXIc6qOx+7xhLFYjVlv2ZgK3I8pGpMcG5/OGkaSZTbNYUJp769Vowl+E\nDmSYEcC6bXAlG/uiURLsWdDz2ClJEtPeMGXl+oTKLSXuVN2cwoWaEj7qKrJTC8gbXGA6qkuS0kyo\nY3EJNLtDPrnodYpN+Z2KKKCgWCgJ9iwojiI9NPZwKEY8JhadGQbSNXYdBHsyfLTYNHYgFb+uh8/F\nX6Q2doNB55PLdASL1ahbHaISM5QEexasNhNWm4lpHbSTYrWvw8yY9Qh59E4GgeJKylFwV+h3cim2\nrNN0nGU2gv4ooljYyUWSJDnkswjXQjFQEuwLUF5h10VTTQn2Isyw01ewh2Z9ZzFRVj5TyrlQ/NMR\nrDYTZkvxaaoutz5hwJFwnGgkUco6XSRKgn0ByirtRCMJIuHCOtQrWl4xCjRZABl1MUEUsynGrZhi\nCtzgZE01UnRmGAW9fE/KWtCrREOJ2ZQE+wKUJxddoTZFRRiUFaFgFwQBV5lVN429WDXVMp1MMak0\n+iLc3CDNmV7gelBOPiWNfXEoCfYFKK9MCvYCtZNitrGDvCEVenKRJAnvVKho58DhtGA0GZj2FmaK\nmYlhL855KEu+E4XWUVI2yJLGvjiUBPsCpDT2As0QPm84lZ5fjLh0iIwJh2JFrakKgoC73Fawxp4K\ndSxSU0x5pbwWCq18OqOxlwT7YlAS7AugaCeFRMYoMezFqqmCPsdvRaAVW1JOOmXlNiLheEEnF8W2\nrJwGiw1XmQ1B0FFjL+L34namJNgXQA+NPZTsvFPMtsRULHsBgl0xRxVzeJvyGxZijil2wW40GnCX\n23TR2F1lVoymkghaDEqzugDu8qR2UsDxu9jt66BPyGOqDnt5cZogANzJzOFC1oMSy1+sgh3ksYcC\nMaKR/E4uibiI3xcpaeuLSEmwL4DRaKCswo53In/tRLElLgvBXsDJxZ/snFTM86BHZIx3MoTdaS5a\nfwvM2MXznQdlHZXs64tHSbDnoKLKTjgUy7u64XLQ2O0OczIiJH9fQzE2sZ7LzMklv40+kRCZ9oYp\nr3ToOawlRzlt5NsTuBTquPgUpDZ89atf5eWXX8ZgMFBdXc0Xv/hFPB6PXmO7LSivcsCNCbyTIWx2\n7ZUZZ2LYi1c7EQQBd4Gx7FMTQSxWE3ZHcVW3TCelqeY5D9PJ6pbFbIaBdI09T8E+mXwninwebmcK\n0tg/85nP8Mtf/pJf/OIX3HvvvXz961/Xa1y3DRVV8uKbGg/mdf+Mxl68tmWQtVUlZFEroijhnQxR\nU+sqqlZwc0nVD8rTBKGY9IpesCshjwVr7MU9D7czBQl2p9OZ+u9QKITBsPwsOxVV8rF5ajI/we7z\nhrHZzUWZbZmOklKfz8s87Q0jJiSqa525L77NcZfbknXltXeUUtaQoiwUKwVr7KnkpJIpZrEoWNp8\n+ctf5umnn8btdvPUU0/pMabbivKkYM/HgSpJEn5vmOpal97DWnIqq+V5mBwPUFOn7d+jnHZqlsE8\nlFXYGBv2EwxENdcR9y2T+ihmsxGny5J3LLtvKoTZYszLtFlCHTkF++OPP87Y2Ni8vz/xxBPcf//9\nPPHEEzzxxBN861vf4nvf+x6f//znVT3Y43FrH+0toL2jGrPFiN8X0TxmnzdEIiFRU+ta9H/vYn9/\n+4oa3uQ60VBC87OuXxgBWJJ50IOFxljXUM7NK2MYMWj+twT9sgN+5eparLbCT3C3ci6ra130dE5Q\nWenAZDIueG36OCVJwucNU1XjpLa2bLGHqYliWJtqybm6nnzySVVf9Oijj/IHf/AHqgX76Oi0qutu\nJR6Pm7ExP+UVdsZH/YyM+DTZiHs7JwCwuyyL+u/1eNyLPp8Gs/zv7uuZ1Pysvp5JAKprF3+chZJr\nLk0W2dzY0zWOzaVN4xwdnsbhtOCbDkGB07AUv/lCOJwWkODm9bHUaS4Tc8cZDESJRRM4Fvmd0Mqt\nnk+1qN18CjKKd3d3p/775ZdfZsWKFYV83W1LeZWdeEzU3OtxYjQAQLWn+G3LTpcFs8XI5HhA871T\n40EEAapqijvMD/KPZU8kRPy+cNE7ThXyLQZWcpwuDQWdB7/0pS/R2dmJwWCgsbGR//bf/pte47qt\nSDlQJ0Ka4rAnxmQhWFlT/IJdEAQqaxyMDfkRRVGTo3xyIoi73JbzyF4MKGtB6wbnm1oeoY4KqVh2\njQ7UkuN0aShIsP/jP/6jXuO4rSlXQh4ngjS3V6q+b2IsgMEgLJuXubLaycjANN7J8ILH73TCoRjh\nYIy6huVhv3SX2zBbjIyPaBPsqVICRR4Ro1Be0thva5ZffOIiUJFHZIwkSUyOBamodmA0Lo9pTkXG\njKkXalMTyRA/lRvB7Y4gCFTXOpmaCBKPq4/pXy4x7AqKxq1VY1fWTrGHfN7uLA+Js8ikkpQ0xLL7\nfRFi0QRVy8AMo1BZo5gh1M+DEuqobI7LgSqPC0mCyTH186AIwOUi2K02M1abSXNew9iwH4vVVNQl\nNoqBkmBXgdVmxuYwa9LYFcfpcnAYKlRWy5uUFvuysgksF40dZpzhym+shuWmsYN8gvNNhojH1J1c\nYtEEUxMhauqKOwO5GCgJdpVUVNnxTYVIJERV1yuO06plEBGj4C63YTQZNGmqisau1iZfDCgJZ+Oj\nftX3eCdDOFyWos9ATsdT70aSYFzlBqfM13JIVLvdKQl2lVRUOpAk9WFuyykiRsFgEKiosjM1HlSd\nUj81EcRqMy2rLEPFvKbWgRoJx5n2qnc4Fws19bJDfHRIXfz32LAs2Ks1Zi6X0E5JsKskPTJGDROj\nAYwmw7Lz/lfWOInHRVWVHhMJEd9UmIpqx7I6elttJtxlVtWmGEXw1TbcXpmWheKplwW0WsE+PlLS\n2JeKkmBXiWJSGVOxiEVRYmo8SGW1A4Nh+Qg0SK8Zk3uD802FEEWJymXkOFWoqnURDEQJBaM5rx0Z\n9AFQu0xCPhUqqx2YTAZNGrvBIKSc8CUWj5JgV0ldo6xtDfX7cl477Q0Rj4vLKiJGIeVAVWFnV65Z\nTo5TBcWBqsYcMzwgrxllDS0XDAYD1XUuJsdyh36Kosj4aICqGueyCf+9nSnNsErsDgsVVXaGB3yI\n4sL25YlRWaAtJ8epwkzIo3qBphzZlxNqHaiSJDEyMI3TbcHpLu6a/Jnw1LkRRSnnBuedCJGIiyX7\n+hJREuwaqG8qJxZN5EzQmXGcLj9NtbzSjsEgpOylCzHQO4XBIFDXWL4EI1taUiGPOQRaYDpCMBBd\ndvZ1BbV29jHFvl4S7EtCSbBroK5ZMcd4F7wuFeq4DE0xRqOB2sYyxob9RMLZ+8DGonHGhvx46t2Y\nLcVfI2Yu5VV2jEYhZ6jfyKDiOF1e9nUFj8rIGCUipuQ4XRpKgl0D9U2y5jnUl93OLkkSw33eZZ1d\n19xWgSTBQM9U1muG+mWTVUPL8tPWQbYvV9Y4mRgLLGiaW672dYXKGtmBOja08AkuFepYEuxLQkmw\na6Cy2oHFalpQY58cDzLti9C6onJZhfiloxRC6+uazHrNYK88R40tFUsypltBtcdJIi4u6G9QNHZF\ns11uGAwGqmtdTIwFsjpQJUlibMSPu9ymS4ORErkpCXYNCIJAfVMZvqkwwUDmMLeeG+MAtKyoXsqh\nLSm1jWWYzAb6urNr7AO98mf1zctTYwdobJM3uO7r4xk/F0WJ0aFpKmtkhWC54ql3IYpS1rj+oD9K\nOBgr2deXkJJg10h9k3ykHs4S9thzU+6a1LqiasnGtNQYjQYaWyuYGg/iz9B8JB5PMDLgo6bOtaw1\ntPZV1QgCdF6d3zoS5HIKsWhi2TpOFXLZ2ZV3Yrmao25HSoJdI3WKnT2DOSYaiTPY68VT75Jbhy1j\nmpPaan8Gc8zIwDSJxPK1ryvY7GYaWysYGZzOuMEp9vXl6jhV8CT/fdlMc9cuDgOwan3tko3pnU5J\nsGukrtGNIGROVOrrmkQUJVqXsRlGoSkp2Pu657/Mg0kzzHK2ryt0rKkBoCuD1t6f7PW63DXVqhon\nVR4nXdfG55kofd4Q/d1T1DeXL9tggtuRkmDXiNliorrWxeigj3BodrhfygyzcvmaYRSqa53YHGb6\nuybnFQQbSDpOl7vGDtCxWhbsN6+Ozvq7fzrCjUujVFTZl71tWRAENmxrQBQlrpwbmvXZhdMDAKze\nUNLWlxJdBPt3vvMd1q1bx9RUdmfacmLNxjoSCYmTR2aaeUuSRM/NcWx207K3qYL8Mje3VRDwR2cV\nRgsGogz1e6msdmB3LG9zFICrzEZtg5uBnqlZG/25432IosTWPS3LNjoqnTUb6zCaDFw6Mzhroz9/\nsh+DQWDlOs8tHN07j4IF+9DQEIcPH6axsVGP8RQFm3Y04S6zcu5Ef6rK4cRogMB0lJaOqmVX+Csb\nTcmwx7PH+1N/e/2Fa8RjIhu3v3PWQ8eaGiRpJjomEo5z8fQADqeFNRvrbvHolgarzcyqdR68k7Lp\nBeTQ38E+Ly0dle+ITf52omDB/nd/93f82Z/9mR5jKRqMJgO77ulATEgcfb0T72SIF56+CEB78mj+\nTmD1+lqqPE4unhrg3Ik+blwe4eaVUeqby9m0s+lWD2/J6Fgja6PnT8kb/cUzA0QjCTbf0YTJtPyy\nbrOxYZu8mV86M5A0ywwCsGrDO2Nzu50oKBbtlVdeoaGhgbVr1+o1nqJhzca6/7+9u4tpMkvjAP6v\ntIDDOKaK06DD6CwOG4gFRhPdgURtbeSjVlFRboymDUZvrCB+hKJGA8aAqJekxAjRZDTK2myI0Wym\nWiEIIsYFN6Q6bHAcjAVRMhSj9OvZC9dO2NJqzOgp5fndnSYn+acfT09P3/c56Or4DY/+PYBfe19g\n7I0H6Uu/mVI/OWXRUuQVKPH3c/fQ+nMvZNFSREmnQZX31ymx/fCOfPYXSPxOjt/6hvGT+Q6ipNMg\ni46aUr9aAEAx7yvI47/Af+zP8fiXFng8Psiio/Dd95F/MUG4eW9h1+v1GBoK/Me/uLgYZrMZZ8+e\n9T/2oafqRAKJRIK/rfwLrl56ALfLixU5yf4Vy1QyY2Yscjcq8Y+f/gXXmAc/qpIi6uDqD5W3KQ29\nPQPobP0Vvw+/RvrSRMTERs6pUR9CIpFg8Y/z0fLPX/DVzFjMmhOHH5Z+G1HHAU4WEvrIavzo0SPo\n9XrExsa+7Y8yMACFQoHLly9j9mz+hmaMMVE+urD/P7VaDYvFgpkzI/8SN8YYC2d/2nXsEolkSm3F\nMMZYuPrTVuyMMcbCA995yhhjEYYLO2OMRRgu7IwxFmGEFXa73Y7CwkLk5+ejoKAADx48EBXlvc6f\nP4+cnBzodDrU1NSIjhNUuPfsqa6uRm5uLtatW4ddu3ZhdPT9B2J/Ts3NzcjJyUF2djbq6upEx5mQ\nw+HA1q1bkZeXB51Oh3PnzomOFJTP58P69euxc+dO0VGCcjqdMBqNyM3NhVarRVdXl+hIE2poaMCa\nNWug0+lQWloKl2vig378SBCDwUAtLS1ERGSz2WjLli2iooTU3t5Oer2e3G43ERG9ePFCcKKJPXv2\njAwGA6lUKhoeHhYdZ0Ktra3k9XqJiOjEiRNUU1MjONEfvF4vaTQa6u/vJ5fLRWvXrqXe3l7RsQIM\nDg5ST08PERGNjo7S6tWrwzInEVF9fT2VlpbSjh07REcJ6sCBA9TY2EhERG63m5xOp+BEgRwOB6nV\nahobGyMiot27d5PFYgk5R9iKXSKRwOl8e+KK0+mEQhGe/SQuXLiA7du3Qyp9e/fcrFnh2ZJ3MvTs\nyczMxLRpb99yGRkZcDgc75nx+XR3d2P+/PmYN28eZDIZtFotrFar6FgB5syZg5SUFABAXFwckpKS\nMDg4KDhVIIfDgVu3bmHTpk2iowQ1OjqKzs5ObNy4EQAglUrx5Zfh2WLZ5/Ph9evX8Hg8ePPmDb7+\nOnQbZGH3+paVlaGoqAhVVVUgIly8eFFUlJAeP36Mzs5OnD59GjExMdi/fz+USqXoWONMxp49jY2N\n0Gq1omP4DQwMICEhwT9WKBRhvT0IAP39/bDb7UhLSxMdJcC7hca7xVs46u/vh1wuR1lZGex2OxYt\nWoTy8nLExobXgSAKhQJ6vR4rV67E9OnTkZWVhczMzJBzPmlhD9ZnpqSkBLdv30Z5eTk0Gg2uX78O\nk8mE+vr6TxknqFD9cLxeL0ZGRnDp0iV0d3ejuLhYyEpusvTsCfWaq9VqAEBtbS1kMhl0Ot3njheU\nyOfsY7x69QpGoxEmkwlxcXGi44xjs9kQHx+PlJQU3LlzR3ScoDweD3p6enD48GEolUocO3YMdXV1\nMBqNoqONMzIyAqvVips3b2LGjBkwGo1oamoK/fn55BtEQSxZsmTcePHixYKShFZUVEQdHR3+sUaj\noZcvXwpMNN7Dhw8pMzOT1Go1qVQqSk1NJZVKRUNDQ6KjTejKlStUWFjo3y8MF/fv3yeDweAfm81m\nMpvNAhMF53a7yWAwUENDg+goEzp58iStWLGC1Go1ZWVlUUZGBu3bt090rADPnz8ntVrtH9+9ezcs\n/w+4du0alZeX+8cWi4WOHj0aco6wPXaFQoGOjg4AQFtbGxYsWCAqSkgajQZtbW0AgL6+Png8Hsjl\ncsGp/pCcnIzW1lZYrVbcuHEDCoUCFoslLBuxNTc348yZM6itrUV0dHgdvKBUKvHkyRM8ffoULpcL\nV69exapVq0THmpDJZMLChQuxbds20VEmtGfPHthsNlitVpw6dQrLli1DdXW16FgB4uPjkZCQgL6+\nPgBAe3s7kpKSBKcKNHfuXHR1dWFsbAxE9EE5he2xV1RUoLKyEj6fDzExMaioqBAVJaQNGzbAZDJB\np9NBJpOhqqpKdKSQwrlnT2VlJdxuNwwGAwAgPT0dR44cERvqf6KionDo0CEYDAYQEQoKCsLyQ37v\n3j00NTUhOTkZ+fn5kEgkKCkpwfLly0VHm5QOHjyIvXv3wuPxIDExEcePHxcdKUBaWhqys7ORn58P\nqVSK1NRUbN68OeQc7hXDGGMRhu88ZYyxCMOFnTHGIgwXdsYYizBc2BljLMJwYWeMsQjDhZ0xxiIM\nF3bGGIswXNgZYyzC/Be68EGj7hfMcwAAAABJRU5ErkJggg==\n",
-            "text/plain": [
-              "\u003cmatplotlib.figure.Figure at 0x7f385e198650\u003e"
-            ]
-          },
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "display_data"
-        }
-      ],
-      "source": [
-        "def f(x):\n",
-        "  return tf.square(tf.sin(x))\n",
-        "\n",
-        "def grad(f):\n",
-        "  return lambda x: tfe.gradients_function(f)(x)[0]\n",
-        "\n",
-        "x = tf.lin_space(-2*pi, 2*pi, 100)  # 100 points between -2π and +2π\n",
-        "\n",
-        "import matplotlib.pyplot as plt\n",
-        "\n",
-        "plt.plot(x, f(x), label=\"f\")\n",
-        "plt.plot(x, grad(f)(x), label=\"first derivative\")\n",
-        "plt.plot(x, grad(grad(f))(x), label=\"second derivative\")\n",
-        "plt.plot(x, grad(grad(grad(f)))(x), label=\"third derivative\")\n",
-        "plt.legend()\n",
-        "plt.show()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "-39gouo7mtgu"
-      },
-      "source": [
-        "## Gradient tapes\n",
-        "\n",
-        "Every differentiable TensorFlow operation has an associated gradient function. For example, the gradient function of `tf.square(x)` would be a function that returns `2.0 * x`.  To compute the gradient of a user-defined function (like `f(x)` in the example above), TensorFlow first \"records\" all the operations applied to compute the output of the function. We call this record a \"tape\". It then uses that tape and the gradients functions associated with each primitive operation to compute the gradients of the user-defined function using [reverse mode differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation).\n",
-        "\n",
-        "Since operations are recorded as they are executed, Python control flow (using `if`s and `while`s for example) is naturally handled:\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "MH0UfjympWf7"
-      },
-      "outputs": [],
-      "source": [
-        "def f(x, y):\n",
-        "  output = 1\n",
-        "  for i in range(y):\n",
-        "    output = tf.multiply(output, x)\n",
-        "  return output\n",
-        "\n",
-        "def g(x, y):\n",
-        "  # Return the gradient of `f` with respect to it's first parameter\n",
-        "  return tfe.gradients_function(f)(x, y)[0]\n",
-        "\n",
-        "assert f(3.0, 2).numpy() == 9.0   # f(x, 2) is essentially x * x\n",
-        "assert g(3.0, 2).numpy() == 6.0   # And its gradient will be 2 * x\n",
-        "assert f(4.0, 3).numpy() == 64.0  # f(x, 3) is essentially x * x * x\n",
-        "assert g(4.0, 3).numpy() == 48.0  # And its gradient will be 3 * x * x"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "aNmR5-jhpX2t"
-      },
-      "source": [
-        "At times it may be inconvenient to encapsulate computation of interest into a function. For example, if you want the gradient of the output with respect to intermediate values computed in the function. In such cases, the slightly more verbose but explicit [tf.GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape) context is useful. All computation inside the context of a `tf.GradientTape` is \"recorded\".\n",
-        "\n",
-        "For example:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "bAFeIE8EuVIq"
-      },
-      "outputs": [],
-      "source": [
-        "x = tf.ones((2, 2))\n",
-        "  \n",
-        "# TODO(b/78880779): Remove the 'persistent=True' argument and use\n",
-        "# a single t.gradient() call when the bug is resolved.\n",
-        "with tf.GradientTape(persistent=True) as t:\n",
-        "  # TODO(ashankar): Explain with \"watch\" argument better?\n",
-        "  t.watch(x)\n",
-        "  y = tf.reduce_sum(x)\n",
-        "  z = tf.multiply(y, y)\n",
-        "\n",
-        "# Use the same tape to compute the derivative of z with respect to the\n",
-        "# intermediate value y.\n",
-        "dz_dy = t.gradient(z, y)\n",
-        "assert dz_dy.numpy() == 8.0\n",
-        "\n",
-        "# Derivative of z with respect to the original input tensor x\n",
-        "dz_dx = t.gradient(z, x)\n",
-        "for i in [0, 1]:\n",
-        "  for j in [0, 1]:\n",
-        "    assert dz_dx[i][j].numpy() == 8.0"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "DK05KXrAAld3"
-      },
-      "source": [
-        "### Higher-order gradients\n",
-        "\n",
-        "Operations inside of the `GradientTape` context manager are recorded for automatic differentiation. If gradients are computed in that context, then the gradient computation is recorded as well. As a result, the exact same API works for higher-order gradients as well. For example:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "cPQgthZ7ugRJ"
-      },
-      "outputs": [],
-      "source": [
-        "# TODO(ashankar): Should we use the persistent tape here instead? Follow up on Tom and Alex's discussion\n",
-        "\n",
-        "x = tf.constant(1.0)  # Convert the Python 1.0 to a Tensor object\n",
-        "\n",
-        "with tf.GradientTape() as t:\n",
-        "  with tf.GradientTape() as t2:\n",
-        "    t2.watch(x)\n",
-        "    y = x * x * x\n",
-        "  # Compute the gradient inside the 't' context manager\n",
-        "  # which means the gradient computation is differentiable as well.\n",
-        "  dy_dx = t2.gradient(y, x)\n",
-        "d2y_dx2 = t.gradient(dy_dx, x)\n",
-        "\n",
-        "assert dy_dx.numpy() == 3.0\n",
-        "assert d2y_dx2.numpy() == 6.0"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "4U1KKzUpNl58"
-      },
-      "source": [
-        "## Next Steps\n",
-        "\n",
-        "In this tutorial we covered gradient computation in TensorFlow. With that we have enough of the primitives required to build an train neural networks, which we will cover in the [next tutorial](https://github.com/tensorflow/models/tree/master/official/contrib/eager/python/examples/notebooks/3_neural_networks.ipynb)."
-      ]
-    }
-  ],
-  "metadata": {
-    "colab": {
-      "collapsed_sections": [],
-      "default_view": {},
-      "name": "Automatic Differentiation",
-      "provenance": [],
-      "version": "0.3.2",
-      "views": {}
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/3_datasets.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/3_datasets.ipynb
deleted file mode 100644
index d268cbcd91..0000000000
--- a/tensorflow/contrib/eager/python/examples/notebooks/3_datasets.ipynb
+++ /dev/null
@@ -1,209 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "U9i2Dsh-ziXr"
-      },
-      "source": [
-        "# Eager Execution Tutorial: Importing Data\n",
-        "\n",
-        "This notebook demonstrates the use of the [`tf.data.Dataset` API](https://www.tensorflow.org/guide/datasets) to build pipelines to feed data to your program. It covers:\n",
-        "\n",
-        "* Creating a `Dataset`.\n",
-        "* Iteration over a `Dataset` with eager execution enabled.\n",
-        "\n",
-        "We recommend using the `Dataset`s API for building performant, complex input pipelines from simple, re-usable pieces that will feed your model's training or evaluation loops.\n",
-        "\n",
-        "If you're familiar with TensorFlow graphs, the API for constructing the `Dataset` object remains exactly the same when eager execution is enabled, but the process of iterating over elements of the dataset is slightly simpler.\n",
-        "You can use Python iteration over the `tf.data.Dataset` object and do not need to explicitly create an `tf.data.Iterator` object.\n",
-        "As a result, the discussion on iterators in the [TensorFlow Guide](https://www.tensorflow.org/guide/datasets) is not relevant when eager execution is enabled."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "z1JcS5iBXMRO"
-      },
-      "source": [
-        "# Setup: Enable eager execution\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "cellView": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "RlIWhyeLoYnG"
-      },
-      "outputs": [],
-      "source": [
-        "# Import TensorFlow.\n",
-        "import tensorflow as tf\n",
-        "\n",
-        "# Enable eager execution\n",
-        "tf.enable_eager_execution()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "H9UySOPLXdaw"
-      },
-      "source": [
-        "# Step 1: Create a source `Dataset`\n",
-        "\n",
-        "Create a _source_ dataset using one of the factory functions like [`Dataset.from_tensors`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensors), [`Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) or using objects that read from files like [`TextLineDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset) or [`TFRecordDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset). See the [TensorFlow Guide](https://www.tensorflow.org/guide/datasets#reading_input_data) for more information."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "cellView": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "WPTUfGq6kJ5w"
-      },
-      "outputs": [],
-      "source": [
-        "ds_tensors = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])\n",
-        "\n",
-        "# Create a CSV file\n",
-        "import tempfile\n",
-        "_, filename = tempfile.mkstemp()\n",
-        "with open(filename, 'w') as f:\n",
-        "  f.write(\"\"\"Line 1\n",
-        "Line 2\n",
-        "Line 3\n",
-        "  \"\"\")\n",
-        "ds_file = tf.data.TextLineDataset(filename)\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "twBfWd5xyu_d"
-      },
-      "source": [
-        "# Step 2: Apply transformations\n",
-        "\n",
-        "Use the transformations functions like [`map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map), [`batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch), [`shuffle`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) etc. to apply transformations to the records of the dataset. See the [API documentation for `tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) for details."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "cellView": "code",
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "ngUe237Wt48W"
-      },
-      "outputs": [],
-      "source": [
-        "ds_tensors = ds_tensors.map(tf.square).shuffle(2).batch(2)\n",
-        "ds_file = ds_file.batch(2)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "IDY4WsYRhP81"
-      },
-      "source": [
-        "# Step 3: Iterate\n",
-        "\n",
-        "When eager execution is enabled `Dataset` objects support iteration.\n",
-        "If you're familiar with the use of `Dataset`s in TensorFlow graphs, note that there is no need for calls to `Dataset.make_one_shot_iterator()` or `get_next()` calls."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "base_uri": "https://localhost:8080/",
-          "height": 153
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 388,
-          "status": "ok",
-          "timestamp": 1525154629129,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "lCUWzso6mbqR",
-        "outputId": "8e4b0298-d27d-4ac7-e26a-ef94af0594ec"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Elements of ds_tensors:\n",
-            "tf.Tensor([1 9], shape=(2,), dtype=int32)\n",
-            "tf.Tensor([16 25], shape=(2,), dtype=int32)\n",
-            "tf.Tensor([ 4 36], shape=(2,), dtype=int32)\n",
-            "\n",
-            "Elements in ds_file:\n",
-            "tf.Tensor(['Line 1' 'Line 2'], shape=(2,), dtype=string)\n",
-            "tf.Tensor(['Line 3' '  '], shape=(2,), dtype=string)\n"
-          ]
-        }
-      ],
-      "source": [
-        "print('Elements of ds_tensors:')\n",
-        "for x in ds_tensors:\n",
-        "  print(x)\n",
-        "\n",
-        "print('\\nElements in ds_file:')\n",
-        "for x in ds_file:\n",
-        "  print(x)"
-      ]
-    }
-  ],
-  "metadata": {
-    "colab": {
-      "collapsed_sections": [],
-      "default_view": {},
-      "name": "Eager Execution Tutorial: Importing Data",
-      "provenance": [],
-      "version": "0.3.2",
-      "views": {}
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb
deleted file mode 100644
index 84f1d031d4..0000000000
--- a/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb
+++ /dev/null
@@ -1,485 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "k2o3TTG4TFpt"
-      },
-      "source": [
-        "# Training Models\n",
-        "\n",
-        "In the previous tutorial we covered the TensorFlow APIs for automatic differentiation, a basic building block for machine learning.\n",
-        "In this tutorial we will use the TensorFlow primitives introduced in the prior tutorials to do some simple machine learning.\n",
-        "\n",
-        "TensorFlow also includes a higher-level neural networks API (`tf.keras`) which provides useful abstractions to reduce boilerplate. We strongly recommend those higher level APIs for people working with neural networks. However, in this short tutorial we cover neural network training from first principles to establish a strong foundation."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "3LXMVuV0VhDr"
-      },
-      "source": [
-        "## Setup"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "PJ64L90aVir3"
-      },
-      "outputs": [],
-      "source": [
-        "import tensorflow as tf\n",
-        "tf.enable_eager_execution()\n",
-        "tfe = tf.contrib.eager # Shorthand for some symbols"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "eMAWbDJFVmMk"
-      },
-      "source": [
-        "## Variables\n",
-        "\n",
-        "Tensors in TensorFlow are immutable stateless objects. Machine learning models, however, need to have changing state: as your model trains, the same code to compute predictions should behave differently over time (hopefully with a lower loss!). To represent this state which needs to change over the course of your computation, you can choose to rely on the fact that Python is a stateful programming language:\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "VkJwtLS_Jbn8"
-      },
-      "outputs": [],
-      "source": [
-        "# Using python state\n",
-        "x = tf.zeros([10, 10])\n",
-        "x += 2  # This is equivalent to x = x + 2, which does not mutate the original\n",
-        "        # value of x\n",
-        "print(x)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "wfneTXy7JcUz"
-      },
-      "source": [
-        "TensorFlow, however, has stateful operations built in, and these are often more pleasant to use than low-level Python representations of your state. To represent weights in a model, for example, it's often convenient and efficient to use TensorFlow variables.\n",
-        "\n",
-        "A Variable is an object which stores a value and, when used in a TensorFlow computation, will implicitly read from this stored value. There are operations (`tf.assign_sub`, `tf.scatter_update`, etc) which manipulate the value stored in a TensorFlow variable."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "itxmrMil6DQi"
-      },
-      "outputs": [],
-      "source": [
-        "v = tfe.Variable(1.0)\n",
-        "assert v.numpy() == 1.0\n",
-        "\n",
-        "# Re-assign the value\n",
-        "v.assign(3.0)\n",
-        "assert v.numpy() == 3.0\n",
-        "\n",
-        "# Use `v` in a TensorFlow operation like tf.square() and reassign\n",
-        "v.assign(tf.square(v))\n",
-        "assert v.numpy() == 9.0"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "-paSaeq1JzwC"
-      },
-      "source": [
-        "Computations using Variables are automatically traced when computing gradients. For Variables representing embeddings TensorFlow will do sparse updates by default, which are more computation and memory efficient.\n",
-        "\n",
-        "Using Variables is also a way to quickly let a reader of your code know that this piece of state is mutable."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "BMiFcDzE7Qu3"
-      },
-      "source": [
-        "## Example: Fitting a linear model\n",
-        "\n",
-        "Let's now put the few concepts we have so far ---`Tensor`, `GradientTape`, `Variable` --- to build and train a simple model. This typically involves a few steps:\n",
-        "\n",
-        "1. Define the model.\n",
-        "2. Define a loss function.\n",
-        "3. Obtain training data.\n",
-        "4. Run through the training data and use an \"optimizer\" to adjust the variables to fit the data.\n",
-        "\n",
-        "In this tutorial, we'll walk through a trivial example of a simple linear model: `f(x) = x * W + b`, which has two variables - `W` and `b`. Furthermore, we'll synthesize data such that a well trained model would have `W = 3.0` and `b = 2.0`."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "gFzH64Jn9PIm"
-      },
-      "source": [
-        "### Define the model\n",
-        "\n",
-        "Let's define a simple class to encapsulate the variables and the computation."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "_WRu7Pze7wk8"
-      },
-      "outputs": [],
-      "source": [
-        "class Model(object):\n",
-        "  def __init__(self):\n",
-        "    # Initialize variable to (5.0, 0.0)\n",
-        "    # In practice, these should be initialized to random values.\n",
-        "    self.W = tfe.Variable(5.0)\n",
-        "    self.b = tfe.Variable(0.0)\n",
-        "    \n",
-        "  def __call__(self, x):\n",
-        "    return self.W * x + self.b\n",
-        "  \n",
-        "model = Model()\n",
-        "\n",
-        "assert model(3.0).numpy() == 15.0"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "xa6j_yXa-j79"
-      },
-      "source": [
-        "### Define a loss function\n",
-        "\n",
-        "A loss function measures how well the output of a model for a given input matches the desired output. Let's use the standard L2 loss."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "Y0ysUFGY924U"
-      },
-      "outputs": [],
-      "source": [
-        "def loss(predicted_y, desired_y):\n",
-        "  return tf.reduce_mean(tf.square(predicted_y - desired_y))"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "qutT_fkl_CBc"
-      },
-      "source": [
-        "### Obtain training data\n",
-        "\n",
-        "Let's synthesize the training data with some noise."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "gxPTb-kt_N5m"
-      },
-      "outputs": [],
-      "source": [
-        "TRUE_W = 3.0\n",
-        "TRUE_b = 2.0\n",
-        "NUM_EXAMPLES = 1000\n",
-        "\n",
-        "inputs  = tf.random_normal(shape=[NUM_EXAMPLES])\n",
-        "noise   = tf.random_normal(shape=[NUM_EXAMPLES])\n",
-        "outputs = inputs * TRUE_W + TRUE_b + noise"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "-50nq-wPBsAW"
-      },
-      "source": [
-        "Before we train the model let's visualize where the model stands right now. We'll plot the model's predictions in red and the training data in blue."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 293
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 1210,
-          "status": "ok",
-          "timestamp": 1527005898290,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "_eb83LtrB4nt",
-        "outputId": "3873f508-72fb-41e7-a7f5-3f513deefe38"
-      },
-      "outputs": [
-        {
-          "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEDCAYAAAA2k7/eAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnXlgU1X2xz/pAhRautCWUsCwWVlcUHHGBUFQcSg7uM8P\nFLUICo4VpygObihI3UdmUHBB0IGZQbEgFNGCqKgMolV2pKylCy1pukDp+n5/3LxmaUsDTUjSns8/\nbZKXd09C+b7zvvfccw2apmkIgiAITR4/TwcgCIIgnB9E8AVBEJoJIviCIAjNBBF8QRCEZoIIviAI\nQjNBBF8QBKGZENDYE+Tk5JCUlER+fj7+/v7cdtttTJgwgcLCQhITEzl27BidOnXijTfeICQkxBUx\nC4IgCOeAobF1+Hl5eeTn59OrVy9OnjzJ2LFj+ec//8mnn35KWFgYCQkJLFy4kKKiIh5//HFXxS0I\ngiCcJY22dKKioujVqxcAbdq0oXv37uTm5pKWlsaYMWMAGDNmDF999VVjhxIEQRAagUs9/MzMTPbs\n2cNll13GiRMniIyMBNRFoaCgwJVDCYIgCGeJywT/5MmTPPLII8ycOZM2bdpgMBhcdWpBEATBBbhE\n8CsrK3nkkUcYNWoUN910EwDt2rUjPz8fUD5/REREg+eRtj6CIAjuo9FVOgAzZ86kR48e3HPPPTXP\nDR48mE8//ZRJkyaxcuVKbrzxxgbPYzAYyMsrdkVIbiUqKkTidCESp2vxhTh9IUbwrTidodGCv23b\nNlavXk1cXByjR4/GYDCQmJhIQkICjz76KJ988gmxsbG8+eabjR1KEARBaASNFvwrr7yS3bt31/na\n4sWLG3t6QRAEwUXISltBEIRmggi+IAhCM0EEXxAEoZkggi8IgtBMEMEXBEFoJojgC4IgNBNE8AVB\nEJoJIviCIAjNBBF8QRCEZoIIviAIQjNBBF8QBKGZIIIvCILQTBDBFwRBaCaI4AuCIDQTRPAFQRCa\nCSL4giAIzQQRfEEQhLOk0GTi84R7+XbIDXyecA+FBSZPh+QULtnTVhAEoTnx7YzHuDflUwyAlv4z\nizEwfNFiT4fVIJLhC4IgnCWhhw9hsPxusDz2BVwi+DNnzuTaa69lxIgRNc/Nnz+fAQMGMGbMGMaM\nGcM333zjiqEEQRA8TqHRiGb5XQMKjV08GI3zuMTSGTt2LOPHjycpKcnu+YkTJzJx4kRXDCEIguA1\nXJ/8OosxEHr4EIXGLlyf/JqnQ3IKlwh+v379OHbsWK3nNU2r42hBEATfJjQ8wic8e0fc6uF//PHH\njBo1iqeeeori4mJ3DiUIgiA0gNsE/+677+arr74iJSWFyMhI5s6d666hBEEQXMLRjAwW9u3FWmN7\nFvbtxeGMDE+H5FLcVpYZERFR8/vtt9/O5MmTnXpfVFSIu0JyKRKna5E4XYsvxOmNMb53xQhmZh1T\n5Zalx5h3ww08cfSop8NyGS4TfEe/Pi8vj6ioKAC+/PJL4uLinDpPXp73Wz9RUSESpwuROF2LL8Tp\nTTEWmkx8O+MxQg8fIjory67cMtZk8po4z4SzF0+XCP706dPZsmULZrOZG264gWnTprFlyxZ2796N\nn58fHTt25Pnnn3fFUIIgCC7FdhHVXFSZpcHyM8vGqWgKuETwX3311VrPjRs3zhWnFgRBcCu2i6ju\nBp4JDKR7QACZ4RH839dfezAy1yMrbQVBaNbYLqK6AOgaP4L4w7lMSt+NsXt3T4bmcqSXjiAIzRpf\nXUR1LojgC4LQrPHVRVTnglg6giA0WXy1jbG7kAxfEIQmi6+2MXYXIviCIDQZbGvqC41Ggg9k+GQb\nY3chgi8IQpPBMaOfE9vRrq7eV9oYuwsRfEEQfB49s/dbn2qX0V8QEcHiq/7odAWOyWRmxoyNHD7c\nFqOxkPffHwX4uzv884YIviAIPk2hycS/B1/HJVnH2In9StnK7heelWc/Y8ZGUlLGAwbS0zWmTFnO\n/PnD3RK3JxDBFwTBJzmakUHquOFE5mTTpbqaAcD1wDygQ1AQ1UOGnnVN/eHDbcHmHuHgwWDXBu1h\nRPAFQfBJUscNt3a2BJYDdwG9gRNDhp5TNY7RWEh6uvUeoWvXEhdG7HlE8AVB8BkKTSY2PDqVgB+/\nI8ZstvPrg1HCvz22I3ec42rZ5OTBwFKLh1/EggUjqapyTezegAi+IAhejz4pq23aQBuzmWHAAuz9\n+t8CA8mPH8Edya8RGn5uXS7tu7w3vS1aRfAFQfBq9ElZR/vmbuBZwGgwkN0hlqEr19C5a7dGjdXU\nJ22ltYIgCF7NtzMe4xKL2IPVvrkAuAgwjBzDpPTdjRZ7aPqTtiL4giB4NaGHD1GC1WDR7ZvktqFs\nbH85r2eMJiHhUwoKzI0ey2gstBtJJm0FQRDciF5u2anARGZ4BC169eIBlI3TBsuk7MbNPJ60Sdkv\nuQa279CApSxaNOasx7NdbNWhw0mGDn2P7OxImbQVBEFwF/rEbNba1cysqKjZSPyF6mo+GzWW0MOH\nOGHsUjMp62i/HD7cttZK2eTkwYSHh51xXEffftSopaxffyMAERHes/euKxDBFwTBK9D74HwO9u0R\nCs3E11FT71gzbzQW1RJvZ7L+ui4cTRWXCP7MmTP5+uuvadeuHatXrwagsLCQxMREjh07RqdOnXjj\njTcICXFuZ3VBEJo+u7Zt48vRQ+ladpqDBgNRrVtjAIqxL7fMrKfE0rFmPjl5EHfcsY2zFe+6LhxN\nFZcI/tixYxk/fjxJSUk1zy1cuJBrrrmGhIQEFi5cyDvvvMPjjz/uiuEEQWgCfDkmntllp5XMahpP\nnzyJBsQDy4BC/MhoFcawxcvqfH94eFit7P1cxLuuC0dTxSWC369fP44dO2b3XFpaGh999BEAY8aM\nYfz48SL4giCwa9s20sbG0+10qZ110x14JSwMf8LZZO7HKt6G0+Hs/8dSFi3q69S5z0W867pwNMS5\nzBV4A27z8E0mE5GRkQBERUVRUFDgrqEEQfAQZyN8+qTs6VUruUjTOIS9dZMNxAwczN8Pjyc9fXTN\n+87GUz8X8T4XzmWuwBvwuknbqCjf8PklTtcicbqW8xXn1Kmf2wlfy5bL+fe/76p1nPnECVbc1J/e\nmZmUAEOBpajOltHAAYOBsBtvZMz7i1g3ZZ2dLRMXV4qfXxUPPZTKwYPBdO1azIIF8UREnJ+Muq7v\nMisrHNu5gqyscJ/423Cb4Ldr1478/HwiIyPJy8sjIsK53ha+UAIVFeUbpVoSp2uROGuzb18QtsK3\nb18QeXnFtTL/+PIVzMjMtGuN0BUYDsxqFcRfjuQCUFEFs2dfT1mZ1ZaZPXsQ99+/qubCsnWrRlnZ\nUubNG+R2W6W+7zI21oTt/UlsbIFH/zacvdi4TPA1+65DDB48mE8//ZRJkyaxcuVKbrzxRlcNJQiC\nl1DfJKlueRgwcUH6FPBbb+fXtwF+A7a0CuLmVevszlmXLVNX6aQnbRVfneh1ieBPnz6dLVu2YDab\nueGGG5g2bRqTJk3iL3/5C5988gmxsbG8+eabrhhKEAQvoi7hKzSZCN34D17hUfIoYS4VLKu29+t3\nderEHWnfOd3Vsq4Liyfr58/XXIGrcYngv/rqq3U+v3jxYlecXhAEL8VW+ApNJj57YAKl337NDSgp\n7mD5GY+yccotO1FNfn8RuXknSUhY6ZQlU9eFJSlpQ7Opn3cVXjdpKwiCb/LtjMeI/fZr7sKayb9k\n+RkG3AkstuxEFRYRwr33fe60JaNfWPS5gTvu2Far742v2CqeRARfEASnqasMs9BUwOJxCVyanU4+\nUIgSeAOqffFLQPuwMAwDB3N98muYTGamTv2c9evB1pLJyPBvMOOvr++NyWQmKcn36uLPNyL4giA4\nja3g/pqeT+DqP3BJ9UHmY83ql6E2J9GAn4Hc9pezLCqRblRzHX4251iGrbNvMh1mx44nOVPGX59v\n76t18ecbEXxBEJxGF1wDJ5jM5fyjOrNWs7NC4G3gd/zZenUS3/74ol0LY6toK2c/KKiCIUPgwIE4\nsrLOPAnrOHl7/PguDhzowaZNucDnqE488U26AVpjEMEXBKEW9a2g7djhGMb0eG5mHa3R6mx2to6r\nWcUPwCrC9uRbXjEDqaxfD+HhO4GBNe8oKytl69Z8evUKtjuTPglr36++nPbtnyY39yrgJFlZUxg7\ndgFm85PY3mMYjZVn/BzNFRF8QRBqoSySEcA60tPD2bp1CY/cW0nv1GfpDeQBuWDX7Kwc2A2s4l+W\nV04C+ZbfU4E7KS01UFqqERT0DKWlLYGZVFcbyMrSqK5+gVGjate2O9o1YWGvACNrYi0o6ITtPUZY\n2GmSk2+u873N3eoRwReEZsDZZrrKElkH3Ik/WxmSNYaCOdX0B0qAB4BVwNNAJ9QFIB9Y1vZeKNoO\n/Aj8iWuuWU6LFktZvx5KS62iXFraDyXS1ucKC411irGjbw/tsL0TCA8/Smmp9fHAgQE1n6059bp3\nBhF8QWgG1JXpnqk1gdFYyK/pp4nnEv7ATiKBUGCA5edyIALoDOwliNfIJCzsM7744g/MmfOz5Zyr\nSU4eTnh4GAkJn5KSYmv8nLT8tBXuzDpjd/Ttr7mmmhYtrHcCM2eOYs6cule9Nqde984ggi8ITRDH\njP7AgTY01JrgxImX+emnYsrKuhKifcxf2EAw0BdqGp6lAnehWiMUA7tpxRtsB8Ixm1vx7LPf0aJF\na8s41nYrtgunjh/fRVbWFEs8y/DzKyYm5gQrV1ptGltqL7q6pdbdyaJFRiff27xr9UXwBaEJ4ijm\nsbFzcJwQdbQ7Nm8uAm0ioxjFzeykEHgCa06+HNCnVf8H/MKlrGUssMvyTDybNy+kqOhBHD1z2xW5\nBQVXMmvWOvbtC8JorCQ5Of6M9lJj2hj4agsEdyGCLwhNEEcxj4jowlVXnbk1gb92kkR6MM/yzCrs\nnfM2wK/Atxh4mXuB14A1qJ6X6hynToXSkGceHh7Gv/99l090Hm1qiOALQhNEedcFqInXlvz++06O\nHGmFn18nOnSoAuztjuh2+7k07Q36Y5XrEuzLLb8DXmYDMAhQ1TLV1aUUFS1BOfoltG5toqiocZ65\nlFK6DxF8QfBRMjIOM27cKgoKOhEefpSVK0fRtavyspOTB7N16wKyslR9elnZGMrKlgHxpKauZfPm\nLwgOziW8bTsuP/4SxvTddMZe5IcCs4COwC78mc8e4D+Wo4rp1CmW7t0rSUmZgC7w/fq9xZ49cy0x\nZTJzZm1fXm+toCyd2oIupZTuQwRfEFzI+cxOx41bVSPopaUaI0c+w9VX9yArK5zYWBMREUa7lasQ\nhFoDO4PiIhN9i/7ENVk/0Qb4G/AKcDvKq28DbAPKgHXczKqaupyLgRGAxv79s3jvvTuxnRQtLw+0\ni2nOnKW1JlQbEnQppXQfIviC4EKczU6dvTDUd5zJZCYnR8O2nUBeXsuasdUuTHOxN2X2AH3wYz/j\nuYgYNHoDpZYjwlBVOCGWM5qBv/Moyqu3LacEMHD6tCrBtP18Q4akoZorpALBbNqUQ0GB2e6z2Qt6\nIZs25TJkSFrN55NSSvchgi8ILsTZ7NTZC4PjcWVl7wGQlpZLdfVMbNsJaFqE3dgFBbG0ajWL06fb\no0Q4GH/SmcooWgNXo8wZ/Qy3AWuBTGB/y1DSus7B//dDVFUtAyqBY8Bky/mV+Dt+PiXWa8HSJNls\nHk5S0lK71saHDlUCHwPDgLWYzY+Tnm79HqSU0n2I4AuCC3E2O63vwmCb0cfE5PH99/arUX/80Q+z\neSLUallWTnCwieJi69iqdcFsYmJe4FTO10xhA3HAPuBFrEK/BJiLMmwOGAL5vPsLxPVpz6fJg3n0\n0dWkpgLkoMR+Hcrw2QU8SEzMJ3YtjWfOvJJNm/6H2Xzmjpb6pC+0q3WslFK6DxF8QXAhzmanHTpk\nk57+L5SBUkSHDrZ7waoeNtAeVd9uFfGiohzL744ty/IpKTlOy5azKC/vhqYFoaZdDbSqzOMeNjCX\nusstw1E9cF7yv4viqo9hv4Hd+zVSU5+ksjIEg8FAy5a5lJXNQ9PiwLLQKiZmPgZDJCkp92N7pzJw\noL/dqlr9oud4kevS5UKMxsI6jxXcgwi+ILgQ57PTQLDbG+o9TCazpc1viuX1AcD1wDwgBmhBdXWU\n5fjrUHl5NKqTzd1o2mbKyu5CtTK7k0BWkMjtdM2HFtRfbvk9MI+HoeoKm6MKKS9vA1wClHD69KXA\nBJt3LeP06WNkZ3fA8U7l3/++krouenXd/Yh9c35xu+APHjyY4OBg/Pz8CAgIYMWKFe4eUhC8nuzs\nSGyFMjs7khkzNmI2P255vgBVUdMH8ENJdjzqYvAekAHMwX4dbIjl8XX48xBTeJvLLM9uwb7c8kng\nAiCdIBaRALyB/YYka1G1O/r5P8T+viAEaFeniNd30bMV97i4UmbPHiT2zXnG7YJvMBhYunQpoaGh\n7h5KEHwG+4VRbTh+fCdVVRdhFdV1wAzL4+G0aJFEefkBVG/KAqAH9gIcDBThzxb+j6uJRTnt+j1E\nf+ApoCfKfd9LIPN4CUjEKub6VuOlqEla2/PnYX9fUMw111Rb2hA7l6HbintUVIistPUAbhd8TdOo\nrq529zCC4HHOpgbfcWFUVtYITKZZwDisjQysgtuyZSTl5UlYBVffHlx/vJcW/EoiH2EE2qLuC/Qz\nhANGlNjPYxgwGnXXsAw4hf1W48ss77I9fy7qziIAP78sbrklnDfeGC4Zuo9xXjL8+++/H4PBwB13\n3MHtt9/u7iEFwSMkJq4hNbUt4E96egDl5Z/z4Yf/V+ex4eFhREf3tlsYdfp0F5RfHw38jvLvwwGN\n4uJg7DPuzsALQCcCWMOdfEJHqJmYreuSsA94jf0og8d2/mA2yh5S7REgwTKOyvZbtjxAWdlUoAug\nMWKErHz1Vdwu+MuXLycqKgqTycTEiRPp1q0b/fr1q/f4qKgQd4fkEiRO1+LNcZ44Yeahh1I5eDCY\nrl2LWbAgnoiIsFrHfPVVNqA6RYLGjz++WvO5Tpww88ADKWzapAF5DBgQRocO2PnfaguRGTaP56E8\n/BIgG3v5Pgp0pxWf8hc+IRi4FPtLQk9Urm4GdhDCAn4BuqPyfNsjL0c1QHseqEB1vDcAd9Kp0zx+\n/fVxpkxJZd++X8jP38vhw0amTl1d5/dwNnjzv7ktvhKnM7hd8KOiogCIiIjg5ptvZvv27WcUfF/w\n9XzFf5Q4XUNCwqqaUsmtW4P57rt/sHHjBMLDw2r62eTktKO6ugu2QlpcHMK+fUctG4Cssus5k5Ky\nBPgNeBWIRAl4H+yFuDd6GwNYgLVBcQkBHGIqM7kQ5ei3pnb1zS6gCEjmJ9Qq226Wcxc5HKkvv7rC\n8rt1nLCwzlRV+TN//nASElaSnj6DzEx9Edi5Z/re/m+u40txOoNbBb+0tJTq6mratGnDqVOn+O67\n75g6dao7hxSEs8IZ3z0jwx94B1XXspOsrF4MGrSEjRsn2PWzUatHrUJaWdmKQYOWEh3d27K61FbM\nNVT3Guvyp8DAX6ioGIO9ZBuAdJSFcyd+7Gc0/ehOUU0bYw1l7tyLtQ/O98AxgviI7aisvrvlqF7A\nXtQU7oVAS9RkrS781cDdNWfu3n1pzfcgPW58H7cKfn5+PlOnTsVgMFBVVcWIESPo37+/O4cUhLPC\nmRYHJtNhVCHjcvQtQbKyNJKSllJQYFuHPgyVscehWo/dR1bWf8nK8kdNehage/JwANs+OFBJUFB7\n2rV7kZycSNS0613AZtQdwAECuZ1EVnARqqmZ7eWjI+oeIBz4BQMvk4Ta+1XP6kNRtf0VwHPAEWAp\nEAXMx8+vkN69+9K5cxHwHtnZkbJdYBPErYLfuXNnUlJS3DmEIDQKZ7LWdu3iLJOr9hOnq1ZVomm7\nsWb1eulxgeXnRqADavJ1OJCEqoTRd4PNRU3QLgBKKSp63tJL/hmU4P8XmI4BE4MZSD920hvV0cYP\ne1PGhDJqnuIOVOb+PKp/zjKgHFWRU2zzGb5HZfnqDDExc9mwwdrKWL/zueOObTV3PrJIyveRlbZC\ns8aZrLVbt5Ns365qz21lVrUviEOJahCqQUEgyjL5K9a+MwuAKajFSvYNz2AkyqefZxnNgLJdDgNt\n8eN1pjKDi6gkFHUPEYqa2l2GtbNlJvAmM1C1OXrzhDCUPbPaMsYCwsJ2YzYPx/Hi1a5dnN1nru/O\nR6pzfBsRfKFZo2etGRmtMZn2kZFhJCHhUzsv33qMH/v3P83p00bUQqRhwHpU24NiVLZ+P8qqWYeq\naTegxHYJyj5xXK2q/x6MkvA2qJLMSIJYwiNs4VpqbyLeFdiBWob1Bd1YxR2oOwioPX2rHgcGmtiy\nZQJJSUvZtCnHIvzqmG7dTtl9L+LXN01E8IVmjb5w6J57PmbHji5kZYWwY0cOP/zwHuXlFwD5XHNN\nMG+8MYJZs75jx47nsbY+eB3ohxL7oSj/Xq+jz0ZZKmGWn0eAVjiuVlVoKKPmYcBAC/L4Cw/QAlUh\nb9s8Qd9E/DCQg4G5bEU1MzuFaocQgrJwnkDdfRxH1c8vY8CAkJrPW1BgJimpfntG/PqmiQi+0OQ4\n212nTCYzX32VhbWG/l8cP/4Mutilpi6jRYuNZGWFY93cIwNV6a7L8XxUhYttHf0ylKWi96XRPfVi\nlOtegrJgwlGZfSEB/I9HeYCXUFOqtvcDbVDSvhmYxzxURq8Bn6CqbabYjP0CMBbdVoqN3cE//zm+\n5jM3tEJW/PqmiQi+0OQ42z1RZ8zYSEVFP6zyGoKj9ZKSchz4BiW5T6KyedvVqi+gJktt31cIvI+q\njLH11FehBD8Y/QLhzxbGE04HrF1yjlG7q2UJLfgHPwArLec5iTJ42tuN3bZtB6677hNLtY2Z5OTx\ntS56Z7owSsuEpokIvtDkcPSfMzL87TbpePLJK5k792ebTUayUTX2+i5MjguTTKiKmj4o774QVSrp\n2Opgj8P7TqKmWG1ragpQ9fVRwDEMHORmRtGXHXRB1ebonW3uRuX/eqOFDYRyqtfrxBauo23bzhQV\n7aBduzgOHy6nqGgnaq5AjT1oUIsGBVs2C29+iOALTQ6r/1wIrGXv3kPs2KGqY9LTNdasmUVl5eya\n15V4ZwMXofZvLUJZNsrDV4+fw75LDdiLeybKdHkS1aYsFHjA8vM9VK+aXqhFVOpcLXmTKXSnDfZe\n/RKsPStPAj8Bb/MhsbGZpG+6tdbn7dv3LYqKpqAvuwoK+onk5IRaxzkiE7PNDz9PByAIrsRkMlNe\nXkFY2AcEBLwCDKWiwr7LTGVlZ8vjVNRkaxFqknMsSozboiY6W6BE+yrss/k+KL9cL4FcjppwDUDZ\nQS1R+XmY5fho1H+1/6F8/+W04s88yqNcBfzB4ewRqPqeA0ApLXibn4AJhIZ2r/Mzq5LKcJTFNJKe\nPS8/45yFjtFYiLrEgEzMNg8kwxe8nrOZhH300S9Yt05tuWetbdFw3A5Q/QxGTWr2xl5y+6Hq4/X3\nV1PbqgkDLkbZKDp6q4IdDsf/BDwGrCKA9fyFJfRE4xDKvsHh6L3Ad0AyM7Dtf1lYmFHnZ7auE1DH\nXXjh6TN9nTXIxGzzQwRf8HrOxmv+8UfbLvB6bcsAVHVMIcq6Kbc8PoaycRzr1k/avL8Y5bnvRmX9\nB1CLqqC21/+75Zi7gVmoydTjwP34cZw7uI8LqKpZLfsAyux5DGsPnP8BvxDLWraj7gqWoyZ9Aykp\naUtBgbnWxc5RuBcsGElVVcPfq0zMNj9E8AWv5+y8Zj1710V4p817v0cJcg9U1h2E6lLZBiW90ZZj\nJlse630oo4CHULZJAaqRWm/LuZdg7SMfClwLfInK9A1AJa14iGmspSWq4YFt82Mj8E/LmX/AwFv8\nRHT0ajiul4BqqN2n/CkqaklS0sZaIu0o3BERvtHhUTj/iOALXo+zi4BMJjMtWxZjbTlcSfv2p+jQ\noYqYmFOsWxeNmjgNQVXbPIG1cuYN1H+HauADVNOx6Vjl+VUgFtXorA/KytmL/cbeT6IuAN2Bv+HP\nVhKYRDBVhKHW49ree8SiMv1iYBVhFF74MqN676CkJJy0tGVAlkMMS2RiVWgUIviC1+Ho2c+ceSWO\nXrPtMR06ZAOB/PCDH2ZzT/SOM4GBc7jiigt4440refTRNahM3LZ2XpffdcCzNs/PcXjdgLqABKP6\n4kRZXg/HWk+Ti6rq6QQYaM10pvI6ccAh1KXAcQeqXagcftfVf+XzVbNqPv+QIWmoLQhXO8QQjtFo\nbvwXLDRbRPAFr8Pesy9g69YFREf3tpuwTUhYaXPMv7AX8uXAXVRUXEpqan9++WW+peVwFUpiQdkx\noGrsq7AX1hhqL3tqgbXR2WzUHMCtKBtHb5u8kAC+ZzjzuAhqvPo4y1nuxtp4YR/wHZEcaD+Jrz+c\nbPf5rXc09s3aYmN3kJw8HkE4V0TwBa/D3rNfR1bWk2RlWSds580bxKZNucBnqLr2AOBDy/GjUR0r\n56JWn75ETs5L2Fe5t8Bq59S1+2suKovXNwjMBfoC/0JZOlGoOv3lqHmAUYCBEN5iCjsJwbbxMDxt\n+WlEratNAl7hCoYOfYCvLRuB22Jt1uaPyTSXdu3i6NbtVJ2rZQXhbBDBF7wG3aZRu0Ppq17bYJt9\n792rcdllb1NW9kfURGknVL2LrdeeidqnNQJrWwMsP09TO6PvgrU12V7L43hUnX4R9nbPMlRWfxKV\nvz+PH/uJJ5o+VDAX1SvT9uzdUReAjqgan9dYQlBQFR9+OK7O70GqZwR3IYIveA22Vg5otG37EuXl\nJzl9Wm8ZUMC+fXuprn4Re4G3ldeLUNOhQ1HevGPVTjFqktb2Ob2RgV4FvxtlEd2Ftbe8fv5y1F3E\nt0B/gunNFPbQBWtdTrHD2fcAJ4C5PI1a2KURGvqi6744QXASEXzBa3AsvywpiaK6uhxYCORjMBRS\nXd0fewFuR+3e7yFY+9G/i/1WIUUoF/0pVO69H2XLrEb5+WGoCdqXUP89CrHtUaNkPRQD2VxPF66h\niDhUBX6X0UVuAAAgAElEQVRLyxHxWHtiHgR+IJhv+NYS0xLgd/r0EWtGOP9IawXB45w4YSYhYSUH\nDuzFdql/dfUhVOVLS+AhNK071kVSYF3s9CyqNn4Z1lYJuhV0G9bSS/189wDXAPcB/ijhH2455/2o\n7cCfQOXl01F2zyrURSKHAJ5kEg9yDUX0Rjn8D6LuJf6Gala8C7WI6oer/8qivbsIC/sSVc4ZCEzn\nxIm62yQIgjtxe4b/zTffMGfOHDRNY9y4cUyaNMndQwpegG3ZZExMHgZDJdnZHepsjfDQQ6kWK8ex\nX/x0rJt+L0fVzt+OypL1TUNOowTeH+XNL0CJ/SFUZh6GyvR160evrNmCEnRQUv0qtdsi2/aoAX/2\ncieP0Qkl246LqK6wRP078D3h7G8/hc/evIvw8DAGDowmJcW6w5T0rRE8gVsFv7q6mtmzZ7N48WKi\no6O59dZbufHGG+neXbKbpo6jH6+EfDTp6Rrl5e/QokXrmjr7I0d0K0fvF78ENXG6DjWRql8AilCZ\nfABqQrYUJdIXUbsscwLwIsryse1cqS+gMqIyeX3B1KWoUk1be2h/zWMD+dxOEp1Qa2mzqb2Iapcl\nor/zHPA05GrMmbOURYuM0rdG8ArcKvi//fYbRqORjh07AjBs2DDS0tJE8JsBGRn+WCtfirGVx82b\n8ygq6g74k54eQIcOv6AmQnWhPWY51rZ0ch5KmF8GrkZZO9NRG4E4ZubBKHHvbvndtsGZXrnTEmuZ\nZXeUXD+OtavNVmASBhYxhme4kZyada/hqBoix0VUR4BlrETdbahY9JWxUnkjeANuFfzc3Fw6dOhQ\n87h9+/Zs377dnUMKHka3cvbu3U99PeSLiqqwzchPnnyRsLBXMJs7oCpkOlN7pWs0ag9Z2wqd5aiL\nQ0vs5fc3lGUzHXVn8aHl+TxUM7OHgR9Qwr4AlZf/EVv7BvJoxTKmMZN5DiPehSoYnYOq9P8dSOav\nQDLWuxn1WcW6EbwJtwq+pmkNH+RAVFSIGyJxPRJn3Uyd+rnFyvkMW8E2GNqiaUtQAh2DdW/YYIqK\nqhk6tA2pqX6orQIN1M6hg1Btix07YZajBFvvn3MUlbFnoCyhLOy3F1mG6pXzrOW5EahGadZiSj/2\n8Sce4BKUfeM4Iqj7h2LgZ+C+las5tKyYgwdX07GjCU2rICtrNV27lrBgwUgiIs7/34ov/H36Qozg\nO3E6g1sFPyYmhqysrJrHubm5REdHn/E9vtDlLyrKN7oReiLOffuCUNKod3pUQqtpLVFTnX1Q2fda\nrFn+cNLSnqBt21YUFenyOgwl4hEosR9qeY/tRWAr1lYJlUCZ5fl01DKnO1HzAbaSHYK6IDjePagW\nyi1ZwZ9ZSRRK7B0bJ+9EXbIOA/P4KzCPqsX1t2uuqjr/f9O+8PfpCzGCb8XpDG4V/EsuuYQjR45w\n7NgxoqKiWLNmDa+99po7hxRsUOWOq5zaOMRVqD4wBajVrh+ibJTTqHLIO1HSeT3wH2xFt7z8QgwG\n6ySpyqFjUcuWdGtoKMoaCkNV1kSiBL81+mbg1sVYRcBbqAzfceGVfZ8cP78f8av+HxN5kWiUQXQZ\nSuyHYu/qlwDbCWAZe1AXDqSDpeAzuFXw/f39mTVrFvfddx+apnHrrbfKhO15xFrueP42qU5OHszW\nrQvIyrLtJvMSyh/XbZxAVAXMIpS9UwSUU1b2V1Stew+sE7ftUNXtF6JEvhR1Z6B78Ccs4ziutu1v\nGddoOacR5d/HoCqBVJ+cmBgThTmnmcrrNVO8rbGK/TqsG5OUAIb7JnHqxLWQ0s0ynvj0gu/g9jr8\nAQMGMGDAAHcPI9TBwYPBOL9xyJlxdpvB8PAwoqN7k5VlK8DtgV9RkqnbOONQojsCdVF4EXVR6Im6\nIPwN6wVjBipTvxh157Ac1YuyBEgE3qb2att1KMG3nW69HVXW+RtgwJ9D9Mx5mb5Qk9lXAL9YzqqL\n/fdArrErr//8ExVVgRQUmJESS8EXkdYKTZiuXYvZurXhjUOcwXGbQb2WPiOjNSbTXiIiutC9eyXJ\nyYOJicnDXoBboeri11HbT9d/jwIWo5oRnLa8pxRVNtkVtSh8JKoLpm255nLUBWW25Ryhlvd84zBW\nBaq0cwYQTkve5EFeJghVgW9bxT8Pa9u0X4G+H3zMjcNGEGbZSUpKLAVfRQS/CbNgQTxlZWfORPXM\nvS7hts3gHfvc/PBDMWbzg+gymZX1Pjt2BLFmzVpURfoLqJbCe1GLnlRFTm0/HcvvIVgbmC1Bib6+\n4YgJlYNrqDsAx7qZwyg//3eU7/8ZyhKy9sDx89tD9+4XcOD3KQzj31yE2tMqHzUlbHvGCNQ9QC6w\ns/f/MX2YbR2/IPguIvhNmIiIhjNRxxWxWVnL2bFjZK1NRxy3GSwpsb0AFKJE/lkqKx27wFdYfgaj\nJmuXo7L3LagFSgtQ3vpfLOcyoKpt9K0D9bJJtayp9sYkO1Fi74eaatVQ62BNGAwLMBgKCAgopLz8\nSTJ/f5XH+DctsC/UdOyGvxdY3fmv9L7iYj4Su0ZoQojgN3McM3clzKvsNh3ZsuUFIiO70bLlLMrK\nugIFVFaWohqSrUMJdBeH8/RC9YzvgzJJ/FENyu5CyeoW1MpWfd1qqOW9GsqnL0RV46iWxMHBwbRu\nncHJkyGcPDkLuBJ1FzAFa0Y/E1sZ17TOaFoorQK2MLG8KxEUcrHlXbaRhqIWUYUDu6KiefS7//FE\neIQLvl1B8C5E8Jsp1s1Gcqg94Wm/yjUnpx05OX7AH1AZ9RSsbvdc6l4odQRrqeQIm2MvRpVq9gJS\nULtP9QdmWc5/EjVluhblxa8FWtO2bQGXXRZJaupk9L48+lixsVmUlMTY1PAb0Dcab8F7TDr1Fn1Q\nS7JKLKPbRhqGarV22YrV3DZgoAu+XUHwTkTwmxG2lTbHj+8kK+shlOwtIyTkFOXlBygr80ctWtJ3\nnApFudlTsIq33mCgN9YLg75QKhrlpV+OfR6t7yk7EiXYek2+vvq1o+U1nWLgH+hZe1aWRnb2U8BS\nlGe/gKCgIMLDs4mIMFJdfYCiIpvaevZzEy25iHIuR80QBAKnLL+/iJrizQUyW7Zk8jdb6Ny1G4LQ\nlBHBb0bY+/WjgPctrxRQXNwG5YPbNv3VO0teQG3bR29yZrtQqhglpzEoJ9w2j24DVGP1823PV4i6\nSNger0u09ThNuxp1UVCWTXi4ucZ6ggJiY+cSHd2bnN8/5s8nV9ADlc3bVuC8iqrSH44ylHq/9TZT\n7ri7MV+rIPgMIvhNGMeVthkZAdgLbQFK0O+zPLbfzi8oKJLQ0AxycsB+Zep2AgK+oby8M0pC26EE\nXpU8qvLKu7GuUf0NdRFohVoEFYSSXF2GQ1H5ti7HJSg7Zz72F4GdqBYIYfj5taekRP8cAOFEtA1j\n4P4JtDhZXNPw7ANq32f8AmwA/mgptxSE5oIIfhPmgQdSSElR1S7p6RrR0S9gK6ABAcFUVtq2Frbv\nHBMenkV09CXk5NyAEu8yoAXV1X+mvPxfqIlafU3qfJTYg8r030bZNDstj5+ynONFVEbvKO6foTJ6\n2wtBEQbDU5bM/iQwGVXeeSctWhykqKhnTbxBPMWf9syhB8qmsZ3yrVXT87fnmPlIoku+Y0HwJUTw\nmzCbNtlPvppMkSi/3AAcoqoqFNiBVWSHoiZXewO7aN26DXv2bANyUPZNN5QL/jFK7HeiRPtt7Gvs\ny7BfHDUHqxVkQElxLPbibsDffzdVVXrXSwNt215ARUUwpaW23n4psbFzadu2M3v2DMOfF3iAp4lE\nTfmWoNbTrkXdY4xCFYgagX1Az7feZqRYOEIzRQTfx6ivxUFdzzvWo1RXF6AmX5cBT6BpytYJCHgG\n6EBl5WGUtbINuICMjN/RtBmo0kt9kdW/UBuRLMde1Gdh3SzcfkMSgyHC0iq72CaeocTEvMjp07EY\nDCauvroNYCQ19YGacw4atJStW49SWmr9DLGxOaSnTyMh4VP279nCgzxNR2qvvT2NMpbyLaMa3nqb\nv4rQC80cEXwfo74WB5s2VWI2twRuID09FFjK1Ve3JDX1JZS1coyIiALy8x0nTcMJDu6C2TwRtcI1\nEHgMNUmql17G2hyvi7njxGtfVG/6LNRCKqtI33ijgV275pKV1cVyvjhiY/ewceM9hIeH1bSgLSgw\n06KF/cpgs7mQMWPmUlDQifDwTFauHMnRjAx6bnyEKyiqKcB0XHt7CDUzkN82lAlfbpIKHEFABN/n\naKjFgV4yefhwW7p00YBpNa/17fsOO3a8YKmpt9opRUU5qOy8LepPwlY+9SZluoAXYW2LYOuOH0Jd\nWKpQ3S5fom3bKAYNakFy8jAAkpI2cvhwT4uYj6/VfK2uHjXh4WGkp0+refzh669wdO7zRKNaIDiu\nHNCAHwEzEP3W20yXrF4QahDB9zEcWxyoCpnaJZNGYxHHjkXYvZafH0OfPhXk5LRCTZqGAK2orn4I\nlQ8/i6qksfXWT6ImVZfTqlUZoaEZ5OYuQS2YeslyjmKUp1+NukNQq2mvu+49OwFvTMOxXdu28Wn8\nYDprGiGWT51nGd22Z/1m4GhQax7/+nvJ6gXBARF8H0H36A8caENs7BxLk7MqysurSE21XgDCwvYw\ncGABTz55Bbfeuhpb8TYai9i06TQwFaugv4+qfAFlyVyCtY/8LtS+sGHAnUREzKWg4EJUnxtFQMBz\nVFY+jaqLWYtqoaA2B8/OjnTJZ1/98VL2JD7MGyhhn24T/VxUfVBHVGfLuLfe5nHJ6gWhTkTwfQTH\nJmeXXvoe0IKjR1sTGzuXdu3i6NbtFMnJdxIeHkZCwkoyMyej574xMb9SXh5JUVE7lH0TjxLyAlQd\n/nKs1TS6NdQH+CcBAZFERBwnK2sq6uKgoQt8dXVny/kqsDY8U6tnjcZKwPle+nXxzovPU/LmK/S0\njOLY2bIT6rK0u20od4lXLwhnRATfwzgrho7e/Y8/+mE2Wy8AV11l3c3KZDKzaVMlqi7+LgCOH99h\n6UNjK+h3Uv8kbAVwjKFDI/jww7sZMiSN48f15z9E7Vg1nerqcMv5PrR7f1jYaZKTbwZqTzQ7s/NW\nWspnbEmYQBuse8sOpfZWJzuB61es5o/SA0cQGkQE38M4K4a1vXt9az8A+92sZszYiNlchbJWQoAi\nqqvD7I4PCqrA3382JSX+qBW2O7H37gNRxY7v2Yy/FvssXu+pY8DP7xjV1db4Bg4MqLlwOV6sGtp5\n6z8LF3D0bzO4Cvu2CMtRl6VZqLqhA35+jFi3kd59Lz/j+QRBUIjgexhnxVDV1VtLFsvL29h590Zj\nUc3dwvr1oNab2u4rOwfb3HjIEPjii3KsneGvB55BbczdApVPG2p8+OTkwWza9CVms2MBJIDGLbdE\n1Cqp1HG8WNW389aWDRv45s7RdEXtcWVfza9G247aB6vlW28zQ7x6QTgr3Cb48+fP5z//+Q/t2rUD\nIDExUfa2rQNnxdCxZLGumvWkJFuf374vjiqvXEZY2GkGDgwgOXkQ69dX2RwTDlyFqrixdrLU4wkP\nD2PgQH9SUmwXQe0gOrqamJh8gHptKceLVV07b+kWzlUood9B7f2xvgcKWrfmwY1SgSMI54JbM/yJ\nEycyceJEdw7h8zgjhnWhXwD0rP6OO7ZZetvrXWRKsG5Q0gb4BX//Mq65xkhy8gjCw8MID8+yW8Wq\nO+V610nHeGrHOr5mgjgl5X7qs6XOtAfsrm3bWDlyCK0qKojCauH0x9pBPwzVmu1SaYsgCI3CrYKv\nVmoKZ+JMYujMhO6jj37BunVKbK37wd4DDMVgeBlNe9Hy2giqql4lNXUKLVooQV65cpRlFWsHNO0A\nXbv2IC5udZ2LovRY580bVBNTUtIGkpMHn7VHD8q++e6uMcRpGq1QbdG+xv5+oyewBwiZ+wp/u39S\ng+cUBOHMuFXwP/74Y1JSUrj44ot54oknCAkJcedwTQZd6Otql+B4cfjxRz9sxTYg4DQXX/yZpea+\nu4PnrpqS6YLctavRbhWrM9Q1yWw0ak7ZUjqrP17KvsSHa/bK0uvpY7G3cHYAwdNncKeIvSC4hEYJ\n/sSJE8nPz6/1fGJiInfffTcPP/wwBoOB119/nblz5zJnzpwGzxkV5RsXBXfFeeKEmZtu+pjMTH17\nQGs1TFZWeK1xDYYT2MpkSEgxv/zyIACjRy+289z1n3FxpWcV/4kTZh56KJWDB4PZv78a2wtMVlY4\n69Zdz5Qpyzl4MJiuXUtYsGAkERG1z3/49995+7rr0PLyuBb7GYYY1KaFy1BtEQ4FBjLhhx+49Mor\nnY7zfNDc/z5diS/ECL4TpzM0SvA/+OADp467/fbbmTx5slPH5uUVNyak84Le7MsdJCSsIjPTdutA\na7uE2NiCWuNefXUbUlP1LpXFXHGFH6NHL7HYQBUMHfoemZlhnDixj4gII927L2X27EHk5RU7vQYg\nIWGVzWSw/d61sbEFVFX5M3/+8Jrjq6pq/ztu2bCBNXeOph2qyfJOVF2QXsV/AHVZy42MYsSaL7nN\nMinrTX8P7vx3dyW+EKcvxAi+FaczuM3SycvLIyoqCoAvv/ySuLg4dw3VpFB2i307ML1dgu0Eqi7W\nmZnRxMbutWm10NbOchk1ailpabcAt9Qay9k1APYe/TDCwl6hS5cLnZpkLjSZ+PD2MZz67Rcisd9A\nUe+8vxXVxrjrW2/zkEzKCoLbcJvgv/zyy+zevRs/Pz86duzI888/766hmhQxMXnArVhbIvzGpk33\n1Mq8HVst6CtthwxJw9kJVGcnW+1LR0MZOLA9ixbd2OBnKTSZePe6fgSeyKcDEIf9fUsUkI5aQnaD\nbDcoCG7HbYKfnJzsrlM3aQyGSlS/GmXRXH55O6daLehi7Wxd/9kce7alo4UmEx/eOoLAHdvpgloC\n1prabYx/B4au38TAmwf4xG2zIPg6stLWy8jO7oCavtQff1bncfWJta04x8WVMnt2/eLsrJCfqXTU\nkS0bNrDqztG0R7VAsF3nexfWNsbfA/1XrJa2CIJwHhHB9zIam3XbinNDE05nI+QNUWgy8cn/3U7Z\nT/8jGrVm19a+CQcWoBZRHbj8Sh5Y/gmh4REuGVsQBOcQwT+POFMV8+STV7J1q76l31FmzhxV57lc\nKdaNpdBk4tUr+hB56iRdgQyUjWNr32QBtGzFdau/kKxeEDyECP55xJmqmLlzfyYr60nAQGmpxpw5\nS1m0yOiJcJ1CX0TVDvsKnGewt286zn1FFlAJgodp8oJfV1ataZzzhhyNwZmqmHNpU+AJ0lI+Iz1h\nAj1QmyN2wN7CuQDV1fJXoK9U4AiCV9DkBb+urBo46w05XIEz/vzZVNl4gkKTifWJD3MkdY1da4SZ\n2Fs4+1FCP12EXhC8hiYv+PVnzOc/i3amKuZcu2eeD45mZLDwun6EVVfVqqmPRO2EG4US+/C/PSdZ\nvSB4GU1e8OvOmM+u2ZercGai1ZsmY3UKTSa+eHgSpWnrCUNtObgT1XxZb41QBJQBmZf2JfG/n0kF\njiB4IU1e8OvPmL0zi/Y2jmZk8J8Bf2RuRTnLgenozZZVa4RoVEYfcNUfeOCj/4jQC4IX0+QFv76M\n2duyaG9k17ZtpA4dRFfq3ua8N/CjwY/79hwQoRcEH6DJC75w9hzNyODzUX/C73guc4FXULZNMbW3\nHBz6xUYRe0HwEUTwBTv0rL43qtfNEdTq2GUooX8JaAvkRbfn9tVfyN6yguBDiOALgEXoRw7hgooK\nLgGGoerrXwKmAGtRk7SFLVpy7efrZbWsIPggIvjNHL0C50Taeru6erXHFrRHZfeHUZ0tbxOhFwSf\nRQTfDTi7k5SnSUv5jG8SJhAFGKlrjy3YB1RFRnHXmi/FvhEEH0cEvwHqEu+GthNzdicpT1JoMvFz\nwgQ6A0+gsnjbCdm9wGZUVi/2jSA0DUTwG6Au8f7sswlnfI8398PRWyPkfLWei4BAVKTxKBvnJJCN\n2nLwZulXLwhNCj9PB+DtnIt4G42FqDwZvKkfztGMDN699CJOpK7huYoKgoBjqEjDgDtRi6hCbhzC\ntL2H+OOAgZ4MVxAEFyMZfgOcSzMzb+yHU2gy8emga7m2vAwT1qx+CfA00BXYHxjI7d9tFa9eEJoo\njRL8devWMX/+fDIyMlixYgV9+vSpee2dd97hk08+wd/fn6eeeor+/fs3OlhPcC7i7S39cMwnTvDf\ne+6hcPO3VBYXM1vTMAAfY83qpwFzAgOpvOkW7ntjviyiEoQmTKMEPy4ujvnz5/P000/bPZ+RkUFq\naipr164lJyeHiRMnsn79egwGQz1n8l68RbzPlqMZGSy89goiNI0eqEVU24FLUTX2r6I6XO5v2Yp7\nf9sjQi8IzYBGCX63burWX9M0u+fT0tKIj48nICCATp06YTQa+e2337jssssaM5zgJLp9E6VpdrtQ\nPY0S/FDADJyIiua2z9eL2AtCM8EtHn5ubi59+/atedy+fXtyc3PdMZTgQKHJxL8HX0e306VUYF9b\n3xVYDByL7ci9GzeL0AtCM6NBwZ84cSL5+fm1nk9MTGTw4MF1vscx4wectnMaqnH3FrwtTvOJE6Q8\n8ACZa9Yws6LCzqvXM/x9QI9Ro3j4/fcJi/Ausfe277M+JE7X4Qsxgu/E6QwNCv4HH3xw1ieNiYkh\nOzu75nFOTg7R0dFOvTcvr/isxzvfREWFeE2cu7Zt48sx8XQ9Xcpx7FfMDgNeADqixF7fW7aiyru+\nZ2/6Ps+ExOk6fCFG8K04ncFldfi2Wf3gwYNZu3Yt5eXlHD16lCNHjnDppZe6aijBhi/HxDP7dCn3\no1bM7sW6AiAU8IvtyIC9h5h+vEi2HBSEZk6jPPyvvvqK2bNnU1BQwOTJk+nZsyfvvvsuPXr0YOjQ\noQwbNoyAgACeeeYZn6zQ8WaOZmSQOm443U6X2vn03VFtEsqBnE6duCPtO/HqBUEAwKDVZbh7EF+5\nffJUnIUmE9/OeIyDa1fzXEUFy1BdLXWffhYQGhZG62uu488fLaGiKtAjcZ4NvnTbLHG6Bl+IEXwr\nTmeQlbY+gi702qYNtDSb6YZ9D5xS4ECrIG5eta6m/01YhG/8sQqCcH4QwfcRvp3xGPemfGqXydv2\nwJkT25G/pO/2ZIiCIHg5IvhejJ7Vhx4+hHbogJ1X3xO1G1V7Pz+yYzowdOUazwUqCIJPIILvxdhm\n9Y419dlhYcQMHMz1ya/JpKwgCE4hgu9l7Nq2jS9G/wljWRm5wALgblRN/SthYXTv0o1CYxfGiNAL\ngnCWiOB7GV+OiefFsrKaTH4ZkIry6SMHDub6RYs9GZ4gCD6MCL6X0a3stJ1XHwKYgoJYPGQo1ye/\n5sHIBEHwdUTwPYztxGyh0cjuwBZo5dYMvxioHjKU4ZLZC4LQSETwPYxduWX6z/x94CCe+vF7jGVl\nHDcYaHX9QMZIZi8IggsQwfcwoYcP2Vk4nQsLuftonidDEgShiSKbmJ9HCk0mPk+4l2+H3MDnCfdQ\nWGCi0Gi02e4cCo1dPBihIAhNGcnwzyOO9s1iDFyf/DqLMVg8/C4yMSsIgtsQwT+PONo3oYcPERoe\nIROygiCcF8TSOY+IfSMIgieRDN8NOJZaXp/8OqHhEWLfCILgUUTw3UBdXv3wRYvFvhEEwaOIpeMG\n6vLqBUEQPI0IvhsQr14QBG9ELB03IF69IAjeSKMEf926dcyfP5+MjAxWrFhBnz59ADh27Bjx8fF0\n69YNgMsuu4xnn3220cH6CuLVC4LgjTRK8OPi4pg/fz5PP/10rdcuuOACVq5c2ZjTC4IgCC6kUYKv\nZ/CapjVwpCAIguBp3DZpm5mZydixYxk/fjw//fSTu4YRBEEQnKTBDH/ixInk5+fXej4xMZHBgwfX\n+Z7o6Gi+/vprQkND2blzJw8//DBr1qyhTZs2DQYUFRXiRNjnD/OJE6Q+9BDBBw9S3LUr8QsWAN4X\nZ31InK5F4nQdvhAj+E6cztCg4H/wwQdnfdLAwEBCQ0MB6NOnD507d+bQoUM1k7pnIi+v+KzHcyef\nJ0yyLqLaupXFZZVM/OwTr4uzLqKiQiROFyJxug5fiBF8K05ncJmlY+vjm0wmqqurATh69ChHjhyh\nc+fOrhrqvCKLqARBaCo0atL2q6++Yvbs2RQUFDB58mR69uzJu+++y08//cTf//53AgIC8PPz4/nn\nn6dt27auivm8Umg0oqX/XLPloCyiEgTBV2mU4N90003cdNNNtZ4fMmQIQ4YMacypvQZZRCUIQlNB\nVto2gCyiEgShqSC9dARBEJoJzVLw69pbVhAEoanTLC2d+vrVC4IgNGWaZYYvpZaCIDRHmqXgS796\nQRCaI03e0qlrf1kptRQEoTnS5AW/Pr9ePHtBEJobTd7SEb9eEARB0eQFX/x6QRAERZO3dMSvFwRB\nUDR5wZfWCIIgCIomb+kIgiAIChF8QRCEZoIIviAIQjNBBF8QBKGZIIIvCILQTBDBFwRBaCY0SvCT\nk5MZOnQoo0aNYtq0aZSUlNS89s477zBkyBCGDh3Kd9991+hABUEQhMbRKMHv378/a9asISUlBaPR\nyDvvvAPA/v37SU1NZe3atSxatIjnnnsOTdMaOJsgCILgThol+Ndeey1+fuoUffv2JScnB4ANGzYQ\nHx9PQEAAnTp1wmg08ttvvzU+WkEQBOGccZmHv2LFCgYOHAhAbm4uHTp0qHmtffv25ObmumooQRAE\n4RxosLXCxIkTyc/Pr/V8YmIigwcPBmDBggUEBgYyfPhwgDrtG4PBUOs5QRAE4fzRoOB/8MEHZ3x9\n5cqVbNq0iSVLltQ8FxMTQ3Z2ds3jnJwcoqOjnQooKirEqeM8jcTpWiRO1+ILcfpCjOA7cTpDoyyd\nb775hnfffZcFCxbQokWLmucHDx7M2rVrKS8v5+jRoxw5coRLL7200cEKgiAI545Ba0T5zJAhQ6io\nqIMzjrUAAATvSURBVCAsLAyAyy67jGeffRZQZZkrVqwgICCAp556iv79+7skYEEQBOHcaJTgC4Ig\nCL6DrLQVBEFoJojgC4IgNBNE8AVBEJoJXiv47733Hj179sRsNns6lDp58803GTlyJKNHj+b+++8n\nLy/P0yHVyZn6HXkT69atY/jw4fTq1YudO3d6Ohw7vvnmG/70pz9xyy23sHDhQk+HUy8zZ87k2muv\nZcSIEZ4OpV5ycnKYMGEC8fHxjBgxwq6c25soLy/ntttuY/To0YwYMYL58+d7OqR6qa6uZsyYMUye\nPLnhgzUvJDs7W7vvvvu0QYMGaQUFBZ4Op05KSkpqfl+yZIn29NNPezCa+tm8ebNWVVWlaZqmvfzy\ny9orr7zi4YjqJiMjQzt48KA2fvx4bceOHZ4Op4aqqirtpptu0jIzM7Xy8nJt5MiR2v79+z0dVp1s\n3bpV27VrlzZ8+HBPh1Ivx48f13bt2qVpmvo/NGTIEK/9Pk+dOqVpmqZVVlZqt912m/brr796OKK6\n+eCDD7Tp06drDz74YIPHemWGP2fOHJKSkjwdxhlp06ZNze+lpaU1PYW8jfr6HXkb3bp1o0uXLl7X\nZO+3337DaDTSsWNHAgMDGTZsGGlpaZ4Oq0769etH27ZtPR3GGYmKiqJXr16A+j/UvXt3jh8/7uGo\n6iYoKAhQ2X5lZaWHo6mbnJwcNm3axG233ebU8Q2utD3fbNiwgQ4dOnDRRRd5OpQGef3110lJSSEk\nJMRrb01tWbFiBcOGDfN0GD5FXX2htm/f7sGImg6ZmZns2bPHaxdlVldXM3bsWI4cOcKf//xnr4xT\nT46Li4udOt4jgl9ff55HH32Ud955h/fff7/mOU9mfA31EUpMTCQxMZGFCxfy0UcfMW3aNA9EeXb9\njjzp7zoTp7fhbXccTYWTJ0/yyCOPMHPmTLu7ZW/Cz8+Pzz77jJKSEh566CH2799Pjx49PB1WDV9/\n/TWRkZH06tWLLVu2OPUejwh+ff159u3bx7Fjxxg1ahSappGbm8u4ceP473//S7t27c5zlA33EdIZ\nPnw4Dz74oMcE/1z6HXkCZ79PbyImJoasrKyax7m5uU73hRLqprKykkceeYRRo0Zx0003eTqcBgkO\nDuYPf/gD3377rVcJ/s8//8yGDRvYtGkTZWVlnDx5kqSkJJKTk+t9j1cZz3FxcWzevJm0tDQ2bNhA\n+/btWblypUfEviEOHz5c83taWhrdunXzYDT1U1+/I2/Gm7LqSy65hCNHjnDs2DHKy8tZs2YNN954\no6fDqhdv+u7qY+bMmfTo0YN77rnH06HUi8lkqrFJTp8+zQ8//OB1/8cfe+wxvv76a9LS0njttdf4\n4x//eEaxBy/08G0xGAxe+wf86quvcvDgQfz8/IiNjeW5557zdEh18sILL1BRUcF9990H2Pc78ia+\n+uorZs+eTUFBAZMnT6Znz568++67ng4Lf39/Zs2axX333Yemadx66610797d02HVyfTp09myZQtm\ns5kbbriBadOmMW7cOE+HZce2bdtYvXo1cXFxjB49GoPBQGJiIgMGDPB0aHbk5eXxxBNPUF1dTXV1\nNfHx8TX7ffgy0ktHEAShmeBVlo4gCILgPkTwBUEQmgki+IIgCM0EEXxBEIRmggi+IAhCM0EEXxAE\noZkggi8IgtBMEMEXBEFoJvw//5K32R/vBHAAAAAASUVORK5CYII=\n",
-            "text/plain": [
-              "\u003cmatplotlib.figure.Figure at 0x7f5be3c99f50\u003e"
-            ]
-          },
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "display_data"
-        },
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Current loss:  9.48636\n"
-          ]
-        }
-      ],
-      "source": [
-        "import matplotlib.pyplot as plt\n",
-        "\n",
-        "plt.scatter(inputs, outputs, c='b')\n",
-        "plt.scatter(inputs, model(inputs), c='r')\n",
-        "plt.show()\n",
-        "\n",
-        "print('Current loss: '),\n",
-        "print(loss(model(inputs), outputs).numpy())"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "sSDP-yeq_4jE"
-      },
-      "source": [
-        "### Define a training loop\n",
-        "\n",
-        "We now have our network and our training data. Let's train it, i.e., use the training data to update the model's variables (`W` and `b`) so that the loss goes down using [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent). There are many variants of the gradient descent scheme that are captured in `tf.train.Optimizer` implementations. We'd highly recommend using those implementations, but in the spirit of building from first principles, in this particular example we will implement the basic math ourselves."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "MBIACgdnA55X"
-      },
-      "outputs": [],
-      "source": [
-        "def train(model, inputs, outputs, learning_rate):\n",
-        "  with tf.GradientTape() as t:\n",
-        "    current_loss = loss(model(inputs), outputs)\n",
-        "  dW, db = t.gradient(current_loss, [model.W, model.b])\n",
-        "  model.W.assign_sub(learning_rate * dW)\n",
-        "  model.b.assign_sub(learning_rate * db)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "RwWPaJryD2aN"
-      },
-      "source": [
-        "Finally, let's repeatedly run through the training data and see how `W` and `b` evolve."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 446
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 569,
-          "status": "ok",
-          "timestamp": 1527005915434,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "XdfkR223D9dW",
-        "outputId": "c43591ae-d5ac-4f2b-a8e7-bfce607e0919"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Epoch  0: W=5.00 b=0.00, loss=9.48636\n",
-            "Epoch  1: W=4.58 b=0.42, loss=6.28101\n",
-            "Epoch  2: W=4.24 b=0.76, loss=4.29357\n",
-            "Epoch  3: W=3.98 b=1.02, loss=3.06128\n",
-            "Epoch  4: W=3.78 b=1.23, loss=2.29721\n",
-            "Epoch  5: W=3.61 b=1.39, loss=1.82345\n",
-            "Epoch  6: W=3.49 b=1.52, loss=1.52970\n",
-            "Epoch  7: W=3.38 b=1.62, loss=1.34756\n",
-            "Epoch  8: W=3.30 b=1.70, loss=1.23463\n",
-            "Epoch  9: W=3.24 b=1.76, loss=1.16460\n"
-          ]
-        },
-        {
-          "data": {
-            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAEDCAYAAAD+/1UIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl4VOXdPvD7zJZ9XwmELQkQIAELsiTsi6xiEBGXAiIW\nbV8WBY2K0tLa4lbsr283qxURtIoioAi8SpFNg6whi0FJKAoJBgLZt5k5c87vj5OZLIRkgEnOGXJ/\nritXJsmZyT0sN1+enPOMIMuyDCIicgs6tQMQEZHzWNpERG6EpU1E5EZY2kREboSlTUTkRljaRERu\nxODMQePGjYOvry90Oh0MBgM2b97c1rmIiKgZTpW2IAjYuHEjAgIC2joPERG1wKnlEVmWIUlSW2ch\nIqJWCM5cETl+/HgEBARAEATMmTMH9957b3tkIyKiJpxaHvnggw8QFhaG4uJiLFiwAD179sTgwYPb\nOhsRETXh1PJIWFgYACA4OBgTJ05EVlZWi8fL3t6AIADdugFvvglYrTeflIiIWl8eqampgSRJ8PHx\nQXV1NR5++GEsXrwYI0aMuPadCgtRvfoFeL2zDkJtLWxdu6NqRSrMs+8DDE4N9y4XFuaHoqIKVb73\ntTCTc7SYCdBmLmZyjlYzOaPVSfvy5ct44IEHkJKSgjlz5mDcuHEtFzYAREai6oWXUHwkA9WPPApd\n4QX4L/sVgpMGwWPTvwFRdCocERE15tQPIm9Ew3/FdBcK4P3ntfB89x0IVivEmFhUP/kMzCmzAL2+\nLb79VbT6LysztU6LmQBt5mIm52g1kzPa5YpIKaozKl9+DcWHT6Jm7gLof/wB/r98BEGjh8Fj28cA\nTyckInJKu17GLnWJRuXaP6P40AnUPDgP+jN58F+0AEFjhsO0fRvLm4ioFarsPSJ1647KP/0VxWnH\nUTvnAehPf4+AhfMQNG4ETDs/A/hiOkREzVJ1wyipR09U/OV1lHx9FLX3zIH+uxwEPPQAAieMgunz\nXSxvIqImNLHLny0mDhV/fxMlB4+g9u57YMjORMDcOQicNAamPV+wvImI6miitO1scb1Q8fo6lOz/\nBrUzZsJ4Mh0B99+DwKnjYdy7h+VNRNftL395DR999IHj4+XLl2DVqlWOj//61/+HDz/8txrRboim\nStvO1iceFf96B8V702CeNgPG48cQOGcmAu+cBOOBfSxvInJa//6JyM7OAKBsfldWVorc3FzH17Oz\nM5GQMECteNdNk6VtZ+vXH+Vvv4uSPQdhnjwVxiPfIPCeGQhImQpj2ldqxyMiN5CQMBBZWZkAgLNn\nz6Bnzxj4+PigsrISVqsVP/74A+Liequc0nnqXFN+ncSEASjf8AEMJ0/A+9UX4bH7c5hSpsIycjSq\nnloJcdhwtSMSkRN8Vj8Pj+3bXPqY5jtTULX699f8emhoKPR6Ay5duoisrEz075+I6uoyZGdnwsfH\nBzExsTCotL3GjdD0pN2UOPBnKH/vI5Ts2gPL2PEwHdyPoBmTEDD7LhiOHlY7HhFpVGJiIrKyMpCd\nrZT2gAEDkJWVgaws91oaAdxk0m5KHHQ7yjZtheHIYfi8sgam/Xth2r8X5vETUZ26EuJtg9SOSETN\nqFr9+xan4rbSr18isrIy8d//KssjHh4y/vnPf8HX1wfTpt3V7nluhltN2k2JQ4aibPMnKP1kFyzJ\nI+GxZzeCJo2F/8/vhSHzpNrxiEgjEhIGIC3tIPz9/SEIAgICAlBZWYHs7Cz075+gdrzr4talbWcd\nnoyyrTtQuuUzWIYlweOL/0PQhFHwn3c/9HU/gCCijismJhbl5WXo3z+x0ef8/Pzg7+9er33bLrv8\ntStZhvHAPvi8/AcYjx0BAJin3wWPXz+Hom69lRdn0Ait7jTGTM7RYi5mco5WMznjlpi0GxEEWEeP\nRemO3Sj9YAusPxsEj88+AYYMQeCEUfBc/xaEinK1UxIR3ZBbr7TtBAHWcRNQuutLlG7aCsycCUNO\nNvxSn0BIQm/4Ll8CQ/pxXqhDRG7l1i1tO0GAdex4YMsWFJ88hapnV0EKCYHXu+8gaNJYBI4fCc+3\n/wWhvEztpERErbr1S7sBKSIS1U88heKjmSj9YAvM02bAcOpb+D29HCGJveH7xGIYThzj9E1EmtWh\nSttBp4N13ASUv/2uMn2v/DWk0DB4vbcBQZPHIWjcCHiue5PTNxFpTscs7QakiEhUP/4kio9koHTT\nVpinzYD++1Pwe2aFMn0//j8wHD/K6ZuINKHDl7aDTgfr2PHK9J2eg8rnfgMpNBxe/96IoCnjETQ2\nmdM3kZsqLPwJ8+bNUTuGS7C0myFFRKJm2QoUHzmpTN/T74L+9HfK9J3QC77LfgXDsSOcvonciKCh\nazRuBku7Jfbpe91GXEk/hcrnV0MKj4DX++8iaOoEZfp+6w0IZaVqJyWiVoiiiD/8YTXmz78fy5Yt\ng9lsVjvSDbn1roi8BpddASVJMB7YB6+N62Ha9RkEUYTs5QXzXXejZu5DEAcPcfqqS61elcVMztFi\nLq1nWr3aA9u3u3afujvvFLF6dcsFXFj4E2bPnoF//GMd+vdPwJ/+9CI6dYrGfff93KVZbkbHvSKy\nrel0sI4Zh/K3NuDKye9Q+fxvIYVHwPOD9xA0bSKCxiTB861/cvom0piIiEjH5lAzZsxAZmaGyolu\njFtuzaoVcng4apY+gZrFy2A8uB+eG9fDY+d2+D37FHx/92uYZ8xEzbwF1zV9E93KVq82tzoVt5Wm\na9ru+leSk7Yr6HSwjh6Lin+9o0zfq34HKSISnpv+XTd9D4fnv16HUFqidlKiDquw8Cd8+202AGDH\njh1ITByocqIbw9J2MTk8HDVLHkfxN+ko3fwpau+6G/q8XPitTEVIYm/4LXkMhiOHeeYJUTvr3r0H\ndu36DPPn34+ysjKkpNyjdqQbwh9EtgOhqAiem/4Nz41vw3D2vwAAsU88DAsfxpVREyH16KlKruZo\n/QdZWqLFXMzkHK1mcgYn7XYgh4WhZvEylBw6gdKPt6M25W7oz+QBTz2FkKEDETR6OLxf/gMMWRmc\nwImoRfxBZHvS6WAdORrWkaNReeUKQr/eA/Omj2A6sA8+a1+Gz9qXYYvuCvOUabBMvRPWIcMAN3qV\naCJqe2wElcghIcDChSifcS+EygoYv/wPPHZ+BtPuz+H9xj/g/cY/IAUHwzxpKixTpsMyeizg5aV2\nbCJSGUtbA2RfP1hmzIRlxkzAYoHx64NKgf/fDni9/y683n8Xsrc3LGMnwDx1OiwTJ0EODFI7NhGp\ngKWtNSYTrGPHKy/c8PJaGE4cg8euHTDt3A6PHZ/CY8enkA0GWJNGKgU+ZRqkTlFqpyaidsLS1jKd\nDuLgIRAHD0HV86uhP/09PHZ9BtPO7TAd2AvTgb3AMytg/dkgmKdMh2XqnbDF9VI7NRG1IZ494i4E\nAbbefVD9+JMo/WI/rqTnoOLFV2EZOQaGjJPw/cNvEZw8GEFJg+Dz+9XKHuCSpHZqItVVVlZi69bN\nbfb406dPQGVlJQDgypXLGDnydmRlZTT4+kSUl7vuxcSdLm1JkjBz5kw89thjLvvmdOOkzl1Qu/BR\nlH38Ka7knEH5X/8J89Q7oS/Ih/f/voagKeMRPDAevqlPwLjvS8BiUTsykSoqKsqxdetHzX5NcsFg\n07dvArKzMwEA2dmZ6NWrD7KylI/PnfsRgYFB8Pf3v+nvY+d0aW/YsAExMTEu+8bkOnJQMMz33o/y\n9e/h8qmzKHvnfdTOeQCCuRZe699C4L0pCOkbA79fPgLT9m1A3VRA1BG8/vpfceFCAR5++EH8/e//\ni/T045g3bx5++9vnMX/+fVe9QML777+Lt99+EwBQUJCPFSuW4pFH5mHx4kU4d+7Hqx4/ISHRUdpZ\nWZmYM+dBfPttfYknJCS69Pk4taZdWFiI/fv347HHHsPbb7/t0gDkYt7esEyZBsuUaYAowvhNGky7\nPoPHzs/g+fGH8Pz4Q8geHrCMGQfLlOkw3zEFcmio2qmpAwke1L/Zzxcfz3bJ8U398pdL8MMP/8W6\nde8BANLTjyMrKwsbNnyIyMhIFBb+dM0XSHjllTVITV2Jzp27ICcnG2vXvoQ///kfjY7p3z8R69e/\nBQA4depbPPLIY/joo38DUEo8IWGAUzmd5VRpr1mzBqmpqaio0NZln9QKgwHWEaNgHTEKVb9/GYas\nDOUslF074PH5Lnh8vgu+Oh2sQ4fDMnU6zFOmA2HN/wUhupUkJiYiMjKyxWNqamqQnZ2BVauehn23\nD1EUrzqub99+yM39HrW1tbDZbPD09ERUVGcUFOQjOzsD99/v2j27Wy3tffv2ITQ0FPHx8Th8+LDT\nD+zsdfTtqcNnGj9SeVv7CpCbC3zyCYStW2E6lAbToa/hu+pZICEBYWPGAGPGAKNGARqZwrX4ewdo\nM5fmMzWzxAAAYde68/Ue34TFUg69XufIEBjoDS8vL8fHklQNQajPaDQCOp0JwcHeCAgIwPbtn7by\nHfzQvXs37N//OQYMSEBYmB+GDBmMrKxjKC8vw6Br/E/hRrVa2idOnMCXX36J/fv3w2w2o6qqCqmp\nqXjllVdavJ8WN2NhpgYCI4H5jwLzH4Vw8SI8Pt8Jj53bYUr7CsjKAv7yFwCAGN8X1uHJsCSPhHVY\nMuQwZ/+quI4Wf+8AbeZipqvV1sqoqKh0ZCgtrQZQ31GSZMLly1dw5kwBPD09sXv3HgwbloSaGhkR\nEZ3w4YdbMXbsBABAXl4uYmPjrvoeffr0w7p1b2PhwkdRVFSBbt164YUXViE+vp/Tz93Zf2xbLe3l\ny5dj+fLlAIAjR45g3bp1rRY2uRc5IgK18xagdt4ChPmbUPrFPhi/Pghj2tcwHjsMw6kceK1TfjAj\n9u4D6/BkWJNHwjJ8BOTwcJXTE7XM3z8ACQkDMH/+fRg6NAnDhyc3+rrBYMCCBY9g0aL5iIrqjG7d\nuju+9utfv4A//vElvPPOOthsIsaPv6PZ0k5IGIDNmzehXz/llXF69+6DoqIizJgx0+XP57q2ZrWX\n9uuvv97qsfzXvnVukcligSH9BEyHvlKK/OhhCNXVji+Lcb1gHT4C1uQRsCaNgBTR8jqhSzJphBZz\nMZNztJrJGdxPW0VumclqheHkCRgPfQ3T1wdhOHIYuqr6UwjFmFhYk0Y43lxxib0Wf50AbeZiJudo\nNZMzeBk7XR+jEeLtQyHePhQ1S5crJZ55UllKSTsI4+Fv4LVxPbw2rgcA2Lr3UNbD65ZUpM5d1M1P\n5OZY2nRzjEaIg26HOOh21Cx5HBBFGLIylBI/9BWMh9Lg9d4GeL23AQBg69odluQR9SUe3VXlJ0Dk\nXlja5FoGA8TbBkG8bRBq/mcpYLPB8G0WjF9/VV/iddvNAoAtuiusSSNgsS+ndO3mvi+TTdQOWNrU\ntvR6iIkDISYORM0vFwM2G/Q538KUdtAxjXtu+jc8NylXkNk6d3Gsh1uSRkDq3kPlJ0CkLSxtal96\nPWwJiahJSETNo/8DSBL0p3Ial/hHH8Dzow8AALZOUcCY0fDqkwAxcQDEhETI/gEqPwki9bC0SV06\nHWz9+qOmX3/U/OKXSol//x2MaV/BlKYsqeD99+GL9x13EXv0VKb3hAF1RT5Aefk2og6ApU3aotPB\nFt8Xtvi+qF24CJBlhJUWonx/GgyZGcpb1kl4frIF+GSL4262LtH1JZ44AGLiwDY5Z5zcT2VlJXbv\n/j/MnHlPm32PNWt+i+TkkRg9elybfQ87ljZpmyAAvXrBHNQJ5pRZyudkGbr8844CN2RmwJhxEh67\nPoPHrs8cd7WFR9SXeMJAiIkDIHWJ5g86Oxj7ftpNS1uSJOh07vc6MCxtcj+CACm6KyzRXWGZdqfj\n07qLhTBknmwwkWfA4z9fwOM/XziOkYKCHAVuf7N17wm44V9edzVokE+znz9+vMolxzfVcD9tvV4P\nLy9vREVF4ttvc/Dqq39Gaurj2LBhEwBlL+3a2hosWPALFBTk47XXXkFZWSk8PT2Rmvocunbtds3v\nc/ToYXz44fsoKSnG4sVPIClphFP5rhdLm24ZUkQkLBMnwzJxsuNzwpUrMGTVl7gh82T962va7+fr\nBzEh0bE+LiYOhC02DjDwr8etoOF+2unpx5Ga+gTWrn0VRqPfTe+l3VBh4U/429/eRH7+eSxd+hg2\nbdoGo9Ho8ufDP5V0S5NDQmAdMw7WMfVrjUJ5GQzZWfVTeVYGjIcPwXTo6/r7eXlB7NvfsT4uJg6A\n2DseMJnUeBq3FGcn5Bs9vjV9+/ZDVFRUi5exO7uXdkPjxk0EAHTpEo2oqM748ccfmt1c6maxtKnD\nkf0DHOeCO1RVwZCT3WAiz4AhIx3G40fr72c0QozvpxR4/0Rg2CAIIZ2VnQ65Tu42PD09Hbf1ej1s\ntvrXibRYzAAAWZbg5+fveLUbZzSd2K81wd8sljYRAPj4OPZUcTCbYfgup9FZK4Zvs2HMPOk4JBSA\n5B8AW2wsbDFxsMX1glj33tajJ+Dh0f7PhRrx9vZGdd3OlE33xwsKCkZpaQnKy8vh6emJtLSvMGxY\nEry9fdCpUxT27v1Pq3tp2+3d+x9MnjwNFy4U4MKFghbXv28GS5voWjw8IA64DeKA2+o/Z7VCn3sa\nhqwM+F/4EeaMbOjP5MKQlQnjieON7i7rdJC6doMYG+codFtsHMTYXsqLSXA6bxcN99M2mTwQHBzs\n+Jor9tK2i47uhsWLF6GkpBhPPbWyTdazAW7Nqipmco4WMwFNcokidOd+hOFMLvS5udCfyVXKPS8X\nustFV93XMZ3H1he5LTbupqdzLf5aMZNzuDUrUXsyGCD1jIGlZwzQ4OwVABBKS6DPy4U+LxeGuvf6\nvNOtT+f2Iq9bcuF0TgBLm6jNyYFBEAcPgTh4CMwNvyCK0J/7oa7E86DPO11X7KeVc8sbnF8O1E3n\nccpSixjXS1lyccF0Ts7bsGEd9u79DwRBgCzLEAQBY8dOwNy5C9otA5dHVMRMztFiJqBtc101neee\nVpZczv4XgtXa6FjHdB7XCx59eqEyJBK26GhIXaJh69IVcmioqhO6Fn//tJrJGZy0iTTIqem8bu3c\nUFfoHrs/B3Z/Dt+mj+XlBVvnLkqJR3eFFN0VtrpCl6KjIUV2AvT6dnx2dDNY2kTuxGCArWcsbD1j\ngTumNPqSUFKM0MorKMv8Dvr8c9Dln4f+/Pm69z/CkJfb7EPKBgOkqM6wdbFP59GOYpeio2HrHM3l\nFw1haRPdIuSgYKBXN1iir3FaWmUl9PnnlUI/fx76/PPQ5Z9zFLvx0NcQrrFaaguPUAo8uiukLg0K\nvW5al32d+6893TyWNlFH4esLW5942PrEN/91sxm6CwV1ZX4e+vPn6m+fOwdDxkkYjx9r9q5SYKBS\n4F2i69bT64sdiX0A2YNLMC7C0iYihYcHpB49IfXo2fzXbTboLhbWTen1yy/224b/5kHIzmz2rqF6\nPaTQMEjhEZAiIhq/D49s9DG8vdvwSbo/ljYROUevhxTVGVJUZ4hDh139dVmGUFzcYPlFKXPv4iKI\n5wuUrXPP5ELIymjx20h+/pDCwyFFRCrvHcVu/1wEpIhIyMHBHXJLXZY2EbmGIEAOCYEYEgI0uPTf\nO8wPpQ1OrxMqK6C7dBG6ixfr3hdCd+lS3fv6z+v/e+aaa+xA3Q9Qw8KbTO0RjlJvWPJosEmUu2Np\nE1G7kn39YPP1U86AaYnVCt2Vy1eVeePCvwjD96cgZKS3+FBSQGD91B4RAXTtAm9PX0jBIZCCgyEH\nBUMKCoYcEgIpKFjTJc/SJiJtMhohRXZSziNviSxDqChvMq03md7r3gy5px13a/71cOoe0ttbKfSg\nukIPaVDswcH1X6u7LQcHQ/bxbZeLmFjaROTeBAGyfwBs/gHKKw61xGKB7nIRQsQqlJ45D11JMYSS\nYuiuXGl0Wygpga6kGIYzeRCqnXsRBtlobDSty0H1hS4FBSsTfXDj4pcDAq97XZ6lTUQdh8kEKaoz\nEOYHa9dezt3HbFYKvbgYuuIrSrHbbxcX15e9/eOfLsBwKseph5Z1OsiBgZCCQ4AG/wtoCUubiKgl\nHh7KEk1kJ9icvY8oQigtVQq9boq/ZvHX3XYWS5uIyNUMBsihobCFhgJOvkxkmJMP3fFOciQicmMs\nbSIiN8LSJiJyIyxtIiI30uoPIi0WCx588EFYrVbYbDZMmjQJixcvbo9sRETURKulbTKZsGHDBnh5\necFms+H+++/HqFGjkJiY2B75iIioAaeWR7y8vAAoU7coim0aiIiIrs2p0pYkCSkpKUhOTkZycjKn\nbCIilTh1cY1Op8O2bdtQWVmJX/3qV8jLy0NsbAs7dHXvjmDp6i0Vi49nN3t48KD+zX7epcfrhKsy\nqZoHuCqT6nmaZNJEngaZNJPH7tyPmsrD42+N41tzXVdE+vr6YsiQITh48GDLpQ1Ar7t6t6trvkR8\nM8e2xfFNM6mdp2kmLeRpmEkreeyZtJSnxfuolMd+/FX3UznPVffVQJ5GH2skj7MEWW5hl3EAxcXF\nMBqN8PPzQ21tLRYuXIhFixZh9OjRLT5wUYNNz7UgLMyPmZzATM7TYi5mco5WMzmj1Um7qKgIzzzz\nDCRJgiRJmDp1aquFTUREbaPV0u7duze2bt3aHlmIiKgVvCKSiMiNsLSJiNwIS5uIyI2wtImI3AhL\nm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uI\nyI2wtImI3AhLm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0iYjcCEubiMiN\nsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0iYjcCEubiMiNsLSJiNwIS5uIyI2wtImI3AhLm4jIjbC0\niYjciKG1AwoLC5GamorLly9Dr9dj9uzZmDdvXntkIyKiJlotbb1ej2effRbx8fGoqqrC3XffjeTk\nZMTExLRHPiIiaqDV5ZGwsDDEx8cDAHx8fBATE4NLly61eTAiIrrada1p5+fn47vvvkNiYmJb5SEi\noha0ujxiV1VVhaVLl2LlypXw8fFp8dju3QFJuvqY48ermj1+0KDmH8+Vx+t0V2dSMw+AqzKpnadp\nJi3kaZhJK3nszp1r9tOq5eHxt8bxrXGqtEVRxNKlS3HXXXdhwoQJTj2wTnf1EB8W5neNY5t/DFcf\n3zST2nmaZtJCnoaZtJLHnklLeVq6j1p57Mc3vZ/aeZre1kKehh9rJY+zBFmW5dYOSk1NRVBQEJ59\n9lmnH7ioqOKGArWVsDA/ZnICMzlPi7mYyTlazeSMVte0jx8/ju3bt+Obb75BSkoKZs6ciQMHDtx0\nQCIiun6tLo8MGjQIp06dao8sRETUCl4RSUTkRljaRERuhKVNRORGWNpERG6EpU1E5EacviKSiIiu\nnyQBZWVASYmA4uL6t5ISwfG5khIBn37q3OOxtImInGSxoFHRNizgpkWsfAyUlgqQJMFlGVjaRNTh\nyDJQWYmrCvdaU7D981VVzpWvXi8jKEhGaKiMuDgJQUEygoOVt6Ag1L2XHe+DgmQAvk49NkubiG4Z\nNTVAUZGAixcFXLqkw6VLyu2iIuVj5fMCLl8GLBbnLhv39lZKtUcPqVHR1pdw4/INCZHh5wcIrhuu\nG2FpE5GmSZIyEV+6JDhK2F7IDd8uXtShvLzlpvTwkBERIWPgQMDfX2yxfO23vbza6Yk6iaVNRKqo\nqUGjwm1cwvVTcVGRAFFsuYxDQiR07izhtttkhIfLiIiQEB5uvy3X3Zbg769MwMqGUTXt9Exdi6VN\nRC4likBhoYD8fB3OnxdQUQGcPevRYEpWStn5qVhCeLjUqIAblnJYmAyjsZ2enAawtInoutTUABcu\nCDh/Xof8fB3y8+23laK+cEGAzda0kE2OW9eaiusnYuVzbbku7M5Y2kTUSFkZGpVw49sCLl9u/po8\nQZARGSnjZz+T0KWL/U1GfLwnPD2rEBGhnE3RkabitsDSJupAZFlZR25Ywsq0XH+7oqL58dZolNG5\ns4z4eBFdusjo0kVCdLTkuB0VJcNkuvp+YWGeKCqS2viZdRwsbaJbiNUKnDvXtJDrlzIKCgSYzc2X\nso+P3KiEu3SxfywhOlpZtmjppdeofbC0idyMLAM//SQgL0+H3FwdzpzRIS9PeV9QAEhS8xdphIZK\niI9X1pPrC7m+mAMDuYbsDljaRBpVXQ2cOaOUccNyzsvTobr66naNiJCQlARERFgbTczR0TI6d5bg\n7a3CkyCXY2kTqcg+Nefm1heyfWrOz796LcLTU0bPnhJiYxu/xcQoZ1so5x/XqvBMqL2wtInagX1q\nbljK9um5uak5MlLCyJEiYmIal3OXLlxX7uhY2kQuIsvK+csNJ2b7W0HBtafmuDjJUc72277O7R1E\nHRBLm+g6WSzA99/rcOkScOKEqdH03NzU3KmTMjU3XMqIi5PQuTOnZrp+LG2iFtTUADk5OmRm6pGV\npbw/dUoHq9Vezh4AAC+va681c2omV2JpE9WprASys/XIzKwv6dOndY0uyfbwkJGQIKF/fxsGDzYh\nIqIasbGcmqn9sLSpQyopAbKylIJW3utx5kzj1vX2ljF4sA2JiRISEpT3cXGS4zLssDATiopsKqSn\njoylTbe8S5cEx9KGvaTPnWtc0AEBMkaOFJGQICEx0YbERBt69uT0TNrD0qZbhv3sjYblnJmpQ2Fh\n4+YNDZUwbpyIxESbo6S7dpV5NSC5BZY2uSVZBn74QXAUs30N+sqVxgUdFSVh8mRrgwlaQmQkC5rc\nF0ubNM9mA06f1jUq56ws/VWb6HfrJiEpyepYg05IkBAWJquUmqhtsLRJc8xmID1dj6+/1iMtTY/j\nx4Hqah/H1wVBeYXrCRPqp+f+/W0IDFQxNFE7YWmT6mprgRMnlIJOS9Pj2DE9amvrp+h+/YCEBKtj\nDbpfPxvPfaYOi6VN7a6mBjh+vL6kjx/XO/Z4FgQZfftKSE62YfhwG4YPF9G7NzdBIrJjaVObq64G\njh2rL+lQSYIFAAANpklEQVQTJ/SwWOpLun9/CUlJNiQl2TBsmIigIJUDE2kYS5tcrqrq6pK2X/at\n09WXdHKyiKFDuRZNdD1Y2nTTKiuBo0ftJW1AeroOolhf0omJ9klaKemAAJUDE7kxljZdt8pK4MgR\n+9kdBmRk1Je0Xi9jwAAJw4crk/SQITb4+6scmOgW0mppr1y5Evv27UNISAi2b9/eHplIYyoqgMOH\n6yfpjIz6TZT0ehkDB0pIShKRnGzDkCE8s4OoLbVa2nfffTfmzp2L1NTU9shDGlBeDnzzjVLQaWnK\nFYeSpJS0wSDjttskJCeLGD6cJU3U3lot7cGDB6OgoKA9spBKZBk4eVKHXbsMOHgQSE/3dZS00ajs\ndGc/u+P2223w8WnlAYmozXBNu4OyWoFDh/TYtcuAXbsMuHBB2bPDaARuv92G5GSlpAcPtvFVvIk0\npM1KOyzMr60e+oZ19EzV1cAXXwBbtwLbtyt7SgNAYCAwdy6QkgJMmgT4+BigtX/Ptfh7B2gzFzM5\nR4uZnNFmfzOLiira6qFvSFiYX4fMVFICfPGFATt3GrBvnwE1NcqyR2SkhAULREydKiIpyebY2N/H\np2P+Ot0ILeZiJudoNZMznCptWeZOae7kwgUBu3YpRZ2Wpnec6REba8PUqUpRDxwocYN/IjfUammv\nWLEChw8fRmlpKcaMGYMlS5Zg1qxZ7ZGNrkNurg47dypFnZ6ud3z+ttuUop4yRUSvXpKKCYnIFVot\n7bVr17ZHDrpOkqSc8WEv6rw8paj1euVls+xFHRXF/yUR3Uq09dMmapHVCqSl1Z/x8dNPyvqGl5eM\nKVOsmDpVxB13cMMlolsZS1vjqquBvXuVaXr3bgNKS5X16cBAGffeqxT1mDEiT8sj6iBY2hpUUgJ8\n/rlS1Pv315/xERUlYdYspaiHDas/44OIOg6WtkYUFAiOZY+GZ3z06lX/g8SBAyW+IC1RB8fSVtGp\nU8C775qwc6cBJ0/Wn/Hxs5/ZT82zIjaWP0gkonos7XZWWgp89JER775rxKlTAOABg0HGqFH1Z3x0\n6sSiJqLmsbTbgSwDR4/qsGGDCZ9+akBtrQCjUcbMmcCECTWYOFHkq7cQkVNY2m2orEyZqjduNOLU\nKWX5o0cPCXPnmjFnjoi+fX1RVCSqnJKI3AlL28VkGTh2TIeNG0345BPlzA+jUcZdd1kxd64VI0bY\nePk4Ed0wlraLlJUBmzcbsWFD/VTdrZuEuXMtuP9+K8LCuE5NRDePpX0TZBk4cUJZq962TZmqDQYZ\nM2YoU/XIkZyqici1WNo3oLy8fqrOyWk8Vd93nxXh4ZyqiahtsLSdJMtAeroOGzYYsW2bEdXVylQ9\nfboV8+ZZMWoUp2oianss7VZUVChT9caNRmRnK1N11671U3VEBKdqImo/LO1m2F/oduNGI7ZsUaZq\nvV7GtGnKVD16NKdqIlIHS7uBysr6qTorq36q/vnPlTNAOFUTkdpY2gAyMpS16o8/rp+qp05Vpuox\nYzhVE5F2dNjSrqwEtmxRzgDJzFSm6i5dJCxdasEDD1gRGcmpmoi0p8OVdmamDu+8o6xVV1UpU/Xk\nyVbMn69M1Xp9649BRKSWDlHalZXAtm3A3//u7dgCtXNnCYsXK1M1d9UjIndxS5d2eTnwxhsmvP66\nCeXlgE6nw+TJylr12LGcqonI/dySpV1ZCbz1lgl/+5sJpaUCQkIkrF4tICWliq9OTkRu7ZYq7epq\nYN06I/72NxOuXNEhMFDGc8+ZsXChBT16+KGoiIVNRO7tlijt2lpgwwYj/vxnE4qKdPD3l5Gaasai\nRRb4+6udjojIddy6tM1m4N13lbIuLNTBx0fG8uVmPPaYha8EQ0S3JLcsbasVeP99I/70JxMKCnTw\n9paxZIkZv/qVFSEhXAIholuXW5W2KAIffWTA2rUeOHdOB09PGY89ZsGSJRa+yAARdQhuUdo2G7Bl\niwF//KMHzp7VwWSS8cgjFixbZuF+IETUoWi6tCUJ+PRTA1591YTcXD2MRhkPPWTB449beOoeEXVI\nmixtSQJ27lTK+tQpPfR6GT//uVLWXbuyrImo49JUacsy8MUXerz8sgeys/XQ6WTMmWPF8uVm9OjB\nsiYi0kRpyzKwd69S1unpegiCjLvvtuLJJ82IjWVZExHZqVrasgwcPKiU9dGjykYgM2ZY8eSTFvTp\nI6kZjYhIk1Qr7UOH9HjpJRMOHVIiTJlixVNPWdC/P8uaiOha2r20jx7V4aWXPHDwoPKtJ04UkZpq\nxoABLGsiotY49UJaBw4cwOTJkzFp0iS88cYbN/SNTpzQ4b77vDBtmg8OHjRgzBgRu3ZV4b33aljY\nREROanXSliQJL7zwAtavX4/w8HDcc889GD9+PGJiYpz6BllZOrzyigc+/1z5ViNGiEhNtWDYMNvN\nJSci6oBaLe3MzEx069YNnTt3BgBMmzYNe/bsabW0c3J0ePVVE3bsMAIAhg4V8fTTFowYwbImIrpR\nrZb2xYsX0alTJ8fHERERyMrKavE+990HfPihN2RZwKBBNjz9tBmjR9sgCDcfmIioI2u1tGX5+s+T\n3rQJGDBAwtNPmzF+PMuaiMhVWi3tyMhIXLhwwfHxxYsXER4e3uJ9lJ7XA/C+yXiuFRbmp3aEqzCT\nc7SYCdBmLmZyjhYzOaPVs0cSEhJw7tw5FBQUwGKxYMeOHRg/fnx7ZCMioiZanbT1ej1WrVqFhx9+\nGLIs45577nH6zBEiInItQb6RRWsiIlKFUxfXEBGRNrC0iYjcCEubiMiNuHTDqAMHDmDNmjWQZRmz\nZs3CokWLXPnwN2TlypXYt28fQkJCsH37drXjAAAKCwuRmpqKy5cvQ6/XY/bs2Zg3b56qmSwWCx58\n8EFYrVbYbDZMmjQJixcvVjWTnSRJmDVrFiIiIvD666+rHQfjxo2Dr68vdDodDAYDNm/erHYkVFRU\n4LnnnkNubi50Oh3WrFmDAQMGqJrp7NmzeOKJJyAIAmRZxvnz57Fs2TLV/6yvX78emzdvhiAI6NWr\nF1588UWYTCZVM73zzjuOP0et9oHsIjabTZ4wYYKcn58vWywWecaMGXJeXp6rHv6GHT16VM7JyZGn\nT5+udhSHS5cuyTk5ObIsy3JlZaV8xx13aOLXqrq6WpZlWRZFUZ49e7ackZGhciLF22+/La9YsUJ+\n9NFH1Y4iy7Isjxs3Ti4tLVU7RiNPP/20vHnzZlmWZdlqtcoVFRUqJ2rMZrPJycnJ8oULF1TNUVhY\nKI8bN042m82yLMvysmXL5K1bt6qa6fTp0/L06dNls9ksi6IoP/TQQ/KPP/54zeNdtjzScI8So9Ho\n2KNEbYMHD4a/v7/aMRoJCwtDfHw8AMDHxwcxMTG4dOmSyqkALy8vAMrULYqiymkUhYWF2L9/P2bP\nnq12FAdZliFJ2tmZsrKyEseOHcOsWbMAAAaDAb6+viqnaiwtLQ1du3ZttCWGWiRJQk1NDURRRG1t\nbasXC7a1M2fOYODAgTCZTNDr9bj99tuxe/fuax7vstJubo8SLRSR1uXn5+O7775DYmKi2lEgSRJS\nUlKQnJyM5ORkTWRas2YNUlNTIWhoLwRBELBw4ULMmjULH374odpxkJ+fj6CgIDz77LOYOXMmVq1a\nhdraWrVjNbJz505MmzZN7RiIiIjAggULMGbMGIwaNQp+fn5ISkpSNVNcXByOHj2KsrIy1NTU4MCB\nA/jpp5+uebzLSlvm6d7XraqqCkuXLsXKlSvh4+OjdhzodDps27YNBw4cQEZGBvLy8lTNs2/fPoSG\nhiI+Pl5Tf74++OADbNmyBW+++Sbee+89HDt2TNU8oigiJycHDzzwALZu3QpPT88b3ve+LVitVnz5\n5ZeYMmWK2lFQXl6OPXv2YO/evTh48CCqq6tV/1lXTEwMfvGLX2DBggVYtGgR+vTpA4Ph2j9udFlp\n38geJR2ZKIpYunQp7rrrLkyYMEHtOI34+vpiyJAhOHjwoKo5Tpw4gS+//BLjx4/HihUrcPjwYaSm\npqqaCVCWtwAgODgYEydObHXXy7YWGRmJyMhIJCQkAAAmTZqEnJwcVTM1dODAAfTr1w/BwcFqR0Fa\nWhqio6MRGBgIvV6PiRMnIj09Xe1YmDVrFrZs2YKNGzciICAA3bp1u+axLittLe9RoqUpzW7lypWI\njY3F/Pnz1Y4CACguLkZFRQUAoLa2FocOHULPnj1VzbR8+XLs27cPe/bswWuvvYahQ4filVdeUTVT\nTU0NqqqqAADV1dX46quvEBcXp2qm0NBQdOrUCWfPngUAfPPNN5raamLHjh2YPn262jEAAFFRUcjI\nyIDZbIYsy5r5tSouLgYAXLhwAbt3727x18tlp/xpdY8S+4RWWlqKMWPGYMmSJY4f2Kjl+PHj2L59\nO3r16oWUlBQIgoAnnngCo0aNUi1TUVERnnnmGUiSBEmSMHXqVIwePVq1PFp1+fJlLF68GIIgwGaz\n4c4778SIESPUjoXnn38eTz75JERRRHR0NF588UW1IwFQBoC0tDT87ne/UzsKACAxMRGTJk1CSkoK\nDAYD+vbti3vvvVftWFiyZAnKyspgMBjwm9/8Bn5+196BkHuPEBG5EV4RSUTkRljaRERuhKVNRORG\nWNpERG6EpU1E5EZY2kREboSlTUTkRljaRERu5P8D+7Wym3BFpegAAAAASUVORK5CYII=\n",
-            "text/plain": [
-              "\u003cmatplotlib.figure.Figure at 0x7f5be4b8ec50\u003e"
-            ]
-          },
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "display_data"
-        }
-      ],
-      "source": [
-        "model = Model()\n",
-        "\n",
-        "# Collect the history of W-values and b-values to plot later\n",
-        "Ws, bs = [], []\n",
-        "epochs = range(10)\n",
-        "for epoch in epochs:\n",
-        "  Ws.append(model.W.numpy())\n",
-        "  bs.append(model.b.numpy())\n",
-        "  current_loss = loss(model(inputs), outputs)\n",
-        "\n",
-        "  train(model, inputs, outputs, learning_rate=0.1)\n",
-        "  print('Epoch %2d: W=%1.2f b=%1.2f, loss=%2.5f' %\n",
-        "        (epoch, Ws[-1], bs[-1], current_loss))\n",
-        "\n",
-        "# Let's plot it all\n",
-        "plt.plot(epochs, Ws, 'r',\n",
-        "         epochs, bs, 'b')\n",
-        "plt.plot([TRUE_W] * len(epochs), 'r--',\n",
-        "         [TRUE_b] * len(epochs), 'b--')\n",
-        "plt.legend(['W', 'b', 'true W', 'true_b'])\n",
-        "plt.show()\n",
-        "  "
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "vPnIVuaSJwWz"
-      },
-      "source": [
-        "## Next Steps\n",
-        "\n",
-        "In this tutorial we covered `Variable`s and built and trained a simple linear model using the TensorFlow primitives discussed so far.\n",
-        "\n",
-        "In theory, this is pretty much all you need to use TensorFlow for your machine learning research.\n",
-        "In practice, particularly for neural networks, the higher level APIs like `tf.keras` will be much more convenient since it provides higher level building blocks (called \"layers\"), utilities to save and restore state, a suite of loss functions, a suite of optimization strategies etc. \n",
-        "\n",
-        "The [next tutorial](TODO) will cover these higher level APIs."
-      ]
-    }
-  ],
-  "metadata": {
-    "colab": {
-      "collapsed_sections": [],
-      "default_view": {},
-      "name": "Training Models",
-      "provenance": [],
-      "version": "0.3.2",
-      "views": {}
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb
deleted file mode 100644
index 5749f22ac5..0000000000
--- a/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb
+++ /dev/null
@@ -1,551 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "pwX7Fii1rwsJ"
-      },
-      "outputs": [],
-      "source": [
-        "import tensorflow as tf\n",
-        "tf.enable_eager_execution()\n",
-        "tfe = tf.contrib.eager\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "UEu3q4jmpKVT"
-      },
-      "source": [
-        "# High level API\n",
-        "\n",
-        "We recommend using `tf.keras` as a high-level API for building neural networks. That said, most TensorFlow APIs are usable with eager execution.\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "zSFfVVjkrrsI"
-      },
-      "source": [
-        "## Layers: common sets of useful operations\n",
-        "\n",
-        "Most of the time when writing code for machine learning models you want to operate at a higher level of abstraction than individual operations and manipulation of individual variables.\n",
-        "\n",
-        "Many machine learning models are expressible as the composition and stacking of relatively simple layers, and TensorFlow provides both a set of many common layers as a well as easy ways for you to write your own application-specific layers either from scratch or as the composition of existing layers.\n",
-        "\n",
-        "TensorFlow includes the full [Keras](https://keras.io) API in the tf.keras package, and the Keras layers are very useful when building your own models.\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          }
-        },
-        "colab_type": "code",
-        "id": "8PyXlPl-4TzQ"
-      },
-      "outputs": [],
-      "source": [
-        "# In the tf.keras.layers package, layers are objects. To construct a layer,\n",
-        "# simply construct the object. Most layers take as a first argument the number\n",
-        "# of output dimensions / channels.\n",
-        "layer = tf.keras.layers.Dense(100)\n",
-        "# The number of input dimensions is often unnecessary, as it can be inferred\n",
-        "# the first time the layer is used, but it can be provided if you want to \n",
-        "# specify it manually, which is useful in some complex models.\n",
-        "layer = tf.keras.layers.Dense(10, input_shape=(None, 5))"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "Fn69xxPO5Psr"
-      },
-      "source": [
-        "The full list of pre-existing layers can be seen in [the documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers). It includes Dense (a fully-connected layer),\n",
-        "Conv2D, LSTM, BatchNormalization, Dropout, and many others."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 3,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 204
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 244,
-          "status": "ok",
-          "timestamp": 1527783641557,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "E3XKNknP5Mhb",
-        "outputId": "c5d52434-d980-4488-efa7-5660819d0207"
-      },
-      "outputs": [
-        {
-          "data": {
-            "text/plain": [
-              "\u003ctf.Tensor: id=30, shape=(10, 10), dtype=float32, numpy=\n",
-              "array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],\n",
-              "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)\u003e"
-            ]
-          },
-          "execution_count": 3,
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "execute_result"
-        }
-      ],
-      "source": [
-        "# To use a layer, simply call it.\n",
-        "layer(tf.zeros([10, 5]))"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 4,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 221
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 320,
-          "status": "ok",
-          "timestamp": 1527783642457,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "Wt_Nsv-L5t2s",
-        "outputId": "f0d96dce-0128-4080-bfe2-0ee6fbc0ad90"
-      },
-      "outputs": [
-        {
-          "data": {
-            "text/plain": [
-              "[\u003ctf.Variable 'dense_1/kernel:0' shape=(5, 10) dtype=float32, numpy=\n",
-              " array([[ 0.43788117, -0.62099844, -0.30525017, -0.59352523,  0.1783089 ,\n",
-              "          0.47078604, -0.23620895, -0.30482283,  0.01366901, -0.1288507 ],\n",
-              "        [ 0.18407935, -0.56550485,  0.54180616, -0.42254075,  0.3702994 ,\n",
-              "          0.36705834, -0.29678228,  0.36660975,  0.36717761,  0.46269661],\n",
-              "        [ 0.1709305 , -0.11529458,  0.32710236,  0.46300393, -0.62802851,\n",
-              "          0.51641601,  0.39624029,  0.26918125, -0.25196898,  0.21353298],\n",
-              "        [ 0.35752094,  0.44161648,  0.61500639, -0.12653333,  0.41629118,\n",
-              "          0.36193585,  0.066082  , -0.59253877,  0.47318751,  0.17115968],\n",
-              "        [-0.22554061, -0.17727301,  0.5525015 ,  0.3678053 , -0.00454676,\n",
-              "          0.24066836, -0.53640735,  0.13792562, -0.10727292,  0.59708995]], dtype=float32)\u003e,\n",
-              " \u003ctf.Variable 'dense_1/bias:0' shape=(10,) dtype=float32, numpy=array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)\u003e]"
-            ]
-          },
-          "execution_count": 4,
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "execute_result"
-        }
-      ],
-      "source": [
-        "# Layers have many useful methods. For example, you can inspect all variables\n",
-        "# in a layer by calling layer.variables. In this case a fully-connected layer\n",
-        "# will have variables for weights and biases.\n",
-        "layer.variables"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 5,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 221
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 226,
-          "status": "ok",
-          "timestamp": 1527783643252,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "6ilvKjz8_4MQ",
-        "outputId": "f647fced-c2d7-41a3-c237-242036784665"
-      },
-      "outputs": [
-        {
-          "data": {
-            "text/plain": [
-              "(\u003ctf.Variable 'dense_1/kernel:0' shape=(5, 10) dtype=float32, numpy=\n",
-              " array([[ 0.43788117, -0.62099844, -0.30525017, -0.59352523,  0.1783089 ,\n",
-              "          0.47078604, -0.23620895, -0.30482283,  0.01366901, -0.1288507 ],\n",
-              "        [ 0.18407935, -0.56550485,  0.54180616, -0.42254075,  0.3702994 ,\n",
-              "          0.36705834, -0.29678228,  0.36660975,  0.36717761,  0.46269661],\n",
-              "        [ 0.1709305 , -0.11529458,  0.32710236,  0.46300393, -0.62802851,\n",
-              "          0.51641601,  0.39624029,  0.26918125, -0.25196898,  0.21353298],\n",
-              "        [ 0.35752094,  0.44161648,  0.61500639, -0.12653333,  0.41629118,\n",
-              "          0.36193585,  0.066082  , -0.59253877,  0.47318751,  0.17115968],\n",
-              "        [-0.22554061, -0.17727301,  0.5525015 ,  0.3678053 , -0.00454676,\n",
-              "          0.24066836, -0.53640735,  0.13792562, -0.10727292,  0.59708995]], dtype=float32)\u003e,\n",
-              " \u003ctf.Variable 'dense_1/bias:0' shape=(10,) dtype=float32, numpy=array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)\u003e)"
-            ]
-          },
-          "execution_count": 5,
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "execute_result"
-        }
-      ],
-      "source": [
-        "# The variables are also accessible through nice accessors\n",
-        "layer.kernel, layer.bias"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "O0kDbE54-5VS"
-      },
-      "source": [
-        "## Implementing custom layers\n",
-        "The best way to implement your own layer is extending the tf.keras.Layer class and implementing:\n",
-        "  *  `__init__` , where you can do all input-independent initialization\n",
-        "  * `build`, where you know the shapes of the input tensors and can do the rest of the initialization\n",
-        "  * `call`, where you do the forward computation\n",
-        "\n",
-        "Note that you don't have to wait until `build` is called to create your variables, you can also create them in `__init__`. However, the advantage of creating them in `build` is that it enables late variable creation based on the shape of the inputs the layer will operate on. On the other hand, creating variables in `__init__` would mean that shapes required to create the variables will need to be explicitly specified."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 7,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 391
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 251,
-          "status": "ok",
-          "timestamp": 1527783661512,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "5Byl3n1k5kIy",
-        "outputId": "6e7f9285-649a-4132-82ce-73ea92f15862"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "tf.Tensor(\n",
-            "[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n",
-            " [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]], shape=(10, 10), dtype=float32)\n",
-            "[\u003ctf.Variable 'my_dense_layer_1/kernel:0' shape=(5, 10) dtype=float32, numpy=\n",
-            "array([[-0.4011991 ,  0.22458655, -0.33237562, -0.25117266,  0.33528614,\n",
-            "        -0.01392961,  0.58580834, -0.16346583,  0.28465688, -0.47191954],\n",
-            "       [-0.52922136,  0.22416979, -0.58209574, -0.60914612,  0.05226624,\n",
-            "        -0.18325993,  0.5591442 , -0.24718609,  0.37148207,  0.40475875],\n",
-            "       [ 0.16912812, -0.47618777, -0.38989353,  0.30105609, -0.08085585,\n",
-            "         0.44758242,  0.545829  ,  0.51421839,  0.11063248,  0.20159996],\n",
-            "       [ 0.34073615, -0.59835428,  0.06498981, -0.44489855, -0.34302285,\n",
-            "         0.20969599,  0.35527444, -0.03173476, -0.22227573,  0.09303057],\n",
-            "       [ 0.41764337, -0.06435019, -0.52509922, -0.39957345,  0.56811184,\n",
-            "         0.23481232, -0.61666459,  0.31144124, -0.11532354, -0.42421889]], dtype=float32)\u003e]\n"
-          ]
-        }
-      ],
-      "source": [
-        "class MyDenseLayer(tf.keras.layers.Layer):\n",
-        "  def __init__(self, num_outputs):\n",
-        "    super(MyDenseLayer, self).__init__()\n",
-        "    self.num_outputs = num_outputs\n",
-        "    \n",
-        "  def build(self, input_shape):\n",
-        "    self.kernel = self.add_variable(\"kernel\", \n",
-        "                                    shape=[input_shape[-1].value, \n",
-        "                                           self.num_outputs])\n",
-        "    \n",
-        "  def call(self, input):\n",
-        "    return tf.matmul(input, self.kernel)\n",
-        "  \n",
-        "layer = MyDenseLayer(10)\n",
-        "print(layer(tf.zeros([10, 5])))\n",
-        "print(layer.variables)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "tk8E2vY0-z4Z"
-      },
-      "source": [
-        "Note that you don't have to wait until `build` is called to create your variables, you can also create them in `__init__`.\n",
-        "\n",
-        "Overall code is easier to read and maintain if it uses standard layers whenever possible, as other readers will be familiar with the behavior of standard layers. If you want to use a layer which is not present in tf.keras.layers or tf.contrib.layers, consider filing a [github issue](http://github.com/tensorflow/tensorflow/issues/new) or, even better, sending us a pull request!"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "Qhg4KlbKrs3G"
-      },
-      "source": [
-        "## Models: composing layers\n",
-        "\n",
-        "Many interesting layer-like things in machine learning models are implemented by composing existing layers. For example, each residual block in a resnet is a composition of convolutions, batch normalizations, and a shortcut.\n",
-        "\n",
-        "The main class used when creating a layer-like thing which contains other layers is tf.keras.Model. Implementing one is done by inheriting from tf.keras.Model."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 9,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "height": 190
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 420,
-          "status": "ok",
-          "timestamp": 1527783698512,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "N30DTXiRASlb",
-        "outputId": "a8b23a8e-5cf9-4bbf-f93b-6c763d74e2b3"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "tf.Tensor(\n",
-            "[[[[ 0.  0.  0.]\n",
-            "   [ 0.  0.  0.]\n",
-            "   [ 0.  0.  0.]]\n",
-            "\n",
-            "  [[ 0.  0.  0.]\n",
-            "   [ 0.  0.  0.]\n",
-            "   [ 0.  0.  0.]]]], shape=(1, 2, 3, 3), dtype=float32)\n",
-            "['resnet_identity_block_1/conv2d_3/kernel:0', 'resnet_identity_block_1/conv2d_3/bias:0', 'resnet_identity_block_1/batch_normalization_3/gamma:0', 'resnet_identity_block_1/batch_normalization_3/beta:0', 'resnet_identity_block_1/conv2d_4/kernel:0', 'resnet_identity_block_1/conv2d_4/bias:0', 'resnet_identity_block_1/batch_normalization_4/gamma:0', 'resnet_identity_block_1/batch_normalization_4/beta:0', 'resnet_identity_block_1/conv2d_5/kernel:0', 'resnet_identity_block_1/conv2d_5/bias:0', 'resnet_identity_block_1/batch_normalization_5/gamma:0', 'resnet_identity_block_1/batch_normalization_5/beta:0', 'resnet_identity_block_1/batch_normalization_3/moving_mean:0', 'resnet_identity_block_1/batch_normalization_3/moving_variance:0', 'resnet_identity_block_1/batch_normalization_4/moving_mean:0', 'resnet_identity_block_1/batch_normalization_4/moving_variance:0', 'resnet_identity_block_1/batch_normalization_5/moving_mean:0', 'resnet_identity_block_1/batch_normalization_5/moving_variance:0']\n"
-          ]
-        }
-      ],
-      "source": [
-        "class ResnetIdentityBlock(tf.keras.Model):\n",
-        "  def __init__(self, kernel_size, filters):\n",
-        "    super(ResnetIdentityBlock, self).__init__(name='')\n",
-        "    filters1, filters2, filters3 = filters\n",
-        "\n",
-        "    self.conv2a = tf.keras.layers.Conv2D(filters1, (1, 1))\n",
-        "    self.bn2a = tf.keras.layers.BatchNormalization()\n",
-        "\n",
-        "    self.conv2b = tf.keras.layers.Conv2D(filters2, kernel_size, padding='same')\n",
-        "    self.bn2b = tf.keras.layers.BatchNormalization()\n",
-        "\n",
-        "    self.conv2c = tf.keras.layers.Conv2D(filters3, (1, 1))\n",
-        "    self.bn2c = tf.keras.layers.BatchNormalization()\n",
-        "\n",
-        "  def call(self, input_tensor, training=False):\n",
-        "    x = self.conv2a(input_tensor)\n",
-        "    x = self.bn2a(x, training=training)\n",
-        "    x = tf.nn.relu(x)\n",
-        "\n",
-        "    x = self.conv2b(x)\n",
-        "    x = self.bn2b(x, training=training)\n",
-        "    x = tf.nn.relu(x)\n",
-        "\n",
-        "    x = self.conv2c(x)\n",
-        "    x = self.bn2c(x, training=training)\n",
-        "\n",
-        "    x += input_tensor\n",
-        "    return tf.nn.relu(x)\n",
-        "\n",
-        "    \n",
-        "block = ResnetIdentityBlock(1, [1, 2, 3])\n",
-        "print(block(tf.zeros([1, 2, 3, 3])))\n",
-        "print([x.name for x in block.variables])"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "wYfucVw65PMj"
-      },
-      "source": [
-        "Much of the time, however, models which compose many layers simply call one layer after the other. This can be done in very little code using tf.keras.Sequential"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 0,
-      "metadata": {
-        "colab": {
-          "autoexec": {
-            "startup": false,
-            "wait_interval": 0
-          },
-          "base_uri": "https://localhost:8080/",
-          "height": 153
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 361,
-          "status": "ok",
-          "timestamp": 1526674830777,
-          "user": {
-            "displayName": "Alexandre Passos",
-            "photoUrl": "//lh4.googleusercontent.com/-kmTTWXEgAPw/AAAAAAAAAAI/AAAAAAAAAC0/q_DoOzKGwds/s50-c-k-no/photo.jpg",
-            "userId": "108023195365833072773"
-          },
-          "user_tz": 420
-        },
-        "id": "L9frk7Ur4uvJ",
-        "outputId": "882e9076-b6d9-4380-bb1e-7c6b57d54c39"
-      },
-      "outputs": [
-        {
-          "data": {
-            "text/plain": [
-              "\u003ctf.Tensor: id=1423, shape=(1, 2, 3, 3), dtype=float32, numpy=\n",
-              "array([[[[0., 0., 0.],\n",
-              "         [0., 0., 0.],\n",
-              "         [0., 0., 0.]],\n",
-              "\n",
-              "        [[0., 0., 0.],\n",
-              "         [0., 0., 0.],\n",
-              "         [0., 0., 0.]]]], dtype=float32)\u003e"
-            ]
-          },
-          "execution_count": 26,
-          "metadata": {
-            "tags": []
-          },
-          "output_type": "execute_result"
-        }
-      ],
-      "source": [
-        " my_seq = tf.keras.Sequential([tf.keras.layers.Conv2D(1, (1, 1)),\n",
-        "                               tf.keras.layers.BatchNormalization(),\n",
-        "                               tf.keras.layers.Conv2D(2, 1, \n",
-        "                                                      padding='same'),\n",
-        "                               tf.keras.layers.BatchNormalization(),\n",
-        "                               tf.keras.layers.Conv2D(3, (1, 1)),\n",
-        "                               tf.keras.layers.BatchNormalization()])\n",
-        "my_seq(tf.zeros([1, 2, 3, 3]))"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "c5YwYcnuK-wc"
-      },
-      "source": [
-        "# Next steps\n",
-        "\n",
-        "Now you can go back to the previous notebook and adapt the linear regression example to use layers and models to be better structured."
-      ]
-    }
-  ],
-  "metadata": {
-    "colab": {
-      "collapsed_sections": [],
-      "default_view": {},
-      "name": "4 - High level API - TensorFlow Eager.ipynb",
-      "provenance": [],
-      "version": "0.3.2",
-      "views": {}
-    },
-    "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/README.md b/tensorflow/contrib/eager/python/examples/notebooks/README.md
new file mode 100644
index 0000000000..0d5ed84894
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/notebooks/README.md
@@ -0,0 +1,11 @@
+## Research and experimentation
+
+Eager execution provides an imperative, define-by-run interface for advanced
+operations. Write custom layers, forward passes, and training loops with auto
+differentiation. Start with these notebooks, then read the
+[eager execution guide](https://www.tensorflow.org/guide/eager).
+
+1. [Eager execution basics](./eager_basics.ipynb)
+2. [Automatic differentiation and gradient tapes](./automatic_differentiation.ipynb)
+3. [Custom training: basics](./custom_training.ipynb)
+4. [Custom layers](./custom_layers.ipynb)
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb
new file mode 100644
index 0000000000..51b7ffc4de
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb
@@ -0,0 +1,366 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "t09eeeR5prIJ"
+      },
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "cellView": "form",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "GCCk8_dHpuNf"
+      },
+      "outputs": [],
+      "source": [
+        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "xh8WkEwWpnm7"
+      },
+      "source": [
+        "# Automatic differentiation and gradient tape"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "idv0bPeCp325"
+      },
+      "source": [
+        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\u003ctd\u003e\n",
+        "\u003ca target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb\"\u003e\n",
+        "    \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
+        "\u003c/td\u003e\u003ctd\u003e\n",
+        "\u003ca target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb\"\u003e\u003cimg width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\u003c/td\u003e\u003c/table\u003e"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "vDJ4XzMqodTy"
+      },
+      "source": [
+        "In the previous tutorial we introduced `Tensor`s and operations on them. In this tutorial we will cover [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), a key technique for optimizing machine learning models."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "GQJysDM__Qb0"
+      },
+      "source": [
+        "## Setup\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "OiMPZStlibBv"
+      },
+      "outputs": [],
+      "source": [
+        "import tensorflow as tf\n",
+        "tf.enable_eager_execution()\n",
+        "\n",
+        "tfe = tf.contrib.eager # Shorthand for some symbols"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "1CLWJl0QliB0"
+      },
+      "source": [
+        "## Derivatives of a function\n",
+        "\n",
+        "TensorFlow provides APIs for automatic differentiation - computing the derivative of a function. The way that more closely mimics the math is to encapsulate the computation in a Python function, say `f`, and use `tfe.gradients_function` to create a function that computes the derivatives of `f` with respect to its arguments. If you're familiar with [autograd](https://github.com/HIPS/autograd) for differentiating numpy functions, this will be familiar. For example: "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "9FViq92UX7P8"
+      },
+      "outputs": [],
+      "source": [
+        "from math import pi\n",
+        "\n",
+        "def f(x):\n",
+        "  return tf.square(tf.sin(x))\n",
+        "\n",
+        "assert f(pi/2).numpy() == 1.0\n",
+        "\n",
+        "\n",
+        "# grad_f will return a list of derivatives of f\n",
+        "# with respect to its arguments. Since f() has a single argument,\n",
+        "# grad_f will return a list with a single element.\n",
+        "grad_f = tfe.gradients_function(f)\n",
+        "assert tf.abs(grad_f(pi/2)[0]).numpy() \u003c 1e-7"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "v9fPs8RyopCf"
+      },
+      "source": [
+        "### Higher-order gradients\n",
+        "\n",
+        "The same API can be used to differentiate as many times as you like:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "3D0ZvnGYo0rW"
+      },
+      "outputs": [],
+      "source": [
+        "def f(x):\n",
+        "  return tf.square(tf.sin(x))\n",
+        "\n",
+        "def grad(f):\n",
+        "  return lambda x: tfe.gradients_function(f)(x)[0]\n",
+        "\n",
+        "x = tf.lin_space(-2*pi, 2*pi, 100)  # 100 points between -2π and +2π\n",
+        "\n",
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "plt.plot(x, f(x), label=\"f\")\n",
+        "plt.plot(x, grad(f)(x), label=\"first derivative\")\n",
+        "plt.plot(x, grad(grad(f))(x), label=\"second derivative\")\n",
+        "plt.plot(x, grad(grad(grad(f)))(x), label=\"third derivative\")\n",
+        "plt.legend()\n",
+        "plt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "-39gouo7mtgu"
+      },
+      "source": [
+        "## Gradient tapes\n",
+        "\n",
+        "Every differentiable TensorFlow operation has an associated gradient function. For example, the gradient function of `tf.square(x)` would be a function that returns `2.0 * x`.  To compute the gradient of a user-defined function (like `f(x)` in the example above), TensorFlow first \"records\" all the operations applied to compute the output of the function. We call this record a \"tape\". It then uses that tape and the gradients functions associated with each primitive operation to compute the gradients of the user-defined function using [reverse mode differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation).\n",
+        "\n",
+        "Since operations are recorded as they are executed, Python control flow (using `if`s and `while`s for example) is naturally handled:\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "MH0UfjympWf7"
+      },
+      "outputs": [],
+      "source": [
+        "def f(x, y):\n",
+        "  output = 1\n",
+        "  # Must use range(int(y)) instead of range(y) in Python 3 when\n",
+        "  # using TensorFlow 1.10 and earlier. Can use range(y) in 1.11+\n",
+        "  for i in range(int(y)):\n",
+        "    output = tf.multiply(output, x)\n",
+        "  return output\n",
+        "\n",
+        "def g(x, y):\n",
+        "  # Return the gradient of `f` with respect to it's first parameter\n",
+        "  return tfe.gradients_function(f)(x, y)[0]\n",
+        "\n",
+        "assert f(3.0, 2).numpy() == 9.0   # f(x, 2) is essentially x * x\n",
+        "assert g(3.0, 2).numpy() == 6.0   # And its gradient will be 2 * x\n",
+        "assert f(4.0, 3).numpy() == 64.0  # f(x, 3) is essentially x * x * x\n",
+        "assert g(4.0, 3).numpy() == 48.0  # And its gradient will be 3 * x * x"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "aNmR5-jhpX2t"
+      },
+      "source": [
+        "At times it may be inconvenient to encapsulate computation of interest into a function. For example, if you want the gradient of the output with respect to intermediate values computed in the function. In such cases, the slightly more verbose but explicit [tf.GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape) context is useful. All computation inside the context of a `tf.GradientTape` is \"recorded\".\n",
+        "\n",
+        "For example:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "bAFeIE8EuVIq"
+      },
+      "outputs": [],
+      "source": [
+        "x = tf.ones((2, 2))\n",
+        "  \n",
+        "# TODO(b/78880779): Remove the 'persistent=True' argument and use\n",
+        "# a single t.gradient() call when the bug is resolved.\n",
+        "with tf.GradientTape(persistent=True) as t:\n",
+        "  # TODO(ashankar): Explain with \"watch\" argument better?\n",
+        "  t.watch(x)\n",
+        "  y = tf.reduce_sum(x)\n",
+        "  z = tf.multiply(y, y)\n",
+        "\n",
+        "# Use the same tape to compute the derivative of z with respect to the\n",
+        "# intermediate value y.\n",
+        "dz_dy = t.gradient(z, y)\n",
+        "assert dz_dy.numpy() == 8.0\n",
+        "\n",
+        "# Derivative of z with respect to the original input tensor x\n",
+        "dz_dx = t.gradient(z, x)\n",
+        "for i in [0, 1]:\n",
+        "  for j in [0, 1]:\n",
+        "    assert dz_dx[i][j].numpy() == 8.0"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "DK05KXrAAld3"
+      },
+      "source": [
+        "### Higher-order gradients\n",
+        "\n",
+        "Operations inside of the `GradientTape` context manager are recorded for automatic differentiation. If gradients are computed in that context, then the gradient computation is recorded as well. As a result, the exact same API works for higher-order gradients as well. For example:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "cPQgthZ7ugRJ"
+      },
+      "outputs": [],
+      "source": [
+        "# TODO(ashankar): Should we use the persistent tape here instead? Follow up on Tom and Alex's discussion\n",
+        "\n",
+        "x = tf.constant(1.0)  # Convert the Python 1.0 to a Tensor object\n",
+        "\n",
+        "with tf.GradientTape() as t:\n",
+        "  with tf.GradientTape() as t2:\n",
+        "    t2.watch(x)\n",
+        "    y = x * x * x\n",
+        "  # Compute the gradient inside the 't' context manager\n",
+        "  # which means the gradient computation is differentiable as well.\n",
+        "  dy_dx = t2.gradient(y, x)\n",
+        "d2y_dx2 = t.gradient(dy_dx, x)\n",
+        "\n",
+        "assert dy_dx.numpy() == 3.0\n",
+        "assert d2y_dx2.numpy() == 6.0"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "4U1KKzUpNl58"
+      },
+      "source": [
+        "## Next Steps\n",
+        "\n",
+        "In this tutorial we covered gradient computation in TensorFlow. With that we have enough of the primitives required to build an train neural networks, which we will cover in the [next tutorial](https://github.com/tensorflow/models/tree/master/official/contrib/eager/python/examples/notebooks/3_neural_networks.ipynb)."
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [],
+      "default_view": {},
+      "name": "automatic_differentiation.ipynb",
+      "private_outputs": true,
+      "provenance": [],
+      "toc_visible": true,
+      "version": "0.3.2",
+      "views": {}
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb
new file mode 100644
index 0000000000..a0bbbb6123
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb
@@ -0,0 +1,399 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "custom_layers.ipynb",
+      "version": "0.3.2",
+      "views": {},
+      "default_view": {},
+      "provenance": [],
+      "private_outputs": true,
+      "collapsed_sections": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    }
+  },
+  "cells": [
+    {
+      "metadata": {
+        "id": "tDnwEv8FtJm7",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "JlknJBWQtKkI",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "cellView": "form"
+      },
+      "cell_type": "code",
+      "source": [
+        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "60RdWsg1tETW",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Custom layers"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "BcJg7Enms86w",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "<table class=\"tfo-notebook-buttons\" align=\"left\"><td>\n",
+        "<a target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb\">\n",
+        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
+        "</td><td>\n",
+        "<a target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a></td></table>"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "UEu3q4jmpKVT",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "We recommend using `tf.keras` as a high-level API for building neural networks. That said, most TensorFlow APIs are usable with eager execution.\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "pwX7Fii1rwsJ",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "import tensorflow as tf\n",
+        "tfe = tf.contrib.eager\n",
+        "\n",
+        "tf.enable_eager_execution()"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "zSFfVVjkrrsI",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Layers: common sets of useful operations\n",
+        "\n",
+        "Most of the time when writing code for machine learning models you want to operate at a higher level of abstraction than individual operations and manipulation of individual variables.\n",
+        "\n",
+        "Many machine learning models are expressible as the composition and stacking of relatively simple layers, and TensorFlow provides both a set of many common layers as a well as easy ways for you to write your own application-specific layers either from scratch or as the composition of existing layers.\n",
+        "\n",
+        "TensorFlow includes the full [Keras](https://keras.io) API in the tf.keras package, and the Keras layers are very useful when building your own models.\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "8PyXlPl-4TzQ",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# In the tf.keras.layers package, layers are objects. To construct a layer,\n",
+        "# simply construct the object. Most layers take as a first argument the number\n",
+        "# of output dimensions / channels.\n",
+        "layer = tf.keras.layers.Dense(100)\n",
+        "# The number of input dimensions is often unnecessary, as it can be inferred\n",
+        "# the first time the layer is used, but it can be provided if you want to \n",
+        "# specify it manually, which is useful in some complex models.\n",
+        "layer = tf.keras.layers.Dense(10, input_shape=(None, 5))"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "Fn69xxPO5Psr",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "The full list of pre-existing layers can be seen in [the documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers). It includes Dense (a fully-connected layer),\n",
+        "Conv2D, LSTM, BatchNormalization, Dropout, and many others."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "E3XKNknP5Mhb",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# To use a layer, simply call it.\n",
+        "layer(tf.zeros([10, 5]))"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "Wt_Nsv-L5t2s",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# Layers have many useful methods. For example, you can inspect all variables\n",
+        "# in a layer by calling layer.variables. In this case a fully-connected layer\n",
+        "# will have variables for weights and biases.\n",
+        "layer.variables"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "6ilvKjz8_4MQ",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# The variables are also accessible through nice accessors\n",
+        "layer.kernel, layer.bias"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "O0kDbE54-5VS",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Implementing custom layers\n",
+        "The best way to implement your own layer is extending the tf.keras.Layer class and implementing:\n",
+        "  *  `__init__` , where you can do all input-independent initialization\n",
+        "  * `build`, where you know the shapes of the input tensors and can do the rest of the initialization\n",
+        "  * `call`, where you do the forward computation\n",
+        "\n",
+        "Note that you don't have to wait until `build` is called to create your variables, you can also create them in `__init__`. However, the advantage of creating them in `build` is that it enables late variable creation based on the shape of the inputs the layer will operate on. On the other hand, creating variables in `__init__` would mean that shapes required to create the variables will need to be explicitly specified."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "5Byl3n1k5kIy",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "class MyDenseLayer(tf.keras.layers.Layer):\n",
+        "  def __init__(self, num_outputs):\n",
+        "    super(MyDenseLayer, self).__init__()\n",
+        "    self.num_outputs = num_outputs\n",
+        "    \n",
+        "  def build(self, input_shape):\n",
+        "    self.kernel = self.add_variable(\"kernel\", \n",
+        "                                    shape=[input_shape[-1].value, \n",
+        "                                           self.num_outputs])\n",
+        "    \n",
+        "  def call(self, input):\n",
+        "    return tf.matmul(input, self.kernel)\n",
+        "  \n",
+        "layer = MyDenseLayer(10)\n",
+        "print(layer(tf.zeros([10, 5])))\n",
+        "print(layer.variables)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "tk8E2vY0-z4Z",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Note that you don't have to wait until `build` is called to create your variables, you can also create them in `__init__`.\n",
+        "\n",
+        "Overall code is easier to read and maintain if it uses standard layers whenever possible, as other readers will be familiar with the behavior of standard layers. If you want to use a layer which is not present in tf.keras.layers or tf.contrib.layers, consider filing a [github issue](http://github.com/tensorflow/tensorflow/issues/new) or, even better, sending us a pull request!"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Qhg4KlbKrs3G",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Models: composing layers\n",
+        "\n",
+        "Many interesting layer-like things in machine learning models are implemented by composing existing layers. For example, each residual block in a resnet is a composition of convolutions, batch normalizations, and a shortcut.\n",
+        "\n",
+        "The main class used when creating a layer-like thing which contains other layers is tf.keras.Model. Implementing one is done by inheriting from tf.keras.Model."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "N30DTXiRASlb",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "class ResnetIdentityBlock(tf.keras.Model):\n",
+        "  def __init__(self, kernel_size, filters):\n",
+        "    super(ResnetIdentityBlock, self).__init__(name='')\n",
+        "    filters1, filters2, filters3 = filters\n",
+        "\n",
+        "    self.conv2a = tf.keras.layers.Conv2D(filters1, (1, 1))\n",
+        "    self.bn2a = tf.keras.layers.BatchNormalization()\n",
+        "\n",
+        "    self.conv2b = tf.keras.layers.Conv2D(filters2, kernel_size, padding='same')\n",
+        "    self.bn2b = tf.keras.layers.BatchNormalization()\n",
+        "\n",
+        "    self.conv2c = tf.keras.layers.Conv2D(filters3, (1, 1))\n",
+        "    self.bn2c = tf.keras.layers.BatchNormalization()\n",
+        "\n",
+        "  def call(self, input_tensor, training=False):\n",
+        "    x = self.conv2a(input_tensor)\n",
+        "    x = self.bn2a(x, training=training)\n",
+        "    x = tf.nn.relu(x)\n",
+        "\n",
+        "    x = self.conv2b(x)\n",
+        "    x = self.bn2b(x, training=training)\n",
+        "    x = tf.nn.relu(x)\n",
+        "\n",
+        "    x = self.conv2c(x)\n",
+        "    x = self.bn2c(x, training=training)\n",
+        "\n",
+        "    x += input_tensor\n",
+        "    return tf.nn.relu(x)\n",
+        "\n",
+        "    \n",
+        "block = ResnetIdentityBlock(1, [1, 2, 3])\n",
+        "print(block(tf.zeros([1, 2, 3, 3])))\n",
+        "print([x.name for x in block.variables])"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "wYfucVw65PMj",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Much of the time, however, models which compose many layers simply call one layer after the other. This can be done in very little code using tf.keras.Sequential"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "L9frk7Ur4uvJ",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        " my_seq = tf.keras.Sequential([tf.keras.layers.Conv2D(1, (1, 1)),\n",
+        "                               tf.keras.layers.BatchNormalization(),\n",
+        "                               tf.keras.layers.Conv2D(2, 1, \n",
+        "                                                      padding='same'),\n",
+        "                               tf.keras.layers.BatchNormalization(),\n",
+        "                               tf.keras.layers.Conv2D(3, (1, 1)),\n",
+        "                               tf.keras.layers.BatchNormalization()])\n",
+        "my_seq(tf.zeros([1, 2, 3, 3]))"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "c5YwYcnuK-wc",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Next steps\n",
+        "\n",
+        "Now you can go back to the previous notebook and adapt the linear regression example to use layers and models to be better structured."
+      ]
+    }
+  ]
+}
+\ No newline at end of file
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb
new file mode 100644
index 0000000000..5f1b48fa0d
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb
@@ -0,0 +1,477 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Custom training: basics",
+      "version": "0.3.2",
+      "views": {},
+      "default_view": {},
+      "provenance": [],
+      "private_outputs": true,
+      "collapsed_sections": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "metadata": {
+        "id": "5rmpybwysXGV",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "m8y3rGtQsYP2",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "cellView": "form"
+      },
+      "cell_type": "code",
+      "source": [
+        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "hrXv0rU9sIma",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Custom training: basics"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "7S0BwJ_8sLu7",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "<table class=\"tfo-notebook-buttons\" align=\"left\"><td>\n",
+        "<a target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb\">\n",
+        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
+        "</td><td>\n",
+        "<a target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a></td></table>"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "k2o3TTG4TFpt",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "In the previous tutorial we covered the TensorFlow APIs for automatic differentiation, a basic building block for machine learning.\n",
+        "In this tutorial we will use the TensorFlow primitives introduced in the prior tutorials to do some simple machine learning.\n",
+        "\n",
+        "TensorFlow also includes a higher-level neural networks API (`tf.keras`) which provides useful abstractions to reduce boilerplate. We strongly recommend those higher level APIs for people working with neural networks. However, in this short tutorial we cover neural network training from first principles to establish a strong foundation."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "3LXMVuV0VhDr",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Setup"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "PJ64L90aVir3",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "import tensorflow as tf\n",
+        "\n",
+        "tf.enable_eager_execution()"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "eMAWbDJFVmMk",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Variables\n",
+        "\n",
+        "Tensors in TensorFlow are immutable stateless objects. Machine learning models, however, need to have changing state: as your model trains, the same code to compute predictions should behave differently over time (hopefully with a lower loss!). To represent this state which needs to change over the course of your computation, you can choose to rely on the fact that Python is a stateful programming language:\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "VkJwtLS_Jbn8",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "# Using python state\n",
+        "x = tf.zeros([10, 10])\n",
+        "x += 2  # This is equivalent to x = x + 2, which does not mutate the original\n",
+        "        # value of x\n",
+        "print(x)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "wfneTXy7JcUz",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "TensorFlow, however, has stateful operations built in, and these are often more pleasant to use than low-level Python representations of your state. To represent weights in a model, for example, it's often convenient and efficient to use TensorFlow variables.\n",
+        "\n",
+        "A Variable is an object which stores a value and, when used in a TensorFlow computation, will implicitly read from this stored value. There are operations (`tf.assign_sub`, `tf.scatter_update`, etc) which manipulate the value stored in a TensorFlow variable."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "itxmrMil6DQi",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "v = tf.Variable(1.0)\n",
+        "assert v.numpy() == 1.0\n",
+        "\n",
+        "# Re-assign the value\n",
+        "v.assign(3.0)\n",
+        "assert v.numpy() == 3.0\n",
+        "\n",
+        "# Use `v` in a TensorFlow operation like tf.square() and reassign\n",
+        "v.assign(tf.square(v))\n",
+        "assert v.numpy() == 9.0"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "-paSaeq1JzwC",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Computations using Variables are automatically traced when computing gradients. For Variables representing embeddings TensorFlow will do sparse updates by default, which are more computation and memory efficient.\n",
+        "\n",
+        "Using Variables is also a way to quickly let a reader of your code know that this piece of state is mutable."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "BMiFcDzE7Qu3",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Example: Fitting a linear model\n",
+        "\n",
+        "Let's now put the few concepts we have so far ---`Tensor`, `GradientTape`, `Variable` --- to build and train a simple model. This typically involves a few steps:\n",
+        "\n",
+        "1. Define the model.\n",
+        "2. Define a loss function.\n",
+        "3. Obtain training data.\n",
+        "4. Run through the training data and use an \"optimizer\" to adjust the variables to fit the data.\n",
+        "\n",
+        "In this tutorial, we'll walk through a trivial example of a simple linear model: `f(x) = x * W + b`, which has two variables - `W` and `b`. Furthermore, we'll synthesize data such that a well trained model would have `W = 3.0` and `b = 2.0`."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "gFzH64Jn9PIm",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Define the model\n",
+        "\n",
+        "Let's define a simple class to encapsulate the variables and the computation."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "_WRu7Pze7wk8",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "class Model(object):\n",
+        "  def __init__(self):\n",
+        "    # Initialize variable to (5.0, 0.0)\n",
+        "    # In practice, these should be initialized to random values.\n",
+        "    self.W = tf.Variable(5.0)\n",
+        "    self.b = tf.Variable(0.0)\n",
+        "    \n",
+        "  def __call__(self, x):\n",
+        "    return self.W * x + self.b\n",
+        "  \n",
+        "model = Model()\n",
+        "\n",
+        "assert model(3.0).numpy() == 15.0"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "xa6j_yXa-j79",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Define a loss function\n",
+        "\n",
+        "A loss function measures how well the output of a model for a given input matches the desired output. Let's use the standard L2 loss."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Y0ysUFGY924U",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "def loss(predicted_y, desired_y):\n",
+        "  return tf.reduce_mean(tf.square(predicted_y - desired_y))"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "qutT_fkl_CBc",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Obtain training data\n",
+        "\n",
+        "Let's synthesize the training data with some noise."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "gxPTb-kt_N5m",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "TRUE_W = 3.0\n",
+        "TRUE_b = 2.0\n",
+        "NUM_EXAMPLES = 1000\n",
+        "\n",
+        "inputs  = tf.random_normal(shape=[NUM_EXAMPLES])\n",
+        "noise   = tf.random_normal(shape=[NUM_EXAMPLES])\n",
+        "outputs = inputs * TRUE_W + TRUE_b + noise"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "-50nq-wPBsAW",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Before we train the model let's visualize where the model stands right now. We'll plot the model's predictions in red and the training data in blue."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "_eb83LtrB4nt",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "plt.scatter(inputs, outputs, c='b')\n",
+        "plt.scatter(inputs, model(inputs), c='r')\n",
+        "plt.show()\n",
+        "\n",
+        "print('Current loss: '),\n",
+        "print(loss(model(inputs), outputs).numpy())"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "sSDP-yeq_4jE",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Define a training loop\n",
+        "\n",
+        "We now have our network and our training data. Let's train it, i.e., use the training data to update the model's variables (`W` and `b`) so that the loss goes down using [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent). There are many variants of the gradient descent scheme that are captured in `tf.train.Optimizer` implementations. We'd highly recommend using those implementations, but in the spirit of building from first principles, in this particular example we will implement the basic math ourselves."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "MBIACgdnA55X",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "def train(model, inputs, outputs, learning_rate):\n",
+        "  with tf.GradientTape() as t:\n",
+        "    current_loss = loss(model(inputs), outputs)\n",
+        "  dW, db = t.gradient(current_loss, [model.W, model.b])\n",
+        "  model.W.assign_sub(learning_rate * dW)\n",
+        "  model.b.assign_sub(learning_rate * db)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "RwWPaJryD2aN",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Finally, let's repeatedly run through the training data and see how `W` and `b` evolve."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "XdfkR223D9dW",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "model = Model()\n",
+        "\n",
+        "# Collect the history of W-values and b-values to plot later\n",
+        "Ws, bs = [], []\n",
+        "epochs = range(10)\n",
+        "for epoch in epochs:\n",
+        "  Ws.append(model.W.numpy())\n",
+        "  bs.append(model.b.numpy())\n",
+        "  current_loss = loss(model(inputs), outputs)\n",
+        "\n",
+        "  train(model, inputs, outputs, learning_rate=0.1)\n",
+        "  print('Epoch %2d: W=%1.2f b=%1.2f, loss=%2.5f' %\n",
+        "        (epoch, Ws[-1], bs[-1], current_loss))\n",
+        "\n",
+        "# Let's plot it all\n",
+        "plt.plot(epochs, Ws, 'r',\n",
+        "         epochs, bs, 'b')\n",
+        "plt.plot([TRUE_W] * len(epochs), 'r--',\n",
+        "         [TRUE_b] * len(epochs), 'b--')\n",
+        "plt.legend(['W', 'b', 'true W', 'true_b'])\n",
+        "plt.show()\n",
+        "  "
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "vPnIVuaSJwWz",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Next Steps\n",
+        "\n",
+        "In this tutorial we covered `Variable`s and built and trained a simple linear model using the TensorFlow primitives discussed so far.\n",
+        "\n",
+        "In theory, this is pretty much all you need to use TensorFlow for your machine learning research.\n",
+        "In practice, particularly for neural networks, the higher level APIs like `tf.keras` will be much more convenient since it provides higher level building blocks (called \"layers\"), utilities to save and restore state, a suite of loss functions, a suite of optimization strategies etc. \n",
+        "\n",
+        "The [next tutorial](TODO) will cover these higher level APIs."
+      ]
+    }
+  ]
+}
+\ No newline at end of file
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb
index 51d10a7784..f1e13de5de 100644
--- a/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb
+++ b/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb
@@ -1,27 +1,107 @@
 {
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "eager_basics.ipynb",
+      "version": "0.3.2",
+      "views": {},
+      "default_view": {},
+      "provenance": [],
+      "private_outputs": true,
+      "collapsed_sections": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
   "cells": [
     {
+      "metadata": {
+        "id": "iPpI7RaYoZuE",
+        "colab_type": "text"
+      },
       "cell_type": "markdown",
+      "source": [
+        "##### Copyright 2018 The TensorFlow Authors."
+      ]
+    },
+    {
       "metadata": {
-        "colab_type": "text",
-        "id": "U9i2Dsh-ziXr"
+        "id": "hro2InpHobKk",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "cellView": "form"
       },
+      "cell_type": "code",
+      "source": [
+        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "U9i2Dsh-ziXr",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Eager execution basics"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "Hndw-YcxoOJK",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "<table class=\"tfo-notebook-buttons\" align=\"left\"><td>\n",
+        "<a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb\">\n",
+        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
+        "</td><td>\n",
+        "<a target=\"_blank\" href=\"https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb\"><img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a></td></table>"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "6sILUVbHoSgH",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
       "source": [
-        "# An introduction to TensorFlow\n",
-        "\n",
         "This is an introductory tutorial for using TensorFlow. It will cover:\n",
         "\n",
         "* Importing required packages\n",
         "* Creating and using Tensors\n",
-        "* Using GPU acceleration\n"
+        "* Using GPU acceleration\n",
+        "* Datasets"
       ]
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "z1JcS5iBXMRO"
+        "id": "z1JcS5iBXMRO",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "## Import TensorFlow\n",
         "\n",
@@ -30,32 +110,32 @@
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": 0,
       "metadata": {
-        "cellView": "code",
+        "id": "RlIWhyeLoYnG",
+        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
           }
         },
-        "colab_type": "code",
-        "id": "RlIWhyeLoYnG"
+        "cellView": "code"
       },
-      "outputs": [],
+      "cell_type": "code",
       "source": [
         "import tensorflow as tf\n",
         "\n",
         "tf.enable_eager_execution()"
-      ]
+      ],
+      "execution_count": 0,
+      "outputs": []
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "H9UySOPLXdaw"
+        "id": "H9UySOPLXdaw",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "## Tensors\n",
         "\n",
@@ -63,46 +143,18 @@
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": 0,
       "metadata": {
-        "cellView": "code",
+        "id": "ngUe237Wt48W",
+        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          },
-          "height": 125
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 320,
-          "status": "ok",
-          "timestamp": 1526420535530,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
+          }
         },
-        "id": "ngUe237Wt48W",
-        "outputId": "b1a1cd60-4eb3-443d-cd6b-68406390784e"
+        "cellView": "code"
       },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "tf.Tensor(3, shape=(), dtype=int32)\n",
-            "tf.Tensor([4 6], shape=(2,), dtype=int32)\n",
-            "tf.Tensor(25, shape=(), dtype=int32)\n",
-            "tf.Tensor(6, shape=(), dtype=int32)\n",
-            "tf.Tensor(aGVsbG8gd29ybGQ, shape=(), dtype=string)\n",
-            "tf.Tensor(13, shape=(), dtype=int32)\n"
-          ]
-        }
-      ],
+      "cell_type": "code",
       "source": [
         "print(tf.add(1, 2))\n",
         "print(tf.add([1, 2], [3, 4]))\n",
@@ -112,66 +164,46 @@
         "\n",
         "# Operator overloading is also supported\n",
         "print(tf.square(2) + tf.square(3))"
-      ]
+      ],
+      "execution_count": 0,
+      "outputs": []
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "IDY4WsYRhP81"
+        "id": "IDY4WsYRhP81",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "Each Tensor has a shape and a datatype"
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": 0,
       "metadata": {
+        "id": "srYWH1MdJNG7",
+        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          },
-          "height": 53
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 215,
-          "status": "ok",
-          "timestamp": 1526420538162,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "srYWH1MdJNG7",
-        "outputId": "5e4ac41c-5115-4e50-eba0-42e249c16561"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "(1, 2)\n",
-            "\u003cdtype: 'int32'\u003e\n"
-          ]
+          }
         }
-      ],
+      },
+      "cell_type": "code",
       "source": [
         "x = tf.matmul([[1]], [[2, 3]])\n",
         "print(x.shape)\n",
         "print(x.dtype)"
-      ]
+      ],
+      "execution_count": 0,
+      "outputs": []
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "eBPw8e8vrsom"
+        "id": "eBPw8e8vrsom",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "The most obvious differences between NumPy arrays and TensorFlow Tensors are:\n",
         "\n",
@@ -180,11 +212,11 @@
       ]
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "Dwi1tdW3JBw6"
+        "id": "Dwi1tdW3JBw6",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "### NumPy Compatibility\n",
         "\n",
@@ -197,52 +229,17 @@
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": 0,
       "metadata": {
+        "id": "lCUWzso6mbqR",
+        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          },
-          "height": 251
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 238,
-          "status": "ok",
-          "timestamp": 1526420540562,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "lCUWzso6mbqR",
-        "outputId": "fd0a22bc-8249-49dd-fcbd-63161cc47e46"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "TensorFlow operations convert numpy arrays to Tensors automatically\n",
-            "tf.Tensor(\n",
-            "[[ 42.  42.  42.]\n",
-            " [ 42.  42.  42.]\n",
-            " [ 42.  42.  42.]], shape=(3, 3), dtype=float64)\n",
-            "And NumPy operations convert Tensors to numpy arrays automatically\n",
-            "[[ 43.  43.  43.]\n",
-            " [ 43.  43.  43.]\n",
-            " [ 43.  43.  43.]]\n",
-            "The .numpy() method explicitly converts a Tensor to a numpy array\n",
-            "[[ 42.  42.  42.]\n",
-            " [ 42.  42.  42.]\n",
-            " [ 42.  42.  42.]]\n"
-          ]
+          }
         }
-      ],
+      },
+      "cell_type": "code",
       "source": [
         "import numpy as np\n",
         "\n",
@@ -258,14 +255,16 @@
         "\n",
         "print(\"The .numpy() method explicitly converts a Tensor to a numpy array\")\n",
         "print(tensor.numpy())"
-      ]
+      ],
+      "execution_count": 0,
+      "outputs": []
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "PBNP8yTRfu_X"
+        "id": "PBNP8yTRfu_X",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "## GPU acceleration\n",
         "\n",
@@ -273,42 +272,18 @@
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": 0,
       "metadata": {
-        "cellView": "code",
+        "id": "3Twf_Rw-gQFM",
+        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          },
-          "height": 53
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 340,
-          "status": "ok",
-          "timestamp": 1526420543562,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
+          }
         },
-        "id": "3Twf_Rw-gQFM",
-        "outputId": "2239ae2b-adf3-4895-b1f3-464cf5361d1b"
+        "cellView": "code"
       },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Is there a GPU available:  False\n",
-            "Is the Tensor on GPU #0:   False\n"
-          ]
-        }
-      ],
+      "cell_type": "code",
       "source": [
         "x = tf.random_uniform([3, 3])\n",
         "\n",
@@ -317,26 +292,28 @@
         "\n",
         "print(\"Is the Tensor on GPU #0:  \"),\n",
         "print(x.device.endswith('GPU:0'))"
-      ]
+      ],
+      "execution_count": 0,
+      "outputs": []
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "vpgYzgVXW2Ud"
+        "id": "vpgYzgVXW2Ud",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "### Device Names\n",
         "\n",
-        "The `Tensor.device` property provides a fully qualified string name of the device hosting the contents of the Tensor. This name encodes a bunch of details, such as an identifier of the network address of the host on which this program is executing and the device within that host. This is required for distributed execution of TensorFlow programs, but we'll skip that for now. The string will end with `GPU:\u003cN\u003e` if the tensor is placed on the `N`-th tensor on the host."
+        "The `Tensor.device` property provides a fully qualified string name of the device hosting the contents of the Tensor. This name encodes a bunch of details, such as an identifier of the network address of the host on which this program is executing and the device within that host. This is required for distributed execution of TensorFlow programs, but we'll skip that for now. The string will end with `GPU:<N>` if the tensor is placed on the `N`-th tensor on the host."
       ]
     },
     {
-      "cell_type": "markdown",
       "metadata": {
-        "colab_type": "text",
-        "id": "ZWZQCimzuqyP"
+        "id": "ZWZQCimzuqyP",
+        "colab_type": "text"
       },
+      "cell_type": "markdown",
       "source": [
         "\n",
         "\n",
@@ -346,41 +323,17 @@
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": 0,
       "metadata": {
+        "id": "RjkNZTuauy-Q",
+        "colab_type": "code",
         "colab": {
           "autoexec": {
             "startup": false,
             "wait_interval": 0
-          },
-          "height": 53
-        },
-        "colab_type": "code",
-        "executionInfo": {
-          "elapsed": 1762,
-          "status": "ok",
-          "timestamp": 1526420547562,
-          "user": {
-            "displayName": "",
-            "photoUrl": "",
-            "userId": ""
-          },
-          "user_tz": 420
-        },
-        "id": "RjkNZTuauy-Q",
-        "outputId": "2e613293-ccac-4db2-b793-8ceb5b5adcfd"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "On CPU:\n",
-            "10 loops, best of 3: 35.8 ms per loop\n"
-          ]
+          }
         }
-      ],
+      },
+      "cell_type": "code",
       "source": [
         "def time_matmul(x):\n",
         "  %timeit tf.matmul(x, x)\n",
@@ -398,32 +351,141 @@
         "    x = tf.random_uniform([1000, 1000])\n",
         "    assert x.device.endswith(\"GPU:0\")\n",
         "    time_matmul(x)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "o1K4dlhhHtQj",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "## Datasets\n",
+        "\n",
+        "This section demonstrates the use of the [`tf.data.Dataset` API](https://www.tensorflow.org/guide/datasets) to build pipelines to feed data to your model. It covers:\n",
+        "\n",
+        "* Creating a `Dataset`.\n",
+        "* Iteration over a `Dataset` with eager execution enabled.\n",
+        "\n",
+        "We recommend using the `Dataset`s API for building performant, complex input pipelines from simple, re-usable pieces that will feed your model's training or evaluation loops.\n",
+        "\n",
+        "If you're familiar with TensorFlow graphs, the API for constructing the `Dataset` object remains exactly the same when eager execution is enabled, but the process of iterating over elements of the dataset is slightly simpler.\n",
+        "You can use Python iteration over the `tf.data.Dataset` object and do not need to explicitly create an `tf.data.Iterator` object.\n",
+        "As a result, the discussion on iterators in the [TensorFlow Guide](https://www.tensorflow.org/guide/datasets) is not relevant when eager execution is enabled."
       ]
     },
     {
+      "metadata": {
+        "id": "zI0fmOynH-Ne",
+        "colab_type": "text"
+      },
       "cell_type": "markdown",
+      "source": [
+        "### Create a source `Dataset`\n",
+        "\n",
+        "Create a _source_ dataset using one of the factory functions like [`Dataset.from_tensors`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensors), [`Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) or using objects that read from files like [`TextLineDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset) or [`TFRecordDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset). See the [TensorFlow Guide](https://www.tensorflow.org/guide/datasets#reading_input_data) for more information."
+      ]
+    },
+    {
       "metadata": {
-        "colab_type": "text",
-        "id": "YEOJTNiOvnpQ"
+        "id": "F04fVOHQIBiG",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
       },
+      "cell_type": "code",
       "source": [
-        "## Next Steps\n",
+        "ds_tensors = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])\n",
         "\n",
-        "In this tutorial we covered the most fundamental concepts in TensorFlow - `Tensor`s, operations, and devices.\n",
-        "In [the next tutorial](https://github.com/tensorflow/models/tree/master/official/contrib/eager/python/examples/notebooks/2_gradients.ipynb) we will cover automatic differentiation - a building block required for training many machine learning models like neural networks."
+        "# Create a CSV file\n",
+        "import tempfile\n",
+        "_, filename = tempfile.mkstemp()\n",
+        "\n",
+        "with open(filename, 'w') as f:\n",
+        "  f.write(\"\"\"Line 1\n",
+        "Line 2\n",
+        "Line 3\n",
+        "  \"\"\")\n",
+        "\n",
+        "ds_file = tf.data.TextLineDataset(filename)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "vbxIhC-5IPdf",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Apply transformations\n",
+        "\n",
+        "Use the transformations functions like [`map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map), [`batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch), [`shuffle`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) etc. to apply transformations to the records of the dataset. See the [API documentation for `tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) for details."
       ]
+    },
+    {
+      "metadata": {
+        "id": "uXSDZWE-ISsd",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "ds_tensors = ds_tensors.map(tf.square).shuffle(2).batch(2)\n",
+        "\n",
+        "ds_file = ds_file.batch(2)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "metadata": {
+        "id": "A8X1GNfoIZKJ",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "### Iterate\n",
+        "\n",
+        "When eager execution is enabled `Dataset` objects support iteration.\n",
+        "If you're familiar with the use of `Dataset`s in TensorFlow graphs, note that there is no need for calls to `Dataset.make_one_shot_iterator()` or `get_next()` calls."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "ws-WKRk5Ic6-",
+        "colab_type": "code",
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        }
+      },
+      "cell_type": "code",
+      "source": [
+        "print('Elements of ds_tensors:')\n",
+        "for x in ds_tensors:\n",
+        "  print(x)\n",
+        "\n",
+        "print('\\nElements in ds_file:')\n",
+        "for x in ds_file:\n",
+        "  print(x)"
+      ],
+      "execution_count": 0,
+      "outputs": []
     }
-  ],
-  "metadata": {
-    "colab": {
-      "collapsed_sections": [],
-      "default_view": {},
-      "name": "TensorFlow: An introduction",
-      "provenance": [],
-      "version": "0.3.2",
-      "views": {}
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
+  ]
+}
+\ No newline at end of file
diff --git a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
index b14ef1df8f..07d8788882 100644
--- a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
+++ b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
@@ -29,6 +29,7 @@ import tensorflow.contrib.eager as tfe
 from tensorflow.contrib.eager.python.examples.resnet50 import resnet50
 from tensorflow.contrib.summary import summary_test_util
 from tensorflow.python.client import device_lib
+from tensorflow.python.eager import tape
 
 
 def device_and_data_format():
@@ -49,13 +50,21 @@ def random_batch(batch_size, data_format):
   return images, one_hot
 
 
-def compute_gradients(model, images, labels):
-  with tf.GradientTape() as tape:
+def compute_gradients(model, images, labels, num_replicas=1):
+  with tf.GradientTape() as grad_tape:
     logits = model(images, training=True)
     loss = tf.losses.softmax_cross_entropy(
         logits=logits, onehot_labels=labels)
     tf.contrib.summary.scalar(name='loss', tensor=loss)
-  return tape.gradient(loss, model.variables)
+    if num_replicas != 1:
+      loss /= num_replicas
+
+  # TODO(b/110991947): We can mistakenly trace the gradient call in
+  # multi-threaded environment. Explicitly disable recording until
+  # this is fixed.
+  with tape.stop_recording():
+    grads = grad_tape.gradient(loss, model.variables)
+  return grads
 
 
 def apply_gradients(model, optimizer, gradients):
@@ -188,11 +197,14 @@ class ResNet50Benchmarks(tf.test.Benchmark):
         return (32,)
     return (16, 32)
 
-  def _report(self, label, start, num_iters, device, batch_size, data_format):
+  def _report(self, label, start, num_iters, device, batch_size, data_format,
+              num_replicas=1):
     avg_time = (time.time() - start) / num_iters
     dev = tf.DeviceSpec.from_string(device).device_type.lower()
-    name = '%s_%s_batch_%d_%s' % (label, dev, batch_size, data_format)
-    extras = {'examples_per_sec': batch_size / avg_time}
+    replica_str = '' if num_replicas == 1 else 'replicas_%d_' % num_replicas
+    name = '%s_%s_batch_%d_%s%s' % (label, dev, batch_size,
+                                    replica_str, data_format)
+    extras = {'examples_per_sec': (num_replicas * batch_size) / avg_time}
     self.report_benchmark(
         iters=num_iters, wall_time=avg_time, name=name, extras=extras)
 
diff --git a/tensorflow/contrib/eager/python/examples/revnet/BUILD b/tensorflow/contrib/eager/python/examples/revnet/BUILD
index 81c9facfb5..4f0d46b1ba 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/BUILD
+++ b/tensorflow/contrib/eager/python/examples/revnet/BUILD
@@ -43,6 +43,27 @@ py_library(
     ],
 )
 
+py_library(
+    name = "resnet_preprocessing",
+    srcs = ["resnet_preprocessing.py"],
+    srcs_version = "PY2AND3",
+    tags = ["local"],
+    deps = [
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
+py_library(
+    name = "imagenet_input",
+    srcs = ["imagenet_input.py"],
+    srcs_version = "PY2AND3",
+    tags = ["local"],
+    deps = [
+        ":resnet_preprocessing",
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
 # Tests
 cuda_py_test(
     name = "ops_test",
@@ -78,7 +99,7 @@ cuda_py_test(
         "//tensorflow:tensorflow_py",
     ],
     tags = [
-        "no_pip",
+        "no_pip",  # depends on blocks_test, which is not available in pip package
         "optonly",
     ],
 )
@@ -113,3 +134,39 @@ py_binary(
         "//tensorflow:tensorflow_py",
     ],
 )
+
+py_binary(
+    name = "main_estimator",
+    srcs = ["main_estimator.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":cifar_input",
+        ":main",
+        ":revnet",
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
+py_library(
+    name = "main_estimator_lib",
+    srcs = ["main_estimator.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":cifar_input",
+        ":main",
+        ":revnet",
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
+py_library(
+    name = "main_estimator_tpu_lib",
+    srcs = ["main_estimator_tpu.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":cifar_input",
+        ":main",
+        ":revnet",
+        "//tensorflow:tensorflow_py",
+    ],
+)
diff --git a/tensorflow/contrib/eager/python/examples/revnet/README.md b/tensorflow/contrib/eager/python/examples/revnet/README.md
index 21fc44febc..2875d0ffb3 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/README.md
+++ b/tensorflow/contrib/eager/python/examples/revnet/README.md
@@ -1,18 +1,21 @@
 # RevNet with TensorFlow eager execution
 
-This folder contains an TensorFlow eager implementation of the [Reversible Residual Network](https://arxiv.org/pdf/1707.04585.pdf) adapted from the released implementation by the authors. The presented implementation can be ran both in eager and graph mode. The code is considerably simplified with `tf.GradientTape`. Moreover, we reduce the step of reconstructing the outputs. This saves us from using `tf.stop_gradient` and makes the model run faster.
+This folder contains a TensorFlow eager implementation of the [Reversible Residual Network](https://arxiv.org/pdf/1707.04585.pdf) adapted from the released implementation by the authors. The presented implementation can be ran both in eager and graph mode. The code is considerably simplified with `tf.GradientTape`. Moreover, we reduce the step of reconstructing the outputs. This saves us from using `tf.stop_gradient` and makes the model run faster.
 
 ##  Content
 
 - `revnet.py`: The RevNet model.
 - `blocks.py`: The relevant reversible blocks.
+- `ops.py`: Auxiliary downsampling operation.
 - `cifar_tfrecords.py`: Script to generate the TFRecords for both CIFAR-10 and CIFAR-100.
 - `cifar_input.py`: Script to read from TFRecords and generate dataset objects with the `tf.data` API.
 - `config.py`: Configuration file for network architectures and training hyperparameters.
 - `main.py`: Main training and evaluation script.
-- `ops.py`: Auxiliary downsampling operation.
+- `main_estimator.py`: Script to train RevNet models on CIFAR-10 and CIFAR-100 with the `tf.estimator` API.
+- `main_estimator_tpu.py`: Script to train RevNet models on ImageNet with TPU estimators on Cloud TPUs.
+- `resnet_preprocessing.py`, `imagenet_input.py`: Boilerplate to read ImageNet data from TFRecords.
 
-## To run
+## Train on CIFAR-10/CIFAR-100
 - Make sure you have installed TensorFlow 1.9+ or the latest `tf-nightly`
 or `tf-nightly-gpu` pip package in order to access the eager execution feature.
 
@@ -24,7 +27,7 @@ python cifar_tfrecords.py --data_dir ${PWD}/cifar
 to download the cifar dataset and convert them
 to TFRecords. This produces TFRecord files for both CIFAR-10 and CIFAR-100.
 
-- To train a model run
+- To train a model, run
 
 ```bash
 python main.py --data_dir ${PWD}/cifar
@@ -34,8 +37,63 @@ python main.py --data_dir ${PWD}/cifar
   - `train_dir`: Directory to store eventfiles and checkpoints.
   - `restore`: Restore the latest checkpoint.
   - `validate`: Use validation set for training monitoring.
-  - `manual_grad`: Use the manually defined gradient map given by the authors.
-  - `dataset`: Use either `cifar-10` or `cifar-100`
+  - `dataset`: Use either `cifar-10` or `cifar-100`.
+  - `config`: RevNet configuration.
+  - `use_defun`: Use `tfe.defun` to boost performance.
+
+- To train a model with estimators in graph-mode, run
+
+```bash
+python main_estimator.py --data_dir ${PWD}/cifar
+```
+
+- Optional arguments for `main.py` include
+  - `model_dir`: Directory to store eventfiles and checkpoints.
+  - `dataset`: Use either `cifar-10` or `cifar-100`.
+  - `config`: RevNet configuration.
+  - `export`: Export the model for serving if True.
+
+## Speed up with `tfe.defun`
+Even though the speed difference between pure eager execution and graph-mode execution is noticeable,
+the difference between fully "defunned" model training and graph-mode
+training is negligible.
+
+## Train on ImageNet with Cloud TPUs
+The standard way to train models on Cloud TPUs is via TPU estimators and graph-mode
+execution. Models built with the `tf.keras` API are fully compatible with TPU estimators.
+
+### Setup a Google Cloud project
+
+Follow the instructions at the [Quickstart Guide](https://cloud.google.com/tpu/docs/quickstart)
+to get a GCE VM with access to Cloud TPU.
+
+To run this model, you will need:
+
+* A GCE VM instance with an associated Cloud TPU resource
+* A GCS bucket to store your training checkpoints
+* (Optional): The ImageNet training and validation data preprocessed into
+  TFRecord format, and stored in GCS.
+
+### Format the data
+
+The data is expected to be formatted in TFRecord format, as generated by [this
+script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py).
+
+If you do not have ImageNet dataset prepared, you can use a randomly generated
+fake dataset to test the model. It is located at
+`gs://cloud-tpu-test-datasets/fake_imagenet`.
+
+### Start training
+
+Train the model by executing the following command (substituting the appropriate
+values):
+
+```bash
+python main_estimator_tpu.py \
+  --tpu=$TPU_NAME \
+  --data_dir=$DATA_DIR \
+  --model_dir=$MODEL_DIR
+```
 
 ## Performance
 - With the current implementation, RevNet-38 achieves >92% on CIFAR-10 and >71% on CIFAR-100.
diff --git a/tensorflow/contrib/eager/python/examples/revnet/blocks.py b/tensorflow/contrib/eager/python/examples/revnet/blocks.py
index 306096e9f8..f61354bc38 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/blocks.py
+++ b/tensorflow/contrib/eager/python/examples/revnet/blocks.py
@@ -24,6 +24,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import functools
+import operator
+
 import tensorflow as tf
 from tensorflow.contrib.eager.python.examples.revnet import ops
 
@@ -45,7 +48,7 @@ class RevBlock(tf.keras.Model):
                bottleneck=False,
                fused=True,
                dtype=tf.float32):
-    """Initialize RevBlock.
+    """Initialization.
 
     Args:
       n_res: number of residual blocks
@@ -88,49 +91,27 @@ class RevBlock(tf.keras.Model):
       h = block(h, training=training)
     return h
 
-  def backward_grads_and_vars(self, x, y, dy, training=True):
+  def backward_grads(self, x, y, dy, training=True):
     """Apply reversible block backward to outputs."""
 
     grads_all = []
-    vars_all = []
-
     for i in reversed(range(len(self.blocks))):
       block = self.blocks[i]
       if i == 0:
         # First block usually contains downsampling that can't be reversed
-        with tf.GradientTape() as tape:
-          x = tf.identity(x)
-          tape.watch(x)
-          y = block(x, training=training)
-
-        grads_combined = tape.gradient(
-            y, [x] + block.trainable_variables, output_gradients=dy)
-        dy = grads_combined[0]
-        grads_all += grads_combined[1:]
-        vars_all += block.trainable_variables
+        dy, grads = block.backward_grads_with_downsample(
+            x, y, dy, training=True)
       else:
-        y, dy, grads, vars_ = block.backward_grads_and_vars(
-            y, dy, training=training)
-        grads_all += grads
-        vars_all += vars_
+        y, dy, grads = block.backward_grads(y, dy, training=training)
+      grads_all = grads + grads_all
 
-    return dy, grads_all, vars_all
+    return dy, grads_all
 
 
 class _Residual(tf.keras.Model):
   """Single residual block contained in a _RevBlock. Each `_Residual` object has
   two _ResidualInner objects, corresponding to the `F` and `G` functions in the
   paper.
-
-  Args:
-    filters: output filter size
-    strides: length 2 list/tuple of integers for height and width strides
-    input_shape: length 3 list/tuple of integers
-    batch_norm_first: whether to apply activation and batch norm before conv
-    data_format: tensor data format, "NCHW"/"NHWC",
-    bottleneck: use bottleneck residual if True
-    fused: use fused batch normalization if True
-    dtype: float16, float32, or float64
   """
 
   def __init__(self,
@@ -142,6 +123,18 @@ class _Residual(tf.keras.Model):
                bottleneck=False,
                fused=True,
                dtype=tf.float32):
+    """Initialization.
+
+    Args:
+      filters: output filter size
+      strides: length 2 list/tuple of integers for height and width strides
+      input_shape: length 3 list/tuple of integers
+      batch_norm_first: whether to apply activation and batch norm before conv
+      data_format: tensor data format, "NCHW"/"NHWC",
+      bottleneck: use bottleneck residual if True
+      fused: use fused batch normalization if True
+      dtype: float16, float32, or float64
+    """
     super(_Residual, self).__init__()
 
     self.filters = filters
@@ -174,10 +167,9 @@ class _Residual(tf.keras.Model):
         fused=fused,
         dtype=dtype)
 
-  def call(self, x, training=True, concat=True):
+  def call(self, x, training=True):
     """Apply residual block to inputs."""
-
-    x1, x2 = tf.split(x, num_or_size_splits=2, axis=self.axis)
+    x1, x2 = x
     f_x2 = self.f(x2, training=training)
     x1_down = ops.downsample(
         x1, self.filters // 2, self.strides, axis=self.axis)
@@ -186,172 +178,327 @@ class _Residual(tf.keras.Model):
     y1 = f_x2 + x1_down
     g_y1 = self.g(y1, training=training)
     y2 = g_y1 + x2_down
-    if not concat:  # For correct backward grads
-      return y1, y2
 
-    return tf.concat([y1, y2], axis=self.axis)
+    return y1, y2
 
-  def backward_grads_and_vars(self, y, dy, training=True):
+  def backward_grads(self, y, dy, training=True):
     """Manually compute backward gradients given input and output grads."""
-    dy1, dy2 = tf.split(dy, num_or_size_splits=2, axis=self.axis)
-
-    with tf.GradientTape(persistent=True) as tape:
-      y = tf.identity(y)
-      tape.watch(y)
-      y1, y2 = tf.split(y, num_or_size_splits=2, axis=self.axis)
-      z1 = y1
-      gz1 = self.g(z1, training=training)
-      x2 = y2 - gz1
+    dy1, dy2 = dy
+    y1, y2 = y
+
+    with tf.GradientTape() as gtape:
+      gtape.watch(y1)
+      gy1 = self.g(y1, training=training)
+    grads_combined = gtape.gradient(
+        gy1, [y1] + self.g.trainable_variables, output_gradients=dy2)
+    dg = grads_combined[1:]
+    dx1 = dy1 + grads_combined[0]
+    # This doesn't affect eager execution, but improves memory efficiency with
+    # graphs
+    with tf.control_dependencies(dg + [dx1]):
+      x2 = y2 - gy1
+
+    with tf.GradientTape() as ftape:
+      ftape.watch(x2)
       fx2 = self.f(x2, training=training)
-      x1 = z1 - fx2
+    grads_combined = ftape.gradient(
+        fx2, [x2] + self.f.trainable_variables, output_gradients=dx1)
+    df = grads_combined[1:]
+    dx2 = dy2 + grads_combined[0]
+    # Same behavior as above
+    with tf.control_dependencies(df + [dx2]):
+      x1 = y1 - fx2
 
-    grads_combined = tape.gradient(
-        gz1, [z1] + self.g.trainable_variables, output_gradients=dy2)
-    dz1 = dy1 + grads_combined[0]
+    x = x1, x2
+    dx = dx1, dx2
+    grads = df + dg
+
+    return x, dx, grads
+
+  def backward_grads_with_downsample(self, x, y, dy, training=True):
+    """Manually compute backward gradients given input and output grads."""
+    # Splitting this from `backward_grads` for better readability
+    x1, x2 = x
+    y1, _ = y
+    dy1, dy2 = dy
+
+    with tf.GradientTape() as gtape:
+      gtape.watch(y1)
+      gy1 = self.g(y1, training=training)
+    grads_combined = gtape.gradient(
+        gy1, [y1] + self.g.trainable_variables, output_gradients=dy2)
     dg = grads_combined[1:]
-    dx1 = dz1
+    dz1 = dy1 + grads_combined[0]
 
-    grads_combined = tape.gradient(
+    # dx1 need one more step to backprop through downsample
+    with tf.GradientTape() as x1tape:
+      x1tape.watch(x1)
+      z1 = ops.downsample(x1, self.filters // 2, self.strides, axis=self.axis)
+    dx1 = x1tape.gradient(z1, x1, output_gradients=dz1)
+
+    with tf.GradientTape() as ftape:
+      ftape.watch(x2)
+      fx2 = self.f(x2, training=training)
+    grads_combined = ftape.gradient(
         fx2, [x2] + self.f.trainable_variables, output_gradients=dz1)
-    dx2 = dy2 + grads_combined[0]
-    df = grads_combined[1:]
+    dx2, df = grads_combined[0], grads_combined[1:]
 
-    del tape
+    # dx2 need one more step to backprop through downsample
+    with tf.GradientTape() as x2tape:
+      x2tape.watch(x2)
+      z2 = ops.downsample(x2, self.filters // 2, self.strides, axis=self.axis)
+    dx2 += x2tape.gradient(z2, x2, output_gradients=dy2)
 
+    dx = dx1, dx2
     grads = df + dg
-    vars_ = self.f.trainable_variables + self.g.trainable_variables
 
-    x = tf.concat([x1, x2], axis=self.axis)
-    dx = tf.concat([dx1, dx2], axis=self.axis)
+    return dx, grads
 
-    return x, dx, grads, vars_
 
-
-def _BottleneckResidualInner(filters,
-                             strides,
-                             input_shape,
-                             batch_norm_first=True,
-                             data_format="channels_first",
-                             fused=True,
-                             dtype=tf.float32):
+# Ideally, the following should be wrapped in `tf.keras.Sequential`, however
+# there are subtle issues with its placeholder insertion policy and batch norm
+class _BottleneckResidualInner(tf.keras.Model):
   """Single bottleneck residual inner function contained in _Resdual.
 
   Corresponds to the `F`/`G` functions in the paper.
   Suitable for training on ImageNet dataset.
-
-  Args:
-    filters: output filter size
-    strides: length 2 list/tuple of integers for height and width strides
-    input_shape: length 3 list/tuple of integers
-    batch_norm_first: whether to apply activation and batch norm before conv
-    data_format: tensor data format, "NCHW"/"NHWC"
-    fused: use fused batch normalization if True
-    dtype: float16, float32, or float64
-
-  Returns:
-    A keras model
   """
 
-  axis = 1 if data_format == "channels_first" else 3
-  model = tf.keras.Sequential()
-  if batch_norm_first:
-    model.add(
-        tf.keras.layers.BatchNormalization(
-            axis=axis, input_shape=input_shape, fused=fused, dtype=dtype))
-    model.add(tf.keras.layers.Activation("relu"))
-  model.add(
-      tf.keras.layers.Conv2D(
-          filters=filters // 4,
-          kernel_size=1,
-          strides=strides,
-          input_shape=input_shape,
-          data_format=data_format,
-          use_bias=False,
-          padding="SAME",
-          dtype=dtype))
-
-  model.add(
-      tf.keras.layers.BatchNormalization(axis=axis, fused=fused, dtype=dtype))
-  model.add(tf.keras.layers.Activation("relu"))
-  model.add(
-      tf.keras.layers.Conv2D(
-          filters=filters // 4,
-          kernel_size=3,
-          strides=(1, 1),
-          data_format=data_format,
-          use_bias=False,
-          padding="SAME",
-          dtype=dtype))
-
-  model.add(
-      tf.keras.layers.BatchNormalization(axis=axis, fused=fused, dtype=dtype))
-  model.add(tf.keras.layers.Activation("relu"))
-  model.add(
-      tf.keras.layers.Conv2D(
-          filters=filters,
-          kernel_size=1,
-          strides=(1, 1),
-          data_format=data_format,
-          use_bias=False,
-          padding="SAME",
-          dtype=dtype))
+  def __init__(self,
+               filters,
+               strides,
+               input_shape,
+               batch_norm_first=True,
+               data_format="channels_first",
+               fused=True,
+               dtype=tf.float32):
+    """Initialization.
+
+    Args:
+      filters: output filter size
+      strides: length 2 list/tuple of integers for height and width strides
+      input_shape: length 3 list/tuple of integers
+      batch_norm_first: whether to apply activation and batch norm before conv
+      data_format: tensor data format, "NCHW"/"NHWC"
+      fused: use fused batch normalization if True
+      dtype: float16, float32, or float64
+    """
+    super(_BottleneckResidualInner, self).__init__()
+    axis = 1 if data_format == "channels_first" else 3
+    if batch_norm_first:
+      self.batch_norm_0 = tf.keras.layers.BatchNormalization(
+          axis=axis, input_shape=input_shape, fused=fused, dtype=dtype)
+    self.conv2d_1 = tf.keras.layers.Conv2D(
+        filters=filters // 4,
+        kernel_size=1,
+        strides=strides,
+        input_shape=input_shape,
+        data_format=data_format,
+        use_bias=False,
+        padding="SAME",
+        dtype=dtype)
+
+    self.batch_norm_1 = tf.keras.layers.BatchNormalization(
+        axis=axis, fused=fused, dtype=dtype)
+    self.conv2d_2 = tf.keras.layers.Conv2D(
+        filters=filters // 4,
+        kernel_size=3,
+        strides=(1, 1),
+        data_format=data_format,
+        use_bias=False,
+        padding="SAME",
+        dtype=dtype)
+
+    self.batch_norm_2 = tf.keras.layers.BatchNormalization(
+        axis=axis, fused=fused, dtype=dtype)
+    self.conv2d_3 = tf.keras.layers.Conv2D(
+        filters=filters,
+        kernel_size=1,
+        strides=(1, 1),
+        data_format=data_format,
+        use_bias=False,
+        padding="SAME",
+        dtype=dtype)
+
+    self.batch_norm_first = batch_norm_first
+
+  def call(self, x, training=True):
+    net = x
+    if self.batch_norm_first:
+      net = self.batch_norm_0(net, training=training)
+      net = tf.nn.relu(net)
+    net = self.conv2d_1(net)
+
+    net = self.batch_norm_1(net, training=training)
+    net = tf.nn.relu(net)
+    net = self.conv2d_2(net)
 
-  return model
+    net = self.batch_norm_2(net, training=training)
+    net = tf.nn.relu(net)
+    net = self.conv2d_3(net)
 
+    return net
 
-def _ResidualInner(filters,
-                   strides,
-                   input_shape,
-                   batch_norm_first=True,
-                   data_format="channels_first",
-                   fused=True,
-                   dtype=tf.float32):
+
+class _ResidualInner(tf.keras.Model):
   """Single residual inner function contained in _ResdualBlock.
 
   Corresponds to the `F`/`G` functions in the paper.
-
-  Args:
-    filters: output filter size
-    strides: length 2 list/tuple of integers for height and width strides
-    input_shape: length 3 list/tuple of integers
-    batch_norm_first: whether to apply activation and batch norm before conv
-    data_format: tensor data format, "NCHW"/"NHWC"
-    fused: use fused batch normalization if True
-    dtype: float16, float32, or float64
-
-  Returns:
-    A keras model
   """
 
-  axis = 1 if data_format == "channels_first" else 3
-  model = tf.keras.Sequential()
-  if batch_norm_first:
-    model.add(
-        tf.keras.layers.BatchNormalization(
-            axis=axis, input_shape=input_shape, fused=fused, dtype=dtype))
-    model.add(tf.keras.layers.Activation("relu"))
-  model.add(
-      tf.keras.layers.Conv2D(
-          filters=filters,
-          kernel_size=3,
-          strides=strides,
-          input_shape=input_shape,
-          data_format=data_format,
-          use_bias=False,
-          padding="SAME",
-          dtype=dtype))
-
-  model.add(
-      tf.keras.layers.BatchNormalization(axis=axis, fused=fused, dtype=dtype))
-  model.add(tf.keras.layers.Activation("relu"))
-  model.add(
-      tf.keras.layers.Conv2D(
-          filters=filters,
-          kernel_size=3,
-          strides=(1, 1),
-          data_format=data_format,
-          use_bias=False,
+  def __init__(self,
+               filters,
+               strides,
+               input_shape,
+               batch_norm_first=True,
+               data_format="channels_first",
+               fused=True,
+               dtype=tf.float32):
+    """Initialization.
+
+    Args:
+      filters: output filter size
+      strides: length 2 list/tuple of integers for height and width strides
+      input_shape: length 3 list/tuple of integers
+      batch_norm_first: whether to apply activation and batch norm before conv
+      data_format: tensor data format, "NCHW"/"NHWC"
+      fused: use fused batch normalization if True
+      dtype: float16, float32, or float64
+    """
+    super(_ResidualInner, self).__init__()
+    axis = 1 if data_format == "channels_first" else 3
+    if batch_norm_first:
+      self.batch_norm_0 = tf.keras.layers.BatchNormalization(
+          axis=axis, input_shape=input_shape, fused=fused, dtype=dtype)
+    self.conv2d_1 = tf.keras.layers.Conv2D(
+        filters=filters,
+        kernel_size=3,
+        strides=strides,
+        input_shape=input_shape,
+        data_format=data_format,
+        use_bias=False,
+        padding="SAME",
+        dtype=dtype)
+
+    self.batch_norm_1 = tf.keras.layers.BatchNormalization(
+        axis=axis, fused=fused, dtype=dtype)
+    self.conv2d_2 = tf.keras.layers.Conv2D(
+        filters=filters,
+        kernel_size=3,
+        strides=(1, 1),
+        data_format=data_format,
+        use_bias=False,
+        padding="SAME",
+        dtype=dtype)
+
+    self.batch_norm_first = batch_norm_first
+
+  def call(self, x, training=True):
+    net = x
+    if self.batch_norm_first:
+      net = self.batch_norm_0(net, training=training)
+      net = tf.nn.relu(net)
+    net = self.conv2d_1(net)
+
+    net = self.batch_norm_1(net, training=training)
+    net = tf.nn.relu(net)
+    net = self.conv2d_2(net)
+
+    return net
+
+
+class InitBlock(tf.keras.Model):
+  """Initial block of RevNet."""
+
+  def __init__(self, config):
+    """Initialization.
+
+    Args:
+      config: tf.contrib.training.HParams object; specifies hyperparameters
+    """
+    super(InitBlock, self).__init__()
+    self.config = config
+    self.axis = 1 if self.config.data_format == "channels_first" else 3
+    self.conv2d = tf.keras.layers.Conv2D(
+        filters=self.config.init_filters,
+        kernel_size=self.config.init_kernel,
+        strides=(self.config.init_stride, self.config.init_stride),
+        data_format=self.config.data_format,
+        use_bias=False,
+        padding="SAME",
+        input_shape=self.config.input_shape,
+        dtype=self.config.dtype)
+    self.batch_norm = tf.keras.layers.BatchNormalization(
+        axis=self.axis, fused=self.config.fused, dtype=self.config.dtype)
+    self.activation = tf.keras.layers.Activation("relu")
+
+    if self.config.init_max_pool:
+      self.max_pool = tf.keras.layers.MaxPooling2D(
+          pool_size=(3, 3),
+          strides=(2, 2),
           padding="SAME",
-          dtype=dtype))
+          data_format=self.config.data_format,
+          dtype=self.config.dtype)
+
+  def call(self, x, training=True):
+    net = x
+    net = self.conv2d(net)
+    net = self.batch_norm(net, training=training)
+    net = self.activation(net)
+
+    if self.config.init_max_pool:
+      net = self.max_pool(net)
+
+    return tf.split(net, num_or_size_splits=2, axis=self.axis)
 
-  return model
+
+class FinalBlock(tf.keras.Model):
+  """Final block of RevNet."""
+
+  def __init__(self, config):
+    """Initialization.
+
+    Args:
+      config: tf.contrib.training.HParams object; specifies hyperparameters
+
+    Raises:
+      ValueError: Unsupported data format
+    """
+    super(FinalBlock, self).__init__()
+    self.config = config
+    self.axis = 1 if self.config.data_format == "channels_first" else 3
+
+    f = self.config.filters[-1]  # Number of filters
+    r = functools.reduce(operator.mul, self.config.strides, 1)  # Reduce ratio
+    r *= self.config.init_stride
+    if self.config.init_max_pool:
+      r *= 2
+
+    if self.config.data_format == "channels_first":
+      w, h = self.config.input_shape[1], self.config.input_shape[2]
+      input_shape = (f, w // r, h // r)
+    elif self.config.data_format == "channels_last":
+      w, h = self.config.input_shape[0], self.config.input_shape[1]
+      input_shape = (w // r, h // r, f)
+    else:
+      raise ValueError("Data format should be either `channels_first`"
+                       " or `channels_last`")
+    self.batch_norm = tf.keras.layers.BatchNormalization(
+        axis=self.axis,
+        input_shape=input_shape,
+        fused=self.config.fused,
+        dtype=self.config.dtype)
+    self.activation = tf.keras.layers.Activation("relu")
+    self.global_avg_pool = tf.keras.layers.GlobalAveragePooling2D(
+        data_format=self.config.data_format, dtype=self.config.dtype)
+    self.dense = tf.keras.layers.Dense(
+        self.config.n_classes, dtype=self.config.dtype)
+
+  def call(self, x, training=True):
+    net = tf.concat(x, axis=self.axis)
+    net = self.batch_norm(net, training=training)
+    net = self.activation(net)
+    net = self.global_avg_pool(net)
+    net = self.dense(net)
+
+    return net
diff --git a/tensorflow/contrib/eager/python/examples/revnet/blocks_test.py b/tensorflow/contrib/eager/python/examples/revnet/blocks_test.py
index d74785c8fe..fda9020ddf 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/blocks_test.py
+++ b/tensorflow/contrib/eager/python/examples/revnet/blocks_test.py
@@ -116,70 +116,13 @@ def _validate_block_call_channels_first(block_factory, test):
 
 class RevBlockTest(tf.test.TestCase):
 
-  def test_call_channels_first(self):
-    """Test `call` function with `channels_first` data format."""
-    if not tf.test.is_gpu_available():
-      self.skipTest("GPU not available")
-
-    with tf.device("/gpu:0"):  # Default NCHW format
-      input_shape = (128, 8, 8)
-      data_shape = (16,) + input_shape
-      x = tf.random_normal(shape=data_shape)
-
-      # Stride of 1
-      block = blocks.RevBlock(
-          n_res=3, filters=128, strides=(1, 1), input_shape=input_shape)
-      y_tr, y_ev = block(x, training=True), block(x, training=False)
-      self.assertEqual(y_tr.shape, y_ev.shape)
-      self.assertEqual(y_ev.shape, (16, 128, 8, 8))
-      self.assertNotAllClose(y_tr, y_ev)
-
-      # Stride of 2
-      block = blocks.RevBlock(
-          n_res=3, filters=128, strides=(2, 2), input_shape=input_shape)
-      y_tr, y_ev = block(x, training=True), block(x, training=False)
-      self.assertEqual(y_tr.shape, y_ev.shape)
-      self.assertEqual(y_ev.shape, [16, 128, 4, 4])
-      self.assertNotAllClose(y_tr, y_ev)
-
-  def test_call_channels_last(self):
-    """Test `call` function with `channels_last` data format."""
-    with tf.device("/cpu:0"):  # NHWC format
-      input_shape = (8, 8, 128)
-      data_shape = (16,) + input_shape
-      x = tf.random_normal(shape=data_shape)
-
-      # Stride 1
-      block = blocks.RevBlock(
-          n_res=3,
-          filters=128,
-          strides=(1, 1),
-          input_shape=input_shape,
-          data_format="channels_last")
-      y_tr, y_ev = block(x, training=True), block(x, training=False)
-      self.assertEqual(y_tr.shape, y_ev.shape)
-      self.assertEqual(y_ev.shape, (16, 8, 8, 128))
-      self.assertNotAllClose(y_tr, y_ev)
-
-      # Stride of 2
-      block = blocks.RevBlock(
-          n_res=3,
-          filters=128,
-          strides=(2, 2),
-          input_shape=input_shape,
-          data_format="channels_last")
-      y_tr, y_ev = block(x, training=True), block(x, training=False)
-      self.assertEqual(y_tr.shape, y_ev.shape)
-      self.assertEqual(y_ev.shape, (16, 4, 4, 128))
-      self.assertNotAllClose(y_tr, y_ev)
-
   def _check_grad_angle(self, grads, grads_true, atol=1e0):
     """Check the angle between two list of vectors are all close."""
     for g1, g2 in zip(grads, grads_true):
       degree = compute_degree(g1, g2)
       self.assertLessEqual(degree, atol)
 
-  def test_backward_grads_and_vars_channels_first(self):
+  def test_backward_grads_channels_first(self):
     """Test `backward` function with `channels_first` data format."""
     if not tf.test.is_gpu_available():
       self.skipTest("GPU not available")
@@ -190,6 +133,7 @@ class RevBlockTest(tf.test.TestCase):
       data_shape = (16,) + input_shape
       x = tf.random_normal(shape=data_shape, dtype=tf.float64)
       dy = tf.random_normal(shape=data_shape, dtype=tf.float64)
+      dy1, dy2 = tf.split(dy, num_or_size_splits=2, axis=1)
       block = blocks.RevBlock(
           n_res=3,
           filters=128,
@@ -199,9 +143,14 @@ class RevBlockTest(tf.test.TestCase):
           dtype=tf.float64)
       with tf.GradientTape() as tape:
         tape.watch(x)
-        y = block(x, training=True)
+        x1, x2 = tf.split(x, num_or_size_splits=2, axis=1)
+        y1, y2 = block((x1, x2), training=True)
+        y = tf.concat((y1, y2), axis=1)
       # Compute grads from reconstruction
-      dx, dw, vars_ = block.backward_grads_and_vars(x, y, dy, training=True)
+      (dx1, dx2), dw = block.backward_grads(
+          x=(x1, x2), y=(y1, y2), dy=(dy1, dy2), training=True)
+      dx = tf.concat((dx1, dx2), axis=1)
+      vars_ = block.trainable_variables
       # Compute true grads
       grads = tape.gradient(y, [x] + vars_, output_gradients=dy)
       dx_true, dw_true = grads[0], grads[1:]
@@ -213,6 +162,7 @@ class RevBlockTest(tf.test.TestCase):
       # Stride 2
       x = tf.random_normal(shape=data_shape, dtype=tf.float64)
       dy = tf.random_normal(shape=(16, 128, 4, 4), dtype=tf.float64)
+      dy1, dy2 = tf.split(dy, num_or_size_splits=2, axis=1)
       block = blocks.RevBlock(
           n_res=3,
           filters=128,
@@ -222,9 +172,14 @@ class RevBlockTest(tf.test.TestCase):
           dtype=tf.float64)
       with tf.GradientTape() as tape:
         tape.watch(x)
-        y = block(x, training=True)
+        x1, x2 = tf.split(x, num_or_size_splits=2, axis=1)
+        y1, y2 = block((x1, x2), training=True)
+        y = tf.concat((y1, y2), axis=1)
       # Compute grads from reconstruction
-      dx, dw, vars_ = block.backward_grads_and_vars(x, y, dy, training=True)
+      (dx1, dx2), dw = block.backward_grads(
+          x=(x1, x2), y=(y1, y2), dy=(dy1, dy2), training=True)
+      dx = tf.concat((dx1, dx2), axis=1)
+      vars_ = block.trainable_variables
       # Compute true grads
       grads = tape.gradient(y, [x] + vars_, output_gradients=dy)
       dx_true, dw_true = grads[0], grads[1:]
@@ -236,16 +191,7 @@ class RevBlockTest(tf.test.TestCase):
 
 class _ResidualTest(tf.test.TestCase):
 
-  def test_call(self):
-    """Test `call` function.
-
-    Varying downsampling and data format options.
-    """
-
-    _validate_block_call_channels_first(blocks._Residual, self)
-    _validate_block_call_channels_last(blocks._Residual, self)
-
-  def test_backward_grads_and_vars_channels_first(self):
+  def test_backward_grads_channels_first(self):
     """Test `backward_grads` function with `channels_first` data format."""
     if not tf.test.is_gpu_available():
       self.skipTest("GPU not available")
@@ -256,6 +202,7 @@ class _ResidualTest(tf.test.TestCase):
       # Use double precision for testing
       x_true = tf.random_normal(shape=data_shape, dtype=tf.float64)
       dy = tf.random_normal(shape=data_shape, dtype=tf.float64)
+      dy1, dy2 = tf.split(dy, num_or_size_splits=2, axis=1)
       residual = blocks._Residual(
           filters=128,
           strides=(1, 1),
@@ -264,16 +211,19 @@ class _ResidualTest(tf.test.TestCase):
           dtype=tf.float64)
 
       with tf.GradientTape() as tape:
-        x_true = tf.identity(x_true)
         tape.watch(x_true)
-        y = residual(x_true, training=True)
+        x1_true, x2_true = tf.split(x_true, num_or_size_splits=2, axis=1)
+        y1, y2 = residual((x1_true, x2_true), training=True)
+        y = tf.concat((y1, y2), axis=1)
 
       # Gradients computed due to reversibility
-      x, dx, dw, vars_ = residual.backward_grads_and_vars(
-          y, dy=dy, training=True)
-
+      (x1, x2), (dx1, dx2), dw = residual.backward_grads(
+          y=(y1, y2), dy=(dy1, dy2), training=True)
+      x = tf.concat((x1, x2), axis=1)
+      dx = tf.concat((dx1, dx2), axis=1)
       # True gradients computed by the tape
-      grads = tape.gradient(y, [x_true] + vars_, output_gradients=dy)
+      grads = tape.gradient(
+          y, [x_true] + residual.trainable_variables, output_gradients=dy)
       dx_true, dw_true = grads[0], grads[1:]
 
       self.assertAllClose(x_true, x)
diff --git a/tensorflow/contrib/eager/python/examples/revnet/cifar_input.py b/tensorflow/contrib/eager/python/examples/revnet/cifar_input.py
index b6d4c35bfd..e9672f13e1 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/cifar_input.py
+++ b/tensorflow/contrib/eager/python/examples/revnet/cifar_input.py
@@ -111,6 +111,6 @@ def get_ds_from_tfrecords(data_dir,
     }[split]
     dataset = dataset.shuffle(size)
 
-  dataset = dataset.batch(batch_size)
+  dataset = dataset.batch(batch_size, drop_remainder=True)
 
   return dataset
diff --git a/tensorflow/contrib/eager/python/examples/revnet/config.py b/tensorflow/contrib/eager/python/examples/revnet/config.py
index 3d93fa955a..29f1db0e03 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/config.py
+++ b/tensorflow/contrib/eager/python/examples/revnet/config.py
@@ -27,17 +27,17 @@ from __future__ import division
 from __future__ import print_function
 
 import tensorflow as tf
-tfe = tf.contrib.eager
 
 
 def get_hparams_cifar_38():
   """RevNet-38 configurations for CIFAR-10/CIFAR-100."""
 
   config = tf.contrib.training.HParams()
+  config.add_hparam("num_train_images", 50000)
+  config.add_hparam("num_eval_images", 10000)
   config.add_hparam("init_filters", 32)
   config.add_hparam("init_kernel", 3)
   config.add_hparam("init_stride", 1)
-  config.add_hparam("n_classes", 10)
   config.add_hparam("n_rev_blocks", 3)
   config.add_hparam("n_res", [3, 3, 3])
   config.add_hparam("filters", [32, 64, 112])
@@ -46,7 +46,7 @@ def get_hparams_cifar_38():
   config.add_hparam("bottleneck", False)
   config.add_hparam("fused", True)
   config.add_hparam("init_max_pool", False)
-  if tfe.num_gpus() > 0:
+  if tf.test.is_gpu_available():
     config.add_hparam("input_shape", (3, 32, 32))
     config.add_hparam("data_format", "channels_first")
   else:
@@ -68,9 +68,22 @@ def get_hparams_cifar_38():
   config.add_hparam("div255", True)
   # This is imprecise, when training with validation set,
   # we only have 40k images in training data
-  config.add_hparam("iters_per_epoch", 50000 // config.batch_size)
+  config.add_hparam("iters_per_epoch",
+                    config.num_train_images // config.batch_size)
   config.add_hparam("epochs", config.max_train_iter // config.iters_per_epoch)
 
+  # Customized TPU hyperparameters due to differing batch size caused by
+  # TPU architecture specifics
+  # Suggested batch sizes to reduce overhead from excessive tensor padding
+  # https://cloud.google.com/tpu/docs/troubleshooting
+  config.add_hparam("tpu_batch_size", 1024)
+  config.add_hparam("tpu_eval_batch_size", 1024)
+  config.add_hparam("tpu_iters_per_epoch",
+                    config.num_train_images // config.tpu_batch_size)
+  config.add_hparam("tpu_epochs",
+                    config.max_train_iter // config.tpu_iters_per_epoch)
+  config.add_hparam("tpu_eval_steps",
+                    config.num_eval_images // config.tpu_eval_batch_size)
   return config
 
 
@@ -98,15 +111,18 @@ def get_hparams_imagenet_56():
   """RevNet-56 configurations for ImageNet."""
 
   config = tf.contrib.training.HParams()
+  config.add_hparam("n_classes", 1000)
+  config.add_hparam("dataset", "ImageNet")
+  config.add_hparam("num_train_images", 1281167)
+  config.add_hparam("num_eval_images", 50000)
   config.add_hparam("init_filters", 128)
   config.add_hparam("init_kernel", 7)
   config.add_hparam("init_stride", 2)
-  config.add_hparam("n_classes", 1000)
   config.add_hparam("n_rev_blocks", 4)
   config.add_hparam("n_res", [2, 2, 2, 2])
   config.add_hparam("filters", [128, 256, 512, 832])
   config.add_hparam("strides", [1, 2, 2, 2])
-  config.add_hparam("batch_size", 16)
+  config.add_hparam("batch_size", 256)
   config.add_hparam("bottleneck", True)
   config.add_hparam("fused", True)
   config.add_hparam("init_max_pool", True)
@@ -116,6 +132,9 @@ def get_hparams_imagenet_56():
   else:
     config.add_hparam("input_shape", (224, 224, 3))
     config.add_hparam("data_format", "channels_last")
+  # Due to bottleneck residual blocks
+  filters = [f * 4 for f in config.filters]
+  config.filters = filters
 
   # Training details
   config.add_hparam("weight_decay", 1e-4)
@@ -125,16 +144,32 @@ def get_hparams_imagenet_56():
   config.add_hparam("max_train_iter", 600000)
   config.add_hparam("seed", 1234)
   config.add_hparam("shuffle", True)
-  config.add_hparam("log_every", 50)
-  config.add_hparam("save_every", 50)
+  config.add_hparam("log_every", 500)
+  config.add_hparam("save_every", 500)
   config.add_hparam("dtype", tf.float32)
-  config.add_hparam("eval_batch_size", 1000)
+  config.add_hparam("eval_batch_size", 256)
   config.add_hparam("div255", True)
-  # TODO(lxuechen): Update this according to ImageNet data
-  config.add_hparam("iters_per_epoch", 50000 // config.batch_size)
+  config.add_hparam("iters_per_epoch",
+                    config.num_train_images // config.batch_size)
   config.add_hparam("epochs", config.max_train_iter // config.iters_per_epoch)
-  # Due to bottleneck residual blocks
-  filters = [f * 4 for f in config.filters]
-  config.filters = filters
+
+  # Customized TPU hyperparameters due to differing batch size caused by
+  # TPU architecture specifics
+  # Suggested batch sizes to reduce overhead from excessive tensor padding
+  # https://cloud.google.com/tpu/docs/troubleshooting
+  config.add_hparam("tpu_batch_size", 1024)
+  config.add_hparam("tpu_eval_batch_size", 1024)
+  config.add_hparam("tpu_iters_per_epoch",
+                    config.num_train_images // config.tpu_batch_size)
+  config.add_hparam("tpu_epochs",
+                    config.max_train_iter // config.tpu_iters_per_epoch)
+  config.add_hparam("tpu_eval_steps",
+                    config.num_eval_images // config.tpu_eval_batch_size)
+  return config
+
+
+def get_hparams_imagenet_104():
+  config = get_hparams_imagenet_56()
+  config.n_res = [2, 2, 11, 2]
 
   return config
diff --git a/tensorflow/contrib/eager/python/examples/revnet/imagenet_input.py b/tensorflow/contrib/eager/python/examples/revnet/imagenet_input.py
new file mode 100644
index 0000000000..34a9984b0e
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/revnet/imagenet_input.py
@@ -0,0 +1,229 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Efficient ImageNet input pipeline using tf.data.Dataset."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import functools
+import os
+
+import tensorflow as tf
+
+from tensorflow.contrib.eager.python.examples.revnet import resnet_preprocessing
+
+
+def image_serving_input_fn():
+  """Serving input fn for raw images."""
+
+  def _preprocess_image(image_bytes):
+    """Preprocess a single raw image."""
+    image = resnet_preprocessing.preprocess_image(
+        image_bytes=image_bytes, is_training=False)
+    return image
+
+  image_bytes_list = tf.placeholder(
+      shape=[None],
+      dtype=tf.string,
+  )
+  images = tf.map_fn(
+      _preprocess_image, image_bytes_list, back_prop=False, dtype=tf.float32)
+  return tf.estimator.export.ServingInputReceiver(
+      images, {'image_bytes': image_bytes_list})
+
+
+class ImageNetInput(object):
+  """Generates ImageNet input_fn for training or evaluation.
+
+  The training data is assumed to be in TFRecord format with keys as specified
+  in the dataset_parser below, sharded across 1024 files, named sequentially:
+      train-00000-of-01024
+      train-00001-of-01024
+      ...
+      train-01023-of-01024
+
+  The validation data is in the same format but sharded in 128 files.
+
+  The format of the data required is created by the script at:
+      https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py
+
+  Args:
+    is_training: `bool` for whether the input is for training
+    data_dir: `str` for the directory of the training and validation data;
+        if 'null' (the literal string 'null', not None), then construct a null
+        pipeline, consisting of empty images.
+    use_bfloat16: If True, use bfloat16 precision; else use float32.
+    transpose_input: 'bool' for whether to use the double transpose trick
+    num_cores: `int` for the number of TPU cores
+  """
+
+  def __init__(self, is_training,
+               use_bfloat16,
+               data_dir,
+               num_cores=8,
+               num_parallel_calls=64,
+               image_size=224,
+               transpose_input=False,
+               cache=False):
+    self.image_preprocessing_fn = resnet_preprocessing.preprocess_image
+    self.is_training = is_training
+    self.use_bfloat16 = use_bfloat16
+    self.data_dir = data_dir
+    self.num_cores = num_cores
+    self.num_parallel_calls = num_parallel_calls
+    if self.data_dir == 'null' or self.data_dir == '':
+      self.data_dir = None
+    self.transpose_input = transpose_input
+    self.image_size = image_size
+    self.cache = cache
+
+  def set_shapes(self, batch_size, images, labels):
+    """Statically set the batch_size dimension."""
+    if self.transpose_input:
+      images.set_shape(images.get_shape().merge_with(
+          tf.TensorShape([None, None, None, batch_size])))
+      labels.set_shape(labels.get_shape().merge_with(
+          tf.TensorShape([batch_size])))
+    else:
+      images.set_shape(images.get_shape().merge_with(
+          tf.TensorShape([batch_size, None, None, None])))
+      labels.set_shape(labels.get_shape().merge_with(
+          tf.TensorShape([batch_size])))
+
+    return images, labels
+
+  def dataset_parser(self, value):
+    """Parse an ImageNet record from a serialized string Tensor."""
+    keys_to_features = {
+        'image/encoded': tf.FixedLenFeature((), tf.string, ''),
+        'image/format': tf.FixedLenFeature((), tf.string, 'jpeg'),
+        'image/class/label': tf.FixedLenFeature([], tf.int64, -1),
+        'image/class/text': tf.FixedLenFeature([], tf.string, ''),
+        'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/class/label': tf.VarLenFeature(dtype=tf.int64),
+    }
+
+    parsed = tf.parse_single_example(value, keys_to_features)
+    image_bytes = tf.reshape(parsed['image/encoded'], shape=[])
+
+    image = self.image_preprocessing_fn(
+        image_bytes=image_bytes,
+        is_training=self.is_training,
+        image_size=self.image_size,
+        use_bfloat16=self.use_bfloat16)
+
+    # Subtract one so that labels are in [0, 1000).
+    label = tf.cast(
+        tf.reshape(parsed['image/class/label'], shape=[]), dtype=tf.int32) - 1
+
+    return image, label
+
+  def input_fn(self, params):
+    """Input function which provides a single batch for train or eval.
+
+    Args:
+      params: `dict` of parameters passed from the `TPUEstimator`.
+          `params['batch_size']` is always provided and should be used as the
+          effective batch size.
+
+    Returns:
+      A `tf.data.Dataset` object.
+    """
+    if self.data_dir is None:
+      tf.logging.info('Using fake input.')
+      return self.input_fn_null(params)
+
+    # Retrieves the batch size for the current shard. The # of shards is
+    # computed according to the input pipeline deployment. See
+    # tf.contrib.tpu.RunConfig for details.
+    batch_size = params['batch_size']
+
+    # Shuffle the filenames to ensure better randomization.
+    file_pattern = os.path.join(
+        self.data_dir, 'train-*' if self.is_training else 'validation-*')
+    dataset = tf.data.Dataset.list_files(file_pattern, shuffle=self.is_training)
+
+    if self.is_training and not self.cache:
+      dataset = dataset.repeat()
+
+    def fetch_dataset(filename):
+      buffer_size = 8 * 1024 * 1024  # 8 MiB per file
+      dataset = tf.data.TFRecordDataset(filename, buffer_size=buffer_size)
+      return dataset
+
+    # Read the data from disk in parallel
+    dataset = dataset.apply(
+        tf.contrib.data.parallel_interleave(
+            fetch_dataset, cycle_length=self.num_parallel_calls, sloppy=True))
+    if self.cache:
+      dataset = dataset.cache().apply(
+          tf.contrib.data.shuffle_and_repeat(1024 * 16))
+    else:
+      dataset = dataset.shuffle(1024)
+
+    # Use the fused map-and-batch operation.
+    #
+    # For XLA, we must used fixed shapes. Because we repeat the source training
+    # dataset indefinitely, we can use `drop_remainder=True` to get fixed-size
+    # batches without dropping any training examples.
+    #
+    # When evaluating, `drop_remainder=True` prevents accidentally evaluating
+    # the same image twice by dropping the final batch if it is less than a full
+    # batch size. As long as this validation is done with consistent batch size,
+    # exactly the same images will be used.
+    dataset = dataset.apply(
+        tf.contrib.data.map_and_batch(
+            self.dataset_parser, batch_size=batch_size,
+            num_parallel_batches=self.num_cores, drop_remainder=True))
+
+    # Transpose for performance on TPU
+    if self.transpose_input:
+      dataset = dataset.map(
+          lambda images, labels: (tf.transpose(images, [1, 2, 3, 0]), labels),
+          num_parallel_calls=self.num_cores)
+
+    # Assign static batch size dimension
+    dataset = dataset.map(functools.partial(self.set_shapes, batch_size))
+
+    # Prefetch overlaps in-feed with training
+    dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
+    return dataset
+
+  def input_fn_null(self, params):
+    """Input function which provides null (black) images."""
+    batch_size = params['batch_size']
+    dataset = tf.data.Dataset.range(1).repeat().map(self._get_null_input)
+    dataset = dataset.prefetch(batch_size)
+
+    dataset = dataset.batch(batch_size, drop_remainder=True)
+    if self.transpose_input:
+      dataset = dataset.map(
+          lambda images, labels: (tf.transpose(images, [1, 2, 3, 0]), labels),
+          num_parallel_calls=8)
+
+    dataset = dataset.map(functools.partial(self.set_shapes, batch_size))
+
+    dataset = dataset.prefetch(32)     # Prefetch overlaps in-feed with training
+    tf.logging.info('Input dataset: %s', str(dataset))
+    return dataset
+
+  def _get_null_input(self, _):
+    null_image = tf.zeros([224, 224, 3], tf.bfloat16
+                          if self.use_bfloat16 else tf.float32)
+    return (null_image, tf.constant(0, tf.int32))
diff --git a/tensorflow/contrib/eager/python/examples/revnet/main.py b/tensorflow/contrib/eager/python/examples/revnet/main.py
index e2f43b03f9..b702e91f92 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/main.py
+++ b/tensorflow/contrib/eager/python/examples/revnet/main.py
@@ -29,10 +29,18 @@ from tensorflow.contrib.eager.python.examples.revnet import revnet
 tfe = tf.contrib.eager
 
 
+def apply_gradients(optimizer, grads, vars_, global_step=None):
+  """Functional style apply_grads for `tfe.defun`."""
+  optimizer.apply_gradients(zip(grads, vars_), global_step=global_step)
+
+
 def main(_):
   """Eager execution workflow with RevNet trained on CIFAR-10."""
-  config = get_config()
-  ds_train, ds_train_one_shot, ds_validation, ds_test = get_datasets(config)
+  tf.enable_eager_execution()
+
+  config = get_config(config_name=FLAGS.config, dataset=FLAGS.dataset)
+  ds_train, ds_train_one_shot, ds_validation, ds_test = get_datasets(
+      data_dir=FLAGS.data_dir, config=config)
   model = revnet.RevNet(config=config)
   global_step = tf.train.get_or_create_global_step()  # Ensure correct summary
   global_step.assign(1)
@@ -43,6 +51,14 @@ def main(_):
   checkpointer = tf.train.Checkpoint(
       optimizer=optimizer, model=model, optimizer_step=global_step)
 
+  if FLAGS.use_defun:
+    model.call = tfe.defun(model.call)
+    model.compute_gradients = tfe.defun(model.compute_gradients)
+    model.get_moving_stats = tfe.defun(model.get_moving_stats)
+    model.restore_moving_stats = tfe.defun(model.restore_moving_stats)
+    global apply_gradients  # pylint:disable=global-variable-undefined
+    apply_gradients = tfe.defun(apply_gradients)
+
   if FLAGS.train_dir:
     summary_writer = tf.contrib.summary.create_file_writer(FLAGS.train_dir)
     if FLAGS.restore:
@@ -52,46 +68,37 @@ def main(_):
             "with global_step: {}".format(latest_path, global_step.numpy()))
       sys.stdout.flush()
 
-  if FLAGS.manual_grad:
-    print("Using manual gradients.")
-  else:
-    print("Not using manual gradients.")
-  sys.stdout.flush()
-
   for x, y in ds_train:
     train_one_iter(model, x, y, optimizer, global_step=global_step)
 
     if global_step.numpy() % config.log_every == 0:
-      it_train = ds_train_one_shot.make_one_shot_iterator()
       it_test = ds_test.make_one_shot_iterator()
-      acc_train, loss_train = evaluate(model, it_train)
       acc_test, loss_test = evaluate(model, it_test)
 
       if FLAGS.validate:
+        it_train = ds_train_one_shot.make_one_shot_iterator()
         it_validation = ds_validation.make_one_shot_iterator()
+        acc_train, loss_train = evaluate(model, it_train)
         acc_validation, loss_validation = evaluate(model, it_validation)
         print("Iter {}, "
               "training set accuracy {:.4f}, loss {:.4f}; "
-              "validation set accuracy {:.4f}, loss {:4.f}"
+              "validation set accuracy {:.4f}, loss {:.4f}; "
               "test accuracy {:.4f}, loss {:.4f}".format(
                   global_step.numpy(), acc_train, loss_train, acc_validation,
                   loss_validation, acc_test, loss_test))
       else:
-        print("Iter {}, "
-              "training set accuracy {:.4f}, loss {:.4f}; "
-              "test accuracy {:.4f}, loss {:.4f}".format(
-                  global_step.numpy(), acc_train, loss_train, acc_test,
-                  loss_test))
+        print("Iter {}, test accuracy {:.4f}, loss {:.4f}".format(
+            global_step.numpy(), acc_test, loss_test))
       sys.stdout.flush()
 
       if FLAGS.train_dir:
         with summary_writer.as_default():
           with tf.contrib.summary.always_record_summaries():
-            tf.contrib.summary.scalar("Training accuracy", acc_train)
             tf.contrib.summary.scalar("Test accuracy", acc_test)
-            tf.contrib.summary.scalar("Training loss", loss_train)
             tf.contrib.summary.scalar("Test loss", loss_test)
             if FLAGS.validate:
+              tf.contrib.summary.scalar("Training accuracy", acc_train)
+              tf.contrib.summary.scalar("Training loss", loss_train)
               tf.contrib.summary.scalar("Validation accuracy", acc_validation)
               tf.contrib.summary.scalar("Validation loss", loss_validation)
 
@@ -103,34 +110,38 @@ def main(_):
       sys.stdout.flush()
 
 
-def get_config():
+def get_config(config_name="revnet-38", dataset="cifar-10"):
   """Return configuration."""
-  print("Config: {}".format(FLAGS.config))
+  print("Config: {}".format(config_name))
   sys.stdout.flush()
   config = {
       "revnet-38": config_.get_hparams_cifar_38(),
       "revnet-110": config_.get_hparams_cifar_110(),
       "revnet-164": config_.get_hparams_cifar_164(),
-  }[FLAGS.config]
+  }[config_name]
 
-  if FLAGS.dataset == "cifar-100":
-    config.n_classes = 100
+  if dataset == "cifar-10":
+    config.add_hparam("n_classes", 10)
+    config.add_hparam("dataset", "cifar-10")
+  else:
+    config.add_hparam("n_classes", 100)
+    config.add_hparam("dataset", "cifar-100")
 
   return config
 
 
-def get_datasets(config):
+def get_datasets(data_dir, config):
   """Return dataset."""
-  if FLAGS.data_dir is None:
+  if data_dir is None:
     raise ValueError("No supplied data directory")
-  if not os.path.exists(FLAGS.data_dir):
-    raise ValueError("Data directory {} does not exist".format(FLAGS.data_dir))
-  if FLAGS.dataset not in ["cifar-10", "cifar-100"]:
-    raise ValueError("Unknown dataset {}".format(FLAGS.dataset))
+  if not os.path.exists(data_dir):
+    raise ValueError("Data directory {} does not exist".format(data_dir))
+  if config.dataset not in ["cifar-10", "cifar-100"]:
+    raise ValueError("Unknown dataset {}".format(config.dataset))
 
-  print("Training on {} dataset.".format(FLAGS.dataset))
+  print("Training on {} dataset.".format(config.dataset))
   sys.stdout.flush()
-  data_dir = os.path.join(FLAGS.data_dir, FLAGS.dataset)
+  data_dir = os.path.join(data_dir, config.dataset)
   if FLAGS.validate:
     # 40k Training set
     ds_train = cifar_input.get_ds_from_tfrecords(
@@ -168,7 +179,7 @@ def get_datasets(config):
         prefetch=config.batch_size)
     ds_validation = None
 
-  # Always compute loss and accuracy on whole training and test set
+  # Always compute loss and accuracy on whole test set
   ds_train_one_shot = cifar_input.get_ds_from_tfrecords(
       data_dir=data_dir,
       split="train_all",
@@ -196,19 +207,15 @@ def get_datasets(config):
 
 def train_one_iter(model, inputs, labels, optimizer, global_step=None):
   """Train for one iteration."""
-  if FLAGS.manual_grad:
-    grads, vars_, loss = model.compute_gradients(inputs, labels, training=True)
-    optimizer.apply_gradients(zip(grads, vars_), global_step=global_step)
-  else:  # For correctness validation
-    with tf.GradientTape() as tape:
-      logits, _ = model(inputs, training=True)
-      loss = model.compute_loss(logits=logits, labels=labels)
-      tf.logging.info("Logits are placed on device: {}".format(logits.device))
-    grads = tape.gradient(loss, model.trainable_variables)
-    optimizer.apply_gradients(
-        zip(grads, model.trainable_variables), global_step=global_step)
+  logits, saved_hiddens = model(inputs, training=True)
+  values = model.get_moving_stats()
+  grads, loss = model.compute_gradients(saved_hiddens, labels)
+  # Restore moving averages when executing eagerly to avoid updating twice
+  model.restore_moving_stats(values)
+  apply_gradients(
+      optimizer, grads, model.trainable_variables, global_step=global_step)
 
-  return loss.numpy()
+  return logits, loss
 
 
 def evaluate(model, iterator):
@@ -241,16 +248,18 @@ if __name__ == "__main__":
       "validate",
       default=False,
       help="[Optional] Use the validation set or not for hyperparameter search")
-  flags.DEFINE_boolean(
-      "manual_grad",
-      default=False,
-      help="[Optional] Use manual gradient graph to save memory")
   flags.DEFINE_string(
       "dataset",
       default="cifar-10",
       help="[Optional] The dataset used; either `cifar-10` or `cifar-100`")
   flags.DEFINE_string(
-      "config", default="revnet-38", help="[Optional] Architecture of network.")
+      "config",
+      default="revnet-38",
+      help="[Optional] Architecture of network. "
+      "Other options include `revnet-110` and `revnet-164`")
+  flags.DEFINE_boolean(
+      "use_defun",
+      default=False,
+      help="[Optional] Use `tfe.defun` to boost performance.")
   FLAGS = flags.FLAGS
-  tf.enable_eager_execution()
   tf.app.run(main)
diff --git a/tensorflow/contrib/eager/python/examples/revnet/main_estimator.py b/tensorflow/contrib/eager/python/examples/revnet/main_estimator.py
new file mode 100644
index 0000000000..3a17eb30da
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/revnet/main_estimator.py
@@ -0,0 +1,200 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Estimator workflow with RevNet train on CIFAR-10."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from absl import flags
+import tensorflow as tf
+from tensorflow.contrib.eager.python.examples.revnet import cifar_input
+from tensorflow.contrib.eager.python.examples.revnet import main as main_
+from tensorflow.contrib.eager.python.examples.revnet import revnet
+
+
+def model_fn(features, labels, mode, params):
+  """Function specifying the model that is required by the `tf.estimator` API.
+
+  Args:
+    features: Input images
+    labels: Labels of images
+    mode: One of `ModeKeys.TRAIN`, `ModeKeys.EVAL` or 'ModeKeys.PREDICT'
+    params: A dictionary of extra parameter that might be passed
+
+  Returns:
+    An instance of `tf.estimator.EstimatorSpec`
+  """
+
+  inputs = features
+  if isinstance(inputs, dict):
+    inputs = features["image"]
+
+  config = params["config"]
+  model = revnet.RevNet(config=config)
+
+  if mode == tf.estimator.ModeKeys.TRAIN:
+    global_step = tf.train.get_or_create_global_step()
+    learning_rate = tf.train.piecewise_constant(
+        global_step, config.lr_decay_steps, config.lr_list)
+    optimizer = tf.train.MomentumOptimizer(
+        learning_rate, momentum=config.momentum)
+    logits, saved_hidden = model(inputs, training=True)
+    grads, loss = model.compute_gradients(saved_hidden, labels, training=True)
+    with tf.control_dependencies(model.get_updates_for(inputs)):
+      train_op = optimizer.apply_gradients(
+          zip(grads, model.trainable_variables), global_step=global_step)
+
+    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
+  else:
+    logits, _ = model(inputs, training=False)
+    predictions = tf.argmax(logits, axis=1)
+    probabilities = tf.nn.softmax(logits)
+
+    if mode == tf.estimator.ModeKeys.EVAL:
+      loss = model.compute_loss(labels=labels, logits=logits)
+      return tf.estimator.EstimatorSpec(
+          mode=mode,
+          loss=loss,
+          eval_metric_ops={
+              "accuracy":
+                  tf.metrics.accuracy(labels=labels, predictions=predictions)
+          })
+
+    else:  # mode == tf.estimator.ModeKeys.PREDICT
+      result = {
+          "classes": predictions,
+          "probabilities": probabilities,
+      }
+
+      return tf.estimator.EstimatorSpec(
+          mode=mode,
+          predictions=predictions,
+          export_outputs={
+              "classify": tf.estimator.export.PredictOutput(result)
+          })
+
+
+def get_input_fn(config, data_dir, split):
+  """Get the input function that is required by the `tf.estimator` API.
+
+  Args:
+    config: Customized hyperparameters
+    data_dir: Directory where the data is stored
+    split: One of `train`, `validation`, `train_all`, and `test`
+
+  Returns:
+    Input function required by the `tf.estimator` API
+  """
+
+  data_dir = os.path.join(data_dir, config.dataset)
+  # Fix split-dependent hyperparameters
+  if split == "train_all" or split == "train":
+    data_aug = True
+    batch_size = config.batch_size
+    epochs = config.epochs
+    shuffle = True
+    prefetch = config.batch_size
+  else:
+    data_aug = False
+    batch_size = config.eval_batch_size
+    epochs = 1
+    shuffle = False
+    prefetch = config.eval_batch_size
+
+  def input_fn():
+    """Input function required by the `tf.estimator.Estimator` API."""
+    return cifar_input.get_ds_from_tfrecords(
+        data_dir=data_dir,
+        split=split,
+        data_aug=data_aug,
+        batch_size=batch_size,
+        epochs=epochs,
+        shuffle=shuffle,
+        prefetch=prefetch,
+        data_format=config.data_format)
+
+  return input_fn
+
+
+def main(_):
+  tf.logging.set_verbosity(tf.logging.INFO)
+
+  # RevNet specific configuration
+  config = main_.get_config(config_name=FLAGS.config, dataset=FLAGS.dataset)
+
+  # Estimator specific configuration
+  run_config = tf.estimator.RunConfig(
+      model_dir=FLAGS.model_dir,  # Directory for storing checkpoints
+      tf_random_seed=config.seed,
+      save_summary_steps=config.log_every,
+      save_checkpoints_steps=config.log_every,
+      session_config=None,  # Using default
+      keep_checkpoint_max=100,
+      keep_checkpoint_every_n_hours=10000,  # Using default
+      log_step_count_steps=config.log_every,
+      train_distribute=None  # Default not use distribution strategy
+  )
+
+  # Construct estimator
+  revnet_estimator = tf.estimator.Estimator(
+      model_fn=model_fn,
+      model_dir=FLAGS.model_dir,
+      config=run_config,
+      params={"config": config})
+
+  # Construct input functions
+  train_input_fn = get_input_fn(
+      config=config, data_dir=FLAGS.data_dir, split="train_all")
+  eval_input_fn = get_input_fn(
+      config=config, data_dir=FLAGS.data_dir, split="test")
+
+  # Train and evaluate estimator
+  revnet_estimator.train(input_fn=train_input_fn)
+  revnet_estimator.evaluate(input_fn=eval_input_fn)
+
+  if FLAGS.export:
+    input_shape = (None,) + config.input_shape
+    inputs = tf.placeholder(tf.float32, shape=input_shape)
+    input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
+        "image": inputs
+    })
+    revnet_estimator.export_savedmodel(FLAGS.model_dir, input_fn)
+
+
+if __name__ == "__main__":
+  flags.DEFINE_string(
+      "data_dir", default=None, help="Directory to load tfrecords")
+  flags.DEFINE_string(
+      "model_dir",
+      default=None,
+      help="[Optional] Directory to store the training information")
+  flags.DEFINE_string(
+      "dataset",
+      default="cifar-10",
+      help="[Optional] The dataset used; either `cifar-10` or `cifar-100`")
+  flags.DEFINE_boolean(
+      "export",
+      default=False,
+      help="[Optional] Export the model for serving if True")
+  flags.DEFINE_string(
+      "config",
+      default="revnet-38",
+      help="[Optional] Architecture of network. "
+      "Other options include `revnet-110` and `revnet-164`")
+  FLAGS = flags.FLAGS
+  tf.app.run()
diff --git a/tensorflow/contrib/eager/python/examples/revnet/main_estimator_tpu.py b/tensorflow/contrib/eager/python/examples/revnet/main_estimator_tpu.py
new file mode 100644
index 0000000000..8520cf5b71
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/revnet/main_estimator_tpu.py
@@ -0,0 +1,394 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Cloud TPU Estimator workflow with RevNet train on ImageNet."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import time
+
+from absl import flags
+import tensorflow as tf
+from tensorflow.contrib import summary
+from tensorflow.contrib.eager.python.examples.revnet import config as config_
+from tensorflow.contrib.eager.python.examples.revnet import imagenet_input
+from tensorflow.contrib.eager.python.examples.revnet import revnet
+from tensorflow.contrib.training.python.training import evaluation
+from tensorflow.python.estimator import estimator
+
+MEAN_RGB = [0.485, 0.456, 0.406]
+STDDEV_RGB = [0.229, 0.224, 0.225]
+
+
+def _host_call_fn(gs, loss, lr):
+  """Training host call.
+
+  Creates scalar summaries for training metrics.
+
+  This function is executed on the CPU and should not directly reference
+  any Tensors in the rest of the `model_fn`. To pass Tensors from the
+  model to the `metric_fn`, provide as part of the `host_call`. See
+  https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimatorSpec
+  for more information.
+
+  Arguments should match the list of `Tensor` objects passed as the second
+  element in the tuple passed to `host_call`.
+
+  Args:
+    gs: `Tensor with shape `[batch]` for the global_step
+    loss: `Tensor` with shape `[batch]` for the training loss.
+    lr: `Tensor` with shape `[batch]` for the learning_rate.
+
+  Returns:
+    List of summary ops to run on the CPU host.
+  """
+  # Host call fns are executed FLAGS.iterations_per_loop times after one
+  # TPU loop is finished, setting max_queue value to the same as number of
+  # iterations will make the summary writer only flush the data to storage
+  # once per loop.
+  gs = gs[0]
+  with summary.create_file_writer(
+      FLAGS.model_dir, max_queue=FLAGS.iterations_per_loop).as_default():
+    with summary.always_record_summaries():
+      summary.scalar("loss", loss[0], step=gs)
+      summary.scalar("learning_rate", lr[0], step=gs)
+      return summary.all_summary_ops()
+
+
+def _metric_fn(labels, logits):
+  """Evaluation metric function. Evaluates accuracy.
+
+  This function is executed on the CPU and should not directly reference
+  any Tensors in the rest of the `model_fn`. To pass Tensors from the model
+  to the `metric_fn`, provide as part of the `eval_metrics`. See
+  https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimatorSpec
+  for more information.
+
+  Arguments should match the list of `Tensor` objects passed as the second
+  element in the tuple passed to `eval_metrics`.
+
+  Args:
+    labels: `Tensor` with shape `[batch]`.
+    logits: `Tensor` with shape `[batch, num_classes]`.
+
+  Returns:
+    A dict of the metrics to return from evaluation.
+  """
+  predictions = tf.argmax(logits, axis=1)
+  top_1_accuracy = tf.metrics.accuracy(labels, predictions)
+  in_top_5 = tf.cast(tf.nn.in_top_k(logits, labels, 5), tf.float32)
+  top_5_accuracy = tf.metrics.mean(in_top_5)
+
+  return {
+      "top_1_accuracy": top_1_accuracy,
+      "top_5_accuracy": top_5_accuracy,
+  }
+
+
+def model_fn(features, labels, mode, params):
+  """Model function required by the `tf.contrib.tpu.TPUEstimator` API.
+
+  Args:
+    features: Input images
+    labels: Labels of images
+    mode: One of `ModeKeys.TRAIN`, `ModeKeys.EVAL` or 'ModeKeys.PREDICT'
+    params: A dictionary of extra parameter that might be passed
+
+  Returns:
+    An instance of `tf.contrib.tpu.TPUEstimatorSpec`
+  """
+  revnet_config = params["revnet_config"]
+  model = revnet.RevNet(config=revnet_config)
+
+  inputs = features
+  if isinstance(inputs, dict):
+    inputs = features["image"]
+
+  if revnet_config.data_format == "channels_first":
+    assert not FLAGS.transpose_input  # channels_first only for GPU
+    inputs = tf.transpose(inputs, [0, 3, 1, 2])
+
+  if FLAGS.transpose_input and mode != tf.estimator.ModeKeys.PREDICT:
+    inputs = tf.transpose(inputs, [3, 0, 1, 2])  # HWCN to NHWC
+
+  # Normalize the image to zero mean and unit variance.
+  inputs -= tf.constant(MEAN_RGB, shape=[1, 1, 3], dtype=inputs.dtype)
+  inputs /= tf.constant(STDDEV_RGB, shape=[1, 1, 3], dtype=inputs.dtype)
+
+  if mode == tf.estimator.ModeKeys.TRAIN:
+    global_step = tf.train.get_or_create_global_step()
+    learning_rate = tf.train.piecewise_constant(
+        global_step, revnet_config.lr_decay_steps, revnet_config.lr_list)
+    optimizer = tf.train.MomentumOptimizer(learning_rate,
+                                           revnet_config.momentum)
+    if FLAGS.use_tpu:
+      optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
+
+    logits, saved_hidden = model(inputs, training=True)
+    grads, loss = model.compute_gradients(saved_hidden, labels, training=True)
+    with tf.control_dependencies(model.get_updates_for(inputs)):
+      train_op = optimizer.apply_gradients(
+          zip(grads, model.trainable_variables), global_step=global_step)
+    if not FLAGS.skip_host_call:
+      # To log the loss, current learning rate, and epoch for Tensorboard, the
+      # summary op needs to be run on the host CPU via host_call. host_call
+      # expects [batch_size, ...] Tensors, thus reshape to introduce a batch
+      # dimension. These Tensors are implicitly concatenated to
+      # [params['batch_size']].
+      gs_t = tf.reshape(global_step, [1])
+      loss_t = tf.reshape(loss, [1])
+      lr_t = tf.reshape(learning_rate, [1])
+      host_call = (_host_call_fn, [gs_t, loss_t, lr_t])
+
+    return tf.contrib.tpu.TPUEstimatorSpec(
+        mode=mode, loss=loss, train_op=train_op, host_call=host_call)
+
+  elif mode == tf.estimator.ModeKeys.EVAL:
+    logits, _ = model(inputs, training=False)
+    loss = model.compute_loss(labels=labels, logits=logits)
+
+    return tf.contrib.tpu.TPUEstimatorSpec(
+        mode=mode, loss=loss, eval_metrics=(_metric_fn, [labels, logits]))
+
+  else:  # Predict or export
+    logits, _ = model(inputs, training=False)
+    predictions = {
+        "classes": tf.argmax(logits, axis=1),
+        "probabilities": tf.nn.softmax(logits),
+    }
+
+    return tf.contrib.tpu.TPUEstimatorSpec(
+        mode=mode,
+        predictions=predictions,
+        export_outputs={
+            "classify": tf.estimator.export.PredictOutput(predictions)
+        })
+
+
+def main(_):
+  tf.logging.set_verbosity(tf.logging.INFO)
+
+  # RevNet specific configuration
+  revnet_config = {
+      "revnet-56": config_.get_hparams_imagenet_56(),
+      "revnet-104": config_.get_hparams_imagenet_104()
+  }[FLAGS.revnet_config]
+
+  if FLAGS.use_tpu:
+    revnet_config.data_format = "channels_last"
+
+  tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
+      FLAGS.tpu, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
+
+  # Estimator specific configuration
+  config = tf.contrib.tpu.RunConfig(
+      cluster=tpu_cluster_resolver,
+      model_dir=FLAGS.model_dir,
+      session_config=tf.ConfigProto(
+          allow_soft_placement=True, log_device_placement=True),
+      tpu_config=tf.contrib.tpu.TPUConfig(
+          iterations_per_loop=FLAGS.iterations_per_loop,
+          num_shards=FLAGS.num_shards,
+          per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.
+          PER_HOST_V2),
+  )
+
+  # Input pipelines are slightly different (with regards to shuffling and
+  # preprocessing) between training and evaluation.
+  imagenet_train, imagenet_eval = [
+      imagenet_input.ImageNetInput(
+          is_training=is_training,
+          data_dir=FLAGS.data_dir,
+          transpose_input=FLAGS.transpose_input,
+          use_bfloat16=False) for is_training in [True, False]
+  ]
+
+  revnet_classifier = tf.contrib.tpu.TPUEstimator(
+      model_fn=model_fn,
+      use_tpu=FLAGS.use_tpu,
+      train_batch_size=revnet_config.tpu_batch_size,
+      eval_batch_size=revnet_config.tpu_eval_batch_size,
+      config=config,
+      export_to_tpu=False,
+      params={"revnet_config": revnet_config})
+
+  steps_per_epoch = revnet_config.tpu_iters_per_epoch
+  eval_steps = revnet_config.tpu_eval_steps
+
+  # pylint: disable=protected-access
+  if FLAGS.mode == "eval":
+    # Run evaluation when there's a new checkpoint
+    for ckpt in evaluation.checkpoints_iterator(
+        FLAGS.model_dir, timeout=FLAGS.eval_timeout):
+      tf.logging.info("Starting to evaluate.")
+      try:
+        start_timestamp = time.time()  # This time will include compilation time
+        eval_results = revnet_classifier.evaluate(
+            input_fn=imagenet_eval.input_fn,
+            steps=eval_steps,
+            checkpoint_path=ckpt)
+        elapsed_time = int(time.time() - start_timestamp)
+        tf.logging.info("Eval results: %s. Elapsed seconds: %d" %
+                        (eval_results, elapsed_time))
+
+        # Terminate eval job when final checkpoint is reached
+        current_step = int(os.path.basename(ckpt).split("-")[1])
+        if current_step >= revnet_config.max_train_iter:
+          tf.logging.info(
+              "Evaluation finished after training step %d" % current_step)
+          break
+
+      except tf.errors.NotFoundError:
+        # Since the coordinator is on a different job than the TPU worker,
+        # sometimes the TPU worker does not finish initializing until long after
+        # the CPU job tells it to start evaluating. In this case, the checkpoint
+        # file could have been deleted already.
+        tf.logging.info(
+            "Checkpoint %s no longer exists, skipping checkpoint" % ckpt)
+
+  else:  # FLAGS.mode == 'train' or FLAGS.mode == 'train_and_eval'
+    current_step = estimator._load_global_step_from_checkpoint_dir(
+        FLAGS.model_dir)
+
+    tf.logging.info(
+        "Training for %d steps (%.2f epochs in total). Current"
+        " step %d." % (revnet_config.max_train_iter,
+                       revnet_config.max_train_iter / steps_per_epoch,
+                       current_step))
+
+    start_timestamp = time.time()  # This time will include compilation time
+
+    if FLAGS.mode == "train":
+      revnet_classifier.train(
+          input_fn=imagenet_train.input_fn,
+          max_steps=revnet_config.max_train_iter)
+
+    else:
+      assert FLAGS.mode == "train_and_eval"
+      while current_step < revnet_config.max_train_iter:
+        # Train for up to steps_per_eval number of steps.
+        # At the end of training, a checkpoint will be written to --model_dir.
+        next_checkpoint = min(current_step + FLAGS.steps_per_eval,
+                              revnet_config.max_train_iter)
+        revnet_classifier.train(
+            input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
+        current_step = next_checkpoint
+
+        tf.logging.info("Finished training up to step %d. Elapsed seconds %d." %
+                        (next_checkpoint, int(time.time() - start_timestamp)))
+
+        # Evaluate the model on the most recent model in --model_dir.
+        # Since evaluation happens in batches of --eval_batch_size, some images
+        # may be excluded modulo the batch size. As long as the batch size is
+        # consistent, the evaluated images are also consistent.
+        tf.logging.info("Starting to evaluate.")
+        eval_results = revnet_classifier.evaluate(
+            input_fn=imagenet_eval.input_fn, steps=eval_steps)
+        tf.logging.info("Eval results: %s" % eval_results)
+
+        elapsed_time = int(time.time() - start_timestamp)
+        tf.logging.info("Finished training up to step %d. Elapsed seconds %d." %
+                        (revnet_config.max_train_iter, elapsed_time))
+
+    if FLAGS.export_dir is not None:
+      # The guide to serve an exported TensorFlow model is at:
+      #    https://www.tensorflow.org/serving/serving_basic
+      tf.logging.info("Starting to export model.")
+      revnet_classifier.export_savedmodel(
+          export_dir_base=FLAGS.export_dir,
+          serving_input_receiver_fn=imagenet_input.image_serving_input_fn)
+
+
+if __name__ == "__main__":
+  # Cloud TPU Cluster Resolver flags
+  flags.DEFINE_string(
+      "tpu",
+      default=None,
+      help="The Cloud TPU to use for training. This should be either the name "
+      "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
+      "url.")
+  flags.DEFINE_string(
+      "tpu_zone",
+      default=None,
+      help="[Optional] GCE zone where the Cloud TPU is located in. If not "
+      "specified, we will attempt to automatically detect the GCE project from "
+      "metadata.")
+  flags.DEFINE_string(
+      "gcp_project",
+      default=None,
+      help="[Optional] Project name for the Cloud TPU-enabled project. If not "
+      "specified, we will attempt to automatically detect the GCE project from "
+      "metadata.")
+
+  # Model specific parameters
+  flags.DEFINE_string(
+      "data_dir", default=None, help="Directory to load tfrecords")
+  flags.DEFINE_string(
+      "model_dir",
+      default=None,
+      help="[Optional] Directory to store the model information")
+  flags.DEFINE_string(
+      "revnet_config",
+      default="revnet-56",
+      help="[Optional] Architecture of network. "
+      "Other options include `revnet-104`")
+  flags.DEFINE_boolean(
+      "use_tpu", default=True, help="[Optional] Whether to use TPU")
+  flags.DEFINE_integer(
+      "num_shards", default=8, help="Number of shards (TPU chips).")
+  flags.DEFINE_integer(
+      "iterations_per_loop",
+      default=100,
+      help=(
+          "Number of steps to run on TPU before feeding metrics to the CPU."
+          " If the number of iterations in the loop would exceed the number of"
+          " train steps, the loop will exit before reaching"
+          " --iterations_per_loop. The larger this value is, the higher the"
+          " utilization on the TPU."))
+  flags.DEFINE_integer(
+      "eval_timeout",
+      default=None,
+      help="Maximum seconds between checkpoints before evaluation terminates.")
+  flags.DEFINE_integer(
+      "steps_per_eval",
+      default=5000,
+      help=(
+          "Controls how often evaluation is performed. Since evaluation is"
+          " fairly expensive, it is advised to evaluate as infrequently as"
+          " possible (i.e. up to --train_steps, which evaluates the model only"
+          " after finishing the entire training regime)."))
+  flags.DEFINE_bool(
+      "transpose_input",
+      default=True,
+      help="Use TPU double transpose optimization")
+  flags.DEFINE_string(
+      "export_dir",
+      default=None,
+      help=("The directory where the exported SavedModel will be stored."))
+  flags.DEFINE_bool(
+      "skip_host_call",
+      default=False,
+      help=("Skip the host_call which is executed every training step. This is"
+            " generally used for generating training summaries (train loss,"
+            " learning rate, etc...). When --skip_host_call=false, there could"
+            " be a performance drop if host_call function is slow and cannot"
+            " keep up with the TPU-side computation."))
+  flags.DEFINE_string(
+      "mode",
+      default="train_and_eval",
+      help='One of {"train_and_eval", "train", "eval"}.')
+  FLAGS = flags.FLAGS
+  tf.app.run()
diff --git a/tensorflow/contrib/eager/python/examples/revnet/resnet_preprocessing.py b/tensorflow/contrib/eager/python/examples/revnet/resnet_preprocessing.py
new file mode 100644
index 0000000000..21a1ab85d4
--- /dev/null
+++ b/tensorflow/contrib/eager/python/examples/revnet/resnet_preprocessing.py
@@ -0,0 +1,190 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""ImageNet preprocessing for ResNet."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+IMAGE_SIZE = 224
+CROP_PADDING = 32
+
+
+def distorted_bounding_box_crop(image_bytes,
+                                bbox,
+                                min_object_covered=0.1,
+                                aspect_ratio_range=(0.75, 1.33),
+                                area_range=(0.05, 1.0),
+                                max_attempts=100,
+                                scope=None):
+  """Generates cropped_image using one of the bboxes randomly distorted.
+
+  See `tf.image.sample_distorted_bounding_box` for more documentation.
+
+  Args:
+    image_bytes: `Tensor` of binary image data.
+    bbox: `Tensor` of bounding boxes arranged `[1, num_boxes, coords]`
+        where each coordinate is [0, 1) and the coordinates are arranged
+        as `[ymin, xmin, ymax, xmax]`. If num_boxes is 0 then use the whole
+        image.
+    min_object_covered: An optional `float`. Defaults to `0.1`. The cropped
+        area of the image must contain at least this fraction of any bounding
+        box supplied.
+    aspect_ratio_range: An optional list of `float`s. The cropped area of the
+        image must have an aspect ratio = width / height within this range.
+    area_range: An optional list of `float`s. The cropped area of the image
+        must contain a fraction of the supplied image within in this range.
+    max_attempts: An optional `int`. Number of attempts at generating a cropped
+        region of the image of the specified constraints. After `max_attempts`
+        failures, return the entire image.
+    scope: Optional `str` for name scope.
+  Returns:
+    cropped image `Tensor`
+  """
+  with tf.name_scope(scope, 'distorted_bounding_box_crop', [image_bytes, bbox]):
+    shape = tf.image.extract_jpeg_shape(image_bytes)
+    sample_distorted_bounding_box = tf.image.sample_distorted_bounding_box(
+        shape,
+        bounding_boxes=bbox,
+        min_object_covered=min_object_covered,
+        aspect_ratio_range=aspect_ratio_range,
+        area_range=area_range,
+        max_attempts=max_attempts,
+        use_image_if_no_bounding_boxes=True)
+    bbox_begin, bbox_size, _ = sample_distorted_bounding_box
+
+    # Crop the image to the specified bounding box.
+    offset_y, offset_x, _ = tf.unstack(bbox_begin)
+    target_height, target_width, _ = tf.unstack(bbox_size)
+    crop_window = tf.stack([offset_y, offset_x, target_height, target_width])
+    image = tf.image.decode_and_crop_jpeg(image_bytes, crop_window, channels=3)
+
+    return image
+
+
+def _at_least_x_are_equal(a, b, x):
+  """At least `x` of `a` and `b` `Tensors` are equal."""
+  match = tf.equal(a, b)
+  match = tf.cast(match, tf.int32)
+  return tf.greater_equal(tf.reduce_sum(match), x)
+
+
+def _decode_and_random_crop(image_bytes, image_size):
+  """Make a random crop of image_size."""
+  bbox = tf.constant([0.0, 0.0, 1.0, 1.0], dtype=tf.float32, shape=[1, 1, 4])
+  image = distorted_bounding_box_crop(
+      image_bytes,
+      bbox,
+      min_object_covered=0.1,
+      aspect_ratio_range=(3. / 4, 4. / 3.),
+      area_range=(0.08, 1.0),
+      max_attempts=10,
+      scope=None)
+  original_shape = tf.image.extract_jpeg_shape(image_bytes)
+  bad = _at_least_x_are_equal(original_shape, tf.shape(image), 3)
+
+  image = tf.cond(
+      bad,
+      lambda: _decode_and_center_crop(image_bytes, image_size),
+      lambda: tf.image.resize_bicubic([image],  # pylint: disable=g-long-lambda
+                                      [image_size, image_size])[0])
+
+  return image
+
+
+def _decode_and_center_crop(image_bytes, image_size):
+  """Crops to center of image with padding then scales image_size."""
+  shape = tf.image.extract_jpeg_shape(image_bytes)
+  image_height = shape[0]
+  image_width = shape[1]
+
+  padded_center_crop_size = tf.cast(
+      ((image_size / (image_size + CROP_PADDING)) *
+       tf.cast(tf.minimum(image_height, image_width), tf.float32)),
+      tf.int32)
+
+  offset_height = ((image_height - padded_center_crop_size) + 1) // 2
+  offset_width = ((image_width - padded_center_crop_size) + 1) // 2
+  crop_window = tf.stack([offset_height, offset_width,
+                          padded_center_crop_size, padded_center_crop_size])
+  image = tf.image.decode_and_crop_jpeg(image_bytes, crop_window, channels=3)
+  image = tf.image.resize_bicubic([image], [image_size, image_size])[0]
+
+  return image
+
+
+def _flip(image):
+  """Random horizontal image flip."""
+  image = tf.image.random_flip_left_right(image)
+  return image
+
+
+def preprocess_for_train(image_bytes, use_bfloat16, image_size=IMAGE_SIZE):
+  """Preprocesses the given image for evaluation.
+
+  Args:
+    image_bytes: `Tensor` representing an image binary of arbitrary size.
+    use_bfloat16: `bool` for whether to use bfloat16.
+    image_size: image size.
+
+  Returns:
+    A preprocessed image `Tensor`.
+  """
+  image = _decode_and_random_crop(image_bytes, image_size)
+  image = _flip(image)
+  image = tf.reshape(image, [image_size, image_size, 3])
+  image = tf.image.convert_image_dtype(
+      image, dtype=tf.bfloat16 if use_bfloat16 else tf.float32)
+  return image
+
+
+def preprocess_for_eval(image_bytes, use_bfloat16, image_size=IMAGE_SIZE):
+  """Preprocesses the given image for evaluation.
+
+  Args:
+    image_bytes: `Tensor` representing an image binary of arbitrary size.
+    use_bfloat16: `bool` for whether to use bfloat16.
+    image_size: image size.
+
+  Returns:
+    A preprocessed image `Tensor`.
+  """
+  image = _decode_and_center_crop(image_bytes, image_size)
+  image = tf.reshape(image, [image_size, image_size, 3])
+  image = tf.image.convert_image_dtype(
+      image, dtype=tf.bfloat16 if use_bfloat16 else tf.float32)
+  return image
+
+
+def preprocess_image(image_bytes,
+                     is_training=False,
+                     use_bfloat16=False,
+                     image_size=IMAGE_SIZE):
+  """Preprocesses the given image.
+
+  Args:
+    image_bytes: `Tensor` representing an image binary of arbitrary size.
+    is_training: `bool` for whether the preprocessing is for training.
+    use_bfloat16: `bool` for whether to use bfloat16.
+    image_size: image size.
+
+  Returns:
+    A preprocessed image `Tensor`.
+  """
+  if is_training:
+    return preprocess_for_train(image_bytes, use_bfloat16, image_size)
+  else:
+    return preprocess_for_eval(image_bytes, use_bfloat16, image_size)
diff --git a/tensorflow/contrib/eager/python/examples/revnet/revnet.py b/tensorflow/contrib/eager/python/examples/revnet/revnet.py
index af0d20fa72..1f2cb14972 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/revnet.py
+++ b/tensorflow/contrib/eager/python/examples/revnet/revnet.py
@@ -24,10 +24,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import functools
-import operator
-
-import six
 import tensorflow as tf
 from tensorflow.contrib.eager.python.examples.revnet import blocks
 
@@ -45,71 +41,10 @@ class RevNet(tf.keras.Model):
     self.axis = 1 if config.data_format == "channels_first" else 3
     self.config = config
 
-    self._init_block = self._construct_init_block()
+    self._init_block = blocks.InitBlock(config=self.config)
+    self._final_block = blocks.FinalBlock(config=self.config)
     self._block_list = self._construct_intermediate_blocks()
-    self._final_block = self._construct_final_block()
-
-  def _construct_init_block(self):
-    init_block = tf.keras.Sequential(
-        [
-            tf.keras.layers.Conv2D(
-                filters=self.config.init_filters,
-                kernel_size=self.config.init_kernel,
-                strides=(self.config.init_stride, self.config.init_stride),
-                data_format=self.config.data_format,
-                use_bias=False,
-                padding="SAME",
-                input_shape=self.config.input_shape,
-                dtype=self.config.dtype),
-            tf.keras.layers.BatchNormalization(
-                axis=self.axis,
-                fused=self.config.fused,
-                dtype=self.config.dtype),
-            tf.keras.layers.Activation("relu"),
-        ],
-        name="init")
-    if self.config.init_max_pool:
-      init_block.add(
-          tf.keras.layers.MaxPooling2D(
-              pool_size=(3, 3),
-              strides=(2, 2),
-              padding="SAME",
-              data_format=self.config.data_format,
-              dtype=self.config.dtype))
-    return init_block
-
-  def _construct_final_block(self):
-    f = self.config.filters[-1]  # Number of filters
-    r = functools.reduce(operator.mul, self.config.strides, 1)  # Reduce ratio
-    r *= self.config.init_stride
-    if self.config.init_max_pool:
-      r *= 2
-
-    if self.config.data_format == "channels_first":
-      w, h = self.config.input_shape[1], self.config.input_shape[2]
-      input_shape = (f, w // r, h // r)
-    elif self.config.data_format == "channels_last":
-      w, h = self.config.input_shape[0], self.config.input_shape[1]
-      input_shape = (w // r, h // r, f)
-    else:
-      raise ValueError("Data format should be either `channels_first`"
-                       " or `channels_last`")
-
-    final_block = tf.keras.Sequential(
-        [
-            tf.keras.layers.BatchNormalization(
-                axis=self.axis,
-                input_shape=input_shape,
-                fused=self.config.fused,
-                dtype=self.config.dtype),
-            tf.keras.layers.Activation("relu"),
-            tf.keras.layers.GlobalAveragePooling2D(
-                data_format=self.config.data_format, dtype=self.config.dtype),
-            tf.keras.layers.Dense(
-                self.config.n_classes, dtype=self.config.dtype)
-        ],
-        name="final")
-    return final_block
+    self._moving_average_variables = []
 
   def _construct_intermediate_blocks(self):
     # Precompute input shape after initial block
@@ -193,109 +128,90 @@ class RevNet(tf.keras.Model):
 
     return tf.reduce_mean(cross_ent)
 
-  def compute_gradients(self, inputs, labels, training=True, l2_reg=True):
+  def compute_gradients(self, saved_hidden, labels, training=True, l2_reg=True):
     """Manually computes gradients.
 
-    When eager execution is enabled, this method also SILENTLY updates the
-    running averages of batch normalization when `training` is set to True.
+    This method silently updates the running averages of batch normalization.
 
     Args:
-      inputs: Image tensor, either NHWC or NCHW, conforming to `data_format`
+      saved_hidden: List of hidden states Tensors
       labels: One-hot labels for classification
       training: Use the mini-batch stats in batch norm if set to True
       l2_reg: Apply l2 regularization
 
     Returns:
-      list of tuples each being (grad, var) for optimizer to use
+      A tuple with the first entry being a list of all gradients and the second
+      being the loss
     """
 
-    # Run forward pass to record hidden states; avoid updating running averages
-    vars_and_vals = self.get_moving_stats()
-    _, saved_hidden = self.call(inputs, training=training)
-    self.restore_moving_stats(vars_and_vals)
-
-    grads_all = []
-    vars_all = []
+    def _defunable_pop(l):
+      """Functional style list pop that works with `tfe.defun`."""
+      t, l = l[-1], l[:-1]
+      return t, l
 
-    # Manually backprop through last block
+    # Backprop through last block
     x = saved_hidden[-1]
     with tf.GradientTape() as tape:
-      x = tf.identity(x)
       tape.watch(x)
-      # Running stats updated below
       logits = self._final_block(x, training=training)
       loss = self.compute_loss(logits, labels)
-
     grads_combined = tape.gradient(loss,
                                    [x] + self._final_block.trainable_variables)
-    dy, grads_ = grads_combined[0], grads_combined[1:]
-    grads_all += grads_
-    vars_all += self._final_block.trainable_variables
+    dy, final_grads = grads_combined[0], grads_combined[1:]
 
-    # Manually backprop through intermediate blocks
+    # Backprop through intermediate blocks
+    intermediate_grads = []
     for block in reversed(self._block_list):
-      y = saved_hidden.pop()
+      y, saved_hidden = _defunable_pop(saved_hidden)
       x = saved_hidden[-1]
-      dy, grads, vars_ = block.backward_grads_and_vars(
-          x, y, dy, training=training)
-      grads_all += grads
-      vars_all += vars_
-
-    # Manually backprop through first block
-    saved_hidden.pop()
-    x = saved_hidden.pop()
-    assert not saved_hidden  # Cleared after backprop
+      dy, grads = block.backward_grads(x, y, dy, training=training)
+      intermediate_grads = grads + intermediate_grads
 
+    # Backprop through first block
+    _, saved_hidden = _defunable_pop(saved_hidden)
+    x, saved_hidden = _defunable_pop(saved_hidden)
+    assert not saved_hidden
     with tf.GradientTape() as tape:
-      x = tf.identity(x)
-      # Running stats updated below
       y = self._init_block(x, training=training)
-
-    grads_all += tape.gradient(
+    init_grads = tape.gradient(
         y, self._init_block.trainable_variables, output_gradients=dy)
-    vars_all += self._init_block.trainable_variables
 
-    # Apply weight decay
+    # Ordering match up with `model.trainable_variables`
+    grads_all = init_grads + final_grads + intermediate_grads
     if l2_reg:
-      grads_all = self._apply_weight_decay(grads_all, vars_all)
+      grads_all = self._apply_weight_decay(grads_all)
 
-    return grads_all, vars_all, loss
+    return grads_all, loss
 
-  def _apply_weight_decay(self, grads, vars_):
+  def _apply_weight_decay(self, grads):
     """Update gradients to reflect weight decay."""
-    # Don't decay bias
     return [
         g + self.config.weight_decay * v if v.name.endswith("kernel:0") else g
-        for g, v in zip(grads, vars_)
+        for g, v in zip(grads, self.trainable_variables)
     ]
 
   def get_moving_stats(self):
-    """Get moving averages of batch normalization.
-
-    This is needed to avoid updating the running average twice in one iteration.
-
-    Returns:
-      A dictionary mapping variables for batch normalization moving averages
-      to their current values.
-    """
-    vars_and_vals = {}
-
-    def _is_moving_var(v):
+    """Get moving averages of batch normalization."""
+    device = "/gpu:0" if tf.test.is_gpu_available() else "/cpu:0"
+    with tf.device(device):
+      return [v.read_value() for v in self.moving_average_variables]
+
+  def restore_moving_stats(self, values):
+    """Restore moving averages of batch normalization."""
+    device = "/gpu:0" if tf.test.is_gpu_available() else "/cpu:0"
+    with tf.device(device):
+      for var_, val in zip(self.moving_average_variables, values):
+        var_.assign(val)
+
+  @property
+  def moving_average_variables(self):
+    """Get all variables that are batch norm moving averages."""
+
+    def _is_moving_avg(v):
       n = v.name
       return n.endswith("moving_mean:0") or n.endswith("moving_variance:0")
 
-    for v in filter(_is_moving_var, self.variables):
-      vars_and_vals[v] = v.read_value()
+    if not self._moving_average_variables:
+      self._moving_average_variables = filter(_is_moving_avg, self.variables)
 
-    return vars_and_vals
-
-  def restore_moving_stats(self, vars_and_vals):
-    """Restore moving averages of batch normalization.
-
-    This is needed to avoid updating the running average twice in one iteration.
-
-    Args:
-      vars_and_vals: The dictionary mapping variables to their previous values.
-    """
-    for var_, val in six.iteritems(vars_and_vals):
-      var_.assign(val)
+    return self._moving_average_variables
diff --git a/tensorflow/contrib/eager/python/examples/revnet/revnet_test.py b/tensorflow/contrib/eager/python/examples/revnet/revnet_test.py
index b2ac4b67c9..84b2ddf0de 100644
--- a/tensorflow/contrib/eager/python/examples/revnet/revnet_test.py
+++ b/tensorflow/contrib/eager/python/examples/revnet/revnet_test.py
@@ -31,10 +31,13 @@ tfe = tf.contrib.eager
 
 def train_one_iter(model, inputs, labels, optimizer, global_step=None):
   """Train for one iteration."""
-  grads, vars_, loss = model.compute_gradients(inputs, labels, training=True)
-  optimizer.apply_gradients(zip(grads, vars_), global_step=global_step)
+  logits, saved_hidden = model(inputs)
+  grads, loss = model.compute_gradients(
+      saved_hidden=saved_hidden, labels=labels)
+  optimizer.apply_gradients(
+      zip(grads, model.trainable_variables), global_step=global_step)
 
-  return loss
+  return logits, loss
 
 
 class RevNetTest(tf.test.TestCase):
@@ -42,6 +45,8 @@ class RevNetTest(tf.test.TestCase):
   def setUp(self):
     super(RevNetTest, self).setUp()
     config = config_.get_hparams_cifar_38()
+    config.add_hparam("n_classes", 10)
+    config.add_hparam("dataset", "cifar-10")
     # Reconstruction could cause numerical error, use double precision for tests
     config.dtype = tf.float64
     config.fused = False  # Fused batch norm does not support tf.float64
@@ -93,9 +98,10 @@ class RevNetTest(tf.test.TestCase):
 
   def test_compute_gradients(self):
     """Test `compute_gradients` function."""
-    self.model(self.x, training=False)  # Initialize model
-    grads, vars_, loss = self.model.compute_gradients(
-        inputs=self.x, labels=self.t, training=True, l2_reg=True)
+    _, saved_hidden = self.model(self.x)  # Initialize model
+    grads, loss = self.model.compute_gradients(
+        saved_hidden=saved_hidden, labels=self.t)
+    vars_ = self.model.trainable_variables
     self.assertTrue(isinstance(grads, list))
     self.assertTrue(isinstance(vars_, list))
     self.assertEqual(len(grads), len(vars_))
@@ -104,7 +110,7 @@ class RevNetTest(tf.test.TestCase):
 
     # Compare against the true gradient computed by the tape
     with tf.GradientTape() as tape:
-      logits, _ = self.model(self.x, training=True)
+      logits, _ = self.model(self.x)
       loss_true = self.model.compute_loss(logits=logits, labels=self.t)
     grads_true = tape.gradient(loss_true, vars_)
     self.assertAllClose(loss, loss_true)
@@ -119,7 +125,9 @@ class RevNetTest(tf.test.TestCase):
   def test_compute_gradients_defun(self):
     """Test `compute_gradients` function with defun."""
     compute_gradients = tfe.defun(self.model.compute_gradients)
-    grads, vars_, _ = compute_gradients(self.x, self.t, training=True)
+    _, saved_hidden = self.model(self.x)
+    grads, _ = compute_gradients(saved_hidden=saved_hidden, labels=self.t)
+    vars_ = self.model.trainable_variables
     self.assertTrue(isinstance(grads, list))
     self.assertTrue(isinstance(vars_, list))
     self.assertEqual(len(grads), len(vars_))
@@ -131,6 +139,9 @@ class RevNetTest(tf.test.TestCase):
     """Test model training in graph mode."""
     with tf.Graph().as_default():
       config = config_.get_hparams_cifar_38()
+      config.add_hparam("n_classes", 10)
+      config.add_hparam("dataset", "cifar-10")
+
       x = tf.random_normal(
           shape=(self.config.batch_size,) + self.config.input_shape)
       t = tf.random_uniform(
@@ -138,17 +149,13 @@ class RevNetTest(tf.test.TestCase):
           minval=0,
           maxval=self.config.n_classes,
           dtype=tf.int32)
-      global_step = tfe.Variable(0., trainable=False)
+      global_step = tf.Variable(0., trainable=False)
       model = revnet.RevNet(config=config)
-      model(x)
-      updates = model.get_updates_for(x)
-
-      x_ = tf.identity(x)
-      grads_all, vars_all, _ = model.compute_gradients(x_, t, training=True)
+      _, saved_hidden = model(x)
+      grads, _ = model.compute_gradients(saved_hidden=saved_hidden, labels=t)
       optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
-      with tf.control_dependencies(updates):
-        train_op = optimizer.apply_gradients(
-            zip(grads_all, vars_all), global_step=global_step)
+      train_op = optimizer.apply_gradients(
+          zip(grads, model.trainable_variables), global_step=global_step)
 
       with tf.Session() as sess:
         sess.run(tf.global_variables_initializer())
diff --git a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
index c2340a293a..15776c694e 100644
--- a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
+++ b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
@@ -310,12 +310,12 @@ def main(_):
   with tf.device("/device:GPU:0" if have_gpu else None):
     # Make learning_rate a Variable so it can be included in the checkpoint
     # and we can resume training with the last saved learning_rate.
-    learning_rate = tfe.Variable(20.0, name="learning_rate")
+    learning_rate = tf.Variable(20.0, name="learning_rate")
     model = PTBModel(corpus.vocab_size(), FLAGS.embedding_dim,
                      FLAGS.hidden_dim, FLAGS.num_layers, FLAGS.dropout,
                      use_cudnn_rnn)
     optimizer = tf.train.GradientDescentOptimizer(learning_rate)
-    checkpoint = tfe.Checkpoint(
+    checkpoint = tf.train.Checkpoint(
         learning_rate=learning_rate, model=model,
         # GradientDescentOptimizer has no state to checkpoint, but noting it
         # here lets us swap in an optimizer that does.
diff --git a/tensorflow/contrib/eager/python/examples/sagan/BUILD b/tensorflow/contrib/eager/python/examples/sagan/BUILD
deleted file mode 100644
index b470a41d81..0000000000
--- a/tensorflow/contrib/eager/python/examples/sagan/BUILD
+++ /dev/null
@@ -1,59 +0,0 @@
-licenses(["notice"])  # Apache 2.0
-
-package(default_visibility = ["//tensorflow:internal"])
-
-load("//tensorflow:tensorflow.bzl", "cuda_py_test")
-
-# Model
-py_library(
-    name = "config",
-    srcs = ["config.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        "//tensorflow:tensorflow_py",
-    ],
-)
-
-py_library(
-    name = "ops",
-    srcs = ["ops.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        "//tensorflow:tensorflow_py",
-    ],
-)
-
-py_library(
-    name = "sagan",
-    srcs = ["sagan.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":ops",
-        "//tensorflow:tensorflow_py",
-    ],
-)
-
-# Tests
-cuda_py_test(
-    name = "ops_test",
-    size = "small",
-    srcs = ["ops_test.py"],
-    additional_deps = [
-        ":ops",
-        "//tensorflow:tensorflow_py",
-    ],
-)
-
-cuda_py_test(
-    name = "sagan_test",
-    size = "large",
-    srcs = ["sagan_test.py"],
-    additional_deps = [
-        ":config",
-        ":sagan",
-        "//tensorflow:tensorflow_py",
-    ],
-    tags = [
-        "optonly",
-    ],
-)
diff --git a/tensorflow/contrib/eager/python/examples/sagan/config.py b/tensorflow/contrib/eager/python/examples/sagan/config.py
deleted file mode 100644
index 1967bbd867..0000000000
--- a/tensorflow/contrib/eager/python/examples/sagan/config.py
+++ /dev/null
@@ -1,72 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Self-attention generative adversarial with eager execution.
-
-Configuration in format of tf.contrib.training.HParams.
-Supports default 128x128 ImageNet.
-
-Reference [Self-Attention Generative Adversarial
-Networks](https://arxiv.org/pdf/1805.08318.pdf)
-
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import tensorflow as tf
-tfe = tf.contrib.eager
-
-
-def get_hparams_imagenet():
-  """Configurations to train SAGAN on 128x128 ImageNet dataset."""
-  config = tf.contrib.training.HParams()
-  if tf.test.is_gpu_available():
-    config.add_hparam("image_shape", (3, 128, 128))
-    config.add_hparam("data_format", "channels_first")
-    config.add_hparam("g_init_shape", (512, 4, 4))
-  else:
-    config.add_hparam("image_shape", (128, 128, 3))
-    config.add_hparam("data_format", "channels_first")
-    config.add_hparam("g_init_shape", (4, 4, 512))
-
-  config.add_hparam("latent_dim", 128)
-  config.add_hparam("update_g_once_every", 1)
-  config.add_hparam("batch_size", 64)
-  config.add_hparam("d_init_filters", 32)
-  config.add_hparam("num_upsamples", 5)
-  # (512, 4, 4) -> (3, 128, 128)
-  return config
-
-
-def get_hparams_mock():
-  """Configurations of smaller networks for testing."""
-  config = tf.contrib.training.HParams()
-  if tf.test.is_gpu_available():
-    config.add_hparam("image_shape", (3, 16, 16))
-    config.add_hparam("data_format", "channels_first")
-    config.add_hparam("g_init_shape", (32, 2, 2))
-  else:
-    config.add_hparam("image_shape", (16, 16, 3))
-    config.add_hparam("data_format", "channels_last")
-    config.add_hparam("g_init_shape", (2, 2, 32))
-
-  config.add_hparam("latent_dim", 16)
-  config.add_hparam("update_g_once_every", 1)
-  config.add_hparam("batch_size", 2)
-  config.add_hparam("d_init_filters", 4)
-  config.add_hparam("num_upsamples", 3)
-  # (32, 2, 2) -> (3, 16, 16)
-  return config
diff --git a/tensorflow/contrib/eager/python/examples/sagan/ops.py b/tensorflow/contrib/eager/python/examples/sagan/ops.py
deleted file mode 100644
index 9a03cab1d1..0000000000
--- a/tensorflow/contrib/eager/python/examples/sagan/ops.py
+++ /dev/null
@@ -1,71 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Self-attention generative adversarial with eager execution.
-
-Auxiliary operations.
-
-Reference [Self-Attention Generative Adversarial
-Networks](https://arxiv.org/pdf/1805.08318.pdf)
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import tensorflow as tf
-
-
-def flatten_hw(x, data_format="channels_first"):
-  """Flatten the input tensor across height and width dimensions."""
-  if data_format == "channels_last":
-    x = tf.transpose(x, perm=[0, 3, 1, 2])  # Convert to `channels_first`
-
-  old_shape = tf.shape(x)
-  new_shape = [old_shape[0], old_shape[2] * old_shape[3], old_shape[1]]
-
-  return tf.reshape(x, new_shape)
-
-
-def broaden_hw(x, h, w, c, data_format="channels_first"):
-  """Broaden dimension so that output has height and width."""
-  if data_format == "channels_first":
-    shape = [-1, c, h, w]
-  else:
-    shape = [-1, h, w, c]
-
-  return tf.reshape(x, shape)
-
-
-class BroadenHW(tf.keras.layers.Layer):
-  """Wrapper class so that `broaden_hw` can be used in `tf.keras.Sequential`."""
-
-  def __init__(self, h, w, c, data_format="channels_first"):
-    super(BroadenHW, self).__init__()
-    self.h = h
-    self.w = w
-    self.c = c
-    self.data_format = data_format
-
-  def call(self, x):
-    return broaden_hw(
-        x, h=self.h, w=self.w, c=self.c, data_format=self.data_format)
-
-  def compute_output_shape(self, input_shape):
-    input_shape = tf.TensorShape(input_shape).as_list()
-    if self.data_format == "channels_first":
-      output_shape = (input_shape[0], self.c, self.h, self.w)
-    else:
-      output_shape = (input_shape[0], self.h, self.w, self.c)
-
-    return tf.TensorShape(output_shape)
diff --git a/tensorflow/contrib/eager/python/examples/sagan/ops_test.py b/tensorflow/contrib/eager/python/examples/sagan/ops_test.py
deleted file mode 100644
index 3454985904..0000000000
--- a/tensorflow/contrib/eager/python/examples/sagan/ops_test.py
+++ /dev/null
@@ -1,59 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for auxiliary operations."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import tensorflow as tf
-from tensorflow.contrib.eager.python.examples.sagan import ops
-
-
-class OpsTest(tf.test.TestCase):
-
-  def test_flatten_hw(self):
-    """Test `flatten_hw` function with mock object."""
-
-    batch_size = 1
-    # Default NCHW format
-    if tf.test.is_gpu_available():
-      x = tf.random_normal(shape=(batch_size, 3, 4, 4))
-      y = ops.flatten_hw(x, data_format="channels_first")
-      self.assertEqual(y.shape, (batch_size, 4 * 4, 3))
-
-    # NHWC format
-    x = tf.random_normal(shape=(batch_size, 4, 4, 3))
-    y = ops.flatten_hw(x, data_format="channels_last")
-    self.assertEqual(y.shape, (batch_size, 4 * 4, 3))
-
-  def test_broaden_hw(self):
-    """Test `broaden_hw` function with mock object."""
-
-    batch_size = 1
-    # NHWC format
-    x = tf.random_normal(shape=[batch_size, 4 * 4 * 16])
-    y = ops.broaden_hw(x, h=4, w=4, c=16, data_format="channels_last")
-    self.assertEqual(y.shape, (batch_size, 4, 4, 16))
-
-    # Default NCHW format
-    if tf.test.is_gpu_available():
-      y = ops.broaden_hw(x, h=4, w=4, c=16, data_format="channels_first")
-      self.assertEqual(y.shape, (batch_size, 16, 4, 4))
-
-
-if __name__ == "__main__":
-  tf.enable_eager_execution()
-  tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/sagan/sagan.py b/tensorflow/contrib/eager/python/examples/sagan/sagan.py
deleted file mode 100644
index 561be36c91..0000000000
--- a/tensorflow/contrib/eager/python/examples/sagan/sagan.py
+++ /dev/null
@@ -1,232 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Self-attention generative adversarial with eager execution.
-
-Code for main model.
-
-Reference [Self-Attention Generative Adversarial
-Networks](https://arxiv.org/pdf/1805.08318.pdf)
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-import tensorflow as tf
-from tensorflow.contrib.eager.python.examples.sagan import ops
-tfe = tf.contrib.eager
-
-
-class SelfAttentionModule(tf.keras.Model):
-  """Self-attention module composed of convolutional layers."""
-
-  def __init__(self,
-               attention_features,
-               original_features,
-               data_format="channels_first"):
-    """Initialize the module.
-
-    Args:
-      attention_features: Number of filters for the attention computation.
-      original_features: Number of filters of the original Tensor.
-      data_format: Either 'channels_first' or 'channels_last'
-    """
-    super(SelfAttentionModule, self).__init__()
-    self.data_format = data_format
-    # Matrix multiplication implemented as 2D Convolution
-    self.f = tf.keras.layers.Conv2D(
-        filters=attention_features,
-        kernel_size=1,
-        strides=(1, 1),
-        data_format=data_format)
-    self.g = tf.keras.layers.Conv2D(
-        filters=attention_features,
-        kernel_size=1,
-        strides=(1, 1),
-        data_format=data_format)
-    self.h = tf.keras.layers.Conv2D(
-        filters=original_features,
-        kernel_size=1,
-        strides=(1, 1),
-        data_format=data_format)
-    self.scale = tfe.Variable(0., trainable=True)
-
-  def call(self, x):
-    f = self.f(x)
-    g = self.g(x)
-    h = self.h(x)
-
-    f_flatten = ops.flatten_hw(f, data_format=self.data_format)
-    g_flatten = ops.flatten_hw(g, data_format=self.data_format)
-    h_flatten = ops.flatten_hw(h, data_format=self.data_format)
-
-    s = tf.matmul(g_flatten, f_flatten, transpose_b=True)
-    b = tf.nn.softmax(s, axis=-1)
-    o = tf.matmul(b, h_flatten)
-    y = self.scale * tf.reshape(o, tf.shape(x)) + x
-
-    return y
-
-  def compute_output_shape(self, input_shape):
-    return input_shape
-
-
-class SAGAN(tf.contrib.checkpoint.Checkpointable):
-  """Self-attention generative adversarial network."""
-
-  def __init__(self, config):
-    """Initialize the model.
-
-    Args:
-      config: tf.contrib.training.HParams object; specifies hyperparameters
-    """
-    super(SAGAN, self).__init__()
-    self.config = config
-    self.generator = self._construct_generator()
-    self.discriminator = self._construct_discriminator()
-
-  def _construct_generator(self):
-    """Construct generator."""
-    # TODO(lxuechen): Add spectral normalization for WGAN
-    axis = 1 if self.config.data_format == "channels_first" else 3
-
-    generator = tf.keras.Sequential()
-    generator.add(
-        tf.keras.layers.InputLayer(input_shape=(self.config.latent_dim,)))
-    generator.add(
-        tf.keras.layers.Dense(
-            units=np.prod(self.config.g_init_shape), activation=tf.nn.relu))
-
-    if self.config.data_format == "channels_first":
-      c, h, w = self.config.g_init_shape
-    else:
-      h, w, c = self.config.g_init_shape
-
-    # Reshape to NHWC/NCHW
-    generator.add(
-        ops.BroadenHW(h=h, w=w, c=c, data_format=self.config.data_format))
-
-    filters_list = [c // 2**p for p in range(1, self.config.num_upsamples + 1)]
-    filters_list[-1] = 3  # Standard RGB images
-
-    for filters in filters_list[:len(filters_list) // 2]:
-      generator.add(
-          tf.keras.layers.Conv2DTranspose(
-              filters=filters,
-              kernel_size=4,
-              strides=(2, 2),
-              use_bias=False,
-              padding="SAME",
-              data_format=self.config.data_format))
-      generator.add(tf.keras.layers.BatchNormalization(axis=axis))
-      generator.add(tf.keras.layers.Activation("relu"))
-
-    # pylint: disable=undefined-loop-variable
-    generator.add(
-        SelfAttentionModule(
-            original_features=filters,
-            attention_features=filters // 8,
-            data_format=self.config.data_format))
-    # pylint: enable=undefined-loop-variable
-
-    for filters in filters_list[len(filters_list) // 2:]:
-      generator.add(
-          tf.keras.layers.Conv2DTranspose(
-              filters=filters,
-              kernel_size=4,
-              strides=(2, 2),
-              use_bias=False,
-              padding="SAME",
-              data_format=self.config.data_format))
-      if filters == 3:
-        # Assume Image rescaled to [-1, 1]
-        generator.add(tf.keras.layers.Activation("tanh"))
-      else:
-        generator.add(tf.keras.layers.BatchNormalization(axis=axis))
-        generator.add(tf.keras.layers.Activation("relu"))
-
-    return generator
-
-  def _construct_discriminator(self):
-    """Construct discriminator."""
-    # TODO(lxuechen): Add spectral normalization for WGAN
-    discriminator = tf.keras.Sequential()
-    discriminator.add(
-        tf.keras.layers.InputLayer(input_shape=self.config.image_shape))
-
-    filters_list = [
-        self.config.d_init_filters * 2**p
-        for p in range(self.config.num_upsamples)
-    ]
-
-    for filters in filters_list[:(len(filters_list) + 1) // 2]:
-      discriminator.add(
-          tf.keras.layers.Conv2D(
-              filters=filters,
-              kernel_size=4,
-              strides=(2, 2),
-              padding="SAME",
-              data_format=self.config.data_format))
-      discriminator.add(tf.keras.layers.LeakyReLU(alpha=.1))
-
-    # pylint: disable=undefined-loop-variable
-    discriminator.add(
-        SelfAttentionModule(
-            original_features=filters,
-            attention_features=filters // 8,
-            data_format=self.config.data_format))
-    # pylint: enable=undefined-loop-variable
-
-    for filters in filters_list[(len(filters_list) + 1) // 2:]:
-      discriminator.add(
-          tf.keras.layers.Conv2D(
-              filters=filters,
-              kernel_size=4,
-              strides=(2, 2),
-              padding="SAME",
-              data_format=self.config.data_format))
-      discriminator.add(tf.keras.layers.LeakyReLU(alpha=.1))
-
-    discriminator.add(tf.keras.layers.Flatten())
-    discriminator.add(tf.keras.layers.Dense(units=1))
-
-    return discriminator
-
-  def compute_loss_and_grads(self, real_images, noise, training=True):
-    """Compute loss and gradients for both generator and discriminator."""
-    # TODO(lxuechen): Add gradient penalty for discriminator
-    with tf.GradientTape() as g_tape, tf.GradientTape() as d_tape:
-      real_logits = self.discriminator(real_images, training=training)
-
-      fake_images = self.generator.call(noise, training=training)
-      fake_logits = self.discriminator.call(fake_images)
-
-      g_loss = self.compute_g_loss(fake_logits)
-      d_loss = self.compute_d_loss(fake_logits, real_logits)
-
-    g_grads = g_tape.gradient(g_loss, self.generator.trainable_variables)
-    d_grads = d_tape.gradient(d_loss, self.discriminator.trainable_variables)
-
-    return g_loss, d_loss, g_grads, d_grads
-
-  def compute_g_loss(self, fake_logits):
-    return -tf.reduce_mean(fake_logits)  # Hinge loss
-
-  def compute_d_loss(self, fake_logits, real_logits):
-    # Hinge loss
-    real_loss = tf.reduce_mean(tf.nn.relu(1. - real_logits))
-    fake_loss = tf.reduce_mean(tf.nn.relu(1. + fake_logits))
-    return real_loss + fake_loss
diff --git a/tensorflow/contrib/eager/python/examples/sagan/sagan_test.py b/tensorflow/contrib/eager/python/examples/sagan/sagan_test.py
deleted file mode 100644
index 1834594510..0000000000
--- a/tensorflow/contrib/eager/python/examples/sagan/sagan_test.py
+++ /dev/null
@@ -1,101 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for self-attention generative adversarial network."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import tensorflow as tf
-from tensorflow.contrib.eager.python.examples.sagan import config as config_
-from tensorflow.contrib.eager.python.examples.sagan import sagan
-tfe = tf.contrib.eager
-
-
-class SAGANTest(tf.test.TestCase):
-
-  def setUp(self):
-    super(SAGANTest, self).setUp()
-    config = config_.get_hparams_mock()
-    self.noise_shape = (config.batch_size, config.latent_dim)
-    self.logits_shape = (config.batch_size, 1)
-    self.images_shape = (config.batch_size,) + config.image_shape
-
-    self.model = sagan.SAGAN(config=config)
-    self.noise = tf.random_normal(shape=self.noise_shape)
-    self.real_images = tf.random_normal(shape=self.images_shape)
-    self.config = config
-
-  def tearDown(self):
-    del self.model
-    del self.noise
-    del self.real_images
-    super(SAGANTest, self).tearDown()
-
-  def test_generator_call(self):
-    """Test `generator.__call__` function."""
-    fake_images = self.model.generator(self.noise, training=False)
-    self.assertEqual(fake_images.shape, self.images_shape)
-
-  def test_generator_call_defun(self):
-    """Test `generator.__call__` function with defun."""
-    call_ = tfe.defun(self.model.generator.__call__)
-    fake_images = call_(self.noise, training=False)
-    self.assertEqual(fake_images.shape, self.images_shape)
-
-  def test_discriminator_call(self):
-    """Test `discriminator.__call__` function."""
-    real_logits = self.model.discriminator(self.real_images)
-    self.assertEqual(real_logits.shape, self.logits_shape)
-
-  def test_discriminator_call_defun(self):
-    """Test `discriminator.__call__` function with defun."""
-    call_ = tfe.defun(self.model.discriminator.__call__)
-    real_logits = call_(self.real_images)
-    self.assertEqual(real_logits.shape, self.logits_shape)
-
-  def test_compute_loss_and_grads(self):
-    """Test `compute_loss_and_grads` function."""
-    g_loss, d_loss, g_grads, d_grads = self.model.compute_loss_and_grads(
-        self.real_images, self.noise, training=False)
-    self.assertEqual(g_loss.shape, ())
-    self.assertEqual(d_loss.shape, ())
-    self.assertTrue(isinstance(g_grads, list))
-    self.assertTrue(isinstance(d_grads, list))
-    g_vars = self.model.generator.trainable_variables
-    d_vars = self.model.discriminator.trainable_variables
-
-    self.assertEqual(len(g_grads), len(g_vars))
-    self.assertEqual(len(d_grads), len(d_vars))
-
-  def test_compute_loss_and_grads_defun(self):
-    """Test `compute_loss_and_grads` function with defun."""
-    compute_loss_and_grads = tfe.defun(self.model.compute_loss_and_grads)
-    g_loss, d_loss, g_grads, d_grads = compute_loss_and_grads(
-        self.real_images, self.noise, training=False)
-    self.assertEqual(g_loss.shape, ())
-    self.assertEqual(d_loss.shape, ())
-    self.assertTrue(isinstance(g_grads, list))
-    self.assertTrue(isinstance(d_grads, list))
-    g_vars = self.model.generator.trainable_variables
-    d_vars = self.model.discriminator.trainable_variables
-
-    self.assertEqual(len(g_grads), len(g_vars))
-    self.assertEqual(len(d_grads), len(d_vars))
-
-
-if __name__ == "__main__":
-  tf.enable_eager_execution()
-  tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py b/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py
index 8ac553e0ae..d18a097063 100644
--- a/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py
+++ b/tensorflow/contrib/eager/python/examples/spinn/spinn_test.py
@@ -36,7 +36,7 @@ from third_party.examples.eager.spinn import spinn
 from tensorflow.contrib.summary import summary_test_util
 from tensorflow.python.eager import test
 from tensorflow.python.framework import test_util
-from tensorflow.python.training import saver
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training.checkpointable import util as checkpointable_utils
 # pylint: enable=g-bad-import-order
 
@@ -422,7 +422,7 @@ class SpinnTest(test_util.TensorFlowTestCase):
     # 5. Verify that checkpoints exist and contains all the expected variables.
     self.assertTrue(glob.glob(os.path.join(config.logdir, "ckpt*")))
     object_graph = checkpointable_utils.object_metadata(
-        saver.latest_checkpoint(config.logdir))
+        checkpoint_management.latest_checkpoint(config.logdir))
     ckpt_variable_names = set()
     for node in object_graph.nodes:
       for attribute in node.attributes:
diff --git a/tensorflow/contrib/eager/python/examples/workshop/1_basic.ipynb b/tensorflow/contrib/eager/python/examples/workshop/1_basic.ipynb
index 3e7abe952d..75cb3f8227 100644
--- a/tensorflow/contrib/eager/python/examples/workshop/1_basic.ipynb
+++ b/tensorflow/contrib/eager/python/examples/workshop/1_basic.ipynb
@@ -210,7 +210,7 @@
         "a = tf.constant(0.0)\n",
         "b = tf.constant(1.0)\n",
         "epsilon = tf.constant(0.001)\n",
-        "x = bisecting_line_search(test_f, a, b, epsilon)\n",
+        "x = bisecting_line_search(test_f, a, b, epsilon)\n"
       ],
       "execution_count": 0,
       "outputs": []
@@ -279,4 +279,4 @@
       ]
     }
   ]
-}
-\ No newline at end of file
+}
diff --git a/tensorflow/contrib/eager/python/examples/workshop/2_models.ipynb b/tensorflow/contrib/eager/python/examples/workshop/2_models.ipynb
index 4f1410e00b..f3a65f5aab 100644
--- a/tensorflow/contrib/eager/python/examples/workshop/2_models.ipynb
+++ b/tensorflow/contrib/eager/python/examples/workshop/2_models.ipynb
@@ -69,7 +69,7 @@
       "cell_type": "code",
       "source": [
         "# Creating variables\n",
-        "v = tfe.Variable(1.0)\n",
+        "v = tf.Variable(1.0)\n",
         "v"
       ],
       "execution_count": 2,
diff --git a/tensorflow/contrib/eager/python/metrics_impl.py b/tensorflow/contrib/eager/python/metrics_impl.py
index efa6ba0626..6efafccd6b 100644
--- a/tensorflow/contrib/eager/python/metrics_impl.py
+++ b/tensorflow/contrib/eager/python/metrics_impl.py
@@ -291,8 +291,6 @@ class Metric(checkpointable.CheckpointableBase):
 
 class Mean(Metric):
   """Computes the (weighted) mean of the given values."""
-  # TODO(josh11b): Maybe have a dtype argument that defaults to tf.float64?
-  # Or defaults to type of the input if it is tf.float32, else tf.float64?
 
   def __init__(self, name=None, dtype=dtypes.float64,
                use_global_variables=False):
@@ -377,7 +375,7 @@ class Accuracy(Mean):
         array_ops.shape(labels), array_ops.shape(predictions),
         message="Shapes of labels and predictions are unequal")
     matches = math_ops.equal(labels, predictions)
-    matches = math_ops.cast(matches, dtypes.float64)
+    matches = math_ops.cast(matches, self.dtype)
     super(Accuracy, self).call(matches, weights=weights)
     if weights is None:
       return labels, predictions
@@ -421,7 +419,7 @@ class CategoricalAccuracy(Mean):
     labels = math_ops.argmax(labels, axis=-1)
     predictions = math_ops.argmax(predictions, axis=-1)
     matches = math_ops.equal(labels, predictions)
-    matches = math_ops.cast(matches, dtypes.float64)
+    matches = math_ops.cast(matches, self.dtype)
     super(CategoricalAccuracy, self).call(matches, weights=weights)
     if weights is None:
       return labels, predictions
@@ -472,7 +470,7 @@ class BinaryAccuracy(Mean):
     predictions = ops.convert_to_tensor(predictions)
     predictions = predictions > self.threshold
     matches = math_ops.equal(labels, predictions)
-    matches = math_ops.cast(matches, dtypes.float64)
+    matches = math_ops.cast(matches, self.dtype)
     super(BinaryAccuracy, self).call(matches, weights=weights)
     if weights is None:
       return labels, predictions
@@ -520,7 +518,7 @@ class SparseAccuracy(Mean):
     predictions = math_ops.argmax(predictions, axis=-1)
     labels = math_ops.cast(labels, dtypes.int64)
     matches = math_ops.equal(labels, predictions)
-    matches = math_ops.cast(matches, dtypes.float64)
+    matches = math_ops.cast(matches, self.dtype)
     super(SparseAccuracy, self).call(matches, weights=weights)
     if weights is None:
       return labels, predictions
diff --git a/tensorflow/contrib/eager/python/saver.py b/tensorflow/contrib/eager/python/saver.py
index fdaca90fd1..d709308647 100644
--- a/tensorflow/contrib/eager/python/saver.py
+++ b/tensorflow/contrib/eager/python/saver.py
@@ -125,8 +125,8 @@ class Saver(object):
 
     Args:
       var_list: The list of variables that will be saved and restored. Either a
-        list of `tfe.Variable` objects, or a dictionary mapping names to
-        `tfe.Variable` objects.
+        list of `tf.Variable` objects, or a dictionary mapping names to
+        `tf.Variable` objects.
 
     Raises:
       RuntimeError: if invoked when eager execution has not been enabled.
diff --git a/tensorflow/contrib/eager/python/tfe.py b/tensorflow/contrib/eager/python/tfe.py
index ca6430253b..2f0ab616e4 100644
--- a/tensorflow/contrib/eager/python/tfe.py
+++ b/tensorflow/contrib/eager/python/tfe.py
@@ -34,6 +34,7 @@ To use, at program startup, call `tfe.enable_eager_execution()`.
 
 @@run
 @@enable_eager_execution
+@@enable_remote_eager_execution
 
 @@custom_gradient
 
@@ -114,6 +115,7 @@ from tensorflow.python.eager.execution_callbacks import inf_nan_callback
 from tensorflow.python.eager.execution_callbacks import nan_callback
 from tensorflow.python.eager.execution_callbacks import seterr
 from tensorflow.python.framework.ops import enable_eager_execution
+from tensorflow.python.framework.ops import enable_eager_execution_internal as enable_remote_eager_execution
 from tensorflow.python.framework.ops import eager_run as run
 from tensorflow.python.framework.test_util import run_in_graph_and_eager_modes as run_test_in_graph_and_eager_modes
 from tensorflow.python.framework.test_util import run_all_in_graph_and_eager_modes as run_all_tests_in_graph_and_eager_modes
diff --git a/tensorflow/contrib/eager/python/tfe_test.py b/tensorflow/contrib/eager/python/tfe_test.py
index db50b33af2..4454abfb96 100644
--- a/tensorflow/contrib/eager/python/tfe_test.py
+++ b/tensorflow/contrib/eager/python/tfe_test.py
@@ -27,7 +27,6 @@ from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import numerics
-from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.summary import summary
 from tensorflow.python.summary.writer import writer
@@ -45,12 +44,6 @@ class TFETest(test_util.TensorFlowTestCase):
                                  r'indices = 7 is not in \[0, 3\)'):
       array_ops.gather([0, 1, 2], 7)
 
-  def testVariableError(self):
-    with self.assertRaisesRegexp(
-        RuntimeError,
-        r'Variable not supported when eager execution is enabled'):
-      variables.Variable(initial_value=1.0)
-
   def testGradients(self):
 
     def square(x):
diff --git a/tensorflow/contrib/estimator/BUILD b/tensorflow/contrib/estimator/BUILD
index 30d297a5fb..349f48f7f7 100644
--- a/tensorflow/contrib/estimator/BUILD
+++ b/tensorflow/contrib/estimator/BUILD
@@ -18,6 +18,7 @@ py_library(
         ":boosted_trees",
         ":dnn",
         ":dnn_linear_combined",
+        ":early_stopping",
         ":export",
         ":extenders",
         ":head",
@@ -27,7 +28,8 @@ py_library(
         ":multi_head",
         ":replicate_model_fn",
         ":rnn",
-        "//tensorflow/python:util",
+        ":saved_model_estimator",
+        "//tensorflow:tensorflow_py_no_contrib",
     ],
 )
 
@@ -53,22 +55,10 @@ py_test(
     deps = [
         ":baseline",
         ":head",
-        "//tensorflow/python:check_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:control_flow_ops",
-        "//tensorflow/python:dtypes",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:platform",
-        "//tensorflow/python:session",
-        "//tensorflow/python:summary",
-        "//tensorflow/python:training",
-        "//tensorflow/python:variables",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:export_export",
         "//tensorflow/python/estimator:metric_keys",
         "//tensorflow/python/estimator:numpy_io",
-        "//tensorflow/python/feature_column",
-        "//tensorflow/python/ops/losses",
         "//third_party/py/numpy",
         "@six_archive//:six",
     ],
@@ -95,11 +85,8 @@ py_test(
     ],
     deps = [
         ":boosted_trees",
-        "//tensorflow/python:dtypes",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:training",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:numpy_io",
-        "//tensorflow/python/feature_column",
         "//third_party/py/numpy",
     ],
 )
@@ -109,7 +96,7 @@ py_library(
     srcs = ["python/estimator/dnn.py"],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/python:nn",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:dnn",
     ],
@@ -128,16 +115,11 @@ py_test(
     deps = [
         ":dnn",
         ":head",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:platform",
-        "//tensorflow/python:summary",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:dnn_testing_utils",
         "//tensorflow/python/estimator:export_export",
         "//tensorflow/python/estimator:numpy_io",
         "//tensorflow/python/estimator:prediction_keys",
-        "//tensorflow/python/feature_column",
-        "//tensorflow/python/ops/losses",
         "//third_party/py/numpy",
         "@six_archive//:six",
     ],
@@ -148,7 +130,7 @@ py_library(
     srcs = ["python/estimator/dnn_linear_combined.py"],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/python:nn",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:dnn_linear_combined",
     ],
@@ -167,18 +149,12 @@ py_test(
     deps = [
         ":dnn_linear_combined",
         ":head",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:nn",
-        "//tensorflow/python:platform",
-        "//tensorflow/python:summary",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:dnn_testing_utils",
         "//tensorflow/python/estimator:export_export",
         "//tensorflow/python/estimator:linear_testing_utils",
         "//tensorflow/python/estimator:numpy_io",
         "//tensorflow/python/estimator:prediction_keys",
-        "//tensorflow/python/feature_column",
-        "//tensorflow/python/ops/losses",
         "//third_party/py/numpy",
         "@six_archive//:six",
     ],
@@ -191,10 +167,7 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/python:clip_ops",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:sparse_tensor",
-        "//tensorflow/python:training",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/estimator:util",
@@ -210,18 +183,11 @@ py_test(
     tags = ["notsan"],  # b/62863147
     deps = [
         ":extenders",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/contrib/data/python/ops:dataset_ops",
         "//tensorflow/contrib/predictor",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:constant_op",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:metrics",
-        "//tensorflow/python:sparse_tensor",
-        "//tensorflow/python:training",
-        "//tensorflow/python:variables",
         "//tensorflow/python/estimator:estimator_py",
         "//tensorflow/python/estimator:linear",
-        "//tensorflow/python/feature_column",
         "//third_party/py/numpy",
     ],
 )
@@ -245,21 +211,11 @@ py_test(
     tags = ["notsan"],  # b/62863147
     deps = [
         ":export",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:metrics",
-        "//tensorflow/python:parsing_ops",
-        "//tensorflow/python:session",
-        "//tensorflow/python:state_ops",
-        "//tensorflow/python:training",
-        "//tensorflow/python:util",
-        "//tensorflow/python:variables",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:export_export",
         "//tensorflow/python/estimator:export_output",
         "//tensorflow/python/estimator:model_fn",
-        "//tensorflow/python/saved_model:loader",
-        "//tensorflow/python/saved_model:tag_constants",
     ],
 )
 
@@ -270,25 +226,12 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:check_ops",
-        "//tensorflow/python:dtypes",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:lookup_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:metrics",
-        "//tensorflow/python:nn",
-        "//tensorflow/python:sparse_ops",
-        "//tensorflow/python:sparse_tensor",
-        "//tensorflow/python:summary",
-        "//tensorflow/python:training",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:export_output",
         "//tensorflow/python/estimator:head",
         "//tensorflow/python/estimator:metric_keys",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/estimator:prediction_keys",
-        "//tensorflow/python/ops/losses",
-        "//tensorflow/python/saved_model:signature_constants",
     ],
 )
 
@@ -299,25 +242,10 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":head",
-        "//tensorflow/core:protos_all_py",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:check_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:constant_op",
-        "//tensorflow/python:control_flow_ops",
-        "//tensorflow/python:dtypes",
-        "//tensorflow/python:errors",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:sparse_tensor",
-        "//tensorflow/python:string_ops",
-        "//tensorflow/python:training",
-        "//tensorflow/python:variables",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:metric_keys",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/estimator:prediction_keys",
-        "//tensorflow/python/ops/losses",
-        "//tensorflow/python/saved_model:signature_constants",
         "//third_party/py/numpy",
         "@six_archive//:six",
     ],
@@ -330,8 +258,7 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:training",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:estimator_py",
     ],
 )
@@ -344,10 +271,7 @@ py_test(
     tags = ["notsan"],
     deps = [
         ":hooks",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:training",
-        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:estimator_py",
         "//third_party/py/numpy",
         "@six_archive//:six",
@@ -376,16 +300,11 @@ py_test(
     deps = [
         ":head",
         ":linear",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:platform",
-        "//tensorflow/python:summary",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:export_export",
         "//tensorflow/python/estimator:linear_testing_utils",
         "//tensorflow/python/estimator:numpy_io",
         "//tensorflow/python/estimator:prediction_keys",
-        "//tensorflow/python/feature_column",
-        "//tensorflow/python/ops/losses",
         "//third_party/py/numpy",
         "@six_archive//:six",
     ],
@@ -398,8 +317,7 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:util",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:dnn",
         "//tensorflow/python/estimator:linear",
     ],
@@ -412,9 +330,7 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":logit_fns",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:constant_op",
-        "//tensorflow/python:session",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:model_fn",
     ],
 )
@@ -426,18 +342,11 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:control_flow_ops",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:metrics",
-        "//tensorflow/python:summary",
-        "//tensorflow/python:training",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:export_output",
         "//tensorflow/python/estimator:head",
         "//tensorflow/python/estimator:metric_keys",
         "//tensorflow/python/estimator:model_fn",
-        "//tensorflow/python/saved_model:signature_constants",
         "@six_archive//:six",
     ],
 )
@@ -450,15 +359,10 @@ py_test(
     deps = [
         ":head",
         ":multi_head",
-        "//tensorflow/core:protos_all_py",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:constant_op",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:string_ops",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:metric_keys",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/estimator:prediction_keys",
-        "//tensorflow/python/saved_model:signature_constants",
         "//third_party/py/numpy",
         "@six_archive//:six",
     ],
@@ -471,24 +375,10 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/core:protos_all_py",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:control_flow_ops",
-        "//tensorflow/python:device",
-        "//tensorflow/python:device_lib",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:platform",
-        "//tensorflow/python:sparse_ops",
-        "//tensorflow/python:sparse_tensor",
-        "//tensorflow/python:state_ops",
-        "//tensorflow/python:training",
-        "//tensorflow/python:util",
-        "//tensorflow/python:variable_scope",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator:export_output",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/estimator:util",
-        "//tensorflow/python/ops/losses",
         "@six_archive//:six",
     ],
 )
@@ -499,6 +389,7 @@ cuda_py_test(
     srcs = ["python/estimator/replicate_model_fn_test.py"],
     additional_deps = [
         "@absl_py//absl/testing:parameterized",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:dnn",
         "//tensorflow/python/estimator:export_export",
@@ -507,21 +398,6 @@ cuda_py_test(
         "//tensorflow/python/estimator:numpy_io",
         "//tensorflow/python/estimator:optimizers",
         "//tensorflow/python/estimator:prediction_keys",
-        "//tensorflow/python/feature_column",
-        "//tensorflow/python/ops/losses",
-        "//tensorflow/python/saved_model:signature_constants",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:control_flow_ops",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:metrics",
-        "//tensorflow/python:platform",
-        "//tensorflow/python:summary",
-        "//tensorflow/python:training",
-        "//tensorflow/python:variable_scope",
-        "//tensorflow/python:variables",
         ":replicate_model_fn",
     ],
     tags = [
@@ -537,22 +413,11 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":extenders",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/contrib/feature_column:feature_column_py",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:check_ops",
-        "//tensorflow/python:framework_ops",
-        "//tensorflow/python:init_ops",
-        "//tensorflow/python:layers",
-        "//tensorflow/python:partitioned_variables",
-        "//tensorflow/python:rnn",
-        "//tensorflow/python:rnn_cell",
-        "//tensorflow/python:summary",
-        "//tensorflow/python:training",
-        "//tensorflow/python:variable_scope",
         "//tensorflow/python/estimator",
         "//tensorflow/python/estimator:head",
         "//tensorflow/python/estimator:optimizers",
-        "//tensorflow/python/feature_column",
         "@six_archive//:six",
     ],
 )
@@ -571,22 +436,73 @@ py_test(
     deps = [
         ":head",
         ":rnn",
+        "//tensorflow:tensorflow_py_no_contrib",
         "//tensorflow/contrib/data",
-        "//tensorflow/core:protos_all_py",
-        "//tensorflow/python:check_ops",
+        "//tensorflow/python/estimator:numpy_io",
+        "//tensorflow/python/estimator:parsing_utils",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+    ],
+)
+
+py_library(
+    name = "early_stopping",
+    srcs = ["python/estimator/early_stopping.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow:tensorflow_py_no_contrib",
+        "//tensorflow/python/estimator",
+    ],
+)
+
+py_test(
+    name = "early_stopping_test",
+    srcs = ["python/estimator/early_stopping_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":early_stopping",
+        "//tensorflow:tensorflow_py_no_contrib",
+        "//tensorflow/python/estimator",
+        "@absl_py//absl/testing:parameterized",
+    ],
+)
+
+py_library(
+    name = "saved_model_estimator",
+    srcs = ["python/estimator/saved_model_estimator.py"],
+    deps = [
+        ":export",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:training",
+        "//tensorflow/python/estimator",
+        "//tensorflow/python/estimator:export",
+        "//tensorflow/python/estimator:model_fn",
+        "//tensorflow/python/saved_model",
+    ],
+)
+
+py_test(
+    name = "saved_model_estimator_test",
+    size = "medium",
+    srcs = ["python/estimator/saved_model_estimator_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":export",
+        ":saved_model_estimator",
+        "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
-        "//tensorflow/python:dtypes",
+        "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:framework_ops",
-        "//tensorflow/python:lib",
-        "//tensorflow/python:math_ops",
+        "//tensorflow/python:metrics",
+        "//tensorflow/python:platform",
         "//tensorflow/python:state_ops",
-        "//tensorflow/python:summary",
         "//tensorflow/python:training",
         "//tensorflow/python:variables",
-        "//tensorflow/python/estimator:numpy_io",
-        "//tensorflow/python/estimator:parsing_utils",
-        "//tensorflow/python/feature_column",
-        "//third_party/py/numpy",
-        "@six_archive//:six",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/estimator",
+        "//tensorflow/python/estimator:export_export",
+        "//tensorflow/python/estimator:export_output",
+        "//tensorflow/python/estimator:model_fn",
     ],
 )
diff --git a/tensorflow/contrib/estimator/__init__.py b/tensorflow/contrib/estimator/__init__.py
index 788ac5ca70..e1453ae1d0 100644
--- a/tensorflow/contrib/estimator/__init__.py
+++ b/tensorflow/contrib/estimator/__init__.py
@@ -23,6 +23,7 @@ from tensorflow.contrib.estimator.python.estimator.baseline import *
 from tensorflow.contrib.estimator.python.estimator.boosted_trees import *
 from tensorflow.contrib.estimator.python.estimator.dnn import *
 from tensorflow.contrib.estimator.python.estimator.dnn_linear_combined import *
+from tensorflow.contrib.estimator.python.estimator.early_stopping import *
 from tensorflow.contrib.estimator.python.estimator.export import *
 from tensorflow.contrib.estimator.python.estimator.extenders import *
 from tensorflow.contrib.estimator.python.estimator.head import *
@@ -32,6 +33,8 @@ from tensorflow.contrib.estimator.python.estimator.logit_fns import *
 from tensorflow.contrib.estimator.python.estimator.multi_head import *
 from tensorflow.contrib.estimator.python.estimator.replicate_model_fn import *
 from tensorflow.contrib.estimator.python.estimator.rnn import *
+from tensorflow.contrib.estimator.python.estimator.saved_model_estimator import *
+from tensorflow.python.estimator.export.export import *
 
 from tensorflow.python.util.all_util import remove_undocumented
 # pylint: enable=unused-import,line-too-long,wildcard-import
@@ -63,6 +66,15 @@ _allowed_symbols = [
     'RNNEstimator',
     'export_saved_model_for_mode',
     'export_all_saved_models',
+    'make_early_stopping_hook',
+    'read_eval_metrics',
+    'stop_if_lower_hook',
+    'stop_if_higher_hook',
+    'stop_if_no_increase_hook',
+    'stop_if_no_decrease_hook',
+    'build_raw_supervised_input_receiver_fn',
+    'build_supervised_input_receiver_fn_from_input_fn',
+    'SavedModelEstimator'
 ]
 
 remove_undocumented(__name__, allowed_exception_list=_allowed_symbols)
diff --git a/tensorflow/contrib/estimator/python/estimator/baseline_test.py b/tensorflow/contrib/estimator/python/estimator/baseline_test.py
index d0e3e670f7..505c94e971 100644
--- a/tensorflow/contrib/estimator/python/estimator/baseline_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/baseline_test.py
@@ -113,6 +113,8 @@ class BaselineEstimatorEvaluationTest(test.TestCase):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 18.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -141,6 +143,8 @@ class BaselineEstimatorEvaluationTest(test.TestCase):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 27.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -166,7 +170,9 @@ class BaselineEstimatorEvaluationTest(test.TestCase):
 
     self.assertItemsEqual(
         (metric_keys.MetricKeys.LOSS, metric_keys.MetricKeys.LOSS_MEAN,
-         ops.GraphKeys.GLOBAL_STEP), eval_metrics.keys())
+         metric_keys.MetricKeys.PREDICTION_MEAN,
+         metric_keys.MetricKeys.LABEL_MEAN, ops.GraphKeys.GLOBAL_STEP),
+        eval_metrics.keys())
 
     # Logit is bias which is [46, 58]
     self.assertAlmostEqual(0, eval_metrics[metric_keys.MetricKeys.LOSS])
diff --git a/tensorflow/contrib/estimator/python/estimator/boosted_trees.py b/tensorflow/contrib/estimator/python/estimator/boosted_trees.py
index bd641014e9..7ed77bcce6 100644
--- a/tensorflow/contrib/estimator/python/estimator/boosted_trees.py
+++ b/tensorflow/contrib/estimator/python/estimator/boosted_trees.py
@@ -49,7 +49,9 @@ class _BoostedTreesEstimator(estimator.Estimator):
                l2_regularization=0.,
                tree_complexity=0.,
                min_node_weight=0.,
-               config=None):
+               config=None,
+               center_bias=False,
+               pruning_mode='none'):
     """Initializes a `BoostedTreesEstimator` instance.
 
     Args:
@@ -82,17 +84,35 @@ class _BoostedTreesEstimator(estimator.Estimator):
         considered. The value will be compared with sum(leaf_hessian)/
         (batch_size * n_batches_per_layer).
       config: `RunConfig` object to configure the runtime settings.
+      center_bias: Whether bias centering needs to occur. Bias centering refers
+        to the first node in the very first tree returning the prediction that
+        is aligned with the original labels distribution. For example, for
+        regression problems, the first node will return the mean of the labels.
+        For binary classification problems, it will return a logit for a prior
+        probability of label 1.
+      pruning_mode: one of 'none', 'pre', 'post' to indicate no pruning, pre-
+        pruning (do not split a node if not enough gain is observed) and post
+        pruning (build the tree up to a max depth and then prune branches with
+        negative gain). For pre and post pruning, you MUST provide
+        tree_complexity >0.
+
     """
     # pylint:disable=protected-access
     # HParams for the model.
     tree_hparams = canned_boosted_trees._TreeHParams(
         n_trees, max_depth, learning_rate, l1_regularization, l2_regularization,
-        tree_complexity, min_node_weight)
+        tree_complexity, min_node_weight, center_bias, pruning_mode)
 
     def _model_fn(features, labels, mode, config):
       return canned_boosted_trees._bt_model_fn(
-          features, labels, mode, head, feature_columns, tree_hparams,
-          n_batches_per_layer, config)
+          features,
+          labels,
+          mode,
+          head,
+          feature_columns,
+          tree_hparams,
+          n_batches_per_layer,
+          config=config)
 
     super(_BoostedTreesEstimator, self).__init__(
         model_fn=_model_fn, model_dir=model_dir, config=config)
@@ -114,7 +134,9 @@ def boosted_trees_classifier_train_in_memory(
     tree_complexity=0.,
     min_node_weight=0.,
     config=None,
-    train_hooks=None):
+    train_hooks=None,
+    center_bias=False,
+    pruning_mode='none'):
   """Trains a boosted tree classifier with in memory dataset.
 
   Example:
@@ -186,7 +208,18 @@ def boosted_trees_classifier_train_in_memory(
         considered. The value will be compared with sum(leaf_hessian)/
         (batch_size * n_batches_per_layer).
     config: `RunConfig` object to configure the runtime settings.
-    train_hooks: a list of Hook instances to be passed to estimator.train().
+    train_hooks: a list of Hook instances to be passed to estimator.train()
+    center_bias: Whether bias centering needs to occur. Bias centering refers
+        to the first node in the very first tree returning the prediction that
+        is aligned with the original labels distribution. For example, for
+        regression problems, the first node will return the mean of the labels.
+        For binary classification problems, it will return a logit for a prior
+        probability of label 1.
+    pruning_mode: one of 'none', 'pre', 'post' to indicate no pruning, pre-
+        pruning (do not split a node if not enough gain is observed) and post
+        pruning (build the tree up to a max depth and then prune branches with
+        negative gain). For pre and post pruning, you MUST provide
+        tree_complexity >0.
 
   Returns:
     a `BoostedTreesClassifier` instance created with the given arguments and
@@ -207,7 +240,7 @@ def boosted_trees_classifier_train_in_memory(
   # HParams for the model.
   tree_hparams = canned_boosted_trees._TreeHParams(
       n_trees, max_depth, learning_rate, l1_regularization, l2_regularization,
-      tree_complexity, min_node_weight)
+      tree_complexity, min_node_weight, center_bias, pruning_mode)
 
   def _model_fn(features, labels, mode, config):
     return canned_boosted_trees._bt_model_fn(
@@ -247,7 +280,9 @@ def boosted_trees_regressor_train_in_memory(
     tree_complexity=0.,
     min_node_weight=0.,
     config=None,
-    train_hooks=None):
+    train_hooks=None,
+    center_bias=False,
+    pruning_mode='none'):
   """Trains a boosted tree regressor with in memory dataset.
 
   Example:
@@ -313,6 +348,17 @@ def boosted_trees_regressor_train_in_memory(
         (batch_size * n_batches_per_layer).
     config: `RunConfig` object to configure the runtime settings.
     train_hooks: a list of Hook instances to be passed to estimator.train().
+    center_bias: Whether bias centering needs to occur. Bias centering refers
+        to the first node in the very first tree returning the prediction that
+        is aligned with the original labels distribution. For example, for
+        regression problems, the first node will return the mean of the labels.
+        For binary classification problems, it will return a logit for a prior
+        probability of label 1.
+    pruning_mode: one of 'none', 'pre', 'post' to indicate no pruning, pre-
+        pruning (do not split a node if not enough gain is observed) and post
+        pruning (build the tree up to a max depth and then prune branches with
+        negative gain). For pre and post pruning, you MUST provide
+        tree_complexity >0.
 
   Returns:
     a `BoostedTreesClassifier` instance created with the given arguments and
@@ -332,7 +378,7 @@ def boosted_trees_regressor_train_in_memory(
   # HParams for the model.
   tree_hparams = canned_boosted_trees._TreeHParams(
       n_trees, max_depth, learning_rate, l1_regularization, l2_regularization,
-      tree_complexity, min_node_weight)
+      tree_complexity, min_node_weight, center_bias, pruning_mode)
 
   def _model_fn(features, labels, mode, config):
     return canned_boosted_trees._bt_model_fn(
diff --git a/tensorflow/contrib/estimator/python/estimator/boosted_trees_test.py b/tensorflow/contrib/estimator/python/estimator/boosted_trees_test.py
index 76cbefe5e9..b1581f3750 100644
--- a/tensorflow/contrib/estimator/python/estimator/boosted_trees_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/boosted_trees_test.py
@@ -115,6 +115,70 @@ class BoostedTreesEstimatorTest(test_util.TensorFlowTestCase):
     eval_res = est.evaluate(input_fn=input_fn, steps=1)
     self.assertAllClose(eval_res['average_loss'], 1.008551)
 
+  def testTrainAndEvaluateEstimatorWithCenterBias(self):
+    input_fn = _make_train_input_fn(is_classification=False)
+
+    est = boosted_trees._BoostedTreesEstimator(
+        feature_columns=self._feature_columns,
+        n_batches_per_layer=1,
+        n_trees=2,
+        head=self._head,
+        max_depth=5,
+        center_bias=True)
+
+    # It will stop after 11 steps because of the max depth and num trees.
+    num_steps = 100
+    # Train for a few steps, and validate final checkpoint.
+    est.train(input_fn, steps=num_steps)
+    # 10 steps for training and 2 step for bias centering.
+    self._assert_checkpoint(
+        est.model_dir, global_step=12, finalized_trees=2, attempted_layers=10)
+    eval_res = est.evaluate(input_fn=input_fn, steps=1)
+    self.assertAllClose(eval_res['average_loss'], 0.614642)
+
+  def testTrainAndEvaluateEstimatorWithPrePruning(self):
+    input_fn = _make_train_input_fn(is_classification=False)
+
+    est = boosted_trees._BoostedTreesEstimator(
+        feature_columns=self._feature_columns,
+        n_batches_per_layer=1,
+        n_trees=2,
+        head=self._head,
+        max_depth=5,
+        tree_complexity=0.001,
+        pruning_mode='pre')
+
+    num_steps = 100
+    # Train for a few steps, and validate final checkpoint.
+    est.train(input_fn, steps=num_steps)
+    # We stop actually after 2*depth*n_trees steps (via a hook) because we still
+    # could not grow 2 trees of depth 5 (due to pre-pruning).
+    self._assert_checkpoint(
+        est.model_dir, global_step=21, finalized_trees=0, attempted_layers=21)
+    eval_res = est.evaluate(input_fn=input_fn, steps=1)
+    self.assertAllClose(eval_res['average_loss'], 3.83943)
+
+  def testTrainAndEvaluateEstimatorWithPostPruning(self):
+    input_fn = _make_train_input_fn(is_classification=False)
+
+    est = boosted_trees._BoostedTreesEstimator(
+        feature_columns=self._feature_columns,
+        n_batches_per_layer=1,
+        n_trees=2,
+        head=self._head,
+        max_depth=5,
+        tree_complexity=0.001,
+        pruning_mode='post')
+
+    # It will stop after 10 steps because of the max depth and num trees.
+    num_steps = 100
+    # Train for a few steps, and validate final checkpoint.
+    est.train(input_fn, steps=num_steps)
+    self._assert_checkpoint(
+        est.model_dir, global_step=10, finalized_trees=2, attempted_layers=10)
+    eval_res = est.evaluate(input_fn=input_fn, steps=1)
+    self.assertAllClose(eval_res['average_loss'], 2.37652)
+
   def testInferEstimator(self):
     train_input_fn = _make_train_input_fn(is_classification=False)
     predict_input_fn = numpy_io.numpy_input_fn(
@@ -139,6 +203,33 @@ class BoostedTreesEstimatorTest(test_util.TensorFlowTestCase):
         [[0.571619], [0.262821], [0.124549], [0.956801], [1.769801]],
         [pred['predictions'] for pred in predictions])
 
+  def testInferEstimatorWithCenterBias(self):
+    train_input_fn = _make_train_input_fn(is_classification=False)
+    predict_input_fn = numpy_io.numpy_input_fn(
+        x=FEATURES_DICT, y=None, batch_size=1, num_epochs=1, shuffle=False)
+
+    est = boosted_trees._BoostedTreesEstimator(
+        feature_columns=self._feature_columns,
+        n_batches_per_layer=1,
+        n_trees=1,
+        max_depth=5,
+        center_bias=True,
+        head=self._head)
+
+    # It will stop after 6 steps because of the max depth and num trees (5 for
+    # training and 2 for bias centering).
+    num_steps = 100
+    # Train for a few steps, and validate final checkpoint.
+    est.train(train_input_fn, steps=num_steps)
+    self._assert_checkpoint(
+        est.model_dir, global_step=7, finalized_trees=1, attempted_layers=5)
+    # Validate predictions.
+    predictions = list(est.predict(input_fn=predict_input_fn))
+
+    self.assertAllClose(
+        [[1.634501], [1.325703], [1.187431], [2.019683], [2.832683]],
+        [pred['predictions'] for pred in predictions])
+
   def testBinaryClassifierTrainInMemoryAndEvalAndInfer(self):
     train_input_fn = _make_train_input_fn(is_classification=True)
     predict_input_fn = numpy_io.numpy_input_fn(
@@ -159,14 +250,65 @@ class BoostedTreesEstimatorTest(test_util.TensorFlowTestCase):
     self.assertAllClose([[0], [1], [1], [0], [0]],
                         [pred['class_ids'] for pred in predictions])
 
+  def testBinaryClassifierTrainInMemoryAndEvalAndInferWithCenterBias(self):
+    train_input_fn = _make_train_input_fn(is_classification=True)
+    predict_input_fn = numpy_io.numpy_input_fn(
+        x=FEATURES_DICT, y=None, batch_size=1, num_epochs=1, shuffle=False)
+
+    est = boosted_trees.boosted_trees_classifier_train_in_memory(
+        train_input_fn=train_input_fn,
+        feature_columns=self._feature_columns,
+        n_trees=1,
+        max_depth=5,
+        center_bias=True)
+    # It will stop after 5 steps + 3 for bias, because of the max depth and num
+    # trees.
+    self._assert_checkpoint(
+        est.model_dir, global_step=8, finalized_trees=1, attempted_layers=5)
+
+    # Check evaluate and predict.
+    eval_res = est.evaluate(input_fn=train_input_fn, steps=1)
+    self.assertAllClose(eval_res['accuracy'], 1.0)
+    # Validate predictions.
+    predictions = list(est.predict(input_fn=predict_input_fn))
+    self.assertAllClose([[0], [1], [1], [0], [0]],
+                        [pred['class_ids'] for pred in predictions])
+
+  def testBinaryClassifierTrainInMemoryAndEvalAndInferWithPrePruning(self):
+    train_input_fn = _make_train_input_fn(is_classification=True)
+    predict_input_fn = numpy_io.numpy_input_fn(
+        x=FEATURES_DICT, y=None, batch_size=1, num_epochs=1, shuffle=False)
+
+    est = boosted_trees.boosted_trees_classifier_train_in_memory(
+        train_input_fn=train_input_fn,
+        feature_columns=self._feature_columns,
+        n_trees=1,
+        max_depth=5,
+        pruning_mode='pre',
+        tree_complexity=0.01)
+    # We stop actually after 2*depth*n_trees steps (via a hook) because we still
+    # could not grow 1 trees of depth 5 (due to pre-pruning).
+    self._assert_checkpoint(
+        est.model_dir, global_step=11, finalized_trees=0, attempted_layers=11)
+
+    # Check evaluate and predict.
+    eval_res = est.evaluate(input_fn=train_input_fn, steps=1)
+    self.assertAllClose(eval_res['accuracy'], 1.0)
+    # Validate predictions.
+    predictions = list(est.predict(input_fn=predict_input_fn))
+    self.assertAllClose([[0], [1], [1], [0], [0]],
+                        [pred['class_ids'] for pred in predictions])
+
   def testBinaryClassifierTrainInMemoryWithDataset(self):
     train_input_fn = _make_train_input_fn_dataset(is_classification=True)
     predict_input_fn = numpy_io.numpy_input_fn(
         x=FEATURES_DICT, y=None, batch_size=1, num_epochs=1, shuffle=False)
 
     est = boosted_trees.boosted_trees_classifier_train_in_memory(
-        train_input_fn=train_input_fn, feature_columns=self._feature_columns,
-        n_trees=1, max_depth=5)
+        train_input_fn=train_input_fn,
+        feature_columns=self._feature_columns,
+        n_trees=1,
+        max_depth=5)
     # It will stop after 5 steps because of the max depth and num trees.
     self._assert_checkpoint(
         est.model_dir, global_step=5, finalized_trees=1, attempted_layers=5)
diff --git a/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined.py b/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined.py
index 894a295498..2eef60c39f 100644
--- a/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined.py
+++ b/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined.py
@@ -110,7 +110,8 @@ class DNNLinearCombinedEstimator(estimator.Estimator):
                dnn_activation_fn=nn.relu,
                dnn_dropout=None,
                input_layer_partitioner=None,
-               config=None):
+               config=None,
+               linear_sparse_combiner='sum'):
     """Initializes a DNNLinearCombinedEstimator instance.
 
     Args:
@@ -142,6 +143,11 @@ class DNNLinearCombinedEstimator(estimator.Estimator):
       input_layer_partitioner: Partitioner for input layer. Defaults to
         `min_max_variable_partitioner` with `min_slice_size` 64 << 20.
       config: RunConfig object to configure the runtime settings.
+      linear_sparse_combiner: A string specifying how to reduce the linear model
+        if a categorical column is multivalent.  One of "mean", "sqrtn", and
+        "sum" -- these are effectively different ways to do example-level
+        normalization, which can be useful for bag-of-words features.  For more
+        details, see @{tf.feature_column.linear_model$linear_model}.
 
     Raises:
       ValueError: If both linear_feature_columns and dnn_features_columns are
@@ -169,7 +175,8 @@ class DNNLinearCombinedEstimator(estimator.Estimator):
           dnn_activation_fn=dnn_activation_fn,
           dnn_dropout=dnn_dropout,
           input_layer_partitioner=input_layer_partitioner,
-          config=config)
+          config=config,
+          linear_sparse_combiner=linear_sparse_combiner)
 
     super(DNNLinearCombinedEstimator, self).__init__(
         model_fn=_model_fn, model_dir=model_dir, config=config)
diff --git a/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined_test.py b/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined_test.py
index dd009a6753..51b9ce7005 100644
--- a/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/dnn_linear_combined_test.py
@@ -100,7 +100,8 @@ def _linear_only_estimator_fn(
     weight_column=None,
     optimizer='Ftrl',
     config=None,
-    partitioner=None):
+    partitioner=None,
+    sparse_combiner='sum'):
   return dnn_linear_combined.DNNLinearCombinedEstimator(
       head=head_lib.regression_head(
           weight_column=weight_column, label_dimension=label_dimension,
@@ -110,7 +111,8 @@ def _linear_only_estimator_fn(
       linear_feature_columns=feature_columns,
       linear_optimizer=optimizer,
       input_layer_partitioner=partitioner,
-      config=config)
+      config=config,
+      linear_sparse_combiner=sparse_combiner)
 
 
 class LinearOnlyEstimatorEvaluateTest(
diff --git a/tensorflow/contrib/estimator/python/estimator/early_stopping.py b/tensorflow/contrib/estimator/python/estimator/early_stopping.py
new file mode 100644
index 0000000000..3eab21d5ac
--- /dev/null
+++ b/tensorflow/contrib/estimator/python/estimator/early_stopping.py
@@ -0,0 +1,469 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utilities for early stopping."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import operator
+import os
+
+from tensorflow.python.estimator import estimator as estimator_lib
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import init_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.platform import gfile
+from tensorflow.python.platform import tf_logging
+from tensorflow.python.summary import summary_iterator
+from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.training import session_run_hook
+from tensorflow.python.training import training_util
+
+_EVENT_FILE_GLOB_PATTERN = 'events.out.tfevents.*'
+
+
+def make_early_stopping_hook(estimator,
+                             should_stop_fn,
+                             run_every_secs=60,
+                             run_every_steps=None):
+  """Creates early-stopping hook.
+
+  Returns a `SessionRunHook` that stops training when `should_stop_fn` returns
+  `True`.
+
+  Usage example:
+
+  ```python
+  estimator = ...
+  hook = early_stopping.make_early_stopping_hook(
+      estimator, should_stop_fn=make_stop_fn(...))
+  train_spec = tf.estimator.TrainSpec(..., hooks=[hook])
+  tf.estimator.train_and_evaluate(estimator, train_spec, ...)
+  ```
+
+  Args:
+    estimator: A `tf.estimator.Estimator` instance.
+    should_stop_fn: `callable`, function that takes no arguments and returns a
+      `bool`. If the function returns `True`, stopping will be initiated by the
+      chief.
+    run_every_secs: If specified, calls `should_stop_fn` at an interval of
+      `run_every_secs` seconds. Defaults to 60 seconds. Either this or
+      `run_every_steps` must be set.
+    run_every_steps: If specified, calls `should_stop_fn` every
+      `run_every_steps` steps. Either this or `run_every_secs` must be set.
+
+  Returns:
+    A `SessionRunHook` that periodically executes `should_stop_fn` and initiates
+    early stopping if the function returns `True`.
+
+  Raises:
+    TypeError: If `estimator` is not of type `tf.estimator.Estimator`.
+    ValueError: If both `run_every_secs` and `run_every_steps` are set.
+  """
+  if not isinstance(estimator, estimator_lib.Estimator):
+    raise TypeError('`estimator` must have type `tf.estimator.Estimator`. '
+                    'Got: {}'.format(type(estimator)))
+
+  if run_every_secs is not None and run_every_steps is not None:
+    raise ValueError('Only one of `run_every_secs` and `run_every_steps` must '
+                     'be set.')
+
+  if estimator.config.is_chief:
+    return _StopOnPredicateHook(should_stop_fn, run_every_secs, run_every_steps)
+  else:
+    return _CheckForStoppingHook()
+
+
+def stop_if_higher_hook(estimator,
+                        metric_name,
+                        threshold,
+                        eval_dir=None,
+                        min_steps=0,
+                        run_every_secs=60,
+                        run_every_steps=None):
+  """Creates hook to stop if the given metric is higher than the threshold.
+
+  Usage example:
+
+  ```python
+  estimator = ...
+  # Hook to stop training if accuracy becomes higher than 0.9.
+  hook = early_stopping.stop_if_higher_hook(estimator, "accuracy", 0.9)
+  train_spec = tf.estimator.TrainSpec(..., hooks=[hook])
+  tf.estimator.train_and_evaluate(estimator, train_spec, ...)
+  ```
+
+  Args:
+    estimator: A `tf.estimator.Estimator` instance.
+    metric_name: `str`, metric to track. "loss", "accuracy", etc.
+    threshold: Numeric threshold for the given metric.
+    eval_dir: If set, directory containing summary files with eval metrics. By
+      default, `estimator.eval_dir()` will be used.
+    min_steps: `int`, stop is never requested if global step is less than this
+      value. Defaults to 0.
+    run_every_secs: If specified, calls `should_stop_fn` at an interval of
+      `run_every_secs` seconds. Defaults to 60 seconds. Either this or
+      `run_every_steps` must be set.
+    run_every_steps: If specified, calls `should_stop_fn` every
+      `run_every_steps` steps. Either this or `run_every_secs` must be set.
+
+  Returns:
+    An early-stopping hook of type `SessionRunHook` that periodically checks
+    if the given metric is higher than specified threshold and initiates
+    early stopping if true.
+  """
+  return _stop_if_threshold_crossed_hook(
+      estimator=estimator,
+      metric_name=metric_name,
+      threshold=threshold,
+      higher_is_better=True,
+      eval_dir=eval_dir,
+      min_steps=min_steps,
+      run_every_secs=run_every_secs,
+      run_every_steps=run_every_steps)
+
+
+def stop_if_lower_hook(estimator,
+                       metric_name,
+                       threshold,
+                       eval_dir=None,
+                       min_steps=0,
+                       run_every_secs=60,
+                       run_every_steps=None):
+  """Creates hook to stop if the given metric is lower than the threshold.
+
+  Usage example:
+
+  ```python
+  estimator = ...
+  # Hook to stop training if loss becomes lower than 100.
+  hook = early_stopping.stop_if_lower_hook(estimator, "loss", 100)
+  train_spec = tf.estimator.TrainSpec(..., hooks=[hook])
+  tf.estimator.train_and_evaluate(estimator, train_spec, ...)
+  ```
+
+  Args:
+    estimator: A `tf.estimator.Estimator` instance.
+    metric_name: `str`, metric to track. "loss", "accuracy", etc.
+    threshold: Numeric threshold for the given metric.
+    eval_dir: If set, directory containing summary files with eval metrics. By
+      default, `estimator.eval_dir()` will be used.
+    min_steps: `int`, stop is never requested if global step is less than this
+      value. Defaults to 0.
+    run_every_secs: If specified, calls `should_stop_fn` at an interval of
+      `run_every_secs` seconds. Defaults to 60 seconds. Either this or
+      `run_every_steps` must be set.
+    run_every_steps: If specified, calls `should_stop_fn` every
+      `run_every_steps` steps. Either this or `run_every_secs` must be set.
+
+  Returns:
+    An early-stopping hook of type `SessionRunHook` that periodically checks
+    if the given metric is lower than specified threshold and initiates
+    early stopping if true.
+  """
+  return _stop_if_threshold_crossed_hook(
+      estimator=estimator,
+      metric_name=metric_name,
+      threshold=threshold,
+      higher_is_better=False,
+      eval_dir=eval_dir,
+      min_steps=min_steps,
+      run_every_secs=run_every_secs,
+      run_every_steps=run_every_steps)
+
+
+def stop_if_no_increase_hook(estimator,
+                             metric_name,
+                             max_steps_without_increase,
+                             eval_dir=None,
+                             min_steps=0,
+                             run_every_secs=60,
+                             run_every_steps=None):
+  """Creates hook to stop if metric does not increase within given max steps.
+
+  Usage example:
+
+  ```python
+  estimator = ...
+  # Hook to stop training if accuracy does not increase in over 100000 steps.
+  hook = early_stopping.stop_if_no_increase_hook(estimator, "accuracy", 100000)
+  train_spec = tf.estimator.TrainSpec(..., hooks=[hook])
+  tf.estimator.train_and_evaluate(estimator, train_spec, ...)
+  ```
+
+  Args:
+    estimator: A `tf.estimator.Estimator` instance.
+    metric_name: `str`, metric to track. "loss", "accuracy", etc.
+    max_steps_without_increase: `int`, maximum number of training steps with no
+      increase in the given metric.
+    eval_dir: If set, directory containing summary files with eval metrics. By
+      default, `estimator.eval_dir()` will be used.
+    min_steps: `int`, stop is never requested if global step is less than this
+      value. Defaults to 0.
+    run_every_secs: If specified, calls `should_stop_fn` at an interval of
+      `run_every_secs` seconds. Defaults to 60 seconds. Either this or
+      `run_every_steps` must be set.
+    run_every_steps: If specified, calls `should_stop_fn` every
+      `run_every_steps` steps. Either this or `run_every_secs` must be set.
+
+  Returns:
+    An early-stopping hook of type `SessionRunHook` that periodically checks
+    if the given metric shows no increase over given maximum number of
+    training steps, and initiates early stopping if true.
+  """
+  return _stop_if_no_metric_improvement_hook(
+      estimator=estimator,
+      metric_name=metric_name,
+      max_steps_without_improvement=max_steps_without_increase,
+      higher_is_better=True,
+      eval_dir=eval_dir,
+      min_steps=min_steps,
+      run_every_secs=run_every_secs,
+      run_every_steps=run_every_steps)
+
+
+def stop_if_no_decrease_hook(estimator,
+                             metric_name,
+                             max_steps_without_decrease,
+                             eval_dir=None,
+                             min_steps=0,
+                             run_every_secs=60,
+                             run_every_steps=None):
+  """Creates hook to stop if metric does not decrease within given max steps.
+
+  Usage example:
+
+  ```python
+  estimator = ...
+  # Hook to stop training if loss does not decrease in over 100000 steps.
+  hook = early_stopping.stop_if_no_decrease_hook(estimator, "loss", 100000)
+  train_spec = tf.estimator.TrainSpec(..., hooks=[hook])
+  tf.estimator.train_and_evaluate(estimator, train_spec, ...)
+  ```
+
+  Args:
+    estimator: A `tf.estimator.Estimator` instance.
+    metric_name: `str`, metric to track. "loss", "accuracy", etc.
+    max_steps_without_decrease: `int`, maximum number of training steps with no
+      decrease in the given metric.
+    eval_dir: If set, directory containing summary files with eval metrics. By
+      default, `estimator.eval_dir()` will be used.
+    min_steps: `int`, stop is never requested if global step is less than this
+      value. Defaults to 0.
+    run_every_secs: If specified, calls `should_stop_fn` at an interval of
+      `run_every_secs` seconds. Defaults to 60 seconds. Either this or
+      `run_every_steps` must be set.
+    run_every_steps: If specified, calls `should_stop_fn` every
+      `run_every_steps` steps. Either this or `run_every_secs` must be set.
+
+  Returns:
+    An early-stopping hook of type `SessionRunHook` that periodically checks
+    if the given metric shows no decrease over given maximum number of
+    training steps, and initiates early stopping if true.
+  """
+  return _stop_if_no_metric_improvement_hook(
+      estimator=estimator,
+      metric_name=metric_name,
+      max_steps_without_improvement=max_steps_without_decrease,
+      higher_is_better=False,
+      eval_dir=eval_dir,
+      min_steps=min_steps,
+      run_every_secs=run_every_secs,
+      run_every_steps=run_every_steps)
+
+
+def read_eval_metrics(eval_dir):
+  """Helper to read eval metrics from eval summary files.
+
+  Args:
+    eval_dir: Directory containing summary files with eval metrics.
+
+  Returns:
+    A `dict` with global steps mapping to `dict` of metric names and values.
+  """
+  eval_metrics_dict = {}
+  for event in _summaries(eval_dir):
+    if not event.HasField('summary'):
+      continue
+    metrics = {}
+    for value in event.summary.value:
+      if value.HasField('simple_value'):
+        metrics[value.tag] = value.simple_value
+    if metrics:
+      eval_metrics_dict[event.step] = metrics
+  return eval_metrics_dict
+
+
+def _stop_if_threshold_crossed_hook(estimator, metric_name, threshold,
+                                    higher_is_better, eval_dir, min_steps,
+                                    run_every_secs, run_every_steps):
+  """Creates early-stopping hook to stop training if threshold is crossed."""
+
+  if eval_dir is None:
+    eval_dir = estimator.eval_dir()
+
+  is_lhs_better = operator.gt if higher_is_better else operator.lt
+  greater_or_lesser = 'greater than' if higher_is_better else 'less than'
+
+  def stop_if_threshold_crossed_fn():
+    """Returns `True` if the given metric crosses specified threshold."""
+
+    eval_results = read_eval_metrics(eval_dir)
+
+    for step, metrics in eval_results.items():
+      if step < min_steps:
+        continue
+      val = metrics[metric_name]
+      if is_lhs_better(val, threshold):
+        tf_logging.info(
+            'At step %s, metric "%s" has value %s which is %s the configured '
+            'threshold (%s) for early stopping.', step, metric_name, val,
+            greater_or_lesser, threshold)
+        return True
+    return False
+
+  return make_early_stopping_hook(
+      estimator=estimator,
+      should_stop_fn=stop_if_threshold_crossed_fn,
+      run_every_secs=run_every_secs,
+      run_every_steps=run_every_steps)
+
+
+def _stop_if_no_metric_improvement_hook(
+    estimator, metric_name, max_steps_without_improvement, higher_is_better,
+    eval_dir, min_steps, run_every_secs, run_every_steps):
+  """Returns hook to stop training if given metric shows no improvement."""
+
+  if eval_dir is None:
+    eval_dir = estimator.eval_dir()
+
+  is_lhs_better = operator.gt if higher_is_better else operator.lt
+  increase_or_decrease = 'increase' if higher_is_better else 'decrease'
+
+  def stop_if_no_metric_improvement_fn():
+    """Returns `True` if metric does not improve within max steps."""
+
+    eval_results = read_eval_metrics(eval_dir)
+
+    best_val = None
+    best_val_step = None
+    for step, metrics in eval_results.items():
+      if step < min_steps:
+        continue
+      val = metrics[metric_name]
+      if best_val is None or is_lhs_better(val, best_val):
+        best_val = val
+        best_val_step = step
+      if step - best_val_step >= max_steps_without_improvement:
+        tf_logging.info(
+            'No %s in metric "%s" for %s steps, which is greater than or equal '
+            'to max steps (%s) configured for early stopping.',
+            increase_or_decrease, metric_name, step - best_val_step,
+            max_steps_without_improvement)
+        return True
+    return False
+
+  return make_early_stopping_hook(
+      estimator=estimator,
+      should_stop_fn=stop_if_no_metric_improvement_fn,
+      run_every_secs=run_every_secs,
+      run_every_steps=run_every_steps)
+
+
+def _summaries(eval_dir):
+  """Yields `tensorflow.Event` protos from event files in the eval dir.
+
+  Args:
+    eval_dir: Directory containing summary files with eval metrics.
+
+  Yields:
+    `tensorflow.Event` object read from the event files.
+  """
+  if gfile.Exists(eval_dir):
+    for event_file in gfile.Glob(
+        os.path.join(eval_dir, _EVENT_FILE_GLOB_PATTERN)):
+      for event in summary_iterator.summary_iterator(event_file):
+        yield event
+
+
+def _get_or_create_stop_var():
+  with variable_scope.variable_scope(
+      name_or_scope='signal_early_stopping',
+      values=[],
+      reuse=variable_scope.AUTO_REUSE):
+    return variable_scope.get_variable(
+        name='STOP',
+        shape=[],
+        dtype=dtypes.bool,
+        initializer=init_ops.constant_initializer(False),
+        collections=[ops.GraphKeys.GLOBAL_VARIABLES],
+        trainable=False)
+
+
+class _StopOnPredicateHook(session_run_hook.SessionRunHook):
+  """Hook that requests stop when `should_stop_fn` returns `True`."""
+
+  def __init__(self, should_stop_fn, run_every_secs=60, run_every_steps=None):
+    if not callable(should_stop_fn):
+      raise TypeError('`should_stop_fn` must be callable.')
+
+    self._should_stop_fn = should_stop_fn
+    self._timer = basic_session_run_hooks.SecondOrStepTimer(
+        every_secs=run_every_secs, every_steps=run_every_steps)
+    self._global_step_tensor = None
+    self._stop_var = None
+    self._stop_op = None
+
+  def begin(self):
+    self._global_step_tensor = training_util.get_global_step()
+    self._stop_var = _get_or_create_stop_var()
+    self._stop_op = state_ops.assign(self._stop_var, True)
+
+  def before_run(self, run_context):
+    del run_context
+    return session_run_hook.SessionRunArgs(self._global_step_tensor)
+
+  def after_run(self, run_context, run_values):
+    global_step = run_values.results
+    if self._timer.should_trigger_for_step(global_step):
+      self._timer.update_last_triggered_step(global_step)
+      if self._should_stop_fn():
+        tf_logging.info('Requesting early stopping at global step %d',
+                        global_step)
+        run_context.session.run(self._stop_op)
+        run_context.request_stop()
+
+
+class _CheckForStoppingHook(session_run_hook.SessionRunHook):
+  """Hook that requests stop if stop is requested by `_StopOnPredicateHook`."""
+
+  def __init__(self):
+    self._stop_var = None
+
+  def begin(self):
+    self._stop_var = _get_or_create_stop_var()
+
+  def before_run(self, run_context):
+    del run_context
+    return session_run_hook.SessionRunArgs(self._stop_var)
+
+  def after_run(self, run_context, run_values):
+    should_early_stop = run_values.results
+    if should_early_stop:
+      tf_logging.info('Early stopping requested, suspending run.')
+      run_context.request_stop()
diff --git a/tensorflow/contrib/estimator/python/estimator/early_stopping_test.py b/tensorflow/contrib/estimator/python/estimator/early_stopping_test.py
new file mode 100644
index 0000000000..e4bfd4b446
--- /dev/null
+++ b/tensorflow/contrib/estimator/python/estimator/early_stopping_test.py
@@ -0,0 +1,246 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for early_stopping."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import tempfile
+
+from absl.testing import parameterized
+from tensorflow.contrib.estimator.python.estimator import early_stopping
+from tensorflow.python.estimator import estimator
+from tensorflow.python.estimator import run_config
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.platform import test
+from tensorflow.python.training import monitored_session
+from tensorflow.python.training import training_util
+
+
+class _FakeRunConfig(run_config.RunConfig):
+
+  def __init__(self, is_chief):
+    super(_FakeRunConfig, self).__init__()
+    self._is_chief = is_chief
+
+  @property
+  def is_chief(self):
+    return self._is_chief
+
+
+def _dummy_model_fn(features, labels, params):
+  _, _, _ = features, labels, params
+
+
+class _FakeEstimator(estimator.Estimator):
+  """Fake estimator for testing."""
+
+  def __init__(self, config):
+    super(_FakeEstimator, self).__init__(
+        model_fn=_dummy_model_fn, config=config)
+
+
+def _write_events(eval_dir, params):
+  """Test helper to write events to summary files."""
+  for steps, loss, accuracy in params:
+    estimator._write_dict_to_summary(eval_dir, {
+        'loss': loss,
+        'accuracy': accuracy,
+    }, steps)
+
+
+class ReadEvalMetricsTest(test.TestCase):
+
+  def test_read_eval_metrics(self):
+    eval_dir = tempfile.mkdtemp()
+    _write_events(
+        eval_dir,
+        [
+            # steps, loss, accuracy
+            (1000, 1, 2),
+            (2000, 3, 4),
+            (3000, 5, 6),
+        ])
+    self.assertEqual({
+        1000: {
+            'loss': 1,
+            'accuracy': 2
+        },
+        2000: {
+            'loss': 3,
+            'accuracy': 4
+        },
+        3000: {
+            'loss': 5,
+            'accuracy': 6
+        },
+    }, early_stopping.read_eval_metrics(eval_dir))
+
+  def test_read_eval_metrics_when_no_events(self):
+    eval_dir = tempfile.mkdtemp()
+    self.assertTrue(os.path.exists(eval_dir))
+
+    # No error should be raised when eval directory exists with no event files.
+    self.assertEqual({}, early_stopping.read_eval_metrics(eval_dir))
+
+    os.rmdir(eval_dir)
+    self.assertFalse(os.path.exists(eval_dir))
+
+    # No error should be raised when eval directory does not exist.
+    self.assertEqual({}, early_stopping.read_eval_metrics(eval_dir))
+
+
+class EarlyStoppingHooksTest(test.TestCase, parameterized.TestCase):
+
+  def setUp(self):
+    config = _FakeRunConfig(is_chief=True)
+    self._estimator = _FakeEstimator(config=config)
+    eval_dir = self._estimator.eval_dir()
+    os.makedirs(eval_dir)
+    _write_events(
+        eval_dir,
+        [
+            # steps, loss, accuracy
+            (1000, 0.8, 0.5),
+            (2000, 0.7, 0.6),
+            (3000, 0.4, 0.7),
+            (3500, 0.41, 0.68),
+        ])
+
+  def run_session(self, hooks, should_stop):
+    hooks = hooks if isinstance(hooks, list) else [hooks]
+    with ops.Graph().as_default():
+      training_util.create_global_step()
+      no_op = control_flow_ops.no_op()
+      with monitored_session.SingularMonitoredSession(hooks=hooks) as mon_sess:
+        mon_sess.run(no_op)
+        self.assertEqual(mon_sess.should_stop(), should_stop)
+
+  @parameterized.parameters((0.8, 0, False), (0.6, 4000, False), (0.6, 0, True))
+  def test_stop_if_higher_hook(self, threshold, min_steps, should_stop):
+    self.run_session(
+        early_stopping.stop_if_higher_hook(
+            self._estimator,
+            metric_name='accuracy',
+            threshold=threshold,
+            min_steps=min_steps), should_stop)
+
+  @parameterized.parameters((0.3, 0, False), (0.5, 4000, False), (0.5, 0, True))
+  def test_stop_if_lower_hook(self, threshold, min_steps, should_stop):
+    self.run_session(
+        early_stopping.stop_if_lower_hook(
+            self._estimator,
+            metric_name='loss',
+            threshold=threshold,
+            min_steps=min_steps), should_stop)
+
+  @parameterized.parameters((1500, 0, False), (500, 4000, False),
+                            (500, 0, True))
+  def test_stop_if_no_increase_hook(self, max_steps, min_steps, should_stop):
+    self.run_session(
+        early_stopping.stop_if_no_increase_hook(
+            self._estimator,
+            metric_name='accuracy',
+            max_steps_without_increase=max_steps,
+            min_steps=min_steps), should_stop)
+
+  @parameterized.parameters((1500, 0, False), (500, 4000, False),
+                            (500, 0, True))
+  def test_stop_if_no_decrease_hook(self, max_steps, min_steps, should_stop):
+    self.run_session(
+        early_stopping.stop_if_no_decrease_hook(
+            self._estimator,
+            metric_name='loss',
+            max_steps_without_decrease=max_steps,
+            min_steps=min_steps), should_stop)
+
+  @parameterized.parameters((1500, 0.3, False), (1500, 0.5, True),
+                            (500, 0.3, True))
+  def test_multiple_hooks(self, max_steps, loss_threshold, should_stop):
+    self.run_session([
+        early_stopping.stop_if_no_decrease_hook(
+            self._estimator,
+            metric_name='loss',
+            max_steps_without_decrease=max_steps),
+        early_stopping.stop_if_lower_hook(
+            self._estimator, metric_name='loss', threshold=loss_threshold)
+    ], should_stop)
+
+  @parameterized.parameters(False, True)
+  def test_make_early_stopping_hook(self, should_stop):
+    self.run_session([
+        early_stopping.make_early_stopping_hook(
+            self._estimator, should_stop_fn=lambda: should_stop)
+    ], should_stop)
+
+  def test_make_early_stopping_hook_typeerror(self):
+    with self.assertRaises(TypeError):
+      early_stopping.make_early_stopping_hook(
+          estimator=object(), should_stop_fn=lambda: True)
+
+  def test_make_early_stopping_hook_valueerror(self):
+    with self.assertRaises(ValueError):
+      early_stopping.make_early_stopping_hook(
+          self._estimator,
+          should_stop_fn=lambda: True,
+          run_every_secs=60,
+          run_every_steps=100)
+
+
+class StopOnPredicateHookTest(test.TestCase):
+
+  def test_stop(self):
+    hook = early_stopping._StopOnPredicateHook(
+        should_stop_fn=lambda: False, run_every_secs=0)
+    with ops.Graph().as_default():
+      training_util.create_global_step()
+      no_op = control_flow_ops.no_op()
+      with monitored_session.SingularMonitoredSession(hooks=[hook]) as mon_sess:
+        mon_sess.run(no_op)
+        self.assertFalse(mon_sess.should_stop())
+        self.assertFalse(mon_sess.raw_session().run(hook._stop_var))
+
+    hook = early_stopping._StopOnPredicateHook(
+        should_stop_fn=lambda: True, run_every_secs=0)
+    with ops.Graph().as_default():
+      training_util.create_global_step()
+      no_op = control_flow_ops.no_op()
+      with monitored_session.SingularMonitoredSession(hooks=[hook]) as mon_sess:
+        mon_sess.run(no_op)
+        self.assertTrue(mon_sess.should_stop())
+        self.assertTrue(mon_sess.raw_session().run(hook._stop_var))
+
+
+class CheckForStoppingHookTest(test.TestCase):
+
+  def test_stop(self):
+    hook = early_stopping._CheckForStoppingHook()
+    with ops.Graph().as_default():
+      no_op = control_flow_ops.no_op()
+      assign_op = state_ops.assign(early_stopping._get_or_create_stop_var(),
+                                   True)
+      with monitored_session.SingularMonitoredSession(hooks=[hook]) as mon_sess:
+        mon_sess.run(no_op)
+        self.assertFalse(mon_sess.should_stop())
+        mon_sess.run(assign_op)
+        self.assertTrue(mon_sess.should_stop())
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/estimator/python/estimator/head.py b/tensorflow/contrib/estimator/python/estimator/head.py
index c9d86ef4ab..34f765d565 100644
--- a/tensorflow/contrib/estimator/python/estimator/head.py
+++ b/tensorflow/contrib/estimator/python/estimator/head.py
@@ -943,20 +943,30 @@ class _MultiLabelHead(head_lib._Head):  # pylint:disable=protected-access
         class_probabilities = array_ops.slice(
             probabilities, begin=begin, size=size)
         class_labels = array_ops.slice(labels, begin=begin, size=size)
-        prob_key = keys.PROBABILITY_MEAN_AT_CLASS % class_id
+        if self._label_vocabulary is None:
+          prob_key = keys.PROBABILITY_MEAN_AT_CLASS % class_id
+        else:
+          prob_key = (
+              keys.PROBABILITY_MEAN_AT_NAME % self._label_vocabulary[class_id])
         metric_ops[head_lib._summary_key(self._name, prob_key)] = (  # pylint:disable=protected-access
             head_lib._predictions_mean(  # pylint:disable=protected-access
                 predictions=class_probabilities,
                 weights=weights,
                 name=prob_key))
-        auc_key = keys.AUC_AT_CLASS % class_id
+        if self._label_vocabulary is None:
+          auc_key = keys.AUC_AT_CLASS % class_id
+        else:
+          auc_key = keys.AUC_AT_NAME % self._label_vocabulary[class_id]
         metric_ops[head_lib._summary_key(self._name, auc_key)] = (  # pylint:disable=protected-access
             head_lib._auc(  # pylint:disable=protected-access
                 labels=class_labels,
                 predictions=class_probabilities,
                 weights=weights,
                 name=auc_key))
-        auc_pr_key = keys.AUC_PR_AT_CLASS % class_id
+        if self._label_vocabulary is None:
+          auc_pr_key = keys.AUC_PR_AT_CLASS % class_id
+        else:
+          auc_pr_key = keys.AUC_PR_AT_NAME % self._label_vocabulary[class_id]
         metric_ops[head_lib._summary_key(self._name, auc_pr_key)] = (  # pylint:disable=protected-access
             head_lib._auc(  # pylint:disable=protected-access
                 labels=class_labels,
diff --git a/tensorflow/contrib/estimator/python/estimator/head_test.py b/tensorflow/contrib/estimator/python/estimator/head_test.py
index 7b884402d4..2d367adb47 100644
--- a/tensorflow/contrib/estimator/python/estimator/head_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/head_test.py
@@ -694,12 +694,14 @@ class MultiLabelHead(test.TestCase):
         # this assert tests that the algorithm remains consistent.
         keys.AUC: 0.3333,
         keys.AUC_PR: 0.7639,
-        keys.PROBABILITY_MEAN_AT_CLASS % 0: np.sum(_sigmoid(logits[:, 0])) / 2.,
-        keys.AUC_AT_CLASS % 0: 0.,
-        keys.AUC_PR_AT_CLASS % 0: 1.,
-        keys.PROBABILITY_MEAN_AT_CLASS % 1: np.sum(_sigmoid(logits[:, 1])) / 2.,
-        keys.AUC_AT_CLASS % 1: 1.,
-        keys.AUC_PR_AT_CLASS % 1: 1.,
+        keys.PROBABILITY_MEAN_AT_NAME % 'a':
+            np.sum(_sigmoid(logits[:, 0])) / 2.,
+        keys.AUC_AT_NAME % 'a': 0.,
+        keys.AUC_PR_AT_NAME % 'a': 1.,
+        keys.PROBABILITY_MEAN_AT_NAME % 'b':
+            np.sum(_sigmoid(logits[:, 1])) / 2.,
+        keys.AUC_AT_NAME % 'b': 1.,
+        keys.AUC_PR_AT_NAME % 'b': 1.,
     }
 
     self._test_eval(
diff --git a/tensorflow/contrib/estimator/python/estimator/hooks.py b/tensorflow/contrib/estimator/python/estimator/hooks.py
index ddd6aa442f..caadafdfa6 100644
--- a/tensorflow/contrib/estimator/python/estimator/hooks.py
+++ b/tensorflow/contrib/estimator/python/estimator/hooks.py
@@ -189,7 +189,7 @@ class InMemoryEvaluatorHook(training.SessionRunHook):
         init_fn=feed_variables, copy_from_scaffold=self._scaffold)
 
     with self._graph.as_default():
-      return self._estimator._evaluate_run(
+      self._estimator._evaluate_run(
           checkpoint_path=None,
           scaffold=scaffold,
           update_op=self._update_op,
diff --git a/tensorflow/contrib/estimator/python/estimator/hooks_test.py b/tensorflow/contrib/estimator/python/estimator/hooks_test.py
index 95ae971852..ee88d5ecf5 100644
--- a/tensorflow/contrib/estimator/python/estimator/hooks_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/hooks_test.py
@@ -102,6 +102,7 @@ class InMemoryEvaluatorHookTest(test.TestCase):
     self.assertTrue(os.path.isdir(estimator.eval_dir()))
     step_keyword_to_value = summary_step_keyword_to_value_mapping(
         estimator.eval_dir())
+
     # 4.5 = sum(range(10))/10
     # before training
     self.assertEqual(4.5, step_keyword_to_value[0]['mean_of_features'])
@@ -110,6 +111,7 @@ class InMemoryEvaluatorHookTest(test.TestCase):
     self.assertEqual(4.5, step_keyword_to_value[8]['mean_of_features'])
     # end
     self.assertEqual(4.5, step_keyword_to_value[10]['mean_of_features'])
+    self.assertEqual(set([0, 4, 8, 10]), set(step_keyword_to_value.keys()))
 
   def test_uses_latest_variable_value(self):
 
diff --git a/tensorflow/contrib/estimator/python/estimator/linear.py b/tensorflow/contrib/estimator/python/estimator/linear.py
index b960b16f1b..62a37abefb 100644
--- a/tensorflow/contrib/estimator/python/estimator/linear.py
+++ b/tensorflow/contrib/estimator/python/estimator/linear.py
@@ -99,7 +99,8 @@ class LinearEstimator(estimator.Estimator):
                model_dir=None,
                optimizer='Ftrl',
                config=None,
-               partitioner=None):
+               partitioner=None,
+               sparse_combiner='sum'):
     """Initializes a `LinearEstimator` instance.
 
     Args:
@@ -116,6 +117,11 @@ class LinearEstimator(estimator.Estimator):
         callable. Defaults to FTRL optimizer.
       config: `RunConfig` object to configure the runtime settings.
       partitioner: Optional. Partitioner for input layer.
+      sparse_combiner: A string specifying how to reduce if a categorical column
+        is multivalent.  One of "mean", "sqrtn", and "sum" -- these are
+        effectively different ways to do example-level normalization, which can
+        be useful for bag-of-words features. for more details, see
+        @{tf.feature_column.linear_model$linear_model}.
     """
     def _model_fn(features, labels, mode, config):
       return linear_lib._linear_model_fn(  # pylint: disable=protected-access
@@ -126,6 +132,7 @@ class LinearEstimator(estimator.Estimator):
           feature_columns=tuple(feature_columns or []),
           optimizer=optimizer,
           partitioner=partitioner,
-          config=config)
+          config=config,
+          sparse_combiner=sparse_combiner)
     super(LinearEstimator, self).__init__(
         model_fn=_model_fn, model_dir=model_dir, config=config)
diff --git a/tensorflow/contrib/estimator/python/estimator/saved_model_estimator.py b/tensorflow/contrib/estimator/python/estimator/saved_model_estimator.py
new file mode 100644
index 0000000000..ce98e9987e
--- /dev/null
+++ b/tensorflow/contrib/estimator/python/estimator/saved_model_estimator.py
@@ -0,0 +1,449 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Class that creates an Estimator from a SavedModel."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+
+from tensorflow.python.estimator import estimator as estimator_lib
+from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.estimator.export import export as export_lib
+from tensorflow.python.estimator.export import export_output
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.saved_model import constants
+from tensorflow.python.saved_model import loader_impl
+from tensorflow.python.saved_model import signature_constants
+from tensorflow.python.training import checkpoint_utils
+from tensorflow.python.training import monitored_session
+from tensorflow.python.training import training_util
+
+
+class SavedModelEstimator(estimator_lib.Estimator):
+  """Create an Estimator from a SavedModel.
+
+  Only SavedModels exported with
+  `tf.contrib.estimator.export_all_saved_models()` or
+  `tf.estimator.Estimator.export_savedmodel()` are supported for this class.
+
+  Example with `tf.estimator.DNNClassifier`:
+
+  **Step 1: Create and train DNNClassifier.**
+
+  ```python
+  feature1 = tf.feature_column.embedding_column(
+      tf.feature_column.categorical_column_with_vocabulary_list(
+          key='feature1', vocabulary_list=('green', 'yellow')), dimension=1)
+  feature2 = tf.feature_column.numeric_column(key='feature2', default_value=0.0)
+
+  classifier = tf.estimator.DNNClassifier(
+      hidden_units=[4,2], feature_columns=[feature1, feature2])
+
+  def input_fn():
+    features = {'feature1': tf.constant(['green', 'green', 'yellow']),
+                'feature2': tf.constant([3.5, 4.2, 6.1])}
+    label = tf.constant([1., 0., 0.])
+    return tf.data.Dataset.from_tensors((features, label)).repeat()
+
+  classifier.train(input_fn=input_fn, steps=10)
+  ```
+
+  **Step 2: Export classifier.**
+  First, build functions that specify the expected inputs.
+
+  ```python
+  # During train and evaluation, both the features and labels should be defined.
+  supervised_input_receiver_fn = (
+      tf.contrib.estimator.build_raw_supervised_input_receiver_fn(
+          {'feature1': tf.placeholder(dtype=tf.string, shape=[None]),
+           'feature2': tf.placeholder(dtype=tf.float32, shape=[None])},
+          tf.placeholder(dtype=tf.float32, shape=[None])))
+
+  # During predict mode, expect to receive a `tf.Example` proto, so a parsing
+  # function is used.
+  serving_input_receiver_fn = (
+      tf.estimator.export.build_parsing_serving_input_receiver_fn(
+          tf.feature_column.make_parse_example_spec([feature1, feature2])))
+  ```
+
+  Next, export the model as a SavedModel. A timestamped directory will be
+  created (for example `/tmp/export_all/1234567890`).
+
+  ```python
+  # Option 1: Save all modes (train, eval, predict)
+  export_dir = tf.contrib.estimator.export_all_saved_models(
+      classifier, '/tmp/export_all',
+      {tf.estimator.ModeKeys.TRAIN: supervised_input_receiver_fn,
+       tf.estimator.ModeKeys.EVAL: supervised_input_receiver_fn,
+       tf.estimator.ModeKeys.PREDICT: serving_input_receiver_fn})
+
+  # Option 2: Only export predict mode
+  export_dir = classifier.export_savedmodel(
+      '/tmp/export_predict', serving_input_receiver_fn)
+  ```
+
+  **Step 3: Create a SavedModelEstimator from the exported SavedModel.**
+
+  ```python
+  est = tf.contrib.estimator.SavedModelEstimator(export_dir)
+
+  # If all modes were exported, you can immediately evaluate and predict, or
+  # continue training. Otherwise only predict is available.
+  eval_results = est.evaluate(input_fn=input_fn, steps=1)
+  print(eval_results)
+
+  est.train(input_fn=input_fn, steps=20)
+
+  def predict_input_fn():
+    example = tf.train.Example()
+    example.features.feature['feature1'].bytes_list.value.extend(['yellow'])
+    example.features.feature['feature2'].float_list.value.extend([1.])
+    return {'inputs':tf.constant([example.SerializeToString()])}
+
+  predictions = est.predict(predict_input_fn)
+  print(next(predictions))
+  ```
+  """
+
+  def __init__(self, saved_model_dir, model_dir=None):
+    """Initialize a SavedModelEstimator.
+
+    The SavedModelEstimator loads its model function and variable values from
+    the graphs defined in the SavedModel. There is no option to pass in
+    `RunConfig` or `params` arguments, because the model function graph is
+    defined statically in the SavedModel.
+
+    Args:
+      saved_model_dir: Directory containing SavedModel protobuf and subfolders.
+      model_dir: Directory to save new checkpoints during training.
+
+    Raises:
+      NotImplementedError: If a DistributionStrategy is defined in the config.
+        Unless the SavedModelEstimator is subclassed, this shouldn't happen.
+    """
+    checkpoint = estimator_lib._get_saved_model_ckpt(saved_model_dir)  # pylint: disable=protected-access
+    vars_to_warm_start = [name for name, _ in
+                          checkpoint_utils.list_variables(checkpoint)]
+    warm_start_settings = estimator_lib.WarmStartSettings(
+        ckpt_to_initialize_from=checkpoint,
+        vars_to_warm_start=vars_to_warm_start)
+
+    super(SavedModelEstimator, self).__init__(
+        model_fn=self._model_fn_from_saved_model, model_dir=model_dir,
+        warm_start_from=warm_start_settings)
+    if self._train_distribution or self._eval_distribution:
+      raise NotImplementedError(
+          'SavedModelEstimator currently does not support '
+          'DistributionStrategy.')
+    self.saved_model_dir = saved_model_dir
+    self.saved_model_loader = loader_impl.SavedModelLoader(saved_model_dir)
+    self._available_modes = self._extract_available_modes()
+
+  def _extract_available_modes(self):
+    """Return list of modes found in SavedModel."""
+    available_modes = []
+    logging.info('Checking available modes for SavedModelEstimator.')
+    for mode in [model_fn_lib.ModeKeys.TRAIN, model_fn_lib.ModeKeys.EVAL,
+                 model_fn_lib.ModeKeys.PREDICT]:
+      try:
+        self._get_meta_graph_def_for_mode(mode)
+      except RuntimeError:
+        logging.warning('%s mode not found in SavedModel.' % mode)
+        continue
+
+      if self._get_signature_def_for_mode(mode) is not None:
+        available_modes.append(mode)
+
+    logging.info('Available modes for Estimator: %s' % available_modes)
+    return available_modes
+
+  def _validate_mode(self, mode):
+    """Make sure that mode can be run using the SavedModel."""
+    if mode not in self._available_modes:
+      raise RuntimeError('%s mode is not available in the SavedModel. Use '
+                         'saved_model_cli to check that the Metagraph for this '
+                         'mode has been exported.' % mode)
+
+  def _get_meta_graph_def_for_mode(self, mode):
+    tags = model_fn_lib.EXPORT_TAG_MAP[mode]
+    return self.saved_model_loader.get_meta_graph_def_from_tags(tags)
+
+  def _get_signature_def_for_mode(self, mode):
+    meta_graph_def = self._get_meta_graph_def_for_mode(mode)
+    sig_def_key = (signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
+                   if mode == model_fn_lib.ModeKeys.PREDICT else mode)
+    if sig_def_key not in meta_graph_def.signature_def:
+      logging.warning('Metagraph for mode %s was found, but SignatureDef with'
+                      ' key \"%s\" is missing.' % (mode, sig_def_key))
+      return None
+    return meta_graph_def.signature_def[sig_def_key]
+
+  def _create_and_assert_global_step(self, graph):
+    # Do nothing here. The global step variable will be created/loaded from the
+    # SavedModel. If a global step variable were created here, the result
+    # will be two duplicate global step variables, causing issues during
+    # the warm-start phase.
+    # Due to the global variable being created in the model function, this may
+    # cause issues when running DistributionStrategy. Thus, DistributionStrategy
+    # is not yet supported with SavedModelEstimator.
+    return None
+
+  def _model_fn_from_saved_model(self, features, labels, mode):
+    """Load a SavedModel graph and return an EstimatorSpec."""
+    # TODO(kathywu): Model function loads placeholders from the graph. Calling
+    # export_all_saved_models creates another placeholder for the inputs, on top
+    # of the original placeholders. There should be a way to avoid this.
+    self._validate_mode(mode)
+
+    g = ops.get_default_graph()
+    if  training_util.get_global_step(g) is not None:
+      raise RuntimeError(
+          'Graph must not contain a global step tensor before the SavedModel is'
+          ' loaded. Please make sure that the input function does not create a '
+          'global step.')
+
+    # Extract SignatureDef for information about the input and output tensors.
+    signature_def = self._get_signature_def_for_mode(mode)
+
+    # Generate input map for replacing the inputs in the SavedModel graph with
+    # the provided features and labels.
+    input_map = _generate_input_map(signature_def, features, labels)
+
+    # Create a list of the names of output tensors. When the graph is loaded,
+    # names of the output tensors may be remapped. This ensures that the correct
+    # tensors are returned in the EstimatorSpec.
+    output_tensor_names = [
+        value.name for value in six.itervalues(signature_def.outputs)]
+
+    # Load the graph. `output_tensors` contains output `Tensors` in the same
+    # same order as the `output_tensor_names` list.
+    tags = model_fn_lib.EXPORT_TAG_MAP[mode]
+    _, output_tensors = self.saved_model_loader.load_graph(
+        g, tags, input_map=input_map, return_elements=output_tensor_names)
+
+    # Create a scaffold from the MetaGraphDef that contains ops to initialize
+    # the graph. This should mirror the steps from _add_meta_graph_for_mode(),
+    # which creates a MetaGraphDef from the EstimatorSpec's scaffold.
+    scaffold = monitored_session.Scaffold(
+        local_init_op=loader_impl._get_main_op_tensor(  # pylint: disable=protected-access
+            self._get_meta_graph_def_for_mode(mode)))
+
+    # Ensure that a global step tensor has been created.
+    global_step_tensor = training_util.get_global_step(g)
+    training_util.assert_global_step(global_step_tensor)
+
+    # Extract values to return in the EstimatorSpec.
+    output_map = dict(zip(output_tensor_names, output_tensors))
+    outputs = {key: output_map[value.name]
+               for key, value in six.iteritems(signature_def.outputs)}
+
+    loss, predictions, metrics = _validate_and_extract_outputs(
+        mode, outputs, signature_def.method_name)
+
+    train_op = ops.get_collection(constants.TRAIN_OP_KEY)
+    if len(train_op) > 1:
+      raise RuntimeError('Multiple ops found in the train_op collection.')
+    train_op = None if not train_op else train_op[0]
+
+    _clear_saved_model_collections()
+    return model_fn_lib.EstimatorSpec(
+        scaffold=scaffold,
+        mode=mode,
+        loss=loss,
+        train_op=train_op,
+        predictions=predictions,
+        eval_metric_ops=metrics)
+
+
+def _clear_saved_model_collections():
+  """Clear collections that are expected empty when exporting a SavedModel.
+
+  The SavedModel builder uses these collections to track ops necessary to
+  restore the graph state. These collections are expected to be empty before
+  MetaGraphs are added to the builder.
+  """
+  del ops.get_collection_ref(constants.ASSETS_KEY)[:]
+  del ops.get_collection_ref(constants.LEGACY_INIT_OP_KEY)[:]
+  del ops.get_collection_ref(constants.MAIN_OP_KEY)[:]
+  del ops.get_collection_ref(constants.TRAIN_OP_KEY)[:]
+
+
+def _generate_input_map(signature_def, features, labels):
+  """Return dict mapping an input tensor name to a feature or label tensor.
+
+  Args:
+    signature_def: SignatureDef loaded from SavedModel
+    features: A `Tensor`, `SparseTensor`, or dict of string to `Tensor` or
+      `SparseTensor`, specifying the features to be passed to the model.
+    labels: A `Tensor`, `SparseTensor`, or dict of string to `Tensor` or
+      `SparseTensor`, specifying the labels to be passed to the model. May be
+      `None`.
+
+  Returns:
+    dict mapping string names of inputs to features or labels tensors
+
+  Raises:
+    ValueError: if SignatureDef inputs are not completely mapped by the input
+      features and labels.
+  """
+  # pylint: disable=protected-access
+  if not isinstance(features, dict):
+    features = {export_lib._SINGLE_FEATURE_DEFAULT_NAME: features}
+  if labels is not None and not isinstance(labels, dict):
+    labels = {export_lib._SINGLE_LABEL_DEFAULT_NAME: labels}
+  # pylint: enable=protected-access
+
+  inputs = signature_def.inputs
+  input_map = {}
+  for key, tensor_info in six.iteritems(inputs):
+    input_name = tensor_info.name
+    if ':' in input_name:
+      input_name = input_name[:input_name.find(':')]
+
+    # When tensors are used as control inputs for operations, their names are
+    # prepended with a '^' character in the GraphDef. To handle possible control
+    # flow edge cases, control input names must be included in the input map.
+    control_dependency_name = '^' + input_name
+
+    if key in features:
+      _check_same_dtype_and_shape(features[key], tensor_info, key)
+      input_map[input_name] = input_map[control_dependency_name] = features[key]
+    elif labels is not None and key in labels:
+      _check_same_dtype_and_shape(labels[key], tensor_info, key)
+      input_map[input_name] = input_map[control_dependency_name] = labels[key]
+    else:
+      raise ValueError(
+          'Key \"%s\" not found in features or labels passed in to the model '
+          'function. All required keys: %s' % (key, inputs.keys()))
+
+  return input_map
+
+
+def _check_same_dtype_and_shape(tensor, tensor_info, name):
+  """Validate that tensor has the same properties as the TensorInfo proto.
+
+  Args:
+    tensor: a `Tensor` object.
+    tensor_info: a `TensorInfo` proto.
+    name: Name of the input (to identify Tensor if an error is raised).
+
+  Raises:
+    ValueError: If the tensor shape or dtype don't match the TensorInfo
+  """
+  dtype_error = (tensor.dtype != dtypes.DType(tensor_info.dtype))
+  shape_error = not tensor.shape.is_compatible_with(tensor_info.tensor_shape)
+
+  if dtype_error or shape_error:
+    msg = 'Tensor shape and/or dtype validation failed for input %s:' % name
+    if dtype_error:
+      msg += ('\n\tExpected dtype: %s, Got: %s'
+              % (dtypes.DType(tensor_info.dtype), tensor.dtype))
+    if shape_error:
+      msg += ('\n\tExpected shape: %s, Got: %s'
+              % (tensor_shape.TensorShape(tensor_info.tensor_shape),
+                 tensor.shape))
+
+    raise ValueError(msg)
+
+
+def _extract_eval_metrics(output_dict):
+  """Return a eval metric dict extracted from the output_dict.
+
+  Eval metrics consist of a value tensor and an update op. Both must be in the
+  passed-in tensor dictionary for an eval metric to be added to the returned
+  dictionary.
+
+  Args:
+    output_dict: a dict that maps strings to tensors.
+
+  Returns:
+    dict mapping strings to (value, update_op) tuples.
+  """
+  # pylint: disable=protected-access
+  metric_ops = {}
+  separator_char = export_output._SupervisedOutput._SEPARATOR_CHAR
+
+  for key, tensor in six.iteritems(output_dict):
+    split_key = key.split(separator_char)
+
+    # The metric name may contain the separator character, so recreate its name.
+    metric_name = separator_char.join(split_key[:-1])
+
+    if split_key[0] == export_output._SupervisedOutput.METRICS_NAME:
+      # If the key ends with the value suffix, and there is a corresponding
+      # key ending with the update_op suffix, then add tensors to metrics dict.
+      if split_key[-1] == export_output._SupervisedOutput.METRIC_VALUE_SUFFIX:
+        update_op = ''.join(
+            [metric_name, separator_char,
+             export_output._SupervisedOutput.METRIC_UPDATE_SUFFIX])
+        if update_op in output_dict:
+          update_op_tensor = output_dict[update_op]
+          metric_ops[metric_name] = (tensor, update_op_tensor)
+
+  # pylint: enable=protected-access
+  return metric_ops
+
+
+def _validate_and_extract_outputs(mode, output_dict, method_name):
+  """Extract values from SignatureDef output dictionary.
+
+  Args:
+    mode: One of the modes enumerated in `tf.estimator.ModeKeys`.
+    output_dict: dict of string SignatureDef keys to `Tensor`.
+    method_name: Method name of the SignatureDef as a string.
+
+  Returns:
+    Tuple of (
+      loss: `Tensor` object,
+      predictions: dictionary mapping string keys to `Tensor` objects,
+      metrics: dictionary mapping string keys to a tuple of two `Tensor` objects
+    )
+
+  Raises:
+    RuntimeError: raised if SignatureDef has an invalid method name for the mode
+  """
+  # pylint: disable=protected-access
+  loss, predictions, metrics = None, None, None
+
+  if mode == model_fn_lib.ModeKeys.PREDICT:
+    predictions = output_dict
+  else:
+    # Validate that the SignatureDef's method name matches the expected name for
+    # the given mode.
+    expected_method_name = signature_constants.SUPERVISED_TRAIN_METHOD_NAME
+    if mode == model_fn_lib.ModeKeys.EVAL:
+      expected_method_name = signature_constants.SUPERVISED_EVAL_METHOD_NAME
+    if method_name != expected_method_name:
+      raise RuntimeError(
+          'Invalid SignatureDef method name for mode %s.\n\tExpected: %s\n\t'
+          'Got: %s\nPlease ensure that the SavedModel was exported with '
+          '`tf.contrib.estimator.export_all_saved_models()`.' %
+          (mode, expected_method_name, method_name))
+
+    # Extract loss, metrics and predictions from the output dict.
+    loss = output_dict[export_output._SupervisedOutput.LOSS_NAME]
+    metrics = _extract_eval_metrics(output_dict)
+    predictions = {
+        key: value for key, value in six.iteritems(output_dict)
+        if key.split(export_output._SupervisedOutput._SEPARATOR_CHAR)[0] == (
+            export_output._SupervisedOutput.PREDICTIONS_NAME)}
+
+  # pylint: enable=protected-access
+  return loss, predictions, metrics
diff --git a/tensorflow/contrib/estimator/python/estimator/saved_model_estimator_test.py b/tensorflow/contrib/estimator/python/estimator/saved_model_estimator_test.py
new file mode 100644
index 0000000000..718da1367c
--- /dev/null
+++ b/tensorflow/contrib/estimator/python/estimator/saved_model_estimator_test.py
@@ -0,0 +1,369 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for SavedModelEstimator."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import shutil
+import tempfile
+
+from tensorflow.contrib.estimator.python.estimator import export as contrib_export
+from tensorflow.contrib.estimator.python.estimator import saved_model_estimator
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.estimator import estimator
+from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.estimator.export import export
+from tensorflow.python.estimator.export import export_output
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import metrics as metrics_lib
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.python.training import monitored_session
+from tensorflow.python.training import training
+
+
+def dummy_input_fn():
+  return dataset_ops.Dataset.from_tensors((
+      {'x': constant_op.constant([[1], [-2]], dtype=dtypes.int64)},
+      constant_op.constant([[4], [-3]], dtype=dtypes.float32))).repeat()
+
+
+def dummy_input_fn_features_only():
+  return dataset_ops.Dataset.from_tensors(
+      {'x': constant_op.constant([[5], [6]], dtype=dtypes.int64)}).repeat()
+
+
+def dummy_supervised_receiver_fn():
+  feature_spec = {
+      'x': array_ops.placeholder(
+          dtype=dtypes.int64, shape=(2, 1), name='feature_x'),
+      }
+  label_spec = array_ops.placeholder(
+      dtype=dtypes.float32, shape=[2, 1], name='truth')
+  return export.build_raw_supervised_input_receiver_fn(
+      feature_spec, label_spec)
+
+
+def dummy_serving_receiver_fn():
+  feature_spec = {'x': array_ops.placeholder(
+      dtype=dtypes.int64, shape=(2, 1), name='feature_x'),}
+  return export.build_raw_serving_input_receiver_fn(feature_spec)
+
+
+def model_fn_diff_modes(features, labels, mode):
+  _, _ = features, labels
+  v = variables.Variable(21, name='some_var')
+  train_op = None
+  loss = constant_op.constant(104)
+  if mode == model_fn_lib.ModeKeys.TRAIN:
+    loss = constant_op.constant(105)
+    predictions = constant_op.constant([501])
+    train_op = control_flow_ops.group(
+        state_ops.assign_add(training.get_global_step(), 1),
+        state_ops.assign_add(v, 3))
+  elif mode == model_fn_lib.ModeKeys.EVAL:
+    loss = constant_op.constant(106)
+    predictions = constant_op.constant([502])
+  else:
+    loss = constant_op.constant(107)
+    predictions = constant_op.constant([503])
+  return model_fn_lib.EstimatorSpec(
+      mode,
+      loss=loss,
+      train_op=train_op,
+      eval_metric_ops={
+          'abs_err': metrics_lib.mean_absolute_error(
+              constant_op.constant(0), predictions)},
+      predictions=predictions)
+
+
+class SavedModelEstimatorTest(test.TestCase):
+
+  def setUp(self):
+    self.tmpdirs = []
+
+  def tearDown(self):
+    for tmpdir in self.tmpdirs:
+      # gfile.DeleteRecursively fails in the windows cmake test, so use shutil.
+      shutil.rmtree(tmpdir, ignore_errors=True)
+    self.tmpdirs = []
+
+  def _get_tmp_dir(self):
+    tmpdir = tempfile.mkdtemp()
+    self.tmpdirs.append(tmpdir)
+    return tmpdir
+
+  def _export_estimator(self, train=True, evaluate=True, predict=True,
+                        model_fn=model_fn_diff_modes):
+    est = estimator.Estimator(model_fn, self._get_tmp_dir())
+    est.train(input_fn=dummy_input_fn, steps=10)
+
+    input_receiver_fn_map = {}
+    if train:
+      input_receiver_fn_map[model_fn_lib.ModeKeys.TRAIN] = (
+          dummy_supervised_receiver_fn())
+    if evaluate:
+      input_receiver_fn_map[model_fn_lib.ModeKeys.EVAL] = (
+          dummy_supervised_receiver_fn())
+    if predict:
+      input_receiver_fn_map[model_fn_lib.ModeKeys.PREDICT] = (
+          dummy_serving_receiver_fn())
+
+    export_base_path = self._get_tmp_dir()
+    export_dir = contrib_export.export_all_saved_models(
+        est, export_base_path, input_receiver_fn_map)
+    return export_dir
+
+  def test_load_all_modes(self):
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(), self._get_tmp_dir())
+    sme.train(input_fn=dummy_input_fn, steps=1)
+    sme.train(input_fn=dummy_input_fn, steps=2)
+    self.assertEqual(13, sme.get_variable_value('global_step'))
+    self.assertEqual(60, sme.get_variable_value('some_var'))
+
+    eval_results = sme.evaluate(dummy_input_fn, steps=5)
+
+    self.assertEqual(13, eval_results['global_step'])
+    self.assertEqual(106, eval_results['loss'])
+    self.assertEqual(502, eval_results['metrics/abs_err'])
+
+    predictions = next(sme.predict(dummy_input_fn_features_only))
+    self.assertDictEqual({'output': 503}, predictions)
+
+  def test_load_all_modes_no_train(self):
+    """Ensure that all functions can be used without requiring a ckpt."""
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(), self._get_tmp_dir())
+    eval_results = sme.evaluate(dummy_input_fn, steps=5)
+    self.assertEqual(10, eval_results['global_step'])
+    self.assertEqual(106, eval_results['loss'])
+    self.assertEqual(502, eval_results['metrics/abs_err'])
+
+    predictions = next(sme.predict(dummy_input_fn_features_only))
+    self.assertDictEqual({'output': 503}, predictions)
+
+  def test_partial_exported_estimator(self):
+    sme1 = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(train=False, predict=False), self._get_tmp_dir())
+    sme1.evaluate(dummy_input_fn, steps=5)
+    with self.assertRaisesRegexp(RuntimeError, 'train mode is not available'):
+      sme1.train(input_fn=dummy_input_fn, steps=1)
+    with self.assertRaisesRegexp(RuntimeError, 'infer mode is not available'):
+      next(sme1.predict(dummy_input_fn_features_only))
+
+    sme2 = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(evaluate=False), self._get_tmp_dir())
+    sme2.train(input_fn=dummy_input_fn, steps=1)
+    next(sme2.predict(dummy_input_fn_features_only))
+    with self.assertRaisesRegexp(RuntimeError, 'eval mode is not available'):
+      sme2.evaluate(dummy_input_fn, steps=5)
+
+  def test_with_incorrect_input(self):
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(), self._get_tmp_dir())
+
+    def bad_shape_input_fn():
+      return dataset_ops.Dataset.from_tensors((
+          {'x': constant_op.constant([1, 2], dtype=dtypes.int64)},
+          constant_op.constant([1, 2], dtype=dtypes.float32)))
+
+    with self.assertRaisesRegexp(ValueError, 'Expected shape'):
+      sme.train(bad_shape_input_fn, steps=1)
+
+    def bad_dtype_input_fn():
+      return dataset_ops.Dataset.from_tensors((
+          {'x': constant_op.constant([[1], [1]], dtype=dtypes.int32)},
+          constant_op.constant([[1], [1]], dtype=dtypes.int64)))
+
+    with self.assertRaisesRegexp(ValueError, 'Expected dtype'):
+      sme.train(bad_dtype_input_fn, steps=1)
+
+  def test_input_fn_with_global_step(self):
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(), self._get_tmp_dir())
+
+    def bad_input_fn():
+      training.get_or_create_global_step()
+      return dataset_ops.Dataset.from_tensors((
+          {'x': constant_op.constant([[1], [1]], dtype=dtypes.int64)},
+          constant_op.constant([[1], [1]], dtype=dtypes.float32)))
+
+    with self.assertRaisesRegexp(RuntimeError,
+                                 'Graph must not contain a global step tensor'):
+      sme.train(bad_input_fn, steps=1)
+
+  def test_re_export_saved_model_serving_only(self):
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(), self._get_tmp_dir())
+    sme.train(dummy_input_fn, steps=3)
+    self.assertEqual(13, sme.get_variable_value('global_step'))
+    self.assertEqual(60, sme.get_variable_value('some_var'))
+
+    predictions = next(sme.predict(dummy_input_fn_features_only))
+    self.assertDictEqual({'output': 503}, predictions)
+
+    # Export SavedModel, and test that the variable and prediction values are
+    # the same.
+    sme_export_dir = sme.export_savedmodel(
+        self._get_tmp_dir(), dummy_serving_receiver_fn())
+
+    sme2 = saved_model_estimator.SavedModelEstimator(
+        sme_export_dir, self._get_tmp_dir())
+    self.assertEqual(60, sme.get_variable_value('some_var'))
+    self.assertEqual(13, sme.get_variable_value('global_step'))
+
+    predictions = next(sme2.predict(dummy_input_fn_features_only))
+    self.assertDictEqual({'output': 503}, predictions)
+
+  def test_re_export_saved_model(self):
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(), self._get_tmp_dir())
+    self.assertDictEqual(
+        {'loss': 106, 'metrics/abs_err': 502, 'global_step': 10},
+        sme.evaluate(dummy_input_fn, steps=1))
+
+    sme.train(dummy_input_fn, steps=3)
+    self.assertDictEqual(
+        {'loss': 106, 'metrics/abs_err': 502, 'global_step': 13},
+        sme.evaluate(dummy_input_fn, steps=1))
+    self.assertEqual(60, sme.get_variable_value('some_var'))
+
+    predictions = next(sme.predict(dummy_input_fn_features_only))
+    self.assertDictEqual({'output': 503}, predictions)
+
+    # Export SavedModel for all modes
+    input_receiver_fn_map = {
+        model_fn_lib.ModeKeys.TRAIN: dummy_supervised_receiver_fn(),
+        model_fn_lib.ModeKeys.EVAL: dummy_supervised_receiver_fn(),
+        model_fn_lib.ModeKeys.PREDICT: dummy_serving_receiver_fn()}
+    sme_export_dir = contrib_export.export_all_saved_models(
+        sme, self._get_tmp_dir(), input_receiver_fn_map)
+
+    sme2 = saved_model_estimator.SavedModelEstimator(
+        sme_export_dir, self._get_tmp_dir())
+    self.assertDictEqual(
+        {'loss': 106, 'metrics/abs_err': 502, 'global_step': 13},
+        sme.evaluate(dummy_input_fn, steps=1))
+    self.assertEqual(60, sme.get_variable_value('some_var'))
+
+    sme.train(dummy_input_fn, steps=7)
+    self.assertEqual(20, sme.get_variable_value('global_step'))
+
+    predictions = next(sme2.predict(dummy_input_fn_features_only))
+    self.assertDictEqual({'output': 503}, predictions)
+
+  def test_load_saved_model_from_serving_only(self):
+    def model_fn(features, labels, mode):
+      _, _ = features, labels
+      return model_fn_lib.EstimatorSpec(
+          mode,
+          loss=constant_op.constant([103]),
+          train_op=state_ops.assign_add(training.get_global_step(), 1),
+          predictions=constant_op.constant([502]),
+          export_outputs={'test': export_output.ClassificationOutput(
+              constant_op.constant([[32.]]))})
+
+    est = estimator.Estimator(model_fn, self._get_tmp_dir())
+    est.train(input_fn=dummy_input_fn, steps=10)
+
+    def serving_input_receiver_fn():
+      return export.ServingInputReceiver(
+          {'test-features': constant_op.constant([[1], [1]])},
+          array_ops.placeholder(dtype=dtypes.string))
+
+    export_dir = est.export_savedmodel(
+        self._get_tmp_dir(), serving_input_receiver_fn)
+
+    sme = saved_model_estimator.SavedModelEstimator(
+        export_dir, self._get_tmp_dir())
+
+    def input_fn():
+      return {'inputs': constant_op.constant('someinputstr')}
+
+    prediction = next(sme.predict(input_fn))
+    self.assertDictEqual({'scores': 32}, prediction)
+
+  def test_with_local_init_op(self):
+    def model_fn(features, labels, mode):
+      _, _ = features, labels
+      v = variables.Variable(21, name='some_var')
+      scaffold = monitored_session.Scaffold(
+          local_init_op=state_ops.assign_add(v, -3).op
+      )
+      return model_fn_lib.EstimatorSpec(
+          mode,
+          scaffold=scaffold,
+          train_op=state_ops.assign_add(training.get_global_step(), 1),
+          loss=array_ops.identity(v))
+    export_dir = self._export_estimator(predict=False, model_fn=model_fn)
+    sme = saved_model_estimator.SavedModelEstimator(
+        export_dir, self._get_tmp_dir())
+
+    eval_results1 = sme.evaluate(dummy_input_fn, steps=2)
+    self.assertEqual(15, eval_results1['loss'])
+
+    sme.train(dummy_input_fn, steps=1)
+    self.assertEqual(15, sme.get_variable_value('some_var'))
+
+    eval_results2 = sme.evaluate(dummy_input_fn, steps=5)
+    self.assertEqual(12, eval_results2['loss'])
+
+  def test_with_working_input_fn(self):
+    def model_fn(features, labels, mode):
+      loss = None
+      if labels is not None:
+        loss = labels[0][0] + labels[1][0]
+      return model_fn_lib.EstimatorSpec(
+          mode,
+          loss=loss,
+          train_op=state_ops.assign_add(training.get_global_step(), 1),
+          predictions={'features_0': array_ops.identity([features['x'][0][0]]),
+                       'features_1': array_ops.identity([features['x'][1][0]])})
+
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(model_fn=model_fn), self._get_tmp_dir())
+    eval_results = sme.evaluate(dummy_input_fn, steps=1)
+    self.assertEqual(1, eval_results['loss'])
+
+    predictions = next(sme.predict(dummy_input_fn_features_only))
+    self.assertDictEqual({'features_0': 5, 'features_1': 6}, predictions)
+
+  def test_control_dependency(self):
+    # Control dependencies are saved with "^" appended to the start of the input
+    # name. The input map must include control dependencies as well.
+    def model_fn(features, labels, mode):
+      _ = labels
+      with ops.control_dependencies([features['x']]):
+        loss = features['x'][1][0]
+      return model_fn_lib.EstimatorSpec(
+          mode,
+          loss=loss,
+          train_op=state_ops.assign_add(training.get_global_step(), 1))
+    sme = saved_model_estimator.SavedModelEstimator(
+        self._export_estimator(train=False, predict=False, model_fn=model_fn),
+        self._get_tmp_dir())
+    sme.evaluate(dummy_input_fn, steps=1)  # Should run without error
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/framework/__init__.py b/tensorflow/contrib/framework/__init__.py
index dc49383c5c..918a7e2bc7 100644
--- a/tensorflow/contrib/framework/__init__.py
+++ b/tensorflow/contrib/framework/__init__.py
@@ -133,6 +133,7 @@ _nest_allowed_symbols = [
     'flatten_dict_items',
     'pack_sequence_as',
     'map_structure',
+    'map_structure_with_paths',
     'assert_shallow_structure',
     'flatten_up_to',
     'map_structure_up_to',
diff --git a/tensorflow/contrib/framework/python/framework/checkpoint_utils.py b/tensorflow/contrib/framework/python/framework/checkpoint_utils.py
index 9e356dd965..e7184a01fb 100644
--- a/tensorflow/contrib/framework/python/framework/checkpoint_utils.py
+++ b/tensorflow/contrib/framework/python/framework/checkpoint_utils.py
@@ -27,7 +27,7 @@ from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
-from tensorflow.python.training import saver
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import training as train
 
 __all__ = [
@@ -40,7 +40,7 @@ __all__ = [
 def _get_checkpoint_filename(filepattern):
   """Returns checkpoint filename given directory or specific filepattern."""
   if gfile.IsDirectory(filepattern):
-    return saver.latest_checkpoint(filepattern)
+    return checkpoint_management.latest_checkpoint(filepattern)
   return filepattern
 
 
diff --git a/tensorflow/contrib/framework/python/ops/variables.py b/tensorflow/contrib/framework/python/ops/variables.py
index e8e3180019..322d5c335e 100644
--- a/tensorflow/contrib/framework/python/ops/variables.py
+++ b/tensorflow/contrib/framework/python/ops/variables.py
@@ -34,6 +34,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import resource_loader
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import saver as tf_saver
@@ -199,10 +200,20 @@ def global_variable(initial_value,
 
 
 @contrib_add_arg_scope
-def variable(name, shape=None, dtype=None, initializer=None,
-             regularizer=None, trainable=True, collections=None,
-             caching_device=None, device=None,
-             partitioner=None, custom_getter=None, use_resource=None):
+def variable(name,
+             shape=None,
+             dtype=None,
+             initializer=None,
+             regularizer=None,
+             trainable=True,
+             collections=None,
+             caching_device=None,
+             device=None,
+             partitioner=None,
+             custom_getter=None,
+             use_resource=None,
+             synchronization=variables.VariableSynchronization.AUTO,
+             aggregation=variables.VariableAggregation.NONE):
   """Gets an existing variable with these parameters or creates a new one.
 
   Args:
@@ -228,6 +239,15 @@ def variable(name, shape=None, dtype=None, initializer=None,
     custom_getter: Callable that allows overwriting the internal
       get_variable method and has to have the same signature.
     use_resource: If `True` use a ResourceVariable instead of a Variable.
+    synchronization: Indicates when a distributed a variable will be
+      aggregated. Accepted values are constants defined in the class
+      @{tf.VariableSynchronization}. By default the synchronization is set to
+      `AUTO` and the current `DistributionStrategy` chooses
+      when to synchronize. If `synchronization` is set to `ON_READ`,
+      `trainable` must not be set to `True`.
+    aggregation: Indicates how a distributed variable will be aggregated.
+      Accepted values are constants defined in the class
+      @{tf.VariableAggregation}.
 
   Returns:
     The created or existing variable.
@@ -242,21 +262,36 @@ def variable(name, shape=None, dtype=None, initializer=None,
     getter = functools.partial(custom_getter,
                                reuse=variable_scope.get_variable_scope().reuse)
   with ops.device(device or ''):
-    return getter(name, shape=shape, dtype=dtype,
-                  initializer=initializer,
-                  regularizer=regularizer,
-                  trainable=trainable,
-                  collections=collections,
-                  caching_device=caching_device,
-                  partitioner=partitioner,
-                  use_resource=use_resource)
+    return getter(
+        name,
+        shape=shape,
+        dtype=dtype,
+        initializer=initializer,
+        regularizer=regularizer,
+        trainable=trainable,
+        collections=collections,
+        caching_device=caching_device,
+        partitioner=partitioner,
+        use_resource=use_resource,
+        synchronization=synchronization,
+        aggregation=aggregation)
 
 
 @contrib_add_arg_scope
-def model_variable(name, shape=None, dtype=dtypes.float32, initializer=None,
-                   regularizer=None, trainable=True, collections=None,
-                   caching_device=None, device=None, partitioner=None,
-                   custom_getter=None, use_resource=None):
+def model_variable(name,
+                   shape=None,
+                   dtype=dtypes.float32,
+                   initializer=None,
+                   regularizer=None,
+                   trainable=True,
+                   collections=None,
+                   caching_device=None,
+                   device=None,
+                   partitioner=None,
+                   custom_getter=None,
+                   use_resource=None,
+                   synchronization=variables.VariableSynchronization.AUTO,
+                   aggregation=variables.VariableAggregation.NONE):
   """Gets an existing model variable with these parameters or creates a new one.
 
   Args:
@@ -283,18 +318,36 @@ def model_variable(name, shape=None, dtype=dtypes.float32, initializer=None,
     custom_getter: Callable that allows overwriting the internal
       get_variable method and has to have the same signature.
     use_resource: If `True` use a ResourceVariable instead of a Variable.
+    synchronization: Indicates when a distributed a variable will be
+      aggregated. Accepted values are constants defined in the class
+      @{tf.VariableSynchronization}. By default the synchronization is set to
+      `AUTO` and the current `DistributionStrategy` chooses
+      when to synchronize. If `synchronization` is set to `ON_READ`,
+      `trainable` must not be set to `True`.
+    aggregation: Indicates how a distributed variable will be aggregated.
+      Accepted values are constants defined in the class
+      @{tf.VariableAggregation}.
 
   Returns:
     The created or existing variable.
   """
   collections = list(collections or [])
   collections += [ops.GraphKeys.GLOBAL_VARIABLES, ops.GraphKeys.MODEL_VARIABLES]
-  var = variable(name, shape=shape, dtype=dtype,
-                 initializer=initializer, regularizer=regularizer,
-                 trainable=trainable, collections=collections,
-                 caching_device=caching_device, device=device,
-                 partitioner=partitioner, custom_getter=custom_getter,
-                 use_resource=use_resource)
+  var = variable(
+      name,
+      shape=shape,
+      dtype=dtype,
+      initializer=initializer,
+      regularizer=regularizer,
+      trainable=trainable,
+      collections=collections,
+      caching_device=caching_device,
+      device=device,
+      partitioner=partitioner,
+      custom_getter=custom_getter,
+      use_resource=use_resource,
+      synchronization=synchronization,
+      aggregation=aggregation)
   return var
 
 
diff --git a/tensorflow/contrib/framework/python/ops/variables_test.py b/tensorflow/contrib/framework/python/ops/variables_test.py
index 7e0c7dbec1..3c44630a51 100644
--- a/tensorflow/contrib/framework/python/ops/variables_test.py
+++ b/tensorflow/contrib/framework/python/ops/variables_test.py
@@ -106,8 +106,9 @@ class LocalVariableTest(test.TestCase):
   def testResourceVariable(self):
     a = variables_lib2.local_variable(0)
     b = variables_lib2.local_variable(0, use_resource=True)
-    self.assertEqual(type(a), variables_lib.Variable)
-    self.assertEqual(type(b), resource_variable_ops.ResourceVariable)
+    self.assertTrue(isinstance(a, variables_lib.Variable))
+    self.assertFalse(isinstance(a, resource_variable_ops.ResourceVariable))
+    self.assertTrue(isinstance(b, resource_variable_ops.ResourceVariable))
 
 
 class GlobalVariableTest(test.TestCase):
@@ -176,8 +177,9 @@ class GlobalVariableTest(test.TestCase):
   def testResourceVariable(self):
     a = variables_lib2.global_variable(0)
     b = variables_lib2.global_variable(0, use_resource=True)
-    self.assertEqual(type(a), variables_lib.Variable)
-    self.assertEqual(type(b), resource_variable_ops.ResourceVariable)
+    self.assertTrue(isinstance(a, variables_lib.Variable))
+    self.assertFalse(isinstance(a, resource_variable_ops.ResourceVariable))
+    self.assertTrue(isinstance(b, resource_variable_ops.ResourceVariable))
 
 
 class GlobalStepTest(test.TestCase):
diff --git a/tensorflow/contrib/fused_conv/kernels/fused_conv2d_bias_activation_op.cc b/tensorflow/contrib/fused_conv/kernels/fused_conv2d_bias_activation_op.cc
index 2458f7554a..0ccb4583ab 100644
--- a/tensorflow/contrib/fused_conv/kernels/fused_conv2d_bias_activation_op.cc
+++ b/tensorflow/contrib/fused_conv/kernels/fused_conv2d_bias_activation_op.cc
@@ -135,9 +135,12 @@ class FusedConv2DBiasActivationOp : public OpKernel {
                    context->GetAttr("activation_mode", &activation_mode_str));
     OP_REQUIRES_OK(context, GetActivationModeFromString(activation_mode_str,
                                                         &activation_mode_));
-    OP_REQUIRES(context, activation_mode_ == ActivationMode::RELU,
-                errors::InvalidArgument("Current implementation only supports "
-                                        "RELU as the activation function."));
+    OP_REQUIRES(context,
+                activation_mode_ == ActivationMode::RELU ||
+                    activation_mode_ == ActivationMode::NONE,
+                errors::InvalidArgument(
+                    "Current implementation only supports RELU or NONE "
+                    "as the activation function."));
     cudnn_use_autotune_ = CudnnUseAutotune();
   }
 
@@ -440,6 +443,8 @@ void LaunchFusedConv2DBiasActivationOp<GPUDevice, T, BiasType, ScaleType>::
                                          : dnn::DataLayout::kBatchDepthYX;
   constexpr auto filter_layout = is_int8x4 ? dnn::FilterLayout::kOutputInputYX4
                                            : dnn::FilterLayout::kOutputInputYX;
+  constexpr auto compute_data_format =
+      is_int8x4 ? FORMAT_NCHW_VECT_C : FORMAT_NCHW;
 
   dnn::BatchDescriptor conv_input_desc;
   conv_input_desc.set_count(batch_size)
@@ -526,6 +531,7 @@ void LaunchFusedConv2DBiasActivationOp<GPUDevice, T, BiasType, ScaleType>::
       batch_size,
       conv_input_depth,
       {{conv_input_rows, conv_input_cols}},
+      compute_data_format,
       output_depth,
       {{filter_rows, filter_cols}},
       // TODO(yangzihao): Add support for arbitrary dilations for fused conv.
@@ -538,6 +544,18 @@ void LaunchFusedConv2DBiasActivationOp<GPUDevice, T, BiasType, ScaleType>::
       activation_mode,
   };
 
+  dnn::ActivationMode dnn_activation_mode;
+  switch (activation_mode) {
+    case ActivationMode::NONE:
+      dnn_activation_mode = dnn::ActivationMode::kNone;
+      break;
+    case ActivationMode::RELU:
+      dnn_activation_mode = dnn::ActivationMode::kRelu;
+      break;
+    default:
+      LOG(FATAL) << "Activation mode " << activation_mode << " not supported";
+  }
+
   dnn::AlgorithmConfig algorithm_config;
   if (cudnn_use_autotune && !AutoTuneConvBiasActivation::GetInstance()->Find(
                                 fused_conv_parameters, &algorithm_config)) {
@@ -558,10 +576,9 @@ void LaunchFusedConv2DBiasActivationOp<GPUDevice, T, BiasType, ScaleType>::
               ->ThenFusedConvolveWithAlgorithm(
                   conv_input_desc, conv_input_ptr, conv_input_scale,
                   filter_desc, filter_ptr, conv_desc, side_input_ptr,
-                  side_input_scale, bias_desc, bias_ptr,
-                  dnn::ActivationMode::kRelu, output_desc, &output_ptr,
-                  &scratch_allocator, dnn::AlgorithmConfig(profile_algorithm),
-                  &profile_result)
+                  side_input_scale, bias_desc, bias_ptr, dnn_activation_mode,
+                  output_desc, &output_ptr, &scratch_allocator,
+                  dnn::AlgorithmConfig(profile_algorithm), &profile_result)
               .ok();
       if (cudnn_launch_status) {
         if (profile_result.is_valid()) {
@@ -597,7 +614,7 @@ void LaunchFusedConv2DBiasActivationOp<GPUDevice, T, BiasType, ScaleType>::
           ->ThenFusedConvolveWithAlgorithm(
               conv_input_desc, conv_input_ptr, conv_input_scale, filter_desc,
               filter_ptr, conv_desc, side_input_ptr, side_input_scale,
-              bias_desc, bias_ptr, dnn::ActivationMode::kRelu, output_desc,
+              bias_desc, bias_ptr, dnn_activation_mode, output_desc,
               &output_ptr, &scratch_allocator, algorithm_config,
               /*output_profile_result=*/nullptr)
           .ok();
diff --git a/tensorflow/contrib/fused_conv/kernels/fused_conv_ops_gpu.h b/tensorflow/contrib/fused_conv/kernels/fused_conv_ops_gpu.h
index ba52697679..b9c131a2e9 100644
--- a/tensorflow/contrib/fused_conv/kernels/fused_conv_ops_gpu.h
+++ b/tensorflow/contrib/fused_conv/kernels/fused_conv_ops_gpu.h
@@ -29,13 +29,13 @@ namespace tensorflow {
 class FusedConvParameters : public ConvParameters {
  public:
   FusedConvParameters(int64 batch, int64 in_depths, const SpatialArray& in,
-                      int64 out_depths, const SpatialArray& filter,
-                      const SpatialArray& dilation, const SpatialArray& stride,
-                      const SpatialArray& padding, DataType dtype,
-                      int device_id, bool has_side_input,
+                      TensorFormat data_format, int64 out_depths,
+                      const SpatialArray& filter, const SpatialArray& dilation,
+                      const SpatialArray& stride, const SpatialArray& padding,
+                      DataType dtype, int device_id, bool has_side_input,
                       ActivationMode activation_mode)
-      : ConvParameters(batch, in_depths, in, out_depths, filter, dilation,
-                       stride, padding, dtype, device_id),
+      : ConvParameters(batch, in_depths, in, data_format, out_depths, filter,
+                       dilation, stride, padding, dtype, device_id),
         activation_mode_(activation_mode),
         has_side_input_(has_side_input) {
     hash_code_ = Hash64Combine(hash_code_, has_side_input);
diff --git a/tensorflow/contrib/fused_conv/ops/fused_conv2d_bias_activation_op.cc b/tensorflow/contrib/fused_conv/ops/fused_conv2d_bias_activation_op.cc
index bafd1d5941..410571f378 100644
--- a/tensorflow/contrib/fused_conv/ops/fused_conv2d_bias_activation_op.cc
+++ b/tensorflow/contrib/fused_conv/ops/fused_conv2d_bias_activation_op.cc
@@ -44,7 +44,7 @@ REGISTER_OP("FusedConv2DBiasActivation")
     .Attr(GetPaddingAttrString())
     .Attr("data_format: {'NHWC', 'NCHW', 'NCHW_VECT_C'} = 'NHWC'")
     .Attr("filter_format: {'HWIO', 'OIHW', 'OIHW_VECT_I'} = 'HWIO'")
-    .Attr("activation_mode: {'Relu'} = 'Relu'")
+    .Attr("activation_mode: {'Relu', 'None'} = 'Relu'")
     .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](shape_inference::InferenceContext* c) {
       using shape_inference::ShapeHandle;
@@ -144,7 +144,7 @@ REGISTER_OP("FusedConv2DBiasActivation")
             `qint8 [ output_channels, input_channels / 4,
                      kernel_height, kernel_width, input_channels % 4 ]`
     activation_mode: The activation applied to the output.
-        Currently must be "Relu".
+        Must be "Relu" or "None".
     dilations: 1-D tensor of length 4.  The dilation factor for each dimension
         of `input`. If set to k > 1, there will be k-1 skipped cells between
         each filter element on that dimension. The dimension order is determined
diff --git a/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op.py b/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op.py
index 983b6dc8e5..cdc07b935d 100644
--- a/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op.py
+++ b/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op.py
@@ -66,8 +66,10 @@ def fused_conv2d_bias_activation(conv_input,
         This is optional and defaults to 0.
     side_input: A `Tensor` of the format specified by `data_format`.
         This is useful for implementing ResNet blocks.
-    activation_mode: (optional) currently must be the default "Relu".
-        Note that in qint8 mode, it also clips to 127, so acts like ReluX.
+    activation_mode: (optional) currently supports the default "Relu", or
+        "None" activation function.
+        Note: in qint8 mode, "None" actually clips to the range [-128, 127],
+        while "Relu" clips to the range [0, 127].
     data_format: Specifies the data format.
         Possible values are:
         "NHWC" float [batch, height, width, channels]
diff --git a/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py b/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py
index 4d62ac65ff..0185ef662c 100644
--- a/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py
+++ b/tensorflow/contrib/fused_conv/python/ops/fused_conv2d_bias_activation_op_test.py
@@ -622,7 +622,7 @@ def HwioToOihw(in_tensor):
 
 def SimulateFusedConv2dBiasActivationInt8(conv_input_scale, conv_input, kernel,
                                           padding, strides, side_input_scale,
-                                          side_input, biases):
+                                          side_input, biases, apply_relu):
   """Simulates the int8 fused 2-D convolution op using separate float ops.
 
     The arguments and return values have the same format, meanings and
@@ -636,6 +636,9 @@ def SimulateFusedConv2dBiasActivationInt8(conv_input_scale, conv_input, kernel,
     side_input_scale: A scalar 'float'.
     side_input: A `Tensor` of type `qint8` in NCHW_VECT_C layout.
     biases: A `Tensor` of type `float32` in NCHW layout.
+    apply_relu: A boolean to specify whether to apply "Relu" activation function
+      that clips outputs to the range [0, 127], or "None" activation that clips
+      to the range [-128, 127].
   Returns:
     A `Tensor` of type `qint8` in NCHW_VECT_C layout.
   """
@@ -649,10 +652,12 @@ def SimulateFusedConv2dBiasActivationInt8(conv_input_scale, conv_input, kernel,
   conv_and_side_inputs = conv_result + side_input_scale * NchwVectCToNchw(
       gen_array_ops.dequantize(side_input, -128, 127))
 
-  logit = nn_ops.bias_add(conv_and_side_inputs, biases, data_format="NCHW")
+  output = nn_ops.bias_add(conv_and_side_inputs, biases, data_format="NCHW")
+  if apply_relu:
+    output = nn_ops.relu(output)
 
   result, _, _ = gen_array_ops.quantize_v2(
-      NchwToNchwVectC(nn_ops.relu(logit)), -128, 127, dtypes.qint8)
+      NchwToNchwVectC(output), -128, 127, dtypes.qint8)
   return result
 
 
@@ -795,7 +800,7 @@ class FusedConvInt8Tests(test.TestCase):
       },
   ]
 
-  def runTest(self, test_param):
+  def runTest(self, test_param, apply_relu):
     batch_size = test_param["batch_size"]
     input_channels = test_param["input_channels"]
     output_channels = test_param["output_channels"]
@@ -831,8 +836,8 @@ class FusedConvInt8Tests(test.TestCase):
                                                 vertical_stride, padding_type)
     output_width = CalculateConvolvedOutputDim(input_width, filter_width,
                                                horizontal_stride, padding_type)
-    tf_logging.info("output_height=", output_height, ", output_width=", 
-			                 output_width)
+    tf_logging.info("output_height=", output_height, ", output_width=",
+                    output_width)
 
     side_input, _, _ = gen_array_ops.quantize_v2(
         random_ops.random_uniform(
@@ -858,12 +863,13 @@ class FusedConvInt8Tests(test.TestCase):
         conv_input_scale=conv_input_scale,
         side_input_scale=side_input_scale,
         side_input=side_input,
+        activation_mode="Relu" if apply_relu else "None",
         data_format="NCHW_VECT_C",
         filter_format="OIHW_VECT_I")
 
     expected = SimulateFusedConv2dBiasActivationInt8(
         conv_input_scale, conv_input, kernel, padding_type, strides,
-        side_input_scale, side_input, biases)
+        side_input_scale, side_input, biases, apply_relu)
 
     with self.test_session(use_gpu=True) as sess:
       actual_y, expected_y = sess.run([actual, expected])
@@ -877,8 +883,9 @@ class FusedConvInt8Tests(test.TestCase):
       tf_logging.info("int8 test skipped because not run with --config=cuda or "
                       "no GPUs with compute capability >= 6.1 are available.")
       return
-    for test_param in self._test_params:
-      self.runTest(test_param)
+    for apply_relu in [True, False]:
+      for test_param in self._test_params:
+        self.runTest(test_param, apply_relu)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/gan/BUILD b/tensorflow/contrib/gan/BUILD
index b305f37791..053d4e3e97 100644
--- a/tensorflow/contrib/gan/BUILD
+++ b/tensorflow/contrib/gan/BUILD
@@ -42,9 +42,12 @@ py_library(
         "//tensorflow/contrib/training:training_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:check_ops",
+        "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:init_ops",
+        "//tensorflow/python:random_ops",
         "//tensorflow/python:training",
+        "//tensorflow/python:training_util",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python/ops/distributions",
         "//tensorflow/python/ops/losses",
@@ -54,26 +57,31 @@ py_library(
 py_test(
     name = "train_test",
     srcs = ["python/train_test.py"],
+    shard_count = 50,
     srcs_version = "PY2AND3",
     tags = ["notsan"],
     deps = [
-        ":features",
         ":namedtuples",
+        ":random_tensor_pool",
         ":train",
         "//tensorflow/contrib/framework:framework_py",
+        "//tensorflow/contrib/layers:layers_py",
         "//tensorflow/contrib/slim:learning",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:math_ops",
         "//tensorflow/python:random_ops",
         "//tensorflow/python:random_seed",
         "//tensorflow/python:training",
+        "//tensorflow/python:training_util",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
         "//tensorflow/python/ops/distributions",
         "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
     ],
 )
 
@@ -188,10 +196,16 @@ py_test(
     srcs = ["python/losses/python/tuple_losses_test.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":losses_impl",
+        ":namedtuples",
         ":tuple_losses",
+        "//tensorflow/contrib/layers:layers_py",
+        "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:dtypes",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
         "//third_party/py/numpy",
     ],
@@ -248,12 +262,15 @@ py_library(
 py_test(
     name = "random_tensor_pool_test",
     srcs = ["python/features/python/random_tensor_pool_test.py"],
+    shard_count = 6,
     srcs_version = "PY2AND3",
     deps = [
         ":random_tensor_pool",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
         "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
         "//third_party/py/numpy",
     ],
 )
@@ -344,9 +361,11 @@ py_library(
         "//tensorflow/python:image_ops",
         "//tensorflow/python:linalg_ops",
         "//tensorflow/python:math_ops",
+        "//tensorflow/python:nn",
         "//tensorflow/python:nn_ops",
         "//tensorflow/python:platform",
         "//tensorflow/python:util",
+        "@six_archive//:six",
     ],
 )
 
@@ -470,12 +489,12 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
-        ":head",
         ":namedtuples",
         ":summaries",
         ":train",
         "//tensorflow/contrib/framework:framework_py",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:metrics",
         "//tensorflow/python:util",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python/estimator",
@@ -498,16 +517,19 @@ py_test(
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
-        "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:metrics",
         "//tensorflow/python:parsing_ops",
         "//tensorflow/python:summary",
         "//tensorflow/python:training",
-        "//tensorflow/python/estimator:head",
+        "//tensorflow/python:training_util",
+        "//tensorflow/python:variable_scope",
         "//tensorflow/python/estimator:model_fn",
         "//tensorflow/python/estimator:numpy_io",
         "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
         "@six_archive//:six",
     ],
 )
diff --git a/tensorflow/contrib/gan/python/estimator/python/gan_estimator_impl.py b/tensorflow/contrib/gan/python/estimator/python/gan_estimator_impl.py
index 4092b32004..8e4affb9b4 100644
--- a/tensorflow/contrib/gan/python/estimator/python/gan_estimator_impl.py
+++ b/tensorflow/contrib/gan/python/estimator/python/gan_estimator_impl.py
@@ -24,11 +24,11 @@ import enum
 from tensorflow.contrib.framework.python.ops import variables as variable_lib
 from tensorflow.contrib.gan.python import namedtuples as tfgan_tuples
 from tensorflow.contrib.gan.python import train as tfgan_train
-from tensorflow.contrib.gan.python.estimator.python import head as head_lib
 from tensorflow.contrib.gan.python.eval.python import summaries as tfgan_summaries
 from tensorflow.python.estimator import estimator
 from tensorflow.python.estimator import model_fn as model_fn_lib
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import metrics as metrics_lib
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.util import tf_inspect as inspect
 
@@ -154,94 +154,93 @@ class GANEstimator(estimator.Estimator):
       use_loss_summaries: If `True`, add loss summaries. If `False`, does not.
         If `None`, uses defaults.
       config: `RunConfig` object to configure the runtime settings.
+
+    Raises:
+      ValueError: If loss functions aren't callable.
+      ValueError: If `use_loss_summaries` isn't boolean or `None`.
+      ValueError: If `get_hooks_fn` isn't callable or `None`.
     """
-    # TODO(joelshor): Explicitly validate inputs.
+    if not callable(generator_loss_fn):
+      raise ValueError('generator_loss_fn must be callable.')
+    if not callable(discriminator_loss_fn):
+      raise ValueError('discriminator_loss_fn must be callable.')
+    if use_loss_summaries not in [True, False, None]:
+      raise ValueError('use_loss_summaries must be True, False or None.')
+    if get_hooks_fn is not None and not callable(get_hooks_fn):
+      raise TypeError('get_hooks_fn must be callable.')
 
     def _model_fn(features, labels, mode):
-      gopt = (generator_optimizer() if callable(generator_optimizer) else
-              generator_optimizer)
-      dopt = (discriminator_optimizer() if callable(discriminator_optimizer)
-              else discriminator_optimizer)
-      gan_head = head_lib.gan_head(
-          generator_loss_fn, discriminator_loss_fn, gopt, dopt,
-          use_loss_summaries, get_hooks_fn=get_hooks_fn,
-          get_eval_metric_ops_fn=get_eval_metric_ops_fn)
-      return _gan_model_fn(
-          features, labels, mode, generator_fn, discriminator_fn, gan_head,
+      """GANEstimator model function."""
+      if mode not in [model_fn_lib.ModeKeys.TRAIN, model_fn_lib.ModeKeys.EVAL,
+                      model_fn_lib.ModeKeys.PREDICT]:
+        raise ValueError('Mode not recognized: %s' % mode)
+      real_data = labels  # rename inputs for clarity
+      generator_inputs = features  # rename inputs for clarity
+
+      # Make GANModel, which encapsulates the GAN model architectures.
+      gan_model = _get_gan_model(
+          mode, generator_fn, discriminator_fn, real_data, generator_inputs,
           add_summaries)
 
+      # Make the EstimatorSpec, which incorporates the GANModel, losses, eval
+      # metrics, and optimizers (if required).
+      return _get_estimator_spec(
+          mode, gan_model, generator_loss_fn, discriminator_loss_fn,
+          get_eval_metric_ops_fn, generator_optimizer, discriminator_optimizer,
+          get_hooks_fn)
+
     super(GANEstimator, self).__init__(
         model_fn=_model_fn, model_dir=model_dir, config=config)
 
 
-def _gan_model_fn(
-    features,
-    labels,
-    mode,
-    generator_fn,
-    discriminator_fn,
-    head,
-    add_summaries=None,
-    generator_scope_name='Generator'):
-  """The `model_fn` for the GAN estimator.
-
-  We make the following convention:
-    features -> TFGAN's `generator_inputs`
-    labels -> TFGAN's `real_data`
-
-  Args:
-    features: A dictionary to feed to generator. In the unconditional case,
-      this might be just `noise`. In the conditional GAN case, this
-      might be the generator's conditioning. The `generator_fn` determines
-      what the required keys are.
-    labels: Real data. Can be any structure, as long as `discriminator_fn`
-      can accept it for the first argument.
-    mode: Defines whether this is training, evaluation or prediction.
-      See `ModeKeys`.
-    generator_fn: A python lambda that takes `generator_inputs` as inputs and
-      returns the outputs of the GAN generator.
-    discriminator_fn: A python lambda that takes `real_data`/`generated data`
-      and `generator_inputs`. Outputs a Tensor in the range [-inf, inf].
-    head: A `Head` instance suitable for GANs.
-    add_summaries: `None`, a single `SummaryType`, or a list of `SummaryType`.
-    generator_scope_name: The name of the generator scope. We need this to be
-      the same for GANModels produced by TFGAN's `train.gan_model` and the
-      manually constructed ones for predictions.
-
-  Returns:
-    `ModelFnOps`
-
-  Raises:
-    ValueError: If `labels` isn't `None` during prediction.
-  """
-  real_data = labels
-  generator_inputs = features
-
-  if mode == model_fn_lib.ModeKeys.TRAIN:
-    gan_model = _make_train_gan_model(
-        generator_fn, discriminator_fn, real_data, generator_inputs,
-        generator_scope_name, add_summaries)
-  elif mode == model_fn_lib.ModeKeys.EVAL:
-    gan_model = _make_eval_gan_model(
-        generator_fn, discriminator_fn, real_data, generator_inputs,
-        generator_scope_name, add_summaries)
-  else:
+def _get_gan_model(
+    mode, generator_fn, discriminator_fn, real_data, generator_inputs,
+    add_summaries, generator_scope='Generator'):
+  """Makes the GANModel tuple, which encapsulates the GAN model architecture."""
+  if mode == model_fn_lib.ModeKeys.PREDICT:
     if real_data is not None:
       raise ValueError('`labels` must be `None` when mode is `predict`. '
                        'Instead, found %s' % real_data)
     gan_model = _make_prediction_gan_model(
-        generator_inputs, generator_fn, generator_scope_name)
+        generator_inputs, generator_fn, generator_scope)
+  else:  # model_fn_lib.ModeKeys.TRAIN or model_fn_lib.ModeKeys.EVAL
+    gan_model = _make_gan_model(
+        generator_fn, discriminator_fn, real_data, generator_inputs,
+        generator_scope, add_summaries, mode)
 
-  return head.create_estimator_spec(
-      features=None,
-      mode=mode,
-      logits=gan_model,
-      labels=None)
+  return gan_model
+
+
+def _get_estimator_spec(
+    mode, gan_model, generator_loss_fn, discriminator_loss_fn,
+    get_eval_metric_ops_fn, generator_optimizer, discriminator_optimizer,
+    get_hooks_fn=None):
+  """Get the EstimatorSpec for the current mode."""
+  if mode == model_fn_lib.ModeKeys.PREDICT:
+    estimator_spec = model_fn_lib.EstimatorSpec(
+        mode=mode, predictions=gan_model.generated_data)
+  else:
+    gan_loss = tfgan_tuples.GANLoss(
+        generator_loss=generator_loss_fn(gan_model),
+        discriminator_loss=discriminator_loss_fn(gan_model))
+    if mode == model_fn_lib.ModeKeys.EVAL:
+      estimator_spec = _get_eval_estimator_spec(
+          gan_model, gan_loss, get_eval_metric_ops_fn)
+    else:  # model_fn_lib.ModeKeys.TRAIN:
+      gopt = (generator_optimizer() if callable(generator_optimizer) else
+              generator_optimizer)
+      dopt = (discriminator_optimizer() if callable(discriminator_optimizer)
+              else discriminator_optimizer)
+      get_hooks_fn = get_hooks_fn or tfgan_train.get_sequential_train_hooks()
+      estimator_spec = _get_train_estimator_spec(
+          gan_model, gan_loss, gopt, dopt, get_hooks_fn)
+
+  return estimator_spec
 
 
 def _make_gan_model(generator_fn, discriminator_fn, real_data,
                     generator_inputs, generator_scope, add_summaries, mode):
-  """Make a `GANModel`, and optionally pass in `mode`."""
+  """Construct a `GANModel`, and optionally pass in `mode`."""
   # If network functions have an argument `mode`, pass mode to it.
   if 'mode' in inspect.getargspec(generator_fn).args:
     generator_fn = functools.partial(generator_fn, mode=mode)
@@ -264,22 +263,6 @@ def _make_gan_model(generator_fn, discriminator_fn, real_data,
   return gan_model
 
 
-def _make_train_gan_model(generator_fn, discriminator_fn, real_data,
-                          generator_inputs, generator_scope, add_summaries):
-  """Make a `GANModel` for training."""
-  return _make_gan_model(generator_fn, discriminator_fn, real_data,
-                         generator_inputs, generator_scope, add_summaries,
-                         model_fn_lib.ModeKeys.TRAIN)
-
-
-def _make_eval_gan_model(generator_fn, discriminator_fn, real_data,
-                         generator_inputs, generator_scope, add_summaries):
-  """Make a `GANModel` for evaluation."""
-  return _make_gan_model(generator_fn, discriminator_fn, real_data,
-                         generator_inputs, generator_scope, add_summaries,
-                         model_fn_lib.ModeKeys.EVAL)
-
-
 def _make_prediction_gan_model(generator_inputs, generator_fn, generator_scope):
   """Make a `GANModel` from just the generator."""
   # If `generator_fn` has an argument `mode`, pass mode to it.
@@ -303,3 +286,46 @@ def _make_prediction_gan_model(generator_inputs, generator_fn, generator_scope):
       discriminator_variables=None,
       discriminator_scope=None,
       discriminator_fn=None)
+
+
+def _get_eval_estimator_spec(gan_model, gan_loss, get_eval_metric_ops_fn=None,
+                             name=None):
+  """Return an EstimatorSpec for the eval case."""
+  scalar_loss = gan_loss.generator_loss + gan_loss.discriminator_loss
+  with ops.name_scope(None, 'metrics',
+                      [gan_loss.generator_loss,
+                       gan_loss.discriminator_loss]):
+    def _summary_key(head_name, val):
+      return '%s/%s' % (val, head_name) if head_name else val
+    eval_metric_ops = {
+        _summary_key(name, 'generator_loss'):
+            metrics_lib.mean(gan_loss.generator_loss),
+        _summary_key(name, 'discriminator_loss'):
+            metrics_lib.mean(gan_loss.discriminator_loss)
+    }
+    if get_eval_metric_ops_fn is not None:
+      custom_eval_metric_ops = get_eval_metric_ops_fn(gan_model)
+      if not isinstance(custom_eval_metric_ops, dict):
+        raise TypeError('get_eval_metric_ops_fn must return a dict, '
+                        'received: {}'.format(custom_eval_metric_ops))
+      eval_metric_ops.update(custom_eval_metric_ops)
+  return model_fn_lib.EstimatorSpec(
+      mode=model_fn_lib.ModeKeys.EVAL,
+      predictions=gan_model.generated_data,
+      loss=scalar_loss,
+      eval_metric_ops=eval_metric_ops)
+
+
+def _get_train_estimator_spec(
+    gan_model, gan_loss, generator_optimizer, discriminator_optimizer,
+    get_hooks_fn, train_op_fn=tfgan_train.gan_train_ops):
+  """Return an EstimatorSpec for the train case."""
+  scalar_loss = gan_loss.generator_loss + gan_loss.discriminator_loss
+  train_ops = train_op_fn(gan_model, gan_loss, generator_optimizer,
+                          discriminator_optimizer)
+  training_hooks = get_hooks_fn(train_ops)
+  return model_fn_lib.EstimatorSpec(
+      loss=scalar_loss,
+      mode=model_fn_lib.ModeKeys.TRAIN,
+      train_op=train_ops.global_step_inc_op,
+      training_hooks=training_hooks)
diff --git a/tensorflow/contrib/gan/python/estimator/python/gan_estimator_test.py b/tensorflow/contrib/gan/python/estimator/python/gan_estimator_test.py
index 955482599b..9ac9c6ca9c 100644
--- a/tensorflow/contrib/gan/python/estimator/python/gan_estimator_test.py
+++ b/tensorflow/contrib/gan/python/estimator/python/gan_estimator_test.py
@@ -21,30 +21,30 @@ from __future__ import print_function
 import shutil
 import tempfile
 
+from absl.testing import parameterized
 import numpy as np
 import six
 
 from tensorflow.contrib import layers
-from tensorflow.contrib.gan.python import namedtuples
+from tensorflow.contrib.gan.python import namedtuples as tfgan_tuples
 from tensorflow.contrib.gan.python.estimator.python import gan_estimator_impl as estimator
 from tensorflow.contrib.gan.python.losses.python import tuple_losses as losses
 from tensorflow.contrib.learn.python.learn.learn_io import graph_io
 from tensorflow.core.example import example_pb2
 from tensorflow.core.example import feature_pb2
 from tensorflow.python.estimator import model_fn as model_fn_lib
-from tensorflow.python.estimator.canned import head as head_lib
 from tensorflow.python.estimator.inputs import numpy_io
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_lib
 from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import test
 from tensorflow.python.summary.writer import writer_cache
 from tensorflow.python.training import input as input_lib
 from tensorflow.python.training import learning_rate_decay
-from tensorflow.python.training import monitored_session
 from tensorflow.python.training import training
 from tensorflow.python.training import training_util
 
@@ -60,120 +60,109 @@ def discriminator_fn(data, unused_conditioning, mode):
   return layers.fully_connected(data, 1)
 
 
-def mock_head(testcase, expected_generator_inputs, expected_real_data,
-              generator_scope_name):
-  """Returns a mock head that validates logits values and variable names."""
-  discriminator_scope_name = 'Discriminator'  # comes from TFGAN defaults
-  generator_var_names = set([
-      '%s/fully_connected/weights:0' % generator_scope_name,
-      '%s/fully_connected/biases:0' % generator_scope_name])
-  discriminator_var_names = set([
-      '%s/fully_connected/weights:0' % discriminator_scope_name,
-      '%s/fully_connected/biases:0' % discriminator_scope_name])
-
-  def _create_estimator_spec(features, mode, logits, labels):
-    gan_model = logits  # renaming for clarity
-    is_predict = mode == model_fn_lib.ModeKeys.PREDICT
-    testcase.assertIsNone(features)
-    testcase.assertIsNone(labels)
-    testcase.assertIsInstance(gan_model, namedtuples.GANModel)
-
-    trainable_vars = ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
-    expected_var_names = (generator_var_names if is_predict else
-                          generator_var_names | discriminator_var_names)
-    testcase.assertItemsEqual(expected_var_names,
-                              [var.name for var in trainable_vars])
-
-    assertions = []
-    def _or_none(x):
-      return None if is_predict else x
-    testcase.assertEqual(expected_generator_inputs, gan_model.generator_inputs)
-    # TODO(joelshor): Add check on `generated_data`.
-    testcase.assertItemsEqual(
-        generator_var_names,
-        set([x.name for x in gan_model.generator_variables]))
-    testcase.assertEqual(generator_scope_name, gan_model.generator_scope.name)
-    testcase.assertEqual(_or_none(expected_real_data), gan_model.real_data)
-    # TODO(joelshor): Add check on `discriminator_real_outputs`.
-    # TODO(joelshor): Add check on `discriminator_gen_outputs`.
-    if is_predict:
-      testcase.assertIsNone(gan_model.discriminator_scope)
-    else:
-      testcase.assertEqual(discriminator_scope_name,
-                           gan_model.discriminator_scope.name)
-
-    with ops.control_dependencies(assertions):
-      if mode == model_fn_lib.ModeKeys.TRAIN:
-        return model_fn_lib.EstimatorSpec(
-            mode=mode, loss=array_ops.zeros([]),
-            train_op=control_flow_ops.no_op(), training_hooks=[])
-      elif mode == model_fn_lib.ModeKeys.EVAL:
-        return model_fn_lib.EstimatorSpec(
-            mode=mode, predictions=gan_model.generated_data,
-            loss=array_ops.zeros([]))
-      elif mode == model_fn_lib.ModeKeys.PREDICT:
-        return model_fn_lib.EstimatorSpec(
-            mode=mode, predictions=gan_model.generated_data)
-      else:
-        testcase.fail('Invalid mode: {}'.format(mode))
-
-  head = test.mock.NonCallableMagicMock(spec=head_lib._Head)
-  head.create_estimator_spec = test.mock.MagicMock(
-      wraps=_create_estimator_spec)
-
-  return head
-
-
-class GANModelFnTest(test.TestCase):
-  """Tests that _gan_model_fn passes expected logits to mock head."""
-
-  def setUp(self):
-    self._model_dir = tempfile.mkdtemp()
-
-  def tearDown(self):
-    if self._model_dir:
-      writer_cache.FileWriterCache.clear()
-      shutil.rmtree(self._model_dir)
+class GetGANModelTest(test.TestCase, parameterized.TestCase):
+  """Tests that `GetGANModel` produces the correct model."""
 
-  def _test_logits_helper(self, mode):
-    """Tests that the expected logits are passed to mock head."""
+  @parameterized.named_parameters(
+      ('train', model_fn_lib.ModeKeys.TRAIN),
+      ('eval', model_fn_lib.ModeKeys.EVAL),
+      ('predict', model_fn_lib.ModeKeys.PREDICT))
+  def test_get_gan_model(self, mode):
     with ops.Graph().as_default():
-      training_util.get_or_create_global_step()
-      generator_inputs = {'x': array_ops.zeros([5, 4])}
-      real_data = (None if mode == model_fn_lib.ModeKeys.PREDICT else
-                   array_ops.zeros([5, 4]))
-      generator_scope_name = 'generator'
-      head = mock_head(self,
-                       expected_generator_inputs=generator_inputs,
-                       expected_real_data=real_data,
-                       generator_scope_name=generator_scope_name)
-      estimator_spec = estimator._gan_model_fn(
-          features=generator_inputs,
-          labels=real_data,
-          mode=mode,
-          generator_fn=generator_fn,
-          discriminator_fn=discriminator_fn,
-          generator_scope_name=generator_scope_name,
-          head=head)
-      with monitored_session.MonitoredTrainingSession(
-          checkpoint_dir=self._model_dir) as sess:
-        if mode == model_fn_lib.ModeKeys.TRAIN:
-          sess.run(estimator_spec.train_op)
-        elif mode == model_fn_lib.ModeKeys.EVAL:
-          sess.run(estimator_spec.loss)
-        elif mode == model_fn_lib.ModeKeys.PREDICT:
-          sess.run(estimator_spec.predictions)
-        else:
-          self.fail('Invalid mode: {}'.format(mode))
-
-  def test_logits_predict(self):
-    self._test_logits_helper(model_fn_lib.ModeKeys.PREDICT)
-
-  def test_logits_eval(self):
-    self._test_logits_helper(model_fn_lib.ModeKeys.EVAL)
-
-  def test_logits_train(self):
-    self._test_logits_helper(model_fn_lib.ModeKeys.TRAIN)
+      generator_inputs = {'x': array_ops.ones([3, 4])}
+      real_data = (array_ops.zeros([3, 4]) if
+                   mode != model_fn_lib.ModeKeys.PREDICT else None)
+      gan_model = estimator._get_gan_model(
+          mode, generator_fn, discriminator_fn, real_data, generator_inputs,
+          add_summaries=False)
+
+    self.assertEqual(generator_inputs, gan_model.generator_inputs)
+    self.assertIsNotNone(gan_model.generated_data)
+    self.assertEqual(2, len(gan_model.generator_variables))  # 1 FC layer
+    self.assertIsNotNone(gan_model.generator_fn)
+    if mode == model_fn_lib.ModeKeys.PREDICT:
+      self.assertIsNone(gan_model.real_data)
+      self.assertIsNone(gan_model.discriminator_real_outputs)
+      self.assertIsNone(gan_model.discriminator_gen_outputs)
+      self.assertIsNone(gan_model.discriminator_variables)
+      self.assertIsNone(gan_model.discriminator_scope)
+      self.assertIsNone(gan_model.discriminator_fn)
+    else:
+      self.assertIsNotNone(gan_model.real_data)
+      self.assertIsNotNone(gan_model.discriminator_real_outputs)
+      self.assertIsNotNone(gan_model.discriminator_gen_outputs)
+      self.assertEqual(2, len(gan_model.discriminator_variables))  # 1 FC layer
+      self.assertIsNotNone(gan_model.discriminator_scope)
+      self.assertIsNotNone(gan_model.discriminator_fn)
+
+
+def get_dummy_gan_model():
+  # TODO(joelshor): Find a better way of creating a variable scope.
+  with variable_scope.variable_scope('generator') as gen_scope:
+    gen_var = variable_scope.get_variable('dummy_var', initializer=0.0)
+  with variable_scope.variable_scope('discriminator') as dis_scope:
+    dis_var = variable_scope.get_variable('dummy_var', initializer=0.0)
+  return tfgan_tuples.GANModel(
+      generator_inputs=None,
+      generated_data=array_ops.ones([3, 4]),
+      generator_variables=[gen_var],
+      generator_scope=gen_scope,
+      generator_fn=None,
+      real_data=array_ops.zeros([3, 4]),
+      discriminator_real_outputs=array_ops.ones([1, 2, 3]) * dis_var,
+      discriminator_gen_outputs=array_ops.ones([1, 2, 3]) * gen_var * dis_var,
+      discriminator_variables=[dis_var],
+      discriminator_scope=dis_scope,
+      discriminator_fn=None)
+
+
+def dummy_loss_fn(gan_model):
+  return math_ops.reduce_sum(gan_model.discriminator_real_outputs -
+                             gan_model.discriminator_gen_outputs)
+
+
+def get_metrics(gan_model):
+  return {
+      'mse_custom_metric': metrics_lib.mean_squared_error(
+          gan_model.real_data, gan_model.generated_data)
+  }
+
+
+class GetEstimatorSpecTest(test.TestCase, parameterized.TestCase):
+  """Tests that the EstimatorSpec is constructed appropriately."""
+
+  @classmethod
+  def setUpClass(cls):
+    cls._generator_optimizer = training.GradientDescentOptimizer(1.0)
+    cls._discriminator_optimizer = training.GradientDescentOptimizer(1.0)
+
+  @parameterized.named_parameters(
+      ('train', model_fn_lib.ModeKeys.TRAIN),
+      ('eval', model_fn_lib.ModeKeys.EVAL),
+      ('predict', model_fn_lib.ModeKeys.PREDICT))
+  def test_get_estimator_spec(self, mode):
+    with ops.Graph().as_default():
+      self._gan_model = get_dummy_gan_model()
+      spec = estimator._get_estimator_spec(
+          mode,
+          self._gan_model,
+          generator_loss_fn=dummy_loss_fn,
+          discriminator_loss_fn=dummy_loss_fn,
+          get_eval_metric_ops_fn=get_metrics,
+          generator_optimizer=self._generator_optimizer,
+          discriminator_optimizer=self._discriminator_optimizer)
+
+    self.assertEqual(mode, spec.mode)
+    if mode == model_fn_lib.ModeKeys.PREDICT:
+      self.assertEqual(self._gan_model.generated_data, spec.predictions)
+    elif mode == model_fn_lib.ModeKeys.TRAIN:
+      self.assertShapeEqual(np.array(0), spec.loss)  # must be a scalar
+      self.assertIsNotNone(spec.train_op)
+      self.assertIsNotNone(spec.training_hooks)
+    elif mode == model_fn_lib.ModeKeys.EVAL:
+      self.assertEqual(self._gan_model.generated_data, spec.predictions)
+      self.assertShapeEqual(np.array(0), spec.loss)  # must be a scalar
+      self.assertIsNotNone(spec.eval_metric_ops)
 
 
 # TODO(joelshor): Add pandas test.
@@ -195,12 +184,6 @@ class GANEstimatorIntegrationTest(test.TestCase):
       lr = learning_rate_decay.exponential_decay(1.0, gstep, 10, 0.9)
       return training.GradientDescentOptimizer(lr)
 
-    def get_metrics(gan_model):
-      return {
-          'mse_custom_metric': metrics_lib.mean_squared_error(
-              gan_model.real_data, gan_model.generated_data)
-      }
-
     gopt = make_opt if lr_decay else training.GradientDescentOptimizer(1.0)
     dopt = make_opt if lr_decay else training.GradientDescentOptimizer(1.0)
     est = estimator.GANEstimator(
diff --git a/tensorflow/contrib/gan/python/estimator/python/head_impl.py b/tensorflow/contrib/gan/python/estimator/python/head_impl.py
index d1441e1eb2..1a0ee6dfc4 100644
--- a/tensorflow/contrib/gan/python/estimator/python/head_impl.py
+++ b/tensorflow/contrib/gan/python/estimator/python/head_impl.py
@@ -27,16 +27,21 @@ from tensorflow.python.estimator.canned import head
 from tensorflow.python.estimator.export import export_output
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import metrics as metrics_lib
+from tensorflow.python.util import deprecation
 
 __all__ = [
     'GANHead',
     'gan_head',
 ]
 
+
 def _summary_key(head_name, val):
   return '%s/%s' % (val, head_name) if head_name else val
 
 
+@deprecation.deprecated(
+    None, 'Please use tf.contrib.gan.GANEstimator without explicitly making a '
+    'GANHead.')
 def gan_head(generator_loss_fn, discriminator_loss_fn, generator_optimizer,
              discriminator_optimizer, use_loss_summaries=True,
              get_hooks_fn=tfgan_train.get_sequential_train_hooks(),
@@ -77,6 +82,9 @@ def gan_head(generator_loss_fn, discriminator_loss_fn, generator_optimizer,
 class GANHead(head._Head):  # pylint: disable=protected-access
   """`Head` for a GAN."""
 
+  @deprecation.deprecated(
+      None, 'Please use tf.contrib.gan.GANEstimator without explicitly making '
+      'a GANHead.')
   def __init__(self, generator_loss_fn, discriminator_loss_fn,
                generator_optimizer, discriminator_optimizer,
                use_loss_summaries=True,
@@ -108,7 +116,7 @@ class GANHead(head._Head):  # pylint: disable=protected-access
       raise TypeError('generator_loss_fn must be callable.')
     if not callable(discriminator_loss_fn):
       raise TypeError('discriminator_loss_fn must be callable.')
-    if not use_loss_summaries in [True, False, None]:
+    if use_loss_summaries not in [True, False, None]:
       raise ValueError('use_loss_summaries must be True, False or None.')
     if get_hooks_fn is not None and not callable(get_hooks_fn):
       raise TypeError('get_hooks_fn must be callable.')
diff --git a/tensorflow/contrib/gan/python/estimator/python/head_test.py b/tensorflow/contrib/gan/python/estimator/python/head_test.py
index 5309d87765..8205bc889d 100644
--- a/tensorflow/contrib/gan/python/estimator/python/head_test.py
+++ b/tensorflow/contrib/gan/python/estimator/python/head_test.py
@@ -67,7 +67,7 @@ class GANHeadTest(test.TestCase):
         generator_optimizer=training.GradientDescentOptimizer(1.0),
         discriminator_optimizer=training.GradientDescentOptimizer(1.0),
         get_eval_metric_ops_fn=self.get_metrics)
-    self.assertTrue(isinstance(self.gan_head, head.GANHead))
+    self.assertIsInstance(self.gan_head, head.GANHead)
 
   def get_metrics(self, gan_model):
     self.assertTrue(isinstance(gan_model, tfgan_tuples.GANModel))
diff --git a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
index 9e4ec59e70..ca2d724b49 100644
--- a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
+++ b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
@@ -36,16 +36,15 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import data_flow_ops
 from tensorflow.python.ops import random_ops
+from tensorflow.python.util import nest
 
 __all__ = [
     'tensor_pool',
 ]
 
 
-def _to_tuple(x):
-  if isinstance(x, (list, tuple)):
-    return tuple(x)
-  return (x,)
+def _to_list(x):
+  return [x] if isinstance(x, ops.Tensor) else list(x)
 
 
 def tensor_pool(input_values,
@@ -63,8 +62,8 @@ def tensor_pool(input_values,
   `pool_size` = 0 or `pooling_probability` = 0.
 
   Args:
-    input_values: A `Tensor`, or a list or tuple of `Tensor`s from which to read
-      values to be pooled.
+    input_values: An arbitrarily nested structure of `tf.Tensors`, from which to
+      read values to be pooled.
     pool_size: An integer specifying the maximum size of the pool. Defaults to
       50.
     pooling_probability: A float `Tensor` specifying the probability of getting
@@ -72,9 +71,10 @@ def tensor_pool(input_values,
     name: A string prefix for the name scope for all tensorflow ops.
 
   Returns:
-    A `Tensor`, or a list or tuple of `Tensor`s (according to the type ofx
-    `input_values`) which is with given probability either the `input_values` or
-    a randomly chosen sample that was previously inserted in the pool.
+    A nested structure of `Tensor` objects with the same structure as
+    `input_values`. With the given probability, the Tensor values are either the
+    same as in `input_values` or a randomly chosen sample that was previously
+    inserted in the pool.
 
   Raises:
     ValueError: If `pool_size` is negative.
@@ -86,11 +86,10 @@ def tensor_pool(input_values,
     return input_values
 
   original_input_values = input_values
-  input_values = _to_tuple(input_values)
+  input_values = nest.flatten(input_values)
 
-  with ops.name_scope(
-      '{}_pool_queue'.format(name),
-      values=input_values + (pooling_probability,)):
+  with ops.name_scope('{}_pool_queue'.format(name),
+                      values=input_values + [pooling_probability]):
     pool_queue = data_flow_ops.RandomShuffleQueue(
         capacity=pool_size,
         min_after_dequeue=0,
@@ -112,10 +111,10 @@ def tensor_pool(input_values,
     def _get_input_value_pooled():
       enqueue_op = pool_queue.enqueue(input_values)
       with ops.control_dependencies([enqueue_op]):
-        return tuple(array_ops.identity(v) for v in input_values)
+        return [array_ops.identity(v) for v in input_values]
 
     def _get_random_pool_value_and_enqueue_input():
-      dequeue_values = _to_tuple(pool_queue.dequeue())
+      dequeue_values = _to_list(pool_queue.dequeue())
       with ops.control_dependencies(dequeue_values):
         enqueue_op = pool_queue.enqueue(input_values)
         with ops.control_dependencies([enqueue_op]):
@@ -124,7 +123,7 @@ def tensor_pool(input_values,
           return control_flow_ops.cond(prob, lambda: dequeue_values,
                                        lambda: input_values)
 
-    output_values = _to_tuple(control_flow_ops.cond(
+    output_values = _to_list(control_flow_ops.cond(
         pool_queue.size() < pool_size, _get_input_value_pooled,
         _get_random_pool_value_and_enqueue_input))
 
@@ -132,8 +131,4 @@ def tensor_pool(input_values,
     for input_value, output_value in zip(input_values, output_values):
       output_value.set_shape(input_value.shape)
 
-  if isinstance(original_input_values, list):
-    return list(output_values)
-  elif isinstance(original_input_values, tuple):
-    return output_values
-  return output_values[0]
+  return nest.pack_sequence_as(original_input_values, output_values)
diff --git a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_test.py b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_test.py
index d8cf549cf7..08584dcd65 100644
--- a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_test.py
+++ b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_test.py
@@ -21,7 +21,9 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.gan.python.features.python.random_tensor_pool_impl import tensor_pool
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import test
 
@@ -111,6 +113,23 @@ class TensorPoolTest(test.TestCase):
         self.assertEqual(len(outs), len(input_values))
         self.assertEqual(outs[1] - outs[0], 1)
 
+  def test_pool_preserves_shape(self):
+    t = constant_op.constant(1)
+    input_values = [[t, t, t], (t, t), t]
+    output_values = tensor_pool(input_values, pool_size=5)
+    print('stuff: ', output_values)
+    # Overall shape.
+    self.assertIsInstance(output_values, list)
+    self.assertEqual(3, len(output_values))
+    # Shape of first element.
+    self.assertIsInstance(output_values[0], list)
+    self.assertEqual(3, len(output_values[0]))
+    # Shape of second element.
+    self.assertIsInstance(output_values[1], tuple)
+    self.assertEqual(2, len(output_values[1]))
+    # Shape of third element.
+    self.assertIsInstance(output_values[2], ops.Tensor)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/gan/python/losses/python/losses_impl.py b/tensorflow/contrib/gan/python/losses/python/losses_impl.py
index 1ba3a64167..d389748374 100644
--- a/tensorflow/contrib/gan/python/losses/python/losses_impl.py
+++ b/tensorflow/contrib/gan/python/losses/python/losses_impl.py
@@ -949,6 +949,11 @@ def cycle_consistency_loss(data_x,
   * loss = (loss_x2x + loss_y2y) / 2
   where `loss` is the final result.
 
+  For the L1-norm, we follow the original implementation:
+  https://github.com/junyanz/CycleGAN/blob/master/models/cycle_gan_model.lua
+  we use L1-norm of pixel-wise error normalized by data size such that
+  `cycle_loss_weight` can be specified independent of image size.
+
   See https://arxiv.org/abs/1703.10593 for more details.
 
   Args:
@@ -965,19 +970,12 @@ def cycle_consistency_loss(data_x,
     A scalar `Tensor` of cycle consistency loss.
   """
 
-  def _partial_cycle_consistency_loss(data, reconstructed_data):
-    # Following the original implementation
-    # https://github.com/junyanz/CycleGAN/blob/master/models/cycle_gan_model.lua
-    # use L1-norm of pixel-wise error normalized by data size so that
-    # `cycle_loss_weight` can be specified independent of image size.
-    return math_ops.reduce_mean(math_ops.abs(data - reconstructed_data))
-
   with ops.name_scope(
       scope,
       'cycle_consistency_loss',
       values=[data_x, reconstructed_data_x, data_y, reconstructed_data_y]):
-    loss_x2x = _partial_cycle_consistency_loss(data_x, reconstructed_data_x)
-    loss_y2y = _partial_cycle_consistency_loss(data_y, reconstructed_data_y)
+    loss_x2x = losses.absolute_difference(data_x, reconstructed_data_x)
+    loss_y2y = losses.absolute_difference(data_y, reconstructed_data_y)
     loss = (loss_x2x + loss_y2y) / 2.0
     if add_summaries:
       summary.scalar('cycle_consistency_loss_x2x', loss_x2x)
diff --git a/tensorflow/contrib/gan/python/losses/python/tuple_losses_impl.py b/tensorflow/contrib/gan/python/losses/python/tuple_losses_impl.py
index dcc3f94c2d..221c70c38b 100644
--- a/tensorflow/contrib/gan/python/losses/python/tuple_losses_impl.py
+++ b/tensorflow/contrib/gan/python/losses/python/tuple_losses_impl.py
@@ -80,6 +80,9 @@ __all__ = [
     'mutual_information_penalty',
     'combine_adversarial_loss',
     'cycle_consistency_loss',
+    'stargan_generator_loss_wrapper',
+    'stargan_discriminator_loss_wrapper',
+    'stargan_gradient_penalty_wrapper'
 ]
 
 
@@ -277,3 +280,86 @@ def cycle_consistency_loss(cyclegan_model, scope=None, add_summaries=False):
       cyclegan_model.model_x2y.generator_inputs, cyclegan_model.reconstructed_x,
       cyclegan_model.model_y2x.generator_inputs, cyclegan_model.reconstructed_y,
       scope, add_summaries)
+
+
+def stargan_generator_loss_wrapper(loss_fn):
+  """Convert a generator loss function to take a StarGANModel.
+
+  The new function has the same name as the original one.
+
+  Args:
+    loss_fn: A python function taking Discriminator's real/fake prediction for
+      generated data.
+
+  Returns:
+    A new function that takes a StarGANModel namedtuple and returns the same
+    loss.
+  """
+
+  def new_loss_fn(stargan_model, **kwargs):
+    return loss_fn(
+        stargan_model.discriminator_generated_data_source_predication, **kwargs)
+
+  new_docstring = """The stargan_model version of %s.""" % loss_fn.__name__
+  new_loss_fn.__docstring__ = new_docstring
+  new_loss_fn.__name__ = loss_fn.__name__
+  new_loss_fn.__module__ = loss_fn.__module__
+  return new_loss_fn
+
+
+def stargan_discriminator_loss_wrapper(loss_fn):
+  """Convert a discriminator loss function to take a StarGANModel.
+
+  The new function has the same name as the original one.
+
+  Args:
+    loss_fn: A python function taking Discriminator's real/fake prediction for
+      real data and generated data.
+
+  Returns:
+    A new function that takes a StarGANModel namedtuple and returns the same
+    loss.
+  """
+
+  def new_loss_fn(stargan_model, **kwargs):
+    return loss_fn(
+        stargan_model.discriminator_input_data_source_predication,
+        stargan_model.discriminator_generated_data_source_predication, **kwargs)
+
+  new_docstring = """The stargan_model version of %s.""" % loss_fn.__name__
+  new_loss_fn.__docstring__ = new_docstring
+  new_loss_fn.__name__ = loss_fn.__name__
+  new_loss_fn.__module__ = loss_fn.__module__
+  return new_loss_fn
+
+
+def stargan_gradient_penalty_wrapper(loss_fn):
+  """Convert a gradient penalty function to take a StarGANModel.
+
+  The new function has the same name as the original one.
+
+  Args:
+    loss_fn: A python function taking real_data, generated_data,
+      generator_inputs for Discriminator's condition (i.e. number of domains),
+      discriminator_fn, and discriminator_scope.
+
+  Returns:
+    A new function that takes a StarGANModel namedtuple and returns the same
+    loss.
+  """
+
+  def new_loss_fn(stargan_model, **kwargs):
+    num_domains = stargan_model.input_data_domain_label.shape.as_list()[-1]
+    return loss_fn(
+        real_data=stargan_model.input_data,
+        generated_data=stargan_model.generated_data,
+        generator_inputs=num_domains,
+        discriminator_fn=stargan_model.discriminator_fn,
+        discriminator_scope=stargan_model.discriminator_scope,
+        **kwargs)
+
+  new_docstring = """The stargan_model version of %s.""" % loss_fn.__name__
+  new_loss_fn.__docstring__ = new_docstring
+  new_loss_fn.__name__ = loss_fn.__name__
+  new_loss_fn.__module__ = loss_fn.__module__
+  return new_loss_fn
diff --git a/tensorflow/contrib/gan/python/losses/python/tuple_losses_test.py b/tensorflow/contrib/gan/python/losses/python/tuple_losses_test.py
index aa1ef11172..a559bbfa11 100644
--- a/tensorflow/contrib/gan/python/losses/python/tuple_losses_test.py
+++ b/tensorflow/contrib/gan/python/losses/python/tuple_losses_test.py
@@ -22,10 +22,15 @@ import collections
 
 import numpy as np
 
+from tensorflow.contrib import layers
 from tensorflow.contrib.gan.python import namedtuples
+from tensorflow.contrib.gan.python.losses.python import losses_impl as tfgan_losses_impl
 from tensorflow.contrib.gan.python.losses.python import tuple_losses_impl as tfgan_losses
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
@@ -129,6 +134,9 @@ manual_tests = [
     'mutual_information_penalty',
     'wasserstein_gradient_penalty',
     'cycle_consistency_loss',
+    'stargan_generator_loss_wrapper',
+    'stargan_discriminator_loss_wrapper',
+    'stargan_gradient_penalty_wrapper'
 ]
 
 discriminator_keyword_args = {
@@ -175,6 +183,112 @@ class CycleConsistencyLossTest(test.TestCase):
       self.assertNear(5.0, loss.eval(), 1e-5)
 
 
+class StarGANLossWrapperTest(test.TestCase):
+
+  def setUp(self):
+
+    super(StarGANLossWrapperTest, self).setUp()
+
+    self.input_data = array_ops.ones([1, 2, 2, 3])
+    self.input_data_domain_label = constant_op.constant([[0, 1]])
+    self.generated_data = array_ops.ones([1, 2, 2, 3])
+    self.discriminator_input_data_source_predication = array_ops.ones([1])
+    self.discriminator_generated_data_source_predication = array_ops.ones([1])
+
+    def _discriminator_fn(inputs, num_domains):
+      """Differentiable dummy discriminator for StarGAN."""
+      hidden = layers.flatten(inputs)
+      output_src = math_ops.reduce_mean(hidden, axis=1)
+      output_cls = layers.fully_connected(
+          inputs=hidden,
+          num_outputs=num_domains,
+          activation_fn=None,
+          normalizer_fn=None,
+          biases_initializer=None)
+      return output_src, output_cls
+
+    with variable_scope.variable_scope('discriminator') as dis_scope:
+      pass
+
+    self.model = namedtuples.StarGANModel(
+        input_data=self.input_data,
+        input_data_domain_label=self.input_data_domain_label,
+        generated_data=self.generated_data,
+        generated_data_domain_target=None,
+        reconstructed_data=None,
+        discriminator_input_data_source_predication=self.
+        discriminator_input_data_source_predication,
+        discriminator_generated_data_source_predication=self.
+        discriminator_generated_data_source_predication,
+        discriminator_input_data_domain_predication=None,
+        discriminator_generated_data_domain_predication=None,
+        generator_variables=None,
+        generator_scope=None,
+        generator_fn=None,
+        discriminator_variables=None,
+        discriminator_scope=dis_scope,
+        discriminator_fn=_discriminator_fn)
+
+    self.discriminator_fn = _discriminator_fn
+    self.discriminator_scope = dis_scope
+
+  def test_stargan_generator_loss_wrapper(self):
+    """Test StarGAN generator loss wrapper."""
+    loss_fn = tfgan_losses_impl.wasserstein_generator_loss
+    wrapped_loss_fn = tfgan_losses.stargan_generator_loss_wrapper(loss_fn)
+
+    loss_result_tensor = loss_fn(
+        self.discriminator_generated_data_source_predication)
+    wrapped_loss_result_tensor = wrapped_loss_fn(self.model)
+
+    with self.test_session() as sess:
+      sess.run(variables.global_variables_initializer())
+      loss_result, wrapped_loss_result = sess.run(
+          [loss_result_tensor, wrapped_loss_result_tensor])
+      self.assertAlmostEqual(loss_result, wrapped_loss_result)
+
+  def test_stargan_discriminator_loss_wrapper(self):
+    """Test StarGAN discriminator loss wrapper."""
+    loss_fn = tfgan_losses_impl.wasserstein_discriminator_loss
+    wrapped_loss_fn = tfgan_losses.stargan_discriminator_loss_wrapper(loss_fn)
+
+    loss_result_tensor = loss_fn(
+        self.discriminator_generated_data_source_predication,
+        self.discriminator_generated_data_source_predication)
+    wrapped_loss_result_tensor = wrapped_loss_fn(self.model)
+
+    with self.test_session() as sess:
+      sess.run(variables.global_variables_initializer())
+      loss_result, wrapped_loss_result = sess.run(
+          [loss_result_tensor, wrapped_loss_result_tensor])
+      self.assertAlmostEqual(loss_result, wrapped_loss_result)
+
+  def test_stargan_gradient_penalty_wrapper(self):
+    """Test StaGAN gradient penalty wrapper.
+
+    Notes:
+      The random interpolates are handled by given setting the reconstruction to
+      be the same as the input.
+
+    """
+    loss_fn = tfgan_losses_impl.wasserstein_gradient_penalty
+    wrapped_loss_fn = tfgan_losses.stargan_gradient_penalty_wrapper(loss_fn)
+
+    loss_result_tensor = loss_fn(
+        real_data=self.input_data,
+        generated_data=self.generated_data,
+        generator_inputs=self.input_data_domain_label.shape.as_list()[-1],
+        discriminator_fn=self.discriminator_fn,
+        discriminator_scope=self.discriminator_scope)
+    wrapped_loss_result_tensor = wrapped_loss_fn(self.model)
+
+    with self.test_session() as sess:
+      sess.run(variables.global_variables_initializer())
+      loss_result, wrapped_loss_result = sess.run(
+          [loss_result_tensor, wrapped_loss_result_tensor])
+      self.assertAlmostEqual(loss_result, wrapped_loss_result)
+
+
 if __name__ == '__main__':
   for loss_name in tfgan_losses.__all__:
     if loss_name in manual_tests: continue
diff --git a/tensorflow/contrib/gan/python/namedtuples.py b/tensorflow/contrib/gan/python/namedtuples.py
index 25cfeafeec..a462b68e28 100644
--- a/tensorflow/contrib/gan/python/namedtuples.py
+++ b/tensorflow/contrib/gan/python/namedtuples.py
@@ -25,12 +25,12 @@ from __future__ import print_function
 
 import collections
 
-
 __all__ = [
     'GANModel',
     'InfoGANModel',
     'ACGANModel',
     'CycleGANModel',
+    'StarGANModel',
     'GANLoss',
     'CycleGANLoss',
     'GANTrainOps',
@@ -136,6 +136,54 @@ class CycleGANModel(
   """
 
 
+class StarGANModel(
+    collections.namedtuple('StarGANModel', (
+        'input_data',
+        'input_data_domain_label',
+        'generated_data',
+        'generated_data_domain_target',
+        'reconstructed_data',
+        'discriminator_input_data_source_predication',
+        'discriminator_generated_data_source_predication',
+        'discriminator_input_data_domain_predication',
+        'discriminator_generated_data_domain_predication',
+        'generator_variables',
+        'generator_scope',
+        'generator_fn',
+        'discriminator_variables',
+        'discriminator_scope',
+        'discriminator_fn',
+    ))):
+  """A StarGANModel contains all the pieces needed for StarGAN training.
+
+  Args:
+    input_data: The real images that need to be transferred by the generator.
+    input_data_domain_label: The real domain labels associated with the real
+      images.
+    generated_data: The generated images produced by the generator. It has the
+      same shape as the input_data.
+    generated_data_domain_target: The target domain that the generated images
+      belong to. It has the same shape as the input_data_domain_label.
+    reconstructed_data: The reconstructed images produced by the G(enerator).
+      reconstructed_data = G(G(input_data, generated_data_domain_target),
+      input_data_domain_label).
+    discriminator_input_data_source: The discriminator's output for predicting
+      the source (real/generated) of input_data.
+    discriminator_generated_data_source: The discriminator's output for
+      predicting the source (real/generated) of  generated_data.
+    discriminator_input_data_domain_predication: The discriminator's output for
+      predicting the domain_label for the input_data.
+    discriminator_generated_data_domain_predication: The discriminatorr's output
+      for predicting the domain_target for the generated_data.
+    generator_variables: A list of all generator variables.
+    generator_scope: Variable scope all generator variables live in.
+    generator_fn: The generator function.
+    discriminator_variables: A list of all discriminator variables.
+    discriminator_scope: Variable scope all discriminator variables live in.
+    discriminator_fn: The discriminator function.
+  """
+
+
 class GANLoss(
     collections.namedtuple('GANLoss', (
         'generator_loss',
diff --git a/tensorflow/contrib/gan/python/train.py b/tensorflow/contrib/gan/python/train.py
index 6fa43059f3..03f52d214b 100644
--- a/tensorflow/contrib/gan/python/train.py
+++ b/tensorflow/contrib/gan/python/train.py
@@ -34,15 +34,20 @@ from __future__ import print_function
 from tensorflow.contrib.framework.python.ops import variables as variables_lib
 from tensorflow.contrib.gan.python import losses as tfgan_losses
 from tensorflow.contrib.gan.python import namedtuples
+from tensorflow.contrib.gan.python.losses.python import losses_impl as tfgan_losses_impl
 from tensorflow.contrib.slim.python.slim import learning as slim_learning
 from tensorflow.contrib.training.python.training import training
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import init_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops.distributions import distribution as ds
 from tensorflow.python.ops.losses import losses
+from tensorflow.python.summary import summary
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.training import sync_replicas_optimizer
 from tensorflow.python.training import training_util
@@ -53,6 +58,7 @@ __all__ = [
     'infogan_model',
     'acgan_model',
     'cyclegan_model',
+    'stargan_model',
     'gan_loss',
     'cyclegan_loss',
     'gan_train_ops',
@@ -123,16 +129,9 @@ def gan_model(
   discriminator_variables = variables_lib.get_trainable_variables(dis_scope)
 
   return namedtuples.GANModel(
-      generator_inputs,
-      generated_data,
-      generator_variables,
-      gen_scope,
-      generator_fn,
-      real_data,
-      discriminator_real_outputs,
-      discriminator_gen_outputs,
-      discriminator_variables,
-      dis_scope,
+      generator_inputs, generated_data, generator_variables, gen_scope,
+      generator_fn, real_data, discriminator_real_outputs,
+      discriminator_gen_outputs, discriminator_variables, dis_scope,
       discriminator_fn)
 
 
@@ -201,8 +200,7 @@ def infogan_model(
 
   # Get model-specific variables.
   generator_variables = variables_lib.get_trainable_variables(gen_scope)
-  discriminator_variables = variables_lib.get_trainable_variables(
-      disc_scope)
+  discriminator_variables = variables_lib.get_trainable_variables(disc_scope)
 
   return namedtuples.InfoGANModel(
       generator_inputs,
@@ -279,12 +277,12 @@ def acgan_model(
     generator_inputs = _convert_tensor_or_l_or_d(generator_inputs)
     generated_data = generator_fn(generator_inputs)
   with variable_scope.variable_scope(discriminator_scope) as dis_scope:
-    with ops.name_scope(dis_scope.name+'/generated/'):
+    with ops.name_scope(dis_scope.name + '/generated/'):
       (discriminator_gen_outputs, discriminator_gen_classification_logits
       ) = _validate_acgan_discriminator_outputs(
           discriminator_fn(generated_data, generator_inputs))
   with variable_scope.variable_scope(dis_scope, reuse=True):
-    with ops.name_scope(dis_scope.name+'/real/'):
+    with ops.name_scope(dis_scope.name + '/real/'):
       real_data = ops.convert_to_tensor(real_data)
       (discriminator_real_outputs, discriminator_real_classification_logits
       ) = _validate_acgan_discriminator_outputs(
@@ -297,8 +295,7 @@ def acgan_model(
 
   # Get model-specific variables.
   generator_variables = variables_lib.get_trainable_variables(gen_scope)
-  discriminator_variables = variables_lib.get_trainable_variables(
-      dis_scope)
+  discriminator_variables = variables_lib.get_trainable_variables(dis_scope)
 
   return namedtuples.ACGANModel(
       generator_inputs, generated_data, generator_variables, gen_scope,
@@ -379,6 +376,108 @@ def cyclegan_model(
                                    reconstructed_y)
 
 
+def stargan_model(generator_fn,
+                  discriminator_fn,
+                  input_data,
+                  input_data_domain_label,
+                  generator_scope='Generator',
+                  discriminator_scope='Discriminator'):
+  """Returns a StarGAN model outputs and variables.
+
+  See https://arxiv.org/abs/1711.09020 for more details.
+
+  Args:
+    generator_fn: A python lambda that takes `inputs` and `targets` as inputs
+      and returns 'generated_data' as the transformed version of `input` based
+      on the `target`. `input` has shape (n, h, w, c), `targets` has shape (n,
+      num_domains), and `generated_data` has the same shape as `input`.
+    discriminator_fn: A python lambda that takes `inputs` and `num_domains` as
+      inputs and returns a tuple (`source_prediction`, `domain_prediction`).
+      `source_prediction` represents the source(real/generated) prediction by
+      the discriminator, and `domain_prediction` represents the domain
+      prediction/classification by the discriminator. `source_prediction` has
+      shape (n) and `domain_prediction` has shape (n, num_domains).
+    input_data: Tensor or a list of tensor of shape (n, h, w, c) representing
+      the real input images.
+    input_data_domain_label: Tensor or a list of tensor of shape (batch_size,
+      num_domains) representing the domain label associated with the real
+      images.
+    generator_scope: Optional generator variable scope. Useful if you want to
+      reuse a subgraph that has already been created.
+    discriminator_scope: Optional discriminator variable scope. Useful if you
+      want to reuse a subgraph that has already been created.
+
+  Returns:
+    StarGANModel nametuple return the tensor that are needed to compute the
+    loss.
+
+  Raises:
+    ValueError: If the shape of `input_data_domain_label` is not rank 2 or fully
+    defined in every dimensions.
+  """
+
+  # Convert to tensor.
+  input_data = _convert_tensor_or_l_or_d(input_data)
+  input_data_domain_label = _convert_tensor_or_l_or_d(input_data_domain_label)
+
+  # Convert list of tensor to a single tensor if applicable.
+  if isinstance(input_data, (list, tuple)):
+    input_data = array_ops.concat(
+        [ops.convert_to_tensor(x) for x in input_data], 0)
+  if isinstance(input_data_domain_label, (list, tuple)):
+    input_data_domain_label = array_ops.concat(
+        [ops.convert_to_tensor(x) for x in input_data_domain_label], 0)
+
+  # Get batch_size, num_domains from the labels.
+  input_data_domain_label.shape.assert_has_rank(2)
+  input_data_domain_label.shape.assert_is_fully_defined()
+  batch_size, num_domains = input_data_domain_label.shape.as_list()
+
+  # Transform input_data to random target domains.
+  with variable_scope.variable_scope(generator_scope) as generator_scope:
+    generated_data_domain_target = _generate_stargan_random_domain_target(
+        batch_size, num_domains)
+    generated_data = generator_fn(input_data, generated_data_domain_target)
+
+  # Transform generated_data back to the original input_data domain.
+  with variable_scope.variable_scope(generator_scope, reuse=True):
+    reconstructed_data = generator_fn(generated_data, input_data_domain_label)
+
+  # Predict source and domain for the generated_data using the discriminator.
+  with variable_scope.variable_scope(
+      discriminator_scope) as discriminator_scope:
+    disc_gen_data_source_pred, disc_gen_data_domain_pred = discriminator_fn(
+        generated_data, num_domains)
+
+  # Predict source and domain for the input_data using the discriminator.
+  with variable_scope.variable_scope(discriminator_scope, reuse=True):
+    disc_input_data_source_pred, disc_input_data_domain_pred = discriminator_fn(
+        input_data, num_domains)
+
+  # Collect trainable variables from the neural networks.
+  generator_variables = variables_lib.get_trainable_variables(generator_scope)
+  discriminator_variables = variables_lib.get_trainable_variables(
+      discriminator_scope)
+
+  # Create the StarGANModel namedtuple.
+  return namedtuples.StarGANModel(
+      input_data=input_data,
+      input_data_domain_label=input_data_domain_label,
+      generated_data=generated_data,
+      generated_data_domain_target=generated_data_domain_target,
+      reconstructed_data=reconstructed_data,
+      discriminator_input_data_source_predication=disc_input_data_source_pred,
+      discriminator_generated_data_source_predication=disc_gen_data_source_pred,
+      discriminator_input_data_domain_predication=disc_input_data_domain_pred,
+      discriminator_generated_data_domain_predication=disc_gen_data_domain_pred,
+      generator_variables=generator_variables,
+      generator_scope=generator_scope,
+      generator_fn=generator_fn,
+      discriminator_variables=discriminator_variables,
+      discriminator_scope=discriminator_scope,
+      discriminator_fn=discriminator_fn)
+
+
 def _validate_aux_loss_weight(aux_loss_weight, name='aux_loss_weight'):
   if isinstance(aux_loss_weight, ops.Tensor):
     aux_loss_weight.shape.assert_is_compatible_with([])
@@ -419,33 +518,42 @@ def _tensor_pool_adjusted_model(model, tensor_pool_fn):
   Raises:
     ValueError: If tensor pool does not support the `model`.
   """
-  if tensor_pool_fn is None:
-    return model
-
-  pooled_generated_data, pooled_generator_inputs = tensor_pool_fn(
-      (model.generated_data, model.generator_inputs))
-
   if isinstance(model, namedtuples.GANModel):
+    pooled_generator_inputs, pooled_generated_data = tensor_pool_fn(
+        (model.generator_inputs, model.generated_data))
     with variable_scope.variable_scope(model.discriminator_scope, reuse=True):
       dis_gen_outputs = model.discriminator_fn(pooled_generated_data,
                                                pooled_generator_inputs)
-    return model._replace(discriminator_gen_outputs=dis_gen_outputs)
+    return model._replace(
+        generator_inputs=pooled_generator_inputs,
+        generated_data=pooled_generated_data,
+        discriminator_gen_outputs=dis_gen_outputs)
   elif isinstance(model, namedtuples.ACGANModel):
+    pooled_generator_inputs, pooled_generated_data = tensor_pool_fn(
+        (model.generator_inputs, model.generated_data))
     with variable_scope.variable_scope(model.discriminator_scope, reuse=True):
-      (dis_pooled_gen_outputs,
-       dis_pooled_gen_classification_logits) = model.discriminator_fn(
+      (pooled_discriminator_gen_outputs,
+       pooled_discriminator_gen_classification_logits) = model.discriminator_fn(
            pooled_generated_data, pooled_generator_inputs)
     return model._replace(
-        discriminator_gen_outputs=dis_pooled_gen_outputs,
+        generator_inputs=pooled_generator_inputs,
+        generated_data=pooled_generated_data,
+        discriminator_gen_outputs=pooled_discriminator_gen_outputs,
         discriminator_gen_classification_logits=
-        dis_pooled_gen_classification_logits)
+        pooled_discriminator_gen_classification_logits)
   elif isinstance(model, namedtuples.InfoGANModel):
+    pooled_generator_inputs, pooled_generated_data, pooled_structured_input = (
+        tensor_pool_fn((model.generator_inputs, model.generated_data,
+                        model.structured_generator_inputs)))
     with variable_scope.variable_scope(model.discriminator_scope, reuse=True):
-      (dis_pooled_gen_outputs,
+      (pooled_discriminator_gen_outputs,
        pooled_predicted_distributions) = model.discriminator_and_aux_fn(
            pooled_generated_data, pooled_generator_inputs)
     return model._replace(
-        discriminator_gen_outputs=dis_pooled_gen_outputs,
+        generator_inputs=pooled_generator_inputs,
+        generated_data=pooled_generated_data,
+        structured_generator_inputs=pooled_structured_input,
+        discriminator_gen_outputs=pooled_discriminator_gen_outputs,
         predicted_distributions=pooled_predicted_distributions)
   else:
     raise ValueError('Tensor pool does not support `model`: %s.' % type(model))
@@ -512,8 +620,8 @@ def gan_loss(
       `model` isn't an `InfoGANModel`.
   """
   # Validate arguments.
-  gradient_penalty_weight = _validate_aux_loss_weight(gradient_penalty_weight,
-                                                      'gradient_penalty_weight')
+  gradient_penalty_weight = _validate_aux_loss_weight(
+      gradient_penalty_weight, 'gradient_penalty_weight')
   mutual_information_penalty_weight = _validate_aux_loss_weight(
       mutual_information_penalty_weight, 'infogan_weight')
   aux_cond_generator_weight = _validate_aux_loss_weight(
@@ -537,33 +645,38 @@ def gan_loss(
         'is provided, `model` must be an `ACGANModel`. Instead, was %s.' %
         type(model))
 
+  # Optionally create pooled model.
+  pooled_model = (_tensor_pool_adjusted_model(model, tensor_pool_fn) if
+                  tensor_pool_fn else model)
+
   # Create standard losses.
   gen_loss = generator_loss_fn(model, add_summaries=add_summaries)
-  dis_loss = discriminator_loss_fn(
-      _tensor_pool_adjusted_model(model, tensor_pool_fn),
-      add_summaries=add_summaries)
+  dis_loss = discriminator_loss_fn(pooled_model, add_summaries=add_summaries)
 
   # Add optional extra losses.
   if _use_aux_loss(gradient_penalty_weight):
     gp_loss = tfgan_losses.wasserstein_gradient_penalty(
-        model,
+        pooled_model,
         epsilon=gradient_penalty_epsilon,
         target=gradient_penalty_target,
         one_sided=gradient_penalty_one_sided,
         add_summaries=add_summaries)
     dis_loss += gradient_penalty_weight * gp_loss
   if _use_aux_loss(mutual_information_penalty_weight):
-    info_loss = tfgan_losses.mutual_information_penalty(
+    gen_info_loss = tfgan_losses.mutual_information_penalty(
         model, add_summaries=add_summaries)
-    dis_loss += mutual_information_penalty_weight * info_loss
-    gen_loss += mutual_information_penalty_weight * info_loss
+    dis_info_loss = (gen_info_loss if tensor_pool_fn is None else
+                     tfgan_losses.mutual_information_penalty(
+                         pooled_model, add_summaries=add_summaries))
+    gen_loss += mutual_information_penalty_weight * gen_info_loss
+    dis_loss += mutual_information_penalty_weight * dis_info_loss
   if _use_aux_loss(aux_cond_generator_weight):
     ac_gen_loss = tfgan_losses.acgan_generator_loss(
         model, add_summaries=add_summaries)
     gen_loss += aux_cond_generator_weight * ac_gen_loss
   if _use_aux_loss(aux_cond_discriminator_weight):
     ac_disc_loss = tfgan_losses.acgan_discriminator_loss(
-        model, add_summaries=add_summaries)
+        pooled_model, add_summaries=add_summaries)
     dis_loss += aux_cond_discriminator_weight * ac_disc_loss
   # Gathers auxiliary losses.
   if model.generator_scope:
@@ -631,8 +744,8 @@ def cyclegan_loss(
         generator_loss_fn=generator_loss_fn,
         discriminator_loss_fn=discriminator_loss_fn,
         **kwargs)
-    return partial_loss._replace(
-        generator_loss=partial_loss.generator_loss + aux_loss)
+    return partial_loss._replace(generator_loss=partial_loss.generator_loss +
+                                 aux_loss)
 
   with ops.name_scope('cyclegan_loss_x2y'):
     loss_x2y = _partial_loss(model.model_x2y)
@@ -642,6 +755,130 @@ def cyclegan_loss(
   return namedtuples.CycleGANLoss(loss_x2y, loss_y2x)
 
 
+def stargan_loss(
+    model,
+    generator_loss_fn=tfgan_losses.stargan_generator_loss_wrapper(
+        tfgan_losses_impl.wasserstein_generator_loss),
+    discriminator_loss_fn=tfgan_losses.stargan_discriminator_loss_wrapper(
+        tfgan_losses_impl.wasserstein_discriminator_loss),
+    gradient_penalty_weight=10.0,
+    gradient_penalty_epsilon=1e-10,
+    gradient_penalty_target=1.0,
+    gradient_penalty_one_sided=False,
+    reconstruction_loss_fn=losses.absolute_difference,
+    reconstruction_loss_weight=10.0,
+    classification_loss_fn=losses.softmax_cross_entropy,
+    classification_loss_weight=1.0,
+    classification_one_hot=True,
+    add_summaries=True):
+  """StarGAN Loss.
+
+  The four major part can be found here: http://screen/tMRMBAohDYG.
+
+  Args:
+    model: (StarGAN) Model output of the stargan_model() function call.
+    generator_loss_fn: The loss function on the generator. Takes a
+      `StarGANModel` named tuple.
+    discriminator_loss_fn: The loss function on the discriminator. Takes a
+      `StarGANModel` namedtuple.
+    gradient_penalty_weight: (float) Gradient penalty weight. Default to 10 per
+      the original paper https://arxiv.org/abs/1711.09020. Set to 0 or None to
+      turn off gradient penalty.
+    gradient_penalty_epsilon: (float) A small positive number added for
+      numerical stability when computing the gradient norm.
+    gradient_penalty_target: (float, or tf.float `Tensor`) The target value of
+      gradient norm. Defaults to 1.0.
+    gradient_penalty_one_sided: (bool) If `True`, penalty proposed in
+      https://arxiv.org/abs/1709.08894 is used. Defaults to `False`.
+    reconstruction_loss_fn: The reconstruction loss function. Default to L1-norm
+      and the function must conform to the `tf.losses` API.
+    reconstruction_loss_weight: Reconstruction loss weight. Default to 10.0.
+    classification_loss_fn: The loss function on the discriminator's ability to
+      classify domain of the input. Default to one-hot softmax cross entropy
+      loss, and the function must conform to the `tf.losses` API.
+    classification_loss_weight: (float) Classification loss weight. Default to
+      1.0.
+    classification_one_hot: (bool) If the label is one hot representation.
+      Default to True. If False, classification classification_loss_fn need to
+      be sigmoid cross entropy loss instead.
+    add_summaries: (bool) Add the loss to the summary
+
+  Returns:
+    GANLoss namedtuple where we have generator loss and discriminator loss.
+
+  Raises:
+    ValueError: If input StarGANModel.input_data_domain_label does not have rank
+    2, or dimension 2 is not defined.
+  """
+
+  def _classification_loss_helper(true_labels, predict_logits, scope_name):
+    """Classification Loss Function Helper.
+
+    Args:
+      true_labels: Tensor of shape [batch_size, num_domains] representing the
+        label where each row is an one-hot vector.
+      predict_logits: Tensor of shape [batch_size, num_domains] representing the
+        predicted label logit, which is UNSCALED output from the NN.
+      scope_name: (string) Name scope of the loss component.
+
+    Returns:
+      Single scalar tensor representing the classification loss.
+    """
+
+    with ops.name_scope(scope_name, values=(true_labels, predict_logits)):
+
+      loss = classification_loss_fn(
+          onehot_labels=true_labels, logits=predict_logits)
+
+      if not classification_one_hot:
+        loss = math_ops.reduce_sum(loss, axis=1)
+      loss = math_ops.reduce_mean(loss)
+
+      if add_summaries:
+        summary.scalar(scope_name, loss)
+
+      return loss
+
+  # Check input shape.
+  model.input_data_domain_label.shape.assert_has_rank(2)
+  model.input_data_domain_label.shape[1:].assert_is_fully_defined()
+
+  # Adversarial Loss.
+  generator_loss = generator_loss_fn(model, add_summaries=add_summaries)
+  discriminator_loss = discriminator_loss_fn(model, add_summaries=add_summaries)
+
+  # Gradient Penalty.
+  if _use_aux_loss(gradient_penalty_weight):
+    gradient_penalty_fn = tfgan_losses.stargan_gradient_penalty_wrapper(
+        tfgan_losses_impl.wasserstein_gradient_penalty)
+    discriminator_loss += gradient_penalty_fn(
+        model,
+        epsilon=gradient_penalty_epsilon,
+        target=gradient_penalty_target,
+        one_sided=gradient_penalty_one_sided,
+        add_summaries=add_summaries) * gradient_penalty_weight
+
+  # Reconstruction Loss.
+  reconstruction_loss = reconstruction_loss_fn(model.input_data,
+                                               model.reconstructed_data)
+  generator_loss += reconstruction_loss * reconstruction_loss_weight
+  if add_summaries:
+    summary.scalar('reconstruction_loss', reconstruction_loss)
+
+  # Classification Loss.
+  generator_loss += _classification_loss_helper(
+      true_labels=model.generated_data_domain_target,
+      predict_logits=model.discriminator_generated_data_domain_predication,
+      scope_name='generator_classification_loss') * classification_loss_weight
+  discriminator_loss += _classification_loss_helper(
+      true_labels=model.input_data_domain_label,
+      predict_logits=model.discriminator_input_data_domain_predication,
+      scope_name='discriminator_classification_loss'
+  ) * classification_loss_weight
+
+  return namedtuples.GANLoss(generator_loss, discriminator_loss)
+
+
 def _get_update_ops(kwargs, gen_scope, dis_scope, check_for_unused_ops=True):
   """Gets generator and discriminator update ops.
 
@@ -822,12 +1059,14 @@ def get_sequential_train_hooks(train_steps=namedtuples.GANTrainSteps(1, 1)):
   Returns:
     A function that takes a GANTrainOps tuple and returns a list of hooks.
   """
+
   def get_hooks(train_ops):
     generator_hook = RunTrainOpsHook(train_ops.generator_train_op,
                                      train_steps.generator_train_steps)
     discriminator_hook = RunTrainOpsHook(train_ops.discriminator_train_op,
                                          train_steps.discriminator_train_steps)
     return [generator_hook, discriminator_hook]
+
   return get_hooks
 
 
@@ -881,23 +1120,23 @@ def get_joint_train_hooks(train_steps=namedtuples.GANTrainSteps(1, 1)):
     d_hook = RunTrainOpsHook(d_op, num_d_steps)
 
     return [joint_hook, g_hook, d_hook]
+
   return get_hooks
 
 
 # TODO(joelshor): This function currently returns the global step. Find a
 # good way for it to return the generator, discriminator, and final losses.
-def gan_train(
-    train_ops,
-    logdir,
-    get_hooks_fn=get_sequential_train_hooks(),
-    master='',
-    is_chief=True,
-    scaffold=None,
-    hooks=None,
-    chief_only_hooks=None,
-    save_checkpoint_secs=600,
-    save_summaries_steps=100,
-    config=None):
+def gan_train(train_ops,
+              logdir,
+              get_hooks_fn=get_sequential_train_hooks(),
+              master='',
+              is_chief=True,
+              scaffold=None,
+              hooks=None,
+              chief_only_hooks=None,
+              save_checkpoint_secs=600,
+              save_summaries_steps=100,
+              config=None):
   """A wrapper around `contrib.training.train` that uses GAN hooks.
 
   Args:
@@ -943,8 +1182,7 @@ def gan_train(
       config=config)
 
 
-def get_sequential_train_steps(
-    train_steps=namedtuples.GANTrainSteps(1, 1)):
+def get_sequential_train_steps(train_steps=namedtuples.GANTrainSteps(1, 1)):
   """Returns a thin wrapper around slim.learning.train_step, for GANs.
 
   This function is to provide support for the Supervisor. For new code, please
@@ -1042,3 +1280,19 @@ def _validate_acgan_discriminator_outputs(discriminator_output):
         'A discriminator function for ACGAN must output a tuple '
         'consisting of (discrimination logits, classification logits).')
   return a, b
+
+
+def _generate_stargan_random_domain_target(batch_size, num_domains):
+  """Generate random domain label.
+
+  Args:
+    batch_size: (int) Number of random domain label.
+    num_domains: (int) Number of domains representing with the label.
+
+  Returns:
+    Tensor of shape (batch_size, num_domains) representing random label.
+  """
+  domain_idx = random_ops.random_uniform(
+      [batch_size], minval=0, maxval=num_domains, dtype=dtypes.int32)
+
+  return array_ops.one_hot(domain_idx, num_domains)
diff --git a/tensorflow/contrib/gan/python/train_test.py b/tensorflow/contrib/gan/python/train_test.py
index 3ebbe55d05..58f348034f 100644
--- a/tensorflow/contrib/gan/python/train_test.py
+++ b/tensorflow/contrib/gan/python/train_test.py
@@ -18,8 +18,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from absl.testing import parameterized
 import numpy as np
 
+from tensorflow.contrib import layers
 from tensorflow.contrib.framework.python.ops import variables as variables_lib
 from tensorflow.contrib.gan.python import namedtuples
 from tensorflow.contrib.gan.python import train
@@ -30,6 +32,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
@@ -84,19 +87,59 @@ class InfoGANDiscriminator(object):
 
 
 def acgan_discriminator_model(inputs, _, num_classes=10):
-  return (discriminator_model(inputs, _), array_ops.one_hot(
-      # TODO(haeusser): infer batch size from input
-      random_ops.random_uniform([3], maxval=num_classes, dtype=dtypes.int32),
-      num_classes))
+  return (
+      discriminator_model(inputs, _),
+      array_ops.one_hot(
+          # TODO(haeusser): infer batch size from input
+          random_ops.random_uniform(
+              [3], maxval=num_classes, dtype=dtypes.int32),
+          num_classes))
 
 
 class ACGANDiscriminator(object):
 
   def __call__(self, inputs, _, num_classes=10):
-    return (discriminator_model(inputs, _), array_ops.one_hot(
-        # TODO(haeusser): infer batch size from input
-        random_ops.random_uniform([3], maxval=num_classes, dtype=dtypes.int32),
-        num_classes))
+    return (
+        discriminator_model(inputs, _),
+        array_ops.one_hot(
+            # TODO(haeusser): infer batch size from input
+            random_ops.random_uniform(
+                [3], maxval=num_classes, dtype=dtypes.int32),
+            num_classes))
+
+
+def stargan_generator_model(inputs, _):
+  """Dummy generator for StarGAN."""
+
+  return variable_scope.get_variable('dummy_g', initializer=0.5) * inputs
+
+
+class StarGANGenerator(object):
+
+  def __call__(self, inputs, _):
+    return stargan_generator_model(inputs, _)
+
+
+def stargan_discriminator_model(inputs, num_domains):
+  """Differentiable dummy discriminator for StarGAN."""
+
+  hidden = layers.flatten(inputs)
+
+  output_src = math_ops.reduce_mean(hidden, axis=1)
+
+  output_cls = layers.fully_connected(
+      inputs=hidden,
+      num_outputs=num_domains,
+      activation_fn=None,
+      normalizer_fn=None,
+      biases_initializer=None)
+  return output_src, output_cls
+
+
+class StarGANDiscriminator(object):
+
+  def __call__(self, inputs, num_domains):
+    return stargan_discriminator_model(inputs, num_domains)
 
 
 def get_gan_model():
@@ -122,8 +165,7 @@ def get_gan_model():
 def get_callable_gan_model():
   ganmodel = get_gan_model()
   return ganmodel._replace(
-      generator_fn=Generator(),
-      discriminator_fn=Discriminator())
+      generator_fn=Generator(), discriminator_fn=Discriminator())
 
 
 def create_gan_model():
@@ -242,69 +284,84 @@ def create_callable_cyclegan_model():
       data_y=array_ops.ones([1, 2]))
 
 
-def get_sync_optimizer():
-  return sync_replicas_optimizer.SyncReplicasOptimizer(
-      gradient_descent.GradientDescentOptimizer(learning_rate=1.0),
-      replicas_to_aggregate=1)
+def get_stargan_model():
+  """Similar to get_gan_model()."""
+  # TODO(joelshor): Find a better way of creating a variable scope.
+  with variable_scope.variable_scope('generator') as gen_scope:
+    pass
+  with variable_scope.variable_scope('discriminator') as dis_scope:
+    pass
+  return namedtuples.StarGANModel(
+      input_data=array_ops.ones([1, 2, 2, 3]),
+      input_data_domain_label=array_ops.ones([1, 2]),
+      generated_data=array_ops.ones([1, 2, 2, 3]),
+      generated_data_domain_target=array_ops.ones([1, 2]),
+      reconstructed_data=array_ops.ones([1, 2, 2, 3]),
+      discriminator_input_data_source_predication=array_ops.ones([1]),
+      discriminator_generated_data_source_predication=array_ops.ones([1]),
+      discriminator_input_data_domain_predication=array_ops.ones([1, 2]),
+      discriminator_generated_data_domain_predication=array_ops.ones([1, 2]),
+      generator_variables=None,
+      generator_scope=gen_scope,
+      generator_fn=stargan_generator_model,
+      discriminator_variables=None,
+      discriminator_scope=dis_scope,
+      discriminator_fn=stargan_discriminator_model)
 
 
-def get_tensor_pool_fn(pool_size):
+def get_callable_stargan_model():
+  model = get_stargan_model()
+  return model._replace(
+      generator_fn=StarGANGenerator(), discriminator_fn=StarGANDiscriminator())
 
-  def tensor_pool_fn_impl(input_values):
-    return random_tensor_pool.tensor_pool(input_values, pool_size=pool_size)
 
-  return tensor_pool_fn_impl
+def create_stargan_model():
+  return train.stargan_model(
+      stargan_generator_model, stargan_discriminator_model,
+      array_ops.ones([1, 2, 2, 3]), array_ops.ones([1, 2]))
 
 
-def get_tensor_pool_fn_for_infogan(pool_size):
+def create_callable_stargan_model():
+  return train.stargan_model(StarGANGenerator(), StarGANDiscriminator(),
+                             array_ops.ones([1, 2, 2, 3]),
+                             array_ops.ones([1, 2]))
 
-  def tensor_pool_fn_impl(input_values):
-    generated_data, generator_inputs = input_values
-    output_values = random_tensor_pool.tensor_pool(
-        [generated_data] + generator_inputs, pool_size=pool_size)
-    return output_values[0], output_values[1:]
 
-  return tensor_pool_fn_impl
+def get_sync_optimizer():
+  return sync_replicas_optimizer.SyncReplicasOptimizer(
+      gradient_descent.GradientDescentOptimizer(learning_rate=1.0),
+      replicas_to_aggregate=1)
 
 
-class GANModelTest(test.TestCase):
+class GANModelTest(test.TestCase, parameterized.TestCase):
   """Tests for `gan_model`."""
 
-  def _test_output_type_helper(self, create_fn, tuple_type):
-    self.assertTrue(isinstance(create_fn(), tuple_type))
-
-  def test_output_type_gan(self):
-    self._test_output_type_helper(get_gan_model, namedtuples.GANModel)
-
-  def test_output_type_callable_gan(self):
-    self._test_output_type_helper(get_callable_gan_model, namedtuples.GANModel)
-
-  def test_output_type_infogan(self):
-    self._test_output_type_helper(get_infogan_model, namedtuples.InfoGANModel)
-
-  def test_output_type_callable_infogan(self):
-    self._test_output_type_helper(
-        get_callable_infogan_model, namedtuples.InfoGANModel)
-
-  def test_output_type_acgan(self):
-    self._test_output_type_helper(get_acgan_model, namedtuples.ACGANModel)
-
-  def test_output_type_callable_acgan(self):
-    self._test_output_type_helper(
-        get_callable_acgan_model, namedtuples.ACGANModel)
-
-  def test_output_type_cyclegan(self):
-    self._test_output_type_helper(get_cyclegan_model, namedtuples.CycleGANModel)
-
-  def test_output_type_callable_cyclegan(self):
-    self._test_output_type_helper(get_callable_cyclegan_model,
-                                  namedtuples.CycleGANModel)
+  @parameterized.named_parameters(
+      ('gan', get_gan_model, namedtuples.GANModel),
+      ('callable_gan', get_callable_gan_model, namedtuples.GANModel),
+      ('infogan', get_infogan_model, namedtuples.InfoGANModel),
+      ('callable_infogan', get_callable_infogan_model,
+       namedtuples.InfoGANModel),
+      ('acgan', get_acgan_model, namedtuples.ACGANModel),
+      ('callable_acgan', get_callable_acgan_model, namedtuples.ACGANModel),
+      ('cyclegan', get_cyclegan_model, namedtuples.CycleGANModel),
+      ('callable_cyclegan', get_callable_cyclegan_model,
+       namedtuples.CycleGANModel),
+      ('stargan', get_stargan_model, namedtuples.StarGANModel),
+      ('callabel_stargan', get_callable_stargan_model, namedtuples.StarGANModel)
+  )
+  def test_output_type(self, create_fn, expected_tuple_type):
+    """Test that output type is as expected."""
+    self.assertIsInstance(create_fn(), expected_tuple_type)
 
   def test_no_shape_check(self):
+
     def dummy_generator_model(_):
       return (None, None)
+
     def dummy_discriminator_model(data, conditioning):  # pylint: disable=unused-argument
       return 1
+
     with self.assertRaisesRegexp(AttributeError, 'object has no attribute'):
       train.gan_model(
           dummy_generator_model,
@@ -320,52 +377,182 @@ class GANModelTest(test.TestCase):
         check_shapes=False)
 
 
-class GANLossTest(test.TestCase):
-  """Tests for `gan_loss`."""
+class StarGANModelTest(test.TestCase):
+  """Tests for `stargan_model`."""
+
+  @staticmethod
+  def create_input_and_label_tensor(batch_size, img_size, c_size, num_domains):
+    input_tensor_list = []
+    label_tensor_list = []
+    for _ in range(num_domains):
+      input_tensor_list.append(
+          random_ops.random_uniform((batch_size, img_size, img_size, c_size)))
+      domain_idx = random_ops.random_uniform(
+          [batch_size], minval=0, maxval=num_domains, dtype=dtypes.int32)
+      label_tensor_list.append(array_ops.one_hot(domain_idx, num_domains))
+    return input_tensor_list, label_tensor_list
+
+  def test_generate_stargan_random_domain_target(self):
+    batch_size = 8
+    domain_numbers = 3
+
+    target_tensor = train._generate_stargan_random_domain_target(
+        batch_size, domain_numbers)
+
+    with self.test_session() as sess:
+      targets = sess.run(target_tensor)
+      self.assertTupleEqual((batch_size, domain_numbers), targets.shape)
+      for target in targets:
+        self.assertEqual(1, np.sum(target))
+        self.assertEqual(1, np.max(target))
+
+  def test_stargan_model_output_type(self):
+    batch_size = 2
+    img_size = 16
+    c_size = 3
+    num_domains = 5
+
+    input_tensor, label_tensor = StarGANModelTest.create_input_and_label_tensor(
+        batch_size, img_size, c_size, num_domains)
+    model = train.stargan_model(
+        generator_fn=stargan_generator_model,
+        discriminator_fn=stargan_discriminator_model,
+        input_data=input_tensor,
+        input_data_domain_label=label_tensor)
+
+    self.assertIsInstance(model, namedtuples.StarGANModel)
+    self.assertTrue(isinstance(model.discriminator_variables, list))
+    self.assertTrue(isinstance(model.generator_variables, list))
+    self.assertIsInstance(model.discriminator_scope,
+                          variable_scope.VariableScope)
+    self.assertTrue(model.generator_scope, variable_scope.VariableScope)
+    self.assertTrue(callable(model.discriminator_fn))
+    self.assertTrue(callable(model.generator_fn))
+
+  def test_stargan_model_generator_output(self):
+    batch_size = 2
+    img_size = 16
+    c_size = 3
+    num_domains = 5
+
+    input_tensor, label_tensor = StarGANModelTest.create_input_and_label_tensor(
+        batch_size, img_size, c_size, num_domains)
+    model = train.stargan_model(
+        generator_fn=stargan_generator_model,
+        discriminator_fn=stargan_discriminator_model,
+        input_data=input_tensor,
+        input_data_domain_label=label_tensor)
 
-  # Test output type.
-  def _test_output_type_helper(self, get_gan_model_fn):
-    loss = train.gan_loss(get_gan_model_fn(), add_summaries=True)
-    self.assertTrue(isinstance(loss, namedtuples.GANLoss))
-    self.assertGreater(len(ops.get_collection(ops.GraphKeys.SUMMARIES)), 0)
-
-  def test_output_type_gan(self):
-    self._test_output_type_helper(get_gan_model)
+    with self.test_session(use_gpu=True) as sess:
 
-  def test_output_type_callable_gan(self):
-    self._test_output_type_helper(get_callable_gan_model)
+      sess.run(variables.global_variables_initializer())
 
-  def test_output_type_infogan(self):
-    self._test_output_type_helper(get_infogan_model)
+      input_data, generated_data, reconstructed_data = sess.run(
+          [model.input_data, model.generated_data, model.reconstructed_data])
+      self.assertTupleEqual(
+          (batch_size * num_domains, img_size, img_size, c_size),
+          input_data.shape)
+      self.assertTupleEqual(
+          (batch_size * num_domains, img_size, img_size, c_size),
+          generated_data.shape)
+      self.assertTupleEqual(
+          (batch_size * num_domains, img_size, img_size, c_size),
+          reconstructed_data.shape)
+
+  def test_stargan_model_discriminator_output(self):
+    batch_size = 2
+    img_size = 16
+    c_size = 3
+    num_domains = 5
+
+    input_tensor, label_tensor = StarGANModelTest.create_input_and_label_tensor(
+        batch_size, img_size, c_size, num_domains)
+    model = train.stargan_model(
+        generator_fn=stargan_generator_model,
+        discriminator_fn=stargan_discriminator_model,
+        input_data=input_tensor,
+        input_data_domain_label=label_tensor)
 
-  def test_output_type_callable_infogan(self):
-    self._test_output_type_helper(get_callable_infogan_model)
+    with self.test_session(use_gpu=True) as sess:
 
-  def test_output_type_acgan(self):
-    self._test_output_type_helper(get_acgan_model)
+      sess.run(variables.global_variables_initializer())
 
-  def test_output_type_callable_acgan(self):
-    self._test_output_type_helper(get_callable_acgan_model)
+      disc_input_data_source_pred, disc_gen_data_source_pred = sess.run([
+          model.discriminator_input_data_source_predication,
+          model.discriminator_generated_data_source_predication
+      ])
+      self.assertEqual(1, len(disc_input_data_source_pred.shape))
+      self.assertEqual(batch_size * num_domains,
+                       disc_input_data_source_pred.shape[0])
+      self.assertEqual(1, len(disc_gen_data_source_pred.shape))
+      self.assertEqual(batch_size * num_domains,
+                       disc_gen_data_source_pred.shape[0])
+
+      input_label, disc_input_label, gen_label, disc_gen_label = sess.run([
+          model.input_data_domain_label,
+          model.discriminator_input_data_domain_predication,
+          model.generated_data_domain_target,
+          model.discriminator_generated_data_domain_predication
+      ])
+      self.assertTupleEqual((batch_size * num_domains, num_domains),
+                            input_label.shape)
+      self.assertTupleEqual((batch_size * num_domains, num_domains),
+                            disc_input_label.shape)
+      self.assertTupleEqual((batch_size * num_domains, num_domains),
+                            gen_label.shape)
+      self.assertTupleEqual((batch_size * num_domains, num_domains),
+                            disc_gen_label.shape)
+
+
+class GANLossTest(test.TestCase, parameterized.TestCase):
+  """Tests for `gan_loss`."""
 
-  def test_output_type_cyclegan(self):
-    loss = train.cyclegan_loss(create_cyclegan_model(), add_summaries=True)
-    self.assertIsInstance(loss, namedtuples.CycleGANLoss)
+  @parameterized.named_parameters(
+      ('gan', get_gan_model),
+      ('callable_gan', get_callable_gan_model),
+      ('infogan', get_infogan_model),
+      ('callable_infogan', get_callable_infogan_model),
+      ('acgan', get_acgan_model),
+      ('callable_acgan', get_callable_acgan_model),
+  )
+  def test_output_type(self, get_gan_model_fn):
+    """Test output type."""
+    loss = train.gan_loss(get_gan_model_fn(), add_summaries=True)
+    self.assertIsInstance(loss, namedtuples.GANLoss)
     self.assertGreater(len(ops.get_collection(ops.GraphKeys.SUMMARIES)), 0)
 
-  def test_output_type_callable_cyclegan(self):
-    loss = train.cyclegan_loss(
-        create_callable_cyclegan_model(), add_summaries=True)
+  @parameterized.named_parameters(
+      ('cyclegan', create_cyclegan_model),
+      ('callable_cyclegan', create_callable_cyclegan_model),
+  )
+  def test_cyclegan_output_type(self, get_gan_model_fn):
+    loss = train.cyclegan_loss(get_gan_model_fn(), add_summaries=True)
     self.assertIsInstance(loss, namedtuples.CycleGANLoss)
     self.assertGreater(len(ops.get_collection(ops.GraphKeys.SUMMARIES)), 0)
 
-  # Test gradient penalty option.
-  def _test_grad_penalty_helper(self, create_gan_model_fn, one_sided=False):
+  @parameterized.named_parameters(
+      ('gan', create_gan_model, False),
+      ('gan_one_sided', create_gan_model, True),
+      ('callable_gan', create_callable_gan_model, False),
+      ('callable_gan_one_sided', create_callable_gan_model, True),
+      ('infogan', create_infogan_model, False),
+      ('infogan_one_sided', create_infogan_model, True),
+      ('callable_infogan', create_callable_infogan_model, False),
+      ('callable_infogan_one_sided', create_callable_infogan_model, True),
+      ('acgan', create_acgan_model, False),
+      ('acgan_one_sided', create_acgan_model, True),
+      ('callable_acgan', create_callable_acgan_model, False),
+      ('callable_acgan_one_sided', create_callable_acgan_model, True),
+  )
+  def test_grad_penalty(self, create_gan_model_fn, one_sided):
+    """Test gradient penalty option."""
     model = create_gan_model_fn()
     loss = train.gan_loss(model)
-    loss_gp = train.gan_loss(model,
-                             gradient_penalty_weight=1.0,
-                             gradient_penalty_one_sided=one_sided)
-    self.assertTrue(isinstance(loss_gp, namedtuples.GANLoss))
+    loss_gp = train.gan_loss(
+        model,
+        gradient_penalty_weight=1.0,
+        gradient_penalty_one_sided=one_sided)
+    self.assertIsInstance(loss_gp, namedtuples.GANLoss)
 
     # Check values.
     with self.test_session(use_gpu=True) as sess:
@@ -376,58 +563,28 @@ class GANLossTest(test.TestCase):
           [loss.discriminator_loss, loss_gp.discriminator_loss])
 
     self.assertEqual(loss_gen_np, loss_gen_gp_np)
-    self.assertTrue(loss_dis_np < loss_dis_gp_np)
-
-  def test_grad_penalty_gan(self):
-    self._test_grad_penalty_helper(create_gan_model)
-
-  def test_grad_penalty_callable_gan(self):
-    self._test_grad_penalty_helper(create_callable_gan_model)
-
-  def test_grad_penalty_infogan(self):
-    self._test_grad_penalty_helper(create_infogan_model)
-
-  def test_grad_penalty_callable_infogan(self):
-    self._test_grad_penalty_helper(create_callable_infogan_model)
-
-  def test_grad_penalty_acgan(self):
-    self._test_grad_penalty_helper(create_acgan_model)
-
-  def test_grad_penalty_callable_acgan(self):
-    self._test_grad_penalty_helper(create_callable_acgan_model)
-
-  def test_grad_penalty_one_sided_gan(self):
-    self._test_grad_penalty_helper(create_gan_model, one_sided=True)
-
-  def test_grad_penalty_one_sided_callable_gan(self):
-    self._test_grad_penalty_helper(create_callable_gan_model, one_sided=True)
-
-  def test_grad_penalty_one_sided_infogan(self):
-    self._test_grad_penalty_helper(create_infogan_model, one_sided=True)
-
-  def test_grad_penalty_one_sided_callable_infogan(self):
-    self._test_grad_penalty_helper(
-        create_callable_infogan_model, one_sided=True)
-
-  def test_grad_penalty_one_sided_acgan(self):
-    self._test_grad_penalty_helper(create_acgan_model, one_sided=True)
-
-  def test_grad_penalty_one_sided_callable_acgan(self):
-    self._test_grad_penalty_helper(create_callable_acgan_model, one_sided=True)
-
-  # Test mutual information penalty option.
-  def _test_mutual_info_penalty_helper(self, create_gan_model_fn):
-    train.gan_loss(create_gan_model_fn(),
-                   mutual_information_penalty_weight=constant_op.constant(1.0))
-
-  def test_mutual_info_penalty_infogan(self):
-    self._test_mutual_info_penalty_helper(get_infogan_model)
-
-  def test_mutual_info_penalty_callable_infogan(self):
-    self._test_mutual_info_penalty_helper(get_callable_infogan_model)
-
-  # Test regularization loss.
-  def _test_regularization_helper(self, get_gan_model_fn):
+    self.assertLess(loss_dis_np, loss_dis_gp_np)
+
+  @parameterized.named_parameters(
+      ('infogan', get_infogan_model),
+      ('callable_infogan', get_callable_infogan_model),
+  )
+  def test_mutual_info_penalty(self, create_gan_model_fn):
+    """Test mutual information penalty option."""
+    train.gan_loss(
+        create_gan_model_fn(),
+        mutual_information_penalty_weight=constant_op.constant(1.0))
+
+  @parameterized.named_parameters(
+      ('gan', get_gan_model),
+      ('callable_gan', get_callable_gan_model),
+      ('infogan', get_infogan_model),
+      ('callable_infogan', get_callable_infogan_model),
+      ('acgan', get_acgan_model),
+      ('callable_acgan', get_callable_acgan_model),
+  )
+  def test_regularization_helper(self, get_gan_model_fn):
+    """Test regularization loss."""
     # Evaluate losses without regularization.
     no_reg_loss = train.gan_loss(get_gan_model_fn())
     with self.test_session(use_gpu=True):
@@ -435,11 +592,11 @@ class GANLossTest(test.TestCase):
       no_reg_loss_dis_np = no_reg_loss.discriminator_loss.eval()
 
     with ops.name_scope(get_gan_model_fn().generator_scope.name):
-      ops.add_to_collection(
-          ops.GraphKeys.REGULARIZATION_LOSSES, constant_op.constant(3.0))
+      ops.add_to_collection(ops.GraphKeys.REGULARIZATION_LOSSES,
+                            constant_op.constant(3.0))
     with ops.name_scope(get_gan_model_fn().discriminator_scope.name):
-      ops.add_to_collection(
-          ops.GraphKeys.REGULARIZATION_LOSSES, constant_op.constant(2.0))
+      ops.add_to_collection(ops.GraphKeys.REGULARIZATION_LOSSES,
+                            constant_op.constant(2.0))
 
     # Check that losses now include the correct regularization values.
     reg_loss = train.gan_loss(get_gan_model_fn())
@@ -447,63 +604,47 @@ class GANLossTest(test.TestCase):
       reg_loss_gen_np = reg_loss.generator_loss.eval()
       reg_loss_dis_np = reg_loss.discriminator_loss.eval()
 
-    self.assertTrue(3.0, reg_loss_gen_np - no_reg_loss_gen_np)
-    self.assertTrue(3.0, reg_loss_dis_np - no_reg_loss_dis_np)
-
-  def test_regularization_gan(self):
-    self._test_regularization_helper(get_gan_model)
+    self.assertEqual(3.0, reg_loss_gen_np - no_reg_loss_gen_np)
+    self.assertEqual(2.0, reg_loss_dis_np - no_reg_loss_dis_np)
 
-  def test_regularization_callable_gan(self):
-    self._test_regularization_helper(get_callable_gan_model)
-
-  def test_regularization_infogan(self):
-    self._test_regularization_helper(get_infogan_model)
-
-  def test_regularization_callable_infogan(self):
-    self._test_regularization_helper(get_callable_infogan_model)
-
-  def test_regularization_acgan(self):
-    self._test_regularization_helper(get_acgan_model)
-
-  def test_regularization_callable_acgan(self):
-    self._test_regularization_helper(get_callable_acgan_model)
-
-  # Test that ACGan models work.
-  def _test_acgan_helper(self, create_gan_model_fn):
+  @parameterized.named_parameters(
+      ('notcallable', create_acgan_model),
+      ('callable', create_callable_acgan_model),
+  )
+  def test_acgan(self, create_gan_model_fn):
+    """Test that ACGAN models work."""
     model = create_gan_model_fn()
     loss = train.gan_loss(model)
     loss_ac_gen = train.gan_loss(model, aux_cond_generator_weight=1.0)
     loss_ac_dis = train.gan_loss(model, aux_cond_discriminator_weight=1.0)
-    self.assertTrue(isinstance(loss, namedtuples.GANLoss))
-    self.assertTrue(isinstance(loss_ac_gen, namedtuples.GANLoss))
-    self.assertTrue(isinstance(loss_ac_dis, namedtuples.GANLoss))
+    self.assertIsInstance(loss, namedtuples.GANLoss)
+    self.assertIsInstance(loss_ac_gen, namedtuples.GANLoss)
+    self.assertIsInstance(loss_ac_dis, namedtuples.GANLoss)
 
     # Check values.
     with self.test_session(use_gpu=True) as sess:
       variables.global_variables_initializer().run()
-      loss_gen_np, loss_ac_gen_gen_np, loss_ac_dis_gen_np = sess.run(
-          [loss.generator_loss,
-           loss_ac_gen.generator_loss,
-           loss_ac_dis.generator_loss])
-      loss_dis_np, loss_ac_gen_dis_np, loss_ac_dis_dis_np = sess.run(
-          [loss.discriminator_loss,
-           loss_ac_gen.discriminator_loss,
-           loss_ac_dis.discriminator_loss])
-
-    self.assertTrue(loss_gen_np < loss_dis_np)
+      loss_gen_np, loss_ac_gen_gen_np, loss_ac_dis_gen_np = sess.run([
+          loss.generator_loss, loss_ac_gen.generator_loss,
+          loss_ac_dis.generator_loss
+      ])
+      loss_dis_np, loss_ac_gen_dis_np, loss_ac_dis_dis_np = sess.run([
+          loss.discriminator_loss, loss_ac_gen.discriminator_loss,
+          loss_ac_dis.discriminator_loss
+      ])
+
+    self.assertLess(loss_gen_np, loss_dis_np)
     self.assertTrue(np.isscalar(loss_ac_gen_gen_np))
     self.assertTrue(np.isscalar(loss_ac_dis_gen_np))
     self.assertTrue(np.isscalar(loss_ac_gen_dis_np))
     self.assertTrue(np.isscalar(loss_ac_dis_dis_np))
 
-  def test_acgan(self):
-    self._test_acgan_helper(create_acgan_model)
-
-  def test_callable_acgan(self):
-    self._test_acgan_helper(create_callable_acgan_model)
-
-  # Test that CycleGan models work.
-  def _test_cyclegan_helper(self, create_gan_model_fn):
+  @parameterized.named_parameters(
+      ('notcallable', create_cyclegan_model),
+      ('callable', create_callable_cyclegan_model),
+  )
+  def test_cyclegan(self, create_gan_model_fn):
+    """Test that CycleGan models work."""
     model = create_gan_model_fn()
     loss = train.cyclegan_loss(model)
     self.assertIsInstance(loss, namedtuples.CycleGANLoss)
@@ -524,14 +665,86 @@ class GANLossTest(test.TestCase):
     self.assertTrue(np.isscalar(loss_y2x_gen_np))
     self.assertTrue(np.isscalar(loss_y2x_dis_np))
 
-  def test_cyclegan(self):
-    self._test_cyclegan_helper(create_cyclegan_model)
+  @parameterized.named_parameters(
+      ('notcallable', create_stargan_model),
+      ('callable', create_callable_stargan_model),
+  )
+  def test_stargan(self, create_gan_model_fn):
+
+    model = create_gan_model_fn()
+    model_loss = train.stargan_loss(model)
+
+    self.assertIsInstance(model_loss, namedtuples.GANLoss)
+
+    with self.test_session() as sess:
+
+      sess.run(variables.global_variables_initializer())
+
+      gen_loss, disc_loss = sess.run(
+          [model_loss.generator_loss, model_loss.discriminator_loss])
+
+      self.assertTrue(np.isscalar(gen_loss))
+      self.assertTrue(np.isscalar(disc_loss))
+
+  @parameterized.named_parameters(
+      ('gan', create_gan_model),
+      ('callable_gan', create_callable_gan_model),
+      ('infogan', create_infogan_model),
+      ('callable_infogan', create_callable_infogan_model),
+      ('acgan', create_acgan_model),
+      ('callable_acgan', create_callable_acgan_model),
+  )
+  def test_tensor_pool(self, create_gan_model_fn):
+    """Test tensor pool option."""
+    model = create_gan_model_fn()
+    tensor_pool_fn = lambda x: random_tensor_pool.tensor_pool(x, pool_size=5)
+    loss = train.gan_loss(model, tensor_pool_fn=tensor_pool_fn)
+    self.assertIsInstance(loss, namedtuples.GANLoss)
+
+    # Check values.
+    with self.test_session(use_gpu=True) as sess:
+      variables.global_variables_initializer().run()
+      for _ in range(10):
+        sess.run([loss.generator_loss, loss.discriminator_loss])
+
+  def test_discriminator_only_sees_pool(self):
+    """Checks that discriminator only sees pooled values."""
+    def checker_gen_fn(_):
+      return constant_op.constant(0.0)
+    model = train.gan_model(
+        checker_gen_fn,
+        discriminator_model,
+        real_data=array_ops.zeros([]),
+        generator_inputs=random_ops.random_normal([]))
+    def tensor_pool_fn(_):
+      return (random_ops.random_uniform([]), random_ops.random_uniform([]))
+    def checker_dis_fn(inputs, _):
+      """Discriminator that checks that it only sees pooled Tensors."""
+      self.assertFalse(constant_op.is_constant(inputs))
+      return inputs
+    model = model._replace(
+        discriminator_fn=checker_dis_fn)
+    train.gan_loss(model, tensor_pool_fn=tensor_pool_fn)
+
+  def test_doesnt_crash_when_in_nested_scope(self):
+    with variable_scope.variable_scope('outer_scope'):
+      gan_model = train.gan_model(
+          generator_model,
+          discriminator_model,
+          real_data=array_ops.zeros([1, 2]),
+          generator_inputs=random_ops.random_normal([1, 2]))
+
+      # This should work inside a scope.
+      train.gan_loss(gan_model, gradient_penalty_weight=1.0)
 
-  def test_callable_cyclegan(self):
-    self._test_cyclegan_helper(create_callable_cyclegan_model)
+    # This should also work outside a scope.
+    train.gan_loss(gan_model, gradient_penalty_weight=1.0)
 
-  def _check_tensor_pool_adjusted_model_outputs(self, tensor1, tensor2,
-                                                pool_size):
+
+class TensorPoolAdjusteModelTest(test.TestCase):
+
+  def _check_tensor_pool_adjusted_model_outputs(
+      self, tensor1, tensor2, pool_size):
     history_values = []
     with self.test_session(use_gpu=True) as sess:
       variables.global_variables_initializer().run()
@@ -548,115 +761,66 @@ class GANLossTest(test.TestCase):
           # pool).
           self.assertTrue(any([(v == t2).all() for v in history_values]))
 
-  # Test `_tensor_pool_adjusted_model` for gan model.
-  def test_tensor_pool_adjusted_model_gan(self):
-    model = create_gan_model()
-
-    new_model = train._tensor_pool_adjusted_model(model, None)
+  def _make_new_model_and_check(self, model, pool_size):
+    pool_fn = lambda x: random_tensor_pool.tensor_pool(x, pool_size=pool_size)
+    new_model = train._tensor_pool_adjusted_model(model, pool_fn)
     # 'Generator/dummy_g:0' and 'Discriminator/dummy_d:0'
     self.assertEqual(2, len(ops.get_collection(ops.GraphKeys.VARIABLES)))
-    self.assertIs(new_model.discriminator_gen_outputs,
-                  model.discriminator_gen_outputs)
-
-    pool_size = 5
-    new_model = train._tensor_pool_adjusted_model(
-        model, get_tensor_pool_fn(pool_size=pool_size))
     self.assertIsNot(new_model.discriminator_gen_outputs,
                      model.discriminator_gen_outputs)
+
+    return new_model
+
+  def test_tensor_pool_adjusted_model_gan(self):
+    """Test `_tensor_pool_adjusted_model` for gan model."""
+    pool_size = 5
+    model = create_gan_model()
+    new_model = self._make_new_model_and_check(model, pool_size)
+
     # Check values.
     self._check_tensor_pool_adjusted_model_outputs(
         model.discriminator_gen_outputs, new_model.discriminator_gen_outputs,
         pool_size)
 
-  # Test _tensor_pool_adjusted_model for infogan model.
   def test_tensor_pool_adjusted_model_infogan(self):
+    """Test _tensor_pool_adjusted_model for infogan model."""
+    pool_size = 5
     model = create_infogan_model()
+    new_model = self._make_new_model_and_check(model, pool_size)
 
-    pool_size = 5
-    new_model = train._tensor_pool_adjusted_model(
-        model, get_tensor_pool_fn_for_infogan(pool_size=pool_size))
-    # 'Generator/dummy_g:0' and 'Discriminator/dummy_d:0'
-    self.assertEqual(2, len(ops.get_collection(ops.GraphKeys.VARIABLES)))
-    self.assertIsNot(new_model.discriminator_gen_outputs,
-                     model.discriminator_gen_outputs)
+    # Check values.
     self.assertIsNot(new_model.predicted_distributions,
                      model.predicted_distributions)
-    # Check values.
     self._check_tensor_pool_adjusted_model_outputs(
         model.discriminator_gen_outputs, new_model.discriminator_gen_outputs,
         pool_size)
 
-  # Test _tensor_pool_adjusted_model for acgan model.
   def test_tensor_pool_adjusted_model_acgan(self):
+    """Test _tensor_pool_adjusted_model for acgan model."""
+    pool_size = 5
     model = create_acgan_model()
+    new_model = self._make_new_model_and_check(model, pool_size)
 
-    pool_size = 5
-    new_model = train._tensor_pool_adjusted_model(
-        model, get_tensor_pool_fn(pool_size=pool_size))
-    # 'Generator/dummy_g:0' and 'Discriminator/dummy_d:0'
-    self.assertEqual(2, len(ops.get_collection(ops.GraphKeys.VARIABLES)))
-    self.assertIsNot(new_model.discriminator_gen_outputs,
-                     model.discriminator_gen_outputs)
+    # Check values.
     self.assertIsNot(new_model.discriminator_gen_classification_logits,
                      model.discriminator_gen_classification_logits)
-    # Check values.
     self._check_tensor_pool_adjusted_model_outputs(
         model.discriminator_gen_outputs, new_model.discriminator_gen_outputs,
         pool_size)
 
-  # Test tensor pool.
-  def _test_tensor_pool_helper(self, create_gan_model_fn):
-    model = create_gan_model_fn()
-    if isinstance(model, namedtuples.InfoGANModel):
-      tensor_pool_fn = get_tensor_pool_fn_for_infogan(pool_size=5)
-    else:
-      tensor_pool_fn = get_tensor_pool_fn(pool_size=5)
-    loss = train.gan_loss(model, tensor_pool_fn=tensor_pool_fn)
-    self.assertTrue(isinstance(loss, namedtuples.GANLoss))
-
-    # Check values.
-    with self.test_session(use_gpu=True) as sess:
-      variables.global_variables_initializer().run()
-      for _ in range(10):
-        sess.run([loss.generator_loss, loss.discriminator_loss])
-
-  def test_tensor_pool_gan(self):
-    self._test_tensor_pool_helper(create_gan_model)
-
-  def test_tensor_pool_callable_gan(self):
-    self._test_tensor_pool_helper(create_callable_gan_model)
-
-  def test_tensor_pool_infogan(self):
-    self._test_tensor_pool_helper(create_infogan_model)
-
-  def test_tensor_pool_callable_infogan(self):
-    self._test_tensor_pool_helper(create_callable_infogan_model)
-
-  def test_tensor_pool_acgan(self):
-    self._test_tensor_pool_helper(create_acgan_model)
-
-  def test_tensor_pool_callable_acgan(self):
-    self._test_tensor_pool_helper(create_callable_acgan_model)
-
-  def test_doesnt_crash_when_in_nested_scope(self):
-    with variable_scope.variable_scope('outer_scope'):
-      gan_model = train.gan_model(
-          generator_model,
-          discriminator_model,
-          real_data=array_ops.zeros([1, 2]),
-          generator_inputs=random_ops.random_normal([1, 2]))
-
-      # This should work inside a scope.
-      train.gan_loss(gan_model, gradient_penalty_weight=1.0)
 
-    # This should also work outside a scope.
-    train.gan_loss(gan_model, gradient_penalty_weight=1.0)
-
-
-class GANTrainOpsTest(test.TestCase):
+class GANTrainOpsTest(test.TestCase, parameterized.TestCase):
   """Tests for `gan_train_ops`."""
 
-  def _test_output_type_helper(self, create_gan_model_fn):
+  @parameterized.named_parameters(
+      ('gan', create_gan_model),
+      ('callable_gan', create_callable_gan_model),
+      ('infogan', create_infogan_model),
+      ('callable_infogan', create_callable_infogan_model),
+      ('acgan', create_acgan_model),
+      ('callable_acgan', create_callable_acgan_model),
+  )
+  def test_output_type(self, create_gan_model_fn):
     model = create_gan_model_fn()
     loss = train.gan_loss(model)
 
@@ -670,28 +834,24 @@ class GANTrainOpsTest(test.TestCase):
         summarize_gradients=True,
         colocate_gradients_with_ops=True)
 
-    self.assertTrue(isinstance(train_ops, namedtuples.GANTrainOps))
-
-  def test_output_type_gan(self):
-    self._test_output_type_helper(create_gan_model)
-
-  def test_output_type_callable_gan(self):
-    self._test_output_type_helper(create_callable_gan_model)
-
-  def test_output_type_infogan(self):
-    self._test_output_type_helper(create_infogan_model)
-
-  def test_output_type_callable_infogan(self):
-    self._test_output_type_helper(create_callable_infogan_model)
-
-  def test_output_type_acgan(self):
-    self._test_output_type_helper(create_acgan_model)
-
-  def test_output_type_callable_acgan(self):
-    self._test_output_type_helper(create_callable_acgan_model)
+    self.assertIsInstance(train_ops, namedtuples.GANTrainOps)
 
   # TODO(joelshor): Add a test to check that custom update op is run.
-  def _test_unused_update_ops(self, create_gan_model_fn, provide_update_ops):
+  @parameterized.named_parameters(
+      ('gan', create_gan_model, False),
+      ('gan_provideupdates', create_gan_model, True),
+      ('callable_gan', create_callable_gan_model, False),
+      ('callable_gan_provideupdates', create_callable_gan_model, True),
+      ('infogan', create_infogan_model, False),
+      ('infogan_provideupdates', create_infogan_model, True),
+      ('callable_infogan', create_callable_infogan_model, False),
+      ('callable_infogan_provideupdates', create_callable_infogan_model, True),
+      ('acgan', create_acgan_model, False),
+      ('acgan_provideupdates', create_acgan_model, True),
+      ('callable_acgan', create_callable_acgan_model, False),
+      ('callable_acgan_provideupdates', create_callable_acgan_model, True),
+  )
+  def test_unused_update_ops(self, create_gan_model_fn, provide_update_ops):
     model = create_gan_model_fn()
     loss = train.gan_loss(model)
 
@@ -707,8 +867,11 @@ class GANTrainOpsTest(test.TestCase):
 
     # Add an update op outside the generator and discriminator scopes.
     if provide_update_ops:
-      kwargs = {'update_ops':
-                [constant_op.constant(1.0), gen_update_op, dis_update_op]}
+      kwargs = {
+          'update_ops': [
+              constant_op.constant(1.0), gen_update_op, dis_update_op
+          ]
+      }
     else:
       ops.add_to_collection(ops.GraphKeys.UPDATE_OPS, constant_op.constant(1.0))
       kwargs = {}
@@ -717,8 +880,8 @@ class GANTrainOpsTest(test.TestCase):
     d_opt = gradient_descent.GradientDescentOptimizer(1.0)
 
     with self.assertRaisesRegexp(ValueError, 'There are unused update ops:'):
-      train.gan_train_ops(model, loss, g_opt, d_opt,
-                          check_for_unused_update_ops=True, **kwargs)
+      train.gan_train_ops(
+          model, loss, g_opt, d_opt, check_for_unused_update_ops=True, **kwargs)
     train_ops = train.gan_train_ops(
         model, loss, g_opt, d_opt, check_for_unused_update_ops=False, **kwargs)
 
@@ -735,44 +898,16 @@ class GANTrainOpsTest(test.TestCase):
       self.assertEqual(1, gen_update_count.eval())
       self.assertEqual(1, dis_update_count.eval())
 
-  def test_unused_update_ops_gan(self):
-    self._test_unused_update_ops(create_gan_model, False)
-
-  def test_unused_update_ops_gan_provideupdates(self):
-    self._test_unused_update_ops(create_gan_model, True)
-
-  def test_unused_update_ops_callable_gan(self):
-    self._test_unused_update_ops(create_callable_gan_model, False)
-
-  def test_unused_update_ops_callable_gan_provideupdates(self):
-    self._test_unused_update_ops(create_callable_gan_model, True)
-
-  def test_unused_update_ops_infogan(self):
-    self._test_unused_update_ops(create_infogan_model, False)
-
-  def test_unused_update_ops_infogan_provideupdates(self):
-    self._test_unused_update_ops(create_infogan_model, True)
-
-  def test_unused_update_ops_callable_infogan(self):
-    self._test_unused_update_ops(create_callable_infogan_model, False)
-
-  def test_unused_update_ops_callable_infogan_provideupdates(self):
-    self._test_unused_update_ops(create_callable_infogan_model, True)
-
-  def test_unused_update_ops_acgan(self):
-    self._test_unused_update_ops(create_acgan_model, False)
-
-  def test_unused_update_ops_acgan_provideupdates(self):
-    self._test_unused_update_ops(create_acgan_model, True)
-
-  def test_unused_update_ops_callable_acgan(self):
-    self._test_unused_update_ops(create_callable_acgan_model, False)
-
-  def test_unused_update_ops_callable_acgan_provideupdates(self):
-    self._test_unused_update_ops(create_callable_acgan_model, True)
-
-  def _test_sync_replicas_helper(
-      self, create_gan_model_fn, create_global_step=False):
+  @parameterized.named_parameters(
+      ('gan', create_gan_model, False),
+      ('callable_gan', create_callable_gan_model, False),
+      ('infogan', create_infogan_model, False),
+      ('callable_infogan', create_callable_infogan_model, False),
+      ('acgan', create_acgan_model, False),
+      ('callable_acgan', create_callable_acgan_model, False),
+      ('gan_canbeint32', create_gan_model, True),
+  )
+  def test_sync_replicas(self, create_gan_model_fn, create_global_step):
     model = create_gan_model_fn()
     loss = train.gan_loss(model)
     num_trainable_vars = len(variables_lib.get_trainable_variables())
@@ -785,11 +920,8 @@ class GANTrainOpsTest(test.TestCase):
     g_opt = get_sync_optimizer()
     d_opt = get_sync_optimizer()
     train_ops = train.gan_train_ops(
-        model,
-        loss,
-        generator_optimizer=g_opt,
-        discriminator_optimizer=d_opt)
-    self.assertTrue(isinstance(train_ops, namedtuples.GANTrainOps))
+        model, loss, generator_optimizer=g_opt, discriminator_optimizer=d_opt)
+    self.assertIsInstance(train_ops, namedtuples.GANTrainOps)
     # No new trainable variables should have been added.
     self.assertEqual(num_trainable_vars,
                      len(variables_lib.get_trainable_variables()))
@@ -827,29 +959,8 @@ class GANTrainOpsTest(test.TestCase):
       coord.request_stop()
       coord.join(g_threads + d_threads)
 
-  def test_sync_replicas_gan(self):
-    self._test_sync_replicas_helper(create_gan_model)
-
-  def test_sync_replicas_callable_gan(self):
-    self._test_sync_replicas_helper(create_callable_gan_model)
-
-  def test_sync_replicas_infogan(self):
-    self._test_sync_replicas_helper(create_infogan_model)
 
-  def test_sync_replicas_callable_infogan(self):
-    self._test_sync_replicas_helper(create_callable_infogan_model)
-
-  def test_sync_replicas_acgan(self):
-    self._test_sync_replicas_helper(create_acgan_model)
-
-  def test_sync_replicas_callable_acgan(self):
-    self._test_sync_replicas_helper(create_callable_acgan_model)
-
-  def test_global_step_can_be_int32(self):
-    self._test_sync_replicas_helper(create_gan_model, create_global_step=True)
-
-
-class GANTrainTest(test.TestCase):
+class GANTrainTest(test.TestCase, parameterized.TestCase):
   """Tests for `gan_train`."""
 
   def _gan_train_ops(self, generator_add, discriminator_add):
@@ -860,12 +971,20 @@ class GANTrainTest(test.TestCase):
     # joint training.
     train_ops = namedtuples.GANTrainOps(
         generator_train_op=step.assign_add(generator_add, use_locking=True),
-        discriminator_train_op=step.assign_add(discriminator_add,
-                                               use_locking=True),
+        discriminator_train_op=step.assign_add(
+            discriminator_add, use_locking=True),
         global_step_inc_op=step.assign_add(1))
     return train_ops
 
-  def _test_run_helper(self, create_gan_model_fn):
+  @parameterized.named_parameters(
+      ('gan', create_gan_model),
+      ('callable_gan', create_callable_gan_model),
+      ('infogan', create_infogan_model),
+      ('callable_infogan', create_callable_infogan_model),
+      ('acgan', create_acgan_model),
+      ('callable_acgan', create_callable_acgan_model),
+  )
+  def test_run_helper(self, create_gan_model_fn):
     random_seed.set_random_seed(1234)
     model = create_gan_model_fn()
     loss = train.gan_loss(model)
@@ -881,30 +1000,15 @@ class GANTrainTest(test.TestCase):
     self.assertTrue(np.isscalar(final_step))
     self.assertEqual(2, final_step)
 
-  def test_run_gan(self):
-    self._test_run_helper(create_gan_model)
-
-  def test_run_callable_gan(self):
-    self._test_run_helper(create_callable_gan_model)
-
-  def test_run_infogan(self):
-    self._test_run_helper(create_infogan_model)
-
-  def test_run_callable_infogan(self):
-    self._test_run_helper(create_callable_infogan_model)
-
-  def test_run_acgan(self):
-    self._test_run_helper(create_acgan_model)
-
-  def test_run_callable_acgan(self):
-    self._test_run_helper(create_callable_acgan_model)
-
-  # Test multiple train steps.
-  def _test_multiple_steps_helper(self, get_hooks_fn_fn):
+  @parameterized.named_parameters(
+      ('seq_train_steps', train.get_sequential_train_hooks),
+      ('efficient_seq_train_steps', train.get_joint_train_hooks),
+  )
+  def test_multiple_steps(self, get_hooks_fn_fn):
+    """Test multiple train steps."""
     train_ops = self._gan_train_ops(generator_add=10, discriminator_add=100)
     train_steps = namedtuples.GANTrainSteps(
-        generator_train_steps=3,
-        discriminator_train_steps=4)
+        generator_train_steps=3, discriminator_train_steps=4)
     final_step = train.gan_train(
         train_ops,
         get_hooks_fn=get_hooks_fn_fn(train_steps),
@@ -914,12 +1018,6 @@ class GANTrainTest(test.TestCase):
     self.assertTrue(np.isscalar(final_step))
     self.assertEqual(1 + 3 * 10 + 4 * 100, final_step)
 
-  def test_multiple_steps_seq_train_steps(self):
-    self._test_multiple_steps_helper(train.get_sequential_train_hooks)
-
-  def test_multiple_steps_efficient_seq_train_steps(self):
-    self._test_multiple_steps_helper(train.get_joint_train_hooks)
-
   def test_supervisor_run_gan_model_train_ops_multiple_steps(self):
     step = training_util.create_global_step()
     train_ops = namedtuples.GANTrainOps(
@@ -927,8 +1025,7 @@ class GANTrainTest(test.TestCase):
         discriminator_train_op=constant_op.constant(2.0),
         global_step_inc_op=step.assign_add(1))
     train_steps = namedtuples.GANTrainSteps(
-        generator_train_steps=3,
-        discriminator_train_steps=4)
+        generator_train_steps=3, discriminator_train_steps=4)
 
     final_loss = slim_learning.train(
         train_op=train_ops,
@@ -940,10 +1037,18 @@ class GANTrainTest(test.TestCase):
     self.assertEqual(17.0, final_loss)
 
 
-class PatchGANTest(test.TestCase):
+class PatchGANTest(test.TestCase, parameterized.TestCase):
   """Tests that functions work on PatchGAN style output."""
 
-  def _test_patchgan_helper(self, create_gan_model_fn):
+  @parameterized.named_parameters(
+      ('gan', create_gan_model),
+      ('callable_gan', create_callable_gan_model),
+      ('infogan', create_infogan_model),
+      ('callable_infogan', create_callable_infogan_model),
+      ('acgan', create_acgan_model),
+      ('callable_acgan', create_callable_acgan_model),
+  )
+  def test_patchgan(self, create_gan_model_fn):
     """Ensure that patch-based discriminators work end-to-end."""
     random_seed.set_random_seed(1234)
     model = create_gan_model_fn()
@@ -960,24 +1065,6 @@ class PatchGANTest(test.TestCase):
     self.assertTrue(np.isscalar(final_step))
     self.assertEqual(2, final_step)
 
-  def test_patchgan_gan(self):
-    self._test_patchgan_helper(create_gan_model)
-
-  def test_patchgan_callable_gan(self):
-    self._test_patchgan_helper(create_callable_gan_model)
-
-  def test_patchgan_infogan(self):
-    self._test_patchgan_helper(create_infogan_model)
-
-  def test_patchgan_callable_infogan(self):
-    self._test_patchgan_helper(create_callable_infogan_model)
-
-  def test_patchgan_acgan(self):
-    self._test_patchgan_helper(create_acgan_model)
-
-  def test_patchgan_callable_acgan(self):
-    self._test_patchgan_helper(create_callable_acgan_model)
-
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/gdr/gdr_memory_manager.cc b/tensorflow/contrib/gdr/gdr_memory_manager.cc
index 1435e19109..f3bbf6b4d7 100644
--- a/tensorflow/contrib/gdr/gdr_memory_manager.cc
+++ b/tensorflow/contrib/gdr/gdr_memory_manager.cc
@@ -33,10 +33,11 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/bfc_allocator.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/dma_helper.h"
+#include "tensorflow/core/common_runtime/pool_allocator.h"
+#include "tensorflow/core/common_runtime/process_state.h"
 #if GOOGLE_CUDA
 #include "tensorflow/core/common_runtime/gpu/gpu_process_state.h"
 #include "tensorflow/core/common_runtime/gpu/gpu_util.h"
-#include "tensorflow/core/common_runtime/process_state.h"
 #endif  // GOOGLE_CUDA
 #include "tensorflow/core/framework/allocator_registry.h"
 #include "tensorflow/core/lib/core/status.h"
@@ -182,28 +183,25 @@ class GdrMemoryManager : public RemoteMemoryManager {
   TF_DISALLOW_COPY_AND_ASSIGN(GdrMemoryManager);
 };
 
-// TODO(byronyi): remove this class duplicated from the one in
-// common/runtime/gpu/pool_allocator.h when it is available in common_runtime
-class BasicCPUAllocator : public SubAllocator {
- public:
-  ~BasicCPUAllocator() override {}
-
-  void* Alloc(size_t alignment, size_t num_bytes) override {
-    return port::AlignedMalloc(num_bytes, alignment);
-  }
-  void Free(void* ptr, size_t) override { port::AlignedFree(ptr); }
-};
-
 // TODO(byronyi): remove this class and its registration when the default
-// cpu_allocator() returns visitable allocator
+// cpu_allocator() returns visitable allocator, or cpu_allocator() is no
+// longer in use.
 class BFCRdmaAllocator : public BFCAllocator {
  public:
   BFCRdmaAllocator()
-      : BFCAllocator(new BasicCPUAllocator(), 1LL << 36, true, "cpu_rdma_bfc") {
+      : BFCAllocator(new BasicCPUAllocator(port::kNUMANoAffinity), 1LL << 36,
+                     true, "cpu_rdma_bfc") {}
+};
+class BFCRdmaAllocatorFactory : public AllocatorFactory {
+ public:
+  Allocator* CreateAllocator() override { return new BFCRdmaAllocator; }
+
+  virtual SubAllocator* CreateSubAllocator(int numa_node) {
+    return new BasicCPUAllocator(numa_node);
   }
 };
 
-REGISTER_MEM_ALLOCATOR("BFCRdmaAllocator", 101, BFCRdmaAllocator);
+REGISTER_MEM_ALLOCATOR("BFCRdmaAllocator", 101, BFCRdmaAllocatorFactory);
 
 GdrMemoryManager::GdrMemoryManager(const string& host, const string& port)
     : host_(host),
@@ -276,8 +274,8 @@ Status GdrMemoryManager::Init() {
   Allocator* allocators[] = {
 #if GOOGLE_CUDA
     GPUProcessState::singleton()->GetCUDAHostAllocator(0),
-    ProcessState::singleton()->GetCPUAllocator(0),
 #endif  // GOOGLE_CUDA
+    ProcessState::singleton()->GetCPUAllocator(0),
     cpu_allocator(),
   };
 
diff --git a/tensorflow/contrib/graph_editor/reroute.py b/tensorflow/contrib/graph_editor/reroute.py
index 95c02a64d4..d42e0c01f4 100644
--- a/tensorflow/contrib/graph_editor/reroute.py
+++ b/tensorflow/contrib/graph_editor/reroute.py
@@ -208,9 +208,9 @@ def _reroute_ts(ts0, ts1, mode, can_modify=None, cannot_modify=None):
 def swap_ts(ts0, ts1, can_modify=None, cannot_modify=None):
   """For each tensor's pair, swap the end of (t0,t1).
 
-  B0 B1     B0 B1
-  |  |    =>  X
-  A0 A1     A0 A1
+      B0 B1     B0 B1
+      |  |    =>  X
+      A0 A1     A0 A1
 
   Args:
     ts0: an object convertible to a list of `tf.Tensor`.
@@ -233,9 +233,9 @@ def swap_ts(ts0, ts1, can_modify=None, cannot_modify=None):
 def reroute_ts(ts0, ts1, can_modify=None, cannot_modify=None):
   """For each tensor's pair, replace the end of t1 by the end of t0.
 
-  B0 B1     B0 B1
-  |  |    => |/
-  A0 A1     A0 A1
+      B0 B1     B0 B1
+      |  |    => |/
+      A0 A1     A0 A1
 
   The end of the tensors in ts1 are left dangling.
 
diff --git a/tensorflow/contrib/kafka/ops/kafka_ops.cc b/tensorflow/contrib/kafka/ops/kafka_ops.cc
new file mode 100644
index 0000000000..8cdf16103b
--- /dev/null
+++ b/tensorflow/contrib/kafka/ops/kafka_ops.cc
@@ -0,0 +1,44 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/common_shape_fns.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+
+namespace tensorflow {
+
+REGISTER_OP("KafkaDataset")
+    .Input("topics: string")
+    .Input("servers: string")
+    .Input("group: string")
+    .Input("eof: bool")
+    .Input("timeout: int64")
+    .Output("handle: variant")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::ScalarShape)
+    .Doc(R"doc(
+Creates a dataset that emits the messages of one or more Kafka topics.
+
+topics: A `tf.string` tensor containing one or more subscriptions,
+  in the format of [topic:partition:offset:length],
+  by default length is -1 for unlimited.
+servers: A list of bootstrap servers.
+group: The consumer group id.
+eof: If True, the kafka reader will stop on EOF.
+timeout: The timeout value for the Kafka Consumer to wait
+  (in millisecond).
+)doc");
+
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/layers/__init__.py b/tensorflow/contrib/layers/__init__.py
index bc33596935..a7b41b714f 100644
--- a/tensorflow/contrib/layers/__init__.py
+++ b/tensorflow/contrib/layers/__init__.py
@@ -121,6 +121,7 @@ from tensorflow.contrib.layers.python.layers import *
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = ['bias_add',
+                    'conv1d',
                     'conv2d',
                     'conv3d',
                     'elu',
diff --git a/tensorflow/contrib/layers/python/layers/layers.py b/tensorflow/contrib/layers/python/layers/layers.py
index beeabd6b65..fa334070ad 100644
--- a/tensorflow/contrib/layers/python/layers/layers.py
+++ b/tensorflow/contrib/layers/python/layers/layers.py
@@ -55,9 +55,9 @@ from tensorflow.python.training import moving_averages
 # TODO(b/28426988): Replace legacy_* fns migrated from slim.
 # TODO(b/28426988): Remove legacy_* when all uses have migrated to new API.
 __all__ = [
-    'avg_pool2d', 'avg_pool3d', 'batch_norm', 'bias_add', 'conv2d', 'conv3d',
-    'conv2d_in_plane', 'conv2d_transpose', 'conv3d_transpose', 'convolution',
-    'convolution1d', 'convolution2d', 'convolution2d_in_plane',
+    'avg_pool2d', 'avg_pool3d', 'batch_norm', 'bias_add', 'conv1d', 'conv2d',
+    'conv3d', 'conv2d_in_plane', 'conv2d_transpose', 'conv3d_transpose',
+    'convolution', 'convolution1d', 'convolution2d', 'convolution2d_in_plane',
     'convolution2d_transpose', 'convolution3d', 'convolution3d_transpose',
     'dense_to_sparse', 'dropout', 'elu', 'flatten', 'fully_connected', 'GDN',
     'gdn', 'images_to_sequence', 'layer_norm', 'linear', 'pool', 'max_pool2d',
@@ -1702,19 +1702,22 @@ def _inner_flatten(inputs, new_rank, output_collections=None, scope=None):
   return utils.collect_named_outputs(output_collections, sc, flattened)
 
 
-def _model_variable_getter(getter,
-                           name,
-                           shape=None,
-                           dtype=None,
-                           initializer=None,
-                           regularizer=None,
-                           trainable=True,
-                           collections=None,
-                           caching_device=None,
-                           partitioner=None,
-                           rename=None,
-                           use_resource=None,
-                           **_):
+def _model_variable_getter(
+    getter,
+    name,
+    shape=None,
+    dtype=None,
+    initializer=None,
+    regularizer=None,
+    trainable=True,
+    collections=None,
+    caching_device=None,
+    partitioner=None,
+    rename=None,
+    use_resource=None,
+    synchronization=tf_variables.VariableSynchronization.AUTO,
+    aggregation=tf_variables.VariableAggregation.NONE,
+    **_):
   """Getter that uses model_variable for compatibility with core layers."""
   short_name = name.split('/')[-1]
   if rename and short_name in rename:
@@ -1732,7 +1735,9 @@ def _model_variable_getter(getter,
       caching_device=caching_device,
       partitioner=partitioner,
       custom_getter=getter,
-      use_resource=use_resource)
+      use_resource=use_resource,
+      synchronization=synchronization,
+      aggregation=aggregation)
 
 
 def _build_variable_getter(rename=None):
@@ -3315,6 +3320,7 @@ relu6 = functools.partial(fully_connected, activation_fn=nn.relu6)
 linear = functools.partial(fully_connected, activation_fn=None)
 
 # Simple alias.
+conv1d = convolution1d
 conv2d = convolution2d
 conv3d = convolution3d
 conv2d_transpose = convolution2d_transpose
diff --git a/tensorflow/contrib/layers/python/layers/rev_block_lib.py b/tensorflow/contrib/layers/python/layers/rev_block_lib.py
index 0e35b1aa8b..dad3da3748 100644
--- a/tensorflow/contrib/layers/python/layers/rev_block_lib.py
+++ b/tensorflow/contrib/layers/python/layers/rev_block_lib.py
@@ -514,15 +514,15 @@ def _recompute_grad(fn, args, use_data_dep=_USE_DEFAULT, tupleize_grads=False):
     original_vars = set(tape.watched_variables())
 
     # Backward pass
-    def grad_fn(*output_grads, **kwargs):
+    def _grad_fn(output_grads, variables=None):
       """Recompute outputs for gradient computation."""
-      variables = []
+      variables = variables or []
       if original_vars:
-        variables = kwargs["variables"]
-      if set(variables) != original_vars:
-        raise ValueError(_WRONG_VARS_ERR)
-      del kwargs
-      inputs = list(args)
+        assert variables, ("Fn created variables but the variables were not "
+                           "passed to the gradient fn.")
+        if set(variables) != original_vars:
+          raise ValueError(_WRONG_VARS_ERR)
+      inputs = [array_ops.identity(x) for x in list(args)]
       # Recompute outputs
       with framework_ops.control_dependencies(output_grads):
         if use_data_dep_:
@@ -538,7 +538,7 @@ def _recompute_grad(fn, args, use_data_dep=_USE_DEFAULT, tupleize_grads=False):
             if original_vars != recompute_vars:
               raise ValueError(_WRONG_VARS_ERR)
 
-      if not (isinstance(outputs, list) or isinstance(outputs, tuple)):
+      if not isinstance(outputs, (list, tuple)):
         outputs = [outputs]
       outputs = list(outputs)
       grads = gradients_impl.gradients(outputs, inputs + variables,
@@ -554,6 +554,16 @@ def _recompute_grad(fn, args, use_data_dep=_USE_DEFAULT, tupleize_grads=False):
       grad_vars = grads[len(inputs):]
       return grad_inputs, grad_vars
 
+    # custom_gradient inspects the signature of the function to determine
+    # whether the user expects variables passed in the grad_fn. If the function
+    # created variables, the grad_fn should accept the "variables" kwarg.
+    if original_vars:
+      def grad_fn(*output_grads, **kwargs):
+        return _grad_fn(output_grads, kwargs["variables"])
+    else:
+      def grad_fn(*output_grads):
+        return _grad_fn(output_grads)
+
     return outputs, grad_fn
 
   return fn_with_recompute(*args)
diff --git a/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py b/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py
index bc09ba8d43..d5971fb9d8 100644
--- a/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py
+++ b/tensorflow/contrib/layers/python/layers/rev_block_lib_test.py
@@ -372,6 +372,26 @@ class RecomputeTest(test.TestCase):
     self.assertEqual(2, len(update_ops))
     self.assertEqual([False, True], kwarg_values)
 
+  def testWithoutVariables(self):
+
+    def concat_n(layer_list, num_inputs):
+      return math_ops.reduce_sum(
+          array_ops.concat([x for x in layer_list[-num_inputs:]], axis=-1),
+          axis=1, keepdims=True)
+
+    @rev_block_lib.recompute_grad
+    def concat_n_wrap(*args):
+      return concat_n(args, 3)
+
+    # DenseNet-style layers
+    layer_list = [random_ops.random_uniform((4, 8))]
+    for _ in range(5):
+      layer_list.append(math_ops.sqrt(concat_n_wrap(*layer_list)))
+
+    grads = gradients_impl.gradients(layer_list[-1], layer_list[0])
+    with self.test_session() as sess:
+      sess.run(grads)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator.py b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
index 7a026a15e4..c1de42782e 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
@@ -72,6 +72,7 @@ from tensorflow.python.saved_model import builder as saved_model_builder
 from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.summary import summary as core_summary
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import device_setter
 from tensorflow.python.training import monitored_session
 from tensorflow.python.training import saver
@@ -891,7 +892,7 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
 
     # Check that model has been trained (if nothing has been set explicitly).
     if not checkpoint_path:
-      latest_path = saver.latest_checkpoint(self._model_dir)
+      latest_path = checkpoint_management.latest_checkpoint(self._model_dir)
       if not latest_path:
         raise NotFittedError(
             "Couldn't find trained model at %s." % self._model_dir)
@@ -956,7 +957,7 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
                    as_iterable=True,
                    iterate_batches=False):
     # Check that model has been trained.
-    checkpoint_path = saver.latest_checkpoint(self._model_dir)
+    checkpoint_path = checkpoint_management.latest_checkpoint(self._model_dir)
     if not checkpoint_path:
       raise NotFittedError(
           "Couldn't find trained model at %s." % self._model_dir)
@@ -1364,7 +1365,7 @@ class Estimator(BaseEstimator):
 
     if not checkpoint_path:
       # Locate the latest checkpoint
-      checkpoint_path = saver.latest_checkpoint(self._model_dir)
+      checkpoint_path = checkpoint_management.latest_checkpoint(self._model_dir)
     if not checkpoint_path:
       raise NotFittedError(
           "Couldn't find trained model at %s." % self._model_dir)
diff --git a/tensorflow/contrib/learn/python/learn/estimators/head.py b/tensorflow/contrib/learn/python/learn/estimators/head.py
index 339c4e0e36..ded93d4a7f 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/head.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/head.py
@@ -563,10 +563,10 @@ def _mean_squared_loss(labels, logits, weights=None):
     labels = ops.convert_to_tensor(labels)
     # To prevent broadcasting inside "-".
     if len(labels.get_shape()) == 1:
-      labels = array_ops.expand_dims(labels, dim=(1,))
+      labels = array_ops.expand_dims(labels, axis=(1,))
     # TODO(zakaria): make sure it does not recreate the broadcast bug.
     if len(logits.get_shape()) == 1:
-      logits = array_ops.expand_dims(logits, dim=(1,))
+      logits = array_ops.expand_dims(logits, axis=(1,))
     logits.get_shape().assert_is_compatible_with(labels.get_shape())
     loss = math_ops.square(logits - math_ops.to_float(labels), name=name)
     return _compute_weighted_loss(loss, weights)
@@ -579,10 +579,10 @@ def _poisson_loss(labels, logits, weights=None):
     labels = ops.convert_to_tensor(labels)
     # To prevent broadcasting inside "-".
     if len(labels.get_shape()) == 1:
-      labels = array_ops.expand_dims(labels, dim=(1,))
+      labels = array_ops.expand_dims(labels, axis=(1,))
     # TODO(zakaria): make sure it does not recreate the broadcast bug.
     if len(logits.get_shape()) == 1:
-      logits = array_ops.expand_dims(logits, dim=(1,))
+      logits = array_ops.expand_dims(logits, axis=(1,))
     logits.get_shape().assert_is_compatible_with(labels.get_shape())
     loss = nn.log_poisson_loss(labels, logits, compute_full_loss=True,
                                name=name)
@@ -797,7 +797,7 @@ def _log_loss_with_two_classes(labels, logits, weights=None):
     # TODO(ptucker): This will break for dynamic shapes.
     # sigmoid_cross_entropy_with_logits requires [batch_size, 1] labels.
     if len(labels.get_shape()) == 1:
-      labels = array_ops.expand_dims(labels, dim=(1,))
+      labels = array_ops.expand_dims(labels, axis=(1,))
     loss = nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits,
                                                 name=name)
     return _compute_weighted_loss(loss, weights)
diff --git a/tensorflow/contrib/learn/python/learn/estimators/run_config.py b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
index 14ee2ba609..c36879e048 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/run_config.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
@@ -240,6 +240,7 @@ class RunConfig(ClusterConfig, core_run_config.RunConfig):
                keep_checkpoint_max=5,
                keep_checkpoint_every_n_hours=10000,
                log_step_count_steps=100,
+               protocol=None,
                evaluation_master='',
                model_dir=None,
                session_config=None):
@@ -289,6 +290,8 @@ class RunConfig(ClusterConfig, core_run_config.RunConfig):
       session_config: a ConfigProto used to set session parameters, or None.
         Note - using this argument, it is easy to provide settings which break
         otherwise perfectly good models. Use with care.
+      protocol: An optional argument which specifies the protocol used when
+        starting server. None means default to grpc.
     """
     # Neither parent class calls super().__init__(), so here we have to
     # manually call their __init__() methods.
@@ -299,6 +302,7 @@ class RunConfig(ClusterConfig, core_run_config.RunConfig):
     # so instead of breaking compatibility with that assumption, we
     # just manually initialize this field:
     self._train_distribute = None
+    self._eval_distribute = None
     self._device_fn = None
 
     gpu_options = config_pb2.GPUOptions(
@@ -313,6 +317,7 @@ class RunConfig(ClusterConfig, core_run_config.RunConfig):
     self._save_summary_steps = save_summary_steps
     self._save_checkpoints_secs = save_checkpoints_secs
     self._log_step_count_steps = log_step_count_steps
+    self._protocol = protocol
     self._session_config = session_config
     if save_checkpoints_secs == RunConfig._USE_DEFAULT:
       if save_checkpoints_steps is None:
diff --git a/tensorflow/contrib/learn/python/learn/experiment.py b/tensorflow/contrib/learn/python/learn/experiment.py
index f8a3709ee5..08e907a608 100644
--- a/tensorflow/contrib/learn/python/learn/experiment.py
+++ b/tensorflow/contrib/learn/python/learn/experiment.py
@@ -41,7 +41,7 @@ from tensorflow.python.estimator import estimator as core_estimator
 from tensorflow.python.framework import ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import basic_session_run_hooks
-from tensorflow.python.training import saver
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import server_lib
 from tensorflow.python.util import compat
 from tensorflow.python.util import function_utils
@@ -95,7 +95,7 @@ class _EvalAndExportListener(basic_session_run_hooks.CheckpointSaverListener):
     # Load and cache the path of the most recent checkpoint to avoid duplicate
     # searches on GCS.
     logging.info("Checking for checkpoint in %s", self._model_dir)
-    latest_path = saver.latest_checkpoint(self._model_dir)
+    latest_path = checkpoint_management.latest_checkpoint(self._model_dir)
 
     if not latest_path:
       logging.warning("Skipping evaluation and export since model has not been "
@@ -516,7 +516,8 @@ class Experiment(object):
       start = time.time()
 
       error_msg = None
-      latest_path = saver.latest_checkpoint(self._estimator.model_dir)
+      latest_path = checkpoint_management.latest_checkpoint(
+          self._estimator.model_dir)
       if not latest_path:
         error_msg = ("Estimator is not fitted yet. "
                      "Will start an evaluation when a checkpoint is ready.")
@@ -778,7 +779,8 @@ class Experiment(object):
           saving_listeners=self._saving_listeners)
 
       logging.info("Evaluating model now.")
-      latest_checkpoint = saver.latest_checkpoint(self._estimator.model_dir)
+      latest_checkpoint = checkpoint_management.latest_checkpoint(
+          self._estimator.model_dir)
       eval_result = self._call_evaluate(
           input_fn=self._eval_input_fn,
           steps=self._eval_steps,
diff --git a/tensorflow/contrib/learn/python/learn/graph_actions_test.py b/tensorflow/contrib/learn/python/learn/graph_actions_test.py
index 0d039d593b..df156da3f4 100644
--- a/tensorflow/contrib/learn/python/learn/graph_actions_test.py
+++ b/tensorflow/contrib/learn/python/learn/graph_actions_test.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.summary import summary
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as saver_lib
 
 
@@ -124,7 +125,7 @@ class GraphActionsTest(test.TestCase):
 
   # TODO(ptucker): Test number and contents of checkpoint files.
   def _assert_ckpt(self, output_dir, expected=True):
-    ckpt_state = saver_lib.get_checkpoint_state(output_dir)
+    ckpt_state = checkpoint_management.get_checkpoint_state(output_dir)
     if expected:
       pattern = '%s/model.ckpt-.*' % output_dir
       primary_ckpt_path = ckpt_state.model_checkpoint_path
@@ -434,7 +435,7 @@ class GraphActionsTrainTest(test.TestCase):
 
   # TODO(ptucker): Test number and contents of checkpoint files.
   def _assert_ckpt(self, output_dir, expected=True):
-    ckpt_state = saver_lib.get_checkpoint_state(output_dir)
+    ckpt_state = checkpoint_management.get_checkpoint_state(output_dir)
     if expected:
       pattern = '%s/model.ckpt-.*' % output_dir
       primary_ckpt_path = ckpt_state.model_checkpoint_path
diff --git a/tensorflow/contrib/learn/python/learn/monitors.py b/tensorflow/contrib/learn/python/learn/monitors.py
index 77f7c73d54..3d691d4340 100644
--- a/tensorflow/contrib/learn/python/learn/monitors.py
+++ b/tensorflow/contrib/learn/python/learn/monitors.py
@@ -51,7 +51,7 @@ from tensorflow.python.estimator import estimator as core_estimator
 from tensorflow.python.framework import ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.summary import summary as core_summary
-from tensorflow.python.training import saver as saver_lib
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.training import training_util
 from tensorflow.python.util import deprecation
@@ -735,7 +735,8 @@ class ValidationMonitor(EveryN):
       return False
     self._last_checkpoint_check_time = current_time
     # Check that we are not running evaluation on the same checkpoint.
-    latest_path = saver_lib.latest_checkpoint(self._estimator.model_dir)
+    latest_path = checkpoint_management.latest_checkpoint(
+        self._estimator.model_dir)
     if latest_path is None:
       logging.debug("Skipping evaluation since model has not been saved yet "
                     "at step %d.", step)
@@ -1059,7 +1060,8 @@ class ExportMonitor(EveryN):
 
   def end(self, session=None):
     super(ExportMonitor, self).end(session=session)
-    latest_path = saver_lib.latest_checkpoint(self._estimator.model_dir)
+    latest_path = checkpoint_management.latest_checkpoint(
+        self._estimator.model_dir)
     if latest_path is None:
       logging.info("Skipping export at the end since model has not been saved "
                    "yet.")
diff --git a/tensorflow/contrib/learn/python/learn/monitors_test.py b/tensorflow/contrib/learn/python/learn/monitors_test.py
index 5c34d0ddb0..ff1da32c21 100644
--- a/tensorflow/contrib/learn/python/learn/monitors_test.py
+++ b/tensorflow/contrib/learn/python/learn/monitors_test.py
@@ -39,9 +39,9 @@ from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.summary import summary
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import gradient_descent
 from tensorflow.python.training import monitored_session
-from tensorflow.python.training import saver
 from tensorflow.python.training import training_util
 
 
@@ -317,7 +317,7 @@ class MonitorsTest(test.TestCase):
         self._run_monitor(monitor)
 
   @test.mock.patch.object(estimators, 'Estimator', autospec=True)
-  @test.mock.patch.object(saver, 'latest_checkpoint')
+  @test.mock.patch.object(checkpoint_management, 'latest_checkpoint')
   def test_validation_monitor_no_ckpt(self, mock_latest_checkpoint,
                                       mock_estimator_class):
     estimator = mock_estimator_class()
@@ -336,7 +336,7 @@ class MonitorsTest(test.TestCase):
       mock_latest_checkpoint.assert_called_with(model_dir)
 
   @test.mock.patch.object(estimators, 'Estimator', autospec=True)
-  @test.mock.patch.object(saver, 'latest_checkpoint')
+  @test.mock.patch.object(checkpoint_management, 'latest_checkpoint')
   def test_validation_monitor_no_early_stopping_rounds(self,
                                                        mock_latest_checkpoint,
                                                        mock_estimator_class):
@@ -356,7 +356,7 @@ class MonitorsTest(test.TestCase):
       self._assert_validation_monitor(monitor)
 
   @test.mock.patch.object(estimators, 'Estimator', autospec=True)
-  @test.mock.patch.object(saver, 'latest_checkpoint')
+  @test.mock.patch.object(checkpoint_management, 'latest_checkpoint')
   def test_validation_monitor_invalid_metric(self, mock_latest_checkpoint,
                                              mock_estimator_class):
     estimator = mock_estimator_class()
@@ -375,7 +375,7 @@ class MonitorsTest(test.TestCase):
         self._run_monitor(monitor, num_epochs=1, num_steps_per_epoch=1)
 
   @test.mock.patch.object(estimators, 'Estimator', autospec=True)
-  @test.mock.patch.object(saver, 'latest_checkpoint')
+  @test.mock.patch.object(checkpoint_management, 'latest_checkpoint')
   def test_validation_monitor(self, mock_latest_checkpoint,
                               mock_estimator_class):
     estimator = mock_estimator_class()
@@ -464,7 +464,7 @@ class MonitorsTest(test.TestCase):
       monitor.epoch_end(epoch=0)
       monitor.end()
 
-  @test.mock.patch.object(saver, 'latest_checkpoint')
+  @test.mock.patch.object(checkpoint_management, 'latest_checkpoint')
   def test_validation_monitor_with_core_estimator(self, mock_latest_checkpoint):
     estimator = test.mock.Mock(spec=core_estimator.Estimator)
     model_dir = 'model/dir'
@@ -495,7 +495,7 @@ class MonitorsTest(test.TestCase):
           expected_best_metrics={'loss': 42.0, 'auc': 0.5})
       monitor.post_step(step=step, session=None)
 
-  @test.mock.patch.object(saver, 'latest_checkpoint')
+  @test.mock.patch.object(checkpoint_management, 'latest_checkpoint')
   def test_validation_monitor_fail_with_core_estimator_and_metrics(
       self, mock_latest_checkpoint):
     estimator = test.mock.Mock(spec=core_estimator.Estimator)
diff --git a/tensorflow/contrib/learn/python/learn/utils/export.py b/tensorflow/contrib/learn/python/learn/utils/export.py
index 3eacac7a3d..0144b93814 100644
--- a/tensorflow/contrib/learn/python/learn/utils/export.py
+++ b/tensorflow/contrib/learn/python/learn/utils/export.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import lookup_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as tf_saver
 from tensorflow.python.training import training_util
 
@@ -298,7 +299,8 @@ def _export_estimator(estimator,
 
   # If checkpoint_path is specified, use the specified checkpoint path.
   checkpoint_path = (checkpoint_path or
-                     tf_saver.latest_checkpoint(estimator._model_dir))
+                     checkpoint_management.latest_checkpoint(
+                         estimator._model_dir))
   with ops.Graph().as_default() as g:
     training_util.create_global_step(g)
 
diff --git a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
index f8106d1e4a..66af6833da 100644
--- a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
+++ b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
@@ -55,7 +55,7 @@ from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.saved_model import signature_def_utils
 from tensorflow.python.summary import summary_iterator
-from tensorflow.python.training import saver
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.util import compat
 from tensorflow.python.util.deprecation import deprecated
 
@@ -714,7 +714,8 @@ def make_best_model_export_strategy(
       #  as soon as contrib is cleaned up and we can thus be sure that
       #  estimator is a tf.estimator.Estimator and not a
       #  tf.contrib.learn.Estimator
-      checkpoint_path = saver.latest_checkpoint(estimator.model_dir)
+      checkpoint_path = checkpoint_management.latest_checkpoint(
+          estimator.model_dir)
     export_checkpoint_path, export_eval_result = best_model_selector.update(
         checkpoint_path, eval_result)
 
diff --git a/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py b/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py
index 5e7b422e3c..e742447208 100644
--- a/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py
+++ b/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py
@@ -625,11 +625,13 @@ def attention_decoder(decoder_inputs,
     v = []
     attention_vec_size = attn_size  # Size of query vectors for attention.
     for a in xrange(num_heads):
-      k = variable_scope.get_variable("AttnW_%d" % a,
-                                      [1, 1, attn_size, attention_vec_size])
+      k = variable_scope.get_variable(
+          "AttnW_%d" % a, [1, 1, attn_size, attention_vec_size],
+          dtype=dtype)
       hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
       v.append(
-          variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
+          variable_scope.get_variable(
+              "AttnV_%d" % a, [attention_vec_size], dtype=dtype))
 
     state = initial_state
 
@@ -647,11 +649,13 @@ def attention_decoder(decoder_inputs,
         with variable_scope.variable_scope("Attention_%d" % a):
           y = Linear(query, attention_vec_size, True)(query)
           y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
+          y = math_ops.cast(y, dtype)
           # Attention mask is a softmax of v^T * tanh(...).
           s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
                                   [2, 3])
-          a = nn_ops.softmax(s)
+          a = nn_ops.softmax(math_ops.cast(s, dtype=dtypes.float32))
           # Now calculate the attention-weighted vector d.
+          a = math_ops.cast(a, dtype)
           d = math_ops.reduce_sum(
               array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
           ds.append(array_ops.reshape(d, [-1, attn_size]))
@@ -681,6 +685,7 @@ def attention_decoder(decoder_inputs,
         raise ValueError("Could not infer input size from input: %s" % inp.name)
 
       inputs = [inp] + attns
+      inputs = [math_ops.cast(e, dtype) for e in inputs]
       x = Linear(inputs, input_size, True)(inputs)
       # Run the RNN.
       cell_output, state = cell(x, state)
@@ -693,6 +698,7 @@ def attention_decoder(decoder_inputs,
         attns = attention(state)
 
       with variable_scope.variable_scope("AttnOutputProjection"):
+        cell_output = math_ops.cast(cell_output, dtype)
         inputs = [cell_output] + attns
         output = Linear(inputs, output_size, True)(inputs)
       if loop_function is not None:
diff --git a/tensorflow/contrib/linear_optimizer/BUILD b/tensorflow/contrib/linear_optimizer/BUILD
index 5b89c6cef9..fe0ba19fcb 100644
--- a/tensorflow/contrib/linear_optimizer/BUILD
+++ b/tensorflow/contrib/linear_optimizer/BUILD
@@ -41,6 +41,7 @@ py_test(
     size = "medium",
     srcs = ["python/kernel_tests/sdca_ops_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows_gpu"],
     deps = [
         ":sdca_ops_py",
         ":sparse_feature_column_py",
diff --git a/tensorflow/contrib/lite/BUILD b/tensorflow/contrib/lite/BUILD
index 73f5c1448d..1e6f1e7da2 100644
--- a/tensorflow/contrib/lite/BUILD
+++ b/tensorflow/contrib/lite/BUILD
@@ -47,6 +47,10 @@ cc_test(
     name = "arena_planner_test",
     size = "small",
     srcs = ["arena_planner_test.cc"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
     deps = [
         ":arena_planner",
         "//tensorflow/contrib/lite/testing:util",
@@ -121,10 +125,22 @@ cc_library(
         "graph_info.cc",
         "interpreter.cc",
         "model.cc",
-        "nnapi_delegate.cc",
         "op_resolver.cc",
         "optional_debug_tools.cc",
-    ],
+    ] + select({
+        "//tensorflow:android": [
+            "nnapi_delegate.cc",
+            "mmap_allocation.cc",
+        ],
+        "//tensorflow:windows": [
+            "nnapi_delegate_disabled.cc",
+            "mmap_allocation_disabled.cc",
+        ],
+        "//conditions:default": [
+            "nnapi_delegate_disabled.cc",
+            "mmap_allocation.cc",
+        ],
+    }),
     hdrs = [
         "allocation.h",
         "context.h",
@@ -146,6 +162,7 @@ cc_library(
         ":memory_planner",
         ":schema_fbs_version",
         ":simple_memory_arena",
+        ":string",
         ":util",
         "//tensorflow/contrib/lite/kernels:eigen_support",
         "//tensorflow/contrib/lite/kernels:gemm_support",
@@ -199,6 +216,7 @@ cc_test(
     name = "graph_info_test",
     size = "small",
     srcs = ["graph_info_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":framework",
         ":string_util",
@@ -243,6 +261,7 @@ cc_test(
     name = "op_resolver_test",
     size = "small",
     srcs = ["op_resolver_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":framework",
         "//tensorflow/contrib/lite/testing:util",
@@ -275,6 +294,7 @@ cc_test(
     name = "util_test",
     size = "small",
     srcs = ["util_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":context",
         ":util",
diff --git a/tensorflow/contrib/lite/Makefile b/tensorflow/contrib/lite/Makefile
index a616138d33..92635ff430 100644
--- a/tensorflow/contrib/lite/Makefile
+++ b/tensorflow/contrib/lite/Makefile
@@ -82,8 +82,9 @@ endif
 
 # Settings for the host compiler.
 CXX := $(CC_PREFIX) ${TARGET_TOOLCHAIN_PREFIX}g++
-CXXFLAGS += --std=c++11 -O3 -DNDEBUG
+CXXFLAGS += -O3 -DNDEBUG
 CCFLAGS := ${CXXFLAGS}
+CXXFLAGS += --std=c++11
 CC := $(CC_PREFIX) ${TARGET_TOOLCHAIN_PREFIX}gcc
 AR := $(CC_PREFIX) ${TARGET_TOOLCHAIN_PREFIX}ar
 CFLAGS :=
@@ -175,8 +176,13 @@ $(wildcard tensorflow/contrib/lite/kernels/test_util.cc) \
 $(MINIMAL_SRCS)
 ifeq ($(BUILD_TYPE),micro)
 CORE_CC_EXCLUDE_SRCS += \
+tensorflow/contrib/lite/mmap_allocation.cc \
 tensorflow/contrib/lite/model.cc \
 tensorflow/contrib/lite/nnapi_delegate.cc
+else
+CORE_CC_EXCLUDE_SRCS += \
+tensorflow/contrib/lite/mmap_allocation_disabled.cc \
+tensorflow/contrib/lite/nnapi_delegate_disabled.cc
 endif
 # Filter out all the excluded files.
 TF_LITE_CC_SRCS := $(filter-out $(CORE_CC_EXCLUDE_SRCS), $(CORE_CC_ALL_SRCS))
diff --git a/tensorflow/contrib/lite/allocation.cc b/tensorflow/contrib/lite/allocation.cc
index c42622ff02..6d834f71c0 100644
--- a/tensorflow/contrib/lite/allocation.cc
+++ b/tensorflow/contrib/lite/allocation.cc
@@ -13,61 +13,21 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <fcntl.h>
-#ifndef TFLITE_MCU
-#include <sys/mman.h>
-#endif
+#include "tensorflow/contrib/lite/allocation.h"
+
 #include <sys/stat.h>
 #include <sys/types.h>
-#include <unistd.h>
 #include <cassert>
 #include <cstdarg>
 #include <cstdint>
 #include <cstring>
 #include <utility>
 
-#include "tensorflow/contrib/lite/allocation.h"
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/error_reporter.h"
-#ifndef TFLITE_MCU
-#include "tensorflow/contrib/lite/nnapi_delegate.h"
-#endif
 
 namespace tflite {
 
-#ifndef TFLITE_MCU
-MMAPAllocation::MMAPAllocation(const char* filename,
-                               ErrorReporter* error_reporter)
-    : Allocation(error_reporter), mmapped_buffer_(MAP_FAILED) {
-  mmap_fd_ = open(filename, O_RDONLY);
-  if (mmap_fd_ == -1) {
-    error_reporter_->Report("Could not open '%s'.", filename);
-    return;
-  }
-  struct stat sb;
-  fstat(mmap_fd_, &sb);
-  buffer_size_bytes_ = sb.st_size;
-  mmapped_buffer_ =
-      mmap(nullptr, buffer_size_bytes_, PROT_READ, MAP_SHARED, mmap_fd_, 0);
-  if (mmapped_buffer_ == MAP_FAILED) {
-    error_reporter_->Report("Mmap of '%s' failed.", filename);
-    return;
-  }
-}
-
-MMAPAllocation::~MMAPAllocation() {
-  if (valid()) {
-    munmap(const_cast<void*>(mmapped_buffer_), buffer_size_bytes_);
-  }
-  if (mmap_fd_ != -1) close(mmap_fd_);
-}
-
-const void* MMAPAllocation::base() const { return mmapped_buffer_; }
-
-size_t MMAPAllocation::bytes() const { return buffer_size_bytes_; }
-
-bool MMAPAllocation::valid() const { return mmapped_buffer_ != MAP_FAILED; }
-
 FileCopyAllocation::FileCopyAllocation(const char* filename,
                                        ErrorReporter* error_reporter)
     : Allocation(error_reporter) {
@@ -99,7 +59,9 @@ FileCopyAllocation::FileCopyAllocation(const char* filename,
                             filename);
     return;
   }
-  copied_buffer_ = std::move(buffer);
+  // Versions of GCC before 6.2.0 don't support std::move from non-const
+  // char[] to const char[] unique_ptrs.
+  copied_buffer_.reset(const_cast<char const*>(buffer.release()));
 }
 
 FileCopyAllocation::~FileCopyAllocation() {}
@@ -116,7 +78,6 @@ MemoryAllocation::MemoryAllocation(const void* ptr, size_t num_bytes,
   buffer_ = ptr;
   buffer_size_bytes_ = num_bytes;
 }
-#endif
 
 MemoryAllocation::~MemoryAllocation() {}
 
diff --git a/tensorflow/contrib/lite/allocation.h b/tensorflow/contrib/lite/allocation.h
index 68aee2e644..121f3d2646 100644
--- a/tensorflow/contrib/lite/allocation.h
+++ b/tensorflow/contrib/lite/allocation.h
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/error_reporter.h"
 #include "tensorflow/contrib/lite/simple_memory_arena.h"
+#include "tensorflow/contrib/lite/string.h"
 
 namespace tflite {
 
@@ -51,6 +52,8 @@ class MMAPAllocation : public Allocation {
   size_t bytes() const override;
   bool valid() const override;
 
+  static bool IsSupported();
+
  protected:
   // Data required for mmap.
   int mmap_fd_ = -1;  // mmap file descriptor
diff --git a/tensorflow/contrib/lite/arena_planner.cc b/tensorflow/contrib/lite/arena_planner.cc
index 4257e754ad..02442575b3 100644
--- a/tensorflow/contrib/lite/arena_planner.cc
+++ b/tensorflow/contrib/lite/arena_planner.cc
@@ -17,14 +17,6 @@ limitations under the License.
 
 namespace tflite {
 
-namespace {
-
-// Memory allocation tuning
-constexpr const int kDefaultArenaAlignment = 64;
-constexpr const int kDefaultTensorAlignment = 4;
-
-}  // namespace
-
 struct AllocationInfo {
   // The node index requesting this allocation.
   int node;
@@ -36,12 +28,16 @@ struct AllocationInfo {
 
 ArenaPlanner::ArenaPlanner(TfLiteContext* context,
                            std::unique_ptr<GraphInfo> graph_info,
-                           bool preserve_inputs)
+                           bool preserve_inputs, bool preserve_intermediates,
+                           int tensor_alignment)
     : context_(context),
       graph_info_(std::move(graph_info)),
       arena_(kDefaultArenaAlignment),
       persistent_arena_(kDefaultArenaAlignment),
-      preserve_inputs_(preserve_inputs) {}
+      preserve_inputs_(preserve_inputs),
+      preserve_intermediates_(preserve_intermediates),
+      tensor_alignment_(tensor_alignment) {}
+
 ArenaPlanner::~ArenaPlanner() {}
 
 int64_t ArenaPlanner::BasePointer(TfLiteAllocationType type) {
@@ -164,13 +160,15 @@ TfLiteStatus ArenaPlanner::PlanAllocations() {
 
     // Then update the ref-counts of the node's inputs, and if necessary queue
     // them for deallocation.
-    TfLiteIntArray* node_inputs = node.inputs;
-    for (int j = 0; j < node_inputs->size; ++j) {
-      int tensor_index = node_inputs->data[j];
-      if (tensor_index != kOptionalTensor) {
-        refcounts[tensor_index]--;
-        if (refcounts[tensor_index] == 0) {
-          TF_LITE_ENSURE_STATUS(deallocate(i, tensor_index));
+    if (!preserve_intermediates_) {
+      TfLiteIntArray* node_inputs = node.inputs;
+      for (int j = 0; j < node_inputs->size; ++j) {
+        int tensor_index = node_inputs->data[j];
+        if (tensor_index != kOptionalTensor) {
+          refcounts[tensor_index]--;
+          if (refcounts[tensor_index] == 0) {
+            TF_LITE_ENSURE_STATUS(deallocate(i, tensor_index));
+          }
         }
       }
     }
@@ -261,14 +259,12 @@ TfLiteStatus ArenaPlanner::ResolveTensorAllocation(int tensor_index) {
 TfLiteStatus ArenaPlanner::CalculateTensorAllocation(int tensor_index) {
   TfLiteTensor& tensor = *graph_info_->tensor(tensor_index);
   if (tensor.allocation_type == kTfLiteArenaRw) {
-    TF_LITE_ENSURE_STATUS(arena_.Allocate(context_, kDefaultTensorAlignment,
-                                          tensor.bytes,
-                                          &allocs_[tensor_index]));
+    TF_LITE_ENSURE_STATUS(arena_.Allocate(
+        context_, tensor_alignment_, tensor.bytes, &allocs_[tensor_index]));
   }
   if (tensor.allocation_type == kTfLiteArenaRwPersistent) {
-    TF_LITE_ENSURE_STATUS(
-        persistent_arena_.Allocate(context_, kDefaultTensorAlignment,
-                                   tensor.bytes, &allocs_[tensor_index]));
+    TF_LITE_ENSURE_STATUS(persistent_arena_.Allocate(
+        context_, tensor_alignment_, tensor.bytes, &allocs_[tensor_index]));
   }
   return kTfLiteOk;
 }
diff --git a/tensorflow/contrib/lite/arena_planner.h b/tensorflow/contrib/lite/arena_planner.h
index 1d84950e91..55003cf4e9 100644
--- a/tensorflow/contrib/lite/arena_planner.h
+++ b/tensorflow/contrib/lite/arena_planner.h
@@ -25,6 +25,10 @@ limitations under the License.
 
 namespace tflite {
 
+// Memory allocation tuning
+constexpr const int kDefaultArenaAlignment = 64;
+constexpr const int kDefaultTensorAlignment = 64;
+
 struct AllocationInfo;
 
 // A memory planner that makes all the allocations using arenas.
@@ -47,7 +51,8 @@ class ArenaPlanner : public MemoryPlanner {
   // graph will not share memory with any other tensor, effectively preserving
   // them until the end of inference.
   ArenaPlanner(TfLiteContext* context, std::unique_ptr<GraphInfo> graph_info,
-               bool preserve_inputs);
+               bool preserve_inputs, bool preserve_intermediates,
+               int tensor_alignment = kDefaultTensorAlignment);
   ~ArenaPlanner() override;
   ArenaPlanner(const ArenaPlanner&) = delete;
   ArenaPlanner& operator=(const ArenaPlanner&) = delete;
@@ -104,7 +109,17 @@ class ArenaPlanner : public MemoryPlanner {
   // declared as kTfLiteArenaRwPersistent.
   SimpleMemoryArena persistent_arena_;
 
+  // Ensure that the memory self-allocated for inputs is never reused by the
+  // allocator. This allows for example, multiple runs without getting
+  // unpredictable results.
   bool preserve_inputs_;
+
+  // If true, then no overlapping of memory areas is done, meaning intermediates
+  // results can be queried after running (modulo running delegates).
+  bool preserve_intermediates_;
+
+  // Number of bytes that tensor buffers should be aligned to.
+  int tensor_alignment_;
 };
 
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/arena_planner_test.cc b/tensorflow/contrib/lite/arena_planner_test.cc
index f5bd1932f9..7d7c41289c 100644
--- a/tensorflow/contrib/lite/arena_planner_test.cc
+++ b/tensorflow/contrib/lite/arena_planner_test.cc
@@ -24,6 +24,8 @@ limitations under the License.
 namespace tflite {
 namespace {
 
+constexpr const int kTensorAlignment = 4;
+
 // A simple op to be used in tests, as syntactic sugar.
 class TestOp {
  public:
@@ -156,7 +158,7 @@ class ArenaPlannerTest : public ::testing::Test {
     context_.ReportError = ReportError;
     planner_.reset(new ArenaPlanner(
         &context_, std::unique_ptr<GraphInfo>(new TestGraphInfo(graph)),
-        preserve_inputs));
+        preserve_inputs, /*preserve intermediates*/ false, kTensorAlignment));
     CHECK(planner_->ResetAllocations() == kTfLiteOk);
     CHECK(planner_->PlanAllocations() == kTfLiteOk);
   }
@@ -178,8 +180,8 @@ class ArenaPlannerTest : public ::testing::Test {
     const TfLiteTensor& tensor = (*graph_->tensors())[tensor_index];
     int64_t offset = GetOffset(tensor_index) + tensor.bytes;
     // We must make sure the offset is aligned to kDefaultArenaAlignment.
-    if (offset % 4 != 0) {
-      offset += 4 - offset % 4;
+    if (offset % kTensorAlignment != 0) {
+      offset += kTensorAlignment - offset % kTensorAlignment;
     }
     return offset;
   };
diff --git a/tensorflow/contrib/lite/build_def.bzl b/tensorflow/contrib/lite/build_def.bzl
index 5543acc1f5..422584c0ea 100644
--- a/tensorflow/contrib/lite/build_def.bzl
+++ b/tensorflow/contrib/lite/build_def.bzl
@@ -1,193 +1,218 @@
 """Generate Flatbuffer binary from json."""
+
 load(
     "//tensorflow:tensorflow.bzl",
+    "tf_cc_shared_object",
     "tf_cc_test",
 )
 
 def tflite_copts():
-  """Defines compile time flags."""
-  copts = [
-      "-DFARMHASH_NO_CXX_STRING",
-  ] + select({
-          str(Label("//tensorflow:android_arm64")): [
-              "-std=c++11",
-              "-O3",
-          ],
-          str(Label("//tensorflow:android_arm")): [
-              "-mfpu=neon",
-              "-mfloat-abi=softfp",
-              "-std=c++11",
-              "-O3",
-          ],
-          str(Label("//tensorflow:android_x86")): [
-              "-DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK",
-          ],
-          str(Label("//tensorflow:ios_x86_64")): [
-              "-msse4.1",
-          ],
-          "//conditions:default": [],
-  }) + select({
-      str(Label("//tensorflow:with_default_optimizations")): [],
-      "//conditions:default": ["-DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK"],
-  })
+    """Defines compile time flags."""
+    copts = [
+        "-DFARMHASH_NO_CXX_STRING",
+    ] + select({
+        str(Label("//tensorflow:android_arm64")): [
+            "-std=c++11",
+            "-O3",
+        ],
+        str(Label("//tensorflow:android_arm")): [
+            "-mfpu=neon",
+            "-mfloat-abi=softfp",
+            "-std=c++11",
+            "-O3",
+        ],
+        str(Label("//tensorflow:android_x86")): [
+            "-DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK",
+        ],
+        str(Label("//tensorflow:ios_x86_64")): [
+            "-msse4.1",
+        ],
+        str(Label("//tensorflow:windows")): [
+            "/DTF_COMPILE_LIBRARY",
+        ],
+        "//conditions:default": [],
+    }) + select({
+        str(Label("//tensorflow:with_default_optimizations")): [],
+        "//conditions:default": ["-DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK"],
+    })
 
-  return copts
+    return copts
 
 LINKER_SCRIPT = "//tensorflow/contrib/lite/java/src/main/native:version_script.lds"
 
 def tflite_linkopts_unstripped():
-  """Defines linker flags to reduce size of TFLite binary.
+    """Defines linker flags to reduce size of TFLite binary.
 
-     These are useful when trying to investigate the relative size of the
-     symbols in TFLite.
+       These are useful when trying to investigate the relative size of the
+       symbols in TFLite.
 
-  Returns:
-     a select object with proper linkopts
-  """
-  return select({
-      "//tensorflow:android": [
-          "-Wl,--no-export-dynamic", # Only inc syms referenced by dynamic obj.
-          "-Wl,--exclude-libs,ALL",  # Exclude syms in all libs from auto export.
-          "-Wl,--gc-sections", # Eliminate unused code and data.
-          "-Wl,--as-needed", # Don't link unused libs.
-      ],
-      "//tensorflow/contrib/lite:mips": [],
-      "//tensorflow/contrib/lite:mips64": [],
-      "//conditions:default": [
-          "-Wl,--icf=all",  # Identical code folding.
-      ],
-  })
+    Returns:
+       a select object with proper linkopts
+    """
+    return select({
+        "//tensorflow:android": [
+            "-Wl,--no-export-dynamic",  # Only inc syms referenced by dynamic obj.
+            "-Wl,--exclude-libs,ALL",  # Exclude syms in all libs from auto export.
+            "-Wl,--gc-sections",  # Eliminate unused code and data.
+            "-Wl,--as-needed",  # Don't link unused libs.
+        ],
+        "//tensorflow:darwin": [],
+        "//tensorflow/contrib/lite:mips": [],
+        "//tensorflow/contrib/lite:mips64": [],
+        "//conditions:default": [
+            "-Wl,--icf=all",  # Identical code folding.
+        ],
+    })
 
 def tflite_jni_linkopts_unstripped():
-  """Defines linker flags to reduce size of TFLite binary with JNI.
+    """Defines linker flags to reduce size of TFLite binary with JNI.
 
-     These are useful when trying to investigate the relative size of the
-     symbols in TFLite.
+       These are useful when trying to investigate the relative size of the
+       symbols in TFLite.
 
-  Returns:
-     a select object with proper linkopts
-  """
-  return select({
-      "//tensorflow:android": [
-          "-Wl,--gc-sections", # Eliminate unused code and data.
-          "-Wl,--as-needed", # Don't link unused libs.
-      ],
-      "//tensorflow/contrib/lite:mips": [],
-      "//tensorflow/contrib/lite:mips64": [],
-      "//conditions:default": [
-          "-Wl,--icf=all",  # Identical code folding.
-      ],
-  })
+    Returns:
+       a select object with proper linkopts
+    """
+    return select({
+        "//tensorflow:android": [
+            "-Wl,--gc-sections",  # Eliminate unused code and data.
+            "-Wl,--as-needed",  # Don't link unused libs.
+        ],
+        "//tensorflow:darwin": [],
+        "//tensorflow/contrib/lite:mips": [],
+        "//tensorflow/contrib/lite:mips64": [],
+        "//conditions:default": [
+            "-Wl,--icf=all",  # Identical code folding.
+        ],
+    })
 
 def tflite_linkopts():
-  """Defines linker flags to reduce size of TFLite binary."""
-  return tflite_linkopts_unstripped() + select({
-      "//tensorflow:android": [
-          "-s",  # Omit symbol table.
-      ],
-      "//conditions:default": [],
-  })
+    """Defines linker flags to reduce size of TFLite binary."""
+    return tflite_linkopts_unstripped() + select({
+        "//tensorflow:android": [
+            "-s",  # Omit symbol table.
+        ],
+        "//conditions:default": [],
+    })
 
 def tflite_jni_linkopts():
-  """Defines linker flags to reduce size of TFLite binary with JNI."""
-  return tflite_jni_linkopts_unstripped() + select({
-      "//tensorflow:android": [
-          "-s",  # Omit symbol table.
-          "-latomic",  # Required for some uses of ISO C++11 <atomic> in x86.
-      ],
-      "//conditions:default": [],
-  })
+    """Defines linker flags to reduce size of TFLite binary with JNI."""
+    return tflite_jni_linkopts_unstripped() + select({
+        "//tensorflow:android": [
+            "-s",  # Omit symbol table.
+            "-latomic",  # Required for some uses of ISO C++11 <atomic> in x86.
+        ],
+        "//conditions:default": [],
+    })
+
+def tflite_jni_binary(
+        name,
+        copts = tflite_copts(),
+        linkopts = tflite_jni_linkopts(),
+        linkscript = LINKER_SCRIPT,
+        linkshared = 1,
+        linkstatic = 1,
+        deps = []):
+    """Builds a jni binary for TFLite."""
+    linkopts = linkopts + [
+        "-Wl,--version-script",  # Export only jni functions & classes.
+        "$(location {})".format(linkscript),
+    ]
+    native.cc_binary(
+        name = name,
+        copts = copts,
+        linkshared = linkshared,
+        linkstatic = linkstatic,
+        deps = deps + [linkscript],
+        linkopts = linkopts,
+    )
 
-def tflite_jni_binary(name,
-                      copts=tflite_copts(),
-                      linkopts=tflite_jni_linkopts(),
-                      linkscript=LINKER_SCRIPT,
-                      linkshared=1,
-                      linkstatic=1,
-                      deps=[]):
-  """Builds a jni binary for TFLite."""
-  linkopts = linkopts + [
-      "-Wl,--version-script",  # Export only jni functions & classes.
-      "$(location {})".format(linkscript),
-  ]
-  native.cc_binary(
-      name=name,
-      copts=copts,
-      linkshared=linkshared,
-      linkstatic=linkstatic,
-      deps= deps + [linkscript],
-      linkopts=linkopts)
+def tflite_cc_shared_object(
+        name,
+        copts = tflite_copts(),
+        linkopts = [],
+        linkstatic = 1,
+        deps = []):
+    """Builds a shared object for TFLite."""
+    tf_cc_shared_object(
+        name = name,
+        copts = copts,
+        linkstatic = linkstatic,
+        linkopts = linkopts + tflite_jni_linkopts(),
+        framework_so = [],
+        deps = deps,
+    )
 
 def tf_to_tflite(name, src, options, out):
-  """Convert a frozen tensorflow graphdef to TF Lite's flatbuffer.
+    """Convert a frozen tensorflow graphdef to TF Lite's flatbuffer.
 
-  Args:
-    name: Name of rule.
-    src: name of the input graphdef file.
-    options: options passed to TOCO.
-    out: name of the output flatbuffer file.
-  """
+    Args:
+      name: Name of rule.
+      src: name of the input graphdef file.
+      options: options passed to TOCO.
+      out: name of the output flatbuffer file.
+    """
 
-  toco_cmdline = " ".join([
-      "//tensorflow/contrib/lite/toco:toco",
-      "--input_format=TENSORFLOW_GRAPHDEF",
-      "--output_format=TFLITE",
-      ("--input_file=$(location %s)" % src),
-      ("--output_file=$(location %s)" % out),
-  ] + options )
-  native.genrule(
-      name = name,
-      srcs=[src],
-      outs=[out],
-      cmd = toco_cmdline,
-      tools= ["//tensorflow/contrib/lite/toco:toco"],
-  )
+    toco_cmdline = " ".join([
+        "//tensorflow/contrib/lite/toco:toco",
+        "--input_format=TENSORFLOW_GRAPHDEF",
+        "--output_format=TFLITE",
+        ("--input_file=$(location %s)" % src),
+        ("--output_file=$(location %s)" % out),
+    ] + options)
+    native.genrule(
+        name = name,
+        srcs = [src],
+        outs = [out],
+        cmd = toco_cmdline,
+        tools = ["//tensorflow/contrib/lite/toco:toco"],
+    )
 
 def tflite_to_json(name, src, out):
-  """Convert a TF Lite flatbuffer to JSON.
+    """Convert a TF Lite flatbuffer to JSON.
 
-  Args:
-    name: Name of rule.
-    src: name of the input flatbuffer file.
-    out: name of the output JSON file.
-  """
+    Args:
+      name: Name of rule.
+      src: name of the input flatbuffer file.
+      out: name of the output JSON file.
+    """
 
-  flatc = "@flatbuffers//:flatc"
-  schema = "//tensorflow/contrib/lite/schema:schema.fbs"
-  native.genrule(
-      name = name,
-      srcs = [schema, src],
-      outs = [out],
-      cmd = ("TMP=`mktemp`; cp $(location %s) $${TMP}.bin &&"  +
-             "$(location %s) --raw-binary --strict-json -t" +
-             " -o /tmp $(location %s) -- $${TMP}.bin &&" +
-             "cp $${TMP}.json $(location %s)")
-            % (src, flatc, schema, out),
-      tools = [flatc],
-  )
+    flatc = "@flatbuffers//:flatc"
+    schema = "//tensorflow/contrib/lite/schema:schema.fbs"
+    native.genrule(
+        name = name,
+        srcs = [schema, src],
+        outs = [out],
+        cmd = ("TMP=`mktemp`; cp $(location %s) $${TMP}.bin &&" +
+               "$(location %s) --raw-binary --strict-json -t" +
+               " -o /tmp $(location %s) -- $${TMP}.bin &&" +
+               "cp $${TMP}.json $(location %s)") %
+              (src, flatc, schema, out),
+        tools = [flatc],
+    )
 
 def json_to_tflite(name, src, out):
-  """Convert a JSON file to TF Lite's flatbuffer.
+    """Convert a JSON file to TF Lite's flatbuffer.
 
-  Args:
-    name: Name of rule.
-    src: name of the input JSON file.
-    out: name of the output flatbuffer file.
-  """
+    Args:
+      name: Name of rule.
+      src: name of the input JSON file.
+      out: name of the output flatbuffer file.
+    """
 
-  flatc = "@flatbuffers//:flatc"
-  schema = "//tensorflow/contrib/lite/schema:schema_fbs"
-  native.genrule(
-      name = name,
-      srcs = [schema, src],
-      outs = [out],
-      cmd = ("TMP=`mktemp`; cp $(location %s) $${TMP}.json &&"  +
-             "$(location %s) --raw-binary --unknown-json --allow-non-utf8 -b" +
-             " -o /tmp $(location %s) $${TMP}.json &&" +
-             "cp $${TMP}.bin $(location %s)")
-      % (src, flatc, schema, out),
-      tools = [flatc],
-  )
+    flatc = "@flatbuffers//:flatc"
+    schema = "//tensorflow/contrib/lite/schema:schema_fbs"
+    native.genrule(
+        name = name,
+        srcs = [schema, src],
+        outs = [out],
+        cmd = ("TMP=`mktemp`; cp $(location %s) $${TMP}.json &&" +
+               "$(location %s) --raw-binary --unknown-json --allow-non-utf8 -b" +
+               " -o /tmp $(location %s) $${TMP}.json &&" +
+               "cp $${TMP}.bin $(location %s)") %
+              (src, flatc, schema, out),
+        tools = [flatc],
+    )
 
 # This is the master list of generated examples that will be made into tests. A
 # function called make_XXX_tests() must also appear in generate_examples.py.
@@ -195,7 +220,7 @@ def json_to_tflite(name, src, out):
 def generated_test_models():
     return [
         "add",
-        "arg_max",
+        "arg_min_max",
         "avg_pool",
         "batch_to_space_nd",
         "concat",
@@ -222,6 +247,7 @@ def generated_test_models():
         "local_response_norm",
         "log_softmax",
         "log",
+        "logical_or",
         "lstm",
         "max_pool",
         "maximum",
@@ -230,10 +256,14 @@ def generated_test_models():
         "mul",
         "neg",
         "not_equal",
+        "one_hot",
+        "pack",
         "pad",
         "padv2",
-        # "prelu",
+        "prelu",
         "pow",
+        "reduce_max",
+        #"reduce_prod",  # disabled due to b/111823366
         "relu",
         "relu1",
         "relu6",
@@ -257,63 +287,63 @@ def generated_test_models():
         "tile",
         "topk",
         "transpose",
-        "transpose_conv",
+        #"transpose_conv",   # disabled due to b/111213074
         "where",
     ]
 
 def gen_zip_test(name, test_name, **kwargs):
-  """Generate a zipped-example test and its dependent zip files.
+    """Generate a zipped-example test and its dependent zip files.
 
-  Args:
-    name: Resulting cc_test target name
-    test_name: Test targets this model. Comes from the list above.
-    **kwargs: tf_cc_test kwargs.
-  """
-  gen_zipped_test_file(
-      name = "zip_%s" % test_name,
-      file = "%s.zip" % test_name,
-  )
-  tf_cc_test(name, **kwargs)
+    Args:
+      name: Resulting cc_test target name
+      test_name: Test targets this model. Comes from the list above.
+      **kwargs: tf_cc_test kwargs.
+    """
+    gen_zipped_test_file(
+        name = "zip_%s" % test_name,
+        file = "%s.zip" % test_name,
+    )
+    tf_cc_test(name, **kwargs)
 
 def gen_zipped_test_file(name, file):
-  """Generate a zip file of tests by using :generate_examples.
+    """Generate a zip file of tests by using :generate_examples.
 
-  Args:
-    name: Name of output. We will produce "`file`.files" as a target.
-    file: The name of one of the generated_examples targets, e.g. "transpose"
-  """
-  toco = "//tensorflow/contrib/lite/toco:toco"
-  native.genrule(
-      name = file + ".files",
-      cmd = ("$(locations :generate_examples) --toco $(locations %s) " % toco
-             + " --zip_to_output " + file + " $(@D)"),
-      outs = [file],
-      tools = [
-          ":generate_examples",
-          toco,
-      ],
-  )
+    Args:
+      name: Name of output. We will produce "`file`.files" as a target.
+      file: The name of one of the generated_examples targets, e.g. "transpose"
+    """
+    toco = "//tensorflow/contrib/lite/toco:toco"
+    native.genrule(
+        name = file + ".files",
+        cmd = ("$(locations :generate_examples) --toco $(locations %s) " % toco +
+               " --zip_to_output " + file + " $(@D)"),
+        outs = [file],
+        tools = [
+            ":generate_examples",
+            toco,
+        ],
+    )
 
-  native.filegroup(
-      name = name,
-      srcs = [file],
-  )
+    native.filegroup(
+        name = name,
+        srcs = [file],
+    )
 
 def gen_selected_ops(name, model):
-  """Generate the library that includes only used ops.
+    """Generate the library that includes only used ops.
 
-  Args:
-    name: Name of the generated library.
-    model: TFLite model to interpret.
-  """
-  out = name + "_registration.cc"
-  tool = "//tensorflow/contrib/lite/tools:generate_op_registrations"
-  tflite_path = "//tensorflow/contrib/lite"
-  native.genrule(
-      name = name,
-      srcs = [model],
-      outs = [out],
-      cmd = ("$(location %s) --input_model=$(location %s) --output_registration=$(location %s) --tflite_path=%s")
-      % (tool, model, out, tflite_path[2:]),
-      tools = [tool],
-  )
+    Args:
+      name: Name of the generated library.
+      model: TFLite model to interpret.
+    """
+    out = name + "_registration.cc"
+    tool = "//tensorflow/contrib/lite/tools:generate_op_registrations"
+    tflite_path = "//tensorflow/contrib/lite"
+    native.genrule(
+        name = name,
+        srcs = [model],
+        outs = [out],
+        cmd = ("$(location %s) --input_model=$(location %s) --output_registration=$(location %s) --tflite_path=%s") %
+              (tool, model, out, tflite_path[2:]),
+        tools = [tool],
+    )
diff --git a/tensorflow/contrib/lite/build_ios_universal_lib.sh b/tensorflow/contrib/lite/build_ios_universal_lib.sh
index e9531aef19..31df43a175 100755
--- a/tensorflow/contrib/lite/build_ios_universal_lib.sh
+++ b/tensorflow/contrib/lite/build_ios_universal_lib.sh
@@ -21,7 +21,7 @@ cd "$SCRIPT_DIR/../../.."
 
 # Build library for supported architectures and packs them in a fat binary.
 make_library() {
-    for arch in x86_64 i386 armv7 armv7s arm64
+    for arch in x86_64 armv7 armv7s arm64
     do
         make -f tensorflow/contrib/lite/Makefile TARGET=IOS IOS_ARCH=${arch} \
         -j 8 \
@@ -29,7 +29,6 @@ make_library() {
     done
     lipo \
     tensorflow/contrib/lite/gen/lib/ios_x86_64/${1} \
-    tensorflow/contrib/lite/gen/lib/ios_i386/${1} \
     tensorflow/contrib/lite/gen/lib/ios_armv7/${1} \
     tensorflow/contrib/lite/gen/lib/ios_armv7s/${1} \
     tensorflow/contrib/lite/gen/lib/ios_arm64/${1} \
diff --git a/tensorflow/contrib/lite/builtin_op_data.h b/tensorflow/contrib/lite/builtin_op_data.h
index cda889bf50..70178b2faa 100644
--- a/tensorflow/contrib/lite/builtin_op_data.h
+++ b/tensorflow/contrib/lite/builtin_op_data.h
@@ -250,6 +250,10 @@ typedef struct {
 } TfLiteArgMaxParams;
 
 typedef struct {
+  TfLiteType output_type;
+} TfLiteArgMinParams;
+
+typedef struct {
   TfLitePadding padding;
   int stride_width;
   int stride_height;
@@ -263,6 +267,25 @@ typedef struct {
   TfLiteType out_type;
 } TfLiteShapeParams;
 
+typedef struct {
+  // Parameters supported by version 1:
+  float min;
+  float max;
+  int num_bits;
+
+  // Parameters supported by version 2:
+  bool narrow_range;
+} TfLiteFakeQuantParams;
+
+typedef struct {
+  int values_count;
+  int axis;
+} TfLitePackParams;
+
+typedef struct {
+  int axis;
+} TfLiteOneHotParams;
+
 #ifdef __cplusplus
 }  // extern "C"
 #endif  // __cplusplus
diff --git a/tensorflow/contrib/lite/builtin_ops.h b/tensorflow/contrib/lite/builtin_ops.h
index a44e918230..8a8eb98568 100644
--- a/tensorflow/contrib/lite/builtin_ops.h
+++ b/tensorflow/contrib/lite/builtin_ops.h
@@ -104,6 +104,15 @@ typedef enum {
   kTfLiteBuiltinRsqrt = 76,
   kTfLiteBuiltinShape = 77,
   kTfLiteBuiltinPow = 78,
+  kTfLiteBuiltinArgMin = 79,
+  kTfLiteBuiltinFakeQuant = 80,
+  kTfLiteBuiltinReduceProd = 81,
+  kTfLiteBuiltinReduceMax = 82,
+  kTfLiteBuiltinPack = 83,
+  kTfLiteBuiltinLogicalOr = 84,
+  kTfLiteBuiltinOneHot = 85,
+  kTfLiteBuiltinLogicalAnd = 86,
+  kTfLiteBuiltinLogicalNot = 87,
 } TfLiteBuiltinOperator;
 
 #ifdef __cplusplus
diff --git a/tensorflow/contrib/lite/context.h b/tensorflow/contrib/lite/context.h
index 4f260ad40a..5bc20106d3 100644
--- a/tensorflow/contrib/lite/context.h
+++ b/tensorflow/contrib/lite/context.h
@@ -29,6 +29,9 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_LITE_CONTEXT_H_
 #define TENSORFLOW_CONTRIB_LITE_CONTEXT_H_
 
+#if defined(_MSC_VER)
+#include <complex.h>
+#endif
 #include <stdbool.h>
 #include <stdint.h>
 #include <stdlib.h>
@@ -39,6 +42,26 @@ extern "C" {
 
 typedef enum { kTfLiteOk = 0, kTfLiteError = 1 } TfLiteStatus;
 
+// The list of external context types known to TF Lite. This list exists solely
+// to avoid conflicts and to ensure ops can share the external contexts they
+// need. Access to the external contexts is controled by one of the
+// corresponding support files.
+typedef enum {
+  kTfLiteEigenContext = 0,     // include eigen_support.h to use.
+  kTfLiteGemmLowpContext = 1,  // include gemm_support.h to use.
+  kTfLiteMaxExternalContexts = 2
+} TfLiteExternalContextType;
+
+// An external context is a collection of information unrelated to the TF Lite
+// framework, but useful to a subset of the ops. TF Lite knows very little
+// about about the actual contexts, but it keeps a list of them, and is able to
+// refresh them if configurations like the number of recommended threads
+// change.
+typedef struct {
+  TfLiteExternalContextType type;
+  TfLiteStatus (*Refresh)(struct TfLiteContext* context);
+} TfLiteExternalContext;
+
 // Forward declare so GetNode can use this is in Context.
 typedef struct _TfLiteRegistration TfLiteRegistration;
 typedef struct _TfLiteDelegate TfLiteDelegate;
@@ -160,7 +183,11 @@ typedef union {
   uint8_t* uint8;
   bool* b;
   int16_t* i16;
+#if defined(_MSC_VER)
+  _Fcomplex* c64;
+#else
   _Complex float* c64;
+#endif
 } TfLitePtrUnion;
 
 // Memory allocation strategies. kTfLiteMmapRo is for read-only memory-mapped
@@ -339,10 +366,15 @@ typedef struct TfLiteContext {
   // eigen.
   int recommended_num_threads;
 
-  // TODO(ahentz): we should create a more general mechanism for this sort of
-  // library-global objects.
-  void* gemm_context;
-  void* eigen_context;
+  // Access external contexts by type.
+  // WARNING: This is an experimental interface that is subject to change.
+  TfLiteExternalContext* (*GetExternalContext)(struct TfLiteContext*,
+                                               TfLiteExternalContextType);
+  // Set the value of a external context. Does not take ownership of the
+  // pointer.
+  // WARNING: This is an experimental interface that is subject to change.
+  void (*SetExternalContext)(struct TfLiteContext*, TfLiteExternalContextType,
+                             TfLiteExternalContext*);
 } TfLiteContext;
 
 typedef struct _TfLiteRegistration {
@@ -439,6 +471,12 @@ typedef struct _TfLiteDelegate {
 } TfLiteDelegate;
 
 // WARNING: This is an experimental interface that is subject to change.
+//
+// Currently, TfLiteDelegateParams has to be allocated in a way that it's
+// trivially destructable. It will be stored as `builtin_data` field in
+// `TfLiteNode` of the delegate node.
+//
+// See also the `CreateDelegateParams` function in `interpreter.cc` details.
 typedef struct {
   TfLiteDelegate* delegate;
   TfLiteIntArray* nodes_to_replace;
diff --git a/tensorflow/contrib/lite/delegates/eager/BUILD b/tensorflow/contrib/lite/delegates/eager/BUILD
new file mode 100644
index 0000000000..40afa0df41
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/BUILD
@@ -0,0 +1,144 @@
+#
+# This is a TF Lite delegate that is powered by TensorFlow's Eager.
+#
+package(default_visibility = [
+    "//visibility:public",
+])
+
+licenses(["notice"])  # Apache 2.0
+
+cc_library(
+    name = "buffer_map",
+    srcs = ["buffer_map.cc"],
+    hdrs = ["buffer_map.h"],
+    deps = [
+        ":util",
+        "//tensorflow/c:c_api_internal",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:kernel_api",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
+cc_test(
+    name = "buffer_map_test",
+    size = "small",
+    srcs = ["buffer_map_test.cc"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
+    deps = [
+        ":buffer_map",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:util",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+cc_library(
+    name = "delegate_data",
+    srcs = ["delegate_data.cc"],
+    hdrs = ["delegate_data.h"],
+    deps = [
+        ":buffer_map",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/common_runtime/eager:context",
+    ],
+)
+
+cc_test(
+    name = "delegate_data_test",
+    size = "small",
+    srcs = ["delegate_data_test.cc"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
+    deps = [
+        ":delegate_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:util",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+cc_library(
+    name = "kernel",
+    srcs = ["kernel.cc"],
+    hdrs = ["kernel.h"],
+    deps = [
+        ":delegate_data",
+        ":util",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:kernel_api",
+        "//tensorflow/contrib/lite/kernels:kernel_util",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/common_runtime/eager:context",
+        "//tensorflow/core/common_runtime/eager:execute",
+        "//tensorflow/core/common_runtime/eager:tensor_handle",
+        "@flatbuffers",
+    ],
+)
+
+cc_test(
+    name = "kernel_test",
+    size = "small",
+    srcs = ["kernel_test.cc"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
+    deps = [
+        ":delegate_data",
+        ":kernel",
+        ":test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+cc_library(
+    name = "test_util",
+    testonly = True,
+    srcs = ["test_util.cc"],
+    hdrs = ["test_util.h"],
+    deps = [
+        "//tensorflow/c:c_api_internal",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_absl//absl/memory",
+        "@flatbuffers",
+    ],
+)
+
+cc_library(
+    name = "util",
+    srcs = ["util.cc"],
+    hdrs = ["util.h"],
+    deps = [
+        "//tensorflow/c:c_api_internal",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:kernel_api",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_test(
+    name = "util_test",
+    size = "small",
+    srcs = ["util_test.cc"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
+    deps = [
+        ":util",
+        "//tensorflow/contrib/lite/testing:util",
+        "//tensorflow/core:lib",
+        "@com_google_googletest//:gtest",
+    ],
+)
diff --git a/tensorflow/contrib/lite/delegates/eager/buffer_map.cc b/tensorflow/contrib/lite/delegates/eager/buffer_map.cc
new file mode 100644
index 0000000000..e5a19c3997
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/buffer_map.cc
@@ -0,0 +1,111 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/buffer_map.h"
+
+#include "tensorflow/c/c_api_internal.h"
+#include "tensorflow/contrib/lite/delegates/eager/util.h"
+#include "tensorflow/core/framework/allocation_description.pb.h"
+#include "tensorflow/core/framework/log_memory.h"
+
+namespace tflite {
+namespace eager {
+namespace {
+// A tensor buffer that is allocated, deallocated and populated by TF Lite.
+class TfLiteTensorBuffer : public tensorflow::TensorBuffer {
+ public:
+  explicit TfLiteTensorBuffer(const TfLiteTensor* tensor) {
+    len_ = tensor->bytes;
+    // TODO(ahentz): if we can guarantee that TF Lite allocated tensors with
+    // the same alignment as TensorFlow (EIGEN_MAX_ALIGN_BYTES), then we can
+    // potentially eliminate the copy below.
+    data_ =
+        tensorflow::cpu_allocator()->AllocateRaw(EIGEN_MAX_ALIGN_BYTES, len_);
+    if (data_ != nullptr) {
+      if (tensorflow::LogMemory::IsEnabled()) {
+        tensorflow::LogMemory::RecordRawAllocation(
+            "TfLiteTensorBuffer_New",
+            tensorflow::LogMemory::EXTERNAL_TENSOR_ALLOCATION_STEP_ID, len_,
+            data_, tensorflow::cpu_allocator());
+      }
+      std::memcpy(data_, tensor->data.raw, tensor->bytes);
+    }
+  }
+
+  ~TfLiteTensorBuffer() override {
+    if (tensorflow::LogMemory::IsEnabled() && data_ != nullptr) {
+      tensorflow::LogMemory::RecordRawDeallocation(
+          "TfLiteTensorBuffer_Delete",
+          tensorflow::LogMemory::EXTERNAL_TENSOR_ALLOCATION_STEP_ID, data_,
+          tensorflow::cpu_allocator(), false);
+    }
+    tensorflow::cpu_allocator()->DeallocateRaw(data_);
+  }
+
+  void* data() const override { return data_; }
+  size_t size() const override { return len_; }
+
+  TensorBuffer* root_buffer() override { return this; }
+  void FillAllocationDescription(
+      tensorflow::AllocationDescription* proto) const override {
+    tensorflow::int64 rb = size();
+    proto->set_requested_bytes(rb);
+    proto->set_allocator_name(tensorflow::cpu_allocator()->Name());
+  }
+
+  // Prevents input forwarding from mutating this buffer.
+  bool OwnsMemory() const override { return false; }
+
+ private:
+  void* data_;
+  size_t len_;
+};
+}  // namespace
+
+BufferMap::BufferMap() {}
+
+BufferMap::~BufferMap() {}
+
+bool BufferMap::HasTensor(int tensor_index) const {
+  return id_to_tensor_.count(tensor_index) != 0;
+}
+
+tensorflow::Tensor BufferMap::GetTensor(int tensor_index) const {
+  return id_to_tensor_.at(tensor_index);
+}
+
+void BufferMap::SetFromTfLite(int tensor_index, const TfLiteTensor* tensor) {
+  tensorflow::TensorShape shape;
+  int num_dims = tensor->dims->size;
+  for (int i = 0; i < num_dims; ++i) {
+    shape.AddDim(tensor->dims->data[i]);
+  }
+  // TODO(ahentz): we assume this is a new tensor and allocate a new buffer
+  // for it. This is not always the best approach. For example, this might
+  // be a reallocation after resizing tensors. In that case we would be
+  // preferable to somehow reuse the buffer.
+  auto* buf = new TfLiteTensorBuffer(tensor);
+  tensorflow::Tensor t = tensorflow::TensorCApi::MakeTensor(
+      GetTensorFlowDataType(tensor->type), shape, buf);
+  buf->Unref();
+
+  SetFromTensorFlow(tensor_index, std::move(t));
+}
+
+void BufferMap::SetFromTensorFlow(int tensor_index, tensorflow::Tensor tensor) {
+  id_to_tensor_[tensor_index] = std::move(tensor);
+}
+
+}  // namespace eager
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/delegates/eager/buffer_map.h b/tensorflow/contrib/lite/delegates/eager/buffer_map.h
new file mode 100644
index 0000000000..a28329ae7d
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/buffer_map.h
@@ -0,0 +1,61 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_BUFFER_MAP_H_
+#define TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_BUFFER_MAP_H_
+
+#include <map>
+
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/core/framework/tensor.h"
+
+namespace tflite {
+namespace eager {
+
+// Maps a TF Lite tensor index into a TensorFlow tensor.
+//
+// The TF Lite interpreter assigns integer indices to each of its tensors, but
+// the Eager delegate deals in terms of TensorFlow tensors. This class maps
+// from indices to tensors and allows the creation of new tensors to be
+// associated with a given index.
+class BufferMap {
+ public:
+  BufferMap();
+  ~BufferMap();
+
+  // Returns true if the given 'tensor_index' has a corresponding
+  // tensorflow::Tensor.
+  bool HasTensor(int tensor_index) const;
+
+  // Returns the tensorflow::Tensor associated with the given 'tensor_index'.
+  // Precondition: HasTensor() is true.
+  tensorflow::Tensor GetTensor(int tensor_index) const;
+
+  // Associates the given tensorflow::Tensor with the given 'tensor_index'.
+  // Note that tensorflow Tensors share data buffers, so this method is only a
+  // shallow copy.
+  void SetFromTensorFlow(int tensor_index, tensorflow::Tensor tensor);
+
+  // Same as above but creates a new tensorflow::Tensor with a copy of the
+  // given TfLiteTensor's data.
+  void SetFromTfLite(int tensor_index, const TfLiteTensor* tensor);
+
+ private:
+  std::map<int, tensorflow::Tensor> id_to_tensor_;
+};
+
+}  // namespace eager
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_BUFFER_MAP_H_
diff --git a/tensorflow/contrib/lite/delegates/eager/buffer_map_test.cc b/tensorflow/contrib/lite/delegates/eager/buffer_map_test.cc
new file mode 100644
index 0000000000..dcb3f6c941
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/buffer_map_test.cc
@@ -0,0 +1,174 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/buffer_map.h"
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/testing/util.h"
+#include "tensorflow/contrib/lite/util.h"
+
+namespace tflite {
+namespace eager {
+namespace {
+
+using ::testing::ElementsAre;
+
+// A bit of RAII to simplify handling of TfLiteTensors in the tests.
+using UniqueTfLiteTensor =
+    std::unique_ptr<TfLiteTensor, std::function<void(TfLiteTensor*)>>;
+
+template <typename T>
+UniqueTfLiteTensor MakeLiteTensor(const std::vector<int>& shape,
+                                  const std::vector<T>& data) {
+  auto tensor = UniqueTfLiteTensor(new TfLiteTensor, [](TfLiteTensor* t) {
+    TfLiteTensorDataFree(t);
+    TfLiteIntArrayFree(t->dims);
+    delete t;
+  });
+  tensor->allocation_type = kTfLiteDynamic;
+  tensor->type = typeToTfLiteType<T>();
+  tensor->dims = ConvertVectorToTfLiteIntArray(shape);
+  tensor->data.raw = nullptr;
+  TfLiteTensorRealloc(data.size() * sizeof(T), tensor.get());
+  memcpy(tensor->data.raw, data.data(), data.size() * sizeof(T));
+  return tensor;
+}
+
+template <typename T>
+tensorflow::Tensor MakeTensor(const std::vector<int>& shape,
+                              const std::vector<T>& data) {
+  BufferMap buffer_map;  // BufferMap is the easiest way to build the tensor.
+  UniqueTfLiteTensor t1 = MakeLiteTensor<T>(shape, data);
+  buffer_map.SetFromTfLite(0, t1.get());
+  return buffer_map.GetTensor(0);
+}
+
+std::vector<int64> GetTensorShape(const tensorflow::Tensor& t) {
+  std::vector<int64> shape(t.dims());
+  for (int i = 0; i < t.dims(); ++i) {
+    shape[i] = t.dim_size(i);
+  }
+  return shape;
+}
+
+template <typename T>
+std::vector<T> GetTensorData(const tensorflow::Tensor& t) {
+  const T* data = t.flat<T>().data();
+  return std::vector<T>(data, data + t.NumElements());
+}
+
+TEST(BufferMapTest, EmptyBuffer) {
+  BufferMap buffer_map;
+  EXPECT_FALSE(buffer_map.HasTensor(0));
+}
+
+TEST(BufferMapTest, SetFromTfLite) {
+  BufferMap buffer_map;
+
+  UniqueTfLiteTensor t =
+      MakeLiteTensor<float>({1, 2, 1, 3}, {0, 0, 0, 0.123f, 0, 0});
+  buffer_map.SetFromTfLite(0, t.get());
+  ASSERT_TRUE(buffer_map.HasTensor(0));
+
+  EXPECT_THAT(GetTensorData<float>(buffer_map.GetTensor(0)),
+              ElementsAre(0, 0, 0, 0.123f, 0, 0));
+
+  // Also check details of the tensor.
+  tensorflow::Tensor out_tensor = buffer_map.GetTensor(0);
+  ASSERT_EQ(out_tensor.dtype(), tensorflow::DT_FLOAT);
+  ASSERT_EQ(out_tensor.NumElements(), 6);
+  ASSERT_THAT(GetTensorShape(out_tensor), ElementsAre(1, 2, 1, 3));
+}
+
+TEST(BufferMapTest, SetFromTfLiteTwice) {
+  UniqueTfLiteTensor t1 =
+      MakeLiteTensor<float>({1, 2, 1, 3}, {0, 0, 0, 0.123f, 0, 0});
+  UniqueTfLiteTensor t2 =
+      MakeLiteTensor<int>({1, 2, 4}, {0, 0, 0, 3, 0, 0, 1, 2});
+
+  BufferMap buffer_map;
+  buffer_map.SetFromTfLite(0, t1.get());
+  buffer_map.SetFromTfLite(0, t2.get());
+
+  EXPECT_THAT(GetTensorData<int>(buffer_map.GetTensor(0)),
+              ElementsAre(0, 0, 0, 3, 0, 0, 1, 2));
+}
+
+TEST(BufferMapTest, SetFromTensorFlow) {
+  tensorflow::Tensor t1 =
+      MakeTensor<float>({1, 2, 1, 3}, {0, 0, 0, 0.123f, 0, 0});
+
+  BufferMap buffer_map;
+  buffer_map.SetFromTensorFlow(0, t1);
+
+  EXPECT_THAT(GetTensorData<float>(buffer_map.GetTensor(0)),
+              ElementsAre(0, 0, 0, 0.123f, 0, 0));
+
+  // Also check details of the tensor.
+  tensorflow::Tensor out_tensor = buffer_map.GetTensor(0);
+  ASSERT_EQ(out_tensor.dtype(), tensorflow::DT_FLOAT);
+  ASSERT_EQ(out_tensor.NumElements(), 6);
+  ASSERT_THAT(GetTensorShape(out_tensor), ElementsAre(1, 2, 1, 3));
+}
+
+TEST(BufferMapTest, SetFromTensorFlowTwice) {
+  tensorflow::Tensor t1 =
+      MakeTensor<float>({1, 2, 1, 3}, {0, 0, 0, 0.123f, 0, 0});
+  tensorflow::Tensor t2 = MakeTensor<int>({1, 2, 4}, {0, 0, 0, 3, 0, 0, 1, 2});
+  BufferMap buffer_map;
+  buffer_map.SetFromTensorFlow(0, t1);
+  buffer_map.SetFromTensorFlow(0, t2);
+
+  EXPECT_THAT(GetTensorData<int>(buffer_map.GetTensor(0)),
+              ElementsAre(0, 0, 0, 3, 0, 0, 1, 2));
+}
+
+TEST(BufferMapTest, TfLiteOverwritesTensorFlow) {
+  tensorflow::Tensor t1 =
+      MakeTensor<float>({1, 2, 1, 3}, {0, 0, 0, 0.123f, 0, 0});
+  UniqueTfLiteTensor t2 =
+      MakeLiteTensor<int>({1, 2, 4}, {0, 0, 0, 3, 0, 0, 1, 2});
+
+  BufferMap buffer_map;
+  buffer_map.SetFromTensorFlow(0, t1);
+  buffer_map.SetFromTfLite(0, t2.get());
+
+  EXPECT_THAT(GetTensorData<int>(buffer_map.GetTensor(0)),
+              ElementsAre(0, 0, 0, 3, 0, 0, 1, 2));
+}
+
+TEST(BufferMapTest, TensorFlowOverwritesTfLite) {
+  tensorflow::Tensor t1 =
+      MakeTensor<float>({1, 2, 1, 3}, {0, 0, 0, 0.123f, 0, 0});
+  UniqueTfLiteTensor t2 =
+      MakeLiteTensor<int>({1, 2, 4}, {0, 0, 0, 3, 0, 0, 1, 2});
+  BufferMap buffer_map;
+  buffer_map.SetFromTfLite(0, t2.get());
+  buffer_map.SetFromTensorFlow(0, t1);
+
+  EXPECT_THAT(GetTensorData<float>(buffer_map.GetTensor(0)),
+              ElementsAre(0, 0, 0, 0.123f, 0, 0));
+}
+
+}  // namespace
+}  // namespace eager
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/delegates/eager/delegate_data.cc b/tensorflow/contrib/lite/delegates/eager/delegate_data.cc
new file mode 100644
index 0000000000..0fd5c976f8
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/delegate_data.cc
@@ -0,0 +1,47 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/delegate_data.h"
+
+#include "tensorflow/core/common_runtime/device_factory.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tflite {
+namespace eager {
+tensorflow::Status DelegateData::Create(std::unique_ptr<DelegateData>* data) {
+  std::vector<tensorflow::Device*> devices;
+
+  TF_RETURN_IF_ERROR(tensorflow::DeviceFactory::AddDevices(
+      tensorflow::SessionOptions(), "/job:localhost/replica:0/task:0",
+      &devices));
+
+  std::unique_ptr<tensorflow::DeviceMgr> device_mgr(
+      new tensorflow::DeviceMgr(devices));
+  // Note that Rendezvous is ref-counted so it will be automatically deleted.
+  tensorflow::Rendezvous* rendezvous =
+      new tensorflow::IntraProcessRendezvous(device_mgr.get());
+  data->reset(new DelegateData(new tensorflow::EagerContext(
+      tensorflow::SessionOptions(),
+      tensorflow::ContextDevicePlacementPolicy::DEVICE_PLACEMENT_SILENT,
+      /*async=*/false, std::move(device_mgr), rendezvous)));
+  return tensorflow::Status();
+}
+
+DelegateData::DelegateData(tensorflow::EagerContext* eager_context)
+    : eager_context_(eager_context) {}
+
+DelegateData::~DelegateData() {}
+
+}  // namespace eager
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/delegates/eager/delegate_data.h b/tensorflow/contrib/lite/delegates/eager/delegate_data.h
new file mode 100644
index 0000000000..8a0e8ba8bf
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/delegate_data.h
@@ -0,0 +1,48 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_DELEGATE_DATA_H_
+#define TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_DELEGATE_DATA_H_
+
+#include "tensorflow/contrib/lite/delegates/eager/buffer_map.h"
+#include "tensorflow/core/common_runtime/eager/context.h"
+
+namespace tflite {
+namespace eager {
+
+// Data kept by the Eager delegate for the lifetime of an Interpreter.
+class DelegateData {
+ public:
+  // Create a new DelegateData, initialized with a newly-created EagerContext.
+  static tensorflow::Status Create(std::unique_ptr<DelegateData>* data);
+
+  ~DelegateData();
+
+  // The EagerContext that is required for execution of Eager Ops.
+  tensorflow::EagerContext* GetEagerContext() { return eager_context_.get(); }
+
+  // Map from TF Lite tensor index to TensorFlow tensor.
+  BufferMap* GetBufferMap() { return &buffer_map_; }
+
+ private:
+  explicit DelegateData(tensorflow::EagerContext* eager_context);
+
+  std::unique_ptr<tensorflow::EagerContext> eager_context_;
+  BufferMap buffer_map_;
+};
+
+}  // namespace eager
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_DELEGATE_DATA_H_
diff --git a/tensorflow/contrib/lite/delegates/eager/delegate_data_test.cc b/tensorflow/contrib/lite/delegates/eager/delegate_data_test.cc
new file mode 100644
index 0000000000..30251b8f82
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/delegate_data_test.cc
@@ -0,0 +1,44 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/delegate_data.h"
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/testing/util.h"
+
+namespace tflite {
+namespace eager {
+namespace {
+
+TEST(DelegateDataTest, Basic) {
+  std::unique_ptr<DelegateData> data;
+  // We only check for success because it is hard to make initialization fail.
+  // It only happens if we manage to not link the CPU device factory into the
+  // binary.
+  EXPECT_TRUE(DelegateData::Create(&data).ok());
+
+  EXPECT_NE(data->GetEagerContext(), nullptr);
+  EXPECT_NE(data->GetBufferMap(), nullptr);
+}
+
+}  // namespace
+}  // namespace eager
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/delegates/eager/kernel.cc b/tensorflow/contrib/lite/delegates/eager/kernel.cc
new file mode 100644
index 0000000000..1727981807
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/kernel.cc
@@ -0,0 +1,289 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/kernel.h"
+
+#include "third_party/flatbuffers/include/flatbuffers/flexbuffers.h"
+#include "tensorflow/contrib/lite/builtin_ops.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/context_util.h"
+#include "tensorflow/contrib/lite/delegates/eager/delegate_data.h"
+#include "tensorflow/contrib/lite/delegates/eager/util.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/core/common_runtime/eager/context.h"
+#include "tensorflow/core/common_runtime/eager/execute.h"
+#include "tensorflow/core/common_runtime/eager/tensor_handle.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+
+// Note: this is part of TF Lite's Eager delegation code which is to be
+// completed soon.
+
+// This is the TF Lite op that is created by the eager delegate to handle
+// execution of a supported subgraph. The usual flow is that the delegate
+// informs the interpreter of supported nodes in a graph, and each supported
+// subgraph is replaced with one instance of this kernel.
+//
+// The kernel is initialized with TfLiteDelegateParams from which we retrieve
+// the global EagerContext and BufferMap, as well as a list of inputs and
+// outputs to the subgraph. Those are used to build the OpData, with a list of
+// TensorFlow Ops that should be executed in order (which we call an OpNode).
+//
+// For each node included in the subgraph, we query the interpreter and
+// retrieve the associated NodeDef, which is then used to configure the
+// corresponding TensorFlow/Eager Op.
+
+namespace tflite {
+namespace eager {
+namespace kernel {
+
+// Controls the lifetime of tensor handles in a vector.
+class VectorOfHandles {
+ public:
+  explicit VectorOfHandles(int num_elements) : vector_(num_elements, nullptr) {}
+
+  ~VectorOfHandles() {
+    for (auto* handle : vector_) {
+      if (handle) handle->Unref();
+    }
+  }
+
+  tensorflow::gtl::InlinedVector<tensorflow::TensorHandle*, 2>* GetVector() {
+    return &vector_;
+  }
+
+  tensorflow::TensorHandle* GetHandle(int index) { return vector_[index]; }
+
+ private:
+  tensorflow::gtl::InlinedVector<tensorflow::TensorHandle*, 2> vector_;
+};
+
+// Executes the TensorFlow op given by 'op_name', with the attributes specified
+// in 'nodedef'. Inputs and outputs are given as indices into the 'buffer_map'.
+tensorflow::Status ExecuteEagerOp(tensorflow::EagerContext* eager_context,
+                                  BufferMap* buffer_map, const string& op_name,
+                                  const tensorflow::NodeDef& nodedef,
+                                  const std::vector<int>& inputs,
+                                  const std::vector<int>& outputs) {
+  const tensorflow::AttrTypeMap* attr_types;
+  TF_RETURN_WITH_CONTEXT_IF_ERROR(
+      tensorflow::AttrTypeMapForOp(op_name.c_str(), &attr_types),
+      " (while processing attributes of '", op_name, "')");
+
+  tensorflow::EagerOperation op(eager_context, op_name.c_str(), attr_types);
+  for (const auto& attr : nodedef.attr()) {
+    op.MutableAttrs()->Set(attr.first, attr.second);
+  }
+
+  for (int input_index : inputs) {
+    if (!buffer_map->HasTensor(input_index)) {
+      return tensorflow::errors::Internal(
+          "Cannot read from invalid tensor index ", input_index);
+    }
+    auto* handle = new tensorflow::TensorHandle(
+        buffer_map->GetTensor(input_index), nullptr, nullptr, nullptr);
+    op.AddInput(handle);
+    handle->Unref();
+  }
+
+  int num_retvals = outputs.size();
+  VectorOfHandles retvals(num_retvals);
+  TF_RETURN_WITH_CONTEXT_IF_ERROR(
+      EagerExecute(&op, retvals.GetVector(), &num_retvals),
+      " (while executing '", op_name, "' via Eager)");
+
+  if (num_retvals != outputs.size()) {
+    return tensorflow::errors::Internal(
+        "Unexpected number of outputs from EagerExecute");
+  }
+
+  for (int i = 0; i < num_retvals; ++i) {
+    const tensorflow::Tensor* tensor = nullptr;
+    TF_RETURN_IF_ERROR(retvals.GetHandle(i)->Tensor(&tensor));
+    buffer_map->SetFromTensorFlow(outputs[i], *tensor);
+  }
+
+  return tensorflow::Status::OK();
+}
+
+// A single node within the larger 'op'. Note that this kernel executes many
+// TensorFlow ops within a single TF Lite op.
+struct OpNode {
+  // The name of the TensorFlow op to execute.
+  string name;
+  // The corresponding NodeDef, containing the attributes for the op.
+  tensorflow::NodeDef nodedef;
+  // List of inputs, as TF Lite tensor indices.
+  std::vector<int> inputs;
+  // List of outputs, as TF Lite tensor indices.
+  std::vector<int> outputs;
+};
+
+// The Larger 'op', which contains all the nodes in a supported subgraph.
+struct OpData {
+  tensorflow::EagerContext* eager_context;
+  BufferMap* buffer_map;
+  std::vector<OpNode> nodes;
+  std::vector<int> subgraph_inputs;
+  std::vector<int> subgraph_outputs;
+};
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* op_data = new OpData;
+
+  const TfLiteDelegateParams* params =
+      reinterpret_cast<const TfLiteDelegateParams*>(buffer);
+  CHECK(params);
+  CHECK(params->delegate);
+  CHECK(params->delegate->data_);
+  op_data->eager_context =
+      reinterpret_cast<DelegateData*>(params->delegate->data_)
+          ->GetEagerContext();
+  op_data->buffer_map =
+      reinterpret_cast<DelegateData*>(params->delegate->data_)->GetBufferMap();
+
+  CHECK(params->output_tensors);
+  for (auto tensor_index : TfLiteIntArrayView(params->output_tensors)) {
+    op_data->subgraph_outputs.push_back(tensor_index);
+  }
+
+  CHECK(params->input_tensors);
+  for (auto tensor_index : TfLiteIntArrayView(params->input_tensors)) {
+    op_data->subgraph_inputs.push_back(tensor_index);
+  }
+
+  CHECK(params->nodes_to_replace);
+  for (auto node_index : TfLiteIntArrayView(params->nodes_to_replace)) {
+    TfLiteNode* node;
+    TfLiteRegistration* reg;
+    context->GetNodeAndRegistration(context, node_index, &node, &reg);
+
+    op_data->nodes.push_back(OpNode());
+    OpNode& node_data = op_data->nodes.back();
+
+    node_data.name = "";
+    if (node->custom_initial_data) {
+      // The flexbuffer contains a vector where the first elements is the
+      // op name and the second is a serialized NodeDef.
+      const flexbuffers::Vector& v =
+          flexbuffers::GetRoot(
+              reinterpret_cast<const uint8_t*>(node->custom_initial_data),
+              node->custom_initial_data_size)
+              .AsVector();
+
+      node_data.name = v[0].AsString().str();
+      if (!node_data.nodedef.ParseFromString(v[1].AsString().str())) {
+        // We will just leave the nodedef empty and error out in Eval().
+        node_data.nodedef.Clear();
+      }
+    }
+
+    for (auto input_index : TfLiteIntArrayView(node->inputs)) {
+      node_data.inputs.push_back(input_index);
+    }
+    for (auto output_index : TfLiteIntArrayView(node->outputs)) {
+      node_data.outputs.push_back(output_index);
+    }
+  }
+
+  return op_data;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<OpData*>(buffer);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  const auto* op_data = reinterpret_cast<OpData*>(node->user_data);
+  TF_LITE_ENSURE_MSG(
+      context, op_data->eager_context != nullptr,
+      "Failed to initialize eager context. This often happens when a CPU "
+      "device has not been registered, presumably because some symbols from "
+      "tensorflow/core:core_cpu_impl were not linked into the binary.");
+
+  // Whenever we find a constant tensor, insert it in the buffer map.
+  BufferMap* buffer_map = op_data->buffer_map;
+  for (auto tensor_index : op_data->subgraph_inputs) {
+    TfLiteTensor* tensor = &context->tensors[tensor_index];
+    if (IsConstantTensor(tensor)) {
+      if (!buffer_map->HasTensor(tensor_index)) {
+        buffer_map->SetFromTfLite(tensor_index, tensor);
+      }
+    }
+  }
+
+  // All output tensors are allocated by TensorFlow/Eager, so we
+  // mark them as kTfLiteDynamic.
+  for (auto tensor_index : op_data->subgraph_outputs) {
+    SetTensorToDynamic(&context->tensors[tensor_index]);
+  }
+
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const auto* op_data = reinterpret_cast<OpData*>(node->user_data);
+  BufferMap* buffer_map = op_data->buffer_map;
+  tensorflow::EagerContext* eager_context = op_data->eager_context;
+
+  // Insert a tensor in the buffer map for all inputs that are not constant.
+  // Constants were handled in Prepare() already.
+  for (auto tensor_index : op_data->subgraph_inputs) {
+    TfLiteTensor* tensor = &context->tensors[tensor_index];
+    if (!IsConstantTensor(tensor)) {
+      buffer_map->SetFromTfLite(tensor_index, tensor);
+    }
+  }
+
+  // Execute the TensorFlow Ops sequentially.
+  for (const auto& node_data : op_data->nodes) {
+    if (node_data.nodedef.op().empty()) {
+      context->ReportError(context, "Invalid NodeDef in Eager op '%s'",
+                           node_data.name.c_str());
+      return kTfLiteError;
+    }
+    auto status =
+        ExecuteEagerOp(eager_context, buffer_map, node_data.name,
+                       node_data.nodedef, node_data.inputs, node_data.outputs);
+    TF_LITE_ENSURE_OK(context, ConvertStatus(context, status));
+  }
+
+  for (auto tensor_index : op_data->subgraph_outputs) {
+    if (!buffer_map->HasTensor(tensor_index)) {
+      context->ReportError(context, "Cannot write to invalid tensor index %d",
+                           tensor_index);
+      return kTfLiteError;
+    }
+
+    TfLiteTensor* tensor = &context->tensors[tensor_index];
+    TF_LITE_ENSURE_OK(
+        context,
+        CopyShape(context, buffer_map->GetTensor(tensor_index), tensor));
+    tensor->buffer_handle = tensor_index;
+    tensor->data_is_stale = true;
+  }
+
+  return kTfLiteOk;
+}
+
+}  // namespace kernel
+
+TfLiteRegistration GetKernel() {
+  TfLiteRegistration registration{&kernel::Init,    &kernel::Free,
+                                  &kernel::Prepare, &kernel::Eval,
+                                  nullptr,          kTfLiteBuiltinDelegate};
+  return registration;
+}
+
+}  // namespace eager
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/delegates/eager/kernel.h b/tensorflow/contrib/lite/delegates/eager/kernel.h
new file mode 100644
index 0000000000..100672c82d
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/kernel.h
@@ -0,0 +1,34 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_KERNEL_H_
+#define TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_KERNEL_H_
+
+#include "tensorflow/contrib/lite/context.h"
+
+namespace tflite {
+namespace eager {
+
+// Return the registration object used to initialize and execute ops that will
+// be delegated to TensorFlow's Eager runtime. This TF Lite op is created by
+// the eager delegate to handle execution of a supported subgraph. The usual
+// flow is that the delegate informs the interpreter of supported nodes in a
+// graph, and each supported subgraph is replaced with one instance of this
+// kernel.
+TfLiteRegistration GetKernel();
+
+}  // namespace eager
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_KERNEL_H_
diff --git a/tensorflow/contrib/lite/delegates/eager/kernel_test.cc b/tensorflow/contrib/lite/delegates/eager/kernel_test.cc
new file mode 100644
index 0000000000..b7bfbb34e4
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/kernel_test.cc
@@ -0,0 +1,228 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/kernel.h"
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/delegates/eager/delegate_data.h"
+#include "tensorflow/contrib/lite/delegates/eager/test_util.h"
+
+namespace tflite {
+namespace eager {
+namespace {
+
+using ::testing::ContainsRegex;
+using ::testing::ElementsAre;
+
+TfLiteStatus GenericPrepare(TfLiteContext* context, TfLiteDelegate* delegate,
+                            const std::vector<int>& supported_nodes) {
+  TfLiteIntArray* size_and_nodes =
+      ConvertVectorToTfLiteIntArray(supported_nodes);
+  TF_LITE_ENSURE_STATUS(context->ReplaceSubgraphsWithDelegateKernels(
+      context, eager::GetKernel(), size_and_nodes, delegate));
+  TfLiteIntArrayFree(size_and_nodes);
+  return kTfLiteOk;
+}
+
+class KernelTest : public testing::EagerModelTest {
+ public:
+  KernelTest() {
+    CHECK(DelegateData::Create(&delegate_data_).ok());
+    interpreter_.reset(new Interpreter(&error_reporter_));
+  }
+
+  ~KernelTest() override {
+    // The data needs to be released before the interpreter because the
+    // interpreter references the data.
+    delegate_data_.reset();
+    interpreter_.reset();
+  }
+
+  template <typename T>
+  void ConfigureDelegate(T prepare_function) {
+    delegate_.data_ = delegate_data_.get();
+    delegate_.FreeBufferHandle = nullptr;
+    delegate_.Prepare = prepare_function;
+    delegate_.CopyFromBufferHandle = [](TfLiteDelegate* delegate,
+                                        TfLiteBufferHandle buffer_handle,
+                                        void* data, size_t size) {
+      auto* delegate_data = reinterpret_cast<DelegateData*>(delegate->data_);
+      tensorflow::StringPiece values =
+          delegate_data->GetBufferMap()->GetTensor(buffer_handle).tensor_data();
+      memcpy(data, values.data(), values.size());
+      return kTfLiteOk;
+    };
+    CHECK(interpreter_->ModifyGraphWithDelegate(
+              &delegate_, /*allow_dynamic_tensors=*/true) == kTfLiteOk);
+  }
+
+ private:
+  std::unique_ptr<DelegateData> delegate_data_;
+  TfLiteDelegate delegate_;
+};
+
+TEST_F(KernelTest, FullGraph) {
+  // Define the graph.
+  AddTensors(9, {0, 3}, {8}, kTfLiteFloat32, {3});
+
+  AddTfOp(testing::kUnpack, {0}, {1, 2});
+  AddTfOp(testing::kUnpack, {3}, {4, 5});
+  AddTfOp(testing::kAdd, {1, 4}, {6});
+  AddTfOp(testing::kAdd, {2, 5}, {7});
+  AddTfOp(testing::kMul, {6, 7}, {8});
+
+  // Apply Delegate.
+  ConfigureDelegate([](TfLiteContext* context, TfLiteDelegate* delegate) {
+    return GenericPrepare(context, delegate, {0, 1, 2, 3, 4});
+  });
+
+  // Define inputs.
+  SetShape(0, {2, 2, 1});
+  SetValues(0, {1.1f, 2.2f, 3.3f, 4.4f});
+  SetShape(3, {2, 2, 1});
+  SetValues(3, {1.1f, 2.2f, 3.3f, 4.4f});
+
+  ASSERT_TRUE(Invoke());
+
+  ASSERT_THAT(GetShape(8), ElementsAre(2, 1));
+  ASSERT_THAT(GetValues(8), ElementsAre(14.52f, 38.72f));
+}
+
+TEST_F(KernelTest, BadTensorFlowOp) {
+  AddTensors(2, {0}, {1}, kTfLiteFloat32, {3});
+  AddTfOp(testing::kNonExistent, {0}, {1});
+
+  ConfigureDelegate([](TfLiteContext* context, TfLiteDelegate* delegate) {
+    return GenericPrepare(context, delegate, {0});
+  });
+
+  SetShape(0, {2, 2, 1});
+  SetValues(0, {1.1f, 2.2f, 3.3f, 4.4f});
+
+  ASSERT_FALSE(Invoke());
+  ASSERT_THAT(error_reporter().error_messages(),
+              ContainsRegex("while processing attributes of 'NonExistentOp'"));
+}
+
+TEST_F(KernelTest, BadNumberOfOutputs) {
+  AddTensors(3, {0}, {1, 2}, kTfLiteFloat32, {3});
+  AddTfOp(testing::kIdentity, {0}, {1, 2});
+
+  ConfigureDelegate([](TfLiteContext* context, TfLiteDelegate* delegate) {
+    return GenericPrepare(context, delegate, {0});
+  });
+
+  SetShape(0, {2, 2, 1});
+  SetValues(0, {1.1f, 2.2f, 3.3f, 4.4f});
+
+  ASSERT_FALSE(Invoke());
+  ASSERT_THAT(error_reporter().error_messages(),
+              ContainsRegex("Unexpected number of outputs"));
+}
+
+TEST_F(KernelTest, IncompatibleNodeDef) {
+  AddTensors(2, {0}, {1}, kTfLiteFloat32, {3});
+
+  // Cast is a TF op, but we don't add the proper nodedef to it in AddTfOp.
+  AddTfOp(testing::kIncompatibleNodeDef, {0}, {1});
+
+  ConfigureDelegate([](TfLiteContext* context, TfLiteDelegate* delegate) {
+    return GenericPrepare(context, delegate, {0});
+  });
+
+  SetShape(0, {2, 2, 1});
+  SetValues(0, {1.1f, 2.2f, 3.3f, 4.4f});
+
+  ASSERT_FALSE(Invoke());
+  ASSERT_THAT(error_reporter().error_messages(),
+              ContainsRegex("while executing 'Cast' via Eager"));
+}
+
+TEST_F(KernelTest, WrongSetOfNodes) {
+  AddTensors(4, {0}, {3}, kTfLiteFloat32, {3});
+  AddTfOp(testing::kUnpack, {0}, {1, 2});
+  AddTfLiteMulOp({1, 2}, {3});
+
+  // Specify that testing::kMul (#1) is supported when it actually isn't.
+  ConfigureDelegate([](TfLiteContext* context, TfLiteDelegate* delegate) {
+    return GenericPrepare(context, delegate, {0, 1});
+  });
+
+  SetShape(0, {2, 2, 1});
+  SetValues(0, {1.1f, 2.2f, 3.3f, 4.4f});
+
+  ASSERT_FALSE(Invoke());
+  ASSERT_THAT(error_reporter().error_messages(),
+              ContainsRegex("Invalid NodeDef in Eager op"));
+}
+
+TEST_F(KernelTest, MixedGraph) {
+  AddTensors(9, {0, 3}, {8}, kTfLiteFloat32, {3});
+
+  AddTfOp(testing::kUnpack, {0}, {1, 2});
+  AddTfOp(testing::kUnpack, {3}, {4, 5});
+  AddTfOp(testing::kAdd, {1, 4}, {6});
+  AddTfOp(testing::kAdd, {2, 5}, {7});
+  AddTfLiteMulOp({6, 7}, {8});
+
+  ConfigureDelegate([](TfLiteContext* context, TfLiteDelegate* delegate) {
+    return GenericPrepare(context, delegate, {0, 1, 2, 3});
+  });
+
+  SetShape(0, {2, 2, 1});
+  SetValues(0, {1.1f, 2.2f, 3.3f, 4.4f});
+  SetShape(3, {2, 2, 1});
+  SetValues(3, {1.1f, 2.2f, 3.3f, 4.4f});
+
+  ASSERT_TRUE(Invoke());
+
+  ASSERT_THAT(GetShape(8), ElementsAre(2, 1));
+  ASSERT_THAT(GetValues(8), ElementsAre(14.52f, 38.72f));
+}
+
+TEST_F(KernelTest, SplitGraph) {
+  AddTensors(10, {0}, {9}, kTfLiteFloat32, {3});
+
+  AddTfOp(testing::kUnpack, {0}, {1, 2});
+  AddTfOp(testing::kAdd, {1, 2}, {3});
+  AddTfOp(testing::kUnpack, {3}, {4, 5});
+
+  AddTfLiteMulOp({4, 5}, {6});
+
+  AddTfOp(testing::kUnpack, {6}, {7, 8});
+  AddTfOp(testing::kAdd, {7, 8}, {9});
+
+  ConfigureDelegate([](TfLiteContext* context, TfLiteDelegate* delegate) {
+    return GenericPrepare(context, delegate, {0, 1, 2, 4, 5});
+  });
+
+  SetShape(0, {2, 2, 2, 1});
+  SetValues(0, {3.0f, 1.0f, 0.5f, -1.0f, 0.0f, 1.0f, 1.5f, 3.0f});
+
+  ASSERT_TRUE(Invoke());
+
+  ASSERT_THAT(GetShape(9), ElementsAre(1));
+  ASSERT_THAT(GetValues(9), ElementsAre(10.0f));
+}
+
+}  // namespace
+}  // namespace eager
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/delegates/eager/test_util.cc b/tensorflow/contrib/lite/delegates/eager/test_util.cc
new file mode 100644
index 0000000000..9ea802e34b
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/test_util.cc
@@ -0,0 +1,164 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/delegates/eager/test_util.h"
+
+#include "absl/memory/memory.h"
+#include "third_party/flatbuffers/include/flatbuffers/flexbuffers.h"
+
+namespace tflite {
+namespace eager {
+namespace testing {
+
+namespace {
+// These are used as custom_names in TFLite ops, so they need to be static.
+// Names need to start with "Eager" in order to work with the Eager delegate
+// requirements.
+static constexpr char kIdentityName[] = "EagerIdentity";
+static constexpr char kUnpackName[] = "EagerUnpack";
+static constexpr char kAddName[] = "EagerAdd";
+static constexpr char kMulName[] = "EagerMul";
+}  // namespace
+
+bool EagerModelTest::Invoke() { return interpreter_->Invoke() == kTfLiteOk; }
+
+void EagerModelTest::SetValues(int tensor_index,
+                               const std::vector<float>& values) {
+  float* v = interpreter_->typed_tensor<float>(tensor_index);
+  for (float f : values) {
+    *v++ = f;
+  }
+}
+
+std::vector<float> EagerModelTest::GetValues(int tensor_index) {
+  TfLiteTensor* o = interpreter_->tensor(tensor_index);
+  return std::vector<float>(o->data.f, o->data.f + o->bytes / sizeof(float));
+}
+
+void EagerModelTest::SetShape(int tensor_index,
+                              const std::vector<int>& values) {
+  ASSERT_EQ(interpreter_->ResizeInputTensor(tensor_index, values), kTfLiteOk);
+  ASSERT_EQ(interpreter_->AllocateTensors(), kTfLiteOk);
+}
+
+std::vector<int> EagerModelTest::GetShape(int tensor_index) {
+  std::vector<int> result;
+  auto* dims = interpreter_->tensor(tensor_index)->dims;
+  result.reserve(dims->size);
+  for (int i = 0; i < dims->size; ++i) {
+    result.push_back(dims->data[i]);
+  }
+  return result;
+}
+
+void EagerModelTest::AddTensors(int num_tensors, const std::vector<int>& inputs,
+                                const std::vector<int>& outputs,
+                                const TfLiteType& type,
+                                const std::vector<int>& dims) {
+  interpreter_->AddTensors(num_tensors);
+  for (int i = 0; i < num_tensors; ++i) {
+    TfLiteQuantizationParams quant;
+    CHECK_EQ(interpreter_->SetTensorParametersReadWrite(i, type,
+                                                        /*name=*/"",
+                                                        /*dims=*/dims, quant),
+             kTfLiteOk);
+  }
+
+  CHECK_EQ(interpreter_->SetInputs(inputs), kTfLiteOk);
+  CHECK_EQ(interpreter_->SetOutputs(outputs), kTfLiteOk);
+}
+
+void EagerModelTest::AddTfLiteMulOp(const std::vector<int>& inputs,
+                                    const std::vector<int>& outputs) {
+  static TfLiteRegistration reg = {nullptr, nullptr, nullptr, nullptr};
+  reg.builtin_code = BuiltinOperator_MUL;
+  reg.prepare = [](TfLiteContext* context, TfLiteNode* node) {
+    auto* i0 = &context->tensors[node->inputs->data[0]];
+    auto* o = &context->tensors[node->outputs->data[0]];
+    return context->ResizeTensor(context, o, TfLiteIntArrayCopy(i0->dims));
+  };
+  reg.invoke = [](TfLiteContext* context, TfLiteNode* node) {
+    auto* i0 = &context->tensors[node->inputs->data[0]];
+    auto* i1 = &context->tensors[node->inputs->data[1]];
+    auto* o = &context->tensors[node->outputs->data[0]];
+    for (int i = 0; i < o->bytes / sizeof(float); ++i) {
+      o->data.f[i] = i0->data.f[i] * i1->data.f[i];
+    }
+    return kTfLiteOk;
+  };
+
+  CHECK_EQ(interpreter_->AddNodeWithParameters(inputs, outputs, nullptr, 0,
+                                               nullptr, &reg),
+           kTfLiteOk);
+}
+
+void EagerModelTest::AddTfOp(TfOpType op, const std::vector<int>& inputs,
+                             const std::vector<int>& outputs) {
+  auto attr = [](const string& key, const string& value) {
+    return " attr{ key: '" + key + "' value {" + value + "}}";
+  };
+
+  if (op == kUnpack) {
+    string attributes = attr("T", "type: DT_FLOAT") + attr("num", "i: 2") +
+                        attr("axis", "i: 0");
+    AddTfOp(kUnpackName, "Unpack", attributes, inputs, outputs);
+  } else if (op == kIdentity) {
+    string attributes = attr("T", "type: DT_FLOAT");
+    AddTfOp(kIdentityName, "Identity", attributes, inputs, outputs);
+  } else if (op == kAdd) {
+    string attributes = attr("T", "type: DT_FLOAT");
+    AddTfOp(kAddName, "Add", attributes, inputs, outputs);
+  } else if (op == kMul) {
+    string attributes = attr("T", "type: DT_FLOAT");
+    AddTfOp(kMulName, "Mul", attributes, inputs, outputs);
+  } else if (op == kNonExistent) {
+    AddTfOp("NonExistentOp", "NonExistentOp", "", inputs, outputs);
+  } else if (op == kIncompatibleNodeDef) {
+    // "Cast" op is created without attributes - making it incompatible.
+    AddTfOp("EagerCast", "Cast", "", inputs, outputs);
+  }
+}
+
+void EagerModelTest::AddTfOp(const char* tflite_name, const string& tf_name,
+                             const string& nodedef_str,
+                             const std::vector<int>& inputs,
+                             const std::vector<int>& outputs) {
+  static TfLiteRegistration reg = {nullptr, nullptr, nullptr, nullptr};
+  reg.builtin_code = BuiltinOperator_CUSTOM;
+  reg.custom_name = tflite_name;
+
+  tensorflow::NodeDef nodedef;
+  CHECK(tensorflow::protobuf::TextFormat::ParseFromString(
+      nodedef_str + " op: '" + tf_name + "'", &nodedef));
+  string serialized_nodedef;
+  CHECK(nodedef.SerializeToString(&serialized_nodedef));
+  flexbuffers::Builder fbb;
+  fbb.Vector([&]() {
+    fbb.String(nodedef.op());
+    fbb.String(serialized_nodedef);
+  });
+  fbb.Finish();
+
+  flexbuffers_.push_back(fbb.GetBuffer());
+  auto& buffer = flexbuffers_.back();
+  CHECK_EQ(interpreter_->AddNodeWithParameters(
+               inputs, outputs, reinterpret_cast<const char*>(buffer.data()),
+               buffer.size(), nullptr, &reg),
+           kTfLiteOk);
+}
+
+}  // namespace testing
+}  // namespace eager
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/delegates/eager/test_util.h b/tensorflow/contrib/lite/delegates/eager/test_util.h
new file mode 100644
index 0000000000..369009715e
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/test_util.h
@@ -0,0 +1,95 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_TEST_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_TEST_UTIL_H_
+
+#include "tensorflow/c/c_api_internal.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+
+namespace tflite {
+namespace eager {
+namespace testing {
+
+enum TfOpType {
+  kUnpack,
+  kIdentity,
+  kAdd,
+  kMul,
+  // Represents an op that does not exist in TensorFlow.
+  kNonExistent,
+  // Represents an valid TensorFlow op where the NodeDef is incompatible.
+  kIncompatibleNodeDef,
+};
+
+// This class creates models with TF and TFLite ops. In order to use this class
+// to test the Eager delegate, implement a function that calls
+// interpreter->ModifyGraphWithDelegate.
+class EagerModelTest : public ::testing::Test {
+ public:
+  EagerModelTest() {}
+  ~EagerModelTest() {}
+
+  bool Invoke();
+
+  // Sets the tensor's values at the given index.
+  void SetValues(int tensor_index, const std::vector<float>& values);
+
+  // Returns the tensor's values at the given index.
+  std::vector<float> GetValues(int tensor_index);
+
+  // Sets the tensor's shape at the given index.
+  void SetShape(int tensor_index, const std::vector<int>& values);
+
+  // Returns the tensor's shape at the given index.
+  std::vector<int> GetShape(int tensor_index);
+
+  const TestErrorReporter& error_reporter() const { return error_reporter_; }
+
+  // Adds `num_tensor` tensors to the model. `inputs` contains the indices of
+  // the input tensors and `outputs` contains the indices of the output
+  // tensors. All tensors are set to have `type` and `dims`.
+  void AddTensors(int num_tensors, const std::vector<int>& inputs,
+                  const std::vector<int>& outputs, const TfLiteType& type,
+                  const std::vector<int>& dims);
+
+  // Adds a TFLite Mul op. `inputs` contains the indices of the input tensors
+  // and `outputs` contains the indices of the output tensors.
+  void AddTfLiteMulOp(const std::vector<int>& inputs,
+                      const std::vector<int>& outputs);
+
+  // Adds a TensorFlow op. `inputs` contains the indices of the
+  // input tensors and `outputs` contains the indices of the output tensors.
+  // This function is limited to the set of ops defined in TfOpType.
+  void AddTfOp(TfOpType op, const std::vector<int>& inputs,
+               const std::vector<int>& outputs);
+
+ protected:
+  std::unique_ptr<Interpreter> interpreter_;
+  TestErrorReporter error_reporter_;
+
+ private:
+  void AddTfOp(const char* tflite_name, const string& tf_name,
+               const string& nodedef_str, const std::vector<int>& inputs,
+               const std::vector<int>& outputs);
+
+  std::vector<std::vector<uint8_t>> flexbuffers_;
+};
+
+}  // namespace testing
+}  // namespace eager
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_TEST_UTIL_H_
diff --git a/tensorflow/contrib/lite/delegates/eager/util.cc b/tensorflow/contrib/lite/delegates/eager/util.cc
new file mode 100644
index 0000000000..4426c653e6
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/util.cc
@@ -0,0 +1,72 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/util.h"
+
+namespace tflite {
+namespace eager {
+
+TfLiteStatus ConvertStatus(TfLiteContext* context,
+                           const tensorflow::Status& status) {
+  if (!status.ok()) {
+    context->ReportError(context, "%s", status.error_message().c_str());
+    return kTfLiteError;
+  }
+  return kTfLiteOk;
+}
+
+TfLiteStatus CopyShape(TfLiteContext* context, const tensorflow::Tensor& src,
+                       TfLiteTensor* tensor) {
+  int num_dims = src.dims();
+  TfLiteIntArray* shape = TfLiteIntArrayCreate(num_dims);
+  for (int j = 0; j < num_dims; ++j) {
+    // We need to cast from TensorFlow's int64 to TF Lite's int32. Let's
+    // make sure there's no overflow.
+    if (src.dim_size(j) >= std::numeric_limits<int>::max()) {
+      context->ReportError(context,
+                           "Dimension value in TensorFlow shape is larger than "
+                           "supported by TF Lite");
+      TfLiteIntArrayFree(shape);
+      return kTfLiteError;
+    }
+    shape->data[j] = static_cast<int>(src.dim_size(j));
+  }
+  return context->ResizeTensor(context, tensor, shape);
+}
+
+TF_DataType GetTensorFlowDataType(TfLiteType type) {
+  switch (type) {
+    case kTfLiteNoType:
+      return TF_FLOAT;
+    case kTfLiteFloat32:
+      return TF_FLOAT;
+    case kTfLiteInt16:
+      return TF_INT16;
+    case kTfLiteInt32:
+      return TF_INT32;
+    case kTfLiteUInt8:
+      return TF_UINT8;
+    case kTfLiteInt64:
+      return TF_INT64;
+    case kTfLiteComplex64:
+      return TF_COMPLEX64;
+    case kTfLiteString:
+      return TF_STRING;
+    case kTfLiteBool:
+      return TF_BOOL;
+  }
+}
+
+}  // namespace eager
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/delegates/eager/util.h b/tensorflow/contrib/lite/delegates/eager/util.h
new file mode 100644
index 0000000000..a9407be071
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/util.h
@@ -0,0 +1,42 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_UTIL_H_
+
+#include "tensorflow/c/c_api_internal.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tflite {
+namespace eager {
+
+// Converts a tensorflow:Status into a TfLiteStatus. If the original status
+// represented an error, reports it using the given 'context'.
+TfLiteStatus ConvertStatus(TfLiteContext* context,
+                           const tensorflow::Status& status);
+
+// Copies the given shape of the given 'src' into a TF Lite 'tensor'. Logs an
+// error and returns kTfLiteError if the shape can't be converted.
+TfLiteStatus CopyShape(TfLiteContext* context, const tensorflow::Tensor& src,
+                       TfLiteTensor* tensor);
+
+// Returns the TF C API Data type that corresponds to the given TfLiteType.
+TF_DataType GetTensorFlowDataType(TfLiteType type);
+
+}  // namespace eager
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_DELEGATES_EAGER_UTIL_H_
diff --git a/tensorflow/contrib/lite/delegates/eager/util_test.cc b/tensorflow/contrib/lite/delegates/eager/util_test.cc
new file mode 100644
index 0000000000..c4fbf54127
--- /dev/null
+++ b/tensorflow/contrib/lite/delegates/eager/util_test.cc
@@ -0,0 +1,113 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/delegates/eager/util.h"
+
+#include <cstdarg>
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/testing/util.h"
+
+namespace tflite {
+namespace eager {
+namespace {
+
+using tensorflow::DT_FLOAT;
+using tensorflow::Tensor;
+using ::testing::ElementsAre;
+
+struct TestContext : public TfLiteContext {
+  string error;
+  std::vector<int> new_size;
+};
+
+void ReportError(TfLiteContext* context, const char* format, ...) {
+  TestContext* c = static_cast<TestContext*>(context);
+  const size_t kBufferSize = 1024;
+  char temp_buffer[kBufferSize];
+
+  va_list args;
+  va_start(args, format);
+  vsnprintf(temp_buffer, kBufferSize, format, args);
+  va_end(args);
+
+  c->error = temp_buffer;
+}
+
+TfLiteStatus ResizeTensor(TfLiteContext* context, TfLiteTensor* tensor,
+                          TfLiteIntArray* new_size) {
+  TestContext* c = static_cast<TestContext*>(context);
+  c->new_size.clear();
+  for (int i = 0; i < new_size->size; ++i) {
+    c->new_size.push_back(new_size->data[i]);
+  }
+  TfLiteIntArrayFree(new_size);
+  return kTfLiteOk;
+}
+
+TEST(UtilTest, ConvertStatus) {
+  TestContext context;
+  context.ReportError = ReportError;
+
+  EXPECT_EQ(ConvertStatus(&context, tensorflow::errors::Internal("Some Error")),
+            kTfLiteError);
+  EXPECT_EQ(context.error, "Some Error");
+
+  context.error.clear();
+  EXPECT_EQ(ConvertStatus(&context, tensorflow::Status()), kTfLiteOk);
+  EXPECT_TRUE(context.error.empty());
+}
+
+TEST(UtilTest, CopyShape) {
+  TestContext context;
+  context.ReportError = ReportError;
+  context.ResizeTensor = ResizeTensor;
+
+  TfLiteTensor dst;
+
+  EXPECT_EQ(CopyShape(&context, Tensor(), &dst), kTfLiteOk);
+  EXPECT_THAT(context.new_size, ElementsAre(0));
+
+  EXPECT_EQ(CopyShape(&context, Tensor(DT_FLOAT, {1, 2}), &dst), kTfLiteOk);
+  EXPECT_THAT(context.new_size, ElementsAre(1, 2));
+
+  EXPECT_EQ(CopyShape(&context, Tensor(DT_FLOAT, {1LL << 44, 2}), &dst),
+            kTfLiteError);
+  EXPECT_EQ(context.error,
+            "Dimension value in TensorFlow shape is larger than supported by "
+            "TF Lite");
+}
+
+TEST(UtilTest, TypeConversions) {
+  EXPECT_EQ(TF_FLOAT, GetTensorFlowDataType(kTfLiteNoType));
+  EXPECT_EQ(TF_FLOAT, GetTensorFlowDataType(kTfLiteFloat32));
+  EXPECT_EQ(TF_INT16, GetTensorFlowDataType(kTfLiteInt16));
+  EXPECT_EQ(TF_INT32, GetTensorFlowDataType(kTfLiteInt32));
+  EXPECT_EQ(TF_UINT8, GetTensorFlowDataType(kTfLiteUInt8));
+  EXPECT_EQ(TF_INT64, GetTensorFlowDataType(kTfLiteInt64));
+  EXPECT_EQ(TF_COMPLEX64, GetTensorFlowDataType(kTfLiteComplex64));
+  EXPECT_EQ(TF_STRING, GetTensorFlowDataType(kTfLiteString));
+  EXPECT_EQ(TF_BOOL, GetTensorFlowDataType(kTfLiteBool));
+}
+
+}  // namespace
+}  // namespace eager
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/delegates/nnapi/BUILD b/tensorflow/contrib/lite/delegates/nnapi/BUILD
index 35a8f6ca41..091f8fbce7 100644
--- a/tensorflow/contrib/lite/delegates/nnapi/BUILD
+++ b/tensorflow/contrib/lite/delegates/nnapi/BUILD
@@ -22,6 +22,7 @@ tf_cc_test(
     name = "nnapi_delegate_test",
     size = "small",
     srcs = ["nnapi_delegate_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":nnapi_delegate",
         "//tensorflow/contrib/lite:framework",
diff --git a/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.cc b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.cc
index fd798c209e..60855eb8ed 100644
--- a/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.cc
+++ b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate.cc
@@ -436,7 +436,6 @@ class NNAPIDelegateKernel {
         }
         break;
       case kTfLiteBuiltinSqueeze:
-        // Squeeze requires NNAPI1.1.
         if (version == 1 && kAndroidSdkVersion >= kMinSdkVersionForNNAPI11) {
           return [](TfLiteContext* context, NNAPIOpBuilder* builder,
                     TfLiteNode* node) -> ANeuralNetworksOperationType {
@@ -452,6 +451,240 @@ class NNAPIDelegateKernel {
         } else {
           return nullptr;
         }
+      case kTfLiteBuiltinL2Normalization: {
+        auto builtin =
+            reinterpret_cast<TfLiteL2NormParams*>(node->builtin_data);
+        if (builtin->activation != kTfLiteActNone) {
+          // NNAPI does not support activations
+          return nullptr;
+        }
+        return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                  TfLiteNode* node) -> ANeuralNetworksOperationType {
+          return ANEURALNETWORKS_L2_NORMALIZATION;
+        };
+      }
+      case kTfLiteBuiltinLocalResponseNormalization:
+        if (version == 1) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            auto builtin = reinterpret_cast<TfLiteLocalResponseNormParams*>(
+                node->builtin_data);
+            builder->AddScalarInt32Operand(builtin->radius);
+            builder->AddScalarFloat32Operand(builtin->bias);
+            builder->AddScalarFloat32Operand(builtin->alpha);
+            builder->AddScalarFloat32Operand(builtin->beta);
+            return ANEURALNETWORKS_LOCAL_RESPONSE_NORMALIZATION;
+          };
+        } else {
+          // TODO(miaowang): clean-up code and return early in the unsupported
+          // case.
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinLshProjection:
+        if (version == 1) {
+          // NNAPI does not support sparse projection correctly (b/111751836).
+          if (reinterpret_cast<TfLiteLSHProjectionParams*>(node->builtin_data)
+                  ->type == kTfLiteLshProjectionSparse) {
+            return nullptr;
+          }
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            auto builtin = reinterpret_cast<TfLiteLSHProjectionParams*>(
+                node->builtin_data);
+            builder->AddScalarInt32Operand(builtin->type);
+            return ANEURALNETWORKS_LSH_PROJECTION;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinConcatenation:
+        if (version == 1 &&
+            reinterpret_cast<TfLiteConcatenationParams*>(node->builtin_data)
+                    ->activation == kTfLiteActNone) {
+          if (context->tensors[node->inputs->data[0]].type == kTfLiteUInt8) {
+            // NNAPI only support concatenating quantized tensor of the same
+            // scale and offset.
+            auto first_param = context->tensors[node->inputs->data[0]].params;
+            for (int i = 0; i < node->inputs->size; i++) {
+              auto curr_param = context->tensors[node->inputs->data[i]].params;
+              if (curr_param.scale != first_param.scale ||
+                  curr_param.zero_point != first_param.zero_point) {
+                return nullptr;
+              }
+            }
+          }
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            auto builtin = reinterpret_cast<TfLiteConcatenationParams*>(
+                node->builtin_data);
+            builder->AddScalarInt32Operand(builtin->axis);
+            return ANEURALNETWORKS_CONCATENATION;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinDequantize:
+        if (version == 1) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_DEQUANTIZE;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinFloor:
+        if (version == 1) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_FLOOR;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinRelu:
+        if (version == 1) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_RELU;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinReluN1To1:
+        if (version == 1) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_RELU1;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinRelu6:
+        if (version == 1) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_RELU6;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinLogistic:
+        if (version == 1) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_LOGISTIC;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinTanh:
+        // TODO(miaowang): add additional checks for the parameters.
+        if (version == 1 &&
+            context->tensors[node->inputs->data[0]].type == kTfLiteFloat32) {
+          // NNAPI only support float tanh.
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_TANH;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinSub:
+        if (version == 1 && kAndroidSdkVersion >= kMinSdkVersionForNNAPI11 &&
+            context->tensors[node->inputs->data[0]].type == kTfLiteFloat32) {
+          // NNAPI only support float sub.
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            auto builtin =
+                reinterpret_cast<TfLiteSubParams*>(node->builtin_data);
+            builder->AddScalarInt32Operand(builtin->activation);
+            return ANEURALNETWORKS_SUB;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinDiv:
+        if (version == 1 && kAndroidSdkVersion >= kMinSdkVersionForNNAPI11 &&
+            context->tensors[node->inputs->data[0]].type == kTfLiteFloat32) {
+          // NNAPI only support float div.
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            auto builtin =
+                reinterpret_cast<TfLiteDivParams*>(node->builtin_data);
+            builder->AddScalarInt32Operand(builtin->activation);
+            return ANEURALNETWORKS_DIV;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinPad:
+        if (version == 1 && kAndroidSdkVersion >= kMinSdkVersionForNNAPI11 &&
+            node->inputs->size == 2 &&
+            context->tensors[node->inputs->data[0]].type == kTfLiteFloat32) {
+          // NNAPI does not support specifying the padding value.
+          // NNAPI pads physical zero for quantized tensors, so only delegate
+          // float pad to NNAPI.
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_PAD;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinSpaceToBatchNd:
+        if (version == 1 && kAndroidSdkVersion >= kMinSdkVersionForNNAPI11) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_SPACE_TO_BATCH_ND;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinStridedSlice:
+        if (version == 1 && kAndroidSdkVersion >= kMinSdkVersionForNNAPI11) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            auto builtin =
+                reinterpret_cast<TfLiteStridedSliceParams*>(node->builtin_data);
+            builder->AddScalarInt32Operand(builtin->begin_mask);
+            builder->AddScalarInt32Operand(builtin->end_mask);
+            builder->AddScalarInt32Operand(builtin->shrink_axis_mask);
+            return ANEURALNETWORKS_STRIDED_SLICE;
+          };
+        } else {
+          return nullptr;
+        }
+        break;
+      case kTfLiteBuiltinTranspose:
+        // Note that the permutation input tensor value dictates the output
+        // dimensions.
+        // TODO(b/110888333): Support dynamically-sized tensors in delegates.
+        if ((version == 1) &&
+            (kAndroidSdkVersion >= kMinSdkVersionForNNAPI11) &&
+            (node->inputs->size > 1) &&
+            (context->tensors[node->inputs->data[1]].allocation_type ==
+             kTfLiteMmapRo)) {
+          return [](TfLiteContext* context, NNAPIOpBuilder* builder,
+                    TfLiteNode* node) -> ANeuralNetworksOperationType {
+            return ANEURALNETWORKS_TRANSPOSE;
+          };
+        } else {
+          return nullptr;
+        }
         break;
       default:
         return nullptr;
diff --git a/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate_test.cc b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate_test.cc
index aad10c9ce7..b7b159c59f 100644
--- a/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate_test.cc
+++ b/tensorflow/contrib/lite/delegates/nnapi/nnapi_delegate_test.cc
@@ -27,14 +27,20 @@ using ::testing::ElementsAreArray;
 // TODO(b/110368244): figure out how to share the existing tests in kernels/ but
 // with the delegation on. Also, add more unit tests to improve code coverage.
 
-class FloatAddOpModel : public SingleOpModel {
+class SingleOpModelWithNNAPI : public SingleOpModel {
+ public:
+  SingleOpModelWithNNAPI() {
+    this->SetApplyDelegate([](Interpreter* interpreter) {
+      interpreter->ModifyGraphWithDelegate(NnApiDelegate(), false);
+    });
+  }
+};
+
+class FloatAddOpModel : public SingleOpModelWithNNAPI {
  public:
   FloatAddOpModel(const TensorData& input1, const TensorData& input2,
                   const TensorData& output,
                   ActivationFunctionType activation_type) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
     input1_ = AddInput(input1);
     input2_ = AddInput(input2);
     output_ = AddOutput(output);
@@ -76,14 +82,11 @@ TEST(NNAPIDelegate, AddWithRelu) {
   EXPECT_THAT(m.GetOutput(), ElementsAreArray({0.0, 0.4, 1.0, 1.3}));
 }
 
-class FloatMulOpModel : public SingleOpModel {
+class FloatMulOpModel : public SingleOpModelWithNNAPI {
  public:
   FloatMulOpModel(const TensorData& input1, const TensorData& input2,
                   const TensorData& output,
                   ActivationFunctionType activation_type) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
     input1_ = AddInput(input1);
     input2_ = AddInput(input2);
     output_ = AddOutput(output);
@@ -114,15 +117,11 @@ TEST(NNAPIDelegate, MulWithNoActivation) {
               ElementsAreArray(ArrayFloatNear({-0.2, 0.04, 0.21, 0.4})));
 }
 
-class FloatPoolingOpModel : public SingleOpModel {
+class FloatPoolingOpModel : public SingleOpModelWithNNAPI {
  public:
   FloatPoolingOpModel(BuiltinOperator type, const TensorData& input,
                       int filter_width, int filter_height,
                       const TensorData& output) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
-
     input_ = AddInput(input);
     output_ = AddOutput(output);
 
@@ -185,7 +184,7 @@ TEST(NNAPIDelegate, L2PoolWithNoActivation) {
   EXPECT_THAT(m.GetOutput(), ElementsAreArray({3.5, 6.5}));
 }
 
-class BaseConvolutionOpModel : public SingleOpModel {
+class BaseConvolutionOpModel : public SingleOpModelWithNNAPI {
  public:
   BaseConvolutionOpModel(
       const TensorData& input, const TensorData& filter,
@@ -193,10 +192,6 @@ class BaseConvolutionOpModel : public SingleOpModel {
       enum Padding padding = Padding_VALID,
       enum ActivationFunctionType activation = ActivationFunctionType_NONE,
       int dilation_width_factor = 1, int dilation_height_factor = 1) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
-
     input_ = AddInput(input);
     filter_ = AddInput(filter);
 
@@ -344,14 +339,10 @@ TEST(NNAPIDelegate, Conv2DWithNoActivation) {
                              }));
 }
 
-class DepthwiseConvolutionOpModel : public SingleOpModel {
+class DepthwiseConvolutionOpModel : public SingleOpModelWithNNAPI {
  public:
   DepthwiseConvolutionOpModel(const TensorData& input, const TensorData& filter,
                               const TensorData& output) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
-
     input_ = AddInput(input);
     filter_ = AddInput(filter);
 
@@ -426,15 +417,11 @@ TEST(NNAPIDelegate, DepthwiseConv2DWithNoActivation) {
                              }));
 }
 
-class FloatFullyConnectedOpModel : public SingleOpModel {
+class FloatFullyConnectedOpModel : public SingleOpModelWithNNAPI {
  public:
   FloatFullyConnectedOpModel(int units, int batches, const TensorData& input,
                              const TensorData& output = {TensorType_FLOAT32})
       : batches_(batches), units_(units) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
-
     int total_input_size = 1;
     for (int i = 0; i < input.shape.size(); ++i) {
       total_input_size *= input.shape[i];
@@ -515,14 +502,10 @@ TEST(NNAPIDelegate, FullyConnectedSimpleTest) {
   EXPECT_THAT(m.GetOutput(), ElementsAre(24, 25, 26, 58, 59, 60));
 }
 
-class SoftmaxOpModel : public SingleOpModel {
+class SoftmaxOpModel : public SingleOpModelWithNNAPI {
  public:
   SoftmaxOpModel(int batches, int size, float beta)
       : batches_(batches), input_size_(size), beta_(beta) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
-
     input_ = AddInput(TensorType_FLOAT32);
     output_ = AddOutput(TensorType_FLOAT32);
     SetBuiltinOp(BuiltinOperator_SOFTMAX, BuiltinOptions_SoftmaxOptions,
@@ -566,14 +549,10 @@ TEST(NNAPIDelegate, SoftmaxSimpleTest) {
           1e-6)));
 }
 
-class ReshapeOpModel : public SingleOpModel {
+class ReshapeOpModel : public SingleOpModelWithNNAPI {
  public:
   ReshapeOpModel(std::initializer_list<int> input_shape,
                  std::initializer_list<int> new_shape) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
-
     input_ = AddInput(TensorType_FLOAT32);
     new_shape_ = AddInput(TensorType_INT32);
     output_ = AddOutput(TensorType_FLOAT32);
@@ -605,14 +584,10 @@ TEST(NNAPIDelegate, ReshapeSimpleTest) {
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 2, 2}));
 }
 
-class SqueezeOpModel : public SingleOpModel {
+class SqueezeOpModel : public SingleOpModelWithNNAPI {
  public:
   SqueezeOpModel(const TensorData& input, const TensorData& output,
                  std::initializer_list<int> axis) {
-    this->SetApplyDelegate([](Interpreter* interpreter) {
-      interpreter->ModifyGraphWithDelegate(NnApiDelegate());
-    });
-
     input_ = AddInput(input);
     output_ = AddOutput(output);
     SetBuiltinOp(
@@ -666,6 +641,988 @@ TEST(NNAPIDelegate, SqueezeWithAxisTest) {
                         17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0}));
 }
 
+class L2NormOpModel : public SingleOpModelWithNNAPI {
+ public:
+  L2NormOpModel(const TensorData& input, const TensorData& output,
+                ActivationFunctionType activation_type) {
+    input_ = AddInput(input);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_L2_NORMALIZATION, BuiltinOptions_L2NormOptions,
+                 CreateL2NormOptions(builder_, activation_type).Union());
+    BuildInterpreter({GetShape(input_)});
+  }
+
+  void SetInput(std::initializer_list<float> data) {
+    PopulateTensor<float>(input_, data);
+  }
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int input_;
+  int new_shape_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, L2NormSimpleTest) {
+  std::initializer_list<float> data = {-1.1, 0.6, 0.7, 1.2, -0.7, 0.1};
+  L2NormOpModel m({TensorType_FLOAT32, {1, 1, 1, 6}},
+                  {TensorType_FLOAT32, {1, 1, 1, 6}},
+                  ActivationFunctionType_NONE);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 1, 1, 6}));
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray({-0.55, 0.3, 0.35, 0.6, -0.35, 0.05}));
+}
+
+class TransposeSimpleModel : public SingleOpModelWithNNAPI {
+ public:
+  TransposeSimpleModel(std::initializer_list<int> input_shape,
+                       std::initializer_list<int> perm_shape,
+                       std::initializer_list<int> perm) {
+    input_ = AddInput(TensorType_FLOAT32);
+    perm_ = AddConstInput(TensorType_INT32, perm, perm_shape);
+    output_ = AddOutput(TensorType_FLOAT32);
+    SetBuiltinOp(BuiltinOperator_TRANSPOSE, BuiltinOptions_TransposeOptions,
+                 CreateTransposeOptions(builder_).Union());
+    BuildInterpreter({input_shape, perm_shape});
+  }
+
+  void SetInput(std::initializer_list<float> data) {
+    PopulateTensor<float>(input_, data);
+  }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int input_;
+  int perm_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, TransposeSimpleTest) {
+  TransposeSimpleModel m({2, 3, 4}, {3}, {2, 0, 1});
+  m.SetInput({0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  10, 11,
+              12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 2, 3}));
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray({0, 4, 8,  12, 16, 20, 1, 5, 9,  13, 17, 21,
+                                2, 6, 10, 14, 18, 22, 3, 7, 11, 15, 19, 23}));
+}
+
+class FloatSubOpModel : public SingleOpModelWithNNAPI {
+ public:
+  FloatSubOpModel(const TensorData& input1, const TensorData& input2,
+                  const TensorData& output,
+                  ActivationFunctionType activation_type) {
+    input1_ = AddInput(input1);
+    input2_ = AddInput(input2);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_SUB, BuiltinOptions_SubOptions,
+                 CreateMulOptions(builder_, activation_type).Union());
+    BuildInterpreter({GetShape(input1_), GetShape(input2_)});
+  }
+
+  int input1() { return input1_; }
+  int input2() { return input2_; }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ protected:
+  int input1_;
+  int input2_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, SubWithNoActivation) {
+  FloatSubOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+  m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 0.7, 0.8});
+  m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.3, 0.5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear({-2.1, 0.0, 0.4, 0.3})));
+}
+
+class FloatDivOpModel : public SingleOpModelWithNNAPI {
+ public:
+  FloatDivOpModel(const TensorData& input1, const TensorData& input2,
+                  const TensorData& output,
+                  ActivationFunctionType activation_type) {
+    input1_ = AddInput(input1);
+    input2_ = AddInput(input2);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_DIV, BuiltinOptions_DivOptions,
+                 CreateMulOptions(builder_, activation_type).Union());
+    BuildInterpreter({GetShape(input1_), GetShape(input2_)});
+  }
+
+  int input1() { return input1_; }
+  int input2() { return input2_; }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ protected:
+  int input1_;
+  int input2_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, DivWithNoActivation) {
+  FloatDivOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {1, 2, 2, 1}},
+                    {TensorType_FLOAT32, {}}, ActivationFunctionType_NONE);
+  m.PopulateTensor<float>(m.input1(), {-2.0, 0.2, 0.8, 0.8});
+  m.PopulateTensor<float>(m.input2(), {0.1, 0.2, 0.4, 0.2});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({-20, 1, 2, 4})));
+}
+
+class BaseConcatenationOpModel : public SingleOpModelWithNNAPI {
+ public:
+  BaseConcatenationOpModel() {}
+  BaseConcatenationOpModel(const TensorData& input_template, int axis,
+                           int num_inputs) {
+    std::vector<std::vector<int>> all_input_shapes;
+    for (int i = 0; i < num_inputs; ++i) {
+      all_input_shapes.push_back(input_template.shape);
+      AddInput(input_template);
+    }
+    output_ = AddOutput({input_template.type, /*shape=*/{}, input_template.min,
+                         input_template.max});
+    SetBuiltinOp(
+        BuiltinOperator_CONCATENATION, BuiltinOptions_ConcatenationOptions,
+        CreateConcatenationOptions(builder_, axis, ActivationFunctionType_NONE)
+            .Union());
+    BuildInterpreter(all_input_shapes);
+  }
+
+ protected:
+  int output_;
+};
+
+class ConcatenationOpModel : public BaseConcatenationOpModel {
+ public:
+  using BaseConcatenationOpModel::BaseConcatenationOpModel;
+  void SetInput(int index, std::initializer_list<float> data) {
+    PopulateTensor(index, data);
+  }
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+};
+
+TEST(NNAPIDelegate, ConcatenationThreeDimensionalOneInput) {
+  ConcatenationOpModel m0({TensorType_FLOAT32, {2, 1, 2}}, /*axis=*/1,
+                          /*num_inputs=*/1);
+  m0.SetInput(0, {1.0f, 3.0f, 4.0f, 7.0f});
+  m0.Invoke();
+  EXPECT_THAT(m0.GetOutput(), ElementsAreArray({1, 3, 4, 7}));
+}
+
+TEST(NNAPIDelegate, ConcatenationFourInputs) {
+  ConcatenationOpModel m0({TensorType_FLOAT32, {2, 1, 2}}, /*axis=*/2,
+                          /*num_inputs=*/4);
+  m0.SetInput(0, {1.0f, 3.0f, 4.0f, 7.0f});
+  m0.SetInput(1, {1.1f, 3.1f, 4.1f, 7.1f});
+  m0.SetInput(2, {1.2f, 3.2f, 4.2f, 7.2f});
+  m0.SetInput(3, {1.3f, 3.3f, 4.3f, 7.3f});
+  m0.Invoke();
+  EXPECT_THAT(m0.GetOutput(),
+              ElementsAreArray({
+                  1.0f, 3.0f, 1.1f, 3.1f, 1.2f, 3.2f, 1.3f, 3.3f,  //
+                  4.0f, 7.0f, 4.1f, 7.1f, 4.2f, 7.2f, 4.3f, 7.3f,  //
+              }));
+}
+
+class QuantizedConcatenationOpModel : public BaseConcatenationOpModel {
+ public:
+  using BaseConcatenationOpModel::BaseConcatenationOpModel;
+  QuantizedConcatenationOpModel(const std::vector<TensorData>& input_template,
+                                int axis, int num_inputs,
+                                const TensorData& output_template) {
+    std::vector<std::vector<int>> all_input_shapes;
+    CHECK_EQ(input_template.size(), num_inputs);
+    for (int i = 0; i < num_inputs; ++i) {
+      all_input_shapes.push_back(input_template[i].shape);
+      AddInput(input_template[i]);
+    }
+    output_ = AddOutput({output_template.type, /*shape=*/{},
+                         output_template.min, output_template.max});
+    SetBuiltinOp(
+        BuiltinOperator_CONCATENATION, BuiltinOptions_ConcatenationOptions,
+        CreateConcatenationOptions(builder_, axis, ActivationFunctionType_NONE)
+            .Union());
+    BuildInterpreter(all_input_shapes);
+  }
+  void SetInput(int index, std::initializer_list<float> data) {
+    QuantizeAndPopulate<uint8_t>(index, data);
+  }
+  std::vector<uint8_t> GetOutput() { return ExtractVector<uint8_t>(output_); }
+  std::vector<float> GetDequantizedOutput() {
+    return Dequantize<uint8_t>(ExtractVector<uint8_t>(output_),
+                               GetScale(output_), GetZeroPoint(output_));
+  }
+};
+
+TEST(NNAPIDelegate, ConcatenationFourInputsQuantized) {
+  QuantizedConcatenationOpModel m0({TensorType_UINT8, {2, 1, 2}, -12.7, 12.8},
+                                   /*axis=*/2,
+                                   /*num_inputs=*/4);
+
+  m0.SetInput(0, {1.0f, 3.0f, 4.0f, 7.0f});
+  m0.SetInput(1, {1.1f, 3.1f, 4.1f, 7.1f});
+  m0.SetInput(2, {1.2f, 3.2f, 4.2f, 7.2f});
+  m0.SetInput(3, {1.3f, 3.3f, 4.3f, 7.3f});
+  m0.Invoke();
+  EXPECT_THAT(m0.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear({
+                  1.0f, 3.0f, 1.1f, 3.1f, 1.2f, 3.2f, 1.3f, 3.3f,  //
+                  4.0f, 7.0f, 4.1f, 7.1f, 4.2f, 7.2f, 4.3f, 7.3f,  //
+              })));
+  EXPECT_THAT(m0.GetOutput(), ElementsAreArray({
+                                  137, 157, 138, 158, 139, 159, 140, 160,  //
+                                  167, 197, 168, 198, 169, 199, 170, 200,  //
+                              }));
+}
+
+TEST(NNAPIDelegate, ConcatenationFourInputsQuantizedMixedRange) {
+  QuantizedConcatenationOpModel m0({{TensorType_UINT8, {2, 1, 2}, -10.7, 10.8},
+                                    {TensorType_UINT8, {2, 1, 2}, 0, 12.8},
+                                    {TensorType_UINT8, {2, 1, 2}, -11, 11.8},
+                                    {TensorType_UINT8, {2, 1, 2}, 0, 7.4}},
+                                   /*axis=*/2, /*num_inputs=*/4,
+                                   {TensorType_UINT8, {2, 1, 2}, -12.7, 12.8});
+
+  m0.SetInput(0, {1.0f, 3.0f, 4.0f, 7.0f});
+  m0.SetInput(1, {1.1f, 3.1f, 4.1f, 7.1f});
+  m0.SetInput(2, {1.2f, 3.2f, 4.2f, 7.2f});
+  m0.SetInput(3, {1.3f, 3.3f, 4.3f, 7.3f});
+  m0.Invoke();
+  EXPECT_THAT(m0.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear({
+                  1.0f, 3.0f, 1.1f, 3.1f, 1.2f, 3.2f, 1.3f, 3.3f,  //
+                  4.0f, 7.0f, 4.1f, 7.1f, 4.2f, 7.2f, 4.3f, 7.3f,  //
+              })));
+  EXPECT_THAT(m0.GetOutput(), ElementsAreArray({
+                                  137, 157, 138, 158, 139, 159, 140, 160,  //
+                                  167, 197, 168, 198, 169, 199, 170, 200,  //
+                              }));
+}
+
+class DequantizeOpModel : public SingleOpModelWithNNAPI {
+ public:
+  DequantizeOpModel(std::initializer_list<int> shape, float min, float max) {
+    input_ = AddInput({TensorType_UINT8, shape, min, max});
+    output_ = AddOutput({TensorType_FLOAT32, shape});
+    SetBuiltinOp(BuiltinOperator_DEQUANTIZE, BuiltinOptions_DequantizeOptions,
+                 CreateDequantizeOptions(builder_).Union());
+
+    BuildInterpreter({GetShape(input_)});
+  }
+
+  void SetInput(std::initializer_list<uint8_t> data) {
+    PopulateTensor(input_, data);
+  }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ private:
+  int input_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, DequantizeFourDimensional) {
+  DequantizeOpModel m({2, 5}, -63.5, 64);
+
+  m.SetInput({0, 1, 2, 3, 4, 251, 252, 253, 254, 255});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {-63.5, -63, -62.5, -62, -61.5, 62, 62.5, 63, 63.5, 64})));
+}
+
+class FloorOpModel : public SingleOpModelWithNNAPI {
+ public:
+  FloorOpModel(std::initializer_list<int> input_shape, TensorType input_type) {
+    input_ = AddInput(TensorType_FLOAT32);
+    output_ = AddOutput(TensorType_FLOAT32);
+    SetBuiltinOp(BuiltinOperator_FLOOR, BuiltinOptions_NONE, 0);
+    BuildInterpreter({
+        input_shape,
+    });
+  }
+
+  int input() { return input_; }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int input_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, FloorSingleDim) {
+  FloorOpModel model({2}, TensorType_FLOAT32);
+  model.PopulateTensor<float>(model.input(), {8.5, 0.0});
+  model.Invoke();
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({8, 0}));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({2}));
+}
+
+TEST(NNAPIDelegate, FloorMultiDims) {
+  FloorOpModel model({2, 1, 1, 5}, TensorType_FLOAT32);
+  model.PopulateTensor<float>(model.input(), {
+                                                 0.0001,
+                                                 8.0001,
+                                                 0.9999,
+                                                 9.9999,
+                                                 0.5,
+                                                 -0.0001,
+                                                 -8.0001,
+                                                 -0.9999,
+                                                 -9.9999,
+                                                 -0.5,
+                                             });
+  model.Invoke();
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({0, 8, 0, 9, 0, -1, -9, -1, -10, -1}));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({2, 1, 1, 5}));
+}
+
+class LocalResponseNormOpModel : public SingleOpModelWithNNAPI {
+ public:
+  LocalResponseNormOpModel(std::initializer_list<int> input_shape, int radius,
+                           float bias, float alpha, float beta) {
+    input_ = AddInput(TensorType_FLOAT32);
+    output_ = AddOutput(TensorType_FLOAT32);
+    SetBuiltinOp(BuiltinOperator_LOCAL_RESPONSE_NORMALIZATION,
+                 BuiltinOptions_LocalResponseNormalizationOptions,
+                 CreateLocalResponseNormalizationOptions(builder_, radius, bias,
+                                                         alpha, beta)
+                     .Union());
+    BuildInterpreter({input_shape});
+  }
+
+  void SetInput(std::initializer_list<float> data) {
+    PopulateTensor(input_, data);
+  }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ private:
+  int input_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, LocalResponseNormSameAsL2Norm) {
+  LocalResponseNormOpModel m({1, 1, 1, 6}, /*radius=*/20, /*bias=*/0.0,
+                             /*alpha=*/1.0, /*beta=*/0.5);
+  m.SetInput({-1.1, 0.6, 0.7, 1.2, -0.7, 0.1});
+  m.Invoke();
+  // The result is every input divided by 2.
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray(ArrayFloatNear({-0.55, 0.3, 0.35, 0.6, -0.35, 0.05})));
+}
+
+TEST(NNAPIDelegate, LocalResponseNormWithAlpha) {
+  LocalResponseNormOpModel m({1, 1, 1, 6}, /*radius=*/20, /*bias=*/0.0,
+                             /*alpha=*/4.0, /*beta=*/0.5);
+  m.SetInput({-1.1, 0.6, 0.7, 1.2, -0.7, 0.1});
+  m.Invoke();
+  // The result is every input divided by 3.
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear(
+                                 {-0.275, 0.15, 0.175, 0.3, -0.175, 0.025})));
+}
+
+TEST(NNAPIDelegate, LocalResponseNormWithBias) {
+  LocalResponseNormOpModel m({1, 1, 1, 6}, /*radius=*/20, /*bias=*/9.0,
+                             /*alpha=*/4.0, /*beta=*/0.5);
+  m.SetInput({-1.1, 0.6, 0.7, 1.2, -0.7, 0.1});
+  m.Invoke();
+  // The result is every input divided by 5.
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray(ArrayFloatNear({-0.22, 0.12, 0.14, 0.24, -0.14, 0.02})));
+}
+
+TEST(NNAPIDelegate, LocalResponseNormSmallRadius) {
+  LocalResponseNormOpModel m({1, 1, 1, 6}, /*radius=*/2, /*bias=*/9.0,
+                             /*alpha=*/4.0, /*beta=*/0.5);
+  m.SetInput({-1.1, 0.6, 0.7, 1.2, -0.7, 0.1});
+  m.Invoke();
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray(ArrayFloatNear(
+          {-0.264926, 0.125109, 0.140112, 0.267261, -0.161788, 0.0244266})));
+}
+
+class LSHProjectionOpModel : public SingleOpModelWithNNAPI {
+ public:
+  LSHProjectionOpModel(LSHProjectionType type,
+                       std::initializer_list<int> hash_shape,
+                       std::initializer_list<int> input_shape,
+                       std::initializer_list<int> weight_shape) {
+    hash_ = AddInput(TensorType_FLOAT32);
+    input_ = AddInput(TensorType_INT32);
+    if (weight_shape.size() > 0) {
+      weight_ = AddInput(TensorType_FLOAT32);
+    }
+    output_ = AddOutput(TensorType_INT32);
+
+    SetBuiltinOp(BuiltinOperator_LSH_PROJECTION,
+                 BuiltinOptions_LSHProjectionOptions,
+                 CreateLSHProjectionOptions(builder_, type).Union());
+    if (weight_shape.size() > 0) {
+      BuildInterpreter({hash_shape, input_shape, weight_shape});
+    } else {
+      BuildInterpreter({hash_shape, input_shape});
+    }
+
+    output_size_ = 1;
+    for (int i : hash_shape) {
+      output_size_ *= i;
+      if (type == LSHProjectionType_SPARSE) {
+        break;
+      }
+    }
+  }
+  void SetInput(std::initializer_list<int> data) {
+    PopulateTensor(input_, data);
+  }
+
+  void SetHash(std::initializer_list<float> data) {
+    PopulateTensor(hash_, data);
+  }
+
+  void SetWeight(std::initializer_list<float> f) { PopulateTensor(weight_, f); }
+
+  std::vector<int> GetOutput() { return ExtractVector<int>(output_); }
+
+ private:
+  int input_;
+  int hash_;
+  int weight_;
+  int output_;
+
+  int output_size_;
+};
+
+TEST(NNAPIDelegate, LSHProjectionDense1DInputs) {
+  LSHProjectionOpModel m(LSHProjectionType_DENSE, {3, 2}, {5}, {5});
+
+  m.SetInput({12345, 54321, 67890, 9876, -12345678});
+  m.SetHash({0.123, 0.456, -0.321, 1.234, 5.678, -4.321});
+  m.SetWeight({1.0, 1.0, 1.0, 1.0, 1.0});
+
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(), ElementsAre(0, 0, 0, 1, 0, 0));
+}
+
+TEST(NNAPIDelegate, LSHProjectionSparse1DInputs) {
+  LSHProjectionOpModel m(LSHProjectionType_SPARSE, {3, 2}, {5}, {});
+
+  m.SetInput({12345, 54321, 67890, 9876, -12345678});
+  m.SetHash({0.123, 0.456, -0.321, 1.234, 5.678, -4.321});
+
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(), ElementsAre(0 + 0, 4 + 1, 8 + 0));
+}
+
+TEST(NNAPIDelegate, LSHProjectionSparse3DInputs) {
+  LSHProjectionOpModel m(LSHProjectionType_SPARSE, {3, 2}, {5, 2, 2}, {5});
+
+  m.SetInput({1234, 2345, 3456, 1234, 4567, 5678, 6789, 4567, 7891, 8912,
+              9123, 7890, -987, -876, -765, -987, -543, -432, -321, -543});
+  m.SetHash({0.123, 0.456, -0.321, 1.234, 5.678, -4.321});
+  m.SetWeight({0.12, 0.34, 0.56, 0.67, 0.78});
+
+  m.Invoke();
+
+  EXPECT_THAT(m.GetOutput(), ElementsAre(0 + 2, 4 + 1, 8 + 1));
+}
+
+class BaseActivationsOpModel : public SingleOpModelWithNNAPI {
+ public:
+  // Most activations don't take any options, so this constructor works for
+  // them.
+  BaseActivationsOpModel(BuiltinOperator type, TensorData input) {
+    input_ = AddInput(input);
+    if (input.type == TensorType_UINT8) {
+      output_ = AddOutput({input.type, {}, 0, 0, 1. / 256});
+    } else {
+      output_ = AddOutput({input.type, {}});
+    }
+    SetBuiltinOp(type, BuiltinOptions_NONE, 0);
+    BuildInterpreter({GetShape(input_)});
+  }
+
+  BaseActivationsOpModel(BuiltinOperator type, const TensorData& input,
+                         const TensorData& output) {
+    input_ = AddInput(input);
+    output_ = AddOutput(output);
+    SetBuiltinOp(type, BuiltinOptions_NONE, 0);
+    BuildInterpreter({GetShape(input_)});
+  }
+
+ protected:
+  int input_;
+  int output_;
+};
+
+class FloatActivationsOpModel : public BaseActivationsOpModel {
+ public:
+  using BaseActivationsOpModel::BaseActivationsOpModel;
+
+  void SetInput(std::initializer_list<float> data) {
+    PopulateTensor(input_, data);
+  }
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+};
+
+const float kQuantizedTolerance = 2 * (1. / 256);
+
+class QuantizedActivationsOpModel : public BaseActivationsOpModel {
+ public:
+  using BaseActivationsOpModel::BaseActivationsOpModel;
+
+  template <typename T>
+  void SetInput(std::initializer_list<float> data) {
+    QuantizeAndPopulate<T>(input_, data);
+  }
+  template <typename T>
+
+  std::vector<T> GetOutput() {
+    return ExtractVector<T>(output_);
+  }
+  template <typename T>
+  std::vector<float> GetDequantizedOutput() {
+    return Dequantize<T>(ExtractVector<T>(output_), GetScale(output_),
+                         GetZeroPoint(output_));
+  }
+};
+
+TEST(NNAPIDelegate, Relu) {
+  FloatActivationsOpModel m(BuiltinOperator_RELU,
+                            /*input=*/{TensorType_FLOAT32, {1, 2, 4, 1}});
+  m.SetInput({
+      0, -6, 2, 4,   //
+      3, -2, 10, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({
+                                 0, 0, 2, 4,   //
+                                 3, 0, 10, 1,  //
+                             }));
+}
+
+TEST(NNAPIDelegate, Relu1) {
+  FloatActivationsOpModel m(BuiltinOperator_RELU_N1_TO_1,
+                            /*input=*/{TensorType_FLOAT32, {1, 2, 4, 1}});
+  m.SetInput({
+      0.0, -0.6, 0.2, -0.4,  //
+      0.3, -2.0, 1.1, -0.1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({
+                                 0.0, -0.6, 0.2, -0.4,  //
+                                 0.3, -1.0, 1.0, -0.1,  //
+                             }));
+}
+
+TEST(NNAPIDelegate, Relu6) {
+  FloatActivationsOpModel m(BuiltinOperator_RELU6,
+                            /*input=*/{TensorType_FLOAT32, {1, 2, 4, 1}});
+  m.SetInput({
+      0, -6, 2, 4,   //
+      3, -2, 10, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({
+                                 0, 0, 2, 4,  //
+                                 3, 0, 6, 1,  //
+                             }));
+}
+
+TEST(NNAPIDelegate, Tanh) {
+  FloatActivationsOpModel m(BuiltinOperator_TANH,
+                            /*input=*/{TensorType_FLOAT32, {1, 2, 4, 1}});
+  m.SetInput({
+      0, -6, 2, 4,   //
+      3, -2, 10, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({
+                                 0, -0.9999877, 0.9640275, 0.999329,    //
+                                 0.99505475, -0.9640275, 1, 0.7615941,  //
+                             })));
+}
+
+TEST(NNAPIDelegate, LogisticFloat) {
+  FloatActivationsOpModel m(BuiltinOperator_LOGISTIC,
+                            /*input=*/{TensorType_FLOAT32, {1, 2, 4, 1}});
+  m.SetInput({
+      0, -6, 2, 4,   //
+      3, -2, 10, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({
+                                 0.5, 0.002473, 0.880797, 0.982014,       //
+                                 0.952574, 0.119203, 0.999955, 0.731059,  //
+                             })));
+}
+
+TEST(NNAPIDelegate, LogisticQuantized) {
+  QuantizedActivationsOpModel m(
+      BuiltinOperator_LOGISTIC,
+      /*input=*/{TensorType_UINT8, {1, 2, 4, 1}, -10, 10});
+  m.SetInput<uint8_t>({
+      0, -6, 2, 4,   //
+      3, -2, 10, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetDequantizedOutput<uint8_t>(),
+              ElementsAreArray(ArrayFloatNear(
+                  {
+                      0.5, 0.002473, 0.880797, 0.982014,       //
+                      0.952574, 0.119203, 0.999955, 0.731059,  //
+                  },
+                  kQuantizedTolerance)));
+  EXPECT_THAT(m.GetOutput<uint8_t>(),
+              ElementsAreArray({128, 1, 227, 251, 244, 32, 255, 188}));
+}
+
+#if 0
+class ResizeBilinearOpModel : public SingleOpModelWithNNAPI {
+ public:
+  ResizeBilinearOpModel(const TensorData& input,
+                        std::initializer_list<int> size_data = {}) {
+    bool const_size = size_data.size() != 0;
+    input_ = AddInput(input);
+    if (const_size) {
+      size_ = AddConstInput(TensorType_INT32, size_data, {2});
+    } else {
+      size_ = AddInput({TensorType_INT32, {2}});
+    }
+    output_ = AddOutput(input.type);
+    SetBuiltinOp(BuiltinOperator_RESIZE_BILINEAR,
+                 BuiltinOptions_ResizeBilinearOptions,
+                 CreateResizeBilinearOptions(builder_).Union());
+    if (const_size) {
+      BuildInterpreter({GetShape(input_)});
+    } else {
+      BuildInterpreter({GetShape(input_), GetShape(size_)});
+    }
+  }
+
+  template <typename T>
+  void SetInput(std::initializer_list<T> data) {
+    PopulateTensor(input_, data);
+  }
+  void SetSize(std::initializer_list<int> data) { PopulateTensor(size_, data); }
+
+  template <typename T>
+  std::vector<T> GetOutput() {
+    return ExtractVector<T>(output_);
+  }
+
+ private:
+  int input_;
+  int size_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, ResizeBilinearHorizontal) {
+  ResizeBilinearOpModel m({TensorType_FLOAT32, {1, 1, 2, 1}});
+  m.SetInput<float>({3, 6});
+  m.SetSize({1, 3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 5, 6})));
+
+  ResizeBilinearOpModel const_m({TensorType_FLOAT32, {1, 1, 2, 1}}, {1, 3});
+  const_m.SetInput<float>({3, 6});
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 5, 6})));
+}
+
+TEST(NNAPIDelegate, ResizeBilinearVertical) {
+  ResizeBilinearOpModel m({TensorType_FLOAT32, {1, 2, 1, 1}});
+  m.SetInput<float>({3, 9});
+  m.SetSize({3, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 7, 9})));
+
+  ResizeBilinearOpModel const_m({TensorType_FLOAT32, {1, 2, 1, 1}}, {3, 1});
+  const_m.SetInput<float>({3, 9});
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({3, 7, 9})));
+}
+
+TEST(NNAPIDelegate, ResizeBilinearTwoDimensional) {
+  ResizeBilinearOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}});
+  m.SetInput<float>({
+      3, 6,  //
+      9, 12  //
+  });
+  m.SetSize({3, 3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                        3, 5, 6,    //
+                                        7, 9, 10,   //
+                                        9, 11, 12,  //
+                                    })));
+
+  ResizeBilinearOpModel const_m({TensorType_FLOAT32, {1, 2, 2, 1}}, {3, 3});
+  const_m.SetInput<float>({
+      3, 6,  //
+      9, 12  //
+  });
+  const_m.Invoke();
+  EXPECT_THAT(const_m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({
+                                              3, 5, 6,    //
+                                              7, 9, 10,   //
+                                              9, 11, 12,  //
+                                          })));
+}
+#endif
+
+template <typename T>
+class PadOpModel : public SingleOpModelWithNNAPI {
+ public:
+  void SetInput(std::initializer_list<T> data) {
+    PopulateTensor<T>(input_, data);
+  }
+
+  void SetQuantizedInput(std::initializer_list<float> data) {
+    QuantizeAndPopulate<uint8_t>(input_, data);
+  }
+
+  void SetQuantizedPadValue(float data) {
+    QuantizeAndPopulate<uint8_t>(constant_values_, {data});
+  }
+
+  void SetPaddings(std::initializer_list<int> paddings) {
+    PopulateTensor<int>(paddings_, paddings);
+  }
+
+  std::vector<T> GetOutput() { return ExtractVector<T>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+  std::vector<float> GetDequantizedOutput() {
+    return Dequantize<uint8_t>(ExtractVector<uint8_t>(output_),
+                               GetScale(output_), GetZeroPoint(output_));
+  }
+
+ protected:
+  int input_;
+  int output_;
+  int paddings_;
+  int constant_values_;
+};
+
+class PadOpConstModel : public PadOpModel<float> {
+ public:
+  PadOpConstModel(const TensorData& input,
+                  std::initializer_list<int> paddings_shape,
+                  std::initializer_list<int> paddings,
+                  const TensorData& output) {
+    input_ = AddInput(input);
+    paddings_ = AddConstInput(TensorType_INT32, paddings, paddings_shape);
+    output_ = AddOutput(output);
+
+    SetBuiltinOp(BuiltinOperator_PAD, BuiltinOptions_PadOptions,
+                 CreatePadOptions(builder_).Union());
+    BuildInterpreter({input.shape});
+  }
+};
+
+TEST(NNAPIDelegate, PadAdvancedConstTest) {
+  PadOpConstModel m({TensorType_FLOAT32, {1, 2, 3, 1}}, {4, 2},
+                    {0, 0, 0, 2, 1, 3, 0, 0}, {TensorType_FLOAT32});
+  m.SetInput({1, 2, 3, 4, 5, 6});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray({0, 1, 2, 3, 0, 0, 0, 0, 4, 5, 6, 0, 0, 0,
+                                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 4, 7, 1}));
+}
+
+class SpaceToBatchNDOpModel : public SingleOpModelWithNNAPI {
+ public:
+  void SetInput(std::initializer_list<float> data) {
+    PopulateTensor<float>(input_, data);
+  }
+
+  void SetBlockShape(std::initializer_list<int> data) {
+    PopulateTensor<int>(block_shape_, data);
+  }
+
+  void SetPaddings(std::initializer_list<int> data) {
+    PopulateTensor<int>(paddings_, data);
+  }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ protected:
+  int input_;
+  int block_shape_;
+  int paddings_;
+  int output_;
+};
+
+class SpaceToBatchNDOpConstModel : public SpaceToBatchNDOpModel {
+ public:
+  SpaceToBatchNDOpConstModel(std::initializer_list<int> input_shape,
+                             std::initializer_list<int> block_shape,
+                             std::initializer_list<int> paddings) {
+    input_ = AddInput(TensorType_FLOAT32);
+    block_shape_ = AddConstInput(TensorType_INT32, block_shape, {2});
+    paddings_ = AddConstInput(TensorType_INT32, paddings, {2, 2});
+    output_ = AddOutput(TensorType_FLOAT32);
+
+    SetBuiltinOp(BuiltinOperator_SPACE_TO_BATCH_ND,
+                 BuiltinOptions_SpaceToBatchNDOptions,
+                 CreateSpaceToBatchNDOptions(builder_).Union());
+    BuildInterpreter({input_shape});
+  }
+};
+
+TEST(NNAPIDelegate, SpaceToBatchNDSimpleConstTest) {
+  SpaceToBatchNDOpConstModel m({1, 4, 4, 1}, {2, 2}, {0, 0, 0, 0});
+  m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 2, 2, 1}));
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({1, 3, 9, 11, 2, 4, 10, 12, 5, 7,
+                                               13, 15, 6, 8, 14, 16}));
+}
+
+TEST(NNAPIDelegate, SpaceToBatchNDMultipleInputBatchesConstTest) {
+  SpaceToBatchNDOpConstModel m({2, 2, 4, 1}, {2, 2}, {0, 0, 0, 0});
+  m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({8, 1, 2, 1}));
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({1, 3, 9, 11, 2, 4, 10, 12, 5, 7,
+                                               13, 15, 6, 8, 14, 16}));
+}
+
+TEST(NNAPIDelegate, SpaceToBatchNDSimplePaddingConstTest) {
+  SpaceToBatchNDOpConstModel m({1, 5, 2, 1}, {3, 2}, {1, 0, 2, 0});
+  m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 2, 1}));
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({
+                                 0, 0, 0, 5, 0, 0, 0, 6, 0, 1, 0, 7,
+                                 0, 2, 0, 8, 0, 3, 0, 9, 0, 4, 0, 10,
+                             }));
+}
+
+TEST(NNAPIDelegate, SpaceToBatchNDComplexPaddingConstTest) {
+  SpaceToBatchNDOpConstModel m({1, 4, 2, 1}, {3, 2}, {1, 1, 2, 4});
+  m.SetInput({1, 2, 3, 4, 5, 6, 7, 8});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 4, 1}));
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({
+                                 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0,
+                                 0, 1, 0, 0, 0, 7, 0, 0, 0, 2, 0, 0, 0, 8, 0, 0,
+                                 0, 3, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0,
+                             }));
+}
+
+template <typename input_type = float,
+          TensorType tensor_input_type = TensorType_FLOAT32>
+class StridedSliceOpModel : public SingleOpModelWithNNAPI {
+ public:
+  StridedSliceOpModel(std::initializer_list<int> input_shape,
+                      std::initializer_list<int> begin_shape,
+                      std::initializer_list<int> end_shape,
+                      std::initializer_list<int> strides_shape, int begin_mask,
+                      int end_mask, int ellipsis_mask, int new_axis_mask,
+                      int shrink_axis_mask) {
+    input_ = AddInput(tensor_input_type);
+    begin_ = AddInput(TensorType_INT32);
+    end_ = AddInput(TensorType_INT32);
+    strides_ = AddInput(TensorType_INT32);
+    output_ = AddOutput(tensor_input_type);
+    SetBuiltinOp(
+        BuiltinOperator_STRIDED_SLICE, BuiltinOptions_StridedSliceOptions,
+        CreateStridedSliceOptions(builder_, begin_mask, end_mask, ellipsis_mask,
+                                  new_axis_mask, shrink_axis_mask)
+            .Union());
+    BuildInterpreter({input_shape, begin_shape, end_shape, strides_shape});
+  }
+
+  void SetInput(std::initializer_list<input_type> data) {
+    PopulateTensor<input_type>(input_, data);
+  }
+  void SetBegin(std::initializer_list<int32_t> data) {
+    PopulateTensor<int32_t>(begin_, data);
+  }
+  void SetEnd(std::initializer_list<int32_t> data) {
+    PopulateTensor<int32_t>(end_, data);
+  }
+  void SetStrides(std::initializer_list<int32_t> data) {
+    PopulateTensor<int32_t>(strides_, data);
+  }
+
+  std::vector<input_type> GetOutput() {
+    return ExtractVector<input_type>(output_);
+  }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int input_;
+  int begin_;
+  int end_;
+  int strides_;
+  int output_;
+};
+
+TEST(NNAPIDelegate, StridedSliceIn2D) {
+  StridedSliceOpModel<> m({2, 3}, {2}, {2}, {2}, 0, 0, 0, 0, 0);
+  m.SetInput({1, 2, 3, 4, 5, 6});
+  m.SetBegin({1, 0});
+  m.SetEnd({2, 2});
+  m.SetStrides({1, 1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 2}));
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({4, 5}));
+}
+
+TEST(NNAPIDelegate, StridedSliceIn2D_ShrinkAxis_NegativeSlice) {
+  // This is equivalent to tf.range(4)[:, tf.newaxis][-2, -1].
+  StridedSliceOpModel<> m({4, 1}, {2}, {2}, {2}, 0, 0, 0, 0, 3);
+  m.SetInput({0, 1, 2, 3});
+  m.SetBegin({-2, -1});
+  m.SetEnd({-1, 0});
+  m.SetStrides({1, 1});
+
+  m.Invoke();
+  EXPECT_TRUE(m.GetOutputShape().empty());
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({2}));
+}
+
+TEST(NNAPIDelegate, StridedSliceIn2D_ShrinkAxisMask) {
+  StridedSliceOpModel<> m({2, 3}, {2}, {2}, {2}, 0, 0, 0, 0, 3);
+  m.SetInput({1, 2, 3, 4, 5, 6});
+  m.SetBegin({0, 0});
+  m.SetEnd({1, 1});
+  m.SetStrides({1, 1});
+  m.Invoke();
+  EXPECT_TRUE(m.GetOutputShape().empty());
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({1}));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/download_dependencies.sh b/tensorflow/contrib/lite/download_dependencies.sh
index 840015a7fa..8c7df474d5 100755
--- a/tensorflow/contrib/lite/download_dependencies.sh
+++ b/tensorflow/contrib/lite/download_dependencies.sh
@@ -35,7 +35,7 @@ GOOGLETEST_URL="https://github.com/google/googletest/archive/release-1.8.0.tar.g
 ABSL_URL="$(grep -o 'https://github.com/abseil/abseil-cpp/.*tar.gz' "${BZL_FILE_PATH}" | head -n1)"
 NEON_2_SSE_URL="https://github.com/intel/ARM_NEON_2_x86_SSE/archive/master.zip"
 FARMHASH_URL="https://mirror.bazel.build/github.com/google/farmhash/archive/816a4ae622e964763ca0862d9dbd19324a1eaf45.tar.gz"
-FLATBUFFERS_URL="https://github.com/google/flatbuffers/archive/master.zip"
+FLATBUFFERS_URL="https://github.com/google/flatbuffers/archive/v1.8.0.zip"
 FFT2D_URL="https://mirror.bazel.build/www.kurims.kyoto-u.ac.jp/~ooura/fft.tgz"
 
 # TODO(petewarden): Some new code in Eigen triggers a clang bug with iOS arm64,
diff --git a/tensorflow/contrib/lite/examples/android/app/README.md b/tensorflow/contrib/lite/examples/android/app/README.md
new file mode 100644
index 0000000000..cbdeeac879
--- /dev/null
+++ b/tensorflow/contrib/lite/examples/android/app/README.md
@@ -0,0 +1,19 @@
+# TF Lite Android App Example
+
+## Building from Source with Bazel
+
+1. Install [Bazel](https://docs.bazel.build/versions/master/install.html), the Android NDK and SDK. The recommended versions are specified on this [webpage](https://www.tensorflow.org/mobile/tflite/demo_android#build_tensorflow_lite_and_the_demo_app_from_source).
+
+2. Build this demo app with Bazel. The demo needs C++11. We configure the fat_apk_cpu flag to package support for 4 hardware variants. You may replace it with --config=android_arm64 on a 64-bit device and --config=android_arm for 32-bit device:
+
+  ```shell
+  bazel build -c opt --cxxopt='--std=c++11' --fat_apk_cpu=x86,x86_64,arm64-v8a,armeabi-v7a \
+    //tensorflow/contrib/lite/examples/android:tflite_demo
+  ```
+
+3. Install the demo on a
+   [debug-enabled device](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android#install):
+
+  ```shell
+  adb install bazel-bin/tensorflow/contrib/lite/examples/android/tflite_demo.apk
+  ```
diff --git a/tensorflow/contrib/lite/examples/android/app/build.gradle b/tensorflow/contrib/lite/examples/android/app/build.gradle
index 1ffb9dd377..eb7fd705e1 100644
--- a/tensorflow/contrib/lite/examples/android/app/build.gradle
+++ b/tensorflow/contrib/lite/examples/android/app/build.gradle
@@ -51,7 +51,7 @@ apply from: "download-models.gradle"
 
 dependencies {
     compile fileTree(dir: 'libs', include: ['*.jar'])
-    androidTestCompile('com.androidx.test.espresso:espresso-core:2.2.2', {
+    androidTestCompile('androidx.test.espresso:espresso-core:3.1.0-alpha3', {
         exclude group: 'com.android.support', module: 'support-annotations'
     })
     compile 'org.tensorflow:tensorflow-lite:0.0.0-nightly'
diff --git a/tensorflow/contrib/lite/examples/android/app/src/main/assets/pets_labels_list.txt b/tensorflow/contrib/lite/examples/android/app/src/main/assets/pets_labels_list.txt
new file mode 100644
index 0000000000..d581f733e4
--- /dev/null
+++ b/tensorflow/contrib/lite/examples/android/app/src/main/assets/pets_labels_list.txt
@@ -0,0 +1,38 @@
+???
+Abyssinian
+american_bulldog
+american_pit_bull_terrier
+basset_hound
+beagle
+Bengal
+Birman
+Bombay
+boxer
+British_Shorthair
+chihuahua
+Egyptian_Mau
+english_cocker_spaniel
+english_setter
+german_shorthaired
+great_pyrenees
+havanese
+japanese_chin
+keeshond
+leonberger
+Maine_Coon
+miniature_pinscher
+newfoundland
+Persian
+pomeranian
+pug
+Ragdoll
+Russian_Blue
+saint_bernard
+samoyed
+scottish_terrier
+shiba_inu
+Siamese
+Sphynx
+staffordshire_bull_terrier
+wheaten_terrier
+yorkshire_terrier
diff --git a/tensorflow/contrib/lite/examples/label_image/BUILD b/tensorflow/contrib/lite/examples/label_image/BUILD
index c61445114e..fc55a78019 100644
--- a/tensorflow/contrib/lite/examples/label_image/BUILD
+++ b/tensorflow/contrib/lite/examples/label_image/BUILD
@@ -63,6 +63,7 @@ cc_test(
     data = [
         "testdata/grace_hopper.bmp",
     ],
+    tags = ["no_oss"],
     deps = [
         ":bitmap_helpers",
         "@com_google_googletest//:gtest",
diff --git a/tensorflow/contrib/lite/examples/label_image/bitmap_helpers_impl.h b/tensorflow/contrib/lite/examples/label_image/bitmap_helpers_impl.h
index e36218e4f1..6fdcf78b69 100644
--- a/tensorflow/contrib/lite/examples/label_image/bitmap_helpers_impl.h
+++ b/tensorflow/contrib/lite/examples/label_image/bitmap_helpers_impl.h
@@ -16,11 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_LITE_EXAMPLES_LABEL_IMAGE_BITMAP_HELPERS_IMPL_H_
 #define TENSORFLOW_CONTRIB_LITE_EXAMPLES_LABEL_IMAGE_BITMAP_HELPERS_IMPL_H_
 
-#include "tensorflow/contrib/lite/builtin_op_data.h"
-#include "tensorflow/contrib/lite/interpreter.h"
-#include "tensorflow/contrib/lite/kernels/register.h"
-#include "tensorflow/contrib/lite/string_util.h"
-#include "tensorflow/contrib/lite/version.h"
+#include "tensorflow/contrib/lite/examples/label_image/label_image.h"
 
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/interpreter.h"
@@ -28,8 +24,6 @@ limitations under the License.
 #include "tensorflow/contrib/lite/string_util.h"
 #include "tensorflow/contrib/lite/version.h"
 
-#include "tensorflow/contrib/lite/examples/label_image/label_image.h"
-
 namespace tflite {
 namespace label_image {
 
diff --git a/tensorflow/contrib/lite/experimental/c/BUILD b/tensorflow/contrib/lite/experimental/c/BUILD
new file mode 100644
index 0000000000..50f8da66d0
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/c/BUILD
@@ -0,0 +1,59 @@
+package(default_visibility = ["//visibility:private"])
+
+licenses(["notice"])  # Apache 2.0
+
+load(
+    "//tensorflow/contrib/lite:build_def.bzl",
+    "tflite_cc_shared_object",
+    "tflite_copts",
+    "tflite_jni_binary",
+)
+
+tflite_cc_shared_object(
+    name = "libtensorflowlite_c.so",
+    linkopts = select({
+        "//tensorflow:darwin": [
+            "-Wl,-exported_symbols_list",  # This line must be directly followed by the exported_symbols.lds file
+            "$(location //tensorflow/contrib/lite/experimental/c:exported_symbols.lds)",
+            "-Wl,-install_name,@rpath/libtensorflowlite_c.so",
+        ],
+        "//tensorflow:windows": [],
+        "//conditions:default": [
+            "-z defs",
+            "-Wl,--version-script",  #  This line must be directly followed by the version_script.lds file
+            "$(location //tensorflow/contrib/lite/experimental/c:version_script.lds)",
+        ],
+    }),
+    deps = [
+        ":c_api",
+        ":exported_symbols.lds",
+        ":version_script.lds",
+    ],
+)
+
+cc_library(
+    name = "c_api",
+    srcs = ["c_api.cc"],
+    hdrs = ["c_api.h"],
+    copts = tflite_copts(),
+    deps = [
+        "//tensorflow/contrib/lite:context",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:schema_fbs_version",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+    ],
+)
+
+cc_test(
+    name = "c_api_test",
+    size = "small",
+    srcs = ["c_api_test.cc"],
+    data = ["//tensorflow/contrib/lite:testdata/add.bin"],
+    deps = [
+        ":c_api",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:kernel_api",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
+    ],
+)
diff --git a/tensorflow/contrib/lite/experimental/c/c_api.cc b/tensorflow/contrib/lite/experimental/c/c_api.cc
new file mode 100644
index 0000000000..9d29e8b3e0
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/c/c_api.cc
@@ -0,0 +1,122 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/experimental/c/c_api.h"
+
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/model.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif  // __cplusplus
+
+struct _TFL_Interpreter {
+  std::unique_ptr<tflite::Interpreter> impl;
+};
+
+// LINT.IfChange
+
+TFL_Interpreter* TFL_NewInterpreter(const void* model_data,
+                                    int32_t model_size) {
+  auto model = tflite::FlatBufferModel::BuildFromBuffer(
+      static_cast<const char*>(model_data), static_cast<size_t>(model_size));
+  if (!model) {
+    return nullptr;
+  }
+
+  tflite::ops::builtin::BuiltinOpResolver resolver;
+  tflite::InterpreterBuilder builder(*model, resolver);
+  std::unique_ptr<tflite::Interpreter> interpreter_impl;
+  if (builder(&interpreter_impl) != kTfLiteOk) {
+    return nullptr;
+  }
+
+  return new TFL_Interpreter{std::move(interpreter_impl)};
+}
+
+void TFL_DeleteInterpreter(TFL_Interpreter* interpreter) { delete interpreter; }
+
+int32_t TFL_InterpreterGetInputTensorCount(const TFL_Interpreter* interpreter) {
+  return static_cast<int>(interpreter->impl->inputs().size());
+}
+
+TFL_Tensor* TFL_InterpreterGetInputTensor(const TFL_Interpreter* interpreter,
+                                          int32_t input_index) {
+  return interpreter->impl->tensor(interpreter->impl->inputs()[input_index]);
+}
+
+TFL_Status TFL_InterpreterResizeInputTensor(TFL_Interpreter* interpreter,
+                                            int32_t input_index,
+                                            const int* input_dims,
+                                            int32_t input_dims_size) {
+  std::vector<int> dims{input_dims, input_dims + input_dims_size};
+  return interpreter->impl->ResizeInputTensor(
+      interpreter->impl->inputs()[input_index], dims);
+}
+
+TFL_Status TFL_InterpreterAllocateTensors(TFL_Interpreter* interpreter) {
+  return interpreter->impl->AllocateTensors();
+}
+
+TFL_Status TFL_InterpreterInvoke(TFL_Interpreter* interpreter) {
+  return interpreter->impl->Invoke();
+}
+
+int32_t TFL_InterpreterGetOutputTensorCount(
+    const TFL_Interpreter* interpreter) {
+  return static_cast<int>(interpreter->impl->outputs().size());
+}
+
+const TFL_Tensor* TFL_InterpreterGetOutputTensor(
+    const TFL_Interpreter* interpreter, int32_t output_index) {
+  return interpreter->impl->tensor(interpreter->impl->outputs()[output_index]);
+}
+
+TFL_Type TFL_TensorType(const TFL_Tensor* tensor) { return tensor->type; }
+
+int32_t TFL_TensorNumDims(const TFL_Tensor* tensor) {
+  return tensor->dims->size;
+}
+
+int32_t TFL_TensorDim(const TFL_Tensor* tensor, int32_t dim_index) {
+  return tensor->dims->data[dim_index];
+}
+
+size_t TFL_TensorByteSize(const TFL_Tensor* tensor) { return tensor->bytes; }
+
+TFL_Status TFL_TensorCopyFromBuffer(TFL_Tensor* tensor, const void* input_data,
+                                    int32_t input_data_size) {
+  if (tensor->bytes != static_cast<size_t>(input_data_size)) {
+    return kTfLiteError;
+  }
+  memcpy(tensor->data.raw, input_data, input_data_size);
+  return kTfLiteOk;
+}
+
+TFL_Status TFL_TensorCopyToBuffer(const TFL_Tensor* tensor, void* output_data,
+                                  int32_t output_data_size) {
+  if (tensor->bytes != static_cast<size_t>(output_data_size)) {
+    return kTfLiteError;
+  }
+  memcpy(output_data, tensor->data.raw, output_data_size);
+  return kTfLiteOk;
+}
+
+// LINT.ThenChange(//tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts/Interpreter.cs)
+
+#ifdef __cplusplus
+}  // extern "C"
+#endif  // __cplusplus
diff --git a/tensorflow/contrib/lite/experimental/c/c_api.h b/tensorflow/contrib/lite/experimental/c/c_api.h
new file mode 100644
index 0000000000..070f1add13
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/c/c_api.h
@@ -0,0 +1,149 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_C_C_API_H_
+#define TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_C_C_API_H_
+
+#include <stdint.h>
+
+// Eventually the various C APIs defined in context.h will be migrated into
+// the appropriate /c/c_api*.h header. For now, we pull in existing definitions
+// for convenience.
+#include "tensorflow/contrib/lite/context.h"
+
+// --------------------------------------------------------------------------
+// Experimental C API for TensorFlowLite.
+//
+// The API leans towards simplicity and uniformity instead of convenience, as
+// most usage will be by language-specific wrappers.
+//
+// Conventions:
+// * We use the prefix TFL_ for everything in the API.
+
+#ifdef SWIG
+#define TFL_CAPI_EXPORT
+#else
+#if defined(_WIN32)
+#ifdef TF_COMPILE_LIBRARY
+#define TFL_CAPI_EXPORT __declspec(dllexport)
+#else
+#define TFL_CAPI_EXPORT __declspec(dllimport)
+#endif  // TF_COMPILE_LIBRARY
+#else
+#define TFL_CAPI_EXPORT __attribute__((visibility("default")))
+#endif  // _WIN32
+#endif  // SWIG
+
+#ifdef __cplusplus
+extern "C" {
+#endif  // __cplusplus
+
+typedef TfLiteTensor TFL_Tensor;
+typedef TfLiteStatus TFL_Status;
+typedef TfLiteType TFL_Type;
+
+// --------------------------------------------------------------------------
+// TFL_Interpreter provides inference from a provided model.
+typedef struct _TFL_Interpreter TFL_Interpreter;
+
+// Returns an interpreter for the provided model, or null on failure.
+//
+// NOTE: The client *must* explicitly allocate tensors before attempting to
+// access input tensor data or invoke the interpreter.
+TFL_CAPI_EXPORT extern TFL_Interpreter* TFL_NewInterpreter(
+    const void* model_data, int32_t model_size);
+
+// Destroys the interpreter.
+TFL_CAPI_EXPORT extern void TFL_DeleteInterpreter(TFL_Interpreter* interpreter);
+
+// Returns the number of input tensors associated with the model.
+TFL_CAPI_EXPORT extern int TFL_InterpreterGetInputTensorCount(
+    const TFL_Interpreter* interpreter);
+
+// Returns the tensor associated with the input index.
+// REQUIRES: 0 <= input_index < TFL_InterpreterGetInputTensorCount(tensor)
+TFL_CAPI_EXPORT extern TFL_Tensor* TFL_InterpreterGetInputTensor(
+    const TFL_Interpreter* interpreter, int32_t input_index);
+
+// Attempts to resize the specified input tensor.
+// NOTE: After a resize, the client *must* explicitly allocate tensors before
+// attempting to access the resized tensor data or invoke the interpreter.
+// REQUIRES: 0 <= input_index < TFL_InterpreterGetInputTensorCount(tensor)
+TFL_CAPI_EXPORT extern TFL_Status TFL_InterpreterResizeInputTensor(
+    TFL_Interpreter* interpreter, int32_t input_index, const int* input_dims,
+    int32_t input_dims_size);
+
+// Updates allocations for all tensors, resizing dependent tensors using the
+// specified input tensor dimensionality.
+//
+// This is a relatively expensive operation, and need only be called after
+// creating the graph and/or resizing any inputs.
+TFL_CAPI_EXPORT extern TFL_Status TFL_InterpreterAllocateTensors(
+    TFL_Interpreter* interpreter);
+
+// Runs inference for the loaded graph.
+//
+// NOTE: It is possible that the interpreter is not in a ready state to
+// evaluate (e.g., if a ResizeInputTensor() has been performed without a call to
+// AllocateTensors()).
+TFL_CAPI_EXPORT extern TFL_Status TFL_InterpreterInvoke(
+    TFL_Interpreter* interpreter);
+
+// Returns the number of output tensors associated with the model.
+TFL_CAPI_EXPORT extern int32_t TFL_InterpreterGetOutputTensorCount(
+    const TFL_Interpreter* interpreter);
+
+// Returns the tensor associated with the output index.
+// REQUIRES: 0 <= input_index < TFL_InterpreterGetOutputTensorCount(tensor)
+TFL_CAPI_EXPORT extern const TFL_Tensor* TFL_InterpreterGetOutputTensor(
+    const TFL_Interpreter* interpreter, int32_t output_index);
+
+// --------------------------------------------------------------------------
+// TFL_Tensor wraps data associated with a graph tensor.
+//
+// Note that, while the TFL_Tensor struct is not currently opaque, and its
+// fields can be accessed directly, these methods are still convenient for
+// language bindings. In the future the tensor struct will likely be made opaque
+// in the public API.
+
+// Returns the type of a tensor element.
+TFL_CAPI_EXPORT extern TFL_Type TFL_TensorType(const TFL_Tensor* tensor);
+
+// Returns the number of dimensions that the tensor has.
+TFL_CAPI_EXPORT extern int32_t TFL_TensorNumDims(const TFL_Tensor* tensor);
+
+// Returns the length of the tensor in the "dim_index" dimension.
+// REQUIRES: 0 <= dim_index < TFLiteTensorNumDims(tensor)
+TFL_CAPI_EXPORT extern int32_t TFL_TensorDim(const TFL_Tensor* tensor,
+                                             int32_t dim_index);
+
+// Returns the size of the underlying data in bytes.
+TFL_CAPI_EXPORT extern size_t TFL_TensorByteSize(const TFL_Tensor* tensor);
+
+// Copies from the provided input buffer into the tensor's buffer.
+// REQUIRES: input_data_size == TFL_TensorByteSize(tensor)
+TFL_CAPI_EXPORT extern TFL_Status TFL_TensorCopyFromBuffer(
+    TFL_Tensor* tensor, const void* input_data, int32_t input_data_size);
+
+// Copies to the provided output buffer from the tensor's buffer.
+// REQUIRES: output_data_size == TFL_TensorByteSize(tensor)
+TFL_CAPI_EXPORT extern TFL_Status TFL_TensorCopyToBuffer(
+    const TFL_Tensor* output_tensor, void* output_data,
+    int32_t output_data_size);
+
+#ifdef __cplusplus
+}  // extern "C"
+#endif  // __cplusplus
+
+#endif  // TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_C_C_API_H_
diff --git a/tensorflow/contrib/lite/experimental/c/c_api_test.cc b/tensorflow/contrib/lite/experimental/c/c_api_test.cc
new file mode 100644
index 0000000000..bc925e00a6
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/c/c_api_test.cc
@@ -0,0 +1,84 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <array>
+
+#include "tensorflow/contrib/lite/experimental/c/c_api.h"
+
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/allocation.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/testing/util.h"
+
+namespace {
+
+TEST(CApiSimple, Smoke) {
+  tflite::FileCopyAllocation model_file(
+      "tensorflow/contrib/lite/testdata/add.bin",
+      tflite::DefaultErrorReporter());
+
+  TFL_Interpreter* interpreter =
+      TFL_NewInterpreter(model_file.base(), model_file.bytes());
+  ASSERT_NE(interpreter, nullptr);
+  ASSERT_EQ(TFL_InterpreterAllocateTensors(interpreter), kTfLiteOk);
+
+  ASSERT_EQ(TFL_InterpreterGetInputTensorCount(interpreter), 1);
+  ASSERT_EQ(TFL_InterpreterGetOutputTensorCount(interpreter), 1);
+
+  std::array<int, 1> input_dims = {2};
+  ASSERT_EQ(TFL_InterpreterResizeInputTensor(interpreter, 0, input_dims.data(),
+                                             input_dims.size()),
+            kTfLiteOk);
+  ASSERT_EQ(TFL_InterpreterAllocateTensors(interpreter), kTfLiteOk);
+
+  TFL_Tensor* input_tensor = TFL_InterpreterGetInputTensor(interpreter, 0);
+  ASSERT_NE(input_tensor, nullptr);
+  EXPECT_EQ(TFL_TensorType(input_tensor), kTfLiteFloat32);
+  EXPECT_EQ(TFL_TensorNumDims(input_tensor), 1);
+  EXPECT_EQ(TFL_TensorDim(input_tensor, 0), 2);
+  EXPECT_EQ(TFL_TensorByteSize(input_tensor), sizeof(float) * 2);
+
+  std::array<float, 2> input = {1.f, 3.f};
+  ASSERT_EQ(TFL_TensorCopyFromBuffer(input_tensor, input.data(),
+                                     input.size() * sizeof(float)),
+            kTfLiteOk);
+
+  ASSERT_EQ(TFL_InterpreterInvoke(interpreter), kTfLiteOk);
+
+  const TFL_Tensor* output_tensor =
+      TFL_InterpreterGetOutputTensor(interpreter, 0);
+  ASSERT_NE(output_tensor, nullptr);
+  EXPECT_EQ(TFL_TensorType(output_tensor), kTfLiteFloat32);
+  EXPECT_EQ(TFL_TensorNumDims(output_tensor), 1);
+  EXPECT_EQ(TFL_TensorDim(output_tensor, 0), 2);
+  EXPECT_EQ(TFL_TensorByteSize(output_tensor), sizeof(float) * 2);
+
+  std::array<float, 2> output;
+  ASSERT_EQ(TFL_TensorCopyToBuffer(output_tensor, output.data(),
+                                   output.size() * sizeof(float)),
+            kTfLiteOk);
+  EXPECT_EQ(output[0], 3.f);
+  EXPECT_EQ(output[1], 9.f);
+
+  TFL_DeleteInterpreter(interpreter);
+}
+
+}  // namespace
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/experimental/c/exported_symbols.lds b/tensorflow/contrib/lite/experimental/c/exported_symbols.lds
new file mode 100644
index 0000000000..a3ddc6bc8d
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/c/exported_symbols.lds
@@ -0,0 +1 @@
+_TFL_*
diff --git a/tensorflow/contrib/lite/experimental/c/version_script.lds b/tensorflow/contrib/lite/experimental/c/version_script.lds
new file mode 100644
index 0000000000..c0c8a2bca1
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/c/version_script.lds
@@ -0,0 +1,9 @@
+VERS_1.0 {
+  # Export symbols in c_api.h.
+  global:
+    *TFL_*;
+
+  # Hide everything else.
+  local:
+    *;
+};
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/.gitignore b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/.gitignore
new file mode 100644
index 0000000000..c72a5cae9e
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/.gitignore
@@ -0,0 +1,13 @@
+# Unity generated
+Builds/
+Temp/
+Library/
+obj/
+# Visual Studio / MonoDevelop generated
+*.csproj
+*.unityproj
+*.sln
+*.suo
+*.userprefs
+# OS generated
+.DS_Store
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite.meta
new file mode 100644
index 0000000000..ed9337b53e
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite.meta
@@ -0,0 +1,8 @@
+fileFormatVersion: 2
+guid: 71d1b4219b1da4aeaa1cebbec324fc81
+folderAsset: yes
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples.meta
new file mode 100644
index 0000000000..edcce00939
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples.meta
@@ -0,0 +1,8 @@
+fileFormatVersion: 2
+guid: d948aead14abd4c88947c9886d16f774
+folderAsset: yes
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite.meta
new file mode 100644
index 0000000000..36b35516f0
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite.meta
@@ -0,0 +1,8 @@
+fileFormatVersion: 2
+guid: b810b85b794fa48fd93100acf5525e1f
+folderAsset: yes
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes.meta
new file mode 100644
index 0000000000..d4133da49a
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes.meta
@@ -0,0 +1,8 @@
+fileFormatVersion: 2
+guid: 154f4201e2e454d4696fa5834eaa3ad3
+folderAsset: yes
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/HelloTFLite.unity b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/HelloTFLite.unity
new file mode 100644
index 0000000000..9397d8f27a
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/HelloTFLite.unity
@@ -0,0 +1,242 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!29 &1
+OcclusionCullingSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 2
+  m_OcclusionBakeSettings:
+    smallestOccluder: 5
+    smallestHole: 0.25
+    backfaceThreshold: 100
+  m_SceneGUID: 00000000000000000000000000000000
+  m_OcclusionCullingData: {fileID: 0}
+--- !u!104 &2
+RenderSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 8
+  m_Fog: 0
+  m_FogColor: {r: 0.5, g: 0.5, b: 0.5, a: 1}
+  m_FogMode: 3
+  m_FogDensity: 0.01
+  m_LinearFogStart: 0
+  m_LinearFogEnd: 300
+  m_AmbientSkyColor: {r: 0.212, g: 0.227, b: 0.259, a: 1}
+  m_AmbientEquatorColor: {r: 0.114, g: 0.125, b: 0.133, a: 1}
+  m_AmbientGroundColor: {r: 0.047, g: 0.043, b: 0.035, a: 1}
+  m_AmbientIntensity: 1
+  m_AmbientMode: 3
+  m_SubtractiveShadowColor: {r: 0.42, g: 0.478, b: 0.627, a: 1}
+  m_SkyboxMaterial: {fileID: 0}
+  m_HaloStrength: 0.5
+  m_FlareStrength: 1
+  m_FlareFadeSpeed: 3
+  m_HaloTexture: {fileID: 0}
+  m_SpotCookie: {fileID: 10001, guid: 0000000000000000e000000000000000, type: 0}
+  m_DefaultReflectionMode: 0
+  m_DefaultReflectionResolution: 128
+  m_ReflectionBounces: 1
+  m_ReflectionIntensity: 1
+  m_CustomReflection: {fileID: 0}
+  m_Sun: {fileID: 0}
+  m_IndirectSpecularColor: {r: 0, g: 0, b: 0, a: 1}
+--- !u!157 &3
+LightmapSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 11
+  m_GIWorkflowMode: 1
+  m_GISettings:
+    serializedVersion: 2
+    m_BounceScale: 1
+    m_IndirectOutputScale: 1
+    m_AlbedoBoost: 1
+    m_TemporalCoherenceThreshold: 1
+    m_EnvironmentLightingMode: 0
+    m_EnableBakedLightmaps: 0
+    m_EnableRealtimeLightmaps: 0
+  m_LightmapEditorSettings:
+    serializedVersion: 9
+    m_Resolution: 2
+    m_BakeResolution: 40
+    m_TextureWidth: 1024
+    m_TextureHeight: 1024
+    m_AO: 0
+    m_AOMaxDistance: 1
+    m_CompAOExponent: 1
+    m_CompAOExponentDirect: 0
+    m_Padding: 2
+    m_LightmapParameters: {fileID: 0}
+    m_LightmapsBakeMode: 1
+    m_TextureCompression: 1
+    m_FinalGather: 0
+    m_FinalGatherFiltering: 1
+    m_FinalGatherRayCount: 256
+    m_ReflectionCompression: 2
+    m_MixedBakeMode: 2
+    m_BakeBackend: 0
+    m_PVRSampling: 1
+    m_PVRDirectSampleCount: 32
+    m_PVRSampleCount: 500
+    m_PVRBounces: 2
+    m_PVRFilterTypeDirect: 0
+    m_PVRFilterTypeIndirect: 0
+    m_PVRFilterTypeAO: 0
+    m_PVRFilteringMode: 1
+    m_PVRCulling: 1
+    m_PVRFilteringGaussRadiusDirect: 1
+    m_PVRFilteringGaussRadiusIndirect: 5
+    m_PVRFilteringGaussRadiusAO: 2
+    m_PVRFilteringAtrousPositionSigmaDirect: 0.5
+    m_PVRFilteringAtrousPositionSigmaIndirect: 2
+    m_PVRFilteringAtrousPositionSigmaAO: 1
+    m_ShowResolutionOverlay: 1
+  m_LightingDataAsset: {fileID: 0}
+  m_UseShadowmask: 1
+--- !u!196 &4
+NavMeshSettings:
+  serializedVersion: 2
+  m_ObjectHideFlags: 0
+  m_BuildSettings:
+    serializedVersion: 2
+    agentTypeID: 0
+    agentRadius: 0.5
+    agentHeight: 2
+    agentSlope: 45
+    agentClimb: 0.4
+    ledgeDropHeight: 0
+    maxJumpAcrossDistance: 0
+    minRegionArea: 2
+    manualCellSize: 0
+    cellSize: 0.16666667
+    manualTileSize: 0
+    tileSize: 256
+    accuratePlacement: 0
+    debug:
+      m_Flags: 0
+  m_NavMeshData: {fileID: 0}
+--- !u!1 &492081941
+GameObject:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  serializedVersion: 5
+  m_Component:
+  - component: {fileID: 492081945}
+  - component: {fileID: 492081944}
+  - component: {fileID: 492081943}
+  - component: {fileID: 492081942}
+  m_Layer: 0
+  m_Name: Main Camera
+  m_TagString: MainCamera
+  m_Icon: {fileID: 0}
+  m_NavMeshLayer: 0
+  m_StaticEditorFlags: 0
+  m_IsActive: 1
+--- !u!81 &492081942
+AudioListener:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  m_GameObject: {fileID: 492081941}
+  m_Enabled: 1
+--- !u!124 &492081943
+Behaviour:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  m_GameObject: {fileID: 492081941}
+  m_Enabled: 1
+--- !u!20 &492081944
+Camera:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  m_GameObject: {fileID: 492081941}
+  m_Enabled: 1
+  serializedVersion: 2
+  m_ClearFlags: 1
+  m_BackGroundColor: {r: 0.19215687, g: 0.3019608, b: 0.4745098, a: 0}
+  m_NormalizedViewPortRect:
+    serializedVersion: 2
+    x: 0
+    y: 0
+    width: 1
+    height: 1
+  near clip plane: 0.3
+  far clip plane: 1000
+  field of view: 60
+  orthographic: 1
+  orthographic size: 5
+  m_Depth: -1
+  m_CullingMask:
+    serializedVersion: 2
+    m_Bits: 4294967295
+  m_RenderingPath: -1
+  m_TargetTexture: {fileID: 0}
+  m_TargetDisplay: 0
+  m_TargetEye: 3
+  m_HDR: 1
+  m_AllowMSAA: 1
+  m_AllowDynamicResolution: 0
+  m_ForceIntoRT: 0
+  m_OcclusionCulling: 1
+  m_StereoConvergence: 10
+  m_StereoSeparation: 0.022
+--- !u!4 &492081945
+Transform:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  m_GameObject: {fileID: 492081941}
+  m_LocalRotation: {x: 0, y: 0, z: 0, w: 1}
+  m_LocalPosition: {x: 0, y: 0, z: -10}
+  m_LocalScale: {x: 1, y: 1, z: 1}
+  m_Children:
+  - {fileID: 904015944}
+  m_Father: {fileID: 0}
+  m_RootOrder: 0
+  m_LocalEulerAnglesHint: {x: 0, y: 0, z: 0}
+--- !u!1 &904015943
+GameObject:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  serializedVersion: 5
+  m_Component:
+  - component: {fileID: 904015944}
+  - component: {fileID: 904015945}
+  m_Layer: 0
+  m_Name: HelloTFLite
+  m_TagString: Untagged
+  m_Icon: {fileID: 0}
+  m_NavMeshLayer: 0
+  m_StaticEditorFlags: 0
+  m_IsActive: 1
+--- !u!4 &904015944
+Transform:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  m_GameObject: {fileID: 904015943}
+  m_LocalRotation: {x: 0, y: 0, z: 0, w: 1}
+  m_LocalPosition: {x: 0, y: 0, z: 0}
+  m_LocalScale: {x: 1, y: 1, z: 1}
+  m_Children: []
+  m_Father: {fileID: 492081945}
+  m_RootOrder: 0
+  m_LocalEulerAnglesHint: {x: 0, y: 0, z: 0}
+--- !u!114 &904015945
+MonoBehaviour:
+  m_ObjectHideFlags: 0
+  m_PrefabParentObject: {fileID: 0}
+  m_PrefabInternal: {fileID: 0}
+  m_GameObject: {fileID: 904015943}
+  m_Enabled: 1
+  m_EditorHideFlags: 0
+  m_Script: {fileID: 11500000, guid: 899510441e0ca4be0879d3055e467878, type: 3}
+  m_Name: 
+  m_EditorClassIdentifier: 
+  model: {fileID: 4900000, guid: adff4e1dbdba344c199ee4fe7e84457e, type: 3}
+  inputs:
+  - 1
+  - 3
+  - 7
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/HelloTFLite.unity.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/HelloTFLite.unity.meta
new file mode 100644
index 0000000000..e1e13efb66
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/HelloTFLite.unity.meta
@@ -0,0 +1,7 @@
+fileFormatVersion: 2
+guid: f8a8c37a396584bb7b21687f33d6d3f8
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/add.bytes b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/add.bytes
new file mode 100644
index 0000000000..aef0fe3d82
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/add.bytes
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/add.bytes.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/add.bytes.meta
new file mode 100644
index 0000000000..ba24871413
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scenes/add.bytes.meta
@@ -0,0 +1,7 @@
+fileFormatVersion: 2
+guid: adff4e1dbdba344c199ee4fe7e84457e
+TextScriptImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts.meta
new file mode 100644
index 0000000000..28fde68b8b
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts.meta
@@ -0,0 +1,8 @@
+fileFormatVersion: 2
+guid: f7d1e2dec09b64acdb7b8f5aef9fcb44
+folderAsset: yes
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts/HelloTFLite.cs b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts/HelloTFLite.cs
new file mode 100644
index 0000000000..abca814499
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts/HelloTFLite.cs
@@ -0,0 +1,70 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+using System;
+using System.Collections;
+using System.Collections.Generic;
+using System.Linq;
+using TensorFlowLite;
+using UnityEngine;
+
+/// <summary>
+/// Simple example demonstrating use of the experimental C# bindings for TensorFlowLite.
+/// </summary>
+public class HelloTFLite : MonoBehaviour {
+
+  [Tooltip("Configurable TFLite model.")]
+  public TextAsset model;
+
+  [Tooltip("Configurable TFLite input tensor data.")]
+  public float[] inputs;
+
+  private Interpreter interpreter;
+  private float[] outputs;
+
+  void Start () {
+    interpreter = new Interpreter(model.bytes);
+    Debug.LogFormat("InputCount: {0}, OutputCount: {1}",
+                    interpreter.GetInputTensorCount(),
+                    interpreter.GetOutputTensorCount());
+  }
+
+  void Update () {
+    if (inputs == null) {
+      return;
+    }
+
+    if (outputs == null || outputs.Length != inputs.Length) {
+      interpreter.ResizeInputTensor(0, new int[]{inputs.Length});
+      interpreter.AllocateTensors();
+      outputs = new float[inputs.Length];
+    }
+
+    interpreter.SetInputTensorData(0, inputs);
+    interpreter.Invoke();
+    interpreter.GetOutputTensorData(0, outputs);
+
+    Debug.LogFormat("Input: {0}, Output: {1}",
+                    ArrayToString(inputs),
+                    ArrayToString(outputs));
+  }
+
+  void OnDestroy() {
+    interpreter.Dispose();
+  }
+
+   private static string ArrayToString(float[] values) {
+    return string.Join(",", values.Select(x => x.ToString()).ToArray());
+  }
+}
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts/HelloTFLite.cs.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts/HelloTFLite.cs.meta
new file mode 100644
index 0000000000..ba83f45084
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/Examples/HelloTFLite/Scripts/HelloTFLite.cs.meta
@@ -0,0 +1,11 @@
+fileFormatVersion: 2
+guid: 899510441e0ca4be0879d3055e467878
+MonoImporter:
+  externalObjects: {}
+  serializedVersion: 2
+  defaultReferences: []
+  executionOrder: 0
+  icon: {instanceID: 0}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK.meta
new file mode 100644
index 0000000000..bf5ce15c6a
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK.meta
@@ -0,0 +1,8 @@
+fileFormatVersion: 2
+guid: 16dad1655bcdc48f7b325a2a634b9c69
+folderAsset: yes
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts.meta
new file mode 100644
index 0000000000..22ed2c466b
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts.meta
@@ -0,0 +1,8 @@
+fileFormatVersion: 2
+guid: d70863368f8904d509a9b73d3a555914
+folderAsset: yes
+DefaultImporter:
+  externalObjects: {}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts/Interpreter.cs b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts/Interpreter.cs
new file mode 100644
index 0000000000..ab966bae2e
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts/Interpreter.cs
@@ -0,0 +1,145 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+using System;
+using System.Runtime.InteropServices;
+
+using TFL_Interpreter = System.IntPtr;
+using TFL_Tensor = System.IntPtr;
+
+namespace TensorFlowLite
+{
+  /// <summary>
+  /// Simple C# bindings for the experimental TensorFlowLite C API.
+  /// </summary>
+  public class Interpreter : IDisposable
+  {
+    private const string TensorFlowLibrary = "tensorflowlite_c";
+
+    private TFL_Interpreter handle;
+
+    public Interpreter(byte[] modelData) {
+      GCHandle modelDataHandle = GCHandle.Alloc(modelData, GCHandleType.Pinned);
+      IntPtr modelDataPtr = modelDataHandle.AddrOfPinnedObject();
+      handle = TFL_NewInterpreter(modelDataPtr, modelData.Length);
+      if (handle == IntPtr.Zero) throw new Exception("Failed to create TensorFlowLite Interpreter");
+    }
+
+    ~Interpreter() {
+      Dispose();
+    }
+
+    public void Dispose() {
+      if (handle != IntPtr.Zero) TFL_DeleteInterpreter(handle);
+      handle = IntPtr.Zero;
+    }
+
+    public void Invoke() {
+      ThrowIfError(TFL_InterpreterInvoke(handle));
+    }
+
+    public int GetInputTensorCount() {
+      return TFL_InterpreterGetInputTensorCount(handle);
+    }
+
+    public void SetInputTensorData(int inputTensorIndex, Array inputTensorData) {
+      GCHandle tensorDataHandle = GCHandle.Alloc(inputTensorData, GCHandleType.Pinned);
+      IntPtr tensorDataPtr = tensorDataHandle.AddrOfPinnedObject();
+      TFL_Tensor tensor = TFL_InterpreterGetInputTensor(handle, inputTensorIndex);
+      ThrowIfError(TFL_TensorCopyFromBuffer(
+          tensor, tensorDataPtr, Buffer.ByteLength(inputTensorData)));
+    }
+
+    public void ResizeInputTensor(int inputTensorIndex, int[] inputTensorShape) {
+      ThrowIfError(TFL_InterpreterResizeInputTensor(
+          handle, inputTensorIndex, inputTensorShape, inputTensorShape.Length));
+    }
+
+    public void AllocateTensors() {
+      ThrowIfError(TFL_InterpreterAllocateTensors(handle));
+    }
+
+    public int GetOutputTensorCount() {
+      return TFL_InterpreterGetOutputTensorCount(handle);
+    }
+
+    public void GetOutputTensorData(int outputTensorIndex, Array outputTensorData) {
+      GCHandle tensorDataHandle = GCHandle.Alloc(outputTensorData, GCHandleType.Pinned);
+      IntPtr tensorDataPtr = tensorDataHandle.AddrOfPinnedObject();
+      TFL_Tensor tensor = TFL_InterpreterGetOutputTensor(handle, outputTensorIndex);
+      ThrowIfError(TFL_TensorCopyToBuffer(
+          tensor, tensorDataPtr, Buffer.ByteLength(outputTensorData)));
+    }
+
+    private static void ThrowIfError(int resultCode) {
+      if (resultCode != 0) throw new Exception("TensorFlowLite operation failed.");
+    }
+
+    #region Externs
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe TFL_Interpreter TFL_NewInterpreter(
+        IntPtr model_data,
+        int model_size);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe void TFL_DeleteInterpreter(TFL_Interpreter interpreter);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe int TFL_InterpreterGetInputTensorCount(
+        TFL_Interpreter interpreter);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe TFL_Tensor TFL_InterpreterGetInputTensor(
+        TFL_Interpreter interpreter,
+        int input_index);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe int TFL_InterpreterResizeInputTensor(
+        TFL_Interpreter interpreter,
+        int input_index,
+        int[] input_dims,
+        int input_dims_size);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe int TFL_InterpreterAllocateTensors(
+        TFL_Interpreter interpreter);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe int TFL_InterpreterInvoke(TFL_Interpreter interpreter);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe int TFL_InterpreterGetOutputTensorCount(
+        TFL_Interpreter interpreter);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe TFL_Tensor TFL_InterpreterGetOutputTensor(
+        TFL_Interpreter interpreter,
+        int output_index);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe int TFL_TensorCopyFromBuffer(
+        TFL_Tensor tensor,
+        IntPtr input_data,
+        int input_data_size);
+
+    [DllImport (TensorFlowLibrary)]
+    private static extern unsafe int TFL_TensorCopyToBuffer(
+        TFL_Tensor tensor,
+        IntPtr output_data,
+        int output_data_size);
+
+    #endregion
+  }
+}
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts/Interpreter.cs.meta b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts/Interpreter.cs.meta
new file mode 100644
index 0000000000..5ec84ef7f7
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/Assets/TensorFlowLite/SDK/Scripts/Interpreter.cs.meta
@@ -0,0 +1,11 @@
+fileFormatVersion: 2
+guid: 0bbaf59e6ac914ed1b28174fb9008a09
+MonoImporter:
+  externalObjects: {}
+  serializedVersion: 2
+  defaultReferences: []
+  executionOrder: 0
+  icon: {instanceID: 0}
+  userData: 
+  assetBundleName: 
+  assetBundleVariant: 
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/AudioManager.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/AudioManager.asset
new file mode 100644
index 0000000000..da6112576a
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/AudioManager.asset
@@ -0,0 +1,17 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!11 &1
+AudioManager:
+  m_ObjectHideFlags: 0
+  m_Volume: 1
+  Rolloff Scale: 1
+  Doppler Factor: 1
+  Default Speaker Mode: 2
+  m_SampleRate: 0
+  m_DSPBufferSize: 0
+  m_VirtualVoiceCount: 512
+  m_RealVoiceCount: 32
+  m_SpatializerPlugin: 
+  m_AmbisonicDecoderPlugin: 
+  m_DisableAudio: 0
+  m_VirtualizeEffects: 1
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ClusterInputManager.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ClusterInputManager.asset
new file mode 100644
index 0000000000..e7886b266a
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ClusterInputManager.asset
@@ -0,0 +1,6 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!236 &1
+ClusterInputManager:
+  m_ObjectHideFlags: 0
+  m_Inputs: []
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/DynamicsManager.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/DynamicsManager.asset
new file mode 100644
index 0000000000..78992f08c7
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/DynamicsManager.asset
@@ -0,0 +1,29 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!55 &1
+PhysicsManager:
+  m_ObjectHideFlags: 0
+  serializedVersion: 7
+  m_Gravity: {x: 0, y: -9.81, z: 0}
+  m_DefaultMaterial: {fileID: 0}
+  m_BounceThreshold: 2
+  m_SleepThreshold: 0.005
+  m_DefaultContactOffset: 0.01
+  m_DefaultSolverIterations: 6
+  m_DefaultSolverVelocityIterations: 1
+  m_QueriesHitBackfaces: 0
+  m_QueriesHitTriggers: 1
+  m_EnableAdaptiveForce: 0
+  m_ClothInterCollisionDistance: 0
+  m_ClothInterCollisionStiffness: 0
+  m_ContactsGeneration: 1
+  m_LayerCollisionMatrix: ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
+  m_AutoSimulation: 1
+  m_AutoSyncTransforms: 1
+  m_ClothInterCollisionSettingsToggle: 0
+  m_ContactPairsMode: 0
+  m_BroadphaseType: 0
+  m_WorldBounds:
+    m_Center: {x: 0, y: 0, z: 0}
+    m_Extent: {x: 250, y: 250, z: 250}
+  m_WorldSubdivisions: 8
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/EditorBuildSettings.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/EditorBuildSettings.asset
new file mode 100644
index 0000000000..6dc24f7dfd
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/EditorBuildSettings.asset
@@ -0,0 +1,7 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!1045 &1
+EditorBuildSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 2
+  m_Scenes: []
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/EditorSettings.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/EditorSettings.asset
new file mode 100644
index 0000000000..fcd016402f
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/EditorSettings.asset
@@ -0,0 +1,21 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!159 &1
+EditorSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 7
+  m_ExternalVersionControlSupport: Visible Meta Files
+  m_SerializationMode: 2
+  m_LineEndingsForNewScripts: 1
+  m_DefaultBehaviorMode: 1
+  m_SpritePackerMode: 4
+  m_SpritePackerPaddingPower: 1
+  m_EtcTextureCompressorBehavior: 1
+  m_EtcTextureFastCompressor: 1
+  m_EtcTextureNormalCompressor: 2
+  m_EtcTextureBestCompressor: 4
+  m_ProjectGenerationIncludedExtensions: txt;xml;fnt;cd;asmdef;rsp
+  m_ProjectGenerationRootNamespace: 
+  m_UserGeneratedProjectSuffix: 
+  m_CollabEditorSettings:
+    inProgressEnabled: 1
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/GraphicsSettings.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/GraphicsSettings.asset
new file mode 100644
index 0000000000..74d7b532b0
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/GraphicsSettings.asset
@@ -0,0 +1,61 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!30 &1
+GraphicsSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 12
+  m_Deferred:
+    m_Mode: 1
+    m_Shader: {fileID: 69, guid: 0000000000000000f000000000000000, type: 0}
+  m_DeferredReflections:
+    m_Mode: 1
+    m_Shader: {fileID: 74, guid: 0000000000000000f000000000000000, type: 0}
+  m_ScreenSpaceShadows:
+    m_Mode: 1
+    m_Shader: {fileID: 64, guid: 0000000000000000f000000000000000, type: 0}
+  m_LegacyDeferred:
+    m_Mode: 1
+    m_Shader: {fileID: 63, guid: 0000000000000000f000000000000000, type: 0}
+  m_DepthNormals:
+    m_Mode: 1
+    m_Shader: {fileID: 62, guid: 0000000000000000f000000000000000, type: 0}
+  m_MotionVectors:
+    m_Mode: 1
+    m_Shader: {fileID: 75, guid: 0000000000000000f000000000000000, type: 0}
+  m_LightHalo:
+    m_Mode: 1
+    m_Shader: {fileID: 105, guid: 0000000000000000f000000000000000, type: 0}
+  m_LensFlare:
+    m_Mode: 1
+    m_Shader: {fileID: 102, guid: 0000000000000000f000000000000000, type: 0}
+  m_AlwaysIncludedShaders:
+  - {fileID: 7, guid: 0000000000000000f000000000000000, type: 0}
+  - {fileID: 15104, guid: 0000000000000000f000000000000000, type: 0}
+  - {fileID: 15105, guid: 0000000000000000f000000000000000, type: 0}
+  - {fileID: 15106, guid: 0000000000000000f000000000000000, type: 0}
+  - {fileID: 10753, guid: 0000000000000000f000000000000000, type: 0}
+  - {fileID: 10770, guid: 0000000000000000f000000000000000, type: 0}
+  m_PreloadedShaders: []
+  m_SpritesDefaultMaterial: {fileID: 10754, guid: 0000000000000000f000000000000000,
+    type: 0}
+  m_CustomRenderPipeline: {fileID: 0}
+  m_TransparencySortMode: 0
+  m_TransparencySortAxis: {x: 0, y: 0, z: 1}
+  m_DefaultRenderingPath: 1
+  m_DefaultMobileRenderingPath: 1
+  m_TierSettings: []
+  m_LightmapStripping: 0
+  m_FogStripping: 0
+  m_InstancingStripping: 0
+  m_LightmapKeepPlain: 1
+  m_LightmapKeepDirCombined: 1
+  m_LightmapKeepDynamicPlain: 1
+  m_LightmapKeepDynamicDirCombined: 1
+  m_LightmapKeepShadowMask: 1
+  m_LightmapKeepSubtractive: 1
+  m_FogKeepLinear: 1
+  m_FogKeepExp: 1
+  m_FogKeepExp2: 1
+  m_AlbedoSwatchInfos: []
+  m_LightsUseLinearIntensity: 0
+  m_LightsUseColorTemperature: 0
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/InputManager.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/InputManager.asset
new file mode 100644
index 0000000000..17c8f538e2
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/InputManager.asset
@@ -0,0 +1,295 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!13 &1
+InputManager:
+  m_ObjectHideFlags: 0
+  serializedVersion: 2
+  m_Axes:
+  - serializedVersion: 3
+    m_Name: Horizontal
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: left
+    positiveButton: right
+    altNegativeButton: a
+    altPositiveButton: d
+    gravity: 3
+    dead: 0.001
+    sensitivity: 3
+    snap: 1
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Vertical
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: down
+    positiveButton: up
+    altNegativeButton: s
+    altPositiveButton: w
+    gravity: 3
+    dead: 0.001
+    sensitivity: 3
+    snap: 1
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Fire1
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: left ctrl
+    altNegativeButton: 
+    altPositiveButton: mouse 0
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Fire2
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: left alt
+    altNegativeButton: 
+    altPositiveButton: mouse 1
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Fire3
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: left shift
+    altNegativeButton: 
+    altPositiveButton: mouse 2
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Jump
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: space
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Mouse X
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: 
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 0
+    dead: 0
+    sensitivity: 0.1
+    snap: 0
+    invert: 0
+    type: 1
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Mouse Y
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: 
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 0
+    dead: 0
+    sensitivity: 0.1
+    snap: 0
+    invert: 0
+    type: 1
+    axis: 1
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Mouse ScrollWheel
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: 
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 0
+    dead: 0
+    sensitivity: 0.1
+    snap: 0
+    invert: 0
+    type: 1
+    axis: 2
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Horizontal
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: 
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 0
+    dead: 0.19
+    sensitivity: 1
+    snap: 0
+    invert: 0
+    type: 2
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Vertical
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: 
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 0
+    dead: 0.19
+    sensitivity: 1
+    snap: 0
+    invert: 1
+    type: 2
+    axis: 1
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Fire1
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: joystick button 0
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Fire2
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: joystick button 1
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Fire3
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: joystick button 2
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Jump
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: joystick button 3
+    altNegativeButton: 
+    altPositiveButton: 
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Submit
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: return
+    altNegativeButton: 
+    altPositiveButton: joystick button 0
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Submit
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: enter
+    altNegativeButton: 
+    altPositiveButton: space
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
+  - serializedVersion: 3
+    m_Name: Cancel
+    descriptiveName: 
+    descriptiveNegativeName: 
+    negativeButton: 
+    positiveButton: escape
+    altNegativeButton: 
+    altPositiveButton: joystick button 1
+    gravity: 1000
+    dead: 0.001
+    sensitivity: 1000
+    snap: 0
+    invert: 0
+    type: 0
+    axis: 0
+    joyNum: 0
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/NavMeshAreas.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/NavMeshAreas.asset
new file mode 100644
index 0000000000..3b0b7c3d18
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/NavMeshAreas.asset
@@ -0,0 +1,91 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!126 &1
+NavMeshProjectSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 2
+  areas:
+  - name: Walkable
+    cost: 1
+  - name: Not Walkable
+    cost: 1
+  - name: Jump
+    cost: 2
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  - name: 
+    cost: 1
+  m_LastAgentTypeID: -887442657
+  m_Settings:
+  - serializedVersion: 2
+    agentTypeID: 0
+    agentRadius: 0.5
+    agentHeight: 2
+    agentSlope: 45
+    agentClimb: 0.75
+    ledgeDropHeight: 0
+    maxJumpAcrossDistance: 0
+    minRegionArea: 2
+    manualCellSize: 0
+    cellSize: 0.16666667
+    manualTileSize: 0
+    tileSize: 256
+    accuratePlacement: 0
+    debug:
+      m_Flags: 0
+  m_SettingNames:
+  - Humanoid
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/NetworkManager.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/NetworkManager.asset
new file mode 100644
index 0000000000..5dc6a831d9
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/NetworkManager.asset
@@ -0,0 +1,8 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!149 &1
+NetworkManager:
+  m_ObjectHideFlags: 0
+  m_DebugLevel: 0
+  m_Sendrate: 15
+  m_AssetToPrefab: {}
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/Physics2DSettings.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/Physics2DSettings.asset
new file mode 100644
index 0000000000..132ee6bc86
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/Physics2DSettings.asset
@@ -0,0 +1,37 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!19 &1
+Physics2DSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 3
+  m_Gravity: {x: 0, y: -9.81}
+  m_DefaultMaterial: {fileID: 0}
+  m_VelocityIterations: 8
+  m_PositionIterations: 3
+  m_VelocityThreshold: 1
+  m_MaxLinearCorrection: 0.2
+  m_MaxAngularCorrection: 8
+  m_MaxTranslationSpeed: 100
+  m_MaxRotationSpeed: 360
+  m_BaumgarteScale: 0.2
+  m_BaumgarteTimeOfImpactScale: 0.75
+  m_TimeToSleep: 0.5
+  m_LinearSleepTolerance: 0.01
+  m_AngularSleepTolerance: 2
+  m_DefaultContactOffset: 0.01
+  m_AutoSimulation: 1
+  m_QueriesHitTriggers: 1
+  m_QueriesStartInColliders: 1
+  m_ChangeStopsCallbacks: 0
+  m_CallbacksOnDisable: 1
+  m_AutoSyncTransforms: 1
+  m_AlwaysShowColliders: 0
+  m_ShowColliderSleep: 1
+  m_ShowColliderContacts: 0
+  m_ShowColliderAABB: 0
+  m_ContactArrowScale: 0.2
+  m_ColliderAwakeColor: {r: 0.5686275, g: 0.95686275, b: 0.54509807, a: 0.7529412}
+  m_ColliderAsleepColor: {r: 0.5686275, g: 0.95686275, b: 0.54509807, a: 0.36078432}
+  m_ColliderContactColor: {r: 1, g: 0, b: 1, a: 0.6862745}
+  m_ColliderAABBColor: {r: 1, g: 1, b: 0, a: 0.2509804}
+  m_LayerCollisionMatrix: ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ProjectSettings.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ProjectSettings.asset
new file mode 100644
index 0000000000..3fbfab76c1
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ProjectSettings.asset
@@ -0,0 +1,641 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!129 &1
+PlayerSettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 14
+  productGUID: a084943b991dd4597b140f4ce2b41c65
+  AndroidProfiler: 0
+  AndroidFilterTouchesWhenObscured: 0
+  defaultScreenOrientation: 4
+  targetDevice: 2
+  useOnDemandResources: 0
+  accelerometerFrequency: 60
+  companyName: DefaultCompany
+  productName: TensorFlowLitePlugin
+  defaultCursor: {fileID: 0}
+  cursorHotspot: {x: 0, y: 0}
+  m_SplashScreenBackgroundColor: {r: 0.13725491, g: 0.12156863, b: 0.1254902, a: 1}
+  m_ShowUnitySplashScreen: 1
+  m_ShowUnitySplashLogo: 1
+  m_SplashScreenOverlayOpacity: 1
+  m_SplashScreenAnimation: 1
+  m_SplashScreenLogoStyle: 1
+  m_SplashScreenDrawMode: 0
+  m_SplashScreenBackgroundAnimationZoom: 1
+  m_SplashScreenLogoAnimationZoom: 1
+  m_SplashScreenBackgroundLandscapeAspect: 1
+  m_SplashScreenBackgroundPortraitAspect: 1
+  m_SplashScreenBackgroundLandscapeUvs:
+    serializedVersion: 2
+    x: 0
+    y: 0
+    width: 1
+    height: 1
+  m_SplashScreenBackgroundPortraitUvs:
+    serializedVersion: 2
+    x: 0
+    y: 0
+    width: 1
+    height: 1
+  m_SplashScreenLogos: []
+  m_VirtualRealitySplashScreen: {fileID: 0}
+  m_HolographicTrackingLossScreen: {fileID: 0}
+  defaultScreenWidth: 1024
+  defaultScreenHeight: 768
+  defaultScreenWidthWeb: 960
+  defaultScreenHeightWeb: 600
+  m_StereoRenderingPath: 0
+  m_ActiveColorSpace: 0
+  m_MTRendering: 1
+  m_StackTraceTypes: 010000000100000001000000010000000100000001000000
+  iosShowActivityIndicatorOnLoading: -1
+  androidShowActivityIndicatorOnLoading: -1
+  tizenShowActivityIndicatorOnLoading: -1
+  iosAppInBackgroundBehavior: 0
+  displayResolutionDialog: 1
+  iosAllowHTTPDownload: 1
+  allowedAutorotateToPortrait: 1
+  allowedAutorotateToPortraitUpsideDown: 1
+  allowedAutorotateToLandscapeRight: 1
+  allowedAutorotateToLandscapeLeft: 1
+  useOSAutorotation: 1
+  use32BitDisplayBuffer: 1
+  preserveFramebufferAlpha: 0
+  disableDepthAndStencilBuffers: 0
+  androidBlitType: 0
+  defaultIsFullScreen: 1
+  defaultIsNativeResolution: 1
+  macRetinaSupport: 1
+  runInBackground: 0
+  captureSingleScreen: 0
+  muteOtherAudioSources: 0
+  Prepare IOS For Recording: 0
+  Force IOS Speakers When Recording: 0
+  deferSystemGesturesMode: 0
+  hideHomeButton: 0
+  submitAnalytics: 1
+  usePlayerLog: 1
+  bakeCollisionMeshes: 0
+  forceSingleInstance: 0
+  resizableWindow: 0
+  useMacAppStoreValidation: 0
+  macAppStoreCategory: public.app-category.games
+  gpuSkinning: 0
+  graphicsJobs: 0
+  xboxPIXTextureCapture: 0
+  xboxEnableAvatar: 0
+  xboxEnableKinect: 0
+  xboxEnableKinectAutoTracking: 0
+  xboxEnableFitness: 0
+  visibleInBackground: 1
+  allowFullscreenSwitch: 1
+  graphicsJobMode: 0
+  macFullscreenMode: 2
+  d3d11FullscreenMode: 1
+  xboxSpeechDB: 0
+  xboxEnableHeadOrientation: 0
+  xboxEnableGuest: 0
+  xboxEnablePIXSampling: 0
+  metalFramebufferOnly: 0
+  n3dsDisableStereoscopicView: 0
+  n3dsEnableSharedListOpt: 1
+  n3dsEnableVSync: 0
+  xboxOneResolution: 0
+  xboxOneSResolution: 0
+  xboxOneXResolution: 3
+  xboxOneMonoLoggingLevel: 0
+  xboxOneLoggingLevel: 1
+  xboxOneDisableEsram: 0
+  xboxOnePresentImmediateThreshold: 0
+  videoMemoryForVertexBuffers: 0
+  psp2PowerMode: 0
+  psp2AcquireBGM: 1
+  wiiUTVResolution: 0
+  wiiUGamePadMSAA: 1
+  wiiUSupportsNunchuk: 0
+  wiiUSupportsClassicController: 0
+  wiiUSupportsBalanceBoard: 0
+  wiiUSupportsMotionPlus: 0
+  wiiUSupportsProController: 0
+  wiiUAllowScreenCapture: 1
+  wiiUControllerCount: 0
+  m_SupportedAspectRatios:
+    4:3: 1
+    5:4: 1
+    16:10: 1
+    16:9: 1
+    Others: 1
+  bundleVersion: 1.0
+  preloadedAssets: []
+  metroInputSource: 0
+  wsaTransparentSwapchain: 0
+  m_HolographicPauseOnTrackingLoss: 1
+  xboxOneDisableKinectGpuReservation: 0
+  xboxOneEnable7thCore: 0
+  vrSettings:
+    cardboard:
+      depthFormat: 0
+      enableTransitionView: 0
+    daydream:
+      depthFormat: 0
+      useSustainedPerformanceMode: 0
+      enableVideoLayer: 0
+      useProtectedVideoMemory: 0
+      minimumSupportedHeadTracking: 0
+      maximumSupportedHeadTracking: 1
+    hololens:
+      depthFormat: 1
+      depthBufferSharingEnabled: 0
+    oculus:
+      sharedDepthBuffer: 0
+      dashSupport: 0
+  protectGraphicsMemory: 0
+  useHDRDisplay: 0
+  m_ColorGamuts: 00000000
+  targetPixelDensity: 30
+  resolutionScalingMode: 0
+  androidSupportedAspectRatio: 1
+  androidMaxAspectRatio: 2.1
+  applicationIdentifier: {}
+  buildNumber: {}
+  AndroidBundleVersionCode: 1
+  AndroidMinSdkVersion: 16
+  AndroidTargetSdkVersion: 0
+  AndroidPreferredInstallLocation: 1
+  aotOptions: 
+  stripEngineCode: 1
+  iPhoneStrippingLevel: 0
+  iPhoneScriptCallOptimization: 0
+  ForceInternetPermission: 0
+  ForceSDCardPermission: 0
+  CreateWallpaper: 0
+  APKExpansionFiles: 0
+  keepLoadedShadersAlive: 0
+  StripUnusedMeshComponents: 0
+  VertexChannelCompressionMask:
+    serializedVersion: 2
+    m_Bits: 238
+  iPhoneSdkVersion: 988
+  iOSTargetOSVersionString: 7.0
+  tvOSSdkVersion: 0
+  tvOSRequireExtendedGameController: 0
+  tvOSTargetOSVersionString: 9.0
+  uIPrerenderedIcon: 0
+  uIRequiresPersistentWiFi: 0
+  uIRequiresFullScreen: 1
+  uIStatusBarHidden: 1
+  uIExitOnSuspend: 0
+  uIStatusBarStyle: 0
+  iPhoneSplashScreen: {fileID: 0}
+  iPhoneHighResSplashScreen: {fileID: 0}
+  iPhoneTallHighResSplashScreen: {fileID: 0}
+  iPhone47inSplashScreen: {fileID: 0}
+  iPhone55inPortraitSplashScreen: {fileID: 0}
+  iPhone55inLandscapeSplashScreen: {fileID: 0}
+  iPhone58inPortraitSplashScreen: {fileID: 0}
+  iPhone58inLandscapeSplashScreen: {fileID: 0}
+  iPadPortraitSplashScreen: {fileID: 0}
+  iPadHighResPortraitSplashScreen: {fileID: 0}
+  iPadLandscapeSplashScreen: {fileID: 0}
+  iPadHighResLandscapeSplashScreen: {fileID: 0}
+  appleTVSplashScreen: {fileID: 0}
+  appleTVSplashScreen2x: {fileID: 0}
+  tvOSSmallIconLayers: []
+  tvOSSmallIconLayers2x: []
+  tvOSLargeIconLayers: []
+  tvOSTopShelfImageLayers: []
+  tvOSTopShelfImageLayers2x: []
+  tvOSTopShelfImageWideLayers: []
+  tvOSTopShelfImageWideLayers2x: []
+  iOSLaunchScreenType: 0
+  iOSLaunchScreenPortrait: {fileID: 0}
+  iOSLaunchScreenLandscape: {fileID: 0}
+  iOSLaunchScreenBackgroundColor:
+    serializedVersion: 2
+    rgba: 0
+  iOSLaunchScreenFillPct: 100
+  iOSLaunchScreenSize: 100
+  iOSLaunchScreenCustomXibPath: 
+  iOSLaunchScreeniPadType: 0
+  iOSLaunchScreeniPadImage: {fileID: 0}
+  iOSLaunchScreeniPadBackgroundColor:
+    serializedVersion: 2
+    rgba: 0
+  iOSLaunchScreeniPadFillPct: 100
+  iOSLaunchScreeniPadSize: 100
+  iOSLaunchScreeniPadCustomXibPath: 
+  iOSUseLaunchScreenStoryboard: 0
+  iOSLaunchScreenCustomStoryboardPath: 
+  iOSDeviceRequirements: []
+  iOSURLSchemes: []
+  iOSBackgroundModes: 0
+  iOSMetalForceHardShadows: 0
+  metalEditorSupport: 1
+  metalAPIValidation: 1
+  iOSRenderExtraFrameOnPause: 0
+  appleDeveloperTeamID: 
+  iOSManualSigningProvisioningProfileID: 
+  tvOSManualSigningProvisioningProfileID: 
+  appleEnableAutomaticSigning: 0
+  clonedFromGUID: 00000000000000000000000000000000
+  AndroidTargetDevice: 0
+  AndroidSplashScreenScale: 0
+  androidSplashScreen: {fileID: 0}
+  AndroidKeystoreName: 
+  AndroidKeyaliasName: 
+  AndroidTVCompatibility: 1
+  AndroidIsGame: 1
+  AndroidEnableTango: 0
+  androidEnableBanner: 1
+  androidUseLowAccuracyLocation: 0
+  m_AndroidBanners:
+  - width: 320
+    height: 180
+    banner: {fileID: 0}
+  androidGamepadSupportLevel: 0
+  resolutionDialogBanner: {fileID: 0}
+  m_BuildTargetIcons: []
+  m_BuildTargetBatching: []
+  m_BuildTargetGraphicsAPIs: []
+  m_BuildTargetVRSettings: []
+  m_BuildTargetEnableVuforiaSettings: []
+  openGLRequireES31: 0
+  openGLRequireES31AEP: 0
+  m_TemplateCustomTags: {}
+  mobileMTRendering:
+    Android: 1
+    iPhone: 1
+    tvOS: 1
+  m_BuildTargetGroupLightmapEncodingQuality: []
+  wiiUTitleID: 0005000011000000
+  wiiUGroupID: 00010000
+  wiiUCommonSaveSize: 4096
+  wiiUAccountSaveSize: 2048
+  wiiUOlvAccessKey: 0
+  wiiUTinCode: 0
+  wiiUJoinGameId: 0
+  wiiUJoinGameModeMask: 0000000000000000
+  wiiUCommonBossSize: 0
+  wiiUAccountBossSize: 0
+  wiiUAddOnUniqueIDs: []
+  wiiUMainThreadStackSize: 3072
+  wiiULoaderThreadStackSize: 1024
+  wiiUSystemHeapSize: 128
+  wiiUTVStartupScreen: {fileID: 0}
+  wiiUGamePadStartupScreen: {fileID: 0}
+  wiiUDrcBufferDisabled: 0
+  wiiUProfilerLibPath: 
+  playModeTestRunnerEnabled: 0
+  actionOnDotNetUnhandledException: 1
+  enableInternalProfiler: 0
+  logObjCUncaughtExceptions: 1
+  enableCrashReportAPI: 0
+  cameraUsageDescription: 
+  locationUsageDescription: 
+  microphoneUsageDescription: 
+  switchNetLibKey: 
+  switchSocketMemoryPoolSize: 6144
+  switchSocketAllocatorPoolSize: 128
+  switchSocketConcurrencyLimit: 14
+  switchScreenResolutionBehavior: 2
+  switchUseCPUProfiler: 0
+  switchApplicationID: 0x01004b9000490000
+  switchNSODependencies: 
+  switchTitleNames_0: 
+  switchTitleNames_1: 
+  switchTitleNames_2: 
+  switchTitleNames_3: 
+  switchTitleNames_4: 
+  switchTitleNames_5: 
+  switchTitleNames_6: 
+  switchTitleNames_7: 
+  switchTitleNames_8: 
+  switchTitleNames_9: 
+  switchTitleNames_10: 
+  switchTitleNames_11: 
+  switchTitleNames_12: 
+  switchTitleNames_13: 
+  switchTitleNames_14: 
+  switchPublisherNames_0: 
+  switchPublisherNames_1: 
+  switchPublisherNames_2: 
+  switchPublisherNames_3: 
+  switchPublisherNames_4: 
+  switchPublisherNames_5: 
+  switchPublisherNames_6: 
+  switchPublisherNames_7: 
+  switchPublisherNames_8: 
+  switchPublisherNames_9: 
+  switchPublisherNames_10: 
+  switchPublisherNames_11: 
+  switchPublisherNames_12: 
+  switchPublisherNames_13: 
+  switchPublisherNames_14: 
+  switchIcons_0: {fileID: 0}
+  switchIcons_1: {fileID: 0}
+  switchIcons_2: {fileID: 0}
+  switchIcons_3: {fileID: 0}
+  switchIcons_4: {fileID: 0}
+  switchIcons_5: {fileID: 0}
+  switchIcons_6: {fileID: 0}
+  switchIcons_7: {fileID: 0}
+  switchIcons_8: {fileID: 0}
+  switchIcons_9: {fileID: 0}
+  switchIcons_10: {fileID: 0}
+  switchIcons_11: {fileID: 0}
+  switchIcons_12: {fileID: 0}
+  switchIcons_13: {fileID: 0}
+  switchIcons_14: {fileID: 0}
+  switchSmallIcons_0: {fileID: 0}
+  switchSmallIcons_1: {fileID: 0}
+  switchSmallIcons_2: {fileID: 0}
+  switchSmallIcons_3: {fileID: 0}
+  switchSmallIcons_4: {fileID: 0}
+  switchSmallIcons_5: {fileID: 0}
+  switchSmallIcons_6: {fileID: 0}
+  switchSmallIcons_7: {fileID: 0}
+  switchSmallIcons_8: {fileID: 0}
+  switchSmallIcons_9: {fileID: 0}
+  switchSmallIcons_10: {fileID: 0}
+  switchSmallIcons_11: {fileID: 0}
+  switchSmallIcons_12: {fileID: 0}
+  switchSmallIcons_13: {fileID: 0}
+  switchSmallIcons_14: {fileID: 0}
+  switchManualHTML: 
+  switchAccessibleURLs: 
+  switchLegalInformation: 
+  switchMainThreadStackSize: 1048576
+  switchPresenceGroupId: 
+  switchLogoHandling: 0
+  switchReleaseVersion: 0
+  switchDisplayVersion: 1.0.0
+  switchStartupUserAccount: 0
+  switchTouchScreenUsage: 0
+  switchSupportedLanguagesMask: 0
+  switchLogoType: 0
+  switchApplicationErrorCodeCategory: 
+  switchUserAccountSaveDataSize: 0
+  switchUserAccountSaveDataJournalSize: 0
+  switchApplicationAttribute: 0
+  switchCardSpecSize: -1
+  switchCardSpecClock: -1
+  switchRatingsMask: 0
+  switchRatingsInt_0: 0
+  switchRatingsInt_1: 0
+  switchRatingsInt_2: 0
+  switchRatingsInt_3: 0
+  switchRatingsInt_4: 0
+  switchRatingsInt_5: 0
+  switchRatingsInt_6: 0
+  switchRatingsInt_7: 0
+  switchRatingsInt_8: 0
+  switchRatingsInt_9: 0
+  switchRatingsInt_10: 0
+  switchRatingsInt_11: 0
+  switchLocalCommunicationIds_0: 
+  switchLocalCommunicationIds_1: 
+  switchLocalCommunicationIds_2: 
+  switchLocalCommunicationIds_3: 
+  switchLocalCommunicationIds_4: 
+  switchLocalCommunicationIds_5: 
+  switchLocalCommunicationIds_6: 
+  switchLocalCommunicationIds_7: 
+  switchParentalControl: 0
+  switchAllowsScreenshot: 1
+  switchAllowsVideoCapturing: 1
+  switchAllowsRuntimeAddOnContentInstall: 0
+  switchDataLossConfirmation: 0
+  switchSupportedNpadStyles: 3
+  switchSocketConfigEnabled: 0
+  switchTcpInitialSendBufferSize: 32
+  switchTcpInitialReceiveBufferSize: 64
+  switchTcpAutoSendBufferSizeMax: 256
+  switchTcpAutoReceiveBufferSizeMax: 256
+  switchUdpSendBufferSize: 9
+  switchUdpReceiveBufferSize: 42
+  switchSocketBufferEfficiency: 4
+  switchSocketInitializeEnabled: 1
+  switchNetworkInterfaceManagerInitializeEnabled: 1
+  switchPlayerConnectionEnabled: 1
+  ps4NPAgeRating: 12
+  ps4NPTitleSecret: 
+  ps4NPTrophyPackPath: 
+  ps4ParentalLevel: 11
+  ps4ContentID: ED1633-NPXX51362_00-0000000000000000
+  ps4Category: 0
+  ps4MasterVersion: 01.00
+  ps4AppVersion: 01.00
+  ps4AppType: 0
+  ps4ParamSfxPath: 
+  ps4VideoOutPixelFormat: 0
+  ps4VideoOutInitialWidth: 1920
+  ps4VideoOutBaseModeInitialWidth: 1920
+  ps4VideoOutReprojectionRate: 60
+  ps4PronunciationXMLPath: 
+  ps4PronunciationSIGPath: 
+  ps4BackgroundImagePath: 
+  ps4StartupImagePath: 
+  ps4StartupImagesFolder: 
+  ps4IconImagesFolder: 
+  ps4SaveDataImagePath: 
+  ps4SdkOverride: 
+  ps4BGMPath: 
+  ps4ShareFilePath: 
+  ps4ShareOverlayImagePath: 
+  ps4PrivacyGuardImagePath: 
+  ps4NPtitleDatPath: 
+  ps4RemotePlayKeyAssignment: -1
+  ps4RemotePlayKeyMappingDir: 
+  ps4PlayTogetherPlayerCount: 0
+  ps4EnterButtonAssignment: 1
+  ps4ApplicationParam1: 0
+  ps4ApplicationParam2: 0
+  ps4ApplicationParam3: 0
+  ps4ApplicationParam4: 0
+  ps4DownloadDataSize: 0
+  ps4GarlicHeapSize: 2048
+  ps4ProGarlicHeapSize: 2560
+  ps4Passcode: d3hjjul8UhK6ZnQCEBYYQPozR9sQV066
+  ps4pnSessions: 1
+  ps4pnPresence: 1
+  ps4pnFriends: 1
+  ps4pnGameCustomData: 1
+  playerPrefsSupport: 0
+  restrictedAudioUsageRights: 0
+  ps4UseResolutionFallback: 0
+  ps4ReprojectionSupport: 0
+  ps4UseAudio3dBackend: 0
+  ps4SocialScreenEnabled: 0
+  ps4ScriptOptimizationLevel: 0
+  ps4Audio3dVirtualSpeakerCount: 14
+  ps4attribCpuUsage: 0
+  ps4PatchPkgPath: 
+  ps4PatchLatestPkgPath: 
+  ps4PatchChangeinfoPath: 
+  ps4PatchDayOne: 0
+  ps4attribUserManagement: 0
+  ps4attribMoveSupport: 0
+  ps4attrib3DSupport: 0
+  ps4attribShareSupport: 0
+  ps4attribExclusiveVR: 0
+  ps4disableAutoHideSplash: 0
+  ps4videoRecordingFeaturesUsed: 0
+  ps4contentSearchFeaturesUsed: 0
+  ps4attribEyeToEyeDistanceSettingVR: 0
+  ps4IncludedModules: []
+  monoEnv: 
+  psp2Splashimage: {fileID: 0}
+  psp2NPTrophyPackPath: 
+  psp2NPSupportGBMorGJP: 0
+  psp2NPAgeRating: 12
+  psp2NPTitleDatPath: 
+  psp2NPCommsID: 
+  psp2NPCommunicationsID: 
+  psp2NPCommsPassphrase: 
+  psp2NPCommsSig: 
+  psp2ParamSfxPath: 
+  psp2ManualPath: 
+  psp2LiveAreaGatePath: 
+  psp2LiveAreaBackroundPath: 
+  psp2LiveAreaPath: 
+  psp2LiveAreaTrialPath: 
+  psp2PatchChangeInfoPath: 
+  psp2PatchOriginalPackage: 
+  psp2PackagePassword: 3onkgZsAECEn0fzCoWiCtWCKe4l74pE5
+  psp2KeystoneFile: 
+  psp2MemoryExpansionMode: 0
+  psp2DRMType: 0
+  psp2StorageType: 0
+  psp2MediaCapacity: 0
+  psp2DLCConfigPath: 
+  psp2ThumbnailPath: 
+  psp2BackgroundPath: 
+  psp2SoundPath: 
+  psp2TrophyCommId: 
+  psp2TrophyPackagePath: 
+  psp2PackagedResourcesPath: 
+  psp2SaveDataQuota: 10240
+  psp2ParentalLevel: 1
+  psp2ShortTitle: Not Set
+  psp2ContentID: IV0000-ABCD12345_00-0123456789ABCDEF
+  psp2Category: 0
+  psp2MasterVersion: 01.00
+  psp2AppVersion: 01.00
+  psp2TVBootMode: 0
+  psp2EnterButtonAssignment: 2
+  psp2TVDisableEmu: 0
+  psp2AllowTwitterDialog: 1
+  psp2Upgradable: 0
+  psp2HealthWarning: 0
+  psp2UseLibLocation: 0
+  psp2InfoBarOnStartup: 0
+  psp2InfoBarColor: 0
+  psp2ScriptOptimizationLevel: 0
+  psmSplashimage: {fileID: 0}
+  splashScreenBackgroundSourceLandscape: {fileID: 0}
+  splashScreenBackgroundSourcePortrait: {fileID: 0}
+  spritePackerPolicy: 
+  webGLMemorySize: 256
+  webGLExceptionSupport: 1
+  webGLNameFilesAsHashes: 0
+  webGLDataCaching: 0
+  webGLDebugSymbols: 0
+  webGLEmscriptenArgs: 
+  webGLModulesDirectory: 
+  webGLTemplate: APPLICATION:Default
+  webGLAnalyzeBuildSize: 0
+  webGLUseEmbeddedResources: 0
+  webGLUseWasm: 0
+  webGLCompressionFormat: 1
+  scriptingDefineSymbols: {}
+  platformArchitecture: {}
+  scriptingBackend: {}
+  incrementalIl2cppBuild: {}
+  additionalIl2CppArgs: 
+  scriptingRuntimeVersion: 0
+  apiCompatibilityLevelPerPlatform: {}
+  m_RenderingPath: 1
+  m_MobileRenderingPath: 1
+  metroPackageName: TensorFlowLitePlugin
+  metroPackageVersion: 
+  metroCertificatePath: 
+  metroCertificatePassword: 
+  metroCertificateSubject: 
+  metroCertificateIssuer: 
+  metroCertificateNotAfter: 0000000000000000
+  metroApplicationDescription: TensorFlowLitePlugin
+  wsaImages: {}
+  metroTileShortName: 
+  metroCommandLineArgsFile: 
+  metroTileShowName: 0
+  metroMediumTileShowName: 0
+  metroLargeTileShowName: 0
+  metroWideTileShowName: 0
+  metroDefaultTileSize: 1
+  metroTileForegroundText: 2
+  metroTileBackgroundColor: {r: 0.13333334, g: 0.17254902, b: 0.21568628, a: 0}
+  metroSplashScreenBackgroundColor: {r: 0.12941177, g: 0.17254902, b: 0.21568628,
+    a: 1}
+  metroSplashScreenUseBackgroundColor: 0
+  platformCapabilities: {}
+  metroFTAName: 
+  metroFTAFileTypes: []
+  metroProtocolName: 
+  metroCompilationOverrides: 1
+  tizenProductDescription: 
+  tizenProductURL: 
+  tizenSigningProfileName: 
+  tizenGPSPermissions: 0
+  tizenMicrophonePermissions: 0
+  tizenDeploymentTarget: 
+  tizenDeploymentTargetType: -1
+  tizenMinOSVersion: 1
+  n3dsUseExtSaveData: 0
+  n3dsCompressStaticMem: 1
+  n3dsExtSaveDataNumber: 0x12345
+  n3dsStackSize: 131072
+  n3dsTargetPlatform: 2
+  n3dsRegion: 7
+  n3dsMediaSize: 0
+  n3dsLogoStyle: 3
+  n3dsTitle: GameName
+  n3dsProductCode: 
+  n3dsApplicationId: 0xFF3FF
+  XboxOneProductId: 
+  XboxOneUpdateKey: 
+  XboxOneSandboxId: 
+  XboxOneContentId: 
+  XboxOneTitleId: 
+  XboxOneSCId: 
+  XboxOneGameOsOverridePath: 
+  XboxOnePackagingOverridePath: 
+  XboxOneAppManifestOverridePath: 
+  XboxOnePackageEncryption: 0
+  XboxOnePackageUpdateGranularity: 2
+  XboxOneDescription: 
+  XboxOneLanguage:
+  - enus
+  XboxOneCapability: []
+  XboxOneGameRating: {}
+  XboxOneIsContentPackage: 0
+  XboxOneEnableGPUVariability: 0
+  XboxOneSockets: {}
+  XboxOneSplashScreen: {fileID: 0}
+  XboxOneAllowedProductIds: []
+  XboxOnePersistentLocalStorageSize: 0
+  XboxOneXTitleMemory: 8
+  xboxOneScriptCompiler: 0
+  vrEditorSettings:
+    daydream:
+      daydreamIconForeground: {fileID: 0}
+      daydreamIconBackground: {fileID: 0}
+  cloudServicesEnabled: {}
+  facebookSdkVersion: 7.9.4
+  apiCompatibilityLevel: 2
+  cloudProjectId: 
+  projectName: 
+  organizationId: 
+  cloudEnabled: 0
+  enableNativePlatformBackendsForNewInputSystem: 0
+  disableOldInputManagerSupport: 0
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ProjectVersion.txt b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ProjectVersion.txt
new file mode 100644
index 0000000000..4a9cfb61ab
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/ProjectVersion.txt
@@ -0,0 +1 @@
+m_EditorVersion: 2017.4.6f1
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/QualitySettings.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/QualitySettings.asset
new file mode 100644
index 0000000000..05daac3c49
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/QualitySettings.asset
@@ -0,0 +1,191 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!47 &1
+QualitySettings:
+  m_ObjectHideFlags: 0
+  serializedVersion: 5
+  m_CurrentQuality: 5
+  m_QualitySettings:
+  - serializedVersion: 2
+    name: Very Low
+    pixelLightCount: 0
+    shadows: 0
+    shadowResolution: 0
+    shadowProjection: 1
+    shadowCascades: 1
+    shadowDistance: 15
+    shadowNearPlaneOffset: 3
+    shadowCascade2Split: 0.33333334
+    shadowCascade4Split: {x: 0.06666667, y: 0.2, z: 0.46666667}
+    shadowmaskMode: 0
+    blendWeights: 1
+    textureQuality: 1
+    anisotropicTextures: 0
+    antiAliasing: 0
+    softParticles: 0
+    softVegetation: 0
+    realtimeReflectionProbes: 0
+    billboardsFaceCameraPosition: 0
+    vSyncCount: 0
+    lodBias: 0.3
+    maximumLODLevel: 0
+    particleRaycastBudget: 4
+    asyncUploadTimeSlice: 2
+    asyncUploadBufferSize: 4
+    resolutionScalingFixedDPIFactor: 1
+    excludedTargetPlatforms: []
+  - serializedVersion: 2
+    name: Low
+    pixelLightCount: 0
+    shadows: 0
+    shadowResolution: 0
+    shadowProjection: 1
+    shadowCascades: 1
+    shadowDistance: 20
+    shadowNearPlaneOffset: 3
+    shadowCascade2Split: 0.33333334
+    shadowCascade4Split: {x: 0.06666667, y: 0.2, z: 0.46666667}
+    shadowmaskMode: 0
+    blendWeights: 2
+    textureQuality: 0
+    anisotropicTextures: 0
+    antiAliasing: 0
+    softParticles: 0
+    softVegetation: 0
+    realtimeReflectionProbes: 0
+    billboardsFaceCameraPosition: 0
+    vSyncCount: 0
+    lodBias: 0.4
+    maximumLODLevel: 0
+    particleRaycastBudget: 16
+    asyncUploadTimeSlice: 2
+    asyncUploadBufferSize: 4
+    resolutionScalingFixedDPIFactor: 1
+    excludedTargetPlatforms: []
+  - serializedVersion: 2
+    name: Medium
+    pixelLightCount: 1
+    shadows: 1
+    shadowResolution: 0
+    shadowProjection: 1
+    shadowCascades: 1
+    shadowDistance: 20
+    shadowNearPlaneOffset: 3
+    shadowCascade2Split: 0.33333334
+    shadowCascade4Split: {x: 0.06666667, y: 0.2, z: 0.46666667}
+    shadowmaskMode: 0
+    blendWeights: 2
+    textureQuality: 0
+    anisotropicTextures: 1
+    antiAliasing: 0
+    softParticles: 0
+    softVegetation: 0
+    realtimeReflectionProbes: 0
+    billboardsFaceCameraPosition: 0
+    vSyncCount: 1
+    lodBias: 0.7
+    maximumLODLevel: 0
+    particleRaycastBudget: 64
+    asyncUploadTimeSlice: 2
+    asyncUploadBufferSize: 4
+    resolutionScalingFixedDPIFactor: 1
+    excludedTargetPlatforms: []
+  - serializedVersion: 2
+    name: High
+    pixelLightCount: 2
+    shadows: 2
+    shadowResolution: 1
+    shadowProjection: 1
+    shadowCascades: 2
+    shadowDistance: 40
+    shadowNearPlaneOffset: 3
+    shadowCascade2Split: 0.33333334
+    shadowCascade4Split: {x: 0.06666667, y: 0.2, z: 0.46666667}
+    shadowmaskMode: 1
+    blendWeights: 2
+    textureQuality: 0
+    anisotropicTextures: 1
+    antiAliasing: 0
+    softParticles: 0
+    softVegetation: 1
+    realtimeReflectionProbes: 1
+    billboardsFaceCameraPosition: 1
+    vSyncCount: 1
+    lodBias: 1
+    maximumLODLevel: 0
+    particleRaycastBudget: 256
+    asyncUploadTimeSlice: 2
+    asyncUploadBufferSize: 4
+    resolutionScalingFixedDPIFactor: 1
+    excludedTargetPlatforms: []
+  - serializedVersion: 2
+    name: Very High
+    pixelLightCount: 3
+    shadows: 2
+    shadowResolution: 2
+    shadowProjection: 1
+    shadowCascades: 2
+    shadowDistance: 70
+    shadowNearPlaneOffset: 3
+    shadowCascade2Split: 0.33333334
+    shadowCascade4Split: {x: 0.06666667, y: 0.2, z: 0.46666667}
+    shadowmaskMode: 1
+    blendWeights: 4
+    textureQuality: 0
+    anisotropicTextures: 2
+    antiAliasing: 2
+    softParticles: 1
+    softVegetation: 1
+    realtimeReflectionProbes: 1
+    billboardsFaceCameraPosition: 1
+    vSyncCount: 1
+    lodBias: 1.5
+    maximumLODLevel: 0
+    particleRaycastBudget: 1024
+    asyncUploadTimeSlice: 2
+    asyncUploadBufferSize: 4
+    resolutionScalingFixedDPIFactor: 1
+    excludedTargetPlatforms: []
+  - serializedVersion: 2
+    name: Ultra
+    pixelLightCount: 4
+    shadows: 2
+    shadowResolution: 2
+    shadowProjection: 1
+    shadowCascades: 4
+    shadowDistance: 150
+    shadowNearPlaneOffset: 3
+    shadowCascade2Split: 0.33333334
+    shadowCascade4Split: {x: 0.06666667, y: 0.2, z: 0.46666667}
+    shadowmaskMode: 1
+    blendWeights: 4
+    textureQuality: 0
+    anisotropicTextures: 2
+    antiAliasing: 2
+    softParticles: 1
+    softVegetation: 1
+    realtimeReflectionProbes: 1
+    billboardsFaceCameraPosition: 1
+    vSyncCount: 1
+    lodBias: 2
+    maximumLODLevel: 0
+    particleRaycastBudget: 4096
+    asyncUploadTimeSlice: 2
+    asyncUploadBufferSize: 4
+    resolutionScalingFixedDPIFactor: 1
+    excludedTargetPlatforms: []
+  m_PerPlatformDefaultQuality:
+    Android: 2
+    Nintendo 3DS: 5
+    Nintendo Switch: 5
+    PS4: 5
+    PSM: 5
+    PSP2: 2
+    Standalone: 5
+    Tizen: 2
+    WebGL: 3
+    WiiU: 5
+    Windows Store Apps: 5
+    XboxOne: 5
+    iPhone: 2
+    tvOS: 2
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/TagManager.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/TagManager.asset
new file mode 100644
index 0000000000..1c92a7840e
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/TagManager.asset
@@ -0,0 +1,43 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!78 &1
+TagManager:
+  serializedVersion: 2
+  tags: []
+  layers:
+  - Default
+  - TransparentFX
+  - Ignore Raycast
+  - 
+  - Water
+  - UI
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  - 
+  m_SortingLayers:
+  - name: Default
+    uniqueID: 0
+    locked: 0
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/TimeManager.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/TimeManager.asset
new file mode 100644
index 0000000000..558a017e1f
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/TimeManager.asset
@@ -0,0 +1,9 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!5 &1
+TimeManager:
+  m_ObjectHideFlags: 0
+  Fixed Timestep: 0.02
+  Maximum Allowed Timestep: 0.33333334
+  m_TimeScale: 1
+  Maximum Particle Timestep: 0.03
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/UnityConnectSettings.asset b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/UnityConnectSettings.asset
new file mode 100644
index 0000000000..3da14d5baf
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/ProjectSettings/UnityConnectSettings.asset
@@ -0,0 +1,34 @@
+%YAML 1.1
+%TAG !u! tag:unity3d.com,2011:
+--- !u!310 &1
+UnityConnectSettings:
+  m_ObjectHideFlags: 0
+  m_Enabled: 0
+  m_TestMode: 0
+  m_TestEventUrl: 
+  m_TestConfigUrl: 
+  m_TestInitMode: 0
+  CrashReportingSettings:
+    m_EventUrl: https://perf-events.cloud.unity3d.com/api/events/crashes
+    m_NativeEventUrl: https://perf-events.cloud.unity3d.com/symbolicate
+    m_Enabled: 0
+    m_CaptureEditorExceptions: 1
+  UnityPurchasingSettings:
+    m_Enabled: 0
+    m_TestMode: 0
+  UnityAnalyticsSettings:
+    m_Enabled: 0
+    m_InitializeOnStartup: 1
+    m_TestMode: 0
+    m_TestEventUrl: 
+    m_TestConfigUrl: 
+  UnityAdsSettings:
+    m_Enabled: 0
+    m_InitializeOnStartup: 1
+    m_TestMode: 0
+    m_IosGameId: 
+    m_AndroidGameId: 
+    m_GameIds: {}
+    m_GameId: 
+  PerformanceReportingSettings:
+    m_Enabled: 0
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/README.md b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/README.md
new file mode 100644
index 0000000000..c0dcb090b4
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/README.md
@@ -0,0 +1,27 @@
+# TF Lite Experimental Unity Plugin
+
+This directoryy contains an experimental sample Unity (2017) Plugin, based on
+the experimental TF Lite C API. The sample demonstrates running inference within
+Unity by way of a C# `Interpreter` wrapper.
+
+Note that the native TF Lite plugin(s) *must* be built before using the Unity
+Plugin, and placed in Assets/TensorFlowLite/SDK/Plugins/. For the editor (note
+that this has only been tested on Linux; the syntax may differ on Mac/Windows):
+
+```sh
+bazel build -c opt --cxxopt=--std=c++11 \
+  //tensorflow/contrib/lite/experimental/c:libtensorflowlite_c.so
+```
+
+and for Android:
+
+```sh
+bazel build -c opt --cxxopt=--std=c++11 \
+  --crosstool_top=//external:android/crosstool \
+  --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
+  --cpu=armeabi-v7a \
+  //tensorflow/contrib/lite/experimental/c:libtensorflowlite_c.so
+```
+
+If you encounter issues with native plugin discovery on Mac ("Darwin")
+platforms, try renaming `libtensorflowlite_c.so` to `tensorflowlite_c.bundle`.
diff --git a/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/UnityPackageManager/manifest.json b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/UnityPackageManager/manifest.json
new file mode 100644
index 0000000000..526aca6057
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/examples/unity/TensorFlowLitePlugin/UnityPackageManager/manifest.json
@@ -0,0 +1,4 @@
+{
+	"dependencies": {
+	}
+}
diff --git a/tensorflow/contrib/lite/experimental/kernels/BUILD b/tensorflow/contrib/lite/experimental/kernels/BUILD
new file mode 100644
index 0000000000..9c06c4ebd9
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/BUILD
@@ -0,0 +1,84 @@
+package(default_visibility = [
+    "//visibility:public",
+])
+
+licenses(["notice"])  # Apache 2.0
+
+load("//tensorflow/contrib/lite:build_def.bzl", "tflite_copts")
+load("//tensorflow:tensorflow.bzl", "tf_cc_test")
+
+# ctc support classes imported directly from TensorFlow.
+cc_library(
+    name = "ctc_utils",
+    hdrs = [
+        "ctc_beam_entry.h",
+        "ctc_beam_scorer.h",
+        "ctc_beam_search.h",
+        "ctc_decoder.h",
+        "ctc_loss_util.h",
+    ],
+    deps = [
+        ":top_n",
+        "//tensorflow/contrib/lite/kernels/internal:types",
+        "//third_party/eigen3",
+    ],
+)
+
+# top_n support classes imported directly from TensorFlow.
+cc_library(
+    name = "top_n",
+    hdrs = [
+        "top_n.h",
+    ],
+    deps = [
+        "//tensorflow/contrib/lite/kernels/internal:types",
+    ],
+)
+
+cc_library(
+    name = "experimental_ops",
+    srcs = [
+        "ctc_beam_search_decoder.cc",
+    ],
+    # Suppress warnings that are introduced by Eigen Tensor.
+    copts = tflite_copts() + [
+        "-Wno-error=reorder",
+    ] + select({
+        "//tensorflow:ios": ["-Wno-error=invalid-partial-specialization"],
+        "//conditions:default": [
+        ],
+    }),
+    deps = [
+        ":ctc_utils",
+        "//tensorflow/contrib/lite:builtin_op_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:string_util",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+        "//tensorflow/contrib/lite/kernels:gemm_support",
+        "//tensorflow/contrib/lite/kernels:kernel_util",
+        "//tensorflow/contrib/lite/kernels:op_macros",
+        "//tensorflow/contrib/lite/kernels/internal:kernel_utils",
+        "//tensorflow/contrib/lite/kernels/internal:optimized",
+        "//tensorflow/contrib/lite/kernels/internal:optimized_base",
+        "//tensorflow/contrib/lite/kernels/internal:quantization_util",
+        "//tensorflow/contrib/lite/kernels/internal:reference",
+        "//tensorflow/contrib/lite/kernels/internal:reference_base",
+        "//tensorflow/contrib/lite/kernels/internal:tensor_utils",
+        "@flatbuffers",
+    ],
+)
+
+tf_cc_test(
+    name = "ctc_beam_search_decoder_test",
+    size = "small",
+    srcs = ["ctc_beam_search_decoder_test.cc"],
+    tags = ["tflite_not_portable_ios"],
+    deps = [
+        ":experimental_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+        "@flatbuffers",
+    ],
+)
diff --git a/tensorflow/contrib/lite/experimental/kernels/ctc_beam_entry.h b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_entry.h
new file mode 100644
index 0000000000..a60ff2a1c5
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_entry.h
@@ -0,0 +1,150 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Copied from tensorflow/core/util/ctc/ctc_beam_entry.h
+// TODO(b/111524997): Remove this file.
+#ifndef TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_ENTRY_H_
+#define TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_ENTRY_H_
+
+#include <algorithm>
+#include <memory>
+#include <unordered_map>
+#include <vector>
+
+#include "third_party/eigen3/Eigen/Core"
+#include "tensorflow/contrib/lite/experimental/kernels/ctc_loss_util.h"
+
+namespace tflite {
+namespace experimental {
+namespace ctc {
+
+// The ctc_beam_search namespace holds several classes meant to be accessed only
+// in case of extending the CTCBeamSearch decoder to allow custom scoring
+// functions.
+//
+// BeamEntry is exposed through template arguments BeamScorer and BeamComparer
+// of CTCBeamSearch (ctc_beam_search.h).
+namespace ctc_beam_search {
+
+struct EmptyBeamState {};
+
+struct BeamProbability {
+  BeamProbability() : total(kLogZero), blank(kLogZero), label(kLogZero) {}
+  void Reset() {
+    total = kLogZero;
+    blank = kLogZero;
+    label = kLogZero;
+  }
+  float total;
+  float blank;
+  float label;
+};
+
+template <class CTCBeamState>
+class BeamRoot;
+
+template <class CTCBeamState = EmptyBeamState>
+struct BeamEntry {
+  // BeamRoot<CTCBeamState>::AddEntry() serves as the factory method.
+  friend BeamEntry<CTCBeamState>* BeamRoot<CTCBeamState>::AddEntry(
+      BeamEntry<CTCBeamState>* p, int l);
+  inline bool Active() const { return newp.total != kLogZero; }
+  // Return the child at the given index, or construct a new one in-place if
+  // none was found.
+  BeamEntry& GetChild(int ind) {
+    auto entry = children.emplace(ind, nullptr);
+    auto& child_entry = entry.first->second;
+    // If this is a new child, populate the BeamEntry<CTCBeamState>*.
+    if (entry.second) {
+      child_entry = beam_root->AddEntry(this, ind);
+    }
+    return *child_entry;
+  }
+  std::vector<int> LabelSeq(bool merge_repeated) const {
+    std::vector<int> labels;
+    int prev_label = -1;
+    const BeamEntry* c = this;
+    while (c->parent != nullptr) {  // Checking c->parent to skip root leaf.
+      if (!merge_repeated || c->label != prev_label) {
+        labels.push_back(c->label);
+      }
+      prev_label = c->label;
+      c = c->parent;
+    }
+    std::reverse(labels.begin(), labels.end());
+    return labels;
+  }
+
+  BeamEntry<CTCBeamState>* parent;
+  int label;
+  // All instances of child BeamEntry are owned by *beam_root.
+  std::unordered_map<int, BeamEntry<CTCBeamState>*> children;
+  BeamProbability oldp;
+  BeamProbability newp;
+  CTCBeamState state;
+
+ private:
+  // Constructor giving parent, label, and the beam_root.
+  // The object pointed to by p cannot be copied and should not be moved,
+  // otherwise parent will become invalid.
+  // This private constructor is only called through the factory method
+  // BeamRoot<CTCBeamState>::AddEntry().
+  BeamEntry(BeamEntry* p, int l, BeamRoot<CTCBeamState>* beam_root)
+      : parent(p), label(l), beam_root(beam_root) {}
+  BeamRoot<CTCBeamState>* beam_root;
+
+  BeamEntry(const BeamEntry&) = delete;
+  void operator=(const BeamEntry&) = delete;
+};
+
+// This class owns all instances of BeamEntry.  This is used to avoid recursive
+// destructor call during destruction.
+template <class CTCBeamState = EmptyBeamState>
+class BeamRoot {
+ public:
+  BeamRoot(BeamEntry<CTCBeamState>* p, int l) { root_entry_ = AddEntry(p, l); }
+  BeamRoot(const BeamRoot&) = delete;
+  BeamRoot& operator=(const BeamRoot&) = delete;
+
+  BeamEntry<CTCBeamState>* AddEntry(BeamEntry<CTCBeamState>* p, int l) {
+    auto* new_entry = new BeamEntry<CTCBeamState>(p, l, this);
+    beam_entries_.emplace_back(new_entry);
+    return new_entry;
+  }
+  BeamEntry<CTCBeamState>* RootEntry() const { return root_entry_; }
+
+ private:
+  BeamEntry<CTCBeamState>* root_entry_ = nullptr;
+  std::vector<std::unique_ptr<BeamEntry<CTCBeamState>>> beam_entries_;
+};
+
+// BeamComparer is the default beam comparer provided in CTCBeamSearch.
+template <class CTCBeamState = EmptyBeamState>
+class BeamComparer {
+ public:
+  virtual ~BeamComparer() {}
+  virtual bool inline operator()(const BeamEntry<CTCBeamState>* a,
+                                 const BeamEntry<CTCBeamState>* b) const {
+    return a->newp.total > b->newp.total;
+  }
+};
+
+}  // namespace ctc_beam_search
+
+}  // namespace ctc
+}  // namespace experimental
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_ENTRY_H_
diff --git a/tensorflow/contrib/lite/experimental/kernels/ctc_beam_scorer.h b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_scorer.h
new file mode 100644
index 0000000000..ec60e26257
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_scorer.h
@@ -0,0 +1,79 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Collection of scoring classes that can be extended and provided to the
+// CTCBeamSearchDecoder to incorporate additional scoring logic (such as a
+// language model).
+//
+// To build a custom scorer extend and implement the pure virtual methods from
+// BeamScorerInterface. The default CTC decoding behavior is implemented
+// through BaseBeamScorer.
+
+// Copied from tensorflow/core/util/ctc/ctc_beam_scorer.h
+// TODO(b/111524997): Remove this file.
+#ifndef TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_SCORER_H_
+#define TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_SCORER_H_
+
+#include "tensorflow/contrib/lite/experimental/kernels/ctc_beam_entry.h"
+
+namespace tflite {
+namespace experimental {
+namespace ctc {
+
+// Base implementation of a beam scorer used by default by the decoder that can
+// be subclassed and provided as an argument to CTCBeamSearchDecoder, if complex
+// scoring is required. Its main purpose is to provide a thin layer for
+// integrating language model scoring easily.
+template <typename CTCBeamState>
+class BaseBeamScorer {
+ public:
+  virtual ~BaseBeamScorer() {}
+  // State initialization.
+  virtual void InitializeState(CTCBeamState* root) const {}
+  // ExpandState is called when expanding a beam to one of its children.
+  // Called at most once per child beam. In the simplest case, no state
+  // expansion is done.
+  virtual void ExpandState(const CTCBeamState& from_state, int from_label,
+                           CTCBeamState* to_state, int to_label) const {}
+  // ExpandStateEnd is called after decoding has finished. Its purpose is to
+  // allow a final scoring of the beam in its current state, before resorting
+  // and retrieving the TopN requested candidates. Called at most once per beam.
+  virtual void ExpandStateEnd(CTCBeamState* state) const {}
+  // GetStateExpansionScore should be an inexpensive method to retrieve the
+  // (cached) expansion score computed within ExpandState. The score is
+  // multiplied (log-addition) with the input score at the current step from
+  // the network.
+  //
+  // The score returned should be a log-probability. In the simplest case, as
+  // there's no state expansion logic, the expansion score is zero.
+  virtual float GetStateExpansionScore(const CTCBeamState& state,
+                                       float previous_score) const {
+    return previous_score;
+  }
+  // GetStateEndExpansionScore should be an inexpensive method to retrieve the
+  // (cached) expansion score computed within ExpandStateEnd. The score is
+  // multiplied (log-addition) with the final probability of the beam.
+  //
+  // The score returned should be a log-probability.
+  virtual float GetStateEndExpansionScore(const CTCBeamState& state) const {
+    return 0;
+  }
+};
+
+}  // namespace ctc
+}  // namespace experimental
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_SCORER_H_
diff --git a/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search.h b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search.h
new file mode 100644
index 0000000000..c658e43092
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search.h
@@ -0,0 +1,420 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Copied from tensorflow/core/util/ctc/ctc_beam_search.h
+// TODO(b/111524997): Remove this file.
+#ifndef TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_SEARCH_H_
+#define TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_SEARCH_H_
+
+#include <algorithm>
+#include <cmath>
+#include <limits>
+#include <memory>
+#include <vector>
+
+#include "third_party/eigen3/Eigen/Core"
+#include "tensorflow/contrib/lite/experimental/kernels/ctc_beam_entry.h"
+#include "tensorflow/contrib/lite/experimental/kernels/ctc_beam_scorer.h"
+#include "tensorflow/contrib/lite/experimental/kernels/ctc_decoder.h"
+#include "tensorflow/contrib/lite/experimental/kernels/ctc_loss_util.h"
+#include "tensorflow/contrib/lite/experimental/kernels/top_n.h"
+#include "tensorflow/contrib/lite/kernels/internal/compatibility.h"
+
+namespace tflite {
+namespace experimental {
+namespace ctc {
+
+template <typename CTCBeamState = ctc_beam_search::EmptyBeamState,
+          typename CTCBeamComparer =
+              ctc_beam_search::BeamComparer<CTCBeamState>>
+class CTCBeamSearchDecoder : public CTCDecoder {
+  // Beam Search
+  //
+  // Example (GravesTh Fig. 7.5):
+  //         a    -
+  //  P = [ 0.3  0.7 ]  t = 0
+  //      [ 0.4  0.6 ]  t = 1
+  //
+  // Then P(l = -) = P(--) = 0.7 * 0.6 = 0.42
+  //      P(l = a) = P(a-) + P(aa) + P(-a) = 0.3*0.4 + ... = 0.58
+  //
+  // In this case, Best Path decoding is suboptimal.
+  //
+  // For Beam Search, we use the following main recurrence relations:
+  //
+  // Relation 1:
+  // ---------------------------------------------------------- Eq. 1
+  //      P(l=abcd @ t=7) = P(l=abc  @ t=6) * P(d @ 7)
+  //                      + P(l=abcd @ t=6) * (P(d @ 7) + P(- @ 7))
+  // where P(l=? @ t=7), ? = a, ab, abc, abcd are all stored and
+  // updated recursively in the beam entry.
+  //
+  // Relation 2:
+  // ---------------------------------------------------------- Eq. 2
+  //      P(l=abc? @ t=3) = P(l=abc @ t=2) * P(? @ 3)
+  // for ? in a, b, d, ..., (not including c or the blank index),
+  // and the recurrence starts from the beam entry for P(l=abc @ t=2).
+  //
+  // For this case, the length of the new sequence equals t+1 (t
+  // starts at 0).  This special case can be calculated as:
+  //   P(l=abc? @ t=3) = P(a @ 0)*P(b @ 1)*P(c @ 2)*P(? @ 3)
+  // but we calculate it recursively for speed purposes.
+  typedef ctc_beam_search::BeamEntry<CTCBeamState> BeamEntry;
+  typedef ctc_beam_search::BeamRoot<CTCBeamState> BeamRoot;
+  typedef ctc_beam_search::BeamProbability BeamProbability;
+
+ public:
+  typedef BaseBeamScorer<CTCBeamState> DefaultBeamScorer;
+
+  // The beam search decoder is constructed specifying the beam_width (number of
+  // candidates to keep at each decoding timestep) and a beam scorer (used for
+  // custom scoring, for example enabling the use of a language model).
+  // The ownership of the scorer remains with the caller. The default
+  // implementation, CTCBeamSearchDecoder<>::DefaultBeamScorer, generates the
+  // standard beam search.
+  CTCBeamSearchDecoder(int num_classes, int beam_width,
+                       BaseBeamScorer<CTCBeamState>* scorer, int batch_size = 1,
+                       bool merge_repeated = false)
+      : CTCDecoder(num_classes, batch_size, merge_repeated),
+        beam_width_(beam_width),
+        leaves_(beam_width),
+        beam_scorer_(scorer) {
+    Reset();
+  }
+
+  ~CTCBeamSearchDecoder() override {}
+
+  // Run the hibernating beam search algorithm on the given input.
+  bool Decode(const CTCDecoder::SequenceLength& seq_len,
+              const std::vector<CTCDecoder::Input>& input,
+              std::vector<CTCDecoder::Output>* output,
+              CTCDecoder::ScoreOutput* scores) override;
+
+  // Calculate the next step of the beam search and update the internal state.
+  template <typename Vector>
+  void Step(const Vector& log_input_t);
+
+  template <typename Vector>
+  float GetTopK(const int K, const Vector& input,
+                std::vector<float>* top_k_logits,
+                std::vector<int>* top_k_indices);
+
+  // Retrieve the beam scorer instance used during decoding.
+  BaseBeamScorer<CTCBeamState>* GetBeamScorer() const { return beam_scorer_; }
+
+  // Set label selection parameters for faster decoding.
+  // See comments for label_selection_size_ and label_selection_margin_.
+  void SetLabelSelectionParameters(int label_selection_size,
+                                   float label_selection_margin) {
+    label_selection_size_ = label_selection_size;
+    label_selection_margin_ = label_selection_margin;
+  }
+
+  // Reset the beam search
+  void Reset();
+
+  // Extract the top n paths at current time step
+  bool TopPaths(int n, std::vector<std::vector<int>>* paths,
+                std::vector<float>* log_probs, bool merge_repeated) const;
+
+ private:
+  int beam_width_;
+
+  // Label selection is designed to avoid possibly very expensive scorer calls,
+  // by pruning the hypotheses based on the input alone.
+  // Label selection size controls how many items in each beam are passed
+  // through to the beam scorer. Only items with top N input scores are
+  // considered.
+  // Label selection margin controls the difference between minimal input score
+  // (versus the best scoring label) for an item to be passed to the beam
+  // scorer. This margin is expressed in terms of log-probability.
+  // Default is to do no label selection.
+  // For more detail: https://research.google.com/pubs/pub44823.html
+  int label_selection_size_ = 0;       // zero means unlimited
+  float label_selection_margin_ = -1;  // -1 means unlimited.
+
+  gtl::TopN<BeamEntry*, CTCBeamComparer> leaves_;
+  std::unique_ptr<BeamRoot> beam_root_;
+  BaseBeamScorer<CTCBeamState>* beam_scorer_;
+
+  CTCBeamSearchDecoder(const CTCBeamSearchDecoder&) = delete;
+  void operator=(const CTCBeamSearchDecoder&) = delete;
+};
+
+template <typename CTCBeamState, typename CTCBeamComparer>
+bool CTCBeamSearchDecoder<CTCBeamState, CTCBeamComparer>::Decode(
+    const CTCDecoder::SequenceLength& seq_len,
+    const std::vector<CTCDecoder::Input>& input,
+    std::vector<CTCDecoder::Output>* output, ScoreOutput* scores) {
+  // Storage for top paths.
+  std::vector<std::vector<int>> beams;
+  std::vector<float> beam_log_probabilities;
+  int top_n = output->size();
+  if (std::any_of(output->begin(), output->end(),
+                  [this](const CTCDecoder::Output& output) -> bool {
+                    return output.size() < this->batch_size_;
+                  })) {
+    return false;
+  }
+  if (scores->rows() < batch_size_ || scores->cols() < top_n) {
+    return false;
+  }
+
+  for (int b = 0; b < batch_size_; ++b) {
+    int seq_len_b = seq_len[b];
+    Reset();
+
+    for (int t = 0; t < seq_len_b; ++t) {
+      // Pass log-probabilities for this example + time.
+      Step(input[t].row(b));
+    }  // for (int t...
+
+    // O(n * log(n))
+    std::unique_ptr<std::vector<BeamEntry*>> branches(leaves_.Extract());
+    leaves_.Reset();
+    for (int i = 0; i < branches->size(); ++i) {
+      BeamEntry* entry = (*branches)[i];
+      beam_scorer_->ExpandStateEnd(&entry->state);
+      entry->newp.total +=
+          beam_scorer_->GetStateEndExpansionScore(entry->state);
+      leaves_.push(entry);
+    }
+
+    bool status =
+        TopPaths(top_n, &beams, &beam_log_probabilities, merge_repeated_);
+    if (!status) {
+      return status;
+    }
+
+    TFLITE_DCHECK_EQ(top_n, beam_log_probabilities.size());
+    TFLITE_DCHECK_EQ(beams.size(), beam_log_probabilities.size());
+
+    for (int i = 0; i < top_n; ++i) {
+      // Copy output to the correct beam + batch
+      (*output)[i][b].swap(beams[i]);
+      (*scores)(b, i) = -beam_log_probabilities[i];
+    }
+  }  // for (int b...
+  return true;
+}
+
+template <typename CTCBeamState, typename CTCBeamComparer>
+template <typename Vector>
+float CTCBeamSearchDecoder<CTCBeamState, CTCBeamComparer>::GetTopK(
+    const int K, const Vector& input, std::vector<float>* top_k_logits,
+    std::vector<int>* top_k_indices) {
+  // Find Top K choices, complexity nk in worst case. The array input is read
+  // just once.
+  TFLITE_DCHECK_EQ(num_classes_, input.size());
+  top_k_logits->clear();
+  top_k_indices->clear();
+  top_k_logits->resize(K, -INFINITY);
+  top_k_indices->resize(K, -1);
+  for (int j = 0; j < num_classes_ - 1; ++j) {
+    const float logit = input(j);
+    if (logit > (*top_k_logits)[K - 1]) {
+      int k = K - 1;
+      while (k > 0 && logit > (*top_k_logits)[k - 1]) {
+        (*top_k_logits)[k] = (*top_k_logits)[k - 1];
+        (*top_k_indices)[k] = (*top_k_indices)[k - 1];
+        k--;
+      }
+      (*top_k_logits)[k] = logit;
+      (*top_k_indices)[k] = j;
+    }
+  }
+  // Return max value which is in 0th index or blank character logit
+  return std::max((*top_k_logits)[0], input(num_classes_ - 1));
+}
+
+template <typename CTCBeamState, typename CTCBeamComparer>
+template <typename Vector>
+void CTCBeamSearchDecoder<CTCBeamState, CTCBeamComparer>::Step(
+    const Vector& raw_input) {
+  std::vector<float> top_k_logits;
+  std::vector<int> top_k_indices;
+  const bool top_k =
+      (label_selection_size_ > 0 && label_selection_size_ < raw_input.size());
+  // Number of character classes to consider in each step.
+  const int max_classes = top_k ? label_selection_size_ : (num_classes_ - 1);
+  // Get max coefficient and remove it from raw_input later.
+  float max_coeff;
+  if (top_k) {
+    max_coeff = GetTopK(label_selection_size_, raw_input, &top_k_logits,
+                        &top_k_indices);
+  } else {
+    max_coeff = raw_input.maxCoeff();
+  }
+  const float label_selection_input_min =
+      (label_selection_margin_ >= 0) ? (max_coeff - label_selection_margin_)
+                                     : -std::numeric_limits<float>::infinity();
+
+  // Extract the beams sorted in decreasing new probability
+  TFLITE_DCHECK_EQ(num_classes_, raw_input.size());
+
+  std::unique_ptr<std::vector<BeamEntry*>> branches(leaves_.Extract());
+  leaves_.Reset();
+
+  for (BeamEntry* b : *branches) {
+    // P(.. @ t) becomes the new P(.. @ t-1)
+    b->oldp = b->newp;
+  }
+
+  for (BeamEntry* b : *branches) {
+    if (b->parent != nullptr) {  // if not the root
+      if (b->parent->Active()) {
+        // If last two sequence characters are identical:
+        //   Plabel(l=acc @ t=6) = (Plabel(l=acc @ t=5)
+        //                          + Pblank(l=ac @ t=5))
+        // else:
+        //   Plabel(l=abc @ t=6) = (Plabel(l=abc @ t=5)
+        //                          + P(l=ab @ t=5))
+        float previous = (b->label == b->parent->label) ? b->parent->oldp.blank
+                                                        : b->parent->oldp.total;
+        b->newp.label =
+            LogSumExp(b->newp.label,
+                      beam_scorer_->GetStateExpansionScore(b->state, previous));
+      }
+      // Plabel(l=abc @ t=6) *= P(c @ 6)
+      b->newp.label += raw_input(b->label) - max_coeff;
+    }
+    // Pblank(l=abc @ t=6) = P(l=abc @ t=5) * P(- @ 6)
+    b->newp.blank = b->oldp.total + raw_input(blank_index_) - max_coeff;
+    // P(l=abc @ t=6) = Plabel(l=abc @ t=6) + Pblank(l=abc @ t=6)
+    b->newp.total = LogSumExp(b->newp.blank, b->newp.label);
+
+    // Push the entry back to the top paths list.
+    // Note, this will always fill leaves back up in sorted order.
+    leaves_.push(b);
+  }
+
+  // we need to resort branches in descending oldp order.
+
+  // branches is in descending oldp order because it was
+  // originally in descending newp order and we copied newp to oldp.
+
+  // Grow new leaves
+  for (BeamEntry* b : *branches) {
+    // A new leaf (represented by its BeamProbability) is a candidate
+    // iff its total probability is nonzero and either the beam list
+    // isn't full, or the lowest probability entry in the beam has a
+    // lower probability than the leaf.
+    auto is_candidate = [this](const BeamProbability& prob) {
+      return (prob.total > kLogZero &&
+              (leaves_.size() < beam_width_ ||
+               prob.total > leaves_.peek_bottom()->newp.total));
+    };
+
+    if (!is_candidate(b->oldp)) {
+      continue;
+    }
+
+    for (int ind = 0; ind < max_classes; ind++) {
+      const int label = top_k ? top_k_indices[ind] : ind;
+      const float logit = top_k ? top_k_logits[ind] : raw_input(ind);
+      // Perform label selection: if input for this label looks very
+      // unpromising, never evaluate it with a scorer.
+      if (logit < label_selection_input_min) {
+        continue;
+      }
+      BeamEntry& c = b->GetChild(label);
+      if (!c.Active()) {
+        //   Pblank(l=abcd @ t=6) = 0
+        c.newp.blank = kLogZero;
+        // If new child label is identical to beam label:
+        //   Plabel(l=abcc @ t=6) = Pblank(l=abc @ t=5) * P(c @ 6)
+        // Otherwise:
+        //   Plabel(l=abcd @ t=6) = P(l=abc @ t=5) * P(d @ 6)
+        beam_scorer_->ExpandState(b->state, b->label, &c.state, c.label);
+        float previous = (c.label == b->label) ? b->oldp.blank : b->oldp.total;
+        c.newp.label = logit - max_coeff +
+                       beam_scorer_->GetStateExpansionScore(c.state, previous);
+        // P(l=abcd @ t=6) = Plabel(l=abcd @ t=6)
+        c.newp.total = c.newp.label;
+
+        if (is_candidate(c.newp)) {
+          // Before adding the new node to the beam, check if the beam
+          // is already at maximum width.
+          if (leaves_.size() == beam_width_) {
+            // Bottom is no longer in the beam search.  Reset
+            // its probability; signal it's no longer in the beam search.
+            BeamEntry* bottom = leaves_.peek_bottom();
+            bottom->newp.Reset();
+          }
+          leaves_.push(&c);
+        } else {
+          // Deactivate child.
+          c.oldp.Reset();
+          c.newp.Reset();
+        }
+      }
+    }
+  }  // for (BeamEntry* b...
+}
+
+template <typename CTCBeamState, typename CTCBeamComparer>
+void CTCBeamSearchDecoder<CTCBeamState, CTCBeamComparer>::Reset() {
+  leaves_.Reset();
+
+  // This beam root, and all of its children, will be in memory until
+  // the next reset.
+  beam_root_.reset(new BeamRoot(nullptr, -1));
+  beam_root_->RootEntry()->newp.total = 0.0;  // ln(1)
+  beam_root_->RootEntry()->newp.blank = 0.0;  // ln(1)
+
+  // Add the root as the initial leaf.
+  leaves_.push(beam_root_->RootEntry());
+
+  // Call initialize state on the root object.
+  beam_scorer_->InitializeState(&beam_root_->RootEntry()->state);
+}
+
+template <typename CTCBeamState, typename CTCBeamComparer>
+bool CTCBeamSearchDecoder<CTCBeamState, CTCBeamComparer>::TopPaths(
+    int n, std::vector<std::vector<int>>* paths, std::vector<float>* log_probs,
+    bool merge_repeated) const {
+  TFLITE_DCHECK(paths);
+  TFLITE_DCHECK(log_probs);
+  paths->clear();
+  log_probs->clear();
+  if (n > beam_width_) {
+    return false;
+  }
+  if (n > leaves_.size()) {
+    return false;
+  }
+
+  gtl::TopN<BeamEntry*, CTCBeamComparer> top_branches(n);
+
+  // O(beam_width_ * log(n)), space complexity is O(n)
+  for (auto it = leaves_.unsorted_begin(); it != leaves_.unsorted_end(); ++it) {
+    top_branches.push(*it);
+  }
+  // O(n * log(n))
+  std::unique_ptr<std::vector<BeamEntry*>> branches(top_branches.Extract());
+
+  for (int i = 0; i < n; ++i) {
+    BeamEntry* e((*branches)[i]);
+    paths->push_back(e->LabelSeq(merge_repeated));
+    log_probs->push_back(e->newp.total);
+  }
+  return true;
+}
+
+}  // namespace ctc
+}  // namespace experimental
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_BEAM_SEARCH_H_
diff --git a/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search_decoder.cc b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search_decoder.cc
new file mode 100644
index 0000000000..834d1ebd66
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search_decoder.cc
@@ -0,0 +1,247 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <vector>
+#include "flatbuffers/flexbuffers.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/experimental/kernels/ctc_beam_search.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace ops {
+namespace experimental {
+namespace ctc_beam_search_decoder {
+
+constexpr int kInputsTensor = 0;
+constexpr int kSequenceLengthTensor = 1;
+
+typedef struct {
+  int beam_width;
+  int top_paths;
+  bool merge_repeated;
+} CTCBeamSearchDecoderParams;
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  TFLITE_CHECK(buffer != nullptr);
+  const uint8_t* buffer_t = reinterpret_cast<const uint8_t*>(buffer);
+  const flexbuffers::Map& m = flexbuffers::GetRoot(buffer_t, length).AsMap();
+
+  CTCBeamSearchDecoderParams* option = new CTCBeamSearchDecoderParams;
+  option->beam_width = m["beam_width"].AsInt32();
+  option->top_paths = m["top_paths"].AsInt32();
+  option->merge_repeated = m["merge_repeated"].AsBool();
+
+  return option;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<CTCBeamSearchDecoderParams*>(buffer);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  const CTCBeamSearchDecoderParams* option =
+      reinterpret_cast<CTCBeamSearchDecoderParams*>(node->user_data);
+  const int top_paths = option->top_paths;
+  TF_LITE_ENSURE(context, option->beam_width >= top_paths);
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
+  // The outputs should be top_paths * 3 + 1.
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 3 * top_paths + 1);
+
+  const TfLiteTensor* inputs = GetInput(context, node, kInputsTensor);
+  TF_LITE_ENSURE_EQ(context, NumDimensions(inputs), 3);
+  // TensorFlow only supports float.
+  TF_LITE_ENSURE_EQ(context, inputs->type, kTfLiteFloat32);
+  const int batch_size = SizeOfDimension(inputs, 1);
+
+  const TfLiteTensor* sequence_length =
+      GetInput(context, node, kSequenceLengthTensor);
+  TF_LITE_ENSURE_EQ(context, NumDimensions(sequence_length), 1);
+  TF_LITE_ENSURE_EQ(context, NumElements(sequence_length), batch_size);
+  // TensorFlow only supports int32.
+  TF_LITE_ENSURE_EQ(context, sequence_length->type, kTfLiteInt32);
+
+  // Resize decoded outputs.
+  // Do not resize indices & values cause we don't know the values yet.
+  for (int i = 0; i < top_paths; ++i) {
+    TfLiteTensor* indices = GetOutput(context, node, i);
+    SetTensorToDynamic(indices);
+    TfLiteTensor* values = GetOutput(context, node, i + top_paths);
+    SetTensorToDynamic(values);
+    TfLiteTensor* output_shape = GetOutput(context, node, i + 2 * top_paths);
+    SetTensorToDynamic(output_shape);
+  }
+
+  // Resize log probability outputs.
+  TfLiteTensor* log_probability_output =
+      GetOutput(context, node, top_paths * 3);
+  TfLiteIntArray* log_probability_output_shape_array = TfLiteIntArrayCreate(2);
+  log_probability_output_shape_array->data[0] = batch_size;
+  log_probability_output_shape_array->data[1] = top_paths;
+  return context->ResizeTensor(context, log_probability_output,
+                               log_probability_output_shape_array);
+}
+
+TfLiteStatus Resize(TfLiteContext* context,
+                    std::initializer_list<int32_t> output_shape,
+                    TfLiteTensor* output) {
+  const int dimensions = output_shape.size();
+  TfLiteIntArray* output_shape_array = TfLiteIntArrayCreate(dimensions);
+  int i = 0;
+  for (const int v : output_shape) {
+    output_shape_array->data[i++] = v;
+  }
+  return context->ResizeTensor(context, output, output_shape_array);
+}
+
+TfLiteStatus StoreAllDecodedSequences(
+    TfLiteContext* context,
+    const std::vector<std::vector<std::vector<int>>>& sequences,
+    TfLiteNode* node, int top_paths) {
+  const int32_t batch_size = sequences.size();
+  std::vector<int32_t> num_entries(top_paths, 0);
+
+  // Calculate num_entries per path
+  for (const auto& batch_s : sequences) {
+    TF_LITE_ENSURE_EQ(context, batch_s.size(), top_paths);
+    for (int p = 0; p < top_paths; ++p) {
+      num_entries[p] += batch_s[p].size();
+    }
+  }
+
+  for (int p = 0; p < top_paths; ++p) {
+    const int32_t p_num = num_entries[p];
+
+    // Resize the decoded outputs.
+    TfLiteTensor* indices = GetOutput(context, node, p);
+    TF_LITE_ENSURE_OK(context, Resize(context, {p_num, 2}, indices));
+
+    TfLiteTensor* values = GetOutput(context, node, p + top_paths);
+    TF_LITE_ENSURE_OK(context, Resize(context, {p_num}, values));
+
+    TfLiteTensor* decoded_shape = GetOutput(context, node, p + 2 * top_paths);
+    TF_LITE_ENSURE_OK(context, Resize(context, {2}, decoded_shape));
+
+    int32_t max_decoded = 0;
+    int32_t offset = 0;
+
+    int32_t* indices_data = GetTensorData<int32_t>(indices);
+    int32_t* values_data = GetTensorData<int32_t>(values);
+    int32_t* decoded_shape_data = GetTensorData<int32_t>(decoded_shape);
+    for (int b = 0; b < batch_size; ++b) {
+      auto& p_batch = sequences[b][p];
+      int32_t num_decoded = p_batch.size();
+      max_decoded = std::max(max_decoded, num_decoded);
+
+      std::copy_n(p_batch.begin(), num_decoded, values_data + offset);
+      for (int32_t t = 0; t < num_decoded; ++t, ++offset) {
+        indices_data[offset * 2] = b;
+        indices_data[offset * 2 + 1] = t;
+      }
+    }
+
+    decoded_shape_data[0] = batch_size;
+    decoded_shape_data[1] = max_decoded;
+  }
+  return kTfLiteOk;
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const TfLiteTensor* inputs = GetInput(context, node, kInputsTensor);
+  const TfLiteTensor* sequence_length =
+      GetInput(context, node, kSequenceLengthTensor);
+  const CTCBeamSearchDecoderParams* option =
+      reinterpret_cast<CTCBeamSearchDecoderParams*>(node->user_data);
+
+  const int max_time = SizeOfDimension(inputs, 0);
+  const int batch_size = SizeOfDimension(inputs, 1);
+  const int num_classes = SizeOfDimension(inputs, 2);
+
+  const int beam_width = option->beam_width;
+  const int top_paths = option->top_paths;
+  const bool merge_repeated = option->merge_repeated;
+
+  // Validate sequence length is less or equal than max time.
+  for (int i = 0; i < batch_size; ++i) {
+    TF_LITE_ENSURE(context,
+                   max_time >= GetTensorData<int32_t>(sequence_length)[i]);
+  }
+
+  // The following logic is implemented like
+  // tensorflow/core/kernels/ctc_decoder_ops.cc
+  std::vector<optimized_ops::TTypes<float>::UnalignedConstMatrix> input_list_t;
+
+  for (std::size_t t = 0; t < max_time; ++t) {
+    input_list_t.emplace_back(
+        GetTensorData<float>(inputs) + t * batch_size * num_classes, batch_size,
+        num_classes);
+  }
+
+  ::tflite::experimental::ctc::CTCBeamSearchDecoder<>::DefaultBeamScorer
+      beam_scorer;
+  ::tflite::experimental::ctc::CTCBeamSearchDecoder<> beam_search(
+      num_classes, beam_width, &beam_scorer, 1 /* batch_size */,
+      merge_repeated);
+
+  // Allocate temporary memory for holding chip operation data.
+  float* input_chip_t_data =
+      static_cast<float*>(malloc(num_classes * sizeof(float)));
+  Eigen::array<Eigen::DenseIndex, 1> dims;
+  dims[0] = num_classes;
+  optimized_ops::TTypes<float>::Flat input_chip_t(input_chip_t_data, dims);
+
+  std::vector<std::vector<std::vector<int>>> best_paths(batch_size);
+  std::vector<float> log_probs;
+
+  TfLiteTensor* log_probabilities = GetOutput(context, node, 3 * top_paths);
+  float* log_probabilities_output = GetTensorData<float>(log_probabilities);
+
+  // Assumption: the blank index is num_classes - 1
+  for (int b = 0; b < batch_size; ++b) {
+    auto& best_paths_b = best_paths[b];
+    best_paths_b.resize(top_paths);
+    for (int t = 0; t < GetTensorData<int32_t>(sequence_length)[b]; ++t) {
+      input_chip_t = input_list_t[t].chip(b, 0);
+      auto input_bi =
+          Eigen::Map<const Eigen::ArrayXf>(input_chip_t.data(), num_classes);
+      beam_search.Step(input_bi);
+    }
+    TF_LITE_ENSURE(context, beam_search.TopPaths(top_paths, &best_paths_b,
+                                                 &log_probs, merge_repeated));
+    beam_search.Reset();
+
+    // Fill in log_probabilities output.
+    for (int bp = 0; bp < top_paths; ++bp) {
+      log_probabilities_output[b * top_paths + bp] = log_probs[bp];
+    }
+  }
+
+  free(input_chip_t_data);
+  return StoreAllDecodedSequences(context, best_paths, node, top_paths);
+}
+
+}  // namespace ctc_beam_search_decoder
+
+TfLiteRegistration* Register_CTC_BEAM_SEARCH_DECODER() {
+  static TfLiteRegistration r = {
+      ctc_beam_search_decoder::Init, ctc_beam_search_decoder::Free,
+      ctc_beam_search_decoder::Prepare, ctc_beam_search_decoder::Eval};
+  return &r;
+}
+
+}  // namespace experimental
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search_decoder_test.cc b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search_decoder_test.cc
new file mode 100644
index 0000000000..9d1e6a562f
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/ctc_beam_search_decoder_test.cc
@@ -0,0 +1,238 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <functional>
+#include <memory>
+#include <vector>
+
+#include <gtest/gtest.h>
+#include "flatbuffers/flexbuffers.h"
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace ops {
+namespace experimental {
+
+using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+
+TfLiteRegistration* Register_CTC_BEAM_SEARCH_DECODER();
+
+namespace {
+
+using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+
+class CTCBeamSearchDecoderOpModel : public SingleOpModel {
+ public:
+  CTCBeamSearchDecoderOpModel(std::initializer_list<int> input_shape,
+                              std::initializer_list<int> sequence_length_shape,
+                              int beam_width, int top_paths,
+                              bool merge_repeated) {
+    inputs_ = AddInput(TensorType_FLOAT32);
+    sequence_length_ = AddInput(TensorType_INT32);
+
+    for (int i = 0; i < top_paths * 3; ++i) {
+      outputs_.push_back(AddOutput(TensorType_INT32));
+    }
+    outputs_.push_back(AddOutput(TensorType_FLOAT32));
+
+    flexbuffers::Builder fbb;
+    fbb.Map([&]() {
+      fbb.Int("beam_width", beam_width);
+      fbb.Int("top_paths", top_paths);
+      fbb.Bool("merge_repeated", merge_repeated);
+    });
+    fbb.Finish();
+    SetCustomOp("CTCBeamSearchDecoder", fbb.GetBuffer(),
+                Register_CTC_BEAM_SEARCH_DECODER);
+    BuildInterpreter({input_shape, sequence_length_shape});
+  }
+
+  int inputs() { return inputs_; }
+
+  int sequence_length() { return sequence_length_; }
+
+  std::vector<std::vector<int>> GetDecodedOutpus() {
+    std::vector<std::vector<int>> outputs;
+    for (int i = 0; i < outputs_.size() - 1; ++i) {
+      outputs.push_back(ExtractVector<int>(outputs_[i]));
+    }
+    return outputs;
+  }
+
+  std::vector<float> GetLogProbabilitiesOutput() {
+    return ExtractVector<float>(outputs_[outputs_.size() - 1]);
+  }
+
+  std::vector<std::vector<int>> GetOutputShapes() {
+    std::vector<std::vector<int>> output_shapes;
+    for (const int output : outputs_) {
+      output_shapes.push_back(GetTensorShape(output));
+    }
+    return output_shapes;
+  }
+
+ private:
+  int inputs_;
+  int sequence_length_;
+  std::vector<int> outputs_;
+};
+
+TEST(CTCBeamSearchTest, SimpleTest) {
+  CTCBeamSearchDecoderOpModel m({2, 1, 2}, {1}, 1, 1, true);
+  m.PopulateTensor<float>(m.inputs(),
+                          {-0.50922557, -1.35512652, -2.55445064, -1.58419356});
+  m.PopulateTensor<int>(m.sequence_length(), {2});
+  m.Invoke();
+
+  // Make sure the output shapes are right.
+  const std::vector<std::vector<int>>& output_shapes = m.GetOutputShapes();
+  EXPECT_EQ(output_shapes.size(), 4);
+  EXPECT_THAT(output_shapes[0], ElementsAre(1, 2));
+  EXPECT_THAT(output_shapes[1], ElementsAre(1));
+  EXPECT_THAT(output_shapes[2], ElementsAre(2));
+  EXPECT_THAT(output_shapes[3], ElementsAre(1, 1));
+
+  // Check decoded outputs.
+  const std::vector<std::vector<int>>& decoded_outputs = m.GetDecodedOutpus();
+  EXPECT_EQ(decoded_outputs.size(), 3);
+  EXPECT_THAT(decoded_outputs[0], ElementsAre(0, 0));
+  EXPECT_THAT(decoded_outputs[1], ElementsAre(0));
+  EXPECT_THAT(decoded_outputs[2], ElementsAre(1, 1));
+  // Check log probabilities output.
+  EXPECT_THAT(m.GetLogProbabilitiesOutput(),
+              ElementsAreArray(ArrayFloatNear({0.32134813})));
+}
+
+TEST(CTCBeamSearchTest, MultiBatchTest) {
+  CTCBeamSearchDecoderOpModel m({3, 3, 3}, {3}, 1, 1, true);
+  m.PopulateTensor<float>(
+      m.inputs(),
+      {-0.63649208, -0.00487571, -0.04249819, -0.67754697, -1.0341399,
+       -2.14717721, -0.77686821, -3.41973774, -0.05151402, -0.21482619,
+       -0.57411168, -1.45039917, -0.73769373, -2.10941739, -0.44818325,
+       -0.25287673, -2.80057302, -0.54748312, -0.73334867, -0.86537719,
+       -0.2065197,  -0.18725838, -1.42770405, -0.86051965, -1.61642301,
+       -2.07275114, -0.9201845});
+  m.PopulateTensor<int>(m.sequence_length(), {3, 3, 3});
+  m.Invoke();
+
+  // Make sure the output shapes are right.
+  const std::vector<std::vector<int>>& output_shapes = m.GetOutputShapes();
+  EXPECT_EQ(output_shapes.size(), 4);
+  EXPECT_THAT(output_shapes[0], ElementsAre(4, 2));
+  EXPECT_THAT(output_shapes[1], ElementsAre(4));
+  EXPECT_THAT(output_shapes[2], ElementsAre(2));
+  EXPECT_THAT(output_shapes[3], ElementsAre(3, 1));
+
+  // Check decoded outputs.
+  const std::vector<std::vector<int>>& decoded_outputs = m.GetDecodedOutpus();
+  EXPECT_EQ(decoded_outputs.size(), 3);
+  EXPECT_THAT(decoded_outputs[0], ElementsAre(0, 0, 0, 1, 1, 0, 2, 0));
+  EXPECT_THAT(decoded_outputs[1], ElementsAre(1, 0, 0, 0));
+  EXPECT_THAT(decoded_outputs[2], ElementsAre(3, 2));
+  // Check log probabilities output.
+  EXPECT_THAT(
+      m.GetLogProbabilitiesOutput(),
+      ElementsAreArray(ArrayFloatNear({0.46403232, 0.49500442, 0.40443572})));
+}
+
+TEST(CTCBeamSearchTest, MultiPathsTest) {
+  CTCBeamSearchDecoderOpModel m({3, 2, 5}, {2}, 3, 2, true);
+  m.PopulateTensor<float>(
+      m.inputs(),
+      {-2.206851,   -0.09542714, -0.2393415,  -3.81866197, -0.27241158,
+       -0.20371124, -0.68236623, -1.1397166,  -0.17422639, -1.85224048,
+       -0.9406037,  -0.32544678, -0.21846784, -0.38377237, -0.33498676,
+       -0.10139782, -0.51886883, -0.21678554, -0.15267063, -1.91164412,
+       -0.31328673, -0.27462716, -0.65975336, -1.53671973, -2.76554225,
+       -0.23920634, -1.2370502,  -4.98751576, -3.12995717, -0.43129368});
+  m.PopulateTensor<int>(m.sequence_length(), {3, 3});
+  m.Invoke();
+
+  // Make sure the output shapes are right.
+  const std::vector<std::vector<int>>& output_shapes = m.GetOutputShapes();
+  EXPECT_EQ(output_shapes.size(), 7);
+  EXPECT_THAT(output_shapes[0], ElementsAre(4, 2));
+  EXPECT_THAT(output_shapes[1], ElementsAre(3, 2));
+  EXPECT_THAT(output_shapes[2], ElementsAre(4));
+  EXPECT_THAT(output_shapes[3], ElementsAre(3));
+  EXPECT_THAT(output_shapes[4], ElementsAre(2));
+  EXPECT_THAT(output_shapes[5], ElementsAre(2));
+  EXPECT_THAT(output_shapes[6], ElementsAre(2, 2));
+
+  // Check decoded outputs.
+  const std::vector<std::vector<int>>& decoded_outputs = m.GetDecodedOutpus();
+  EXPECT_EQ(decoded_outputs.size(), 6);
+  EXPECT_THAT(decoded_outputs[0], ElementsAre(0, 0, 0, 1, 1, 0, 1, 1));
+  EXPECT_THAT(decoded_outputs[1], ElementsAre(0, 0, 0, 1, 1, 0));
+  EXPECT_THAT(decoded_outputs[2], ElementsAre(1, 2, 3, 0));
+  EXPECT_THAT(decoded_outputs[3], ElementsAre(2, 1, 0));
+  EXPECT_THAT(decoded_outputs[4], ElementsAre(2, 2));
+  EXPECT_THAT(decoded_outputs[5], ElementsAre(2, 2));
+  // Check log probabilities output.
+  EXPECT_THAT(m.GetLogProbabilitiesOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {0.91318405, 0.9060272, 1.0780245, 0.64358956})));
+}
+
+TEST(CTCBeamSearchTest, NonEqualSequencesTest) {
+  CTCBeamSearchDecoderOpModel m({3, 3, 4}, {3}, 3, 1, true);
+  m.PopulateTensor<float>(
+      m.inputs(),
+      {-1.26658163, -0.25760023, -0.03917975, -0.63772235, -0.03794756,
+       -0.45063099, -0.27706473, -0.01569179, -0.59940385, -0.35700127,
+       -0.48920721, -1.42635476, -1.3462478,  -0.02565498, -0.30179568,
+       -0.6491698,  -0.55017719, -2.92291466, -0.92522973, -0.47592022,
+       -0.07099135, -0.31575624, -0.86345281, -0.36017021, -0.79208612,
+       -1.75306124, -0.65089224, -0.00912786, -0.42915003, -1.72606203,
+       -1.66337589, -0.70800793, -2.52272352, -0.67329562, -2.49145522,
+       -0.49786342});
+  m.PopulateTensor<int>(m.sequence_length(), {1, 2, 3});
+  m.Invoke();
+
+  // Make sure the output shapes are right.
+  const std::vector<std::vector<int>>& output_shapes = m.GetOutputShapes();
+  EXPECT_EQ(output_shapes.size(), 4);
+  EXPECT_THAT(output_shapes[0], ElementsAre(3, 2));
+  EXPECT_THAT(output_shapes[1], ElementsAre(3));
+  EXPECT_THAT(output_shapes[2], ElementsAre(2));
+  EXPECT_THAT(output_shapes[3], ElementsAre(3, 1));
+
+  // Check decoded outputs.
+  const std::vector<std::vector<int>>& decoded_outputs = m.GetDecodedOutpus();
+  EXPECT_EQ(decoded_outputs.size(), 3);
+  EXPECT_THAT(decoded_outputs[0], ElementsAre(0, 0, 1, 0, 2, 0));
+  EXPECT_THAT(decoded_outputs[1], ElementsAre(2, 0, 1));
+  EXPECT_THAT(decoded_outputs[2], ElementsAre(3, 1));
+  // Check log probabilities output.
+  EXPECT_THAT(m.GetLogProbabilitiesOutput(),
+              ElementsAreArray(ArrayFloatNear({0., 1.0347567, 0.7833005})));
+}
+
+}  // namespace
+}  // namespace experimental
+}  // namespace ops
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/experimental/kernels/ctc_decoder.h b/tensorflow/contrib/lite/experimental/kernels/ctc_decoder.h
new file mode 100644
index 0000000000..596ad4a5f7
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/ctc_decoder.h
@@ -0,0 +1,114 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Copied from tensorflow/core/util/ctc/ctc_decoder.h
+// TODO(b/111524997): Remove this file.
+#ifndef TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_DECODER_H_
+#define TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_DECODER_H_
+
+#include <memory>
+#include <vector>
+
+#include "third_party/eigen3/Eigen/Core"
+
+namespace tflite {
+namespace experimental {
+namespace ctc {
+
+// The CTCDecoder is an abstract interface to be implemented when providing a
+// decoding method on the timestep output of a RNN trained with CTC loss.
+//
+// The two types of decoding available are:
+//   - greedy path, through the CTCGreedyDecoder
+//   - beam search, through the CTCBeamSearchDecoder
+class CTCDecoder {
+ public:
+  typedef Eigen::Map<const Eigen::ArrayXi> SequenceLength;
+  typedef Eigen::Map<const Eigen::MatrixXf> Input;
+  typedef std::vector<std::vector<int>> Output;
+  typedef Eigen::Map<Eigen::MatrixXf> ScoreOutput;
+
+  CTCDecoder(int num_classes, int batch_size, bool merge_repeated)
+      : num_classes_(num_classes),
+        blank_index_(num_classes - 1),
+        batch_size_(batch_size),
+        merge_repeated_(merge_repeated) {}
+
+  virtual ~CTCDecoder() {}
+
+  // Dimensionality of the input/output is expected to be:
+  //  - seq_len[b] - b = 0 to batch_size_
+  //  - input[t].rows(b) - t = 0 to timesteps; b = 0 t batch_size_
+  //  - output.size() specifies the number of beams to be returned.
+  //  - scores(b, i) - b = 0 to batch_size; i = 0 to output.size()
+  virtual bool Decode(const SequenceLength& seq_len,
+                      const std::vector<Input>& input,
+                      std::vector<Output>* output, ScoreOutput* scores) = 0;
+
+  int batch_size() { return batch_size_; }
+  int num_classes() { return num_classes_; }
+
+ protected:
+  int num_classes_;
+  int blank_index_;
+  int batch_size_;
+  bool merge_repeated_;
+};
+
+// CTCGreedyDecoder is an implementation of the simple best path decoding
+// algorithm, selecting at each timestep the most likely class at each timestep.
+class CTCGreedyDecoder : public CTCDecoder {
+ public:
+  CTCGreedyDecoder(int num_classes, int batch_size, bool merge_repeated)
+      : CTCDecoder(num_classes, batch_size, merge_repeated) {}
+
+  bool Decode(const CTCDecoder::SequenceLength& seq_len,
+              const std::vector<CTCDecoder::Input>& input,
+              std::vector<CTCDecoder::Output>* output,
+              CTCDecoder::ScoreOutput* scores) override {
+    if (output->empty() || (*output)[0].size() < batch_size_) {
+      return false;
+    }
+    if (scores->rows() < batch_size_ || scores->cols() == 0) {
+      return false;
+    }
+    // For each batch entry, identify the transitions
+    for (int b = 0; b < batch_size_; ++b) {
+      int seq_len_b = seq_len[b];
+      // Only writing to beam 0
+      std::vector<int>& output_b = (*output)[0][b];
+
+      int prev_class_ix = -1;
+      (*scores)(b, 0) = 0;
+      for (int t = 0; t < seq_len_b; ++t) {
+        auto row = input[t].row(b);
+        int max_class_ix;
+        (*scores)(b, 0) += -row.maxCoeff(&max_class_ix);
+        if (max_class_ix != blank_index_ &&
+            !(merge_repeated_ && max_class_ix == prev_class_ix)) {
+          output_b.push_back(max_class_ix);
+        }
+        prev_class_ix = max_class_ix;
+      }
+    }
+    return true;
+  }
+};
+
+}  // namespace ctc
+}  // namespace experimental
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_DECODER_H_
diff --git a/tensorflow/contrib/lite/experimental/kernels/ctc_loss_util.h b/tensorflow/contrib/lite/experimental/kernels/ctc_loss_util.h
new file mode 100644
index 0000000000..0bae732533
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/ctc_loss_util.h
@@ -0,0 +1,50 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Copied from tensorflow/core/util/ctc/ctc_loss_util.h
+// TODO(b/111524997): Remove this file.
+#ifndef TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_LOSS_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_LOSS_UTIL_H_
+
+#include <cmath>
+#include <limits>
+
+namespace tflite {
+namespace experimental {
+namespace ctc {
+
+const float kLogZero = -std::numeric_limits<float>::infinity();
+
+// Add logarithmic probabilities using:
+// ln(a + b) = ln(a) + ln(1 + exp(ln(b) - ln(a)))
+// The two inputs are assumed to be log probabilities.
+// (GravesTh) Eq. 7.18
+inline float LogSumExp(float log_prob_1, float log_prob_2) {
+  // Always have 'b' be the smaller number to avoid the exponential from
+  // blowing up.
+  if (log_prob_1 == kLogZero && log_prob_2 == kLogZero) {
+    return kLogZero;
+  } else {
+    return (log_prob_1 > log_prob_2)
+               ? log_prob_1 + log1pf(expf(log_prob_2 - log_prob_1))
+               : log_prob_2 + log1pf(expf(log_prob_1 - log_prob_2));
+  }
+}
+
+}  // namespace ctc
+}  // namespace experimental
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_CTC_LOSS_UTIL_H_
diff --git a/tensorflow/contrib/lite/experimental/kernels/top_n.h b/tensorflow/contrib/lite/experimental/kernels/top_n.h
new file mode 100644
index 0000000000..cd2a2f1c80
--- /dev/null
+++ b/tensorflow/contrib/lite/experimental/kernels/top_n.h
@@ -0,0 +1,341 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// This simple class finds the top n elements of an incrementally provided set
+// of elements which you push one at a time.  If the number of elements exceeds
+// n, the lowest elements are incrementally dropped.  At the end you get
+// a vector of the top elements sorted in descending order (through Extract() or
+// ExtractNondestructive()), or a vector of the top elements but not sorted
+// (through ExtractUnsorted() or ExtractUnsortedNondestructive()).
+//
+// The value n is specified in the constructor.  If there are p elements pushed
+// altogether:
+//   The total storage requirements are O(min(n, p)) elements
+//   The running time is O(p * log(min(n, p))) comparisons
+// If n is a constant, the total storage required is a constant and the running
+// time is linear in p.
+//
+// NOTE(zhifengc): There is a way to do this in O(min(n, p)) storage and O(p)
+// runtime. The basic idea is to repeatedly fill up a buffer of 2 * n elements,
+// discarding the lowest n elements whenever the buffer is full using a linear-
+// time median algorithm. This may have better performance when the input
+// sequence is partially sorted.
+//
+// NOTE(zhifengc): This class should be redesigned to avoid reallocating a
+// vector for each Extract.
+
+// Copied from tensorflow/core/lib/gtl/top_n.h
+// TODO(b/111524997): Remove this file.
+#ifndef TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_TOP_N_H_
+#define TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_TOP_N_H_
+
+#include <stddef.h>
+#include <algorithm>
+#include <functional>
+#include <string>
+#include <vector>
+
+#include "tensorflow/contrib/lite/kernels/internal/compatibility.h"
+
+namespace tflite {
+namespace gtl {
+
+// Cmp is an stl binary predicate.  Note that Cmp is the "greater" predicate,
+// not the more commonly used "less" predicate.
+//
+// If you use a "less" predicate here, the TopN will pick out the bottom N
+// elements out of the ones passed to it, and it will return them sorted in
+// ascending order.
+//
+// TopN is rule-of-zero copyable and movable if its members are.
+template <class T, class Cmp = std::greater<T> >
+class TopN {
+ public:
+  // The TopN is in one of the three states:
+  //
+  //  o UNORDERED: this is the state an instance is originally in,
+  //    where the elements are completely orderless.
+  //
+  //  o BOTTOM_KNOWN: in this state, we keep the invariant that there
+  //    is at least one element in it, and the lowest element is at
+  //    position 0. The elements in other positions remain
+  //    unsorted. This state is reached if the state was originally
+  //    UNORDERED and a peek_bottom() function call is invoked.
+  //
+  //  o HEAP_SORTED: in this state, the array is kept as a heap and
+  //    there are exactly (limit_+1) elements in the array. This
+  //    state is reached when at least (limit_+1) elements are
+  //    pushed in.
+  //
+  //  The state transition graph is at follows:
+  //
+  //             peek_bottom()                (limit_+1) elements
+  //  UNORDERED --------------> BOTTOM_KNOWN --------------------> HEAP_SORTED
+  //      |                                                           ^
+  //      |                      (limit_+1) elements                  |
+  //      +-----------------------------------------------------------+
+
+  enum State { UNORDERED, BOTTOM_KNOWN, HEAP_SORTED };
+  using UnsortedIterator = typename std::vector<T>::const_iterator;
+
+  // 'limit' is the maximum number of top results to return.
+  explicit TopN(size_t limit) : TopN(limit, Cmp()) {}
+  TopN(size_t limit, const Cmp &cmp) : limit_(limit), cmp_(cmp) {}
+
+  size_t limit() const { return limit_; }
+
+  // Number of elements currently held by this TopN object.  This
+  // will be no greater than 'limit' passed to the constructor.
+  size_t size() const { return std::min(elements_.size(), limit_); }
+
+  bool empty() const { return size() == 0; }
+
+  // If you know how many elements you will push at the time you create the
+  // TopN object, you can call reserve to preallocate the memory that TopN
+  // will need to process all 'n' pushes.  Calling this method is optional.
+  void reserve(size_t n) { elements_.reserve(std::min(n, limit_ + 1)); }
+
+  // Push 'v'.  If the maximum number of elements was exceeded, drop the
+  // lowest element and return it in 'dropped' (if given). If the maximum is not
+  // exceeded, 'dropped' will remain unchanged. 'dropped' may be omitted or
+  // nullptr, in which case it is not filled in.
+  // Requires: T is CopyAssignable, Swappable
+  void push(const T &v) { push(v, nullptr); }
+  void push(const T &v, T *dropped) { PushInternal(v, dropped); }
+
+  // Move overloads of push.
+  // Requires: T is MoveAssignable, Swappable
+  void push(T &&v) {  // NOLINT(build/c++11)
+    push(std::move(v), nullptr);
+  }
+  void push(T &&v, T *dropped) {  // NOLINT(build/c++11)
+    PushInternal(std::move(v), dropped);
+  }
+
+  // Peeks the bottom result without calling Extract()
+  const T &peek_bottom();
+
+  // Extract the elements as a vector sorted in descending order.  The caller
+  // assumes ownership of the vector and must delete it when done.  This is a
+  // destructive operation.  The only method that can be called immediately
+  // after Extract() is Reset().
+  std::vector<T> *Extract();
+
+  // Similar to Extract(), but makes no guarantees the elements are in sorted
+  // order.  As with Extract(), the caller assumes ownership of the vector and
+  // must delete it when done.  This is a destructive operation.  The only
+  // method that can be called immediately after ExtractUnsorted() is Reset().
+  std::vector<T> *ExtractUnsorted();
+
+  // A non-destructive version of Extract(). Copy the elements in a new vector
+  // sorted in descending order and return it.  The caller assumes ownership of
+  // the new vector and must delete it when done.  After calling
+  // ExtractNondestructive(), the caller can continue to push() new elements.
+  std::vector<T> *ExtractNondestructive() const;
+
+  // A non-destructive version of Extract(). Copy the elements to a given
+  // vector sorted in descending order. After calling
+  // ExtractNondestructive(), the caller can continue to push() new elements.
+  // Note:
+  //  1. The given argument must to be allocated.
+  //  2. Any data contained in the vector prior to the call will be deleted
+  //     from it. After the call the vector will contain only the elements
+  //     from the data structure.
+  void ExtractNondestructive(std::vector<T> *output) const;
+
+  // A non-destructive version of ExtractUnsorted(). Copy the elements in a new
+  // vector and return it, with no guarantees the elements are in sorted order.
+  // The caller assumes ownership of the new vector and must delete it when
+  // done.  After calling ExtractUnsortedNondestructive(), the caller can
+  // continue to push() new elements.
+  std::vector<T> *ExtractUnsortedNondestructive() const;
+
+  // A non-destructive version of ExtractUnsorted(). Copy the elements into
+  // a given vector, with no guarantees the elements are in sorted order.
+  // After calling ExtractUnsortedNondestructive(), the caller can continue
+  // to push() new elements.
+  // Note:
+  //  1. The given argument must to be allocated.
+  //  2. Any data contained in the vector prior to the call will be deleted
+  //     from it. After the call the vector will contain only the elements
+  //     from the data structure.
+  void ExtractUnsortedNondestructive(std::vector<T> *output) const;
+
+  // Return an iterator to the beginning (end) of the container,
+  // with no guarantees about the order of iteration. These iterators are
+  // invalidated by mutation of the data structure.
+  UnsortedIterator unsorted_begin() const { return elements_.begin(); }
+  UnsortedIterator unsorted_end() const { return elements_.begin() + size(); }
+
+  // Accessor for comparator template argument.
+  Cmp *comparator() { return &cmp_; }
+
+  // This removes all elements.  If Extract() or ExtractUnsorted() have been
+  // called, this will put it back in an empty but useable state.
+  void Reset();
+
+ private:
+  template <typename U>
+  void PushInternal(U &&v, T *dropped);  // NOLINT(build/c++11)
+
+  // elements_ can be in one of two states:
+  //   elements_.size() <= limit_:  elements_ is an unsorted vector of elements
+  //      pushed so far.
+  //   elements_.size() > limit_:  The last element of elements_ is unused;
+  //      the other elements of elements_ are an stl heap whose size is exactly
+  //      limit_.  In this case elements_.size() is exactly one greater than
+  //      limit_, but don't use "elements_.size() == limit_ + 1" to check for
+  //      that because you'll get a false positive if limit_ == size_t(-1).
+  std::vector<T> elements_;
+  size_t limit_;  // Maximum number of elements to find
+  Cmp cmp_;       // Greater-than comparison function
+  State state_ = UNORDERED;
+};
+
+// ----------------------------------------------------------------------
+// Implementations of non-inline functions
+
+template <class T, class Cmp>
+template <typename U>
+void TopN<T, Cmp>::PushInternal(U &&v, T *dropped) {  // NOLINT(build/c++11)
+  if (limit_ == 0) {
+    if (dropped) *dropped = std::forward<U>(v);  // NOLINT(build/c++11)
+    return;
+  }
+  if (state_ != HEAP_SORTED) {
+    elements_.push_back(std::forward<U>(v));  // NOLINT(build/c++11)
+    if (state_ == UNORDERED || cmp_(elements_.back(), elements_.front())) {
+      // Easy case: we just pushed the new element back
+    } else {
+      // To maintain the BOTTOM_KNOWN state, we need to make sure that
+      // the element at position 0 is always the smallest. So we put
+      // the new element at position 0 and push the original bottom
+      // element in the back.
+      // Warning: this code is subtle.
+      using std::swap;
+      swap(elements_.front(), elements_.back());
+    }
+    if (elements_.size() == limit_ + 1) {
+      // Transition from unsorted vector to a heap.
+      std::make_heap(elements_.begin(), elements_.end(), cmp_);
+      if (dropped) *dropped = std::move(elements_.front());
+      std::pop_heap(elements_.begin(), elements_.end(), cmp_);
+      state_ = HEAP_SORTED;
+    }
+  } else {
+    // Only insert the new element if it is greater than the least element.
+    if (cmp_(v, elements_.front())) {
+      elements_.back() = std::forward<U>(v);  // NOLINT(build/c++11)
+      std::push_heap(elements_.begin(), elements_.end(), cmp_);
+      if (dropped) *dropped = std::move(elements_.front());
+      std::pop_heap(elements_.begin(), elements_.end(), cmp_);
+    } else {
+      if (dropped) *dropped = std::forward<U>(v);  // NOLINT(build/c++11)
+    }
+  }
+}
+
+template <class T, class Cmp>
+const T &TopN<T, Cmp>::peek_bottom() {
+  TFLITE_DCHECK(!empty());
+  if (state_ == UNORDERED) {
+    // We need to do a linear scan to find out the bottom element
+    int min_candidate = 0;
+    for (size_t i = 1; i < elements_.size(); ++i) {
+      if (cmp_(elements_[min_candidate], elements_[i])) {
+        min_candidate = i;
+      }
+    }
+    // By swapping the element at position 0 and the minimal
+    // element, we transition to the BOTTOM_KNOWN state
+    if (min_candidate != 0) {
+      using std::swap;
+      swap(elements_[0], elements_[min_candidate]);
+    }
+    state_ = BOTTOM_KNOWN;
+  }
+  return elements_.front();
+}
+
+template <class T, class Cmp>
+std::vector<T> *TopN<T, Cmp>::Extract() {
+  auto out = new std::vector<T>;
+  out->swap(elements_);
+  if (state_ != HEAP_SORTED) {
+    std::sort(out->begin(), out->end(), cmp_);
+  } else {
+    out->pop_back();
+    std::sort_heap(out->begin(), out->end(), cmp_);
+  }
+  return out;
+}
+
+template <class T, class Cmp>
+std::vector<T> *TopN<T, Cmp>::ExtractUnsorted() {
+  auto out = new std::vector<T>;
+  out->swap(elements_);
+  if (state_ == HEAP_SORTED) {
+    // Remove the limit_+1'th element.
+    out->pop_back();
+  }
+  return out;
+}
+
+template <class T, class Cmp>
+std::vector<T> *TopN<T, Cmp>::ExtractNondestructive() const {
+  auto out = new std::vector<T>;
+  ExtractNondestructive(out);
+  return out;
+}
+
+template <class T, class Cmp>
+void TopN<T, Cmp>::ExtractNondestructive(std::vector<T> *output) const {
+  TFLITE_DCHECK(output);
+  *output = elements_;
+  if (state_ != HEAP_SORTED) {
+    std::sort(output->begin(), output->end(), cmp_);
+  } else {
+    output->pop_back();
+    std::sort_heap(output->begin(), output->end(), cmp_);
+  }
+}
+
+template <class T, class Cmp>
+std::vector<T> *TopN<T, Cmp>::ExtractUnsortedNondestructive() const {
+  auto elements = new std::vector<T>;
+  ExtractUnsortedNondestructive(elements);
+  return elements;
+}
+
+template <class T, class Cmp>
+void TopN<T, Cmp>::ExtractUnsortedNondestructive(std::vector<T> *output) const {
+  TFLITE_DCHECK(output);
+  *output = elements_;
+  if (state_ == HEAP_SORTED) {
+    // Remove the limit_+1'th element.
+    output->pop_back();
+  }
+}
+
+template <class T, class Cmp>
+void TopN<T, Cmp>::Reset() {
+  elements_.clear();
+  state_ = UNORDERED;
+}
+
+}  // namespace gtl
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_EXPERIMENTAL_KERNELS_TOP_N_H_
diff --git a/tensorflow/contrib/lite/g3doc/README.md b/tensorflow/contrib/lite/g3doc/README.md
new file mode 100644
index 0000000000..e3db478481
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/README.md
@@ -0,0 +1,4 @@
+This is a *work-in-progress* TF Lite subsite for:
+https://www.tensorflow.org/mobile
+
+DO NOT PUBLISH
diff --git a/tensorflow/contrib/lite/g3doc/_book.yaml b/tensorflow/contrib/lite/g3doc/_book.yaml
new file mode 100644
index 0000000000..98abd5743b
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/_book.yaml
@@ -0,0 +1,58 @@
+upper_tabs:
+# Tabs left of dropdown menu
+- include: /_upper_tabs_left.yaml
+# Dropdown menu
+- name: Ecosystem
+  path: /ecosystem
+  is_default: True
+  menu:
+  - include: /ecosystem/_menu_toc.yaml
+  lower_tabs:
+    # Subsite tabs
+    other:
+    - name: Guide
+      contents:
+      - title: Overview
+        path: /mobile/overview
+      - title: Developer Guide
+        path: /mobile/devguide
+      - title: Android Demo App
+        path: /mobile/demo_android
+      - title: iOS Demo App
+        path: /mobile/demo_ios
+      - title: Performance
+        path: /mobile/performance
+      - break: True
+      - title: TensorFlow Lite APIs
+        path: /mobile/apis
+      - title: Custom operators
+        path: /mobile/custom_operators
+      - title: TensorFlow Lite Ops Versioning
+        path: /mobile/ops_versioning
+      - title: TensorFlow Lite Compatibility Guide
+        path: /mobile/tf_ops_compatibility
+      - title: List of Hosted Models
+        path: /mobile/models
+      - title: TensorFlow Lite for iOS
+        path: /mobile/ios
+      - title: TensorFlow Lite for Raspberry Pi
+        path: /mobile/rpi
+
+      - heading: TF Mobile
+        status: deprecated
+      - title: Overview
+        path: /mobile/tfmobile/
+      - title: Building TensorFlow on Android
+        path: /mobile/tfmobile/android_build
+      - title: Building TensorFlow on IOS
+        path: /mobile/tfmobile/ios_build
+      - title: Integrating TensorFlow libraries
+        path: /mobile/tfmobile/linking_libs
+      - title: Preparing models for mobile deployment
+        path: /mobile/tfmobile/prepare_models
+      - title: Optimizing for mobile
+        path: /mobile/tfmobile/optimizing
+
+    - name: API
+      contents:
+      - include: /mobile/api_docs/python/_toc.yaml
diff --git a/tensorflow/contrib/lite/g3doc/_index.yaml b/tensorflow/contrib/lite/g3doc/_index.yaml
new file mode 100644
index 0000000000..9119e49117
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/_index.yaml
@@ -0,0 +1,67 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+description: <!--no description-->
+landing_page:
+  rows:
+  - heading: TensorFlow Lite is a lightweight solution for mobile and embedded devices.
+    items:
+    - description: >
+        TensorFlow Lite is TensorFlow’s lightweight solution for mobile and
+        embedded devices. It enables on-device machine learning inference with
+        low latency and a small binary size. TensorFlow Lite also supports
+        hardware acceleration with the
+        <a href='https://developer.android.com/ndk/guides/neuralnetworks/index.html'>Android Neural Networks API</a>.
+      list:
+      - heading: Key point 1
+        description: >
+          [high-level overview]
+        icon:
+          icon_name: chevron_right
+          foreground: theme
+          background: grey
+      - heading: Key point 2
+        description: >
+          [high-level overview]
+        icon:
+          icon_name: chevron_right
+          foreground: theme
+          background: grey
+      - heading: Key point 3
+        description: >
+          [high-level overview]
+        icon:
+          icon_name: chevron_right
+          foreground: theme
+          background: grey
+    - code_block: |
+        <pre class = "prettyprint">
+        $ toco --input_file=$(pwd)/mobilenet_v1_1.0_224/frozen_graph.pb \
+               --input_format=TENSORFLOW_GRAPHDEF \
+               --output_format=TFLITE \
+               --output_file=/tmp/mobilenet_v1_1.0_224.tflite \
+               --inference_type=FLOAT \
+               --input_type=FLOAT \
+               --input_arrays=input \
+               --output_arrays=MobilenetV1/Predictions/Reshape_1 \
+               --input_shapes=1,224,224,3
+        </pre>
+
+  - classname: devsite-landing-row-cards
+    items:
+    - heading: Using TensorFlow Lite on Android
+      image_path: /ecosystem/images/tf-logo-card-16x9.png
+      path: https://medium.com/tensorflow/using-tensorflow-lite-on-android-9bbc9cb7d69d
+      buttons:
+      - label: Read on TensorFlow blog
+        path: https://medium.com/tensorflow/using-tensorflow-lite-on-android-9bbc9cb7d69d
+    - heading: TensorFlow Lite at the Dev Summit
+      youtube_id: FAMfy7izB6A
+      buttons:
+      - label: Watch the video
+        path: https://www.youtube.com/watch?v=FAMfy7izB6A
+    - heading: TensorFlow Lite on GitHub
+      image_path: /ecosystem/images/github-card-16x9.png
+      path: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite
+      buttons:
+      - label: View on GitHub
+        path: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite
diff --git a/tensorflow/contrib/lite/g3doc/_project.yaml b/tensorflow/contrib/lite/g3doc/_project.yaml
new file mode 100644
index 0000000000..b39666516b
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/_project.yaml
@@ -0,0 +1,10 @@
+name: TensorFlow Lite
+breadcrumb_name: Mobile
+home_url: /mobile/
+parent_project_metadata_path: /_project.yaml
+description: >
+  TensorFlow Lite is a lightweight solution for mobile and embedded devices.
+use_site_branding: True
+hide_from_products_list: True
+content_license: cc3-apache2
+buganizer_id: 316308
diff --git a/tensorflow/contrib/lite/g3doc/api_docs/python/_toc.yaml b/tensorflow/contrib/lite/g3doc/api_docs/python/_toc.yaml
new file mode 100644
index 0000000000..1e1c44c692
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/api_docs/python/_toc.yaml
@@ -0,0 +1,6 @@
+# Automatically generated file; please do not edit
+toc:
+  - title: TensorFlow Lite
+    section:
+    - title: Overview
+      path: /mobile/api_docs/python/
diff --git a/tensorflow/contrib/lite/g3doc/api_docs/python/index.md b/tensorflow/contrib/lite/g3doc/api_docs/python/index.md
new file mode 100644
index 0000000000..70031a3c3d
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/api_docs/python/index.md
@@ -0,0 +1,10 @@
+Project: /mobile/_project.yaml
+Book: /mobile/_book.yaml
+page_type: reference
+<style> table img { max-width: 100%; } </style>
+<script src="/_static/js/managed/mathjax/MathJax.js?config=TeX-AMS-MML_SVG"></script>
+
+<!-- DO NOT EDIT! Automatically generated file. -->
+# All symbols in TensorFlow Lite
+
+TEMP PAGE
diff --git a/tensorflow/contrib/lite/g3doc/apis.md b/tensorflow/contrib/lite/g3doc/apis.md
index a591a353dd..776803da8c 100644
--- a/tensorflow/contrib/lite/g3doc/apis.md
+++ b/tensorflow/contrib/lite/g3doc/apis.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # TensorFlow Lite APIs
 
 TensorFlow Lite provides programming APIs in C++ and Java, and in both cases
@@ -53,6 +56,7 @@ typedef enum {
 ```
 
 Failures can be easily verified with:
+
 ```c++
 if (status != kTfLiteOk) {
   // ... error handling here ...
diff --git a/tensorflow/contrib/lite/g3doc/custom_operators.md b/tensorflow/contrib/lite/g3doc/custom_operators.md
index 972e57f73e..d979353bb3 100644
--- a/tensorflow/contrib/lite/g3doc/custom_operators.md
+++ b/tensorflow/contrib/lite/g3doc/custom_operators.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # How to use custom operators
 
 TensorFlow Lite currently supports a subset of TensorFlow operators. However, it
@@ -89,3 +92,83 @@ builtins.AddCustom("Sin", Register_SIN());
 
 Note that a similar process as above can be followed for supporting for a set of
 operations instead of a single operator.
+
+## Best Practices for writing custom operators
+
+1.  Optimize memory allocations and de-allocations cautiously. It is more
+    efficient to allocate memory in Prepare() instead of Invoke(), and allocate
+    memory before a loop instead of in every iteration. Use temporary tensors
+    data rather than mallocing yourself (see item 2). Use pointers/references
+    instead of copying as much as possible.
+
+2.  If a data structure will persist during the entire operation, we advise
+    pre-allocating the memory using temporary tensors. You may need to use
+    OpData struct to reference the tensor indices in other functions. See
+    example in the
+    [kernel for convolution](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/kernels/conv.cc).
+    A sample code snippet is below
+
+    ```
+    auto* op_data = reinterpret_cast<OpData*>(node->user_data);
+    TfLiteIntArrayFree(node->temporaries);
+    node->temporaries = TfLiteIntArrayCreate(1);
+    node->temporaries->data[0] = op_data->temp_tensor_index;
+    TfLiteTensor* temp_tensor = &context->tensors[op_data->temp_tensor_index];
+    temp_tensor->type =  kTfLiteFloat32;
+    temp_tensor->allocation_type = kTfLiteArenaRw;
+    ```
+
+3.  If it doesn't cost too much wasted memory, prefer using a static fixed size
+    array (or in Resize() pre-allocated std::vector) rather than using a
+    dynamically allocating std::vector every iteration of execution.
+
+4.  Avoid instantiating standard library container templates that don't already
+    exist, because they affect binary size. For example, if you need a std::map
+    in your operation that doesn't exist in other kernels, using a std::vector
+    with direct indexing mapping could work while keeping the binary size small.
+    See what other kernels use to gain insight (or ask).
+
+5.  Check the pointer to the memory returned by malloc. If this pointer is
+    nullptr, no operations should be performed using that pointer. If you
+    malloc() in a function and have an error exit, deallocate memory before you
+    exit.
+
+6.  Use TF_LITE_ENSURE(context, condition) to check for a specific condition.
+    Your code must not leave memory hanging when TF_LITE_ENSURE is done, i.e.,
+    these should be done before any resources are allocated that will leak.
+
+## Special TF Graph Attributes
+
+When Toco convertes a TF graph into TFLite format, it makes some assumption
+about custom operations that might be not correct. In this case, the generated
+graph can be not executable.
+
+It is possible to add aditional information about your custom op output to TF
+graph before it is converted. The following attributes are supported:
+
+-   **_output_quantized** a boolean attribute, true if the operation outputs are
+    quantized
+-   **_output_types** a list of types for output tensors
+-   **_output_shapes** a list of shapes for output tensors
+
+### Setting the Attributes
+
+This is an example how the attributes can be set:
+
+```python
+frozen_graph_def = tf.graph_util.convert_variables_to_constants(...)
+for node in frozen_graph_def.node:
+    if node.op == 'sin':
+      node.attr['_output_types'].list.type.extend([
+          types_pb2.DT_FLOAT,
+      ])
+      node.attr['_output_shapes'].list.shape.extend([
+          tf.TensorShape([10]),
+      ])
+      node.attr['_output_quantized'].b = False
+tflite_model = tf.contrib.lite.toco_convert(
+        frozen_graph_def,...)
+```
+
+**Note:** After the attributes are set, the graph can not be executed by
+Tensorflow, therefore it should be done just before the conversion.
diff --git a/tensorflow/docs_src/mobile/tflite/demo_android.md b/tensorflow/contrib/lite/g3doc/demo_android.md
index fdf0bcf3c1..d79a2696b4 100644
--- a/tensorflow/docs_src/mobile/tflite/demo_android.md
+++ b/tensorflow/contrib/lite/g3doc/demo_android.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Android Demo App
 
 An example Android application using TensorFLow Lite is available
diff --git a/tensorflow/docs_src/mobile/tflite/demo_ios.md b/tensorflow/contrib/lite/g3doc/demo_ios.md
index 3be21da89f..a554898899 100644
--- a/tensorflow/docs_src/mobile/tflite/demo_ios.md
+++ b/tensorflow/contrib/lite/g3doc/demo_ios.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # iOS Demo App
 
 The TensorFlow Lite demo is a camera app that continuously classifies whatever
diff --git a/tensorflow/docs_src/mobile/tflite/devguide.md b/tensorflow/contrib/lite/g3doc/devguide.md
index 4133bc172a..dc9cc98c08 100644
--- a/tensorflow/docs_src/mobile/tflite/devguide.md
+++ b/tensorflow/contrib/lite/g3doc/devguide.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Developer Guide
 
 Using a TensorFlow Lite model in your mobile app requires multiple
@@ -54,10 +57,11 @@ both floating point and quantized inference.
 ### Train a custom model
 
 A developer may choose to train a custom model using Tensorflow (see the
-@{$tutorials} for examples of building and training models). If you have already
-written a model, the first step is to export this to a @{tf.GraphDef} file. This
-is required because some formats do not store the model structure outside the
-code, and we must communicate with other parts of the framework. See
+[TensorFlow tutorials](../../tutorials/) for examples of building and training
+models). If you have already written a model, the first step is to export this
+to a `tf.GraphDef` file. This is required because some formats do not store the
+model structure outside the code, and we must communicate with other parts of the
+framework. See
 [Exporting the Inference Graph](https://github.com/tensorflow/models/blob/master/research/slim/README.md)
 to create .pb file for the custom model.
 
@@ -70,12 +74,12 @@ grow in future Tensorflow Lite releases.
 ## 2. Convert the model format
 
 The model generated (or downloaded) in the previous step is a *standard*
-Tensorflow model and you should now have a .pb or .pbtxt @{tf.GraphDef} file.
+Tensorflow model and you should now have a .pb or .pbtxt `tf.GraphDef` file.
 Models generated with transfer learning (re-training) or custom models must be
 converted—but, we must first freeze the graph to convert the model to the
 Tensorflow Lite format. This process uses several model formats:
 
-* @{tf.GraphDef} (.pb) —A protobuf that represents the TensorFlow training or
+* `tf.GraphDef` (.pb) —A protobuf that represents the TensorFlow training or
   computation graph. It contains operators, tensors, and variables definitions.
 * *CheckPoint* (.ckpt) —Serialized variables from a TensorFlow graph. Since this
   does not contain a graph structure, it cannot be interpreted by itself.
@@ -142,11 +146,11 @@ containing the model architecture. The [frozen_graph.pb](https://storage.googlea
 file used here is available for download. `output_file` is where the TensorFlow
 Lite model will get generated. The `input_type` and `inference_type`
 arguments should be set to `FLOAT`, unless converting a
-@{$performance/quantization$quantized model}. Setting the `input_array`,
-`output_array`, and `input_shape` arguments are not as straightforward. The
-easiest way to find these values is to explore the graph using Tensorboard. Reuse
-the arguments for specifying the output nodes for inference in the
-`freeze_graph` step.
+<a href="https://www.tensorflow.org/performance/quantization">quantized model</a>.
+Setting the `input_array`, `output_array`, and `input_shape` arguments are not as
+straightforward. The easiest way to find these values is to explore the graph
+using Tensorboard. Reuse the arguments for specifying the output nodes for
+inference in the `freeze_graph` step.
 
 It is also possible to use the Tensorflow Optimizing Converter with protobufs
 from either Python or from the command line (see the 
@@ -203,16 +207,16 @@ The open source Android demo app uses the JNI interface and is available
 [on GitHub](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/java/demo/app).
 You can also download a
 [prebuilt APK](http://download.tensorflow.org/deps/tflite/TfLiteCameraDemo.apk).
-See the @{$tflite/demo_android} guide for details.
+See the <a href="../demo_android.md">Android demo</a> guide for details.
 
-The @{$mobile/android_build} guide has instructions for installing TensorFlow on
-Android and setting up `bazel` and Android Studio.
+The <a href="./android_build.md">Android mobile</a> guide has instructions for
+installing TensorFlow on Android and setting up `bazel` and Android Studio.
 
 ### iOS
 
 To integrate a TensorFlow model in an iOS app, see the
 [TensorFlow Lite for iOS](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/g3doc/ios.md)
-guide and @{$tflite/demo_ios} guide.
+guide and <a href="../demo_ios.md">iOS demo</a> guide.
 
 #### Core ML support
 
diff --git a/tensorflow/contrib/lite/g3doc/ios.md b/tensorflow/contrib/lite/g3doc/ios.md
index e0358a444d..d78d373ccf 100644
--- a/tensorflow/contrib/lite/g3doc/ios.md
+++ b/tensorflow/contrib/lite/g3doc/ios.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # TensorFlow Lite for iOS
 
 ## Building
diff --git a/tensorflow/contrib/lite/g3doc/models.md b/tensorflow/contrib/lite/g3doc/models.md
index c1c8ef049f..3292aece0e 100644
--- a/tensorflow/contrib/lite/g3doc/models.md
+++ b/tensorflow/contrib/lite/g3doc/models.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # List of Hosted Models
 
 ## Image classification (Float Models)
@@ -39,22 +42,22 @@ single thread large core.
 
 Model Name               | Paper_Model_Files                                                                                                                                         | Model_Size | Top-1 Accuracy | Top-5 Accuracy | TF Lite Performance
 ------------------------ | :-------------------------------------------------------------------------------------------------------------------------------------------------------: | ---------: | -------------: | -------------: | ------------------:
-Mobilenet_0.25_128_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.25_128_quant.tgz) | 0.5 Mb     | 39.9%          | 65.8%          | 3.7 ms
-Mobilenet_0.25_160_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.25_160_quant.tgz) | 0.5 Mb     | 43.5%          | 69.1%          | 5.5 ms
-Mobilenet_0.25_192_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.25_192_quant.tgz) | 0.5 Mb     | 45.8%          | 71.9%          | 7.9 ms
-Mobilenet_0.25_224_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.25_224_quant.tgz) | 0.5 Mb     | 48.2%          | 73.8%          | 10.4 ms
-Mobilenet_0.50_128_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.5_128_quant.tgz)  | 1.4 Mb     | 54.9%          | 78.9%          | 8.8 ms
-Mobilenet_0.50_160_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.5_160_quant.tgz)  | 1.4 Mb     | 57.7%          | 81.3%          | 13.0 ms
-Mobilenet_0.50_192_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.5_192_quant.tgz)  | 1.4 Mb     | 60.4%          | 83.2%          | 18.3 ms
-Mobilenet_0.50_224_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.5_224_quant.tgz)  | 1.4 Mb     | 62.2%          | 84.5%          | 24.7 ms
-Mobilenet_0.75_128_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.75_128_quant.tgz) | 2.6 Mb     | 59.8%          | 82.8%          | 16.2 ms
-Mobilenet_0.75_160_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.75_160_quant.tgz) | 2.6 Mb     | 63.9%          | 85.5%          | 24.3 ms
-Mobilenet_0.75_192_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.75_192_quant.tgz) | 2.6 Mb     | 66.2%          | 87.1%          | 33.8 ms
-Mobilenet_0.75_224_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_0.75_224_quant.tgz) | 2.6 Mb     | 67.9%          | 88.1%          | 45.4 ms
-Mobilenet_1.0_128_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_128_quant.tgz)  | 4.3 Mb     | 64.0%          | 85.5%          | 24.9 ms
-Mobilenet_1.0_160_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_160_quant.tgz)  | 4.3 Mb     | 67.3%          | 87.7%          | 37.4 ms
-Mobilenet_1.0_192_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_192_quant.tgz)  | 4.3 Mb     | 69.0%          | 88.9%          | 51.9 ms
-Mobilenet_1.0_224_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_224_quant.tgz)  | 4.3 Mb     | 69.7%          | 89.5%          | 70.2 ms
+Mobilenet_0.25_128_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.25_128_quant.tgz) | 0.5 Mb     | 39.7%          | 65.8%          | 3.7 ms
+Mobilenet_0.25_160_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.25_160_quant.tgz) | 0.5 Mb     | 41.9%          | 69.1%          | 5.5 ms
+Mobilenet_0.25_192_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.25_192_quant.tgz) | 0.5 Mb     | 45.3%          | 71.9%          | 7.9 ms
+Mobilenet_0.25_224_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.25_224_quant.tgz) | 0.5 Mb     | 46.4%          | 73.8%          | 10.4 ms
+Mobilenet_0.50_128_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.5_128_quant.tgz)  | 1.4 Mb     | 54.1%          | 78.9%          | 8.8 ms
+Mobilenet_0.50_160_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.5_160_quant.tgz)  | 1.4 Mb     | 57.6%          | 81.3%          | 13.0 ms
+Mobilenet_0.50_192_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.5_192_quant.tgz)  | 1.4 Mb     | 59.1%          | 83.2%          | 18.3 ms
+Mobilenet_0.50_224_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.5_224_quant.tgz)  | 1.4 Mb     | 61.0%          | 84.5%          | 24.7 ms
+Mobilenet_0.75_128_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.75_128_quant.tgz) | 2.6 Mb     | 52.5%          | 82.8%          | 16.2 ms
+Mobilenet_0.75_160_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.75_160_quant.tgz) | 2.6 Mb     | 63.6%          | 85.5%          | 24.3 ms
+Mobilenet_0.75_192_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.75_192_quant.tgz) | 2.6 Mb     | 61.1%          | 87.1%          | 33.8 ms
+Mobilenet_0.75_224_quant | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_0.75_224_quant.tgz) | 2.6 Mb     | 66.7%          | 88.1%          | 45.4 ms
+Mobilenet_1.0_128_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_1.0_128_quant.tgz)  | 4.3 Mb     | 62.7%          | 85.5%          | 24.9 ms
+Mobilenet_1.0_160_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_1.0_160_quant.tgz)  | 4.3 Mb     | 66.6%          | 87.7%          | 37.4 ms
+Mobilenet_1.0_192_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_1.0_192_quant.tgz)  | 4.3 Mb     | 69.2%          | 88.9%          | 51.9 ms
+Mobilenet_1.0_224_quant  | [paper](https://arxiv.org/pdf/1712.05877.pdf), [tflite&pb](http://download.tensorflow.org/models/mobilenet_v1_2018_07_12/mobilenet_v1_1.0_224_quant.tgz)  | 4.3 Mb     | 69.3%          | 89.5%          | 70.2 ms
 
 ## Other models
 
diff --git a/tensorflow/contrib/lite/g3doc/ops_versioning.md b/tensorflow/contrib/lite/g3doc/ops_versioning.md
index bd2f797e6c..b06f4fd3b8 100644
--- a/tensorflow/contrib/lite/g3doc/ops_versioning.md
+++ b/tensorflow/contrib/lite/g3doc/ops_versioning.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # TensorFlow Lite Ops Versioning
 
 This document describes TensorFlow Lite's op versioning schema. Op
diff --git a/tensorflow/docs_src/mobile/tflite/index.md b/tensorflow/contrib/lite/g3doc/overview.md
index 3d1733024e..be60d7941a 100644
--- a/tensorflow/docs_src/mobile/tflite/index.md
+++ b/tensorflow/contrib/lite/g3doc/overview.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Introduction to TensorFlow Lite
 
 TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded
@@ -70,10 +73,9 @@ There are several factors which are fueling interest in this domain:
 We believe the next wave of machine learning applications will have significant
 processing on mobile and embedded devices.
 
-## TensorFlow Lite developer preview highlights
+## TensorFlow Lite highlights
 
-TensorFlow Lite is available as a developer preview and includes the
-following:
+TensorFlow Lite provides:
 
 - A set of core operators, both quantized and float, many of which have been
   tuned for mobile platforms.  These can be used to create and run custom
@@ -129,9 +131,6 @@ following:
 
 - Java and C++ API support
 
-Note: This is a developer release, and it’s likely that there will be changes in
-the API in upcoming versions. We do not guarantee backward or forward
-compatibility with this release.
 
 ## Getting Started
 
@@ -201,9 +200,5 @@ possible performance for a particular model on a particular device.
 
 ## Next Steps
 
-For the developer preview, most of our documentation is on GitHub. Please take a
-look at the [TensorFlow Lite
-repository](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite)
-on GitHub for more information and for code samples, demo applications, and
-more.
-
+The TensorFlow Lite [GitHub repository](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite).
+contains additional docs, code samples, and demo applications.
diff --git a/tensorflow/contrib/lite/g3doc/benchmarks.md b/tensorflow/contrib/lite/g3doc/performance.md
index 29b087bea7..613e9f97c3 100644
--- a/tensorflow/contrib/lite/g3doc/benchmarks.md
+++ b/tensorflow/contrib/lite/g3doc/performance.md
@@ -1,27 +1,26 @@
-# Performance Benchmark numbers
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
 
-This document contains the performance benchmark numbers for running a few well
+# Performance
+
+This document lists TensorFlow Lite performance benchmarks when running well
 known models on some Android and iOS devices.
 
-The benchmark numbers were generated by running the [TFLite benchmark
-binary](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark)
-on Android and running the [iOS benchmark
-app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark/ios)
-on iOS.
+These performance benchmark numbers were generated with the
+[Android TFLite benchmark binary](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark)
+and the [iOS benchmark app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark/ios).
 
-# Android benchmarks
+# Android performance benchmarks
 
-When running Android benchmarks, the CPU affinity is set to use big cores on the
-device to reduce variance (see
-[details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark#reducing-variance-between-runs-on-android)).
+For Android benchmarks, the CPU affinity is set to use big cores on the device to
+reduce variance (see [details](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark#reducing-variance-between-runs-on-android)).
 
-Models are assumed to have been downloaded from the link, unzipped and pushed to
-`/data/local/tmp/tflite_models` folder. The benchmark binary is built according
-to instructions listed
-[here](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark#on-android).
-and is assumed to have been pushed to `/data/local/tmp`.
+It assumes that models were download and unzipped to the
+`/data/local/tmp/tflite_models` directory. The benchmark binary is built
+using [these instructions](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark#on-android)
+and assumed in the `/data/local/tmp` directory.
 
-The following command was used to run the benchmark:
+To run the benchmark:
 
 ```
 adb shell taskset ${CPU_MASK} /data/local/tmp/benchmark_model \
@@ -32,7 +31,7 @@ adb shell taskset ${CPU_MASK} /data/local/tmp/benchmark_model \
   --use_nnapi=false
 ```
 
-where `${GRAPH}` is the name of model and `${CPU_MASK}` is the CPU affinity
+Here, `${GRAPH}` is the name of model and `${CPU_MASK}` is the CPU affinity
 chosen according to the following table:
 
 Device | CPU_MASK |
@@ -62,7 +61,7 @@ Pixel xl | 0c |
   </tr>
   <tr>
     <td rowspan = 2>
-      <a href="http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_224_quant.tgz)">Mobilenet_1.0_224 (quant)</a>
+      <a href="http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/mobilenet_v1_1.0_224_quant.tgz">Mobilenet_1.0_224 (quant)</a>
     </td>
     <td>Pixel 2 </td>
     <td>69.5 ms (0.9 ms)</td>
@@ -120,7 +119,7 @@ Pixel xl | 0c |
 
 # iOS benchmarks
 
-For running iOS benchmarks, the [benchmark
+To run iOS benchmarks, the [benchmark
 app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/benchmark/ios)
 was modified to include the appropriate model and `benchmark_params.json` was
 modified  to set `num_threads` to 1.
diff --git a/tensorflow/contrib/lite/g3doc/rpi.md b/tensorflow/contrib/lite/g3doc/rpi.md
index ab50789307..cdc9172d87 100644
--- a/tensorflow/contrib/lite/g3doc/rpi.md
+++ b/tensorflow/contrib/lite/g3doc/rpi.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # TensorFlow Lite for Raspberry Pi
 
 ## Cross compiling
diff --git a/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md b/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md
index dcd17bbeab..aa65ec9988 100644
--- a/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md
+++ b/tensorflow/contrib/lite/g3doc/tf_ops_compatibility.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # TensorFlow Lite & TensorFlow Compatibility Guide
 
 TensorFlow Lite supports a number of TensorFlow operations used in common
@@ -42,6 +45,7 @@ counterparts:
     *as long as the input tensor is 4D (1 batch + 2 spatial + 1 other) and the
     crops attribute is not used*
 *   [tf.exp](https://www.tensorflow.org/api_docs/python/tf/exp)
+*   [tf.fake_quant*](https://www.tensorflow.org/api_docs/python/tf/fake_quant_with_min_max_args)
 *   [tf.matmul](https://www.tensorflow.org/api_docs/python/tf/matmul) - *as long
     as the second argument is constant and transposition is not used*
 *   [tf.nn.avg_pool](https://www.tensorflow.org/api_docs/python/tf/nn/avg_pool)
@@ -58,6 +62,7 @@ counterparts:
 *   [tf.nn.softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax) -
     *as long as tensors are 2D and axis is the last dimension*
 *   [tf.nn.top_k](https://www.tensorflow.org/api_docs/python/tf/nn/top_k)
+*   [tf.one_hot](https://www.tensorflow.org/api_docs/python/tf/one_hot)
 *   [tf.pad](https://www.tensorflow.org/api_docs/python/tf/pad) - *as long as
     mode and constant_values are not used*
 *   [tf.reduce_mean](https://www.tensorflow.org/api_docs/python/tf/reduce_mean) -
@@ -790,6 +795,54 @@ Outputs {
 }
 ```
 
+**ARG_MAX**
+
+```
+Inputs {
+  0: a tensor
+  1: a tensor
+}
+Outputs {
+  0: A tensor of indices of maximum values.
+}
+```
+
+**ARG_MIN**
+
+```
+Inputs {
+  0: a tensor
+  1: a tensor
+}
+Outputs {
+  0: A tensor of indices of minium values.
+}
+```
+
+**PACK**
+
+```
+Inputs {
+  0: a list of tensors.
+  1: an integer.
+}
+Outputs {
+  0: A tensor of stacked tensors.
+}
+```
+
+**LOGICAL_OR**
+
+```
+Inputs {
+  0: a list of tensors.
+  1: a list of tensors.
+}
+Outputs {
+  0: A tensor of logical_or output tensors.
+}
+```
+
 And these are TensorFlow Lite operations that are present but not ready for
 custom models yet:
 
diff --git a/tensorflow/docs_src/mobile/android_build.md b/tensorflow/contrib/lite/g3doc/tfmobile/android_build.md
index f4b07db459..76e16fc9db 100644
--- a/tensorflow/docs_src/mobile/android_build.md
+++ b/tensorflow/contrib/lite/g3doc/tfmobile/android_build.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Building TensorFlow on Android
 
 To get you started working with TensorFlow on Android, we'll walk through two
@@ -91,7 +94,8 @@ using [ADB](https://developer.android.com/studio/command-line/adb.html). This
 requires some knowledge of build systems and Android developer tools, but we'll
 guide you through the basics here.
 
-- First, follow our instructions for @{$install/install_sources$installing from sources}.
+- First, follow our instructions for
+  <a href="http://www.tensorflow.org/install/install_sources">installing from sources</a>.
   This will also guide you through installing Bazel and cloning the
   TensorFlow code.
 
diff --git a/tensorflow/docs_src/mobile/mobile_intro.md b/tensorflow/contrib/lite/g3doc/tfmobile/index.md
index 241f01d460..bd047bfcec 100644
--- a/tensorflow/docs_src/mobile/mobile_intro.md
+++ b/tensorflow/contrib/lite/g3doc/tfmobile/index.md
@@ -1,4 +1,45 @@
-# Introduction to TensorFlow Mobile
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
+# Overview
+
+TensorFlow was designed to be a good deep learning solution for mobile
+platforms. Currently we have two solutions for deploying machine learning
+applications on mobile and embedded devices: TensorFlow for Mobile and
+<a href="../index.md">TensorFlow Lite</a>.
+
+## TensorFlow Lite versus TensorFlow Mobile
+
+Here are a few of the differences between the two:
+
+- TensorFlow Lite is an evolution of TensorFlow Mobile.  In most cases, apps
+  developed with TensorFlow Lite will have a smaller binary size, fewer
+  dependencies, and better performance.
+
+- TensorFlow Lite is in developer preview, so not all use cases are covered yet.
+  We expect you to use TensorFlow Mobile to cover production cases.
+
+- TensorFlow Lite supports only a limited set of operators, so not all models
+  will work on it by default. TensorFlow for Mobile has a fuller set of
+  supported functionality.
+
+TensorFlow Lite provides better performance and a small binary size on mobile
+platforms as well as the ability to leverage hardware acceleration if available
+on their platforms. In addition, it has many fewer dependencies so it can be
+built and hosted on simpler, more constrained device scenarios. TensorFlow Lite
+also allows targeting accelerators through the [Neural Networks
+API](https://developer.android.com/ndk/guides/neuralnetworks/index.html).
+
+TensorFlow Lite currently has coverage for a limited set of operators. While
+TensorFlow for Mobile supports only a constrained set of ops by default, in
+principle if you use an arbitrary operator in TensorFlow, it can be customized
+to build that kernel. Thus use cases which are not currently supported by
+TensorFlow Lite should continue to use TensorFlow for Mobile. As TensorFlow Lite
+evolves, it will gain additional operators, and the decision will be easier to
+make.
+
+
+## Introduction to TensorFlow Mobile
 
 TensorFlow was designed from the ground up to be a good deep learning solution
 for mobile platforms like Android and iOS. This mobile guide should help you
@@ -38,7 +79,8 @@ speech-driven interface, and many of these require on-device processing. Most of
 the time a user isn’t giving commands, and so streaming audio continuously to a
 remote server would be a waste of bandwidth, since it would mostly be silence or
 background noises. To solve this problem it’s common to have a small neural
-network running on-device @{$tutorials/audio_recognition$listening out for a particular keyword}.
+network running on-device
+[listening out for a particular keyword](../tutorials/sequences/audio_recognition).
 Once that keyword has been spotted, the rest of the
 conversation can be transmitted over to the server for further processing if
 more computing power is needed.
@@ -166,7 +208,7 @@ interesting products possible.
 
 TensorFlow runs on Ubuntu Linux, Windows 10, and OS X. For a list of all
 supported operating systems and instructions to install TensorFlow, see
-@{$install$Installing Tensorflow}.
+<a href="https://www.tensorflow.org/install">Installing Tensorflow</a>.
 
 Note that some of the sample code we provide for mobile TensorFlow requires you
 to compile TensorFlow from source, so you’ll need more than just `pip install`
@@ -240,8 +282,3 @@ results you’ll see. It’s common for an algorithm to get great training accur
 numbers but then fail to be useful within a real application because there’s a
 mismatch between the dataset and real usage. Prototype end-to-end usage as soon
 as possible to create a consistent user experience.
-
-## Next Steps
-
-We suggest you get started by building one of our demos for
-@{$mobile/android_build$Android} or @{$mobile/ios_build$iOS}.
diff --git a/tensorflow/docs_src/mobile/ios_build.md b/tensorflow/contrib/lite/g3doc/tfmobile/ios_build.md
index 4c84a1214a..6223707892 100644
--- a/tensorflow/docs_src/mobile/ios_build.md
+++ b/tensorflow/contrib/lite/g3doc/tfmobile/ios_build.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Building TensorFlow on iOS
 
 ## Using CocoaPods
diff --git a/tensorflow/docs_src/mobile/linking_libs.md b/tensorflow/contrib/lite/g3doc/tfmobile/linking_libs.md
index efef5dd0da..4c2071ed05 100644
--- a/tensorflow/docs_src/mobile/linking_libs.md
+++ b/tensorflow/contrib/lite/g3doc/tfmobile/linking_libs.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Integrating TensorFlow libraries
 
 Once you have made some progress on a model that addresses the problem you’re
@@ -14,11 +17,11 @@ TensorFlow mobile demo apps.
 
 After you've managed to build the examples, you'll probably want to call
 TensorFlow from one of your existing applications. The very easiest way to do
-this is to use the Pod installation steps described
-@{$mobile/ios_build#using_cocoapods$here}, but if you want to build TensorFlow
-from source (for example to customize which operators are included) you'll need
-to break out TensorFlow as a framework, include the right header files, and link
-against the built libraries and dependencies.
+this is to use the Pod installation steps described in
+<a href="./ios_build.md">Building TensorFlow on iOS</a>, but if you want to build
+TensorFlow from source (for example to customize which operators are included)
+you'll need to break out TensorFlow as a framework, include the right header
+files, and link against the built libraries and dependencies.
 
 ### Android
 
@@ -82,10 +85,12 @@ recompile of the core.
 To achieve this capability, TensorFlow uses a registration pattern in a lot of
 places. In the code, it looks like this:
 
-    class MulKernel : OpKernel {
-      Status Compute(OpKernelContext* context) { … }
-    };
-    REGISTER_KERNEL(MulKernel, “Mul”);
+```
+class MulKernel : OpKernel {
+	Status Compute(OpKernelContext* context) { … }
+};
+REGISTER_KERNEL(MulKernel, “Mul”);
+```
 
 This would be in a standalone `.cc` file linked into your application, either
 as part of the main set of kernels or as a separate custom library. The magic
@@ -101,15 +106,17 @@ doesn’t offer a good mechanism for doing this sort of registration, so we have
 to resort to some tricky code. Under the hood, the macro is implemented so that
 it produces something like this:
 
-    class RegisterMul {
-     public:
-      RegisterMul() {
-        global_kernel_registry()->Register(“Mul”, [](){
-          return new MulKernel()
-        });
-      }
-    };
-    RegisterMul g_register_mul;
+```
+class RegisterMul {
+	public:
+		RegisterMul() {
+			global_kernel_registry()->Register(“Mul”, [](){
+				return new MulKernel()
+			});
+	}
+};
+RegisterMul g_register_mul;
+```
 
 This sets up a class `RegisterMul` with a constructor that tells the global
 kernel registry what function to call when somebody asks it how to create a
@@ -176,8 +183,10 @@ have an experimental script at [rename_protobuf.sh](https://github.com/tensorflo
 You need to run this as part of the makefile build, after you’ve downloaded all
 the dependencies:
 
-    tensorflow/contrib/makefile/download_dependencies.sh
-    tensorflow/contrib/makefile/rename_protobuf.sh
+```
+tensorflow/contrib/makefile/download_dependencies.sh
+tensorflow/contrib/makefile/rename_protobuf.sh
+```
 
 ## Calling the TensorFlow API
 
@@ -193,18 +202,20 @@ use case, while on iOS and Raspberry Pi you call directly into the C++ API.
 
 Here’s what a typical Inference Library sequence looks like on Android:
 
-    // Load the model from disk.
-    TensorFlowInferenceInterface inferenceInterface =
-    new TensorFlowInferenceInterface(assetManager, modelFilename);
+```
+// Load the model from disk.
+TensorFlowInferenceInterface inferenceInterface =
+new TensorFlowInferenceInterface(assetManager, modelFilename);
 
-    // Copy the input data into TensorFlow.
-    inferenceInterface.feed(inputName, floatValues, 1, inputSize, inputSize, 3);
+// Copy the input data into TensorFlow.
+inferenceInterface.feed(inputName, floatValues, 1, inputSize, inputSize, 3);
 
-    // Run the inference call.
-    inferenceInterface.run(outputNames, logStats);
+// Run the inference call.
+inferenceInterface.run(outputNames, logStats);
 
-    // Copy the output Tensor back into the output array.
-    inferenceInterface.fetch(outputName, outputs);
+// Copy the output Tensor back into the output array.
+inferenceInterface.fetch(outputName, outputs);
+```
 
 You can find the source of this code in the [Android examples](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/android/src/org/tensorflow/demo/TensorFlowImageClassifier.java#L107).
 
@@ -212,27 +223,29 @@ You can find the source of this code in the [Android examples](https://github.co
 
 Here’s the equivalent code for iOS and Raspberry Pi:
 
-    // Load the model.
-    PortableReadFileToProto(file_path, &tensorflow_graph);
-
-    // Create a session from the model.
-    tensorflow::Status s = session->Create(tensorflow_graph);
-    if (!s.ok()) {
-      LOG(FATAL) << "Could not create TensorFlow Graph: " << s;
-    }
-
-    // Run the model.
-    std::string input_layer = "input";
-    std::string output_layer = "output";
-    std::vector<tensorflow::Tensor> outputs;
-    tensorflow::Status run_status = session->Run({{input_layer, image_tensor}},
+```
+// Load the model.
+PortableReadFileToProto(file_path, &tensorflow_graph);
+
+// Create a session from the model.
+tensorflow::Status s = session->Create(tensorflow_graph);
+if (!s.ok()) {
+    LOG(FATAL) << "Could not create TensorFlow Graph: " << s;
+}
+
+// Run the model.
+std::string input_layer = "input";
+std::string output_layer = "output";
+std::vector<tensorflow::Tensor> outputs;
+tensorflow::Status run_status = session->Run({\{input_layer, image_tensor}},
                                {output_layer}, {}, &outputs);
-    if (!run_status.ok()) {
-      LOG(FATAL) << "Running model failed: " << run_status;
-    }
+if (!run_status.ok()) {
+    LOG(FATAL) << "Running model failed: " << run_status;
+}
 
-    // Access the output data.
-    tensorflow::Tensor* output = &outputs[0];
+// Access the output data.
+tensorflow::Tensor* output = &outputs[0];
+```
 
 This is all based on the
 [iOS sample code](https://www.tensorflow.org/code/tensorflow/examples/ios/simple/RunModelViewController.mm),
diff --git a/tensorflow/docs_src/mobile/optimizing.md b/tensorflow/contrib/lite/g3doc/tfmobile/optimizing.md
index 778e4d3a62..a0192c3541 100644
--- a/tensorflow/docs_src/mobile/optimizing.md
+++ b/tensorflow/contrib/lite/g3doc/tfmobile/optimizing.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Optimizing for mobile
 
 There are some special issues that you have to deal with when you’re trying to
@@ -77,7 +80,7 @@ out of a mobile device's memory faster.
 
 To understand how large your network will be on disk, start by looking at the
 size on disk of your `GraphDef` file after you’ve run `freeze_graph` and
-`strip_unused_nodes` on it (see @{$mobile/prepare_models$Preparing models} for
+`strip_unused_nodes` on it (see <a href="./prepare_models.md">Preparing models</a> for
 more details on these tools), since then it should only contain
 inference-related nodes. To double-check that your results are as expected, run
 the `summarize_graph` tool to see how many parameters are in constants:
@@ -103,7 +106,8 @@ you multiply the number of const parameters by four, you should get something
 that’s close to the size of the file on disk. You can often get away with only
 eight-bits per parameter with very little loss of accuracy in the final result,
 so if your file size is too large you can try using
-@{$performance/quantization$quantize_weights} to transform the parameters down.
+<a href="https://www.tensorflow.org/performance/quantization">quantize_weights</a>
+to transform the parameters down.
 
     bazel build tensorflow/tools/graph_transforms:transform_graph && \
     bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
@@ -292,7 +296,8 @@ run it on a 64-bit ARM device:
 
 You can interpret the results in exactly the same way as the desktop version
 above. If you have any trouble figuring out what the right input and output
-names and types are, take a look at the @{$mobile/prepare_models$Preparing models}
+names and types are, take a look at the
+<a href="./prepare_models">Preparing models</a>
 page for details about detecting these for your model, and look at the
 `summarize_graph` tool which may give you
 helpful information.
diff --git a/tensorflow/docs_src/mobile/prepare_models.md b/tensorflow/contrib/lite/g3doc/tfmobile/prepare_models.md
index 2b84dbb973..6b4e4a92bd 100644
--- a/tensorflow/docs_src/mobile/prepare_models.md
+++ b/tensorflow/contrib/lite/g3doc/tfmobile/prepare_models.md
@@ -1,3 +1,6 @@
+book_path: /mobile/_book.yaml
+project_path: /mobile/_project.yaml
+
 # Preparing models for mobile deployment
 
 The requirements for storing model information during training are very
@@ -255,8 +258,8 @@ The criteria for including ops and types fall into several categories:
 These ops are trimmed by default to optimize for inference on mobile, but it is
 possible to alter some build files to change the default.  After alternating the
 build files, you will need to recompile TensorFlow.  See below for more details
-on how to do this, and also see @{$mobile/optimizing#binary_size$Optimizing} for
-more on reducing your binary size.
+on how to do this, and also see <a href="./optimizing.md">optimizing binary size</a>
+for more on reducing your binary size.
 
 ### Locate the implementation
 
diff --git a/tensorflow/contrib/lite/interpreter.cc b/tensorflow/contrib/lite/interpreter.cc
index 3089a4c568..7a680f5c64 100644
--- a/tensorflow/contrib/lite/interpreter.cc
+++ b/tensorflow/contrib/lite/interpreter.cc
@@ -25,25 +25,28 @@ limitations under the License.
 #include "tensorflow/contrib/lite/context_util.h"
 #include "tensorflow/contrib/lite/error_reporter.h"
 #include "tensorflow/contrib/lite/graph_info.h"
-#ifndef TFLITE_MCU
-#include "tensorflow/contrib/lite/kernels/eigen_support.h"
-#include "tensorflow/contrib/lite/kernels/gemm_support.h"
-#endif
 #include "tensorflow/contrib/lite/memory_planner.h"
-#ifndef TFLITE_MCU
 #include "tensorflow/contrib/lite/nnapi_delegate.h"
-#endif
 #include "tensorflow/contrib/lite/profiling/profiler.h"
 #include "tensorflow/contrib/lite/schema/schema_generated.h"
 #include "tensorflow/contrib/lite/util.h"
 
 namespace tflite {
-#ifdef TFLITE_MCU
-class NNAPIDelegate {};
-#endif
-
 namespace {
 
+TfLiteStatus ReportOpError(TfLiteContext* context, const TfLiteNode& node,
+                           const TfLiteRegistration& registration,
+                           int node_index, const char* message) {
+  context->ReportError(
+      context, "Node number %d (%s) %s.\n", node_index,
+      registration.custom_name
+          ? registration.custom_name
+          : EnumNameBuiltinOperator(
+                static_cast<BuiltinOperator>(registration.builtin_code)),
+      message);
+  return kTfLiteError;
+}
+
 // Stub method which returns kTfLiteError when the function is forbidden.
 // We're registrating this function to several different function to save
 // compiled binary size. Please note the restrictions:
@@ -120,19 +123,22 @@ Interpreter::Interpreter(ErrorReporter* error_reporter)
   context_.AddTensors = AddTensors;
   context_.tensors = nullptr;
   context_.tensors_size = 0;
-  context_.eigen_context = nullptr;
-  context_.gemm_context = nullptr;
   context_.recommended_num_threads = -1;
+  context_.GetExternalContext = GetExternalContext;
+  context_.SetExternalContext = SetExternalContext;
 
   // Invalid to call these these except from TfLiteDelegate
-  SetForbiddenContextFunction(&context_.GetNodeAndRegistration);
-  SetForbiddenContextFunction(&context_.ReplaceSubgraphsWithDelegateKernels);
-  SetForbiddenContextFunction(&context_.GetExecutionPlan);
+  SwitchToKernelContext();
 
   // Reserve some space for the tensors to avoid excessive resizing.
   tensors_.reserve(kTensorsReservedCapacity);
   nodes_and_registration_.reserve(kTensorsReservedCapacity);
   next_execution_plan_index_to_prepare_ = 0;
+
+  for (int i = 0; i < kTfLiteMaxExternalContexts; ++i) {
+    external_contexts_[i] = nullptr;
+  }
+
   UseNNAPI(false);
 }
 
@@ -267,8 +273,9 @@ TfLiteStatus Interpreter::ReplaceSubgraphsWithDelegateKernels(
         int node_index;
 
         TfLiteDelegateParams* params = CreateDelegateParams(delegate, subgraph);
-        AddNodeWithParameters(subgraph.input_tensors, subgraph.output_tensors,
-                              nullptr, 0, params, &registration, &node_index);
+        TF_LITE_ENSURE_STATUS(AddNodeWithParameters(
+            subgraph.input_tensors, subgraph.output_tensors, nullptr, 0, params,
+            &registration, &node_index));
 
         // Initialize the output tensors's delegate-related fields.
         for (int tensor_index : subgraph.output_tensors) {
@@ -290,6 +297,33 @@ TfLiteStatus Interpreter::ReplaceSubgraphsWithDelegateKernels(
   return kTfLiteOk;
 }
 
+TfLiteExternalContext* Interpreter::GetExternalContext(
+    TfLiteExternalContextType type) {
+  if (type >= 0 && type < kTfLiteMaxExternalContexts) {
+    return external_contexts_[type];
+  }
+  return nullptr;
+}
+
+TfLiteExternalContext* Interpreter::GetExternalContext(
+    struct TfLiteContext* context, TfLiteExternalContextType type) {
+  return static_cast<Interpreter*>(context->impl_)->GetExternalContext(type);
+}
+
+void Interpreter::SetExternalContext(TfLiteExternalContextType type,
+                                     TfLiteExternalContext* ctx) {
+  if (type >= 0 && type < kTfLiteMaxExternalContexts) {
+    external_contexts_[type] = ctx;
+  }
+}
+
+void Interpreter::SetExternalContext(struct TfLiteContext* context,
+                                     TfLiteExternalContextType type,
+                                     TfLiteExternalContext* ctx) {
+  return static_cast<Interpreter*>(context->impl_)
+      ->SetExternalContext(type, ctx);
+}
+
 // Gets an TfLiteIntArray* representing the execution plan. The interpreter owns
 // this memory and it is only guaranteed to exist during the invocation of the
 // delegate prepare.
@@ -413,6 +447,13 @@ TfLiteStatus Interpreter::AllocateTensors() {
   TF_LITE_ENSURE_STATUS(PrepareOpsAndTensors());
 
   state_ = kStateInvokable;
+
+  // Reset the variable tensors to zero after (re)allocating the tensors.
+  // Developers shouldn't rely on the side effect of this function to reset
+  // variable tesnsors. They should call `ResetVariableTensorsToZero` directly
+  // instead.
+  ResetVariableTensorsToZero();
+
   return kTfLiteOk;
 }
 
@@ -537,7 +578,8 @@ TfLiteStatus Interpreter::PrepareOpsStartingAt(
         nodes_and_registration_[node_index].second;
     EnsureTensorsVectorCapacity();
     if (OpPrepare(registration, &node) == kTfLiteError) {
-      return kTfLiteError;
+      return ReportOpError(&context_, node, registration, node_index,
+                           "failed to prepare");
     }
 
     *last_execution_plan_index_prepared = execution_plan_index;
@@ -556,7 +598,7 @@ TfLiteStatus Interpreter::PrepareOpsAndTensors() {
   if (!memory_planner_) {
     memory_planner_.reset(new ArenaPlanner(
         &context_, std::unique_ptr<GraphInfo>(new InterpreterInfo(this)),
-        /*preserve_inputs=*/true));
+        /*preserve_inputs=*/true, /*preserve_intermediates*/ false));
     memory_planner_->PlanAllocations();
   }
 
@@ -582,7 +624,6 @@ TfLiteStatus Interpreter::Invoke() {
   }
 
   TfLiteStatus status = kTfLiteOk;
-#ifndef TFLITE_MCU
   if (nnapi_delegate_) {
     if (next_execution_plan_index_to_prepare_ == execution_plan_.size()) {
       TF_LITE_ENSURE_OK(&context_, nnapi_delegate_->Invoke(this));
@@ -596,7 +637,6 @@ TfLiteStatus Interpreter::Invoke() {
       return kTfLiteError;
     }
   }
-#endif
 
   // Invocations are always done in node order.
   // Note that calling Invoke repeatedly will cause the original memory plan to
@@ -637,7 +677,8 @@ TfLiteStatus Interpreter::Invoke() {
     EnsureTensorsVectorCapacity();
     tensor_resized_since_op_invoke_ = false;
     if (OpInvoke(registration, &node) == kTfLiteError) {
-      status = kTfLiteError;
+      status = ReportOpError(&context_, node, registration, node_index,
+                             "failed to invoke");
     }
 
     // Force execution prep for downstream ops if the latest op triggered the
@@ -853,28 +894,39 @@ TfLiteStatus Interpreter::ResizeTensorImpl(TfLiteTensor* tensor,
 }
 
 void Interpreter::UseNNAPI(bool enable) {
-#ifndef TFLITE_MCU
   // TODO(aselle): This is a workaround for finding if NNAPI exists.
   // We also need to make sure getLibraryHandle() is renamed to be NNAPI
   // prefixed.
-  if (!NNAPIExists()) enable = false;
+  if (!NNAPIDelegate::IsSupported()) enable = false;
   if (!enable) {
     nnapi_delegate_.reset();
   } else if (!nnapi_delegate_) {
     nnapi_delegate_.reset(new NNAPIDelegate);
   }
-#endif
 }
 
 void Interpreter::SetNumThreads(int num_threads) {
   context_.recommended_num_threads = num_threads;
 
-  // TODO(ahentz): find a way to avoid this. It causes gemmlowp and eigen to
-  // be required in order to compile the framework.
-#ifndef TFLITE_MCU
-  gemm_support::SetNumThreads(&context_, num_threads);
-  eigen_support::SetNumThreads(&context_, num_threads);
-#endif
+  for (int i = 0; i < kTfLiteMaxExternalContexts; ++i) {
+    auto* c = external_contexts_[i];
+    if (c && c->Refresh) {
+      c->Refresh(&context_);
+    }
+  }
+}
+
+void Interpreter::SwitchToDelegateContext() {
+  context_.GetNodeAndRegistration = GetNodeAndRegistration;
+  context_.ReplaceSubgraphsWithDelegateKernels =
+      ReplaceSubgraphsWithDelegateKernels;
+  context_.GetExecutionPlan = GetExecutionPlan;
+}
+
+void Interpreter::SwitchToKernelContext() {
+  SetForbiddenContextFunction(&context_.GetNodeAndRegistration);
+  SetForbiddenContextFunction(&context_.ReplaceSubgraphsWithDelegateKernels);
+  SetForbiddenContextFunction(&context_.GetExecutionPlan);
 }
 
 TfLiteStatus Interpreter::ModifyGraphWithDelegate(TfLiteDelegate* delegate,
@@ -903,17 +955,12 @@ TfLiteStatus Interpreter::ModifyGraphWithDelegate(TfLiteDelegate* delegate,
 
   // TODO(aselle): Consider if it is worth storing pointers to delegates.
   // Setup additional context interface.
-  context_.GetNodeAndRegistration = GetNodeAndRegistration;
-  context_.ReplaceSubgraphsWithDelegateKernels =
-      ReplaceSubgraphsWithDelegateKernels;
-  context_.GetExecutionPlan = GetExecutionPlan;
+  SwitchToDelegateContext();
 
   TfLiteStatus status = delegate->Prepare(&context_, delegate);
 
   // Remove additional context info.
-  SetForbiddenContextFunction(&context_.GetNodeAndRegistration);
-  SetForbiddenContextFunction(&context_.ReplaceSubgraphsWithDelegateKernels);
-  SetForbiddenContextFunction(&context_.GetExecutionPlan);
+  SwitchToKernelContext();
 
   TF_LITE_ENSURE_OK(&context_, status);
 
diff --git a/tensorflow/contrib/lite/interpreter.h b/tensorflow/contrib/lite/interpreter.h
index 033b8ee5fa..be149a8cc0 100644
--- a/tensorflow/contrib/lite/interpreter.h
+++ b/tensorflow/contrib/lite/interpreter.h
@@ -63,6 +63,10 @@ template <>
 constexpr TfLiteType typeToTfLiteType<std::complex<float>>() {
   return kTfLiteComplex64;
 }
+template <>
+constexpr TfLiteType typeToTfLiteType<string>() {
+  return kTfLiteString;
+}
 
 // Forward declare since NNAPIDelegate uses Interpreter.
 class NNAPIDelegate;
@@ -107,7 +111,7 @@ class Interpreter {
   // processing this model will be forwarded to the error_reporter object.
   //
   // Note, if error_reporter is nullptr, then a default StderrReporter is
-  // used.
+  // used. Ownership of 'error_reporter' remains with the caller.
   explicit Interpreter(ErrorReporter* error_reporter = DefaultErrorReporter());
 
   ~Interpreter();
@@ -410,6 +414,15 @@ class Interpreter {
   }
 
  private:
+  friend class InterpreterTest;
+
+  // Prevent 'context_' from accessing functions that are only available to
+  // delegated kernels.
+  void SwitchToKernelContext();
+
+  // Add delegate-only functions to 'context_'.
+  void SwitchToDelegateContext();
+
   // Give 'op_reg' a chance to initialize itself using the contents of
   // 'buffer'.
   void* OpInit(const TfLiteRegistration& op_reg, const char* buffer,
@@ -496,6 +509,7 @@ class Interpreter {
   // Update the execution graph to replace some of the nodes with stub
   // nodes. Specifically any node index that has `nodes[index]==1` will be
   // slated for replacement with a delegate kernel specified by registration.
+  // Ownership of 'nodes_to_replace' and 'delegate' remains with the caller.
   // WARNING: This is an experimental interface that is subject to change.
   TfLiteStatus ReplaceSubgraphsWithDelegateKernels(
       TfLiteRegistration registration, const TfLiteIntArray* nodes_to_replace,
@@ -522,6 +536,18 @@ class Interpreter {
   static TfLiteStatus GetExecutionPlan(struct TfLiteContext* context,
                                        TfLiteIntArray** execution_plan);
 
+  // Retrieve an existing external context by type.
+  TfLiteExternalContext* GetExternalContext(TfLiteExternalContextType type);
+  static TfLiteExternalContext* GetExternalContext(
+      struct TfLiteContext* context, TfLiteExternalContextType type);
+
+  // Set the value of an external context.
+  void SetExternalContext(TfLiteExternalContextType type,
+                          TfLiteExternalContext* ctx);
+  static void SetExternalContext(struct TfLiteContext* context,
+                                 TfLiteExternalContextType type,
+                                 TfLiteExternalContext* ctx);
+
   // Ensures that `tensors_` has at least `kTensorsCapacityHeadroom` extra
   // capacity. Calling this function may invalidate existing pointers to
   // tensors. After calling this function, adding `kTensorsCapacityHeadroom`
@@ -611,7 +637,10 @@ class Interpreter {
   bool tensor_resized_since_op_invoke_ = false;
 
   // Profiler for this interpreter instance.
-  profiling::Profiler* profiler_;
+  profiling::Profiler* profiler_ = nullptr;
+
+  // List of active external contexts.
+  TfLiteExternalContext* external_contexts_[kTfLiteMaxExternalContexts];
 };
 
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/interpreter_test.cc b/tensorflow/contrib/lite/interpreter_test.cc
index 4f7fb36696..2bf598bad7 100644
--- a/tensorflow/contrib/lite/interpreter_test.cc
+++ b/tensorflow/contrib/lite/interpreter_test.cc
@@ -23,6 +23,15 @@ limitations under the License.
 #include "tensorflow/contrib/lite/testing/util.h"
 
 namespace tflite {
+
+// InterpreterTest is a friend of Interpreter, so it can access context_.
+class InterpreterTest : public ::testing::Test {
+ protected:
+  TfLiteContext* GetInterpreterContext() { return &interpreter_.context_; }
+
+  Interpreter interpreter_;
+};
+
 namespace ops {
 namespace builtin {
 TfLiteRegistration* Register_PADV2();
@@ -48,6 +57,22 @@ TEST(BasicInterpreter, InvokeInvalidModel) {
   ASSERT_EQ(interpreter.Invoke(), kTfLiteOk);
 }
 
+TEST(BasicInterpreter, TestAllocateTensorsResetVariableTensors) {
+  Interpreter interpreter;
+  int tensor_index;
+  ASSERT_EQ(interpreter.AddTensors(1, &tensor_index), kTfLiteOk);
+  constexpr int kTensorSize = 16;
+  interpreter.SetTensorParametersReadWrite(tensor_index, kTfLiteFloat32, "",
+                                           {kTensorSize}, {}, true);
+  interpreter.SetVariables({tensor_index});
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+  TfLiteTensor* tensor = interpreter.tensor(tensor_index);
+  // Ensure that variable tensors are reset to zero.
+  for (int i = 0; i < kTensorSize; ++i) {
+    ASSERT_EQ(tensor->data.f[i], 0.0f);
+  }
+}
+
 // Test size accessor functions.
 TEST(BasicInterpreter, TestSizeFunctions) {
   Interpreter interpreter;
@@ -622,18 +647,6 @@ TEST(BasicInterpreter, AllocateTwice) {
   ASSERT_EQ(old_tensor1_ptr, interpreter.tensor(1)->data.raw);
 }
 
-struct TestErrorReporter : public ErrorReporter {
-  int Report(const char* format, va_list args) override {
-    char buffer[1024];
-    int size = vsnprintf(buffer, sizeof(buffer), format, args);
-    all_reports += buffer;
-    calls++;
-    return size;
-  }
-  int calls = 0;
-  std::string all_reports;
-};
-
 TEST(BasicInterpreter, TestNullErrorReporter) {
   TestErrorReporter reporter;
   Interpreter interpreter;
@@ -643,8 +656,9 @@ TEST(BasicInterpreter, TestCustomErrorReporter) {
   TestErrorReporter reporter;
   Interpreter interpreter(&reporter);
   ASSERT_NE(interpreter.Invoke(), kTfLiteOk);
-  ASSERT_EQ(reporter.all_reports, "Invoke called on model that is not ready.");
-  ASSERT_EQ(reporter.calls, 1);
+  ASSERT_EQ(reporter.error_messages(),
+            "Invoke called on model that is not ready.");
+  ASSERT_EQ(reporter.num_calls(), 1);
 }
 
 TEST(BasicInterpreter, TestUnsupportedDelegateFunctions) {
@@ -780,6 +794,47 @@ TEST(InterpreterTensorsCapacityTest, TestExceedHeadroom) {
   ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
 }
 
+struct TestExternalContext : public TfLiteExternalContext {
+  static const TfLiteExternalContextType kType = kTfLiteGemmLowpContext;
+
+  static TestExternalContext* Get(TfLiteContext* context) {
+    return reinterpret_cast<TestExternalContext*>(
+        context->GetExternalContext(context, kType));
+  }
+
+  static void Set(TfLiteContext* context, TestExternalContext* value) {
+    context->SetExternalContext(context, kType, value);
+  }
+
+  int num_refreshes = 0;
+};
+
+TEST_F(InterpreterTest, GetSetResetExternalContexts) {
+  auto* context = GetInterpreterContext();
+
+  TestExternalContext external_context;
+  external_context.Refresh = [](TfLiteContext* context) {
+    auto* ptr = TestExternalContext::Get(context);
+    if (ptr != nullptr) {
+      ++ptr->num_refreshes;
+    }
+    return kTfLiteOk;
+  };
+
+  EXPECT_EQ(TestExternalContext::Get(context), nullptr);
+  interpreter_.SetNumThreads(4);
+
+  TestExternalContext::Set(context, &external_context);
+  EXPECT_EQ(TestExternalContext::Get(context), &external_context);
+  interpreter_.SetNumThreads(4);
+  interpreter_.SetNumThreads(5);
+  EXPECT_EQ(external_context.num_refreshes, 2);
+
+  TestExternalContext::Set(context, nullptr);
+  EXPECT_EQ(TestExternalContext::Get(context), nullptr);
+  interpreter_.SetNumThreads(4);
+}
+
 // Test fixture that allows playing with execution plans. It creates a two
 // node graph that can be executed in either [0,1] order or [1,0] order.
 // The CopyOp records when it is invoked in the class member run_order_
diff --git a/tensorflow/contrib/lite/java/AndroidManifest.xml b/tensorflow/contrib/lite/java/AndroidManifest.xml
index f705feacbe..b91c6d149a 100644
--- a/tensorflow/contrib/lite/java/AndroidManifest.xml
+++ b/tensorflow/contrib/lite/java/AndroidManifest.xml
@@ -1,7 +1,12 @@
 <?xml version="1.0" encoding="utf-8"?>
 <manifest xmlns:android="http://schemas.android.com/apk/res/android"
-          package="org.tensorflow.lite">
-    <application>
-    </application>
+    package="org.tensorflow.lite">
+
+    <uses-sdk
+        android:minSdkVersion="4"
+        android:targetSdkVersion="19" />
+
+    <application />
+
 </manifest>
 
diff --git a/tensorflow/contrib/lite/java/BUILD b/tensorflow/contrib/lite/java/BUILD
index 593af81a18..098ba7e773 100644
--- a/tensorflow/contrib/lite/java/BUILD
+++ b/tensorflow/contrib/lite/java/BUILD
@@ -69,6 +69,7 @@ java_test(
     size = "small",
     srcs = ["src/test/java/org/tensorflow/lite/TensorFlowLiteTest.java"],
     javacopts = JAVACOPTS,
+    tags = ["no_oss"],
     test_class = "org.tensorflow.lite.TensorFlowLiteTest",
     deps = [
         ":libtensorflowlite_jni.so",
@@ -83,6 +84,7 @@ java_test(
     size = "small",
     srcs = ["src/test/java/org/tensorflow/lite/DataTypeTest.java"],
     javacopts = JAVACOPTS,
+    tags = ["no_oss"],
     test_class = "org.tensorflow.lite.DataTypeTest",
     deps = [
         ":libtensorflowlite_jni.so",
@@ -105,6 +107,7 @@ java_test(
         "src/testdata/with_custom_op.lite",
     ],
     javacopts = JAVACOPTS,
+    tags = ["no_oss"],
     test_class = "org.tensorflow.lite.NativeInterpreterWrapperTest",
     deps = [
         ":libtensorflowlite_jni.so",
@@ -124,6 +127,7 @@ java_test(
         "src/testdata/mobilenet.tflite.bin",
     ],
     javacopts = JAVACOPTS,
+    tags = ["no_oss"],
     test_class = "org.tensorflow.lite.InterpreterTest",
     visibility = ["//visibility:private"],
     deps = [
@@ -142,6 +146,7 @@ java_test(
         "src/testdata/add.bin",
     ],
     javacopts = JAVACOPTS,
+    tags = ["no_oss"],
     test_class = "org.tensorflow.lite.TensorTest",
     deps = [
         ":tensorflowlitelib",
diff --git a/tensorflow/contrib/lite/java/demo/app/build.gradle b/tensorflow/contrib/lite/java/demo/app/build.gradle
index 49868c5a75..92f04c651c 100644
--- a/tensorflow/contrib/lite/java/demo/app/build.gradle
+++ b/tensorflow/contrib/lite/java/demo/app/build.gradle
@@ -44,7 +44,7 @@ repositories {
 
 dependencies {
     compile fileTree(dir: 'libs', include: ['*.jar'])
-    androidTestCompile('com.androidx.test.espresso:espresso-core:2.2.2', {
+    androidTestCompile('androidx.test.espresso:espresso-core:3.1.0-alpha3', {
         exclude group: 'com.android.support', module: 'support-annotations'
     })
     compile 'com.android.support:appcompat-v7:25.2.0'
diff --git a/tensorflow/contrib/lite/java/ovic/BUILD b/tensorflow/contrib/lite/java/ovic/BUILD
index f232b00045..06f46fb923 100644
--- a/tensorflow/contrib/lite/java/ovic/BUILD
+++ b/tensorflow/contrib/lite/java/ovic/BUILD
@@ -18,6 +18,7 @@ java_test(
         "//tensorflow/contrib/lite/java/ovic/src/testdata:ovic_testdata",
     ],
     javacopts = JAVACOPTS,
+    tags = ["no_oss"],
     test_class = "org.tensorflow.ovic.OvicClassifierTest",
     visibility = ["//visibility:public"],
     deps = [
diff --git a/tensorflow/contrib/lite/java/ovic/demo/app/build.gradle b/tensorflow/contrib/lite/java/ovic/demo/app/build.gradle
index 3f32d62e5c..2a08608bbb 100644
--- a/tensorflow/contrib/lite/java/ovic/demo/app/build.gradle
+++ b/tensorflow/contrib/lite/java/ovic/demo/app/build.gradle
@@ -43,7 +43,7 @@ repositories {
 
 dependencies {
     compile fileTree(dir: 'libs', include: ['*.jar'])
-    androidTestCompile('com.androidx.test.espresso:espresso-core:2.2.2', {
+    androidTestCompile('androidx.test.espresso:espresso-core:3.1.0-alpha3', {
         exclude group: 'com.android.support', module: 'support-annotations'
     })
     compile 'com.android.support:appcompat-v7:25.2.0'
diff --git a/tensorflow/contrib/lite/java/ovic/src/test/java/org/tensorflow/ovic/OvicClassifierTest.java b/tensorflow/contrib/lite/java/ovic/src/test/java/org/tensorflow/ovic/OvicClassifierTest.java
index 56f3e7604a..1587c3c56f 100644
--- a/tensorflow/contrib/lite/java/ovic/src/test/java/org/tensorflow/ovic/OvicClassifierTest.java
+++ b/tensorflow/contrib/lite/java/ovic/src/test/java/org/tensorflow/ovic/OvicClassifierTest.java
@@ -127,12 +127,8 @@ public final class OvicClassifierTest {
     try {
       testResult = classifier.classifyByteBuffer(testImage);
       fail();
-    } catch (RuntimeException e) {
-      assertThat(e)
-          .hasMessageThat()
-          .contains(
-              "Failed to get input dimensions. 0-th input should have 49152 bytes, "
-                  + "but found 150528 bytes.");
+    } catch (IllegalArgumentException e) {
+      // Success.
     }
   }
 
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java
index 75334cd96e..94a1ec65d6 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/DataType.java
@@ -27,10 +27,7 @@ enum DataType {
   UINT8(3),
 
   /** 64-bit signed integer. */
-  INT64(4),
-
-  /** A {@link ByteBuffer}. */
-  BYTEBUFFER(999);
+  INT64(4);
 
   private final int value;
 
@@ -69,8 +66,6 @@ enum DataType {
         return 1;
       case INT64:
         return 8;
-      case BYTEBUFFER:
-        return 1;
     }
     throw new IllegalArgumentException(
         "DataType error: DataType " + this + " is not supported yet");
@@ -87,8 +82,6 @@ enum DataType {
         return "byte";
       case INT64:
         return "long";
-      case BYTEBUFFER:
-        return "ByteBuffer";
     }
     throw new IllegalArgumentException(
         "DataType error: DataType " + this + " is not supported yet");
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
index 4e22a68bf2..7002f82677 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
@@ -165,20 +165,7 @@ public final class Interpreter implements AutoCloseable {
     if (wrapper == null) {
       throw new IllegalStateException("Internal error: The Interpreter has already been closed.");
     }
-    Tensor[] tensors = wrapper.run(inputs);
-    if (outputs == null || tensors == null || outputs.size() > tensors.length) {
-      throw new IllegalArgumentException("Output error: Outputs do not match with model outputs.");
-    }
-    final int size = tensors.length;
-    for (Integer idx : outputs.keySet()) {
-      if (idx == null || idx < 0 || idx >= size) {
-        throw new IllegalArgumentException(
-            String.format(
-                "Output error: Invalid index of output %d (should be in range [0, %d))",
-                idx, size));
-      }
-      tensors[idx].copyTo(outputs.get(idx));
-    }
+    wrapper.run(inputs, outputs);
   }
 
   /**
@@ -251,8 +238,10 @@ public final class Interpreter implements AutoCloseable {
   /** Release resources associated with the {@code Interpreter}. */
   @Override
   public void close() {
-    wrapper.close();
-    wrapper = null;
+    if (wrapper != null) {
+      wrapper.close();
+      wrapper = null;
+    }
   }
 
   @Override
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
index 80de88b6a1..767a220f8c 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
@@ -15,10 +15,10 @@ limitations under the License.
 
 package org.tensorflow.lite;
 
-import java.lang.reflect.Array;
 import java.nio.ByteBuffer;
 import java.nio.ByteOrder;
 import java.nio.MappedByteBuffer;
+import java.util.Arrays;
 import java.util.HashMap;
 import java.util.Map;
 
@@ -40,6 +40,8 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     modelHandle = createModel(modelPath, errorHandle);
     interpreterHandle = createInterpreter(modelHandle, errorHandle, numThreads);
     isMemoryAllocated = true;
+    inputTensors = new Tensor[getInputCount(interpreterHandle)];
+    outputTensors = new Tensor[getOutputCount(interpreterHandle)];
   }
 
   /**
@@ -72,6 +74,8 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     modelHandle = createModelWithBuffer(modelByteBuffer, errorHandle);
     interpreterHandle = createInterpreter(modelHandle, errorHandle, numThreads);
     isMemoryAllocated = true;
+    inputTensors = new Tensor[getInputCount(interpreterHandle)];
+    outputTensors = new Tensor[getOutputCount(interpreterHandle)];
   }
 
   /** Releases resources associated with this {@code NativeInterpreterWrapper}. */
@@ -85,75 +89,63 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     inputsIndexes = null;
     outputsIndexes = null;
     isMemoryAllocated = false;
+    Arrays.fill(inputTensors, null);
+    Arrays.fill(outputTensors, null);
   }
 
   /** Sets inputs, runs model inference and returns outputs. */
-  Tensor[] run(Object[] inputs) {
+  void run(Object[] inputs, Map<Integer, Object> outputs) {
+    inferenceDurationNanoseconds = -1;
     if (inputs == null || inputs.length == 0) {
       throw new IllegalArgumentException("Input error: Inputs should not be null or empty.");
     }
-    int[] dataTypes = new int[inputs.length];
-    Object[] sizes = new Object[inputs.length];
-    int[] numsOfBytes = new int[inputs.length];
+    if (outputs == null || outputs.isEmpty()) {
+      throw new IllegalArgumentException("Input error: Outputs should not be null or empty.");
+    }
+
+    // TODO(b/80431971): Remove implicit resize after deprecating multi-dimensional array inputs.
+    // Rather than forcing an immediate resize + allocation if an input's shape differs, we first
+    // flush all resizes, avoiding redundant allocations.
     for (int i = 0; i < inputs.length; ++i) {
-      DataType dataType = dataTypeOf(inputs[i]);
-      dataTypes[i] = dataType.getNumber();
-      if (dataType == DataType.BYTEBUFFER) {
-        ByteBuffer buffer = (ByteBuffer) inputs[i];
-        if (buffer == null || !buffer.isDirect() || buffer.order() != ByteOrder.nativeOrder()) {
-          throw new IllegalArgumentException(
-              "Input error: ByteBuffer should be a direct ByteBuffer that uses "
-                  + "ByteOrder.nativeOrder().");
-        }
-        numsOfBytes[i] = buffer.limit();
-        sizes[i] = getInputDims(interpreterHandle, i, numsOfBytes[i]);
-      } else if (isNonEmptyArray(inputs[i])) {
-        int[] dims = shapeOf(inputs[i]);
-        sizes[i] = dims;
-        numsOfBytes[i] = dataType.elemByteSize() * numElements(dims);
-      } else {
-        throw new IllegalArgumentException(
-            String.format(
-                "Input error: %d-th element of the %d inputs is not an array or a ByteBuffer.",
-                i, inputs.length));
+      Tensor tensor = getInputTensor(i);
+      int[] newShape = tensor.getInputShapeIfDifferent(inputs[i]);
+      if (newShape != null) {
+        resizeInput(i, newShape);
       }
     }
-    inferenceDurationNanoseconds = -1;
-    long[] outputsHandles =
-        run(
-            interpreterHandle,
-            errorHandle,
-            sizes,
-            dataTypes,
-            numsOfBytes,
-            inputs,
-            this,
-            isMemoryAllocated);
-    if (outputsHandles == null || outputsHandles.length == 0) {
-      throw new IllegalStateException("Internal error: Interpreter has no outputs.");
+
+    if (!isMemoryAllocated) {
+      allocateTensors(interpreterHandle, errorHandle);
+      isMemoryAllocated = true;
+      // Allocation can trigger dynamic resizing of output tensors, so clear the
+      // output tensor cache.
+      Arrays.fill(outputTensors, null);
     }
-    isMemoryAllocated = true;
-    Tensor[] outputs = new Tensor[outputsHandles.length];
-    for (int i = 0; i < outputsHandles.length; ++i) {
-      outputs[i] = Tensor.fromHandle(outputsHandles[i]);
+
+    for (int i = 0; i < inputs.length; ++i) {
+      getInputTensor(i).setTo(inputs[i]);
+    }
+
+    long inferenceStartNanos = System.nanoTime();
+    run(interpreterHandle, errorHandle);
+    long inferenceDurationNanoseconds = System.nanoTime() - inferenceStartNanos;
+
+    for (Map.Entry<Integer, Object> output : outputs.entrySet()) {
+      getOutputTensor(output.getKey()).copyTo(output.getValue());
     }
-    return outputs;
+
+    // Only set if the entire operation succeeds.
+    this.inferenceDurationNanoseconds = inferenceDurationNanoseconds;
   }
 
-  private static native long[] run(
-      long interpreterHandle,
-      long errorHandle,
-      Object[] sizes,
-      int[] dtypes,
-      int[] numsOfBytes,
-      Object[] values,
-      NativeInterpreterWrapper wrapper,
-      boolean memoryAllocated);
+  private static native boolean run(long interpreterHandle, long errorHandle);
 
   /** Resizes dimensions of a specific input. */
   void resizeInput(int idx, int[] dims) {
     if (resizeInput(interpreterHandle, errorHandle, idx, dims)) {
       isMemoryAllocated = false;
+      // Resizing will invalidate the Tensor's shape, so invalidate the Tensor handle.
+      inputTensors[idx] = null;
     }
   }
 
@@ -212,78 +204,6 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     }
   }
 
-  static int numElements(int[] shape) {
-    if (shape == null) {
-      return 0;
-    }
-    int n = 1;
-    for (int i = 0; i < shape.length; i++) {
-      n *= shape[i];
-    }
-    return n;
-  }
-
-  static boolean isNonEmptyArray(Object o) {
-    return (o != null && o.getClass().isArray() && Array.getLength(o) != 0);
-  }
-
-  /** Returns the type of the data. */
-  static DataType dataTypeOf(Object o) {
-    if (o != null) {
-      Class<?> c = o.getClass();
-      while (c.isArray()) {
-        c = c.getComponentType();
-      }
-      if (float.class.equals(c)) {
-        return DataType.FLOAT32;
-      } else if (int.class.equals(c)) {
-        return DataType.INT32;
-      } else if (byte.class.equals(c)) {
-        return DataType.UINT8;
-      } else if (long.class.equals(c)) {
-        return DataType.INT64;
-      } else if (ByteBuffer.class.isInstance(o)) {
-        return DataType.BYTEBUFFER;
-      }
-    }
-    throw new IllegalArgumentException(
-        "DataType error: cannot resolve DataType of " + o.getClass().getName());
-  }
-
-  /** Returns the shape of an object as an int array. */
-  static int[] shapeOf(Object o) {
-    int size = numDimensions(o);
-    int[] dimensions = new int[size];
-    fillShape(o, 0, dimensions);
-    return dimensions;
-  }
-
-  static int numDimensions(Object o) {
-    if (o == null || !o.getClass().isArray()) {
-      return 0;
-    }
-    if (Array.getLength(o) == 0) {
-      throw new IllegalArgumentException("Array lengths cannot be 0.");
-    }
-    return 1 + numDimensions(Array.get(o, 0));
-  }
-
-  static void fillShape(Object o, int dim, int[] shape) {
-    if (shape == null || dim == shape.length) {
-      return;
-    }
-    final int len = Array.getLength(o);
-    if (shape[dim] == 0) {
-      shape[dim] = len;
-    } else if (shape[dim] != len) {
-      throw new IllegalArgumentException(
-          String.format("Mismatched lengths (%d and %d) in dimension %d", shape[dim], len, dim));
-    }
-    for (int i = 0; i < len; ++i) {
-      fillShape(Array.get(o, i), dim + 1, shape);
-    }
-  }
-
   /**
    * Gets the last inference duration in nanoseconds. It returns null if there is no previous
    * inference run or the last inference run failed.
@@ -293,40 +213,55 @@ final class NativeInterpreterWrapper implements AutoCloseable {
   }
 
   /**
-   * Gets the dimensions of an input. It throws IllegalArgumentException if input index is invalid.
+   * Gets the quantization zero point of an output.
+   *
+   * @throws IllegalArgumentException if the output index is invalid.
    */
-  int[] getInputDims(int index) {
-    return getInputDims(interpreterHandle, index, -1);
+  int getOutputQuantizationZeroPoint(int index) {
+    return getOutputQuantizationZeroPoint(interpreterHandle, index);
   }
 
   /**
-   * Gets the dimensions of an input. If numBytes >= 0, it will check whether num of bytes match the
-   * input.
+   * Gets the quantization scale of an output.
+   *
+   * @throws IllegalArgumentException if the output index is invalid.
    */
-  private static native int[] getInputDims(long interpreterHandle, int inputIdx, int numBytes);
-
-  /** Gets the type of an output. It throws IllegalArgumentException if output index is invalid. */
-  String getOutputDataType(int index) {
-    int type = getOutputDataType(interpreterHandle, index);
-    return DataType.fromNumber(type).toStringName();
+  float getOutputQuantizationScale(int index) {
+    return getOutputQuantizationScale(interpreterHandle, index);
   }
 
   /**
-   * Gets the quantization zero point of an output.
+   * Gets the input {@link Tensor} for the provided input index.
    *
-   * @throws IllegalArgumentExeption if the output index is invalid.
+   * @throws IllegalArgumentException if the input index is invalid.
    */
-  int getOutputQuantizationZeroPoint(int index) {
-    return getOutputQuantizationZeroPoint(interpreterHandle, index);
+  Tensor getInputTensor(int index) {
+    if (index < 0 || index >= inputTensors.length) {
+      throw new IllegalArgumentException("Invalid input Tensor index: " + index);
+    }
+    Tensor inputTensor = inputTensors[index];
+    if (inputTensor == null) {
+      inputTensor =
+          inputTensors[index] = Tensor.fromHandle(getInputTensor(interpreterHandle, index));
+    }
+    return inputTensor;
   }
 
   /**
-   * Gets the quantization scale of an output.
+   * Gets the output {@link Tensor} for the provided output index.
    *
-   * @throws IllegalArgumentExeption if the output index is invalid.
+   * @throws IllegalArgumentException if the output index is invalid.
    */
-  float getOutputQuantizationScale(int index) {
-    return getOutputQuantizationScale(interpreterHandle, index);
+  Tensor getOutputTensor(int index) {
+    if (index < 0 || index >= outputTensors.length) {
+      throw new IllegalArgumentException("Invalid output Tensor index: " + index);
+    }
+    Tensor outputTensor = outputTensors[index];
+    if (outputTensor == null) {
+      outputTensor =
+          outputTensors[index] = Tensor.fromHandle(getOutputTensor(interpreterHandle, index));
+    }
+    return outputTensor;
   }
 
   private static native int getOutputDataType(long interpreterHandle, int outputIdx);
@@ -343,18 +278,30 @@ final class NativeInterpreterWrapper implements AutoCloseable {
 
   private long modelHandle;
 
-  private int inputSize;
-
   private long inferenceDurationNanoseconds = -1;
 
   private ByteBuffer modelByteBuffer;
 
+  // Lazily constructed maps of input and output names to input and output Tensor indexes.
   private Map<String, Integer> inputsIndexes;
-
   private Map<String, Integer> outputsIndexes;
 
+  // Lazily constructed and populated arrays of input and output Tensor wrappers.
+  private final Tensor[] inputTensors;
+  private final Tensor[] outputTensors;
+
   private boolean isMemoryAllocated = false;
 
+  private static native long allocateTensors(long interpreterHandle, long errorHandle);
+
+  private static native long getInputTensor(long interpreterHandle, int inputIdx);
+
+  private static native long getOutputTensor(long interpreterHandle, int outputIdx);
+
+  private static native int getInputCount(long interpreterHandle);
+
+  private static native int getOutputCount(long interpreterHandle);
+
   private static native String[] getInputNames(long interpreterHandle);
 
   private static native String[] getOutputNames(long interpreterHandle);
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Tensor.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Tensor.java
index b2a3e04c55..2403570c52 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Tensor.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Tensor.java
@@ -15,6 +15,7 @@ limitations under the License.
 
 package org.tensorflow.lite;
 
+import java.lang.reflect.Array;
 import java.nio.ByteBuffer;
 import java.nio.ByteOrder;
 import java.util.Arrays;
@@ -31,43 +32,179 @@ final class Tensor {
     return new Tensor(nativeHandle);
   }
 
+  /** Returns the {@link DataType} of elements stored in the Tensor. */
+  public DataType dataType() {
+    return dtype;
+  }
+
+  /** Returns the size, in bytes, of the tensor data. */
+  public int numBytes() {
+    return numBytes(nativeHandle);
+  }
+
+  /**
+   * Returns the <a href="https://www.tensorflow.org/resources/dims_types.html#shape">shape</a> of
+   * the Tensor, i.e., the sizes of each dimension.
+   *
+   * @return an array where the i-th element is the size of the i-th dimension of the tensor.
+   */
+  public int[] shape() {
+    return shapeCopy;
+  }
+
+  /**
+   * Copies the contents of the provided {@code src} object to the Tensor.
+   *
+   * <p>The {@code src} should either be a (multi-dimensional) array with a shape matching that of
+   * this tensor, or a {@link ByteByffer} of compatible primitive type with a matching flat size.
+   *
+   * @throws IllegalArgumentException if the tensor is a scalar or if {@code src} is not compatible
+   *     with the tensor (for example, mismatched data types or shapes).
+   */
+  void setTo(Object src) {
+    throwExceptionIfTypeIsIncompatible(src);
+    if (isByteBuffer(src)) {
+      ByteBuffer srcBuffer = (ByteBuffer) src;
+      // For direct ByteBuffer instances we support zero-copy. Note that this assumes the caller
+      // retains ownership of the source buffer until inference has completed.
+      if (srcBuffer.isDirect() && srcBuffer.order() == ByteOrder.nativeOrder()) {
+        writeDirectBuffer(nativeHandle, srcBuffer);
+      } else {
+        buffer().put(srcBuffer);
+      }
+      return;
+    }
+    writeMultiDimensionalArray(nativeHandle, src);
+  }
+
   /**
    * Copies the contents of the tensor to {@code dst} and returns {@code dst}.
    *
    * @param dst the destination buffer, either an explicitly-typed array or a {@link ByteBuffer}.
    * @throws IllegalArgumentException if {@code dst} is not compatible with the tensor (for example,
    *     mismatched data types or shapes).
-   * @throws BufferOverflowException If {@code dst} is a ByteBuffer with insufficient space for the
-   *     data in this tensor.
    */
-  <T> T copyTo(T dst) {
+  Object copyTo(Object dst) {
+    throwExceptionIfTypeIsIncompatible(dst);
     if (dst instanceof ByteBuffer) {
       ByteBuffer dstByteBuffer = (ByteBuffer) dst;
       dstByteBuffer.put(buffer());
       return dst;
     }
-    if (NativeInterpreterWrapper.dataTypeOf(dst) != dtype) {
+    readMultiDimensionalArray(nativeHandle, dst);
+    return dst;
+  }
+
+  /** Returns the provided buffer's shape if specified and different from this Tensor's shape. */
+  // TODO(b/80431971): Remove this method after deprecating multi-dimensional array inputs.
+  int[] getInputShapeIfDifferent(Object input) {
+    // Implicit resizes based on ByteBuffer capacity isn't supported, so short-circuit that path.
+    // The ByteBuffer's size will be validated against this Tensor's size in {@link #setTo(Object)}.
+    if (isByteBuffer(input)) {
+      return null;
+    }
+    int[] inputShape = shapeOf(input);
+    if (Arrays.equals(shapeCopy, inputShape)) {
+      return null;
+    }
+    return inputShape;
+  }
+
+  /** Returns the type of the data. */
+  static DataType dataTypeOf(Object o) {
+    if (o != null) {
+      Class<?> c = o.getClass();
+      while (c.isArray()) {
+        c = c.getComponentType();
+      }
+      if (float.class.equals(c)) {
+        return DataType.FLOAT32;
+      } else if (int.class.equals(c)) {
+        return DataType.INT32;
+      } else if (byte.class.equals(c)) {
+        return DataType.UINT8;
+      } else if (long.class.equals(c)) {
+        return DataType.INT64;
+      }
+    }
+    throw new IllegalArgumentException(
+        "DataType error: cannot resolve DataType of " + o.getClass().getName());
+  }
+
+  /** Returns the shape of an object as an int array. */
+  static int[] shapeOf(Object o) {
+    int size = numDimensions(o);
+    int[] dimensions = new int[size];
+    fillShape(o, 0, dimensions);
+    return dimensions;
+  }
+
+  /** Returns the number of dimensions of a multi-dimensional array, otherwise 0. */
+  static int numDimensions(Object o) {
+    if (o == null || !o.getClass().isArray()) {
+      return 0;
+    }
+    if (Array.getLength(o) == 0) {
+      throw new IllegalArgumentException("Array lengths cannot be 0.");
+    }
+    return 1 + numDimensions(Array.get(o, 0));
+  }
+
+  /** Recursively populates the shape dimensions for a given (multi-dimensional) array. */
+  static void fillShape(Object o, int dim, int[] shape) {
+    if (shape == null || dim == shape.length) {
+      return;
+    }
+    final int len = Array.getLength(o);
+    if (shape[dim] == 0) {
+      shape[dim] = len;
+    } else if (shape[dim] != len) {
+      throw new IllegalArgumentException(
+          String.format("Mismatched lengths (%d and %d) in dimension %d", shape[dim], len, dim));
+    }
+    for (int i = 0; i < len; ++i) {
+      fillShape(Array.get(o, i), dim + 1, shape);
+    }
+  }
+
+  private void throwExceptionIfTypeIsIncompatible(Object o) {
+    if (isByteBuffer(o)) {
+      ByteBuffer oBuffer = (ByteBuffer) o;
+      if (oBuffer.capacity() != numBytes()) {
+        throw new IllegalArgumentException(
+            String.format(
+                "Cannot convert between a TensorFlowLite buffer with %d bytes and a "
+                    + "ByteBuffer with %d bytes.",
+                numBytes(), oBuffer.capacity()));
+      }
+      return;
+    }
+    DataType oType = dataTypeOf(o);
+    if (oType != dtype) {
       throw new IllegalArgumentException(
           String.format(
-              "Output error: Cannot convert an TensorFlowLite tensor with type %s to a Java "
-                  + "object of type %s (which is compatible with the TensorFlowLite type %s)",
-              dtype, dst.getClass().getName(), NativeInterpreterWrapper.dataTypeOf(dst)));
+              "Cannot convert between a TensorFlowLite tensor with type %s and a Java "
+                  + "object of type %s (which is compatible with the TensorFlowLite type %s).",
+              dtype, o.getClass().getName(), oType));
     }
-    int[] dstShape = NativeInterpreterWrapper.shapeOf(dst);
-    if (!Arrays.equals(dstShape, shapeCopy)) {
+
+    int[] oShape = shapeOf(o);
+    if (!Arrays.equals(oShape, shapeCopy)) {
       throw new IllegalArgumentException(
           String.format(
-              "Output error: Shape of output target %s does not match with the shape of the "
-                  + "Tensor %s.",
-              Arrays.toString(dstShape), Arrays.toString(shapeCopy)));
+              "Cannot copy between a TensorFlowLite tensor with shape %s and a Java object "
+                  + "with shape %s.",
+              Arrays.toString(shapeCopy), Arrays.toString(oShape)));
     }
-    readMultiDimensionalArray(nativeHandle, dst);
-    return dst;
   }
 
-  final long nativeHandle;
-  final DataType dtype;
-  final int[] shapeCopy;
+  private static boolean isByteBuffer(Object o) {
+    return o instanceof ByteBuffer;
+  }
+
+  private final long nativeHandle;
+  private final DataType dtype;
+  private final int[] shapeCopy;
 
   private Tensor(long nativeHandle) {
     this.nativeHandle = nativeHandle;
@@ -81,11 +218,17 @@ final class Tensor {
 
   private static native ByteBuffer buffer(long handle);
 
+  private static native void writeDirectBuffer(long handle, ByteBuffer src);
+
   private static native int dtype(long handle);
 
   private static native int[] shape(long handle);
 
-  private static native void readMultiDimensionalArray(long handle, Object value);
+  private static native int numBytes(long handle);
+
+  private static native void readMultiDimensionalArray(long handle, Object dst);
+
+  private static native void writeMultiDimensionalArray(long handle, Object src);
 
   static {
     TensorFlowLite.init();
diff --git a/tensorflow/contrib/lite/java/src/main/native/BUILD b/tensorflow/contrib/lite/java/src/main/native/BUILD
index 4399ed2025..4b4e1c21d8 100644
--- a/tensorflow/contrib/lite/java/src/main/native/BUILD
+++ b/tensorflow/contrib/lite/java/src/main/native/BUILD
@@ -11,7 +11,6 @@ licenses(["notice"])  # Apache 2.0
 cc_library(
     name = "native_framework_only",
     srcs = [
-        "duration_utils_jni.cc",
         "exception_jni.cc",
         "nativeinterpreterwrapper_jni.cc",
         "tensor_jni.cc",
diff --git a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
index 31f7b58fbc..fdcf00a0a0 100644
--- a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
+++ b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
@@ -16,9 +16,6 @@ limitations under the License.
 #include "tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h"
 namespace {
 
-const int kByteBufferValue = 999;
-const int kBufferSize = 256;
-
 tflite::Interpreter* convertLongToInterpreter(JNIEnv* env, jlong handle) {
   if (handle == 0) {
     throwException(env, kIllegalArgumentException,
@@ -62,22 +59,6 @@ std::vector<int> convertJIntArrayToVector(JNIEnv* env, jintArray inputs) {
   return outputs;
 }
 
-bool isByteBuffer(jint data_type) { return data_type == kByteBufferValue; }
-
-TfLiteType resolveDataType(jint data_type) {
-  switch (data_type) {
-    case 1:
-      return kTfLiteFloat32;
-    case 2:
-      return kTfLiteInt32;
-    case 3:
-      return kTfLiteUInt8;
-    case 4:
-      return kTfLiteInt64;
-    default:
-      return kTfLiteNoType;
-  }
-}
 
 int getDataType(TfLiteType data_type) {
   switch (data_type) {
@@ -108,64 +89,6 @@ void printDims(char* buffer, int max_size, int* dims, int num_dims) {
   }
 }
 
-TfLiteStatus checkInputs(JNIEnv* env, tflite::Interpreter* interpreter,
-                         const int input_size, jintArray data_types,
-                         jintArray nums_of_bytes, jobjectArray values,
-                         jobjectArray sizes) {
-  if (input_size != interpreter->inputs().size()) {
-    throwException(env, kIllegalArgumentException,
-                   "Input error: Expected num of inputs is %d but got %d",
-                   interpreter->inputs().size(), input_size);
-    return kTfLiteError;
-  }
-  if (input_size != env->GetArrayLength(data_types) ||
-      input_size != env->GetArrayLength(nums_of_bytes) ||
-      input_size != env->GetArrayLength(values)) {
-    throwException(env, kIllegalArgumentException,
-                   "Internal error: Arrays in arguments should be of the same "
-                   "length, but got %d sizes, %d data_types, %d nums_of_bytes, "
-                   "and %d values",
-                   input_size, env->GetArrayLength(data_types),
-                   env->GetArrayLength(nums_of_bytes),
-                   env->GetArrayLength(values));
-    return kTfLiteError;
-  }
-  for (int i = 0; i < input_size; ++i) {
-    int input_idx = interpreter->inputs()[i];
-    TfLiteTensor* target = interpreter->tensor(input_idx);
-    jintArray dims =
-        static_cast<jintArray>(env->GetObjectArrayElement(sizes, i));
-    int num_dims = static_cast<int>(env->GetArrayLength(dims));
-    if (target->dims->size != num_dims) {
-      throwException(env, kIllegalArgumentException,
-                     "Input error: %d-th input should have %d dimensions, but "
-                     "found %d dimensions",
-                     i, target->dims->size, num_dims);
-      return kTfLiteError;
-    }
-    jint* ptr = env->GetIntArrayElements(dims, nullptr);
-    for (int j = 1; j < num_dims; ++j) {
-      if (target->dims->data[j] != ptr[j]) {
-        std::unique_ptr<char[]> expected_dims(new char[kBufferSize]);
-        std::unique_ptr<char[]> obtained_dims(new char[kBufferSize]);
-        printDims(expected_dims.get(), kBufferSize, target->dims->data,
-                  num_dims);
-        printDims(obtained_dims.get(), kBufferSize, ptr, num_dims);
-        throwException(env, kIllegalArgumentException,
-                       "Input error: %d-th input dimension should be [%s], but "
-                       "found [%s]",
-                       i, expected_dims.get(), obtained_dims.get());
-        env->ReleaseIntArrayElements(dims, ptr, JNI_ABORT);
-        return kTfLiteError;
-      }
-    }
-    env->ReleaseIntArrayElements(dims, ptr, JNI_ABORT);
-    env->DeleteLocalRef(dims);
-    if (env->ExceptionCheck()) return kTfLiteError;
-  }
-  return kTfLiteOk;
-}
-
 // Checks whether there is any difference between dimensions of a tensor and a
 // given dimensions. Returns true if there is difference, else false.
 bool areDimsDifferent(JNIEnv* env, TfLiteTensor* tensor, jintArray dims) {
@@ -188,74 +111,6 @@ bool areDimsDifferent(JNIEnv* env, TfLiteTensor* tensor, jintArray dims) {
   return false;
 }
 
-bool areInputDimensionsTheSame(JNIEnv* env, tflite::Interpreter* interpreter,
-                               int input_size, jobjectArray sizes) {
-  if (interpreter->inputs().size() != input_size) {
-    return false;
-  }
-  for (int i = 0; i < input_size; ++i) {
-    int input_idx = interpreter->inputs()[i];
-    jintArray dims =
-        static_cast<jintArray>(env->GetObjectArrayElement(sizes, i));
-    TfLiteTensor* target = interpreter->tensor(input_idx);
-    if (areDimsDifferent(env, target, dims)) return false;
-    env->DeleteLocalRef(dims);
-    if (env->ExceptionCheck()) return false;
-  }
-  return true;
-}
-
-TfLiteStatus resizeInputs(JNIEnv* env, tflite::Interpreter* interpreter,
-                          int input_size, jobjectArray sizes) {
-  for (int i = 0; i < input_size; ++i) {
-    int input_idx = interpreter->inputs()[i];
-    jintArray dims =
-        static_cast<jintArray>(env->GetObjectArrayElement(sizes, i));
-    TfLiteStatus status = interpreter->ResizeInputTensor(
-        input_idx, convertJIntArrayToVector(env, dims));
-    if (status != kTfLiteOk) {
-      return status;
-    }
-    env->DeleteLocalRef(dims);
-    if (env->ExceptionCheck()) return kTfLiteError;
-  }
-  return kTfLiteOk;
-}
-
-TfLiteStatus setInputs(JNIEnv* env, tflite::Interpreter* interpreter,
-                       int input_size, jintArray data_types,
-                       jintArray nums_of_bytes, jobjectArray values) {
-  jint* data_type = env->GetIntArrayElements(data_types, nullptr);
-  jint* num_bytes = env->GetIntArrayElements(nums_of_bytes, nullptr);
-  for (int i = 0; i < input_size; ++i) {
-    int input_idx = interpreter->inputs()[i];
-    TfLiteTensor* target = interpreter->tensor(input_idx);
-    jobject value = env->GetObjectArrayElement(values, i);
-    bool is_byte_buffer = isByteBuffer(data_type[i]);
-    if (is_byte_buffer) {
-      writeByteBuffer(env, value, &(target->data.raw),
-                      static_cast<int>(num_bytes[i]));
-    } else {
-      TfLiteType type = resolveDataType(data_type[i]);
-      if (type != target->type) {
-        throwException(env, kIllegalArgumentException,
-                       "Input error: DataType (%d) of input data does not "
-                       "match with the DataType (%d) of model inputs.",
-                       type, target->type);
-        return kTfLiteError;
-      }
-      writeMultiDimensionalArray(env, value, target->type, target->dims->size,
-                                 &(target->data.raw),
-                                 static_cast<int>(num_bytes[i]));
-    }
-    env->DeleteLocalRef(value);
-    if (env->ExceptionCheck()) return kTfLiteError;
-  }
-  env->ReleaseIntArrayElements(data_types, data_type, JNI_ABORT);
-  env->ReleaseIntArrayElements(nums_of_bytes, num_bytes, JNI_ABORT);
-  return kTfLiteOk;
-}
-
 // TODO(yichengfan): evaluate the benefit to use tflite verifier.
 bool VerifyModel(const void* buf, size_t len) {
   flatbuffers::Verifier verifier(static_cast<const uint8_t*>(buf), len);
@@ -287,6 +142,64 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputNames(JNIEnv* env,
   return names;
 }
 
+JNIEXPORT void JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_allocateTensors(
+    JNIEnv* env, jclass clazz, jlong handle, jlong error_handle) {
+  tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
+  if (interpreter == nullptr) return;
+  BufferErrorReporter* error_reporter =
+      convertLongToErrorReporter(env, error_handle);
+  if (error_reporter == nullptr) return;
+
+  if (interpreter->AllocateTensors() != kTfLiteOk) {
+    throwException(
+        env, kIllegalStateException,
+        "Internal error: Unexpected failure when preparing tensor allocations:"
+        " %s",
+        error_reporter->CachedErrorMessage());
+  }
+}
+
+JNIEXPORT jlong JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputTensor(JNIEnv* env,
+                                                                 jclass clazz,
+                                                                 jlong handle,
+                                                                 jint index) {
+  tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
+  if (interpreter == nullptr) return 0;
+  return reinterpret_cast<jlong>(
+      interpreter->tensor(interpreter->inputs()[index]));
+}
+
+JNIEXPORT jlong JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getOutputTensor(JNIEnv* env,
+                                                                  jclass clazz,
+                                                                  jlong handle,
+                                                                  jint index) {
+  tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
+  if (interpreter == nullptr) return 0;
+  return reinterpret_cast<jlong>(
+      interpreter->tensor(interpreter->outputs()[index]));
+}
+
+JNIEXPORT jint JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputCount(JNIEnv* env,
+                                                                jclass clazz,
+                                                                jlong handle) {
+  tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
+  if (interpreter == nullptr) return 0;
+  return static_cast<jint>(interpreter->inputs().size());
+}
+
+JNIEXPORT jint JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getOutputCount(JNIEnv* env,
+                                                                 jclass clazz,
+                                                                 jlong handle) {
+  tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
+  if (interpreter == nullptr) return 0;
+  return static_cast<jint>(interpreter->outputs().size());
+}
+
 JNIEXPORT jobjectArray JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_getOutputNames(JNIEnv* env,
                                                                  jclass clazz,
@@ -424,124 +337,32 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createInterpreter(
   // allocates memory
   status = interpreter->AllocateTensors();
   if (status != kTfLiteOk) {
-    throwException(env, kNullPointerException,
-                   "Internal error: Cannot allocate memory for the interpreter:"
-                   " %s",
-                   error_reporter->CachedErrorMessage());
+    throwException(
+        env, kIllegalStateException,
+        "Internal error: Unexpected failure when preparing tensor allocations:"
+        " %s",
+        error_reporter->CachedErrorMessage());
     return 0;
   }
   return reinterpret_cast<jlong>(interpreter.release());
 }
 
 // Sets inputs, runs inference, and returns outputs as long handles.
-JNIEXPORT jlongArray JNICALL
-Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
-    JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
-    jobjectArray sizes, jintArray data_types, jintArray nums_of_bytes,
-    jobjectArray values, jobject wrapper, jboolean memory_allocated) {
+JNIEXPORT void JNICALL Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
+    JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle) {
   tflite::Interpreter* interpreter =
       convertLongToInterpreter(env, interpreter_handle);
-  if (interpreter == nullptr) return nullptr;
+  if (interpreter == nullptr) return;
   BufferErrorReporter* error_reporter =
       convertLongToErrorReporter(env, error_handle);
-  if (error_reporter == nullptr) return nullptr;
-  const int input_size = env->GetArrayLength(sizes);
-  // validates inputs
-  TfLiteStatus status = checkInputs(env, interpreter, input_size, data_types,
-                                    nums_of_bytes, values, sizes);
-  if (status != kTfLiteOk) return nullptr;
-  if (!memory_allocated ||
-      !areInputDimensionsTheSame(env, interpreter, input_size, sizes)) {
-    // resizes inputs
-    status = resizeInputs(env, interpreter, input_size, sizes);
-    if (status != kTfLiteOk) {
-      throwException(env, kNullPointerException,
-                     "Internal error: Can not resize the input: %s",
-                     error_reporter->CachedErrorMessage());
-      return nullptr;
-    }
-    // allocates memory
-    status = interpreter->AllocateTensors();
-    if (status != kTfLiteOk) {
-      throwException(env, kNullPointerException,
-                     "Internal error: Can not allocate memory for the given "
-                     "inputs: %s",
-                     error_reporter->CachedErrorMessage());
-      return nullptr;
-    }
-  }
-  // sets inputs
-  status = setInputs(env, interpreter, input_size, data_types, nums_of_bytes,
-                     values);
-  if (status != kTfLiteOk) return nullptr;
-  timespec beforeInference = ::tflite::getCurrentTime();
-  // runs inference
+  if (error_reporter == nullptr) return;
+
   if (interpreter->Invoke() != kTfLiteOk) {
     throwException(env, kIllegalArgumentException,
                    "Internal error: Failed to run on the given Interpreter: %s",
                    error_reporter->CachedErrorMessage());
-    return nullptr;
-  }
-  timespec afterInference = ::tflite::getCurrentTime();
-  jclass wrapper_clazz = env->GetObjectClass(wrapper);
-  jfieldID fid =
-      env->GetFieldID(wrapper_clazz, "inferenceDurationNanoseconds", "J");
-  if (env->ExceptionCheck()) {
-    env->ExceptionClear();
-  } else if (fid != nullptr) {
-    env->SetLongField(
-        wrapper, fid,
-        ::tflite::timespec_diff_nanoseconds(&beforeInference, &afterInference));
-  }
-  // returns outputs
-  const std::vector<int>& results = interpreter->outputs();
-  if (results.empty()) {
-    throwException(
-        env, kIllegalArgumentException,
-        "Internal error: The Interpreter does not have any outputs.");
-    return nullptr;
-  }
-  jlongArray outputs = env->NewLongArray(results.size());
-  size_t size = results.size();
-  for (int i = 0; i < size; ++i) {
-    TfLiteTensor* source = interpreter->tensor(results[i]);
-    jlong output = reinterpret_cast<jlong>(source);
-    env->SetLongArrayRegion(outputs, i, 1, &output);
-  }
-  return outputs;
-}
-
-JNIEXPORT jintArray JNICALL
-Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
-    JNIEnv* env, jclass clazz, jlong handle, jint input_idx, jint num_bytes) {
-  tflite::Interpreter* interpreter = convertLongToInterpreter(env, handle);
-  if (interpreter == nullptr) return nullptr;
-  const int idx = static_cast<int>(input_idx);
-  if (input_idx < 0 || input_idx >= interpreter->inputs().size()) {
-    throwException(env, kIllegalArgumentException,
-                   "Input error: Out of range: Failed to get %d-th input out of"
-                   " %d inputs",
-                   input_idx, interpreter->inputs().size());
-    return nullptr;
-  }
-  TfLiteTensor* target = interpreter->tensor(interpreter->inputs()[idx]);
-  int size = target->dims->size;
-  if (num_bytes >= 0) {  // verifies num of bytes matches if num_bytes if valid.
-    int expected_num_bytes = elementByteSize(target->type);
-    for (int i = 0; i < size; ++i) {
-      expected_num_bytes *= target->dims->data[i];
-    }
-    if (num_bytes != expected_num_bytes) {
-      throwException(env, kIllegalArgumentException,
-                     "Input error: Failed to get input dimensions. %d-th input "
-                     "should have %d bytes, but found %d bytes.",
-                     idx, expected_num_bytes, num_bytes);
-      return nullptr;
-    }
+    return;
   }
-  jintArray outputs = env->NewIntArray(size);
-  env->SetIntArrayRegion(outputs, 0, size, &(target->dims->data[0]));
-  return outputs;
 }
 
 JNIEXPORT jint JNICALL
diff --git a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
index 128ece4981..618fba480e 100644
--- a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
+++ b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
@@ -29,9 +29,6 @@ limitations under the License.
 namespace tflite {
 // This is to be provided at link-time by a library.
 extern std::unique_ptr<OpResolver> CreateOpResolver();
-extern timespec getCurrentTime();
-extern jlong timespec_diff_nanoseconds(struct timespec* start,
-                                       struct timespec* stop);
 }  // namespace tflite
 
 #ifdef __cplusplus
@@ -40,6 +37,57 @@ extern "C" {
 
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
+ *  Method:    allocateTensors
+ *  Signature: (JJ)V
+ */
+JNIEXPORT void JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_allocateTensors(
+    JNIEnv* env, jclass clazz, jlong handle, jlong error_handle);
+
+/*
+ *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
+ *  Method:    getInputTensor
+ *  Signature: (JI)J
+ */
+JNIEXPORT jlong JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputTensor(JNIEnv* env,
+                                                                 jclass clazz,
+                                                                 jlong handle,
+                                                                 jint index);
+
+/*
+ *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
+ *  Method:    getOutputTensor
+ *  Signature: (JI)J
+ */
+JNIEXPORT jlong JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getOutputTensor(JNIEnv* env,
+                                                                  jclass clazz,
+                                                                  jlong handle,
+                                                                  jint index);
+
+/*
+ *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
+ *  Method:    getInputCount
+ *  Signature: (J)I
+ */
+JNIEXPORT jint JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputCount(JNIEnv* env,
+                                                                jclass clazz,
+                                                                jlong handle);
+
+/*
+ *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
+ *  Method:    getOutputCount
+ *  Signature: (J)I
+ */
+JNIEXPORT jint JNICALL
+Java_org_tensorflow_lite_NativeInterpreterWrapper_getOutputCount(JNIEnv* env,
+                                                                 jclass clazz,
+                                                                 jlong handle);
+
+/*
+ *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
  *  Method:
  *  Signature: (J)[Ljava/lang/Object;
  */
@@ -118,28 +166,11 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createInterpreter(
 
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
- *  Method:
- *  Signature:
- * (JJ[Ljava/lang/Object;[I[I[Ljava/lang/Object;Ljava/lang/Object;Z)[J
- */
-JNIEXPORT jlongArray JNICALL
-Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
-    JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
-    jobjectArray sizes, jintArray data_types, jintArray nums_of_bytes,
-    jobjectArray values, jobject wrapper, jboolean memory_allocated);
-
-/*
- *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
- *  Method:
- *  Signature: (JII)[I
- *
- * Gets input dimensions. If num_bytes is non-negative, it will check whether
- * num_bytes matches num of bytes required by the input, and return null and
- * throw IllegalArgumentException if not.
+ *  Method:    run
+ *  Signature: (JJ)V
  */
-JNIEXPORT jintArray JNICALL
-Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
-    JNIEnv* env, jclass clazz, jlong handle, jint input_idx, jint num_bytes);
+JNIEXPORT void JNICALL Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
+    JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle);
 
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
diff --git a/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc b/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc
index 08b4d04280..7ff96a3172 100644
--- a/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc
+++ b/tensorflow/contrib/lite/java/src/main/native/tensor_jni.cc
@@ -29,6 +29,35 @@ TfLiteTensor* convertLongToTensor(JNIEnv* env, jlong handle) {
   return reinterpret_cast<TfLiteTensor*>(handle);
 }
 
+size_t elementByteSize(TfLiteType data_type) {
+  // The code in this file makes the assumption that the
+  // TensorFlow TF_DataTypes and the Java primitive types
+  // have the same byte sizes. Validate that:
+  switch (data_type) {
+    case kTfLiteFloat32:
+      static_assert(sizeof(jfloat) == 4,
+                    "Interal error: Java float not compatible with "
+                    "kTfLiteFloat");
+      return 4;
+    case kTfLiteInt32:
+      static_assert(sizeof(jint) == 4,
+                    "Interal error: Java int not compatible with kTfLiteInt");
+      return 4;
+    case kTfLiteUInt8:
+      static_assert(sizeof(jbyte) == 1,
+                    "Interal error: Java byte not compatible with "
+                    "kTfLiteUInt8");
+      return 1;
+    case kTfLiteInt64:
+      static_assert(sizeof(jlong) == 8,
+                    "Interal error: Java long not compatible with "
+                    "kTfLiteInt64");
+      return 8;
+    default:
+      return 0;
+  }
+}
+
 size_t writeOneDimensionalArray(JNIEnv* env, jobject object, TfLiteType type,
                                 void* dst, size_t dst_size) {
   jarray array = static_cast<jarray>(object);
@@ -141,48 +170,6 @@ size_t readMultiDimensionalArray(JNIEnv* env, TfLiteType data_type, char* src,
   }
 }
 
-}  // namespace
-
-size_t elementByteSize(TfLiteType data_type) {
-  // The code in this file makes the assumption that the
-  // TensorFlow TF_DataTypes and the Java primitive types
-  // have the same byte sizes. Validate that:
-  switch (data_type) {
-    case kTfLiteFloat32:
-      static_assert(sizeof(jfloat) == 4,
-                    "Interal error: Java float not compatible with "
-                    "kTfLiteFloat");
-      return 4;
-    case kTfLiteInt32:
-      static_assert(sizeof(jint) == 4,
-                    "Interal error: Java int not compatible with kTfLiteInt");
-      return 4;
-    case kTfLiteUInt8:
-      static_assert(sizeof(jbyte) == 1,
-                    "Interal error: Java byte not compatible with "
-                    "kTfLiteUInt8");
-      return 1;
-    case kTfLiteInt64:
-      static_assert(sizeof(jlong) == 8,
-                    "Interal error: Java long not compatible with "
-                    "kTfLiteInt64");
-      return 8;
-    default:
-      return 0;
-  }
-}
-
-size_t writeByteBuffer(JNIEnv* env, jobject object, char** dst, int dst_size) {
-  char* buf = static_cast<char*>(env->GetDirectBufferAddress(object));
-  if (!buf) {
-    throwException(env, kIllegalArgumentException,
-                   "Input ByteBuffer is not a direct buffer");
-    return 0;
-  }
-  *dst = buf;
-  return dst_size;
-}
-
 size_t writeMultiDimensionalArray(JNIEnv* env, jobject src, TfLiteType type,
                                   int dims_left, char** dst, int dst_size) {
   if (dims_left <= 1) {
@@ -203,16 +190,37 @@ size_t writeMultiDimensionalArray(JNIEnv* env, jobject src, TfLiteType type,
   }
 }
 
+}  // namespace
+
 JNIEXPORT jobject JNICALL Java_org_tensorflow_lite_Tensor_buffer(JNIEnv* env,
                                                                  jclass clazz,
                                                                  jlong handle) {
   TfLiteTensor* tensor = convertLongToTensor(env, handle);
   if (tensor == nullptr) return nullptr;
-
+  if (tensor->data.raw == nullptr) {
+    throwException(env, kIllegalArgumentException,
+                   "Internal error: Tensor hasn't been allocated.");
+    return nullptr;
+  }
   return env->NewDirectByteBuffer(static_cast<void*>(tensor->data.raw),
                                   static_cast<jlong>(tensor->bytes));
 }
 
+JNIEXPORT void JNICALL Java_org_tensorflow_lite_Tensor_writeDirectBuffer(
+    JNIEnv* env, jclass clazz, jlong handle, jobject src) {
+  TfLiteTensor* tensor = convertLongToTensor(env, handle);
+  if (tensor == nullptr) return;
+
+  char* src_data_raw = static_cast<char*>(env->GetDirectBufferAddress(src));
+  if (!src_data_raw) {
+    throwException(env, kIllegalArgumentException,
+                   "Input ByteBuffer is not a direct buffer");
+    return;
+  }
+
+  tensor->data.raw = src_data_raw;
+}
+
 JNIEXPORT void JNICALL
 Java_org_tensorflow_lite_Tensor_readMultiDimensionalArray(JNIEnv* env,
                                                           jclass clazz,
@@ -230,6 +238,27 @@ Java_org_tensorflow_lite_Tensor_readMultiDimensionalArray(JNIEnv* env,
                             num_dims, static_cast<jarray>(value));
 }
 
+JNIEXPORT void JNICALL
+Java_org_tensorflow_lite_Tensor_writeMultiDimensionalArray(JNIEnv* env,
+                                                           jclass clazz,
+                                                           jlong handle,
+                                                           jobject src) {
+  TfLiteTensor* tensor = convertLongToTensor(env, handle);
+  if (tensor == nullptr) return;
+  if (tensor->data.raw == nullptr) {
+    throwException(env, kIllegalArgumentException,
+                   "Internal error: Target Tensor hasn't been allocated.");
+    return;
+  }
+  if (tensor->dims->size == 0) {
+    throwException(env, kIllegalArgumentException,
+                   "Internal error: Cannot copy empty/scalar Tensors.");
+    return;
+  }
+  writeMultiDimensionalArray(env, src, tensor->type, tensor->dims->size,
+                             &tensor->data.raw, tensor->bytes);
+}
+
 JNIEXPORT jint JNICALL Java_org_tensorflow_lite_Tensor_dtype(JNIEnv* env,
                                                              jclass clazz,
                                                              jlong handle) {
@@ -247,3 +276,11 @@ Java_org_tensorflow_lite_Tensor_shape(JNIEnv* env, jclass clazz, jlong handle) {
   env->SetIntArrayRegion(result, 0, num_dims, tensor->dims->data);
   return result;
 }
+
+JNIEXPORT jint JNICALL Java_org_tensorflow_lite_Tensor_numBytes(JNIEnv* env,
+                                                                jclass clazz,
+                                                                jlong handle) {
+  const TfLiteTensor* tensor = convertLongToTensor(env, handle);
+  if (tensor == nullptr) return 0;
+  return static_cast<jint>(tensor->bytes);
+}
diff --git a/tensorflow/contrib/lite/java/src/main/native/tensor_jni.h b/tensorflow/contrib/lite/java/src/main/native/tensor_jni.h
index 9ba95d9ac4..06e2546af8 100644
--- a/tensorflow/contrib/lite/java/src/main/native/tensor_jni.h
+++ b/tensorflow/contrib/lite/java/src/main/native/tensor_jni.h
@@ -34,6 +34,14 @@ JNIEXPORT jobject JNICALL Java_org_tensorflow_lite_Tensor_buffer(JNIEnv* env,
 
 /*
  *  Class:     org_tensorflow_lite_Tensor
+ *  Method:    writeDirectBuffer
+ *  Signature: (JLjava/nio/ByteBuffer;)
+ */
+JNIEXPORT void JNICALL Java_org_tensorflow_lite_Tensor_writeDirectBuffer(
+    JNIEnv* env, jclass clazz, jlong handle, jobject src);
+
+/*
+ *  Class:     org_tensorflow_lite_Tensor
  *  Method:    dtype
  *  Signature: (J)I
  */
@@ -52,6 +60,15 @@ JNIEXPORT jintArray JNICALL Java_org_tensorflow_lite_Tensor_shape(JNIEnv* env,
 
 /*
  *  Class:     org_tensorflow_lite_Tensor
+ *  Method:    numBytes
+ *  Signature: (J)I
+ */
+JNIEXPORT jint JNICALL Java_org_tensorflow_lite_Tensor_numBytes(JNIEnv* env,
+                                                                jclass clazz,
+                                                                jlong handle);
+
+/*
+ *  Class:     org_tensorflow_lite_Tensor
  *  Method:    readMultiDimensionalArray
  *  Signature: (JLjava/lang/Object;)
  */
@@ -59,23 +76,18 @@ JNIEXPORT void JNICALL
 Java_org_tensorflow_lite_Tensor_readMultiDimensionalArray(JNIEnv* env,
                                                           jclass clazz,
                                                           jlong handle,
-                                                          jobject value);
+                                                          jobject dst);
 
 /*
- * Finds the size of each data type.
- */
-size_t elementByteSize(TfLiteType data_type);
-
-/*
- * Writes data of a ByteBuffer into dest.
- */
-size_t writeByteBuffer(JNIEnv* env, jobject object, char** dst, int dst_size);
-
-/*
- * Writes a multi-dimensional array into dest.
+ *  Class:     org_tensorflow_lite_Tensor
+ *  Method:    writeMultidimensionalArray
+ *  Signature: (JLjava/lang/Object;)
  */
-size_t writeMultiDimensionalArray(JNIEnv* env, jobject src, TfLiteType type,
-                                  int dims_left, char** dst, int dst_size);
+JNIEXPORT void JNICALL
+Java_org_tensorflow_lite_Tensor_writeMultiDimensionalArray(JNIEnv* env,
+                                                           jclass clazz,
+                                                           jlong handle,
+                                                           jobject src);
 
 #ifdef __cplusplus
 }  // extern "C"
diff --git a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
index e6deadffe2..d66a73db94 100644
--- a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
+++ b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
@@ -221,7 +221,9 @@ public final class InterpreterTest {
       assertThat(e)
           .hasMessageThat()
           .contains(
-              "DataType (2) of input data does not match with the DataType (1) of model inputs.");
+              "Cannot convert between a TensorFlowLite tensor with type "
+                  + "FLOAT32 and a Java object of type [[[[I (which is compatible with the"
+                  + " TensorFlowLite type INT32)");
     }
     interpreter.close();
   }
@@ -241,8 +243,8 @@ public final class InterpreterTest {
       assertThat(e)
           .hasMessageThat()
           .contains(
-              "Cannot convert an TensorFlowLite tensor with type "
-                  + "FLOAT32 to a Java object of type [[[[I (which is compatible with the"
+              "Cannot convert between a TensorFlowLite tensor with type "
+                  + "FLOAT32 and a Java object of type [[[[I (which is compatible with the"
                   + " TensorFlowLite type INT32)");
     }
     interpreter.close();
@@ -329,4 +331,11 @@ public final class InterpreterTest {
     interpreter.close();
     fileChannel.close();
   }
+
+  @Test
+  public void testRedundantClose() throws Exception {
+    Interpreter interpreter = new Interpreter(MODEL_FILE);
+    interpreter.close();
+    interpreter.close();
+  }
 }
diff --git a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
index 029e5853e2..9c4a5acd79 100644
--- a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
+++ b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
@@ -20,6 +20,8 @@ import static org.junit.Assert.fail;
 
 import java.nio.ByteBuffer;
 import java.nio.ByteOrder;
+import java.util.HashMap;
+import java.util.Map;
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
@@ -101,10 +103,10 @@ public final class NativeInterpreterWrapperTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
     float[][][][] parsedOutputs = new float[2][8][8][3];
-    outputs[0].copyTo(parsedOutputs);
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     float[] outputOneD = parsedOutputs[0][0][0];
     float[] expected = {3.69f, -19.62f, 23.43f};
     assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
@@ -119,11 +121,11 @@ public final class NativeInterpreterWrapperTest {
       float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
       float[][][][] fourD = {threeD, threeD};
       Object[] inputs = {fourD};
-      Tensor[] outputs = wrapper.run(inputs);
-      assertThat(outputs).hasLength(1);
       ByteBuffer parsedOutput =
           ByteBuffer.allocateDirect(2 * 8 * 8 * 3 * 4).order(ByteOrder.nativeOrder());
-      outputs[0].copyTo(parsedOutput);
+      Map<Integer, Object> outputs = new HashMap<>();
+      outputs.put(0, parsedOutput);
+      wrapper.run(inputs, outputs);
       float[] outputOneD = {
         parsedOutput.getFloat(0), parsedOutput.getFloat(4), parsedOutput.getFloat(8)
       };
@@ -140,17 +142,16 @@ public final class NativeInterpreterWrapperTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
     float[][][][] parsedOutputs = new float[2][8][8][3];
-    outputs[0].copyTo(parsedOutputs);
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     float[] outputOneD = parsedOutputs[0][0][0];
     float[] expected = {3.69f, -19.62f, 23.43f};
     assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
-    outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
     parsedOutputs = new float[2][8][8][3];
-    outputs[0].copyTo(parsedOutputs);
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     outputOneD = parsedOutputs[0][0][0];
     assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
     wrapper.close();
@@ -164,10 +165,10 @@ public final class NativeInterpreterWrapperTest {
     int[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     int[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
     int[][][][] parsedOutputs = new int[2][4][4][12];
-    outputs[0].copyTo(parsedOutputs);
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     int[] outputOneD = parsedOutputs[0][0][0];
     int[] expected = {3, 7, -4, 3, 7, -4, 3, 7, -4, 3, 7, -4};
     assertThat(outputOneD).isEqualTo(expected);
@@ -182,10 +183,10 @@ public final class NativeInterpreterWrapperTest {
     long[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     long[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
     long[][][][] parsedOutputs = new long[2][4][4][12];
-    outputs[0].copyTo(parsedOutputs);
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     long[] outputOneD = parsedOutputs[0][0][0];
     long[] expected = {-892834092L, 923423L, 2123918239018L, -892834092L, 923423L, 2123918239018L,
                        -892834092L, 923423L, 2123918239018L, -892834092L, 923423L, 2123918239018L};
@@ -203,10 +204,10 @@ public final class NativeInterpreterWrapperTest {
     Object[] inputs = {fourD};
     int[] inputDims = {2, 8, 8, 3};
     wrapper.resizeInput(0, inputDims);
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
     byte[][][][] parsedOutputs = new byte[2][4][4][12];
-    outputs[0].copyTo(parsedOutputs);
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     byte[] outputOneD = parsedOutputs[0][0][0];
     byte[] expected = {(byte) 0xe0, 0x4f, (byte) 0xd0, (byte) 0xe0, 0x4f, (byte) 0xd0,
                        (byte) 0xe0, 0x4f, (byte) 0xd0, (byte) 0xe0, 0x4f, (byte) 0xd0};
@@ -229,13 +230,14 @@ public final class NativeInterpreterWrapperTest {
         }
       }
     }
+    bbuf.rewind();
     Object[] inputs = {bbuf};
     int[] inputDims = {2, 8, 8, 3};
     wrapper.resizeInput(0, inputDims);
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
     byte[][][][] parsedOutputs = new byte[2][4][4][12];
-    outputs[0].copyTo(parsedOutputs);
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     byte[] outputOneD = parsedOutputs[0][0][0];
     byte[] expected = {
       (byte) 0xe0, 0x4f, (byte) 0xd0, (byte) 0xe0, 0x4f, (byte) 0xd0,
@@ -261,21 +263,22 @@ public final class NativeInterpreterWrapperTest {
       }
     }
     Object[] inputs = {bbuf};
+    float[][][][] parsedOutputs = new float[4][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
       assertThat(e)
           .hasMessageThat()
           .contains(
-              "Failed to get input dimensions. 0-th input should have 768 bytes, but found 3072 bytes");
+              "Cannot convert between a TensorFlowLite buffer with 768 bytes and a "
+                  + "ByteBuffer with 3072 bytes.");
     }
     int[] inputDims = {4, 8, 8, 3};
     wrapper.resizeInput(0, inputDims);
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
-    float[][][][] parsedOutputs = new float[4][8][8][3];
-    outputs[0].copyTo(parsedOutputs);
+    wrapper.run(inputs, outputs);
     float[] outputOneD = parsedOutputs[0][0][0];
     float[] expected = {3.69f, -19.62f, 23.43f};
     assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
@@ -288,14 +291,18 @@ public final class NativeInterpreterWrapperTest {
     ByteBuffer bbuf = ByteBuffer.allocateDirect(2 * 7 * 8 * 3);
     bbuf.order(ByteOrder.nativeOrder());
     Object[] inputs = {bbuf};
+    Map<Integer, Object> outputs = new HashMap<>();
+    ByteBuffer parsedOutput = ByteBuffer.allocateDirect(2 * 7 * 8 * 3);
+    outputs.put(0, parsedOutput);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
       assertThat(e)
           .hasMessageThat()
           .contains(
-              "Failed to get input dimensions. 0-th input should have 192 bytes, but found 336 bytes.");
+              "Cannot convert between a TensorFlowLite buffer with 192 bytes and a "
+                  + "ByteBuffer with 336 bytes.");
     }
     wrapper.close();
   }
@@ -308,14 +315,18 @@ public final class NativeInterpreterWrapperTest {
     int[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     int[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
+    int[][][][] parsedOutputs = new int[2][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
       assertThat(e)
           .hasMessageThat()
           .contains(
-              "DataType (2) of input data does not match with the DataType (1) of model inputs.");
+              "Cannot convert between a TensorFlowLite tensor with type FLOAT32 and a Java object "
+                  + "of type [[[[I (which is compatible with the TensorFlowLite type INT32)");
     }
     wrapper.close();
   }
@@ -329,8 +340,11 @@ public final class NativeInterpreterWrapperTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
       assertThat(e).hasMessageThat().contains("Invalid handle to Interpreter.");
@@ -342,7 +356,7 @@ public final class NativeInterpreterWrapperTest {
     NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
     try {
       Object[] inputs = {};
-      wrapper.run(inputs);
+      wrapper.run(inputs, null);
       fail();
     } catch (IllegalArgumentException e) {
       assertThat(e).hasMessageThat().contains("Inputs should not be null or empty.");
@@ -358,11 +372,14 @@ public final class NativeInterpreterWrapperTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD, fourD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
-      assertThat(e).hasMessageThat().contains("Expected num of inputs is 1 but got 2");
+      assertThat(e).hasMessageThat().contains("Invalid input Tensor index: 1");
     }
     wrapper.close();
   }
@@ -374,13 +391,18 @@ public final class NativeInterpreterWrapperTest {
     float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD};
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     Object[] inputs = {threeD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
       assertThat(e)
           .hasMessageThat()
-          .contains("0-th input should have 4 dimensions, but found 3 dimensions");
+          .contains(
+              "Cannot copy between a TensorFlowLite tensor with shape [8, 7, 3] and a "
+                  + "Java object with shape [2, 8, 8, 3].");
     }
     wrapper.close();
   }
@@ -393,92 +415,23 @@ public final class NativeInterpreterWrapperTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
       assertThat(e)
           .hasMessageThat()
-          .contains("0-th input dimension should be [?,8,8,3], but found [?,8,7,3]");
+          .contains(
+              "Cannot copy between a TensorFlowLite tensor with shape [2, 8, 7, 3] and a "
+                  + "Java object with shape [2, 8, 8, 3].");
     }
     wrapper.close();
   }
 
   @Test
-  public void testNumElements() {
-    int[] shape = {2, 3, 4};
-    int num = NativeInterpreterWrapper.numElements(shape);
-    assertThat(num).isEqualTo(24);
-    shape = null;
-    num = NativeInterpreterWrapper.numElements(shape);
-    assertThat(num).isEqualTo(0);
-  }
-
-  @Test
-  public void testIsNonEmtpyArray() {
-    assertThat(NativeInterpreterWrapper.isNonEmptyArray(null)).isFalse();
-    assertThat(NativeInterpreterWrapper.isNonEmptyArray(3.2)).isFalse();
-    int[] emptyArray = {};
-    assertThat(NativeInterpreterWrapper.isNonEmptyArray(emptyArray)).isFalse();
-    int[] validArray = {9, 5, 2, 1};
-    assertThat(NativeInterpreterWrapper.isNonEmptyArray(validArray)).isTrue();
-  }
-
-  @Test
-  public void testDataTypeOf() {
-    float[] testEmtpyArray = {};
-    DataType dataType = NativeInterpreterWrapper.dataTypeOf(testEmtpyArray);
-    assertThat(dataType).isEqualTo(DataType.FLOAT32);
-    float[] testFloatArray = {0.783f, 0.251f};
-    dataType = NativeInterpreterWrapper.dataTypeOf(testFloatArray);
-    assertThat(dataType).isEqualTo(DataType.FLOAT32);
-    float[][] testMultiDimArray = {testFloatArray, testFloatArray, testFloatArray};
-    dataType = NativeInterpreterWrapper.dataTypeOf(testFloatArray);
-    assertThat(dataType).isEqualTo(DataType.FLOAT32);
-    try {
-      double[] testDoubleArray = {0.783, 0.251};
-      NativeInterpreterWrapper.dataTypeOf(testDoubleArray);
-      fail();
-    } catch (IllegalArgumentException e) {
-      assertThat(e).hasMessageThat().contains("cannot resolve DataType of");
-    }
-    try {
-      Float[] testBoxedArray = {0.783f, 0.251f};
-      NativeInterpreterWrapper.dataTypeOf(testBoxedArray);
-      fail();
-    } catch (IllegalArgumentException e) {
-      assertThat(e).hasMessageThat().contains("cannot resolve DataType of [Ljava.lang.Float;");
-    }
-  }
-
-  @Test
-  public void testNumDimensions() {
-    int scalar = 1;
-    assertThat(NativeInterpreterWrapper.numDimensions(scalar)).isEqualTo(0);
-    int[][] array = {{2, 4}, {1, 9}};
-    assertThat(NativeInterpreterWrapper.numDimensions(array)).isEqualTo(2);
-    try {
-      int[] emptyArray = {};
-      NativeInterpreterWrapper.numDimensions(emptyArray);
-      fail();
-    } catch (IllegalArgumentException e) {
-      assertThat(e).hasMessageThat().contains("Array lengths cannot be 0.");
-    }
-  }
-
-  @Test
-  public void testFillShape() {
-    int[][][] array = {{{23}, {14}, {87}}, {{12}, {42}, {31}}};
-    int num = NativeInterpreterWrapper.numDimensions(array);
-    int[] shape = new int[num];
-    NativeInterpreterWrapper.fillShape(array, 0, shape);
-    assertThat(num).isEqualTo(3);
-    assertThat(shape[0]).isEqualTo(2);
-    assertThat(shape[1]).isEqualTo(3);
-    assertThat(shape[2]).isEqualTo(1);
-  }
-
-  @Test
   public void testGetInferenceLatency() {
     NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
     float[] oneD = {1.23f, 6.54f, 7.81f};
@@ -486,8 +439,10 @@ public final class NativeInterpreterWrapperTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
-    Tensor[] outputs = wrapper.run(inputs);
-    assertThat(outputs.length).isEqualTo(1);
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
+    wrapper.run(inputs, outputs);
     assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isGreaterThan(0L);
     wrapper.close();
   }
@@ -507,13 +462,14 @@ public final class NativeInterpreterWrapperTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, parsedOutputs);
     try {
-      wrapper.run(inputs);
+      wrapper.run(inputs, outputs);
       fail();
     } catch (IllegalArgumentException e) {
-      assertThat(e)
-          .hasMessageThat()
-          .contains("0-th input dimension should be [?,8,8,3], but found [?,8,7,3]");
+      // Expected.
     }
     assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isNull();
     wrapper.close();
@@ -523,41 +479,7 @@ public final class NativeInterpreterWrapperTest {
   public void testGetInputDims() {
     NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
     int[] expectedDims = {1, 8, 8, 3};
-    assertThat(wrapper.getInputDims(0)).isEqualTo(expectedDims);
-    wrapper.close();
-  }
-
-  @Test
-  public void testGetInputDimsOutOfRange() {
-    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
-    try {
-      wrapper.getInputDims(-1);
-      fail();
-    } catch (IllegalArgumentException e) {
-      assertThat(e).hasMessageThat().contains("Out of range");
-    }
-    try {
-      wrapper.getInputDims(1);
-      fail();
-    } catch (IllegalArgumentException e) {
-      assertThat(e).hasMessageThat().contains("Out of range");
-    }
-    wrapper.close();
-  }
-
-  @Test
-  public void testGetOutputDataType() {
-    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
-    assertThat(wrapper.getOutputDataType(0)).contains("float");
-    wrapper.close();
-    wrapper = new NativeInterpreterWrapper(LONG_MODEL_PATH);
-    assertThat(wrapper.getOutputDataType(0)).contains("long");
-    wrapper.close();
-    wrapper = new NativeInterpreterWrapper(INT_MODEL_PATH);
-    assertThat(wrapper.getOutputDataType(0)).contains("int");
-    wrapper.close();
-    wrapper = new NativeInterpreterWrapper(BYTE_MODEL_PATH);
-    assertThat(wrapper.getOutputDataType(0)).contains("byte");
+    assertThat(wrapper.getInputTensor(0).shape()).isEqualTo(expectedDims);
     wrapper.close();
   }
 
diff --git a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/TensorTest.java b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/TensorTest.java
index dd9d37eeda..71ef044943 100644
--- a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/TensorTest.java
+++ b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/TensorTest.java
@@ -18,9 +18,10 @@ package org.tensorflow.lite;
 import static com.google.common.truth.Truth.assertThat;
 import static org.junit.Assert.fail;
 
-import java.nio.BufferOverflowException;
 import java.nio.ByteBuffer;
 import java.nio.ByteOrder;
+import java.util.HashMap;
+import java.util.Map;
 import org.junit.After;
 import org.junit.Before;
 import org.junit.Test;
@@ -35,7 +36,7 @@ public final class TensorTest {
       "tensorflow/contrib/lite/java/src/testdata/add.bin";
 
   private NativeInterpreterWrapper wrapper;
-  private long nativeHandle;
+  private Tensor tensor;
 
   @Before
   public void setUp() {
@@ -45,8 +46,10 @@ public final class TensorTest {
     float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
     float[][][][] fourD = {threeD, threeD};
     Object[] inputs = {fourD};
-    Tensor[] outputs = wrapper.run(inputs);
-    nativeHandle = outputs[0].nativeHandle;
+    Map<Integer, Object> outputs = new HashMap<>();
+    outputs.put(0, new float[2][8][8][3]);
+    wrapper.run(inputs, outputs);
+    tensor = wrapper.getOutputTensor(0);
   }
 
   @After
@@ -55,17 +58,16 @@ public final class TensorTest {
   }
 
   @Test
-  public void testFromHandle() throws Exception {
-    Tensor tensor = Tensor.fromHandle(nativeHandle);
+  public void testBasic() throws Exception {
     assertThat(tensor).isNotNull();
     int[] expectedShape = {2, 8, 8, 3};
-    assertThat(tensor.shapeCopy).isEqualTo(expectedShape);
-    assertThat(tensor.dtype).isEqualTo(DataType.FLOAT32);
+    assertThat(tensor.shape()).isEqualTo(expectedShape);
+    assertThat(tensor.dataType()).isEqualTo(DataType.FLOAT32);
+    assertThat(tensor.numBytes()).isEqualTo(2 * 8 * 8 * 3 * 4);
   }
 
   @Test
   public void testCopyTo() {
-    Tensor tensor = Tensor.fromHandle(nativeHandle);
     float[][][][] parsedOutputs = new float[2][8][8][3];
     tensor.copyTo(parsedOutputs);
     float[] outputOneD = parsedOutputs[0][0][0];
@@ -75,7 +77,6 @@ public final class TensorTest {
 
   @Test
   public void testCopyToByteBuffer() {
-    Tensor tensor = Tensor.fromHandle(nativeHandle);
     ByteBuffer parsedOutput =
         ByteBuffer.allocateDirect(2 * 8 * 8 * 3 * 4).order(ByteOrder.nativeOrder());
     tensor.copyTo(parsedOutput);
@@ -89,19 +90,17 @@ public final class TensorTest {
 
   @Test
   public void testCopyToInvalidByteBuffer() {
-    Tensor tensor = Tensor.fromHandle(nativeHandle);
     ByteBuffer parsedOutput = ByteBuffer.allocateDirect(3 * 4).order(ByteOrder.nativeOrder());
     try {
       tensor.copyTo(parsedOutput);
       fail();
-    } catch (BufferOverflowException e) {
+    } catch (IllegalArgumentException e) {
       // Expected.
     }
   }
 
   @Test
   public void testCopyToWrongType() {
-    Tensor tensor = Tensor.fromHandle(nativeHandle);
     int[][][][] parsedOutputs = new int[2][8][8][3];
     try {
       tensor.copyTo(parsedOutputs);
@@ -110,15 +109,13 @@ public final class TensorTest {
       assertThat(e)
           .hasMessageThat()
           .contains(
-              "Cannot convert an TensorFlowLite tensor with type "
-                  + "FLOAT32 to a Java object of type [[[[I (which is compatible with the TensorFlowLite "
-                  + "type INT32)");
+              "Cannot convert between a TensorFlowLite tensor with type FLOAT32 and a Java object "
+                  + "of type [[[[I (which is compatible with the TensorFlowLite type INT32)");
     }
   }
 
   @Test
   public void testCopyToWrongShape() {
-    Tensor tensor = Tensor.fromHandle(nativeHandle);
     float[][][][] parsedOutputs = new float[1][8][8][3];
     try {
       tensor.copyTo(parsedOutputs);
@@ -127,8 +124,104 @@ public final class TensorTest {
       assertThat(e)
           .hasMessageThat()
           .contains(
-              "Shape of output target [1, 8, 8, 3] does not match "
-                  + "with the shape of the Tensor [2, 8, 8, 3].");
+              "Cannot copy between a TensorFlowLite tensor with shape [2, 8, 8, 3] "
+                  + "and a Java object with shape [1, 8, 8, 3].");
+    }
+  }
+
+  @Test
+  public void testSetTo() {
+    float[][][][] input = new float[2][8][8][3];
+    float[][][][] output = new float[2][8][8][3];
+    ByteBuffer inputByteBuffer =
+        ByteBuffer.allocateDirect(2 * 8 * 8 * 3 * 4).order(ByteOrder.nativeOrder());
+
+    input[0][0][0][0] = 2.0f;
+    tensor.setTo(input);
+    tensor.copyTo(output);
+    assertThat(output[0][0][0][0]).isEqualTo(2.0f);
+
+    inputByteBuffer.putFloat(0, 3.0f);
+    tensor.setTo(inputByteBuffer);
+    tensor.copyTo(output);
+    assertThat(output[0][0][0][0]).isEqualTo(3.0f);
+  }
+
+  @Test
+  public void testSetToInvalidByteBuffer() {
+    ByteBuffer input = ByteBuffer.allocateDirect(3 * 4).order(ByteOrder.nativeOrder());
+    try {
+      tensor.setTo(input);
+      fail();
+    } catch (IllegalArgumentException e) {
+      // Success.
+    }
+  }
+
+  @Test
+  public void testGetInputShapeIfDifferent() {
+    ByteBuffer bytBufferInput = ByteBuffer.allocateDirect(3 * 4).order(ByteOrder.nativeOrder());
+    assertThat(tensor.getInputShapeIfDifferent(bytBufferInput)).isNull();
+
+    float[][][][] sameShapeInput = new float[2][8][8][3];
+    assertThat(tensor.getInputShapeIfDifferent(sameShapeInput)).isNull();
+
+    float[][][][] differentShapeInput = new float[1][8][8][3];
+    assertThat(tensor.getInputShapeIfDifferent(differentShapeInput))
+        .isEqualTo(new int[] {1, 8, 8, 3});
+  }
+
+  @Test
+  public void testDataTypeOf() {
+    float[] testEmptyArray = {};
+    DataType dataType = Tensor.dataTypeOf(testEmptyArray);
+    assertThat(dataType).isEqualTo(DataType.FLOAT32);
+    float[] testFloatArray = {0.783f, 0.251f};
+    dataType = Tensor.dataTypeOf(testFloatArray);
+    assertThat(dataType).isEqualTo(DataType.FLOAT32);
+    float[][] testMultiDimArray = {testFloatArray, testFloatArray, testFloatArray};
+    dataType = Tensor.dataTypeOf(testFloatArray);
+    assertThat(dataType).isEqualTo(DataType.FLOAT32);
+    try {
+      double[] testDoubleArray = {0.783, 0.251};
+      Tensor.dataTypeOf(testDoubleArray);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e).hasMessageThat().contains("cannot resolve DataType of");
+    }
+    try {
+      Float[] testBoxedArray = {0.783f, 0.251f};
+      Tensor.dataTypeOf(testBoxedArray);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e).hasMessageThat().contains("cannot resolve DataType of [Ljava.lang.Float;");
     }
   }
+
+  @Test
+  public void testNumDimensions() {
+    int scalar = 1;
+    assertThat(Tensor.numDimensions(scalar)).isEqualTo(0);
+    int[][] array = {{2, 4}, {1, 9}};
+    assertThat(Tensor.numDimensions(array)).isEqualTo(2);
+    try {
+      int[] emptyArray = {};
+      Tensor.numDimensions(emptyArray);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e).hasMessageThat().contains("Array lengths cannot be 0.");
+    }
+  }
+
+  @Test
+  public void testFillShape() {
+    int[][][] array = {{{23}, {14}, {87}}, {{12}, {42}, {31}}};
+    int num = Tensor.numDimensions(array);
+    int[] shape = new int[num];
+    Tensor.fillShape(array, 0, shape);
+    assertThat(num).isEqualTo(3);
+    assertThat(shape[0]).isEqualTo(2);
+    assertThat(shape[1]).isEqualTo(3);
+    assertThat(shape[2]).isEqualTo(1);
+  }
 }
diff --git a/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java b/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
index 3aef0c3bb6..c23521c077 100644
--- a/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
+++ b/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
@@ -58,7 +58,7 @@ public class TestHelper {
    */
   public static int[] getInputDims(Interpreter interpreter, int index) {
     if (interpreter != null && interpreter.wrapper != null) {
-      return interpreter.wrapper.getInputDims(index);
+      return interpreter.wrapper.getInputTensor(index).shape();
     } else {
       throw new IllegalArgumentException(
           "Interpreter has not initialized;" + " Failed to get input dimensions.");
@@ -77,7 +77,7 @@ public class TestHelper {
    */
   public static String getOutputDataType(Interpreter interpreter, int index) {
     if (interpreter != null && interpreter.wrapper != null) {
-      return interpreter.wrapper.getOutputDataType(index);
+      return interpreter.wrapper.getOutputTensor(index).dataType().toStringName();
     } else {
       throw new IllegalArgumentException(
           "Interpreter has not initialized;" + " Failed to get output data type.");
diff --git a/tensorflow/contrib/lite/kernels/BUILD b/tensorflow/contrib/lite/kernels/BUILD
index 27b8a16e15..40adaa4994 100644
--- a/tensorflow/contrib/lite/kernels/BUILD
+++ b/tensorflow/contrib/lite/kernels/BUILD
@@ -8,11 +8,24 @@ load("//tensorflow/contrib/lite:build_def.bzl", "tflite_copts")
 load("//tensorflow/contrib/lite:special_rules.bzl", "tflite_portable_test_suite")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 
+# Suppress warnings that are introduced by Eigen Tensor.
+EXTRA_EIGEN_COPTS = select({
+    "//tensorflow:ios": [
+        "-Wno-error=invalid-partial-specialization",
+        "-Wno-error=reorder",
+    ],
+    "//tensorflow:windows": [],
+    "//conditions:default": ["-Wno-error=reorder"],
+})
+
 tf_cc_test(
     name = "optional_tensor_test",
     size = "small",
     srcs = ["optional_tensor_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -46,11 +59,12 @@ cc_library(
     hdrs = [
         "eigen_support.h",
     ],
-    copts = tflite_copts(),
+    copts = tflite_copts() + EXTRA_EIGEN_COPTS,
     deps = [
         ":op_macros",
+        "//tensorflow/contrib/lite:arena_planner",
         "//tensorflow/contrib/lite:context",
-        "//third_party/eigen3",
+        "//tensorflow/contrib/lite/kernels/internal:optimized",
     ],
 )
 
@@ -106,7 +120,10 @@ tf_cc_test(
     name = "kernel_util_test",
     size = "small",
     srcs = ["kernel_util_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":kernel_util",
         "//tensorflow/contrib/lite/testing:util",
@@ -118,6 +135,7 @@ tf_cc_test(
     name = "test_util_test",
     size = "small",
     srcs = ["test_util_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":test_util",
         "//tensorflow/contrib/lite/testing:util",
@@ -130,7 +148,7 @@ cc_library(
     srcs = [
         "activations.cc",
         "add.cc",
-        "arg_max.cc",
+        "arg_min_max.cc",
         "audio_spectrogram.cc",
         "basic_rnn.cc",
         "batch_to_space_nd.cc",
@@ -149,18 +167,22 @@ cc_library(
         "embedding_lookup_sparse.cc",
         "exp.cc",
         "expand_dims.cc",
+        "fake_quant.cc",
         "floor.cc",
         "fully_connected.cc",
         "gather.cc",
         "hashtable_lookup.cc",
         "l2norm.cc",
         "local_response_norm.cc",
+        "logical.cc",
         "lsh_projection.cc",
         "lstm.cc",
         "maximum_minimum.cc",
         "mfcc.cc",
         "mul.cc",
         "neg.cc",
+        "one_hot.cc",
+        "pack.cc",
         "pad.cc",
         "pooling.cc",
         "pow.cc",
@@ -191,14 +213,7 @@ cc_library(
         "padding.h",
         "register.h",
     ],
-    # Suppress warnings that are introduced by Eigen Tensor.
-    copts = tflite_copts() + [
-        "-Wno-error=reorder",
-    ] + select({
-        "//tensorflow:ios": ["-Wno-error=invalid-partial-specialization"],
-        "//conditions:default": [
-        ],
-    }),
+    copts = tflite_copts() + EXTRA_EIGEN_COPTS,
     deps = [
         ":activation_functor",
         ":eigen_support",
@@ -225,7 +240,10 @@ tf_cc_test(
     name = "audio_spectrogram_test",
     size = "small",
     srcs = ["audio_spectrogram_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -239,7 +257,10 @@ tf_cc_test(
     name = "mfcc_test",
     size = "small",
     srcs = ["mfcc_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -253,7 +274,10 @@ tf_cc_test(
     name = "detection_postprocess_test",
     size = "small",
     srcs = ["detection_postprocess_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -290,10 +314,11 @@ tf_cc_test(
 )
 
 tf_cc_test(
-    name = "arg_max_test",
+    name = "arg_min_max_test",
     size = "small",
-    srcs = ["arg_max_test.cc"],
+    srcs = ["arg_min_max_test.cc"],
     tags = [
+        "no_oss",
         "tflite_not_portable_ios",
     ],
     deps = [
@@ -308,7 +333,10 @@ tf_cc_test(
     name = "div_test",
     size = "small",
     srcs = ["div_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -321,7 +349,10 @@ tf_cc_test(
     name = "sub_test",
     size = "small",
     srcs = ["sub_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -334,7 +365,10 @@ tf_cc_test(
     name = "transpose_test",
     size = "small",
     srcs = ["transpose_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -349,7 +383,10 @@ tf_cc_test(
     name = "space_to_batch_nd_test",
     size = "small",
     srcs = ["space_to_batch_nd_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -362,7 +399,10 @@ tf_cc_test(
     name = "batch_to_space_nd_test",
     size = "small",
     srcs = ["batch_to_space_nd_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -375,7 +415,10 @@ tf_cc_test(
     name = "cast_test",
     size = "small",
     srcs = ["cast_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -428,7 +471,10 @@ tf_cc_test(
     name = "dequantize_test",
     size = "small",
     srcs = ["dequantize_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -455,7 +501,10 @@ tf_cc_test(
     name = "bidirectional_sequence_lstm_test",
     size = "small",
     srcs = ["bidirectional_sequence_lstm_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -468,7 +517,10 @@ tf_cc_test(
     name = "floor_test",
     size = "small",
     srcs = ["floor_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -481,7 +533,10 @@ tf_cc_test(
     name = "elementwise_test",
     size = "small",
     srcs = ["elementwise_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -494,7 +549,10 @@ tf_cc_test(
     name = "unidirectional_sequence_lstm_test",
     size = "small",
     srcs = ["unidirectional_sequence_lstm_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -508,6 +566,7 @@ tf_cc_test(
     size = "small",
     srcs = ["bidirectional_sequence_rnn_test.cc"],
     tags = [
+        "no_oss",
         "tflite_not_portable",
     ],
     deps = [
@@ -522,7 +581,10 @@ tf_cc_test(
     name = "unidirectional_sequence_rnn_test",
     size = "small",
     srcs = ["unidirectional_sequence_rnn_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -548,7 +610,26 @@ tf_cc_test(
     name = "exp_test",
     size = "small",
     srcs = ["exp_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "fake_quant_test",
+    size = "small",
+    srcs = ["fake_quant_test.cc"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -561,7 +642,10 @@ tf_cc_test(
     name = "maximum_minimum_test",
     size = "small",
     srcs = ["maximum_minimum_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -574,7 +658,10 @@ tf_cc_test(
     name = "reduce_test",
     size = "small",
     srcs = ["reduce_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -600,7 +687,10 @@ tf_cc_test(
     name = "pad_test",
     size = "small",
     srcs = ["pad_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -626,7 +716,10 @@ tf_cc_test(
     name = "gather_test",
     size = "small",
     srcs = ["gather_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -640,7 +733,10 @@ tf_cc_test(
     name = "topk_v2_test",
     size = "small",
     srcs = ["topk_v2_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -761,7 +857,10 @@ tf_cc_test(
     name = "log_softmax_test",
     size = "small",
     srcs = ["log_softmax_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -842,7 +941,10 @@ tf_cc_test(
     name = "split_test",
     size = "small",
     srcs = ["split_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -855,7 +957,10 @@ tf_cc_test(
     name = "squeeze_test",
     size = "small",
     srcs = ["squeeze_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -868,7 +973,10 @@ tf_cc_test(
     name = "strided_slice_test",
     size = "small",
     srcs = ["strided_slice_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -881,7 +989,10 @@ tf_cc_test(
     name = "tile_test",
     size = "small",
     srcs = ["tile_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -898,6 +1009,7 @@ tf_cc_test(
         "comparisons_test.cc",
     ],
     tags = [
+        "no_oss",
         "tflite_not_portable_ios",
     ],
     deps = [
@@ -912,7 +1024,10 @@ tf_cc_test(
     name = "neg_test",
     size = "small",
     srcs = ["neg_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
@@ -928,6 +1043,7 @@ tf_cc_test(
         "select_test.cc",
     ],
     tags = [
+        "no_oss",
         "tflite_not_portable_ios",
     ],
     deps = [
@@ -945,6 +1061,7 @@ tf_cc_test(
         "slice_test.cc",
     ],
     tags = [
+        "no_oss",
         "tflite_not_portable_ios",
     ],
     deps = [
@@ -959,12 +1076,14 @@ tf_cc_test(
     name = "transpose_conv_test",
     size = "small",
     srcs = ["transpose_conv_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite/kernels:test_util",
-        "@com_google_absl//absl/memory",
         "@com_google_googletest//:gtest",
     ],
 )
@@ -973,7 +1092,10 @@ tf_cc_test(
     name = "expand_dims_test",
     size = "small",
     srcs = ["expand_dims_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -987,7 +1109,10 @@ tf_cc_test(
     name = "sparse_to_dense_test",
     size = "small",
     srcs = ["sparse_to_dense_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -1001,7 +1126,10 @@ tf_cc_test(
     name = "shape_test",
     size = "small",
     srcs = ["shape_test.cc"],
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":builtin_ops",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -1015,6 +1143,50 @@ tf_cc_test(
     name = "pow_test",
     size = "small",
     srcs = ["pow_test.cc"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:builtin_op_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "pack_test",
+    size = "small",
+    srcs = ["pack_test.cc"],
+    tags = ["tflite_not_portable_ios"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:builtin_op_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "one_hot_test",
+    size = "small",
+    srcs = ["one_hot_test.cc"],
+    tags = ["tflite_not_portable_ios"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
+tf_cc_test(
+    name = "logical_test",
+    size = "small",
+    srcs = ["logical_test.cc"],
     tags = ["tflite_not_portable_ios"],
     deps = [
         ":builtin_ops",
diff --git a/tensorflow/contrib/lite/kernels/activations.cc b/tensorflow/contrib/lite/kernels/activations.cc
index 99f81c4a8a..6e13b8c667 100644
--- a/tensorflow/contrib/lite/kernels/activations.cc
+++ b/tensorflow/contrib/lite/kernels/activations.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -186,8 +185,8 @@ TfLiteStatus SoftmaxPrepare(TfLiteContext* context, TfLiteNode* node) {
   TfLiteTensor* output = GetOutput(context, node, 0);
   TF_LITE_ENSURE_EQ(context, input->type, output->type);
 
-  TF_LITE_ENSURE(context,
-                 NumDimensions(input) == 2 || NumDimensions(input) == 4);
+  const int num_dims = NumDimensions(input);
+  TF_LITE_ENSURE(context, num_dims == 1 || num_dims == 2 || num_dims == 4);
 
   if (input->type == kTfLiteUInt8) {
     TF_LITE_ENSURE_EQ(context, output->params.zero_point, 0);
@@ -365,13 +364,9 @@ TfLiteStatus SigmoidEval(TfLiteContext* context, TfLiteNode* node) {
   return kTfLiteOk;
 }
 
-// Takes a 2D tensor and perform softmax along the second dimension.
-void Softmax2DFloat(const TfLiteTensor* input, TfLiteTensor* output,
-                    TfLiteSoftmaxParams* params) {
-  const int batch_size = input->dims->data[0];
-  const int input_size = input->dims->data[1];
-  float* in = input->data.f;
-  float* out = output->data.f;
+// Performs softmax along the input of size (input_size * batch_size).
+void Softmax(const float* in, const int input_size, const int batch_size,
+             const float beta, float* out) {
   TF_LITE_ASSERT(input_size > 0);
 
   // For each batch
@@ -385,7 +380,7 @@ void Softmax2DFloat(const TfLiteTensor* input, TfLiteTensor* output,
     // Compute the normalized sum of exps.
     float exp_sum = 0.0;
     for (int i = 0; i < input_size; i++) {
-      out[i] = std::exp((in[i] - max_coeff) * params->beta);
+      out[i] = std::exp((in[i] - max_coeff) * beta);
       exp_sum += out[i];
     }
 
@@ -401,6 +396,33 @@ void Softmax2DFloat(const TfLiteTensor* input, TfLiteTensor* output,
   }
 }
 
+// Takes a 1D tensor and performs softmax along it.
+void Softmax1DFloat(const TfLiteTensor* input, TfLiteTensor* output,
+                    TfLiteSoftmaxParams* params) {
+  const int input_size = input->dims->data[0];
+  Softmax(input->data.f, input_size, 1, params->beta, output->data.f);
+}
+
+// Takes a 2D tensor and perform softmax along the last dimension.
+void Softmax2DFloat(const TfLiteTensor* input, TfLiteTensor* output,
+                    TfLiteSoftmaxParams* params) {
+  const int batch_size = input->dims->data[0];
+  const int input_size = input->dims->data[1];
+  Softmax(input->data.f, input_size, batch_size, params->beta, output->data.f);
+}
+
+void Softmax1DQuantized(const TfLiteTensor* input, TfLiteTensor* output,
+                        TfLiteSoftmaxParams* params, OpData* data) {
+  // TODO(ahentz): this is arguably a dirty trick. Since the implementation
+  // always traverses the last dimension of a 4D tensor, we will pretend our 1D
+  // tensor is 4D in a special way. We will convert a (Y) shape into a (1,
+  // 1, 1, Y) shape.
+  const int input_size = input->dims->data[0];
+  optimized_ops::Softmax(
+      GetTensorData<uint8_t>(input), GetTensorShape({1, 1, 1, input_size}),
+      data->input_multiplier, data->input_left_shift, data->diff_min,
+      GetTensorData<uint8_t>(output), GetTensorShape({1, 1, 1, input_size}));
+}
 void Softmax2DQuantized(const TfLiteTensor* input, TfLiteTensor* output,
                         TfLiteSoftmaxParams* params, OpData* data) {
   // TODO(ahentz): this is arguably a dirty trick. Since the implementation
@@ -443,6 +465,10 @@ TfLiteStatus SoftmaxEval(TfLiteContext* context, TfLiteNode* node) {
   // dimensions.
   switch (input->type) {
     case kTfLiteFloat32: {
+      if (NumDimensions(input) == 1) {
+        Softmax1DFloat(input, output, params);
+        return kTfLiteOk;
+      }
       if (NumDimensions(input) == 2) {
         Softmax2DFloat(input, output, params);
         return kTfLiteOk;
@@ -452,11 +478,15 @@ TfLiteStatus SoftmaxEval(TfLiteContext* context, TfLiteNode* node) {
         return kTfLiteOk;
       }
       context->ReportError(
-          context, "Only 2D and 4D tensors supported currently, got %dD.",
+          context, "Only 1D, 2D and 4D tensors supported currently, got %dD.",
           NumDimensions(input));
       return kTfLiteError;
     }
     case kTfLiteUInt8: {
+      if (NumDimensions(input) == 1) {
+        Softmax1DQuantized(input, output, params, data);
+        return kTfLiteOk;
+      }
       if (NumDimensions(input) == 2) {
         Softmax2DQuantized(input, output, params, data);
         return kTfLiteOk;
diff --git a/tensorflow/contrib/lite/kernels/activations_test.cc b/tensorflow/contrib/lite/kernels/activations_test.cc
index 587e1303da..083cdf78d7 100644
--- a/tensorflow/contrib/lite/kernels/activations_test.cc
+++ b/tensorflow/contrib/lite/kernels/activations_test.cc
@@ -339,6 +339,29 @@ TEST(QuantizedActivationsOpTest, Softmax4D) {
                   kQuantizedTolerance)));
 }
 
+TEST(FloatActivationsOpTest, Softmax1D) {
+  FloatActivationsOpModel m(0.1,
+                            /*input=*/{TensorType_FLOAT32, {8}});
+  m.SetInput({0, -6, 2, 4, 3, -2, 10, 1});
+  m.Invoke();
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray(ArrayFloatNear(
+          {.09752, .05352, .11911, .14548, .13164, .07984, .26509, .10778})));
+}
+
+TEST(QuantizedActivationsOpTest, Softmax1D) {
+  QuantizedActivationsOpModel m(0.1,
+                                /*input=*/{TensorType_UINT8, {8}, -10, 10});
+  m.SetInput<uint8_t>({0, -6, 2, 4, 3, -2, 10, 1});
+  m.Invoke();
+  EXPECT_THAT(
+      m.GetDequantizedOutput<uint8_t>(),
+      ElementsAreArray(ArrayFloatNear({0.09766, 0.05469, 0.12109, 0.14453,
+                                       0.13281, 0.07813, 0.26563, 0.10938},
+                                      kQuantizedTolerance)));
+}
+
 TEST(FloatActivationsOpTest, Softmax2D) {
   FloatActivationsOpModel m(0.1,
                             /*input=*/{TensorType_FLOAT32, {2, 4}});
diff --git a/tensorflow/contrib/lite/kernels/add.cc b/tensorflow/contrib/lite/kernels/add.cc
index f44d531cbf..af9b5c7013 100644
--- a/tensorflow/contrib/lite/kernels/add.cc
+++ b/tensorflow/contrib/lite/kernels/add.cc
@@ -110,15 +110,12 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
 
     QuantizeMultiplierSmallerThanOneExp(
         real_input1_multiplier, &data->input1_multiplier, &data->input1_shift);
-    data->input1_shift *= -1;
 
     QuantizeMultiplierSmallerThanOneExp(
         real_input2_multiplier, &data->input2_multiplier, &data->input2_shift);
-    data->input2_shift *= -1;
 
     QuantizeMultiplierSmallerThanOneExp(
         real_output_multiplier, &data->output_multiplier, &data->output_shift);
-    data->output_shift *= -1;
 
     CalculateActivationRangeUint8(params->activation, output,
                                   &data->output_activation_min,
@@ -152,14 +149,14 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
         CheckedLog2(output->params.scale, &output_scale_log2_rounded);
     TF_LITE_ENSURE(context, output_scale_is_pot);
 
-    data->input1_shift = output_scale_log2_rounded - input1_scale_log2_rounded;
-    data->input2_shift = output_scale_log2_rounded - input2_scale_log2_rounded;
+    data->input1_shift = input1_scale_log2_rounded - output_scale_log2_rounded;
+    data->input2_shift = input2_scale_log2_rounded - output_scale_log2_rounded;
 
     // Shifting of one input is supported. The graph quantization should ensure
     // that the other input matches the output.
     TF_LITE_ENSURE(context, data->input1_shift == 0 || data->input2_shift == 0);
-    TF_LITE_ENSURE(context, data->input1_shift >= 0);
-    TF_LITE_ENSURE(context, data->input2_shift >= 0);
+    TF_LITE_ENSURE(context, data->input1_shift <= 0);
+    TF_LITE_ENSURE(context, data->input2_shift <= 0);
 
     CalculateActivationRangeQuantized(context, params->activation, output,
                                       &data->output_activation_min,
@@ -173,24 +170,27 @@ template <KernelType kernel_type>
 void EvalAdd(TfLiteContext* context, TfLiteNode* node, TfLiteAddParams* params,
              const OpData* data, const TfLiteTensor* input1,
              const TfLiteTensor* input2, TfLiteTensor* output) {
-#define TF_LITE_ADD(type, opname, data_type)                            \
-  data_type output_activation_min, output_activation_max;               \
-  CalculateActivationRange(params->activation, &output_activation_min,  \
-                           &output_activation_max);                     \
-  type::opname(GetTensorData<data_type>(input1), GetTensorDims(input1), \
-               GetTensorData<data_type>(input2), GetTensorDims(input2), \
-               output_activation_min, output_activation_max,            \
-               GetTensorData<data_type>(output), GetTensorDims(output))
+#define TF_LITE_ADD(type, opname, data_type)                             \
+  data_type output_activation_min, output_activation_max;                \
+  CalculateActivationRange(params->activation, &output_activation_min,   \
+                           &output_activation_max);                      \
+  tflite::ArithmeticParams op_params;                                    \
+  SetActivationParams(output_activation_min, output_activation_max,      \
+                      &op_params);                                       \
+  type::opname(op_params, GetTensorShape(input1),                        \
+               GetTensorData<data_type>(input1), GetTensorShape(input2), \
+               GetTensorData<data_type>(input2), GetTensorShape(output), \
+               GetTensorData<data_type>(output))
   if (output->type == kTfLiteInt32) {
     if (kernel_type == kReference) {
       if (data->requires_broadcast) {
-        TF_LITE_ADD(reference_ops, BroadcastAdd, int32_t);
+        TF_LITE_ADD(reference_ops, BroadcastAdd4DSlow, int32_t);
       } else {
         TF_LITE_ADD(reference_ops, Add, int32_t);
       }
     } else {
       if (data->requires_broadcast) {
-        TF_LITE_ADD(optimized_ops, BroadcastAdd, int32_t);
+        TF_LITE_ADD(optimized_ops, BroadcastAdd4DSlow, int32_t);
       } else {
         TF_LITE_ADD(optimized_ops, Add, int32_t);
       }
@@ -198,13 +198,13 @@ void EvalAdd(TfLiteContext* context, TfLiteNode* node, TfLiteAddParams* params,
   } else if (output->type == kTfLiteFloat32) {
     if (kernel_type == kReference) {
       if (data->requires_broadcast) {
-        TF_LITE_ADD(reference_ops, BroadcastAdd, float);
+        TF_LITE_ADD(reference_ops, BroadcastAdd4DSlow, float);
       } else {
         TF_LITE_ADD(reference_ops, Add, float);
       }
     } else {
       if (data->requires_broadcast) {
-        TF_LITE_ADD(optimized_ops, BroadcastAdd, float);
+        TF_LITE_ADD(optimized_ops, BroadcastAdd4DSlow, float);
       } else {
         TF_LITE_ADD(optimized_ops, Add, float);
       }
@@ -220,30 +220,43 @@ TfLiteStatus EvalAddQuantized(TfLiteContext* context, TfLiteNode* node,
                               const TfLiteTensor* input2,
                               TfLiteTensor* output) {
   if (output->type == kTfLiteUInt8) {
-#define TF_LITE_ADD(type, opname)                                              \
-  type::opname(                                                                \
-      data->left_shift, GetTensorData<uint8_t>(input1), GetTensorDims(input1), \
-      data->input1_offset, data->input1_multiplier, data->input1_shift,        \
-      GetTensorData<uint8_t>(input2), GetTensorDims(input2),                   \
-      data->input2_offset, data->input2_multiplier, data->input2_shift,        \
-      data->output_offset, data->output_multiplier, data->output_shift,        \
-      data->output_activation_min, data->output_activation_max,                \
-      GetTensorData<uint8_t>(output), GetTensorDims(output));
+#define TF_LITE_ADD(type, opname)                                      \
+  tflite::ArithmeticParams op_params;                                  \
+  op_params.left_shift = data->left_shift;                             \
+  op_params.input1_offset = data->input1_offset;                       \
+  op_params.input1_multiplier = data->input1_multiplier;               \
+  op_params.input1_shift = data->input1_shift;                         \
+  op_params.input2_offset = data->input2_offset;                       \
+  op_params.input2_multiplier = data->input2_multiplier;               \
+  op_params.input2_shift = data->input2_shift;                         \
+  op_params.output_offset = data->output_offset;                       \
+  op_params.output_multiplier = data->output_multiplier;               \
+  op_params.output_shift = data->output_shift;                         \
+  SetActivationParams(data->output_activation_min,                     \
+                      data->output_activation_max, &op_params);        \
+  type::opname(op_params, GetTensorShape(input1),                      \
+               GetTensorData<uint8_t>(input1), GetTensorShape(input2), \
+               GetTensorData<uint8_t>(input2), GetTensorShape(output), \
+               GetTensorData<uint8_t>(output))
     // The quantized version of Add doesn't support activations, so we
     // always use BroadcastAdd.
     if (kernel_type == kReference) {
-      TF_LITE_ADD(reference_ops, BroadcastAdd);
+      TF_LITE_ADD(reference_ops, BroadcastAdd4DSlow);
     } else {
-      TF_LITE_ADD(optimized_ops, BroadcastAdd);
+      TF_LITE_ADD(optimized_ops, BroadcastAdd4DSlow);
     }
 #undef TF_LITE_ADD
   } else if (output->type == kTfLiteInt16) {
-#define TF_LITE_ADD(type, opname)                                        \
-  type::opname(GetTensorData<int16_t>(input1), GetTensorDims(input1),    \
-               data->input1_shift, GetTensorData<int16_t>(input2),       \
-               GetTensorDims(input2), data->input2_shift,                \
-               data->output_activation_min, data->output_activation_max, \
-               GetTensorData<int16_t>(output), GetTensorDims(output));
+#define TF_LITE_ADD(type, opname)                                      \
+  tflite::ArithmeticParams op_params;                                  \
+  op_params.input1_shift = data->input1_shift;                         \
+  op_params.input2_shift = data->input2_shift;                         \
+  SetActivationParams(data->output_activation_min,                     \
+                      data->output_activation_max, &op_params);        \
+  type::opname(op_params, GetTensorShape(input1),                      \
+               GetTensorData<int16_t>(input1), GetTensorShape(input2), \
+               GetTensorData<int16_t>(input2), GetTensorShape(output), \
+               GetTensorData<int16_t>(output))
     // The quantized version of Add doesn't support activations, so we
     // always use BroadcastAdd.
     if (kernel_type == kReference) {
diff --git a/tensorflow/contrib/lite/kernels/arg_max.cc b/tensorflow/contrib/lite/kernels/arg_min_max.cc
index 26f57e8896..4f30d09030 100644
--- a/tensorflow/contrib/lite/kernels/arg_max.cc
+++ b/tensorflow/contrib/lite/kernels/arg_min_max.cc
@@ -23,7 +23,7 @@ limitations under the License.
 namespace tflite {
 namespace ops {
 namespace builtin {
-namespace arg_max {
+namespace arg_min_max {
 
 constexpr int kInputTensor = 0;
 constexpr int kAxis = 1;
@@ -80,30 +80,39 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   return context->ResizeTensor(context, output, output_size);
 }
 
+template <typename T>
+std::function<bool(T, T)> GetComparefunction(bool is_arg_max) {
+  if (is_arg_max) {
+    return std::greater<T>();
+  } else {
+    return std::less<T>();
+  }
+}
+
 // The current impl actually ignores the axis argument.
 // Only determine the index of the maximum value in the last dimension.
-TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node, bool is_arg_max) {
   const TfLiteTensor* input = GetInput(context, node, kInputTensor);
   const TfLiteTensor* axis = GetInput(context, node, kAxis);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
-#define TF_LITE_ARG_MAX(data_type, axis_type, output_type)                     \
-  optimized_ops::ArgMax(GetTensorData<axis_type>(axis),                        \
-                        GetTensorData<data_type>(input), GetTensorDims(input), \
-                        GetTensorData<output_type>(output),                    \
-                        GetTensorDims(output))
+#define TF_LITE_ARG_MIN_MAX(data_type, axis_type, output_type)         \
+  optimized_ops::ArgMinMax(                                            \
+      GetTensorData<axis_type>(axis), GetTensorData<data_type>(input), \
+      GetTensorDims(input), GetTensorData<output_type>(output),        \
+      GetTensorDims(output), GetComparefunction<data_type>(is_arg_max))
   if (axis->type == kTfLiteInt32) {
     switch (output->type) {
       case kTfLiteInt32: {
         switch (input->type) {
           case kTfLiteFloat32:
-            TF_LITE_ARG_MAX(float, int32_t, int32_t);
+            TF_LITE_ARG_MIN_MAX(float, int32_t, int32_t);
             break;
           case kTfLiteUInt8:
-            TF_LITE_ARG_MAX(uint8_t, int32_t, int32_t);
+            TF_LITE_ARG_MIN_MAX(uint8_t, int32_t, int32_t);
             break;
           case kTfLiteInt32:
-            TF_LITE_ARG_MAX(int32_t, int32_t, int32_t);
+            TF_LITE_ARG_MIN_MAX(int32_t, int32_t, int32_t);
             break;
           default:
             return kTfLiteError;
@@ -112,13 +121,13 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
       case kTfLiteInt64: {
         switch (input->type) {
           case kTfLiteFloat32:
-            TF_LITE_ARG_MAX(float, int32_t, int64_t);
+            TF_LITE_ARG_MIN_MAX(float, int32_t, int64_t);
             break;
           case kTfLiteUInt8:
-            TF_LITE_ARG_MAX(uint8_t, int32_t, int64_t);
+            TF_LITE_ARG_MIN_MAX(uint8_t, int32_t, int64_t);
             break;
           case kTfLiteInt32:
-            TF_LITE_ARG_MAX(int32_t, int32_t, int64_t);
+            TF_LITE_ARG_MIN_MAX(int32_t, int32_t, int64_t);
             break;
           default:
             return kTfLiteError;
@@ -132,13 +141,13 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
       case kTfLiteInt32: {
         switch (input->type) {
           case kTfLiteFloat32:
-            TF_LITE_ARG_MAX(float, int64_t, int32_t);
+            TF_LITE_ARG_MIN_MAX(float, int64_t, int32_t);
             break;
           case kTfLiteUInt8:
-            TF_LITE_ARG_MAX(uint8_t, int64_t, int32_t);
+            TF_LITE_ARG_MIN_MAX(uint8_t, int64_t, int32_t);
             break;
           case kTfLiteInt32:
-            TF_LITE_ARG_MAX(int32_t, int64_t, int32_t);
+            TF_LITE_ARG_MIN_MAX(int32_t, int64_t, int32_t);
             break;
           default:
             return kTfLiteError;
@@ -147,13 +156,13 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
       case kTfLiteInt64: {
         switch (input->type) {
           case kTfLiteFloat32:
-            TF_LITE_ARG_MAX(float, int64_t, int64_t);
+            TF_LITE_ARG_MIN_MAX(float, int64_t, int64_t);
             break;
           case kTfLiteUInt8:
-            TF_LITE_ARG_MAX(uint8_t, int64_t, int64_t);
+            TF_LITE_ARG_MIN_MAX(uint8_t, int64_t, int64_t);
             break;
           case kTfLiteInt32:
-            TF_LITE_ARG_MAX(int32_t, int64_t, int64_t);
+            TF_LITE_ARG_MIN_MAX(int32_t, int64_t, int64_t);
             break;
           default:
             return kTfLiteError;
@@ -163,16 +172,30 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
         return kTfLiteError;
     }
   }
-#undef TF_LITE_ARG_MAX
+#undef TF_LITE_ARG_MIN_MAX
 
   return kTfLiteOk;
 }
 
-}  // namespace arg_max
+TfLiteStatus ArgMinEval(TfLiteContext* context, TfLiteNode* node) {
+  return Eval(context, node, false);
+}
+
+TfLiteStatus ArgMaxEval(TfLiteContext* context, TfLiteNode* node) {
+  return Eval(context, node, true);
+}
+
+}  // namespace arg_min_max
 
 TfLiteRegistration* Register_ARG_MAX() {
-  static TfLiteRegistration r = {nullptr, nullptr, arg_max::Prepare,
-                                 arg_max::Eval};
+  static TfLiteRegistration r = {nullptr, nullptr, arg_min_max::Prepare,
+                                 arg_min_max::ArgMaxEval};
+  return &r;
+}
+
+TfLiteRegistration* Register_ARG_MIN() {
+  static TfLiteRegistration r = {nullptr, nullptr, arg_min_max::Prepare,
+                                 arg_min_max::ArgMinEval};
   return &r;
 }
 
diff --git a/tensorflow/contrib/lite/kernels/arg_max_test.cc b/tensorflow/contrib/lite/kernels/arg_min_max_test.cc
index 31b15fe19a..90e5fdc532 100644
--- a/tensorflow/contrib/lite/kernels/arg_max_test.cc
+++ b/tensorflow/contrib/lite/kernels/arg_min_max_test.cc
@@ -24,16 +24,13 @@ namespace {
 using ::testing::ElementsAreArray;
 
 template <typename T>
-class ArgMaxOpModel : public SingleOpModel {
+class ArgBaseOpModel : public SingleOpModel {
  public:
-  ArgMaxOpModel(std::initializer_list<int> input_shape, TensorType input_type,
-                TensorType output_type, TensorType index_output_type) {
+  ArgBaseOpModel(std::initializer_list<int> input_shape, TensorType input_type,
+                 TensorType output_type, TensorType index_output_type) {
     input_ = AddInput(input_type);
     axis_ = AddInput(TensorType_INT32);
     output_ = AddOutput(output_type);
-    SetBuiltinOp(BuiltinOperator_ARG_MAX, BuiltinOptions_ArgMaxOptions,
-                 CreateArgMaxOptions(builder_, index_output_type).Union());
-    BuildInterpreter({input_shape, {1, 1, 1, 1}});
   }
 
   int input() { return input_; }
@@ -42,12 +39,42 @@ class ArgMaxOpModel : public SingleOpModel {
   std::vector<T> GetOutput() { return ExtractVector<T>(output_); }
   std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
 
- private:
+ protected:
   int input_;
   int axis_;
   int output_;
 };
 
+template <typename T>
+class ArgMaxOpModel : public ArgBaseOpModel<T> {
+ public:
+  ArgMaxOpModel(std::initializer_list<int> input_shape, TensorType input_type,
+                TensorType output_type, TensorType index_output_type)
+      : ArgBaseOpModel<T>(input_shape, input_type, output_type,
+                          index_output_type) {
+    ArgBaseOpModel<T>::SetBuiltinOp(
+        BuiltinOperator_ARG_MAX, BuiltinOptions_ArgMaxOptions,
+        CreateArgMaxOptions(ArgBaseOpModel<T>::builder_, index_output_type)
+            .Union());
+    ArgBaseOpModel<T>::BuildInterpreter({input_shape, {1, 1, 1, 1}});
+  }
+};
+
+template <typename T>
+class ArgMinOpModel : public ArgBaseOpModel<T> {
+ public:
+  ArgMinOpModel(std::initializer_list<int> input_shape, TensorType input_type,
+                TensorType output_type, TensorType index_output_type)
+      : ArgBaseOpModel<T>(input_shape, input_type, output_type,
+                          index_output_type) {
+    ArgBaseOpModel<T>::SetBuiltinOp(
+        BuiltinOperator_ARG_MIN, BuiltinOptions_ArgMinOptions,
+        CreateArgMinOptions(ArgBaseOpModel<T>::builder_, index_output_type)
+            .Union());
+    ArgBaseOpModel<T>::BuildInterpreter({input_shape, {1, 1, 1, 1}});
+  }
+};
+
 TEST(ArgMaxOpTest, GetMaxArgFloat) {
   ArgMaxOpModel<int32_t> model({1, 1, 1, 4}, TensorType_FLOAT32,
                                TensorType_INT32, TensorType_INT32);
@@ -96,6 +123,54 @@ TEST(ArgMaxOpTest, GetMaxArgOutput64) {
   EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 2, 1}));
 }
 
+TEST(ArgMinOpTest, GetMinArgFloat) {
+  ArgMinOpModel<int32_t> model({1, 1, 1, 4}, TensorType_FLOAT32,
+                               TensorType_INT32, TensorType_INT32);
+  model.PopulateTensor<float>(model.input(), {0.1, 0.9, 0.7, 0.3});
+  // Currently only support the last dimension.
+  model.PopulateTensor<int>(model.axis(), {3});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({0}));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 1}));
+}
+
+TEST(ArgMinOpTest, GetMinArgInt) {
+  ArgMinOpModel<int32_t> model({1, 1, 1, 4}, TensorType_INT32, TensorType_INT32,
+                               TensorType_INT32);
+  model.PopulateTensor<int>(model.input(), {1, 9, 7, 3});
+  // Currently only support the last dimension.
+  model.PopulateTensor<int>(model.axis(), {3});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({0}));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 1}));
+}
+
+TEST(ArgMinOpTest, GetMinArgMulDimensions) {
+  ArgMinOpModel<int32_t> model({1, 1, 2, 4}, TensorType_INT32, TensorType_INT32,
+                               TensorType_INT32);
+  model.PopulateTensor<int>(model.input(), {1, 2, 7, 8, 1, 9, 7, 3});
+  // Currently only support the last dimension.
+  model.PopulateTensor<int>(model.axis(), {3});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({0, 0}));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 2, 1}));
+}
+
+TEST(ArgMinOpTest, GetMinArgOutput64) {
+  ArgMinOpModel<int64_t> model({1, 1, 2, 4}, TensorType_INT32, TensorType_INT64,
+                               TensorType_INT64);
+  model.PopulateTensor<int>(model.input(), {10, 2, 7, 8, 1, 9, 7, 3});
+  // Currently only support the last dimension.
+  model.PopulateTensor<int>(model.axis(), {3});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 0}));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 2, 1}));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
index 3425288f02..a11a59aa05 100644
--- a/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
+++ b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
@@ -13,7 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -276,27 +275,33 @@ TfLiteStatus CheckLstmTensorDimensions(
 TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
                                         TfLiteNode* node, int n_input,
                                         int n_output, int n_cell) {
-  CheckLstmTensorDimensions(
-      context, node, n_input, n_output, n_cell, kFwInputToInputWeightsTensor,
-      kFwInputToForgetWeightsTensor, kFwInputToCellWeightsTensor,
-      kFwInputToOutputWeightsTensor, kFwRecurrentToInputWeightsTensor,
-      kFwRecurrentToForgetWeightsTensor, kFwRecurrentToCellWeightsTensor,
-      kFwRecurrentToOutputWeightsTensor, kFwCellToInputWeightsTensor,
-      kFwCellToForgetWeightsTensor, kFwCellToOutputWeightsTensor,
-      kFwInputGateBiasTensor, kFwForgetGateBiasTensor, kFwCellGateBiasTensor,
-      kFwOutputGateBiasTensor, kFwProjectionWeightsTensor,
-      kFwProjectionBiasTensor);
-
-  CheckLstmTensorDimensions(
-      context, node, n_input, n_output, n_cell, kBwInputToInputWeightsTensor,
-      kBwInputToForgetWeightsTensor, kBwInputToCellWeightsTensor,
-      kBwInputToOutputWeightsTensor, kBwRecurrentToInputWeightsTensor,
-      kBwRecurrentToForgetWeightsTensor, kBwRecurrentToCellWeightsTensor,
-      kBwRecurrentToOutputWeightsTensor, kBwCellToInputWeightsTensor,
-      kBwCellToForgetWeightsTensor, kBwCellToOutputWeightsTensor,
-      kBwInputGateBiasTensor, kBwForgetGateBiasTensor, kBwCellGateBiasTensor,
-      kBwOutputGateBiasTensor, kBwProjectionWeightsTensor,
-      kBwProjectionBiasTensor);
+  TF_LITE_ENSURE_OK(
+      context,
+      CheckLstmTensorDimensions(
+          context, node, n_input, n_output, n_cell,
+          kFwInputToInputWeightsTensor, kFwInputToForgetWeightsTensor,
+          kFwInputToCellWeightsTensor, kFwInputToOutputWeightsTensor,
+          kFwRecurrentToInputWeightsTensor, kFwRecurrentToForgetWeightsTensor,
+          kFwRecurrentToCellWeightsTensor, kFwRecurrentToOutputWeightsTensor,
+          kFwCellToInputWeightsTensor, kFwCellToForgetWeightsTensor,
+          kFwCellToOutputWeightsTensor, kFwInputGateBiasTensor,
+          kFwForgetGateBiasTensor, kFwCellGateBiasTensor,
+          kFwOutputGateBiasTensor, kFwProjectionWeightsTensor,
+          kFwProjectionBiasTensor));
+
+  TF_LITE_ENSURE_OK(
+      context,
+      CheckLstmTensorDimensions(
+          context, node, n_input, n_output, n_cell,
+          kBwInputToInputWeightsTensor, kBwInputToForgetWeightsTensor,
+          kBwInputToCellWeightsTensor, kBwInputToOutputWeightsTensor,
+          kBwRecurrentToInputWeightsTensor, kBwRecurrentToForgetWeightsTensor,
+          kBwRecurrentToCellWeightsTensor, kBwRecurrentToOutputWeightsTensor,
+          kBwCellToInputWeightsTensor, kBwCellToForgetWeightsTensor,
+          kBwCellToOutputWeightsTensor, kBwInputGateBiasTensor,
+          kBwForgetGateBiasTensor, kBwCellGateBiasTensor,
+          kBwOutputGateBiasTensor, kBwProjectionWeightsTensor,
+          kBwProjectionBiasTensor));
 
   // Check if Forward and Backward tensors match along required dimensions.
   return kTfLiteOk;
@@ -334,7 +339,9 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   const int n_fw_output = fw_recurrent_to_output_weights->dims->data[1];
 
   // Check that input tensor dimensions matches with each other.
-  CheckInputTensorDimensions(context, node, n_input, n_fw_output, n_fw_cell);
+  TF_LITE_ENSURE_OK(
+      context, CheckInputTensorDimensions(context, node, n_input, n_fw_output,
+                                          n_fw_cell));
 
   // Get the pointer to output, state and scratch buffer tensors.
   TfLiteTensor* fw_output = GetOutput(context, node, kFwOutputTensor);
@@ -404,7 +411,9 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   const int n_bw_output = bw_recurrent_to_output_weights->dims->data[1];
 
   // Check that input tensor dimensions matches with each other.
-  CheckInputTensorDimensions(context, node, n_input, n_bw_output, n_bw_cell);
+  TF_LITE_ENSURE_OK(
+      context, CheckInputTensorDimensions(context, node, n_input, n_bw_output,
+                                          n_bw_cell));
 
   // Get the pointer to output, output_state and cell_state buffer tensors.
   TfLiteTensor* bw_output = GetOutput(context, node, kBwOutputTensor);
diff --git a/tensorflow/contrib/lite/kernels/bidirectional_sequence_rnn.cc b/tensorflow/contrib/lite/kernels/bidirectional_sequence_rnn.cc
index aa24c1f34c..517309a226 100644
--- a/tensorflow/contrib/lite/kernels/bidirectional_sequence_rnn.cc
+++ b/tensorflow/contrib/lite/kernels/bidirectional_sequence_rnn.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdlib>
diff --git a/tensorflow/contrib/lite/kernels/concatenation.cc b/tensorflow/contrib/lite/kernels/concatenation.cc
index 45ea8d0049..ad211e9c67 100644
--- a/tensorflow/contrib/lite/kernels/concatenation.cc
+++ b/tensorflow/contrib/lite/kernels/concatenation.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
diff --git a/tensorflow/contrib/lite/kernels/conv.cc b/tensorflow/contrib/lite/kernels/conv.cc
index 0321b2e2a0..04c0263b78 100644
--- a/tensorflow/contrib/lite/kernels/conv.cc
+++ b/tensorflow/contrib/lite/kernels/conv.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <algorithm>
 #include <cassert>
 #include <cmath>
@@ -257,10 +256,10 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
     double real_multiplier = 0.0;
     TF_LITE_ENSURE_STATUS(GetQuantizedConvolutionMultipler(
         context, input, filter, bias, output, &real_multiplier));
-    TF_LITE_ENSURE(context, real_multiplier < 1.0);
-    QuantizeMultiplierSmallerThanOneExp(
-        real_multiplier, &data->output_multiplier, &data->output_shift);
-    data->output_shift *= -1;
+
+    int exponent;
+    QuantizeMultiplier(real_multiplier, &data->output_multiplier, &exponent);
+    data->output_shift = -exponent;
     CalculateActivationRangeUint8(params->activation, output,
                                   &data->output_activation_min,
                                   &data->output_activation_max);
@@ -418,6 +417,7 @@ void EvalFloat(TfLiteContext* context, TfLiteNode* node,
         filter_data = GetTensorData<float>(filter);
       }
       multithreaded_ops::Conv(
+          *eigen_support::GetThreadPoolDevice(context),
           GetTensorData<float>(input), GetTensorDims(input), filter_data,
           GetTensorDims(filter), GetTensorData<float>(bias),
           GetTensorDims(bias), params->stride_width, params->stride_height,
diff --git a/tensorflow/contrib/lite/kernels/conv_test.cc b/tensorflow/contrib/lite/kernels/conv_test.cc
index 0dcfc826fd..24633c2fd7 100644
--- a/tensorflow/contrib/lite/kernels/conv_test.cc
+++ b/tensorflow/contrib/lite/kernels/conv_test.cc
@@ -64,12 +64,6 @@ class BaseConvolutionOpModel : public SingleOpModel {
     }
 
     output_ = AddOutput(output);
-    if (input.type != TensorType_FLOAT32) {
-      // The following is required by quantized inference. It is the unittest's
-      // responsibility to make sure the output scale falls into the correct
-      // range.
-      CHECK_LT(GetScale(input_) * GetScale(filter_), GetScale(output_));
-    }
 
     SetBuiltinOp(BuiltinOperator_CONV_2D, BuiltinOptions_Conv2DOptions,
                  CreateConv2DOptions(
@@ -441,6 +435,44 @@ TEST_P(ConvolutionOpTest, SimpleTestQuantized) {
                              }));
 }
 
+TEST_P(ConvolutionOpTest, SimpleTestQuantizedOutputMultiplierGreaterThan1) {
+  // output_multiplier = 1.0118
+  QuantizedConvolutionOpModel quant_op(
+      GetRegistration(), {TensorType_UINT8, {2, 2, 4, 1}, -128.5, 128},
+      {TensorType_UINT8, {3, 2, 2, 1}, -128.5, 128},
+      {TensorType_UINT8, {}, -127, 128});
+  ConvolutionOpModel float_op(
+      GetRegistration(), {TensorType_FLOAT32, {2, 2, 4, 1}},
+      {TensorType_FLOAT32, {3, 2, 2, 1}}, {TensorType_FLOAT32, {}});
+  std::initializer_list<float> input = {
+      // First batch
+      1, 1, 1, 1,  // row = 1
+      2, 2, 2, 2,  // row = 2
+      // Second batch
+      1, 2, 3, 4,  // row = 1
+      1, 2, 3, 4,  // row = 2
+  };
+  std::initializer_list<float> filter = {
+      1,  2,  3,  4,  // first 2x2 filter
+      -1, 1,  -1, 1,  // second 2x2 filter
+      -1, -1, 1,  1,  // third 2x2 filter
+  };
+  std::initializer_list<float> bias = {1, 2, 3};
+
+  quant_op.SetInput(input);
+  quant_op.SetFilter(filter);
+  quant_op.SetBias(bias);
+  quant_op.Invoke();
+
+  float_op.SetInput(input);
+  float_op.SetFilter(filter);
+  float_op.SetBias(bias);
+  float_op.Invoke();
+
+  EXPECT_THAT(quant_op.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear(float_op.GetOutput(), 1)));
+}
+
 TEST_P(ConvolutionOpTest, SimpleTestQuantizedWithAnisotropicStrides) {
   QuantizedConvolutionOpModel m(GetRegistration(),
                                 {TensorType_UINT8, {1, 3, 6, 1}, -63.5, 64},
diff --git a/tensorflow/contrib/lite/kernels/depthwise_conv.cc b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
index 16e5f1d065..21518156b8 100644
--- a/tensorflow/contrib/lite/kernels/depthwise_conv.cc
+++ b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
diff --git a/tensorflow/contrib/lite/kernels/div.cc b/tensorflow/contrib/lite/kernels/div.cc
index bc5c3783fd..d7420ddd8e 100644
--- a/tensorflow/contrib/lite/kernels/div.cc
+++ b/tensorflow/contrib/lite/kernels/div.cc
@@ -78,29 +78,44 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
 }
 
 template <KernelType kernel_type>
-void EvalFloat(TfLiteContext* context, TfLiteNode* node,
-               TfLiteDivParams* params, const OpData* data,
-               const TfLiteTensor* input1, const TfLiteTensor* input2,
-               TfLiteTensor* output) {
-  float output_activation_min, output_activation_max;
-  CalculateActivationRange(params->activation, &output_activation_min,
-                           &output_activation_max);
-#define TF_LITE_DIV(type, opname)                                   \
-  type::opname(GetTensorData<float>(input1), GetTensorDims(input1), \
-               GetTensorData<float>(input2), GetTensorDims(input2), \
-               output_activation_min, output_activation_max,        \
-               GetTensorData<float>(output), GetTensorDims(output))
-  if (kernel_type == kReference) {
-    if (data->requires_broadcast) {
-      TF_LITE_DIV(reference_ops, BroadcastDiv);
+void EvalDiv(TfLiteContext* context, TfLiteNode* node, TfLiteDivParams* params,
+             const OpData* data, const TfLiteTensor* input1,
+             const TfLiteTensor* input2, TfLiteTensor* output) {
+#define TF_LITE_DIV(type, opname, data_type)                            \
+  data_type output_activation_min, output_activation_max;               \
+  CalculateActivationRange(params->activation, &output_activation_min,  \
+                           &output_activation_max);                     \
+  type::opname(GetTensorData<data_type>(input1), GetTensorDims(input1), \
+               GetTensorData<data_type>(input2), GetTensorDims(input2), \
+               output_activation_min, output_activation_max,            \
+               GetTensorData<data_type>(output), GetTensorDims(output))
+  if (output->type == kTfLiteInt32) {
+    if (kernel_type == kReference) {
+      if (data->requires_broadcast) {
+        TF_LITE_DIV(reference_ops, BroadcastDiv, int32_t);
+      } else {
+        TF_LITE_DIV(reference_ops, Div, int32_t);
+      }
     } else {
-      TF_LITE_DIV(reference_ops, Div);
+      if (data->requires_broadcast) {
+        TF_LITE_DIV(optimized_ops, BroadcastDiv, int32_t);
+      } else {
+        TF_LITE_DIV(optimized_ops, Div, int32_t);
+      }
     }
-  } else {
-    if (data->requires_broadcast) {
-      TF_LITE_DIV(optimized_ops, BroadcastDiv);
+  } else if (output->type == kTfLiteFloat32) {
+    if (kernel_type == kReference) {
+      if (data->requires_broadcast) {
+        TF_LITE_DIV(reference_ops, BroadcastDiv, float);
+      } else {
+        TF_LITE_DIV(reference_ops, Div, float);
+      }
     } else {
-      TF_LITE_DIV(optimized_ops, Div);
+      if (data->requires_broadcast) {
+        TF_LITE_DIV(optimized_ops, BroadcastDiv, float);
+      } else {
+        TF_LITE_DIV(optimized_ops, Div, float);
+      }
     }
   }
 #undef TF_LITE_DIV
@@ -115,11 +130,12 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
-  if (output->type == kTfLiteFloat32) {
-    EvalFloat<kernel_type>(context, node, params, data, input1, input2, output);
+  if (output->type == kTfLiteFloat32 || output->type == kTfLiteInt32) {
+    EvalDiv<kernel_type>(context, node, params, data, input1, input2, output);
   } else {
     context->ReportError(
-        context, "Div only supports FLOAT32 and quantized UINT8 now, got %d.",
+        context,
+        "Div only supports FLOAT32, INT32 and quantized UINT8 now, got %d.",
         output->type);
     return kTfLiteError;
   }
diff --git a/tensorflow/contrib/lite/kernels/div_test.cc b/tensorflow/contrib/lite/kernels/div_test.cc
index 276b8289fb..97aa2fe04e 100644
--- a/tensorflow/contrib/lite/kernels/div_test.cc
+++ b/tensorflow/contrib/lite/kernels/div_test.cc
@@ -52,6 +52,13 @@ class FloatDivOpModel : public BaseDivOpModel {
   std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
 };
 
+class IntegerDivOpModel : public BaseDivOpModel {
+ public:
+  using BaseDivOpModel::BaseDivOpModel;
+
+  std::vector<int32_t> GetOutput() { return ExtractVector<int32_t>(output_); }
+};
+
 TEST(FloatDivOpTest, NoActivation) {
   FloatDivOpModel m({TensorType_FLOAT32, {1, 2, 2, 1}},
                     {TensorType_FLOAT32, {1, 2, 2, 1}},
@@ -75,7 +82,7 @@ TEST(FloatDivOpTest, ActivationRELU_N1_TO_1) {
 }
 
 TEST(FloatDivOpTest, VariousInputShapes) {
-  std::vector<std::initializer_list<int>> test_shapes = {
+  std::vector<std::vector<int>> test_shapes = {
       {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
   for (int i = 0; i < test_shapes.size(); ++i) {
     FloatDivOpModel m({TensorType_FLOAT32, test_shapes[i]},
@@ -92,7 +99,7 @@ TEST(FloatDivOpTest, VariousInputShapes) {
 }
 
 TEST(FloatDivOpTest, WithBroadcast) {
-  std::vector<std::initializer_list<int>> test_shapes = {
+  std::vector<std::vector<int>> test_shapes = {
       {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
   for (int i = 0; i < test_shapes.size(); ++i) {
     FloatDivOpModel m({TensorType_FLOAT32, test_shapes[i]},
@@ -108,6 +115,56 @@ TEST(FloatDivOpTest, WithBroadcast) {
   }
 }
 
+TEST(IntegerDivOpTest, NoActivation) {
+  IntegerDivOpModel m({TensorType_INT32, {1, 2, 2, 1}},
+                      {TensorType_INT32, {1, 2, 2, 1}}, {TensorType_INT32, {}},
+                      ActivationFunctionType_NONE);
+  m.PopulateTensor<int32_t>(m.input1(), {-2, 2, -15, 8});
+  m.PopulateTensor<int32_t>(m.input2(), {5, -2, -3, 5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({0, -1, 5, 1}));
+}
+
+TEST(IntegerDivOpTest, ActivationRELU_N1_TO_1) {
+  IntegerDivOpModel m({TensorType_INT32, {1, 2, 2, 1}},
+                      {TensorType_INT32, {1, 2, 2, 1}}, {TensorType_INT32, {}},
+                      ActivationFunctionType_RELU_N1_TO_1);
+  m.PopulateTensor<int32_t>(m.input1(), {-2, 2, -12, 8});
+  m.PopulateTensor<int32_t>(m.input2(), {1, 2, -15, 5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({-1, 1, 0, 1}));
+}
+
+TEST(IntegerDivOpTest, VariousInputShapes) {
+  std::vector<std::vector<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    IntegerDivOpModel m({TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 3, 8, 11, -20});
+    m.PopulateTensor<int32_t>(m.input2(), {1, 2, 6, 5, -11, -1});
+    m.Invoke();
+    EXPECT_THAT(m.GetOutput(), ElementsAreArray({-20, 1, 0, 1, -1, 20}))
+        << "With shape number " << i;
+  }
+}
+
+TEST(IntegerDivOpTest, WithBroadcast) {
+  std::vector<std::vector<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    IntegerDivOpModel m({TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, {}},  // always a scalar
+                        {TensorType_INT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<int32_t>(m.input1(), {-20, 21, 7, 8, 11, -123});
+    m.PopulateTensor<int32_t>(m.input2(), {3});
+    m.Invoke();
+    EXPECT_THAT(m.GetOutput(), ElementsAreArray({-6, 7, 2, 2, 3, -41}))
+        << "With shape number " << i;
+  }
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/eigen_support.cc b/tensorflow/contrib/lite/kernels/eigen_support.cc
index f1fdb42624..e542ad0765 100644
--- a/tensorflow/contrib/lite/kernels/eigen_support.cc
+++ b/tensorflow/contrib/lite/kernels/eigen_support.cc
@@ -14,31 +14,100 @@ limitations under the License.
 ==============================================================================*/
 #include "tensorflow/contrib/lite/kernels/eigen_support.h"
 
-#include "third_party/eigen3/Eigen/Core"
+#include <utility>
+
+#include "tensorflow/contrib/lite/arena_planner.h"
+#include "tensorflow/contrib/lite/kernels/internal/optimized/eigen_spatial_convolutions.h"
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
 
 namespace tflite {
 namespace eigen_support {
+namespace {
+
+#ifndef EIGEN_DONT_ALIGN
+// Eigen may require buffers to be algiend to 16, 32 or 64 bytes depending on
+// hardware architecture and build configurations.
+// If the static assertion fails, try to increase `kDefaultTensorAlignment` to
+// in `arena_planner.h` to 32 or 64.
+static_assert(
+    kDefaultTensorAlignment % EIGEN_MAX_ALIGN_BYTES == 0,
+    "kDefaultArenaAlignment doesn't comply with Eigen alignment requirement.");
+#endif  // EIGEN_DONT_ALIGN
+
+// We have a single global threadpool for all convolution operations. This means
+// that inferences started from different threads may block each other, but
+// since the underlying resource of CPU cores should be consumed by the
+// operations anyway, it shouldn't affect overall performance.
+class EigenThreadPoolWrapper : public Eigen::ThreadPoolInterface {
+ public:
+  // Takes ownership of 'pool'
+  explicit EigenThreadPoolWrapper(Eigen::ThreadPool* pool) : pool_(pool) {}
+  ~EigenThreadPoolWrapper() override {}
 
-struct RefCountedEigenContext {
+  void Schedule(std::function<void()> fn) override {
+    pool_->Schedule(std::move(fn));
+  }
+  int NumThreads() const override { return pool_->NumThreads(); }
+  int CurrentThreadId() const override { return pool_->CurrentThreadId(); }
+
+ private:
+  std::unique_ptr<Eigen::ThreadPool> pool_;
+};
+
+struct RefCountedEigenContext : public TfLiteExternalContext {
+  std::unique_ptr<Eigen::ThreadPoolInterface> thread_pool_wrapper;
+  std::unique_ptr<Eigen::ThreadPoolDevice> device;
   int num_references = 0;
 };
 
+RefCountedEigenContext* GetEigenContext(TfLiteContext* context) {
+  return reinterpret_cast<RefCountedEigenContext*>(
+      context->GetExternalContext(context, kTfLiteEigenContext));
+}
+
+void InitDevice(TfLiteContext* context, RefCountedEigenContext* ptr) {
+  int num_threads = 4;
+  if (context->recommended_num_threads != -1) {
+    num_threads = context->recommended_num_threads;
+  }
+  ptr->device.reset();  // destroy before we invalidate the thread pool
+  ptr->thread_pool_wrapper.reset(
+      new EigenThreadPoolWrapper(new Eigen::ThreadPool(num_threads)));
+  ptr->device.reset(
+      new Eigen::ThreadPoolDevice(ptr->thread_pool_wrapper.get(), num_threads));
+}
+
+TfLiteStatus Refresh(TfLiteContext* context) {
+  Eigen::setNbThreads(context->recommended_num_threads);
+
+  auto* ptr = GetEigenContext(context);
+  if (ptr != nullptr) {
+    InitDevice(context, ptr);
+  }
+
+  return kTfLiteOk;
+}
+
+}  // namespace
+
 void IncrementUsageCounter(TfLiteContext* context) {
-  auto* ptr = reinterpret_cast<RefCountedEigenContext*>(context->eigen_context);
+  auto* ptr = GetEigenContext(context);
   if (ptr == nullptr) {
     if (context->recommended_num_threads != -1) {
       Eigen::setNbThreads(context->recommended_num_threads);
     }
     ptr = new RefCountedEigenContext;
+    ptr->type = kTfLiteEigenContext;
+    ptr->Refresh = Refresh;
     ptr->num_references = 0;
-    context->eigen_context = ptr;
+    InitDevice(context, ptr);
+    context->SetExternalContext(context, kTfLiteEigenContext, ptr);
   }
   ptr->num_references++;
 }
 
 void DecrementUsageCounter(TfLiteContext* context) {
-  auto* ptr = reinterpret_cast<RefCountedEigenContext*>(context->eigen_context);
+  auto* ptr = GetEigenContext(context);
   if (ptr == nullptr) {
     TF_LITE_FATAL(
         "Call to DecrementUsageCounter() not preceded by "
@@ -46,14 +115,17 @@ void DecrementUsageCounter(TfLiteContext* context) {
   }
   if (--ptr->num_references == 0) {
     delete ptr;
-    context->eigen_context = nullptr;
+    context->SetExternalContext(context, kTfLiteEigenContext, nullptr);
   }
 }
 
-void SetNumThreads(TfLiteContext* context, int num_threads) {
-  IncrementUsageCounter(context);
-  Eigen::setNbThreads(num_threads);
-  DecrementUsageCounter(context);
+const Eigen::ThreadPoolDevice* GetThreadPoolDevice(TfLiteContext* context) {
+  auto* ptr = GetEigenContext(context);
+  if (ptr == nullptr) {
+    TF_LITE_FATAL(
+        "Call to GetFromContext() not preceded by IncrementUsageCounter()");
+  }
+  return ptr->device.get();
 }
 
 }  // namespace eigen_support
diff --git a/tensorflow/contrib/lite/kernels/eigen_support.h b/tensorflow/contrib/lite/kernels/eigen_support.h
index aa8c351fd8..ec77856b10 100644
--- a/tensorflow/contrib/lite/kernels/eigen_support.h
+++ b/tensorflow/contrib/lite/kernels/eigen_support.h
@@ -17,6 +17,10 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/context.h"
 
+namespace EigenForTFLite {
+class ThreadPoolDevice;
+}
+
 namespace tflite {
 namespace eigen_support {
 
@@ -28,8 +32,8 @@ void IncrementUsageCounter(TfLiteContext* context);
 // usages all temporary Eigen objects will be deleted.
 void DecrementUsageCounter(TfLiteContext* context);
 
-// Set the number of threads that can be used by Eigen.
-void SetNumThreads(TfLiteContext* context, int num_threads);
+const EigenForTFLite::ThreadPoolDevice* GetThreadPoolDevice(
+    TfLiteContext* context);
 
 }  // namespace eigen_support
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/embedding_lookup.cc b/tensorflow/contrib/lite/kernels/embedding_lookup.cc
index 0ba170a4da..b2dff87e62 100644
--- a/tensorflow/contrib/lite/kernels/embedding_lookup.cc
+++ b/tensorflow/contrib/lite/kernels/embedding_lookup.cc
@@ -29,7 +29,6 @@ limitations under the License.
 //   When indices are out of bound, the ops will not succeed.
 //
 
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -112,8 +111,9 @@ TfLiteStatus EvalHybrid(TfLiteContext* context, TfLiteNode* node,
       // TODO(alanchiao): refactor scalar multiply into separate function
       // for ease of adding a neon equivalent if ever necessary.
       for (int j = 0; j < col_size; j++) {
+        const int8_t* value_ptr = reinterpret_cast<int8_t*>(value->data.uint8);
         output->data.f[j + i * col_size] =
-            value->data.uint8[j + idx * col_size] * scaling_factor;
+            value_ptr[j + idx * col_size] * scaling_factor;
       }
     }
   }
diff --git a/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc b/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc
index 04657fd863..4a88d168c6 100644
--- a/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc
+++ b/tensorflow/contrib/lite/kernels/embedding_lookup_test.cc
@@ -107,9 +107,9 @@ TEST(HybridEmbeddingLookupHybridOpTest, Simple2DTest) {
   HybridEmbeddingLookupOpModel m({3}, {3, 8});
   m.SetInput({1, 0, 2});
   m.SetWeight({
-      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
-      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
-      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+      0.00, 0.01,  0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+      1.00, -1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+      2.00, 2.01,  2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
   });
 
   m.Invoke();
@@ -117,9 +117,9 @@ TEST(HybridEmbeddingLookupHybridOpTest, Simple2DTest) {
   EXPECT_THAT(m.GetOutput(),
               ElementsAreArray(ArrayFloatNear(
                   {
-                      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
-                      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
-                      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+                      1.00, -1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+                      0.00, 0.01,  0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+                      2.00, 2.01,  2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
                   },
                   7.41e-03)));
 }
@@ -128,9 +128,9 @@ TEST(HybridEmbeddingLookupHybridOpTest, Simple3DTest) {
   HybridEmbeddingLookupOpModel m({3}, {3, 2, 4});
   m.SetInput({1, 0, 2});
   m.SetWeight({
-      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
-      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
-      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+      0.00, 0.01,  0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+      1.00, -1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+      2.00, 2.01,  2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
   });
 
   m.Invoke();
@@ -138,9 +138,9 @@ TEST(HybridEmbeddingLookupHybridOpTest, Simple3DTest) {
   EXPECT_THAT(m.GetOutput(),
               ElementsAreArray(ArrayFloatNear(
                   {
-                      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
-                      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
-                      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+                      1.00, -1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+                      0.00, 0.01,  0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+                      2.00, 2.01,  2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
                   },
                   7.41e-03)));
 }
@@ -149,9 +149,9 @@ TEST(HybridEmbeddingLookupHybridOpTest, Simple4DTest) {
   HybridEmbeddingLookupOpModel m({3}, {3, 2, 2, 2});
   m.SetInput({1, 0, 2});
   m.SetWeight({
-      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
-      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
-      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+      0.00, 0.01,  0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+      1.00, -1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+      2.00, 2.01,  2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
   });
 
   m.Invoke();
@@ -159,9 +159,9 @@ TEST(HybridEmbeddingLookupHybridOpTest, Simple4DTest) {
   EXPECT_THAT(m.GetOutput(),
               ElementsAreArray(ArrayFloatNear(
                   {
-                      1.00, 1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
-                      0.00, 0.01, 0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
-                      2.00, 2.01, 2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
+                      1.00, -1.01, 1.02, 1.03, 1.10, 1.11, 1.12, 1.13,  // Row 1
+                      0.00, 0.01,  0.02, 0.03, 0.10, 0.11, 0.12, 0.13,  // Row 0
+                      2.00, 2.01,  2.02, 2.03, 2.10, 2.11, 2.12, 2.13,  // Row 2
                   },
                   7.41e-03)));
 }
diff --git a/tensorflow/contrib/lite/kernels/fake_quant.cc b/tensorflow/contrib/lite/kernels/fake_quant.cc
new file mode 100644
index 0000000000..0ef1a50b30
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/fake_quant.cc
@@ -0,0 +1,92 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <string.h>
+#include <vector>
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace fake_quant {
+
+// This file has reference implementation of FakeQuant.
+enum KernelType {
+  kReference,
+};
+
+struct OpContext {
+  OpContext(TfLiteContext* context, TfLiteNode* node) {
+    input = GetInput(context, node, 0);
+    output = GetOutput(context, node, 0);
+  }
+  const TfLiteTensor* input;
+  TfLiteTensor* output;
+};
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  const auto* params =
+      reinterpret_cast<TfLiteFakeQuantParams*>(node->builtin_data);
+
+  if (params->narrow_range) {
+    context->ReportError(
+        context,
+        "narrow_range FakeQuant is not currently supported at runtime. "
+        "narrow_range is only meant to be applied to weights, not activations");
+    return kTfLiteError;
+  }
+
+  OpContext op_context(context, node);
+  TfLiteIntArray* output_dims = TfLiteIntArrayCopy(op_context.input->dims);
+  op_context.output->type = op_context.input->type;
+  return context->ResizeTensor(context, op_context.output, output_dims);
+}
+
+template <KernelType kernel_type>
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  OpContext op_context(context, node);
+
+  const auto* params =
+      reinterpret_cast<TfLiteFakeQuantParams*>(node->builtin_data);
+
+  reference_ops::FakeQuant(GetTensorData<float>(op_context.input),
+                           GetTensorDims(op_context.input), params->min,
+                           params->max, params->num_bits,
+                           GetTensorData<float>(op_context.output),
+                           GetTensorDims(op_context.output));
+
+  return kTfLiteOk;
+}
+
+}  // namespace fake_quant
+
+TfLiteRegistration* Register_FAKE_QUANT_REF() {
+  static TfLiteRegistration r = {nullptr, nullptr, fake_quant::Prepare,
+                                 fake_quant::Eval<fake_quant::kReference>};
+  return &r;
+}
+
+TfLiteRegistration* Register_FAKE_QUANT() { return Register_FAKE_QUANT_REF(); }
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/fake_quant_test.cc b/tensorflow/contrib/lite/kernels/fake_quant_test.cc
new file mode 100644
index 0000000000..11a02f7ed7
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/fake_quant_test.cc
@@ -0,0 +1,112 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class FakeQuantOpModel : public SingleOpModel {
+ public:
+  FakeQuantOpModel(const TensorData& input, const TensorType& output, float min,
+                   float max, int num_bits) {
+    input_ = AddInput(input);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_FAKE_QUANT, BuiltinOptions_FakeQuantOptions,
+                 CreateFakeQuantOptions(builder_, min, max, num_bits).Union());
+    BuildInterpreter({GetShape(input_)});
+  }
+
+  template <class T>
+  void SetInput(std::initializer_list<T> data) {
+    PopulateTensor(input_, data);
+  }
+
+  template <class T>
+  std::vector<T> GetOutput() {
+    return ExtractVector<T>(output_);
+  }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ protected:
+  int input_;
+  int output_;
+};
+
+TEST(FakeQuantOpTest, FloatPositiveRange8Test) {
+  std::initializer_list<float> data = {0.0,  1.0,       0.25,
+                                       0.50, 0.4444444, 0.00001};
+  FakeQuantOpModel m({TensorType_FLOAT32, {3, 1, 2}}, TensorType_FLOAT32, 0.0f,
+                     1.0f, 8);
+  m.SetInput<float>(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 1, 2}));
+  EXPECT_THAT(
+      m.GetOutput<float>(),
+      ElementsAreArray(ArrayFloatNear({0, 1, 0.25098, 0.498039, 0.443137, 0})));
+}
+
+TEST(FakeQuantOpTest, FloatNegativeRange8Test) {
+  std::initializer_list<float> data = {0.0,  -0.9,      0.25,
+                                       0.50, 0.4444444, -0.00001};
+  FakeQuantOpModel m({TensorType_FLOAT32, {3, 1, 2}}, TensorType_FLOAT32, -0.9f,
+                     0.9f, 8);
+  m.SetInput<float>(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 1, 2}));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear(
+                  {0, -0.896471, 0.247059, 0.501176, 0.444706, 0})));
+}
+
+TEST(FakeQuantOpTest, FloatPositiveRange16Test) {
+  std::initializer_list<float> data = {0.0,  1.0,       0.25,
+                                       0.50, 0.4444444, 0.00001};
+  FakeQuantOpModel m({TensorType_FLOAT32, {3, 1, 2}}, TensorType_FLOAT32, 0.0f,
+                     1.0f, 16);
+  m.SetInput<float>(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 1, 2}));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear(
+                  {0, 1, 0.250004, 0.500008, 0.44445, 1.5259e-05})));
+}
+
+TEST(FakeQuantOpTest, FloatNegativeRange16Test) {
+  std::initializer_list<float> data = {0.0,  -0.9,      0.25,
+                                       0.50, 0.4444444, -0.00001};
+  FakeQuantOpModel m({TensorType_FLOAT32, {3, 1, 2}}, TensorType_FLOAT32, -0.9f,
+                     0.9f, 16);
+  m.SetInput<float>(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 1, 2}));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear(
+                  {0, -0.900014, 0.249998, 0.499995, 0.444431, 0})));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/fully_connected.cc b/tensorflow/contrib/lite/kernels/fully_connected.cc
index 3b203dd480..bc370608c0 100644
--- a/tensorflow/contrib/lite/kernels/fully_connected.cc
+++ b/tensorflow/contrib/lite/kernels/fully_connected.cc
@@ -13,7 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -71,7 +70,7 @@ void* Init(TfLiteContext* context, const char* buffer, size_t length) {
   // Instead, we allocate a new object to carry information from Prepare() to
   // Eval().
   gemm_support::IncrementUsageCounter(context);
-  auto* op_data = new OpData;
+  auto* op_data = new OpData();
   context->AddTensors(context, 1, &op_data->input_quantized_index);
   return op_data;
 }
diff --git a/tensorflow/contrib/lite/kernels/gemm_support.cc b/tensorflow/contrib/lite/kernels/gemm_support.cc
index 95f45ea768..ed334af2da 100644
--- a/tensorflow/contrib/lite/kernels/gemm_support.cc
+++ b/tensorflow/contrib/lite/kernels/gemm_support.cc
@@ -14,57 +14,70 @@ limitations under the License.
 ==============================================================================*/
 #include "tensorflow/contrib/lite/kernels/gemm_support.h"
 
+#include <memory>
+
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
 
 namespace tflite {
 namespace gemm_support {
+namespace {
 
-struct RefCountedGemmContext {
-  gemmlowp::GemmContext* gemm_context_ = nullptr;
-  int num_references_ = 0;
+struct RefCountedGemmContext : public TfLiteExternalContext {
+  std::unique_ptr<gemmlowp::GemmContext> gemm_context;
+  int num_references = 0;
 };
 
+RefCountedGemmContext* GetGemmLowpContext(TfLiteContext* context) {
+  return reinterpret_cast<RefCountedGemmContext*>(
+      context->GetExternalContext(context, kTfLiteGemmLowpContext));
+}
+
+TfLiteStatus Refresh(TfLiteContext* context) {
+  auto* ptr = GetGemmLowpContext(context);
+  if (ptr != nullptr) {
+    ptr->gemm_context->set_max_num_threads(context->recommended_num_threads);
+  }
+  return kTfLiteOk;
+}
+
+}  // namespace
+
 void IncrementUsageCounter(TfLiteContext* context) {
-  auto* ptr = reinterpret_cast<RefCountedGemmContext*>(context->gemm_context);
+  auto* ptr = GetGemmLowpContext(context);
   if (ptr == nullptr) {
     ptr = new RefCountedGemmContext;
-    ptr->gemm_context_ = new gemmlowp::GemmContext();
+    ptr->type = kTfLiteGemmLowpContext;
+    ptr->Refresh = Refresh;
+    ptr->gemm_context.reset(new gemmlowp::GemmContext());
     if (context->recommended_num_threads != -1) {
-      ptr->gemm_context_->set_max_num_threads(context->recommended_num_threads);
+      ptr->gemm_context->set_max_num_threads(context->recommended_num_threads);
     }
-    ptr->num_references_ = 0;
-    context->gemm_context = ptr;
+    ptr->num_references = 0;
+    context->SetExternalContext(context, kTfLiteGemmLowpContext, ptr);
   }
-  ptr->num_references_++;
+  ptr->num_references++;
 }
 
 void DecrementUsageCounter(TfLiteContext* context) {
-  auto* ptr = reinterpret_cast<RefCountedGemmContext*>(context->gemm_context);
+  auto* ptr = GetGemmLowpContext(context);
   if (ptr == nullptr) {
     TF_LITE_FATAL(
         "Call to DecrementUsageCounter() not preceded by "
         "IncrementUsageCounter()");
   }
-  if (--ptr->num_references_ == 0) {
-    delete ptr->gemm_context_;
+  if (--ptr->num_references == 0) {
     delete ptr;
-    context->gemm_context = nullptr;
+    context->SetExternalContext(context, kTfLiteGemmLowpContext, nullptr);
   }
 }
 
 gemmlowp::GemmContext* GetFromContext(TfLiteContext* context) {
-  auto* ptr = reinterpret_cast<RefCountedGemmContext*>(context->gemm_context);
+  auto* ptr = GetGemmLowpContext(context);
   if (ptr == nullptr) {
     TF_LITE_FATAL(
         "Call to GetFromContext() not preceded by IncrementUsageCounter()");
   }
-  return ptr->gemm_context_;
-}
-
-void SetNumThreads(TfLiteContext* context, int num_threads) {
-  IncrementUsageCounter(context);
-  GetFromContext(context)->set_max_num_threads(num_threads);
-  DecrementUsageCounter(context);
+  return ptr->gemm_context.get();
 }
 
 }  // namespace gemm_support
diff --git a/tensorflow/contrib/lite/kernels/gemm_support.h b/tensorflow/contrib/lite/kernels/gemm_support.h
index f033501cb6..37af772c68 100644
--- a/tensorflow/contrib/lite/kernels/gemm_support.h
+++ b/tensorflow/contrib/lite/kernels/gemm_support.h
@@ -45,9 +45,6 @@ void IncrementUsageCounter(TfLiteContext* context);
 // 'context'. If there are no more usages the GemmContext will be deleted.
 void DecrementUsageCounter(TfLiteContext* context);
 
-// Set the number of threads that can be used by gemmlowp.
-void SetNumThreads(TfLiteContext* context, int num_threads);
-
 }  // namespace gemm_support
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/hashtable_lookup.cc b/tensorflow/contrib/lite/kernels/hashtable_lookup.cc
index 41211d41aa..f37c66acb3 100644
--- a/tensorflow/contrib/lite/kernels/hashtable_lookup.cc
+++ b/tensorflow/contrib/lite/kernels/hashtable_lookup.cc
@@ -31,7 +31,6 @@ limitations under the License.
 //   Each item indicates whether the corresponding lookup has a returned value.
 //   0 for missing key, 1 for found key.
 
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
diff --git a/tensorflow/contrib/lite/kernels/internal/BUILD b/tensorflow/contrib/lite/kernels/internal/BUILD
index 7962fcbc9d..0d424071da 100644
--- a/tensorflow/contrib/lite/kernels/internal/BUILD
+++ b/tensorflow/contrib/lite/kernels/internal/BUILD
@@ -232,6 +232,7 @@ cc_library(
 cc_test(
     name = "tensor_test",
     srcs = ["tensor_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":reference",
         "@com_google_googletest//:gtest",
@@ -260,6 +261,7 @@ cc_library(
 cc_test(
     name = "quantization_util_test",
     srcs = ["quantization_util_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":quantization_util",
         "@com_google_googletest//:gtest",
@@ -479,6 +481,9 @@ cc_library(
         ":darwin": [
             ":neon_tensor_utils",
         ],
+        ":darwin_x86_64": [
+            ":neon_tensor_utils",
+        ],
         "//conditions:default": [
             ":portable_tensor_utils",
         ],
@@ -505,7 +510,10 @@ cc_test(
         "//conditions:default": [],
     }),
     linkstatic = 1,
-    tags = ["tflite_not_portable_ios"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable_ios",
+    ],
     deps = [
         ":tensor_utils",
         "//tensorflow/contrib/lite:builtin_op_data",
@@ -517,6 +525,7 @@ cc_test(
 cc_test(
     name = "depthwiseconv_float_test",
     srcs = ["depthwiseconv_float_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":optimized_base",
         ":reference_base",
@@ -529,6 +538,7 @@ cc_test(
 cc_test(
     name = "depthwiseconv_quantized_test",
     srcs = ["depthwiseconv_quantized_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":optimized_base",
         ":reference_base",
@@ -541,7 +551,10 @@ cc_test(
 cc_test(
     name = "resize_bilinear_test",
     srcs = ["resize_bilinear_test.cc"],
-    tags = ["tflite_not_portable"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
     deps = [
         ":optimized_base",
         ":reference_base",
@@ -557,6 +570,7 @@ cc_test(
     srcs = [
         "softmax_quantized_test.cc",
     ],
+    tags = ["no_oss"],
     deps = [
         ":optimized_base",
         ":quantization_util",
@@ -572,7 +586,10 @@ cc_test(
     srcs = [
         "logsoftmax_quantized_test.cc",
     ],
-    tags = ["tflite_not_portable"],
+    tags = [
+        "no_oss",
+        "tflite_not_portable",
+    ],
     deps = [
         ":optimized_base",
         ":quantization_util",
@@ -585,6 +602,7 @@ cc_test(
 cc_test(
     name = "log_quantized_test",
     srcs = ["log_quantized_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":optimized_base",
         ":reference_base",
@@ -611,6 +629,7 @@ cc_library(
 cc_test(
     name = "batch_to_space_nd_test",
     srcs = ["batch_to_space_nd_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":optimized_base",
         "@com_google_googletest//:gtest_main",
diff --git a/tensorflow/contrib/lite/kernels/internal/common.h b/tensorflow/contrib/lite/kernels/internal/common.h
index b86ca49c11..310a8980e6 100644
--- a/tensorflow/contrib/lite/kernels/internal/common.h
+++ b/tensorflow/contrib/lite/kernels/internal/common.h
@@ -127,6 +127,139 @@ int CountLeadingZeros(T integer_input) {
   return leading_zeros;
 }
 
+// DO NOT USE THIS STRUCT FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING
+// BROADCASTING.
+//
+// NdArrayDesc<N> describes the shape and memory layout of an N-dimensional
+// rectangular array of numbers.
+//
+// NdArrayDesc<N> is basically identical to Dims<N> defined in types.h.
+// However, as Dims<N> is to be deprecated, this class exists as an adaptor
+// to enable simple unoptimized implementations of element-wise broadcasting
+// operations.
+template <int N>
+struct NdArrayDesc {
+  // The "extent" of each dimension. Indices along dimension d must be in the
+  // half-open interval [0, extents[d]).
+  int extents[N];
+
+  // The number of *elements* (not bytes) between consecutive indices of each
+  // dimension.
+  int strides[N];
+};
+
+// DO NOT USE THIS FUNCTION FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING
+// BROADCASTING.
+//
+// Same as Offset(), except takes as NdArrayDesc<N> instead of Dims<N>.
+inline int SubscriptToIndex(const NdArrayDesc<4>& desc, int i0, int i1, int i2,
+                            int i3) {
+  TFLITE_DCHECK(i0 >= 0 && i0 < desc.extents[0]);
+  TFLITE_DCHECK(i1 >= 0 && i1 < desc.extents[1]);
+  TFLITE_DCHECK(i2 >= 0 && i2 < desc.extents[2]);
+  TFLITE_DCHECK(i3 >= 0 && i3 < desc.extents[3]);
+  return i0 * desc.strides[0] + i1 * desc.strides[1] + i2 * desc.strides[2] +
+         i3 * desc.strides[3];
+}
+
+// Given the dimensions of the operands for an element-wise binary broadcast,
+// adjusts them so that they can be directly iterated over with simple loops.
+// Returns the adjusted dims as instances of NdArrayDesc in 'desc0_out' and
+// 'desc1_out'. 'desc0_out' and 'desc1_out' cannot be nullptr.
+//
+// This function assumes that the two input shapes are compatible up to
+// broadcasting and the shorter one has already been prepended with 1s to be the
+// same length. E.g., if shape0 is (1, 16, 16, 64) and shape1 is (1, 64),
+// shape1 must already have been prepended to be (1, 1, 1, 64). Recall that
+// Dims<N> refer to shapes in reverse order. In this case, input0_dims will be
+// (64, 16, 16, 1) and input1_dims will be (64, 1, 1, 1).
+//
+// When two shapes are compatible up to broadcasting, for each dimension d,
+// the input extents are either equal, or one of them is 1.
+//
+// This function performs the following for each dimension d:
+// - If the extents are equal, then do nothing since the loop that walks over
+//   both of the input arrays is correct.
+// - Otherwise, one (and only one) of the extents must be 1. Say extent0 is 1
+//   and extent1 is e1. Then set extent0 to e1 and stride0 *to 0*. This allows
+//   array0 to be referenced *at any index* in dimension d and still access the
+//   same slice.
+template <int N>
+inline void NdArrayDescsForElementwiseBroadcast(const Dims<N>& input0_dims,
+                                                const Dims<N>& input1_dims,
+                                                NdArrayDesc<N>* desc0_out,
+                                                NdArrayDesc<N>* desc1_out) {
+  TFLITE_DCHECK(desc0_out != nullptr);
+  TFLITE_DCHECK(desc1_out != nullptr);
+
+  // Copy dims to desc.
+  for (int i = 0; i < N; ++i) {
+    desc0_out->extents[i] = input0_dims.sizes[i];
+    desc0_out->strides[i] = input0_dims.strides[i];
+    desc1_out->extents[i] = input1_dims.sizes[i];
+    desc1_out->strides[i] = input1_dims.strides[i];
+  }
+
+  // Walk over each dimension. If the extents are equal do nothing.
+  // Otherwise, set the desc with extent 1 to have extent equal to the other and
+  // stride 0.
+  for (int i = 0; i < N; ++i) {
+    const int extent0 = ArraySize(input0_dims, i);
+    const int extent1 = ArraySize(input1_dims, i);
+    if (extent0 != extent1) {
+      if (extent0 == 1) {
+        desc0_out->strides[i] = 0;
+        desc0_out->extents[i] = extent1;
+      } else {
+        TFLITE_DCHECK_EQ(extent1, 1);
+        desc1_out->strides[i] = 0;
+        desc1_out->extents[i] = extent0;
+      }
+    }
+  }
+}
+
+template <int N>
+inline void NdArrayDescsForElementwiseBroadcast(
+    const RuntimeShape& input0_shape, const RuntimeShape& input1_shape,
+    NdArrayDesc<N>* desc0_out, NdArrayDesc<N>* desc1_out) {
+  TFLITE_DCHECK(desc0_out != nullptr);
+  TFLITE_DCHECK(desc1_out != nullptr);
+
+  auto extended_input0_shape = RuntimeShape::ExtendedShape(N, input0_shape);
+  auto extended_input1_shape = RuntimeShape::ExtendedShape(N, input1_shape);
+
+  // Copy dims to desc, calculating strides.
+  int desc0_stride = 1;
+  int desc1_stride = 1;
+  for (int i = N - 1; i >= 0; --i) {
+    desc0_out->extents[i] = extended_input0_shape.Dims(i);
+    desc0_out->strides[i] = desc0_stride;
+    desc0_stride *= extended_input0_shape.Dims(i);
+    desc1_out->extents[i] = extended_input1_shape.Dims(i);
+    desc1_out->strides[i] = desc1_stride;
+    desc1_stride *= extended_input1_shape.Dims(i);
+  }
+
+  // Walk over each dimension. If the extents are equal do nothing.
+  // Otherwise, set the desc with extent 1 to have extent equal to the other and
+  // stride 0.
+  for (int i = 0; i < N; ++i) {
+    const int extent0 = extended_input0_shape.Dims(i);
+    const int extent1 = extended_input1_shape.Dims(i);
+    if (extent0 != extent1) {
+      if (extent0 == 1) {
+        desc0_out->strides[i] = 0;
+        desc0_out->extents[i] = extent1;
+      } else {
+        TFLITE_DCHECK_EQ(extent1, 1);
+        desc1_out->strides[i] = 0;
+        desc1_out->extents[i] = extent0;
+      }
+    }
+  }
+}
+
 }  // namespace tflite
 
 #endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_COMMON_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
index a0e382edb6..200f2f1515 100644
--- a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
@@ -255,14 +255,6 @@ void LstmStep(
                            output_state_ptr);
 }
 
-// TODO(alanchiao): move this to tensor_utils.
-void VectorMultiply(const int8_t* vector, const int v_size, const float scale,
-                    float* result) {
-  for (int i = 0; i < v_size; ++i) {
-    *result++ = scale * *vector++;
-  }
-}
-
 void LstmStep(
     const float* input_ptr_batch, const int8_t* input_to_input_weights_ptr,
     float input_to_input_weights_scale,
@@ -415,8 +407,9 @@ void LstmStep(
   // For each batch and cell: update input gate.
   if (!use_cifg) {
     if (use_peephole && !is_cell_state_all_zeros) {
-      VectorMultiply(cell_to_input_weights_ptr, n_cell,
-                     cell_to_input_weights_scale, recovered_cell_weights);
+      tensor_utils::VectorScalarMultiply(cell_to_input_weights_ptr, n_cell,
+                                         cell_to_input_weights_scale,
+                                         recovered_cell_weights);
       tensor_utils::VectorBatchVectorCwiseProductAccumulate(
           recovered_cell_weights, n_cell, cell_state_ptr, n_batch,
           input_gate_scratch);
@@ -427,8 +420,9 @@ void LstmStep(
 
   // For each batch and cell: update forget gate.
   if (use_peephole && !is_cell_state_all_zeros) {
-    VectorMultiply(cell_to_forget_weights_ptr, n_cell,
-                   cell_to_forget_weights_scale, recovered_cell_weights);
+    tensor_utils::VectorScalarMultiply(cell_to_forget_weights_ptr, n_cell,
+                                       cell_to_forget_weights_scale,
+                                       recovered_cell_weights);
     tensor_utils::VectorBatchVectorCwiseProductAccumulate(
         recovered_cell_weights, n_cell, cell_state_ptr, n_batch,
         forget_gate_scratch);
@@ -459,8 +453,9 @@ void LstmStep(
       tensor_utils::IsZeroVector(cell_state_ptr, n_batch * n_cell);
   // For each batch and cell: update the output gate.
   if (use_peephole && !is_cell_state_all_zeros) {
-    VectorMultiply(cell_to_output_weights_ptr, n_cell,
-                   cell_to_output_weights_scale, recovered_cell_weights);
+    tensor_utils::VectorScalarMultiply(cell_to_output_weights_ptr, n_cell,
+                                       cell_to_output_weights_scale,
+                                       recovered_cell_weights);
     tensor_utils::VectorBatchVectorCwiseProductAccumulate(
         recovered_cell_weights, n_cell, cell_state_ptr, n_batch,
         output_gate_scratch);
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/eigen_tensor_reduced_instantiations_google.h b/tensorflow/contrib/lite/kernels/internal/optimized/eigen_tensor_reduced_instantiations_google.h
index d85e06a5d5..250872c422 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/eigen_tensor_reduced_instantiations_google.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/eigen_tensor_reduced_instantiations_google.h
@@ -33,7 +33,7 @@ limitations under the License.
 #include <functional>
 
 #ifdef _WIN32
-#include <winbase.h>
+#include <windows.h>
 #elif defined(__APPLE__)
 #include <mach/mach_time.h>
 #else
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/legacy_optimized_ops.h b/tensorflow/contrib/lite/kernels/internal/optimized/legacy_optimized_ops.h
index 7816752132..d5503073a7 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/legacy_optimized_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/legacy_optimized_ops.h
@@ -55,15 +55,262 @@ inline void Relu(const float* input_data, const Dims<4>& input_dims,
        DimsToShape(output_dims));
 }
 
+// legacy, for compatibility with old checked-in code
+template <FusedActivationFunctionType Ac>
+void Add(const float* input1_data, const Dims<4>& input1_dims,
+         const float* input2_data, const Dims<4>& input2_dims,
+         float* output_data, const Dims<4>& output_dims) {
+  float output_activation_min, output_activation_max;
+  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
+
+  tflite::ArithmeticParams op_params;
+  op_params.float_activation_min = output_activation_min;
+  op_params.float_activation_max = output_activation_max;
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+inline void Add(int left_shift, const uint8* input1_data,
+                const Dims<4>& input1_dims, int32 input1_offset,
+                int32 input1_multiplier, int input1_shift,
+                const uint8* input2_data, const Dims<4>& input2_dims,
+                int32 input2_offset, int32 input2_multiplier, int input2_shift,
+                int32 output_offset, int32 output_multiplier, int output_shift,
+                int32 output_activation_min, int32 output_activation_max,
+                uint8* output_data, const Dims<4>& output_dims) {
+  constexpr int kReverseShift = -1;
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, 0);
+    TFLITE_DCHECK_EQ(output_activation_max, 255);
+  }
+
+  tflite::ArithmeticParams op_params;
+  op_params.left_shift = left_shift;
+  op_params.input1_offset = input1_offset;
+  op_params.input1_multiplier = input1_multiplier;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_offset = input2_offset;
+  op_params.input2_multiplier = input2_multiplier;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.output_offset = output_offset;
+  op_params.output_multiplier = output_multiplier;
+  op_params.output_shift = kReverseShift * output_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+void Add(const int32* input1_data, const Dims<4>& input1_dims,
+         const int32* input2_data, const Dims<4>& input2_dims,
+         int32* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Add/int32");
+  TFLITE_DCHECK(Ac == FusedActivationFunctionType::kNone);
+
+  tflite::ArithmeticParams op_params;
+  op_params.quantized_activation_min = std::numeric_limits<int32>::min();
+  op_params.quantized_activation_max = std::numeric_limits<int32>::max();
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <typename T>
+void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T output_activation_min, T output_activation_max,
+                  T* output_data, const Dims<4>& output_dims) {
+  tflite::ArithmeticParams op_params;
+  op_params.float_activation_min = output_activation_min;
+  op_params.float_activation_max = output_activation_max;
+  BroadcastAdd4DSlow(op_params, DimsToShape(input1_dims), input1_data,
+                     DimsToShape(input2_dims), input2_data,
+                     DimsToShape(output_dims), output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+inline void BroadcastAdd(int left_shift, const uint8* input1_data,
+                         const Dims<4>& input1_dims, int32 input1_offset,
+                         int32 input1_multiplier, int input1_shift,
+                         const uint8* input2_data, const Dims<4>& input2_dims,
+                         int32 input2_offset, int32 input2_multiplier,
+                         int input2_shift, int32 output_offset,
+                         int32 output_multiplier, int output_shift,
+                         int32 output_activation_min,
+                         int32 output_activation_max, uint8* output_data,
+                         const Dims<4>& output_dims) {
+  constexpr int kReverseShift = -1;
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, 0);
+    TFLITE_DCHECK_EQ(output_activation_max, 255);
+  }
+
+  tflite::ArithmeticParams op_params;
+  op_params.left_shift = left_shift;
+  op_params.input1_offset = input1_offset;
+  op_params.input1_multiplier = input1_multiplier;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_offset = input2_offset;
+  op_params.input2_multiplier = input2_multiplier;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.output_offset = output_offset;
+  op_params.output_multiplier = output_multiplier;
+  op_params.output_shift = kReverseShift * output_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  BroadcastAdd4DSlow(op_params, DimsToShape(input1_dims), input1_data,
+                     DimsToShape(input2_dims), input2_data,
+                     DimsToShape(output_dims), output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+inline void BroadcastAddFivefold(
+    int y0, int y1, int y2, int y3, int y4, int left_shift,
+    const uint8* input1_data, const Dims<4>& input1_dims, int32 input1_offset,
+    int32 input1_multiplier, int input1_shift, const uint8* input2_data,
+    const Dims<4>& input2_dims, int32 input2_offset, int32 input2_multiplier,
+    int input2_shift, int32 output_offset, int32 output_multiplier,
+    int output_shift, int32 output_activation_min, int32 output_activation_max,
+    uint8* output_data, const Dims<4>& output_dims) {
+  constexpr int kReverseShift = -1;
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, 0);
+    TFLITE_DCHECK_EQ(output_activation_max, 255);
+  }
+  tflite::ArithmeticParams op_params;
+  op_params.broadcast_category =
+      tflite::BroadcastableOpCategory::kFirstInputBroadcastsFast;
+  op_params.left_shift = left_shift;
+  op_params.input1_offset = input1_offset;
+  op_params.input1_multiplier = input1_multiplier;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_offset = input2_offset;
+  op_params.input2_multiplier = input2_multiplier;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.output_offset = output_offset;
+  op_params.output_multiplier = output_multiplier;
+  op_params.output_shift = kReverseShift * output_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  op_params.broadcast_shape[4] = y0;
+  op_params.broadcast_shape[3] = y1;
+  op_params.broadcast_shape[2] = y2;
+  op_params.broadcast_shape[1] = y3;
+  op_params.broadcast_shape[0] = y4;
+  BroadcastAddFivefold(op_params, DimsToShape(input1_dims), input1_data,
+                       DimsToShape(input2_dims), input2_data,
+                       DimsToShape(output_dims), output_data);
+}
+
+// legacy, for compatibility with old checked-in code
+template <FusedActivationFunctionType Ac, typename T>
+void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T* output_data, const Dims<4>& output_dims) {
+  T output_activation_min, output_activation_max;
+  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
+
+  BroadcastAdd(input1_data, input1_dims, input2_data, input2_dims,
+               output_activation_min, output_activation_max, output_data,
+               output_dims);
+}
+
+template <FusedActivationFunctionType Ac>
+inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
+                int input1_shift, const int16* input2_data,
+                const Dims<4>& input2_dims, int input2_shift,
+                int16 output_activation_min, int16 output_activation_max,
+                int16* output_data, const Dims<4>& output_dims) {
+  constexpr int kReverseShift = -1;
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, -32768);
+    TFLITE_DCHECK_EQ(output_activation_max, 32767);
+  }
+
+  tflite::ArithmeticParams op_params;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
+                const float* input2_data, const Dims<4>& input2_dims,
+                float* output_data, const Dims<4>& output_dims) {
+  float output_activation_min, output_activation_max;
+  GetActivationMinMax(FusedActivationFunctionType::kNone,
+                      &output_activation_min, &output_activation_max);
+  tflite::ArithmeticParams op_params;
+  op_params.float_activation_min = output_activation_min;
+  op_params.float_activation_max = output_activation_max;
+  Sub(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <typename T>
+void Sub(const T* input1_data, const Dims<4>& input1_dims, const T* input2_data,
+         const Dims<4>& input2_dims, T* output_data,
+         const Dims<4>& output_dims) {
+  T output_activation_min, output_activation_max;
+  GetActivationMinMax(FusedActivationFunctionType::kNone,
+                      &output_activation_min, &output_activation_max);
+  tflite::ArithmeticParams op_params;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  Sub(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
 inline void AveragePool(const float* input_data, const Dims<4>& input_dims,
                         int stride_width, int stride_height, int pad_width,
                         int pad_height, int kwidth, int kheight,
                         float output_activation_min,
                         float output_activation_max, float* output_data,
                         const Dims<4>& output_dims) {
-  AveragePool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-              pad_width, pad_height, kwidth, kheight, output_activation_min,
-              output_activation_max, output_data, DimsToShape(output_dims));
+  tflite::PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = kheight;
+  params.filter_width = kwidth;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.float_activation_min = output_activation_min;
+  params.float_activation_max = output_activation_max;
+  AveragePool(params, DimsToShape(input_dims), input_data,
+              DimsToShape(output_dims), output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -96,10 +343,17 @@ inline void AveragePool(const uint8* input_data, const Dims<4>& input_dims,
                         int32 output_activation_min,
                         int32 output_activation_max, uint8* output_data,
                         const Dims<4>& output_dims) {
-  AveragePool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-              pad_width, pad_height, filter_width, filter_height,
-              output_activation_min, output_activation_max, output_data,
-              DimsToShape(output_dims));
+  tflite::PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = filter_height;
+  params.filter_width = filter_width;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.quantized_activation_min = output_activation_min;
+  params.quantized_activation_max = output_activation_max;
+  AveragePool(params, DimsToShape(input_dims), input_data,
+              DimsToShape(output_dims), output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -140,9 +394,17 @@ inline void MaxPool(const float* input_data, const Dims<4>& input_dims,
                     int pad_height, int kwidth, int kheight,
                     float output_activation_min, float output_activation_max,
                     float* output_data, const Dims<4>& output_dims) {
-  MaxPool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-          pad_width, pad_height, kwidth, kheight, output_activation_min,
-          output_activation_max, output_data, DimsToShape(output_dims));
+  tflite::PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = kheight;
+  params.filter_width = kwidth;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.float_activation_min = output_activation_min;
+  params.float_activation_max = output_activation_max;
+  MaxPool(params, DimsToShape(input_dims), input_data, DimsToShape(output_dims),
+          output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -172,10 +434,17 @@ inline void MaxPool(const uint8* input_data, const Dims<4>& input_dims,
                     int pad_height, int filter_width, int filter_height,
                     int32 output_activation_min, int32 output_activation_max,
                     uint8* output_data, const Dims<4>& output_dims) {
-  MaxPool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-          pad_width, pad_height, filter_width, filter_height,
-          output_activation_min, output_activation_max, output_data,
-          DimsToShape(output_dims));
+  PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = filter_height;
+  params.filter_width = filter_width;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.quantized_activation_min = output_activation_min;
+  params.quantized_activation_max = output_activation_max;
+  MaxPool(params, DimsToShape(input_dims), input_data, DimsToShape(output_dims),
+          output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -215,10 +484,17 @@ inline void L2Pool(const float* input_data, const Dims<4>& input_dims,
                    int pad_height, int filter_width, int filter_height,
                    float output_activation_min, float output_activation_max,
                    float* output_data, const Dims<4>& output_dims) {
-  L2Pool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-         pad_width, pad_height, filter_width, filter_height,
-         output_activation_min, output_activation_max, output_data,
-         DimsToShape(output_dims));
+  PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = filter_height;
+  params.filter_width = filter_width;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.float_activation_min = output_activation_min;
+  params.float_activation_max = output_activation_max;
+  L2Pool(params, DimsToShape(input_dims), input_data, DimsToShape(output_dims),
+         output_data);
 }
 
 // legacy, for compatibility with old checked-in code
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/multithreaded_conv.h b/tensorflow/contrib/lite/kernels/internal/optimized/multithreaded_conv.h
index 27d9224512..4a3545d47a 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/multithreaded_conv.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/multithreaded_conv.h
@@ -35,35 +35,6 @@ limitations under the License.
 namespace tflite {
 namespace multithreaded_ops {
 
-class EigenThreadPoolWrapper : public Eigen::ThreadPoolInterface {
- public:
-  explicit EigenThreadPoolWrapper(Eigen::ThreadPool* pool) : pool_(pool) {}
-  ~EigenThreadPoolWrapper() override {}
-
-  void Schedule(std::function<void()> fn) override {
-    pool_->Schedule(std::move(fn));
-  }
-  int NumThreads() const override { return pool_->NumThreads(); }
-  int CurrentThreadId() const override { return pool_->CurrentThreadId(); }
-
- private:
-  Eigen::ThreadPool* pool_ = nullptr;
-};
-
-// We have a single global threadpool for all convolution operations. This means
-// that inferences started from different threads may block each other, but
-// since the underlying resource of CPU cores should be consumed by the
-// operations anyway, it shouldn't affect overall performance.
-const Eigen::ThreadPoolDevice& GetThreadPoolDevice() {
-  const int thread_count = 4;
-  static Eigen::ThreadPool* tp = new Eigen::ThreadPool(thread_count);
-  static EigenThreadPoolWrapper* thread_pool_wrapper =
-      new EigenThreadPoolWrapper(tp);
-  static Eigen::ThreadPoolDevice* device =
-      new Eigen::ThreadPoolDevice(thread_pool_wrapper, thread_count);
-  return *device;
-}
-
 // Shorthands for the types we need when interfacing with the EigenTensor
 // library.
 typedef Eigen::TensorMap<
@@ -113,14 +84,13 @@ class EigenTensorConvFunctor {
   }
 
  public:
-  void operator()(const T* input_data, T* im2col_buffer, int input_batches,
-                  int input_height, int input_width, int input_depth,
-                  const T* filter_data, int filter_height, int filter_width,
-                  int filter_count, int stride_rows, int stride_cols,
-                  int pad_width, int pad_height, TfLitePadding padding,
-                  T* output_data, int output_height, int output_width) {
-    const Eigen::ThreadPoolDevice& device = GetThreadPoolDevice();
-
+  void operator()(const Eigen::ThreadPoolDevice& device, const T* input_data,
+                  T* im2col_buffer, int input_batches, int input_height,
+                  int input_width, int input_depth, const T* filter_data,
+                  int filter_height, int filter_width, int filter_count,
+                  int stride_rows, int stride_cols, int pad_width,
+                  int pad_height, TfLitePadding padding, T* output_data,
+                  int output_height, int output_width) {
     const bool is_1x1_kernel = (filter_height == 1 && filter_width == 1 &&
                                 stride_rows == 1 && stride_cols == 1);
     if (is_1x1_kernel) {
@@ -162,11 +132,11 @@ class EigenTensorConvFunctor {
   }
 };
 
-inline void Conv(const float* input_data, const Dims<4>& input_dims,
-                 const float* filter_data, const Dims<4>& filter_dims,
-                 const float* bias_data, const Dims<4>& bias_dims,
-                 int stride_width, int stride_height, int pad_width,
-                 int pad_height, TfLitePadding padding,
+inline void Conv(const Eigen::ThreadPoolDevice& device, const float* input_data,
+                 const Dims<4>& input_dims, const float* filter_data,
+                 const Dims<4>& filter_dims, const float* bias_data,
+                 const Dims<4>& bias_dims, int stride_width, int stride_height,
+                 int pad_width, int pad_height, TfLitePadding padding,
                  float output_activation_min, float output_activation_max,
                  float* output_data, const Dims<4>& output_dims,
                  float* im2col_data, const Dims<4>& im2col_dims) {
@@ -180,10 +150,11 @@ inline void Conv(const float* input_data, const Dims<4>& input_dims,
   const int output_height = ArraySize(output_dims, 2);
   const int output_width = ArraySize(output_dims, 1);
   EigenTensorConvFunctor<float> conv_functor;
-  conv_functor(input_data, im2col_data, batches, input_height, input_width,
-               input_depth, filter_data, filter_height, filter_width,
-               output_depth, stride_height, stride_width, pad_height, pad_width,
-               padding, output_data, output_height, output_width);
+  conv_functor(device, input_data, im2col_data, batches, input_height,
+               input_width, input_depth, filter_data, filter_height,
+               filter_width, output_depth, stride_height, stride_width,
+               pad_height, pad_width, padding, output_data, output_height,
+               output_width);
 
   optimized_ops::AddBiasAndEvalActivationFunction(
       bias_data, bias_dims, output_data, output_dims, output_activation_min,
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc
index 5ba7e2af9b..420bc68b43 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.cc
@@ -55,83 +55,33 @@ void NeonMatrixBatchVectorMultiplyAccumulate(const float* matrix, int m_rows,
   const int postamble_start =
       m_cols - (m_cols & (kFloatWeightsPerNeonLane - 1));
 
-  // The arrays used to cache the vector.
-  void* aligned_vector_cache_free = nullptr;
-  float32x4_t* vector_cache_float32x4 =
-      reinterpret_cast<float32x4_t*>(aligned_alloc(
-          sizeof(float32x4_t), (postamble_start >> 2) * sizeof(float32x4_t),
-          &aligned_vector_cache_free));
-
-  const int kUnrollSize = 2;
   for (int b = 0; b < n_batch; b++) {
     float* result_in_batch = result + b * m_rows * result_stride;
     const float* vector_in_batch = vector + b * m_cols;
+    const float* matrix_row = matrix;
 
-    const float* matrix_ptr0 = matrix;
-    // If there is only 1 row, we don't want to assign an illegal pointer.
-    const float* matrix_ptr1 = nullptr;
-    if (m_rows > 1) {
-      matrix_ptr1 = matrix + m_cols;
-    }
-
-    // Cache the vector.
-    for (int c = 0; c < postamble_start; c += kFloatWeightsPerNeonLane) {
-      vector_cache_float32x4[c >> 2] = vld1q_f32(vector_in_batch + c);
-    }
-
-    // Main matrix by vector multiplication loop, which handles two rows of
-    // matrix by vector multiplication.
-    for (int r = 0; r < (m_rows & ~(kUnrollSize - 1)); r += kUnrollSize) {
-      float32x4_t acc0_32x4 = vmovq_n_f32(0.0);
-      float32x4_t acc1_32x4 = vmovq_n_f32(0.0);
+    // Main matrix by vector multiplication loop
+    for (int r = 0; r < m_rows; r++) {
+      float32x4_t acc_32x4 = vmovq_n_f32(0.0);
       for (int c = 0; c < postamble_start; c += kFloatWeightsPerNeonLane) {
-        float32x4_t temp = vector_cache_float32x4[c >> 2];
-        // Load 4 float values from vector1 and vector2 and accumulator.
-        float32x4_t v0_f32x4 = vld1q_f32(matrix_ptr0 + c);
-        float32x4_t v1_f32x4 = vld1q_f32(matrix_ptr1 + c);
-        // Vector multiply-accumulate 4 float
-        acc0_32x4 = vmlaq_f32(acc0_32x4, v0_f32x4, temp);
-        acc1_32x4 = vmlaq_f32(acc1_32x4, v1_f32x4, temp);
+        // Load 4 float values from vector and matrix row.
+        float32x4_t vector_f32x4 = vld1q_f32(vector_in_batch + c);
+        float32x4_t matrix_f32x4 = vld1q_f32(matrix_row + c);
+        // Multiply the vector and matrix row and add to accumulator.
+        acc_32x4 = vmlaq_f32(acc_32x4, matrix_f32x4, vector_f32x4);
       }
       // Add the 4 intermediate sum values to get the final dot-prod value for
       // this column.
       *result_in_batch +=
-          (vgetq_lane_f32(acc0_32x4, 0) + vgetq_lane_f32(acc0_32x4, 1) +
-           vgetq_lane_f32(acc0_32x4, 2) + vgetq_lane_f32(acc0_32x4, 3));
-      *(result_in_batch + result_stride) +=
-          (vgetq_lane_f32(acc1_32x4, 0) + vgetq_lane_f32(acc1_32x4, 1) +
-           vgetq_lane_f32(acc1_32x4, 2) + vgetq_lane_f32(acc1_32x4, 3));
+          (vgetq_lane_f32(acc_32x4, 0) + vgetq_lane_f32(acc_32x4, 1) +
+           vgetq_lane_f32(acc_32x4, 2) + vgetq_lane_f32(acc_32x4, 3));
       for (int c = postamble_start; c < m_cols; c++) {
-        *result_in_batch += matrix_ptr0[c] * vector_in_batch[c];
-        *(result_in_batch + result_stride) +=
-            matrix_ptr1[c] * vector_in_batch[c];
+        *result_in_batch += matrix_row[c] * vector_in_batch[c];
       }
-      matrix_ptr0 += kUnrollSize * m_cols;
-      matrix_ptr1 += kUnrollSize * m_cols;
-      result_in_batch += kUnrollSize * result_stride;
-    }
-    for (int r = (m_rows & ~(kUnrollSize - 1)); r < m_rows; r++) {
-      float32x4_t acc0_32x4 = vmovq_n_f32(0.0);
-      for (int c = 0; c < postamble_start; c += kFloatWeightsPerNeonLane) {
-        float32x4_t temp = vector_cache_float32x4[c >> 2];
-        // Load 4 float values from vector1 and vector2 and accumulator.
-        float32x4_t v0_f32x4 = vld1q_f32(matrix_ptr0 + c);
-        // Vector multiply-accumulate 4 float
-        acc0_32x4 = vmlaq_f32(acc0_32x4, v0_f32x4, temp);
-      }
-      // Add the 4 intermediate sum values to get the final dot-prod value for
-      // this column.
-      *result_in_batch +=
-          (vgetq_lane_f32(acc0_32x4, 0) + vgetq_lane_f32(acc0_32x4, 1) +
-           vgetq_lane_f32(acc0_32x4, 2) + vgetq_lane_f32(acc0_32x4, 3));
-      for (int c = postamble_start; c < m_cols; c++) {
-        *result_in_batch += matrix_ptr0[c] * vector_in_batch[c];
-      }
-      matrix_ptr0 += m_cols;
+      matrix_row += m_cols;
       result_in_batch += result_stride;
     }
   }
-  free(aligned_vector_cache_free);
 }
 
 void NeonMatrixBatchVectorMultiplyAccumulate(
@@ -296,17 +246,6 @@ void NeonVectorBatchVectorCwiseProductAccumulate(const float* vector,
   const int postamble_start =
       v_size - (v_size & (kFloatWeightsPerNeonLane - 1));
 
-  // The arrays used to cache the vector.
-  void* aligned_vector_cache_free = nullptr;
-  float32x4_t* vector_cache_float32x4 =
-      reinterpret_cast<float32x4_t*>(aligned_alloc(
-          sizeof(float32x4_t), (postamble_start >> 2) * sizeof(float32x4_t),
-          &aligned_vector_cache_free));
-
-  for (int v = 0; v < postamble_start; v += kFloatWeightsPerNeonLane) {
-    vector_cache_float32x4[v >> 2] = vld1q_f32(vector + v);
-  }
-
   float* result_ptr = result;
   const float* batch_vector_ptr = batch_vector;
   for (int b = 0; b < n_batch; b++) {
@@ -314,9 +253,9 @@ void NeonVectorBatchVectorCwiseProductAccumulate(const float* vector,
       // Load from memory to vectors.
       float32x4_t result_f32x4 = vld1q_f32(result_ptr + v);
       float32x4_t batch_vector_f32x4 = vld1q_f32(batch_vector_ptr + v);
+      float32x4_t vector_f32x4 = vld1q_f32(vector + v);
       // Multiply-accumulate.
-      result_f32x4 = vmlaq_f32(result_f32x4, batch_vector_f32x4,
-                               vector_cache_float32x4[v >> 2]);
+      result_f32x4 = vmlaq_f32(result_f32x4, batch_vector_f32x4, vector_f32x4);
       // Store.
       vst1q_f32(result_ptr + v, result_f32x4);
     }
@@ -328,7 +267,6 @@ void NeonVectorBatchVectorCwiseProductAccumulate(const float* vector,
     result_ptr += v_size;
     batch_vector_ptr += v_size;
   }
-  free(aligned_vector_cache_free);
 }
 
 void NeonSub1Vector(const float* vector, int v_size, float* result) {
@@ -404,6 +342,77 @@ void NeonClipVector(const float* vector, int v_size, float abs_limit,
   }
 }
 
+void NeonVectorScalarMultiply(const int8_t* vector, const int v_size,
+                              const float scale, float* result) {
+  // Here the assumption is that each buffer is 4-byte aligned.
+  const int kWeightsPerUint32 = 4;
+  TFLITE_CHECK_EQ((intptr_t)(&vector[0]) & (kWeightsPerUint32 - 1), 0);
+  // If v_size is not divisible by kWeightsPerNeonLane, we cannot use the main
+  // vectorized loop, and we need to process sequentially. postamble_start shows
+  // the start index where this should happen.
+  const int kWeightsPerNeonLane = 16;
+  const int postamble_start = v_size - (v_size & (kWeightsPerNeonLane - 1));
+
+  // Create a vector of 4 floats with the scale value.
+  const float32x4_t scale_f32x4 = vdupq_n_f32(scale);
+  int v = 0;
+  for (; v < postamble_start; v += kWeightsPerNeonLane) {
+    // Load int8 values, sixteen at a time.
+    const int8x16_t v_i8x16 = vld1q_s8(vector + v);
+    // Split it into two components of size eight.
+    const int8x8_t v0_i8x8 = vget_low_s8(v_i8x16);
+    const int8x8_t v1_i8x8 = vget_high_s8(v_i8x16);
+    // Convert both components to int16 first.
+    const int16x8_t v0_i16x8 = vmovl_s8(v0_i8x8);
+    const int16x8_t v1_i16x8 = vmovl_s8(v1_i8x8);
+    // Split each of them into two components each.
+    const int16x4_t v0_i16x4 = vget_low_s16(v0_i16x8);
+    const int16x4_t v1_i16x4 = vget_high_s16(v0_i16x8);
+    const int16x4_t v2_i16x4 = vget_low_s16(v1_i16x8);
+    const int16x4_t v3_i16x4 = vget_high_s16(v1_i16x8);
+    // Convert these to int32 and then to float.
+    float32x4_t v0_f32x4 = vcvtq_f32_s32(vmovl_s16(v0_i16x4));
+    float32x4_t v1_f32x4 = vcvtq_f32_s32(vmovl_s16(v1_i16x4));
+    float32x4_t v2_f32x4 = vcvtq_f32_s32(vmovl_s16(v2_i16x4));
+    float32x4_t v3_f32x4 = vcvtq_f32_s32(vmovl_s16(v3_i16x4));
+    // Vector multiply four floats at a time.
+    v0_f32x4 = vmulq_f32(v0_f32x4, scale_f32x4);
+    v1_f32x4 = vmulq_f32(v1_f32x4, scale_f32x4);
+    v2_f32x4 = vmulq_f32(v2_f32x4, scale_f32x4);
+    v3_f32x4 = vmulq_f32(v3_f32x4, scale_f32x4);
+    // Store the results.
+    vst1q_f32(result + v, v0_f32x4);
+    vst1q_f32(result + v + 4, v1_f32x4);
+    vst1q_f32(result + v + 8, v2_f32x4);
+    vst1q_f32(result + v + 12, v3_f32x4);
+  }
+
+  if (v_size - postamble_start >= (kWeightsPerNeonLane >> 1)) {
+    // Load eight int8 values, if there is at least eight remaining.
+    const int8x8_t v_i8x8 = vld1_s8(vector + v);
+    // Convert them to int16 first.
+    const int16x8_t v_i16x8 = vmovl_s8(v_i8x8);
+    // Split it into two components.
+    const int16x4_t v0_i16x4 = vget_low_s16(v_i16x8);
+    const int16x4_t v1_i16x4 = vget_high_s16(v_i16x8);
+    // Convert the components two floats.
+    float32x4_t v0_f32x4 = vcvtq_f32_s32(vmovl_s16(v0_i16x4));
+    float32x4_t v1_f32x4 = vcvtq_f32_s32(vmovl_s16(v1_i16x4));
+    // Vector multiply four floats at a time.
+    v0_f32x4 = vmulq_f32(v0_f32x4, scale_f32x4);
+    v1_f32x4 = vmulq_f32(v1_f32x4, scale_f32x4);
+    // Store the results.
+    vst1q_f32(result + v, v0_f32x4);
+    vst1q_f32(result + v + 4, v1_f32x4);
+    v += (kWeightsPerNeonLane >> 1);
+  }
+
+  // Postamble loop.
+  for (; v < v_size; v++) {
+    result[v] = scale * vector[v];
+  }
+}
+
 void NeonSymmetricQuantizeFloats(const float* values, const int size,
                                  int8_t* quantized_values, float* min,
                                  float* max, float* scaling_factor) {
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h
index 7a5a8fc541..63c89d1eee 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/neon_tensor_utils.h
@@ -105,16 +105,20 @@ bool IsZeroVector(const float* vector, int v_size) {
   return NEON_OR_PORTABLE(IsZeroVector, vector, v_size);
 }
 
+void VectorScalarMultiply(const int8_t* vector, int v_size, float scale,
+                          float* result) {
+  NEON_OR_PORTABLE(VectorScalarMultiply, vector, v_size, scale, result);
+}
 void ClipVector(const float* vector, int v_size, float abs_limit,
                 float* result) {
   NEON_OR_PORTABLE(ClipVector, vector, v_size, abs_limit, result);
 }
 
 void SymmetricQuantizeFloats(const float* values, const int size,
-                             int8_t* quantized_values, float* min, float* max,
-                             float* scaling_factor) {
-  NEON_OR_PORTABLE(SymmetricQuantizeFloats, values, size, quantized_values, min,
-                   max, scaling_factor);
+                             int8_t* quantized_values, float* min_value,
+                             float* max_value, float* scaling_factor) {
+  NEON_OR_PORTABLE(SymmetricQuantizeFloats, values, size, quantized_values,
+                   min_value, max_value, scaling_factor);
 }
 
 void VectorShiftLeft(float* vector, int v_size, float shift_value) {
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
index 8597707b24..ebb2c7a8eb 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
@@ -41,10 +41,13 @@ namespace optimized_ops {
 
 // Unoptimized reference ops:
 using reference_ops::ArgMax;
+using reference_ops::ArgMinMax;
+using reference_ops::BroadcastAdd4DSlow;
 using reference_ops::BroadcastGreater;
 using reference_ops::BroadcastGreaterEqual;
 using reference_ops::BroadcastLess;
 using reference_ops::BroadcastLessEqual;
+using reference_ops::BroadcastSub4DSlow;
 using reference_ops::Concatenation;
 using reference_ops::DepthConcatenation;
 using reference_ops::Dequantize;
@@ -165,6 +168,18 @@ ArrayMap<Scalar> MapAsArrayWithFirstDimAsRows(Scalar* data,
   return ArrayMap<Scalar>(data, rows, cols);
 }
 
+// Copied from tensorflow/core/framework/tensor_types.h
+template <typename T, int NDIMS = 1, typename IndexType = Eigen::DenseIndex>
+struct TTypes {
+  // Rank-1 tensor (vector) of scalar type T.
+  typedef Eigen::TensorMap<Eigen::Tensor<T, 1, Eigen::RowMajor, IndexType>,
+                           Eigen::Aligned>
+      Flat;
+  typedef Eigen::TensorMap<
+      Eigen::Tensor<const T, 2, Eigen::RowMajor, IndexType>>
+      UnalignedConstMatrix;
+};
+
 // TODO(b/62193649): this function is only needed as long
 // as we have the --variable_batch hack.
 template <typename Scalar, int N>
@@ -216,98 +231,6 @@ SaturatingRoundingMultiplyByPOTParam(
       SaturatingRoundingMultiplyByPOTParam(a.raw(), exponent));
 }
 
-// DO NOT USE THIS STRUCT FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING ELEMENT-WISE
-// BROADCASTING.
-//
-// NdArrayDesc<N> describes the shape and memory layout of an N-dimensional
-// rectangular array of numbers.
-//
-// NdArrayDesc<N> is basically identical to Dims<N> defined in types.h.
-// However, as Dims<N> is to be deprecated, this class exists as an adaptor
-// to enable simple unoptimized implementations of element-wise broadcasting
-// operations.
-template <int N>
-struct NdArrayDesc {
-  // The "extent" of each dimension. Indices along dimension d must be in the
-  // half-open interval [0, extents[d]).
-  int extents[N];
-
-  // The number of *elements* (not bytes) between consecutive indices of each
-  // dimension.
-  int strides[N];
-};
-
-// DO NOT USE THIS FUNCTION FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING
-// ELEMENT-WISE BROADCASTING.
-//
-// Same as Offset(), except takes as NdArrayDesc<N> instead of Dims<N>.
-inline int SubscriptToIndex(const NdArrayDesc<4>& desc, int i0, int i1, int i2,
-                            int i3) {
-  TFLITE_DCHECK(i0 >= 0 && i0 < desc.extents[0]);
-  TFLITE_DCHECK(i1 >= 0 && i1 < desc.extents[1]);
-  TFLITE_DCHECK(i2 >= 0 && i2 < desc.extents[2]);
-  TFLITE_DCHECK(i3 >= 0 && i3 < desc.extents[3]);
-  return i0 * desc.strides[0] + i1 * desc.strides[1] + i2 * desc.strides[2] +
-         i3 * desc.strides[3];
-}
-
-// Given the dimensions of the operands for an element-wise binary broadcast,
-// adjusts them so that they can be directly iterated over with simple loops.
-// Returns the adjusted dims as instances of NdArrayDesc in 'desc0_out' and
-// 'desc1_out'. 'desc0_out' and 'desc1_out' cannot be nullptr.
-//
-// This function assumes that the two input shapes are compatible up to
-// broadcasting and the shorter one has already been prepended with 1s to be the
-// same length. E.g., if shape0 is (1, 16, 16, 64) and shape1 is (1, 64),
-// shape1 must already have been prepended to be (1, 1, 1, 64). Recall that
-// Dims<N> refer to shapes in reverse order. In this case, input0_dims will be
-// (64, 16, 16, 1) and input1_dims will be (64, 1, 1, 1).
-//
-// When two shapes are compatible up to broadcasting, for each dimension d,
-// the input extents are either equal, or one of them is 1.
-//
-// This function performs the following for each dimension d:
-// - If the extents are equal, then do nothing since the loop that walks over
-//   both of the input arrays is correct.
-// - Otherwise, one (and only one) of the extents must be 1. Say extent0 is 1
-//   and extent1 is e1. Then set extent0 to e1 and stride0 *to 0*. This allows
-//   array0 to be referenced *at any index* in dimension d and still access the
-//   same slice.
-template <int N>
-inline void NdArrayDescsForElementwiseBroadcast(const Dims<N>& input0_dims,
-                                                const Dims<N>& input1_dims,
-                                                NdArrayDesc<N>* desc0_out,
-                                                NdArrayDesc<N>* desc1_out) {
-  TFLITE_DCHECK(desc0_out != nullptr);
-  TFLITE_DCHECK(desc1_out != nullptr);
-
-  // Copy dims to desc.
-  for (int i = 0; i < N; ++i) {
-    desc0_out->extents[i] = input0_dims.sizes[i];
-    desc0_out->strides[i] = input0_dims.strides[i];
-    desc1_out->extents[i] = input1_dims.sizes[i];
-    desc1_out->strides[i] = input1_dims.strides[i];
-  }
-
-  // Walk over each dimension. If the extents are equal do nothing.
-  // Otherwise, set the desc with extent 1 to have extent equal to the other and
-  // stride 0.
-  for (int i = 0; i < N; ++i) {
-    const int extent0 = ArraySize(input0_dims, i);
-    const int extent1 = ArraySize(input1_dims, i);
-    if (extent0 != extent1) {
-      if (extent0 == 1) {
-        desc0_out->strides[i] = 0;
-        desc0_out->extents[i] = extent1;
-      } else {
-        TFLITE_DCHECK_EQ(extent1, 1);
-        desc1_out->strides[i] = 0;
-        desc1_out->extents[i] = extent0;
-      }
-    }
-  }
-}
-
 inline bool AreSameDims(const Dims<4>& dims1, const Dims<4>& dims2) {
   for (int i = 0; i < 4; i++) {
     if (dims1.sizes[i] != dims2.sizes[i]) {
@@ -1107,10 +1030,10 @@ inline void FullyConnectedAsGEMV(
 struct GemmlowpOutputPipeline {
   typedef gemmlowp::VectorMap<const int32, gemmlowp::VectorShape::Col>
       ColVectorMap;
-  typedef std::tuple<
-      gemmlowp::OutputStageBiasAddition<ColVectorMap>,
-      gemmlowp::OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint,
-      gemmlowp::OutputStageClamp, gemmlowp::OutputStageSaturatingCastToUint8>
+  typedef std::tuple<gemmlowp::OutputStageBiasAddition<ColVectorMap>,
+                     gemmlowp::OutputStageScaleInt32ByFixedPointAndExponent,
+                     gemmlowp::OutputStageClamp,
+                     gemmlowp::OutputStageSaturatingCastToUint8>
       Pipeline;
   static Pipeline MakeExp(const int32* bias_data, int output_rows,
                           int32 output_offset, int32 output_multiplier,
@@ -1119,11 +1042,10 @@ struct GemmlowpOutputPipeline {
     ColVectorMap bias_vector(bias_data, output_rows);
     gemmlowp::OutputStageBiasAddition<ColVectorMap> bias_addition_stage;
     bias_addition_stage.bias_vector = bias_vector;
-    gemmlowp::OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint
-        quantize_down_stage;
+    gemmlowp::OutputStageScaleInt32ByFixedPointAndExponent quantize_down_stage;
     quantize_down_stage.result_offset_after_shift = output_offset;
     quantize_down_stage.result_fixedpoint_multiplier = output_multiplier;
-    quantize_down_stage.result_shift = -output_left_shift;
+    quantize_down_stage.result_exponent = output_left_shift;
     gemmlowp::OutputStageClamp clamp_stage;
     clamp_stage.min = output_activation_min;
     clamp_stage.max = output_activation_max;
@@ -2477,20 +2399,17 @@ inline void L2Normalization(const uint8* input_data,
   }
 }
 
-inline void Add(const float* input1_data, const Dims<4>& input1_dims,
-                const float* input2_data, const Dims<4>& input2_dims,
-                float output_activation_min, float output_activation_max,
-                float* output_data, const Dims<4>& output_dims) {
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const float* input1_data,
+                const RuntimeShape& input2_shape, const float* input2_data,
+                const RuntimeShape& output_shape, float* output_data) {
   gemmlowp::ScopedProfilingLabel label("Add");
-  TFLITE_DCHECK(IsPackedWithoutStrides(input1_dims));
-  TFLITE_DCHECK(IsPackedWithoutStrides(input2_dims));
-  TFLITE_DCHECK(IsPackedWithoutStrides(output_dims));
 
   int i = 0;
-  const int size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
+  const int size = MatchingFlatSize(input1_shape, input2_shape, output_shape);
 #ifdef USE_NEON
-  const auto activation_min = vdupq_n_f32(output_activation_min);
-  const auto activation_max = vdupq_n_f32(output_activation_max);
+  const auto activation_min = vdupq_n_f32(params.float_activation_min);
+  const auto activation_max = vdupq_n_f32(params.float_activation_max);
   for (; i <= size - 16; i += 16) {
     auto a10 = vld1q_f32(input1_data + i);
     auto a11 = vld1q_f32(input1_data + i + 4);
@@ -2529,29 +2448,26 @@ inline void Add(const float* input1_data, const Dims<4>& input1_dims,
 
   for (; i < size; i++) {
     auto x = input1_data[i] + input2_data[i];
-    output_data[i] = ActivationFunctionWithMinMax(x, output_activation_min,
-                                                  output_activation_max);
+    output_data[i] = ActivationFunctionWithMinMax(
+        x, params.float_activation_min, params.float_activation_max);
   }
 }
 
 // Element-wise add that can often be used for inner loop of broadcast add as
 // well as the non-broadcast add.
-inline void AddElementwise(int size, int left_shift, const uint8* input1_data,
-                           int32 input1_offset, int32 input1_multiplier,
-                           int input1_shift, const uint8* input2_data,
-                           int32 input2_offset, int32 input2_multiplier,
-                           int input2_shift, int32 output_offset,
-                           int32 output_multiplier, int output_shift,
-                           int32 output_activation_min,
-                           int32 output_activation_max, uint8* output_data) {
+inline void AddElementwise(int size, const ArithmeticParams& params,
+                           const uint8* input1_data, const uint8* input2_data,
+                           uint8* output_data) {
   int i = 0;
-  TFLITE_DCHECK_GT(input1_offset, -256);
-  TFLITE_DCHECK_GT(input2_offset, -256);
-  TFLITE_DCHECK_LT(input1_offset, 256);
-  TFLITE_DCHECK_LT(input2_offset, 256);
+  TFLITE_DCHECK_GT(params.input1_offset, -256);
+  TFLITE_DCHECK_GT(params.input2_offset, -256);
+  TFLITE_DCHECK_LT(params.input1_offset, 256);
+  TFLITE_DCHECK_LT(params.input2_offset, 256);
 #ifdef USE_NEON
-  const auto output_activation_min_vector = vdup_n_u8(output_activation_min);
-  const auto output_activation_max_vector = vdup_n_u8(output_activation_max);
+  const auto output_activation_min_vector =
+      vdup_n_u8(params.quantized_activation_min);
+  const auto output_activation_max_vector =
+      vdup_n_u8(params.quantized_activation_max);
   for (; i <= size - 8; i += 8) {
     const auto input1_val_original = vld1_u8(input1_data + i);
     const auto input2_val_original = vld1_u8(input2_data + i);
@@ -2560,9 +2476,9 @@ inline void AddElementwise(int size, int left_shift, const uint8* input1_data,
     const auto input2_val_s16 =
         vreinterpretq_s16_u16(vmovl_u8(input2_val_original));
     const auto input1_val =
-        vaddq_s16(input1_val_s16, vdupq_n_s16(input1_offset));
+        vaddq_s16(input1_val_s16, vdupq_n_s16(params.input1_offset));
     const auto input2_val =
-        vaddq_s16(input2_val_s16, vdupq_n_s16(input2_offset));
+        vaddq_s16(input2_val_s16, vdupq_n_s16(params.input2_offset));
     const auto input1_val_high = vget_high_s16(input1_val);
     const auto input1_val_low = vget_low_s16(input1_val);
     const auto input2_val_high = vget_high_s16(input2_val);
@@ -2571,32 +2487,32 @@ inline void AddElementwise(int size, int left_shift, const uint8* input1_data,
     auto x12 = vmovl_s16(input1_val_high);
     auto x21 = vmovl_s16(input2_val_low);
     auto x22 = vmovl_s16(input2_val_high);
-    const auto left_shift_dup = vdupq_n_s32(left_shift);
+    const auto left_shift_dup = vdupq_n_s32(params.left_shift);
     x11 = vshlq_s32(x11, left_shift_dup);
     x12 = vshlq_s32(x12, left_shift_dup);
     x21 = vshlq_s32(x21, left_shift_dup);
     x22 = vshlq_s32(x22, left_shift_dup);
-    x11 = vqrdmulhq_n_s32(x11, input1_multiplier);
-    x12 = vqrdmulhq_n_s32(x12, input1_multiplier);
-    x21 = vqrdmulhq_n_s32(x21, input2_multiplier);
-    x22 = vqrdmulhq_n_s32(x22, input2_multiplier);
-    const auto input1_shift_dup = vdupq_n_s32(-input1_shift);
-    const auto input2_shift_dup = vdupq_n_s32(-input2_shift);
+    x11 = vqrdmulhq_n_s32(x11, params.input1_multiplier);
+    x12 = vqrdmulhq_n_s32(x12, params.input1_multiplier);
+    x21 = vqrdmulhq_n_s32(x21, params.input2_multiplier);
+    x22 = vqrdmulhq_n_s32(x22, params.input2_multiplier);
+    const auto input1_shift_dup = vdupq_n_s32(params.input1_shift);
+    const auto input2_shift_dup = vdupq_n_s32(params.input2_shift);
     x11 = vshlq_s32(x11, input1_shift_dup);
     x12 = vshlq_s32(x12, input1_shift_dup);
     x21 = vshlq_s32(x21, input2_shift_dup);
     x22 = vshlq_s32(x22, input2_shift_dup);
     auto s1 = vaddq_s32(x11, x21);
     auto s2 = vaddq_s32(x12, x22);
-    s1 = vqrdmulhq_n_s32(s1, output_multiplier);
-    s2 = vqrdmulhq_n_s32(s2, output_multiplier);
+    s1 = vqrdmulhq_n_s32(s1, params.output_multiplier);
+    s2 = vqrdmulhq_n_s32(s2, params.output_multiplier);
     using gemmlowp::RoundingDivideByPOT;
-    s1 = RoundingDivideByPOT(s1, output_shift);
-    s2 = RoundingDivideByPOT(s2, output_shift);
+    s1 = RoundingDivideByPOT(s1, -params.output_shift);
+    s2 = RoundingDivideByPOT(s2, -params.output_shift);
     const auto s1_narrowed = vmovn_s32(s1);
     const auto s2_narrowed = vmovn_s32(s2);
     const auto s = vaddq_s16(vcombine_s16(s1_narrowed, s2_narrowed),
-                             vdupq_n_s16(output_offset));
+                             vdupq_n_s16(params.output_offset));
     const auto clamped =
         vmax_u8(output_activation_min_vector,
                 vmin_u8(output_activation_max_vector, vqmovun_s16(s)));
@@ -2605,101 +2521,74 @@ inline void AddElementwise(int size, int left_shift, const uint8* input1_data,
 #endif  // NEON
 
   for (; i < size; ++i) {
-    const int32 input1_val = input1_offset + input1_data[i];
-    const int32 input2_val = input2_offset + input2_data[i];
-    const int32 shifted_input1_val = input1_val * (1 << left_shift);
-    const int32 shifted_input2_val = input2_val * (1 << left_shift);
+    const int32 input1_val = params.input1_offset + input1_data[i];
+    const int32 input2_val = params.input2_offset + input2_data[i];
+    const int32 shifted_input1_val = input1_val * (1 << params.left_shift);
+    const int32 shifted_input2_val = input2_val * (1 << params.left_shift);
     const int32 scaled_input1_val =
         MultiplyByQuantizedMultiplierSmallerThanOneExp(
-            shifted_input1_val, input1_multiplier,
-            kReverseShift * input1_shift);
+            shifted_input1_val, params.input1_multiplier, params.input1_shift);
     const int32 scaled_input2_val =
         MultiplyByQuantizedMultiplierSmallerThanOneExp(
-            shifted_input2_val, input2_multiplier,
-            kReverseShift * input2_shift);
+            shifted_input2_val, params.input2_multiplier, params.input2_shift);
     const int32 raw_sum = scaled_input1_val + scaled_input2_val;
     const int32 raw_output =
         MultiplyByQuantizedMultiplierSmallerThanOneExp(
-            raw_sum, output_multiplier, kReverseShift * output_shift) +
-        output_offset;
-    const int32 clamped_output = std::min(
-        output_activation_max, std::max(output_activation_min, raw_output));
+            raw_sum, params.output_multiplier, params.output_shift) +
+        params.output_offset;
+    const int32 clamped_output =
+        std::min(params.quantized_activation_max,
+                 std::max(params.quantized_activation_min, raw_output));
     output_data[i] = static_cast<uint8>(clamped_output);
   }
 }
 
-// legacy, for compatibility with old checked-in code
-template <FusedActivationFunctionType Ac>
-void Add(const float* input1_data, const Dims<4>& input1_dims,
-         const float* input2_data, const Dims<4>& input2_dims,
-         float* output_data, const Dims<4>& output_dims) {
-  float output_activation_min, output_activation_max;
-  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
-
-  Add(input1_data, input1_dims, input2_data, input2_dims, output_activation_min,
-      output_activation_max, output_data, output_dims);
-}
-
-template <FusedActivationFunctionType Ac>
-inline void Add(int left_shift, const uint8* input1_data,
-                const Dims<4>& input1_dims, int32 input1_offset,
-                int32 input1_multiplier, int input1_shift,
-                const uint8* input2_data, const Dims<4>& input2_dims,
-                int32 input2_offset, int32 input2_multiplier, int input2_shift,
-                int32 output_offset, int32 output_multiplier, int output_shift,
-                int32 output_activation_min, int32 output_activation_max,
-                uint8* output_data, const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, 0);
-    TFLITE_DCHECK_EQ(output_activation_max, 255);
-  }
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const uint8* input1_data,
+                const RuntimeShape& input2_shape, const uint8* input2_data,
+                const RuntimeShape& output_shape, uint8* output_data) {
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
   gemmlowp::ScopedProfilingLabel label("Add/8bit");
-  const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
-  TFLITE_DCHECK(IsPackedWithoutStrides(input1_dims));
-  TFLITE_DCHECK(IsPackedWithoutStrides(input2_dims));
-  TFLITE_DCHECK(IsPackedWithoutStrides(output_dims));
-
-  TFLITE_DCHECK_GT(input1_offset, -256);
-  TFLITE_DCHECK_GT(input2_offset, -256);
-  TFLITE_DCHECK_LT(input1_offset, 256);
-  TFLITE_DCHECK_LT(input2_offset, 256);
-  AddElementwise(flat_size, left_shift, input1_data, input1_offset,
-                 input1_multiplier, input1_shift, input2_data, input2_offset,
-                 input2_multiplier, input2_shift, output_offset,
-                 output_multiplier, output_shift, output_activation_min,
-                 output_activation_max, output_data);
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, output_shape);
+
+  TFLITE_DCHECK_GT(params.input1_offset, -256);
+  TFLITE_DCHECK_GT(params.input2_offset, -256);
+  TFLITE_DCHECK_LT(params.input1_offset, 256);
+  TFLITE_DCHECK_LT(params.input2_offset, 256);
+  AddElementwise(flat_size, params, input1_data, input2_data, output_data);
 }
 
-inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
-                int input1_shift, const int16* input2_data,
-                const Dims<4>& input2_dims, int input2_shift,
-                int16 output_activation_min, int16 output_activation_max,
-                int16* output_data, const Dims<4>& output_dims) {
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const int16* input1_data,
+                const RuntimeShape& input2_shape, const int16* input2_data,
+                const RuntimeShape& output_shape, int16* output_data) {
   gemmlowp::ScopedProfilingLabel label("Add/Int16");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-
-  const int flat_size = MatchingFlatSize(output_dims, input1_dims, input2_dims);
-
-  TFLITE_DCHECK(input1_shift == 0 || input2_shift == 0);
-  TFLITE_DCHECK_GE(input1_shift, 0);
-  TFLITE_DCHECK_GE(input2_shift, 0);
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
+
+  const int input1_shift = params.input1_shift;
+  const int flat_size =
+      MatchingFlatSize(output_shape, input1_shape, input2_shape);
+  const int16 output_activation_min = params.quantized_activation_min;
+  const int16 output_activation_max = params.quantized_activation_max;
+
+  TFLITE_DCHECK(input1_shift == 0 || params.input2_shift == 0);
+  TFLITE_DCHECK_LE(input1_shift, 0);
+  TFLITE_DCHECK_LE(params.input2_shift, 0);
   const int16* not_shift_input = input1_shift == 0 ? input1_data : input2_data;
   const int16* shift_input = input1_shift == 0 ? input2_data : input1_data;
-  const int input_shift = input1_shift == 0 ? input2_shift : input1_shift;
+  const int input_right_shift =
+      input1_shift == 0 ? -params.input2_shift : -input1_shift;
 
   for (int i = 0; i < flat_size; i++) {
     // F0 uses 0 integer bits, range [-1, 1].
     using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
 
     F0 input_ready_scaled = F0::FromRaw(not_shift_input[i]);
-    F0 scaled_input =
-        F0::FromRaw(gemmlowp::RoundingDivideByPOT(shift_input[i], input_shift));
+    F0 scaled_input = F0::FromRaw(
+        gemmlowp::RoundingDivideByPOT(shift_input[i], input_right_shift));
     F0 result = gemmlowp::SaturatingAdd(scaled_input, input_ready_scaled);
     const int16 raw_output = result.raw();
     const int16 clamped_output = std::min(
@@ -2708,195 +2597,59 @@ inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
   }
 }
 
-inline void Add(const int32* input1_data, const Dims<4>& input1_dims,
-                const int32* input2_data, const Dims<4>& input2_dims,
-                int32 output_activation_min, int32 output_activation_max,
-                int32* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("Add/int32");
-
-  const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
-  for (int i = 0; i < flat_size; ++i) {
-    output_data[i] = ActivationFunctionWithMinMax(
-        input1_data[i] + input2_data[i], output_activation_min,
-        output_activation_max);
-  }
-}
-
-template <FusedActivationFunctionType Ac>
-inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
-                int input1_shift, const int16* input2_data,
-                const Dims<4>& input2_dims, int input2_shift,
-                int16 output_activation_min, int16 output_activation_max,
-                int16* output_data, const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, -32768);
-    TFLITE_DCHECK_EQ(output_activation_max, 32767);
-  }
-
-  Add(input1_data, input1_dims, input1_shift, input2_data, input2_dims,
-      input2_shift, output_activation_min, output_activation_max, output_data,
-      output_dims);
-}
-
-template <FusedActivationFunctionType Ac>
-void Add(const int32* input1_data, const Dims<4>& input1_dims,
-         const int32* input2_data, const Dims<4>& input2_dims,
-         int32* output_data, const Dims<4>& output_dims) {
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const int32* input1_data,
+                const RuntimeShape& input2_shape, const int32* input2_data,
+                const RuntimeShape& output_shape, int32* output_data) {
   gemmlowp::ScopedProfilingLabel label("Add/int32");
-  TFLITE_DCHECK(Ac == FusedActivationFunctionType::kNone);
 
-  auto input1_map = MapAsVector(input1_data, input1_dims);
-  auto input2_map = MapAsVector(input2_data, input2_dims);
-  auto output_map = MapAsVector(output_data, output_dims);
-  if (AreSameDims(input1_dims, input2_dims)) {
+  auto input1_map = MapAsVector(input1_data, input1_shape);
+  auto input2_map = MapAsVector(input2_data, input2_shape);
+  auto output_map = MapAsVector(output_data, output_shape);
+  if (input1_shape == input2_shape) {
     output_map.array() = input1_map.array() + input2_map.array();
-  } else if (FlatSize(input2_dims) == 1) {
+  } else if (input2_shape.FlatSize() == 1) {
     auto scalar = input2_data[0];
     output_map.array() = input1_map.array() + scalar;
-  } else if (FlatSize(input1_dims) == 1) {
+  } else if (input1_shape.FlatSize() == 1) {
     auto scalar = input1_data[0];
     output_map.array() = scalar + input2_map.array();
   } else {
     // Should not come here.
     TFLITE_DCHECK(false);
   }
+  output_map = output_map.cwiseMax(params.quantized_activation_min);
+  output_map = output_map.cwiseMin(params.quantized_activation_max);
 }
 
-// TODO(jiawen): We can implement BroadcastAdd on buffers of arbitrary
-// dimensionality if the runtime code does a single loop over one dimension
-// that handles broadcasting as the base case. The code generator would then
-// generate max(D1, D2) nested for loops.
-// TODO(benoitjacob): BroadcastAdd is intentionally duplicated from
-// reference_ops.h. Once an optimized version is implemented and NdArrayDesc<T>
-// is no longer referenced in this file, move NdArrayDesc<T> from types.h to
-// reference_ops.h.
-template <typename T>
-void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
-                  const T* input2_data, const Dims<4>& input2_dims,
-                  T output_activation_min, T output_activation_max,
-                  T* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastAdd");
-
-  NdArrayDesc<4> desc1;
-  NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
-
-  // In Tensorflow, the dimensions are canonically named (batch_number, row,
-  // col, channel), with extents (batches, height, width, depth), with the
-  // trailing dimension changing most rapidly (channels has the smallest stride,
-  // typically 1 element).
-  //
-  // In generated C code, we store arrays with the dimensions reversed. The
-  // first dimension has smallest stride.
-  //
-  // We name our variables by their Tensorflow convention, but generate C code
-  // nesting loops such that the innermost loop has the smallest stride for the
-  // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          output_data[Offset(output_dims, c, x, y, b)] =
-              ActivationFunctionWithMinMax(
-                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] +
-                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
-                  output_activation_min, output_activation_max);
-        }
-      }
-    }
-  }
-}
-
-// legacy, for compatibility with old checked-in code
-template <FusedActivationFunctionType Ac, typename T>
-void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
-                  const T* input2_data, const Dims<4>& input2_dims,
-                  T* output_data, const Dims<4>& output_dims) {
-  T output_activation_min, output_activation_max;
-  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
-
-  BroadcastAdd(input1_data, input1_dims, input2_data, input2_dims,
-               output_activation_min, output_activation_max, output_data,
-               output_dims);
-}
-
-inline void BroadcastAdd(int left_shift, const uint8* input1_data,
-                         const Dims<4>& input1_dims, int32 input1_offset,
-                         int32 input1_multiplier, int input1_shift,
-                         const uint8* input2_data, const Dims<4>& input2_dims,
-                         int32 input2_offset, int32 input2_multiplier,
-                         int input2_shift, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastAddGeneric/8bit");
-
-  NdArrayDesc<4> desc1;
-  NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
-
-  // In Tensorflow, the dimensions are canonically named (batch_number, row,
-  // col, channel), with extents (batches, height, width, depth), with the
-  // trailing dimension changing most rapidly (channels has the smallest stride,
-  // typically 1 element).
-  //
-  // In generated C code, we store arrays with the dimensions reversed. The
-  // first dimension has smallest stride.
-  //
-  // We name our variables by their Tensorflow convention, but generate C code
-  // nesting loops such that the innermost loop has the smallest stride for the
-  // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          const int32 input1_val =
-              input1_offset + input1_data[SubscriptToIndex(desc1, c, x, y, b)];
-          const int32 input2_val =
-              input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
-          const int32 shifted_input1_val = input1_val * (1 << left_shift);
-          const int32 shifted_input2_val = input2_val * (1 << left_shift);
-          const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input1_val, input1_multiplier,
-                  kReverseShift * input1_shift);
-          const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input2_val, input2_multiplier,
-                  kReverseShift * input2_shift);
-          const int32 raw_sum = scaled_input1_val + scaled_input2_val;
-          const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  raw_sum, output_multiplier, kReverseShift * output_shift) +
-              output_offset;
-          const int32 clamped_output =
-              std::min(output_activation_max,
-                       std::max(output_activation_min, raw_output));
-          output_data[Offset(output_dims, c, x, y, b)] =
-              static_cast<uint8>(clamped_output);
-        }
-      }
-    }
-  }
-}
-
-inline void BroadcastAddFivefold(
-    int y0, int y1, int y2, int y3, int y4, int left_shift,
-    const uint8* input1_data, const Dims<4>& input1_dims, int32 input1_offset,
-    int32 input1_multiplier, int input1_shift, const uint8* input2_data,
-    const Dims<4>& input2_dims, int32 input2_offset, int32 input2_multiplier,
-    int input2_shift, int32 output_offset, int32 output_multiplier,
-    int output_shift, int32 output_activation_min, int32 output_activation_max,
-    uint8* output_data, const Dims<4>& output_dims) {
+inline void BroadcastAddFivefold(const ArithmeticParams& unswitched_params,
+                                 const RuntimeShape& unswitched_input1_shape,
+                                 const uint8* unswitched_input1_data,
+                                 const RuntimeShape& unswitched_input2_shape,
+                                 const uint8* unswitched_input2_data,
+                                 const RuntimeShape& output_shape,
+                                 uint8* output_data) {
   gemmlowp::ScopedProfilingLabel label("BroadcastAddFivefold/8bit");
 
+  ArithmeticParams switched_params = unswitched_params;
+  switched_params.input1_offset = unswitched_params.input2_offset;
+  switched_params.input1_multiplier = unswitched_params.input2_multiplier;
+  switched_params.input1_shift = unswitched_params.input2_shift;
+  switched_params.input2_offset = unswitched_params.input1_offset;
+  switched_params.input2_multiplier = unswitched_params.input1_multiplier;
+  switched_params.input2_shift = unswitched_params.input1_shift;
+
+  const bool use_unswitched =
+      unswitched_params.broadcast_category ==
+      tflite::BroadcastableOpCategory::kFirstInputBroadcastsFast;
+
+  const ArithmeticParams& params =
+      use_unswitched ? unswitched_params : switched_params;
+  const uint8* input1_data =
+      use_unswitched ? unswitched_input1_data : unswitched_input2_data;
+  const uint8* input2_data =
+      use_unswitched ? unswitched_input2_data : unswitched_input1_data;
+
   // Fivefold nested loops. The second input resets its position for each
   // iteration of the second loop. The first input resets its position at the
   // beginning of the fourth loop. The innermost loop is an elementwise add of
@@ -2904,82 +2657,29 @@ inline void BroadcastAddFivefold(
   uint8* output_data_ptr = output_data;
   const uint8* input1_data_ptr = input1_data;
   const uint8* input2_data_reset = input2_data;
-  for (int i4 = 0; i4 < y4; ++i4) {
+  int y0 = params.broadcast_shape[0];
+  int y1 = params.broadcast_shape[1];
+  int y2 = params.broadcast_shape[2];
+  int y3 = params.broadcast_shape[3];
+  int y4 = params.broadcast_shape[4];
+  for (int i0 = 0; i0 < y0; ++i0) {
     const uint8* input2_data_ptr;
-    for (int i3 = 0; i3 < y3; ++i3) {
+    for (int i1 = 0; i1 < y1; ++i1) {
       input2_data_ptr = input2_data_reset;
       for (int i2 = 0; i2 < y2; ++i2) {
-        for (int i1 = 0; i1 < y1; ++i1) {
-          AddElementwise(
-              y0, left_shift, input1_data_ptr, input1_offset, input1_multiplier,
-              input1_shift, input2_data_ptr, input2_offset, input2_multiplier,
-              input2_shift, output_offset, output_multiplier, output_shift,
-              output_activation_min, output_activation_max, output_data_ptr);
-          input2_data_ptr += y0;
-          output_data_ptr += y0;
+        for (int i3 = 0; i3 < y3; ++i3) {
+          AddElementwise(y4, params, input1_data_ptr, input2_data_ptr,
+                         output_data_ptr);
+          input2_data_ptr += y4;
+          output_data_ptr += y4;
         }
-        input1_data_ptr += y0;
+        input1_data_ptr += y4;
       }
     }
     input2_data_reset = input2_data_ptr;
   }
 }
 
-template <FusedActivationFunctionType Ac>
-inline void BroadcastAdd(int left_shift, const uint8* input1_data,
-                         const Dims<4>& input1_dims, int32 input1_offset,
-                         int32 input1_multiplier, int input1_shift,
-                         const uint8* input2_data, const Dims<4>& input2_dims,
-                         int32 input2_offset, int32 input2_multiplier,
-                         int input2_shift, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, 0);
-    TFLITE_DCHECK_EQ(output_activation_max, 255);
-  }
-  BroadcastAdd(left_shift, input1_data, input1_dims, input1_offset,
-               input1_multiplier, input1_shift, input2_data, input2_dims,
-               input2_offset, input2_multiplier, input2_shift, output_offset,
-               output_multiplier, output_shift, output_activation_min,
-               output_activation_max, output_data, output_dims);
-}
-
-template <FusedActivationFunctionType Ac>
-inline void BroadcastAddFivefold(
-    int y0, int y1, int y2, int y3, int y4, int left_shift,
-    const uint8* input1_data, const Dims<4>& input1_dims, int32 input1_offset,
-    int32 input1_multiplier, int input1_shift, const uint8* input2_data,
-    const Dims<4>& input2_dims, int32 input2_offset, int32 input2_multiplier,
-    int input2_shift, int32 output_offset, int32 output_multiplier,
-    int output_shift, int32 output_activation_min, int32 output_activation_max,
-    uint8* output_data, const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, 0);
-    TFLITE_DCHECK_EQ(output_activation_max, 255);
-  }
-  BroadcastAddFivefold(y0, y1, y2, y3, y4, left_shift, input1_data, input1_dims,
-                       input1_offset, input1_multiplier, input1_shift,
-                       input2_data, input2_dims, input2_offset,
-                       input2_multiplier, input2_shift, output_offset,
-                       output_multiplier, output_shift, output_activation_min,
-                       output_activation_max, output_data, output_dims);
-}
-
 inline void Mul(const float* input1_data, const Dims<4>& input1_dims,
                 const float* input2_data, const Dims<4>& input2_dims,
                 float output_activation_min, float output_activation_max,
@@ -3053,6 +2753,20 @@ void Mul(const float* input1_data, const Dims<4>& input1_dims,
       output_activation_max, output_data, output_dims);
 }
 
+inline void Mul(const int32* input1_data, const Dims<4>& input1_dims,
+                const int32* input2_data, const Dims<4>& input2_dims,
+                int32 output_activation_min, int32 output_activation_max,
+                int32* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Mul/int32");
+
+  const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(
+        input1_data[i] * input2_data[i], output_activation_min,
+        output_activation_max);
+  }
+}
+
 template <FusedActivationFunctionType Ac>
 void Mul(const int32* input1_data, const Dims<4>& input1_dims,
          const int32* input2_data, const Dims<4>& input2_dims,
@@ -3290,122 +3004,78 @@ void BroadcastDiv(const T* input1_data, const Dims<4>& input1_dims,
 }
 
 // TODO(aselle): This is not actually optimized yet.
-inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
-                const float* input2_data, const Dims<4>& input2_dims,
-                float output_activation_min, float output_activation_max,
-                float* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("Sub");
-  const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
+inline void SubNonBroadcast(const ArithmeticParams& params,
+                            const RuntimeShape& input1_shape,
+                            const float* input1_data,
+                            const RuntimeShape& input2_shape,
+                            const float* input2_data,
+                            const RuntimeShape& output_shape,
+                            float* output_data) {
+  gemmlowp::ScopedProfilingLabel label("SubNonBroadcast");
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, output_shape);
   for (int i = 0; i < flat_size; ++i) {
     output_data[i] = ActivationFunctionWithMinMax(
-        input1_data[i] - input2_data[i], output_activation_min,
-        output_activation_max);
+        input1_data[i] - input2_data[i], params.float_activation_min,
+        params.float_activation_max);
   }
 }
 
-// TODO(jiawen): We can implement BroadcastSub on buffers of arbitrary
-// dimensionality if the runtime code does a single loop over one dimension
-// that handles broadcasting as the base case. The code generator would then
-// generate max(D1, D2) nested for loops.
-// TODO(benoitjacob): BroadcastSub is intentionally duplicated from
-// reference_ops.h. Once an optimized version is implemented and NdArrayDesc<T>
-// is no longer referenced in this file, move NdArrayDesc<T> from types.h to
-// reference_ops.h.
-template <typename T>
-void BroadcastSub(const T* input1_data, const Dims<4>& input1_dims,
-                  const T* input2_data, const Dims<4>& input2_dims,
-                  T output_activation_min, T output_activation_max,
-                  T* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastSub");
-
-  NdArrayDesc<4> desc1;
-  NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
-
-  // In Tensorflow, the dimensions are canonically named (batch_number, row,
-  // col, channel), with extents (batches, height, width, depth), with the
-  // trailing dimension changing most rapidly (channels has the smallest stride,
-  // typically 1 element).
-  //
-  // In generated C code, we store arrays with the dimensions reversed. The
-  // first dimension has smallest stride.
-  //
-  // We name our variables by their Tensorflow convention, but generate C code
-  // nesting loops such that the innermost loop has the smallest stride for the
-  // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          output_data[Offset(output_dims, c, x, y, b)] =
-              ActivationFunctionWithMinMax(
-                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] -
-                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
-                  output_activation_min, output_activation_max);
-        }
-      }
-    }
+inline void SubWithActivation(const ArithmeticParams& params,
+                              const RuntimeShape& input1_shape,
+                              const int32* input1_data,
+                              const RuntimeShape& input2_shape,
+                              const int32* input2_data,
+                              const RuntimeShape& output_shape,
+                              int32* output_data) {
+  gemmlowp::ScopedProfilingLabel label("SubWithActivation/int32");
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, input2_shape);
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(
+        input1_data[i] - input2_data[i], params.quantized_activation_min,
+        params.quantized_activation_max);
   }
 }
 
-inline void BroadcastSub(int left_shift, const uint8* input1_data,
-                         const Dims<4>& input1_dims, int32 input1_offset,
-                         int32 input1_multiplier, int input1_shift,
-                         const uint8* input2_data, const Dims<4>& input2_dims,
-                         int32 input2_offset, int32 input2_multiplier,
-                         int input2_shift, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastSub/8bit");
+inline void SubWithActivation(const ArithmeticParams& params,
+                              const RuntimeShape& input1_shape,
+                              const float* input1_data,
+                              const RuntimeShape& input2_shape,
+                              const float* input2_data,
+                              const RuntimeShape& output_shape,
+                              float* output_data) {
+  gemmlowp::ScopedProfilingLabel label("SubWithActivation/float");
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, input2_shape);
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(
+        input1_data[i] - input2_data[i], params.float_activation_min,
+        params.float_activation_max);
+  }
+}
 
-  NdArrayDesc<4> desc1;
-  NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+template <typename T>
+void Sub(const ArithmeticParams& params, const RuntimeShape& input1_shape,
+         const T* input1_data, const RuntimeShape& input2_shape,
+         const T* input2_data, const RuntimeShape& output_shape,
+         T* output_data) {
+  gemmlowp::ScopedProfilingLabel label("Sub");
 
-  // In Tensorflow, the dimensions are canonically named (batch_number, row,
-  // col, channel), with extents (batches, height, width, depth), with the
-  // trailing dimension changing most rapidly (channels has the smallest stride,
-  // typically 1 element).
-  //
-  // In generated C code, we store arrays with the dimensions reversed. The
-  // first dimension has smallest stride.
-  //
-  // We name our variables by their Tensorflow convention, but generate C code
-  // nesting loops such that the innermost loop has the smallest stride for the
-  // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          const int32 input1_val =
-              input1_offset + input1_data[SubscriptToIndex(desc1, c, x, y, b)];
-          const int32 input2_val =
-              input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
-          const int32 shifted_input1_val = input1_val * (1 << left_shift);
-          const int32 shifted_input2_val = input2_val * (1 << left_shift);
-          const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input1_val, input1_multiplier,
-                  kReverseShift * input1_shift);
-          const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input2_val, input2_multiplier,
-                  kReverseShift * input2_shift);
-          const int32 raw_sub = scaled_input1_val - scaled_input2_val;
-          const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  raw_sub, output_multiplier, kReverseShift * output_shift) +
-              output_offset;
-          const int32 clamped_output =
-              std::min(output_activation_max,
-                       std::max(output_activation_min, raw_output));
-          output_data[Offset(output_dims, c, x, y, b)] =
-              static_cast<uint8>(clamped_output);
-        }
-      }
-    }
+  auto input1_map = MapAsVector(input1_data, input1_shape);
+  auto input2_map = MapAsVector(input2_data, input2_shape);
+  auto output_map = MapAsVector(output_data, output_shape);
+  if (input1_shape == input2_shape) {
+    output_map.array() = input1_map.array() - input2_map.array();
+  } else if (input1_shape.FlatSize() == 1) {
+    auto scalar = input1_data[0];
+    output_map.array() = scalar - input2_map.array();
+  } else if (input2_shape.FlatSize() == 1) {
+    auto scalar = input2_data[0];
+    output_map.array() = input1_map.array() - scalar;
+  } else {
+    BroadcastSub4DSlow(params, input1_shape, input1_data, input2_shape,
+                       input2_data, output_shape, output_data);
   }
 }
 
@@ -3771,21 +3441,20 @@ inline int NodeOffset(int b, int h, int w, int height, int width) {
   return (b * height + h) * width + w;
 }
 
-inline void AveragePool(const float* input_data,
-                        const RuntimeShape& input_shape, int stride_width,
-                        int stride_height, int pad_width, int pad_height,
-                        int kwidth, int kheight, float output_activation_min,
-                        float output_activation_max, float* output_data,
-                        const RuntimeShape& output_shape) {
+inline void AveragePool(const PoolParams& params,
+                        const RuntimeShape& input_shape,
+                        const float* input_data,
+                        const RuntimeShape& output_shape, float* output_data) {
   gemmlowp::ScopedProfilingLabel label("AveragePool");
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
-  const int depth = MatchingDim(input_shape, 3, output_shape, 3);
   const int input_height = input_shape.Dims(1);
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
 
   // TODO(benoitjacob) make this a proper reference impl without Eigen!
   const auto in_mat = MapAsMatrixWithLastDimAsRows(input_data, input_shape);
@@ -3800,12 +3469,15 @@ inline void AveragePool(const float* input_data,
       for (int w = 0; w < input_width; ++w) {
         // (h_start, h_end) * (w_start, w_end) is the range that the input
         // vector projects to.
-        int hpad = h + pad_height;
-        int wpad = w + pad_width;
-        int h_start =
-            (hpad < kheight) ? 0 : (hpad - kheight) / stride_height + 1;
+        int hpad = h + params.padding_values.height;
+        int wpad = w + params.padding_values.width;
+        int h_start = (hpad < params.filter_height)
+                          ? 0
+                          : (hpad - params.filter_height) / stride_height + 1;
         int h_end = std::min(hpad / stride_height + 1, output_height);
-        int w_start = (wpad < kwidth) ? 0 : (wpad - kwidth) / stride_width + 1;
+        int w_start = (wpad < params.filter_width)
+                          ? 0
+                          : (wpad - params.filter_width) / stride_width + 1;
         int w_end = std::min(wpad / stride_width + 1, output_width);
         // compute elementwise sum
         for (int ph = h_start; ph < h_end; ++ph) {
@@ -3823,29 +3495,21 @@ inline void AveragePool(const float* input_data,
   TFLITE_DCHECK_GT(out_count.minCoeff(), 0);
   out_mat.array().rowwise() /= out_count.transpose().array();
 
-  for (int b = 0; b < batches; ++b) {
-    for (int y = 0; y < output_height; ++y) {
-      for (int x = 0; x < output_width; ++x) {
-        for (int c = 0; c < depth; ++c) {
-          output_data[Offset(output_shape, b, y, x, c)] =
-              ActivationFunctionWithMinMax(
-                  output_data[Offset(output_shape, b, y, x, c)],
-                  output_activation_min, output_activation_max);
-        }
-      }
-    }
+  const int flat_size = output_shape.FlatSize();
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(output_data[i],
+                                                  params.float_activation_min,
+                                                  params.float_activation_max);
   }
 }
 
-inline void AveragePool(const uint8* input_data,
-                        const RuntimeShape& input_shape, int stride_width,
-                        int stride_height, int pad_width, int pad_height,
-                        int filter_width, int filter_height,
-                        int32 output_activation_min,
-                        int32 output_activation_max, uint8* output_data,
-                        const RuntimeShape& output_shape) {
+inline void AveragePool(const PoolParams& params,
+                        const RuntimeShape& input_shape,
+                        const uint8* input_data,
+                        const RuntimeShape& output_shape, uint8* output_data) {
   gemmlowp::ScopedProfilingLabel label("AveragePool/8bit");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
@@ -3854,17 +3518,21 @@ inline void AveragePool(const uint8* input_data,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   for (int batch = 0; batch < batches; ++batch) {
     for (int out_y = 0; out_y < output_height; ++out_y) {
       for (int out_x = 0; out_x < output_width; ++out_x) {
-        const int in_x_origin = (out_x * stride_width) - pad_width;
-        const int in_y_origin = (out_y * stride_height) - pad_height;
+        const int in_x_origin =
+            (out_x * stride_width) - params.padding_values.width;
+        const int in_y_origin =
+            (out_y * stride_height) - params.padding_values.height;
         const int filter_x_start = std::max(0, -in_x_origin);
         const int filter_x_end =
-            std::min(filter_width, input_width - in_x_origin);
+            std::min(params.filter_width, input_width - in_x_origin);
         const int filter_y_start = std::max(0, -in_y_origin);
         const int filter_y_end =
-            std::min(filter_height, input_height - in_y_origin);
+            std::min(params.filter_height, input_height - in_y_origin);
         const int filter_count =
             (filter_x_end - filter_x_start) * (filter_y_end - filter_y_start);
         // 1280 required by Inception v3
@@ -3912,18 +3580,18 @@ inline void AveragePool(const uint8* input_data,
             output_data + Offset(output_shape, batch, out_y, out_x, 0);
         int channel = 0;
 #ifdef USE_NEON
-#define AVGPOOL_DIVIDING_BY(FILTER_COUNT)                              \
-  if (filter_count == FILTER_COUNT) {                                  \
-    for (; channel <= depth - 8; channel += 8) {                       \
-      uint16 buf[8];                                                   \
-      for (int i = 0; i < 8; i++) {                                    \
-        buf[i] = (acc[channel + i] + FILTER_COUNT / 2) / FILTER_COUNT; \
-      }                                                                \
-      uint8x8_t buf8 = vqmovn_u16(vld1q_u16(buf));                     \
-      buf8 = vmin_u8(buf8, vdup_n_u8(output_activation_max));          \
-      buf8 = vmax_u8(buf8, vdup_n_u8(output_activation_min));          \
-      vst1_u8(output_ptr + channel, buf8);                             \
-    }                                                                  \
+#define AVGPOOL_DIVIDING_BY(FILTER_COUNT)                               \
+  if (filter_count == FILTER_COUNT) {                                   \
+    for (; channel <= depth - 8; channel += 8) {                        \
+      uint16 buf[8];                                                    \
+      for (int i = 0; i < 8; i++) {                                     \
+        buf[i] = (acc[channel + i] + FILTER_COUNT / 2) / FILTER_COUNT;  \
+      }                                                                 \
+      uint8x8_t buf8 = vqmovn_u16(vld1q_u16(buf));                      \
+      buf8 = vmin_u8(buf8, vdup_n_u8(params.quantized_activation_max)); \
+      buf8 = vmax_u8(buf8, vdup_n_u8(params.quantized_activation_min)); \
+      vst1_u8(output_ptr + channel, buf8);                              \
+    }                                                                   \
   }
         AVGPOOL_DIVIDING_BY(9)
         AVGPOOL_DIVIDING_BY(15)
@@ -3934,15 +3602,15 @@ inline void AveragePool(const uint8* input_data,
             buf[i] = (acc[channel + i] + filter_count / 2) / filter_count;
           }
           uint8x8_t buf8 = vqmovn_u16(vld1q_u16(buf));
-          buf8 = vmin_u8(buf8, vdup_n_u8(output_activation_max));
-          buf8 = vmax_u8(buf8, vdup_n_u8(output_activation_min));
+          buf8 = vmin_u8(buf8, vdup_n_u8(params.quantized_activation_max));
+          buf8 = vmax_u8(buf8, vdup_n_u8(params.quantized_activation_min));
           vst1_u8(output_ptr + channel, buf8);
         }
 #endif
         for (; channel < depth; ++channel) {
           uint16 a = (acc[channel] + filter_count / 2) / filter_count;
-          a = std::max<uint16>(a, output_activation_min);
-          a = std::min<uint16>(a, output_activation_max);
+          a = std::max<uint16>(a, params.quantized_activation_min);
+          a = std::min<uint16>(a, params.quantized_activation_max);
           output_ptr[channel] = static_cast<uint8>(a);
         }
       }
@@ -3950,20 +3618,19 @@ inline void AveragePool(const uint8* input_data,
   }
 }
 
-inline void MaxPool(const float* input_data, const RuntimeShape& input_shape,
-                    int stride_width, int stride_height, int pad_width,
-                    int pad_height, int kwidth, int kheight,
-                    float output_activation_min, float output_activation_max,
-                    float* output_data, const RuntimeShape& output_shape) {
+inline void MaxPool(const PoolParams& params, const RuntimeShape& input_shape,
+                    const float* input_data, const RuntimeShape& output_shape,
+                    float* output_data) {
   gemmlowp::ScopedProfilingLabel label("MaxPool");
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
-  const int depth = MatchingDim(input_shape, 3, output_shape, 3);
   const int input_height = input_shape.Dims(1);
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
 
   const auto in_mat = MapAsMatrixWithLastDimAsRows(input_data, input_shape);
   auto out_mat = MapAsMatrixWithLastDimAsRows(output_data, output_shape);
@@ -3974,12 +3641,15 @@ inline void MaxPool(const float* input_data, const RuntimeShape& input_shape,
       for (int w = 0; w < input_width; ++w) {
         // (h_start, h_end) * (w_start, w_end) is the range that the input
         // vector projects to.
-        int hpad = h + pad_height;
-        int wpad = w + pad_width;
-        int h_start =
-            (hpad < kheight) ? 0 : (hpad - kheight) / stride_height + 1;
+        int hpad = h + params.padding_values.height;
+        int wpad = w + params.padding_values.width;
+        int h_start = (hpad < params.filter_height)
+                          ? 0
+                          : (hpad - params.filter_height) / stride_height + 1;
         int h_end = std::min(hpad / stride_height + 1, output_height);
-        int w_start = (wpad < kwidth) ? 0 : (wpad - kwidth) / stride_width + 1;
+        int w_start = (wpad < params.filter_width)
+                          ? 0
+                          : (wpad - params.filter_width) / stride_width + 1;
         int w_end = std::min(wpad / stride_width + 1, output_width);
         // compute elementwise sum
         for (int ph = h_start; ph < h_end; ++ph) {
@@ -3994,28 +3664,20 @@ inline void MaxPool(const float* input_data, const RuntimeShape& input_shape,
       }
     }
   }
-
-  for (int b = 0; b < batches; ++b) {
-    for (int y = 0; y < output_height; ++y) {
-      for (int x = 0; x < output_width; ++x) {
-        for (int c = 0; c < depth; ++c) {
-          output_data[Offset(output_shape, b, y, x, c)] =
-              ActivationFunctionWithMinMax(
-                  output_data[Offset(output_shape, b, y, x, c)],
-                  output_activation_min, output_activation_max);
-        }
-      }
-    }
+  const int flat_size = output_shape.FlatSize();
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(output_data[i],
+                                                  params.float_activation_min,
+                                                  params.float_activation_max);
   }
 }
 
-inline void MaxPool(const uint8* input_data, const RuntimeShape& input_shape,
-                    int stride_width, int stride_height, int pad_width,
-                    int pad_height, int filter_width, int filter_height,
-                    int32 output_activation_min, int32 output_activation_max,
-                    uint8* output_data, const RuntimeShape& output_shape) {
+inline void MaxPool(const PoolParams& params, const RuntimeShape& input_shape,
+                    const uint8* input_data, const RuntimeShape& output_shape,
+                    uint8* output_data) {
   gemmlowp::ScopedProfilingLabel label("MaxPool/8bit");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
@@ -4024,17 +3686,21 @@ inline void MaxPool(const uint8* input_data, const RuntimeShape& input_shape,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   for (int batch = 0; batch < batches; ++batch) {
     for (int out_y = 0; out_y < output_height; ++out_y) {
       for (int out_x = 0; out_x < output_width; ++out_x) {
-        const int in_x_origin = (out_x * stride_width) - pad_width;
-        const int in_y_origin = (out_y * stride_height) - pad_height;
+        const int in_x_origin =
+            (out_x * stride_width) - params.padding_values.width;
+        const int in_y_origin =
+            (out_y * stride_height) - params.padding_values.height;
         const int filter_x_start = std::max(0, -in_x_origin);
         const int filter_x_end =
-            std::min(filter_width, input_width - in_x_origin);
+            std::min(params.filter_width, input_width - in_x_origin);
         const int filter_y_start = std::max(0, -in_y_origin);
         const int filter_y_end =
-            std::min(filter_height, input_height - in_y_origin);
+            std::min(params.filter_height, input_height - in_y_origin);
         // 2048 required by Inception v3
         static constexpr int kAccBufferMaxSize = 2048;
         TFLITE_DCHECK_LE(depth, kAccBufferMaxSize);
@@ -4077,21 +3743,21 @@ inline void MaxPool(const uint8* input_data, const RuntimeShape& input_shape,
 #ifdef USE_NEON
         for (; channel <= depth - 16; channel += 16) {
           uint8x16_t a = vld1q_u8(acc + channel);
-          a = vminq_u8(a, vdupq_n_u8(output_activation_max));
-          a = vmaxq_u8(a, vdupq_n_u8(output_activation_min));
+          a = vminq_u8(a, vdupq_n_u8(params.quantized_activation_max));
+          a = vmaxq_u8(a, vdupq_n_u8(params.quantized_activation_min));
           vst1q_u8(output_ptr + channel, a);
         }
         for (; channel <= depth - 8; channel += 8) {
           uint8x8_t a = vld1_u8(acc + channel);
-          a = vmin_u8(a, vdup_n_u8(output_activation_max));
-          a = vmax_u8(a, vdup_n_u8(output_activation_min));
+          a = vmin_u8(a, vdup_n_u8(params.quantized_activation_max));
+          a = vmax_u8(a, vdup_n_u8(params.quantized_activation_min));
           vst1_u8(output_ptr + channel, a);
         }
 #endif
         for (; channel < depth; ++channel) {
           uint8 a = acc[channel];
-          a = std::max<uint8>(a, output_activation_min);
-          a = std::min<uint8>(a, output_activation_max);
+          a = std::max<uint8>(a, params.quantized_activation_min);
+          a = std::min<uint8>(a, params.quantized_activation_max);
           output_ptr[channel] = static_cast<uint8>(a);
         }
       }
@@ -4099,11 +3765,9 @@ inline void MaxPool(const uint8* input_data, const RuntimeShape& input_shape,
   }
 }
 
-inline void L2Pool(const float* input_data, const RuntimeShape& input_shape,
-                   int stride_width, int stride_height, int pad_width,
-                   int pad_height, int filter_width, int filter_height,
-                   float output_activation_min, float output_activation_max,
-                   float* output_data, const RuntimeShape& output_shape) {
+inline void L2Pool(const PoolParams& params, const RuntimeShape& input_shape,
+                   const float* input_data, const RuntimeShape& output_shape,
+                   float* output_data) {
   gemmlowp::ScopedProfilingLabel label("L2Pool");
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
@@ -4112,6 +3776,8 @@ inline void L2Pool(const float* input_data, const RuntimeShape& input_shape,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   // Actually carry out L2 Pool. Code is written in forward mode: we go through
   // the input values once, and write to all the pooled regions that it maps to.
   const auto in_mat = MapAsMatrixWithLastDimAsRows(input_data, input_shape);
@@ -4126,15 +3792,17 @@ inline void L2Pool(const float* input_data, const RuntimeShape& input_shape,
       for (int w = 0; w < input_width; ++w) {
         // (h_start, h_end) * (w_start, w_end) is the range that the input
         // vector projects to.
-        const int hpad = h + pad_height;
-        const int wpad = w + pad_width;
-        const int h_start = (hpad < filter_height)
-                                ? 0
-                                : (hpad - filter_height) / stride_height + 1;
+        const int hpad = h + params.padding_values.height;
+        const int wpad = w + params.padding_values.width;
+        const int h_start =
+            (hpad < params.filter_height)
+                ? 0
+                : (hpad - params.filter_height) / stride_height + 1;
         const int h_end = std::min(hpad / stride_height + 1, output_height);
-        const int w_start = (wpad < filter_width)
-                                ? 0
-                                : (wpad - filter_width) / stride_width + 1;
+        const int w_start =
+            (wpad < params.filter_width)
+                ? 0
+                : (wpad - params.filter_width) / stride_width + 1;
         const int w_end = std::min(wpad / stride_width + 1, output_width);
         // pre-compute square
         const int in_offset = w + input_width * (h + input_height * b);
@@ -4155,6 +3823,13 @@ inline void L2Pool(const float* input_data, const RuntimeShape& input_shape,
   out_count = out_count.array().inverse();
   out_mat =
       (out_mat.array().rowwise() * out_count.transpose().array()).cwiseSqrt();
+
+  const int flat_size = output_shape.FlatSize();
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(output_data[i],
+                                                  params.float_activation_min,
+                                                  params.float_activation_max);
+  }
 }
 
 inline void LocalResponseNormalization(const float* input_data,
@@ -5843,63 +5518,6 @@ inline void Slice(const T* input_data, const Dims<4>& input_dims,
 }
 
 template <typename T>
-void GenericBroadcastSub(const T* input1_data, const Dims<4>& input1_dims,
-                         const T* input2_data, const Dims<4>& input2_dims,
-                         T* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("GenericBroadcastSub");
-
-  NdArrayDesc<4> desc1;
-  NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
-
-  // In Tensorflow, the dimensions are canonically named (batch_number, row,
-  // col, channel), with extents (batches, height, width, depth), with the
-  // trailing dimension changing most rapidly (channels has the smallest stride,
-  // typically 1 element).
-  //
-  // In generated C code, we store arrays with the dimensions reversed. The
-  // first dimension has smallest stride.
-  //
-  // We name our variables by their Tensorflow convention, but generate C code
-  // nesting loops such that the innermost loop has the smallest stride for the
-  // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          output_data[Offset(output_dims, c, x, y, b)] =
-              input1_data[SubscriptToIndex(desc1, c, x, y, b)] -
-              input2_data[SubscriptToIndex(desc2, c, x, y, b)];
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-void Sub(const T* input1_data, const Dims<4>& input1_dims, const T* input2_data,
-         const Dims<4>& input2_dims, T* output_data,
-         const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("Sub");
-
-  auto input1_map = MapAsVector(input1_data, input1_dims);
-  auto input2_map = MapAsVector(input2_data, input2_dims);
-  auto output_map = MapAsVector(output_data, output_dims);
-  if (AreSameDims(input1_dims, input2_dims)) {
-    output_map.array() = input1_map.array() - input2_map.array();
-  } else if (FlatSize(input1_dims) == 1) {
-    auto scalar = input1_data[0];
-    output_map.array() = scalar - input2_map.array();
-  } else if (FlatSize(input2_dims) == 1) {
-    auto scalar = input2_data[0];
-    output_map.array() = input1_map.array() - scalar;
-  } else {
-    GenericBroadcastSub(input1_data, input1_dims, input2_data, input2_dims,
-                        output_data, output_dims);
-  }
-}
-
-template <typename T>
 void TensorFlowMinimum(const T* input1_data, const Dims<4>& input1_dims,
                        const T* input2_data, T* output_data,
                        const Dims<4>& output_dims) {
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h b/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h
index f14667090f..010b40b901 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/tensor_utils_impl.h
@@ -19,6 +19,10 @@ limitations under the License.
 // structure.
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 
+#if defined(_MSC_VER)
+#define __restrict__ __restrict
+#endif
+
 #ifndef USE_NEON
 #if defined(__ARM_NEON__) || defined(__ARM_NEON)
 #define USE_NEON
@@ -124,6 +128,12 @@ void PortableCopyVector(const float* vector, int v_size, float* result);
 // Fill vector with 0.f.
 void PortableZeroVector(float* vector, int v_size);
 
+// Multiply all elements of vector with a scalar.
+void PortableVectorScalarMultiply(const int8_t* vector, int v_size, float scale,
+                                  float* result);
+void NeonVectorScalarMultiply(const int8_t* vector, int v_size, float scale,
+                              float* result);
+
 // Limit a float input f between +abs_limit and -abs_limit.
 float PortableClip(float f, float abs_limit);
 
diff --git a/tensorflow/contrib/lite/kernels/internal/quantization_util.h b/tensorflow/contrib/lite/kernels/internal/quantization_util.h
index 525857a2e6..9b3f1823dc 100644
--- a/tensorflow/contrib/lite/kernels/internal/quantization_util.h
+++ b/tensorflow/contrib/lite/kernels/internal/quantization_util.h
@@ -28,8 +28,9 @@ namespace tflite {
 // Given the min and max values of a float array, return
 // reasonable quantization parameters to use for this array.
 template <typename T>
-QuantizationParams ChooseQuantizationParams(double rmin, double rmax) {
-  const T qmin = std::numeric_limits<T>::min();
+QuantizationParams ChooseQuantizationParams(double rmin, double rmax,
+                                            bool narrow_range) {
+  const T qmin = std::numeric_limits<T>::min() + (narrow_range ? 1 : 0);
   const T qmax = std::numeric_limits<T>::max();
   const double qmin_double = qmin;
   const double qmax_double = qmax;
@@ -97,6 +98,11 @@ QuantizationParams ChooseQuantizationParams(double rmin, double rmax) {
   return quantization_params;
 }
 
+template <typename T>
+QuantizationParams ChooseQuantizationParams(double rmin, double rmax) {
+  return ChooseQuantizationParams<T>(rmin, rmax, false);
+}
+
 // Converts a floating-point number to an integer. For all inputs x where
 // static_cast<IntOut>(x) is legal according to the C++ standard, the result
 // is identical to that cast (i.e. the result is x with its fractional part
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/legacy_reference_ops.h b/tensorflow/contrib/lite/kernels/internal/reference/legacy_reference_ops.h
index 878b2441b4..bcf5e4e4f6 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/legacy_reference_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/legacy_reference_ops.h
@@ -63,15 +63,257 @@ inline void Relu6(const float* input_data, const Dims<4>& input_dims,
         DimsToShape(output_dims));
 }
 
+template <FusedActivationFunctionType Ac>
+inline void Add(int left_shift, const uint8* input1_data,
+                const Dims<4>& input1_dims, int32 input1_offset,
+                int32 input1_multiplier, int input1_shift,
+                const uint8* input2_data, const Dims<4>& input2_dims,
+                int32 input2_offset, int32 input2_multiplier, int input2_shift,
+                int32 output_offset, int32 output_multiplier, int output_shift,
+                int32 output_activation_min, int32 output_activation_max,
+                uint8* output_data, const Dims<4>& output_dims) {
+  constexpr int kReverseShift = -1;
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, 0);
+    TFLITE_DCHECK_EQ(output_activation_max, 255);
+  }
+
+  tflite::ArithmeticParams op_params;
+  op_params.left_shift = left_shift;
+  op_params.input1_offset = input1_offset;
+  op_params.input1_multiplier = input1_multiplier;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_offset = input2_offset;
+  op_params.input2_multiplier = input2_multiplier;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.output_offset = output_offset;
+  op_params.output_multiplier = output_multiplier;
+  op_params.output_shift = kReverseShift * output_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+void Add(const int32* input1_data, const Dims<4>& input1_dims,
+         const int32* input2_data, const Dims<4>& input2_dims,
+         int32* output_data, const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("Add/int32");
+  TFLITE_DCHECK(Ac == FusedActivationFunctionType::kNone);
+
+  tflite::ArithmeticParams op_params;
+  op_params.quantized_activation_min = std::numeric_limits<int32>::min();
+  op_params.quantized_activation_max = std::numeric_limits<int32>::max();
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+inline void BroadcastAdd(int left_shift, const uint8* input1_data,
+                         const Dims<4>& input1_dims, int32 input1_offset,
+                         int32 input1_multiplier, int input1_shift,
+                         const uint8* input2_data, const Dims<4>& input2_dims,
+                         int32 input2_offset, int32 input2_multiplier,
+                         int input2_shift, int32 output_offset,
+                         int32 output_multiplier, int output_shift,
+                         int32 output_activation_min,
+                         int32 output_activation_max, uint8* output_data,
+                         const Dims<4>& output_dims) {
+  constexpr int kReverseShift = -1;
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, 0);
+    TFLITE_DCHECK_EQ(output_activation_max, 255);
+  }
+
+  tflite::ArithmeticParams op_params;
+  op_params.left_shift = left_shift;
+  op_params.input1_offset = input1_offset;
+  op_params.input1_multiplier = input1_multiplier;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_offset = input2_offset;
+  op_params.input2_multiplier = input2_multiplier;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.output_offset = output_offset;
+  op_params.output_multiplier = output_multiplier;
+  op_params.output_shift = kReverseShift * output_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  BroadcastAdd4DSlow(op_params, DimsToShape(input1_dims), input1_data,
+                     DimsToShape(input2_dims), input2_data,
+                     DimsToShape(output_dims), output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+void Add(const float* input1_data, const Dims<4>& input1_dims,
+         const float* input2_data, const Dims<4>& input2_dims,
+         float* output_data, const Dims<4>& output_dims) {
+  float output_activation_min, output_activation_max;
+  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
+
+  tflite::ArithmeticParams op_params;
+  op_params.float_activation_min = output_activation_min;
+  op_params.float_activation_max = output_activation_max;
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <typename T>
+void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T output_activation_min, T output_activation_max,
+                  T* output_data, const Dims<4>& output_dims) {
+  tflite::ArithmeticParams op_params;
+  op_params.float_activation_min = output_activation_min;
+  op_params.float_activation_max = output_activation_max;
+  BroadcastAdd4DSlow(op_params, DimsToShape(input1_dims), input1_data,
+                     DimsToShape(input2_dims), input2_data,
+                     DimsToShape(output_dims), output_data);
+}
+
+template <FusedActivationFunctionType Ac>
+inline void BroadcastAddFivefold(
+    int y0, int y1, int y2, int y3, int y4, int left_shift,
+    const uint8* input1_data, const Dims<4>& input1_dims, int32 input1_offset,
+    int32 input1_multiplier, int input1_shift, const uint8* input2_data,
+    const Dims<4>& input2_dims, int32 input2_offset, int32 input2_multiplier,
+    int input2_shift, int32 output_offset, int32 output_multiplier,
+    int output_shift, int32 output_activation_min, int32 output_activation_max,
+    uint8* output_data, const Dims<4>& output_dims) {
+  constexpr int kReverseShift = -1;
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, 0);
+    TFLITE_DCHECK_EQ(output_activation_max, 255);
+  }
+  tflite::ArithmeticParams op_params;
+  op_params.broadcast_category =
+      tflite::BroadcastableOpCategory::kFirstInputBroadcastsFast;
+  op_params.left_shift = left_shift;
+  op_params.input1_offset = input1_offset;
+  op_params.input1_multiplier = input1_multiplier;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_offset = input2_offset;
+  op_params.input2_multiplier = input2_multiplier;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.output_offset = output_offset;
+  op_params.output_multiplier = output_multiplier;
+  op_params.output_shift = kReverseShift * output_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  op_params.broadcast_shape[4] = y0;
+  op_params.broadcast_shape[3] = y1;
+  op_params.broadcast_shape[2] = y2;
+  op_params.broadcast_shape[1] = y3;
+  op_params.broadcast_shape[0] = y4;
+  BroadcastAddFivefold(op_params, DimsToShape(input1_dims), input1_data,
+                       DimsToShape(input2_dims), input2_data,
+                       DimsToShape(output_dims), output_data);
+}
+
+// legacy, for compatibility with old checked-in code
+template <FusedActivationFunctionType Ac, typename T>
+void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
+                  const T* input2_data, const Dims<4>& input2_dims,
+                  T* output_data, const Dims<4>& output_dims) {
+  T output_activation_min, output_activation_max;
+  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
+
+  BroadcastAdd(input1_data, input1_dims, input2_data, input2_dims,
+               output_activation_min, output_activation_max, output_data,
+               output_dims);
+}
+
+template <FusedActivationFunctionType Ac>
+inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
+                int input1_shift, const int16* input2_data,
+                const Dims<4>& input2_dims, int input2_shift,
+                int16 output_activation_min, int16 output_activation_max,
+                int16* output_data, const Dims<4>& output_dims) {
+  static_assert(Ac == FusedActivationFunctionType::kNone ||
+                    Ac == FusedActivationFunctionType::kRelu ||
+                    Ac == FusedActivationFunctionType::kRelu6 ||
+                    Ac == FusedActivationFunctionType::kRelu1,
+                "");
+  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+  if (Ac == FusedActivationFunctionType::kNone) {
+    TFLITE_DCHECK_EQ(output_activation_min, -32768);
+    TFLITE_DCHECK_EQ(output_activation_max, 32767);
+  }
+
+  tflite::ArithmeticParams op_params;
+  op_params.input1_shift = kReverseShift * input1_shift;
+  op_params.input2_shift = kReverseShift * input2_shift;
+  op_params.quantized_activation_min = output_activation_min;
+  op_params.quantized_activation_max = output_activation_max;
+  Add(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
+                const float* input2_data, const Dims<4>& input2_dims,
+                float* output_data, const Dims<4>& output_dims) {
+  float output_activation_min, output_activation_max;
+  GetActivationMinMax(FusedActivationFunctionType::kNone,
+                      &output_activation_min, &output_activation_max);
+  tflite::ArithmeticParams op_params;
+  op_params.float_activation_min = output_activation_min;
+  op_params.float_activation_max = output_activation_max;
+  Sub(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
+template <typename T>
+void Sub(const T* input1_data, const Dims<4>& input1_dims, const T* input2_data,
+         const Dims<4>& input2_dims, T* output_data,
+         const Dims<4>& output_dims) {
+  tflite::ArithmeticParams op_params;
+  op_params.quantized_activation_min = std::numeric_limits<T>::min();
+  op_params.quantized_activation_max = std::numeric_limits<T>::max();
+  Sub(op_params, DimsToShape(input1_dims), input1_data,
+      DimsToShape(input2_dims), input2_data, DimsToShape(output_dims),
+      output_data);
+}
+
 inline void AveragePool(const float* input_data, const Dims<4>& input_dims,
                         int stride_width, int stride_height, int pad_width,
                         int pad_height, int kwidth, int kheight,
                         float output_activation_min,
                         float output_activation_max, float* output_data,
                         const Dims<4>& output_dims) {
-  AveragePool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-              pad_width, pad_height, kwidth, kheight, output_activation_min,
-              output_activation_max, output_data, DimsToShape(output_dims));
+  tflite::PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = kheight;
+  params.filter_width = kwidth;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.float_activation_min = output_activation_min;
+  params.float_activation_max = output_activation_max;
+  AveragePool(params, DimsToShape(input_dims), input_data,
+              DimsToShape(output_dims), output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -104,10 +346,17 @@ inline void AveragePool(const uint8* input_data, const Dims<4>& input_dims,
                         int32 output_activation_min,
                         int32 output_activation_max, uint8* output_data,
                         const Dims<4>& output_dims) {
-  AveragePool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-              pad_width, pad_height, filter_width, filter_height,
-              output_activation_min, output_activation_max, output_data,
-              DimsToShape(output_dims));
+  tflite::PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = filter_height;
+  params.filter_width = filter_width;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.quantized_activation_min = output_activation_min;
+  params.quantized_activation_max = output_activation_max;
+  AveragePool(params, DimsToShape(input_dims), input_data,
+              DimsToShape(output_dims), output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -148,9 +397,17 @@ inline void MaxPool(const float* input_data, const Dims<4>& input_dims,
                     int pad_height, int kwidth, int kheight,
                     float output_activation_min, float output_activation_max,
                     float* output_data, const Dims<4>& output_dims) {
-  MaxPool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-          pad_width, pad_height, kwidth, kheight, output_activation_min,
-          output_activation_max, output_data, DimsToShape(output_dims));
+  tflite::PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = kheight;
+  params.filter_width = kwidth;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.float_activation_min = output_activation_min;
+  params.float_activation_max = output_activation_max;
+  MaxPool(params, DimsToShape(input_dims), input_data, DimsToShape(output_dims),
+          output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -180,10 +437,17 @@ inline void MaxPool(const uint8* input_data, const Dims<4>& input_dims,
                     int pad_height, int filter_width, int filter_height,
                     int32 output_activation_min, int32 output_activation_max,
                     uint8* output_data, const Dims<4>& output_dims) {
-  MaxPool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-          pad_width, pad_height, filter_width, filter_height,
-          output_activation_min, output_activation_max, output_data,
-          DimsToShape(output_dims));
+  PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = filter_height;
+  params.filter_width = filter_width;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.quantized_activation_min = output_activation_min;
+  params.quantized_activation_max = output_activation_max;
+  MaxPool(params, DimsToShape(input_dims), input_data, DimsToShape(output_dims),
+          output_data);
 }
 
 // legacy, for compatibility with old checked-in code
@@ -223,10 +487,17 @@ inline void L2Pool(const float* input_data, const Dims<4>& input_dims,
                    int pad_height, int filter_width, int filter_height,
                    float output_activation_min, float output_activation_max,
                    float* output_data, const Dims<4>& output_dims) {
-  L2Pool(input_data, DimsToShape(input_dims), stride_width, stride_height,
-         pad_width, pad_height, filter_width, filter_height,
-         output_activation_min, output_activation_max, output_data,
-         DimsToShape(output_dims));
+  PoolParams params;
+  params.stride_height = stride_height;
+  params.stride_width = stride_width;
+  params.filter_height = filter_height;
+  params.filter_width = filter_width;
+  params.padding_values.height = pad_height;
+  params.padding_values.width = pad_width;
+  params.float_activation_min = output_activation_min;
+  params.float_activation_max = output_activation_max;
+  L2Pool(params, DimsToShape(input_dims), input_data, DimsToShape(output_dims),
+         output_data);
 }
 
 // legacy, for compatibility with old checked-in code
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc
index ccf112c990..6bd88b5596 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 #include <stdlib.h>
 #include <string.h>
+#include <algorithm>
 
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/kernels/activation_functor.h"
@@ -38,14 +39,14 @@ bool PortableIsZeroVector(const float* vector, int v_size) {
 
 void PortableSymmetricQuantizeFloats(const float* values, const int size,
                                      int8_t* quantized_values,
-                                     float* __restrict__ min,
-                                     float* __restrict__ max,
+                                     float* __restrict__ min_value,
+                                     float* __restrict__ max_value,
                                      float* __restrict__ scaling_factor) {
   auto minmax = std::minmax_element(values, values + size);
-  *min = *minmax.first;
-  *max = *minmax.second;
+  *min_value = *minmax.first;
+  *max_value = *minmax.second;
   const int kScale = 127;
-  const float range = std::max(std::abs(*min), std::abs(*max));
+  const float range = std::max(std::abs(*min_value), std::abs(*max_value));
   if (range == 0) {
     memset(quantized_values, 0, size * sizeof(int8_t));
     *scaling_factor = 1;
@@ -195,6 +196,13 @@ void PortableZeroVector(float* vector, int v_size) {
   memset(vector, 0, v_size * sizeof(float));
 }
 
+void PortableVectorScalarMultiply(const int8_t* vector, const int v_size,
+                                  const float scale, float* result) {
+  for (int v = 0; v < v_size; ++v) {
+    *result++ = scale * *vector++;
+  }
+}
+
 void PortableClipVector(const float* vector, int v_size, float abs_limit,
                         float* result) {
   for (int v = 0; v < v_size; v++) {
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h
index d2e1fecd25..a375aaffa6 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/portable_tensor_utils.h
@@ -19,6 +19,10 @@ limitations under the License.
 // structure.
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 
+#if defined(_MSC_VER)
+#define __restrict__ __restrict
+#endif
+
 namespace tflite {
 namespace tensor_utils {
 
@@ -28,8 +32,8 @@ float PortableClip(float f, float abs_limit);
 bool PortableIsZeroVector(const float* vector, int v_size);
 
 void PortableSymmetricQuantizeFloats(const float* values, const int size,
-                                     int8_t* quantized_values, float* min,
-                                     float* max, float* scaling_factor);
+                                     int8_t* quantized_values, float* min_value,
+                                     float* max_value, float* scaling_factor);
 
 // Multiply a matrix by a batch vector, and store results in a batch-size
 // vector.
@@ -96,6 +100,10 @@ void PortableSub1Vector(const float* vector, int v_size, float* result);
 // Fill vector with 0.f.
 void PortableZeroVector(float* vector, int v_size);
 
+// Multiply all elements of vector with a scalar.
+void PortableVectorScalarMultiply(const int8_t* vector, int v_size, float scale,
+                                  float* result);
+
 // Clip elements of a vector using a abs_limit value.
 void PortableClipVector(const float* vector, int v_size, float abs_limit,
                         float* result);
@@ -199,6 +207,12 @@ void ZeroVector(float* vector, int v_size) {
   PortableZeroVector(vector, v_size);
 }
 
+// Multiply all elements of vector with a scalar.
+void VectorScalarMultiply(const int8_t* vector, int v_size, float scale,
+                          float* result) {
+  PortableVectorScalarMultiply(vector, v_size, scale, result);
+}
+
 void ClipVector(const float* vector, int v_size, float abs_limit,
                 float* result) {
   PortableClipVector(vector, v_size, abs_limit, result);
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
index 9357e7407e..3afa336413 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
@@ -158,98 +158,6 @@ SaturatingRoundingMultiplyByPOTParam(
       SaturatingRoundingMultiplyByPOTParam(a.raw(), exponent));
 }
 
-// DO NOT USE THIS STRUCT FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING ELEMENT-WISE
-// BROADCASTING.
-//
-// NdArrayDesc<N> describes the shape and memory layout of an N-dimensional
-// rectangular array of numbers.
-//
-// NdArrayDesc<N> is basically identical to Dims<N> defined in types.h.
-// However, as Dims<N> is to be deprecated, this class exists as an adaptor
-// to enable simple unoptimized implementations of element-wise broadcasting
-// operations.
-template <int N>
-struct NdArrayDesc {
-  // The "extent" of each dimension. Indices along dimension d must be in the
-  // half-open interval [0, extents[d]).
-  int extents[N];
-
-  // The number of *elements* (not bytes) between consecutive indices of each
-  // dimension.
-  int strides[N];
-};
-
-// DO NOT USE THIS FUNCTION FOR NEW FUNCTIONALITY BEYOND IMPLEMENTING
-// ELEMENT-WISE BROADCASTING.
-//
-// Same as Offset(), except takes as NdArrayDesc<N> instead of Dims<N>.
-inline int SubscriptToIndex(const NdArrayDesc<4>& desc, int i0, int i1, int i2,
-                            int i3) {
-  TFLITE_DCHECK(i0 >= 0 && i0 < desc.extents[0]);
-  TFLITE_DCHECK(i1 >= 0 && i1 < desc.extents[1]);
-  TFLITE_DCHECK(i2 >= 0 && i2 < desc.extents[2]);
-  TFLITE_DCHECK(i3 >= 0 && i3 < desc.extents[3]);
-  return i0 * desc.strides[0] + i1 * desc.strides[1] + i2 * desc.strides[2] +
-         i3 * desc.strides[3];
-}
-
-// Given the dimensions of the operands for an element-wise binary broadcast,
-// adjusts them so that they can be directly iterated over with simple loops.
-// Returns the adjusted dims as instances of NdArrayDesc in 'desc0_out' and
-// 'desc1_out'. 'desc0_out' and 'desc1_out' cannot be nullptr.
-//
-// This function assumes that the two input shapes are compatible up to
-// broadcasting and the shorter one has already been prepended with 1s to be the
-// same length. E.g., if shape0 is (1, 16, 16, 64) and shape1 is (1, 64),
-// shape1 must already have been prepended to be (1, 1, 1, 64). Recall that
-// Dims<N> refer to shapes in reverse order. In this case, input0_dims will be
-// (64, 16, 16, 1) and input1_dims will be (64, 1, 1, 1).
-//
-// When two shapes are compatible up to broadcasting, for each dimension d,
-// the input extents are either equal, or one of them is 1.
-//
-// This function performs the following for each dimension d:
-// - If the extents are equal, then do nothing since the loop that walks over
-//   both of the input arrays is correct.
-// - Otherwise, one (and only one) of the extents must be 1. Say extent0 is 1
-//   and extent1 is e1. Then set extent0 to e1 and stride0 *to 0*. This allows
-//   array0 to be referenced *at any index* in dimension d and still access the
-//   same slice.
-template <int N>
-inline void NdArrayDescsForElementwiseBroadcast(const Dims<N>& input0_dims,
-                                                const Dims<N>& input1_dims,
-                                                NdArrayDesc<N>* desc0_out,
-                                                NdArrayDesc<N>* desc1_out) {
-  TFLITE_DCHECK(desc0_out != nullptr);
-  TFLITE_DCHECK(desc1_out != nullptr);
-
-  // Copy dims to desc.
-  for (int i = 0; i < N; ++i) {
-    desc0_out->extents[i] = input0_dims.sizes[i];
-    desc0_out->strides[i] = input0_dims.strides[i];
-    desc1_out->extents[i] = input1_dims.sizes[i];
-    desc1_out->strides[i] = input1_dims.strides[i];
-  }
-
-  // Walk over each dimension. If the extents are equal do nothing.
-  // Otherwise, set the desc with extent 1 to have extent equal to the other and
-  // stride 0.
-  for (int i = 0; i < N; ++i) {
-    const int extent0 = ArraySize(input0_dims, i);
-    const int extent1 = ArraySize(input1_dims, i);
-    if (extent0 != extent1) {
-      if (extent0 == 1) {
-        desc0_out->strides[i] = 0;
-        desc0_out->extents[i] = extent1;
-      } else {
-        TFLITE_DCHECK_EQ(extent1, 1);
-        desc1_out->strides[i] = 0;
-        desc1_out->extents[i] = extent0;
-      }
-    }
-  }
-}
-
 inline void Conv(const float* input_data, const Dims<4>& input_dims,
                  const float* filter_data, const Dims<4>& filter_dims,
                  const float* bias_data, const Dims<4>& bias_dims,
@@ -414,8 +322,8 @@ inline void Conv(const uint8* input_data, const Dims<4>& input_dims,
           if (bias_data) {
             acc += bias_data[Offset(bias_dims, out_channel, 0, 0, 0)];
           }
-          acc = MultiplyByQuantizedMultiplierSmallerThanOneExp(
-              acc, output_multiplier, kReverseShift * output_shift);
+          acc = MultiplyByQuantizedMultiplier(acc, output_multiplier,
+                                              kReverseShift * output_shift);
           acc += output_offset;
           acc = std::max(acc, output_activation_min);
           acc = std::min(acc, output_activation_max);
@@ -1065,114 +973,108 @@ inline void L2Normalization(const uint8* input_data,
 }
 
 template <typename T>
-inline void Add(const T* input1_data, const Dims<4>& input1_dims,
-                const T* input2_data, const Dims<4>& input2_dims,
-                T output_activation_min, T output_activation_max,
-                T* output_data, const Dims<4>& output_dims) {
-  const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const T* input1_data,
+                const RuntimeShape& input2_shape, const T* input2_data,
+                const RuntimeShape& output_shape, T* output_data) {
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, output_shape);
   for (int i = 0; i < flat_size; ++i) {
     output_data[i] = ActivationFunctionWithMinMax(
-        input1_data[i] + input2_data[i], output_activation_min,
-        output_activation_max);
+        input1_data[i] + input2_data[i], params.quantized_activation_min,
+        params.quantized_activation_max);
   }
 }
 
-// legacy, for compatibility with old checked-in code
-template <FusedActivationFunctionType Ac>
-void Add(const float* input1_data, const Dims<4>& input1_dims,
-         const float* input2_data, const Dims<4>& input2_dims,
-         float* output_data, const Dims<4>& output_dims) {
-  float output_activation_min, output_activation_max;
-  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
-
-  Add(input1_data, input1_dims, input2_data, input2_dims, output_activation_min,
-      output_activation_max, output_data, output_dims);
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const float* input1_data,
+                const RuntimeShape& input2_shape, const float* input2_data,
+                const RuntimeShape& output_shape, float* output_data) {
+  const int size = MatchingFlatSize(input1_shape, input2_shape, output_shape);
+  for (int i = 0; i < size; i++) {
+    auto x = input1_data[i] + input2_data[i];
+    output_data[i] = ActivationFunctionWithMinMax(
+        x, params.float_activation_min, params.float_activation_max);
+  }
 }
 
-template <FusedActivationFunctionType Ac>
-inline void Add(int left_shift, const uint8* input1_data,
-                const Dims<4>& input1_dims, int32 input1_offset,
-                int32 input1_multiplier, int input1_shift,
-                const uint8* input2_data, const Dims<4>& input2_dims,
-                int32 input2_offset, int32 input2_multiplier, int input2_shift,
-                int32 output_offset, int32 output_multiplier, int output_shift,
-                int32 output_activation_min, int32 output_activation_max,
-                uint8* output_data, const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, 0);
-    TFLITE_DCHECK_EQ(output_activation_max, 255);
-  }
-  const int batches =
-      MatchingArraySize(input1_dims, 3, input2_dims, 3, output_dims, 3);
-  const int height =
-      MatchingArraySize(input1_dims, 2, input2_dims, 2, output_dims, 2);
-  const int width =
-      MatchingArraySize(input1_dims, 1, input2_dims, 1, output_dims, 1);
-  const int depth =
-      MatchingArraySize(input1_dims, 0, input2_dims, 0, output_dims, 0);
-  for (int b = 0; b < batches; ++b) {
-    for (int y = 0; y < height; ++y) {
-      for (int x = 0; x < width; ++x) {
-        for (int c = 0; c < depth; ++c) {
-          const int32 input1_val =
-              input1_offset + input1_data[Offset(input1_dims, c, x, y, b)];
-          const int32 input2_val =
-              input2_offset + input2_data[Offset(input2_dims, c, x, y, b)];
-          const int32 shifted_input1_val = input1_val * (1 << left_shift);
-          const int32 shifted_input2_val = input2_val * (1 << left_shift);
-          const int32 scaled_input1_val =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input1_val, input1_multiplier,
-                  kReverseShift * input1_shift);
-          const int32 scaled_input2_val =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input2_val, input2_multiplier,
-                  kReverseShift * input2_shift);
-          const int32 raw_sum = scaled_input1_val + scaled_input2_val;
-          const int32 raw_output =
-              MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  raw_sum, output_multiplier, kReverseShift * output_shift) +
-              output_offset;
-          const int32 clamped_output =
-              std::min(output_activation_max,
-                       std::max(output_activation_min, raw_output));
-          output_data[Offset(output_dims, c, x, y, b)] =
-              static_cast<uint8>(clamped_output);
-        }
-      }
-    }
+// Element-wise add that can often be used for inner loop of broadcast add as
+// well as the non-broadcast add.
+inline void AddElementwise(int size, const ArithmeticParams& params,
+                           const uint8* input1_data, const uint8* input2_data,
+                           uint8* output_data) {
+  TFLITE_DCHECK_GT(params.input1_offset, -256);
+  TFLITE_DCHECK_GT(params.input2_offset, -256);
+  TFLITE_DCHECK_LT(params.input1_offset, 256);
+  TFLITE_DCHECK_LT(params.input2_offset, 256);
+
+  for (int i = 0; i < size; ++i) {
+    const int32 input1_val = params.input1_offset + input1_data[i];
+    const int32 input2_val = params.input2_offset + input2_data[i];
+    const int32 shifted_input1_val = input1_val * (1 << params.left_shift);
+    const int32 shifted_input2_val = input2_val * (1 << params.left_shift);
+    const int32 scaled_input1_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input1_val, params.input1_multiplier, params.input1_shift);
+    const int32 scaled_input2_val =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            shifted_input2_val, params.input2_multiplier, params.input2_shift);
+    const int32 raw_sum = scaled_input1_val + scaled_input2_val;
+    const int32 raw_output =
+        MultiplyByQuantizedMultiplierSmallerThanOneExp(
+            raw_sum, params.output_multiplier, params.output_shift) +
+        params.output_offset;
+    const int32 clamped_output =
+        std::min(params.quantized_activation_max,
+                 std::max(params.quantized_activation_min, raw_output));
+    output_data[i] = static_cast<uint8>(clamped_output);
   }
 }
 
-inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
-                int input1_shift, const int16* input2_data,
-                const Dims<4>& input2_dims, int input2_shift,
-                int16 output_activation_min, int16 output_activation_max,
-                int16* output_data, const Dims<4>& output_dims) {
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-
-  const int flat_size = MatchingFlatSize(output_dims, input1_dims, input2_dims);
-
-  TFLITE_DCHECK(input1_shift == 0 || input2_shift == 0);
-  TFLITE_DCHECK_GE(input1_shift, 0);
-  TFLITE_DCHECK_GE(input2_shift, 0);
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const uint8* input1_data,
+                const RuntimeShape& input2_shape, const uint8* input2_data,
+                const RuntimeShape& output_shape, uint8* output_data) {
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, output_shape);
+
+  TFLITE_DCHECK_GT(params.input1_offset, -256);
+  TFLITE_DCHECK_GT(params.input2_offset, -256);
+  TFLITE_DCHECK_LT(params.input1_offset, 256);
+  TFLITE_DCHECK_LT(params.input2_offset, 256);
+  AddElementwise(flat_size, params, input1_data, input2_data, output_data);
+}
+
+inline void Add(const ArithmeticParams& params,
+                const RuntimeShape& input1_shape, const int16* input1_data,
+                const RuntimeShape& input2_shape, const int16* input2_data,
+                const RuntimeShape& output_shape, int16* output_data) {
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
+
+  const int input1_shift = params.input1_shift;
+  const int flat_size =
+      MatchingFlatSize(output_shape, input1_shape, input2_shape);
+  const int16 output_activation_min = params.quantized_activation_min;
+  const int16 output_activation_max = params.quantized_activation_max;
+
+  TFLITE_DCHECK(input1_shift == 0 || params.input2_shift == 0);
+  TFLITE_DCHECK_LE(input1_shift, 0);
+  TFLITE_DCHECK_LE(params.input2_shift, 0);
   const int16* not_shift_input = input1_shift == 0 ? input1_data : input2_data;
   const int16* shift_input = input1_shift == 0 ? input2_data : input1_data;
-  const int input_shift = input1_shift == 0 ? input2_shift : input1_shift;
+  const int input_right_shift =
+      input1_shift == 0 ? -params.input2_shift : -input1_shift;
 
   for (int i = 0; i < flat_size; i++) {
     // F0 uses 0 integer bits, range [-1, 1].
     using F0 = gemmlowp::FixedPoint<std::int16_t, 0>;
 
     F0 input_ready_scaled = F0::FromRaw(not_shift_input[i]);
-    F0 scaled_input =
-        F0::FromRaw(gemmlowp::RoundingDivideByPOT(shift_input[i], input_shift));
+    F0 scaled_input = F0::FromRaw(
+        gemmlowp::RoundingDivideByPOT(shift_input[i], input_right_shift));
     F0 result = gemmlowp::SaturatingAdd(scaled_input, input_ready_scaled);
     const int16 raw_output = result.raw();
     const int16 clamped_output = std::min(
@@ -1181,42 +1083,28 @@ inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
   }
 }
 
-template <FusedActivationFunctionType Ac>
-inline void Add(const int16* input1_data, const Dims<4>& input1_dims,
-                int input1_shift, const int16* input2_data,
-                const Dims<4>& input2_dims, int input2_shift,
-                int16 output_activation_min, int16 output_activation_max,
-                int16* output_data, const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, -32768);
-    TFLITE_DCHECK_EQ(output_activation_max, 32767);
-  }
-
-  Add(input1_data, input1_dims, input1_shift, input2_data, input2_dims,
-      input2_shift, output_activation_min, output_activation_max, output_data,
-      output_dims);
-}
-
 // TODO(jiawen): We can implement BroadcastAdd on buffers of arbitrary
 // dimensionality if the runtime code does a single loop over one dimension
 // that handles broadcasting as the base case. The code generator would then
 // generate max(D1, D2) nested for loops.
-template <typename T>
-void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
-                  const T* input2_data, const Dims<4>& input2_dims,
-                  T output_activation_min, T output_activation_max,
-                  T* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastAdd");
-
+// TODO(benoitjacob): BroadcastAdd is intentionally duplicated from
+// reference_ops.h. Once an optimized version is implemented and NdArrayDesc<T>
+// is no longer referenced in this file, move NdArrayDesc<T> from types.h to
+// reference_ops.h.
+inline void BroadcastAdd4DSlow(const ArithmeticParams& params,
+                               const RuntimeShape& input1_shape,
+                               const float* input1_data,
+                               const RuntimeShape& input2_shape,
+                               const float* input2_data,
+                               const RuntimeShape& output_shape,
+                               float* output_data) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastAdd4DSlow/float");
   NdArrayDesc<4> desc1;
   NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
 
   // In Tensorflow, the dimensions are canonically named (batch_number, row,
   // col, channel), with extents (batches, height, width, depth), with the
@@ -1229,49 +1117,77 @@ void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
   // We name our variables by their Tensorflow convention, but generate C code
   // nesting loops such that the innermost loop has the smallest stride for the
   // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          output_data[Offset(output_dims, c, x, y, b)] =
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
               ActivationFunctionWithMinMax(
-                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] +
-                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
-                  output_activation_min, output_activation_max);
+                  input1_data[SubscriptToIndex(desc1, b, y, x, c)] +
+                      input2_data[SubscriptToIndex(desc2, b, y, x, c)],
+                  params.float_activation_min, params.float_activation_max);
         }
       }
     }
   }
 }
 
-// legacy, for compatibility with old checked-in code
-template <FusedActivationFunctionType Ac, typename T>
-void BroadcastAdd(const T* input1_data, const Dims<4>& input1_dims,
-                  const T* input2_data, const Dims<4>& input2_dims,
-                  T* output_data, const Dims<4>& output_dims) {
-  T output_activation_min, output_activation_max;
-  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
+inline void BroadcastAdd4DSlow(const ArithmeticParams& params,
+                               const RuntimeShape& input1_shape,
+                               const int32* input1_data,
+                               const RuntimeShape& input2_shape,
+                               const int32* input2_data,
+                               const RuntimeShape& output_shape,
+                               int32* output_data) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastAdd4DSlow/int32");
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
 
-  BroadcastAdd(input1_data, input1_dims, input2_data, input2_dims,
-               output_activation_min, output_activation_max, output_data,
-               output_dims);
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
+              ActivationFunctionWithMinMax(
+                  input1_data[SubscriptToIndex(desc1, b, y, x, c)] +
+                      input2_data[SubscriptToIndex(desc2, b, y, x, c)],
+                  params.quantized_activation_min,
+                  params.quantized_activation_max);
+        }
+      }
+    }
+  }
 }
 
-inline void BroadcastAdd(int left_shift, const uint8* input1_data,
-                         const Dims<4>& input1_dims, int32 input1_offset,
-                         int32 input1_multiplier, int input1_shift,
-                         const uint8* input2_data, const Dims<4>& input2_dims,
-                         int32 input2_offset, int32 input2_multiplier,
-                         int input2_shift, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastAdd/8bit");
-
+inline void BroadcastAdd4DSlow(const ArithmeticParams& params,
+                               const RuntimeShape& input1_shape,
+                               const uint8* input1_data,
+                               const RuntimeShape& input2_shape,
+                               const uint8* input2_data,
+                               const RuntimeShape& output_shape,
+                               uint8* output_data) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastAdd4DSlow/uint8");
   NdArrayDesc<4> desc1;
   NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
 
   // In Tensorflow, the dimensions are canonically named (batch_number, row,
   // col, channel), with extents (batches, height, width, depth), with the
@@ -1284,33 +1200,37 @@ inline void BroadcastAdd(int left_shift, const uint8* input1_data,
   // We name our variables by their Tensorflow convention, but generate C code
   // nesting loops such that the innermost loop has the smallest stride for the
   // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
           const int32 input1_val =
-              input1_offset + input1_data[SubscriptToIndex(desc1, c, x, y, b)];
+              params.input1_offset +
+              input1_data[SubscriptToIndex(desc1, b, y, x, c)];
           const int32 input2_val =
-              input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
-          const int32 shifted_input1_val = input1_val * (1 << left_shift);
-          const int32 shifted_input2_val = input2_val * (1 << left_shift);
+              params.input2_offset +
+              input2_data[SubscriptToIndex(desc2, b, y, x, c)];
+          const int32 shifted_input1_val =
+              input1_val * (1 << params.left_shift);
+          const int32 shifted_input2_val =
+              input2_val * (1 << params.left_shift);
           const int32 scaled_input1_val =
               MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input1_val, input1_multiplier,
-                  kReverseShift * input1_shift);
+                  shifted_input1_val, params.input1_multiplier,
+                  params.input1_shift);
           const int32 scaled_input2_val =
               MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input2_val, input2_multiplier,
-                  kReverseShift * input2_shift);
+                  shifted_input2_val, params.input2_multiplier,
+                  params.input2_shift);
           const int32 raw_sum = scaled_input1_val + scaled_input2_val;
           const int32 raw_output =
               MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  raw_sum, output_multiplier, kReverseShift * output_shift) +
-              output_offset;
+                  raw_sum, params.output_multiplier, params.output_shift) +
+              params.output_offset;
           const int32 clamped_output =
-              std::min(output_activation_max,
-                       std::max(output_activation_min, raw_output));
-          output_data[Offset(output_dims, c, x, y, b)] =
+              std::min(params.quantized_activation_max,
+                       std::max(params.quantized_activation_min, raw_output));
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
               static_cast<uint8>(clamped_output);
         }
       }
@@ -1318,121 +1238,67 @@ inline void BroadcastAdd(int left_shift, const uint8* input1_data,
   }
 }
 
-inline void BroadcastAddFivefold(
-    int y0, int y1, int y2, int y3, int y4, int left_shift,
-    const uint8* input1_data, const Dims<4>& input1_dims, int32 input1_offset,
-    int32 input1_multiplier, int input1_shift, const uint8* input2_data,
-    const Dims<4>& input2_dims, int32 input2_offset, int32 input2_multiplier,
-    int input2_shift, int32 output_offset, int32 output_multiplier,
-    int output_shift, int32 output_activation_min, int32 output_activation_max,
-    uint8* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastAddFivefold/8bit");
-
-  int sb1 = y0;
-  int sa2 = y0;
-  int sb2 = y0 * y1;
-  int sa3 = y0 * y2;
-  int sa4 = y0 * y2 * y3;
-  int sb4 = y0 * y1 * y2;
-
+inline void BroadcastAddFivefold(const ArithmeticParams& unswitched_params,
+                                 const RuntimeShape& unswitched_input1_shape,
+                                 const uint8* unswitched_input1_data,
+                                 const RuntimeShape& unswitched_input2_shape,
+                                 const uint8* unswitched_input2_data,
+                                 const RuntimeShape& output_shape,
+                                 uint8* output_data) {
+  ArithmeticParams switched_params = unswitched_params;
+  switched_params.input1_offset = unswitched_params.input2_offset;
+  switched_params.input1_multiplier = unswitched_params.input2_multiplier;
+  switched_params.input1_shift = unswitched_params.input2_shift;
+  switched_params.input2_offset = unswitched_params.input1_offset;
+  switched_params.input2_multiplier = unswitched_params.input1_multiplier;
+  switched_params.input2_shift = unswitched_params.input1_shift;
+
+  const bool use_unswitched =
+      unswitched_params.broadcast_category ==
+      tflite::BroadcastableOpCategory::kFirstInputBroadcastsFast;
+
+  const ArithmeticParams& params =
+      use_unswitched ? unswitched_params : switched_params;
+  const uint8* input1_data =
+      use_unswitched ? unswitched_input1_data : unswitched_input2_data;
+  const uint8* input2_data =
+      use_unswitched ? unswitched_input2_data : unswitched_input1_data;
+
+  // Fivefold nested loops. The second input resets its position for each
+  // iteration of the second loop. The first input resets its position at the
+  // beginning of the fourth loop. The innermost loop is an elementwise add of
+  // sections of the arrays.
   uint8* output_data_ptr = output_data;
-  for (int i4 = 0; i4 < y4; ++i4) {
-    for (int i3 = 0; i3 < y3; ++i3) {
+  const uint8* input1_data_ptr = input1_data;
+  const uint8* input2_data_reset = input2_data;
+  int y0 = params.broadcast_shape[0];
+  int y1 = params.broadcast_shape[1];
+  int y2 = params.broadcast_shape[2];
+  int y3 = params.broadcast_shape[3];
+  int y4 = params.broadcast_shape[4];
+  for (int i0 = 0; i0 < y0; ++i0) {
+    const uint8* input2_data_ptr;
+    for (int i1 = 0; i1 < y1; ++i1) {
+      input2_data_ptr = input2_data_reset;
       for (int i2 = 0; i2 < y2; ++i2) {
-        for (int i1 = 0; i1 < y1; ++i1) {
-          for (int i0 = 0; i0 < y0; ++i0) {
-            const int32 input1_val =
-                input1_offset +
-                input1_data[i4 * sa4 + i3 * sa3 + i2 * sa2 + i0];
-            const int32 input2_val =
-                input2_offset +
-                input2_data[i4 * sb4 + i2 * sb2 + i1 * sb1 + i0];
-            const int32 shifted_input1_val = input1_val * (1 << left_shift);
-            const int32 shifted_input2_val = input2_val * (1 << left_shift);
-            const int32 scaled_input1_val =
-                MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                    shifted_input1_val, input1_multiplier,
-                    kReverseShift * input1_shift);
-            const int32 scaled_input2_val =
-                MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                    shifted_input2_val, input2_multiplier,
-                    kReverseShift * input2_shift);
-            const int32 raw_sum = scaled_input1_val + scaled_input2_val;
-            const int32 raw_output =
-                MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                    raw_sum, output_multiplier, kReverseShift * output_shift) +
-                output_offset;
-            const int32 clamped_output =
-                std::min(output_activation_max,
-                         std::max(output_activation_min, raw_output));
-            *output_data_ptr = static_cast<uint8>(clamped_output);
-            ++output_data_ptr;
-          }
+        for (int i3 = 0; i3 < y3; ++i3) {
+          AddElementwise(y4, params, input1_data_ptr, input2_data_ptr,
+                         output_data_ptr);
+          input2_data_ptr += y4;
+          output_data_ptr += y4;
         }
+        input1_data_ptr += y4;
       }
     }
+    input2_data_reset = input2_data_ptr;
   }
 }
 
-template <FusedActivationFunctionType Ac>
-inline void BroadcastAdd(int left_shift, const uint8* input1_data,
-                         const Dims<4>& input1_dims, int32 input1_offset,
-                         int32 input1_multiplier, int input1_shift,
-                         const uint8* input2_data, const Dims<4>& input2_dims,
-                         int32 input2_offset, int32 input2_multiplier,
-                         int input2_shift, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, 0);
-    TFLITE_DCHECK_EQ(output_activation_max, 255);
-  }
-  BroadcastAdd(left_shift, input1_data, input1_dims, input1_offset,
-               input1_multiplier, input1_shift, input2_data, input2_dims,
-               input2_offset, input2_multiplier, input2_shift, output_offset,
-               output_multiplier, output_shift, output_activation_min,
-               output_activation_max, output_data, output_dims);
-}
-
-template <FusedActivationFunctionType Ac>
-inline void BroadcastAddFivefold(
-    int y0, int y1, int y2, int y3, int y4, int left_shift,
-    const uint8* input1_data, const Dims<4>& input1_dims, int32 input1_offset,
-    int32 input1_multiplier, int input1_shift, const uint8* input2_data,
-    const Dims<4>& input2_dims, int32 input2_offset, int32 input2_multiplier,
-    int input2_shift, int32 output_offset, int32 output_multiplier,
-    int output_shift, int32 output_activation_min, int32 output_activation_max,
-    uint8* output_data, const Dims<4>& output_dims) {
-  static_assert(Ac == FusedActivationFunctionType::kNone ||
-                    Ac == FusedActivationFunctionType::kRelu ||
-                    Ac == FusedActivationFunctionType::kRelu6 ||
-                    Ac == FusedActivationFunctionType::kRelu1,
-                "");
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  if (Ac == FusedActivationFunctionType::kNone) {
-    TFLITE_DCHECK_EQ(output_activation_min, 0);
-    TFLITE_DCHECK_EQ(output_activation_max, 255);
-  }
-  BroadcastAddFivefold(y0, y1, y2, y3, y4, left_shift, input1_data, input1_dims,
-                       input1_offset, input1_multiplier, input1_shift,
-                       input2_data, input2_dims, input2_offset,
-                       input2_multiplier, input2_shift, output_offset,
-                       output_multiplier, output_shift, output_activation_min,
-                       output_activation_max, output_data, output_dims);
-}
-
-inline void Mul(const float* input1_data, const Dims<4>& input1_dims,
-                const float* input2_data, const Dims<4>& input2_dims,
-                float output_activation_min, float output_activation_max,
-                float* output_data, const Dims<4>& output_dims) {
+template <typename T>
+inline void Mul(const T* input1_data, const Dims<4>& input1_dims,
+                const T* input2_data, const Dims<4>& input2_dims,
+                T output_activation_min, T output_activation_max,
+                T* output_data, const Dims<4>& output_dims) {
   const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
   for (int i = 0; i < flat_size; ++i) {
     output_data[i] = ActivationFunctionWithMinMax(
@@ -1653,10 +1519,11 @@ void BroadcastDiv(const T* input1_data, const Dims<4>& input1_dims,
   }
 }
 
-inline void Div(const float* input1_data, const Dims<4>& input1_dims,
-                const float* input2_data, const Dims<4>& input2_dims,
-                float output_activation_min, float output_activation_max,
-                float* output_data, const Dims<4>& output_dims) {
+template <typename T>
+inline void Div(const T* input1_data, const Dims<4>& input1_dims,
+                const T* input2_data, const Dims<4>& input2_dims,
+                T output_activation_min, T output_activation_max,
+                T* output_data, const Dims<4>& output_dims) {
   const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
   for (int i = 0; i < flat_size; ++i) {
     output_data[i] = ActivationFunctionWithMinMax(
@@ -1665,15 +1532,35 @@ inline void Div(const float* input1_data, const Dims<4>& input1_dims,
   }
 }
 
-inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
-                const float* input2_data, const Dims<4>& input2_dims,
-                float output_activation_min, float output_activation_max,
-                float* output_data, const Dims<4>& output_dims) {
-  const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
+inline void SubNonBroadcast(const ArithmeticParams& params,
+                            const RuntimeShape& input1_shape,
+                            const float* input1_data,
+                            const RuntimeShape& input2_shape,
+                            const float* input2_data,
+                            const RuntimeShape& output_shape,
+                            float* output_data) {
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, output_shape);
   for (int i = 0; i < flat_size; ++i) {
     output_data[i] = ActivationFunctionWithMinMax(
-        input1_data[i] - input2_data[i], output_activation_min,
-        output_activation_max);
+        input1_data[i] - input2_data[i], params.float_activation_min,
+        params.float_activation_max);
+  }
+}
+
+inline void SubNonBroadcast(const ArithmeticParams& params,
+                            const RuntimeShape& input1_shape,
+                            const int32* input1_data,
+                            const RuntimeShape& input2_shape,
+                            const int32* input2_data,
+                            const RuntimeShape& output_shape,
+                            int32* output_data) {
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, output_shape);
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(
+        input1_data[i] - input2_data[i], params.quantized_activation_min,
+        params.quantized_activation_max);
   }
 }
 
@@ -1681,16 +1568,24 @@ inline void Sub(const float* input1_data, const Dims<4>& input1_dims,
 // dimensionality if the runtime code does a single loop over one dimension
 // that handles broadcasting as the base case. The code generator would then
 // generate max(D1, D2) nested for loops.
-template <typename T>
-void BroadcastSub(const T* input1_data, const Dims<4>& input1_dims,
-                  const T* input2_data, const Dims<4>& input2_dims,
-                  T output_activation_min, T output_activation_max,
-                  T* output_data, const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastSub");
-
+// TODO(benoitjacob): BroadcastSub is intentionally duplicated from
+// reference_ops.h. Once an optimized version is implemented and NdArrayDesc<T>
+// is no longer referenced in this file, move NdArrayDesc<T> from types.h to
+// reference_ops.h.
+inline void BroadcastSub4DSlow(const ArithmeticParams& params,
+                               const RuntimeShape& input1_shape,
+                               const float* input1_data,
+                               const RuntimeShape& input2_shape,
+                               const float* input2_data,
+                               const RuntimeShape& output_shape,
+                               float* output_data) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastAdd4DSlow/float");
   NdArrayDesc<4> desc1;
   NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
 
   // In Tensorflow, the dimensions are canonically named (batch_number, row,
   // col, channel), with extents (batches, height, width, depth), with the
@@ -1703,36 +1598,35 @@ void BroadcastSub(const T* input1_data, const Dims<4>& input1_dims,
   // We name our variables by their Tensorflow convention, but generate C code
   // nesting loops such that the innermost loop has the smallest stride for the
   // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          output_data[Offset(output_dims, c, x, y, b)] =
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
               ActivationFunctionWithMinMax(
-                  input1_data[SubscriptToIndex(desc1, c, x, y, b)] -
-                      input2_data[SubscriptToIndex(desc2, c, x, y, b)],
-                  output_activation_min, output_activation_max);
+                  input1_data[SubscriptToIndex(desc1, b, y, x, c)] -
+                      input2_data[SubscriptToIndex(desc2, b, y, x, c)],
+                  params.float_activation_min, params.float_activation_max);
         }
       }
     }
   }
 }
 
-inline void BroadcastSub(int left_shift, const uint8* input1_data,
-                         const Dims<4>& input1_dims, int32 input1_offset,
-                         int32 input1_multiplier, int input1_shift,
-                         const uint8* input2_data, const Dims<4>& input2_dims,
-                         int32 input2_offset, int32 input2_multiplier,
-                         int input2_shift, int32 output_offset,
-                         int32 output_multiplier, int output_shift,
-                         int32 output_activation_min,
-                         int32 output_activation_max, uint8* output_data,
-                         const Dims<4>& output_dims) {
-  gemmlowp::ScopedProfilingLabel label("BroadcastSub/8bit");
-
+inline void BroadcastSub4DSlow(const ArithmeticParams& params,
+                               const RuntimeShape& input1_shape,
+                               const uint8* input1_data,
+                               const RuntimeShape& input2_shape,
+                               const uint8* input2_data,
+                               const RuntimeShape& output_shape,
+                               uint8* output_data) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastAdd4DSlow/uint8");
   NdArrayDesc<4> desc1;
   NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
 
   // In Tensorflow, the dimensions are canonically named (batch_number, row,
   // col, channel), with extents (batches, height, width, depth), with the
@@ -1745,33 +1639,37 @@ inline void BroadcastSub(int left_shift, const uint8* input1_data,
   // We name our variables by their Tensorflow convention, but generate C code
   // nesting loops such that the innermost loop has the smallest stride for the
   // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
           const int32 input1_val =
-              input1_offset + input1_data[SubscriptToIndex(desc1, c, x, y, b)];
+              params.input1_offset +
+              input1_data[SubscriptToIndex(desc1, b, y, x, c)];
           const int32 input2_val =
-              input2_offset + input2_data[SubscriptToIndex(desc2, c, x, y, b)];
-          const int32 shifted_input1_val = input1_val * (1 << left_shift);
-          const int32 shifted_input2_val = input2_val * (1 << left_shift);
+              params.input2_offset +
+              input2_data[SubscriptToIndex(desc2, b, y, x, c)];
+          const int32 shifted_input1_val =
+              input1_val * (1 << params.left_shift);
+          const int32 shifted_input2_val =
+              input2_val * (1 << params.left_shift);
           const int32 scaled_input1_val =
               MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input1_val, input1_multiplier,
-                  kReverseShift * input1_shift);
+                  shifted_input1_val, params.input1_multiplier,
+                  params.input1_shift);
           const int32 scaled_input2_val =
               MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  shifted_input2_val, input2_multiplier,
-                  kReverseShift * input2_shift);
+                  shifted_input2_val, params.input2_multiplier,
+                  params.input2_shift);
           const int32 raw_sub = scaled_input1_val - scaled_input2_val;
           const int32 raw_output =
               MultiplyByQuantizedMultiplierSmallerThanOneExp(
-                  raw_sub, output_multiplier, kReverseShift * output_shift) +
-              output_offset;
+                  raw_sub, params.output_multiplier, params.output_shift) +
+              params.output_offset;
           const int32 clamped_output =
-              std::min(output_activation_max,
-                       std::max(output_activation_min, raw_output));
-          output_data[Offset(output_dims, c, x, y, b)] =
+              std::min(params.quantized_activation_max,
+                       std::max(params.quantized_activation_min, raw_output));
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
               static_cast<uint8>(clamped_output);
         }
       }
@@ -1779,6 +1677,156 @@ inline void BroadcastSub(int left_shift, const uint8* input1_data,
   }
 }
 
+inline void BroadcastSub4DSlow(const ArithmeticParams& params,
+                               const RuntimeShape& input1_shape,
+                               const int32* input1_data,
+                               const RuntimeShape& input2_shape,
+                               const int32* input2_data,
+                               const RuntimeShape& output_shape,
+                               int32* output_data) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastAdd4DSlow/int32");
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
+              ActivationFunctionWithMinMax(
+                  input1_data[SubscriptToIndex(desc1, b, y, x, c)] -
+                      input2_data[SubscriptToIndex(desc2, b, y, x, c)],
+                  params.quantized_activation_min,
+                  params.quantized_activation_max);
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+void BroadcastSub4DSlow(const ArithmeticParams& params,
+                        const RuntimeShape& input1_shape, const T* input1_data,
+                        const RuntimeShape& input2_shape, const T* input2_data,
+                        const RuntimeShape& output_shape, T* output_data) {
+  gemmlowp::ScopedProfilingLabel label("BroadcastAdd4DSlow/templated");
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
+              ActivationFunctionWithMinMax(
+                  input1_data[SubscriptToIndex(desc1, b, y, x, c)] -
+                      input2_data[SubscriptToIndex(desc2, b, y, x, c)],
+                  params.quantized_activation_min,
+                  params.quantized_activation_max);
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+void Sub(const ArithmeticParams& params, const RuntimeShape& input1_shape,
+         const T* input1_data, const RuntimeShape& input2_shape,
+         const T* input2_data, const RuntimeShape& output_shape,
+         T* output_data) {
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_shape, input2_shape, &desc1,
+                                      &desc2);
+  RuntimeShape extended_output_shape =
+      RuntimeShape::ExtendedShape(4, output_shape);
+
+  // In Tensorflow, the dimensions are canonically named (batch_number, row,
+  // col, channel), with extents (batches, height, width, depth), with the
+  // trailing dimension changing most rapidly (channels has the smallest stride,
+  // typically 1 element).
+  //
+  // In generated C code, we store arrays with the dimensions reversed. The
+  // first dimension has smallest stride.
+  //
+  // We name our variables by their Tensorflow convention, but generate C code
+  // nesting loops such that the innermost loop has the smallest stride for the
+  // best cache behavior.
+  for (int b = 0; b < extended_output_shape.Dims(0); ++b) {
+    for (int y = 0; y < extended_output_shape.Dims(1); ++y) {
+      for (int x = 0; x < extended_output_shape.Dims(2); ++x) {
+        for (int c = 0; c < extended_output_shape.Dims(3); ++c) {
+          output_data[Offset(extended_output_shape, b, y, x, c)] =
+              input1_data[SubscriptToIndex(desc1, b, y, x, c)] -
+              input2_data[SubscriptToIndex(desc2, b, y, x, c)];
+        }
+      }
+    }
+  }
+}
+
+inline void SubWithActivation(const ArithmeticParams& params,
+                              const RuntimeShape& input1_shape,
+                              const int32* input1_data,
+                              const RuntimeShape& input2_shape,
+                              const int32* input2_data,
+                              const RuntimeShape& output_shape,
+                              int32* output_data) {
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, input2_shape);
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(
+        input1_data[i] - input2_data[i], params.quantized_activation_min,
+        params.quantized_activation_max);
+  }
+}
+
+inline void SubWithActivation(const ArithmeticParams& params,
+                              const RuntimeShape& input1_shape,
+                              const float* input1_data,
+                              const RuntimeShape& input2_shape,
+                              const float* input2_data,
+                              const RuntimeShape& output_shape,
+                              float* output_data) {
+  const int flat_size =
+      MatchingFlatSize(input1_shape, input2_shape, input2_shape);
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = ActivationFunctionWithMinMax(
+        input1_data[i] - input2_data[i], params.float_activation_min,
+        params.float_activation_max);
+  }
+}
+
 template <FusedActivationFunctionType Ac, typename Scalar>
 void Concatenation(int concat_dim, const Scalar* const* input_data,
                    const Dims<4>* const* input_dims, int inputs_count,
@@ -1812,6 +1860,26 @@ void Concatenation(int concat_dim, const Scalar* const* input_data,
   }
 }
 
+template <typename Scalar>
+void Pack(int dim, const Scalar* const* input_data,
+          const Dims<4>* const* input_dims, int inputs_count,
+          Scalar* output_data, const Dims<4>& output_dims) {
+  TFLITE_DCHECK(IsPackedWithoutStrides(output_dims));
+  int outer_size = 1;
+  for (int i = dim + 1; i < 4; i++) {
+    outer_size *= output_dims.sizes[i];
+  }
+  Scalar* output_ptr = output_data;
+  const int copy_size = FlatSize(**input_dims) / outer_size;
+  for (int k = 0; k < outer_size; k++) {
+    for (int i = 0; i < inputs_count; ++i) {
+      memcpy(output_ptr, input_data[i] + k * copy_size,
+             copy_size * sizeof(Scalar));
+      output_ptr += copy_size;
+    }
+  }
+}
+
 // TODO(prabhumk): This is the same as the optimized implementation.
 // TODO(prabhumk): The quantized implementation of concatentation isn't fully
 // quantized as it takes scale as a floating point value. This should be fixed
@@ -2273,13 +2341,10 @@ inline int NodeOffset(int b, int h, int w, int height, int width) {
   return (b * height + h) * width + w;
 }
 
-inline void AveragePool(const float* input_data,
-                        const RuntimeShape& input_shape, int stride_width,
-                        int stride_height, int pad_width, int pad_height,
-                        int filter_width, int filter_height,
-                        float output_activation_min,
-                        float output_activation_max, float* output_data,
-                        const RuntimeShape& output_shape) {
+inline void AveragePool(const PoolParams& params,
+                        const RuntimeShape& input_shape,
+                        const float* input_data,
+                        const RuntimeShape& output_shape, float* output_data) {
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
@@ -2288,20 +2353,24 @@ inline void AveragePool(const float* input_data,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   for (int batch = 0; batch < batches; ++batch) {
     for (int out_y = 0; out_y < output_height; ++out_y) {
       for (int out_x = 0; out_x < output_width; ++out_x) {
         for (int channel = 0; channel < depth; ++channel) {
-          const int in_x_origin = (out_x * stride_width) - pad_width;
-          const int in_y_origin = (out_y * stride_height) - pad_height;
+          const int in_x_origin =
+              (out_x * stride_width) - params.padding_values.width;
+          const int in_y_origin =
+              (out_y * stride_height) - params.padding_values.height;
           // Compute the boundaries of the filter region clamped so as to
           // ensure that the filter window fits in the input array.
           const int filter_x_start = std::max(0, -in_x_origin);
           const int filter_x_end =
-              std::min(filter_width, input_width - in_x_origin);
+              std::min(params.filter_width, input_width - in_x_origin);
           const int filter_y_start = std::max(0, -in_y_origin);
           const int filter_y_end =
-              std::min(filter_height, input_height - in_y_origin);
+              std::min(params.filter_height, input_height - in_y_origin);
           float total = 0.f;
           float filter_count = 0;
           for (int filter_y = filter_y_start; filter_y < filter_y_end;
@@ -2317,22 +2386,20 @@ inline void AveragePool(const float* input_data,
           }
           const float average = total / filter_count;
           output_data[Offset(output_shape, batch, out_y, out_x, channel)] =
-              ActivationFunctionWithMinMax(average, output_activation_min,
-                                           output_activation_max);
+              ActivationFunctionWithMinMax(average, params.float_activation_min,
+                                           params.float_activation_max);
         }
       }
     }
   }
 }
 
-inline void AveragePool(const uint8* input_data,
-                        const RuntimeShape& input_shape, int stride_width,
-                        int stride_height, int pad_width, int pad_height,
-                        int filter_width, int filter_height,
-                        int32 output_activation_min,
-                        int32 output_activation_max, uint8* output_data,
-                        const RuntimeShape& output_shape) {
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
+inline void AveragePool(const PoolParams& params,
+                        const RuntimeShape& input_shape,
+                        const uint8* input_data,
+                        const RuntimeShape& output_shape, uint8* output_data) {
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
@@ -2341,20 +2408,24 @@ inline void AveragePool(const uint8* input_data,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   for (int batch = 0; batch < batches; ++batch) {
     for (int out_y = 0; out_y < output_height; ++out_y) {
       for (int out_x = 0; out_x < output_width; ++out_x) {
         for (int channel = 0; channel < depth; ++channel) {
-          const int in_x_origin = (out_x * stride_width) - pad_width;
-          const int in_y_origin = (out_y * stride_height) - pad_height;
+          const int in_x_origin =
+              (out_x * stride_width) - params.padding_values.width;
+          const int in_y_origin =
+              (out_y * stride_height) - params.padding_values.height;
           // Compute the boundaries of the filter region clamped so as to
           // ensure that the filter window fits in the input array.
           const int filter_x_start = std::max(0, -in_x_origin);
           const int filter_x_end =
-              std::min(filter_width, input_width - in_x_origin);
+              std::min(params.filter_width, input_width - in_x_origin);
           const int filter_y_start = std::max(0, -in_y_origin);
           const int filter_y_end =
-              std::min(filter_height, input_height - in_y_origin);
+              std::min(params.filter_height, input_height - in_y_origin);
           int32 acc = 0;
           int filter_count = 0;
           for (int filter_y = filter_y_start; filter_y < filter_y_end;
@@ -2369,8 +2440,8 @@ inline void AveragePool(const uint8* input_data,
             }
           }
           acc = (acc + filter_count / 2) / filter_count;
-          acc = std::max(acc, output_activation_min);
-          acc = std::min(acc, output_activation_max);
+          acc = std::max(acc, params.quantized_activation_min);
+          acc = std::min(acc, params.quantized_activation_max);
           output_data[Offset(output_shape, batch, out_y, out_x, channel)] =
               static_cast<uint8>(acc);
         }
@@ -2379,11 +2450,9 @@ inline void AveragePool(const uint8* input_data,
   }
 }
 
-inline void L2Pool(const float* input_data, const RuntimeShape& input_shape,
-                   int stride_width, int stride_height, int pad_width,
-                   int pad_height, int filter_width, int filter_height,
-                   float output_activation_min, float output_activation_max,
-                   float* output_data, const RuntimeShape& output_shape) {
+inline void L2Pool(const PoolParams& params, const RuntimeShape& input_shape,
+                   const float* input_data, const RuntimeShape& output_shape,
+                   float* output_data) {
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
@@ -2392,20 +2461,24 @@ inline void L2Pool(const float* input_data, const RuntimeShape& input_shape,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   for (int batch = 0; batch < batches; ++batch) {
     for (int out_y = 0; out_y < output_height; ++out_y) {
       for (int out_x = 0; out_x < output_width; ++out_x) {
         for (int channel = 0; channel < depth; ++channel) {
-          const int in_x_origin = (out_x * stride_width) - pad_width;
-          const int in_y_origin = (out_y * stride_height) - pad_height;
+          const int in_x_origin =
+              (out_x * stride_width) - params.padding_values.width;
+          const int in_y_origin =
+              (out_y * stride_height) - params.padding_values.height;
           // Compute the boundaries of the filter region clamped so as to
           // ensure that the filter window fits in the input array.
           const int filter_x_start = std::max(0, -in_x_origin);
           const int filter_x_end =
-              std::min(filter_width, input_width - in_x_origin);
+              std::min(params.filter_width, input_width - in_x_origin);
           const int filter_y_start = std::max(0, -in_y_origin);
           const int filter_y_end =
-              std::min(filter_height, input_height - in_y_origin);
+              std::min(params.filter_height, input_height - in_y_origin);
           float sum_squares = 0.f;
           int filter_count = 0;
           for (int filter_y = filter_y_start; filter_y < filter_y_end;
@@ -2422,19 +2495,18 @@ inline void L2Pool(const float* input_data, const RuntimeShape& input_shape,
           }
           const float l2pool_result = std::sqrt(sum_squares / filter_count);
           output_data[Offset(output_shape, batch, out_y, out_x, channel)] =
-              ActivationFunctionWithMinMax(l2pool_result, output_activation_min,
-                                           output_activation_max);
+              ActivationFunctionWithMinMax(l2pool_result,
+                                           params.float_activation_min,
+                                           params.float_activation_max);
         }
       }
     }
   }
 }
 
-inline void MaxPool(const float* input_data, const RuntimeShape& input_shape,
-                    int stride_width, int stride_height, int pad_width,
-                    int pad_height, int filter_width, int filter_height,
-                    float output_activation_min, float output_activation_max,
-                    float* output_data, const RuntimeShape& output_shape) {
+inline void MaxPool(const PoolParams& params, const RuntimeShape& input_shape,
+                    const float* input_data, const RuntimeShape& output_shape,
+                    float* output_data) {
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
@@ -2443,20 +2515,24 @@ inline void MaxPool(const float* input_data, const RuntimeShape& input_shape,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   for (int batch = 0; batch < batches; ++batch) {
     for (int out_y = 0; out_y < output_height; ++out_y) {
       for (int out_x = 0; out_x < output_width; ++out_x) {
         for (int channel = 0; channel < depth; ++channel) {
-          const int in_x_origin = (out_x * stride_width) - pad_width;
-          const int in_y_origin = (out_y * stride_height) - pad_height;
+          const int in_x_origin =
+              (out_x * stride_width) - params.padding_values.width;
+          const int in_y_origin =
+              (out_y * stride_height) - params.padding_values.height;
           // Compute the boundaries of the filter region clamped so as to
           // ensure that the filter window fits in the input array.
           const int filter_x_start = std::max(0, -in_x_origin);
           const int filter_x_end =
-              std::min(filter_width, input_width - in_x_origin);
+              std::min(params.filter_width, input_width - in_x_origin);
           const int filter_y_start = std::max(0, -in_y_origin);
           const int filter_y_end =
-              std::min(filter_height, input_height - in_y_origin);
+              std::min(params.filter_height, input_height - in_y_origin);
           float max = std::numeric_limits<float>::lowest();
           for (int filter_y = filter_y_start; filter_y < filter_y_end;
                ++filter_y) {
@@ -2470,22 +2546,21 @@ inline void MaxPool(const float* input_data, const RuntimeShape& input_shape,
             }
           }
           output_data[Offset(output_shape, batch, out_y, out_x, channel)] =
-              ActivationFunctionWithMinMax(max, output_activation_min,
-                                           output_activation_max);
+              ActivationFunctionWithMinMax(max, params.float_activation_min,
+                                           params.float_activation_max);
         }
       }
     }
   }
 }
 
-inline void MaxPool(const uint8* input_data, const RuntimeShape& input_shape,
-                    int stride_width, int stride_height, int pad_width,
-                    int pad_height, int filter_width, int filter_height,
-                    int32 output_activation_min, int32 output_activation_max,
-                    uint8* output_data, const RuntimeShape& output_shape) {
-  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
-  TFLITE_DCHECK_GE(output_activation_min, 0);
-  TFLITE_DCHECK_LE(output_activation_max, 255);
+inline void MaxPool(const PoolParams& params, const RuntimeShape& input_shape,
+                    const uint8* input_data, const RuntimeShape& output_shape,
+                    uint8* output_data) {
+  TFLITE_DCHECK_LE(params.quantized_activation_min,
+                   params.quantized_activation_max);
+  TFLITE_DCHECK_GE(params.quantized_activation_min, 0);
+  TFLITE_DCHECK_LE(params.quantized_activation_max, 255);
   TFLITE_DCHECK_EQ(input_shape.DimensionsCount(), 4);
   TFLITE_DCHECK_EQ(output_shape.DimensionsCount(), 4);
   const int batches = MatchingDim(input_shape, 0, output_shape, 0);
@@ -2494,20 +2569,24 @@ inline void MaxPool(const uint8* input_data, const RuntimeShape& input_shape,
   const int input_width = input_shape.Dims(2);
   const int output_height = output_shape.Dims(1);
   const int output_width = output_shape.Dims(2);
+  const int stride_height = params.stride_height;
+  const int stride_width = params.stride_width;
   for (int batch = 0; batch < batches; ++batch) {
     for (int out_y = 0; out_y < output_height; ++out_y) {
       for (int out_x = 0; out_x < output_width; ++out_x) {
         for (int channel = 0; channel < depth; ++channel) {
-          const int in_x_origin = (out_x * stride_width) - pad_width;
-          const int in_y_origin = (out_y * stride_height) - pad_height;
+          const int in_x_origin =
+              (out_x * stride_width) - params.padding_values.width;
+          const int in_y_origin =
+              (out_y * stride_height) - params.padding_values.height;
           // Compute the boundaries of the filter region clamped so as to
           // ensure that the filter window fits in the input array.
           const int filter_x_start = std::max(0, -in_x_origin);
           const int filter_x_end =
-              std::min(filter_width, input_width - in_x_origin);
+              std::min(params.filter_width, input_width - in_x_origin);
           const int filter_y_start = std::max(0, -in_y_origin);
           const int filter_y_end =
-              std::min(filter_height, input_height - in_y_origin);
+              std::min(params.filter_height, input_height - in_y_origin);
           uint8 max = 0;
           for (int filter_y = filter_y_start; filter_y < filter_y_end;
                ++filter_y) {
@@ -2520,8 +2599,8 @@ inline void MaxPool(const uint8* input_data, const RuntimeShape& input_shape,
                   input_data[Offset(input_shape, batch, in_y, in_x, channel)]);
             }
           }
-          max = std::max<uint8>(max, output_activation_min);
-          max = std::min<uint8>(max, output_activation_max);
+          max = std::max<uint8>(max, params.quantized_activation_min);
+          max = std::min<uint8>(max, params.quantized_activation_max);
           output_data[Offset(output_shape, batch, out_y, out_x, channel)] =
               static_cast<uint8>(max);
         }
@@ -3205,7 +3284,8 @@ inline void SpaceToBatchND(const T* input_data, const Dims<4>& input_dims,
                            const Dims<4>& block_shape_dims,
                            const int32* paddings_data,
                            const Dims<4>& paddings_dims, T* output_data,
-                           const Dims<4>& output_dims) {
+                           const Dims<4>& output_dims,
+                           const int32_t pad_value) {
   const int output_batch_size = ArraySize(output_dims, 3);
   const int output_height = ArraySize(output_dims, 2);
   const int output_width = ArraySize(output_dims, 1);
@@ -3230,7 +3310,7 @@ inline void SpaceToBatchND(const T* input_data, const Dims<4>& input_dims,
                 padding_top + input_height ||
             out_w * block_shape_width + shift_w < padding_left ||
             out_w * block_shape_width + shift_w >= padding_left + input_width) {
-          memset(out, 0, depth * sizeof(T));
+          memset(out, pad_value, depth * sizeof(T));
         } else {
           const T* in =
               input_data +
@@ -3246,6 +3326,17 @@ inline void SpaceToBatchND(const T* input_data, const Dims<4>& input_dims,
 }
 
 template <typename T>
+inline void SpaceToBatchND(const T* input_data, const Dims<4>& input_dims,
+                           const int32* block_shape_data,
+                           const Dims<4>& block_shape_dims,
+                           const int32* paddings_data,
+                           const Dims<4>& paddings_dims, T* output_data,
+                           const Dims<4>& output_dims) {
+  SpaceToBatchND(input_data, input_dims, block_shape_data, block_shape_dims,
+                 paddings_data, paddings_dims, output_data, output_dims, 0);
+}
+
+template <typename T>
 inline void BatchToSpaceND(const T* input_data, const Dims<4>& input_dims,
                            const int32* block_shape_data,
                            const Dims<4>& block_shape_dims,
@@ -3455,9 +3546,9 @@ inline bool Reduce(const In* input_data, const int* input_dims,
                    const int* output_dims, const int input_num_dims,
                    const int output_num_dims, const int* axis,
                    const int num_axis, int* input_iter,
-                   Out reducer(Out current, const In in), Out* output_data) {
+                   Out reducer(const Out current, const In in),
+                   Out* output_data) {
   // Reset input iterator.
-  TFLITE_DCHECK(input_num_dims > 0);
   for (int idx = 0; idx < input_num_dims; ++idx) {
     input_iter[idx] = 0;
   }
@@ -3473,11 +3564,16 @@ inline bool Reduce(const In* input_data, const int* input_dims,
   return true;
 }
 
-inline bool ResolveAxis(const int num_dims, const int* axis, const int num_axis,
-                        int* out_axis, int* out_num_axis) {
+inline bool ResolveAxis(const int num_dims, const int* axis,
+                        const int64_t num_axis, int* out_axis,
+                        int* out_num_axis) {
   *out_num_axis = 0;  // Just in case.
+  // Short-circuit axis resolution for scalars; the axis will go unused.
+  if (num_dims == 0) {
+    return true;
+  }
   // o(n^2) is fine since out_num_axis should be really small, mostly <= 4
-  for (int idx = 0; idx < num_axis; ++idx) {
+  for (int64_t idx = 0; idx < num_axis; ++idx) {
     // Handle negative index.
     int current = axis[idx] < 0 ? (axis[idx] + num_dims) : axis[idx];
     TFLITE_DCHECK(current >= 0 && current < num_dims);
@@ -3503,7 +3599,7 @@ inline bool ReduceSumImpl(const In* input_data, const int* input_dims,
                           const int output_num_dims, const int* axis,
                           const int num_axis, int* input_iter,
                           Out* output_data) {
-  auto reducer = [](Out current, const In in) -> Out {
+  auto reducer = [](const Out current, const In in) -> Out {
     const Out actual_in = static_cast<Out>(in);
     return current + actual_in;
   };
@@ -3512,6 +3608,24 @@ inline bool ReduceSumImpl(const In* input_data, const int* input_dims,
                          output_data);
 }
 
+template <typename T>
+inline bool InitTensorDataForReduce(const int* dims, const int num_dims,
+                                    const T init_value, T* data) {
+  size_t num_elements = 1;
+  for (int idx = 0; idx < num_dims; ++idx) {
+    size_t current = static_cast<size_t>(dims[idx]);
+    // Overflow prevention.
+    if (num_elements > std::numeric_limits<size_t>::max() / current) {
+      return false;
+    }
+    num_elements *= current;
+  }
+  for (size_t idx = 0; idx < num_elements; ++idx) {
+    data[idx] = init_value;
+  }
+  return true;
+}
+
 // Computes the sum of elements across dimensions given in axis.
 template <typename T>
 inline bool Sum(const T* input_data, const int* input_dims,
@@ -3520,17 +3634,9 @@ inline bool Sum(const T* input_data, const int* input_dims,
                 const int* axis, const int num_axis_dimensions, bool keep_dims,
                 int* temp_index, int* resolved_axis) {
   // Reset output data.
-  size_t num_outputs = 1;
-  for (int idx = 0; idx < output_num_dims; ++idx) {
-    size_t current = static_cast<size_t>(output_dims[idx]);
-    // Overflow prevention.
-    if (num_outputs > std::numeric_limits<size_t>::max() / current) {
-      return false;
-    }
-    num_outputs *= current;
-  }
-  for (size_t idx = 0; idx < num_outputs; ++idx) {
-    output_data[idx] = T();
+  if (!InitTensorDataForReduce(output_dims, output_num_dims, static_cast<T>(0),
+                               output_data)) {
+    return false;
   }
 
   // Resolve axis.
@@ -3545,6 +3651,61 @@ inline bool Sum(const T* input_data, const int* input_dims,
                              num_resolved_axis, temp_index, output_data);
 }
 
+// Computes the max of elements across dimensions given in axis.
+template <typename T>
+inline bool ReduceMax(const T* input_data, const int* input_dims,
+                      const int input_num_dims, T* output_data,
+                      const int* output_dims, const int output_num_dims,
+                      const int* axis, const int64_t num_axis_dimensions,
+                      bool keep_dims, int* temp_index, int* resolved_axis) {
+  T init_value = std::numeric_limits<T>::lowest();
+  // Reset output data.
+  if (!InitTensorDataForReduce(output_dims, output_num_dims, init_value,
+                               output_data)) {
+    return false;
+  }
+
+  // Resolve axis.
+  int num_resolved_axis = 0;
+  if (!ResolveAxis(input_num_dims, axis, num_axis_dimensions, resolved_axis,
+                   &num_resolved_axis)) {
+    return false;
+  }
+
+  auto reducer = [](const T current, const T in) -> T {
+    return (in > current) ? in : current;
+  };
+  return Reduce<T, T>(input_data, input_dims, output_dims, input_num_dims,
+                      output_num_dims, resolved_axis, num_resolved_axis,
+                      temp_index, reducer, output_data);
+}
+
+// Computes the prod of elements across dimensions given in axis.
+template <typename T>
+inline bool ReduceProd(const T* input_data, const int* input_dims,
+                       const int input_num_dims, T* output_data,
+                       const int* output_dims, const int output_num_dims,
+                       const int* axis, const int64_t num_axis_dimensions,
+                       bool keep_dims, int* temp_index, int* resolved_axis) {
+  // Reset output data.
+  if (!InitTensorDataForReduce(output_dims, output_num_dims, static_cast<T>(1),
+                               output_data)) {
+    return false;
+  }
+
+  // Resolve axis.
+  int num_resolved_axis = 0;
+  if (!ResolveAxis(input_num_dims, axis, num_axis_dimensions, resolved_axis,
+                   &num_resolved_axis)) {
+    return false;
+  }
+
+  auto reducer = [](const T current, const T in) -> T { return in * current; };
+  return Reduce<T, T>(input_data, input_dims, output_dims, input_num_dims,
+                      output_num_dims, resolved_axis, num_resolved_axis,
+                      temp_index, reducer, output_data);
+}
+
 // Computes the mean of elements across dimensions given in axis.
 // It does so in two stages, first calculates the sum of elements along the axis
 // then divides it by the number of element in axis.
@@ -3637,38 +3798,6 @@ inline void Mean(const T* input_data, const Dims<4>& input_dims,
 }
 
 template <typename T>
-void Sub(const T* input1_data, const Dims<4>& input1_dims, const T* input2_data,
-         const Dims<4>& input2_dims, T* output_data,
-         const Dims<4>& output_dims) {
-  NdArrayDesc<4> desc1;
-  NdArrayDesc<4> desc2;
-  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
-
-  // In Tensorflow, the dimensions are canonically named (batch_number, row,
-  // col, channel), with extents (batches, height, width, depth), with the
-  // trailing dimension changing most rapidly (channels has the smallest stride,
-  // typically 1 element).
-  //
-  // In generated C code, we store arrays with the dimensions reversed. The
-  // first dimension has smallest stride.
-  //
-  // We name our variables by their Tensorflow convention, but generate C code
-  // nesting loops such that the innermost loop has the smallest stride for the
-  // best cache behavior.
-  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
-    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
-      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
-        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
-          output_data[Offset(output_dims, c, x, y, b)] =
-              input1_data[SubscriptToIndex(desc1, c, x, y, b)] -
-              input2_data[SubscriptToIndex(desc2, c, x, y, b)];
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
 void TensorFlowMinimum(const T* input1_data, const Dims<4>& input1_dims,
                        const T* input2_data, T* output_data,
                        const Dims<4>& output_dims) {
@@ -3717,9 +3846,9 @@ void TensorFlowMaximumMinimum(const T* input1_data, const Dims<4>& input1_dims,
   }
 }
 
-template <typename T1, typename T2, typename T3>
-void ArgMax(const T3* axis, const T1* input_data, const Dims<4>& input_dims,
-            T2* output_data, const Dims<4>& output_dims) {
+template <typename T1, typename T2, typename T3, typename Cmp>
+void ArgMinMax(const T3* axis, const T1* input_data, const Dims<4>& input_dims,
+               T2* output_data, const Dims<4>& output_dims, const Cmp& cmp) {
   // The current ArgMax implemention can only determine the index of the maximum
   // value in the last dimension. So the axis argument is ignored.
 
@@ -3732,19 +3861,28 @@ void ArgMax(const T3* axis, const T1* input_data, const Dims<4>& input_dims,
   const int depth = ArraySize(input_dims, 0);
 
   for (int i = 0; i < outer_size; ++i) {
-    auto max_value = input_data[i * depth];
-    int max_index = 0;
+    auto min_max_value = input_data[i * depth];
+    int min_max_index = 0;
     for (int d = 1; d < depth; ++d) {
       const auto& curr_value = input_data[i * depth + d];
-      if (curr_value > max_value) {
-        max_value = curr_value;
-        max_index = d;
+      if (cmp(curr_value, min_max_value)) {
+        min_max_value = curr_value;
+        min_max_index = d;
       }
     }
-    output_data[i] = max_index;
+    output_data[i] = min_max_index;
   }
 }
 
+// TODO(renjieliu): Remove this one.
+template <typename T1, typename T2, typename T3>
+void ArgMax(const T3* axis, const T1* input_data,
+            const tflite::Dims<4>& input_dims, T2* output_data,
+            const tflite::Dims<4>& output_dims) {
+  ArgMinMax(axis, input_data, input_dims, output_data, output_dims,
+            std::greater<T1>());
+}
+
 template <typename T>
 void Transpose(const T* input, const Dims<4>& input_dims, T* output,
                const Dims<4>& output_dims, const int* permuted_axes) {
@@ -4117,6 +4255,38 @@ inline void BroadcastPow(const T* input1_data, const Dims<4>& input1_dims,
   }
 }
 
+inline void Logical(const bool* input1_data, const Dims<4>& input1_dims,
+                    const bool* input2_data, const Dims<4>& input2_dims,
+                    bool* output_data, const Dims<4>& output_dims,
+                    const std::function<bool(bool, bool)>& func) {
+  const int flat_size = MatchingFlatSize(input1_dims, input2_dims, output_dims);
+  for (int i = 0; i < flat_size; ++i) {
+    output_data[i] = func(input1_data[i], input2_data[i]);
+  }
+}
+
+inline void BroadcastLogical(const bool* input1_data,
+                             const Dims<4>& input1_dims,
+                             const bool* input2_data,
+                             const Dims<4>& input2_dims, bool* output_data,
+                             const Dims<4>& output_dims,
+                             const std::function<bool(bool, bool)>& func) {
+  NdArrayDesc<4> desc1;
+  NdArrayDesc<4> desc2;
+  NdArrayDescsForElementwiseBroadcast(input1_dims, input2_dims, &desc1, &desc2);
+  for (int b = 0; b < ArraySize(output_dims, 3); ++b) {
+    for (int y = 0; y < ArraySize(output_dims, 2); ++y) {
+      for (int x = 0; x < ArraySize(output_dims, 1); ++x) {
+        for (int c = 0; c < ArraySize(output_dims, 0); ++c) {
+          output_data[Offset(output_dims, c, x, y, b)] =
+              func(input1_data[SubscriptToIndex(desc1, c, x, y, b)],
+                   input2_data[SubscriptToIndex(desc2, c, x, y, b)]);
+        }
+      }
+    }
+  }
+}
+
 }  // namespace reference_ops
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/internal/spectrogram.cc b/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
index 4eddf7bf0a..20abcb7258 100644
--- a/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
+++ b/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
@@ -43,13 +43,13 @@ bool Spectrogram::Initialize(int window_length, int step_length) {
   return Initialize(window, step_length);
 }
 
-inline int Log2Floor(uint n) {
+inline int Log2Floor(uint32_t n) {
   if (n == 0) return -1;
   int log = 0;
-  uint value = n;
+  uint32_t value = n;
   for (int i = 4; i >= 0; --i) {
     int shift = (1 << i);
-    uint x = value >> shift;
+    uint32_t x = value >> shift;
     if (x != 0) {
       value = x;
       log += shift;
@@ -58,7 +58,7 @@ inline int Log2Floor(uint n) {
   return log;
 }
 
-inline int Log2Ceiling(uint n) {
+inline int Log2Ceiling(uint32_t n) {
   int floor = Log2Floor(n);
   if (n == (n & ~(n - 1)))  // zero or a power of two
     return floor;
@@ -66,7 +66,7 @@ inline int Log2Ceiling(uint n) {
     return floor + 1;
 }
 
-inline uint NextPowerOfTwo(uint value) {
+inline uint32_t NextPowerOfTwo(uint32_t value) {
   int exponent = Log2Ceiling(value);
   // DCHECK_LT(exponent, std::numeric_limits<uint32>::digits);
   return 1 << exponent;
diff --git a/tensorflow/contrib/lite/kernels/internal/tensor_utils.h b/tensorflow/contrib/lite/kernels/internal/tensor_utils.h
index 5160e22307..1ff8cfe39c 100644
--- a/tensorflow/contrib/lite/kernels/internal/tensor_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/tensor_utils.h
@@ -17,6 +17,10 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 
+#if defined(_MSC_VER)
+#define __restrict__ __restrict
+#endif
+
 namespace tflite {
 namespace tensor_utils {
 
@@ -31,8 +35,8 @@ bool IsZeroVector(const float* vector, int v_size);
 // It also outputs the range (min, max) of the floating point buffer, and the
 // scaling factor used to quantize the values.
 void SymmetricQuantizeFloats(const float* values, const int size,
-                             int8_t* quantized_values, float* min, float* max,
-                             float* scaling_factor);
+                             int8_t* quantized_values, float* min_value,
+                             float* max_value, float* scaling_factor);
 
 // Multiplies a matrix by a "batched" vector (i.e. a matrix with a batch
 // dimension composed by input vectors independent from each other). The result
@@ -124,6 +128,10 @@ void Sub1Vector(const float* vector, int v_size, float* result);
 // Fill vector with 0.f.
 void ZeroVector(float* vector, int v_size);
 
+// Multiply all elements of vector with a scalar.
+void VectorScalarMultiply(const int8_t* vector, int v_size, float scale,
+                          float* result);
+
 // Clip elements of a vector using a abs_limit value.
 void ClipVector(const float* vector, int v_size, float abs_limit,
                 float* result);
diff --git a/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc b/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc
index aa0d49ae4d..372a6efec5 100644
--- a/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc
+++ b/tensorflow/contrib/lite/kernels/internal/tensor_utils_test.cc
@@ -32,6 +32,22 @@ TEST(uKernels, ClipTest) {
                   {0.0, -0.5, 1.0, -1.5, 2.0, -2.0, 2.0, -2.0, 2.0, -2.0})));
 }
 
+TEST(uKernels, VectorScalarMultiply) {
+  constexpr int kVectorSize = 29;
+  static int8_t input[kVectorSize];
+  for (int i = 0; i < 29; ++i) {
+    input[i] = static_cast<int8_t>(i - 14);
+  }
+  const float scale = 0.1f;
+  std::vector<float> output(kVectorSize, 0.0f);
+  VectorScalarMultiply(input, kVectorSize, scale, output.data());
+  EXPECT_THAT(output,
+              ElementsAreArray(ArrayFloatNear(
+                  {-1.4, -1.3, -1.2, -1.1, -1.0, -0.9, -0.8, -0.7, -0.6, -0.5,
+                   -0.4, -0.3, -0.2, -0.1, 0,    0.1,  0.2,  0.3,  0.4,  0.5,
+                   0.6,  0.7,  0.8,  0.9,  1.0,  1.1,  1.2,  1.3,  1.4})));
+}
+
 TEST(uKernels, IsZeroTest) {
   constexpr int kVectorSize = 21;
   static float zeros[kVectorSize] = {0.0};
diff --git a/tensorflow/contrib/lite/kernels/internal/types.h b/tensorflow/contrib/lite/kernels/internal/types.h
index fa2420713f..c44698b677 100644
--- a/tensorflow/contrib/lite/kernels/internal/types.h
+++ b/tensorflow/contrib/lite/kernels/internal/types.h
@@ -23,7 +23,12 @@ limitations under the License.
 namespace tflite {
 
 enum class FusedActivationFunctionType : uint8 { kNone, kRelu6, kRelu1, kRelu };
-enum class PaddingType { kNone, kSame, kValid };
+enum class PaddingType : uint8 { kNone, kSame, kValid };
+
+struct PaddingValues {
+  int8 width;
+  int8 height;
+};
 
 // This enumeration allows for non-default formats for the weights array
 // of a fully-connected operator, allowing the use of special optimized
@@ -114,6 +119,8 @@ class RuntimeShape {
   // larger shapes are separately allocated.
   static constexpr int kMaxSmallSize = 4;
 
+  RuntimeShape& operator=(RuntimeShape const&) = delete;
+
   RuntimeShape() : size_(0) {}
 
   explicit RuntimeShape(int dimensions_count) : size_(dimensions_count) {
@@ -130,6 +137,20 @@ class RuntimeShape {
     BuildFrom(init_list);
   }
 
+  // Avoid using this constructor.  We should be able to delete it when C++17
+  // rolls out.
+  RuntimeShape(RuntimeShape const& other) : size_(other.DimensionsCount()) {
+    if (size_ > kMaxSmallSize) {
+      dims_pointer_ = new int32[size_];
+    }
+    std::memcpy(DimsData(), other.DimsData(), sizeof(int32) * size_);
+  }
+
+  bool operator==(const RuntimeShape& comp) const {
+    return this->size_ == comp.size_ &&
+           std::memcmp(DimsData(), comp.DimsData(), size_ * sizeof(int32)) == 0;
+  }
+
   ~RuntimeShape() {
     if (size_ > kMaxSmallSize) {
       delete[] dims_pointer_;
@@ -186,6 +207,16 @@ class RuntimeShape {
     }
   }
 
+  // This will probably be factored out. Old code made substantial use of 4-D
+  // shapes, and so this function is used to extend smaller shapes. Note that
+  // (a) as Dims<4>-dependent code is eliminated, the reliance on this should be
+  // reduced, and (b) some kernels are stricly 4-D, but then the shapes of their
+  // inputs should already be 4-D, so this function should not be needed.
+  inline static RuntimeShape ExtendedShape(int new_shape_size,
+                                           const RuntimeShape& shape) {
+    return RuntimeShape(new_shape_size, shape, 1);
+  }
+
   inline void BuildFrom(const std::initializer_list<int> init_list) {
     BuildFrom<const std::initializer_list<int>>(init_list);
   }
@@ -203,7 +234,25 @@ class RuntimeShape {
     return buffer_size;
   }
 
+  bool operator!=(const RuntimeShape& comp) const { return !((*this) == comp); }
+
  private:
+  // For use only by ExtendFrom(), written to guarantee (return-value) copy
+  // elision in C++17.
+  // This creates a shape padded to the desired size with the specified value.
+  RuntimeShape(int new_shape_size, const RuntimeShape& shape, int pad_value)
+      : size_(0) {
+    TFLITE_CHECK_GE(new_shape_size, shape.DimensionsCount());
+    TFLITE_CHECK_LE(new_shape_size, kMaxSmallSize);
+    Resize(new_shape_size);
+    const int size_increase = new_shape_size - shape.DimensionsCount();
+    for (int i = 0; i < size_increase; ++i) {
+      SetDim(i, pad_value);
+    }
+    std::memcpy(DimsData() + size_increase, shape.DimsData(),
+                sizeof(int32) * shape.DimensionsCount());
+  }
+
   int32 size_;
   union {
     int32 dims_[kMaxSmallSize];
@@ -229,7 +278,9 @@ inline tflite::Dims<4> ToRuntimeDims(const tflite::RuntimeShape& array_shape) {
 
 // Gets next index to iterate through a multidimensional array.
 inline bool NextIndex(const int num_dims, const int* dims, int* current) {
-  TFLITE_DCHECK_GT(num_dims, 0);
+  if (num_dims == 0) {
+    return false;
+  }
   TFLITE_DCHECK(dims != nullptr);
   TFLITE_DCHECK(current != nullptr);
   int carry = 1;
@@ -256,7 +307,9 @@ inline bool NextIndex(const int num_dims, const int* dims, int* current) {
 inline size_t ReducedOutputOffset(const int num_dims, const int* dims,
                                   const int* index, const int num_axis,
                                   const int* axis) {
-  TFLITE_DCHECK_GT(num_dims, 0);
+  if (num_dims == 0) {
+    return 0;
+  }
   TFLITE_DCHECK(dims != nullptr);
   TFLITE_DCHECK(index != nullptr);
   size_t offset = 0;
@@ -359,6 +412,7 @@ inline int RequiredBufferSizeForDims(const Dims<4>& dims) {
 // arrays.
 inline int MatchingFlatSize(const RuntimeShape& shape,
                             const RuntimeShape& check_shape_0) {
+  TFLITE_DCHECK_EQ(shape.DimensionsCount(), check_shape_0.DimensionsCount());
   const int dims_count = shape.DimensionsCount();
   for (int i = 0; i < dims_count; ++i) {
     TFLITE_DCHECK_EQ(shape.Dims(i), check_shape_0.Dims(i));
@@ -369,6 +423,7 @@ inline int MatchingFlatSize(const RuntimeShape& shape,
 inline int MatchingFlatSize(const RuntimeShape& shape,
                             const RuntimeShape& check_shape_0,
                             const RuntimeShape& check_shape_1) {
+  TFLITE_DCHECK_EQ(shape.DimensionsCount(), check_shape_0.DimensionsCount());
   const int dims_count = shape.DimensionsCount();
   for (int i = 0; i < dims_count; ++i) {
     TFLITE_DCHECK_EQ(shape.Dims(i), check_shape_0.Dims(i));
@@ -380,6 +435,7 @@ inline int MatchingFlatSize(const RuntimeShape& shape,
                             const RuntimeShape& check_shape_0,
                             const RuntimeShape& check_shape_1,
                             const RuntimeShape& check_shape_2) {
+  TFLITE_DCHECK_EQ(shape.DimensionsCount(), check_shape_0.DimensionsCount());
   const int dims_count = shape.DimensionsCount();
   for (int i = 0; i < dims_count; ++i) {
     TFLITE_DCHECK_EQ(shape.Dims(i), check_shape_0.Dims(i));
@@ -392,6 +448,7 @@ inline int MatchingFlatSize(const RuntimeShape& shape,
                             const RuntimeShape& check_shape_1,
                             const RuntimeShape& check_shape_2,
                             const RuntimeShape& check_shape_3) {
+  TFLITE_DCHECK_EQ(shape.DimensionsCount(), check_shape_0.DimensionsCount());
   const int dims_count = shape.DimensionsCount();
   for (int i = 0; i < dims_count; ++i) {
     TFLITE_DCHECK_EQ(shape.Dims(i), check_shape_0.Dims(i));
@@ -588,6 +645,82 @@ void ComputeStrides(Dims<N>* dims) {
   }
 }
 
+struct PoolParams {
+  FusedActivationFunctionType activation;
+  PaddingType padding_type;
+  PaddingValues padding_values;
+  int stride_height;
+  int stride_width;
+  int filter_height;
+  int filter_width;
+  // uint8, etc, activation params.
+  int32 quantized_activation_min;
+  int32 quantized_activation_max;
+  // float activation params.
+  float float_activation_min;
+  float float_activation_max;
+};
+
+enum class BroadcastableOpCategory : uint8 {
+  kNone,
+  kNonBroadcast,               // Matching input shapes.
+  kFirstInputBroadcastsFast,   // Fivefold nested loops.
+  kSecondInputBroadcastsFast,  // Fivefold nested loops.
+  kGenericBroadcast,           // Fall-back.
+};
+
+// For Add, Sub, Mul ops.
+struct ArithmeticParams {
+  // Shape dependent / common to data / op types.
+  BroadcastableOpCategory broadcast_category;
+  // uint8 inference params.
+  int32 input1_offset;
+  int32 input2_offset;
+  int32 output_offset;
+  int32 output_multiplier;
+  int output_shift;
+  // Add / Sub, not Mul, uint8 inference params.
+  int left_shift;
+  int32 input1_multiplier;
+  int input1_shift;
+  int32 input2_multiplier;
+  int input2_shift;
+  // uint8, etc, activation params.
+  int32 quantized_activation_min;
+  int32 quantized_activation_max;
+  // float activation params.
+  float float_activation_min;
+  float float_activation_max;
+
+  // Processed output dimensions.
+  // Let input "a" be the one that broadcasts in the faster-changing dimension.
+  // Then, after coalescing, for shapes {a0, a1, a2, a3, a4} and
+  // {b0, b1, b2, b3, b4},
+  // broadcast_shape[4] = b0 = a0.
+  // broadcast_shape[3] = b1; a1 = 1.
+  // broadcast_shape[2] = b2 = a2.
+  // broadcast_shape[1] = a3; b3 = 1.
+  // broadcast_shape[0] = b4 = a4.
+  int broadcast_shape[5];
+};
+
+template <typename T>
+inline void SetActivationParams(T min, T max, ArithmeticParams* params);
+
+template <>
+inline void SetActivationParams(float min, float max,
+                                ArithmeticParams* params) {
+  params->float_activation_min = min;
+  params->float_activation_max = max;
+}
+
+template <>
+inline void SetActivationParams(int32 min, int32 max,
+                                ArithmeticParams* params) {
+  params->quantized_activation_min = min;
+  params->quantized_activation_max = max;
+}
+
 }  // namespace tflite
 
 #endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_TYPES_H_
diff --git a/tensorflow/contrib/lite/kernels/logical.cc b/tensorflow/contrib/lite/kernels/logical.cc
new file mode 100644
index 0000000000..3dc39bf79a
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/logical.cc
@@ -0,0 +1,121 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace logical {
+namespace {
+
+// Input/output tensor index.
+constexpr int kInputTensor1 = 0;
+constexpr int kInputTensor2 = 1;
+constexpr int kOutputTensor = 0;
+
+// Op data for logical op.
+struct OpData {
+  bool requires_broadcast;
+};
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* data = new OpData;
+  data->requires_broadcast = false;
+  return data;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<OpData*>(buffer);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 2);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  // Reinterprete the opaque data provided by user.
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
+
+  const TfLiteTensor* input1 = GetInput(context, node, kInputTensor1);
+  const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  TF_LITE_ENSURE_EQ(context, input1->type, input2->type);
+
+  const TfLiteType type = input1->type;
+  if (type != kTfLiteBool) {
+    context->ReportError(context, "Logical ops only support bool type.");
+    return kTfLiteError;
+  }
+  output->type = type;
+
+  data->requires_broadcast = !HaveSameShapes(input1, input2);
+
+  TfLiteIntArray* output_size = nullptr;
+  if (data->requires_broadcast) {
+    TF_LITE_ENSURE_OK(context, CalculateShapeForBroadcast(
+                                   context, input1, input2, &output_size));
+  } else {
+    output_size = TfLiteIntArrayCopy(input1->dims);
+  }
+
+  return context->ResizeTensor(context, output, output_size);
+}
+
+TfLiteStatus LogicalImpl(TfLiteContext* context, TfLiteNode* node,
+                         const std::function<bool(bool, bool)>& func) {
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
+
+  const TfLiteTensor* input1 = GetInput(context, node, kInputTensor1);
+  const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+
+  if (data->requires_broadcast) {
+    reference_ops::BroadcastLogical(
+        GetTensorData<bool>(input1), GetTensorDims(input1),
+        GetTensorData<bool>(input2), GetTensorDims(input2),
+        GetTensorData<bool>(output), GetTensorDims(output), func);
+  } else {
+    reference_ops::Logical(GetTensorData<bool>(input1), GetTensorDims(input1),
+                           GetTensorData<bool>(input2), GetTensorDims(input2),
+                           GetTensorData<bool>(output), GetTensorDims(output),
+                           func);
+  }
+
+  return kTfLiteOk;
+}
+
+TfLiteStatus LogicalOrEval(TfLiteContext* context, TfLiteNode* node) {
+  const auto logical_or_func = std::logical_or<bool>();
+  return LogicalImpl(context, node, logical_or_func);
+}
+
+}  // namespace
+}  // namespace logical
+
+TfLiteRegistration* Register_LOGICAL_OR() {
+  // Init, Free, Prepare, Eval are satisfying the Interface required by
+  // TfLiteRegistration.
+  static TfLiteRegistration r = {logical::Init, logical::Free, logical::Prepare,
+                                 logical::LogicalOrEval};
+  return &r;
+}
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/logical_test.cc b/tensorflow/contrib/lite/kernels/logical_test.cc
new file mode 100644
index 0000000000..382008245b
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/logical_test.cc
@@ -0,0 +1,87 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAre;
+
+class LogicalOpModel : public SingleOpModel {
+ public:
+  LogicalOpModel(std::initializer_list<int> input1_shape,
+                 std::initializer_list<int> input2_shape, BuiltinOperator op) {
+    input1_ = AddInput(TensorType_BOOL);
+    input2_ = AddInput(TensorType_BOOL);
+    output_ = AddOutput(TensorType_BOOL);
+    ConfigureBuiltinOp(op);
+    BuildInterpreter({input1_shape, input2_shape});
+  }
+
+  int input1() { return input1_; }
+  int input2() { return input2_; }
+
+  std::vector<bool> GetOutput() { return ExtractVector<bool>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int input1_;
+  int input2_;
+  int output_;
+
+  void ConfigureBuiltinOp(BuiltinOperator op) {
+    switch (op) {
+      case BuiltinOperator_LOGICAL_OR: {
+        SetBuiltinOp(op, BuiltinOptions_LogicalOrOptions,
+                     CreateLogicalOrOptions(builder_).Union());
+        break;
+      }
+      default: { FAIL() << "We shouldn't get here."; }
+    }
+  }
+};
+
+TEST(LogicalTest, LogicalOr) {
+  LogicalOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, BuiltinOperator_LOGICAL_OR);
+  model.PopulateTensor<bool>(model.input1(), {true, false, false, true});
+  model.PopulateTensor<bool>(model.input2(), {true, false, true, false});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, true, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
+
+TEST(LogicalTest, BroadcastLogicalOr) {
+  LogicalOpModel model({1, 1, 1, 4}, {1, 1, 1, 1}, BuiltinOperator_LOGICAL_OR);
+  model.PopulateTensor<bool>(model.input1(), {true, false, false, true});
+  model.PopulateTensor<bool>(model.input2(), {false});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput(), ElementsAre(true, false, false, true));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 1, 1, 4));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/lsh_projection.cc b/tensorflow/contrib/lite/kernels/lsh_projection.cc
index 25d2dc2cdd..69523b02cc 100644
--- a/tensorflow/contrib/lite/kernels/lsh_projection.cc
+++ b/tensorflow/contrib/lite/kernels/lsh_projection.cc
@@ -50,7 +50,6 @@ limitations under the License.
 //     Output.Dim == { Tensor[0].Dim[0] * Tensor[0].Dim[1] }
 //     A flattened tensor represents projected bit vectors.
 
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
diff --git a/tensorflow/contrib/lite/kernels/lstm.cc b/tensorflow/contrib/lite/kernels/lstm.cc
index 3577ae6caa..ba251c451e 100644
--- a/tensorflow/contrib/lite/kernels/lstm.cc
+++ b/tensorflow/contrib/lite/kernels/lstm.cc
@@ -13,7 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -97,7 +96,7 @@ constexpr int kCellStateTensor = 1;
 constexpr int kOutputTensor = 2;
 
 void* Init(TfLiteContext* context, const char* buffer, size_t length) {
-  auto* op_data = new OpData;
+  auto* op_data = new OpData();
   op_data->kernel_type = kTfLiteLSTMFullKernel;
   context->AddTensors(context, /*tensors_to_add=*/7,
                       &op_data->scratch_tensor_index);
@@ -306,7 +305,8 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   const int n_output = recurrent_to_output_weights->dims->data[1];
 
   // Check that input tensor dimensions matches with each other.
-  CheckInputTensorDimensions(context, node, n_input, n_output, n_cell);
+  TF_LITE_ENSURE_OK(context, CheckInputTensorDimensions(context, node, n_input,
+                                                        n_output, n_cell));
 
   // Get the pointer to output, activation_state and cell_state tensors.
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
@@ -846,7 +846,7 @@ enum OutputTensor {
 };
 
 void* Init(TfLiteContext* context, const char* buffer, size_t length) {
-  auto* op_data = new OpData;
+  auto* op_data = new OpData();
   op_data->kernel_type = kTfLiteLSTMBasicKernel;
   // `scratch_tensor_index` is unused in this kernel.
   op_data->scratch_tensor_index = -1;
diff --git a/tensorflow/contrib/lite/kernels/lstm_test.cc b/tensorflow/contrib/lite/kernels/lstm_test.cc
index 0b7c56133e..0266f5fe57 100644
--- a/tensorflow/contrib/lite/kernels/lstm_test.cc
+++ b/tensorflow/contrib/lite/kernels/lstm_test.cc
@@ -13,6 +13,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 // Unit test for TFLite LSTM op.
+//
+// TODO(alanchiao): add unit test with invalid input dimensions for this and its
+// variants.
 
 #include <memory>
 #include <vector>
diff --git a/tensorflow/contrib/lite/kernels/mul.cc b/tensorflow/contrib/lite/kernels/mul.cc
index 1f72f3a3c7..349f3e6726 100644
--- a/tensorflow/contrib/lite/kernels/mul.cc
+++ b/tensorflow/contrib/lite/kernels/mul.cc
@@ -100,29 +100,44 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
 }
 
 template <KernelType kernel_type>
-void EvalFloat(TfLiteContext* context, TfLiteNode* node,
-               TfLiteMulParams* params, const OpData* data,
-               const TfLiteTensor* input1, const TfLiteTensor* input2,
-               TfLiteTensor* output) {
-  float output_activation_min, output_activation_max;
-  CalculateActivationRange(params->activation, &output_activation_min,
-                           &output_activation_max);
-#define TF_LITE_MUL(type, opname)                                   \
-  type::opname(GetTensorData<float>(input1), GetTensorDims(input1), \
-               GetTensorData<float>(input2), GetTensorDims(input2), \
-               output_activation_min, output_activation_max,        \
-               GetTensorData<float>(output), GetTensorDims(output))
-  if (kernel_type == kReference) {
-    if (data->requires_broadcast) {
-      TF_LITE_MUL(reference_ops, BroadcastMul);
+void EvalMul(TfLiteContext* context, TfLiteNode* node, TfLiteMulParams* params,
+             const OpData* data, const TfLiteTensor* input1,
+             const TfLiteTensor* input2, TfLiteTensor* output) {
+#define TF_LITE_MUL(type, opname, data_type)                            \
+  data_type output_activation_min, output_activation_max;               \
+  CalculateActivationRange(params->activation, &output_activation_min,  \
+                           &output_activation_max);                     \
+  type::opname(GetTensorData<data_type>(input1), GetTensorDims(input1), \
+               GetTensorData<data_type>(input2), GetTensorDims(input2), \
+               output_activation_min, output_activation_max,            \
+               GetTensorData<data_type>(output), GetTensorDims(output))
+  if (output->type == kTfLiteInt32) {
+    if (kernel_type == kReference) {
+      if (data->requires_broadcast) {
+        TF_LITE_MUL(reference_ops, BroadcastMul, int32_t);
+      } else {
+        TF_LITE_MUL(reference_ops, Mul, int32_t);
+      }
     } else {
-      TF_LITE_MUL(reference_ops, Mul);
+      if (data->requires_broadcast) {
+        TF_LITE_MUL(optimized_ops, BroadcastMul, int32_t);
+      } else {
+        TF_LITE_MUL(optimized_ops, Mul, int32_t);
+      }
     }
-  } else {
-    if (data->requires_broadcast) {
-      TF_LITE_MUL(optimized_ops, BroadcastMul);
+  } else if (output->type == kTfLiteFloat32) {
+    if (kernel_type == kReference) {
+      if (data->requires_broadcast) {
+        TF_LITE_MUL(reference_ops, BroadcastMul, float);
+      } else {
+        TF_LITE_MUL(reference_ops, Mul, float);
+      }
     } else {
-      TF_LITE_MUL(optimized_ops, Mul);
+      if (data->requires_broadcast) {
+        TF_LITE_MUL(optimized_ops, BroadcastMul, float);
+      } else {
+        TF_LITE_MUL(optimized_ops, Mul, float);
+      }
     }
   }
 #undef TF_LITE_MUL
@@ -194,17 +209,17 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
-  if (output->type == kTfLiteFloat32) {
-    EvalFloat<kernel_type>(context, node, params, data, input1, input2, output);
+  if (output->type == kTfLiteFloat32 || output->type == kTfLiteInt32) {
+    EvalMul<kernel_type>(context, node, params, data, input1, input2, output);
   } else if (output->type == kTfLiteUInt8 || output->type == kTfLiteInt16) {
     TF_LITE_ENSURE_OK(
         context, EvalQuantized<kernel_type>(context, node, params, data, input1,
                                             input2, output));
   } else {
-    context->ReportError(
-        context,
-        "Mul only supports FLOAT32 and quantized UINT8 and INT16 now, got %d.",
-        output->type);
+    context->ReportError(context,
+                         "Mul only supports FLOAT32, INT32 and quantized UINT8 "
+                         "and INT16 now, got %d.",
+                         output->type);
     return kTfLiteError;
   }
 
diff --git a/tensorflow/contrib/lite/kernels/mul_test.cc b/tensorflow/contrib/lite/kernels/mul_test.cc
index 43d56e50d2..2807550a6b 100644
--- a/tensorflow/contrib/lite/kernels/mul_test.cc
+++ b/tensorflow/contrib/lite/kernels/mul_test.cc
@@ -52,6 +52,13 @@ class FloatMulOpModel : public BaseMulOpModel {
   std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
 };
 
+class IntegerMulOpModel : public BaseMulOpModel {
+ public:
+  using BaseMulOpModel::BaseMulOpModel;
+
+  std::vector<int32_t> GetOutput() { return ExtractVector<int32_t>(output_); }
+};
+
 // For quantized Mul, the error shouldn't exceed (2*step + step^2).
 // The param min=-1.0 & max=1.0 is used in the following tests.
 // The tolerance value is ~0.0157.
@@ -133,6 +140,57 @@ TEST(FloatMulOpTest, WithBroadcast) {
   }
 }
 
+TEST(IntegerMulOpTest, NoActivation) {
+  IntegerMulOpModel m({TensorType_INT32, {1, 2, 2, 1}},
+                      {TensorType_INT32, {1, 2, 2, 1}}, {TensorType_INT32, {}},
+                      ActivationFunctionType_NONE);
+  m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8});
+  m.PopulateTensor<int32_t>(m.input2(), {1, 2, 3, 5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({-20, 4, 21, 40}));
+}
+
+TEST(IntegerMulOpTest, ActivationRELU_N1_TO_1) {
+  IntegerMulOpModel m({TensorType_INT32, {1, 2, 2, 1}},
+                      {TensorType_INT32, {1, 2, 2, 1}}, {TensorType_INT32, {}},
+                      ActivationFunctionType_RELU_N1_TO_1);
+  m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8});
+  m.PopulateTensor<int32_t>(m.input2(), {1, 2, 3, 5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({-1, 1, 1, 1}));
+}
+
+TEST(IntegerMulOpTest, VariousInputShapes) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    IntegerMulOpModel m({TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8, 11, 20});
+    m.PopulateTensor<int32_t>(m.input2(), {1, 2, 3, 5, 11, 1});
+    m.Invoke();
+    EXPECT_THAT(m.GetOutput(), ElementsAreArray({-20, 4, 21, 40, 121, 20}))
+        << "With shape number " << i;
+  }
+}
+
+TEST(IntegerMulOpTest, WithBroadcast) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    IntegerMulOpModel m({TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, {}},  // always a scalar
+                        {TensorType_INT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8, 11, 20});
+    m.PopulateTensor<int32_t>(m.input2(), {1});
+    m.Invoke();
+    EXPECT_THAT(m.GetOutput(),
+                ElementsAreArray(ArrayFloatNear({-20, 2, 7, 8, 11, 20})))
+        << "With shape number " << i;
+  }
+}
+
 TEST(QuantizedMulOpTest, NoActivation) {
   QuantizedMulOpModel m({TensorType_UINT8, {1, 2, 2, 1}, -1.0, 1.0},
                         {TensorType_UINT8, {1, 2, 2, 1}, -1.0, 1.0},
diff --git a/tensorflow/contrib/lite/kernels/one_hot.cc b/tensorflow/contrib/lite/kernels/one_hot.cc
new file mode 100644
index 0000000000..9ff3dca932
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/one_hot.cc
@@ -0,0 +1,199 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace one_hot {
+
+constexpr int kIndicesTensor = 0;
+constexpr int kDepthTensor = 1;
+constexpr int kOnValueTensor = 2;
+constexpr int kOffValueTensor = 3;
+constexpr int kOutputTensor = 0;
+
+// Convenience utility for destructuring a node into the appropriate tensors and
+// data for the op. Note that this destructuring is quite cheap, so we can avoid
+// allocating op-specific, persistent data on the heap.
+struct OneHotContext {
+  OneHotContext(TfLiteContext* context, TfLiteNode* node) {
+    indices = GetInput(context, node, kIndicesTensor);
+    depth = GetInput(context, node, kDepthTensor);
+    on_value = GetInput(context, node, kOnValueTensor);
+    off_value = GetInput(context, node, kOffValueTensor);
+    output = GetOutput(context, node, kOutputTensor);
+
+    const auto* params =
+        reinterpret_cast<TfLiteOneHotParams*>(node->builtin_data);
+    const int indices_dims = indices->dims->size;
+    axis = (params->axis == -1) ? indices_dims : params->axis;
+    output_dims = indices_dims + 1;
+    dtype = on_value->type;
+  }
+
+  const TfLiteTensor* indices;
+  const TfLiteTensor* depth;
+  const TfLiteTensor* on_value;
+  const TfLiteTensor* off_value;
+  TfLiteTensor* output;
+  int axis;
+  int output_dims;
+  TfLiteType dtype;
+};
+
+template <typename T, typename TI>
+void OneHotComputeImpl(const OneHotContext& op_context) {
+  // prefix_dim_size == # of elements before the axis
+  // depth == # of elements per axis
+  // suffix_dim_size == # of elements after the axis
+  int prefix_dim_size = 1;
+  for (int i = 0; i < op_context.axis; ++i) {
+    prefix_dim_size *= op_context.indices->dims->data[i];
+  }
+  const int suffix_dim_size = NumElements(op_context.indices) / prefix_dim_size;
+  const int depth = *op_context.depth->data.i32;
+
+  const T on_value = *GetTensorData<T>(op_context.on_value);
+  const T off_value = *GetTensorData<T>(op_context.off_value);
+
+  // View the indices as a matrix of size:
+  //     prefix_dim_size x suffix_dim_size
+  // View the output as a matrix of size:
+  //     prefix_dim_size x depth x suffix_dim_size
+  // Then the output is:
+  //     output(i, j, k) == (indices(i, k) == j) ? on : off
+  T* output = GetTensorData<T>(op_context.output);
+  const TI* indices = GetTensorData<TI>(op_context.indices);
+  for (int i = 0; i < prefix_dim_size; ++i) {
+    for (int j = 0; j < depth; ++j) {
+      for (int k = 0; k < suffix_dim_size; ++k, ++output) {
+        *output = static_cast<int>(indices[i * suffix_dim_size + k]) == j
+                      ? on_value
+                      : off_value;
+      }
+    }
+  }
+}
+
+template <typename T>
+void OneHotCompute(const OneHotContext& op_context) {
+  if (op_context.indices->type == kTfLiteInt64) {
+    OneHotComputeImpl<T, int64_t>(op_context);
+  } else {
+    OneHotComputeImpl<T, int>(op_context);
+  }
+}
+
+TfLiteStatus ResizeOutputTensor(TfLiteContext* context,
+                                const OneHotContext& op_context) {
+  TF_LITE_ENSURE(context, *op_context.depth->data.i32 >= 0);
+  TfLiteIntArray* output_size = TfLiteIntArrayCreate(op_context.output_dims);
+  for (int i = 0; i < op_context.output_dims; ++i) {
+    if (i < op_context.axis) {
+      output_size->data[i] = op_context.indices->dims->data[i];
+    } else if (i == op_context.axis) {
+      output_size->data[i] = *op_context.depth->data.i32;
+    } else {
+      output_size->data[i] = op_context.indices->dims->data[i - 1];
+    }
+  }
+  return context->ResizeTensor(context, op_context.output, output_size);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 4);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  OneHotContext op_context{context, node};
+  switch (op_context.dtype) {
+    // TODO(b/111744875): Support uint8 and quantization.
+    case kTfLiteFloat32:
+    case kTfLiteInt16:
+    case kTfLiteInt32:
+    case kTfLiteInt64:
+    case kTfLiteBool:
+      op_context.output->type = op_context.dtype;
+      break;
+    default:
+      context->ReportError(context, "Unknown output data type: %d",
+                           op_context.dtype);
+      return kTfLiteError;
+  }
+
+  TF_LITE_ENSURE(context, op_context.indices->type == kTfLiteInt32 ||
+                              op_context.indices->type == kTfLiteInt64);
+  TF_LITE_ENSURE(context, op_context.axis >= 0 &&
+                              op_context.axis < op_context.output_dims);
+  TF_LITE_ENSURE_EQ(context, NumElements(op_context.depth), 1);
+  TF_LITE_ENSURE_EQ(context, NumElements(op_context.on_value), 1);
+  TF_LITE_ENSURE_EQ(context, NumElements(op_context.off_value), 1);
+  TF_LITE_ENSURE_EQ(context, op_context.on_value->type, op_context.dtype);
+  TF_LITE_ENSURE_EQ(context, op_context.off_value->type, op_context.dtype);
+
+  if (!IsConstantTensor(op_context.depth)) {
+    SetTensorToDynamic(op_context.output);
+    return kTfLiteOk;
+  }
+
+  return ResizeOutputTensor(context, op_context);
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  OneHotContext op_context{context, node};
+
+  if (IsDynamicTensor(op_context.output)) {
+    ResizeOutputTensor(context, op_context);
+  }
+
+  switch (op_context.output->type) {
+    case kTfLiteFloat32:
+      OneHotCompute<float>(op_context);
+      break;
+    case kTfLiteInt32:
+      OneHotCompute<int>(op_context);
+      break;
+    case kTfLiteInt64:
+      OneHotCompute<int64_t>(op_context);
+      break;
+    case kTfLiteBool:
+      OneHotCompute<bool>(op_context);
+      break;
+    default:
+      return kTfLiteError;
+  }
+
+  return kTfLiteOk;
+}
+
+}  // namespace one_hot
+
+TfLiteRegistration* Register_ONE_HOT() {
+  static TfLiteRegistration r = {
+      nullptr,
+      nullptr,
+      one_hot::Prepare,
+      one_hot::Eval,
+  };
+  return &r;
+}
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/one_hot_test.cc b/tensorflow/contrib/lite/kernels/one_hot_test.cc
new file mode 100644
index 0000000000..6b604ec7a7
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/one_hot_test.cc
@@ -0,0 +1,182 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <initializer_list>
+
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+template <typename T>
+class OneHotOpModel : public SingleOpModel {
+ public:
+  OneHotOpModel(std::initializer_list<int> input_shape, int depth_value,
+                TensorType dtype, int axis = -1, T on_value = 1,
+                T off_value = 0, TensorType indices_type = TensorType_INT32) {
+    indices_ = AddInput(indices_type);
+    int depth = AddInput(TensorType_INT32);
+    int on = AddInput(dtype);
+    int off = AddInput(dtype);
+    output_ = AddOutput(dtype);
+    SetBuiltinOp(BuiltinOperator_ONE_HOT, BuiltinOptions_OneHotOptions,
+                 CreateOneHotOptions(builder_, axis).Union());
+    BuildInterpreter({input_shape});
+
+    PopulateTensor<int>(depth, {depth_value});
+    PopulateTensor<T>(on, {on_value});
+    PopulateTensor<T>(off, {off_value});
+  }
+
+  template <typename TI>
+  void SetIndices(std::initializer_list<TI> data) {
+    PopulateTensor<TI>(indices_, data);
+  }
+
+  TfLiteStatus InvokeWithResult() { return interpreter_->Invoke(); }
+
+  int32_t GetOutputSize() { return GetTensorSize(output_); }
+  std::vector<T> GetOutput() { return ExtractVector<T>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int indices_;
+  int output_;
+};
+
+TEST(OneHotOpTest, BasicFloat) {
+  const int depth = 3;
+  OneHotOpModel<float> model({3}, depth, TensorType_FLOAT32);
+  model.SetIndices({0, 1, 2});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({3, 3}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({1.f, 0.f, 0.f, 0.f, 1.f, 0.f, 0.f, 0.f, 1.f}));
+}
+
+TEST(OneHotOpTest, BasicInt) {
+  const int depth = 3;
+  OneHotOpModel<int> model({3}, depth, TensorType_INT32);
+  model.SetIndices({0, 1, 2});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({3, 3}));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 0, 0, 0, 1, 0, 0, 0, 1}));
+}
+
+TEST(OneHotOpTest, BasicBool) {
+  const int depth = 3;
+  OneHotOpModel<bool> model({3}, depth, TensorType_BOOL);
+  model.SetIndices({0, 1, 2});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({3, 3}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({true, false, false, false, true, false, false,
+                                false, true}));
+}
+
+TEST(OneHotOpTest, SmallDepth) {
+  const int depth = 1;
+  OneHotOpModel<int> model({3}, depth, TensorType_INT32);
+  model.SetIndices({0, 1, 2});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({3, 1}));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 0, 0}));
+}
+
+TEST(OneHotOpTest, BigDepth) {
+  const int depth = 4;
+  OneHotOpModel<int> model({2}, depth, TensorType_INT32);
+  model.SetIndices({0, 1});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({2, 4}));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 0, 0, 0, 0, 1, 0, 0}));
+}
+
+TEST(OneHotOpTest, OnOffValues) {
+  const int depth = 3;
+  const int axis = -1;
+  const int on = 5;
+  const int off = 0;
+  OneHotOpModel<int> model({4}, depth, TensorType_INT32, axis, on, off);
+  model.SetIndices({0, 2, -1, 1});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({4, 3}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0}));
+}
+
+TEST(OneHotOpTest, ZeroAxis) {
+  const int depth = 3;
+  const int axis = 0;
+  const int on = 5;
+  const int off = 0;
+  OneHotOpModel<int> model({4}, depth, TensorType_INT32, axis, on, off);
+  model.SetIndices({0, 2, -1, 1});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({3, 4}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({5, 0, 0, 0, 0, 0, 0, 5, 0, 5, 0, 0}));
+}
+
+TEST(OneHotOpTest, MultiDimensionalIndices) {
+  const int depth = 3;
+  const int axis = -1;
+  const float on = 2;
+  const float off = 0;
+  OneHotOpModel<float> model({2, 2}, depth, TensorType_FLOAT32, axis, on, off);
+  model.SetIndices({0, 2, 1, -1});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({2, 2, 3}));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0}));
+}
+
+TEST(OneHotOpTest, Int64Indices) {
+  const int depth = 3;
+  const int axis = -1;
+  const int on = 1;
+  const int off = 0;
+  OneHotOpModel<int> model({3}, depth, TensorType_INT32, axis, on, off,
+                           TensorType_INT64);
+  std::initializer_list<int64_t> indices = {0, 1, 2};
+  model.SetIndices(indices);
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({3, 3}));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 0, 0, 0, 1, 0, 0, 0, 1}));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/pack.cc b/tensorflow/contrib/lite/kernels/pack.cc
new file mode 100644
index 0000000000..bb3416f6a6
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/pack.cc
@@ -0,0 +1,131 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace pack {
+namespace {
+
+constexpr int kOutputTensor = 0;
+
+// Op data for pack op.
+struct OpData {
+  int values_count;
+  int axis;
+};
+
+void* Init(TfLiteContext* context, const char* buffer, size_t length) {
+  auto* data = new OpData;
+  data->axis = 0;
+  return data;
+}
+
+void Free(TfLiteContext* context, void* buffer) {
+  delete reinterpret_cast<OpData*>(buffer);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  const OpData* data = reinterpret_cast<OpData*>(node->builtin_data);
+
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), data->values_count);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  const TfLiteTensor* input0 = GetInput(context, node, 0);
+  TF_LITE_ENSURE(context, NumDimensions(input0) < 4);
+  TF_LITE_ENSURE(context, NumDimensions(input0) >= data->axis);
+  // TODO(renjieliu): Support negative axis.
+  TF_LITE_ENSURE(context, data->axis >= 0);
+  if (input0->type != kTfLiteInt32 && input0->type != kTfLiteFloat32) {
+    context->ReportError(context,
+                         "Currently pack only supports int32 and float32.");
+    return kTfLiteError;
+  }
+  // Make sure all inputs have the same shape and type.
+  for (int i = 1; i < data->values_count; ++i) {
+    const TfLiteTensor* input = GetInput(context, node, i);
+    TF_LITE_ENSURE(context, HaveSameShapes(input0, input));
+    TF_LITE_ENSURE_EQ(context, input0->type, input->type);
+  }
+
+  // Resize output. rank R will become rank R + 1
+  const int dimension_size = NumDimensions(input0) + 1;
+  const TfLiteIntArray* input_shape = input0->dims;
+  TfLiteIntArray* output_shape = TfLiteIntArrayCreate(dimension_size);
+  int i = 0;
+  for (int index = 0; index < dimension_size; ++index) {
+    if (index == data->axis) {
+      output_shape->data[index] = data->values_count;
+    } else {
+      output_shape->data[index] = input_shape->data[i++];
+    }
+  }
+
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  TF_LITE_ENSURE_EQ(context, output->type, input0->type);
+
+  return context->ResizeTensor(context, output, output_shape);
+}
+
+template <typename T>
+void PackImpl(TfLiteContext* context, TfLiteNode* node, TfLiteTensor* output,
+              int values_count, int axis) {
+  VectorOfTensors<T> all_inputs(*context, *node->inputs);
+  reference_ops::Pack<T>(RemapDim(NumDimensions(output), axis),
+                         all_inputs.data(), all_inputs.dims(), values_count,
+                         GetTensorData<T>(output), GetTensorDims(output));
+}
+
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  const OpData* data = reinterpret_cast<OpData*>(node->builtin_data);
+
+  TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+  switch (output->type) {
+    case kTfLiteFloat32: {
+      PackImpl<float>(context, node, output, data->values_count, data->axis);
+      break;
+    }
+    case kTfLiteInt32: {
+      PackImpl<int32_t>(context, node, output, data->values_count, data->axis);
+      break;
+    }
+    default: {
+      context->ReportError(context,
+                           "Currently pack only supports int32 and float32.");
+      return kTfLiteError;
+    }
+  }
+
+  return kTfLiteOk;
+}
+
+}  // namespace
+}  // namespace pack
+
+TfLiteRegistration* Register_PACK() {
+  static TfLiteRegistration r = {pack::Init, pack::Free, pack::Prepare,
+                                 pack::Eval};
+  return &r;
+}
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/pack_test.cc b/tensorflow/contrib/lite/kernels/pack_test.cc
new file mode 100644
index 0000000000..485a50ad3a
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/pack_test.cc
@@ -0,0 +1,120 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+
+template <typename T>
+class PackOpModel : public SingleOpModel {
+ public:
+  PackOpModel(const TensorData& input_template, int axis, int values_count) {
+    std::vector<std::vector<int>> all_input_shapes;
+    for (int i = 0; i < values_count; ++i) {
+      all_input_shapes.push_back(input_template.shape);
+      AddInput(input_template);
+    }
+    output_ = AddOutput({input_template.type, /*shape=*/{}, input_template.min,
+                         input_template.max});
+    SetBuiltinOp(BuiltinOperator_PACK, BuiltinOptions_PackOptions,
+                 CreatePackOptions(builder_, values_count, axis).Union());
+    BuildInterpreter(all_input_shapes);
+  }
+
+  void SetInput(int index, std::initializer_list<T> data) {
+    PopulateTensor(index, data);
+  }
+
+  std::vector<T> GetOutput() { return ExtractVector<T>(output_); }
+  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
+
+ private:
+  int output_;
+};
+
+TEST(PackOpTest, FloatThreeInputs) {
+  PackOpModel<float> model({TensorType_FLOAT32, {2}}, 0, 3);
+  model.SetInput(0, {1, 4});
+  model.SetInput(1, {2, 5});
+  model.SetInput(2, {3, 6});
+  model.Invoke();
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(3, 2));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 4, 2, 5, 3, 6}));
+}
+
+TEST(PackOpTest, FloatThreeInputsDifferentAxis) {
+  PackOpModel<float> model({TensorType_FLOAT32, {2}}, 1, 3);
+  model.SetInput(0, {1, 4});
+  model.SetInput(1, {2, 5});
+  model.SetInput(2, {3, 6});
+  model.Invoke();
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(2, 3));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 2, 3, 4, 5, 6}));
+}
+
+TEST(PackOpTest, FloatMultilDimensions) {
+  PackOpModel<float> model({TensorType_FLOAT32, {2, 3}}, 1, 2);
+  model.SetInput(0, {1, 2, 3, 4, 5, 6});
+  model.SetInput(1, {7, 8, 9, 10, 11, 12});
+  model.Invoke();
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(2, 2, 3));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({1, 2, 3, 7, 8, 9, 4, 5, 6, 10, 11, 12}));
+}
+
+TEST(PackOpTest, IntThreeInputs) {
+  PackOpModel<int32_t> model({TensorType_INT32, {2}}, 0, 3);
+  model.SetInput(0, {1, 4});
+  model.SetInput(1, {2, 5});
+  model.SetInput(2, {3, 6});
+  model.Invoke();
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(3, 2));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 4, 2, 5, 3, 6}));
+}
+
+TEST(PackOpTest, IntThreeInputsDifferentAxis) {
+  PackOpModel<int32_t> model({TensorType_INT32, {2}}, 1, 3);
+  model.SetInput(0, {1, 4});
+  model.SetInput(1, {2, 5});
+  model.SetInput(2, {3, 6});
+  model.Invoke();
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(2, 3));
+  EXPECT_THAT(model.GetOutput(), ElementsAreArray({1, 2, 3, 4, 5, 6}));
+}
+
+TEST(PackOpTest, IntMultilDimensions) {
+  PackOpModel<int32_t> model({TensorType_INT32, {2, 3}}, 1, 2);
+  model.SetInput(0, {1, 2, 3, 4, 5, 6});
+  model.SetInput(1, {7, 8, 9, 10, 11, 12});
+  model.Invoke();
+  EXPECT_THAT(model.GetOutputShape(), ElementsAre(2, 2, 3));
+  EXPECT_THAT(model.GetOutput(),
+              ElementsAreArray({1, 2, 3, 7, 8, 9, 4, 5, 6, 10, 11, 12}));
+}
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/pooling.cc b/tensorflow/contrib/lite/kernels/pooling.cc
index 7240fe04cc..29a5be0683 100644
--- a/tensorflow/contrib/lite/kernels/pooling.cc
+++ b/tensorflow/contrib/lite/kernels/pooling.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -126,13 +125,19 @@ void AverageEvalFloat(TfLiteContext* context, TfLiteNode* node,
   float activation_min, activation_max;
   CalculateActivationRange(params->activation, &activation_min,
                            &activation_max);
-#define TF_LITE_AVERAGE_POOL(type)                                      \
-  type::AveragePool(GetTensorData<float>(input), GetTensorShape(input), \
-                    params->stride_width, params->stride_height,        \
-                    data->padding.width, data->padding.height,          \
-                    params->filter_width, params->filter_height,        \
-                    activation_min, activation_max,                     \
-                    GetTensorData<float>(output), GetTensorShape(output))
+#define TF_LITE_AVERAGE_POOL(type)                                       \
+  tflite::PoolParams op_params;                                          \
+  op_params.stride_height = params->stride_height;                       \
+  op_params.stride_width = params->stride_width;                         \
+  op_params.filter_height = params->filter_height;                       \
+  op_params.filter_width = params->filter_width;                         \
+  op_params.padding_values.height = data->padding.height;                \
+  op_params.padding_values.width = data->padding.width;                  \
+  op_params.float_activation_min = activation_min;                       \
+  op_params.float_activation_max = activation_max;                       \
+  type::AveragePool(op_params, GetTensorShape(input),                    \
+                    GetTensorData<float>(input), GetTensorShape(output), \
+                    GetTensorData<float>(output))
   if (kernel_type == kReference) {
     TF_LITE_AVERAGE_POOL(reference_ops);
   } else {
@@ -149,13 +154,19 @@ void AverageEvalQuantized(TfLiteContext* context, TfLiteNode* node,
   int32_t activation_max;
   CalculateActivationRangeUint8(params->activation, output, &activation_min,
                                 &activation_max);
-#define TF_LITE_AVERAGE_POOL(type)                                        \
-  type::AveragePool(GetTensorData<uint8_t>(input), GetTensorShape(input), \
-                    params->stride_width, params->stride_height,          \
-                    data->padding.width, data->padding.height,            \
-                    params->filter_width, params->filter_height,          \
-                    activation_min, activation_max,                       \
-                    GetTensorData<uint8_t>(output), GetTensorShape(output))
+#define TF_LITE_AVERAGE_POOL(type)                                         \
+  tflite::PoolParams op_params;                                            \
+  op_params.stride_height = params->stride_height;                         \
+  op_params.stride_width = params->stride_width;                           \
+  op_params.filter_height = params->filter_height;                         \
+  op_params.filter_width = params->filter_width;                           \
+  op_params.padding_values.height = data->padding.height;                  \
+  op_params.padding_values.width = data->padding.width;                    \
+  op_params.quantized_activation_min = activation_min;                     \
+  op_params.quantized_activation_max = activation_max;                     \
+  type::AveragePool(op_params, GetTensorShape(input),                      \
+                    GetTensorData<uint8_t>(input), GetTensorShape(output), \
+                    GetTensorData<uint8_t>(output))
   if (kernel_type == kReference) {
     TF_LITE_AVERAGE_POOL(reference_ops);
   } else {
@@ -171,13 +182,18 @@ void MaxEvalFloat(TfLiteContext* context, TfLiteNode* node,
   float activation_min, activation_max;
   CalculateActivationRange(params->activation, &activation_min,
                            &activation_max);
-#define TF_LITE_MAX_POOL(type)                                               \
-  type::MaxPool(GetTensorData<float>(input), GetTensorShape(input),          \
-                params->stride_width, params->stride_height,                 \
-                data->padding.width, data->padding.height,                   \
-                params->filter_width, params->filter_height, activation_min, \
-                activation_max, GetTensorData<float>(output),                \
-                GetTensorShape(output))
+#define TF_LITE_MAX_POOL(type)                                                 \
+  tflite::PoolParams op_params;                                                \
+  op_params.stride_height = params->stride_height;                             \
+  op_params.stride_width = params->stride_width;                               \
+  op_params.filter_height = params->filter_height;                             \
+  op_params.filter_width = params->filter_width;                               \
+  op_params.padding_values.height = data->padding.height;                      \
+  op_params.padding_values.width = data->padding.width;                        \
+  op_params.float_activation_min = activation_min;                             \
+  op_params.float_activation_max = activation_max;                             \
+  type::MaxPool(op_params, GetTensorShape(input), GetTensorData<float>(input), \
+                GetTensorShape(output), GetTensorData<float>(output))
   if (kernel_type == kReference) {
     TF_LITE_MAX_POOL(reference_ops);
   } else {
@@ -194,13 +210,19 @@ void MaxEvalQuantized(TfLiteContext* context, TfLiteNode* node,
   int32_t activation_max;
   CalculateActivationRangeUint8(params->activation, output, &activation_min,
                                 &activation_max);
-#define TF_LITE_MAX_POOL(type)                                               \
-  type::MaxPool(GetTensorData<uint8_t>(input), GetTensorShape(input),        \
-                params->stride_width, params->stride_height,                 \
-                data->padding.width, data->padding.height,                   \
-                params->filter_width, params->filter_height, activation_min, \
-                activation_max, GetTensorData<uint8_t>(output),              \
-                GetTensorShape(output))
+#define TF_LITE_MAX_POOL(type)                                         \
+  tflite::PoolParams op_params;                                        \
+  op_params.stride_height = params->stride_height;                     \
+  op_params.stride_width = params->stride_width;                       \
+  op_params.filter_height = params->filter_height;                     \
+  op_params.filter_width = params->filter_width;                       \
+  op_params.padding_values.height = data->padding.height;              \
+  op_params.padding_values.width = data->padding.width;                \
+  op_params.quantized_activation_min = activation_min;                 \
+  op_params.quantized_activation_max = activation_max;                 \
+  type::MaxPool(op_params, GetTensorShape(input),                      \
+                GetTensorData<uint8_t>(input), GetTensorShape(output), \
+                GetTensorData<uint8_t>(output))
   if (kernel_type == kReference) {
     TF_LITE_MAX_POOL(reference_ops);
   } else {
@@ -216,13 +238,18 @@ void L2EvalFloat(TfLiteContext* context, TfLiteNode* node,
   float activation_min, activation_max;
   CalculateActivationRange(params->activation, &activation_min,
                            &activation_max);
-#define TF_LITE_L2_POOL(type)                                               \
-  type::L2Pool(GetTensorData<float>(input), GetTensorShape(input),          \
-               params->stride_width, params->stride_height,                 \
-               data->padding.width, data->padding.height,                   \
-               params->filter_width, params->filter_height, activation_min, \
-               activation_max, GetTensorData<float>(output),                \
-               GetTensorShape(output))
+#define TF_LITE_L2_POOL(type)                                                 \
+  tflite::PoolParams op_params;                                               \
+  op_params.stride_height = params->stride_height;                            \
+  op_params.stride_width = params->stride_width;                              \
+  op_params.filter_height = params->filter_height;                            \
+  op_params.filter_width = params->filter_width;                              \
+  op_params.padding_values.height = data->padding.height;                     \
+  op_params.padding_values.width = data->padding.width;                       \
+  op_params.float_activation_min = activation_min;                            \
+  op_params.float_activation_max = activation_max;                            \
+  type::L2Pool(op_params, GetTensorShape(input), GetTensorData<float>(input), \
+               GetTensorShape(output), GetTensorData<float>(output))
   if (kernel_type == kReference) {
     TF_LITE_L2_POOL(reference_ops);
   } else {
diff --git a/tensorflow/contrib/lite/kernels/pow_test.cc b/tensorflow/contrib/lite/kernels/pow_test.cc
index 474d323bc3..74b3aef5bd 100644
--- a/tensorflow/contrib/lite/kernels/pow_test.cc
+++ b/tensorflow/contrib/lite/kernels/pow_test.cc
@@ -50,22 +50,22 @@ class PowOpModel : public SingleOpModel {
 };
 
 TEST(PowOpModel, Simple) {
-  PowOpModel<int32> model({TensorType_INT32, {1, 2, 2, 1}},
-                          {TensorType_INT32, {1, 2, 2, 1}},
-                          {TensorType_INT32, {}});
-  model.PopulateTensor<int32>(model.input1(), {12, 2, 7, 8});
-  model.PopulateTensor<int32>(model.input2(), {1, 2, 3, 1});
+  PowOpModel<int32_t> model({TensorType_INT32, {1, 2, 2, 1}},
+                            {TensorType_INT32, {1, 2, 2, 1}},
+                            {TensorType_INT32, {}});
+  model.PopulateTensor<int32_t>(model.input1(), {12, 2, 7, 8});
+  model.PopulateTensor<int32_t>(model.input2(), {1, 2, 3, 1});
   model.Invoke();
   EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 2, 2, 1));
   EXPECT_THAT(model.GetOutput(), ElementsAre(12, 4, 343, 8));
 }
 
 TEST(PowOpModel, NegativeAndZeroValue) {
-  PowOpModel<int32> model({TensorType_INT32, {1, 2, 2, 1}},
-                          {TensorType_INT32, {1, 2, 2, 1}},
-                          {TensorType_INT32, {}});
-  model.PopulateTensor<int32>(model.input1(), {0, 2, -7, 8});
-  model.PopulateTensor<int32>(model.input2(), {1, 2, 3, 0});
+  PowOpModel<int32_t> model({TensorType_INT32, {1, 2, 2, 1}},
+                            {TensorType_INT32, {1, 2, 2, 1}},
+                            {TensorType_INT32, {}});
+  model.PopulateTensor<int32_t>(model.input1(), {0, 2, -7, 8});
+  model.PopulateTensor<int32_t>(model.input2(), {1, 2, 3, 0});
   model.Invoke();
   EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 2, 2, 1));
   EXPECT_THAT(model.GetOutput(), ElementsAre(0, 4, -343, 1));
@@ -98,10 +98,10 @@ TEST(PowOpModel, NegativeFloatTest) {
 }
 
 TEST(PowOpModel, BroadcastTest) {
-  PowOpModel<int32> model({TensorType_INT32, {1, 2, 2, 1}},
-                          {TensorType_INT32, {1}}, {TensorType_INT32, {}});
-  model.PopulateTensor<int32>(model.input1(), {12, 2, 7, 8});
-  model.PopulateTensor<int32>(model.input2(), {4});
+  PowOpModel<int32_t> model({TensorType_INT32, {1, 2, 2, 1}},
+                            {TensorType_INT32, {1}}, {TensorType_INT32, {}});
+  model.PopulateTensor<int32_t>(model.input1(), {12, 2, 7, 8});
+  model.PopulateTensor<int32_t>(model.input2(), {4});
   model.Invoke();
   EXPECT_THAT(model.GetOutputShape(), ElementsAre(1, 2, 2, 1));
   EXPECT_THAT(model.GetOutput(), ElementsAre(20736, 16, 2401, 4096));
diff --git a/tensorflow/contrib/lite/kernels/reduce.cc b/tensorflow/contrib/lite/kernels/reduce.cc
index 31c331a8c6..e99f67c725 100644
--- a/tensorflow/contrib/lite/kernels/reduce.cc
+++ b/tensorflow/contrib/lite/kernels/reduce.cc
@@ -78,6 +78,10 @@ TfLiteStatus ResizeOutputTensor(TfLiteContext* context, OpContext* op_context) {
   size_t num_axis = NumElements(op_context->axis);
   const TfLiteIntArray* input_dims = op_context->input->dims;
   int input_num_dims = NumDimensions(op_context->input);
+  if (input_num_dims == 0) {
+    return context->ResizeTensor(context, op_context->output,
+                                 TfLiteIntArrayCreate(0));
+  }
   const int* axis = GetTensorData<int>(op_context->axis);
   if (op_context->params->keep_dims) {
     TfLiteIntArray* output_dims = TfLiteIntArrayCreate(input_num_dims);
@@ -315,6 +319,99 @@ TfLiteStatus EvalSum(TfLiteContext* context, TfLiteNode* node) {
   return kTfLiteOk;
 }
 
+template <KernelType kernel_type>
+TfLiteStatus EvalProd(TfLiteContext* context, TfLiteNode* node) {
+  OpContext op_context(context, node);
+  int64_t num_axis = NumElements(op_context.axis);
+  TfLiteTensor* temp_index = GetTemporary(context, node, /*index=*/0);
+  TfLiteTensor* resolved_axis = GetTemporary(context, node, /*index=*/1);
+  // Resize the output tensor if the output tensor is dynamic.
+  if (IsDynamicTensor(op_context.output)) {
+    TF_LITE_ENSURE_OK(context,
+                      ResizeTempAxis(context, &op_context, resolved_axis));
+    TF_LITE_ENSURE_OK(context, ResizeOutputTensor(context, &op_context));
+  }
+
+#define TF_LITE_PROD(kernel_type, data_type)                        \
+  kernel_type::ReduceProd<>(                                        \
+      GetTensorData<data_type>(op_context.input),                   \
+      op_context.input->dims->data, op_context.input->dims->size,   \
+      GetTensorData<data_type>(op_context.output),                  \
+      op_context.output->dims->data, op_context.output->dims->size, \
+      GetTensorData<int>(op_context.axis), num_axis,                \
+      op_context.params->keep_dims, GetTensorData<int>(temp_index), \
+      GetTensorData<int>(resolved_axis))
+
+  if (kernel_type == kReference) {
+    switch (op_context.input->type) {
+      case kTfLiteFloat32:
+        TF_LITE_ENSURE(context, TF_LITE_PROD(reference_ops, float));
+        break;
+      case kTfLiteInt32:
+        TF_LITE_ENSURE(context, TF_LITE_PROD(reference_ops, int));
+        break;
+      case kTfLiteInt64:
+        TF_LITE_ENSURE(context, TF_LITE_PROD(reference_ops, int64_t));
+        break;
+      case kTfLiteUInt8:
+        // TODO(wangtz): uint8 reduce_prod is not yet supported.
+      default:
+        return kTfLiteError;
+    }
+  }
+#undef TF_LITE_PROD
+  return kTfLiteOk;
+}
+
+template <KernelType kernel_type>
+TfLiteStatus EvalMax(TfLiteContext* context, TfLiteNode* node) {
+  OpContext op_context(context, node);
+  int64_t num_axis = NumElements(op_context.axis);
+  TfLiteTensor* temp_index = GetTemporary(context, node, /*index=*/0);
+  TfLiteTensor* resolved_axis = GetTemporary(context, node, /*index=*/1);
+  // Resize the output tensor if the output tensor is dynamic.
+  if (IsDynamicTensor(op_context.output)) {
+    TF_LITE_ENSURE_OK(context,
+                      ResizeTempAxis(context, &op_context, resolved_axis));
+    TF_LITE_ENSURE_OK(context, ResizeOutputTensor(context, &op_context));
+  }
+
+#define TF_LITE_MAX(kernel_type, data_type)                         \
+  kernel_type::ReduceMax<>(                                         \
+      GetTensorData<data_type>(op_context.input),                   \
+      op_context.input->dims->data, op_context.input->dims->size,   \
+      GetTensorData<data_type>(op_context.output),                  \
+      op_context.output->dims->data, op_context.output->dims->size, \
+      GetTensorData<int>(op_context.axis), num_axis,                \
+      op_context.params->keep_dims, GetTensorData<int>(temp_index), \
+      GetTensorData<int>(resolved_axis))
+
+  if (kernel_type == kReference) {
+    switch (op_context.input->type) {
+      case kTfLiteFloat32:
+        TF_LITE_ENSURE(context, TF_LITE_MAX(reference_ops, float));
+        break;
+      case kTfLiteInt32:
+        TF_LITE_ENSURE(context, TF_LITE_MAX(reference_ops, int));
+        break;
+      case kTfLiteInt64:
+        TF_LITE_ENSURE(context, TF_LITE_MAX(reference_ops, int64_t));
+        break;
+      case kTfLiteUInt8:
+        TF_LITE_ENSURE_EQ(context, op_context.input->params.scale,
+                          op_context.output->params.scale);
+        TF_LITE_ENSURE_EQ(context, op_context.input->params.zero_point,
+                          op_context.output->params.zero_point);
+        TF_LITE_ENSURE(context, TF_LITE_MAX(reference_ops, uint8_t));
+        break;
+      default:
+        return kTfLiteError;
+    }
+  }
+#undef TF_LITE_MAX
+  return kTfLiteOk;
+}
+
 }  // namespace reduce
 
 TfLiteRegistration* Register_MEAN_REF() {
@@ -331,9 +428,27 @@ TfLiteRegistration* Register_SUM_REF() {
   return &r;
 }
 
+TfLiteRegistration* Register_REDUCE_PROD_REF() {
+  static TfLiteRegistration r = {reduce::Init, reduce::Free,
+                                 reduce::PrepareSimple,
+                                 reduce::EvalProd<reduce::kReference>};
+  return &r;
+}
+
+TfLiteRegistration* Register_REDUCE_MAX_REF() {
+  static TfLiteRegistration r = {reduce::Init, reduce::Free,
+                                 reduce::PrepareSimple,
+                                 reduce::EvalMax<reduce::kReference>};
+  return &r;
+}
+
 // TODO(kanlig): add optimized implementation of Mean.
 TfLiteRegistration* Register_MEAN() { return Register_MEAN_REF(); }
 TfLiteRegistration* Register_SUM() { return Register_SUM_REF(); }
+TfLiteRegistration* Register_REDUCE_PROD() {
+  return Register_REDUCE_PROD_REF();
+}
+TfLiteRegistration* Register_REDUCE_MAX() { return Register_REDUCE_MAX_REF(); }
 
 }  // namespace builtin
 }  // namespace ops
diff --git a/tensorflow/contrib/lite/kernels/reduce_test.cc b/tensorflow/contrib/lite/kernels/reduce_test.cc
index 9e946822c6..5d432d34ef 100644
--- a/tensorflow/contrib/lite/kernels/reduce_test.cc
+++ b/tensorflow/contrib/lite/kernels/reduce_test.cc
@@ -22,13 +22,14 @@ namespace tflite {
 namespace {
 
 using ::testing::ElementsAreArray;
+using ::testing::IsEmpty;
 
 class BaseOpModel : public SingleOpModel {
  public:
-  void SetAxis(std::initializer_list<int> data) { PopulateTensor(axis_, data); }
+  void SetAxis(const std::vector<int>& data) { PopulateTensor(axis_, data); }
 
   template <class T>
-  void SetInput(std::initializer_list<T> data) {
+  void SetInput(std::vector<T> data) {
     PopulateTensor(input_, data);
   }
 
@@ -110,14 +111,72 @@ class SumOpDynamicModel : public BaseOpModel {
   }
 };
 
+// Model for the tests case where axis is a const tensor.
+class ProdOpConstModel : public BaseOpModel {
+ public:
+  ProdOpConstModel(const TensorData& input, const TensorData& output,
+                   std::initializer_list<int> axis_shape,
+                   std::initializer_list<int> axis, bool keep_dims) {
+    input_ = AddInput(input);
+    axis_ = AddConstInput(TensorType_INT32, axis, axis_shape);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_REDUCE_PROD, BuiltinOptions_ReducerOptions,
+                 CreateReducerOptions(builder_, keep_dims).Union());
+    BuildInterpreter({GetShape(input_)});
+  }
+};
+
+// Model for the tests case where axis is a dynamic tensor.
+class ProdOpDynamicModel : public BaseOpModel {
+ public:
+  ProdOpDynamicModel(const TensorData& input, const TensorData& output,
+                     const TensorData& axis, bool keep_dims) {
+    input_ = AddInput(input);
+    axis_ = AddInput(axis);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_REDUCE_PROD, BuiltinOptions_ReducerOptions,
+                 CreateReducerOptions(builder_, keep_dims).Union());
+    BuildInterpreter({GetShape(input_)});
+  }
+};
+
+// Model for the tests case where axis is a const tensor.
+class MaxOpConstModel : public BaseOpModel {
+ public:
+  MaxOpConstModel(const TensorData& input, const TensorData& output,
+                  std::initializer_list<int> axis_shape,
+                  std::initializer_list<int> axis, bool keep_dims) {
+    input_ = AddInput(input);
+    axis_ = AddConstInput(TensorType_INT32, axis, axis_shape);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_REDUCE_MAX, BuiltinOptions_ReducerOptions,
+                 CreateReducerOptions(builder_, keep_dims).Union());
+    BuildInterpreter({GetShape(input_)});
+  }
+};
+
+// Model for the tests case where axis is a dynamic tensor.
+class MaxOpDynamicModel : public BaseOpModel {
+ public:
+  MaxOpDynamicModel(const TensorData& input, const TensorData& output,
+                    const TensorData& axis, bool keep_dims) {
+    input_ = AddInput(input);
+    axis_ = AddInput(axis);
+    output_ = AddOutput(output);
+    SetBuiltinOp(BuiltinOperator_REDUCE_MAX, BuiltinOptions_ReducerOptions,
+                 CreateReducerOptions(builder_, keep_dims).Union());
+    BuildInterpreter({GetShape(input_)});
+  }
+};
+
 // for quantized Add, the error shouldn't exceed step
 float GetTolerance(int min, int max) { return (max - min) / 255.0; }
 
 // Tests for reduce_mean
 TEST(ConstFloatMeanOpTest, NotKeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   MeanOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {2}},
                      {4}, {1, 0, -3, -3}, false);
   m.SetInput(data);
@@ -127,9 +186,9 @@ TEST(ConstFloatMeanOpTest, NotKeepDims) {
 }
 
 TEST(ConstFloatMeanOpTest, KeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   MeanOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {3}},
                      {2}, {0, 2}, true);
   m.SetInput(data);
@@ -139,14 +198,24 @@ TEST(ConstFloatMeanOpTest, KeepDims) {
               ElementsAreArray(ArrayFloatNear({10.5, 12.5, 14.5})));
 }
 
+TEST(ConstFloatMeanOpTest, Scalar) {
+  std::vector<float> data = {3.27};
+  MeanOpConstModel m({TensorType_FLOAT32, {}}, {TensorType_FLOAT32, {}}, {},
+                     {0}, true);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), IsEmpty());
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({3.27})));
+}
+
 TEST(DynamicFloatMeanOpTest, NotKeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   MeanOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
                        {TensorType_FLOAT32, {2}}, {TensorType_INT32, {4}},
                        false);
-  std::initializer_list<int> axis = {1, 0, -3, -3};
+  std::vector<int> axis = {1, 0, -3, -3};
   m.SetAxis(axis);
   m.SetInput(data);
   m.Invoke();
@@ -155,13 +224,13 @@ TEST(DynamicFloatMeanOpTest, NotKeepDims) {
 }
 
 TEST(DynamicFloatMeanOpTest, KeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   MeanOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
                        {TensorType_FLOAT32, {3}}, {TensorType_INT32, {2}},
                        true);
-  std::initializer_list<int> axis = {0, 2};
+  std::vector<int> axis = {0, 2};
   m.SetAxis(axis);
   m.SetInput(data);
   m.Invoke();
@@ -171,10 +240,10 @@ TEST(DynamicFloatMeanOpTest, KeepDims) {
 }
 
 TEST(DynamicFloatMeanOpTest, Scale) {
-  std::initializer_list<float> data = {9.527};
+  std::vector<float> data = {9.527};
   MeanOpDynamicModel m({TensorType_FLOAT32, {1}}, {TensorType_FLOAT32, {1}},
                        {TensorType_INT32, {1}}, true);
-  std::initializer_list<int> axis = {0};
+  std::vector<int> axis = {0};
   m.SetAxis(axis);
   m.SetInput(data);
   m.Invoke();
@@ -185,7 +254,7 @@ TEST(DynamicFloatMeanOpTest, Scale) {
 
 TEST(ConstUint8MeanOpTest, NotKeepDims) {
   float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
-  std::initializer_list<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
+  std::vector<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
   MeanOpConstModel m({TensorType_UINT8, {1, 3, 2}, -1.0, 1.0},
                      {TensorType_UINT8, {2}, -1.0, 1.0}, {1}, {1}, false);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
@@ -197,7 +266,7 @@ TEST(ConstUint8MeanOpTest, NotKeepDims) {
 
 TEST(ConstUint8MeanOpTest, KeepDims) {
   float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
-  std::initializer_list<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
+  std::vector<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
   MeanOpConstModel m({TensorType_UINT8, {3, 2}, -1.0, 1.0},
                      {TensorType_UINT8, {3}, -1.0, 1.0}, {1}, {1}, true);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
@@ -210,11 +279,11 @@ TEST(ConstUint8MeanOpTest, KeepDims) {
 
 TEST(DynamicUint8MeanOpTest, NotKeepDims) {
   float kQuantizedTolerance = GetTolerance(-5.0, 2.0);
-  std::initializer_list<float> data = {1.3, -4.8, -3.6, 0.24};
+  std::vector<float> data = {1.3, -4.8, -3.6, 0.24};
   MeanOpDynamicModel m({TensorType_UINT8, {2, 2}, -5.0, 2.0},
                        {TensorType_UINT8, {2}, -5.0, 2.0},
                        {TensorType_INT32, {1}}, false);
-  std::initializer_list<int> axis = {1};
+  std::vector<int> axis = {1};
   m.SetAxis(axis);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
   m.Invoke();
@@ -226,11 +295,11 @@ TEST(DynamicUint8MeanOpTest, NotKeepDims) {
 
 TEST(DynamicUint8MeanOpTest, KeepDims) {
   float kQuantizedTolerance = GetTolerance(-10.0, 12.0);
-  std::initializer_list<float> data = {11.14, -0.14, 7.423, 0.879};
+  std::vector<float> data = {11.14, -0.14, 7.423, 0.879};
   MeanOpDynamicModel m({TensorType_UINT8, {2, 2}, -10.0, 12.0},
                        {TensorType_UINT8, {2}, -10.0, 12.0},
                        {TensorType_INT32, {1}}, true);
-  std::initializer_list<int> axis = {0};
+  std::vector<int> axis = {0};
   m.SetAxis(axis);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
   m.Invoke();
@@ -243,9 +312,9 @@ TEST(DynamicUint8MeanOpTest, KeepDims) {
 // Tests for reduce_sum
 
 TEST(ConstFloatSumOpTest, NotKeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   SumOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {2}},
                     {4}, {1, 0, -3, -3}, false);
   m.SetInput(data);
@@ -256,9 +325,9 @@ TEST(ConstFloatSumOpTest, NotKeepDims) {
 }
 
 TEST(ConstFloatSumOpTest, KeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   SumOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {3}},
                     {2}, {0, 2}, true);
   m.SetInput(data);
@@ -269,13 +338,13 @@ TEST(ConstFloatSumOpTest, KeepDims) {
 }
 
 TEST(DynamicFloatSumOpTest, NotKeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   SumOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
                       {TensorType_FLOAT32, {2}}, {TensorType_INT32, {4}},
                       false);
-  std::initializer_list<int> axis = {1, 0, -3, -3};
+  std::vector<int> axis = {1, 0, -3, -3};
   m.SetAxis(axis);
   m.SetInput(data);
   m.Invoke();
@@ -284,13 +353,23 @@ TEST(DynamicFloatSumOpTest, NotKeepDims) {
               ElementsAreArray(ArrayFloatNear({144, 156})));
 }
 
+TEST(ConstFloatSumOpTest, Scalar) {
+  std::vector<float> data = {17.};
+  SumOpConstModel m({TensorType_FLOAT32, {}}, {TensorType_FLOAT32, {}}, {}, {0},
+                    false);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), IsEmpty());
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({17.})));
+}
+
 TEST(DynamicFloatSumOpTest, KeepDims) {
-  std::initializer_list<float> data = {
-      1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
-      13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
   SumOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
                       {TensorType_FLOAT32, {3}}, {TensorType_INT32, {2}}, true);
-  std::initializer_list<int> axis = {0, 2};
+  std::vector<int> axis = {0, 2};
   m.SetAxis(axis);
   m.SetInput(data);
   m.Invoke();
@@ -300,10 +379,10 @@ TEST(DynamicFloatSumOpTest, KeepDims) {
 }
 
 TEST(DynamicFloatSumOpTest, Scale) {
-  std::initializer_list<float> data = {9.527};
+  std::vector<float> data = {9.527};
   SumOpDynamicModel m({TensorType_FLOAT32, {1}}, {TensorType_FLOAT32, {1}},
                       {TensorType_INT32, {1}}, true);
-  std::initializer_list<int> axis = {0};
+  std::vector<int> axis = {0};
   m.SetAxis(axis);
   m.SetInput(data);
   m.Invoke();
@@ -313,7 +392,7 @@ TEST(DynamicFloatSumOpTest, Scale) {
 
 TEST(ConstUint8SumOpTest, NotKeepDims) {
   float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
-  std::initializer_list<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
+  std::vector<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
   SumOpConstModel m({TensorType_UINT8, {1, 3, 2}, -1.0, 1.0},
                     {TensorType_UINT8, {2}, -1.0, 1.0}, {1}, {1}, false);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
@@ -326,7 +405,7 @@ TEST(ConstUint8SumOpTest, NotKeepDims) {
 
 TEST(ConstUint8SumOpTest, KeepDims) {
   float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
-  std::initializer_list<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
+  std::vector<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
   SumOpConstModel m({TensorType_UINT8, {3, 2}, -1.0, 1.0},
                     {TensorType_UINT8, {3}, -1.0, 1.0}, {1}, {1}, true);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
@@ -339,11 +418,11 @@ TEST(ConstUint8SumOpTest, KeepDims) {
 
 TEST(DynamicUint8SumOpTest, NotKeepDims) {
   float kQuantizedTolerance = GetTolerance(-5.0, 2.0);
-  std::initializer_list<float> data = {1.3, -4.8, -3.6, 0.24};
+  std::vector<float> data = {1.3, -4.8, -3.6, 0.24};
   SumOpDynamicModel m({TensorType_UINT8, {2, 2}, -5.0, 2.0},
                       {TensorType_UINT8, {2}, -5.0, 2.0},
                       {TensorType_INT32, {1}}, false);
-  std::initializer_list<int> axis = {1};
+  std::vector<int> axis = {1};
   m.SetAxis(axis);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
   m.Invoke();
@@ -355,11 +434,11 @@ TEST(DynamicUint8SumOpTest, NotKeepDims) {
 
 TEST(DynamicUint8SumOpTest, KeepDims) {
   float kQuantizedTolerance = GetTolerance(-10.0, 12.0);
-  std::initializer_list<float> data = {11.14, -0.14, 7.423, 0.879};
+  std::vector<float> data = {11.14, -0.14, 7.423, 0.879};
   SumOpDynamicModel m({TensorType_UINT8, {2, 2}, -10.0, 12.0},
                       {TensorType_UINT8, {2}, -10.0, 12.0},
                       {TensorType_INT32, {1}}, true);
-  std::initializer_list<int> axis = {0};
+  std::vector<int> axis = {0};
   m.SetAxis(axis);
   m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
   m.Invoke();
@@ -369,6 +448,223 @@ TEST(DynamicUint8SumOpTest, KeepDims) {
       ElementsAreArray(ArrayFloatNear({6.47059, 10.698}, kQuantizedTolerance)));
 }
 
+// Tests for reduce_prod
+
+TEST(ConstFloatProdOpTest, NotKeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  ProdOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {2}},
+                     {4}, {1, 0, -3, -3}, false);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2}));
+  EXPECT_THAT(
+      m.GetOutput<float>(),
+      ElementsAreArray(ArrayFloatNear({3.162341376e+11, 1.9619905536e+12})));
+}
+
+TEST(ConstFloatProdOpTest, KeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  ProdOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {3}},
+                     {2}, {0, 2}, true);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 3, 1}));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(
+                  ArrayFloatNear({7.74592e+06, 1.197504e+08, 6.6889152e+08})));
+}
+
+TEST(DynamicFloatProdOpTest, NotKeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  ProdOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
+                       {TensorType_FLOAT32, {2}}, {TensorType_INT32, {4}},
+                       false);
+  std::vector<int> axis = {1, 0, -3, -3};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2}));
+  EXPECT_THAT(
+      m.GetOutput<float>(),
+      ElementsAreArray(ArrayFloatNear({3.16234143225e+11, 1.9619905536e+12})));
+}
+
+TEST(DynamicFloatProdOpTest, KeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  ProdOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
+                       {TensorType_FLOAT32, {3}}, {TensorType_INT32, {2}},
+                       true);
+  std::vector<int> axis = {0, 2};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 3, 1}));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(
+                  ArrayFloatNear({7.74592e+06, 1.197504e+08, 6.6889152e+08})));
+}
+
+TEST(DynamicFloatProdOpTest, Scale) {
+  std::vector<float> data = {9.527};
+  ProdOpDynamicModel m({TensorType_FLOAT32, {1}}, {TensorType_FLOAT32, {1}},
+                       {TensorType_INT32, {1}}, true);
+  std::vector<int> axis = {0};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1}));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({9.527})));
+}
+
+// Tests for reduce_max
+
+TEST(ConstFloatMaxOpTest, NotKeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  MaxOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {2}},
+                    {4}, {1, 0, -3, -3}, false);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2}));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({23, 24})));
+}
+
+TEST(ConstFloatMaxOpTest, KeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  MaxOpConstModel m({TensorType_FLOAT32, {4, 3, 2}}, {TensorType_FLOAT32, {3}},
+                    {2}, {0, 2}, true);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 3, 1}));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({20, 22, 24})));
+}
+
+TEST(DynamicFloatMaxOpTest, NotKeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  MaxOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
+                      {TensorType_FLOAT32, {2}}, {TensorType_INT32, {4}},
+                      false);
+  std::vector<int> axis = {1, 0, -3, -3};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2}));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({23, 24})));
+}
+
+TEST(DynamicFloatMaxOpTest, KeepDims) {
+  std::vector<float> data = {1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,
+                             9.0,  10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
+                             17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
+  MaxOpDynamicModel m({TensorType_FLOAT32, {4, 3, 2}},
+                      {TensorType_FLOAT32, {3}}, {TensorType_INT32, {2}}, true);
+  std::vector<int> axis = {0, 2};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 3, 1}));
+  EXPECT_THAT(m.GetOutput<float>(),
+              ElementsAreArray(ArrayFloatNear({20, 22, 24})));
+}
+
+TEST(DynamicFloatMaxOpTest, Scale) {
+  std::vector<float> data = {9.527};
+  MaxOpDynamicModel m({TensorType_FLOAT32, {1}}, {TensorType_FLOAT32, {1}},
+                      {TensorType_INT32, {1}}, true);
+  std::vector<int> axis = {0};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1}));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({9.527})));
+}
+
+TEST(ConstUint8MaxOpTest, NotKeepDims) {
+  float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
+  std::vector<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
+  MaxOpConstModel m({TensorType_UINT8, {1, 3, 2}, -1.0, 1.0},
+                    {TensorType_UINT8, {2}, -1.0, 1.0}, {1}, {1}, false);
+  m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 2}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(
+                  ArrayFloatNear({0.501961, 0.603922}, kQuantizedTolerance)));
+}
+
+TEST(ConstUint8MaxOpTest, KeepDims) {
+  float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
+  std::vector<float> data = {0.4, 0.2, 0.3, 0.4, 0.5, 0.6};
+  MaxOpConstModel m({TensorType_UINT8, {3, 2}, -1.0, 1.0},
+                    {TensorType_UINT8, {3}, -1.0, 1.0}, {1}, {1}, true);
+  m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({3, 1}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(
+                  ArrayFloatNear({0.4, 0.4, 0.603922}, kQuantizedTolerance)));
+}
+
+TEST(DynamicUint8MaxOpTest, NotKeepDims) {
+  float kQuantizedTolerance = GetTolerance(-5.0, 2.0);
+  std::vector<float> data = {1.3, -4.8, -3.6, 0.24};
+  MaxOpDynamicModel m({TensorType_UINT8, {2, 2}, -5.0, 2.0},
+                      {TensorType_UINT8, {2}, -5.0, 2.0},
+                      {TensorType_INT32, {1}}, false);
+  std::vector<int> axis = {1};
+  m.SetAxis(axis);
+  m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(
+                  ArrayFloatNear({1.2902, 0.247059}, kQuantizedTolerance)));
+}
+
+TEST(DynamicUint8MaxOpTest, KeepDims) {
+  float kQuantizedTolerance = GetTolerance(-10.0, 12.0);
+  std::vector<float> data = {11.14, -0.14, 7.423, 0.879};
+  MaxOpDynamicModel m({TensorType_UINT8, {2, 2}, -10.0, 12.0},
+                      {TensorType_UINT8, {2}, -10.0, 12.0},
+                      {TensorType_INT32, {1}}, true);
+  std::vector<int> axis = {0};
+  m.SetAxis(axis);
+  m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 2}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(
+                  ArrayFloatNear({11.1294, 0.862745}, kQuantizedTolerance)));
+}
+
+TEST(DynamicUint8MaxOpTest, Scalar) {
+  float kQuantizedTolerance = GetTolerance(-10.0, 12.0);
+  std::vector<float> data = {11.14};
+  MaxOpDynamicModel m({TensorType_UINT8, {}, -10.0, 12.0},
+                      {TensorType_UINT8, {}, -10.0, 12.0},
+                      {TensorType_INT32, {1}}, true);
+  std::vector<int> axis = {0};
+  m.QuantizeAndPopulate<uint8_t>(m.Input(), data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), IsEmpty());
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear({11.1294}, kQuantizedTolerance)));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/register.cc b/tensorflow/contrib/lite/kernels/register.cc
index 0ca08cd8f3..5ad0f4d232 100644
--- a/tensorflow/contrib/lite/kernels/register.cc
+++ b/tensorflow/contrib/lite/kernels/register.cc
@@ -82,6 +82,7 @@ TfLiteRegistration* Register_PRELU();
 TfLiteRegistration* Register_MAXIMUM();
 TfLiteRegistration* Register_MINIMUM();
 TfLiteRegistration* Register_ARG_MAX();
+TfLiteRegistration* Register_ARG_MIN();
 TfLiteRegistration* Register_GREATER();
 TfLiteRegistration* Register_GREATER_EQUAL();
 TfLiteRegistration* Register_LESS();
@@ -90,6 +91,8 @@ TfLiteRegistration* Register_FLOOR();
 TfLiteRegistration* Register_TILE();
 TfLiteRegistration* Register_NEG();
 TfLiteRegistration* Register_SUM();
+TfLiteRegistration* Register_REDUCE_PROD();
+TfLiteRegistration* Register_REDUCE_MAX();
 TfLiteRegistration* Register_SELECT();
 TfLiteRegistration* Register_SLICE();
 TfLiteRegistration* Register_SIN();
@@ -102,6 +105,37 @@ TfLiteRegistration* Register_SQRT();
 TfLiteRegistration* Register_RSQRT();
 TfLiteRegistration* Register_SHAPE();
 TfLiteRegistration* Register_POW();
+TfLiteRegistration* Register_FAKE_QUANT();
+TfLiteRegistration* Register_PACK();
+TfLiteRegistration* Register_ONE_HOT();
+TfLiteRegistration* Register_LOGICAL_OR();
+
+TfLiteStatus UnsupportedTensorFlowOp(TfLiteContext* context, TfLiteNode* node) {
+  context->ReportError(
+      context,
+      "Regular TensorFlow ops are not supported by this interpreter. Make sure "
+      "you invoke the Eager delegate before inference.");
+  return kTfLiteError;
+}
+
+const TfLiteRegistration* BuiltinOpResolver::FindOp(tflite::BuiltinOperator op,
+                                                    int version) const {
+  return MutableOpResolver::FindOp(op, version);
+}
+
+const TfLiteRegistration* BuiltinOpResolver::FindOp(const char* op,
+                                                    int version) const {
+  // Return the NULL Op for all ops whose name start with "Eager:", allowing
+  // the interpreter to delegate their execution.
+  if (string(op).find("Eager:") == 0) {
+    static TfLiteRegistration null_op{
+        nullptr, nullptr, &UnsupportedTensorFlowOp,
+        nullptr, nullptr, BuiltinOperator_CUSTOM,
+        "Eager", 1};
+    return &null_op;
+  }
+  return MutableOpResolver::FindOp(op, version);
+}
 
 BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_RELU, Register_RELU());
@@ -167,6 +201,7 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_MAXIMUM, Register_MAXIMUM());
   AddBuiltin(BuiltinOperator_MINIMUM, Register_MINIMUM());
   AddBuiltin(BuiltinOperator_ARG_MAX, Register_ARG_MAX());
+  AddBuiltin(BuiltinOperator_ARG_MIN, Register_ARG_MIN());
   AddBuiltin(BuiltinOperator_GREATER, Register_GREATER());
   AddBuiltin(BuiltinOperator_GREATER_EQUAL, Register_GREATER_EQUAL());
   AddBuiltin(BuiltinOperator_LESS, Register_LESS());
@@ -179,6 +214,8 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_TRANSPOSE_CONV, Register_TRANSPOSE_CONV());
   AddBuiltin(BuiltinOperator_TILE, Register_TILE());
   AddBuiltin(BuiltinOperator_SUM, Register_SUM());
+  AddBuiltin(BuiltinOperator_REDUCE_PROD, Register_REDUCE_PROD());
+  AddBuiltin(BuiltinOperator_REDUCE_MAX, Register_REDUCE_MAX());
   AddBuiltin(BuiltinOperator_EXPAND_DIMS, Register_EXPAND_DIMS());
   AddBuiltin(BuiltinOperator_SPARSE_TO_DENSE, Register_SPARSE_TO_DENSE());
   AddBuiltin(BuiltinOperator_EQUAL, Register_EQUAL());
@@ -187,6 +224,10 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_RSQRT, Register_RSQRT());
   AddBuiltin(BuiltinOperator_SHAPE, Register_SHAPE());
   AddBuiltin(BuiltinOperator_POW, Register_POW());
+  AddBuiltin(BuiltinOperator_FAKE_QUANT, Register_FAKE_QUANT(), 1, 2);
+  AddBuiltin(BuiltinOperator_PACK, Register_PACK());
+  AddBuiltin(BuiltinOperator_ONE_HOT, Register_ONE_HOT());
+  AddBuiltin(BuiltinOperator_LOGICAL_OR, Register_LOGICAL_OR());
 
   // TODO(andrewharp, ahentz): Move these somewhere more appropriate so that
   // custom ops aren't always included by default.
diff --git a/tensorflow/contrib/lite/kernels/register.h b/tensorflow/contrib/lite/kernels/register.h
index 940718d67e..0296152d68 100644
--- a/tensorflow/contrib/lite/kernels/register.h
+++ b/tensorflow/contrib/lite/kernels/register.h
@@ -26,6 +26,10 @@ namespace builtin {
 class BuiltinOpResolver : public MutableOpResolver {
  public:
   BuiltinOpResolver();
+
+  const TfLiteRegistration* FindOp(tflite::BuiltinOperator op,
+                                   int version) const override;
+  const TfLiteRegistration* FindOp(const char* op, int version) const override;
 };
 
 }  // namespace builtin
diff --git a/tensorflow/contrib/lite/kernels/reshape.cc b/tensorflow/contrib/lite/kernels/reshape.cc
index 3287040695..49ba0571e2 100644
--- a/tensorflow/contrib/lite/kernels/reshape.cc
+++ b/tensorflow/contrib/lite/kernels/reshape.cc
@@ -25,16 +25,11 @@ namespace builtin {
 namespace reshape {
 
 constexpr int kInputTensor = 0;
+constexpr int kShapeTensor = 1;
 constexpr int kOutputTensor = 0;
 
-TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
-  auto* params = reinterpret_cast<TfLiteReshapeParams*>(node->builtin_data);
-
-  // TODO(ahentz): we are often given a tensor with the shape but we only pay
-  // attention to what the shape specified in 'params'.
-  TF_LITE_ENSURE(context, NumInputs(node) == 1 || NumInputs(node) == 2);
-  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
-
+TfLiteStatus ResizeOutput(TfLiteContext* context, TfLiteNode* node,
+                          TfLiteIntArray* output_shape) {
   const TfLiteTensor* input = GetInput(context, node, kInputTensor);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
@@ -42,37 +37,84 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   // special -1 value, meaning it will be calculated automatically based on the
   // input. Here we calculate what that dimension should be so that the number
   // of output elements in the same as the number of input elements.
-  int num_input_elements = 1;
-  for (int i = 0; i < NumDimensions(input); ++i) {
-    num_input_elements *= SizeOfDimension(input, i);
-  }
+  int num_input_elements = NumElements(input);
 
-  TfLiteIntArray* output_size = TfLiteIntArrayCreate(params->num_dimensions);
   int num_output_elements = 1;
   int stretch_dim = -1;
-  for (int i = 0; i < params->num_dimensions; ++i) {
-    int value = params->shape[i];
+  for (int i = 0; i < output_shape->size; ++i) {
+    int value = output_shape->data[i];
     if (value == -1) {
       TF_LITE_ENSURE_EQ(context, stretch_dim, -1);
       stretch_dim = i;
     } else {
       num_output_elements *= value;
-      output_size->data[i] = value;
     }
   }
   if (stretch_dim != -1) {
-    output_size->data[stretch_dim] = num_input_elements / num_output_elements;
-    num_output_elements *= output_size->data[stretch_dim];
+    output_shape->data[stretch_dim] = num_input_elements / num_output_elements;
+    num_output_elements *= output_shape->data[stretch_dim];
   }
 
   TF_LITE_ENSURE_EQ(context, num_input_elements, num_output_elements);
-  return context->ResizeTensor(context, output, output_size);
+  return context->ResizeTensor(context, output, output_shape);
+}
+
+TfLiteStatus ResizeOutputWithShapeTensor(TfLiteContext* context,
+                                         TfLiteNode* node) {
+  const TfLiteTensor* shape = GetInput(context, node, kShapeTensor);
+
+  TfLiteIntArray* output_shape = TfLiteIntArrayCreate(shape->dims->data[0]);
+  for (int i = 0; i < output_shape->size; ++i) {
+    output_shape->data[i] = shape->data.i32[i];
+  }
+  return ResizeOutput(context, node, output_shape);
+}
+
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  auto* params = reinterpret_cast<TfLiteReshapeParams*>(node->builtin_data);
+
+  TF_LITE_ENSURE(context, NumInputs(node) == 1 || NumInputs(node) == 2);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+
+  // Attempt to use shape tensor if it exists.
+  if (NumInputs(node) == 2) {
+    const TfLiteTensor* shape = GetInput(context, node, kShapeTensor);
+    // Check if the shape tensor is valid.
+    if (shape->dims->size == 1 && shape->type == kTfLiteInt32) {
+      // Set the output tensor as dynamic if the shape isn't constnat.
+      if (!IsConstantTensor(shape)) {
+        TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
+        SetTensorToDynamic(output);
+        return kTfLiteOk;
+      }
+      // Shape is constant. Resize now.
+      return ResizeOutputWithShapeTensor(context, node);
+    }
+  }
+  // The function is returned above this line if the shape tensor is usable.
+  // Now fallback to the shape parameter in `TfLiteReshapeParams`.
+  int num_dimensions = params->num_dimensions;
+  if (num_dimensions == 1 && params->shape[0] == 0) {
+    // Legacy tflite models use a shape parameter of [0] to indicate scalars,
+    // so adjust accordingly. TODO(b/111614235): Allow zero-sized buffers during
+    // toco conversion.
+    num_dimensions = 0;
+  }
+  TfLiteIntArray* output_shape = TfLiteIntArrayCreate(num_dimensions);
+  for (int i = 0; i < num_dimensions; ++i) {
+    output_shape->data[i] = params->shape[i];
+  }
+  return ResizeOutput(context, node, output_shape);
 }
 
 TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const TfLiteTensor* input = GetInput(context, node, kInputTensor);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
+  if (IsDynamicTensor(output)) {
+    TF_LITE_ENSURE_OK(context, ResizeOutputWithShapeTensor(context, node));
+  }
+
   memcpy(output->data.raw, input->data.raw, input->bytes);
 
   return kTfLiteOk;
diff --git a/tensorflow/contrib/lite/kernels/reshape_test.cc b/tensorflow/contrib/lite/kernels/reshape_test.cc
index aecbd0399f..52d71350d3 100644
--- a/tensorflow/contrib/lite/kernels/reshape_test.cc
+++ b/tensorflow/contrib/lite/kernels/reshape_test.cc
@@ -22,18 +22,27 @@ namespace tflite {
 namespace {
 
 using ::testing::ElementsAreArray;
+using ::testing::IsEmpty;
 
 class ReshapeOpModel : public SingleOpModel {
  public:
   ReshapeOpModel(std::initializer_list<int> input_shape,
-                 std::initializer_list<int> new_shape) {
+                 std::initializer_list<int> new_shape,
+                 bool use_shape_input_tensor = false) {
     input_ = AddInput(TensorType_FLOAT32);
     output_ = AddOutput(TensorType_FLOAT32);
+    int shape_input_tensor =
+        use_shape_input_tensor ? AddInput(TensorType_INT32) : -1;
     SetBuiltinOp(
         BuiltinOperator_RESHAPE, BuiltinOptions_ReshapeOptions,
         CreateReshapeOptions(builder_, builder_.CreateVector<int>(new_shape))
             .Union());
-    BuildInterpreter({input_shape});
+    if (use_shape_input_tensor) {
+      BuildInterpreter({input_shape, GetShape(shape_input_tensor)});
+      PopulateTensor<int>(shape_input_tensor, new_shape);
+    } else {
+      BuildInterpreter({input_shape});
+    }
   }
 
   void SetInput(std::initializer_list<float> data) {
@@ -71,6 +80,14 @@ TEST(ReshapeOpTest, SimpleTest) {
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 2, 2}));
 }
 
+TEST(ReshapeOpTest, ShapeTensorInput) {
+  ReshapeOpModel m({1, 2, 4, 1}, {2, 2, 2}, /*use_shape_input_tensor=*/true);
+  m.SetInput({1, 2, 3, 4, 5, 6, 7, 8});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({1, 2, 3, 4, 5, 6, 7, 8}));
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 2, 2}));
+}
+
 TEST(ReshapeOpTest, WithStretchDimension) {
   ReshapeOpModel m({1, 2, 4, 1}, {2, 1, -1});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8});
@@ -79,6 +96,22 @@ TEST(ReshapeOpTest, WithStretchDimension) {
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2, 1, 4}));
 }
 
+TEST(ReshapeOpTest, ScalarOutput) {
+  ReshapeOpModel m({1}, {});
+  m.SetInput({3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({3}));
+  EXPECT_THAT(m.GetOutputShape(), IsEmpty());
+}
+
+TEST(ReshapeOpTest, LegacyScalarOutput) {
+  ReshapeOpModel m({1}, {0});
+  m.SetInput({3});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({3}));
+  EXPECT_THAT(m.GetOutputShape(), IsEmpty());
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc b/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc
index 10caffea03..f4289105f7 100644
--- a/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc
+++ b/tensorflow/contrib/lite/kernels/resize_bilinear_test.cc
@@ -247,7 +247,7 @@ TEST(ResizeBilinearOpTest, TwoDimensionalResizeWithTwoBatches8Bit) {
       3, 6,   //
       9, 12,  //
       4, 10,  //
-      10, 16  //
+      12, 16  //
   });
   m.SetSize({3, 3});
   m.Invoke();
@@ -256,8 +256,8 @@ TEST(ResizeBilinearOpTest, TwoDimensionalResizeWithTwoBatches8Bit) {
                                         7, 9, 10,    //
                                         9, 11, 12,   //
                                         4, 8, 10,    //
-                                        8, 12, 14,   //
-                                        10, 13, 16,  //
+                                        9, 12, 14,   //
+                                        12, 14, 16,  //
                                     })));
 
   ResizeBilinearOpModel const_m({TensorType_UINT8, {2, 2, 2, 1}}, {3, 3});
@@ -265,7 +265,7 @@ TEST(ResizeBilinearOpTest, TwoDimensionalResizeWithTwoBatches8Bit) {
       3, 6,   //
       9, 12,  //
       4, 10,  //
-      10, 16  //
+      12, 16  //
   });
   const_m.Invoke();
   EXPECT_THAT(const_m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
@@ -273,35 +273,35 @@ TEST(ResizeBilinearOpTest, TwoDimensionalResizeWithTwoBatches8Bit) {
                                               7, 9, 10,    //
                                               9, 11, 12,   //
                                               4, 8, 10,    //
-                                              8, 12, 14,   //
-                                              10, 13, 16,  //
+                                              9, 12, 14,   //
+                                              12, 14, 16,  //
                                           })));
 }
 
 TEST(ResizeBilinearOpTest, ThreeDimensionalResize8Bit) {
   ResizeBilinearOpModel m({TensorType_UINT8, {1, 2, 2, 2}});
   m.SetInput<uint8>({
-      3, 4, 6, 10,    //
-      9, 10, 12, 16,  //
+      3, 4, 6, 10,     //
+      10, 12, 14, 16,  //
   });
   m.SetSize({3, 3});
   m.Invoke();
   EXPECT_THAT(m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
-                                        3, 4, 5, 8, 6, 10,      //
-                                        7, 8, 9, 12, 10, 14,    //
-                                        9, 10, 11, 13, 12, 16,  //
+                                        3, 4, 5, 8, 6, 10,       //
+                                        7, 9, 10, 12, 11, 14,    //
+                                        10, 12, 12, 14, 14, 16,  //
                                     })));
 
   ResizeBilinearOpModel const_m({TensorType_UINT8, {1, 2, 2, 2}}, {3, 3});
   const_m.SetInput<uint8>({
-      3, 4, 6, 10,    //
-      9, 10, 12, 16,  //
+      3, 4, 6, 10,     //
+      10, 12, 14, 16,  //
   });
   const_m.Invoke();
   EXPECT_THAT(const_m.GetOutput<uint8>(), ElementsAreArray(ArrayFloatNear({
-                                              3, 4, 5, 8, 6, 10,      //
-                                              7, 8, 9, 12, 10, 14,    //
-                                              9, 10, 11, 13, 12, 16,  //
+                                              3, 4, 5, 8, 6, 10,       //
+                                              7, 9, 10, 12, 11, 14,    //
+                                              10, 12, 12, 14, 14, 16,  //
                                           })));
 }
 }  // namespace
diff --git a/tensorflow/contrib/lite/kernels/select.cc b/tensorflow/contrib/lite/kernels/select.cc
index 9b6cee3cb5..3cdb5db209 100644
--- a/tensorflow/contrib/lite/kernels/select.cc
+++ b/tensorflow/contrib/lite/kernels/select.cc
@@ -89,6 +89,9 @@ TfLiteStatus SelectEval(TfLiteContext* context, TfLiteNode* node) {
     case kTfLiteUInt8:                                                         \
       TF_LITE_SELECT(uint8_t, op);                                             \
       break;                                                                   \
+    case kTfLiteInt16:                                                         \
+      TF_LITE_SELECT(int16_t, op);                                             \
+      break;                                                                   \
     case kTfLiteInt32:                                                         \
       TF_LITE_SELECT(int32_t, op);                                             \
       break;                                                                   \
diff --git a/tensorflow/contrib/lite/kernels/select_test.cc b/tensorflow/contrib/lite/kernels/select_test.cc
index 4664b9acb4..5b2e61cd29 100644
--- a/tensorflow/contrib/lite/kernels/select_test.cc
+++ b/tensorflow/contrib/lite/kernels/select_test.cc
@@ -96,6 +96,19 @@ TEST(SelectOpTest, SelectUInt8) {
   EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
 }
 
+TEST(SelectOpTest, SelectInt16) {
+  SelectOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, {1, 1, 1, 4},
+                      TensorType_INT16);
+
+  model.PopulateTensor<bool>(model.input1(), {false, true, false, false});
+  model.PopulateTensor<int16_t>(model.input2(), {1, 2, 3, 4});
+  model.PopulateTensor<int16_t>(model.input3(), {5, 6, 7, 8});
+  model.Invoke();
+
+  EXPECT_THAT(model.GetOutput<int16_t>(), ElementsAreArray({5, 2, 7, 8}));
+  EXPECT_THAT(model.GetOutputShape(), ElementsAreArray({1, 1, 1, 4}));
+}
+
 TEST(SelectOpTest, SelectInt32) {
   SelectOpModel model({1, 1, 1, 4}, {1, 1, 1, 4}, {1, 1, 1, 4},
                       TensorType_INT32);
diff --git a/tensorflow/contrib/lite/kernels/space_to_batch_nd.cc b/tensorflow/contrib/lite/kernels/space_to_batch_nd.cc
index c9269599e5..03079f1c3b 100644
--- a/tensorflow/contrib/lite/kernels/space_to_batch_nd.cc
+++ b/tensorflow/contrib/lite/kernels/space_to_batch_nd.cc
@@ -113,7 +113,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     TF_LITE_ENSURE_OK(context, ResizeOutputTensor(context, &op_context));
   }
 
-#define TF_LITE_SPACE_TO_BATCH_ND(type, scalar)                        \
+#define TF_LITE_SPACE_TO_BATCH_ND(type, scalar, pad_value)             \
   type::SpaceToBatchND(GetTensorData<scalar>(op_context.input),        \
                        GetTensorDims(op_context.input),                \
                        GetTensorData<int32_t>(op_context.block_shape), \
@@ -121,34 +121,36 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
                        GetTensorData<int32_t>(op_context.paddings),    \
                        GetTensorDims(op_context.paddings),             \
                        GetTensorData<scalar>(op_context.output),       \
-                       GetTensorDims(op_context.output))
+                       GetTensorDims(op_context.output), pad_value)
   switch (op_context.input->type) {  // Already know in/out types are same.
     case kTfLiteFloat32:
       if (kernel_type == kReference) {
-        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, float);
+        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, float, 0);
       } else {
-        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, float);
+        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, float, 0);
       }
       break;
     case kTfLiteUInt8:
       if (kernel_type == kReference) {
-        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, uint8_t);
+        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, uint8_t,
+                                  op_context.output->params.zero_point);
       } else {
-        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, uint8_t);
+        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, uint8_t,
+                                  op_context.output->params.zero_point);
       }
       break;
     case kTfLiteInt32:
       if (kernel_type == kReference) {
-        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, int32_t);
+        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, int32_t, 0);
       } else {
-        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, int32_t);
+        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, int32_t, 0);
       }
       break;
     case kTfLiteInt64:
       if (kernel_type == kReference) {
-        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, int64_t);
+        TF_LITE_SPACE_TO_BATCH_ND(reference_ops, int64_t, 0);
       } else {
-        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, int64_t);
+        TF_LITE_SPACE_TO_BATCH_ND(optimized_ops, int64_t, 0);
       }
       break;
     default:
diff --git a/tensorflow/contrib/lite/kernels/space_to_batch_nd_test.cc b/tensorflow/contrib/lite/kernels/space_to_batch_nd_test.cc
index 92a4a037d5..5756573629 100644
--- a/tensorflow/contrib/lite/kernels/space_to_batch_nd_test.cc
+++ b/tensorflow/contrib/lite/kernels/space_to_batch_nd_test.cc
@@ -23,6 +23,7 @@ namespace tflite {
 namespace {
 
 using ::testing::ElementsAreArray;
+using ::testing::Matcher;
 
 class SpaceToBatchNDOpModel : public SingleOpModel {
  public:
@@ -30,6 +31,10 @@ class SpaceToBatchNDOpModel : public SingleOpModel {
     PopulateTensor<float>(input_, data);
   }
 
+  void SetQuantizedInput(std::initializer_list<float> data) {
+    QuantizeAndPopulate<uint8_t>(input_, data);
+  }
+
   void SetBlockShape(std::initializer_list<int> data) {
     PopulateTensor<int>(block_shape_, data);
   }
@@ -41,6 +46,11 @@ class SpaceToBatchNDOpModel : public SingleOpModel {
   std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
   std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
 
+  std::vector<float> GetDequantizedOutput() {
+    return Dequantize<uint8_t>(ExtractVector<uint8_t>(output_),
+                               GetScale(output_), GetZeroPoint(output_));
+  }
+
  protected:
   int input_;
   int block_shape_;
@@ -56,18 +66,19 @@ class SpaceToBatchNDOpModel : public SingleOpModel {
 //    m.Invoke();
 class SpaceToBatchNDOpConstModel : public SpaceToBatchNDOpModel {
  public:
-  SpaceToBatchNDOpConstModel(std::initializer_list<int> input_shape,
+  SpaceToBatchNDOpConstModel(const TensorData& input,
                              std::initializer_list<int> block_shape,
-                             std::initializer_list<int> paddings) {
-    input_ = AddInput(TensorType_FLOAT32);
+                             std::initializer_list<int> paddings,
+                             const TensorData& output) {
+    input_ = AddInput(input);
     block_shape_ = AddConstInput(TensorType_INT32, block_shape, {2});
     paddings_ = AddConstInput(TensorType_INT32, paddings, {2, 2});
-    output_ = AddOutput(TensorType_FLOAT32);
+    output_ = AddOutput(output);
 
     SetBuiltinOp(BuiltinOperator_SPACE_TO_BATCH_ND,
                  BuiltinOptions_SpaceToBatchNDOptions,
                  CreateSpaceToBatchNDOptions(builder_).Union());
-    BuildInterpreter({input_shape});
+    BuildInterpreter({input.shape});
   }
 };
 
@@ -81,26 +92,30 @@ class SpaceToBatchNDOpConstModel : public SpaceToBatchNDOpModel {
 //    m.Invoke();
 class SpaceToBatchNDOpDynamicModel : public SpaceToBatchNDOpModel {
  public:
-  SpaceToBatchNDOpDynamicModel(std::initializer_list<int> input_shape) {
-    input_ = AddInput(TensorType_FLOAT32);
+  SpaceToBatchNDOpDynamicModel(const TensorData& input,
+                               const TensorData& output) {
+    input_ = AddInput(input);
     block_shape_ = AddInput(TensorType_INT32);
     paddings_ = AddInput(TensorType_INT32);
-    output_ = AddOutput(TensorType_FLOAT32);
+    output_ = AddOutput(output);
 
     SetBuiltinOp(BuiltinOperator_SPACE_TO_BATCH_ND,
                  BuiltinOptions_SpaceToBatchNDOptions,
                  CreateSpaceToBatchNDOptions(builder_).Union());
-    BuildInterpreter({input_shape, {2}, {2, 2}});
+    BuildInterpreter({input.shape, {2}, {2, 2}});
   }
 };
 
 TEST(SpaceToBatchNDOpTest, InvalidShapeTest) {
-  EXPECT_DEATH(SpaceToBatchNDOpConstModel({1, 3, 3, 1}, {2, 2}, {0, 0, 0, 0}),
-               "Cannot allocate tensors");
+  EXPECT_DEATH(
+      SpaceToBatchNDOpConstModel({TensorType_FLOAT32, {1, 3, 3, 1}}, {2, 2},
+                                 {0, 0, 0, 0}, {TensorType_FLOAT32}),
+      "Cannot allocate tensors");
 }
 
 TEST(SpaceToBatchNDOpTest, SimpleConstTest) {
-  SpaceToBatchNDOpConstModel m({1, 4, 4, 1}, {2, 2}, {0, 0, 0, 0});
+  SpaceToBatchNDOpConstModel m({TensorType_FLOAT32, {1, 4, 4, 1}}, {2, 2},
+                               {0, 0, 0, 0}, {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
   m.Invoke();
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({4, 2, 2, 1}));
@@ -109,7 +124,8 @@ TEST(SpaceToBatchNDOpTest, SimpleConstTest) {
 }
 
 TEST(SpaceToBatchNDOpTest, SimpleDynamicTest) {
-  SpaceToBatchNDOpDynamicModel m({1, 4, 4, 1});
+  SpaceToBatchNDOpDynamicModel m({TensorType_FLOAT32, {1, 4, 4, 1}},
+                                 {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
   m.SetBlockShape({2, 2});
   m.SetPaddings({0, 0, 0, 0});
@@ -120,7 +136,8 @@ TEST(SpaceToBatchNDOpTest, SimpleDynamicTest) {
 }
 
 TEST(SpaceToBatchNDOpTest, MultipleInputBatchesConstTest) {
-  SpaceToBatchNDOpConstModel m({2, 2, 4, 1}, {2, 2}, {0, 0, 0, 0});
+  SpaceToBatchNDOpConstModel m({TensorType_FLOAT32, {2, 2, 4, 1}}, {2, 2},
+                               {0, 0, 0, 0}, {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
   m.Invoke();
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({8, 1, 2, 1}));
@@ -129,7 +146,8 @@ TEST(SpaceToBatchNDOpTest, MultipleInputBatchesConstTest) {
 }
 
 TEST(SpaceToBatchNDOpTest, MultipleInputBatchesDynamicTest) {
-  SpaceToBatchNDOpDynamicModel m({2, 2, 4, 1});
+  SpaceToBatchNDOpDynamicModel m({TensorType_FLOAT32, {2, 2, 4, 1}},
+                                 {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
   m.SetBlockShape({2, 2});
   m.SetPaddings({0, 0, 0, 0});
@@ -140,7 +158,8 @@ TEST(SpaceToBatchNDOpTest, MultipleInputBatchesDynamicTest) {
 }
 
 TEST(SpaceToBatchNDOpTest, SimplePaddingConstTest) {
-  SpaceToBatchNDOpConstModel m({1, 5, 2, 1}, {3, 2}, {1, 0, 2, 0});
+  SpaceToBatchNDOpConstModel m({TensorType_FLOAT32, {1, 5, 2, 1}}, {3, 2},
+                               {1, 0, 2, 0}, {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10});
   m.Invoke();
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 2, 1}));
@@ -151,7 +170,8 @@ TEST(SpaceToBatchNDOpTest, SimplePaddingConstTest) {
 }
 
 TEST(SpaceToBatchNDOpTest, SimplePaddingDynamicTest) {
-  SpaceToBatchNDOpDynamicModel m({1, 5, 2, 1});
+  SpaceToBatchNDOpDynamicModel m({TensorType_FLOAT32, {1, 5, 2, 1}},
+                                 {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8, 9, 10});
   m.SetBlockShape({3, 2});
   m.SetPaddings({1, 0, 2, 0});
@@ -164,7 +184,8 @@ TEST(SpaceToBatchNDOpTest, SimplePaddingDynamicTest) {
 }
 
 TEST(SpaceToBatchNDOpTest, ComplexPaddingConstTest) {
-  SpaceToBatchNDOpConstModel m({1, 4, 2, 1}, {3, 2}, {1, 1, 2, 4});
+  SpaceToBatchNDOpConstModel m({TensorType_FLOAT32, {1, 4, 2, 1}}, {3, 2},
+                               {1, 1, 2, 4}, {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8});
   m.Invoke();
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 4, 1}));
@@ -176,7 +197,8 @@ TEST(SpaceToBatchNDOpTest, ComplexPaddingConstTest) {
 }
 
 TEST(SpaceToBatchNDOpTest, ComplexPaddingDynamicTest) {
-  SpaceToBatchNDOpDynamicModel m({1, 4, 2, 1});
+  SpaceToBatchNDOpDynamicModel m({TensorType_FLOAT32, {1, 4, 2, 1}},
+                                 {TensorType_FLOAT32});
   m.SetInput({1, 2, 3, 4, 5, 6, 7, 8});
   m.SetBlockShape({3, 2});
   m.SetPaddings({1, 1, 2, 4});
@@ -189,6 +211,88 @@ TEST(SpaceToBatchNDOpTest, ComplexPaddingDynamicTest) {
                              }));
 }
 
+class QuantizedSpaceToBatchNDOpTest : public ::testing::Test {
+ protected:
+  std::vector<Matcher<float>> DequantizedArrayNear(
+      const std::vector<float>& values, const float min, const float max) {
+    const float quantization_tolerance = (max - min) / 255.0;
+    return ArrayFloatNear(values, quantization_tolerance);
+  }
+};
+
+TEST_F(QuantizedSpaceToBatchNDOpTest, ZeroNotInQuantizationRange) {
+  // The test_util and actual quantization code currently ensure that the range
+  // must include zero, but if that ever changes, this test will catch it.
+  EXPECT_DEATH(SpaceToBatchNDOpConstModel m(
+                   {TensorType_UINT8, {1, 2, 2, 1}, 1.0, 2.0}, {4, 2},
+                   {0, 0, 1, 1, 1, 1, 0, 0}, {TensorType_UINT8, {}, 1.0, 2.0}),
+               ".*Check failed: f_min <= 0.*");
+}
+
+TEST_F(QuantizedSpaceToBatchNDOpTest, SimplePaddingConstTest) {
+  SpaceToBatchNDOpConstModel m({TensorType_UINT8, {1, 5, 2, 1}, -1.0, 1.0},
+                               {3, 2}, {1, 0, 2, 0},
+                               {TensorType_UINT8, {}, -1.0, 1.0});
+  m.SetQuantizedInput({-0.1, 0.2, -0.3, 0.4, -0.5, 0.6, -0.7, 0.8, -0.9, 0.1});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 2, 1}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(DequantizedArrayNear(
+                  {0, 0,   0, -0.5, 0, 0,    0, 0.6,  0, -0.1, 0, -0.7,
+                   0, 0.2, 0, 0.8,  0, -0.3, 0, -0.9, 0, 0.4,  0, 0.1},
+                  -1.0, 1.0)));
+}
+
+TEST_F(QuantizedSpaceToBatchNDOpTest, SimplePaddingDynamicTest) {
+  SpaceToBatchNDOpDynamicModel m({TensorType_UINT8, {1, 5, 2, 1}, -1.0, 1.0},
+                                 {TensorType_UINT8, {}, -1.0, 1.0});
+  m.SetQuantizedInput({-0.1, 0.2, -0.3, 0.4, -0.5, 0.6, -0.7, 0.8, -0.9, 0.1});
+  m.SetBlockShape({3, 2});
+  m.SetPaddings({1, 0, 2, 0});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 2, 1}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(DequantizedArrayNear(
+                  {0, 0,   0, -0.5, 0, 0,    0, 0.6,  0, -0.1, 0, -0.7,
+                   0, 0.2, 0, 0.8,  0, -0.3, 0, -0.9, 0, 0.4,  0, 0.1},
+                  -1.0, 1.0)));
+}
+
+TEST_F(QuantizedSpaceToBatchNDOpTest, ComplexPaddingConstTest) {
+  SpaceToBatchNDOpConstModel m({TensorType_UINT8, {1, 4, 2, 1}, -1.0, 1.0},
+                               {3, 2}, {1, 1, 2, 4},
+                               {TensorType_UINT8, {}, -1.0, 1.0});
+  m.SetQuantizedInput({-0.1, 0.2, -0.3, 0.4, -0.5, 0.6, -0.7, 0.8});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 4, 1}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(DequantizedArrayNear(
+                  {
+                      0, 0,    0, 0, 0, -0.5, 0, 0, 0, 0,   0, 0, 0, 0.6, 0, 0,
+                      0, -0.1, 0, 0, 0, -0.7, 0, 0, 0, 0.2, 0, 0, 0, 0.8, 0, 0,
+                      0, -0.3, 0, 0, 0, 0,    0, 0, 0, 0.4, 0, 0, 0, 0,   0, 0,
+                  },
+                  -1.0, 1.0)));
+}
+
+TEST_F(QuantizedSpaceToBatchNDOpTest, ComplexPaddingDynamicTest) {
+  SpaceToBatchNDOpDynamicModel m({TensorType_UINT8, {1, 4, 2, 1}, -1.0, 1.0},
+                                 {TensorType_UINT8, {}, -1.0, 1.0});
+  m.SetQuantizedInput({-0.1, 0.2, -0.3, 0.4, -0.5, 0.6, -0.7, 0.8});
+  m.SetBlockShape({3, 2});
+  m.SetPaddings({1, 1, 2, 4});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({6, 2, 4, 1}));
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(DequantizedArrayNear(
+                  {
+                      0, 0,    0, 0, 0, -0.5, 0, 0, 0, 0,   0, 0, 0, 0.6, 0, 0,
+                      0, -0.1, 0, 0, 0, -0.7, 0, 0, 0, 0.2, 0, 0, 0, 0.8, 0, 0,
+                      0, -0.3, 0, 0, 0, 0,    0, 0, 0, 0.4, 0, 0, 0, 0,   0, 0,
+                  },
+                  -1.0, 1.0)));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/sparse_to_dense.cc b/tensorflow/contrib/lite/kernels/sparse_to_dense.cc
index 404c32ad9c..7be5e66c16 100644
--- a/tensorflow/contrib/lite/kernels/sparse_to_dense.cc
+++ b/tensorflow/contrib/lite/kernels/sparse_to_dense.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
diff --git a/tensorflow/contrib/lite/kernels/sub.cc b/tensorflow/contrib/lite/kernels/sub.cc
index 1247525d41..77a1f59689 100644
--- a/tensorflow/contrib/lite/kernels/sub.cc
+++ b/tensorflow/contrib/lite/kernels/sub.cc
@@ -78,29 +78,47 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
 }
 
 template <KernelType kernel_type>
-void EvalFloat(TfLiteContext* context, TfLiteNode* node,
-               TfLiteSubParams* params, const OpData* data,
-               const TfLiteTensor* input1, const TfLiteTensor* input2,
-               TfLiteTensor* output) {
-  float output_activation_min, output_activation_max;
-  CalculateActivationRange(params->activation, &output_activation_min,
-                           &output_activation_max);
-#define TF_LITE_SUB(type, opname)                                   \
-  type::opname(GetTensorData<float>(input1), GetTensorDims(input1), \
-               GetTensorData<float>(input2), GetTensorDims(input2), \
-               output_activation_min, output_activation_max,        \
-               GetTensorData<float>(output), GetTensorDims(output))
-  if (kernel_type == kReference) {
-    if (data->requires_broadcast) {
-      TF_LITE_SUB(reference_ops, BroadcastSub);
+void EvalSub(TfLiteContext* context, TfLiteNode* node, TfLiteSubParams* params,
+             const OpData* data, const TfLiteTensor* input1,
+             const TfLiteTensor* input2, TfLiteTensor* output) {
+#define TF_LITE_SUB(type, opname, data_type)                             \
+  data_type output_activation_min, output_activation_max;                \
+  CalculateActivationRange(params->activation, &output_activation_min,   \
+                           &output_activation_max);                      \
+  tflite::ArithmeticParams op_params;                                    \
+  SetActivationParams(output_activation_min, output_activation_max,      \
+                      &op_params);                                       \
+  type::opname(op_params, GetTensorShape(input1),                        \
+               GetTensorData<data_type>(input1), GetTensorShape(input2), \
+               GetTensorData<data_type>(input2), GetTensorShape(output), \
+               GetTensorData<data_type>(output))
+  if (output->type == kTfLiteInt32) {
+    if (kernel_type == kReference) {
+      if (data->requires_broadcast) {
+        TF_LITE_SUB(reference_ops, BroadcastSub4DSlow, int32_t);
+      } else {
+        TF_LITE_SUB(reference_ops, SubWithActivation, int32_t);
+      }
     } else {
-      TF_LITE_SUB(reference_ops, Sub);
+      if (data->requires_broadcast) {
+        TF_LITE_SUB(optimized_ops, BroadcastSub4DSlow, int32_t);
+      } else {
+        TF_LITE_SUB(optimized_ops, SubWithActivation, int32_t);
+      }
     }
-  } else {
-    if (data->requires_broadcast) {
-      TF_LITE_SUB(optimized_ops, BroadcastSub);
+  } else if (output->type == kTfLiteFloat32) {
+    if (kernel_type == kReference) {
+      if (data->requires_broadcast) {
+        TF_LITE_SUB(reference_ops, BroadcastSub4DSlow, float);
+      } else {
+        TF_LITE_SUB(reference_ops, SubWithActivation, float);
+      }
     } else {
-      TF_LITE_SUB(optimized_ops, Sub);
+      if (data->requires_broadcast) {
+        TF_LITE_SUB(optimized_ops, BroadcastSub4DSlow, float);
+      } else {
+        TF_LITE_SUB(optimized_ops, SubWithActivation, float);
+      }
     }
   }
 #undef TF_LITE_SUB
@@ -128,36 +146,43 @@ void EvalQuantized(TfLiteContext* context, TfLiteNode* node,
   int input1_shift;
   QuantizeMultiplierSmallerThanOneExp(real_input1_multiplier,
                                       &input1_multiplier, &input1_shift);
-  input1_shift *= -1;
   int32 input2_multiplier;
   int input2_shift;
   QuantizeMultiplierSmallerThanOneExp(real_input2_multiplier,
                                       &input2_multiplier, &input2_shift);
-  input2_shift *= -1;
   int32 output_multiplier;
   int output_shift;
   QuantizeMultiplierSmallerThanOneExp(real_output_multiplier,
                                       &output_multiplier, &output_shift);
-  output_shift *= -1;
 
   int32 output_activation_min, output_activation_max;
   CalculateActivationRangeUint8(params->activation, output,
                                 &output_activation_min, &output_activation_max);
 
-#define TF_LITE_SUB(type, opname)                                            \
-  type::opname(left_shift, GetTensorData<uint8_t>(input1),                   \
-               GetTensorDims(input1), input1_offset, input1_multiplier,      \
-               input1_shift, GetTensorData<uint8_t>(input2),                 \
-               GetTensorDims(input2), input2_offset, input2_multiplier,      \
-               input2_shift, output_offset, output_multiplier, output_shift, \
-               output_activation_min, output_activation_max,                 \
-               GetTensorData<uint8_t>(output), GetTensorDims(output));
+#define TF_LITE_SUB(type, opname)                                      \
+  tflite::ArithmeticParams op_params;                                  \
+  op_params.left_shift = left_shift;                                   \
+  op_params.input1_offset = input1_offset;                             \
+  op_params.input1_multiplier = input1_multiplier;                     \
+  op_params.input1_shift = input1_shift;                               \
+  op_params.input2_offset = input2_offset;                             \
+  op_params.input2_multiplier = input2_multiplier;                     \
+  op_params.input2_shift = input2_shift;                               \
+  op_params.output_offset = output_offset;                             \
+  op_params.output_multiplier = output_multiplier;                     \
+  op_params.output_shift = output_shift;                               \
+  SetActivationParams(output_activation_min, output_activation_max,    \
+                      &op_params);                                     \
+  type::opname(op_params, GetTensorShape(input1),                      \
+               GetTensorData<uint8_t>(input1), GetTensorShape(input2), \
+               GetTensorData<uint8_t>(input2), GetTensorShape(output), \
+               GetTensorData<uint8_t>(output))
   // The quantized version of Sub doesn't support activations, so we
   // always use BroadcastSub.
   if (kernel_type == kReference) {
-    TF_LITE_SUB(reference_ops, BroadcastSub);
+    TF_LITE_SUB(reference_ops, BroadcastSub4DSlow);
   } else {
-    TF_LITE_SUB(optimized_ops, BroadcastSub);
+    TF_LITE_SUB(optimized_ops, BroadcastSub4DSlow);
   }
 #undef TF_LITE_SUB
 }
@@ -171,14 +196,15 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const TfLiteTensor* input2 = GetInput(context, node, kInputTensor2);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
 
-  if (output->type == kTfLiteFloat32) {
-    EvalFloat<kernel_type>(context, node, params, data, input1, input2, output);
+  if (output->type == kTfLiteFloat32 || output->type == kTfLiteInt32) {
+    EvalSub<kernel_type>(context, node, params, data, input1, input2, output);
   } else if (output->type == kTfLiteUInt8) {
     EvalQuantized<kernel_type>(context, node, params, data, input1, input2,
                                output);
   } else {
     context->ReportError(
-        context, "output type %d is not supported, requires float|uint8 types.",
+        context,
+        "output type %d is not supported, requires float|uint8|int32 types.",
         output->type);
     return kTfLiteError;
   }
diff --git a/tensorflow/contrib/lite/kernels/sub_test.cc b/tensorflow/contrib/lite/kernels/sub_test.cc
index ff07aeec49..5978c574d3 100644
--- a/tensorflow/contrib/lite/kernels/sub_test.cc
+++ b/tensorflow/contrib/lite/kernels/sub_test.cc
@@ -52,6 +52,13 @@ class FloatSubOpModel : public BaseSubOpModel {
   std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
 };
 
+class IntegerSubOpModel : public BaseSubOpModel {
+ public:
+  using BaseSubOpModel::BaseSubOpModel;
+
+  std::vector<int32_t> GetOutput() { return ExtractVector<int32_t>(output_); }
+};
+
 class QuantizedSubOpModel : public BaseSubOpModel {
  public:
   using BaseSubOpModel::BaseSubOpModel;
@@ -125,6 +132,57 @@ TEST(FloatSubOpModel, WithBroadcast) {
   }
 }
 
+TEST(IntegerSubOpModel, NoActivation) {
+  IntegerSubOpModel m({TensorType_INT32, {1, 2, 2, 1}},
+                      {TensorType_INT32, {1, 2, 2, 1}}, {TensorType_INT32, {}},
+                      ActivationFunctionType_NONE);
+  m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8});
+  m.PopulateTensor<int32_t>(m.input2(), {1, 2, 3, 5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({-21, 0, 4, 3}));
+}
+
+TEST(IntegerSubOpModel, ActivationRELU_N1_TO_1) {
+  IntegerSubOpModel m({TensorType_INT32, {1, 2, 2, 1}},
+                      {TensorType_INT32, {1, 2, 2, 1}}, {TensorType_INT32, {}},
+                      ActivationFunctionType_RELU_N1_TO_1);
+  m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8});
+  m.PopulateTensor<int32_t>(m.input2(), {1, 2, 3, 5});
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray({-1, 0, 1, 1}));
+}
+
+TEST(IntegerSubOpModel, VariousInputShapes) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    IntegerSubOpModel m({TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8, 11, 20});
+    m.PopulateTensor<int32_t>(m.input2(), {1, 2, 3, 5, 11, 1});
+    m.Invoke();
+    EXPECT_THAT(m.GetOutput(), ElementsAreArray({-21, 0, 4, 3, 0, 19}))
+        << "With shape number " << i;
+  }
+}
+
+TEST(IntegerSubOpModel, WithBroadcast) {
+  std::vector<std::initializer_list<int>> test_shapes = {
+      {6}, {2, 3}, {2, 1, 3}, {1, 3, 1, 2}};
+  for (int i = 0; i < test_shapes.size(); ++i) {
+    IntegerSubOpModel m({TensorType_INT32, test_shapes[i]},
+                        {TensorType_INT32, {}},  // always a scalar
+                        {TensorType_INT32, {}}, ActivationFunctionType_NONE);
+    m.PopulateTensor<int32_t>(m.input1(), {-20, 2, 7, 8, 11, 20});
+    m.PopulateTensor<int32_t>(m.input2(), {1});
+    m.Invoke();
+    EXPECT_THAT(m.GetOutput(),
+                ElementsAreArray(ArrayFloatNear({-21, 1, 6, 7, 10, 19})))
+        << "With shape number " << i;
+  }
+}
+
 TEST(QuantizedSubOpModel, QuantizedTestsNoActivation) {
   float kQuantizedTolerance = GetTolerance(-1.0, 1.0);
   std::vector<std::initializer_list<float>> inputs1 = {
diff --git a/tensorflow/contrib/lite/kernels/svdf.cc b/tensorflow/contrib/lite/kernels/svdf.cc
index 22eebdd4ce..6d4912ce3a 100644
--- a/tensorflow/contrib/lite/kernels/svdf.cc
+++ b/tensorflow/contrib/lite/kernels/svdf.cc
@@ -16,7 +16,6 @@ limitations under the License.
 // SVDF op that compresses a fully connected op via low-rank matrix
 // factorization. See https://research.google.com/pubs/archive/43813.pdf for
 // details.
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -105,7 +104,7 @@ constexpr int kStateTensor = 0;
 constexpr int kOutputTensor = 1;
 
 void* Init(TfLiteContext* context, const char* buffer, size_t length) {
-  auto* op_data = new OpData;
+  auto* op_data = new OpData();
   op_data->float_weights_time_initialized = false;
   context->AddTensors(context, /*tensors_to_add=*/4,
                       &op_data->scratch_tensor_index);
diff --git a/tensorflow/contrib/lite/kernels/transpose_conv.cc b/tensorflow/contrib/lite/kernels/transpose_conv.cc
index 7182374a6f..a9baa5c698 100644
--- a/tensorflow/contrib/lite/kernels/transpose_conv.cc
+++ b/tensorflow/contrib/lite/kernels/transpose_conv.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -22,7 +21,6 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/context.h"
-#include "tensorflow/contrib/lite/kernels/eigen_support.h"
 #include "tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h"
 #include "tensorflow/contrib/lite/kernels/internal/tensor.h"
 #include "tensorflow/contrib/lite/kernels/kernel_util.h"
@@ -39,35 +37,9 @@ constexpr int kWeightsTensor = 1;
 constexpr int kDataInputTensor = 2;
 constexpr int kOutputTensor = 0;
 
-const int kTensorNotAllocated = -1;
-
-struct OpData {
-  // IDs are the arbitrary identifiers used by TF Lite to identify and access
-  // memory buffers.
-  int im2col_id = kTensorNotAllocated;
-
-  // im2col is the only temporary currently tracked, therefore always index 0.
-  // If more temporaries are added, they should be properly tracked.
-  int32_t im2col_index = 0;
-};
-
-void* Init(TfLiteContext* context, const char* buffer, size_t length) {
-  // This is a builtin op, so we don't use the contents in 'buffer', if any.
-  // Instead, we allocate a new object to use as scratch space for im2col, and
-  // to carry information from Prepare() to Eval().
-  auto* data = new OpData;
-  eigen_support::IncrementUsageCounter(context);
-  return data;
-}
-
-void Free(TfLiteContext* context, void* buffer) {
-  eigen_support::DecrementUsageCounter(context);
-  delete reinterpret_cast<OpData*>(buffer);
-}
-
-TfLiteStatus ResizeOutputTensor(TfLiteContext* context,
-                                const TfLiteTensor* output_shape,
-                                TfLiteTensor* output) {
+TfLiteStatus ResizeOutputShape(TfLiteContext* context,
+                               const TfLiteTensor* output_shape,
+                               TfLiteTensor* output) {
   // Currently only support int32 for output shape.
   if (output_shape->type != kTfLiteInt32) {
     context->ReportError(context, "Output shape is %d, not int32.",
@@ -83,60 +55,15 @@ TfLiteStatus ResizeOutputTensor(TfLiteContext* context,
   return context->ResizeTensor(context, output, output_shape_array);
 }
 
-// Allocate temporary im2col tensor.
-static TfLiteStatus AllocateIm2colTensor(TfLiteContext* context,
-                                         TfLiteNode* node) {
-  OpData* data = reinterpret_cast<OpData*>(node->user_data);
-  if (data->im2col_id == kTensorNotAllocated) {
-    context->AddTensors(context, 1, &data->im2col_id);
-  }
-
-  TfLiteIntArrayFree(node->temporaries);
-  node->temporaries = TfLiteIntArrayCreate(1);
-  node->temporaries->data[data->im2col_index] = data->im2col_id;
-
-  return kTfLiteOk;
-}
-
-TfLiteStatus ResizeIm2ColTensor(TfLiteContext* context,
-                                const TfLiteTensor* output_shape,
-                                const TfLiteTensor* weights,
-                                const TfLiteTensor* input,
-                                TfLiteTensor* im2col) {
-  if (output_shape->type != kTfLiteInt32) {
-    context->ReportError(context, "im2col shape is %d, not int32.",
-                         output_shape->type);
-    return kTfLiteError;
-  }
-  TF_LITE_ENSURE_EQ(context, NumElements(output_shape), 4);
-  TfLiteIntArray* im2col_shape_array = TfLiteIntArrayCreate(4);
-  im2col_shape_array->data[0] = output_shape->data.i32[0];
-  im2col_shape_array->data[1] = output_shape->data.i32[1];
-  im2col_shape_array->data[2] = output_shape->data.i32[2];
-  const int input_depth = SizeOfDimension(input, 3);
-  const int filter_width = SizeOfDimension(weights, 1);
-  const int filter_height = SizeOfDimension(weights, 2);
-  im2col_shape_array->data[3] = input_depth * filter_height * filter_width;
-
-  im2col->type = input->type;
-  im2col->allocation_type = kTfLiteArenaRw;
-  return context->ResizeTensor(context, im2col, im2col_shape_array);
-}
-
 TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   TF_LITE_ENSURE_EQ(context, NumInputs(node), 3);
   TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
 
-  TF_LITE_ENSURE_STATUS(AllocateIm2colTensor(context, node));
-
   const TfLiteTensor* output_shape =
       GetInput(context, node, kOutputShapeTensor);
   const TfLiteTensor* weights = GetInput(context, node, kWeightsTensor);
   const TfLiteTensor* input = GetInput(context, node, kDataInputTensor);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
-  OpData* user_data = reinterpret_cast<OpData*>(node->user_data);
-  TfLiteTensor* im2col =
-      &context->tensors[node->temporaries->data[user_data->im2col_index]];
 
   TF_LITE_ENSURE_EQ(context, NumDimensions(output_shape), 1);
   TF_LITE_ENSURE_EQ(context, NumDimensions(input), 4);
@@ -153,15 +80,11 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   TF_LITE_ENSURE_EQ(context, SizeOfDimension(input, 3),
                     SizeOfDimension(weights, 3));
 
-  if (IsConstantTensor(output_shape)) {
-    TF_LITE_ENSURE_STATUS(ResizeOutputTensor(context, output_shape, output));
-    TF_LITE_ENSURE_STATUS(
-        ResizeIm2ColTensor(context, output_shape, weights, input, im2col));
-  } else {
-    // Defer resizing until Eval().
+  if (!IsConstantTensor(output_shape)) {
     SetTensorToDynamic(output);
+    return kTfLiteOk;
   }
-  return kTfLiteOk;
+  return ResizeOutputShape(context, output_shape, output);
 }
 
 TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
@@ -170,19 +93,13 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const TfLiteTensor* weights = GetInput(context, node, kWeightsTensor);
   const TfLiteTensor* input = GetInput(context, node, kDataInputTensor);
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
-  OpData* user_data = reinterpret_cast<OpData*>(node->user_data);
-  TfLiteTensor* im2col =
-      &context->tensors[node->temporaries->data[user_data->im2col_index]];
+
   const auto* params =
       reinterpret_cast<TfLiteTransposeConvParams*>(node->builtin_data);
 
   if (IsDynamicTensor(output)) {
     TF_LITE_ENSURE_OK(context,
-                      ResizeOutputTensor(context, output_shape, output));
-  }
-  if (IsDynamicTensor(im2col)) {
-    TF_LITE_ENSURE_OK(context, ResizeIm2ColTensor(context, output_shape,
-                                                  weights, input, im2col));
+                      ResizeOutputShape(context, output_shape, output));
   }
 
   // Get height and width of the output image.
@@ -201,12 +118,17 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   // Currently only support float32.
   switch (input->type) {
     case kTfLiteFloat32:
-      optimized_ops::TransposeConv(
+      reference_ops::TransposeConv(
           GetTensorData<float>(input), GetTensorDims(input),
           GetTensorData<float>(weights), GetTensorDims(weights), stride_width,
           stride_height, padding_size.width, padding_size.height,
           GetTensorData<float>(output), GetTensorDims(output),
-          GetTensorData<float>(im2col), GetTensorDims(im2col));
+          // Last two args specify im2col which reference_ops ignores.
+          // (Note this does not lead to a performance regression, as the
+          // previous optimized version was just a copy of the reference code.)
+          // TODO(b/110208176): Allocate im2col tensors and switch to
+          // optimized_ops.
+          GetTensorData<float>(output), GetTensorDims(output));
       break;
     default:
       context->ReportError(context, "Type %d, not currently supported.",
@@ -219,8 +141,8 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
 }  // namespace transpose_conv
 
 TfLiteRegistration* Register_TRANSPOSE_CONV() {
-  static TfLiteRegistration r = {transpose_conv::Init, transpose_conv::Free,
-                                 transpose_conv::Prepare, transpose_conv::Eval};
+  static TfLiteRegistration r = {nullptr, nullptr, transpose_conv::Prepare,
+                                 transpose_conv::Eval};
   return &r;
 }
 
diff --git a/tensorflow/contrib/lite/kernels/transpose_conv_test.cc b/tensorflow/contrib/lite/kernels/transpose_conv_test.cc
index c741df19de..55df897180 100644
--- a/tensorflow/contrib/lite/kernels/transpose_conv_test.cc
+++ b/tensorflow/contrib/lite/kernels/transpose_conv_test.cc
@@ -14,7 +14,6 @@ limitations under the License.
 ==============================================================================*/
 #include <cstdarg>
 #include <gtest/gtest.h>
-#include "absl/memory/memory.h"
 #include "tensorflow/contrib/lite/interpreter.h"
 #include "tensorflow/contrib/lite/kernels/register.h"
 #include "tensorflow/contrib/lite/kernels/test_util.h"
@@ -25,49 +24,9 @@ namespace {
 
 using ::testing::ElementsAreArray;
 
-class ConstTransposeConvOpModel : public SingleOpModel {
-  // Just to be extra confusing, transpose_conv has an _input_ named
-  // "output_shape". This input sets the shape of the output tensor of the op.
-  // In this version of the test class, "output_shape" is a constant that must
-  // be specified in the constructor.
- public:
-  ConstTransposeConvOpModel(TfLiteRegistration* registration,
-                            std::initializer_list<int> input_shape,
-                            std::initializer_list<int> filter_shape,
-                            std::initializer_list<int> output_shape_data,
-                            Padding padding, int stride_w, int stride_h) {
-    output_shape_ = AddConstInput(TensorType_INT32, output_shape_data,
-                                  {static_cast<int>(output_shape_data.size())});
-    filter_ = AddInput(TensorType_FLOAT32);
-    input_ = AddInput(TensorType_FLOAT32);
-    output_ = AddOutput(TensorType_FLOAT32);
-    SetBuiltinOp(
-        BuiltinOperator_TRANSPOSE_CONV, BuiltinOptions_TransposeConvOptions,
-        CreateTransposeConvOptions(builder_, padding, stride_w, stride_h)
-            .Union());
-    resolver_ = absl::make_unique<SingleOpResolver>(
-        BuiltinOperator_TRANSPOSE_CONV, registration);
-    BuildInterpreter({{4}, filter_shape, input_shape});
-  }
-
-  int output_shape() { return output_shape_; }
-  int filter() { return filter_; }
-  int input() { return input_; }
-
-  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
-  std::vector<int> GetOutputShape() { return GetTensorShape(output_); }
-
- private:
-  int output_shape_;
-  int filter_;
-  int input_;
-  int output_;
-};
-
 class TransposeConvOpModel : public SingleOpModel {
  public:
-  TransposeConvOpModel(TfLiteRegistration* registration,
-                       std::initializer_list<int> input_shape,
+  TransposeConvOpModel(std::initializer_list<int> input_shape,
                        std::initializer_list<int> filter_shape, Padding padding,
                        int stride_w, int stride_h) {
     output_shape_ = AddInput(TensorType_INT32);
@@ -78,8 +37,6 @@ class TransposeConvOpModel : public SingleOpModel {
         BuiltinOperator_TRANSPOSE_CONV, BuiltinOptions_TransposeConvOptions,
         CreateTransposeConvOptions(builder_, padding, stride_w, stride_h)
             .Union());
-    resolver_ = absl::make_unique<SingleOpResolver>(
-        BuiltinOperator_TRANSPOSE_CONV, registration);
     BuildInterpreter({{4}, filter_shape, input_shape});
   }
 
@@ -97,15 +54,6 @@ class TransposeConvOpModel : public SingleOpModel {
   int output_;
 };
 
-const auto kKernelMap = new std::map<string, TfLiteRegistration*>({});
-
-class TransposeConvOpTest : public SingleOpTest {
- protected:
-  const std::map<string, TfLiteRegistration*>& GetKernelMap() override {
-    return *kKernelMap;
-  }
-};
-
 // Test case:
 // output = tf.nn.conv2d_backprop_input(
 //     tf.constant([ 1, 4, 4, 1 ]),
@@ -113,9 +61,8 @@ class TransposeConvOpTest : public SingleOpTest {
 //     tf.constant(np.arange(1, 17), shape=[ 1, 4, 4, 1 ], dtype=tf.float32),
 //     [1, 1, 1, 1 ],
 //     "SAME")
-TEST_P(TransposeConvOpTest, SimpleTest) {
-  TransposeConvOpModel m(GetRegistration(), {1, 4, 4, 1}, {1, 3, 3, 1},
-                         Padding_SAME, 1, 1);
+TEST(TransposeConvOpModelTest, SimpleTest) {
+  TransposeConvOpModel m({1, 4, 4, 1}, {1, 3, 3, 1}, Padding_SAME, 1, 1);
   m.PopulateTensor<int>(m.output_shape(), {1, 4, 4, 1});
   m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9});
   m.PopulateTensor<float>(
@@ -128,21 +75,6 @@ TEST_P(TransposeConvOpTest, SimpleTest) {
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 4, 4, 1}));
 }
 
-// Test case: Same as above, but with a const "output_shape"
-TEST_P(TransposeConvOpTest, ConstSimpleTest) {
-  ConstTransposeConvOpModel m(GetRegistration(), {1, 4, 4, 1}, {1, 4, 4, 1},
-                              {1, 3, 3, 1}, Padding_SAME, 1, 1);
-  m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9});
-  m.PopulateTensor<float>(
-      m.input(), {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16});
-  m.Invoke();
-
-  EXPECT_THAT(m.GetOutput(),
-              ElementsAreArray({29, 62, 83, 75, 99, 192, 237, 198, 207, 372,
-                                417, 330, 263, 446, 485, 365}));
-  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 4, 4, 1}));
-}
-
 // Test case:
 // filter = tf.constant(np.arange(1, 19),
 //                      shape=[ 3, 3, 1, 2 ],
@@ -155,9 +87,8 @@ TEST_P(TransposeConvOpTest, ConstSimpleTest) {
 //     "SAME")
 // And filter value is derived by:
 // filter = tf.reshape(tf.transpose(filter, perm=[3, 0, 1, 2]), shape=[18, 1])
-TEST_P(TransposeConvOpTest, TwoFiltersTest) {
-  TransposeConvOpModel m(GetRegistration(), {1, 4, 4, 2}, {1, 3, 3, 2},
-                         Padding_SAME, 1, 1);
+TEST(TransposeConvOpModelTest, TwoFiltersTest) {
+  TransposeConvOpModel m({1, 4, 4, 2}, {1, 3, 3, 2}, Padding_SAME, 1, 1);
   m.PopulateTensor<int>(m.output_shape(), {1, 4, 4, 1});
   m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                        13, 14, 15, 16, 17, 18});
@@ -185,9 +116,8 @@ TEST_P(TransposeConvOpTest, TwoFiltersTest) {
 //     "VALID")
 // And filter value is derived by:
 // filter = tf.reshape(tf.transpose(filter, perm=[3, 0, 1, 2]), shape=[1, 18])
-TEST_P(TransposeConvOpTest, PaddingValidTest) {
-  TransposeConvOpModel m(GetRegistration(), {1, 4, 4, 2}, {1, 3, 3, 2},
-                         Padding_VALID, 1, 1);
+TEST(TransposeConvOpModelTest, PaddingValidTest) {
+  TransposeConvOpModel m({1, 4, 4, 2}, {1, 3, 3, 2}, Padding_VALID, 1, 1);
   m.PopulateTensor<int>(m.output_shape(), {1, 6, 6, 1});
   m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                        13, 14, 15, 16, 17, 18});
@@ -216,9 +146,8 @@ TEST_P(TransposeConvOpTest, PaddingValidTest) {
 //     tf.constant(np.arange(1, 5), shape=[ 1, 2, 2, 1 ], dtype=tf.float32),
 //     [1, 2, 2, 1 ],
 //     "VALID")
-TEST_P(TransposeConvOpTest, StrideValidTest) {
-  TransposeConvOpModel m(GetRegistration(), {1, 2, 2, 1}, {1, 3, 3, 1},
-                         Padding_VALID, 2, 2);
+TEST(TransposeConvOpModelTest, StrideValidTest) {
+  TransposeConvOpModel m({1, 2, 2, 1}, {1, 3, 3, 1}, Padding_VALID, 2, 2);
   m.PopulateTensor<int>(m.output_shape(), {1, 5, 5, 1});
   m.PopulateTensor<float>(m.filter(), {1, 2, 3, 4, 5, 6, 7, 8, 9});
   m.PopulateTensor<float>(m.input(), {1, 2, 3, 4});
@@ -241,9 +170,8 @@ TEST_P(TransposeConvOpTest, StrideValidTest) {
 //     tf.constant(np.arange(1, 5), shape=[ 1, 2, 2, 1 ], dtype=tf.float32),
 //     [1, 2, 2, 1 ],
 //     "VALID")
-TEST_P(TransposeConvOpTest, MultiChannelTest) {
-  TransposeConvOpModel m(GetRegistration(), {1, 2, 2, 1}, {2, 3, 3, 1},
-                         Padding_VALID, 2, 2);
+TEST(TransposeConvOpModelTest, MultiChannelTest) {
+  TransposeConvOpModel m({1, 2, 2, 1}, {2, 3, 3, 1}, Padding_VALID, 2, 2);
   m.PopulateTensor<int>(m.output_shape(), {1, 5, 5, 2});
   m.PopulateTensor<float>(m.filter(), {1, 3, 5, 7, 9, 11, 13, 15, 17, 2, 4, 6,
                                        8, 10, 12, 14, 16, 18});
@@ -259,24 +187,6 @@ TEST_P(TransposeConvOpTest, MultiChannelTest) {
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 5, 5, 2}));
 }
 
-// Test case: Same as above, but with a const "output_shape"
-TEST_P(TransposeConvOpTest, ConstMultiChannelTest) {
-  ConstTransposeConvOpModel m(GetRegistration(), {1, 2, 2, 1}, {2, 3, 3, 1},
-                              {1, 5, 5, 2}, Padding_VALID, 2, 2);
-  m.PopulateTensor<float>(m.filter(), {1, 3, 5, 7, 9, 11, 13, 15, 17, 2, 4, 6,
-                                       8, 10, 12, 14, 16, 18});
-  m.PopulateTensor<float>(m.input(), {1, 2, 3, 4});
-  m.Invoke();
-
-  EXPECT_THAT(
-      m.GetOutput(),
-      ElementsAreArray({1,  2,  3,  4,  7,  10,  6,   8,  10, 12, 7,  8,  9,
-                        10, 25, 28, 18, 20, 22,  24,  16, 20, 24, 28, 62, 72,
-                        42, 48, 54, 60, 21, 24,  27,  30, 61, 68, 36, 40, 44,
-                        48, 39, 42, 45, 48, 103, 110, 60, 64, 68, 72}));
-  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 5, 5, 2}));
-}
-
 // Test case:
 // filter = tf.constant(np.random.randint(1, 10, size=9),
 //                      shape=[ 3, 3, 1, 1 ],
@@ -289,9 +199,8 @@ TEST_P(TransposeConvOpTest, ConstMultiChannelTest) {
 //     "SAME")
 // And filter value is derived by:
 // filter = tf.reshape(tf.transpose(filter, perm=[3, 0, 1, 2]), shape=[-1])
-TEST_P(TransposeConvOpTest, AccuracyTest) {
-  TransposeConvOpModel m(GetRegistration(), {1, 1, 2, 1}, {1, 3, 3, 1},
-                         Padding_SAME, 3, 3);
+TEST(TransposeConvOpModelTest, AccuracyTest) {
+  TransposeConvOpModel m({1, 1, 2, 1}, {1, 3, 3, 1}, Padding_SAME, 3, 3);
   m.PopulateTensor<int>(m.output_shape(), {1, 3, 4, 1});
   m.PopulateTensor<float>(m.filter(), {9, 5, 6, 9, 8, 5, 3, 1, 4});
   m.PopulateTensor<float>(m.input(), {323, 521});
@@ -303,10 +212,6 @@ TEST_P(TransposeConvOpTest, AccuracyTest) {
   EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 3, 4, 1}));
 }
 
-INSTANTIATE_TEST_CASE_P(
-    TransposeConvOpTest, TransposeConvOpTest,
-    ::testing::ValuesIn(SingleOpTest::GetKernelTags(*kKernelMap)));
-
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
index 32daf2bb02..0acd705950 100644
--- a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
+++ b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
@@ -13,7 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
@@ -274,7 +273,8 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   const int n_output = recurrent_to_output_weights->dims->data[1];
 
   // Check that input tensor dimensions matches with each other.
-  CheckInputTensorDimensions(context, node, n_input, n_output, n_cell);
+  TF_LITE_ENSURE_OK(context, CheckInputTensorDimensions(context, node, n_input,
+                                                        n_output, n_cell));
 
   // Get the pointer to output, output_state and cell_state buffer tensors.
   TfLiteTensor* output = GetOutput(context, node, kOutputTensor);
diff --git a/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc b/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc
index 164a0cbd08..0d6d29a171 100644
--- a/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc
+++ b/tensorflow/contrib/lite/kernels/unidirectional_sequence_rnn.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
-#include <unistd.h>
 #include <cassert>
 #include <cmath>
 #include <cstdio>
diff --git a/tensorflow/contrib/lite/mmap_allocation.cc b/tensorflow/contrib/lite/mmap_allocation.cc
new file mode 100644
index 0000000000..fa9a3cd1d8
--- /dev/null
+++ b/tensorflow/contrib/lite/mmap_allocation.cc
@@ -0,0 +1,61 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "tensorflow/contrib/lite/allocation.h"
+#include "tensorflow/contrib/lite/error_reporter.h"
+
+namespace tflite {
+
+MMAPAllocation::MMAPAllocation(const char* filename,
+                               ErrorReporter* error_reporter)
+    : Allocation(error_reporter), mmapped_buffer_(MAP_FAILED) {
+  mmap_fd_ = open(filename, O_RDONLY);
+  if (mmap_fd_ == -1) {
+    error_reporter_->Report("Could not open '%s'.", filename);
+    return;
+  }
+  struct stat sb;
+  fstat(mmap_fd_, &sb);
+  buffer_size_bytes_ = sb.st_size;
+  mmapped_buffer_ =
+      mmap(nullptr, buffer_size_bytes_, PROT_READ, MAP_SHARED, mmap_fd_, 0);
+  if (mmapped_buffer_ == MAP_FAILED) {
+    error_reporter_->Report("Mmap of '%s' failed.", filename);
+    return;
+  }
+}
+
+MMAPAllocation::~MMAPAllocation() {
+  if (valid()) {
+    munmap(const_cast<void*>(mmapped_buffer_), buffer_size_bytes_);
+  }
+  if (mmap_fd_ != -1) close(mmap_fd_);
+}
+
+const void* MMAPAllocation::base() const { return mmapped_buffer_; }
+
+size_t MMAPAllocation::bytes() const { return buffer_size_bytes_; }
+
+bool MMAPAllocation::valid() const { return mmapped_buffer_ != MAP_FAILED; }
+
+bool MMAPAllocation::IsSupported() { return true; }
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc b/tensorflow/contrib/lite/mmap_allocation_disabled.cc
index 0e08a04370..f3d4cf1a25 100644
--- a/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc
+++ b/tensorflow/contrib/lite/mmap_allocation_disabled.cc
@@ -1,4 +1,4 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -13,26 +13,27 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <jni.h>
-#include <time.h>
+#include "tensorflow/contrib/lite/allocation.h"
+
+#include <cassert>
 
 namespace tflite {
 
-// Gets the elapsed wall-clock timespec.
-timespec getCurrentTime() {
-  timespec time;
-  clock_gettime(CLOCK_MONOTONIC, &time);
-  return time;
+MMAPAllocation::MMAPAllocation(const char* filename,
+                               ErrorReporter* error_reporter)
+    : Allocation(error_reporter), mmapped_buffer_(nullptr) {
+  // The disabled variant should never be created.
+  assert(false);
 }
 
-// Computes the time diff from two timespecs. Returns '-1' if 'stop' is earlier
-// than 'start'.
-jlong timespec_diff_nanoseconds(struct timespec* start, struct timespec* stop) {
-  jlong result = stop->tv_sec - start->tv_sec;
-  if (result < 0) return -1;
-  result = 1000000000 * result + (stop->tv_nsec - start->tv_nsec);
-  if (result < 0) return -1;
-  return result;
-}
+MMAPAllocation::~MMAPAllocation() {}
+
+const void* MMAPAllocation::base() const { return nullptr; }
+
+size_t MMAPAllocation::bytes() const { return 0; }
+
+bool MMAPAllocation::valid() const { return false; }
+
+bool MMAPAllocation::IsSupported() { return false; }
 
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/model.cc b/tensorflow/contrib/lite/model.cc
index c448fb71db..e50b262ab8 100644
--- a/tensorflow/contrib/lite/model.cc
+++ b/tensorflow/contrib/lite/model.cc
@@ -16,10 +16,8 @@ limitations under the License.
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
-#include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/types.h>
-#include <unistd.h>
 
 #include "tensorflow/contrib/lite/allocation.h"
 #include "tensorflow/contrib/lite/builtin_op_data.h"
@@ -81,8 +79,8 @@ std::unique_ptr<Allocation> GetAllocationFromFile(const char* filename,
                                                   ErrorReporter* error_reporter,
                                                   bool use_nnapi) {
   std::unique_ptr<Allocation> allocation;
-  if (mmap_file) {
-    if (use_nnapi && NNAPIExists())
+  if (mmap_file && MMAPAllocation::IsSupported()) {
+    if (use_nnapi && NNAPIDelegate::IsSupported())
       allocation.reset(new NNAPIAllocation(filename, error_reporter));
     else
       allocation.reset(new MMAPAllocation(filename, error_reporter));
@@ -186,6 +184,8 @@ InterpreterBuilder::InterpreterBuilder(const ::tflite::Model* model,
       op_resolver_(op_resolver),
       error_reporter_(ValidateErrorReporter(error_reporter)) {}
 
+InterpreterBuilder::~InterpreterBuilder() {}
+
 TfLiteStatus InterpreterBuilder::BuildLocalIndexToRegistrationMapping() {
   TfLiteStatus status = kTfLiteOk;
   auto opcodes = model_->operator_codes();
@@ -204,8 +204,9 @@ TfLiteStatus InterpreterBuilder::BuildLocalIndexToRegistrationMapping() {
     } else if (builtin_code != BuiltinOperator_CUSTOM) {
       registration = op_resolver_.FindOp(builtin_code, version);
       if (registration == nullptr) {
-        error_reporter_->Report("Didn't find op for builtin opcode '%s'\n",
-                                EnumNameBuiltinOperator(builtin_code));
+        error_reporter_->Report(
+            "Didn't find op for builtin opcode '%s' version '%d'\n",
+            EnumNameBuiltinOperator(builtin_code), version);
         status = kTfLiteError;
       }
     } else if (!opcode->custom_code()) {
@@ -613,6 +614,8 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       break;
     }
     case BuiltinOperator_MEAN:
+    case BuiltinOperator_REDUCE_MAX:
+    case BuiltinOperator_REDUCE_PROD:
     case BuiltinOperator_SUM: {
       auto* params = MallocPOD<TfLiteReducerParams>();
       if (auto* schema_params = op->builtin_options_as_ReducerOptions()) {
@@ -661,6 +664,15 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = reinterpret_cast<void*>(params);
       break;
     }
+    case BuiltinOperator_ARG_MIN: {
+      auto* params = MallocPOD<TfLiteArgMinParams>();
+      if (const auto* schema_params = op->builtin_options_as_ArgMinOptions()) {
+        ConvertTensorType(schema_params->output_type(), &params->output_type,
+                          error_reporter);
+      }
+      *builtin_data = reinterpret_cast<void*>(params);
+      break;
+    }
     case BuiltinOperator_TRANSPOSE_CONV: {
       TfLiteTransposeConvParams* params =
           MallocPOD<TfLiteTransposeConvParams>();
@@ -692,11 +704,39 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
       *builtin_data = static_cast<void*>(params);
       break;
     }
+    case BuiltinOperator_PACK: {
+      TfLitePackParams* params = MallocPOD<TfLitePackParams>();
+      if (auto* pack_params = op->builtin_options_as_PackOptions()) {
+        params->values_count = pack_params->values_count();
+        params->axis = pack_params->axis();
+      }
+      *builtin_data = reinterpret_cast<void*>(params);
+      break;
+    }
     case BuiltinOperator_DELEGATE: {
       // TODO(ycling): Revisit when supporting saving delegated models.
       error_reporter->Report("DELEGATE op shouldn't exist in model.");
       return kTfLiteError;
     }
+    case BuiltinOperator_FAKE_QUANT: {
+      auto* params = MallocPOD<TfLiteFakeQuantParams>();
+      if (auto* schema_params = op->builtin_options_as_FakeQuantOptions()) {
+        params->min = schema_params->min();
+        params->max = schema_params->max();
+        params->num_bits = schema_params->num_bits();
+        params->narrow_range = schema_params->narrow_range();
+      }
+      *builtin_data = static_cast<void*>(params);
+      break;
+    }
+    case BuiltinOperator_ONE_HOT: {
+      auto* params = MallocPOD<TfLiteOneHotParams>();
+      if (auto* schema_params = op->builtin_options_as_OneHotOptions()) {
+        params->axis = schema_params->axis();
+      }
+      *builtin_data = static_cast<void*>(params);
+      break;
+    }
 
     // Below are the ops with no builtin_data strcture.
     case BuiltinOperator_BATCH_TO_SPACE_ND:
@@ -739,6 +779,9 @@ TfLiteStatus ParseOpData(const Operator* op, BuiltinOperator op_type,
     case BuiltinOperator_TOPK_V2:
     case BuiltinOperator_TRANSPOSE:
     case BuiltinOperator_POW:
+    case BuiltinOperator_LOGICAL_OR:
+    case BuiltinOperator_LOGICAL_AND:
+    case BuiltinOperator_LOGICAL_NOT:
       break;
   }
   return kTfLiteOk;
diff --git a/tensorflow/contrib/lite/model.h b/tensorflow/contrib/lite/model.h
index 3946b49041..8bc9ecd7ce 100644
--- a/tensorflow/contrib/lite/model.h
+++ b/tensorflow/contrib/lite/model.h
@@ -156,6 +156,7 @@ class InterpreterBuilder {
   InterpreterBuilder(const ::tflite::Model* model,
                      const OpResolver& op_resolver,
                      ErrorReporter* error_reporter = DefaultErrorReporter());
+  ~InterpreterBuilder();
   InterpreterBuilder(const InterpreterBuilder&) = delete;
   InterpreterBuilder& operator=(const InterpreterBuilder&) = delete;
   TfLiteStatus operator()(std::unique_ptr<Interpreter>* interpreter);
diff --git a/tensorflow/contrib/lite/model_test.cc b/tensorflow/contrib/lite/model_test.cc
index 15bae21a41..df4f60d4ad 100644
--- a/tensorflow/contrib/lite/model_test.cc
+++ b/tensorflow/contrib/lite/model_test.cc
@@ -19,7 +19,6 @@ limitations under the License.
 #include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/types.h>
-#include <unistd.h>
 
 #include "tensorflow/contrib/lite/model.h"
 
@@ -242,14 +241,6 @@ TEST(BasicFlatBufferModel, TestWithNullVerifier) {
       "tensorflow/contrib/lite/testdata/test_model.bin", nullptr));
 }
 
-struct TestErrorReporter : public ErrorReporter {
-  int Report(const char* format, va_list args) override {
-    calls++;
-    return 0;
-  }
-  int calls = 0;
-};
-
 // This makes sure the ErrorReporter is marshalled from FlatBufferModel to
 // the Interpreter.
 TEST(BasicFlatBufferModel, TestCustomErrorReporter) {
@@ -263,7 +254,7 @@ TEST(BasicFlatBufferModel, TestCustomErrorReporter) {
   TrivialResolver resolver;
   InterpreterBuilder(*model, resolver)(&interpreter);
   ASSERT_NE(interpreter->Invoke(), kTfLiteOk);
-  ASSERT_EQ(reporter.calls, 1);
+  ASSERT_EQ(reporter.num_calls(), 1);
 }
 
 // This makes sure the ErrorReporter is marshalled from FlatBufferModel to
diff --git a/tensorflow/contrib/lite/models/smartreply/BUILD b/tensorflow/contrib/lite/models/smartreply/BUILD
index 8b5fa240ac..9d88c396ba 100644
--- a/tensorflow/contrib/lite/models/smartreply/BUILD
+++ b/tensorflow/contrib/lite/models/smartreply/BUILD
@@ -47,6 +47,7 @@ cc_test(
     name = "extract_feature_op_test",
     size = "small",
     srcs = ["ops/extract_feature_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":custom_ops",
         "//tensorflow/contrib/lite:framework",
@@ -61,6 +62,7 @@ cc_test(
     name = "normalize_op_test",
     size = "small",
     srcs = ["ops/normalize_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":custom_ops",
         "//tensorflow/contrib/lite:framework",
@@ -75,6 +77,7 @@ cc_test(
     name = "predict_op_test",
     size = "small",
     srcs = ["ops/predict_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":custom_ops",
         "//tensorflow/contrib/lite:framework",
diff --git a/tensorflow/contrib/lite/models/smartreply/predictor.h b/tensorflow/contrib/lite/models/smartreply/predictor.h
index 90260c8d62..3151192d92 100644
--- a/tensorflow/contrib/lite/models/smartreply/predictor.h
+++ b/tensorflow/contrib/lite/models/smartreply/predictor.h
@@ -65,9 +65,9 @@ struct SmartReplyConfig {
   float backoff_confidence;
   // Backoff responses are used when predicted responses cannot fulfill the
   // list.
-  const std::vector<std::string>& backoff_responses;
+  std::vector<std::string> backoff_responses;
 
-  SmartReplyConfig(std::vector<std::string> backoff_responses)
+  SmartReplyConfig(const std::vector<std::string>& backoff_responses)
       : num_response(kDefaultNumResponse),
         backoff_confidence(kDefaultBackoffConfidence),
         backoff_responses(backoff_responses) {}
diff --git a/tensorflow/contrib/lite/nnapi_delegate.cc b/tensorflow/contrib/lite/nnapi_delegate.cc
index 905c0919cb..c91f488175 100644
--- a/tensorflow/contrib/lite/nnapi_delegate.cc
+++ b/tensorflow/contrib/lite/nnapi_delegate.cc
@@ -548,6 +548,26 @@ TfLiteStatus AddOpsAndParams(
         add_squeeze_params(node.builtin_data);
         nn_op_type = ANEURALNETWORKS_SQUEEZE;
         break;
+      case tflite::BuiltinOperator_TRANSPOSE:
+        // The permutation input tensor value dictates the output dimensions.
+        // TODO(b/110888333): Support dynamically-sized tensors in delegates.
+        if ((node.inputs->size > 1) &&
+            (interpreter->tensor(node.inputs->data[1])->allocation_type !=
+             kTfLiteMmapRo)) {
+          logError("NNAPI does not yet support dynamic tensors.");
+          return kTfLiteError;
+        }
+        nnapi_version = 11;  // require NNAPI 1.1
+        nn_op_type = ANEURALNETWORKS_TRANSPOSE;
+        break;
+      case tflite::BuiltinOperator_L2_NORMALIZATION:
+        nn_op_type = ANEURALNETWORKS_L2_NORMALIZATION;
+        if (reinterpret_cast<TfLiteL2NormParams*>(node.builtin_data)
+                ->activation != kTfLiteActNone) {
+          FATAL(
+              "NNAPI does not support L2Normalization with fused activations");
+        }
+        break;
       case tflite::BuiltinOperator_CONCAT_EMBEDDINGS:
       case tflite::BuiltinOperator_LSH_PROJECTION:
       case tflite::BuiltinOperator_HASHTABLE_LOOKUP:
@@ -556,7 +576,6 @@ TfLiteStatus AddOpsAndParams(
       case tflite::BuiltinOperator_EMBEDDING_LOOKUP_SPARSE:
       case tflite::BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM:
       case tflite::BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_LSTM:
-      case tflite::BuiltinOperator_L2_NORMALIZATION:
       case tflite::BuiltinOperator_LOCAL_RESPONSE_NORMALIZATION:
       case tflite::BuiltinOperator_PADV2:
       case tflite::BuiltinOperator_RESIZE_BILINEAR:
@@ -567,7 +586,6 @@ TfLiteStatus AddOpsAndParams(
       case tflite::BuiltinOperator_SPACE_TO_BATCH_ND:
       case tflite::BuiltinOperator_BATCH_TO_SPACE_ND:
       case tflite::BuiltinOperator_TOPK_V2:
-      case tflite::BuiltinOperator_TRANSPOSE:
       case tflite::BuiltinOperator_SPLIT:
       case tflite::BuiltinOperator_STRIDED_SLICE:
       case tflite::BuiltinOperator_EXP:
@@ -579,6 +597,7 @@ TfLiteStatus AddOpsAndParams(
       case tflite::BuiltinOperator_MAXIMUM:
       case tflite::BuiltinOperator_MINIMUM:
       case tflite::BuiltinOperator_ARG_MAX:
+      case tflite::BuiltinOperator_ARG_MIN:
       case tflite::BuiltinOperator_GREATER:
       case tflite::BuiltinOperator_GREATER_EQUAL:
       case tflite::BuiltinOperator_LESS:
@@ -595,10 +614,18 @@ TfLiteStatus AddOpsAndParams(
       case tflite::BuiltinOperator_EQUAL:
       case tflite::BuiltinOperator_NOT_EQUAL:
       case tflite::BuiltinOperator_SUM:
+      case tflite::BuiltinOperator_REDUCE_MAX:
+      case tflite::BuiltinOperator_REDUCE_PROD:
       case tflite::BuiltinOperator_SQRT:
       case tflite::BuiltinOperator_RSQRT:
       case tflite::BuiltinOperator_SHAPE:
       case tflite::BuiltinOperator_POW:
+      case tflite::BuiltinOperator_FAKE_QUANT:
+      case tflite::BuiltinOperator_PACK:
+      case tflite::BuiltinOperator_LOGICAL_OR:
+      case tflite::BuiltinOperator_ONE_HOT:
+      case tflite::BuiltinOperator_LOGICAL_AND:
+      case tflite::BuiltinOperator_LOGICAL_NOT:
         logError("Op code %d is currently not delegated to NNAPI", builtin);
         return kTfLiteError;
         break;
@@ -764,4 +791,6 @@ TfLiteStatus NNAPIDelegate::Invoke(Interpreter* interpreter) {
   return kTfLiteOk;
 }
 
+bool NNAPIDelegate::IsSupported() { return NNAPIExists(); }
+
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/nnapi_delegate.h b/tensorflow/contrib/lite/nnapi_delegate.h
index 8dc7d38a30..2bdb2cc5c8 100644
--- a/tensorflow/contrib/lite/nnapi_delegate.h
+++ b/tensorflow/contrib/lite/nnapi_delegate.h
@@ -19,9 +19,10 @@ limitations under the License.
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/error_reporter.h"
 #include "tensorflow/contrib/lite/interpreter.h"
-#include "tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h"
 
-class ANeuralNetworsModel;
+class ANeuralNetworksModel;
+class ANeuralNetworksMemory;
+class ANeuralNetworksCompilation;
 
 namespace tflite {
 
@@ -54,6 +55,9 @@ class NNAPIDelegate {
   // Run
   TfLiteStatus Invoke(Interpreter* interpreter);
 
+  // Whether the current platform supports NNAPI delegation.
+  static bool IsSupported();
+
  private:
   // The NN API model handle
   ANeuralNetworksModel* nn_model_ = nullptr;
diff --git a/tensorflow/contrib/lite/nnapi_delegate_disabled.cc b/tensorflow/contrib/lite/nnapi_delegate_disabled.cc
new file mode 100644
index 0000000000..efde72b1a7
--- /dev/null
+++ b/tensorflow/contrib/lite/nnapi_delegate_disabled.cc
@@ -0,0 +1,42 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/nnapi_delegate.h"
+
+#include <cassert>
+
+namespace tflite {
+
+NNAPIAllocation::NNAPIAllocation(const char* filename,
+                                 ErrorReporter* error_reporter)
+    : MMAPAllocation(filename, error_reporter) {
+  // The disabled variant should never be created.
+  assert(false);
+}
+
+NNAPIAllocation::~NNAPIAllocation() {}
+
+NNAPIDelegate::~NNAPIDelegate() {}
+
+TfLiteStatus NNAPIDelegate::BuildGraph(Interpreter* interpreter) {
+  return kTfLiteError;
+}
+
+TfLiteStatus NNAPIDelegate::Invoke(Interpreter* interpreter) {
+  return kTfLiteError;
+}
+
+bool NNAPIDelegate::IsSupported() { return false; }
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/profiling/BUILD b/tensorflow/contrib/lite/profiling/BUILD
index a162b87b8f..1172722f7a 100644
--- a/tensorflow/contrib/lite/profiling/BUILD
+++ b/tensorflow/contrib/lite/profiling/BUILD
@@ -58,6 +58,7 @@ cc_test(
     name = "profile_summarizer_test",
     srcs = ["profile_summarizer_test.cc"],
     copts = common_copts,
+    tags = ["no_oss"],
     deps = [
         ":profile_summarizer",
         "//tensorflow/contrib/lite:framework",
diff --git a/tensorflow/contrib/lite/profiling/profile_summarizer.cc b/tensorflow/contrib/lite/profiling/profile_summarizer.cc
index c37a096588..720bd717b9 100644
--- a/tensorflow/contrib/lite/profiling/profile_summarizer.cc
+++ b/tensorflow/contrib/lite/profiling/profile_summarizer.cc
@@ -23,8 +23,6 @@ namespace tflite {
 namespace profiling {
 namespace {
 
-using Detail = tensorflow::StatsCalculator::Detail;
-
 struct OperatorDetails {
   std::string name;
   std::vector<std::string> inputs;
@@ -83,7 +81,7 @@ OperatorDetails GetOperatorDetails(const tflite::Interpreter& interpreter,
   OperatorDetails details;
   details.name = op_name;
   if (profiling_string) {
-    details.name += ":" + string(profiling_string);
+    details.name += ":" + std::string(profiling_string);
   }
   details.inputs = GetTensorNames(interpreter, inputs);
   details.outputs = GetTensorNames(interpreter, outputs);
@@ -125,28 +123,17 @@ void ProfileSummarizer::ProcessProfiles(
   int64_t base_start_us = events[0]->begin_timestamp_us;
   int node_num = 0;
   int64_t curr_total_us = 0;
-  std::map<std::string, Detail> details;
   for (auto event : events) {
     auto op_details = GetOperatorDetails(interpreter, event->event_metadata);
     auto node_name = ToString(op_details.outputs);
-    auto result = details.emplace(node_name, Detail());
-    Detail* detail = &(result.first->second);
-    detail->start_us.UpdateStat(event->begin_timestamp_us - base_start_us);
+    int64_t start_us = event->begin_timestamp_us - base_start_us;
     int64_t node_exec_time =
         event->end_timestamp_us - event->begin_timestamp_us;
-    detail->rel_end_us.UpdateStat(node_exec_time);
+    stats_calculator_->AddNodeStats(node_name, op_details.name, node_num,
+                                    start_us, node_exec_time, 0 /*memory */);
     curr_total_us += node_exec_time;
     ++node_num;
-
-    if (result.second) {
-      detail->name = node_name;
-      detail->type = op_details.name;
-      detail->run_order = node_num;
-      detail->times_called = 0;
-    }
-    ++detail->times_called;
   }
-  stats_calculator_->UpdateDetails(details);
   stats_calculator_->UpdateRunTotalUs(curr_total_us);
 }
 }  // namespace profiling
diff --git a/tensorflow/contrib/lite/profiling/time.cc b/tensorflow/contrib/lite/profiling/time.cc
index 446660bb74..875ddb02bc 100644
--- a/tensorflow/contrib/lite/profiling/time.cc
+++ b/tensorflow/contrib/lite/profiling/time.cc
@@ -14,16 +14,34 @@ limitations under the License.
 ==============================================================================*/
 #include "tensorflow/contrib/lite/profiling/time.h"
 
+#if defined(_MSC_VER)
+#include <chrono>  // NOLINT(build/c++11)
+#else
 #include <sys/time.h>
+#endif
 
 namespace tflite {
 namespace profiling {
 namespace time {
+
+#if defined(_MSC_VER)
+
+uint64_t NowMicros() {
+  return std::chrono::duration_cast<std::chrono::microseconds>(
+             std::chrono::system_clock::now().time_since_epoch())
+      .count();
+}
+
+#else
+
 uint64_t NowMicros() {
   struct timeval tv;
   gettimeofday(&tv, nullptr);
   return static_cast<uint64_t>(tv.tv_sec) * 1000000 + tv.tv_usec;
 }
+
+#endif  // defined(_MSC_VER)
+
 }  // namespace time
 }  // namespace profiling
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/python/BUILD b/tensorflow/contrib/lite/python/BUILD
index 27909a9458..860aff9e7e 100644
--- a/tensorflow/contrib/lite/python/BUILD
+++ b/tensorflow/contrib/lite/python/BUILD
@@ -19,6 +19,8 @@ py_library(
     visibility = ["//visibility:public"],
     deps = [
         "//tensorflow/contrib/lite/python/interpreter_wrapper:tensorflow_wrap_interpreter_wrapper",
+        "//tensorflow/python:util",
+        "//third_party/py/numpy",
     ],
 )
 
@@ -30,9 +32,10 @@ py_test(
     tags = ["no_oss"],
     deps = [
         ":interpreter",
-        "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
-        "//tensorflow/python:platform_test",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:platform",
+        "//third_party/py/numpy",
     ],
 )
 
@@ -69,7 +72,10 @@ py_test(
     srcs = ["lite_test.py"],
     data = [":interpreter_test_data"],
     srcs_version = "PY2AND3",
-    tags = ["no_windows"],
+    tags = [
+        "no_oss",
+        "no_windows",
+    ],
     deps = [
         ":lite",
     ],
@@ -161,7 +167,10 @@ py_test(
     name = "convert_saved_model_test",
     srcs = ["convert_saved_model_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_windows"],
+    tags = [
+        "no_oss",
+        "no_windows",
+    ],
     visibility = ["//visibility:public"],
     deps = [
         ":convert_saved_model",
diff --git a/tensorflow/contrib/lite/python/convert.py b/tensorflow/contrib/lite/python/convert.py
index 0ea2630f71..ec49738fb5 100644
--- a/tensorflow/contrib/lite/python/convert.py
+++ b/tensorflow/contrib/lite/python/convert.py
@@ -115,6 +115,7 @@ def build_toco_convert_protos(input_tensors,
                               inference_type=lite_constants.FLOAT,
                               inference_input_type=None,
                               input_format=lite_constants.TENSORFLOW_GRAPHDEF,
+                              input_shapes=None,
                               output_format=lite_constants.TFLITE,
                               quantized_input_stats=None,
                               default_ranges_stats=None,
@@ -141,6 +142,8 @@ def build_toco_convert_protos(input_tensors,
       Must be `{FLOAT, QUANTIZED_UINT8}`. (default `inference_type`)
     input_format: Type of data to read Currently must be
       `{TENSORFLOW_GRAPHDEF}`. (default TENSORFLOW_GRAPHDEF)
+    input_shapes: Input array shape. It needs to be a list of the same length
+      as `input_tensors`, or None. (default None)
     output_format: Output file format. Currently must be `{TFLITE,
       GRAPHVIZ_DOT}`. (default TFLITE)
     quantized_input_stats: List of tuples of integers representing the mean and
@@ -209,7 +212,11 @@ def build_toco_convert_protos(input_tensors,
     if inference_type == lite_constants.QUANTIZED_UINT8:
       input_array.mean_value, input_array.std_value = quantized_input_stats[idx]
     input_array.name = tensor_name(input_tensor)
-    input_array.shape.dims.extend(map(int, input_tensor.get_shape()))
+    if input_shapes is None:
+      shape = input_tensor.get_shape()
+    else:
+      shape = input_shapes[idx]
+    input_array.shape.dims.extend(map(int, shape))
 
   for output_tensor in output_tensors:
     model.output_arrays.append(tensor_name(output_tensor))
diff --git a/tensorflow/contrib/lite/python/interpreter.py b/tensorflow/contrib/lite/python/interpreter.py
index fd90823425..3243bddac8 100644
--- a/tensorflow/contrib/lite/python/interpreter.py
+++ b/tensorflow/contrib/lite/python/interpreter.py
@@ -18,6 +18,7 @@ from __future__ import division
 from __future__ import print_function
 
 import sys
+import numpy as np
 from tensorflow.python.util.lazy_loader import LazyLoader
 
 # Lazy load since some of the performance benchmark skylark rules
@@ -56,9 +57,6 @@ class Interpreter(object):
       self._interpreter = (
           _interpreter_wrapper.InterpreterWrapper_CreateWrapperCPPFromBuffer(
               model_content))
-      if not self._interpreter:
-        raise ValueError(
-            'Failed to create model from {} bytes'.format(len(model_content)))
     elif not model_path and not model_path:
       raise ValueError('`model_path` or `model_content` must be specified.')
     else:
@@ -66,8 +64,7 @@ class Interpreter(object):
 
   def allocate_tensors(self):
     self._ensure_safe()
-    if not self._interpreter.AllocateTensors():
-      raise ValueError('Failed to allocate tensors')
+    return self._interpreter.AllocateTensors()
 
   def _safe_to_run(self):
     """Returns true if there exist no numpy array buffers.
@@ -152,8 +149,7 @@ class Interpreter(object):
     Raises:
       ValueError: If the interpreter could not set the tensor.
     """
-    if not self._interpreter.SetTensor(tensor_index, value):
-      raise ValueError('Failed to set tensor')
+    self._interpreter.SetTensor(tensor_index, value)
 
   def resize_tensor_input(self, input_index, tensor_size):
     """Resizes an input tensor.
@@ -167,8 +163,10 @@ class Interpreter(object):
       ValueError: If the interpreter could not resize the input tensor.
     """
     self._ensure_safe()
-    if not self._interpreter.ResizeInputTensor(input_index, tensor_size):
-      raise ValueError('Failed to resize input')
+    # `ResizeInputTensor` now only accepts int32 numpy array as `tensor_size
+    # parameter.
+    tensor_size = np.array(tensor_size, dtype=np.int32)
+    self._interpreter.ResizeInputTensor(input_index, tensor_size)
 
   def get_output_details(self):
     """Gets model output details.
@@ -181,7 +179,9 @@ class Interpreter(object):
     ]
 
   def get_tensor(self, tensor_index):
-    """Gets the value of the input tensor. Note this makes a copy so prefer `tensor()`.
+    """Gets the value of the input tensor (get a copy).
+
+    If you wish to avoid the copy, use `tensor()`.
 
     Args:
       tensor_index: Tensor index of tensor to get. This value can be gotten from
@@ -208,7 +208,7 @@ class Interpreter(object):
     for i in range(10):
       input().fill(3.)
       interpreter.invoke()
-      print("inference %s" % output)
+      print("inference %s" % output())
 
     Notice how this function avoids making a numpy array directly. This is
     because it is important to not hold actual numpy views to the data longer
@@ -247,5 +247,7 @@ class Interpreter(object):
       ValueError: When the underlying interpreter fails raise ValueError.
     """
     self._ensure_safe()
-    if not self._interpreter.Invoke():
-      raise ValueError('Failed to invoke TFLite model')
+    self._interpreter.Invoke()
+
+  def reset_all_variables_to_zero(self):
+    return self._interpreter.ResetVariableTensorsToZero()
diff --git a/tensorflow/contrib/lite/python/interpreter_test.py b/tensorflow/contrib/lite/python/interpreter_test.py
index 5f1fa26c3b..e77d52ca99 100644
--- a/tensorflow/contrib/lite/python/interpreter_test.py
+++ b/tensorflow/contrib/lite/python/interpreter_test.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 
 import io
 import numpy as np
+import six
 
 from tensorflow.contrib.lite.python import interpreter as interpreter_wrapper
 from tensorflow.python.framework import test_util
@@ -82,7 +83,7 @@ class InterpreterTest(test_util.TensorFlowTestCase):
     test_input = np.array([[1, 2, 3, 4]], dtype=np.uint8)
     expected_output = np.array([[4, 3, 2, 1]], dtype=np.uint8)
     interpreter.resize_tensor_input(input_details[0]['index'],
-                                    np.array(test_input.shape, dtype=np.int32))
+                                    test_input.shape)
     interpreter.allocate_tensors()
     interpreter.set_tensor(input_details[0]['index'], test_input)
     interpreter.invoke()
@@ -91,6 +92,28 @@ class InterpreterTest(test_util.TensorFlowTestCase):
     self.assertTrue((expected_output == output_data).all())
 
 
+class InterpreterTestErrorPropagation(test_util.TensorFlowTestCase):
+
+  def testInvalidModelContent(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 'Model provided has model identifier \''):
+      interpreter_wrapper.Interpreter(model_content=six.b('garbage'))
+
+  def testInvalidModelFile(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'Could not open \'totally_invalid_file_name\''):
+      interpreter_wrapper.Interpreter(
+          model_path='totally_invalid_file_name')
+
+  def testInvokeBeforeReady(self):
+    interpreter = interpreter_wrapper.Interpreter(
+        model_path=resource_loader.get_path_to_datafile(
+            'testdata/permute_float.tflite'))
+    with self.assertRaisesRegexp(RuntimeError,
+                                 'Invoke called on model that is not ready'):
+      interpreter.invoke()
+
+
 class InterpreterTensorAccessorTest(test_util.TensorFlowTestCase):
 
   def setUp(self):
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD b/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
index 634c2a1e1f..69ee95c320 100644
--- a/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/BUILD
@@ -13,7 +13,6 @@ cc_library(
     deps = [
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite/kernels:builtin_ops",
-        "//tensorflow/core:lib",
         "//third_party/py/numpy:headers",
         "//third_party/python_runtime:headers",
         "@com_google_absl//absl/memory",
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
index 5554d08fa0..9ab05f3068 100644
--- a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.cc
@@ -14,13 +14,13 @@ limitations under the License.
 ==============================================================================*/
 #include "tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h"
 
+#include <sstream>
 #include <string>
 
 #include "absl/memory/memory.h"
 #include "tensorflow/contrib/lite/interpreter.h"
 #include "tensorflow/contrib/lite/kernels/register.h"
 #include "tensorflow/contrib/lite/model.h"
-#include "tensorflow/core/platform/logging.h"
 
 // Disallow Numpy 1.7 deprecated symbols.
 #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
@@ -38,9 +38,58 @@ limitations under the License.
 #define CPP_TO_PYSTRING PyString_FromStringAndSize
 #endif
 
+#define TFLITE_PY_CHECK(x)               \
+  if ((x) != kTfLiteOk) {                \
+    return error_reporter_->exception(); \
+  }
+
+#define TFLITE_PY_TENSOR_BOUNDS_CHECK(i)                                    \
+  if (i >= interpreter_->tensors_size() || i < 0) {                         \
+    PyErr_Format(PyExc_ValueError,                                          \
+                 "Invalid tensor index %d exceeds max tensor index %lu", i, \
+                 interpreter_->tensors_size());                             \
+    return nullptr;                                                         \
+  }
+
+#define TFLITE_PY_ENSURE_VALID_INTERPRETER()                               \
+  if (!interpreter_) {                                                     \
+    PyErr_SetString(PyExc_ValueError, "Interpreter was not initialized."); \
+    return nullptr;                                                        \
+  }
+
 namespace tflite {
 namespace interpreter_wrapper {
 
+class PythonErrorReporter : public tflite::ErrorReporter {
+ public:
+  PythonErrorReporter() {}
+
+  // Report an error message
+  int Report(const char* format, va_list args) override {
+    char buf[1024];
+    int formatted = vsnprintf(buf, sizeof(buf), format, args);
+    buffer_ << buf;
+    return formatted;
+  }
+
+  // Set's a Python runtime exception with the last error.
+  PyObject* exception() {
+    std::string last_message = message();
+    PyErr_SetString(PyExc_RuntimeError, last_message.c_str());
+    return nullptr;
+  }
+
+  // Gets the last error message and clears the buffer.
+  std::string message() {
+    std::string value = buffer_.str();
+    buffer_.clear();
+    return value;
+  }
+
+ private:
+  std::stringstream buffer_;
+};
+
 namespace {
 
 // Calls PyArray's initialization to initialize all the API pointers. Note that
@@ -59,19 +108,8 @@ std::unique_ptr<tflite::Interpreter> CreateInterpreter(
   ImportNumpy();
 
   std::unique_ptr<tflite::Interpreter> interpreter;
-  tflite::InterpreterBuilder(*model, resolver)(&interpreter);
-  if (interpreter) {
-    for (const int input_index : interpreter->inputs()) {
-      const TfLiteTensor* tensor = interpreter->tensor(input_index);
-      CHECK(tensor);
-      const TfLiteIntArray* dims = tensor->dims;
-      if (!dims) {
-        continue;
-      }
-
-      std::vector<int> input_dims(dims->data, dims->data + dims->size);
-      interpreter->ResizeInputTensor(input_index, input_dims);
-    }
+  if (tflite::InterpreterBuilder(*model, resolver)(&interpreter) != kTfLiteOk) {
+    return nullptr;
   }
   return interpreter;
 }
@@ -95,10 +133,10 @@ int TfLiteTypeToPyArrayType(TfLiteType tf_lite_type) {
     case kTfLiteComplex64:
       return NPY_COMPLEX64;
     case kTfLiteNoType:
-      return -1;
+      return NPY_NOTYPE;
+      // Avoid default so compiler errors created when new types are made.
   }
-  LOG(ERROR) << "Unknown TfLiteType " << tf_lite_type;
-  return -1;
+  return NPY_NOTYPE;
 }
 
 TfLiteType TfLiteTypeFromPyArray(PyArrayObject* array) {
@@ -122,8 +160,8 @@ TfLiteType TfLiteTypeFromPyArray(PyArrayObject* array) {
       return kTfLiteString;
     case NPY_COMPLEX64:
       return kTfLiteComplex64;
+      // Avoid default so compiler errors created when new types are made.
   }
-  LOG(ERROR) << "Unknown PyArray dtype " << pyarray_type;
   return kTfLiteNoType;
 }
 
@@ -146,33 +184,54 @@ PyObject* PyTupleFromQuantizationParam(const TfLiteQuantizationParams& param) {
 
 }  // namespace
 
+InterpreterWrapper* InterpreterWrapper::CreateInterpreterWrapper(
+    std::unique_ptr<tflite::FlatBufferModel> model,
+    std::unique_ptr<PythonErrorReporter> error_reporter,
+    std::string* error_msg) {
+  if (!model) {
+    *error_msg = error_reporter->message();
+    return nullptr;
+  }
+
+  auto resolver = absl::make_unique<tflite::ops::builtin::BuiltinOpResolver>();
+  auto interpreter = CreateInterpreter(model.get(), *resolver);
+  if (!interpreter) {
+    *error_msg = error_reporter->message();
+    return nullptr;
+  }
+
+  InterpreterWrapper* wrapper =
+      new InterpreterWrapper(std::move(model), std::move(error_reporter),
+                             std::move(resolver), std::move(interpreter));
+  return wrapper;
+}
+
 InterpreterWrapper::InterpreterWrapper(
-    std::unique_ptr<tflite::FlatBufferModel> model)
+    std::unique_ptr<tflite::FlatBufferModel> model,
+    std::unique_ptr<PythonErrorReporter> error_reporter,
+    std::unique_ptr<tflite::ops::builtin::BuiltinOpResolver> resolver,
+    std::unique_ptr<tflite::Interpreter> interpreter)
     : model_(std::move(model)),
-      resolver_(absl::make_unique<tflite::ops::builtin::BuiltinOpResolver>()),
-      interpreter_(CreateInterpreter(model_.get(), *resolver_)) {}
+      error_reporter_(std::move(error_reporter)),
+      resolver_(std::move(resolver)),
+      interpreter_(std::move(interpreter)) {}
 
 InterpreterWrapper::~InterpreterWrapper() {}
 
-bool InterpreterWrapper::AllocateTensors() {
-  if (!interpreter_) {
-    LOG(ERROR) << "Cannot allocate tensors: invalid interpreter.";
-    return false;
-  }
-
-  if (interpreter_->AllocateTensors() != kTfLiteOk) {
-    LOG(ERROR) << "Unable to allocate tensors.";
-    return false;
-  }
-
-  return true;
+PyObject* InterpreterWrapper::AllocateTensors() {
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_CHECK(interpreter_->AllocateTensors());
+  Py_RETURN_NONE;
 }
 
-bool InterpreterWrapper::Invoke() {
-  return interpreter_ ? (interpreter_->Invoke() == kTfLiteOk) : false;
+PyObject* InterpreterWrapper::Invoke() {
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_CHECK(interpreter_->Invoke());
+  Py_RETURN_NONE;
 }
 
 PyObject* InterpreterWrapper::InputIndices() const {
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
   PyObject* np_array = PyArrayFromIntVector(interpreter_->inputs().data(),
                                             interpreter_->inputs().size());
 
@@ -186,35 +245,36 @@ PyObject* InterpreterWrapper::OutputIndices() const {
   return PyArray_Return(reinterpret_cast<PyArrayObject*>(np_array));
 }
 
-bool InterpreterWrapper::ResizeInputTensor(int i, PyObject* value) {
-  if (!interpreter_) {
-    LOG(ERROR) << "Invalid interpreter.";
-    return false;
-  }
+PyObject* InterpreterWrapper::ResizeInputTensor(int i, PyObject* value) {
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
 
   std::unique_ptr<PyObject, PyDecrefDeleter> array_safe(
       PyArray_FromAny(value, nullptr, 0, 0, NPY_ARRAY_CARRAY, nullptr));
   if (!array_safe) {
-    LOG(ERROR) << "Failed to convert value into readable tensor.";
-    return false;
+    PyErr_SetString(PyExc_ValueError,
+                    "Failed to convert numpy value into readable tensor.");
+    return nullptr;
   }
 
   PyArrayObject* array = reinterpret_cast<PyArrayObject*>(array_safe.get());
 
   if (PyArray_NDIM(array) != 1) {
-    LOG(ERROR) << "Expected 1-D defining input shape.";
-    return false;
+    PyErr_Format(PyExc_ValueError, "Shape should be 1D instead of %d.",
+                 PyArray_NDIM(array));
+    return nullptr;
   }
 
   if (PyArray_TYPE(array) != NPY_INT32) {
-    LOG(ERROR) << "Shape must be an int32 array";
-    return false;
+    PyErr_Format(PyExc_ValueError, "Shape must be type int32 (was %d).",
+                 PyArray_TYPE(array));
+    return nullptr;
   }
 
   std::vector<int> dims(PyArray_SHAPE(array)[0]);
   memcpy(dims.data(), PyArray_BYTES(array), dims.size() * sizeof(int));
 
-  return (interpreter_->ResizeInputTensor(i, dims) == kTfLiteOk);
+  TFLITE_PY_CHECK(interpreter_->ResizeInputTensor(i, dims));
+  Py_RETURN_NONE;
 }
 
 std::string InterpreterWrapper::TensorName(int i) const {
@@ -227,21 +287,21 @@ std::string InterpreterWrapper::TensorName(int i) const {
 }
 
 PyObject* InterpreterWrapper::TensorType(int i) const {
-  if (!interpreter_ || i >= interpreter_->tensors_size() || i < 0) {
-    return nullptr;
-  }
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_TENSOR_BOUNDS_CHECK(i);
 
   const TfLiteTensor* tensor = interpreter_->tensor(i);
-  int typenum = TfLiteTypeToPyArrayType(tensor->type);
-  return PyArray_TypeObjectFromType(typenum);
+  int code = TfLiteTypeToPyArrayType(tensor->type);
+  if (code == -1) {
+    PyErr_Format(PyExc_ValueError, "Invalid tflite type code %d", code);
+    return nullptr;
+  }
+  return PyArray_TypeObjectFromType(code);
 }
 
 PyObject* InterpreterWrapper::TensorSize(int i) const {
-  if (!interpreter_ || i >= interpreter_->tensors_size() || i < 0) {
-    Py_INCREF(Py_None);
-    return Py_None;
-  }
-
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_TENSOR_BOUNDS_CHECK(i);
   const TfLiteTensor* tensor = interpreter_->tensor(i);
   PyObject* np_array =
       PyArrayFromIntVector(tensor->dims->data, tensor->dims->size);
@@ -250,100 +310,87 @@ PyObject* InterpreterWrapper::TensorSize(int i) const {
 }
 
 PyObject* InterpreterWrapper::TensorQuantization(int i) const {
-  if (!interpreter_ || i >= interpreter_->tensors_size() || i < 0) {
-    Py_INCREF(Py_None);
-    return Py_None;
-  }
-
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_TENSOR_BOUNDS_CHECK(i);
   const TfLiteTensor* tensor = interpreter_->tensor(i);
   return PyTupleFromQuantizationParam(tensor->params);
 }
 
-bool InterpreterWrapper::SetTensor(int i, PyObject* value) {
-  if (!interpreter_) {
-    LOG(ERROR) << "Invalid interpreter.";
-    return false;
-  }
-
-  if (i >= interpreter_->tensors_size()) {
-    LOG(ERROR) << "Invalid tensor index: " << i << " exceeds max tensor index "
-               << interpreter_->tensors_size();
-    return false;
-  }
+PyObject* InterpreterWrapper::SetTensor(int i, PyObject* value) {
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_TENSOR_BOUNDS_CHECK(i);
 
   std::unique_ptr<PyObject, PyDecrefDeleter> array_safe(
       PyArray_FromAny(value, nullptr, 0, 0, NPY_ARRAY_CARRAY, nullptr));
   if (!array_safe) {
-    LOG(ERROR) << "Failed to convert value into readable tensor.";
-    return false;
+    PyErr_SetString(PyExc_ValueError,
+                    "Failed to convert value into readable tensor.");
+    return nullptr;
   }
 
   PyArrayObject* array = reinterpret_cast<PyArrayObject*>(array_safe.get());
   const TfLiteTensor* tensor = interpreter_->tensor(i);
 
   if (TfLiteTypeFromPyArray(array) != tensor->type) {
-    LOG(ERROR) << "Cannot set tensor:"
-               << " Got tensor of type " << TfLiteTypeFromPyArray(array)
-               << " but expected type " << tensor->type << " for input " << i;
-    return false;
+    PyErr_Format(PyExc_ValueError,
+                 "Cannot set tensor:"
+                 " Got tensor of type %d"
+                 " but expected type %d for input %d ",
+                 TfLiteTypeFromPyArray(array), tensor->type, i);
+    return nullptr;
   }
 
   if (PyArray_NDIM(array) != tensor->dims->size) {
-    LOG(ERROR) << "Cannot set tensor: Dimension mismatch";
-    return false;
+    PyErr_SetString(PyExc_ValueError, "Cannot set tensor: Dimension mismatch");
+    return nullptr;
   }
 
   for (int j = 0; j < PyArray_NDIM(array); j++) {
     if (tensor->dims->data[j] != PyArray_SHAPE(array)[j]) {
-      LOG(ERROR) << "Cannot set tensor: Dimension mismatch";
-      return false;
+      PyErr_SetString(PyExc_ValueError,
+                      "Cannot set tensor: Dimension mismatch");
+      return nullptr;
     }
   }
 
   size_t size = PyArray_NBYTES(array);
-  DCHECK_EQ(size, tensor->bytes);
+  if (size != tensor->bytes) {
+    PyErr_Format(PyExc_ValueError,
+                 "numpy array had %zu bytes but expected %zu bytes.", size,
+                 tensor->bytes);
+    return nullptr;
+  }
   memcpy(tensor->data.raw, PyArray_DATA(array), size);
-  return true;
+  Py_RETURN_NONE;
 }
 
 namespace {
 
-PyObject* CheckGetTensorArgs(Interpreter* interpreter, int tensor_index,
+// Checks to see if a tensor access can succeed (returns nullptr on error).
+// Otherwise returns Py_None.
+PyObject* CheckGetTensorArgs(Interpreter* interpreter_, int tensor_index,
                              TfLiteTensor** tensor, int* type_num) {
-  if (!interpreter) {
-    LOG(ERROR) << "Invalid interpreter.";
-    Py_INCREF(Py_None);
-    return Py_None;
-  }
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_TENSOR_BOUNDS_CHECK(tensor_index);
 
-  if (tensor_index >= interpreter->tensors_size() || tensor_index < 0) {
-    LOG(ERROR) << "Invalid tensor index: " << tensor_index
-               << " exceeds max tensor index " << interpreter->inputs().size();
-    Py_INCREF(Py_None);
-    return Py_None;
-  }
-
-  *tensor = interpreter->tensor(tensor_index);
+  *tensor = interpreter_->tensor(tensor_index);
   if ((*tensor)->bytes == 0) {
-    LOG(ERROR) << "Invalid tensor size";
-    Py_INCREF(Py_None);
-    return Py_None;
+    PyErr_SetString(PyExc_ValueError, "Invalid tensor size.");
+    return nullptr;
   }
 
   *type_num = TfLiteTypeToPyArrayType((*tensor)->type);
   if (*type_num == -1) {
-    LOG(ERROR) << "Unknown tensor type " << (*tensor)->type;
-    Py_INCREF(Py_None);
-    return Py_None;
+    PyErr_SetString(PyExc_ValueError, "Unknown tensor type.");
+    return nullptr;
   }
 
   if (!(*tensor)->data.raw) {
-    LOG(ERROR) << "Tensor data is null.";
-    Py_INCREF(Py_None);
-    return Py_None;
+    PyErr_SetString(PyExc_ValueError, "Tensor data is null.");
+    return nullptr;
   }
 
-  return nullptr;
+  Py_RETURN_NONE;
 }
 
 }  // namespace
@@ -352,19 +399,20 @@ PyObject* InterpreterWrapper::GetTensor(int i) const {
   // Sanity check accessor
   TfLiteTensor* tensor = nullptr;
   int type_num = 0;
-  if (PyObject* pynone_or_nullptr =
-          CheckGetTensorArgs(interpreter_.get(), i, &tensor, &type_num)) {
-    return pynone_or_nullptr;
-  }
+
+  PyObject* check_result =
+      CheckGetTensorArgs(interpreter_.get(), i, &tensor, &type_num);
+  if (check_result == nullptr) return check_result;
+  Py_XDECREF(check_result);
+
   std::vector<npy_intp> dims(tensor->dims->data,
                              tensor->dims->data + tensor->dims->size);
   // Make a buffer copy but we must tell Numpy It owns that data or else
   // it will leak.
   void* data = malloc(tensor->bytes);
   if (!data) {
-    LOG(ERROR) << "Malloc to copy tensor failed.";
-    Py_INCREF(Py_None);
-    return Py_None;
+    PyErr_SetString(PyExc_ValueError, "Malloc to copy tensor failed.");
+    return nullptr;
   }
   memcpy(data, tensor->data.raw, tensor->bytes);
   PyObject* np_array =
@@ -378,10 +426,11 @@ PyObject* InterpreterWrapper::tensor(PyObject* base_object, int i) {
   // Sanity check accessor
   TfLiteTensor* tensor = nullptr;
   int type_num = 0;
-  if (PyObject* pynone_or_nullptr =
-          CheckGetTensorArgs(interpreter_.get(), i, &tensor, &type_num)) {
-    return pynone_or_nullptr;
-  }
+
+  PyObject* check_result =
+      CheckGetTensorArgs(interpreter_.get(), i, &tensor, &type_num);
+  if (check_result == nullptr) return check_result;
+  Py_XDECREF(check_result);
 
   std::vector<npy_intp> dims(tensor->dims->data,
                              tensor->dims->data + tensor->dims->size);
@@ -394,22 +443,33 @@ PyObject* InterpreterWrapper::tensor(PyObject* base_object, int i) {
 }
 
 InterpreterWrapper* InterpreterWrapper::CreateWrapperCPPFromFile(
-    const char* model_path) {
+    const char* model_path, std::string* error_msg) {
+  std::unique_ptr<PythonErrorReporter> error_reporter(new PythonErrorReporter);
   std::unique_ptr<tflite::FlatBufferModel> model =
-      tflite::FlatBufferModel::BuildFromFile(model_path);
-  return model ? new InterpreterWrapper(std::move(model)) : nullptr;
+      tflite::FlatBufferModel::BuildFromFile(model_path, error_reporter.get());
+  return CreateInterpreterWrapper(std::move(model), std::move(error_reporter),
+                                  error_msg);
 }
 
 InterpreterWrapper* InterpreterWrapper::CreateWrapperCPPFromBuffer(
-    PyObject* data) {
+    PyObject* data, std::string* error_msg) {
   char * buf = nullptr;
   Py_ssize_t length;
+  std::unique_ptr<PythonErrorReporter> error_reporter(new PythonErrorReporter);
   if (PY_TO_CPPSTRING(data, &buf, &length) == -1) {
     return nullptr;
   }
   std::unique_ptr<tflite::FlatBufferModel> model =
-      tflite::FlatBufferModel::BuildFromBuffer(buf, length);
-  return model ? new InterpreterWrapper(std::move(model)) : nullptr;
+      tflite::FlatBufferModel::BuildFromBuffer(buf, length,
+                                               error_reporter.get());
+  return CreateInterpreterWrapper(std::move(model), std::move(error_reporter),
+                                  error_msg);
+}
+
+PyObject* InterpreterWrapper::ResetVariableTensorsToZero() {
+  TFLITE_PY_ENSURE_VALID_INTERPRETER();
+  TFLITE_PY_CHECK(interpreter_->ResetVariableTensorsToZero());
+  Py_RETURN_NONE;
 }
 
 }  // namespace interpreter_wrapper
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
index 681448be20..3e03751da4 100644
--- a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h
@@ -15,13 +15,13 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_LITE_PYTHON_INTERPRETER_WRAPPER_INTERPRETER_WRAPPER_H_
 #define TENSORFLOW_CONTRIB_LITE_PYTHON_INTERPRETER_WRAPPER_INTERPRETER_WRAPPER_H_
 
+// Place `<locale>` before <Python.h> to avoid build failures in macOS.
+#include <locale>
 #include <memory>
 #include <string>
 #include <vector>
 
-// Place `<locale>` before <Python.h> to avoid build failures in macOS.
 #include <Python.h>
-#include <locale>
 
 // We forward declare TFLite classes here to avoid exposing them to SWIG.
 namespace tflite {
@@ -36,41 +36,63 @@ class Interpreter;
 
 namespace interpreter_wrapper {
 
+class PythonErrorReporter;
+
 class InterpreterWrapper {
  public:
   // SWIG caller takes ownership of pointer.
-  static InterpreterWrapper* CreateWrapperCPPFromFile(const char* model_path);
+  static InterpreterWrapper* CreateWrapperCPPFromFile(const char* model_path,
+                                                      std::string* error_msg);
 
   // SWIG caller takes ownership of pointer.
-  static InterpreterWrapper* CreateWrapperCPPFromBuffer(PyObject* data);
+  static InterpreterWrapper* CreateWrapperCPPFromBuffer(PyObject* data,
+                                                        std::string* error_msg);
 
   ~InterpreterWrapper();
-  bool AllocateTensors();
-  bool Invoke();
+  PyObject* AllocateTensors();
+  PyObject* Invoke();
 
   PyObject* InputIndices() const;
   PyObject* OutputIndices() const;
-  bool ResizeInputTensor(int i, PyObject* value);
+  PyObject* ResizeInputTensor(int i, PyObject* value);
 
   std::string TensorName(int i) const;
   PyObject* TensorType(int i) const;
   PyObject* TensorSize(int i) const;
   PyObject* TensorQuantization(int i) const;
-  bool SetTensor(int i, PyObject* value);
+  PyObject* SetTensor(int i, PyObject* value);
   PyObject* GetTensor(int i) const;
+  PyObject* ResetVariableTensorsToZero();
+
   // Returns a reference to tensor index i as a numpy array. The base_object
   // should be the interpreter object providing the memory.
   PyObject* tensor(PyObject* base_object, int i);
 
  private:
-  InterpreterWrapper(std::unique_ptr<tflite::FlatBufferModel> model);
+  // Helper function to construct an `InterpreterWrapper` object.
+  // It only returns InterpreterWrapper if it can construct an `Interpreter`.
+  // Otherwise it returns `nullptr`.
+  static InterpreterWrapper* CreateInterpreterWrapper(
+      std::unique_ptr<tflite::FlatBufferModel> model,
+      std::unique_ptr<PythonErrorReporter> error_reporter,
+      std::string* error_msg);
+
+  InterpreterWrapper(
+      std::unique_ptr<tflite::FlatBufferModel> model,
+      std::unique_ptr<PythonErrorReporter> error_reporter,
+      std::unique_ptr<tflite::ops::builtin::BuiltinOpResolver> resolver,
+      std::unique_ptr<tflite::Interpreter> interpreter);
 
   // InterpreterWrapper is not copyable or assignable. We avoid the use of
   // InterpreterWrapper() = delete here for SWIG compatibility.
   InterpreterWrapper();
   InterpreterWrapper(const InterpreterWrapper& rhs);
 
+  // The public functions which creates `InterpreterWrapper` should ensure all
+  // these member variables are initialized successfully. Otherwise it should
+  // report the error and return `nullptr`.
   const std::unique_ptr<tflite::FlatBufferModel> model_;
+  const std::unique_ptr<PythonErrorReporter> error_reporter_;
   const std::unique_ptr<tflite::ops::builtin::BuiltinOpResolver> resolver_;
   const std::unique_ptr<tflite::Interpreter> interpreter_;
 };
diff --git a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.i b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.i
index 7f51f9f00d..afb2092eac 100644
--- a/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.i
+++ b/tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.i
@@ -18,8 +18,51 @@ limitations under the License.
 
 %{
 #define SWIG_FILE_WITH_INIT
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/model.h"
 #include "tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h"
 %}
 
 
 %include "tensorflow/contrib/lite/python/interpreter_wrapper/interpreter_wrapper.h"
+
+namespace tflite {
+namespace interpreter_wrapper {
+%extend InterpreterWrapper {
+
+  // Version of the constructor that handles producing Python exceptions
+  // that propagate strings.
+  static PyObject* CreateWrapperCPPFromFile(const char* model_path) {
+    std::string error;
+    if(tflite::interpreter_wrapper::InterpreterWrapper* ptr =
+        tflite::interpreter_wrapper::InterpreterWrapper
+            ::CreateWrapperCPPFromFile(
+        model_path, &error)) {
+      return SWIG_NewPointerObj(
+          ptr, SWIGTYPE_p_tflite__interpreter_wrapper__InterpreterWrapper, 1);
+    } else {
+      PyErr_SetString(PyExc_ValueError, error.c_str());
+      return nullptr;
+    }
+  }
+
+  // Version of the constructor that handles producing Python exceptions
+  // that propagate strings.
+  static PyObject* CreateWrapperCPPFromBuffer(
+      PyObject* data) {
+    std::string error;
+    if(tflite::interpreter_wrapper::InterpreterWrapper* ptr =
+        tflite::interpreter_wrapper::InterpreterWrapper
+            ::CreateWrapperCPPFromBuffer(
+        data, &error)) {
+      return SWIG_NewPointerObj(
+          ptr, SWIGTYPE_p_tflite__interpreter_wrapper__InterpreterWrapper, 1);
+    } else {
+      PyErr_SetString(PyExc_ValueError, error.c_str());
+      return nullptr;
+    }
+  }
+}
+
+}  // namespace interpreter_wrapper
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/python/lite.py b/tensorflow/contrib/lite/python/lite.py
index 29a1487c1f..2f9b9d469a 100644
--- a/tensorflow/contrib/lite/python/lite.py
+++ b/tensorflow/contrib/lite/python/lite.py
@@ -40,24 +40,23 @@ from google.protobuf import text_format as _text_format
 from google.protobuf.message import DecodeError
 from tensorflow.contrib.lite.python import lite_constants as constants
 from tensorflow.contrib.lite.python.convert import build_toco_convert_protos  # pylint: disable=unused-import
-from tensorflow.contrib.lite.python.convert import tensor_name
+from tensorflow.contrib.lite.python.convert import tensor_name as _tensor_name
 from tensorflow.contrib.lite.python.convert import toco_convert
 from tensorflow.contrib.lite.python.convert import toco_convert_protos  # pylint: disable=unused-import
-from tensorflow.contrib.lite.python.convert_saved_model import freeze_saved_model
-from tensorflow.contrib.lite.python.convert_saved_model import get_tensors_from_tensor_names
-from tensorflow.contrib.lite.python.convert_saved_model import set_tensor_shapes
+from tensorflow.contrib.lite.python.convert_saved_model import freeze_saved_model as _freeze_saved_model
+from tensorflow.contrib.lite.python.convert_saved_model import get_tensors_from_tensor_names as _get_tensors_from_tensor_names
+from tensorflow.contrib.lite.python.convert_saved_model import set_tensor_shapes as _set_tensor_shapes
 from tensorflow.contrib.lite.python.interpreter import Interpreter  # pylint: disable=unused-import
 from tensorflow.contrib.lite.python.op_hint import convert_op_hints_to_stubs  # pylint: disable=unused-import
 from tensorflow.contrib.lite.python.op_hint import OpHint  # pylint: disable=unused-import
 from tensorflow.core.framework import graph_pb2 as _graph_pb2
 from tensorflow.python import keras as _keras
 from tensorflow.python.client import session as _session
-from tensorflow.python.framework import graph_util as tf_graph_util
-from tensorflow.python.framework.importer import import_graph_def
-from tensorflow.python.ops.variables import global_variables_initializer
-from tensorflow.python.saved_model import signature_constants
-from tensorflow.python.saved_model import tag_constants
-# from tensorflow.python.util.all_util import remove_undocumented
+from tensorflow.python.framework import graph_util as _tf_graph_util
+from tensorflow.python.framework.importer import import_graph_def as _import_graph_def
+from tensorflow.python.ops.variables import global_variables_initializer as _global_variables_initializer
+from tensorflow.python.saved_model import signature_constants as _signature_constants
+from tensorflow.python.saved_model import tag_constants as _tag_constants
 
 
 class TocoConverter(object):
@@ -196,7 +195,7 @@ class TocoConverter(object):
         input_arrays or output_arrays contains an invalid tensor name.
     """
     with _session.Session() as sess:
-      sess.run(global_variables_initializer())
+      sess.run(_global_variables_initializer())
 
       # Read GraphDef from file.
       graph_def = _graph_pb2.GraphDef()
@@ -218,12 +217,12 @@ class TocoConverter(object):
           raise ValueError(
               "Unable to parse input file '{}'.".format(graph_def_file))
       sess.graph.as_default()
-      import_graph_def(graph_def, name="")
+      _import_graph_def(graph_def, name="")
 
       # Get input and output tensors.
-      input_tensors = get_tensors_from_tensor_names(sess.graph, input_arrays)
-      output_tensors = get_tensors_from_tensor_names(sess.graph, output_arrays)
-      set_tensor_shapes(input_tensors, input_shapes)
+      input_tensors = _get_tensors_from_tensor_names(sess.graph, input_arrays)
+      output_tensors = _get_tensors_from_tensor_names(sess.graph, output_arrays)
+      _set_tensor_shapes(input_tensors, input_shapes)
 
       # Check if graph is frozen.
       if not _is_frozen_graph(sess):
@@ -261,12 +260,12 @@ class TocoConverter(object):
       TocoConverter class.
     """
     if tag_set is None:
-      tag_set = set([tag_constants.SERVING])
+      tag_set = set([_tag_constants.SERVING])
     if signature_key is None:
-      signature_key = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
+      signature_key = _signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
 
-    result = freeze_saved_model(saved_model_dir, input_arrays, input_shapes,
-                                output_arrays, tag_set, signature_key)
+    result = _freeze_saved_model(saved_model_dir, input_arrays, input_shapes,
+                                 output_arrays, tag_set, signature_key)
     return cls(
         graph_def=result[0], input_tensors=result[1], output_tensors=result[2])
 
@@ -299,15 +298,15 @@ class TocoConverter(object):
 
     # Get input and output tensors.
     if input_arrays:
-      input_tensors = get_tensors_from_tensor_names(sess.graph, input_arrays)
+      input_tensors = _get_tensors_from_tensor_names(sess.graph, input_arrays)
     else:
       input_tensors = keras_model.inputs
 
     if output_arrays:
-      output_tensors = get_tensors_from_tensor_names(sess.graph, output_arrays)
+      output_tensors = _get_tensors_from_tensor_names(sess.graph, output_arrays)
     else:
       output_tensors = keras_model.outputs
-    set_tensor_shapes(input_tensors, input_shapes)
+    _set_tensor_shapes(input_tensors, input_shapes)
 
     graph_def = _freeze_graph(sess, output_tensors)
     return cls(graph_def, input_tensors, output_tensors)
@@ -328,12 +327,12 @@ class TocoConverter(object):
     for tensor in self._input_tensors:
       if not tensor.get_shape():
         raise ValueError("Provide an input shape for input array '{0}'.".format(
-            tensor_name(tensor)))
+            _tensor_name(tensor)))
       shape = tensor.get_shape().as_list()
       if None in shape[1:]:
         raise ValueError(
             "None is only supported in the 1st dimension. Tensor '{0}' has "
-            "invalid shape '{1}'.".format(tensor_name(tensor), shape))
+            "invalid shape '{1}'.".format(_tensor_name(tensor), shape))
       elif shape[0] is None:
         self._set_batch_size(batch_size=1)
 
@@ -343,7 +342,7 @@ class TocoConverter(object):
       quantized_stats = []
       invalid_stats = []
       for tensor in self._input_tensors:
-        name = tensor_name(tensor)
+        name = _tensor_name(tensor)
         if name in self.quantized_input_stats:
           quantized_stats.append(self.quantized_input_stats[name])
         else:
@@ -381,7 +380,7 @@ class TocoConverter(object):
     Returns:
       List of strings.
     """
-    return [tensor_name(tensor) for tensor in self._input_tensors]
+    return [_tensor_name(tensor) for tensor in self._input_tensors]
 
   def _set_batch_size(self, batch_size):
     """Sets the first dimension of the input tensor to `batch_size`.
@@ -428,11 +427,9 @@ def _freeze_graph(sess, output_tensors):
     Frozen GraphDef.
   """
   if not _is_frozen_graph(sess):
-    sess.run(global_variables_initializer())
-    output_arrays = [tensor_name(tensor) for tensor in output_tensors]
-    return tf_graph_util.convert_variables_to_constants(sess, sess.graph_def,
-                                                        output_arrays)
+    sess.run(_global_variables_initializer())
+    output_arrays = [_tensor_name(tensor) for tensor in output_tensors]
+    return _tf_graph_util.convert_variables_to_constants(
+        sess, sess.graph_def, output_arrays)
   else:
     return sess.graph_def
-
-# remove_undocumented(__name__)
diff --git a/tensorflow/contrib/lite/python/tflite_convert.py b/tensorflow/contrib/lite/python/tflite_convert.py
index 9bd1f4f76e..d17482e601 100644
--- a/tensorflow/contrib/lite/python/tflite_convert.py
+++ b/tensorflow/contrib/lite/python/tflite_convert.py
@@ -257,7 +257,7 @@ def run_main(_):
   parser.add_argument(
       "--input_arrays",
       type=str,
-      help="Names of the output arrays, comma-separated.")
+      help="Names of the input arrays, comma-separated.")
   parser.add_argument(
       "--input_shapes",
       type=str,
diff --git a/tensorflow/contrib/lite/schema/BUILD b/tensorflow/contrib/lite/schema/BUILD
index f095151cae..b616e449e6 100644
--- a/tensorflow/contrib/lite/schema/BUILD
+++ b/tensorflow/contrib/lite/schema/BUILD
@@ -30,7 +30,10 @@ py_test(
     size = "small",
     srcs = ["upgrade_schema_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_pip"],
+    tags = [
+        "no_oss",
+        "no_pip",
+    ],
     deps = [
         ":upgrade_schema",
         "//tensorflow/python:client_testlib",
@@ -64,6 +67,7 @@ cc_test(
         "schema_v3.fbs",
     ],
     tags = [
+        "no_oss",
         "tflite_not_portable_android",
         "tflite_not_portable_ios",
     ],
diff --git a/tensorflow/contrib/lite/schema/builtin_ops_header/BUILD b/tensorflow/contrib/lite/schema/builtin_ops_header/BUILD
index 0148149a6a..4a627761da 100644
--- a/tensorflow/contrib/lite/schema/builtin_ops_header/BUILD
+++ b/tensorflow/contrib/lite/schema/builtin_ops_header/BUILD
@@ -24,6 +24,7 @@ cc_binary(
 cc_test(
     name = "generator_test",
     srcs = ["generator_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":generator",
         "@com_google_googletest//:gtest",
@@ -36,6 +37,7 @@ cc_test(
     data = [
         "//tensorflow/contrib/lite:builtin_ops.h",
     ],
+    tags = ["no_oss"],
     deps = [
         ":generator",
         "@com_google_googletest//:gtest",
diff --git a/tensorflow/contrib/lite/schema/schema.fbs b/tensorflow/contrib/lite/schema/schema.fbs
index 15fb8bbdb8..14f88b4c00 100644
--- a/tensorflow/contrib/lite/schema/schema.fbs
+++ b/tensorflow/contrib/lite/schema/schema.fbs
@@ -44,7 +44,7 @@ enum TensorType : byte {
 table QuantizationParameters {
   min:[float];  // For importing back into tensorflow.
   max:[float];  // For importing back into tensorflow.
-  scale:[float];
+  scale:[float];  // For dequantizing the tensor's values.
   zero_point:[long];
 }
 
@@ -155,11 +155,20 @@ enum BuiltinOperator : byte {
   EQUAL = 71,
   NOT_EQUAL = 72,
   LOG = 73,
-  SUM=74,
+  SUM = 74,
   SQRT = 75,
   RSQRT = 76,
   SHAPE = 77,
   POW = 78,
+  ARG_MIN = 79,
+  FAKE_QUANT = 80,
+  REDUCE_PROD = 81,
+  REDUCE_MAX = 82,
+  PACK = 83,
+  LOGICAL_OR = 84,
+  ONE_HOT = 85,
+  LOGICAL_AND = 86,
+  LOGICAL_NOT = 87,
 }
 
 // Options for the builtin operators.
@@ -220,6 +229,13 @@ union BuiltinOptions {
   NotEqualOptions,
   ShapeOptions,
   PowOptions,
+  ArgMinOptions,
+  FakeQuantOptions,
+  PackOptions,
+  LogicalOrOptions,
+  OneHotOptions,
+  LogicalAndOptions,
+  LogicalNotOptions,
 }
 
 enum Padding : byte { SAME, VALID }
@@ -469,6 +485,10 @@ table ArgMaxOptions {
   output_type : TensorType;
 }
 
+table ArgMinOptions {
+  output_type : TensorType;
+}
+
 table GreaterOptions {
 }
 
@@ -517,6 +537,34 @@ table ShapeOptions {
 table PowOptions {
 }
 
+table FakeQuantOptions {
+  // Parameters supported by version 1:
+  min:float;
+  max:float;
+  num_bits:int;
+
+  // Parameters supported by version 2:
+  narrow_range:bool;
+}
+
+table PackOptions {
+  values_count:int;
+  axis:int;
+}
+
+table LogicalOrOptions {
+}
+
+table OneHotOptions {
+  axis:int;
+}
+
+table LogicalAndOptions {
+}
+
+table LogicalNotOptions {
+}
+
 // An OperatorCode can be an enum value (BuiltinOperator) if the operator is a
 // builtin, or a string if the operator is custom.
 table OperatorCode {
diff --git a/tensorflow/contrib/lite/schema/schema_generated.h b/tensorflow/contrib/lite/schema/schema_generated.h
index fe0ff9a7a5..3efa153e2c 100755
--- a/tensorflow/contrib/lite/schema/schema_generated.h
+++ b/tensorflow/contrib/lite/schema/schema_generated.h
@@ -157,6 +157,9 @@ struct TileOptionsT;
 struct ArgMaxOptions;
 struct ArgMaxOptionsT;
 
+struct ArgMinOptions;
+struct ArgMinOptionsT;
+
 struct GreaterOptions;
 struct GreaterOptionsT;
 
@@ -199,6 +202,24 @@ struct ShapeOptionsT;
 struct PowOptions;
 struct PowOptionsT;
 
+struct FakeQuantOptions;
+struct FakeQuantOptionsT;
+
+struct PackOptions;
+struct PackOptionsT;
+
+struct LogicalOrOptions;
+struct LogicalOrOptionsT;
+
+struct OneHotOptions;
+struct OneHotOptionsT;
+
+struct LogicalAndOptions;
+struct LogicalAndOptionsT;
+
+struct LogicalNotOptions;
+struct LogicalNotOptionsT;
+
 struct OperatorCode;
 struct OperatorCodeT;
 
@@ -343,11 +364,20 @@ enum BuiltinOperator {
   BuiltinOperator_RSQRT = 76,
   BuiltinOperator_SHAPE = 77,
   BuiltinOperator_POW = 78,
+  BuiltinOperator_ARG_MIN = 79,
+  BuiltinOperator_FAKE_QUANT = 80,
+  BuiltinOperator_REDUCE_PROD = 81,
+  BuiltinOperator_REDUCE_MAX = 82,
+  BuiltinOperator_PACK = 83,
+  BuiltinOperator_LOGICAL_OR = 84,
+  BuiltinOperator_ONE_HOT = 85,
+  BuiltinOperator_LOGICAL_AND = 86,
+  BuiltinOperator_LOGICAL_NOT = 87,
   BuiltinOperator_MIN = BuiltinOperator_ADD,
-  BuiltinOperator_MAX = BuiltinOperator_POW
+  BuiltinOperator_MAX = BuiltinOperator_LOGICAL_NOT
 };
 
-inline BuiltinOperator (&EnumValuesBuiltinOperator())[78] {
+inline BuiltinOperator (&EnumValuesBuiltinOperator())[87] {
   static BuiltinOperator values[] = {
     BuiltinOperator_ADD,
     BuiltinOperator_AVERAGE_POOL_2D,
@@ -426,7 +456,16 @@ inline BuiltinOperator (&EnumValuesBuiltinOperator())[78] {
     BuiltinOperator_SQRT,
     BuiltinOperator_RSQRT,
     BuiltinOperator_SHAPE,
-    BuiltinOperator_POW
+    BuiltinOperator_POW,
+    BuiltinOperator_ARG_MIN,
+    BuiltinOperator_FAKE_QUANT,
+    BuiltinOperator_REDUCE_PROD,
+    BuiltinOperator_REDUCE_MAX,
+    BuiltinOperator_PACK,
+    BuiltinOperator_LOGICAL_OR,
+    BuiltinOperator_ONE_HOT,
+    BuiltinOperator_LOGICAL_AND,
+    BuiltinOperator_LOGICAL_NOT
   };
   return values;
 }
@@ -512,6 +551,15 @@ inline const char **EnumNamesBuiltinOperator() {
     "RSQRT",
     "SHAPE",
     "POW",
+    "ARG_MIN",
+    "FAKE_QUANT",
+    "REDUCE_PROD",
+    "REDUCE_MAX",
+    "PACK",
+    "LOGICAL_OR",
+    "ONE_HOT",
+    "LOGICAL_AND",
+    "LOGICAL_NOT",
     nullptr
   };
   return names;
@@ -580,11 +628,18 @@ enum BuiltinOptions {
   BuiltinOptions_NotEqualOptions = 54,
   BuiltinOptions_ShapeOptions = 55,
   BuiltinOptions_PowOptions = 56,
+  BuiltinOptions_ArgMinOptions = 57,
+  BuiltinOptions_FakeQuantOptions = 58,
+  BuiltinOptions_PackOptions = 59,
+  BuiltinOptions_LogicalOrOptions = 60,
+  BuiltinOptions_OneHotOptions = 61,
+  BuiltinOptions_LogicalAndOptions = 62,
+  BuiltinOptions_LogicalNotOptions = 63,
   BuiltinOptions_MIN = BuiltinOptions_NONE,
-  BuiltinOptions_MAX = BuiltinOptions_PowOptions
+  BuiltinOptions_MAX = BuiltinOptions_LogicalNotOptions
 };
 
-inline BuiltinOptions (&EnumValuesBuiltinOptions())[57] {
+inline BuiltinOptions (&EnumValuesBuiltinOptions())[64] {
   static BuiltinOptions values[] = {
     BuiltinOptions_NONE,
     BuiltinOptions_Conv2DOptions,
@@ -642,7 +697,14 @@ inline BuiltinOptions (&EnumValuesBuiltinOptions())[57] {
     BuiltinOptions_EqualOptions,
     BuiltinOptions_NotEqualOptions,
     BuiltinOptions_ShapeOptions,
-    BuiltinOptions_PowOptions
+    BuiltinOptions_PowOptions,
+    BuiltinOptions_ArgMinOptions,
+    BuiltinOptions_FakeQuantOptions,
+    BuiltinOptions_PackOptions,
+    BuiltinOptions_LogicalOrOptions,
+    BuiltinOptions_OneHotOptions,
+    BuiltinOptions_LogicalAndOptions,
+    BuiltinOptions_LogicalNotOptions
   };
   return values;
 }
@@ -706,6 +768,13 @@ inline const char **EnumNamesBuiltinOptions() {
     "NotEqualOptions",
     "ShapeOptions",
     "PowOptions",
+    "ArgMinOptions",
+    "FakeQuantOptions",
+    "PackOptions",
+    "LogicalOrOptions",
+    "OneHotOptions",
+    "LogicalAndOptions",
+    "LogicalNotOptions",
     nullptr
   };
   return names;
@@ -944,6 +1013,34 @@ template<> struct BuiltinOptionsTraits<PowOptions> {
   static const BuiltinOptions enum_value = BuiltinOptions_PowOptions;
 };
 
+template<> struct BuiltinOptionsTraits<ArgMinOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_ArgMinOptions;
+};
+
+template<> struct BuiltinOptionsTraits<FakeQuantOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_FakeQuantOptions;
+};
+
+template<> struct BuiltinOptionsTraits<PackOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_PackOptions;
+};
+
+template<> struct BuiltinOptionsTraits<LogicalOrOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_LogicalOrOptions;
+};
+
+template<> struct BuiltinOptionsTraits<OneHotOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_OneHotOptions;
+};
+
+template<> struct BuiltinOptionsTraits<LogicalAndOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_LogicalAndOptions;
+};
+
+template<> struct BuiltinOptionsTraits<LogicalNotOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_LogicalNotOptions;
+};
+
 struct BuiltinOptionsUnion {
   BuiltinOptions type;
   void *value;
@@ -1423,6 +1520,62 @@ struct BuiltinOptionsUnion {
     return type == BuiltinOptions_PowOptions ?
       reinterpret_cast<const PowOptionsT *>(value) : nullptr;
   }
+  ArgMinOptionsT *AsArgMinOptions() {
+    return type == BuiltinOptions_ArgMinOptions ?
+      reinterpret_cast<ArgMinOptionsT *>(value) : nullptr;
+  }
+  const ArgMinOptionsT *AsArgMinOptions() const {
+    return type == BuiltinOptions_ArgMinOptions ?
+      reinterpret_cast<const ArgMinOptionsT *>(value) : nullptr;
+  }
+  FakeQuantOptionsT *AsFakeQuantOptions() {
+    return type == BuiltinOptions_FakeQuantOptions ?
+      reinterpret_cast<FakeQuantOptionsT *>(value) : nullptr;
+  }
+  const FakeQuantOptionsT *AsFakeQuantOptions() const {
+    return type == BuiltinOptions_FakeQuantOptions ?
+      reinterpret_cast<const FakeQuantOptionsT *>(value) : nullptr;
+  }
+  PackOptionsT *AsPackOptions() {
+    return type == BuiltinOptions_PackOptions ?
+      reinterpret_cast<PackOptionsT *>(value) : nullptr;
+  }
+  const PackOptionsT *AsPackOptions() const {
+    return type == BuiltinOptions_PackOptions ?
+      reinterpret_cast<const PackOptionsT *>(value) : nullptr;
+  }
+  LogicalOrOptionsT *AsLogicalOrOptions() {
+    return type == BuiltinOptions_LogicalOrOptions ?
+      reinterpret_cast<LogicalOrOptionsT *>(value) : nullptr;
+  }
+  const LogicalOrOptionsT *AsLogicalOrOptions() const {
+    return type == BuiltinOptions_LogicalOrOptions ?
+      reinterpret_cast<const LogicalOrOptionsT *>(value) : nullptr;
+  }
+  OneHotOptionsT *AsOneHotOptions() {
+    return type == BuiltinOptions_OneHotOptions ?
+      reinterpret_cast<OneHotOptionsT *>(value) : nullptr;
+  }
+  const OneHotOptionsT *AsOneHotOptions() const {
+    return type == BuiltinOptions_OneHotOptions ?
+      reinterpret_cast<const OneHotOptionsT *>(value) : nullptr;
+  }
+  LogicalAndOptionsT *AsLogicalAndOptions() {
+    return type == BuiltinOptions_LogicalAndOptions ?
+      reinterpret_cast<LogicalAndOptionsT *>(value) : nullptr;
+  }
+  const LogicalAndOptionsT *AsLogicalAndOptions() const {
+    return type == BuiltinOptions_LogicalAndOptions ?
+      reinterpret_cast<const LogicalAndOptionsT *>(value) : nullptr;
+  }
+  LogicalNotOptionsT *AsLogicalNotOptions() {
+    return type == BuiltinOptions_LogicalNotOptions ?
+      reinterpret_cast<LogicalNotOptionsT *>(value) : nullptr;
+  }
+  const LogicalNotOptionsT *AsLogicalNotOptions() const {
+    return type == BuiltinOptions_LogicalNotOptions ?
+      reinterpret_cast<const LogicalNotOptionsT *>(value) : nullptr;
+  }
 };
 
 bool VerifyBuiltinOptions(flatbuffers::Verifier &verifier, const void *obj, BuiltinOptions type);
@@ -4486,6 +4639,60 @@ inline flatbuffers::Offset<ArgMaxOptions> CreateArgMaxOptions(
 
 flatbuffers::Offset<ArgMaxOptions> CreateArgMaxOptions(flatbuffers::FlatBufferBuilder &_fbb, const ArgMaxOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
 
+struct ArgMinOptionsT : public flatbuffers::NativeTable {
+  typedef ArgMinOptions TableType;
+  TensorType output_type;
+  ArgMinOptionsT()
+      : output_type(TensorType_FLOAT32) {
+  }
+};
+
+struct ArgMinOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef ArgMinOptionsT NativeTableType;
+  enum {
+    VT_OUTPUT_TYPE = 4
+  };
+  TensorType output_type() const {
+    return static_cast<TensorType>(GetField<int8_t>(VT_OUTPUT_TYPE, 0));
+  }
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           VerifyField<int8_t>(verifier, VT_OUTPUT_TYPE) &&
+           verifier.EndTable();
+  }
+  ArgMinOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(ArgMinOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<ArgMinOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const ArgMinOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct ArgMinOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  void add_output_type(TensorType output_type) {
+    fbb_.AddElement<int8_t>(ArgMinOptions::VT_OUTPUT_TYPE, static_cast<int8_t>(output_type), 0);
+  }
+  explicit ArgMinOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  ArgMinOptionsBuilder &operator=(const ArgMinOptionsBuilder &);
+  flatbuffers::Offset<ArgMinOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<ArgMinOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<ArgMinOptions> CreateArgMinOptions(
+    flatbuffers::FlatBufferBuilder &_fbb,
+    TensorType output_type = TensorType_FLOAT32) {
+  ArgMinOptionsBuilder builder_(_fbb);
+  builder_.add_output_type(output_type);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<ArgMinOptions> CreateArgMinOptions(flatbuffers::FlatBufferBuilder &_fbb, const ArgMinOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
 struct GreaterOptionsT : public flatbuffers::NativeTable {
   typedef GreaterOptions TableType;
   GreaterOptionsT() {
@@ -5112,6 +5319,336 @@ inline flatbuffers::Offset<PowOptions> CreatePowOptions(
 
 flatbuffers::Offset<PowOptions> CreatePowOptions(flatbuffers::FlatBufferBuilder &_fbb, const PowOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
 
+struct FakeQuantOptionsT : public flatbuffers::NativeTable {
+  typedef FakeQuantOptions TableType;
+  float min;
+  float max;
+  int32_t num_bits;
+  bool narrow_range;
+  FakeQuantOptionsT()
+      : min(0.0f),
+        max(0.0f),
+        num_bits(0),
+        narrow_range(false) {
+  }
+};
+
+struct FakeQuantOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef FakeQuantOptionsT NativeTableType;
+  enum {
+    VT_MIN = 4,
+    VT_MAX = 6,
+    VT_NUM_BITS = 8,
+    VT_NARROW_RANGE = 10
+  };
+  float min() const {
+    return GetField<float>(VT_MIN, 0.0f);
+  }
+  float max() const {
+    return GetField<float>(VT_MAX, 0.0f);
+  }
+  int32_t num_bits() const {
+    return GetField<int32_t>(VT_NUM_BITS, 0);
+  }
+  bool narrow_range() const {
+    return GetField<uint8_t>(VT_NARROW_RANGE, 0) != 0;
+  }
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           VerifyField<float>(verifier, VT_MIN) &&
+           VerifyField<float>(verifier, VT_MAX) &&
+           VerifyField<int32_t>(verifier, VT_NUM_BITS) &&
+           VerifyField<uint8_t>(verifier, VT_NARROW_RANGE) &&
+           verifier.EndTable();
+  }
+  FakeQuantOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(FakeQuantOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<FakeQuantOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const FakeQuantOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct FakeQuantOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  void add_min(float min) {
+    fbb_.AddElement<float>(FakeQuantOptions::VT_MIN, min, 0.0f);
+  }
+  void add_max(float max) {
+    fbb_.AddElement<float>(FakeQuantOptions::VT_MAX, max, 0.0f);
+  }
+  void add_num_bits(int32_t num_bits) {
+    fbb_.AddElement<int32_t>(FakeQuantOptions::VT_NUM_BITS, num_bits, 0);
+  }
+  void add_narrow_range(bool narrow_range) {
+    fbb_.AddElement<uint8_t>(FakeQuantOptions::VT_NARROW_RANGE, static_cast<uint8_t>(narrow_range), 0);
+  }
+  explicit FakeQuantOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  FakeQuantOptionsBuilder &operator=(const FakeQuantOptionsBuilder &);
+  flatbuffers::Offset<FakeQuantOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<FakeQuantOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<FakeQuantOptions> CreateFakeQuantOptions(
+    flatbuffers::FlatBufferBuilder &_fbb,
+    float min = 0.0f,
+    float max = 0.0f,
+    int32_t num_bits = 0,
+    bool narrow_range = false) {
+  FakeQuantOptionsBuilder builder_(_fbb);
+  builder_.add_num_bits(num_bits);
+  builder_.add_max(max);
+  builder_.add_min(min);
+  builder_.add_narrow_range(narrow_range);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<FakeQuantOptions> CreateFakeQuantOptions(flatbuffers::FlatBufferBuilder &_fbb, const FakeQuantOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct PackOptionsT : public flatbuffers::NativeTable {
+  typedef PackOptions TableType;
+  int32_t values_count;
+  int32_t axis;
+  PackOptionsT()
+      : values_count(0),
+        axis(0) {
+  }
+};
+
+struct PackOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef PackOptionsT NativeTableType;
+  enum {
+    VT_VALUES_COUNT = 4,
+    VT_AXIS = 6
+  };
+  int32_t values_count() const {
+    return GetField<int32_t>(VT_VALUES_COUNT, 0);
+  }
+  int32_t axis() const {
+    return GetField<int32_t>(VT_AXIS, 0);
+  }
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           VerifyField<int32_t>(verifier, VT_VALUES_COUNT) &&
+           VerifyField<int32_t>(verifier, VT_AXIS) &&
+           verifier.EndTable();
+  }
+  PackOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(PackOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<PackOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const PackOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct PackOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  void add_values_count(int32_t values_count) {
+    fbb_.AddElement<int32_t>(PackOptions::VT_VALUES_COUNT, values_count, 0);
+  }
+  void add_axis(int32_t axis) {
+    fbb_.AddElement<int32_t>(PackOptions::VT_AXIS, axis, 0);
+  }
+  explicit PackOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  PackOptionsBuilder &operator=(const PackOptionsBuilder &);
+  flatbuffers::Offset<PackOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<PackOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<PackOptions> CreatePackOptions(
+    flatbuffers::FlatBufferBuilder &_fbb,
+    int32_t values_count = 0,
+    int32_t axis = 0) {
+  PackOptionsBuilder builder_(_fbb);
+  builder_.add_axis(axis);
+  builder_.add_values_count(values_count);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<PackOptions> CreatePackOptions(flatbuffers::FlatBufferBuilder &_fbb, const PackOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct LogicalOrOptionsT : public flatbuffers::NativeTable {
+  typedef LogicalOrOptions TableType;
+  LogicalOrOptionsT() {
+  }
+};
+
+struct LogicalOrOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef LogicalOrOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  LogicalOrOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(LogicalOrOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<LogicalOrOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogicalOrOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct LogicalOrOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit LogicalOrOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  LogicalOrOptionsBuilder &operator=(const LogicalOrOptionsBuilder &);
+  flatbuffers::Offset<LogicalOrOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<LogicalOrOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<LogicalOrOptions> CreateLogicalOrOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  LogicalOrOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<LogicalOrOptions> CreateLogicalOrOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogicalOrOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct OneHotOptionsT : public flatbuffers::NativeTable {
+  typedef OneHotOptions TableType;
+  int32_t axis;
+  OneHotOptionsT()
+      : axis(0) {
+  }
+};
+
+struct OneHotOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef OneHotOptionsT NativeTableType;
+  enum {
+    VT_AXIS = 4
+  };
+  int32_t axis() const {
+    return GetField<int32_t>(VT_AXIS, 0);
+  }
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           VerifyField<int32_t>(verifier, VT_AXIS) &&
+           verifier.EndTable();
+  }
+  OneHotOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(OneHotOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<OneHotOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const OneHotOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct OneHotOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  void add_axis(int32_t axis) {
+    fbb_.AddElement<int32_t>(OneHotOptions::VT_AXIS, axis, 0);
+  }
+  explicit OneHotOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  OneHotOptionsBuilder &operator=(const OneHotOptionsBuilder &);
+  flatbuffers::Offset<OneHotOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<OneHotOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<OneHotOptions> CreateOneHotOptions(
+    flatbuffers::FlatBufferBuilder &_fbb,
+    int32_t axis = 0) {
+  OneHotOptionsBuilder builder_(_fbb);
+  builder_.add_axis(axis);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<OneHotOptions> CreateOneHotOptions(flatbuffers::FlatBufferBuilder &_fbb, const OneHotOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct LogicalAndOptionsT : public flatbuffers::NativeTable {
+  typedef LogicalAndOptions TableType;
+  LogicalAndOptionsT() {
+  }
+};
+
+struct LogicalAndOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef LogicalAndOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  LogicalAndOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(LogicalAndOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<LogicalAndOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogicalAndOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct LogicalAndOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit LogicalAndOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  LogicalAndOptionsBuilder &operator=(const LogicalAndOptionsBuilder &);
+  flatbuffers::Offset<LogicalAndOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<LogicalAndOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<LogicalAndOptions> CreateLogicalAndOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  LogicalAndOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<LogicalAndOptions> CreateLogicalAndOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogicalAndOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
+struct LogicalNotOptionsT : public flatbuffers::NativeTable {
+  typedef LogicalNotOptions TableType;
+  LogicalNotOptionsT() {
+  }
+};
+
+struct LogicalNotOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef LogicalNotOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  LogicalNotOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(LogicalNotOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<LogicalNotOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogicalNotOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct LogicalNotOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit LogicalNotOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  LogicalNotOptionsBuilder &operator=(const LogicalNotOptionsBuilder &);
+  flatbuffers::Offset<LogicalNotOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<LogicalNotOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<LogicalNotOptions> CreateLogicalNotOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  LogicalNotOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<LogicalNotOptions> CreateLogicalNotOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogicalNotOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
 struct OperatorCodeT : public flatbuffers::NativeTable {
   typedef OperatorCode TableType;
   BuiltinOperator builtin_code;
@@ -5413,6 +5950,27 @@ struct Operator FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   const PowOptions *builtin_options_as_PowOptions() const {
     return builtin_options_type() == BuiltinOptions_PowOptions ? static_cast<const PowOptions *>(builtin_options()) : nullptr;
   }
+  const ArgMinOptions *builtin_options_as_ArgMinOptions() const {
+    return builtin_options_type() == BuiltinOptions_ArgMinOptions ? static_cast<const ArgMinOptions *>(builtin_options()) : nullptr;
+  }
+  const FakeQuantOptions *builtin_options_as_FakeQuantOptions() const {
+    return builtin_options_type() == BuiltinOptions_FakeQuantOptions ? static_cast<const FakeQuantOptions *>(builtin_options()) : nullptr;
+  }
+  const PackOptions *builtin_options_as_PackOptions() const {
+    return builtin_options_type() == BuiltinOptions_PackOptions ? static_cast<const PackOptions *>(builtin_options()) : nullptr;
+  }
+  const LogicalOrOptions *builtin_options_as_LogicalOrOptions() const {
+    return builtin_options_type() == BuiltinOptions_LogicalOrOptions ? static_cast<const LogicalOrOptions *>(builtin_options()) : nullptr;
+  }
+  const OneHotOptions *builtin_options_as_OneHotOptions() const {
+    return builtin_options_type() == BuiltinOptions_OneHotOptions ? static_cast<const OneHotOptions *>(builtin_options()) : nullptr;
+  }
+  const LogicalAndOptions *builtin_options_as_LogicalAndOptions() const {
+    return builtin_options_type() == BuiltinOptions_LogicalAndOptions ? static_cast<const LogicalAndOptions *>(builtin_options()) : nullptr;
+  }
+  const LogicalNotOptions *builtin_options_as_LogicalNotOptions() const {
+    return builtin_options_type() == BuiltinOptions_LogicalNotOptions ? static_cast<const LogicalNotOptions *>(builtin_options()) : nullptr;
+  }
   const flatbuffers::Vector<uint8_t> *custom_options() const {
     return GetPointer<const flatbuffers::Vector<uint8_t> *>(VT_CUSTOM_OPTIONS);
   }
@@ -5668,6 +6226,34 @@ template<> inline const PowOptions *Operator::builtin_options_as<PowOptions>() c
   return builtin_options_as_PowOptions();
 }
 
+template<> inline const ArgMinOptions *Operator::builtin_options_as<ArgMinOptions>() const {
+  return builtin_options_as_ArgMinOptions();
+}
+
+template<> inline const FakeQuantOptions *Operator::builtin_options_as<FakeQuantOptions>() const {
+  return builtin_options_as_FakeQuantOptions();
+}
+
+template<> inline const PackOptions *Operator::builtin_options_as<PackOptions>() const {
+  return builtin_options_as_PackOptions();
+}
+
+template<> inline const LogicalOrOptions *Operator::builtin_options_as<LogicalOrOptions>() const {
+  return builtin_options_as_LogicalOrOptions();
+}
+
+template<> inline const OneHotOptions *Operator::builtin_options_as<OneHotOptions>() const {
+  return builtin_options_as_OneHotOptions();
+}
+
+template<> inline const LogicalAndOptions *Operator::builtin_options_as<LogicalAndOptions>() const {
+  return builtin_options_as_LogicalAndOptions();
+}
+
+template<> inline const LogicalNotOptions *Operator::builtin_options_as<LogicalNotOptions>() const {
+  return builtin_options_as_LogicalNotOptions();
+}
+
 struct OperatorBuilder {
   flatbuffers::FlatBufferBuilder &fbb_;
   flatbuffers::uoffset_t start_;
@@ -7333,6 +7919,32 @@ inline flatbuffers::Offset<ArgMaxOptions> CreateArgMaxOptions(flatbuffers::FlatB
       _output_type);
 }
 
+inline ArgMinOptionsT *ArgMinOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new ArgMinOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void ArgMinOptions::UnPackTo(ArgMinOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+  { auto _e = output_type(); _o->output_type = _e; };
+}
+
+inline flatbuffers::Offset<ArgMinOptions> ArgMinOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const ArgMinOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateArgMinOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<ArgMinOptions> CreateArgMinOptions(flatbuffers::FlatBufferBuilder &_fbb, const ArgMinOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const ArgMinOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  auto _output_type = _o->output_type;
+  return tflite::CreateArgMinOptions(
+      _fbb,
+      _output_type);
+}
+
 inline GreaterOptionsT *GreaterOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
   auto _o = new GreaterOptionsT();
   UnPackTo(_o, _resolver);
@@ -7670,6 +8282,165 @@ inline flatbuffers::Offset<PowOptions> CreatePowOptions(flatbuffers::FlatBufferB
       _fbb);
 }
 
+inline FakeQuantOptionsT *FakeQuantOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new FakeQuantOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void FakeQuantOptions::UnPackTo(FakeQuantOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+  { auto _e = min(); _o->min = _e; };
+  { auto _e = max(); _o->max = _e; };
+  { auto _e = num_bits(); _o->num_bits = _e; };
+  { auto _e = narrow_range(); _o->narrow_range = _e; };
+}
+
+inline flatbuffers::Offset<FakeQuantOptions> FakeQuantOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const FakeQuantOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateFakeQuantOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<FakeQuantOptions> CreateFakeQuantOptions(flatbuffers::FlatBufferBuilder &_fbb, const FakeQuantOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const FakeQuantOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  auto _min = _o->min;
+  auto _max = _o->max;
+  auto _num_bits = _o->num_bits;
+  auto _narrow_range = _o->narrow_range;
+  return tflite::CreateFakeQuantOptions(
+      _fbb,
+      _min,
+      _max,
+      _num_bits,
+      _narrow_range);
+}
+
+inline PackOptionsT *PackOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new PackOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void PackOptions::UnPackTo(PackOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+  { auto _e = values_count(); _o->values_count = _e; };
+  { auto _e = axis(); _o->axis = _e; };
+}
+
+inline flatbuffers::Offset<PackOptions> PackOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const PackOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreatePackOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<PackOptions> CreatePackOptions(flatbuffers::FlatBufferBuilder &_fbb, const PackOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const PackOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  auto _values_count = _o->values_count;
+  auto _axis = _o->axis;
+  return tflite::CreatePackOptions(
+      _fbb,
+      _values_count,
+      _axis);
+}
+
+inline LogicalOrOptionsT *LogicalOrOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new LogicalOrOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void LogicalOrOptions::UnPackTo(LogicalOrOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<LogicalOrOptions> LogicalOrOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogicalOrOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateLogicalOrOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<LogicalOrOptions> CreateLogicalOrOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogicalOrOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const LogicalOrOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateLogicalOrOptions(
+      _fbb);
+}
+
+inline OneHotOptionsT *OneHotOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new OneHotOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void OneHotOptions::UnPackTo(OneHotOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+  { auto _e = axis(); _o->axis = _e; };
+}
+
+inline flatbuffers::Offset<OneHotOptions> OneHotOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const OneHotOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateOneHotOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<OneHotOptions> CreateOneHotOptions(flatbuffers::FlatBufferBuilder &_fbb, const OneHotOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const OneHotOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  auto _axis = _o->axis;
+  return tflite::CreateOneHotOptions(
+      _fbb,
+      _axis);
+}
+
+inline LogicalAndOptionsT *LogicalAndOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new LogicalAndOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void LogicalAndOptions::UnPackTo(LogicalAndOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<LogicalAndOptions> LogicalAndOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogicalAndOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateLogicalAndOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<LogicalAndOptions> CreateLogicalAndOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogicalAndOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const LogicalAndOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateLogicalAndOptions(
+      _fbb);
+}
+
+inline LogicalNotOptionsT *LogicalNotOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new LogicalNotOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void LogicalNotOptions::UnPackTo(LogicalNotOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<LogicalNotOptions> LogicalNotOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogicalNotOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateLogicalNotOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<LogicalNotOptions> CreateLogicalNotOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogicalNotOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const LogicalNotOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateLogicalNotOptions(
+      _fbb);
+}
+
 inline OperatorCodeT *OperatorCode::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
   auto _o = new OperatorCodeT();
   UnPackTo(_o, _resolver);
@@ -8083,6 +8854,34 @@ inline bool VerifyBuiltinOptions(flatbuffers::Verifier &verifier, const void *ob
       auto ptr = reinterpret_cast<const PowOptions *>(obj);
       return verifier.VerifyTable(ptr);
     }
+    case BuiltinOptions_ArgMinOptions: {
+      auto ptr = reinterpret_cast<const ArgMinOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_FakeQuantOptions: {
+      auto ptr = reinterpret_cast<const FakeQuantOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_PackOptions: {
+      auto ptr = reinterpret_cast<const PackOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_LogicalOrOptions: {
+      auto ptr = reinterpret_cast<const LogicalOrOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_OneHotOptions: {
+      auto ptr = reinterpret_cast<const OneHotOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_LogicalAndOptions: {
+      auto ptr = reinterpret_cast<const LogicalAndOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
+    case BuiltinOptions_LogicalNotOptions: {
+      auto ptr = reinterpret_cast<const LogicalNotOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
     default: return false;
   }
 }
@@ -8325,6 +9124,34 @@ inline void *BuiltinOptionsUnion::UnPack(const void *obj, BuiltinOptions type, c
       auto ptr = reinterpret_cast<const PowOptions *>(obj);
       return ptr->UnPack(resolver);
     }
+    case BuiltinOptions_ArgMinOptions: {
+      auto ptr = reinterpret_cast<const ArgMinOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_FakeQuantOptions: {
+      auto ptr = reinterpret_cast<const FakeQuantOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_PackOptions: {
+      auto ptr = reinterpret_cast<const PackOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_LogicalOrOptions: {
+      auto ptr = reinterpret_cast<const LogicalOrOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_OneHotOptions: {
+      auto ptr = reinterpret_cast<const OneHotOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_LogicalAndOptions: {
+      auto ptr = reinterpret_cast<const LogicalAndOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
+    case BuiltinOptions_LogicalNotOptions: {
+      auto ptr = reinterpret_cast<const LogicalNotOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
     default: return nullptr;
   }
 }
@@ -8555,6 +9382,34 @@ inline flatbuffers::Offset<void> BuiltinOptionsUnion::Pack(flatbuffers::FlatBuff
       auto ptr = reinterpret_cast<const PowOptionsT *>(value);
       return CreatePowOptions(_fbb, ptr, _rehasher).Union();
     }
+    case BuiltinOptions_ArgMinOptions: {
+      auto ptr = reinterpret_cast<const ArgMinOptionsT *>(value);
+      return CreateArgMinOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_FakeQuantOptions: {
+      auto ptr = reinterpret_cast<const FakeQuantOptionsT *>(value);
+      return CreateFakeQuantOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_PackOptions: {
+      auto ptr = reinterpret_cast<const PackOptionsT *>(value);
+      return CreatePackOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_LogicalOrOptions: {
+      auto ptr = reinterpret_cast<const LogicalOrOptionsT *>(value);
+      return CreateLogicalOrOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_OneHotOptions: {
+      auto ptr = reinterpret_cast<const OneHotOptionsT *>(value);
+      return CreateOneHotOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_LogicalAndOptions: {
+      auto ptr = reinterpret_cast<const LogicalAndOptionsT *>(value);
+      return CreateLogicalAndOptions(_fbb, ptr, _rehasher).Union();
+    }
+    case BuiltinOptions_LogicalNotOptions: {
+      auto ptr = reinterpret_cast<const LogicalNotOptionsT *>(value);
+      return CreateLogicalNotOptions(_fbb, ptr, _rehasher).Union();
+    }
     default: return 0;
   }
 }
@@ -8785,6 +9640,34 @@ inline BuiltinOptionsUnion::BuiltinOptionsUnion(const BuiltinOptionsUnion &u) FL
       value = new PowOptionsT(*reinterpret_cast<PowOptionsT *>(u.value));
       break;
     }
+    case BuiltinOptions_ArgMinOptions: {
+      value = new ArgMinOptionsT(*reinterpret_cast<ArgMinOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_FakeQuantOptions: {
+      value = new FakeQuantOptionsT(*reinterpret_cast<FakeQuantOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_PackOptions: {
+      value = new PackOptionsT(*reinterpret_cast<PackOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_LogicalOrOptions: {
+      value = new LogicalOrOptionsT(*reinterpret_cast<LogicalOrOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_OneHotOptions: {
+      value = new OneHotOptionsT(*reinterpret_cast<OneHotOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_LogicalAndOptions: {
+      value = new LogicalAndOptionsT(*reinterpret_cast<LogicalAndOptionsT *>(u.value));
+      break;
+    }
+    case BuiltinOptions_LogicalNotOptions: {
+      value = new LogicalNotOptionsT(*reinterpret_cast<LogicalNotOptionsT *>(u.value));
+      break;
+    }
     default:
       break;
   }
@@ -9072,6 +9955,41 @@ inline void BuiltinOptionsUnion::Reset() {
       delete ptr;
       break;
     }
+    case BuiltinOptions_ArgMinOptions: {
+      auto ptr = reinterpret_cast<ArgMinOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_FakeQuantOptions: {
+      auto ptr = reinterpret_cast<FakeQuantOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_PackOptions: {
+      auto ptr = reinterpret_cast<PackOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_LogicalOrOptions: {
+      auto ptr = reinterpret_cast<LogicalOrOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_OneHotOptions: {
+      auto ptr = reinterpret_cast<OneHotOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_LogicalAndOptions: {
+      auto ptr = reinterpret_cast<LogicalAndOptionsT *>(value);
+      delete ptr;
+      break;
+    }
+    case BuiltinOptions_LogicalNotOptions: {
+      auto ptr = reinterpret_cast<LogicalNotOptionsT *>(value);
+      delete ptr;
+      break;
+    }
     default: break;
   }
   value = nullptr;
diff --git a/tensorflow/contrib/lite/simple_memory_arena.cc b/tensorflow/contrib/lite/simple_memory_arena.cc
index 4eaf6f1bfe..cd0f1f7c17 100644
--- a/tensorflow/contrib/lite/simple_memory_arena.cc
+++ b/tensorflow/contrib/lite/simple_memory_arena.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/simple_memory_arena.h"
 
+#include <algorithm>
 #include <cstring>
 #include <limits>
 #include <vector>
@@ -34,7 +35,7 @@ namespace tflite {
 TfLiteStatus SimpleMemoryArena::Allocate(TfLiteContext* context,
                                          size_t alignment, size_t size,
                                          ArenaAlloc* new_alloc) {
-  TF_LITE_ENSURE(context, alignment < arena_alignment_);
+  TF_LITE_ENSURE(context, alignment <= arena_alignment_);
 
   if (size == 0) {
     new_alloc->offset = 0;
diff --git a/tensorflow/contrib/lite/testdata/add.bin b/tensorflow/contrib/lite/testdata/add.bin
new file mode 100644
index 0000000000..aef0fe3d82
--- /dev/null
+++ b/tensorflow/contrib/lite/testdata/add.bin
diff --git a/tensorflow/contrib/lite/testing/BUILD b/tensorflow/contrib/lite/testing/BUILD
index 789bc695f8..a788d41ba7 100644
--- a/tensorflow/contrib/lite/testing/BUILD
+++ b/tensorflow/contrib/lite/testing/BUILD
@@ -140,6 +140,7 @@ cc_test(
 cc_library(
     name = "join",
     hdrs = ["join.h"],
+    deps = ["//tensorflow/contrib/lite:string"],
 )
 
 cc_test(
@@ -209,6 +210,10 @@ cc_library(
 cc_library(
     name = "util",
     hdrs = ["util.h"],
+    deps = [
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:string",
+    ],
 )
 
 cc_test(
@@ -252,6 +257,7 @@ cc_test(
     srcs = ["tf_driver_test.cc"],
     data = ["//tensorflow/contrib/lite:testdata/multi_add.pb"],
     tags = [
+        "no_oss",
         "tflite_not_portable",
     ],
     deps = [
@@ -268,6 +274,7 @@ cc_library(
         ":join",
         ":split",
         ":tf_driver",
+        "//tensorflow/contrib/lite:string",
         "//tensorflow/core:framework",
     ],
 )
@@ -277,6 +284,7 @@ cc_test(
     size = "small",
     srcs = ["generate_testspec_test.cc"],
     tags = [
+        "no_oss",
         "tflite_not_portable",
     ],
     deps = [
@@ -333,7 +341,7 @@ tf_cc_test(
     ],
     tags = [
         "no_cuda_on_cpu_tap",
-        "no_oss",
+        "no_oss",  # needs test data
         "tflite_not_portable",
     ],
     deps = [
diff --git a/tensorflow/contrib/lite/testing/generate_examples.py b/tensorflow/contrib/lite/testing/generate_examples.py
index 50237ed792..6d03c0fd9e 100644
--- a/tensorflow/contrib/lite/testing/generate_examples.py
+++ b/tensorflow/contrib/lite/testing/generate_examples.py
@@ -104,6 +104,8 @@ KNOWN_BUGS = {
     r"div.*int32": "72051395",
     # No support for SplitV
     r"split.*num_or_size_splits=\[2,2\]": "73377559",
+    # Scalar constants don't work.
+    r"constant.*shape=\[\]": "109811500",
 }
 
 
@@ -229,6 +231,7 @@ _TF_TYPE_INFO = {
     tf.int32: (np.int32, "INT32"),
     tf.uint8: (np.uint8, "QUANTIZED_UINT8"),
     tf.int64: (np.int64, "INT64"),
+    tf.bool: (np.bool, "BOOL"),
 }
 
 
@@ -242,7 +245,10 @@ def create_tensor_data(dtype, shape, min_value=-100, max_value=100):
     value = (max_value-min_value)*np.random.random_sample(shape)+min_value
   elif dtype in (tf.int32, tf.uint8, tf.int64):
     value = np.random.randint(min_value, max_value+1, shape)
-  return value.astype(dtype)
+  elif dtype == tf.bool:
+    value = np.random.choice([True, False], size=shape)
+  return np.dtype(dtype).type(value) if np.isscalar(value) else value.astype(
+      dtype)
 
 
 def create_scalar_data(dtype, min_value=-100, max_value=100):
@@ -479,7 +485,7 @@ def make_zip_of_tests(zip_path,
                           else report_lib.FAILED)
         report["toco_log"] = toco_log
 
-        if FLAGS.save_graphdefs:
+        if True or FLAGS.save_graphdefs:
           archive.writestr(label + ".pbtxt",
                            text_format.MessageToString(graph_def),
                            zipfile.ZIP_DEFLATED)
@@ -678,6 +684,55 @@ def make_relu6_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_prelu_tests(zip_path):
+  """Make a set of tests to do PReLU."""
+
+  test_parameters = [{
+      # The canonical case for image processing is having a 4D `input` (NHWC)
+      # and `shared_axes`=[1, 2], so the alpha parameter is per channel.
+      "input_shape": [[1, 10, 10, 3], [3, 3, 3, 3]],
+      "shared_axes": [[1, 2], [1]],
+  }]
+
+  def build_graph(parameters):
+    """Build the graph for the test case."""
+
+    input_tensor = tf.placeholder(
+        dtype=tf.float32, name="input", shape=parameters["input_shape"])
+    prelu = tf.keras.layers.PReLU(shared_axes=parameters["shared_axes"])
+    out = prelu(input_tensor)
+    return [input_tensor], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    """Build the inputs for the test case."""
+
+    input_shape = parameters["input_shape"]
+    input_values = create_tensor_data(
+        np.float32, input_shape, min_value=-10, max_value=10)
+    shared_axes = parameters["shared_axes"]
+
+    alpha_shape = []
+    for dim in range(1, len(input_shape)):
+      alpha_shape.append(1 if dim in shared_axes else input_shape[dim])
+
+    alpha_values = create_tensor_data(np.float32, alpha_shape)
+
+    # There should be only 1 trainable variable tensor.
+    variables = tf.all_variables()
+    assert len(variables) == 1
+    sess.run(variables[0].assign(alpha_values))
+
+    return [input_values], sess.run(
+        outputs, feed_dict=dict(zip(inputs, [input_values])))
+
+  make_zip_of_tests(
+      zip_path,
+      test_parameters,
+      build_graph,
+      build_inputs,
+      use_frozen_graph=True)
+
+
 # This function tests various TensorFLow functions that generates Const op,
 # including `tf.ones`, `tf.zeros` and random functions.
 def make_constant_tests(zip_path):
@@ -685,21 +740,22 @@ def make_constant_tests(zip_path):
 
   test_parameters = [{
       "dtype": [tf.float32, tf.int32],
-      "input_shape": [[1], [2], [1, 1, 1, 1], [2, 2, 2, 2]],
+      "input_shape": [[], [1], [2], [1, 1, 1, 1], [2, 2, 2, 2]],
   }]
 
   def build_graph(parameters):
-    # Since Toco & Tflite can't have a single constant op in the entire graph,
-    # this test adds a zero tensor with a constant op tensor.
-    input1 = tf.placeholder(dtype=parameters["dtype"], name="input1",
-                            shape=parameters["input_shape"])
-    out = tf.ones(parameters["input_shape"], dtype=parameters["dtype"]) + input1
-    return [input1], [out]
+    dummy_input = tf.placeholder(
+        dtype=parameters["dtype"],
+        name="input1",
+        shape=parameters["input_shape"])
+    out = tf.constant(
+        create_tensor_data(parameters["dtype"], parameters["input_shape"]))
+    return [dummy_input], [out]
 
   def build_inputs(parameters, sess, inputs, outputs):
-    input1 = np.zeros(parameters["input_shape"],
-                      dtype=_TF_TYPE_INFO[parameters["dtype"]][0])
-    return [input1], sess.run(outputs, feed_dict={inputs[0]: input1})
+    dummy_input = np.zeros(
+        parameters["input_shape"], dtype=_TF_TYPE_INFO[parameters["dtype"]][0])
+    return [dummy_input], sess.run(outputs, feed_dict={inputs[0]: dummy_input})
 
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
@@ -723,6 +779,11 @@ def make_binary_op_tests(zip_path, binary_operator):
       "input_shape_1": [[1, 3, 4, 3]],
       "input_shape_2": [[3]],
       "activation": [True]
+  }, {
+      "dtype": [tf.float32],
+      "input_shape_1": [[]],
+      "input_shape_2": [[]],
+      "activation": [False]
   }]
 
   def build_graph(parameters):
@@ -772,7 +833,7 @@ def make_reduce_tests(reduce_op):
         "input_dtype": [tf.float32, tf.int32, tf.int64],
         "input_shape": [[3, 2, 4]],
         "axis": [
-            None, 0, 1, 2, [0, 1], [0, 2], [1, 2], [0, 1, 2], [1, 0], [2, 0],
+            0, 1, 2, [0, 1], [0, 2], [1, 2], [0, 1, 2], [1, 0], [2, 0],
             [2, 1], [2, 1, 0], [2, 0, 1], -1, -2, -3, [1, -1], [0, -1], [-1, 0],
             [-1, -2, -3], [0, 0, 0], [2, 2, 0], [1, 0, -3, -3]
         ],
@@ -782,13 +843,19 @@ def make_reduce_tests(reduce_op):
         "input_dtype": [tf.float32],
         "input_shape": [[1, 8, 8, 3]],
         "axis": [
-            None, 0, 1, 2, 3, [1, 2], [0, 3], [1, 2, 3], [0, 1, 2, 3],
+            0, 1, 2, 3, [1, 2], [0, 3], [1, 2, 3], [0, 1, 2, 3],
             [3, 2, 1, 0], [3, 1, 0, 2], [2, 0], [3, 0], [3, 1], [1, 0], -1, -2,
             -3, -4, [0, -2], [2, 3, -1, 0], [3, 1, 2, -3], [3, -4], [2, 2, 2],
             [2, 2, 3], [-3, -3, -4], [-3, 2, 1]
         ],
         "const_axis": [True, False],
         "keepdims": [True, False],
+    }, {
+        "input_dtype": [tf.float32],
+        "input_shape": [[], [1, 8, 8, 3], [3, 2, 4]],
+        "axis": [None],
+        "const_axis": [True],
+        "keepdims": [True, False],
     }]
 
     def build_graph(parameters):
@@ -806,7 +873,7 @@ def make_reduce_tests(reduce_op):
         if isinstance(parameters["axis"], list):
           shape = [len(parameters["axis"])]
         else:
-          shape = [0]  # shape for None or integers.
+          shape = []  # shape for None or integers.
         axis = tf.placeholder(dtype=tf.int32, name="axis", shape=shape)
         input_tensors = [input_tensor, axis]
 
@@ -817,10 +884,11 @@ def make_reduce_tests(reduce_op):
     def build_inputs(parameters, sess, inputs, outputs):
       values = [
           create_tensor_data(parameters["input_dtype"],
-                             parameters["input_shape"])]
+                             parameters["input_shape"],
+                             min_value=-10,
+                             max_value=10)]
       if not parameters["const_axis"]:
-        if parameters["axis"]:
-          values.append(np.array(parameters["axis"]))
+        values.append(np.array(parameters["axis"]))
       return values, sess.run(outputs, feed_dict=dict(zip(inputs, values)))
 
     make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
@@ -830,22 +898,30 @@ def make_reduce_tests(reduce_op):
 
 def make_mean_tests(zip_path):
   """Make a set of tests to do mean."""
-
   return make_reduce_tests(tf.reduce_mean)(zip_path)
 
 
 def make_sum_tests(zip_path):
   """Make a set of tests to do sum."""
-
   return make_reduce_tests(tf.reduce_sum)(zip_path)
 
 
+def make_reduce_prod_tests(zip_path):
+  """Make a set of tests to do prod."""
+  return make_reduce_tests(tf.reduce_prod)(zip_path)
+
+
+def make_reduce_max_tests(zip_path):
+  """Make a set of tests to do max."""
+  return make_reduce_tests(tf.reduce_max)(zip_path)
+
+
 def make_exp_tests(zip_path):
   """Make a set of tests to do exp."""
 
   test_parameters = [{
       "input_dtype": [tf.float32],
-      "input_shape": [[3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
+      "input_shape": [[], [3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
   }]
 
   def build_graph(parameters):
@@ -904,8 +980,8 @@ def make_maximum_tests(zip_path):
 
   test_parameters = [{
       "input_dtype": [tf.float32],
-      "input_shape_1": [[3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
-      "input_shape_2": [[3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
+      "input_shape_1": [[], [3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
+      "input_shape_2": [[], [3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
   }]
 
   def build_graph(parameters):
@@ -939,8 +1015,8 @@ def make_minimum_tests(zip_path):
 
   test_parameters = [{
       "input_dtype": [tf.float32],
-      "input_shape_1": [[3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
-      "input_shape_2": [[3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
+      "input_shape_1": [[], [3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
+      "input_shape_2": [[], [3], [1, 100], [4, 2, 3], [5, 224, 224, 3]],
   }]
 
   def build_graph(parameters):
@@ -1538,19 +1614,39 @@ def make_reshape_tests(zip_path):
       "dtype": [tf.float32, tf.int32],
       "input_shape": [[3, 4, 5, 7], [4, 105], [21, 5, 2, 2], [420]],
       "output_shape": [[15, 28], [420], [1, -1, 5, 7], [-1]],
+      "constant_shape": [True, False],
+  }, {
+      "dtype": [tf.float32],
+      "input_shape": [[1]],
+      "output_shape": [[]],
+      "constant_shape": [True, False],
   }]
 
   def build_graph(parameters):
     input_tensor = tf.placeholder(dtype=parameters["dtype"], name="input",
                                   shape=parameters["input_shape"])
-    out = tf.reshape(input_tensor, shape=parameters["output_shape"])
-    return [input_tensor], [out]
+
+    # Get shape as either a placeholder or constants.
+    if parameters["constant_shape"]:
+      output_shape = parameters["output_shape"]
+      input_tensors = [input_tensor]
+    else:
+      # The shape of the shape tensor.
+      shape_tensor_shape = [len(parameters["output_shape"])]
+      output_shape = tf.placeholder(
+          dtype=tf.int32, name="output_shape", shape=shape_tensor_shape)
+      input_tensors = [input_tensor, output_shape]
+    out = tf.reshape(input_tensor, shape=output_shape)
+    return input_tensors, [out]
 
   def build_inputs(parameters, sess, inputs, outputs):
-    input_values = create_tensor_data(parameters["dtype"],
-                                      parameters["input_shape"])
-    return [input_values], sess.run(
-        outputs, feed_dict=dict(zip(inputs, [input_values])))
+    values = [
+        create_tensor_data(parameters["dtype"], parameters["input_shape"])
+    ]
+    if not parameters["constant_shape"]:
+      values.append(np.array(parameters["output_shape"]))
+
+    return values, sess.run(outputs, feed_dict=dict(zip(inputs, values)))
 
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
@@ -1581,6 +1677,65 @@ def make_shape_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_one_hot_tests(zip_path):
+  """Make a set of tests to do one_hot."""
+
+  test_parameters = [{
+      "indices_type": [tf.int32, tf.int64],
+      "indices_shape": [[3], [4, 4], [1, 5], [5, 1]],
+      "axis": [0, 1],
+      "dtype": [tf.int32, tf.int64, tf.float32],
+      "provide_optional_inputs": [True, False],
+  }]
+
+  def build_graph(parameters):
+    indices = tf.placeholder(
+        dtype=parameters["indices_type"],
+        name="indices",
+        shape=parameters["indices_shape"])
+    depth = tf.placeholder(dtype=tf.int32, name="depth", shape=())
+
+    if not parameters["provide_optional_inputs"]:
+      out = tf.one_hot(indices=indices, depth=depth)
+      return [indices, depth], [out]
+
+    on_value = tf.placeholder(
+        dtype=parameters["dtype"], name="on_value", shape=())
+    off_value = tf.placeholder(
+        dtype=parameters["dtype"], name="off_value", shape=())
+    out = tf.one_hot(
+        indices=indices,
+        depth=depth,
+        on_value=on_value,
+        off_value=off_value,
+        axis=parameters["axis"],
+        dtype=parameters["dtype"])
+    return [indices, depth, on_value, off_value], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    input_values = [
+        create_tensor_data(
+            parameters["indices_type"],
+            shape=parameters["indices_shape"],
+            min_value=-1,
+            max_value=10),
+        create_tensor_data(tf.int32, shape=None, min_value=1, max_value=10),
+    ]
+
+    if parameters["provide_optional_inputs"]:
+      input_values.append(
+          create_tensor_data(
+              parameters["dtype"], shape=None, min_value=1, max_value=10))
+      input_values.append(
+          create_tensor_data(
+              parameters["dtype"], shape=None, min_value=-1, max_value=0))
+
+    return input_values, sess.run(
+        outputs, feed_dict=dict(zip(inputs, input_values)))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
 def make_resize_bilinear_tests(zip_path):
   """Make a set of tests to do resize_bilinear."""
 
@@ -2175,14 +2330,15 @@ def make_topk_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
-def make_arg_max_tests(zip_path):
+def make_arg_min_max_tests(zip_path):
   """Make a set of tests to do arg_max."""
 
   test_parameters = [{
       "input_dtype": [tf.float32, tf.int32],
-      "input_shape": [[1, 1, 1, 3], [2, 3, 4, 5], [2, 3, 3], [5, 5], [10]],
+      "input_shape": [[], [1, 1, 1, 3], [2, 3, 4, 5], [2, 3, 3], [5, 5], [10]],
       "output_type": [tf.int32, tf.int64],
       "axis_is_last_dim": [True, False],
+      "is_arg_max": [True],
   }]
 
   def build_graph(parameters):
@@ -2195,7 +2351,10 @@ def make_arg_max_tests(zip_path):
       axis = len(parameters["input_shape"]) - 1
     else:
       axis = random.randint(0, max(len(parameters["input_shape"]) - 2, 0))
-    out = tf.arg_max(input_value, axis, output_type=parameters["output_type"])
+    if parameters["is_arg_max"]:
+      out = tf.arg_max(input_value, axis, output_type=parameters["output_type"])
+    else:
+      out = tf.arg_min(input_value, axis, output_type=parameters["output_type"])
     return [input_value], [out]
 
   def build_inputs(parameters, sess, inputs, outputs):
@@ -2212,7 +2371,8 @@ def make_equal_tests(zip_path):
 
   test_parameters = [{
       "input_dtype": [tf.float32, tf.int32, tf.int64],
-      "input_shape_pair": [([1, 1, 1, 3], [1, 1, 1, 3]),
+      "input_shape_pair": [([], []),
+                           ([1, 1, 1, 3], [1, 1, 1, 3]),
                            ([2, 3, 4, 5], [2, 3, 4, 5]), ([2, 3, 3], [2, 3]),
                            ([5, 5], [1]), ([10], [2, 4, 10])],
   }]
@@ -2469,7 +2629,7 @@ def _make_elementwise_tests(op):
     """Actual function that generates examples."""
     test_parameters = [{
         "input_dtype": [tf.float32],
-        "input_shape": [[1], [1, 2], [5, 6, 7, 8], [3, 4, 5, 6]],
+        "input_shape": [[], [1], [1, 2], [5, 6, 7, 8], [3, 4, 5, 6]],
     }]
 
     def build_graph(parameters):
@@ -2791,6 +2951,73 @@ def make_sparse_to_dense_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_pack_tests(zip_path):
+  """Make a set of tests to do stack."""
+
+  test_parameters = [{
+      "base_shape": [[3, 4, 3], [3, 4], [5]],
+      "num_tensors": [1, 2, 3, 4, 5, 6],
+      "axis": [0, 1, 2, 3],
+      "additional_shape": [1, 2, 3],
+  }]
+
+  def get_shape(parameters):
+    """Return a tweaked version of 'base_shape'."""
+    axis = parameters["axis"]
+    shape = parameters["base_shape"][:]
+    if axis < len(shape):
+      shape[axis] += parameters["additional_shape"]
+    return shape
+
+  def build_graph(parameters):
+    all_tensors = []
+    for n in range(0, parameters["num_tensors"]):
+      input_tensor = tf.placeholder(
+          dtype=tf.float32, name=("input%d" % n), shape=get_shape(parameters))
+      all_tensors.append(input_tensor)
+    out = tf.stack(all_tensors, parameters["axis"])
+    return all_tensors, [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    all_values = []
+    for _ in range(0, parameters["num_tensors"]):
+      input_values = create_tensor_data(np.float32, get_shape(parameters))
+      all_values.append(input_values)
+    return all_values, sess.run(
+        outputs, feed_dict=dict(zip(inputs, all_values)))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
+def make_logical_or_tests(zip_path):
+  """Make a set of tests to do logical_or."""
+
+  test_parameters = [{
+      "input_shape_pair": [([], []), ([1, 1, 1, 3], [1, 1, 1, 3]),
+                           ([2, 3, 4, 5], [2, 3, 4, 5]), ([2, 3, 3], [2, 3]),
+                           ([5, 5], [1]), ([10], [2, 4, 10])],
+  }]
+
+  def build_graph(parameters):
+    """Build the logical_or op testing graph."""
+    input_value1 = tf.placeholder(
+        dtype=tf.bool, name="input1", shape=parameters["input_shape_pair"][0])
+    input_value2 = tf.placeholder(
+        dtype=tf.bool, name="input2", shape=parameters["input_shape_pair"][1])
+    out = tf.logical_or(input_value1, input_value2)
+    return [input_value1, input_value2], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    input_value1 = create_tensor_data(tf.bool,
+                                      parameters["input_shape_pair"][0])
+    input_value2 = create_tensor_data(tf.bool,
+                                      parameters["input_shape_pair"][1])
+    return [input_value1, input_value2], sess.run(
+        outputs, feed_dict=dict(zip(inputs, [input_value1, input_value2])))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
 # Toco binary path provided by the generate rule.
 bin_path = None
 
diff --git a/tensorflow/contrib/lite/testing/generate_testspec.cc b/tensorflow/contrib/lite/testing/generate_testspec.cc
index c1092e4d25..f29c188e6c 100644
--- a/tensorflow/contrib/lite/testing/generate_testspec.cc
+++ b/tensorflow/contrib/lite/testing/generate_testspec.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include <iostream>
+
 #include "tensorflow/contrib/lite/testing/generate_testspec.h"
 #include "tensorflow/contrib/lite/testing/join.h"
 #include "tensorflow/contrib/lite/testing/split.h"
@@ -88,13 +90,13 @@ bool GenerateTestSpecFromTensorflowModel(
   TfDriver runner(input_layer, input_layer_type, input_layer_shape,
                   output_layer);
   if (!runner.IsValid()) {
-    cerr << runner.GetErrorMessage() << endl;
+    std::cerr << runner.GetErrorMessage() << std::endl;
     return false;
   }
 
   runner.LoadModel(tensorflow_model_path);
   if (!runner.IsValid()) {
-    cerr << runner.GetErrorMessage() << endl;
+    std::cerr << runner.GetErrorMessage() << std::endl;
     return false;
   }
 
@@ -118,14 +120,14 @@ bool GenerateTestSpecFromTensorflowModel(
     for (int j = 0; j < input_values.size(); j++) {
       runner.SetInput(j, input_values[j]);
       if (!runner.IsValid()) {
-        cerr << runner.GetErrorMessage() << endl;
+        std::cerr << runner.GetErrorMessage() << std::endl;
         return false;
       }
     }
 
     runner.Invoke();
     if (!runner.IsValid()) {
-      cerr << runner.GetErrorMessage() << endl;
+      std::cerr << runner.GetErrorMessage() << std::endl;
       return false;
     }
 
@@ -137,7 +139,7 @@ bool GenerateTestSpecFromTensorflowModel(
     for (int j = 0; j < output_layer.size(); j++) {
       stream << "  output: \"" << runner.ReadOutput(j) << "\"\n";
       if (!runner.IsValid()) {
-        cerr << runner.GetErrorMessage() << endl;
+        std::cerr << runner.GetErrorMessage() << std::endl;
         return false;
       }
     }
diff --git a/tensorflow/contrib/lite/testing/generate_testspec.h b/tensorflow/contrib/lite/testing/generate_testspec.h
index bfaf5e7ec8..b3d0db31c0 100644
--- a/tensorflow/contrib/lite/testing/generate_testspec.h
+++ b/tensorflow/contrib/lite/testing/generate_testspec.h
@@ -19,6 +19,8 @@ limitations under the License.
 #include <iostream>
 #include <vector>
 
+#include "tensorflow/contrib/lite/string.h"
+
 namespace tflite {
 namespace testing {
 
diff --git a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
index c4e20312d8..106cbc1b8e 100644
--- a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
+++ b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
@@ -53,10 +53,6 @@ tensorflow::Env* env = tensorflow::Env::Default();
 // Key is a substring of the test name and value is a bug number.
 // TODO(ahentz): make sure we clean this list up frequently.
 std::map<string, string> kBrokenTests = {
-    {R"(^\/mul.*int32)", "68808744"},
-    {R"(^\/div.*int32)", "68808744"},
-    {R"(^\/sub.*int32)", "68808744"},
-
     // Pad and PadV2 only supports 4D tensors.
     {R"(^\/pad.*,input_shape=\[.,.\],paddings=\[\[.,.\],\[.,.\]\])",
      "70527055"},
@@ -97,11 +93,12 @@ std::map<string, string> kBrokenTests = {
     {R"(^\/gather.*axis=1)", "76910444"},
 
     // No support for arbitrary dimensions in ArgMax.
-    {R"(^\/arg_max.*axis_is_last_dim=False.*input_shape=\[.,.,.,.\])",
+    {R"(^\/arg_min_max.*axis_is_last_dim=False.*input_shape=\[.,.,.,.\])",
+     "77546240"},
+    {R"(^\/arg_min_max.*axis_is_last_dim=False.*input_shape=\[.,.,.\])",
      "77546240"},
-    {R"(^\/arg_max.*axis_is_last_dim=False.*input_shape=\[.,.,.\])",
+    {R"(^\/arg_min_max.*axis_is_last_dim=False.*input_shape=\[.,.\])",
      "77546240"},
-    {R"(^\/arg_max.*axis_is_last_dim=False.*input_shape=\[.,.\])", "77546240"},
 };
 
 // Allows test data to be unzipped into a temporary directory and makes
@@ -229,7 +226,8 @@ TEST_P(OpsTest, RunZipTests) {
   string message = test_driver.GetErrorMessage();
   if (bug_number.empty()) {
     if (FLAGS_use_nnapi && FLAGS_ignore_unsupported_nnapi && !result) {
-      EXPECT_EQ(message, string("Failed to invoke interpreter")) << message;
+      EXPECT_EQ(message, string("Failed to invoke NNAPI interpreter"))
+          << message;
     } else {
       EXPECT_TRUE(result) << message;
     }
diff --git a/tensorflow/contrib/lite/testing/join.h b/tensorflow/contrib/lite/testing/join.h
index 1edee01cf9..4be19ad756 100644
--- a/tensorflow/contrib/lite/testing/join.h
+++ b/tensorflow/contrib/lite/testing/join.h
@@ -17,7 +17,8 @@ limitations under the License.
 
 #include <cstdlib>
 #include <sstream>
-#include <string>
+
+#include "tensorflow/contrib/lite/string.h"
 
 namespace tflite {
 namespace testing {
diff --git a/tensorflow/contrib/lite/testing/test_runner.h b/tensorflow/contrib/lite/testing/test_runner.h
index 96ab6be54e..fac7d01aab 100644
--- a/tensorflow/contrib/lite/testing/test_runner.h
+++ b/tensorflow/contrib/lite/testing/test_runner.h
@@ -90,7 +90,7 @@ class TestRunner {
 
   // Invalidate the test runner, preventing it from executing any further.
   void Invalidate(const string& error_message) {
-    cerr << error_message << std::endl;
+    std::cerr << error_message << std::endl;
     error_message_ = error_message;
   }
   bool IsValid() const { return error_message_.empty(); }
diff --git a/tensorflow/contrib/lite/testing/tf_driver.cc b/tensorflow/contrib/lite/testing/tf_driver.cc
index 3b27f6f3da..ec435ca60d 100644
--- a/tensorflow/contrib/lite/testing/tf_driver.cc
+++ b/tensorflow/contrib/lite/testing/tf_driver.cc
@@ -28,8 +28,8 @@ namespace {
 
 tensorflow::Tensor CreateTensor(const tensorflow::DataType type,
                                 const std::vector<int64_t>& dim) {
-  tensorflow::TensorShape shape{gtl::ArraySlice<int64>{
-      reinterpret_cast<const int64*>(dim.data()), dim.size()}};
+  tensorflow::TensorShape shape{tensorflow::gtl::ArraySlice<tensorflow::int64>{
+      reinterpret_cast<const tensorflow::int64*>(dim.data()), dim.size()}};
   return {type, shape};
 }
 
@@ -179,7 +179,7 @@ void TfDriver::Invoke() {
   auto status = session_->Run({input_tensors_.begin(), input_tensors_.end()},
                               output_names_, {}, &output_tensors_);
   if (!status.ok()) {
-    Invalidate("Failed to invoke interpreter");
+    Invalidate("Failed to run input data on graph");
   }
 }
 
diff --git a/tensorflow/contrib/lite/testing/tflite_diff_flags.h b/tensorflow/contrib/lite/testing/tflite_diff_flags.h
index 7a57e8d3fb..695c2a3de6 100644
--- a/tensorflow/contrib/lite/testing/tflite_diff_flags.h
+++ b/tensorflow/contrib/lite/testing/tflite_diff_flags.h
@@ -15,6 +15,8 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_FLAGS_H_
 #define TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_FLAGS_H_
 
+#include <cstring>
+
 #include "tensorflow/contrib/lite/testing/split.h"
 #include "tensorflow/contrib/lite/testing/tflite_diff_util.h"
 #include "tensorflow/core/util/command_line_flags.h"
diff --git a/tensorflow/contrib/lite/testing/util.h b/tensorflow/contrib/lite/testing/util.h
index 6d20aec141..8aa639157b 100644
--- a/tensorflow/contrib/lite/testing/util.h
+++ b/tensorflow/contrib/lite/testing/util.h
@@ -15,8 +15,39 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_LITE_TESTING_UTIL_H_
 #define TENSORFLOW_CONTRIB_LITE_TESTING_UTIL_H_
 
+#include <cstdio>
+
+#include "tensorflow/contrib/lite/error_reporter.h"
+#include "tensorflow/contrib/lite/string.h"
+
 namespace tflite {
 
+// An ErrorReporter that collects error message in a string, in addition
+// to printing to stderr.
+class TestErrorReporter : public ErrorReporter {
+ public:
+  int Report(const char* format, va_list args) override {
+    char buffer[1024];
+    int size = vsnprintf(buffer, sizeof(buffer), format, args);
+    fprintf(stderr, "%s", buffer);
+    error_messages_ += buffer;
+    num_calls_++;
+    return size;
+  }
+
+  void Reset() {
+    num_calls_ = 0;
+    error_messages_.clear();
+  }
+
+  int num_calls() const { return num_calls_; }
+  const string& error_messages() const { return error_messages_; }
+
+ private:
+  int num_calls_ = 0;
+  string error_messages_;
+};
+
 inline void LogToStderr() {
 #ifdef PLATFORM_GOOGLE
   FLAGS_logtostderr = true;
diff --git a/tensorflow/contrib/lite/toco/BUILD b/tensorflow/contrib/lite/toco/BUILD
index 209dce56cb..c88079717d 100644
--- a/tensorflow/contrib/lite/toco/BUILD
+++ b/tensorflow/contrib/lite/toco/BUILD
@@ -93,6 +93,7 @@ cc_library(
         ":runtime",
         ":toco_port",
         "//tensorflow/core:lib",
+        "@com_google_absl//absl/types:optional",
     ],
 )
 
@@ -176,7 +177,7 @@ cc_library(
         "graph_transformations/convert_reorder_axes.cc",
         "graph_transformations/convert_squeeze_to_reshape.cc",
         "graph_transformations/convert_trivial_addn_to_add.cc",
-        "graph_transformations/convert_trivial_stack_to_reshape.cc",
+        "graph_transformations/convert_trivial_pack_to_reshape.cc",
         "graph_transformations/convert_trivial_tile_to_concat.cc",
         "graph_transformations/convert_trivial_transpose_to_reshape.cc",
         "graph_transformations/create_im2col_arrays.cc",
@@ -212,7 +213,7 @@ cc_library(
         "graph_transformations/quantization_util.h",
         "graph_transformations/quantize.cc",
         "graph_transformations/quantize_weights.cc",
-        "graph_transformations/read_fake_quant_min_max.cc",
+        "graph_transformations/read_array_minmax_and_narrow_range_from_fake_quant.cc",
         "graph_transformations/remove_final_dequantize_op.cc",
         "graph_transformations/remove_tensorflow_assert.cc",
         "graph_transformations/remove_tensorflow_identity.cc",
@@ -236,19 +237,21 @@ cc_library(
         "graph_transformations/resolve_constant_fake_quant.cc",
         "graph_transformations/resolve_constant_fill.cc",
         "graph_transformations/resolve_constant_gather.cc",
+        "graph_transformations/resolve_constant_pack.cc",
         "graph_transformations/resolve_constant_random_uniform.cc",
         "graph_transformations/resolve_constant_range.cc",
         "graph_transformations/resolve_constant_reshape.cc",
         "graph_transformations/resolve_constant_shape_or_rank.cc",
         "graph_transformations/resolve_constant_slice.cc",
-        "graph_transformations/resolve_constant_stack.cc",
         "graph_transformations/resolve_constant_strided_slice.cc",
         "graph_transformations/resolve_constant_transpose.cc",
         "graph_transformations/resolve_constant_unary.cc",
-        "graph_transformations/resolve_mean_attributes.cc",
+        "graph_transformations/resolve_fake_quant_args_from_vars.cc",
+        "graph_transformations/resolve_gather_attributes.cc",
         "graph_transformations/resolve_multiply_by_zero.cc",
         "graph_transformations/resolve_pad_attributes.cc",
         "graph_transformations/resolve_padv2_attributes.cc",
+        "graph_transformations/resolve_reduce_attributes.cc",
         "graph_transformations/resolve_reorder_axes.cc",
         "graph_transformations/resolve_reshape_attributes.cc",
         "graph_transformations/resolve_slice_attributes.cc",
@@ -336,6 +339,7 @@ cc_library(
 tf_cc_test(
     name = "import_tensorflow_test",
     srcs = ["import_tensorflow_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":toco_tooling",
         "//tensorflow/core:framework",
@@ -375,6 +379,7 @@ cc_library(
 tf_cc_test(
     name = "tooling_util_test",
     srcs = ["tooling_util_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":model",
         ":tooling_util",
@@ -409,6 +414,7 @@ tf_cc_test(
     data = [
         "toco_port_test.cc",
     ],
+    tags = ["no_oss"],
     deps = [
         ":toco_port",
         "@com_google_googletest//:gtest_main",
diff --git a/tensorflow/contrib/lite/toco/export_tensorflow.cc b/tensorflow/contrib/lite/toco/export_tensorflow.cc
index 6be6b25f93..8b41865985 100644
--- a/tensorflow/contrib/lite/toco/export_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/export_tensorflow.cc
@@ -215,6 +215,30 @@ void ConvertFloatTensorConst(const Model& model, const string& name,
                    LegacyScalarPolicy::kAvoidLegacyScalars);
 }
 
+void ConvertBoolTensorConst(const Model& model, const string& name,
+                            GraphDef* tensorflow_graph) {
+  if (HasAlreadyExportedConst(name, *tensorflow_graph)) {
+    return;
+  }
+  CHECK(model.HasArray(name));
+  const auto& array = model.GetArray(name);
+  tensorflow::NodeDef* const_op = tensorflow_graph->add_node();
+  const_op->set_op("Const");
+  const_op->set_name(name);
+  (*const_op->mutable_attr())["dtype"].set_type(DT_BOOL);
+  auto* tensor = (*const_op->mutable_attr())["value"].mutable_tensor();
+  tensor->set_dtype(DT_BOOL);
+  const auto& data = array.GetBuffer<ArrayDataType::kBool>().data;
+  for (auto index : data) {
+    tensor->add_bool_val(index);
+  }
+  const auto& array_shape = array.shape();
+  auto* shape = tensor->mutable_tensor_shape();
+  for (int i = 0; i < array_shape.dimensions_count(); i++) {
+    shape->add_dim()->set_size(array_shape.dims(i));
+  }
+}
+
 void ConvertIntTensorConst(const Model& model, const string& name,
                            GraphDef* tensorflow_graph) {
   if (HasAlreadyExportedConst(name, *tensorflow_graph)) {
@@ -621,7 +645,8 @@ void ConvertAddOperator(const Model& model, const AddOperator& src_op,
   CHECK_EQ(src_op.inputs.size(), 2);
   *add_op->add_input() = src_op.inputs[0];
   *add_op->add_input() = src_op.inputs[1];
-  (*add_op->mutable_attr())["T"].set_type(DT_FLOAT);
+  (*add_op->mutable_attr())["T"].set_type(
+      GetTensorFlowDataType(model, src_op.outputs[0]));
 }
 
 void ConvertAddNOperator(const Model& model, const AddNOperator& src_op,
@@ -633,7 +658,8 @@ void ConvertAddNOperator(const Model& model, const AddNOperator& src_op,
     *add_op->add_input() = input;
   }
   (*add_op->mutable_attr())["N"].set_i(src_op.inputs.size());
-  (*add_op->mutable_attr())["T"].set_type(DT_FLOAT);
+  (*add_op->mutable_attr())["T"].set_type(
+      GetTensorFlowDataType(model, src_op.outputs[0]));
 }
 
 void ConvertMulOperator(const Model& model, const MulOperator& src_op,
@@ -644,16 +670,18 @@ void ConvertMulOperator(const Model& model, const MulOperator& src_op,
   CHECK_EQ(src_op.inputs.size(), 2);
   *add_op->add_input() = src_op.inputs[0];
   *add_op->add_input() = src_op.inputs[1];
-  (*add_op->mutable_attr())["T"].set_type(DT_FLOAT);
+  (*add_op->mutable_attr())["T"].set_type(
+      GetTensorFlowDataType(model, src_op.outputs[0]));
 }
 
-void ConvertReluOperator(const ReluOperator& src_op,
+void ConvertReluOperator(const Model& model, const ReluOperator& src_op,
                          GraphDef* tensorflow_graph) {
   tensorflow::NodeDef* relu_op = tensorflow_graph->add_node();
   relu_op->set_op("Relu");
   relu_op->set_name(src_op.outputs[0]);
   *relu_op->add_input() = src_op.inputs[0];
-  (*relu_op->mutable_attr())["T"].set_type(DT_FLOAT);
+  (*relu_op->mutable_attr())["T"].set_type(
+      GetTensorFlowDataType(model, src_op.outputs[0]));
 }
 
 void ConvertRelu1Operator(const Relu1Operator& src_op,
@@ -884,6 +912,9 @@ void ConvertFakeQuantOperator(const FakeQuantOperator& src_op,
   if (src_op.num_bits) {
     (*fakequant_op->mutable_attr())["num_bits"].set_i(src_op.num_bits);
   }
+  if (src_op.narrow_range) {
+    (*fakequant_op->mutable_attr())["narrow_range"].set_b(src_op.narrow_range);
+  }
 }
 
 void ConvertMaxPoolOperator(const MaxPoolOperator& src_op,
@@ -1107,13 +1138,27 @@ void ConvertFloorOperator(const Model& model, const FloorOperator& src_op,
 void ConvertGatherOperator(const Model& model, const GatherOperator& src_op,
                            GraphDef* tensorflow_graph) {
   tensorflow::NodeDef* gather_op = tensorflow_graph->add_node();
-  gather_op->set_op("Gather");
+  gather_op->set_op("GatherV2");
   gather_op->set_name(src_op.outputs[0]);
-  CHECK_EQ(src_op.inputs.size(), 2);
   *gather_op->add_input() = src_op.inputs[0];
   *gather_op->add_input() = src_op.inputs[1];
 
+  if (!src_op.axis) {
+    // Dynamic axis.
+    CHECK_EQ(src_op.inputs.size(), 3);
+    *gather_op->add_input() = src_op.inputs[2];
+  } else {
+    // Constant axis.
+    CHECK_EQ(src_op.inputs.size(), 2);
+    const string gather_axis =
+        AvailableArrayName(model, gather_op->name() + "/axis");
+    CreateIntTensorConst(gather_axis, {src_op.axis.value()}, {},
+                         tensorflow_graph);
+    *gather_op->add_input() = gather_axis;
+  }
+
   (*gather_op->mutable_attr())["Tindices"].set_type(DT_INT32);
+  (*gather_op->mutable_attr())["Taxis"].set_type(DT_INT32);
   const tensorflow::DataType params_type =
       GetTensorFlowDataType(model, src_op.inputs[0]);
   (*gather_op->mutable_attr())["Tparams"].set_type(params_type);
@@ -1135,6 +1180,22 @@ void ConvertArgMaxOperator(const Model& model, const ArgMaxOperator& src_op,
       GetTensorFlowDataType(model, src_op.outputs[0]));
 }
 
+void ConvertArgMinOperator(const Model& model, const ArgMinOperator& src_op,
+                           GraphDef* tensorflow_graph) {
+  tensorflow::NodeDef* argmin_op = tensorflow_graph->add_node();
+  argmin_op->set_op("ArgMin");
+  argmin_op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 2);
+  *argmin_op->add_input() = src_op.inputs[0];
+  *argmin_op->add_input() = src_op.inputs[1];
+  (*argmin_op->mutable_attr())["T"].set_type(
+      GetTensorFlowDataType(model, src_op.inputs[0]));
+  (*argmin_op->mutable_attr())["Tidx"].set_type(
+      GetTensorFlowDataType(model, src_op.inputs[1]));
+  (*argmin_op->mutable_attr())["output_type"].set_type(
+      GetTensorFlowDataType(model, src_op.outputs[0]));
+}
+
 void ConvertTransposeOperator(const Model& model,
                               const TransposeOperator& src_op,
                               GraphDef* tensorflow_graph) {
@@ -1188,17 +1249,17 @@ void ConvertRangeOperator(const Model& model, const RangeOperator& src_op,
       GetTensorFlowDataType(src_op.dtype));
 }
 
-void ConvertStackOperator(const Model& model, const StackOperator& src_op,
-                          GraphDef* tensorflow_graph) {
-  tensorflow::NodeDef* stack_op = tensorflow_graph->add_node();
-  stack_op->set_op("Stack");
-  stack_op->set_name(src_op.outputs[0]);
+void ConvertPackOperator(const Model& model, const PackOperator& src_op,
+                         GraphDef* tensorflow_graph) {
+  tensorflow::NodeDef* pack_op = tensorflow_graph->add_node();
+  pack_op->set_op("Pack");
+  pack_op->set_name(src_op.outputs[0]);
   for (const auto& input : src_op.inputs) {
-    *stack_op->add_input() = input;
+    *pack_op->add_input() = input;
   }
-  (*stack_op->mutable_attr())["elem_type"].set_type(
-      GetTensorFlowDataType(model, src_op.outputs[0]));
-  (*stack_op->mutable_attr())["axis"].set_i(src_op.axis);
+  (*pack_op->mutable_attr())["axis"].set_i(src_op.axis);
+  (*pack_op->mutable_attr())["N"].set_i(src_op.inputs.size());
+  (*pack_op->mutable_attr())["T"].set_type(GetTensorFlowDataType(src_op.dtype));
 }
 
 void ConvertFillOperator(const Model& model, const FillOperator& src_op,
@@ -1255,6 +1316,20 @@ void ConvertResizeBilinearOperator(const Model& model,
   (*resize_op->mutable_attr())["align_corners"].set_b(src_op.align_corners);
 }
 
+void ConvertOneHotOperator(const Model& model, const OneHotOperator& src_op,
+                           GraphDef* tensorflow_graph) {
+  tensorflow::NodeDef* onehot_op = tensorflow_graph->add_node();
+  onehot_op->set_op("OneHot");
+  onehot_op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 4);
+  for (const auto& input : src_op.inputs) {
+    *onehot_op->add_input() = input;
+  }
+  (*onehot_op->mutable_attr())["T"].set_type(
+      GetTensorFlowDataType(model, src_op.outputs[0]));
+  (*onehot_op->mutable_attr())["axis"].set_i(src_op.axis);
+}
+
 namespace {
 // TODO(aselle): Remove when available in absl
 absl::string_view FindLongestCommonPrefix(absl::string_view a,
@@ -1604,10 +1679,11 @@ void ConvertSliceOperator(const Model& model, const SliceOperator& src_op,
   CreateSliceInput(src_op.inputs[2], src_op.size, tensorflow_graph);
 }
 
-void ConvertMeanOperator(const Model& model, const MeanOperator& src_op,
-                         GraphDef* tensorflow_graph) {
+template <typename T>
+void ConvertReduceOperator(const Model& model, const T& src_op,
+                           GraphDef* tensorflow_graph, const string& op_name) {
   tensorflow::NodeDef* new_op = tensorflow_graph->add_node();
-  new_op->set_op("Mean");
+  new_op->set_op(op_name);
   new_op->set_name(src_op.outputs[0]);
   CHECK_EQ(src_op.inputs.size(), 2);
   *new_op->add_input() = src_op.inputs[0];
@@ -1616,6 +1692,9 @@ void ConvertMeanOperator(const Model& model, const MeanOperator& src_op,
   const tensorflow::DataType params_type =
       GetTensorFlowDataType(model, src_op.inputs[0]);
   (*new_op->mutable_attr())["T"].set_type(params_type);
+  const tensorflow::DataType indices_type =
+      GetTensorFlowDataType(model, src_op.inputs[1]);
+  (*new_op->mutable_attr())["Tidx"].set_type(indices_type);
 
   if (src_op.keep_dims) {
     (*new_op->mutable_attr())["keep_dims"].set_b(true);
@@ -1672,43 +1751,43 @@ void ConvertSubOperator(const Model& model, const SubOperator& src_op,
 void ConvertTensorFlowMinimumOperator(const Model& model,
                                       const TensorFlowMinimumOperator& src_op,
                                       GraphDef* tensorflow_graph) {
-  tensorflow::NodeDef* sub_op = tensorflow_graph->add_node();
-  sub_op->set_op("Minimum");
-  sub_op->set_name(src_op.outputs[0]);
+  tensorflow::NodeDef* min_op = tensorflow_graph->add_node();
+  min_op->set_op("Minimum");
+  min_op->set_name(src_op.outputs[0]);
   CHECK_EQ(src_op.inputs.size(), 2);
-  *sub_op->add_input() = src_op.inputs[0];
-  *sub_op->add_input() = src_op.inputs[1];
+  *min_op->add_input() = src_op.inputs[0];
+  *min_op->add_input() = src_op.inputs[1];
   const tensorflow::DataType data_type =
       GetTensorFlowDataType(model, src_op.inputs[0]);
-  (*sub_op->mutable_attr())["T"].set_type(data_type);
+  (*min_op->mutable_attr())["T"].set_type(data_type);
 }
 
 void ConvertTensorFlowMaximumOperator(const Model& model,
                                       const TensorFlowMaximumOperator& src_op,
                                       GraphDef* tensorflow_graph) {
-  tensorflow::NodeDef* sub_op = tensorflow_graph->add_node();
-  sub_op->set_op("Maximum");
-  sub_op->set_name(src_op.outputs[0]);
+  tensorflow::NodeDef* max_op = tensorflow_graph->add_node();
+  max_op->set_op("Maximum");
+  max_op->set_name(src_op.outputs[0]);
   CHECK_EQ(src_op.inputs.size(), 2);
-  *sub_op->add_input() = src_op.inputs[0];
-  *sub_op->add_input() = src_op.inputs[1];
+  *max_op->add_input() = src_op.inputs[0];
+  *max_op->add_input() = src_op.inputs[1];
   const tensorflow::DataType data_type =
       GetTensorFlowDataType(model, src_op.inputs[0]);
-  (*sub_op->mutable_attr())["T"].set_type(data_type);
+  (*max_op->mutable_attr())["T"].set_type(data_type);
 }
 
 void ConvertSelectOperator(const Model& model, const SelectOperator& src_op,
                            GraphDef* tensorflow_graph) {
-  tensorflow::NodeDef* sub_op = tensorflow_graph->add_node();
-  sub_op->set_op("Select");
-  sub_op->set_name(src_op.outputs[0]);
+  tensorflow::NodeDef* select_op = tensorflow_graph->add_node();
+  select_op->set_op("Select");
+  select_op->set_name(src_op.outputs[0]);
   CHECK_EQ(src_op.inputs.size(), 3);
-  *sub_op->add_input() = src_op.inputs[0];
-  *sub_op->add_input() = src_op.inputs[1];
-  *sub_op->add_input() = src_op.inputs[2];
+  *select_op->add_input() = src_op.inputs[0];
+  *select_op->add_input() = src_op.inputs[1];
+  *select_op->add_input() = src_op.inputs[2];
   const tensorflow::DataType data_type =
       GetTensorFlowDataType(model, src_op.inputs[1]);
-  (*sub_op->mutable_attr())["T"].set_type(data_type);
+  (*select_op->mutable_attr())["T"].set_type(data_type);
 }
 
 void ConvertTileOperator(const Model& model,
@@ -1731,11 +1810,14 @@ void ConvertTileOperator(const Model& model,
 void ConvertTopKV2Operator(const Model& model, const TopKV2Operator& src_op,
                            GraphDef* tensorflow_graph) {
   tensorflow::NodeDef* topk_op = tensorflow_graph->add_node();
-  topk_op->set_op("TOPKV2");
+  topk_op->set_op("TopKV2");
   topk_op->set_name(src_op.outputs[0]);
   CHECK_EQ(src_op.inputs.size(), 2);
   *topk_op->add_input() = src_op.inputs[0];
   *topk_op->add_input() = src_op.inputs[1];
+  const tensorflow::DataType data_type =
+      GetTensorFlowDataType(model, src_op.inputs[0]);
+  (*topk_op->mutable_attr())["T"].set_type(data_type);
   (*topk_op->mutable_attr())["sorted"].set_b(true);
 }
 
@@ -1806,6 +1888,73 @@ void ConvertPowOperator(const Model& model, const PowOperator& src_op,
   (*pow_op->mutable_attr())["T"].set_type(data_type);
 }
 
+void ConvertAnyOperator(const Model& model, const AnyOperator& src_op,
+                        GraphDef* tensorflow_graph) {
+  tensorflow::NodeDef* any_op = tensorflow_graph->add_node();
+  any_op->set_op("Any");
+  any_op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 2);
+  for (int i = 0; i < 2; ++i) {
+    *any_op->add_input() = src_op.inputs[i];
+  }
+  const tensorflow::DataType data_type =
+      GetTensorFlowDataType(model, src_op.inputs[1]);
+  (*any_op->mutable_attr())["Tidx"].set_type(data_type);
+  (*any_op->mutable_attr())["keep_dims"].set_b(src_op.keep_dims);
+}
+
+void ConvertLogicalAndOperator(const Model& model,
+                               const LogicalAndOperator& src_op,
+                               GraphDef* tensorflow_graph) {
+  tensorflow::NodeDef* logical_op = tensorflow_graph->add_node();
+  logical_op->set_op("LogicalAnd");
+  logical_op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 2);
+  for (int i = 0; i < 2; ++i) {
+    *logical_op->add_input() = src_op.inputs[i];
+  }
+}
+
+void ConvertLogicalNotOperator(const Model& model,
+                               const LogicalNotOperator& src_op,
+                               GraphDef* tensorflow_graph) {
+  tensorflow::NodeDef* logical_op = tensorflow_graph->add_node();
+  logical_op->set_op("LogicalNot");
+  logical_op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 1);
+  *logical_op->add_input() = src_op.inputs[0];
+}
+
+void ConvertLogicalOrOperator(const Model& model,
+                              const LogicalOrOperator& src_op,
+                              const char* op_name, GraphDef* tensorflow_graph) {
+  tensorflow::NodeDef* logical_or_op = tensorflow_graph->add_node();
+  logical_or_op->set_op(op_name);
+  logical_or_op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 2);
+  for (int i = 0; i < 2; ++i) {
+    *logical_or_op->add_input() = src_op.inputs[i];
+  }
+  const tensorflow::DataType data_type =
+      GetTensorFlowDataType(model, src_op.inputs[0]);
+  (*logical_or_op->mutable_attr())["T"].set_type(data_type);
+}
+
+void ConvertCTCBeamSearchDecoderOperator(
+    const Model& model, const CTCBeamSearchDecoderOperator& src_op,
+    const char* op_name, GraphDef* tensorflow_graph) {
+  auto* op = tensorflow_graph->add_node();
+  op->set_op(op_name);
+  op->set_name(src_op.outputs[0]);
+  CHECK_EQ(src_op.inputs.size(), 2);
+  for (int i = 0; i < 2; ++i) {
+    *op->add_input() = src_op.inputs[i];
+  }
+  (*op->mutable_attr())["beam_width"].set_i(src_op.beam_width);
+  (*op->mutable_attr())["top_paths"].set_i(src_op.top_paths);
+  (*op->mutable_attr())["merge_repeated"].set_b(src_op.merge_repeated);
+}
+
 void ConvertOperator(const Model& model, const Operator& src_op,
                      GraphDef* tensorflow_graph) {
   if (src_op.fused_activation_function != FusedActivationFunctionType::kNone) {
@@ -1842,7 +1991,7 @@ void ConvertOperator(const Model& model, const Operator& src_op,
     ConvertMulOperator(model, static_cast<const MulOperator&>(src_op),
                        tensorflow_graph);
   } else if (src_op.type == OperatorType::kRelu) {
-    ConvertReluOperator(static_cast<const ReluOperator&>(src_op),
+    ConvertReluOperator(model, static_cast<const ReluOperator&>(src_op),
                         tensorflow_graph);
   } else if (src_op.type == OperatorType::kRelu1) {
     ConvertRelu1Operator(static_cast<const Relu1Operator&>(src_op),
@@ -1942,8 +2091,24 @@ void ConvertOperator(const Model& model, const Operator& src_op,
         model, static_cast<const StridedSliceOperator&>(src_op),
         tensorflow_graph);
   } else if (src_op.type == OperatorType::kMean) {
-    ConvertMeanOperator(model, static_cast<const MeanOperator&>(src_op),
-                        tensorflow_graph);
+    ConvertReduceOperator(model, static_cast<const MeanOperator&>(src_op),
+                          tensorflow_graph, "Mean");
+  } else if (src_op.type == OperatorType::kSum) {
+    ConvertReduceOperator(model,
+                          static_cast<const TensorFlowSumOperator&>(src_op),
+                          tensorflow_graph, "Sum");
+  } else if (src_op.type == OperatorType::kReduceProd) {
+    ConvertReduceOperator(model,
+                          static_cast<const TensorFlowProdOperator&>(src_op),
+                          tensorflow_graph, "Prod");
+  } else if (src_op.type == OperatorType::kReduceMin) {
+    ConvertReduceOperator(model,
+                          static_cast<const TensorFlowMaxOperator&>(src_op),
+                          tensorflow_graph, "Min");
+  } else if (src_op.type == OperatorType::kReduceMax) {
+    ConvertReduceOperator(model,
+                          static_cast<const TensorFlowMaxOperator&>(src_op),
+                          tensorflow_graph, "Max");
   } else if (src_op.type == OperatorType::kSub) {
     ConvertSubOperator(model, static_cast<const SubOperator&>(src_op),
                        tensorflow_graph);
@@ -1964,6 +2129,9 @@ void ConvertOperator(const Model& model, const Operator& src_op,
   } else if (src_op.type == OperatorType::kArgMax) {
     ConvertArgMaxOperator(model, static_cast<const ArgMaxOperator&>(src_op),
                           tensorflow_graph);
+  } else if (src_op.type == OperatorType::kArgMin) {
+    ConvertArgMinOperator(model, static_cast<const ArgMinOperator&>(src_op),
+                          tensorflow_graph);
   } else if (src_op.type == OperatorType::kTopK_V2) {
     ConvertTopKV2Operator(model, static_cast<const TopKV2Operator&>(src_op),
                           tensorflow_graph);
@@ -1980,9 +2148,9 @@ void ConvertOperator(const Model& model, const Operator& src_op,
   } else if (src_op.type == OperatorType::kRange) {
     ConvertRangeOperator(model, static_cast<const RangeOperator&>(src_op),
                          tensorflow_graph);
-  } else if (src_op.type == OperatorType::kStack) {
-    ConvertStackOperator(model, static_cast<const StackOperator&>(src_op),
-                         tensorflow_graph);
+  } else if (src_op.type == OperatorType::kPack) {
+    ConvertPackOperator(model, static_cast<const PackOperator&>(src_op),
+                        tensorflow_graph);
   } else if (src_op.type == OperatorType::kFill) {
     ConvertFillOperator(model, static_cast<const FillOperator&>(src_op),
                         tensorflow_graph);
@@ -2023,6 +2191,28 @@ void ConvertOperator(const Model& model, const Operator& src_op,
   } else if (src_op.type == OperatorType::kPow) {
     ConvertPowOperator(model, static_cast<const PowOperator&>(src_op), "Pow",
                        tensorflow_graph);
+  } else if (src_op.type == OperatorType::kAny) {
+    ConvertAnyOperator(model, static_cast<const AnyOperator&>(src_op),
+                       tensorflow_graph);
+  } else if (src_op.type == OperatorType::kLogicalAnd) {
+    ConvertLogicalAndOperator(model,
+                              static_cast<const LogicalAndOperator&>(src_op),
+                              tensorflow_graph);
+  } else if (src_op.type == OperatorType::kLogicalNot) {
+    ConvertLogicalNotOperator(model,
+                              static_cast<const LogicalNotOperator&>(src_op),
+                              tensorflow_graph);
+  } else if (src_op.type == OperatorType::kOneHot) {
+    ConvertOneHotOperator(model, static_cast<const OneHotOperator&>(src_op),
+                          tensorflow_graph);
+  } else if (src_op.type == OperatorType::kLogicalOr) {
+    ConvertLogicalOrOperator(model,
+                             static_cast<const LogicalOrOperator&>(src_op),
+                             "LogicalOr", tensorflow_graph);
+  } else if (src_op.type == OperatorType::kCTCBeamSearchDecoder) {
+    ConvertCTCBeamSearchDecoderOperator(
+        model, static_cast<const CTCBeamSearchDecoderOperator&>(src_op),
+        "CTCBeamSearchDecoder", tensorflow_graph);
   } else {
     LOG(FATAL) << "Unhandled operator type " << OperatorTypeName(src_op.type);
   }
@@ -2101,6 +2291,9 @@ void ExportTensorFlowGraphDefImplementation(const Model& model,
     const auto& array = *array_pair.second;
     if (array.buffer) {
       switch (array.data_type) {
+        case ArrayDataType::kBool:
+          ConvertBoolTensorConst(model, array_name, tensorflow_graph);
+          break;
         case ArrayDataType::kFloat:
           ConvertFloatTensorConst(model, array_name, tensorflow_graph);
           break;
diff --git a/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md b/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md
index 18b7848db8..4bf47aa3c4 100644
--- a/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md
+++ b/tensorflow/contrib/lite/toco/g3doc/cmdline_examples.md
@@ -36,7 +36,7 @@ There are two approaches to running TOCO via command line.
 *   `tflite_convert`: Starting from TensorFlow 1.9, the command-line tool
     `tflite_convert` will be installed as part of the Python package. All of the
     examples below use `tflite_convert` for simplicity.
-    *   Example: `tflite --output_file=...`
+    *   Example: `tflite_convert --output_file=...`
 *   `bazel`: In order to run the latest version of TOCO, [clone the TensorFlow
     repository](https://www.tensorflow.org/install/install_sources#clone_the_tensorflow_repository)
     and use `bazel`. This is the recommended approach for converting models that
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/convert_expanddims_to_reshape.cc b/tensorflow/contrib/lite/toco/graph_transformations/convert_expanddims_to_reshape.cc
index 56f48d47de..310a88484c 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/convert_expanddims_to_reshape.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/convert_expanddims_to_reshape.cc
@@ -40,11 +40,6 @@ bool ConvertExpandDimsToReshape::Run(Model* model, std::size_t op_index) {
     // Yield until input dims have been resolved.
     return false;
   }
-  if (input_array.shape().dimensions_count() == 0) {
-    // Input array cannot be 0-D.
-    // (Unsure if this is TF behavior, but was required to get a test to pass.)
-    return false;
-  }
 
   const auto& axis_array = model->GetArray(expand_op->inputs[1]);
   if (!axis_array.has_shape()) {
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_stack_to_reshape.cc b/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_pack_to_reshape.cc
index 0615b5e6c6..75113a2a8c 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_stack_to_reshape.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/convert_trivial_pack_to_reshape.cc
@@ -25,19 +25,19 @@ limitations under the License.
 
 namespace toco {
 
-bool ConvertTrivialStackToReshape::Run(Model* model, std::size_t op_index) {
-  auto stack_it = model->operators.begin() + op_index;
-  if (stack_it->get()->type != OperatorType::kStack) {
+bool ConvertTrivialPackToReshape::Run(Model* model, std::size_t op_index) {
+  auto pack_it = model->operators.begin() + op_index;
+  if (pack_it->get()->type != OperatorType::kPack) {
     return false;
   }
-  auto* stack_op = static_cast<StackOperator*>(stack_it->get());
-  if (stack_op->inputs.size() > 1) {
+  auto* pack_op = static_cast<PackOperator*>(pack_it->get());
+  if (pack_op->inputs.size() > 1) {
     // Not trivial.
     return false;
   }
-  CHECK_EQ(stack_op->outputs.size(), 1);
+  CHECK_EQ(pack_op->outputs.size(), 1);
 
-  const auto& input_array = model->GetArray(stack_op->inputs[0]);
+  const auto& input_array = model->GetArray(pack_op->inputs[0]);
   if (!input_array.has_shape()) {
     // Yield until input dims have been resolved.
     return false;
@@ -48,16 +48,16 @@ bool ConvertTrivialStackToReshape::Run(Model* model, std::size_t op_index) {
     return false;
   }
 
-  AddMessageF("Converting trivial %s to a reshape", LogName(*stack_op));
+  AddMessageF("Converting trivial %s to a reshape", LogName(*pack_op));
 
   // Note that we could convert to ExpandDims but toco prefers reshapes.
   auto* reshape_op = new TensorFlowReshapeOperator;
-  reshape_op->inputs = {stack_op->inputs[0]};
-  reshape_op->outputs = stack_op->outputs;
+  reshape_op->inputs = {pack_op->inputs[0]};
+  reshape_op->outputs = pack_op->outputs;
 
   // Create shape param.
   string shape_array_name =
-      AvailableArrayName(*model, stack_op->outputs[0] + "_shape");
+      AvailableArrayName(*model, pack_op->outputs[0] + "_shape");
   Array& shape_array = model->GetOrCreateArray(shape_array_name);
   *(shape_array.mutable_shape()->mutable_dims()) = {
       1 + input_array.shape().dimensions_count()};
@@ -70,10 +70,10 @@ bool ConvertTrivialStackToReshape::Run(Model* model, std::size_t op_index) {
   }
 
   // Replace the operator in the graph.
-  const auto reshape_it = model->operators.emplace(stack_it, reshape_op);
-  stack_it = reshape_it + 1;
-  CHECK_EQ(stack_it->get(), stack_op);
-  model->operators.erase(stack_it);
+  const auto reshape_it = model->operators.emplace(pack_it, reshape_op);
+  pack_it = reshape_it + 1;
+  CHECK_EQ(pack_it->get(), pack_op);
+  model->operators.erase(pack_it);
 
   return true;
 }
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/dequantize.cc b/tensorflow/contrib/lite/toco/graph_transformations/dequantize.cc
index 2c7ffe4884..1688586733 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/dequantize.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/dequantize.cc
@@ -159,6 +159,7 @@ bool DequantizeArray(const string& array_name,
   new_array.GetOrCreateMinMax() = array->GetMinMax();
   fakequant_op->minmax.reset(new MinMax);
   *fakequant_op->minmax = array->GetMinMax();
+  fakequant_op->narrow_range = array->narrow_range;
   if (must_insert_fakequant_before) {
     for (const auto& op : model->operators) {
       for (string& output : op->outputs) {
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/ensure_uint8_weights_safe_for_fast_int8_kernels.cc b/tensorflow/contrib/lite/toco/graph_transformations/ensure_uint8_weights_safe_for_fast_int8_kernels.cc
index 75642bbc37..c13fc0de75 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/ensure_uint8_weights_safe_for_fast_int8_kernels.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/ensure_uint8_weights_safe_for_fast_int8_kernels.cc
@@ -181,7 +181,7 @@ bool EnsureUint8WeightsSafeForFastInt8Kernels::Run(Model* model,
         // future without worrying.
         static constexpr int kMinDistanceBetweenBadValues = 16;
         if (distance < kMinDistanceBetweenBadValues) {
-          if (allow_nudging_weights()) {
+          if (allow_nudging_weights() || has_default_ranges_flag()) {
             buffer_data[i] = 1;
             changed = true;
             continue;
@@ -200,6 +200,15 @@ bool EnsureUint8WeightsSafeForFastInt8Kernels::Run(Model* model,
   }
 
   if (changed) {
+    if (has_default_ranges_flag()) {
+      std::cerr
+          << "Since the specified values of --default_ranges_min and "
+             "--default_ranges_max result in values incompatible with TFLite's "
+             "fast int8 kernels, "
+             "--allow_nudging_weights_to_use_fast_gemm_kernel "
+             "has been enabled. This may affect the accuracy of the model."
+          << std::endl;
+    }
     AddMessageF("Tweaked weights values for %s", LogName(op));
   }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
index 8cd1298bca..8d9a4c4700 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
+++ b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
@@ -116,7 +116,7 @@ DECLARE_GRAPH_TRANSFORMATION(ConvertExpandDimsToReshape)
 DECLARE_GRAPH_TRANSFORMATION(ConvertPureConvToDepthwise)
 DECLARE_GRAPH_TRANSFORMATION(ConvertSqueezeToReshape)
 DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialAddNToAdd)
-DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialStackToReshape)
+DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialPackToReshape)
 DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialTileToConcat)
 DECLARE_GRAPH_TRANSFORMATION(ConvertTrivialTransposeToReshape)
 DECLARE_GRAPH_TRANSFORMATION(ConvertReorderAxes)
@@ -159,7 +159,7 @@ DECLARE_GRAPH_TRANSFORMATION(ResolveConstantBinaryOperator)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantUnaryOperator)
 DECLARE_GRAPH_TRANSFORMATION(CreateIm2colArrays)
 DECLARE_GRAPH_TRANSFORMATION(DropIm2colArrays)
-DECLARE_GRAPH_TRANSFORMATION(ReadFakeQuantMinMax)
+DECLARE_GRAPH_TRANSFORMATION(ReadArrayMinmaxAndNarrowRangeFromFakeQuant)
 DECLARE_GRAPH_TRANSFORMATION(ReorderElementwiseUnary)
 DECLARE_GRAPH_TRANSFORMATION(ReorderReshapeTranspose)
 DECLARE_GRAPH_TRANSFORMATION(ResolveReorderAxes)
@@ -180,13 +180,13 @@ DECLARE_GRAPH_TRANSFORMATION(ResolvePadAttributes)
 DECLARE_GRAPH_TRANSFORMATION(ResolvePadV2Attributes)
 DECLARE_GRAPH_TRANSFORMATION(ResolveStridedSliceAttributes)
 DECLARE_GRAPH_TRANSFORMATION(ResolveSliceAttributes)
-DECLARE_GRAPH_TRANSFORMATION(ResolveMeanAttributes)
+DECLARE_GRAPH_TRANSFORMATION(ResolveReduceAttributes)
 DECLARE_GRAPH_TRANSFORMATION(ResolveTransposeAttributes)
+DECLARE_GRAPH_TRANSFORMATION(ResolveConstantPack)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantRandomUniform)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantRange)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantShapeOrRank)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantSlice)
-DECLARE_GRAPH_TRANSFORMATION(ResolveConstantStack)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantStridedSlice)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantFill)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantGather)
@@ -194,6 +194,8 @@ DECLARE_GRAPH_TRANSFORMATION(ResolveMultiplyByZero)
 DECLARE_GRAPH_TRANSFORMATION(Dequantize)
 DECLARE_GRAPH_TRANSFORMATION(UnpartitionEmbeddingLookup)
 DECLARE_GRAPH_TRANSFORMATION(ShuffleFCWeights)
+DECLARE_GRAPH_TRANSFORMATION(ResolveFakeQuantArgsFromVars)
+DECLARE_GRAPH_TRANSFORMATION(ResolveGatherAttributes)
 
 class PropagateDefaultMinMax : public GraphTransformation {
  public:
@@ -260,8 +262,12 @@ class EnsureUint8WeightsSafeForFastInt8Kernels : public GraphTransformation {
   bool allow_nudging_weights() const { return allow_nudging_weights_; }
   void set_allow_nudging_weights(bool val) { allow_nudging_weights_ = val; }
 
+  bool has_default_ranges_flag() const { return has_default_ranges_flag_; }
+  void set_has_default_ranges_flag(bool val) { has_default_ranges_flag_ = val; }
+
  private:
   bool allow_nudging_weights_ = false;
+  bool has_default_ranges_flag_ = false;
 };
 
 #undef DECLARE_GRAPH_TRANSFORMATION
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
index 2f1bb8f0ad..527013bfa3 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
@@ -377,6 +377,19 @@ bool HardcodeMinMax::Run(Model* model, std::size_t op_index) {
     case OperatorType::kMean:
       changed = HardcodeMinMaxFromFirstInput(model, op);
       break;
+    case OperatorType::kSum:
+      // reduce_sum is expected to change the output range. Hence
+      // a fake_quant op is necessary in the output to minimize error. However
+      // in special circumstances like when computing expected value using
+      // reduce_sum the input range and the output range matches. Hence the
+      // below code would act as a fallback. If a fake_quant node is observed in
+      // the output that takes precendence over the hard coding logic below.
+      changed = HardcodeMinMaxFromFirstInput(model, op);
+      if (changed) {
+        LOG(WARNING) << "Using the input range for output in reduce_sum op."
+                     << "This could have an impact on your model accuracy.";
+      }
+      break;
     case OperatorType::kSelect:
       changed = HardcodeMinMaxForSelect(model, op);
       break;
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/identify_prelu.cc b/tensorflow/contrib/lite/toco/graph_transformations/identify_prelu.cc
index 30be4ac0aa..b90a156a0d 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/identify_prelu.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/identify_prelu.cc
@@ -74,14 +74,30 @@ bool IdentifyPRelu::Run(Model* model, std::size_t op_index) {
   const auto* relu_neg_input_op = GetOpWithOutput(*model, mul_op->inputs[1]);
 
   if (relu_neg_input_op == nullptr ||
-      relu_neg_input_op->type != OperatorType::kNeg ||
-      relu_neg_input_op->fused_activation_function !=
-          FusedActivationFunctionType::kRelu ||
       relu_neg_input_op->inputs.size() != 1) {
     return false;
   }
 
-  if (relu_input_op->inputs[0] != relu_neg_input_op->inputs[0]) {
+  const Operator* final_input_op;
+  if (relu_neg_input_op->type == OperatorType::kNeg &&
+      relu_neg_input_op->fused_activation_function ==
+          FusedActivationFunctionType::kRelu) {
+    // This detects a Neg op with fused Relu activation function.
+    final_input_op = relu_neg_input_op;
+  } else {
+    // This detects a Neg op followed by a separated Relu op.
+    const auto* neg_input_op =
+        GetOpWithOutput(*model, relu_neg_input_op->inputs[0]);
+    if (neg_input_op == nullptr || neg_input_op->inputs.size() != 1 ||
+        relu_neg_input_op->type != OperatorType::kRelu ||
+        relu_neg_input_op->fused_activation_function !=
+            FusedActivationFunctionType::kNone) {
+      return false;
+    }
+    final_input_op = neg_input_op;
+  }
+
+  if (relu_input_op->inputs[0] != final_input_op->inputs[0]) {
     return false;
   }
 
@@ -112,7 +128,6 @@ bool IdentifyPRelu::Run(Model* model, std::size_t op_index) {
   // intermediate tensors aren't used by other ops, those will be removed by
   // other graph transformation rules.
   model->operators.erase(FindOp(*model, add_op));
-
   return true;
 }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc b/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc
index 45d9f73a1e..f684de08ab 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/make_initial_dequantize_operator.cc
@@ -85,15 +85,8 @@ bool AddDequantizeOperatorToInput(const string& input_name, const Operator* op,
   dequantized_input_minmax = input_minmax;
   auto& input_qparams = input_array.GetOrCreateQuantizationParams();
   input_array.data_type = input_array.final_data_type;
-  if (input_array.data_type == ArrayDataType::kUint8) {
-    GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(input_minmax,
-                                                           &input_qparams);
-  } else if (input_array.data_type == ArrayDataType::kInt16) {
-    GetQuantizationParamsFromMinMax<ArrayDataType::kInt16>(input_minmax,
-                                                           &input_qparams);
-  } else {
-    LOG(FATAL) << "unhandled data type";
-  }
+  ChooseQuantizationParamsForArrayAndQuantizedDataType(
+      input_array, input_array.data_type, &input_qparams);
 
   transformation->AddMessageF(
       "Created %s"
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
index 00ab7cbaa9..c8310161cb 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
@@ -62,6 +62,10 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kGreaterEqual:
     case OperatorType::kEqual:
     case OperatorType::kNotEqual:
+    case OperatorType::kAny:
+    case OperatorType::kLogicalAnd:
+    case OperatorType::kLogicalNot:
+    case OperatorType::kLogicalOr:
       // These operators unconditionally produce bool outputs
       SetDataTypeForAllOutputs(model, op, ArrayDataType::kBool);
       break;
@@ -100,6 +104,13 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
       model->GetArray(op->outputs[0]).data_type = argmax_op->output_data_type;
       break;
     }
+    case OperatorType::kArgMin: {
+      // Data type of the ArgMin op is specified.
+      CHECK_EQ(op->outputs.size(), 1);
+      auto* argmin_op = static_cast<ArgMinOperator*>(op);
+      model->GetArray(op->outputs[0]).data_type = argmin_op->output_data_type;
+      break;
+    }
     case OperatorType::kRange: {
       auto* range_op = static_cast<RangeOperator*>(op);
       // Output type of the Range op can be set via an attribute
@@ -131,7 +142,8 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
       CHECK_EQ(op->inputs.size(), 2);
       CHECK_EQ(op->outputs.size(), 2);
       CHECK(model->GetArray(op->inputs[1]).data_type == ArrayDataType::kInt32);
-      model->GetArray(op->outputs[0]).data_type = model->GetArray(op->inputs[0]).data_type;
+      model->GetArray(op->outputs[0]).data_type =
+          model->GetArray(op->inputs[0]).data_type;
       model->GetArray(op->outputs[1]).data_type = ArrayDataType ::kInt32;
       break;
     }
@@ -144,8 +156,8 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
         return false;
       }
       for (int i = 0; i < op->outputs.size(); ++i) {
-        auto output = op->outputs[i];
-        auto data_type = unsupported_op->output_data_types[i];
+        const string& output = op->outputs[i];
+        const ArrayDataType data_type = unsupported_op->output_data_types[i];
         model->GetArray(output).data_type = data_type;
       }
       break;
@@ -183,6 +195,38 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
       SetDataTypeForAllOutputs(model, op, data_type);
       break;
     }
+    case OperatorType::kPack: {
+      const ArrayDataType data_type = model->GetArray(op->inputs[0]).data_type;
+      for (const auto& input : op->inputs) {
+        CHECK(data_type == model->GetArray(input).data_type);
+      }
+      SetDataTypeForAllOutputs(model, op, data_type);
+      break;
+    }
+    case OperatorType::kOneHot: {
+      CHECK_EQ(op->inputs.size(), 4);
+      CHECK_EQ(op->outputs.size(), 1);
+      const ArrayDataType on_value_type =
+          model->GetArray(op->inputs[OneHotOperator::ON_VALUE_INPUT]).data_type;
+      const ArrayDataType off_value_type =
+          model->GetArray(op->inputs[OneHotOperator::OFF_VALUE_INPUT])
+              .data_type;
+      CHECK(on_value_type == off_value_type);
+      model->GetArray(op->outputs[0]).data_type = on_value_type;
+      break;
+    }
+    case OperatorType::kCTCBeamSearchDecoder: {
+      CHECK_EQ(op->inputs.size(), 2);
+      // All outputs (sparse tensors) are int32s (although tf uses int64s)
+      // except the last one (log probabilities) is float.
+      const int output_size = op->outputs.size();
+      for (int i = 0; i < output_size - 1; ++i) {
+        model->GetArray(op->outputs[i]).data_type = ArrayDataType::kInt32;
+      }
+      model->GetArray(op->outputs[output_size - 1]).data_type =
+          ArrayDataType::kFloat;
+      break;
+    }
     default: {
       // These operators produce outputs with the same type as their 1st input
       CHECK_GT(op->inputs.size(), 0);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_default_min_max.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_default_min_max.cc
index 50b90e7c2b..cd078ef189 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_default_min_max.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_default_min_max.cc
@@ -25,6 +25,14 @@ limitations under the License.
 
 namespace toco {
 
+namespace {
+
+bool SupportsMinMax(const Array& array) {
+  return array.data_type == ArrayDataType::kFloat;
+}
+
+}  // namespace
+
 // Propagates default min/max values to any operator input/output array that
 // is missing them.
 //
@@ -39,14 +47,16 @@ bool PropagateDefaultMinMax::Run(Model* model, std::size_t op_index) {
 
   for (const auto& input : op->inputs) {
     auto& input_array = model->GetArray(input);
-    if (!input_array.minmax && !input_array.buffer) {
+    if (!input_array.minmax && !input_array.buffer &&
+        SupportsMinMax(input_array)) {
       did_change |= SetArrayMinMax(input, &input_array);
     }
   }
 
   for (const auto& output : op->outputs) {
     auto& output_array = model->GetArray(output);
-    if (!output_array.minmax && !output_array.buffer) {
+    if (!output_array.minmax && !output_array.buffer &&
+        SupportsMinMax(output_array)) {
       did_change |= SetArrayMinMax(output, &output_array);
     }
   }
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fake_quant_num_bits.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fake_quant_num_bits.cc
index 0f2592d05f..3ad6b0ec6f 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fake_quant_num_bits.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fake_quant_num_bits.cc
@@ -30,15 +30,9 @@ namespace {
 bool ChangeArrayDataType(GraphTransformation* transformation, Array* array,
                          ArrayDataType new_data_type,
                          const MinMax* new_minmax) {
-  // The code below assumes kInt16, see
-  //     GetQuantizationParamsFromMinMax<ArrayDataType::kInt16>
-  if (new_data_type != ArrayDataType::kInt16) {
-    return false;
-  }
-
-  bool changed = false;
   // Ensure the array ends up in the new type (if it hasn't yet been quantized).
-  if ((array->final_data_type != new_data_type)) {
+  bool changed = false;
+  if (array->final_data_type != new_data_type) {
     array->final_data_type = new_data_type;
     changed = true;
   }
@@ -72,12 +66,10 @@ bool ChangeArrayDataType(GraphTransformation* transformation, Array* array,
         "Rescaling min/max from %g,%g (%s) to %g,%g (%s)", array_minmax.min,
         array_minmax.max, ArrayDataTypeName(array->data_type), min, max,
         ArrayDataTypeName(new_data_type));
-
     array_minmax.min = min;
     array_minmax.max = max;
-    GetQuantizationParamsFromMinMax<ArrayDataType::kInt16>(
-        array_minmax, array->quantization_params.get());
-
+    ChooseQuantizationParamsForArrayAndQuantizedDataType(
+        *array, new_data_type, array->quantization_params.get());
     // Directly change the type as the array was already quantized.
     array->data_type = new_data_type;
     changed = true;
@@ -95,6 +87,7 @@ bool ChangeArrayDataType(GraphTransformation* transformation, Array* array,
       changed = true;
     }
   }
+
   return changed;
 }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
index 8eb0423283..3c9379fd87 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
@@ -437,6 +437,7 @@ void ProcessTensorFlowReshapeOperator(Model* model,
       product_non_wildcard_dims *= shape_data[i];
     }
   }
+
   const int input_flat_size = RequiredBufferSizeForShape(input_shape);
   if (has_wildcard) {
     CHECK_GE(input_flat_size, product_non_wildcard_dims)
@@ -445,6 +446,12 @@ void ProcessTensorFlowReshapeOperator(Model* model,
         << op->outputs[0] << "\". Are your input shapes correct?";
     shape_data[wildcard_index] = input_flat_size / product_non_wildcard_dims;
   }
+
+  if (shape_data.size() == 1 && shape_data[0] == 0) {
+    // We have reshaped a scalar, so preserve as a scalar.
+    shape_data.clear();
+  }
+
   auto& output_shape = *output_array.mutable_shape();
   *output_shape.mutable_dims() = shape_data;
   CHECK_EQ(input_flat_size, RequiredBufferSizeForShape(output_shape))
@@ -522,12 +529,14 @@ void ProcessAddNOperator(Model* model, Operator* op) {
 
 bool KeepDims(const Operator& op) {
   switch (op.type) {
-    case OperatorType::kMin:  //  Reduction Min
+    case OperatorType::kReduceMin:  //  Reduction Min
       return static_cast<const TensorFlowMinOperator&>(op).keep_dims;
-    case OperatorType::kMax:  //  Reduction Max
+    case OperatorType::kReduceMax:  //  Reduction Max
       return static_cast<const TensorFlowMaxOperator&>(op).keep_dims;
     case OperatorType::kSum:
       return static_cast<const TensorFlowSumOperator&>(op).keep_dims;
+    case OperatorType::kReduceProd:
+      return static_cast<const TensorFlowProdOperator&>(op).keep_dims;
     case OperatorType::kMean:
       return static_cast<const MeanOperator&>(op).keep_dims;
     default:
@@ -1034,17 +1043,28 @@ void ProcessGatherOperator(Model* model, GatherOperator* op) {
     return;
   }
 
+  // Yield until the axis has been resolved.
+  if (!op->axis) {
+    return;
+  }
+  int axis = op->axis.value();
+
   const auto& input_shape = input_array.shape();
   const auto& indices_shape = indices_array.shape();
   QCHECK_GE(input_shape.dimensions_count(), 1);
   op->input_rank = input_shape.dimensions_count();
+  QCHECK_LT(axis, op->input_rank);
 
-  // Copy the input dimensions to the output except for dimension 0,
+  // Copy the input dimensions to the output except for the axis dimensions
   // where the dimension of indices_shape is used.
-  // TODO(mgubin): if axis != 0 this is not true, change when it's supported.
   auto output_dims = output_array.mutable_shape()->mutable_dims();
-  output_dims->push_back(indices_shape.dims(0));
-  for (int dim = 1; dim < input_shape.dimensions_count(); dim++) {
+  for (int dim = 0; dim < axis; ++dim) {
+    output_dims->push_back(input_shape.dims(dim));
+  }
+  for (int dim = 0; dim < indices_shape.dimensions_count(); ++dim) {
+    output_dims->push_back(indices_shape.dims(dim));
+  }
+  for (int dim = axis + 1; dim < input_shape.dimensions_count(); ++dim) {
     output_dims->push_back(input_shape.dims(dim));
   }
 }
@@ -1190,7 +1210,7 @@ void ProcessShapeOperator(Model* model, TensorFlowShapeOperator* op) {
   output_shape->ReplaceDims({input_array.shape().dimensions_count()});
 }
 
-void ProcessStackOperator(Model* model, StackOperator* op) {
+void ProcessPackOperator(Model* model, PackOperator* op) {
   CHECK_GE(op->inputs.size(), 1);
   CHECK_EQ(op->outputs.size(), 1);
   auto& output_array = model->GetArray(op->outputs[0]);
@@ -1199,7 +1219,7 @@ void ProcessStackOperator(Model* model, StackOperator* op) {
     return;
   }
 
-  std::unique_ptr<Shape> stacked_shape;
+  std::unique_ptr<Shape> packed_shape;
   for (const auto& input : op->inputs) {
     const auto& input_array = model->GetArray(input);
     if (!input_array.has_shape()) {
@@ -1208,23 +1228,23 @@ void ProcessStackOperator(Model* model, StackOperator* op) {
     }
 
     Shape shape = input_array.shape();
-    if (!stacked_shape) {
-      stacked_shape.reset(new Shape(shape));
+    if (!packed_shape) {
+      packed_shape.reset(new Shape(shape));
     } else {
-      CHECK(*stacked_shape == shape) << "All input arrays to Stack operators "
-                                        "must have the same shape. Input \""
-                                     << input << "\" is different.";
+      CHECK(*packed_shape == shape) << "All input arrays to Pack operators "
+                                       "must have the same shape. Input \""
+                                    << input << "\" is different.";
     }
   }
 
   int axis = op->axis;
   if (axis < 0) {
     // Handle negative axis
-    axis += stacked_shape->dims().size() + 1;
+    axis += packed_shape->dims().size() + 1;
   }
-  stacked_shape->mutable_dims()->insert(
-      stacked_shape->mutable_dims()->begin() + axis, op->inputs.size());
-  output_array.copy_shape(*stacked_shape);
+  packed_shape->mutable_dims()->insert(
+      packed_shape->mutable_dims()->begin() + axis, op->inputs.size());
+  output_array.copy_shape(*packed_shape);
 }
 
 void ProcessStridedSliceOperator(Model* model, StridedSliceOperator* op) {
@@ -1404,7 +1424,8 @@ void ProcessTransposeOperator(Model* model, TransposeOperator* op) {
   }
 }
 
-void ProcessArgMaxOperator(Model* model, ArgMaxOperator* op) {
+template <typename Op>
+void ProcessArgMinMaxOperator(Model* model, Op* op) {
   CHECK_EQ(op->inputs.size(), 2);
   const auto& input_array = model->GetArray(op->inputs[0]);
   // Yield until input dims have been resolved.
@@ -1498,6 +1519,120 @@ void ProcessTileOperator(Model* model, TensorFlowTileOperator* op) {
   }
 }
 
+void ProcessAnyOperator(Model* model, AnyOperator* op) {
+  CHECK_EQ(op->inputs.size(), 2);
+  CHECK_EQ(op->outputs.size(), 1);
+
+  auto& output_array = model->GetArray(op->outputs[0]);
+  if (output_array.has_shape()) {
+    // We have already run.
+    return;
+  }
+
+  const auto& input_array = model->GetArray(op->inputs[0]);
+  if (!input_array.has_shape()) {
+    // Yield until input dims have been resolved.
+    return;
+  }
+  const auto& input_shape = input_array.shape();
+
+  auto& reduction_indices_array = model->GetArray(op->inputs[1]);
+  if (!reduction_indices_array.has_shape()) {
+    // Yield until reduction indices shape been resolved.
+    return;
+  }
+  if (!reduction_indices_array.buffer) {
+    // Yield until the reduction indices are constant.
+    return;
+  }
+  CHECK(reduction_indices_array.data_type == ArrayDataType::kInt32)
+      << "Any reduction input must be int32";
+
+  int input_rank = input_shape.dimensions_count();
+  std::set<int32> true_indices;
+  const auto& reduction_indices =
+      reduction_indices_array.GetBuffer<ArrayDataType::kInt32>().data;
+  for (int i = 0; i < reduction_indices.size(); ++i) {
+    const int32 reduction_index = reduction_indices[i];
+    if (reduction_index < -input_rank || reduction_index >= input_rank) {
+      CHECK(false) << "Invalid reduction dimension " << reduction_index
+                   << " for input with " << input_rank << " dimensions";
+    }
+    int32 wrapped_index = reduction_index;
+    if (wrapped_index < 0) {
+      wrapped_index += input_rank;
+    }
+    true_indices.insert(wrapped_index);
+  }
+
+  auto* mutable_dims = output_array.mutable_shape()->mutable_dims();
+  mutable_dims->clear();
+  for (int i = 0; i < input_rank; ++i) {
+    if (true_indices.count(i) > 0) {
+      if (op->keep_dims) {
+        mutable_dims->emplace_back(1);
+      }
+    } else {
+      mutable_dims->emplace_back(input_shape.dims(i));
+    }
+  }
+}
+
+void ProcessOneHotOperator(Model* model, OneHotOperator* op) {
+  CHECK_EQ(op->inputs.size(), 4);
+  CHECK_EQ(op->outputs.size(), 1);
+  auto& output_array = model->GetArray(op->outputs[0]);
+  if (output_array.has_shape()) {
+    // Shape already propagated
+    return;
+  }
+
+  // Yield until indices dims have been resolved.
+  const auto& indices_array =
+      model->GetArray(op->inputs[OneHotOperator::INDICES_INPUT]);
+  if (!indices_array.has_shape()) {
+    return;
+  }
+
+  // Yield until depth is constant and dims have been resolved.
+  if (!IsConstantParameterArray(*model,
+                                op->inputs[OneHotOperator::DEPTH_INPUT])) {
+    return;
+  }
+  const auto& depth_array =
+      model->GetArray(op->inputs[OneHotOperator::DEPTH_INPUT]);
+  if (!depth_array.has_shape()) {
+    return;
+  }
+
+  CHECK(depth_array.data_type == ArrayDataType::kInt32)
+      << "Depth array must be int32.";
+  CHECK_EQ(RequiredBufferSizeForShape(depth_array.shape()), 1)
+      << "Depth array must be scalar.";
+
+  const int depth = depth_array.GetBuffer<ArrayDataType::kInt32>().data[0];
+  CHECK_GE(depth, 0) << "Depth must be non-negative.";
+
+  const int indices_dims = indices_array.shape().dimensions_count();
+  const int output_dims = indices_dims + 1;
+  const int axis = op->axis == -1 ? indices_dims : op->axis;
+  CHECK_GE(axis, 0) << "Resolved axis must be non-negative.";
+
+  auto* mutable_dims = output_array.mutable_shape()->mutable_dims();
+  mutable_dims->resize(output_dims);
+  for (int i = 0; i < output_dims; ++i) {
+    int dim = 0;
+    if (i < axis) {
+      dim = indices_array.shape().dims(i);
+    } else if (i == axis) {
+      dim = depth;
+    } else {
+      dim = indices_array.shape().dims(i - 1);
+    }
+    (*mutable_dims)[i] = dim;
+  }
+}
+
 }  // namespace
 
 bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
@@ -1536,6 +1671,9 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kFloor:
     case OperatorType::kExp:
     case OperatorType::kSin:
+    case OperatorType::kLogicalAnd:
+    case OperatorType::kLogicalNot:
+    case OperatorType::kLogicalOr:
       ProcessSimpleOperator(model, op, 0);
       break;
     case OperatorType::kGather:
@@ -1604,9 +1742,10 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kL2Pool:
       ProcessL2PoolOperator(model, static_cast<L2PoolOperator*>(op));
       break;
-    case OperatorType::kMin:  //  Reduction Min
-    case OperatorType::kMax:  //  Reduction Max
+    case OperatorType::kReduceMin:  //  Reduction Min
+    case OperatorType::kReduceMax:  //  Reduction Max
     case OperatorType::kSum:
+    case OperatorType::kReduceProd:
     case OperatorType::kMean:
       ProcessTensorFlowReductionOperator(model, op);
       break;
@@ -1655,8 +1794,8 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kShape:
       ProcessShapeOperator(model, static_cast<TensorFlowShapeOperator*>(op));
       break;
-    case OperatorType::kStack:
-      ProcessStackOperator(model, static_cast<StackOperator*>(op));
+    case OperatorType::kPack:
+      ProcessPackOperator(model, static_cast<PackOperator*>(op));
       break;
     case OperatorType::kReorderAxes:
       ProcessReorderAxesOperator(model, static_cast<ReorderAxesOperator*>(op));
@@ -1696,10 +1835,26 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
                                   static_cast<StridedSliceOperator*>(op));
       break;
     case OperatorType::kArgMax:
-      ProcessArgMaxOperator(model, static_cast<ArgMaxOperator*>(op));
+      ProcessArgMinMaxOperator<ArgMaxOperator>(
+          model, static_cast<ArgMaxOperator*>(op));
       break;
-    case OperatorType::kUnsupported:
+    case OperatorType::kArgMin:
+      ProcessArgMinMaxOperator<ArgMinOperator>(
+          model, static_cast<ArgMinOperator*>(op));
       break;
+    case OperatorType::kUnsupported: {
+      const auto* unsupported_op =
+          static_cast<TensorFlowUnsupportedOperator*>(op);
+      // Attribute can be not specified, ignore it.
+      if (unsupported_op->output_shapes.size() < op->outputs.size()) {
+        return false;
+      }
+      for (int i = 0; i < op->outputs.size(); ++i) {
+        const string& output = op->outputs[i];
+        model->GetArray(output).copy_shape(unsupported_op->output_shapes.at(i));
+      }
+      break;
+    }
     case OperatorType::kSvdf:
       ProcessSvdfOperator(model, static_cast<SvdfOperator*>(op));
       break;
@@ -1723,6 +1878,12 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kTile:
       ProcessTileOperator(model, static_cast<TensorFlowTileOperator*>(op));
       break;
+    case OperatorType::kAny:
+      ProcessAnyOperator(model, static_cast<AnyOperator*>(op));
+      break;
+    case OperatorType::kOneHot:
+      ProcessOneHotOperator(model, static_cast<OneHotOperator*>(op));
+      break;
     default:
       // Unimplemented, another graph transformation should drop it.
       LOG(FATAL) << "Unhandled operator type " << OperatorTypeName(op->type);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.cc b/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.cc
index d74cad9a62..44733391f5 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.cc
@@ -74,46 +74,54 @@ ArrayDataType GetQuantizedDataType(const Array& array,
   }
 }
 
-void GetQuantizationParams(ArrayDataType data_type, const MinMax& minmax,
-                           QuantizationParams* quantization_params) {
-  switch (data_type) {
+template <ArrayDataType A>
+void ChooseQuantizationParamsForArrayAndQuantizedDataType(
+    const Array& array, QuantizationParams* quantization_params) {
+  *quantization_params = ::tflite::ChooseQuantizationParams<DataType<A>>(
+      array.minmax->min, array.minmax->max, array.narrow_range);
+}
+
+void ChooseQuantizationParamsForArrayAndQuantizedDataType(
+    const Array& array, ArrayDataType quantized_data_type,
+    QuantizationParams* quantization_params) {
+  switch (quantized_data_type) {
     case ArrayDataType::kInt8:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kInt8>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kInt8>(array, quantization_params);
       break;
     case ArrayDataType::kUint8:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kUint8>(array, quantization_params);
       break;
     case ArrayDataType::kInt16:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kInt16>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kInt16>(array, quantization_params);
       break;
     case ArrayDataType::kUint16:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kUint16>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kUint16>(array, quantization_params);
       break;
     case ArrayDataType::kInt32:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kInt32>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kInt32>(array, quantization_params);
       break;
     case ArrayDataType::kUint32:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kUint32>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kUint32>(array, quantization_params);
       break;
     case ArrayDataType::kInt64:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kInt64>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kInt64>(array, quantization_params);
       break;
     case ArrayDataType::kUint64:
-      GetQuantizationParamsFromMinMax<ArrayDataType::kUint64>(
-          minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType<
+          ArrayDataType::kUint64>(array, quantization_params);
       break;
     case ArrayDataType::kFloat:
     case ArrayDataType::kNone:
     default:
       LOG(FATAL) << "Unhandled final quantization type "
-                 << static_cast<int>(data_type);
+                 << static_cast<int>(quantized_data_type);
   }
 }
 
@@ -121,8 +129,8 @@ namespace {
 
 template <ArrayDataType A>
 std::unique_ptr<GenericBuffer> QuantizeBuffer(
-    const GenericBuffer& buffer,
-    const QuantizationParams& quantization_params) {
+    const Array& array, const QuantizationParams& quantization_params) {
+  const GenericBuffer& buffer = *array.buffer;
   const auto inverse_scale = 1. / quantization_params.scale;
   CHECK(buffer.type == ArrayDataType::kFloat);
   const auto& float_buffer =
@@ -140,8 +148,15 @@ std::unique_ptr<GenericBuffer> QuantizeBuffer(
     } else {
       scaled_val = quantization_params.zero_point + inverse_scale * src_val;
     }
-    quantized_buffer->data[i] =
-        tflite::SafeCast<DataType<A>>(std::round(scaled_val));
+    auto integer_val = tflite::SafeCast<DataType<A>>(std::round(scaled_val));
+    // In addition to its effect on the choice of quantization params upstream
+    // of here, narrow_range also means nudge the min quantized value by +1,
+    // so e.g. uint8 values get constrained to [1, 255].
+    if (integer_val == std::numeric_limits<DataType<A>>::min() &&
+        array.narrow_range) {
+      integer_val++;
+    }
+    quantized_buffer->data[i] = integer_val;
   }
   return std::unique_ptr<GenericBuffer>(quantized_buffer);
 }
@@ -155,7 +170,7 @@ void QuantizeArray(GraphTransformation* transformation, Model* model,
   CHECK(!array.quantization_params);
   array.GetOrCreateQuantizationParams() = quantization_params;
   if (array.buffer) {
-    array.buffer = QuantizeBuffer<A>(*array.buffer, quantization_params);
+    array.buffer = QuantizeBuffer<A>(array, quantization_params);
   }
   array.data_type = A;
   array.final_data_type = A;
@@ -210,8 +225,8 @@ bool IsArrayQuantizedRangeSubset(GraphTransformation* transformation,
     } else {
       // Work around cases where we are asking for this prior to the Quantize
       // transformation having added the quantization_params.
-      GetQuantizationParams(quantized_data_type, *array.minmax,
-                            &quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType(
+          array, quantized_data_type, &quantization_params);
       transformation->AddMessageF(
           "No quantization params - infering from data type %s with minmax "
           "%g,%g as zero_point=%g, scale=%g",
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.h b/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.h
index 79a2ce7e50..cf093c6f17 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.h
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantization_util.h
@@ -38,21 +38,11 @@ bool GetQuantizedDataTypeNumericalRange(ArrayDataType data_type,
 ArrayDataType GetQuantizedDataType(const Array& array,
                                    ArrayDataType default_type);
 
-// Returns the quantization params for the array with the given data type and
-// minmax.
-void GetQuantizationParams(ArrayDataType data_type, const MinMax& minmax,
-                           QuantizationParams* quantization_params);
-
-// Returns the quantization params for the data type and minmax values.
-template <ArrayDataType A>
-void GetQuantizationParamsFromMinMax(const MinMax& minmax,
-                                     QuantizationParams* quantization_params) {
-  using Integer = DataType<A>;
-  const double rmin = minmax.min;
-  const double rmax = minmax.max;
-  *quantization_params =
-      ::tflite::ChooseQuantizationParams<Integer>(rmin, rmax);
-}
+// Chooses the quantization params for a given array and a given target
+// quantized data type (which may not be the array's current data type).
+void ChooseQuantizationParamsForArrayAndQuantizedDataType(
+    const Array& array, ArrayDataType quantized_data_type,
+    QuantizationParams* quantization_params);
 
 // Quantizes an array by setting its data type and (if constant) quantizing
 // all values in the array.
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
index 58885b4950..fe037df122 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
@@ -50,6 +50,7 @@ bool SupportsQuantization(const Operator& op) {
          type == OperatorType::kSqueeze || type == OperatorType::kPad ||
          type == OperatorType::kPadV2 || type == OperatorType::kReshape ||
          type == OperatorType::kTanh || type == OperatorType::kMul ||
+         type == OperatorType::kBatchToSpaceND || type == OperatorType::kSum ||
          type == OperatorType::kSpaceToBatchND ||
          type == OperatorType::kSpaceToDepth ||
          type == OperatorType::kStridedSlice ||
@@ -212,13 +213,15 @@ bool ChooseQuantizationForOperatorInput(
   if (op.type == OperatorType::kLstmCell) {
     if (input_index == LstmCellOperator::PREV_STATE_INPUT) {
       *quantized_data_type = ArrayDataType::kInt16;
-      GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType(
+          array, *quantized_data_type, quantization_params);
       return true;
     }
   }
 
   *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kUint8);
-  GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
+  ChooseQuantizationParamsForArrayAndQuantizedDataType(
+      array, *quantized_data_type, quantization_params);
   transformation->AddMessageF(
       "For input array %s with min=%g, max=%g, chose to quantize as %s (f=%s) "
       "with zero_point=%d, scale=%g",
@@ -358,12 +361,14 @@ bool ChooseQuantizationForOperatorOutput(
     if (output_index == LstmCellOperator::STATE_OUTPUT ||
         output_index == LstmCellOperator::ACTIV_TEMP) {
       *quantized_data_type = ArrayDataType::kInt16;
-      GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
+      ChooseQuantizationParamsForArrayAndQuantizedDataType(
+          array, *quantized_data_type, quantization_params);
       return true;
     }
   }
   *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kUint8);
-  GetQuantizationParams(*quantized_data_type, minmax, quantization_params);
+  ChooseQuantizationParamsForArrayAndQuantizedDataType(
+      array, *quantized_data_type, quantization_params);
   transformation->AddMessageF(
       "For output array %s with min=%g, max=%g"
       ", chose to quantize as %s with zero_point=%d"
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantize_weights.cc b/tensorflow/contrib/lite/toco/graph_transformations/quantize_weights.cc
index 88ea0945e7..7a8515f6d1 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/quantize_weights.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantize_weights.cc
@@ -36,10 +36,8 @@ void GetQuantizationParamsFromArray(const Array& array,
   const std::vector<float>& float_vals =
       array.GetBuffer<ArrayDataType::kFloat>().data;
   auto minmax = std::minmax_element(float_vals.begin(), float_vals.end());
-  MinMax toco_minmax;
-  toco_minmax.min = *minmax.first;
-  toco_minmax.max = *minmax.second;
-  GetQuantizationParams(ArrayDataType::kUint8, toco_minmax, params);
+  *params = tflite::ChooseQuantizationParams<uint8>(
+      *minmax.first, *minmax.second, array.narrow_range);
 }
 
 }  // namespace
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/read_array_minmax_and_narrow_range_from_fake_quant.cc b/tensorflow/contrib/lite/toco/graph_transformations/read_array_minmax_and_narrow_range_from_fake_quant.cc
new file mode 100644
index 0000000000..5b41c49bfa
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/read_array_minmax_and_narrow_range_from_fake_quant.cc
@@ -0,0 +1,78 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <algorithm>
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace toco {
+
+namespace {
+
+bool ApplyAttrsToArray(GraphTransformation* transformation, Model* model,
+                       const FakeQuantOperator& fq_op,
+                       const string& array_name) {
+  bool changed = false;
+  auto& annotated_array = model->GetArray(array_name);
+  if (!annotated_array.minmax) {
+    const MinMax& minmax = *fq_op.minmax;
+    annotated_array.GetOrCreateMinMax() = minmax;
+    transformation->AddMessageF(
+        "Read min/max annotation for array %s: min=%g, max=%g", array_name,
+        minmax.min, minmax.max);
+    changed = true;
+  }
+  if (fq_op.narrow_range && !annotated_array.narrow_range) {
+    annotated_array.narrow_range = true;
+    transformation->AddMessageF("Read narrow_range annotation for array %s",
+                                array_name);
+    changed = true;
+  }
+  return changed;
+}
+
+}  // end namespace
+
+bool ReadArrayMinmaxAndNarrowRangeFromFakeQuant::Run(Model* model,
+                                                     std::size_t op_index) {
+  const auto fakequant_it = model->operators.begin() + op_index;
+  auto* fakequant_base_op = fakequant_it->get();
+  if (fakequant_base_op->type != OperatorType::kFakeQuant) {
+    return false;
+  }
+  auto* fq_op = static_cast<FakeQuantOperator*>(fakequant_base_op);
+
+  if (!fq_op->minmax) {
+    // Need to be resolved first by ResolveFakeQuantArgsFromVars.
+    return false;
+  }
+
+  // At this point, this FakeQuantOperator should have a MinMax
+  // attached to it, and should only have 1 input (it should not have
+  // 2nd and 3rd input arrays giving min and max anymore).
+  CHECK(fq_op->minmax);
+  CHECK_EQ(1, fq_op->inputs.size());
+
+  return ApplyAttrsToArray(this, model, *fq_op, fq_op->inputs[0]) ||
+         ApplyAttrsToArray(this, model, *fq_op, fq_op->outputs[0]);
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/read_fake_quant_min_max.cc b/tensorflow/contrib/lite/toco/graph_transformations/read_fake_quant_min_max.cc
deleted file mode 100644
index bdcca5b7ca..0000000000
--- a/tensorflow/contrib/lite/toco/graph_transformations/read_fake_quant_min_max.cc
+++ /dev/null
@@ -1,112 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-#include <algorithm>
-#include <memory>
-#include <string>
-#include <unordered_map>
-#include <vector>
-
-#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
-#include "tensorflow/contrib/lite/toco/model.h"
-#include "tensorflow/contrib/lite/toco/tooling_util.h"
-#include "tensorflow/core/platform/logging.h"
-
-namespace toco {
-
-namespace {
-
-bool ApplyMinMaxToArray(GraphTransformation* transformation, Model* model,
-                        const MinMax& minmax, const string& array_name) {
-  auto& annotated_array = model->GetArray(array_name);
-  if (annotated_array.minmax) {
-    return false;
-  }
-  annotated_array.GetOrCreateMinMax() = minmax;
-  transformation->AddMessageF(
-      "Read min/max annotation for array %s: min=%g, max=%g", array_name,
-      minmax.min, minmax.max);
-  return true;
-}
-
-}  // end namespace
-
-bool ReadFakeQuantMinMax::Run(Model* model, std::size_t op_index) {
-  const auto fakequant_it = model->operators.begin() + op_index;
-  auto* fakequant_base_op = fakequant_it->get();
-  if (fakequant_base_op->type != OperatorType::kFakeQuant) {
-    return false;
-  }
-  auto* fakequant_op = static_cast<FakeQuantOperator*>(fakequant_base_op);
-
-  bool changed = false;
-
-  if (!fakequant_op->minmax) {
-    CHECK_EQ(fakequant_op->inputs.size(), 3);
-    // We need to yield until the min and max parameters have been
-    // resolved to constant arrays.
-    for (int i = 1; i <= 2; i++) {
-      if (!IsConstantParameterArray(*model, fakequant_op->inputs[1])) {
-        return false;
-      }
-    }
-
-    // Obtain the final min/max values
-    const auto& min_array = model->GetArray(fakequant_op->inputs[1]);
-    const auto& max_array = model->GetArray(fakequant_op->inputs[2]);
-    CHECK_EQ(RequiredBufferSizeForShape(min_array.shape()), 1);
-    CHECK_EQ(RequiredBufferSizeForShape(max_array.shape()), 1);
-    fakequant_op->minmax.reset(new MinMax);
-    MinMax& minmax = *fakequant_op->minmax;
-    minmax.min = min_array.GetBuffer<ArrayDataType::kFloat>().data[0];
-    minmax.max = max_array.GetBuffer<ArrayDataType::kFloat>().data[0];
-    // We always want [min, max] to contain 0.
-    if (minmax.min > 0 || minmax.max < 0) {
-      LOG(ERROR) << "For " << LogName(*fakequant_op) << " the MinMax range "
-                 << "[" << minmax.min << ", " << minmax.max
-                 << "] does not contain 0. "
-                 << "Proceeding by tweaking it to contain 0, which will result "
-                    "in poor accuracy.";
-    }
-    minmax.min = std::min(minmax.min, 0.);
-    minmax.max = std::max(minmax.max, 0.);
-
-    // We won't use the input arrays that provided these min and max
-    // values, anymore. Delete them unless they are used by something
-    // else.
-    for (int i = 1; i <= 2; i++) {
-      if (CountOpsWithInput(*model, fakequant_op->inputs[i]) == 1) {
-        model->EraseArray(fakequant_op->inputs[i]);
-      }
-    }
-    fakequant_op->inputs.resize(1);
-    changed = true;
-  }
-
-  // At this point, this FakeQuantOperator should have a MinMax
-  // attached to it, and should only have 1 input (it should not have
-  // 2nd and 3rd input arrays giving min and max anymore).
-  CHECK(fakequant_op->minmax);
-  CHECK_EQ(1, fakequant_op->inputs.size());
-
-  const MinMax& minmax = *fakequant_op->minmax;
-
-  // Record the MinMax info on the input and output arrays
-  changed |= ApplyMinMaxToArray(this, model, minmax, fakequant_op->inputs[0]);
-  changed |= ApplyMinMaxToArray(this, model, minmax, fakequant_op->outputs[0]);
-
-  return changed;
-}
-
-}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc
index 404f27e067..5295eeccec 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_reshape.cc
@@ -59,6 +59,15 @@ bool IsReshapeTrivial(const Model& model, const Operator& op,
   if (CountOpsWithInput(model, op.outputs[0]) == 1) {
     const auto* next_op = GetOpWithInput(model, op.outputs[0]);
     if (next_op->type == OperatorType::kReshape) {
+      if (!IsDiscardableArray(model, next_op->outputs[0])) {
+        // If the |next_op| output is used as a model output we need to preserve
+        // its shape.
+        transformation->AddMessageF(
+            "%s cannot be merged into following reshape %s as it is "
+            "non-discardable and must keep the specified shape",
+            LogName(op), LogName(*next_op));
+        return false;
+      }
       transformation->AddMessageF(
           "%s is trivial because its output is only consumed by another "
           "Reshape op %s",
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc
index efb7bb2184..058f314b33 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_fake_quant.cc
@@ -25,6 +25,37 @@ limitations under the License.
 
 namespace toco {
 
+template <ArrayDataType A>
+void GetBoundsForQuantizedDataType(double* min, double* max) {
+  using limits = std::numeric_limits<DataType<A>>;
+  *min = limits::min();
+  *max = limits::max();
+}
+
+void GetBoundsForQuantizedDataType(ArrayDataType quantized_data_type,
+                                   double* min, double* max) {
+  switch (quantized_data_type) {
+    case ArrayDataType::kUint8:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kUint8>(min, max);
+    case ArrayDataType::kInt8:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kInt8>(min, max);
+    case ArrayDataType::kUint16:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kUint16>(min, max);
+    case ArrayDataType::kInt16:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kInt16>(min, max);
+    case ArrayDataType::kUint32:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kUint32>(min, max);
+    case ArrayDataType::kInt32:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kInt32>(min, max);
+    case ArrayDataType::kUint64:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kUint64>(min, max);
+    case ArrayDataType::kInt64:
+      return GetBoundsForQuantizedDataType<ArrayDataType::kInt64>(min, max);
+    default:
+      LOG(FATAL) << "unhandled quantized data type";
+  }
+}
+
 bool ResolveConstantFakeQuant::Run(Model* model, std::size_t op_index) {
   const auto fakequant_it = model->operators.begin() + op_index;
   const auto* fakequant_base_op = fakequant_it->get();
@@ -76,14 +107,21 @@ bool ResolveConstantFakeQuant::Run(Model* model, std::size_t op_index) {
   const int size = input_buffer.data.size();
   output_buffer.data.resize(size);
   QuantizationParams qparams;
-  GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(*fakequant_op->minmax,
-                                                         &qparams);
+  ChooseQuantizationParamsForArrayAndQuantizedDataType(
+      output_array, quantized_data_type, &qparams);
+  double quantized_min, quantized_max;
+  GetBoundsForQuantizedDataType(quantized_data_type, &quantized_min,
+                                &quantized_max);
+  if (fakequant_op->narrow_range) {
+    quantized_min++;
+  }
+
   for (int i = 0; i < size; i++) {
     const double src_val = input_buffer.data[i];
     const double unclamped_quantized_val =
         std::round(qparams.zero_point + src_val / qparams.scale);
-    const double quantized_val =
-        std::min(255., std::max(0., unclamped_quantized_val));
+    const double quantized_val = std::min(
+        quantized_max, std::max(quantized_min, unclamped_quantized_val));
     const double dst_val = qparams.scale * (quantized_val - qparams.zero_point);
     output_buffer.data[i] = dst_val;
   }
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_gather.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_gather.cc
index debe298a5a..36d7dad0ce 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_gather.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_gather.cc
@@ -69,7 +69,7 @@ bool ResolveConstantGather::Run(Model* model, std::size_t op_index) {
   }
   const auto* op = static_cast<const GatherOperator*>(base_op);
 
-  CHECK_EQ(op->inputs.size(), 2);
+  CHECK_GE(op->inputs.size(), 2);
   CHECK_EQ(op->outputs.size(), 1);
   auto& output_array = model->GetArray(op->outputs[0]);
   if (output_array.data_type == ArrayDataType::kNone) {
@@ -81,10 +81,14 @@ bool ResolveConstantGather::Run(Model* model, std::size_t op_index) {
     return false;
   }
 
-  // Only handling axis=0 for now.
-  if (op->axis != 0) {
+  if (!op->axis) {
+    // Yield until axis has been set by ResolveGatherAttributes.
+    return false;
+  }
+  if (op->axis.value() != 0) {
+    // Only handling axis=0 for now.
     AddMessageF("%s has axis %d; only axis=0 is supported", LogName(*op),
-                op->axis);
+                op->axis.value());
     return false;
   }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_stack.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_pack.cc
index a4d5f1923a..e86616574d 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_stack.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_pack.cc
@@ -24,7 +24,7 @@ namespace toco {
 namespace {
 
 template <ArrayDataType Type>
-void Stack(Model* model, StackOperator const& op) {
+void Pack(Model* model, PackOperator const& op) {
   auto& output_array = model->GetArray(op.outputs[0]);
   CHECK(output_array.data_type == Type);
 
@@ -33,8 +33,8 @@ void Stack(Model* model, StackOperator const& op) {
       output_array.GetMutableBuffer<Type>().data;
   output_data.resize(RequiredBufferSizeForShape(output_array.shape()));
 
-  // Stack inputs into buffer
-  CHECK_EQ(op.axis, 0) << "Stacking only supported along first axis";
+  // Pack inputs into buffer
+  CHECK_EQ(op.axis, 0) << "Packing only supported along first axis";
   int dst_offset = 0;
   for (int i = 0; i < op.inputs.size(); i++) {
     // Append array data to output for each input array
@@ -49,13 +49,13 @@ void Stack(Model* model, StackOperator const& op) {
 
 }  // namespace
 
-bool ResolveConstantStack::Run(Model* model, std::size_t op_index) {
+bool ResolveConstantPack::Run(Model* model, std::size_t op_index) {
   auto it = model->operators.begin() + op_index;
   const auto* base_op = it->get();
-  if (base_op->type != OperatorType::kStack) {
+  if (base_op->type != OperatorType::kPack) {
     return false;
   }
-  const auto* op = static_cast<const StackOperator*>(base_op);
+  const auto* op = static_cast<const PackOperator*>(base_op);
 
   CHECK_GE(op->inputs.size(), 1);
   CHECK_EQ(op->outputs.size(), 1);
@@ -82,24 +82,24 @@ bool ResolveConstantStack::Run(Model* model, std::size_t op_index) {
     // Handle negative axis
     axis += model->GetArray(op->inputs[0]).shape().dims().size();
   }
-  CHECK_EQ(axis, 0) << "Stacking only supported along 0th axis";
+  CHECK_EQ(axis, 0) << "Packing only supported along 0th axis";
 
   CHECK(!output_array.buffer);
   switch (output_array.data_type) {
     case ArrayDataType::kFloat:
-      Stack<ArrayDataType::kFloat>(model, *op);
+      Pack<ArrayDataType::kFloat>(model, *op);
       break;
     case ArrayDataType::kUint8:
-      Stack<ArrayDataType::kUint8>(model, *op);
+      Pack<ArrayDataType::kUint8>(model, *op);
       break;
     case ArrayDataType::kInt32:
-      Stack<ArrayDataType::kInt32>(model, *op);
+      Pack<ArrayDataType::kInt32>(model, *op);
       break;
     case ArrayDataType::kInt64:
-      Stack<ArrayDataType::kInt64>(model, *op);
+      Pack<ArrayDataType::kInt64>(model, *op);
       break;
     default:
-      LOG(FATAL) << "Unsupported data type given to Stack op with output \""
+      LOG(FATAL) << "Unsupported data type given to Pack op with output \""
                  << op->outputs[0] << "\"";
       break;
   }
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
index f89ef85fdb..fe3882c28d 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
@@ -57,8 +57,8 @@ bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
     case OperatorType::kSqrt:
     case OperatorType::kSquare:
     case OperatorType::kSum:
-    case OperatorType::kMin:  //  Reduction Min
-    case OperatorType::kMax:  //  Reduction Max
+    case OperatorType::kReduceMin:  //  Reduction Min
+    case OperatorType::kReduceMax:  //  Reduction Max
     case OperatorType::kReshape:
     case OperatorType::kRelu6:
     case OperatorType::kRelu1:
@@ -196,7 +196,7 @@ bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
       }
       output_float_data[i] = sum;
     }
-  } else if (unary_op->type == OperatorType::kMin) {
+  } else if (unary_op->type == OperatorType::kReduceMin) {
     // At the moment only full reduction across all dimensions is supported.
     // TODO(starka): Output should not be padded.
     for (int i = 0; i < output_dims_count; i++) {
@@ -207,7 +207,7 @@ bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
       min = std::min(min, (*input_float_data)[i]);
     }
     output_float_data[0] = min;
-  } else if (unary_op->type == OperatorType::kMax) {
+  } else if (unary_op->type == OperatorType::kReduceMax) {
     // At the moment only full reduction across all dimensions is supported.
     // TODO(starka): Output should not be padded.
     for (int i = 0; i < output_dims_count; i++) {
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_fake_quant_args_from_vars.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_fake_quant_args_from_vars.cc
new file mode 100644
index 0000000000..0dda1fd0b3
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_fake_quant_args_from_vars.cc
@@ -0,0 +1,80 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <algorithm>
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace toco {
+
+bool ResolveFakeQuantArgsFromVars::Run(Model* model, std::size_t op_index) {
+  const auto fakequant_it = model->operators.begin() + op_index;
+  auto* fakequant_base_op = fakequant_it->get();
+  if (fakequant_base_op->type != OperatorType::kFakeQuant) {
+    return false;
+  }
+  auto* fakequant_op = static_cast<FakeQuantOperator*>(fakequant_base_op);
+
+  if (fakequant_op->minmax) {
+    // Already resolved.
+    return false;
+  }
+
+  CHECK_EQ(fakequant_op->inputs.size(), 3);
+  // We need to yield until the min and max parameters have been
+  // resolved to constant arrays.
+  for (int i = 1; i <= 2; i++) {
+    if (!IsConstantParameterArray(*model, fakequant_op->inputs[i])) {
+      return false;
+    }
+  }
+
+  // Obtain the final min/max values
+  const auto& min_array = model->GetArray(fakequant_op->inputs[1]);
+  const auto& max_array = model->GetArray(fakequant_op->inputs[2]);
+  CHECK_EQ(RequiredBufferSizeForShape(min_array.shape()), 1);
+  CHECK_EQ(RequiredBufferSizeForShape(max_array.shape()), 1);
+  fakequant_op->minmax.reset(new MinMax);
+  MinMax& minmax = *fakequant_op->minmax;
+  minmax.min = min_array.GetBuffer<ArrayDataType::kFloat>().data[0];
+  minmax.max = max_array.GetBuffer<ArrayDataType::kFloat>().data[0];
+  // We always want [min, max] to contain 0.
+  if (minmax.min > 0 || minmax.max < 0) {
+    LOG(ERROR) << "For " << LogName(*fakequant_op) << " the MinMax range "
+               << "[" << minmax.min << ", " << minmax.max
+               << "] does not contain 0. "
+               << "Proceeding by tweaking it to contain 0, which will result "
+                  "in poor accuracy.";
+  }
+  minmax.min = std::min(minmax.min, 0.);
+  minmax.max = std::max(minmax.max, 0.);
+
+  // We won't use the input arrays that provided these min and max
+  // values, anymore. Delete them unless they are used by something
+  // else.
+  for (int i = 1; i <= 2; i++) {
+    DeleteArrayIfUsedOnce(fakequant_op->inputs[i], model);
+  }
+  fakequant_op->inputs.resize(1);
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_gather_attributes.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_gather_attributes.cc
new file mode 100644
index 0000000000..ce825c91af
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_gather_attributes.cc
@@ -0,0 +1,53 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace toco {
+
+bool ResolveGatherAttributes::Run(Model* model, std::size_t op_index) {
+  auto* gather_op = model->operators[op_index].get();
+  if (gather_op->type != OperatorType::kGather) return false;
+  auto* op = static_cast<GatherOperator*>(gather_op);
+
+  if (op->axis) {
+    // Attributes already resolved
+    return false;
+  }
+  if (op->inputs.size() != 3) return false;
+  if (!IsConstantParameterArray(*model, op->inputs[2])) return false;
+
+  const auto& indices_array = model->GetArray(op->inputs[2]);
+  if (!indices_array.has_shape()) return false;
+  const auto& axis_data = indices_array.GetBuffer<ArrayDataType::kInt32>().data;
+  CHECK_EQ(axis_data.size(), 1)
+      << "Multidimensional gather not supported on " << LogName(*op);
+  op->axis = {axis_data[0]};
+
+  // Drop the axis array as we no longer need it.
+  DeleteArrayIfUsedOnce(op->inputs[2], model);
+  op->inputs.resize(2);
+
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_mean_attributes.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_reduce_attributes.cc
index 013b50ac9b..7d456af2fb 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_mean_attributes.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_reduce_attributes.cc
@@ -24,11 +24,8 @@ limitations under the License.
 
 namespace toco {
 
-bool ResolveMeanAttributes::Run(Model* model, std::size_t op_index) {
-  auto* mean_op = model->operators[op_index].get();
-  if (mean_op->type != OperatorType::kMean) return false;
-  auto* op = static_cast<MeanOperator*>(mean_op);
-
+template <typename T>
+bool ResolveAttributes(Model* model, T* op) {
   if (!op->axis.empty()) {
     // Attributes already resolved
     return false;
@@ -36,10 +33,28 @@ bool ResolveMeanAttributes::Run(Model* model, std::size_t op_index) {
   if (op->inputs.size() != 2) return false;
   if (!IsConstantParameterArray(*model, op->inputs[1])) return false;
 
-  const auto& indices_array = model->GetArray(op->inputs[1]);
+  const Array& indices_array = model->GetArray(op->inputs[1]);
   if (!indices_array.has_shape()) return false;
   op->axis = indices_array.GetBuffer<ArrayDataType::kInt32>().data;
   return true;
 }
 
+bool ResolveReduceAttributes::Run(Model* model, std::size_t op_index) {
+  Operator* op = model->operators[op_index].get();
+  switch (op->type) {
+    case OperatorType::kMean:
+      return ResolveAttributes(model, static_cast<MeanOperator*>(op));
+    case OperatorType::kSum:
+      return ResolveAttributes(model, static_cast<TensorFlowSumOperator*>(op));
+    case OperatorType::kReduceProd:
+      return ResolveAttributes(model, static_cast<TensorFlowProdOperator*>(op));
+    case OperatorType::kReduceMin:
+      return ResolveAttributes(model, static_cast<TensorFlowMinOperator*>(op));
+    case OperatorType::kReduceMax:
+      return ResolveAttributes(model, static_cast<TensorFlowMaxOperator*>(op));
+    default:
+      return false;
+  }
+}
+
 }  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD b/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD
index 95e8433be2..e163fc9ae1 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD
+++ b/tensorflow/contrib/lite/toco/graph_transformations/tests/BUILD
@@ -10,6 +10,7 @@ load(
 tf_cc_test(
     name = "lstm_utils_test",
     srcs = ["lstm_utils_test.cc"],
+    tags = ["no_oss"],
     deps = [
         "//tensorflow/contrib/lite/toco:graph_transformations",
         "//tensorflow/contrib/lite/toco:model",
@@ -21,6 +22,7 @@ tf_cc_test(
 tf_cc_test(
     name = "quantize_weights_test",
     srcs = ["quantize_weights_test.cc"],
+    tags = ["no_oss"],
     deps = [
         "//tensorflow/contrib/lite/toco:graph_transformations",
         "//tensorflow/contrib/lite/toco:model",
@@ -33,6 +35,7 @@ tf_cc_test(
 tf_cc_test(
     name = "resolve_constant_concatenation_test",
     srcs = ["resolve_constant_concatenation_test.cc"],
+    tags = ["no_oss"],
     deps = [
         "//tensorflow/contrib/lite/toco:graph_transformations",
         "//tensorflow/contrib/lite/toco:model",
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/unfuse_activation_functions.cc b/tensorflow/contrib/lite/toco/graph_transformations/unfuse_activation_functions.cc
index 2c7046c8c7..69bad2fa89 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/unfuse_activation_functions.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/unfuse_activation_functions.cc
@@ -64,7 +64,14 @@ bool UnfuseActivationFunctions::Run(Model* model, std::size_t op_index) {
   const string& tmp_array_name =
       AvailableArrayName(*model, op->outputs[0] + "_unfused");
   CHECK(!model->HasArray(tmp_array_name));
-  model->GetOrCreateArray(tmp_array_name);
+
+  const auto& output_array = model->GetArray(op->outputs[0]);
+  auto& tmp_array = model->GetOrCreateArray(tmp_array_name);
+  if (output_array.quantization_params) {
+    tmp_array.GetOrCreateQuantizationParams() =
+        output_array.GetQuantizationParams();
+  }
+
   ac_op->inputs = {tmp_array_name};
   op->outputs = {tmp_array_name};
   return true;
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc b/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc
index cbea39bcc0..dd9e26e68b 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc
@@ -187,6 +187,7 @@ bool UnpartitionEmbeddingLookup::Run(Model* model, std::size_t op_index) {
       AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_permuted/perm"));
   gather_params_permute_op->outputs.push_back(
       AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_permuted"));
+  gather_params_permute_op->axis = {0};
   op_it = model->operators.emplace(op_it, gather_params_permute_op) + 1;
   model->GetOrCreateArray(gather_params_permute_op->outputs[0]);
   const auto& partition_array = model->GetArray(gather_ops[0]->inputs[0]);
@@ -212,6 +213,7 @@ bool UnpartitionEmbeddingLookup::Run(Model* model, std::size_t op_index) {
                               mod_op->inputs[0]};
   merged_gather_op->outputs = {stitch_op->outputs[0]};
   merged_gather_op->input_rank = partition_array.shape().dimensions_count();
+  merged_gather_op->axis = {0};
   model->operators.emplace(op_it, merged_gather_op);
 
   AddMessageF(
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/unroll_batch_matmul.cc b/tensorflow/contrib/lite/toco/graph_transformations/unroll_batch_matmul.cc
index da81ea2ff3..5f0cece67a 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/unroll_batch_matmul.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/unroll_batch_matmul.cc
@@ -76,7 +76,7 @@ bool UnrollBatchMatMul::Run(Model* model, std::size_t op_index) {
   AddMessageF("Unrolling BatchMatMul %s %d times", LogName(*batch_op),
               batch_count);
   auto tail_it = batch_op_it;
-  std::vector<string> stack_inputs;
+  std::vector<string> pack_inputs;
   for (int batch = 0; batch < batch_count; ++batch) {
     std::string batch_name =
         std::string(batch_op->outputs[0]) + "_b" + std::to_string(batch);
@@ -146,15 +146,15 @@ bool UnrollBatchMatMul::Run(Model* model, std::size_t op_index) {
     tail_it = model->operators.emplace(tail_it, matmul_op) + 1;
 
     // Add to stack.
-    stack_inputs.push_back(matmul_op->outputs[0]);
+    pack_inputs.push_back(matmul_op->outputs[0]);
   }
 
-  // The stack that will join all the individual matmul results together.
-  auto* stack_op = new StackOperator;
-  stack_op->inputs = stack_inputs;
-  stack_op->outputs = {batch_op->outputs[0]};
-  stack_op->axis = 0;
-  model->operators.emplace(tail_it, stack_op);
+  // The pack that will join all the individual matmul results together.
+  auto* pack_op = new PackOperator;
+  pack_op->inputs = pack_inputs;
+  pack_op->outputs = {batch_op->outputs[0]};
+  pack_op->axis = 0;
+  model->operators.emplace(tail_it, pack_op);
 
   // Remove the old batch matmul now that we've unrolled.
   batch_op_it = model->operators.begin();
diff --git a/tensorflow/contrib/lite/toco/import_tensorflow.cc b/tensorflow/contrib/lite/toco/import_tensorflow.cc
index 5c32a39035..9a404c2606 100644
--- a/tensorflow/contrib/lite/toco/import_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/import_tensorflow.cc
@@ -215,7 +215,7 @@ tensorflow::Status ImportFloatArray(const TensorProto& input_tensor,
                                     Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_FLOAT);
   const auto& input_shape = input_tensor.tensor_shape();
-  CHECK_LE(input_shape.dim_size(), 4);
+  CHECK_LE(input_shape.dim_size(), 6);
   int input_flat_size;
   auto status = ImportShape(input_shape.dim(), &input_flat_size,
                             output_array->mutable_shape());
@@ -253,7 +253,7 @@ tensorflow::Status ImportQuint8Array(const TensorProto& input_tensor,
                                      Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_QUINT8);
   const auto& input_shape = input_tensor.tensor_shape();
-  CHECK_LE(input_shape.dim_size(), 4);
+  CHECK_LE(input_shape.dim_size(), 6);
   int input_flat_size;
   auto status = ImportShape(input_shape.dim(), &input_flat_size,
                             output_array->mutable_shape());
@@ -290,7 +290,7 @@ tensorflow::Status ImportInt32Array(const TensorProto& input_tensor,
                                     Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_INT32);
   const auto& input_shape = input_tensor.tensor_shape();
-  CHECK_LE(input_shape.dim_size(), 4);
+  CHECK_LE(input_shape.dim_size(), 6);
   int input_flat_size;
   auto status = ImportShape(input_shape.dim(), &input_flat_size,
                             output_array->mutable_shape());
@@ -326,7 +326,7 @@ tensorflow::Status ImportInt64Array(const TensorProto& input_tensor,
                                     Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_INT64);
   const auto& input_shape = input_tensor.tensor_shape();
-  CHECK_LE(input_shape.dim_size(), 4);
+  CHECK_LE(input_shape.dim_size(), 6);
   int input_flat_size;
   auto status = ImportShape(input_shape.dim(), &input_flat_size,
                             output_array->mutable_shape());
@@ -363,7 +363,7 @@ tensorflow::Status ImportBoolArray(const TensorProto& input_tensor,
                                    Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_BOOL);
   const auto& input_shape = input_tensor.tensor_shape();
-  CHECK_LE(input_shape.dim_size(), 4);
+  CHECK_LE(input_shape.dim_size(), 6);
   int input_flat_size;
   auto status = ImportShape(input_shape.dim(), &input_flat_size,
                             output_array->mutable_shape());
@@ -409,7 +409,7 @@ tensorflow::Status ImportStringArray(const TensorProto& input_tensor,
                                      Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_STRING);
   const auto& input_shape = input_tensor.tensor_shape();
-  CHECK_LE(input_shape.dim_size(), 4);
+  CHECK_LE(input_shape.dim_size(), 6);
   int input_flat_size;
   auto status = ImportShape(input_shape.dim(), &input_flat_size,
                             output_array->mutable_shape());
@@ -755,6 +755,9 @@ tensorflow::Status ConvertFakeQuantWithMinMaxArgs(
   op->outputs.push_back(node.name());
   // tf.fake_quant_with_min_max_args num_bits defaults to 8.
   op->num_bits = HasAttr(node, "num_bits") ? GetIntAttr(node, "num_bits") : 8;
+  if (HasAttr(node, "narrow_range")) {
+    op->narrow_range = GetBoolAttr(node, "narrow_range");
+  }
   model->operators.emplace_back(op);
   return tensorflow::Status::OK();
 }
@@ -774,6 +777,9 @@ tensorflow::Status ConvertFakeQuantWithMinMaxVars(
   }
   op->outputs.push_back(node.name());
   op->num_bits = HasAttr(node, "num_bits") ? GetIntAttr(node, "num_bits") : 8;
+  if (HasAttr(node, "narrow_range")) {
+    op->narrow_range = GetBoolAttr(node, "narrow_range");
+  }
   model->operators.emplace_back(op);
   return tensorflow::Status::OK();
 }
@@ -799,22 +805,6 @@ tensorflow::Status ConvertSqueezeOperator(
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertSumOperator(
-    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
-    Model* model) {
-  CHECK_EQ(node.op(), "Sum");
-  TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 2));
-  auto* op = new TensorFlowSumOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-  if (HasAttr(node, "keep_dims")) {
-    op->keep_dims = GetBoolAttr(node, "keep_dims");
-  }
-  return tensorflow::Status::OK();
-}
-
 tensorflow::Status ConvertSplitOperator(
     const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
     Model* model) {
@@ -1052,41 +1042,14 @@ tensorflow::Status ConvertSimpleOperator(
   return ConvertSimpleOperator<Op>(node, tf_import_flags, model);
 }
 
-tensorflow::Status ConvertMaxOperator(
-    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
-    Model* model) {
-  CHECK_EQ(node.op(), "Max");
-  TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 2));
-  auto* op = new TensorFlowMaxOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-  if (HasAttr(node, "keep_dims")) {
-    op->keep_dims = GetBoolAttr(node, "keep_dims");
-  }
-  return tensorflow::Status::OK();
-}
-
-tensorflow::Status ConvertMinOperator(
-    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
-    Model* model) {
-  CHECK_EQ(node.op(), "Min");
-  TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 2));
-  auto* op = new TensorFlowMinOperator;
-  op->inputs.push_back(node.input(0));
-  op->inputs.push_back(node.input(1));
-  op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
-  if (HasAttr(node, "keep_dims")) {
-    op->keep_dims = GetBoolAttr(node, "keep_dims");
-  }
-  return tensorflow::Status::OK();
-}
-
 tensorflow::Status ConvertUnsupportedOperator(
     const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
     Model* model) {
+  // Names of special attributes in TF graph that are used by Toco.
+  static constexpr char kAttrOutputQuantized[] = "_output_quantized";
+  static constexpr char kAttrOutputTypes[] = "_output_types";
+  static constexpr char kAttrOutputShapes[] = "_output_shapes";
+
   LOG(INFO) << "Converting unsupported operation: " << node.op();
   auto* op = new TensorFlowUnsupportedOperator;
   const int num_inputs = GetInputsCount(node, tf_import_flags);
@@ -1097,11 +1060,11 @@ tensorflow::Status ConvertUnsupportedOperator(
   op->tensorflow_op = node.op();
   node.SerializeToString(&op->tensorflow_node_def);
   model->operators.emplace_back(op);
-  if (HasAttr(node, "_output_quantized")) {
-    op->quantized = GetBoolAttr(node, "_output_quantized");
+  if (HasAttr(node, kAttrOutputQuantized)) {
+    op->quantized = GetBoolAttr(node, kAttrOutputQuantized);
   }
-  if (HasAttr(node, "_output_types")) {
-    const auto& output_types = GetListAttr(node, "_output_types");
+  if (HasAttr(node, kAttrOutputTypes)) {
+    const auto& output_types = GetListAttr(node, kAttrOutputTypes);
     for (int i = 0; i < output_types.type_size(); ++i) {
       op->output_data_types.push_back(ConvertDataType(output_types.type(i)));
     }
@@ -1109,6 +1072,19 @@ tensorflow::Status ConvertUnsupportedOperator(
     const auto& output_type = GetDataTypeAttr(node, "Tout");
     op->output_data_types.push_back(ConvertDataType(output_type));
   }
+  if (HasAttr(node, kAttrOutputShapes)) {
+    const auto& output_shapes = GetListAttr(node, kAttrOutputShapes);
+    Shape output_shape;
+    for (int i = 0; i < output_shapes.shape_size(); ++i) {
+      const auto status =
+          ImportShape(output_shapes.shape(i).dim(), /*input_flat_size=*/nullptr,
+                      &output_shape);
+      if (!status.ok()) {
+        return status;
+      }
+      op->output_shapes.push_back(output_shape);
+    }
+  }
   return tensorflow::Status::OK();
 }
 
@@ -1223,17 +1199,27 @@ tensorflow::Status ConvertGatherOperator(
   auto* op = new GatherOperator;
   op->inputs.push_back(node.input(0));
   op->inputs.push_back(node.input(1));
-  // TODO(ahentz): we currently ignore the third tensor in GatherV2 but we
-  // should read it an pass it on to the TF Lite Interpreter.
+  if (node.input_size() >= 3) {
+    // GatherV2 form where we are provided an axis. It may be either a constant
+    // or runtime defined value, so we just wire up the array and let
+    // ResolveGatherAttributes take care of it later on.
+    const auto axis_data_type = GetDataTypeAttr(node, "Taxis");
+    CHECK(axis_data_type == DT_INT32 || axis_data_type == DT_INT64);
+    op->inputs.push_back(node.input(2));
+  } else {
+    // Gather form that assumes axis=0.
+    op->axis = {0};
+  }
   op->outputs.push_back(node.name());
   model->operators.emplace_back(op);
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertArgMaxOperator(
+template <typename Op, const char* op_name>
+tensorflow::Status ConvertArgMinMaxOperator(
     const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
     Model* model) {
-  CHECK_EQ(node.op(), "ArgMax");
+  CHECK_EQ(node.op(), op_name);
   TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 2));
   const auto axis_data_type =
       HasAttr(node, "Tidx") ? GetDataTypeAttr(node, "Tidx") : DT_INT32;
@@ -1242,7 +1228,7 @@ tensorflow::Status ConvertArgMaxOperator(
                                : DT_INT64;
   CHECK(axis_data_type == DT_INT64 || axis_data_type == DT_INT32);
   CHECK(output_type == DT_INT64 || output_type == DT_INT32);
-  auto* op = new ArgMaxOperator;
+  auto* op = new Op;
   op->output_data_type = ConvertDataType(output_type);
   op->inputs.push_back(node.input(0));
   op->inputs.push_back(node.input(1));
@@ -1405,12 +1391,12 @@ tensorflow::Status ConvertBatchToSpaceNDOperator(
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertMeanOperator(
+template <typename T>
+tensorflow::Status ConvertReduceOperator(
     const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
     Model* model) {
-  CHECK_EQ(node.op(), "Mean");
   TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 2));
-  auto* op = new MeanOperator;
+  auto* op = new T;
   op->inputs.push_back(node.input(0));
   op->inputs.push_back(node.input(1));
   op->outputs.push_back(node.name());
@@ -1543,11 +1529,15 @@ tensorflow::Status ConvertRangeOperator(
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertStackOperator(
+// Note that it's easy to confuse/conflate "Stack" and "Pack" operators, but
+// they aren't the same thing.  tf.stack results in a "Pack" operator.  "Stack"
+// operators also exist, but involve manipulating the TF runtime stack, and are
+// not directly related to tf.stack() usage.
+tensorflow::Status ConvertPackOperator(
     const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
     Model* model) {
-  CHECK((node.op() == "Stack") || (node.op() == "Pack"));
-  auto* op = new StackOperator;
+  CHECK_EQ(node.op(), "Pack");
+  auto op = absl::make_unique<PackOperator>();
   const int num_inputs = GetInputsCount(node, tf_import_flags);
   QCHECK_GE(num_inputs, 1)
       << node.op()
@@ -1557,10 +1547,11 @@ tensorflow::Status ConvertStackOperator(
   for (int i = 0; i < num_inputs; ++i) {
     op->inputs.push_back(node.input(i));
   }
-  // Both "Stack" and "Pack" have the "axis" attribute.
+  op->values_count = HasAttr(node, "N") ? GetIntAttr(node, "N") : num_inputs;
   op->axis = HasAttr(node, "axis") ? GetIntAttr(node, "axis") : 0;
+  op->dtype = ConvertDataType(toco::GetDataTypeAttr(node, "T"));
   op->outputs.push_back(node.name());
-  model->operators.emplace_back(op);
+  model->operators.emplace_back(std::move(op));
   return tensorflow::Status::OK();
 }
 
@@ -1606,6 +1597,24 @@ tensorflow::Status ConvertShapeOperator(
   return tensorflow::Status::OK();
 }
 
+tensorflow::Status ConvertAnyOperator(
+    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
+    Model* model) {
+  CHECK_EQ(node.op(), "Any");
+  TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 2));
+  const auto idx_type =
+      HasAttr(node, "Tidx") ? GetDataTypeAttr(node, "Tidx") : DT_INT32;
+  CHECK(idx_type == DT_INT32);
+  auto op = absl::make_unique<AnyOperator>();
+  op->inputs.push_back(node.input(0));
+  op->inputs.push_back(node.input(1));
+  op->outputs.push_back(node.name());
+  op->keep_dims =
+      HasAttr(node, "keep_dims") ? GetBoolAttr(node, "keep_dims") : false;
+  model->operators.push_back(std::move(op));
+  return tensorflow::Status::OK();
+}
+
 void StripCaretFromArrayNames(Model* model) {
   for (auto& op : model->operators) {
     for (auto& input : op->inputs) {
@@ -1824,6 +1833,55 @@ tensorflow::Status ConvertSparseToDenseOperator(
   return tensorflow::Status::OK();
 }
 
+tensorflow::Status ConvertOneHotOperator(
+    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
+    Model* model) {
+  CHECK_EQ(node.op(), "OneHot");
+  TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 4));
+
+  const auto dtype = GetDataTypeAttr(node, "T");
+  // TODO(b/111744875): Support DT_UINT8 and quantization.
+  CHECK(dtype == DT_INT32 || dtype == DT_INT64 || dtype == DT_FLOAT ||
+        dtype == DT_BOOL);
+
+  auto op = absl::make_unique<OneHotOperator>();
+  op->axis = HasAttr(node, "axis") ? GetIntAttr(node, "axis") : -1;
+  for (const string& input : node.input()) {
+    op->inputs.push_back(input);
+  }
+  op->outputs.push_back(node.name());
+  model->operators.emplace_back(op.release());
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertCTCBeamSearchDecoderOperator(
+    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
+    Model* model) {
+  CHECK_EQ(node.op(), "CTCBeamSearchDecoder");
+  TF_QCHECK_OK(CheckInputsCount(node, tf_import_flags, 2));
+
+  auto* op = new CTCBeamSearchDecoderOperator;
+  for (const string& input : node.input()) {
+    op->inputs.push_back(input);
+  }
+
+  op->beam_width =
+      HasAttr(node, "beam_width") ? GetIntAttr(node, "beam_width") : 1;
+  op->top_paths =
+      HasAttr(node, "top_paths") ? GetIntAttr(node, "top_paths") : 1;
+  op->merge_repeated = HasAttr(node, "merge_repeated")
+                           ? GetBoolAttr(node, "merge_repeated")
+                           : true;
+
+  // There are top_paths + 1 outputs.
+  op->outputs.push_back(node.name());  // Implicit :0.
+  for (int i = 0; i < op->top_paths; ++i) {
+    op->outputs.push_back(node.name() + ":" + std::to_string(i + 1));
+  }
+  model->operators.emplace_back(op);
+  return tensorflow::Status::OK();
+}
+
 }  // namespace
 
 namespace internal {
@@ -1833,12 +1891,17 @@ using ConverterType = tensorflow::Status (*)(
     Model* model);
 using ConverterMapType = std::unordered_map<std::string, ConverterType>;
 
+constexpr char kArgMax[] = "ArgMax";
+constexpr char kArgMin[] = "ArgMin";
+
 ConverterMapType GetTensorFlowNodeConverterMap() {
   return std::unordered_map<std::string, ConverterType>({
       {"Add", ConvertSimpleOperator<AddOperator, 2>},
       {"AddN", ConvertSimpleOperator<AddNOperator>},
       {"All", ConvertSimpleOperator<TensorFlowAllOperator>},
-      {"ArgMax", ConvertArgMaxOperator},
+      {"Any", ConvertAnyOperator},
+      {"ArgMax", ConvertArgMinMaxOperator<ArgMaxOperator, kArgMax>},
+      {"ArgMin", ConvertArgMinMaxOperator<ArgMinOperator, kArgMin>},
       {"Assert", ConvertSimpleOperator<TensorFlowAssertOperator>},
       {"AvgPool", ConvertAvgPoolOperator},
       {"BatchMatMul", ConvertBatchMatMulOperator},
@@ -1853,6 +1916,7 @@ ConverterMapType GetTensorFlowNodeConverterMap() {
       {"Const", ConvertConstOperator},
       {"Conv2D", ConvertConvOperator},
       {"Conv2DBackpropInput", ConvertTransposeConvOperator},
+      {"CTCBeamSearchDecoder", ConvertCTCBeamSearchDecoderOperator},
       {"DepthToSpace", ConvertDepthToSpaceOperator},
       {"DepthwiseConv2dNative", ConvertDepthwiseConvOperator},
       {"Div", ConvertSimpleOperator<DivOperator, 2>},
@@ -1879,28 +1943,32 @@ ConverterMapType GetTensorFlowNodeConverterMap() {
       {"Less", ConvertSimpleOperator<TensorFlowLessOperator, 2>},
       {"LessEqual", ConvertSimpleOperator<TensorFlowLessEqualOperator, 2>},
       {"Log", ConvertSimpleOperator<LogOperator, 1>},
-      {"Log", ConvertSimpleOperator<LogOperator, 1>},
+      {"LogicalAnd", ConvertSimpleOperator<LogicalAndOperator, 2>},
+      {"LogicalOr", ConvertSimpleOperator<LogicalOrOperator, 2>},
+      {"LogicalNot", ConvertSimpleOperator<LogicalNotOperator, 1>},
       {"LogSoftmax", ConvertSimpleOperator<LogSoftmaxOperator, 1>},
       {"MatMul", ConvertMatMulOperator},
-      {"Max", ConvertMaxOperator},
+      {"Max", ConvertReduceOperator<TensorFlowMaxOperator>},
       {"MaxPool", ConvertMaxPoolOperator},
       {"Maximum", ConvertSimpleOperator<TensorFlowMaximumOperator, 2>},
-      {"Mean", ConvertMeanOperator},
+      {"Mean", ConvertReduceOperator<MeanOperator>},
       {"Merge", ConvertSimpleOperator<TensorFlowMergeOperator, 2>},
-      {"Min", ConvertMinOperator},
+      {"Min", ConvertReduceOperator<TensorFlowMinOperator>},
       {"Minimum", ConvertSimpleOperator<TensorFlowMinimumOperator, 2>},
       {"Mul", ConvertSimpleOperator<MulOperator, 2>},
       {"Neg", ConvertSimpleOperator<NegOperator, 1>},
       {"NextIteration", ConvertOperatorSpecialCasedAsRNNBackEdge},
       {"NoOp", ConvertNoOpOperator},
       {"NotEqual", ConvertSimpleOperator<TensorFlowNotEqualOperator, 2>},
-      {"Pack", ConvertStackOperator},
+      {"OneHot", ConvertOneHotOperator},
+      {"Pack", ConvertPackOperator},
       {"Pad", ConvertSimpleOperator<PadOperator, 2>},
       {"PadV2", ConvertSimpleOperator<PadV2Operator, 3>},
       {"ParallelDynamicStitch", ConvertDynamicStitchOperator},
       {"Placeholder", ConvertPlaceholderOperator},
       {"PlaceholderWithDefault", ConvertIdentityOperator},
       {"Pow", ConvertSimpleOperator<PowOperator, 2>},
+      {"Prod", ConvertReduceOperator<TensorFlowProdOperator>},
       {"RandomUniform", ConvertRandomUniform},
       {"Range", ConvertRangeOperator},
       {"Rank", ConvertSimpleOperator<RankOperator, 1>},
@@ -1923,11 +1991,10 @@ ConverterMapType GetTensorFlowNodeConverterMap() {
       {"Sqrt", ConvertSimpleOperator<TensorFlowSqrtOperator, 1>},
       {"Square", ConvertSimpleOperator<TensorFlowSquareOperator, 1>},
       {"Squeeze", ConvertSqueezeOperator},
-      {"Stack", ConvertStackOperator},
       {"StopGradient", ConvertIdentityOperator},
       {"StridedSlice", ConvertStridedSliceOperator},
       {"Sub", ConvertSimpleOperator<SubOperator, 2>},
-      {"Sum", ConvertSumOperator},
+      {"Sum", ConvertReduceOperator<TensorFlowSumOperator>},
       {"Svdf", ConvertSvdfOperator},
       {"Switch", ConvertSwitchOperator},
       {"Tanh", ConvertSimpleOperator<TanhOperator, 1>},
diff --git a/tensorflow/contrib/lite/toco/model.h b/tensorflow/contrib/lite/toco/model.h
index 3a1d243f87..cd263930f5 100644
--- a/tensorflow/contrib/lite/toco/model.h
+++ b/tensorflow/contrib/lite/toco/model.h
@@ -23,6 +23,7 @@ limitations under the License.
 #include <unordered_map>
 #include <vector>
 
+#include "absl/types/optional.h"
 #include "tensorflow/contrib/lite/toco/model_flags.pb.h"
 #include "tensorflow/contrib/lite/toco/runtime/types.h"
 #include "tensorflow/contrib/lite/toco/toco_port.h"
@@ -63,6 +64,7 @@ enum class OperatorType : uint8 {
   kMaxPool,
   kFakeQuant,
   kMul,
+  kOneHot,
   kRandomUniform,
   kRange,
   kRank,
@@ -81,10 +83,11 @@ enum class OperatorType : uint8 {
   kResizeBilinear,
   kSin,
   kSpaceToBatchND,
-  kStack,
+  kPack,
   kBatchToSpaceND,
   kPad,
   kPadV2,
+  kReduceProd,  // Reduction product
   kStridedSlice,
   kSlice,
   kSqueeze,
@@ -106,10 +109,10 @@ enum class OperatorType : uint8 {
   kIdentity,
   kLess,
   kLessEqual,
-  kMax,      //  Reduction Max
-  kMaximum,  //  Element-wise Maximum
-  kMin,      //  Reduction Min
-  kMinimum,  //  Element-wise Minimum
+  kReduceMax,  //  Reduction Max
+  kMaximum,    //  Element-wise Maximum
+  kReduceMin,  //  Reduction Min
+  kMinimum,    //  Element-wise Minimum
   kMatMul,
   kMerge,
   kNeg,
@@ -140,6 +143,12 @@ enum class OperatorType : uint8 {
   kEqual,
   kNotEqual,
   kPow,
+  kArgMin,
+  kAny,
+  kLogicalAnd,
+  kLogicalNot,
+  kLogicalOr,
+  kCTCBeamSearchDecoder,
 };
 
 // Helper to deal with TensorFlow arrays using a different ordering of
@@ -286,6 +295,46 @@ struct Buffer : GenericBuffer {
   std::vector<DataType<A>> data;
 };
 
+class Shape {
+ public:
+  // For Shape, we stick to half-way encapsulation for now:
+  // we hide the raw dims_ member, but expose it raw by accessors
+  // because from some brainstorming, it's not at all easy to
+  // anticipate which flavor of more hermetic encapsulation would
+  // actually buy us future-proof-ness without being needlessly
+  // cumbersome.
+  Shape() {}
+  Shape(std::initializer_list<int> dim_list) : dims_(dim_list) {}
+
+  void ReplaceDims(std::initializer_list<int> dim_list) {
+    dims_ = std::vector<int>(dim_list);
+  }
+
+  const std::vector<int>& dims() const { return dims_; }
+  std::vector<int>* mutable_dims() { return &dims_; }
+  const int dimensions_count() const { return dims_.size(); }
+
+  // We still have that one convenience accessor to avoid
+  // the awkward double bracket issue:  shape.dims()[i].
+  int dims(int i) const {
+    // Always check for out-of-bounds accesses, even in optimized builds where
+    // standard assertions are disabled. Out-of-bounds access here is a common
+    // occurrence.
+    CHECK_GE(i, 0);
+    CHECK_GT(dims_.size(), i);
+    return dims_[i];
+  }
+
+  bool operator==(const Shape& comp) const {
+    return (this->dims_ == comp.dims());
+  }
+
+  bool operator!=(const Shape& comp) const { return !((*this) == comp); }
+
+ private:
+  std::vector<int> dims_;
+};
+
 // Base class for all operator classes.
 struct Operator {
   // Non-default-constructible: only OperatorType-specific subclass
@@ -390,6 +439,28 @@ struct ConvOperator : Operator {
   int dilation_height_factor = 1;
 };
 
+// CTCBeamSearchDecoder operator:
+//
+// Inputs:
+//   inputs[0]: required: the logits.
+//   inputs[1]: required: sequence length.
+//   inputs[2]: optional: beam width.
+//   inputs[3]: optional: top paths.
+//   inputs[4]: optional: merge repeated.
+//
+//  Outputs:
+//    outputs[0]: deocoded.
+//    outputs[1]: log probability.
+//
+// TensorFlow equivalent: CTCBeamSearchDecoder
+struct CTCBeamSearchDecoderOperator : Operator {
+  CTCBeamSearchDecoderOperator()
+      : Operator(OperatorType::kCTCBeamSearchDecoder) {}
+  int beam_width;
+  int top_paths;
+  bool merge_repeated = true;
+};
+
 // Depthwise-separable convolution operator.
 //
 // Inputs:
@@ -790,6 +861,7 @@ struct FakeQuantOperator : Operator {
   FakeQuantOperator() : Operator(OperatorType::kFakeQuant) {}
   std::unique_ptr<MinMax> minmax;
   int num_bits = 8;
+  bool narrow_range = false;
 };
 
 // Element-wise division operator.
@@ -1154,10 +1226,12 @@ struct TensorFlowRsqrtOperator : Operator {
 // Inputs: this operator accepts any number >= 1 of inputs.
 //   inputs[i]: the i-th array to merge.
 //
-// TensorFlow equivalent: Stack or Pack
-struct StackOperator : Operator {
-  StackOperator() : Operator(OperatorType::kStack) {}
+// TensorFlow equivalent: Pack
+struct PackOperator : Operator {
+  PackOperator() : Operator(OperatorType::kPack) {}
+  int values_count;
   int axis = 0;
+  ArrayDataType dtype = ArrayDataType::kNone;
 };
 
 // Shape operator. Extracts the shape of the tensor.
@@ -1227,6 +1301,19 @@ struct SubOperator : Operator {
 // TensorFlow equivalent: Sum
 struct TensorFlowSumOperator : Operator {
   TensorFlowSumOperator() : Operator(OperatorType::kSum) {}
+  std::vector<int> axis;
+  bool keep_dims = false;
+};
+
+// Prod reduction: computes the product of all of entries across the axes.
+//
+// Inputs:
+//   inputs[0]: required: the input array
+//
+// TensorFlow equivalent: Prod
+struct TensorFlowProdOperator : Operator {
+  TensorFlowProdOperator() : Operator(OperatorType::kReduceProd) {}
+  std::vector<int> axis;
   bool keep_dims = false;
 };
 
@@ -1386,29 +1473,27 @@ struct TensorFlowNotEqualOperator : Operator {
   TensorFlowNotEqualOperator() : Operator(OperatorType::kNotEqual) {}
 };
 
-// Global max reduction: computes the max of all of entries in the input array.
-// Thus the output is "0-dimensional": it consists of a single scalar value.
+// Max reduction: computes the max of all of entries across the axes.
 //
 // Inputs:
 //   inputs[0]: required: the input array
 //
-// TensorFlow equivalent: Max --- except that we only support the special case
-// of global reduction across all dimensions.
+// TensorFlow equivalent: Max
 struct TensorFlowMaxOperator : Operator {
-  TensorFlowMaxOperator() : Operator(OperatorType::kMax) {}
+  TensorFlowMaxOperator() : Operator(OperatorType::kReduceMax) {}
+  std::vector<int> axis;
   bool keep_dims = false;
 };
 
-// Global min reduction: computes the min of all of entries in the input array.
-// Thus the output is "0-dimensional": it consists of a single scalar value.
+// Min reduction: computes the min of all of entries across the axes.
 //
 // Inputs:
 //   inputs[0]: required: the input array
 //
-// TensorFlow equivalent: Min --- except that we only support the special case
-// of global reduction across all dimensions.
+// TensorFlow equivalent: Min
 struct TensorFlowMinOperator : Operator {
-  TensorFlowMinOperator() : Operator(OperatorType::kMin) {}
+  TensorFlowMinOperator() : Operator(OperatorType::kReduceMin) {}
+  std::vector<int> axis;
   bool keep_dims = false;
 };
 
@@ -1449,6 +1534,8 @@ struct TensorFlowUnsupportedOperator : Operator {
   bool quantized = false;
   // Output data types
   std::vector<ArrayDataType> output_data_types;
+  // Output shapes.
+  std::vector<Shape> output_shapes;
 };
 
 // Softmax activation function.
@@ -1509,11 +1596,15 @@ struct FloorOperator : Operator {
 // Inputs:
 //   inputs[0]: required: the params array
 //   inputs[1]: required: the indices to gather
+//   inputs[2]: optional: axis
 //
 // TensorFlow equivalent: Gather
 struct GatherOperator : Operator {
   GatherOperator() : Operator(OperatorType::kGather) {}
-  int axis = 0;
+  // Axis is populated explicitly or implicitly from the axis input by
+  // ResolveGatherAttributes. An empty axis indicates that the axis has not yet
+  // be resolved.
+  absl::optional<int> axis;
   int input_rank = 0;
 };
 
@@ -1528,6 +1619,17 @@ struct ArgMaxOperator : Operator {
   ArrayDataType output_data_type = ArrayDataType::kInt64;
 };
 
+// ArgMin operator. It returns the index of the minimum value along axis.
+//
+// Inputs:
+//   inputs[0]: required: the input tensor
+//
+// TensorFlow equivalent: ArgMin
+struct ArgMinOperator : Operator {
+  ArgMinOperator() : Operator(OperatorType::kArgMin) {}
+  ArrayDataType output_data_type = ArrayDataType::kInt64;
+};
+
 // ResizeBilinear operator. It resizes input images with bilinear interpolation.
 // It does not support align_corners at the moment.
 //
@@ -1658,6 +1760,71 @@ struct PowOperator : Operator {
   PowOperator() : Operator(OperatorType::kPow) {}
 };
 
+// Any operator:
+//
+// Inputs:
+// Inputs[0]: required: A boolean input tensor.
+// Inputs[1]: required: reduction_indices.
+//
+// TensorFlow equivalent: tf.reduce_any.
+struct AnyOperator : Operator {
+  AnyOperator() : Operator(OperatorType::kAny) {}
+  bool keep_dims = false;
+};
+
+// LogicalAnd operator:
+//
+// Inputs:
+// Inputs[0]: required: A boolean tensor.
+// Inputs[1]: required: A boolean tensor.
+//
+// TensorFlow equivalent: tf.logical_and.
+struct LogicalAndOperator : Operator {
+  LogicalAndOperator() : Operator(OperatorType::kLogicalAnd) {}
+};
+
+// LogicalNot operator:
+//
+// Inputs:
+// Inputs[0]: required: A boolean tensor.
+//
+// TensorFlow equivalent: tf.logical_not.
+struct LogicalNotOperator : Operator {
+  LogicalNotOperator() : Operator(OperatorType::kLogicalNot) {}
+};
+
+// OneHot operator:
+//
+// Inputs:
+// Inputs[0]: required: indices.
+// Inputs[1]: required: depth.
+// Inputs[2]: required: on_value.
+// Inputs[3]: required: off_value.
+//
+// TensorFlow equivalent: OneHot.
+struct OneHotOperator : Operator {
+  enum Inputs {
+    INDICES_INPUT = 0,
+    DEPTH_INPUT = 1,
+    ON_VALUE_INPUT = 2,
+    OFF_VALUE_INPUT = 3,
+  };
+
+  OneHotOperator() : Operator(OperatorType::kOneHot) {}
+  int axis = -1;
+};
+
+// LogicalOr operator:
+//
+// Inputs:
+// Inputs[0]: required: A Bool tensor.
+// Inputs[1]: required: A Bool tensor.
+//
+// TensorFlow equivalent: LogicalOr.
+struct LogicalOrOperator : Operator {
+  LogicalOrOperator() : Operator(OperatorType::kLogicalOr) {}
+};
+
 // Alloc's are used for transient arrays only. An Alloc specifies which interval
 // of the "transient_data" workspace buffer passed to inference functions, is to
 // be used for the transient array at hand. The 'start' and 'end' values are
@@ -1671,46 +1838,6 @@ inline bool operator<(const Alloc& a, const Alloc& b) {
   return a.start < b.start;
 }
 
-class Shape {
- public:
-  // For Shape, we stick to half-way encapsulation for now:
-  // we hide the raw dims_ member, but expose it raw by accessors
-  // because from some brainstorming, it's not at all easy to
-  // anticipate which flavor of more hermetic encapsulation would
-  // actually buy us future-proof-ness without being needlessly
-  // cumbersome.
-  Shape() {}
-  Shape(std::initializer_list<int> dim_list) : dims_(dim_list) {}
-
-  void ReplaceDims(std::initializer_list<int> dim_list) {
-    dims_ = std::vector<int>(dim_list);
-  }
-
-  const std::vector<int>& dims() const { return dims_; }
-  std::vector<int>* mutable_dims() { return &dims_; }
-  const int dimensions_count() const { return dims_.size(); }
-
-  // We still have that one convenience accessor to avoid
-  // the awkward double bracket issue:  shape.dims()[i].
-  int dims(int i) const {
-    // Always check for out-of-bounds accesses, even in optimized builds where
-    // standard assertions are disabled. Out-of-bounds access here is a common
-    // occurrence.
-    CHECK_GE(i, 0);
-    CHECK_GT(dims_.size(), i);
-    return dims_[i];
-  }
-
-  bool operator==(const Shape& comp) const {
-    return (this->dims_ == comp.dims());
-  }
-
-  bool operator!=(const Shape& comp) const { return !((*this) == comp); }
-
- private:
-  std::vector<int> dims_;
-};
-
 // Array represents an array (either a constant parameter array or an
 // activations array) in a Model.
 struct Array {
@@ -1842,6 +1969,40 @@ struct Array {
   // If this is non-null, then these quantization parameters are to be used
   // to assign a meaning as real numbers to the elements of this array.
   std::unique_ptr<QuantizationParams> quantization_params;
+  // narrow_range is a detail of how toco handles FakeQuant operators with
+  // narrow_range, see
+  // https://www.tensorflow.org/api_docs/python/tf/fake_quant_with_min_max_vars
+  //
+  // For more context about what that is useful for, see the big comment in
+  // graph_transformations/ensure_uint8_weights_safe_for_fast_int8_kernels.cc
+  //
+  // The narrow_range flag applies only to quantized arrays, and changes
+  // their quantization in the following way when it is set to 'true':
+  // 1. The computation of {zero_point, scale} from {min, max} needs to be
+  //    amended so that the real min value will get quantized to
+  //    (min_quantized_value + 1) instead of just (min_quantized_value).
+  //    E.g. for uint8 quantization, the real min value should get quantized to
+  //    the uint8 value 1, not 0.
+  // 2. Quantized values should get clamped to the interval
+  //    [min_quantized_value + 1, max_value]. Equivalently, the
+  //    min_quantized_value should get nudged to (min_quantized_value + 1).
+  // The reason why 1. does not imply 2. is that real values may not belong to
+  // the stated [min, max] interval. Concretely, weights recorded at the last
+  // learning step may not fall in the [min, max] interval recorded over
+  // previous learning steps, as the values evolve across learning steps.
+  //
+  // Rationale why this is directly a field on Array:
+  // - This can't be just a field on FakeQuantOperator, because
+  //   FakeQuantOperators are gone (DropFakeQuant) before we get to using that
+  //   information (Quantize). We need a place to store that bit in the interim.
+  // - This can't be in QuantizationParams because we need to record this
+  //   ahead of quantization, and QuantizationParams are only created during
+  //   quantization.
+  // - This could be in MinMax, but that would be an abuse of what MinMax is
+  //   about, and would break existing code that assumes that a MinMax is just
+  //   a min and a max. Unlike MinMax which is agnostic as to the quantized
+  //   data type, narrow_range refers to values in the quantized data type.
+  bool narrow_range = false;
 
  private:
   std::unique_ptr<Shape> array_shape;
diff --git a/tensorflow/contrib/lite/toco/model_cmdline_flags.cc b/tensorflow/contrib/lite/toco/model_cmdline_flags.cc
index 06072d1fcb..d34da63e43 100644
--- a/tensorflow/contrib/lite/toco/model_cmdline_flags.cc
+++ b/tensorflow/contrib/lite/toco/model_cmdline_flags.cc
@@ -322,6 +322,10 @@ void ReadModelFlagsFromCommandLineFlags(
     for (int i = 0; i < input_shapes.size(); ++i) {
       auto* shape = model_flags->mutable_input_arrays(i)->mutable_shape();
       shape->clear_dims();
+      // Treat an empty input shape as a scalar.
+      if (input_shapes[i].empty()) {
+        continue;
+      }
       for (const auto& dim_str : absl::StrSplit(input_shapes[i], ',')) {
         int size;
         CHECK(absl::SimpleAtoi(dim_str, &size))
diff --git a/tensorflow/contrib/lite/toco/python/BUILD b/tensorflow/contrib/lite/toco/python/BUILD
index 93fe756a55..33c5b16462 100644
--- a/tensorflow/contrib/lite/toco/python/BUILD
+++ b/tensorflow/contrib/lite/toco/python/BUILD
@@ -53,5 +53,8 @@ tf_py_test(
     data = [
         ":toco_from_protos",
     ],
-    tags = ["no_pip"],
+    tags = [
+        "no_oss",
+        "no_pip",
+    ],
 )
diff --git a/tensorflow/contrib/lite/toco/tensorflow_graph_matching/BUILD b/tensorflow/contrib/lite/toco/tensorflow_graph_matching/BUILD
index 336e94de1e..ea1fc2827e 100644
--- a/tensorflow/contrib/lite/toco/tensorflow_graph_matching/BUILD
+++ b/tensorflow/contrib/lite/toco/tensorflow_graph_matching/BUILD
@@ -60,6 +60,7 @@ cc_library(
 tf_cc_test(
     name = "resolve_svdf_test",
     srcs = ["resolve_svdf_test.cc"],
+    tags = ["no_oss"],
     deps = [
         ":cluster",
         ":cluster_utils",
diff --git a/tensorflow/contrib/lite/toco/tflite/BUILD b/tensorflow/contrib/lite/toco/tflite/BUILD
index a02f90988b..83e977d7b3 100644
--- a/tensorflow/contrib/lite/toco/tflite/BUILD
+++ b/tensorflow/contrib/lite/toco/tflite/BUILD
@@ -37,6 +37,7 @@ tf_cc_test(
     srcs = [
         "operator_test.cc",
     ],
+    tags = ["no_oss"],
     deps = [
         ":operator",
         "//tensorflow/contrib/lite/toco:tooling_util",
@@ -66,6 +67,7 @@ tf_cc_test(
     srcs = [
         "types_test.cc",
     ],
+    tags = ["no_oss"],
     deps = [
         ":types",
         "@com_google_googletest//:gtest_main",
@@ -98,6 +100,7 @@ tf_cc_test(
     srcs = [
         "export_test.cc",
     ],
+    tags = ["no_oss"],
     deps = [
         ":export",
         "//tensorflow/contrib/lite/schema:schema_fbs",
@@ -131,6 +134,7 @@ tf_cc_test(
     srcs = [
         "import_test.cc",
     ],
+    tags = ["no_oss"],
     deps = [
         ":import",
         "//tensorflow/contrib/lite:schema_fbs_version",
diff --git a/tensorflow/contrib/lite/toco/tflite/export.cc b/tensorflow/contrib/lite/toco/tflite/export.cc
index 1972246807..5ad307af14 100644
--- a/tensorflow/contrib/lite/toco/tflite/export.cc
+++ b/tensorflow/contrib/lite/toco/tflite/export.cc
@@ -336,17 +336,13 @@ void Export(
   auto op_codes = ExportOperatorCodes(model, ops_by_type, operators_map,
                                       &builder, &error_summary);
 
-  const string fake_quant_operation_name = "FAKE_QUANT";
-
-  if (error_summary.count(fake_quant_operation_name) != 0) {
-    LOG(ERROR)
-        << fake_quant_operation_name
-        << " operation was not converted. If running quantized make sure you "
-           "are passing --inference_type=QUANTIZED_UINT8 and values for "
-           "--std_values and --mean_values.";
-    // Remove the fake quant operation from the errors, since it shouldn't
-    // be provided a custom implementation.
-    error_summary.erase(fake_quant_operation_name);
+  for (const auto& op : model.operators) {
+    if (op->type == OperatorType::kFakeQuant) {
+      LOG(WARNING) << "FAKE_QUANT operation " << LogName(*op)
+                   << " was not converted. If running quantized make sure you "
+                      "are passing --inference_type=QUANTIZED_UINT8 and values "
+                      "for --std_values and --mean_values.";
+    }
   }
   if (!allow_custom_ops && !error_summary.empty()) {
     // Remove ExpandDims and ReorderAxes from unimplemented list unless they
diff --git a/tensorflow/contrib/lite/toco/tflite/export_test.cc b/tensorflow/contrib/lite/toco/tflite/export_test.cc
index d1fdbcb8e9..a95937ba0f 100644
--- a/tensorflow/contrib/lite/toco/tflite/export_test.cc
+++ b/tensorflow/contrib/lite/toco/tflite/export_test.cc
@@ -262,7 +262,7 @@ TEST_F(VersionedOpExportTest, Export) {
   EXPECT_EQ(1, (*operators)[1]->opcode_index());
 }
 
-// TODO(ahentz): tests for tensors, inputs, outpus, opcodes and operators.
+// TODO(ahentz): tests for tensors, inputs, outputs, opcodes and operators.
 
 }  // namespace
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/toco/tflite/operator.cc b/tensorflow/contrib/lite/toco/tflite/operator.cc
index 7e55ae92bd..b1cd74794c 100644
--- a/tensorflow/contrib/lite/toco/tflite/operator.cc
+++ b/tensorflow/contrib/lite/toco/tflite/operator.cc
@@ -282,25 +282,31 @@ class DepthToSpace : public CustomOperator<DepthToSpaceOperator> {
   int GetVersion(const Operator& op) const override { return 1; }
 };
 
-class FakeQuant : public CustomOperator<FakeQuantOperator> {
+class FakeQuant
+    : public BuiltinOperator<FakeQuantOperator, ::tflite::FakeQuantOptions,
+                             ::tflite::BuiltinOptions_FakeQuantOptions> {
  public:
-  using CustomOperator::CustomOperator;
-  void WriteOptions(const TocoOperator& op,
-                    flexbuffers::Builder* fbb) const override {
-    fbb->Float("min", op.minmax->min);
-    fbb->Float("max", op.minmax->max);
-    fbb->Int("num_bits", op.num_bits);
+  using BuiltinOperator::BuiltinOperator;
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateFakeQuantOptions(
+        *builder, op.minmax->min, op.minmax->max, op.num_bits, op.narrow_range);
   }
-  void ReadOptions(const flexbuffers::Map& m, TocoOperator* op) const override {
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {
     auto* minmax = new MinMax;
-    minmax->min = m["min"].AsFloat();
-    minmax->max = m["max"].AsFloat();
+    minmax->min = options.min();
+    minmax->max = options.max();
     op->minmax.reset(minmax);
-    const auto& num_bits = m["num_bits"];
-    op->num_bits = num_bits.IsInt() ? num_bits.AsInt32() : 8;
+    op->num_bits = options.num_bits();
+    op->narrow_range = options.narrow_range();
   }
 
-  int GetVersion(const Operator& op) const override { return 1; }
+  int GetVersion(const Operator& op) const override {
+    const auto& fq_op = static_cast<const FakeQuantOperator&>(op);
+    return fq_op.narrow_range ? 2 : 1;
+  }
 };
 
 class FullyConnected
@@ -364,12 +370,13 @@ class Gather : public BuiltinOperator<GatherOperator, ::tflite::GatherOptions,
   flatbuffers::Offset<TfLiteOptions> WriteOptions(
       const TocoOperator& op,
       flatbuffers::FlatBufferBuilder* builder) const override {
-    return ::tflite::CreateGatherOptions(*builder, op.axis);
+    int axis = op.axis ? op.axis.value() : 0;
+    return ::tflite::CreateGatherOptions(*builder, axis);
   }
 
   void ReadOptions(const TfLiteOptions& options,
                    TocoOperator* op) const override {
-    op->axis = options.axis();
+    op->axis = {options.axis()};
   }
 
   int GetVersion(const Operator& op) const override { return 1; }
@@ -761,6 +768,44 @@ class Sum
   int GetVersion(const Operator& op) const override { return 1; }
 };
 
+class ReduceMax
+    : public BuiltinOperator<TensorFlowSumOperator, ::tflite::ReducerOptions,
+                             ::tflite::BuiltinOptions_ReducerOptions> {
+ public:
+  using BuiltinOperator::BuiltinOperator;
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateReducerOptions(*builder, op.keep_dims);
+  }
+
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {
+    op->keep_dims = options.keep_dims();
+  }
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
+class ReduceProd
+    : public BuiltinOperator<TensorFlowSumOperator, ::tflite::ReducerOptions,
+                             ::tflite::BuiltinOptions_ReducerOptions> {
+ public:
+  using BuiltinOperator::BuiltinOperator;
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateReducerOptions(*builder, op.keep_dims);
+  }
+
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {
+    op->keep_dims = options.keep_dims();
+  }
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
 class ResizeBilinear
     : public BuiltinOperator<ResizeBilinearOperator,
                              ::tflite::ResizeBilinearOptions,
@@ -885,6 +930,25 @@ class ArgMax : public BuiltinOperator<ArgMaxOperator, ::tflite::ArgMaxOptions,
   int GetVersion(const Operator& op) const override { return 1; }
 };
 
+class ArgMin : public BuiltinOperator<ArgMinOperator, ::tflite::ArgMinOptions,
+                                      ::tflite::BuiltinOptions_ArgMinOptions> {
+ public:
+  using BuiltinOperator::BuiltinOperator;
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateArgMinOptions(
+        *builder, DataType::Serialize(op.output_data_type));
+  }
+
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {
+    op->output_data_type = DataType::Deserialize(options.output_type());
+  }
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
 class TransposeConv
     : public BuiltinOperator<TransposeConvOperator,
                              ::tflite::TransposeConvOptions,
@@ -949,6 +1013,26 @@ class ExpandDims
   int GetVersion(const Operator& op) const override { return 1; }
 };
 
+class Pack : public BuiltinOperator<PackOperator, ::tflite::PackOptions,
+                                    ::tflite::BuiltinOptions_PackOptions> {
+ public:
+  using BuiltinOperator::BuiltinOperator;
+
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreatePackOptions(*builder, op.values_count, op.axis);
+  }
+
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {
+    op->values_count = options.values_count();
+    op->axis = options.axis();
+  }
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
 class Shape
     : public BuiltinOperator<TensorFlowShapeOperator, ::tflite::ShapeOptions,
                              ::tflite::BuiltinOptions_ShapeOptions> {
@@ -969,6 +1053,44 @@ class Shape
   int GetVersion(const Operator& op) const override { return 1; }
 };
 
+class OneHot : public BuiltinOperator<OneHotOperator, ::tflite::OneHotOptions,
+                                      ::tflite::BuiltinOptions_OneHotOptions> {
+ public:
+  using BuiltinOperator::BuiltinOperator;
+  flatbuffers::Offset<TfLiteOptions> WriteOptions(
+      const TocoOperator& op,
+      flatbuffers::FlatBufferBuilder* builder) const override {
+    return ::tflite::CreateOneHotOptions(*builder, op.axis);
+  }
+  void ReadOptions(const TfLiteOptions& options,
+                   TocoOperator* op) const override {
+    op->axis = options.axis();
+  }
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
+class CTCBeamSearchDecoder
+    : public CustomOperator<CTCBeamSearchDecoderOperator> {
+ public:
+  using CustomOperator::CustomOperator;
+
+  void WriteOptions(const TocoOperator& op,
+                    flexbuffers::Builder* fbb) const override {
+    fbb->Int("beam_width", op.beam_width);
+    fbb->Int("top_paths", op.top_paths);
+    fbb->Bool("merge_repeated", op.merge_repeated);
+  }
+
+  void ReadOptions(const flexbuffers::Map& m, TocoOperator* op) const override {
+    op->beam_width = m["beam_width"].AsInt32();
+    op->top_paths = m["top_paths"].AsInt32();
+    op->merge_repeated = m["merge_repeated"].AsBool();
+  }
+
+  int GetVersion(const Operator& op) const override { return 1; }
+};
+
 class TensorFlowUnsupported : public BaseOperator {
  public:
   using BaseOperator::BaseOperator;
@@ -1158,6 +1280,10 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
   ops.emplace_back(
       new Mean(::tflite::BuiltinOperator_MEAN, OperatorType::kMean));
   ops.emplace_back(new Sum(::tflite::BuiltinOperator_SUM, OperatorType::kSum));
+  ops.emplace_back(new ReduceProd(::tflite::BuiltinOperator_REDUCE_PROD,
+                                  OperatorType::kReduceProd));
+  ops.emplace_back(new ReduceMax(::tflite::BuiltinOperator_REDUCE_MAX,
+                                 OperatorType::kReduceMax));
   ops.emplace_back(new ResizeBilinear(::tflite::BuiltinOperator_RESIZE_BILINEAR,
                                       OperatorType::kResizeBilinear));
   ops.emplace_back(
@@ -1175,6 +1301,8 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
   ops.emplace_back(
       new ArgMax(::tflite::BuiltinOperator_ARG_MAX, OperatorType::kArgMax));
   ops.emplace_back(
+      new ArgMin(::tflite::BuiltinOperator_ARG_MIN, OperatorType::kArgMin));
+  ops.emplace_back(
       new Tile(::tflite::BuiltinOperator_TILE, OperatorType::kTile));
   ops.emplace_back(new ExpandDims(::tflite::BuiltinOperator_EXPAND_DIMS,
                                   OperatorType::kExpandDims));
@@ -1184,11 +1312,18 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
                                      OperatorType::kSparseToDense));
   ops.emplace_back(
       new Shape(::tflite::BuiltinOperator_SHAPE, OperatorType::kShape));
+  ops.emplace_back(new FakeQuant(::tflite::BuiltinOperator_FAKE_QUANT,
+                                 OperatorType::kFakeQuant));
+  ops.emplace_back(
+      new Pack(::tflite::BuiltinOperator_PACK, OperatorType::kPack));
+  ops.emplace_back(
+      new OneHot(::tflite::BuiltinOperator_ONE_HOT, OperatorType::kOneHot));
 
   // Custom Operators.
   ops.emplace_back(
       new DepthToSpace("DEPTH_TO_SPACE", OperatorType::kDepthToSpace));
-  ops.emplace_back(new FakeQuant("FAKE_QUANT", OperatorType::kFakeQuant));
+  ops.emplace_back(new CTCBeamSearchDecoder(
+      "CTC_BEAM_SEARCH_DECODER", OperatorType::kCTCBeamSearchDecoder));
   ops.emplace_back(new TensorFlowUnsupported("TENSORFLOW_UNSUPPORTED",
                                              OperatorType::kUnsupported));
 
@@ -1238,6 +1373,8 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
   ops.emplace_back(
       new SimpleOperator<SliceOperator>("SLICE", OperatorType::kSlice));
   ops.emplace_back(new SimpleOperator<PowOperator>("POW", OperatorType::kPow));
+  ops.emplace_back(new SimpleOperator<LogicalOrOperator>(
+      "LOGICAL_OR", OperatorType::kLogicalOr));
   // Element-wise operator
   ops.emplace_back(new SimpleOperator<SinOperator>("SIN", OperatorType::kSin));
   ops.emplace_back(new SimpleOperator<LogOperator>("LOG", OperatorType::kLog));
diff --git a/tensorflow/contrib/lite/toco/tflite/operator_test.cc b/tensorflow/contrib/lite/toco/tflite/operator_test.cc
index 8b6808d3c7..12fdbbf214 100644
--- a/tensorflow/contrib/lite/toco/tflite/operator_test.cc
+++ b/tensorflow/contrib/lite/toco/tflite/operator_test.cc
@@ -127,6 +127,8 @@ TEST_F(OperatorTest, SimpleOperators) {
   CheckSimpleOperator<TensorFlowSqrtOperator>("SQRT", OperatorType::kSqrt);
   CheckSimpleOperator<TensorFlowRsqrtOperator>("RSQRT", OperatorType::kRsqrt);
   CheckSimpleOperator<PowOperator>("POW", OperatorType::kPow);
+  CheckSimpleOperator<LogicalOrOperator>("LOGICAL_OR",
+                                         OperatorType::kLogicalOr);
 }
 
 TEST_F(OperatorTest, BuiltinAdd) {
@@ -416,6 +418,13 @@ TEST_F(OperatorTest, BuiltinArgMax) {
   EXPECT_EQ(op.output_data_type, output_toco_op->output_data_type);
 }
 
+TEST_F(OperatorTest, BuiltinArgMin) {
+  ArgMinOperator op;
+  auto output_toco_op = SerializeAndDeserialize(
+      GetOperator("ARG_MIN", OperatorType::kArgMin), op);
+  EXPECT_EQ(op.output_data_type, output_toco_op->output_data_type);
+}
+
 TEST_F(OperatorTest, BuiltinTransposeConv) {
   TransposeConvOperator op;
   op.stride_width = 123;
@@ -445,6 +454,38 @@ TEST_F(OperatorTest, BuiltinSparseToDense) {
   EXPECT_EQ(op.validate_indices, output_toco_op->validate_indices);
 }
 
+TEST_F(OperatorTest, BuiltinPack) {
+  PackOperator op;
+  op.values_count = 3;
+  op.axis = 1;
+  std::unique_ptr<toco::PackOperator> output_toco_op =
+      SerializeAndDeserialize(GetOperator("PACK", OperatorType::kPack), op);
+  EXPECT_EQ(op.values_count, output_toco_op->values_count);
+  EXPECT_EQ(op.axis, output_toco_op->axis);
+}
+
+TEST_F(OperatorTest, BuiltinOneHot) {
+  OneHotOperator op;
+  op.axis = 2;
+  auto output_toco_op = SerializeAndDeserialize(
+      GetOperator("ONE_HOT", OperatorType::kOneHot), op);
+  EXPECT_EQ(op.axis, output_toco_op->axis);
+}
+
+TEST_F(OperatorTest, CustomCTCBeamSearchDecoder) {
+  CTCBeamSearchDecoderOperator op;
+  op.beam_width = 3;
+  op.top_paths = 2;
+  op.merge_repeated = false;
+  std::unique_ptr<toco::CTCBeamSearchDecoderOperator> output_toco_op =
+      SerializeAndDeserialize(GetOperator("CTC_BEAM_SEARCH_DECODER",
+                                          OperatorType::kCTCBeamSearchDecoder),
+                              op);
+  EXPECT_EQ(op.beam_width, output_toco_op->beam_width);
+  EXPECT_EQ(op.top_paths, output_toco_op->top_paths);
+  EXPECT_EQ(op.merge_repeated, output_toco_op->merge_repeated);
+}
+
 TEST_F(OperatorTest, TensorFlowUnsupported) {
   TensorFlowUnsupportedOperator op;
   op.tensorflow_op = "MyCustomUnsupportedOp";
diff --git a/tensorflow/contrib/lite/toco/toco_port.cc b/tensorflow/contrib/lite/toco/toco_port.cc
index de76fd4032..14168fa33f 100644
--- a/tensorflow/contrib/lite/toco/toco_port.cc
+++ b/tensorflow/contrib/lite/toco/toco_port.cc
@@ -38,7 +38,8 @@ void CopyToBuffer(const Cord& src, char* dest) { src.CopyToArray(dest); }
 }  // namespace port
 }  // namespace toco
 
-#if defined(PLATFORM_GOOGLE) && !defined(__APPLE__) && !defined(__ANDROID__)
+#if defined(PLATFORM_GOOGLE) && !defined(__APPLE__) && \
+    !defined(__ANDROID__) && !defined(_WIN32)
 
 // Wrap Google file operations.
 
@@ -115,9 +116,12 @@ string JoinPath(const string& a, const string& b) {
 }  // namespace port
 }  // namespace toco
 
-#else  // (__APPLE__ || __ANDROID__)
+#else  // !PLATFORM_GOOGLE || __APPLE__ || __ANDROID__ || _WIN32
 
 #include <fcntl.h>
+#if defined(_WIN32)
+#include <io.h>  // for _close, _open, _read
+#endif
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <unistd.h>
@@ -130,6 +134,19 @@ string JoinPath(const string& a, const string& b) {
 namespace toco {
 namespace port {
 
+#if defined(_WIN32)
+#define close _close
+#define open _open
+#define read _read
+#define O_RDONLY _O_RDONLY
+#define O_CREAT _O_CREAT
+#define O_WRONLY _O_WRONLY
+// Windows does not support the same set of file permissions as other platforms.
+constexpr int kFileCreateMode = _S_IREAD | _S_IWRITE;
+#else
+constexpr int kFileCreateMode = 0664;
+#endif  // _WIN32
+
 static bool port_initialized = false;
 
 void InitGoogle(const char* usage, int* argc, char*** argv, bool remove_flags) {
@@ -209,7 +226,7 @@ tensorflow::Status GetContents(const string& path, string* output,
 
 tensorflow::Status SetContents(const string& filename, const string& contents,
                                const file::Options& options) {
-  int fd = open(filename.c_str(), O_WRONLY | O_CREAT, 0664);
+  int fd = open(filename.c_str(), O_WRONLY | O_CREAT, kFileCreateMode);
   if (fd == -1) {
     return tensorflow::errors::Internal("can't open() for write");
   }
@@ -243,4 +260,4 @@ string JoinPath(const string& base, const string& filename) {
 }  // namespace port
 }  // namespace toco
 
-#endif  // (__APPLE || __ANDROID__)
+#endif  // !PLATFORM_GOOGLE || __APPLE || __ANDROID__ || _WIN32
diff --git a/tensorflow/contrib/lite/toco/toco_tooling.cc b/tensorflow/contrib/lite/toco/toco_tooling.cc
index fc1636831b..fcd3cbab07 100644
--- a/tensorflow/contrib/lite/toco/toco_tooling.cc
+++ b/tensorflow/contrib/lite/toco/toco_tooling.cc
@@ -55,7 +55,7 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new ConvertExpandDimsToReshape);
   transformations->Add(new ConvertSqueezeToReshape);
   transformations->Add(new ConvertTrivialAddNToAdd);
-  transformations->Add(new ConvertTrivialStackToReshape);
+  transformations->Add(new ConvertTrivialPackToReshape);
   transformations->Add(new ConvertTrivialTileToConcat);
   transformations->Add(new ConvertTrivialTransposeToReshape);
   transformations->Add(new ConvertReorderAxes);
@@ -86,11 +86,11 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new ResolveConstantBinaryOperator);
   transformations->Add(new ResolveConstantFill);
   transformations->Add(new ResolveConstantGather);
+  transformations->Add(new ResolveConstantPack);
   transformations->Add(new ResolveConstantRandomUniform);
   transformations->Add(new ResolveConstantRange);
   transformations->Add(new ResolveConstantReshape);
   transformations->Add(new ResolveConstantSlice);
-  transformations->Add(new ResolveConstantStack);
   transformations->Add(new ResolveConstantStridedSlice);
   transformations->Add(new ResolveConstantTranspose);
   transformations->Add(new ResolveConstantUnaryOperator);
@@ -105,17 +105,19 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new IdentifyRelu1);
   transformations->Add(new IdentifyPRelu);
   transformations->Add(new RemoveTrivialBinaryOperator);
-  transformations->Add(new ReadFakeQuantMinMax);
+  transformations->Add(new ResolveFakeQuantArgsFromVars);
+  transformations->Add(new ReadArrayMinmaxAndNarrowRangeFromFakeQuant);
   transformations->Add(new ResolveSpaceToBatchNDAttributes);
   transformations->Add(new ResolveBatchToSpaceNDAttributes);
   transformations->Add(new ResolvePadAttributes);
   transformations->Add(new ResolvePadV2Attributes);
   transformations->Add(new ResolveStridedSliceAttributes);
   transformations->Add(new ResolveSliceAttributes);
-  transformations->Add(new ResolveMeanAttributes);
+  transformations->Add(new ResolveReduceAttributes);
   transformations->Add(new ResolveConstantShapeOrRank);
   transformations->Add(new MakeInitialDequantizeOperator);
   transformations->Add(new UnpartitionEmbeddingLookup);
+  transformations->Add(new ResolveGatherAttributes);
 }
 
 bool SupportsQuantization(FileFormat format) {
@@ -273,13 +275,16 @@ void Transform(const TocoFlags& toco_flags, Model* model) {
       transformations.Add(new toco::MergeLstmCellInputs);
     }
   }
-  if (toco_flags.quantize_weights()) {
-    transformations.Add(new QuantizeWeights);
-  }
   transformations.Add(new ResolveConstantConcatenation);
   RunGraphTransformations(model, "general graph transformations",
                           transformations);
 
+  if (toco_flags.quantize_weights()) {
+    // Run the quantize weights transformation after batchnorms have been
+    // folded into the weights.
+    RunGraphTransformations(model, "quantize weights transformation",
+                            {new QuantizeWeights});
+  }
   if (quantize_output) {
     if (toco_flags.propagate_fake_quant_num_bits()) {
       RunGraphTransformations(model,
@@ -304,8 +309,9 @@ void Transform(const TocoFlags& toco_flags, Model* model) {
     // HardcodeMinMax to move changes through the graph as we make changes.
     auto propagate_default_min_max =
         absl::make_unique<PropagateDefaultMinMax>();
-    if (toco_flags.has_default_ranges_min() &&
-        toco_flags.has_default_ranges_max()) {
+    bool has_default_ranges_flag = (toco_flags.has_default_ranges_min() &&
+                                    toco_flags.has_default_ranges_max());
+    if (has_default_ranges_flag) {
       propagate_default_min_max->DefineTypeRange(
           ArrayDataType::kUint8, toco_flags.default_ranges_min(),
           toco_flags.default_ranges_max());
@@ -330,6 +336,8 @@ void Transform(const TocoFlags& toco_flags, Model* model) {
         new EnsureUint8WeightsSafeForFastInt8Kernels;
     ensure_safe_for_int8_kernels->set_allow_nudging_weights(
         toco_flags.allow_nudging_weights_to_use_fast_gemm_kernel());
+    ensure_safe_for_int8_kernels->set_has_default_ranges_flag(
+        has_default_ranges_flag);
     RunGraphTransformations(model, "quantization graph transformations",
                             {
                                 new RemoveTrivialQuantizedActivationFunc,
diff --git a/tensorflow/contrib/lite/toco/tooling_util.cc b/tensorflow/contrib/lite/toco/tooling_util.cc
index 01113506d0..80df09eb08 100644
--- a/tensorflow/contrib/lite/toco/tooling_util.cc
+++ b/tensorflow/contrib/lite/toco/tooling_util.cc
@@ -350,16 +350,17 @@ const char* OperatorTypeName(OperatorType type) {
     HANDLE_OPERATORTYPENAME_CASE(Less)
     HANDLE_OPERATORTYPENAME_CASE(LessEqual)
     HANDLE_OPERATORTYPENAME_CASE(MatMul)
-    HANDLE_OPERATORTYPENAME_CASE(Max)      //  Reduction Max
-    HANDLE_OPERATORTYPENAME_CASE(Maximum)  //  Element-wise Maximum
+    HANDLE_OPERATORTYPENAME_CASE(ReduceMax)  //  Reduction Max
+    HANDLE_OPERATORTYPENAME_CASE(Maximum)    //  Element-wise Maximum
     HANDLE_OPERATORTYPENAME_CASE(Merge)
-    HANDLE_OPERATORTYPENAME_CASE(Min)      //  Reduction Min
-    HANDLE_OPERATORTYPENAME_CASE(Minimum)  //  Element-wise Minimum
+    HANDLE_OPERATORTYPENAME_CASE(ReduceMin)  //  Reduction Min
+    HANDLE_OPERATORTYPENAME_CASE(Minimum)    //  Element-wise Minimum
     HANDLE_OPERATORTYPENAME_CASE(Neg)
+    HANDLE_OPERATORTYPENAME_CASE(OneHot)
+    HANDLE_OPERATORTYPENAME_CASE(Pack)
     HANDLE_OPERATORTYPENAME_CASE(Pad)
     HANDLE_OPERATORTYPENAME_CASE(PadV2)
     HANDLE_OPERATORTYPENAME_CASE(StridedSlice)
-    HANDLE_OPERATORTYPENAME_CASE(Stack)
     HANDLE_OPERATORTYPENAME_CASE(Range)
     HANDLE_OPERATORTYPENAME_CASE(Rank)
     HANDLE_OPERATORTYPENAME_CASE(Reshape)
@@ -385,8 +386,10 @@ const char* OperatorTypeName(OperatorType type) {
     HANDLE_OPERATORTYPENAME_CASE(SpaceToBatchND)
     HANDLE_OPERATORTYPENAME_CASE(BatchToSpaceND)
     HANDLE_OPERATORTYPENAME_CASE(Mean)
+    HANDLE_OPERATORTYPENAME_CASE(ReduceProd)
     HANDLE_OPERATORTYPENAME_CASE(Svdf)
     HANDLE_OPERATORTYPENAME_CASE(ArgMax)
+    HANDLE_OPERATORTYPENAME_CASE(ArgMin)
     HANDLE_OPERATORTYPENAME_CASE(TopK_V2)
     HANDLE_OPERATORTYPENAME_CASE(Unsupported)
     HANDLE_OPERATORTYPENAME_CASE(Exp)
@@ -397,6 +400,11 @@ const char* OperatorTypeName(OperatorType type) {
     HANDLE_OPERATORTYPENAME_CASE(Equal)
     HANDLE_OPERATORTYPENAME_CASE(NotEqual)
     HANDLE_OPERATORTYPENAME_CASE(Pow)
+    HANDLE_OPERATORTYPENAME_CASE(Any)
+    HANDLE_OPERATORTYPENAME_CASE(LogicalAnd)
+    HANDLE_OPERATORTYPENAME_CASE(LogicalNot)
+    HANDLE_OPERATORTYPENAME_CASE(LogicalOr)
+    HANDLE_OPERATORTYPENAME_CASE(CTCBeamSearchDecoder)
     default:
       LOG(FATAL) << "Unhandled op type";
 #undef HANDLE_OPERATORTYPENAME_CASE
@@ -938,8 +946,12 @@ void CheckEachArray(const Model& model) {
       // shape.
       CHECK(array->has_shape());
       // Constant buffer should has a valid shape.
-      for (int d : array->shape().dims()) {
-        CHECK_GE(d, 1);
+      bool is_scalar =
+          array->shape().dimensions_count() == 1 && array->shape().dims(0) == 0;
+      if (!is_scalar) {
+        for (int d : array->shape().dims()) {
+          CHECK_GE(d, 1);
+        }
       }
       // The shape flat-size should agree with the buffer length.
       CHECK_EQ(array->buffer->Length(),
@@ -1265,8 +1277,13 @@ void InsertCopyOperator(Model* model, const string& source_array_name,
   auto* copy_op = new TensorFlowReshapeOperator;
   copy_op->inputs = {
       source_array_name,
-      CreateInt32Array(model, target_array_name + "_copy_shape", shape)};
+      CreateInt32Array(
+          model, AvailableArrayName(*model, target_array_name + "_copy_shape"),
+          shape)};
   copy_op->outputs = {target_array_name};
+  if (target_array.has_shape()) {
+    copy_op->shape = target_array.shape().dims();
+  }
   model->operators.emplace_back(copy_op);
 }
 
@@ -1571,11 +1588,6 @@ void ResolveModelFlags(const ModelFlags& model_flags, Model* model) {
                                model);
   }
 
-  for (const auto& input_array : model->flags.input_arrays()) {
-    if (input_array.has_shape()) {
-      CHECK(input_array.shape().dims_size());
-    }
-  }
   model->flags.set_change_concat_input_ranges(
       model_flags.change_concat_input_ranges());
   model->flags.set_allow_nonascii_arrays(model_flags.allow_nonascii_arrays());
@@ -1608,11 +1620,12 @@ void CheckIsReadyForQuantization(const Model& model) {
           << "Array " << input << ", which is an input to the "
           << HelpfulOperatorTypeName(*op) << " operator producing the output "
           << "array " << op->outputs[0] << ", is lacking min/max data, "
-          << "which is necessary for quantization. Either target a "
-          << "non-quantized output format, or change the input graph to "
-          << "contain min/max information, or pass --default_ranges_min= and "
-          << "--default_ranges_max= if you do not care about the accuracy of "
-          << "results.";
+          << "which is necessary for quantization. If accuracy matters, either "
+          << "target a non-quantized output format, or run quantized training "
+          << "with your model from a floating point checkpoint to change the "
+          << "input graph to contain min/max information. If you don't care "
+          << "about accuracy, you can pass --default_ranges_min= and "
+          << "--default_ranges_max= for easy experimentation.";
     }
   }
 }
diff --git a/tensorflow/contrib/lite/toco/tooling_util_test.cc b/tensorflow/contrib/lite/toco/tooling_util_test.cc
index 8609e5bedd..eb495646a2 100644
--- a/tensorflow/contrib/lite/toco/tooling_util_test.cc
+++ b/tensorflow/contrib/lite/toco/tooling_util_test.cc
@@ -39,6 +39,8 @@ std::vector<ShapePair> CreateShapePairs() {
        {Shape({256, 256, 3}), Shape({256, 256, 3}), Agreement::kBroadcast},
        {Shape({256, 256, 3}), Shape({3}), Agreement::kBroadcast},
        {Shape({8, 1, 6, 1}), Shape({7, 1, 5}), Agreement::kBroadcast},
+       {Shape({}), Shape({3}), Agreement::kBroadcast},
+       {Shape({}), Shape({3, 1}), Agreement::kBroadcast},
 
        // These extend (and therefore broadcast).
        {Shape({3}), Shape({3}), Agreement::kExtend},
@@ -54,6 +56,7 @@ std::vector<ShapePair> CreateShapePairs() {
        {Shape({15, 3, 5}), Shape({15, 1, 5}), Agreement::kBroadcastNotExtend},
        {Shape({15, 3, 5}), Shape({3, 5}), Agreement::kBroadcastNotExtend},
        {Shape({15, 3, 5}), Shape({3, 1}), Agreement::kBroadcastNotExtend},
+       {Shape({3, 1}), Shape({}), Agreement::kBroadcastNotExtend},
 
        // These do not broadcast (and therefore also do not extend).
        {Shape({3}), Shape({4}), Agreement::kNeither},
@@ -175,6 +178,20 @@ TEST(NumElementsTest, UnsignedInt64) {
   EXPECT_EQ(status.error_message(), kLargeTensorMessage);
 }
 
+TEST(NumElementsTest, Scalar) {
+  tensorflow::Status status = tensorflow::Status::OK();
+
+  int32_t count;
+  status = NumElements(std::vector<int32_t>{}, &count);
+  EXPECT_TRUE(status.ok());
+  EXPECT_EQ(count, 1);
+
+  uint64_t countu64;
+  status = NumElements(std::vector<uint64_t>{}, &countu64);
+  EXPECT_TRUE(status.ok());
+  EXPECT_EQ(countu64, 1ULL);
+}
+
 TEST(FusedActivationTest, DefaultsToUnfused) {
   EXPECT_TRUE(OperatorSupportsFusedActivation(OperatorType::kAdd));
   EXPECT_FALSE(OperatorSupportsFusedActivation(OperatorType::kNone));
diff --git a/tensorflow/contrib/lite/tools/BUILD b/tensorflow/contrib/lite/tools/BUILD
index a3df37358f..0b26826403 100644
--- a/tensorflow/contrib/lite/tools/BUILD
+++ b/tensorflow/contrib/lite/tools/BUILD
@@ -14,6 +14,7 @@ py_binary(
     srcs = ["visualize.py"],
     data = [
         "//tensorflow/contrib/lite/schema:schema.fbs",
+        "//tensorflow/python:platform",
         "@flatbuffers//:flatc",
     ],
     srcs_version = "PY2AND3",
@@ -52,6 +53,7 @@ cc_test(
         "//tensorflow/contrib/lite:testdata/test_model_broken.bin",
     ],
     tags = [
+        "no_oss",
         "tflite_not_portable_android",
         "tflite_not_portable_ios",
     ],
@@ -78,6 +80,7 @@ cc_test(
     size = "small",
     srcs = ["verifier_test.cc"],
     tags = [
+        "no_oss",
         "tflite_not_portable",
     ],
     deps = [
diff --git a/tensorflow/contrib/lite/tools/benchmark/BUILD b/tensorflow/contrib/lite/tools/benchmark/BUILD
index 183a545295..2cb07eb6ec 100644
--- a/tensorflow/contrib/lite/tools/benchmark/BUILD
+++ b/tensorflow/contrib/lite/tools/benchmark/BUILD
@@ -10,11 +10,16 @@ load("//tensorflow/contrib/lite:build_def.bzl", "tflite_copts")
 
 common_copts = ["-Wall"] + tflite_copts()
 
+cc_library(
+    name = "logging",
+    hdrs = ["logging.h"],
+    copts = common_copts,
+)
+
 cc_binary(
     name = "benchmark_model",
     srcs = [
         "benchmark_main.cc",
-        "logging.h",
     ],
     copts = common_copts,
     linkopts = tflite_linkopts() + select({
@@ -26,6 +31,26 @@ cc_binary(
     }),
     deps = [
         ":benchmark_tflite_model_lib",
+        ":logging",
+    ],
+)
+
+cc_test(
+    name = "benchmark_test",
+    srcs = ["benchmark_test.cc"],
+    args = [
+        "--graph=$(location //tensorflow/contrib/lite:testdata/multi_add.bin)",
+    ],
+    data = ["//tensorflow/contrib/lite:testdata/multi_add.bin"],
+    tags = [
+        "tflite_not_portable_android",
+        "tflite_not_portable_ios",
+    ],
+    deps = [
+        ":benchmark_tflite_model_lib",
+        ":command_line_flags",
+        "//tensorflow/contrib/lite/testing:util",
+        "@com_google_googletest//:gtest",
     ],
 )
 
@@ -58,6 +83,7 @@ cc_library(
     copts = common_copts,
     deps = [
         ":benchmark_model_lib",
+        ":logging",
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite:string_util",
         "//tensorflow/contrib/lite/kernels:builtin_ops",
@@ -70,23 +96,23 @@ cc_library(
     name = "benchmark_params",
     srcs = [
         "benchmark_params.cc",
-        "logging.h",
     ],
     hdrs = ["benchmark_params.h"],
     copts = common_copts,
+    deps = [":logging"],
 )
 
 cc_library(
     name = "benchmark_model_lib",
     srcs = [
         "benchmark_model.cc",
-        "logging.h",
     ],
     hdrs = ["benchmark_model.h"],
     copts = common_copts,
     deps = [
         ":benchmark_params",
         ":command_line_flags",
+        ":logging",
         "//tensorflow/contrib/lite:framework",
         "//tensorflow/contrib/lite:string_util",
         "//tensorflow/contrib/lite/kernels:builtin_ops",
diff --git a/tensorflow/contrib/lite/tools/benchmark/README.md b/tensorflow/contrib/lite/tools/benchmark/README.md
index 93769305bd..f1e257ad10 100644
--- a/tensorflow/contrib/lite/tools/benchmark/README.md
+++ b/tensorflow/contrib/lite/tools/benchmark/README.md
@@ -115,7 +115,7 @@ E.g. for running the benchmark on big cores on Pixel 2 with a single thread one
 can use the following command:
 
 ```
-adb shell tasket f0 /data/local/tmp/benchmark_model \
+adb shell taskset f0 /data/local/tmp/benchmark_model \
   --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
   --input_layer="input" \
   --input_layer_shape="1,224,224,3" \
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_model.cc b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.cc
index 08648bcfe2..f86c0445b0 100644
--- a/tensorflow/contrib/lite/tools/benchmark/benchmark_model.cc
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.cc
@@ -84,7 +84,7 @@ std::vector<Flag> BenchmarkModel::GetFlags() {
   };
 }
 
-void BenchmarkModel::LogFlags() {
+void BenchmarkModel::LogParams() {
   TFLITE_LOG(INFO) << "Num runs: [" << params_.Get<int32_t>("num_runs") << "]";
   TFLITE_LOG(INFO) << "Inter-run delay (seconds): ["
                    << params_.Get<float>("run_delay") << "]";
@@ -98,10 +98,13 @@ void BenchmarkModel::LogFlags() {
                    << "]";
 }
 
+void BenchmarkModel::PrepareInputsAndOutputs() {}
+
 Stat<int64_t> BenchmarkModel::Run(int num_times, RunType run_type) {
   Stat<int64_t> run_stats;
   TFLITE_LOG(INFO) << "Running benchmark for " << num_times << " iterations ";
   for (int run = 0; run < num_times; run++) {
+    PrepareInputsAndOutputs();
     listeners_.OnSingleRunStart(run_type);
     int64_t start_us = profiling::time::NowMicros();
     RunImpl();
@@ -119,12 +122,18 @@ Stat<int64_t> BenchmarkModel::Run(int num_times, RunType run_type) {
   return run_stats;
 }
 
+bool BenchmarkModel::ValidateParams() { return true; }
+
 void BenchmarkModel::Run(int argc, char **argv) {
   if (!ParseFlags(argc, argv)) {
     return;
   }
+  Run();
+}
 
-  LogFlags();
+void BenchmarkModel::Run() {
+  ValidateParams();
+  LogParams();
 
   listeners_.OnBenchmarkStart(params_);
   int64_t initialization_start_us = profiling::time::NowMicros();
@@ -152,7 +161,7 @@ bool BenchmarkModel::ParseFlags(int argc, char **argv) {
     TFLITE_LOG(ERROR) << usage;
     return false;
   }
-  return ValidateFlags();
+  return true;
 }
 
 }  // namespace benchmark
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_model.h b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.h
index 942e21f67a..677a1ee68c 100644
--- a/tensorflow/contrib/lite/tools/benchmark/benchmark_model.h
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_model.h
@@ -137,19 +137,21 @@ class BenchmarkModel {
   BenchmarkModel();
   BenchmarkModel(BenchmarkParams params) : params_(std::move(params)) {}
   virtual ~BenchmarkModel() {}
-  bool ParseFlags(int argc, char** argv);
   virtual void Init() = 0;
   void Run(int argc, char** argv);
+  virtual void Run();
   void AddListener(BenchmarkListener* listener) {
     listeners_.AddListener(listener);
   }
 
  protected:
-  virtual void LogFlags();
-  virtual bool ValidateFlags() { return true; }
+  virtual void LogParams();
+  virtual bool ValidateParams();
+  bool ParseFlags(int argc, char** argv);
   virtual std::vector<Flag> GetFlags();
   virtual uint64_t ComputeInputBytes() = 0;
   virtual tensorflow::Stat<int64_t> Run(int num_times, RunType run_type);
+  virtual void PrepareInputsAndOutputs();
   virtual void RunImpl() = 0;
   BenchmarkParams params_;
   BenchmarkListeners listeners_;
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_params.h b/tensorflow/contrib/lite/tools/benchmark/benchmark_params.h
index 33448dd162..c98f47bb0d 100644
--- a/tensorflow/contrib/lite/tools/benchmark/benchmark_params.h
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_params.h
@@ -31,6 +31,8 @@ class TypedBenchmarkParam;
 class BenchmarkParam {
  protected:
   enum class ParamType { TYPE_INT32, TYPE_FLOAT, TYPE_BOOL, TYPE_STRING };
+  template <typename T>
+  static ParamType GetValueType();
 
  public:
   template <typename T>
@@ -49,8 +51,6 @@ class BenchmarkParam {
 
  private:
   static void AssertHasSameType(ParamType a, ParamType b);
-  template <typename T>
-  static ParamType GetValueType();
 
   const ParamType type_;
 };
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_test.cc b/tensorflow/contrib/lite/tools/benchmark/benchmark_test.cc
new file mode 100644
index 0000000000..b697bb394d
--- /dev/null
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_test.cc
@@ -0,0 +1,74 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <iostream>
+#include <string>
+#include <vector>
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/testing/util.h"
+#include "tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h"
+#include "tensorflow/contrib/lite/tools/benchmark/command_line_flags.h"
+
+namespace {
+const std::string* g_model_path = nullptr;
+}
+
+namespace tflite {
+namespace benchmark {
+namespace {
+
+BenchmarkParams CreateParams() {
+  BenchmarkParams params;
+  params.AddParam("num_runs", BenchmarkParam::Create<int32_t>(2));
+  params.AddParam("run_delay", BenchmarkParam::Create<float>(-1.0f));
+  params.AddParam("num_threads", BenchmarkParam::Create<int32_t>(1));
+  params.AddParam("benchmark_name", BenchmarkParam::Create<std::string>(""));
+  params.AddParam("output_prefix", BenchmarkParam::Create<std::string>(""));
+  params.AddParam("warmup_runs", BenchmarkParam::Create<int32_t>(1));
+  params.AddParam("graph", BenchmarkParam::Create<std::string>(*g_model_path));
+  params.AddParam("input_layer", BenchmarkParam::Create<std::string>(""));
+  params.AddParam("input_layer_shape", BenchmarkParam::Create<std::string>(""));
+  params.AddParam("use_nnapi", BenchmarkParam::Create<bool>(false));
+  return params;
+}
+
+TEST(BenchmarkTest, DoesntCrash) {
+  ASSERT_THAT(g_model_path, testing::NotNull());
+
+  BenchmarkTfLiteModel benchmark(CreateParams());
+  benchmark.Run();
+}
+
+}  // namespace
+}  // namespace benchmark
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  std::string model_path;
+  std::vector<tflite::Flag> flags = {
+      tflite::Flag::CreateFlag("graph", &model_path, "Path to model file.")};
+  g_model_path = &model_path;
+  const bool parse_result =
+      tflite::Flags::Parse(&argc, const_cast<const char**>(argv), flags);
+  if (!parse_result) {
+    std::cerr << tflite::Flags::Usage(argv[0], flags);
+    return 1;
+  }
+
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.cc b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.cc
index 73affc26b0..7f97f5d0cd 100644
--- a/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.cc
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.cc
@@ -198,8 +198,8 @@ std::vector<Flag> BenchmarkTfLiteModel::GetFlags() {
   return flags;
 }
 
-void BenchmarkTfLiteModel::LogFlags() {
-  BenchmarkModel::LogFlags();
+void BenchmarkTfLiteModel::LogParams() {
+  BenchmarkModel::LogParams();
   TFLITE_LOG(INFO) << "Graph: [" << params_.Get<std::string>("graph") << "]";
   TFLITE_LOG(INFO) << "Input layers: ["
                    << params_.Get<std::string>("input_layer") << "]";
@@ -208,7 +208,7 @@ void BenchmarkTfLiteModel::LogFlags() {
   TFLITE_LOG(INFO) << "Use nnapi : [" << params_.Get<bool>("use_nnapi") << "]";
 }
 
-bool BenchmarkTfLiteModel::ValidateFlags() {
+bool BenchmarkTfLiteModel::ValidateParams() {
   if (params_.Get<std::string>("graph").empty()) {
     TFLITE_LOG(ERROR)
         << "Please specify the name of your TF Lite input file with --graph";
diff --git a/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h
index 50cc3f24b3..9931dcbafe 100644
--- a/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h
+++ b/tensorflow/contrib/lite/tools/benchmark/benchmark_tflite_model.h
@@ -54,8 +54,8 @@ class BenchmarkTfLiteModel : public BenchmarkModel {
   BenchmarkTfLiteModel(BenchmarkParams params);
 
   std::vector<Flag> GetFlags() override;
-  void LogFlags() override;
-  bool ValidateFlags() override;
+  void LogParams() override;
+  bool ValidateParams() override;
   uint64_t ComputeInputBytes() override;
   void Init() override;
   void RunImpl() override;
diff --git a/tensorflow/contrib/lite/tools/visualize.py b/tensorflow/contrib/lite/tools/visualize.py
index f571dd59da..e07f899e4d 100644
--- a/tensorflow/contrib/lite/tools/visualize.py
+++ b/tensorflow/contrib/lite/tools/visualize.py
@@ -28,11 +28,24 @@ import json
 import os
 import sys
 
+from tensorflow.python.platform import resource_loader
+
 # Schema to use for flatbuffers
 _SCHEMA = "third_party/tensorflow/contrib/lite/schema/schema.fbs"
 
-# Where the binary will be once built in for the flatc converter
-_BINARY = "third_party/flatbuffers/flatc"
+# TODO(angerson): fix later when rules are simplified..
+_SCHEMA = resource_loader.get_path_to_datafile("../schema/schema.fbs")
+_BINARY = resource_loader.get_path_to_datafile("../../../../flatbuffers/flatc")
+# Account for different package positioning internal vs. external.
+if not os.path.exists(_BINARY):
+  _BINARY = resource_loader.get_path_to_datafile(
+      "../../../../../flatbuffers/flatc")
+
+if not os.path.exists(_SCHEMA):
+  raise RuntimeError("Sorry, schema file cannot be found at %r" % _SCHEMA)
+if not os.path.exists(_BINARY):
+  raise RuntimeError("Sorry, flatc is not available at %r" % _BINARY)
+
 
 # A CSS description for making the visualizer
 _CSS = """
diff --git a/tensorflow/contrib/lite/util.h b/tensorflow/contrib/lite/util.h
index 89d9b4f5cf..3c4801183b 100644
--- a/tensorflow/contrib/lite/util.h
+++ b/tensorflow/contrib/lite/util.h
@@ -26,12 +26,17 @@ limitations under the License.
 
 namespace tflite {
 
-// Converts a `std::vector` to a `TfLiteIntArray`.
+// Converts a `std::vector` to a `TfLiteIntArray`. The caller takes ownership
+// of the returned pointer.
 TfLiteIntArray* ConvertVectorToTfLiteIntArray(const std::vector<int>& input);
 
+// Converts an array (of the given size) to a `TfLiteIntArray`. The caller
+// takes ownership of the returned pointer, and must make sure 'dims' has at
+// least 'rank' elemnts.
 TfLiteIntArray* ConvertArrayToTfLiteIntArray(const int rank, const int* dims);
 
 // Checks whether a `TfLiteIntArray` and an int array have matching elements.
+// The caller must guarantee that 'b' has at least 'b_size' elements.
 bool EqualArrayAndTfLiteIntArray(const TfLiteIntArray* a, const int b_size,
                                  const int* b);
 
diff --git a/tensorflow/contrib/lookup/lookup_ops_test.py b/tensorflow/contrib/lookup/lookup_ops_test.py
index 889accdd5a..8d510ede58 100644
--- a/tensorflow/contrib/lookup/lookup_ops_test.py
+++ b/tensorflow/contrib/lookup/lookup_ops_test.py
@@ -280,6 +280,21 @@ class HashTableOpTest(test.TestCase):
       table.init.run()
       self.assertAllEqual(3, table.size().eval())
 
+  def testHashTableInt32String(self):
+    with self.test_session():
+      default_val = "n/a"
+      keys = constant_op.constant([0, 1, 2], dtypes.int32)
+      values = constant_op.constant(["brain", "salad", "surgery"])
+      table = lookup.HashTable(
+          lookup.KeyValueTensorInitializer(keys, values), default_val)
+      table.init.run()
+
+      input_tensor = constant_op.constant([0, 1, -1])
+      output = table.lookup(input_tensor)
+
+      result = output.eval()
+      self.assertAllEqual([b"brain", b"salad", b"n/a"], result)
+
 
 class MutableHashTableOpTest(test.TestCase):
 
diff --git a/tensorflow/contrib/makefile/proto_text_cc_files.txt b/tensorflow/contrib/makefile/proto_text_cc_files.txt
index 76428bc1d4..7d26429f9c 100644
--- a/tensorflow/contrib/makefile/proto_text_cc_files.txt
+++ b/tensorflow/contrib/makefile/proto_text_cc_files.txt
@@ -35,6 +35,7 @@ tensorflow/core/lib/random/random.cc
 tensorflow/core/lib/random/distribution_sampler.cc
 tensorflow/core/lib/io/zlib_outputbuffer.cc
 tensorflow/core/lib/io/zlib_inputstream.cc
+tensorflow/core/lib/io/zlib_compression_options.cc
 tensorflow/core/lib/io/two_level_iterator.cc
 tensorflow/core/lib/io/table_builder.cc
 tensorflow/core/lib/io/table.cc
diff --git a/tensorflow/contrib/makefile/tf_op_files.txt b/tensorflow/contrib/makefile/tf_op_files.txt
index 6e7423f85e..ecf2e120df 100644
--- a/tensorflow/contrib/makefile/tf_op_files.txt
+++ b/tensorflow/contrib/makefile/tf_op_files.txt
@@ -229,6 +229,8 @@ tensorflow/core/kernels/cast_op_impl_int32.cc
 tensorflow/core/kernels/cast_op_impl_int64.cc
 tensorflow/core/kernels/cast_op_impl_int8.cc
 tensorflow/core/kernels/cast_op_impl_uint16.cc
+tensorflow/core/kernels/cast_op_impl_uint32.cc
+tensorflow/core/kernels/cast_op_impl_uint64.cc
 tensorflow/core/kernels/cast_op_impl_uint8.cc
 tensorflow/core/kernels/boosted_trees/prediction_ops.cc
 tensorflow/core/kernels/boosted_trees/resource_ops.cc
diff --git a/tensorflow/contrib/metrics/BUILD b/tensorflow/contrib/metrics/BUILD
index 66cb493e5c..21cd34f73f 100644
--- a/tensorflow/contrib/metrics/BUILD
+++ b/tensorflow/contrib/metrics/BUILD
@@ -31,6 +31,7 @@ py_library(
         "//tensorflow/python:check_ops",
         "//tensorflow/python:confusion_matrix",
         "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:distribute",
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:histogram_ops",
         "//tensorflow/python:init_ops",
diff --git a/tensorflow/contrib/metrics/__init__.py b/tensorflow/contrib/metrics/__init__.py
index 5effea3596..88798d61b7 100644
--- a/tensorflow/contrib/metrics/__init__.py
+++ b/tensorflow/contrib/metrics/__init__.py
@@ -63,6 +63,7 @@ See the @{$python/contrib.metrics} guide.
 @@aggregate_metrics
 @@aggregate_metric_map
 @@confusion_matrix
+@@f1_score
 @@set_difference
 @@set_intersection
 @@set_size
diff --git a/tensorflow/contrib/metrics/python/metrics/classification.py b/tensorflow/contrib/metrics/python/metrics/classification.py
index 26aba1cc51..e553612269 100644
--- a/tensorflow/contrib/metrics/python/metrics/classification.py
+++ b/tensorflow/contrib/metrics/python/metrics/classification.py
@@ -22,6 +22,9 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import metrics_impl
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.training import distribute as distribute_lib
 
 # TODO(nsilberman): move into metrics/python/ops/
 
@@ -62,3 +65,121 @@ def accuracy(predictions, labels, weights=None, name=None):
       return math_ops.div(math_ops.reduce_sum(is_correct),
                           math_ops.reduce_sum(num_values))
     return math_ops.reduce_mean(is_correct)
+
+
+def f1_score(labels, predictions, weights=None, num_thresholds=200,
+             metrics_collections=None, updates_collections=None, name=None):
+  """Computes the approximately best F1-score across different thresholds.
+
+  The f1_score function applies a range of thresholds to the predictions to
+  convert them from [0, 1] to bool. Precision and recall are computed by
+  comparing them to the labels. The F1-Score is then defined as
+  2 * precision * recall / (precision + recall). The best one across the
+  thresholds is returned.
+
+  Disclaimer: In practice it may be desirable to choose the best threshold on
+  the validation set and evaluate the F1 score with this threshold on a
+  separate test set. Or it may be desirable to use a fixed threshold (e.g. 0.5).
+
+  This function internally creates four local variables, `true_positives`,
+  `true_negatives`, `false_positives` and `false_negatives` that are used to
+  compute the pairs of recall and precision values for a linearly spaced set of
+  thresholds from which the best f1-score is derived.
+
+  This value is ultimately returned as `f1-score`, an idempotent operation that
+  computes the F1-score (computed using the aforementioned variables). The
+  `num_thresholds` variable controls the degree of discretization with larger
+  numbers of thresholds more closely approximating the true best F1-score.
+
+  For estimation of the metric over a stream of data, the function creates an
+  `update_op` operation that updates these variables and returns the F1-score.
+
+  Example usage with a custom estimator:
+  def model_fn(features, labels, mode):
+    predictions = make_predictions(features)
+    loss = make_loss(predictions, labels)
+    train_op = tf.contrib.training.create_train_op(
+          total_loss=loss,
+          optimizer='Adam')
+    eval_metric_ops = {'f1': f1_score(labels, predictions)}
+    return tf.estimator.EstimatorSpec(
+        mode=mode,
+        predictions=predictions,
+        loss=loss,
+        train_op=train_op,
+        eval_metric_ops=eval_metric_ops,
+        export_outputs=export_outputs)
+  estimator = tf.estimator.Estimator(model_fn=model_fn)
+
+  If `weights` is `None`, weights default to 1. Use weights of 0 to mask values.
+
+  Args:
+    labels: A `Tensor` whose shape matches `predictions`. Will be cast to
+      `bool`.
+    predictions: A floating point `Tensor` of arbitrary shape and whose values
+      are in the range `[0, 1]`.
+    weights: Optional `Tensor` whose rank is either 0, or the same rank as
+      `labels`, and must be broadcastable to `labels` (i.e., all dimensions must
+      be either `1`, or the same as the corresponding `labels` dimension).
+    num_thresholds: The number of thresholds to use when discretizing the roc
+      curve.
+    metrics_collections: An optional list of collections that `f1_score` should
+      be added to.
+    updates_collections: An optional list of collections that `update_op` should
+      be added to.
+    name: An optional variable_scope name.
+
+  Returns:
+    f1_score: A scalar `Tensor` representing the current best f1-score across
+      different thresholds.
+    update_op: An operation that increments the `true_positives`,
+      `true_negatives`, `false_positives` and `false_negatives` variables
+      appropriately and whose value matches the `f1_score`.
+
+  Raises:
+    ValueError: If `predictions` and `labels` have mismatched shapes, or if
+      `weights` is not `None` and its shape doesn't match `predictions`, or if
+      either `metrics_collections` or `updates_collections` are not a list or
+      tuple.
+  """
+  with variable_scope.variable_scope(
+      name, 'f1', (labels, predictions, weights)):
+    predictions, labels, weights = metrics_impl._remove_squeezable_dimensions(  # pylint: disable=protected-access
+        predictions=predictions, labels=labels, weights=weights)
+    # To account for floating point imprecisions / avoid division by zero.
+    epsilon = 1e-7
+    thresholds = [(i + 1) * 1.0 / (num_thresholds - 1)
+                  for i in range(num_thresholds - 2)]
+    thresholds = [0.0 - epsilon] + thresholds + [1.0 + epsilon]
+
+    # Confusion matrix.
+    values, update_ops = metrics_impl._confusion_matrix_at_thresholds(  # pylint: disable=protected-access
+        labels, predictions, thresholds, weights, includes=('tp', 'fp', 'fn'))
+
+    # Compute precision and recall at various thresholds.
+    def compute_best_f1_score(tp, fp, fn, name):
+      precision_at_t = math_ops.div(tp, epsilon + tp + fp,
+                                    name='precision_' + name)
+      recall_at_t = math_ops.div(tp, epsilon + tp + fn, name='recall_' + name)
+      # Compute F1 score.
+      f1_at_thresholds = (
+          2.0 * precision_at_t * recall_at_t /
+          (precision_at_t + recall_at_t + epsilon))
+      return math_ops.reduce_max(f1_at_thresholds)
+
+    def f1_across_towers(_, values):
+      best_f1 = compute_best_f1_score(tp=values['tp'], fp=values['fp'],
+                                      fn=values['fn'], name='value')
+      if metrics_collections:
+        ops.add_to_collections(metrics_collections, best_f1)
+      return best_f1
+
+    best_f1 = distribute_lib.get_tower_context().merge_call(
+        f1_across_towers, values)
+
+    update_op = compute_best_f1_score(tp=update_ops['tp'], fp=update_ops['fp'],
+                                      fn=update_ops['fn'], name='update')
+    if updates_collections:
+      ops.add_to_collections(updates_collections, update_op)
+
+    return best_f1, update_op
diff --git a/tensorflow/contrib/metrics/python/metrics/classification_test.py b/tensorflow/contrib/metrics/python/metrics/classification_test.py
index fa0f12d029..3d0b81c1be 100644
--- a/tensorflow/contrib/metrics/python/metrics/classification_test.py
+++ b/tensorflow/contrib/metrics/python/metrics/classification_test.py
@@ -18,9 +18,16 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
 from tensorflow.contrib.metrics.python.metrics import classification
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import random_ops
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
 
@@ -108,5 +115,200 @@ class ClassificationTest(test.TestCase):
       self.assertEqual(result, 0.5)
 
 
+class F1ScoreTest(test.TestCase):
+
+  def setUp(self):
+    super(F1ScoreTest, self).setUp()
+    np.random.seed(1)
+
+  def testVars(self):
+    classification.f1_score(
+        predictions=array_ops.ones((10, 1)),
+        labels=array_ops.ones((10, 1)),
+        num_thresholds=3)
+    expected = {'f1/true_positives:0', 'f1/false_positives:0',
+                'f1/false_negatives:0'}
+    self.assertEquals(
+        expected, set(v.name for v in variables.local_variables()))
+    self.assertEquals(
+        set(expected), set(v.name for v in variables.local_variables()))
+    self.assertEquals(
+        set(expected),
+        set(v.name for v in ops.get_collection(ops.GraphKeys.METRIC_VARIABLES)))
+
+  def testMetricsCollection(self):
+    my_collection_name = '__metrics__'
+    f1, _ = classification.f1_score(
+        predictions=array_ops.ones((10, 1)),
+        labels=array_ops.ones((10, 1)),
+        num_thresholds=3,
+        metrics_collections=[my_collection_name])
+    self.assertListEqual(ops.get_collection(my_collection_name), [f1])
+
+  def testUpdatesCollection(self):
+    my_collection_name = '__updates__'
+    _, f1_op = classification.f1_score(
+        predictions=array_ops.ones((10, 1)),
+        labels=array_ops.ones((10, 1)),
+        num_thresholds=3,
+        updates_collections=[my_collection_name])
+    self.assertListEqual(ops.get_collection(my_collection_name), [f1_op])
+
+  def testValueTensorIsIdempotent(self):
+    predictions = random_ops.random_uniform(
+        (10, 3), maxval=1, dtype=dtypes.float32, seed=1)
+    labels = random_ops.random_uniform(
+        (10, 3), maxval=2, dtype=dtypes.int64, seed=2)
+    f1, f1_op = classification.f1_score(predictions, labels, num_thresholds=3)
+
+    with self.test_session() as sess:
+      sess.run(variables.local_variables_initializer())
+
+      # Run several updates.
+      for _ in range(10):
+        sess.run([f1_op])
+
+      # Then verify idempotency.
+      initial_f1 = f1.eval()
+      for _ in range(10):
+        self.assertAllClose(initial_f1, f1.eval())
+
+  def testAllCorrect(self):
+    inputs = np.random.randint(0, 2, size=(100, 1))
+
+    with self.test_session() as sess:
+      predictions = constant_op.constant(inputs, dtype=dtypes.float32)
+      labels = constant_op.constant(inputs)
+      f1, f1_op = classification.f1_score(predictions, labels, num_thresholds=3)
+
+      sess.run(variables.local_variables_initializer())
+      sess.run([f1_op])
+
+      self.assertEqual(1, f1.eval())
+
+  def testSomeCorrect(self):
+    predictions = constant_op.constant(
+        [1, 0, 1, 0], shape=(1, 4), dtype=dtypes.float32)
+    labels = constant_op.constant([0, 1, 1, 0], shape=(1, 4))
+    f1, f1_op = classification.f1_score(predictions, labels, num_thresholds=1)
+    with self.test_session() as sess:
+      sess.run(variables.local_variables_initializer())
+      sess.run([f1_op])
+      # Threshold 0 will have around 0.5 precision and 1 recall yielding an F1
+      # score of 2 * 0.5 * 1 / (1 + 0.5).
+      self.assertAlmostEqual(2 * 0.5 * 1 / (1 + 0.5), f1.eval())
+
+  def testAllIncorrect(self):
+    inputs = np.random.randint(0, 2, size=(10000, 1))
+
+    with self.test_session() as sess:
+      predictions = constant_op.constant(inputs, dtype=dtypes.float32)
+      labels = constant_op.constant(1 - inputs, dtype=dtypes.float32)
+      f1, f1_op = classification.f1_score(predictions, labels, num_thresholds=3)
+
+      sess.run(variables.local_variables_initializer())
+      sess.run([f1_op])
+
+      # Threshold 0 will have around 0.5 precision and 1 recall yielding an F1
+      # score of 2 * 0.5 * 1 / (1 + 0.5).
+      self.assertAlmostEqual(2 * 0.5 * 1 / (1 + 0.5), f1.eval(), places=2)
+
+  def testWeights1d(self):
+    with self.test_session() as sess:
+      predictions = constant_op.constant(
+          [[1, 0], [1, 0]], shape=(2, 2), dtype=dtypes.float32)
+      labels = constant_op.constant([[0, 1], [1, 0]], shape=(2, 2))
+      weights = constant_op.constant(
+          [[0], [1]], shape=(2, 1), dtype=dtypes.float32)
+      f1, f1_op = classification.f1_score(predictions, labels, weights,
+                                          num_thresholds=3)
+      sess.run(variables.local_variables_initializer())
+      sess.run([f1_op])
+
+      self.assertAlmostEqual(1.0, f1.eval(), places=5)
+
+  def testWeights2d(self):
+    with self.test_session() as sess:
+      predictions = constant_op.constant(
+          [[1, 0], [1, 0]], shape=(2, 2), dtype=dtypes.float32)
+      labels = constant_op.constant([[0, 1], [1, 0]], shape=(2, 2))
+      weights = constant_op.constant(
+          [[0, 0], [1, 1]], shape=(2, 2), dtype=dtypes.float32)
+      f1, f1_op = classification.f1_score(predictions, labels, weights,
+                                          num_thresholds=3)
+      sess.run(variables.local_variables_initializer())
+      sess.run([f1_op])
+
+      self.assertAlmostEqual(1.0, f1.eval(), places=5)
+
+  def testZeroLabelsPredictions(self):
+    with self.test_session() as sess:
+      predictions = array_ops.zeros([4], dtype=dtypes.float32)
+      labels = array_ops.zeros([4])
+      f1, f1_op = classification.f1_score(predictions, labels, num_thresholds=3)
+      sess.run(variables.local_variables_initializer())
+      sess.run([f1_op])
+
+      self.assertAlmostEqual(0.0, f1.eval(), places=5)
+
+  def testWithMultipleUpdates(self):
+    num_samples = 1000
+    batch_size = 10
+    num_batches = int(num_samples / batch_size)
+
+    # Create the labels and data.
+    labels = np.random.randint(0, 2, size=(num_samples, 1))
+    noise = np.random.normal(0.0, scale=0.2, size=(num_samples, 1))
+    predictions = 0.4 + 0.2 * labels + noise
+    predictions[predictions > 1] = 1
+    predictions[predictions < 0] = 0
+    thresholds = [-0.01, 0.5, 1.01]
+
+    expected_max_f1 = -1.0
+    for threshold in thresholds:
+      tp = 0
+      fp = 0
+      fn = 0
+      tn = 0
+      for i in range(num_samples):
+        if predictions[i] >= threshold:
+          if labels[i] == 1:
+            tp += 1
+          else:
+            fp += 1
+        else:
+          if labels[i] == 1:
+            fn += 1
+          else:
+            tn += 1
+      epsilon = 1e-7
+      expected_prec = tp / (epsilon + tp + fp)
+      expected_rec = tp / (epsilon + tp + fn)
+      expected_f1 = (2 * expected_prec * expected_rec /
+                     (epsilon + expected_prec + expected_rec))
+      if expected_f1 > expected_max_f1:
+        expected_max_f1 = expected_f1
+
+    labels = labels.astype(np.float32)
+    predictions = predictions.astype(np.float32)
+    tf_predictions, tf_labels = (dataset_ops.Dataset
+                                 .from_tensor_slices((predictions, labels))
+                                 .repeat()
+                                 .batch(batch_size)
+                                 .make_one_shot_iterator()
+                                 .get_next())
+    f1, f1_op = classification.f1_score(tf_labels, tf_predictions,
+                                        num_thresholds=3)
+
+    with self.test_session() as sess:
+      sess.run(variables.local_variables_initializer())
+      for _ in range(num_batches):
+        sess.run([f1_op])
+      # Since this is only approximate, we can't expect a 6 digits match.
+      # Although with higher number of samples/thresholds we should see the
+      # accuracy improving
+      self.assertAlmostEqual(expected_max_f1, f1.eval(), 2)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops.py b/tensorflow/contrib/metrics/python/ops/metric_ops.py
index b14202ff9e..a328670526 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops.py
@@ -3715,6 +3715,7 @@ def count(values,
           name=None):
   """Computes the number of examples, or sum of `weights`.
 
+  This metric keeps track of the denominator in `tf.metrics.mean`.
   When evaluating some metric (e.g. mean) on one or more subsets of the data,
   this auxiliary metric is useful for keeping track of how many examples there
   are in each subset.
@@ -3741,15 +3742,21 @@ def count(values,
     ValueError: If `weights` is not `None` and its shape doesn't match `values`,
       or if either `metrics_collections` or `updates_collections` are not a list
       or tuple.
+    RuntimeError: If eager execution is enabled.
   """
+  if context.executing_eagerly():
+    raise RuntimeError('tf.contrib.metrics.count is not supported when eager '
+                       'execution is enabled.')
 
   with variable_scope.variable_scope(name, 'count', (values, weights)):
+
     count_ = metrics_impl.metric_variable([], dtypes.float32, name='count')
 
     if weights is None:
       num_values = math_ops.to_float(array_ops.size(values))
     else:
-      _, _, weights = metrics_impl._remove_squeezable_dimensions(  # pylint: disable=protected-access
+      values = math_ops.to_float(values)
+      values, _, weights = metrics_impl._remove_squeezable_dimensions(  # pylint: disable=protected-access
           predictions=values,
           labels=None,
           weights=weights)
@@ -3758,15 +3765,14 @@ def count(values,
       num_values = math_ops.reduce_sum(weights)
 
     with ops.control_dependencies([values]):
-      update_op = state_ops.assign_add(count_, num_values)
+      update_count_op = state_ops.assign_add(count_, num_values)
 
-    if metrics_collections:
-      ops.add_to_collections(metrics_collections, count_)
+    count_ = metrics_impl._aggregate_variable(count_, metrics_collections)  # pylint: disable=protected-access
 
     if updates_collections:
-      ops.add_to_collections(updates_collections, update_op)
+      ops.add_to_collections(updates_collections, update_count_op)
 
-    return count_, update_op
+    return count_, update_count_op
 
 
 def cohen_kappa(labels,
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
index a09fc4abd4..401fedcbed 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
@@ -6854,6 +6854,11 @@ class CountTest(test.TestCase):
         array_ops.ones([4, 3]), updates_collections=[my_collection_name])
     self.assertListEqual(ops.get_collection(my_collection_name), [update_op])
 
+  def testReturnType(self):
+    c, op = metrics.count(array_ops.ones([4, 3]))
+    self.assertTrue(isinstance(c, ops.Tensor))
+    self.assertTrue(isinstance(op, ops.Operation) or isinstance(op, ops.Tensor))
+
   def testBasic(self):
     with self.test_session() as sess:
       values_queue = data_flow_ops.FIFOQueue(
diff --git a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
index ef34f7bf7b..93050a3ae3 100644
--- a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
+++ b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
@@ -77,7 +77,7 @@ class LossScaleOptimizer(optimizer.Optimizer):
 
   If gradients clipping is applied, one can call
   `optimizer.compute_gradients()` and `optimizer.apply_gradients()`
-  seperately.
+  separately.
 
   Notice the following way of using LossScaleOptimizer is not intended. Always
   use `loss_scale_optimizer.compute_gradients()` to compute gradients instead of
diff --git a/tensorflow/contrib/model_pruning/README.md b/tensorflow/contrib/model_pruning/README.md
index 86f4fd6adf..dbe4e124fd 100644
--- a/tensorflow/contrib/model_pruning/README.md
+++ b/tensorflow/contrib/model_pruning/README.md
@@ -42,7 +42,7 @@ The pruning library allows for specification of the following hyper parameters:
 | name | string | model_pruning | Name of the pruning specification. Used for adding summaries and ops under a common tensorflow name_scope |
 | begin_pruning_step | integer | 0 | The global step at which to begin pruning |
 | end_pruning_step   | integer | -1 | The global step at which to terminate pruning. Defaults to -1 implying that pruning continues till  the training stops |
-| do_not_prune | list of strings | [""] | list of layers names that are not pruned |
+| weight_sparsity_map | list of strings | [""] | list of weight variable name (or layer name):target sparsity pairs. Eg. [conv1:0.9,conv2/kernel:0.8]. For layers/weights not in this list, sparsity as specified by the target_sparsity hyperparameter is used. |
 | threshold_decay | float | 0.9 | The decay factor to use for exponential decay of the thresholds |
 | pruning_frequency | integer | 10 | How often should the masks be updated? (in # of global_steps) |
 | nbins | integer | 256 | Number of bins to use for histogram computation |
@@ -66,10 +66,10 @@ is the sparsity_function_begin_step. In this equation, the
 sparsity_function_exponent is set to 3.
 ### Adding pruning ops to the training graph
 
-The final step involves adding ops to the training graph that monitors the
-distribution of the layer's weight magnitudes and determines the layer threshold
-such masking all the weights below this threshold achieves the sparsity level
-desired for the current training step. This can be achieved as follows:
+The final step involves adding ops to the training graph that monitor the
+distribution of the layer's weight magnitudes and determine the layer threshold,
+such that masking all the weights below this threshold achieves the sparsity
+level desired for the current training step. This can be achieved as follows:
 
 ```python
 tf.app.flags.DEFINE_string(
@@ -79,7 +79,7 @@ tf.app.flags.DEFINE_string(
 with tf.graph.as_default():
 
   # Create global step variable
-  global_step = tf.train.get_global_step()
+  global_step = tf.train.get_or_create_global_step()
 
   # Parse pruning hyperparameters
   pruning_hparams = pruning.get_pruning_hparams().parse(FLAGS.pruning_hparams)
@@ -103,6 +103,7 @@ with tf.graph.as_default():
     mon_sess.run(mask_update_op)
 
 ```
+Ensure that `global_step` is being [incremented](https://www.tensorflow.org/api_docs/python/tf/train/Optimizer#minimize), otherwise pruning will not work!
 
 ## Example: Pruning and training deep CNNs on the cifar10 dataset
 
diff --git a/tensorflow/contrib/model_pruning/python/learning.py b/tensorflow/contrib/model_pruning/python/learning.py
index 2b79c23cef..26695237c2 100644
--- a/tensorflow/contrib/model_pruning/python/learning.py
+++ b/tensorflow/contrib/model_pruning/python/learning.py
@@ -33,11 +33,14 @@ to support training of pruned models
   # Create the train_op
   train_op = slim.learning.create_train_op(total_loss, optimizer)
 
-  # Set up sparsity
-  sparsity = pruning.setup_gradual_sparsity(self.global_step)
+  # Parse pruning hyperparameters
+  pruning_hparams = pruning.get_pruning_hparams().parse(FLAGS.pruning_hparams)
 
-  # Create mask update op
-  mask_update_op = pruning.add_mask_update_ip(sparsity)
+  # Create a pruning object using the pruning_hparams
+  p = pruning.Pruning(pruning_hparams)
+
+  # Add mask update ops to the graph
+  mask_update_op = p.conditional_mask_update_op()
 
   # Run training.
   learning.train(train_op,
diff --git a/tensorflow/contrib/model_pruning/python/pruning.py b/tensorflow/contrib/model_pruning/python/pruning.py
index 4b7af18b33..723dab9369 100644
--- a/tensorflow/contrib/model_pruning/python/pruning.py
+++ b/tensorflow/contrib/model_pruning/python/pruning.py
@@ -152,8 +152,11 @@ def get_pruning_hparams():
     end_pruning_step: integer
       the global step at which to terminate pruning. Defaults to -1 implying
       that pruning continues till the training stops
-    do_not_prune: list of strings
-      list of layers that are not pruned
+    weight_sparsity_map: list of strings
+       comma separed list of weight variable name:target sparsity pairs.
+       For layers/weights not in this list, sparsity as specified by the
+       target_sparsity hyperparameter is used.
+       Eg. [conv1:0.9,conv2/kernel:0.8]
     threshold_decay: float
       the decay factor to use for exponential decay of the thresholds
     pruning_frequency: integer
@@ -200,7 +203,7 @@ def get_pruning_hparams():
       name='model_pruning',
       begin_pruning_step=0,
       end_pruning_step=-1,
-      do_not_prune=[''],
+      weight_sparsity_map=[''],
       threshold_decay=0.9,
       pruning_frequency=10,
       nbins=256,
@@ -256,6 +259,9 @@ class Pruning(object):
     # Block pooling function
     self._block_pooling_function = self._spec.block_pooling_function
 
+    # Mapping of weight names and target sparsity
+    self._weight_sparsity_map = self._get_weight_sparsity_map()
+
   def _setup_global_step(self, global_step):
     graph_global_step = global_step
     if graph_global_step is None:
@@ -306,15 +312,36 @@ class Pruning(object):
             'last_mask_update_step', dtype=dtypes.int32)
     return last_update_step
 
-  def _exists_in_do_not_prune_list(self, tensor_name):
-    do_not_prune_list = self._spec.do_not_prune
-    if not do_not_prune_list[0]:
-      return False
-    for layer_name in do_not_prune_list:
-      if tensor_name.find(layer_name) != -1:
-        return True
-
-    return False
+  def _get_weight_sparsity_map(self):
+    """Return the map of weight_name:sparsity parsed from the hparams."""
+    weight_sparsity_map = {}
+    val_list = self._spec.weight_sparsity_map
+    filtered_val_list = [l for l in val_list if l]
+    for val in filtered_val_list:
+      weight_name, sparsity = val.split(':')
+      if float(sparsity) >= 1.0:
+        raise ValueError('Weight sparsity can not exceed 1.0')
+      weight_sparsity_map[weight_name] = float(sparsity)
+
+    return weight_sparsity_map
+
+  def _get_sparsity(self, weight_name):
+    """Return target sparsity for the given layer/weight name."""
+    target_sparsity = [
+        sparsity for name, sparsity in self._weight_sparsity_map.items()
+        if weight_name.find(name) != -1
+    ]
+    if not target_sparsity:
+      return self._sparsity
+
+    if len(target_sparsity) > 1:
+      raise ValueError(
+          'Multiple matches in weight_sparsity_map for weight %s' % weight_name)
+    # TODO(suyoggupta): This will work when initial_sparsity = 0. Generalize
+    # to handle other cases as well.
+    return math_ops.mul(
+        self._sparsity,
+        math_ops.div(target_sparsity[0], self._spec.target_sparsity))
 
   def _update_mask(self, weights, threshold):
     """Updates the mask for a given weight tensor.
@@ -342,6 +369,8 @@ class Pruning(object):
     if self._sparsity is None:
       raise ValueError('Sparsity variable undefined')
 
+    sparsity = self._get_sparsity(weights.op.name)
+
     with ops.name_scope(weights.op.name + '_pruning_ops'):
       abs_weights = math_ops.abs(weights)
       max_value = math_ops.reduce_max(abs_weights)
@@ -354,7 +383,7 @@ class Pruning(object):
           math_ops.div(
               math_ops.reduce_sum(
                   math_ops.cast(
-                      math_ops.less(norm_cdf, self._sparsity), dtypes.float32)),
+                      math_ops.less(norm_cdf, sparsity), dtypes.float32)),
               float(self._spec.nbins)), max_value)
 
       smoothed_threshold = math_ops.add_n([
@@ -453,10 +482,6 @@ class Pruning(object):
       if is_partitioned:
         weight = weight.as_tensor()
 
-      if self._spec.do_not_prune:
-        if self._exists_in_do_not_prune_list(mask.name):
-          continue
-
       new_threshold, new_mask = self._maybe_update_block_mask(weight, threshold)
       self._assign_ops.append(
           pruning_utils.variable_assign(threshold, new_threshold))
@@ -507,22 +532,15 @@ class Pruning(object):
                                  no_update_op)
 
   def add_pruning_summaries(self):
-    """Adds summaries for this pruning spec.
-
-    Args: none
-
-    Returns: none
-    """
+    """Adds summaries of weight sparsities and thresholds."""
     with ops.name_scope(self._spec.name + '_summaries'):
       summary.scalar('sparsity', self._sparsity)
       summary.scalar('last_mask_update_step', self._last_update_step)
       masks = get_masks()
       thresholds = get_thresholds()
-      for index, mask in enumerate(masks):
-        if not self._exists_in_do_not_prune_list(mask.name):
-          summary.scalar(mask.name + '/sparsity', nn_impl.zero_fraction(mask))
-          summary.scalar(thresholds[index].op.name + '/threshold',
-                         thresholds[index])
+      for mask, threshold in zip(masks, thresholds):
+        summary.scalar(mask.op.name + '/sparsity', nn_impl.zero_fraction(mask))
+        summary.scalar(threshold.op.name + '/threshold', threshold)
 
   def print_hparams(self):
     logging.info(self._spec.to_json())
diff --git a/tensorflow/contrib/model_pruning/python/pruning_test.py b/tensorflow/contrib/model_pruning/python/pruning_test.py
index f80b7c52c0..5b67656e9f 100644
--- a/tensorflow/contrib/model_pruning/python/pruning_test.py
+++ b/tensorflow/contrib/model_pruning/python/pruning_test.py
@@ -35,8 +35,8 @@ from tensorflow.python.training import training_util
 class PruningHParamsTest(test.TestCase):
   PARAM_LIST = [
       "name=test", "threshold_decay=0.9", "pruning_frequency=10",
-      "do_not_prune=[conv1,conv2]", "sparsity_function_end_step=100",
-      "target_sparsity=0.9"
+      "sparsity_function_end_step=100", "target_sparsity=0.9",
+      "weight_sparsity_map=[conv1:0.8,conv2/kernel:0.8]"
   ]
   TEST_HPARAMS = ",".join(PARAM_LIST)
 
@@ -55,9 +55,11 @@ class PruningHParamsTest(test.TestCase):
     self.assertEqual(p._spec.name, "test")
     self.assertAlmostEqual(p._spec.threshold_decay, 0.9)
     self.assertEqual(p._spec.pruning_frequency, 10)
-    self.assertAllEqual(p._spec.do_not_prune, ["conv1", "conv2"])
     self.assertEqual(p._spec.sparsity_function_end_step, 100)
     self.assertAlmostEqual(p._spec.target_sparsity, 0.9)
+    self.assertEqual(p._weight_sparsity_map["conv1"], 0.8)
+    self.assertEqual(p._weight_sparsity_map["conv2/kernel"], 0.8)
+
 
   def testInitWithExternalSparsity(self):
     with self.test_session():
@@ -211,6 +213,37 @@ class PruningTest(test.TestCase):
     expected_non_zero_count = [100, 100, 80, 80, 60, 60, 40, 40, 40, 40]
     self.assertAllEqual(expected_non_zero_count, non_zero_count)
 
+  def testWeightSpecificSparsity(self):
+    param_list = [
+        "begin_pruning_step=1", "pruning_frequency=1", "end_pruning_step=100",
+        "target_sparsity=0.5", "weight_sparsity_map=[layer2/weights:0.75]",
+        "threshold_decay=0.0"
+    ]
+    test_spec = ",".join(param_list)
+    pruning_hparams = pruning.get_pruning_hparams().parse(test_spec)
+
+    with variable_scope.variable_scope("layer1"):
+      w1 = variables.Variable(
+          math_ops.linspace(1.0, 100.0, 100), name="weights")
+      _ = pruning.apply_mask(w1)
+    with variable_scope.variable_scope("layer2"):
+      w2 = variables.Variable(
+          math_ops.linspace(1.0, 100.0, 100), name="weights")
+      _ = pruning.apply_mask(w2)
+
+    p = pruning.Pruning(pruning_hparams)
+    mask_update_op = p.conditional_mask_update_op()
+    increment_global_step = state_ops.assign_add(self.global_step, 1)
+
+    with self.test_session() as session:
+      variables.global_variables_initializer().run()
+      for _ in range(110):
+        session.run(mask_update_op)
+        session.run(increment_global_step)
+
+      self.assertAllEqual(
+          session.run(pruning.get_weight_sparsity()), [0.5, 0.75])
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/mpi_collectives/mpi_ops.py b/tensorflow/contrib/mpi_collectives/mpi_ops.py
new file mode 100644
index 0000000000..bd7096d9ce
--- /dev/null
+++ b/tensorflow/contrib/mpi_collectives/mpi_ops.py
@@ -0,0 +1,163 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Inter-process communication using MPI."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import load_library
+from tensorflow.python.framework import ops
+from tensorflow.python.platform import resource_loader
+from tensorflow.python.platform import tf_logging as logging
+
+
+def _load_library(name, op_list=None):
+  """Loads a .so file containing the specified operators.
+
+  Args:
+    name: The name of the .so file to load.
+    op_list: A list of names of operators that the library should have. If None
+        then the .so file's contents will not be verified.
+
+  Raises:
+    NameError if one of the required ops is missing.
+  """
+  try:
+    filename = resource_loader.get_path_to_datafile(name)
+    library = load_library.load_op_library(filename)
+    for expected_op in (op_list or []):
+      for lib_op in library.OP_LIST.op:
+        if lib_op.name == expected_op:
+          break
+      else:
+        raise NameError('Could not find operator %s in dynamic library %s' %
+                        (expected_op, name))
+    return library
+  except errors.NotFoundError:
+    logging.warning('%s file could not be loaded.', name)
+
+
+MPI_LIB = _load_library(
+    'mpi_collectives.so',
+    ['MPISize', 'MPIRank', 'MPILocalRank', 'MPIAllgather', 'MPIAllreduce'])
+
+
+def size(name=None):
+  """An op which returns the number of MPI processes.
+
+  This is equivalent to running `MPI_Comm_size(MPI_COMM_WORLD, ...)` to get the
+  size of the global communicator.
+
+  Returns:
+    An integer scalar containing the number of MPI processes.
+  """
+  return MPI_LIB.mpi_size(name=name)
+
+
+ops.NotDifferentiable('MPISize')
+
+
+def rank(name=None):
+  """An op which returns the MPI rank of the calling process.
+
+  This is equivalent to running `MPI_Comm_rank(MPI_COMM_WORLD, ...)` to get the
+  rank of the current process in the global communicator.
+
+  Returns:
+    An integer scalar with the MPI rank of the calling process.
+  """
+  return MPI_LIB.mpi_rank(name=name)
+
+
+ops.NotDifferentiable('MPIRank')
+
+
+def init(name=None):
+  """An op which initializes MPI on the device on which it is run.
+
+  All future MPI ops must be run on the same device that the `init` op was run
+  on.
+  """
+  return MPI_LIB.mpi_init(name=name)
+
+
+ops.NotDifferentiable('MPIInit')
+
+
+def local_rank(name=None):
+  """An op which returns the local MPI rank of the calling process, within the
+  node that it is running on. For example, if there are seven processes running
+  on a node, their local ranks will be zero through six, inclusive.
+
+  This is equivalent to running `MPI_Comm_rank(...)` on a new communicator
+  which only includes processes on the same node.
+
+  Returns:
+    An integer scalar with the local MPI rank of the calling process.
+  """
+  return MPI_LIB.mpi_local_rank(name=name)
+
+
+ops.NotDifferentiable('MPILocalRank')
+
+
+def _allreduce(tensor, name=None):
+  """An op which sums an input tensor over all the MPI processes.
+
+  The reduction operation is keyed by the name of the op. The tensor type and
+  shape must be the same on all MPI processes for a given name. The reduction
+  will not start until all processes are ready to send and receive the tensor.
+
+  Returns:
+    A tensor of the same shape and type as `tensor`, summed across all
+    processes.
+  """
+  return MPI_LIB.mpi_allreduce(tensor, name=name)
+
+
+ops.NotDifferentiable('MPIAllreduce')
+
+
+def allgather(tensor, name=None):
+  """An op which concatenates the input tensor with the same input tensor on
+  all other MPI processes.
+
+  The concatenation is done on the first dimension, so the input tensors on the
+  different processes must have the same rank and shape, except for the first
+  dimension, which is allowed to be different.
+
+  Returns:
+    A tensor of the same type as `tensor`, concatenated on dimension zero
+    across all processes. The shape is identical to the input shape, except for
+    the first dimension, which may be greater and is the sum of all first
+    dimensions of the tensors in different MPI processes.
+  """
+  # Specify that first allgather is to collect the tensor gather sizes,
+  # indicated by passing in a scalar (0-D tensor) of value 0
+  sizes_flag = tf.constant(0, dtype=tf.int64, name='size_flag_const')
+  my_size = tf.slice(
+      tf.shape(tensor, out_type=tf.int64), [0], [1], name='size_slice')
+  if name is None:
+    name = 'allgather'
+  sizing_name = '{}_sizing'.format(name)
+  sizes = MPI_LIB.mpi_allgather(my_size, sizes_flag, name=sizing_name)
+  return MPI_LIB.mpi_allgather(tensor, sizes, name=name)
+
+
+ops.NotDifferentiable('MPIAllgather')
diff --git a/tensorflow/contrib/mpi_collectives/ring.cc b/tensorflow/contrib/mpi_collectives/ring.cc
new file mode 100644
index 0000000000..d93233eb21
--- /dev/null
+++ b/tensorflow/contrib/mpi_collectives/ring.cc
@@ -0,0 +1,80 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef TENSORFLOW_USE_MPI
+
+#define EIGEN_USE_THREADS
+
+#include "tensorflow/contrib/mpi_collectives/ring.h"
+
+namespace tensorflow {
+namespace contrib {
+namespace mpi {
+
+using CPUDevice = Eigen::ThreadPoolDevice;
+
+extern template MPI_Datatype MPIType<float>();
+extern template MPI_Datatype MPIType<int>();
+extern template MPI_Datatype MPIType<long long>();
+extern template DataType TensorFlowDataType<float>();
+extern template DataType TensorFlowDataType<int>();
+extern template DataType TensorFlowDataType<long long>();
+
+// Generate all necessary specializations for RingAllreduce.
+template Status RingAllreduce<CPUDevice, int>(OpKernelContext*, const Tensor*,
+                                              Tensor*, Tensor*);
+template Status RingAllreduce<CPUDevice, long long>(OpKernelContext*,
+                                                    const Tensor*, Tensor*,
+                                                    Tensor*);
+template Status RingAllreduce<CPUDevice, float>(OpKernelContext*, const Tensor*,
+                                                Tensor*, Tensor*);
+
+// Generate all necessary specializations for RingAllgather.
+template Status RingAllgather<CPUDevice, int>(OpKernelContext*, const Tensor*,
+                                              const std::vector<size_t>&,
+                                              Tensor*);
+template Status RingAllgather<CPUDevice, long long>(OpKernelContext*,
+                                                    const Tensor*,
+                                                    const std::vector<size_t>&,
+                                                    Tensor*);
+template Status RingAllgather<CPUDevice, float>(OpKernelContext*, const Tensor*,
+                                                const std::vector<size_t>&,
+                                                Tensor*);
+
+// Copy data on a CPU using a straight-forward memcpy.
+template <>
+void CopyTensorData<CPUDevice>(void* dst, void* src, size_t size) {
+  std::memcpy(dst, src, size);
+};
+
+// Accumulate values on a CPU.
+#define GENERATE_ACCUMULATE(type)                                    \
+  template <>                                                        \
+  void AccumulateTensorData<CPUDevice, type>(type * dst, type * src, \
+                                             size_t size) {          \
+    for (unsigned int i = 0; i < size; i++) {                        \
+      dst[i] += src[i];                                              \
+    }                                                                \
+  };
+GENERATE_ACCUMULATE(int);
+GENERATE_ACCUMULATE(long long);
+GENERATE_ACCUMULATE(float);
+#undef GENERATE_ACCUMULATE
+
+}  // namespace mpi
+}  // namespace contrib
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_USE_MPI
diff --git a/tensorflow/contrib/mpi_collectives/ring.cu.cc b/tensorflow/contrib/mpi_collectives/ring.cu.cc
new file mode 100644
index 0000000000..2f3eef366a
--- /dev/null
+++ b/tensorflow/contrib/mpi_collectives/ring.cu.cc
@@ -0,0 +1,117 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef TENSORFLOW_USE_MPI
+
+#if GOOGLE_CUDA
+
+#define EIGEN_USE_GPU
+
+#include "tensorflow/contrib/mpi_collectives/ring.h"
+
+namespace tensorflow {
+namespace contrib {
+namespace mpi {
+
+using CPUDevice = Eigen::ThreadPoolDevice;
+
+template <>
+MPI_Datatype MPIType<float>() {
+  return MPI_FLOAT;
+};
+template <>
+MPI_Datatype MPIType<int>() {
+  return MPI_INT;
+};
+template <>
+MPI_Datatype MPIType<long long>() {
+  return MPI_LONG_LONG;
+};
+
+template <>
+DataType TensorFlowDataType<float>() {
+  return DT_FLOAT;
+};
+template <>
+DataType TensorFlowDataType<int>() {
+  return DT_INT32;
+};
+template <>
+DataType TensorFlowDataType<long long>() {
+  return DT_INT64;
+};
+
+// Generate all necessary specializations for RingAllreduce.
+template Status RingAllreduce<GPUDevice, int>(OpKernelContext*, const Tensor*,
+                                              Tensor*, Tensor*);
+template Status RingAllreduce<GPUDevice, long long>(OpKernelContext*,
+                                                    const Tensor*, Tensor*,
+                                                    Tensor*);
+template Status RingAllreduce<GPUDevice, float>(OpKernelContext*, const Tensor*,
+                                                Tensor*, Tensor*);
+
+// Generate all necessary specializations for RingAllgather.
+template Status RingAllgather<GPUDevice, int>(OpKernelContext*, const Tensor*,
+                                              const std::vector<size_t>&,
+                                              Tensor*);
+template Status RingAllgather<GPUDevice, long long>(OpKernelContext*,
+                                                    const Tensor*,
+                                                    const std::vector<size_t>&,
+                                                    Tensor*);
+template Status RingAllgather<GPUDevice, float>(OpKernelContext*, const Tensor*,
+                                                const std::vector<size_t>&,
+                                                Tensor*);
+
+// Synchronously copy data on the GPU, using a different stream than the default
+// and than TensorFlow to avoid synchronizing on operations unrelated to the
+// allreduce.
+template <>
+void CopyTensorData<GPUDevice>(void* dst, void* src, size_t size) {
+  auto stream = CudaStreamForMPI();
+  cudaMemcpyAsync(dst, src, size, cudaMemcpyDeviceToDevice, stream);
+  cudaStreamSynchronize(stream);
+};
+
+// Elementwise accumulation kernel for GPU.
+template <typename T>
+__global__ void elemwise_accum(T* out, const T* in, const size_t N) {
+  for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
+       i += blockDim.x * gridDim.x) {
+    out[i] += in[i];
+  }
+}
+
+// Synchronously accumulate tensors on the GPU, using a different stream than
+// the default and than TensorFlow to avoid synchronizing on operations
+// unrelated to the allreduce.
+#define GENERATE_ACCUMULATE(type)                                    \
+  template <>                                                        \
+  void AccumulateTensorData<GPUDevice, type>(type * dst, type * src, \
+                                             size_t size) {          \
+    auto stream = CudaStreamForMPI();                                \
+    elemwise_accum<type><<<32, 256, 0, stream>>>(dst, src, size);    \
+    cudaStreamSynchronize(stream);                                   \
+  };
+GENERATE_ACCUMULATE(int);
+GENERATE_ACCUMULATE(long long);
+GENERATE_ACCUMULATE(float);
+#undef GENERATE_ACCUMULATE
+
+}  // namespace mpi
+}  // namespace contrib
+}  // namespace tensorflow
+#endif  // GOOGLE_CUDA
+
+#endif  // TENSORFLOW_USE_MPI
diff --git a/tensorflow/contrib/mpi_collectives/ring.h b/tensorflow/contrib/mpi_collectives/ring.h
new file mode 100644
index 0000000000..cae57ce60e
--- /dev/null
+++ b/tensorflow/contrib/mpi_collectives/ring.h
@@ -0,0 +1,327 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_MPI_H_
+#define TENSORFLOW_CONTRIB_MPI_H_
+
+#ifdef TENSORFLOW_USE_MPI
+
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/shape_inference.h"
+
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+#include "tensorflow/core/framework/tensor_types.h"
+
+#if GOOGLE_CUDA
+#include "cuda_runtime.h"
+#endif
+
+// Needed to avoid header issues with C++-supporting MPI implementations
+#define OMPI_SKIP_MPICXX
+#include "third_party/mpi/mpi.h"
+
+#define TAG_TENSOR 12
+
+namespace tensorflow {
+namespace contrib {
+namespace mpi {
+
+using CPUDevice = Eigen::ThreadPoolDevice;
+using GPUDevice = Eigen::GpuDevice;
+
+// Convert from templated types to values we can pass to MPI.
+template <typename T>
+MPI_Datatype MPIType();
+
+// Convert from templated types to TensorFlow data types.
+template <typename T>
+DataType TensorFlowDataType();
+
+#define MPI_REQUIRES_OK(MPI_STATUS)                               \
+  if ((MPI_STATUS) != MPI_SUCCESS) {                              \
+    return errors::Unknown("MPI operation failed unexpectedly."); \
+  }
+
+// Copy data from one tensor to another tensor.
+// This uses a custom CUDA stream on GPU, which is necessary to overlay the
+// backpropagation computations with the allreduce.
+template <typename Device>
+void CopyTensorData(void* destination, void* source, size_t size);
+
+// Add a tensor into another tensor, accumulating in place.
+// This uses a custom CUDA stream on GPU, which is necessary to overlay the
+// backpropagation computations with the allreduce.
+template <typename Device, typename T>
+void AccumulateTensorData(T* destination, T* source, size_t size);
+
+// We need to get the right stream for doing CUDA memory transfers and
+// operations, which is possibly different from the standard TensorFlow stream.
+#if GOOGLE_CUDA
+cudaStream_t CudaStreamForMPI();
+#endif
+
+/* Perform a ring allreduce on the data. Allocate the necessary output tensor
+ * and store it in the output parameter.
+ *
+ * Assumes that all MPI processes are doing an allreduce of the same tensor,
+ * with the same dimensions.
+ *
+ * A ring allreduce is a bandwidth-optimal way to do an allreduce. To do the
+ * allreduce, the nodes involved are arranged in a ring:
+ *
+ *                   .--0--.
+ *                  /       \
+ *                 3         1
+ *                  \       /
+ *                   *--2--*
+ *
+ *  Each node always sends to the next clockwise node in the ring, and receives
+ *  from the previous one.
+ *
+ *  The allreduce is done in two parts: a scatter-reduce and an allgather. In
+ *  the scatter reduce, a reduction is done, so that each node ends up with a
+ *  chunk of the final output tensor which has contributions from all other
+ *  nodes.  In the allgather, those chunks are distributed among all the nodes,
+ *  so that all nodes have the entire output tensor.
+ *
+ *  Both of these operations are done by dividing the input tensor into N
+ *  evenly sized chunks (where N is the number of nodes in the ring).
+ *
+ *  The scatter-reduce is done in N-1 steps. In the ith step, node j will send
+ *  the (j - i)th chunk and receive the (j - i - 1)th chunk, adding it in to
+ *  its existing data for that chunk. For example, in the first iteration with
+ *  the ring depicted above, you will have the following transfers:
+ *
+ *      Segment 0:  Node 0 --> Node 1
+ *      Segment 1:  Node 1 --> Node 2
+ *      Segment 2:  Node 2 --> Node 3
+ *      Segment 3:  Node 3 --> Node 0
+ *
+ *  In the second iteration, you'll have the following transfers:
+ *
+ *      Segment 0:  Node 1 --> Node 2
+ *      Segment 1:  Node 2 --> Node 3
+ *      Segment 2:  Node 3 --> Node 0
+ *      Segment 3:  Node 0 --> Node 1
+ *
+ *  After this iteration, Node 2 has 3 of the four contributions to Segment 0.
+ *  The last iteration has the following transfers:
+ *
+ *      Segment 0:  Node 2 --> Node 3
+ *      Segment 1:  Node 3 --> Node 0
+ *      Segment 2:  Node 0 --> Node 1
+ *      Segment 3:  Node 1 --> Node 2
+ *
+ *  After this iteration, Node 3 has the fully accumulated Segment 0; Node 0
+ *  has the fully accumulated Segment 1; and so on. The scatter-reduce is
+ * complete.
+ *
+ *  Next, the allgather distributes these fully accumululated chunks across all
+ * nodes. Communication proceeds in the same ring, once again in N-1 steps. At
+ * the ith step, node j will send chunk (j - i + 1) and receive chunk (j - i).
+ * For example, at the first iteration, the following transfers will occur:
+ *
+ *      Segment 0:  Node 3 --> Node 0
+ *      Segment 1:  Node 0 --> Node 1
+ *      Segment 2:  Node 1 --> Node 2
+ *      Segment 3:  Node 2 --> Node 3
+ *
+ * After the first iteration, Node 0 will have a fully accumulated Segment 0
+ * (from Node 3) and Segment 1. In the next iteration, Node 0 will send its
+ * just-received Segment 0 onward to Node 1, and receive Segment 3 from Node 3.
+ * After this has continued for N - 1 iterations, all nodes will have a the
+ * fully accumulated tensor.
+ *
+ * Each node will do (N-1) sends for the scatter-reduce and (N-1) sends for the
+ * allgather. Each send will contain K / N bytes, if there are K bytes in the
+ * original tensor on every node. Thus, each node sends and receives 2K(N - 1)/N
+ * bytes of data, and the performance of the allreduce (assuming no latency in
+ * connections) is constrained by the slowest interconnect between the nodes.
+ *
+ */
+template <typename Device, typename T>
+Status RingAllreduce(OpKernelContext* context, const Tensor* input,
+                     Tensor* temp, Tensor* output) {
+  // Acquire MPI size and rank
+  int n, r;
+  MPI_REQUIRES_OK(MPI_Comm_size(MPI_COMM_WORLD, &n));
+  MPI_REQUIRES_OK(MPI_Comm_rank(MPI_COMM_WORLD, &r));
+
+  T* buffer = (T*)output->tensor_data().data();
+
+  CopyTensorData<Device>((void*)buffer, (void*)input->tensor_data().data(),
+                         output->tensor_data().size());
+
+  // Calculate segment sizes and segment ends
+  const size_t elements_to_reduce = input->NumElements();
+  const size_t segment_size = elements_to_reduce / n;
+  std::vector<size_t> segment_sizes(n, segment_size);
+
+  const size_t residual = elements_to_reduce % n;
+  for (size_t i = 0; i < residual; ++i) {
+    segment_sizes[i]++;
+  }
+
+  std::vector<size_t> segment_starts(n);
+  segment_starts[0] = 0;
+  for (size_t i = 1; i < segment_starts.size(); ++i) {
+    segment_starts[i] = segment_starts[i - 1] + segment_sizes[i - 1];
+  }
+
+  assert(segment_starts[n - 1] + segment_sizes[n - 1] == elements_to_reduce);
+
+  T* segment_recv = (T*)temp->tensor_data().data();
+
+  // Receive from your left neighbor with wrap-around
+  const size_t recv_from = ((r - 1) + n) % n;
+
+  // Send to your right neighbor with wrap-around
+  const size_t send_to = (r + 1) % n;
+
+  MPI_Status recv_status;
+  MPI_Request recv_req;
+
+  // Now start ring. At every step, for every rank, we iterate through
+  // segments with wraparound and send and recv from our neighbors and reduce
+  // locally. At the i'th iteration, rank r, sends segment (r-i) and receives
+  // segment (r-i-1).
+  for (int i = 0; i < n - 1; i++) {
+    const size_t send_seg_id = ((r - i) + n) % n;
+    const size_t recv_seg_id = ((r - i - 1) + n) % n;
+
+    T* segment_send = &(buffer[segment_starts[send_seg_id]]);
+
+    MPI_REQUIRES_OK(MPI_Irecv(segment_recv, segment_sizes[recv_seg_id],
+                              MPIType<T>(), recv_from, TAG_TENSOR,
+                              MPI_COMM_WORLD, &recv_req));
+
+    MPI_REQUIRES_OK(MPI_Send(segment_send, segment_sizes[send_seg_id],
+                             MPIType<T>(), send_to, TAG_TENSOR,
+                             MPI_COMM_WORLD));
+
+    T* segment_update = &(buffer[segment_starts[recv_seg_id]]);
+
+    // Wait for recv to complete before reduction
+    MPI_REQUIRES_OK(MPI_Wait(&recv_req, &recv_status));
+
+    const size_t recv_seg_size = segment_sizes[recv_seg_id];
+    AccumulateTensorData<Device, T>(segment_update, segment_recv,
+                                    recv_seg_size);
+  }
+
+  // Now start pipelined ring allgather. At every step, for every rank, we
+  // iterate through segments with wraparound and send and recv from our
+  // neighbors. At the i'th iteration, rank r, sends segment (r-i+1) and
+  // receives segment (r-i).
+  for (size_t i = 0; i < n - 1; ++i) {
+    const size_t send_seg_id = ((r - i + 1) + n) % n;
+    const size_t recv_seg_id = ((r - i) + n) % n;
+
+    // Segment to send - at every iteration we send segment (r-i+1)
+    T* segment_send = &(buffer[segment_starts[send_seg_id]]);
+
+    // Segment to recv - at every iteration we receive segment (r-i)
+    T* segment_recv = &(buffer[segment_starts[recv_seg_id]]);
+
+    MPI_REQUIRES_OK(MPI_Sendrecv(
+        segment_send, segment_sizes[send_seg_id], MPIType<T>(), send_to,
+        TAG_TENSOR, segment_recv, segment_sizes[recv_seg_id], MPIType<T>(),
+        recv_from, TAG_TENSOR, MPI_COMM_WORLD, &recv_status));
+  }
+
+  return Status::OK();
+}
+
+// Perform a ring allgather on a Tensor. Other ranks may allgather with a
+// tensor which differs in the first dimension only; all other dimensions must
+// be the same.
+//
+// For more information on the ring allgather, read the documentation for the
+// ring allreduce, which includes a ring allgather.
+template <typename Device, typename T>
+Status RingAllgather(OpKernelContext* context, const Tensor* input,
+                     const std::vector<size_t>& sizes, Tensor* output) {
+  // Acquire MPI size and rank
+  int n, r;
+  MPI_REQUIRES_OK(MPI_Comm_size(MPI_COMM_WORLD, &n));
+  MPI_REQUIRES_OK(MPI_Comm_rank(MPI_COMM_WORLD, &r));
+
+  assert(sizes.size() == n);
+  assert(input->dim_size(0) == sizes[r]);
+
+  // Compute number of elements in every "row". We can't compute number of
+  // elements in every chunks, because those chunks are variable length.
+  size_t elements_per_row = 1;
+  for (int i = 1; i < input->shape().dims(); i++) {
+    elements_per_row *= input->dim_size(i);
+  }
+
+  // Copy data from input tensor to correct place in output tensor.
+  std::vector<size_t> segment_starts(n);
+  segment_starts[0] = 0;
+  for (int i = 1; i < n; i++) {
+    segment_starts[i] = segment_starts[i - 1] + elements_per_row * sizes[i - 1];
+  }
+  size_t offset = segment_starts[r];
+
+  // Copy data to the right offset for this rank.
+  T* buffer = (T*)output->tensor_data().data();
+  CopyTensorData<Device>((void*)(buffer + offset),
+                         (void*)input->tensor_data().data(),
+                         elements_per_row * sizes[r] * sizeof(T));
+
+  // Receive from your left neighbor with wrap-around
+  const size_t recv_from = ((r - 1) + n) % n;
+
+  // Send to your right neighbor with wrap-around
+  const size_t send_to = (r + 1) % n;
+
+  // Perform a ring allgather. At every step, for every rank, we iterate
+  // through segments with wraparound and send and recv from our neighbors.
+  // At the i'th iteration, rank r, sends segment (r-i) and receives segment
+  // (r-1-i).
+  MPI_Status recv_status;
+  for (size_t i = 0; i < n - 1; ++i) {
+    const size_t send_seg_id = ((r - i) + n) % n;
+    const size_t recv_seg_id = ((r - i - 1) + n) % n;
+
+    // Segment to send - at every iteration we send segment (r-i)
+    size_t offset_send = segment_starts[send_seg_id];
+    size_t rows_send = sizes[send_seg_id];
+    T* segment_send = &(buffer[offset_send]);
+
+    // Segment to recv - at every iteration we receive segment (r-1-i)
+    size_t offset_recv = segment_starts[recv_seg_id];
+    size_t rows_recv = sizes[recv_seg_id];
+    T* segment_recv = &(buffer[offset_recv]);
+
+    MPI_REQUIRES_OK(MPI_Sendrecv(
+        segment_send, elements_per_row * rows_send, MPIType<T>(), send_to,
+        TAG_TENSOR, segment_recv, elements_per_row * rows_recv, MPIType<T>(),
+        recv_from, TAG_TENSOR, MPI_COMM_WORLD, &recv_status));
+  }
+
+  return Status::OK();
+}
+
+}  // namespace mpi
+}  // namespace contrib
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_USE_MPI
+
+#undef TENSORFLOW_CONTRIB_MPI_H_
+#endif  // TENSORFLOW_CONTRIB_MPI_H_
diff --git a/tensorflow/contrib/nccl/kernels/nccl_manager.cc b/tensorflow/contrib/nccl/kernels/nccl_manager.cc
index b1cb89391c..99fecf9651 100644
--- a/tensorflow/contrib/nccl/kernels/nccl_manager.cc
+++ b/tensorflow/contrib/nccl/kernels/nccl_manager.cc
@@ -445,7 +445,7 @@ void NcclManager::LoopKernelLaunches(NcclStream* nccl_stream) {
   se::Stream* comm_stream = nccl_stream->stream.get();
   ScopedActivateExecutorContext scoped_context(nccl_stream->executor);
   const cudaStream_t* cu_stream = reinterpret_cast<const cudaStream_t*>(
-      comm_stream->implementation()->CudaStreamMemberHack());
+      comm_stream->implementation()->GpuStreamMemberHack());
 
   while (true) {
     // Find collective to run.
diff --git a/tensorflow/contrib/opt/BUILD b/tensorflow/contrib/opt/BUILD
index bbdf962d04..280d4a5492 100644
--- a/tensorflow/contrib/opt/BUILD
+++ b/tensorflow/contrib/opt/BUILD
@@ -27,6 +27,7 @@ py_library(
         "python/training/nadam_optimizer.py",
         "python/training/powersign.py",
         "python/training/reg_adagrad_optimizer.py",
+        "python/training/shampoo.py",
         "python/training/sign_decay.py",
         "python/training/variable_clipping_optimizer.py",
         "python/training/weight_decay_optimizers.py",
@@ -344,3 +345,21 @@ py_test(
         "//third_party/py/numpy",
     ],
 )
+
+py_test(
+    name = "shampoo_test",
+    srcs = ["python/training/shampoo_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":opt_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:variables",
+        "//third_party/py/numpy",
+    ],
+)
diff --git a/tensorflow/contrib/opt/__init__.py b/tensorflow/contrib/opt/__init__.py
index 3e63e99030..9471fb0181 100644
--- a/tensorflow/contrib/opt/__init__.py
+++ b/tensorflow/contrib/opt/__init__.py
@@ -30,10 +30,10 @@ from tensorflow.contrib.opt.python.training.model_average_optimizer import *
 from tensorflow.contrib.opt.python.training.moving_average_optimizer import *
 from tensorflow.contrib.opt.python.training.multitask_optimizer_wrapper import *
 from tensorflow.contrib.opt.python.training.nadam_optimizer import *
+from tensorflow.contrib.opt.python.training.shampoo import *
 from tensorflow.contrib.opt.python.training.weight_decay_optimizers import *
 from tensorflow.contrib.opt.python.training.powersign import *
 from tensorflow.contrib.opt.python.training.variable_clipping_optimizer import *
-from tensorflow.contrib.opt.python.training.weight_decay_optimizers import *
 # pylint: enable=wildcard-import
 
 from tensorflow.python.util.all_util import remove_undocumented
@@ -62,6 +62,7 @@ _allowed_symbols = [
     'ModelAverageOptimizer',
     'ModelAverageCustomGetter',
     'GGTOptimizer',
+    'ShampooOptimizer',
 ]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/opt/python/training/addsign_test.py b/tensorflow/contrib/opt/python/training/addsign_test.py
index 08d45ed73f..628a735e72 100644
--- a/tensorflow/contrib/opt/python/training/addsign_test.py
+++ b/tensorflow/contrib/opt/python/training/addsign_test.py
@@ -214,7 +214,7 @@ class AddSignTest(test.TestCase):
         # Run 7 steps of AddSign
         # first 4 steps with positive gradient
         # last 3 steps with negative gradient (sign(gm) should be -1)
-        for t in range(1, 4):
+        for t in range(1, 8):
           if t < 5:
             update.run()
           else:
@@ -222,7 +222,7 @@ class AddSignTest(test.TestCase):
 
           var0_np, m0 = addsign_update_numpy(
               var0_np,
-              grads0_np,
+              grads0_np if t < 5 else -grads0_np,
               m0,
               learning_rate,
               alpha=alpha,
@@ -232,7 +232,7 @@ class AddSignTest(test.TestCase):
           )
           var1_np, m1 = addsign_update_numpy(
               var1_np,
-              grads1_np,
+              grads1_np if t < 5 else -grads1_np,
               m1,
               learning_rate,
               alpha=alpha,
diff --git a/tensorflow/contrib/opt/python/training/ggt.py b/tensorflow/contrib/opt/python/training/ggt.py
index 928c453517..cae952d8f5 100644
--- a/tensorflow/contrib/opt/python/training/ggt.py
+++ b/tensorflow/contrib/opt/python/training/ggt.py
@@ -33,7 +33,7 @@ class GGTOptimizer(optimizer_v2.OptimizerV2):
 
   GGT has an advantage over sgd and adam on large models with poor conditioning,
   for example language models and CNNs,
-  see [ABCHSZZ 2018]([pdf](https://arxiv.org/pdf/1806.02958.pdf)).
+  see [[ABCHSZZ 2018]](https://arxiv.org/pdf/1806.02958.pdf).
   """
 
   def __init__(self,
diff --git a/tensorflow/contrib/opt/python/training/powersign_test.py b/tensorflow/contrib/opt/python/training/powersign_test.py
index 5214082dd6..0bcf5d230a 100644
--- a/tensorflow/contrib/opt/python/training/powersign_test.py
+++ b/tensorflow/contrib/opt/python/training/powersign_test.py
@@ -216,7 +216,7 @@ class PowerSignTest(test.TestCase):
         self.assertAllClose([1.0, 2.0], var0.eval())
         self.assertAllClose([3.0, 4.0], var1.eval())
 
-        # Run 3 steps of powersign
+        # Run 7 steps of powersign
         # first 4 steps with positive gradient
         # last 3 steps with negative gradient (sign(gm) should be -1)
         for t in range(1, 8):
diff --git a/tensorflow/contrib/opt/python/training/shampoo.py b/tensorflow/contrib/opt/python/training/shampoo.py
new file mode 100644
index 0000000000..7afa0998f4
--- /dev/null
+++ b/tensorflow/contrib/opt/python/training/shampoo.py
@@ -0,0 +1,463 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""The Shampoo Optimizer.
+
+Variant of Adagrad using one preconditioner matrix per variable dimension.
+For details, see https://arxiv.org/abs/1802.09568
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import linalg_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.platform import tf_logging
+from tensorflow.python.training import optimizer
+
+
+def GetParam(var, timestep):
+  if callable(var):
+    return var(timestep)
+  else:
+    return var
+
+
+class ShampooOptimizer(optimizer.Optimizer):
+  """The Shampoo Optimizer
+
+  Variant of Adagrad using one preconditioner matrix per variable dimension.
+  For details, see https://arxiv.org/abs/1802.09568
+
+  gbar is time-weighted accumulated gradient:
+  gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]
+
+  mat_gbar is time-weighted accumulated gradient square:
+  mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1]
+                  + mat_gbar_weight[t] * gg_j[t]
+  where if g[t] = g_abcd then gg_a[t] = g_abcd g_a'bcd (Einstein notation)
+
+  Update rule:
+  w[t+1] = w[t] - learning_rate[t] * Prod_j mat_gbar_j[t]^(-alpha/n) gbar[t]
+     Again, mat_gbar_j[t]^(-alpha) gbar[t] is a tensor contraction along the
+     j'th dimension of gbar[t] with the first dimension of
+     mat_gbar_j[t]^(-alpha/n), where alpha is a hyperparameter,
+     and n = rank of the variable.
+     Prod_j represents doing this contraction for all j in 0..n-1.
+
+  Typically learning_rate is constant, but could be time dependent by passing
+  a lambda function that depends on step.
+  """
+
+  def __init__(self, global_step=0,
+               max_matrix_size=500,
+               gbar_decay=0.0,
+               gbar_weight=1.0,
+               mat_gbar_decay=1.0,
+               mat_gbar_weight=1.0,
+               learning_rate=1.0,
+               svd_interval=1,
+               precond_update_interval=1,
+               epsilon=0.1,
+               alpha=0.5,
+               use_iterative_root=False,
+               use_locking=False,
+               name="Shampoo"):
+    """Default values of the various hyper-parameters.
+
+    gbar_decay, gbar_weight etc. can be a float or a time varying parameter.
+    For time-varying parameters use e.g. "lambda T: T / (T + 1.0)"
+    where the expression in the lambda is a tensorflow expression
+
+    Args:
+      global_step: tensorflow variable indicating the step.
+      max_matrix_size: We do not perform SVD for matrices larger than this.
+      gbar_decay:
+      gbar_weight:  Used to update gbar:
+            gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]
+      mat_gbar_decay:
+      mat_gbar_weight:  Used to update mat_gbar:
+           mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1]
+                           + mat_gbar_weight[t] * gg_j[t]
+      learning_rate: Similar to SGD
+      svd_interval: We should do SVD after this many steps. Default = 1, i.e.
+                    every step. Usually 20 leads to no loss of accuracy, and
+                    50 or 100 is also OK. May also want more often early,
+                    and less often later - set in caller as for example:
+                    "svd_interval = lambda(T): tf.cond(
+                        T < 2000, lambda: 20.0, lambda: 1000.0)"
+      precond_update_interval: We should update the preconditioners after
+                               this many steps. Default = 1. Usually less than
+                               svd_interval.
+      epsilon:  epsilon * I_n is added to each mat_gbar_j for stability
+      alpha:  total power of the preconditioners.
+      use_iterative_root: should the optimizer use SVD (faster) or the
+                          iterative root method (for TPU) for finding the
+                          roots of PSD matrices.
+      use_locking:
+      name: name of optimizer.
+    """
+
+    super(ShampooOptimizer, self).__init__(use_locking, name)
+
+    self._global_step = math_ops.to_float(global_step)
+    self._max_matrix_size = max_matrix_size
+    self._gbar_decay = gbar_decay
+    self._gbar_weight = gbar_weight
+    self._mat_gbar_decay = mat_gbar_decay
+    self._mat_gbar_weight = mat_gbar_weight
+    self._learning_rate = learning_rate
+    self._svd_interval = svd_interval
+    self._precond_update_interval = precond_update_interval
+    self._epsilon = epsilon
+    self._alpha = alpha
+    self._use_iterative_root = use_iterative_root
+    self._name = name
+
+  def _create_slots(self, var_list):
+    for v in var_list:
+      with ops.colocate_with(v):
+        _ = self._zeros_slot(v, "gbar", self._name)
+        shape = np.array(v.get_shape())
+        for i, d in enumerate(shape):
+          d_tensor = ops.convert_to_tensor(d)
+          if d < self._max_matrix_size:
+            mat_g_init = array_ops.zeros_like(linalg_ops.eye(d_tensor))
+            if self._svd_interval > 1:
+              _ = self._get_or_make_slot(v, linalg_ops.eye(d_tensor),
+                                         "H_" + str(i), self._name)
+          else:
+            mat_g_init = array_ops.zeros([d_tensor])
+
+          _ = self._get_or_make_slot(v, mat_g_init, "Gbar_" + str(i),
+                                     self._name)
+
+  def _apply_dense(self, grad, var):
+    return self._apply_gradient(grad, var)
+
+  def _apply_sparse(self, grad, var):
+    if var.get_shape()[0] < self._max_matrix_size or self._gbar_decay != 0.0:
+      # The dimension is small enough, we can make the variable dense and
+      # do a dense update
+      dense_grad = array_ops.scatter_nd(
+          array_ops.expand_dims(grad.indices, axis=1),
+          grad.values, array_ops.shape(var, out_type=grad.indices.dtype))
+      return self._apply_gradient(dense_grad, var)
+    return self._apply_gradient(grad.values, var, grad.indices)
+
+  def _weighted_average(self, var, weight, weight_t, rest):
+    """Computes exponential weighted average: var = weight_t * var + rest.
+
+    Important to ensure that var does not occur in rest, otherwise
+    we can get race conditions in a distributed setting.
+
+    Args:
+      var: variable to be updated
+      weight: parameter to be checked. If it is a constant, we can optimize.
+      weight_t: current value of parameter, used for weighting
+      rest: the remaining tensor to be added
+
+    Returns:
+      updated variable.
+    """
+    if weight == 0.0:
+      return rest       # no need to update var, we will never use it.
+    if weight == 1.0:   # common case
+      return state_ops.assign_add(var, rest)
+    # The op below can cause race conditions in a distributed setting,
+    # since computing weight_t * var + rest can take some time, during
+    # which var may be set by another worker. To prevent this, it should
+    # be implemented as a C++ op.
+    return var.assign_add((weight_t - 1) * var + rest)
+
+  def _update_mat_g(self, mat_g, grad, axes, mat_gbar_decay,
+                    mat_gbar_weight, i):
+    """Updates the cumulative outer products of the gradients.
+
+    Args:
+      mat_g: the matrix to be updated
+      grad: the gradient of the variable
+      axes: a list of k-1 integers 0 to k-1, except i
+      mat_gbar_decay: constant for weighted average:
+          mat_g = mat_g * decay + grad * weight
+      mat_gbar_weight: constant for weighted average
+      i: index of dimension to be updated.
+
+    Returns:
+      updated mat_g = mat_g * mat_gbar_decay + grad_outer * mat_gbar_weight
+
+    In Einstein notation if i = 0: grad_outer_aa'= g_abcd g_a'bcd
+    thus grad_outer is a matrix d_i x d_i, where d_i is the size of the
+    i'th dimension of g.
+    Alternate view: If mat_i(grad) is the flattening of grad to a
+    d_i x (d_1d_2...d_{i-1}d_{i+1}...d_k) matrix, then
+         grad_outer = mat_i(grad) mat_i(grad).transpose
+    """
+    grad_outer = math_ops.tensordot(grad, grad, axes=(axes, axes),
+                                    name="grad_outer_" + str(i))
+    return self._weighted_average(mat_g, self._mat_gbar_decay, mat_gbar_decay,
+                                  mat_gbar_weight * grad_outer)
+
+  def _compute_power_svd(self, var, mat_g, mat_g_size, alpha, mat_h_slot_name):
+    """Computes mat_h = mat_g^alpha using svd. mat_g is a symmetric PSD matrix.
+
+    Args:
+      var: the variable we are updating.
+      mat_g: the symmetric PSD matrix whose power it to be computed
+      mat_g_size: size of mat_g
+      alpha: a real number
+      mat_h_slot_name: name of slot to store the power, if needed.
+
+    Returns:
+      mat_h = mat_g^alpha
+
+    Stores mat_h in the appropriate slot, if it exists.
+    Note that mat_g is PSD. So we could use linalg_ops.self_adjoint_eig.
+    """
+    if mat_g_size == 1:
+      mat_h = math_ops.pow(mat_g + self._epsilon, alpha)
+    else:
+      damping = self._epsilon * linalg_ops.eye(math_ops.to_int32(mat_g_size))
+      diag_d, mat_u, mat_v = linalg_ops.svd(mat_g + damping, full_matrices=True)
+      mat_h = math_ops.matmul(
+          mat_v * math_ops.pow(math_ops.maximum(diag_d, self._epsilon), alpha),
+          array_ops.transpose(mat_u))
+    if mat_h_slot_name is not None:
+      return state_ops.assign(self.get_slot(var, mat_h_slot_name), mat_h)
+    return mat_h
+
+  def _compute_power_iter(self, var, mat_g, mat_g_size, alpha, mat_h_slot_name,
+                          iter_count=100, epsilon=1e-6):
+    """Computes mat_g^alpha, where alpha = -1/p, p a positive integer.
+
+    We use an iterative Schur-Newton method from equation 3.2 on page 9 of:
+
+    A Schur-Newton Method for the Matrix p-th Root and its Inverse
+    by Chun-Hua Guo and Nicholas J. Higham
+    SIAM Journal on Matrix Analysis and Applications,
+    2006, Vol. 28, No. 3 : pp. 788-804
+    https://pdfs.semanticscholar.org/0abe/7f77433cf5908bfe2b79aa91af881da83858.pdf
+
+    Args:
+      var: the variable we are updating.
+      mat_g: the symmetric PSD matrix whose power it to be computed
+      mat_g_size: size of mat_g.
+      alpha: exponent, must be -1/p for p a positive integer.
+      mat_h_slot_name: name of slot to store the power, if needed.
+      iter_count: Maximum number of iterations.
+      epsilon: accuracy indicator, useful for early termination.
+
+    Returns:
+      mat_g^alpha
+    """
+
+    identity = linalg_ops.eye(math_ops.to_int32(mat_g_size))
+
+    def MatPower(mat_m, p):
+      """Computes mat_m^p, for p a positive integer.
+
+      Power p is known at graph compile time, so no need for loop and cond.
+      Args:
+        mat_m: a square matrix
+        p: a positive integer
+
+      Returns:
+        mat_m^p
+      """
+      assert p == int(p) and p > 0
+      power = None
+      while p > 0:
+        if p % 2 == 1:
+          power = math_ops.matmul(mat_m, power) if power is not None else mat_m
+        p //= 2
+        mat_m = math_ops.matmul(mat_m, mat_m)
+      return power
+
+    def IterCondition(i, mat_m, _):
+      return math_ops.logical_and(
+          i < iter_count,
+          math_ops.reduce_max(math_ops.abs(mat_m - identity)) > epsilon)
+
+    def IterBody(i, mat_m, mat_x):
+      mat_m_i = (1 - alpha) * identity + alpha * mat_m
+      return (i + 1, math_ops.matmul(MatPower(mat_m_i, -1.0/alpha), mat_m),
+              math_ops.matmul(mat_x, mat_m_i))
+
+    if mat_g_size == 1:
+      mat_h = math_ops.pow(mat_g + self._epsilon, alpha)
+    else:
+      damped_mat_g = mat_g + self._epsilon * identity
+      z = (1 - 1/alpha) / (2 * linalg_ops.norm(damped_mat_g, ord=2))
+      # The best value for z is
+      # (1 - 1/alpha) * (c_max^{-alpha} - c_min^{-alpha}) /
+      #                 (c_max^{1-alpha} - c_min^{1-alpha})
+      # where c_max and c_min are the largest and smallest singular values of
+      # damped_mat_g.
+      # The above estimate assumes that c_max > c_min * 2^p. (p = -1/alpha)
+      # Can replace above line by the one below, but it is less accurate,
+      # hence needs more iterations to converge.
+      # z = (1 - 1/alpha) / math_ops.trace(damped_mat_g)
+      # If we want the method to always converge, use z = 1 / norm(damped_mat_g)
+      # or z = 1 / math_ops.trace(damped_mat_g), but these can result in many
+      # extra iterations.
+      _, _, mat_h = control_flow_ops.while_loop(
+          IterCondition, IterBody,
+          [0, damped_mat_g * z, identity * math_ops.pow(z, -alpha)])
+    if mat_h_slot_name is not None:
+      return state_ops.assign(self.get_slot(var, mat_h_slot_name), mat_h)
+    return mat_h
+
+  def _compute_power(self, var, mat_g, mat_g_size, alpha, mat_h_slot_name=None):
+    """Just a switch between the iterative power vs svd."""
+    if self._use_iterative_root:
+      return self._compute_power_iter(var, mat_g, mat_g_size, alpha,
+                                      mat_h_slot_name)
+    else:
+      return self._compute_power_svd(var, mat_g, mat_g_size, alpha,
+                                     mat_h_slot_name)
+
+  def _apply_gradient(self, grad, var, indices=None):
+    """The main function to update a variable.
+
+    Args:
+      grad: A Tensor containing gradient to apply.
+      var: A Tensor containing the variable to update.
+      indices: An array of integers, for sparse update.
+
+    Returns:
+      Updated variable var = var - learning_rate * preconditioner * grad
+
+    If the gradient is dense, var and grad have the same shape.
+    If the update is sparse, then the first dimension of the gradient and var
+    may differ, others are all the same. In this case the indices array
+    provides the set of indices of the variable which are to be updated with
+    each row of the gradient.
+    """
+    global_step = self._global_step + 1
+
+    # Update accumulated weighted average of gradients
+    gbar = self.get_slot(var, "gbar")
+    gbar_decay_t = GetParam(self._gbar_decay, global_step)
+    gbar_weight_t = GetParam(self._gbar_weight, global_step)
+    if indices is not None:
+      # Note - the sparse update is not easily implemented, since the
+      # algorithm needs all indices of gbar to be updated
+      # if mat_gbar_decay != 1 or mat_gbar_decay != 0.
+      # One way to make mat_gbar_decay = 1 is by rescaling.
+      # If we want the update:
+      #         G_{t+1} = a_{t+1} G_t + b_{t+1} w_t
+      # define:
+      #         r_{t+1} = a_{t+1} * r_t
+      #         h_t = G_t / r_t
+      # Then:
+      #         h_{t+1} = h_t + (b_{t+1} / r_{t+1}) * w_t
+      # So we get the mat_gbar_decay = 1 as desired.
+      # We can implement this in a future version as needed.
+      # However we still need gbar_decay = 0, otherwise all indices
+      # of the variable will need to be updated.
+      if self._gbar_decay != 0.0:
+        tf_logging.warning("Not applying momentum for variable: %s" % var.name)
+      gbar_updated = grad
+    else:
+      gbar_updated = self._weighted_average(gbar, self._gbar_decay,
+                                            gbar_decay_t,
+                                            gbar_weight_t * grad)
+
+    # Update the preconditioners and compute the preconditioned gradient
+    shape = var.get_shape()
+    mat_g_list = []
+    for i in range(len(shape)):
+      mat_g_list.append(self.get_slot(var, "Gbar_" + str(i)))
+    mat_gbar_decay_t = GetParam(self._mat_gbar_decay, global_step)
+    mat_gbar_weight_t = GetParam(self._mat_gbar_weight, global_step)
+
+    preconditioned_grad = gbar_updated
+    v_rank = len(mat_g_list)
+    neg_alpha = - GetParam(self._alpha, global_step) / v_rank
+    svd_interval = GetParam(self._svd_interval, global_step)
+    precond_update_interval = GetParam(self._precond_update_interval,
+                                       global_step)
+    for i, mat_g in enumerate(mat_g_list):
+      # axes is the list of indices to reduce - everything but the current i.
+      axes = list(range(i)) + list(range(i+1, v_rank))
+      if shape[i] < self._max_matrix_size:
+        # If the tensor size is sufficiently small perform full Shampoo update
+        # Note if precond_update_interval > 1 and mat_gbar_decay_t != 1, this
+        # is not strictly correct. However we will use it for now, and
+        # fix if needed. (G_1 = aG + bg ==> G_n = a^n G + (1+a+..+a^{n-1})bg)
+
+        # pylint: disable=g-long-lambda,cell-var-from-loop
+        mat_g_updated = control_flow_ops.cond(
+            math_ops.mod(global_step, precond_update_interval) < 1,
+            lambda: self._update_mat_g(
+                mat_g, grad, axes, mat_gbar_decay_t,
+                mat_gbar_weight_t * precond_update_interval, i),
+            lambda: mat_g)
+
+        if self._svd_interval == 1:
+          mat_h = self._compute_power(var, mat_g_updated, shape[i], neg_alpha)
+        else:
+          mat_h = control_flow_ops.cond(
+              math_ops.mod(global_step, svd_interval) < 1,
+              lambda: self._compute_power(var, mat_g_updated, shape[i],
+                                          neg_alpha, "H_" + str(i)),
+              lambda: self.get_slot(var, "H_" + str(i)))
+
+        # mat_h is a square matrix of size d_i x d_i
+        # preconditioned_grad is a d_i x ... x d_n x d_0 x ... d_{i-1} tensor
+        # After contraction with a d_i x d_i tensor
+        # it becomes a d_{i+1} x ... x d_n x d_0 x ... d_i tensor
+        # (the first dimension is contracted out, and the second dimension of
+        # mat_h is appended).  After going through all the indices, it becomes
+        # a d_0 x ... x d_n tensor again.
+        preconditioned_grad = math_ops.tensordot(preconditioned_grad, mat_h,
+                                                 axes=([0], [0]),
+                                                 name="precond_" + str(i))
+      else:
+        # Tensor size is too large -- perform diagonal Shampoo update
+        grad_outer = math_ops.reduce_sum(grad * grad, axis=axes)
+        if i == 0 and indices is not None:
+          assert self._mat_gbar_decay == 1.0
+          mat_g_updated = state_ops.scatter_add(mat_g, indices,
+                                                mat_gbar_weight_t * grad_outer)
+          mat_h = math_ops.pow(
+              array_ops.gather(mat_g_updated, indices) + self._epsilon,
+              neg_alpha)
+        else:
+          mat_g_updated = self._weighted_average(mat_g,
+                                                 self._mat_gbar_decay,
+                                                 mat_gbar_decay_t,
+                                                 mat_gbar_weight_t * grad_outer)
+          mat_h = math_ops.pow(mat_g_updated + self._epsilon, neg_alpha)
+
+        # Need to do the transpose to ensure that the tensor becomes
+        # a d_{i+1} x ... x d_n x d_0 x ... d_i tensor as described above.
+        preconditioned_grad = array_ops.transpose(
+            preconditioned_grad, perm=list(range(1, v_rank)) + [0]) * mat_h
+
+    # Update the variable based on the Shampoo update
+    learning_rate_t = GetParam(self._learning_rate, global_step)
+    if indices is not None:
+      var_updated = state_ops.scatter_sub(var, indices,
+                                          learning_rate_t * preconditioned_grad)
+    else:
+      var_updated = state_ops.assign_sub(var,
+                                         learning_rate_t * preconditioned_grad)
+    return var_updated
diff --git a/tensorflow/contrib/opt/python/training/shampoo_test.py b/tensorflow/contrib/opt/python/training/shampoo_test.py
new file mode 100644
index 0000000000..3148d02296
--- /dev/null
+++ b/tensorflow/contrib/opt/python/training/shampoo_test.py
@@ -0,0 +1,669 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Functional tests for AdaMoo optimizer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.opt.python.training import shampoo
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+
+TOLERANCE = 1e-3
+
+
+def np_power(mat_g, alpha):
+  """Computes mat_g^alpha for a square symmetric matrix mat_g."""
+
+  mat_u, diag_d, mat_v = np.linalg.svd(mat_g)
+  diag_d = np.power(diag_d, alpha)
+  return np.dot(np.dot(mat_u, np.diag(diag_d)), mat_v)
+
+
+class ShampooTest(test.TestCase):
+
+  def testBasicVector(self):
+    """Similar to the full Adagrad update."""
+
+    size = 20
+    init_var_np = np.zeros(size)
+    grad_np = np.random.rand(size)
+    grad_np_2 = np.random.rand(size)
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = constant_op.constant(grad_np, dtype=dtypes.float32)
+      grad_2 = constant_op.constant(grad_np_2, dtype=dtypes.float32)
+
+      opt = shampoo.ShampooOptimizer(global_step)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      update_2 = opt.apply_gradients(zip([grad_2], [var]),
+                                     global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * mat_g^{-0.5} * grad
+      # lr = 1
+      mat_g = np.outer(grad_np, grad_np)
+      mat_h = np_power(mat_g + 0.1 * np.eye(size), -0.5)
+      new_val_np = init_var_np - np.dot(mat_h, grad_np)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+      # Run another step of Shampoo
+      update_2.run()
+      new_val = sess.run(var)
+
+      mat_g += np.outer(grad_np_2, grad_np_2)
+      mat_h = np_power(mat_g + 0.1 * np.eye(size), -0.5)
+      new_val_np -= np.dot(mat_h, grad_np_2)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+  def testBasicMatrix(self):
+    """Check update when gradient is a matrix."""
+    size = [10, 5]
+    init_var_np = np.zeros(size)
+    grad_np = np.random.rand(size[0], size[1])
+    grad_np_2 = np.random.rand(size[0], size[1])
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = constant_op.constant(grad_np, dtype=dtypes.float32)
+      grad_2 = constant_op.constant(grad_np_2, dtype=dtypes.float32)
+
+      opt = shampoo.ShampooOptimizer(global_step)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      update_2 = opt.apply_gradients(zip([grad_2], [var]),
+                                     global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * mat_g1^{-0.25} * grad * mat_g2^{-0.25}
+      # lr = 1
+      mat_g1 = np.dot(grad_np, grad_np.transpose())
+      mat_left = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.25)
+      mat_g2 = np.dot(grad_np.transpose(), grad_np)
+      mat_right = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.25)
+      new_val_np = init_var_np - np.dot(np.dot(mat_left, grad_np), mat_right)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+      # Run another step of Shampoo
+      update_2.run()
+      new_val = sess.run(var)
+
+      mat_g1 += np.dot(grad_np_2, grad_np_2.transpose())
+      mat_left = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.25)
+      mat_g2 += np.dot(grad_np_2.transpose(), grad_np_2)
+      mat_right = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.25)
+      new_val_np -= np.dot(np.dot(mat_left, grad_np_2), mat_right)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+  def _testBasicTensor(self, use_iterative_root):
+    """Check update when gradient is a tensor."""
+    size = [10, 5, 7]
+    init_var_np = np.zeros(size)
+    grad_np = np.random.rand(size[0], size[1], size[2])
+    grad_np_2 = np.random.rand(size[0], size[1], size[2])
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = constant_op.constant(grad_np, dtype=dtypes.float32)
+      grad_2 = constant_op.constant(grad_np_2, dtype=dtypes.float32)
+
+      opt = shampoo.ShampooOptimizer(global_step,
+                                     use_iterative_root=use_iterative_root)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      update_2 = opt.apply_gradients(zip([grad_2], [var]),
+                                     global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * Prod_i mat_g_i^{-0.5/3} grad
+      # lr = 1
+      mat_g1 = np.tensordot(grad_np, grad_np, axes=([1, 2], [1, 2]))
+      mat_g1_a = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.5/3.0)
+      mat_g2 = np.tensordot(grad_np, grad_np, axes=([0, 2], [0, 2]))
+      mat_g2_a = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.5/3.0)
+      mat_g3 = np.tensordot(grad_np, grad_np, axes=([0, 1], [0, 1]))
+      mat_g3_a = np_power(mat_g3 + 0.1 * np.eye(size[2]), -0.5/3.0)
+
+      precond_grad = np.tensordot(grad_np, mat_g1_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g2_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g3_a, axes=([0], [0]))
+      new_val_np = init_var_np - precond_grad
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+      # Run another step of Shampoo
+      update_2.run()
+      new_val = sess.run(var)
+
+      mat_g1 += np.tensordot(grad_np_2, grad_np_2, axes=([1, 2], [1, 2]))
+      mat_g1_a = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.5/3.0)
+      mat_g2 += np.tensordot(grad_np_2, grad_np_2, axes=([0, 2], [0, 2]))
+      mat_g2_a = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.5/3.0)
+      mat_g3 += np.tensordot(grad_np_2, grad_np_2, axes=([0, 1], [0, 1]))
+      mat_g3_a = np_power(mat_g3 + 0.1 * np.eye(size[2]), -0.5/3.0)
+
+      precond_grad = np.tensordot(grad_np_2, mat_g1_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g2_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g3_a, axes=([0], [0]))
+      new_val_np -= precond_grad
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+  def testBasicTensor(self):
+    for use_iterative_root in [True, False]:
+      self._testBasicTensor(use_iterative_root)
+
+  def testLargeVector(self):
+    """This is just the diagonal Adagrad update."""
+
+    size = 2000
+    init_var_np = np.zeros(size)
+    grad_np = np.random.rand(size)
+    grad_np_2 = np.random.rand(size)
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = constant_op.constant(grad_np, dtype=dtypes.float32)
+      grad_2 = constant_op.constant(grad_np_2, dtype=dtypes.float32)
+
+      opt = shampoo.ShampooOptimizer(global_step)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      update_2 = opt.apply_gradients(zip([grad_2], [var]),
+                                     global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * gg^{-0.5} * grad
+      # lr = 1
+      mat_g = grad_np * grad_np + 0.1
+      new_val_np = init_var_np - np.power(mat_g, -0.5) * grad_np
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val)
+
+      # Run another step of Shampoo
+      update_2.run()
+      new_val = sess.run(var)
+
+      mat_g += grad_np_2 * grad_np_2
+      new_val_np -= np.power(mat_g, -0.5) * grad_np_2
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val)
+
+  def testLargeMatrix(self):
+    """Gradient is a matrix, one of whose dimensions is large.
+
+    We do diagonal updates for large dimensions.
+    """
+
+    size = [2000, 3]
+    init_var_np = np.zeros(size)
+    grad_np = np.random.rand(size[0], size[1])
+    grad_np_2 = np.random.rand(size[0], size[1])
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = constant_op.constant(grad_np, dtype=dtypes.float32)
+      grad_2 = constant_op.constant(grad_np_2, dtype=dtypes.float32)
+
+      opt = shampoo.ShampooOptimizer(global_step)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      update_2 = opt.apply_gradients(zip([grad_2], [var]),
+                                     global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * mat_left * grad * mat_right
+      # where the mat_left * grad is just element-wise product,
+      # with broadcasting
+      # lr = 1
+
+      mat_g1 = np.sum(grad_np * grad_np, axis=1, keepdims=True)
+      mat_left = np.power(mat_g1 + 0.1, -0.25)
+      mat_g2 = np.dot(grad_np.transpose(), grad_np)
+      mat_right = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.25)
+      new_val_np = init_var_np - np.dot(grad_np * mat_left, mat_right)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+      # Run another step of Shampoo
+      update_2.run()
+      new_val = sess.run(var)
+
+      mat_g1 += np.sum(grad_np_2 * grad_np_2, axis=1, keepdims=True)
+      mat_left = np.power(mat_g1 + 0.1, -0.25)
+      mat_g2 += np.dot(grad_np_2.transpose(), grad_np_2)
+      mat_right = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.25)
+      new_val_np -= np.dot(grad_np_2 * mat_left, mat_right)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+  def testSparseUpdateLarge(self):
+    """Check update when gradient is of type IndexSlices.
+
+    We do diagonal updates for the first dimension, unless it is very small.
+    """
+
+    size = [2000, 3]
+    sample_size_1 = 100
+    init_var_np = np.zeros(size)
+    grad_indices = np.sort(np.random.choice(np.arange(size[0]), sample_size_1,
+                                            replace=False))
+    grad_np = np.random.rand(sample_size_1, size[1])
+
+    sample_size_2 = 7
+    grad_indices_2 = np.sort(np.random.choice(np.arange(size[0]), sample_size_2,
+                                              replace=False))
+    grad_np_2 = np.random.rand(sample_size_2, size[1])
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = ops.IndexedSlices(
+          constant_op.constant(grad_np, dtype=dtypes.float32),
+          constant_op.constant(grad_indices),
+          constant_op.constant(size))
+      grad_2 = ops.IndexedSlices(
+          constant_op.constant(grad_np_2, dtype=dtypes.float32),
+          constant_op.constant(grad_indices_2),
+          constant_op.constant(size))
+
+      opt = shampoo.ShampooOptimizer(global_step)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      update_2 = opt.apply_gradients(zip([grad_2], [var]),
+                                     global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * mat_left * grad * mat_right
+      # where the mat_left * grad is just element-wise product,
+      # with broadcasting
+      # lr = 1
+      # In this case the update lr * mat_left * grad * mat_right is
+      # of size 10 x 2.
+      # So the correct indices of var need to be updated.
+
+      mat_g1 = np.sum(grad_np * grad_np, axis=1, keepdims=True)
+      mat_g1_acc = np.zeros((size[0], 1))
+      mat_g1_acc[grad_indices] += mat_g1
+      mat_left = np.power(mat_g1 + 0.1, -0.25)
+      mat_g2 = np.dot(grad_np.transpose(), grad_np)
+      mat_right = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.25)
+      new_val_np = init_var_np
+      new_val_np[grad_indices, :] -= np.dot(grad_np * mat_left, mat_right)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+      # Run another step of Shampoo
+      update_2.run()
+      new_val = sess.run(var)
+
+      mat_g1 = np.sum(grad_np_2 * grad_np_2, axis=1, keepdims=True)
+      mat_g1_acc[grad_indices_2] += mat_g1
+      mat_left = np.power(mat_g1_acc[grad_indices_2] + 0.1, -0.25)
+      mat_g2 += np.dot(grad_np_2.transpose(), grad_np_2)
+      mat_right = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.25)
+      new_val_np[grad_indices_2, :] -= np.dot(grad_np_2 * mat_left, mat_right)
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+  def _testSparseUpdateSmall(self, use_iterative_root):
+    """Gradient is of type IndexSlices, but the first dimension is small.
+
+    We create dense gradient and do the full update with SVD etc.
+
+    Args:
+      use_iterative_root: use iterative power method or SVD to find nth roots.
+    """
+
+    size = [100, 3, 5]
+    sample_size = 10
+    init_var_np = np.zeros(size)
+    grad_indices = np.sort(np.random.choice(np.arange(size[0]), sample_size,
+                                            replace=False))
+    grad_np = np.random.rand(sample_size, size[1], size[2])
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = ops.IndexedSlices(
+          constant_op.constant(grad_np, dtype=dtypes.float32),
+          constant_op.constant(grad_indices),
+          constant_op.constant(size))
+
+      opt = shampoo.ShampooOptimizer(global_step,
+                                     use_iterative_root=use_iterative_root)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * Prod_i mat_g_i^{-0.125} grad
+      # lr = 1
+      grad_dense = np.zeros_like(init_var_np)
+      grad_dense[grad_indices] = grad_np
+
+      mat_g1 = np.tensordot(grad_dense, grad_dense, axes=([1, 2], [1, 2]))
+      mat_g1_a = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.5/3.0)
+      mat_g2 = np.tensordot(grad_dense, grad_dense, axes=([0, 2], [0, 2]))
+      mat_g2_a = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.5/3.0)
+      mat_g3 = np.tensordot(grad_dense, grad_dense, axes=([0, 1], [0, 1]))
+      mat_g3_a = np_power(mat_g3 + 0.1 * np.eye(size[2]), -0.5/3.0)
+
+      precond_grad = np.tensordot(grad_dense, mat_g1_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g2_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g3_a, axes=([0], [0]))
+      new_val_np = init_var_np - precond_grad
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+  def testSparseUpdateSmall(self):
+    for use_iterative_root in [True, False]:
+      self._testSparseUpdateSmall(use_iterative_root)
+
+  def _testBasicTensorWithMomentum(self, use_iterative_root):
+    """Check update with momentum when gradient is a tensor.
+
+    Args:
+      use_iterative_root: use iterative power method or SVD to find nth roots.
+    """
+    size = [10, 5, 7]
+    init_var_np = np.zeros(size)
+    grad_np = np.random.rand(size[0], size[1], size[2])
+    grad_np_2 = np.random.rand(size[0], size[1], size[2])
+    gbar_decay = 0.9
+    gbar_weight = 0.1
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = constant_op.constant(grad_np, dtype=dtypes.float32)
+      grad_2 = constant_op.constant(grad_np_2, dtype=dtypes.float32)
+
+      opt = shampoo.ShampooOptimizer(global_step, gbar_decay=gbar_decay,
+                                     gbar_weight=gbar_weight,
+                                     use_iterative_root=use_iterative_root)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      update_2 = opt.apply_gradients(zip([grad_2], [var]),
+                                     global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      # Run a step of Shampoo
+      update.run()
+      new_val = sess.run(var)
+
+      # let up compute this in numpy
+      # Update rule is var = var - lr * Prod_i mat_g_i^{-0.5/3} grad
+      # lr = 1
+      mat_g1 = np.tensordot(grad_np, grad_np, axes=([1, 2], [1, 2]))
+      mat_g1_a = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.5/3.0)
+      mat_g2 = np.tensordot(grad_np, grad_np, axes=([0, 2], [0, 2]))
+      mat_g2_a = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.5/3.0)
+      mat_g3 = np.tensordot(grad_np, grad_np, axes=([0, 1], [0, 1]))
+      mat_g3_a = np_power(mat_g3 + 0.1 * np.eye(size[2]), -0.5/3.0)
+
+      gbar_np = gbar_weight * grad_np
+      precond_grad = np.tensordot(gbar_np, mat_g1_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g2_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g3_a, axes=([0], [0]))
+      new_val_np = init_var_np - precond_grad
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+      # Run another step of Shampoo
+      update_2.run()
+      new_val = sess.run(var)
+
+      mat_g1 += np.tensordot(grad_np_2, grad_np_2, axes=([1, 2], [1, 2]))
+      mat_g1_a = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.5/3.0)
+      mat_g2 += np.tensordot(grad_np_2, grad_np_2, axes=([0, 2], [0, 2]))
+      mat_g2_a = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.5/3.0)
+      mat_g3 += np.tensordot(grad_np_2, grad_np_2, axes=([0, 1], [0, 1]))
+      mat_g3_a = np_power(mat_g3 + 0.1 * np.eye(size[2]), -0.5/3.0)
+
+      gbar_np_2 = gbar_decay * gbar_np + gbar_weight * grad_np_2
+      precond_grad = np.tensordot(gbar_np_2, mat_g1_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g2_a, axes=([0], [0]))
+      precond_grad = np.tensordot(precond_grad, mat_g3_a, axes=([0], [0]))
+      new_val_np -= precond_grad
+
+      self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                         atol=TOLERANCE, rtol=TOLERANCE)
+
+  def testBasicTensorWithMomentum(self):
+    for use_iterative_root in [True, False]:
+      self._testBasicTensorWithMomentum(use_iterative_root)
+
+  def _testDelayedSVD(self, use_iterative_root):
+    """Performing the SVD every nth step.
+
+    Args:
+      use_iterative_root: use iterative power method or SVD to find nth roots.
+    """
+    size = [10, 5, 7]
+    init_var_np = np.zeros(size).astype(np.float32)
+    iterations = 20
+    svd_interval = 5
+    grad_np = np.random.rand(
+        iterations, size[0], size[1], size[2]).astype(np.float32)
+    mat_g1_a = np.eye(size[0])
+    mat_g1 = np.zeros_like(mat_g1_a)
+    mat_g2_a = np.eye(size[1])
+    mat_g2 = np.zeros_like(mat_g2_a)
+    mat_g3_a = np.eye(size[2])
+    mat_g3 = np.zeros_like(mat_g3_a)
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = array_ops.placeholder(dtypes.float32, shape=size)
+
+      opt = shampoo.ShampooOptimizer(global_step, svd_interval=svd_interval,
+                                     use_iterative_root=use_iterative_root)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+      new_val_np = init_var_np
+
+      # Run n steps of Shampoo
+      for i in range(iterations):
+        _ = sess.run(update, feed_dict={grad: grad_np[i]})
+        new_val = sess.run(var)
+
+        # let up compute this in numpy
+        # Update rule is var = var - lr * Prod_i mat_g_i^{-0.5/3} grad
+        # lr = 1
+        mat_g1 += np.tensordot(grad_np[i], grad_np[i], axes=([1, 2], [1, 2]))
+        mat_g2 += np.tensordot(grad_np[i], grad_np[i], axes=([0, 2], [0, 2]))
+        mat_g3 += np.tensordot(grad_np[i], grad_np[i], axes=([0, 1], [0, 1]))
+        if (i + 1) % svd_interval == 0:
+          mat_g1_a = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.5/3.0)
+          mat_g2_a = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.5/3.0)
+          mat_g3_a = np_power(mat_g3 + 0.1 * np.eye(size[2]), -0.5/3.0)
+
+        precond_grad = np.tensordot(grad_np[i], mat_g1_a, axes=([0], [0]))
+        precond_grad = np.tensordot(precond_grad, mat_g2_a, axes=([0], [0]))
+        precond_grad = np.tensordot(precond_grad, mat_g3_a, axes=([0], [0]))
+        new_val_np -= precond_grad
+
+        self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                           atol=TOLERANCE, rtol=TOLERANCE)
+
+  def testDelayedSVD(self):
+    for use_iterative_root in [True, False]:
+      self._testDelayedSVD(use_iterative_root)
+
+  def _testDelayedPrecondUpdate(self, use_iterative_root):
+    """Update the squared sum every nth step, drop the other steps.
+
+    Args:
+      use_iterative_root: use iterative power method or SVD to find nth roots.
+    """
+    size = [10, 5, 7]
+    init_var_np = np.zeros(size).astype(np.float32)
+    iterations = 100
+    grad_np = np.random.rand(
+        iterations, size[0], size[1], size[2]).astype(np.float32)
+    svd_interval = 20
+    precond_update_interval = 5
+    mat_g1_a = np.eye(size[0])
+    mat_g1 = np.zeros_like(mat_g1_a)
+    mat_g2_a = np.eye(size[1])
+    mat_g2 = np.zeros_like(mat_g2_a)
+    mat_g3_a = np.eye(size[2])
+    mat_g3 = np.zeros_like(mat_g3_a)
+
+    with self.test_session() as sess:
+      global_step = variables.Variable(0, dtype=dtypes.int64)
+      var = variables.Variable(init_var_np, dtype=dtypes.float32)
+      grad = array_ops.placeholder(dtypes.float32, shape=size)
+
+      opt = shampoo.ShampooOptimizer(
+          global_step, svd_interval=svd_interval,
+          precond_update_interval=precond_update_interval,
+          use_iterative_root=use_iterative_root)
+      update = opt.apply_gradients(zip([grad], [var]),
+                                   global_step=global_step)
+      variables.global_variables_initializer().run()
+
+      init_val = sess.run(var)
+      self.assertAllCloseAccordingToType(init_var_np, init_val)
+      new_val_np = init_var_np
+
+      # Run n steps of Shampoo
+      for i in range(iterations):
+        _ = sess.run(update, feed_dict={grad: grad_np[i]})
+        new_val = sess.run(var)
+
+        # let up compute this in numpy
+        # Update rule is var = var - lr * Prod_i mat_g_i^{-0.5/3} grad
+        # lr = 1
+        if (i + 1) % precond_update_interval == 0:
+          mat_g1 += (np.tensordot(grad_np[i], grad_np[i], axes=([1, 2], [1, 2]))
+                     * precond_update_interval)
+          mat_g2 += (np.tensordot(grad_np[i], grad_np[i], axes=([0, 2], [0, 2]))
+                     * precond_update_interval)
+          mat_g3 += (np.tensordot(grad_np[i], grad_np[i], axes=([0, 1], [0, 1]))
+                     * precond_update_interval)
+
+        if (i + 1) % svd_interval == 0:
+          mat_g1_a = np_power(mat_g1 + 0.1 * np.eye(size[0]), -0.5/3.0)
+          mat_g2_a = np_power(mat_g2 + 0.1 * np.eye(size[1]), -0.5/3.0)
+          mat_g3_a = np_power(mat_g3 + 0.1 * np.eye(size[2]), -0.5/3.0)
+
+        precond_grad = np.tensordot(grad_np[i], mat_g1_a, axes=([0], [0]))
+        precond_grad = np.tensordot(precond_grad, mat_g2_a, axes=([0], [0]))
+        precond_grad = np.tensordot(precond_grad, mat_g3_a, axes=([0], [0]))
+        new_val_np -= precond_grad
+
+        self.assertAllCloseAccordingToType(new_val_np, new_val,
+                                           atol=TOLERANCE, rtol=TOLERANCE)
+
+  def testDelayedPrecondUpdate(self):
+    for use_iterative_root in [True, False]:
+      self._testDelayedPrecondUpdate(use_iterative_root)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/optimizer_v2/BUILD b/tensorflow/contrib/optimizer_v2/BUILD
index 5225ecc14f..3ba3ee29ec 100644
--- a/tensorflow/contrib/optimizer_v2/BUILD
+++ b/tensorflow/contrib/optimizer_v2/BUILD
@@ -193,6 +193,7 @@ cuda_py_test(
     srcs = ["rmsprop_test.py"],
     additional_deps = [
         ":training",
+        "@absl_py//absl/testing:parameterized",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:embedding_ops",
         "//tensorflow/python:framework",
diff --git a/tensorflow/contrib/optimizer_v2/checkpointable_utils_test.py b/tensorflow/contrib/optimizer_v2/checkpointable_utils_test.py
index 06ab58188a..28a531dfec 100644
--- a/tensorflow/contrib/optimizer_v2/checkpointable_utils_test.py
+++ b/tensorflow/contrib/optimizer_v2/checkpointable_utils_test.py
@@ -41,6 +41,7 @@ from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import template
 from tensorflow.python.ops import variable_scope
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as core_saver
 from tensorflow.python.training import training_util
 from tensorflow.python.training.checkpointable import tracking
@@ -278,7 +279,8 @@ class CheckpointingTests(test.TestCase):
       root = util.Checkpoint(
           optimizer=optimizer, model=model,
           optimizer_step=training_util.get_or_create_global_step())
-      root.restore(core_saver.latest_checkpoint(checkpoint_directory))
+      root.restore(checkpoint_management.latest_checkpoint(
+          checkpoint_directory))
       for _ in range(num_training_steps):
         # TODO(allenl): Use a Dataset and serialize/checkpoint it.
         input_value = constant_op.constant([[3.]])
@@ -306,7 +308,8 @@ class CheckpointingTests(test.TestCase):
           train_op = optimizer.minimize(
               model(input_value),
               global_step=root.global_step)
-          checkpoint_path = core_saver.latest_checkpoint(checkpoint_directory)
+          checkpoint_path = checkpoint_management.latest_checkpoint(
+              checkpoint_directory)
           with self.test_session(graph=ops.get_default_graph()) as session:
             status = root.restore(save_path=checkpoint_path)
             status.initialize_or_restore(session=session)
@@ -339,7 +342,8 @@ class CheckpointingTests(test.TestCase):
         root = util.Checkpoint(
             optimizer=optimizer, model=model,
             global_step=training_util.get_or_create_global_step())
-        checkpoint_path = core_saver.latest_checkpoint(checkpoint_directory)
+        checkpoint_path = checkpoint_management.latest_checkpoint(
+            checkpoint_directory)
         status = root.restore(save_path=checkpoint_path)
         input_value = constant_op.constant([[3.]])
         train_fn = functools.partial(
@@ -372,7 +376,8 @@ class CheckpointingTests(test.TestCase):
         root = util.Checkpoint(
             optimizer=optimizer, model=model,
             global_step=training_util.get_or_create_global_step())
-        checkpoint_path = core_saver.latest_checkpoint(checkpoint_directory)
+        checkpoint_path = checkpoint_management.latest_checkpoint(
+            checkpoint_directory)
         status = root.restore(save_path=checkpoint_path)
         def train_fn():
           @function.defun
diff --git a/tensorflow/contrib/optimizer_v2/optimizer_v2_test.py b/tensorflow/contrib/optimizer_v2/optimizer_v2_test.py
index ec033c4a01..a44bfd1bfd 100644
--- a/tensorflow/contrib/optimizer_v2/optimizer_v2_test.py
+++ b/tensorflow/contrib/optimizer_v2/optimizer_v2_test.py
@@ -38,12 +38,8 @@ class OptimizerTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes
   def testBasic(self):
     for i, dtype in enumerate([dtypes.half, dtypes.float32, dtypes.float64]):
-      # Note that we name the variables uniquely here since the variables don't
-      # seem to be getting deleted at the end of the loop.
-      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype,
-                                                    name='a_%d' % i)
-      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype,
-                                                    name='b_%d' % i)
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
       def loss():
         return 5 * var0 + 3 * var1  # pylint: disable=cell-var-from-loop
       # Note that for eager execution, minimize expects a function instead of a
@@ -131,12 +127,8 @@ class OptimizerTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes
   def testNoGradients(self):
     for i, dtype in enumerate([dtypes.half, dtypes.float32, dtypes.float64]):
-      # Note that we name the variables uniquely here since the variables don't
-      # seem to be getting deleted at the end of the loop.
-      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype,
-                                                    name='a%d' % i)
-      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype,
-                                                    name='b%d' % i)
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
       # pylint: disable=cell-var-from-loop
       def loss():
         return 5 * var0
@@ -149,12 +141,8 @@ class OptimizerTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes
   def testNoGradientsForAnyVariables_Minimize(self):
     for i, dtype in enumerate([dtypes.half, dtypes.float32, dtypes.float64]):
-      # Note that we name the variables uniquely here since the variables don't
-      # seem to be getting deleted at the end of the loop.
-      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype,
-                                                    name='a_%d' % i)
-      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype,
-                                                    name='b_%d' % i)
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
       def loss():
         return constant_op.constant(5.0)
       sgd_op = gradient_descent.GradientDescentOptimizer(3.0)
@@ -165,12 +153,8 @@ class OptimizerTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes
   def testNoGradientsForAnyVariables_ApplyGradients(self):
     for i, dtype in enumerate([dtypes.half, dtypes.float32, dtypes.float64]):
-      # Note that we name the variables uniquely here since the variables don't
-      # seem to be getting deleted at the end of the loop.
-      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype,
-                                                    name='a_%d' % i)
-      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype,
-                                                    name='b_%d' % i)
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
       sgd_op = gradient_descent.GradientDescentOptimizer(3.0)
       with self.assertRaisesRegexp(ValueError,
                                    'No gradients provided for any variable'):
@@ -179,12 +163,8 @@ class OptimizerTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes
   def testGradientsAsVariables(self):
     for i, dtype in enumerate([dtypes.half, dtypes.float32, dtypes.float64]):
-      # Note that we name the variables uniquely here since the variables don't
-      # seem to be getting deleted at the end of the loop.
-      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype,
-                                                    name='a%d' % i)
-      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype,
-                                                    name='b%d' % i)
+      var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+      var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
       def loss():
         return 5 * var0 + 3 * var1  # pylint: disable=cell-var-from-loop
       sgd_op = gradient_descent.GradientDescentOptimizer(3.0)
diff --git a/tensorflow/contrib/optimizer_v2/rmsprop_test.py b/tensorflow/contrib/optimizer_v2/rmsprop_test.py
index ed68f6afbf..dc23ef241a 100644
--- a/tensorflow/contrib/optimizer_v2/rmsprop_test.py
+++ b/tensorflow/contrib/optimizer_v2/rmsprop_test.py
@@ -19,15 +19,16 @@ from __future__ import division
 from __future__ import print_function
 
 import copy
-import itertools
 import math
 
+from absl.testing import parameterized
 import numpy as np
 
 from tensorflow.contrib.optimizer_v2 import rmsprop
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
 from tensorflow.python.ops import embedding_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
@@ -48,13 +49,8 @@ _TEST_PARAM_VALUES = [
     [0.5, 0.95, 0.9, 1e-5, True, False],
 ]
 
-_TESTPARAMS = [
-    [data_type] + values
-    for data_type, values in itertools.product(_DATA_TYPES, _TEST_PARAM_VALUES)
-]
-
 
-class RMSPropOptimizerTest(test.TestCase):
+class RMSPropOptimizerTest(test.TestCase, parameterized.TestCase):
 
   def _rmsprop_update_numpy(self, var, g, mg, rms, mom, lr, decay, momentum,
                             epsilon, centered):
@@ -87,362 +83,366 @@ class RMSPropOptimizerTest(test.TestCase):
       var_t[gindex] = var[gindex] - mom_t[gindex]
     return var_t, mg_t, rms_t, mom_t
 
-  def testDense(self):
-    # TODO(yori): Use ParameterizedTest when available
-    for (dtype, learning_rate, decay, momentum,
-         epsilon, centered, use_resource) in _TESTPARAMS:
-      with self.test_session(use_gpu=True):
-        # Initialize variables for numpy implementation.
-        var0_np = np.array([1.0, 2.0], dtype=dtype.as_numpy_dtype)
-        grads0_np = np.array([0.1, 0.2], dtype=dtype.as_numpy_dtype)
-        var1_np = np.array([3.0, 4.0], dtype=dtype.as_numpy_dtype)
-        grads1_np = np.array([0.01, 0.2], dtype=dtype.as_numpy_dtype)
-
-        if use_resource:
-          var0 = resource_variable_ops.ResourceVariable(var0_np)
-          var1 = resource_variable_ops.ResourceVariable(var1_np)
-        else:
-          var0 = variables.Variable(var0_np)
-          var1 = variables.Variable(var1_np)
-        grads0 = constant_op.constant(grads0_np)
-        grads1 = constant_op.constant(grads1_np)
-        opt = rmsprop.RMSPropOptimizer(
-            learning_rate=learning_rate,
-            decay=decay,
-            momentum=momentum,
-            epsilon=epsilon,
-            centered=centered)
-
-        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-
-        mg0 = opt.get_slot(var0, "mg")
-        self.assertEqual(mg0 is not None, centered)
-        mg1 = opt.get_slot(var1, "mg")
-        self.assertEqual(mg1 is not None, centered)
-        rms0 = opt.get_slot(var0, "rms")
-        self.assertTrue(rms0 is not None)
-        rms1 = opt.get_slot(var1, "rms")
-        self.assertTrue(rms1 is not None)
-        mom0 = opt.get_slot(var0, "momentum")
-        self.assertTrue(mom0 is not None)
-        mom1 = opt.get_slot(var1, "momentum")
-        self.assertTrue(mom1 is not None)
-
-        mg0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-        mg1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-        rms0_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
-        rms1_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
-        mom0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-        mom1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-
-        # Fetch params to validate initial values
-        self.assertAllClose([1.0, 2.0], var0.eval())
-        self.assertAllClose([3.0, 4.0], var1.eval())
-
-        # Run 4 steps of RMSProp
-        for _ in range(1, 5):
-          update.run()
-
-          var0_np, mg0_np, rms0_np, mom0_np = self._rmsprop_update_numpy(
-              var0_np, grads0_np, mg0_np, rms0_np, mom0_np, learning_rate,
-              decay, momentum, epsilon, centered)
-          var1_np, mg1_np, rms1_np, mom1_np = self._rmsprop_update_numpy(
-              var1_np, grads1_np, mg1_np, rms1_np, mom1_np, learning_rate,
-              decay, momentum, epsilon, centered)
-
-          # Validate updated params
-          if centered:
-            self.assertAllCloseAccordingToType(mg0_np, mg0.eval())
-            self.assertAllCloseAccordingToType(mg1_np, mg1.eval())
-          self.assertAllCloseAccordingToType(rms0_np, rms0.eval())
-          self.assertAllCloseAccordingToType(rms1_np, rms1.eval())
-          self.assertAllCloseAccordingToType(mom0_np, mom0.eval())
-          self.assertAllCloseAccordingToType(mom1_np, mom1.eval())
-          self.assertAllCloseAccordingToType(var0_np, var0.eval())
-          self.assertAllCloseAccordingToType(var1_np, var1.eval())
-
-  def testMinimizeSparseResourceVariable(self):
-    for dtype in [dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = resource_variable_ops.ResourceVariable([[1.0, 2.0]], dtype=dtype)
-        x = constant_op.constant([[4.0], [5.0]], dtype=dtype)
-        pred = math_ops.matmul(embedding_ops.embedding_lookup([var0], [0]), x)
-        loss = pred * pred
-        sgd_op = rmsprop.RMSPropOptimizer(
-            learning_rate=1.0,
-            decay=0.0,
-            momentum=0.0,
-            epsilon=0.0,
-            centered=False).minimize(loss)
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([[1.0, 2.0]], var0.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        self.assertAllCloseAccordingToType(
-            [[0., 1.]], var0.eval(), atol=0.01)
-
-  def testMinimizeSparseResourceVariableCentered(self):
-    for dtype in [dtypes.float32, dtypes.float64]:
-      with self.test_session():
-        var0 = resource_variable_ops.ResourceVariable([[1.0, 2.0]], dtype=dtype)
-        x = constant_op.constant([[4.0], [5.0]], dtype=dtype)
-        pred = math_ops.matmul(embedding_ops.embedding_lookup([var0], [0]), x)
-        loss = pred * pred
-        sgd_op = rmsprop.RMSPropOptimizer(
-            learning_rate=1.0,
-            decay=0.0,
-            momentum=0.0,
-            epsilon=1.0,
-            centered=True).minimize(loss)
-        variables.global_variables_initializer().run()
-        # Fetch params to validate initial values
-        self.assertAllCloseAccordingToType([[1.0, 2.0]], var0.eval())
-        # Run 1 step of sgd
-        sgd_op.run()
-        # Validate updated params
-        self.assertAllCloseAccordingToType(
-            [[-111, -138]], var0.eval(), atol=0.01)
-
-  def testSparse(self):
-    # TODO(yori): Use ParameterizedTest when available
-    for (dtype, learning_rate, decay,
-         momentum, epsilon, centered, _) in _TESTPARAMS:
-      with self.test_session(use_gpu=True):
-        # Initialize variables for numpy implementation.
-        var0_np = np.array([1.0, 2.0], dtype=dtype.as_numpy_dtype)
-        grads0_np = np.array([0.1], dtype=dtype.as_numpy_dtype)
-        var1_np = np.array([3.0, 4.0], dtype=dtype.as_numpy_dtype)
-        grads1_np = np.array([0.01], dtype=dtype.as_numpy_dtype)
-
+  @parameterized.named_parameters(
+      *test_util.generate_combinations_with_testcase_name(
+          dtype=_DATA_TYPES, param_value=_TEST_PARAM_VALUES))
+  def testDense(self, dtype, param_value):
+    (learning_rate, decay, momentum, epsilon, centered, use_resource) = tuple(
+        param_value)
+    with self.test_session(use_gpu=True):
+      # Initialize variables for numpy implementation.
+      var0_np = np.array([1.0, 2.0], dtype=dtype.as_numpy_dtype)
+      grads0_np = np.array([0.1, 0.2], dtype=dtype.as_numpy_dtype)
+      var1_np = np.array([3.0, 4.0], dtype=dtype.as_numpy_dtype)
+      grads1_np = np.array([0.01, 0.2], dtype=dtype.as_numpy_dtype)
+
+      if use_resource:
+        var0 = resource_variable_ops.ResourceVariable(var0_np)
+        var1 = resource_variable_ops.ResourceVariable(var1_np)
+      else:
         var0 = variables.Variable(var0_np)
         var1 = variables.Variable(var1_np)
-        grads0_np_indices = np.array([0], dtype=np.int32)
-        grads0 = ops.IndexedSlices(
-            constant_op.constant(grads0_np),
-            constant_op.constant(grads0_np_indices), constant_op.constant([1]))
-        grads1_np_indices = np.array([1], dtype=np.int32)
-        grads1 = ops.IndexedSlices(
-            constant_op.constant(grads1_np),
-            constant_op.constant(grads1_np_indices), constant_op.constant([1]))
-        opt = rmsprop.RMSPropOptimizer(
-            learning_rate=learning_rate,
-            decay=decay,
-            momentum=momentum,
-            epsilon=epsilon,
-            centered=centered)
-        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-
-        mg0 = opt.get_slot(var0, "mg")
-        self.assertEqual(mg0 is not None, centered)
-        mg1 = opt.get_slot(var1, "mg")
-        self.assertEqual(mg1 is not None, centered)
-        rms0 = opt.get_slot(var0, "rms")
-        self.assertTrue(rms0 is not None)
-        rms1 = opt.get_slot(var1, "rms")
-        self.assertTrue(rms1 is not None)
-        mom0 = opt.get_slot(var0, "momentum")
-        self.assertTrue(mom0 is not None)
-        mom1 = opt.get_slot(var1, "momentum")
-        self.assertTrue(mom1 is not None)
-
-        mg0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-        mg1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-        rms0_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
-        rms1_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
-        mom0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-        mom1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
-
-        # Fetch params to validate initial values
-        self.assertAllClose([1.0, 2.0], var0.eval())
-        self.assertAllClose([3.0, 4.0], var1.eval())
-
-        # Run 4 steps of RMSProp
-        for _ in range(1, 5):
-          update.run()
-
-          var0_np, mg0_np, rms0_np, mom0_np = self._sparse_rmsprop_update_numpy(
-              var0_np, grads0_np_indices, grads0_np, mg0_np, rms0_np, mom0_np,
-              learning_rate, decay, momentum, epsilon, centered)
-          var1_np, mg1_np, rms1_np, mom1_np = self._sparse_rmsprop_update_numpy(
-              var1_np, grads1_np_indices, grads1_np, mg1_np, rms1_np, mom1_np,
-              learning_rate, decay, momentum, epsilon, centered)
-
-          # Validate updated params
-          if centered:
-            self.assertAllCloseAccordingToType(mg0_np, mg0.eval())
-            self.assertAllCloseAccordingToType(mg1_np, mg1.eval())
-          self.assertAllCloseAccordingToType(rms0_np, rms0.eval())
-          self.assertAllCloseAccordingToType(rms1_np, rms1.eval())
-          self.assertAllCloseAccordingToType(mom0_np, mom0.eval())
-          self.assertAllCloseAccordingToType(mom1_np, mom1.eval())
-          self.assertAllCloseAccordingToType(var0_np, var0.eval())
-          self.assertAllCloseAccordingToType(var1_np, var1.eval())
-
-  def testWithoutMomentum(self):
-    for dtype in [dtypes.half, dtypes.float32]:
-      with self.test_session(use_gpu=True):
-        var0 = variables.Variable([1.0, 2.0], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-        opt = rmsprop.RMSPropOptimizer(
-            learning_rate=2.0, decay=0.9, momentum=0.0, epsilon=1.0)
-        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-
-        rms0 = opt.get_slot(var0, "rms")
-        self.assertTrue(rms0 is not None)
-        rms1 = opt.get_slot(var1, "rms")
-        self.assertTrue(rms1 is not None)
-        mom0 = opt.get_slot(var0, "momentum")
-        self.assertTrue(mom0 is not None)
-        mom1 = opt.get_slot(var1, "momentum")
-        self.assertTrue(mom1 is not None)
-
-        # Fetch params to validate initial values
-        self.assertAllClose([1.0, 2.0], var0.eval())
-        self.assertAllClose([3.0, 4.0], var1.eval())
-        # Step 1: the rms accumulators where 1. So we should see a normal
-        # update: v -= grad * learning_rate
-        update.run()
-        # Check the root mean square accumulators.
-        self.assertAllCloseAccordingToType(
-            np.array([0.901, 0.901]), rms0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([0.90001, 0.90001]), rms1.eval())
-        # Check the parameters.
-        self.assertAllCloseAccordingToType(
-            np.array([
-                1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)),
-                2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0))
-            ]), var0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([
-                3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)),
-                4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0))
-            ]), var1.eval())
-        # Step 2: the root mean square accumulators contain the previous update.
-        update.run()
-        # Check the rms accumulators.
-        self.assertAllCloseAccordingToType(
-            np.array([0.901 * 0.9 + 0.001, 0.901 * 0.9 + 0.001]), rms0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([0.90001 * 0.9 + 1e-5, 0.90001 * 0.9 + 1e-5]), rms1.eval())
-        # Check the parameters.
-        self.assertAllCloseAccordingToType(
-            np.array([
-                1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)) -
-                (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1.0)),
-                2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)) -
-                (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1.0))
-            ]), var0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([
-                3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)) -
-                (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 1e-5 + 1.0)),
-                4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)) -
-                (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 1e-5 + 1.0))
-            ]), var1.eval())
-
-  def testWithMomentum(self):
-    for dtype in [dtypes.half, dtypes.float32]:
-      with self.test_session(use_gpu=True):
-        var0 = variables.Variable([1.0, 2.0], dtype=dtype)
-        var1 = variables.Variable([3.0, 4.0], dtype=dtype)
-        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
-        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
-
-        opt = rmsprop.RMSPropOptimizer(
-            learning_rate=2.0, decay=0.9, momentum=0.5, epsilon=1e-5)
-        update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
-        variables.global_variables_initializer().run()
-
-        rms0 = opt.get_slot(var0, "rms")
-        self.assertTrue(rms0 is not None)
-        rms1 = opt.get_slot(var1, "rms")
-        self.assertTrue(rms1 is not None)
-        mom0 = opt.get_slot(var0, "momentum")
-        self.assertTrue(mom0 is not None)
-        mom1 = opt.get_slot(var1, "momentum")
-        self.assertTrue(mom1 is not None)
-
-        # Fetch params to validate initial values
-        self.assertAllClose([1.0, 2.0], var0.eval())
-        self.assertAllClose([3.0, 4.0], var1.eval())
-        # Step 1: rms = 1, mom = 0. So we should see a normal
-        # update: v -= grad * learning_rate
+      grads0 = constant_op.constant(grads0_np)
+      grads1 = constant_op.constant(grads1_np)
+      opt = rmsprop.RMSPropOptimizer(
+          learning_rate=learning_rate,
+          decay=decay,
+          momentum=momentum,
+          epsilon=epsilon,
+          centered=centered)
+
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      mg0 = opt.get_slot(var0, "mg")
+      self.assertEqual(mg0 is not None, centered)
+      mg1 = opt.get_slot(var1, "mg")
+      self.assertEqual(mg1 is not None, centered)
+      rms0 = opt.get_slot(var0, "rms")
+      self.assertIsNotNone(rms0)
+      rms1 = opt.get_slot(var1, "rms")
+      self.assertIsNotNone(rms1)
+      mom0 = opt.get_slot(var0, "momentum")
+      self.assertIsNotNone(mom0)
+      mom1 = opt.get_slot(var1, "momentum")
+      self.assertIsNotNone(mom1)
+
+      mg0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+      mg1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+      rms0_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
+      rms1_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
+      mom0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+      mom1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+
+      # Fetch params to validate initial values
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([3.0, 4.0], var1.eval())
+
+      # Run 4 steps of RMSProp
+      for _ in range(4):
         update.run()
-        # Check the root mean square accumulators.
-        self.assertAllCloseAccordingToType(
-            np.array([0.901, 0.901]), rms0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([0.90001, 0.90001]), rms1.eval())
-        # Check the momentum accumulators
-        self.assertAllCloseAccordingToType(
-            np.array([(0.1 * 2.0 / math.sqrt(0.901 + 1e-5)),
-                      (0.1 * 2.0 / math.sqrt(0.901 + 1e-5))]), mom0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([(0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)),
-                      (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5))]), mom1.eval())
-
-        # Check that the parameters.
-        self.assertAllCloseAccordingToType(
-            np.array([
-                1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)),
-                2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5))
-            ]), var0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([
-                3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)),
-                4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5))
-            ]), var1.eval())
-
-        # Step 2: the root mean square accumulators contain the previous update.
+
+        var0_np, mg0_np, rms0_np, mom0_np = self._rmsprop_update_numpy(
+            var0_np, grads0_np, mg0_np, rms0_np, mom0_np, learning_rate,
+            decay, momentum, epsilon, centered)
+        var1_np, mg1_np, rms1_np, mom1_np = self._rmsprop_update_numpy(
+            var1_np, grads1_np, mg1_np, rms1_np, mom1_np, learning_rate,
+            decay, momentum, epsilon, centered)
+
+        # Validate updated params
+        if centered:
+          self.assertAllCloseAccordingToType(mg0_np, mg0.eval())
+          self.assertAllCloseAccordingToType(mg1_np, mg1.eval())
+        self.assertAllCloseAccordingToType(rms0_np, rms0.eval())
+        self.assertAllCloseAccordingToType(rms1_np, rms1.eval())
+        self.assertAllCloseAccordingToType(mom0_np, mom0.eval())
+        self.assertAllCloseAccordingToType(mom1_np, mom1.eval())
+        self.assertAllCloseAccordingToType(var0_np, var0.eval())
+        self.assertAllCloseAccordingToType(var1_np, var1.eval())
+
+  @parameterized.parameters([dtypes.float32, dtypes.float64])
+  def testMinimizeSparseResourceVariable(self, dtype):
+    with self.test_session():
+      var0 = resource_variable_ops.ResourceVariable([[1.0, 2.0]], dtype=dtype)
+      x = constant_op.constant([[4.0], [5.0]], dtype=dtype)
+      pred = math_ops.matmul(embedding_ops.embedding_lookup([var0], [0]), x)
+      loss = pred * pred
+      sgd_op = rmsprop.RMSPropOptimizer(
+          learning_rate=1.0,
+          decay=0.0,
+          momentum=0.0,
+          epsilon=0.0,
+          centered=False).minimize(loss)
+      variables.global_variables_initializer().run()
+      # Fetch params to validate initial values
+      self.assertAllCloseAccordingToType([[1.0, 2.0]], var0.eval())
+      # Run 1 step of sgd
+      sgd_op.run()
+      # Validate updated params
+      self.assertAllCloseAccordingToType(
+          [[0., 1.]], var0.eval(), atol=0.01)
+
+  @parameterized.parameters([dtypes.float32, dtypes.float64])
+  def testMinimizeSparseResourceVariableCentered(self, dtype):
+    with self.test_session():
+      var0 = resource_variable_ops.ResourceVariable([[1.0, 2.0]], dtype=dtype)
+      x = constant_op.constant([[4.0], [5.0]], dtype=dtype)
+      pred = math_ops.matmul(embedding_ops.embedding_lookup([var0], [0]), x)
+      loss = pred * pred
+      sgd_op = rmsprop.RMSPropOptimizer(
+          learning_rate=1.0,
+          decay=0.0,
+          momentum=0.0,
+          epsilon=1.0,
+          centered=True).minimize(loss)
+      variables.global_variables_initializer().run()
+      # Fetch params to validate initial values
+      self.assertAllCloseAccordingToType([[1.0, 2.0]], var0.eval())
+      # Run 1 step of sgd
+      sgd_op.run()
+      # Validate updated params
+      self.assertAllCloseAccordingToType(
+          [[-111, -138]], var0.eval(), atol=0.01)
+
+  @parameterized.named_parameters(
+      *test_util.generate_combinations_with_testcase_name(
+          dtype=_DATA_TYPES, param_value=_TEST_PARAM_VALUES))
+  def testSparse(self, dtype, param_value):
+    (learning_rate, decay, momentum, epsilon, centered, _) = tuple(
+        param_value)
+    with self.test_session(use_gpu=True):
+      # Initialize variables for numpy implementation.
+      var0_np = np.array([1.0, 2.0], dtype=dtype.as_numpy_dtype)
+      grads0_np = np.array([0.1], dtype=dtype.as_numpy_dtype)
+      var1_np = np.array([3.0, 4.0], dtype=dtype.as_numpy_dtype)
+      grads1_np = np.array([0.01], dtype=dtype.as_numpy_dtype)
+
+      var0 = variables.Variable(var0_np)
+      var1 = variables.Variable(var1_np)
+      grads0_np_indices = np.array([0], dtype=np.int32)
+      grads0 = ops.IndexedSlices(
+          constant_op.constant(grads0_np),
+          constant_op.constant(grads0_np_indices), constant_op.constant([1]))
+      grads1_np_indices = np.array([1], dtype=np.int32)
+      grads1 = ops.IndexedSlices(
+          constant_op.constant(grads1_np),
+          constant_op.constant(grads1_np_indices), constant_op.constant([1]))
+      opt = rmsprop.RMSPropOptimizer(
+          learning_rate=learning_rate,
+          decay=decay,
+          momentum=momentum,
+          epsilon=epsilon,
+          centered=centered)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      mg0 = opt.get_slot(var0, "mg")
+      self.assertEqual(mg0 is not None, centered)
+      mg1 = opt.get_slot(var1, "mg")
+      self.assertEqual(mg1 is not None, centered)
+      rms0 = opt.get_slot(var0, "rms")
+      self.assertIsNotNone(rms0)
+      rms1 = opt.get_slot(var1, "rms")
+      self.assertIsNotNone(rms1)
+      mom0 = opt.get_slot(var0, "momentum")
+      self.assertIsNotNone(mom0)
+      mom1 = opt.get_slot(var1, "momentum")
+      self.assertIsNotNone(mom1)
+
+      mg0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+      mg1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+      rms0_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
+      rms1_np = np.array([1.0, 1.0], dtype=dtype.as_numpy_dtype)
+      mom0_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+      mom1_np = np.array([0.0, 0.0], dtype=dtype.as_numpy_dtype)
+
+      # Fetch params to validate initial values
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([3.0, 4.0], var1.eval())
+
+      # Run 4 steps of RMSProp
+      for _ in range(4):
         update.run()
-        # Check the rms accumulators.
-        self.assertAllCloseAccordingToType(
-            np.array([0.901 * 0.9 + 0.001, 0.901 * 0.9 + 0.001]), rms0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([0.90001 * 0.9 + 1e-5, 0.90001 * 0.9 + 1e-5]), rms1.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([
-                0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
-                (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5)),
-                0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
-                (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5))
-            ]), mom0.eval())
-        self.assertAllCloseAccordingToType(
-            np.array([
-                0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
-                (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5)),
-                0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
-                (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5))
-            ]), mom1.eval())
-
-        # Check the parameters.
-        self.assertAllCloseAccordingToType(
-            np.array([
-                1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) -
-                (0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
-                 (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5))),
-                2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) -
-                (0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
-                 (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5)))
-            ]), var0.eval())
-
-        self.assertAllCloseAccordingToType(
-            np.array([
-                3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) -
-                (0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
-                 (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5))),
-                4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) -
-                (0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
-                 (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5)))
-            ]), var1.eval())
+
+        var0_np, mg0_np, rms0_np, mom0_np = self._sparse_rmsprop_update_numpy(
+            var0_np, grads0_np_indices, grads0_np, mg0_np, rms0_np, mom0_np,
+            learning_rate, decay, momentum, epsilon, centered)
+        var1_np, mg1_np, rms1_np, mom1_np = self._sparse_rmsprop_update_numpy(
+            var1_np, grads1_np_indices, grads1_np, mg1_np, rms1_np, mom1_np,
+            learning_rate, decay, momentum, epsilon, centered)
+
+        # Validate updated params
+        if centered:
+          self.assertAllCloseAccordingToType(mg0_np, mg0.eval())
+          self.assertAllCloseAccordingToType(mg1_np, mg1.eval())
+        self.assertAllCloseAccordingToType(rms0_np, rms0.eval())
+        self.assertAllCloseAccordingToType(rms1_np, rms1.eval())
+        self.assertAllCloseAccordingToType(mom0_np, mom0.eval())
+        self.assertAllCloseAccordingToType(mom1_np, mom1.eval())
+        self.assertAllCloseAccordingToType(var0_np, var0.eval())
+        self.assertAllCloseAccordingToType(var1_np, var1.eval())
+
+  @parameterized.parameters(_DATA_TYPES)
+  def testWithoutMomentum(self, dtype):
+    with self.test_session(use_gpu=True):
+      var0 = variables.Variable([1.0, 2.0], dtype=dtype)
+      var1 = variables.Variable([3.0, 4.0], dtype=dtype)
+      grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+      grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+      opt = rmsprop.RMSPropOptimizer(
+          learning_rate=2.0, decay=0.9, momentum=0.0, epsilon=1.0)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      rms0 = opt.get_slot(var0, "rms")
+      self.assertIsNotNone(rms0)
+      rms1 = opt.get_slot(var1, "rms")
+      self.assertIsNotNone(rms1)
+      mom0 = opt.get_slot(var0, "momentum")
+      self.assertIsNotNone(mom0)
+      mom1 = opt.get_slot(var1, "momentum")
+      self.assertIsNotNone(mom1)
+
+      # Fetch params to validate initial values
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([3.0, 4.0], var1.eval())
+      # Step 1: the rms accumulators where 1. So we should see a normal
+      # update: v -= grad * learning_rate
+      update.run()
+      # Check the root mean square accumulators.
+      self.assertAllCloseAccordingToType(
+          np.array([0.901, 0.901]), rms0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([0.90001, 0.90001]), rms1.eval())
+      # Check the parameters.
+      self.assertAllCloseAccordingToType(
+          np.array([
+              1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)),
+              2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0))
+          ]), var0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([
+              3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)),
+              4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0))
+          ]), var1.eval())
+      # Step 2: the root mean square accumulators contain the previous update.
+      update.run()
+      # Check the rms accumulators.
+      self.assertAllCloseAccordingToType(
+          np.array([0.901 * 0.9 + 0.001, 0.901 * 0.9 + 0.001]), rms0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([0.90001 * 0.9 + 1e-5, 0.90001 * 0.9 + 1e-5]), rms1.eval())
+      # Check the parameters.
+      self.assertAllCloseAccordingToType(
+          np.array([
+              1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)) -
+              (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1.0)),
+              2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1.0)) -
+              (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1.0))
+          ]), var0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([
+              3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)) -
+              (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 1e-5 + 1.0)),
+              4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1.0)) -
+              (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 1e-5 + 1.0))
+          ]), var1.eval())
+
+  @parameterized.parameters(_DATA_TYPES)
+  def testWithMomentum(self, dtype):
+    with self.test_session(use_gpu=True):
+      var0 = variables.Variable([1.0, 2.0], dtype=dtype)
+      var1 = variables.Variable([3.0, 4.0], dtype=dtype)
+      grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+      grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+
+      opt = rmsprop.RMSPropOptimizer(
+          learning_rate=2.0, decay=0.9, momentum=0.5, epsilon=1e-5)
+      update = opt.apply_gradients(zip([grads0, grads1], [var0, var1]))
+      variables.global_variables_initializer().run()
+
+      rms0 = opt.get_slot(var0, "rms")
+      self.assertIsNotNone(rms0)
+      rms1 = opt.get_slot(var1, "rms")
+      self.assertIsNotNone(rms1)
+      mom0 = opt.get_slot(var0, "momentum")
+      self.assertIsNotNone(mom0)
+      mom1 = opt.get_slot(var1, "momentum")
+      self.assertIsNotNone(mom1)
+
+      # Fetch params to validate initial values
+      self.assertAllClose([1.0, 2.0], var0.eval())
+      self.assertAllClose([3.0, 4.0], var1.eval())
+      # Step 1: rms = 1, mom = 0. So we should see a normal
+      # update: v -= grad * learning_rate
+      update.run()
+      # Check the root mean square accumulators.
+      self.assertAllCloseAccordingToType(
+          np.array([0.901, 0.901]), rms0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([0.90001, 0.90001]), rms1.eval())
+      # Check the momentum accumulators
+      self.assertAllCloseAccordingToType(
+          np.array([(0.1 * 2.0 / math.sqrt(0.901 + 1e-5)),
+                    (0.1 * 2.0 / math.sqrt(0.901 + 1e-5))]), mom0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([(0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)),
+                    (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5))]), mom1.eval())
+
+      # Check that the parameters.
+      self.assertAllCloseAccordingToType(
+          np.array([
+              1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)),
+              2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5))
+          ]), var0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([
+              3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)),
+              4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5))
+          ]), var1.eval())
+
+      # Step 2: the root mean square accumulators contain the previous update.
+      update.run()
+      # Check the rms accumulators.
+      self.assertAllCloseAccordingToType(
+          np.array([0.901 * 0.9 + 0.001, 0.901 * 0.9 + 0.001]), rms0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([0.90001 * 0.9 + 1e-5, 0.90001 * 0.9 + 1e-5]), rms1.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([
+              0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
+              (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5)),
+              0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
+              (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5))
+          ]), mom0.eval())
+      self.assertAllCloseAccordingToType(
+          np.array([
+              0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
+              (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5)),
+              0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
+              (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5))
+          ]), mom1.eval())
+
+      # Check the parameters.
+      self.assertAllCloseAccordingToType(
+          np.array([
+              1.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) -
+              (0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
+               (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5))),
+              2.0 - (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) -
+              (0.5 * (0.1 * 2.0 / math.sqrt(0.901 + 1e-5)) +
+               (0.1 * 2.0 / math.sqrt(0.901 * 0.9 + 0.001 + 1e-5)))
+          ]), var0.eval())
+
+      self.assertAllCloseAccordingToType(
+          np.array([
+              3.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) -
+              (0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
+               (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5))),
+              4.0 - (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) -
+              (0.5 * (0.01 * 2.0 / math.sqrt(0.90001 + 1e-5)) +
+               (0.01 * 2.0 / math.sqrt(0.90001 * 0.9 + 2e-5)))
+          ]), var1.eval())
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/predictor/contrib_estimator_predictor.py b/tensorflow/contrib/predictor/contrib_estimator_predictor.py
index af3b2ad1b5..c2166594e5 100644
--- a/tensorflow/contrib/predictor/contrib_estimator_predictor.py
+++ b/tensorflow/contrib/predictor/contrib_estimator_predictor.py
@@ -22,8 +22,8 @@ from __future__ import print_function
 from tensorflow.contrib.learn.python.learn.utils import saved_model_export_utils
 from tensorflow.contrib.predictor import predictor
 from tensorflow.python.framework import ops
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import monitored_session
-from tensorflow.python.training import saver
 
 
 class ContribEstimatorPredictor(predictor.Predictor):
@@ -57,7 +57,8 @@ class ContribEstimatorPredictor(predictor.Predictor):
       # pylint: disable=protected-access
       model_fn_ops = estimator._get_predict_ops(input_fn_ops.features)
       # pylint: enable=protected-access
-      checkpoint_path = saver.latest_checkpoint(estimator.model_dir)
+      checkpoint_path = checkpoint_management.latest_checkpoint(
+          estimator.model_dir)
       self._session = monitored_session.MonitoredSession(
           session_creator=monitored_session.ChiefSessionCreator(
               config=config,
diff --git a/tensorflow/contrib/predictor/predictor_factories.py b/tensorflow/contrib/predictor/predictor_factories.py
index f275bc15ad..7886744b3c 100644
--- a/tensorflow/contrib/predictor/predictor_factories.py
+++ b/tensorflow/contrib/predictor/predictor_factories.py
@@ -108,6 +108,8 @@ def from_estimator(estimator,
 def from_saved_model(export_dir,
                      signature_def_key=None,
                      signature_def=None,
+                     input_names=None,
+                     output_names=None,
                      tags=None,
                      graph=None,
                      config=None):
@@ -121,6 +123,12 @@ def from_saved_model(export_dir,
     signature_def: A `SignatureDef` proto specifying the inputs and outputs
       for prediction. Only one of `signature_def_key` and `signature_def`
       should be specified.
+      input_names: A dictionary mapping strings to `Tensor`s in the `SavedModel`
+        that represent the input. The keys can be any string of the user's
+        choosing.
+      output_names: A dictionary mapping strings to `Tensor`s in the
+        `SavedModel` that represent the output. The keys can be any string of
+        the user's choosing.
     tags: Optional. Tags that will be used to retrieve the correct
       `SignatureDef`. Defaults to `DEFAULT_TAGS`.
     graph: Optional. The Tensorflow `graph` in which prediction should be
@@ -138,6 +146,8 @@ def from_saved_model(export_dir,
       export_dir,
       signature_def_key=signature_def_key,
       signature_def=signature_def,
+      input_names=input_names,
+      output_names=output_names,
       tags=tags,
       graph=graph,
       config=config)
diff --git a/tensorflow/contrib/proto/BUILD b/tensorflow/contrib/proto/BUILD
index 3e9b1a0b8d..b27142cf4a 100644
--- a/tensorflow/contrib/proto/BUILD
+++ b/tensorflow/contrib/proto/BUILD
@@ -16,17 +16,3 @@ py_library(
         "//tensorflow/contrib/proto/python/ops:encode_proto_op_py",
     ],
 )
-
-py_library(
-    name = "proto_pip",
-    data = [
-        "//tensorflow/contrib/proto/python/kernel_tests:test_messages",
-    ] + if_static(
-        [],
-        otherwise = ["//tensorflow/contrib/proto/python/kernel_tests:libtestexample.so"],
-    ),
-    deps = [
-        ":proto",
-        "//tensorflow/contrib/proto/python/kernel_tests:py_test_deps",
-    ],
-)
diff --git a/tensorflow/contrib/proto/python/kernel_tests/BUILD b/tensorflow/contrib/proto/python/kernel_tests/BUILD
index a380a131f8..125c1cee29 100644
--- a/tensorflow/contrib/proto/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/proto/python/kernel_tests/BUILD
@@ -4,47 +4,41 @@ licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
 
-# Much of the work in this BUILD file actually happens in the corresponding
-# build_defs.bzl, which creates an individual testcase for each example .pbtxt
-# file in this directory.
-#
-load(":build_defs.bzl", "decode_proto_test_suite")
-load(":build_defs.bzl", "encode_proto_test_suite")
-
-# This expands to a tf_py_test for each test file.
-# It defines the test_suite :decode_proto_op_tests.
-decode_proto_test_suite(
-    name = "decode_proto_tests",
-    examples = glob(["*.pbtxt"]),
-)
-
-# This expands to a tf_py_test for each test file.
-# It defines the test_suite :encode_proto_op_tests.
-encode_proto_test_suite(
-    name = "encode_proto_tests",
-    examples = glob(["*.pbtxt"]),
-)
-
-# Below here are tests that are not tied to an example text proto.
-filegroup(
-    name = "test_messages",
-    srcs = glob(["*.pbtxt"]),
-)
-
 load("//tensorflow:tensorflow.bzl", "tf_py_test")
 load("//tensorflow:tensorflow.bzl", "tf_cc_shared_object")
 load("//tensorflow/core:platform/default/build_config_root.bzl", "if_static")
 load("//tensorflow/core:platform/default/build_config.bzl", "tf_proto_library")
 
 tf_py_test(
-    name = "decode_proto_fail_test",
+    name = "decode_proto_op_test",
     size = "small",
-    srcs = ["decode_proto_fail_test.py"],
+    srcs = ["decode_proto_op_test.py"],
     additional_deps = [
+        ":decode_proto_op_test_base",
+        ":py_test_deps",
+        "//tensorflow/contrib/proto:proto",
+        "//tensorflow/contrib/proto/python/ops:decode_proto_op_py",
+    ],
+    data = if_static(
+        [],
+        otherwise = [":libtestexample.so"],
+    ),
+    tags = [
+        "no_pip",  # TODO(b/78026780)
+        "no_windows",  # TODO(b/78028010)
+    ],
+)
+
+tf_py_test(
+    name = "encode_proto_op_test",
+    size = "small",
+    srcs = ["encode_proto_op_test.py"],
+    additional_deps = [
+        ":encode_proto_op_test_base",
         ":py_test_deps",
-        "//third_party/py/numpy",
         "//tensorflow/contrib/proto:proto",
         "//tensorflow/contrib/proto/python/ops:decode_proto_op_py",
+        "//tensorflow/contrib/proto/python/ops:encode_proto_op_py",
     ],
     data = if_static(
         [],
@@ -57,19 +51,41 @@ tf_py_test(
 )
 
 py_library(
-    name = "test_case",
-    srcs = ["test_case.py"],
-    deps = ["//tensorflow/python:client_testlib"],
+    name = "proto_op_test_base",
+    testonly = 1,
+    srcs = ["proto_op_test_base.py"],
+    deps = [
+        ":test_example_proto_py",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_library(
+    name = "decode_proto_op_test_base",
+    testonly = 1,
+    srcs = ["decode_proto_op_test_base.py"],
+    deps = [
+        ":proto_op_test_base",
+        ":test_example_proto_py",
+        "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
+    ],
 )
 
 py_library(
-    name = "py_test_deps",
+    name = "encode_proto_op_test_base",
+    testonly = 1,
+    srcs = ["encode_proto_op_test_base.py"],
     deps = [
-        ":test_case",
+        ":proto_op_test_base",
         ":test_example_proto_py",
+        "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
     ],
 )
 
+py_library(name = "py_test_deps")
+
 tf_proto_library(
     name = "test_example_proto",
     srcs = ["test_example.proto"],
@@ -84,3 +100,30 @@ tf_cc_shared_object(
         ":test_example_proto_cc",
     ],
 )
+
+py_library(
+    name = "descriptor_source_test_base",
+    testonly = 1,
+    srcs = ["descriptor_source_test_base.py"],
+    deps = [
+        ":proto_op_test_base",
+        "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
+        "@protobuf_archive//:protobuf_python",
+    ],
+)
+
+tf_py_test(
+    name = "descriptor_source_test",
+    size = "small",
+    srcs = ["descriptor_source_test.py"],
+    additional_deps = [
+        ":descriptor_source_test_base",
+        "//tensorflow/contrib/proto/python/ops:decode_proto_op_py",
+        "//tensorflow/contrib/proto/python/ops:encode_proto_op_py",
+        "//tensorflow/python:client_testlib",
+    ],
+    tags = [
+        "no_pip",
+    ],
+)
diff --git a/tensorflow/contrib/proto/python/kernel_tests/build_defs.bzl b/tensorflow/contrib/proto/python/kernel_tests/build_defs.bzl
deleted file mode 100644
index f425601691..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/build_defs.bzl
+++ /dev/null
@@ -1,89 +0,0 @@
-"""BUILD rules for generating file-driven proto test cases.
-
-The decode_proto_test_suite() and encode_proto_test_suite() rules take a list
-of text protos and generates a tf_py_test() for each one.
-"""
-
-load("//tensorflow:tensorflow.bzl", "tf_py_test")
-load("//tensorflow:tensorflow.bzl", "register_extension_info")
-load("//tensorflow/core:platform/default/build_config_root.bzl", "if_static")
-
-def _test_name(test, path):
-  return "%s_%s_test" % (test, path.split("/")[-1].split(".")[0])
-
-def decode_proto_test_suite(name, examples):
-  """Build the decode_proto py_test for each test filename."""
-  for test_filename in examples:
-    tf_py_test(
-        name = _test_name("decode_proto", test_filename),
-        srcs = ["decode_proto_op_test.py"],
-        size = "small",
-        data = [test_filename] + if_static(
-            [],
-            otherwise = [":libtestexample.so"],
-        ),
-        main = "decode_proto_op_test.py",
-        args = [
-            "--message_text_file=\"%s/%s\"" % (native.package_name(), test_filename),
-        ],
-        additional_deps = [
-            ":py_test_deps",
-            "//third_party/py/numpy",
-            "//tensorflow/contrib/proto:proto",
-            "//tensorflow/contrib/proto/python/ops:decode_proto_op_py",
-        ],
-        tags = [
-            "no_pip",  # TODO(b/78026780)
-            "no_windows",  # TODO(b/78028010)
-        ],
-    )
-  native.test_suite(
-      name = name,
-      tests = [":" + _test_name("decode_proto", test_filename)
-               for test_filename in examples],
-  )
-
-def encode_proto_test_suite(name, examples):
-  """Build the encode_proto py_test for each test filename."""
-  for test_filename in examples:
-    tf_py_test(
-        name = _test_name("encode_proto", test_filename),
-        srcs = ["encode_proto_op_test.py"],
-        size = "small",
-        data = [test_filename] + if_static(
-            [],
-            otherwise = [":libtestexample.so"],
-        ),
-        main = "encode_proto_op_test.py",
-        args = [
-            "--message_text_file=\"%s/%s\"" % (native.package_name(), test_filename),
-        ],
-        additional_deps = [
-            ":py_test_deps",
-            "//third_party/py/numpy",
-            "//tensorflow/contrib/proto:proto",
-            "//tensorflow/contrib/proto/python/ops:decode_proto_op_py",
-            "//tensorflow/contrib/proto/python/ops:encode_proto_op_py",
-        ],
-        tags = [
-            "no_pip",  # TODO(b/78026780)
-            "no_windows",  # TODO(b/78028010)
-        ],
-    )
-  native.test_suite(
-      name = name,
-      tests = [":" + _test_name("encode_proto", test_filename)
-               for test_filename in examples],
-  )
-
-register_extension_info(
-    extension_name = "decode_proto_test_suite",
-    label_regex_map = {
-        "deps": "deps:decode_example_.*",
-    })
-
-register_extension_info(
-    extension_name = "encode_proto_test_suite",
-    label_regex_map = {
-        "deps": "deps:encode_example_.*",
-    })
diff --git a/tensorflow/contrib/proto/python/kernel_tests/decode_proto_fail_test.py b/tensorflow/contrib/proto/python/kernel_tests/decode_proto_fail_test.py
deleted file mode 100644
index 5298342ee7..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/decode_proto_fail_test.py
+++ /dev/null
@@ -1,68 +0,0 @@
-# =============================================================================
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# =============================================================================
-
-# Python3 preparedness imports.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.contrib.proto.python.kernel_tests import test_case
-from tensorflow.contrib.proto.python.ops import decode_proto_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import errors
-from tensorflow.python.platform import test
-
-
-class DecodeProtoFailTest(test_case.ProtoOpTestCase):
-  """Test failure cases for DecodeToProto."""
-
-  def _TestCorruptProtobuf(self, sanitize):
-    """Test failure cases for DecodeToProto."""
-
-    # The goal here is to check the error reporting.
-    # Testing against a variety of corrupt protobufs is
-    # done by fuzzing.
-    corrupt_proto = 'This is not a binary protobuf'
-
-    # Numpy silently truncates the strings if you don't specify dtype=object.
-    batch = np.array(corrupt_proto, dtype=object)
-    msg_type = 'tensorflow.contrib.proto.TestCase'
-    field_names = ['sizes']
-    field_types = [dtypes.int32]
-
-    with self.test_session() as sess:
-      ctensor, vtensor = decode_proto_op.decode_proto(
-          batch,
-          message_type=msg_type,
-          field_names=field_names,
-          output_types=field_types,
-          sanitize=sanitize)
-      with self.assertRaisesRegexp(errors.DataLossError,
-                                   'Unable to parse binary protobuf'
-                                   '|Failed to consume entire buffer'):
-        _ = sess.run([ctensor] + vtensor)
-
-  def testCorrupt(self):
-    self._TestCorruptProtobuf(sanitize=False)
-
-  def testSanitizerCorrupt(self):
-    self._TestCorruptProtobuf(sanitize=True)
-
-
-if __name__ == '__main__':
-  test.main()
diff --git a/tensorflow/contrib/proto/python/kernel_tests/decode_proto_op_test.py b/tensorflow/contrib/proto/python/kernel_tests/decode_proto_op_test.py
index d1c13c82bc..934035ec4c 100644
--- a/tensorflow/contrib/proto/python/kernel_tests/decode_proto_op_test.py
+++ b/tensorflow/contrib/proto/python/kernel_tests/decode_proto_op_test.py
@@ -13,287 +13,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # =============================================================================
-"""Table-driven test for decode_proto op.
+"""Tests for decode_proto op."""
 
-This test is run once with each of the *.TestCase.pbtxt files
-in the test directory.
-"""
 # Python3 preparedness imports.
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import numpy as np
-
-from google.protobuf import text_format
-
-from tensorflow.contrib.proto.python.kernel_tests import test_case
-from tensorflow.contrib.proto.python.kernel_tests import test_example_pb2
+from tensorflow.contrib.proto.python.kernel_tests import decode_proto_op_test_base as test_base
 from tensorflow.contrib.proto.python.ops import decode_proto_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.platform import flags
 from tensorflow.python.platform import test
 
-FLAGS = flags.FLAGS
-
-flags.DEFINE_string('message_text_file', None,
-                    'A file containing a text serialized TestCase protobuf.')
-
-
-class DecodeProtoOpTest(test_case.ProtoOpTestCase):
-
-  def _compareValues(self, fd, vs, evs):
-    """Compare lists/arrays of field values."""
-
-    if len(vs) != len(evs):
-      self.fail('Field %s decoded %d outputs, expected %d' %
-                (fd.name, len(vs), len(evs)))
-    for i, ev in enumerate(evs):
-      # Special case fuzzy match for float32. TensorFlow seems to mess with
-      # MAX_FLT slightly and the test doesn't work otherwise.
-      # TODO(nix): ask on TF list about why MAX_FLT doesn't pass through.
-      if fd.cpp_type == fd.CPPTYPE_FLOAT:
-        # Numpy isclose() is better than assertIsClose() which uses an absolute
-        # value comparison.
-        self.assertTrue(
-            np.isclose(vs[i], ev), 'expected %r, actual %r' % (ev, vs[i]))
-      elif fd.cpp_type == fd.CPPTYPE_STRING:
-        # In Python3 string tensor values will be represented as bytes, so we
-        # reencode the proto values to match that.
-        self.assertEqual(vs[i], ev.encode('ascii'))
-      else:
-        # Doubles and other types pass through unscathed.
-        self.assertEqual(vs[i], ev)
-
-  def _compareRepeatedPrimitiveValue(self, batch_shape, sizes, fields,
-                                     field_dict):
-    """Compare protos of type RepeatedPrimitiveValue.
-
-    Args:
-      batch_shape: the shape of the input tensor of serialized messages.
-      sizes: int matrix of repeat counts returned by decode_proto
-      fields: list of test_example_pb2.FieldSpec (types and expected values)
-      field_dict: map from field names to decoded numpy tensors of values
-    """
-
-    # Check that expected values match.
-    for field in fields:
-      values = field_dict[field.name]
-      self.assertEqual(dtypes.as_dtype(values.dtype), field.dtype)
-
-      fd = field.expected.DESCRIPTOR.fields_by_name[field.name]
-
-      # Values has the same shape as the input plus an extra
-      # dimension for repeats.
-      self.assertEqual(list(values.shape)[:-1], batch_shape)
-
-      # Nested messages are represented as TF strings, requiring
-      # some special handling.
-      if field.name == 'message_value':
-        vs = []
-        for buf in values.flat:
-          msg = test_example_pb2.PrimitiveValue()
-          msg.ParseFromString(buf)
-          vs.append(msg)
-        evs = getattr(field.expected, field.name)
-        if len(vs) != len(evs):
-          self.fail('Field %s decoded %d outputs, expected %d' %
-                    (fd.name, len(vs), len(evs)))
-        for v, ev in zip(vs, evs):
-          self.assertEqual(v, ev)
-        continue
-
-      # This can be a little confusing. For testing we are using
-      # RepeatedPrimitiveValue in two ways: it's the proto that we
-      # decode for testing, and it's used in the expected value as a
-      # union type. The two cases are slightly different: this is the
-      # second case.
-      # We may be fetching the uint64_value from the test proto, but
-      # in the expected proto we store it in the int64_value field
-      # because TensorFlow doesn't support unsigned int64.
-      tf_type_to_primitive_value_field = {
-          dtypes.float32:
-              'float_value',
-          dtypes.float64:
-              'double_value',
-          dtypes.int32:
-              'int32_value',
-          dtypes.uint8:
-              'uint8_value',
-          dtypes.int8:
-              'int8_value',
-          dtypes.string:
-              'string_value',
-          dtypes.int64:
-              'int64_value',
-          dtypes.bool:
-              'bool_value',
-          # Unhandled TensorFlow types:
-          # DT_INT16 DT_COMPLEX64 DT_QINT8 DT_QUINT8 DT_QINT32
-          # DT_BFLOAT16 DT_QINT16 DT_QUINT16 DT_UINT16
-      }
-      tf_field_name = tf_type_to_primitive_value_field.get(field.dtype)
-      if tf_field_name is None:
-        self.fail('Unhandled tensorflow type %d' % field.dtype)
-
-      self._compareValues(fd, values.flat,
-                          getattr(field.expected, tf_field_name))
-
-  def _runDecodeProtoTests(self, fields, case_sizes, batch_shape, batch,
-                           message_type, message_format, sanitize,
-                           force_disordered=False):
-    """Run decode tests on a batch of messages.
-
-    Args:
-      fields: list of test_example_pb2.FieldSpec (types and expected values)
-      case_sizes: expected sizes array
-      batch_shape: the shape of the input tensor of serialized messages
-      batch: list of serialized messages
-      message_type: descriptor name for messages
-      message_format: format of messages, 'text' or 'binary'
-      sanitize: whether to sanitize binary protobuf inputs
-      force_disordered: whether to force fields encoded out of order.
-    """
-
-    if force_disordered:
-      # Exercise code path that handles out-of-order fields by prepending extra
-      # fields with tag numbers higher than any real field. Note that this won't
-      # work with sanitization because that forces reserialization using a
-      # trusted decoder and encoder.
-      assert not sanitize
-      extra_fields = test_example_pb2.ExtraFields()
-      extra_fields.string_value = 'IGNORE ME'
-      extra_fields.bool_value = False
-      extra_msg = extra_fields.SerializeToString()
-      batch = [extra_msg + msg for msg in batch]
-
-    # Numpy silently truncates the strings if you don't specify dtype=object.
-    batch = np.array(batch, dtype=object)
-    batch = np.reshape(batch, batch_shape)
-
-    field_names = [f.name for f in fields]
-    output_types = [f.dtype for f in fields]
-
-    with self.test_session() as sess:
-      sizes, vtensor = decode_proto_op.decode_proto(
-          batch,
-          message_type=message_type,
-          field_names=field_names,
-          output_types=output_types,
-          message_format=message_format,
-          sanitize=sanitize)
-
-      vlist = sess.run([sizes] + vtensor)
-      sizes = vlist[0]
-      # Values is a list of tensors, one for each field.
-      value_tensors = vlist[1:]
-
-      # Check that the repeat sizes are correct.
-      self.assertTrue(
-          np.all(np.array(sizes.shape) == batch_shape + [len(field_names)]))
-
-      # Check that the decoded sizes match the expected sizes.
-      self.assertEqual(len(sizes.flat), len(case_sizes))
-      self.assertTrue(
-          np.all(sizes.flat == np.array(
-              case_sizes, dtype=np.int32)))
-
-      field_dict = dict(zip(field_names, value_tensors))
-
-      self._compareRepeatedPrimitiveValue(batch_shape, sizes, fields,
-                                          field_dict)
-
-  def testBinary(self):
-    with open(FLAGS.message_text_file, 'r') as fp:
-      case = text_format.Parse(fp.read(), test_example_pb2.TestCase())
-
-    batch = [primitive.SerializeToString() for primitive in case.primitive]
-    self._runDecodeProtoTests(
-        case.field,
-        case.sizes,
-        list(case.shape),
-        batch,
-        'tensorflow.contrib.proto.RepeatedPrimitiveValue',
-        'binary',
-        sanitize=False)
-
-  def testBinaryDisordered(self):
-    with open(FLAGS.message_text_file, 'r') as fp:
-      case = text_format.Parse(fp.read(), test_example_pb2.TestCase())
-
-    batch = [primitive.SerializeToString() for primitive in case.primitive]
-    self._runDecodeProtoTests(
-        case.field,
-        case.sizes,
-        list(case.shape),
-        batch,
-        'tensorflow.contrib.proto.RepeatedPrimitiveValue',
-        'binary',
-        sanitize=False,
-        force_disordered=True)
-
-  def testPacked(self):
-    with open(FLAGS.message_text_file, 'r') as fp:
-      case = text_format.Parse(fp.read(), test_example_pb2.TestCase())
-
-    # Now try with the packed serialization.
-    # We test the packed representations by loading the same test cases
-    # using PackedPrimitiveValue instead of RepeatedPrimitiveValue.
-    # To do this we rely on the text format being the same for packed and
-    # unpacked fields, and reparse the test message using the packed version
-    # of the proto.
-    packed_batch = [
-        # Note: float_format='.17g' is necessary to ensure preservation of
-        # doubles and floats in text format.
-        text_format.Parse(
-            text_format.MessageToString(
-                primitive, float_format='.17g'),
-            test_example_pb2.PackedPrimitiveValue()).SerializeToString()
-        for primitive in case.primitive
-    ]
-
-    self._runDecodeProtoTests(
-        case.field,
-        case.sizes,
-        list(case.shape),
-        packed_batch,
-        'tensorflow.contrib.proto.PackedPrimitiveValue',
-        'binary',
-        sanitize=False)
-
-  def testText(self):
-    with open(FLAGS.message_text_file, 'r') as fp:
-      case = text_format.Parse(fp.read(), test_example_pb2.TestCase())
-
-    # Note: float_format='.17g' is necessary to ensure preservation of
-    # doubles and floats in text format.
-    text_batch = [
-        text_format.MessageToString(
-            primitive, float_format='.17g') for primitive in case.primitive
-    ]
-
-    self._runDecodeProtoTests(
-        case.field,
-        case.sizes,
-        list(case.shape),
-        text_batch,
-        'tensorflow.contrib.proto.RepeatedPrimitiveValue',
-        'text',
-        sanitize=False)
 
-  def testSanitizerGood(self):
-    with open(FLAGS.message_text_file, 'r') as fp:
-      case = text_format.Parse(fp.read(), test_example_pb2.TestCase())
+class DecodeProtoOpTest(test_base.DecodeProtoOpTestBase):
 
-    batch = [primitive.SerializeToString() for primitive in case.primitive]
-    self._runDecodeProtoTests(
-        case.field,
-        case.sizes,
-        list(case.shape),
-        batch,
-        'tensorflow.contrib.proto.RepeatedPrimitiveValue',
-        'binary',
-        sanitize=True)
+  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
+    super(DecodeProtoOpTest, self).__init__(decode_proto_op, methodName)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/proto/python/kernel_tests/decode_proto_op_test_base.py b/tensorflow/contrib/proto/python/kernel_tests/decode_proto_op_test_base.py
new file mode 100644
index 0000000000..e3570e38a3
--- /dev/null
+++ b/tensorflow/contrib/proto/python/kernel_tests/decode_proto_op_test_base.py
@@ -0,0 +1,303 @@
+# =============================================================================
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Tests for decode_proto op."""
+
+# Python3 preparedness imports.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl.testing import parameterized
+import numpy as np
+
+
+from google.protobuf import text_format
+
+from tensorflow.contrib.proto.python.kernel_tests import proto_op_test_base as test_base
+from tensorflow.contrib.proto.python.kernel_tests import test_example_pb2
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+
+
+class DecodeProtoOpTestBase(test_base.ProtoOpTestBase, parameterized.TestCase):
+  """Base class for testing proto decoding ops."""
+
+  def __init__(self, decode_module, methodName='runTest'):  # pylint: disable=invalid-name
+    """DecodeProtoOpTestBase initializer.
+
+    Args:
+      decode_module: a module containing the `decode_proto_op` method
+      methodName: the name of the test method (same as for test.TestCase)
+    """
+
+    super(DecodeProtoOpTestBase, self).__init__(methodName)
+    self._decode_module = decode_module
+
+  def _compareValues(self, fd, vs, evs):
+    """Compare lists/arrays of field values."""
+
+    if len(vs) != len(evs):
+      self.fail('Field %s decoded %d outputs, expected %d' %
+                (fd.name, len(vs), len(evs)))
+    for i, ev in enumerate(evs):
+      # Special case fuzzy match for float32. TensorFlow seems to mess with
+      # MAX_FLT slightly and the test doesn't work otherwise.
+      # TODO(nix): ask on TF list about why MAX_FLT doesn't pass through.
+      if fd.cpp_type == fd.CPPTYPE_FLOAT:
+        # Numpy isclose() is better than assertIsClose() which uses an absolute
+        # value comparison.
+        self.assertTrue(
+            np.isclose(vs[i], ev), 'expected %r, actual %r' % (ev, vs[i]))
+      elif fd.cpp_type == fd.CPPTYPE_STRING:
+        # In Python3 string tensor values will be represented as bytes, so we
+        # reencode the proto values to match that.
+        self.assertEqual(vs[i], ev.encode('ascii'))
+      else:
+        # Doubles and other types pass through unscathed.
+        self.assertEqual(vs[i], ev)
+
+  def _compareProtos(self, batch_shape, sizes, fields, field_dict):
+    """Compare protos of type TestValue.
+
+    Args:
+      batch_shape: the shape of the input tensor of serialized messages.
+      sizes: int matrix of repeat counts returned by decode_proto
+      fields: list of test_example_pb2.FieldSpec (types and expected values)
+      field_dict: map from field names to decoded numpy tensors of values
+    """
+
+    # Check that expected values match.
+    for field in fields:
+      values = field_dict[field.name]
+      self.assertEqual(dtypes.as_dtype(values.dtype), field.dtype)
+
+      fd = field.value.DESCRIPTOR.fields_by_name[field.name]
+
+      # Values has the same shape as the input plus an extra
+      # dimension for repeats.
+      self.assertEqual(list(values.shape)[:-1], batch_shape)
+
+      # Nested messages are represented as TF strings, requiring
+      # some special handling.
+      if field.name == 'message_value':
+        vs = []
+        for buf in values.flat:
+          msg = test_example_pb2.PrimitiveValue()
+          msg.ParseFromString(buf)
+          vs.append(msg)
+        evs = getattr(field.value, field.name)
+        if len(vs) != len(evs):
+          self.fail('Field %s decoded %d outputs, expected %d' %
+                    (fd.name, len(vs), len(evs)))
+        for v, ev in zip(vs, evs):
+          self.assertEqual(v, ev)
+        continue
+
+      tf_type_to_primitive_value_field = {
+          dtypes.bool:
+              'bool_value',
+          dtypes.float32:
+              'float_value',
+          dtypes.float64:
+              'double_value',
+          dtypes.int8:
+              'int8_value',
+          dtypes.int32:
+              'int32_value',
+          dtypes.int64:
+              'int64_value',
+          dtypes.string:
+              'string_value',
+          dtypes.uint8:
+              'uint8_value',
+          dtypes.uint32:
+              'uint32_value',
+          dtypes.uint64:
+              'uint64_value',
+      }
+      tf_field_name = tf_type_to_primitive_value_field.get(field.dtype)
+      if tf_field_name is None:
+        self.fail('Unhandled tensorflow type %d' % field.dtype)
+
+      self._compareValues(fd, values.flat,
+                          getattr(field.value, tf_field_name))
+
+  def _runDecodeProtoTests(self, fields, case_sizes, batch_shape, batch,
+                           message_type, message_format, sanitize,
+                           force_disordered=False):
+    """Run decode tests on a batch of messages.
+
+    Args:
+      fields: list of test_example_pb2.FieldSpec (types and expected values)
+      case_sizes: expected sizes array
+      batch_shape: the shape of the input tensor of serialized messages
+      batch: list of serialized messages
+      message_type: descriptor name for messages
+      message_format: format of messages, 'text' or 'binary'
+      sanitize: whether to sanitize binary protobuf inputs
+      force_disordered: whether to force fields encoded out of order.
+    """
+
+    if force_disordered:
+      # Exercise code path that handles out-of-order fields by prepending extra
+      # fields with tag numbers higher than any real field. Note that this won't
+      # work with sanitization because that forces reserialization using a
+      # trusted decoder and encoder.
+      assert not sanitize
+      extra_fields = test_example_pb2.ExtraFields()
+      extra_fields.string_value = 'IGNORE ME'
+      extra_fields.bool_value = False
+      extra_msg = extra_fields.SerializeToString()
+      batch = [extra_msg + msg for msg in batch]
+
+    # Numpy silently truncates the strings if you don't specify dtype=object.
+    batch = np.array(batch, dtype=object)
+    batch = np.reshape(batch, batch_shape)
+
+    field_names = [f.name for f in fields]
+    output_types = [f.dtype for f in fields]
+
+    with self.test_session() as sess:
+      sizes, vtensor = self._decode_module.decode_proto(
+          batch,
+          message_type=message_type,
+          field_names=field_names,
+          output_types=output_types,
+          message_format=message_format,
+          sanitize=sanitize)
+
+      vlist = sess.run([sizes] + vtensor)
+      sizes = vlist[0]
+      # Values is a list of tensors, one for each field.
+      value_tensors = vlist[1:]
+
+      # Check that the repeat sizes are correct.
+      self.assertTrue(
+          np.all(np.array(sizes.shape) == batch_shape + [len(field_names)]))
+
+      # Check that the decoded sizes match the expected sizes.
+      self.assertEqual(len(sizes.flat), len(case_sizes))
+      self.assertTrue(
+          np.all(sizes.flat == np.array(
+              case_sizes, dtype=np.int32)))
+
+      field_dict = dict(zip(field_names, value_tensors))
+
+      self._compareProtos(batch_shape, sizes, fields, field_dict)
+
+  @parameterized.named_parameters(*test_base.ProtoOpTestBase.named_parameters())
+  def testBinary(self, case):
+    batch = [value.SerializeToString() for value in case.values]
+    self._runDecodeProtoTests(
+        case.fields,
+        case.sizes,
+        list(case.shapes),
+        batch,
+        'tensorflow.contrib.proto.TestValue',
+        'binary',
+        sanitize=False)
+
+  @parameterized.named_parameters(*test_base.ProtoOpTestBase.named_parameters())
+  def testBinaryDisordered(self, case):
+    batch = [value.SerializeToString() for value in case.values]
+    self._runDecodeProtoTests(
+        case.fields,
+        case.sizes,
+        list(case.shapes),
+        batch,
+        'tensorflow.contrib.proto.TestValue',
+        'binary',
+        sanitize=False,
+        force_disordered=True)
+
+  @parameterized.named_parameters(*test_base.ProtoOpTestBase.named_parameters())
+  def testPacked(self, case):
+    # Now try with the packed serialization.
+    #
+    # We test the packed representations by loading the same test case using
+    # PackedTestValue instead of TestValue. To do this we rely on the text
+    # format being the same for packed and unpacked fields, and reparse the
+    # test message using the packed version of the proto.
+    packed_batch = [
+        # Note: float_format='.17g' is necessary to ensure preservation of
+        # doubles and floats in text format.
+        text_format.Parse(
+            text_format.MessageToString(
+                value, float_format='.17g'),
+            test_example_pb2.PackedTestValue()).SerializeToString()
+        for value in case.values
+    ]
+
+    self._runDecodeProtoTests(
+        case.fields,
+        case.sizes,
+        list(case.shapes),
+        packed_batch,
+        'tensorflow.contrib.proto.PackedTestValue',
+        'binary',
+        sanitize=False)
+
+  @parameterized.named_parameters(*test_base.ProtoOpTestBase.named_parameters())
+  def testText(self, case):
+    # Note: float_format='.17g' is necessary to ensure preservation of
+    # doubles and floats in text format.
+    text_batch = [
+        text_format.MessageToString(
+            value, float_format='.17g') for value in case.values
+    ]
+
+    self._runDecodeProtoTests(
+        case.fields,
+        case.sizes,
+        list(case.shapes),
+        text_batch,
+        'tensorflow.contrib.proto.TestValue',
+        'text',
+        sanitize=False)
+
+  @parameterized.named_parameters(*test_base.ProtoOpTestBase.named_parameters())
+  def testSanitizerGood(self, case):
+    batch = [value.SerializeToString() for value in case.values]
+    self._runDecodeProtoTests(
+        case.fields,
+        case.sizes,
+        list(case.shapes),
+        batch,
+        'tensorflow.contrib.proto.TestValue',
+        'binary',
+        sanitize=True)
+
+  @parameterized.parameters((False), (True))
+  def testCorruptProtobuf(self, sanitize):
+    corrupt_proto = 'This is not a binary protobuf'
+
+    # Numpy silently truncates the strings if you don't specify dtype=object.
+    batch = np.array(corrupt_proto, dtype=object)
+    msg_type = 'tensorflow.contrib.proto.TestCase'
+    field_names = ['sizes']
+    field_types = [dtypes.int32]
+
+    with self.test_session() as sess:
+      ctensor, vtensor = self._decode_module.decode_proto(
+          batch,
+          message_type=msg_type,
+          field_names=field_names,
+          output_types=field_types,
+          sanitize=sanitize)
+      with self.assertRaisesRegexp(errors.DataLossError,
+                                   'Unable to parse binary protobuf'
+                                   '|Failed to consume entire buffer'):
+        _ = sess.run([ctensor] + vtensor)
diff --git a/tensorflow/contrib/proto/python/kernel_tests/defaut_values.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/defaut_values.TestCase.pbtxt
deleted file mode 100644
index 4e31681907..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/defaut_values.TestCase.pbtxt
+++ /dev/null
@@ -1,94 +0,0 @@
-primitive {
-  # No fields specified, so we get all defaults
-}
-shape: 1
-sizes: 0
-field {
-  name: "double_default"
-  dtype: DT_DOUBLE
-  expected { double_value: 1.0 }
-}
-sizes: 0
-field {
-  name: "float_default"
-  dtype: DT_DOUBLE  # Try casting the float field to double.
-  expected { double_value: 2.0 }
-}
-sizes: 0
-field {
-  name: "int64_default"
-  dtype: DT_INT64
-  expected { int64_value: 3 }
-}
-sizes: 0
-field {
-  name: "uint64_default"
-  dtype: DT_INT64
-  expected { int64_value: 4 }
-}
-sizes: 0
-field {
-  name: "int32_default"
-  dtype: DT_INT32
-  expected { int32_value: 5 }
-}
-sizes: 0
-field {
-  name: "fixed64_default"
-  dtype: DT_INT64
-  expected { int64_value: 6 }
-}
-sizes: 0
-field {
-  name: "fixed32_default"
-  dtype: DT_INT32
-  expected { int32_value: 7 }
-}
-sizes: 0
-field {
-  name: "bool_default"
-  dtype: DT_BOOL
-  expected { bool_value: true }
-}
-sizes: 0
-field {
-  name: "string_default"
-  dtype: DT_STRING
-  expected { string_value: "a" }
-}
-sizes: 0
-field {
-  name: "bytes_default"
-  dtype: DT_STRING
-  expected { string_value: "a longer default string" }
-}
-sizes: 0
-field {
-  name: "uint32_default"
-  dtype: DT_INT32
-  expected { int32_value: -1 }
-}
-sizes: 0
-field {
-  name: "sfixed32_default"
-  dtype: DT_INT32
-  expected { int32_value: 10 }
-}
-sizes: 0
-field {
-  name: "sfixed64_default"
-  dtype: DT_INT64
-  expected { int64_value: 11 }
-}
-sizes: 0
-field {
-  name: "sint32_default"
-  dtype: DT_INT32
-  expected { int32_value: 12 }
-}
-sizes: 0
-field {
-  name: "sint64_default"
-  dtype: DT_INT64
-  expected { int64_value: 13 }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/test_case.py b/tensorflow/contrib/proto/python/kernel_tests/descriptor_source_test.py
index b95202c5df..32ca318f73 100644
--- a/tensorflow/contrib/proto/python/kernel_tests/test_case.py
+++ b/tensorflow/contrib/proto/python/kernel_tests/descriptor_source_test.py
@@ -13,23 +13,24 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # =============================================================================
-"""Test case base for testing proto operations."""
-
+"""Tests for proto ops reading descriptors from other sources."""
 # Python3 preparedness imports.
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import ctypes as ct
-import os
-
+from tensorflow.contrib.proto.python.kernel_tests import descriptor_source_test_base as test_base
+from tensorflow.contrib.proto.python.ops import decode_proto_op
+from tensorflow.contrib.proto.python.ops import encode_proto_op
 from tensorflow.python.platform import test
 
 
-class ProtoOpTestCase(test.TestCase):
+class DescriptorSourceTest(test_base.DescriptorSourceTestBase):
 
   def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    super(ProtoOpTestCase, self).__init__(methodName)
-    lib = os.path.join(os.path.dirname(__file__), 'libtestexample.so')
-    if os.path.isfile(lib):
-      ct.cdll.LoadLibrary(lib)
+    super(DescriptorSourceTest, self).__init__(decode_proto_op, encode_proto_op,
+                                               methodName)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/proto/python/kernel_tests/descriptor_source_test_base.py b/tensorflow/contrib/proto/python/kernel_tests/descriptor_source_test_base.py
new file mode 100644
index 0000000000..9a1c04af32
--- /dev/null
+++ b/tensorflow/contrib/proto/python/kernel_tests/descriptor_source_test_base.py
@@ -0,0 +1,176 @@
+# =============================================================================
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Tests for proto ops reading descriptors from other sources."""
+# Python3 preparedness imports.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+import numpy as np
+
+from google.protobuf.descriptor_pb2 import FieldDescriptorProto
+from google.protobuf.descriptor_pb2 import FileDescriptorSet
+from tensorflow.contrib.proto.python.kernel_tests import proto_op_test_base as test_base
+from tensorflow.python.framework import dtypes
+from tensorflow.python.platform import test
+
+
+class DescriptorSourceTestBase(test.TestCase):
+  """Base class for testing descriptor sources."""
+
+  def __init__(self, decode_module, encode_module, methodName='runTest'):  # pylint: disable=invalid-name
+    """DescriptorSourceTestBase initializer.
+
+    Args:
+      decode_module: a module containing the `decode_proto_op` method
+      encode_module: a module containing the `encode_proto_op` method
+      methodName: the name of the test method (same as for test.TestCase)
+    """
+
+    super(DescriptorSourceTestBase, self).__init__(methodName)
+    self._decode_module = decode_module
+    self._encode_module = encode_module
+
+  # NOTE: We generate the descriptor programmatically instead of via a compiler
+  # because of differences between different versions of the compiler.
+  #
+  # The generated descriptor should capture the subset of `test_example.proto`
+  # used in `test_base.simple_test_case()`.
+  def _createDescriptorFile(self):
+    set_proto = FileDescriptorSet()
+
+    file_proto = set_proto.file.add(
+        name='types.proto',
+        package='tensorflow',
+        syntax='proto3')
+    enum_proto = file_proto.enum_type.add(name='DataType')
+    enum_proto.value.add(name='DT_DOUBLE', number=0)
+    enum_proto.value.add(name='DT_BOOL', number=1)
+
+    file_proto = set_proto.file.add(
+        name='test_example.proto',
+        package='tensorflow.contrib.proto',
+        dependency=['types.proto'])
+    message_proto = file_proto.message_type.add(name='TestCase')
+    message_proto.field.add(
+        name='values',
+        number=1,
+        type=FieldDescriptorProto.TYPE_MESSAGE,
+        type_name='.tensorflow.contrib.proto.TestValue',
+        label=FieldDescriptorProto.LABEL_REPEATED)
+    message_proto.field.add(
+        name='shapes',
+        number=2,
+        type=FieldDescriptorProto.TYPE_INT32,
+        label=FieldDescriptorProto.LABEL_REPEATED)
+    message_proto.field.add(
+        name='sizes',
+        number=3,
+        type=FieldDescriptorProto.TYPE_INT32,
+        label=FieldDescriptorProto.LABEL_REPEATED)
+    message_proto.field.add(
+        name='fields',
+        number=4,
+        type=FieldDescriptorProto.TYPE_MESSAGE,
+        type_name='.tensorflow.contrib.proto.FieldSpec',
+        label=FieldDescriptorProto.LABEL_REPEATED)
+
+    message_proto = file_proto.message_type.add(
+        name='TestValue')
+    message_proto.field.add(
+        name='double_value',
+        number=1,
+        type=FieldDescriptorProto.TYPE_DOUBLE,
+        label=FieldDescriptorProto.LABEL_REPEATED)
+    message_proto.field.add(
+        name='bool_value',
+        number=2,
+        type=FieldDescriptorProto.TYPE_BOOL,
+        label=FieldDescriptorProto.LABEL_REPEATED)
+
+    message_proto = file_proto.message_type.add(
+        name='FieldSpec')
+    message_proto.field.add(
+        name='name',
+        number=1,
+        type=FieldDescriptorProto.TYPE_STRING,
+        label=FieldDescriptorProto.LABEL_OPTIONAL)
+    message_proto.field.add(
+        name='dtype',
+        number=2,
+        type=FieldDescriptorProto.TYPE_ENUM,
+        type_name='.tensorflow.DataType',
+        label=FieldDescriptorProto.LABEL_OPTIONAL)
+    message_proto.field.add(
+        name='value',
+        number=3,
+        type=FieldDescriptorProto.TYPE_MESSAGE,
+        type_name='.tensorflow.contrib.proto.TestValue',
+        label=FieldDescriptorProto.LABEL_OPTIONAL)
+
+    fn = os.path.join(self.get_temp_dir(), 'descriptor.pb')
+    with open(fn, 'wb') as f:
+      f.write(set_proto.SerializeToString())
+    return fn
+
+  def _testRoundtrip(self, descriptor_source):
+    # Numpy silently truncates the strings if you don't specify dtype=object.
+    in_bufs = np.array(
+        [test_base.ProtoOpTestBase.simple_test_case().SerializeToString()],
+        dtype=object)
+    message_type = 'tensorflow.contrib.proto.TestCase'
+    field_names = ['values', 'shapes', 'sizes', 'fields']
+    tensor_types = [dtypes.string, dtypes.int32, dtypes.int32, dtypes.string]
+
+    with self.test_session() as sess:
+      sizes, field_tensors = self._decode_module.decode_proto(
+          in_bufs,
+          message_type=message_type,
+          field_names=field_names,
+          output_types=tensor_types,
+          descriptor_source=descriptor_source)
+
+      out_tensors = self._encode_module.encode_proto(
+          sizes,
+          field_tensors,
+          message_type=message_type,
+          field_names=field_names,
+          descriptor_source=descriptor_source)
+
+      out_bufs, = sess.run([out_tensors])
+
+      # Check that the re-encoded tensor has the same shape.
+      self.assertEqual(in_bufs.shape, out_bufs.shape)
+
+      # Compare the input and output.
+      for in_buf, out_buf in zip(in_bufs.flat, out_bufs.flat):
+        # Check that the input and output serialized messages are identical.
+        # If we fail here, there is a difference in the serialized
+        # representation but the new serialization still parses. This could
+        # be harmless (a change in map ordering?) or it could be bad (e.g.
+        # loss of packing in the encoding).
+        self.assertEqual(in_buf, out_buf)
+
+  def testWithFileDescriptorSet(self):
+    # First try parsing with a local proto db, which should fail.
+    with self.assertRaisesOpError('No descriptor found for message type'):
+      self._testRoundtrip('local://')
+
+    # Now try parsing with a FileDescriptorSet which contains the test proto.
+    descriptor_file = self._createDescriptorFile()
+    self._testRoundtrip(descriptor_file)
diff --git a/tensorflow/contrib/proto/python/kernel_tests/encode_proto_op_test.py b/tensorflow/contrib/proto/python/kernel_tests/encode_proto_op_test.py
index 30e58e6336..fc5cd25d43 100644
--- a/tensorflow/contrib/proto/python/kernel_tests/encode_proto_op_test.py
+++ b/tensorflow/contrib/proto/python/kernel_tests/encode_proto_op_test.py
@@ -13,167 +13,24 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # =============================================================================
-"""Table-driven test for encode_proto op.
+"""Tests for encode_proto op."""
 
-This test is run once with each of the *.TestCase.pbtxt files
-in the test directory.
-
-It tests that encode_proto is a lossless inverse of decode_proto
-(for the specified fields).
-"""
 # Python3 readiness boilerplate
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import numpy as np
-
-from google.protobuf import text_format
-
-from tensorflow.contrib.proto.python.kernel_tests import test_case
-from tensorflow.contrib.proto.python.kernel_tests import test_example_pb2
+from tensorflow.contrib.proto.python.kernel_tests import encode_proto_op_test_base as test_base
 from tensorflow.contrib.proto.python.ops import decode_proto_op
 from tensorflow.contrib.proto.python.ops import encode_proto_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import array_ops
-from tensorflow.python.platform import flags
 from tensorflow.python.platform import test
 
-FLAGS = flags.FLAGS
-
-flags.DEFINE_string('message_text_file', None,
-                    'A file containing a text serialized TestCase protobuf.')
-
-
-class EncodeProtoOpTest(test_case.ProtoOpTestCase):
-
-  def testBadInputs(self):
-    # Invalid field name
-    with self.test_session():
-      with self.assertRaisesOpError('Unknown field: non_existent_field'):
-        encode_proto_op.encode_proto(
-            sizes=[[1]],
-            values=[np.array([[0.0]], dtype=np.int32)],
-            message_type='tensorflow.contrib.proto.RepeatedPrimitiveValue',
-            field_names=['non_existent_field']).eval()
-
-    # Incorrect types.
-    with self.test_session():
-      with self.assertRaisesOpError(
-          'Incompatible type for field double_value.'):
-        encode_proto_op.encode_proto(
-            sizes=[[1]],
-            values=[np.array([[0.0]], dtype=np.int32)],
-            message_type='tensorflow.contrib.proto.RepeatedPrimitiveValue',
-            field_names=['double_value']).eval()
-
-    # Incorrect shapes of sizes.
-    with self.test_session():
-      with self.assertRaisesOpError(
-          r'sizes should be batch_size \+ \[len\(field_names\)\]'):
-        sizes = array_ops.placeholder(dtypes.int32)
-        values = array_ops.placeholder(dtypes.float64)
-        encode_proto_op.encode_proto(
-            sizes=sizes,
-            values=[values],
-            message_type='tensorflow.contrib.proto.RepeatedPrimitiveValue',
-            field_names=['double_value']).eval(feed_dict={
-                sizes: [[[0, 0]]],
-                values: [[0.0]]
-            })
-
-    # Inconsistent shapes of values.
-    with self.test_session():
-      with self.assertRaisesOpError(
-          'Values must match up to the last dimension'):
-        sizes = array_ops.placeholder(dtypes.int32)
-        values1 = array_ops.placeholder(dtypes.float64)
-        values2 = array_ops.placeholder(dtypes.int32)
-        (encode_proto_op.encode_proto(
-            sizes=[[1, 1]],
-            values=[values1, values2],
-            message_type='tensorflow.contrib.proto.RepeatedPrimitiveValue',
-            field_names=['double_value', 'int32_value']).eval(feed_dict={
-                values1: [[0.0]],
-                values2: [[0], [0]]
-            }))
-
-  def _testRoundtrip(self, in_bufs, message_type, fields):
-
-    field_names = [f.name for f in fields]
-    out_types = [f.dtype for f in fields]
-
-    with self.test_session() as sess:
-      sizes, field_tensors = decode_proto_op.decode_proto(
-          in_bufs,
-          message_type=message_type,
-          field_names=field_names,
-          output_types=out_types)
-
-      out_tensors = encode_proto_op.encode_proto(
-          sizes,
-          field_tensors,
-          message_type=message_type,
-          field_names=field_names)
-
-      out_bufs, = sess.run([out_tensors])
-
-      # Check that the re-encoded tensor has the same shape.
-      self.assertEqual(in_bufs.shape, out_bufs.shape)
-
-      # Compare the input and output.
-      for in_buf, out_buf in zip(in_bufs.flat, out_bufs.flat):
-        in_obj = test_example_pb2.RepeatedPrimitiveValue()
-        in_obj.ParseFromString(in_buf)
-
-        out_obj = test_example_pb2.RepeatedPrimitiveValue()
-        out_obj.ParseFromString(out_buf)
-
-        # Check that the deserialized objects are identical.
-        self.assertEqual(in_obj, out_obj)
-
-        # Check that the input and output serialized messages are identical.
-        # If we fail here, there is a difference in the serialized
-        # representation but the new serialization still parses. This could
-        # be harmless (a change in map ordering?) or it could be bad (e.g.
-        # loss of packing in the encoding).
-        self.assertEqual(in_buf, out_buf)
-
-  def testRoundtrip(self):
-    with open(FLAGS.message_text_file, 'r') as fp:
-      case = text_format.Parse(fp.read(), test_example_pb2.TestCase())
-
-    in_bufs = [primitive.SerializeToString() for primitive in case.primitive]
-
-    # np.array silently truncates strings if you don't specify dtype=object.
-    in_bufs = np.reshape(np.array(in_bufs, dtype=object), list(case.shape))
-    return self._testRoundtrip(
-        in_bufs, 'tensorflow.contrib.proto.RepeatedPrimitiveValue', case.field)
-
-  def testRoundtripPacked(self):
-    with open(FLAGS.message_text_file, 'r') as fp:
-      case = text_format.Parse(fp.read(), test_example_pb2.TestCase())
 
-    # Now try with the packed serialization.
-    # We test the packed representations by loading the same test cases
-    # using PackedPrimitiveValue instead of RepeatedPrimitiveValue.
-    # To do this we rely on the text format being the same for packed and
-    # unpacked fields, and reparse the test message using the packed version
-    # of the proto.
-    in_bufs = [
-        # Note: float_format='.17g' is necessary to ensure preservation of
-        # doubles and floats in text format.
-        text_format.Parse(
-            text_format.MessageToString(
-                primitive, float_format='.17g'),
-            test_example_pb2.PackedPrimitiveValue()).SerializeToString()
-        for primitive in case.primitive
-    ]
+class EncodeProtoOpTest(test_base.EncodeProtoOpTestBase):
 
-    # np.array silently truncates strings if you don't specify dtype=object.
-    in_bufs = np.reshape(np.array(in_bufs, dtype=object), list(case.shape))
-    return self._testRoundtrip(
-        in_bufs, 'tensorflow.contrib.proto.PackedPrimitiveValue', case.field)
+  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
+    super(EncodeProtoOpTest, self).__init__(decode_proto_op, encode_proto_op,
+                                            methodName)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/proto/python/kernel_tests/encode_proto_op_test_base.py b/tensorflow/contrib/proto/python/kernel_tests/encode_proto_op_test_base.py
new file mode 100644
index 0000000000..07dfb924d3
--- /dev/null
+++ b/tensorflow/contrib/proto/python/kernel_tests/encode_proto_op_test_base.py
@@ -0,0 +1,177 @@
+# =============================================================================
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Table-driven test for encode_proto op.
+
+This test is run once with each of the *.TestCase.pbtxt files
+in the test directory.
+
+It tests that encode_proto is a lossless inverse of decode_proto
+(for the specified fields).
+"""
+# Python3 readiness boilerplate
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl.testing import parameterized
+import numpy as np
+
+from google.protobuf import text_format
+
+from tensorflow.contrib.proto.python.kernel_tests import proto_op_test_base as test_base
+from tensorflow.contrib.proto.python.kernel_tests import test_example_pb2
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import array_ops
+
+
+class EncodeProtoOpTestBase(test_base.ProtoOpTestBase, parameterized.TestCase):
+  """Base class for testing proto encoding ops."""
+
+  def __init__(self, decode_module, encode_module, methodName='runTest'):  # pylint: disable=invalid-name
+    """EncodeProtoOpTestBase initializer.
+
+    Args:
+      decode_module: a module containing the `decode_proto_op` method
+      encode_module: a module containing  the `encode_proto_op` method
+      methodName: the name of the test method (same as for test.TestCase)
+    """
+
+    super(EncodeProtoOpTestBase, self).__init__(methodName)
+    self._decode_module = decode_module
+    self._encode_module = encode_module
+
+  def testBadInputs(self):
+    # Invalid field name
+    with self.test_session():
+      with self.assertRaisesOpError('Unknown field: non_existent_field'):
+        self._encode_module.encode_proto(
+            sizes=[[1]],
+            values=[np.array([[0.0]], dtype=np.int32)],
+            message_type='tensorflow.contrib.proto.TestValue',
+            field_names=['non_existent_field']).eval()
+
+    # Incorrect types.
+    with self.test_session():
+      with self.assertRaisesOpError(
+          'Incompatible type for field double_value.'):
+        self._encode_module.encode_proto(
+            sizes=[[1]],
+            values=[np.array([[0.0]], dtype=np.int32)],
+            message_type='tensorflow.contrib.proto.TestValue',
+            field_names=['double_value']).eval()
+
+    # Incorrect shapes of sizes.
+    with self.test_session():
+      with self.assertRaisesOpError(
+          r'sizes should be batch_size \+ \[len\(field_names\)\]'):
+        sizes = array_ops.placeholder(dtypes.int32)
+        values = array_ops.placeholder(dtypes.float64)
+        self._encode_module.encode_proto(
+            sizes=sizes,
+            values=[values],
+            message_type='tensorflow.contrib.proto.TestValue',
+            field_names=['double_value']).eval(feed_dict={
+                sizes: [[[0, 0]]],
+                values: [[0.0]]
+            })
+
+    # Inconsistent shapes of values.
+    with self.test_session():
+      with self.assertRaisesOpError(
+          'Values must match up to the last dimension'):
+        sizes = array_ops.placeholder(dtypes.int32)
+        values1 = array_ops.placeholder(dtypes.float64)
+        values2 = array_ops.placeholder(dtypes.int32)
+        (self._encode_module.encode_proto(
+            sizes=[[1, 1]],
+            values=[values1, values2],
+            message_type='tensorflow.contrib.proto.TestValue',
+            field_names=['double_value', 'int32_value']).eval(feed_dict={
+                values1: [[0.0]],
+                values2: [[0], [0]]
+            }))
+
+  def _testRoundtrip(self, in_bufs, message_type, fields):
+
+    field_names = [f.name for f in fields]
+    out_types = [f.dtype for f in fields]
+
+    with self.test_session() as sess:
+      sizes, field_tensors = self._decode_module.decode_proto(
+          in_bufs,
+          message_type=message_type,
+          field_names=field_names,
+          output_types=out_types)
+
+      out_tensors = self._encode_module.encode_proto(
+          sizes,
+          field_tensors,
+          message_type=message_type,
+          field_names=field_names)
+
+      out_bufs, = sess.run([out_tensors])
+
+      # Check that the re-encoded tensor has the same shape.
+      self.assertEqual(in_bufs.shape, out_bufs.shape)
+
+      # Compare the input and output.
+      for in_buf, out_buf in zip(in_bufs.flat, out_bufs.flat):
+        in_obj = test_example_pb2.TestValue()
+        in_obj.ParseFromString(in_buf)
+
+        out_obj = test_example_pb2.TestValue()
+        out_obj.ParseFromString(out_buf)
+
+        # Check that the deserialized objects are identical.
+        self.assertEqual(in_obj, out_obj)
+
+        # Check that the input and output serialized messages are identical.
+        # If we fail here, there is a difference in the serialized
+        # representation but the new serialization still parses. This could
+        # be harmless (a change in map ordering?) or it could be bad (e.g.
+        # loss of packing in the encoding).
+        self.assertEqual(in_buf, out_buf)
+
+  @parameterized.named_parameters(*test_base.ProtoOpTestBase.named_parameters())
+  def testRoundtrip(self, case):
+    in_bufs = [value.SerializeToString() for value in case.values]
+
+    # np.array silently truncates strings if you don't specify dtype=object.
+    in_bufs = np.reshape(np.array(in_bufs, dtype=object), list(case.shapes))
+    return self._testRoundtrip(
+        in_bufs, 'tensorflow.contrib.proto.TestValue', case.fields)
+
+  @parameterized.named_parameters(*test_base.ProtoOpTestBase.named_parameters())
+  def testRoundtripPacked(self, case):
+    # Now try with the packed serialization.
+    # We test the packed representations by loading the same test cases using
+    # PackedTestValue instead of TestValue. To do this we rely on the text
+    # format being the same for packed and unpacked fields, and reparse the test
+    # message using the packed version of the proto.
+    in_bufs = [
+        # Note: float_format='.17g' is necessary to ensure preservation of
+        # doubles and floats in text format.
+        text_format.Parse(
+            text_format.MessageToString(
+                value, float_format='.17g'),
+            test_example_pb2.PackedTestValue()).SerializeToString()
+        for value in case.values
+    ]
+
+    # np.array silently truncates strings if you don't specify dtype=object.
+    in_bufs = np.reshape(np.array(in_bufs, dtype=object), list(case.shapes))
+    return self._testRoundtrip(
+        in_bufs, 'tensorflow.contrib.proto.PackedTestValue', case.fields)
diff --git a/tensorflow/contrib/proto/python/kernel_tests/minmax.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/minmax.TestCase.pbtxt
deleted file mode 100644
index b170f89c0f..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/minmax.TestCase.pbtxt
+++ /dev/null
@@ -1,161 +0,0 @@
-primitive {
-  double_value: -1.7976931348623158e+308
-  double_value: 2.2250738585072014e-308
-  double_value: 1.7976931348623158e+308
-  float_value: -3.402823466e+38
-  float_value: 1.175494351e-38
-  float_value: 3.402823466e+38
-  int64_value: -9223372036854775808
-  int64_value: 9223372036854775807
-  uint64_value: 0
-  uint64_value: 18446744073709551615
-  int32_value: -2147483648
-  int32_value: 2147483647
-  fixed64_value: 0
-  fixed64_value: 18446744073709551615
-  fixed32_value: 0
-  fixed32_value: 4294967295
-  bool_value: false
-  bool_value: true
-  string_value: ""
-  string_value: "I refer to the infinite."
-  uint32_value: 0
-  uint32_value: 4294967295
-  sfixed32_value: -2147483648
-  sfixed32_value: 2147483647
-  sfixed64_value: -9223372036854775808
-  sfixed64_value: 9223372036854775807
-  sint32_value: -2147483648
-  sint32_value: 2147483647
-  sint64_value: -9223372036854775808
-  sint64_value: 9223372036854775807
-}
-shape: 1
-sizes: 3
-sizes: 3
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-sizes: 2
-field {
-  name: "double_value"
-  dtype: DT_DOUBLE
-  expected {
-    double_value: -1.7976931348623158e+308
-    double_value: 2.2250738585072014e-308
-    double_value: 1.7976931348623158e+308
-  }
-}
-field {
-  name: "float_value"
-  dtype: DT_FLOAT
-  expected {
-    float_value: -3.402823466e+38
-    float_value: 1.175494351e-38
-    float_value: 3.402823466e+38
-  }
-}
-field {
-  name: "int64_value"
-  dtype: DT_INT64
-  expected {
-    int64_value: -9223372036854775808
-    int64_value: 9223372036854775807
-  }
-}
-field {
-  name: "uint64_value"
-  dtype: DT_INT64
-  expected {
-    int64_value: 0
-    int64_value: -1
-  }
-}
-field {
-  name: "int32_value"
-  dtype: DT_INT32
-  expected {
-    int32_value: -2147483648
-    int32_value: 2147483647
-  }
-}
-field {
-  name: "fixed64_value"
-  dtype: DT_INT64
-  expected {
-    int64_value: 0
-    int64_value: -1  # unsigned is 18446744073709551615
-  }
-}
-field {
-  name: "fixed32_value"
-  dtype: DT_INT32
-  expected {
-    int32_value: 0
-    int32_value: -1  # unsigned is 4294967295
-  }
-}
-field {
-  name: "bool_value"
-  dtype: DT_BOOL
-  expected {
-    bool_value: false
-    bool_value: true
-  }
-}
-field {
-  name: "string_value"
-  dtype: DT_STRING
-  expected {
-    string_value: ""
-    string_value: "I refer to the infinite."
-  }
-}
-field {
-  name: "uint32_value"
-  dtype: DT_INT32
-  expected {
-    int32_value: 0
-    int32_value: -1  # unsigned is 4294967295
-  }
-}
-field {
-  name: "sfixed32_value"
-  dtype: DT_INT32
-  expected {
-    int32_value: -2147483648
-    int32_value: 2147483647
-  }
-}
-field {
-  name: "sfixed64_value"
-  dtype: DT_INT64
-  expected {
-    int64_value: -9223372036854775808
-    int64_value: 9223372036854775807
-  }
-}
-field {
-  name: "sint32_value"
-  dtype: DT_INT32
-  expected {
-    int32_value: -2147483648
-    int32_value: 2147483647
-  }
-}
-field {
-  name: "sint64_value"
-  dtype: DT_INT64
-  expected {
-    int64_value: -9223372036854775808
-    int64_value: 9223372036854775807
-  }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/nested.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/nested.TestCase.pbtxt
deleted file mode 100644
index c664e52851..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/nested.TestCase.pbtxt
+++ /dev/null
@@ -1,16 +0,0 @@
-primitive {
-  message_value {
-    double_value: 23.5
-  }
-}
-shape: 1
-sizes: 1
-field {
-  name: "message_value"
-  dtype: DT_STRING
-  expected {
-    message_value {
-      double_value: 23.5
-    }
-  }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/optional.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/optional.TestCase.pbtxt
deleted file mode 100644
index 125651d7ea..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/optional.TestCase.pbtxt
+++ /dev/null
@@ -1,20 +0,0 @@
-primitive {
-  bool_value: true
-}
-shape: 1
-sizes: 1
-sizes: 0
-field {
-  name: "bool_value"
-  dtype: DT_BOOL
-  expected {
-    bool_value: true
-  }
-}
-field {
-  name: "double_value"
-  dtype: DT_DOUBLE
-  expected {
-    double_value: 0.0
-  }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/promote_unsigned.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/promote_unsigned.TestCase.pbtxt
deleted file mode 100644
index bc07efc8f3..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/promote_unsigned.TestCase.pbtxt
+++ /dev/null
@@ -1,29 +0,0 @@
-primitive {
-  fixed32_value: 4294967295
-  uint32_value: 4294967295
-}
-shape: 1
-sizes: 1
-field {
-  name: "fixed32_value"
-  dtype: DT_INT64
-  expected {
-    int64_value: 4294967295
-  }
-}
-sizes: 1
-field {
-  name: "uint32_value"
-  dtype: DT_INT64
-  expected {
-    int64_value: 4294967295
-  }
-}
-sizes: 0
-field {
-  name: "uint32_default"
-  dtype: DT_INT64
-  expected {
-    int64_value: 4294967295  # Comes from an explicitly-specified default
-  }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/proto_op_test_base.py b/tensorflow/contrib/proto/python/kernel_tests/proto_op_test_base.py
new file mode 100644
index 0000000000..2950c7dfdc
--- /dev/null
+++ b/tensorflow/contrib/proto/python/kernel_tests/proto_op_test_base.py
@@ -0,0 +1,419 @@
+# =============================================================================
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Test case base for testing proto operations."""
+
+# Python3 preparedness imports.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import ctypes as ct
+import os
+
+from tensorflow.contrib.proto.python.kernel_tests import test_example_pb2
+from tensorflow.core.framework import types_pb2
+from tensorflow.python.platform import test
+
+
+class ProtoOpTestBase(test.TestCase):
+  """Base class for testing proto decoding and encoding ops."""
+
+  def __init__(self, methodName="runTest"):  # pylint: disable=invalid-name
+    super(ProtoOpTestBase, self).__init__(methodName)
+    lib = os.path.join(os.path.dirname(__file__), "libtestexample.so")
+    if os.path.isfile(lib):
+      ct.cdll.LoadLibrary(lib)
+
+  @staticmethod
+  def named_parameters():
+    return (
+        ("defaults", ProtoOpTestBase.defaults_test_case()),
+        ("minmax", ProtoOpTestBase.minmax_test_case()),
+        ("nested", ProtoOpTestBase.nested_test_case()),
+        ("optional", ProtoOpTestBase.optional_test_case()),
+        ("promote", ProtoOpTestBase.promote_test_case()),
+        ("ragged", ProtoOpTestBase.ragged_test_case()),
+        ("shaped_batch", ProtoOpTestBase.shaped_batch_test_case()),
+        ("simple", ProtoOpTestBase.simple_test_case()),
+    )
+
+  @staticmethod
+  def defaults_test_case():
+    test_case = test_example_pb2.TestCase()
+    test_case.values.add()  # No fields specified, so we get all defaults.
+    test_case.shapes.append(1)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "double_value_with_default"
+    field.dtype = types_pb2.DT_DOUBLE
+    field.value.double_value.append(1.0)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "float_value_with_default"
+    field.dtype = types_pb2.DT_FLOAT
+    field.value.float_value.append(2.0)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "int64_value_with_default"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(3)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "sfixed64_value_with_default"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(11)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "sint64_value_with_default"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(13)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "uint64_value_with_default"
+    field.dtype = types_pb2.DT_UINT64
+    field.value.uint64_value.append(4)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "fixed64_value_with_default"
+    field.dtype = types_pb2.DT_UINT64
+    field.value.uint64_value.append(6)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "int32_value_with_default"
+    field.dtype = types_pb2.DT_INT32
+    field.value.int32_value.append(5)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "sfixed32_value_with_default"
+    field.dtype = types_pb2.DT_INT32
+    field.value.int32_value.append(10)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "sint32_value_with_default"
+    field.dtype = types_pb2.DT_INT32
+    field.value.int32_value.append(12)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "uint32_value_with_default"
+    field.dtype = types_pb2.DT_UINT32
+    field.value.uint32_value.append(9)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "fixed32_value_with_default"
+    field.dtype = types_pb2.DT_UINT32
+    field.value.uint32_value.append(7)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "bool_value_with_default"
+    field.dtype = types_pb2.DT_BOOL
+    field.value.bool_value.append(True)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "string_value_with_default"
+    field.dtype = types_pb2.DT_STRING
+    field.value.string_value.append("a")
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "bytes_value_with_default"
+    field.dtype = types_pb2.DT_STRING
+    field.value.string_value.append("a longer default string")
+    return test_case
+
+  @staticmethod
+  def minmax_test_case():
+    test_case = test_example_pb2.TestCase()
+    value = test_case.values.add()
+    value.double_value.append(-1.7976931348623158e+308)
+    value.double_value.append(2.2250738585072014e-308)
+    value.double_value.append(1.7976931348623158e+308)
+    value.float_value.append(-3.402823466e+38)
+    value.float_value.append(1.175494351e-38)
+    value.float_value.append(3.402823466e+38)
+    value.int64_value.append(-9223372036854775808)
+    value.int64_value.append(9223372036854775807)
+    value.sfixed64_value.append(-9223372036854775808)
+    value.sfixed64_value.append(9223372036854775807)
+    value.sint64_value.append(-9223372036854775808)
+    value.sint64_value.append(9223372036854775807)
+    value.uint64_value.append(0)
+    value.uint64_value.append(18446744073709551615)
+    value.fixed64_value.append(0)
+    value.fixed64_value.append(18446744073709551615)
+    value.int32_value.append(-2147483648)
+    value.int32_value.append(2147483647)
+    value.sfixed32_value.append(-2147483648)
+    value.sfixed32_value.append(2147483647)
+    value.sint32_value.append(-2147483648)
+    value.sint32_value.append(2147483647)
+    value.uint32_value.append(0)
+    value.uint32_value.append(4294967295)
+    value.fixed32_value.append(0)
+    value.fixed32_value.append(4294967295)
+    value.bool_value.append(False)
+    value.bool_value.append(True)
+    value.string_value.append("")
+    value.string_value.append("I refer to the infinite.")
+    test_case.shapes.append(1)
+    test_case.sizes.append(3)
+    field = test_case.fields.add()
+    field.name = "double_value"
+    field.dtype = types_pb2.DT_DOUBLE
+    field.value.double_value.append(-1.7976931348623158e+308)
+    field.value.double_value.append(2.2250738585072014e-308)
+    field.value.double_value.append(1.7976931348623158e+308)
+    test_case.sizes.append(3)
+    field = test_case.fields.add()
+    field.name = "float_value"
+    field.dtype = types_pb2.DT_FLOAT
+    field.value.float_value.append(-3.402823466e+38)
+    field.value.float_value.append(1.175494351e-38)
+    field.value.float_value.append(3.402823466e+38)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "int64_value"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(-9223372036854775808)
+    field.value.int64_value.append(9223372036854775807)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "sfixed64_value"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(-9223372036854775808)
+    field.value.int64_value.append(9223372036854775807)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "sint64_value"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(-9223372036854775808)
+    field.value.int64_value.append(9223372036854775807)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "uint64_value"
+    field.dtype = types_pb2.DT_UINT64
+    field.value.uint64_value.append(0)
+    field.value.uint64_value.append(18446744073709551615)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "fixed64_value"
+    field.dtype = types_pb2.DT_UINT64
+    field.value.uint64_value.append(0)
+    field.value.uint64_value.append(18446744073709551615)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "int32_value"
+    field.dtype = types_pb2.DT_INT32
+    field.value.int32_value.append(-2147483648)
+    field.value.int32_value.append(2147483647)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "sfixed32_value"
+    field.dtype = types_pb2.DT_INT32
+    field.value.int32_value.append(-2147483648)
+    field.value.int32_value.append(2147483647)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "sint32_value"
+    field.dtype = types_pb2.DT_INT32
+    field.value.int32_value.append(-2147483648)
+    field.value.int32_value.append(2147483647)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "uint32_value"
+    field.dtype = types_pb2.DT_UINT32
+    field.value.uint32_value.append(0)
+    field.value.uint32_value.append(4294967295)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "fixed32_value"
+    field.dtype = types_pb2.DT_UINT32
+    field.value.uint32_value.append(0)
+    field.value.uint32_value.append(4294967295)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "bool_value"
+    field.dtype = types_pb2.DT_BOOL
+    field.value.bool_value.append(False)
+    field.value.bool_value.append(True)
+    test_case.sizes.append(2)
+    field = test_case.fields.add()
+    field.name = "string_value"
+    field.dtype = types_pb2.DT_STRING
+    field.value.string_value.append("")
+    field.value.string_value.append("I refer to the infinite.")
+    return test_case
+
+  @staticmethod
+  def nested_test_case():
+    test_case = test_example_pb2.TestCase()
+    value = test_case.values.add()
+    message_value = value.message_value.add()
+    message_value.double_value = 23.5
+    test_case.shapes.append(1)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "message_value"
+    field.dtype = types_pb2.DT_STRING
+    message_value = field.value.message_value.add()
+    message_value.double_value = 23.5
+    return test_case
+
+  @staticmethod
+  def optional_test_case():
+    test_case = test_example_pb2.TestCase()
+    value = test_case.values.add()
+    value.bool_value.append(True)
+    test_case.shapes.append(1)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "bool_value"
+    field.dtype = types_pb2.DT_BOOL
+    field.value.bool_value.append(True)
+    test_case.sizes.append(0)
+    field = test_case.fields.add()
+    field.name = "double_value"
+    field.dtype = types_pb2.DT_DOUBLE
+    field.value.double_value.append(0.0)
+    return test_case
+
+  @staticmethod
+  def promote_test_case():
+    test_case = test_example_pb2.TestCase()
+    value = test_case.values.add()
+    value.sint32_value.append(2147483647)
+    value.sfixed32_value.append(2147483647)
+    value.int32_value.append(2147483647)
+    value.fixed32_value.append(4294967295)
+    value.uint32_value.append(4294967295)
+    test_case.shapes.append(1)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "sint32_value"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(2147483647)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "sfixed32_value"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(2147483647)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "int32_value"
+    field.dtype = types_pb2.DT_INT64
+    field.value.int64_value.append(2147483647)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "fixed32_value"
+    field.dtype = types_pb2.DT_UINT64
+    field.value.uint64_value.append(4294967295)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "uint32_value"
+    field.dtype = types_pb2.DT_UINT64
+    field.value.uint64_value.append(4294967295)
+    return test_case
+
+  @staticmethod
+  def ragged_test_case():
+    test_case = test_example_pb2.TestCase()
+    value = test_case.values.add()
+    value.double_value.append(23.5)
+    value.double_value.append(123.0)
+    value.bool_value.append(True)
+    value = test_case.values.add()
+    value.double_value.append(3.1)
+    value.bool_value.append(False)
+    test_case.shapes.append(2)
+    test_case.sizes.append(2)
+    test_case.sizes.append(1)
+    test_case.sizes.append(1)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "double_value"
+    field.dtype = types_pb2.DT_DOUBLE
+    field.value.double_value.append(23.5)
+    field.value.double_value.append(123.0)
+    field.value.double_value.append(3.1)
+    field.value.double_value.append(0.0)
+    field = test_case.fields.add()
+    field.name = "bool_value"
+    field.dtype = types_pb2.DT_BOOL
+    field.value.bool_value.append(True)
+    field.value.bool_value.append(False)
+    return test_case
+
+  @staticmethod
+  def shaped_batch_test_case():
+    test_case = test_example_pb2.TestCase()
+    value = test_case.values.add()
+    value.double_value.append(23.5)
+    value.bool_value.append(True)
+    value = test_case.values.add()
+    value.double_value.append(44.0)
+    value.bool_value.append(False)
+    value = test_case.values.add()
+    value.double_value.append(3.14159)
+    value.bool_value.append(True)
+    value = test_case.values.add()
+    value.double_value.append(1.414)
+    value.bool_value.append(True)
+    value = test_case.values.add()
+    value.double_value.append(-32.2)
+    value.bool_value.append(False)
+    value = test_case.values.add()
+    value.double_value.append(0.0001)
+    value.bool_value.append(True)
+    test_case.shapes.append(3)
+    test_case.shapes.append(2)
+    for _ in range(12):
+      test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "double_value"
+    field.dtype = types_pb2.DT_DOUBLE
+    field.value.double_value.append(23.5)
+    field.value.double_value.append(44.0)
+    field.value.double_value.append(3.14159)
+    field.value.double_value.append(1.414)
+    field.value.double_value.append(-32.2)
+    field.value.double_value.append(0.0001)
+    field = test_case.fields.add()
+    field.name = "bool_value"
+    field.dtype = types_pb2.DT_BOOL
+    field.value.bool_value.append(True)
+    field.value.bool_value.append(False)
+    field.value.bool_value.append(True)
+    field.value.bool_value.append(True)
+    field.value.bool_value.append(False)
+    field.value.bool_value.append(True)
+    return test_case
+
+  @staticmethod
+  def simple_test_case():
+    test_case = test_example_pb2.TestCase()
+    value = test_case.values.add()
+    value.double_value.append(23.5)
+    value.bool_value.append(True)
+    test_case.shapes.append(1)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "double_value"
+    field.dtype = types_pb2.DT_DOUBLE
+    field.value.double_value.append(23.5)
+    test_case.sizes.append(1)
+    field = test_case.fields.add()
+    field.name = "bool_value"
+    field.dtype = types_pb2.DT_BOOL
+    field.value.bool_value.append(True)
+    return test_case
diff --git a/tensorflow/contrib/proto/python/kernel_tests/ragged.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/ragged.TestCase.pbtxt
deleted file mode 100644
index 61c7ac53f7..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/ragged.TestCase.pbtxt
+++ /dev/null
@@ -1,32 +0,0 @@
-primitive {
-  double_value: 23.5
-  double_value: 123.0
-  bool_value: true
-}
-primitive {
-  double_value: 3.1
-  bool_value: false
-}
-shape: 2
-sizes: 2
-sizes: 1
-sizes: 1
-sizes: 1
-field {
-  name: "double_value"
-  dtype: DT_DOUBLE
-  expected {
-    double_value: 23.5
-    double_value: 123.0
-    double_value: 3.1
-    double_value: 0.0
-  }
-}
-field {
-  name: "bool_value"
-  dtype: DT_BOOL
-  expected {
-    bool_value: true
-    bool_value: false
-  }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/shaped_batch.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/shaped_batch.TestCase.pbtxt
deleted file mode 100644
index f4828076d5..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/shaped_batch.TestCase.pbtxt
+++ /dev/null
@@ -1,62 +0,0 @@
-primitive {
-  double_value: 23.5
-  bool_value: true
-}
-primitive {
-  double_value: 44.0
-  bool_value: false
-}
-primitive {
-  double_value: 3.14159
-  bool_value: true
-}
-primitive {
-  double_value: 1.414
-  bool_value: true
-}
-primitive {
-  double_value: -32.2
-  bool_value: false
-}
-primitive {
-  double_value: 0.0001
-  bool_value: true
-}
-shape: 3
-shape: 2
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-sizes: 1
-field {
-  name: "double_value"
-  dtype: DT_DOUBLE
-  expected {
-    double_value: 23.5
-    double_value: 44.0
-    double_value: 3.14159
-    double_value: 1.414
-    double_value: -32.2
-    double_value: 0.0001
-  }
-}
-field {
-  name: "bool_value"
-  dtype: DT_BOOL
-  expected {
-    bool_value: true
-    bool_value: false
-    bool_value: true
-    bool_value: true
-    bool_value: false
-    bool_value: true
-  }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/simple.TestCase.pbtxt b/tensorflow/contrib/proto/python/kernel_tests/simple.TestCase.pbtxt
deleted file mode 100644
index dc20ac147b..0000000000
--- a/tensorflow/contrib/proto/python/kernel_tests/simple.TestCase.pbtxt
+++ /dev/null
@@ -1,21 +0,0 @@
-primitive {
-  double_value: 23.5
-  bool_value: true
-}
-shape: 1
-sizes: 1
-sizes: 1
-field {
-  name: "double_value"
-  dtype: DT_DOUBLE
-  expected {
-    double_value: 23.5
-  }
-}
-field {
-  name: "bool_value"
-  dtype: DT_BOOL
-  expected {
-    bool_value: true
-  }
-}
diff --git a/tensorflow/contrib/proto/python/kernel_tests/test_example.proto b/tensorflow/contrib/proto/python/kernel_tests/test_example.proto
index a2c88e372b..674d881220 100644
--- a/tensorflow/contrib/proto/python/kernel_tests/test_example.proto
+++ b/tensorflow/contrib/proto/python/kernel_tests/test_example.proto
@@ -1,6 +1,4 @@
 // Test description and protos to work with it.
-//
-// Many of the protos in this file are for unit tests that haven't been written yet.
 
 syntax = "proto2";
 
@@ -8,54 +6,27 @@ import "tensorflow/core/framework/types.proto";
 
 package tensorflow.contrib.proto;
 
-// A TestCase holds a proto and a bunch of assertions
-// about how it should decode.
+// A TestCase holds a proto and assertions about how it should decode.
 message TestCase {
-  // A batch of primitives to be serialized and decoded.
-  repeated RepeatedPrimitiveValue primitive = 1;
-  // The shape of the batch.
-  repeated int32 shape = 2;
+  // Batches of primitive values.
+  repeated TestValue values = 1;
+  // The batch shapes.
+  repeated int32 shapes = 2;
   // Expected sizes for each field.
   repeated int32 sizes = 3;
   // Expected values for each field.
-  repeated FieldSpec field = 4;
+  repeated FieldSpec fields = 4;
 };
 
 // FieldSpec describes the expected output for a single field.
 message FieldSpec {
   optional string name = 1;
   optional tensorflow.DataType dtype = 2;
-  optional RepeatedPrimitiveValue expected = 3;
+  optional TestValue value = 3;
 };
 
+// NOTE: This definition must be kept in sync with PackedTestValue.
 message TestValue {
-  optional PrimitiveValue primitive_value = 1;
-  optional EnumValue enum_value = 2;
-  optional MessageValue message_value = 3;
-  optional RepeatedMessageValue repeated_message_value = 4;
-  optional RepeatedPrimitiveValue repeated_primitive_value = 6;
-}
-
-message PrimitiveValue {
-  optional double double_value = 1;
-  optional float float_value = 2;
-  optional int64 int64_value = 3;
-  optional uint64 uint64_value = 4;
-  optional int32 int32_value = 5;
-  optional fixed64 fixed64_value = 6;
-  optional fixed32 fixed32_value = 7;
-  optional bool bool_value = 8;
-  optional string string_value = 9;
-  optional bytes bytes_value = 12;
-  optional uint32 uint32_value = 13;
-  optional sfixed32 sfixed32_value = 15;
-  optional sfixed64 sfixed64_value = 16;
-  optional sint32 sint32_value = 17;
-  optional sint64 sint64_value = 18;
-}
-
-// NOTE: This definition must be kept in sync with PackedPrimitiveValue.
-message RepeatedPrimitiveValue {
   repeated double double_value = 1;
   repeated float float_value = 2;
   repeated int64 int64_value = 3;
@@ -74,30 +45,31 @@ message RepeatedPrimitiveValue {
   repeated PrimitiveValue message_value = 19;
 
   // Optional fields with explicitly-specified defaults.
-  optional double double_default = 20 [default = 1.0];
-  optional float float_default = 21 [default = 2.0];
-  optional int64 int64_default = 22 [default = 3];
-  optional uint64 uint64_default = 23 [default = 4];
-  optional int32 int32_default = 24 [default = 5];
-  optional fixed64 fixed64_default = 25 [default = 6];
-  optional fixed32 fixed32_default = 26 [default = 7];
-  optional bool bool_default = 27 [default = true];
-  optional string string_default = 28 [default = "a"];
-  optional bytes bytes_default = 29 [default = "a longer default string"];
-  optional uint32 uint32_default = 30 [default = 4294967295];
-  optional sfixed32 sfixed32_default = 31 [default = 10];
-  optional sfixed64 sfixed64_default = 32 [default = 11];
-  optional sint32 sint32_default = 33 [default = 12];
-  optional sint64 sint64_default = 34 [default = 13];
+  optional double double_value_with_default = 20 [default = 1.0];
+  optional float float_value_with_default = 21 [default = 2.0];
+  optional int64 int64_value_with_default = 22 [default = 3];
+  optional uint64 uint64_value_with_default = 23 [default = 4];
+  optional int32 int32_value_with_default = 24 [default = 5];
+  optional fixed64 fixed64_value_with_default = 25 [default = 6];
+  optional fixed32 fixed32_value_with_default = 26 [default = 7];
+  optional bool bool_value_with_default = 27 [default = true];
+  optional string string_value_with_default = 28 [default = "a"];
+  optional bytes bytes_value_with_default = 29
+      [default = "a longer default string"];
+  optional uint32 uint32_value_with_default = 30 [default = 9];
+  optional sfixed32 sfixed32_value_with_default = 31 [default = 10];
+  optional sfixed64 sfixed64_value_with_default = 32 [default = 11];
+  optional sint32 sint32_value_with_default = 33 [default = 12];
+  optional sint64 sint64_value_with_default = 34 [default = 13];
 }
 
-// A PackedPrimitiveValue looks exactly the same as a RepeatedPrimitiveValue
-// in the text format, but the binary serializion is different.
-// We test the packed representations by loading the same test cases
-// using this definition instead of RepeatedPrimitiveValue.
-// NOTE: This definition must be kept in sync with RepeatedPrimitiveValue
-// in every way except the packed=true declaration.
-message PackedPrimitiveValue {
+// A PackedTestValue looks exactly the same as a TestValue in the text format,
+// but the binary serializion is different. We test the packed representations
+// by loading the same test cases using this definition instead of TestValue.
+//
+// NOTE: This definition must be kept in sync with TestValue in every way except
+// the packed=true declaration.
+message PackedTestValue {
   repeated double double_value = 1 [packed = true];
   repeated float float_value = 2 [packed = true];
   repeated int64 int64_value = 3 [packed = true];
@@ -115,23 +87,53 @@ message PackedPrimitiveValue {
   repeated sint64 sint64_value = 18 [packed = true];
   repeated PrimitiveValue message_value = 19;
 
-  optional double double_default = 20 [default = 1.0];
-  optional float float_default = 21 [default = 2.0];
-  optional int64 int64_default = 22 [default = 3];
-  optional uint64 uint64_default = 23 [default = 4];
-  optional int32 int32_default = 24 [default = 5];
-  optional fixed64 fixed64_default = 25 [default = 6];
-  optional fixed32 fixed32_default = 26 [default = 7];
-  optional bool bool_default = 27 [default = true];
-  optional string string_default = 28 [default = "a"];
-  optional bytes bytes_default = 29 [default = "a longer default string"];
-  optional uint32 uint32_default = 30 [default = 4294967295];
-  optional sfixed32 sfixed32_default = 31 [default = 10];
-  optional sfixed64 sfixed64_default = 32 [default = 11];
-  optional sint32 sint32_default = 33 [default = 12];
-  optional sint64 sint64_default = 34 [default = 13];
+  optional double double_value_with_default = 20 [default = 1.0];
+  optional float float_value_with_default = 21 [default = 2.0];
+  optional int64 int64_value_with_default = 22 [default = 3];
+  optional uint64 uint64_value_with_default = 23 [default = 4];
+  optional int32 int32_value_with_default = 24 [default = 5];
+  optional fixed64 fixed64_value_with_default = 25 [default = 6];
+  optional fixed32 fixed32_value_with_default = 26 [default = 7];
+  optional bool bool_value_with_default = 27 [default = true];
+  optional string string_value_with_default = 28 [default = "a"];
+  optional bytes bytes_value_with_default = 29
+      [default = "a longer default string"];
+  optional uint32 uint32_value_with_default = 30 [default = 9];
+  optional sfixed32 sfixed32_value_with_default = 31 [default = 10];
+  optional sfixed64 sfixed64_value_with_default = 32 [default = 11];
+  optional sint32 sint32_value_with_default = 33 [default = 12];
+  optional sint64 sint64_value_with_default = 34 [default = 13];
 }
 
+message PrimitiveValue {
+  optional double double_value = 1;
+  optional float float_value = 2;
+  optional int64 int64_value = 3;
+  optional uint64 uint64_value = 4;
+  optional int32 int32_value = 5;
+  optional fixed64 fixed64_value = 6;
+  optional fixed32 fixed32_value = 7;
+  optional bool bool_value = 8;
+  optional string string_value = 9;
+  optional bytes bytes_value = 12;
+  optional uint32 uint32_value = 13;
+  optional sfixed32 sfixed32_value = 15;
+  optional sfixed64 sfixed64_value = 16;
+  optional sint32 sint32_value = 17;
+  optional sint64 sint64_value = 18;
+}
+
+// Message containing fields with field numbers higher than any field above.
+// An instance of this message is prepended to each binary message in the test
+// to exercise the code path that handles fields encoded out of order of field
+// number.
+message ExtraFields {
+  optional string string_value = 1776;
+  optional bool bool_value = 1777;
+}
+
+// The messages below are for yet-to-be created tests.
+
 message EnumValue {
   enum Color {
     RED = 0;
@@ -171,12 +173,3 @@ message RepeatedMessageValue {
 
   repeated NestedMessageValue message_values = 11;
 }
-
-// Message containing fields with field numbers higher than any field above. An
-// instance of this message is prepended to each binary message in the test to
-// exercise the code path that handles fields encoded out of order of field
-// number.
-message ExtraFields {
-  optional string string_value = 1776;
-  optional bool bool_value = 1777;
-}
diff --git a/tensorflow/contrib/quantize/python/quantize.py b/tensorflow/contrib/quantize/python/quantize.py
index 19e5bef1ea..903faeff11 100644
--- a/tensorflow/contrib/quantize/python/quantize.py
+++ b/tensorflow/contrib/quantize/python/quantize.py
@@ -261,6 +261,16 @@ def _FindLayersToQuantize(graph):
 
   layer_output_pattern = graph_matcher.OneofPattern(
       [batch_to_space_pattern, layer_pattern])
+
+  # For separable convolutions, we are looking for a conv, followed by a conv
+  # with no activations between the two.
+  sep_conv_pattern = graph_matcher.OpTypePattern(
+      '|'.join(_QUANTIZABLE_TYPES),
+      inputs=[
+          graph_matcher.OneofPattern([layer_output_pattern]),
+          graph_matcher.OpTypePattern('*')
+      ],
+      ordered_inputs=False)
   folded_bias_mul_pattern = graph_matcher.OpTypePattern(
       'Mul',
       inputs=[graph_matcher.OpTypePattern('*'), layer_output_pattern],
@@ -278,6 +288,13 @@ def _FindLayersToQuantize(graph):
       ],
       ordered_inputs=False)
 
+  # batch_norms with forced updates have an Identity operation at the end.
+  # TODO(suharshs): Find a way to easily skip extra Identity operations. The
+  # current issue is that doing so can often match patterns across many layers
+  # incorrectly.
+  batch_norm_identity = graph_matcher.OpTypePattern(
+      'Identity', inputs=[folded_bias_add_pattern])
+
   bias_add_pattern = graph_matcher.OpTypePattern(
       'Add|BiasAdd', inputs=[layer_output_pattern, '*'], ordered_inputs=False)
 
@@ -286,20 +303,22 @@ def _FindLayersToQuantize(graph):
       'Add',
       inputs=[
           graph_matcher.OneofPattern(
-              [bias_add_pattern, folded_bias_add_pattern]), '*'
+              [bias_add_pattern, folded_bias_add_pattern, batch_norm_identity]),
+          '*'
       ],
       ordered_inputs=False)
 
   # The input to the activation can come from bias add, fold bias add, the
   # bypasses.
   # TODO(suharshs): We should ideally skip Identity operations instead of
-  # treating them as an activation.
+  # treating them as activations.
   activation_pattern = graph_matcher.OpTypePattern(
       '|'.join(_ACTIVATION_TYPES) + '|Identity',
       inputs=[
           graph_matcher.OneofPattern([
               bias_add_pattern,
               folded_bias_add_pattern,
+              batch_norm_identity,
               bypass_pattern,
           ])
       ])
@@ -384,6 +403,17 @@ def _FindLayersToQuantize(graph):
       layer_matches.append(
           _LayerMatch(layer_op, weight_tensor, activation_op, None, None, None))
 
+  # Look for separable convolutions here
+  sep_conv_matcher = graph_matcher.GraphMatcher(sep_conv_pattern)
+  for match_result in sep_conv_matcher.match_graph(graph):
+    layer_op = match_result.get_op(layer_pattern)
+    weight_tensor = match_result.get_tensor(weight_identity_pattern)
+    activation_op = match_result.get_op(layer_pattern)
+    if layer_op not in matched_layer_set:
+      matched_layer_set.add(layer_op)
+      layer_matches.append(
+          _LayerMatch(layer_op, weight_tensor, activation_op, None, None, None))
+
   return layer_matches
 
 
diff --git a/tensorflow/contrib/quantize/python/quantize_graph.py b/tensorflow/contrib/quantize/python/quantize_graph.py
index 11d052d7f4..2944f964c7 100644
--- a/tensorflow/contrib/quantize/python/quantize_graph.py
+++ b/tensorflow/contrib/quantize/python/quantize_graph.py
@@ -191,6 +191,7 @@ def experimental_create_training_graph(input_graph=None,
 def experimental_create_eval_graph(input_graph=None,
                                    weight_bits=8,
                                    activation_bits=8,
+                                   quant_delay=None,
                                    scope=None):
   """Rewrites an eval input_graph in place for simulated quantization.
 
@@ -209,6 +210,8 @@ def experimental_create_eval_graph(input_graph=None,
       default graph.
     weight_bits: Number of bits to use for quantizing weights.
     activation_bits: Number of bits to use for quantizing activations.
+    quant_delay: Number of steps after which weights and activations are
+      quantized during eval.
     scope: The scope to be transformed. If it's not None, only the ops which
       are in this scope will be transformed.
 
@@ -221,4 +224,5 @@ def experimental_create_eval_graph(input_graph=None,
       is_training=False,
       weight_bits=weight_bits,
       activation_bits=activation_bits,
+      quant_delay=quant_delay,
       scope=scope)
diff --git a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
index 5e3af0a567..31a2955ddb 100644
--- a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
@@ -654,8 +654,80 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     graph_def_after = str(graph.as_graph_def())
     self.assertEqual(graph_def_before, graph_def_after)
 
-  def _BatchNormParams(self, fused=False):
-    return {'center': True, 'scale': True, 'decay': 1.0 - 0.003, 'fused': fused}
+  def testBatchNormForcedUpdates(self):
+    parameter_list = [
+        # (activation, activation_op_name, fused_batch_norm)
+        (nn_ops.relu6, 'Relu6', False),
+        (nn_ops.relu, 'Relu', False),
+        (array_ops.identity, 'Identity', False),
+        (nn_ops.relu6, 'Relu6', True),
+        (nn_ops.relu, 'Relu', True),
+        (array_ops.identity, 'Identity', True),
+    ]
+    for params in parameter_list:
+      self._TestBatchNormForcedUpdates(params[0], params[1], params[2], False)
+      self._TestBatchNormForcedUpdates(params[0], params[1], params[2], True)
+
+  def _TestBatchNormForcedUpdates(self, activation, activation_op_name,
+                                  fused_batch_norm, use_resource):
+    """post_activation bypass quantization should happen with forced updates."""
+    graph = ops.Graph()
+    with graph.as_default():
+      variable_scope.get_variable_scope().set_use_resource(use_resource)
+      batch_size, height, width, depth = 5, 128, 128, 3
+      input1 = array_ops.zeros((batch_size, height, width, depth))
+      input2 = array_ops.zeros((batch_size, height / 2, width / 2, 32))
+      # Setting updates_collections to None forces updates adding an extra
+      # identity operation following batch norms.
+      bn_params = self._BatchNormParams(
+          fused=fused_batch_norm, force_updates=True)
+      conv = conv2d(
+          input1,
+          32, [5, 5],
+          stride=2,
+          padding='SAME',
+          weights_initializer=self._WeightInit(0.09),
+          activation_fn=activation,
+          normalizer_fn=batch_norm,
+          normalizer_params=bn_params,
+          scope='test/test')
+      bypass_tensor = math_ops.add(conv, input2, name='test/add')
+      # The output of the post_activation bypass will be another layer.
+      _ = conv2d(
+          bypass_tensor,
+          32, [5, 5],
+          stride=2,
+          padding='SAME',
+          weights_initializer=self._WeightInit(0.09),
+          normalizer_fn=batch_norm,
+          normalizer_params=bn_params,
+          activation_fn=activation,
+          scope='test/unused')
+
+      fold_batch_norms.FoldBatchNorms(graph, is_training=True)
+      quantize.Quantize(graph, is_training=True)
+
+      # Ensure that the bypass node is preceded by and followed by a
+      # FakeQuantWithMinMaxVar operation, since the output of the Add isn't an
+      # activation.
+      self.assertTrue('FakeQuantWithMinMaxVars' in
+                      [c.type for c in bypass_tensor.consumers()])
+      self.assertTrue('FakeQuantWithMinMaxVars' in
+                      [i.op.type for i in bypass_tensor.op.inputs])
+
+    with open('/tmp/bn_quant_test.pbtxt', 'w') as f:
+      f.write(str(graph.as_graph_def()))
+
+  def _BatchNormParams(self, fused=False, force_updates=False):
+    params = {
+        'center': True,
+        'scale': True,
+        'decay': 1.0 - 0.003,
+        'fused': fused
+    }
+    if force_updates:
+      params['updates_collections'] = None
+    return params
 
   def _WeightInit(self, stddev):
     """Returns truncated normal variable initializer.
diff --git a/tensorflow/contrib/quantize/python/quantize_test.py b/tensorflow/contrib/quantize/python/quantize_test.py
index 92ca4a1b0c..98209fffb9 100644
--- a/tensorflow/contrib/quantize/python/quantize_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_test.py
@@ -122,12 +122,67 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         array_ops.identity(node, name='control_dependency')
 
     quantize.Quantize(graph, is_training, weight_bits=8, activation_bits=8)
+    # Check if output of bias add is quantized
+    quantization_node_name = 'FakeQuantWithMinMaxVars'
+    conv_quant = graph.get_operation_by_name('test/test/conv_quant/' +
+                                             quantization_node_name)
+    self.assertEqual(conv_quant.type, quantization_node_name)
+
+    for op in graph.get_operations():
+      if op.type == quantization_node_name:
+        quant_op = graph.get_operation_by_name(op.name)
+        # Scan through all FakeQuant operations, ensuring that the activation
+        # identity op isn't in the consumers of the operation.
+        consumers = []
+        for output in quant_op.outputs:
+          consumers.extend(output.consumers())
 
+        self.assertNotIn('test/relu6', [c.name for c in consumers])
+
+  def testInsertQuantOpInSeparableConv2d(self):
+    self._RunTestOverParameters(self._TestInsertQuantOpInSeparableConv2d)
+
+  def _TestInsertQuantOpInSeparableConv2d(self, is_training):
+    graph = ops.Graph()
+    with graph.as_default():
+      batch_size, height, width, depth = 5, 128, 128, 3
+      input1 = array_ops.zeros((batch_size, height, width, depth))
+      input2 = array_ops.zeros((batch_size, height / 2, width / 2, depth))
+      conv = separable_conv2d(
+          input1,
+          3, [5, 5],
+          stride=2,
+          depth_multiplier=1.0,
+          padding='SAME',
+          weights_initializer=self._WeightInit(0.09),
+          activation_fn=None,
+          scope='test/test')
+      node = math_ops.add(conv, input2, name='test/add')
+      node = nn_ops.relu6(node, name='test/relu6')
+      update_barrier = control_flow_ops.no_op(name='update_barrier')
+      with ops.control_dependencies([update_barrier]):
+        array_ops.identity(node, name='control_dependency')
+
+    quantize.Quantize(graph, is_training, weight_bits=8, activation_bits=8)
+    # Check if output of bias add is quantized
     quantization_node_name = 'FakeQuantWithMinMaxVars'
     conv_quant = graph.get_operation_by_name('test/test/conv_quant/' +
                                              quantization_node_name)
     self.assertEqual(conv_quant.type, quantization_node_name)
 
+    # Check if weights for both convs inside seperable conv are quantized
+    pointwise_weight_quant = graph.get_operation_by_name(
+        'test/test/weights_quant/' + quantization_node_name)
+    self.assertEqual(pointwise_weight_quant.type, quantization_node_name)
+    depthwise_weight_quant = graph.get_operation_by_name(
+        'test/test/separable_conv2d/weights_quant/' + quantization_node_name)
+    self.assertEqual(depthwise_weight_quant.type, quantization_node_name)
+
+    # Check if activations after first depthwise conv are quantized.
+    depthwise_act_quant = graph.get_operation_by_name(
+        'test/test/separable_conv2d/act_quant/' + quantization_node_name)
+    self.assertEqual(depthwise_act_quant.type, quantization_node_name)
+
     for op in graph.get_operations():
       if op.type == quantization_node_name:
         quant_op = graph.get_operation_by_name(op.name)
diff --git a/tensorflow/contrib/recurrent/python/ops/recurrent.py b/tensorflow/contrib/recurrent/python/ops/recurrent.py
index fa16b82ab6..4f289e0c85 100644
--- a/tensorflow/contrib/recurrent/python/ops/recurrent.py
+++ b/tensorflow/contrib/recurrent/python/ops/recurrent.py
@@ -79,7 +79,7 @@ def _Index(struct, index):
   """
   index = ops.convert_to_tensor(index)
   index.get_shape().assert_has_rank(0)
-  return nest.map_structure(lambda x: x[index], struct)
+  return nest.map_structure(lambda x: array_ops.gather(x, index), struct)
 
 
 def _Update(struct_acc, struct_x, t):
diff --git a/tensorflow/contrib/rnn/BUILD b/tensorflow/contrib/rnn/BUILD
index 4eb5c920b3..2a84629080 100644
--- a/tensorflow/contrib/rnn/BUILD
+++ b/tensorflow/contrib/rnn/BUILD
@@ -118,7 +118,6 @@ cuda_py_tests(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:init_ops",
         "//tensorflow/python:math_ops",
-        "//tensorflow/python:random_ops",
         "//tensorflow/python:rnn",
         "//tensorflow/python:rnn_cell",
         "//tensorflow/python:variable_scope",
diff --git a/tensorflow/contrib/rnn/__init__.py b/tensorflow/contrib/rnn/__init__.py
index 07227bcb77..cb437f2a2f 100644
--- a/tensorflow/contrib/rnn/__init__.py
+++ b/tensorflow/contrib/rnn/__init__.py
@@ -59,6 +59,9 @@ See @{$python/contrib.rnn} guide.
 @@HighwayWrapper
 @@GLSTMCell
 @@SRUCell
+@@IndRNNCell
+@@IndyGRUCell
+@@IndyLSTMCell
 
 <!--RNNCell wrappers-->
 @@AttentionCellWrapper
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
index 86f1e27abd..85f0f8ced9 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
@@ -18,7 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import functools
 import os
 
 import numpy as np
@@ -35,7 +34,6 @@ from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import rnn
 from tensorflow.python.ops import rnn_cell_impl
 from tensorflow.python.ops import variable_scope
@@ -117,6 +115,27 @@ class RNNCellTest(test.TestCase):
         })
         self.assertEqual(res[0].shape, (1, 2))
 
+  def testIndRNNCell(self):
+    with self.test_session() as sess:
+      with variable_scope.variable_scope(
+          "root", initializer=init_ops.constant_initializer(0.5)):
+        x = array_ops.zeros([1, 2])
+        m = array_ops.zeros([1, 2])
+        cell = contrib_rnn_cell.IndRNNCell(2)
+        g, _ = cell(x, m)
+        self.assertEqual([
+            "root/ind_rnn_cell/%s_w:0" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+            "root/ind_rnn_cell/%s_u:0" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+            "root/ind_rnn_cell/%s:0" % rnn_cell_impl._BIAS_VARIABLE_NAME
+        ], [v.name for v in cell.trainable_variables])
+        self.assertFalse(cell.non_trainable_variables)
+        sess.run([variables_lib.global_variables_initializer()])
+        res = sess.run([g], {
+            x.name: np.array([[1., 1.]]),
+            m.name: np.array([[0.1, 0.1]])
+        })
+        self.assertEqual(res[0].shape, (1, 2))
+
   def testGRUCell(self):
     with self.test_session() as sess:
       with variable_scope.variable_scope(
@@ -145,6 +164,34 @@ class RNNCellTest(test.TestCase):
         # Smoke test
         self.assertAllClose(res[0], [[0.156736, 0.156736]])
 
+  def testIndyGRUCell(self):
+    with self.test_session() as sess:
+      with variable_scope.variable_scope(
+          "root", initializer=init_ops.constant_initializer(0.5)):
+        x = array_ops.zeros([1, 2])
+        m = array_ops.zeros([1, 2])
+        g, _ = contrib_rnn_cell.IndyGRUCell(2)(x, m)
+        sess.run([variables_lib.global_variables_initializer()])
+        res = sess.run([g], {
+            x.name: np.array([[1., 1.]]),
+            m.name: np.array([[0.1, 0.1]])
+        })
+        # Smoke test
+        self.assertAllClose(res[0], [[0.185265, 0.17704]])
+      with variable_scope.variable_scope(
+          "other", initializer=init_ops.constant_initializer(0.5)):
+        # Test IndyGRUCell with input_size != num_units.
+        x = array_ops.zeros([1, 3])
+        m = array_ops.zeros([1, 2])
+        g, _ = contrib_rnn_cell.IndyGRUCell(2)(x, m)
+        sess.run([variables_lib.global_variables_initializer()])
+        res = sess.run([g], {
+            x.name: np.array([[1., 1., 1.]]),
+            m.name: np.array([[0.1, 0.1]])
+        })
+        # Smoke test
+        self.assertAllClose(res[0], [[0.155127, 0.157328]])
+
   def testSRUCell(self):
     with self.test_session() as sess:
       with variable_scope.variable_scope(
@@ -345,6 +392,72 @@ class RNNCellTest(test.TestCase):
         self.assertAllClose(res[1], expected_mem0)
         self.assertAllClose(res[2], expected_mem1)
 
+  def testIndyLSTMCell(self):
+    for dtype in [dtypes.float16, dtypes.float32]:
+      np_dtype = dtype.as_numpy_dtype
+      with self.test_session(graph=ops.Graph()) as sess:
+        with variable_scope.variable_scope(
+            "root", initializer=init_ops.constant_initializer(0.5)):
+          x = array_ops.zeros([1, 2], dtype=dtype)
+          state_0 = (array_ops.zeros([1, 2], dtype=dtype),) * 2
+          state_1 = (array_ops.zeros([1, 2], dtype=dtype),) * 2
+          cell = rnn_cell_impl.MultiRNNCell(
+              [contrib_rnn_cell.IndyLSTMCell(2) for _ in range(2)])
+          self.assertEqual(cell.dtype, None)
+          self.assertEqual("cell-0", cell._checkpoint_dependencies[0].name)
+          self.assertEqual("cell-1", cell._checkpoint_dependencies[1].name)
+          cell.get_config()  # Should not throw an error
+          g, (out_state_0, out_state_1) = cell(x, (state_0, state_1))
+          # Layer infers the input type.
+          self.assertEqual(cell.dtype, dtype.name)
+          expected_variable_names = [
+              "root/multi_rnn_cell/cell_0/indy_lstm_cell/%s_w:0" %
+              rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+              "root/multi_rnn_cell/cell_0/indy_lstm_cell/%s_u:0" %
+              rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+              "root/multi_rnn_cell/cell_0/indy_lstm_cell/%s:0" %
+              rnn_cell_impl._BIAS_VARIABLE_NAME,
+              "root/multi_rnn_cell/cell_1/indy_lstm_cell/%s_w:0" %
+              rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+              "root/multi_rnn_cell/cell_1/indy_lstm_cell/%s_u:0" %
+              rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+              "root/multi_rnn_cell/cell_1/indy_lstm_cell/%s:0" %
+              rnn_cell_impl._BIAS_VARIABLE_NAME
+          ]
+          self.assertEqual(expected_variable_names,
+                           [v.name for v in cell.trainable_variables])
+          self.assertFalse(cell.non_trainable_variables)
+          sess.run([variables_lib.global_variables_initializer()])
+          res = sess.run(
+              [g, out_state_0, out_state_1], {
+                  x.name: np.array([[1., 1.]]),
+                  state_0[0].name: 0.1 * np.ones([1, 2]),
+                  state_0[1].name: 0.1 * np.ones([1, 2]),
+                  state_1[0].name: 0.1 * np.ones([1, 2]),
+                  state_1[1].name: 0.1 * np.ones([1, 2]),
+              })
+          self.assertEqual(len(res), 3)
+          variables = variables_lib.global_variables()
+          self.assertEqual(expected_variable_names, [v.name for v in variables])
+          # Only check the range of outputs as this is just a smoke test.
+          self.assertAllInRange(res[0], -1.0, 1.0)
+          self.assertAllInRange(res[1], -1.0, 1.0)
+          self.assertAllInRange(res[2], -1.0, 1.0)
+        with variable_scope.variable_scope(
+            "other", initializer=init_ops.constant_initializer(0.5)):
+          # Test IndyLSTMCell with input_size != num_units.
+          x = array_ops.zeros([1, 3], dtype=dtype)
+          state = (array_ops.zeros([1, 2], dtype=dtype),) * 2
+          g, out_state = contrib_rnn_cell.IndyLSTMCell(2)(x, state)
+          sess.run([variables_lib.global_variables_initializer()])
+          res = sess.run(
+              [g, out_state], {
+                  x.name: np.array([[1., 1., 1.]], dtype=np_dtype),
+                  state[0].name: 0.1 * np.ones([1, 2], dtype=np_dtype),
+                  state[1].name: 0.1 * np.ones([1, 2], dtype=np_dtype),
+              })
+          self.assertEqual(len(res), 2)
+
   def testLSTMCell(self):
     with self.test_session() as sess:
       num_units = 8
@@ -935,50 +1048,6 @@ class DropoutWrapperTest(test.TestCase):
     self.assertAllClose(res0[1].h, res1[1].h)
 
 
-class SlimRNNCellTest(test.TestCase):
-
-  def testBasicRNNCell(self):
-    with self.test_session() as sess:
-      with variable_scope.variable_scope(
-          "root", initializer=init_ops.constant_initializer(0.5)):
-        x = array_ops.zeros([1, 2])
-        m = array_ops.zeros([1, 2])
-        my_cell = functools.partial(basic_rnn_cell, num_units=2)
-        # pylint: disable=protected-access
-        g, _ = rnn_cell_impl._SlimRNNCell(my_cell)(x, m)
-        # pylint: enable=protected-access
-        sess.run([variables_lib.global_variables_initializer()])
-        res = sess.run([g], {
-            x.name: np.array([[1., 1.]]),
-            m.name: np.array([[0.1, 0.1]])
-        })
-        self.assertEqual(res[0].shape, (1, 2))
-
-  def testBasicRNNCellMatch(self):
-    batch_size = 32
-    input_size = 100
-    num_units = 10
-    with self.test_session() as sess:
-      with variable_scope.variable_scope(
-          "root", initializer=init_ops.constant_initializer(0.5)):
-        inputs = random_ops.random_uniform((batch_size, input_size))
-        _, initial_state = basic_rnn_cell(inputs, None, num_units)
-        rnn_cell = rnn_cell_impl.BasicRNNCell(num_units)
-        outputs, state = rnn_cell(inputs, initial_state)
-        variable_scope.get_variable_scope().reuse_variables()
-        my_cell = functools.partial(basic_rnn_cell, num_units=num_units)
-        # pylint: disable=protected-access
-        slim_cell = rnn_cell_impl._SlimRNNCell(my_cell)
-        # pylint: enable=protected-access
-        slim_outputs, slim_state = slim_cell(inputs, initial_state)
-        self.assertEqual(slim_outputs.get_shape(), outputs.get_shape())
-        self.assertEqual(slim_state.get_shape(), state.get_shape())
-        sess.run([variables_lib.global_variables_initializer()])
-        res = sess.run([slim_outputs, slim_state, outputs, state])
-        self.assertAllClose(res[0], res[2])
-        self.assertAllClose(res[1], res[3])
-
-
 def basic_rnn_cell(inputs, state, num_units, scope=None):
   if state is None:
     if inputs is not None:
diff --git a/tensorflow/contrib/rnn/python/ops/rnn.py b/tensorflow/contrib/rnn/python/ops/rnn.py
index 2f0caadda3..0266b72dcb 100644
--- a/tensorflow/contrib/rnn/python/ops/rnn.py
+++ b/tensorflow/contrib/rnn/python/ops/rnn.py
@@ -175,7 +175,7 @@ def stack_bidirectional_dynamic_rnn(cells_fw,
   Returns:
     A tuple (outputs, output_state_fw, output_state_bw) where:
       outputs: Output `Tensor` shaped:
-        `batch_size, max_time, layers_output]`. Where layers_output
+        `[batch_size, max_time, layers_output]`. Where layers_output
         are depth-concatenated forward and backward outputs.
       output_states_fw is the final states, one tensor per layer,
         of the forward rnn.
diff --git a/tensorflow/contrib/rnn/python/ops/rnn_cell.py b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
index b12e2cd5ed..1816b469ee 100644
--- a/tensorflow/contrib/rnn/python/ops/rnn_cell.py
+++ b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
@@ -23,6 +23,7 @@ import math
 from tensorflow.contrib.compiler import jit
 from tensorflow.contrib.layers.python.layers import layers
 from tensorflow.contrib.rnn.python.ops import core_rnn_cell
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import op_def_registry
 from tensorflow.python.framework import ops
@@ -30,6 +31,7 @@ from tensorflow.python.framework import tensor_shape
 from tensorflow.python.layers import base as base_layer
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import clip_ops
+from tensorflow.python.ops import gen_array_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_impl  # pylint: disable=unused-import
@@ -3050,3 +3052,343 @@ class WeightNormLSTMCell(rnn_cell_impl.RNNCell):
 
       new_state = rnn_cell_impl.LSTMStateTuple(new_c, new_h)
       return new_h, new_state
+
+
+class IndRNNCell(rnn_cell_impl.LayerRNNCell):
+  """Independently Recurrent Neural Network (IndRNN) cell
+    (cf. https://arxiv.org/abs/1803.04831).
+
+  Args:
+    num_units: int, The number of units in the RNN cell.
+    activation: Nonlinearity to use.  Default: `tanh`.
+    reuse: (optional) Python boolean describing whether to reuse variables
+     in an existing scope.  If not `True`, and the existing scope already has
+     the given variables, an error is raised.
+    name: String, the name of the layer. Layers with the same name will
+      share weights, but to avoid mistakes we require reuse=True in such
+      cases.
+    dtype: Default dtype of the layer (default of `None` means use the type
+      of the first input). Required when `build` is called before `call`.
+  """
+
+  def __init__(self,
+               num_units,
+               activation=None,
+               reuse=None,
+               name=None,
+               dtype=None):
+    super(IndRNNCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)
+
+    # Inputs must be 2-dimensional.
+    self.input_spec = base_layer.InputSpec(ndim=2)
+
+    self._num_units = num_units
+    self._activation = activation or math_ops.tanh
+
+  @property
+  def state_size(self):
+    return self._num_units
+
+  @property
+  def output_size(self):
+    return self._num_units
+
+  def build(self, inputs_shape):
+    if inputs_shape[1].value is None:
+      raise ValueError(
+          "Expected inputs.shape[-1] to be known, saw shape: %s" % inputs_shape)
+
+    input_depth = inputs_shape[1].value
+    # pylint: disable=protected-access
+    self._kernel_w = self.add_variable(
+        "%s_w" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[input_depth, self._num_units])
+    self._kernel_u = self.add_variable(
+        "%s_u" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[1, self._num_units],
+        initializer=init_ops.random_uniform_initializer(
+            minval=-1, maxval=1, dtype=self.dtype))
+    self._bias = self.add_variable(
+        rnn_cell_impl._BIAS_VARIABLE_NAME,
+        shape=[self._num_units],
+        initializer=init_ops.zeros_initializer(dtype=self.dtype))
+    # pylint: enable=protected-access
+
+    self.built = True
+
+  def call(self, inputs, state):
+    """IndRNN: output = new_state = act(W * input + u * state + B)."""
+
+    gate_inputs = math_ops.matmul(inputs, self._kernel_w) + (
+        state * self._kernel_u)
+    gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
+    output = self._activation(gate_inputs)
+    return output, output
+
+
+class IndyGRUCell(rnn_cell_impl.LayerRNNCell):
+  r"""Independently Gated Recurrent Unit cell.
+
+  Based on IndRNNs (https://arxiv.org/abs/1803.04831) and similar to GRUCell,
+  yet with the \(U_r\), \(U_z\), and \(U\) matrices in equations 5, 6, and
+  8 of http://arxiv.org/abs/1406.1078 respectively replaced by diagonal
+  matrices, i.e. a Hadamard product with a single vector:
+
+    $$r_j = \sigma\left([\mathbf W_r\mathbf x]_j +
+      [\mathbf u_r\circ \mathbf h_{(t-1)}]_j\right)$$
+    $$z_j = \sigma\left([\mathbf W_z\mathbf x]_j +
+      [\mathbf u_z\circ \mathbf h_{(t-1)}]_j\right)$$
+    $$\tilde{h}^{(t)}_j = \phi\left([\mathbf W \mathbf x]_j +
+      [\mathbf u \circ \mathbf r \circ \mathbf h_{(t-1)}]_j\right)$$
+
+  where \(\circ\) denotes the Hadamard operator. This means that each IndyGRU
+  node sees only its own state, as opposed to seeing all states in the same
+  layer.
+
+  TODO(gonnet): Write a paper describing this and add a reference here.
+
+  Args:
+    num_units: int, The number of units in the GRU cell.
+    activation: Nonlinearity to use.  Default: `tanh`.
+    reuse: (optional) Python boolean describing whether to reuse variables
+     in an existing scope.  If not `True`, and the existing scope already has
+     the given variables, an error is raised.
+    kernel_initializer: (optional) The initializer to use for the weight
+      matrices applied to the input.
+    bias_initializer: (optional) The initializer to use for the bias.
+    name: String, the name of the layer. Layers with the same name will
+      share weights, but to avoid mistakes we require reuse=True in such
+      cases.
+    dtype: Default dtype of the layer (default of `None` means use the type
+      of the first input). Required when `build` is called before `call`.
+  """
+
+  def __init__(self,
+               num_units,
+               activation=None,
+               reuse=None,
+               kernel_initializer=None,
+               bias_initializer=None,
+               name=None,
+               dtype=None):
+    super(IndyGRUCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)
+
+    # Inputs must be 2-dimensional.
+    self.input_spec = base_layer.InputSpec(ndim=2)
+
+    self._num_units = num_units
+    self._activation = activation or math_ops.tanh
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+
+  @property
+  def state_size(self):
+    return self._num_units
+
+  @property
+  def output_size(self):
+    return self._num_units
+
+  def build(self, inputs_shape):
+    if inputs_shape[1].value is None:
+      raise ValueError(
+          "Expected inputs.shape[-1] to be known, saw shape: %s" % inputs_shape)
+
+    input_depth = inputs_shape[1].value
+    # pylint: disable=protected-access
+    self._gate_kernel_w = self.add_variable(
+        "gates/%s_w" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[input_depth, 2 * self._num_units],
+        initializer=self._kernel_initializer)
+    self._gate_kernel_u = self.add_variable(
+        "gates/%s_u" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[1, 2 * self._num_units],
+        initializer=init_ops.random_uniform_initializer(
+            minval=-1, maxval=1, dtype=self.dtype))
+    self._gate_bias = self.add_variable(
+        "gates/%s" % rnn_cell_impl._BIAS_VARIABLE_NAME,
+        shape=[2 * self._num_units],
+        initializer=(self._bias_initializer
+                     if self._bias_initializer is not None else
+                     init_ops.constant_initializer(1.0, dtype=self.dtype)))
+    self._candidate_kernel_w = self.add_variable(
+        "candidate/%s" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[input_depth, self._num_units],
+        initializer=self._kernel_initializer)
+    self._candidate_kernel_u = self.add_variable(
+        "candidate/%s_u" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[1, self._num_units],
+        initializer=init_ops.random_uniform_initializer(
+            minval=-1, maxval=1, dtype=self.dtype))
+    self._candidate_bias = self.add_variable(
+        "candidate/%s" % rnn_cell_impl._BIAS_VARIABLE_NAME,
+        shape=[self._num_units],
+        initializer=(self._bias_initializer
+                     if self._bias_initializer is not None else
+                     init_ops.zeros_initializer(dtype=self.dtype)))
+    # pylint: enable=protected-access
+
+    self.built = True
+
+  def call(self, inputs, state):
+    """Gated recurrent unit (GRU) with nunits cells."""
+
+    gate_inputs = math_ops.matmul(inputs, self._gate_kernel_w) + (
+        gen_array_ops.tile(state, [1, 2]) * self._gate_kernel_u)
+    gate_inputs = nn_ops.bias_add(gate_inputs, self._gate_bias)
+
+    value = math_ops.sigmoid(gate_inputs)
+    r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)
+
+    r_state = r * state
+
+    candidate = math_ops.matmul(inputs, self._candidate_kernel_w) + (
+        r_state * self._candidate_kernel_u)
+    candidate = nn_ops.bias_add(candidate, self._candidate_bias)
+
+    c = self._activation(candidate)
+    new_h = u * state + (1 - u) * c
+    return new_h, new_h
+
+
+class IndyLSTMCell(rnn_cell_impl.LayerRNNCell):
+  r"""Basic IndyLSTM recurrent network cell.
+
+  Based on IndRNNs (https://arxiv.org/abs/1803.04831) and similar to
+  BasicLSTMCell, yet with the \(U_f\), \(U_i\), \(U_o\) and \(U_c\)
+  matrices in
+  https://en.wikipedia.org/wiki/Long_short-term_memory#LSTM_with_a_forget_gate
+  replaced by diagonal matrices, i.e. a Hadamard product with a single vector:
+
+    $$f_t = \sigma_g\left(W_f x_t + u_f \circ h_{t-1} + b_f\right)$$
+    $$i_t = \sigma_g\left(W_i x_t + u_i \circ h_{t-1} + b_i\right)$$
+    $$o_t = \sigma_g\left(W_o x_t + u_o \circ h_{t-1} + b_o\right)$$
+    $$c_t = f_t \circ c_{t-1} +
+            i_t \circ \sigma_c\left(W_c x_t + u_c \circ h_{t-1} + b_c\right)$$
+
+  where \(\circ\) denotes the Hadamard operator. This means that each IndyLSTM
+  node sees only its own state \(h\) and \(c\), as opposed to seeing all
+  states in the same layer.
+
+  We add forget_bias (default: 1) to the biases of the forget gate in order to
+  reduce the scale of forgetting in the beginning of the training.
+
+  It does not allow cell clipping, a projection layer, and does not
+  use peep-hole connections: it is the basic baseline.
+
+  For advanced models, please use the full @{tf.nn.rnn_cell.LSTMCell}
+  that follows.
+
+  TODO(gonnet): Write a paper describing this and add a reference here.
+  """
+
+  def __init__(self,
+               num_units,
+               forget_bias=1.0,
+               activation=None,
+               reuse=None,
+               kernel_initializer=None,
+               bias_initializer=None,
+               name=None,
+               dtype=None):
+    """Initialize the IndyLSTM cell.
+
+    Args:
+      num_units: int, The number of units in the LSTM cell.
+      forget_bias: float, The bias added to forget gates (see above).
+        Must set to `0.0` manually when restoring from CudnnLSTM-trained
+        checkpoints.
+      activation: Activation function of the inner states.  Default: `tanh`.
+      reuse: (optional) Python boolean describing whether to reuse variables
+        in an existing scope.  If not `True`, and the existing scope already has
+        the given variables, an error is raised.
+      kernel_initializer: (optional) The initializer to use for the weight
+        matrix applied to the inputs.
+      bias_initializer: (optional) The initializer to use for the bias.
+      name: String, the name of the layer. Layers with the same name will
+        share weights, but to avoid mistakes we require reuse=True in such
+        cases.
+      dtype: Default dtype of the layer (default of `None` means use the type
+        of the first input). Required when `build` is called before `call`.
+    """
+    super(IndyLSTMCell, self).__init__(_reuse=reuse, name=name, dtype=dtype)
+
+    # Inputs must be 2-dimensional.
+    self.input_spec = base_layer.InputSpec(ndim=2)
+
+    self._num_units = num_units
+    self._forget_bias = forget_bias
+    self._activation = activation or math_ops.tanh
+    self._kernel_initializer = kernel_initializer
+    self._bias_initializer = bias_initializer
+
+  @property
+  def state_size(self):
+    return rnn_cell_impl.LSTMStateTuple(self._num_units, self._num_units)
+
+  @property
+  def output_size(self):
+    return self._num_units
+
+  def build(self, inputs_shape):
+    if inputs_shape[1].value is None:
+      raise ValueError(
+          "Expected inputs.shape[-1] to be known, saw shape: %s" % inputs_shape)
+
+    input_depth = inputs_shape[1].value
+    # pylint: disable=protected-access
+    self._kernel_w = self.add_variable(
+        "%s_w" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[input_depth, 4 * self._num_units],
+        initializer=self._kernel_initializer)
+    self._kernel_u = self.add_variable(
+        "%s_u" % rnn_cell_impl._WEIGHTS_VARIABLE_NAME,
+        shape=[1, 4 * self._num_units],
+        initializer=init_ops.random_uniform_initializer(
+            minval=-1, maxval=1, dtype=self.dtype))
+    self._bias = self.add_variable(
+        rnn_cell_impl._BIAS_VARIABLE_NAME,
+        shape=[4 * self._num_units],
+        initializer=(self._bias_initializer
+                     if self._bias_initializer is not None else
+                     init_ops.zeros_initializer(dtype=self.dtype)))
+    # pylint: enable=protected-access
+
+    self.built = True
+
+  def call(self, inputs, state):
+    """Independent Long short-term memory cell (IndyLSTM).
+
+    Args:
+      inputs: `2-D` tensor with shape `[batch_size, input_size]`.
+      state: An `LSTMStateTuple` of state tensors, each shaped
+        `[batch_size, num_units]`.
+
+    Returns:
+      A pair containing the new hidden state, and the new state (a
+        `LSTMStateTuple`).
+    """
+    sigmoid = math_ops.sigmoid
+    one = constant_op.constant(1, dtype=dtypes.int32)
+    c, h = state
+
+    gate_inputs = math_ops.matmul(inputs, self._kernel_w)
+    gate_inputs += gen_array_ops.tile(h, [1, 4]) * self._kernel_u
+    gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
+
+    # i = input_gate, j = new_input, f = forget_gate, o = output_gate
+    i, j, f, o = array_ops.split(
+        value=gate_inputs, num_or_size_splits=4, axis=one)
+
+    forget_bias_tensor = constant_op.constant(self._forget_bias, dtype=f.dtype)
+    # Note that using `add` and `multiply` instead of `+` and `*` gives a
+    # performance improvement. So using those at the cost of readability.
+    add = math_ops.add
+    multiply = math_ops.multiply
+    new_c = add(
+        multiply(c, sigmoid(add(f, forget_bias_tensor))),
+        multiply(sigmoid(i), self._activation(j)))
+    new_h = multiply(self._activation(new_c), sigmoid(o))
+
+    new_state = rnn_cell_impl.LSTMStateTuple(new_c, new_h)
+    return new_h, new_state
diff --git a/tensorflow/contrib/rpc/python/kernel_tests/BUILD b/tensorflow/contrib/rpc/python/kernel_tests/BUILD
index 2311c15a68..cb0b89ae55 100644
--- a/tensorflow/contrib/rpc/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/rpc/python/kernel_tests/BUILD
@@ -1,5 +1,3 @@
-# TODO(b/76425722): Port everything in here to OS (currently excluded).
-
 package(default_visibility = ["//visibility:public"])
 
 licenses(["notice"])  # Apache 2.0
@@ -17,7 +15,6 @@ tf_proto_library(
     srcs = ["test_example.proto"],
     has_services = 1,
     cc_api_version = 2,
-    protodeps = ["//tensorflow/core:protos_all"],
 )
 
 py_library(
diff --git a/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_base.py b/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_base.py
index 27273d16b1..1c23c28860 100644
--- a/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_base.py
+++ b/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_base.py
@@ -51,23 +51,23 @@ class RpcOpTestBase(object):
   def testScalarHostPortRpc(self):
     with self.test_session() as sess:
       request_tensors = (
-          test_example_pb2.TestCase(shape=[1, 2, 3]).SerializeToString())
+          test_example_pb2.TestCase(values=[1, 2, 3]).SerializeToString())
       response_tensors = self.rpc(
-          method=self.get_method_name('IncrementTestShapes'),
+          method=self.get_method_name('Increment'),
           address=self._address,
           request=request_tensors)
       self.assertEqual(response_tensors.shape, ())
       response_values = sess.run(response_tensors)
     response_message = test_example_pb2.TestCase()
     self.assertTrue(response_message.ParseFromString(response_values))
-    self.assertAllEqual([2, 3, 4], response_message.shape)
+    self.assertAllEqual([2, 3, 4], response_message.values)
 
   def testScalarHostPortTryRpc(self):
     with self.test_session() as sess:
       request_tensors = (
-          test_example_pb2.TestCase(shape=[1, 2, 3]).SerializeToString())
+          test_example_pb2.TestCase(values=[1, 2, 3]).SerializeToString())
       response_tensors, status_code, status_message = self.try_rpc(
-          method=self.get_method_name('IncrementTestShapes'),
+          method=self.get_method_name('Increment'),
           address=self._address,
           request=request_tensors)
       self.assertEqual(status_code.shape, ())
@@ -77,7 +77,7 @@ class RpcOpTestBase(object):
           sess.run((response_tensors, status_code, status_message)))
     response_message = test_example_pb2.TestCase()
     self.assertTrue(response_message.ParseFromString(response_values))
-    self.assertAllEqual([2, 3, 4], response_message.shape)
+    self.assertAllEqual([2, 3, 4], response_message.values)
     # For the base Rpc op, don't expect to get error status back.
     self.assertEqual(errors.OK, status_code_values)
     self.assertEqual(b'', status_message_values)
@@ -86,7 +86,7 @@ class RpcOpTestBase(object):
     with self.test_session() as sess:
       request_tensors = []
       response_tensors = self.rpc(
-          method=self.get_method_name('IncrementTestShapes'),
+          method=self.get_method_name('Increment'),
           address=self._address,
           request=request_tensors)
       self.assertAllEqual(response_tensors.shape, [0])
@@ -95,7 +95,7 @@ class RpcOpTestBase(object):
 
   def testInvalidMethod(self):
     for method in [
-        '/InvalidService.IncrementTestShapes',
+        '/InvalidService.Increment',
         self.get_method_name('InvalidMethodName')
     ]:
       with self.test_session() as sess:
@@ -115,12 +115,12 @@ class RpcOpTestBase(object):
       with self.assertRaises(errors.UnavailableError):
         sess.run(
             self.rpc(
-                method=self.get_method_name('IncrementTestShapes'),
+                method=self.get_method_name('Increment'),
                 address=address,
                 request=''))
       _, status_code_value, status_message_value = sess.run(
           self.try_rpc(
-              method=self.get_method_name('IncrementTestShapes'),
+              method=self.get_method_name('Increment'),
               address=address,
               request=''))
       self.assertEqual(errors.UNAVAILABLE, status_code_value)
@@ -182,10 +182,10 @@ class RpcOpTestBase(object):
     with self.test_session() as sess:
       request_tensors = [
           test_example_pb2.TestCase(
-              shape=[i, i + 1, i + 2]).SerializeToString() for i in range(20)
+              values=[i, i + 1, i + 2]).SerializeToString() for i in range(20)
       ]
       response_tensors = self.rpc(
-          method=self.get_method_name('IncrementTestShapes'),
+          method=self.get_method_name('Increment'),
           address=self._address,
           request=request_tensors)
       self.assertEqual(response_tensors.shape, (20,))
@@ -194,17 +194,17 @@ class RpcOpTestBase(object):
     for i in range(20):
       response_message = test_example_pb2.TestCase()
       self.assertTrue(response_message.ParseFromString(response_values[i]))
-      self.assertAllEqual([i + 1, i + 2, i + 3], response_message.shape)
+      self.assertAllEqual([i + 1, i + 2, i + 3], response_message.values)
 
   def testVecHostPortManyParallelRpcs(self):
     with self.test_session() as sess:
       request_tensors = [
           test_example_pb2.TestCase(
-              shape=[i, i + 1, i + 2]).SerializeToString() for i in range(20)
+              values=[i, i + 1, i + 2]).SerializeToString() for i in range(20)
       ]
       many_response_tensors = [
           self.rpc(
-              method=self.get_method_name('IncrementTestShapes'),
+              method=self.get_method_name('Increment'),
               address=self._address,
               request=request_tensors) for _ in range(10)
       ]
@@ -216,25 +216,25 @@ class RpcOpTestBase(object):
       for i in range(20):
         response_message = test_example_pb2.TestCase()
         self.assertTrue(response_message.ParseFromString(response_values[i]))
-        self.assertAllEqual([i + 1, i + 2, i + 3], response_message.shape)
+        self.assertAllEqual([i + 1, i + 2, i + 3], response_message.values)
 
   def testVecHostPortRpcUsingEncodeAndDecodeProto(self):
     with self.test_session() as sess:
       request_tensors = encode_proto_op.encode_proto(
           message_type='tensorflow.contrib.rpc.TestCase',
-          field_names=['shape'],
+          field_names=['values'],
           sizes=[[3]] * 20,
           values=[
               [[i, i + 1, i + 2] for i in range(20)],
           ])
       response_tensor_strings = self.rpc(
-          method=self.get_method_name('IncrementTestShapes'),
+          method=self.get_method_name('Increment'),
           address=self._address,
           request=request_tensors)
       _, (response_shape,) = decode_proto_op.decode_proto(
           bytes=response_tensor_strings,
           message_type='tensorflow.contrib.rpc.TestCase',
-          field_names=['shape'],
+          field_names=['values'],
           output_types=[dtypes.int32])
       response_shape_values = sess.run(response_shape)
     self.assertAllEqual([[i + 1, i + 2, i + 3]
@@ -285,9 +285,9 @@ class RpcOpTestBase(object):
       addresses = flatten([[
           self._address, 'unix:/tmp/this_unix_socket_doesnt_exist_97820348!!@'
       ] for _ in range(10)])
-      request = test_example_pb2.TestCase(shape=[0, 1, 2]).SerializeToString()
+      request = test_example_pb2.TestCase(values=[0, 1, 2]).SerializeToString()
       response_tensors, status_code, _ = self.try_rpc(
-          method=self.get_method_name('IncrementTestShapes'),
+          method=self.get_method_name('Increment'),
           address=addresses,
           request=request)
       response_tensors_values, status_code_values = sess.run((response_tensors,
@@ -303,9 +303,9 @@ class RpcOpTestBase(object):
     flatten = lambda x: list(itertools.chain.from_iterable(x))
     with self.test_session() as sess:
       methods = flatten(
-          [[self.get_method_name('IncrementTestShapes'), 'InvalidMethodName']
+          [[self.get_method_name('Increment'), 'InvalidMethodName']
            for _ in range(10)])
-      request = test_example_pb2.TestCase(shape=[0, 1, 2]).SerializeToString()
+      request = test_example_pb2.TestCase(values=[0, 1, 2]).SerializeToString()
       response_tensors, status_code, _ = self.try_rpc(
           method=methods, address=self._address, request=request)
       response_tensors_values, status_code_values = sess.run((response_tensors,
@@ -325,10 +325,10 @@ class RpcOpTestBase(object):
       ] for _ in range(10)])
       requests = [
           test_example_pb2.TestCase(
-              shape=[i, i + 1, i + 2]).SerializeToString() for i in range(20)
+              values=[i, i + 1, i + 2]).SerializeToString() for i in range(20)
       ]
       response_tensors, status_code, _ = self.try_rpc(
-          method=self.get_method_name('IncrementTestShapes'),
+          method=self.get_method_name('Increment'),
           address=addresses,
           request=requests)
       response_tensors_values, status_code_values = sess.run((response_tensors,
@@ -343,4 +343,4 @@ class RpcOpTestBase(object):
           response_message = test_example_pb2.TestCase()
           self.assertTrue(
               response_message.ParseFromString(response_tensors_values[i]))
-          self.assertAllEqual([i + 1, i + 2, i + 3], response_message.shape)
+          self.assertAllEqual([i + 1, i + 2, i + 3], response_message.values)
diff --git a/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_servicer.py b/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_servicer.py
index 7cbd636cb1..265254aa51 100644
--- a/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_servicer.py
+++ b/tensorflow/contrib/rpc/python/kernel_tests/rpc_op_test_servicer.py
@@ -30,8 +30,8 @@ from tensorflow.contrib.rpc.python.kernel_tests import test_example_pb2_grpc
 class RpcOpTestServicer(test_example_pb2_grpc.TestCaseServiceServicer):
   """Test servicer for RpcOp tests."""
 
-  def IncrementTestShapes(self, request, context):
-    """Increment the entries in the shape attribute of request.
+  def Increment(self, request, context):
+    """Increment the entries in the `values` attribute of request.
 
     Args:
       request: input TestCase.
@@ -40,8 +40,8 @@ class RpcOpTestServicer(test_example_pb2_grpc.TestCaseServiceServicer):
     Returns:
       output TestCase.
     """
-    for i in range(len(request.shape)):
-      request.shape[i] += 1
+    for i in range(len(request.values)):
+      request.values[i] += 1
     return request
 
   def AlwaysFailWithInvalidArgument(self, request, context):
diff --git a/tensorflow/contrib/rpc/python/kernel_tests/test_example.proto b/tensorflow/contrib/rpc/python/kernel_tests/test_example.proto
index 96f4550f62..8141466349 100644
--- a/tensorflow/contrib/rpc/python/kernel_tests/test_example.proto
+++ b/tensorflow/contrib/rpc/python/kernel_tests/test_example.proto
@@ -1,29 +1,17 @@
 // Test description and protos to work with it.
-//
-// Many of the protos in this file are for unit tests that haven't been written yet.
 
 syntax = "proto2";
 
-import "tensorflow/core/framework/types.proto";
-
 package tensorflow.contrib.rpc;
 
-// A TestCase holds a proto and a bunch of assertions
-// about how it should decode.
+// A TestCase holds a sequence of values.
 message TestCase {
-  // A batch of primitives to be serialized and decoded.
-  repeated RepeatedPrimitiveValue primitive = 1;
-  // The shape of the batch.
-  repeated int32 shape = 2;
-  // Expected sizes for each field.
-  repeated int32 sizes = 3;
-  // Expected values for each field.
-  repeated FieldSpec field = 4;
+  repeated int32 values = 1;
 };
 
 service TestCaseService {
-  // Copy input, and increment each entry in 'shape' by 1.
-  rpc IncrementTestShapes(TestCase) returns (TestCase) {
+  // Copy input, and increment each entry in 'values' by 1.
+  rpc Increment(TestCase) returns (TestCase) {
   }
 
   // Sleep forever.
@@ -42,130 +30,3 @@ service TestCaseService {
   rpc SometimesFailWithInvalidArgument(TestCase) returns (TestCase) {
   }
 };
-
-// FieldSpec describes the expected output for a single field.
-message FieldSpec {
-  optional string name = 1;
-  optional tensorflow.DataType dtype = 2;
-  optional RepeatedPrimitiveValue expected = 3;
-};
-
-message TestValue {
-  optional PrimitiveValue primitive_value = 1;
-  optional EnumValue enum_value = 2;
-  optional MessageValue message_value = 3;
-  optional RepeatedMessageValue repeated_message_value = 4;
-  optional RepeatedPrimitiveValue repeated_primitive_value = 6;
-}
-
-message PrimitiveValue {
-  optional double double_value = 1;
-  optional float float_value = 2;
-  optional int64 int64_value = 3;
-  optional uint64 uint64_value = 4;
-  optional int32 int32_value = 5;
-  optional fixed64 fixed64_value = 6;
-  optional fixed32 fixed32_value = 7;
-  optional bool bool_value = 8;
-  optional string string_value = 9;
-  optional bytes bytes_value = 12;
-  optional uint32 uint32_value = 13;
-  optional sfixed32 sfixed32_value = 15;
-  optional sfixed64 sfixed64_value = 16;
-  optional sint32 sint32_value = 17;
-  optional sint64 sint64_value = 18;
-}
-
-// NOTE: This definition must be kept in sync with PackedPrimitiveValue.
-message RepeatedPrimitiveValue {
-  repeated double double_value = 1;
-  repeated float float_value = 2;
-  repeated int64 int64_value = 3;
-  repeated uint64 uint64_value = 4;
-  repeated int32 int32_value = 5;
-  repeated fixed64 fixed64_value = 6;
-  repeated fixed32 fixed32_value = 7;
-  repeated bool bool_value = 8;
-  repeated string string_value = 9;
-  repeated bytes bytes_value = 12;
-  repeated uint32 uint32_value = 13;
-  repeated sfixed32 sfixed32_value = 15;
-  repeated sfixed64 sfixed64_value = 16;
-  repeated sint32 sint32_value = 17;
-  repeated sint64 sint64_value = 18;
-  repeated PrimitiveValue message_value = 19;
-}
-
-// A PackedPrimitiveValue looks exactly the same as a RepeatedPrimitiveValue
-// in the text format, but the binary serializion is different.
-// We test the packed representations by loading the same test cases
-// using this definition instead of RepeatedPrimitiveValue.
-// NOTE: This definition must be kept in sync with RepeatedPrimitiveValue
-// in every way except the packed=true declaration.
-message PackedPrimitiveValue {
-  repeated double double_value = 1 [packed = true];
-  repeated float float_value = 2 [packed = true];
-  repeated int64 int64_value = 3 [packed = true];
-  repeated uint64 uint64_value = 4 [packed = true];
-  repeated int32 int32_value = 5 [packed = true];
-  repeated fixed64 fixed64_value = 6 [packed = true];
-  repeated fixed32 fixed32_value = 7 [packed = true];
-  repeated bool bool_value = 8 [packed = true];
-  repeated string string_value = 9;
-  repeated bytes bytes_value = 12;
-  repeated uint32 uint32_value = 13 [packed = true];
-  repeated sfixed32 sfixed32_value = 15 [packed = true];
-  repeated sfixed64 sfixed64_value = 16 [packed = true];
-  repeated sint32 sint32_value = 17 [packed = true];
-  repeated sint64 sint64_value = 18 [packed = true];
-  repeated PrimitiveValue message_value = 19;
-}
-
-message EnumValue {
-  enum Color {
-    RED = 0;
-    ORANGE = 1;
-    YELLOW = 2;
-    GREEN = 3;
-    BLUE = 4;
-    INDIGO = 5;
-    VIOLET = 6;
-  };
-  optional Color enum_value = 14;
-  repeated Color repeated_enum_value = 15;
-}
-
-
-message InnerMessageValue {
-  optional float float_value = 2;
-  repeated bytes bytes_values = 8;
-}
-
-message MiddleMessageValue {
-  repeated int32 int32_values = 5;
-  optional InnerMessageValue message_value = 11;
-  optional uint32 uint32_value = 13;
-}
-
-message MessageValue {
-  optional double double_value = 1;
-  optional MiddleMessageValue message_value = 11;
-}
-
-message RepeatedMessageValue {
-  message NestedMessageValue {
-    optional float float_value = 2;
-    repeated bytes bytes_values = 8;
-  }
-
-  repeated NestedMessageValue message_values = 11;
-}
-
-// Message containing fields with field numbers higher than any field above. An
-// instance of this message is prepended to each binary message in the test to
-// exercise the code path that handles fields encoded out of order of field
-// number.
-message ExtraFields {
-  optional string string_value = 1776;
-  optional bool bool_value = 1777;
-}
diff --git a/tensorflow/contrib/saved_model/BUILD b/tensorflow/contrib/saved_model/BUILD
index 26fd4e2023..fbb50befdf 100644
--- a/tensorflow/contrib/saved_model/BUILD
+++ b/tensorflow/contrib/saved_model/BUILD
@@ -93,3 +93,32 @@ py_test(
         "//tensorflow/python/saved_model:utils",
     ],
 )
+
+py_library(
+    name = "keras_saved_model",
+    srcs = ["python/saved_model/keras_saved_model.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_windows"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/python:lib",
+        "//tensorflow/python:util",
+        "//tensorflow/python/keras:engine",
+        "//tensorflow/python/saved_model:constants",
+    ],
+)
+
+py_test(
+    name = "keras_saved_model_test",
+    size = "small",
+    srcs = ["python/saved_model/keras_saved_model_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_windows"],
+    deps = [
+        ":saved_model_py",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:training",
+        "//tensorflow/python/keras",
+        "//third_party/py/numpy",
+    ],
+)
diff --git a/tensorflow/contrib/saved_model/__init__.py b/tensorflow/contrib/saved_model/__init__.py
index b4f27a055d..95e1a8967b 100644
--- a/tensorflow/contrib/saved_model/__init__.py
+++ b/tensorflow/contrib/saved_model/__init__.py
@@ -24,11 +24,12 @@ from __future__ import division
 from __future__ import print_function
 
 # pylint: disable=unused-import,wildcard-import,line-too-long
+from tensorflow.contrib.saved_model.python.saved_model.keras_saved_model import *
 from tensorflow.contrib.saved_model.python.saved_model.signature_def_utils import *
 # pylint: enable=unused-import,widcard-import,line-too-long
 
 from tensorflow.python.util.all_util import remove_undocumented
 
-_allowed_symbols = ["get_signature_def_by_key"]
+_allowed_symbols = ["get_signature_def_by_key", "load_model", "save_model"]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/saved_model/python/saved_model/__init__.py b/tensorflow/contrib/saved_model/python/saved_model/__init__.py
index 7b91622b61..e3b76bb6f3 100644
--- a/tensorflow/contrib/saved_model/python/saved_model/__init__.py
+++ b/tensorflow/contrib/saved_model/python/saved_model/__init__.py
@@ -24,5 +24,6 @@ from __future__ import division
 from __future__ import print_function
 
 # pylint: disable=wildcard-import
+from tensorflow.contrib.saved_model.python.saved_model import keras_saved_model
 from tensorflow.contrib.saved_model.python.saved_model import signature_def_utils
 # pylint: enable=wildcard-import
diff --git a/tensorflow/contrib/saved_model/python/saved_model/keras_saved_model.py b/tensorflow/contrib/saved_model/python/saved_model/keras_saved_model.py
new file mode 100644
index 0000000000..e2a969f053
--- /dev/null
+++ b/tensorflow/contrib/saved_model/python/saved_model/keras_saved_model.py
@@ -0,0 +1,108 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=protected-access
+"""Utility functions to save/load keras Model to/from SavedModel."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from tensorflow.python.keras.models import model_from_json
+from tensorflow.python.lib.io import file_io
+from tensorflow.python.saved_model import constants
+from tensorflow.python.util import compat
+
+
+def save_model(model, saved_model_path):
+  """Save a `tf.keras.Model` into Tensorflow SavedModel format.
+
+  `save_model` generates such files/folders under the `saved_model_path` folder:
+  1) an asset folder containing the json string of the model's
+  configuration(topology).
+  2) a checkpoint containing the model weights.
+
+  Note that subclassed models can not be saved via this function, unless you
+  provide an implementation for get_config() and from_config().
+  Also note that `tf.keras.optimizers.Optimizer` instances can not currently be
+  saved to checkpoints. Use optimizers from `tf.train`.
+
+  Args:
+    model: A `tf.keras.Model` to be saved.
+    saved_model_path: a string specifying the path to the SavedModel directory.
+
+  Raises:
+    NotImplementedError: If the passed in model is a subclassed model.
+  """
+  if not model._is_graph_network:
+    raise NotImplementedError
+
+  # save model configuration as a json string under assets folder.
+  model_json = model.to_json()
+  assets_destination_dir = os.path.join(
+      compat.as_bytes(saved_model_path),
+      compat.as_bytes(constants.ASSETS_DIRECTORY))
+
+  if not file_io.file_exists(assets_destination_dir):
+    file_io.recursive_create_dir(assets_destination_dir)
+
+  model_json_filepath = os.path.join(
+      compat.as_bytes(assets_destination_dir),
+      compat.as_bytes(constants.SAVED_MODEL_FILENAME_JSON))
+  file_io.write_string_to_file(model_json_filepath, model_json)
+
+  # save model weights in checkpoint format.
+  checkpoint_destination_dir = os.path.join(
+      compat.as_bytes(saved_model_path),
+      compat.as_bytes(constants.VARIABLES_DIRECTORY))
+
+  if not file_io.file_exists(checkpoint_destination_dir):
+    file_io.recursive_create_dir(checkpoint_destination_dir)
+
+  checkpoint_prefix = os.path.join(
+      compat.as_text(checkpoint_destination_dir),
+      compat.as_text(constants.VARIABLES_FILENAME))
+  model.save_weights(checkpoint_prefix, save_format='tf', overwrite=True)
+
+
+def load_model(saved_model_path):
+  """Load a keras.Model from SavedModel.
+
+  load_model reinstantiates model state by:
+  1) loading model topology from json (this will eventually come
+     from metagraph).
+  2) loading model weights from checkpoint.
+
+  Args:
+    saved_model_path: a string specifying the path to an existing SavedModel.
+
+  Returns:
+    a keras.Model instance.
+  """
+  # restore model topology from json string
+  model_json_filepath = os.path.join(
+      compat.as_bytes(saved_model_path),
+      compat.as_bytes(constants.ASSETS_DIRECTORY),
+      compat.as_bytes(constants.SAVED_MODEL_FILENAME_JSON))
+  model_json = file_io.read_file_to_string(model_json_filepath)
+  model = model_from_json(model_json)
+
+  # restore model weights
+  checkpoint_prefix = os.path.join(
+      compat.as_text(saved_model_path),
+      compat.as_text(constants.VARIABLES_DIRECTORY),
+      compat.as_text(constants.VARIABLES_FILENAME))
+  model.load_weights(checkpoint_prefix)
+  return model
diff --git a/tensorflow/contrib/saved_model/python/saved_model/keras_saved_model_test.py b/tensorflow/contrib/saved_model/python/saved_model/keras_saved_model_test.py
new file mode 100644
index 0000000000..107ae1b07b
--- /dev/null
+++ b/tensorflow/contrib/saved_model/python/saved_model/keras_saved_model_test.py
@@ -0,0 +1,201 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=protected-access
+"""Tests for saving/loading function for keras Model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import shutil
+import numpy as np
+
+from tensorflow.contrib.saved_model.python.saved_model import keras_saved_model
+from tensorflow.python import keras
+from tensorflow.python.framework import test_util
+from tensorflow.python.keras.engine import training
+from tensorflow.python.platform import test
+from tensorflow.python.training import training as training_module
+
+
+class TestModelSavingandLoading(test.TestCase):
+
+  def test_saving_sequential_model(self):
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.RepeatVector(3))
+      model.add(keras.layers.TimeDistributed(keras.layers.Dense(3)))
+      model.compile(
+          loss=keras.losses.MSE,
+          optimizer=keras.optimizers.RMSprop(lr=0.0001),
+          metrics=[keras.metrics.categorical_accuracy],
+          sample_weight_mode='temporal')
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3, 3))
+      model.train_on_batch(x, y)
+
+      ref_y = model.predict(x)
+      temp_dir = self.get_temp_dir()
+      self.addCleanup(shutil.rmtree, temp_dir)
+
+      temp_saved_model = os.path.join(temp_dir, 'saved_model')
+      keras_saved_model.save_model(model, temp_saved_model)
+
+      loaded_model = keras_saved_model.load_model(temp_saved_model)
+      y = loaded_model.predict(x)
+      self.assertAllClose(ref_y, y, atol=1e-05)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_saving_sequential_model_without_compile(self):
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.RepeatVector(3))
+      model.add(keras.layers.TimeDistributed(keras.layers.Dense(3)))
+
+      x = np.random.random((1, 3))
+      ref_y = model.predict(x)
+
+      temp_dir = self.get_temp_dir()
+      self.addCleanup(shutil.rmtree, temp_dir)
+
+      temp_saved_model = os.path.join(temp_dir, 'saved_model')
+      keras_saved_model.save_model(model, temp_saved_model)
+      loaded_model = keras_saved_model.load_model(temp_saved_model)
+
+      y = loaded_model.predict(x)
+      self.assertAllClose(ref_y, y, atol=1e-05)
+
+  def test_saving_functional_model(self):
+    with self.test_session():
+      inputs = keras.layers.Input(shape=(3,))
+      x = keras.layers.Dense(2)(inputs)
+      output = keras.layers.Dense(3)(x)
+
+      model = keras.models.Model(inputs, output)
+      model.compile(
+          loss=keras.losses.MSE,
+          optimizer=keras.optimizers.RMSprop(lr=0.0001),
+          metrics=[keras.metrics.categorical_accuracy])
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3))
+      model.train_on_batch(x, y)
+
+      ref_y = model.predict(x)
+      temp_dir = self.get_temp_dir()
+      self.addCleanup(shutil.rmtree, temp_dir)
+
+      temp_saved_model = os.path.join(temp_dir, 'saved_model')
+      keras_saved_model.save_model(model, temp_saved_model)
+      loaded_model = keras_saved_model.load_model(temp_saved_model)
+
+      y = loaded_model.predict(x)
+      self.assertAllClose(ref_y, y, atol=1e-05)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_saving_functional_model_without_compile(self):
+    with self.test_session():
+      inputs = keras.layers.Input(shape=(3,))
+      x = keras.layers.Dense(2)(inputs)
+      output = keras.layers.Dense(3)(x)
+
+      model = keras.models.Model(inputs, output)
+
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3))
+
+      ref_y = model.predict(x)
+      temp_dir = self.get_temp_dir()
+      self.addCleanup(shutil.rmtree, temp_dir)
+
+      temp_saved_model = os.path.join(temp_dir, 'saved_model')
+      keras_saved_model.save_model(model, temp_saved_model)
+      loaded_model = keras_saved_model.load_model(temp_saved_model)
+
+      y = loaded_model.predict(x)
+      self.assertAllClose(ref_y, y, atol=1e-05)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_saving_with_tf_optimizer(self):
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.Dense(3))
+      model.compile(
+          loss='mse',
+          optimizer=training_module.RMSPropOptimizer(0.1),
+          metrics=['acc'])
+
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3))
+      model.train_on_batch(x, y)
+
+      ref_y = model.predict(x)
+      temp_dir = self.get_temp_dir()
+      self.addCleanup(shutil.rmtree, temp_dir)
+
+      temp_saved_model = os.path.join(temp_dir, 'saved_model')
+      keras_saved_model.save_model(model, temp_saved_model)
+      loaded_model = keras_saved_model.load_model(temp_saved_model)
+      loaded_model.compile(
+          loss='mse',
+          optimizer=training_module.RMSPropOptimizer(0.1),
+          metrics=['acc'])
+      y = loaded_model.predict(x)
+      self.assertAllClose(ref_y, y, atol=1e-05)
+
+      # test that new updates are the same with both models
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3))
+
+      ref_loss = model.train_on_batch(x, y)
+      loss = loaded_model.train_on_batch(x, y)
+      self.assertAllClose(ref_loss, loss, atol=1e-05)
+
+      ref_y = model.predict(x)
+      y = loaded_model.predict(x)
+      self.assertAllClose(ref_y, y, atol=1e-05)
+
+      # test saving/loading again
+      keras_saved_model.save_model(loaded_model, temp_saved_model)
+      loaded_model = keras_saved_model.load_model(temp_saved_model)
+      y = loaded_model.predict(x)
+      self.assertAllClose(ref_y, y, atol=1e-05)
+
+  def test_saving_subclassed_model_raise_error(self):
+    # For now, saving subclassed model should raise an error. It should be
+    # avoided later with loading from SavedModel.pb.
+
+    class SubclassedModel(training.Model):
+
+      def __init__(self):
+        super(SubclassedModel, self).__init__()
+        self.layer1 = keras.layers.Dense(3)
+        self.layer2 = keras.layers.Dense(1)
+
+      def call(self, inp):
+        return self.layer2(self.layer1(inp))
+
+    model = SubclassedModel()
+    temp_dir = self.get_temp_dir()
+    self.addCleanup(shutil.rmtree, temp_dir)
+    temp_saved_model = os.path.join(temp_dir, 'saved_model')
+    with self.assertRaises(NotImplementedError):
+      keras_saved_model.save_model(model, temp_saved_model)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py b/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
index 178328619f..4073b390fc 100644
--- a/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
+++ b/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
@@ -132,6 +132,48 @@ class TestGatherTree(test.TestCase):
   def test_gather_tree_from_array_2d(self):
     self._test_gather_tree_from_array(depth_ndims=2)
 
+  def test_gather_tree_from_array_complex_trajectory(self):
+    # Max. time = 7, batch = 1, beam = 5.
+    array = np.expand_dims(np.array(
+        [[[25, 12, 114, 89, 97]],
+         [[9, 91, 64, 11, 162]],
+         [[34, 34, 34, 34, 34]],
+         [[2, 4, 2, 2, 4]],
+         [[2, 3, 6, 2, 2]],
+         [[2, 2, 2, 3, 2]],
+         [[2, 2, 2, 2, 2]]]), -1)
+    parent_ids = np.array(
+        [[[0, 0, 0, 0, 0]],
+         [[0, 0, 0, 0, 0]],
+         [[0, 1, 2, 3, 4]],
+         [[0, 0, 1, 2, 1]],
+         [[0, 1, 1, 2, 3]],
+         [[0, 1, 3, 1, 2]],
+         [[0, 1, 2, 3, 4]]])
+    expected_array = np.expand_dims(np.array(
+        [[[25, 25, 25, 25, 25]],
+         [[9, 9, 91, 9, 9]],
+         [[34, 34, 34, 34, 34]],
+         [[2, 4, 2, 4, 4]],
+         [[2, 3, 6, 3, 6]],
+         [[2, 2, 2, 3, 2]],
+         [[2, 2, 2, 2, 2]]]), -1)
+    sequence_length = [[4, 6, 4, 7, 6]]
+
+    array = ops.convert_to_tensor(
+        array, dtype=dtypes.float32)
+    parent_ids = ops.convert_to_tensor(
+        parent_ids, dtype=dtypes.int32)
+    expected_array = ops.convert_to_tensor(
+        expected_array, dtype=dtypes.float32)
+
+    sorted_array = beam_search_decoder.gather_tree_from_array(
+        array, parent_ids, sequence_length)
+
+    with self.test_session() as sess:
+      sorted_array, expected_array = sess.run([sorted_array, expected_array])
+      self.assertAllEqual(expected_array, sorted_array)
+
 
 class TestArrayShapeChecks(test.TestCase):
 
diff --git a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
index c7fbeea310..f17dbb0fe3 100644
--- a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
+++ b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
@@ -145,24 +145,20 @@ def gather_tree_from_array(t, parent_ids, sequence_length):
       array_ops.expand_dims(math_ops.range(beam_width), 0), 0)
   beam_ids = array_ops.tile(beam_ids, [max_time, batch_size, 1])
 
-  mask = array_ops.sequence_mask(
-      sequence_length, maxlen=max_time, dtype=dtypes.int32)
-  mask = array_ops.transpose(mask, perm=[2, 0, 1])
-
-  # Use beam_width + 1 to mark the end of beam.
-  masked_beam_ids = (beam_ids * mask) + (1 - mask) * (beam_width + 1)
-
   max_sequence_lengths = math_ops.to_int32(
       math_ops.reduce_max(sequence_length, axis=1))
   sorted_beam_ids = beam_search_ops.gather_tree(
-      step_ids=masked_beam_ids,
+      step_ids=beam_ids,
       parent_ids=parent_ids,
       max_sequence_lengths=max_sequence_lengths,
       end_token=beam_width + 1)
 
   # For out of range steps, simply copy the same beam.
+  in_bound_steps = array_ops.transpose(
+      array_ops.sequence_mask(sequence_length, maxlen=max_time),
+      perm=[2, 0, 1])
   sorted_beam_ids = array_ops.where(
-      math_ops.cast(mask, dtypes.bool), x=sorted_beam_ids, y=beam_ids)
+      in_bound_steps, x=sorted_beam_ids, y=beam_ids)
 
   # Generate indices for gather_nd.
   time_ind = array_ops.tile(array_ops.reshape(
diff --git a/tensorflow/contrib/signal/python/kernel_tests/mel_ops_test.py b/tensorflow/contrib/signal/python/kernel_tests/mel_ops_test.py
index 345eb6cfaa..f4348e80ea 100644
--- a/tensorflow/contrib/signal/python/kernel_tests/mel_ops_test.py
+++ b/tensorflow/contrib/signal/python/kernel_tests/mel_ops_test.py
@@ -53,7 +53,8 @@ def spectrogram_to_mel_matrix(num_mel_bins=20,
                               num_spectrogram_bins=129,
                               audio_sample_rate=8000,
                               lower_edge_hertz=125.0,
-                              upper_edge_hertz=3800.0):
+                              upper_edge_hertz=3800.0,
+                              unused_dtype=None):
   """Return a matrix that can post-multiply spectrogram rows to make mel.
 
   Copied from
@@ -132,9 +133,9 @@ class LinearToMelTest(test.TestCase):
     # lower_edge_hertz, upper_edge_hertz) to test.
     configs = [
         # Defaults.
-        (20, 129, 8000.0, 125.0, 3800.0),
+        (20, 129, 8000.0, 125.0, 3800.0, dtypes.float64),
         # Settings used by Tacotron (https://arxiv.org/abs/1703.10135).
-        (80, 1025, 24000.0, 80.0, 12000.0)
+        (80, 1025, 24000.0, 80.0, 12000.0, dtypes.float64)
     ]
     with self.test_session(use_gpu=True):
       for config in configs:
@@ -143,7 +144,8 @@ class LinearToMelTest(test.TestCase):
         self.assertAllClose(mel_matrix_np, mel_matrix.eval(), atol=3e-6)
 
   def test_dtypes(self):
-    for dtype in (dtypes.float16, dtypes.float32, dtypes.float64):
+    # LinSpace is not supported for tf.float16.
+    for dtype in (dtypes.bfloat16, dtypes.float32, dtypes.float64):
       self.assertEqual(dtype,
                        mel_ops.linear_to_mel_weight_matrix(dtype=dtype).dtype)
 
@@ -167,7 +169,8 @@ class LinearToMelTest(test.TestCase):
 
   def test_constant_folding(self):
     """Mel functions should be constant foldable."""
-    for dtype in (dtypes.float16, dtypes.float32, dtypes.float64):
+    # TODO(rjryan): tf.bloat16 cannot be constant folded by Grappler.
+    for dtype in (dtypes.float32, dtypes.float64):
       g = ops.Graph()
       with g.as_default():
         mel_matrix = mel_ops.linear_to_mel_weight_matrix(dtype=dtype)
diff --git a/tensorflow/contrib/signal/python/ops/mel_ops.py b/tensorflow/contrib/signal/python/ops/mel_ops.py
index 1e84006116..062d84aea1 100644
--- a/tensorflow/contrib/signal/python/ops/mel_ops.py
+++ b/tensorflow/contrib/signal/python/ops/mel_ops.py
@@ -151,22 +151,21 @@ def linear_to_mel_weight_matrix(num_mel_bins=20,
     _validate_arguments(num_mel_bins, sample_rate,
                         lower_edge_hertz, upper_edge_hertz, dtype)
 
-    # To preserve accuracy, we compute the matrix at float64 precision and then
-    # cast to `dtype` at the end. This function can be constant folded by graph
-    # optimization since there are no Tensor inputs.
+    # This function can be constant folded by graph optimization since there are
+    # no Tensor inputs.
     sample_rate = ops.convert_to_tensor(
-        sample_rate, dtypes.float64, name='sample_rate')
+        sample_rate, dtype, name='sample_rate')
     lower_edge_hertz = ops.convert_to_tensor(
-        lower_edge_hertz, dtypes.float64, name='lower_edge_hertz')
+        lower_edge_hertz, dtype, name='lower_edge_hertz')
     upper_edge_hertz = ops.convert_to_tensor(
-        upper_edge_hertz, dtypes.float64, name='upper_edge_hertz')
-    zero_float64 = ops.convert_to_tensor(0.0, dtypes.float64)
+        upper_edge_hertz, dtype, name='upper_edge_hertz')
+    zero = ops.convert_to_tensor(0.0, dtype)
 
     # HTK excludes the spectrogram DC bin.
     bands_to_zero = 1
     nyquist_hertz = sample_rate / 2.0
     linear_frequencies = math_ops.linspace(
-        zero_float64, nyquist_hertz, num_spectrogram_bins)[bands_to_zero:]
+        zero, nyquist_hertz, num_spectrogram_bins)[bands_to_zero:]
     spectrogram_bins_mel = array_ops.expand_dims(
         _hertz_to_mel(linear_frequencies), 1)
 
@@ -193,11 +192,8 @@ def linear_to_mel_weight_matrix(num_mel_bins=20,
 
     # Intersect the line segments with each other and zero.
     mel_weights_matrix = math_ops.maximum(
-        zero_float64, math_ops.minimum(lower_slopes, upper_slopes))
+        zero, math_ops.minimum(lower_slopes, upper_slopes))
 
     # Re-add the zeroed lower bins we sliced out above.
-    mel_weights_matrix = array_ops.pad(
-        mel_weights_matrix, [[bands_to_zero, 0], [0, 0]])
-
-    # Cast to the desired type.
-    return math_ops.cast(mel_weights_matrix, dtype, name=name)
+    return array_ops.pad(
+        mel_weights_matrix, [[bands_to_zero, 0], [0, 0]], name=name)
diff --git a/tensorflow/contrib/summary/summary_ops_test.py b/tensorflow/contrib/summary/summary_ops_test.py
index 3e41e3d0b4..4d1807130c 100644
--- a/tensorflow/contrib/summary/summary_ops_test.py
+++ b/tensorflow/contrib/summary/summary_ops_test.py
@@ -20,6 +20,8 @@ import os
 import tempfile
 import time
 
+import sqlite3
+
 import numpy as np
 import six
 
@@ -275,6 +277,22 @@ class EagerFileTest(test_util.TensorFlowTestCase):
 
 class EagerDbTest(summary_test_util.SummaryDbTest):
 
+  def testDbURIOpen(self):
+    tmpdb_path = os.path.join(self.get_temp_dir(), 'tmpDbURITest.sqlite')
+    tmpdb_uri = six.moves.urllib_parse.urljoin("file:", tmpdb_path)
+    tmpdb_writer = summary_ops.create_db_writer(
+        tmpdb_uri,
+        "experimentA",
+        "run1",
+        "user1")
+    with summary_ops.always_record_summaries():
+      with tmpdb_writer.as_default():
+        summary_ops.scalar('t1', 2.0)
+    tmpdb = sqlite3.connect(tmpdb_path)
+    num = get_one(tmpdb, 'SELECT count(*) FROM Tags WHERE tag_name = "t1"')
+    self.assertEqual(num, 1)
+    tmpdb.close()
+
   def testIntegerSummaries(self):
     step = training_util.create_global_step()
     writer = self.create_db_writer()
diff --git a/tensorflow/contrib/tensor_forest/BUILD b/tensorflow/contrib/tensor_forest/BUILD
index 136856c015..164f3e58e6 100644
--- a/tensorflow/contrib/tensor_forest/BUILD
+++ b/tensorflow/contrib/tensor_forest/BUILD
@@ -223,7 +223,6 @@ tf_kernel_library(
         ":model_ops_lib",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
     ],
     alwayslink = 1,
 )
@@ -319,7 +318,6 @@ tf_kernel_library(
         ":stats_ops_lib",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
     ],
     alwayslink = 1,
 )
diff --git a/tensorflow/contrib/tensor_forest/kernels/v4/decision_node_evaluator.cc b/tensorflow/contrib/tensor_forest/kernels/v4/decision_node_evaluator.cc
index 7e25579070..6cb2c881e2 100644
--- a/tensorflow/contrib/tensor_forest/kernels/v4/decision_node_evaluator.cc
+++ b/tensorflow/contrib/tensor_forest/kernels/v4/decision_node_evaluator.cc
@@ -51,7 +51,8 @@ std::unique_ptr<DecisionNodeEvaluator> CreateBinaryDecisionNodeEvaluator(
 InequalityDecisionNodeEvaluator::InequalityDecisionNodeEvaluator(
     const decision_trees::InequalityTest& test, int32 left, int32 right)
     : BinaryDecisionNodeEvaluator(left, right) {
-  safe_strto32(test.feature_id().id().value(), &feature_num_);
+  CHECK(safe_strto32(test.feature_id().id().value(), &feature_num_))
+      << "Invalid feature ID: [" << test.feature_id().id().value() << "]";
   threshold_ = test.threshold().float_value();
   include_equals_ =
       test.type() == decision_trees::InequalityTest::LESS_OR_EQUAL;
@@ -72,7 +73,9 @@ ObliqueInequalityDecisionNodeEvaluator::ObliqueInequalityDecisionNodeEvaluator(
     : BinaryDecisionNodeEvaluator(left, right) {
   for (int i = 0; i < test.oblique().features_size(); ++i) {
     int32 val;
-    safe_strto32(test.oblique().features(i).id().value(), &val);
+    CHECK(safe_strto32(test.oblique().features(i).id().value(), &val))
+        << "Invalid feature ID: [" << test.oblique().features(i).id().value()
+        << "]";
     feature_num_.push_back(val);
     feature_weights_.push_back(test.oblique().weights(i));
   }
@@ -97,7 +100,8 @@ int32 ObliqueInequalityDecisionNodeEvaluator::Decide(
 MatchingValuesDecisionNodeEvaluator::MatchingValuesDecisionNodeEvaluator(
     const decision_trees::MatchingValuesTest& test, int32 left, int32 right)
     : BinaryDecisionNodeEvaluator(left, right) {
-  safe_strto32(test.feature_id().id().value(), &feature_num_);
+  CHECK(safe_strto32(test.feature_id().id().value(), &feature_num_))
+      << "Invalid feature ID: [" << test.feature_id().id().value() << "]";
   for (const auto& val : test.value()) {
     values_.push_back(val.float_value());
   }
diff --git a/tensorflow/contrib/tensorboard/db/BUILD b/tensorflow/contrib/tensorboard/db/BUILD
index 3f6b4cdc9a..6507546ee9 100644
--- a/tensorflow/contrib/tensorboard/db/BUILD
+++ b/tensorflow/contrib/tensorboard/db/BUILD
@@ -106,6 +106,7 @@ cc_library(
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
+        "//tensorflow/core:png_internal",
         "//tensorflow/core:protos_all_cc",
     ],
 )
diff --git a/tensorflow/contrib/tensorrt/BUILD b/tensorflow/contrib/tensorrt/BUILD
index adda0b758b..d69d44a454 100644
--- a/tensorflow/contrib/tensorrt/BUILD
+++ b/tensorflow/contrib/tensorrt/BUILD
@@ -3,7 +3,7 @@
 #   and provide TensorRT operators and converter package.
 #   APIs are meant to change over time.
 
-package(default_visibility = ["//tensorflow:__subpackages__"])
+package(default_visibility = ["//visibility:public"])
 
 licenses(["notice"])  # Apache 2.0
 
@@ -11,7 +11,6 @@ exports_files(["LICENSE"])
 
 load(
     "//tensorflow:tensorflow.bzl",
-    "py_test",
     "tf_cc_test",
     "tf_copts",
     "tf_cuda_library",
@@ -20,6 +19,7 @@ load(
     "tf_gen_op_libs",
     "tf_gen_op_wrapper_py",
 )
+load("//tensorflow:tensorflow.bzl", "cuda_py_tests")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_test")
 load("//tensorflow:tensorflow.bzl", "tf_custom_op_py_library")
 load("//tensorflow:tensorflow.bzl", "tf_py_wrap_cc")
@@ -33,11 +33,13 @@ tf_cuda_cc_test(
     size = "small",
     srcs = ["tensorrt_test.cc"],
     tags = [
-        "manual",
-        "notap",
+        "no_windows",
+        "nomac",
     ],
     deps = [
+        "//tensorflow/core:gpu_init",
         "//tensorflow/core:lib",
+        "//tensorflow/core:stream_executor",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
     ] + if_tensorrt([
@@ -83,6 +85,7 @@ cc_library(
     copts = tf_copts(),
     visibility = ["//visibility:public"],
     deps = [
+        ":trt_allocator",
         ":trt_logging",
         ":trt_plugins",
         ":trt_resources",
@@ -119,7 +122,6 @@ tf_cuda_library(
 
 tf_gen_op_wrapper_py(
     name = "trt_engine_op",
-    gen_locally = True,
     deps = [
         ":trt_engine_op_op_lib",
         ":trt_logging",
@@ -156,6 +158,7 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
+        ":tf_trt_integration_test_base",
         ":trt_convert_py",
         ":trt_ops_py",
         "//tensorflow/python:errors",
@@ -181,10 +184,15 @@ py_library(
     ],
 )
 
+# TODO(aaroey): this wrapper has been causing troubles of double linking, so
+# either get rid of it, or split to make it contain minimum dependencies.
 tf_py_wrap_cc(
     name = "wrap_conversion",
     srcs = ["trt_conversion.i"],
     copts = tf_copts(),
+    swig_includes = [
+        "//tensorflow/python:platform/base.i",
+    ],
     deps = [
         ":trt_conversion",
         ":trt_engine_op_kernel",
@@ -195,17 +203,16 @@ tf_py_wrap_cc(
 tf_cuda_library(
     name = "trt_resources",
     srcs = [
-        "resources/trt_allocator.cc",
         "resources/trt_int8_calibrator.cc",
         "resources/trt_resource_manager.cc",
     ],
     hdrs = [
-        "resources/trt_allocator.h",
         "resources/trt_int8_calibrator.h",
         "resources/trt_resource_manager.h",
         "resources/trt_resources.h",
     ],
     deps = [
+        ":trt_allocator",
         ":trt_logging",
         ":utils",
         "//tensorflow/core:framework_headers_lib",
@@ -216,6 +223,34 @@ tf_cuda_library(
     ]),
 )
 
+tf_cuda_library(
+    name = "trt_allocator",
+    srcs = ["resources/trt_allocator.cc"],
+    hdrs = ["resources/trt_allocator.h"],
+    deps = [
+        "//tensorflow/core:framework_headers_lib",
+        "//tensorflow/core:framework_lite",
+        "//tensorflow/core:lib_proto_parsing",
+    ] + if_tensorrt([
+        "@local_config_tensorrt//:nv_infer",
+    ]),
+)
+
+tf_cc_test(
+    name = "trt_allocator_test",
+    size = "small",
+    srcs = ["resources/trt_allocator_test.cc"],
+    tags = [
+        "no_windows",
+        "nomac",
+    ],
+    deps = [
+        ":trt_allocator",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
 # Library for the node-level conversion portion of TensorRT operation creation
 tf_cuda_library(
     name = "trt_conversion",
@@ -231,6 +266,7 @@ tf_cuda_library(
     ],
     deps = [
         ":segment",
+        ":trt_allocator",
         ":trt_plugins",
         ":trt_logging",
         ":trt_resources",
@@ -240,7 +276,6 @@ tf_cuda_library(
         "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
-        "//tensorflow/core:gpu_runtime",
         "//tensorflow/core:framework_lite",
         "//tensorflow/core:graph",
         "//tensorflow/core:lib",
@@ -275,13 +310,21 @@ tf_cc_test(
     name = "segment_test",
     size = "small",
     srcs = ["segment/segment_test.cc"],
+    tags = [
+        "no_windows",
+        "nomac",
+    ],
     deps = [
         ":segment",
-        "//tensorflow/c:c_api",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:scope",
+        "//tensorflow/core:core_cpu",
         "//tensorflow/core:lib",
+        "//tensorflow/core:ops",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
     ],
 )
 
@@ -311,8 +354,9 @@ tf_cuda_cc_test(
     size = "small",
     srcs = ["plugin/trt_plugin_factory_test.cc"],
     tags = [
-        "manual",
-        "notap",
+        "no_cuda_on_cpu_tap",
+        "no_windows",
+        "nomac",
     ],
     deps = [
         ":trt_plugins",
@@ -325,23 +369,51 @@ tf_cuda_cc_test(
     ]),
 )
 
-py_test(
+py_library(
+    name = "tf_trt_integration_test_base",
+    srcs = ["test/tf_trt_integration_test_base.py"],
+    deps = [
+        ":trt_convert_py",
+        ":trt_ops_py",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_test_lib",
+    ],
+)
+
+cuda_py_tests(
     name = "tf_trt_integration_test",
-    srcs = ["test/tf_trt_integration_test.py"],
-    main = "test/tf_trt_integration_test.py",
-    srcs_version = "PY2AND3",
-    tags = [
-        "manual",
-        "notap",
+    srcs = [
+        "test/base_test.py",
+        # "test/batch_matmul_test.py",
+        # "test/biasadd_matmul_test.py",
+        # "test/binary_tensor_weight_broadcast_test.py",  # Blocked by trt4 installation
+        # "test/concatenation_test.py",  # Blocked by trt4 installation
+        "test/const_broadcast_test.py",
+        "test/multi_connection_neighbor_engine_test.py",
+        "test/neighboring_engine_test.py",
+        # "test/unary_test.py",  # Blocked by trt4 installation
+        # "test/vgg_block_nchw_test.py",
+        # "test/vgg_block_test.py",
+        "test/memory_alignment_test.py",
     ],
-    deps = [
-        ":init_py",
+    additional_deps = [
+        ":tf_trt_integration_test_base",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_test_lib",
     ],
+    tags = [
+        "no_cuda_on_cpu_tap",
+        "no_windows",
+        "nomac",
+    ],
 )
 
 cc_library(
     name = "utils",
+    srcs = ["convert/utils.cc"],
     hdrs = ["convert/utils.h"],
+    copts = tf_copts(),
+    deps = [
+        "//tensorflow/core:lib",
+    ],
 )
diff --git a/tensorflow/contrib/tensorrt/convert/convert_graph.cc b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
index 17b32c0e30..896968647e 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_graph.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
@@ -29,9 +29,6 @@ limitations under the License.
 #include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
 #include "tensorflow/contrib/tensorrt/resources/trt_resources.h"
 #include "tensorflow/contrib/tensorrt/segment/segment.h"
-#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
-#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
-#include "tensorflow/core/common_runtime/gpu/gpu_process_state.h"
 #include "tensorflow/core/framework/function.h"
 #include "tensorflow/core/framework/graph_to_functiondef.h"
 #include "tensorflow/core/framework/node_def_builder.h"
@@ -86,27 +83,48 @@ bool IsTensorRTCandidate(const tensorflow::Node* node) {
   // TODO(jie): Segmentation shouldn't associated with op name.
   //            Split it into a registration for each kernel.
   static const std::set<string> candidate_ops = {
-      "Identity",
-      "Snapshot",
-      "Const",
-      "Conv2D",
-      "MaxPool",
-      "BiasAdd",
-      "Relu",
-      "Add",
-      "Mul",
-      "Sub",
-      "Rsqrt",
-      "Pad",
-      "Mean",
-      "AvgPool",
-      "ConcatV2",
-      "DepthwiseConv2dNative",
-      "FusedBatchNorm",
-      "FusedBatchNormV2",
-      // TODO(ben,jie): ...
+    "Identity",
+    "Snapshot",
+    "Const",
+    "Conv2D",
+    "MaxPool",
+    "BiasAdd",
+    "Relu",
+    "Add",
+    "Mul",
+    "Sub",
+    "Rsqrt",
+    "Pad",
+    "Mean",
+    "AvgPool",
+    "ConcatV2",
+    "DepthwiseConv2dNative",
+    "FusedBatchNorm",
+    "FusedBatchNormV2",
+    "Div",
+    "RealDiv",
+    "Rsqrt",
+    "Reciprocal",
+    "Exp",
+    "Log",
+    "Sqrt",
+    "Abs",
+    "Neg",
+#if NV_TENSORRT_MAJOR > 3
+    "MatMul",
+    "BatchMatMul",
+    "Softmax",
+    "Minimum",
+    "Maximum",
+    "TopKV2",
+    "Sum",
+    "Prod",
+    "Max",
+    "Min",
+#endif
+    // TODO(ben,jie): ...
   };
-  // LINT.ThenChange(//tensorflow/contrib/tensorrt/convert/convert_nodes.h)
+  // LINT.ThenChange(//tensorflow/contrib/tensorrt/convert/convert_nodes.cc)
   return (candidate_ops.count(node->type_string()) ||
           PluginFactoryTensorRT::GetInstance()->IsPlugin(node->type_string()));
 }
@@ -142,7 +160,7 @@ tensorflow::Status ConvertCalibGraphToInferGraph(
     auto n = infer_graph->mutable_node(i);
     if (n->op() == "TRTEngineOp") {
       VLOG(1) << "Processing " << n->name();
-      string container_name = n->attr().at("segment_funcdef_name").s();
+      const string& container_name = n->attr().at("segment_funcdef_name").s();
       TRTCalibrationResource* cres = nullptr;
       auto status = calib_rm->Lookup(container_name, "Calibrator", &cres);
       if (!status.ok()) {
@@ -168,25 +186,50 @@ tensorflow::Status ConvertCalibGraphToInferGraph(
             "Can't get TRTCalibrator from resource manager!");
       }
       cres->Unref();
+      TF_RETURN_IF_ERROR(calib_rm->Cleanup(container_name));
     }
   }
   return tensorflow::Status::OK();
 }
 
-// Entry function from Python.
 tensorflow::Status ConvertGraphDefToTensorRT(
     const tensorflow::GraphDef& graph_def,
     const std::vector<string>& output_names, size_t max_batch_size,
     size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def,
     int precision_mode, int minimum_segment_size, bool is_dyn_op,
     int max_cached_engines, std::vector<int> cached_engine_batches) {
-  // optimization pass
+  // Create GrapplerItem.
   tensorflow::grappler::GrapplerItem item;
   item.fetch = output_names;
   item.graph = graph_def;
-  // grappler requires a virtual cluster with a proper GPU device
-  // in order to calculate flops>0 or fails with FATAL
-  // We add numbers from a Pascal card here to have flops>0
+
+  // TODO(aaroey): we should have used single machine cluster like the
+  // following, but the problem is then wrap_conversion will depend on
+  // direct_session and cause double linking problems. To fix this we need to
+  // fix or get rid of the swig dependency. Here we use VirtualCluster
+  // as a work around, and we need to create a session to initialize the
+  // underlying device before calling this method.
+#if 0
+  // Create single machine cluster. Note that this will create a session and
+  // initialize the gpu devices.
+  const int num_cpu_cores =
+      tensorflow::grappler::GetNumAvailableLogicalCPUCores();
+  const int num_gpus = tensorflow::grappler::GetNumAvailableGPUs();
+  VLOG(2) << "cpu_cores: " << num_cpu_cores;
+  VLOG(2) << "gpus: " << num_gpus;
+  const int timeout_s = 60 * 10;
+  std::unique_ptr<tensorflow::grappler::Cluster> cluster(
+      new tensorflow::grappler::SingleMachine(
+          timeout_s, num_cpu_cores, num_gpus));
+  // These settings are the defaults in tensorflow/python/grappler/cluster.py.
+  cluster->DisableDetailedStats(true);
+  cluster->AllowSoftPlacement(true);
+  cluster->SetNumWarmupSteps(10);
+  TF_RETURN_IF_ERROR(cluster->Provision());
+#else
+  // Create virtual cluster. Grappler requires a virtual cluster with a proper
+  // GPU device in order to calculate flops>0 or fails with FATAL in dbg mode.
+  // We add numbers from a Pascal card here to have flops>0.
   tensorflow::DeviceProperties device_properties;
   device_properties.set_type("GPU");
   device_properties.mutable_environment()->insert({"architecture", "6"});
@@ -195,47 +238,43 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   std::unique_ptr<tensorflow::grappler::Cluster> cluster(
       new tensorflow::grappler::VirtualCluster(
           {{"/GPU:0", device_properties}}));
+#endif
 
-  // single machine
-  int num_cpu_cores = tensorflow::grappler::GetNumAvailableLogicalCPUCores();
-  int num_gpus = tensorflow::grappler::GetNumAvailableGPUs();
-  VLOG(2) << "cpu_cores: " << num_cpu_cores;
-  VLOG(2) << "gpus: " << num_gpus;
+  // Create RewriterConfig.
   tensorflow::RewriterConfig rw_cfg;
-  // use only const folding and layout for the time being since new optimizers
-  // break the graph for us
+  // TODO(aaroey): use only const folding and layout for the time being since
+  // new optimizers break the graph for trt.
   rw_cfg.add_optimizers("constfold");
   rw_cfg.add_optimizers("layout");
-  rw_cfg.set_meta_optimizer_iterations(tensorflow::RewriterConfig::ONE);
+  auto optimizer = rw_cfg.add_custom_optimizers();
+  optimizer->set_name("TensorRTOptimizer");
+  auto& parameters = *(optimizer->mutable_parameter_map());
+  parameters["minimum_segment_size"].set_i(minimum_segment_size);
+  parameters["max_batch_size"].set_i(max_batch_size);
+  parameters["is_dynamic_op"].set_b(is_dyn_op);
+  parameters["max_workspace_size_bytes"].set_i(max_workspace_size_bytes);
+  TF_RETURN_IF_ERROR(GetPrecisionModeName(
+      precision_mode, parameters["precision_mode"].mutable_s()));
+  parameters["maximum_cached_engines"].set_i(max_cached_engines);
+  if (!cached_engine_batches.empty()) {
+    auto list = parameters["cached_engine_batches"].mutable_list();
+    for (const int batch : cached_engine_batches) {
+      list->add_i(batch);
+    }
+  }
+
+  // Run optimizer.
   tensorflow::grappler::MetaOptimizer meta_opt(nullptr, rw_cfg);
-  tensorflow::GraphDef gdef;
-  TF_RETURN_IF_ERROR(meta_opt.Optimize(cluster.get(), item, &gdef));
-  item.graph = gdef;
-
-  // AJ refactoring shape inference through grappler/GraphProperties.
-  tensorflow::grappler::GraphProperties static_graph_properties(item);
-  TF_RETURN_IF_ERROR(static_graph_properties.InferStatically(true));
-  // Build full graph
-  ConversionParams cp;
-  cp.input_graph_def = &gdef;
-  cp.output_names = &output_names;
-  cp.max_batch_size = max_batch_size;
-  cp.output_graph_def = new_graph_def;
-  cp.precision_mode = precision_mode;
-  cp.is_dyn_op = is_dyn_op;
-  cp.max_cached_engines = max_cached_engines;
-  cp.cached_engine_batches = cached_engine_batches;
-  cp.minimum_segment_size = minimum_segment_size;
-  cp.graph_properties = &static_graph_properties;
-  cp.max_workspace_size_bytes = max_workspace_size_bytes;
+  TF_RETURN_IF_ERROR(meta_opt.Optimize(cluster.get(), item, new_graph_def));
+
   if (VLOG_IS_ON(5)) {
     std::fstream f;
     f.open("TRTConversionInput.pb",
            std::fstream::out | std::fstream::binary | std::fstream::trunc);
-    f << gdef.SerializeAsString();
+    f << new_graph_def->SerializeAsString();
     f.close();
   }
-  return ConvertAfterShapes(cp);
+  return Status::OK();
 }
 
 // Function to get subsegment information structure.
@@ -247,6 +286,7 @@ tensorflow::Status GetEngineInfo(
     const std::vector<tensorflow::Node*>& reverse_topo_order,
     EngineInfo* info) {
   std::vector<int> subgraph_node_ids;
+  std::set<int> added_const_node_ids;  // Used to prevent double insertion.
   std::set<string> segment_devices;
   int input_port = 0;
   int output_port = 0;
@@ -256,6 +296,7 @@ tensorflow::Status GetEngineInfo(
   // edge, thus there must not be any duplicates since source nodes of
   // input/output edges must be in different split of the graph.
   // TODO(aaroey): consider using node id and port instead.
+  // TODO(aaroey): using topo order instead of reverting reverse topo order.
   std::unordered_map<string, int> created_edges;
   for (auto it = reverse_topo_order.rbegin(); it != reverse_topo_order.rend();
        ++it) {
@@ -274,19 +315,22 @@ tensorflow::Status GetEngineInfo(
                 << " neither have requested device nor assigned device";
       }
     }
-    int node_id = node->id();
-    subgraph_node_ids.push_back(node_id);
+    const int node_id = node->id();
     for (const auto edge : node->in_edges()) {
       auto input_node = edge->src();
-      if (segment_nodes.count(input_node->name()) == 0) {
+      if (segment_nodes.count(input_node->name()) == 0 &&
+          !edge->IsControlEdge() && !input_node->IsSource()) {
         // Add constant input node into the segment. We don't care if it has
         // other output edges going into other engines or TF nodes. Since we add
         // it only to the subsegment node list, not the subsegment itself, it
         // won't be removed from the graph. If it doesn't have any edges, TF
         // will prune it out.
         if (input_node->type_string() == "Const") {
-          subgraph_node_ids.push_back(input_node->id());
-        } else if (!edge->IsControlEdge() && !input_node->IsSource()) {
+          if (added_const_node_ids.count(input_node->id()) == 0) {
+            added_const_node_ids.insert(input_node->id());
+            subgraph_node_ids.push_back(input_node->id());
+          }
+        } else {
           string s(input_node->name());
           StrAppend(&s, ":", edge->src_output());
           VLOG(1) << "Input edge = " << s;
@@ -303,6 +347,9 @@ tensorflow::Status GetEngineInfo(
         }
       }
     }
+    // We need to add possible const input nodes before adding this node in
+    // order to keep the topological order.
+    subgraph_node_ids.push_back(node_id);
     for (const auto edge : node->out_edges()) {
       auto output_node = edge->dst();
       if (segment_nodes.count(output_node->name()) == 0 &&
@@ -349,9 +396,9 @@ tensorflow::Status CreateTRTNode(tensorflow::Graph* graph,
                                  nvinfer1::IGpuAllocator* alloc,
                                  int max_batch_size) {
   const auto& info = infos.at(pos);
-  std::vector<tensorflow::TensorShapeProto> out_shapes;
-  std::vector<tensorflow::TensorShapeProto> input_shapes;
-  std::vector<tensorflow::PartialTensorShape> shapes;
+  std::vector<tensorflow::TensorShapeProto> output_shape_protos;
+  std::vector<tensorflow::TensorShapeProto> input_shape_protos;
+  std::vector<tensorflow::PartialTensorShape> input_shapes;
   std::vector<tensorflow::NodeDefBuilder::NodeOut> inputs;
   std::vector<tensorflow::DataType> out_types;
   VLOG(1) << "Processing " << info.engine_name;
@@ -364,11 +411,11 @@ tensorflow::Status CreateTRTNode(tensorflow::Graph* graph,
       tensorflow::TensorShapeProto out_shape;
       // shape of the output node inside segment
       conn.inside_shape.AsProto(&out_shape);
-      if (out_shapes.size() <= conn.port_number) {
-        out_shapes.resize(conn.port_number + 1);
+      if (output_shape_protos.size() <= conn.port_number) {
+        output_shape_protos.resize(conn.port_number + 1);
         out_types.resize(conn.port_number + 1);
       }
-      out_shapes.at(conn.port_number) = out_shape;
+      output_shape_protos.at(conn.port_number) = out_shape;
       out_types.at(conn.port_number) = conn.connection_type;
       continue;
     }
@@ -376,12 +423,12 @@ tensorflow::Status CreateTRTNode(tensorflow::Graph* graph,
     // Set the shapes and data types of input edge.
     tensorflow::TensorShapeProto in_shape;
     conn.outside_shape.AsProto(&in_shape);
-    if (input_shapes.size() <= conn.port_number) {
+    if (input_shape_protos.size() <= conn.port_number) {
+      input_shape_protos.resize(conn.port_number + 1);
       input_shapes.resize(conn.port_number + 1);
-      shapes.resize(conn.port_number + 1);
     }
-    input_shapes.at(conn.port_number) = in_shape;
-    shapes.at(conn.port_number) = conn.outside_shape;
+    input_shape_protos.at(conn.port_number) = in_shape;
+    input_shapes.at(conn.port_number) = conn.outside_shape;
 
     string input_node = conn.outside_node_name;
     int input_port = conn.outside_port;
@@ -409,6 +456,8 @@ tensorflow::Status CreateTRTNode(tensorflow::Graph* graph,
     VLOG(1) << "Engine Input " << input_node << ":" << input_port << " -> "
             << info.engine_name << ":" << inputs.size();
     // Skip duplicate inputs.
+    // TODO(aaroey): use std::find instead. GetEngineInfo already remove
+    // duplicate connections, so here we should never find any duplicate?
     bool new_input = true;
     for (const auto& inp : inputs) {
       if (inp.node == input_node && inp.index == input_port) {
@@ -436,8 +485,8 @@ tensorflow::Status CreateTRTNode(tensorflow::Graph* graph,
     TF_RETURN_IF_ERROR(ConvertGraphDefToEngine(
         info.segment_graph_def,
         info.precision_mode == INT8MODE ? FP32MODE : info.precision_mode,
-        max_batch_size, info.max_workspace_size_bytes, shapes, &trt_logger,
-        alloc, /*calibrator=*/nullptr, &engine,
+        max_batch_size, info.max_workspace_size_bytes, input_shapes,
+        &trt_logger, alloc, /*calibrator=*/nullptr, &engine,
         /*convert_successfully=*/nullptr));
     TrtUniquePtrType<nvinfer1::IHostMemory> engine_data(engine->serialize());
     segment_string =
@@ -453,21 +502,10 @@ tensorflow::Status CreateTRTNode(tensorflow::Graph* graph,
   // TODO(aaroey): use enum instead, and add a helper method to do the
   // conversion.
   string prec_string;
-  switch (info.precision_mode) {
-    case FP32MODE:
-      prec_string = "FP32";
-      break;
-    case FP16MODE:
-      prec_string = "FP16";
-      break;
-    case INT8MODE:
-      prec_string = "INT8";
-      if (!TRTResourceManager::instance()->getManager("TRTCalibration")) {
-        LOG(ERROR) << "Failed to construct calibration storage";
-      }
-      break;
-    default:
-      return tensorflow::errors::OutOfRange("Unknown precision mode");
+  TF_RETURN_IF_ERROR(GetPrecisionModeName(info.precision_mode, &prec_string));
+  if (info.precision_mode == INT8MODE &&
+      !TRTResourceManager::instance()->getManager("TRTCalibration")) {
+    LOG(ERROR) << "Failed to construct calibration storage";
   }
   tensorflow::NodeDefBuilder node_builder(info.engine_name, "TRTEngineOp");
   if (!info.device.empty()) node_builder.Device(info.device);
@@ -485,8 +523,8 @@ tensorflow::Status CreateTRTNode(tensorflow::Graph* graph,
   }
   tensorflow::NodeDef trt_node;
   tensorflow::Status status =
-      node_builder.Attr("input_shapes", input_shapes)
-          .Attr("output_shapes", out_shapes)
+      node_builder.Attr("input_shapes", input_shape_protos)
+          .Attr("output_shapes", output_shape_protos)
           .Attr("static_engine",
                 info.engine_type == EngineInfo::EngineType::TRTStatic)
           .Attr("segment_funcdef_name",
@@ -595,7 +633,9 @@ tensorflow::Status RegisterSegmentFunctionToFunctionLibrary(
         edge->src()->output_type(edge->src_output()));
     VLOG(1) << " input " << nout.node << ":" << nout.index
             << " dtype=" << tensorflow::DataTypeString(nout.data_type);
-    node_builder.Input({nout});
+    // nvcc complains that Input(<brace-enclosed initializer list>) is
+    // ambiguous, so do not use Input({nout}).
+    node_builder.Input(nout);
     TF_RETURN_IF_ERROR(node_builder.Attr("T", node->output_type(0))
                            .Attr("index", i)
                            .Finalize(&nd));
@@ -632,77 +672,42 @@ tensorflow::Status RegisterSegmentFunctionToFunctionLibrary(
 }
 
 std::pair<int, tensorflow::Allocator*> GetDeviceAndAllocator(
-    ConversionParams& params, EngineInfo& engine) {
+    const ConversionParams& params, const EngineInfo& engine) {
   int cuda_device_id = -1;
-  auto check_device_id = [](int tfid) -> int {
-    tensorflow::TfGpuId tf_gpu_id(tfid);
-    CudaGpuId cuda_gpu_id;
-    Status s = GpuIdManager::TfToCudaGpuId(tf_gpu_id, &cuda_gpu_id);
-    if (s.ok()) {
-      VLOG(1) << "Found TF GPU " << tf_gpu_id.value() << " at cuda device "
-              << cuda_gpu_id.value();
-      return cuda_gpu_id.value();
-    }
-    VLOG(2) << "TF GPU with id " << tfid << " do not exist " << s;
-    return -1;
-  };
   tensorflow::Allocator* dev_allocator = nullptr;
-  // we need to us PM here since in python path there is no way to get
-  // to allocators.
-  // TODO(sami): when grappler devices become available else path will not be
-  // necessary
-  auto pm = tensorflow::GPUProcessState::singleton();
-  if (params.cluster) {  // get allocator
-    tensorflow::Device* device = nullptr;
-    if (params.cluster->GetDeviceSet()) {
-      device = params.cluster->GetDeviceSet()->FindDeviceByName(engine.device);
+  if (params.cluster) {
+    std::vector<tensorflow::Device*> devices;
+    if (!engine.device.empty() && params.cluster->GetDeviceSet()) {
+      DeviceNameUtils::ParsedName parsed_name;
+      if (DeviceNameUtils::ParseFullName(engine.device, &parsed_name) &&
+          parsed_name.has_id) {
+        params.cluster->GetDeviceSet()->FindMatchingDevices(parsed_name,
+                                                            &devices);
+      }
     }
-    if (device) {
+    if (!devices.empty()) {
+      if (devices.size() > 1) {
+        string msg = "Found multiple matching devices using name '";
+        StrAppend(&msg, engine.device, "': ");
+        for (auto d : devices) StrAppend(&msg, d->name(), ", ");
+        StrAppend(&msg, ". Will get the allocator from first one.");
+        LOG(WARNING) << msg;
+      }
       tensorflow::AllocatorAttributes alloc_attr;
-      dev_allocator = device->GetAllocator(alloc_attr);
-      VLOG(1) << "Using allocator " << dev_allocator->Name();
+      cuda_device_id = devices[0]->tensorflow_gpu_device_info()->gpu_id;
+      dev_allocator = devices[0]->GetAllocator(alloc_attr);
+      VLOG(1) << "Using allocator " << dev_allocator->Name()
+              << " and cuda_device_id " << cuda_device_id;
     } else {
       LOG(WARNING) << "Cluster is set but device '" << engine.device
                    << "' is not found in the cluster";
     }
-  } else {  // cluster not found, possibly a python call
-    VLOG(1) << "Cluster is not set, probably called from python";
-    int found_device = 0;
-    bool try_gpu_ids = true;
-    // if device is set, try to find the device. Might be a problem for multi
-    // host case but TensorRT do not support multi host setups yet.
-    if (!engine.device.empty()) {
-      DeviceNameUtils::ParsedName parsed_name;
-      if (DeviceNameUtils::ParseFullName(engine.device, &parsed_name)) {
-        cuda_device_id = parsed_name.has_id ? parsed_name.id : -1;
-      }
-      try_gpu_ids = !parsed_name.has_id;
-    }
-    if (try_gpu_ids) {
-      while (found_device < 100) {
-        cuda_device_id = check_device_id(found_device);
-        if (cuda_device_id >= 0) break;
-        found_device++;
-      }
-    }
-    if (found_device == 100) {
-      LOG(ERROR) << " Can't find a GPU device to work with. Please "
-                    "instantiate a session to initialize devices";
-      return std::make_pair(cuda_device_id, dev_allocator);
-    }
-    LOG(WARNING)
-        << "Can't determine the device, constructing an allocator at device "
-        << found_device;
-    tensorflow::GPUOptions gpuoptions;
-    // this will be a noop if device is already initialized
-    gpuoptions.set_allow_growth(true);
-    tensorflow::TfGpuId tf_gpu_id(found_device);
-    dev_allocator = pm->GetGPUAllocator(gpuoptions, tf_gpu_id, 1);
   }
   return std::make_pair(cuda_device_id, dev_allocator);
 }
 
 // Entry function from optimization pass.
+// TODO(aaeory): parameter should use pointer type.
 tensorflow::Status ConvertAfterShapes(ConversionParams& params) {
   // Convert graphdef to graph.
   tensorflow::FunctionLibraryDefinition flib(tensorflow::OpRegistry::Global(),
@@ -720,7 +725,8 @@ tensorflow::Status ConvertAfterShapes(ConversionParams& params) {
   segment_options.minimum_segment_size = params.minimum_segment_size;
   tensorflow::tensorrt::segment::SegmentNodesVector initial_segments;
   TF_RETURN_IF_ERROR(tensorrt::segment::SegmentGraph(
-      &graph, IsTensorRTCandidate, segment_options, &initial_segments));
+      &graph, IsTensorRTCandidate, InputEdgeValidator(*params.graph_properties),
+      OutputEdgeValidator(), segment_options, &initial_segments));
   if (initial_segments.size() > 1) {
     VLOG(0) << "MULTIPLE tensorrt candidate conversion: "
             << initial_segments.size();
@@ -800,7 +806,7 @@ tensorflow::Status ConvertAfterShapes(ConversionParams& params) {
     // The allocator is used to build the engine. The build and the built engine
     // will be destroyed after we get the serialized engine string, so it's fine
     // to use unique_ptr here.
-    std::unique_ptr<nvinfer1::IGpuAllocator> alloc;
+    std::unique_ptr<TRTBaseAllocator> alloc;
     auto device_alloc = GetDeviceAndAllocator(params, engine);
     int cuda_device_id = 0;
     if (device_alloc.first >= 0) {
diff --git a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
index 146b9c7344..451d6fe698 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/contrib/tensorrt/convert/convert_nodes.h"
 
 #include <algorithm>
+#include <cstring>
 #include <list>
 #include <map>
 #include <memory>
@@ -49,15 +50,34 @@ limitations under the License.
 #if GOOGLE_TENSORRT
 #include "tensorrt/include/NvInfer.h"
 
-//  Check if the types are equal. Cast to int first so that failure log message
-//  would work!
-#define CHECK_EQ_TYPE(val1, val2) CHECK_EQ((int)val1, (int)val2)
+// Check if the types are equal. Cast to int first so that failure log message
+// would work!
+#define TFTRT_CHECK_EQ_TYPE(val1, val2) CHECK_EQ((int)val1, (int)val2)
+
+#define TFTRT_INTERNAL_ERROR_AT_NODE(node)                               \
+  do {                                                                   \
+    return tensorflow::errors::Internal(                                 \
+        "TFTRT::", __FUNCTION__, "failed to add TRT layer, at: ", node); \
+  } while (0)
+
+#define TFTRT_RETURN_ERROR_IF_FALSE(status, node) \
+  do {                                            \
+    if (status == false) {                        \
+      TFTRT_INTERNAL_ERROR_AT_NODE(node);         \
+    }                                             \
+  } while (0)
+
+#define TFTRT_RETURN_ERROR_IF_NULLPTR(ptr, node) \
+  do {                                           \
+    if (ptr == nullptr) {                        \
+      TFTRT_INTERNAL_ERROR_AT_NODE(node);        \
+    }                                            \
+  } while (0)
 
 namespace tensorflow {
 namespace tensorrt {
 namespace convert {
 using ::tensorflow::str_util::Split;
-
 using ::tensorflow::strings::StrAppend;
 using ::tensorflow::strings::StrCat;
 
@@ -75,13 +95,163 @@ inline tensorflow::Status ConvertDType(tensorflow::DataType tf_dtype,
     case tensorflow::DataType::DT_HALF:
       *trt_dtype = nvinfer1::DataType::kHALF;
       break;
+#if NV_TENSORRT_MAJOR > 3
+    case tensorflow::DataType::DT_INT32:
+      *trt_dtype = nvinfer1::DataType::kINT32;
+      break;
+#endif
     default:
       return tensorflow::errors::InvalidArgument(
-          "Unsupported data type " + tensorflow::DataTypeString(tf_dtype));
+          "Unsupported data type ", tensorflow::DataTypeString(tf_dtype));
   }
   return tensorflow::Status::OK();
 }
 
+void GetInputProperties(const grappler::GraphProperties& graph_properties,
+                        const Node* outside_node, const int out_port,
+                        PartialTensorShape* shape,
+                        tensorflow::DataType* dtype) {
+  if (graph_properties.HasOutputProperties(outside_node->name())) {
+    auto output_params =
+        graph_properties.GetOutputProperties(outside_node->name());
+    auto out_shape = output_params.at(out_port);
+    *dtype = out_shape.dtype();
+    *shape = out_shape.shape();
+  } else {
+    VLOG(0) << "Unknown output shape" << outside_node->name();
+    *dtype = outside_node->output_type(out_port);
+  }
+}
+
+void GetOutputProperties(const grappler::GraphProperties& graph_properties,
+                         const Node* outside_node, const int in_port,
+                         PartialTensorShape* shape,
+                         tensorflow::DataType* dtype) {
+  if (graph_properties.HasInputProperties(outside_node->name())) {
+    auto input_params =
+        graph_properties.GetInputProperties(outside_node->name());
+    auto in_shape = input_params.at(in_port);
+    *dtype = in_shape.dtype();
+    *shape = in_shape.shape();
+  } else {
+    *dtype = outside_node->input_type(in_port);
+  }
+}
+
+tensorflow::Status ValidateInputProperties(const PartialTensorShape& shape,
+                                           const tensorflow::DataType dtype,
+                                           nvinfer1::DataType* trt_dtype) {
+  // TODO(aaroey): some of these checks also apply to IsTensorRTCandidate(), so
+  // put them there instead.
+  TF_RETURN_IF_ERROR(ConvertDType(dtype, trt_dtype));
+  if (shape.dims() < 0) {
+    return tensorflow::errors::InvalidArgument("Input tensor rank is unknown.");
+  }
+  if (shape.dims() > 9) {
+    return tensorflow::errors::OutOfRange(
+        "Input tensor rank is greater than 8.");
+  }
+  for (int d = 1; d < shape.dims(); ++d) {
+    if (shape.dim_size(d) < 0) {
+      return tensorflow::errors::InvalidArgument(
+          "Input tensor has a unknown non-batch dimemension at dim ", d);
+    }
+  }
+  return Status::OK();
+}
+
+// Return whether or not the broadcast is feasible;
+bool TensorRTGetBroadcastShape(const nvinfer1::Dims& operand_l,
+                               const bool operand_l_is_tensor,
+                               const nvinfer1::Dims& operand_r,
+                               const bool operand_r_is_tensor,
+                               nvinfer1::Dims* operand_l_new_shape,
+                               nvinfer1::Dims* operand_r_new_shape) {
+  // ***************************************************************************
+  // TensorRT Elementwise op supports broadcast but requires both tensor to be
+  // of Identical rank
+  //
+  // We consider case of:
+  //   1. operand_l to be a Tensor & operand_r to be a Const;
+  //   2. operand_l to be a Tensor & operand_r to be a Tensor;
+  // note: const op const (constant folding) should fallback to TensorFlow
+  //
+  // broadcast scheme:
+  //       T:  1 3 5    (tensor would not have batch dimension)
+  //       W:  1 1 3 1  (weight would have all explicit dimensions)
+  // i. fill in explicit dimensions
+  //    -> T: -1 1 3 5  (we put a -1 for batch dimension)
+  //    -> W:  1 1 3 1
+  // ii. compare broadcast feasibility
+  //
+  // We cannot support the following since TensorRT does not allow manipulation
+  // on batch dimension, we cannot generate output with proper shape
+  //    T: 3 5 1
+  //    W: 1 1 1  1 3 5 1
+  // -> T: 1 1 1 -1 3 5 1
+  // -> W: 1 1 1  1 3 5 1
+  // ***************************************************************************
+  const int max_nb_dims = nvinfer1::Dims::MAX_DIMS + 1;
+  const size_t element_size = sizeof(operand_l.d[0]);
+
+  // fill in dimensions
+  int l_s[max_nb_dims];
+  std::fill(l_s, l_s + max_nb_dims, 1);
+  int l_d = operand_l_is_tensor ? operand_l.nbDims + 1 : operand_l.nbDims;
+  int r_s[max_nb_dims];
+  std::fill(r_s, r_s + max_nb_dims, 1);
+  int r_d = operand_r_is_tensor ? operand_r.nbDims + 1 : operand_r.nbDims;
+
+  int max_d = std::max(l_d, r_d);
+  std::memcpy(l_s + max_d - operand_l.nbDims, operand_l.d,
+              operand_l.nbDims * element_size);
+  std::memcpy(r_s + max_d - operand_r.nbDims, operand_r.d,
+              operand_r.nbDims * element_size);
+
+  // set -1 for batch dimension, since batch size is not supposed to be
+  // broadcasted
+  if (operand_l_is_tensor) {
+    if (max_d != l_d) {  // if broadcast beyond batch dimension, fail
+      return false;
+    }
+    l_s[0] = -1;
+  }
+  if (operand_r_is_tensor) {
+    if (max_d != r_d) {  // if broadcast beyond batch dimension, fail
+      return false;
+    }
+    r_s[0] = -1;
+  }
+
+  // compare broadcast feasibility
+  for (int i = max_d - 1; i >= 0; i--) {
+    if ((l_s[i] != r_s[i]) && (l_s[i] != 1) && (r_s[i] != 1)) {
+      return false;
+    }
+  }
+
+  // output new TensorRT Dimension (stripping the batch dimension)
+  operand_l_new_shape->nbDims = max_d - 1;
+  std::memcpy(operand_l_new_shape->d, l_s + 1, (max_d - 1) * element_size);
+  operand_r_new_shape->nbDims = max_d - 1;
+  std::memcpy(operand_r_new_shape->d, r_s + 1, (max_d - 1) * element_size);
+
+  return true;
+}
+
+inline bool DimsEqual(const nvinfer1::Dims& dim_l,
+                      const nvinfer1::Dims& dim_r) {
+  if (dim_l.nbDims != dim_r.nbDims) {
+    return false;
+  }
+  for (int i = 0; i < dim_l.nbDims; i++) {
+    if (dim_l.d[i] != dim_r.d[i]) {
+      return false;
+    }
+  }
+  return true;
+}
+
 inline nvinfer1::Dims GetTensorShape(const tensorflow::Tensor& tensor) {
   nvinfer1::Dims dims;
   dims.nbDims = tensor.dims();
@@ -91,7 +261,7 @@ inline nvinfer1::Dims GetTensorShape(const tensorflow::Tensor& tensor) {
   return dims;
 }
 
-inline int64_t GetShapeSize(nvinfer1::Dims shape) {
+inline int64_t GetShapeSize(const nvinfer1::Dims& shape) {
   // Returns total number of elements in shape
   int64_t count = 1;
   for (int d = 0; d < shape.nbDims; ++d) {
@@ -104,7 +274,7 @@ static std::vector<std::pair<int, int>> CreateSamePadding(
     const nvinfer1::DimsHW& stride, const nvinfer1::DimsHW& kernel,
     const std::vector<int64_t>& input_dims) {
   std::vector<std::pair<int, int>> padding(input_dims.size());
-  CHECK_EQ((size_t)stride.nbDims, input_dims.size());  // TODO(jie): N+C? NC+?
+  CHECK_EQ(stride.nbDims, input_dims.size());  // TODO(jie): N+C? NC+?
 
   for (size_t i = 0; i < input_dims.size(); ++i) {
     // Formula to calculate the padding
@@ -134,6 +304,7 @@ string GetCommonNameScope(const string& op_name_a, const string& op_name_b) {
   return op_name_a.substr(0, last_scope_separator);
 }
 
+// Class to convert TF weight to TRT weight.
 class TRT_ShapedWeights {
  public:
   TRT_ShapedWeights(tensorflow::DataType type, const void* values,
@@ -145,12 +316,14 @@ class TRT_ShapedWeights {
   explicit TRT_ShapedWeights(tensorflow::DataType type)
       : shape_(), type_(type), values_(nullptr), empty_weight_flag_(true) {}
 
+  // TODO(aaroey): use rvalue reference.
   TRT_ShapedWeights(const TRT_ShapedWeights& rhs)
       : shape_(rhs.shape_),
         type_(rhs.type_),
         values_(rhs.values_),
         empty_weight_flag_(rhs.empty_weight_flag_) {}
 
+  // TODO(aaroey): use GetShapeSize() instead.
   int64_t count() const {
     int64_t c = 1;
     for (int i = 0; i < shape_.nbDims; i++) c *= shape_.d[i];
@@ -168,6 +341,7 @@ class TRT_ShapedWeights {
 
   const void* GetValues() const { return values_; }
 
+  // TODO(aaroey): get rid of this method.
   void SetValues(const void* values) { values_ = values; }
 
   size_t size_bytes() const {
@@ -178,10 +352,12 @@ class TRT_ShapedWeights {
   // Default converter
   operator nvinfer1::Weights() const { return GetWeightsForTRT(); }
 
+  // TODO(aaroey): make these private.
   nvinfer1::Dims shape_;
   tensorflow::DataType type_;
 
  private:
+  // TODO(aaroey): this should not be const as it's always from TRTWeightStore.
   const void* values_;
   bool empty_weight_flag_;
 };
@@ -192,6 +368,7 @@ class TRT_TensorOrWeights {
       : tensor_(tensor), weights_(DT_FLOAT), variant_(TRT_NODE_TENSOR) {}
   explicit TRT_TensorOrWeights(const TRT_ShapedWeights& weights)
       : tensor_(nullptr), weights_(weights), variant_(TRT_NODE_WEIGHTS) {}
+  // TODO(aaroey): use rvalue reference.
   TRT_TensorOrWeights(const TRT_TensorOrWeights& rhs)
       : tensor_(rhs.tensor_), weights_(rhs.weights_), variant_(rhs.variant_) {}
   ~TRT_TensorOrWeights() {}
@@ -200,19 +377,19 @@ class TRT_TensorOrWeights {
   bool is_weights() const { return variant_ == TRT_NODE_WEIGHTS; }
 
   nvinfer1::ITensor* tensor() {
-    CHECK_EQ(is_tensor(), true);
+    CHECK(is_tensor());
     return tensor_;
   }
   const nvinfer1::ITensor* tensor() const {
-    CHECK_EQ(is_tensor(), true);
+    CHECK(is_tensor());
     return tensor_;
   }
   TRT_ShapedWeights& weights() {
-    CHECK_EQ(is_weights(), true);
+    CHECK(is_weights());
     return weights_;
   }
   const TRT_ShapedWeights& weights() const {
-    CHECK_EQ(is_weights(), true);
+    CHECK(is_weights());
     return weights_;
   }
   nvinfer1::Dims shape() const {
@@ -236,21 +413,25 @@ class TFAttrs {
       attrs_.insert({attr.first, &attr.second});
     }
   }
-  bool count(string key) const { return attrs_.count(key); }
-  tensorflow::AttrValue const* at(string key) const {
+
+  bool count(const string& key) const { return attrs_.count(key); }
+
+  tensorflow::AttrValue const* at(const string& key) const {
     if (!attrs_.count(key)) {
       LOG(FATAL) << "Attribute not found: " << key;
     }
     return attrs_.at(key);
   }
+
   template <typename T>
   T get(const string& key) const;
+
   template <typename T>
   T get(const string& key, const T& default_value) const {
     return attrs_.count(key) ? this->get<T>(key) : default_value;
   }
 
-  std::vector<string> GetAllAttrKey() {
+  std::vector<string> GetAllAttrKeys() const {
     std::vector<string> attr_list;
     for (const auto& attr_item : attrs_) {
       attr_list.emplace_back(attr_item.first);
@@ -285,15 +466,6 @@ std::vector<string> TFAttrs::get<std::vector<string>>(const string& key) const {
   auto attr = this->at(key)->list().s();
   return std::vector<string>(attr.begin(), attr.end());
 }
-template <>
-nvinfer1::Dims TFAttrs::get<nvinfer1::Dims>(const string& key) const {
-  auto values = this->get<std::vector<int>>(key);
-  nvinfer1::Dims dims;
-  dims.nbDims = values.size();
-  std::copy(values.begin(), values.end(), dims.d);
-  // Note: No dimension type information is included
-  return dims;
-}
 
 template <>
 nvinfer1::DataType TFAttrs::get<nvinfer1::DataType>(const string& key) const {
@@ -319,10 +491,11 @@ bool TFAttrs::get<bool>(const string& key) const {
 }
 
 // TODO(jie): reorder4 & reorder2 should be merged?
+// TODO(aaroey): fix the order of parameters.
 template <typename T>
-void Reorder4(nvinfer1::DimsNCHW shape, const T* idata,
-              nvinfer1::DimsNCHW istrides, T* odata,
-              nvinfer1::DimsNCHW ostrides) {
+void Reorder4(const nvinfer1::DimsNCHW& shape, const T* idata,
+              const nvinfer1::DimsNCHW& istrides, T* odata,
+              const nvinfer1::DimsNCHW& ostrides) {
   for (int n = 0; n < shape.n(); ++n) {
     for (int c = 0; c < shape.c(); ++c) {
       for (int h = 0; h < shape.h(); ++h) {
@@ -337,12 +510,13 @@ void Reorder4(nvinfer1::DimsNCHW shape, const T* idata,
 }
 
 template <typename T>
-void Reorder2(nvinfer1::DimsHW shape, const T* idata, nvinfer1::DimsHW istrides,
-              T* odata, nvinfer1::DimsHW ostrides) {
+void Reorder2(const nvinfer1::DimsHW& shape, const T* idata,
+              const nvinfer1::DimsHW& istrides, T* odata,
+              const nvinfer1::DimsHW& ostrides) {
   for (int h = 0; h < shape.h(); ++h) {
     for (int w = 0; w < shape.w(); ++w) {
       odata[h * ostrides.h() + w * ostrides.w()] =
-          idata[h * ostrides.h() + w * ostrides.w()];
+          idata[h * istrides.h() + w * istrides.w()];
     }
   }
 }
@@ -350,16 +524,17 @@ void Reorder2(nvinfer1::DimsHW shape, const T* idata, nvinfer1::DimsHW istrides,
 // TODO(jie): fallback to tensorflow!!
 void ReorderCKtoKC(const TRT_ShapedWeights& iweights,
                    TRT_ShapedWeights* oweights) {
-  int c = iweights.shape_.d[0];
-  int k = iweights.shape_.d[1];
+  const int c = iweights.shape_.d[0];
+  const int k = iweights.shape_.d[1];
   oweights->shape_.d[0] = k;
   oweights->shape_.d[1] = c;
-  nvinfer1::DimsHW istrides = {1, k};
-  nvinfer1::DimsHW ostrides = {c, 1};
+  const nvinfer1::DimsHW istrides = {1, k};
+  const nvinfer1::DimsHW ostrides = {c, 1};
   switch (iweights.type_) {
     case tensorflow::DataType::DT_FLOAT: {
       Reorder2({k, c}, static_cast<float const*>(iweights.GetValues()),
                istrides,
+               // TODO(aaroey): get rid of all the const_cast like this.
                static_cast<float*>(const_cast<void*>(oweights->GetValues())),
                ostrides);
       break;
@@ -382,21 +557,24 @@ void ReorderRSCKToKCRS(const TRT_ShapedWeights& iweights,
                        TRT_ShapedWeights* oweights, int num_groups) {
   CHECK_EQ(iweights.type_, oweights->type_);
   CHECK_EQ(iweights.size_bytes(), oweights->size_bytes());
-  int r = iweights.shape_.d[0];
-  int s = iweights.shape_.d[1];
-  // TRT requires GKcRS, while TF depthwise has RSCK
-  //   where c=1, C=G
+  // K indexes over output channels, C over input channels, and R and S over the
+  // height and width of the convolution
+  const int r = iweights.shape_.d[0];
+  const int s = iweights.shape_.d[1];
+  // TRT requires GKcRS, while TF depthwise has RSCK where c=1, C=G
   VLOG(2) << "num_groups: " << num_groups;
-  int c = iweights.shape_.d[2] / num_groups;
+  const int c = iweights.shape_.d[2] / num_groups;
   VLOG(2) << "c" << iweights.shape_.d[2] << " then " << c;
-  int k = iweights.shape_.d[3] * num_groups;
+  const int k = iweights.shape_.d[3] * num_groups;
   VLOG(2) << "k" << iweights.shape_.d[3] << " then " << k;
+  VLOG(2) << "r" << iweights.shape_.d[0] << " then " << r;
+  VLOG(2) << "s" << iweights.shape_.d[1] << " then " << s;
   oweights->shape_.d[0] = k / num_groups;
   oweights->shape_.d[1] = c * num_groups;
   oweights->shape_.d[2] = r;
   oweights->shape_.d[3] = s;
-  nvinfer1::DimsNCHW istrides = {1, k, s * k * c, c * k};
-  nvinfer1::DimsNCHW ostrides = {c * r * s, r * s, s, 1};
+  const nvinfer1::DimsNCHW istrides = {1, k, s * k * c, c * k};
+  const nvinfer1::DimsNCHW ostrides = {c * r * s, r * s, s, 1};
   switch (iweights.type_) {
     case tensorflow::DataType::DT_FLOAT: {
       Reorder4({k, c, r, s}, static_cast<float const*>(iweights.GetValues()),
@@ -428,11 +606,14 @@ using OpConverter =
                                      std::vector<TRT_TensorOrWeights>*)>;
 
 class Converter {
+  // TODO(aaroey): fix the order of members.
   std::unordered_map<string, TRT_TensorOrWeights> trt_tensors_;
   std::unordered_map<string, OpConverter> op_registry_;
   OpConverter plugin_converter_;
   nvinfer1::INetworkDefinition* trt_network_;
   std::list<std::vector<uint8_t>> temp_bufs_;
+  // TODO(aaroey): inline the definition of TRTWeightStore here, and add APIs to
+  // operate the stored weights instead of operating it directly.
   TRTWeightStore* weight_store_;
   bool fp16_;
   void register_op_converters();
@@ -440,7 +621,7 @@ class Converter {
                                 std::vector<TRT_TensorOrWeights>* inputs) {
     for (auto const& input_name : node_def.input()) {
       /*************************************************************************
-       * TODO(jie) handle case 1) here
+       * TODO(jie): handle case 1) here.
        * Normalizes the inputs and extracts associated metadata:
        * 1) Inputs can contain a colon followed by a suffix of characters.
        *    That suffix may be a single number (e.g. inputName:1) or several
@@ -454,6 +635,7 @@ class Converter {
       if (input_name[0] == '^') continue;
       string name = input_name;
       auto first = name.find_first_of(':');
+      // TODO(aaroey): why removing the colon but not the zero? A bug?
       if (first != string::npos && first + 2 == name.size() &&
           name[first + 1] == '0')
         name.erase(first);
@@ -462,12 +644,13 @@ class Converter {
       if (trt_tensors_.count(name)) {
         inputs->push_back(trt_tensors_.at(name));
       } else {
-        string str("Node ");
-        StrAppend(&str, node_def.name(), " should have an input named '", name,
+        // TODO(aaroey): this should not happen, make it a CHECK.
+        // TODO(aaroey): use StrCat for pattern like this.
+        string msg("Node ");
+        StrAppend(&msg, node_def.name(), " should have an input named '", name,
                   "' but it is not available");
-        LOG(WARNING) << "input: " << name << " not available for node at "
-                     << node_def.name();
-        return tensorflow::errors::InvalidArgument(str);
+        LOG(ERROR) << msg;
+        return tensorflow::errors::InvalidArgument(msg);
       }
     }
     return tensorflow::Status::OK();
@@ -488,6 +671,7 @@ class Converter {
     weights.SetValues(weight_store_->store_.back().data());
     return weights;
   }
+  // TODO(aaroey): fix all the namings.
   bool isFP16() { return fp16_; }
   TRT_ShapedWeights get_temp_weights_like(const TRT_ShapedWeights& weights) {
     return this->get_temp_weights(weights.type_, weights.shape_);
@@ -496,9 +680,10 @@ class Converter {
   tensorflow::Status convert_node(const tensorflow::NodeDef& node_def) {
     std::vector<TRT_TensorOrWeights> inputs;
     TF_RETURN_IF_ERROR(this->get_inputs(node_def, &inputs));
-    string op = node_def.op();
+    const string& op = node_def.op();
     std::vector<TRT_TensorOrWeights> outputs;
     if (PluginFactoryTensorRT::GetInstance()->IsPlugin(op)) {
+      // TODO(aaroey): plugin_converter_ is not set, fix it.
       TF_RETURN_IF_ERROR(plugin_converter_(*this, node_def, inputs, &outputs));
     } else {
       if (!op_registry_.count(op)) {
@@ -509,7 +694,7 @@ class Converter {
       TF_RETURN_IF_ERROR(op_converter(*this, node_def, inputs, &outputs));
     }
     for (size_t i = 0; i < outputs.size(); ++i) {
-      TRT_TensorOrWeights output = outputs.at(i);
+      TRT_TensorOrWeights& output = outputs[i];
       // TODO(jie): tf protobuf seems to be omitting the :0 suffix
       string output_name = node_def.name();
       if (i != 0) output_name = StrCat(output_name, ":", i);
@@ -527,26 +712,29 @@ class Converter {
 
   nvinfer1::INetworkDefinition* network() { return trt_network_; }
 
-  TRT_TensorOrWeights get_tensor(string name) {
+  TRT_TensorOrWeights get_tensor(const string& name) {
     if (!trt_tensors_.count(name)) {
       return TRT_TensorOrWeights(nullptr);
     }
     return trt_tensors_.at(name);
   }
 
-  bool insert_input_tensor(string name, nvinfer1::ITensor* tensor) {
+  bool insert_input_tensor(const string& name, nvinfer1::ITensor* tensor) {
     return trt_tensors_.insert({name, TRT_TensorOrWeights(tensor)}).second;
   }
 
   nvinfer1::ITensor* TransposeTensor(nvinfer1::ITensor* input_tensor,
-                                     std::vector<int> order) {
-    auto dims = input_tensor->getDimensions();
+                                     const std::vector<int>& order) {
+    const auto dims = input_tensor->getDimensions();
 
     // TODO(jie): change the return to status and properly exit
     if (order.size() - 1 != size_t(dims.nbDims))
       LOG(ERROR) << "Dimension does not match, fail gracefully";
 
     nvinfer1::IShuffleLayer* layer = this->network()->addShuffle(*input_tensor);
+    if (layer == nullptr) {
+      return nullptr;
+    }
     nvinfer1::Permutation permutation;
     for (int32_t i = 0; i < dims.nbDims; ++i) {
       permutation.order[i] = order[i + 1] - 1;
@@ -577,13 +765,14 @@ TRT_ShapedWeights ConvertFP32ToFP16(Converter& ctx,
   }
   return weights;
 }
+
 // ****************************************************************************
 // Constant folding functions
 // TODO(jie): once optimizer kicks in, we should have done constant folding
 // there.
-//*****************************************************************************/
+// *****************************************************************************
 struct LambdaFactory {
-  enum class OP_CATEGORY : int { RSQRT = 0, NEG, ADD, MUL, SUB };
+  enum class OP_CATEGORY : int { RSQRT = 0, NEG, ADD, MUL, SUB, RECIP };
   OP_CATEGORY op;
 
   template <typename T>
@@ -595,6 +784,8 @@ struct LambdaFactory {
       }
       case OP_CATEGORY::NEG:
         return [](T t) -> T { return -t; };
+      case OP_CATEGORY::RECIP:
+        return [](T t) -> T { return 1.0 / t; };
       default:
         VLOG(2) << "Not supported op for unary: " << static_cast<int>(op);
         return nullptr;
@@ -628,7 +819,6 @@ struct LambdaFactory {
           VLOG(2) << "LAMBDA VAL : " << val;
           return l + val;
         };
-      // Return [val](T l)-> T {return l+val;};
       case OP_CATEGORY::SUB:
         return [val](T l) -> T {
           VLOG(2) << "LAMBDA VAL : " << val;
@@ -688,11 +878,13 @@ std::function<Eigen::half(Eigen::half)> LambdaFactory::unary<Eigen::half>() {
     }
     case OP_CATEGORY::NEG:
       return [](Eigen::half t) -> Eigen::half { return -t; };
+    // TODO(aaroey): can we support RECIP?
     default:
       VLOG(2) << "Not supported op for unary: " << static_cast<int>(op);
       return nullptr;
   }
 }
+
 tensorflow::Status UnaryCompute(const TRT_ShapedWeights& iweights,
                                 TRT_ShapedWeights* oweights,
                                 LambdaFactory unary_op) {
@@ -738,6 +930,7 @@ tensorflow::Status BinaryCompute(const TRT_ShapedWeights& iweights_l,
       if (iweights_l.count() != iweights_r.count()) {
         // We only supports broadcast of RankZero
         if (iweights_l.count() == 1) {
+          // TODO(aaroey): Remove loggings like this.
           VLOG(2) << "I bet it is not working!" << (*inp_l);
           std::transform(inp_r, inp_r + iweights_r.count(), oup,
                          binary_op.broadcast_l<float>(*inp_l));
@@ -790,117 +983,21 @@ tensorflow::Status BinaryCompute(const TRT_ShapedWeights& iweights_l,
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConstantFoldUnary(
-    Converter& ctx, const tensorflow::NodeDef& node_def,
-    const std::vector<TRT_TensorOrWeights>& inputs,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  TRT_ShapedWeights weights_input = inputs.at(0).weights();
-
-  // Allocate output weights
-  TRT_ShapedWeights weights_output = ctx.get_temp_weights_like(weights_input);
-
-  // FIXME assume type matches input weights
-  // Get trt type & shape
-  // Maybe this part has to be moved into the block of rsqrt later
-  // Check type consistency
-  CHECK_EQ(weights_input.type_,
-           TFAttrs(node_def).get<tensorflow::DataType>("T"));
-
-  LambdaFactory unary_op;
-  if (node_def.op() == "Rsqrt") {
-    // Compute rsqrt
-    unary_op.op = LambdaFactory::OP_CATEGORY::RSQRT;
-    auto ret = UnaryCompute(weights_input, &weights_output, unary_op);
-    // Pass the output
-    if (ret == tensorflow::Status::OK()) {
-      outputs->push_back(TRT_TensorOrWeights(weights_output));
-    }
-    return ret;
-  } else {
-    return tensorflow::errors::Unimplemented("Binary op not supported: " +
-                                             node_def.op());
-  }
-}
-
-// TODO(jie,ben) broadcast is needed yet not implemented
-// Let's get the simple stuff working first. Maybe we should fall back to TF
-//   approach for constant folding
-tensorflow::Status ConstantFoldBinary(
-    Converter& ctx, const tensorflow::NodeDef& node_def,
-    const std::vector<TRT_TensorOrWeights>& inputs,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  TRT_ShapedWeights weights_input_l = inputs.at(0).weights();
-  TRT_ShapedWeights weights_input_r = inputs.at(1).weights();
-
-  // Check type consistency
-  CHECK_EQ(weights_input_l.type_, weights_input_r.type_);
-
-  if (weights_input_l.shape_.nbDims != weights_input_r.shape_.nbDims)
-    return tensorflow::errors::Unimplemented(
-        "Binary op implicit broadcast not supported: " + node_def.op());
-
-  // TODO(jie): constant fold should really fall back to TF.
-  int num_dims = weights_input_l.shape_.nbDims;
-  nvinfer1::Dims output_shape;
-  output_shape.nbDims = num_dims;
-  VLOG(2) << "nb_dims: " << num_dims
-          << ", the other: " << weights_input_r.shape_.nbDims;
-  for (int i = 0; i < num_dims; i++) {
-    if (weights_input_l.shape_.d[i] == weights_input_r.shape_.d[i]) {
-      output_shape.d[i] = weights_input_l.shape_.d[i];
-    } else if (weights_input_l.shape_.d[i] == 1 ||
-               weights_input_r.shape_.d[i] == 1) {
-      output_shape.d[i] =
-          std::max(weights_input_l.shape_.d[i], weights_input_r.shape_.d[i]);
-    } else {
-      return tensorflow::errors::Unimplemented(
-          "Binary op with incompatible shape at, " + node_def.op());
-    }
-    VLOG(2) << "left: " << weights_input_l.shape_.d[i]
-            << "right: " << weights_input_r.shape_.d[i]
-            << "output: " << output_shape.d[i];
-  }
-
-  // FIXME assume type matches input weights
-  // Get trt type & shape
-  TFAttrs attrs(node_def);
-  // Maybe this part has to be moved into the block of rsqrt later
-  tensorflow::DataType dtype = attrs.get<tensorflow::DataType>("T");
-
-  // Allocate output weights
-  TRT_ShapedWeights weights_output = ctx.get_temp_weights(dtype, output_shape);
-
-  LambdaFactory binary_op;
-  if (node_def.op() == "Sub") {
-    binary_op.op = LambdaFactory::OP_CATEGORY::SUB;
-  } else if (node_def.op() == "Mul") {
-    binary_op.op = LambdaFactory::OP_CATEGORY::MUL;
-  } else if (node_def.op() == "Add") {
-    binary_op.op = LambdaFactory::OP_CATEGORY::ADD;
-  } else {
-    return tensorflow::errors::Unimplemented("Binary op not supported: " +
-                                             node_def.op());
-  }
-  auto ret = BinaryCompute(weights_input_l, weights_input_r, &weights_output,
-                           binary_op);
-
-  // Pass the output
-  if (ret == tensorflow::Status::OK()) {
-    outputs->push_back(TRT_TensorOrWeights(weights_output));
-  }
-
-  return ret;
-}
-
 // TODO(jie): broadcast is needed yet not implemented.
 // Only implemented channel wise for the time being
 tensorflow::Status BinaryTensorOpWeight(
     Converter& ctx, const tensorflow::NodeDef& node_def,
     const nvinfer1::ITensor* tensor, TRT_ShapedWeights weights,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  // FIXME assume type matches input weights
-  // Get trt type & shape
-  // Maybe this part has to be moved into the block of rsqrt later
+    bool swapped_inputs, std::vector<TRT_TensorOrWeights>* outputs) {
+  // tensor is the left operand while weights is the right operand;
+  // when swapped_inputs set to true, those two are swapped.
+  // TODO(aaroey): use a set.
+  if (node_def.op() != "Sub" && node_def.op() != "Add" &&
+      node_def.op() != "Mul" && node_def.op() != "Div" &&
+      node_def.op() != "RealDiv") {
+    return tensorflow::errors::Unimplemented(
+        "op not supported: " + node_def.op() + ", at: " + node_def.name());
+  }
 
   // Check type consistency
   nvinfer1::DataType ttype;
@@ -910,6 +1007,12 @@ tensorflow::Status BinaryTensorOpWeight(
   auto dims_w = weights.shape_;
   auto dims_t = tensor->getDimensions();
 
+  // TODO(jie): addScale checks for input tensor dimension
+  if (dims_t.nbDims != 3) {
+    return tensorflow::errors::InvalidArgument(
+        "addScale requires tensor with rank 3, " + node_def.name());
+  }
+
   // default to element-wise
   auto scale_mode = nvinfer1::ScaleMode::kELEMENTWISE;
 
@@ -980,6 +1083,7 @@ tensorflow::Status BinaryTensorOpWeight(
       permutation[dims_t.nbDims] = 1;
       tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
                                    permutation);
+      TFTRT_RETURN_ERROR_IF_NULLPTR(tensor, node_def.name());
     } else {
       return tensorflow::errors::InvalidArgument(
           "Transpose cannot be applied, " + node_def.name());
@@ -997,11 +1101,35 @@ tensorflow::Status BinaryTensorOpWeight(
 
   // Maybe I should do a switch
   if (node_def.op() == "Sub") {
-    TRT_ShapedWeights neg_weights = ctx.get_temp_weights_like(weights);
-    LambdaFactory unary_op;
-    unary_op.op = LambdaFactory::OP_CATEGORY::NEG;
-    TF_RETURN_IF_ERROR(UnaryCompute(weights, &neg_weights, unary_op));
-    shift_weights = neg_weights;
+    if (swapped_inputs) {
+      shift_weights = weights;
+      nvinfer1::IUnaryLayer* layer =
+          ctx.network()->addUnary(*const_cast<nvinfer1::ITensor*>(tensor),
+                                  nvinfer1::UnaryOperation::kNEG);
+      TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+      tensor = layer->getOutput(0);
+    } else {
+      TRT_ShapedWeights neg_weights = ctx.get_temp_weights_like(weights);
+      LambdaFactory unary_op;
+      unary_op.op = LambdaFactory::OP_CATEGORY::NEG;
+      TF_RETURN_IF_ERROR(UnaryCompute(weights, &neg_weights, unary_op));
+      shift_weights = neg_weights;
+    }
+  } else if (node_def.op() == "Div" || node_def.op() == "RealDiv") {
+    if (swapped_inputs) {
+      scale_weights = weights;
+      nvinfer1::IUnaryLayer* layer =
+          ctx.network()->addUnary(*const_cast<nvinfer1::ITensor*>(tensor),
+                                  nvinfer1::UnaryOperation::kRECIP);
+      TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+      tensor = layer->getOutput(0);
+    } else {
+      TRT_ShapedWeights recip_weights = ctx.get_temp_weights_like(weights);
+      LambdaFactory unary_op;
+      unary_op.op = LambdaFactory::OP_CATEGORY::RECIP;
+      TF_RETURN_IF_ERROR(UnaryCompute(weights, &recip_weights, unary_op));
+      scale_weights = recip_weights;
+    }
   } else if (node_def.op() == "Mul") {
     scale_weights = weights;
   } else if (node_def.op() == "Add") {
@@ -1014,11 +1142,13 @@ tensorflow::Status BinaryTensorOpWeight(
   nvinfer1::IScaleLayer* layer = ctx.network()->addScale(
       *const_cast<nvinfer1::ITensor*>(tensor), scale_mode, shift_weights,
       scale_weights, power_weights);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
 
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   // transpose back dimension
   if (permutation_flag) {
     output_tensor = ctx.TransposeTensor(output_tensor, permutation);
+    TFTRT_RETURN_ERROR_IF_NULLPTR(output_tensor, node_def.name());
   }
 
   // Pass the output
@@ -1042,20 +1172,31 @@ tensorflow::Status ConvertConv2DHelper(
   if (data_format == "NHWC") {
     tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
                                  {0, 3, 1, 2});
+    TFTRT_RETURN_ERROR_IF_NULLPTR(tensor, node_def.name());
     h_index = 1;
     w_index = 2;
     // TODO(jie): transpose it
   }
 
   // tensor after transpose (NCHW)
-  auto tensor_dim = tensor->getDimensions();
+  const auto tensor_dim = tensor->getDimensions();
 
   int num_groups = group;
-  if (num_groups == 0)  // depthwise convolution
-    num_groups = tensor_dim.d[0];
+  if (num_groups == 0) num_groups = tensor_dim.d[0];  // depthwise convolution
   VLOG(2) << "groups count: " << num_groups;
 
   TRT_ShapedWeights weights_rsck = inputs.at(1).weights();
+
+  VLOG(2) << "weight shape: " << weights_rsck.shape_.nbDims;
+  for (int i = 0; i < weights_rsck.shape_.nbDims; i++) {
+    VLOG(2) << weights_rsck.shape_.d[i];
+  }
+
+  if (weights_rsck.shape_.nbDims != 4) {
+    return tensorflow::errors::Internal(
+        "Conv2D expects kernel of dimension 4, at: " + node_def.name());
+  }
+
   if (ctx.isFP16()) {
     weights_rsck = ConvertFP32ToFP16(ctx, inputs.at(1).weights());
   }
@@ -1063,18 +1204,22 @@ tensorflow::Status ConvertConv2DHelper(
   TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_rsck);
   ReorderRSCKToKCRS(weights_rsck, &weights, num_groups);
   TRT_ShapedWeights biases(weights.type_);
-  int noutput = weights.shape_.d[0] * num_groups;
+  const int noutput = weights.shape_.d[0] * num_groups;
   nvinfer1::DimsHW kernel_size;
   kernel_size.h() = weights.shape_.d[2];
   kernel_size.w() = weights.shape_.d[3];
+  VLOG(2) << "RSCK: ";
+  for (int i = 0; i < 4; i++) {
+    VLOG(2) << "     " << weights.shape_.d[i];
+  }
   VLOG(2) << "kernel size: " << kernel_size.h() << ", " << kernel_size.w();
 
   // TODO(jie): stride. (NHWC/NCHW)
-  auto tf_stride = attrs.get<std::vector<int>>("strides");
+  const auto tf_stride = attrs.get<std::vector<int>>("strides");
   VLOG(2) << "h_INDEX" << h_index << ", w_index " << w_index;
   VLOG(2) << "stride!!!: " << tf_stride[0] << tf_stride[1] << tf_stride[2]
           << tf_stride[3];
-  nvinfer1::DimsHW stride(tf_stride[h_index], tf_stride[w_index]);
+  const nvinfer1::DimsHW stride(tf_stride[h_index], tf_stride[w_index]);
 
   std::vector<std::pair<int, int>> padding;
   // TODO(jie): padding.
@@ -1102,6 +1247,7 @@ tensorflow::Status ConvertConv2DHelper(
         *const_cast<nvinfer1::ITensor*>(tensor),
         nvinfer1::DimsHW(padding[0].first, padding[1].first),
         nvinfer1::DimsHW(padding[0].second, padding[1].second));
+    TFTRT_RETURN_ERROR_IF_NULLPTR(pad_layer, node_def.name());
     padding = {{0, 0}, {0, 0}};
     tensor = pad_layer->getOutput(0);
     auto dim_after = tensor->getDimensions();
@@ -1112,6 +1258,7 @@ tensorflow::Status ConvertConv2DHelper(
   nvinfer1::IConvolutionLayer* layer =
       ctx.network()->addConvolution(*const_cast<nvinfer1::ITensor*>(tensor),
                                     noutput, kernel_size, weights, biases);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
 
   layer->setStride(stride);
   layer->setPadding({padding[0].first, padding[1].first});
@@ -1126,6 +1273,7 @@ tensorflow::Status ConvertConv2DHelper(
   if (data_format == "NHWC") {
     // TODO(jie): transpose it back!
     output_tensor = ctx.TransposeTensor(output_tensor, {0, 2, 3, 1});
+    TFTRT_RETURN_ERROR_IF_NULLPTR(output_tensor, node_def.name());
   } else {
     VLOG(2) << "NCHW !!!!";
   }
@@ -1147,35 +1295,91 @@ tensorflow::Status ConvertConv2DHelper(
                                            node_def.name());
 }
 
+// Helper function converts input into tensor with shape specified by dims.
+bool PrepareTensorForShape(Converter& ctx, const TRT_TensorOrWeights& input,
+                           const nvinfer1::Dims& dims,
+                           const nvinfer1::ITensor** tensor) {
+  if (input.is_tensor()) {
+    if (DimsEqual(input.shape(), dims)) {
+      *tensor = input.tensor();
+    } else {
+      nvinfer1::IShuffleLayer* layer = ctx.network()->addShuffle(
+          *const_cast<nvinfer1::ITensor*>(input.tensor()));
+      if (layer != nullptr) {
+        layer->setReshapeDimensions(dims);
+        *tensor = layer->getOutput(0);
+      } else {
+        return false;
+      }
+    }
+  } else {
+#if NV_TENSORRT_MAJOR > 3
+    nvinfer1::IConstantLayer* layer =
+        ctx.network()->addConstant(dims, input.weights());
+    if (layer != nullptr) {
+      *tensor = layer->getOutput(0);
+    } else {
+      return false;
+    }
+#else
+    return false;
+#endif
+  }
+  return true;
+}
+
 tensorflow::Status BinaryTensorOpTensor(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    const nvinfer1::ITensor* tensor_l, const nvinfer1::ITensor* tensor_r,
+    const TRT_TensorOrWeights& operand_l, const TRT_TensorOrWeights& operand_r,
     std::vector<TRT_TensorOrWeights>* outputs) {
   static const std::unordered_map<string, nvinfer1::ElementWiseOperation> ops{
       {"Add", nvinfer1::ElementWiseOperation::kSUM},
       {"Mul", nvinfer1::ElementWiseOperation::kPROD},
       {"Sub", nvinfer1::ElementWiseOperation::kSUB},
       {"Div", nvinfer1::ElementWiseOperation::kDIV},
+      {"RealDiv", nvinfer1::ElementWiseOperation::kDIV},
+      {"Minimum", nvinfer1::ElementWiseOperation::kMIN},
+      {"Maximum", nvinfer1::ElementWiseOperation::kMAX},
   };
 
-  // FIXME assume type matches input weights
+  const nvinfer1::ITensor* tensor_l;
+  const nvinfer1::ITensor* tensor_r;
+
+  nvinfer1::Dims dim_l;
+  nvinfer1::Dims dim_r;
+
+  if (!TensorRTGetBroadcastShape(operand_l.shape(), operand_l.is_tensor(),
+                                 operand_r.shape(), operand_r.is_tensor(),
+                                 &dim_l, &dim_r)) {
+    return tensorflow::errors::InvalidArgument(
+        "Binary op broadcast scheme not supported by TensorRT op: " +
+        node_def.op() + ", at: " + node_def.name());
+  }
+
+  TFTRT_RETURN_ERROR_IF_FALSE(
+      PrepareTensorForShape(ctx, operand_l, dim_l, &tensor_l), node_def.name());
+  TFTRT_RETURN_ERROR_IF_FALSE(
+      PrepareTensorForShape(ctx, operand_r, dim_r, &tensor_r), node_def.name());
+
   // get trt type & shape
   TFAttrs attrs(node_def);
   // maybe this part has to be moved into the block of rsqrt later
   nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("T");
 
   // check type consistency
-  CHECK_EQ_TYPE(tensor_l->getType(), dtype);
-  CHECK_EQ_TYPE(tensor_r->getType(), dtype);
+  TFTRT_CHECK_EQ_TYPE(tensor_l->getType(), dtype);
+  TFTRT_CHECK_EQ_TYPE(tensor_r->getType(), dtype);
   auto op_pair = ops.find(node_def.op());
-  if (op_pair == ops.end())
+  if (op_pair == ops.end()) {
     return tensorflow::errors::Unimplemented(
-        "binary op: " + node_def.op() +
-        " not supported at: " + node_def.name());
+        "binary op: ", node_def.op(), " not supported at: ", node_def.name());
+  }
 
   nvinfer1::IElementWiseLayer* layer = ctx.network()->addElementWise(
+      // TODO(aaroey): will tensor_l/tensor_r get modified?
       *const_cast<nvinfer1::ITensor*>(tensor_l),
       *const_cast<nvinfer1::ITensor*>(tensor_r), op_pair->second);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
 
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
 
@@ -1202,7 +1406,7 @@ tensorflow::Status ConvertPlugin(Converter& ctx,
   // passing attributes
   // TODO(jie): support more general attribute
   TFAttrs attrs(node_def);
-  auto attr_key_vector = attrs.GetAllAttrKey();
+  auto attr_key_vector = attrs.GetAllAttrKeys();
   for (auto attr_key : attr_key_vector) {
     // TODO(jie): support only list of float for toy example here.
     auto data = attrs.get<std::vector<float>>(attr_key);
@@ -1223,29 +1427,6 @@ tensorflow::Status ConvertPlugin(Converter& ctx,
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertPlaceholder(
-    Converter& ctx, const tensorflow::NodeDef& node_def,
-    const std::vector<TRT_TensorOrWeights>& inputs,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  VLOG(2) << "Placeholder should have been replace already";
-  return tensorflow::errors::Unimplemented("cannot convert Placeholder op");
-  // OK this make sense since we are supposed to replace it with input
-  TFAttrs attrs(node_def);
-  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("dtype");
-  nvinfer1::Dims dims = attrs.get<nvinfer1::Dims>("shape");
-
-  dims.nbDims--;
-  for (int i = 0; i < dims.nbDims; i++) dims.d[i] = dims.d[i + 1];
-
-  nvinfer1::ITensor* output =
-      ctx.network()->addInput(node_def.name().c_str(), dtype, dims);
-  if (!output) {
-    return tensorflow::errors::InvalidArgument("Failed to create Input layer");
-  }
-  outputs->push_back(TRT_TensorOrWeights(output));
-  return tensorflow::Status::OK();
-}
-
 tensorflow::Status ConvertConv2D(Converter& ctx,
                                  const tensorflow::NodeDef& node_def,
                                  const std::vector<TRT_TensorOrWeights>& inputs,
@@ -1271,65 +1452,64 @@ tensorflow::Status ConvertPool(Converter& ctx,
 
   int h_index = 2;
   int w_index = 3;
-  auto data_format = attrs.get<string>("data_format");
+  const auto data_format = attrs.get<string>("data_format");
   if (data_format == "NHWC") {
     h_index = 1;
     w_index = 2;
     tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
                                  {0, 3, 1, 2});
-  } else {
-    VLOG(2) << "NCHW !!!!";
+    TFTRT_RETURN_ERROR_IF_NULLPTR(tensor, node_def.name());
   }
+
   nvinfer1::PoolingType type;
-  // TODO(jie): support other pooling type
-  if (node_def.op() == "MaxPool")
+  if (node_def.op() == "MaxPool") {
     type = nvinfer1::PoolingType::kMAX;
-  else if (node_def.op() == "AvgPool")
+  } else if (node_def.op() == "AvgPool") {
     type = nvinfer1::PoolingType::kAVERAGE;
-  else
-    return tensorflow::errors::Unimplemented("Only supports Max pool");
+  } else {
+    return tensorflow::errors::Unimplemented("Unsupported pool type: ",
+                                             node_def.op());
+  }
 
-  // TODO(jie): NCHW
-  auto tf_stride = attrs.get<std::vector<int>>("strides");
-  nvinfer1::DimsHW stride(tf_stride[h_index], tf_stride[w_index]);
+  const auto tf_stride = attrs.get<std::vector<int>>("strides");
+  const nvinfer1::DimsHW stride(tf_stride[h_index], tf_stride[w_index]);
 
-  auto tf_kernel = attrs.get<std::vector<int>>("ksize");
-  nvinfer1::DimsHW ksize(tf_kernel[h_index], tf_kernel[w_index]);
+  const auto tf_kernel = attrs.get<std::vector<int>>("ksize");
+  const nvinfer1::DimsHW ksize(tf_kernel[h_index], tf_kernel[w_index]);
 
   auto tensor_dim = tensor->getDimensions();
   std::vector<std::pair<int, int>> padding;
-  // TODO(jie): padding.
-  if (attrs.get<string>("padding") == "SAME") {
+  const string padding_type = attrs.get<string>("padding");
+  if (padding_type == "SAME") {
     // This is NCHW tensor with no batch dimension.
     //  1 -> h
     //  2 -> w
     padding = CreateSamePadding(
         stride, ksize,
         {static_cast<int>(tensor_dim.d[1]), static_cast<int>(tensor_dim.d[2])});
-  } else if (attrs.get<string>("padding") == "VALID") {
-    // No padding for valid padding here
-    VLOG(2) << "No padding added for VALID padding in pool" << node_def.name();
+  } else if (padding_type == "VALID") {
     padding = {{0, 0}, {0, 0}};
   } else {
-    return tensorflow::errors::Unimplemented(
-        "Current MaxPool cannot support padding other than SAME");
+    return tensorflow::errors::Unimplemented("Unsupported padding type: ",
+                                             padding_type);
   }
 
   if (padding[0].first != padding[0].second ||
       padding[1].first != padding[1].second) {
-    // TODO(jie): handle asymmetric padding
     VLOG(2) << "Padding!!!: " << padding[0].first << padding[0].second
             << padding[1].first << padding[1].second;
     auto pad_layer = ctx.network()->addPadding(
         *const_cast<nvinfer1::ITensor*>(tensor),
         nvinfer1::DimsHW(padding[0].first, padding[1].first),
         nvinfer1::DimsHW(padding[0].second, padding[1].second));
+    TFTRT_RETURN_ERROR_IF_NULLPTR(pad_layer, node_def.name());
     padding = {{0, 0}, {0, 0}};
     tensor = pad_layer->getOutput(0);
   }
 
   nvinfer1::IPoolingLayer* layer = ctx.network()->addPooling(
       *const_cast<nvinfer1::ITensor*>(tensor), type, ksize);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
 
   layer->setStride(stride);
   layer->setPadding({padding[0].first, padding[1].first});
@@ -1337,10 +1517,8 @@ tensorflow::Status ConvertPool(Converter& ctx,
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
 
   if (data_format == "NHWC") {
-    // TODO(jie): transpose it back!
     output_tensor = ctx.TransposeTensor(output_tensor, {0, 2, 3, 1});
-  } else {
-    VLOG(2) << "NCHW !!!!";
+    TFTRT_RETURN_ERROR_IF_NULLPTR(output_tensor, node_def.name());
   }
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
@@ -1353,6 +1531,7 @@ tensorflow::Status ConvertActivation(
   const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   nvinfer1::IActivationLayer* layer = ctx.network()->addActivation(
       *const_cast<nvinfer1::ITensor*>(tensor), nvinfer1::ActivationType::kRELU);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
@@ -1363,40 +1542,61 @@ tensorflow::Status ConvertScale(Converter& ctx,
                                 const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
-      !inputs.at(1).is_weights())
+      !inputs.at(1).is_weights()) {
     return tensorflow::errors::Unimplemented(
-        "Only supports tensor op weight for now, at " + node_def.name());
-  // Implement tensor binaryOp weight [channel wise] for now;
-  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
+        "ConvertScale only supports tensor<op>weight: ", node_def.name());
+  }
 
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   TRT_ShapedWeights weights = inputs.at(1).weights();
   if (ctx.isFP16()) {
     weights = ConvertFP32ToFP16(ctx, inputs.at(1).weights());
   }
 
   TRT_ShapedWeights empty_weights(weights.type_);
-
   TFAttrs attrs(node_def);
 
-  // Transpose NHWC
-  auto data_format = attrs.get<string>("data_format");
+  const auto data_format = attrs.get<string>("data_format");
+  int channel_index;
+  const auto dims = tensor->getDimensions();
   if (data_format == "NHWC") {
-    tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
-                                 {0, 3, 1, 2});
-    // TODO(jie): transpose it
+    //  1). NHWC is really N+C
+    channel_index = dims.nbDims - 1;  // batch dimension is implicit here!
   } else {
-    VLOG(2) << "NCHW !!!!";
+    //  2). NCHW is really N+CHW
+    channel_index = dims.nbDims - 3;  // batch dimension is implicit here!
   }
 
-  auto dims = tensor->getDimensions();
-  VLOG(2) << "tensor dimensions: " << dims.nbDims;
-  for (int i = 0; i < dims.nbDims; i++) {
-    VLOG(2) << "i: " << dims.d[i];
+  nvinfer1::Permutation permutation;
+  for (int32_t i = 0; i < dims.nbDims; ++i) {
+    permutation.order[i] = i;
   }
-  dims = weights.shape_;
-  VLOG(2) << "tensor dimensions: " << dims.nbDims;
-  for (int i = 0; i < dims.nbDims; i++) {
-    VLOG(2) << "i: " << dims.d[i];
+
+  if (channel_index >= 0) {
+    permutation.order[0] = channel_index;
+    permutation.order[channel_index] = 0;
+  } else {
+    return tensorflow::errors::Unimplemented(
+        "TFTRT::BiasAdd cannot apply on batch dimension, at ", node_def.name());
+  }
+
+  // TensorRT addScale requires input to be of rank 3, we need to apply
+  // transpose as well as reshape
+  if (channel_index != 0 || dims.nbDims != 3) {
+    nvinfer1::IShuffleLayer* shuffle_layer =
+        ctx.network()->addShuffle(*const_cast<nvinfer1::ITensor*>(tensor));
+    TFTRT_RETURN_ERROR_IF_NULLPTR(shuffle_layer, node_def.name());
+    nvinfer1::Dims reshape_dims;
+    reshape_dims.nbDims = 3;
+    reshape_dims.d[0] = 0;                          // 0 copy from the input
+    reshape_dims.d[1] = dims.nbDims >= 2 ? 0 : 1;   // 0 copy from the input
+    reshape_dims.d[2] = dims.nbDims >= 3 ? -1 : 1;  // -1 infer from the rest
+    if (channel_index != 0) {
+      // maybe we do not need this check. concerned about TRT optimization
+      shuffle_layer->setFirstTranspose(permutation);
+    }
+    shuffle_layer->setReshapeDimensions(reshape_dims);
+    tensor = shuffle_layer->getOutput(0);
   }
 
   nvinfer1::ScaleMode mode = nvinfer1::ScaleMode::kCHANNEL;
@@ -1407,14 +1607,26 @@ tensorflow::Status ConvertScale(Converter& ctx,
   nvinfer1::IScaleLayer* layer =
       ctx.network()->addScale(*const_cast<nvinfer1::ITensor*>(tensor), mode,
                               weights, empty_weights, empty_weights);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
 
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
-  if (data_format == "NHWC") {
-    // TODO(jie): transpose it back!
-    output_tensor = ctx.TransposeTensor(output_tensor, {0, 2, 3, 1});
-  } else {
-    VLOG(2) << "NCHW !!!!";
+
+  // restore transpose & reshape
+  if (channel_index != 0 || dims.nbDims != 3) {
+    nvinfer1::IShuffleLayer* shuffle_layer = ctx.network()->addShuffle(
+        *const_cast<nvinfer1::ITensor*>(output_tensor));
+    TFTRT_RETURN_ERROR_IF_NULLPTR(shuffle_layer, node_def.name());
+    nvinfer1::Dims reshape_dims = dims;
+    int tmp = reshape_dims.d[channel_index];
+    reshape_dims.d[channel_index] = reshape_dims.d[0];
+    reshape_dims.d[0] = tmp;
+    shuffle_layer->setReshapeDimensions(reshape_dims);
+    if (channel_index != 0) {
+      shuffle_layer->setSecondTranspose(permutation);
+    }
+    output_tensor = shuffle_layer->getOutput(0);
   }
+
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
@@ -1431,11 +1643,13 @@ tensorflow::Status ConvertConst(Converter& ctx,
 
   // Create shaped weights as output
   tensorflow::Tensor tensor;
-  if (!tensor.FromProto(weights_tensor))
-    return tensorflow::errors::Internal("Cannot parse weight tensor proto: " +
+  if (!tensor.FromProto(weights_tensor)) {
+    return tensorflow::errors::Internal("Cannot parse weight tensor proto: ",
                                         node_def.name());
+  }
 
   TRT_ShapedWeights weights(dtype);
+  // TODO(aaroey): we should choose the array using dtype and shape.
   if (!weights_tensor.float_val().empty()) {
     VLOG(2) << "SCALAR!!!" << node_def.name();
     nvinfer1::Dims scalar_shape;
@@ -1443,22 +1657,16 @@ tensorflow::Status ConvertConst(Converter& ctx,
       VLOG(2) << "dimensions: " << tensor.dims();
       VLOG(2) << "size: " << weights_tensor.float_val_size();
       scalar_shape = GetTensorShape(tensor);
+      VLOG(2) << "details: ";
       for (int i = 0; i < scalar_shape.nbDims; i++)
         VLOG(2) << scalar_shape.d[i];
-      if (GetShapeSize(scalar_shape) != weights_tensor.float_val_size()) {
-        if (weights_tensor.float_val_size() == 1 ||
-            scalar_shape.d[0] == weights_tensor.float_val_size()) {
-          scalar_shape.nbDims = 1;
-          // no dimension provided. flatten it
-          scalar_shape.d[0] = weights_tensor.float_val_size();
-          scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
-        } else {
-          LOG(WARNING) << "Broadcast on weights only supports kCHANNEL and"
-                       << " kUNIFORM, at: " << node_def.name();
-          string err_str("Broadcast method is not supported for '");
-          StrAppend(&err_str, node_def.name(), "' of type ", node_def.op());
-          return tensorflow::errors::InvalidArgument(err_str);
-        }
+      if (GetShapeSize(scalar_shape) != weights_tensor.float_val_size() &&
+          weights_tensor.float_val_size() != 1) {
+        LOG(ERROR) << "Broadcast on weights only supports kCHANNEL and"
+                   << " kUNIFORM, at: " << node_def.name();
+        string err_str("Broadcast method is not supported for '");
+        StrAppend(&err_str, node_def.name(), "' of type ", node_def.op());
+        return tensorflow::errors::InvalidArgument(err_str);
       }
     } else {
       VLOG(2) << "Dimensions: " << tensor.dims();
@@ -1468,39 +1676,42 @@ tensorflow::Status ConvertConst(Converter& ctx,
       scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
       for (int i = 1; i < nvinfer1::Dims::MAX_DIMS; i++) {
         scalar_shape.d[i] = 0;
-        scalar_shape.type[i] = nvinfer1::DimensionType::kSPATIAL;
       }
     }
+    // TODO(aaroey): use GetShapeSize().
     size_t len_data = tensorflow::DataTypeSize(dtype);
     for (int i = 0; i < scalar_shape.nbDims; i++) len_data *= scalar_shape.d[i];
     ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
     void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
-    std::vector<float> tensor_data(
-        weights_tensor.float_val().begin(),
-        weights_tensor.float_val()
-            .end());  //  make a local copy first to flatten
-    memcpy(dst, tensor_data.data(), len_data);  // store into weight store
+    if (weights_tensor.float_val_size() == 1) {
+      std::fill_n((float*)dst, GetShapeSize(scalar_shape),
+                  *weights_tensor.float_val().begin());
+    } else {
+      // TODO(aaroey): get rid of this copy as RepeatedField is always
+      // contiguous make a local copy first to flatten doesn't have to be
+      // contiguous
+      std::vector<float> tensor_data(weights_tensor.float_val().begin(),
+                                     weights_tensor.float_val().end());
+      memcpy(dst, tensor_data.data(), len_data);  // store into weight store
+    }
+    VLOG(2) << "create shape details: ";
+    for (int i = 0; i < scalar_shape.nbDims; i++) VLOG(2) << scalar_shape.d[i];
     weights = TRT_ShapedWeights(dtype, dst, scalar_shape);
   } else if (!weights_tensor.int_val().empty()) {
+    // TODO(aaroey): this is very similar to the above code for float, merge
+    // them.
     VLOG(2) << "int!!!" << node_def.name();
     nvinfer1::Dims scalar_shape;
     if (tensor.dims() > 0) {
       VLOG(2) << "dimensions: " << tensor.dims();
       scalar_shape = GetTensorShape(tensor);
-      if (GetShapeSize(scalar_shape) != weights_tensor.int_val_size()) {
-        if (weights_tensor.int_val_size() == 1 ||
-            scalar_shape.d[0] == weights_tensor.int_val_size()) {
-          scalar_shape.nbDims = 1;
-          // no dimension provided. flatten it
-          scalar_shape.d[0] = weights_tensor.int_val_size();
-          scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
-        } else {
-          LOG(WARNING) << "Broadcast on weights only supports kCHANNEL and"
-                       << " kUNIFORM, at: " << node_def.name();
-          string err_str("Broadcast method is not supported for '");
-          StrAppend(&err_str, node_def.name(), "' of type ", node_def.op());
-          return tensorflow::errors::InvalidArgument(err_str);
-        }
+      if (GetShapeSize(scalar_shape) != weights_tensor.int_val_size() &&
+          weights_tensor.int_val_size() != 1) {
+        LOG(WARNING) << "Broadcast on weights only supports kCHANNEL and"
+                     << " kUNIFORM, at: " << node_def.name();
+        string err_str("Broadcast method is not supported for '");
+        StrAppend(&err_str, node_def.name(), "' of type ", node_def.op());
+        return tensorflow::errors::InvalidArgument(err_str);
       }
     } else {
       VLOG(2) << "dimensions: " << tensor.dims();
@@ -1513,23 +1724,30 @@ tensorflow::Status ConvertConst(Converter& ctx,
         scalar_shape.type[i] = nvinfer1::DimensionType::kSPATIAL;
       }
     }
-    //  we should not have converted //if (ctx.isFP16()) {
+    // we should not have converted
     size_t len_data = tensorflow::DataTypeSize(dtype);
     for (int i = 0; i < scalar_shape.nbDims; i++) len_data *= scalar_shape.d[i];
     size_t len_tensor = weights_tensor.int_val_size() * sizeof(int32);
     len_data = std::max(len_data, len_tensor);
     ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
     void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
-    std::vector<int32> tensor_data(
-        weights_tensor.int_val().begin(),
-        weights_tensor.int_val().end());  //  make a local copy first to flatten
-                                          //  doesn't have to be contigous
-    memcpy(dst, tensor_data.data(), len_tensor);  // store into weight store
+    if (weights_tensor.int_val_size() == 1) {
+      std::fill_n((int*)dst, GetShapeSize(scalar_shape),
+                  *weights_tensor.int_val().begin());
+    } else {
+      // TODO(aaroey): get rid of this copy as RepeatedField is always
+      // contiguous make a local copy first to flatten doesn't have to be
+      // contiguous
+      std::vector<int32> tensor_data(weights_tensor.int_val().begin(),
+                                     weights_tensor.int_val().end());
+      memcpy(dst, tensor_data.data(), len_tensor);  // store into weight store
+    }
     weights = TRT_ShapedWeights(dtype, dst, scalar_shape);
   } else if (!weights_tensor.tensor_content().empty()) {
-    //  obsolete method.
-    //  After optimization path, we do not see weights in this format.
-    //  fp16 conversion technically should be needed here.
+    // obsolete method.
+    // After optimization path, we do not see weights in this format.
+    // TODO(aaroey): why?
+    // fp16 conversion technically should be needed here.
     VLOG(2) << "TENSOR!!!" << node_def.name();
     const auto& content = weights_tensor.tensor_content();
 
@@ -1543,8 +1761,8 @@ tensorflow::Status ConvertConst(Converter& ctx,
           content, static_cast<char*>(const_cast<void*>(weights.GetValues())));
     }
   } else {
-    return tensorflow::errors::Unimplemented(
-        "Not supported constant type, at " + node_def.name());
+    return tensorflow::errors::Unimplemented("Not supported constant type, at ",
+                                             node_def.name());
   }
   // Pass the output
   outputs->push_back(TRT_TensorOrWeights(weights));
@@ -1563,96 +1781,144 @@ tensorflow::Status ConvertBinary(Converter& ctx,
                                  const tensorflow::NodeDef& node_def,
                                  const std::vector<TRT_TensorOrWeights>& inputs,
                                  std::vector<TRT_TensorOrWeights>* outputs) {
-  if (inputs.size() != 2)
+  if (inputs.size() != 2) {
     return tensorflow::errors::FailedPrecondition(
-        "Binary ops require two tensor input, at " + node_def.name());
-
-  if (inputs.at(0).is_weights() && inputs.at(1).is_weights())
-    return ConstantFoldBinary(ctx, node_def, inputs, outputs);
-
-  if (inputs.at(0).is_tensor() && inputs.at(1).is_weights())
-    return BinaryTensorOpWeight(ctx, node_def, inputs.at(0).tensor(),
-                                inputs.at(1).weights(), outputs);
+        "Binary ops require two tensor input, at ", node_def.name());
+  }
 
-  if (inputs.at(0).is_weights() && inputs.at(1).is_tensor())
-    return BinaryTensorOpWeight(ctx, node_def, inputs.at(1).tensor(),
-                                inputs.at(0).weights(), outputs);
+  // Constant folding should have been done by TensorFlow
 
-  if (inputs.at(0).is_tensor() && inputs.at(1).is_tensor())
-    return BinaryTensorOpTensor(ctx, node_def, inputs.at(0).tensor(),
-                                inputs.at(1).tensor(), outputs);
+  if (inputs.at(0).is_weights() && inputs.at(1).is_weights()) {
+    return tensorflow::errors::Unimplemented(
+        "Constant folding is falled back to TensorFlow, binary op received "
+        "both input as constant at: ",
+        node_def.name());
+  }
 
-  return tensorflow::errors::Unknown("Binary op input error, at " +
-                                     node_def.name());
+  // Try to convert into Scale layer first (for better performance)
+  // Since scale layer supports restricted broadcast policy and op types, we
+  // allow failure and try to handle it through Elementwise op
+  // (BinaryTensorOpTensor)
+  Status status = tensorflow::Status::OK();
+  if (inputs.at(0).is_tensor() && inputs.at(1).is_weights()) {
+    status = BinaryTensorOpWeight(ctx, node_def, inputs.at(0).tensor(),
+                                  inputs.at(1).weights(), false, outputs);
+  } else if (inputs.at(0).is_weights() && inputs.at(1).is_tensor()) {
+    status = BinaryTensorOpWeight(ctx, node_def, inputs.at(1).tensor(),
+                                  inputs.at(0).weights(), true, outputs);
+#if NV_TENSORRT_MAJOR == 3
+  } else {
+#else
+  }
+  if ((inputs.at(0).is_tensor() && inputs.at(1).is_tensor()) || !status.ok()) {
+#endif
+    status = BinaryTensorOpTensor(ctx, node_def, inputs.at(0), inputs.at(1),
+                                  outputs);
+  }
+  return status;
 }
 
 tensorflow::Status ConvertUnary(Converter& ctx,
                                 const tensorflow::NodeDef& node_def,
                                 const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
-  if (inputs.size() != 1)
+  static const std::unordered_map<string, nvinfer1::UnaryOperation> ops{
+      {"Neg", nvinfer1::UnaryOperation::kNEG},
+      {"Exp", nvinfer1::UnaryOperation::kEXP},
+      {"Log", nvinfer1::UnaryOperation::kLOG},
+      {"Sqrt", nvinfer1::UnaryOperation::kSQRT},
+      {"Abs", nvinfer1::UnaryOperation::kABS},
+      {"Reciprocal", nvinfer1::UnaryOperation::kRECIP},
+  };
+
+  if (inputs.size() != 1) {
     return tensorflow::errors::FailedPrecondition(
-        "Unary ops require single tensor input, at " + node_def.name());
+        "Unary ops require single tensor input, at ", node_def.name());
+  }
 
-  if (inputs.at(0).is_weights())
-    return ConstantFoldUnary(ctx, node_def, inputs, outputs);
-  else if (inputs.at(0).is_tensor())
+#if NV_TENSORRT_MAJOR == 3
+  if (inputs.at(0).is_weights()) {
     return tensorflow::errors::Unimplemented(
-        "Unary op for tensor not supported, at " + node_def.name());
+        "Constant folding for unary op is not supported", node_def.name());
+  }
+#endif
 
-  return tensorflow::errors::Unknown("Binary op input error, at " +
-                                     node_def.name());
+  // TODO(jie): check type
+  const nvinfer1::ITensor* tensor;
+  TFTRT_RETURN_ERROR_IF_FALSE(
+      PrepareTensorForShape(ctx, inputs.at(0), inputs.at(0).shape(), &tensor),
+      node_def.name());
+
+  nvinfer1::IUnaryLayer* layer;
+  if (node_def.op() == "Rsqrt") {
+    layer = ctx.network()->addUnary(*const_cast<nvinfer1::ITensor*>(tensor),
+                                    nvinfer1::UnaryOperation::kSQRT);
+    TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+    tensor = layer->getOutput(0);
+    layer = ctx.network()->addUnary(*const_cast<nvinfer1::ITensor*>(tensor),
+                                    nvinfer1::UnaryOperation::kRECIP);
+  } else if (ops.count(node_def.op()) != 0) {
+    layer = ctx.network()->addUnary(*const_cast<nvinfer1::ITensor*>(tensor),
+                                    ops.at(node_def.op()));
+  } else {
+    return tensorflow::errors::InvalidArgument(
+        "Binary op: ", node_def.op(), " not supported, at ", node_def.name());
+  }
+
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertReduce(Converter& ctx,
-                                 const tensorflow::NodeDef& node_def,
-                                 const std::vector<TRT_TensorOrWeights>& inputs,
-                                 std::vector<TRT_TensorOrWeights>* outputs) {
+#if NV_TENSORRT_MAJOR == 3
+tensorflow::Status ConvertReducePool(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
-      !inputs.at(1).is_weights())
+      !inputs.at(1).is_weights()) {
     return tensorflow::errors::InvalidArgument(
-        "Input expects tensor and weights, at" + node_def.name());
+        "Input expects tensor and weights, at", node_def.name());
+  }
 
   // Implement tensor binaryOp weight [channel wise] for now;
   const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
-  auto dims = tensor->getDimensions();
+  const auto dims = tensor->getDimensions();
   // Restore implicit batch dimension
-  int nb_dims = dims.nbDims + 1;
+  const int nb_dims = dims.nbDims + 1;
 
   TRT_ShapedWeights index_list = inputs.at(1).weights();
-
   TFAttrs attrs(node_def);
-  // TODO(jie): handle data type.
-  // Index type here is done through TF type, so I can leverage their
-  // EnumToDataType for my cast
   auto index_type = attrs.get<tensorflow::DataType>("Tidx");
 
   // Only expect to handle INT32 as attributes for now
-  if (index_type != tensorflow::DataType::DT_INT32)
+  if (index_type != tensorflow::DataType::DT_INT32) {
     return tensorflow::errors::Unimplemented("Tidx supports only DT_INT32");
-  auto index_list_data =
+  }
+  const auto index_list_data =
       static_cast<int*>(const_cast<void*>(index_list.GetValues()));
 
-  // Hack warning: have to fall back to pool layer since reduce is not in public
-  // TRT yet.
-  if (nb_dims != 4)
+  if (nb_dims != 4) {
     return tensorflow::errors::InvalidArgument(
-        "TRT only support reduce on 4 dimensional tensors, at" +
+        "TRT only support reduce on 4 dimensional tensors, at",
         node_def.name());
-  if (index_list.count() > 2)
+  }
+  if (index_list.count() > 2) {
     return tensorflow::errors::InvalidArgument(
-        "TRT cannot support reduce on more than 2 dimensions, at" +
+        "TRT cannot support reduce on more than 2 dimensions, at",
         node_def.name());
+  }
 
   std::set<int> idx_set;
   // We cannot operate on Channel. permutation flag used to transpose tensor
   int permuted_index = -1;
   for (int i = 0; i < index_list.count(); i++) {
-    if (index_list_data[i] == 0)
-      return tensorflow::errors::InvalidArgument("TRT cannot reduce at 0, at" +
+    if (index_list_data[i] == 0) {
+      return tensorflow::errors::InvalidArgument("TRT cannot reduce at 0, at",
                                                  node_def.name());
+    }
     if (index_list_data[i] == 1) permuted_index = 1;
-
     idx_set.emplace(index_list_data[i]);
   }
 
@@ -1673,6 +1939,7 @@ tensorflow::Status ConvertReduce(Converter& ctx,
     // Apply permutation before extracting dimension for pool_kernel
     tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
                                  permutation_order);
+    TFTRT_RETURN_ERROR_IF_NULLPTR(tensor, node_def.name());
   }
 
   // Apply permutation before extracting dimension for pool_kernel
@@ -1685,34 +1952,104 @@ tensorflow::Status ConvertReduce(Converter& ctx,
     nvinfer1::IPoolingLayer* layer =
         ctx.network()->addPooling(*const_cast<nvinfer1::ITensor*>(tensor),
                                   nvinfer1::PoolingType::kAVERAGE, pool_kernel);
+    TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
     output_tensor = layer->getOutput(0);
   } else {
-    return tensorflow::errors::Unimplemented(
-        "Op not supported " + node_def.op() + " , at " + node_def.name());
+    return tensorflow::errors::Unimplemented("Op not supported ", node_def.op(),
+                                             " , at ", node_def.name());
   }
   if (permuted_index != -1) {
     // Apply permutation before extracting dimension for pool_kernel
     output_tensor = ctx.TransposeTensor(
         const_cast<nvinfer1::ITensor*>(output_tensor), permutation_order);
+    TFTRT_RETURN_ERROR_IF_NULLPTR(output_tensor, node_def.name());
   }
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
+#elif NV_TENSORRT_MAJOR > 3
+tensorflow::Status ConvertReduce(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
+      !inputs.at(1).is_weights()) {
+    return tensorflow::errors::InvalidArgument(
+        "Input expects tensor and weights, at", node_def.name());
+  }
+
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
+  TRT_ShapedWeights index_list = inputs.at(1).weights();
+
+  TFAttrs attrs(node_def);
+  auto index_type = attrs.get<tensorflow::DataType>("Tidx");
+
+  // Only expect to handle INT32 as attributes for now
+  if (index_type != tensorflow::DataType::DT_INT32) {
+    return tensorflow::errors::Unimplemented("Tidx supports only DT_INT32");
+  }
+
+  const auto keep_dims = attrs.get<bool>("keep_dims");
+  auto index_list_data =
+      static_cast<int*>(const_cast<void*>(index_list.GetValues()));
+
+  int axes = 0;
+  if (index_list.count() == 0) {
+    return tensorflow::errors::InvalidArgument(
+        "TRT cannot support reduce on all (batch) dimensions, at",
+        node_def.name());
+  } else {
+    for (int i = 0; i < index_list.count(); i++) {
+      if (index_list_data[i] == 0) {
+        return tensorflow::errors::InvalidArgument(
+            "TRT cannot reduce at batch dimension, at", node_def.name());
+      }
+      axes |= (1 << (index_list_data[i] - 1));
+    }
+  }
+
+  nvinfer1::ReduceOperation reduce_operation;
+  if (node_def.op() == "Sum") {
+    reduce_operation = nvinfer1::ReduceOperation::kSUM;
+  } else if (node_def.op() == "Prod") {
+    reduce_operation = nvinfer1::ReduceOperation::kPROD;
+  } else if (node_def.op() == "Max") {
+    reduce_operation = nvinfer1::ReduceOperation::kMAX;
+  } else if (node_def.op() == "Min") {
+    reduce_operation = nvinfer1::ReduceOperation::kMIN;
+  } else if (node_def.op() == "Mean") {
+    reduce_operation = nvinfer1::ReduceOperation::kAVG;
+  } else {
+    return tensorflow::errors::Unimplemented("Op not supported ", node_def.op(),
+                                             " , at ", node_def.name());
+  }
+
+  nvinfer1::ILayer* layer =
+      ctx.network()->addReduce(*const_cast<nvinfer1::ITensor*>(tensor),
+                               reduce_operation, axes, keep_dims);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+
+  outputs->push_back(TRT_TensorOrWeights(layer->getOutput(0)));
+  return tensorflow::Status::OK();
+}
+#endif
 
 tensorflow::Status ConvertPad(Converter& ctx,
                               const tensorflow::NodeDef& node_def,
                               const std::vector<TRT_TensorOrWeights>& inputs,
                               std::vector<TRT_TensorOrWeights>* outputs) {
+  // TODO(aaroey): make a routine for this check and reuse it.
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
-      !inputs.at(1).is_weights())
+      !inputs.at(1).is_weights()) {
     return tensorflow::errors::InvalidArgument(
-        "Input expects tensor and weights, at" + node_def.name());
+        "Input expects tensor and weights, at", node_def.name());
+  }
 
   // Implement tensor binaryOp weight [channel wise] for now;
   const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
-  auto dims = tensor->getDimensions();
+  const auto dims = tensor->getDimensions();
   // Restore implicit batch dimension
-  int nb_dims = dims.nbDims + 1;
+  const int nb_dims = dims.nbDims + 1;
 
   TRT_ShapedWeights pads = inputs.at(1).weights();
 
@@ -1722,21 +2059,24 @@ tensorflow::Status ConvertPad(Converter& ctx,
   auto padding_type = attrs.get<tensorflow::DataType>("Tpaddings");
   // TODO(jie): handle data type conversion for TRT?
 
-  if (pads.shape_.d[0] != nb_dims || pads.shape_.d[1] != 2)
+  if (pads.shape_.d[0] != nb_dims || pads.shape_.d[1] != 2) {
     return tensorflow::errors::InvalidArgument(
-        "Pad only supports explicit padding on 4 dimensional tensor, at " +
+        "Pad only supports explicit padding on 4 dimensional tensor, at ",
         node_def.name());
+  }
 
   // Only expect to handle INT32 as attributes for now
-  if (padding_type != tensorflow::DataType::DT_INT32)
+  if (padding_type != tensorflow::DataType::DT_INT32) {
     return tensorflow::errors::Unimplemented(
         "Tpaddings supports only DT_INT32");
+  }
   auto pad_data = static_cast<int*>(const_cast<void*>(pads.GetValues()));
 
   std::vector<int32_t> pad_index;
   for (int i = 0; i < nb_dims; i++) {
-    if (pad_data[2 * i] != 0 || pad_data[2 * i + 1] != 0)
+    if (pad_data[2 * i] != 0 || pad_data[2 * i + 1] != 0) {
       pad_index.push_back(i);
+    }
   }
 
   // No padding at all, we should exit
@@ -1746,20 +2086,23 @@ tensorflow::Status ConvertPad(Converter& ctx,
   }
 
   // Only supports padding on less than 2 axis GIE-2579
-  if (pad_index.size() > 2)
+  if (pad_index.size() > 2) {
     return tensorflow::errors::InvalidArgument(
         "Padding layer does not support padding on > 2");
+  }
 
   // Padding on batch dimension is not supported
-  if (pad_index[0] == 0)
+  if (pad_index[0] == 0) {
     return tensorflow::errors::InvalidArgument(
         "Padding layer does not support padding on batch dimension");
+  }
 
   // Not doing the legit thing here. ignoring padding on dim 1 and 3;
   // TODO(jie): implement pad as uff parser
-  if (pad_index.size() == 2 && pad_index[0] == 0 && pad_index[1] == 3)
+  if (pad_index.size() == 2 && pad_index[0] == 0 && pad_index[1] == 3) {
     return tensorflow::errors::Unimplemented(
         "Padding layer does not support padding on dimension 1 and 3 yet");
+  }
 
   bool legit_pad = true;
   nvinfer1::DimsHW pre_padding(0, 0);
@@ -1770,6 +2113,7 @@ tensorflow::Status ConvertPad(Converter& ctx,
     legit_pad = false;
     tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
                                  {0, 3, 2, 1});
+    TFTRT_RETURN_ERROR_IF_NULLPTR(tensor, node_def.name());
     permuted_pad_index[0] = 3;
   }
 
@@ -1786,11 +2130,14 @@ tensorflow::Status ConvertPad(Converter& ctx,
 
   nvinfer1::IPaddingLayer* layer = ctx.network()->addPadding(
       *const_cast<nvinfer1::ITensor*>(tensor), pre_padding, post_padding);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
 
-  if (!legit_pad)
+  if (!legit_pad) {
     output_tensor = ctx.TransposeTensor(
         const_cast<nvinfer1::ITensor*>(output_tensor), {0, 3, 2, 1});
+    TFTRT_RETURN_ERROR_IF_NULLPTR(output_tensor, node_def.name());
+  }
 
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
@@ -1803,9 +2150,10 @@ tensorflow::Status ConvertConcat(Converter& ctx,
   // not including the last input (axis) here
   int input_size = static_cast<int>(inputs.size()) - 1;
 
-  if (!inputs.at(0).is_tensor())
+  if (!inputs.at(0).is_tensor()) {
     return tensorflow::errors::InvalidArgument(
-        "Concat in TRT support only Tensor input, at " + node_def.name());
+        "Concat in TRT support only Tensor input, at ", node_def.name());
+  }
 
   // We are retrieving the axis
   TRT_ShapedWeights axis = inputs.at(input_size).weights();
@@ -1816,8 +2164,8 @@ tensorflow::Status ConvertConcat(Converter& ctx,
   // TODO(jie): handle data type
   // Only expect to handle INT32 as index attributes for now
   if (index_type != tensorflow::DataType::DT_INT32)
-    return tensorflow::errors::Unimplemented(
-        "Tidx supports only DT_INT32, at " + node_def.name());
+    return tensorflow::errors::Unimplemented("Tidx supports only DT_INT32, at ",
+                                             node_def.name());
 
   int index = *(static_cast<int*>(const_cast<void*>(axis.GetValues())));
 
@@ -1825,23 +2173,29 @@ tensorflow::Status ConvertConcat(Converter& ctx,
 
   auto dim = inputs.at(0).tensor()->getDimensions();
   // dimension check
-  if (index > dim.nbDims + 1)
+  if (index > dim.nbDims + 1) {
     return tensorflow::errors::InvalidArgument(
-        "Concatenate on axis out of dimension range, at " + node_def.name());
-
-  if (index == 0)
+        "Concatenate on axis out of dimension range, at ", node_def.name());
+  }
+  if (index == 0) {
     return tensorflow::errors::InvalidArgument(
-        "Concatenate on batch dimension not supported, at " + node_def.name());
+        "Concatenate on batch dimension not supported, at ", node_def.name());
+  }
+  if (index < 0) {
+    index = dim.nbDims + index + 1;
+  }
 
+#if NV_TENSORRT_MAJOR == 3
   // incase we need permutation;
   std::vector<int> permutation_order(dim.nbDims + 1);
 
   for (int i = 0; i < dim.nbDims + 1; i++) permutation_order[i] = i;
 
   if (index != 1) {
-    permutation_order[1] = index - 1;
-    permutation_order[index - 1] = 1;
+    permutation_order[1] = index;
+    permutation_order[index] = 1;
   }
+#endif
 
   std::vector<nvinfer1::ITensor const*> inputs_vec;
   // Shap chack (all input tensor should have same shape)
@@ -1849,24 +2203,28 @@ tensorflow::Status ConvertConcat(Converter& ctx,
   for (int i = 0; i < input_size; i++) {
     auto tensor_i = inputs.at(i).tensor();
     auto dim_i = tensor_i->getDimensions();
-    if (dim_i.nbDims != dim.nbDims)
+    if (dim_i.nbDims != dim.nbDims) {
       return tensorflow::errors::InvalidArgument(
-          "Concatenate receives inputs with inconsistent dimensions, at " +
+          "Concatenate receives inputs with inconsistent dimensions, at ",
           node_def.name());
-
+    }
     for (int j = 0; j < dim.nbDims; j++) {
       // check dimension consistency on non-concatenate axis
-      if (j != index - 1 && dim_i.d[j] != dim.d[j])
+      if (j != index - 1 && dim_i.d[j] != dim.d[j]) {
         return tensorflow::errors::InvalidArgument(
-            "Concatenate receives inputs with inconsistent shape, at" +
+            "Concatenate receives inputs with inconsistent shape, at",
             node_def.name());
+      }
     }
 
-    // TRT does concatenation only on channel!
-    if (index != 1)
+#if NV_TENSORRT_MAJOR == 3
+    // TRT3 does concatenation only on channel!
+    if (index != 1) {
       tensor_i = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor_i),
                                      permutation_order);
-
+      TFTRT_RETURN_ERROR_IF_NULLPTR(tensor_i, node_def.name());
+    }
+#endif
     inputs_vec.push_back(tensor_i);
   }
 
@@ -1874,11 +2232,18 @@ tensorflow::Status ConvertConcat(Converter& ctx,
   nvinfer1::IConcatenationLayer* layer = ctx.network()->addConcatenation(
       const_cast<nvinfer1::ITensor* const*>(inputs_vec.data()),
       inputs_vec.size());
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+#if NV_TENSORRT_MAJOR > 3
+  layer->setAxis(index - 1);
+#endif
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
 
+#if NV_TENSORRT_MAJOR == 3
   if (index != 1) {
     output_tensor = ctx.TransposeTensor(output_tensor, permutation_order);
+    TFTRT_RETURN_ERROR_IF_NULLPTR(output_tensor, node_def.name());
   }
+#endif
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
@@ -1997,112 +2362,243 @@ tensorflow::Status ConvertFusedBatchNorm(
                               combined_offset_weights.GetWeightsForTRT(),
                               combined_scale_weights.GetWeightsForTRT(),
                               dummy_power_weights.GetWeightsForTRT());
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertMatMul(Converter& ctx,
-                                 const tensorflow::NodeDef& node_def,
-                                 const std::vector<TRT_TensorOrWeights>& inputs,
-                                 std::vector<TRT_TensorOrWeights>* outputs) {
-  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
-
-  // TODO(jie): transpose!
-  TFAttrs attrs(node_def);
+#if NV_TENSORRT_MAJOR > 3
+tensorflow::Status ConvertMatMulHelper(
+    Converter& ctx, TRT_TensorOrWeights tensor_input,
+    TRT_ShapedWeights weights_raw, bool transpose_weight, string node_name,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  nvinfer1::ITensor* output_tensor;
+  if (!tensor_input.is_tensor()) {
+    return tensorflow::errors::InvalidArgument("Input 0 expects tensor");
+  }
+  const nvinfer1::ITensor* tensor = tensor_input.tensor();
 
-  TRT_ShapedWeights weights_ck = inputs.at(1).weights();
-  TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_ck);
-  ReorderCKtoKC(weights_ck, &weights);
+  TRT_ShapedWeights weights(weights_raw.type_);
+  if (transpose_weight) {
+    weights = weights_raw;
+  } else {
+    TRT_ShapedWeights weights_ck = weights_raw;
+    weights = ctx.get_temp_weights_like(weights_ck);
+    ReorderCKtoKC(weights_raw, &weights);
+  }
   TRT_ShapedWeights biases(weights.type_);
 
   int noutput = weights.shape_.d[0];
 
+  auto input_dim = tensor->getDimensions();
+  while (input_dim.nbDims != 3) {
+    input_dim.d[input_dim.nbDims++] = 1;
+  }
+  TFTRT_RETURN_ERROR_IF_FALSE(
+      PrepareTensorForShape(ctx, tensor_input, input_dim, &tensor), node_name);
+
   nvinfer1::IFullyConnectedLayer* layer = ctx.network()->addFullyConnected(
       *const_cast<nvinfer1::ITensor*>(tensor), noutput, weights, biases);
-
-  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_name);
+  output_tensor = layer->getOutput(0);
+
+  const nvinfer1::ITensor* temp_tensor;
+  auto output_dim = output_tensor->getDimensions();
+  output_dim.nbDims = 1;
+  TFTRT_RETURN_ERROR_IF_FALSE(
+      PrepareTensorForShape(ctx, TRT_TensorOrWeights(output_tensor), output_dim,
+                            &temp_tensor),
+      node_name);
+  output_tensor = const_cast<nvinfer1::ITensor*>(temp_tensor);
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status ConvertReshape(
+// inputs are both two dimensional (tensorflow::ops::MatMul)
+tensorflow::Status ConvertMatMul(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  if (!inputs.at(0).is_tensor()) {
+    return tensorflow::errors::InvalidArgument("Input 0 expects tensor, at" +
+                                               node_def.name());
+  }
+
+  TFAttrs attrs(node_def);
+  // TODO(jie): INT32 should be converted?
+  tensorflow::DataType tf_dtype = attrs.get<tensorflow::DataType>("T");
+  if (tf_dtype != tensorflow::DataType::DT_FLOAT &&
+      tf_dtype != tensorflow::DataType::DT_HALF) {
+    return tensorflow::errors::Unimplemented(
+        "data type is not supported, for node " + node_def.name() + " got " +
+        tensorflow::DataTypeString(tf_dtype));
+  }
+  bool transpose_a = attrs.get<bool>("transpose_a");
+  bool transpose_b = attrs.get<bool>("transpose_b");
+
+  // FullyConnected:
+  if (transpose_a) {
+    return tensorflow::errors::Internal(
+        "Transpose_a is not supported for TensorRT FullyConnected (op: " +
+        node_def.op() + "), at: " + node_def.name());
+  }
+  if (inputs.at(1).is_tensor()) {
+    return tensorflow::errors::Internal(
+        "Operand 1 must be constant for TensorRT FullyConnected (op: " +
+        node_def.op() + "), at: " + node_def.name());
+  }
+  return ConvertMatMulHelper(ctx, inputs.at(0), inputs.at(1).weights(),
+                             transpose_b, node_def.name(), outputs);
+}
+
+tensorflow::Status ConvertBatchMatMul(
     Converter& ctx, const tensorflow::NodeDef& node_def,
     const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
-  if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
-      !inputs.at(1).is_weights())
-    return tensorflow::errors::InvalidArgument(
-        "Input expects tensor and weights, at" + node_def.name());
+  TFAttrs attrs(node_def);
 
-  // implement tensor binaryOp weight [channel wise] for now;
-  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
-  auto dims = tensor->getDimensions();
-  // restore implicit batch dimension
+  // TODO(jie): INT32 should be converted?
+  tensorflow::DataType tf_dtype = attrs.get<tensorflow::DataType>("T");
+  if (tf_dtype != tensorflow::DataType::DT_FLOAT &&
+      tf_dtype != tensorflow::DataType::DT_HALF) {
+    return tensorflow::errors::Unimplemented(
+        "data type is not supported, for node " + node_def.name() + " got " +
+        tensorflow::DataTypeString(tf_dtype));
+  }
 
-  TRT_ShapedWeights shape = inputs.at(1).weights();
+  bool transpose_a = attrs.get<bool>("adj_x");
+  bool transpose_b = attrs.get<bool>("adj_y");
 
-  TFAttrs attrs(node_def);
+  auto dims = inputs.at(0).shape();
+  if (dims.nbDims == 1) {  // NC * CK is only supported through fully connected
+    if (transpose_a == false && inputs.at(0).is_tensor() &&
+        inputs.at(1).is_weights()) {
+      return ConvertMatMulHelper(ctx, inputs.at(0), inputs.at(1).weights(),
+                                 transpose_b, node_def.name(), outputs);
+    } else {
+      return tensorflow::errors::InvalidArgument(
+          "Invalid configuration for MatMul, at: " + node_def.name());
+    }
+  }
 
-  auto padding_type = attrs.get<tensorflow::DataType>("Tshape");
+  const nvinfer1::ITensor* tensor_l;
+  const nvinfer1::ITensor* tensor_r;
+  auto dims_l = inputs.at(0).shape();
+  auto dims_r = inputs.at(1).shape();
+  if (inputs.at(0).is_weights()) {
+    if (inputs.at(0).shape().d[0] != 1) {
+      return tensorflow::errors::InvalidArgument(
+          "Input 0 as weight assumes broadcast across batch for MatMul, at: " +
+          node_def.name());
+    } else {
+      for (int i = 0; i < dims_l.nbDims - 1; i++) {
+        dims_l.d[i] = dims_l.d[i + 1];
+      }
+      dims_l.nbDims--;
+    }
+  }
+  if (inputs.at(1).is_weights()) {
+    if (inputs.at(1).shape().d[0] != 1) {
+      return tensorflow::errors::InvalidArgument(
+          "Input 1 as weight assumes broadcast across batch for MatMul, at: " +
+          node_def.name());
+    } else {
+      for (int i = 0; i < dims_r.nbDims - 1; i++) {
+        dims_r.d[i] = dims_r.d[i + 1];
+      }
+      dims_r.nbDims--;
+    }
+  }
 
-  if (shape.shape_.nbDims != 1)
-    return tensorflow::errors::InvalidArgument(
-        "reshape new shape is not 1 dimensional, at " + node_def.name());
+  TFTRT_RETURN_ERROR_IF_FALSE(
+      PrepareTensorForShape(ctx, inputs.at(0), dims_l, &tensor_l),
+      node_def.name());
+  TFTRT_RETURN_ERROR_IF_FALSE(
+      PrepareTensorForShape(ctx, inputs.at(1), dims_r, &tensor_r),
+      node_def.name());
 
-  // Only expect to handle INT32 as attributes for now
-  if (padding_type != tensorflow::DataType::DT_INT32)
-    return tensorflow::errors::Unimplemented(
-        "reshape new shape supports only DT_INT32, at " + node_def.name());
+  nvinfer1::IMatrixMultiplyLayer* layer = ctx.network()->addMatrixMultiply(
+      *const_cast<nvinfer1::ITensor*>(tensor_l), transpose_a,
+      *const_cast<nvinfer1::ITensor*>(tensor_r), transpose_b);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+#endif
 
-  auto shape_data = static_cast<int*>(const_cast<void*>(shape.GetValues()));
+#if NV_TENSORRT_MAJOR > 3
+tensorflow::Status ConvertSoftmax(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
 
-  if (shape_data[0] != -1)
+  int nbDims = tensor->getDimensions().nbDims;
+  if (nbDims == 0) {
     return tensorflow::errors::InvalidArgument(
-        "reshape new shape first dimension is not -1, at " + node_def.name());
+        "TensorRT Softmax cannot apply on batch dimension, at" +
+        node_def.name());
+  }
+  nvinfer1::ISoftMaxLayer* layer =
+      ctx.network()->addSoftMax(*const_cast<nvinfer1::ITensor*>(tensor));
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+  // Tensorflow SoftMax assumes applying softmax on the last dimension.
+  layer->setAxes(1 << (nbDims - 1));
 
-  auto shape_num_dims = shape.shape_.d[0];
-  VLOG(2) << "shape dimensions: " << shape_num_dims;
-  int volume_w = 1;
-  for (int i = 1; i < shape.shape_.d[0]; i++) volume_w *= shape_data[i];
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+#endif
 
-  int volume_t = 1;
-  for (int i = 0; i < dims.nbDims; i++) volume_t *= dims.d[i];
+#if NV_TENSORRT_MAJOR > 3
+tensorflow::Status ConvertTopK(Converter& ctx,
+                               const tensorflow::NodeDef& node_def,
+                               const std::vector<TRT_TensorOrWeights>& inputs,
+                               std::vector<TRT_TensorOrWeights>* outputs) {
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
 
-  VLOG(2) << "volume: " << volume_t << " volume weights: " << volume_w;
-  if (volume_w != volume_t)
+  int nbDims = tensor->getDimensions().nbDims;
+  if (nbDims == 0) {
     return tensorflow::errors::InvalidArgument(
-        "volume does not agree between tensor and new shape, at " +
-        node_def.name());
+        "TensorRT TopK cannot apply on batch dimension, at" + node_def.name());
+  }
 
-  nvinfer1::IShuffleLayer* layer =
-      ctx.network()->addShuffle(*const_cast<nvinfer1::ITensor*>(tensor));
+  TRT_ShapedWeights k_w = inputs.at(1).weights();
+  int k = *(static_cast<int*>(const_cast<void*>(k_w.GetValues())));
 
-  nvinfer1::Dims reshape_dims;
-  VLOG(2) << "new dimension: " << shape_num_dims - 1;
-  reshape_dims.nbDims = shape_num_dims - 1;
-  for (int32_t i = 0; i < reshape_dims.nbDims; ++i) {
-    reshape_dims.d[i] = shape_data[i + 1];
+  nvinfer1::TopKOperation op;
+  uint32_t reducedAxes = 0;
+  if (node_def.op() == "TopKV2") {
+    op = nvinfer1::TopKOperation::kMAX;
+    reducedAxes |= 1 << (nbDims - 1);
+  } else {
+    return tensorflow::errors::Unimplemented(
+        "Operation: " + node_def.op() +
+        " not implemented, at: " + node_def.name());
   }
-  layer->setReshapeDimensions(reshape_dims);
-  VLOG(2) << "new dimension: " << shape_num_dims - 1;
 
-  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
-  auto dims_output = output_tensor->getDimensions();
-  VLOG(2) << "output tensor dimension:" << dims_output.nbDims;
-  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  nvinfer1::ITopKLayer* layer = ctx.network()->addTopK(
+      *const_cast<nvinfer1::ITensor*>(tensor), op, k, reducedAxes);
+  TFTRT_RETURN_ERROR_IF_NULLPTR(layer, node_def.name());
+
+  nvinfer1::ITensor* output_value_tensor = layer->getOutput(0);
+  nvinfer1::ITensor* output_indices_tensor = layer->getOutput(1);
+  outputs->push_back(TRT_TensorOrWeights(output_value_tensor));
+  outputs->push_back(TRT_TensorOrWeights(output_indices_tensor));
   return tensorflow::Status::OK();
 }
+#endif
 
 void Converter::register_op_converters() {
   // vgg_16 slim implementation
-  op_registry_["Placeholder"] = ConvertPlaceholder;
   op_registry_["Conv2D"] = ConvertConv2D;
   op_registry_["DepthwiseConv2dNative"] = ConvertConv2DDepthwise;
   op_registry_["Relu"] = ConvertActivation;
   op_registry_["MaxPool"] = ConvertPool;
   op_registry_["AvgPool"] = ConvertPool;
-  // This could be really handled as ConvertBinary
   op_registry_["BiasAdd"] = ConvertScale;
   op_registry_["Const"] = ConvertConst;
   // TODO(ben,jie): this is a temp hack.
@@ -2113,17 +2609,39 @@ void Converter::register_op_converters() {
   op_registry_["Add"] = ConvertBinary;
   op_registry_["Mul"] = ConvertBinary;
   op_registry_["Sub"] = ConvertBinary;
-  op_registry_["Rsqrt"] = ConvertUnary;
-  op_registry_["Mean"] = ConvertReduce;
   op_registry_["Pad"] = ConvertPad;
-  // TODO(ben,jie): Add more ops
 
   op_registry_["ConcatV2"] = ConvertConcat;
-  op_registry_["MatMul"] = ConvertMatMul;
-  op_registry_["Reshape"] = ConvertReshape;
   op_registry_["FusedBatchNorm"] = ConvertFusedBatchNorm;
   op_registry_["FusedBatchNormV2"] = ConvertFusedBatchNorm;
 
+  op_registry_["Div"] = ConvertBinary;
+  op_registry_["RealDiv"] = ConvertBinary;
+
+  op_registry_["Rsqrt"] = ConvertUnary;
+  op_registry_["Reciprocal"] = ConvertUnary;
+  op_registry_["Exp"] = ConvertUnary;
+  op_registry_["Log"] = ConvertUnary;
+  op_registry_["Sqrt"] = ConvertUnary;
+  op_registry_["Abs"] = ConvertUnary;
+  op_registry_["Neg"] = ConvertUnary;
+#if NV_TENSORRT_MAJOR == 3
+  op_registry_["Mean"] = ConvertReducePool;
+#endif
+#if NV_TENSORRT_MAJOR > 3
+  op_registry_["Sum"] = ConvertReduce;
+  op_registry_["Prod"] = ConvertReduce;
+  op_registry_["Max"] = ConvertReduce;
+  op_registry_["Min"] = ConvertReduce;
+  op_registry_["Mean"] = ConvertReduce;
+  op_registry_["Maximum"] = ConvertBinary;
+  op_registry_["Minimum"] = ConvertBinary;
+  op_registry_["Softmax"] = ConvertSoftmax;
+  op_registry_["MatMul"] = ConvertMatMul;
+  op_registry_["BatchMatMul"] = ConvertBatchMatMul;
+  op_registry_["TopKV2"] = ConvertTopK;
+#endif
+
   plugin_converter_ = ConvertPlugin;
 }
 
@@ -2177,25 +2695,22 @@ tensorflow::Status ConvertGraphDefToEngine(
         (node_def.op() == "Placeholder")) {
       nvinfer1::DimsCHW input_dim_pseudo_chw;
       for (int i = 0; i < 8; i++) input_dim_pseudo_chw.d[i] = 0;
-      nvinfer1::DataType dtype(nvinfer1::DataType::kFLOAT);
-      auto type_status =
-          ConvertDType(node_def.attr().at("dtype").type(), &dtype);
-      if (type_status != tensorflow::Status::OK()) {
-        LOG(WARNING) << "Type conversion failed for " << node_name;
-        return type_status;
-      }
       int32 slot_number = -1;
-      if (!tensorflow::strings::safe_strto32(node_name.c_str() + 8,
-                                             &slot_number)) {
-        LOG(ERROR) << "Failed to parse slot number from " << node_name
-                   << " +8= " << node_name.c_str() + 8;
+      if (!tensorflow::strings::safe_strto32(
+              node_name.c_str() + strlen(kInputPHName), &slot_number)) {
+        return tensorflow::errors::InvalidArgument(
+            "Failed to parse slot number from ", node_name);
       }
+      nvinfer1::DataType dtype;
       auto shape = input_shapes.at(slot_number);
-      if (shape.dims() > 8) {
-        LOG(ERROR) << "Tensor rank is greater than 8 for " << node_name
-                   << " at input slot " << slot_number;
-        return tensorflow::errors::OutOfRange(
-            "Input tensor rank is greater than 8");
+      auto status = ValidateInputProperties(
+          shape, node_def.attr().at("dtype").type(), &dtype);
+      if (!status.ok()) {
+        const string error_message =
+            StrCat("Validation failed for ", node_name, " and input slot ",
+                   slot_number, ": ", status.error_message());
+        LOG(WARNING) << error_message;
+        return Status(status.code(), error_message);
       }
       if (VLOG_IS_ON(1)) {
         string dim_str("dims=");
@@ -2226,10 +2741,10 @@ tensorflow::Status ConvertGraphDefToEngine(
     } else if (tensorflow::str_util::StartsWith(node_name, kOutputPHName) &&
                (node_def.op() == "Identity")) {
       int32 slot_number = -1;
-      if (!tensorflow::strings::safe_strto32(node_name.c_str() + 9,
-                                             &slot_number)) {
-        LOG(ERROR) << "Failed to parse slot number from " << node_name
-                   << " +9=" << node_name.c_str() + 9;
+      if (!tensorflow::strings::safe_strto32(
+              node_name.c_str() + strlen(kOutputPHName), &slot_number)) {
+        return tensorflow::errors::InvalidArgument(
+            "Failed to parse slot number from ", node_name);
       }
       if (output_tensors.size() <= slot_number) {
         output_tensors.resize(slot_number + 1);
@@ -2288,38 +2803,20 @@ tensorflow::Status ConvertSegmentToGraphDef(
           "Cannot find node with id ", connection.outside_id, " in the graph.");
     }
     // Updates the shape and data types of input/output connections.
-    tensorflow::DataType input_type = tensorflow::DT_FLOAT;
+    tensorflow::DataType dtype;
     tensorflow::PartialTensorShape partial_shape;
     if (connection.is_input_edge) {
-      if (graph_properties.HasOutputProperties(connection.outside_node_name)) {
-        auto output_params =
-            graph_properties.GetOutputProperties(connection.outside_node_name);
-        auto out_shape = output_params.at(connection.outside_port);
-        input_type = out_shape.dtype();
-        std::vector<tensorflow::int64> dims;
-        partial_shape = out_shape.shape();
-        connection.outside_shape = partial_shape;
-      } else {
-        VLOG(0) << "Unknown output shape" << outside_node->name();
-        input_type = graph->FindNodeId(connection.outside_id)
-                         ->output_type(connection.outside_port);
-      }
-      connection.connection_type = input_type;
-
-    } else {  // output edge
-      if (graph_properties.HasInputProperties(connection.outside_node_name)) {
-        auto input_params =
-            graph_properties.GetInputProperties(connection.outside_node_name);
-        auto in_shape = input_params.at(connection.outside_port);
-        input_type = in_shape.dtype();
-        partial_shape = in_shape.shape();
-        connection.inside_shape = partial_shape;
-      } else {
-        input_type = graph->FindNodeId(connection.inside_id)
-                         ->output_type(connection.outside_port);
-      }
-      connection.connection_type = input_type;
+      GetInputProperties(graph_properties,
+                         graph->FindNodeId(connection.outside_id),
+                         connection.outside_port, &partial_shape, &dtype);
+
+    } else {
+      GetOutputProperties(graph_properties,
+                          graph->FindNodeId(connection.outside_id),
+                          connection.outside_port, &partial_shape, &dtype);
     }
+    connection.outside_shape = partial_shape;
+    connection.connection_type = dtype;
 
     // Add dummy input/output nodes to the segment graphdef.
     if (connection.is_input_edge) {
@@ -2335,7 +2832,7 @@ tensorflow::Status ConvertSegmentToGraphDef(
       auto seg_node = segment_def->add_node();
       tensorflow::NodeDefBuilder builder(node_name, "Placeholder");
       auto status = builder.Attr("shape", partial_shape)
-                        .Attr("dtype", input_type)
+                        .Attr("dtype", dtype)
                         .Finalize(seg_node);
       VLOG(1) << "Constructing input " << node_name << " for the edge "
               << connection.outside_node_name << ":" << connection.outside_port
@@ -2353,7 +2850,7 @@ tensorflow::Status ConvertSegmentToGraphDef(
       marker_nodes.insert(node_name);
       auto seg_node = segment_def->add_node();
       tensorflow::NodeDefBuilder builder(node_name, "Identity");
-      auto status = builder.Input(connection.inside_node_name, 0, input_type)
+      auto status = builder.Input(connection.inside_node_name, 0, dtype)
                         .Finalize(seg_node);
       VLOG(1) << "Constructing output " << node_name << " for the edge "
               << connection.inside_node_name << ":" << connection.inside_port
@@ -2391,6 +2888,38 @@ tensorflow::Status ConvertSegmentToGraphDef(
   return tensorflow::Status::OK();
 }
 
+bool InputEdgeValidator::operator()(const tensorflow::Edge* in_edge) const {
+  if (in_edge->IsControlEdge()) return true;
+  PartialTensorShape shape;
+  tensorflow::DataType dtype;
+  GetInputProperties(graph_properties_, in_edge->src(), in_edge->src_output(),
+                     &shape, &dtype);
+  nvinfer1::DataType trt_dtype;
+  Status status = ValidateInputProperties(shape, dtype, &trt_dtype);
+  if (!status.ok()) {
+    VLOG(2) << "--> Need to remove input node " << in_edge->dst()->name()
+            << ": " << status;
+    return false;
+  }
+  if (shape.dims() < 3 && in_edge->src()->type_string() != "Const") {
+    VLOG(2) << "--> Need to remove input node " << in_edge->dst()->name()
+            << " which has an input at port " << in_edge->dst_input()
+            << " with #dim<3 and is not a const: " << shape;
+    return false;
+  }
+  return true;
+}
+
+bool OutputEdgeValidator::operator()(const tensorflow::Edge* out_edge) const {
+  if (out_edge->IsControlEdge()) return true;
+  if (out_edge->src()->type_string() == "Const") {
+    VLOG(2) << "--> Need to remove output node " << out_edge->src()->name()
+            << " which is a Const.";
+    return false;
+  }
+  return true;
+}
+
 }  // namespace convert
 }  // namespace tensorrt
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/convert/convert_nodes.h b/tensorflow/contrib/tensorrt/convert/convert_nodes.h
index 1a4c0e755d..6a63c9f82f 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_nodes.h
+++ b/tensorflow/contrib/tensorrt/convert/convert_nodes.h
@@ -23,6 +23,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/contrib/tensorrt/convert/utils.h"
+#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
 #include "tensorflow/contrib/tensorrt/resources/trt_allocator.h"
 #include "tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h"
 #include "tensorflow/core/framework/graph.pb.h"
@@ -39,11 +40,6 @@ static const char* kInputPHName = "InputPH_";
 static const char* kOutputPHName = "OutputPH_";
 namespace convert {
 
-// TODO(aaroey): use an enum instead.
-const int FP32MODE = 0;
-const int FP16MODE = 1;
-const int INT8MODE = 2;
-
 struct EngineConnection {
   EngineConnection(const string& outside, int out_id, int out_port,
                    const string& inside, int in_id, int in_port,
@@ -104,6 +100,8 @@ struct EngineInfo {
 //   topological order.
 // - segment_def: the output GraphDef, whose non-input/output nodedefs will be
 //   sorted in topological order.
+//
+// TODO(aaroey): add tests to validate these properties.
 tensorflow::Status ConvertSegmentToGraphDef(
     const tensorflow::Graph* graph,
     const tensorflow::grappler::GraphProperties& graph_properties,
@@ -128,6 +126,30 @@ tensorflow::Status ConvertGraphDefToEngine(
     TrtUniquePtrType<nvinfer1::ICudaEngine>* engine,
     bool* convert_successfully);
 
+// Helper class for the segmenter to determine whether an input edge to the TRT
+// segment is valid.
+class InputEdgeValidator {
+ public:
+  InputEdgeValidator(const grappler::GraphProperties& graph_properties)
+      : graph_properties_(graph_properties) {}
+
+  // Return true if the specified edge is eligible to be an input edge of the
+  // TRT segment.
+  bool operator()(const tensorflow::Edge* in_edge) const;
+
+ private:
+  const grappler::GraphProperties& graph_properties_;
+};
+
+// Helper class for the segmenter to determine whether an output edge from the
+// TRT segment is valid.
+class OutputEdgeValidator {
+ public:
+  // Return true if the specified edge is eligible to be an output edge of the
+  // TRT segment.
+  bool operator()(const tensorflow::Edge* out_edge) const;
+};
+
 }  // namespace convert
 }  // namespace tensorrt
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc b/tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc
index ec9dbfa13b..044c736c03 100644
--- a/tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc
+++ b/tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc
@@ -17,6 +17,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/clusters/cluster.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
@@ -232,8 +233,25 @@ tensorflow::Status TRTOptimizationPass::Optimize(
   tensorflow::grappler::GraphProperties static_graph_properties(item);
   TF_RETURN_IF_ERROR(static_graph_properties.InferStatically(true));
   tensorflow::tensorrt::convert::ConversionParams cp;
+
+  std::vector<string> nodes_to_preserve;
+  for (const auto& n : item.NodesToPreserve()) {
+    auto tokens = str_util::Split(n, ":");
+    string s = tokens.at(0);
+    for (int i = 1; i < tokens.size() - 1; ++i) {
+      StrAppend(&s, ":", tokens.at(i));
+    }
+    int dumm_port = -1;
+    // If the last token is not an integer, it must be part of the name.
+    // Otherwise it is port number.
+    if (tokens.size() > 1 &&
+        !strings::safe_strto32(tokens.back(), &dumm_port)) {
+      StrAppend(&s, ":", tokens.back());
+    }
+    nodes_to_preserve.push_back(s);
+  }
   cp.input_graph_def = &item.graph;
-  cp.output_names = &item.fetch;
+  cp.output_names = &nodes_to_preserve;
   cp.max_batch_size = maximum_batch_size_;
   cp.max_workspace_size_bytes = maximum_workspace_size_;
   cp.output_graph_def = optimized_graph;
diff --git a/tensorflow/contrib/tensorrt/convert/utils.cc b/tensorflow/contrib/tensorrt/convert/utils.cc
new file mode 100644
index 0000000000..e7a1febb8c
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/convert/utils.cc
@@ -0,0 +1,69 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/convert/utils.h"
+
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+bool IsGoogleTensorRTEnabled() {
+  // TODO(laigd): consider also checking if tensorrt shared libraries are
+  // accessible. We can then direct users to this function to make sure they can
+  // safely write code that uses tensorrt conditionally. E.g. if it does not
+  // check for for tensorrt, and user mistakenly uses tensorrt, they will just
+  // crash and burn.
+#if GOOGLE_CUDA && GOOGLE_TENSORRT
+  return true;
+#else
+  return false;
+#endif
+}
+
+Status GetPrecisionModeName(const int precision_mode, string* name) {
+  switch (precision_mode) {
+    case FP32MODE:
+      *name = "FP32";
+      break;
+    case FP16MODE:
+      *name = "FP16";
+      break;
+    case INT8MODE:
+      *name = "INT8";
+      break;
+    default:
+      return tensorflow::errors::OutOfRange("Unknown precision mode");
+  }
+  return Status::OK();
+}
+
+Status GetPrecisionMode(const string& name, int* precision_mode) {
+  if (name == "FP32") {
+    *precision_mode = FP32MODE;
+  } else if (name == "FP16") {
+    *precision_mode = FP16MODE;
+  } else if (name == "INT8") {
+    *precision_mode = INT8MODE;
+  } else {
+    return tensorflow::errors::InvalidArgument("Invalid precision mode name: ",
+                                               name);
+  }
+  return Status::OK();
+}
+
+}  // namespace tensorrt
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/convert/utils.h b/tensorflow/contrib/tensorrt/convert/utils.h
index f601c06701..0592f31462 100644
--- a/tensorflow/contrib/tensorrt/convert/utils.h
+++ b/tensorflow/contrib/tensorrt/convert/utils.h
@@ -18,6 +18,8 @@ limitations under the License.
 
 #include <memory>
 
+#include "tensorflow/core/lib/core/status.h"
+
 namespace tensorflow {
 namespace tensorrt {
 
@@ -31,6 +33,17 @@ struct TrtDestroyer {
 template <typename T>
 using TrtUniquePtrType = std::unique_ptr<T, TrtDestroyer<T>>;
 
+bool IsGoogleTensorRTEnabled();
+
+// TODO(aaroey): use an enum instead.
+const int FP32MODE = 0;
+const int FP16MODE = 1;
+const int INT8MODE = 2;
+
+Status GetPrecisionModeName(const int precision_mode, string* name);
+
+Status GetPrecisionMode(const string& name, int* precision_mode);
+
 }  // namespace tensorrt
 }  // namespace tensorflow
 
diff --git a/tensorflow/contrib/tensorrt/custom_plugin_examples/BUILD b/tensorflow/contrib/tensorrt/custom_plugin_examples/BUILD
index a89cf3ab8b..69058c5826 100644
--- a/tensorflow/contrib/tensorrt/custom_plugin_examples/BUILD
+++ b/tensorflow/contrib/tensorrt/custom_plugin_examples/BUILD
@@ -112,7 +112,9 @@ cuda_py_test(
     ],
     tags = [
         "manual",
+        "no_windows",
         "noguitar",
+        "nomac",
         "notap",
     ],
 )
diff --git a/tensorflow/contrib/tensorrt/custom_plugin_examples/inc_op_kernel.cu.cc b/tensorflow/contrib/tensorrt/custom_plugin_examples/inc_op_kernel.cu.cc
index 988b35f74f..2de7973750 100644
--- a/tensorflow/contrib/tensorrt/custom_plugin_examples/inc_op_kernel.cu.cc
+++ b/tensorflow/contrib/tensorrt/custom_plugin_examples/inc_op_kernel.cu.cc
@@ -65,7 +65,7 @@ class IncPluginTRT : public OpKernel {
         reinterpret_cast<const cudaStream_t*>(context->op_device_context()
                                                   ->stream()
                                                   ->implementation()
-                                                  ->CudaStreamMemberHack()));
+                                                  ->GpuStreamMemberHack()));
     IncrementKernel(input_tensor.flat<float>().data(), inc_,
                     output_tensor->flat<float>().data(),
                     input_shape.num_elements(), *stream);
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
index 8a17eb02f1..6851f79ef6 100644
--- a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
+++ b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
@@ -15,9 +15,11 @@ limitations under the License.
 #include "tensorflow/contrib/tensorrt/kernels/trt_engine_op.h"
 
 #include <algorithm>
+
 #include "tensorflow/contrib/tensorrt/convert/convert_nodes.h"
 #include "tensorflow/contrib/tensorrt/convert/utils.h"
 #include "tensorflow/contrib/tensorrt/log/trt_logger.h"
+#include "tensorflow/contrib/tensorrt/plugin/trt_plugin_factory.h"
 #include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
 #include "tensorflow/contrib/tensorrt/resources/trt_resources.h"
 #include "tensorflow/core/framework/graph_to_functiondef.h"
@@ -43,11 +45,11 @@ using ::tensorflow::strings::StrCat;
 // Helps simultaneous execution of native and TRT engines.
 class AsyncHelper : public tensorflow::core::RefCounted {
  public:
-  AsyncHelper(tensorflow::AsyncOpKernel::DoneCallback done) { done_ = done; }
+  AsyncHelper(AsyncOpKernel::DoneCallback done) { done_ = done; }
   ~AsyncHelper() override { done_(); }
 
  private:
-  tensorflow::AsyncOpKernel::DoneCallback done_;
+  AsyncOpKernel::DoneCallback done_;
 };
 
 #define TYPECASE(dt, X, Y)                                                \
@@ -120,15 +122,9 @@ TRTEngineOp::TRTEngineOp(OpKernelConstruction* context)
                  context->GetAttr("calibration_data", &calibration_data));
   OP_REQUIRES_OK(context,
                  context->GetAttr("segment_funcdef_name", &funcdef_name_));
-  if (precision_string == "FP32") {
-    precision_mode_ = convert::FP32MODE;
-  } else if (precision_string == "FP16") {
-    precision_mode_ = convert::FP16MODE;
-  } else if (precision_string == "INT8") {
-    precision_mode_ = convert::INT8MODE;
-  }
+  OP_REQUIRES_OK(context, GetPrecisionMode(precision_string, &precision_mode_));
   calibration_mode_ =
-      (precision_mode_ == convert::INT8MODE && calibration_data.size() == 0);
+      (precision_mode_ == INT8MODE && calibration_data.size() == 0);
   if (calibration_data.size()) {
     calibrator_.reset(new TRTInt8Calibrator(calibration_data));
     calibration_data.resize(0);
@@ -150,7 +146,7 @@ TRTEngineOp::TRTEngineOp(OpKernelConstruction* context)
   }
 }
 
-void TRTEngineOp::ExecuteNativeSegment(tensorflow::OpKernelContext* ctx,
+void TRTEngineOp::ExecuteNativeSegment(OpKernelContext* ctx,
                                        AsyncHelper* helper) {
   if (!calibration_mode_) {
     VLOG(1) << "Executing native engine";
@@ -191,7 +187,7 @@ void TRTEngineOp::ExecuteNativeSegment(tensorflow::OpKernelContext* ctx,
            });
 }
 
-void TRTEngineOp::ExecuteCalibration(tensorflow::OpKernelContext* ctx,
+void TRTEngineOp::ExecuteCalibration(OpKernelContext* ctx,
                                      AsyncHelper* helper) {
   helper->Ref();
   tensorflow::core::ScopedUnref sc(helper);
@@ -230,13 +226,13 @@ void TRTEngineOp::ExecuteCalibration(tensorflow::OpKernelContext* ctx,
       reinterpret_cast<const cudaStream_t*>(ctx->op_device_context()
                                                 ->stream()
                                                 ->implementation()
-                                                ->CudaStreamMemberHack()));
+                                                ->GpuStreamMemberHack()));
   calib_res->calibrator_->setBatch(input_data, *stream);
   VLOG(2) << "Passed calibration data";
   ExecuteNativeSegment(ctx, helper);
 }
 
-int TRTEngineOp::GetEngineBatch(tensorflow::OpKernelContext* ctx) {
+int TRTEngineOp::GetEngineBatch(OpKernelContext* ctx) {
   int num_batch = ctx->input(0).shape().dim_size(0);
   int smallest_engine = 0;
   for (const auto i : cached_engine_batches_) {
@@ -252,21 +248,20 @@ int TRTEngineOp::GetEngineBatch(tensorflow::OpKernelContext* ctx) {
       cached_engine_batches_.push_back(num_batch);
       VLOG(1) << "Running with batch size " << num_batch;
     } else {
-      string s("Engine buffer is full. buffer limit= ");
-      StrAppend(&s, max_cached_engines_, ", current entries= ");
-      for (auto i : cached_engine_batches_) StrAppend(&s, i, ", ");
-      StrAppend(&s, "Requested batch= ", num_batch);
-      LOG(ERROR) << s;
-      ctx->SetStatus(tensorflow::errors::ResourceExhausted(
-          "Requested batch size is not available and engine cache is full"));
+      string msg =
+          StrCat("Engine buffer is full. buffer limit=", max_cached_engines_,
+                 ", current entries=");
+      for (auto i : cached_engine_batches_) StrAppend(&msg, i, ",");
+      StrAppend(&msg, "Requested batch=", num_batch);
+      LOG(WARNING) << msg;
       return -1;
     }
   }
   return smallest_engine;
 }
 
-void TRTEngineOp::ComputeAsync(tensorflow::OpKernelContext* ctx,
-                               tensorflow::AsyncOpKernel::DoneCallback done) {
+void TRTEngineOp::ComputeAsync(OpKernelContext* ctx,
+                               AsyncOpKernel::DoneCallback done) {
   auto helper = new AsyncHelper(done);
   tensorflow::core::ScopedUnref sc(helper);
   if (calibration_mode_) {
@@ -274,32 +269,52 @@ void TRTEngineOp::ComputeAsync(tensorflow::OpKernelContext* ctx,
     return;
   }
   const int smallest_engine = GetEngineBatch(ctx);
-  if (smallest_engine < 0) return;  // GetEngineBatch already set the status.
+  if (smallest_engine < 0) {
+    LOG(WARNING) << "Failed to get engine batch, running native segment";
+    ExecuteNativeSegment(ctx, helper);
+    return;
+  }
 
   const int num_batch = ctx->input(0).shape().dim_size(0);
   auto& engine_ctx_pair = GetEngine(smallest_engine, ctx);
   auto& trt_engine_ptr = engine_ctx_pair.first;
   if (!trt_engine_ptr) {
     LOG(WARNING) << "Engine retrieval for batch size " << num_batch
-                 << " failed Running native segment";
+                 << " failed. Running native segment";
+    ExecuteNativeSegment(ctx, helper);
+    return;
+  }
+  const bool retry = ExecuteTrtEngine(ctx, num_batch, trt_engine_ptr.get(),
+                                      engine_ctx_pair.second.get());
+  if (retry) {
+    LOG(WARNING) << "Failed to execute engine, retrying with native segment";
     ExecuteNativeSegment(ctx, helper);
     return;
   }
+}
 
+bool TRTEngineOp::ExecuteTrtEngine(
+    OpKernelContext* ctx, const int num_batch,
+    nvinfer1::ICudaEngine* trt_engine_ptr,
+    nvinfer1::IExecutionContext* trt_execution_context_ptr) {
+  const bool kRetry = true;
   const int num_binding = ctx->num_inputs() + ctx->num_outputs();
   std::vector<void*> buffers(num_binding);
   for (int i = 0; i < ctx->num_inputs(); i++) {
-    const string inp_name = StrCat(kInputPHName, i);
+    const string input_name = StrCat(kInputPHName, i);
     const size_t binding_index =
-        trt_engine_ptr->getBindingIndex(inp_name.c_str());
+        trt_engine_ptr->getBindingIndex(input_name.c_str());
+    if (binding_index == -1) {
+      LOG(ERROR) << "Input node not found, at " << input_name;
+      return kRetry;
+    }
 
     const Tensor& input_tensor = ctx->input(i);
     const TensorShape& input_shape = input_tensor.shape();
     if (num_batch != input_shape.dim_size(0)) {
-      LOG(ERROR) << "input data inconsistent batch size";
-      ctx->SetStatus(tensorflow::errors::FailedPrecondition(
-          "Different batch sizes between input tensors"));
-      return;
+      LOG(ERROR) << "Input data has inconsistent batch size: " << num_batch
+                 << " vs " << input_shape.dim_size(0);
+      return kRetry;
     }
     auto dtype = trt_engine_ptr->getBindingDataType(binding_index);
     switch (dtype) {
@@ -308,19 +323,18 @@ void TRTEngineOp::ComputeAsync(tensorflow::OpKernelContext* ctx,
         break;
       case nvinfer1::DataType::kHALF:
         LOG(ERROR) << "FP16 inputs are not supported yet!";
-        ctx->SetStatus(tensorflow::errors::InvalidArgument(
-            "FP16 inputs are not supported!"));
-        return;
+        return kRetry;
       case nvinfer1::DataType::kINT8:
         LOG(ERROR) << "INT8 inputs are not supported yet!";
-        ctx->SetStatus(tensorflow::errors::InvalidArgument(
-            "INT8 inputs are not supported!"));
-        return;
+        return kRetry;
+#if NV_TENSORRT_MAJOR > 3
+      case nvinfer1::DataType::kINT32:
+        buffers[binding_index] = (void*)(input_tensor.flat<int32>().data());
+        break;
+#endif
       default:
         LOG(ERROR) << "Unknown TRT data type: " << int(dtype);
-        ctx->SetStatus(tensorflow::errors::InvalidArgument(
-            "Unknown output TRT data type! ", static_cast<int>(dtype)));
-        return;
+        return kRetry;
     }
   }
 
@@ -337,20 +351,23 @@ void TRTEngineOp::ComputeAsync(tensorflow::OpKernelContext* ctx,
       std::vector<int> trt_shape(dims.nbDims + 1);
       trt_shape[0] = num_batch;
       for (int j = 0; j < dims.nbDims; j++) trt_shape[j + 1] = dims.d[j];
-      OP_REQUIRES_OK(
-          ctx, TensorShapeUtils::MakeShape(trt_shape.data(), trt_shape.size(),
-                                           &output_shape));
+      auto status = TensorShapeUtils::MakeShape(
+          trt_shape.data(), trt_shape.size(), &output_shape);
+      if (!status.ok()) {
+        LOG(ERROR) << "Failed to get output shape: " << status;
+        return kRetry;
+      }
     } else {
-      LOG(ERROR) << "output node not found, at " << output_name;
-      ctx->SetStatus(tensorflow::errors::Internal("output ", output_name,
-                                                  " couldn't be found!"));
-      return;
+      LOG(ERROR) << "Output node not found, at " << output_name;
+      return kRetry;
     }
     auto status = ctx->allocate_output(i, output_shape, &output_tensor);
     if (!status.ok()) {
       LOG(ERROR) << "Allocating output failed with " << status;
       ctx->SetStatus(status);
-      return;
+      // Do not retry since we cannot allocate the same output twice.
+      // TODO(aaroey): ideally we should retry, fix this.
+      return !kRetry;
     }
     auto dtype = trt_engine_ptr->getBindingDataType(binding_index);
     switch (dtype) {
@@ -359,39 +376,38 @@ void TRTEngineOp::ComputeAsync(tensorflow::OpKernelContext* ctx,
             reinterpret_cast<void*>(output_tensor->flat<float>().data());
         break;
       case nvinfer1::DataType::kHALF:
-        LOG(ERROR) << "half size is not supported yet!";
-        ctx->SetStatus(tensorflow::errors::InvalidArgument(
-            "Half outputs are not supported!"));
-        return;
+        LOG(WARNING) << "half size is not supported yet!";
+        return kRetry;
       case nvinfer1::DataType::kINT8:
-        LOG(ERROR) << "int8 is not supported yet!";
-        ctx->SetStatus(tensorflow::errors::InvalidArgument(
-            "INT8 outputs are not supported!"));
-        return;
+        LOG(WARNING) << "int8 is not supported yet!";
+        return kRetry;
+#if NV_TENSORRT_MAJOR > 3
+      case nvinfer1::DataType::kINT32:
+        buffers[binding_index] =
+            reinterpret_cast<void*>(output_tensor->flat<int32>().data());
+        break;
+#endif
       default:
-        LOG(ERROR) << "Unknown TRT data type: " << static_cast<int>(dtype);
-        ctx->SetStatus(tensorflow::errors::InvalidArgument(
-            "Unsupported output data type! ", static_cast<int>(dtype)));
-        return;
+        LOG(WARNING) << "Unknown TRT data type: " << static_cast<int>(dtype);
+        return kRetry;
     }
   }
-  // copied from cuda_kernel_helper since it seems only valid in *.cu.cc files
+  // Copied from cuda_kernel_helper since it seems only valid in *.cu.cc files
   const cudaStream_t* stream = CHECK_NOTNULL(
       reinterpret_cast<const cudaStream_t*>(ctx->op_device_context()
                                                 ->stream()
                                                 ->implementation()
-                                                ->CudaStreamMemberHack()));
+                                                ->GpuStreamMemberHack()));
 
   // TODO(jie): trt enqueue does not return error
-  auto& trt_execution_context_ptr = engine_ctx_pair.second;
   auto ret = trt_execution_context_ptr->enqueue(num_batch, &buffers[0], *stream,
                                                 nullptr);
   if (!ret) {
-    LOG(ERROR) << "Failed to enqueue batch for TRT engine: " << name();
-    ctx->SetStatus(tensorflow::errors::Internal(
-        "Failed to enqueue batch for TRT engine: ", name()));
+    LOG(WARNING) << "Failed to enqueue batch for TRT engine: " << name();
+    return kRetry;
   }
-  // sync should be done by TF.
+  // Synchronization will be done by TF.
+  return !kRetry;
 }
 
 TRTEngineOp::~TRTEngineOp() {
@@ -411,8 +427,6 @@ nvinfer1::IGpuAllocator* TRTEngineOp::GetAllocator(OpKernelContext* ctx) {
   if (!alloc) {
     LOG(ERROR) << "Can't find device allocator for gpu device "
                << device->name();
-    ctx->SetStatus(tensorflow::errors::Internal(
-        "Can't get device allocator for device ", device->name()));
     return nullptr;
   }
   allocator_.reset(new TRTDeviceAllocator(alloc));
@@ -439,14 +453,14 @@ TRTEngineOp::EngineCtxPair& TRTEngineOp::GetEngine(int batch_size,
 #if NV_TENSORRT_MAJOR > 3
     auto allocator = GetAllocator(ctx);
     if (allocator == nullptr) {
-      // GetAllocator already set the Status.
       return null_pair;
     }
     infer->setGpuAllocator(allocator);
 #endif
     TrtUniquePtrType<nvinfer1::ICudaEngine> static_engine(
         infer->deserializeCudaEngine(serialized_segment_.c_str(),
-                                     serialized_segment_.size(), nullptr));
+                                     serialized_segment_.size(),
+                                     PluginFactoryTensorRT::GetInstance()));
     auto raw_static_engine = static_engine.get();
     const auto max_batch_size = raw_static_engine->getMaxBatchSize();
     engine_map_[max_batch_size] = {
@@ -455,7 +469,9 @@ TRTEngineOp::EngineCtxPair& TRTEngineOp::GetEngine(int batch_size,
             raw_static_engine->createExecutionContext())};
     // Runtime is safe to delete after engine creation
     serialized_segment_.clear();
-    if (max_batch_size < batch_size) return null_pair;
+    if (max_batch_size < batch_size) {
+      return null_pair;
+    }
     return engine_map_.at(max_batch_size);
   }  // static_engine_
 
@@ -467,7 +483,6 @@ TRTEngineOp::EngineCtxPair& TRTEngineOp::GetEngine(int batch_size,
 #if NV_TENSORRT_MAJOR > 3
     allocator = GetAllocator(ctx);
     if (allocator == nullptr) {
-      // GetAllocator already set the Status.
       return null_pair;
     }
 #endif
@@ -491,9 +506,8 @@ TRTEngineOp::EngineCtxPair& TRTEngineOp::GetEngine(int batch_size,
         // retry in the future.
         engine_map_[batch_size] = {nullptr, nullptr};
       }
-      LOG(ERROR) << "Engine creation for batch size " << batch_size
-                 << " failed " << status;
-      ctx->SetStatus(tensorflow::errors::Internal("Engine creation failed!"));
+      LOG(WARNING) << "Engine creation for batch size " << batch_size
+                   << " failed " << status;
       return null_pair;
     }
     VLOG(1) << "Conversion is done";
@@ -505,7 +519,7 @@ TRTEngineOp::EngineCtxPair& TRTEngineOp::GetEngine(int batch_size,
 }
 
 tensorflow::Status TRTEngineOp::AllocateCalibrationResources(
-    tensorflow::OpKernelContext* ctx, TRTCalibrationResource** cr) {
+    OpKernelContext* ctx, TRTCalibrationResource** cr) {
   auto cres = new TRTCalibrationResource();
   *cr = cres;
   // Get the allocator.
@@ -569,7 +583,7 @@ tensorflow::Status TRTEngineOp::AllocateCalibrationResources(
     // TODO(aaroey): maybe setting the max batch size using the python
     // calibration wrapper class.
     auto s = convert::ConvertGraphDefToEngine(
-        *segment_graph, convert::INT8MODE, cres->calibrator_->getBatchSize(),
+        *segment_graph, INT8MODE, cres->calibrator_->getBatchSize(),
         workspace_size_bytes, shapes, &cres->logger_, cres->allocator_.get(),
         cres->calibrator_.get(), &cres->engine_,
         /*convert_successfully=*/nullptr);
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.h b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.h
index 6fe318be6a..59b744e6d3 100644
--- a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.h
+++ b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.h
@@ -60,6 +60,12 @@ class TRTEngineOp : public AsyncOpKernel {
   // Execute replaced native segment as function Op.
   void ExecuteNativeSegment(OpKernelContext* ctx, AsyncHelper* helper);
 
+  // Execute the tensorrt engine. Returns whether we need to retry by running
+  // the native segment.
+  bool ExecuteTrtEngine(OpKernelContext* ctx, const int num_batch,
+                        nvinfer1::ICudaEngine* trt_engine_ptr,
+                        nvinfer1::IExecutionContext* trt_execution_context_ptr);
+
   // Allocate necessary resources for calibration
   Status AllocateCalibrationResources(OpKernelContext* ctx,
                                       TRTCalibrationResource** cr);
@@ -81,7 +87,7 @@ class TRTEngineOp : public AsyncOpKernel {
   std::vector<string> output_nodes_;
 
   // keep device allocator for TRT.
-  std::unique_ptr<TRTDeviceAllocator> allocator_;
+  std::unique_ptr<TRTBaseAllocator> allocator_;
 
   // serialized protobuf segment or trt engine depending on static_engine_ flag.
   string serialized_segment_;
diff --git a/tensorflow/contrib/tensorrt/ops/trt_engine_op.cc b/tensorflow/contrib/tensorrt/ops/trt_engine_op.cc
index 383635f428..e0c7b62723 100644
--- a/tensorflow/contrib/tensorrt/ops/trt_engine_op.cc
+++ b/tensorflow/contrib/tensorrt/ops/trt_engine_op.cc
@@ -42,8 +42,14 @@ REGISTER_OP("TRTEngineOp")
     .Attr("precision_mode: {'FP32', 'FP16', 'INT8', 'INT8CALIB'}")
     .Attr("calibration_data: string = ''")
     .Input("in_tensor: InT")
-    .Output("out_tensor: OutT")
-    .SetShapeFn(shape_inference::TRTEngineOpShapeInference);
+    .Output("out_tensor: OutT");
+// TODO(jie): TF requires concrete output shape for concrete input shapes.
+// This is tricky for batch dimension, since we cannot ensure which input
+// would carry the correct batch dimension (for the current stage of the
+// implementation, we do require all input tensor to carry the same batch
+// size, but this could change in the future). Hence we disable shape
+// inference function as a workaround.
+// .SetShapeFn(shape_inference::TRTEngineOpShapeInference);
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/contrib/tensorrt/plugin/trt_plugin_factory.cc b/tensorflow/contrib/tensorrt/plugin/trt_plugin_factory.cc
index 2bc591484d..cccc912262 100644
--- a/tensorflow/contrib/tensorrt/plugin/trt_plugin_factory.cc
+++ b/tensorflow/contrib/tensorrt/plugin/trt_plugin_factory.cc
@@ -65,9 +65,6 @@ bool PluginFactoryTensorRT::RegisterPlugin(
 
 void PluginFactoryTensorRT::DestroyPlugins() {
   tensorflow::mutex_lock lock(instance_m_);
-  for (auto& owned_plugin_ptr : owned_plugins_) {
-    owned_plugin_ptr.release();
-  }
   owned_plugins_.clear();
 }
 
diff --git a/tensorflow/contrib/tensorrt/python/__init__.py b/tensorflow/contrib/tensorrt/python/__init__.py
index 0b2321b5fc..fe4fa166a1 100644
--- a/tensorflow/contrib/tensorrt/python/__init__.py
+++ b/tensorflow/contrib/tensorrt/python/__init__.py
@@ -22,4 +22,5 @@ from __future__ import print_function
 from tensorflow.contrib.tensorrt.python.ops import trt_engine_op
 from tensorflow.contrib.tensorrt.python.trt_convert import calib_graph_to_infer_graph
 from tensorflow.contrib.tensorrt.python.trt_convert import create_inference_graph
+from tensorflow.contrib.tensorrt.python.trt_convert import is_tensorrt_enabled
 # pylint: enable=unused-import,line-too-long
diff --git a/tensorflow/contrib/tensorrt/python/trt_convert.py b/tensorflow/contrib/tensorrt/python/trt_convert.py
index 79f512dbcf..c696a8b1f0 100644
--- a/tensorflow/contrib/tensorrt/python/trt_convert.py
+++ b/tensorflow/contrib/tensorrt/python/trt_convert.py
@@ -23,22 +23,19 @@ import six as _six
 from tensorflow.contrib.tensorrt.wrap_conversion import calib_convert
 from tensorflow.contrib.tensorrt.wrap_conversion import get_linked_tensorrt_version
 from tensorflow.contrib.tensorrt.wrap_conversion import get_loaded_tensorrt_version
-from tensorflow.contrib.tensorrt.wrap_conversion import trt_convert
+from tensorflow.contrib.tensorrt.wrap_conversion import is_tensorrt_enabled
 from tensorflow.core.framework import graph_pb2
+from tensorflow.core.protobuf import meta_graph_pb2
 from tensorflow.core.protobuf import rewriter_config_pb2
-from tensorflow.python.framework import errors
 from tensorflow.python.framework import errors_impl as _impl
-from tensorflow.python.framework import meta_graph
+from tensorflow.python.framework import importer
 from tensorflow.python.framework import ops
 from tensorflow.python.grappler import tf_optimizer
 from tensorflow.python.platform import tf_logging
-from tensorflow.python.util import compat
-
+from tensorflow.python.training import saver
 # pylint: enable=unused-import,line-too-long
 
 
-# TODO(skama): get outputs from session when implemented as c++
-# optimization pass
 def create_inference_graph(input_graph_def,
                            outputs,
                            max_batch_size=1,
@@ -47,7 +44,7 @@ def create_inference_graph(input_graph_def,
                            minimum_segment_size=3,
                            is_dynamic_op=False,
                            maximum_cached_engines=1,
-                           cached_engine_batches=[]):
+                           cached_engine_batches=None):
   """Python wrapper for the TRT transformation.
 
   Args:
@@ -86,8 +83,7 @@ def create_inference_graph(input_graph_def,
         (".".join([str(x) for x in compiled_version]),
          ".".join([str(x) for x in loaded_version])) +
         ". Please make sure that correct version of TensorRT " +
-        "is available in the system and added to ldconfig or LD_LIBRARY_PATH"
-    )
+        "is available in the system and added to ldconfig or LD_LIBRARY_PATH")
     raise RuntimeError("Incompatible TensorRT library version")
   for i in zip(loaded_version, compiled_version):
     if i[0] != i[1]:
@@ -120,41 +116,42 @@ def create_inference_graph(input_graph_def,
     to_bytes = py3bytes
     to_string = py3string
 
-  out_names = []
-  for i in outputs:
-    if isinstance(i, ops.Tensor):
-      out_names.append(to_bytes(i.name))
-    else:
-      out_names.append(to_bytes(i))
-
-  input_graph_def_str = input_graph_def.SerializeToString()
-
-  # TODO(sami): Fix this when we can return status from C++ library
-  # There is a problem with the TF internal library setup that doesn't
-  # allow us to return a status object from C++.  Thus we return a
-  # pair or strings where first one is encoded status and the second
-  # one is the transformed graphs protobuf string.
-  out = trt_convert(input_graph_def_str, out_names, max_batch_size,
-                    max_workspace_size_bytes, mode, minimum_segment_size,
-                    is_dynamic_op, maximum_cached_engines,
-                    cached_engine_batches)
-  status = to_string(out[0])
-  output_graph_def_string = out[1]
-  del input_graph_def_str  # Save some memory
-  if len(status) < 2:
-    raise _impl.UnknownError(None, None, status)
-  if status[:2] != "OK":
-    msg = status.split(";")
-    if len(msg) == 1:
-      raise RuntimeError("Status message is malformed {}".format(status))
-    # pylint: disable=protected-access
-    raise _impl._make_specific_exception(None, None, ";".join(msg[1:]),
-                                         int(msg[0]))
-    # pylint: enable=protected-access
-  output_graph_def = graph_pb2.GraphDef()
-  output_graph_def.ParseFromString(output_graph_def_string)
-  del output_graph_def_string  # Save some memory
-  return output_graph_def
+  # Create MetaGraphDef
+  graph = ops.Graph()
+  with graph.as_default():
+    importer.import_graph_def(input_graph_def, name="")
+  meta_graph = saver.export_meta_graph(
+      graph_def=graph.as_graph_def(), graph=graph)
+  if outputs:
+    output_collection = meta_graph_pb2.CollectionDef()
+    output_list = output_collection.node_list.value
+    for i in outputs:
+      if isinstance(i, ops.Tensor):
+        output_list.append(to_bytes(i.name))
+      else:
+        output_list.append(to_bytes(i))
+    meta_graph.collection_def["train_op"].CopyFrom(output_collection)
+
+  # Create RewriterConfig.
+  rewriter_cfg = rewriter_config_pb2.RewriterConfig()
+  rewriter_cfg.optimizers.extend(["constfold", "layout"])
+  optimizer = rewriter_cfg.custom_optimizers.add()
+  optimizer.name = "TensorRTOptimizer"
+  optimizer.parameter_map["minimum_segment_size"].i = minimum_segment_size
+  optimizer.parameter_map["max_batch_size"].i = max_batch_size
+  optimizer.parameter_map["is_dynamic_op"].b = is_dynamic_op
+  optimizer.parameter_map[
+      "max_workspace_size_bytes"].i = max_workspace_size_bytes
+  optimizer.parameter_map["precision_mode"].s = to_bytes(precision_mode)
+  optimizer.parameter_map["maximum_cached_engines"].i = maximum_cached_engines
+  if cached_engine_batches:
+    if not isinstance(cached_engine_batches, list):
+      raise TypeError("cached_engine_batches should be a list.")
+    optimizer.parameter_map["cached_engine_batches"].list.i.extend(
+        cached_engine_batches)
+
+  return tf_optimizer.OptimizeGraph(
+      rewriter_cfg, meta_graph, graph_id=b"tf_graph")
 
 
 def calib_graph_to_infer_graph(calibration_graph_def, is_dynamic_op=False):
diff --git a/tensorflow/contrib/tensorrt/resources/trt_allocator.cc b/tensorflow/contrib/tensorrt/resources/trt_allocator.cc
index 9f115990c3..d8f97bfbbc 100644
--- a/tensorflow/contrib/tensorrt/resources/trt_allocator.cc
+++ b/tensorflow/contrib/tensorrt/resources/trt_allocator.cc
@@ -19,12 +19,42 @@ limitations under the License.
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
+#include "cuda/include/cuda_runtime_api.h"
+#endif  // GOOGLE_TENSORRT
+#endif  // GOOGLE_CUDA
+
+namespace tensorflow {
+namespace tensorrt {
+
+// std::align is not supported, so this method mimic its behavior.
+void* Align(size_t alignment, size_t size, void*& ptr, size_t& space) {
+  QCHECK_GT(alignment, 0) << "alignment must be greater than 0.";
+  QCHECK_EQ(0, alignment & (alignment - 1)) << "Alignment must be power of 2.";
+  QCHECK_GT(size, 0) << "size must be greater than 0.";
+  QCHECK(ptr) << "ptr must not be nullptr.";
+  QCHECK_GT(space, 0) << "space must be greater than 0.";
+  const uintptr_t ptr_val = reinterpret_cast<uintptr_t>(ptr);
+  QCHECK_GE(ptr_val + space, ptr_val) << "Provided space overflows.";
 
+  if (size > space) return nullptr;
+  const uintptr_t aligned_ptr_val = ((ptr_val + alignment - 1) & -alignment);
+  if (aligned_ptr_val > ptr_val + space - size) return nullptr;
+  ptr = reinterpret_cast<void*>(aligned_ptr_val);
+  const uintptr_t diff = aligned_ptr_val - ptr_val;
+  space -= diff;
+  return ptr;
+}
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
 #if NV_TENSORRT_MAJOR > 2
-#include "cuda/include/cuda_runtime_api.h"
 
 namespace tensorflow {
 namespace tensorrt {
+
 void* TRTCudaAllocator::allocate(uint64_t size, uint64_t alignment,
                                  uint32_t flags) {
   assert((alignment & (alignment - 1)) == 0);  // zero or a power of 2.
@@ -37,10 +67,23 @@ void TRTCudaAllocator::free(void* memory) { cudaFree(memory); }
 
 void* TRTDeviceAllocator::allocate(uint64_t size, uint64_t alignment,
                                    uint32_t flags) {
+  // WAR for allocator alignment requirement. Certain cuda API calls require GPU
+  // memory with alignemtn to cudaDeviceProp::textureAlignment.
+  // See issue #20856
+  alignment = 512;
   assert((alignment & (alignment - 1)) == 0);  // zero or a power of 2.
-  void* mem = allocator_->AllocateRaw(alignment, size);
-  VLOG(2) << "Allocated " << size << " bytes with alignment " << alignment
-          << " @ " << mem;
+  size_t total_size = size + alignment;
+  void* mem = allocator_->AllocateRaw(alignment, total_size);
+  if (!mem) return nullptr;
+
+  void* alloc_mem = mem;
+  QCHECK(Align(alignment, size, mem, total_size));
+  if (mem != alloc_mem) {
+    QCHECK(mem_map_.insert({mem, alloc_mem}).second);
+  }
+  VLOG(2) << "Allocated " << total_size << " bytes memory @" << alloc_mem
+          << "; aligned to " << size << " bytes @" << mem << " with alignment "
+          << alignment;
   return mem;
 }
 
@@ -51,12 +94,20 @@ TRTDeviceAllocator::TRTDeviceAllocator(tensorflow::Allocator* allocator)
 
 void TRTDeviceAllocator::free(void* memory) {
   VLOG(2) << "Deallocating @ " << memory;
-  allocator_->DeallocateRaw(memory);
+  // allocated memory adjusted for alignment, restore the original pointer
+  if (memory) {
+    auto alloc_mem = mem_map_.find(memory);
+    if (alloc_mem != mem_map_.end()) {
+      memory = alloc_mem->second;
+      mem_map_.erase(alloc_mem->first);
+    }
+    allocator_->DeallocateRaw(memory);
+  }
 }
 
 }  // namespace tensorrt
 }  // namespace tensorflow
 
 #endif
-#endif
-#endif
+#endif  // GOOGLE_TENSORRT
+#endif  // GOOGLE_CUDA
diff --git a/tensorflow/contrib/tensorrt/resources/trt_allocator.h b/tensorflow/contrib/tensorrt/resources/trt_allocator.h
index c5d2cec730..6f94492083 100644
--- a/tensorflow/contrib/tensorrt/resources/trt_allocator.h
+++ b/tensorflow/contrib/tensorrt/resources/trt_allocator.h
@@ -16,13 +16,25 @@ limitations under the License.
 #ifndef TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_ALLOCATOR_H_
 #define TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_ALLOCATOR_H_
 
-#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
+#include <unordered_map>
+
 #include "tensorflow/core/framework/allocator.h"
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
 #include "tensorrt/include/NvInfer.h"
+#endif  // GOOGLE_TENSORRT
+#endif  // GOOGLE_CUDA
+
+namespace tensorflow {
+namespace tensorrt {
+// std::align is not supported, so this function mimic its behavior.
+void* Align(size_t alignment, size_t size, void*& ptr, size_t& space);
+}  // namespace tensorrt
+}  // namespace tensorflow
 
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
 #if NV_TENSORRT_MAJOR == 3
 // Define interface here temporarily until TRT 4.0 is released
 namespace nvinfer1 {
@@ -37,7 +49,14 @@ class IGpuAllocator {
 namespace tensorflow {
 namespace tensorrt {
 
-class TRTCudaAllocator : public nvinfer1::IGpuAllocator {
+class TRTBaseAllocator : public nvinfer1::IGpuAllocator {
+  // Base allocator class so we can have a virtual destructor;
+ public:
+  // python wrapper seems to be not happy with an pure virtual destructor;
+  virtual ~TRTBaseAllocator() = default;
+};
+
+class TRTCudaAllocator : public TRTBaseAllocator {
   // Allocator implementation that is using cuda allocator instead of device
   // allocator in case we can't get device allocator from TF.
  public:
@@ -47,10 +66,13 @@ class TRTCudaAllocator : public nvinfer1::IGpuAllocator {
   void free(void* memory) override;
 };
 
-class TRTDeviceAllocator : public nvinfer1::IGpuAllocator {
+class TRTDeviceAllocator : public TRTBaseAllocator {
   // Allocator implementation wrapping TF device allocators.
  public:
   TRTDeviceAllocator(tensorflow::Allocator* allocator);
+
+  // TODO(aaroey): base class doesn't have a virtual destructor, work with
+  // Nvidia to fix it.
   virtual ~TRTDeviceAllocator() {
     VLOG(1) << "Destroying allocator attached to " << allocator_->Name();
   }
@@ -59,6 +81,9 @@ class TRTDeviceAllocator : public nvinfer1::IGpuAllocator {
 
  private:
   tensorflow::Allocator* allocator_;
+
+  // supporting alignment from allocation request requires a map to free;
+  std::unordered_map<void*, void*> mem_map_;
 };
 
 }  // namespace tensorrt
diff --git a/tensorflow/contrib/tensorrt/resources/trt_allocator_test.cc b/tensorflow/contrib/tensorrt/resources/trt_allocator_test.cc
new file mode 100644
index 0000000000..f515ed03f2
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_allocator_test.cc
@@ -0,0 +1,79 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/resources/trt_allocator.h"
+
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+bool RunTest(const size_t alignment, const size_t size,
+             const intptr_t orig_ptr_val, const size_t orig_space) {
+  void* const orig_ptr = reinterpret_cast<void*>(orig_ptr_val);
+  void* ptr = orig_ptr;
+  size_t space = orig_space;
+  void* result = Align(alignment, size, ptr, space);
+  if (result == nullptr) {
+    EXPECT_EQ(orig_ptr, ptr);
+    EXPECT_EQ(orig_space, space);
+    return false;
+  } else {
+    EXPECT_EQ(result, ptr);
+    const intptr_t ptr_val = reinterpret_cast<intptr_t>(ptr);
+    EXPECT_EQ(0, ptr_val % alignment);
+    EXPECT_GE(ptr_val, orig_ptr_val);
+    EXPECT_GE(space, size);
+    EXPECT_LE(space, orig_space);
+    EXPECT_EQ(ptr_val + space, orig_ptr_val + orig_space);
+    return true;
+  }
+}
+
+TEST(TRTAllocatorTest, Align) {
+  for (const size_t space :
+       {1, 2, 3, 4, 7, 8, 9, 10, 16, 32, 511, 512, 513, 700, 12345}) {
+    for (size_t alignment = 1; alignment <= space * 4; alignment *= 2) {
+      for (const intptr_t ptr_val :
+           {1ul, alignment == 1 ? 1ul : alignment - 1, alignment, alignment + 1,
+            alignment + (alignment / 2)}) {
+        if (ptr_val % alignment == 0) {
+          for (const size_t size :
+               {1ul, space == 1 ? 1ul : space - 1, space, space + 1}) {
+            EXPECT_EQ(space >= size, RunTest(alignment, size, ptr_val, space));
+          }
+        } else {
+          EXPECT_FALSE(RunTest(alignment, space, ptr_val, space));
+          const size_t diff = alignment - ptr_val % alignment;
+          if (space > diff) {
+            EXPECT_TRUE(
+                RunTest(alignment, space - diff, ptr_val + diff, space - diff));
+            for (const size_t size :
+                 {1ul, space - diff > 1 ? space - diff - 1 : 1ul, space - diff,
+                  space - diff + 1, space - 1}) {
+              EXPECT_EQ(space - diff >= size,
+                        RunTest(alignment, size, ptr_val, space));
+            }
+          } else {
+            EXPECT_FALSE(RunTest(alignment, 1, ptr_val, space));
+          }
+        }
+      }
+    }
+  }
+}
+
+}  // namespace tensorrt
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/resources/trt_resources.h b/tensorflow/contrib/tensorrt/resources/trt_resources.h
index b7d5ffd674..d7d56cb95e 100644
--- a/tensorflow/contrib/tensorrt/resources/trt_resources.h
+++ b/tensorflow/contrib/tensorrt/resources/trt_resources.h
@@ -64,7 +64,7 @@ class TRTCalibrationResource : public tensorflow::ResourceBase {
   std::unique_ptr<TRTInt8Calibrator> calibrator_;
   TrtUniquePtrType<nvinfer1::IBuilder> builder_;
   TrtUniquePtrType<nvinfer1::ICudaEngine> engine_;
-  std::unique_ptr<nvinfer1::IGpuAllocator> allocator_;
+  std::unique_ptr<TRTBaseAllocator> allocator_;
   tensorflow::tensorrt::Logger logger_;
   // TODO(sami): Use threadpool threads!
   std::unique_ptr<std::thread> thr_;
diff --git a/tensorflow/contrib/tensorrt/segment/segment.cc b/tensorflow/contrib/tensorrt/segment/segment.cc
index cc42913eca..008fffc954 100644
--- a/tensorflow/contrib/tensorrt/segment/segment.cc
+++ b/tensorflow/contrib/tensorrt/segment/segment.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/contrib/tensorrt/segment/segment.h"
 
+#include <queue>
 #include <set>
 #include <unordered_map>
 #include <vector>
@@ -32,6 +33,7 @@ namespace tensorflow {
 namespace tensorrt {
 namespace segment {
 using ::tensorflow::strings::StrAppend;
+
 // A simple graph representation to mirror tensorflow::Graph. This structure
 // helps saving memory since segmenter modifies the graph in place, preventing
 // the need to create a copy of the graph. It is composed of edges and nodes.
@@ -215,7 +217,7 @@ namespace {
 
 bool CheckCycles(const std::unique_ptr<SimpleGraph>& g, const SimpleNode* src,
                  const std::vector<SimpleNode*>& start) {
-  // copied from TF ReverseDFS.
+  // Copied from TF ReverseDFS, which only works for tensorflow::Graph.
   struct Work {
     SimpleNode* node;
     bool leave;  // Are we entering or leaving n?
@@ -269,6 +271,24 @@ bool CanContractEdge(const SimpleEdge* edge,
   //   1. Get all nodes incoming to 'dst', excluding 'src'
   //   2. Reverse DFS from those nodes
   //   3. If reverse DFS reaches 'src' then we have a cycle
+  //
+  // TODO(aaroey): there are several problems with the current approach:
+  // 1. src->dst->src, this is not detected but it should be;
+  // 2. src->dst->...(any node sequence that doesn't contain src)...->dst, this
+  //    is detected but it should not be.
+  //
+  // Note that it's fine that dst connects back to src indirectly (i.e. through
+  // a path with length > 1 that consists of intermedia nodes other than src).
+  // While loops is one example.
+  //
+  // The goal is to make sure that the trt subgraph:
+  // 1. has no loops (i.e. is a DAG), and
+  // 2. if there is a path in the subgraph from X to Y (X and Y are both nodes
+  //    in the subgraph), then all paths from X to Y are in the subgraph.
+  //
+  // To achieve this goal, the correct way seems to be:
+  // 1. remove any direct edge from src->dst;
+  // 2. detect if src can reach dst, if so they cannot be merged.
   std::vector<SimpleNode*> dfs_start_nodes;
   for (SimpleNode* node : dst->in_nodes()) {
     if (node != src) {
@@ -276,8 +296,8 @@ bool CanContractEdge(const SimpleEdge* edge,
     }
   }
 
-  bool is_cycle = CheckCycles(graph, src, dfs_start_nodes);
-  return !is_cycle;
+  const bool has_cycle = CheckCycles(graph, src, dfs_start_nodes);
+  return !has_cycle;
 }
 }  // namespace
 
@@ -342,22 +362,20 @@ void ContractEdge(SimpleEdge* edge, SimpleGraph* graph,
 }
 
 tensorflow::Status SegmentGraph(
-    const tensorflow::GraphDef& gdef,
-    const std::function<bool(const tensorflow::Node*)>& candidate_fn,
-    const SegmentOptions& options, SegmentNodesVector* segments) {
-  // Create a Graph representation of the GraphDef.
-  tensorflow::FunctionLibraryDefinition flib(tensorflow::OpRegistry::Global(),
-                                             gdef.library());
-  tensorflow::Graph graph(flib);
-  TF_RETURN_IF_ERROR(tensorflow::ConvertGraphDefToGraph(
-      tensorflow::GraphConstructorOptions(), gdef, &graph));
-  return SegmentGraph(&graph, candidate_fn, options, segments);
-}
-
-tensorflow::Status SegmentGraph(
-    tensorflow::Graph* tf_graph,
+    const tensorflow::Graph* tf_graph,
     const std::function<bool(const tensorflow::Node*)>& candidate_fn,
+    const std::function<bool(const tensorflow::Edge*)>& input_candidate_fn,
+    const std::function<bool(const tensorflow::Edge*)>& output_candidate_fn,
     const SegmentOptions& options, SegmentNodesVector* segments) {
+  // Steps:
+  // 1. run the segmentation algorithm to find all the segments, which uses
+  //    candidate_fn to determine the candidates segment nodes;
+  // 2. for each segments, remove the nodes that are inputs/outputs of the
+  //    segment but are not eligible, using input/output_candidate_fn to
+  //    determine the eligibilities;
+  // 3. convert the segment into expected return format and return the result.
+
+  // --------------------------------- Step 1 ---------------------------------
   auto graph = std::unique_ptr<SimpleGraph>(new SimpleGraph(tf_graph));
   // Use a union-find to collect the nodes that belong to the same
   // segment. A node value of nullptr indicates that the node is not a candidate
@@ -372,14 +390,19 @@ tensorflow::Status SegmentGraph(
     node_segments.emplace_back(node);
   }
 
-  // The segmentation algorithm below visits nodes in reverse
-  // topological order and attempts to merge nodes along output
-  // edges. That means that subgraphs grow from the output-side of the
-  // network towards the inputs. In general this is not guaranteed to
-  // produce a globally optimal segmentation. In the future if we have
-  // a measure of how beneficial it is to include a given node in a
-  // TRT subgraph then we can revisit this algorithm to take advantage
-  // of that information.
+  // The segmentation algorithm below visits nodes in reverse topological order
+  // and attempts to merge nodes along output edges. That means that subgraphs
+  // grow from the output-side of the network towards the inputs.
+  //
+  // In general this is not guaranteed to produce a globally optimal
+  // segmentation. For exaample, consider graph with node {A, B, C, D} and edges
+  // {A->B, A->C, B->D, C->D), where A, B, D are trt compatible but C is not, so
+  // in theory we can choose to contract either A, B or B, D but not both, but
+  // here it always choose to contract B, D.
+  //
+  // In the future if we have a measure of how beneficial it is to include a
+  // given node in a TRT subgraph then we can revisit this algorithm to take
+  // advantage of that information.
   std::vector<tensorflow::Node*> tforder;
   tensorflow::GetPostOrder(*tf_graph, &tforder);
   // use postorder implementation from tensorflow and construct mirror in
@@ -392,13 +415,11 @@ tensorflow::Status SegmentGraph(
   for (const SimpleNode* node : order) {
     // All output nodes of 'node' have been visited...
     VLOG(2) << "Trying node " << node->name() << " id=" << node->id();
-
     // 'node' must be a TRT candidate...
     if (node_segments[node->id()].Value() == nullptr) {
       VLOG(2) << "... not a TRT candidate";
       continue;
     }
-
     // Contract output edges to combine 'node' with output
     // nodes. Iterate since combining two nodes may unblock other
     // combining.
@@ -416,7 +437,6 @@ tensorflow::Status SegmentGraph(
           VLOG(2) << "... ... not a TRT candidate";
           continue;
         }
-
         if (CanContractEdge(out_edge, graph)) {
           VLOG(2) << "... ... can contract";
           contract_edges.insert(out_edge);
@@ -424,11 +444,9 @@ tensorflow::Status SegmentGraph(
           VLOG(2) << "... ... cannot contract, would form cycle";
         }
       }
-
       if (contract_edges.empty()) {
         break;
       }
-
       // Contract edges and collect the adjacent nodes into the same
       // segment/subgraph.
       while (!contract_edges.empty()) {
@@ -457,11 +475,22 @@ tensorflow::Status SegmentGraph(
 
   // Collect the segments/subgraphs. Each subgraph is represented by a
   // set of the names of the nodes in that subgraph.
-  std::unordered_map<string, std::set<string>> sg_map;
+
+  // A map from the segment identifier (currently the name of the root node of
+  // the segment tree) to the segment nodes set.
+  std::unordered_map<string, std::set<const tensorflow::Node*>> sg_map;
+
+  // A map from the segment identifier (currently the name of the root node of
+  // the segment tree) to the device names that the nodes in the segment are
+  // assigned to.
+  //
+  // TODO(aaroey): nodes assigned to different devices should not be merged,
+  // fix this.
   std::unordered_map<string, std::set<string>> device_maps;
+
   for (auto& u : node_segments) {
     if ((u.Value() != nullptr) && (u.ParentValue() != nullptr)) {
-      sg_map[u.ParentValue()->name()].insert(u.Value()->name());
+      sg_map[u.ParentValue()->name()].insert(u.Value()->tf_node());
       auto tf_node = u.Value()->tf_node();
       // has_assigned_device_name() is expected to return true
       // when called from optimization pass. However, since graph
@@ -482,25 +511,104 @@ tensorflow::Status SegmentGraph(
     }
   }
 
+  // --------------------------------- Step 2 ---------------------------------
+  // Remove ineligible input/output nodes.
+  for (auto& itr : sg_map) {
+    std::set<const tensorflow::Node*>& segment_nodes = itr.second;
+    VLOG(1) << "Segment original size: " << segment_nodes.size();
+    while (true) {
+      std::deque<const tensorflow::Node*> in_nodes_que, out_nodes_que;
+      // Find an input node that is not eligible and add it to the queue.
+      // Nodes that has no incoming edges should not be treated as "input",
+      // as there are really no inputs to them. Similar for output nodes.
+      for (auto node : segment_nodes) {
+        bool added = false;
+        for (const tensorflow::Edge* edge : node->in_edges()) {
+          if (!edge->IsControlEdge() && !edge->src()->IsSource() &&
+              !segment_nodes.count(edge->src())) {  // 'node' is an input node.
+            if (!input_candidate_fn(edge)) {
+              in_nodes_que.push_back(node);
+              added = true;
+              break;
+            }
+          }
+        }
+        if (added) continue;  // Only adding the node once to either queue.
+        for (const tensorflow::Edge* edge : node->out_edges()) {
+          if (!edge->dst()->IsSink() && !edge->IsControlEdge() &&
+              !segment_nodes.count(edge->dst())) {  // 'node' is an output node.
+            if (!output_candidate_fn(edge)) {
+              out_nodes_que.push_back(node);
+              break;
+            }
+          }
+        }
+      }
+      if (in_nodes_que.empty() && out_nodes_que.empty()) {
+        // No more ineligible input/output nodes.
+        break;
+      }
+      // Now for each ineligible node, remove all of its inputs or outputs from
+      // the subgraph.
+      //
+      // It can be proven that, if the original subgraph:
+      // 1. is a DAG, and
+      // 2. all paths between two nodes in the subgraph are all inside the
+      //    subgraph
+      // then after doing this operation the resulting subgraph will keep the
+      // same properties 1 and 2.
+      //
+      // For simplicity we use heuristics: for input nodes remove all its
+      // input, for output nodes remove all its output. In this way, for common
+      // cases the number of removed nodes should be minimum.
+      auto remove_nodes = [&segment_nodes](
+                              bool is_input_nodes,
+                              std::deque<const tensorflow::Node*>* que) {
+        // Run a BFS on the queue to find all the input/output nodes.
+        std::set<const tensorflow::Node*> visited;
+        while (!que->empty()) {
+          auto node = que->front();
+          que->pop_front();
+          if (!visited.insert(node).second) continue;
+          segment_nodes.erase(node);
+          for (auto in :
+               is_input_nodes ? node->in_nodes() : node->out_nodes()) {
+            if (segment_nodes.count(in)) {
+              que->push_back(in);
+              VLOG(2) << "Need to remove node " << in->name()
+                      << " because one of its "
+                      << (is_input_nodes ? "output" : "input")
+                      << " nodes in the graph was removed: " << node->name();
+            }
+          }
+        }
+      };
+      remove_nodes(true, &in_nodes_que);
+      remove_nodes(false, &out_nodes_que);
+    }
+    VLOG(1) << "Segment new size: " << segment_nodes.size();
+  }
+
+  // --------------------------------- Step 3 ---------------------------------
   // Convert the segments into the expected return format
   for (const auto& itr : sg_map) {
-    const auto& segment_node_names = itr.second;
+    const std::set<const tensorflow::Node*>& segment_nodes = itr.second;
     if (VLOG_IS_ON(1)) {
       string s;
-      for (const auto& name : segment_node_names) {
-        s += " " + name;
-      }
-      VLOG(1) << "Segment " << segments->size() << ":" << s;
+      for (auto node : segment_nodes) s += " " + node->name();
+      VLOG(1) << "Segment " << segments->size() << ": " << s;
     }
 
     // Don't use small segments.
-    if (static_cast<int>(segment_node_names.size()) <
-        options.minimum_segment_size) {
+    if (static_cast<int>(segment_nodes.size()) < options.minimum_segment_size) {
       VLOG(1) << "Segment " << segments->size() << " has only "
-              << segment_node_names.size() << " nodes, dropping";
+              << segment_nodes.size() << " nodes, dropping";
       continue;
     }
+
     // TODO(sami): Make segmenter placement aware once trtscopes are in place
+    std::set<string> segment_node_names;
+    for (auto node : itr.second) segment_node_names.insert(node->name());
     const auto& dev_itr = device_maps.find(itr.first);
     if (dev_itr == device_maps.end() || dev_itr->second.empty()) {
       VLOG(1) << "No device assigned to segment " << segments->size();
diff --git a/tensorflow/contrib/tensorrt/segment/segment.h b/tensorflow/contrib/tensorrt/segment/segment.h
index 81b4bfe49f..8c44eb782a 100644
--- a/tensorflow/contrib/tensorrt/segment/segment.h
+++ b/tensorflow/contrib/tensorrt/segment/segment.h
@@ -42,22 +42,6 @@ struct SegmentOptions {
 
 // Get the subgraphs of a graph that can be handled by TensorRT.
 //
-// @param gdef The GraphDef describing the network
-// @param candidate_fn A function that returns true for a NodeDef if
-// that node can be handled by TensorRT.
-// @param segments Returns the TensorRT segments/subgraphs. Each entry
-// in the vector describes a subgraph by giving a set of the names of
-// all the NodeDefs in that subgraph.
-// @return the status.
-//
-// TODO(aaroey): remove this method.
-tensorflow::Status SegmentGraph(
-    const tensorflow::GraphDef& gdef,
-    const std::function<bool(const tensorflow::Node*)>& candidate_fn,
-    const SegmentOptions& options, SegmentNodesVector* segments);
-
-// Get the subgraphs of a graph that can be handled by TensorRT.
-//
 // @param graph tensorflow::Graph of the network
 // @param candidate_fn A function that returns true for a Node* if
 // that node can be handled by TensorRT.
@@ -66,8 +50,10 @@ tensorflow::Status SegmentGraph(
 // all the NodeDefs in that subgraph.
 // @return the status.
 tensorflow::Status SegmentGraph(
-    tensorflow::Graph* tf_graph,
+    const tensorflow::Graph* tf_graph,
     const std::function<bool(const tensorflow::Node*)>& candidate_fn,
+    const std::function<bool(const tensorflow::Edge*)>& input_candidate_fn,
+    const std::function<bool(const tensorflow::Edge*)>& output_candidate_fn,
     const SegmentOptions& options, SegmentNodesVector* segments);
 
 }  // namespace segment
diff --git a/tensorflow/contrib/tensorrt/segment/segment_test.cc b/tensorflow/contrib/tensorrt/segment/segment_test.cc
index f5b2d258d7..432e7b1c04 100644
--- a/tensorflow/contrib/tensorrt/segment/segment_test.cc
+++ b/tensorflow/contrib/tensorrt/segment/segment_test.cc
@@ -14,350 +14,245 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/contrib/tensorrt/segment/segment.h"
-#include "tensorflow/c/c_api.h"
-#include "tensorflow/core/framework/graph.pb.h"
+
+#include "tensorflow/cc/framework/scope.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/graph/testlib.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
+#include "tensorflow/core/public/session.h"
 
 namespace tensorflow {
 namespace tensorrt {
 namespace segment {
 namespace test {
+namespace ops = ::tensorflow::ops;
 
 class SegmentTest : public ::testing::Test {
- public:
-  bool GetGraphDef(TF_Graph* graph, tensorflow::GraphDef* graph_def);
-
-  TF_Operation* Placeholder(TF_Graph* graph, TF_Status* s, const char* name);
-  TF_Operation* Add(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
-                    TF_Status* s, const char* name);
-
-  std::function<bool(const tensorflow::Node*)> MakeCandidateFn(
-      const std::set<string>& node_names);
-
  protected:
-  void PlaceholderHelper(TF_Graph* graph, TF_Status* s, const char* name,
-                         TF_Operation** op);
-  void AddHelper(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
-                 TF_Status* s, const char* name, TF_Operation** op, bool check);
-
-  SegmentOptions default_options_;
-};
-
-bool SegmentTest::GetGraphDef(TF_Graph* graph,
-                              tensorflow::GraphDef* graph_def) {
-  TF_Status* s = TF_NewStatus();
-  TF_Buffer* buffer = TF_NewBuffer();
-  TF_GraphToGraphDef(graph, buffer, s);
-  bool ret = TF_GetCode(s) == TF_OK;
-  EXPECT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  if (ret) ret = graph_def->ParseFromArray(buffer->data, buffer->length);
-  TF_DeleteBuffer(buffer);
-  TF_DeleteStatus(s);
-  return ret;
-}
+  std::function<bool(const tensorflow::Node*)> MakeCandidateFn(
+      const std::set<string>& node_names) {
+    return [node_names](const tensorflow::Node* node) -> bool {
+      return node_names.find(node->name()) != node_names.end();
+    };
+  }
 
-std::function<bool(const tensorflow::Node*)> SegmentTest::MakeCandidateFn(
-    const std::set<string>& node_names) {
-  return [node_names](const tensorflow::Node* node) -> bool {
-    return node_names.find(node->name()) != node_names.end();
-  };
-}
+  std::function<bool(const tensorflow::Edge*)> MakeInputEdgeCandidateFn(
+      const std::set<string>& node_names) {
+    return [node_names](const tensorflow::Edge* in_edge) -> bool {
+      return node_names.find(in_edge->dst()->name()) != node_names.end();
+    };
+  }
 
-void SegmentTest::PlaceholderHelper(TF_Graph* graph, TF_Status* s,
-                                    const char* name, TF_Operation** op) {
-  TF_OperationDescription* desc = TF_NewOperation(graph, "Placeholder", name);
-  TF_SetAttrType(desc, "dtype", TF_INT32);
-  *op = TF_FinishOperation(desc, s);
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  ASSERT_NE(*op, nullptr);
-}
+  std::function<bool(const tensorflow::Edge*)> MakeOutputEdgeCandidateFn(
+      const std::set<string>& node_names) {
+    return [node_names](const tensorflow::Edge* out_edge) -> bool {
+      return node_names.find(out_edge->src()->name()) != node_names.end();
+    };
+  }
 
-TF_Operation* SegmentTest::Placeholder(TF_Graph* graph, TF_Status* s,
-                                       const char* name) {
-  TF_Operation* op;
-  PlaceholderHelper(graph, s, name, &op);
-  return op;
-}
+  void RunTest(const tensorflow::Graph* graph,
+               const std::set<string>& candidates,
+               const std::set<string>& input_candidates,
+               const std::set<string>& output_candidates,
+               const std::vector<std::set<string>>& expected_segments) {
+    SegmentNodesVector segments;
+    TF_EXPECT_OK(SegmentGraph(graph, MakeCandidateFn(candidates),
+                              MakeInputEdgeCandidateFn(input_candidates),
+                              MakeOutputEdgeCandidateFn(output_candidates),
+                              default_options_, &segments));
+    ValidateSegment(segments, expected_segments);
+  }
 
-void SegmentTest::AddHelper(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
-                            TF_Status* s, const char* name, TF_Operation** op,
-                            bool check) {
-  TF_OperationDescription* desc = TF_NewOperation(graph, "AddN", name);
-  TF_Output add_inputs[2] = {{l, 0}, {r, 0}};
-  TF_AddInputList(desc, add_inputs, 2);
-  *op = TF_FinishOperation(desc, s);
-  if (check) {
-    ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-    ASSERT_NE(*op, nullptr);
+  void ValidateSegment(const SegmentNodesVector& segments,
+                       const std::vector<std::set<string>>& expected_segments) {
+    EXPECT_EQ(expected_segments.size(), segments.size());
+    for (int i = 0; i < segments.size(); ++i) {
+      const auto& segment_node_names = segments[i].first;
+      const auto& expected = expected_segments[i];
+      for (const auto& name : expected) {
+        EXPECT_TRUE(segment_node_names.count(name))
+            << "Segment " << i << " is missing expected node: " << name;
+      }
+      if (segment_node_names.size() == expected.size()) continue;
+      for (const auto& name : segment_node_names) {
+        EXPECT_TRUE(expected.count(name))
+            << "Unexpected node found in segment " << i << ": " << name;
+      }
+    }
   }
-}
 
-TF_Operation* SegmentTest::Add(TF_Operation* l, TF_Operation* r,
-                               TF_Graph* graph, TF_Status* s,
-                               const char* name) {
-  TF_Operation* op;
-  AddHelper(l, r, graph, s, name, &op, true);
-  return op;
+  SegmentOptions default_options_;
+};
+
+std::set<string> operator-(const std::set<string>& lhs, const string& rhs) {
+  std::set<string> result = lhs;
+  CHECK(result.erase(rhs));
+  return result;
 }
 
 TEST_F(SegmentTest, Empty) {
-  TF_Graph* graph = TF_NewGraph();
-
-  GraphDef graph_def;
-  ASSERT_TRUE(GetGraphDef(graph, &graph_def));
-
-  SegmentNodesVector segments;
-  ASSERT_EQ(
-      SegmentGraph(graph_def, MakeCandidateFn({}), default_options_, &segments),
-      tensorflow::Status::OK());
-
+  Scope s = Scope::NewRootScope();
+  tensorflow::Graph g(OpRegistry::Global());
+  TF_EXPECT_OK(s.ToGraph(&g));
   // Expect no segments/subgraphs.
-  EXPECT_TRUE(segments.empty());
-  TF_DeleteGraph(graph);
+  RunTest(&g, {}, {}, {}, {});
 }
 
 TEST_F(SegmentTest, Simple) {
-  TF_Status* s = TF_NewStatus();
-  TF_Graph* graph = TF_NewGraph();
-
   //           feed
-  //         //    ||
+  //          //  \\
   //       add0    add1
-  //        | |    /
+  //        | \    /
   //        |  add2
-  //        |  /  ||
+  //        | /   \\
   //       add3    add4
-  //           |  /
+  //          \    /
   //          <sink>
-  //
-  TF_Operation* feed = Placeholder(graph, s, "feed");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("feed"), string(TF_OperationName(feed)));
-
-  TF_Operation* add0 = Add(feed, feed, graph, s, "add0");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add1 = Add(feed, feed, graph, s, "add1");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add2 = Add(add0, add1, graph, s, "add2");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add3 = Add(add0, add2, graph, s, "add3");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add3"), string(TF_OperationName(add3)));
-  TF_Operation* add4 = Add(add2, add2, graph, s, "add4");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add4"), string(TF_OperationName(add4)));
-
-  GraphDef graph_def;
-  ASSERT_TRUE(GetGraphDef(graph, &graph_def));
-
-  SegmentNodesVector segments;
-  ASSERT_EQ(
-      SegmentGraph(graph_def,
-                   MakeCandidateFn({"add0", "add1", "add2", "add3", "add4"}),
-                   default_options_, &segments),
-      tensorflow::Status::OK());
-
-  // Expect all Add operations to be collapsed into a single segment
-  ASSERT_EQ(segments.size(), 1);
-  std::vector<string> expected{"add0", "add1", "add2", "add3", "add4"};
-  for (const auto& ex : expected) {
-    EXPECT_TRUE(segments[0].first.find(ex) != segments[0].first.end())
-        << "Missing expected node " << ex;
-  }
-  TF_DeleteGraph(graph);
-  TF_DeleteStatus(s);
+  Scope s = Scope::NewRootScope();
+  auto feed = ops::Placeholder(s.WithOpName("feed"), DT_FLOAT);
+  auto add0 = ops::Add(s.WithOpName("add0"), feed, feed);
+  auto add1 = ops::Add(s.WithOpName("add1"), feed, feed);
+  auto add2 = ops::Add(s.WithOpName("add2"), add0, add1);
+  auto add3 = ops::Add(s.WithOpName("add3"), add0, add2);
+  auto add4 = ops::Add(s.WithOpName("add4"), add2, add2);
+  tensorflow::Graph g(OpRegistry::Global());
+  TF_EXPECT_OK(s.ToGraph(&g));
+
+  // All Add operations are candidates, and we expect all of them to be
+  // collapsed into a single segment
+  const std::set<string> all_adds = {"add0", "add1", "add2", "add3", "add4"};
+  RunTest(&g, all_adds, all_adds, all_adds, {all_adds});
+
+  // Make add1 not a candidate, and we expect all other Add operations to be
+  // collapsed into a single segment
+  auto without_add1 = all_adds - "add1";
+  RunTest(&g, without_add1, without_add1, without_add1, {without_add1});
+
+  // Make add1 not a candidate and add2 not an input candidate, and we expect
+  // add0 and add2 are removed from the segment.
+  auto without_add2 = all_adds - "add2";
+  RunTest(&g, without_add1, without_add2, without_add1, {{"add3", "add4"}});
+
+  // Making add2 not an input candidate itself won't affect anything.
+  RunTest(&g, all_adds, without_add2, all_adds, {all_adds});
+
+  // Making add1 not an input candidate.
+  RunTest(&g, all_adds, without_add1, all_adds, {without_add1});
+
+  // Making add3 not an output candidate doesn't affect anything, since it's
+  // output is sink.
+  auto without_add3 = all_adds - "add3";
+  RunTest(&g, all_adds, all_adds, without_add3, {all_adds});
 }
 
 TEST_F(SegmentTest, AvoidCycle) {
-  TF_Status* s = TF_NewStatus();
-  TF_Graph* graph = TF_NewGraph();
-
-  // add2 is not a TRT candidate so add0/add3 cannot be formed as a
-  // subgraph
-  //
   //           feed
-  //         //    ||
+  //          //  \\
   //       add0    add1
-  //        | |    /
+  //        | \    /
   //        |  add2
-  //        |  /  ||
+  //        |  /  \\
   //       add3    add4
-  //           |  /
+  //          \    /
   //          <sink>
-  //
-  TF_Operation* feed = Placeholder(graph, s, "feed");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("feed"), string(TF_OperationName(feed)));
-
-  TF_Operation* add0 = Add(feed, feed, graph, s, "add0");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add1 = Add(feed, feed, graph, s, "add1");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add2 = Add(add0, add1, graph, s, "add2");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add3 = Add(add0, add2, graph, s, "add3");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add3"), string(TF_OperationName(add3)));
-  TF_Operation* add4 = Add(add2, add2, graph, s, "add4");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add4"), string(TF_OperationName(add4)));
-
-  GraphDef graph_def;
-  ASSERT_TRUE(GetGraphDef(graph, &graph_def));
-
-  SegmentNodesVector segments;
-  ASSERT_EQ(
-      SegmentGraph(graph_def, MakeCandidateFn({"add0", "add1", "add3", "add4"}),
-                   default_options_, &segments),
-      tensorflow::Status::OK());
-
-  // Expect no subgraphs
-  EXPECT_EQ(segments.size(), 0);
-  TF_DeleteGraph(graph);
-  TF_DeleteStatus(s);
+  Scope s = Scope::NewRootScope();
+  auto feed = ops::Placeholder(s.WithOpName("feed"), DT_FLOAT);
+  auto add0 = ops::Add(s.WithOpName("add0"), feed, feed);
+  auto add1 = ops::Add(s.WithOpName("add1"), feed, feed);
+  auto add2 = ops::Add(s.WithOpName("add2"), add0, add1);
+  auto add3 = ops::Add(s.WithOpName("add3"), add0, add2);
+  auto add4 = ops::Add(s.WithOpName("add4"), add2, add2);
+  tensorflow::Graph g(OpRegistry::Global());
+  TF_EXPECT_OK(s.ToGraph(&g));
+
+  // add2 is not a TRT candidate so there should be no segments generated.
+  const std::set<string> without_add2 = {"add0", "add1", "add3", "add4"};
+  RunTest(&g, without_add2, without_add2, without_add2, {});
 }
 
 TEST_F(SegmentTest, Multiple) {
-  TF_Status* s = TF_NewStatus();
-  TF_Graph* graph = TF_NewGraph();
-
-  // add5 is not a TRT candidate so two subgraphs should be formed
-  //
-  //                feed
-  //         //      ||     ||
-  //       add0    add1      add7
-  //        | |    /        /   ||
-  //        |  add2-----add5    add8
-  //        |  /  |    |  |    |
-  //       add3   add4     add6
-  //           |     |     /
-  //               <sink>
-  //
-  TF_Operation* feed = Placeholder(graph, s, "feed");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("feed"), string(TF_OperationName(feed)));
-
-  TF_Operation* add0 = Add(feed, feed, graph, s, "add0");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add1 = Add(feed, feed, graph, s, "add1");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add7 = Add(feed, feed, graph, s, "add7");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add2 = Add(add0, add1, graph, s, "add2");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add5 = Add(add2, add7, graph, s, "add5");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add8 = Add(add7, add7, graph, s, "add8");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add3 = Add(add0, add2, graph, s, "add3");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add3"), string(TF_OperationName(add3)));
-  TF_Operation* add4 = Add(add2, add5, graph, s, "add4");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add4"), string(TF_OperationName(add4)));
-  TF_Operation* add6 = Add(add5, add8, graph, s, "add6");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add6"), string(TF_OperationName(add6)));
-
-  GraphDef graph_def;
-  ASSERT_TRUE(GetGraphDef(graph, &graph_def));
-
-  SegmentNodesVector segments;
-  ASSERT_EQ(SegmentGraph(graph_def,
-                         MakeCandidateFn({"add0", "add1", "add2", "add3",
-                                          "add4", "add6", "add7", "add8"}),
-                         default_options_, &segments),
-            tensorflow::Status::OK());
-
-  // Expect two subgraphs
-  EXPECT_EQ(segments.size(), 2);
-
-  std::vector<string> expected0{"add6", "add8"};
-  for (const auto& ex : expected0) {
-    EXPECT_TRUE(segments[0].first.find(ex) != segments[0].first.end())
-        << "Missing expected node " << ex;
-  }
-
-  std::vector<string> expected1{"add0", "add1", "add2", "add3"};
-  for (const auto& ex : expected1) {
-    EXPECT_TRUE(segments[1].first.find(ex) != segments[1].first.end())
-        << "Missing expected node " << ex;
-  }
-  TF_DeleteGraph(graph);
-  TF_DeleteStatus(s);
+  //              feed
+  //           //  ||  \\
+  //        add0  add1  add7
+  //        |  \  /     / \\
+  //        |  add2    /   \\
+  //        |   || \   |   ||
+  //        |   ||  add5  add8
+  //        |  /  \ /  \   /
+  //        add3  add4  add6
+  //           \   |   /
+  //             <sink>
+  Scope s = Scope::NewRootScope();
+  auto feed = ops::Placeholder(s.WithOpName("feed"), DT_FLOAT);
+  auto add0 = ops::Add(s.WithOpName("add0"), feed, feed);
+  auto add1 = ops::Add(s.WithOpName("add1"), feed, feed);
+  auto add7 = ops::Add(s.WithOpName("add7"), feed, feed);
+  auto add2 = ops::Add(s.WithOpName("add2"), add0, add1);
+  auto add5 = ops::Add(s.WithOpName("add5"), add2, add7);
+  auto add8 = ops::Add(s.WithOpName("add8"), add7, add7);
+  auto add3 = ops::Add(s.WithOpName("add3"), add0, add2);
+  auto add4 = ops::Add(s.WithOpName("add4"), add2, add5);
+  auto add6 = ops::Add(s.WithOpName("add6"), add5, add8);
+  tensorflow::Graph g(OpRegistry::Global());
+  TF_EXPECT_OK(s.ToGraph(&g));
+
+  const std::set<string> all_adds = {"add0", "add1", "add2", "add3", "add4",
+                                     "add5", "add6", "add7", "add8"};
+  // Make add5 not a TRT candidate, and we expect two segments.
+  auto without_add5 = all_adds - "add5";
+  RunTest(&g, without_add5, without_add5, without_add5,
+          {{"add6", "add8"}, {"add0", "add1", "add2", "add3"}});
+
+  // Make add8 not a candidate and add6 not an input candidate, then all direct
+  // and indirect inputs of add6 will be removed from the segment.
+  auto without_add8 = all_adds - "add8";
+  auto without_add6 = all_adds - "add6";
+  RunTest(&g, without_add8, without_add6, all_adds, {{"add3", "add4"}});
+
+  // Make add3 not a candidate and add0 not an output candidate, then all
+  // direct and indirect outputs of add0 will be removed from the segment.
+  auto without_add3 = all_adds - "add3";
+  auto without_add0 = all_adds - "add0";
+  RunTest(&g, without_add3, all_adds, without_add0, {{"add1", "add7", "add8"}});
 }
 
 TEST_F(SegmentTest, BigIfElse) {
-  TF_Status* s = TF_NewStatus();
-  TF_Graph* graph = TF_NewGraph();
-
-  // add2 is not a TRT candidate
-  //
   //           feed
   //            ||
   //           add0
-  //         //    ||
+  //         //    \\
   //       add1    add4
   //        ||      ||
   //       add2    add5
   //        ||      ||
   //       add3    add6
-  //         ||    //
+  //         \\    //
   //           add7
   //            ||
   //          <sink>
-  //
-  TF_Operation* feed = Placeholder(graph, s, "feed");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("feed"), string(TF_OperationName(feed)));
-
-  TF_Operation* add0 = Add(feed, feed, graph, s, "add0");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add1 = Add(add0, add0, graph, s, "add1");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add2 = Add(add1, add1, graph, s, "add2");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add3 = Add(add2, add2, graph, s, "add3");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add4 = Add(add0, add0, graph, s, "add4");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add5 = Add(add4, add4, graph, s, "add5");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add6 = Add(add5, add5, graph, s, "add6");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  TF_Operation* add7 = Add(add3, add6, graph, s, "add7");
-  ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
-  EXPECT_EQ(string("add7"), string(TF_OperationName(add7)));
-
-  GraphDef graph_def;
-  ASSERT_TRUE(GetGraphDef(graph, &graph_def));
-
-  SegmentNodesVector segments;
-  ASSERT_EQ(SegmentGraph(graph_def,
-                         MakeCandidateFn({"add0", "add1", "add3", "add4",
-                                          "add5", "add6", "add7"}),
-                         default_options_, &segments),
-            tensorflow::Status::OK());
-
-  // Expect 2 subgraphs
-  EXPECT_EQ(segments.size(), 2);
-
-  std::vector<string> expected0{"add3", "add4", "add5", "add6", "add7"};
-  for (const auto& ex : expected0) {
-    EXPECT_TRUE(segments[0].first.find(ex) != segments[0].first.end())
-        << "Missing expected node " << ex;
-  }
-
-  std::vector<string> expected1{"add0", "add1"};
-  for (const auto& ex : expected1) {
-    EXPECT_TRUE(segments[1].first.find(ex) != segments[1].first.end())
-        << "Missing expected node " << ex;
-  }
-  TF_DeleteGraph(graph);
-  TF_DeleteStatus(s);
+  Scope s = Scope::NewRootScope();
+  auto feed = ops::Placeholder(s.WithOpName("feed"), DT_FLOAT);
+  auto add0 = ops::Add(s.WithOpName("add0"), feed, feed);
+  auto add1 = ops::Add(s.WithOpName("add1"), add0, add0);
+  auto add2 = ops::Add(s.WithOpName("add2"), add1, add1);
+  auto add3 = ops::Add(s.WithOpName("add3"), add2, add2);
+  auto add4 = ops::Add(s.WithOpName("add4"), add0, add0);
+  auto add5 = ops::Add(s.WithOpName("add5"), add4, add4);
+  auto add6 = ops::Add(s.WithOpName("add6"), add5, add5);
+  auto add7 = ops::Add(s.WithOpName("add7"), add3, add6);
+  tensorflow::Graph g(OpRegistry::Global());
+  TF_EXPECT_OK(s.ToGraph(&g));
+
+  // Make add2 not a TRT candidate, and we expect 2 segments.
+  const std::set<string> all_adds = {"add0", "add1", "add2", "add3",
+                                     "add4", "add5", "add6", "add7"};
+  RunTest(&g, all_adds - "add2", all_adds, all_adds,
+          {{"add3", "add4", "add5", "add6", "add7"}, {"add0", "add1"}});
 }
 
 }  // namespace test
diff --git a/tensorflow/contrib/tensorrt/shape_fn/trt_shfn.cc b/tensorflow/contrib/tensorrt/shape_fn/trt_shfn.cc
index 227ac120dd..f30dba59ad 100644
--- a/tensorflow/contrib/tensorrt/shape_fn/trt_shfn.cc
+++ b/tensorflow/contrib/tensorrt/shape_fn/trt_shfn.cc
@@ -28,36 +28,50 @@ limitations under the License.
 namespace tensorflow {
 namespace shape_inference {
 
-tensorflow::Status TRTEngineOpShapeInference(InferenceContext* context) {
-  std::vector<tensorflow::TensorShape> shapes;
-  for (int i = 0; i < context->num_outputs(); ++i) {
-    context->set_output(i, context->UnknownShape());
+tensorflow::Status TRTEngineOpShapeInference(InferenceContext* c) {
+  for (int i = 0; i < c->num_outputs(); ++i) {
+    c->set_output(i, c->UnknownShape());
   }
-  auto status = context->GetAttr("input_shapes", &shapes);
-  // it is ok to not to have shapes
-  if (!status.ok()) return Status::OK();
-  if ((int)shapes.size() != context->num_inputs()) return Status::OK();
-  bool different_input = false;
-  for (int i = 0; i < context->num_inputs(); ++i) {
-    if (shapes.at(i) != context->input_tensor(i)->shape())
-      different_input = true;
+
+  // Check the sanity of the input shapes.
+  std::vector<tensorflow::TensorShape> input_shapes;
+  TF_RETURN_IF_ERROR(c->GetAttr("input_shapes", &input_shapes));
+  if (input_shapes.size() != c->num_inputs()) {
+    return tensorflow::errors::InvalidArgument(
+        "The actual number of inputs doesn't match the number of input "
+        "shapes set in the attr: ",
+        c->num_inputs(), " vs ", input_shapes.size());
+  }
+  bool input_match = true;
+  for (int i = 0; i < c->num_inputs(); ++i) {
+    ShapeHandle handle;
+    TF_RETURN_IF_ERROR(
+        c->MakeShapeFromTensorShape(input_shapes.at(i), &handle));
+    ShapeHandle merged;
+    if (!c->Merge(c->input(i), handle, &merged).ok()) {
+      // Input shape doesn't match what was set in attr, fine.
+      input_match = false;
+    }
   }
-  if (different_input) return Status::OK();
-  shapes.resize(0);
-  status = context->GetAttr("output_shapes", &shapes);
-  if (!status.ok()) return Status::OK();
-  if ((int)shapes.size() != context->num_outputs()) return Status::OK();
-  std::vector<ShapeHandle> shape_handles(shapes.size());
-  for (size_t i = 0; i < shapes.size(); ++i) {
-    status =
-        context->MakeShapeFromTensorShape(shapes.at(i), &shape_handles.at(i));
-    if (!status.ok()) return Status::OK();
+
+  // Check the sanity of the output shapes.
+  std::vector<tensorflow::TensorShape> output_shapes;
+  TF_RETURN_IF_ERROR(c->GetAttr("output_shapes", &output_shapes));
+  if (output_shapes.size() != c->num_outputs()) {
+    return tensorflow::errors::InvalidArgument(
+        "The actual number of outputs doesn't match the number of output "
+        "shapes set in the attr: ",
+        c->num_outputs(), " vs ", output_shapes.size());
   }
-  for (int i = 0; i < context->num_outputs(); ++i) {
-    context->set_output(i, shape_handles.at(i));
+  for (size_t i = 0; i < output_shapes.size(); ++i) {
+    ShapeHandle handle;
+    TF_RETURN_IF_ERROR(
+        c->MakeShapeFromTensorShape(output_shapes.at(i), &handle));
+    if (input_match) c->set_output(i, handle);
   }
   return Status::OK();
 }
+
 }  // namespace shape_inference
 }  // namespace tensorflow
 
diff --git a/tensorflow/contrib/tensorrt/tensorrt_test.cc b/tensorflow/contrib/tensorrt/tensorrt_test.cc
index 3712a9a6fe..769982c645 100644
--- a/tensorflow/contrib/tensorrt/tensorrt_test.cc
+++ b/tensorflow/contrib/tensorrt/tensorrt_test.cc
@@ -13,7 +13,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
 #include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/stream_executor.h"
 #include "tensorflow/core/platform/test.h"
 
 #if GOOGLE_CUDA
@@ -130,6 +132,13 @@ void Execute(nvinfer1::IExecutionContext* context, const float* input,
 }
 
 TEST(TensorrtTest, BasicFunctions) {
+  // Handle the case where the test is run on machine with no gpu available.
+  if (CHECK_NOTNULL(GPUMachineManager())->VisibleDeviceCount() <= 0) {
+    LOG(WARNING) << "No gpu device available, probably not being run on a gpu "
+                    "machine. Skipping...";
+    return;
+  }
+
   // Create the network model.
   nvinfer1::IHostMemory* model = CreateNetwork();
   // Use the model to create an engine and then an execution context.
diff --git a/tensorflow/contrib/tensorrt/test/base_test.py b/tensorflow/contrib/tensorrt/test/base_test.py
new file mode 100644
index 0000000000..edd30ad7a9
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/base_test.py
@@ -0,0 +1,126 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Basic tests for TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.platform import test
+
+
+class SimpleSingleEngineGraphDefTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Create a graph containing single segment."""
+    # TODO(aaroey): test graph with different dtypes.
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [100, 24, 24, 2]
+    g = ops.Graph()
+    with g.as_default():
+      inp = array_ops.placeholder(
+          dtype=dtype, shape=[None] + input_dims[1:], name=input_name)
+      with g.device("/GPU:0"):
+        conv_filter = constant_op.constant(
+            [[[[1., 0.5, 4., 6., 0.5, 1.], [1., 0.5, 1., 1., 0.5, 1.]]]],
+            name="weights",
+            dtype=dtype)
+        conv = nn.conv2d(
+            input=inp,
+            filter=conv_filter,
+            strides=[1, 2, 2, 1],
+            padding="SAME",
+            name="conv")
+        bias = constant_op.constant(
+            [4., 1.5, 2., 3., 5., 7.], name="bias", dtype=dtype)
+        added = nn.bias_add(conv, bias, name="bias_add")
+        relu = nn.relu(added, "relu")
+        identity = array_ops.identity(relu, "identity")
+        pool = nn_ops.max_pool(
+            identity, [1, 2, 2, 1], [1, 2, 2, 1], "VALID", name="max_pool")
+      array_ops.squeeze(pool, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=1,
+        expected_output_dims=(100, 6, 6, 6),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+class SimpleMultiEngineGraphDefTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Create a graph containing multiple segment."""
+    # TODO(aaroey): test graph with different dtypes.
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [100, 24, 24, 2]
+    g = ops.Graph()
+    with g.as_default():
+      inp = array_ops.placeholder(
+          dtype=dtype, shape=[None] + input_dims[1:], name=input_name)
+      with g.device("/GPU:0"):
+        conv_filter = constant_op.constant(
+            [[[[1., 0.5, 4., 6., 0.5, 1.], [1., 0.5, 1., 1., 0.5, 1.]]]],
+            name="weights",
+            dtype=dtype)
+        conv = nn.conv2d(
+            input=inp,
+            filter=conv_filter,
+            strides=[1, 2, 2, 1],
+            padding="SAME",
+            name="conv")
+        c1 = constant_op.constant(
+            np.random.randn(input_dims[0], 12, 12, 6), dtype=dtype)
+        p = conv * c1
+        c2 = constant_op.constant(
+            np.random.randn(input_dims[0], 12, 12, 6), dtype=dtype)
+        q = conv / c2
+
+        edge = self.trt_incompatible_op(q)
+        edge /= edge
+        r = edge + edge
+
+        p -= edge
+        q *= edge
+        s = p + q
+        s -= r
+      array_ops.squeeze(s, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=2,
+        expected_output_dims=(100, 12, 12, 6),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+# TODO(aaroey): add a large complex graph to test.
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/batch_matmul_test.py b/tensorflow/contrib/tensorrt/test/batch_matmul_test.py
new file mode 100644
index 0000000000..730b6843fb
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/batch_matmul_test.py
@@ -0,0 +1,76 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class BatchMatMulTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Testing conversion of BatchMatMul in TF-TRT conversion."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [12, 5, 8, 12]
+    w1_name = "matmul_w1"
+    w1_dims = [12, 5, 12, 7]
+    w2_name = "matmul_w2"
+    w2_dims = [12, 12, 7]
+    g = ops.Graph()
+    with g.as_default():
+      inp = array_ops.placeholder(
+          dtype=dtype, shape=[None] + input_dims[1:], name=input_name)
+      w1 = array_ops.placeholder(dtype=dtype, shape=w1_dims, name=w1_name)
+      w2 = array_ops.placeholder(dtype=dtype, shape=w2_dims, name=w2_name)
+      with g.device("/GPU:0"):
+        b = constant_op.constant(np.random.randn(12, 5, 12, 7), dtype=dtype)
+        c = constant_op.constant(np.random.randn(5, 1, 1), dtype=dtype)
+        d = constant_op.constant(np.random.randn(5, 1, 1), dtype=dtype)
+        x1 = math_ops.matmul(inp, b)
+        x1 = x1 + c
+        x2 = math_ops.matmul(inp, w1)
+        x2 = x2 * d
+        e = gen_array_ops.reshape(inp, [12, 40, 12])
+        x3 = math_ops.matmul(e, w2)
+        f = constant_op.constant(np.random.randn(40, 1), dtype=dtype)
+        x3 = x3 + f
+        x3 = gen_array_ops.reshape(x3, [12, 5, 8, 7])
+        out = x1 + x2 + x3
+      array_ops.squeeze(out, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name, w1_name, w2_name],
+        input_dims=[input_dims, w1_dims, w2_dims],
+        num_expected_engines=1,
+        expected_output_dims=(12, 5, 8, 7),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/biasadd_matmul_test.py b/tensorflow/contrib/tensorrt/test/biasadd_matmul_test.py
new file mode 100644
index 0000000000..0c03a10b64
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/biasadd_matmul_test.py
@@ -0,0 +1,112 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.platform import test
+
+
+class BiasaddMatMulTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Testing conversion of BiasAdd MatMul in TF-TRT conversion."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [48, 12]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+
+      b = constant_op.constant(np.random.randn(12, 4), dtype=dtype)
+      x1 = math_ops.matmul(x, b)
+      b = constant_op.constant(np.random.randn(1, 4), dtype=dtype)
+      x1 = x1 + b
+
+      b = constant_op.constant(np.random.randn(48, 4), dtype=dtype)
+      x2 = math_ops.matmul(x, b, transpose_a=True)
+      x2 = gen_array_ops.reshape(x2, [48, 1])
+
+      b = constant_op.constant(np.random.randn(4, 12), dtype=dtype)
+      x3 = math_ops.matmul(x, b, transpose_b=True)
+
+      b = constant_op.constant(np.random.randn(16, 48), dtype=dtype)
+      x4 = math_ops.matmul(x, b, transpose_b=True, transpose_a=True)
+      x4 = gen_array_ops.reshape(x4, [48, 4])
+
+      x5 = gen_array_ops.reshape(x, [4, 144])
+      b = constant_op.constant(np.random.randn(144, 48), dtype=dtype)
+      x5 = math_ops.matmul(x5, b)
+      b = constant_op.constant(np.random.randn(48), dtype=dtype)
+      x5 = nn.bias_add(x5, b)
+      x5 = gen_array_ops.reshape(x5, [48, 4])
+
+      x6 = gen_array_ops.reshape(x, [4, 12, 12])
+      b = constant_op.constant(np.random.randn(12), dtype=dtype)
+      x6 = nn.bias_add(x6, b, data_format="NHWC")
+      x6 = gen_array_ops.reshape(x6, [48, -1])
+
+      x7 = gen_array_ops.reshape(x, [4, 12, 3, 4])
+      b = constant_op.constant(np.random.randn(4), dtype=dtype)
+      x7 = nn.bias_add(x7, b, data_format="NHWC")
+      x7 = gen_array_ops.reshape(x7, [48, -1])
+
+      x8 = gen_array_ops.reshape(x, [4, 12, 3, 2, 2])
+      b = constant_op.constant(np.random.randn(2), dtype=dtype)
+      x8 = nn.bias_add(x8, b, data_format="NHWC")
+      x8 = gen_array_ops.reshape(x8, [48, -1])
+
+      x9 = gen_array_ops.reshape(x, [4, 12, 3, 2, 2])
+      b = constant_op.constant(np.random.randn(3), dtype=dtype)
+      x9 = nn.bias_add(x9, b, data_format="NCHW")
+      x9 = gen_array_ops.reshape(x9, [48, -1])
+
+      x10 = gen_array_ops.reshape(x, [4, 12, 3, 4])
+      b = constant_op.constant(np.random.randn(12), dtype=dtype)
+      x10 = nn.bias_add(x10, b, data_format="NCHW")
+      x10 = gen_array_ops.reshape(x10, [48, -1])
+
+      x11 = gen_array_ops.reshape(x, [4, 12, 12])
+      b = constant_op.constant(np.random.randn(4), dtype=dtype)
+      x11 = nn.bias_add(x11, b, data_format="NCHW")
+      x11 = gen_array_ops.reshape(x11, [48, -1])
+
+      out = array_ops.concat(
+          [x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11], axis=-1)
+      out = array_ops.squeeze(out, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=7,
+        expected_output_dims=(48, 89),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/binary_tensor_weight_broadcast_test.py b/tensorflow/contrib/tensorrt/test/binary_tensor_weight_broadcast_test.py
new file mode 100644
index 0000000000..dd673463a5
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/binary_tensor_weight_broadcast_test.py
@@ -0,0 +1,119 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class BinaryTensorWeightBroadcastTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Tests for scale & elementwise layers in TF-TRT."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [10, 24, 24, 20]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      # scale
+      a = constant_op.constant(np.random.randn(1), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      # scale
+      a = constant_op.constant(np.random.randn(1), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # scale
+      a = constant_op.constant(np.random.randn(24, 1, 1), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      # scale
+      a = constant_op.constant(np.random.randn(24, 1, 1), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # scale
+      a = constant_op.constant(np.random.randn(24, 24, 20), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # scale
+      a = constant_op.constant(np.random.randn(24, 24, 20), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(20), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(20), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(1, 24, 1, 1), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(1, 24, 1, 1), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(1, 24, 24, 1), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(1, 24, 24, 1), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(1, 24, 24, 20), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(1, 24, 24, 20), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(24, 20), dtype=dtype)
+      f = a + x
+      x = math_ops.sigmoid(f)
+      # elementwise
+      a = constant_op.constant(np.random.randn(24, 20), dtype=dtype)
+      f = x + a
+      x = math_ops.sigmoid(f)
+      gen_array_ops.reshape(x, [5, -1], name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=16,
+        expected_output_dims=(5, 23040),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/concatenation_test.py b/tensorflow/contrib/tensorrt/test/concatenation_test.py
new file mode 100644
index 0000000000..8c51c45b0a
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/concatenation_test.py
@@ -0,0 +1,83 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_array_ops
+from tensorflow.python.ops import gen_math_ops
+from tensorflow.python.platform import test
+
+
+class ConcatenationTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Testing Concatenation in TF-TRT conversion."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [2, 3, 3, 1]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      # scale
+      a = constant_op.constant(np.random.randn(3, 1, 1), dtype=dtype)
+      r1 = x / a
+      a = constant_op.constant(np.random.randn(3, 1, 1), dtype=dtype)
+      r2 = a / x
+      a = constant_op.constant(np.random.randn(1, 3, 1), dtype=dtype)
+      r3 = a + x
+      a = constant_op.constant(np.random.randn(1, 3, 1), dtype=dtype)
+      r4 = x * a
+      a = constant_op.constant(np.random.randn(3, 1, 1), dtype=dtype)
+      r5 = x - a
+      a = constant_op.constant(np.random.randn(3, 1, 1), dtype=dtype)
+      r6 = a - x
+      a = constant_op.constant(np.random.randn(3, 1), dtype=dtype)
+      r7 = x - a
+      a = constant_op.constant(np.random.randn(3, 1), dtype=dtype)
+      r8 = a - x
+      a = constant_op.constant(np.random.randn(3, 1, 1), dtype=dtype)
+      r9 = gen_math_ops.maximum(x, a)
+      a = constant_op.constant(np.random.randn(3, 1), dtype=dtype)
+      r10 = gen_math_ops.minimum(a, x)
+      a = constant_op.constant(np.random.randn(3), dtype=dtype)
+      r11 = x * a
+      a = constant_op.constant(np.random.randn(1), dtype=dtype)
+      r12 = a * x
+      concat1 = array_ops.concat([r1, r2, r3, r4, r5, r6], axis=-1)
+      concat2 = array_ops.concat([r7, r8, r9, r10, r11, r12], axis=3)
+      x = array_ops.concat([concat1, concat2], axis=-1)
+      gen_array_ops.reshape(x, [2, -1], name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=1,
+        expected_output_dims=(2, 126),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/const_broadcast_test.py b/tensorflow/contrib/tensorrt/test/const_broadcast_test.py
new file mode 100644
index 0000000000..97b29bf05d
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/const_broadcast_test.py
@@ -0,0 +1,68 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.platform import test
+
+
+class ConstBroadcastTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Test for Constant broadcasting in TF-TRT."""
+    dtype = dtypes.float32
+    input_name = 'input'
+    input_dims = [5, 12, 12, 2]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      filt1 = constant_op.constant(
+          0.3, shape=(3, 3, 2, 1), dtype=dtype, name='filt1')
+      y1 = nn.conv2d(x, filt1, strides=[1, 1, 1, 1], padding='SAME', name='y1')
+      z1 = nn.relu(y1, name='z1')
+      filt2 = constant_op.constant(
+          np.random.randn(9), shape=(3, 3, 1, 1), dtype=dtype, name='filt2')
+      y2 = nn.conv2d(z1, filt2, strides=[1, 1, 1, 1], padding='SAME', name='y2')
+      z2 = nn.relu(y2, name='z')
+      filt3 = constant_op.constant(
+          np.random.randn(3, 3, 1, 1),
+          shape=(3, 3, 1, 1),
+          dtype=dtype,
+          name='filt3')
+      y3 = nn.conv2d(z2, filt3, strides=[1, 1, 1, 1], padding='SAME', name='y3')
+      nn.relu(y3, name='output')
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=1,
+        expected_output_dims=(5, 12, 12, 1),
+        allclose_atol=1.e-02,
+        allclose_rtol=1.e-02)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/memory_alignment_test.py b/tensorflow/contrib/tensorrt/test/memory_alignment_test.py
new file mode 100644
index 0000000000..3dd95c6f62
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/memory_alignment_test.py
@@ -0,0 +1,72 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.platform import test
+
+
+class MemoryAlignmentTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Testing conversion of BatchMatMul in TF-TRT conversion."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [2, 15, 15, 3]
+    g = ops.Graph()
+    with g.as_default():
+      inp = array_ops.placeholder(
+          dtype=dtype, shape=[None] + input_dims[1:], name=input_name)
+      with g.device("/GPU:0"):
+        e1 = constant_op.constant(
+            np.random.randn(1, 1, 3, 5), name="kernel_1", dtype=dtype)
+        e2 = constant_op.constant(
+            np.random.randn(1, 1, 5, 10), name="kernel_2", dtype=dtype)
+        conv = nn.conv2d(
+            input=inp,
+            filter=e1,
+            strides=[1, 1, 1, 1],
+            padding="VALID",
+            name="conv")
+        out = nn.conv2d(
+            input=conv,
+            filter=e2,
+            strides=[1, 1, 1, 1],
+            padding="VALID",
+            name="conv_2")
+      array_ops.squeeze(out, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=1,
+        expected_output_dims=(2, 15, 15, 10),
+        allclose_atol=1.e-02,
+        allclose_rtol=1.e-02)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/multi_connection_neighbor_engine_test.py b/tensorflow/contrib/tensorrt/test/multi_connection_neighbor_engine_test.py
new file mode 100644
index 0000000000..734ccf6345
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/multi_connection_neighbor_engine_test.py
@@ -0,0 +1,87 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_math_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.platform import test
+
+
+class MultiConnectionNeighborEngineTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Test for multi connection neighboring nodes wiring tests in TF-TRT."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [2, 3, 7, 5]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      e = constant_op.constant(
+          np.random.normal(.05, .005, [3, 2, 3, 4]),
+          name="weights",
+          dtype=dtype)
+      conv = nn.conv2d(
+          input=x,
+          filter=e,
+          data_format="NCHW",
+          strides=[1, 1, 1, 1],
+          padding="VALID",
+          name="conv")
+      b = constant_op.constant(
+          np.random.normal(2.0, 1.0, [1, 4, 1, 1]), name="bias", dtype=dtype)
+      t = conv + b
+
+      b = constant_op.constant(
+          np.random.normal(5.0, 1.0, [1, 4, 1, 1]), name="bias", dtype=dtype)
+      q = conv - b
+      edge = math_ops.sigmoid(q)
+
+      b = constant_op.constant(
+          np.random.normal(5.0, 1.0, [1, 4, 1, 1]), name="bias", dtype=dtype)
+      d = b + conv
+      edge3 = math_ops.sigmoid(d)
+
+      edge1 = gen_math_ops.tan(conv)
+      t = t - edge1
+      q = q + edge
+      t = t + q
+      t = t + d
+      t = t - edge3
+      array_ops.squeeze(t, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=2,
+        expected_output_dims=(2, 4, 5, 4),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/neighboring_engine_test.py b/tensorflow/contrib/tensorrt/test/neighboring_engine_test.py
new file mode 100644
index 0000000000..50265c0845
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/neighboring_engine_test.py
@@ -0,0 +1,69 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_math_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.platform import test
+
+
+class NeighboringEngineTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Neighboring node wiring tests in TF-TRT conversion."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [2, 3, 7, 5]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      e = constant_op.constant(
+          np.random.normal(.3, 0.05, [3, 2, 3, 4]), name="weights", dtype=dtype)
+      conv = nn.conv2d(
+          input=x,
+          filter=e,
+          data_format="NCHW",
+          strides=[1, 1, 1, 1],
+          padding="VALID",
+          name="conv")
+      b = constant_op.constant(
+          np.random.normal(1.0, 1.0, [1, 4, 1, 1]), name="bias", dtype=dtype)
+      t = conv * b
+      e = gen_math_ops.tan(conv)
+      t = t - e
+      array_ops.squeeze(t, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=2,
+        expected_output_dims=(2, 4, 5, 4),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/tf_trt_integration_test.py b/tensorflow/contrib/tensorrt/test/tf_trt_integration_test.py
deleted file mode 100644
index d9c41f90d0..0000000000
--- a/tensorflow/contrib/tensorrt/test/tf_trt_integration_test.py
+++ /dev/null
@@ -1,347 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Script to test TF-TensorRT integration."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from collections import namedtuple
-import itertools
-import warnings
-import numpy as np
-import six
-
-from tensorflow.contrib import tensorrt as trt
-from tensorflow.core.protobuf import config_pb2
-from tensorflow.core.protobuf import rewriter_config_pb2
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import importer
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import test_util
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import nn
-from tensorflow.python.ops import nn_ops
-from tensorflow.python.platform import test
-
-INPUT_NAME = "input"
-OUTPUT_NAME = "output"
-INPUT_DIMS = [100, 24, 24, 2]
-MODE_FP32 = "FP32"
-MODE_FP16 = "FP16"
-MODE_INT8 = "INT8"
-
-if six.PY2:
-  to_bytes = lambda s: s
-  to_string = lambda s: s
-else:
-  to_bytes = lambda s: s.encode("utf-8", errors="surrogateescape")
-  to_string = lambda s: s.decode("utf-8")
-
-
-# TODO(aaroey): test graph with different dtypes.
-def GetSingleEngineGraphDef(dtype=dtypes.float32):
-  """Create a graph containing single segment."""
-  g = ops.Graph()
-  with g.as_default():
-    inp = array_ops.placeholder(
-        dtype=dtype, shape=[None] + INPUT_DIMS[1:], name=INPUT_NAME)
-    with g.device("/GPU:0"):
-      conv_filter = constant_op.constant(
-          [[[[1., 0.5, 4., 6., 0.5, 1.], [1., 0.5, 1., 1., 0.5, 1.]]]],
-          name="weights",
-          dtype=dtype)
-      conv = nn.conv2d(
-          input=inp,
-          filter=conv_filter,
-          strides=[1, 2, 2, 1],
-          padding="SAME",
-          name="conv")
-      bias = constant_op.constant(
-          [4., 1.5, 2., 3., 5., 7.], name="bias", dtype=dtype)
-      added = nn.bias_add(conv, bias, name="bias_add")
-      relu = nn.relu(added, "relu")
-      identity = array_ops.identity(relu, "identity")
-      pool = nn_ops.max_pool(
-          identity, [1, 2, 2, 1], [1, 2, 2, 1], "VALID", name="max_pool")
-    array_ops.squeeze(pool, name=OUTPUT_NAME)
-  return g.as_graph_def()
-
-
-# TODO(aaroey): test graph with different dtypes.
-def GetMultiEngineGraphDef(dtype=dtypes.float32):
-  """Create a graph containing multiple segment."""
-  g = ops.Graph()
-  with g.as_default():
-    inp = array_ops.placeholder(
-        dtype=dtype, shape=[None] + INPUT_DIMS[1:], name=INPUT_NAME)
-    with g.device("/GPU:0"):
-      conv_filter = constant_op.constant(
-          [[[[1., 0.5, 4., 6., 0.5, 1.], [1., 0.5, 1., 1., 0.5, 1.]]]],
-          name="weights",
-          dtype=dtype)
-      conv = nn.conv2d(
-          input=inp,
-          filter=conv_filter,
-          strides=[1, 2, 2, 1],
-          padding="SAME",
-          name="conv")
-      c1 = constant_op.constant(
-          np.random.randn(INPUT_DIMS[0], 12, 12, 6), dtype=dtype)
-      p = conv * c1
-      c2 = constant_op.constant(
-          np.random.randn(INPUT_DIMS[0], 12, 12, 6), dtype=dtype)
-      q = conv / c2
-
-      edge = math_ops.sin(q)
-      edge /= edge
-      r = edge + edge
-
-      p -= edge
-      q *= edge
-      s = p + q
-      s -= r
-    array_ops.squeeze(s, name=OUTPUT_NAME)
-  return g.as_graph_def()
-
-
-TestGraph = namedtuple("TestGraph",
-                       ["gdef", "num_expected_engines", "expected_output_dims"])
-
-TEST_GRAPHS = {
-    "SingleEngineGraph":
-        TestGraph(
-            gdef=GetSingleEngineGraphDef(),
-            num_expected_engines=1,
-            expected_output_dims=(100, 6, 6, 6)),
-    "MultiEngineGraph":
-        TestGraph(
-            gdef=GetMultiEngineGraphDef(),
-            num_expected_engines=2,
-            expected_output_dims=(100, 12, 12, 6)),
-    # TODO(aaroey): add a large complex graph to test.
-}
-
-
-class TfTrtIntegrationTest(test_util.TensorFlowTestCase):
-  """Class to test Tensorflow-TensorRT integration."""
-
-  def setUp(self):
-    """Setup method."""
-    super(TfTrtIntegrationTest, self).setUp()
-    warnings.simplefilter("always")
-    self._input = np.random.random_sample(INPUT_DIMS)
-
-  def _GetConfigProto(self,
-                      use_optimizer,
-                      precision_mode=None,
-                      is_dynamic_op=None):
-    if use_optimizer:
-      rewriter_cfg = rewriter_config_pb2.RewriterConfig()
-      rewriter_cfg.optimizers.extend(["constfold", "layout"])
-      custom_op = rewriter_cfg.custom_optimizers.add()
-      custom_op.name = "TensorRTOptimizer"
-      custom_op.parameter_map["minimum_segment_size"].i = 3
-      custom_op.parameter_map["max_batch_size"].i = self._input.shape[0]
-      custom_op.parameter_map["is_dynamic_op"].b = is_dynamic_op
-      custom_op.parameter_map["max_workspace_size_bytes"].i = 1 << 25
-      custom_op.parameter_map["precision_mode"].s = to_bytes(precision_mode)
-      graph_options = config_pb2.GraphOptions(rewrite_options=rewriter_cfg)
-    else:
-      graph_options = config_pb2.GraphOptions()
-
-    gpu_options = config_pb2.GPUOptions()
-    if trt.trt_convert.get_linked_tensorrt_version()[0] == 3:
-      gpu_options.per_process_gpu_memory_fraction = 0.50
-
-    config = config_pb2.ConfigProto(
-        gpu_options=gpu_options, graph_options=graph_options)
-    return config
-
-  def _RunGraph(self, graph_key, gdef, input_data, config, num_runs=2):
-    """Run given graphdef multiple times."""
-    g = ops.Graph()
-    with g.as_default():
-      inp, out = importer.import_graph_def(
-          graph_def=gdef, return_elements=[INPUT_NAME, OUTPUT_NAME], name="")
-      inp = inp.outputs[0]
-      out = out.outputs[0]
-    with self.test_session(
-        graph=g, config=config, use_gpu=True, force_gpu=True) as sess:
-      val = None
-      # Defaults to 2 runs to verify result across multiple runs is same.
-      for _ in range(num_runs):
-        new_val = sess.run(out, {inp: input_data})
-        self.assertEquals(TEST_GRAPHS[graph_key].expected_output_dims,
-                          new_val.shape)
-        if val is not None:
-          self.assertAllEqual(new_val, val)
-        val = new_val
-    return val
-
-  # Use real data that is representative of the inference dataset
-  # for calibration. For this test script it is random data.
-  def _RunCalibration(self, graph_key, gdef, input_data, config):
-    """Run calibration on given graph."""
-    return self._RunGraph(graph_key, gdef, input_data, config, 30)
-
-  def _GetTrtGraph(self, gdef, precision_mode, is_dynamic_op):
-    """Return trt converted graph."""
-    return trt.create_inference_graph(
-        input_graph_def=gdef,
-        outputs=[OUTPUT_NAME],
-        max_batch_size=self._input.shape[0],
-        max_workspace_size_bytes=1 << 25,
-        precision_mode=precision_mode,
-        minimum_segment_size=2,
-        is_dynamic_op=is_dynamic_op)
-
-  def _VerifyGraphDef(self,
-                      graph_key,
-                      gdef,
-                      precision_mode=None,
-                      is_calibrated=None,
-                      dynamic_engine=None):
-    num_engines = 0
-    for n in gdef.node:
-      if n.op == "TRTEngineOp":
-        num_engines += 1
-        self.assertNotEqual("", n.attr["serialized_segment"].s)
-        self.assertNotEqual("", n.attr["segment_funcdef_name"].s)
-        self.assertEquals(n.attr["precision_mode"].s, precision_mode)
-        self.assertEquals(n.attr["static_engine"].b, not dynamic_engine)
-        if precision_mode == MODE_INT8 and is_calibrated:
-          self.assertNotEqual("", n.attr["calibration_data"].s)
-        else:
-          self.assertEquals("", n.attr["calibration_data"].s)
-    if precision_mode is None:
-      self.assertEquals(num_engines, 0)
-    else:
-      self.assertEquals(num_engines,
-                        TEST_GRAPHS[graph_key].num_expected_engines)
-
-  def _RunTest(self, graph_key, use_optimizer, precision_mode,
-               dynamic_infer_engine, dynamic_calib_engine):
-    assert precision_mode in [MODE_FP32, MODE_FP16, MODE_INT8]
-    input_gdef = TEST_GRAPHS[graph_key].gdef
-    self._VerifyGraphDef(graph_key, input_gdef)
-
-    # Get reference result without running trt.
-    config_no_trt = self._GetConfigProto(False)
-    print("Running original graph w/o trt, config:\n%s" % str(config_no_trt))
-    ref_result = self._RunGraph(graph_key, input_gdef, self._input,
-                                config_no_trt)
-
-    # Run calibration if necessary.
-    if precision_mode == MODE_INT8:
-
-      calib_config = self._GetConfigProto(use_optimizer, precision_mode,
-                                          dynamic_calib_engine)
-      print("Running calibration graph, config:\n%s" % str(calib_config))
-      if use_optimizer:
-        self.assertTrue(False)
-        # TODO(aaroey): uncomment this and get infer_gdef when this mode is
-        # supported.
-        # result = self._RunCalibration(graph_key, input_gdef, self._input,
-        #                               calib_config)
-      else:
-        calib_gdef = self._GetTrtGraph(input_gdef, precision_mode,
-                                       dynamic_calib_engine)
-        self._VerifyGraphDef(graph_key, calib_gdef, precision_mode, False,
-                             dynamic_calib_engine)
-        result = self._RunCalibration(graph_key, calib_gdef, self._input,
-                                      calib_config)
-        infer_gdef = trt.calib_graph_to_infer_graph(calib_gdef)
-        self._VerifyGraphDef(graph_key, infer_gdef, precision_mode, True,
-                             dynamic_calib_engine)
-      self.assertAllClose(ref_result, result, rtol=1.e-03)
-    else:
-      infer_gdef = input_gdef
-
-    # Run inference.
-    infer_config = self._GetConfigProto(use_optimizer, precision_mode,
-                                        dynamic_infer_engine)
-    print("Running final inference graph, config:\n%s" % str(infer_config))
-    if use_optimizer:
-      result = self._RunGraph(graph_key, infer_gdef, self._input, infer_config)
-    else:
-      trt_infer_gdef = self._GetTrtGraph(infer_gdef, precision_mode,
-                                         dynamic_infer_engine)
-      self._VerifyGraphDef(graph_key, trt_infer_gdef, precision_mode, True,
-                           dynamic_infer_engine)
-      result = self._RunGraph(graph_key, trt_infer_gdef, self._input,
-                              infer_config)
-    self.assertAllClose(ref_result, result, rtol=1.e-03)
-
-  def testIdempotence(self):
-    # Test that applying tensorrt optimizer or offline conversion tools multiple
-    # times to the same graph will result in same graph.
-    # TODO(aaroey): implement this.
-    pass
-
-
-def GetTests():
-
-  def _GetTest(g, u, p, i, c):
-
-    def _Test(self):
-      print("Running test with parameters: graph_key=%s, use_optimizer=%s, "
-            "precision_mode=%s, dynamic_infer_engine=%s, "
-            "dynamic_calib_engine=%s" % (g, u, p, i, c))
-      self._RunTest(g, u, p, i, c)
-
-    return _Test
-
-  use_optimizer_options = [False, True]
-  precision_mode_options = [MODE_FP32, MODE_FP16, MODE_INT8]
-  dynamic_infer_engine_options = [False, True]
-  dynamic_calib_engine_options = [False, True]
-  for (graph_key, use_optimizer, precision_mode,
-       dynamic_infer_engine, dynamic_calib_engine) in itertools.product(
-           TEST_GRAPHS, use_optimizer_options, precision_mode_options,
-           dynamic_infer_engine_options, dynamic_calib_engine_options):
-    if precision_mode == MODE_INT8:
-      if not dynamic_calib_engine and dynamic_infer_engine:
-        # TODO(aaroey): test this case, the conversion from static calibration
-        # engine to dynamic inference engine should be a noop.
-        continue
-      if use_optimizer:
-        # TODO(aaroey): if use_optimizer is True we need to get the inference
-        # graphdef using custom python wrapper class, which is not currently
-        # supported yet.
-        continue
-      if not dynamic_calib_engine:
-        # TODO(aaroey): construction of static calibration engine is not
-        # supported yet.
-        continue
-      if dynamic_calib_engine and not dynamic_infer_engine:
-        # TODO(aaroey): construction of static inference engine using dynamic
-        # calibration engine is not supported yet.
-        continue
-    else:  # In non int8 mode.
-      if dynamic_calib_engine:
-        # dynamic_calib_engine doesn't affect non-int8 modes, so just let
-        # related tests run once on dynamic_calib_engine=False.
-        continue
-    yield _GetTest(graph_key, use_optimizer, precision_mode,
-                   dynamic_infer_engine, dynamic_calib_engine)
-
-
-if __name__ == "__main__":
-  for index, t in enumerate(GetTests()):
-    setattr(TfTrtIntegrationTest, "testTfTRT_" + str(index), t)
-  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/tf_trt_integration_test_base.py b/tensorflow/contrib/tensorrt/test/tf_trt_integration_test_base.py
new file mode 100644
index 0000000000..bb7f5a77f0
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/tf_trt_integration_test_base.py
@@ -0,0 +1,329 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utilities to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from collections import namedtuple
+import itertools
+import warnings
+import numpy as np
+import six
+
+from tensorflow.contrib.tensorrt.python import trt_convert
+# pylint: disable=unused-import
+from tensorflow.contrib.tensorrt.python.ops import trt_engine_op
+# pylint: enable=unused-import
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.core.protobuf import rewriter_config_pb2
+from tensorflow.python.framework import importer
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import tf_logging as logging
+
+TfTrtIntegrationTestParams = namedtuple("TfTrtIntegrationTestParams", [
+    "gdef", "input_names", "input_dims", "num_expected_engines",
+    "expected_output_dims", "allclose_atol", "allclose_rtol"
+])
+
+PRECISION_MODES = ["FP32", "FP16", "INT8"]
+
+
+def _IsQuantizationMode(mode):
+  return mode == "INT8"
+
+
+class TfTrtIntegrationTestBase(test_util.TensorFlowTestCase):
+  """Class to test Tensorflow-TensorRT integration."""
+
+  @property
+  def output_name(self):
+    return "output"
+
+  @property
+  def trt_incompatible_op(self):
+    return math_ops.sin
+
+  @property
+  def precision_modes(self):
+    return ["FP32", "FP16", "INT8"]
+
+  def _ToBytes(self, s):
+    if six.PY2:
+      return s
+    else:
+      return s.encode("utf-8")
+
+  def _ToString(self, s):
+    if six.PY2:
+      return s
+    else:
+      return s.decode("utf-8")
+
+  def setUp(self):
+    """Setup method."""
+    super(TfTrtIntegrationTestBase, self).setUp()
+    warnings.simplefilter("always")
+
+  def GetParams(self):
+    """Return a TfTrtIntegrationTestParams for test, implemented by subclass."""
+    raise NotImplementedError()
+
+  def _GetConfigProto(self,
+                      params,
+                      use_optimizer,
+                      precision_mode=None,
+                      is_dynamic_op=None):
+    """Get config proto based on specific settings."""
+    if use_optimizer:
+      rewriter_cfg = rewriter_config_pb2.RewriterConfig()
+      rewriter_cfg.optimizers.extend(["constfold", "layout"])
+      custom_op = rewriter_cfg.custom_optimizers.add()
+      custom_op.name = "TensorRTOptimizer"
+      custom_op.parameter_map["minimum_segment_size"].i = 3
+      custom_op.parameter_map["max_batch_size"].i = max(
+          [dims[0] for dims in params.input_dims])
+      custom_op.parameter_map["is_dynamic_op"].b = is_dynamic_op
+      custom_op.parameter_map["max_workspace_size_bytes"].i = 1 << 25
+      custom_op.parameter_map["precision_mode"].s = self._ToBytes(
+          precision_mode)
+      graph_options = config_pb2.GraphOptions(rewrite_options=rewriter_cfg)
+    else:
+      graph_options = config_pb2.GraphOptions()
+
+    gpu_options = config_pb2.GPUOptions()
+    gpu_options.allow_growth = True
+    if trt_convert.get_linked_tensorrt_version()[0] == 3:
+      gpu_options.per_process_gpu_memory_fraction = 0.50
+
+    config = config_pb2.ConfigProto(
+        gpu_options=gpu_options, graph_options=graph_options)
+    return config
+
+  def _RunGraph(self, params, gdef, input_data, config, num_runs=2):
+    """Run given graphdef multiple times."""
+    assert len(params.input_names) == len(input_data)
+    g = ops.Graph()
+    with g.as_default():
+      io_ops = importer.import_graph_def(
+          graph_def=gdef,
+          return_elements=params.input_names + [self.output_name],
+          name="")
+      inp = [i.outputs[0] for i in io_ops[:-1]]
+      assert len(inp) == len(input_data)
+      out = io_ops[-1].outputs[0]
+    with self.test_session(
+        graph=g, config=config, use_gpu=True, force_gpu=True) as sess:
+      val = None
+      # Defaults to 2 runs to verify result across multiple runs is same.
+      for _ in range(num_runs):
+        new_val = sess.run(out,
+                           {inp[i]: input_data[i] for i in range(len(inp))})
+        self.assertEqual(params.expected_output_dims, new_val.shape)
+        if val is not None:
+          self.assertAllEqual(val, new_val)
+        val = new_val
+    return val
+
+  # Use real data that is representative of the inference dataset
+  # for calibration. For this test script it is random data.
+  def _RunCalibration(self, params, gdef, input_data, config):
+    """Run calibration on given graph."""
+    return self._RunGraph(params, gdef, input_data, config, 30)
+
+  def _GetTrtGraphDef(self, params, gdef, precision_mode, is_dynamic_op):
+    """Return trt converted graphdef."""
+    return trt_convert.create_inference_graph(
+        input_graph_def=gdef,
+        outputs=[self.output_name],
+        max_batch_size=max([dims[0] for dims in params.input_dims]),
+        max_workspace_size_bytes=1 << 25,
+        precision_mode=precision_mode,
+        minimum_segment_size=2,
+        is_dynamic_op=is_dynamic_op)
+
+  def _VerifyGraphDef(self,
+                      params,
+                      gdef,
+                      precision_mode=None,
+                      is_calibrated=None,
+                      dynamic_engine=None):
+    num_engines = 0
+    for n in gdef.node:
+      # TODO(jie): we should have coverage for failed conversion (TF fallback).
+      # where the conversion will fail and we shouldn't count this engine as the
+      # converted engines.
+      if n.op == "TRTEngineOp":
+        num_engines += 1
+        self.assertNotEqual(self._ToBytes(""), n.attr["serialized_segment"].s)
+        self.assertNotEqual(self._ToBytes(""), n.attr["segment_funcdef_name"].s)
+        self.assertEqual(
+            self._ToBytes(precision_mode), n.attr["precision_mode"].s)
+        self.assertEqual(not dynamic_engine, n.attr["static_engine"].b)
+        if _IsQuantizationMode(precision_mode) and is_calibrated:
+          self.assertNotEqual(self._ToBytes(""), n.attr["calibration_data"].s)
+        else:
+          self.assertEqual(self._ToBytes(""), n.attr["calibration_data"].s)
+    if precision_mode is None:  # This means gdef is the original GraphDef.
+      self.assertEqual(0, num_engines)
+    else:
+      self.assertEqual(num_engines, params.num_expected_engines)
+
+  def RunTest(self, params, use_optimizer, precision_mode,
+              dynamic_infer_engine, dynamic_calib_engine):
+    assert precision_mode in PRECISION_MODES
+    input_data = [np.random.random_sample(dims) for dims in params.input_dims]
+    input_gdef = params.gdef
+    self._VerifyGraphDef(params, input_gdef)
+
+    # Get reference result without running trt.
+    config_no_trt = self._GetConfigProto(params, False)
+    logging.info("Running original graph w/o trt, config:\n%s",
+                 str(config_no_trt))
+    ref_result = self._RunGraph(params, input_gdef, input_data, config_no_trt)
+
+    # Run calibration if necessary.
+    if _IsQuantizationMode(precision_mode):
+
+      calib_config = self._GetConfigProto(params, use_optimizer, precision_mode,
+                                          dynamic_calib_engine)
+      logging.info("Running calibration graph, config:\n%s", str(calib_config))
+      if use_optimizer:
+        self.assertTrue(False)
+        # TODO(aaroey): uncomment this and get infer_gdef when this mode is
+        # supported.
+        # result = self._RunCalibration(params, input_gdef, input_data,
+        #                               calib_config)
+      else:
+        calib_gdef = self._GetTrtGraphDef(params, input_gdef, precision_mode,
+                                          dynamic_calib_engine)
+        self._VerifyGraphDef(params, calib_gdef, precision_mode, False,
+                             dynamic_calib_engine)
+        result = self._RunCalibration(params, calib_gdef, input_data,
+                                      calib_config)
+        infer_gdef = trt_convert.calib_graph_to_infer_graph(calib_gdef)
+        self._VerifyGraphDef(params, infer_gdef, precision_mode, True,
+                             dynamic_calib_engine)
+
+      self.assertAllClose(
+          ref_result,
+          result,
+          atol=params.allclose_atol,
+          rtol=params.allclose_rtol)
+    else:
+      infer_gdef = input_gdef
+
+    # Run inference.
+    infer_config = self._GetConfigProto(params, use_optimizer, precision_mode,
+                                        dynamic_infer_engine)
+    logging.info("Running final inference graph, config:\n%s",
+                 str(infer_config))
+    if use_optimizer:
+      result = self._RunGraph(params, infer_gdef, input_data, infer_config)
+    else:
+      trt_infer_gdef = self._GetTrtGraphDef(params, infer_gdef, precision_mode,
+                                            dynamic_infer_engine)
+      self._VerifyGraphDef(params, trt_infer_gdef, precision_mode, True,
+                           dynamic_infer_engine)
+      result = self._RunGraph(params, trt_infer_gdef, input_data, infer_config)
+
+    self.assertAllClose(
+        ref_result,
+        result,
+        atol=params.allclose_atol,
+        rtol=params.allclose_rtol)
+
+  def testIdempotence(self):
+    # Test that applying tensorrt optimizer or offline conversion tools multiple
+    # times to the same graph will result in same graph.
+    #
+    # TODO(aaroey): currently the conversion is not deterministic, this is
+    # mainly because during tensorflow::ConvertGraphDefToGraph(), the graph uses
+    # EdgeSet which use a map keyed by Edge*, so the order of input/output edges
+    # of a node is nondeterministic, thus the order for segmenter to contract
+    # edges is nondeterministic. Need to evaluate whether we should fix this.
+    pass
+
+
+def _AddTests(test_class):
+  """Adds test methods to TfTrtIntegrationTestBase."""
+
+  def _GetTest(use_optimizer, precision_mode, dynamic_infer_engine,
+               dynamic_calib_engine):
+    """Gets a single test method based on the parameters."""
+
+    def _Test(self):
+      params = self.GetParams()
+      logging.info(
+          "Running test with parameters: use_optimizer=%s, precision_mode=%s, "
+          "dynamic_infer_engine=%s, dynamic_calib_engine=%s", use_optimizer,
+          precision_mode, dynamic_infer_engine, dynamic_calib_engine)
+      self.RunTest(params, use_optimizer, precision_mode, dynamic_infer_engine,
+                   dynamic_calib_engine)
+
+    return _Test
+
+  use_optimizer_options = [False, True]
+  dynamic_infer_engine_options = [False, True]
+  dynamic_calib_engine_options = [False, True]
+  for (use_optimizer, precision_mode,
+       dynamic_infer_engine, dynamic_calib_engine) in itertools.product(
+           use_optimizer_options, PRECISION_MODES, dynamic_infer_engine_options,
+           dynamic_calib_engine_options):
+    if _IsQuantizationMode(precision_mode):
+      if not dynamic_calib_engine and dynamic_infer_engine:
+        # TODO(aaroey): test this case, the conversion from static calibration
+        # engine to dynamic inference engine should be a noop.
+        continue
+      if use_optimizer:
+        # TODO(aaroey): if use_optimizer is True we need to get the inference
+        # graphdef using custom python wrapper class, which is not currently
+        # supported yet.
+        continue
+      if not dynamic_calib_engine:
+        # TODO(aaroey): construction of static calibration engine is not
+        # supported yet.
+        continue
+      if dynamic_calib_engine and not dynamic_infer_engine:
+        # TODO(aaroey): construction of static inference engine using dynamic
+        # calibration engine is not supported yet.
+        continue
+    else:  # In non int8 mode.
+      if dynamic_calib_engine:
+        # dynamic_calib_engine doesn't affect non-int8 modes, so just let
+        # related tests run once on dynamic_calib_engine=False.
+        continue
+
+    conversion = "OptimizerConversion" if use_optimizer else "ToolConversion"
+    infer_engine_type = ("DynamicInferEngine"
+                         if dynamic_infer_engine else "StaticInferEngine")
+    calib_engine_type = ""
+    if precision_mode == "INT8":
+      calib_engine_type = ("DynamicCalibEngine"
+                           if dynamic_calib_engine else "StaticCalibEngine")
+    test_name = "%s_%s_%s%s" % (conversion, precision_mode, infer_engine_type,
+                                ("_" + calib_engine_type)
+                                if len(calib_engine_type) else "")
+    setattr(
+        test_class, "testTfTRT_" + test_name,
+        _GetTest(use_optimizer, precision_mode, dynamic_infer_engine,
+                 dynamic_calib_engine))
+
+
+if trt_convert.is_tensorrt_enabled():
+  _AddTests(TfTrtIntegrationTestBase)
diff --git a/tensorflow/contrib/tensorrt/test/unary_test.py b/tensorflow/contrib/tensorrt/test/unary_test.py
new file mode 100644
index 0000000000..b9e977cf67
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/unary_test.py
@@ -0,0 +1,110 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_array_ops
+from tensorflow.python.ops import gen_math_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class UnaryTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Test for unary operations in TF-TRT."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [12, 5, 8, 1, 1, 12]
+    input2_name = "input_2"
+    input2_dims = [12, 5, 8, 1, 12, 1, 1]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      q = math_ops.abs(x)
+      q = q + 1.0
+      q = gen_math_ops.exp(q)
+      q = gen_math_ops.log(q)
+      q = array_ops.squeeze(q, axis=-2)
+      q = math_ops.abs(q)
+      q = q + 2.2
+      q = gen_math_ops.sqrt(q)
+      q = gen_math_ops.rsqrt(q)
+      q = math_ops.negative(q)
+      q = array_ops.squeeze(q, axis=3)
+      q = math_ops.abs(q)
+      q = q + 3.0
+      a = gen_math_ops.reciprocal(q)
+
+      x = constant_op.constant(np.random.randn(5, 8, 12), dtype=dtype)
+      q = math_ops.abs(x)
+      q = q + 2.0
+      q = gen_math_ops.exp(q)
+      q = gen_math_ops.log(q)
+      q = math_ops.abs(q)
+      q = q + 2.1
+      q = gen_math_ops.sqrt(q)
+      q = gen_math_ops.rsqrt(q)
+      q = math_ops.negative(q)
+      q = math_ops.abs(q)
+      q = q + 4.0
+      b = gen_math_ops.reciprocal(q)
+
+      # TODO(jie): this one will break, broadcasting on batch.
+      x = array_ops.placeholder(
+          dtype=dtype, shape=input2_dims, name=input2_name)
+      q = math_ops.abs(x)
+      q = q + 5.0
+      q = gen_math_ops.exp(q)
+      q = array_ops.squeeze(q, axis=[-1, -2, 3])
+      q = gen_math_ops.log(q)
+      q = math_ops.abs(q)
+      q = q + 5.1
+      q = gen_array_ops.reshape(q, [12, 5, 1, 1, 8, 1, 12])
+      q = array_ops.squeeze(q, axis=[5, 2, 3])
+      q = gen_math_ops.sqrt(q)
+      q = math_ops.abs(q)
+      q = q + 5.2
+      q = gen_math_ops.rsqrt(q)
+      q = math_ops.negative(q)
+      q = math_ops.abs(q)
+      q = q + 5.3
+      c = gen_math_ops.reciprocal(q)
+
+      q = a * b
+      q = q / c
+      array_ops.squeeze(q, name=self.output_name)
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name, input2_name],
+        input_dims=[input_dims, input2_dims],
+        num_expected_engines=5,
+        expected_output_dims=(12, 5, 8, 12),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/vgg_block_nchw_test.py b/tensorflow/contrib/tensorrt/test/vgg_block_nchw_test.py
new file mode 100644
index 0000000000..2b134c3bce
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/vgg_block_nchw_test.py
@@ -0,0 +1,82 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.ops import nn_impl
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.platform import test
+
+
+class VGGBlockNCHWTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Single vgg layer in NCHW unit tests in TF-TRT."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [5, 2, 8, 8]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      x, _, _ = nn_impl.fused_batch_norm(
+          x,
+          np.random.randn(2).astype(np.float32),
+          np.random.randn(2).astype(np.float32),
+          mean=np.random.randn(2).astype(np.float32),
+          variance=np.random.randn(2).astype(np.float32),
+          data_format="NCHW",
+          is_training=False)
+      e = constant_op.constant(
+          np.random.randn(1, 1, 2, 6), name="weights", dtype=dtype)
+      conv = nn.conv2d(
+          input=x,
+          filter=e,
+          data_format="NCHW",
+          strides=[1, 1, 2, 2],
+          padding="SAME",
+          name="conv")
+      b = constant_op.constant(np.random.randn(6), name="bias", dtype=dtype)
+      t = nn.bias_add(conv, b, data_format="NCHW", name="biasAdd")
+      relu = nn.relu(t, "relu")
+      idty = array_ops.identity(relu, "ID")
+      v = nn_ops.max_pool(
+          idty, [1, 1, 2, 2], [1, 1, 2, 2],
+          "VALID",
+          data_format="NCHW",
+          name="max_pool")
+      array_ops.squeeze(v, name="output")
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=1,
+        expected_output_dims=(5, 6, 2, 2),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/test/vgg_block_test.py b/tensorflow/contrib/tensorrt/test/vgg_block_test.py
new file mode 100644
index 0000000000..bec2f23eff
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/test/vgg_block_test.py
@@ -0,0 +1,73 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model script to test TF-TensorRT integration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.tensorrt.test import tf_trt_integration_test_base as trt_test
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import nn
+from tensorflow.python.ops import nn_impl
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.platform import test
+
+
+class VGGBlockTest(trt_test.TfTrtIntegrationTestBase):
+
+  def GetParams(self):
+    """Single vgg layer test in TF-TRT conversion."""
+    dtype = dtypes.float32
+    input_name = "input"
+    input_dims = [5, 8, 8, 2]
+    g = ops.Graph()
+    with g.as_default():
+      x = array_ops.placeholder(dtype=dtype, shape=input_dims, name=input_name)
+      x, _, _ = nn_impl.fused_batch_norm(
+          x,
+          np.random.randn(2).astype(np.float32),
+          np.random.randn(2).astype(np.float32),
+          mean=np.random.randn(2).astype(np.float32),
+          variance=np.random.randn(2).astype(np.float32),
+          is_training=False)
+      e = constant_op.constant(
+          np.random.randn(1, 1, 2, 6), name="weights", dtype=dtype)
+      conv = nn.conv2d(
+          input=x, filter=e, strides=[1, 2, 2, 1], padding="SAME", name="conv")
+      b = constant_op.constant(np.random.randn(6), name="bias", dtype=dtype)
+      t = nn.bias_add(conv, b, name="biasAdd")
+      relu = nn.relu(t, "relu")
+      idty = array_ops.identity(relu, "ID")
+      v = nn_ops.max_pool(
+          idty, [1, 2, 2, 1], [1, 2, 2, 1], "VALID", name="max_pool")
+      array_ops.squeeze(v, name="output")
+    return trt_test.TfTrtIntegrationTestParams(
+        gdef=g.as_graph_def(),
+        input_names=[input_name],
+        input_dims=[input_dims],
+        num_expected_engines=1,
+        expected_output_dims=(5, 2, 2, 6),
+        allclose_atol=1.e-03,
+        allclose_rtol=1.e-03)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/tensorrt/trt_conversion.i b/tensorflow/contrib/tensorrt/trt_conversion.i
index d6628cd1eb..3b1a18f8ac 100644
--- a/tensorflow/contrib/tensorrt/trt_conversion.i
+++ b/tensorflow/contrib/tensorrt/trt_conversion.i
@@ -100,81 +100,18 @@ _LIST_OUTPUT_TYPEMAP(int, PyLong_FromLong);
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/util/stat_summarizer.h"
 #include "tensorflow/contrib/tensorrt/convert/convert_graph.h"
+#include "tensorflow/contrib/tensorrt/convert/utils.h"
 %}
 
 %ignoreall
 %unignore tensorflow;
-%unignore trt_convert;
 %unignore calib_convert;
 %unignore get_linked_tensorrt_version;
 %unignore get_loaded_tensorrt_version;
+%unignore is_tensorrt_enabled;
 
 %{
 
-std::pair<string, string> trt_convert(
-    string graph_def_string,  // The serialized GraphDef string.
-    std::vector<string> output_names,
-    size_t max_batch_size,
-    size_t max_workspace_size_bytes,
-    int precision_mode,
-    int minimum_segment_size,
-    bool is_dyn_op,
-    int max_cached_engines,
-    std::vector<int> cached_engine_batches
-    // Unfortunately we can't use TF_Status here since it
-    // is in c/c_api and brings in a lot of other libraries
-    // which in turn declare ops. These ops are included
-    // statically in our library and cause an abort when
-    // module is loaded due to double registration
-    // until Tensorflow properly exposes these headers
-    // we have to work around this by returning a string
-    // and converting it to exception on python side.
-    //,TF_Status* out_status) {
-) {
-#if GOOGLE_CUDA && GOOGLE_TENSORRT
-  string out_status;
-
-  tensorflow::GraphDef graph_def;
-  if (!graph_def.ParseFromString(graph_def_string)) {
-    out_status = "InvalidArgument;Couldn't interpret input as a GraphDef";
-    return std::pair<string, string>{out_status, ""};
-  }
-
-  if(precision_mode < 0 || precision_mode > 2){
-    out_status = "InvalidArgument;Invalid precision_mode";
-    return std::pair<string, string>{out_status, ""};
-  }
-  if (!output_names.size()) {
-    out_status = "InvalidArgument;Size of the output_names vector is 0";
-    return std::pair<string, string>{out_status, ""};
-  }
-  tensorflow::GraphDef out_graph;
-  tensorflow::Status conversion_status =
-      tensorflow::tensorrt::convert::ConvertGraphDefToTensorRT(
-          graph_def, output_names, max_batch_size, max_workspace_size_bytes,
-          &out_graph, precision_mode, minimum_segment_size,
-          is_dyn_op, max_cached_engines, cached_engine_batches);
-  if (!conversion_status.ok()) {
-    auto retCode = (int)conversion_status.code();
-    char buff[2000];
-    snprintf(buff, 2000, "%d;%s", retCode,
-             conversion_status.error_message().c_str());
-    out_status = buff;
-    return std::pair<string, string>{out_status, ""};
-  }
-  string result;
-  if (!out_graph.SerializeToString(&result)) {
-    out_status = "InvalidArgument;Couldn't serialize output as a GraphDef";
-    return std::pair<string, string>{out_status, ""};
-  }
-  out_status = "OK;All good!";
-  return std::pair<string, string>{out_status, result};
-#else
-  // Returns FAILED_PRECONDITION.
-  return std::pair<string, string>{"9;TensorRT is not enabled!", ""};
-#endif  // GOOGLE_CUDA && GOOGLE_TENSORRT
-}
-
 std::pair<string, string> calib_convert(
     string graph_def_string, bool is_dyn_op
     // unfortunately we can't use TF_Status here since it
@@ -232,7 +169,8 @@ version_struct get_linked_tensorrt_version() {
 #endif  // GOOGLE_CUDA && GOOGLE_TENSORRT
   return s;
 }
-version_struct get_loaded_tensorrt_version(){
+
+version_struct get_loaded_tensorrt_version() {
   // Return the version from the loaded library.
   version_struct s;
 #if GOOGLE_CUDA && GOOGLE_TENSORRT
@@ -244,19 +182,16 @@ version_struct get_loaded_tensorrt_version(){
   return s;
 }
 
-%}
+bool is_tensorrt_enabled() {
+  return tensorflow::tensorrt::IsGoogleTensorRTEnabled();
+}
 
-std::pair<string, string> calib_convert(string graph_def_string, bool is_dyn_op);
+%}
 
-std::pair<string, string> trt_convert(string graph_def_string,
-                                      std::vector<string> output_names,
-                                      size_t max_batch_size,
-                                      size_t max_workspace_size_bytes,
-                                      int precision_mode, int minimum_segment_size,
-                                      bool is_dyn_op,
-                                      int max_cached_engines,
-                                      std::vector<int> cached_engine_batches);
+std::pair<string, string> calib_convert(
+    string graph_def_string, bool is_dyn_op);
 version_struct get_linked_tensorrt_version();
 version_struct get_loaded_tensorrt_version();
+bool is_tensorrt_enabled();
 
 %unignoreall
diff --git a/tensorflow/contrib/timeseries/__init__.py b/tensorflow/contrib/timeseries/__init__.py
index 11db56b1b7..654a4db098 100644
--- a/tensorflow/contrib/timeseries/__init__.py
+++ b/tensorflow/contrib/timeseries/__init__.py
@@ -27,6 +27,9 @@
 
 @@TrainEvalFeatures
 @@FilteringResults
+
+@@TimeSeriesRegressor
+@@OneShotPredictionHead
 """
 
 from __future__ import absolute_import
diff --git a/tensorflow/contrib/timeseries/examples/multivariate.py b/tensorflow/contrib/timeseries/examples/multivariate.py
index ed799542fd..e81cb18ad7 100644
--- a/tensorflow/contrib/timeseries/examples/multivariate.py
+++ b/tensorflow/contrib/timeseries/examples/multivariate.py
@@ -80,8 +80,8 @@ def multivariate_train_and_sample(
                 session=session, steps=1))
         next_sample = numpy.random.multivariate_normal(
             # Squeeze out the batch and series length dimensions (both 1).
-            mean=numpy.squeeze(current_prediction["mean"], axis=[0, 1]),
-            cov=numpy.squeeze(current_prediction["covariance"], axis=[0, 1]))
+            mean=numpy.squeeze(current_prediction["mean"], axis=(0, 1)),
+            cov=numpy.squeeze(current_prediction["covariance"], axis=(0, 1)))
         # Update model state so that future predictions are conditional on the
         # value we just sampled.
         filtering_features = {
diff --git a/tensorflow/contrib/timeseries/python/timeseries/BUILD b/tensorflow/contrib/timeseries/python/timeseries/BUILD
index e4963596d3..0e96c1fbd4 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/BUILD
+++ b/tensorflow/contrib/timeseries/python/timeseries/BUILD
@@ -157,9 +157,11 @@ py_library(
 
 py_test(
     name = "head_test",
+    size = "large",
     srcs = [
         "head_test.py",
     ],
+    shard_count = 4,
     srcs_version = "PY2AND3",
     tags = ["no_pip_gpu"],  # b/63391119
     deps = [
@@ -184,6 +186,7 @@ py_test(
         "//tensorflow/python/saved_model:loader",
         "//tensorflow/python/saved_model:tag_constants",
         "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
         "@six_archive//:six",
     ],
 )
diff --git a/tensorflow/contrib/timeseries/python/timeseries/__init__.py b/tensorflow/contrib/timeseries/python/timeseries/__init__.py
index c683dad71d..8462138339 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/__init__.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/__init__.py
@@ -24,5 +24,6 @@ from tensorflow.contrib.timeseries.python.timeseries import saved_model_utils
 from tensorflow.contrib.timeseries.python.timeseries.ar_model import *
 from tensorflow.contrib.timeseries.python.timeseries.estimators import *
 from tensorflow.contrib.timeseries.python.timeseries.feature_keys import *
+from tensorflow.contrib.timeseries.python.timeseries.head import *
 from tensorflow.contrib.timeseries.python.timeseries.input_pipeline import *
 # pylint: enable=wildcard-import
diff --git a/tensorflow/contrib/timeseries/python/timeseries/estimators.py b/tensorflow/contrib/timeseries/python/timeseries/estimators.py
index 4ec8d26116..0ddc4b4144 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/estimators.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/estimators.py
@@ -37,6 +37,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import parsing_ops
 from tensorflow.python.training import training as train
 from tensorflow.python.util import nest
@@ -79,12 +80,137 @@ class TimeSeriesRegressor(estimator_lib.Estimator):
         model_dir=model_dir,
         config=config)
 
-  # TODO(allenl): A parsing input receiver function, which takes a serialized
-  # tf.Example containing all features (times, values, any exogenous features)
-  # and serialized model state (possibly also as a tf.Example).
-  def build_raw_serving_input_receiver_fn(self,
-                                          default_batch_size=None,
-                                          default_series_length=None):
+  def _model_start_state_placeholders(
+      self, batch_size_tensor, static_batch_size=None):
+    """Creates placeholders with zeroed start state for the current model."""
+    gathered_state = {}
+    # Models may not know the shape of their state without creating some
+    # variables/ops. Avoid polluting the default graph by making a new one. We
+    # use only static metadata from the returned Tensors.
+    with ops.Graph().as_default():
+      self._model.initialize_graph()
+      # Evaluate the initial state as same-dtype "zero" values. These zero
+      # constants aren't used, but are necessary for feeding to
+      # placeholder_with_default for the "cold start" case where state is not
+      # fed to the model.
+      def _zeros_like_constant(tensor):
+        return tensor_util.constant_value(array_ops.zeros_like(tensor))
+      start_state = nest.map_structure(
+          _zeros_like_constant, self._model.get_start_state())
+    for prefixed_state_name, state in ts_head_lib.state_to_dictionary(
+        start_state).items():
+      state_shape_with_batch = tensor_shape.TensorShape(
+          (static_batch_size,)).concatenate(state.shape)
+      default_state_broadcast = array_ops.tile(
+          state[None, ...],
+          multiples=array_ops.concat(
+              [batch_size_tensor[None],
+               array_ops.ones(len(state.shape), dtype=dtypes.int32)],
+              axis=0))
+      gathered_state[prefixed_state_name] = array_ops.placeholder_with_default(
+          input=default_state_broadcast,
+          name=prefixed_state_name,
+          shape=state_shape_with_batch)
+    return gathered_state
+
+  def build_one_shot_parsing_serving_input_receiver_fn(
+      self, filtering_length, prediction_length, default_batch_size=None,
+      values_input_dtype=None, truncate_values=False):
+    """Build an input_receiver_fn for export_savedmodel accepting tf.Examples.
+
+    Only compatible with `OneShotPredictionHead` (see `head`).
+
+    Args:
+      filtering_length: The number of time steps used as input to the model, for
+        which values are provided. If more than `filtering_length` values are
+        provided (via `truncate_values`), only the first `filtering_length`
+        values are used.
+      prediction_length: The number of time steps requested as predictions from
+        the model. Times and all exogenous features must be provided for these
+        steps.
+      default_batch_size: If specified, must be a scalar integer. Sets the batch
+        size in the static shape information of all feature Tensors, which means
+        only this batch size will be accepted by the exported model. If None
+        (default), static shape information for batch sizes is omitted.
+      values_input_dtype: An optional dtype specification for values in the
+        tf.Example protos (either float32 or int64, since these are the numeric
+        types supported by tf.Example). After parsing, values are cast to the
+        model's dtype (float32 or float64).
+      truncate_values: If True, expects `filtering_length + prediction_length`
+        values to be provided, but only uses the first `filtering_length`. If
+        False (default), exactly `filtering_length` values must be provided.
+
+    Returns:
+      An input_receiver_fn which may be passed to the Estimator's
+      export_savedmodel.
+
+      Expects features contained in a vector of serialized tf.Examples with
+      shape [batch size] (dtype `tf.string`), each tf.Example containing
+      features with the following shapes:
+        times: [filtering_length + prediction_length] integer
+        values: [filtering_length, num features] floating point. If
+          `truncate_values` is True, expects `filtering_length +
+          prediction_length` values but only uses the first `filtering_length`.
+        all exogenous features: [filtering_length + prediction_length, ...]
+          (various dtypes)
+    """
+    if values_input_dtype is None:
+      values_input_dtype = dtypes.float32
+    if truncate_values:
+      values_proto_length = filtering_length + prediction_length
+    else:
+      values_proto_length = filtering_length
+
+    def _serving_input_receiver_fn():
+      """A receiver function to be passed to export_savedmodel."""
+      times_column = feature_column.numeric_column(
+          key=feature_keys.TrainEvalFeatures.TIMES, dtype=dtypes.int64)
+      values_column = feature_column.numeric_column(
+          key=feature_keys.TrainEvalFeatures.VALUES, dtype=values_input_dtype,
+          shape=(self._model.num_features,))
+      parsed_features_no_sequence = (
+          feature_column.make_parse_example_spec(
+              list(self._model.exogenous_feature_columns)
+              + [times_column, values_column]))
+      parsed_features = {}
+      for key, feature_spec in parsed_features_no_sequence.items():
+        if isinstance(feature_spec, parsing_ops.FixedLenFeature):
+          if key == feature_keys.TrainEvalFeatures.VALUES:
+            parsed_features[key] = feature_spec._replace(
+                shape=((values_proto_length,)
+                       + feature_spec.shape))
+          else:
+            parsed_features[key] = feature_spec._replace(
+                shape=((filtering_length + prediction_length,)
+                       + feature_spec.shape))
+        elif feature_spec.dtype == dtypes.string:
+          parsed_features[key] = parsing_ops.FixedLenFeature(
+              shape=(filtering_length + prediction_length,),
+              dtype=dtypes.string)
+        else:  # VarLenFeature
+          raise ValueError("VarLenFeatures not supported, got %s for key %s"
+                           % (feature_spec, key))
+      tfexamples = array_ops.placeholder(
+          shape=[default_batch_size], dtype=dtypes.string, name="input")
+      features = parsing_ops.parse_example(
+          serialized=tfexamples,
+          features=parsed_features)
+      features[feature_keys.TrainEvalFeatures.TIMES] = array_ops.squeeze(
+          features[feature_keys.TrainEvalFeatures.TIMES], axis=-1)
+      features[feature_keys.TrainEvalFeatures.VALUES] = math_ops.cast(
+          features[feature_keys.TrainEvalFeatures.VALUES],
+          dtype=self._model.dtype)[:, :filtering_length]
+      features.update(
+          self._model_start_state_placeholders(
+              batch_size_tensor=array_ops.shape(
+                  features[feature_keys.TrainEvalFeatures.TIMES])[0],
+              static_batch_size=default_batch_size))
+      return export_lib.ServingInputReceiver(
+          features, {"examples": tfexamples})
+    return _serving_input_receiver_fn
+
+  def build_raw_serving_input_receiver_fn(
+      self, default_batch_size=None, default_series_length=None):
     """Build an input_receiver_fn for export_savedmodel which accepts arrays.
 
     Automatically creates placeholders for exogenous `FeatureColumn`s passed to
@@ -149,34 +275,10 @@ class TimeSeriesRegressor(estimator_lib.Estimator):
                            + batch_only_feature_shape[1:])
           placeholders[feature_key] = array_ops.placeholder(
               dtype=value_dtype, name=feature_key, shape=feature_shape)
-      # Models may not know the shape of their state without creating some
-      # variables/ops. Avoid polluting the default graph by making a new one. We
-      # use only static metadata from the returned Tensors.
-      with ops.Graph().as_default():
-        self._model.initialize_graph()
-        # Evaluate the initial state as same-dtype "zero" values. These zero
-        # constants aren't used, but are necessary for feeding to
-        # placeholder_with_default for the "cold start" case where state is not
-        # fed to the model.
-        def _zeros_like_constant(tensor):
-          return tensor_util.constant_value(array_ops.zeros_like(tensor))
-        start_state = nest.map_structure(
-            _zeros_like_constant, self._model.get_start_state())
       batch_size_tensor = array_ops.shape(time_placeholder)[0]
-      for prefixed_state_name, state in ts_head_lib.state_to_dictionary(
-          start_state).items():
-        state_shape_with_batch = tensor_shape.TensorShape(
-            (default_batch_size,)).concatenate(state.shape)
-        default_state_broadcast = array_ops.tile(
-            state[None, ...],
-            multiples=array_ops.concat(
-                [batch_size_tensor[None],
-                 array_ops.ones(len(state.shape), dtype=dtypes.int32)],
-                axis=0))
-        placeholders[prefixed_state_name] = array_ops.placeholder_with_default(
-            input=default_state_broadcast,
-            name=prefixed_state_name,
-            shape=state_shape_with_batch)
+      placeholders.update(
+          self._model_start_state_placeholders(
+              batch_size_tensor, static_batch_size=default_batch_size))
       return export_lib.ServingInputReceiver(placeholders, placeholders)
 
     return _serving_input_receiver_fn
@@ -288,7 +390,7 @@ class StateSpaceRegressor(TimeSeriesRegressor):
   """An Estimator for general state space models."""
 
   def __init__(self, model, state_manager=None, optimizer=None, model_dir=None,
-               config=None):
+               config=None, head_type=ts_head_lib.TimeSeriesRegressionHead):
     """See TimeSeriesRegressor. Uses the ChainingStateManager by default."""
     if not isinstance(model, state_space_model.StateSpaceModel):
       raise ValueError(
@@ -301,7 +403,8 @@ class StateSpaceRegressor(TimeSeriesRegressor):
         state_manager=state_manager,
         optimizer=optimizer,
         model_dir=model_dir,
-        config=config)
+        config=config,
+        head_type=head_type)
 
 
 class StructuralEnsembleRegressor(StateSpaceRegressor):
@@ -344,7 +447,8 @@ class StructuralEnsembleRegressor(StateSpaceRegressor):
                anomaly_prior_probability=None,
                optimizer=None,
                model_dir=None,
-               config=None):
+               config=None,
+               head_type=ts_head_lib.TimeSeriesRegressionHead):
     """Initialize the Estimator.
 
     Args:
@@ -401,6 +505,8 @@ class StructuralEnsembleRegressor(StateSpaceRegressor):
           from tf.train.Optimizer. Defaults to Adam with step size 0.02.
       model_dir: See `Estimator`.
       config: See `Estimator`.
+      head_type: The kind of head to use for the model (inheriting from
+          `TimeSeriesRegressionHead`).
     """
     if anomaly_prior_probability is not None:
       filtering_postprocessor = StateInterpolatingAnomalyDetector(
@@ -424,4 +530,5 @@ class StructuralEnsembleRegressor(StateSpaceRegressor):
         model=model,
         optimizer=optimizer,
         model_dir=model_dir,
-        config=config)
+        config=config,
+        head_type=head_type)
diff --git a/tensorflow/contrib/timeseries/python/timeseries/head.py b/tensorflow/contrib/timeseries/python/timeseries/head.py
index f236329fdb..d2484d0ef5 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/head.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/head.py
@@ -19,24 +19,22 @@ from __future__ import print_function
 
 import re
 
-from tensorflow.python.training import training_util
-from tensorflow.contrib.layers.python.layers import optimizers
-
 from tensorflow.contrib.timeseries.python.timeseries import feature_keys
-
 from tensorflow.python.estimator import estimator_lib
 from tensorflow.python.estimator.canned import head as head_lib
 from tensorflow.python.estimator.canned import metric_keys
 from tensorflow.python.estimator.export import export_lib
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope
-from tensorflow.python.util import nest
 from tensorflow.python.summary import summary
+from tensorflow.python.training import training_util
+from tensorflow.python.util import nest
 
 
 class _NoStatePredictOutput(export_lib.PredictOutput):
@@ -102,12 +100,9 @@ class TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acce
         use_resource=True):
       model_outputs = self.create_loss(features, mode)
 
-    train_op = optimizers.optimize_loss(
+    train_op = self.optimizer.minimize(
         model_outputs.loss,
-        global_step=training_util.get_global_step(),
-        optimizer=self.optimizer,
-        # Learning rate is set in the Optimizer object
-        learning_rate=None)
+        global_step=training_util.get_global_step())
     return estimator_lib.EstimatorSpec(
         loss=model_outputs.loss,
         mode=mode,
@@ -186,7 +181,7 @@ class TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acce
       return math_ops.cast(value, self.model.dtype)
     if name == feature_keys.PredictionFeatures.STATE_TUPLE:
       return value  # Correct dtypes are model-dependent
-    return ops.convert_to_tensor(value)
+    return sparse_tensor.convert_to_tensor_or_sparse_tensor(value)
 
   def _gather_state(self, features):
     """Returns `features` with state packed, indicates if packing was done."""
@@ -208,6 +203,29 @@ class TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acce
         flat_sequence=[tensor for _, _, tensor in numbered_state])
     return features, True
 
+  def _check_predict_features(self, features):
+    """Raises errors if features are not suitable for prediction."""
+    if feature_keys.PredictionFeatures.TIMES not in features:
+      raise ValueError("Expected a '{}' feature for prediction.".format(
+          feature_keys.PredictionFeatures.TIMES))
+    if feature_keys.PredictionFeatures.STATE_TUPLE not in features:
+      raise ValueError("Expected a '{}' feature for prediction.".format(
+          feature_keys.PredictionFeatures.STATE_TUPLE))
+    times_feature = features[feature_keys.PredictionFeatures.TIMES]
+    if not times_feature.get_shape().is_compatible_with([None, None]):
+      raise ValueError(
+          ("Expected shape (batch dimension, window size) for feature '{}' "
+           "(got shape {})").format(feature_keys.PredictionFeatures.TIMES,
+                                    times_feature.get_shape()))
+    _check_feature_shapes_compatible_with(
+        features=features,
+        compatible_with_name=feature_keys.PredictionFeatures.TIMES,
+        compatible_with_value=times_feature,
+        ignore=set([
+            # Model-dependent shapes
+            feature_keys.PredictionFeatures.STATE_TUPLE
+        ]))
+
   def create_estimator_spec(self, features, mode, labels=None):
     """Performs basic error checking and returns an EstimatorSpec."""
     with ops.name_scope(self._name, "head"):
@@ -236,7 +254,7 @@ class TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acce
           mode == estimator_lib.ModeKeys.EVAL):
         _check_train_eval_features(features, self.model)
       elif mode == estimator_lib.ModeKeys.PREDICT:
-        _check_predict_features(features)
+        self._check_predict_features(features)
       else:
         raise ValueError("Unknown mode '{}' passed to model_fn.".format(mode))
 
@@ -273,6 +291,36 @@ class OneShotPredictionHead(TimeSeriesRegressionHead):
   each time predictions are requested when using this head.
   """
 
+  def _check_predict_features(self, features):
+    """Raises errors if features are not suitable for one-shot prediction."""
+    if feature_keys.PredictionFeatures.TIMES not in features:
+      raise ValueError("Expected a '{}' feature for prediction.".format(
+          feature_keys.PredictionFeatures.TIMES))
+    if feature_keys.TrainEvalFeatures.VALUES not in features:
+      raise ValueError("Expected a '{}' feature for prediction.".format(
+          feature_keys.TrainEvalFeatures.VALUES))
+    if feature_keys.PredictionFeatures.STATE_TUPLE not in features:
+      raise ValueError("Expected a '{}' feature for prediction.".format(
+          feature_keys.PredictionFeatures.STATE_TUPLE))
+    times_feature = features[feature_keys.PredictionFeatures.TIMES]
+    if not times_feature.get_shape().is_compatible_with([None, None]):
+      raise ValueError(
+          ("Expected shape (batch dimension, window size) for feature '{}' "
+           "(got shape {})").format(feature_keys.PredictionFeatures.TIMES,
+                                    times_feature.get_shape()))
+    _check_feature_shapes_compatible_with(
+        features=features,
+        compatible_with_name=feature_keys.PredictionFeatures.TIMES,
+        compatible_with_value=times_feature,
+        ignore=set([
+            # Model-dependent shapes
+            feature_keys.PredictionFeatures.STATE_TUPLE,
+            # One shot prediction head relies on values being shorter than
+            # times. Even though we're predicting eventually, we need values for
+            # the filtering phase.
+            feature_keys.TrainEvalFeatures.VALUES,
+        ]))
+
   def _serving_ops(self, features):
     """Add ops for serving to the graph."""
     with variable_scope.variable_scope("model", use_resource=True):
@@ -339,29 +387,6 @@ def _check_feature_shapes_compatible_with(features,
                times_shape=compatible_with_value.get_shape()))
 
 
-def _check_predict_features(features):
-  """Raises errors if features are not suitable for prediction."""
-  if feature_keys.PredictionFeatures.TIMES not in features:
-    raise ValueError("Expected a '{}' feature for prediction.".format(
-        feature_keys.PredictionFeatures.TIMES))
-  if feature_keys.PredictionFeatures.STATE_TUPLE not in features:
-    raise ValueError("Expected a '{}' feature for prediction.".format(
-        feature_keys.PredictionFeatures.STATE_TUPLE))
-  times_feature = features[feature_keys.PredictionFeatures.TIMES]
-  if not times_feature.get_shape().is_compatible_with([None, None]):
-    raise ValueError(
-        ("Expected shape (batch dimension, window size) for feature '{}' "
-         "(got shape {})").format(feature_keys.PredictionFeatures.TIMES,
-                                  times_feature.get_shape()))
-  _check_feature_shapes_compatible_with(
-      features=features,
-      compatible_with_name=feature_keys.PredictionFeatures.TIMES,
-      compatible_with_value=times_feature,
-      ignore=set([
-          feature_keys.PredictionFeatures.STATE_TUPLE  # Model-dependent shapes
-      ]))
-
-
 def _check_train_eval_features(features, model):
   """Raise errors if features are not suitable for training/evaluation."""
   if feature_keys.TrainEvalFeatures.TIMES not in features:
diff --git a/tensorflow/contrib/timeseries/python/timeseries/head_test.py b/tensorflow/contrib/timeseries/python/timeseries/head_test.py
index ed8f29c321..857e7c5635 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/head_test.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/head_test.py
@@ -18,17 +18,23 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import functools
+import os
+
+from absl.testing import parameterized
 import numpy
 import six
 
 from tensorflow.contrib.estimator.python.estimator import extenders
 from tensorflow.contrib.timeseries.examples import lstm as lstm_example
+from tensorflow.contrib.timeseries.python.timeseries import ar_model
 from tensorflow.contrib.timeseries.python.timeseries import estimators as ts_estimators
 from tensorflow.contrib.timeseries.python.timeseries import feature_keys
 from tensorflow.contrib.timeseries.python.timeseries import head as ts_head_lib
 from tensorflow.contrib.timeseries.python.timeseries import input_pipeline
 from tensorflow.contrib.timeseries.python.timeseries import model
 from tensorflow.contrib.timeseries.python.timeseries import state_management
+from tensorflow.core.example import example_pb2
 
 from tensorflow.python.client import session as session_lib
 from tensorflow.python.estimator import estimator_lib
@@ -317,10 +323,56 @@ class PredictFeatureCheckingTests(test.TestCase):
           mode=estimator_lib.ModeKeys.PREDICT)
 
 
-class OneShotTests(test.TestCase):
-
-  def test_one_shot_prediction_head_export(self):
-    model_dir = self.get_temp_dir()
+def _custom_time_series_regressor(
+    model_dir, head_type, exogenous_feature_columns):
+  return ts_estimators.TimeSeriesRegressor(
+      model=lstm_example._LSTMModel(
+          num_features=5, num_units=128,
+          exogenous_feature_columns=exogenous_feature_columns),
+      optimizer=adam.AdamOptimizer(0.001),
+      config=estimator_lib.RunConfig(tf_random_seed=4),
+      state_manager=state_management.ChainingStateManager(),
+      head_type=head_type,
+      model_dir=model_dir)
+
+
+def _structural_ensemble_regressor(
+    model_dir, head_type, exogenous_feature_columns):
+  return ts_estimators.StructuralEnsembleRegressor(
+      periodicities=None,
+      num_features=5,
+      exogenous_feature_columns=exogenous_feature_columns,
+      head_type=head_type,
+      model_dir=model_dir)
+
+
+def _ar_lstm_regressor(
+    model_dir, head_type, exogenous_feature_columns):
+  return ts_estimators.TimeSeriesRegressor(
+      model=ar_model.ARModel(
+          periodicities=10, input_window_size=10, output_window_size=6,
+          num_features=5,
+          exogenous_feature_columns=exogenous_feature_columns,
+          prediction_model_factory=functools.partial(
+              ar_model.LSTMPredictionModel,
+              num_units=10)),
+      head_type=head_type,
+      model_dir=model_dir)
+
+
+class OneShotTests(parameterized.TestCase):
+
+  @parameterized.named_parameters(
+      {"testcase_name": "ar_lstm_regressor",
+       "estimator_factory": _ar_lstm_regressor},
+      {"testcase_name": "custom_time_series_regressor",
+       "estimator_factory": _custom_time_series_regressor},
+      {"testcase_name": "structural_ensemble_regressor",
+       "estimator_factory": _structural_ensemble_regressor})
+  def test_one_shot_prediction_head_export(self, estimator_factory):
+    def _new_temp_dir():
+      return os.path.join(test.get_temp_dir(), str(ops.uid()))
+    model_dir = _new_temp_dir()
     categorical_column = feature_column.categorical_column_with_hash_bucket(
         key="categorical_exogenous_feature", hash_bucket_size=16)
     exogenous_feature_columns = [
@@ -328,15 +380,10 @@ class OneShotTests(test.TestCase):
             "2d_exogenous_feature", shape=(2,)),
         feature_column.embedding_column(
             categorical_column=categorical_column, dimension=10)]
-    estimator = ts_estimators.TimeSeriesRegressor(
-        model=lstm_example._LSTMModel(
-            num_features=5, num_units=128,
-            exogenous_feature_columns=exogenous_feature_columns),
-        optimizer=adam.AdamOptimizer(0.001),
-        config=estimator_lib.RunConfig(tf_random_seed=4),
-        state_manager=state_management.ChainingStateManager(),
-        head_type=ts_head_lib.OneShotPredictionHead,
-        model_dir=model_dir)
+    estimator = estimator_factory(
+        model_dir=model_dir,
+        exogenous_feature_columns=exogenous_feature_columns,
+        head_type=ts_head_lib.OneShotPredictionHead)
     train_features = {
         feature_keys.TrainEvalFeatures.TIMES: numpy.arange(
             20, dtype=numpy.int64),
@@ -351,7 +398,7 @@ class OneShotTests(test.TestCase):
         num_threads=1, batch_size=16, window_size=16)
     estimator.train(input_fn=train_input_fn, steps=5)
     input_receiver_fn = estimator.build_raw_serving_input_receiver_fn()
-    export_location = estimator.export_savedmodel(self.get_temp_dir(),
+    export_location = estimator.export_savedmodel(_new_temp_dir(),
                                                   input_receiver_fn)
     graph = ops.Graph()
     with graph.as_default():
@@ -385,7 +432,42 @@ class OneShotTests(test.TestCase):
                    for output_key, output_value
                    in predict_signature.outputs.items()}
         output = session.run(fetches, feed_dict=feeds)
-        self.assertAllEqual((2, 15, 5), output["mean"].shape)
+        self.assertEqual((2, 15, 5), output["mean"].shape)
+    # Build a parsing input function, then make a tf.Example for it to parse.
+    export_location = estimator.export_savedmodel(
+        _new_temp_dir(),
+        estimator.build_one_shot_parsing_serving_input_receiver_fn(
+            filtering_length=20, prediction_length=15))
+    graph = ops.Graph()
+    with graph.as_default():
+      with session_lib.Session() as session:
+        example = example_pb2.Example()
+        times = example.features.feature[feature_keys.TrainEvalFeatures.TIMES]
+        values = example.features.feature[feature_keys.TrainEvalFeatures.VALUES]
+        times.int64_list.value.extend(range(35))
+        for i in range(20):
+          values.float_list.value.extend(
+              [float(i) * 2. + feature_number
+               for feature_number in range(5)])
+        real_feature = example.features.feature["2d_exogenous_feature"]
+        categortical_feature = example.features.feature[
+            "categorical_exogenous_feature"]
+        for i in range(35):
+          real_feature.float_list.value.extend([1, 1])
+          categortical_feature.bytes_list.value.append(b"strkey")
+        # Serialize the tf.Example for feeding to the Session
+        examples = [example.SerializeToString()] * 2
+        signatures = loader.load(
+            session, [tag_constants.SERVING], export_location)
+        predict_signature = signatures.signature_def[
+            feature_keys.SavedModelLabels.PREDICT]
+        ((_, input_value),) = predict_signature.inputs.items()
+        feeds = {graph.as_graph_element(input_value.name): examples}
+        fetches = {output_key: graph.as_graph_element(output_value.name)
+                   for output_key, output_value
+                   in predict_signature.outputs.items()}
+        output = session.run(fetches, feed_dict=feeds)
+        self.assertEqual((2, 15, 5), output["mean"].shape)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/tpu/BUILD b/tensorflow/contrib/tpu/BUILD
index c08f088be7..c5855106de 100644
--- a/tensorflow/contrib/tpu/BUILD
+++ b/tensorflow/contrib/tpu/BUILD
@@ -15,8 +15,8 @@ package(
     default_visibility = [
         "//cloud/vmm/testing/tests/tpu:__subpackages__",
         "//learning/brain:__subpackages__",
+        "//learning/deepmind:__subpackages__",
         "//tensorflow:__subpackages__",
-        "//third_party/cloud_tpu:__subpackages__",
     ],
 )
 
@@ -37,6 +37,7 @@ cc_library(
 py_library(
     name = "tpu_estimator",
     srcs = [
+        "python/tpu/error_handling.py",
         "python/tpu/tpu_config.py",
         "python/tpu/tpu_context.py",
         "python/tpu/tpu_estimator.py",
@@ -47,6 +48,8 @@ py_library(
     deps = [
         ":tpu_lib",
         ":tpu_py",
+        "//tensorflow/compiler/xla/experimental/xla_sharding",
+        "//tensorflow/compiler/xla/python_api:xla_shape",
         "//tensorflow/contrib/training:training_py",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
@@ -161,12 +164,55 @@ py_library(
 )
 
 py_library(
+    name = "keras_support",
+    srcs = [
+        "python/tpu/keras_support.py",
+    ],
+    srcs_version = "PY2AND3",
+    visibility = [
+        "//cloud/vmm/testing/tests/tpu:__subpackages__",
+        "//learning/brain:__subpackages__",
+        # TODO(b/111651964): Clean special visibility for keras_support.
+        #
+        # Note: If you are an end user, please do not add your project to this
+        # visibility. This feature is experimental, and will be made public
+        # when ready.
+        "//third_party/cloud_tpu/models/keras:__subpackages__",
+        "//tensorflow:__subpackages__",
+    ],
+    deps = [
+        ":tpu_lib",
+        ":tpu_py",
+        "//tensorflow/contrib/cluster_resolver:tpu_cluster_resolver_py",
+        "//tensorflow/contrib/distribute/python:tpu_strategy",
+        "//tensorflow/contrib/framework:framework_py",
+        "//tensorflow/contrib/tpu/proto:compilation_result_proto_py",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:linalg_ops",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:random_ops",
+        "//tensorflow/python:session",
+        "//tensorflow/python:tensor_spec",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/estimator:model_fn",
+        "//tensorflow/python/keras:backend",
+        "//tensorflow/python/keras:engine",
+        "//tensorflow/python/keras:layers",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_library(
     name = "tpu_lib",
     srcs = [
         "python/tpu/__init__.py",
         "python/tpu/bfloat16.py",
         "python/tpu/device_assignment.py",
-        "python/tpu/keras_support.py",
         "python/tpu/session_support.py",
         "python/tpu/topology.py",
         "python/tpu/tpu.py",
diff --git a/tensorflow/contrib/tpu/__init__.py b/tensorflow/contrib/tpu/__init__.py
index dc90668559..cac346ae30 100644
--- a/tensorflow/contrib/tpu/__init__.py
+++ b/tensorflow/contrib/tpu/__init__.py
@@ -42,9 +42,11 @@
 
 @@TPUEstimator
 @@TPUEstimatorSpec
+@@export_estimator_savedmodel
 @@RunConfig
 @@InputPipelineConfig
 @@TPUConfig
+@@bfloat16_scope
 """
 
 from __future__ import absolute_import
@@ -60,7 +62,7 @@ from tensorflow.contrib.tpu.python.tpu.topology import *
 from tensorflow.contrib.tpu.python.tpu.tpu import *
 from tensorflow.contrib.tpu.python.tpu.tpu_config import *
 from tensorflow.contrib.tpu.python.tpu.tpu_estimator import *
-from tensorflow.contrib.tpu.python.tpu.tpu_feed import *
+from tensorflow.contrib.tpu.python.tpu.tpu_feed import InfeedQueue
 from tensorflow.contrib.tpu.python.tpu.tpu_optimizer import *
 from tensorflow.contrib.tpu.python.tpu.training_loop import *
 # pylint: enable=wildcard-import,unused-import
diff --git a/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc b/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
index f80f5652af..8e6e9aa0cd 100644
--- a/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
+++ b/tensorflow/contrib/tpu/profiler/capture_tpu_profile.cc
@@ -84,8 +84,6 @@ ProfileRequest PopulateProfileRequest(int duration_ms,
   request.add_tools("memory_viewer");
   request.add_tools("overview_page");
   *request.mutable_opts() = opts;
-  std::cout << "Limiting the number of trace events to " << kMaxEvents
-            << std::endl;
   return request;
 }
 
@@ -99,7 +97,6 @@ bool Profile(const string& service_addr, const string& logdir, int duration_ms,
 
   ::grpc::ClientContext context;
   ::grpc::ChannelArguments channel_args;
-  // TODO(ioeric): use `SetMaxReceiveMessageSize` instead once it's available.
   // TODO(qiuminxu): use `NewHostPortGrpcChannel` instead once their
   // `ValidateHostPortPair` checks for empty host string case.
   channel_args.SetInt(GRPC_ARG_MAX_MESSAGE_LENGTH,
@@ -166,6 +163,85 @@ bool NewSession(const string& service_addr,
   return new_session_response.empty_trace();
 }
 
+// Starts tracing on a single or multiple TPU hosts and saves the result in the
+// given logdir. If no trace was collected, retries tracing for
+// num_tracing_attempts.
+void StartTracing(const tensorflow::string& service_addr,
+                  const tensorflow::string& logdir,
+                  const tensorflow::string& workers_list,
+                  bool include_dataset_ops, int duration_ms,
+                  int num_tracing_attempts) {
+  // Use the current timestamp as the run name.
+  tensorflow::string session_id = GetCurrentTimeStampAsString();
+  constexpr char kProfilePluginDirectory[] = "plugins/profile/";
+  tensorflow::string repository_root =
+      io::JoinPath(logdir, kProfilePluginDirectory);
+  std::vector<tensorflow::string> hostnames =
+      tensorflow::str_util::Split(workers_list, ",");
+
+  bool empty_trace = false;
+  int remaining_attempts = num_tracing_attempts;
+  tensorflow::ProfileOptions opts;
+  opts.set_include_dataset_ops(include_dataset_ops);
+  while (true) {
+    std::cout << "Starting to profile TPU traces for " << duration_ms << " ms. "
+              << "Remaining attempt(s): " << remaining_attempts-- << std::endl;
+    if (hostnames.empty()) {
+      empty_trace = tensorflow::tpu::Profile(service_addr, logdir, duration_ms,
+                                             repository_root, session_id, opts);
+    } else {
+      tensorflow::string tpu_master = service_addr;
+      empty_trace =
+          tensorflow::tpu::NewSession(tpu_master, hostnames, duration_ms,
+                                      repository_root, session_id, opts);
+    }
+    if (remaining_attempts <= 0 || !empty_trace) break;
+    std::cout << "No trace event is collected. Automatically retrying."
+              << std::endl
+              << std::endl;
+  }
+
+  if (empty_trace) {
+    std::cout << "No trace event is collected after " << num_tracing_attempts
+              << " attempt(s). "
+              << "Perhaps, you want to try again (with more attempts?)."
+              << std::endl
+              << "Tip: increase number of attempts with --num_tracing_attempts."
+              << std::endl;
+  }
+}
+
+MonitorRequest PopulateMonitorRequest(int duration_ms, int monitoring_level) {
+  MonitorRequest request;
+  request.set_duration_ms(duration_ms);
+  request.set_monitoring_level(monitoring_level);
+  return request;
+}
+
+// Repeatedly collects profiles and shows user-friendly metrics for
+// 'num_queries' time(s).
+void StartMonitoring(const tensorflow::string& service_addr, int duration_ms,
+                     int monitoring_level, int num_queries) {
+  for (int query = 0; query < num_queries; ++query) {
+    MonitorRequest request =
+        PopulateMonitorRequest(duration_ms, monitoring_level);
+
+    ::grpc::ClientContext context;
+    ::grpc::ChannelArguments channel_args;
+    channel_args.SetInt(GRPC_ARG_MAX_MESSAGE_LENGTH,
+                        std::numeric_limits<int32>::max());
+    std::unique_ptr<TPUProfiler::Stub> stub =
+        TPUProfiler::NewStub(::grpc::CreateCustomChannel(
+            "dns:///" + service_addr, ::grpc::InsecureChannelCredentials(),
+            channel_args));
+    MonitorResponse response;
+    TF_QCHECK_OK(FromGrpcStatus(stub->Monitor(&context, request, &response)));
+
+    std::cout << "Xprof Monitoring Results (Sample " << query + 1 << "):\n\n"
+              << response.data() << std::flush;
+  }
+}
+
 }  // namespace
 }  // namespace tpu
 }  // namespace tensorflow
@@ -174,9 +250,11 @@ int main(int argc, char** argv) {
   tensorflow::string FLAGS_service_addr;
   tensorflow::string FLAGS_logdir;
   tensorflow::string FLAGS_workers_list;
-  int FLAGS_duration_ms = 2000;
+  int FLAGS_duration_ms = 0;
   int FLAGS_num_tracing_attempts = 3;
   bool FLAGS_include_dataset_ops = true;
+  int FLAGS_monitoring_level = 0;
+  int FLAGS_num_queries = 100;
   std::vector<tensorflow::Flag> flag_list = {
       tensorflow::Flag("service_addr", &FLAGS_service_addr,
                        "Address of TPU profiler service e.g. localhost:8466"),
@@ -186,21 +264,38 @@ int main(int argc, char** argv) {
       tensorflow::Flag("logdir", &FLAGS_logdir,
                        "Path of TensorBoard log directory e.g. /tmp/tb_log, "
                        "gs://tb_bucket"),
-      tensorflow::Flag("duration_ms", &FLAGS_duration_ms,
-                       "Duration of tracing in ms. Default is 2000ms."),
+      tensorflow::Flag(
+          "duration_ms", &FLAGS_duration_ms,
+          "Duration of tracing or monitoring in ms. Default is 2000ms for "
+          "tracing and 1000ms for monitoring."),
       tensorflow::Flag("num_tracing_attempts", &FLAGS_num_tracing_attempts,
                        "Automatically retry N times when no trace event "
                        "is collected. Default is 3."),
       tensorflow::Flag("include_dataset_ops", &FLAGS_include_dataset_ops,
                        "Set to false to profile longer TPU device traces."),
-  };
+      tensorflow::Flag("monitoring_level", &FLAGS_monitoring_level,
+                       "Choose a monitoring level between 1 and 2 to monitor "
+                       "your TPU job continuously. Level 2 is more verbose "
+                       "than level 1 and shows more metrics."),
+      tensorflow::Flag("num_queries", &FLAGS_num_queries,
+                       "This script will run monitoring for num_queries before "
+                       "it stops.")};
 
   std::cout << "Welcome to the Cloud TPU Profiler v" << TPU_PROFILER_VERSION
             << std::endl;
 
   tensorflow::string usage = tensorflow::Flags::Usage(argv[0], flag_list);
   bool parse_ok = tensorflow::Flags::Parse(&argc, argv, flag_list);
-  if (!parse_ok || FLAGS_service_addr.empty() || FLAGS_logdir.empty()) {
+  if (!parse_ok || FLAGS_service_addr.empty() ||
+      (FLAGS_logdir.empty() && FLAGS_monitoring_level == 0)) {
+    // Fail if flags are not parsed correctly or service_addr not provided.
+    // Also, fail if neither logdir is provided (required for tracing) nor
+    // monitoring level is provided (required for monitoring).
+    std::cout << usage.c_str() << std::endl;
+    return 2;
+  }
+  if (FLAGS_monitoring_level < 0 || FLAGS_monitoring_level > 2) {
+    // Invalid monitoring level.
     std::cout << usage.c_str() << std::endl;
     return 2;
   }
@@ -213,52 +308,27 @@ int main(int argc, char** argv) {
   }
   tensorflow::port::InitMain(argv[0], &argc, &argv);
 
-  // Sets the minimum duration_ms and tracing attempts to one.
-  int duration_ms = std::max(FLAGS_duration_ms, 1);
-  int remaining_attempts = std::max(FLAGS_num_tracing_attempts, 1);
-  tensorflow::ProfileOptions opts;
-  opts.set_include_dataset_ops(FLAGS_include_dataset_ops);
-  tensorflow::ProfileResponse response;
-
-  // Use the current timestamp as the run name.
-  tensorflow::string session_id =
-      tensorflow::tpu::GetCurrentTimeStampAsString();
-  constexpr char kProfilePluginDirectory[] = "plugins/profile/";
-  tensorflow::string repository_root =
-      ::tensorflow::io::JoinPath(FLAGS_logdir, kProfilePluginDirectory);
-  std::vector<tensorflow::string> hostnames =
-      tensorflow::str_util::Split(FLAGS_workers_list, ",");
-
-  bool empty_trace = false;
-  while (true) {
-    std::cout << "Starting to profile TPU traces for " << duration_ms << " ms. "
-              << "Remaining attempt(s): " << remaining_attempts-- << std::endl;
-    if (hostnames.empty()) {
-      empty_trace = tensorflow::tpu::Profile(FLAGS_service_addr, FLAGS_logdir,
-                                             duration_ms, repository_root,
-                                             session_id, opts);
-    } else {
-      tensorflow::string tpu_master = FLAGS_service_addr;
-      empty_trace =
-          tensorflow::tpu::NewSession(tpu_master, hostnames, duration_ms,
-                                      repository_root, session_id, opts);
-    }
-    if (remaining_attempts <= 0 || !empty_trace) break;
-    std::cout << "No trace event is collected. Automatically retrying."
-              << std::endl
-              << std::endl;
+  // Sets the minimum duration_ms, tracing attempts and num queries.
+  int duration_ms = std::max(FLAGS_duration_ms, 0);
+  if (duration_ms == 0) {
+    // If profiling duration was not set by user or set to a negative value, we
+    // set it to default values of 2000ms for tracing and 1000ms for monitoring.
+    duration_ms = FLAGS_monitoring_level == 0 ? 2000 : 1000;
   }
+  int num_tracing_attempts = std::max(FLAGS_num_tracing_attempts, 1);
+  int num_queries = std::max(FLAGS_num_queries, 1);
 
-  if (empty_trace) {
-    std::cout << "No trace event is collected after "
-              << FLAGS_num_tracing_attempts << " attempt(s). "
-              << "Perhaps, you want to try again (with more attempts?)."
-              << std::endl
-              << "Tip: increase number of attempts with --num_tracing_attempts."
+  if (FLAGS_monitoring_level != 0) {
+    std::cout << "Since monitoring level is provided, profile "
+              << FLAGS_service_addr << " for " << duration_ms
+              << "ms and show metrics for " << num_queries << " time(s)."
               << std::endl;
-    // Don't dump profile data if no trace is collected.
-    return 0;
+    tensorflow::tpu::StartMonitoring(FLAGS_service_addr, duration_ms,
+                                     FLAGS_monitoring_level, num_queries);
+  } else {
+    tensorflow::tpu::StartTracing(FLAGS_service_addr, FLAGS_logdir,
+                                  FLAGS_workers_list, FLAGS_include_dataset_ops,
+                                  duration_ms, num_tracing_attempts);
   }
-
   return 0;
 }
diff --git a/tensorflow/contrib/tpu/profiler/pip_package/cloud_tpu_profiler/main.py b/tensorflow/contrib/tpu/profiler/pip_package/cloud_tpu_profiler/main.py
index 7a5d01cca4..438f442848 100644
--- a/tensorflow/contrib/tpu/profiler/pip_package/cloud_tpu_profiler/main.py
+++ b/tensorflow/contrib/tpu/profiler/pip_package/cloud_tpu_profiler/main.py
@@ -50,7 +50,8 @@ flags.DEFINE_string(
 flags.DEFINE_string(
     'logdir', None, 'Path of TensorBoard log directory e.g. /tmp/tb_log, '
     'gs://tb_bucket')
-flags.DEFINE_integer('duration_ms', 2000, 'Duration of tracing in ms.')
+flags.DEFINE_integer('duration_ms', 0,
+                     'Duration of tracing or monitoring in ms.')
 flags.DEFINE_integer(
     'num_tracing_attempts', 3, 'Automatically retry N times when no trace '
     'event is collected.')
@@ -58,6 +59,14 @@ flags.DEFINE_boolean('include_dataset_ops', True,
                      'Set to false to profile longer TPU '
                      'device traces.')
 
+# Monitoring parameters
+flags.DEFINE_integer(
+    'monitoring_level', 0, 'Choose a monitoring level between '
+    '1 and 2 to monitor your TPU job continuously.')
+flags.DEFINE_integer(
+    'num_queries', 100,
+    'This script will run monitoring for num_queries before it stops.')
+
 FLAGS = flags.FLAGS
 EXECUTABLE = 'data/capture_tpu_profile'
 JOB_NAME = 'worker'
@@ -118,6 +127,8 @@ def main(unused_argv=None):
   cmd.append('--duration_ms=' + str(FLAGS.duration_ms))
   cmd.append('--num_tracing_attempts=' + str(FLAGS.num_tracing_attempts))
   cmd.append('--include_dataset_ops=' + str(FLAGS.include_dataset_ops).lower())
+  cmd.append('--monitoring_level=' + str(FLAGS.monitoring_level))
+  cmd.append('--num_queries=' + str(FLAGS.num_queries))
   subprocess.call(cmd)
 
 
diff --git a/tensorflow/contrib/tpu/profiler/tpu_profiler.proto b/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
index f0fca63db0..da4a95e045 100644
--- a/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
+++ b/tensorflow/contrib/tpu/profiler/tpu_profiler.proto
@@ -11,6 +11,9 @@ service TPUProfiler {
   // Starts a profiling session, blocks until it completes, and returns data.
   rpc Profile(ProfileRequest) returns (ProfileResponse) {
   }
+  // Collects profiling data and returns user-friendly metrics.
+  rpc Monitor(MonitorRequest) returns (MonitorResponse) {
+  }
 }
 
 message ProfileOptions {
@@ -104,3 +107,26 @@ message ProfileResponse {
 
   // next-field: 8
 }
+
+message MonitorRequest {
+  // Duration for which to profile between each update.
+  uint64 duration_ms = 1;
+
+  // Indicates the level at which we want to monitor. Currently, two levels are
+  // supported:
+  // Level 1: An ultra lightweight mode that captures only some utilization
+  // metrics.
+  // Level 2: More verbose than level 1. Collects utilization metrics, device
+  // information, step time information, etc. Do not use this option if the TPU
+  // host is being very heavily used.
+  int32 monitoring_level = 2;
+
+  // next-field: 3
+}
+
+message MonitorResponse {
+  // Properly formatted string data that can be directly returned back to user.
+  string data = 1;
+
+  // next-field: 2
+}
diff --git a/tensorflow/contrib/tpu/proto/BUILD b/tensorflow/contrib/tpu/proto/BUILD
index 26016f47df..598b73b438 100644
--- a/tensorflow/contrib/tpu/proto/BUILD
+++ b/tensorflow/contrib/tpu/proto/BUILD
@@ -15,6 +15,16 @@ tf_proto_library(
         "tpu_embedding_config.proto",
     ],
     cc_api_version = 2,
+    protodeps = [":optimization_parameters_proto"],
+    visibility = ["//visibility:public"],
+)
+
+tf_proto_library(
+    name = "optimization_parameters_proto",
+    srcs = [
+        "optimization_parameters.proto",
+    ],
+    cc_api_version = 2,
     visibility = ["//visibility:public"],
 )
 
diff --git a/tensorflow/contrib/tpu/proto/optimization_parameters.proto b/tensorflow/contrib/tpu/proto/optimization_parameters.proto
new file mode 100644
index 0000000000..2cc17d6d92
--- /dev/null
+++ b/tensorflow/contrib/tpu/proto/optimization_parameters.proto
@@ -0,0 +1,164 @@
+syntax = "proto3";
+
+package tensorflow.tpu;
+
+import "google/protobuf/wrappers.proto";
+
+message ClippingLimits {
+  google.protobuf.FloatValue lower = 1;  // -inf if not set
+  google.protobuf.FloatValue upper = 2;  // +inf if not set
+}
+
+// Get the learning rate from a <yet to be determined> source that can change
+// dynamically.
+message DynamicLearningRate {
+}
+
+// Source of learning rate to use.
+message LearningRate {
+  oneof learning_rate {
+    float constant = 1;
+    DynamicLearningRate dynamic = 2;
+  }
+}
+
+message AdagradParameters {
+  float initial_accumulator = 1;
+}
+
+message StochasticGradientDescentParameters {
+}
+
+message FtrlParameters {
+  float l1 = 1;
+  float l2 = 2;
+  float lr_power = 3;
+  float initial_accum = 4;
+  float initial_linear = 5;
+}
+
+// The Adam optimizer does not implement hyper-parameter update; use the dynamic
+// learning rate feature instead, setting the learning rate to:
+// user learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
+// Here, t is the current timestep.
+// https://github.com/tensorflow/tensorflow/blob/ab51450c817674c8ff08a7ae4f8ac50cdc4bed8b/tensorflow/python/training/adam.py#L54
+message AdamParameters {
+  float beta1 = 3;
+  float beta2 = 4;
+  float epsilon = 5;
+  float initial_m = 6;
+  float initial_v = 7;
+}
+
+message MomentumParameters {
+  float momentum = 1;
+  bool use_nesterov = 2;
+  float initial_accum = 3;
+}
+
+message RmsPropParameters {
+  float rho = 1;
+  float momentum = 2;
+  float epsilon = 3;
+  float initial_ms = 4;
+  float initial_mom = 5;
+}
+
+message CenteredRmsPropParameters {
+  float rho = 1;
+  float momentum = 2;
+  float epsilon = 3;
+  float initial_ms = 4;
+  float initial_mom = 5;
+  float initial_mg = 6;
+}
+
+message MdlAdagradLightParameters {
+  float l2 = 1;
+  float lr_power = 2;
+  float min_servable_mdl_benefit = 3;
+  float mdl_mix_in_margin = 4;
+  float mdl_benefit_rampup_coeff = 5;
+  float mdl_min_weight = 6;
+  float benefit_revisit_scale = 7;
+  float max_event_benefit = 8;
+  float max_total_benefit = 9;
+  float mdl_hard_limit = 10;
+  bool hard_limit_min_benefit = 11;
+  bool mdl_regularize = 12;
+  float initial_accumulator = 13;
+  float initial_weight = 14;
+  float initial_benefit = 15;
+}
+
+message AdadeltaParameters {
+  float rho = 1;
+  float epsilon = 2;
+  float initial_accumulator = 3;
+  float initial_update = 4;
+}
+
+message ProximalAdagradParameters {
+  float l1 = 1;
+  float l2 = 2;
+  float initial_accumulator = 3;
+}
+
+message OptimizationParameters {
+  // Learning rate used for updating the embedding layer parameters.
+  LearningRate learning_rate = 13;
+  reserved 1;  // Old learning rate tag.
+
+  // Limits to which to clip the weight values after the backward pass; not
+  // present means no limits are applied.
+  ClippingLimits clipping_limits = 2;
+
+  // Limits to which to clip the backward pass gradient before using it for
+  // updates; not present means no limits are applied.
+  ClippingLimits gradient_clipping_limits = 7;
+
+  // Whether to use gradient accumulation (do two passes over the input
+  // gradients: one to accumulate them into a temporary array and another to
+  // apply them using the actual optimization algorithm).
+  bool use_gradient_accumulation = 15;
+
+  // Optimization algorithm parameters; which field is selected determines which
+  // algorithm to use.
+  oneof parameters {
+    AdagradParameters adagrad = 3;
+    StochasticGradientDescentParameters stochastic_gradient_descent = 4;
+    FtrlParameters ftrl = 5;
+    AdamParameters adam = 6;
+    MomentumParameters momentum = 8;
+    RmsPropParameters rms_prop = 9;
+    CenteredRmsPropParameters centered_rms_prop = 10;
+    MdlAdagradLightParameters mdl_adagrad_light = 11;
+    AdadeltaParameters adadelta = 12;
+    ProximalAdagradParameters proximal_adagrad = 14;
+  }
+}
+
+// Specification of an optimization algorithm's state variables (both the main
+// value vector and any extra accumulators, etc.).
+message StateVariableSpecification {
+  // Parameter name for the state variable.
+  string name = 1;
+
+  // A normal state variable that should be saved and restored in checkpoints
+  // and used as an input or output to non-debug TensorFlow ops.
+  message UserDefined {
+  }
+
+  // A state variable that should be filled with a constant and normally hidden
+  // from users (used for intermediate gradients being accumulated, for
+  // example).
+  message FillWithConstant {
+    double initial_value = 1;
+  }
+
+  // Usage type of this state variable.
+  oneof usage {
+    UserDefined user_defined = 2;
+    FillWithConstant fill_with_constant = 3;
+  }
+}
diff --git a/tensorflow/contrib/tpu/proto/tpu_embedding_config.proto b/tensorflow/contrib/tpu/proto/tpu_embedding_config.proto
index b0ec968d3a..3476cc8953 100644
--- a/tensorflow/contrib/tpu/proto/tpu_embedding_config.proto
+++ b/tensorflow/contrib/tpu/proto/tpu_embedding_config.proto
@@ -2,6 +2,8 @@ syntax = "proto3";
 
 package tensorflow.tpu;
 
+import "tensorflow/contrib/tpu/proto/optimization_parameters.proto";
+
 // The TPUEmbeddingConfiguration contains specification of TPU Embedding lookups
 // and gradient updates separate from the TF Graph.
 message TPUEmbeddingConfiguration {
@@ -30,15 +32,6 @@ message TPUEmbeddingConfiguration {
   // The number of training examples per TensorNode.
   int32 batch_size = 4;
 
-  message GradientDescentOptimizer {
-    float learning_rate = 1;
-  }
-
-  message AdagradOptimizer {
-    float learning_rate = 1;
-    float initial_accumulator = 2;
-  }
-
   // Each Embedding
   message TPUEmbeddingTable {
     // Name of the embedding table. This will be used to name Variables in the
@@ -66,10 +59,7 @@ message TPUEmbeddingConfiguration {
     // separately to the convolutional or recurrent network.
     int32 num_features = 5;
 
-    oneof optimizer {
-      GradientDescentOptimizer gradient_descent = 6;
-      AdagradOptimizer adagrad = 7;
-    }
+    OptimizationParameters optimization_parameters = 6;
   }
 
   repeated TPUEmbeddingTable table_config = 5;
diff --git a/tensorflow/contrib/tpu/python/tpu/device_assignment.py b/tensorflow/contrib/tpu/python/tpu/device_assignment.py
index 726b2d248e..471b1fa46c 100644
--- a/tensorflow/contrib/tpu/python/tpu/device_assignment.py
+++ b/tensorflow/contrib/tpu/python/tpu/device_assignment.py
@@ -175,6 +175,8 @@ class DeviceAssignment(object):
     """Returns the physical topology coordinates of a logical core."""
     if logical_core is None:
       logical_core = np.array([0, 0, 0], np.int32)
+    else:
+      logical_core = np.asarray(logical_core)
 
     if any(logical_core < 0) or any(logical_core >= self.computation_shape):
       raise ValueError("Invalid core {}; computation shape is {}".format(
diff --git a/tensorflow/contrib/tpu/python/tpu/error_handling.py b/tensorflow/contrib/tpu/python/tpu/error_handling.py
new file mode 100644
index 0000000000..52e1ea4237
--- /dev/null
+++ b/tensorflow/contrib/tpu/python/tpu/error_handling.py
@@ -0,0 +1,132 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===================================================================
+"""ErrorRendezvous handler for collecting errors from multiple threads."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import contextlib
+import sys
+import threading
+import time
+
+import six
+
+from tensorflow.python.framework import errors
+from tensorflow.python.platform import tf_logging as logging
+
+_UNINTERESTING_ERRORS = (errors.CancelledError,)
+
+
+class ErrorRendezvous(object):
+  """Resolve errors from multiple threads during TPU execution.
+
+  TPU errors can occur on the infeed or outfeed threads as well as the main
+  training thread.
+
+  Depending on which thread "wins" and receives the session error first, we may
+  end up showing users a confusing and non-actionable error message (session
+  cancelled) instead of a root cause (e.g. a bad filename).
+
+  The rendezvous object provides a location to capture these errors until all
+  threads terminate.  At that point we can choose the most informative error
+  to report.
+  """
+
+  def __init__(self, num_sources):
+    # string -> (message, traceback)
+    self._errors = {}
+    self._num_sources = num_sources
+    self._session_cancel_timer = None
+
+  def record_error(self, source, exc_info, session=None):
+    """Report an exception from the given source.
+
+    If a session is passed, a timer will be registered to close it after a few
+    seconds.  This is necessary to ensure the main training loop does not hang
+    if an infeed/oufeed error occurs.  We sleep a few seconds to allow a more
+    interesting error from another thread to propagate.
+
+    Args:
+      source: string, source of the error
+      exc_info: Output from `sys.exc_info` (type, value, traceback)
+      session: Session to close after delay.
+    """
+    _, value, _ = exc_info
+    self._errors[source] = exc_info
+    logging.info('Error recorded from %s: %s', source, value)
+
+    if session is not None and self._session_cancel_timer is None:
+
+      def _cancel_session():
+        time.sleep(5)
+        try:
+          session.close()
+        except:  # pylint: disable=bare-except
+          pass
+
+      self._session_cancel_timer = threading.Thread(target=_cancel_session,)
+      self._session_cancel_timer.daemon = True
+      self._session_cancel_timer.start()
+
+  def record_done(self, source):
+    """Mark execution source `source` as done.
+
+    If an error was originally reported from `source` it is left intact.
+
+    Args:
+      source: `str`, source being recorded
+    """
+    logging.info('%s marked as finished', source)
+    if source not in self._errors:
+      self._errors[source] = None
+
+  @contextlib.contextmanager
+  def catch_errors(self, source, session=None):
+    """Context manager to report any errors within a block."""
+    try:
+      yield
+    except Exception:  # pylint: disable=broad-except
+      self.record_error(source, sys.exc_info(), session)
+
+  def raise_errors(self, timeout_sec=0):
+    """Wait for up to `timeout` seconds for all error sources to finish.
+
+    Preferentially raise "interesting" errors (errors not in the
+    _UNINTERESTING_ERRORS) set.
+
+    Args:
+      timeout_sec: Seconds to wait for other error sources.
+    """
+    for _ in range(timeout_sec):
+      if len(self._errors) == self._num_sources:
+        break
+      time.sleep(1)
+
+    kept_errors = [(k, v) for (k, v) in self._errors.items() if v is not None]
+
+    # First check for any interesting errors, then fall back on the session
+    # cancelled errors etc.
+    for k, (typ, value, traceback) in kept_errors:
+      if isinstance(value, _UNINTERESTING_ERRORS):
+        continue
+      else:
+        logging.warn('Reraising captured error')
+        six.reraise(typ, value, traceback)
+
+    for k, (typ, value, traceback) in kept_errors:
+      logging.warn('Reraising captured error')
+      six.reraise(typ, value, traceback)
diff --git a/tensorflow/contrib/tpu/python/tpu/keras_support.py b/tensorflow/contrib/tpu/python/tpu/keras_support.py
index 7541544382..81798ee423 100644
--- a/tensorflow/contrib/tpu/python/tpu/keras_support.py
+++ b/tensorflow/contrib/tpu/python/tpu/keras_support.py
@@ -45,6 +45,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import abc
 import collections
 import contextlib
 import re
@@ -59,11 +60,15 @@ from tensorflow.contrib.framework.python.framework import experimental
 from tensorflow.contrib.tpu.proto import compilation_result_pb2 as tpu_compilation_result
 from tensorflow.contrib.tpu.python.ops import tpu_ops
 from tensorflow.contrib.tpu.python.tpu import tpu
+from tensorflow.contrib.tpu.python.tpu import tpu_function
 from tensorflow.contrib.tpu.python.tpu import tpu_optimizer
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.client import session as tf_session
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.estimator import model_fn as model_fn_lib
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_spec
 from tensorflow.python.keras import backend as K
 from tensorflow.python.keras import models
@@ -71,7 +76,9 @@ from tensorflow.python.keras import optimizers as keras_optimizers
 from tensorflow.python.keras.engine import base_layer
 from tensorflow.python.keras.layers import embeddings
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_linalg_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import tf_logging as logging
 
@@ -99,6 +106,45 @@ class TPUEmbedding(embeddings.Embedding):
     return math_ops.tensordot(inputs, self.embeddings, 1)
 
 
+class KerasCrossShardOptimizer(keras_optimizers.Optimizer):
+  """An optimizer that averages gradients across TPU shards."""
+
+  def __init__(self, opt, name='KerasCrossShardOptimizer'):
+    """Construct a new cross-shard optimizer.
+
+    Args:
+      opt: An existing `Optimizer` to encapsulate.
+      name: Optional name prefix for the operations created when applying
+        gradients. Defaults to "KerasCrossShardOptimizer".
+
+    Raises:
+      ValueError: If reduction is not a valid cross-shard reduction.
+    """
+    super(KerasCrossShardOptimizer, self).__init__()
+    self._name = name
+    self._opt = opt
+
+  def get_updates(self, loss, params):
+    logging.info('Get updates: %s', loss)
+    self._opt.get_gradients = self.get_gradients
+    return self._opt.get_updates(loss, params)
+
+  def get_gradients(self, loss, params):
+    num_shards = tpu_function.get_tpu_context().number_of_shards
+    grads = super(KerasCrossShardOptimizer, self).get_gradients(loss, params)
+    return [tpu_ops.cross_replica_sum(grad) / num_shards for grad in grads]
+
+  def set_weights(self, weights):
+    self._opt.set_weights()
+
+  def get_weights(self):
+    return self._opt.get_weights()
+
+  @property
+  def lr(self):
+    return self._opt.lr
+
+
 class TPUModelOp(
     collections.namedtuple('TPUModelOp', [
         'compile_op', 'execute_op', 'infeed_tensors', 'infeed_op', 'outfeed_op'
@@ -113,8 +159,13 @@ def _valid_name(tensor_name):
 
 def _replicated_optimizer(opt):
   """Wrap the optimizer `opt` with CrossShardOptimizer if applicable."""
-  return keras_optimizers.TFOptimizer(
-      optimizer=tpu_optimizer.CrossShardOptimizer(opt.optimizer))
+  if tpu_function.get_tpu_context().number_of_shards == 1:
+    return opt
+
+  if isinstance(opt, keras_optimizers.TFOptimizer):
+    return tpu_optimizer.CrossShardOptimizer(opt.optimizer)
+  else:
+    return KerasCrossShardOptimizer(opt)
 
 
 class TPURewriteContext(object):
@@ -154,7 +205,6 @@ class TPURewriteContext(object):
       caller_obj = caller_frame.f_locals.get('self')
       if (caller_obj is not None and
           isinstance(caller_obj, base_layer.Layer) and name is not None):
-        logging.info('Intercepted name_scope: %s', caller_obj)
         return variable_scope.variable_scope(
             name, default_name, values, reuse=variable_scope.AUTO_REUSE)
 
@@ -163,8 +213,51 @@ class TPURewriteContext(object):
     self._default_placeholder = array_ops.placeholder
     self._default_name_scope = ops.name_scope
     self._default_make_variable = base_layer.make_variable
+    self._default_random_normal = random_ops.random_normal
+    self._default_qr = gen_linalg_ops.qr
 
     array_ops.placeholder = _placeholder
+
+    # Replace random_ops.random_normal with a dummy function because
+    # `random_normal` isn't yet implemented on the TPU. Because these
+    # initialized values are overwritten by the CPU values, this is okay.
+    def random_normal(shape,
+                      mean=0.0,
+                      stddev=1.0,
+                      dtype=dtypes.float32,
+                      seed=None,
+                      name=None):
+      del mean
+      del stddev
+      del seed
+      return array_ops.zeros(shape, dtype=dtype, name=name)
+
+    random_ops.random_normal = random_normal
+
+    # Replace gen_linalg_ops.qr because QR decomposition is not yet implemented.
+    # TODO(saeta): Remove qr override once we confirm the qr implementation is
+    # ok.
+    # pylint: disable=redefined-builtin
+    def qr(input, full_matrices=False, name=None):
+      """Dummy implementation of qr decomposition."""
+      del full_matrices  # TODO(saeta): Properly handle the full matrix case.
+      input_shape = input.shape
+      if len(input_shape) < 2:
+        raise ValueError('Invalid shape passed to qr: %s' % input_shape)
+      p = min(input_shape[-1], input_shape[-2])
+      if len(input_shape) == 2:
+        q = array_ops.zeros((p, p), name=name)
+        r = array_ops.zeros(input_shape, name=name)
+        return (r, q)
+      elif len(input_shape) == 3:
+        n = input_shape[0]
+        q = array_ops.zeros((n, p, p), name=name)
+        r = array_ops.zeros(input_shape, name=name)
+        return (r, q)
+      else:
+        raise ValueError('Invalid shape passed to qr: %s' % input_shape)
+    gen_linalg_ops.qr = qr
+
     ops.name_scope = _name_scope
     base_layer.make_variable = variable_scope.get_variable
     logging.info('Overriding default placeholder.')
@@ -174,6 +267,334 @@ class TPURewriteContext(object):
     array_ops.placeholder = self._default_placeholder
     ops.name_scope = self._default_name_scope
     base_layer.make_variable = self._default_make_variable
+    random_ops.random_normal = self._default_random_normal
+    gen_linalg_ops.qr = self._default_qr
+
+
+class SizedInfeed(collections.namedtuple('SizedInfeed',
+                                         ['sharded_infeed_tensors',
+                                          'infeed_ops'])):
+  """Represents an instantiation of the infeed ops for a concrete input shape.
+
+  sharded_infeed_tensors: A data structure of Tensors used to represent the
+    placeholder tensors that must be fed when using feed_dicts.
+
+  infeed_ops: the set of ops that will be run to drive infeed for a single step.
+  """
+  pass
+
+
+class TPUInfeedInstance(object):
+  """TPUInfeedInstance represents the logic to manage feeding in a single step.
+
+  See the comments on the `TPUInfeedManager` for a description for how infeed
+  is managed.
+  """
+
+  @abc.abstractmethod
+  def make_input_specs(self, input_tensors):
+    """Constructs the infeed_specs for the given Infeed instance.
+
+    Args:
+      input_tensors: The inputs to the model.
+
+    Returns:
+      A list of
+    """
+    pass
+
+  def make_feed_dict(self, tpu_model_op):
+    """Constructs a feed_dict for this instance, given the tpu_model_op.
+
+    Args:
+      tpu_model_op: A `TPUModelOp` representing the TPU Model for this
+        instance's input spec.
+
+    Returns:
+      A dictionary to use as the feed_dict of a `session.run` call.
+    """
+    pass
+
+
+class TPUInfeedManager(object):
+  """TPUInfeedManager manages the data infeeding of data to a TPU computation.
+
+  Because there are multiple data sources (e.g. in-memory NumPy arrays,
+  `tf.data.Dataset`s), we abstract the different logic behind a single
+  interface: the `TPUInfeedManager`.
+
+  (1) A `TPUFunction` is called with a set of inputs. Based on the inputs,
+  `TPUFunction` retrieves the corresponding `TPUInfeedManager` (or constructs a
+  new one if required).
+
+  (2) The `TPUFunction` calls `make_infeed_instance` on the `TPUInfeedManager`
+  which returns a `TPUInfeedInstance`.
+
+  (3) The `TPUFunction` checks in the shape cache for a pre-compiled instance of
+  the model based on the returned `input_specs` from `TPUInfeedInstance`.
+
+  (4) [Optional.] If the model has not already been instantiated for the given
+  input spec, the `TPUFunction` compiles the model for the input spec (using the
+  `TPUInfeedManager`).
+
+  (5) The `TPUInfeedInstance` constructs the session.run's feed_dict given the
+  compiled model instance corresponding to its shape.
+  """
+
+  @abc.abstractmethod
+  def make_infeed_instance(self, inputs):
+    """Given a single step's input, construct a `TPUInfeedInstance`.
+
+    Args:
+      inputs: The inputs to a given step.
+
+    Returns:
+      A subclass of `TPUInfeedInstance`.
+    """
+    pass
+
+  @abc.abstractmethod
+  def build_infeed_from_input_specs(self, input_specs, execution_mode):
+    """For a given input specification (size, type), construct the infeed ops.
+
+    This is called only once for a given input specification and builds the
+    graph ops. It does not have a pointer to the actual infeed data.
+
+    Args:
+      input_specs: TODO(saeta): Document me!
+      execution_mode: TODO(saeta): Document me!
+
+    Returns:
+      A `SizedInfeed` instance.
+    """
+    pass
+
+
+class TPUNumpyInfeedManager(TPUInfeedManager):
+  """TPU Infeed manager for Numpy inputs."""
+
+  class NumpyInfeedInstance(TPUInfeedInstance):
+    """Infeed instance for Numpy inputs."""
+
+    def __init__(self, sharded_inputs):
+      self._sharded_inputs = sharded_inputs
+
+    def make_input_specs(self, input_tensors):
+      # Compute an input specification (used to generate infeed enqueue and
+      # dequeue operations).  We use the shape from our input array and the
+      # dtype from our model.  A user may pass in a float64 for a float32
+      # input: for model compatibility we still must generate a float32 infeed.
+      input_specs = []
+      # We use the shape and dtype from the first shard to compute the input
+      # metadata (`input_specs`); all replicas have the same type and shape.
+      for tensor, ary in zip(input_tensors, self._sharded_inputs[0]):
+        input_specs.append(
+            tensor_spec.TensorSpec(ary.shape, tensor.dtype,
+                                   _valid_name(tensor.name)))
+
+      return input_specs
+
+    def make_feed_dict(self, tpu_model_op):
+      infeed_dict = {}
+      for infeed_tensors, inputs in zip(tpu_model_op.infeed_tensors,
+                                        self._sharded_inputs):
+        for tensor, value in zip(infeed_tensors, inputs):
+          infeed_dict[tensor] = value
+      return infeed_dict
+
+  def __init__(self, distribution_strategy):
+    self._strategy = distribution_strategy
+
+  def _split_tensors(self, inputs):
+    """Split input data across shards.
+
+    Each input is sliced along the batch axis.
+
+    Args:
+      inputs: List of Numpy arrays to run on the TPU.
+
+    Returns:
+      List of lists containing the input to feed to each TPU shard.
+    """
+    if self._strategy.num_towers == 1:
+      return [inputs]
+
+    batch_size = inputs[0].shape[0]
+    assert batch_size % self._strategy.num_towers == 0, (
+        'batch_size must be divisible by strategy.num_towers (%s vs %s)' %
+        (batch_size, self._strategy.num_towers))
+    shard_size = batch_size // self._strategy.num_towers
+    input_list = []
+    for index in range(self._strategy.num_towers):
+      shard_inputs = [
+          x[index * shard_size:(index + 1) * shard_size] for x in inputs
+      ]
+      input_list.append(shard_inputs)
+    return input_list
+
+  def make_infeed_instance(self, inputs):
+    sharded_inputs = self._split_tensors(inputs)
+    return self.NumpyInfeedInstance(sharded_inputs)
+
+  def build_infeed_from_input_specs(self, input_specs, execution_mode):
+    infeed_op = []
+    shard_infeed_tensors = []
+
+    for shard_id in range(self._strategy.num_towers):
+      with ops.device('/device:CPU:0'):
+        infeed_tensors = []
+        with ops.device('/device:TPU:%d' % shard_id):
+          for spec in input_specs:
+            # Construct placeholders for each of the inputs.
+            infeed_tensors.append(
+                array_ops.placeholder(
+                    dtype=spec.dtype,
+                    shape=spec.shape,
+                    name='infeed-enqueue-%s-%d' % (spec.name, shard_id)))
+        shard_infeed_tensors.append(infeed_tensors)
+
+        infeed_op.append(
+            tpu_ops.infeed_enqueue_tuple(
+                infeed_tensors, [spec.shape for spec in input_specs],
+                name='infeed-enqueue-%s-%d' % (execution_mode, shard_id),
+                device_ordinal=shard_id))
+    return SizedInfeed(infeed_ops=infeed_op,
+                       sharded_infeed_tensors=shard_infeed_tensors)
+
+
+class TPUDatasetInfeedManager(TPUInfeedManager):
+  """Manages infeed for a `tf.data.Dataset` into a TPU computation.
+  """
+
+  class DatasetInfeedInstance(TPUInfeedInstance):
+    """An instance of the TPU infeed."""
+
+    def __init__(self, input_specs):
+      self._input_specs = input_specs
+
+    def make_input_specs(self, input_tensors):
+      # TODO(saeta): Do error checking here!
+      return self._input_specs
+
+    def make_feed_dict(self, tpu_model_op):
+      # TODO(saeta): Verify tpu_model_op is as expected!
+      return {}
+
+  def __init__(self, dataset, distribution_strategy, tpu_session):
+    """Constructs a TPUDatasetInfeedManager.
+
+    Must be called within a `KerasTPUModel.tpu_session` context!
+
+    Args:
+      dataset: A `tf.data.Dataset` to infeed.
+      distribution_strategy: The `TPUDistributionStrategy` used to configure the
+        Keras TPU model.
+      tpu_session: The `tf.Session` object used for running the TPU model.
+    """
+    self._verify_dataset_shape(dataset)
+    self._dataset = dataset
+    self._strategy = distribution_strategy
+    dummy_x_shape = dataset.output_shapes[0].as_list()
+    dummy_x_shape[0] *= distribution_strategy.num_towers
+    dummy_y_shape = dataset.output_shapes[1].as_list()
+    dummy_y_shape[0] *= distribution_strategy.num_towers
+    self._iterator = dataset.make_initializable_iterator()
+    tpu_session.run(self._iterator.initializer)
+
+    self._get_next_ops = []
+    ctrl_deps = []
+    for i in range(distribution_strategy.num_towers):
+      with ops.control_dependencies(ctrl_deps):  # Ensure deterministic
+        # TODO(saeta): Ensure correct placement!
+        get_next_op = self._iterator.get_next()
+        self._get_next_ops.append(get_next_op)
+        ctrl_deps.extend(get_next_op)
+
+    # Use dummy numpy inputs for the rest of Keras' shape checking. We
+    # intercept them when building the model.
+    self._dummy_x = np.zeros(dummy_x_shape,
+                             dtype=dataset.output_types[0].as_numpy_dtype)
+    self._dummy_y = np.zeros(dummy_y_shape,
+                             dtype=dataset.output_types[1].as_numpy_dtype)
+
+    input_specs = []
+    if isinstance(self._iterator.output_shapes, tuple):
+      assert isinstance(self._iterator.output_types, tuple)
+      assert len(self._iterator.output_shapes) == len(
+          self._iterator.output_types)
+      for i in range(len(self._iterator.output_shapes)):
+        spec = tensor_spec.TensorSpec(self._iterator.output_shapes[i],
+                                      self._iterator.output_types[i])
+        input_specs.append(spec)
+    elif isinstance(self._iterator.output_shapes, tensor_shape.TensorShape):
+      spec = tensor_spec.TensorSpec(self._iterator.output_shapes,
+                                    self._iterator.output_types)
+      input_specs.append(spec)
+
+    self._infeed_instance = self.DatasetInfeedInstance(input_specs)
+
+  def _verify_dataset_shape(self, dataset):
+    """Verifies a dataset is of an appropriate shape for TPUs."""
+    if not isinstance(dataset, dataset_ops.Dataset):
+      raise ValueError('The function passed as the `x` parameter did not '
+                       'return a `tf.data.Dataset`.')
+    if not isinstance(dataset.output_classes, tuple):
+      raise ValueError('The dataset must return a tuple of tf.Tensors, '
+                       'instead it returns: %s' % dataset.output_classes)
+    if len(dataset.output_classes) != 2:
+      raise ValueError(
+          'The dataset must return a 2-element tuple, got '
+          '%s output classes instead.' % (dataset.output_classes,))
+    for i, cls in enumerate(dataset.output_classes):
+      if cls != ops.Tensor:
+        raise ValueError('The dataset returned a non-Tensor type (%s) at '
+                         'index %d.' % (cls, i))
+    for i, shape in enumerate(dataset.output_shapes):
+      if not shape:
+        raise ValueError('The dataset returns a scalar tensor in '
+                         'tuple index %d. Did you forget to batch? '
+                         '(Output shapes: %s).' % (i,
+                                                   dataset.output_shapes))
+      for j, dim in enumerate(shape):
+        if dim.value is None:
+          if j == 0:
+            hint = (' Hint: did you use `ds.batch(BATCH_SIZE, '
+                    'drop_remainder=True)`?')
+          else:
+            hint = ''
+          raise ValueError(
+              'The Keras-TPU integration for `tf.data` '
+              'currently requires static shapes. The provided '
+              'dataset only has a partially defined shape. '
+              '(Dimension %d of output tensor %d is not statically known '
+              'for output shapes: %s.%s)' % (i, j, dataset.output_shapes, hint))
+
+  @property
+  def dummy_x(self):
+    return self._dummy_x
+
+  @property
+  def dummy_y(self):
+    return self._dummy_y
+
+  def make_infeed_instance(self, inputs):
+    # TODO(saeta): Verify inputs is as expected.
+    return self._infeed_instance
+
+  def build_infeed_from_input_specs(self, input_specs, execution_mode):
+    shard_infeed_tensors = self._get_next_ops
+    assert len(shard_infeed_tensors) == self._strategy.num_towers
+    infeed_ops = []
+    for shard_id in range(self._strategy.num_towers):
+      with ops.device('/device:CPU:0'):
+        infeed_ops.append(
+            tpu_ops.infeed_enqueue_tuple(
+                shard_infeed_tensors[shard_id],
+                [spec.shape for spec in input_specs],
+                name='infeed-enqueue-%s-%d' % (execution_mode, shard_id),
+                device_ordinal=shard_id))
+    return SizedInfeed(infeed_ops=infeed_ops,
+                       sharded_infeed_tensors=shard_infeed_tensors)
 
 
 class TPUFunction(object):
@@ -195,7 +616,13 @@ class TPUFunction(object):
     self._compilation_cache = {}
     self._cloned_model = None
 
-  def _specialize_model(self, input_specs):
+    # Copy optimizer configuration.  This is done prior to `_specialize_model`
+    # as the configuration may require evaluating variables in the CPU session.
+    self._optimizer_config = None
+    if not isinstance(self.model.optimizer, keras_optimizers.TFOptimizer):
+      self._optimizer_config = self.model.optimizer.get_config()
+
+  def _specialize_model(self, input_specs, infeed_manager):
     """Specialize `self.model` (a Keras model) for the given input shapes."""
     # Re-create our input and output layers inside our subgraph.  They will be
     # attached to the true computation when we clone our model in `tpu_fn`.
@@ -221,8 +648,8 @@ class TPUFunction(object):
           name='infeed-%s' % self.execution_mode)
 
       assert len(infeed_tensors) == len(infeed_layers), (
-          'Infeed inputs did not match model: %s vs %s', (infeed_layers,
-                                                          infeed_tensors))
+          'Infeed inputs did not match model: %s vs %s' % (infeed_layers,
+                                                           infeed_tensors))
 
       tpu_targets = []
       tpu_input_map = {}
@@ -236,11 +663,23 @@ class TPUFunction(object):
 
       # Clone our CPU model, running within the TPU device context.
       with TPURewriteContext(tpu_input_map):
-        self._cloned_model = models.clone_model(self.model)
+        # TODO(power): Replicate variables.
+        with ops.device('/device:TPU:0'):
+          self._cloned_model = models.clone_model(self.model)
+
+      # Create a copy of the optimizer for this graph.
+      if isinstance(self.model.optimizer, keras_optimizers.TFOptimizer):
+        cloned_optimizer = keras_optimizers.TFOptimizer(
+            self.model.optimizer.optimizer)
+      else:
+        logging.info('Cloning %s %s', self.model.optimizer.__class__.__name__,
+                     self._optimizer_config)
+        cloned_optimizer = self.model.optimizer.__class__.from_config(
+            self._optimizer_config)
 
       if is_training or is_test:
         self._cloned_model.compile(
-            optimizer=_replicated_optimizer(self.model.optimizer),
+            optimizer=_replicated_optimizer(cloned_optimizer),
             loss=self.model.loss,
             loss_weights=self.model.loss_weights,
             metrics=self.model.metrics,
@@ -299,37 +738,24 @@ class TPUFunction(object):
 
     # Generate CPU side operations to enqueue features/labels and dequeue
     # outputs from the model call.
-    infeed_op = []
+    sized_infeed = infeed_manager.build_infeed_from_input_specs(
+        input_specs, self.execution_mode)
+    # Build output ops.
     outfeed_op = []
-    shard_infeed_tensors = []
-
     for shard_id in range(self._strategy.num_towers):
-      with ops.device('/device:TPU:%d' % shard_id):
-        infeed_tensors = []
-        for spec in input_specs:
-          infeed_tensors.append(
-              array_ops.placeholder(
-                  dtype=spec.dtype,
-                  shape=spec.shape,
-                  name='infeed-enqueue-%s-%d' % (spec.name, shard_id)))
-        shard_infeed_tensors.append(infeed_tensors)
-
-        infeed_op.append(
-            tpu_ops.infeed_enqueue_tuple(
-                infeed_tensors, [spec.shape for spec in input_specs],
-                name='infeed-enqueue-%s-%d' % (self.execution_mode, shard_id)))
-
+      with ops.device('/device:CPU:0'):
         outfeed_op.extend(
             tpu_ops.outfeed_dequeue_tuple(
                 dtypes=[spec.dtype for spec in self._outfeed_spec],
                 shapes=[spec.shape for spec in self._outfeed_spec],
-                name='outfeed-dequeue-%s-%d' % (self.execution_mode, shard_id)))
+                name='outfeed-dequeue-%s-%d' % (self.execution_mode, shard_id),
+                device_ordinal=shard_id))
 
     return TPUModelOp(
         compile_op,
         execute_op,
-        infeed_tensors=shard_infeed_tensors,
-        infeed_op=infeed_op,
+        infeed_tensors=sized_infeed.sharded_infeed_tensors,
+        infeed_op=sized_infeed.infeed_ops,
         outfeed_op=outfeed_op)
 
   def _test_model_compiles(self, tpu_model_ops):
@@ -348,37 +774,17 @@ class TPUFunction(object):
     logging.info('Finished compiling. Time elapsed: %s secs',
                  end_time - start_time)
 
-  def _split_tensors(self, inputs):
-    """Split input data across shards.
-
-    Each input is sliced along the batch axis.
-
-    Args:
-      inputs: List of Numpy arrays to run on the TPU.
-
-    Returns:
-      List of lists containing the input to feed to each TPU shard.
-    """
-    if self._strategy.num_towers == 1:
-      return [inputs]
-
-    batch_size = inputs[0].shape[0]
-    assert batch_size % self._strategy.num_towers == 0, (
-        'batch_size must be divisible by strategy.num_towers (%s vs %s)' %
-        (batch_size, self._strategy.num_towers)
-    )
-    shard_size = batch_size // self._strategy.num_towers
-    input_list = []
-    for index in range(self._strategy.num_towers):
-      shard_inputs = [
-          x[index * shard_size:(index + 1) * shard_size] for x in inputs
-      ]
-      input_list.append(shard_inputs)
-    return input_list
-
   def __call__(self, inputs):
     assert isinstance(inputs, list)
 
+    infeed_manager = None
+    for x, mgr in self.model._numpy_to_infeed_manager_list:
+      if inputs[0] is x:
+        infeed_manager = mgr
+        break
+    if infeed_manager is None:
+      infeed_manager = TPUNumpyInfeedManager(self.model._strategy)
+
     # Strip sample weight from inputs
     if (self.execution_mode == model_fn_lib.ModeKeys.TRAIN or
         self.execution_mode == model_fn_lib.ModeKeys.EVAL):
@@ -387,21 +793,9 @@ class TPUFunction(object):
     else:
       input_tensors = self.model._feed_inputs
 
-    shard_inputs = self._split_tensors(inputs)
+    infeed_instance = infeed_manager.make_infeed_instance(inputs)
     del inputs  # To avoid accident usage.
-
-    # Compute an input specification (used to generate infeed enqueue and
-    # dequeue operations).  We use the shape from our input array and the
-    # dtype from our model.  A user may pass in a float64 for a float32
-    # input: for model compatibility we still must generate a float32 infeed.
-    input_specs = []
-
-    # We use the shape and dtype from the first shard to compute the input
-    # metadata (`input_specs`); all replicas have the same type and shape.
-    for tensor, ary in zip(input_tensors, shard_inputs[0]):
-      input_specs.append(
-          tensor_spec.TensorSpec(ary.shape, tensor.dtype,
-                                 _valid_name(tensor.name)))
+    input_specs = infeed_instance.make_input_specs(input_tensors)
 
     # XLA requires every operation in the graph has a fixed shape.  To
     # handle varying batch sizes we recompile a new sub-graph for each
@@ -412,7 +806,8 @@ class TPUFunction(object):
       with self.model.tpu_session():
         logging.info('New input shapes; (re-)compiling: mode=%s, %s',
                      self.execution_mode, input_specs)
-        new_tpu_model_ops = self._specialize_model(input_specs)
+        new_tpu_model_ops = self._specialize_model(input_specs,
+                                                   infeed_manager)
         self._compilation_cache[shape_key] = new_tpu_model_ops
         self._test_model_compiles(new_tpu_model_ops)
 
@@ -420,11 +815,7 @@ class TPUFunction(object):
     self.model._initialize_weights(self._cloned_model)
     tpu_model_ops = self._compilation_cache[shape_key]
 
-    infeed_dict = {}
-    for infeed_tensors, inputs in zip(tpu_model_ops.infeed_tensors,
-                                      shard_inputs):
-      for tensor, value in zip(infeed_tensors, inputs):
-        infeed_dict[tensor] = value
+    infeed_dict = infeed_instance.make_feed_dict(tpu_model_ops)
 
     with self.model.tpu_session() as session:
       _, _, outfeed_outputs = session.run([
@@ -438,9 +829,8 @@ class TPUFunction(object):
       outputs_per_replica = len(self._outfeed_spec)
 
       for i in range(self._strategy.num_towers):
-        output_group = outfeed_outputs[
-            i * outputs_per_replica:(i+1) * outputs_per_replica
-        ]
+        output_group = outfeed_outputs[i * outputs_per_replica:(i + 1) *
+                                       outputs_per_replica]
         for j in range(outputs_per_replica):
           outputs[j].append(output_group[j])
 
@@ -459,6 +849,11 @@ class KerasTPUModel(models.Model):
         name=cpu_model.name,
     )
 
+    # Create a mapping from numpy arrays to infeed managers.
+    # Note: uses a list of tuples instead of a map because numpy arrays are
+    # not hashable.
+    self._numpy_to_infeed_manager_list = []
+
     self.predict_function = None
     self.test_function = None
     self.train_function = None
@@ -470,14 +865,16 @@ class KerasTPUModel(models.Model):
     self._tpu_weights_initialized = False
     self._graph = ops.Graph()
 
-    cluster_resolver = tpu_cluster_resolver.TPUClusterResolver(
+    self._cluster_resolver = tpu_cluster_resolver.TPUClusterResolver(
         tpu_name_or_address)
-    cluster_spec = cluster_resolver.cluster_spec()
+    master = self._cluster_resolver.master()
+    cluster_spec = self._cluster_resolver.cluster_spec()
     self._session = tf_session.Session(
         graph=self._graph,
-        target=cluster_resolver.master(),
+        target=master,
         config=config_pb2.ConfigProto(isolate_session_state=True))
 
+    # TODO(saeta): Confirm the lines below work in ClusterSpec propagation env.
     if cluster_spec:
       self._session.cluster_def.CopyFrom(cluster_spec.as_cluster_def())
 
@@ -529,10 +926,91 @@ class KerasTPUModel(models.Model):
                               sample_weight_mode, weighted_metrics,
                               target_tensors, **kwargs)
 
-    # Keras optimizers are not compatible with TPU rewrite
-    if not isinstance(self.optimizer, keras_optimizers.TFOptimizer):
+  def fit(self,
+          x=None,
+          y=None,
+          batch_size=None,
+          epochs=1,
+          verbose=1,
+          callbacks=None,
+          validation_split=0.,
+          validation_data=None,
+          shuffle=True,
+          class_weight=None,
+          sample_weight=None,
+          initial_epoch=0,
+          steps_per_epoch=None,
+          validation_steps=None,
+          **kwargs):
+    assert not self._numpy_to_infeed_manager_list  # Ensure empty.
+
+    infeed_managers = []  # Managers to clean up at the end of the fit call.
+    if isinstance(x, dataset_ops.Dataset):
+      # TODO(b/111413240): Support taking a tf.data.Dataset directly.
+      raise ValueError(
+          'Taking a Dataset directly is not yet supported. Please '
+          'wrap your dataset construction code in a function and '
+          'pass that to fit instead. For examples, see: '
+          'https://github.com/tensorflow/tpu/tree/master/models/experimental'
+          '/keras')
+    if callable(x):
+      with self.tpu_session() as sess:
+        dataset = x()
+        if steps_per_epoch is None:
+          raise ValueError('When using tf.data as input to a model, you '
+                           'should specify the steps_per_epoch argument.')
+        if y is not None:
+          raise ValueError('When using tf.data as input to a model, y must be '
+                           'None')
+        infeed_manager = TPUDatasetInfeedManager(dataset, self._strategy, sess)
+        # Use dummy numpy inputs for the rest of Keras' shape checking. We
+        # intercept them when building the model.
+        x = infeed_manager.dummy_x
+        y = infeed_manager.dummy_y
+        infeed_managers.append((x, infeed_manager))
+
+    if isinstance(validation_data, dataset_ops.Dataset):
+      # TODO(b/111413240): Support taking a tf.data.Dataset directly.
       raise ValueError(
-          'Optimizer must be a TFOptimizer, got: %s' % self.optimizer)
+          'Taking a Dataset directly is not yet supported. Please '
+          'wrap your dataset construction code in a function and '
+          'pass that to fit instead. For examples, see: '
+          'https://github.com/tensorflow/tpu/tree/master/models/experimental'
+          '/keras')
+    if callable(validation_data):
+      with self.tpu_session() as sess:
+        dataset = validation_data()
+        if validation_steps is None:
+          raise ValueError('When using tf.data as validation for a model, you '
+                           'should specify the validation_steps argument.')
+        infeed_manager = TPUDatasetInfeedManager(dataset, self._strategy, sess)
+        # Use dummy numpy inputs for the rest of Keras' shape checking. We
+        # intercept them when building the model.
+        val_x = infeed_manager.dummy_x
+        val_y = infeed_manager.dummy_y
+        infeed_managers.append((val_x, infeed_manager))
+        validation_data = (val_x, val_y)
+
+    self._numpy_to_infeed_manager_list = infeed_managers
+    try:
+      return super(KerasTPUModel, self).fit(
+          x,
+          y,
+          batch_size,
+          epochs,
+          verbose,
+          callbacks,
+          validation_split,
+          validation_data,
+          shuffle,
+          class_weight,
+          sample_weight,
+          initial_epoch,
+          steps_per_epoch,
+          validation_steps,
+          **kwargs)
+    finally:
+      self._numpy_to_infeed_manager_list = []
 
   def _make_train_function(self):
     if not self.train_function:
@@ -615,10 +1093,10 @@ class KerasTPUModel(models.Model):
       K.set_session(default_session)
 
   def shutdown(self):
-    logging.info('Shutting down TPU session.')
-    with self.tpu_session() as session:
-      session.run(tpu.shutdown_system())
-
+    # TODO(b/111364423): Actually shut down the system.
+    logging.info('Skipping shutting down TPU system.')
+    # with self.tpu_session() as session:
+    #   session.run(tpu.shutdown_system())
     self._session.close()
 
 
@@ -652,7 +1130,7 @@ Output shape: %(output_shape)s
       'layer': layer,
       'input_shape': layer.input_shape,
       'output_shape': layer.output_shape
-      })
+  })
 
 
 @experimental
@@ -687,6 +1165,10 @@ def tpu_model(model, tpu_name_or_address=None, strategy=None):
   Returns:
     A new `KerasTPUModel` instance.
   """
+  # Force initialization of the CPU model.
+  model.get_weights()
+  model.reset_states()
+
   _validate_shapes(model)
   # TODO(xiejw): Validate TPU model. TPUModel only?
   # TODO(xiejw): Validate replicas. Full or 1. Shall we allow subset?
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu.py b/tensorflow/contrib/tpu/python/tpu/tpu.py
index 6a64893d9a..7994c2c6c7 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu.py
@@ -151,6 +151,41 @@ class TPUReplicateContext(control_flow_ops.XLAControlFlowContext):
     self._name = name
     self._unsupported_ops = []
     self._pivot = pivot
+    self._replicated_vars = {}
+
+  def get_replicated_var_handle(self, var):
+    """Returns a variable handle for replicated TPU variable 'var'.
+
+    This is an method used by an experimental replicated variable
+    implementation and is not intended as a public API.
+
+    Args:
+      var: The replicated TPU variable.
+
+    Returns:
+      The handle of the TPU replicated input node.
+    """
+    handle = self._replicated_vars.get(var)
+    if handle is not None:
+      return handle
+
+    # Builds a TPUReplicatedInput node for the variable, if one does not already
+    # exist. The TPUReplicatedInput node must belong to the enclosing
+    # control-flow scope of the TPUReplicateContext.
+    # TODO(phawkins): consider changing the contract of the TPU encapsulation
+    # so the TPUReplicatedInput nodes go inside the TPUReplicateContext scope
+    # instead.
+
+    # pylint: disable=protected-access
+    graph = ops.get_default_graph()
+    saved_context = graph._get_control_flow_context()
+    graph._set_control_flow_context(self.outer_context)
+    handle = tpu_ops.tpu_replicated_input(
+        [v.handle for v in var._vars], name=var.name + "/handle")
+    graph._set_control_flow_context(saved_context)
+    # pylint: enable=protected-access
+    self._replicated_vars[var] = handle
+    return handle
 
   def report_unsupported_operations(self):
     if self._unsupported_ops:
@@ -279,7 +314,9 @@ class TPUReplicateContext(control_flow_ops.XLAControlFlowContext):
       # Capture the device function stack at the time of first entry
       # since that is the stack that will be used outside_compilation.
       graph = ops.get_default_graph()
-      self._outer_device_function_stack = list(graph._device_function_stack)  # pylint: disable=protected-access
+      # pylint: disable=protected-access
+      self._outer_device_function_stack = graph._device_function_stack.copy()
+      # pylint: enable=protected-access
     super(TPUReplicateContext, self).Enter()
 
   def HostComputeCore(self):
@@ -598,23 +635,14 @@ def split_compile_and_replicate(computation,
     with tpu_function.tpu_shard_context(
         num_replicas), ops.control_dependencies([metadata]):
 
-      # For backward compatibility reasons, we tag replicated inputs with the
-      # _tpu_replicated_input attribute. This does nothing and exists only for
-      # backward compatibility.
-      # TODO(phawkins): delete the attr_scope after 6/28/2018.
-      # pylint: disable=protected-access
-      with graph._attr_scope({
-          "_tpu_replicated_input": attr_value_pb2.AttrValue(b=True)
-      }):
-        # Add identity ops so even unused inputs are "consumed" by the
-        # computation. This is to avoid orphaned TPUReplicatedInput nodes.
-        # TODO(phawkins): consider instead pruning unused TPUReplicatedInput
-        # and eliding trivial TPUReplicatedInput/TPUReplicatedOutput pairs.
-        computation_inputs = [
-            array_ops.identity(x, name="replicated_input_{}".format(i))
-            for i, x in enumerate(computation_inputs)
-        ]
-      # pylint: enable=protected-access
+      # Add identity ops so even unused inputs are "consumed" by the
+      # computation. This is to avoid orphaned TPUReplicatedInput nodes.
+      # TODO(phawkins): consider instead pruning unused TPUReplicatedInput
+      # and eliding trivial TPUReplicatedInput/TPUReplicatedOutput pairs.
+      computation_inputs = [
+          array_ops.identity(x, name="replicated_input_{}".format(i))
+          for i, x in enumerate(computation_inputs)
+      ]
 
       # If there is an infeed queue, adds the dequeued values to the
       # computation's inputs.
@@ -942,8 +970,15 @@ def rewrite(computation,
   Args:
     computation: A Python function that builds a computation to apply
       to the input. If the function takes n inputs, 'inputs' should be
-      a list of n tensors. If the function returns m outputs, rewrite
-      will return a list of m tensors.
+      a list of n tensors.
+
+      `computation` may return a list of operations and tensors.  Tensors must
+      come before operations in the returned list.  The return value of
+      `rewrite` is a list of tensors corresponding to the tensors from the
+      from `computation`.
+
+      All `Operation`s returned from `computation` will be executed when
+      evaluating any of the returned output tensors.
     inputs: A list of input tensors or `None` (equivalent to an empty list).
     infeed_queue: If not `None`, the `InfeedQueue` from which to append a tuple
       of arguments as inputs to `computation`.
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_config.py b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
index 6d7331e3c7..8d05e081a7 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_config.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
@@ -23,8 +23,6 @@ import collections
 import json
 import os
 
-import numpy as np
-
 from tensorflow.contrib.tpu.python.tpu import util as util_lib
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.estimator import run_config as run_config_lib
@@ -43,17 +41,18 @@ class InputPipelineConfig(object):
   PER_SHARD_V1 = 1
   PER_HOST_V1 = 2
   PER_HOST_V2 = 3
+  BROADCAST = 4
 
 
-# TODO(b/72511246) Provide a simplified api to configure model parallelism.
 class TPUConfig(
     collections.namedtuple('TPUConfig', [
         'iterations_per_loop',
         'num_shards',
-        'computation_shape',
+        'num_cores_per_replica',
         'per_host_input_for_training',
         'tpu_job_name',
         'initial_infeed_sleep_secs',
+        'input_partition_dims',
     ])):
   r"""TPU related configuration required by `TPUEstimator`.
 
@@ -67,22 +66,22 @@ class TPUConfig(
       case, this number equals the total number of TPU cores. For
       model-parallelism, the total number of TPU cores equals
       product(computation_shape) * num_shards.
-    computation_shape: Defaults to `None`, which disables model parallelism. A
-      list of size 3 which describes the shape of a model replica's block of
-      cores. This is required by model-parallelism which enables partitioning
-      the model to multiple cores. For example, [2, 2, 1] means the model is
-      partitioned across 4 cores which span two cores in both x and y
-      coordinates.  Please refer to @{tf.contrib.tpu.Topology} for the
-      geometry of a TPU mesh.
+    num_cores_per_replica: Defaults to `None`, which disables model parallelism.
+      An integer which describes the number of TPU cores per model replica. This
+      is required by model-parallelism which enables partitioning
+      the model to multiple cores. Currently num_cores_per_replica must be
+      1, 2, 4, or 8.
     per_host_input_for_training: If `True`, `PER_HOST_V1`, or `PER_HOST_V2`,
-      `input_fn` is invoked per-host rather than per-core. With per-host input
-      pipeline configuration, `input_fn` is invoked once on each host. With the
-      per-core input pipeline configuration, it is invoked once for each core.
+      `input_fn` is invoked once on each host. With the per-core input pipeline
+      configuration, it is invoked once for each core.
       With a global batch size `train_batch_size` in `TPUEstimator` constructor,
       the batch size for each shard is `train_batch_size` // #hosts in the
       `True` or `PER_HOST_V1` mode. In `PER_HOST_V2` mode, it is
-      `train_batch_size` // #cores. With the per-core input pipeline
-      configuration, the shard batch size is also `train_batch_size` // #cores.
+      `train_batch_size` // #cores. In `BROADCAST` mode, `input_fn` is only
+      invoked once on host 0 and the tensors are broadcasted to all other
+      replicas. The batch size equals to train_batch_size`. With the per-core
+      input pipeline configuration, the shard batch size is also
+      `train_batch_size` // #cores.
       Note: per_host_input_for_training==PER_SHARD_V1 only supports mode.TRAIN.
     tpu_job_name: The name of the TPU job. Typically, this name is auto-inferred
       within TPUEstimator, however when using ClusterSpec propagation in more
@@ -91,6 +90,17 @@ class TPUConfig(
     initial_infeed_sleep_secs: The number of seconds the infeed thread should
       wait before enqueueing the first batch. This helps avoid timeouts for
       models that require a long compilation time.
+    input_partition_dims: A nested list to describe the partition dims
+      for all the tensors from input_fn(). The structure of
+      input_partition_dims must match the structure of `features` and
+      `labels` from input_fn(). The total number of partitions must match
+      `num_cores_per_replica`. For example, if input_fn() returns two tensors:
+      images with shape [N, H, W, C] and labels [N].
+      input_partition_dims = [[1, 2, 2, 1], None] will split the images to 4
+      pieces and feed into 4 TPU cores. labels tensor are directly broadcasted
+      to all the TPU cores since the partition dims is `None`.
+      Current limitations: This feature is only supported with the PER_HOST_V2
+      input mode.
 
     Raises:
       ValueError: If `computation_shape` or `computation_shape` are invalid.
@@ -99,10 +109,11 @@ class TPUConfig(
   def __new__(cls,
               iterations_per_loop=2,
               num_shards=None,
-              computation_shape=None,
+              num_cores_per_replica=None,
               per_host_input_for_training=True,
               tpu_job_name=None,
-              initial_infeed_sleep_secs=None):
+              initial_infeed_sleep_secs=None,
+              input_partition_dims=None):
 
     # Check iterations_per_loop.
     util_lib.check_positive_integer(iterations_per_loop,
@@ -112,19 +123,26 @@ class TPUConfig(
     if num_shards is not None:
       util_lib.check_positive_integer(num_shards, 'TPUConfig num_shards')
 
-    # Check computation_shape
-    if computation_shape is not None and len(computation_shape) != 3:
-      raise ValueError(
-          'computation_shape must be a list with length 3 or None; got {}'.
-          format(str(computation_shape)))
+    if input_partition_dims is not None:
+      if len(input_partition_dims) != 1 and len(input_partition_dims) != 2:
+        raise ValueError(
+            'input_partition_dims must be a list/tuple with one or two'
+            ' elements.')
+
+      if per_host_input_for_training is not InputPipelineConfig.PER_HOST_V2:
+        raise ValueError(
+            'input_partition_dims is only supported in PER_HOST_V2 mode.')
+
+      if num_cores_per_replica is None:
+        raise ValueError(
+            'input_partition_dims requires setting num_cores_per_replica.')
 
-    if computation_shape is not None:
-      computation_shape_array = np.asarray(computation_shape, dtype=np.int32)
-      # This prevents any computation being replicated across multiple hosts, so
-      # that each host feeds the same number of computations.
-      if any(computation_shape_array < 1) or any(computation_shape_array > 2):
-        raise ValueError('computation_shape elements can only be 1 or 2; got '
-                         'computation_shape={}'.format(computation_shape))
+    # Parse computation_shape
+    if num_cores_per_replica is not None:
+      if num_cores_per_replica not in [1, 2, 4, 8]:
+        raise ValueError(
+            'num_cores_per_replica must be 1, 2, 4, or 8; got {}'.format(
+                str(num_cores_per_replica)))
 
     # per_host_input_for_training may be True, False, or integer in [1..3].
     # Map legacy values (True, False) to numeric values.
@@ -144,10 +162,11 @@ class TPUConfig(
         cls,
         iterations_per_loop=iterations_per_loop,
         num_shards=num_shards,
-        computation_shape=computation_shape,
+        num_cores_per_replica=num_cores_per_replica,
         per_host_input_for_training=per_host_input_for_training,
         tpu_job_name=tpu_job_name,
-        initial_infeed_sleep_secs=initial_infeed_sleep_secs)
+        initial_infeed_sleep_secs=initial_infeed_sleep_secs,
+        input_partition_dims=input_partition_dims)
 
 
 class RunConfig(run_config_lib.RunConfig):
@@ -214,6 +233,12 @@ class RunConfig(run_config_lib.RunConfig):
         self._session_config.cluster_def.CopyFrom(
             self._cluster_spec.as_cluster_def())
 
+  def _maybe_overwrite_session_config_for_distributed_training(self):
+    # Overrides the parent class session_config overwrite for between-graph. TPU
+    # runs with in-graph, which should not have device filter. Doing nothing
+    # ("pass") basically disables it.
+    pass
+
   @property
   def evaluation_master(self):
     return self._evaluation_master
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_config_test.py b/tensorflow/contrib/tpu/python/tpu/tpu_config_test.py
index 37ef3dbe1e..2326fe97a8 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_config_test.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_config_test.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 import json
 
 from tensorflow.contrib.tpu.python.tpu import tpu_config as tpu_config_lib
+from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.estimator import run_config as run_config_lib
 from tensorflow.python.platform import test
 
@@ -33,6 +34,46 @@ def _set_tf_config_env_variable(tf_config):
 
 class TPURunConfigTest(test.TestCase):
 
+  def test_no_session_config_set_in_local_case(self):
+    run_config = tpu_config_lib.RunConfig()
+    self.assertIsNone(run_config.session_config)
+
+  def test_no_session_config_overwrite_in_local_case(self):
+    session_config = config_pb2.ConfigProto(allow_soft_placement=True)
+    run_config = tpu_config_lib.RunConfig(session_config=session_config)
+    self.assertEqual(session_config, run_config.session_config)
+
+  def test_no_session_config_set_with_cluster_spec(self):
+    tf_config = {
+        'cluster': {
+            run_config_lib.TaskType.CHIEF: ['host3:3'],
+            run_config_lib.TaskType.WORKER: ['host3:4']
+        },
+        'task': {
+            'type': run_config_lib.TaskType.CHIEF,
+            'index': 0
+        }
+    }
+    with _set_tf_config_env_variable(tf_config):
+      run_config = tpu_config_lib.RunConfig()
+      self.assertIsNone(run_config.session_config)
+
+  def test_no_session_config_overwrite_with_cluster_spec(self):
+    tf_config = {
+        'cluster': {
+            run_config_lib.TaskType.CHIEF: ['host3:3'],
+            run_config_lib.TaskType.WORKER: ['host3:4']
+        },
+        'task': {
+            'type': run_config_lib.TaskType.CHIEF,
+            'index': 0
+        }
+    }
+    with _set_tf_config_env_variable(tf_config):
+      session_config = config_pb2.ConfigProto(allow_soft_placement=True)
+      run_config = tpu_config_lib.RunConfig(session_config=session_config)
+      self.assertEqual(session_config, run_config.session_config)
+
   def test_fail_with_invalid_num_shards(self):
     with self.assertRaisesRegexp(ValueError, 'must be positive'):
       tpu_config_lib.RunConfig(
@@ -43,15 +84,11 @@ class TPURunConfigTest(test.TestCase):
       tpu_config_lib.RunConfig(
           tpu_config=tpu_config_lib.TPUConfig(iterations_per_loop=0))
 
-  def test_fail_with_invalid_computation_shape(self):
-    with self.assertRaisesRegexp(ValueError,
-                                 'computation_shape must be a list with length'
-                                 ' 3 or None'):
-      tpu_config_lib.TPUConfig(computation_shape=[2, 1])
-
-    with self.assertRaisesRegexp(ValueError,
-                                 'computation_shape elements can only be'):
-      tpu_config_lib.TPUConfig(computation_shape=[1, 3, 1])
+  def test_fail_with_invalid_num_cores_per_replica(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'num_cores_per_replica must be 1, 2, 4, or 8;'
+        ' got 7'):
+      tpu_config_lib.TPUConfig(num_cores_per_replica=7)
 
 
 class TPURunConfigMasterTest(test.TestCase):
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_context.py b/tensorflow/contrib/tpu/python/tpu/tpu_context.py
index aec59f3885..2c054360a4 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_context.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_context.py
@@ -21,8 +21,6 @@ from __future__ import print_function
 from contextlib import contextmanager
 import copy
 
-import numpy as np
-
 from tensorflow.contrib.tpu.python.tpu import device_assignment  as tpu_device_assignment
 from tensorflow.contrib.tpu.python.tpu import tpu_config
 from tensorflow.contrib.tpu.python.tpu import tpu_system_metadata as tpu_system_metadata_lib
@@ -33,15 +31,26 @@ from tensorflow.python.platform import tf_logging as logging
 _DEFAULT_JOB_NAME = 'tpu_worker'
 _DEFAULT_COORDINATOR_JOB_NAME = 'coordinator'
 _LOCAL_MASTERS = ('', 'local')
+_NUM_CORES_TO_COMPUTATION_SHAPE = {
+    1: [1, 1, 1],
+    2: [1, 1, 2],
+    4: [1, 2, 2],
+    8: [2, 2, 2]
+}
 
 
 class TPUContext(object):
   """The context of current input_fn invocation."""
 
-  def __init__(self, internal_ctx, input_device=None, invocation_index=None):
+  def __init__(self,
+               internal_ctx,
+               input_device=None,
+               invocation_index=None,
+               call_from_input_fn=True):
     self._internal_ctx = internal_ctx
     self._input_device = input_device
     self._invocation_index = invocation_index
+    self._call_from_input_fn = call_from_input_fn
 
   def current_input_fn_deployment(self):
     """The configuration of the current input_fn invocation.
@@ -69,11 +78,21 @@ class TPUContext(object):
            total invocation count is equal to the number of hosts in the system
            and num replicas consumed by current invocation is equal to number of
            cores per host.
+
+    Raises:
+      RuntimeError: If this method must not be called from input_fn.
     """
+    if not self._call_from_input_fn:
+      raise RuntimeError('This TPUContext instance must not be called from'
+                         ' model_fn.')
+
     if self._internal_ctx.is_input_sharded_per_core():
       total_invocation_count = (self._internal_ctx.num_hosts
                                 * self._internal_ctx.num_of_replicas_per_host)
       replicas_consumed = 1
+    elif self._internal_ctx.is_input_broadcast_with_iterators():
+      total_invocation_count = 1
+      replicas_consumed = self._internal_ctx.num_replicas
     else:
       total_invocation_count = self._internal_ctx.num_hosts
       replicas_consumed = self._internal_ctx.num_of_replicas_per_host
@@ -105,6 +124,14 @@ class TPUContext(object):
           'num_of_replicas_per_host is not supported for model_parallelism')
     return self._internal_ctx.num_of_replicas_per_host
 
+  @property
+  def device_assignment(self):
+    """Returns device_assignment object."""
+    if self._call_from_input_fn:
+      raise RuntimeError('This TPUContext instance must not be called from'
+                         ' input_fn.')
+    return self._internal_ctx.device_assignment
+
   def device_for_replica(self, replica_id):
     """Returns the tuple of (CPU device and device ordinal) for replica.
 
@@ -119,24 +146,7 @@ class TPUContext(object):
     # Note that: For the non-model parallelism, the mapping could be
     # a random permutation. The order should not matter in most cases
     # as far as model is replicated to all cores in the system.
-
-    # If the precise replica_id to device mapping is required, please
-    # set the computation_shape as [1,1,1] in TPUConfig to enable
-    # the model parallelism.
-    if self._internal_ctx.model_parallelism_enabled:
-      return RuntimeError(
-          'device_for_replica is not yet implemented for model parallelism. '
-          'b/79689078.')
-
-    master = self._internal_ctx.master_job
-    job_device = '' if master is None else ('/job:%s' % master)
-
-    num_of_replicas_per_host = self._internal_ctx.num_of_replicas_per_host
-    host_id = replica_id / num_of_replicas_per_host
-    ordinal_id = replica_id % num_of_replicas_per_host
-
-    host_device = '%s/task:%d/device:CPU:0' % (job_device, host_id)
-    return (host_device, ordinal_id)
+    return self._internal_ctx.device_for_replica(replica_id)
 
 
 class _InternalTPUContext(object):
@@ -175,9 +185,14 @@ class _InternalTPUContext(object):
 
     self._eval_on_tpu = eval_on_tpu
     self._model_parallelism_enabled = (
-        use_tpu and config.tpu_config.computation_shape)
+        use_tpu and config.tpu_config.num_cores_per_replica)
     self._mode = None
-
+    num_cores_per_replica = config.tpu_config.num_cores_per_replica
+    if num_cores_per_replica:
+      self._computation_shape = _NUM_CORES_TO_COMPUTATION_SHAPE[
+          num_cores_per_replica]
+    else:
+      self._computation_shape = None
     self._lazy_tpu_system_metadata_dict = {}  # key by master address
     self._lazy_device_assignment_dict = {}  # key by master address
     self._lazy_validation_dict = {}  # key by ModeKeys
@@ -238,11 +253,12 @@ class _InternalTPUContext(object):
 
     device_assignment = tpu_device_assignment.device_assignment(
         tpu_system_metadata.topology,
-        computation_shape=self._config.tpu_config.computation_shape,
+        computation_shape=self._computation_shape,
         num_replicas=self.num_replicas)
 
-    logging.info('computation_shape: %s',
-                 str(self._config.tpu_config.computation_shape))
+    logging.info('num_cores_per_replica: %s',
+                 str(self._config.tpu_config.num_cores_per_replica))
+    logging.info('computation_shape: %s', str(self._computation_shape))
     logging.info('num_replicas: %d', self.num_replicas)
     logging.info('device_assignment.topology.device_coordinates: %s',
                  str(device_assignment.topology.device_coordinates))
@@ -257,6 +273,10 @@ class _InternalTPUContext(object):
     return self._model_parallelism_enabled
 
   @property
+  def input_partition_dims(self):
+    return self._config.tpu_config.input_partition_dims
+
+  @property
   def device_assignment(self):
     return (self._get_device_assignment()
             if self._model_parallelism_enabled else None)
@@ -283,23 +303,20 @@ class _InternalTPUContext(object):
     num_cores_in_system = self.num_cores
 
     if self.model_parallelism_enabled:
-      computation_shape_array = np.asarray(
-          self._config.tpu_config.computation_shape, dtype=np.int32)
-      num_cores_per_replica = np.prod(computation_shape_array)
+      num_cores_per_replica = self._config.tpu_config.num_cores_per_replica
       if num_cores_per_replica > num_cores_in_system:
         raise ValueError(
             'The num of cores required by the model parallelism, specified by '
-            'TPUConfig.computation_shape, is larger than the total num of '
-            'TPU cores in the system. computation_shape: {}, num cores '
-            'in the system: {}'.format(
-                self._config.tpu_config.computation_shape,
-                num_cores_in_system))
+            'TPUConfig.num_cores_per_replica, is larger than the total num of '
+            'TPU cores in the system. num_cores_per_replica: {}, num cores '
+            'in the system: {}'.format(num_cores_per_replica,
+                                       num_cores_in_system))
 
       if num_cores_in_system % num_cores_per_replica != 0:
         raise RuntimeError(
             'The num of cores in the system ({}) is not divisible by the num '
             'of cores ({}) required by the model parallelism, specified by '
-            'TPUConfig.computation_shape. This should never happen!'.format(
+            'TPUConfig.num_cores_per_replica. This should never happen!'.format(
                 num_cores_in_system, num_cores_per_replica))
 
       return num_cores_in_system // num_cores_per_replica
@@ -327,6 +344,11 @@ class _InternalTPUContext(object):
     return (self._config.tpu_config.per_host_input_for_training is
             tpu_config.InputPipelineConfig.PER_HOST_V2)
 
+  def is_input_broadcast_with_iterators(self):
+    """Return true if input_fn should be run in the full_replicae config."""
+    return (self._config.tpu_config.per_host_input_for_training is
+            tpu_config.InputPipelineConfig.BROADCAST)
+
   def is_running_on_cpu(self, is_export_mode=False):
     """Determines whether the input_fn and model_fn should be invoked on CPU.
 
@@ -391,7 +413,7 @@ class _InternalTPUContext(object):
     """Returns the shard batch size for `input_fn`."""
     global_batch_size = self.global_batch_size
 
-    if self.is_running_on_cpu():
+    if (self.is_running_on_cpu() or self.is_input_broadcast_with_iterators()):
       return global_batch_size
 
     # On TPU
@@ -406,7 +428,7 @@ class _InternalTPUContext(object):
     """Returns the shard batch size for `model_fn`."""
     global_batch_size = self.global_batch_size
 
-    if self.is_running_on_cpu():
+    if (self.is_running_on_cpu() or self.is_input_broadcast_with_iterators()):
       return global_batch_size
 
     # On TPU. always sharded per shard.
@@ -463,17 +485,23 @@ class _InternalTPUContext(object):
 
     master = self.master_job
 
-    def _placement_function(_sentinal=None, core_id=None, host_id=None):  # pylint: disable=invalid-name
+    def _placement_function(_sentinal=None, replica_id=None, host_id=None):  # pylint: disable=invalid-name
+      """Return the host device given replica_id or host_id."""
       assert _sentinal is None
-      if core_id is not None and host_id is not None:
+      if replica_id is not None and host_id is not None:
         raise RuntimeError(
-            'core_id and host_id can have only one non-None value.')
+            'replica_id and host_id can have only one non-None value.')
 
       if master is None:
         return '/replica:0/task:0/device:CPU:0'
       else:
-        if core_id is not None:
-          host_id = core_id / self.num_of_cores_per_host
+        if replica_id is not None:
+          if self.model_parallelism_enabled:
+            return self.device_assignment.host_device(
+                replica=replica_id, job=master)
+          else:
+            host_id = replica_id / self.num_of_cores_per_host
+
         return '/job:%s/task:%d/device:CPU:0' % (master, host_id)
 
     return _placement_function
@@ -546,7 +574,7 @@ class _InternalTPUContext(object):
             'be ({}), got ({}). For non-model-parallelism, num_replicas should '
             'be the total num of TPU cores in the system. For '
             'model-parallelism, the total number of TPU cores should be '
-            'product(computation_shape) * num_replicas. Please set it '
+            'num_cores_per_replica * num_replicas. Please set it '
             'accordingly or leave it as `None`'.format(
                 self._get_master_address(), num_replicas,
                 user_provided_num_replicas))
@@ -554,7 +582,8 @@ class _InternalTPUContext(object):
         raise ValueError(message)
 
     if mode == model_fn_lib.ModeKeys.TRAIN:
-      if self._train_batch_size % num_replicas != 0:
+      if (self._train_batch_size % num_replicas != 0 and
+          not self.is_input_broadcast_with_iterators()):
         raise ValueError(
             'train batch size {} must be divisible by number of replicas {}'
             .format(self._train_batch_size, num_replicas))
@@ -564,11 +593,12 @@ class _InternalTPUContext(object):
         raise ValueError(
             'eval_batch_size in TPUEstimator constructor cannot be `None`'
             'if .evaluate is running on TPU.')
-      if self._eval_batch_size % num_replicas != 0:
+      if (self._eval_batch_size % num_replicas != 0 and
+          not self.is_input_broadcast_with_iterators()):
         raise ValueError(
             'eval batch size {} must be divisible by number of replicas {}'
             .format(self._eval_batch_size, num_replicas))
-      if num_hosts > 1:
+      if num_hosts > 1 and not self.is_input_broadcast_with_iterators():
         raise ValueError(
             'TPUEstimator.evaluate should be running on single TPU worker. '
             'got {}.'.format(num_hosts))
@@ -578,11 +608,12 @@ class _InternalTPUContext(object):
         raise ValueError(
             'predict_batch_size in TPUEstimator constructor should not be '
             '`None` if .predict is running on TPU.')
-      if self._predict_batch_size % num_replicas != 0:
+      if (self._predict_batch_size % num_replicas != 0 and
+          not self.is_input_broadcast_with_iterators()):
         raise ValueError(
             'predict batch size {} must be divisible by number of replicas {}'
             .format(self._predict_batch_size, num_replicas))
-      if num_hosts > 1:
+      if num_hosts > 1 and not self.is_input_broadcast_with_iterators():
         raise ValueError(
             'TPUEstimator.predict should be running on single TPU worker. '
             'got {}.'.format(num_hosts))
@@ -590,6 +621,33 @@ class _InternalTPUContext(object):
     # Record the state "validated" into lazy dictionary.
     self._lazy_validation_dict[mode] = True
 
+  def device_for_replica(self, replica_id):
+    """Returns the tuple of (CPU device and device ordinal) for replica.
+
+    This should be used for full replicate for non-model-parallelism.
+
+    Args:
+       replica_id: Int, the replica index.
+
+    Returns:
+       A tuple of device spec for CPU device and int device ordinal.
+    """
+    master = self.master_job
+
+    if self.model_parallelism_enabled:
+      return (self.device_assignment.host_device(
+          replica=replica_id, job=master),
+              self.device_assignment.tpu_ordinal(replica=replica_id))
+
+    job_device = '' if master is None else ('/job:%s' % master)
+
+    num_of_replicas_per_host = self.num_of_replicas_per_host
+    host_id = replica_id / num_of_replicas_per_host
+    ordinal_id = replica_id % num_of_replicas_per_host
+
+    host_device = '%s/task:%d/device:CPU:0' % (job_device, host_id)
+    return (host_device, ordinal_id)
+
 
 class _OneCoreTPUContext(_InternalTPUContext):
   """Special _InternalTPUContext for one core usage."""
@@ -625,7 +683,7 @@ def _get_tpu_context(config, train_batch_size, eval_batch_size,
   """Returns an instance of `_InternalTPUContext`."""
 
   if (config.tpu_config.num_shards == 1 and
-      config.tpu_config.computation_shape is None):
+      config.tpu_config.num_cores_per_replica is None):
     logging.warning(
         'Setting TPUConfig.num_shards==1 is an unsupported behavior. '
         'Please fix as soon as possible (leaving num_shards as None.')
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
index 49cd318b89..bd8f2c99a8 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
@@ -22,9 +22,9 @@ import collections
 import copy
 import os
 import signal
+import sys
 import threading
 import time
-import traceback
 
 import numpy as np
 import six
@@ -32,6 +32,7 @@ from six.moves import queue as Queue  # pylint: disable=redefined-builtin
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.contrib.tpu.python.ops import tpu_ops
+from tensorflow.contrib.tpu.python.tpu import error_handling
 from tensorflow.contrib.tpu.python.tpu import session_support
 from tensorflow.contrib.tpu.python.tpu import tpu
 from tensorflow.contrib.tpu.python.tpu import tpu_config
@@ -257,7 +258,10 @@ class TPUEstimatorSpec(model_fn_lib._TPUEstimatorSpec):  # pylint: disable=prote
               eval_metrics=None,
               export_outputs=None,
               scaffold_fn=None,
-              host_call=None):
+              host_call=None,
+              training_hooks=None,
+              evaluation_hooks=None,
+              prediction_hooks=None):
     """Creates a validated `TPUEstimatorSpec` instance."""
     host_calls = {}
     if eval_metrics is not None:
@@ -265,6 +269,17 @@ class TPUEstimatorSpec(model_fn_lib._TPUEstimatorSpec):  # pylint: disable=prote
     if host_call is not None:
       host_calls['host_call'] = host_call
     _OutfeedHostCall.validate(host_calls)
+
+    training_hooks = list(training_hooks or [])
+    evaluation_hooks = list(evaluation_hooks or [])
+    prediction_hooks = list(prediction_hooks or [])
+
+    for hook in training_hooks + evaluation_hooks + prediction_hooks:
+      if not isinstance(hook, session_run_hook.SessionRunHook):
+        raise TypeError(
+            'All hooks must be SessionRunHook instances, given: {}'.format(
+                hook))
+
     return super(TPUEstimatorSpec, cls).__new__(
         cls,
         mode=mode,
@@ -274,7 +289,10 @@ class TPUEstimatorSpec(model_fn_lib._TPUEstimatorSpec):  # pylint: disable=prote
         eval_metrics=eval_metrics,
         export_outputs=export_outputs,
         scaffold_fn=scaffold_fn,
-        host_call=host_call)
+        host_call=host_call,
+        training_hooks=training_hooks,
+        evaluation_hooks=evaluation_hooks,
+        prediction_hooks=prediction_hooks)
 
   def as_estimator_spec(self):
     """Creates an equivalent `EstimatorSpec` used by CPU train/eval."""
@@ -290,6 +308,7 @@ class TPUEstimatorSpec(model_fn_lib._TPUEstimatorSpec):  # pylint: disable=prote
     hooks = None
     if self.host_call is not None:
       hooks = [_OutfeedHostCallHook(host_call_ret['host_call'])]
+    hooks = list(hooks or [])
     scaffold = self.scaffold_fn() if self.scaffold_fn else None
     return model_fn_lib.EstimatorSpec(
         mode=self.mode,
@@ -299,9 +318,9 @@ class TPUEstimatorSpec(model_fn_lib._TPUEstimatorSpec):  # pylint: disable=prote
         eval_metric_ops=eval_metric_ops,
         export_outputs=self.export_outputs,
         scaffold=scaffold,
-        training_hooks=hooks,
-        evaluation_hooks=hooks,
-        prediction_hooks=hooks)
+        training_hooks=self.training_hooks + hooks,
+        evaluation_hooks=self.evaluation_hooks + hooks,
+        prediction_hooks=self.prediction_hooks + hooks)
 
 
 class _OpQueueContext(object):
@@ -365,17 +384,17 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
                ctx,
                enqueue_ops,
                dequeue_ops,
-               run_infeed_loop_on_coordinator=True):
+               run_infeed_loop_on_coordinator=True,
+               rendezvous=None):
     self._master_job = ctx.master_job
     self._enqueue_ops = enqueue_ops
     self._dequeue_ops = dequeue_ops
+    self._rendezvous = rendezvous
 
     self._run_infeed_loop_on_coordinator = run_infeed_loop_on_coordinator
     self._initial_infeed_sleep_secs = (
         ctx.config.tpu_config.initial_infeed_sleep_secs)
 
-    self._session_cancel_timer = None
-
     self._feed_error = None
     self._finished = False
 
@@ -392,62 +411,6 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
     for op in summary_writer_init_ops:
       self._finalize_ops.append(contrib_summary.flush(writer=op.inputs[0]))
 
-  def _log_error(self, session, error):
-    """Log an infeed or outfeed error.
-
-    This logs a short error message immediately, and schedules a timer to
-    emit the full stack trace and error message after a short period of time.
-    If the main session has terminated by the time the timer triggers, we
-    assume the real source of the error was from the main session and avoid
-    emitting a stack trace for the infeed.
-
-    Args:
-      session: `tf.Session`, session to be terminated error: exception that
-        triggered logging.
-      error: the Exception to log.
-    """
-    logging.warning(
-        '\n\n'
-        'Error occurred during infeed/outfeed.  This may be due to a compile '
-        'error in the main session.  Waiting for a short time for the main '
-        'session to come back.\n\n%s', error)
-
-    self._feed_error = traceback.format_exc()
-
-    # If we've already encountered a feed error, don't schedule another
-    # cancellation op.
-    if self._session_cancel_timer:
-      return
-
-    def _cancel_session():
-      """Close the session to avoid the main thread from hanging.
-
-      If input pipeline triggers any error, the infeed thread dies but the main
-      thread for TPU computation waits for the infeed enqueue forever. Close the
-      Session to cancel the main thread Session.run execution.
-
-      We sleep for a few seconds before closing to give some time for the TPU
-      compilation error, if any, propagating, from TPU to CPU host. Compilation
-      errors should be reported by the main thread so that the program can be
-      interrupted and users can take action.  Due to a race condition, the
-      infeed thread might see an error first.  Closing the session here
-      immediately would result in a session cancellation exception in the main
-      thread, instead of the expected compile error.  User code that depends on
-      having the proper exception type will therefore be confused.
-      """
-      time.sleep(5)
-
-      # If the main session is still running, the infeed/outfeed errors are
-      # legitimate, and should be logged.
-      if not self._finished and self._feed_error:
-        logging.error('Feed error: %s', self._feed_error)
-        logging.error('Closing session.  A RuntimeError should follow.')
-        session.close()
-
-    self._session_cancel_timer = threading.Thread(target=_cancel_session)
-    self._session_cancel_timer.daemon = True
-    self._session_cancel_timer.start()
-
   def _run_infeed(self, queue_ctx, session):
     logging.info('Starting infeed thread controller.')
     if self._initial_infeed_sleep_secs:
@@ -456,7 +419,7 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
       time.sleep(self._initial_infeed_sleep_secs)
       logging.info('%s thread starting after sleep', self._name)
 
-    try:
+    with self._rendezvous.catch_errors(source='infeed', session=session):
       if self._run_infeed_loop_on_coordinator:
         for count, steps in enumerate(queue_ctx.read_iteration_counts()):
           for i in xrange(steps):
@@ -466,19 +429,15 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
         for _ in queue_ctx.read_iteration_counts():
           session.run(self._enqueue_ops)
       logging.info('Infeed thread finished, shutting down.')
-    except Exception as e:  # pylint: disable=broad-except
-      self._log_error(session, e)
 
   def _run_outfeed(self, queue_ctx, session):
     logging.info('Starting outfeed thread controller.')
-    try:
+    with self._rendezvous.catch_errors(source='outfeed', session=session):
       for count, steps in enumerate(queue_ctx.read_iteration_counts()):
         for i in xrange(steps):
           logging.debug('Outfeed dequeue for iteration (%d, %d)', count, i)
           session.run(self._dequeue_ops)
       logging.info('Outfeed thread finished, shutting down.')
-    except Exception as e:  # pylint: disable=broad-except
-      self._log_error(session, e)
 
   def _create_infeed_controller(self, name, target, args):
     return _OpQueueContext(name=name, target=target, args=args)
@@ -497,11 +456,6 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
   def before_run(self, run_context):
     self._feed_error = None
 
-    # Wait for the cancellation timer to complete before continuing.
-    if self._session_cancel_timer:
-      self._session_cancel_timer.join()
-      self._session_cancel_timer = None
-
     iterations = run_context.session.run(self._iterations_per_loop_var)
 
     logging.info('Enqueue next (%d) batch(es) of data to infeed.', iterations)
@@ -512,16 +466,14 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
     self._outfeed_controller.send_next_batch_signal(iterations)
 
   def end(self, session):
-    if self._session_cancel_timer:
-      logging.warning('Feed error occurred; waiting for message.')
-      self._session_cancel_timer.join()
-
     self._finished = True
     logging.info('Stop infeed thread controller')
     self._infeed_controller.join()
+    self._rendezvous.record_done('infeed')
 
     logging.info('Stop output thread controller')
     self._outfeed_controller.join()
+    self._rendezvous.record_done('outfeed')
 
     logging.info('Shutdown TPU system.')
     session.run(self._finalize_ops)
@@ -529,9 +481,10 @@ class TPUInfeedOutfeedSessionHook(session_run_hook.SessionRunHook):
 
 class TPUInfeedOutfeedSessionHookForPrediction(TPUInfeedOutfeedSessionHook):
 
-  def __init__(self, ctx, enqueue_ops, dequeue_ops):
+  def __init__(self, ctx, enqueue_ops, dequeue_ops, rendezvous=None):
     super(TPUInfeedOutfeedSessionHookForPrediction, self).__init__(
-        ctx, enqueue_ops, dequeue_ops, run_infeed_loop_on_coordinator=False)
+        ctx, enqueue_ops, dequeue_ops, run_infeed_loop_on_coordinator=False,
+        rendezvous=rendezvous)
 
   def _create_infeed_controller(self, name, target, args):
     return _OpSignalOnceQueueContext(name=name, target=target, args=args)
@@ -701,8 +654,6 @@ def generate_per_core_enqueue_ops_fn_for_host(
     infeed_queue = tpu_feed.InfeedQueue(
         number_of_tuple_elements=len(per_host_sharded_inputs[0]))
     captured_infeed_queue.capture(infeed_queue)
-    infeed_queue.set_configuration_from_sharded_input_tensors(
-        per_host_sharded_inputs)
 
     per_host_enqueue_ops = infeed_queue.generate_enqueue_ops(
         per_host_sharded_inputs, tpu_ordinal_function=tpu_ordinal_function_impl)
@@ -830,19 +781,105 @@ def generate_per_host_v2_enqueue_ops_fn_for_host(
         flattened_inputs = (
             inputs_structure_recorder.flatten_features_and_labels(
                 features, labels))
-
         control_deps.extend(flattened_inputs)
         per_host_sharded_inputs.append(flattened_inputs)
 
+      if inputs_structure_recorder.flattened_input_dims:
+        # pylint: disable=protected-access
+        infeed_queue = tpu_feed._PartitionedInfeedQueue(
+            number_of_tuple_elements=len(per_host_sharded_inputs[0]),
+            host_id=host_id,
+            input_partition_dims=inputs_structure_recorder.flattened_input_dims,
+            device_assignment=ctx.device_assignment)
+        per_host_enqueue_ops = infeed_queue.generate_enqueue_ops(
+            per_host_sharded_inputs)
+      else:
+        infeed_queue = tpu_feed.InfeedQueue(
+            number_of_tuple_elements=len(per_host_sharded_inputs[0]))
+        per_host_enqueue_ops = infeed_queue.generate_enqueue_ops(
+            per_host_sharded_inputs,
+            tpu_ordinal_function=tpu_ordinal_function_impl)
+      captured_infeed_queue.capture(infeed_queue)
+
+    return per_host_enqueue_ops
+
+  return enqueue_ops_fn, captured_infeed_queue, hooks, is_dataset
+
+
+def generate_broadcast_enqueue_ops_fn(ctx, input_fn, inputs_structure_recorder,
+                                      num_hosts):
+  """Generates infeed enqueue ops for one input_fn on all the hosts."""
+  captured_infeed_queue = _CapturedObject()
+  hooks = []
+  device_0 = ctx.tpu_host_placement_function(host_id=0)
+  with ops.device(device_0):
+    user_context = tpu_context.TPUContext(
+        internal_ctx=ctx, input_device=device_0, invocation_index=0)
+    inputs = _Inputs.from_input_fn(input_fn(user_context))
+
+    is_dataset = inputs.is_dataset
+    if ctx.mode == model_fn_lib.ModeKeys.PREDICT:
+      if not is_dataset:
+        raise TypeError(
+            'For mode PREDICT, `input_fn` must return `Dataset` instead of '
+            '`features` and `labels`.')
+
+      inputs = _InputsWithStoppingSignals(
+          dataset=inputs.dataset,
+          batch_size=ctx.batch_size_for_input_fn,
+          add_padding=True)
+
+    if is_dataset:
+      hooks.append(inputs.dataset_initializer_hook())
+    num_replicas_per_host = ctx.num_of_replicas_per_host
+
+  def tpu_ordinal_function_impl(replica_id):
+    if ctx.device_assignment:
+      return ctx.device_assignment.tpu_ordinal(replica=replica_id)
+    else:
+      return replica_id % num_replicas_per_host
+
+  def device_function_impl(replica_id):
+    return ctx.tpu_host_placement_function(replica_id=replica_id)
+
+  def enqueue_ops_fn():
+    """Generates enqueue ops for all the hosts."""
+    broadcasted_inputs = []
+    flattened_inputs = None  # Cache result from input_fn.
+    signals = None
+    for host_id in xrange(num_hosts):
+      with ops.device(ctx.tpu_host_placement_function(host_id=host_id)):
+        for _ in xrange(ctx.num_of_replicas_per_host):
+          # Note: input_fn is only called once at host 0 for the first replica.
+          # The features and labels returned from that invocation are
+          # broadcasted to other replicas(including the replicas on other
+          # hosts).
+          if flattened_inputs is None:
+            features, labels = inputs.features_and_labels()  # Calls get_next()
+            signals = inputs.signals()
+
+            inputs_structure_recorder.validate_and_record_structure(
+                features, labels, signals)
+            flattened_inputs = (
+                inputs_structure_recorder.flatten_features_and_labels(
+                    features, labels, signals))
+          broadcasted_inputs.append(flattened_inputs)
+
     infeed_queue = tpu_feed.InfeedQueue(
-        number_of_tuple_elements=len(per_host_sharded_inputs[0]))
+        number_of_tuple_elements=len(broadcasted_inputs[0]))
     captured_infeed_queue.capture(infeed_queue)
-    infeed_queue.set_configuration_from_sharded_input_tensors(
-        per_host_sharded_inputs)
+    enqueue_ops = infeed_queue.generate_enqueue_ops(
+        broadcasted_inputs,
+        tpu_ordinal_function=tpu_ordinal_function_impl,
+        placement_function=device_function_impl)
 
-    per_host_enqueue_ops = infeed_queue.generate_enqueue_ops(
-        per_host_sharded_inputs, tpu_ordinal_function=tpu_ordinal_function_impl)
-    return per_host_enqueue_ops
+    if signals is None:
+      return enqueue_ops
+    else:
+      return {
+          'ops': enqueue_ops,
+          'signals': signals,
+      }
 
   return enqueue_ops_fn, captured_infeed_queue, hooks, is_dataset
 
@@ -880,21 +917,68 @@ class _InputPipeline(object):
   class InputsStructureRecorder(object):
     """The recorder to record inputs structure."""
 
-    def __init__(self):
+    def __init__(self, input_partition_dims=None):
       # Holds the structure of inputs
       self._feature_names = []
       self._label_names = []
       self._has_labels = False
       self._signals_helper = None
+      self._flattened_input_dims = None
+
+      if input_partition_dims:
+        # This should have been validated in TPUConfig.
+        assert len(input_partition_dims) <= 2, 'must have 1 or 2 elements.'
+        if len(input_partition_dims) == 2:
+          self._feature_dims, self._label_dims = input_partition_dims
+        else:
+          self._feature_dims = input_partition_dims[0]
+          self._label_dims = None
+
+        assert self._feature_dims is not None, ('input_partition_dims[0] must '
+                                                'not be None')
+      else:
+        self._feature_dims = None
+        self._label_dims = None
 
       # Internal state.
       self._initialized = False
 
+    @property
+    def flattened_input_dims(self):
+      assert self._initialized, 'InputsStructureRecorder is not initialized.'
+      return self._flattened_input_dims
+
     def has_labels(self):
       return self._has_labels
 
+    def _flatten_input_dims(self, feature_dims, feature_dims_names, label_dims,
+                            label_dims_names, label_names, has_labels):
+      """Flatten input dims with the same order as flattened input tensors."""
+      flattened_input_dims = []
+      if feature_dims_names:
+        # We need a fixed ordering for matching the tensors in features.
+        flattened_input_dims.extend(
+            [feature_dims[name] for name in feature_dims_names])
+      else:
+        flattened_input_dims.append(feature_dims)
+
+      if label_dims_names:
+        # We need a fixed ordering for matching the tensors in labels.
+        flattened_input_dims.extend(
+            [label_dims[name] for name in label_dims_names])
+      else:
+        if label_names:
+          num_tensors_in_label = len(label_names)
+        else:
+          num_tensors_in_label = int(has_labels)
+        # Setting `None` in input_partition_dims[1] will apply `None` to
+        # all the tensors in labels, regardless of internal structure.
+        flattened_input_dims.extend([label_dims] * num_tensors_in_label)
+
+      return flattened_input_dims
+
     def validate_and_record_structure(self, features, labels, signals=None):
-      """Validates and records the structure of features` and `labels`."""
+      """Validates and records the structure of `features` and `labels`."""
 
       def _extract_key_names(tensor_or_dict):
         if tensor_or_dict is None:
@@ -922,6 +1006,24 @@ class _InputPipeline(object):
         self._feature_names = feature_names
         self._label_names = label_names
         self._has_labels = has_labels
+        if self._feature_dims is not None:
+          feature_dims_names = _extract_key_names(self._feature_dims)
+          if feature_dims_names != feature_names:
+            raise ValueError(
+                'TPUConfig.input_partition_dims[0] mismatched feature'
+                ' keys. Expected {}, got {}'.format(feature_names,
+                                                    feature_dims_names))
+
+          label_dims_names = _extract_key_names(self._label_dims)
+          if self._label_dims is not None and label_dims_names != label_names:
+            raise ValueError(
+                'TPUConfig.input_partition_dims[1] mismatched label'
+                ' keys. Expected {}, got {}'.format(label_names,
+                                                    label_dims_names))
+
+          self._flattened_input_dims = self._flatten_input_dims(
+              self._feature_dims, feature_dims_names, self._label_dims,
+              label_dims_names, label_names, has_labels)
 
     def flatten_features_and_labels(self, features, labels, signals=None):
       """Flattens the `features` and `labels` to a single tensor list."""
@@ -1016,7 +1118,8 @@ class _InputPipeline(object):
     Raises:
       ValueError: If both `sharded_features` and `num_cores` are `None`.
     """
-    self._inputs_structure_recorder = _InputPipeline.InputsStructureRecorder()
+    self._inputs_structure_recorder = _InputPipeline.InputsStructureRecorder(
+        ctx.input_partition_dims)
 
     self._sharded_per_core = ctx.is_input_sharded_per_core()
     self._input_fn = input_fn
@@ -1079,6 +1182,24 @@ class _InputPipeline(object):
             # Infeed_queue_getter must be called after enqueue_ops_fn is called.
             infeed_queues.append(captured_infeed_queue.get())
 
+    elif self._ctx.is_input_broadcast_with_iterators():
+      # Only calls input_fn in host 0.
+      host_device = tpu_host_placement_fn(host_id=0)
+      enqueue_ops_fn, captured_infeed_queue, hooks, is_dataset = (
+          generate_broadcast_enqueue_ops_fn(self._ctx, self._input_fn,
+                                            self._inputs_structure_recorder,
+                                            num_hosts))
+      all_hooks.extend(hooks)
+      if is_dataset:
+        run_infeed_loop_on_coordinator = False
+        wrap_fn = (
+            _wrap_computation_in_while_loop
+            if self._ctx.mode != model_fn_lib.ModeKeys.PREDICT else
+            _wrap_computation_in_while_loop_with_stopping_signals)
+        enqueue_ops.append(wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
+      else:
+        enqueue_ops.append(enqueue_ops_fn())
+      infeed_queues.append(captured_infeed_queue.get())
     else:
       for host_id in range(num_hosts):
         host_device = tpu_host_placement_fn(host_id=host_id)
@@ -1193,6 +1314,7 @@ class _ModelFnWrapper(object):
 
     host_call = _OutfeedHostCall(self._ctx)
     captured_scaffold_fn = _CapturedObject()
+    captured_training_hooks = _CapturedObject()
 
     def train_step(loss):
       """Training step function for use inside a while loop."""
@@ -1209,6 +1331,8 @@ class _ModelFnWrapper(object):
       else:
         captured_scaffold_fn.capture(None)
 
+      captured_training_hooks.capture(estimator_spec.training_hooks)
+
       # We must run train_op to update the variables prior to running the
       # outfeed.
       with ops.control_dependencies([train_op]):
@@ -1220,7 +1344,8 @@ class _ModelFnWrapper(object):
         with ops.control_dependencies(host_call_outfeed_ops):
           return array_ops.identity(loss)
 
-    return train_step, host_call, captured_scaffold_fn
+    return (train_step, host_call, captured_scaffold_fn,
+            captured_training_hooks)
 
   def convert_to_single_tpu_eval_step(self, dequeue_fn):
     """Converts user provided model_fn` as a single eval step on TPU.
@@ -1250,6 +1375,7 @@ class _ModelFnWrapper(object):
     """
     host_calls = _OutfeedHostCall(self._ctx)
     captured_scaffold_fn = _CapturedObject()
+    captured_eval_hooks = _CapturedObject()
 
     def eval_step(total_loss):
       """Evaluation step function for use inside a while loop."""
@@ -1264,8 +1390,11 @@ class _ModelFnWrapper(object):
 
       loss = tpu_estimator_spec.loss
       captured_scaffold_fn.capture(tpu_estimator_spec.scaffold_fn)
+      captured_eval_hooks.capture(tpu_estimator_spec.evaluation_hooks)
+
       to_record = {}
-      to_record['eval_metrics'] = tpu_estimator_spec.eval_metrics
+      if tpu_estimator_spec.eval_metrics:
+        to_record['eval_metrics'] = tpu_estimator_spec.eval_metrics
       if tpu_estimator_spec.host_call is not None:
         # We assume that evaluate won't update global step, so we don't wrap
         # this host_call.
@@ -1275,7 +1404,7 @@ class _ModelFnWrapper(object):
       with ops.control_dependencies(host_calls.create_enqueue_op()):
         return math_ops.add(total_loss, loss)
 
-    return eval_step, host_calls, captured_scaffold_fn
+    return eval_step, host_calls, captured_scaffold_fn, captured_eval_hooks
 
   def convert_to_single_tpu_predict_step(self, dequeue_fn):
     """Converts user provided model_fn` as a single predict step on TPU.
@@ -1290,6 +1419,7 @@ class _ModelFnWrapper(object):
     """
     host_calls = _OutfeedHostCall(self._ctx)
     captured_scaffold_fn = _CapturedObject()
+    captured_predict_hooks = _CapturedObject()
 
     def predict_step(unused_scalar_stopping_signal):
       """Evaluation step function for use inside a while loop."""
@@ -1310,6 +1440,7 @@ class _ModelFnWrapper(object):
       self._verify_tpu_spec_predictions(tpu_estimator_spec.predictions)
 
       captured_scaffold_fn.capture(tpu_estimator_spec.scaffold_fn)
+      captured_predict_hooks.capture(tpu_estimator_spec.prediction_hooks)
       to_record = {}
       identity_fn = lambda **kwargs: kwargs
       to_record['predictions'] = [identity_fn, tpu_estimator_spec.predictions]
@@ -1321,7 +1452,8 @@ class _ModelFnWrapper(object):
       with ops.control_dependencies(host_calls.create_enqueue_op()):
         return _StopSignals.as_scalar_stopping_signal(stopping_signals)
 
-    return predict_step, host_calls, captured_scaffold_fn
+    return (predict_step, host_calls, captured_scaffold_fn,
+            captured_predict_hooks)
 
   def _verify_tpu_spec_predictions(self, predictions):
     """Validates TPUEstimatorSpec.predictions dict."""
@@ -1422,6 +1554,11 @@ class _ModelFnWrapper(object):
     running_on_cpu = self._ctx.is_running_on_cpu(is_export_mode)
     _add_item_to_params(params, _USE_TPU_KEY, not running_on_cpu)
 
+    if not running_on_cpu:
+      user_context = tpu_context.TPUContext(
+          internal_ctx=self._ctx, call_from_input_fn=False)
+      _add_item_to_params(params, _CTX_KEY, user_context)
+
     estimator_spec = self._model_fn(features=features, **kwargs)
     if (running_on_cpu and
         isinstance(estimator_spec, model_fn_lib._TPUEstimatorSpec)):  # pylint: disable=protected-access
@@ -1438,11 +1575,9 @@ class _ModelFnWrapper(object):
 
     err_msg = '{} returned by EstimatorSpec is not supported in TPUEstimator.'
     if estimator_spec.training_chief_hooks:
-      raise ValueError(err_msg.format('training_chief_hooks'))
-    if estimator_spec.training_hooks:
-      raise ValueError(err_msg.format('training_hooks'))
-    if estimator_spec.evaluation_hooks:
-      raise ValueError(err_msg.format('evaluation_hooks'))
+      raise ValueError(
+          err_msg.format('training_chief_hooks') + 'If you want' +
+          ' to pass training hooks, please pass via training_hooks.')
 
     if estimator_spec.scaffold:
       logging.warning('EstimatorSpec.Scaffold is ignored by TPU train/eval. '
@@ -1563,7 +1698,7 @@ class _OutfeedHostCall(object):
       RuntimeError: If outfeed tensor is scalar.
     """
     if not self._names:
-      return []
+      return {}
 
     ret = {}
     # For each i, dequeue_ops[i] is a list containing the tensors from all
@@ -1582,11 +1717,13 @@ class _OutfeedHostCall(object):
     # Outfeed ops execute on each replica's first logical core. Note: we must
     # constraint it such that we have at most one outfeed dequeue and enqueue
     # per replica.
-    tpu_device_placement_fn = self._ctx.tpu_device_placement_function
     for i in xrange(self._ctx.num_replicas):
-      with ops.device(tpu_device_placement_fn(i)):
+      host_device, ordinal_id = self._ctx.device_for_replica(i)
+      with ops.device(host_device):
         outfeed_tensors = tpu_ops.outfeed_dequeue_tuple(
-            dtypes=tensor_dtypes, shapes=tensor_shapes)
+            dtypes=tensor_dtypes,
+            shapes=tensor_shapes,
+            device_ordinal=ordinal_id)
         for j, item in enumerate(outfeed_tensors):
           dequeue_ops[j].append(item)
 
@@ -1601,7 +1738,7 @@ class _OutfeedHostCall(object):
     # place all ops on tpu host if possible.
     #
     # TODO(jhseu): Evaluate whether this is right for summaries.
-    with ops.device(self._ctx.tpu_host_placement_function(core_id=0)):
+    with ops.device(self._ctx.tpu_host_placement_function(replica_id=0)):
       for name in self._names:
         dequeue_ops = dequeue_ops_by_name[name]
         for i, item in enumerate(dequeue_ops):
@@ -1710,6 +1847,9 @@ class InstallSignalHandlerHook(session_run_hook.SessionRunHook):
 class TPUEstimator(estimator_lib.Estimator):
   """Estimator with TPU support.
 
+  TPUEstimator also supports training on CPU and GPU. You don't need to define
+  a separate `tf.estimator.Estimator`.
+
   TPUEstimator handles many of the details of running on TPU devices, such as
   replicating inputs and models for each core, and returning to host
   periodically to run hooks.
@@ -1747,7 +1887,8 @@ class TPUEstimator(estimator_lib.Estimator):
   Current limitations:
   --------------------
 
-  1. TPU evaluation only works on a single host (one TPU worker).
+  1. TPU evaluation only works on a single host (one TPU worker) except
+     BROADCAST mode.
 
   2. `input_fn` for evaluation should **NOT** raise an end-of-input exception
      (`OutOfRangeError` or `StopIteration`). And all evaluation steps and all
@@ -1918,10 +2059,9 @@ class TPUEstimator(estimator_lib.Estimator):
     """Constructs an `TPUEstimator` instance.
 
     Args:
-      model_fn: Model function as required by `Estimator`. For training, the
-        returned `EstimatorSpec` cannot have hooks as it is not supported in
-        `TPUEstimator`. Instead, the user can pass the training hooks as
-        an argument to `TPUEstimator.train()`.
+      model_fn: Model function as required by `Estimator` which returns
+      EstimatorSpec or TPUEstimatorSpec. `training_hooks`, 'evaluation_hooks',
+      and `prediction_hooks` must not capure any TPU Tensor inside the model_fn.
       model_dir: Directory to save model parameters, graph and etc. This can
         also be used to load checkpoints from the directory into a estimator to
         continue training a previously saved model. If `None`, the model_dir in
@@ -1986,7 +2126,7 @@ class TPUEstimator(estimator_lib.Estimator):
 
       if (config.tpu_config.per_host_input_for_training is
           tpu_config.InputPipelineConfig.PER_SHARD_V1 and
-          config.tpu_config.computation_shape):
+          config.tpu_config.num_cores_per_replica):
         raise ValueError(
             'Model parallelism only supports per host input for training. '
             'Please adjust TPURunconfig.per_host_input_for_training.')
@@ -2033,6 +2173,7 @@ class TPUEstimator(estimator_lib.Estimator):
     self._export_to_tpu = export_to_tpu
 
     self._is_input_fn_invoked = None
+    self._rendezvous = {}
 
   def _add_meta_graph_for_mode(self,
                                builder,
@@ -2276,6 +2417,65 @@ class TPUEstimator(estimator_lib.Estimator):
     """
     pass
 
+  def train(self,
+            input_fn,
+            hooks=None,
+            steps=None,
+            max_steps=None,
+            saving_listeners=None):
+    rendezvous = error_handling.ErrorRendezvous(num_sources=3)
+    self._rendezvous[model_fn_lib.ModeKeys.TRAIN] = rendezvous
+    try:
+      return super(TPUEstimator, self).train(
+          input_fn=input_fn, hooks=hooks, steps=steps, max_steps=max_steps,
+          saving_listeners=saving_listeners
+      )
+    except Exception:  # pylint: disable=broad-except
+      rendezvous.record_error('training_loop', sys.exc_info())
+    finally:
+      rendezvous.record_done('training_loop')
+      rendezvous.raise_errors()
+
+  def evaluate(self, input_fn, steps=None, hooks=None, checkpoint_path=None,
+               name=None):
+    rendezvous = error_handling.ErrorRendezvous(num_sources=3)
+    self._rendezvous[model_fn_lib.ModeKeys.EVAL] = rendezvous
+    try:
+      return super(TPUEstimator, self).evaluate(
+          input_fn, steps=steps, hooks=hooks, checkpoint_path=checkpoint_path,
+          name=name
+      )
+    except Exception:  # pylint: disable=broad-except
+      rendezvous.record_error('evaluation_loop', sys.exc_info())
+    finally:
+      rendezvous.record_done('evaluation_loop')
+      rendezvous.raise_errors()
+
+  def predict(self,
+              input_fn,
+              predict_keys=None,
+              hooks=None,
+              checkpoint_path=None,
+              yield_single_examples=True):
+    rendezvous = error_handling.ErrorRendezvous(num_sources=3)
+    self._rendezvous[model_fn_lib.ModeKeys.PREDICT] = rendezvous
+    try:
+      for result in super(TPUEstimator, self).predict(
+          input_fn=input_fn,
+          predict_keys=predict_keys,
+          hooks=hooks,
+          checkpoint_path=checkpoint_path,
+          yield_single_examples=yield_single_examples):
+        yield result
+    except Exception:  # pylint: disable=broad-except
+      rendezvous.record_error('prediction_loop', sys.exc_info())
+    finally:
+      rendezvous.record_done('prediction_loop')
+      rendezvous.raise_errors()
+
+    rendezvous.record_done('prediction_loop')
+    rendezvous.raise_errors()
+
   def _augment_model_fn(self, model_fn, batch_axis):
     """Returns a new model_fn, which wraps the TPU support."""
 
@@ -2298,10 +2498,20 @@ class TPUEstimator(estimator_lib.Estimator):
         # Clear the bit.
         self._is_input_fn_invoked = None
 
+        # examples_hook is added to training_hooks for both CPU and TPU
+        # execution.
+        examples_hook = ExamplesPerSecondHook(
+            ctx.global_batch_size,
+            output_dir=self.model_dir,
+            every_n_steps=self._log_every_n_steps)
+
         if ctx.is_running_on_cpu(is_export_mode=is_export_mode):
           logging.info('Running %s on CPU', mode)
-          return model_fn_wrapper.call_without_tpu(
+          estimator_spec = model_fn_wrapper.call_without_tpu(
               features, labels, is_export_mode=is_export_mode)
+          estimator_spec = estimator_spec._replace(
+              training_hooks=estimator_spec.training_hooks + (examples_hook,))
+          return estimator_spec
 
         assert labels is None, '`labels` passed to `model_fn` must be `None`.'
         # TPUEstimator._call_input_fn passes `input_fn` as features to here.
@@ -2320,7 +2530,7 @@ class TPUEstimator(estimator_lib.Estimator):
             graph.add_to_collection(_TPU_ENQUEUE_OPS, enqueue_op)
 
         if mode == model_fn_lib.ModeKeys.TRAIN:
-          loss, host_call, scaffold = (
+          loss, host_call, scaffold, training_hooks = (
               _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
           host_ops = host_call.create_tpu_hostcall()
           if host_ops is None:
@@ -2360,7 +2570,9 @@ class TPUEstimator(estimator_lib.Estimator):
                   enqueue_ops,
                   host_ops,
                   run_infeed_loop_on_coordinator=(
-                      run_infeed_loop_on_coordinator)),
+                      run_infeed_loop_on_coordinator),
+                  rendezvous=self._rendezvous[mode],
+              ),
               InstallSignalHandlerHook(),
               training.LoggingTensorHook(
                   {
@@ -2369,14 +2581,13 @@ class TPUEstimator(estimator_lib.Estimator):
                   },
                   every_n_iter=logging_hook_frequency)
           ])
-          examples_hook = ExamplesPerSecondHook(
-              ctx.global_batch_size,
-              output_dir=self.model_dir,
-              every_n_steps=self._log_every_n_steps)
           examples_hook._set_steps_per_run(   # pylint: disable=protected-access
               self._config.tpu_config.iterations_per_loop)
           hooks.append(examples_hook)
 
+          if training_hooks:
+            hooks.extend(training_hooks)
+
           chief_hooks = []
           if (self._config.save_checkpoints_secs or
               self._config.save_checkpoints_steps):
@@ -2388,6 +2599,7 @@ class TPUEstimator(estimator_lib.Estimator):
             checkpoint_hook._set_steps_per_run(   # pylint: disable=protected-access
                 self._config.tpu_config.iterations_per_loop)
             chief_hooks.append(checkpoint_hook)
+
           summary.scalar(model_fn_lib.LOSS_METRIC_KEY, loss)
           with ops.control_dependencies([loss]):
             update_ops = _sync_variables_ops()
@@ -2407,7 +2619,7 @@ class TPUEstimator(estimator_lib.Estimator):
               scaffold=scaffold)
 
         if mode == model_fn_lib.ModeKeys.EVAL:
-          total_loss, host_calls, scaffold = _eval_on_tpu_system(
+          total_loss, host_calls, scaffold, eval_hooks = _eval_on_tpu_system(
               ctx, model_fn_wrapper, dequeue_fn)
           iterations_per_loop_var = _create_or_get_iterations_per_loop()
           mean_loss = math_ops.div(total_loss,
@@ -2432,7 +2644,8 @@ class TPUEstimator(estimator_lib.Estimator):
           host_call_ret = host_calls.create_tpu_hostcall()
           eval_metric_ops = {}
           eval_update_ops = []
-          for k, v in host_call_ret['eval_metrics'].items():
+
+          for k, v in host_call_ret.get('eval_metrics', {}).items():
             eval_metric_ops[k] = (v[0], dummy_update_op)
             eval_update_ops.append(v[1])
 
@@ -2446,9 +2659,13 @@ class TPUEstimator(estimator_lib.Estimator):
                   enqueue_ops,
                   eval_update_ops + host_ops,
                   run_infeed_loop_on_coordinator=(
-                      run_infeed_loop_on_coordinator)),
+                      run_infeed_loop_on_coordinator),
+                  rendezvous=self._rendezvous[mode]),
           ] + input_hooks
 
+          if eval_hooks:
+            hooks.extend(eval_hooks)
+
           return model_fn_lib.EstimatorSpec(
               mode,
               loss=mean_loss,
@@ -2459,8 +2676,9 @@ class TPUEstimator(estimator_lib.Estimator):
         # Predict
         assert mode == model_fn_lib.ModeKeys.PREDICT
 
-        dummy_predict_op, host_calls, scaffold = _predict_on_tpu_system(
-            ctx, model_fn_wrapper, dequeue_fn)
+        (dummy_predict_op, host_calls,
+         scaffold, prediction_hooks) = _predict_on_tpu_system(
+             ctx, model_fn_wrapper, dequeue_fn)
         with ops.control_dependencies([dummy_predict_op]):
           internal_ops_to_run = _sync_variables_ops()
           with ops.control_dependencies(internal_ops_to_run):
@@ -2512,10 +2730,13 @@ class TPUEstimator(estimator_lib.Estimator):
 
         hooks = [
             _StoppingPredictHook(scalar_stopping_signal),
-            TPUInfeedOutfeedSessionHookForPrediction(ctx, enqueue_ops,
-                                                     host_ops),
+            TPUInfeedOutfeedSessionHookForPrediction(
+                ctx, enqueue_ops, host_ops, rendezvous=self._rendezvous[mode]),
         ] + input_hooks
 
+        if prediction_hooks:
+          hooks.extend(prediction_hooks)
+
         return model_fn_lib.EstimatorSpec(
             mode,
             prediction_hooks=hooks,
@@ -2599,8 +2820,8 @@ def _eval_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn):
   """Executes `model_fn_wrapper` multiple times on all TPU shards."""
   iterations_per_loop_var = _create_or_get_iterations_per_loop()
 
-  single_tpu_eval_step, host_calls, captured_scaffold_fn = (
-      model_fn_wrapper.convert_to_single_tpu_eval_step(dequeue_fn))
+  (single_tpu_eval_step, host_calls, captured_scaffold_fn, captured_eval_hooks
+  ) = model_fn_wrapper.convert_to_single_tpu_eval_step(dequeue_fn)
 
   def multi_tpu_eval_steps_on_single_shard():
     return training_loop.repeat(
@@ -2615,15 +2836,16 @@ def _eval_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn):
       device_assignment=ctx.device_assignment)
 
   scaffold = _get_scaffold(captured_scaffold_fn)
-  return loss, host_calls, scaffold
+  return loss, host_calls, scaffold, captured_eval_hooks.get()
 
 
 def _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn):
   """Executes `model_fn_wrapper` multiple times on all TPU shards."""
   iterations_per_loop_var = _create_or_get_iterations_per_loop()
 
-  single_tpu_train_step, host_call, captured_scaffold_fn = (
-      model_fn_wrapper.convert_to_single_tpu_train_step(dequeue_fn))
+  (single_tpu_train_step, host_call, captured_scaffold_fn,
+   captured_training_hooks) = (
+       model_fn_wrapper.convert_to_single_tpu_train_step(dequeue_fn))
 
   def multi_tpu_train_steps_on_single_shard():
     return training_loop.repeat(
@@ -2638,15 +2860,16 @@ def _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn):
       device_assignment=ctx.device_assignment)
 
   scaffold = _get_scaffold(captured_scaffold_fn)
-  return loss, host_call, scaffold
+  return loss, host_call, scaffold, captured_training_hooks.get()
 
 
 def _predict_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn):
   """Executes `model_fn_wrapper` multiple times on all TPU shards."""
   num_cores = ctx.num_cores
 
-  single_tpu_predict_step, host_calls, captured_scaffold_fn = (
-      model_fn_wrapper.convert_to_single_tpu_predict_step(dequeue_fn))
+  (single_tpu_predict_step, host_calls, captured_scaffold_fn,
+   captured_predict_hooks
+  ) = model_fn_wrapper.convert_to_single_tpu_predict_step(dequeue_fn)
 
   def multi_tpu_predict_steps_on_single_shard():
 
@@ -2666,7 +2889,7 @@ def _predict_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn):
       outputs_from_all_shards=False)
 
   scaffold = _get_scaffold(captured_scaffold_fn)
-  return dummy_predict_op, host_calls, scaffold
+  return dummy_predict_op, host_calls, scaffold, captured_predict_hooks.get()
 
 
 def _wrap_computation_in_while_loop(device, op_fn):
@@ -3163,3 +3386,47 @@ def _add_item_to_params(params, key, value):
   else:
     # Now params is Python dict.
     params[key] = value
+
+
+def export_estimator_savedmodel(estimator,
+                                export_dir_base,
+                                serving_input_receiver_fn,
+                                assets_extra=None,
+                                as_text=False,
+                                checkpoint_path=None,
+                                strip_default_attrs=False):
+  """Export `Estimator` trained model for TPU inference.
+
+  Args:
+    estimator: `Estimator` with which model has been trained.
+    export_dir_base: A string containing a directory in which to create
+      timestamped subdirectories containing exported SavedModels.
+    serving_input_receiver_fn: A function that takes no argument and
+      returns a `ServingInputReceiver` or `TensorServingInputReceiver`.
+    assets_extra: A dict specifying how to populate the assets.extra directory
+      within the exported SavedModel, or `None` if no extra assets are needed.
+    as_text: whether to write the SavedModel proto in text format.
+    checkpoint_path: The checkpoint path to export.  If `None` (the default),
+      the most recent checkpoint found within the model directory is chosen.
+    strip_default_attrs: Boolean. If `True`, default-valued attributes will be
+      removed from the NodeDefs.
+
+  Returns:
+    The string path to the exported directory.
+  """
+  # `TPUEstimator` requires `tpu_config.RunConfig`, so we cannot use
+  # `estimator.config`.
+  config = tpu_config.RunConfig(model_dir=estimator.model_dir)
+  est = TPUEstimator(
+      estimator._model_fn,  # pylint: disable=protected-access
+      config=config,
+      params=estimator.params,
+      use_tpu=True,
+      train_batch_size=2048,  # Does not matter.
+      eval_batch_size=2048,  # Does not matter.
+  )
+  return est.export_savedmodel(export_dir_base, serving_input_receiver_fn,
+                               assets_extra,
+                               as_text,
+                               checkpoint_path,
+                               strip_default_attrs)
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_feed.py b/tensorflow/contrib/tpu/python/tpu/tpu_feed.py
index 604e6600c8..d9c77a3ea1 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_feed.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_feed.py
@@ -20,8 +20,13 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import itertools
+
+import numpy as np
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
+from tensorflow.compiler.xla.experimental.xla_sharding import xla_sharding
+from tensorflow.compiler.xla.python_api import xla_shape
 from tensorflow.contrib.tpu.python.ops import tpu_ops
 from tensorflow.contrib.tpu.python.tpu import tpu
 from tensorflow.contrib.tpu.python.tpu import tpu_sharding
@@ -30,6 +35,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
+from tensorflow.python.util import nest
 
 
 class InfeedQueue(object):
@@ -461,7 +467,10 @@ class InfeedQueue(object):
             name=full_name,
             device_ordinal=tpu_ordinal)
 
-  def generate_enqueue_ops(self, sharded_inputs, tpu_ordinal_function=None):
+  def generate_enqueue_ops(self,
+                           sharded_inputs,
+                           tpu_ordinal_function=None,
+                           placement_function=None):
     """Generates the host-side Ops to enqueue the shards of a tuple.
 
     sharded_inputs is a list, one for each shard, of lists of
@@ -483,6 +492,9 @@ class InfeedQueue(object):
         shard index as input and returns the ordinal of the TPU device
         the shard's infeed should be placed on. tpu_ordinal_function must be
         set if the inputs are placed on CPU devices.
+      placement_function: if not None, a function that takes the shard index as
+        input and returns the host device where the enqueue op should be placed
+        on.
 
     Returns:
       A list of host-side Ops, one for each shard, that when executed together
@@ -508,8 +520,12 @@ class InfeedQueue(object):
       tpu_ordinal_function = lambda index: -1
     name_prefix = "%s/enqueue" % self._name
     return [
-        self._generate_enqueue_op(shard, name_prefix, index,
-                                  tpu_ordinal=tpu_ordinal_function(index))
+        self._generate_enqueue_op(
+            shard,
+            name_prefix,
+            index,
+            tpu_ordinal=tpu_ordinal_function(index),
+            device=placement_function(index) if placement_function else None)
         for (shard, index) in zip(sharded_inputs, xrange(self.number_of_shards))
     ]
 
@@ -630,3 +646,264 @@ class InfeedQueue(object):
             tpu_ordinal=tpu_ordinal_function(index))
         for (shard, index) in zip(sharded_inputs, xrange(self.number_of_shards))
     ]
+
+
+class _PartitionedInfeedQueue(InfeedQueue):
+  """A helper object to build a device infeed queue with input partition.
+
+  Args:
+    number_of_tuple_elements: the number of Tensors fed atomically through the
+      queue, must be present unless it can be inferred from other arguments.
+    device_assignment: A TPU `DeviceAssignment` which is used to place all the
+      partitions to different TPU infeed queues.
+    host_id: The id of the host machine.
+    input_partition_dims: A nested list/tuple of integers. Each inner
+      list/tuple describes how to partition the corresponding input tensor.
+    tuple_types: If not None, a list of types of the elements of the queue.
+    tuple_shapes: If not None, a list of shapes of the elements of the queue.
+    name: The name of the queue.
+  """
+
+  def __init__(self,
+               number_of_tuple_elements,
+               device_assignment,
+               host_id,
+               input_partition_dims=None,
+               tuple_types=None,
+               tuple_shapes=None,
+               name=None):
+    super(_PartitionedInfeedQueue, self).__init__(
+        number_of_tuple_elements=number_of_tuple_elements,
+        tuple_types=tuple_types,
+        tuple_shapes=None,
+        shard_dimensions=None,
+        name="PartitionedInfeedQueue" if name is None else name)
+    self._input_partition_dims = input_partition_dims
+    self._host_id = host_id
+    self._device_assignment = device_assignment
+
+  def generate_dequeue_op(self, tpu_device=0):
+    """Generate TPU dequeue ops.
+
+    Args:
+      tpu_device: The TPU device ordinal where the infeed instruction should be
+        placed.
+
+    Returns:
+      A list of Outputs corresponding to a partition of infeed dequeued
+      into XLA, suitable for use within a replicated block.
+
+    Raises:
+      ValueError: if the types or shapes of the tuple elements have not been
+      set; or if a dequeue op has already been generated.
+    """
+    self.freeze()
+    if self._generated_dequeue_op:
+      raise ValueError("Can't generate two dequeue Ops from the same queue")
+    self._generated_dequeue_op = True
+    full_name = "%s/dequeue" % self._name
+    sharded_shapes = [
+        policy.get_sharded_shape(shape)
+        for (shape, policy) in zip(self._tuple_shapes, self._sharding_policies)
+    ]
+    with ops.device(tpu.core(tpu_device)):
+      values = tpu_ops.infeed_dequeue_tuple(
+          dtypes=self._tuple_types, shapes=sharded_shapes, name=full_name)
+    return self._tag_sharding_attribute_for_dequeued_tensors(
+        values, self._input_partition_dims)
+
+  def generate_enqueue_ops(self, per_host_sharded_inputs):
+    """Generates the host-side Ops to enqueue the partitioned inputs.
+
+    per_host_sharded_inputs is a list, one for each replica, of lists of
+    Tensors. sharded_inputs[i] is the tuple of Tensors to use to feed
+    replica i.
+    sharded_inputs[i][j] is partitioned by self._input_partition_dims[j].
+
+    For example, if sharded_inputs[i][j] is a 2-D Tensor:
+    [[A, B, C, D],
+     [E ,F, G, H]]
+    self._input_partition_dims[j] is [2, 4].
+
+    sharded_inputs[i][j] will be partitioned and flattened into:
+    [A, B, C, D, E, F, G, H] and fed into the logical core ids:
+    [0, 1, 2, 3, 4, 5, 6, 7] respectively.
+
+    Args:
+      per_host_sharded_inputs: a list of lists of Tensors. The length of the
+        outer list determines the number of shards. Each inner list indicates
+        the types and shapes of the tuples in the corresponding shard.
+
+    Returns:
+      A list of host-side Ops, one for each shard, that when executed together
+      will enqueue a full-size element of infeed.
+
+    Raises:
+      ValueError: if the queue configuration has previously been frozen and the
+        shapes of the elements of sharded_inputs are not compatible with the
+        frozen configuration; or if the shapes of the elements of sharded_inputs
+        don't form a consistent unsharded tuple; or if the elements of a tuple
+        have different device constraints; or if the partition dims are invalid.
+      TypeError: if the queue configuration has previously been frozen and the
+        types of the elements of sharded_inputs are not compatible with the
+        frozen configuration; or if the types of the elements of sharded_inputs
+        don't form a consistent unsharded tuple.
+    """
+    self.set_configuration_from_sharded_input_tensors(per_host_sharded_inputs)
+    number_of_replicas_per_host = len(per_host_sharded_inputs)
+    number_of_tuple_elements = len(per_host_sharded_inputs[0])
+
+    assert len(self._input_partition_dims) == number_of_tuple_elements
+    per_host_enqueue_ops = []
+
+    for replica_index in range(number_of_replicas_per_host):
+      flattened_inputs = per_host_sharded_inputs[replica_index]
+      inputs_part_dims_flat = nest.flatten_up_to(flattened_inputs,
+                                                 self._input_partition_dims)
+      inputs_parted_iters = [
+          iter(self._partition_or_replicate_on_host(x, dims)) for x, dims in
+          zip(per_host_sharded_inputs[replica_index], inputs_part_dims_flat)
+      ]
+
+      for core_index in xrange(self._device_assignment.num_cores_per_replica):
+        # Places different partitions to different logic cores.
+        logical_core = self._get_logical_core(core_index)
+        replica_id = self._device_assignment.lookup_replicas(
+            self._host_id, logical_core)[replica_index]
+        ordinal = self._device_assignment.tpu_ordinal(
+            replica=replica_id, logical_core=logical_core)
+        infeed_inputs = []
+        for it in inputs_parted_iters:
+          input_for_device = next(it, None)
+          if input_for_device is not None:
+            infeed_inputs.append(input_for_device)
+
+        if infeed_inputs:
+          per_host_enqueue_ops.append(
+              tpu_ops.infeed_enqueue_tuple(
+                  inputs=infeed_inputs,
+                  shapes=[x.shape for x in infeed_inputs],
+                  name="enqueue/replica_{0}/input_{1}".format(
+                      replica_index, core_index),
+                  device_ordinal=ordinal))
+    return per_host_enqueue_ops
+
+  def _check_input_partition_dims(self, tensor, dims):
+    """Checks that input partition dims are valid for the `Tensor`.
+
+    Args:
+      tensor: Input tensor for partitioning.
+      dims: A list of integer describes how to partition the input tensor.
+
+    Raises:
+      ValueError: If the tensor can't be partitioned by dims or the
+        num_cores_per_replica doesn't match the number of
+        partitions(dims.prod()).
+    """
+    if dims is None:
+      return
+
+    dims = np.array(dims)
+
+    if (dims < 1).any():
+      raise ValueError("All input partition dims must be >= 1.")
+
+    # No partitioning, so don't perform further checks.
+    if dims.prod() == 1:
+      return
+
+    if dims.prod() != self._device_assignment.num_cores_per_replica:
+      raise ValueError(
+          "The product of each input parition dim should equal to "
+          "num_cores_per_replica. (dim = {}, num_cores_per_replica "
+          "= {})".format(dims, self._device_assignment.num_cores_per_replica))
+    if dims.shape[0] != tensor.shape.ndims:
+      raise ValueError(
+          "Input partition dims must have the same number of dimensions "
+          "as the `Tensor` to be partitioned. (tensor shape = {}, input "
+          "partition dims = {}).".format(tensor.shape.as_list(), dims))
+
+    tensor.shape.assert_is_fully_defined()
+    if (np.array(tensor.shape.as_list()) % dims != 0).any():
+      raise ValueError(
+          "All input partition dims must divide exactly into the `Tensor` "
+          "shape (tensor shape = {}, input partition dims = {}).".format(
+              tensor.shape.as_list(), dims))
+
+  def _partition_or_replicate_on_host(self, tensor, dims):
+    """Partitions or replicates the input tensor.
+
+      The ops inside this function are placed on the host side.
+
+    Args:
+      tensor: The input tensor which will be partioned or replicated.
+      dims: A list of integer describes how to partition the input tensor.
+    Returns:
+      An iterator of `Tensor`s or a list of partioned tensors.
+    """
+    self._check_input_partition_dims(tensor, dims)
+    if dims is None:
+      return itertools.repeat(tensor)
+    else:
+      output = [tensor]
+      for axis, dim in enumerate(dims):
+        if dim > 1:
+          output = [array_ops.split(x, dim, axis=axis) for x in output]
+          output = nest.flatten(output)
+      return output
+
+  def _tag_sharding_attribute_for_dequeued_tensor(self, tensor, dims):
+    """Tags appropriate XLA sharding attribute to the dequeued tensor.
+
+    Args:
+      tensor: The dequeued tensor on TPU.
+      dims: A list of integer describes how the tensor is partitioned.
+
+    Returns:
+      The same tensor with the xla_sharding attribute.
+    """
+    if dims is None:
+      return xla_sharding.replicate(tensor)
+    elif np.prod(dims) == 1:
+      return xla_sharding.assign_device(tensor, 0)
+    else:
+      tile_shape = np.array(tensor.shape.as_list()) // dims
+      tile_assignment = np.arange(np.prod(dims)).reshape(dims)
+      return xla_sharding.tile(
+          tensor=tensor,
+          tile_shape=xla_shape.CreateShapeFromDtypeAndTuple(
+              dtype=np.dtype(tensor.dtype.as_numpy_dtype),
+              shape_tuple=tile_shape),
+          tile_assignment=tile_assignment)
+
+  def _tag_sharding_attribute_for_dequeued_tensors(self, dequeues, dims):
+    """Tags appropriate XLA sharding attribute to the dequeued tensors.
+
+    Args:
+      dequeues: A list of dequeued tensors on TPU.
+      dims: A list of integer describes how the tensor is partitioned.
+
+    Returns:
+      The same dequeues with appropriate xla_sharding attribute.
+    """
+    nest.assert_shallow_structure(dequeues, dims)
+    return nest.map_structure_up_to(
+        dequeues, self._tag_sharding_attribute_for_dequeued_tensor, dequeues,
+        dims)
+
+  def _get_logical_core(self, core_index):
+    """Maps the core index to the 3D coordinate within replica.
+
+      The lowest dimension number in computation_shape is the slowest varying
+      dimension (most major).
+
+    Args:
+      core_index: An integer represents the core index within replcia.
+
+    Returns:
+      A tuple with three integers which represents the 3D coordinate.
+    """
+    computation_shape = self._device_assignment.computation_shape
+    return (core_index // (computation_shape[1] * computation_shape[2]),
+            core_index % (computation_shape[1] * computation_shape[2]) //
+            computation_shape[2], core_index % computation_shape[2])
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_optimizer.py b/tensorflow/contrib/tpu/python/tpu/tpu_optimizer.py
index 15f99d7eeb..53d33f4077 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_optimizer.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_optimizer.py
@@ -23,6 +23,7 @@ import collections
 
 from tensorflow.contrib.tpu.python.ops import tpu_ops
 from tensorflow.contrib.tpu.python.tpu import tpu_function
+from tensorflow.python.framework import ops
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import optimizer
@@ -153,8 +154,9 @@ class CrossShardOptimizer(optimizer.Optimizer):
       if grad is None:
         summed_grads_and_vars.append((grad, var))
       else:
-        summed_grads_and_vars.append((tpu_ops.cross_replica_sum(
-            grad, self._group_assignment), var))
+        with ops.colocate_with(grad):
+          summed_grads_and_vars.append((tpu_ops.cross_replica_sum(
+              grad, self._group_assignment), var))
     return self._opt.apply_gradients(summed_grads_and_vars, global_step, name)
 
   def get_slot(self, *args, **kwargs):
diff --git a/tensorflow/contrib/training/python/training/evaluation.py b/tensorflow/contrib/training/python/training/evaluation.py
index f7fd66d33f..01bac891da 100644
--- a/tensorflow/contrib/training/python/training/evaluation.py
+++ b/tensorflow/contrib/training/python/training/evaluation.py
@@ -142,9 +142,9 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.summary import summary
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import evaluation
 from tensorflow.python.training import monitored_session
-from tensorflow.python.training import saver as tf_saver
 from tensorflow.python.training import session_run_hook
 from tensorflow.python.training import training_util
 
@@ -189,7 +189,7 @@ def wait_for_new_checkpoint(checkpoint_dir,
   logging.info('Waiting for new checkpoint at %s', checkpoint_dir)
   stop_time = time.time() + timeout if timeout is not None else None
   while True:
-    checkpoint_path = tf_saver.latest_checkpoint(checkpoint_dir)
+    checkpoint_path = checkpoint_management.latest_checkpoint(checkpoint_dir)
     if checkpoint_path is None or checkpoint_path == last_checkpoint:
       if stop_time is not None and time.time() + seconds_to_sleep > stop_time:
         return None
diff --git a/tensorflow/contrib/training/python/training/sgdr_learning_rate_decay.py b/tensorflow/contrib/training/python/training/sgdr_learning_rate_decay.py
new file mode 100644
index 0000000000..ed0f398e30
--- /dev/null
+++ b/tensorflow/contrib/training/python/training/sgdr_learning_rate_decay.py
@@ -0,0 +1,187 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""SGDR learning rate decay function."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import math_ops, control_flow_ops
+
+
+def sgdr_decay(learning_rate, global_step, initial_period_steps,
+               t_mul=2.0, m_mul=1.0, name=None):
+  """Implements Stochastic Gradient Descent with Warm Restarts (SGDR).
+
+  As described in "SGDR: Stochastic Gradient Descent
+  with Warm Restarts" by Ilya Loshchilov & Frank Hutter, Proceedings of
+  ICLR'2017, available at https://arxiv.org/pdf/1608.03983.pdf
+
+  The learning rate decreases according to cosine annealing:
+
+  ```python
+  learning_rate * 0.5 * (1 + cos(x_val * pi)) # for x_val defined in [0, 1]
+  ```
+
+  Thus, at the beginning (when the restart index i = 0),
+  the learning rate decreases for `initial_period_steps` steps from the initial
+  learning rate `learning_rate` (when `x_val=0`, we get `cos(0)=1`) to
+  0 (when `x_val=1`, we get `cos(pi)=-1`).
+
+  The decrease within the i-th period takes `t_i` steps,
+  where `t_0` = `initial_period_steps` is the user-defined number of batch
+  iterations (not epochs as in the paper) to be performed before the first
+  restart is launched.
+
+  Then, we perform the first restart (i=1) by setting the learning rate to
+  `learning_rate*(m_mul^i)`, where `m_mul in [0,1]` (set to 1 by default).
+  The i-th restart runs for `t_i=t_0*(t_mul^i)` steps, i.e., every new
+  restart runs `t_mul` times longer than the previous one.
+
+  Importantly, when one has no access to a validation set, SGDR suggests
+  to report the best expected / recommended solution in the following way:
+  When we are within our initial run (i=0), every new solution represents
+  SGDR's recommended solution. Instead, when i>0, the recommended solution is
+  the one obtained at the end of each restart.
+
+  Note that the minimum learning rate is set to 0 for simplicity,
+  you can adjust the code to deal with any positive minimum learning rate
+  as defined in the paper.
+
+  `initial_period_steps` is the duration of the first period measured in terms
+  of number of minibatch updates. If one wants to use epochs, one should compute
+  the number of updates required for an epoch.
+
+  For example, assume the following parameters and intention:
+      Minibatch size: 100
+      Training dataset size: 10000
+      If the user wants the first decay period to span across 5 epochs, then
+      `initial_period_steps` = 5 * 10000/100 = 500
+
+      Train for 10000 batch iterations with the initial learning rate set to
+      0.1, then restart to run 2 times longer, i.e, for 20000 batch iterations
+      and with the initial learning rate 0.05, then restart again and again,
+      doubling the runtime of each new period and with two times smaller
+      initial learning rate.
+
+  To accomplish the above, one would write:
+
+  ```python
+  ...
+  global_step = tf.Variable(0, trainable=False)
+  starter_learning_rate = 0.1
+  learning_rate = sgdr_decay(starter_learning_rate, global_step,
+                             initial_period_steps=10000, t_mul=2, m_mul=0.5)
+  # Passing global_step to minimize() will increment it at each step.
+  learning_step = (
+      tf.train.GradientDescentOptimizer(learning_rate)
+      .minimize(...my loss..., global_step=global_step)
+  )
+
+  # Step  | 0   | 1000  | 5000 | 9000  | 9999 | 10000 | 11000  |
+  # LR    | 0.1 | 0.097 | 0.05 | 0.002 | 0.00 | 0.05  | 0.0496 |
+
+  # Step  | 20000 | 29000  | 29999 | 30000 |
+  # LR    | 0.025 | 0.0003 | 0.00  | 0.025 |
+  ```
+
+  Args:
+    learning_rate: A scalar `float32` or `float64` `Tensor` or a
+      Python number.  The initial learning rate.
+    global_step: A scalar `int32` or `int64` `Tensor` or a Python number.
+      Global step to use for the decay computation.  Must not be negative.
+    initial_period_steps: Duration of the first period measured as the number
+      of minibatch updates, if one wants to use epochs, one should compute
+      the number of updates required for an epoch.
+    t_mul: A scalar `float32` or `float64` `Tensor` or a Python number.
+      Must be positive.
+      Used to derive the number of iterations in the i-th period:
+      `initial_period_steps * (t_mul^i)`. Defaults to 2.0.
+    m_mul: A scalar `float32` or `float64` `Tensor` or a Python number.
+      Must be positive.
+      Used to derive the initial learning rate of the i-th period:
+      `learning_rate * (m_mul^i)`. Defaults to 1.0
+
+  Returns:
+    A scalar `Tensor` of the same type as `learning_rate`.
+    The learning rate for a provided global_step.
+  Raises:
+    ValueError: if `global_step` is not supplied.
+  """
+
+  if global_step is None:
+    raise ValueError("global_step is required for sgdr_decay.")
+  with ops.name_scope(name, "SGDRDecay",
+                      [learning_rate, global_step,
+                       initial_period_steps, t_mul, m_mul]) as name:
+    learning_rate = ops.convert_to_tensor(learning_rate,
+                                          name="initial_learning_rate")
+    dtype = learning_rate.dtype
+    global_step = math_ops.cast(global_step, dtype)
+    t_0 = math_ops.cast(initial_period_steps, dtype)
+    t_mul = math_ops.cast(t_mul, dtype)
+    m_mul = math_ops.cast(m_mul, dtype)
+
+    c_one = math_ops.cast(constant_op.constant(1.0), dtype)
+    c_half = math_ops.cast(constant_op.constant(0.5), dtype)
+    c_pi = math_ops.cast(constant_op.constant(math.pi), dtype)
+
+    # Find normalized value of the current step
+    x_val = math_ops.div(global_step, t_0)
+
+    def compute_step(x_val, geometric=False):
+      if geometric:
+        # Consider geometric series where t_mul != 1
+        # 1 + t_mul + t_mul^2 ... = (1 - t_mul^i_restart) / (1 - t_mul)
+
+        # First find how many restarts were performed for a given x_val
+        # Find maximal integer i_restart value for which this equation holds
+        # x_val >= (1 - t_mul^i_restart) / (1 - t_mul)
+        # x_val * (1 - t_mul) <= (1 - t_mul^i_restart)
+        # t_mul^i_restart <= (1 - x_val * (1 - t_mul))
+
+        # tensorflow allows only log with base e
+        # i_restart <= log(1 - x_val * (1 - t_mul) / log(t_mul)
+        # Find how many restarts were performed
+
+        i_restart = math_ops.floor(
+            math_ops.log(c_one - x_val * (c_one - t_mul)) / math_ops.log(t_mul))
+        # Compute the sum of all restarts before the current one
+        sum_r = (c_one - t_mul ** i_restart) / (c_one - t_mul)
+        # Compute our position within the current restart
+        x_val = (x_val - sum_r) / t_mul ** i_restart
+
+      else:
+        # Find how many restarts were performed
+        i_restart = math_ops.floor(x_val)
+        # Compute our position within the current restart
+        x_val = x_val - i_restart
+      return i_restart, x_val
+
+    i_restart, x_val = control_flow_ops.cond(
+        math_ops.equal(t_mul, c_one),
+        lambda: compute_step(x_val, geometric=False),
+        lambda: compute_step(x_val, geometric=True))
+
+    # If m_mul < 1, then the initial learning rate of every new restart will be
+    # smaller, i.e., by a factor of m_mul ** i_restart at i_restart-th restart
+    m_fac = learning_rate * (m_mul ** i_restart)
+
+  return math_ops.multiply(c_half * m_fac,
+                           (math_ops.cos(x_val * c_pi) + c_one), name=name)
diff --git a/tensorflow/contrib/training/python/training/sgdr_learning_rate_decay_test.py b/tensorflow/contrib/training/python/training/sgdr_learning_rate_decay_test.py
new file mode 100644
index 0000000000..4a46e9a49e
--- /dev/null
+++ b/tensorflow/contrib/training/python/training/sgdr_learning_rate_decay_test.py
@@ -0,0 +1,145 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Functional test for sgdr learning rate decay."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+from sgdr_learning_rate_decay import sgdr_decay
+from tensorflow.python.platform import googletest
+from tensorflow.python.framework import test_util
+from tensorflow.python.framework import dtypes
+from tensorflow import placeholder
+
+
+class SGDRDecayTest(test_util.TensorFlowTestCase):
+  """Unit tests for SGDR learning rate decay."""
+
+  def get_original_values(self, lr, t_e, mult_factor, iter_per_epoch, epochs):
+    """Get an array with learning rate values from the consecutive steps using
+    the original implementation
+    (https://github.com/loshchil/SGDR/blob/master/SGDR_WRNs.py)."""
+    t0 = math.pi / 2.0
+    tt = 0
+    te_next = t_e
+
+    lr_values = []
+    sh_lr = lr
+    for epoch in range(epochs):
+      for _ in range(iter_per_epoch):
+        # In the original approach training function is executed here
+        lr_values.append(sh_lr)
+        dt = 2.0 * math.pi / float(2.0 * t_e)
+        tt = tt + float(dt) / iter_per_epoch
+        if tt >= math.pi:
+          tt = tt - math.pi
+        cur_t = t0 + tt
+        new_lr = lr * (1.0 + math.sin(cur_t)) / 2.0  # lr_min = 0, lr_max = lr
+        sh_lr = new_lr
+      if (epoch + 1) == te_next:  # time to restart
+        sh_lr = lr
+        tt = 0                # by setting to 0 we set lr to lr_max, see above
+        t_e = t_e * mult_factor  # change the period of restarts
+        te_next = te_next + t_e  # note the next restart's epoch
+
+    return lr_values
+
+  def get_sgdr_values(self, lr, initial_period_steps, t_mul, iters):
+    """Get an array with learning rate values from the consecutive steps
+    using current tensorflow implementation."""
+    with self.test_session():
+      step = placeholder(dtypes.int32)
+
+      decay = sgdr_decay(lr, step, initial_period_steps, t_mul)
+      lr_values = []
+      for i in range(iters):
+        lr_values.append(decay.eval(feed_dict={step: i}))
+
+      return lr_values
+
+  def testCompareToOriginal(self):
+    """Compare values generated by tensorflow implementation to the values
+    generated by the original implementation
+    (https://github.com/loshchil/SGDR/blob/master/SGDR_WRNs.py)."""
+    with self.test_session():
+      lr = 10.0
+      init_steps = 2
+      t_mul = 3
+      iters = 10
+      epochs = 50
+
+      org_lr = self.get_original_values(lr, init_steps, t_mul, iters, epochs)
+      sgdr_lr = self.get_sgdr_values(lr, init_steps*iters, t_mul, iters*epochs)
+
+      for org, sgdr in zip(org_lr, sgdr_lr):
+        self.assertAllClose(org, sgdr)
+
+  def testMDecay(self):
+    """Test m_mul argument. Check values for learning rate at the beginning
+    of the first, second, third and fourth period. """
+    with self.test_session():
+      step = placeholder(dtypes.int32)
+
+      lr = 0.1
+      t_e = 10
+      t_mul = 3
+      m_mul = 0.9
+
+      decay = sgdr_decay(lr, step, t_e, t_mul, m_mul)
+
+      test_step = 0
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}),
+                          lr)
+
+      test_step = t_e
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}),
+                          lr * m_mul)
+
+      test_step = t_e + t_e*t_mul
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}),
+                          lr * m_mul**2)
+
+      test_step = t_e + t_e*t_mul + t_e * (t_mul**2)
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}),
+                          lr * (m_mul**3))
+
+  def testCos(self):
+    """Check learning rate values at the beginning, in the middle
+    and at the end of the period."""
+    with self.test_session():
+      step = placeholder(dtypes.int32)
+      lr = 0.2
+      t_e = 1000
+      t_mul = 1
+
+      decay = sgdr_decay(lr, step, t_e, t_mul)
+
+      test_step = 0
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}), lr)
+
+      test_step = t_e//2
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}), lr/2)
+
+      test_step = t_e
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}), lr)
+
+      test_step = t_e*3//2
+      self.assertAllClose(decay.eval(feed_dict={step: test_step}), lr/2)
+
+if __name__ == "__main__":
+  googletest.main()
diff --git a/tensorflow/contrib/training/python/training/training_test.py b/tensorflow/contrib/training/python/training/training_test.py
index 4877c010fa..94cf7788b2 100644
--- a/tensorflow/contrib/training/python/training/training_test.py
+++ b/tensorflow/contrib/training/python/training/training_test.py
@@ -36,6 +36,7 @@ from tensorflow.python.ops.losses import losses
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import gradient_descent
 from tensorflow.python.training import monitored_session
 from tensorflow.python.training import saver as saver_lib
@@ -421,7 +422,7 @@ class TrainTest(test.TestCase):
       train_op = self.create_train_op()
 
       model_variables = variables_lib2.global_variables()
-      model_path = saver_lib.latest_checkpoint(logdir1)
+      model_path = checkpoint_management.latest_checkpoint(logdir1)
 
       assign_fn = variables_lib.assign_from_checkpoint_fn(
           model_path, model_variables)
diff --git a/tensorflow/contrib/verbs/rdma_mgr.cc b/tensorflow/contrib/verbs/rdma_mgr.cc
index 9cb3d1fbbf..3cb5e61fac 100644
--- a/tensorflow/contrib/verbs/rdma_mgr.cc
+++ b/tensorflow/contrib/verbs/rdma_mgr.cc
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/bfc_allocator.h"
 #include "tensorflow/core/common_runtime/gpu/gpu_process_state.h"
 #include "tensorflow/core/common_runtime/gpu/gpu_util.h"
+#include "tensorflow/core/common_runtime/pool_allocator.h"
 #include "tensorflow/core/common_runtime/process_state.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.h"
 #include "tensorflow/core/distributed_runtime/session_mgr.h"
@@ -255,28 +256,25 @@ void MRDeleter(ibv_mr* mr) {
   }
 }
 
-// TODO(byronyi): remove this class duplicated from the one in
-// common/runtime/gpu/pool_allocator.h when it is available in common_runtime
-class BasicCPUAllocator : public SubAllocator {
- public:
-  ~BasicCPUAllocator() override {}
-
-  void* Alloc(size_t alignment, size_t num_bytes) override {
-    return port::AlignedMalloc(num_bytes, alignment);
-  }
-  void Free(void* ptr, size_t) override { port::AlignedFree(ptr); }
-};
-
 // TODO(byronyi): remove this class and its registration when the default
-// cpu_allocator() returns visitable allocator
+// cpu_allocator() returns visitable allocator, or cpu_allocator() is no
+// longer in use.
 class BFCRdmaAllocator : public BFCAllocator {
  public:
   BFCRdmaAllocator()
-      : BFCAllocator(new BasicCPUAllocator(), 1LL << 36, true, "cpu_rdma_bfc") {
+      : BFCAllocator(new BasicCPUAllocator(port::kNUMANoAffinity), 1LL << 36,
+                     true, "cpu_rdma_bfc") {}
+};
+class BFCRdmaAllocatorFactory : public AllocatorFactory {
+ public:
+  Allocator* CreateAllocator() { return new BFCRdmaAllocator; }
+
+  SubAllocator* CreateSubAllocator(int numa_node) {
+    return new BasicCPUAllocator(numa_node);
   }
 };
 
-REGISTER_MEM_ALLOCATOR("BFCRdmaAllocator", 101, BFCRdmaAllocator);
+REGISTER_MEM_ALLOCATOR("BFCRdmaAllocator", 101, BFCRdmaAllocatorFactory);
 
 void RdmaMgr::InitAllocators() {
   RdmaMemoryMgr::Singleton().pd_ = rdma_adapter_->pd_;
@@ -284,8 +282,8 @@ void RdmaMgr::InitAllocators() {
   Allocator* allocators[] = {
 #if GOOGLE_CUDA
     GPUProcessState::singleton()->GetCUDAHostAllocator(0),
-    ProcessState::singleton()->GetCPUAllocator(0),
 #endif  // GOOGLE_CUDA
+    ProcessState::singleton()->GetCPUAllocator(0),
     cpu_allocator(),
   };
 
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index 97880219b8..8dea6316a7 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -150,7 +150,6 @@ load(
     "//third_party/mkl:build_defs.bzl",
     "if_mkl",
 )
-load("@io_bazel_rules_closure//closure:defs.bzl", "closure_proto_library")
 
 exports_files(["ops/ops.pbtxt"])
 
@@ -334,6 +333,7 @@ filegroup(
         "platform/init_main.h",
         "platform/mem.h",
         "platform/mutex.h",
+        "platform/numa.h",
         "platform/thread_annotations.h",
     ],
     visibility = ["//visibility:private"],
@@ -662,6 +662,7 @@ cc_library(
         "lib/random/random_distributions.h",
         "lib/random/simple_philox.h",
         "lib/strings/numbers.h",
+        "lib/strings/proto_serialization.h",
         "lib/strings/str_util.h",
         "lib/strings/strcat.h",
         "lib/strings/stringprintf.h",
@@ -846,6 +847,7 @@ tf_cuda_library(
         "util/sparse/sparse_tensor.h",
         "util/stat_summarizer.h",
         "util/stat_summarizer_options.h",
+        "util/status_util.h",
         "util/stream_executor_util.h",
         "util/strided_slice_op.h",
         "util/tensor_format.h",
@@ -882,6 +884,16 @@ cc_library(
     copts = tf_copts(),
 )
 
+tf_cc_test(
+    name = "stats_calculator_test",
+    srcs = ["util/stats_calculator_test.cc"],
+    deps = [
+        ":stats_calculator_portable",
+        ":test",
+        ":test_main",
+    ],
+)
+
 cc_library(
     name = "overflow",
     hdrs = ["util/overflow.h"],
@@ -1644,6 +1656,7 @@ cc_library(
     copts = tf_copts(android_optimization_level_override = None) + [
         "-DSUPPORT_SELECTIVE_REGISTRATION",
     ],
+    linkopts = if_android(["-lz"]),
     tags = [
         "manual",
         "notap",
@@ -1667,6 +1680,7 @@ cc_library(
     copts = tf_copts(android_optimization_level_override = None) + tf_opts_nortti_if_android() + [
         "-DSUPPORT_SELECTIVE_REGISTRATION",
     ],
+    linkopts = if_android(["-lz"]),
     tags = [
         "manual",
         "notap",
@@ -1923,7 +1937,6 @@ tf_proto_library_cc(
     srcs = ["protobuf/master_service.proto"],
     has_services = 1,
     cc_api_version = 2,
-    cc_grpc_version = 1,
     cc_stubby_versions = ["2"],
     protodeps = [":master_proto"],
     visibility = [
@@ -1953,8 +1966,10 @@ LIB_INTERNAL_PRIVATE_HEADERS = ["framework/resource_handle.h"] + glob(
         "**/*test*",
         "lib/gif/**/*",
         "lib/jpeg/**/*",
+        "lib/png/**/*",
         "platform/gif.h",
         "platform/jpeg.h",
+        "platform/png.h",
         "platform/**/cuda.h",
         "platform/**/stream_executor.h",
     ],
@@ -2049,6 +2064,7 @@ cc_library(
             "lib/hash/crc32c_accelerate.cc",
             "lib/gif/**/*",
             "lib/jpeg/**/*",
+            "lib/png/**/*",
             "platform/**/env_time.cc",
             "platform/**/cuda_libdevice_path.cc",
             "platform/**/device_tracer.cc",
@@ -2145,6 +2161,39 @@ cc_library(
 )
 
 cc_library(
+    name = "png_internal",
+    srcs = ["lib/png/png_io.cc"],
+    hdrs = [
+        "lib/bfloat16/bfloat16.h",
+        "lib/core/casts.h",
+        "lib/core/stringpiece.h",
+        "lib/png/png_io.h",
+        "platform/byte_order.h",
+        "platform/cpu_info.h",
+        "platform/default/integral_types.h",
+        "platform/default/logging.h",
+        "platform/logging.h",
+        "platform/macros.h",
+        "platform/platform.h",
+        "platform/png.h",
+        "platform/types.h",
+    ],
+    copts = tf_copts(),
+    linkopts = select({
+        "//tensorflow:freebsd": [],
+        "//tensorflow:windows": [],
+        "//tensorflow:windows_msvc": [],
+        "//conditions:default": ["-ldl"],
+    }),
+    deps = [
+        ":lib",
+        ":lib_internal",
+        "//tensorflow/core/platform/default/build_config:png",
+        "@zlib_archive//:zlib",
+    ],
+)
+
+cc_library(
     name = "tflite_portable_logging",
     srcs = [],
     hdrs = [
@@ -2189,6 +2238,7 @@ cc_library(
     linkopts = ["-ldl"],
     deps = [
         "//tensorflow/core/platform/default/build_config:jpeg",
+        "//tensorflow/core/platform/default/build_config:logging",
     ],
 )
 
@@ -2217,6 +2267,7 @@ cc_library(
     linkopts = ["-ldl"],
     deps = [
         "//tensorflow/core/platform/default/build_config:gif",
+        "//tensorflow/core/platform/default/build_config:logging",
     ],
 )
 
@@ -2243,6 +2294,7 @@ cc_library(
     copts = tf_copts(),
     linkopts = ["-ldl"],
     deps = [
+        "//tensorflow/core/platform/default/build_config:logging",
         "@png_archive//:png",
     ],
 )
@@ -2429,6 +2481,7 @@ tf_cuda_library(
             "framework/resource_handle.cc",
             "util/memmapped_file_system.*",
             "util/memmapped_file_system_writer.*",
+            "util/stats_calculator.*",
             "util/version_info.cc",
         ],
     ) + select({
@@ -2455,6 +2508,7 @@ tf_cuda_library(
         ":protos_all_proto_text",
         ":error_codes_proto_text",
         ":protos_all_cc",
+        ":stats_calculator_portable",
         ":version_lib",
         "//tensorflow/core/platform/default/build_config:platformlib",
         "//tensorflow/core/kernels:bounds_check",
@@ -2874,6 +2928,14 @@ tf_cuda_library(
 )
 
 cc_library(
+    name = "session_ref",
+    srcs = ["common_runtime/session_ref.cc"],
+    hdrs = ["common_runtime/session_ref.h"],
+    copts = tf_copts(),
+    deps = [":core_cpu_base"],
+)
+
+cc_library(
     name = "gpu_id",
     hdrs = [
         "common_runtime/gpu/gpu_id.h",
@@ -3188,6 +3250,7 @@ tf_cc_tests(
         ":test",
         ":test_main",
         "//third_party/eigen3",
+        "@zlib_archive//:zlib",
     ],
 )
 
@@ -3238,6 +3301,28 @@ tf_cc_test(
 )
 
 tf_cc_test(
+    name = "platform_numa_test",
+    size = "small",
+    srcs = ["platform/numa_test.cc"],
+    tags = [
+        # This test will not pass unless it has access to all NUMA nodes
+        # on the executing machine.
+        "manual",
+        "notap",
+    ],
+    deps = [
+        ":framework",
+        ":lib",
+        ":lib_internal",
+        ":lib_test_internal",
+        ":protos_all_cc",
+        ":test",
+        ":test_main",
+        "//third_party/eigen3",
+    ],
+)
+
+tf_cc_test(
     name = "platform_setround_test",
     size = "small",
     srcs = ["platform/setround_test.cc"],
@@ -3408,6 +3493,7 @@ tf_cc_tests(
         "framework/tensor_shape_test.cc",
         "framework/tensor_slice_test.cc",
         "framework/tensor_test.cc",
+        "framework/tensor_testutil_test.cc",
         "framework/tensor_util_test.cc",
         "framework/tracking_allocator_test.cc",
         "framework/types_test.cc",
@@ -3601,6 +3687,7 @@ tf_cc_test_mkl(
     deps = [
         ":core",
         ":core_cpu",
+        ":core_cpu_internal",
         ":framework",
         ":framework_internal",
         ":test",
@@ -3674,7 +3761,6 @@ tf_cc_tests_gpu(
         "common_runtime/gpu/gpu_bfc_allocator_test.cc",
         "common_runtime/gpu/gpu_device_test.cc",
         "common_runtime/gpu/gpu_id_manager_test.cc",
-        "common_runtime/gpu/gpu_event_mgr_test.cc",
         "common_runtime/gpu/pool_allocator_test.cc",
     ],
     linkstatic = tf_kernel_tests_linkstatic(),
@@ -3698,6 +3784,23 @@ tf_cc_tests_gpu(
     ],
 )
 
+tf_cc_test_gpu(
+    name = "gpu_event_mgr_test",
+    srcs = ["common_runtime/gpu/gpu_event_mgr_test.cc"],
+    linkstatic = tf_kernel_tests_linkstatic(),
+    tags = tf_cuda_tests_tags(),
+    deps = [
+        ":framework",
+        ":framework_internal",
+        ":lib",
+        ":lib_internal",
+        ":protos_all_cc",
+        ":test",
+        ":test_main",
+        ":testlib",
+    ],
+)
+
 tf_cuda_cc_test(
     name = "gpu_device_unified_memory_test",
     size = "small",
diff --git a/tensorflow/core/api_def/base_api/api_def_BoostedTreesCenterBias.pbtxt b/tensorflow/core/api_def/base_api/api_def_BoostedTreesCenterBias.pbtxt
new file mode 100644
index 0000000000..b58b974eb4
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_BoostedTreesCenterBias.pbtxt
@@ -0,0 +1,41 @@
+op {
+  graph_op_name: "BoostedTreesCenterBias"
+  visibility: HIDDEN
+  in_arg {
+    name: "tree_ensemble_handle"
+    description: <<END
+Handle to the tree ensemble.
+END
+  }
+  in_arg {
+    name: "mean_gradients"
+    description: <<END
+A tensor with shape=[logits_dimension] with mean of gradients for a first node.
+END
+  }
+  in_arg {
+    name: "mean_hessians"
+    description: <<END
+A tensor with shape=[logits_dimension] mean of hessians for a first node.
+END
+  }
+in_arg {
+    name: "l1"
+    description: <<END
+l1 regularization factor on leaf weights, per instance based.
+END
+  }
+  in_arg {
+    name: "l2"
+    description: <<END
+l2 regularization factor on leaf weights, per instance based.
+END
+  }
+  out_arg {
+    name: "continue_centering"
+    description: <<END
+Bool, whether to continue bias centering.
+END
+  }
+  summary: "Calculates the prior from the training data (the bias) and fills in the first node with the logits' prior. Returns a boolean indicating whether to continue centering."
+}
+\ No newline at end of file
diff --git a/tensorflow/core/api_def/base_api/api_def_Ceil.pbtxt b/tensorflow/core/api_def/base_api/api_def_Ceil.pbtxt
index ad1ada8d71..3134fceeca 100644
--- a/tensorflow/core/api_def/base_api/api_def_Ceil.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_Ceil.pbtxt
@@ -1,4 +1,4 @@
 op {
   graph_op_name: "Ceil"
-  summary: "Returns element-wise smallest integer in not less than x."
+  summary: "Returns element-wise smallest integer not less than x."
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_DrawBoundingBoxes.pbtxt b/tensorflow/core/api_def/base_api/api_def_DrawBoundingBoxes.pbtxt
index 6c3ae09f5d..35c916e269 100644
--- a/tensorflow/core/api_def/base_api/api_def_DrawBoundingBoxes.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_DrawBoundingBoxes.pbtxt
@@ -30,7 +30,7 @@ height of the underlying image.
 
 For example, if an image is 100 x 200 pixels (height x width) and the bounding
 box is `[0.1, 0.2, 0.5, 0.9]`, the upper-left and bottom-right coordinates of
-the bounding box will be `(40, 10)` to `(100, 50)` (in (x,y) coordinates).
+the bounding box will be `(40, 10)` to `(180, 50)` (in (x,y) coordinates).
 
 Parts of the bounding box may fall outside the image.
 END
diff --git a/tensorflow/core/api_def/base_api/api_def_IteratorFromStringHandleV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_IteratorFromStringHandleV2.pbtxt
new file mode 100644
index 0000000000..9d464b2aea
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_IteratorFromStringHandleV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorFromStringHandleV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_IteratorGetNext.pbtxt b/tensorflow/core/api_def/base_api/api_def_IteratorGetNext.pbtxt
index ea5669693e..dfd199d012 100644
--- a/tensorflow/core/api_def/base_api/api_def_IteratorGetNext.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_IteratorGetNext.pbtxt
@@ -1,4 +1,4 @@
 op {
   graph_op_name: "IteratorGetNext"
-  summary: "Gets the next output from the given iterator."
+  summary: "Gets the next output from the given iterator ."
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_IteratorGetNextAsOptional.pbtxt b/tensorflow/core/api_def/base_api/api_def_IteratorGetNextAsOptional.pbtxt
new file mode 100644
index 0000000000..7068336847
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_IteratorGetNextAsOptional.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorGetNextAsOptional"
+  summary: "Gets the next output from the given iterator as an Optional variant."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_IteratorV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_IteratorV2.pbtxt
new file mode 100644
index 0000000000..becc729016
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_IteratorV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_NonMaxSuppressionV4.pbtxt b/tensorflow/core/api_def/base_api/api_def_NonMaxSuppressionV4.pbtxt
new file mode 100644
index 0000000000..75df90f570
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_NonMaxSuppressionV4.pbtxt
@@ -0,0 +1,78 @@
+op {
+  graph_op_name: "NonMaxSuppressionV4"
+  in_arg {
+    name: "boxes"
+    description: <<END
+A 2-D float tensor of shape `[num_boxes, 4]`.
+END
+  }
+  in_arg {
+    name: "scores"
+    description: <<END
+A 1-D float tensor of shape `[num_boxes]` representing a single
+score corresponding to each box (each row of boxes).
+END
+  }
+  in_arg {
+    name: "max_output_size"
+    description: <<END
+A scalar integer tensor representing the maximum number of
+boxes to be selected by non max suppression.
+END
+  }
+  in_arg {
+    name: "iou_threshold"
+    description: <<END
+A 0-D float tensor representing the threshold for deciding whether
+boxes overlap too much with respect to IOU.
+END
+  }
+  in_arg {
+    name: "score_threshold"
+    description: <<END
+A 0-D float tensor representing the threshold for deciding when to remove
+boxes based on score.
+END
+  }
+  attr {
+    name: "pad_to_max_output_size"
+    description: <<END
+If true, the output `selected_indices` is padded to be of length
+`max_output_size`. Defaults to false.
+END
+  }
+  out_arg {
+    name: "selected_indices"
+    description: <<END
+A 1-D integer tensor of shape `[M]` representing the selected
+indices from the boxes tensor, where `M <= max_output_size`.
+END
+  }
+  out_arg {
+    name: "valid_outputs"
+    description: <<END
+A 0-D integer tensor representing the number of valid elements in
+`selected_indices`, with the valid elements appearing first.
+END
+  }
+  summary: "Greedily selects a subset of bounding boxes in descending order of score,"
+  description: <<END
+pruning away boxes that have high intersection-over-union (IOU) overlap
+with previously selected boxes.  Bounding boxes with score less than
+`score_threshold` are removed.  Bounding boxes are supplied as
+[y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
+diagonal pair of box corners and the coordinates can be provided as normalized
+(i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
+is agnostic to where the origin is in the coordinate system and more
+generally is invariant to orthogonal transformations and translations
+of the coordinate system; thus translating or reflections of the coordinate
+system result in the same boxes being selected by the algorithm.
+The output of this operation is a set of integers indexing into the input
+collection of bounding boxes representing the selected boxes.  The bounding
+box coordinates corresponding to the selected indices can then be obtained
+using the `tf.gather operation`.  For example:
+  selected_indices = tf.image.non_max_suppression_v2(
+      boxes, scores, max_output_size, iou_threshold, score_threshold)
+  selected_boxes = tf.gather(boxes, selected_indices)
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_NonMaxSuppressionWithOverlaps.pbtxt b/tensorflow/core/api_def/base_api/api_def_NonMaxSuppressionWithOverlaps.pbtxt
new file mode 100644
index 0000000000..180edb15a4
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_NonMaxSuppressionWithOverlaps.pbtxt
@@ -0,0 +1,62 @@
+op {
+  graph_op_name: "NonMaxSuppressionWithOverlaps"
+  in_arg {
+    name: "overlaps"
+    description: <<END
+A 2-D float tensor of shape `[num_boxes, num_boxes]` representing
+the n-by-n box overlap values.
+END
+  }
+  in_arg {
+    name: "scores"
+    description: <<END
+A 1-D float tensor of shape `[num_boxes]` representing a single
+score corresponding to each box (each row of boxes).
+END
+  }
+  in_arg {
+    name: "max_output_size"
+    description: <<END
+A scalar integer tensor representing the maximum number of
+boxes to be selected by non max suppression.
+END
+  }
+  in_arg {
+    name: "overlap_threshold"
+    description: <<END
+A 0-D float tensor representing the threshold for deciding whether
+boxes overlap too.
+END
+  }
+  in_arg {
+    name: "score_threshold"
+    description: <<END
+A 0-D float tensor representing the threshold for deciding when to remove
+boxes based on score.
+END
+  }
+  out_arg {
+    name: "selected_indices"
+    description: <<END
+A 1-D integer tensor of shape `[M]` representing the selected
+indices from the boxes tensor, where `M <= max_output_size`.
+END
+  }
+  summary: "Greedily selects a subset of bounding boxes in descending order of score,"
+  description: <<END
+pruning away boxes that have high overlaps
+with previously selected boxes.  Bounding boxes with score less than
+`score_threshold` are removed. N-by-n overlap values are supplied as square matrix,
+which allows for defining a custom overlap criterium (eg. intersection over union,
+intersection over area, etc.).
+
+The output of this operation is a set of integers indexing into the input
+collection of bounding boxes representing the selected boxes.  The bounding
+box coordinates corresponding to the selected indices can then be obtained
+using the `tf.gather operation`.  For example:
+
+  selected_indices = tf.image.non_max_suppression_with_overlaps(
+      overlaps, scores, max_output_size, overlap_threshold, score_threshold)
+  selected_boxes = tf.gather(boxes, selected_indices)
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_OptionalFromValue.pbtxt b/tensorflow/core/api_def/base_api/api_def_OptionalFromValue.pbtxt
new file mode 100644
index 0000000000..4a15eea424
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_OptionalFromValue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalFromValue"
+  summary: "Constructs an Optional variant from a tuple of tensors."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_OptionalGetValue.pbtxt b/tensorflow/core/api_def/base_api/api_def_OptionalGetValue.pbtxt
new file mode 100644
index 0000000000..11c0c545d0
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_OptionalGetValue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalGetValue"
+  summary: "Returns the value stored in an Optional variant or raises an error if none exists."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_OptionalHasValue.pbtxt b/tensorflow/core/api_def/base_api/api_def_OptionalHasValue.pbtxt
new file mode 100644
index 0000000000..7669178427
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_OptionalHasValue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalHasValue"
+  summary: "Returns true if and only if the given Optional variant has a value."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_OptionalNone.pbtxt b/tensorflow/core/api_def/base_api/api_def_OptionalNone.pbtxt
new file mode 100644
index 0000000000..150062a704
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_OptionalNone.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalNone"
+  summary: "Creates an Optional variant with no value."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdAdd.pbtxt b/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdAdd.pbtxt
index 3b3a274df5..2b58969da2 100644
--- a/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdAdd.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdAdd.pbtxt
@@ -51,7 +51,7 @@ For example, say we want to update 4 scattered elements to a rank-1 tensor to
 8 elements. In Python, that update would look like this:
 
 ```python
-    ref = tfe.Variable([1, 2, 3, 4, 5, 6, 7, 8])
+    ref = tf.Variable([1, 2, 3, 4, 5, 6, 7, 8], use_resource=True)
     indices = tf.constant([[4], [3], [1] ,[7]])
     updates = tf.constant([9, 10, 11, 12])
     update = tf.scatter_nd_add(ref, indices, updates)
diff --git a/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdUpdate.pbtxt b/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdUpdate.pbtxt
index b07ee9fda9..17b79ee30c 100644
--- a/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdUpdate.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_ResourceScatterNdUpdate.pbtxt
@@ -51,7 +51,7 @@ For example, say we want to update 4 scattered elements to a rank-1 tensor to
 8 elements. In Python, that update would look like this:
 
 ```python
-    ref = tfe.Variable([1, 2, 3, 4, 5, 6, 7, 8])
+    ref = tf.Variable([1, 2, 3, 4, 5, 6, 7, 8])
     indices = tf.constant([[4], [3], [1] ,[7]])
     updates = tf.constant([9, 10, 11, 12])
     update = tf.scatter_nd_update(ref, indices, updates)
diff --git a/tensorflow/core/api_def/base_api/api_def_ScatterNd.pbtxt b/tensorflow/core/api_def/base_api/api_def_ScatterNd.pbtxt
index 58753a651a..ad1c527b01 100644
--- a/tensorflow/core/api_def/base_api/api_def_ScatterNd.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_ScatterNd.pbtxt
@@ -32,8 +32,12 @@ slices within a tensor (initially zero for numeric, empty for string) of
 the given `shape` according to indices.  This operator is the inverse of the
 @{tf.gather_nd} operator which extracts values or slices from a given tensor.
 
+If `indices` contains duplicates, then their updates are accumulated (summed).
+
 **WARNING**: The order in which updates are applied is nondeterministic, so the
-output will be nondeterministic if `indices` contains duplicates.
+output will be nondeterministic if `indices` contains duplicates -- because
+of some numerical approximation issues, numbers summed in different order
+may yield different results.
 
 `indices` is an integer tensor containing indices into a new tensor of shape
 `shape`.  The last dimension of `indices` can be at most the rank of `shape`:
diff --git a/tensorflow/core/api_def/base_api/api_def_IdentityDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_SinkDataset.pbtxt
index ff2854fd2c..b5758ddbfb 100644
--- a/tensorflow/core/api_def/base_api/api_def_IdentityDataset.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SinkDataset.pbtxt
@@ -1,5 +1,5 @@
 op {
-  graph_op_name: "IdentityDataset"
+  graph_op_name: "SinkDataset"
   visibility: HIDDEN
   in_arg {
     name: "input_dataset"
diff --git a/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt
index c80ee77f73..ddde3ee5b4 100644
--- a/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt
@@ -8,11 +8,18 @@ sliding window.
 END
   }
   in_arg {
-    name: "stride"
+    name: "window_shift"
     description: <<END
 A scalar representing the steps moving the sliding window
 forward in one iteration. It must be positive.
 END
   }
+  in_arg {
+    name: "window_stride"
+    description: <<END
+A scalar representing the stride of the input elements of the sliding window.
+It must be positive.
+END
+  }
   summary: "Creates a dataset that passes a sliding window over `input_dataset`."
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_StatefulPartitionedCall.pbtxt b/tensorflow/core/api_def/base_api/api_def_StatefulPartitionedCall.pbtxt
new file mode 100644
index 0000000000..c4cb4e362a
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_StatefulPartitionedCall.pbtxt
@@ -0,0 +1,25 @@
+
+op {
+  graph_op_name: "StatefulPartitionedCall"
+  in_arg {
+    name: "args"
+    description: "A list of input tensors."
+  }
+  out_arg {
+    name: "output"
+    description: "A list of return values."
+  }
+  attr { name: "Tin"  description: "A list of input types." }
+  attr { name: "Tout"  description: "A list of output types." }
+  attr {
+    name: "f"
+    description: <<END
+      A function that takes 'args', a list of tensors, and returns 'output',
+      another list of tensors. Input and output types are specified by 'Tin'
+      and 'Tout'. The function body of f will be placed and partitioned across
+      devices, setting this op apart from the regular Call op. This op is
+      stateful.
+END
+  }
+  summary: "returns `f(inputs)`, where `f`'s body is placed and partitioned."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_WindowDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_WindowDataset.pbtxt
new file mode 100644
index 0000000000..1bc3660479
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_WindowDataset.pbtxt
@@ -0,0 +1,11 @@
+op {
+  visibility: HIDDEN
+  graph_op_name: "WindowDataset"
+  in_arg {
+    name: "window_size"
+    description: <<END
+A scalar representing the number of elements to accumulate in a window.
+END
+  }
+  summary: "A dataset that creates window datasets from the input dataset."
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Acos.pbtxt b/tensorflow/core/api_def/python_api/api_def_Acos.pbtxt
index ca1ee78526..1fd8baf05f 100644
--- a/tensorflow/core/api_def/python_api/api_def_Acos.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Acos.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "acos"
-    deprecation_message: "tf.acos is deprecated, please use tf.math.acos instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Acosh.pbtxt b/tensorflow/core/api_def/python_api/api_def_Acosh.pbtxt
index 7503353e41..f7946652ef 100644
--- a/tensorflow/core/api_def/python_api/api_def_Acosh.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Acosh.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "acosh"
-    deprecation_message: "tf.acosh is deprecated, please use tf.math.acosh instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Add.pbtxt b/tensorflow/core/api_def/python_api/api_def_Add.pbtxt
index cc5d68b15d..fb505a91ac 100644
--- a/tensorflow/core/api_def/python_api/api_def_Add.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Add.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "add"
-    deprecation_message: "tf.add is deprecated, please use tf.math.add instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_AsString.pbtxt b/tensorflow/core/api_def/python_api/api_def_AsString.pbtxt
index 9306eaf373..ea65543a76 100644
--- a/tensorflow/core/api_def/python_api/api_def_AsString.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_AsString.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "as_string"
-    deprecation_message: "tf.as_string is deprecated, please use tf.dtypes.as_string instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Asin.pbtxt b/tensorflow/core/api_def/python_api/api_def_Asin.pbtxt
index 7622af7b45..eedf4553c6 100644
--- a/tensorflow/core/api_def/python_api/api_def_Asin.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Asin.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "asin"
-    deprecation_message: "tf.asin is deprecated, please use tf.math.asin instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Asinh.pbtxt b/tensorflow/core/api_def/python_api/api_def_Asinh.pbtxt
index 395275c21d..10c2fb356e 100644
--- a/tensorflow/core/api_def/python_api/api_def_Asinh.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Asinh.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "asinh"
-    deprecation_message: "tf.asinh is deprecated, please use tf.math.asinh instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Atan.pbtxt b/tensorflow/core/api_def/python_api/api_def_Atan.pbtxt
index dfcd632558..03dd5dc848 100644
--- a/tensorflow/core/api_def/python_api/api_def_Atan.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Atan.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "atan"
-    deprecation_message: "tf.atan is deprecated, please use tf.math.atan instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Atan2.pbtxt b/tensorflow/core/api_def/python_api/api_def_Atan2.pbtxt
index fba79507aa..85b27bd881 100644
--- a/tensorflow/core/api_def/python_api/api_def_Atan2.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Atan2.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "atan2"
-    deprecation_message: "tf.atan2 is deprecated, please use tf.math.atan2 instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Atanh.pbtxt b/tensorflow/core/api_def/python_api/api_def_Atanh.pbtxt
index f7164c33e8..ee7c0600d6 100644
--- a/tensorflow/core/api_def/python_api/api_def_Atanh.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Atanh.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "atanh"
-    deprecation_message: "tf.atanh is deprecated, please use tf.math.atanh instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchToSpaceND.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchToSpaceND.pbtxt
index 56e49a2221..9552fc92e3 100644
--- a/tensorflow/core/api_def/python_api/api_def_BatchToSpaceND.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_BatchToSpaceND.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "batch_to_space_nd"
-    deprecation_message: "tf.batch_to_space_nd is deprecated, please use tf.manip.batch_to_space_nd instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Betainc.pbtxt b/tensorflow/core/api_def/python_api/api_def_Betainc.pbtxt
index 7c37b534c7..7ad7cbcba9 100644
--- a/tensorflow/core/api_def/python_api/api_def_Betainc.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Betainc.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "betainc"
-    deprecation_message: "tf.betainc is deprecated, please use tf.math.betainc instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Ceil.pbtxt b/tensorflow/core/api_def/python_api/api_def_Ceil.pbtxt
index 0c72cf2edd..f2265bad56 100644
--- a/tensorflow/core/api_def/python_api/api_def_Ceil.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Ceil.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "ceil"
-    deprecation_message: "tf.ceil is deprecated, please use tf.math.ceil instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_CheckNumerics.pbtxt b/tensorflow/core/api_def/python_api/api_def_CheckNumerics.pbtxt
index 7ea52d30b6..541b09a591 100644
--- a/tensorflow/core/api_def/python_api/api_def_CheckNumerics.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_CheckNumerics.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "check_numerics"
-    deprecation_message: "tf.check_numerics is deprecated, please use tf.debugging.check_numerics instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Cholesky.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cholesky.pbtxt
index 568fab4037..942f4e6ed8 100644
--- a/tensorflow/core/api_def/python_api/api_def_Cholesky.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Cholesky.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "cholesky"
-    deprecation_message: "tf.cholesky is deprecated, please use tf.linalg.cholesky instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Cos.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cos.pbtxt
index 6550cd2d4e..1af8c0c2c9 100644
--- a/tensorflow/core/api_def/python_api/api_def_Cos.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Cos.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "cos"
-    deprecation_message: "tf.cos is deprecated, please use tf.math.cos instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Cosh.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cosh.pbtxt
index ef82a45a80..2de87df40d 100644
--- a/tensorflow/core/api_def/python_api/api_def_Cosh.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Cosh.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "cosh"
-    deprecation_message: "tf.cosh is deprecated, please use tf.math.cosh instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Cross.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cross.pbtxt
index 33c1b8c617..e8a871cae6 100644
--- a/tensorflow/core/api_def/python_api/api_def_Cross.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Cross.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "cross"
-    deprecation_message: "tf.cross is deprecated, please use tf.linalg.cross instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_DecodeBase64.pbtxt b/tensorflow/core/api_def/python_api/api_def_DecodeBase64.pbtxt
index 55c43ceba2..8b96eee631 100644
--- a/tensorflow/core/api_def/python_api/api_def_DecodeBase64.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_DecodeBase64.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "decode_base64"
-    deprecation_message: "tf.decode_base64 is deprecated, please use tf.io.decode_base64 instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt b/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt
index 5f6be24cc4..829608fc8f 100644
--- a/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "decode_compressed"
-    deprecation_message: "tf.decode_compressed is deprecated, please use tf.io.decode_compressed instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_DecodeJSONExample.pbtxt b/tensorflow/core/api_def/python_api/api_def_DecodeJSONExample.pbtxt
index 3759047f57..9f28bc5f59 100644
--- a/tensorflow/core/api_def/python_api/api_def_DecodeJSONExample.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_DecodeJSONExample.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "decode_json_example"
-    deprecation_message: "tf.decode_json_example is deprecated, please use tf.io.decode_json_example instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_DecodeRaw.pbtxt b/tensorflow/core/api_def/python_api/api_def_DecodeRaw.pbtxt
index a83f702dca..0010a59ca4 100644
--- a/tensorflow/core/api_def/python_api/api_def_DecodeRaw.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_DecodeRaw.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "decode_raw"
-    deprecation_message: "tf.decode_raw is deprecated, please use tf.io.decode_raw instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Dequantize.pbtxt b/tensorflow/core/api_def/python_api/api_def_Dequantize.pbtxt
index c9b4f76fab..5edd0c216b 100644
--- a/tensorflow/core/api_def/python_api/api_def_Dequantize.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Dequantize.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "dequantize"
-    deprecation_message: "tf.dequantize is deprecated, please use tf.quantization.dequantize instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Diag.pbtxt b/tensorflow/core/api_def/python_api/api_def_Diag.pbtxt
index 2043facfa9..cba30e63e8 100644
--- a/tensorflow/core/api_def/python_api/api_def_Diag.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Diag.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "diag"
-    deprecation_message: "tf.diag is deprecated, please use tf.linalg.tensor_diag instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_DiagPart.pbtxt b/tensorflow/core/api_def/python_api/api_def_DiagPart.pbtxt
index 7fa30b2347..54e1f34e82 100644
--- a/tensorflow/core/api_def/python_api/api_def_DiagPart.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_DiagPart.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "diag_part"
-    deprecation_message: "tf.diag_part is deprecated, please use tf.linalg.tensor_diag_part instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Digamma.pbtxt b/tensorflow/core/api_def/python_api/api_def_Digamma.pbtxt
index 03f57678a8..91b4dfead7 100644
--- a/tensorflow/core/api_def/python_api/api_def_Digamma.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Digamma.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "digamma"
-    deprecation_message: "tf.digamma is deprecated, please use tf.math.digamma instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_EncodeBase64.pbtxt b/tensorflow/core/api_def/python_api/api_def_EncodeBase64.pbtxt
index 47b4ab4da4..71bb73cfb2 100644
--- a/tensorflow/core/api_def/python_api/api_def_EncodeBase64.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_EncodeBase64.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "encode_base64"
-    deprecation_message: "tf.encode_base64 is deprecated, please use tf.io.encode_base64 instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Equal.pbtxt b/tensorflow/core/api_def/python_api/api_def_Equal.pbtxt
index 2630962f7d..78aa1b3bc5 100644
--- a/tensorflow/core/api_def/python_api/api_def_Equal.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Equal.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "equal"
-    deprecation_message: "tf.equal is deprecated, please use tf.math.equal instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Erfc.pbtxt b/tensorflow/core/api_def/python_api/api_def_Erfc.pbtxt
index 6a511b3251..e96df0c596 100644
--- a/tensorflow/core/api_def/python_api/api_def_Erfc.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Erfc.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "erfc"
-    deprecation_message: "tf.erfc is deprecated, please use tf.math.erfc instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Exp.pbtxt b/tensorflow/core/api_def/python_api/api_def_Exp.pbtxt
index e1fd718ff0..70323fe5b4 100644
--- a/tensorflow/core/api_def/python_api/api_def_Exp.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Exp.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "exp"
-    deprecation_message: "tf.exp is deprecated, please use tf.math.exp instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Expm1.pbtxt b/tensorflow/core/api_def/python_api/api_def_Expm1.pbtxt
index ca25706407..8ddf9d4d70 100644
--- a/tensorflow/core/api_def/python_api/api_def_Expm1.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Expm1.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "expm1"
-    deprecation_message: "tf.expm1 is deprecated, please use tf.math.expm1 instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_ExtractImagePatches.pbtxt b/tensorflow/core/api_def/python_api/api_def_ExtractImagePatches.pbtxt
index d302e26ad2..f008b1222d 100644
--- a/tensorflow/core/api_def/python_api/api_def_ExtractImagePatches.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_ExtractImagePatches.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "extract_image_patches"
-    deprecation_message: "tf.extract_image_patches is deprecated, please use tf.image.extract_image_patches instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_FFT.pbtxt b/tensorflow/core/api_def/python_api/api_def_FFT.pbtxt
index 57a00a08e3..d79e936b71 100644
--- a/tensorflow/core/api_def/python_api/api_def_FFT.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_FFT.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "fft"
-    deprecation_message: "tf.fft is deprecated, please use tf.spectral.fft instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgs.pbtxt b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgs.pbtxt
index cd14b13675..d8db83331f 100644
--- a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgs.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgs.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "fake_quant_with_min_max_args"
-    deprecation_message: "tf.fake_quant_with_min_max_args is deprecated, please use tf.quantization.fake_quant_with_min_max_args instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgsGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgsGradient.pbtxt
index d55cb69d1d..74f01d1a0c 100644
--- a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgsGradient.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxArgsGradient.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "fake_quant_with_min_max_args_gradient"
-    deprecation_message: "tf.fake_quant_with_min_max_args_gradient is deprecated, please use tf.quantization.fake_quant_with_min_max_args_gradient instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVars.pbtxt b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVars.pbtxt
index 6ff4f2cdb2..e14fb6d118 100644
--- a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVars.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVars.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "fake_quant_with_min_max_vars"
-    deprecation_message: "tf.fake_quant_with_min_max_vars is deprecated, please use tf.quantization.fake_quant_with_min_max_vars instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsGradient.pbtxt
index 817a35cc6c..4611ebdfb8 100644
--- a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsGradient.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsGradient.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "fake_quant_with_min_max_vars_gradient"
-    deprecation_message: "tf.fake_quant_with_min_max_vars_gradient is deprecated, please use tf.quantization.fake_quant_with_min_max_vars_gradient instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt
index 275c0d5225..0936e513c3 100644
--- a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannel.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "fake_quant_with_min_max_vars_per_channel"
-    deprecation_message: "tf.fake_quant_with_min_max_vars_per_channel is deprecated, please use tf.quantization.fake_quant_with_min_max_vars_per_channel instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt
index 897312897f..0d9968248c 100644
--- a/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_FakeQuantWithMinMaxVarsPerChannelGradient.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "fake_quant_with_min_max_vars_per_channel_gradient"
-    deprecation_message: "tf.fake_quant_with_min_max_vars_per_channel_gradient is deprecated, please use tf.quantization.fake_quant_with_min_max_vars_per_channel_gradient instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Floor.pbtxt b/tensorflow/core/api_def/python_api/api_def_Floor.pbtxt
index 788d95edc1..9b93caa0b1 100644
--- a/tensorflow/core/api_def/python_api/api_def_Floor.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Floor.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "floor"
-    deprecation_message: "tf.floor is deprecated, please use tf.math.floor instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_GatherNd.pbtxt b/tensorflow/core/api_def/python_api/api_def_GatherNd.pbtxt
index 371dc740df..71257c8855 100644
--- a/tensorflow/core/api_def/python_api/api_def_GatherNd.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_GatherNd.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "gather_nd"
-    deprecation_message: "tf.gather_nd is deprecated, please use tf.manip.gather_nd instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Greater.pbtxt b/tensorflow/core/api_def/python_api/api_def_Greater.pbtxt
index c8c56515b2..7de60d44c4 100644
--- a/tensorflow/core/api_def/python_api/api_def_Greater.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Greater.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "greater"
-    deprecation_message: "tf.greater is deprecated, please use tf.math.greater instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_GreaterEqual.pbtxt b/tensorflow/core/api_def/python_api/api_def_GreaterEqual.pbtxt
index ccb390fb3e..9c8975c2a9 100644
--- a/tensorflow/core/api_def/python_api/api_def_GreaterEqual.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_GreaterEqual.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "greater_equal"
-    deprecation_message: "tf.greater_equal is deprecated, please use tf.math.greater_equal instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_IFFT.pbtxt b/tensorflow/core/api_def/python_api/api_def_IFFT.pbtxt
index 267ad8d0a0..17fbd8ace4 100644
--- a/tensorflow/core/api_def/python_api/api_def_IFFT.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_IFFT.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "ifft"
-    deprecation_message: "tf.ifft is deprecated, please use tf.spectral.ifft instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Igamma.pbtxt b/tensorflow/core/api_def/python_api/api_def_Igamma.pbtxt
index 4e7e3a6e57..8c4815c26e 100644
--- a/tensorflow/core/api_def/python_api/api_def_Igamma.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Igamma.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "igamma"
-    deprecation_message: "tf.igamma is deprecated, please use tf.math.igamma instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Igammac.pbtxt b/tensorflow/core/api_def/python_api/api_def_Igammac.pbtxt
index ea92a0916b..b43b54391b 100644
--- a/tensorflow/core/api_def/python_api/api_def_Igammac.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Igammac.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "igammac"
-    deprecation_message: "tf.igammac is deprecated, please use tf.math.igammac instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_InvertPermutation.pbtxt b/tensorflow/core/api_def/python_api/api_def_InvertPermutation.pbtxt
index bce642b96a..d75fcd63e3 100644
--- a/tensorflow/core/api_def/python_api/api_def_InvertPermutation.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_InvertPermutation.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "invert_permutation"
-    deprecation_message: "tf.invert_permutation is deprecated, please use tf.math.invert_permutation instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_IsFinite.pbtxt b/tensorflow/core/api_def/python_api/api_def_IsFinite.pbtxt
index a2c12f2ea0..27142644bf 100644
--- a/tensorflow/core/api_def/python_api/api_def_IsFinite.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_IsFinite.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "is_finite"
-    deprecation_message: "tf.is_finite is deprecated, please use tf.debugging.is_finite instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_IsInf.pbtxt b/tensorflow/core/api_def/python_api/api_def_IsInf.pbtxt
index 7c29811fd7..4cd92f1cb7 100644
--- a/tensorflow/core/api_def/python_api/api_def_IsInf.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_IsInf.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "is_inf"
-    deprecation_message: "tf.is_inf is deprecated, please use tf.debugging.is_inf instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_IsNan.pbtxt b/tensorflow/core/api_def/python_api/api_def_IsNan.pbtxt
index 459cf3ccbd..07d49f9436 100644
--- a/tensorflow/core/api_def/python_api/api_def_IsNan.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_IsNan.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "is_nan"
-    deprecation_message: "tf.is_nan is deprecated, please use tf.debugging.is_nan instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorGetNextAsOptional.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorGetNextAsOptional.pbtxt
new file mode 100644
index 0000000000..a88f422c21
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorGetNextAsOptional.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorGetNextAsOptional"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Less.pbtxt b/tensorflow/core/api_def/python_api/api_def_Less.pbtxt
index 15cbdc6d8e..055df2922a 100644
--- a/tensorflow/core/api_def/python_api/api_def_Less.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Less.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "less"
-    deprecation_message: "tf.less is deprecated, please use tf.math.less instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_LessEqual.pbtxt b/tensorflow/core/api_def/python_api/api_def_LessEqual.pbtxt
index 35aa18698f..d2803ddb69 100644
--- a/tensorflow/core/api_def/python_api/api_def_LessEqual.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_LessEqual.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "less_equal"
-    deprecation_message: "tf.less_equal is deprecated, please use tf.math.less_equal instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Lgamma.pbtxt b/tensorflow/core/api_def/python_api/api_def_Lgamma.pbtxt
index 89886b09d3..0262b838ca 100644
--- a/tensorflow/core/api_def/python_api/api_def_Lgamma.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Lgamma.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "lgamma"
-    deprecation_message: "tf.lgamma is deprecated, please use tf.math.lgamma instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Log.pbtxt b/tensorflow/core/api_def/python_api/api_def_Log.pbtxt
index fb82aa7e43..26d2473b9c 100644
--- a/tensorflow/core/api_def/python_api/api_def_Log.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Log.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "log"
-    deprecation_message: "tf.log is deprecated, please use tf.math.log instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Log1p.pbtxt b/tensorflow/core/api_def/python_api/api_def_Log1p.pbtxt
index 6b451aa546..d85b6dccec 100644
--- a/tensorflow/core/api_def/python_api/api_def_Log1p.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Log1p.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "log1p"
-    deprecation_message: "tf.log1p is deprecated, please use tf.math.log1p instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_LogicalAnd.pbtxt b/tensorflow/core/api_def/python_api/api_def_LogicalAnd.pbtxt
index 403a8c71ff..80bd98b740 100644
--- a/tensorflow/core/api_def/python_api/api_def_LogicalAnd.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_LogicalAnd.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "logical_and"
-    deprecation_message: "tf.logical_and is deprecated, please use tf.math.logical_and instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_LogicalNot.pbtxt b/tensorflow/core/api_def/python_api/api_def_LogicalNot.pbtxt
index f228958c77..b2244c44b1 100644
--- a/tensorflow/core/api_def/python_api/api_def_LogicalNot.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_LogicalNot.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "logical_not"
-    deprecation_message: "tf.logical_not is deprecated, please use tf.math.logical_not instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_LogicalOr.pbtxt b/tensorflow/core/api_def/python_api/api_def_LogicalOr.pbtxt
index ab89f236e7..cf78b52e07 100644
--- a/tensorflow/core/api_def/python_api/api_def_LogicalOr.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_LogicalOr.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "logical_or"
-    deprecation_message: "tf.logical_or is deprecated, please use tf.math.logical_or instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatchingFiles.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatchingFiles.pbtxt
index 8930d66940..74145670a8 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatchingFiles.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatchingFiles.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matching_files"
-    deprecation_message: "tf.matching_files is deprecated, please use tf.io.matching_files instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixBandPart.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixBandPart.pbtxt
index bad2f03f32..1122c52ab4 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixBandPart.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixBandPart.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_band_part"
-    deprecation_message: "tf.matrix_band_part is deprecated, please use tf.linalg.band_part instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixDeterminant.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixDeterminant.pbtxt
index d241d4d721..9563bf0354 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixDeterminant.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixDeterminant.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_determinant"
-    deprecation_message: "tf.matrix_determinant is deprecated, please use tf.linalg.det instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixDiag.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixDiag.pbtxt
index 208b37e297..8ab0bf75eb 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixDiag.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixDiag.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_diag"
-    deprecation_message: "tf.matrix_diag is deprecated, please use tf.linalg.diag instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixDiagPart.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixDiagPart.pbtxt
index a8a50e8a89..82ce67853c 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixDiagPart.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixDiagPart.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_diag_part"
-    deprecation_message: "tf.matrix_diag_part is deprecated, please use tf.linalg.diag_part instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixInverse.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixInverse.pbtxt
index 944513fcd9..85862f6eb5 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixInverse.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixInverse.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_inverse"
-    deprecation_message: "tf.matrix_inverse is deprecated, please use tf.linalg.inv instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixSetDiag.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixSetDiag.pbtxt
index a6080dbc2d..6325e4f0e6 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixSetDiag.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixSetDiag.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_set_diag"
-    deprecation_message: "tf.matrix_set_diag is deprecated, please use tf.linalg.set_diag instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixSolve.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixSolve.pbtxt
index caba80326b..6325dff407 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixSolve.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixSolve.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_solve"
-    deprecation_message: "tf.matrix_solve is deprecated, please use tf.linalg.solve instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_MatrixTriangularSolve.pbtxt b/tensorflow/core/api_def/python_api/api_def_MatrixTriangularSolve.pbtxt
index a4dfa538ed..7f865e23b2 100644
--- a/tensorflow/core/api_def/python_api/api_def_MatrixTriangularSolve.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_MatrixTriangularSolve.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "matrix_triangular_solve"
-    deprecation_message: "tf.matrix_triangular_solve is deprecated, please use tf.linalg.triangular_solve instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Maximum.pbtxt b/tensorflow/core/api_def/python_api/api_def_Maximum.pbtxt
index 90af9e145b..bcff379b71 100644
--- a/tensorflow/core/api_def/python_api/api_def_Maximum.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Maximum.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "maximum"
-    deprecation_message: "tf.maximum is deprecated, please use tf.math.maximum instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Minimum.pbtxt b/tensorflow/core/api_def/python_api/api_def_Minimum.pbtxt
index 33bcd6f667..9aae74226a 100644
--- a/tensorflow/core/api_def/python_api/api_def_Minimum.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Minimum.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "minimum"
-    deprecation_message: "tf.minimum is deprecated, please use tf.math.minimum instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_NonMaxSuppressionV4.pbtxt b/tensorflow/core/api_def/python_api/api_def_NonMaxSuppressionV4.pbtxt
new file mode 100644
index 0000000000..be6caacd00
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_NonMaxSuppressionV4.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "NonMaxSuppressionV4"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_NonMaxSuppressionWithOverlaps.pbtxt b/tensorflow/core/api_def/python_api/api_def_NonMaxSuppressionWithOverlaps.pbtxt
new file mode 100644
index 0000000000..0d358dff98
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_NonMaxSuppressionWithOverlaps.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "NonMaxSuppressionWithOverlaps"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_NotEqual.pbtxt b/tensorflow/core/api_def/python_api/api_def_NotEqual.pbtxt
index 385565daaf..f37317854f 100644
--- a/tensorflow/core/api_def/python_api/api_def_NotEqual.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_NotEqual.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "not_equal"
-    deprecation_message: "tf.not_equal is deprecated, please use tf.math.not_equal instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_OptionalFromValue.pbtxt b/tensorflow/core/api_def/python_api/api_def_OptionalFromValue.pbtxt
new file mode 100644
index 0000000000..c4949258e6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OptionalFromValue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalFromValue"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OptionalGetValue.pbtxt b/tensorflow/core/api_def/python_api/api_def_OptionalGetValue.pbtxt
new file mode 100644
index 0000000000..e3d362ac6e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OptionalGetValue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalGetValue"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OptionalHasValue.pbtxt b/tensorflow/core/api_def/python_api/api_def_OptionalHasValue.pbtxt
new file mode 100644
index 0000000000..7f5a96982a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OptionalHasValue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalHasValue"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OptionalNone.pbtxt b/tensorflow/core/api_def/python_api/api_def_OptionalNone.pbtxt
new file mode 100644
index 0000000000..15d11c4169
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OptionalNone.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OptionalNone"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParseTensor.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParseTensor.pbtxt
index 29f02ab1ac..10b3aab0c7 100644
--- a/tensorflow/core/api_def/python_api/api_def_ParseTensor.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_ParseTensor.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "parse_tensor"
-    deprecation_message: "tf.parse_tensor is deprecated, please use tf.io.parse_tensor instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Polygamma.pbtxt b/tensorflow/core/api_def/python_api/api_def_Polygamma.pbtxt
index 567a448642..9df81402d5 100644
--- a/tensorflow/core/api_def/python_api/api_def_Polygamma.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Polygamma.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "polygamma"
-    deprecation_message: "tf.polygamma is deprecated, please use tf.math.polygamma instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Qr.pbtxt b/tensorflow/core/api_def/python_api/api_def_Qr.pbtxt
index a9371b5d9b..0260eecc91 100644
--- a/tensorflow/core/api_def/python_api/api_def_Qr.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Qr.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "qr"
-    deprecation_message: "tf.qr is deprecated, please use tf.linalg.qr instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedConcat.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedConcat.pbtxt
index 44508ef079..69404b9472 100644
--- a/tensorflow/core/api_def/python_api/api_def_QuantizedConcat.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedConcat.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "quantized_concat"
-    deprecation_message: "tf.quantized_concat is deprecated, please use tf.quantization.quantized_concat instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_ReadFile.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReadFile.pbtxt
index 7c38fae31c..9d479be45f 100644
--- a/tensorflow/core/api_def/python_api/api_def_ReadFile.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_ReadFile.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "read_file"
-    deprecation_message: "tf.read_file is deprecated, please use tf.io.read_file instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Reciprocal.pbtxt b/tensorflow/core/api_def/python_api/api_def_Reciprocal.pbtxt
index 0f37e99f4f..c4d4c27722 100644
--- a/tensorflow/core/api_def/python_api/api_def_Reciprocal.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Reciprocal.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "reciprocal"
-    deprecation_message: "tf.reciprocal is deprecated, please use tf.math.reciprocal instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_RegexReplace.pbtxt b/tensorflow/core/api_def/python_api/api_def_RegexReplace.pbtxt
index 6938e20e57..b17806b338 100644
--- a/tensorflow/core/api_def/python_api/api_def_RegexReplace.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_RegexReplace.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "regex_replace"
-    deprecation_message: "tf.regex_replace is deprecated, please use tf.strings.regex_replace instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Reshape.pbtxt b/tensorflow/core/api_def/python_api/api_def_Reshape.pbtxt
index 907d95a6f0..c469665b66 100644
--- a/tensorflow/core/api_def/python_api/api_def_Reshape.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Reshape.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "reshape"
-    deprecation_message: "tf.reshape is deprecated, please use tf.manip.reshape instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_ReverseV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReverseV2.pbtxt
index bbe9e97d60..77f595927b 100644
--- a/tensorflow/core/api_def/python_api/api_def_ReverseV2.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_ReverseV2.pbtxt
@@ -5,10 +5,10 @@ op {
   }
   endpoint {
     name: "reverse"
-    deprecation_message: "tf.reverse is deprecated, please use tf.manip.reverse instead."
+    deprecated: true
   }
   endpoint {
     name: "reverse_v2"
-    deprecation_message: "tf.reverse_v2 is deprecated, please use tf.manip.reverse instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Rint.pbtxt b/tensorflow/core/api_def/python_api/api_def_Rint.pbtxt
index 4330a80d04..ec37a23127 100644
--- a/tensorflow/core/api_def/python_api/api_def_Rint.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Rint.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "rint"
-    deprecation_message: "tf.rint is deprecated, please use tf.math.rint instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Rsqrt.pbtxt b/tensorflow/core/api_def/python_api/api_def_Rsqrt.pbtxt
index 6a45f4aff5..4fc2b81421 100644
--- a/tensorflow/core/api_def/python_api/api_def_Rsqrt.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Rsqrt.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "rsqrt"
-    deprecation_message: "tf.rsqrt is deprecated, please use tf.math.rsqrt instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_ScatterNd.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScatterNd.pbtxt
index cabf171cb0..a65a19b542 100644
--- a/tensorflow/core/api_def/python_api/api_def_ScatterNd.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_ScatterNd.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "scatter_nd"
-    deprecation_message: "tf.scatter_nd is deprecated, please use tf.manip.scatter_nd instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_SegmentMax.pbtxt b/tensorflow/core/api_def/python_api/api_def_SegmentMax.pbtxt
index 65e34a1fcf..2e22c375c0 100644
--- a/tensorflow/core/api_def/python_api/api_def_SegmentMax.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_SegmentMax.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "segment_max"
-    deprecation_message: "tf.segment_max is deprecated, please use tf.math.segment_max instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_SegmentMean.pbtxt b/tensorflow/core/api_def/python_api/api_def_SegmentMean.pbtxt
index f1e19c5571..646348072f 100644
--- a/tensorflow/core/api_def/python_api/api_def_SegmentMean.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_SegmentMean.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "segment_mean"
-    deprecation_message: "tf.segment_mean is deprecated, please use tf.math.segment_mean instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_SegmentMin.pbtxt b/tensorflow/core/api_def/python_api/api_def_SegmentMin.pbtxt
index fd9a3c380d..1a77019a2d 100644
--- a/tensorflow/core/api_def/python_api/api_def_SegmentMin.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_SegmentMin.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "segment_min"
-    deprecation_message: "tf.segment_min is deprecated, please use tf.math.segment_min instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_SegmentProd.pbtxt b/tensorflow/core/api_def/python_api/api_def_SegmentProd.pbtxt
index f2be8baafc..cf4d6f0237 100644
--- a/tensorflow/core/api_def/python_api/api_def_SegmentProd.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_SegmentProd.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "segment_prod"
-    deprecation_message: "tf.segment_prod is deprecated, please use tf.math.segment_prod instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_SegmentSum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SegmentSum.pbtxt
index c7cc1d0c9f..c6d7999455 100644
--- a/tensorflow/core/api_def/python_api/api_def_SegmentSum.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_SegmentSum.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "segment_sum"
-    deprecation_message: "tf.segment_sum is deprecated, please use tf.math.segment_sum instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Sin.pbtxt b/tensorflow/core/api_def/python_api/api_def_Sin.pbtxt
index 0794334987..9c19a1a177 100644
--- a/tensorflow/core/api_def/python_api/api_def_Sin.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Sin.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "sin"
-    deprecation_message: "tf.sin is deprecated, please use tf.math.sin instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Sinh.pbtxt b/tensorflow/core/api_def/python_api/api_def_Sinh.pbtxt
index c42f8678c6..155e58e6d5 100644
--- a/tensorflow/core/api_def/python_api/api_def_Sinh.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Sinh.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "sinh"
-    deprecation_message: "tf.sinh is deprecated, please use tf.math.sinh instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_SpaceToBatchND.pbtxt b/tensorflow/core/api_def/python_api/api_def_SpaceToBatchND.pbtxt
index 63a7547e14..af323a6cf3 100644
--- a/tensorflow/core/api_def/python_api/api_def_SpaceToBatchND.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_SpaceToBatchND.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "space_to_batch_nd"
-    deprecation_message: "tf.space_to_batch_nd is deprecated, please use tf.manip.space_to_batch_nd instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_SquaredDifference.pbtxt b/tensorflow/core/api_def/python_api/api_def_SquaredDifference.pbtxt
index 01a33a3346..4bab8cf00c 100644
--- a/tensorflow/core/api_def/python_api/api_def_SquaredDifference.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_SquaredDifference.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "squared_difference"
-    deprecation_message: "tf.squared_difference is deprecated, please use tf.math.squared_difference instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_StatefulPartitionedCall.pbtxt b/tensorflow/core/api_def/python_api/api_def_StatefulPartitionedCall.pbtxt
new file mode 100644
index 0000000000..eb8e3ae902
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StatefulPartitionedCall.pbtxt
@@ -0,0 +1 @@
+op { graph_op_name: "StatefulPartitionedCall" visibility: HIDDEN }
diff --git a/tensorflow/core/api_def/python_api/api_def_StringJoin.pbtxt b/tensorflow/core/api_def/python_api/api_def_StringJoin.pbtxt
index 53c1b8053d..46a7c0361e 100644
--- a/tensorflow/core/api_def/python_api/api_def_StringJoin.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_StringJoin.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "string_join"
-    deprecation_message: "tf.string_join is deprecated, please use tf.strings.join instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_StringStrip.pbtxt b/tensorflow/core/api_def/python_api/api_def_StringStrip.pbtxt
index 364806e1f5..fbcdeaad6d 100644
--- a/tensorflow/core/api_def/python_api/api_def_StringStrip.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_StringStrip.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "string_strip"
-    deprecation_message: "tf.string_strip is deprecated, please use tf.strings.strip instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_StringToHashBucket.pbtxt b/tensorflow/core/api_def/python_api/api_def_StringToHashBucket.pbtxt
index b0e93d2b22..d122e79b39 100644
--- a/tensorflow/core/api_def/python_api/api_def_StringToHashBucket.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_StringToHashBucket.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "string_to_hash_bucket"
-    deprecation_message: "tf.string_to_hash_bucket is deprecated, please use tf.strings.to_hash_bucket instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_StringToHashBucketFast.pbtxt b/tensorflow/core/api_def/python_api/api_def_StringToHashBucketFast.pbtxt
index 9576e1a9de..aef9dffefe 100644
--- a/tensorflow/core/api_def/python_api/api_def_StringToHashBucketFast.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_StringToHashBucketFast.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "string_to_hash_bucket_fast"
-    deprecation_message: "tf.string_to_hash_bucket_fast is deprecated, please use tf.strings.to_hash_bucket_fast instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_StringToHashBucketStrong.pbtxt b/tensorflow/core/api_def/python_api/api_def_StringToHashBucketStrong.pbtxt
index e8c7c12608..385b9fd02a 100644
--- a/tensorflow/core/api_def/python_api/api_def_StringToHashBucketStrong.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_StringToHashBucketStrong.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "string_to_hash_bucket_strong"
-    deprecation_message: "tf.string_to_hash_bucket_strong is deprecated, please use tf.strings.to_hash_bucket_strong instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_StringToNumber.pbtxt b/tensorflow/core/api_def/python_api/api_def_StringToNumber.pbtxt
index 9de1ca0b30..f740b9849d 100644
--- a/tensorflow/core/api_def/python_api/api_def_StringToNumber.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_StringToNumber.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "string_to_number"
-    deprecation_message: "tf.string_to_number is deprecated, please use tf.strings.to_number instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Substr.pbtxt b/tensorflow/core/api_def/python_api/api_def_Substr.pbtxt
index 25d1bb3f51..4778d7927c 100644
--- a/tensorflow/core/api_def/python_api/api_def_Substr.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Substr.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "substr"
-    deprecation_message: "tf.substr is deprecated, please use tf.strings.substr instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Tan.pbtxt b/tensorflow/core/api_def/python_api/api_def_Tan.pbtxt
index 8bcf381dd4..ffa92f5580 100644
--- a/tensorflow/core/api_def/python_api/api_def_Tan.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Tan.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "tan"
-    deprecation_message: "tf.tan is deprecated, please use tf.math.tan instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Tile.pbtxt b/tensorflow/core/api_def/python_api/api_def_Tile.pbtxt
index 0b9053a529..c34061c941 100644
--- a/tensorflow/core/api_def/python_api/api_def_Tile.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Tile.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "tile"
-    deprecation_message: "tf.tile is deprecated, please use tf.manip.tile instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMax.pbtxt b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMax.pbtxt
index 1ea59d2e63..cf81843241 100644
--- a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMax.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMax.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "unsorted_segment_max"
-    deprecation_message: "tf.unsorted_segment_max is deprecated, please use tf.math.unsorted_segment_max instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMin.pbtxt b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMin.pbtxt
index 9857def6fe..475361c85a 100644
--- a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMin.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentMin.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "unsorted_segment_min"
-    deprecation_message: "tf.unsorted_segment_min is deprecated, please use tf.math.unsorted_segment_min instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentProd.pbtxt b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentProd.pbtxt
index d9e3f7be69..a9d741bbc3 100644
--- a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentProd.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentProd.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "unsorted_segment_prod"
-    deprecation_message: "tf.unsorted_segment_prod is deprecated, please use tf.math.unsorted_segment_prod instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentSum.pbtxt b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentSum.pbtxt
index 0cffd12404..337678dcff 100644
--- a/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentSum.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_UnsortedSegmentSum.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "unsorted_segment_sum"
-    deprecation_message: "tf.unsorted_segment_sum is deprecated, please use tf.math.unsorted_segment_sum instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_WriteFile.pbtxt b/tensorflow/core/api_def/python_api/api_def_WriteFile.pbtxt
index f28a9151ca..1a58ae19e5 100644
--- a/tensorflow/core/api_def/python_api/api_def_WriteFile.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_WriteFile.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "write_file"
-    deprecation_message: "tf.write_file is deprecated, please use tf.io.write_file instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/api_def/python_api/api_def_Zeta.pbtxt b/tensorflow/core/api_def/python_api/api_def_Zeta.pbtxt
index a84ffcdf14..4684a9d624 100644
--- a/tensorflow/core/api_def/python_api/api_def_Zeta.pbtxt
+++ b/tensorflow/core/api_def/python_api/api_def_Zeta.pbtxt
@@ -5,6 +5,6 @@ op {
   }
   endpoint {
     name: "zeta"
-    deprecation_message: "tf.zeta is deprecated, please use tf.math.zeta instead."
+    deprecated: true
   }
 }
diff --git a/tensorflow/core/common_runtime/bfc_allocator.cc b/tensorflow/core/common_runtime/bfc_allocator.cc
index 9cda17867b..3bf0532491 100644
--- a/tensorflow/core/common_runtime/bfc_allocator.cc
+++ b/tensorflow/core/common_runtime/bfc_allocator.cc
@@ -155,10 +155,6 @@ bool BFCAllocator::Extend(size_t alignment, size_t rounded_bytes) {
 
   region_manager_.set_handle(c->ptr, h);
 
-  // TODO(vrv): Try to merge this new region with an existing region,
-  // if the address space is contiguous, to avoid fragmentation
-  // across regions.
-
   // Insert the chunk into the right bin.
   InsertFreeChunkIntoBin(h);
 
@@ -465,49 +461,33 @@ void BFCAllocator::FreeAndMaybeCoalesce(BFCAllocator::ChunkHandle h) {
   Chunk* c = ChunkFromHandle(h);
   CHECK(c->in_use() && (c->bin_num == kInvalidBinNum));
 
-  // Mark the chunk as no longer in use
+  // Mark the chunk as no longer in use.
   c->allocation_id = -1;
 
   // Updates the stats.
   stats_.bytes_in_use -= c->size;
 
-  // This chunk is no longer in-use, consider coalescing the chunk
-  // with adjacent chunks.
-  ChunkHandle chunk_to_reassign = h;
-
-  // If the next chunk is free, coalesce the two
-  if (c->next != kInvalidChunkHandle) {
-    Chunk* cnext = ChunkFromHandle(c->next);
-    if (!cnext->in_use()) {
-      //      VLOG(8) << "Chunk at " << cnext->ptr << " merging with c " <<
-      //      c->ptr;
-
-      chunk_to_reassign = h;
+  ChunkHandle coalesced_chunk = h;
 
-      // Deletes c->next
-      RemoveFreeChunkFromBin(c->next);
-      Merge(h, ChunkFromHandle(h)->next);
-    }
+  // If the next chunk is free, merge it into c and delete it.
+  if (c->next != kInvalidChunkHandle && !ChunkFromHandle(c->next)->in_use()) {
+    // VLOG(8) << "Merging c->next " << ChunkFromHandle(c->next)->ptr
+    //         << " with c " << c->ptr;
+    RemoveFreeChunkFromBin(c->next);
+    Merge(h, c->next);
   }
 
-  // If the previous chunk is free, coalesce the two
-  c = ChunkFromHandle(h);
-  if (c->prev != kInvalidChunkHandle) {
-    Chunk* cprev = ChunkFromHandle(c->prev);
-    if (!cprev->in_use()) {
-      //      VLOG(8) << "Chunk at " << c->ptr << " merging into c->prev "
-      //       << cprev->ptr;
-
-      chunk_to_reassign = c->prev;
+  // If the previous chunk is free, merge c into it and delete c.
+  if (c->prev != kInvalidChunkHandle && !ChunkFromHandle(c->prev)->in_use()) {
+    // VLOG(8) << "Merging c " << c->ptr << " into c->prev "
+    //         << ChunkFromHandle(c->prev)->ptr;
 
-      // Deletes c
-      RemoveFreeChunkFromBin(c->prev);
-      Merge(ChunkFromHandle(h)->prev, h);
-      c = ChunkFromHandle(h);
-    }
+    coalesced_chunk = c->prev;
+    RemoveFreeChunkFromBin(c->prev);
+    Merge(c->prev, h);
   }
 
-  InsertFreeChunkIntoBin(chunk_to_reassign);
+  InsertFreeChunkIntoBin(coalesced_chunk);
 }
 
 void BFCAllocator::AddAllocVisitor(Visitor visitor) {
diff --git a/tensorflow/core/common_runtime/bfc_allocator.h b/tensorflow/core/common_runtime/bfc_allocator.h
index 52aedb1e9c..580e61e2ea 100644
--- a/tensorflow/core/common_runtime/bfc_allocator.h
+++ b/tensorflow/core/common_runtime/bfc_allocator.h
@@ -88,11 +88,20 @@ class BFCAllocator : public VisitableAllocator {
   static const int kInvalidBinNum = -1;
   static const int kNumBins = 21;
 
-  // Chunks point to memory.  Their prev/next pointers form a
-  // doubly-linked list of addresses sorted by base address that
-  // must be contiguous.  Chunks contain information about whether
-  // they are in use or whether they are free, and contain a pointer
-  // to the bin they are in.
+  // A Chunk points to a piece of memory that's either entirely free or entirely
+  // in use by one user memory allocation.
+  //
+  // An AllocationRegion's memory is split up into one or more disjoint Chunks,
+  // which together cover the whole region without gaps.  Chunks participate in
+  // a doubly-linked list, and the prev/next pointers point to the physically
+  // adjacent chunks.
+  //
+  // Since a chunk cannot be partially in use, we may need to split a free chunk
+  // in order to service a user allocation.  We always merge adjacent free
+  // chunks.
+  //
+  // Chunks contain information about whether they are in use or whether they
+  // are free, and contain a pointer to the bin they are in.
   struct Chunk {
     size_t size = 0;  // Full size of buffer.
 
@@ -177,8 +186,12 @@ class BFCAllocator : public VisitableAllocator {
   static const size_t kMinAllocationBits = 8;
   static const size_t kMinAllocationSize = 1 << kMinAllocationBits;
 
-  // AllocationRegion maps pointers to ChunkHandles for a single
-  // contiguous memory region.
+  // BFCAllocator allocates memory into a collection of disjoint
+  // AllocationRegions.  Each AllocationRegion corresponds to one call to
+  // SubAllocator::Alloc().
+  //
+  // An AllocationRegion contains one or more Chunks, covering all of its
+  // memory.  Its primary job is to map a pointers to ChunkHandles.
   //
   // This class is thread-compatible.
   class AllocationRegion {
@@ -191,18 +204,14 @@ class BFCAllocator : public VisitableAllocator {
       DCHECK_EQ(0, memory_size % kMinAllocationSize);
       const size_t n_handles =
           (memory_size + kMinAllocationSize - 1) / kMinAllocationSize;
-      handles_ = new ChunkHandle[n_handles];
+      handles_.reset(new ChunkHandle[n_handles]);
       for (size_t i = 0; i < n_handles; i++) {
         handles_[i] = kInvalidChunkHandle;
       }
     }
 
-    AllocationRegion() {}
-
-    ~AllocationRegion() { delete[] handles_; }
-
+    AllocationRegion() = default;
     AllocationRegion(AllocationRegion&& other) { Swap(other); }
-
     AllocationRegion& operator=(AllocationRegion&& other) {
       Swap(other);
       return *this;
@@ -241,7 +250,7 @@ class BFCAllocator : public VisitableAllocator {
     // Array of size "memory_size / kMinAllocationSize".  It is
     // indexed by (p-base) / kMinAllocationSize, contains ChunkHandle
     // for the memory allocation represented by "p"
-    ChunkHandle* handles_ = nullptr;
+    std::unique_ptr<ChunkHandle[]> handles_;
 
     TF_DISALLOW_COPY_AND_ASSIGN(AllocationRegion);
   };
diff --git a/tensorflow/core/common_runtime/broadcaster.cc b/tensorflow/core/common_runtime/broadcaster.cc
index 46142d5923..e1c6b21939 100644
--- a/tensorflow/core/common_runtime/broadcaster.cc
+++ b/tensorflow/core/common_runtime/broadcaster.cc
@@ -27,13 +27,14 @@ namespace tensorflow {
 
 namespace {
 // Key to be used for BufRendezvous by Broadcaster.
-string BroadcastBufKey(const string& exec_key, int src_rank, int dst_rank) {
+string BroadcastBufKey(const string& exec_key, int subdiv, int src_rank,
+                       int dst_rank) {
   if (READABLE_KEYS) {
-    return strings::StrCat("broadcast(", exec_key, "):src(", src_rank, "):dst(",
-                           dst_rank, ")");
+    return strings::StrCat("broadcast(", exec_key, "):subdiv(", subdiv,
+                           "):src(", src_rank, "):dst(", dst_rank, ")");
   } else {
     // TODO(tucker): Try a denser format, e.g. a 64 or 128 bit hash.
-    return strings::StrCat(exec_key, ":", src_rank, ":", dst_rank);
+    return strings::StrCat(exec_key, ":", subdiv, ":", src_rank, ":", dst_rank);
   }
 }
 }  // namespace
@@ -85,11 +86,15 @@ void Broadcaster::Run(StatusCallback done) {
 // device, no send to it is necessary.
 
 /* static*/
-int Broadcaster::TreeRecvFrom(const CollectiveParams& cp) {
-  DCHECK_EQ(1, cp.subdiv_rank.size());
-  if (cp.is_source) return -1;
-  int source_rank = cp.instance.impl_details.subdiv_source_rank[0];
-  int my_rank = cp.subdiv_rank[0];
+int Broadcaster::TreeRecvFrom(const CollectiveParams& cp, int subdiv) {
+  DCHECK_LT(subdiv, static_cast<int>(cp.subdiv_rank.size()));
+  int my_rank = cp.subdiv_rank[subdiv];
+  if (-1 == my_rank) return -1;
+
+  const auto& impl = cp.instance.impl_details;
+  DCHECK_LT(subdiv, static_cast<int>(impl.subdiv_source_rank.size()));
+  int source_rank = impl.subdiv_source_rank[subdiv];
+  if (my_rank == source_rank) return -1;
   if (source_rank == 0) {
     return (my_rank - 1) / 2;
   } else {
@@ -99,13 +104,24 @@ int Broadcaster::TreeRecvFrom(const CollectiveParams& cp) {
 }
 
 /* static */
-void Broadcaster::TreeSendTo(const CollectiveParams& cp,
+void Broadcaster::TreeSendTo(const CollectiveParams& cp, int subdiv,
                              std::vector<int>* targets) {
-  DCHECK_EQ(1, cp.subdiv_rank.size());
+  DCHECK_LT(subdiv, static_cast<int>(cp.subdiv_rank.size()));
+  int my_rank = cp.subdiv_rank[subdiv];
+  if (-1 == my_rank) return;
+
+  const auto& impl = cp.instance.impl_details;
+  DCHECK_LT(subdiv, static_cast<int>(impl.subdiv_source_rank.size()));
+  int source_rank = impl.subdiv_source_rank[subdiv];
+
+  int group_size = 0;
+  for (int i = 0; i < impl.subdiv_permutations[subdiv].size(); i++) {
+    if (impl.subdiv_permutations[subdiv][i] >= 0) {
+      group_size++;
+    }
+  }
+
   targets->clear();
-  int my_rank = cp.subdiv_rank[0];
-  DCHECK_EQ(1, cp.instance.impl_details.subdiv_source_rank.size());
-  int source_rank = cp.instance.impl_details.subdiv_source_rank[0];
   int successor_rank = 0;
   if (source_rank == 0) {
     successor_rank = (2 * my_rank) + 1;
@@ -116,108 +132,147 @@ void Broadcaster::TreeSendTo(const CollectiveParams& cp,
   if (cp.is_source && source_rank != 0) {
     // The source sends to rank 0,1 in addition to its positional
     // descendants.
-    if (cp.group.group_size > 1) {
+    if (group_size > 1) {
       targets->push_back(0);
     }
-    if (cp.group.group_size > 2 && source_rank != 1) {
+    if (group_size > 2 && source_rank != 1) {
       targets->push_back(1);
     }
   }
   for (int i = 0; i < 2; ++i) {
-    if (successor_rank < cp.group.group_size && successor_rank != source_rank) {
+    if (successor_rank < group_size && successor_rank != source_rank) {
       targets->push_back(successor_rank);
     }
     ++successor_rank;
   }
 }
 
-// Execute a tree broadcast, i.e. each non-source device receives from
-// one other and sends to up-to two others.
+// Executes a hierarchical tree broadcast.
+// Each subdiv is a broadcast between a subset of the devices.
+// If there is only one task, there is one subdiv comprising a broadcast between
+// all devices belonging to the task.
+// If there are n tasks, n>1, then there are n+1 subdivs.  In the first (global)
+// subdiv, one device from each task participates in a binary tree broadcast.
+// Each task receives a copy of the tensor on one device via this broadcast.
+// Subsequent subdivs correspond to intra-task broadcasts.  Subdiv i+1
+// corresponds to broadcast between all devices on task i.  Thus, each task
+// participates in at most 2 subdivs.
 void Broadcaster::RunTree() {
-  mutex mu;               // also guards status_ while callbacks are pending
-  int pending_count = 0;  // GUARDED_BY(mu)
-  condition_variable all_done;
-  std::vector<int> send_to_ranks;
-  TreeSendTo(col_params_, &send_to_ranks);
-
-  if (!is_source_) {
-    // Begin by receiving the value.
-    int recv_from_rank = TreeRecvFrom(col_params_);
-    Notification note;
-    DispatchRecv(recv_from_rank, output_,
-                 [this, recv_from_rank, &mu, &note](const Status& s) {
-                   mutex_lock l(mu);
-                   status_.Update(s);
-                   note.Notify();
-                 });
-    note.WaitForNotification();
-  }
+  int num_subdivs = static_cast<int>(col_params_.subdiv_rank.size());
+  // TODO(ayushd): this is easily improved when a node participates in both
+  // first and second subdivision.  It would first send to its descendents in
+  // the first subdiv, then wait until all pending ops are finished before
+  // sending to descendents in second subdiv.  A better implementation would
+  // collapse the two send blocks.
+  for (int si = 0; si < num_subdivs; si++) {
+    int my_rank = col_params_.subdiv_rank[si];
+    // If rank is -1, this device does not participate in this subdiv.
+    if (-1 == my_rank) continue;
+    int source_rank = col_params_.instance.impl_details.subdiv_source_rank[si];
+    if (VLOG_IS_ON(1)) {
+      string subdiv_buf;
+      for (int r : col_params_.instance.impl_details.subdiv_permutations[si]) {
+        strings::StrAppend(&subdiv_buf, r, ",");
+      }
+      VLOG(1) << "Running Broadcast tree device=" << device_->name()
+              << " subdiv=" << si << " perm=" << subdiv_buf
+              << " my_rank=" << my_rank << " source_rank=" << source_rank;
+    }
+
+    mutex mu;               // also guards status_ while callbacks are pending
+    int pending_count = 0;  // GUARDED_BY(mu)
+    condition_variable all_done;
 
-  // Then forward value to all descendent devices.
-  if (status_.ok()) {
-    for (int i = 0; i < send_to_ranks.size(); ++i) {
-      int target_rank = send_to_ranks[i];
-      {
-        mutex_lock l(mu);
-        ++pending_count;
+    if (my_rank >= 0 && my_rank != source_rank) {
+      // Begin by receiving the value.
+      int recv_from_rank = TreeRecvFrom(col_params_, si);
+      Notification note;
+      DispatchRecv(si, recv_from_rank, my_rank, output_,
+                   [this, &mu, &note](const Status& s) {
+                     mutex_lock l(mu);
+                     status_.Update(s);
+                     note.Notify();
+                   });
+      note.WaitForNotification();
+    }
+
+    // Then forward value to all descendent devices.
+    if (my_rank >= 0 && status_.ok()) {
+      std::vector<int> send_to_ranks;
+      TreeSendTo(col_params_, si, &send_to_ranks);
+      for (int i = 0; i < send_to_ranks.size(); ++i) {
+        int target_rank = send_to_ranks[i];
+        {
+          mutex_lock l(mu);
+          ++pending_count;
+        }
+        DispatchSend(si, target_rank, my_rank,
+                     (is_source_ ? &ctx_->input(0) : output_),
+                     [this, &mu, &pending_count, &all_done](const Status& s) {
+                       mutex_lock l(mu);
+                       status_.Update(s);
+                       --pending_count;
+                       if (pending_count == 0) {
+                         all_done.notify_all();
+                       }
+                     });
       }
-      DispatchSend(
-          target_rank, (is_source_ ? &ctx_->input(0) : output_),
-          [this, target_rank, &mu, &pending_count, &all_done](const Status& s) {
-            mutex_lock l(mu);
-            status_.Update(s);
-            --pending_count;
-            if (pending_count == 0) {
-              all_done.notify_all();
-            }
-          });
     }
-  }
 
-  if (status_.ok() && is_source_) {
-    // Meanwhile, copy input to output if we weren't lucky enough to
-    // be able to reuse input as output.
-    const Tensor* input = &ctx_->input(0);
-    if (input != output_ &&
-        (DMAHelper::base(input) != DMAHelper::base(output_))) {
-      {
-        mutex_lock l(mu);
-        ++pending_count;
+    // For the original source device, we copy input to output if they are
+    // different.
+    // If there is only 1 subdiv, we do this in that subdiv.  If there is more
+    // than 1 subdiv, then the original source device will participate in 2
+    // subdivs - the global inter-task broadcast and one local intra-task
+    // broadcast.  In this case, we perform the copy in the second subdiv for
+    // this device.
+    if (status_.ok() && is_source_ && (1 == num_subdivs || 0 != si)) {
+      VLOG(2) << "copying input to output for device=" << device_->name()
+              << " subdiv=" << si;
+      const Tensor* input = &ctx_->input(0);
+      if (input != output_ &&
+          (DMAHelper::base(input) != DMAHelper::base(output_))) {
+        {
+          mutex_lock l(mu);
+          ++pending_count;
+        }
+        DeviceContext* op_dev_ctx = ctx_->op_device_context();
+        CollectiveRemoteAccessLocal::MemCpyAsync(
+            op_dev_ctx, op_dev_ctx, device_, device_, ctx_->input_alloc_attr(0),
+            ctx_->output_alloc_attr(0), input, output_, 0, /*stream_index*/
+            [this, &mu, &pending_count, &all_done](const Status& s) {
+              mutex_lock l(mu);
+              status_.Update(s);
+              --pending_count;
+              if (0 == pending_count) {
+                all_done.notify_all();
+              }
+            });
       }
-      DeviceContext* op_dev_ctx = ctx_->op_device_context();
-      CollectiveRemoteAccessLocal::MemCpyAsync(
-          op_dev_ctx, op_dev_ctx, device_, device_, ctx_->input_alloc_attr(0),
-          ctx_->output_alloc_attr(0), input, output_, 0 /*steam_index*/,
-          [this, &mu, &pending_count, &all_done](const Status& s) {
-            mutex_lock l(mu);
-            status_.Update(s);
-            --pending_count;
-            if (0 == pending_count) {
-              all_done.notify_all();
-            }
-          });
     }
-  }
 
-  // Then wait for all pending actions to complete.
-  {
-    mutex_lock l(mu);
-    if (pending_count > 0) {
-      all_done.wait(l);
+    // Then wait for all pending actions to complete.
+    {
+      mutex_lock l(mu);
+      if (pending_count > 0) {
+        all_done.wait(l);
+      }
     }
   }
-
-  VLOG(2) << "return status " << status_;
+  VLOG(2) << "device=" << device_->name() << " return status " << status_;
   done_(status_);
 }
 
-void Broadcaster::DispatchSend(int dst_rank, const Tensor* src_tensor,
+void Broadcaster::DispatchSend(int subdiv, int dst_rank, int src_rank,
+                               const Tensor* src_tensor,
                                const StatusCallback& done) {
-  string send_buf_key = BroadcastBufKey(exec_key_, rank_, dst_rank);
-  VLOG(1) << "DispatchSend " << send_buf_key << " from_device "
-          << device_->name();
+  string send_buf_key = BroadcastBufKey(exec_key_, subdiv, src_rank, dst_rank);
   int dst_idx =
-      col_params_.instance.impl_details.subdiv_permutations[0][dst_rank];
+      col_params_.instance.impl_details.subdiv_permutations[subdiv][dst_rank];
+  VLOG(1) << "DispatchSend " << send_buf_key << " from_device "
+          << device_->name() << " to_device "
+          << col_params_.instance.device_names[dst_idx] << " subdiv=" << subdiv
+          << " dst_rank=" << dst_rank << " dst_idx=" << dst_idx;
   col_exec_->PostToPeer(col_params_.instance.device_names[dst_idx],
                         col_params_.instance.task_names[dst_idx], send_buf_key,
                         device_, ctx_->op_device_context(),
@@ -225,15 +280,15 @@ void Broadcaster::DispatchSend(int dst_rank, const Tensor* src_tensor,
                         device_locality_, done);
 }
 
-void Broadcaster::DispatchRecv(int src_rank, Tensor* dst_tensor,
-                               const StatusCallback& done) {
-  string recv_buf_key = BroadcastBufKey(exec_key_, src_rank, rank_);
+void Broadcaster::DispatchRecv(int subdiv, int src_rank, int dst_rank,
+                               Tensor* dst_tensor, const StatusCallback& done) {
+  string recv_buf_key = BroadcastBufKey(exec_key_, subdiv, src_rank, dst_rank);
   int src_idx =
-      col_params_.instance.impl_details.subdiv_permutations[0][src_rank];
+      col_params_.instance.impl_details.subdiv_permutations[subdiv][src_rank];
   VLOG(1) << "DispatchRecv " << recv_buf_key << " from_device "
-          << col_params_.instance.device_names[src_idx];
-  int dst_idx = col_params_.instance.impl_details.subdiv_permutations[0][rank_];
-  CHECK_EQ(col_params_.instance.device_names[dst_idx], device_->name());
+          << col_params_.instance.device_names[src_idx] << " to_device "
+          << device_->name() << " subdiv=" << subdiv << " src_rank=" << src_rank
+          << " src_idx=" << src_idx;
   col_exec_->RecvFromPeer(col_params_.instance.device_names[src_idx],
                           col_params_.instance.task_names[src_idx],
                           col_params_.task.is_local[src_idx], recv_buf_key,
diff --git a/tensorflow/core/common_runtime/broadcaster.h b/tensorflow/core/common_runtime/broadcaster.h
index bdf68f19ab..799228b161 100644
--- a/tensorflow/core/common_runtime/broadcaster.h
+++ b/tensorflow/core/common_runtime/broadcaster.h
@@ -34,17 +34,24 @@ class Broadcaster {
 
   // Returns the rank of the device from which this device should receive
   // its value, -1 if no value should be received.
-  static int TreeRecvFrom(const CollectiveParams& cp);
+  static int TreeRecvFrom(const CollectiveParams& cp, int subdiv);
 
   // Populates targets with the ranks of the devices to which this device
   // should forward the value.
-  static void TreeSendTo(const CollectiveParams& cp, std::vector<int>* targets);
+  static void TreeSendTo(const CollectiveParams& cp, int subdiv,
+                         std::vector<int>* targets);
 
  private:
-  void DispatchSend(int dst_rank, const Tensor* src_tensor,
-                    const StatusCallback& done);
-  void DispatchRecv(int src_rank, Tensor* dst_tensor,
+  // Sends `src_tensor` asynchronously from this device to device at `dst_rank`
+  // in `subdiv`.  Calls `done` upon completion.
+  void DispatchSend(int subdiv, int dst_rank, int src_rank,
+                    const Tensor* src_tensor, const StatusCallback& done);
+  // Receives a tensor into the memory buffer owned by `dst_tensor` at this
+  // device from device at `src_rank` in `subdiv`.  Calls `done` upon
+  // completion.
+  void DispatchRecv(int subdiv, int src_rank, int dst_rank, Tensor* dst_tensor,
                     const StatusCallback& done);
+  // Executes the hierarchical broadcast defined by this op.
   void RunTree();
 
   Status status_;
diff --git a/tensorflow/core/common_runtime/broadcaster_test.cc b/tensorflow/core/common_runtime/broadcaster_test.cc
index 6a163a0db0..3960fc6c97 100644
--- a/tensorflow/core/common_runtime/broadcaster_test.cc
+++ b/tensorflow/core/common_runtime/broadcaster_test.cc
@@ -38,7 +38,6 @@ namespace tensorflow {
 namespace {
 
 static int64 kStepId = 123;
-static int32 kNumSubdivs = 1;  // Subdiv not yet meaningful for broadcast
 
 // The test harness won't allow a mixture of fixture and non-fixture
 // tests in one file, so this is a trival fixture for tests that don't
@@ -59,12 +58,14 @@ class TrivialTest : public ::testing::Test {
     CollectiveParams cp;                                           \
     cp.group.group_size = D;                                       \
     cp.instance.impl_details.subdiv_source_rank = {S};             \
+    cp.instance.impl_details.subdiv_permutations.push_back(        \
+        std::vector<int>(D, 0));                                   \
     cp.subdiv_rank = {R};                                          \
     cp.is_source = (S == R);                                       \
-    EXPECT_EQ(RF, Broadcaster::TreeRecvFrom(cp));                  \
+    EXPECT_EQ(RF, Broadcaster::TreeRecvFrom(cp, 0));               \
     std::vector<int> expected = ST;                                \
     std::vector<int> send_to;                                      \
-    Broadcaster::TreeSendTo(cp, &send_to);                         \
+    Broadcaster::TreeSendTo(cp, 0, &send_to);                      \
     ASSERT_EQ(expected.size(), send_to.size());                    \
     for (int i = 0; i < expected.size(); ++i) {                    \
       EXPECT_EQ(expected[i], send_to[i]);                          \
@@ -209,8 +210,11 @@ class BroadcasterTest : public ::testing::Test {
 #endif
   }
 
-  void Init(int num_workers, int num_devices, DataType dtype,
+  void Init(int num_workers, int num_devices_per_worker, DataType dtype,
             const DeviceType& device_type, int fail_after) {
+    VLOG(2) << "num_workers=" << num_workers
+            << " num_devices_per_worker=" << num_devices_per_worker;
+    int total_num_devices = num_workers * num_devices_per_worker;
     device_type_ = device_type;
     std::vector<Device*> local_devices;
     SessionOptions sess_opts;
@@ -218,14 +222,14 @@ class BroadcasterTest : public ::testing::Test {
     Bytes mem_limit(4 << 20);
     DeviceLocality dev_locality;
     for (int wi = 0; wi < num_workers; ++wi) {
-      for (int di = 0; di < num_devices; ++di) {
+      for (int di = 0; di < num_devices_per_worker; ++di) {
         if (device_type == DEVICE_CPU) {
           string dev_name = strings::StrCat("/job:worker/replica:0/task:", wi,
                                             "/device:CPU:", di);
           local_devices.push_back(new ThreadPoolDevice(
               sess_opts, dev_name, mem_limit, dev_locality, cpu_allocator()));
         } else if (device_type == DEVICE_GPU && !gpu_devices_.empty()) {
-          int dev_idx = (wi * num_devices) + di;
+          int dev_idx = (wi * num_devices_per_worker) + di;
           if (dev_idx >= static_cast<int>(gpu_devices_.size())) {
             LOG(INFO) << "dev_mgr has access to limited GPUs, reusing for more "
                          "than one ring node.";
@@ -247,67 +251,86 @@ class BroadcasterTest : public ::testing::Test {
                                            dev_mgr_.get());
     col_params_.name = "test_collective";
     col_params_.instance.data_type = dtype;
-    static const int kGroupKey = 5;
+    static const int kGroupKey = 6;
     col_params_.group.group_key = kGroupKey;
-    static const int kInstanceKey = 17;
+    static const int kInstanceKey = 18;
     col_params_.instance.instance_key = kInstanceKey;
     col_params_.group.device_type = device_type;
-    col_params_.group.group_size = num_workers * num_devices;
+    col_params_.group.group_size = num_workers * num_devices_per_worker;
     col_params_.instance.impl_details.subdiv_offsets.clear();
     col_params_.instance.type = BROADCAST_COLLECTIVE;
-    col_params_.instance.impl_details.subdiv_permutations.resize(kNumSubdivs);
-    col_params_.subdiv_rank.resize(kNumSubdivs);
-    int subdiv_stride = num_devices / kNumSubdivs;
-    for (int sdi = 0; sdi < kNumSubdivs; ++sdi) {
-      col_params_.instance.impl_details.subdiv_offsets.push_back(sdi *
-                                                                 subdiv_stride);
-      col_params_.subdiv_rank[sdi] = sdi * subdiv_stride;
-    }
 
-    // Set up a local device ring order that's not just 0,1,2...
-    std::vector<int> local_ring_order;
-    for (int di = 0; di < num_devices; ++di) {
-      local_ring_order.push_back(di);
+    int num_subdivs = num_workers + (num_workers > 1 ? 1 : 0);
+    VLOG(2) << "#subdiv=" << num_subdivs;
+    col_params_.instance.impl_details.subdiv_permutations.resize(num_subdivs);
+    col_params_.subdiv_rank.resize(num_subdivs);
+
+    // Inter-machine broadcast.
+    int subdiv_i = 0;
+    if (num_workers > 1) {
+      col_params_.instance.impl_details.subdiv_permutations[subdiv_i].resize(
+          total_num_devices, -1);
+      for (int i = 0, rank = 0; i < total_num_devices; i++) {
+        if (i % num_devices_per_worker == 0) {
+          col_params_.instance.impl_details
+              .subdiv_permutations[subdiv_i][rank] = i;
+          rank++;
+        }
+      }
+      if (VLOG_IS_ON(2)) {
+        string sp_buf;
+        for (int p :
+             col_params_.instance.impl_details.subdiv_permutations[subdiv_i])
+          strings::StrAppend(&sp_buf, p, ", ");
+        VLOG(2) << "subdiv_i=" << subdiv_i << " perm=" << sp_buf;
+      }
+      subdiv_i++;
     }
-    for (int di = 0; di < num_devices; ++di) {
-      bool is_odd = ((di % 2) == 1);
-      int other = (di + (is_odd ? 7 : 3)) % num_devices;
-      if (di == other) continue;
-      iter_swap(local_ring_order.begin() + di,
-                local_ring_order.begin() + other);
+    // Intra-machine broadcast.
+    for (int i = 0; subdiv_i < num_subdivs; i++, subdiv_i++) {
+      col_params_.instance.impl_details.subdiv_permutations[subdiv_i].resize(
+          total_num_devices, -1);
+      int perm_i_base = i * num_devices_per_worker;
+      VLOG(2) << "subdiv_i=" << subdiv_i << " i=" << i
+              << " perm_i_base=" << perm_i_base << " subdiv_perms.size="
+              << col_params_.instance.impl_details.subdiv_permutations.size();
+      // subdiv for worker i.
+      for (int j = perm_i_base, rank = 0;
+           j < perm_i_base + num_devices_per_worker; j++, rank++) {
+        col_params_.instance.impl_details.subdiv_permutations[subdiv_i][rank] =
+            j;
+      }
+      if (VLOG_IS_ON(2)) {
+        string sp_buf;
+        for (int p :
+             col_params_.instance.impl_details.subdiv_permutations[subdiv_i])
+          strings::StrAppend(&sp_buf, p, ", ");
+        VLOG(2) << "subdiv_i=" << subdiv_i << " perm=" << sp_buf;
+      }
     }
-    broadcast_dev_id_ = local_ring_order[0];
-    string lro_buf;
-    for (auto d : local_ring_order) strings::StrAppend(&lro_buf, d, ", ");
-    VLOG(1) << "local_ring_order " << lro_buf;
 
-    // Set up all of the fake device contexts.
-    for (int wi = 0; wi < num_workers; ++wi) {
-      for (int di = 0; di < num_devices; ++di) {
+    // Set up all the fake device contexts.
+    for (int wi = 0; wi < num_workers; wi++) {
+      for (int di = 0; di < num_devices_per_worker; di++) {
         string task_name = strings::StrCat("/job:worker/replica:0/task:", wi);
-        string dev_name = strings::StrCat(task_name, "/device:CPU:", di);
+        string dev_name;
         if (device_type == DEVICE_GPU) {
           dev_name = strings::StrCat(task_name, "/device:GPU:0");
+        } else {
+          dev_name = strings::StrCat(task_name, "/device:CPU:", di);
         }
+        VLOG(2) << "dev=" << dev_name;
         col_params_.instance.device_names.push_back(dev_name);
         col_params_.instance.task_names.push_back(task_name);
-        // Normally each device would set is_local to its own perspective but
-        // this test runs in a single process so is_local is always true.
         col_params_.task.is_local.push_back(true);
-        for (int sdi = 0; sdi < kNumSubdivs; ++sdi) {
-          int rotated_di =
-              (di + col_params_.instance.impl_details.subdiv_offsets[sdi]) %
-              num_devices;
-          col_params_.instance.impl_details.subdiv_permutations[sdi].push_back(
-              wi * num_devices + local_ring_order[rotated_di]);
-        }
       }
     }
-    for (int wi = 0; wi < num_workers; ++wi) {
-      for (int di = 0; di < num_devices; ++di) {
-        int rank = wi * num_devices + di;
+    for (int wi = 0; wi < num_workers; wi++) {
+      for (int di = 0; di < num_devices_per_worker; di++) {
+        int default_rank = wi * num_devices_per_worker + di;
         instances_.push_back(new DeviceInstance(
-            rank, col_params_.instance.device_names[rank], device_type_, this));
+            default_rank, col_params_.instance.device_names[default_rank],
+            device_type, this));
       }
     }
   }
@@ -315,6 +338,7 @@ class BroadcasterTest : public ::testing::Test {
   typedef std::function<void(Tensor*)> InitFunc;
 
   void Broadcast(bool forward_input) {
+    VLOG(2) << "#instances=" << instances_.size();
     std::atomic<int> done(0);
     for (auto di : instances_) {
       SchedClosure([di, forward_input, &done] {
@@ -516,39 +540,29 @@ class BroadcasterTest : public ::testing::Test {
       CHECK_EQ(group_size, col_params_.instance.device_names.size());
       // Default rank is order in device_names.
       col_params_.default_rank = rank;
-      // perm_rank is order in subdiv[0]:
-      int perm_rank = -1;
-      for (int i = 0;
-           i < col_params_.instance.impl_details.subdiv_permutations[0].size();
-           ++i) {
-        if (rank ==
-            col_params_.instance.impl_details.subdiv_permutations[0][i]) {
-          perm_rank = i;
-          break;
-        }
-      }
-      CHECK_GE(perm_rank, 0);
-      col_params_.instance.impl_details.subdiv_source_rank.resize(1, 0);
-      col_params_.is_source =
-          (perm_rank ==
-           col_params_.instance.impl_details.subdiv_source_rank[0]);
-      // Set rank in all subdivs by finding that default_rank.
-      for (int sdi = 0; sdi < kNumSubdivs; ++sdi) {
-        for (int r = 0;
-             r <
-             col_params_.instance.impl_details.subdiv_permutations[sdi].size();
-             ++r) {
-          if (col_params_.default_rank ==
-              col_params_.instance.impl_details.subdiv_permutations[sdi][r]) {
-            col_params_.subdiv_rank[sdi] = r;
-            CHECK_EQ(0, sdi);
-            CHECK_EQ(perm_rank, col_params_.subdiv_rank[sdi]);
+
+      auto& impl = col_params_.instance.impl_details;
+      size_t num_subdivs = impl.subdiv_permutations.size();
+      impl.subdiv_source_rank.resize(num_subdivs, 0);
+      col_params_.subdiv_rank.resize(num_subdivs);
+      for (size_t si = 0; si < num_subdivs; si++) {
+        int perm_rank = -1;
+        for (int i = 0; i < group_size; i++) {
+          if (rank == impl.subdiv_permutations[si][i]) {
+            perm_rank = i;
             break;
           }
         }
+        col_params_.subdiv_rank[si] = perm_rank;
+      }
+      string rank_buf;
+      for (int r : col_params_.subdiv_rank) {
+        strings::StrAppend(&rank_buf, r, ", ");
       }
-      CHECK_EQ(group_size, col_params_.task.is_local.size());
-      CHECK_EQ(group_size, col_params_.instance.task_names.size());
+      VLOG(1) << "default=" << rank << " subdiv_ranks=" << rank_buf;
+
+      col_params_.is_source =
+          col_params_.subdiv_rank[0] == impl.subdiv_source_rank[0];
     }
 
     void InitTensor(DataType dtype, const TensorShape& shape,
diff --git a/tensorflow/core/common_runtime/collective_param_resolver_local.cc b/tensorflow/core/common_runtime/collective_param_resolver_local.cc
index 236f999228..2a14493a67 100644
--- a/tensorflow/core/common_runtime/collective_param_resolver_local.cc
+++ b/tensorflow/core/common_runtime/collective_param_resolver_local.cc
@@ -319,6 +319,97 @@ void SortDevicesAndTasks(CollectiveParams* cp) {
 }
 }  // namespace
 
+int GetDeviceTask(int device_rank, const std::vector<int>& dev_per_task) {
+  int num_tasks = static_cast<int>(dev_per_task.size());
+  int task_lo = 0;
+  int task_hi;
+  for (int ti = 0; ti < num_tasks; ti++) {
+    task_hi = task_lo + dev_per_task[ti];
+    if (task_lo <= device_rank && device_rank < task_hi) return ti;
+    task_lo += dev_per_task[ti];
+  }
+  LOG(FATAL) << "Unexpected device rank " << device_rank << " for " << task_hi
+             << " devices";
+  return -1;
+}
+
+void CollectiveParamResolverLocal::GenerateBcastSubdivPerms(
+    const string& device, int source_rank, const std::vector<int>& dev_per_task,
+    CollectiveParams* cp) {
+  if (VLOG_IS_ON(1)) {
+    string dpt_buf;
+    for (int dpt : dev_per_task) strings::StrAppend(&dpt_buf, dpt, ";");
+    VLOG(1) << "GenerateBcastSubdivPerms device=" << device
+            << " source_rank=" << source_rank << " dev_per_task=" << dpt_buf;
+  }
+  int num_tasks = cp->group.num_tasks;
+  // If there is just 1 task, then execute binary tree broadcast over all
+  // devices.  Otherwise, the first subdiv is inter-task broadcast, and then
+  // there are N more subdivs, where N is #task.
+  int num_subdivs = num_tasks + (num_tasks > 1 ? 1 : 0);
+  int total_num_devices = 0;
+  for (int num_dev : dev_per_task) total_num_devices += num_dev;
+
+  cp->instance.impl_details.subdiv_permutations.resize(num_subdivs);
+  cp->subdiv_rank.reserve(num_subdivs);
+  cp->instance.impl_details.subdiv_source_rank.reserve(num_subdivs);
+
+  // Inter-task subdiv.  Pick one device from each task - this is the source
+  // device if it belongs to that task, or device 0 for that task.  If a device
+  // does not participate in the subdiv, set subdiv_rank to -1.
+  if (num_tasks > 1) {
+    const int sdi = 0;
+    std::vector<int>& perm = cp->instance.impl_details.subdiv_permutations[sdi];
+    CHECK_EQ(perm.size(), 0);
+    int device_count = 0;
+    int source_task = GetDeviceTask(source_rank, dev_per_task);
+    for (int ti = 0; ti < cp->group.num_tasks; ti++) {
+      bool participate = false;
+      if (source_task == ti) {
+        // Source device belongs to this task.
+        perm.push_back(source_rank);
+        participate = cp->instance.device_names[source_rank] == device;
+      } else {
+        // Source does not belong to this task, choose dev 0.
+        perm.push_back(device_count);
+        participate = cp->instance.device_names[device_count] == device;
+      }
+      if (participate) cp->subdiv_rank.push_back(ti);
+      device_count += dev_per_task[ti];
+    }
+    if (cp->subdiv_rank.empty()) cp->subdiv_rank.push_back(-1);
+    cp->instance.impl_details.subdiv_source_rank.push_back(source_task);
+  }
+
+  // Intra-task subdivs.  Pick all devices in task ti for subdiv sdi.  Set
+  // source to dev 0 for that task if it does not contain original source, else
+  // set to rank of original source.  If a device does not participate in the
+  // subdiv, set subdiv_rank to -1;
+  int abs_di = 0;
+  for (int ti = 0; ti < cp->group.num_tasks; ti++) {
+    const int sdi = ti + (num_tasks > 1 ? 1 : 0);
+    std::vector<int>& perm = cp->instance.impl_details.subdiv_permutations[sdi];
+    CHECK_EQ(perm.size(), 0);
+    bool participate = false;
+    int subdiv_source = 0;
+    for (int di = 0; di < dev_per_task[ti]; di++) {
+      perm.push_back(abs_di);
+      if (cp->instance.device_names[abs_di] == device) {
+        participate = true;
+        cp->subdiv_rank.push_back(di);
+      }
+      if (abs_di == source_rank) subdiv_source = di;
+      abs_di++;
+    }
+    if (!participate) cp->subdiv_rank.push_back(-1);
+    cp->instance.impl_details.subdiv_source_rank.push_back(subdiv_source);
+  }
+
+  for (int sri = 0; sri < num_subdivs; sri++) {
+    CHECK_GE(cp->instance.impl_details.subdiv_source_rank[sri], 0);
+  }
+}
+
 // Establish the requested number of subdivision permutations based on the
 // ring order implicit in the device order.
 /*static*/
@@ -351,61 +442,51 @@ void CollectiveParamResolverLocal::GenerateSubdivPerms(const string& device,
   dev_per_task.push_back(dev_count);
   CHECK_EQ(cp->group.num_tasks, dev_per_task.size());
 
-  // Generate a ring permutation for each requested offset.
-  CHECK_GT(cp->instance.impl_details.subdiv_offsets.size(), 0);
-  VLOG(2) << "Setting up perms for cp " << cp << " subdiv_permutations "
-          << &cp->instance.impl_details.subdiv_permutations;
-  cp->instance.impl_details.subdiv_permutations.resize(
-      cp->instance.impl_details.subdiv_offsets.size());
-  cp->subdiv_rank.resize(cp->instance.impl_details.subdiv_offsets.size(), -1);
-  for (int sdi = 0; sdi < cp->instance.impl_details.subdiv_offsets.size();
-       ++sdi) {
-    std::vector<int>& perm = cp->instance.impl_details.subdiv_permutations[sdi];
-    CHECK_EQ(perm.size(), 0);
-    int offset = cp->instance.impl_details.subdiv_offsets[sdi];
-    // A negative subdivision offset is interpreted as follows:
-    //  1. Reverse the local device ordering.
-    //  2. Begin the subdivision at abs(offset) in the reversed ordering.
-    bool reverse = false;
-    if (offset < 0) {
-      offset = abs(offset);
-      reverse = true;
-    }
-    int prior_dev_count = 0;  // sum over prior worker device counts
-    for (int ti = 0; ti < cp->group.num_tasks; ++ti) {
-      for (int di = 0; di < dev_per_task[ti]; ++di) {
-        int di_offset = (di + offset) % dev_per_task[ti];
-        int offset_di =
-            reverse ? (dev_per_task[ti] - (di_offset + 1)) : di_offset;
-        // Device index in global subdivision permutation.
-        int permuted_di = prior_dev_count + offset_di;
-        int rank = static_cast<int>(perm.size());
-        perm.push_back(permuted_di);
-        if (cp->instance.device_names[permuted_di] == device) {
-          CHECK_EQ(permuted_di, cp->default_rank);
-          cp->subdiv_rank[sdi] = rank;
-        }
-      }
-      prior_dev_count += dev_per_task[ti];
-    }
-    CHECK_EQ(cp->group.group_size, perm.size());
-  }
-
-  if (cp->instance.type == BROADCAST_COLLECTIVE) {
-    CHECK_GE(source_rank, 0);
-    cp->instance.impl_details.subdiv_source_rank.resize(
-        cp->instance.impl_details.subdiv_offsets.size(), -1);
-    for (int sdi = 0; sdi < cp->instance.impl_details.subdiv_source_rank.size();
+  CHECK(cp->instance.type == REDUCTION_COLLECTIVE ||
+        cp->instance.type == BROADCAST_COLLECTIVE);
+  if (cp->instance.type == REDUCTION_COLLECTIVE) {
+    // Generate a ring permutation for each requested offset.
+    CHECK_GT(cp->instance.impl_details.subdiv_offsets.size(), 0);
+    VLOG(2) << "Setting up perms for cp " << cp << " subdiv_permutations "
+            << &cp->instance.impl_details.subdiv_permutations;
+    cp->instance.impl_details.subdiv_permutations.resize(
+        cp->instance.impl_details.subdiv_offsets.size());
+    cp->subdiv_rank.resize(cp->instance.impl_details.subdiv_offsets.size(), -1);
+    for (int sdi = 0; sdi < cp->instance.impl_details.subdiv_offsets.size();
          ++sdi) {
-      for (int j = 0; j < cp->group.group_size; ++j) {
-        if (cp->instance.impl_details.subdiv_permutations[sdi][j] ==
-            source_rank) {
-          cp->instance.impl_details.subdiv_source_rank[sdi] = j;
-          break;
+      std::vector<int>& perm =
+          cp->instance.impl_details.subdiv_permutations[sdi];
+      CHECK_EQ(perm.size(), 0);
+      int offset = cp->instance.impl_details.subdiv_offsets[sdi];
+      // A negative subdivision offset is interpreted as follows:
+      //  1. Reverse the local device ordering.
+      //  2. Begin the subdivision at abs(offset) in the reversed ordering.
+      bool reverse = false;
+      if (offset < 0) {
+        offset = abs(offset);
+        reverse = true;
+      }
+      int prior_dev_count = 0;  // sum over prior worker device counts
+      for (int ti = 0; ti < cp->group.num_tasks; ++ti) {
+        for (int di = 0; di < dev_per_task[ti]; ++di) {
+          int di_offset = (di + offset) % dev_per_task[ti];
+          int offset_di =
+              reverse ? (dev_per_task[ti] - (di_offset + 1)) : di_offset;
+          // Device index in global subdivision permutation.
+          int permuted_di = prior_dev_count + offset_di;
+          int rank = static_cast<int>(perm.size());
+          perm.push_back(permuted_di);
+          if (cp->instance.device_names[permuted_di] == device) {
+            CHECK_EQ(permuted_di, cp->default_rank);
+            cp->subdiv_rank[sdi] = rank;
+          }
         }
+        prior_dev_count += dev_per_task[ti];
       }
-      CHECK_GE(cp->instance.impl_details.subdiv_source_rank[sdi], 0);
+      CHECK_EQ(cp->group.group_size, perm.size());
     }
+  } else if (cp->instance.type == BROADCAST_COLLECTIVE) {
+    GenerateBcastSubdivPerms(device, source_rank, dev_per_task, cp);
   }
 
   if (VLOG_IS_ON(1)) {
@@ -418,13 +499,21 @@ void CollectiveParamResolverLocal::GenerateSubdivPerms(const string& device,
            di < cp->instance.impl_details.subdiv_permutations[sdi].size();
            ++di) {
         int idx = cp->instance.impl_details.subdiv_permutations[sdi][di];
-        strings::StrAppend(&buf, cp->instance.device_names[idx], "\n");
+        if (idx >= 0) {
+          CHECK_GT(cp->instance.device_names.size(), idx);
+          strings::StrAppend(&buf, cp->instance.device_names[idx], "\n");
+        }
       }
       strings::StrAppend(&buf, " subdiv_offsets: ");
       for (auto o : cp->instance.impl_details.subdiv_offsets)
         strings::StrAppend(&buf, o, " ");
       strings::StrAppend(&buf, " SubdivRank: ");
       for (auto d : cp->subdiv_rank) strings::StrAppend(&buf, d, " ");
+      if (cp->instance.type == BROADCAST_COLLECTIVE) {
+        strings::StrAppend(&buf, " subdiv_source_rank: ");
+        for (auto src : cp->instance.impl_details.subdiv_source_rank)
+          strings::StrAppend(&buf, src, " ");
+      }
       VLOG(1) << buf;
     }
   }
diff --git a/tensorflow/core/common_runtime/collective_param_resolver_local.h b/tensorflow/core/common_runtime/collective_param_resolver_local.h
index 01bdeca7d1..2e2aa801d9 100644
--- a/tensorflow/core/common_runtime/collective_param_resolver_local.h
+++ b/tensorflow/core/common_runtime/collective_param_resolver_local.h
@@ -213,8 +213,16 @@ class CollectiveParamResolverLocal : public ParamResolverInterface {
       LOCKS_EXCLUDED(irec->out_mu);
 
   friend class CollectiveParamResolverLocalTest;
+  // Establishes the requested number of subdivision permutations based on the
+  // ring order implicit in the device order.
   static void GenerateSubdivPerms(const string& device, int source_rank,
                                   CollectiveParams* cp);
+  // Establishes the subdivisions for broadcast op.  The first subdiv executes
+  // binary tree bcast with one device per task.  Each subsequent subdiv
+  // executes intra-task binary tree broadcast.
+  static void GenerateBcastSubdivPerms(const string& device, int source_rank,
+                                       const std::vector<int>& dev_per_task,
+                                       CollectiveParams* cp);
 
   const DeviceMgr* dev_mgr_;
   DeviceResolverInterface* dev_resolver_;  // Not owned.
diff --git a/tensorflow/core/common_runtime/collective_param_resolver_local_test.cc b/tensorflow/core/common_runtime/collective_param_resolver_local_test.cc
index d5be8f927e..9ea23b72d2 100644
--- a/tensorflow/core/common_runtime/collective_param_resolver_local_test.cc
+++ b/tensorflow/core/common_runtime/collective_param_resolver_local_test.cc
@@ -49,6 +49,26 @@ class CollectiveParamResolverLocalTest : public ::testing::Test {
     CollectiveParamResolverLocal::GenerateSubdivPerms(device, source_rank, cp);
   }
 
+  // Calls GenerateBcastSubdivPerms for device at `device_rank`.  Checks if the
+  // generated subdiv perms, ranks, and source ranks match the expected values.
+  void BcastSubdivPerms(
+      CollectiveParams* cp, const std::vector<int>& dev_per_task,
+      int device_rank, int source_rank,
+      const std::vector<std::vector<int>>& expected_subdiv_perms,
+      const std::vector<int>& expected_subdiv_rank,
+      const std::vector<int>& expected_subdiv_source_rank) {
+    cp->subdiv_rank.clear();
+    cp->instance.impl_details.subdiv_permutations.clear();
+    cp->instance.impl_details.subdiv_source_rank.clear();
+    CollectiveParamResolverLocal::GenerateBcastSubdivPerms(
+        cp->instance.device_names[device_rank], source_rank, dev_per_task, cp);
+    EXPECT_EQ(expected_subdiv_perms,
+              cp->instance.impl_details.subdiv_permutations);
+    EXPECT_EQ(expected_subdiv_rank, cp->subdiv_rank);
+    EXPECT_EQ(expected_subdiv_source_rank,
+              cp->instance.impl_details.subdiv_source_rank);
+  }
+
   std::vector<Device*> devices_;
   std::unique_ptr<DeviceMgr> device_mgr_;
   std::unique_ptr<DeviceResolverLocal> drl_;
@@ -216,4 +236,113 @@ TEST_F(CollectiveParamResolverLocalTest, GenerateSubdivPerms) {
   EXPECT_EQ(1, cp.subdiv_rank[1]);
 }
 
+TEST_F(CollectiveParamResolverLocalTest, GenerateBcastSubdivPerms1Task8GPU) {
+  CollectiveParams cp;
+  cp.group.device_type = DeviceType("GPU");
+  cp.group.num_tasks = 1;
+  cp.instance.type = BROADCAST_COLLECTIVE;
+  for (int i = 0; i < 8; i++) {
+    string dev_name =
+        strings::StrCat("/job:worker/replica:0/task:0/device:GPU:", i);
+    cp.instance.device_names.push_back(dev_name);
+  }
+  std::vector<int> dev_per_task = {8};
+
+  // source 0 device 0
+  BcastSubdivPerms(&cp, dev_per_task, 0, 0, {{0, 1, 2, 3, 4, 5, 6, 7}}, {0},
+                   {0});
+
+  // source 2 device 2
+  BcastSubdivPerms(&cp, dev_per_task, 2, 2, {{0, 1, 2, 3, 4, 5, 6, 7}}, {2},
+                   {2});
+
+  // source 2 device 0
+  BcastSubdivPerms(&cp, dev_per_task, 0, 2, {{0, 1, 2, 3, 4, 5, 6, 7}}, {0},
+                   {2});
+}
+
+TEST_F(CollectiveParamResolverLocalTest, GenerateBcastSubdivPerms4Tasks8GPU) {
+  CollectiveParams cp;
+  cp.group.device_type = DeviceType("GPU");
+  cp.group.num_tasks = 4;
+  cp.instance.type = BROADCAST_COLLECTIVE;
+  for (int ti = 0; ti < cp.group.num_tasks; ti++) {
+    for (int di = 0; di < 8; di++) {
+      string dev_name = strings::StrCat("/job:worker/replica:0/task:", ti,
+                                        "/device:GPU:", di);
+      cp.instance.device_names.push_back(dev_name);
+    }
+  }
+  std::vector<int> dev_per_task = {8, 8, 8, 8};
+
+  // source 0 device 0
+  BcastSubdivPerms(&cp, dev_per_task, 0, 0,
+                   {{0, 8, 16, 24},
+                    {0, 1, 2, 3, 4, 5, 6, 7},
+                    {8, 9, 10, 11, 12, 13, 14, 15},
+                    {16, 17, 18, 19, 20, 21, 22, 23},
+                    {24, 25, 26, 27, 28, 29, 30, 31}},
+                   {0, 0, -1, -1, -1}, {0, 0, 0, 0, 0});
+
+  // source 2 device 0
+  BcastSubdivPerms(&cp, dev_per_task, 0, 2,
+                   {{2, 8, 16, 24},
+                    {0, 1, 2, 3, 4, 5, 6, 7},
+                    {8, 9, 10, 11, 12, 13, 14, 15},
+                    {16, 17, 18, 19, 20, 21, 22, 23},
+                    {24, 25, 26, 27, 28, 29, 30, 31}},
+                   {-1, 0, -1, -1, -1}, {0, 2, 0, 0, 0});
+
+  // source 9 device 9
+  BcastSubdivPerms(&cp, dev_per_task, 9, 9,
+                   {{0, 9, 16, 24},
+                    {0, 1, 2, 3, 4, 5, 6, 7},
+                    {8, 9, 10, 11, 12, 13, 14, 15},
+                    {16, 17, 18, 19, 20, 21, 22, 23},
+                    {24, 25, 26, 27, 28, 29, 30, 31}},
+                   {1, -1, 1, -1, -1}, {1, 0, 1, 0, 0});
+}
+
+TEST_F(CollectiveParamResolverLocalTest,
+       GenerateBcastSubdivPerms4TasksVariableGPU) {
+  CollectiveParams cp;
+  cp.group.device_type = DeviceType("GPU");
+  cp.group.num_tasks = 4;
+  std::vector<int> dev_per_task = {4, 4, 6, 8};
+  for (int ti = 0; ti < cp.group.num_tasks; ti++) {
+    for (int di = 0; di < dev_per_task[ti]; di++) {
+      string dev_name = strings::StrCat("/job:worker/replica:0/task:", ti,
+                                        "/device:GPU:", di);
+      cp.instance.device_names.push_back(dev_name);
+    }
+  }
+
+  // source 0 device 0
+  BcastSubdivPerms(&cp, dev_per_task, 0, 0,
+                   {{0, 4, 8, 14},
+                    {0, 1, 2, 3},
+                    {4, 5, 6, 7},
+                    {8, 9, 10, 11, 12, 13},
+                    {14, 15, 16, 17, 18, 19, 20, 21}},
+                   {0, 0, -1, -1, -1}, {0, 0, 0, 0, 0});
+
+  // source 2 device 0
+  BcastSubdivPerms(&cp, dev_per_task, 0, 2,
+                   {{2, 4, 8, 14},
+                    {0, 1, 2, 3},
+                    {4, 5, 6, 7},
+                    {8, 9, 10, 11, 12, 13},
+                    {14, 15, 16, 17, 18, 19, 20, 21}},
+                   {-1, 0, -1, -1, -1}, {0, 2, 0, 0, 0});
+
+  // source 9 device 5
+  BcastSubdivPerms(&cp, dev_per_task, 5, 9,
+                   {{0, 4, 9, 14},
+                    {0, 1, 2, 3},
+                    {4, 5, 6, 7},
+                    {8, 9, 10, 11, 12, 13},
+                    {14, 15, 16, 17, 18, 19, 20, 21}},
+                   {-1, -1, 1, -1, -1}, {2, 0, 0, 1, 0});
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/direct_session.cc b/tensorflow/core/common_runtime/direct_session.cc
index f903faf1bd..0695278c0d 100644
--- a/tensorflow/core/common_runtime/direct_session.cc
+++ b/tensorflow/core/common_runtime/direct_session.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/device_resolver_local.h"
 #include "tensorflow/core/common_runtime/executor.h"
+#include "tensorflow/core/common_runtime/executor_factory.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/common_runtime/graph_optimizer.h"
 #include "tensorflow/core/common_runtime/memory_types.h"
@@ -146,18 +147,15 @@ class DirectSessionFactory : public SessionFactory {
     return options.target.empty();
   }
 
-  Session* NewSession(const SessionOptions& options) override {
+  Status NewSession(const SessionOptions& options,
+                    Session** out_session) override {
     // Must do this before the CPU allocator is created.
     if (options.config.graph_options().build_cost_model() > 0) {
       EnableCPUAllocatorFullStats(true);
     }
     std::vector<Device*> devices;
-    const Status s = DeviceFactory::AddDevices(
-        options, "/job:localhost/replica:0/task:0", &devices);
-    if (!s.ok()) {
-      LOG(ERROR) << s;
-      return nullptr;
-    }
+    TF_RETURN_IF_ERROR(DeviceFactory::AddDevices(
+        options, "/job:localhost/replica:0/task:0", &devices));
 
     DirectSession* session =
         new DirectSession(options, new DeviceMgr(devices), this);
@@ -165,7 +163,8 @@ class DirectSessionFactory : public SessionFactory {
       mutex_lock l(sessions_lock_);
       sessions_.push_back(session);
     }
-    return session;
+    *out_session = session;
+    return Status::OK();
   }
 
   Status Reset(const SessionOptions& options,
@@ -237,7 +236,11 @@ void DirectSession::SchedClosure(thread::ThreadPool* pool,
   // safe given the reasoning above.
   c();
 #else
-  pool->Schedule(std::move(c));
+  if (pool != nullptr) {
+    pool->Schedule(std::move(c));
+  } else {
+    c();
+  }
 #endif  // __ANDROID__
 }
 
@@ -524,8 +527,9 @@ Status DirectSession::RunInternal(int64 step_id, const RunOptions& run_options,
     }
   }
 
-  if (run_options.inter_op_thread_pool() < 0 ||
-      run_options.inter_op_thread_pool() >= thread_pools_.size()) {
+  if (run_options.inter_op_thread_pool() < -1 ||
+      run_options.inter_op_thread_pool() >=
+          static_cast<int32>(thread_pools_.size())) {
     run_state.executors_done.Notify();
     delete barrier;
     return errors::InvalidArgument("Invalid inter_op_thread_pool: ",
@@ -550,7 +554,19 @@ Status DirectSession::RunInternal(int64 step_id, const RunOptions& run_options,
   }
 
   thread::ThreadPool* pool =
-      thread_pools_[run_options.inter_op_thread_pool()].first;
+      run_options.inter_op_thread_pool() >= 0
+          ? thread_pools_[run_options.inter_op_thread_pool()].first
+          : nullptr;
+
+  if (pool == nullptr) {
+    // We allow using the caller thread only when having a single executor
+    // specified.
+    if (executors_and_keys->items.size() > 1) {
+      pool = thread_pools_[0].first;
+    } else {
+      VLOG(1) << "Executing Session::Run() synchronously!";
+    }
+  }
 
   Executor::Args::Runner default_runner = [this,
                                            pool](Executor::Args::Closure c) {
@@ -702,7 +718,8 @@ Status DirectSession::Run(const RunOptions& run_options,
   // Receive outputs.
   if (outputs) {
     std::vector<Tensor> sorted_outputs;
-    const Status s = call_frame.ConsumeRetvals(&sorted_outputs);
+    const Status s = call_frame.ConsumeRetvals(
+        &sorted_outputs, /* allow_dead_tensors = */ false);
     if (errors::IsInternal(s)) {
       return errors::InvalidArgument(s.error_message());
     } else if (!s.ok()) {
@@ -1188,12 +1205,11 @@ Status DirectSession::CreateExecutors(
         delete kernel;
       }
     };
-    params.node_outputs_cb = node_outputs_callback_;
 
     optimizer.Optimize(lib, options_.env, device, &iter->second,
                        /*shape_map=*/nullptr);
 
-    // EXPERIMENTAL: tfdbg inserts debug nodes in the graph.
+    // TensorFlow Debugger (tfdbg) inserts debug nodes in the graph.
     const DebugOptions& debug_options =
         options.callable_options.run_options().debug_options();
     if (!debug_options.debug_tensor_watch_opts().empty()) {
@@ -1208,10 +1224,9 @@ Status DirectSession::CreateExecutors(
     item->graph = partition_graph.get();
     item->executor = nullptr;
     item->device = device;
-    Executor* executor;
-    TF_RETURN_IF_ERROR(
-        NewLocalExecutor(params, std::move(partition_graph), &executor));
-    item->executor.reset(executor);
+    auto executor_type = options_.config.experimental().executor_type();
+    TF_RETURN_IF_ERROR(NewExecutor(
+        executor_type, params, std::move(partition_graph), &item->executor));
   }
 
   // Cache the mapping from input/output names to graph elements to
diff --git a/tensorflow/core/common_runtime/direct_session_test.cc b/tensorflow/core/common_runtime/direct_session_test.cc
index 5b424230ca..4b51b20bb1 100644
--- a/tensorflow/core/common_runtime/direct_session_test.cc
+++ b/tensorflow/core/common_runtime/direct_session_test.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <map>
 #include <memory>
 #include <string>
+#include <thread>
 #include <unordered_map>
 #include <vector>
 
@@ -896,6 +897,125 @@ TEST(DirectSessionTest, FetchMultipleTimes) {
   }
 }
 
+TEST(DirectSessionTest, MultipleFeedTestSomeSyncRun) {
+  GraphDef def;
+  Graph g(OpRegistry::Global());
+  RunOptions run_options;
+  run_options.set_inter_op_thread_pool(-1);
+
+  Tensor first_value(DT_FLOAT, TensorShape({}));
+  first_value.scalar<float>()() = 1.0;
+  Node* first_const = test::graph::Constant(&g, first_value);
+  Node* first_identity = test::graph::Identity(&g, first_const);
+
+  Tensor second_value(DT_FLOAT, TensorShape({}));
+  second_value.scalar<float>()() = 2.0;
+  Node* second_const = test::graph::Constant(&g, second_value);
+  Node* second_identity = test::graph::Identity(&g, second_const);
+
+  test::graph::ToGraphDef(&g, &def);
+
+  auto session = CreateSession();
+  ASSERT_TRUE(session != nullptr);
+  TF_ASSERT_OK(session->Create(def));
+
+  std::vector<Tensor> outputs;
+
+  // Fetch without feeding.
+  Status s = session->Run(
+      run_options, {},
+      {first_identity->name() + ":0", second_identity->name() + ":0"}, {},
+      &outputs, nullptr);
+  TF_ASSERT_OK(s);
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(1.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(2.0, outputs[1].flat<float>()(0));
+
+  s = session->Run(
+      {}, {second_identity->name() + ":0", first_identity->name() + ":0"}, {},
+      &outputs);
+  TF_ASSERT_OK(s);
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(2.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(1.0, outputs[1].flat<float>()(0));
+
+  Tensor value_11(DT_FLOAT, TensorShape({}));
+  value_11.scalar<float>()() = 11.0;
+  Tensor value_22(DT_FLOAT, TensorShape({}));
+  value_22.scalar<float>()() = 22.0;
+
+  // Feed [first_const, second_const]
+  s = session->Run(
+      {{first_const->name(), value_11}, {second_const->name(), value_22}},
+      {first_identity->name() + ":0", second_identity->name() + ":0"}, {},
+      &outputs);
+  TF_ASSERT_OK(s);
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(11.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(22.0, outputs[1].flat<float>()(0));
+
+  // Feed [second_const, first_const]
+  s = session->Run(
+      {{second_const->name(), value_22}, {first_const->name(), value_11}},
+      {first_identity->name() + ":0", second_identity->name() + ":0"}, {},
+      &outputs);
+  TF_ASSERT_OK(s);
+  ASSERT_EQ(2, outputs.size());
+  ASSERT_EQ(11.0, outputs[0].flat<float>()(0));
+  ASSERT_EQ(22.0, outputs[1].flat<float>()(0));
+
+  // Feed [first_const, first_const]
+  s = session->Run(
+      run_options,
+      {{first_const->name(), value_11}, {first_const->name(), value_22}},
+      {first_identity->name() + ":0", second_identity->name() + ":0"}, {},
+      &outputs, nullptr);
+  EXPECT_TRUE(errors::IsInvalidArgument(s));
+  EXPECT_TRUE(str_util::StrContains(s.error_message(), "fed more than once"));
+}
+
+REGISTER_OP("ThreadID").Input("x: int64").Output("y: int64").Doc(R"doc(
+ThreadID returns the thread ID that called compute.
+
+x: int64
+y: int64
+)doc");
+
+// The ThreadID kernel returns the thread ID that executed Compute.
+class ThreadIDOp : public OpKernel {
+ public:
+  explicit ThreadIDOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+  void Compute(OpKernelContext* ctx) override {
+    Tensor* out_tensor = nullptr;
+    OP_REQUIRES_OK(ctx,
+                   ctx->allocate_output("y", TensorShape({}), &out_tensor));
+    std::hash<std::thread::id> hasher;
+    out_tensor->scalar<int64>()() =
+        static_cast<int64>(hasher(std::this_thread::get_id()));
+  }
+};
+REGISTER_KERNEL_BUILDER(Name("ThreadID").Device(DEVICE_CPU), ThreadIDOp);
+
+TEST(DirectSessionTest, SessionSyncRun) {
+  Graph g(OpRegistry::Global());
+  Tensor vx(DT_INT64, TensorShape({}));
+  vx.scalar<int64>()() = 17;
+  Node* x = test::graph::Constant(&g, vx);
+  Node* y = test::graph::Unary(&g, "ThreadID", x);
+  GraphDef def;
+  test::graph::ToGraphDef(&g, &def);
+  auto sess = CreateSession();
+  TF_ASSERT_OK(sess->Create(def));
+  std::vector<Tensor> outputs;
+  RunOptions run_opts;
+  run_opts.set_inter_op_thread_pool(-1);
+  auto s = sess->Run(run_opts, {}, {y->name() + ":0"}, {}, &outputs, nullptr);
+
+  std::hash<std::thread::id> hasher;
+  EXPECT_EQ(static_cast<int64>(hasher(std::this_thread::get_id())),
+            static_cast<int64>(outputs[0].scalar<int64>()()));
+}
+
 REGISTER_OP("Darth").Input("x: float").Output("y: float").Doc(R"doc(
 Darth promises one return value.
 
@@ -1400,6 +1520,7 @@ static void TestSessionInterOpThreadsImpl(bool use_function_lib,
   p = options.config.add_session_inter_op_thread_pool();
   if (use_global_pools) p->set_global_name("small pool");
   p->set_num_threads(1);
+  const int kSyncPool = -1;
   const int kLargePool = 0;
   const int kSmallPool = 1;
 
@@ -1442,7 +1563,11 @@ static void TestSessionInterOpThreadsImpl(bool use_function_lib,
           EXPECT_FLOAT_EQ(1.2, flat(0));
           num_done.fetch_add(1);
         };
-        tp->Schedule(fn);
+        if (tp != nullptr) {
+          tp->Schedule(fn);
+        } else {
+          fn();
+        }
       };
 
   // For blocking states:
@@ -1463,9 +1588,10 @@ static void TestSessionInterOpThreadsImpl(bool use_function_lib,
 
   tp1 = new thread::ThreadPool(Env::Default(), "tp1", 5);
 
-  // Launch 2 session run calls. Neither will finish until the blocking op is
+  // Launch a session run call. It will not finish until the blocking op is
   // unblocked, because it is using all threads in the small pool.
   add_session_run_call(tp1, y, kSmallPool);
+
   blocking_op_state->AwaitState(1);  // Wait for the blocking op to Compute.
 
   // These will block on <BlockingOpState>.
@@ -1484,10 +1610,15 @@ static void TestSessionInterOpThreadsImpl(bool use_function_lib,
   delete tp2;
   EXPECT_EQ(kUnblockedThreads, num_done.load());
 
+  // Launch a session call using this thread. This will finish as it runs
+  // synchronously in this thread.
+  add_session_run_call(nullptr, x, kSyncPool);
+
   // Unblock the blocked op and wait for the blocked functions to finish.
   blocking_op_state->MoveToState(1, 2);
   delete tp1;
-  EXPECT_EQ(kUnblockedThreads + kBlockedThreads + 1, num_done.load());
+
+  EXPECT_EQ(kUnblockedThreads + kBlockedThreads + 1 + 1, num_done.load());
   delete blocking_op_state;
   blocking_op_state = nullptr;
 }
@@ -1532,7 +1663,7 @@ TEST(DirectSessionTest, TestSessionInterOpThreadsInvalidOptions) {
   {
     std::unique_ptr<Session> session(NewSession(options));
     TF_ASSERT_OK(session->Create(def));
-    for (int pool_num = -1; pool_num <= 1; pool_num += 2) {
+    for (int pool_num = -2; pool_num <= 1; pool_num += 3) {
       RunOptions run_options;
       run_options.set_inter_op_thread_pool(pool_num);
       std::vector<Tensor> outputs;
@@ -1714,8 +1845,7 @@ TEST(DirectSessionTest, LocalDeviceManager) {
 // y = tf.square(x)
 GraphDef CreateGraphForYEqualsXSquared() {
   GraphDef graph_def;
-  QCHECK(protobuf::TextFormat::ParseFromString(
-      R"EOF(
+  const char* text_proto = R"EOF(
 node {
   name: "x"
   op: "Placeholder"
@@ -1731,8 +1861,9 @@ node {
 versions {
   producer: 26
 }
-  )EOF",
-      &graph_def));
+  )EOF";
+
+  QCHECK(protobuf::TextFormat::ParseFromString(text_proto, &graph_def));
   return graph_def;
 }
 
diff --git a/tensorflow/core/common_runtime/eager/context.cc b/tensorflow/core/common_runtime/eager/context.cc
index 70208fb6d1..b623ed4421 100644
--- a/tensorflow/core/common_runtime/eager/context.cc
+++ b/tensorflow/core/common_runtime/eager/context.cc
@@ -17,8 +17,20 @@ limitations under the License.
 
 #include "tensorflow/core/common_runtime/process_util.h"
 #include "tensorflow/core/lib/core/blocking_counter.h"
+#include "tensorflow/core/util/env_var.h"
 
 namespace tensorflow {
+namespace {
+
+bool ReadBoolFromEnvVar(StringPiece env_var_name, bool default_val) {
+  bool val;
+  if (ReadBoolFromEnvVar(env_var_name, default_val, &val).ok()) {
+    return val;
+  }
+  return default_val;
+}
+
+}  // namespace
 
 EagerContext::EagerContext(const SessionOptions& opts,
                            ContextDevicePlacementPolicy default_policy,
@@ -34,8 +46,16 @@ EagerContext::EagerContext(const SessionOptions& opts,
           local_device_manager_.get(), opts.env, TF_GRAPH_DEF_VERSION,
           &func_lib_def_, {}, thread_pool_.get())),
       log_device_placement_(opts.config.log_device_placement()),
-      async_default_(async) {
+      async_default_(async),
+      use_send_tensor_rpc_(false) {
   InitDeviceMapAndAsync();
+  if (opts.config.inter_op_parallelism_threads() > 0) {
+    runner_ = [this](std::function<void()> closure) {
+      this->thread_pool_->Schedule(closure);
+    };
+  } else {
+    runner_ = [](std::function<void()> closure) { closure(); };
+  }
 }
 
 #ifndef __ANDROID__
@@ -59,7 +79,9 @@ EagerContext::EagerContext(
       remote_device_manager_(std::move(remote_device_manager)),
       server_(std::move(server)),
       remote_eager_workers_(std::move(remote_eager_workers)),
-      remote_contexts_(remote_contexts) {
+      remote_contexts_(remote_contexts),
+      use_send_tensor_rpc_(
+          ReadBoolFromEnvVar("TF_EAGER_REMOTE_USE_SEND_TENSOR_RPC", false)) {
   InitDeviceMapAndAsync();
 }
 #endif
@@ -195,7 +217,7 @@ Status EagerContext::FindDeviceByName(const string& name, Device** result) {
 
 Status EagerContext::MaybeRegisterFunctionRemotely(const FunctionDef& fdef) {
   if (remote_device_manager_ == nullptr) return Status::OK();
-
+#ifndef __ANDROID__
   BlockingCounter blocking_counter(static_cast<int>(remote_contexts_.size()));
 
   std::vector<eager::RegisterFunctionRequest> requests(remote_contexts_.size());
@@ -225,6 +247,7 @@ Status EagerContext::MaybeRegisterFunctionRemotely(const FunctionDef& fdef) {
   for (int i = 0; i < remote_contexts_.size(); i++) {
     TF_RETURN_IF_ERROR(statuses[i]);
   }
+#endif
   return Status::OK();
 }
 
diff --git a/tensorflow/core/common_runtime/eager/context.h b/tensorflow/core/common_runtime/eager/context.h
index 864f514a19..9c8c599452 100644
--- a/tensorflow/core/common_runtime/eager/context.h
+++ b/tensorflow/core/common_runtime/eager/context.h
@@ -105,6 +105,8 @@ class EagerContext {
 
   EagerExecutor* Executor() { return &executor_; }
 
+  std::function<void(std::function<void()>)>* runner() { return &runner_; }
+
   // Sets whether this thread should run in synchronous or asynchronous mode.
   Status SetAsyncForThread(bool async);
 
@@ -180,7 +182,13 @@ class EagerContext {
 #ifndef __ANDROID__
   Status GetClientAndContextID(Device* device, eager::EagerClient** client,
                                uint64* context_id);
+
 #endif
+  // If true, then tensors should be shipped across processes via the
+  // EagerService.SendTensor RPC. If false, _Send/_Recv ops should be used
+  // instead (which in-turn use WorkerService.RecvTensor RPCs.
+  bool UseSendTensorRPC() { return use_send_tensor_rpc_; }
+
  private:
   void InitDeviceMapAndAsync();
   Status MaybeRegisterFunctionRemotely(const FunctionDef& fdef);
@@ -214,6 +222,8 @@ class EagerContext {
   // session->devices[i].
   const std::unique_ptr<ProcessFunctionLibraryRuntime> pflr_;
 
+  std::function<void(std::function<void()>)> runner_;
+
   mutex cache_mu_;
   std::unordered_map<Fprint128, KernelAndDevice*, Fprint128Hasher> kernel_cache_
       GUARDED_BY(cache_mu_);
@@ -235,10 +245,10 @@ class EagerContext {
 
   const std::unique_ptr<DeviceMgr> remote_device_manager_;
 
+#ifndef __ANDROID__
   // The server_ is not const since we release it when the context is destroyed.
   // Therefore the server_ object is not marked as const (even though it should
   // be).
-#ifndef __ANDROID__
   std::unique_ptr<ServerInterface> server_;
   const std::unique_ptr<eager::EagerClientCache> remote_eager_workers_;
 
@@ -246,6 +256,8 @@ class EagerContext {
   gtl::FlatMap<Device*, std::pair<eager::EagerClient*, uint64>>
       device_to_client_cache_;
 #endif
+
+  const bool use_send_tensor_rpc_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eager/execute.cc b/tensorflow/core/common_runtime/eager/execute.cc
index 7a2b477845..181b222b4c 100644
--- a/tensorflow/core/common_runtime/eager/execute.cc
+++ b/tensorflow/core/common_runtime/eager/execute.cc
@@ -88,6 +88,8 @@ Status MaybeCopyInputToExpectedDevice(EagerOperation* op, int i,
   TF_RETURN_IF_ERROR((*handle)->Device(&handle_device));
   const Device* actual_device =
       handle_device == nullptr ? ctx->HostCPU() : handle_device;
+  const Device* op_device =
+      op->Device() == nullptr ? ctx->HostCPU() : op->Device();
 
   if (expected_device != actual_device) {
     switch (ctx->GetDevicePlacementPolicy()) {
@@ -106,8 +108,8 @@ Status MaybeCopyInputToExpectedDevice(EagerOperation* op, int i,
             " cannot compute ",
             op->Name(), " as input #", i, " was expected to be on ",
             expected_device->name(), " but is actually on ",
-            actual_device->name(), " (operation running on ",
-            op->Device()->name(), ")",
+            actual_device->name(), " (operation running on ", op_device->name(),
+            ")",
             " Tensors can be copied explicitly using .gpu() or .cpu() "
             "methods,"
             " or transparently copied by using tf.enable_eager_execution("
@@ -118,7 +120,7 @@ Status MaybeCopyInputToExpectedDevice(EagerOperation* op, int i,
         LOG(WARNING) << "before computing " << op->Name() << " input #" << i
                      << " was expected to be on " << expected_device->name()
                      << " but is actually on " << actual_device->name()
-                     << " (operation running on " << op->Device()->name()
+                     << " (operation running on " << op_device->name()
                      << "). This triggers a copy which can be a performance "
                         "bottleneck.";
         break;
@@ -127,7 +129,7 @@ Status MaybeCopyInputToExpectedDevice(EagerOperation* op, int i,
     }
     // We are only here if the policy is warn or silent copies, so we should
     // trigger a copy.
-    auto pre_time = Env::Default()->NowMicros();
+    auto pre_time_nanos = Env::Default()->NowNanos();
     TensorHandle* result_handle = nullptr;
     Status status = EagerCopyToDevice(
         *handle, ctx, expected_device->name().c_str(), &result_handle);
@@ -139,8 +141,13 @@ Status MaybeCopyInputToExpectedDevice(EagerOperation* op, int i,
       auto* dev_stats = step_stats->mutable_dev_stats(device_idx);
       auto* node_stats = dev_stats->add_node_stats();
       node_stats->set_node_name("_Send");
-      node_stats->set_all_start_micros(pre_time);
-      node_stats->set_op_end_rel_micros(Env::Default()->NowMicros() - pre_time);
+      node_stats->set_all_start_micros(pre_time_nanos /
+                                       EnvTime::kMicrosToNanos);
+      node_stats->set_all_start_nanos(pre_time_nanos);
+      int64 now_nanos = Env::Default()->NowNanos();
+      node_stats->set_op_end_rel_micros((now_nanos - pre_time_nanos) /
+                                        EnvTime::kMicrosToNanos);
+      node_stats->set_op_end_rel_nanos(now_nanos - pre_time_nanos);
     }
     if (!status.ok()) {
       if (result_handle != nullptr) result_handle->Unref();
@@ -173,7 +180,7 @@ Status ValidateInputTypeAndPlacement(EagerContext* ctx, Device* op_device,
     tensorflow::TensorHandle* handle = op->Inputs()[i];
     if (handle->dtype != kernel->input_type(i)) {
       return errors::InvalidArgument(
-          "cannot compute ", op->Name(), " as input #", i,
+          "cannot compute ", op->Name(), " as input #", i, "(zero-based)",
           " was expected to be a ", DataTypeString(kernel->input_type(i)),
           " tensor but is a ", DataTypeString(handle->dtype), " tensor");
     }
@@ -204,222 +211,6 @@ Status SelectDevice(const NodeDef& ndef, EagerContext* ctx, Device** device) {
                          ndef.DebugString());
 }
 
-#ifdef TENSORFLOW_EAGER_USE_XLA
-// Synthesizes and returns a wrapper function over `op`, which must be a
-// primitive op (e.g. matmul).
-//
-// The wrapper function conforms to the function signature expected by
-// XlaLaunch, with input params ordered by <constants, (variable) args and
-// resources>. For example, if the op has input params <Const1, Arg2, Const3,
-// Resource4, Arg5>, they will be reordered to <Const1, Const3, Arg2, Arg5,
-// Resource4> as the input params to the synthesized function.
-//
-// It populates `const_input_types`, `arg_input_types` and
-// `op_input_to_func_input` based on the reordering results, that the caller
-// can use them to build an XlaLaunch. On error, it returns NULL, and sets
-// `status` accordingly.
-const FunctionDef* OpToFunction(TFE_Op* op,
-                                std::vector<TF_DataType>* const_input_types,
-                                std::vector<TF_DataType>* arg_input_types,
-                                gtl::FlatMap<int, int>* op_input_to_func_input,
-                                TF_Status* status) {
-  DCHECK(!op->operation.is_function());
-
-  FunctionDef fdef;
-
-  // Get the OpDef of the op we are trying to encapsulate.
-  TFE_Context* ctx = op->operation.ctx;
-  const OpRegistrationData* op_data;
-  {
-    status = ctx->context.FindFunctionOpData(op->operation.Name(), &op_data);
-    if (!status.ok()) {
-      return nullptr;
-    }
-  }
-  const OpDef& op_def = op_data->op_def;
-
-  OpDef* signature = fdef.mutable_signature();
-
-  // Handle constant inputs.
-  const std::unordered_set<string> const_inputs(
-      *XlaOpRegistry::CompileTimeConstantInputs(op->operation.Name()));
-
-  // First add place holders for the input args, so that we can refer to them
-  // by position in the next loop. Also tally up the resource inputs.
-  int num_resource_inputs = 0;
-  for (int i = 0; i < op_def.input_arg_size(); ++i) {
-    if (op_def.input_arg(i).type() == DT_RESOURCE) {
-      ++num_resource_inputs;
-    }
-    signature->add_input_arg();
-  }
-
-  // Now we map the input params from `op_def` to `signature`, where the param
-  // ordering for `signature` is: <constants, args, resources>.
-  int const_index = 0;
-  int arg_index = const_inputs.size();
-  int resource_index = op_def.input_arg_size() - num_resource_inputs;
-  for (int i = 0; i < op_def.input_arg_size(); ++i) {
-    const OpDef::ArgDef& op_input_arg = op_def.input_arg(i);
-    OpDef::ArgDef* func_input_arg = nullptr;
-    if (const_inputs.find(op_input_arg.name()) != const_inputs.end()) {
-      VLOG(1) << "For const input, mapping op input " << i << " to func input "
-              << const_index;
-      (*op_input_to_func_input)[i] = const_index;
-      func_input_arg = signature->mutable_input_arg(const_index++);
-      const_input_types->push_back(
-          static_cast<TF_DataType>(op->operation.Inputs()[i]->dtype));
-    } else if (op_input_arg.type() == DT_RESOURCE) {
-      VLOG(1) << "For resource input, mapping op input " << i
-              << " to func input " << resource_index;
-      (*op_input_to_func_input)[i] = resource_index;
-      func_input_arg = signature->mutable_input_arg(resource_index++);
-    } else {
-      VLOG(1) << "For arg input, mapping op input " << i << " to func input "
-              << arg_index;
-      (*op_input_to_func_input)[i] = arg_index;
-      func_input_arg = signature->mutable_input_arg(arg_index++);
-      arg_input_types->push_back(
-          static_cast<TF_DataType>(op->operation.Inputs()[i]->dtype));
-    }
-
-    func_input_arg->set_name(op_input_arg.name());
-    func_input_arg->set_type(op->operation.Inputs()[i]->dtype);
-  }
-  VLOG(1) << "Added OpDef Inputs: " << fdef.DebugString();
-
-  // Resources args are at the end of the function input params, and we should
-  // have iterated over all of them.
-  DCHECK_EQ(signature->input_arg_size(), resource_index);
-
-  // Make the synthesized function's name unique.
-  signature->set_name(
-      strings::StrCat(op_def.name(), func_id_generator.fetch_add(1)));
-
-  // Add the node def and set its input names to match op_def's names.
-  const NodeDef& ndef = op->operation.MutableAttrs()->BuildNodeDef();
-  DCHECK_EQ(signature->input_arg_size(), ndef.input_size());
-  *fdef.add_node_def() = ndef;
-  for (int i = 0; i < op_def.input_arg_size(); ++i) {
-    fdef.mutable_node_def(0)->set_input(i, op_def.input_arg(i).name());
-  }
-  VLOG(1) << "Added NodeDef: " << fdef.DebugString();
-
-  // Fix the output names and set output types.
-  for (int i = 0; i < op_def.output_arg_size(); ++i) {
-    OpDef::ArgDef* arg = signature->add_output_arg();
-    const OpDef::ArgDef& op_def_arg = op_def.output_arg(i);
-    const string& out_tensor_name =
-        strings::StrCat(ndef.name(), ":", op_def_arg.name(), ":", 0);
-    arg->set_name(op_def_arg.name());
-    (*fdef.mutable_ret())[op_def_arg.name()] = out_tensor_name;
-    const string& type_attr = op_def_arg.type_attr();
-    if (!type_attr.empty()) {
-      auto i = ndef.attr().find(type_attr);
-      if (i == ndef.attr().end()) {
-        status = errors::InvalidArgument(
-            strings::StrCat("Could not find attr ", type_attr, " in NodeDef ",
-                            ndef.DebugString()));
-        return nullptr;
-      }
-      arg->set_type(i->second.type());
-    }
-  }
-  VLOG(1) << "Fixed Output names and all types: " << fdef.DebugString();
-
-  status = ctx->context.AddFunctionDef(fdef);
-  if (!status.ok()) return nullptr;
-  const auto ret = ctx->context.FindFunctionDef(signature->name());
-  DCHECK(ret != nullptr);
-  return ret;
-}
-
-// Builds an XlaLaunch as a wrapper over 'op', so that 'op' can be executed
-// via XLA.
-std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
-  VLOG(1) << "Creating XlaLaunch for TFE_Op " << op->operation.Name();
-  auto launch_op = std::unique_ptr<TFE_Op>(
-      TFE_NewOp(op->operation.ctx, "XlaLaunch", status));
-  if (TF_GetCode(status) != TF_OK) return nullptr;
-  if (op->operation.device) {
-    TFE_OpSetDevice(launch_op.get(), op->operation.device->name().c_str(),
-                    status);
-    if (TF_GetCode(status) != TF_OK) return nullptr;
-  }
-
-  const FunctionDef* fdef;
-  { fdef = op->operation.ctx->FindFunctionDef(op->operation.Name()); }
-  std::vector<TF_DataType> const_input_types;
-  std::vector<TF_DataType> arg_input_types;
-  gtl::FlatMap<int, int> op_input_to_func_input;
-  if (fdef == nullptr) {
-    // See if this is a primitive op, and if so create a function for it, so
-    // that XlaLaunch can access it.
-    fdef = OpToFunction(op, &const_input_types, &arg_input_types,
-                        &op_input_to_func_input, status);
-    if (!status.ok()) return nullptr;
-  } else {
-    // TODO(hongm): XlaOpRegistry::CompileTimeConstantInputs() does not work
-    // for functions, so we need to find another way to handle constant
-    // inputs.
-    for (int i = const_input_types.size();
-         i < fdef->signature().input_arg_size(); ++i) {
-      VLOG(1) << "Adding Targs from input arg " << i;
-      const OpDef::ArgDef& arg = fdef->signature().input_arg(i);
-      arg_input_types.push_back(static_cast<TF_DataType>(arg.type()));
-    }
-  }
-  DCHECK(fdef != nullptr);
-
-  // Copy inputs and their devices.
-  // Since input param reordering may have occurred between `op` and
-  // `launch_op` via `op_input_to_func_input`, adjust the actual inputs
-  // accordingly.
-  *launch_op->operation.MutableInputs() = op->operation.Inputs();
-  for (TensorHandle* h : launch_op->operation.Inputs()) {
-    h->Ref();
-  }
-  if (!op_input_to_func_input.empty()) {
-    DCHECK_EQ(op->operation.Inputs().size(), op_input_to_func_input.size());
-    for (int i = 0; i < op_input_to_func_input.size(); ++i) {
-      VLOG(1) << "mapping op input " << i << " to func input "
-              << op_input_to_func_input[i];
-
-      (*launch_op->operation.MuableInputs())[op_input_to_func_input[i]] =
-          op->operation.Inputs()[i];
-    }
-  }
-  launch_op->operation.MutableAttrs()->NumInputs(op->operation.Inputs().size());
-
-  TFE_OpSetAttrTypeList(launch_op.get(), "Tconstants", const_input_types.data(),
-                        const_input_types.size());
-
-  // Set Targs and Nresources attrs.
-  TFE_OpSetAttrTypeList(launch_op.get(), "Targs", arg_input_types.data(),
-                        arg_input_types.size());
-  const int num_resource_inputs = fdef->signature().input_arg_size() -
-                                  const_input_types.size() -
-                                  arg_input_types.size();
-  TFE_OpSetAttrInt(launch_op.get(), "Nresources", num_resource_inputs);
-
-  // Set Tresults attr.
-  std::vector<TF_DataType> tresults;
-  for (const OpDef::ArgDef& arg : fdef->signature().output_arg()) {
-    tresults.push_back(static_cast<TF_DataType>(arg.type()));
-  }
-  TFE_OpSetAttrTypeList(launch_op.get(), "Tresults", tresults.data(),
-                        tresults.size());
-
-  // Set function attr.
-  AttrValue attr_value;
-  NameAttrList* func = attr_value.mutable_func();
-  func->set_name(fdef->signature().name());
-  launch_op->attrs.Set("function", attr_value);
-
-  return launch_op;
-}
-#endif  // TENSORFLOW_EAGER_USE_XLA
-
 Status GetOutputDTypes(EagerOperation* op, DataTypeVector* output_dtypes) {
   const auto& node_def = op->MutableAttrs()->BuildNodeDef();
   const OpDef* op_def = nullptr;
@@ -446,20 +237,20 @@ bool IsLocal(EagerContext* ctx, tensorflow::Device* d) {
   return ctx->local_device_mgr()->LookupDevice(d->name(), &tmp).ok();
 }
 
+bool OnSameTask(EagerContext* ctx, Device* first, Device* second) {
+  if (first == nullptr) first = ctx->HostCPU();
+  if (second == nullptr) second = ctx->HostCPU();
+  return first->parsed_name().job == second->parsed_name().job &&
+         first->parsed_name().replica == second->parsed_name().replica &&
+         first->parsed_name().task == second->parsed_name().task;
+}
+
 Status EagerLocalExecute(EagerOperation* op,
                          gtl::InlinedVector<TensorHandle*, 2>* retvals,
                          int* num_retvals) {
   EagerContext* ctx = op->EagerContext();
   auto status = ctx->GetStatus();
   if (!status.ok()) return status;
-#ifdef TENSORFLOW_EAGER_USE_XLA
-  std::unique_ptr<TFE_Op> xla_launch_op;
-  if (op->UseXla() && op->Name() != "XlaLaunch") {
-    xla_launch_op = BuildXlaLaunch(op, status);
-    if (!status.ok()) return status;
-    op = xla_launch_op.get();
-  }
-#endif  // TENSORFLOW_EAGER_USE_XLA
   // Ensure all resource-touching ops run in the device the resource is,
   // regardless of anything else that has been specified. This is identical to
   // the graph mode behavior.
@@ -512,7 +303,8 @@ Status EagerLocalExecute(EagerOperation* op,
     // See WARNING comment in Execute (before kernel->Run) - would be nice to
     // rework to avoid this subtlety.
     tf_shared_lock l(*ctx->FunctionsMu());
-    status = KernelAndDevice::Init(ndef, ctx->func_lib(device), kernel);
+    status = KernelAndDevice::Init(ndef, ctx->func_lib(device), ctx->runner(),
+                                   kernel);
     if (!status.ok()) {
       delete kernel;
       return status;
@@ -552,11 +344,15 @@ Status EagerLocalExecute(EagerOperation* op,
   if (!status.ok()) return status;
   std::unique_ptr<NodeExecStats> maybe_stats;
   if (ctx->ShouldStoreMetadata()) {
+    int64 now_nanos = Env::Default()->NowNanos();
     maybe_stats.reset(new NodeExecStats);
     maybe_stats->set_node_name(op->Name());
-    maybe_stats->set_all_start_micros(Env::Default()->NowMicros());
+    maybe_stats->set_all_start_micros(now_nanos / EnvTime::kMicrosToNanos);
+    maybe_stats->set_all_start_nanos(now_nanos);
     maybe_stats->set_op_start_rel_micros(0);
-    maybe_stats->set_scheduled_micros(Env::Default()->NowMicros());
+    maybe_stats->set_op_start_rel_nanos(0);
+    maybe_stats->set_scheduled_micros(now_nanos / EnvTime::kMicrosToNanos);
+    maybe_stats->set_scheduled_nanos(now_nanos);
     // TODO(apassos) track referenced tensors
   }
   retvals->resize(*num_retvals);
@@ -582,6 +378,94 @@ Status EagerLocalExecute(EagerOperation* op,
   return status;
 }
 
+#ifndef __ANDROID__
+std::function<void()> GetRemoteTensorDestructor(
+    EagerContext* ctx, eager::EagerClient* eager_client, uint64 context_id,
+    uint64 op_id, int output_num) {
+  return [ctx, eager_client, context_id, op_id, output_num]() {
+    std::unique_ptr<eager::EnqueueRequest> request(new eager::EnqueueRequest);
+    request->set_context_id(context_id);
+
+    auto* handle_to_decref = request->add_queue()->mutable_handle_to_decref();
+    handle_to_decref->set_op_id(op_id);
+    handle_to_decref->set_output_num(output_num);
+
+    if (ctx->Async()) {
+      tensorflow::uint64 id = ctx->NextId();
+      auto* node =
+          new eager::RemoteExecuteNode(id, std::move(request), eager_client);
+      ctx->ExecutorAdd(node);
+    } else {
+      eager::EnqueueRequest* actual_request = request.release();
+      eager::EnqueueResponse* response = new eager::EnqueueResponse;
+      eager_client->EnqueueAsync(
+          actual_request, response,
+          [actual_request, response](const tensorflow::Status& s) {
+            delete actual_request;
+            delete response;
+          });
+    }
+
+    return tensorflow::Status::OK();
+  };
+}
+#endif
+
+// When !ctx->UseSendTensorRPC(), then tensors are shipped between remote
+// devices by the receiver invoking the WorkerService.RecvTensor RPC *on the
+// sender* (Rendezvous::RecvAsync() invoked by the _Recv kernel).
+//
+// However, in some configurations the node that has the tensor to be copied
+// isn't running a server (WorkerService RPC interface). For such cases,
+// this function enables sending tensors using the EagerService.SendTensor RPC
+// *on the receiver*.
+Status EagerRemoteSendTensor(EagerContext* ctx, TensorHandle* h,
+                             Device* recv_device, TensorHandle** result) {
+#ifdef __ANDROID__
+  return errors::Unimplemented(
+      "Eager's remote execution is not available on Android devices.");
+#else
+  eager::EagerClient* eager_client;
+  uint64 context_id;
+  TF_RETURN_IF_ERROR(
+      ctx->GetClientAndContextID(recv_device, &eager_client, &context_id));
+
+  eager::SendTensorRequest request;
+  eager::SendTensorResponse response;
+
+  request.set_context_id(context_id);
+  request.set_op_id(ctx->NextId());
+  request.set_device_name(recv_device->name());
+
+  const Tensor* tensor;
+  TF_RETURN_IF_ERROR(h->Tensor(&tensor));
+  tensor->AsProtoTensorContent(request.add_tensors());
+
+  const tensorflow::uint64 id = request.op_id();
+
+  // TODO(nareshmodi): support making this call async.
+  Notification n;
+  Status status;
+  eager_client->SendTensorAsync(&request, &response,
+                                [&n, &status](const Status& s) {
+                                  status = s;
+                                  n.Notify();
+                                });
+  n.WaitForNotification();
+  if (!status.ok()) return status;
+
+  std::function<void()> destructor =
+      GetRemoteTensorDestructor(ctx, eager_client, context_id, id, 0);
+
+  *result = new TensorHandle(id, /*output_num=*/0, /*remote_shape_node_id=*/0,
+                             tensor->dtype(), std::move(destructor),
+                             recv_device, recv_device, ctx);
+  (*result)->SetRemoteShape(MakeUnique<TensorShape>(tensor->shape()));
+
+  return Status::OK();
+#endif
+}
+
 Status EagerRemoteExecute(EagerOperation* op, TensorHandle** retvals,
                           int* num_retvals) {
 #ifdef __ANDROID__
@@ -595,15 +479,21 @@ Status EagerRemoteExecute(EagerOperation* op, TensorHandle** retvals,
   TF_RETURN_IF_ERROR(
       ctx->GetClientAndContextID(op->Device(), &eager_client, &context_id));
 
-  eager::EnqueueRequest request;
+  std::unique_ptr<eager::EnqueueRequest> request(new eager::EnqueueRequest);
   eager::EnqueueResponse response;
 
-  auto* remote_op = request.add_queue()->mutable_operation();
+  request->set_context_id(context_id);
+
+  auto* remote_op = request->add_queue()->mutable_operation();
 
   for (int i = 0; i < op->Inputs().size(); i++) {
     tensorflow::Device* input_device;
     TF_RETURN_IF_ERROR(op->Inputs()[i]->Device(&input_device));
-    if (op->Device() != input_device) {
+    if (op->Device() != input_device &&
+        // If the expected and actual devices are on the same task, don't
+        // explicitly copy, and instead depend on the copy to happen locally
+        // when the op is executed on the device.
+        !OnSameTask(ctx, op->Device(), input_device)) {
       // TODO(b/110044833): It's possible the same tensor gets copied to the
       // remote device repeatedly.
       TF_RETURN_IF_ERROR(MaybeCopyInputToExpectedDevice(
@@ -628,8 +518,6 @@ Status EagerRemoteExecute(EagerOperation* op, TensorHandle** retvals,
   op->Attrs().FillAttrValueMap(remote_op->mutable_attrs());
   remote_op->set_device(op->Device()->name());
 
-  request.set_context_id(context_id);
-
   DataTypeVector output_dtypes;
   TF_RETURN_IF_ERROR(GetOutputDTypes(op, &output_dtypes));
 
@@ -651,32 +539,11 @@ Status EagerRemoteExecute(EagerOperation* op, TensorHandle** retvals,
   for (int i = 0; i < *num_retvals; i++) {
     // TODO(nareshmodi): Change the callback to instead add the decref to a list
     // of pending decrefs that we can send as a batch with the next execute.
-    std::function<void()> callback = [ctx, eager_client, context_id, id, i]() {
-      eager::EnqueueRequest request;
-      request.set_context_id(context_id);
-
-      auto* handle_to_decref = request.add_queue()->mutable_handle_to_decref();
-      handle_to_decref->set_op_id(id);
-      handle_to_decref->set_output_num(i);
-
-      if (ctx->Async()) {
-        tensorflow::uint64 id = ctx->NextId();
-        auto* node = new eager::RemoteExecuteNode(id, request, eager_client);
-        ctx->ExecutorAdd(node);
-      } else {
-        Notification n;
-        eager::EnqueueResponse response;
-        eager_client->EnqueueAsync(
-            &request, &response,
-            [&n](const tensorflow::Status& s) { n.Notify(); });
-        n.WaitForNotification();
-      }
-
-      return tensorflow::Status::OK();
-    };
+    std::function<void()> destructor =
+        GetRemoteTensorDestructor(ctx, eager_client, context_id, id, i);
 
     retvals[i] = new TensorHandle(remote_op->id(), i, remote_node_id,
-                                  output_dtypes[i], std::move(callback),
+                                  output_dtypes[i], std::move(destructor),
                                   op_device, op_device, op->EagerContext());
   }
 
@@ -690,7 +557,7 @@ Status EagerRemoteExecute(EagerOperation* op, TensorHandle** retvals,
     }
     // Unable to capture via std::move, so bind instead.
     auto* node = new eager::RemoteExecuteNode(
-        remote_node_id, request, eager_client, op->Inputs(),
+        remote_node_id, std::move(request), eager_client, op->Inputs(),
         std::bind(
             [](const gtl::InlinedVector<TensorHandle*, 2>& retvals,
                const Status& status, const eager::EnqueueResponse& response) {
@@ -707,7 +574,7 @@ Status EagerRemoteExecute(EagerOperation* op, TensorHandle** retvals,
   } else {
     Notification n;
     Status status;
-    eager_client->EnqueueAsync(&request, &response,
+    eager_client->EnqueueAsync(request.get(), &response,
                                [&n, &status](const Status& s) {
                                  status = s;
                                  n.Notify();
@@ -736,6 +603,11 @@ Status EagerExecute(EagerOperation* op,
     return EagerLocalExecute(op, retvals, num_retvals);
   }
 
+  if (op->EagerContext()->LogDevicePlacement()) {
+    LOG(INFO) << "Executing op " << op->Name() << " in device "
+              << op->Device()->name();
+  }
+
   return EagerRemoteExecute(op, retvals->data(), num_retvals);
 }
 
@@ -770,8 +642,10 @@ Status EagerExecute(EagerContext* ctx, Device* device,
   // TODO(agarwal): change Run to take vector of handles ?
   TF_RETURN_IF_ERROR(kernel->Run(&inputs, &outputs, maybe_stats));
   if (maybe_stats != nullptr) {
-    maybe_stats->set_op_end_rel_micros(Env::Default()->NowMicros() -
+    int64 nanos = Env::Default()->NowNanos();
+    maybe_stats->set_op_end_rel_micros(nanos / EnvTime::kMicrosToNanos -
                                        maybe_stats->all_start_micros());
+    maybe_stats->set_op_end_rel_nanos(nanos - maybe_stats->all_start_nanos());
     mutex_lock ml(*ctx->MetadataMu());
     if (ctx->ShouldStoreMetadata()) {
       auto* step_stats = ctx->RunMetadataProto()->mutable_step_stats();
@@ -936,6 +810,8 @@ Status EagerCopyToDevice(TensorHandle* h, EagerContext* ctx,
 
   if (sender_is_local && recver_is_local) {
     return LocalEagerCopyToDevice(h, ctx, recv_device, result);
+  } else if (ctx->UseSendTensorRPC() && sender_is_local && !recver_is_local) {
+    return EagerRemoteSendTensor(ctx, h, recv_device, result);
   } else {
     string wire_id = GetUniqueWireID();
 
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device.cc b/tensorflow/core/common_runtime/eager/kernel_and_device.cc
index b410ea175b..dae5d1983f 100644
--- a/tensorflow/core/common_runtime/eager/kernel_and_device.cc
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device.cc
@@ -41,17 +41,22 @@ Status KernelAndDevice::InitOp(Device* device, const NodeDef& ndef,
   out->device_ = device;
   out->kernel_.reset(k);
   out->flib_ = nullptr;
+  out->runner_ = nullptr;
+  out->default_runner_ = [](std::function<void()> f) { f(); };
   return s;
 }
 
 // static
 Status KernelAndDevice::Init(const NodeDef& ndef, FunctionLibraryRuntime* flib,
+                             std::function<void(std::function<void()>)>* runner,
                              KernelAndDevice* out) {
   OpKernel* k = nullptr;
   Status s = flib->CreateKernel(ndef, &k);
   out->device_ = flib->device();
   out->kernel_.reset(k);
   out->flib_ = flib;
+  out->runner_ = runner;
+  out->default_runner_ = [](std::function<void()> f) { f(); };
   return s;
 }
 
@@ -83,10 +88,11 @@ Status KernelAndDevice::Run(std::vector<Tensor>* input_tensors,
   if (stats != nullptr) {
     params.track_allocations = true;
   }
-  // TODO(apassos): use a thread pool.
-  std::function<void(std::function<void()>)> runner =
-      [](std::function<void()> f) { f(); };
-  params.runner = &runner;
+  if (runner_ == nullptr) {
+    params.runner = &default_runner_;
+  } else {
+    params.runner = runner_;
+  }
 
   ScopedStepContainer step_container(0, [this](const string& name) {
     device_->resource_manager()->Cleanup(name).IgnoreError();
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device.h b/tensorflow/core/common_runtime/eager/kernel_and_device.h
index c41a0972b1..c0b676b285 100644
--- a/tensorflow/core/common_runtime/eager/kernel_and_device.h
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device.h
@@ -57,6 +57,7 @@ class KernelAndDevice {
   // the FunctionLibraryRuntime is pushed on to the caller (see locking in
   // c_api.cc).
   static Status Init(const NodeDef& ndef, FunctionLibraryRuntime* flib,
+                     std::function<void(std::function<void()>)>* runner,
                      KernelAndDevice* out);
   // TODO(ashankar): Remove this
   static Status InitOp(Device* device, const NodeDef& ndef,
@@ -88,6 +89,8 @@ class KernelAndDevice {
   checkpoint::TensorSliceReaderCacheWrapper slice_reader_cache_;
   Rendezvous* rendez_;
   DataTypeVector output_dtypes_;
+  std::function<void(std::function<void()>)>* runner_;
+  std::function<void(std::function<void()>)> default_runner_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/eager/kernel_and_device_test.cc b/tensorflow/core/common_runtime/eager/kernel_and_device_test.cc
index b4349e1dee..6abe98f53c 100644
--- a/tensorflow/core/common_runtime/eager/kernel_and_device_test.cc
+++ b/tensorflow/core/common_runtime/eager/kernel_and_device_test.cc
@@ -107,8 +107,8 @@ void BM_KernelAndDeviceInit(int iters) {
   KernelAndDevice k(nullptr);
   tensorflow::testing::StartTiming();
   for (int i = 0; i < iters; ++i) {
-    TF_CHECK_OK(
-        KernelAndDevice::Init(ndef, env.function_library_runtime(), &k));
+    TF_CHECK_OK(KernelAndDevice::Init(ndef, env.function_library_runtime(),
+                                      nullptr, &k));
   }
 }
 BENCHMARK(BM_KernelAndDeviceInit);
@@ -128,8 +128,8 @@ void BM_KernelAndDeviceRun(int iters) {
                    .BuildNodeDef());
   TestEnv env;
   KernelAndDevice kernel(nullptr);
-  TF_CHECK_OK(
-      KernelAndDevice::Init(ndef, env.function_library_runtime(), &kernel));
+  TF_CHECK_OK(KernelAndDevice::Init(ndef, env.function_library_runtime(),
+                                    nullptr, &kernel));
   tensorflow::testing::StartTiming();
   for (int i = 0; i < iters; ++i) {
     TF_CHECK_OK(kernel.Run(&inputs, &outputs, nullptr));
diff --git a/tensorflow/core/common_runtime/eager/tensor_handle.cc b/tensorflow/core/common_runtime/eager/tensor_handle.cc
index f9b9abcc99..85b0b79bce 100644
--- a/tensorflow/core/common_runtime/eager/tensor_handle.cc
+++ b/tensorflow/core/common_runtime/eager/tensor_handle.cc
@@ -109,6 +109,19 @@ Status TensorHandle::TensorAndDevice(const tensorflow::Tensor** tensor,
   return Status::OK();
 }
 
+Status TensorHandle::Shape(tensorflow::TensorShape* shape) {
+  if (IsRemote()) {
+    TF_RETURN_IF_ERROR(WaitForNode(remote_shape_node_id_, false));
+    CHECK(remote_shape_ != nullptr);
+    *shape = *(remote_shape_.get());
+  } else {
+    TF_RETURN_IF_ERROR(WaitReady());
+    DCHECK(IsReady());
+    *shape = tensor_.shape();
+  }
+  return Status::OK();
+}
+
 Status TensorHandle::NumDims(int* num_dims) {
   if (IsRemote()) {
     TF_RETURN_IF_ERROR(WaitForNode(remote_shape_node_id_, false));
diff --git a/tensorflow/core/common_runtime/eager/tensor_handle.h b/tensorflow/core/common_runtime/eager/tensor_handle.h
index 46bc94f875..1bc9c6531a 100644
--- a/tensorflow/core/common_runtime/eager/tensor_handle.h
+++ b/tensorflow/core/common_runtime/eager/tensor_handle.h
@@ -109,6 +109,8 @@ class TensorHandle : public core::RefCounted {
                          tensorflow::Device** device,
                          tensorflow::Device** op_device);
 
+  Status Shape(tensorflow::TensorShape* shape);
+
   Status NumDims(int* num_dims);
   Status Dim(int dim_index, int64* dim);
 
@@ -138,6 +140,12 @@ class TensorHandle : public core::RefCounted {
     remote_shape_ = std::move(remote_shape);
   }
 
+  bool OnHostCPU() {
+    mutex_lock ml(ctx_mutex_);
+    return device_ == nullptr ||
+           (ctx_ == nullptr || ctx_->HostCPU() == device_);
+  }
+
  private:
   // If the contents of the Tensor pointed to by this handle is yet to be
   // computed by a EagerNode, this function will block till that compuatation is
diff --git a/tensorflow/core/common_runtime/executor.cc b/tensorflow/core/common_runtime/executor.cc
index f7f2cdc14f..c2fac4c2c8 100644
--- a/tensorflow/core/common_runtime/executor.cc
+++ b/tensorflow/core/common_runtime/executor.cc
@@ -127,36 +127,52 @@ bool SetTimelineLabel(const Node* node, NodeExecStatsWrapper* stats) {
 // Helper routines for collecting step stats.
 namespace nodestats {
 inline int64 NowInUsec() { return Env::Default()->NowMicros(); }
+inline int64 NowInNsec() { return Env::Default()->NowNanos(); }
 
-void SetScheduled(NodeExecStatsWrapper* stats, int64 t) {
+void SetScheduled(NodeExecStatsWrapper* stats, int64 nanos) {
   if (!stats) return;
-  stats->stats()->set_scheduled_micros(t);
+  stats->stats()->set_scheduled_micros(nanos / EnvTime::kMicrosToNanos);
+  stats->stats()->set_scheduled_nanos(nanos);
 }
 
 void SetAllStart(NodeExecStatsWrapper* stats) {
   if (!stats) return;
-  stats->stats()->set_all_start_micros(NowInUsec());
+  int64 now_nanos = NowInNsec();
+  stats->stats()->set_all_start_micros(now_nanos / EnvTime::kMicrosToNanos);
+  stats->stats()->set_all_start_nanos(now_nanos);
 }
 
 void SetOpStart(NodeExecStatsWrapper* stats) {
   if (!stats) return;
   NodeExecStats* nt = stats->stats();
   DCHECK_NE(nt->all_start_micros(), 0);
-  nt->set_op_start_rel_micros(NowInUsec() - nt->all_start_micros());
+  DCHECK_NE(nt->all_start_nanos(), 0);
+  int64 now_nanos = NowInNsec();
+  nt->set_op_start_rel_micros(now_nanos / EnvTime::kMicrosToNanos -
+                              nt->all_start_micros());
+  nt->set_op_start_rel_nanos(now_nanos - nt->all_start_nanos());
 }
 
 void SetOpEnd(NodeExecStatsWrapper* stats) {
   if (!stats) return;
   NodeExecStats* nt = stats->stats();
   DCHECK_NE(nt->all_start_micros(), 0);
-  nt->set_op_end_rel_micros(NowInUsec() - nt->all_start_micros());
+  DCHECK_NE(nt->all_start_nanos(), 0);
+  int64 now_nanos = NowInNsec();
+  nt->set_op_end_rel_micros(now_nanos / EnvTime::kMicrosToNanos -
+                            nt->all_start_micros());
+  nt->set_op_end_rel_nanos(now_nanos - nt->all_start_nanos());
 }
 
 void SetAllEnd(NodeExecStatsWrapper* stats) {
   if (!stats) return;
   NodeExecStats* nt = stats->stats();
   DCHECK_NE(nt->all_start_micros(), 0);
-  nt->set_all_end_rel_micros(NowInUsec() - nt->all_start_micros());
+  DCHECK_NE(nt->all_start_nanos(), 0);
+  int64 now_nanos = NowInNsec();
+  nt->set_all_end_rel_micros(now_nanos / EnvTime::kMicrosToNanos -
+                             nt->all_start_micros());
+  nt->set_all_end_rel_nanos(now_nanos - nt->all_start_nanos());
 }
 
 void SetOutput(NodeExecStatsWrapper* stats, int slot, const Tensor* v) {
@@ -1357,7 +1373,7 @@ class ExecutorState {
                                TaggedNodeSeq* ready);
 
   // Process a ready node in current thread.
-  void Process(TaggedNode node, int64 scheduled_usec);
+  void Process(TaggedNode node, int64 scheduled_nsec);
 
   // Before invoking item->kernel, fills in its "inputs".
   Status PrepareInputs(const NodeItem& item, Entry* first_input,
@@ -1615,7 +1631,7 @@ struct ExecutorState::AsyncState {
   }
 };
 
-void ExecutorState::Process(TaggedNode tagged_node, int64 scheduled_usec) {
+void ExecutorState::Process(TaggedNode tagged_node, int64 scheduled_nsec) {
   const GraphView& gview = impl_->gview_;
   TaggedNodeSeq ready;
   TaggedNodeReadyQueue inline_ready;
@@ -1680,7 +1696,7 @@ void ExecutorState::Process(TaggedNode tagged_node, int64 scheduled_usec) {
       params.track_allocations = true;
       stats = new NodeExecStatsWrapper;
       stats->stats()->set_node_name(node->name());
-      nodestats::SetScheduled(stats, scheduled_usec);
+      nodestats::SetScheduled(stats, scheduled_nsec);
       nodestats::SetAllStart(stats);
     }
 
@@ -1823,7 +1839,7 @@ void ExecutorState::Process(TaggedNode tagged_node, int64 scheduled_usec) {
         device->ConsumeListOfAccessedTensors(device_context, accessed_tensors);
       }
       if (stats) {
-        scheduled_usec = nodestats::NowInUsec();
+        scheduled_nsec = nodestats::NowInNsec();
       }
       // Postprocess.
       completed = NodeDone(s, item.node, ready, stats, &inline_ready);
@@ -1966,17 +1982,9 @@ Status ExecutorState::ProcessOutputs(const NodeItem& item, OpKernelContext* ctx,
     device_context = device_context_map_[node->id()];
   }
 
-  // Experimental: debugger (tfdb) access to intermediate node completion.
-  if (item.num_outputs == 0 && impl_->params_.node_outputs_cb != nullptr) {
-    // If the node has no output, invoke the callback with output slot set to
-    // -1, signifying that this is a no-output node.
-    s.Update(impl_->params_.node_outputs_cb(item.node->name(), -1, nullptr,
-                                            false, ctx));
-  }
-
   for (int i = 0; i < item.num_outputs; ++i) {
     const TensorValue val = ctx->release_output(i);
-    if (*ctx->is_output_dead() || val.tensor == nullptr) {
+    if (val.tensor == nullptr) {
       // Unless it's a Switch or a Recv, the node must produce a
       // tensor value at i-th output.
       if (!IsSwitch(node) && !IsRecv(node)) {
@@ -2018,13 +2026,6 @@ Status ExecutorState::ProcessOutputs(const NodeItem& item, OpKernelContext* ctx,
             LogMemory::RecordTensorOutput(ctx->op_kernel().name(),
                                           ctx->step_id(), i, to_log);
           }
-
-          // Experimental: debugger (tfdb) access to intermediate node
-          // outputs.
-          if (impl_->params_.node_outputs_cb != nullptr) {
-            s.Update(impl_->params_.node_outputs_cb(item.node->name(), i,
-                                                    out->ref, true, ctx));
-          }
         } else {
           // NOTE that std::move is used here, so val.tensor goes to
           // uninitialized state (val.tensor->IsInitialized return false).
@@ -2036,12 +2037,6 @@ Status ExecutorState::ProcessOutputs(const NodeItem& item, OpKernelContext* ctx,
             LogMemory::RecordTensorOutput(ctx->op_kernel().name(),
                                           ctx->step_id(), i, *out->val);
           }
-
-          // Experimental: debugger access to intermediate node outputs.
-          if (impl_->params_.node_outputs_cb != nullptr) {
-            s.Update(impl_->params_.node_outputs_cb(
-                item.node->name(), i, out->val.get(), false, ctx));
-          }
         }
       } else {
         s.Update(errors::Internal("Output ", i, " of type ",
@@ -2219,14 +2214,14 @@ void ExecutorState::ScheduleReady(const TaggedNodeSeq& ready,
                                   TaggedNodeReadyQueue* inline_ready) {
   if (ready.empty()) return;
 
-  int64 scheduled_usec = 0;
+  int64 scheduled_nsec = 0;
   if (stats_collector_) {
-    scheduled_usec = nodestats::NowInUsec();
+    scheduled_nsec = nodestats::NowInNsec();
   }
   if (inline_ready == nullptr) {
     // Schedule to run all the ready ops in thread pool.
     for (auto& tagged_node : ready) {
-      runner_([=]() { Process(tagged_node, scheduled_usec); });
+      runner_([=]() { Process(tagged_node, scheduled_nsec); });
     }
     return;
   }
@@ -2242,7 +2237,7 @@ void ExecutorState::ScheduleReady(const TaggedNodeSeq& ready,
         // Dispatch to another thread since there is plenty of work to
         // do for this thread.
         runner_(std::bind(&ExecutorState::Process, this, *curr_expensive_node,
-                          scheduled_usec));
+                          scheduled_nsec));
       }
       curr_expensive_node = &tagged_node;
     }
@@ -2255,7 +2250,7 @@ void ExecutorState::ScheduleReady(const TaggedNodeSeq& ready,
       // There are inline nodes to run already. We dispatch this expensive
       // node to other thread.
       runner_(std::bind(&ExecutorState::Process, this, *curr_expensive_node,
-                        scheduled_usec));
+                        scheduled_nsec));
     }
   }
 }
diff --git a/tensorflow/core/common_runtime/executor.h b/tensorflow/core/common_runtime/executor.h
index e5d7b7c53c..cd01b43aea 100644
--- a/tensorflow/core/common_runtime/executor.h
+++ b/tensorflow/core/common_runtime/executor.h
@@ -103,7 +103,6 @@ class Executor {
                                  const Tensor* tensor, const bool is_ref,
                                  OpKernelContext* ctx)>
         NodeOutputsCallback;
-    NodeOutputsCallback node_outputs_cb = nullptr;
   };
   typedef std::function<void(const Status&)> DoneCallback;
   virtual void RunAsync(const Args& args, DoneCallback done) = 0;
@@ -139,8 +138,6 @@ struct LocalExecutorParams {
   // when the executor is deleted.
   std::function<Status(const NodeDef&, OpKernel**)> create_kernel;
   std::function<void(OpKernel*)> delete_kernel;
-
-  Executor::Args::NodeOutputsCallback node_outputs_cb;
 };
 ::tensorflow::Status NewLocalExecutor(const LocalExecutorParams& params,
                                       std::unique_ptr<const Graph> graph,
diff --git a/tensorflow/core/common_runtime/function.cc b/tensorflow/core/common_runtime/function.cc
index 08057f26d3..54bbe84b57 100644
--- a/tensorflow/core/common_runtime/function.cc
+++ b/tensorflow/core/common_runtime/function.cc
@@ -399,12 +399,11 @@ Status FunctionLibraryRuntimeImpl::CreateKernel(
   // types.
   MemoryTypeVector input_memory_types;
   for (const auto& t : fbody->arg_types) {
-    input_memory_types.push_back(
-        (t == DT_INT32 || t == DT_RESOURCE) ? HOST_MEMORY : DEVICE_MEMORY);
+    input_memory_types.push_back(MTypeFromDType(t));
   }
   MemoryTypeVector output_memory_types;
   for (const auto& t : fbody->ret_types) {
-    output_memory_types.push_back(t == DT_INT32 ? HOST_MEMORY : DEVICE_MEMORY);
+    output_memory_types.push_back(MTypeFromDType(t));
   }
 
   // Constructs a CallOp kernel for running the instantiated function.
@@ -731,21 +730,24 @@ void FunctionLibraryRuntimeImpl::RunRemote(const Options& opts, Handle handle,
   std::vector<AllocatorAttributes> args_alloc_attrs, rets_alloc_attrs;
   args_alloc_attrs.reserve(fbody->arg_types.size());
   rets_alloc_attrs.reserve(fbody->ret_types.size());
+  // Note: Functions assume that int32's are always on host memory.
   for (const auto& arg_type : fbody->arg_types) {
     AllocatorAttributes arg_alloc_attrs;
-    if (DataTypeAlwaysOnHost(arg_type)) {
+    if (MTypeFromDType(arg_type) == HOST_MEMORY) {
       arg_alloc_attrs.set_on_host(true);
     }
     args_alloc_attrs.push_back(arg_alloc_attrs);
   }
   for (const auto& ret_type : fbody->ret_types) {
     AllocatorAttributes ret_alloc_attrs;
-    if (DataTypeAlwaysOnHost(ret_type)) {
+    if (MTypeFromDType(ret_type) == HOST_MEMORY) {
       ret_alloc_attrs.set_on_host(true);
     }
     rets_alloc_attrs.push_back(ret_alloc_attrs);
   }
 
+  bool allow_dead_tensors = opts.allow_dead_tensors;
+
   // The ProcFLR sends the arguments to the function from the source_device to
   // the target_device. So here we receive those arguments. Similarly, when the
   // computation is done and stored in *rets, we send the return values back
@@ -756,7 +758,7 @@ void FunctionLibraryRuntimeImpl::RunRemote(const Options& opts, Handle handle,
       device_context, args_alloc_attrs, rendezvous, remote_args,
       [frame, remote_args, item, source_device, target_device,
        target_incarnation, rendezvous, device_context, rets, done, exec_args,
-       rets_alloc_attrs](const Status& status) {
+       rets_alloc_attrs, allow_dead_tensors](const Status& status) {
         Status s = status;
         if (s.ok()) {
           s = frame->SetArgs(*remote_args);
@@ -769,13 +771,13 @@ void FunctionLibraryRuntimeImpl::RunRemote(const Options& opts, Handle handle,
           return;
         }
         item->exec->RunAsync(
-            *exec_args,
-            [frame, rets, done, source_device, target_device,
-             target_incarnation, rendezvous, device_context, remote_args,
-             exec_args, rets_alloc_attrs](const Status& status) {
+            *exec_args, [frame, rets, done, source_device, target_device,
+                         target_incarnation, rendezvous, device_context,
+                         remote_args, exec_args, rets_alloc_attrs,
+                         allow_dead_tensors](const Status& status) {
               Status s = status;
               if (s.ok()) {
-                s = frame->ConsumeRetvals(rets);
+                s = frame->ConsumeRetvals(rets, allow_dead_tensors);
               }
               delete frame;
               if (!s.ok()) {
@@ -859,14 +861,15 @@ void FunctionLibraryRuntimeImpl::Run(const Options& opts, Handle handle,
     return;
   }
 
+  bool allow_dead_tensors = opts.allow_dead_tensors;
   item->exec->RunAsync(
       // Executor args
       *exec_args,
       // Done callback.
-      [frame, rets, done, exec_args](const Status& status) {
+      [frame, rets, done, exec_args, allow_dead_tensors](const Status& status) {
         Status s = status;
         if (s.ok()) {
-          s = frame->ConsumeRetvals(rets);
+          s = frame->ConsumeRetvals(rets, allow_dead_tensors);
         }
         delete frame;
         delete exec_args;
diff --git a/tensorflow/core/common_runtime/function_test.cc b/tensorflow/core/common_runtime/function_test.cc
index 1e837e9a7e..120f480198 100644
--- a/tensorflow/core/common_runtime/function_test.cc
+++ b/tensorflow/core/common_runtime/function_test.cc
@@ -1019,8 +1019,9 @@ TEST_F(FunctionLibraryRuntimeTest, Error_BadControlFlow) {
   DCHECK_EQ(x.dtype(), DT_INT32);
   Tensor y;
   HasError(InstantiateAndRun(flr0_, "InvalidControlFlow", {}, {x}, {&y}),
-           "The node 'add' has inputs from different frames. The input 'enter' "
-           "is in frame 'while'. The input 'i' is in frame ''.");
+           "{{node add}} has inputs from different frames. The input"
+           " {{node enter}} is in frame 'while'. The input {{node i}} is in"
+           " frame ''.");
 }
 
 TEST_F(FunctionLibraryRuntimeTest, Gradient_XTimesTwo) {
diff --git a/tensorflow/core/common_runtime/gpu/gpu_device.cc b/tensorflow/core/common_runtime/gpu/gpu_device.cc
index 3cb51b0dbc..3292ef2f62 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_device.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.cc
@@ -41,6 +41,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/gpu/gpu_util.h"
 #include "tensorflow/core/common_runtime/gpu_device_context.h"
 #include "tensorflow/core/common_runtime/local_device.h"
+#include "tensorflow/core/common_runtime/visitable_allocator.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/device_base.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -224,6 +225,7 @@ class BaseGPUDevice::StreamGroupFactory {
 
       int num_d2d_streams =
           options.experimental().num_dev_to_dev_copy_streams();
+      if (num_d2d_streams == 0) num_d2d_streams = 1;
       if (num_d2d_streams < 1 || num_d2d_streams > 4) {
         LOG(ERROR)
             << "Illegal GPUOptions.experimental.num_dev_to_dev_copy_streams="
@@ -856,7 +858,7 @@ void BaseGPUDevice::ReinitializeDevice(OpKernelContext* context,
       static_cast<ConcretePerOpGpuDevice*>(device);
   DCHECK(concrete_device);
   const cudaStream_t* cuda_stream = reinterpret_cast<const cudaStream_t*>(
-      streams_[stream_id]->compute->implementation()->CudaStreamMemberHack());
+      streams_[stream_id]->compute->implementation()->GpuStreamMemberHack());
   concrete_device->Reinitialize(context, cuda_stream, tf_gpu_id_, allocator,
                                 scratch_[stream_id]);
 }
diff --git a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
index 4898448476..3c1c31aa73 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc
@@ -15,11 +15,80 @@ limitations under the License.
 
 #include "tensorflow/core/common_runtime/gpu/gpu_event_mgr.h"
 
+#include "tensorflow/core/platform/stacktrace.h"
 #include "tensorflow/core/platform/stream_executor.h"
 #include "tensorflow/core/protobuf/config.pb.h"
 
 namespace tensorflow {
 
+namespace {
+// The EventMgr has 1 thread for the polling loop and one to execute
+// event callback functions. Issues for reconsideration:
+//  - Is this the right number of threads?
+//  - Should EventMgrs be shared between GPUDevices on a multi-GPU machine?
+static const int kNumThreads = 2;
+}  // namespace
+
+namespace gpu_event_mgr {
+class ThreadLabel {
+ public:
+  static const char* GetValue() { return value_; }
+
+  // v must be a static const because value_ will capture and use its value
+  // until reset or thread terminates.
+  static void SetValue(const char* v) { value_ = v; }
+
+ private:
+  static thread_local const char* value_;
+};
+thread_local const char* ThreadLabel::value_ = "";
+
+void WarnIfInCallback(std::function<void()> f) {
+  const char* label = ThreadLabel::GetValue();
+  if (label && !strcmp(label, "gpu_event_mgr")) {
+    if (f) {
+      f();
+    } else {
+      LOG(WARNING) << "Executing inside EventMgr callback thread: "
+                   << CurrentStackTrace();
+    }
+  }
+}
+
+void InitThreadpoolLabels(thread::ThreadPool* threadpool) {
+  static const char* label = "gpu_event_mgr";
+  mutex mu;
+  int init_count = 0;
+  condition_variable all_initialized;
+  int exit_count = 0;
+  condition_variable ready_to_exit;
+  const int num_threads = threadpool->NumThreads();
+  for (int i = 0; i < num_threads; ++i) {
+    threadpool->Schedule([num_threads, &mu, &init_count, &all_initialized,
+                          &exit_count, &ready_to_exit]() {
+      gpu_event_mgr::ThreadLabel::SetValue(label);
+      mutex_lock l(mu);
+      ++init_count;
+      if (init_count == num_threads) {
+        all_initialized.notify_all();
+      }
+      while (init_count < num_threads) {
+        all_initialized.wait(l);
+      }
+      if (++exit_count == num_threads) {
+        ready_to_exit.notify_all();
+      }
+    });
+  }
+  {
+    mutex_lock l(mu);
+    while (exit_count < num_threads) {
+      ready_to_exit.wait(l);
+    }
+  }
+}
+}  // namespace gpu_event_mgr
+
 EventMgr::EventMgr(se::StreamExecutor* se, const GPUOptions& gpu_options)
     : exec_(se),
       deferred_bytes_threshold_(gpu_options.deferred_deletion_bytes()
@@ -31,9 +100,8 @@ EventMgr::EventMgr(se::StreamExecutor* se, const GPUOptions& gpu_options)
       accumulated_stream_(nullptr),
       accumulated_tensors_(new TensorReferenceVector),
       accumulated_tensor_bytes_(0),
-      // threadpool_ has 1 thread for the polling loop, and one to execute
-      // event callback functions. Maybe we should have more?
-      threadpool_(Env::Default(), "GPU_Event_Manager", 2) {
+      threadpool_(Env::Default(), "GPU_Event_Manager", kNumThreads) {
+  gpu_event_mgr::InitThreadpoolLabels(&threadpool_);
   StartPollingLoop();
 }
 
diff --git a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
index b26f88a201..f0a109cc10 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr.h
@@ -39,6 +39,25 @@ namespace tensorflow {
 
 class GPUOptions;
 
+// The callback provided to EventMgr::ThenExecute must not block or take a long
+// time.  If it does, performance may be impacted and GPU memory may be
+// exhausted.  This macro is for checking that an EventMgr thread is not
+// accidentally entering blocking parts of the code, e.g. the RPC subsystem.
+//
+// Intended use is something like
+//
+//   void RespondToAnRPC(Params* params) {
+//      WARN_IF_IN_EVENT_MGR_THREAD;
+//      if (params->status.ok()) { ...
+//
+namespace gpu_event_mgr {
+// Logs a stack trace if current execution thread belongs to this EventMgr
+// object.  If f is not nullptr, executes instead of  logging the stack trace.
+// trace.
+void WarnIfInCallback(std::function<void()> f);
+}  // namespace gpu_event_mgr
+#define WARN_IF_IN_EVENT_MGR_THREAD gpu_event_mgr::WarnIfInCallback(nullptr)
+
 // An object to keep track of pending Events in the StreamExecutor streams
 // and associated Tensors that cannot safely be deleted until the associated
 // Events are recorded.
@@ -74,6 +93,9 @@ class EventMgr {
     FreeMemory(to_free);
   }
 
+  // Execute func when all pending stream actions have completed.
+  // func must be brief and non-blocking since it executes in the one
+  // thread used for all such callbacks and also buffer deletions.
   inline void ThenExecute(se::Stream* stream, std::function<void()> func) {
     ToFreeVector to_free;
     {
diff --git a/tensorflow/core/common_runtime/gpu/gpu_event_mgr_test.cc b/tensorflow/core/common_runtime/gpu/gpu_event_mgr_test.cc
index c5ff6c97a1..d2adf699f5 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_event_mgr_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_event_mgr_test.cc
@@ -19,6 +19,7 @@ limitations under the License.
 
 #include <atomic>
 #include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/lib/core/notification.h"
 #include "tensorflow/core/platform/stream_executor.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/protobuf/config.pb.h"
@@ -243,6 +244,28 @@ TEST(EventMgr, NonEmptyShutdown) {
   }
 }
 
+// Tests that WarnIfInCallback() triggers correctly.
+TEST(EventMgr, WarnIfInCallback) {
+  auto stream_exec = GPUMachineManager()->ExecutorForDevice(0).ValueOrDie();
+  EventMgr em(stream_exec, GPUOptions());
+  TEST_EventMgrHelper th(&em);
+  std::unique_ptr<se::Stream> stream(new se::Stream(stream_exec));
+  CHECK(stream);
+  stream->Init();
+  bool hit = false;
+  gpu_event_mgr::WarnIfInCallback([&hit] { hit = true; });
+  EXPECT_FALSE(hit);
+  Notification note;
+  em.ThenExecute(stream.get(), [&hit, &note]() {
+    gpu_event_mgr::WarnIfInCallback([&hit, &note] {
+      hit = true;
+      note.Notify();
+    });
+  });
+  note.WaitForNotification();
+  EXPECT_TRUE(hit);
+}
+
 }  // namespace
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/common_runtime/gpu/gpu_process_state.cc b/tensorflow/core/common_runtime/gpu/gpu_process_state.cc
index 607cb33b80..b18688174d 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_process_state.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_process_state.cc
@@ -68,6 +68,7 @@ GPUProcessState* GPUProcessState::instance_ = nullptr;
 
 GPUProcessState::GPUProcessState() : gpu_device_enabled_(false) {
   CHECK(instance_ == nullptr);
+  instance_ = this;
   process_state_ = ProcessState::singleton();
 }
 
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.h b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
index 29f702699f..94e10dbfa2 100644
--- a/tensorflow/core/common_runtime/mkl_cpu_allocator.h
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
@@ -22,7 +22,6 @@ limitations under the License.
 #ifdef INTEL_MKL
 
 #include <cstdlib>
-#include <string>
 #include "tensorflow/core/common_runtime/bfc_allocator.h"
 #include "tensorflow/core/common_runtime/visitable_allocator.h"
 #include "tensorflow/core/lib/strings/numbers.h"
diff --git a/tensorflow/core/common_runtime/placer.cc b/tensorflow/core/common_runtime/placer.cc
index 86851c2c07..d581f45a90 100644
--- a/tensorflow/core/common_runtime/placer.cc
+++ b/tensorflow/core/common_runtime/placer.cc
@@ -30,6 +30,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/util/status_util.h"
 
 namespace tensorflow {
 
@@ -40,10 +41,8 @@ namespace {
 const StringPiece kColocationAttrNameStringPiece(kColocationAttrName);
 const StringPiece kColocationGroupPrefixStringPiece(kColocationGroupPrefix);
 
-// Returns a list of devices sorted by preferred type and then name
-// from 'devices' whose type is in 'supported_device_types'.  This
-// function searches the device types in 'supported_device_types' and
-// returns the subset of devices that match.
+// Returns a list of devices having type in supported_device_types.  The
+// returned list is sorted by preferred type (higher numeric type is preferred).
 std::vector<Device*> FilterSupportedDevices(
     const std::vector<Device*>& devices,
     const DeviceTypeVector& supported_device_types) {
@@ -80,12 +79,12 @@ std::vector<Device*> FilterSupportedDevices(
 //   DeviceSet device_set = ...;
 //   ColocationGraph colocation_graph(graph, device_set);
 //
-//   // Add all the nodes of graph to colocation_graph.
+//   // Add all the nodes of the `graph` to the `colocation_graph`.
 //   for (Node* node : graph.nodes()) {
 //     TF_RETURN_IF_ERROR(colocation_graph.AddNode(*node));
 //   }
 //
-//   // Add one or more colocation constraint.
+//   // Add one or more colocation constraints.
 //   Node node_1 = *graph.FindNodeId(...);
 //   Node node_2 = *graph.FindNodeId(...);
 //   TF_RETURN_IF_ERROR(colocation_graph.ColocateNodes(node_1, node_2));
@@ -95,9 +94,9 @@ std::vector<Device*> FilterSupportedDevices(
 //     TF_RETURN_IF_ERROR(colocation_graph.AssignDevice(node));
 //   }
 //
-// The implementation uses the union-find algorithm to maintain the
-// connected components efficiently and incrementally as edges
-// (implied by ColocationGraph::ColocateNodes() invocations) are added.
+// This implementation uses the Union-Find algorithm to efficiently maintain the
+// connected components and incrementally adds edges via
+// ColocationGraph::ColocateNodes() invocations.
 class ColocationGraph {
  public:
   ColocationGraph(Graph* graph, const DeviceSet* device_set,
@@ -133,13 +132,9 @@ class ColocationGraph {
     std::unordered_map<StringPiece, const Node*, StringPieceHasher>
         colocation_group_root;
 
-    for (Node* node : graph_->nodes()) {
-      if (!node->IsOp()) {
-        continue;
-      }
-
-      // When adding the node, identify whether it is part of a
-      // colocation group.
+    for (Node* node : graph_->op_nodes()) {
+      // When adding the node, identify whether it is part of a colocation
+      // group.
 
       // This code is effectively the equivalent of GetNodeAttr() for a string
       // array, but it avoids all internal allocations (the allocation of the
@@ -218,11 +213,10 @@ class ColocationGraph {
     Member& x_root_member = members_[x_root];
     Member& y_root_member = members_[y_root];
 
-    // Merge the sets by swinging the parent pointer of the smaller
-    // tree to point to the root of the larger tree. Together with
-    // path compression in ColocationGraph::FindRoot, this ensures
-    // that we do not experience pathological performance on graphs
-    // such as chains.
+    // Merge the sets by setting the parent pointer of the smaller tree's root
+    // node to point to the root of the larger tree. Together with path
+    // compression in ColocationGraph::FindRoot, this ensures that we do not
+    // experience pathological performance on graphs such as chains.
     int new_root, old_root;
     if (x_root_member.rank < y_root_member.rank) {
       // The tree rooted at x_root is shallower, so connect it to
@@ -610,22 +604,50 @@ class ColocationGraph {
   // given id is connected.
   int FindRoot(int node_id) {
     Member& member = members_[node_id];
+    DCHECK_GE(member.parent, 0);
+    if (member.parent == node_id) {
+      // member.parent is the root of this disjoint tree.  Do nothing.
+    } else {
+      member.parent = FindRoot(member.parent);
+    }
+    // Now it is guaranteed that member.parent is the root of this disjoint
+    // tree.
+    DCHECK_GE(member.parent, 0);
+    return member.parent;
+  }
 
-    int parent = member.parent;
-    DCHECK_GE(parent, 0);
-
-    if (parent != node_id) {
-      // NOTE: Compress paths from node_id to its root, so that future
-      // calls to FindRoot and ColocateNodes are more efficient.
-      int root = FindRoot(parent);
-      if (parent != root) {
-        parent = root;
-        member.parent = root;
+  // Ensures that the devices of 'dst's resource and reference match the device
+  // specified for 'src', which is an input of 'dst' with a partially or fully
+  // specified device.
+  Status VerifyResourceAndRefInputsCanBeColocated(
+      const Node* dst, const Node* src,
+      const DeviceNameUtils::ParsedName& src_parsed_name) {
+    std::vector<const Edge*> edges;
+    TF_RETURN_IF_ERROR(dst->input_edges(&edges));
+    for (const Edge* edge : edges) {
+      DataType input_type = dst->input_type(edge->dst_input());
+      if (input_type == DT_RESOURCE || IsRefType(input_type)) {
+        const Node* input_node = edge->src();
+        if (input_node == src) {
+          continue;
+        }
+        const auto& input_root = members_[FindRoot(input_node->id())];
+        const auto& input_parsed_name = input_root.device_name;
+        if (DeviceNameUtils::HasSomeDetails(input_parsed_name) &&
+            !DeviceNameUtils::AreCompatibleDevNames(input_parsed_name,
+                                                    src_parsed_name)) {
+          return AttachDef(
+              errors::InvalidArgument(
+                  "Could not colocate node with its "
+                  "resource and reference inputs; devices ",
+                  DeviceNameUtils::ParsedNameToString(input_parsed_name),
+                  " and ", DeviceNameUtils::ParsedNameToString(src_parsed_name),
+                  " are not compatible."),
+              *dst);
+        }
       }
     }
-
-    DCHECK_GE(parent, 0);
-    return parent;
+    return Status::OK();
   }
 
   Graph* const graph_;  // Not owned.
@@ -646,6 +668,15 @@ bool IsGeneratorNode(const Node* node) {
          !IsRefType(node->output_type(0));
 }
 
+bool IsExemptFromResourceInputColocation(const Node* node) {
+  // Note: Partitioned function calls, which place and partition their
+  // function bodies, are exempt from this check: they forward resource and
+  // ref inputs to operations that are appropriately placed, instead of
+  // dereferencing them.
+  const string& op_type = node->op_def().name();
+  return op_type == "PartitionedCall" || op_type == "StatefulPartitionedCall";
+}
+
 }  // namespace
 
 Placer::Placer(Graph* graph, const DeviceSet* devices,
@@ -680,8 +711,8 @@ Status Placer::Run() {
   // 2. Enumerate the constraint edges, and use them to update the disjoint
   // node set.
 
-  // If `node` has an input edge with reference type, add an
-  // edge from the source of that edge to `node`.
+  // If `node` has an input edge with reference type, add an edge from the
+  // source of that edge to `node`.
   for (const Edge* edge : graph_->edges()) {
     if (edge->IsControlEdge()) {
       continue;
@@ -689,7 +720,10 @@ Status Placer::Run() {
     Node* src = edge->src();
     Node* dst = edge->dst();
     DataType input_type = dst->input_type(edge->dst_input());
-    if (input_type == DT_RESOURCE || IsRefType(input_type)) {
+    if ((input_type == DT_RESOURCE || IsRefType(input_type)) &&
+        !IsExemptFromResourceInputColocation(dst)) {
+      // Colocate `src` and `dst` to maintain the invariant that nodes connected
+      // by reference edges are colocated.
       int src_root_id = colocation_graph.FindRoot(src->id());
       int dst_root_id = colocation_graph.FindRoot(dst->id());
       auto& src_root = colocation_graph.members_[src_root_id];
@@ -706,6 +740,9 @@ Status Placer::Run() {
         // incompatible.
         if (!DeviceNameUtils::AreCompatibleDevNames(source_parsed_name,
                                                     dest_parsed_name)) {
+          TF_RETURN_IF_ERROR(
+              colocation_graph.VerifyResourceAndRefInputsCanBeColocated(
+                  dst, src, source_parsed_name));
           if (log_device_placement_) {
             LOG(INFO) << "Ignoring device specification "
                       << DeviceNameUtils::ParsedNameToString(dest_parsed_name)
@@ -773,10 +810,10 @@ Status Placer::Run() {
     std::vector<Device*>* devices;
     Status status = colocation_graph.GetDevicesForNode(node, &devices);
     if (!status.ok()) {
-      return AttachDef(
-          errors::InvalidArgument("Cannot assign a device for operation '",
-                                  node->name(), "': ", status.error_message()),
-          *node);
+      return AttachDef(errors::InvalidArgument(
+                           "Cannot assign a device for operation ",
+                           RichNodeName(node), ": ", status.error_message()),
+                       *node);
     }
 
     // Returns the first device in sorted devices list so we will always
@@ -820,10 +857,10 @@ Status Placer::Run() {
     std::vector<Device*>* devices;
     Status status = colocation_graph.GetDevicesForNode(node, &devices);
     if (!status.ok()) {
-      return AttachDef(
-          errors::InvalidArgument("Cannot assign a device for operation '",
-                                  node->name(), "': ", status.error_message()),
-          *node);
+      return AttachDef(errors::InvalidArgument(
+                           "Cannot assign a device for operation ",
+                           RichNodeName(node), ": ", status.error_message()),
+                       *node);
     }
 
     int assigned_device = -1;
@@ -889,4 +926,22 @@ void Placer::LogDeviceAssignment(const Node* node) const {
   }
 }
 
+bool Placer::ClientHandlesErrorFormatting() const {
+  return options_ != nullptr &&
+         options_->config.experimental().client_handles_error_formatting();
+}
+
+// Returns the node name in single quotes. If the client handles formatted
+// errors, appends a formatting tag which the client will reformat into, for
+// example, " (defined at filename:123)".
+string Placer::RichNodeName(const Node* node) const {
+  string quoted_name = strings::StrCat("'", node->name(), "'");
+  if (ClientHandlesErrorFormatting()) {
+    string file_and_line = error_format_tag(*node, "${defined_at}");
+    return strings::StrCat(quoted_name, file_and_line);
+  } else {
+    return quoted_name;
+  }
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/placer.h b/tensorflow/core/common_runtime/placer.h
index 75dce7c7fe..fce87269c5 100644
--- a/tensorflow/core/common_runtime/placer.h
+++ b/tensorflow/core/common_runtime/placer.h
@@ -87,6 +87,8 @@ class Placer {
   // placement if the SessionOptions entry in 'options_' requests it.
   void AssignAndLog(int assigned_device, Node* node) const;
   void LogDeviceAssignment(const Node* node) const;
+  bool ClientHandlesErrorFormatting() const;
+  string RichNodeName(const Node* node) const;
 
   Graph* const graph_;              // Not owned.
   const DeviceSet* const devices_;  // Not owned.
diff --git a/tensorflow/core/common_runtime/placer_test.cc b/tensorflow/core/common_runtime/placer_test.cc
index 5ad251c892..87f2f2ceb9 100644
--- a/tensorflow/core/common_runtime/placer_test.cc
+++ b/tensorflow/core/common_runtime/placer_test.cc
@@ -575,6 +575,10 @@ REGISTER_KERNEL_BUILDER(Name("HandleAssignCPU").Device("FakeCPU"), DummyOp);
 REGISTER_OP("HandleAssignGPU").Input("i: resource").Input("v: float");
 REGISTER_KERNEL_BUILDER(Name("HandleAssignGPU").Device("FakeGPU"), DummyOp);
 
+REGISTER_OP("TestTwoHandlesIn").Input("i: resource").Input("j: resource");
+REGISTER_KERNEL_BUILDER(Name("TestTwoHandlesIn").Device("FakeCPU"), DummyOp);
+REGISTER_KERNEL_BUILDER(Name("TestTwoHandlesIn").Device("FakeGPU"), DummyOp);
+
 // Tests all combinations of resource handles and ops using them.
 TEST_F(PlacerTest, TestResourceHandle) {
   auto handle_test = [this](const string& var_op_name,
@@ -609,6 +613,42 @@ TEST_F(PlacerTest, TestResourceHandle) {
       handle_test("HandleVariableCPU", "HandleAssignGPU", "FakeCPU").ok());
 }
 
+TEST_F(PlacerTest, TestResourceHandlesOnDifferentDevicesFails) {
+  auto handle_test = [this](bool allow_soft_placement) {
+    Graph g(OpRegistry::Global());
+    {  // Scope for temporary variables used to construct g.
+      GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+      Node* var_cpu =
+          ops::SourceOp("TestHandleVariable", b.opts().WithName("var_cpu"));
+      Node* var_gpu =
+          ops::SourceOp("TestHandleVariable", b.opts().WithName("var_gpu"));
+      ops::BinaryOp("TestTwoHandlesIn", var_cpu, var_gpu,
+                    b.opts().WithName("two_handles_in"));
+      TF_EXPECT_OK(BuildGraph(b, &g));
+
+      GetNodeByName(g, "var_cpu")
+          ->set_assigned_device_name(
+              "/job:a/replica:0/task:0/device:fakecpu:0");
+      GetNodeByName(g, "var_gpu")
+          ->set_assigned_device_name(
+              "/job:a/replica:0/task:0/device:fakegpu:0");
+    }
+
+    SessionOptions options;
+    options.config.set_allow_soft_placement(allow_soft_placement);
+    options.config.set_log_device_placement(true);
+    Status s = Place(&g, &options);
+    EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+    EXPECT_TRUE(str_util::StrContains(
+        s.error_message(),
+        "Could not colocate node with its resource and reference inputs"));
+    return Status::OK();
+  };
+
+  TF_EXPECT_OK(handle_test(false));
+  TF_EXPECT_OK(handle_test(true));
+}
+
 // Test that an assignment of an operator to the wrong device
 // is ignored when it could never be satisfied (due to reference
 // edges, for example).
@@ -1102,6 +1142,50 @@ TEST_F(PlacerTest, TestNonexistentGpuNoAllowSoftPlacement) {
   EXPECT_TRUE(str_util::StrContains(s.error_message(), "/device:fakegpu:11"));
 }
 
+// Test that the "Cannot assign a device" error message contains a format tag
+// when requested.
+TEST_F(PlacerTest, TestNonexistentGpuNoAllowSoftPlacementFormatTag) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestDevice",
+                  b.opts().WithName("in").WithDevice("/device:fakegpu:11"));
+    TF_EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  options.config.mutable_experimental()->set_client_handles_error_formatting(
+      true);
+  Status s = Place(&g, &options);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  LOG(WARNING) << s.error_message();
+  EXPECT_TRUE(str_util::StrContains(s.error_message(),
+                                    "Cannot assign a device for operation 'in'"
+                                    "^^node:in:${defined_at}^^"));
+}
+
+// Test that the "Cannot assign a device" error message does not contain a
+// format tag when not it shouldn't
+TEST_F(PlacerTest, TestNonexistentGpuNoAllowSoftPlacementNoFormatTag) {
+  Graph g(OpRegistry::Global());
+  {  // Scope for temporary variables used to construct g.
+    GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+    ops::SourceOp("TestDevice",
+                  b.opts().WithName("in").WithDevice("/device:fakegpu:11"));
+    TF_EXPECT_OK(BuildGraph(b, &g));
+  }
+
+  SessionOptions options;
+  options.config.mutable_experimental()->set_client_handles_error_formatting(
+      false);
+  Status s = Place(&g, &options);
+  EXPECT_EQ(error::INVALID_ARGUMENT, s.code());
+  EXPECT_TRUE(str_util::StrContains(
+      s.error_message(), "Cannot assign a device for operation 'in'"));
+  EXPECT_FALSE(str_util::StrContains(
+      s.error_message(), "'in' (defined at ^^node:in:${file}:${line}^^)"));
+}
+
 // Test that placement fails when a node requests an explicit device that is not
 // supported by the registered kernels if allow_soft_placement is no set.
 TEST_F(PlacerTest, TestUnsupportedDeviceNoAllowSoftPlacement) {
diff --git a/tensorflow/core/common_runtime/process_state.cc b/tensorflow/core/common_runtime/process_state.cc
index 4d83b25ce6..447338e7bd 100644
--- a/tensorflow/core/common_runtime/process_state.cc
+++ b/tensorflow/core/common_runtime/process_state.cc
@@ -71,7 +71,7 @@ ProcessState::MemDesc ProcessState::PtrType(const void* ptr) {
   return MemDesc();
 }
 
-Allocator* ProcessState::GetCPUAllocator(int numa_node) {
+VisitableAllocator* ProcessState::GetCPUAllocator(int numa_node) {
   CHECK_GE(numa_node, 0);
   if (!numa_enabled_) numa_node = 0;
   mutex_lock lock(mu_);
diff --git a/tensorflow/core/common_runtime/process_state.h b/tensorflow/core/common_runtime/process_state.h
index 0f4ae230bb..2892677333 100644
--- a/tensorflow/core/common_runtime/process_state.h
+++ b/tensorflow/core/common_runtime/process_state.h
@@ -65,7 +65,7 @@ class ProcessState {
 
   // Returns the one CPUAllocator used for the given numa_node.
   // TEMPORARY: ignores numa_node.
-  Allocator* GetCPUAllocator(int numa_node);
+  VisitableAllocator* GetCPUAllocator(int numa_node);
 
   typedef std::unordered_map<const void*, MemDesc> MDMap;
 
@@ -87,7 +87,7 @@ class ProcessState {
 
   mutex mu_;
 
-  std::vector<Allocator*> cpu_allocators_ GUARDED_BY(mu_);
+  std::vector<VisitableAllocator*> cpu_allocators_ GUARDED_BY(mu_);
 
   virtual ~ProcessState();
 
diff --git a/tensorflow/core/common_runtime/ring_reducer.cc b/tensorflow/core/common_runtime/ring_reducer.cc
index c1e514d5ad..e26761703b 100644
--- a/tensorflow/core/common_runtime/ring_reducer.cc
+++ b/tensorflow/core/common_runtime/ring_reducer.cc
@@ -206,6 +206,9 @@ void RingReducer::ContinueAfterInputCopy() {
       group_size_tensor_ = group_size_val;
       group_size_tensor_ready_.Notify();
     }
+  } else {
+    // Value won't be used, so no need to initialize.
+    group_size_tensor_ready_.Notify();
   }
   Finish(RunAsyncParts());
 }
diff --git a/tensorflow/core/common_runtime/session.cc b/tensorflow/core/common_runtime/session.cc
index 4a9248171b..8c30beeec2 100644
--- a/tensorflow/core/common_runtime/session.cc
+++ b/tensorflow/core/common_runtime/session.cc
@@ -53,27 +53,33 @@ Status Session::PRun(const string& handle,
 
 Session* NewSession(const SessionOptions& options) {
   SessionFactory* factory;
-  const Status s = SessionFactory::GetFactory(options, &factory);
+  Status s = SessionFactory::GetFactory(options, &factory);
   if (!s.ok()) {
     LOG(ERROR) << s;
     return nullptr;
   }
-  return factory->NewSession(options);
+  Session* out_session;
+  s = NewSession(options, &out_session);
+  if (!s.ok()) {
+    LOG(ERROR) << "Failed to create session: " << s;
+    return nullptr;
+  }
+  return out_session;
 }
 
 Status NewSession(const SessionOptions& options, Session** out_session) {
   SessionFactory* factory;
-  const Status s = SessionFactory::GetFactory(options, &factory);
+  Status s = SessionFactory::GetFactory(options, &factory);
   if (!s.ok()) {
     *out_session = nullptr;
     LOG(ERROR) << s;
     return s;
   }
-  *out_session = factory->NewSession(options);
-  if (!*out_session) {
-    return errors::Internal("Failed to create session.");
+  s = factory->NewSession(options, out_session);
+  if (!s.ok()) {
+    *out_session = nullptr;
   }
-  return Status::OK();
+  return s;
 }
 
 Status Reset(const SessionOptions& options,
diff --git a/tensorflow/core/common_runtime/session_factory.h b/tensorflow/core/common_runtime/session_factory.h
index df3198a70d..81c172c6ae 100644
--- a/tensorflow/core/common_runtime/session_factory.h
+++ b/tensorflow/core/common_runtime/session_factory.h
@@ -30,7 +30,12 @@ struct SessionOptions;
 
 class SessionFactory {
  public:
-  virtual Session* NewSession(const SessionOptions& options) = 0;
+  // Creates a new session and stores it in *out_session, or fails with an error
+  // status if the Session could not be created. Caller takes ownership of
+  // *out_session if this returns Status::OK().
+  virtual Status NewSession(const SessionOptions& options,
+                            Session** out_session) = 0;
+
   virtual bool AcceptsOptions(const SessionOptions& options) = 0;
 
   // Abort and close all existing sessions, disconnecting their resources from
diff --git a/tensorflow/core/common_runtime/session_ref.cc b/tensorflow/core/common_runtime/session_ref.cc
new file mode 100644
index 0000000000..b931ef4229
--- /dev/null
+++ b/tensorflow/core/common_runtime/session_ref.cc
@@ -0,0 +1,170 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/common_runtime/session_ref.h"
+
+#include <utility>
+
+namespace tensorflow {
+
+namespace {
+
+// Scope helper to track active calls and manage session lifetime.
+struct RunCounter {
+  std::shared_ptr<Session> session;
+  uint64* value;
+  mutex* m;
+  condition_variable* cv;
+
+  explicit RunCounter(std::shared_ptr<Session> s, uint64* v, mutex* m,
+                      condition_variable* cv)
+      : session(std::move(s)), value(v), m(m), cv(cv) {
+    mutex_lock l(*m);
+    ++*value;
+  }
+
+  ~RunCounter() {
+    mutex_lock l(*m);
+    if (--*value == 0) {
+      cv->notify_all();
+    }
+  }
+};
+
+}  // namespace
+
+Status SessionRef::CheckNotClosed() {
+  mutex_lock l(run_lock_);
+  if (session_ == nullptr) return errors::Cancelled("Session has been closed.");
+  return ::tensorflow::Status::OK();
+}
+
+Status SessionRef::Run(const RunOptions& run_options,
+                       const std::vector<std::pair<string, Tensor> >& inputs,
+                       const std::vector<string>& output_tensor_names,
+                       const std::vector<string>& target_node_names,
+                       std::vector<Tensor>* outputs,
+                       RunMetadata* run_metadata) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->Run(run_options, inputs, output_tensor_names,
+                         target_node_names, outputs, run_metadata);
+}
+
+Status SessionRef::Create(const GraphDef& graph) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->Create(graph);
+}
+
+Status SessionRef::Create(const RunOptions& run_options,
+                          const GraphDef& graph) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->Create(run_options, graph);
+}
+
+Status SessionRef::Extend(const RunOptions& run_options,
+                          const GraphDef& graph) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->Extend(run_options, graph);
+}
+
+Status SessionRef::Extend(const GraphDef& graph) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->Extend(graph);
+}
+
+Status SessionRef::Close(const RunOptions& run_options) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  mutex_lock l(run_lock_);
+  Status status = session_->Close(run_options);
+  session_.reset();
+  while (run_count_ > 0) {
+    run_finished_.wait(l);
+  }
+  return status;
+}
+
+Status SessionRef::Close() {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  mutex_lock l(run_lock_);
+  Status status = session_->Close();
+  session_.reset();
+  while (run_count_ > 0) {
+    run_finished_.wait(l);
+  }
+  return status;
+}
+
+Status SessionRef::Run(const std::vector<std::pair<string, Tensor> >& inputs,
+                       const std::vector<string>& output_tensor_names,
+                       const std::vector<string>& target_node_names,
+                       std::vector<Tensor>* outputs) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->Run(inputs, output_tensor_names, target_node_names,
+                         outputs);
+}
+
+Status SessionRef::ListDevices(std::vector<DeviceAttributes>* response) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->ListDevices(response);
+}
+
+Status SessionRef::PRunSetup(const std::vector<string>& input_names,
+                             const std::vector<string>& output_names,
+                             const std::vector<string>& target_nodes,
+                             string* handle) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->PRunSetup(input_names, output_names, target_nodes, handle);
+}
+
+Status SessionRef::PRun(const string& handle,
+                        const std::vector<std::pair<string, Tensor> >& inputs,
+                        const std::vector<string>& output_names,
+                        std::vector<Tensor>* outputs) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->PRun(handle, inputs, output_names, outputs);
+}
+
+Status SessionRef::MakeCallable(const CallableOptions& callable_options,
+                                CallableHandle* out_handle) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->MakeCallable(callable_options, out_handle);
+}
+
+Status SessionRef::RunCallable(CallableHandle handle,
+                               const std::vector<Tensor>& feed_tensors,
+                               std::vector<Tensor>* fetch_tensors,
+                               RunMetadata* run_metadata) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->RunCallable(handle, feed_tensors, fetch_tensors,
+                                 run_metadata);
+}
+
+Status SessionRef::ReleaseCallable(CallableHandle handle) {
+  TF_RETURN_IF_ERROR(CheckNotClosed());
+  RunCounter rc(session_, &run_count_, &run_lock_, &run_finished_);
+  return rc.session->ReleaseCallable(handle);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/session_ref.h b/tensorflow/core/common_runtime/session_ref.h
new file mode 100644
index 0000000000..9459e7edbe
--- /dev/null
+++ b/tensorflow/core/common_runtime/session_ref.h
@@ -0,0 +1,86 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_SESSION_REF_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_SESSION_REF_H_
+
+#include <memory>
+
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/public/session.h"
+
+namespace tensorflow {
+
+// A `SessionRef` manages the lifetime of a wrapped `Session` pointer.
+//
+// SessionRef blocks the return of Close() until all pending operations have
+// been completed or cancelled and underlying session has been freed.  Any
+// subsequent operations on the SessionRef object will return errors::Cancelled.
+class SessionRef : public Session {
+ public:
+  SessionRef(Session* session) : session_(session) {}
+  virtual ~SessionRef() {}
+
+  Status Create(const GraphDef& graph) override;
+  Status Extend(const GraphDef& graph) override;
+  Status Create(const RunOptions& run_options, const GraphDef& graph) override;
+  Status Extend(const RunOptions& run_options, const GraphDef& graph) override;
+  Status Run(const std::vector<std::pair<string, Tensor> >& inputs,
+             const std::vector<string>& output_tensor_names,
+             const std::vector<string>& target_node_names,
+             std::vector<Tensor>* outputs) override;
+
+  Status ListDevices(std::vector<DeviceAttributes>* response) override;
+
+  Status Close() override;
+  Status Close(const RunOptions& run_options) override;
+
+  Status Run(const RunOptions& run_options,
+             const std::vector<std::pair<string, Tensor> >& inputs,
+             const std::vector<string>& output_tensor_names,
+             const std::vector<string>& target_node_names,
+             std::vector<Tensor>* outputs, RunMetadata* run_metadata) override;
+
+  Status PRunSetup(const std::vector<string>& input_names,
+                   const std::vector<string>& output_names,
+                   const std::vector<string>& target_nodes,
+                   string* handle) override;
+
+  Status PRun(const string& handle,
+              const std::vector<std::pair<string, Tensor> >& inputs,
+              const std::vector<string>& output_names,
+              std::vector<Tensor>* outputs) override;
+
+  Status MakeCallable(const CallableOptions& callable_options,
+                      CallableHandle* out_handle) override;
+
+  Status RunCallable(CallableHandle handle,
+                     const std::vector<Tensor>& feed_tensors,
+                     std::vector<Tensor>* fetch_tensors,
+                     RunMetadata* run_metadata) override;
+
+  Status ReleaseCallable(CallableHandle handle) override;
+
+ private:
+  mutex run_lock_;
+  condition_variable run_finished_;
+  uint64 run_count_ GUARDED_BY(run_lock_) = {0};
+  std::shared_ptr<Session> session_;
+
+  Status CheckNotClosed();
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_COMMON_RUNTIME_SESSION_REF_H_
diff --git a/tensorflow/core/common_runtime/session_test.cc b/tensorflow/core/common_runtime/session_test.cc
index feaf29c7bb..1fa5aad60c 100644
--- a/tensorflow/core/common_runtime/session_test.cc
+++ b/tensorflow/core/common_runtime/session_test.cc
@@ -47,8 +47,10 @@ class FakeSessionFactory : public SessionFactory {
     return str_util::StartsWith(options.target, "fake");
   }
 
-  Session* NewSession(const SessionOptions& options) override {
-    return nullptr;
+  Status NewSession(const SessionOptions& options,
+                    Session** out_session) override {
+    *out_session = nullptr;
+    return Status::OK();
   }
 };
 class FakeSessionRegistrar {
diff --git a/tensorflow/core/common_runtime/threadpool_device.cc b/tensorflow/core/common_runtime/threadpool_device.cc
index 74a87215e1..7406ecf4f8 100644
--- a/tensorflow/core/common_runtime/threadpool_device.cc
+++ b/tensorflow/core/common_runtime/threadpool_device.cc
@@ -111,7 +111,21 @@ Status ThreadPoolDevice::MakeTensorFromProto(
 }
 
 #ifdef INTEL_MKL
-REGISTER_MEM_ALLOCATOR("MklCPUAllocator", 200, MklCPUAllocator);
+namespace {
+class MklCPUAllocatorFactory : public AllocatorFactory {
+ public:
+  bool NumaEnabled() override { return false; }
+
+  Allocator* CreateAllocator() override { return new MklCPUAllocator; }
+
+  // Note: Ignores numa_node, for now.
+  virtual SubAllocator* CreateSubAllocator(int numa_node) {
+    return new MklSubAllocator;
+  }
+};
+
+REGISTER_MEM_ALLOCATOR("MklCPUAllocator", 200, MklCPUAllocatorFactory);
+}  // namespace
 #endif
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/debug/BUILD b/tensorflow/core/debug/BUILD
index 36e9b3455a..591c22b8f6 100644
--- a/tensorflow/core/debug/BUILD
+++ b/tensorflow/core/debug/BUILD
@@ -82,25 +82,6 @@ cc_library(
 )
 
 tf_cuda_library(
-    name = "debug_gateway_internal",
-    srcs = ["debug_gateway.cc"],
-    hdrs = ["debug_gateway.h"],
-    copts = tf_copts(),
-    linkstatic = 1,
-    deps = [
-        ":debug",
-        "//tensorflow/core:core_cpu_internal",
-        "//tensorflow/core:direct_session_internal",
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
-        "//tensorflow/core:proto_text",
-        "//tensorflow/core:protos_all_cc",
-    ],
-    alwayslink = 1,
-)
-
-tf_cuda_library(
     name = "debugger_state_impl",
     srcs = ["debugger_state_impl.cc"],
     hdrs = ["debugger_state_impl.h"],
@@ -187,42 +168,6 @@ tf_cuda_library(
     ],
 )
 
-# TODO(cais): Fix flakiness on GPU and change this back to a tf_cc_test_gpu.
-#   See b/34081273.
-tf_cc_test(
-    name = "debug_gateway_test",
-    size = "small",
-    srcs = ["debug_gateway_test.cc"],
-    args = ["--heap_check=local"],
-    linkstatic = tf_kernel_tests_linkstatic(),
-    tags = [
-        "no_cuda_on_cpu_tap",
-        "no_gpu",
-    ],
-    deps = [
-        ":debug",
-        ":debug_gateway_internal",
-        ":debug_graph_utils",
-        "//tensorflow/cc:cc_ops",
-        "//tensorflow/core:all_kernels",
-        "//tensorflow/core:core_cpu",
-        "//tensorflow/core:core_cpu_internal",
-        "//tensorflow/core:direct_session",
-        "//tensorflow/core:direct_session_internal",
-        "//tensorflow/core:framework",
-        "//tensorflow/core:framework_internal",
-        "//tensorflow/core:gpu_runtime",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
-        "//tensorflow/core:protos_all_cc",
-        "//tensorflow/core:test",
-        "//tensorflow/core:test_main",
-        "//tensorflow/core:testlib",
-        "//tensorflow/core/kernels:debug_ops",
-        "//tensorflow/core/kernels:ops_util",
-    ],
-)
-
 tf_cc_test(
     name = "debug_io_utils_test",
     size = "small",
diff --git a/tensorflow/core/debug/debug_gateway.cc b/tensorflow/core/debug/debug_gateway.cc
deleted file mode 100644
index 2e1aabd1cc..0000000000
--- a/tensorflow/core/debug/debug_gateway.cc
+++ /dev/null
@@ -1,122 +0,0 @@
-/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/core/debug/debug_gateway.h"
-
-#include <utility>
-
-#include "tensorflow/core/common_runtime/device_factory.h"
-#include "tensorflow/core/common_runtime/session_factory.h"
-#include "tensorflow/core/framework/tensor.h"
-
-namespace tensorflow {
-
-DebugGateway::DebugGateway(DirectSession* session) : session_(session) {
-  session_->node_outputs_callback_ =
-      [this](const string& node_name, const int output_slot,
-             const Tensor* tensor, const bool is_ref, OpKernelContext* ctx) {
-        if (comp_cb_ != nullptr && output_slot <= 0) {
-          // The node completion callback is invoked once for a node regardless
-          // of whether the node has zero, one or more outputs.
-          // The output_slot can be negative (-1, or kControlSlot) if
-          // node_outputs_callback_ is invoked for a node with no output. If
-          // that is the case, notify the callback that the node in question has
-          // no output.
-          comp_cb_(node_name, output_slot == 0);
-        }
-
-        // Copy tensor values (e.g., from GPU to host) only if the
-        // value callback is not nullptr.
-        if (val_cb_ != nullptr && output_slot >= 0) {
-          CopyTensor(node_name, output_slot, tensor, ctx,
-                     [this, node_name, output_slot,
-                      is_ref](const Tensor* copied_tensor) {
-                       val_cb_(node_name, output_slot, *copied_tensor, is_ref);
-                     });
-        }
-
-        return Status::OK();
-      };
-}
-
-DebugGateway::~DebugGateway() {
-  if (session_ != nullptr) {
-    session_->node_outputs_callback_ = nullptr;
-  }
-}
-
-void DebugGateway::SetNodeCompletionCallback(NodeCompletionCallback callback) {
-  comp_cb_ = std::move(callback);
-}
-
-void DebugGateway::SetNodeValueCallback(NodeValueCallback callback) {
-  val_cb_ = std::move(callback);
-}
-
-void DebugGateway::CopyTensor(const string& node_name, const int output_slot,
-                              const Tensor* src_tensor, OpKernelContext* ctx,
-                              CopyDoneCallback copy_done_cb) {
-  Device* device = static_cast<Device*>(ctx->device());
-
-  // Determine if the tensor is initialized properly.
-  // The second part of the check is necessary because in some cases, a
-  // tensor can pass the IsInitialized() check, but the dtype is not set,
-  // e.g., tf.FIFOQueue.
-  if (src_tensor->IsInitialized() && DataTypeSize(src_tensor->dtype()) > 0) {
-    // Tensor is initialized.
-
-    string tensor_tag = strings::StrCat(node_name, ":", output_slot);
-
-    // Create copied tensor on host
-    Allocator* cpu_allocator = tensorflow::cpu_allocator();
-    Tensor cpu_tensor(cpu_allocator, src_tensor->dtype(), src_tensor->shape());
-
-    // Determine if the tensor is on device (GPU) or host (CPU).
-    // The second part of the check is necessary because even an OpKernel on
-    // may have output tensors allocated on CPU.
-    if ((device->name().find("GPU:") != string::npos ||
-         device->name().find("SYCL:") != string::npos) &&
-        !ctx->output_alloc_attr(output_slot).on_host()) {
-      // GPU tensors: Copy it to host (CPU).
-      DeviceContext* device_ctxt = ctx->op_device_context();
-
-      // Copy device (e.g., GPU) tensor to host and when done, invoke the
-      // callback.
-      device_ctxt->CopyDeviceTensorToCPU(
-          src_tensor, "TensorCopy", device, &cpu_tensor,
-          [node_name, cpu_tensor, copy_done_cb](const Status& s) {
-            if (s.ok()) {
-              copy_done_cb(&cpu_tensor);
-            } else {
-              LOG(ERROR) << "Copying of device Tensor " << node_name
-                         << " to CPU for debugging failed.";
-            }
-          });
-    } else {
-      // For CPU tensors, copy the source tensor and own the copy, because the
-      // value callback may outlive the life time of the tensor and the tensor
-      // may shared the underlying buffer with other tensors.
-      cpu_tensor.UnsafeCopyFromInternal(*src_tensor, src_tensor->dtype(),
-                                        src_tensor->shape());
-
-      copy_done_cb(&cpu_tensor);
-    }
-  } else {
-    // Tensor is not initialized: No need to copy.
-    copy_done_cb(src_tensor);
-  }
-}
-
-}  // namespace tensorflow
diff --git a/tensorflow/core/debug/debug_gateway.h b/tensorflow/core/debug/debug_gateway.h
deleted file mode 100644
index bf5b6e08db..0000000000
--- a/tensorflow/core/debug/debug_gateway.h
+++ /dev/null
@@ -1,83 +0,0 @@
-/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_DEBUG_DEBUG_SESSION_H_
-#define TENSORFLOW_DEBUG_DEBUG_SESSION_H_
-
-#include <unordered_map>
-
-#include "tensorflow/core/common_runtime/direct_session.h"
-#include "tensorflow/core/common_runtime/executor.h"
-
-namespace tensorflow {
-
-// Experimental. tfdb (TensorFlow Debugger): Gateway to intermediate node
-// outputs during Session Run calls. Currently limited to DirectSession.
-class DebugGateway {
- public:
-  DebugGateway(DirectSession* session);
-  virtual ~DebugGateway();
-
-  // Callback for node completion. This callback is invoked only once for
-  // a node regardless of whether it has one or more outputs. The value(s) of
-  // the output tensor(s) are not necessarily available when this callback is
-  // invoked. They may need to be asynchronously copied from device (e.g.,
-  // GPU) to host, hence the need for the NodeValueCallback below.
-  //
-  // Args:
-  //   node_name: Name of the node that has just completed execution
-  //   any_output: Whether the node has any output(s)
-  typedef std::function<void(const string& node_name, const bool any_output)>
-      NodeCompletionCallback;
-  void SetNodeCompletionCallback(NodeCompletionCallback callback);
-
-  // Callback for node value. This is invoked when the value of a node's
-  // output tensor is available on the host, possibly after copying from
-  // a device (e.g., GPU).
-  //
-  // Args:
-  //   node_name: Name of the node of which the output has become available
-  //   output_slot: Output slot number of the output Tensor
-  //   tensor_value: Reference to the tensor value
-  //   is_ref: Whether the output of the reference type
-  typedef std::function<void(const string& node_name, const int output_slot,
-                             const Tensor& tensor_value, const bool is_ref)>
-      NodeValueCallback;
-  void SetNodeValueCallback(NodeValueCallback callback);
-
-  // TODO(cais): Add whitelists for ops/tensors (e.g., {"A:0", "B:0"})
-  // for node completion callback (whitelist_comp_) and node value callback
-  // (whitelist_val_). If whitelist_comp_ is non-empty, the gateway will
-  // invoke the NodeCompletionCallback only for the nodes specified in the
-  // whitelist. And so forth for whitelist_val_.
-
- private:
-  DirectSession* session_;
-  // TODO(cais): DebugGateway currently supports only DirectSession. Add
-  // support for GrpcSession.
-
-  NodeCompletionCallback comp_cb_ = nullptr;
-  NodeValueCallback val_cb_ = nullptr;
-
-  typedef std::function<void(const Tensor* dst_tensor)> CopyDoneCallback;
-
-  void CopyTensor(const string& node_name, const int output_slot,
-                  const Tensor* src_tensor, OpKernelContext* ctx,
-                  CopyDoneCallback copy_done_cb);
-};
-
-}  // end namespace tensorflow
-
-#endif  // TENSORFLOW_DEBUG_DEBUG_SESSION_H_
diff --git a/tensorflow/core/debug/debug_gateway_test.cc b/tensorflow/core/debug/debug_gateway_test.cc
deleted file mode 100644
index b1bbd3f698..0000000000
--- a/tensorflow/core/debug/debug_gateway_test.cc
+++ /dev/null
@@ -1,1011 +0,0 @@
-/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/core/debug/debug_gateway.h"
-
-#include <algorithm>
-#include <cstdlib>
-#include <memory>
-#include <unordered_map>
-
-#include "tensorflow/core/debug/debug_graph_utils.h"
-#include "tensorflow/core/framework/tensor_testutil.h"
-#include "tensorflow/core/graph/testlib.h"
-#include "tensorflow/core/lib/core/notification.h"
-#include "tensorflow/core/lib/core/status_test_util.h"
-#include "tensorflow/core/lib/core/threadpool.h"
-#include "tensorflow/core/protobuf/rewriter_config.pb.h"
-
-namespace tensorflow {
-namespace {
-
-std::unique_ptr<DirectSession> CreateSession() {
-  SessionOptions options;
-  // Turn off graph optimizer so we can observe intermediate node states.
-  options.config.mutable_graph_options()
-      ->mutable_optimizer_options()
-      ->set_opt_level(OptimizerOptions_Level_L0);
-  options.config.mutable_graph_options()
-      ->mutable_rewrite_options()
-      ->set_constant_folding(RewriterConfig::OFF);
-  options.config.mutable_graph_options()
-      ->mutable_rewrite_options()
-      ->set_dependency_optimization(RewriterConfig::OFF);
-
-  return std::unique_ptr<DirectSession>(
-      dynamic_cast<DirectSession*>(NewSession(options)));
-}
-
-class SessionDebugMinusAXTest : public ::testing::Test {
- public:
-  void Initialize(std::initializer_list<float> a_values) {
-    Graph graph(OpRegistry::Global());
-
-#if GOOGLE_CUDA
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:GPU:0";
-#elif defined(TENSORFLOW_USE_SYCL)
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
-#else
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:CPU:0";
-#endif
-
-    Tensor a_tensor(DT_FLOAT, TensorShape({2, 2}));
-    test::FillValues<float>(&a_tensor, a_values);
-    Node* a = test::graph::Constant(&graph, a_tensor);
-    a->set_assigned_device_name(kDeviceName);
-    a_ = a->name();
-
-    Tensor x_tensor(DT_FLOAT, TensorShape({2, 1}));
-    test::FillValues<float>(&x_tensor, {1, 1});
-    Node* x = test::graph::Constant(&graph, x_tensor);
-    x->set_assigned_device_name(kDeviceName);
-    x_ = x->name();
-
-    // y = A * x
-    Node* y = test::graph::Matmul(&graph, a, x, false, false);
-    y->set_assigned_device_name(kDeviceName);
-    y_ = y->name();
-
-    Node* y_neg = test::graph::Unary(&graph, "Neg", y);
-    y_neg_ = y_neg->name();
-    y_neg->set_assigned_device_name(kDeviceName);
-
-    test::graph::ToGraphDef(&graph, &def_);
-  }
-
-  string a_;
-  string x_;
-  string y_;
-  string y_neg_;
-  GraphDef def_;
-};
-
-TEST_F(SessionDebugMinusAXTest, RunSimpleNetwork) {
-  Initialize({3, 2, -1, 0});
-  auto session = CreateSession();
-  ASSERT_TRUE(session != nullptr);
-
-  DebugGateway debug_gateway(session.get());
-
-  // Supply completion and value callbacks
-  mutex mu;
-  // Completed nodes with and without outputs
-  std::vector<string> completed_nodes_w_outputs;
-  std::vector<string> completed_nodes_wo_outputs;
-
-  Notification callbacks_done;
-  debug_gateway.SetNodeCompletionCallback(
-      [&mu, &completed_nodes_w_outputs, &completed_nodes_wo_outputs](
-          const string& node_name, const bool any_output) {
-        mutex_lock l(mu);
-        if (any_output) {
-          completed_nodes_w_outputs.push_back(node_name);
-        } else {
-          completed_nodes_wo_outputs.push_back(node_name);
-        }
-      });
-
-  std::vector<bool> tensors_initialized;
-  std::unordered_map<string, Tensor> tensor_vals;
-  // output_slot values recorded in value callbacks
-  std::vector<int> output_slots_val;
-  // is_ref values recorded in value callbacks
-  std::vector<bool> is_refs_val;
-
-  debug_gateway.SetNodeValueCallback(
-      [this, &mu, &tensors_initialized, &tensor_vals, &output_slots_val,
-       &is_refs_val,
-       &callbacks_done](const string& node_name, const int output_slot,
-                        const Tensor& tensor_value, const bool is_ref) {
-        mutex_lock l(mu);
-        tensors_initialized.push_back(tensor_value.IsInitialized());
-        tensor_vals.insert(std::make_pair(node_name, tensor_value));
-        output_slots_val.push_back(output_slot);
-        is_refs_val.push_back(is_ref);
-
-        // Set the notification once we have the value from the target node.
-        if (node_name == y_neg_ && !callbacks_done.HasBeenNotified()) {
-          callbacks_done.Notify();
-        }
-      });
-
-  TF_ASSERT_OK(session->Create(def_));
-
-  std::vector<std::pair<string, Tensor>> inputs;
-
-  // Request two targets: one fetch output and one non-fetched output.
-  std::vector<string> output_names = {y_ + ":0"};
-  std::vector<string> target_nodes = {y_neg_};
-  std::vector<Tensor> outputs;
-  Status s = session->Run(inputs, output_names, target_nodes, &outputs);
-  TF_ASSERT_OK(s);
-
-  // Wait for callbacks to complete.
-  callbacks_done.WaitForNotification();
-
-  ASSERT_EQ(1, outputs.size());
-  // The first output should be initialized and have the correct
-  // output.
-  auto mat = outputs[0].matrix<float>();
-  ASSERT_TRUE(outputs[0].IsInitialized());
-  EXPECT_FLOAT_EQ(5.0, mat(0, 0));
-
-  // Verify the calling history of the completion callback
-  // The following verifies each node with output(s) invoked the callback
-  // exactly once.
-  ASSERT_GE(completed_nodes_w_outputs.size(), 4);  // There may be added nodes.
-
-  ASSERT_EQ(1, std::count(completed_nodes_w_outputs.begin(),
-                          completed_nodes_w_outputs.end(), a_));
-  ASSERT_EQ(1, std::count(completed_nodes_w_outputs.begin(),
-                          completed_nodes_w_outputs.end(), x_));
-  ASSERT_EQ(1, std::count(completed_nodes_w_outputs.begin(),
-                          completed_nodes_w_outputs.end(), y_));
-  ASSERT_EQ(1, std::count(completed_nodes_w_outputs.begin(),
-                          completed_nodes_w_outputs.end(), y_neg_));
-
-  // Apart from nodes with outputs, there are also no-output (control) nodes.
-  // They ought to be captured by the DebugGateway through
-  // NodeOutputCallback as well.
-  ASSERT_GT(completed_nodes_wo_outputs.size(), 0);
-
-  // The DebugGateway should have captured the _SOURCE node.
-  ASSERT_LE(1, std::count(completed_nodes_wo_outputs.begin(),
-                          completed_nodes_wo_outputs.end(), "_SOURCE"));
-
-  // Verify the calling history of the value callabck
-  ASSERT_EQ(completed_nodes_w_outputs.size(), tensors_initialized.size());
-
-  // In this graph, there is no uninitialized node value.
-  ASSERT_EQ(
-      tensors_initialized.end(),
-      std::find(tensors_initialized.begin(), tensors_initialized.end(), false));
-
-  ASSERT_EQ(completed_nodes_w_outputs.size(), tensor_vals.size());
-  ASSERT_EQ(completed_nodes_w_outputs.size(), output_slots_val.size());
-  ASSERT_EQ(completed_nodes_w_outputs.size(), is_refs_val.size());
-
-  // Verify the intermediate tensor values captured through the value callback
-  auto mat_a = tensor_vals[a_].matrix<float>();
-  ASSERT_EQ(3.0, mat_a(0, 0));
-  ASSERT_EQ(2.0, mat_a(0, 1));
-  ASSERT_EQ(-1.0, mat_a(1, 0));
-  ASSERT_EQ(0.0, mat_a(1, 1));
-
-  auto mat_x = tensor_vals[x_].matrix<float>();
-  ASSERT_EQ(1.0, mat_x(0, 0));
-  ASSERT_EQ(1.0, mat_x(1, 0));
-
-  auto mat_y = tensor_vals[y_].matrix<float>();
-  ASSERT_EQ(5.0, mat_y(0, 0));
-  ASSERT_EQ(-1.0, mat_y(1, 0));
-
-  auto mat_y_neg = tensor_vals[y_neg_].matrix<float>();
-  ASSERT_EQ(-5.0, mat_y_neg(0, 0));
-  ASSERT_EQ(1.0, mat_y_neg(1, 0));
-
-  // In this graph, all outputs are on the first slot
-  ASSERT_EQ(output_slots_val.size(),
-            std::count_if(output_slots_val.begin(), output_slots_val.end(),
-                          [](int slot) { return slot == 0; }));
-
-  // In this graph, there is no ref-type tensor.
-  ASSERT_EQ(is_refs_val.end(),
-            std::find(is_refs_val.begin(), is_refs_val.end(), true));
-}
-
-TEST_F(SessionDebugMinusAXTest, RunSimpleNetworkWithTwoDebugNodesInserted) {
-  // Tensor contains one count of NaN
-  Initialize({3, std::numeric_limits<float>::quiet_NaN(), -1, 0});
-  auto session = CreateSession();
-  ASSERT_TRUE(session != nullptr);
-
-  DebugGateway debug_gateway(session.get());
-
-  // Create debug tensor watch options with two debug ops:
-  // DebugIdentity and DebugNanCount
-  RunOptions run_opts;
-  run_opts.set_output_partition_graphs(true);
-
-  const string debug_identity = "DebugIdentity";
-  const string debug_nan_count = "DebugNanCount";
-  DebugTensorWatch* tensor_watch_opts =
-      run_opts.mutable_debug_options()->add_debug_tensor_watch_opts();
-  tensor_watch_opts->set_node_name(y_);
-  tensor_watch_opts->set_output_slot(0);
-  tensor_watch_opts->add_debug_ops(debug_identity);
-  tensor_watch_opts->add_debug_ops(debug_nan_count);
-
-  // Expected name of the inserted debug node
-  string debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(y_, ":", 0), 0, debug_identity);
-  string debug_nan_count_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(y_, ":", 0), 1, debug_nan_count);
-
-  // Supply completion and value callbacks
-  mutex mu;
-  // Completed nodes with and without outputs
-  std::vector<string> completed_debug_nodes;
-
-  Notification callbacks_done;
-  debug_gateway.SetNodeCompletionCallback(
-      [&mu, &debug_identity_node_name, &debug_nan_count_node_name,
-       &completed_debug_nodes](const string& node_name, const bool any_output) {
-        mutex_lock l(mu);
-        if (any_output && (node_name == debug_identity_node_name ||
-                           node_name == debug_nan_count_node_name)) {
-          completed_debug_nodes.push_back(node_name);
-        }
-      });
-
-  std::vector<Tensor> watched_tensor_vals;
-  std::vector<Tensor> debug_identity_tensor_vals;
-  std::vector<Tensor> debug_nan_count_tensor_vals;
-
-  debug_gateway.SetNodeValueCallback(
-      [this, &mu, &debug_identity_node_name, &debug_nan_count_node_name,
-       &watched_tensor_vals, &debug_identity_tensor_vals,
-       &debug_nan_count_tensor_vals,
-       &callbacks_done](const string& node_name, const int output_slot,
-                        const Tensor& tensor_value, const bool is_ref) {
-        mutex_lock l(mu);
-        if (node_name == y_) {
-          watched_tensor_vals.push_back(tensor_value);
-        } else if (node_name == debug_identity_node_name && output_slot == 0) {
-          // output_slot == 0 carries the debug signal. Same below.
-          debug_identity_tensor_vals.push_back(tensor_value);
-        } else if (node_name == debug_nan_count_node_name && output_slot == 0) {
-          debug_nan_count_tensor_vals.push_back(tensor_value);
-        }
-
-        // Set the notification once we have the value from the target node.
-        if (node_name == y_neg_ && !callbacks_done.HasBeenNotified()) {
-          callbacks_done.Notify();
-        }
-      });
-
-  TF_ASSERT_OK(session->Create(def_));
-
-  std::vector<std::pair<string, Tensor>> inputs;
-
-  // Request two targets: one fetch output and one non-fetched output.
-  std::vector<string> output_names = {y_ + ":0"};
-  std::vector<string> target_nodes = {y_neg_};
-  std::vector<Tensor> outputs;
-
-  RunMetadata run_metadata;
-  Status s = session->Run(run_opts, inputs, output_names, target_nodes,
-                          &outputs, &run_metadata);
-  TF_ASSERT_OK(s);
-
-// Verify the correct number of partition graphs (GraphDefs) outputted
-// through RunMetadata, given whether GPU is involved.
-#if GOOGLE_CUDA
-  ASSERT_EQ(2, run_metadata.partition_graphs().size());
-#elif defined(TENSORFLOW_USE_SYCL)
-  ASSERT_EQ(2, run_metadata.partition_graphs().size());
-#else
-  ASSERT_EQ(1, run_metadata.partition_graphs().size());
-#endif
-
-  // Wait for callbacks to complete.
-  callbacks_done.WaitForNotification();
-
-  // Verify that each of the two debug nodes has completed exactly once.
-  ASSERT_EQ(2, completed_debug_nodes.size());
-  ASSERT_EQ(
-      1, std::count(completed_debug_nodes.begin(), completed_debug_nodes.end(),
-                    debug_identity_node_name));
-  ASSERT_EQ(
-      1, std::count(completed_debug_nodes.begin(), completed_debug_nodes.end(),
-                    debug_nan_count_node_name));
-
-  // Verify that the tensor values from the watched node and the identity
-  // debug node are received and they are equal (owing to the debug op being
-  // "DebugIdentity")
-  ASSERT_EQ(1, watched_tensor_vals.size());
-  ASSERT_EQ(1, debug_identity_tensor_vals.size());
-  auto mat_y = watched_tensor_vals[0].matrix<float>();
-  auto mat_identity = debug_identity_tensor_vals[0].matrix<float>();
-  // ASSERT_EQ doesn't work for nan == nan
-  ASSERT_TRUE(std::isnan(mat_y(0, 0)));
-  ASSERT_TRUE(std::isnan(mat_identity(0, 0)));
-  ASSERT_EQ(-1, mat_identity(1, 0));
-
-  // Verify that the output from the NaN-count debug node indicates exactly
-  // one NaN.
-  ASSERT_EQ(1, debug_nan_count_tensor_vals.size());
-  ASSERT_EQ(1, debug_nan_count_tensor_vals[0].scalar<int64>()());
-}
-
-#if !defined(GOOGLE_CUDA) && !defined(TENSORFLOW_USE_SYCL)
-// TODO(cais): Reinstate the following test for concurrent debugged runs on
-//   a GPU once the root cause of the ~0.5% flakiness has been addressed.
-//   (b/34081273)
-TEST_F(SessionDebugMinusAXTest,
-       RunSimpleNetworkConcurrentlyWithDifferentDebugTensorWatches) {
-  // Test concurrent Run() calls on a graph with different debug watches.
-
-  Initialize({3, 2, -1, 0});
-  auto session = CreateSession();
-  ASSERT_TRUE(session != nullptr);
-  TF_ASSERT_OK(session->Create(def_));
-
-  // Number of concurrent Run() calls to launch.
-  const int kConcurrentRuns = 3;
-  thread::ThreadPool* tp =
-      new thread::ThreadPool(Env::Default(), "test", kConcurrentRuns);
-
-  std::vector<string> output_names = {y_ + ":0"};
-  std::vector<string> target_nodes = {y_neg_};
-
-  mutex mu;
-  DebugGateway debug_gateway(session.get());
-  std::unordered_map<string, Tensor> debug_identity_tensor_vals;
-
-  const string debug_identity = "DebugIdentity";
-
-  const string a_debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(a_, ":", 0), 0, debug_identity);
-  const string x_debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(x_, ":", 0), 0, debug_identity);
-  const string y_debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(y_, ":", 0), 0, debug_identity);
-
-  Notification callbacks_done;
-  volatile int val_callback_count = 0;
-
-  debug_gateway.SetNodeValueCallback(
-      [this, &mu, &val_callback_count, &a_debug_identity_node_name,
-       &x_debug_identity_node_name, &y_debug_identity_node_name,
-       &debug_identity_tensor_vals, &callbacks_done,
-       &kConcurrentRuns](const string& node_name, const int output_slot,
-                         const Tensor& tensor_value, const bool is_ref) {
-        mutex_lock l(mu);
-
-        if (node_name == a_debug_identity_node_name && output_slot == 0) {
-          debug_identity_tensor_vals["a"] = tensor_value;
-          val_callback_count++;
-        } else if (node_name == x_debug_identity_node_name &&
-                   output_slot == 0) {
-          // output_slot == 0 carries the debug signal.
-          debug_identity_tensor_vals["x"] = tensor_value;
-          val_callback_count++;
-        } else if (node_name == y_debug_identity_node_name &&
-                   output_slot == 0) {
-          debug_identity_tensor_vals["y"] = tensor_value;
-          val_callback_count++;
-        }
-
-        // Set the notification once we have the value from the callbacks from
-        // all the concurrent Run() calls.
-        if (val_callback_count == kConcurrentRuns &&
-            !callbacks_done.HasBeenNotified()) {
-          callbacks_done.Notify();
-        }
-      });
-
-  int run_counter = 0;
-  mutex run_lock;
-
-  // Function to be executed concurrently.
-  auto fn = [this, &run_lock, &run_counter, &session, output_names,
-             target_nodes, &debug_identity]() {
-    // Create unique debug tensor watch options for each of the concurrent
-    // run calls.
-    RunOptions run_opts;
-    run_opts.set_output_partition_graphs(true);
-
-    DebugTensorWatch* tensor_watch_opts =
-        run_opts.mutable_debug_options()->add_debug_tensor_watch_opts();
-    tensor_watch_opts->set_output_slot(0);
-    tensor_watch_opts->add_debug_ops(debug_identity);
-
-    {
-      // Let the concurrent runs watch different tensors.
-
-      mutex_lock l(run_lock);
-
-      if (run_counter == 0) {
-        // Let the 1st concurrent run watch a.
-        tensor_watch_opts->set_node_name(a_);
-      } else if (run_counter == 1) {
-        // Let the 2nd concurrent watch x.
-        tensor_watch_opts->set_node_name(x_);
-      } else if (run_counter == 2) {
-        // Let the 3rd concurrent watch y.
-        tensor_watch_opts->set_node_name(y_);
-      }
-
-      run_counter++;
-    }
-
-    // Run the graph.
-    RunMetadata run_metadata;
-    std::vector<std::pair<string, Tensor>> inputs;
-    std::vector<Tensor> outputs;
-    Status s = session->Run(run_opts, inputs, output_names, target_nodes,
-                            &outputs, &run_metadata);
-    TF_ASSERT_OK(s);
-
-    ASSERT_EQ(1, run_metadata.partition_graphs().size());
-
-    ASSERT_EQ(1, outputs.size());
-    ASSERT_TRUE(outputs[0].IsInitialized());
-    ASSERT_EQ(TensorShape({2, 1}), outputs[0].shape());
-    auto mat = outputs[0].matrix<float>();
-    EXPECT_FLOAT_EQ(5.0, mat(0, 0));
-    EXPECT_FLOAT_EQ(-1.0, mat(1, 0));
-  };
-
-  for (int i = 0; i < kConcurrentRuns; ++i) {
-    tp->Schedule(fn);
-  }
-
-  // Wait for the debug callbacks to finish.
-  callbacks_done.WaitForNotification();
-
-  // Wait for the concurrent functions with Run() calls to finish.
-  delete tp;
-
-  {
-    mutex_lock l(mu);
-
-    ASSERT_EQ(kConcurrentRuns, val_callback_count);
-    ASSERT_EQ(kConcurrentRuns, debug_identity_tensor_vals.size());
-
-    ASSERT_EQ(TensorShape({2, 2}), debug_identity_tensor_vals["a"].shape());
-    auto a_mat_identity = debug_identity_tensor_vals["a"].matrix<float>();
-    ASSERT_EQ(3.0, a_mat_identity(0, 0));
-    ASSERT_EQ(2.0, a_mat_identity(0, 1));
-    ASSERT_EQ(-1.0, a_mat_identity(1, 0));
-    ASSERT_EQ(0.0, a_mat_identity(1, 1));
-
-    ASSERT_EQ(TensorShape({2, 1}), debug_identity_tensor_vals["x"].shape());
-    auto x_mat_identity = debug_identity_tensor_vals["x"].matrix<float>();
-    ASSERT_EQ(1.0, x_mat_identity(0, 0));
-    ASSERT_EQ(1.0, x_mat_identity(1, 0));
-
-    ASSERT_EQ(TensorShape({2, 1}), debug_identity_tensor_vals["y"].shape());
-    auto y_mat_identity = debug_identity_tensor_vals["y"].matrix<float>();
-    ASSERT_EQ(5.0, y_mat_identity(0, 0));
-    ASSERT_EQ(-1.0, y_mat_identity(1, 0));
-  }
-}
-#endif
-
-class SessionDebugOutputSlotWithoutOutgoingEdgeTest : public ::testing::Test {
- public:
-  void Initialize() {
-    Graph graph(OpRegistry::Global());
-
-#if GOOGLE_CUDA
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:GPU:0";
-#elif defined(TENSORFLOW_USE_SYCL)
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
-#else
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:CPU:0";
-#endif
-
-    Tensor a_tensor(DT_FLOAT, TensorShape({1, 1}));
-    test::FillValues<float>(&a_tensor, {42.0});
-    Node* a = test::graph::Constant(&graph, a_tensor);
-    a->set_assigned_device_name(kDeviceName);
-
-    Node* c = test::graph::Constant(&graph, a_tensor);
-    c->set_assigned_device_name(kDeviceName);
-    c_ = c->name();
-
-    // Node c will be executed only because of the control edge from c to y.
-    // Its output slot (slot 0) does not have an outgoing edge. This test
-    // is for testing that the debugger can watch that slot properly.
-    Node* y = test::graph::NoOp(&graph, {c});
-    y->set_assigned_device_name(kDeviceName);
-    y_ = y->name();
-
-    test::graph::ToGraphDef(&graph, &def_);
-  }
-
-  string c_;
-  string y_;
-  GraphDef def_;
-};
-
-TEST_F(SessionDebugOutputSlotWithoutOutgoingEdgeTest,
-       WatchSlotWithoutOutgoingEdge) {
-  Initialize();
-  auto session = CreateSession();
-  ASSERT_TRUE(session != nullptr);
-
-  DebugGateway debug_gateway(session.get());
-
-  // Supply completion and value callbacks
-  mutex mu;
-
-  string debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(c_, ":", 0), 0, "DebugIdentity");
-
-  Notification callbacks_done;
-
-  std::vector<Tensor> debug_identity_tensor_vals;
-  debug_gateway.SetNodeValueCallback(
-      [this, &mu, &callbacks_done, &debug_identity_node_name,
-       &debug_identity_tensor_vals](
-          const string& node_name, const int output_slot,
-          const Tensor& tensor_value, const bool is_ref) {
-        mutex_lock l(mu);
-
-        if (node_name == debug_identity_node_name && output_slot == 0) {
-          debug_identity_tensor_vals.push_back(tensor_value);
-
-          if (!callbacks_done.HasBeenNotified()) {
-            callbacks_done.Notify();
-          }
-        }
-      });
-
-  // Add DebugIdentity watch on c:0, which does not have an outgoing edge.
-  RunOptions run_opts;
-  run_opts.set_output_partition_graphs(true);
-
-  DebugTensorWatch* tensor_watch_opts =
-      run_opts.mutable_debug_options()->add_debug_tensor_watch_opts();
-  tensor_watch_opts->set_node_name(c_);
-  tensor_watch_opts->set_output_slot(0);
-  tensor_watch_opts->add_debug_ops("DebugIdentity");
-
-  TF_ASSERT_OK(session->Create(def_));
-
-  // Invoke Session::Run() on y.
-  std::vector<std::pair<string, Tensor>> inputs;
-  std::vector<string> output_names;
-  std::vector<string> target_nodes = {y_};
-  std::vector<Tensor> outputs;
-
-  RunMetadata run_metadata;
-  Status s = session->Run(run_opts, inputs, output_names, target_nodes,
-                          &outputs, &run_metadata);
-  TF_ASSERT_OK(s);
-
-  // Wait for callbacks to complete.
-  callbacks_done.WaitForNotification();
-
-  // Assert that DebugIdentity node watching the control edge has been run.
-  ASSERT_EQ(1, debug_identity_tensor_vals.size());
-  auto mat_identity = debug_identity_tensor_vals[0].matrix<float>();
-  ASSERT_EQ(42.0, mat_identity(0, 0));
-}
-
-class SessionDebugVariableTest : public ::testing::Test {
- public:
-  void Initialize() {
-    Graph graph(OpRegistry::Global());
-
-#if GOOGLE_CUDA
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:GPU:0";
-#elif defined(TENSORFLOW_USE_SYCL)
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
-#else
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:CPU:0";
-#endif
-
-    // Define variable node.
-    var_node_name_ = "var";
-    Node* var =
-        test::graph::Var(&graph, DT_FLOAT, TensorShape({3}), var_node_name_);
-    var->set_assigned_device_name(kDeviceName);
-
-    // Define the initial value and the initial-value node.
-    Tensor nan_nan_seven(DT_FLOAT, TensorShape({3}));
-    nan_nan_seven.flat<float>()(0) = std::numeric_limits<float>::quiet_NaN();
-    nan_nan_seven.flat<float>()(1) = std::numeric_limits<float>::quiet_NaN();
-    nan_nan_seven.flat<float>()(2) = 7.0;
-
-    init_val_node_name_ = "init_val";
-    Node* init_val =
-        test::graph::Constant(&graph, nan_nan_seven, init_val_node_name_);
-    init_val->set_assigned_device_name(kDeviceName);
-
-    // Define node for variable value initialization
-    Node* init = test::graph::Assign(&graph, var, init_val);
-    init->set_assigned_device_name(kDeviceName);
-    init_node_name_ = init->name();
-
-    // Define new value node
-    Tensor nan_eight_eight(DT_FLOAT, TensorShape({3}));
-    nan_eight_eight.flat<float>()(0) = std::numeric_limits<float>::quiet_NaN();
-    nan_eight_eight.flat<float>()(1) = 8.0;
-    nan_eight_eight.flat<float>()(2) = 8.0;
-
-    Node* new_val = test::graph::Constant(&graph, nan_eight_eight);
-    new_val->set_assigned_device_name(kDeviceName);
-    new_val_node_name_ = new_val->name();
-
-    // Define node for assigning new value
-    Node* assign = test::graph::Assign(&graph, var, new_val);
-    assign->set_assigned_device_name(kDeviceName);
-    assign_node_name_ = assign->name();
-
-    test::graph::ToGraphDef(&graph, &def_);
-  }
-
-  string var_node_name_;
-  string init_val_node_name_;
-  string init_node_name_;
-  string new_val_node_name_;
-  string assign_node_name_;
-  GraphDef def_;
-};
-
-TEST_F(SessionDebugVariableTest, WatchUninitializedVariableWithDebugOps) {
-  Initialize();
-  auto session = CreateSession();
-  ASSERT_TRUE(session != nullptr);
-
-  DebugGateway debug_gateway(session.get());
-
-  TF_ASSERT_OK(session->Create(def_));
-
-  // Set up DebugTensorWatch for an uninitialized tensor (in node var).
-  RunOptions run_opts;
-  const string debug_identity = "DebugIdentity";
-  DebugTensorWatch* tensor_watch_opts =
-      run_opts.mutable_debug_options()->add_debug_tensor_watch_opts();
-  tensor_watch_opts->set_node_name(var_node_name_);
-  tensor_watch_opts->set_output_slot(0);
-  tensor_watch_opts->add_debug_ops(debug_identity);
-
-  // Expected name of the inserted debug node
-  string debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(var_node_name_, ":", 0), 0, debug_identity);
-
-  // Supply completion and value callbacks
-  mutex mu;
-  // Completed nodes with and without outputs
-  std::vector<string> completed_debug_nodes;
-
-  Notification callbacks_done;
-  debug_gateway.SetNodeCompletionCallback(
-      [this, &mu, &debug_identity_node_name, &completed_debug_nodes,
-       &callbacks_done](const string& node_name, const bool any_output) {
-        mutex_lock l(mu);
-        if (any_output && (node_name == debug_identity_node_name)) {
-          completed_debug_nodes.push_back(node_name);
-        }
-      });
-
-  std::vector<Tensor> debug_identity_tensor_vals;
-
-  debug_gateway.SetNodeValueCallback(
-      [this, &mu, &debug_identity_node_name, &debug_identity_tensor_vals,
-       &callbacks_done](const string& node_name, const int output_slot,
-                        const Tensor& tensor_value, const bool is_ref) {
-        mutex_lock l(mu);
-        if (node_name == debug_identity_node_name && output_slot == 0) {
-          // output_slot == 0 carries the debug signal. Same below.
-          debug_identity_tensor_vals.push_back(tensor_value);
-        }
-
-        // Set the notification once we have the value from the target node.
-        if (node_name == init_node_name_ && !callbacks_done.HasBeenNotified()) {
-          callbacks_done.Notify();
-        }
-      });
-
-  // First run the initialization op
-  std::vector<std::pair<string, Tensor>> inputs_init;
-  std::vector<Tensor> outputs_init;
-
-  RunMetadata run_metadata;
-  Status s = session->Run(run_opts, inputs_init, {init_node_name_}, {},
-                          &outputs_init, &run_metadata);
-  TF_ASSERT_OK(s);
-
-  callbacks_done.WaitForNotification();
-
-  ASSERT_EQ(1, completed_debug_nodes.size());
-  ASSERT_EQ(
-      1, std::count(completed_debug_nodes.begin(), completed_debug_nodes.end(),
-                    debug_identity_node_name));
-
-  // Assert the output reflects the uninitialized nature of var's tensor.
-  ASSERT_EQ(1, debug_identity_tensor_vals.size());
-  ASSERT_FALSE(debug_identity_tensor_vals[0].IsInitialized());
-  ASSERT_EQ(DT_FLOAT, debug_identity_tensor_vals[0].dtype());
-  ASSERT_EQ(TensorShape({3}), debug_identity_tensor_vals[0].shape());
-}
-
-TEST_F(SessionDebugVariableTest, VariableAssignWithDebugOps) {
-  // Tensor contains one count of NaN
-  Initialize();
-  auto session = CreateSession();
-  ASSERT_TRUE(session != nullptr);
-
-  DebugGateway debug_gateway(session.get());
-
-  TF_ASSERT_OK(session->Create(def_));
-
-  // First run the initialization op
-  std::vector<std::pair<string, Tensor>> inputs_init;
-  std::vector<Tensor> outputs_init;
-  Status s = session->Run(inputs_init, {init_node_name_}, {}, &outputs_init);
-  TF_ASSERT_OK(s);
-
-  // Create debug tensor watch options with two ref-type debug ops:
-  // DebugIdentity and DebugNanCount
-  RunOptions run_opts;
-  run_opts.set_output_partition_graphs(true);
-  const string debug_identity = "DebugIdentity";
-  const string debug_nan_count = "DebugNanCount";
-  DebugTensorWatch* tensor_watch_opts =
-      run_opts.mutable_debug_options()->add_debug_tensor_watch_opts();
-  tensor_watch_opts->set_node_name(var_node_name_);
-  tensor_watch_opts->set_output_slot(0);
-  tensor_watch_opts->add_debug_ops(debug_identity);
-  tensor_watch_opts->add_debug_ops(debug_nan_count);
-
-  char tempdir_template[] = "/tmp/tfdbg_XXXXXX";
-  string temp_dir(mkdtemp(tempdir_template));
-  tensor_watch_opts->add_debug_urls(strings::StrCat("file://", temp_dir));
-
-  // Expected name of the inserted debug node
-  string debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(var_node_name_, ":", 0), 0, debug_identity);
-  string debug_nan_count_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(var_node_name_, ":", 0), 1, debug_nan_count);
-
-  // Supply completion and value callbacks
-  mutex mu;
-  // Completed nodes with and without outputs
-  std::vector<string> completed_debug_nodes;
-
-  Notification callbacks_done;
-  debug_gateway.SetNodeCompletionCallback(
-      [this, &mu, &debug_identity_node_name, &debug_nan_count_node_name,
-       &completed_debug_nodes,
-       &callbacks_done](const string& node_name, const bool any_output) {
-        mutex_lock l(mu);
-        if (any_output && (node_name == debug_identity_node_name ||
-                           node_name == debug_nan_count_node_name)) {
-          completed_debug_nodes.push_back(node_name);
-        }
-      });
-
-  std::vector<Tensor> debug_identity_tensor_vals;
-  std::vector<Tensor> debug_nan_count_tensor_vals;
-
-  debug_gateway.SetNodeValueCallback(
-      [this, &mu, &debug_identity_node_name, &debug_nan_count_node_name,
-       &debug_identity_tensor_vals, &debug_nan_count_tensor_vals,
-       &callbacks_done](const string& node_name, const int output_slot,
-                        const Tensor& tensor_value, const bool is_ref) {
-        mutex_lock l(mu);
-        if (node_name == debug_identity_node_name && output_slot == 0) {
-          // output_slot == 0 carries the debug signal. Same below.
-          debug_identity_tensor_vals.push_back(tensor_value);
-        } else if (node_name == debug_nan_count_node_name && output_slot == 0) {
-          debug_nan_count_tensor_vals.push_back(tensor_value);
-        }
-
-        // Set the notification once we have the value from the target node.
-        if (node_name == assign_node_name_ &&
-            !callbacks_done.HasBeenNotified()) {
-          callbacks_done.Notify();
-        }
-      });
-
-  // // Request two targets: one fetch output and one non-fetched output.
-  std::vector<std::pair<string, Tensor>> inputs;
-  std::vector<string> output_names = {assign_node_name_ + ":0"};
-  std::vector<string> target_nodes = {assign_node_name_};
-  std::vector<Tensor> outputs;
-
-  // Run with RunOptions that has tensor watches
-  RunMetadata run_metadata;
-  s = session->Run(run_opts, inputs, output_names, target_nodes, &outputs,
-                   &run_metadata);
-  TF_ASSERT_OK(s);
-
-#if GOOGLE_CUDA
-  ASSERT_EQ(2, run_metadata.partition_graphs().size());
-#elif defined(TENSORFLOW_USE_SYCL)
-  ASSERT_EQ(2, run_metadata.partition_graphs().size());
-#else
-  ASSERT_EQ(1, run_metadata.partition_graphs().size());
-#endif
-
-  // Wait for callbacks to complete.
-  callbacks_done.WaitForNotification();
-
-  // Verify that the update has happened properly.
-  ASSERT_EQ(1, outputs.size());
-  ASSERT_TRUE(std::isnan(outputs[0].vec<float>()(0)));
-  ASSERT_EQ(8.0, outputs[0].vec<float>()(1));  // Expect new value
-  ASSERT_EQ(8.0, outputs[0].vec<float>()(2));  // Expect new value
-
-  // Verify that each of the two debug nodes has completed exactly once.
-  ASSERT_EQ(2, completed_debug_nodes.size());
-  ASSERT_EQ(
-      1, std::count(completed_debug_nodes.begin(), completed_debug_nodes.end(),
-                    debug_identity_node_name));
-  ASSERT_EQ(
-      1, std::count(completed_debug_nodes.begin(), completed_debug_nodes.end(),
-                    debug_nan_count_node_name));
-
-  // Verify that the values from the ref identity node reflects the value
-  // before the new assign.
-  ASSERT_EQ(1, debug_identity_tensor_vals.size());
-
-  auto vec_identity = debug_identity_tensor_vals[0].vec<float>();
-  ASSERT_TRUE(std::isnan(vec_identity(0)));
-  ASSERT_TRUE(std::isnan(vec_identity(1)));
-  ASSERT_EQ(7.0, vec_identity(2));
-
-  // Verify that the output from the NaN-count debug node indicates exactly
-  // two NaNs, i.e., reflecting the value before the new assign.
-  ASSERT_EQ(1, debug_nan_count_tensor_vals.size());
-  ASSERT_EQ(2, debug_nan_count_tensor_vals[0].scalar<int64>()());
-}
-
-#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_SYCL)
-class SessionDebugGPUSwitchTest : public ::testing::Test {
- public:
-  void Initialize() {
-    Graph graph(OpRegistry::Global());
-
-#ifdef GOOGLE_CUDA
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:GPU:0";
-#elif TENSORFLOW_USE_SYCL
-    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
-#endif
-
-    Tensor vb(DT_BOOL, TensorShape({}));
-    vb.scalar<bool>()() = true;
-    Tensor vi(DT_INT64, TensorShape({}));
-    vi.scalar<int>()() = 42;
-    // So vi is expected to be forwarded to the second output port of sw.
-
-    Node* pred = test::graph::Constant(&graph, vb);
-    pred->set_assigned_device_name(kDeviceName);
-    pred_node_name_ = pred->name();
-
-    Node* value = test::graph::Constant(&graph, vi);
-    pred->set_assigned_device_name(kDeviceName);
-    value_node_name_ = value->name();
-
-    Node* sw = test::graph::Switch(&graph, value, pred);
-    sw->set_assigned_device_name(kDeviceName);
-    sw_node_name_ = sw->name();
-
-    Node* z = test::graph::Identity(&graph, sw, 1);
-    sw->set_assigned_device_name(kDeviceName);
-    z_node_name_ = z->name();
-
-    test::graph::ToGraphDef(&graph, &def_);
-  }
-
-  string pred_node_name_;
-  string value_node_name_;
-  string sw_node_name_;
-  string z_node_name_;
-  GraphDef def_;
-};
-
-// Test for debug-watching tensors marked as HOST_MEMORY on GPU.
-TEST_F(SessionDebugGPUSwitchTest, RunSwitchWithHostMemoryDebugOp) {
-  Initialize();
-  auto session = CreateSession();
-  ASSERT_TRUE(session != nullptr);
-
-  DebugGateway debug_gateway(session.get());
-
-  RunOptions run_opts;
-  run_opts.set_output_partition_graphs(true);
-  // This is the name of the boolean tensor fed as pred to the Switch node.
-  // On GPU, this edge is HOST_MEMORY.
-  const string watched_tensor = strings::StrCat(pred_node_name_, "/_1");
-
-  const string debug_identity = "DebugIdentity";
-  DebugTensorWatch* tensor_watch_opts =
-      run_opts.mutable_debug_options()->add_debug_tensor_watch_opts();
-  tensor_watch_opts->set_node_name(watched_tensor);
-  tensor_watch_opts->set_output_slot(0);
-  tensor_watch_opts->add_debug_ops(debug_identity);
-
-  // Expected name of the inserted debug node
-  string debug_identity_node_name = DebugNodeInserter::GetDebugNodeName(
-      strings::StrCat(watched_tensor, ":", 0), 0, debug_identity);
-
-  // Supply completion and value callbacks
-  mutex mu;
-  // Completed nodes with and without outputs
-  std::vector<string> completed_nodes_w_outputs;
-  std::vector<string> completed_nodes_wo_outputs;
-
-  Notification callbacks_done;
-  debug_gateway.SetNodeCompletionCallback(
-      [&mu, &completed_nodes_w_outputs, &completed_nodes_wo_outputs](
-          const string& node_name, const bool any_output) {
-        mutex_lock l(mu);
-        if (any_output) {
-          completed_nodes_w_outputs.push_back(node_name);
-        } else {
-          completed_nodes_wo_outputs.push_back(node_name);
-        }
-      });
-
-  std::vector<Tensor> debug_identity_tensor_vals;
-
-  debug_gateway.SetNodeValueCallback(
-      [this, &mu, &debug_identity_node_name, &debug_identity_tensor_vals,
-       &callbacks_done](const string& node_name, const int output_slot,
-                        const Tensor& tensor_value, const bool is_ref) {
-        mutex_lock l(mu);
-        if (node_name == debug_identity_node_name && output_slot == 0) {
-          debug_identity_tensor_vals.push_back(tensor_value);
-        }
-
-        // Set the notification once we have the value from the target node.
-        if (node_name == z_node_name_ && !callbacks_done.HasBeenNotified()) {
-          callbacks_done.Notify();
-        }
-      });
-
-  TF_ASSERT_OK(session->Create(def_));
-
-  std::vector<std::pair<string, Tensor>> inputs;
-
-  // Request two targets: one fetch output and one non-fetched output.
-  std::vector<string> output_names = {z_node_name_ + ":0"};
-  std::vector<string> target_nodes = {z_node_name_};
-  std::vector<Tensor> outputs;
-
-  RunMetadata run_metadata;
-  Status s = session->Run(run_opts, inputs, output_names, target_nodes,
-                          &outputs, &run_metadata);
-  TF_ASSERT_OK(s);
-
-  ASSERT_EQ(2, run_metadata.partition_graphs().size());
-
-  // Wait for callbacks to complete.
-  callbacks_done.WaitForNotification();
-
-  ASSERT_EQ(1, debug_identity_tensor_vals.size());
-  ASSERT_TRUE(debug_identity_tensor_vals[0].scalar<bool>()());
-}
-#endif  // GOOGLE_CUDA
-
-}  // end namespace
-}  // end namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/BUILD b/tensorflow/core/distributed_runtime/BUILD
index 75f8a19e9c..2059b1ce0d 100644
--- a/tensorflow/core/distributed_runtime/BUILD
+++ b/tensorflow/core/distributed_runtime/BUILD
@@ -494,9 +494,11 @@ tf_cc_test(
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:session_options",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
+        "//tensorflow/core:worker_proto_cc",
     ],
 )
 
@@ -636,12 +638,12 @@ tf_cuda_cc_test(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:master_proto_cc",
-        "//tensorflow/core:master_service_proto_cc",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_channel",
+        "//tensorflow/core/distributed_runtime/rpc:grpc_master_service_impl",
         "//tensorflow/core/distributed_runtime/rpc:grpc_testlib",
         "//tensorflow/core/distributed_runtime/rpc:grpc_util",
         "//tensorflow/core/distributed_runtime/rpc:grpc_worker_cache",
diff --git a/tensorflow/core/distributed_runtime/eager/eager_client.h b/tensorflow/core/distributed_runtime/eager/eager_client.h
index 9ba8c8d80c..707f3234b9 100644
--- a/tensorflow/core/distributed_runtime/eager/eager_client.h
+++ b/tensorflow/core/distributed_runtime/eager/eager_client.h
@@ -39,6 +39,7 @@ class EagerClient {
   CLIENT_METHOD(KeepAlive);
   CLIENT_METHOD(CloseContext);
   CLIENT_METHOD(RegisterFunction);
+  CLIENT_METHOD(SendTensor);
 
 #undef CLIENT_METHOD
 };
diff --git a/tensorflow/core/distributed_runtime/eager/eager_service_impl.cc b/tensorflow/core/distributed_runtime/eager/eager_service_impl.cc
index 5a26d5bf48..916c8720f0 100644
--- a/tensorflow/core/distributed_runtime/eager/eager_service_impl.cc
+++ b/tensorflow/core/distributed_runtime/eager/eager_service_impl.cc
@@ -63,10 +63,10 @@ Status GetNumRetvals(tensorflow::EagerContext* context, const string& op_name,
       }
       *num_retvals += iter->second.i();
     } else if (!output_arg.type_list_attr().empty()) {
-      auto iter = attrs.find(output_arg.number_attr());
+      auto iter = attrs.find(output_arg.type_list_attr());
       if (iter == attrs.end()) {
-        return errors::InvalidArgument("Unable to find number_attr ",
-                                       output_arg.number_attr(),
+        return errors::InvalidArgument("Unable to find type_list_attr ",
+                                       output_arg.type_list_attr(),
                                        " for Op: ", op_name);
       }
       *num_retvals += iter->second.list().type_size();
@@ -81,6 +81,11 @@ Status GetNumRetvals(tensorflow::EagerContext* context, const string& op_name,
 
 Status EagerServiceImpl::CreateContext(const CreateContextRequest* request,
                                        CreateContextResponse* response) {
+  // make sure env_ , env_->rendezvous_mgr available
+  if (env_ == nullptr || env_->rendezvous_mgr == nullptr) {
+    return tensorflow::errors::Internal(
+        "invalid eager env_ or env_->rendezvous_mgr.");
+  }
   std::vector<tensorflow::Device*> devices;
 
   TF_RETURN_IF_ERROR(tensorflow::DeviceFactory::AddDevices(
@@ -262,6 +267,35 @@ Status EagerServiceImpl::RegisterFunction(
   return context->Context()->AddFunctionDef(request->function_def());
 }
 
+Status EagerServiceImpl::SendTensor(const SendTensorRequest* request,
+                                    SendTensorResponse* response) {
+  ServerContext* context = nullptr;
+  TF_RETURN_IF_ERROR(GetServerContext(request->context_id(), &context));
+  core::ScopedUnref context_unref(context);
+
+  tensorflow::gtl::InlinedVector<tensorflow::TensorHandle*, 2> tensors;
+  for (const auto& tensor_proto : request->tensors()) {
+    Tensor tensor;
+    if (!tensor.FromProto(tensor_proto)) {
+      return errors::InvalidArgument("Unable to parse tensor proto");
+    }
+
+    TensorHandle* tensor_handle =
+        new TensorHandle(tensor, nullptr, nullptr, nullptr);
+
+    TensorHandle* copied_handle = nullptr;
+    TF_RETURN_IF_ERROR(EagerCopyToDevice(tensor_handle, context->Context(),
+                                         request->device_name().c_str(),
+                                         &copied_handle));
+    tensors.push_back(copied_handle);
+    tensor_handle->Unref();
+  }
+
+  context->AddOperationOutputs(tensors, request->op_id());
+
+  return Status::OK();
+}
+
 tensorflow::Status EagerServiceImpl::GetServerContext(
     uint64 context_id, ServerContext** server_context) {
   mutex_lock l(contexts_mu_);
diff --git a/tensorflow/core/distributed_runtime/eager/eager_service_impl.h b/tensorflow/core/distributed_runtime/eager/eager_service_impl.h
index b0e4aa84b9..718b4e2457 100644
--- a/tensorflow/core/distributed_runtime/eager/eager_service_impl.h
+++ b/tensorflow/core/distributed_runtime/eager/eager_service_impl.h
@@ -62,6 +62,9 @@ class EagerServiceImpl {
   Status RegisterFunction(const RegisterFunctionRequest* request,
                           RegisterFunctionResponse* response);
 
+  Status SendTensor(const SendTensorRequest* request,
+                    SendTensorResponse* response);
+
  protected:
   // This is the server-side execution context. All state regarding execution of
   // a client's ops is held in this server-side context (all generated tensors,
diff --git a/tensorflow/core/distributed_runtime/eager/eager_service_impl_test.cc b/tensorflow/core/distributed_runtime/eager/eager_service_impl_test.cc
index b98386ba86..d1f2a6da8f 100644
--- a/tensorflow/core/distributed_runtime/eager/eager_service_impl_test.cc
+++ b/tensorflow/core/distributed_runtime/eager/eager_service_impl_test.cc
@@ -84,7 +84,7 @@ class EagerServiceImplTest : public ::testing::Test {
   std::unique_ptr<DeviceMgr> device_mgr_;
 };
 
-void SetTensorProto(AttrValue* val) {
+void SetTensorProto(TensorProto* tensor_proto) {
   int64_t dims[] = {2, 2};
   float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
   TF_Tensor* t = TF_AllocateTensor(
@@ -92,7 +92,7 @@ void SetTensorProto(AttrValue* val) {
   memcpy(TF_TensorData(t), &data[0], TF_TensorByteSize(t));
   tensorflow::Tensor tensor;
   TF_ASSERT_OK(tensorflow::TF_TensorToTensor(t, &tensor));
-  tensor.AsProtoTensorContent(val->mutable_tensor());
+  tensor.AsProtoTensorContent(tensor_proto);
   TF_DeleteTensor(t);
 }
 
@@ -175,7 +175,7 @@ TEST_F(EagerServiceImplTest, BasicTest) {
   val.set_type(tensorflow::DataType::DT_FLOAT);
   const_attrs.insert({"dtype", val});
   val.Clear();
-  SetTensorProto(&val);
+  SetTensorProto(val.mutable_tensor());
   const_attrs.insert({"value", val});
 
   AddOperationToEnqueueRequest(1, "Const", {}, const_attrs,
@@ -260,7 +260,7 @@ TEST_F(EagerServiceImplTest, BasicFunctionTest) {
   const_attrs.insert({"dtype", val});
   val.Clear();
 
-  SetTensorProto(&val);
+  SetTensorProto(val.mutable_tensor());
   const_attrs.insert({"value", val});
 
   AddOperationToEnqueueRequest(1, "Const", {}, const_attrs,
@@ -294,6 +294,77 @@ TEST_F(EagerServiceImplTest, BasicFunctionTest) {
                                                &close_context_response));
 }
 
+// Test creates a context and attempts to send a tensor (using the RPC), and
+// then use the tensor.
+TEST_F(EagerServiceImplTest, SendTensorTest) {
+  TestEagerServiceImpl eager_service_impl(&worker_env_);
+
+  CreateContextRequest request;
+  request.mutable_server_def()->set_job_name("localhost");
+  request.mutable_server_def()->set_task_index(0);
+  request.set_rendezvous_id(random::New64());
+  CreateContextResponse response;
+
+  TF_ASSERT_OK(eager_service_impl.CreateContext(&request, &response));
+
+  uint64 context_id = response.context_id();
+
+  SendTensorRequest send_tensor_request;
+  send_tensor_request.set_context_id(context_id);
+  send_tensor_request.set_op_id(1);
+  SetTensorProto(send_tensor_request.add_tensors());
+  SendTensorResponse send_tensor_response;
+
+  TF_ASSERT_OK(eager_service_impl.SendTensor(&send_tensor_request,
+                                             &send_tensor_response));
+
+  EnqueueRequest remote_enqueue_request;
+  remote_enqueue_request.set_context_id(context_id);
+  EnqueueResponse remote_enqueue_response;
+
+  std::unordered_map<string, AttrValue> attrs;
+  AttrValue val;
+  val.Clear();
+  val.set_type(tensorflow::DataType::DT_FLOAT);
+  attrs.insert({"T", val});
+  val.Clear();
+  val.set_b(false);
+  attrs.insert({"transpose_a", val});
+  attrs.insert({"transpose_b", val});
+
+  AddOperationToEnqueueRequest(2, "MatMul", {{1, 0}, {1, 0}}, attrs,
+                               "/job:localhost/replica:0/task:0/device:CPU:0",
+                               &remote_enqueue_request);
+
+  TF_ASSERT_OK(eager_service_impl.Enqueue(&remote_enqueue_request,
+                                          &remote_enqueue_response));
+
+  const tensorflow::Tensor* t = nullptr;
+  tensorflow::TensorHandle* tensor_handle;
+  TF_ASSERT_OK(eager_service_impl.GetTensorHandle(
+      response.context_id(), RemoteTensorHandleInternal(2, 0), &tensor_handle));
+  TF_ASSERT_OK(tensor_handle->Tensor(&t));
+
+  Device* device = nullptr;
+  TF_ASSERT_OK(tensor_handle->Device(&device));
+  EXPECT_NE(device, nullptr);
+  EXPECT_EQ(device->name(), "/job:localhost/replica:0/task:0/device:CPU:0");
+
+  auto actual = t->flat<float>();
+  EXPECT_EQ(4, actual.size());
+
+  EXPECT_EQ(7, actual(0));
+  EXPECT_EQ(10, actual(1));
+  EXPECT_EQ(15, actual(2));
+  EXPECT_EQ(22, actual(3));
+
+  CloseContextRequest close_context_request;
+  close_context_request.set_context_id(context_id);
+  CloseContextResponse close_context_response;
+  TF_ASSERT_OK(eager_service_impl.CloseContext(&close_context_request,
+                                               &close_context_response));
+}
+
 }  // namespace
 }  // namespace eager
 }  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/eager/remote_execute_node.h b/tensorflow/core/distributed_runtime/eager/remote_execute_node.h
index 28b68c3b88..0e3a68c4d8 100644
--- a/tensorflow/core/distributed_runtime/eager/remote_execute_node.h
+++ b/tensorflow/core/distributed_runtime/eager/remote_execute_node.h
@@ -29,8 +29,8 @@ namespace eager {
 class RemoteExecuteNode : public tensorflow::EagerNode {
  public:
   RemoteExecuteNode(
-      tensorflow::uint64 id, const tensorflow::eager::EnqueueRequest& request,
-      tensorflow::eager::EagerClient* eager_client,
+      tensorflow::uint64 id, std::unique_ptr<EnqueueRequest> request,
+      EagerClient* eager_client,
       const gtl::InlinedVector<TensorHandle*, 4>& inputs,
       std::function<void(const Status& status, const EnqueueResponse& response)>
           done_callback)
@@ -45,8 +45,8 @@ class RemoteExecuteNode : public tensorflow::EagerNode {
   }
 
   RemoteExecuteNode(tensorflow::uint64 id,
-                    const tensorflow::eager::EnqueueRequest& request,
-                    tensorflow::eager::EagerClient* eager_client)
+                    std::unique_ptr<EnqueueRequest> request,
+                    EagerClient* eager_client)
       : tensorflow::EagerNode(id),
         request_(std::move(request)),
         eager_client_(eager_client) {}
@@ -58,10 +58,10 @@ class RemoteExecuteNode : public tensorflow::EagerNode {
   }
 
   tensorflow::Status Run() override {
-    tensorflow::eager::EnqueueResponse response;
-    tensorflow::Status status;
+    EnqueueResponse response;
+    Status status;
     Notification n;
-    eager_client_->EnqueueAsync(&request_, &response,
+    eager_client_->EnqueueAsync(request_.get(), &response,
                                 [&n, &status](const tensorflow::Status& s) {
                                   status.Update(s);
                                   n.Notify();
@@ -76,9 +76,8 @@ class RemoteExecuteNode : public tensorflow::EagerNode {
   }
 
  private:
-  EnqueueRequest request_;
-  tensorflow::eager::EagerClient*
-      eager_client_;  // Not owned, and must outlive the RemoteExecuteNode.
+  std::unique_ptr<EnqueueRequest> request_;
+  EagerClient* eager_client_;  // Not owned, and must outlive this node.
 
   // This is required to ensure that the tensor handles stay alive across the
   // execution.
diff --git a/tensorflow/core/distributed_runtime/graph_mgr.cc b/tensorflow/core/distributed_runtime/graph_mgr.cc
index e2f13df19f..6c146036ae 100644
--- a/tensorflow/core/distributed_runtime/graph_mgr.cc
+++ b/tensorflow/core/distributed_runtime/graph_mgr.cc
@@ -261,7 +261,7 @@ Status GraphMgr::InitItem(const string& session, const GraphDef& gdef,
     optimizer.Optimize(lib, worker_env_->env, params.device, &subgraph,
                        /*shape_map=*/nullptr);
 
-    // EXPERIMENTAL: tfdbg inserts debug nodes (i.e., probes) to the graph.
+    // TensorFlow Debugger (tfdbg) inserts debug nodes in the graph.
     if (!debug_options.debug_tensor_watch_opts().empty()) {
       TF_RETURN_IF_ERROR(DecorateAndPublishGraphForDebug(
           debug_options, subgraph.get(), params.device));
diff --git a/tensorflow/core/distributed_runtime/master_test.cc b/tensorflow/core/distributed_runtime/master_test.cc
index 09e96cbd40..62b18a45b1 100644
--- a/tensorflow/core/distributed_runtime/master_test.cc
+++ b/tensorflow/core/distributed_runtime/master_test.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include "grpcpp/grpcpp.h"
 
 #include "tensorflow/core/distributed_runtime/rpc/grpc_channel.h"
+#include "tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_testlib.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 #include "tensorflow/core/framework/allocator.h"
@@ -37,7 +38,6 @@ limitations under the License.
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/protobuf/master.pb.h"
-#include "tensorflow/core/protobuf/master_service.grpc.pb.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/distributed_runtime/rpc/BUILD b/tensorflow/core/distributed_runtime/rpc/BUILD
index d6c493c022..4a10d99a60 100644
--- a/tensorflow/core/distributed_runtime/rpc/BUILD
+++ b/tensorflow/core/distributed_runtime/rpc/BUILD
@@ -201,11 +201,11 @@ cc_library(
     srcs = ["grpc_remote_master.cc"],
     hdrs = ["grpc_remote_master.h"],
     deps = [
+        ":grpc_master_service_impl",
         ":grpc_util",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:master_proto_cc",
-        "//tensorflow/core:master_service_proto_cc",
         "//tensorflow/core/distributed_runtime:call_options",
         "//tensorflow/core/distributed_runtime:master_interface",
     ],
@@ -219,18 +219,28 @@ cc_library(
     deps = [
         ":async_service_interface",
         ":grpc_call",
+        ":grpc_master_service_impl",
         ":grpc_util",
         "//tensorflow:grpc++",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:master_proto_cc",
-        "//tensorflow/core:master_service_proto_cc",
         "//tensorflow/core/distributed_runtime:master",
     ],
     alwayslink = 1,
 )
 
 cc_library(
+    name = "grpc_master_service_impl",
+    srcs = ["grpc_master_service_impl.cc"],
+    hdrs = ["grpc_master_service_impl.h"],
+    deps = [
+        "//tensorflow:grpc++",
+        "//tensorflow/core:master_proto_cc",
+    ],
+)
+
+cc_library(
     name = "rpc_rendezvous_mgr",
     srcs = ["rpc_rendezvous_mgr.cc"],
     hdrs = ["rpc_rendezvous_mgr.h"],
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc
index b23466037f..181422118c 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc
@@ -49,6 +49,7 @@ class GrpcEagerClient : public EagerClient {
   CLIENT_METHOD(KeepAlive);
   CLIENT_METHOD(CloseContext);
   CLIENT_METHOD(RegisterFunction);
+  CLIENT_METHOD(SendTensor);
 
 #undef CLIENT_METHOD
 
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc
index 39ab6856c5..ab3aa3fd1d 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.cc
@@ -36,6 +36,7 @@ static const char* grpcEagerService_method_names[] = {
     "/tensorflow.eager.EagerService/KeepAlive",
     "/tensorflow.eager.EagerService/CloseContext",
     "/tensorflow.eager.EagerService/RegisterFunction",
+    "/tensorflow.eager.EagerService/SendTensor",
 };
 
 std::unique_ptr<EagerService::Stub> EagerService::NewStub(
@@ -62,7 +63,9 @@ EagerService::Stub::Stub(
                               ::grpc::internal::RpcMethod::NORMAL_RPC, channel),
       rpcmethod_RegisterFunction_(grpcEagerService_method_names[5],
                                   ::grpc::internal::RpcMethod::NORMAL_RPC,
-                                  channel) {}
+                                  channel),
+      rpcmethod_SendTensor_(grpcEagerService_method_names[6],
+                            ::grpc::internal::RpcMethod::NORMAL_RPC, channel) {}
 
 ::grpc::Status EagerService::Stub::CreateContext(
     ::grpc::ClientContext* context, const CreateContextRequest& request,
@@ -106,8 +109,15 @@ EagerService::Stub::Stub(
       channel_.get(), rpcmethod_RegisterFunction_, context, request, response);
 }
 
+::grpc::Status EagerService::Stub::SendTensor(::grpc::ClientContext* context,
+                                              const SendTensorRequest& request,
+                                              SendTensorResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_SendTensor_, context, request, response);
+}
+
 EagerService::AsyncService::AsyncService() {
-  for (int i = 0; i < 6; ++i) {
+  for (int i = 0; i < 7; ++i) {
     AddMethod(new ::grpc::internal::RpcServiceMethod(
         grpcEagerService_method_names[i],
         ::grpc::internal::RpcMethod::NORMAL_RPC, nullptr));
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h
index 66458186ad..521e0ac4fa 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service.h
@@ -69,6 +69,9 @@ class EagerService final {
     virtual ::grpc::Status RegisterFunction(
         ::grpc::ClientContext* context, const RegisterFunctionRequest& request,
         RegisterFunctionResponse* response) = 0;
+    virtual ::grpc::Status SendTensor(::grpc::ClientContext* context,
+                                      const SendTensorRequest& request,
+                                      SendTensorResponse* response) = 0;
   };
   class Stub final : public StubInterface {
    public:
@@ -91,6 +94,9 @@ class EagerService final {
     ::grpc::Status RegisterFunction(
         ::grpc::ClientContext* context, const RegisterFunctionRequest& request,
         RegisterFunctionResponse* response) override;
+    ::grpc::Status SendTensor(::grpc::ClientContext* context,
+                              const SendTensorRequest& request,
+                              SendTensorResponse* response) override;
 
    private:
     std::shared_ptr< ::grpc::ChannelInterface> channel_;
@@ -100,6 +106,7 @@ class EagerService final {
     const ::grpc::internal::RpcMethod rpcmethod_KeepAlive_;
     const ::grpc::internal::RpcMethod rpcmethod_CloseContext_;
     const ::grpc::internal::RpcMethod rpcmethod_RegisterFunction_;
+    const ::grpc::internal::RpcMethod rpcmethod_SendTensor_;
   };
   static std::unique_ptr<Stub> NewStub(
       const std::shared_ptr< ::grpc::ChannelInterface>& channel,
@@ -157,6 +164,14 @@ class EagerService final {
       ::grpc::Service::RequestAsyncUnary(5, context, request, response,
                                          new_call_cq, notification_cq, tag);
     }
+    void RequestSendTensor(
+        ::grpc::ServerContext* context, SendTensorRequest* request,
+        ::grpc::ServerAsyncResponseWriter<SendTensorResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(6, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
   };
 };
 
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.cc b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.cc
index 52e06c263d..f511674e1f 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.cc
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.cc
@@ -27,9 +27,7 @@ namespace eager {
 
 GrpcEagerServiceImpl::GrpcEagerServiceImpl(
     const WorkerEnv* env, ::grpc::ServerBuilder* server_builder)
-    : local_impl_(env) {
-  request_handler_threadpool_ =
-      MakeUnique<thread::ThreadPool>(env->env, "EagerServiceRequestHandler", 4);
+    : env_(env), local_impl_(env) {
   server_builder->RegisterService(&service_);
   cq_ = server_builder->AddCompletionQueue();
 }
@@ -50,6 +48,7 @@ void GrpcEagerServiceImpl::HandleRPCsLoop() {
   ENQUEUE_REQUEST(KeepAlive);
   ENQUEUE_REQUEST(CloseContext);
   ENQUEUE_REQUEST(RegisterFunction);
+  ENQUEUE_REQUEST(SendTensor);
 #undef ENQUEUE_REQUEST
 
   void* tag;  // Matches the operation started against this cq_.
diff --git a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h
index 9a94026342..537e9043bd 100644
--- a/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h
+++ b/tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_service_impl.h
@@ -45,7 +45,7 @@ class GrpcEagerServiceImpl : public AsyncServiceInterface {
  private:
 #define HANDLER(method)                                                        \
   void method##Handler(EagerCall<method##Request, method##Response>* call) {   \
-    request_handler_threadpool_->Schedule([this, call]() {                     \
+    env_->compute_pool->Schedule([this, call]() {                              \
       call->SendResponse(                                                      \
           ToGrpcStatus(local_impl_.method(&call->request, &call->response)));  \
     });                                                                        \
@@ -62,8 +62,10 @@ class GrpcEagerServiceImpl : public AsyncServiceInterface {
   HANDLER(KeepAlive);
   HANDLER(CloseContext);
   HANDLER(RegisterFunction);
+  HANDLER(SendTensor);
 #undef HANDLER
 
+  const WorkerEnv* const env_;  // Not owned.
   EagerServiceImpl local_impl_;
 
   std::unique_ptr<::grpc::Alarm> shutdown_alarm_;
@@ -71,8 +73,6 @@ class GrpcEagerServiceImpl : public AsyncServiceInterface {
   std::unique_ptr<::grpc::ServerCompletionQueue> cq_;
   tensorflow::eager::grpc::EagerService::AsyncService service_;
 
-  std::unique_ptr<thread::ThreadPool> request_handler_threadpool_;
-
   TF_DISALLOW_COPY_AND_ASSIGN(GrpcEagerServiceImpl);
 };
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc b/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc
index 0ebc084cb6..b7eb3c9015 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_channel.cc
@@ -42,12 +42,12 @@ string MakeAddress(const string& job, int task) {
   return strings::StrCat("/job:", job, "/replica:0/task:", task);
 }
 
+// Allows the host to be a raw IP (either v4 or v6).
 Status ValidateHostPortPair(const string& host_port) {
   uint32 port;
-  std::vector<string> parts = str_util::Split(host_port, ':');
-  // Must be host:port, port must be a number, host must not contain a '/'.
-  if (parts.size() != 2 || !strings::safe_strtou32(parts[1], &port) ||
-      parts[0].find("/") != string::npos) {
+  auto colon_index = host_port.find_last_of(':');
+  if (!strings::safe_strtou32(host_port.substr(colon_index + 1), &port) ||
+      host_port.substr(0, colon_index).find("/") != string::npos) {
     return errors::InvalidArgument("Could not interpret \"", host_port,
                                    "\" as a host-port pair.");
   }
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_channel_test.cc b/tensorflow/core/distributed_runtime/rpc/grpc_channel_test.cc
index a17acc85b3..f07a5a0974 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_channel_test.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_channel_test.cc
@@ -150,10 +150,15 @@ TEST(GrpcChannelTest, NewHostPortGrpcChannelValidation) {
   EXPECT_TRUE(NewHostPortGrpcChannel("127.0.0.1:2222", &mock_ptr).ok());
   EXPECT_TRUE(NewHostPortGrpcChannel("example.com:2222", &mock_ptr).ok());
   EXPECT_TRUE(NewHostPortGrpcChannel("fqdn.example.com.:2222", &mock_ptr).ok());
+  EXPECT_TRUE(NewHostPortGrpcChannel("[2002:a9c:258e::]:2222", &mock_ptr).ok());
+  EXPECT_TRUE(NewHostPortGrpcChannel("[::]:2222", &mock_ptr).ok());
 
   EXPECT_FALSE(NewHostPortGrpcChannel("example.com/abc:2222", &mock_ptr).ok());
   EXPECT_FALSE(NewHostPortGrpcChannel("127.0.0.1:2222/", &mock_ptr).ok());
   EXPECT_FALSE(NewHostPortGrpcChannel("example.com/abc:", &mock_ptr).ok());
+  EXPECT_FALSE(NewHostPortGrpcChannel("[::]/:2222", &mock_ptr).ok());
+  EXPECT_FALSE(NewHostPortGrpcChannel("[::]:2222/", &mock_ptr).ok());
+  EXPECT_FALSE(NewHostPortGrpcChannel("[::]:", &mock_ptr).ok());
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc b/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
index 2c2c1d484a..127dea2882 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
@@ -36,12 +36,12 @@ limitations under the License.
 #include "tensorflow/core/distributed_runtime/master.h"
 #include "tensorflow/core/distributed_runtime/rpc/async_service_interface.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_call.h"
+#include "tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/tracing.h"
 #include "tensorflow/core/protobuf/master.pb.h"
-#include "tensorflow/core/protobuf/master_service.grpc.pb.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.cc b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.cc
new file mode 100644
index 0000000000..770a0fcf14
--- /dev/null
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.cc
@@ -0,0 +1,164 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h"
+
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/channel_interface.h"
+#include "grpcpp/impl/codegen/client_unary_call.h"
+#include "grpcpp/impl/codegen/method_handler_impl.h"
+#include "grpcpp/impl/codegen/rpc_service_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
+
+namespace tensorflow {
+
+namespace grpc {
+
+static const char* grpcMasterService_method_names[] = {
+    "/tensorflow.MasterService/CreateSession",
+    "/tensorflow.MasterService/ExtendSession",
+    "/tensorflow.MasterService/PartialRunSetup",
+    "/tensorflow.MasterService/RunStep",
+    "/tensorflow.MasterService/CloseSession",
+    "/tensorflow.MasterService/ListDevices",
+    "/tensorflow.MasterService/Reset",
+    "/tensorflow.MasterService/MakeCallable",
+    "/tensorflow.MasterService/RunCallable",
+    "/tensorflow.MasterService/ReleaseCallable",
+};
+
+std::unique_ptr<MasterService::Stub> MasterService::NewStub(
+    const std::shared_ptr< ::grpc::ChannelInterface>& channel,
+    const ::grpc::StubOptions& options) {
+  std::unique_ptr<MasterService::Stub> stub(new MasterService::Stub(channel));
+  return stub;
+}
+
+MasterService::Stub::Stub(
+    const std::shared_ptr< ::grpc::ChannelInterface>& channel)
+    : channel_(channel),
+      rpcmethod_CreateSession_(grpcMasterService_method_names[0],
+                               ::grpc::internal::RpcMethod::NORMAL_RPC,
+                               channel),
+      rpcmethod_ExtendSession_(grpcMasterService_method_names[1],
+                               ::grpc::internal::RpcMethod::NORMAL_RPC,
+                               channel),
+      rpcmethod_PartialRunSetup_(grpcMasterService_method_names[2],
+                                 ::grpc::internal::RpcMethod::NORMAL_RPC,
+                                 channel),
+      rpcmethod_RunStep_(grpcMasterService_method_names[3],
+                         ::grpc::internal::RpcMethod::NORMAL_RPC, channel),
+      rpcmethod_CloseSession_(grpcMasterService_method_names[4],
+                              ::grpc::internal::RpcMethod::NORMAL_RPC, channel),
+      rpcmethod_ListDevices_(grpcMasterService_method_names[5],
+                             ::grpc::internal::RpcMethod::NORMAL_RPC, channel),
+      rpcmethod_Reset_(grpcMasterService_method_names[6],
+                       ::grpc::internal::RpcMethod::NORMAL_RPC, channel),
+      rpcmethod_MakeCallable_(grpcMasterService_method_names[7],
+                              ::grpc::internal::RpcMethod::NORMAL_RPC, channel),
+      rpcmethod_RunCallable_(grpcMasterService_method_names[8],
+                             ::grpc::internal::RpcMethod::NORMAL_RPC, channel),
+      rpcmethod_ReleaseCallable_(grpcMasterService_method_names[9],
+                                 ::grpc::internal::RpcMethod::NORMAL_RPC,
+                                 channel) {}
+
+::grpc::Status MasterService::Stub::CreateSession(
+    ::grpc::ClientContext* context, const CreateSessionRequest& request,
+    CreateSessionResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_CreateSession_, context, request, response);
+}
+
+::grpc::Status MasterService::Stub::ExtendSession(
+    ::grpc::ClientContext* context, const ExtendSessionRequest& request,
+    ExtendSessionResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_ExtendSession_, context, request, response);
+}
+
+::grpc::Status MasterService::Stub::PartialRunSetup(
+    ::grpc::ClientContext* context, const PartialRunSetupRequest& request,
+    PartialRunSetupResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_PartialRunSetup_, context, request, response);
+}
+
+::grpc::Status MasterService::Stub::RunStep(::grpc::ClientContext* context,
+                                            const RunStepRequest& request,
+                                            RunStepResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(channel_.get(), rpcmethod_RunStep_,
+                                             context, request, response);
+}
+
+::grpc::Status MasterService::Stub::CloseSession(
+    ::grpc::ClientContext* context, const CloseSessionRequest& request,
+    CloseSessionResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_CloseSession_, context, request, response);
+}
+
+::grpc::Status MasterService::Stub::ListDevices(
+    ::grpc::ClientContext* context, const ListDevicesRequest& request,
+    ListDevicesResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_ListDevices_, context, request, response);
+}
+
+::grpc::Status MasterService::Stub::Reset(::grpc::ClientContext* context,
+                                          const ResetRequest& request,
+                                          ResetResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(channel_.get(), rpcmethod_Reset_,
+                                             context, request, response);
+}
+
+::grpc::Status MasterService::Stub::MakeCallable(
+    ::grpc::ClientContext* context, const MakeCallableRequest& request,
+    MakeCallableResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_MakeCallable_, context, request, response);
+}
+
+::grpc::Status MasterService::Stub::RunCallable(
+    ::grpc::ClientContext* context, const RunCallableRequest& request,
+    RunCallableResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_RunCallable_, context, request, response);
+}
+
+::grpc::Status MasterService::Stub::ReleaseCallable(
+    ::grpc::ClientContext* context, const ReleaseCallableRequest& request,
+    ReleaseCallableResponse* response) {
+  return ::grpc::internal::BlockingUnaryCall(
+      channel_.get(), rpcmethod_ReleaseCallable_, context, request, response);
+}
+
+MasterService::AsyncService::AsyncService() {
+  int method_len = sizeof(grpcMasterService_method_names) / 
+                    sizeof(grpcMasterService_method_names[0]);
+  for (int i = 0; i < method_len; ++i) {
+    AddMethod(new ::grpc::internal::RpcServiceMethod(
+        grpcMasterService_method_names[i],
+        ::grpc::internal::RpcMethod::NORMAL_RPC, nullptr));
+    ::grpc::Service::MarkMethodAsync(i);
+  }
+}
+
+MasterService::AsyncService::~AsyncService() {}
+
+}  // namespace grpc
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h
new file mode 100644
index 0000000000..751f2633e7
--- /dev/null
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h
@@ -0,0 +1,224 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_MASTER_SERVICE_IMPL_H_
+#define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_MASTER_SERVICE_IMPL_H_
+
+#include "grpcpp/impl/codegen/async_stream.h"
+#include "grpcpp/impl/codegen/async_unary_call.h"
+#include "grpcpp/impl/codegen/proto_utils.h"
+#include "grpcpp/impl/codegen/rpc_method.h"
+#include "grpcpp/impl/codegen/service_type.h"
+#include "grpcpp/impl/codegen/status.h"
+#include "grpcpp/impl/codegen/stub_options.h"
+#include "grpcpp/impl/codegen/sync_stream.h"
+
+#include "tensorflow/core/protobuf/master.pb.h"
+
+namespace grpc {
+class CompletionQueue;
+class Channel;
+class RpcService;
+class ServerCompletionQueue;
+class ServerContext;
+}  // namespace grpc
+
+namespace tensorflow {
+
+namespace grpc {
+
+// Implementation of `tensorflow.MasterService`, based on the
+// definition in "//tensorflow/core/protobuf/master_service.proto",
+// and the gRPC generated stub and service classes.
+// See that file for the definition of methods and messages.
+class MasterService final {
+ public:
+  class StubInterface {
+   public:
+    virtual ~StubInterface() {}
+    virtual ::grpc::Status CreateSession(::grpc::ClientContext* context,
+                                         const CreateSessionRequest& request,
+                                         CreateSessionResponse* response) = 0;
+    virtual ::grpc::Status ExtendSession(::grpc::ClientContext* context,
+                                         const ExtendSessionRequest& request,
+                                         ExtendSessionResponse* response) = 0;
+    virtual ::grpc::Status PartialRunSetup(
+        ::grpc::ClientContext* context, const PartialRunSetupRequest& request,
+        PartialRunSetupResponse* response) = 0;
+    virtual ::grpc::Status RunStep(::grpc::ClientContext* context,
+                                   const RunStepRequest& request,
+                                   RunStepResponse* response) = 0;
+    virtual ::grpc::Status CloseSession(::grpc::ClientContext* context,
+                                        const CloseSessionRequest& request,
+                                        CloseSessionResponse* response) = 0;
+    virtual ::grpc::Status ListDevices(::grpc::ClientContext* context,
+                                       const ListDevicesRequest& request,
+                                       ListDevicesResponse* response) = 0;
+    virtual ::grpc::Status Reset(::grpc::ClientContext* context,
+                                 const ResetRequest& request,
+                                 ResetResponse* response) = 0;
+    virtual ::grpc::Status MakeCallable(::grpc::ClientContext* context,
+                                        const MakeCallableRequest& request,
+                                        MakeCallableResponse* response) = 0;
+    virtual ::grpc::Status RunCallable(::grpc::ClientContext* context,
+                                       const RunCallableRequest& request,
+                                       RunCallableResponse* response) = 0;
+    virtual ::grpc::Status ReleaseCallable(
+        ::grpc::ClientContext* context, const ReleaseCallableRequest& request,
+        ReleaseCallableResponse* response) = 0;
+  };
+  class Stub final : public StubInterface {
+   public:
+    Stub(const std::shared_ptr< ::grpc::ChannelInterface>& channel);
+    ::grpc::Status CreateSession(::grpc::ClientContext* context,
+                                 const CreateSessionRequest& request,
+                                 CreateSessionResponse* response) override;
+    ::grpc::Status ExtendSession(::grpc::ClientContext* context,
+                                 const ExtendSessionRequest& request,
+                                 ExtendSessionResponse* response) override;
+    ::grpc::Status PartialRunSetup(::grpc::ClientContext* context,
+                                   const PartialRunSetupRequest& request,
+                                   PartialRunSetupResponse* response) override;
+    ::grpc::Status RunStep(::grpc::ClientContext* context,
+                           const RunStepRequest& request,
+                           RunStepResponse* response) override;
+    ::grpc::Status CloseSession(::grpc::ClientContext* context,
+                                const CloseSessionRequest& request,
+                                CloseSessionResponse* response) override;
+    ::grpc::Status ListDevices(::grpc::ClientContext* context,
+                               const ListDevicesRequest& request,
+                               ListDevicesResponse* response) override;
+    ::grpc::Status Reset(::grpc::ClientContext* context,
+                         const ResetRequest& request,
+                         ResetResponse* response) override;
+    ::grpc::Status MakeCallable(::grpc::ClientContext* context,
+                                const MakeCallableRequest& request,
+                                MakeCallableResponse* response) override;
+    ::grpc::Status RunCallable(::grpc::ClientContext* context,
+                               const RunCallableRequest& request,
+                               RunCallableResponse* response) override;
+    ::grpc::Status ReleaseCallable(::grpc::ClientContext* context,
+                                   const ReleaseCallableRequest& request,
+                                   ReleaseCallableResponse* response) override;
+
+   private:
+    std::shared_ptr< ::grpc::ChannelInterface> channel_;
+    const ::grpc::internal::RpcMethod rpcmethod_CreateSession_;
+    const ::grpc::internal::RpcMethod rpcmethod_ExtendSession_;
+    const ::grpc::internal::RpcMethod rpcmethod_PartialRunSetup_;
+    const ::grpc::internal::RpcMethod rpcmethod_RunStep_;
+    const ::grpc::internal::RpcMethod rpcmethod_CloseSession_;
+    const ::grpc::internal::RpcMethod rpcmethod_ListDevices_;
+    const ::grpc::internal::RpcMethod rpcmethod_Reset_;
+    const ::grpc::internal::RpcMethod rpcmethod_MakeCallable_;
+    const ::grpc::internal::RpcMethod rpcmethod_RunCallable_;
+    const ::grpc::internal::RpcMethod rpcmethod_ReleaseCallable_;
+  };
+  static std::unique_ptr<Stub> NewStub(
+      const std::shared_ptr< ::grpc::ChannelInterface>& channel,
+      const ::grpc::StubOptions& options = ::grpc::StubOptions());
+
+  class AsyncService : public ::grpc::Service {
+   public:
+    AsyncService();
+    virtual ~AsyncService();
+    void RequestCreateSession(
+        ::grpc::ServerContext* context, CreateSessionRequest* request,
+        ::grpc::ServerAsyncResponseWriter<CreateSessionResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(0, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestExtendSession(
+        ::grpc::ServerContext* context, ExtendSessionRequest* request,
+        ::grpc::ServerAsyncResponseWriter<ExtendSessionResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(1, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestPartialRunSetup(
+        ::grpc::ServerContext* context, PartialRunSetupRequest* request,
+        ::grpc::ServerAsyncResponseWriter<PartialRunSetupResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(2, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestRunStep(
+        ::grpc::ServerContext* context, RunStepRequest* request,
+        ::grpc::ServerAsyncResponseWriter<RunStepResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(3, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestCloseSession(
+        ::grpc::ServerContext* context, CloseSessionRequest* request,
+        ::grpc::ServerAsyncResponseWriter<CloseSessionResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(4, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestListDevices(
+        ::grpc::ServerContext* context, ListDevicesRequest* request,
+        ::grpc::ServerAsyncResponseWriter<ListDevicesResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(5, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestReset(
+        ::grpc::ServerContext* context, ResetRequest* request,
+        ::grpc::ServerAsyncResponseWriter<ResetResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(6, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestMakeCallable(
+        ::grpc::ServerContext* context, MakeCallableRequest* request,
+        ::grpc::ServerAsyncResponseWriter<MakeCallableResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(7, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestRunCallable(
+        ::grpc::ServerContext* context, RunCallableRequest* request,
+        ::grpc::ServerAsyncResponseWriter<RunCallableResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(8, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+    void RequestReleaseCallable(
+        ::grpc::ServerContext* context, ReleaseCallableRequest* request,
+        ::grpc::ServerAsyncResponseWriter<ReleaseCallableResponse>* response,
+        ::grpc::CompletionQueue* new_call_cq,
+        ::grpc::ServerCompletionQueue* notification_cq, void* tag) {
+      ::grpc::Service::RequestAsyncUnary(9, context, request, response,
+                                         new_call_cq, notification_cq, tag);
+    }
+  };
+};
+
+}  // namespace grpc
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RPC_GRPC_MASTER_SERVICE_IMPL_H_
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc b/tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc
index 6c2940553c..b832a2115c 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc
@@ -19,13 +19,13 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/call_options.h"
 #include "tensorflow/core/distributed_runtime/master_interface.h"
+#include "tensorflow/core/distributed_runtime/rpc/grpc_master_service_impl.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/tracing.h"
 #include "tensorflow/core/protobuf/master.pb.h"
-#include "tensorflow/core/protobuf/master_service.grpc.pb.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
index 2c833d11a9..8a6903be9e 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
@@ -152,16 +152,14 @@ Status GrpcServer::Init(
                                        " was not defined in job \"",
                                        server_def_.job_name(), "\"");
       }
-      const std::vector<string> hostname_port =
-          str_util::Split(iter->second, ':');
-      if (hostname_port.size() != 2 ||
-          !strings::safe_strto32(hostname_port[1], &requested_port)) {
+      auto colon_index = iter->second.find_last_of(':');
+      if (!strings::safe_strto32(iter->second.substr(colon_index + 1),
+                                 &requested_port)) {
         return errors::InvalidArgument(
             "Could not parse port for local server from \"", iter->second,
-            "\"");
-      } else {
-        break;
+            "\".");
       }
+      break;
     }
   }
   if (requested_port == -1) {
@@ -246,6 +244,7 @@ Status GrpcServer::Init(
   // Finish setting up master environment.
   master_env_.ops = OpRegistry::Global();
   master_env_.worker_cache = worker_cache;
+  master_env_.collective_executor_mgr = worker_env_.collective_executor_mgr;
   master_env_.master_session_factory =
       [config, stats_factory](
           SessionOptions options, const MasterEnv* env,
@@ -343,11 +342,13 @@ Status GrpcServer::WorkerCacheFactory(const WorkerCacheFactoryOptions& options,
   const string host_port = channel_cache_->TranslateTask(name_prefix);
   int requested_port;
 
-  if (!strings::safe_strto32(str_util::Split(host_port, ':')[1],
+  auto colon_index = host_port.find_last_of(':');
+  if (!strings::safe_strto32(host_port.substr(colon_index + 1),
                              &requested_port)) {
     return errors::Internal("Could not parse port for local server from \"",
-                            channel_cache_->TranslateTask(name_prefix), "\".");
+                            host_port, "\".");
   }
+
   if (requested_port != bound_port_) {
     return errors::InvalidArgument("Requested port ", requested_port,
                                    " differs from expected port ", bound_port_);
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_session.cc b/tensorflow/core/distributed_runtime/rpc/grpc_session.cc
index fd1c150fa7..fdce1b10e0 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_session.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_session.cc
@@ -452,15 +452,12 @@ class GrpcSessionFactory : public SessionFactory {
     return str_util::StartsWith(options.target, kSchemePrefix);
   }
 
-  Session* NewSession(const SessionOptions& options) override {
-    std::unique_ptr<GrpcSession> ret;
-    Status s = GrpcSession::Create(options, &ret);
-    if (s.ok()) {
-      return ret.release();
-    } else {
-      LOG(ERROR) << "Error during session construction: " << s.ToString();
-      return nullptr;
-    }
+  Status NewSession(const SessionOptions& options,
+                    Session** out_session) override {
+    std::unique_ptr<GrpcSession> session;
+    TF_RETURN_IF_ERROR(GrpcSession::Create(options, &session));
+    *out_session = session.release();
+    return Status::OK();
   }
 
   // Invokes the session specific static method to reset containers.
diff --git a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc
index 5eeed6e382..45b989f6e2 100644
--- a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc
+++ b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.cc
@@ -99,6 +99,32 @@ void RpcCollectiveExecutorMgr::RefreshStepIdSequenceAsync(
   }
 }
 
+void RpcCollectiveExecutorMgr::GetStepSequenceAsync(
+    const GetStepSequenceRequest* request, GetStepSequenceResponse* response,
+    const StatusCallback& done) {
+  if (!group_leader_.empty()) {
+    LOG(ERROR) << "GetStepSequence called at non-group-leader";
+    done(errors::Internal("GetStepSequenceAsync called at non-group-leader"));
+  } else {
+    mutex_lock l(sequence_mu_);
+    for (int64 graph_key : request->graph_key()) {
+      auto it = sequence_table_.find(graph_key);
+      GraphKeySequence* gks = nullptr;
+      if (it == sequence_table_.end()) {
+        gks = new GraphKeySequence(graph_key);
+        gks->next_step_id_ = NewRandomStepId();
+        sequence_table_[graph_key] = gks;
+      } else {
+        gks = it->second;
+      }
+      StepSequence* ss = response->add_step_sequence();
+      ss->set_graph_key(graph_key);
+      ss->set_next_step_id(gks->next_step_id_);
+    }
+    done(Status::OK());
+  }
+}
+
 Status RpcCollectiveExecutorMgr::UpdateStepSequences(
     const GetStepSequenceResponse& resp) {
   mutex_lock l(sequence_mu_);
diff --git a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h
index e9f3f0ebe8..c9581fa00f 100644
--- a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h
+++ b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr.h
@@ -42,6 +42,12 @@ class RpcCollectiveExecutorMgr : public CollectiveExecutorMgr {
 
   virtual ~RpcCollectiveExecutorMgr();
 
+  // This function should only be called at the group_leader, by an RPC.
+  // Other needs for StepIds should be satisfied by NextStepId.
+  void GetStepSequenceAsync(const GetStepSequenceRequest* request,
+                            GetStepSequenceResponse* response,
+                            const StatusCallback& done) override;
+
   void RefreshStepIdSequenceAsync(int64 graph_key,
                                   const StatusCallback& done) override;
 
diff --git a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr_test.cc b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr_test.cc
index 37b83d82be..0323300fdd 100644
--- a/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr_test.cc
+++ b/tensorflow/core/distributed_runtime/rpc_collective_executor_mgr_test.cc
@@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/protobuf/worker.pb.h"
 #include "tensorflow/core/public/session_options.h"
 
 namespace tensorflow {
@@ -121,4 +122,50 @@ TEST_F(RpcCollectiveExecutorMgrTest, NextStepId) {
   EXPECT_GT(llabs(y - z), 3);
 }
 
+TEST_F(RpcCollectiveExecutorMgrTest, GetStepSequence) {
+  int64 x = cme_->NextStepId(3);
+  EXPECT_EQ(x, CollectiveExecutor::kInvalidId);
+  int64 y = cme_->NextStepId(4);
+  EXPECT_EQ(y, CollectiveExecutor::kInvalidId);
+  GetStepSequenceRequest request;
+  GetStepSequenceResponse response;
+  request.add_graph_key(3);
+  request.add_graph_key(4);
+  {
+    Notification note;
+    Status status;
+    cme_->GetStepSequenceAsync(&request, &response,
+                               [this, &status, &note](const Status& s) {
+                                 status = s;
+                                 note.Notify();
+                               });
+    note.WaitForNotification();
+    EXPECT_TRUE(status.ok());
+  }
+  ASSERT_EQ(2, response.step_sequence_size());
+  std::unordered_map<int64, int64> values;
+  for (const auto& ss : response.step_sequence()) {
+    values[ss.graph_key()] = ss.next_step_id();
+  }
+  EXPECT_NE(values[3], CollectiveExecutor::kInvalidId);
+  EXPECT_NE(values[4], CollectiveExecutor::kInvalidId);
+  // Re-get, should be same values.
+  response.Clear();
+  {
+    Notification note;
+    Status status;
+    cme_->GetStepSequenceAsync(&request, &response,
+                               [this, &status, &note](const Status& s) {
+                                 status = s;
+                                 note.Notify();
+                               });
+    note.WaitForNotification();
+    EXPECT_TRUE(status.ok());
+  }
+  ASSERT_EQ(2, response.step_sequence_size());
+  for (const auto& ss : response.step_sequence()) {
+    EXPECT_EQ(values[ss.graph_key()], ss.next_step_id());
+  }
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/allocator.cc b/tensorflow/core/framework/allocator.cc
index 1c62d37955..888ed0c57b 100644
--- a/tensorflow/core/framework/allocator.cc
+++ b/tensorflow/core/framework/allocator.cc
@@ -91,6 +91,11 @@ void EnableCPUAllocatorFullStats(bool enable) {
   cpu_allocator_collect_full_stats = enable;
 }
 
+namespace {
+// A default Allocator for CPU devices.  ProcessState::GetCPUAllocator() will
+// return a different version that may perform better, but may also lack the
+// optional stats triggered by the functions above.  TODO(tucker): migrate all
+// uses of cpu_allocator() except tests to use ProcessState instead.
 class CPUAllocator : public Allocator {
  public:
   CPUAllocator()
@@ -170,14 +175,42 @@ class CPUAllocator : public Allocator {
   TF_DISALLOW_COPY_AND_ASSIGN(CPUAllocator);
 };
 
+class CPUAllocatorFactory : public AllocatorFactory {
+ public:
+  Allocator* CreateAllocator() override { return new CPUAllocator; }
+
+  SubAllocator* CreateSubAllocator(int numa_node) override {
+    return new CPUSubAllocator(new CPUAllocator);
+  }
+
+ private:
+  class CPUSubAllocator : public SubAllocator {
+   public:
+    explicit CPUSubAllocator(CPUAllocator* cpu_allocator)
+        : cpu_allocator_(cpu_allocator) {}
+
+    void* Alloc(size_t alignment, size_t num_bytes) override {
+      return cpu_allocator_->AllocateRaw(alignment, num_bytes);
+    }
+
+    void Free(void* ptr, size_t num_bytes) override {
+      cpu_allocator_->DeallocateRaw(ptr);
+    }
+
+   private:
+    CPUAllocator* cpu_allocator_;
+  };
+};
+
+REGISTER_MEM_ALLOCATOR("DefaultCPUAllocator", 100, CPUAllocatorFactory);
+}  // namespace
+
 Allocator* cpu_allocator() {
-  static Allocator* cpu_alloc = AllocatorRegistry::Global()->GetAllocator();
+  static Allocator* cpu_alloc =
+      AllocatorFactoryRegistry::singleton()->GetAllocator();
   if (cpu_allocator_collect_full_stats && !cpu_alloc->TracksAllocationSizes()) {
     cpu_alloc = new TrackingAllocator(cpu_alloc, true);
   }
   return cpu_alloc;
 }
-
-REGISTER_MEM_ALLOCATOR("DefaultCPUAllocator", 100, CPUAllocator);
-
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/allocator.h b/tensorflow/core/framework/allocator.h
index 2bb4d32d57..774b1fe137 100644
--- a/tensorflow/core/framework/allocator.h
+++ b/tensorflow/core/framework/allocator.h
@@ -376,16 +376,18 @@ struct AllocatorAttributes {
   int32 scope_id = 0;
 };
 
-// Returns a trivial implementation of Allocator which uses the system
-// default malloc. The returned allocator is a process singleton.
+// Returns a trivial implementation of Allocator, which is a process singleton.
+// Access through this function is only intended for use in tests and auxiliary
+// processing.  Performance sensitive uses should always obtain allocators from
+// ProcessState.
 Allocator* cpu_allocator();
 
-// If 'enable' is true, the process-wide cpu allocator collects
+// If 'enable' is true, the default CPU allocator implementation will collect
 // AllocatorStats. By default, it's disabled.
 void EnableCPUAllocatorStats(bool enable);
 
-// If 'enable' is true, the process-wide cpu allocator collects full
-// statistics. By default, it's disabled.
+// If 'enable' is true, the default CPU allocator implementation will collect
+// full statistics. By default, it's disabled.
 void EnableCPUAllocatorFullStats(bool enable);
 
 // Abstract interface of an object that does the underlying suballoc/free of
diff --git a/tensorflow/core/framework/allocator_registry.cc b/tensorflow/core/framework/allocator_registry.cc
index 486be39ae3..099c4bacc8 100644
--- a/tensorflow/core/framework/allocator_registry.cc
+++ b/tensorflow/core/framework/allocator_registry.cc
@@ -21,60 +21,110 @@ limitations under the License.
 namespace tensorflow {
 
 // static
-AllocatorRegistry* AllocatorRegistry::Global() {
-  static AllocatorRegistry* global_allocator_registry = new AllocatorRegistry;
-  return global_allocator_registry;
+AllocatorFactoryRegistry* AllocatorFactoryRegistry::singleton() {
+  static AllocatorFactoryRegistry* singleton = new AllocatorFactoryRegistry;
+  return singleton;
 }
 
-Allocator* AllocatorRegistry::GetRegisteredAllocator(const string& name,
-                                                     int priority) {
-  for (auto entry : allocators_) {
+const AllocatorFactoryRegistry::FactoryEntry*
+AllocatorFactoryRegistry::FindEntry(const string& name, int priority) const {
+  for (auto& entry : factories_) {
     if (!name.compare(entry.name) && priority == entry.priority) {
-      return entry.allocator;
+      return &entry;
     }
   }
   return nullptr;
 }
 
-void AllocatorRegistry::Register(const string& name, int priority,
-                                 Allocator* allocator) {
+void AllocatorFactoryRegistry::Register(const char* source_file,
+                                        int source_line, const string& name,
+                                        int priority,
+                                        AllocatorFactory* factory) {
+  mutex_lock l(mu_);
+  CHECK(!first_alloc_made_) << "Attempt to register an AllocatorFactory "
+                            << "after call to GetAllocator()";
   CHECK(!name.empty()) << "Need a valid name for Allocator";
   CHECK_GE(priority, 0) << "Priority needs to be non-negative";
 
-  Allocator* existing = GetRegisteredAllocator(name, priority);
+  const FactoryEntry* existing = FindEntry(name, priority);
   if (existing != nullptr) {
-    // A duplicate is if the registration name and priority match
-    // but the Allocator::Name()'s don't match.
-    CHECK_EQ(existing->Name(), allocator->Name())
-        << "Allocator with name: [" << name << "], type [" << existing->Name()
-        << "], priority: [" << priority
-        << "] already registered.  Choose a different name to register "
-        << "an allocator of type " << allocator->Name();
-
-    // The allocator names match, so we can just return.
-    // It should be safe to delete the allocator since the caller
-    // gives up ownership of it.
-    delete allocator;
-    return;
+    // Duplicate registration is a hard failure.
+    LOG(FATAL) << "New registration for AllocatorFactory with name=" << name
+               << " priority=" << priority << " at location " << source_file
+               << ":" << source_line
+               << " conflicts with previous registration at location "
+               << existing->source_file << ":" << existing->source_line;
   }
 
-  AllocatorRegistryEntry tmp_entry;
-  tmp_entry.name = name;
-  tmp_entry.priority = priority;
-  tmp_entry.allocator = allocator;
+  FactoryEntry entry;
+  entry.source_file = source_file;
+  entry.source_line = source_line;
+  entry.name = name;
+  entry.priority = priority;
+  entry.factory.reset(factory);
+  factories_.push_back(std::move(entry));
+}
 
-  allocators_.push_back(tmp_entry);
-  int high_pri = -1;
-  for (auto entry : allocators_) {
-    if (high_pri < entry.priority) {
-      m_curr_allocator_ = entry.allocator;
-      high_pri = entry.priority;
+Allocator* AllocatorFactoryRegistry::GetAllocator() {
+  mutex_lock l(mu_);
+  first_alloc_made_ = true;
+  FactoryEntry* best_entry = nullptr;
+  for (auto& entry : factories_) {
+    if (best_entry == nullptr) {
+      best_entry = &entry;
+    } else if (entry.priority > best_entry->priority) {
+      best_entry = &entry;
     }
   }
+  if (best_entry) {
+    if (!best_entry->allocator) {
+      best_entry->allocator.reset(best_entry->factory->CreateAllocator());
+    }
+    return best_entry->allocator.get();
+  } else {
+    LOG(FATAL) << "No registered CPU AllocatorFactory";
+    return nullptr;
+  }
 }
 
-Allocator* AllocatorRegistry::GetAllocator() {
-  return CHECK_NOTNULL(m_curr_allocator_);
+SubAllocator* AllocatorFactoryRegistry::GetSubAllocator(int numa_node) {
+  mutex_lock l(mu_);
+  first_alloc_made_ = true;
+  FactoryEntry* best_entry = nullptr;
+  for (auto& entry : factories_) {
+    if (best_entry == nullptr) {
+      best_entry = &entry;
+    } else if (best_entry->factory->NumaEnabled()) {
+      if (entry.factory->NumaEnabled() &&
+          (entry.priority > best_entry->priority)) {
+        best_entry = &entry;
+      }
+    } else {
+      DCHECK(!best_entry->factory->NumaEnabled());
+      if (entry.factory->NumaEnabled() ||
+          (entry.priority > best_entry->priority)) {
+        best_entry = &entry;
+      }
+    }
+  }
+  if (best_entry) {
+    int index = 0;
+    if (numa_node != port::kNUMANoAffinity) {
+      CHECK_LE(numa_node, port::NUMANumNodes());
+      index = 1 + numa_node;
+    }
+    if (best_entry->sub_allocators.size() < (index + 1)) {
+      best_entry->sub_allocators.resize(index + 1);
+    }
+    if (!best_entry->sub_allocators[index].get()) {
+      best_entry->sub_allocators[index].reset(
+          best_entry->factory->CreateSubAllocator(numa_node));
+    }
+    return best_entry->sub_allocators[index].get();
+  } else {
+    LOG(FATAL) << "No registered CPU AllocatorFactory";
+    return nullptr;
+  }
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/allocator_registry.h b/tensorflow/core/framework/allocator_registry.h
index b26e79ac3b..24f282ce84 100644
--- a/tensorflow/core/framework/allocator_registry.h
+++ b/tensorflow/core/framework/allocator_registry.h
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// Classes to maintain a static registry of memory allocators
+// Classes to maintain a static registry of memory allocator factories.
 #ifndef TENSORFLOW_CORE_FRAMEWORK_ALLOCATOR_REGISTRY_H_
 #define TENSORFLOW_CORE_FRAMEWORK_ALLOCATOR_REGISTRY_H_
 
@@ -21,59 +21,100 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/platform/numa.h"
 
 namespace tensorflow {
 
-// A global AllocatorRegistry is used to hold allocators for CPU backends
-class AllocatorRegistry {
+class AllocatorFactory {
  public:
-  // Add an allocator to the registry.  Caller releases ownership of
-  // 'allocator'.
-  void Register(const string& name, int priority, Allocator* allocator);
+  virtual ~AllocatorFactory() {}
 
-  // Return allocator with highest priority
-  // If multiple allocators have the same high priority, return one of them
+  // Returns true if the factory will create a functionally different
+  // SubAllocator for different (legal) values of numa_node.
+  virtual bool NumaEnabled() { return false; }
+
+  // Create an Allocator.
+  virtual Allocator* CreateAllocator() = 0;
+
+  // Create a SubAllocator. If NumaEnabled() is true, then returned SubAllocator
+  // will allocate memory local to numa_node.  If numa_node == kNUMANoAffinity
+  // then allocated memory is not specific to any NUMA node.
+  virtual SubAllocator* CreateSubAllocator(int numa_node) = 0;
+};
+
+// A singleton registry of AllocatorFactories.
+//
+// Allocators should be obtained through ProcessState or cpu_allocator()
+// (deprecated), not directly through this interface.  The purpose of this
+// registry is to allow link-time discovery of multiple AllocatorFactories among
+// which ProcessState will obtain the best fit at startup.
+class AllocatorFactoryRegistry {
+ public:
+  AllocatorFactoryRegistry() {}
+  ~AllocatorFactoryRegistry() {}
+
+  void Register(const char* source_file, int source_line, const string& name,
+                int priority, AllocatorFactory* factory);
+
+  // Returns 'best fit' Allocator.  Find the factory with the highest priority
+  // and return an allocator constructed by it.  If multiple factories have
+  // been registered with the same priority, picks one by unspecified criteria.
   Allocator* GetAllocator();
 
-  // Returns the global registry of allocators.
-  static AllocatorRegistry* Global();
+  // Returns 'best fit' SubAllocator.  First look for the highest priority
+  // factory that is NUMA-enabled.  If none is registered, fall back to the
+  // highest priority non-NUMA-enabled factory.  If NUMA-enabled, return a
+  // SubAllocator specific to numa_node, otherwise return a NUMA-insensitive
+  // SubAllocator.
+  SubAllocator* GetSubAllocator(int numa_node);
+
+  // Returns the singleton value.
+  static AllocatorFactoryRegistry* singleton();
 
  private:
-  typedef struct {
+  mutex mu_;
+  bool first_alloc_made_ = false;
+  struct FactoryEntry {
+    const char* source_file;
+    int source_line;
     string name;
     int priority;
-    Allocator* allocator;  // not owned
-  } AllocatorRegistryEntry;
-
-  // Returns the Allocator registered for 'name' and 'priority',
-  // or 'nullptr' if not found.
-  Allocator* GetRegisteredAllocator(const string& name, int priority);
-
-  std::vector<AllocatorRegistryEntry> allocators_;
-  Allocator* m_curr_allocator_;  // not owned
+    std::unique_ptr<AllocatorFactory> factory;
+    std::unique_ptr<Allocator> allocator;
+    // Index 0 corresponds to kNUMANoAffinity, other indices are (numa_node +
+    // 1).
+    std::vector<std::unique_ptr<SubAllocator>> sub_allocators;
+  };
+  std::vector<FactoryEntry> factories_ GUARDED_BY(mu_);
+
+  // Returns any FactoryEntry registered under 'name' and 'priority',
+  // or 'nullptr' if none found.
+  const FactoryEntry* FindEntry(const string& name, int priority) const
+      EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  TF_DISALLOW_COPY_AND_ASSIGN(AllocatorFactoryRegistry);
 };
 
-namespace allocator_registration {
-
-class AllocatorRegistration {
+class AllocatorFactoryRegistration {
  public:
-  AllocatorRegistration(const string& name, int priority,
-                        Allocator* allocator) {
-    AllocatorRegistry::Global()->Register(name, priority, allocator);
+  AllocatorFactoryRegistration(const char* file, int line, const string& name,
+                               int priority, AllocatorFactory* factory) {
+    AllocatorFactoryRegistry::singleton()->Register(file, line, name, priority,
+                                                    factory);
   }
 };
 
-}  // namespace allocator_registration
-
-#define REGISTER_MEM_ALLOCATOR(name, priority, allocator) \
-  REGISTER_MEM_ALLOCATOR_UNIQ_HELPER(__COUNTER__, name, priority, allocator)
+#define REGISTER_MEM_ALLOCATOR(name, priority, factory)                     \
+  REGISTER_MEM_ALLOCATOR_UNIQ_HELPER(__COUNTER__, __FILE__, __LINE__, name, \
+                                     priority, factory)
 
-#define REGISTER_MEM_ALLOCATOR_UNIQ_HELPER(ctr, name, priority, allocator) \
-  REGISTER_MEM_ALLOCATOR_UNIQ(ctr, name, priority, allocator)
+#define REGISTER_MEM_ALLOCATOR_UNIQ_HELPER(ctr, file, line, name, priority, \
+                                           factory)                         \
+  REGISTER_MEM_ALLOCATOR_UNIQ(ctr, file, line, name, priority, factory)
 
-#define REGISTER_MEM_ALLOCATOR_UNIQ(ctr, name, priority, allocator) \
-  static allocator_registration::AllocatorRegistration              \
-      register_allocator_##ctr(name, priority, new allocator)
+#define REGISTER_MEM_ALLOCATOR_UNIQ(ctr, file, line, name, priority, factory) \
+  static AllocatorFactoryRegistration allocator_factory_reg_##ctr(            \
+      file, line, name, priority, new factory)
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/framework/api_def.proto b/tensorflow/core/framework/api_def.proto
index c6cda06342..f8553cf5bb 100644
--- a/tensorflow/core/framework/api_def.proto
+++ b/tensorflow/core/framework/api_def.proto
@@ -64,12 +64,6 @@ message ApiDef {
     // to use a non-deprecated endpoint instead will be printed. If all
     // endpoints are deprecated, set deprecation_message in ApiDef instead.
     bool deprecated = 3;
-    // Deprecated: set deprecated to "true" instead. We can auto-generate
-    // the message.
-    // If this endpoint is deprecated, set deprecation_message to a
-    // message that should be logged when the endpoint is used.
-    // The message should indicate alternative endpoint to use, if any.
-    string deprecation_message = 2;
   }
   repeated Endpoint endpoint = 3;
 
diff --git a/tensorflow/core/framework/bfloat16_test.cc b/tensorflow/core/framework/bfloat16_test.cc
index 206396a25a..0a1b5e1975 100644
--- a/tensorflow/core/framework/bfloat16_test.cc
+++ b/tensorflow/core/framework/bfloat16_test.cc
@@ -45,7 +45,8 @@ class Bfloat16Test : public ::testing::Test,
                      public ::testing::WithParamInterface<Bfloat16TestParam> {};
 
 TEST_P(Bfloat16Test, TruncateTest) {
-  bfloat16 truncated(GetParam().input);
+  bfloat16 truncated = bfloat16::truncate_to_bfloat16((GetParam().input));
+
   if (std::isnan(GetParam().input)) {
     EXPECT_TRUE(std::isnan(float(truncated)) || std::isinf(float(truncated)));
     return;
diff --git a/tensorflow/core/framework/common_shape_fns.cc b/tensorflow/core/framework/common_shape_fns.cc
index ed3318d841..21c6940b62 100644
--- a/tensorflow/core/framework/common_shape_fns.cc
+++ b/tensorflow/core/framework/common_shape_fns.cc
@@ -1231,11 +1231,13 @@ Status ConcatV2Shape(InferenceContext* c) {
                            c->num_inputs() - 1 /* dim_index */);
 }
 
-Status BroadcastBinaryOpOutputShapeFn(InferenceContext* c, int output_index) {
-  ShapeHandle shape_x = c->input(0);
-  ShapeHandle shape_y = c->input(1);
+Status BroadcastBinaryOpOutputShapeFnHelper(InferenceContext* c,
+                                            ShapeHandle shape_x,
+                                            ShapeHandle shape_y,
+                                            ShapeHandle* out) {
+  CHECK_NOTNULL(out);
   if (!c->RankKnown(shape_x) || !c->RankKnown(shape_y)) {
-    c->set_output(0, c->UnknownShape());
+    *out = c->UnknownShape();
     return Status::OK();
   }
   const int32 rank_x = c->Rank(shape_x);
@@ -1293,7 +1295,7 @@ Status BroadcastBinaryOpOutputShapeFn(InferenceContext* c, int output_index) {
     }
   }
 
-  c->set_output(output_index, c->MakeShape(dims));
+  *out = c->MakeShape(dims);
   return Status::OK();
 }
 
diff --git a/tensorflow/core/framework/common_shape_fns.h b/tensorflow/core/framework/common_shape_fns.h
index 87bb133d92..2bedce1d6a 100644
--- a/tensorflow/core/framework/common_shape_fns.h
+++ b/tensorflow/core/framework/common_shape_fns.h
@@ -267,7 +267,22 @@ Status ConcatV2Shape(shape_inference::InferenceContext* c);
 
 // Shape function for binary operators that broadcast their inputs
 // and with output to output_index.
-Status BroadcastBinaryOpOutputShapeFn(InferenceContext* c, int output_index);
+// Note: out cannot be NULL.
+Status BroadcastBinaryOpOutputShapeFnHelper(InferenceContext* c,
+                                            ShapeHandle shape_x,
+                                            ShapeHandle shape_y,
+                                            ShapeHandle* out);
+
+// Shape function for binary operators that broadcast their inputs
+// and with output to output_index.
+inline Status BroadcastBinaryOpOutputShapeFn(InferenceContext* c,
+                                             int output_index) {
+  ShapeHandle out;
+  TF_RETURN_IF_ERROR(
+      BroadcastBinaryOpOutputShapeFnHelper(c, c->input(0), c->input(1), &out));
+  c->set_output(output_index, out);
+  return Status::OK();
+}
 
 // Shape function for binary operators that broadcast their inputs.
 // Tested by ops/math_ops_test.cc.
diff --git a/tensorflow/core/framework/dataset.h b/tensorflow/core/framework/dataset.h
index d8618f391e..8cf84afedb 100644
--- a/tensorflow/core/framework/dataset.h
+++ b/tensorflow/core/framework/dataset.h
@@ -498,28 +498,24 @@ class GraphDatasetBase : public DatasetBase {
 };
 
 // Represents an iterator that is associated with a particular parent dataset.
-template <class DatasetType>
-class DatasetIterator : public IteratorBase {
+class DatasetBaseIterator : public IteratorBase {
  public:
-  struct Params {
-    // Owns one reference on the shared dataset resource.
-    const DatasetType* dataset;
+  struct BaseParams {
+    // Owns one reference on the shared dataset object.
+    const DatasetBase* dataset;
 
     // Identifies the sequence of iterators leading up to this iterator.
     const string prefix;
   };
 
-  explicit DatasetIterator(const Params& params) : params_(params) {
+  explicit DatasetBaseIterator(const BaseParams& params) : params_(params) {
     params_.dataset->Ref();
   }
 
-  ~DatasetIterator() override { params_.dataset->Unref(); }
-
-  // The dataset from which this iterator was created.
-  const DatasetType* dataset() const { return params_.dataset; }
+  ~DatasetBaseIterator() override { params_.dataset->Unref(); }
 
   // The sequence of iterators leading up to this iterator.
-  const string prefix() const { return params_.prefix; }
+  const string& prefix() const { return params_.prefix; }
 
   const DataTypeVector& output_dtypes() const override {
     return params_.dataset->output_dtypes();
@@ -545,7 +541,7 @@ class DatasetIterator : public IteratorBase {
   }
 
   Status Save(OpKernelContext* ctx, IteratorStateWriter* writer) final {
-    TF_RETURN_IF_ERROR(dataset()->Save(ctx, writer));
+    TF_RETURN_IF_ERROR(params_.dataset->Save(ctx, writer));
     return IteratorBase::Save(ctx, writer);
   }
 
@@ -556,11 +552,40 @@ class DatasetIterator : public IteratorBase {
                                  bool* end_of_sequence) = 0;
 
   string full_name(const string& name) const {
-    return strings::StrCat(prefix(), ":", name);
+    return strings::StrCat(params_.prefix, ":", name);
   }
 
  private:
-  Params params_;
+  BaseParams params_;
+};
+
+// Represents an iterator that is associated with a particular parent dataset
+// with a particular type.
+template <class DatasetType>
+class DatasetIterator : public DatasetBaseIterator {
+ public:
+  struct Params {
+    // Borrowed pointer to the parent dataset.
+    const DatasetType* dataset;
+
+    // Identifies the sequence of iterators leading up to this iterator.
+    const string prefix;
+  };
+
+  explicit DatasetIterator(const Params& params)
+      : DatasetBaseIterator({params.dataset, params.prefix}),
+        typed_dataset_(params.dataset) {}
+
+  // The dataset from which this iterator was created.
+  const DatasetType* dataset() const { return typed_dataset_; }
+
+ protected:
+  virtual Status GetNextInternal(IteratorContext* ctx,
+                                 std::vector<Tensor>* out_tensors,
+                                 bool* end_of_sequence) = 0;
+
+ private:
+  const DatasetType* const typed_dataset_;  // Not owned.
 };
 
 // Encapsulates the work required to plug a DatasetBase into the core TensorFlow
diff --git a/tensorflow/core/framework/device_base.h b/tensorflow/core/framework/device_base.h
index 922d34fac9..b184fd91e1 100644
--- a/tensorflow/core/framework/device_base.h
+++ b/tensorflow/core/framework/device_base.h
@@ -184,9 +184,7 @@ class DeviceBase {
 
   virtual ScopedAllocatorMgr* GetScopedAllocatorMgr() const { return nullptr; }
 
-  const bool has_eigen_cpu_device() const {
-    return !eigen_cpu_devices_.empty();
-  }
+  bool has_eigen_cpu_device() const { return !eigen_cpu_devices_.empty(); }
 
   virtual const Eigen::ThreadPoolDevice* eigen_cpu_device();
 
diff --git a/tensorflow/core/framework/function.cc b/tensorflow/core/framework/function.cc
index 88d9d65f5a..57bcc0f513 100644
--- a/tensorflow/core/framework/function.cc
+++ b/tensorflow/core/framework/function.cc
@@ -865,12 +865,15 @@ Status FunctionCallFrame::GetRetvals(std::vector<Tensor>* rets) const {
   return Status::OK();
 }
 
-Status FunctionCallFrame::ConsumeRetvals(std::vector<Tensor>* rets) {
+Status FunctionCallFrame::ConsumeRetvals(std::vector<Tensor>* rets,
+                                         bool allow_dead_tensors) {
   rets->clear();
   rets->reserve(rets_.size());
   for (size_t i = 0; i < rets_.size(); ++i) {
     if (rets_[i].has_val) {
       rets->emplace_back(std::move(rets_[i].val));
+    } else if (allow_dead_tensors) {
+      rets->emplace_back();
     } else {
       return errors::Internal("Retval[", i, "] does not have value");
     }
diff --git a/tensorflow/core/framework/function.h b/tensorflow/core/framework/function.h
index 8e607b927c..5da9af7db3 100644
--- a/tensorflow/core/framework/function.h
+++ b/tensorflow/core/framework/function.h
@@ -261,7 +261,10 @@ class FunctionCallFrame : public CallFrameInterface {
   // Caller methods.
   Status SetArgs(gtl::ArraySlice<Tensor> args);
   Status GetRetvals(std::vector<Tensor>* rets) const;
-  Status ConsumeRetvals(std::vector<Tensor>* rets);
+
+  // Moves the return values from the frame to rets. If allow_dead_tensors is
+  // false it will fail if any of the retvals do not have a value.
+  Status ConsumeRetvals(std::vector<Tensor>* rets, bool allow_dead_tensors);
 
   size_t num_args() const override { return arg_types_.size(); }
   size_t num_retvals() const override { return ret_types_.size(); }
@@ -510,6 +513,9 @@ class FunctionLibraryRuntime {
     // If true, we create a new IntraProcessRendezvous, else use the existing
     // one.
     bool create_rendezvous = false;
+
+    // If True, allow returning dead tensors.
+    bool allow_dead_tensors = false;
   };
   typedef std::function<void(const Status&)> DoneCallback;
   virtual void Run(const Options& opts, Handle handle,
diff --git a/tensorflow/core/framework/function_testlib.cc b/tensorflow/core/framework/function_testlib.cc
index 2b5a0fe1bb..a8eecc1a63 100644
--- a/tensorflow/core/framework/function_testlib.cc
+++ b/tensorflow/core/framework/function_testlib.cc
@@ -45,13 +45,12 @@ GraphDef GDef(gtl::ArraySlice<NodeDef> nodes,
 }
 
 // Helper to construct a NodeDef.
-NodeDef NDef(const string& name, const string& op,
-             gtl::ArraySlice<string> inputs,
+NodeDef NDef(StringPiece name, StringPiece op, gtl::ArraySlice<string> inputs,
              gtl::ArraySlice<std::pair<string, FDH::AttrValueWrapper>> attrs,
              const string& device) {
   NodeDef n;
-  n.set_name(name);
-  n.set_op(op);
+  n.set_name(name.ToString());
+  n.set_op(op.ToString());
   for (const auto& in : inputs) n.add_input(in);
   n.set_device(device);
   for (auto na : attrs) n.mutable_attr()->insert({na.first, na.second.proto});
diff --git a/tensorflow/core/framework/function_testlib.h b/tensorflow/core/framework/function_testlib.h
index b67c5cb1ab..8cf3c6a680 100644
--- a/tensorflow/core/framework/function_testlib.h
+++ b/tensorflow/core/framework/function_testlib.h
@@ -48,7 +48,7 @@ class Attrs {
 
 // Helper to construct a NodeDef.
 NodeDef NDef(
-    const string& name, const string& op, gtl::ArraySlice<string> inputs,
+    StringPiece name, StringPiece op, gtl::ArraySlice<string> inputs,
     gtl::ArraySlice<std::pair<string, FunctionDefHelper::AttrValueWrapper>>
         attrs = {},
     const string& device = "");
diff --git a/tensorflow/core/framework/memory_types.cc b/tensorflow/core/framework/memory_types.cc
index 270118bb67..6dff6fe654 100644
--- a/tensorflow/core/framework/memory_types.cc
+++ b/tensorflow/core/framework/memory_types.cc
@@ -60,13 +60,18 @@ void MemoryTypesHelper(const NameRangeMap& name_map,
   host_memory_args->resize(keep);
 }
 
+bool IsFunctionCallOp(const string& op_type) {
+  return op_type == "SymbolicGradient" || op_type == "PartitionedCall" ||
+         op_type == "StatefulPartitionedCall";
+}
+
+}  // namespace
+
 MemoryType MTypeFromDType(const DataType dtype) {
   return (dtype == DT_INT32 || DataTypeAlwaysOnHost(dtype)) ? HOST_MEMORY
                                                             : DEVICE_MEMORY;
 }
 
-}  // namespace
-
 Status MemoryTypesForNode(const OpRegistryInterface* op_registry,
                           const DeviceType& device_type, const NodeDef& ndef,
                           MemoryTypeVector* inp_mtypes,
@@ -94,7 +99,7 @@ Status MemoryTypesForNode(const OpRegistryInterface* op_registry,
   // TODO(zhifengc,phawkins): We should do type inference over function bodies
   // to derive the correct input/output memory types. We should also split
   // host-memory and non host-memory arguments into separate type lists.
-  if (!status.ok() || ndef.op() == "SymbolicGradient") {
+  if (!status.ok() || IsFunctionCallOp(ndef.op())) {
     for (const auto& t : inp_dtypes) inp_mtypes->push_back(MTypeFromDType(t));
     for (const auto& t : out_dtypes) out_mtypes->push_back(MTypeFromDType(t));
     return Status::OK();
diff --git a/tensorflow/core/framework/node_def_util.cc b/tensorflow/core/framework/node_def_util.cc
index e8ea904ebd..0bd79366eb 100644
--- a/tensorflow/core/framework/node_def_util.cc
+++ b/tensorflow/core/framework/node_def_util.cc
@@ -86,7 +86,8 @@ string AttrSlice::SummarizeNode() const {
 string SummarizeNode(const Node& node) { return SummarizeNodeDef(node.def()); }
 
 string SummarizeNodeDef(const NodeDef& node_def) {
-  string ret = strings::StrCat(node_def.name(), " = ", node_def.op(), "[");
+  string ret = strings::StrCat(FormatNodeDefForError(node_def), " = ",
+                               node_def.op(), "[");
   strings::StrAppend(&ret, SummarizeAttrsHelper(node_def, node_def.device()));
   strings::StrAppend(&ret, "](");
 
@@ -101,6 +102,14 @@ string SummarizeNodeDef(const NodeDef& node_def) {
   return ret;
 }
 
+string FormatNodeForError(const Node& node) {
+  return FormatNodeDefForError(node.def());
+}
+
+string FormatNodeDefForError(const NodeDef& node_def) {
+  return errors::FormatNodeNameForError(node_def.name());
+}
+
 const AttrValue* AttrSlice::Find(StringPiece attr_name) const {
   // Currently, the collection used for NodeDef::attr() (google::protobuf::Map)
   // requires that the keys used for lookups have type 'const string&'. Because
@@ -634,7 +643,7 @@ Status ValidateExternalNodeDefSyntax(const NodeDef& node_def) {
 Status AttachDef(const Status& status, const NodeDef& node_def) {
   Status ret = status;
   errors::AppendToMessage(
-      &ret, strings::StrCat(" [[Node: ", SummarizeNodeDef(node_def), "]]"));
+      &ret, strings::StrCat(" [[", SummarizeNodeDef(node_def), "]]"));
   return ret;
 }
 
diff --git a/tensorflow/core/framework/node_def_util.h b/tensorflow/core/framework/node_def_util.h
index 64c8b386e8..c012b7c3d3 100644
--- a/tensorflow/core/framework/node_def_util.h
+++ b/tensorflow/core/framework/node_def_util.h
@@ -50,6 +50,12 @@ extern const char* const kColocationGroupPrefix;
 string SummarizeNode(const Node& node);
 string SummarizeNodeDef(const NodeDef& node_def);
 
+// Produces a formatted string pattern from the node which can uniquely identify
+// this node upstream to produce an informative error message. The pattern
+// followed is: {{node <node_name>}}
+string FormatNodeForError(const Node& node);
+string FormatNodeDefForError(const NodeDef& node_def);
+
 typedef protobuf::Map<string, AttrValue> AttrValueMap;
 
 // Adds an attr with name <name> and value <value> to *node_def.
diff --git a/tensorflow/core/framework/node_def_util_test.cc b/tensorflow/core/framework/node_def_util_test.cc
index 35b7b2272b..74cc594863 100644
--- a/tensorflow/core/framework/node_def_util_test.cc
+++ b/tensorflow/core/framework/node_def_util_test.cc
@@ -20,6 +20,8 @@ limitations under the License.
 #include "tensorflow/core/framework/node_def_builder.h"
 #include "tensorflow/core/framework/op_def_builder.h"
 #include "tensorflow/core/framework/op_def_util.h"
+#include "tensorflow/core/graph/graph.h"
+#include "tensorflow/core/graph/node_builder.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/lib/strings/str_util.h"
@@ -79,7 +81,7 @@ TEST(NodeDefUtilTest, In) {
     )proto");
   ExpectSuccess(node_def, op);
 
-  EXPECT_EQ("n = In[T=DT_FLOAT](a)", SummarizeNodeDef(node_def));
+  EXPECT_EQ("{{node n}} = In[T=DT_FLOAT](a)", SummarizeNodeDef(node_def));
 
   // Mismatching Op names.
   NodeDef bad = node_def;
@@ -144,7 +146,7 @@ TEST(NodeDefUtilTest, Out) {
     )proto");
   ExpectSuccess(node_def, op);
 
-  EXPECT_EQ("n = Out[T=DT_INT32]()", SummarizeNodeDef(node_def));
+  EXPECT_EQ("{{node n}} = Out[T=DT_INT32]()", SummarizeNodeDef(node_def));
 
   // Non-number type.
   NodeDef bad = node_def;
@@ -164,7 +166,7 @@ TEST(NodeDefUtilTest, Enum) {
     )proto");
   ExpectSuccess(node_def, op);
 
-  EXPECT_EQ("n = Enum[e=\"apple\"]()", SummarizeNodeDef(node_def));
+  EXPECT_EQ("{{node n}} = Enum[e=\"apple\"]()", SummarizeNodeDef(node_def));
 
   NodeDef good = node_def;
   good.clear_attr();
@@ -191,7 +193,8 @@ TEST(NodeDefUtilTest, SameIn) {
     )proto");
   ExpectSuccess(node_def, op);
 
-  EXPECT_EQ("n = SameIn[N=2, T=DT_DOUBLE](a, b)", SummarizeNodeDef(node_def));
+  EXPECT_EQ("{{node n}} = SameIn[N=2, T=DT_DOUBLE](a, b)",
+            SummarizeNodeDef(node_def));
 
   // Illegal type
   NodeDef bad = ToNodeDef(R"proto(
@@ -220,7 +223,7 @@ TEST(NodeDefUtilTest, AnyIn) {
     )proto");
   ExpectSuccess(node_def, op);
 
-  EXPECT_EQ("n = AnyIn[T=[DT_INT32, DT_STRING]](a, b)",
+  EXPECT_EQ("{{node n}} = AnyIn[T=[DT_INT32, DT_STRING]](a, b)",
             SummarizeNodeDef(node_def));
 
   const NodeDef bad = ToNodeDef(R"proto(
@@ -243,13 +246,14 @@ TEST(NodeDefUtilTest, Device) {
   const NodeDef node_def1 =
       ToNodeDef(NodeDefBuilder("d", &op_def1).Device("/cpu:17"));
   ExpectSuccess(node_def1, op_def1);
-  EXPECT_EQ("d = None[_device=\"/cpu:17\"]()", SummarizeNodeDef(node_def1));
+  EXPECT_EQ("{{node d}} = None[_device=\"/cpu:17\"]()",
+            SummarizeNodeDef(node_def1));
 
   const OpDef op_def2 = ToOpDef(OpDefBuilder("WithAttr").Attr("v: int"));
   const NodeDef node_def2 =
       ToNodeDef(NodeDefBuilder("d", &op_def2).Attr("v", 7).Device("/cpu:5"));
   ExpectSuccess(node_def2, op_def2);
-  EXPECT_EQ("d = WithAttr[v=7, _device=\"/cpu:5\"]()",
+  EXPECT_EQ("{{node d}} = WithAttr[v=7, _device=\"/cpu:5\"]()",
             SummarizeNodeDef(node_def2));
 }
 
@@ -284,7 +288,7 @@ TEST(NodeDefUtilTest, ValidSyntax) {
     )proto");
   ExpectValidSyntax(node_def_explicit_inputs);
 
-  EXPECT_EQ("n = AnyIn[T=[DT_INT32, DT_STRING]](a:0, b:123)",
+  EXPECT_EQ("{{node n}} = AnyIn[T=[DT_INT32, DT_STRING]](a:0, b:123)",
             SummarizeNodeDef(node_def_explicit_inputs));
 
   const NodeDef node_def_partial_shape = ToNodeDef(R"proto(
@@ -379,7 +383,7 @@ TEST(NameRangesForNodeTest, Simple) {
   EXPECT_EQ(NameRangeMap({{"a", {0, 1}}, {"b", {1, 2}}}), inputs);
   EXPECT_EQ(NameRangeMap({{"c", {0, 1}}, {"d", {1, 2}}}), outputs);
 
-  EXPECT_EQ("simple = Simple[](a, b)", SummarizeNodeDef(node_def));
+  EXPECT_EQ("{{node simple}} = Simple[](a, b)", SummarizeNodeDef(node_def));
 
   OpDef bad_op_def = op_def;
   bad_op_def.mutable_input_arg(0)->clear_type();
@@ -399,7 +403,7 @@ TEST(NameRangesForNodeTest, Polymorphic) {
   TF_EXPECT_OK(NameRangesForNode(node_def1, op_def, &inputs, &outputs));
   EXPECT_EQ(NameRangeMap({{"a", {0, 1}}, {"b", {1, 2}}}), inputs);
   EXPECT_EQ(NameRangeMap({{"c", {0, 1}}}), outputs);
-  EXPECT_EQ("poly = Polymorphic[T=DT_INT32](a, b)",
+  EXPECT_EQ("{{node poly}} = Polymorphic[T=DT_INT32](a, b)",
             SummarizeNodeDef(node_def1));
 
   const NodeDef node_def2 = ToNodeDef(NodeDefBuilder("poly", &op_def)
@@ -408,7 +412,8 @@ TEST(NameRangesForNodeTest, Polymorphic) {
   TF_EXPECT_OK(NameRangesForNode(node_def2, op_def, &inputs, &outputs));
   EXPECT_EQ(NameRangeMap({{"a", {0, 1}}, {"b", {1, 2}}}), inputs);
   EXPECT_EQ(NameRangeMap({{"c", {0, 1}}}), outputs);
-  EXPECT_EQ("poly = Polymorphic[T=DT_BOOL](a, b)", SummarizeNodeDef(node_def2));
+  EXPECT_EQ("{{node poly}} = Polymorphic[T=DT_BOOL](a, b)",
+            SummarizeNodeDef(node_def2));
 }
 
 TEST(NameRangesForNodeTest, NRepeats) {
@@ -431,7 +436,8 @@ TEST(NameRangesForNodeTest, NRepeats) {
   EXPECT_EQ(NameRangeMap({{"c", {0, 1}}, {"d", {1, 5}}, {"e", {5, 8}}}),
             outputs);
   EXPECT_EQ(
-      "nr = NRepeats[M=3, N=4, T=DT_FLOAT](a, a:1, a:2, a:3, b, b:1, b:2, b:3)",
+      "{{node nr}} = NRepeats[M=3, N=4, T=DT_FLOAT](a, a:1, a:2, a:3, b, b:1, "
+      "b:2, b:3)",
       SummarizeNodeDef(node_def1));
 
   const NodeDef node_def2 = ToNodeDef(NodeDefBuilder("nr", &op_def)
@@ -442,7 +448,7 @@ TEST(NameRangesForNodeTest, NRepeats) {
   EXPECT_EQ(NameRangeMap({{"a", {0, 2}}, {"b", {2, 4}}}), inputs);
   EXPECT_EQ(NameRangeMap({{"c", {0, 1}}, {"d", {1, 3}}, {"e", {3, 10}}}),
             outputs);
-  EXPECT_EQ("nr = NRepeats[M=7, N=2, T=DT_DOUBLE](a, a:1, b, b:1)",
+  EXPECT_EQ("{{node nr}} = NRepeats[M=7, N=2, T=DT_DOUBLE](a, a:1, b, b:1)",
             SummarizeNodeDef(node_def2));
 
   NodeDef bad_node_def = node_def2;
@@ -471,7 +477,7 @@ TEST(NameRangesForNodeTest, TypeList) {
   EXPECT_EQ(NameRangeMap({{"c", {0, 4}}, {"d", {4, 7}}, {"e", {7, 9}}}),
             outputs);
   EXPECT_EQ(
-      "tl = TypeList[T1=[DT_BOOL, DT_FLOAT],"
+      "{{node tl}} = TypeList[T1=[DT_BOOL, DT_FLOAT],"
       " T2=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT],"
       " T3=[DT_INT32, DT_DOUBLE, DT_STRING]](a, a:1, b, b:1, b:2, b:3)",
       SummarizeNodeDef(node_def1));
@@ -485,7 +491,8 @@ TEST(NameRangesForNodeTest, TypeList) {
   EXPECT_EQ(NameRangeMap({{"c", {0, 1}}, {"d", {1, 3}}, {"e", {3, 10}}}),
             outputs);
   EXPECT_EQ(
-      "tl = TypeList[T1=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32,"
+      "{{node tl}} = TypeList[T1=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, "
+      "DT_INT32,"
       " DT_INT32, DT_INT32], T2=[DT_DOUBLE], T3=[DT_DOUBLE, DT_STRING]]"
       "(a, a:1, a:2, a:3, a:4, a:5, a:6, b)",
       SummarizeNodeDef(node_def2));
@@ -509,5 +516,20 @@ TEST(AddPrefixAndSuffixToNode, Enter) {
   EXPECT_EQ("prefix/test_frame/suffix", frame_name);
 }
 
+TEST(FormatNodeForErrorTest, Node) {
+  Graph g(OpRegistry::Global());
+  Node* node;
+  TF_CHECK_OK(NodeBuilder("enter", "NoOp").Finalize(&g, &node));
+  EXPECT_EQ("{{node enter}}", FormatNodeForError(*node));
+}
+
+TEST(FormatNodeForErrorTest, NodeDef) {
+  NodeDef node_def;
+  node_def.set_name("enter");
+  node_def.set_op("Enter");
+  AddNodeAttr("frame_name", "test_frame", &node_def);
+  EXPECT_EQ("{{node enter}}", FormatNodeDefForError(node_def));
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/op_compatibility_test.cc b/tensorflow/core/framework/op_compatibility_test.cc
index c782480f1f..140f201085 100644
--- a/tensorflow/core/framework/op_compatibility_test.cc
+++ b/tensorflow/core/framework/op_compatibility_test.cc
@@ -209,8 +209,8 @@ TEST_F(OpCompatibilityTest, Same) {
                    .Finalize(node_def()));
   ExpectSuccess(*RegisteredOpDef());
   EXPECT_EQ(
-      "same = Same[N=3, T=DT_FLOAT, TList=[DT_BOOL, DT_BOOL]](a, b, c, c:1, "
-      "c:2, d, d:1, d:2, e, e:1)",
+      "{{node same}} = Same[N=3, T=DT_FLOAT, TList=[DT_BOOL, DT_BOOL]](a, b, "
+      "c, c:1, c:2, d, d:1, d:2, e, e:1)",
       Result());
 }
 
@@ -224,7 +224,7 @@ TEST_F(OpCompatibilityTest, AddAttr) {
       OpDefBuilder("AddAttr").Output("ndef: string").Finalize(&old_op));
   TF_ASSERT_OK(NodeDefBuilder("add_attr", &old_op.op_def).Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("add_attr = AddAttr[a=42]()", Result());
+  EXPECT_EQ("{{node add_attr}} = AddAttr[a=42]()", Result());
 }
 
 // Should be able to make an attr restriction less strict.
@@ -241,7 +241,7 @@ TEST_F(OpCompatibilityTest, LessStrict) {
                    .Attr("a", "B")
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("less_strict = LessStrict[a=\"B\"]()", Result());
+  EXPECT_EQ("{{node less_strict}} = LessStrict[a=\"B\"]()", Result());
 }
 
 // Should be able to remove an attr restriction.
@@ -259,7 +259,8 @@ TEST_F(OpCompatibilityTest, RemoveRestriction) {
                    .Attr("a", DT_INT32)
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("remove_restriction = RemoveRestriction[a=DT_INT32]()", Result());
+  EXPECT_EQ("{{node remove_restriction}} = RemoveRestriction[a=DT_INT32]()",
+            Result());
 }
 
 // Should be able to change the order of attrs.
@@ -278,7 +279,7 @@ TEST_F(OpCompatibilityTest, AttrOrder) {
                    .Attr("a", 7)
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("attr_order = AttrOrder[a=7, b=true]()", Result());
+  EXPECT_EQ("{{node attr_order}} = AttrOrder[a=7, b=true]()", Result());
 }
 
 // Should be able to make an input/output polymorphic.
@@ -299,7 +300,8 @@ TEST_F(OpCompatibilityTest, TypePolymorphic) {
                    .Input(FakeInput())
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("type_polymorphic = TypePolymorphic[T=DT_INT32](a)", Result());
+  EXPECT_EQ("{{node type_polymorphic}} = TypePolymorphic[T=DT_INT32](a)",
+            Result());
 }
 
 // Should be able to make a single input/output into a list.
@@ -320,7 +322,7 @@ TEST_F(OpCompatibilityTest, MakeList) {
                    .Input(FakeInput())
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("make_list = MakeList[N=1](a)", Result());
+  EXPECT_EQ("{{node make_list}} = MakeList[N=1](a)", Result());
 }
 
 // Should be able to make a single input/output into a polymorphic list.
@@ -343,7 +345,8 @@ TEST_F(OpCompatibilityTest, MakePolyList) {
                    .Input(FakeInput())
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("make_poly_list = MakePolyList[N=1, T=DT_INT32](a)", Result());
+  EXPECT_EQ("{{node make_poly_list}} = MakePolyList[N=1, T=DT_INT32](a)",
+            Result());
 }
 
 // Should be able to make a single input/output into an arbitrary list.
@@ -364,7 +367,7 @@ TEST_F(OpCompatibilityTest, MakeAnyList) {
                    .Input(FakeInput())
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("make_any_list = MakeAnyList[T=[DT_INT32]](a)", Result());
+  EXPECT_EQ("{{node make_any_list}} = MakeAnyList[T=[DT_INT32]](a)", Result());
 }
 
 // Should be able to make a single polymorphic input/output into a list of
@@ -387,7 +390,8 @@ TEST_F(OpCompatibilityTest, PolyIntoList) {
                    .Input(FakeInput(DT_INT32))
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("poly_into_list = PolyIntoList[N=1, T=DT_INT32](a)", Result());
+  EXPECT_EQ("{{node poly_into_list}} = PolyIntoList[N=1, T=DT_INT32](a)",
+            Result());
 }
 
 // Should be able to make a multiple inputs/outputs into a list with
@@ -413,7 +417,7 @@ TEST_F(OpCompatibilityTest, MakeMultipleSameList) {
                    .Input(FakeInput())
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("make_list = MakeMultipleSameList[N=2](a, b)", Result());
+  EXPECT_EQ("{{node make_list}} = MakeMultipleSameList[N=2](a, b)", Result());
 }
 
 // Changing from int32, float -> T
@@ -437,8 +441,9 @@ TEST_F(OpCompatibilityTest, MakeMultipleAnyList) {
                    .Input(FakeInput())
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("make_list = MakeMultipleAnyList[T=[DT_INT32, DT_FLOAT]](a, b)",
-            Result());
+  EXPECT_EQ(
+      "{{node make_list}} = MakeMultipleAnyList[T=[DT_INT32, DT_FLOAT]](a, b)",
+      Result());
 }
 
 // Should be able to change the name of an input/output.
@@ -455,7 +460,7 @@ TEST_F(OpCompatibilityTest, ChangeName) {
                    .Input(FakeInput())
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("change_name = ChangeName[](a)", Result());
+  EXPECT_EQ("{{node change_name}} = ChangeName[](a)", Result());
 }
 
 // Should be able to add an input/output of type
@@ -473,7 +478,7 @@ TEST_F(OpCompatibilityTest, AddNInts) {
   TF_ASSERT_OK(
       NodeDefBuilder("add_n_ints", &old_op.op_def).Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("add_n_ints = AddNInts[N=0]()", Result());
+  EXPECT_EQ("{{node add_n_ints}} = AddNInts[N=0]()", Result());
 }
 
 // Should be able to add an input/output of type N * T
@@ -492,7 +497,7 @@ TEST_F(OpCompatibilityTest, AddNSame) {
   TF_ASSERT_OK(
       NodeDefBuilder("add_n_same", &old_op.op_def).Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("add_n_same = AddNSame[N=0, T=DT_BOOL]()", Result());
+  EXPECT_EQ("{{node add_n_same}} = AddNSame[N=0, T=DT_BOOL]()", Result());
 }
 
 // Should be able to add an input/output of type N * T
@@ -517,8 +522,10 @@ TEST_F(OpCompatibilityTest, AddNSameAsExisting) {
                    .Input(FakeInput(DT_STRING))
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("add_n_same_as_existing = AddNSameAsExisting[N=0, T=DT_STRING](a)",
-            Result());
+  EXPECT_EQ(
+      "{{node add_n_same_as_existing}} = AddNSameAsExisting[N=0, "
+      "T=DT_STRING](a)",
+      Result());
 }
 
 // Should be able to add an input/output of type T
@@ -536,7 +543,7 @@ TEST_F(OpCompatibilityTest, AddAnyList) {
   TF_ASSERT_OK(
       NodeDefBuilder("add_any_list", &old_op.op_def).Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("add_any_list = AddAnyList[T=[]]()", Result());
+  EXPECT_EQ("{{node add_any_list}} = AddAnyList[T=[]]()", Result());
 }
 
 // Should be able to allow shorter lists.
@@ -557,8 +564,10 @@ TEST_F(OpCompatibilityTest, ShorterAnyList) {
                    .Input(FakeInput(2, DT_BOOL))
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("shorter_any_list = ShorterAnyList[T=[DT_BOOL, DT_BOOL]](a, a:1)",
-            Result());
+  EXPECT_EQ(
+      "{{node shorter_any_list}} = ShorterAnyList[T=[DT_BOOL, DT_BOOL]](a, "
+      "a:1)",
+      Result());
 }
 
 REGISTER_OP("ShorterSameList")
@@ -578,7 +587,8 @@ TEST_F(OpCompatibilityTest, ShorterSameList) {
                    .Input(FakeInput(2))
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("shorter_same_list = ShorterSameList[N=2](a, a:1)", Result());
+  EXPECT_EQ("{{node shorter_same_list}} = ShorterSameList[N=2](a, a:1)",
+            Result());
 }
 
 // Can remove a restriction to an attr
@@ -597,7 +607,7 @@ TEST_F(OpCompatibilityTest, AttrRemoveRestriction) {
                    .Attr("t", DT_INT32)
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("remove_restriction = AttrRemoveRestriction[t=DT_INT32]()",
+  EXPECT_EQ("{{node remove_restriction}} = AttrRemoveRestriction[t=DT_INT32]()",
             Result());
 }
 
@@ -619,7 +629,8 @@ TEST_F(OpCompatibilityTest, AttrLessRestrictive) {
                    .Attr("t", DT_INT32)
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("less_restrictive = AttrLessRestrictive[t=DT_INT32]()", Result());
+  EXPECT_EQ("{{node less_restrictive}} = AttrLessRestrictive[t=DT_INT32]()",
+            Result());
 }
 
 // Can remove a minimum from an attr.
@@ -637,7 +648,7 @@ TEST_F(OpCompatibilityTest, AttrRemoveMin) {
                    .Attr("n", 4)
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("remove_min = AttrRemoveMin[n=4]()", Result());
+  EXPECT_EQ("{{node remove_min}} = AttrRemoveMin[n=4]()", Result());
 }
 
 // Can lower the minimum on an attr.
@@ -655,7 +666,7 @@ TEST_F(OpCompatibilityTest, AttrLowerMin) {
                    .Attr("n", 4)
                    .Finalize(node_def()));
   ExpectSuccess(old_op.op_def);
-  EXPECT_EQ("lower_min = AttrLowerMin[n=4]()", Result());
+  EXPECT_EQ("{{node lower_min}} = AttrLowerMin[n=4]()", Result());
 }
 
 // Can make a ref input into a non-ref input.
diff --git a/tensorflow/core/framework/op_kernel.cc b/tensorflow/core/framework/op_kernel.cc
index 8a332fa1d8..b285accce7 100644
--- a/tensorflow/core/framework/op_kernel.cc
+++ b/tensorflow/core/framework/op_kernel.cc
@@ -263,11 +263,13 @@ OpKernelContext::OpKernelContext(Params* params, int num_outputs)
       outputs_(num_outputs),
       temp_memory_allocated_(0),
       persistent_memory_allocated_(0) {
-  Allocator* eigen_gpu_allocator = get_allocator(AllocatorAttributes());
   params_->ensure_eigen_gpu_device();
-  params_->device->ReinitializeGpuDevice(this, params_->eigen_gpu_device,
-                                         params_->op_device_context,
-                                         eigen_gpu_allocator);
+  if (params_->eigen_gpu_device != nullptr) {
+    Allocator* eigen_gpu_allocator = get_allocator(AllocatorAttributes());
+    params_->device->ReinitializeGpuDevice(this, params_->eigen_gpu_device,
+                                           params_->op_device_context,
+                                           eigen_gpu_allocator);
+  }
   if (params_->record_tensor_accesses) {
     referenced_tensors_.Init();
   }
@@ -824,19 +826,6 @@ Status OpKernelContext::mutable_output(StringPiece name, Tensor** tensor) {
   return Status::OK();
 }
 
-Status OpKernelContext::release_output(StringPiece name, TensorValue* value) {
-  int start, stop;
-  TF_RETURN_IF_ERROR(params_->op_kernel->OutputRange(name, &start, &stop));
-  if (stop != start + 1) {
-    return errors::InvalidArgument("OpKernel used list-valued output name '",
-                                   name,
-                                   "' when single-valued output was "
-                                   "expected");
-  }
-  *value = release_output(start);
-  return Status::OK();
-}
-
 bool OpKernelContext::ValidateInputsAreSameShape(OpKernel* op) {
   const auto& inputs = *params_->inputs;
   for (size_t i = 1; i < inputs.size(); ++i) {
@@ -1059,40 +1048,51 @@ Status SupportedDeviceTypesForNode(
 }
 
 void LogAllRegisteredKernels() {
-  for (const auto& key_registration : *GlobalKernelRegistryTyped()) {
-    const KernelDef& kernel_def(key_registration.second.def);
+  KernelList kernel_list = GetAllRegisteredKernels();
+  for (const auto& kernel_def : kernel_list.kernel()) {
     LOG(INFO) << "OpKernel ('" << ProtoShortDebugString(kernel_def) << "')";
   }
 }
 
 KernelList GetAllRegisteredKernels() {
+  return GetFilteredRegisteredKernels([](const KernelDef& k) { return true; });
+}
+
+KernelList GetFilteredRegisteredKernels(
+    const std::function<bool(const KernelDef&)>& predicate) {
   const KernelRegistry* const typed_registry = GlobalKernelRegistryTyped();
   KernelList kernel_list;
   kernel_list.mutable_kernel()->Reserve(typed_registry->size());
   for (const auto& p : *typed_registry) {
-    *kernel_list.add_kernel() = p.second.def;
+    const KernelDef& kernel_def = p.second.def;
+    if (predicate(kernel_def)) {
+      *kernel_list.add_kernel() = kernel_def;
+    }
   }
   return kernel_list;
 }
 
+KernelList GetRegisteredKernelsForOp(StringPiece op_name) {
+  auto op_pred = [op_name](const KernelDef& k) { return k.op() == op_name; };
+  return GetFilteredRegisteredKernels(op_pred);
+}
+
 string KernelsRegisteredForOp(StringPiece op_name) {
+  KernelList kernel_list = GetRegisteredKernelsForOp(op_name);
+  if (kernel_list.kernel_size() == 0) return "  <no registered kernels>\n";
   string ret;
-  for (const auto& key_registration : *GlobalKernelRegistryTyped()) {
-    const KernelDef& kernel_def(key_registration.second.def);
-    if (kernel_def.op() == op_name) {
-      strings::StrAppend(&ret, "  device='", kernel_def.device_type(), "'");
-      if (!kernel_def.label().empty()) {
-        strings::StrAppend(&ret, "; label='", kernel_def.label(), "'");
-      }
-      for (int i = 0; i < kernel_def.constraint_size(); ++i) {
-        strings::StrAppend(
-            &ret, "; ", kernel_def.constraint(i).name(), " in ",
-            SummarizeAttrValue(kernel_def.constraint(i).allowed_values()));
-      }
-      strings::StrAppend(&ret, "\n");
+  for (const auto& kernel_def : kernel_list.kernel()) {
+    strings::StrAppend(&ret, "  device='", kernel_def.device_type(), "'");
+    if (!kernel_def.label().empty()) {
+      strings::StrAppend(&ret, "; label='", kernel_def.label(), "'");
     }
+    for (int i = 0; i < kernel_def.constraint_size(); ++i) {
+      strings::StrAppend(
+          &ret, "; ", kernel_def.constraint(i).name(), " in ",
+          SummarizeAttrValue(kernel_def.constraint(i).allowed_values()));
+    }
+    strings::StrAppend(&ret, "\n");
   }
-  if (ret.empty()) return "  <no registered kernels>\n";
   return ret;
 }
 
@@ -1275,4 +1275,10 @@ void OpKernelContext::CtxFailureWithWarning(const char* file, int line,
   SetStatus(s);
 }
 
+void CheckNotInComputeAsync(OpKernelContext* ctx,
+                            const char* correct_macro_name) {
+  CHECK_EQ(nullptr, ctx->op_kernel().AsAsync())
+      << "Use " << correct_macro_name << " in AsyncOpKernel implementations.";
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/op_kernel.h b/tensorflow/core/framework/op_kernel.h
index 6c4c3a2ac1..aab95b785b 100644
--- a/tensorflow/core/framework/op_kernel.h
+++ b/tensorflow/core/framework/op_kernel.h
@@ -113,6 +113,7 @@ class OpKernel {
 
   // Returns nullptr iff this op kernel is synchronous.
   virtual AsyncOpKernel* AsAsync() { return nullptr; }
+  virtual const AsyncOpKernel* AsAsync() const { return nullptr; }
 
   // Returns true iff this op kernel is considered "expensive". The
   // runtime may use this flag to optimize graph execution for example
@@ -197,6 +198,7 @@ class AsyncOpKernel : public OpKernel {
   virtual void ComputeAsync(OpKernelContext* context, DoneCallback done) = 0;
 
   AsyncOpKernel* AsAsync() final { return this; }
+  const AsyncOpKernel* AsAsync() const final { return this; }
 
   void Compute(OpKernelContext* context) final;
 
@@ -902,12 +904,6 @@ class OpKernelContext {
   // Returns nullptr if allocate_output() or set_output() have not been called.
   Status mutable_output(StringPiece name, Tensor** tensor);
 
-  // Transfers ownership of an output tensor to the caller.
-  // NOTE: For non-reference outputs, the caller takes responsibility
-  // for deletion. For reference outputs, the caller does NOT take
-  // responsibility for deletion.
-  Status release_output(StringPiece name, TensorValue* value);
-
   // Records device specific state about how the input tensors were
   // computed.
   //
@@ -1044,7 +1040,6 @@ class OpKernelContext {
   // For control flow.
   FrameAndIter frame_iter() const { return params_->frame_iter; }
   bool is_input_dead() const { return params_->is_input_dead; }
-  bool* is_output_dead() { return &is_output_dead_; }
 
   // May be used, e.g., to get GPU handles, etc.
   // TODO(tucker): Add example usage.
@@ -1143,8 +1138,6 @@ class OpKernelContext {
   // Constructed only if <params->record_tensor_accesses>.
   ManualConstructor<UniqueTensorReferences> referenced_tensors_ GUARDED_BY(mu_);
 
-  bool is_output_dead_ = false;
-
   // The following data members are only used when allocation tracking is
   // enabled.
   mutable mutex stats_mu_;
@@ -1307,6 +1300,13 @@ void LogAllRegisteredKernels();
 // Gets a list of all registered kernels.
 KernelList GetAllRegisteredKernels();
 
+// Gets a list of all registered kernels for which predicate returns true
+KernelList GetFilteredRegisteredKernels(
+    const std::function<bool(const KernelDef&)>& predicate);
+
+// Gets a list of all registered kernels for a given op
+KernelList GetRegisteredKernelsForOp(StringPiece op_name);
+
 namespace kernel_factory {
 
 class OpKernelRegistrar {
@@ -1538,21 +1538,36 @@ inline void OpOutputList::set_ref(int i, mutex* mu, Tensor* tensor_for_ref) {
 //   ...
 // }
 
-#define OP_REQUIRES(CTX, EXP, STATUS)                  \
-  do {                                                 \
-    if (!TF_PREDICT_TRUE(EXP)) {                       \
-      (CTX)->CtxFailure(__FILE__, __LINE__, (STATUS)); \
-      return;                                          \
-    }                                                  \
+// Generate a fatal error if OP_REQUIRES or OP_REQUIRES_OK are used in
+// AsyncOpKernel implementations. If these macros are used and the condition
+// does not hold, the `done` callback will never be called and the system will
+// deadlock, so a crash failure is preferable. Since the OP_REQUIRES[_OK] macros
+// are legal to use in AsyncOpKernel constructors, we use overload resolution
+// to distinguish between OpKernelConstruction* and OpKernelContext* context
+// types.
+class XlaOpKernelContext;
+inline void CheckNotInComputeAsync(XlaOpKernelContext*, const char*) {}
+inline void CheckNotInComputeAsync(OpKernelConstruction*, const char*) {}
+void CheckNotInComputeAsync(OpKernelContext* ctx,
+                            const char* correct_macro_name);
+
+#define OP_REQUIRES(CTX, EXP, STATUS)                     \
+  do {                                                    \
+    if (!TF_PREDICT_TRUE(EXP)) {                          \
+      CheckNotInComputeAsync((CTX), "OP_REQUIRES_ASYNC"); \
+      (CTX)->CtxFailure(__FILE__, __LINE__, (STATUS));    \
+      return;                                             \
+    }                                                     \
   } while (0)
 
-#define OP_REQUIRES_OK(CTX, ...)                            \
-  do {                                                      \
-    ::tensorflow::Status _s(__VA_ARGS__);                   \
-    if (!TF_PREDICT_TRUE(_s.ok())) {                        \
-      (CTX)->CtxFailureWithWarning(__FILE__, __LINE__, _s); \
-      return;                                               \
-    }                                                       \
+#define OP_REQUIRES_OK(CTX, ...)                             \
+  do {                                                       \
+    ::tensorflow::Status _s(__VA_ARGS__);                    \
+    if (!TF_PREDICT_TRUE(_s.ok())) {                         \
+      CheckNotInComputeAsync((CTX), "OP_REQUIRES_OK_ASYNC"); \
+      (CTX)->CtxFailureWithWarning(__FILE__, __LINE__, _s);  \
+      return;                                                \
+    }                                                        \
   } while (0)
 
 #define OP_REQUIRES_ASYNC(CTX, EXP, STATUS, CALLBACK)  \
diff --git a/tensorflow/core/framework/op_kernel_test.cc b/tensorflow/core/framework/op_kernel_test.cc
index b76a3400a8..83dda6579b 100644
--- a/tensorflow/core/framework/op_kernel_test.cc
+++ b/tensorflow/core/framework/op_kernel_test.cc
@@ -965,7 +965,8 @@ BENCHMARK(BM_ConcatInputRange);
 BENCHMARK(BM_SelectInputRange);
 
 TEST(RegisteredKernels, CanCallGetAllRegisteredKernels) {
-  auto all_registered_kernels = GetAllRegisteredKernels().kernel();
+  auto kernel_list = GetAllRegisteredKernels();
+  auto all_registered_kernels = kernel_list.kernel();
   auto has_name_test1 = [](const KernelDef& k) { return k.op() == "Test1"; };
 
   // Verify we can find the "Test1" op registered above
@@ -986,5 +987,20 @@ TEST(RegisteredKernels, CanLogAllRegisteredKernels) {
   tensorflow::LogAllRegisteredKernels();
 }
 
+TEST(RegisteredKernels, GetFilteredRegisteredKernels) {
+  auto has_name_test1 = [](const KernelDef& k) { return k.op() == "Test1"; };
+  auto kernel_list = GetFilteredRegisteredKernels(has_name_test1);
+  ASSERT_EQ(kernel_list.kernel_size(), 1);
+  EXPECT_EQ(kernel_list.kernel(0).op(), "Test1");
+  EXPECT_EQ(kernel_list.kernel(0).device_type(), "CPU");
+}
+
+TEST(RegisteredKernels, GetRegisteredKernelsForOp) {
+  auto kernel_list = GetRegisteredKernelsForOp("Test1");
+  ASSERT_EQ(kernel_list.kernel_size(), 1);
+  EXPECT_EQ(kernel_list.kernel(0).op(), "Test1");
+  EXPECT_EQ(kernel_list.kernel(0).device_type(), "CPU");
+}
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/framework/register_types.h b/tensorflow/core/framework/register_types.h
index e90596980f..f1cd37ecda 100644
--- a/tensorflow/core/framework/register_types.h
+++ b/tensorflow/core/framework/register_types.h
@@ -151,6 +151,12 @@ limitations under the License.
 
 // Defines for sets of types.
 
+// TODO(b/111604096): Add uint32 and uint64 to TF_CALL_INTEGRAL_TYPES.
+//
+// The uint32 and uint64 types were introduced in 10/2017 to be used via XLA and
+// thus were not included in TF_CALL_INTEGRAL_TYPES. Including them in
+// TF_CALL_INTEGRAL_TYPES should only happen after evaluating the effect on the
+// TF binary size and performance.
 #define TF_CALL_INTEGRAL_TYPES(m)                                      \
   TF_CALL_int64(m) TF_CALL_int32(m) TF_CALL_uint16(m) TF_CALL_int16(m) \
       TF_CALL_uint8(m) TF_CALL_int8(m)
diff --git a/tensorflow/core/framework/step_stats.proto b/tensorflow/core/framework/step_stats.proto
index d98999cb54..67cc9e3845 100644
--- a/tensorflow/core/framework/step_stats.proto
+++ b/tensorflow/core/framework/step_stats.proto
@@ -67,6 +67,11 @@ message NodeExecStats {
   uint32 thread_id = 10;
   repeated AllocationDescription referenced_tensor = 11;
   MemoryStats memory_stats = 12;
+  int64 all_start_nanos = 13;
+  int64 op_start_rel_nanos = 14;
+  int64 op_end_rel_nanos = 15;
+  int64 all_end_rel_nanos = 16;
+  int64 scheduled_nanos = 17;
 };
 
 message DeviceStepStats {
diff --git a/tensorflow/core/framework/tensor.h b/tensorflow/core/framework/tensor.h
index d2f2609d3b..1b19ab5da3 100644
--- a/tensorflow/core/framework/tensor.h
+++ b/tensorflow/core/framework/tensor.h
@@ -482,6 +482,7 @@ class Tensor {
   friend class VariableOp;            // For access to set_shape
   friend class AutoReloadVariableOp;  // For access to set_shape
   friend class TensorTestHelper;      // For access to set_shape
+  friend class CastOpBase;            // For access to set_dtype;
   friend class OpKernelContext;       // For access to RefCountIsOne().
   friend class ScopedAllocator;       // For access to buf_.
   friend class XlaTensor;             // For access to RefCountIsOne().
diff --git a/tensorflow/core/framework/tensor_testutil.cc b/tensorflow/core/framework/tensor_testutil.cc
index 8f480d65f2..1a7812ce4e 100644
--- a/tensorflow/core/framework/tensor_testutil.cc
+++ b/tensorflow/core/framework/tensor_testutil.cc
@@ -20,30 +20,42 @@ namespace tensorflow {
 namespace test {
 
 template <typename T>
-bool IsClose(const T& x, const T& y, double atol, double rtol) {
-  // Need x == y so that infinities are close to themselves
-  return x == y || std::abs(x - y) < atol + rtol * std::abs(x);
-}
-
-template <typename T>
 void ExpectClose(const Tensor& x, const Tensor& y, double atol, double rtol) {
-  auto Tx = x.flat<T>();
-  auto Ty = y.flat<T>();
-  for (int i = 0; i < Tx.size(); ++i) {
-    if (!IsClose(Tx(i), Ty(i), atol, rtol)) {
-      LOG(ERROR) << "x = " << x.DebugString();
-      LOG(ERROR) << "y = " << y.DebugString();
-      LOG(ERROR) << "atol = " << atol << " rtol = " << rtol
-                 << " tol = " << atol + rtol * std::abs(Tx(i));
-      EXPECT_TRUE(false) << i << "-th element is not close " << Tx(i) << " vs. "
-                         << Ty(i);
-    }
+  const T* Tx = x.flat<T>().data();
+  const T* Ty = y.flat<T>().data();
+  const auto size = x.NumElements();
+
+  // Tolerance's type (RealType) can be different from T.
+  // For example, if T = std::complex<float>, then RealType = float.
+  // Did not use std::numeric_limits<T> because
+  // 1) It returns 0 for Eigen::half.
+  // 2) It doesn't support T=std::complex<RealType>.
+  //    (Would have to write a templated struct to handle this.)
+  typedef decltype(Eigen::NumTraits<T>::epsilon()) RealType;
+  const RealType kSlackFactor = static_cast<RealType>(5.0);
+  const RealType kDefaultTol = kSlackFactor * Eigen::NumTraits<T>::epsilon();
+  const RealType typed_atol =
+      (atol < 0) ? kDefaultTol : static_cast<RealType>(atol);
+  const RealType typed_rtol =
+      (rtol < 0) ? kDefaultTol : static_cast<RealType>(rtol);
+  ASSERT_GE(typed_atol, static_cast<RealType>(0.0))
+      << "typed_atol is negative: " << typed_atol;
+  ASSERT_GE(typed_rtol, static_cast<RealType>(0.0))
+      << "typed_rtol is negative: " << typed_rtol;
+  for (int i = 0; i < size; ++i) {
+    EXPECT_TRUE(
+        internal::Helper<T>::IsClose(Tx[i], Ty[i], typed_atol, typed_rtol))
+        << "index = " << i << " x = " << Tx[i] << " y = " << Ty[i]
+        << " typed_atol = " << typed_atol << " typed_rtol = " << typed_rtol;
   }
 }
 
 void ExpectClose(const Tensor& x, const Tensor& y, double atol, double rtol) {
   internal::AssertSameTypeDims(x, y);
   switch (x.dtype()) {
+    case DT_HALF:
+      ExpectClose<Eigen::half>(x, y, atol, rtol);
+      break;
     case DT_FLOAT:
       ExpectClose<float>(x, y, atol, rtol);
       break;
diff --git a/tensorflow/core/framework/tensor_testutil.h b/tensorflow/core/framework/tensor_testutil.h
index 4c216a84f0..3163002851 100644
--- a/tensorflow/core/framework/tensor_testutil.h
+++ b/tensorflow/core/framework/tensor_testutil.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_FRAMEWORK_TENSOR_TESTUTIL_H_
-#define TENSORFLOW_FRAMEWORK_TENSOR_TESTUTIL_H_
+#ifndef TENSORFLOW_CORE_FRAMEWORK_TENSOR_TESTUTIL_H_
+#define TENSORFLOW_CORE_FRAMEWORK_TENSOR_TESTUTIL_H_
 
 #include <numeric>
 
@@ -105,9 +105,10 @@ void ExpectTensorNear(const Tensor& x, const Tensor& y, const T& abs_err);
 
 // Expects "x" and "y" are tensors of the same type (float or double),
 // same shape and element-wise difference between x and y is no more
-// than atol + rtol * abs(x).
-void ExpectClose(const Tensor& x, const Tensor& y, double atol = 1e-6,
-                 double rtol = 1e-6);
+// than atol + rtol * abs(x). If atol or rtol is negative, it is replaced
+// with a default tolerance value = data type's epsilon * kSlackFactor.
+void ExpectClose(const Tensor& x, const Tensor& y, double atol = -1.0,
+                 double rtol = -1.0);
 
 // Implementation details.
 
@@ -191,11 +192,10 @@ struct Expector<T, true> {
     }
   }
 
-  static void Near(const T& a, const T& b, const double abs_err, int index) {
-    if (a != b) {  // Takes care of inf.
-      EXPECT_LE(double(Eigen::numext::abs(a - b)), abs_err)
-          << "a = " << a << " b = " << b << " index = " << index;
-    }
+  static bool Near(const T& a, const T& b, const double abs_err) {
+    // Need a == b so that infinities are close to themselves.
+    return (a == b) ||
+           (static_cast<double>(Eigen::numext::abs(a - b)) <= abs_err);
   }
 
   static void Near(const Tensor& x, const Tensor& y, const double abs_err) {
@@ -205,11 +205,31 @@ struct Expector<T, true> {
     const T* a = x.flat<T>().data();
     const T* b = y.flat<T>().data();
     for (int i = 0; i < size; ++i) {
-      Near(a[i], b[i], abs_err, i);
+      EXPECT_TRUE(Near(a[i], b[i], abs_err))
+          << "a = " << a[i] << " b = " << b << " index = " << i;
     }
   }
 };
 
+template <typename T>
+struct Helper {
+  // Assumes atol and rtol are nonnegative.
+  static bool IsClose(const T& x, const T& y, const T& atol, const T& rtol) {
+    // Need x == y so that infinities are close to themselves.
+    return (x == y) ||
+           (Eigen::numext::abs(x - y) <= atol + rtol * Eigen::numext::abs(x));
+  }
+};
+
+template <typename T>
+struct Helper<std::complex<T>> {
+  static bool IsClose(const std::complex<T>& x, const std::complex<T>& y,
+                      const T& atol, const T& rtol) {
+    return Helper<T>::IsClose(x.real(), y.real(), atol, rtol) &&
+           Helper<T>::IsClose(x.imag(), y.imag(), atol, rtol);
+  }
+};
+
 }  // namespace internal
 
 template <typename T>
@@ -221,10 +241,11 @@ template <typename T>
 void ExpectTensorNear(const Tensor& x, const Tensor& y, const double abs_err) {
   static_assert(internal::is_floating_point_type<T>::value,
                 "T is not a floating point types.");
+  ASSERT_GE(abs_err, 0.0) << "abs_error is negative" << abs_err;
   internal::Expector<T>::Near(x, y, abs_err);
 }
 
 }  // namespace test
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_FRAMEWORK_TENSOR_TESTUTIL_H_
+#endif  // TENSORFLOW_CORE_FRAMEWORK_TENSOR_TESTUTIL_H_
diff --git a/tensorflow/core/framework/tensor_testutil_test.cc b/tensorflow/core/framework/tensor_testutil_test.cc
new file mode 100644
index 0000000000..dd321535f2
--- /dev/null
+++ b/tensorflow/core/framework/tensor_testutil_test.cc
@@ -0,0 +1,356 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/tensor_testutil.h"
+
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/util/ptr_util.h"
+
+namespace tensorflow {
+namespace test {
+namespace {
+
+using internal::Expector;
+using internal::Helper;
+
+template <typename T>
+static void TestEdgeCasesNear() {
+  EXPECT_TRUE(Expector<T>::Near(Eigen::NumTraits<T>::infinity(),
+                                Eigen::NumTraits<T>::infinity(), 0.0));
+  EXPECT_TRUE(Expector<T>::Near(Eigen::NumTraits<T>::lowest(),
+                                Eigen::NumTraits<T>::highest(),
+                                Eigen::NumTraits<double>::infinity()));
+  EXPECT_FALSE(Expector<T>::Near(Eigen::NumTraits<T>::lowest(),
+                                 Eigen::NumTraits<T>::highest(),
+                                 Eigen::NumTraits<double>::highest()));
+  EXPECT_FALSE(Expector<T>::Near(Eigen::NumTraits<T>::quiet_NaN(),
+                                 Eigen::NumTraits<T>::quiet_NaN(), 0.0));
+  EXPECT_FALSE(Expector<T>::Near(Eigen::NumTraits<T>::quiet_NaN(),
+                                 Eigen::NumTraits<T>::quiet_NaN(),
+                                 Eigen::NumTraits<double>::infinity()));
+}
+
+// For debug printing. Example usage:
+// dumpFloatingPointStorage<Eigen::half, uint16>(
+//     static_cast<Eigen::half>(-2.71f));
+// dumpFloatingPointStorage<float, uint32>(-2.718281f);
+// dumpFloatingPointStorage <double, uint64>(-2.71828182846);
+template <typename T, typename U>
+static void dumpFloatingPointStorage(T value) {
+  U* integral = reinterpret_cast<U*>(&value);
+  int shift_amount = (sizeof(U) << 3) - 1;
+  int exponent_bits = 2 + (log2(sizeof(U)) * 3);
+  U mask = static_cast<U>(1) << shift_amount;
+  for (int bits = 0; bits <= shift_amount; ++bits) {
+    std::cout << ((*integral & mask) > 0);
+    if (bits == 0 || bits == exponent_bits) std::cout << " ";
+    mask >>= 1;
+  }
+  std::cout << std::endl;
+  printf("%.20lf\n", static_cast<double>(value));
+}
+
+TEST(TensorTestUtilTest, ExpectTensorNearHalf) {
+  // Eigen::half has 1 sign bit, 5 exponent bits, and 10 mantissa bits.
+  // The exponent is offset at 15.
+  // https://en.wikipedia.org/wiki/Half-precision_floating-point_format
+  typedef Eigen::half T;
+#define HALF(x) static_cast<T>(x)
+
+  // Trivial cases: equalities.
+  EXPECT_TRUE(Expector<T>::Near(HALF(1.0f), HALF(1.0f), 0.0));
+  EXPECT_TRUE(Expector<T>::Near(HALF(0.0f), HALF(-0.0f), 0.0));
+  EXPECT_TRUE(Expector<T>::Near(HALF(3.141592f), HALF(3.141592f), 0.0));
+
+  // 0 10010 0001111110 -> 1150/128 = 8.984375 vs
+  // 0 10010 0001111111 -> 1151/128 = 8.9921875 (diff = 0.0078125)
+  EXPECT_TRUE(Expector<T>::Near(HALF(8.9875f), HALF(8.99f), 0.0078125));
+  EXPECT_FALSE(Expector<T>::Near(HALF(8.9875f), HALF(8.99f), 0.007));
+
+  // 0 11000 0110100000 -> 1440/2 = 720 vs
+  // 0 11000 0110100001 -> 1441/2 = 720.5 (diff = 0.5)
+  EXPECT_TRUE(Expector<T>::Near(HALF(720.2f), HALF(720.3f), 0.5));
+  EXPECT_FALSE(Expector<T>::Near(HALF(720.2f), HALF(720.3f), 0.4));
+
+  // 0 11001 0011010010 -> 1234 vs
+  // 0 11001 0011010011 -> 1235 (diff = 1)
+  // Rounds to even (1234.5 -> 1234).
+  EXPECT_TRUE(Expector<T>::Near(HALF(1234.f), HALF(1235.f), 1.0));
+  EXPECT_FALSE(Expector<T>::Near(HALF(1234.5f), HALF(1235.f), 0.5));
+  EXPECT_TRUE(Expector<T>::Near(HALF(1234.5f), HALF(1235.f), 1.0));
+
+  // 1 10000 0101101100 -> -1388/512 = -2.7109375 vs
+  // 1 10000 0101110001 -> -1393/512 = -2.720703125 (diff = 0.009765625)
+  EXPECT_TRUE(Expector<T>::Near(HALF(-2.71f), HALF(-2.72f), 0.01));
+
+#undef HALF
+
+  // Some of the cases failed because Eigen::half doesn't behave as expected.
+  // For example, (inf == inf) should have been true, but it returns false.
+  // TODO(penporn): uncomment this test once we fix Eigen::half
+  // TestEdgeCasesNear<T>();
+}
+
+TEST(TensorTestUtilTest, ExpectTensorNearFloat) {
+  // float has 1 sign bit, 8 exponent bits, and 23 mantissa bits.
+  // The exponent offset is 127.
+  // https://en.wikipedia.org/wiki/Single-precision_floating-point_format
+  typedef float T;
+  // Trivial cases: equalities.
+  EXPECT_TRUE(Expector<T>::Near(1.0f, 1.0f, 0.0));
+  EXPECT_TRUE(Expector<T>::Near(0.0f, -0.0f, 0.0));
+  EXPECT_TRUE(Expector<T>::Near(3.14159265359f, 3.14159265359f, 0.0));
+
+  // 0 10000010 00011111100110011001101 -> 9,424,077/2^20 vs
+  // 0 10000010 00011111100110100110110 -> 9,424,182/2^20
+  // diff = 105/2^20 = 0.000100135803223
+  EXPECT_TRUE(Expector<T>::Near(8.9875f, 8.9876f, 0.0001002));
+  EXPECT_FALSE(Expector<T>::Near(8.9875f, 8.9876f, 0.0001));
+
+  // 0 10001000 01101000000110011101001 -> 11,799,785/2^14 vs
+  // 0 10001000 01101000000110011101010 -> 11,799,786/2^14
+  // diff = 1/2^14 = 0.00006103515625
+  EXPECT_TRUE(Expector<T>::Near(720.2017f, 720.2018f, 0.0001));
+  EXPECT_FALSE(Expector<T>::Near(720.20175f, 720.20185f, 0.0001));
+  EXPECT_TRUE(Expector<T>::Near(720.20175f, 720.20185f, 0.00013));
+
+  // 0 10011001 11010110111100110100010 -> 15,432,098*2^3 vs
+  // 0 10011001 11010110111100110100011 -> 15,432,099*2^3 (diff = 2^3 = 8)
+  EXPECT_FALSE(Expector<T>::Near(123456788.f, 123456789.f, 4.0));
+  EXPECT_TRUE(Expector<T>::Near(123456788.f, 123456789.f, 8.0));
+
+  // 1 10000000 01011011111100001010001 -> 11,401,297/2^22 vs
+  // 1 10000000 01011011111100001010101 -> 11,401,301/2^22
+  // diff = 4/2^22 = 0.000000953674316
+  EXPECT_TRUE(Expector<T>::Near(-2.718281f, -2.718282f, 0.1));
+
+  TestEdgeCasesNear<T>();
+}
+
+TEST(TensorTestUtilTest, ExpectTensorNearDouble) {
+  // double has 1 sign bit, 11 exponent bits, and 52 mantissa bits.
+  // The exponent offset is 1,023.
+  // https://en.wikipedia.org/wiki/Double-precision_floating-point_format
+  typedef double T;
+  // Trivial cases: equalities.
+  EXPECT_TRUE(Expector<T>::Near(1.0, 1.0, 0.0));
+  EXPECT_TRUE(Expector<T>::Near(0.0, -0.0, 0.0));
+  EXPECT_TRUE(Expector<T>::Near(3.14159265359, 3.14159265359, 0.0));
+
+  // 0 10000000010 0001111110011001100110011001100110011001100110011010
+  //   -> 5,059,512,706,374,042/2^49 vs
+  // 0 10000000010 0001111110011010011010110101000010110000111100101000
+  //   -> 5,059,569,001,369,384/2^49
+  // diff = 56,294,995,342/2^49 = 9.999999999976694198267E-5
+  EXPECT_TRUE(Expector<T>::Near(8.9875, 8.9876, 0.0001));
+
+  // 0 10000001111 1000100101110000001100111010100100101010001100000101
+  //   -> 6,921,439,564,440,325/2^36
+  // 0 10000001111 1000100101110000001100111010111110110111111010010001
+  //   -> 6,921,439,571,312,273/2^36
+  // diff = 6,871,948/2^36 = 1.000000047497451305389E-4
+  EXPECT_FALSE(Expector<T>::Near(100720.2018, 100720.2019, 0.0001));
+  EXPECT_TRUE(Expector<T>::Near(100720.2018, 100720.2019, 1.00000005e-4));
+
+  // 0 10000110100 0101111011100010101000101110101101011010010111000100
+  //   -> 6,172,839,450,617,284 * 2
+  // 0 10000110100 0101111011100010101000101110101101011010010111000011
+  //   -> 6,172,839,450,617,283 * 2
+  // diff = 1 * 2 = 2
+  EXPECT_FALSE(Expector<T>::Near(12345678901234567., 12345678901234566., 1.0));
+  EXPECT_TRUE(Expector<T>::Near(12345678901234567., 12345678901234566., 2.0));
+
+  // 1 10000000000 0101101111110000101010001011000101000101111111001111
+  //   -> -6,121,026,514,870,223/2^51
+  // 1 10000000000 0101101111110000101010001011000101001011011111000101
+  //   -> -6,121,026,514,892,741/2^51
+  // diff = 22,518/2^51 = 1.00000008274037099909E-11
+  EXPECT_FALSE(Expector<T>::Near(-2.71828182846, -2.71828182847, 1.0e-11));
+  EXPECT_TRUE(
+      Expector<T>::Near(-2.71828182846, -2.71828182847, 1.00000009e-11));
+
+  TestEdgeCasesNear<T>();
+}
+
+static const double kSlackFactor = 5.0;
+
+template <typename T>
+static void TestEdgeCasesClose() {
+  T kZero = static_cast<T>(0.0);
+  EXPECT_TRUE(Helper<T>::IsClose(Eigen::NumTraits<T>::infinity(),
+                                 Eigen::NumTraits<T>::infinity(), kZero,
+                                 kZero));
+  EXPECT_TRUE(Helper<T>::IsClose(
+      Eigen::NumTraits<T>::lowest(), Eigen::NumTraits<T>::highest(),
+      Eigen::NumTraits<T>::infinity(), Eigen::NumTraits<T>::infinity()));
+  EXPECT_TRUE(Helper<T>::IsClose(
+      Eigen::NumTraits<T>::lowest(), Eigen::NumTraits<T>::highest(),
+      Eigen::NumTraits<T>::highest(), Eigen::NumTraits<T>::highest()));
+  EXPECT_FALSE(Helper<T>::IsClose(Eigen::NumTraits<T>::quiet_NaN(),
+                                  Eigen::NumTraits<T>::quiet_NaN(), kZero,
+                                  kZero));
+  EXPECT_FALSE(Helper<T>::IsClose(
+      Eigen::NumTraits<T>::quiet_NaN(), Eigen::NumTraits<T>::quiet_NaN(),
+      Eigen::NumTraits<T>::infinity(), Eigen::NumTraits<T>::infinity()));
+}
+
+TEST(TensorTestUtilTest, ExpectTensorCloseHalf) {
+  typedef Eigen::half T;
+#define HALF(x) static_cast<T>(x)
+  EXPECT_TRUE(
+      Helper<T>::IsClose(HALF(1.0f), HALF(1.1f), HALF(0.1f), HALF(0.1f)));
+  EXPECT_TRUE(
+      Helper<T>::IsClose(HALF(1.0f), HALF(1.0f), HALF(0.0f), HALF(0.0f)));
+  EXPECT_FALSE(
+      Helper<T>::IsClose(HALF(1.0f), HALF(1.1f), HALF(0.0f), HALF(0.0f)));
+
+  // Epsilon:      0 00010 0000000000 -> 2^-13  = 0.0001220703125
+  // kDefaultTol:  0 00100 0100000000 -> 5/2^13 = 0.0006103515625
+  const T kDefaultTol =
+      static_cast<T>(kSlackFactor) * Eigen::NumTraits<T>::epsilon();
+
+  // 1.234 -> 0 01111 0011110000 -> 1264/2^10 = 1.234375
+  // 1.233 -> 0 01111 0011101111 -> 1263/2^10 = 1.2333984375
+  // 1.235 -> 0 01111 0011110001 -> 1265/2^10 = 1.2353515625
+  // 1.232 -> 0 01111 0011101110 -> 1262/2^10 = 1.232421875
+  // 1.236 -> 0 01111 0011110010 -> 1266/2^10 = 1.236328125
+  // 1/2^10 = 0.0009765625E
+  // Threshold = 0.0013637542724609375
+  EXPECT_TRUE(
+      Helper<T>::IsClose(HALF(1.234f), HALF(1.234f), kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(
+      Helper<T>::IsClose(HALF(1.234f), HALF(1.233f), kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(
+      Helper<T>::IsClose(HALF(1.234f), HALF(1.235f), kDefaultTol, kDefaultTol));
+
+  // Diff = 0.001953125
+  EXPECT_FALSE(
+      Helper<T>::IsClose(HALF(1.234f), HALF(1.232f), kDefaultTol, kDefaultTol));
+  EXPECT_FALSE(
+      Helper<T>::IsClose(HALF(1.234f), HALF(1.236f), kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(
+      Helper<T>::IsClose(HALF(1.234f), HALF(1.232f), HALF(8e-4f), HALF(1e-3f)));
+  EXPECT_TRUE(Helper<T>::IsClose(HALF(1.234f), HALF(1.236f), HALF(1.4e-3f),
+                                 HALF(5e-4f)));
+
+  // Too fine-grained: won't detect the difference
+  EXPECT_TRUE(Helper<T>::IsClose(HALF(3.141592f), HALF(3.141593f), HALF(0.0),
+                                 HALF(0.0)));
+
+  // Trivial case.
+  EXPECT_FALSE(
+      Helper<T>::IsClose(HALF(1e4f), HALF(1e-4f), kDefaultTol, kDefaultTol));
+#undef HALF
+
+  // Some of the cases failed because Eigen::half doesn't behave as expected.
+  // For example, (inf == inf) should have been true, but it returns false.
+  // TODO(penporn): uncomment this test once we fix Eigen::half
+  // TestEdgeCasesClose<T>();
+}
+
+TEST(TensorTestUtilTest, ExpectTensorCloseFloat) {
+  typedef float T;
+
+  EXPECT_TRUE(Helper<T>::IsClose(1.0f, 1.1f, 0.1f, 0.1f));
+  EXPECT_TRUE(Helper<T>::IsClose(1.0f, 1.0f, 0.0f, 0.0f));
+  EXPECT_FALSE(Helper<T>::IsClose(1.0f, 1.1f, 0.0f, 0.0f));
+
+  // Epsilon:      2^-23  ~ 0.00000011920928955078
+  // kDefaultTol:  5/2^23 ~ 0.00000059604644775391
+  const T kDefaultTol =
+      static_cast<T>(kSlackFactor) * Eigen::NumTraits<T>::epsilon();
+
+  // 1.234567f -> 10,356,299/2^23 ~ 1.234567046165466308594
+  // 1.234568f -> 10,356,307/2^23 ~ 1.234567999839782714844
+  // 1.234566f -> 10,356,290/2^23 ~ 1.234565973281860351563
+  // 1.234569f -> 10,356,315/2^23 ~ 1.234568953514099121094
+  // 1.234565f -> 10,356,282/2^23 ~ 1.234565019607543945313
+  // Threshold ~ 0.00000133190576434572
+  EXPECT_TRUE(
+      Helper<T>::IsClose(1.234567f, 1.234567f, kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(
+      Helper<T>::IsClose(1.234567f, 1.234568f, kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(
+      Helper<T>::IsClose(1.234567f, 1.234566f, kDefaultTol, kDefaultTol));
+  EXPECT_FALSE(
+      Helper<T>::IsClose(1.234567f, 1.234569f, kDefaultTol, kDefaultTol));
+  EXPECT_FALSE(
+      Helper<T>::IsClose(1.234567f, 1.234565f, kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567f, 1.234569f, 8e-7f, 1e-6f));
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567f, 1.234565f, 3e-7f, 1.5e-6f));
+
+  // Too fine-grained: won't detect the difference
+  EXPECT_TRUE(Helper<T>::IsClose(3.14159265f, 3.14159266f, 0.0f, 0.0f));
+
+  // Trivial cases
+  EXPECT_FALSE(Helper<T>::IsClose(1e8f, 1e-8f, kDefaultTol, kDefaultTol));
+  EXPECT_FALSE(Helper<T>::IsClose(1e15f, 1e-15f, kDefaultTol, kDefaultTol));
+
+  TestEdgeCasesClose<T>();
+}
+
+TEST(TensorTestUtilTest, ExpectTensorCloseDouble) {
+  typedef double T;
+
+  EXPECT_TRUE(Helper<T>::IsClose(1.0, 1.1, 0.1, 0.1));
+  EXPECT_TRUE(Helper<T>::IsClose(1.0, 1.0, 0.0, 0.0));
+  EXPECT_FALSE(Helper<T>::IsClose(1.0, 1.1, 0.0, 0.0));
+
+  // Epsilon:      2^-52  ~ 2.220446049250313080847E-16
+  // kDefaultTol:  5/2^52 ~ 1.110223024625156540424E-15
+  const T kDefaultTol =
+      static_cast<T>(kSlackFactor) * Eigen::NumTraits<T>::epsilon();
+
+  // 1.234567890123456 -> 5,559,999,489,923,576/2^52 ~ 1.234567890123456024298
+  // 1.234567890123457 -> 5,559,999,489,923,580/2^52 ~ 1.234567890123456912477
+  // 1.234567890123455 -> 5,559,999,489,923,571/2^52 ~ 1.234567890123454914075
+  // 1.234567890123458 -> 5,559,999,489,923,585/2^52 ~ 1.2345678901234580227
+  // 1.234567890123454 -> 5,559,999,489,923,567/2^52 ~ 1.234567890123454025897
+  // 1.234567890123459 -> 5,559,999,489,923,589/2^52 ~ 1.234567890123458910878
+  // 1.234567890123453 -> 5,559,999,489,923,562/2^52 ~ 1.234567890123452915674
+  // Threshold ~ 2.480868721703117812159E-15
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567890123456, 1.234567890123456,
+                                 kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567890123456, 1.234567890123457,
+                                 kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567890123456, 1.234567890123455,
+                                 kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567890123456, 1.234567890123458,
+                                 kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567890123456, 1.234567890123454,
+                                 kDefaultTol, kDefaultTol));
+  EXPECT_FALSE(Helper<T>::IsClose(1.234567890123456, 1.234567890123459,
+                                  kDefaultTol, kDefaultTol));
+  EXPECT_FALSE(Helper<T>::IsClose(1.234567890123456, 1.234567890123453,
+                                  kDefaultTol, kDefaultTol));
+  EXPECT_TRUE(Helper<T>::IsClose(1.234567890123456, 1.234567890123459, 9.5e-16,
+                                 1.6e-15));
+  EXPECT_TRUE(
+      Helper<T>::IsClose(1.234567890123456, 1.234567890123453, 7e-16, 2e-15));
+
+  // Too fine-grained: won't detect the difference
+  EXPECT_TRUE(
+      Helper<T>::IsClose(3.141592653589793238, 3.141592653589793239, 0.0, 0.0));
+
+  // Trivial cases
+  EXPECT_FALSE(Helper<T>::IsClose(1e15, 1e-15, kDefaultTol, kDefaultTol));
+  EXPECT_FALSE(Helper<T>::IsClose(1e30, 1e-30, kDefaultTol, kDefaultTol));
+
+  TestEdgeCasesClose<T>();
+}
+
+}  // namespace
+}  // namespace test
+}  // namespace tensorflow
diff --git a/tensorflow/core/framework/types.h b/tensorflow/core/framework/types.h
index ded6aa0991..ff7c9855d6 100644
--- a/tensorflow/core/framework/types.h
+++ b/tensorflow/core/framework/types.h
@@ -470,6 +470,10 @@ inline bool DataTypeIsUnsigned(DataType dt) {
 // Returns a 0 on failure
 int DataTypeSize(DataType dt);
 
+// Returns HOST_MEMORY if `dtype` is always on host or is a DT_INT32,
+// DEVICE_MEMORY otherwise.
+MemoryType MTypeFromDType(const DataType dtype);
+
 // Types that always sit on host: DT_STRING, DT_STRING_REF, DT_RESOURCE.
 // For DT_RESOURCE, the handle always sits on host (even if the underlying
 // object has device-allocated resources).
diff --git a/tensorflow/core/graph/algorithm.cc b/tensorflow/core/graph/algorithm.cc
index 4652fbe406..9b4200e0b4 100644
--- a/tensorflow/core/graph/algorithm.cc
+++ b/tensorflow/core/graph/algorithm.cc
@@ -25,7 +25,8 @@ namespace tensorflow {
 
 void DFS(const Graph& g, const std::function<void(Node*)>& enter,
          const std::function<void(Node*)>& leave,
-         const NodeComparator& stable_comparator) {
+         const NodeComparator& stable_comparator,
+         const EdgeFilter& edge_filter) {
   // Stack of work to do.
   struct Work {
     Node* node;
@@ -52,7 +53,6 @@ void DFS(const Graph& g, const std::function<void(Node*)>& enter,
     // Arrange to call leave(n) when all done with descendants.
     if (leave) stack.push_back(Work{n, true});
 
-    gtl::iterator_range<NeighborIter> nodes = n->out_nodes();
     auto add_work = [&visited, &stack](Node* out) {
       if (!visited[out->id()]) {
         // Note; we must not mark as visited until we actually process it.
@@ -62,16 +62,20 @@ void DFS(const Graph& g, const std::function<void(Node*)>& enter,
 
     if (stable_comparator) {
       std::vector<Node*> nodes_sorted;
-      for (Node* out : nodes) {
-        nodes_sorted.emplace_back(out);
+      for (const Edge* out_edge : n->out_edges()) {
+        if (!edge_filter || edge_filter(*out_edge)) {
+          nodes_sorted.emplace_back(out_edge->dst());
+        }
       }
       std::sort(nodes_sorted.begin(), nodes_sorted.end(), stable_comparator);
       for (Node* out : nodes_sorted) {
         add_work(out);
       }
     } else {
-      for (Node* out : nodes) {
-        add_work(out);
+      for (const Edge* out_edge : n->out_edges()) {
+        if (!edge_filter || edge_filter(*out_edge)) {
+          add_work(out_edge->dst());
+        }
       }
     }
   }
@@ -118,8 +122,6 @@ void ReverseDFSFromHelper(const Graph& g, gtl::ArraySlice<T> start,
     // Arrange to call leave(n) when all done with descendants.
     if (leave) stack.push_back(Work{n, true});
 
-    gtl::iterator_range<NeighborIter> nodes = n->in_nodes();
-
     auto add_work = [&visited, &stack](T out) {
       if (!visited[out->id()]) {
         // Note; we must not mark as visited until we actually process it.
@@ -129,16 +131,16 @@ void ReverseDFSFromHelper(const Graph& g, gtl::ArraySlice<T> start,
 
     if (stable_comparator) {
       std::vector<T> nodes_sorted;
-      for (T in : nodes) {
-        nodes_sorted.emplace_back(in);
+      for (const Edge* in_edge : n->in_edges()) {
+        nodes_sorted.emplace_back(in_edge->src());
       }
       std::sort(nodes_sorted.begin(), nodes_sorted.end(), stable_comparator);
       for (T in : nodes_sorted) {
         add_work(in);
       }
     } else {
-      for (T in : nodes) {
-        add_work(in);
+      for (const Edge* in_edge : n->in_edges()) {
+        add_work(in_edge->src());
       }
     }
   }
@@ -161,14 +163,17 @@ void ReverseDFSFrom(const Graph& g, gtl::ArraySlice<Node*> start,
 }
 
 void GetPostOrder(const Graph& g, std::vector<Node*>* order,
-                  const NodeComparator& stable_comparator) {
+                  const NodeComparator& stable_comparator,
+                  const EdgeFilter& edge_filter) {
   order->clear();
-  DFS(g, nullptr, [order](Node* n) { order->push_back(n); }, stable_comparator);
+  DFS(g, nullptr, [order](Node* n) { order->push_back(n); }, stable_comparator,
+      edge_filter);
 }
 
 void GetReversePostOrder(const Graph& g, std::vector<Node*>* order,
-                         const NodeComparator& stable_comparator) {
-  GetPostOrder(g, order, stable_comparator);
+                         const NodeComparator& stable_comparator,
+                         const EdgeFilter& edge_filter) {
+  GetPostOrder(g, order, stable_comparator, edge_filter);
   std::reverse(order->begin(), order->end());
 }
 
diff --git a/tensorflow/core/graph/algorithm.h b/tensorflow/core/graph/algorithm.h
index ac4a099013..5bbbc6f6dc 100644
--- a/tensorflow/core/graph/algorithm.h
+++ b/tensorflow/core/graph/algorithm.h
@@ -28,6 +28,8 @@ namespace tensorflow {
 // Comparator for two nodes. This is used in order to get a stable ording.
 using NodeComparator = std::function<bool(const Node*, const Node*)>;
 
+using EdgeFilter = std::function<bool(const Edge&)>;
+
 // Compares two node based on their ids.
 struct NodeComparatorID {
   bool operator()(const Node* n1, const Node* n2) const {
@@ -47,9 +49,11 @@ struct NodeComparatorName {
 // If leave is not empty, calls leave(n) after visiting all children of n.
 // If stable_comparator is set, a stable ordering of visit is achieved by
 // sorting a node's neighbors first before visiting them.
+// If edge_filter is set then ignores edges for which edge_filter returns false.
 extern void DFS(const Graph& g, const std::function<void(Node*)>& enter,
                 const std::function<void(Node*)>& leave,
-                const NodeComparator& stable_comparator = {});
+                const NodeComparator& stable_comparator = {},
+                const EdgeFilter& edge_filter = {});
 
 // Perform a reverse depth-first-search on g starting at the sink node.
 // If enter is not empty, calls enter(n) before visiting any parents of n.
@@ -83,15 +87,21 @@ extern void ReverseDFSFrom(const Graph& g, gtl::ArraySlice<const Node*> start,
 // If stable_comparator is set, a stable ordering of visit is achieved by
 // sorting a node's neighbors first before visiting them.
 //
+// If edge_filter is set then ignores edges for which edge_filter returns false.
+//
 // REQUIRES: order is not NULL.
 void GetPostOrder(const Graph& g, std::vector<Node*>* order,
-                  const NodeComparator& stable_comparator = {});
+                  const NodeComparator& stable_comparator = {},
+                  const EdgeFilter& edge_filter = {});
 
 // Stores in *order the reverse post-order numbering of all nodes
 // If stable_comparator is set, a stable ordering of visit is achieved by
 // sorting a node's neighbors first before visiting them.
+//
+// If edge_filter is set then ignores edges for which edge_filter returns false.
 void GetReversePostOrder(const Graph& g, std::vector<Node*>* order,
-                         const NodeComparator& stable_comparator = {});
+                         const NodeComparator& stable_comparator = {},
+                         const EdgeFilter& edge_filter = {});
 
 // Prune nodes in "g" that are not in some path from the source node
 // to any node in 'nodes'. Returns true if changes were made to the graph.
diff --git a/tensorflow/core/graph/algorithm_test.cc b/tensorflow/core/graph/algorithm_test.cc
index f67d5a2fd2..60a3e66aa1 100644
--- a/tensorflow/core/graph/algorithm_test.cc
+++ b/tensorflow/core/graph/algorithm_test.cc
@@ -36,6 +36,11 @@ namespace {
 REGISTER_OP("TestParams").Output("o: float");
 REGISTER_OP("TestInput").Output("a: float").Output("b: float");
 REGISTER_OP("TestMul").Input("a: float").Input("b: float").Output("o: float");
+REGISTER_OP("TestUnary").Input("a: float").Output("o: float");
+REGISTER_OP("TestBinary")
+    .Input("a: float")
+    .Input("b: float")
+    .Output("o: float");
 
 // Compares that the order of nodes in 'inputs' respects the
 // pair orders described in 'ordered_pairs'.
@@ -148,5 +153,52 @@ TEST(AlgorithmTest, ReversePostOrderStable) {
     EXPECT_TRUE(ExpectBefore({{"t2", "t3"}}, order, &error));
   }
 }
+
+TEST(AlgorithmTest, PostOrderWithEdgeFilter) {
+  GraphDefBuilder b(GraphDefBuilder::kFailImmediately);
+  string error;
+  Node* n0 = ops::SourceOp("TestParams", b.opts().WithName("n0"));
+  Node* n1 = ops::UnaryOp("TestUnary", n0, b.opts().WithName("n1"));
+  Node* n2 = ops::UnaryOp("TestUnary", n1, b.opts().WithName("n2"));
+  Node* n3 = ops::BinaryOp("TestBinary", n2, n0, b.opts().WithName("n3"));
+
+  Graph g(OpRegistry::Global());
+  TF_ASSERT_OK(GraphDefBuilderToGraph(b, &g));
+
+  g.AddEdge(g.FindNodeId(n3->id()), 0, g.FindNodeId(n1->id()), 1);
+
+  std::vector<Node*> post_order;
+  auto edge_filter = [&](const Edge& e) {
+    return !(e.src()->id() == n3->id() && e.dst()->id() == n1->id());
+  };
+
+  std::vector<Node*> expected_post_order = {
+      g.sink_node(),          g.FindNodeId(n3->id()), g.FindNodeId(n2->id()),
+      g.FindNodeId(n1->id()), g.FindNodeId(n0->id()), g.source_node()};
+
+  std::vector<Node*> expected_reverse_post_order = expected_post_order;
+  std::reverse(expected_reverse_post_order.begin(),
+               expected_reverse_post_order.end());
+
+  GetPostOrder(g, &post_order, /*stable_comparator=*/{},
+               /*edge_filter=*/edge_filter);
+
+  ASSERT_EQ(expected_post_order.size(), post_order.size());
+  for (int i = 0; i < post_order.size(); i++) {
+    CHECK_EQ(post_order[i], expected_post_order[i])
+        << post_order[i]->name() << " vs. " << expected_post_order[i]->name();
+  }
+
+  std::vector<Node*> reverse_post_order;
+  GetReversePostOrder(g, &reverse_post_order, /*stable_comparator=*/{},
+                      /*edge_filter=*/edge_filter);
+
+  ASSERT_EQ(expected_reverse_post_order.size(), reverse_post_order.size());
+  for (int i = 0; i < reverse_post_order.size(); i++) {
+    CHECK_EQ(reverse_post_order[i], expected_reverse_post_order[i])
+        << reverse_post_order[i]->name() << " vs. "
+        << expected_reverse_post_order[i]->name();
+  }
+}
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/graph/control_flow.cc b/tensorflow/core/graph/control_flow.cc
index 1778e48ef6..8e1e56d29b 100644
--- a/tensorflow/core/graph/control_flow.cc
+++ b/tensorflow/core/graph/control_flow.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include <deque>
 #include <vector>
 
+#include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/graph/node_builder.h"
 #include "tensorflow/core/lib/core/errors.h"
@@ -54,10 +55,11 @@ Status ValidateControlFlowInfo(const Graph* graph,
       frame.parent = parent;
       frame.name = cf.frame_name;
     } else if (frame.parent != parent) {
-      return errors::InvalidArgument(
+      return errors::Internal(
           "Invalid loop structure: Mismatched parent frames for \"",
           cf.frame_name, "\": \"", parent->name, "\" vs \"", frame.parent->name,
-          "\". This is an internal bug, please file a bug report with "
+          "\". The node giving this error: ", FormatNodeForError(*node),
+          "This is an internal bug, please file a bug report with "
           "instructions on how to reproduce the error.");
     }
     if (IsLoopCond(node)) {
@@ -69,9 +71,9 @@ Status ValidateControlFlowInfo(const Graph* graph,
           !str_util::StrContains(node->name(), "LoopCounter")) {
         return errors::InvalidArgument(
             "Invalid loop structure: Loop \"", cf.frame_name,
-            "\" has more than one LoopCond node: \"", node->name(), "\" and \"",
-            frame.loop_cond->name(),
-            "\". This is an internal bug, please file a bug report with "
+            "\" has more than one LoopCond node: ", FormatNodeForError(*node),
+            " and ", FormatNodeForError(*frame.loop_cond),
+            ". This is an internal bug, please file a bug report with "
             "instructions on how to reproduce the error.");
       }
       frame.loop_cond = node;
@@ -135,12 +137,11 @@ Status BuildControlFlowInfo(const Graph* g, std::vector<ControlFlowInfo>* info,
           const string& parent_frame = (*info)[out_parent->id()].frame_name;
           if (parent_frame != frame_name) {
             return errors::InvalidArgument(
-                "The node '", out->name(),
-                "' has inputs from different "
-                "frames. The input '",
-                curr_node->name(), "' is in frame '", frame_name,
-                "'. The input '", parent_nodes[out->id()]->name(),
-                "' is in frame '", parent_frame, "'.");
+                FormatNodeForError(*out),
+                " has inputs from different frames. The input ",
+                FormatNodeForError(*curr_node), " is in frame '", frame_name,
+                "'. The input ", FormatNodeForError(*parent_nodes[out->id()]),
+                " is in frame '", parent_frame, "'.");
           }
         } else {
           out_info->frame = out;
@@ -148,7 +149,8 @@ Status BuildControlFlowInfo(const Graph* g, std::vector<ControlFlowInfo>* info,
           TF_RETURN_IF_ERROR(
               GetNodeAttr(out->attrs(), "frame_name", &out_info->frame_name));
           if (out_info->frame_name.empty()) {
-            return errors::InvalidArgument("The Enter node ", out->name(),
+            return errors::InvalidArgument("The Enter ",
+                                           FormatNodeForError(*out),
                                            " must have a frame name.");
           }
         }
@@ -156,12 +158,11 @@ Status BuildControlFlowInfo(const Graph* g, std::vector<ControlFlowInfo>* info,
         if (is_visited) {
           if (out_info->frame_name != frame_name) {
             return errors::InvalidArgument(
-                "The node '", out->name(),
-                "' has inputs from different "
-                "frames. The input '",
-                curr_node->name(), "' is in frame '", frame_name,
-                "'. The input '", parent_nodes[out->id()]->name(),
-                "' is in frame '", out_info->frame_name, "'.");
+                FormatNodeForError(*out),
+                " has inputs from different frames. The input ",
+                FormatNodeForError(*curr_node), " is in frame '", frame_name,
+                "'. The input ", FormatNodeForError(*parent_nodes[out->id()]),
+                " is in frame '", out_info->frame_name, "'.");
           }
         } else {
           out_info->frame = frame;
diff --git a/tensorflow/core/graph/control_flow_test.cc b/tensorflow/core/graph/control_flow_test.cc
index eb7937400f..803c757c3f 100644
--- a/tensorflow/core/graph/control_flow_test.cc
+++ b/tensorflow/core/graph/control_flow_test.cc
@@ -63,6 +63,15 @@ TEST(ValidateControlFlowTest, InputsFromDifferentFrames) {
   EXPECT_TRUE(str_util::StrContains(status.error_message(),
                                     "has inputs from different frames"))
       << status.error_message();
+  EXPECT_TRUE(str_util::StrContains(status.error_message(),
+                                    "{{node outer/body/inner/Merge}}"))
+      << status.error_message();
+  EXPECT_TRUE(str_util::StrContains(status.error_message(),
+                                    "{{node outer/body/inner/Enter}}"))
+      << status.error_message();
+  EXPECT_TRUE(
+      str_util::StrContains(status.error_message(), "{{node outer/Switch}}"))
+      << status.error_message();
 }
 
 TEST(ValidateControlFlowTest, MismatchedParentFrames) {
@@ -102,6 +111,8 @@ TEST(ValidateControlFlowTest, MismatchedParentFrames) {
   EXPECT_TRUE(
       str_util::StrContains(status.error_message(), "Mismatched parent frames"))
       << status.error_message();
+  EXPECT_TRUE(str_util::StrContains(status.error_message(), "{{node Enter2}}"))
+      << status.error_message();
 }
 
 TEST(ValidateControlFlowTest, TwoLoopCond) {
@@ -125,6 +136,12 @@ TEST(ValidateControlFlowTest, TwoLoopCond) {
   EXPECT_TRUE(str_util::StrContains(status.error_message(),
                                     "more than one LoopCond node"))
       << status.error_message();
+  EXPECT_TRUE(
+      str_util::StrContains(status.error_message(), "{{node sub/LoopCond}}"))
+      << status.error_message();
+  EXPECT_TRUE(
+      str_util::StrContains(status.error_message(), "{{node LoopCond}}"))
+      << status.error_message();
 }
 
 }  // namespace
diff --git a/tensorflow/core/graph/graph_constructor.cc b/tensorflow/core/graph/graph_constructor.cc
index add26f3b71..8c73f8f712 100644
--- a/tensorflow/core/graph/graph_constructor.cc
+++ b/tensorflow/core/graph/graph_constructor.cc
@@ -1042,6 +1042,14 @@ Status GraphConstructor::Convert() {
   }
 
   if (processed < node_defs_.size()) {
+    LOG(WARNING) << "IN " << __func__ << (node_defs_.size() - processed)
+                 << " NODES IN A CYCLE";
+    for (int64 i = 0; i < node_defs_.size(); i++) {
+      if (pending_count_[i] != 0) {
+        LOG(WARNING) << "PENDING: " << SummarizeNodeDef(*node_defs_[i])
+                     << "WITH PENDING COUNT = " << pending_count_[i];
+      }
+    }
     return errors::InvalidArgument(node_defs_.size() - processed,
                                    " nodes in a cycle");
   }
diff --git a/tensorflow/core/graph/graph_partition.cc b/tensorflow/core/graph/graph_partition.cc
index 1b1941f9c1..ea0a814ab8 100644
--- a/tensorflow/core/graph/graph_partition.cc
+++ b/tensorflow/core/graph/graph_partition.cc
@@ -214,6 +214,14 @@ NodeDef* AddSend(const PartitionOptions& opts, const GraphInfo& g_info,
       cast_builder.Attr("_start_time", start_time);
     }
     cast_builder.Attr("DstT", cast_dtype);
+
+    if (cast_dtype == DT_BFLOAT16) {
+      // the below attribute specifies that the cast to bfloat16 should use
+      // truncation. This is needed to retain legacy behavior when we change
+      // the default bfloat16 casts to use rounding instead of truncation
+      cast_builder.Attr("Truncate", true);
+    }
+
     NodeDef* cast = gdef->add_node();
     *status = cast_builder.Finalize(cast);
     if (!status->ok()) return nullptr;
diff --git a/tensorflow/core/graph/mkl_graph_util.h b/tensorflow/core/graph/mkl_graph_util.h
index 5f51d6083b..333bf761b0 100644
--- a/tensorflow/core/graph/mkl_graph_util.h
+++ b/tensorflow/core/graph/mkl_graph_util.h
@@ -17,7 +17,6 @@ limitations under the License.
 #define TENSORFLOW_CORE_GRAPH_MKL_GRAPH_UTIL_H_
 #ifdef INTEL_MKL
 
-#include <string>
 #include "tensorflow/core/framework/op_kernel.h"
 
 namespace tensorflow {
diff --git a/tensorflow/core/graph/mkl_layout_pass.cc b/tensorflow/core/graph/mkl_layout_pass.cc
index b9667998d6..c22e0a3872 100644
--- a/tensorflow/core/graph/mkl_layout_pass.cc
+++ b/tensorflow/core/graph/mkl_layout_pass.cc
@@ -22,7 +22,6 @@ limitations under the License.
 #include <memory>
 #include <queue>
 #include <set>
-#include <string>
 #include <unordered_set>
 #include <utility>
 #include <vector>
@@ -2495,13 +2494,13 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
                       CopyAttrsLRN, LrnRewrite});
     rinfo_.push_back({csinfo_.lrn_grad,
                       mkl_op_registry::GetMklOpName(csinfo_.lrn_grad),
-                      CopyAttrsLRN, LrnRewrite});
+                      CopyAttrsLRN, LrnGradRewrite});
     rinfo_.push_back({csinfo_.max_pool,
                       mkl_op_registry::GetMklOpName(csinfo_.max_pool),
                       CopyAttrsPooling, NonDepthBatchWisePoolRewrite});
     rinfo_.push_back({csinfo_.max_pool_grad,
                       mkl_op_registry::GetMklOpName(csinfo_.max_pool_grad),
-                      CopyAttrsPooling, AlwaysRewrite});
+                      CopyAttrsPooling, MaxpoolGradRewrite});
 
     rinfo_.push_back({csinfo_.maximum,
                       mkl_op_registry::GetMklOpName(csinfo_.maximum),
@@ -2887,6 +2886,41 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
     return false;
   }
 
+  static bool LrnGradRewrite(const Node* n) {
+    CHECK_NOTNULL(n);
+    bool do_rewrite = false;
+
+    for (const Edge* e : n->in_edges()) {
+      // Rewrite only if there is corresponding LRN, i.e workspace is available
+      if (e->dst()->type_string() == csinfo_.lrn_grad && e->dst_input() == 2 &&
+          e->src()->type_string() ==
+              mkl_op_registry::GetMklOpName(csinfo_.lrn) &&
+          e->src_output() == 0) {
+        do_rewrite = true;
+        break;
+      }
+    }
+    return do_rewrite;
+  }
+
+  static bool MaxpoolGradRewrite(const Node* n) {
+    CHECK_NOTNULL(n);
+    bool do_rewrite = false;
+    for (const Edge* e : n->in_edges()) {
+      // Rewrite only if there is corresponding Maxpool, i.e workspace is
+      // available
+      if (e->dst()->type_string() == csinfo_.max_pool_grad &&
+          e->dst_input() == 1 &&
+          e->src()->type_string() ==
+              mkl_op_registry::GetMklOpName(csinfo_.max_pool) &&
+          e->src_output() == 0) {
+        do_rewrite = true;
+        break;
+      }
+    }
+    return do_rewrite;
+  }
+
   static bool AddNRewrite(const Node* n) {
     CHECK_NOTNULL(n);
 
@@ -3421,44 +3455,9 @@ Status MklLayoutRewritePass::SetUpInputs(
 // TODO(nhasabni) We should move this to mkl_util.h.
 void MklLayoutRewritePass::GetDummyWorkspaceTensorNode(
     std::unique_ptr<Graph>* g, Node** out, Node* orig_node) {
-  // We use a tensor of shape {1} and value 0 to represent
-  // dummy float tensor. We need this as a dummy workspace tensor.
-  // Workspace tensor has type uint8.
-  const DataType dt = DataTypeToEnum<uint8>::v();
-  TensorProto proto;
-  proto.set_dtype(dt);
-  float zero[1] = {0};
-  proto.set_tensor_content(string(reinterpret_cast<char*>(&zero), 4));
-  TensorShape dummy_shape({1});
-  dummy_shape.AsProto(proto.mutable_tensor_shape());
-  TF_CHECK_OK(NodeBuilder((*g)->NewName("DMT"), "Const")
-                  .Attr("value", proto)
-                  .Attr("dtype", dt)
-                  .Device(orig_node->def().device())  // We place this node on
-                                                      // same the device as the
-                                                      // device of the original
-                                                      // node.
-                  .Finalize(&**g, out));
-
-  // If number of inputs to the original node is > 0, then we add
-  // control dependency between 1st input (index 0) of the original node and
-  // the dummy Mkl node. This is needed because control-flow ops such as Enter,
-  // Merge, etc, require frame_name of the dummy Mkl node to be same as the
-  // rewritten node. Adding control edge between 1st input of the original node
-  // and the dummy Mkl node ensures that the dummy node is in the same frame
-  // as the original node. Choosing 1st input is not necessary - any input of
-  // the original node is fine because all the inputs of a node are always in
-  // the same frame.
-  if (orig_node->num_inputs() > 0) {
-    Node* orig_input0 = nullptr;
-    TF_CHECK_OK(
-        orig_node->input_node(0, const_cast<const Node**>(&orig_input0)));
-    // Allow duplicate while adding control edge as it would fail (return
-    // NULL) if we try to add duplicate edge.
-    CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out, true));
-  }
-
-  (*out)->set_assigned_device_name(orig_node->assigned_device_name());
+  // We use uint8 tensor of shape 8 with content {0,0,0,0,0,0,0,0} to represent
+  // workspace tensor.
+  GetDummyMklTensorNode(g, out, orig_node);
 }
 
 void MklLayoutRewritePass::AddWorkSpaceEdgeIfNeeded(
diff --git a/tensorflow/core/graph/mkl_layout_pass_test.cc b/tensorflow/core/graph/mkl_layout_pass_test.cc
index fc474c0dc8..a41f5861af 100644
--- a/tensorflow/core/graph/mkl_layout_pass_test.cc
+++ b/tensorflow/core/graph/mkl_layout_pass_test.cc
@@ -19,7 +19,6 @@ limitations under the License.
 #include "tensorflow/core/graph/mkl_graph_util.h"
 
 #include <algorithm>
-#include <string>
 #include <vector>
 
 #include "tensorflow/core/framework/op.h"
@@ -3015,12 +3014,8 @@ TEST_F(MklLayoutPassTest, LRN_Negative2) {
       "node { name: 'E' op: 'Zeta' attr { key: 'T' value { type: DT_FLOAT } }"
       " input: ['A', 'D'] }");
   EXPECT_EQ(DoMklLayoutOptimizationPass(),
-            "A(Input);B(Input);C(Input);D(_MklLRNGrad);DMT/_0(Const);"
-            "DMT/_1(Const);DMT/_2(Const);DMT/_3(Const);DMT/_4(Const);E(Zeta)|"
-            "A->D;A->E;A:control->DMT/_0:control;A:control->DMT/_1:control;"
-            "A:control->DMT/_2:control;A:control->DMT/_3:control;"
-            "A:control->DMT/_4:control;B->D:1;C->D:2;D->E:1;DMT/_0->D:3;"
-            "DMT/_1->D:7;DMT/_2->D:4;DMT/_3->D:5;DMT/_4->D:6");
+            "A(Input);B(Input);C(Input);D(LRNGrad);"
+            "E(Zeta)|A->D;A->E;B->D:1;C->D:2;D->E:1");
 }
 
 /* Test LRN->LRNGrad negative case, where single LRN feeds
@@ -3058,15 +3053,11 @@ TEST_F(MklLayoutPassTest, LRN_Negative3) {
       " input: ['E', 'F'] }");
   EXPECT_EQ(DoMklLayoutOptimizationPass(),
             "A(Input);B(_MklLRN);C(Input);D(Input);DMT/_0(Const);DMT/_1(Const);"
-            "DMT/_2(Const);DMT/_3(Const);DMT/_4(Const);DMT/_5(Const);"
-            "DMT/_6(Const);E(_MklLRNGrad);F(_MklLRNGrad);G(Zeta)|A->B;"
-            "A:control->DMT/_0:control;B->E:2;"
-            "B->F:1;B:1->E:3;B:2->E:6;B:2->F:5;B:3->E:7;C->E;C->F;"
-            "C:control->DMT/_1:control;C:control->DMT/_2:control;"
-            "C:control->DMT/_3:control;C:control->DMT/_4:control;"
-            "C:control->DMT/_5:control;C:control->DMT/_6:control;"
-            "D->E:1;D->F:2;DMT/_0->B:1;DMT/_1->E:4;DMT/_2->E:5;DMT/_3->F:3;"
-            "DMT/_4->F:7;DMT/_5->F:4;DMT/_6->F:6;E->G;F->G:1");
+            "DMT/_2(Const);E(_MklLRNGrad);F(LRNGrad);G(Zeta)|A->B;"
+            "A:control->DMT/_0:control;B->E:2;B->F:1;B:1->E:3;B:2->E:6;"
+            "B:3->E:7;C->E;C->F;C:control->DMT/_1:control;"
+            "C:control->DMT/_2:control;D->E:1;D->F:2;DMT/_0->B:1;"
+            "DMT/_1->E:4;DMT/_2->E:5;E->G;F->G:1");
 }
 
 /* Test MaxPool->MaxPoolGrad replacement by workspace+rewrite nodes. */
@@ -3137,12 +3128,8 @@ TEST_F(MklLayoutPassTest, NodeWorkspace_MaxPool_Negative2) {
       "node { name: 'E' op: 'Zeta' attr { key: 'T' value { type: DT_FLOAT } }"
       " input: ['A', 'D'] }");
   EXPECT_EQ(DoMklLayoutOptimizationPass(),
-            "A(Input);B(Input);C(Input);D(_MklMaxPoolGrad);DMT/_0(Const);"
-            "DMT/_1(Const);DMT/_2(Const);DMT/_3(Const);DMT/_4(Const);E(Zeta)|"
-            "A->D;A->E;A:control->DMT/_0:control;A:control->DMT/_1:control;"
-            "A:control->DMT/_2:control;A:control->DMT/_3:control;"
-            "A:control->DMT/_4:control;B->D:1;C->D:2;D->E:1;DMT/_0->D:3;"
-            "DMT/_1->D:7;DMT/_2->D:4;DMT/_3->D:5;DMT/_4->D:6");
+            "A(Input);B(Input);C(Input);D(MaxPoolGrad);"
+            "E(Zeta)|A->D;A->E;B->D:1;C->D:2;D->E:1");
 }
 
 // Test MaxPool handling for batch-wise pooling (NCHW)
diff --git a/tensorflow/core/graph/mkl_tfconversion_pass.cc b/tensorflow/core/graph/mkl_tfconversion_pass.cc
index e9ced4d2b6..aa39af637f 100644
--- a/tensorflow/core/graph/mkl_tfconversion_pass.cc
+++ b/tensorflow/core/graph/mkl_tfconversion_pass.cc
@@ -18,7 +18,6 @@ limitations under the License.
 #include <memory>
 #include <queue>
 #include <set>
-#include <string>
 #include <utility>
 #include <vector>
 
diff --git a/tensorflow/core/graph/mkl_tfconversion_pass_test.cc b/tensorflow/core/graph/mkl_tfconversion_pass_test.cc
index bbdbe78bbd..ebcb6de551 100644
--- a/tensorflow/core/graph/mkl_tfconversion_pass_test.cc
+++ b/tensorflow/core/graph/mkl_tfconversion_pass_test.cc
@@ -19,7 +19,6 @@ limitations under the License.
 #include "tensorflow/core/graph/mkl_graph_util.h"
 
 #include <algorithm>
-#include <string>
 #include <vector>
 
 #include "tensorflow/core/framework/op.h"
diff --git a/tensorflow/core/graph/tensor_id.cc b/tensorflow/core/graph/tensor_id.cc
index b5c2c2aac8..80c76df255 100644
--- a/tensorflow/core/graph/tensor_id.cc
+++ b/tensorflow/core/graph/tensor_id.cc
@@ -24,9 +24,6 @@ namespace tensorflow {
 
 TensorId::TensorId(const SafeTensorId& id) : TensorId(id.first, id.second) {}
 
-SafeTensorId::SafeTensorId(StringPiece str, int idx)
-    : SafeTensorId(str.ToString(), idx) {}
-
 SafeTensorId::SafeTensorId(const TensorId& id)
     : SafeTensorId(id.first.ToString(), id.second) {}
 
diff --git a/tensorflow/core/graph/tensor_id.h b/tensorflow/core/graph/tensor_id.h
index b0978b4120..0ba3942618 100644
--- a/tensorflow/core/graph/tensor_id.h
+++ b/tensorflow/core/graph/tensor_id.h
@@ -62,13 +62,10 @@ TensorId ParseTensorName(StringPiece name);
 struct SafeTensorId : public std::pair<string, int> {
   typedef std::pair<string, int> Base;
 
-  // Inherit the set of constructors.
-  using Base::pair;
-
   // NOTE(skyewm): this is required on some platforms. I'm not sure why the
-  // using statement above isn't always sufficient.
+  // using "using Base::pair;" isn't always sufficient.
   SafeTensorId() : Base() {}
-  SafeTensorId(StringPiece str, int idx);
+  SafeTensorId(const string& str, int idx) : Base(str, idx) {}
   SafeTensorId(const TensorId& id);
 
   string ToString() const {
diff --git a/tensorflow/core/grappler/BUILD b/tensorflow/core/grappler/BUILD
index 9dcc6765f5..7c6fe56e1f 100644
--- a/tensorflow/core/grappler/BUILD
+++ b/tensorflow/core/grappler/BUILD
@@ -33,6 +33,7 @@ tf_cc_test(
     name = "utils_test",
     srcs = ["utils_test.cc"],
     deps = [
+        ":grappler_item",
         ":utils",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/core:all_kernels",
@@ -151,3 +152,32 @@ tf_cc_test(
         "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
     ],
 )
+
+cc_library(
+    name = "mutable_graph_view",
+    srcs = [
+        "mutable_graph_view.cc",
+    ],
+    hdrs = ["mutable_graph_view.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_view",
+        ":grappler_item",
+        ":utils",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
+tf_cc_test(
+    name = "mutable_graph_view_test",
+    srcs = ["mutable_graph_view_test.cc"],
+    deps = [
+        ":grappler_item",
+        ":mutable_graph_view",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
+    ],
+)
diff --git a/tensorflow/core/grappler/clusters/cluster.cc b/tensorflow/core/grappler/clusters/cluster.cc
index 8d8c6084ec..6ca379323e 100644
--- a/tensorflow/core/grappler/clusters/cluster.cc
+++ b/tensorflow/core/grappler/clusters/cluster.cc
@@ -29,11 +29,24 @@ void Cluster::AllowSoftPlacement(bool soft_placement_state) {
   options_.config.set_allow_soft_placement(soft_placement_state);
 }
 
+void Cluster::SetNumInterOpThreads(int num_threads) {
+  for (int i = 0; i < options_.config.session_inter_op_thread_pool_size();
+       ++i) {
+    options_.config.mutable_session_inter_op_thread_pool(i)->set_num_threads(
+        num_threads);
+  }
+}
+
 void Cluster::SetNumWarmupSteps(int num_steps) {
   options_.config.mutable_graph_options()->set_build_cost_model_after(
       num_steps);
 }
 
+// Set executor type to instantiate
+void Cluster::SetExecutorType(const string* executor_type) {
+  options_.config.mutable_experimental()->set_executor_type(*executor_type);
+}
+
 int Cluster::NumWarmupSteps() const {
   return options_.config.graph_options().build_cost_model_after();
 }
diff --git a/tensorflow/core/grappler/clusters/cluster.h b/tensorflow/core/grappler/clusters/cluster.h
index 06db36b3aa..519d5ed875 100644
--- a/tensorflow/core/grappler/clusters/cluster.h
+++ b/tensorflow/core/grappler/clusters/cluster.h
@@ -65,10 +65,16 @@ class Cluster {
   // with reftype input(s) which are from CPU.
   void AllowSoftPlacement(bool soft_placement_state);
 
+  // Update the number of inter-op threads for each per-session threadpool
+  void SetNumInterOpThreads(int num_threads);
+
   // Set the number of steps required to warmup TensorFlow. Must be called
   // before Provision().
   void SetNumWarmupSteps(int num_steps);
 
+  // Set executor type to instantiate
+  void SetExecutorType(const string* executor_type);
+
   // Returns the number of warmup steps.
   int NumWarmupSteps() const;
 
diff --git a/tensorflow/core/grappler/costs/BUILD b/tensorflow/core/grappler/costs/BUILD
index b054068299..f3dc2c2091 100644
--- a/tensorflow/core/grappler/costs/BUILD
+++ b/tensorflow/core/grappler/costs/BUILD
@@ -41,6 +41,7 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = [
         ":utils",
+        "//tensorflow/core/grappler/utils:functions",
         "//tensorflow/core/grappler/utils:topological_sort",
         "//tensorflow/core/grappler:graph_view",
         "//tensorflow/core/grappler:op_types",
diff --git a/tensorflow/core/grappler/costs/graph_properties.cc b/tensorflow/core/grappler/costs/graph_properties.cc
index 0c02876ac5..231c7c63be 100644
--- a/tensorflow/core/grappler/costs/graph_properties.cc
+++ b/tensorflow/core/grappler/costs/graph_properties.cc
@@ -28,6 +28,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/graph_view.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/functions.h"
 #include "tensorflow/core/grappler/utils/topological_sort.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 
@@ -422,11 +423,106 @@ class SymbolicShapeRefiner {
     return it->second.inference_context.get();
   }
 
-  // Forward the shapes from the function's fanin to the function body,
-  // then call PropagateShapes.
-  // Returns an error if 'node' is not a function node.
-  Status UpdateFunction(const NodeDef* node, bool* refined) {
-    return UpdateNode(node, refined);
+  // Forward the shapes from the function input nodes to
+  // the argument nodes (which are Placeholder nodes), then
+  // perform shape inference on the function body.
+  //
+  // Propagate shape information of final function body node
+  // to function node `node`.
+  //
+  // In the event of an error, UpdateNode will simply set `node`'s
+  // output shape to be Unknown.
+  Status UpdateFunction(const NodeDef* node) {
+    auto it = fun_to_grappler_function_item_.find(node->op());
+    if (it == fun_to_grappler_function_item_.end()) {
+      return errors::InvalidArgument(
+          node->op(), " was not previously added to SymbolicShapeRefiner.");
+    }
+
+    GrapplerFunctionItem& grappler_function_item = it->second;
+    GraphView gv(&grappler_function_item.graph);
+
+    // Forward shapes from function input nodes to argument nodes.
+    for (int i = 0; i < grappler_function_item.inputs().size(); ++i) {
+      auto& fun_input = grappler_function_item.input(i);
+      if (fun_input.placeholders.size() > 1) {
+        // TODO(jmdecker): Handle case with multiple input placeholders
+        return errors::Unimplemented(
+            "Input arguments with multiple placeholders are not yet "
+            "supported.");
+      }
+      NodeDef* fun_node = gv.GetNode(fun_input.input_name);
+      const string& input = node->input(i);
+      const string& node_name = NodeName(input);
+
+      if (IsControlInput(input)) {
+        return errors::FailedPrecondition(
+            "Function inputs should not contain control nodes.");
+      }
+
+      NodeDef* input_node = graph_.GetNode(node_name);
+      if (input_node == nullptr) {
+        return errors::FailedPrecondition(node_name,
+                                          " was not found in the graph.");
+      }
+
+      InferenceContext* input_inference_context = GetContext(input_node);
+      if (input_inference_context == nullptr) {
+        return errors::FailedPrecondition(
+            "Inference context has not been created for ", node_name);
+      }
+
+      int output_port_num = NodePosition(input);
+      AttrValue attr_output_shape;
+      TensorShapeProto proto;
+      const auto& handle = input_inference_context->output(output_port_num);
+      input_inference_context->ShapeHandleToProto(handle, &proto);
+      *attr_output_shape.mutable_shape() = proto;
+      (*fun_node->mutable_attr())["shape"] = attr_output_shape;
+    }
+
+    // Perform inference on function body.
+    GraphProperties gp(grappler_function_item);
+    TF_RETURN_IF_ERROR(gp.InferStatically(true));
+
+    // Add return nodes for output shapes.
+    auto ic = GetContext(node);
+    int output = 0;
+    for (auto const& out_arg : grappler_function_item.outputs()) {
+      if (out_arg.output_tensors.size() > 1) {
+        // TODO(jmdecker): Handle case of multiple output tensors
+        return errors::Unimplemented(
+            "Output arguments with multiple output tensors are not yet "
+            "supported.");
+      }
+
+      // It is guaranteed that output_tensors does not contain any control
+      // inputs, so port_id >= 0.
+      string out_tensor = out_arg.output_tensors[0];
+      int port_id;
+      string node_name = ParseNodeName(out_tensor, &port_id);
+
+      const NodeDef* retnode = gv.GetNode(node_name);
+      if (retnode == nullptr) {
+        return errors::FailedPrecondition("Unable to find return node ",
+                                          node_name, " for ", node->name());
+      }
+
+      auto output_properties = gp.GetOutputProperties(retnode->name());
+      if (port_id >= output_properties.size()) {
+        return errors::InvalidArgument(
+            out_tensor, " has invalid position ", port_id,
+            " (output_properties.size() = ", output_properties.size(), ").");
+      }
+      auto const& outprop = output_properties[port_id];
+      const TensorShapeProto& shape = outprop.shape();
+      ShapeHandle out;
+      TF_RETURN_IF_ERROR(ic->MakeShapeFromShapeProto(shape, &out));
+      ic->set_output(output, out);
+      output++;
+    }
+
+    return Status::OK();
   }
 
   Status UpdateNode(const NodeDef* node, bool* refined) {
@@ -436,6 +532,7 @@ class SymbolicShapeRefiner {
       node_context = CHECK_NOTNULL(GetNodeContext(node));
       *refined = true;
     }
+
     // Check if the shapes of the nodes in the fan-in of this node have changed,
     // and if they have, update the node input shapes.
     InferenceContext* inference_context = node_context->inference_context.get();
@@ -455,7 +552,8 @@ class SymbolicShapeRefiner {
         if (c == nullptr) {
           return errors::FailedPrecondition(
               "Input ", dst_input, " ('", input->name(), "') for '",
-              node->name(), "' was not previously added to ShapeRefiner.");
+              node->name(),
+              "' was not previously added to SymbolicShapeRefiner.");
         }
 
         if (IsConstant(*input)) {
@@ -565,6 +663,21 @@ class SymbolicShapeRefiner {
     node_context->inference_context->set_input_tensors_as_shapes(
         input_tensors_as_shapes);
 
+    // Properly handle function nodes.
+    if (node_context->op_data && node_context->op_data->is_function_op) {
+      // TODO(jmdecker): Detect if the input shapes have changed for this
+      // function. Note that when we hit a function call node, refined will be
+      // true, as the updates to the call node will have changed, even if it's
+      // the same function being called twice with the same input shapes.
+      // Example: simple_function.pbtxt
+      if (UpdateFunction(node).ok()) {
+        return Status::OK();
+      } else {
+        VLOG(1) << "UpdateFunction failed for " << node->op()
+                << ". Defaulting to ShapeUnknown.";
+      }
+    }
+
     // Update the shapes of the outputs.
     return InferShapes(*node, node_context);
   }
@@ -681,7 +794,39 @@ class SymbolicShapeRefiner {
     return true;
   }
 
-  Status AddFunction(const NodeDef* node) { return Status::OK(); }
+  Status AddFunction(const NodeDef* function_node) {
+    auto it = fun_to_grappler_function_item_.find(function_node->op());
+    if (it != fun_to_grappler_function_item_.end()) {
+      return Status::OK();
+    }
+
+    const FunctionDef* function_def =
+        CHECK_NOTNULL(function_library_.Find(function_node->op()));
+
+    GrapplerFunctionItem grappler_function_item;
+    TF_RETURN_IF_ERROR(MakeGrapplerFunctionItem(
+        *function_def, function_library_, &grappler_function_item));
+
+    if (grappler_function_item.inputs().size() > function_node->input_size()) {
+      return errors::FailedPrecondition(
+          "Function input size should be smaller than node input size.");
+    }
+
+    for (int i = grappler_function_item.inputs().size();
+         i < function_node->input_size(); ++i) {
+      const string& input = function_node->input(i);
+      if (!IsControlInput(input)) {
+        return errors::FailedPrecondition(
+            "Found regular input (", input,
+            ") instead of control nodes for node ", function_node->name());
+      }
+    }
+
+    fun_to_grappler_function_item_[function_def->signature().name()] =
+        grappler_function_item;
+
+    return Status::OK();
+  }
 
   Status AddNode(const NodeDef* node) {
     NodeContext& node_ctx = node_to_context_[node];
@@ -911,6 +1056,8 @@ class SymbolicShapeRefiner {
   std::unordered_map<const NodeDef*, NodeContext> node_to_context_;
   std::unordered_map<ShapeId, ShapeHandle, HashShapeId> unknown_shapes_;
   std::unordered_map<DimId, DimensionHandle, HashDimId> unknown_dims_;
+  std::unordered_map<string, GrapplerFunctionItem>
+      fun_to_grappler_function_item_;
   FunctionLibraryDefinition function_library_;
   const std::unordered_map<string, std::unordered_set<int>>& fed_ports_;
 };
@@ -1082,13 +1229,9 @@ Status GraphProperties::UpdateShapes(
     // Set shapes and types of Queue ops, if needed.
     TF_RETURN_IF_ERROR(UpdateQueue(n, shape_refiner, new_shapes));
   } else {
-    auto c = shape_refiner->GetNodeContext(n);
-    if (c && c->op_data && c->op_data->is_function_op) {
-      TF_RETURN_IF_ERROR(shape_refiner->UpdateFunction(n, new_shapes));
-    } else {
-      // Rely on regular TF shape refinement for all the other nodes.
-      TF_RETURN_IF_ERROR(shape_refiner->UpdateNode(n, new_shapes));
-    }
+    // Rely on regular TF shape refinement for all the other nodes.
+    // UpdateNode calls UpdateFunction if a function node is detected.
+    TF_RETURN_IF_ERROR(shape_refiner->UpdateNode(n, new_shapes));
   }
   return Status::OK();
 }
diff --git a/tensorflow/core/grappler/costs/graph_properties_test.cc b/tensorflow/core/grappler/costs/graph_properties_test.cc
index aa787ae620..5acfb56b05 100644
--- a/tensorflow/core/grappler/costs/graph_properties_test.cc
+++ b/tensorflow/core/grappler/costs/graph_properties_test.cc
@@ -783,7 +783,7 @@ TEST_F(GraphPropertiesTest, InferRestoreOpShape_WithTwoNodesShareSameOutput) {
   EXPECT_EQ("float: [128,256]", PropToString(prop));
 }
 
-TEST_F(GraphPropertiesTest, FunctionStaticShapeInference) {
+TEST_F(GraphPropertiesTest, SimpleFunctionStaticShapeInference) {
   // Test graph produced in python using:
   /*
     @function.Defun(*[tf.float32] * 2, noinline=True)
@@ -796,7 +796,6 @@ TEST_F(GraphPropertiesTest, FunctionStaticShapeInference) {
       z = MyAdd(x, y)
       z = MyAdd(x, z)
   */
-  // Check that the shape inference code infers what it can.
   GrapplerItem item;
   string filename = io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPath,
                                  "simple_function.pbtxt");
@@ -806,15 +805,296 @@ TEST_F(GraphPropertiesTest, FunctionStaticShapeInference) {
   const auto out_props = properties.GetOutputProperties("MyAdd_55e046a8");
   const OpInfo::TensorProperties& out_prop = out_props[0];
   EXPECT_EQ(DT_FLOAT, out_prop.dtype());
-  EXPECT_TRUE(out_prop.shape().unknown_rank());
+  EXPECT_FALSE(out_prop.shape().unknown_rank());
+  EXPECT_EQ(2, out_prop.shape().dim_size());
+  EXPECT_EQ(1, out_prop.shape().dim(0).size());
+  EXPECT_EQ(2, out_prop.shape().dim(1).size());
 
   const auto in_props = properties.GetInputProperties("MyAdd_55e046a8");
+  EXPECT_EQ(2, in_props.size());
+
+  const OpInfo::TensorProperties& in_prop = in_props[0];
+  EXPECT_EQ(DT_FLOAT, in_prop.dtype());
+  EXPECT_FALSE(in_prop.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop.shape().dim_size());
+  EXPECT_EQ(1, in_prop.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop.shape().dim(1).size());
+
+  const OpInfo::TensorProperties& in_prop1 = in_props[1];
+  EXPECT_EQ(DT_FLOAT, in_prop1.dtype());
+  EXPECT_FALSE(in_prop1.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop1.shape().dim_size());
+  EXPECT_EQ(1, in_prop1.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop1.shape().dim(1).size());
+}
+
+TEST_F(GraphPropertiesTest, LargeFunctionStaticShapeInference) {
+  GrapplerItem item;
+  string filename = io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPath,
+                                 "large_function_graph.pbtxt");
+  TF_CHECK_OK(ReadGraphDefFromFile(filename, &item.graph));
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(false));
+
+  const auto out_props = properties.GetOutputProperties("y0");
+  EXPECT_EQ(2, out_props.size());
+
+  const OpInfo::TensorProperties& out_prop0 = out_props[0];
+  EXPECT_EQ(DT_FLOAT, out_prop0.dtype());
+  EXPECT_EQ(4, out_prop0.shape().dim_size());
+  EXPECT_EQ(128, out_prop0.shape().dim(0).size());
+  EXPECT_EQ(112, out_prop0.shape().dim(1).size());
+  EXPECT_EQ(112, out_prop0.shape().dim(2).size());
+  EXPECT_EQ(64, out_prop0.shape().dim(3).size());
+
+  const OpInfo::TensorProperties& out_prop1 = out_props[1];
+  EXPECT_EQ(DT_FLOAT, out_prop1.dtype());
+  EXPECT_EQ(128, out_prop1.shape().dim(0).size());
+  EXPECT_EQ(112, out_prop1.shape().dim(1).size());
+  EXPECT_EQ(112, out_prop1.shape().dim(2).size());
+  EXPECT_EQ(24, out_prop1.shape().dim(3).size());
+
+  const auto in_props = properties.GetInputProperties("y0");
+  EXPECT_EQ(4, in_props.size());
+
+  const OpInfo::TensorProperties& in_prop0 = in_props[0];
+  EXPECT_EQ(DT_FLOAT, in_prop0.dtype());
+  EXPECT_EQ(1, in_prop0.shape().dim_size());
+  EXPECT_EQ(64, in_prop0.shape().dim(0).size());
+
+  const OpInfo::TensorProperties& in_prop1 = in_props[1];
+  EXPECT_EQ(DT_FLOAT, in_prop1.dtype());
+  EXPECT_EQ(4, in_prop1.shape().dim_size());
+  EXPECT_EQ(1, in_prop1.shape().dim(0).size());
+  EXPECT_EQ(1, in_prop1.shape().dim(1).size());
+  EXPECT_EQ(24, in_prop1.shape().dim(2).size());
+  EXPECT_EQ(64, in_prop1.shape().dim(3).size());
+
+  const OpInfo::TensorProperties& in_prop2 = in_props[2];
+  EXPECT_EQ(DT_FLOAT, in_prop2.dtype());
+  EXPECT_EQ(4, in_prop2.shape().dim_size());
+  EXPECT_EQ(128, in_prop2.shape().dim(0).size());
+  EXPECT_EQ(224, in_prop2.shape().dim(1).size());
+  EXPECT_EQ(224, in_prop2.shape().dim(2).size());
+  EXPECT_EQ(3, in_prop2.shape().dim(3).size());
+
+  const OpInfo::TensorProperties& in_prop3 = in_props[3];
+  EXPECT_EQ(DT_FLOAT, in_prop3.dtype());
+  EXPECT_EQ(4, in_prop3.shape().dim_size());
+  EXPECT_EQ(7, in_prop3.shape().dim(0).size());
+  EXPECT_EQ(7, in_prop3.shape().dim(1).size());
+  EXPECT_EQ(3, in_prop3.shape().dim(2).size());
+  EXPECT_EQ(8, in_prop3.shape().dim(3).size());
+}
+
+TEST_F(GraphPropertiesTest, LargeFunctionWithMultipleOutputs) {
+  // Test graph produced in python using:
+  /*
+    @function.Defun(noinline=True)
+    def MyFunc():
+      @function.Defun(*[tf.float32] * 2)
+      def Cond(n, unused_x):
+        return n > 0
+
+      @function.Defun(*[tf.float32] * 2)
+      def Body(n, x):
+        return n - 1, x + n
+
+      i = tf.constant(10)
+      return functional_ops.While([i, 0.], Cond, Body)
+
+    with tf.Graph().as_default():
+      z = MyFunc()
+  */
+  GrapplerItem item;
+  string filename = io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPath,
+                                 "function_functional_while.pbtxt");
+  TF_CHECK_OK(ReadGraphDefFromFile(filename, &item.graph));
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(false));
+
+  const auto out_props = properties.GetOutputProperties("MyFunc_AenMyWWx1Us");
+  EXPECT_EQ(2, out_props.size());
+
+  const OpInfo::TensorProperties& out_prop0 = out_props[0];
+  EXPECT_EQ(DT_INT32, out_prop0.dtype());
+  EXPECT_FALSE(out_prop0.shape().unknown_rank());
+
+  const OpInfo::TensorProperties& out_prop1 = out_props[1];
+  EXPECT_EQ(DT_FLOAT, out_prop1.dtype());
+  EXPECT_FALSE(out_prop1.shape().unknown_rank());
+}
+
+TEST_F(GraphPropertiesTest, FunctionWithErrorStaticShapeInference) {
+  GrapplerItem item;
+  string filename = io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPath,
+                                 "function_error.pbtxt");
+  TF_CHECK_OK(ReadGraphDefFromFile(filename, &item.graph));
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(false));
+
+  const auto out_props = properties.GetOutputProperties("MyAdd_yabA4wXEdM4");
+  EXPECT_EQ(1, out_props.size());
+
+  const OpInfo::TensorProperties& out_prop = out_props[0];
+  EXPECT_EQ(DT_FLOAT, out_prop.dtype());
+  EXPECT_TRUE(out_prop.shape().unknown_rank());
+
+  const auto in_props = properties.GetInputProperties("MyAdd_yabA4wXEdM4");
+  EXPECT_EQ(2, in_props.size());
+
+  const OpInfo::TensorProperties& in_prop = in_props[0];
+  EXPECT_EQ(DT_FLOAT, in_prop.dtype());
+  EXPECT_FALSE(in_prop.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop.shape().dim_size());
+  EXPECT_EQ(1, in_prop.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop.shape().dim(1).size());
+
+  const OpInfo::TensorProperties& in_prop1 = in_props[1];
+  EXPECT_EQ(DT_FLOAT, in_prop1.dtype());
+  EXPECT_FALSE(in_prop1.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop1.shape().dim_size());
+  EXPECT_EQ(1, in_prop1.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop1.shape().dim(1).size());
+}
+
+TEST_F(GraphPropertiesTest, FunctionSwitchStaticShapeInference) {
+  // Test graph produced in python using:
+  /*
+    @function.Defun(*[tf.float32] * 2, noinline=True)
+    def MyAdd(x, y):
+      return tf.add(x, y)
+
+    with tf.Graph().as_default():
+      x = lambda: tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      y = lambda: tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      z = tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      z2 = MyAdd(tf.case([(tf.less(0, 1), x)], default=y), z)
+  */
+  GrapplerItem item;
+  string filename = io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPath,
+                                 "function_switch.pbtxt");
+  TF_CHECK_OK(ReadGraphDefFromFile(filename, &item.graph));
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(false));
+  const auto out_props = properties.GetOutputProperties("MyAdd_MPaeanipb7o");
+  const OpInfo::TensorProperties& out_prop = out_props[0];
+  EXPECT_EQ(DT_FLOAT, out_prop.dtype());
+  EXPECT_FALSE(out_prop.shape().unknown_rank());
+  EXPECT_EQ(2, out_prop.shape().dim_size());
+  EXPECT_EQ(1, out_prop.shape().dim(0).size());
+  EXPECT_EQ(2, out_prop.shape().dim(1).size());
+
+  const auto in_props = properties.GetInputProperties("MyAdd_MPaeanipb7o");
+  EXPECT_EQ(2, in_props.size());
+
+  const OpInfo::TensorProperties& in_prop = in_props[0];
+  EXPECT_EQ(DT_FLOAT, in_prop.dtype());
+  EXPECT_FALSE(in_prop.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop.shape().dim_size());
+  EXPECT_EQ(1, in_prop.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop.shape().dim(1).size());
+
+  const OpInfo::TensorProperties& in_prop1 = in_props[1];
+  EXPECT_EQ(DT_FLOAT, in_prop1.dtype());
+  EXPECT_FALSE(in_prop1.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop1.shape().dim_size());
+  EXPECT_EQ(1, in_prop1.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop1.shape().dim(1).size());
+}
+
+TEST_F(GraphPropertiesTest, FunctionSwitch2StaticShapeInference) {
+  // Test graph produced in python using:
+  /*
+    @function.Defun(*[tf.float32] * 2, noinline=True)
+    def MyAdd(x, y):
+      return tf.add(x, y)
+
+    with tf.Graph().as_default():
+      x = lambda: tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      y = lambda: tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      z = tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      z2 = MyAdd(tf.case([(tf.less(1, 0), x)], default=y), z)
+  */
+  GrapplerItem item;
+  string filename = io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPath,
+                                 "function_switch_2.pbtxt");
+  TF_CHECK_OK(ReadGraphDefFromFile(filename, &item.graph));
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(false));
+  const auto out_props = properties.GetOutputProperties("MyAdd_MPaeanipb7o");
+  const OpInfo::TensorProperties& out_prop = out_props[0];
+  EXPECT_EQ(DT_FLOAT, out_prop.dtype());
+  EXPECT_FALSE(out_prop.shape().unknown_rank());
+  EXPECT_EQ(2, out_prop.shape().dim_size());
+  EXPECT_EQ(1, out_prop.shape().dim(0).size());
+  EXPECT_EQ(2, out_prop.shape().dim(1).size());
+
+  const auto in_props = properties.GetInputProperties("MyAdd_MPaeanipb7o");
+  EXPECT_EQ(2, in_props.size());
+
   const OpInfo::TensorProperties& in_prop = in_props[0];
   EXPECT_EQ(DT_FLOAT, in_prop.dtype());
   EXPECT_FALSE(in_prop.shape().unknown_rank());
   EXPECT_EQ(2, in_prop.shape().dim_size());
   EXPECT_EQ(1, in_prop.shape().dim(0).size());
   EXPECT_EQ(2, in_prop.shape().dim(1).size());
+
+  const OpInfo::TensorProperties& in_prop1 = in_props[1];
+  EXPECT_EQ(DT_FLOAT, in_prop1.dtype());
+  EXPECT_FALSE(in_prop1.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop1.shape().dim_size());
+  EXPECT_EQ(1, in_prop1.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop1.shape().dim(1).size());
+}
+
+TEST_F(GraphPropertiesTest, FunctionSwitchShapesStaticShapeInference) {
+  // Test graph produced in python using:
+  /*
+    @function.Defun(*[tf.float32] * 2, noinline=True)
+    def MyAdd(x, y):
+      a = tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      b = tf.constant(2.0, shape=[1, 3], dtype=tf.float32)
+      c = tf.add(x, a)
+      d = tf.add(y, b)
+      return c
+
+    with tf.Graph().as_default():
+      x = lambda: tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      y = lambda: tf.constant(2.0, shape=[1, 2], dtype=tf.float32)
+      z = tf.constant(2.0, shape=[1, 3], dtype=tf.float32)
+      z2 = MyAdd(tf.case([(tf.less(1, 0), x)], default=y), z)
+  */
+  GrapplerItem item;
+  string filename = io::JoinPath(testing::TensorFlowSrcRoot(), kTestDataPath,
+                                 "function_switch_shapes.pbtxt");
+  TF_CHECK_OK(ReadGraphDefFromFile(filename, &item.graph));
+  GraphProperties properties(item);
+  TF_CHECK_OK(properties.InferStatically(false));
+  const auto out_props = properties.GetOutputProperties("MyAdd_lEKAAnIwI5I");
+  const OpInfo::TensorProperties& out_prop = out_props[0];
+  EXPECT_EQ(DT_FLOAT, out_prop.dtype());
+  EXPECT_FALSE(out_prop.shape().unknown_rank());
+  EXPECT_EQ(2, out_prop.shape().dim_size());
+  EXPECT_EQ(1, out_prop.shape().dim(0).size());
+  EXPECT_EQ(2, out_prop.shape().dim(1).size());
+
+  const auto in_props = properties.GetInputProperties("MyAdd_lEKAAnIwI5I");
+  EXPECT_EQ(2, in_props.size());
+
+  const OpInfo::TensorProperties& in_prop = in_props[0];
+  EXPECT_EQ(DT_FLOAT, in_prop.dtype());
+  EXPECT_FALSE(in_prop.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop.shape().dim_size());
+  EXPECT_EQ(1, in_prop.shape().dim(0).size());
+  EXPECT_EQ(2, in_prop.shape().dim(1).size());
+
+  const OpInfo::TensorProperties& in_prop1 = in_props[1];
+  EXPECT_EQ(DT_FLOAT, in_prop1.dtype());
+  EXPECT_FALSE(in_prop1.shape().unknown_rank());
+  EXPECT_EQ(2, in_prop1.shape().dim_size());
+  EXPECT_EQ(1, in_prop1.shape().dim(0).size());
+  EXPECT_EQ(3, in_prop1.shape().dim(1).size());
 }
 
 TEST_F(GraphPropertiesTest, SymbolicShapes) {
diff --git a/tensorflow/core/grappler/costs/graph_properties_testdata/function_error.pbtxt b/tensorflow/core/grappler/costs/graph_properties_testdata/function_error.pbtxt
new file mode 100644
index 0000000000..c3f0a6c95d
--- /dev/null
+++ b/tensorflow/core/grappler/costs/graph_properties_testdata/function_error.pbtxt
@@ -0,0 +1,117 @@
+node {
+  name: "Const"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "Const_1"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "MyAdd_yabA4wXEdM4"
+  op: "MyAdd_yabA4wXEdM4"
+  input: "Const"
+  input: "Const_1"
+}
+library {
+  function {
+    signature {
+      name: "MyAdd_yabA4wXEdM4"
+      input_arg {
+        name: "x"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "y"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "add_1"
+        type: DT_FLOAT
+      }
+    }
+    node_def {
+      name: "Add"
+      op: "Add"
+      input: "x"
+      input: "Add:z:0"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    node_def {
+      name: "Add_1"
+      op: "Add"
+      input: "Add:z:0"
+      input: "y"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    ret {
+      key: "add_1"
+      value: "Add_1:z:0"
+    }
+    attr {
+      key: "_noinline"
+      value {
+        b: true
+      }
+    }
+  }
+}
+versions {
+  producer: 26
+  min_consumer: 12
+}
diff --git a/tensorflow/core/grappler/costs/graph_properties_testdata/function_functional_while.pbtxt b/tensorflow/core/grappler/costs/graph_properties_testdata/function_functional_while.pbtxt
new file mode 100644
index 0000000000..c94ee2f227
--- /dev/null
+++ b/tensorflow/core/grappler/costs/graph_properties_testdata/function_functional_while.pbtxt
@@ -0,0 +1,239 @@
+node {
+  name: "MyFunc_AenMyWWx1Us"
+  op: "MyFunc_AenMyWWx1Us"
+}
+library {
+  function {
+    signature {
+      name: "MyFunc_AenMyWWx1Us"
+      output_arg {
+        name: "while"
+        type: DT_INT32
+      }
+      output_arg {
+        name: "while_0"
+        type: DT_FLOAT
+      }
+      is_stateful: true
+    }
+    node_def {
+      name: "Const"
+      op: "Const"
+      attr {
+        key: "dtype"
+        value {
+          type: DT_INT32
+        }
+      }
+      attr {
+        key: "value"
+        value {
+          tensor {
+            dtype: DT_INT32
+            tensor_shape {
+            }
+            int_val: 10
+          }
+        }
+      }
+    }
+    node_def {
+      name: "While/input_1"
+      op: "Const"
+      attr {
+        key: "dtype"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "value"
+        value {
+          tensor {
+            dtype: DT_FLOAT
+            tensor_shape {
+            }
+            float_val: 0.0
+          }
+        }
+      }
+    }
+    node_def {
+      name: "While"
+      op: "While"
+      input: "Const:output:0"
+      input: "While/input_1:output:0"
+      attr {
+        key: "T"
+        value {
+          list {
+            type: DT_INT32
+            type: DT_FLOAT
+          }
+        }
+      }
+      attr {
+        key: "body"
+        value {
+          func {
+            name: "Body_8GOMGeZeK5c"
+          }
+        }
+      }
+      attr {
+        key: "cond"
+        value {
+          func {
+            name: "Cond_Xf5ttAHgUCg"
+          }
+        }
+      }
+    }
+    ret {
+      key: "while"
+      value: "While:output:0"
+    }
+    ret {
+      key: "while_0"
+      value: "While:output:1"
+    }
+    attr {
+      key: "_noinline"
+      value {
+        b: true
+      }
+    }
+  }
+  function {
+    signature {
+      name: "Body_8GOMGeZeK5c"
+      input_arg {
+        name: "n"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "x"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "sub"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "add"
+        type: DT_FLOAT
+      }
+    }
+    node_def {
+      name: "sub/y"
+      op: "Const"
+      attr {
+        key: "dtype"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "value"
+        value {
+          tensor {
+            dtype: DT_FLOAT
+            tensor_shape {
+            }
+            float_val: 1.0
+          }
+        }
+      }
+    }
+    node_def {
+      name: "sub_0"
+      op: "Sub"
+      input: "n"
+      input: "sub/y:output:0"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    node_def {
+      name: "add_0"
+      op: "Add"
+      input: "x"
+      input: "n"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    ret {
+      key: "add"
+      value: "add_0:z:0"
+    }
+    ret {
+      key: "sub"
+      value: "sub_0:z:0"
+    }
+  }
+  function {
+    signature {
+      name: "Cond_Xf5ttAHgUCg"
+      input_arg {
+        name: "n"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "unused_x"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "greater"
+        type: DT_BOOL
+      }
+    }
+    node_def {
+      name: "Greater/y"
+      op: "Const"
+      attr {
+        key: "dtype"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "value"
+        value {
+          tensor {
+            dtype: DT_FLOAT
+            tensor_shape {
+            }
+            float_val: 0.0
+          }
+        }
+      }
+    }
+    node_def {
+      name: "Greater"
+      op: "Greater"
+      input: "n"
+      input: "Greater/y:output:0"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    ret {
+      key: "greater"
+      value: "Greater:z:0"
+    }
+  }
+}
+versions {
+  producer: 26
+  min_consumer: 12
+}
diff --git a/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch.pbtxt b/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch.pbtxt
new file mode 100644
index 0000000000..d6d856ce41
--- /dev/null
+++ b/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch.pbtxt
@@ -0,0 +1,251 @@
+node {
+  name: "Const"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "Less/x"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 0
+      }
+    }
+  }
+}
+node {
+  name: "Less/y"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 1
+      }
+    }
+  }
+}
+node {
+  name: "Less"
+  op: "Less"
+  input: "Less/x"
+  input: "Less/y"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "case/cond/Switch"
+  op: "Switch"
+  input: "Less"
+  input: "Less"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/switch_t"
+  op: "Identity"
+  input: "case/cond/Switch:1"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/switch_f"
+  op: "Identity"
+  input: "case/cond/Switch"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/pred_id"
+  op: "Identity"
+  input: "Less"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/Const"
+  op: "Const"
+  input: "^case/cond/switch_t"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "case/cond/Const_1"
+  op: "Const"
+  input: "^case/cond/switch_f"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "case/cond/Merge"
+  op: "Merge"
+  input: "case/cond/Const_1"
+  input: "case/cond/Const"
+  attr {
+    key: "N"
+    value {
+      i: 2
+    }
+  }
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+node {
+  name: "MyAdd_MPaeanipb7o"
+  op: "MyAdd_MPaeanipb7o"
+  input: "case/cond/Merge"
+  input: "Const"
+}
+library {
+  function {
+    signature {
+      name: "MyAdd_MPaeanipb7o"
+      input_arg {
+        name: "x"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "y"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "Add"
+        type: DT_FLOAT
+      }
+    }
+    node_def {
+      name: "Add"
+      op: "Add"
+      input: "x"
+      input: "y"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    ret {
+      key: "Add"
+      value: "Add:z:0"
+    }
+    attr {
+      key: "_noinline"
+      value {
+        b: true
+      }
+    }
+  }
+}
+versions {
+  producer: 26
+  min_consumer: 12
+}
diff --git a/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch_2.pbtxt b/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch_2.pbtxt
new file mode 100644
index 0000000000..e57d9d7076
--- /dev/null
+++ b/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch_2.pbtxt
@@ -0,0 +1,251 @@
+node {
+  name: "Const"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "Less/x"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 1
+      }
+    }
+  }
+}
+node {
+  name: "Less/y"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 0
+      }
+    }
+  }
+}
+node {
+  name: "Less"
+  op: "Less"
+  input: "Less/x"
+  input: "Less/y"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "case/cond/Switch"
+  op: "Switch"
+  input: "Less"
+  input: "Less"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/switch_t"
+  op: "Identity"
+  input: "case/cond/Switch:1"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/switch_f"
+  op: "Identity"
+  input: "case/cond/Switch"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/pred_id"
+  op: "Identity"
+  input: "Less"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/Const"
+  op: "Const"
+  input: "^case/cond/switch_t"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "case/cond/Const_1"
+  op: "Const"
+  input: "^case/cond/switch_f"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "case/cond/Merge"
+  op: "Merge"
+  input: "case/cond/Const_1"
+  input: "case/cond/Const"
+  attr {
+    key: "N"
+    value {
+      i: 2
+    }
+  }
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+node {
+  name: "MyAdd_MPaeanipb7o"
+  op: "MyAdd_MPaeanipb7o"
+  input: "case/cond/Merge"
+  input: "Const"
+}
+library {
+  function {
+    signature {
+      name: "MyAdd_MPaeanipb7o"
+      input_arg {
+        name: "x"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "y"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "Add"
+        type: DT_FLOAT
+      }
+    }
+    node_def {
+      name: "Add"
+      op: "Add"
+      input: "x"
+      input: "y"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    ret {
+      key: "Add"
+      value: "Add:z:0"
+    }
+    attr {
+      key: "_noinline"
+      value {
+        b: true
+      }
+    }
+  }
+}
+versions {
+  producer: 26
+  min_consumer: 12
+}
diff --git a/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch_shapes.pbtxt b/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch_shapes.pbtxt
new file mode 100644
index 0000000000..e9afa91886
--- /dev/null
+++ b/tensorflow/core/grappler/costs/graph_properties_testdata/function_switch_shapes.pbtxt
@@ -0,0 +1,317 @@
+node {
+  name: "Const"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 3
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "Less/x"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 1
+      }
+    }
+  }
+}
+node {
+  name: "Less/y"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 0
+      }
+    }
+  }
+}
+node {
+  name: "Less"
+  op: "Less"
+  input: "Less/x"
+  input: "Less/y"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "case/cond/Switch"
+  op: "Switch"
+  input: "Less"
+  input: "Less"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/switch_t"
+  op: "Identity"
+  input: "case/cond/Switch:1"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/switch_f"
+  op: "Identity"
+  input: "case/cond/Switch"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/pred_id"
+  op: "Identity"
+  input: "Less"
+  attr {
+    key: "T"
+    value {
+      type: DT_BOOL
+    }
+  }
+}
+node {
+  name: "case/cond/Const"
+  op: "Const"
+  input: "^case/cond/switch_t"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "case/cond/Const_1"
+  op: "Const"
+  input: "^case/cond/switch_f"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_FLOAT
+        tensor_shape {
+          dim {
+            size: 1
+          }
+          dim {
+            size: 2
+          }
+        }
+        float_val: 2.0
+      }
+    }
+  }
+}
+node {
+  name: "case/cond/Merge"
+  op: "Merge"
+  input: "case/cond/Const_1"
+  input: "case/cond/Const"
+  attr {
+    key: "N"
+    value {
+      i: 2
+    }
+  }
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+node {
+  name: "MyAdd_lEKAAnIwI5I"
+  op: "MyAdd_lEKAAnIwI5I"
+  input: "case/cond/Merge"
+  input: "Const"
+}
+library {
+  function {
+    signature {
+      name: "MyAdd_lEKAAnIwI5I"
+      input_arg {
+        name: "x"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "y"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "Add"
+        type: DT_FLOAT
+      }
+    }
+    node_def {
+      name: "Const"
+      op: "Const"
+      attr {
+        key: "dtype"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "value"
+        value {
+          tensor {
+            dtype: DT_FLOAT
+            tensor_shape {
+              dim {
+                size: 1
+              }
+              dim {
+                size: 2
+              }
+            }
+            float_val: 2.0
+          }
+        }
+      }
+    }
+    node_def {
+      name: "Const_1"
+      op: "Const"
+      attr {
+        key: "dtype"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "value"
+        value {
+          tensor {
+            dtype: DT_FLOAT
+            tensor_shape {
+              dim {
+                size: 1
+              }
+              dim {
+                size: 3
+              }
+            }
+            float_val: 2.0
+          }
+        }
+      }
+    }
+    node_def {
+      name: "Add"
+      op: "Add"
+      input: "x"
+      input: "Const:output:0"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    node_def {
+      name: "Add_1"
+      op: "Add"
+      input: "y"
+      input: "Const_1:output:0"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    ret {
+      key: "Add"
+      value: "Add:z:0"
+    }
+    attr {
+      key: "_noinline"
+      value {
+        b: true
+      }
+    }
+  }
+}
+versions {
+  producer: 26
+  min_consumer: 12
+}
diff --git a/tensorflow/core/grappler/costs/graph_properties_testdata/large_function_graph.pbtxt b/tensorflow/core/grappler/costs/graph_properties_testdata/large_function_graph.pbtxt
new file mode 100644
index 0000000000..415c347a1d
--- /dev/null
+++ b/tensorflow/core/grappler/costs/graph_properties_testdata/large_function_graph.pbtxt
@@ -0,0 +1,597 @@
+node {
+  name: "Const/Const"
+  op: "Const"
+  device: "/cpu:0"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+          dim {
+            size: 1
+          }
+        }
+        int_val: 64
+      }
+    }
+  }
+}
+node {
+  name: "input_0_0"
+  op: "RandomUniform"
+  input: "Const/Const"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "seed"
+    value {
+      i: 0
+    }
+  }
+  attr {
+    key: "seed2"
+    value {
+      i: 0
+    }
+  }
+}
+node {
+  name: "Const_1/Const"
+  op: "Const"
+  device: "/cpu:0"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+          dim {
+            size: 4
+          }
+        }
+        tensor_content: "\001\000\000\000\001\000\000\000\030\000\000\000@\000\000\000"
+      }
+    }
+  }
+}
+node {
+  name: "input_1_0"
+  op: "RandomUniform"
+  input: "Const_1/Const"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "seed"
+    value {
+      i: 0
+    }
+  }
+  attr {
+    key: "seed2"
+    value {
+      i: 0
+    }
+  }
+}
+node {
+  name: "Const_2/Const"
+  op: "Const"
+  device: "/cpu:0"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+          dim {
+            size: 4
+          }
+        }
+        tensor_content: "\200\000\000\000\340\000\000\000\340\000\000\000\003\000\000\000"
+      }
+    }
+  }
+}
+node {
+  name: "input_2_0"
+  op: "RandomUniform"
+  input: "Const_2/Const"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "seed"
+    value {
+      i: 0
+    }
+  }
+  attr {
+    key: "seed2"
+    value {
+      i: 0
+    }
+  }
+}
+node {
+  name: "Const_3/Const"
+  op: "Const"
+  device: "/cpu:0"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+          dim {
+            size: 4
+          }
+        }
+        tensor_content: "\007\000\000\000\007\000\000\000\003\000\000\000\010\000\000\000"
+      }
+    }
+  }
+}
+node {
+  name: "input_3_0"
+  op: "RandomUniform"
+  input: "Const_3/Const"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "dtype"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "seed"
+    value {
+      i: 0
+    }
+  }
+  attr {
+    key: "seed2"
+    value {
+      i: 0
+    }
+  }
+}
+node {
+  name: "y0"
+  op: "BiasAddx1_Conv2Dx1_DepthwiseConv2dNativex1_Relux1_95"
+  input: "input_0_0"
+  input: "input_1_0"
+  input: "input_2_0"
+  input: "input_3_0"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+}
+node {
+  name: "shape"
+  op: "Shape"
+  input: "y0"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "out_type"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "zeros"
+  op: "ZerosLike"
+  input: "shape"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "ones"
+  op: "OnesLike"
+  input: "shape"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "slice_0"
+  op: "Slice"
+  input: "y0"
+  input: "zeros"
+  input: "ones"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "Index"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+node {
+  name: "identity_0"
+  op: "Identity"
+  input: "slice_0"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+node {
+  name: "shape_1"
+  op: "Shape"
+  input: "y0:1"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+  attr {
+    key: "out_type"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "zeros_1"
+  op: "ZerosLike"
+  input: "shape_1"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "ones_1"
+  op: "OnesLike"
+  input: "shape_1"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "slice_1"
+  op: "Slice"
+  input: "y0:1"
+  input: "zeros_1"
+  input: "ones_1"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "Index"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+node {
+  name: "identity_1"
+  op: "Identity"
+  input: "slice_1"
+  input: "^input_0_0"
+  input: "^input_1_0"
+  input: "^input_2_0"
+  input: "^input_3_0"
+  device: "/cpu:0"
+  attr {
+    key: "T"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+library {
+  function {
+    signature {
+      name: "BiasAddx1_Conv2Dx1_DepthwiseConv2dNativex1_Relux1_95"
+      input_arg {
+        name: "InceptionV2/Conv2d_1a_7x7/biases/read"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "InceptionV2/Conv2d_1a_7x7/pointwise_weights/read"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "random_uniform"
+        type: DT_FLOAT
+      }
+      input_arg {
+        name: "InceptionV2/Conv2d_1a_7x7/depthwise_weights/read"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "InceptionV2/InceptionV2/Conv2d_1a_7x7/Relu"
+        type: DT_FLOAT
+      }
+      output_arg {
+        name: "InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d/depthwise"
+        type: DT_FLOAT
+      }
+    }
+    node_def {
+      name: "InceptionV2/InceptionV2/Conv2d_1a_7x7/BiasAdd"
+      op: "BiasAdd"
+      input: "InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d:output:0"
+      input: "InceptionV2/Conv2d_1a_7x7/biases/read"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "data_format"
+        value {
+          s: "NHWC"
+        }
+      }
+    }
+    node_def {
+      name: "InceptionV2/InceptionV2/Conv2d_1a_7x7/Relu"
+      op: "Relu"
+      input: "InceptionV2/InceptionV2/Conv2d_1a_7x7/BiasAdd:output:0"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+    }
+    node_def {
+      name: "InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d"
+      op: "Conv2D"
+      input: "InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d/depthwise:output:0"
+      input: "InceptionV2/Conv2d_1a_7x7/pointwise_weights/read"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "data_format"
+        value {
+          s: "NHWC"
+        }
+      }
+      attr {
+        key: "dilations"
+        value {
+          list {
+            i: 1
+            i: 1
+            i: 1
+            i: 1
+          }
+        }
+      }
+      attr {
+        key: "padding"
+        value {
+          s: "VALID"
+        }
+      }
+      attr {
+        key: "strides"
+        value {
+          list {
+            i: 1
+            i: 1
+            i: 1
+            i: 1
+          }
+        }
+      }
+      attr {
+        key: "use_cudnn_on_gpu"
+        value {
+          b: true
+        }
+      }
+    }
+    node_def {
+      name: "InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d/depthwise"
+      op: "DepthwiseConv2dNative"
+      input: "random_uniform"
+      input: "InceptionV2/Conv2d_1a_7x7/depthwise_weights/read"
+      attr {
+        key: "T"
+        value {
+          type: DT_FLOAT
+        }
+      }
+      attr {
+        key: "data_format"
+        value {
+          s: "NHWC"
+        }
+      }
+      attr {
+        key: "dilations"
+        value {
+          list {
+            i: 1
+            i: 1
+            i: 1
+            i: 1
+          }
+        }
+      }
+      attr {
+        key: "padding"
+        value {
+          s: "SAME"
+        }
+      }
+      attr {
+        key: "strides"
+        value {
+          list {
+            i: 1
+            i: 2
+            i: 2
+            i: 1
+          }
+        }
+      }
+    }
+    ret {
+      key: "InceptionV2/InceptionV2/Conv2d_1a_7x7/Relu"
+      value: "InceptionV2/InceptionV2/Conv2d_1a_7x7/Relu:activations:0"
+    }
+    ret {
+      key: "InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d/depthwise"
+      value: "InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d/depthwise:output:0"
+    }
+    attr {
+      key: "_noinline"
+      value {
+        b: true
+      }
+    }
+  }
+}
+versions {
+  producer: 26
+  min_consumer: 12
+}
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
index d34eecd009..5b303f6ccb 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
@@ -65,6 +65,7 @@ constexpr char kAvgPool[] = "AvgPool";
 constexpr char kAvgPoolGrad[] = "AvgPoolGrad";
 constexpr char kFusedBatchNorm[] = "FusedBatchNorm";
 constexpr char kFusedBatchNormGrad[] = "FusedBatchNormGrad";
+constexpr char kQuantizedMatMulV2[] = "QuantizedMatMulV2";
 
 static const Costs::Duration kMinComputeTime(1);
 
@@ -226,6 +227,7 @@ OpLevelCostEstimator::OpLevelCostEstimator() {
       {kMatMul, wrap(&OpLevelCostEstimator::PredictMatMul)},
       {kSparseMatMul, wrap(&OpLevelCostEstimator::PredictMatMul)},
       {kBatchMatMul, wrap(&OpLevelCostEstimator::PredictBatchMatMul)},
+      {kQuantizedMatMulV2, wrap(&OpLevelCostEstimator::PredictMatMul)},
 
       {kNoOp, wrap(&OpLevelCostEstimator::PredictNoOp)},
       {kGuaranteeConst, wrap(&OpLevelCostEstimator::PredictNoOp)},
@@ -268,67 +270,70 @@ OpLevelCostEstimator::OpLevelCostEstimator() {
       EIGEN_COST(scalar_product_op<float>) + EIGEN_COST(scalar_max_op<float>) +
       EIGEN_COST(scalar_min_op<float>) + EIGEN_COST(scalar_round_op<float>);
 
-  elementwise_ops_ = {// Unary ops alphabetically sorted
-                      {"Acos", EIGEN_COST(scalar_acos_op<float>)},
-                      {"Asin", EIGEN_COST(scalar_asin_op<float>)},
-                      {"Atan", EIGEN_COST(scalar_atan_op<float>)},
-                      {"Atan2", EIGEN_COST(scalar_quotient_op<float>) +
-                                    EIGEN_COST(scalar_atan_op<float>)},
-                      {"Ceil", EIGEN_COST(scalar_ceil_op<float>)},
-                      {"Cos", EIGEN_COST(scalar_cos_op<float>)},
-                      {"Dequantize", EIGEN_COST(scalar_product_op<float>)},
-                      {"Erf", 1},
-                      {"Erfc", 1},
-                      {"Exp", EIGEN_COST(scalar_exp_op<float>)},
-                      {"Expm1", EIGEN_COST(scalar_expm1_op<float>)},
-                      {"Floor", EIGEN_COST(scalar_floor_op<float>)},
-                      {"Inv", EIGEN_COST(scalar_inverse_op<float>)},
-                      {"InvGrad", 1},
-                      {"Lgamma", 1},
-                      {"Log", EIGEN_COST(scalar_log_op<float>)},
-                      {"Log1p", EIGEN_COST(scalar_log1p_op<float>)},
-                      {"Neg", EIGEN_COST(scalar_opposite_op<float>)},
-                      {"QuantizeV2", quantize_v2_cost},
-                      {"Reciprocal", EIGEN_COST(scalar_inverse_op<float>)},
-                      {"Rint", 1},
-                      {"Round", EIGEN_COST(scalar_round_op<float>)},
-                      {"Rsqrt", EIGEN_COST(scalar_rsqrt_op<float>)},
-                      {"Sqrt", EIGEN_COST(scalar_sqrt_op<float>)},
-                      {"Square", EIGEN_COST(scalar_square_op<float>)},
-                      {"Tanh", EIGEN_COST(scalar_tanh_op<float>)},
-                      {"Relu", EIGEN_COST(scalar_max_op<float>)},
-                      {"Sigmoid", EIGEN_COST(scalar_sigmoid_op<float>)},
-                      {"Sign", EIGEN_COST(scalar_sign_op<float>)},
-                      {"Sin", EIGEN_COST(scalar_sin_op<float>)},
-                      {"Tan", EIGEN_COST(scalar_tan_op<float>)},
-                      // Binary ops alphabetically sorted
-                      {"Add", EIGEN_COST(scalar_sum_op<float>)},
-                      {"ApproximateEqual", 1},
-                      {"BiasAdd", EIGEN_COST(scalar_sum_op<float>)},
-                      {"Div", EIGEN_COST(scalar_quotient_op<float>)},
-                      {"Equal", 1},
-                      {"FloorDiv", EIGEN_COST(scalar_quotient_op<float>)},
-                      {"FloorMod", EIGEN_COST(scalar_mod_op<float>)},
-                      {"Greater", 1},
-                      {"GreaterEqual", 1},
-                      {"Less", 1},
-                      {"LessEqual", 1},
-                      {"LogicalAnd", EIGEN_COST(scalar_boolean_and_op)},
-                      {"LogicalNot", 1},
-                      {"LogicalOr", EIGEN_COST(scalar_boolean_or_op)},
-                      {"Maximum", EIGEN_COST(scalar_max_op<float>)},
-                      {"Minimum", EIGEN_COST(scalar_min_op<float>)},
-                      {"Mod", EIGEN_COST(scalar_mod_op<float>)},
-                      {"Mul", EIGEN_COST(scalar_product_op<float>)},
-                      {"NotEqual", 1},
-                      {"QuantizedAdd", EIGEN_COST(scalar_sum_op<float>)},
-                      {"QuantizedMul", EIGEN_COST(scalar_product_op<float>)},
-                      {"RealDiv", EIGEN_COST(scalar_quotient_op<float>)},
-                      {"ReluGrad", EIGEN_COST(scalar_max_op<float>)},
-                      {"SquareDifference", 1},
-                      {"Sub", EIGEN_COST(scalar_difference_op<float>)},
-                      {"TruncateDiv", EIGEN_COST(scalar_quotient_op<float>)},
-                      {"TruncateMod", EIGEN_COST(scalar_mod_op<float>)}};
+  elementwise_ops_ = {
+      // Unary ops alphabetically sorted
+      {"Acos", EIGEN_COST(scalar_acos_op<float>)},
+      {"Asin", EIGEN_COST(scalar_asin_op<float>)},
+      {"Atan", EIGEN_COST(scalar_atan_op<float>)},
+      {"Atan2", EIGEN_COST(scalar_quotient_op<float>) +
+                    EIGEN_COST(scalar_atan_op<float>)},
+      {"Ceil", EIGEN_COST(scalar_ceil_op<float>)},
+      {"Cos", EIGEN_COST(scalar_cos_op<float>)},
+      {"Dequantize", EIGEN_COST(scalar_product_op<float>)},
+      {"Erf", 1},
+      {"Erfc", 1},
+      {"Exp", EIGEN_COST(scalar_exp_op<float>)},
+      {"Expm1", EIGEN_COST(scalar_expm1_op<float>)},
+      {"Floor", EIGEN_COST(scalar_floor_op<float>)},
+      {"Inv", EIGEN_COST(scalar_inverse_op<float>)},
+      {"InvGrad", 1},
+      {"Lgamma", 1},
+      {"Log", EIGEN_COST(scalar_log_op<float>)},
+      {"Log1p", EIGEN_COST(scalar_log1p_op<float>)},
+      {"Neg", EIGEN_COST(scalar_opposite_op<float>)},
+      {"QuantizeV2", quantize_v2_cost},
+      {"Reciprocal", EIGEN_COST(scalar_inverse_op<float>)},
+      {"Rint", 1},
+      {"Round", EIGEN_COST(scalar_round_op<float>)},
+      {"Rsqrt", EIGEN_COST(scalar_rsqrt_op<float>)},
+      {"Sqrt", EIGEN_COST(scalar_sqrt_op<float>)},
+      {"Square", EIGEN_COST(scalar_square_op<float>)},
+      {"Tanh", EIGEN_COST(scalar_tanh_op<float>)},
+      {"Relu", EIGEN_COST(scalar_max_op<float>)},
+      {"Sigmoid", EIGEN_COST(scalar_sigmoid_op<float>)},
+      {"QuantizedSigmoid", EIGEN_COST(scalar_sigmoid_op<float>)},
+      {"Sign", EIGEN_COST(scalar_sign_op<float>)},
+      {"Sin", EIGEN_COST(scalar_sin_op<float>)},
+      {"Tan", EIGEN_COST(scalar_tan_op<float>)},
+      // Binary ops alphabetically sorted
+      {"Add", EIGEN_COST(scalar_sum_op<float>)},
+      {"ApproximateEqual", 1},
+      {"BiasAdd", EIGEN_COST(scalar_sum_op<float>)},
+      {"QuantizedBiasAdd", EIGEN_COST(scalar_sum_op<float>)},
+      {"Div", EIGEN_COST(scalar_quotient_op<float>)},
+      {"Equal", 1},
+      {"FloorDiv", EIGEN_COST(scalar_quotient_op<float>)},
+      {"FloorMod", EIGEN_COST(scalar_mod_op<float>)},
+      {"Greater", 1},
+      {"GreaterEqual", 1},
+      {"Less", 1},
+      {"LessEqual", 1},
+      {"LogicalAnd", EIGEN_COST(scalar_boolean_and_op)},
+      {"LogicalNot", 1},
+      {"LogicalOr", EIGEN_COST(scalar_boolean_or_op)},
+      {"Maximum", EIGEN_COST(scalar_max_op<float>)},
+      {"Minimum", EIGEN_COST(scalar_min_op<float>)},
+      {"Mod", EIGEN_COST(scalar_mod_op<float>)},
+      {"Mul", EIGEN_COST(scalar_product_op<float>)},
+      {"NotEqual", 1},
+      {"QuantizedAdd", EIGEN_COST(scalar_sum_op<float>)},
+      {"QuantizedMul", EIGEN_COST(scalar_product_op<float>)},
+      {"RealDiv", EIGEN_COST(scalar_quotient_op<float>)},
+      {"ReluGrad", EIGEN_COST(scalar_max_op<float>)},
+      {"SquareDifference", 1},
+      {"Sub", EIGEN_COST(scalar_difference_op<float>)},
+      {"TruncateDiv", EIGEN_COST(scalar_quotient_op<float>)},
+      {"TruncateMod", EIGEN_COST(scalar_mod_op<float>)}};
 
 #undef EIGEN_COST
 
@@ -675,7 +680,7 @@ int64 OpLevelCostEstimator::CountMatMulOperations(
   }
 
   ops = m_dim * n_dim * k_dim * 2;
-  VLOG(1) << "Operations for Matmul" << ops;
+  VLOG(1) << "Operations for Matmul: " << ops;
 
   if (mat_mul != nullptr) {
     mat_mul->m = m_dim;
@@ -972,8 +977,10 @@ int64 OpLevelCostEstimator::CalculateTensorElementCount(
 
 int64 OpLevelCostEstimator::CalculateTensorSize(
     const OpInfo::TensorProperties& tensor, bool* found_unknown_shapes) const {
-  return CalculateTensorElementCount(tensor, found_unknown_shapes) *
-         DataTypeSize(BaseType(tensor.dtype()));
+  int64 count = CalculateTensorElementCount(tensor, found_unknown_shapes);
+  int size = DataTypeSize(BaseType(tensor.dtype()));
+  VLOG(2) << "Count: " << count << " DataTypeSize: " << size;
+  return count * size;
 }
 
 int64 OpLevelCostEstimator::CalculateInputSize(
diff --git a/tensorflow/core/grappler/costs/virtual_scheduler.cc b/tensorflow/core/grappler/costs/virtual_scheduler.cc
index 7f68272950..34fe2c75ef 100644
--- a/tensorflow/core/grappler/costs/virtual_scheduler.cc
+++ b/tensorflow/core/grappler/costs/virtual_scheduler.cc
@@ -30,6 +30,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/util/device_name_utils.h"
 
@@ -858,8 +859,10 @@ Costs VirtualScheduler::Summary() const {
     const auto& memory_cost = op_cost_pair.second.memory_time.count();
     const bool is_op_cost_accurate = !op_cost_pair.second.inaccurate;
     if (cost) {  // Skip printing out zero-cost ops.
-      VLOG(1) << " + " << op << " : " << (is_op_cost_accurate ? "" : "~")
-              << cost << " / " << compute_cost << " / " << memory_cost;
+      VLOG(1) << strings::Printf(
+          " + %30s : %c %10lld / %10lld / %10lld", op.c_str(),
+          (is_op_cost_accurate ? ' ' : '~'), static_cast<int64>(cost),
+          static_cast<int64>(compute_cost), static_cast<int64>(memory_cost));
     }
   }
 
@@ -934,9 +937,13 @@ Costs VirtualScheduler::Summary() const {
                                : 0.0;
       if (cost || mem_usage_percent > 1.0) {
         // Print out only non-zero cost ops or ops with > 1% memory usage.
-        VLOG(1) << " + " << op << " : " << (is_op_cost_accurate ? "" : "~")
-                << cost << " / " << compute_cost << " / " << memory_cost << " ("
-                << strings::HumanReadableNumBytes(op_mem_usage) << " ["
+        VLOG(1) << strings::Printf(" + %30s : %c %10lld / %10lld / %10lld",
+                                   op.c_str(),
+                                   (is_op_cost_accurate ? ' ' : '~'),
+                                   static_cast<int64>(cost),
+                                   static_cast<int64>(compute_cost),
+                                   static_cast<int64>(memory_cost))
+                << " (" << strings::HumanReadableNumBytes(op_mem_usage) << " ["
                 << mem_usage_percent << "%] "
                 << (persisent_ops.count(op) > 0 ? ": persistent op)" : ")");
       }
diff --git a/tensorflow/core/grappler/costs/virtual_scheduler.h b/tensorflow/core/grappler/costs/virtual_scheduler.h
index 34d48819ac..353ca6f071 100644
--- a/tensorflow/core/grappler/costs/virtual_scheduler.h
+++ b/tensorflow/core/grappler/costs/virtual_scheduler.h
@@ -275,7 +275,6 @@ class VirtualScheduler {
   // Return per device peak memory usage.
   const std::unordered_map<string, int64> GetPeakMemoryUsage() const;
 
- protected:
   const std::unordered_map<string, DeviceState>* GetDeviceStates() const {
     return &device_;
   }
@@ -283,6 +282,7 @@ class VirtualScheduler {
     return &node_map_;
   }
 
+ protected:
   // Returns the size of output at port_num (unit: bytes). A special case is
   // port_num -1, which is for control dependency and assumed to be 4 bytes.
   int64 CalculateOutputSize(
diff --git a/tensorflow/core/grappler/graph_view.cc b/tensorflow/core/grappler/graph_view.cc
index 3e448216f9..a6b6b6f8b2 100644
--- a/tensorflow/core/grappler/graph_view.cc
+++ b/tensorflow/core/grappler/graph_view.cc
@@ -22,28 +22,37 @@ namespace grappler {
 GraphView::GraphView(GraphDef* graph) : graph_(graph) {
   for (int i = 0; i < graph_->node_size(); i++) {
     auto node = graph_->mutable_node(i);
-    auto rslt = nodes_.insert(std::make_pair(node->name(), node));
-    // Check that the graph doesn't contain multiple nodes with the same name.
-    CHECK(rslt.second) << "Non unique node name detected: " << node->name();
+    AddUniqueNodeOrDie(node);
   }
+
   for (NodeDef& node : *graph_->mutable_node()) {
-    for (int i = 0; i < node.input_size(); ++i) {
-      OutputPort fanin;
-      string fanin_name = ParseNodeName(node.input(i), &fanin.port_id);
-      fanin.node = nodes_[fanin_name];
+    AddFanouts(&node);
+  }
+}
 
-      InputPort input;
-      input.node = &node;
-      if (fanin.port_id < 0) {
-        input.port_id = -1;
-      } else {
-        input.port_id = i;
-        num_regular_outputs_[fanin.node] =
-            std::max(num_regular_outputs_[fanin.node], fanin.port_id);
-      }
+void GraphView::AddUniqueNodeOrDie(NodeDef* node) {
+  auto result = nodes_.emplace(node->name(), node);
+  // Check that the graph doesn't contain multiple nodes with the same name.
+  CHECK(result.second) << "Non unique node name detected: " << node->name();
+}
+
+void GraphView::AddFanouts(NodeDef* node) {
+  for (int i = 0; i < node->input_size(); ++i) {
+    OutputPort fanin;
+    string fanin_name = ParseNodeName(node->input(i), &fanin.port_id);
+    fanin.node = nodes_[fanin_name];
 
-      fanouts_[fanin].insert(input);
+    InputPort input;
+    input.node = node;
+    if (fanin.port_id < 0) {
+      input.port_id = -1;
+    } else {
+      input.port_id = i;
+      num_regular_outputs_[fanin.node] =
+          std::max(num_regular_outputs_[fanin.node], fanin.port_id);
     }
+
+    fanouts_[fanin].insert(input);
   }
 }
 
diff --git a/tensorflow/core/grappler/graph_view.h b/tensorflow/core/grappler/graph_view.h
index 584cb9048b..ac260f85a0 100644
--- a/tensorflow/core/grappler/graph_view.h
+++ b/tensorflow/core/grappler/graph_view.h
@@ -29,8 +29,11 @@ namespace grappler {
 class GraphView {
  public:
   struct Port {
-    Port() : node(nullptr), port_id(-1) {}
+    Port() = default;
     Port(NodeDef* n, int port) : node(n), port_id(port) {}
+
+    // TODO(prazek): ports should keep the constness of GraphView.  The only way
+    // to modify graph through the view should be using MutableGraphView.
     NodeDef* node = nullptr;
     int port_id = -1;
 
@@ -111,13 +114,24 @@ class GraphView {
   std::unordered_set<Edge, HashEdge> GetFaninEdges(
       const NodeDef& node, bool include_controlling_edges) const;
 
+ protected:
+  // Add a new `node` to the graph.
+  void AddUniqueNodeOrDie(NodeDef* node);
+  // Add fanout to every `node` input.
+  void AddFanouts(NodeDef* node);
+  std::unordered_map<string, NodeDef*>* MutableNodes() { return &nodes_; }
+  GraphDef* MutableGraph() { return graph_; }
+
+  using FanoutsMapType =
+      std::unordered_map<OutputPort, std::unordered_set<InputPort, HashPort>,
+                         HashPort>;
+  FanoutsMapType* MutableFanouts() { return &fanouts_; }
+
  private:
   GraphDef* graph_;
   std::unordered_map<string, NodeDef*> nodes_;
   std::unordered_set<InputPort, HashPort> empty_set_;
-  std::unordered_map<OutputPort, std::unordered_set<InputPort, HashPort>,
-                     HashPort>
-      fanouts_;
+  FanoutsMapType fanouts_;
   std::unordered_map<const NodeDef*, int> num_regular_outputs_;
 };
 
diff --git a/tensorflow/core/grappler/mutable_graph_view.cc b/tensorflow/core/grappler/mutable_graph_view.cc
new file mode 100644
index 0000000000..f0aff90c6c
--- /dev/null
+++ b/tensorflow/core/grappler/mutable_graph_view.cc
@@ -0,0 +1,84 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/mutable_graph_view.h"
+#include "tensorflow/core/grappler/utils.h"
+
+namespace tensorflow {
+namespace grappler {
+
+NodeDef* MutableGraphView::AddNode(NodeDef&& node) {
+  auto* node_in_graph = GetGraph()->add_node();
+  *node_in_graph = std::move(node);
+
+  AddUniqueNodeOrDie(node_in_graph);
+
+  AddFanouts(node_in_graph);
+  return node_in_graph;
+}
+
+NodeDef* MutableGraphView::InsertNode(const NodeDef& input_node, NodeDef&& node,
+                                      const int output_port_id) {
+  auto* node_in_graph = GetGraph()->add_node();
+  *node_in_graph = std::move(node);
+
+  AddUniqueNodeOrDie(node_in_graph);
+
+  // replace input for the output nodes of `input_node` with `node`
+  ReplaceInput(input_node, *node_in_graph, output_port_id);
+
+  AddFanouts(node_in_graph);
+  return node_in_graph;
+}
+
+void MutableGraphView::ReplaceInput(const NodeDef& old_input,
+                                    const NodeDef& new_input,
+                                    const int output_port_id) {
+  GraphView::OutputPort output_port =
+      GetOutputPort(old_input.name(), output_port_id);
+  auto fanout = GetFanout(output_port);
+  for (auto& input_port : fanout) {
+    input_port.node->set_input(input_port.port_id, new_input.name());
+    AddFanouts(input_port.node);
+  }
+}
+
+void MutableGraphView::DeleteNodes(const std::set<string>& nodes_to_delete) {
+  for (const string& node_name_to_delete : nodes_to_delete)
+    RemoveFanouts(MutableNodes()->at(node_name_to_delete));
+  for (const string& node_name_to_delete : nodes_to_delete)
+    MutableNodes()->erase(node_name_to_delete);
+  EraseNodesFromGraph(nodes_to_delete, GetGraph());
+}
+
+void MutableGraphView::RemoveFanouts(NodeDef* node) {
+  for (int i = 0; i < node->input_size(); ++i) {
+    OutputPort fanin;
+    string fanin_name = ParseNodeName(node->input(i), &fanin.port_id);
+    fanin.node = (*MutableNodes())[fanin_name];
+
+    InputPort input;
+    input.node = node;
+    if (fanin.port_id < 0)
+      input.port_id = -1;
+    else
+      input.port_id = i;
+
+    (*MutableFanouts())[fanin].erase(input);
+  }
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/mutable_graph_view.h b/tensorflow/core/grappler/mutable_graph_view.h
new file mode 100644
index 0000000000..971e5503d4
--- /dev/null
+++ b/tensorflow/core/grappler/mutable_graph_view.h
@@ -0,0 +1,63 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_MUTABLE_GRAPH_VIEW_H_
+#define TENSORFLOW_CORE_GRAPPLER_MUTABLE_GRAPH_VIEW_H_
+
+#include "tensorflow/core/grappler/graph_view.h"
+
+namespace tensorflow {
+namespace grappler {
+
+// A utility class to simplify the traversal of a GraphDef that, unlike
+// GraphView, supports updating the graph.  Note that you should not modify the
+// graph separately, because the view will get out of sync.
+class MutableGraphView : public GraphView {
+ public:
+  using GraphView::GraphView;
+
+  GraphDef* GetGraph() { return MutableGraph(); }
+
+  // Adds a new node to graph and updates the view.
+  NodeDef* AddNode(NodeDef&& node);
+
+  // Inserts a new node to the graph after `input` node and updates the view.
+  // This adds `node` to the graph and replaces the input for the output
+  // nodes of `input` with a port `output_port_id` with the new node.
+  NodeDef* InsertNode(const NodeDef& input, NodeDef&& node,
+                      int output_port_id = 0);
+
+  // Replaces the input for the output nodes of 'old_input' with a port
+  // `output_port_id` with 'new_input'.
+  //
+  // E.g: We have 2 nodes that use 'bar' node outputs as inputs:
+  // foo(bar:0, bar:1),  foo2(other:0, bar:0)
+  // Calling ReplaceInput(bar, new, 0) changes every occurrence of bar:0 for
+  // new:0.  Result:
+  // foo(new:0, bar:1),  foo2(other:0, new:0)
+  void ReplaceInput(const NodeDef& old_input, const NodeDef& new_input,
+                    int output_port_id = 0);
+
+  // Deletes nodes from the graph.
+  void DeleteNodes(const std::set<string>& nodes_to_delete);
+
+ private:
+  void RemoveFanouts(NodeDef* node);
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_MUTABLE_GRAPH_VIEW_H_
diff --git a/tensorflow/core/grappler/mutable_graph_view_test.cc b/tensorflow/core/grappler/mutable_graph_view_test.cc
new file mode 100644
index 0000000000..2536bec35d
--- /dev/null
+++ b/tensorflow/core/grappler/mutable_graph_view_test.cc
@@ -0,0 +1,127 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/mutable_graph_view.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+bool FindChildWithName(const MutableGraphView& graph,
+                       const string& output_port_name,
+                       const string& input_name) {
+  GraphView::OutputPort output_port = graph.GetOutputPort(output_port_name, 0);
+  auto fanout = graph.GetFanout(output_port);
+  for (auto& input_port : fanout) {
+    if (input_port.node->name() == input_name) return true;
+  }
+  return false;
+}
+
+TrivialTestGraphInputYielder SimpleGraph() {
+  // This outputs simple graph like:
+  //        x
+  //       / \
+  // Square   Square_1
+  //   |   \  /    |
+  //   |    \/     |
+  //   |    /\     |
+  //   |   /  \    |
+  //  AddN     AddN_1
+  //      \   /
+  //        y
+  TrivialTestGraphInputYielder simple_graph(2, 2, 2, false,
+                                            {"/CPU:0", "/GPU:0"});
+  return simple_graph;
+}
+
+TEST(MutableGraphViewTest, AddAndReplaceInput) {
+  TrivialTestGraphInputYielder fake_input = SimpleGraph();
+  GrapplerItem item;
+  CHECK(fake_input.NextItem(&item));
+
+  GraphDef new_graph = item.graph;
+  MutableGraphView graph(&new_graph);
+
+  GraphView::InputPort input = graph.GetInputPort("AddN", 0);
+  EXPECT_EQ("AddN", input.node->name());
+  EXPECT_EQ(0, input.port_id);
+  GraphView::OutputPort fanin = graph.GetRegularFanin(input);
+  EXPECT_EQ("Square", fanin.node->name());
+  EXPECT_EQ(0, fanin.port_id);
+
+  EXPECT_FALSE(FindChildWithName(graph, "Square", "new_node"));
+
+  NodeDef new_node = *input.node;
+  new_node.set_name("new_node");
+
+  EXPECT_EQ(graph.GetNode("new_node"), nullptr);
+  NodeDef* node_in_graph = graph.AddNode(std::move(new_node));
+  EXPECT_NE(graph.GetNode("new_node"), nullptr);
+
+  graph.ReplaceInput(*input.node, *node_in_graph);
+  EXPECT_TRUE(FindChildWithName(graph, "Square", "new_node"));
+  EXPECT_TRUE(FindChildWithName(graph, "new_node", "y"));
+}
+
+TEST(MutableGraphViewTest, InsertNodes) {
+  TrivialTestGraphInputYielder fake_input = SimpleGraph();
+
+  GrapplerItem item;
+  CHECK(fake_input.NextItem(&item));
+
+  GraphDef new_graph = item.graph;
+  MutableGraphView graph(&new_graph);
+
+  GraphView::InputPort input = graph.GetInputPort("AddN", 0);
+
+  NodeDef new_node = *input.node;
+  new_node.set_name("new_node");
+  new_node.set_input(0, input.node->name());
+
+  EXPECT_EQ(graph.GetNode("new_node"), nullptr);
+  graph.InsertNode(*input.node, std::move(new_node));
+  EXPECT_NE(graph.GetNode("new_node"), nullptr);
+  EXPECT_TRUE(FindChildWithName(graph, "Square", "AddN"));
+  EXPECT_TRUE(FindChildWithName(graph, "Square", "AddN_1"));
+  EXPECT_TRUE(FindChildWithName(graph, "Square_1", "AddN"));
+  EXPECT_TRUE(FindChildWithName(graph, "Square_1", "AddN_1"));
+  EXPECT_TRUE(FindChildWithName(graph, "AddN", "new_node"));
+  EXPECT_TRUE(FindChildWithName(graph, "AddN_1", "y"));
+  EXPECT_TRUE(FindChildWithName(graph, "new_node", "y"));
+}
+
+TEST(MutableGraphViewTest, DeleteNodes) {
+  // Outputs simple graph as described in first test.
+  TrivialTestGraphInputYielder fake_input = SimpleGraph();
+  GrapplerItem item;
+  CHECK(fake_input.NextItem(&item));
+
+  GraphDef new_graph = item.graph;
+  MutableGraphView graph(&new_graph);
+
+  EXPECT_NE(graph.GetNode("AddN"), nullptr);
+  graph.DeleteNodes({"AddN"});
+
+  EXPECT_EQ(graph.GetNode("AddN"), nullptr);
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/BUILD b/tensorflow/core/grappler/optimizers/BUILD
index b1d6d48e31..caaa5ac8db 100644
--- a/tensorflow/core/grappler/optimizers/BUILD
+++ b/tensorflow/core/grappler/optimizers/BUILD
@@ -95,6 +95,7 @@ cc_library(
     ],
     visibility = ["//visibility:public"],
     deps = [
+        ":evaluation_utils",
         ":graph_optimizer",
         ":symbolic_shapes",
         "//tensorflow/core:framework",
@@ -603,7 +604,9 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = [
         ":constant_folding",
+        ":evaluation_utils",
         ":graph_optimizer",
+        "//tensorflow/core:core_cpu_lib",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
@@ -624,6 +627,7 @@ tf_cuda_cc_test(
         ":loop_optimizer",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:tensor_testutil",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
         "//tensorflow/core/grappler:grappler_item",
@@ -810,3 +814,39 @@ tf_cc_test(
         "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
     ],
 )
+
+cc_library(
+    name = "evaluation_utils",
+    srcs = ["evaluation_utils.cc"],
+    hdrs = [
+        "evaluation_utils.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/clusters:cluster",
+        "//tensorflow/core/grappler/costs:graph_properties",
+    ],
+)
+
+tf_cc_test(
+    name = "evaluation_utils_test",
+    srcs = ["evaluation_utils_test.cc"],
+    deps = [
+        ":evaluation_utils",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+        "//third_party/eigen3",
+    ],
+)
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
index b7369c7b4a..889445bbd6 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
@@ -2598,23 +2598,36 @@ class ConvertExpm1Stage : public ArithmeticOptimizerStage {
       : ArithmeticOptimizerStage("ConvertExpm1", ctx, ctx_ext) {}
   ~ConvertExpm1Stage() override = default;
 
-  bool IsSupported(const NodeDef* node) const override { return IsExp(*node); }
+  bool IsSupported(const NodeDef* node) const override {
+    if (!IsSub(*node))
+      return false;
 
-  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
     NodeDef* input;
-    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &input));
-    if (!IsSub(*input)) {
+    if (!GetInputNode(node->input(0), &input).ok())
+      return false;
+
+    return IsExp(*input);
+  }
+
+  Status TrySimplify(NodeDef* node, string* simplified_node_name) override {
+    if (ctx().graph_properties->GetInputProperties(node->name()).size() < 2) {
       return Status::OK();
     }
 
-    if (ctx().graph_properties->GetInputProperties(input->name()).size() < 2) {
+    NodeDef* exp;
+    TF_RETURN_IF_ERROR(GetInputNode(node->input(0), &exp));
+    if (!IsExp(*exp)) {
+      return Status::OK();
+    }
+
+    if (ctx().graph_properties->GetInputProperties(exp->name()).empty()) {
       return Status::OK();
     }
 
     const auto& t =
-        ctx().graph_properties->GetInputProperties(input->name())[0];
+        ctx().graph_properties->GetInputProperties(exp->name())[0];
     const auto& c =
-        ctx().graph_properties->GetInputProperties(input->name())[1];
+        ctx().graph_properties->GetInputProperties(node->name())[1];
     for (int k = 0; k < c.shape().dim_size(); ++k) {
       // Skip if c shape is not fully determined.
       if (c.shape().dim(k).size() < 0) {
@@ -2650,18 +2663,18 @@ class ConvertExpm1Stage : public ArithmeticOptimizerStage {
           return Status::OK();
         }
       }
-      NodeDef *x, *y;
-      TF_RETURN_IF_ERROR(GetInputNode(input->input(0), &x));
-      TF_RETURN_IF_ERROR(GetInputNode(input->input(1), &y));
+      NodeDef *exp_input, *ones;
+      TF_RETURN_IF_ERROR(GetInputNode(exp->input(0), &exp_input));
+      TF_RETURN_IF_ERROR(GetInputNode(node->input(1), &ones));
       node->set_op("Expm1");
-      node->set_input(0, input->input(0));
-      node->add_input(AsControlDependency(y->name()));
-      ForwardControlDependencies(node, {input});
+      node->set_input(0, exp->input(0));
+      node->set_input(1, AsControlDependency(ones->name()));
+      ForwardControlDependencies(node, {exp});
 
       AddToOptimizationQueue(node);
-      AddToOptimizationQueue(input);
-      AddToOptimizationQueue(x);
-      AddToOptimizationQueue(y);
+      AddToOptimizationQueue(exp);
+      AddToOptimizationQueue(exp_input);
+      AddToOptimizationQueue(ones);
     }
     return Status::OK();
   }
@@ -3078,14 +3091,7 @@ void ArithmeticOptimizer::DedupComputations() {
 
   // Delete duplicates
   if (fetch_nodes_known_ && !duplicates.empty()) {
-    int last = optimized_graph_->node_size() - 1;
-    for (auto it = duplicates.rbegin(); it != duplicates.rend(); ++it) {
-      int index = *it;
-      optimized_graph_->mutable_node()->SwapElements(index, last);
-      last--;
-    }
-    optimized_graph_->mutable_node()->DeleteSubrange(last + 1,
-                                                     duplicates.size());
+    EraseNodesFromGraph(duplicates, optimized_graph_);
     // Rebuild the NodeMap which was invalidated by the node  swapping above.
     node_map_.reset(new NodeMap(optimized_graph_));
   }
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc b/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
index 54fdc01adb..85cb19d419 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc
@@ -2553,10 +2553,9 @@ TEST_F(ArithmeticOptimizerTest, Expm1) {
   auto x1 = ops::Const(s.WithOpName("x1"), {2.0f, 2.0f}, {1, 2});
   auto x2 = ops::Const(s.WithOpName("x2"), {1.0f, 1.0f}, {1, 2});
   auto x3 = ops::Const(s.WithOpName("x3"), {3.0f, 3.0f}, {1, 2});
-  auto s12 = ops::Sub(s.WithOpName("s12").WithControlDependencies(x3), x1, x2);
-  auto s23 = ops::Sub(s.WithOpName("s23"), x2, x3);
-  Output out1 = ops::Exp(s.WithOpName("out1"), s12);
-  Output out2 = ops::Exp(s.WithOpName("out2"), s23);
+  auto exp1 = ops::Exp(s.WithOpName("exp1").WithControlDependencies(x3), x1);
+  Output out1 = ops::Sub(s.WithOpName("out1"), exp1, x2);
+  Output out2 = ops::Sub(s.WithOpName("out2"), exp1, x3);
 
   GrapplerItem item;
   item.fetch = {"out1", "out2"};
@@ -2571,15 +2570,20 @@ TEST_F(ArithmeticOptimizerTest, Expm1) {
   auto tensors = EvaluateNodes(got, item.fetch);
   EXPECT_EQ(2, tensors.size());
 
+  for (int i = 0; i < 2; ++i) {
+    EXPECT_EQ(tensors[i].NumElements(), tensors_expected[i].NumElements());
+    test::ExpectTensorNear<float>(tensors[i], tensors_expected[i], 1e-6);
+  }
+
   GraphDef want;
   AddNode("x1", "Const", {}, {}, &want);
   AddNode("x2", "Const", {}, {}, &want);
   AddNode("x3", "Const", {}, {}, &want);
-  AddNode("s23", "Sub", {"x2", "x3"}, {}, &want);
+  AddNode("exp1", "Exp", {"x1", AsControlDependency("x3")}, {}, &want);
   AddNode("out1", "Expm1",
           {"x1", AsControlDependency("x2"), AsControlDependency("x3")}, {},
           &want);
-  AddNode("out2", "Exp", {"s23"}, {}, &want);
+  AddNode("out2", "Sub", {"exp1", "x3"}, {}, &want);
 
   CompareGraphs(want, got);
 }
diff --git a/tensorflow/core/grappler/optimizers/constant_folding.cc b/tensorflow/core/grappler/optimizers/constant_folding.cc
index 76c928f995..f2ac3a44c0 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding.cc
@@ -31,6 +31,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/costs/graph_properties.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/evaluation_utils.h"
 #include "tensorflow/core/grappler/optimizers/symbolic_shapes.h"
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
@@ -73,44 +74,6 @@ class EigenThreadPoolWrapper : public Eigen::ThreadPoolInterface {
   thread::ThreadPool* pool_ = nullptr;
 };
 
-class DeviceSimple : public DeviceBase {
- public:
-  DeviceSimple() : DeviceBase(Env::Default()) {
-    eigen_worker_threads_.num_threads = port::NumSchedulableCPUs();
-    eigen_worker_threads_.workers = new thread::ThreadPool(
-        Env::Default(), "constant_folding", eigen_worker_threads_.num_threads);
-    eigen_threadpool_wrapper_.reset(
-        new EigenThreadPoolWrapper(eigen_worker_threads_.workers));
-    eigen_device_.reset(new Eigen::ThreadPoolDevice(
-        eigen_threadpool_wrapper_.get(), eigen_worker_threads_.num_threads));
-    set_tensorflow_cpu_worker_threads(&eigen_worker_threads_);
-    set_eigen_cpu_device(eigen_device_.get());
-  }
-  ~DeviceSimple() override {
-    eigen_threadpool_wrapper_.reset();
-    eigen_device_.reset();
-    delete eigen_worker_threads_.workers;
-  }
-  Status MakeTensorFromProto(const TensorProto& tensor_proto,
-                             const AllocatorAttributes alloc_attrs,
-                             Tensor* tensor) override {
-    Tensor parsed(tensor_proto.dtype());
-    if (!parsed.FromProto(cpu_allocator(), tensor_proto)) {
-      return errors::InvalidArgument("Cannot parse tensor from tensor_proto.");
-    }
-    *tensor = parsed;
-    return Status::OK();
-  }
-  Allocator* GetAllocator(AllocatorAttributes attr) override {
-    return cpu_allocator();
-  }
-
- private:
-  DeviceBase::CpuWorkerThreads eigen_worker_threads_;
-  std::unique_ptr<Eigen::ThreadPoolInterface> eigen_threadpool_wrapper_;
-  std::unique_ptr<Eigen::ThreadPoolDevice> eigen_device_;
-};
-
 template <typename T>
 bool AllValuesAre(const TensorProto& proto, const T& value) {
   Tensor tensor;
@@ -983,33 +946,8 @@ Status ConstantFolding::CreateNodeDef(const string& name,
 Status ConstantFolding::EvaluateNode(const NodeDef& node,
                                      const TensorVector& inputs,
                                      TensorVector* output) const {
-  Status status;
-  auto op_kernel =
-      CreateOpKernel("CPU", cpu_device_, cpu_device_->GetAllocator({}), node,
-                     TF_GRAPH_DEF_VERSION, &status);
-  TF_RETURN_IF_ERROR(status);
-  OpKernelContext::Params params;
-  params.device = cpu_device_;
-  params.frame_iter = FrameAndIter(0, 0);
-  params.inputs = &inputs;
-  params.op_kernel = op_kernel.get();
-  params.resource_manager = resource_mgr_.get();
-
-  gtl::InlinedVector<AllocatorAttributes, 4> output_attrs;
-  const int num_outputs = op_kernel->num_outputs();
-  for (int i = 0; i < num_outputs; i++) {
-    AllocatorAttributes attr;
-    attr.set_on_host(true);
-    output_attrs.push_back(attr);
-  }
-  params.output_attr_array = output_attrs.data();
-
-  OpKernelContext op_context(&params);
-  op_kernel->Compute(&op_context);
-  for (int i = 0; i < num_outputs; i++) {
-    output->push_back(op_context.release_output(i));
-  }
-  return op_context.status();
+  return ::tensorflow::grappler::EvaluateNode(node, inputs, cpu_device_,
+                                              resource_mgr_.get(), output);
 }
 
 Status ConstantFolding::EvaluateOneFoldable(const NodeDef& node,
@@ -1305,17 +1243,12 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
   }
 
   // Delete the newly created nodes that don't feed anything.
-  int last = output->node_size() - 1;
-  for (int i = output->node_size() - 1; i >= 0; --i) {
-    const NodeDef& node = output->node(i);
-    auto fanout = node_map_->GetOutputs(node.name());
-    if (fanout.empty()) {
-      output->mutable_node()->SwapElements(i, last);
-      last--;
-    }
+  std::vector<int> nodes_to_delete;
+  for (int i = 0; i < output->node_size(); i++) {
+    auto fanout = node_map_->GetOutputs(output->node(i).name());
+    if (fanout.empty()) nodes_to_delete.push_back(i);
   }
-  output->mutable_node()->DeleteSubrange(last + 1,
-                                         output->node_size() - last - 1);
+  EraseNodesFromGraph(std::move(nodes_to_delete), output);
 
   for (const auto& node : graph_->node()) {
     // If no fetch nodes is provided, we conservatively
diff --git a/tensorflow/core/grappler/optimizers/data/BUILD b/tensorflow/core/grappler/optimizers/data/BUILD
index 1dbd044778..451ef6cabb 100644
--- a/tensorflow/core/grappler/optimizers/data/BUILD
+++ b/tensorflow/core/grappler/optimizers/data/BUILD
@@ -4,6 +4,39 @@ load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 load("//tensorflow/core:platform/default/build_config.bzl", "tf_protos_all")
 
 cc_library(
+    name = "function_rename",
+    srcs = ["function_rename.cc"],
+    hdrs = [
+        "function_rename.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/clusters:cluster",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
+    ] + tf_protos_all(),
+)
+
+tf_cc_test(
+    name = "function_rename_test",
+    srcs = ["function_rename_test.cc"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":function_rename",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler:grappler_item",
+    ] + tf_protos_all(),
+)
+
+cc_library(
     name = "graph_utils",
     srcs = ["graph_utils.cc"],
     hdrs = [
@@ -13,12 +46,9 @@ cc_library(
     deps = [
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
-        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:mutable_graph_view",
         "//tensorflow/core/grappler:grappler_item",
-        "//tensorflow/core/grappler:grappler_item_builder",
         "//tensorflow/core/grappler:utils",
-        "//tensorflow/core/grappler/clusters:virtual_cluster",
-        "//tensorflow/core/grappler/optimizers:meta_optimizer",
     ] + tf_protos_all(),
 )
 
@@ -28,12 +58,38 @@ tf_cc_test(
     visibility = ["//visibility:public"],
     deps = [
         ":graph_utils",
+        "//tensorflow/core:framework",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
+        "//tensorflow/core/kernels:cast_op",
     ],
 )
 
 cc_library(
+    name = "latency_all_edges",
+    srcs = ["latency_all_edges.cc"],
+    hdrs = [
+        "latency_all_edges.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        "//tensorflow/core/grappler:mutable_graph_view",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/clusters:cluster",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
+    ] + tf_protos_all(),
+)
+
+cc_library(
     name = "map_and_batch_fusion",
     srcs = ["map_and_batch_fusion.cc"],
     hdrs = [
@@ -43,7 +99,7 @@ cc_library(
     deps = [
         ":graph_utils",
         "//tensorflow/core:lib",
-        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:mutable_graph_view",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:op_types",
         "//tensorflow/core/grappler:utils",
@@ -68,50 +124,87 @@ tf_cc_test(
 )
 
 cc_library(
-    name = "shuffle_and_repeat_fusion",
-    srcs = ["shuffle_and_repeat_fusion.cc"],
+    name = "map_fusion",
+    srcs = ["map_fusion.cc"],
     hdrs = [
-        "shuffle_and_repeat_fusion.h",
+        "map_fusion.h",
     ],
     visibility = ["//visibility:public"],
     deps = [
         ":graph_utils",
+        "//tensorflow/core/grappler:mutable_graph_view",
         "//tensorflow/core:lib",
-        "//tensorflow/core/grappler:graph_view",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:op_types",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/grappler/clusters:cluster",
+        "//tensorflow/core/kernels:cast_op",
+        "//tensorflow/core/grappler/utils:topological_sort",
         "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
         "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
     ] + tf_protos_all(),
 )
 
 tf_cc_test(
-    name = "shuffle_and_repeat_fusion_test",
-    srcs = ["shuffle_and_repeat_fusion_test.cc"],
+    name = "map_fusion_test",
+    srcs = ["map_fusion_test.cc"],
     visibility = ["//visibility:public"],
     deps = [
         ":graph_utils",
-        ":shuffle_and_repeat_fusion",
+        ":map_fusion",
         "//tensorflow/core:framework",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
         "//tensorflow/core/grappler:grappler_item",
     ],
 )
 
 cc_library(
-    name = "take_all_elimination",
-    srcs = ["take_all_elimination.cc"],
+    name = "noop_elimination",
+    srcs = ["noop_elimination.cc"],
     hdrs = [
-        "take_all_elimination.h",
+        "noop_elimination.h",
     ],
     visibility = ["//visibility:public"],
     deps = [
         ":graph_utils",
         "//tensorflow/core:lib",
-        "//tensorflow/core/grappler:graph_view",
+        "//tensorflow/core/grappler:mutable_graph_view",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/clusters:cluster",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer",
+        "//tensorflow/core/grappler/optimizers:custom_graph_optimizer_registry",
+    ] + tf_protos_all(),
+)
+
+tf_cc_test(
+    name = "noop_elimination_test",
+    srcs = ["noop_elimination_test.cc"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        ":noop_elimination",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler:grappler_item",
+    ],
+)
+
+cc_library(
+    name = "shuffle_and_repeat_fusion",
+    srcs = ["shuffle_and_repeat_fusion.cc"],
+    hdrs = [
+        "shuffle_and_repeat_fusion.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_utils",
+        "//tensorflow/core:lib",
+        "//tensorflow/core/grappler:mutable_graph_view",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:op_types",
         "//tensorflow/core/grappler:utils",
@@ -122,12 +215,12 @@ cc_library(
 )
 
 tf_cc_test(
-    name = "take_all_elimination_test",
-    srcs = ["take_all_elimination_test.cc"],
+    name = "shuffle_and_repeat_fusion_test",
+    srcs = ["shuffle_and_repeat_fusion_test.cc"],
     visibility = ["//visibility:public"],
     deps = [
         ":graph_utils",
-        ":take_all_elimination",
+        ":shuffle_and_repeat_fusion",
         "//tensorflow/core:framework",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
@@ -139,9 +232,26 @@ cc_library(
     name = "data",
     visibility = ["//visibility:public"],
     deps = [
+        ":function_rename",
+        ":latency_all_edges",
         ":map_and_batch_fusion",
+        ":map_fusion",
+        ":noop_elimination",
         ":shuffle_and_repeat_fusion",
-        ":take_all_elimination",
     ],
     alwayslink = 1,
 )
+
+tf_cc_test(
+    name = "latency_all_edges_test",
+    srcs = ["latency_all_edges_test.cc"],
+    deps = [
+        ":graph_utils",
+        ":latency_all_edges",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+        "//tensorflow/core/grappler:grappler_item",
+    ],
+)
diff --git a/tensorflow/core/grappler/optimizers/data/function_rename.cc b/tensorflow/core/grappler/optimizers/data/function_rename.cc
new file mode 100644
index 0000000000..8cf044d1bd
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/function_rename.cc
@@ -0,0 +1,51 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/function_rename.h"
+
+#include "tensorflow/core/grappler/clusters/cluster.h"
+#include "tensorflow/core/grappler/graph_view.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+namespace grappler {
+
+Status FunctionRename::Optimize(Cluster* cluster, const GrapplerItem& item,
+                                GraphDef* output) {
+  *output = item.graph;
+  GraphView graph(output);
+  int n = output->mutable_library()->function_size();
+  for (int i = 0; i < n; ++i) {
+    FunctionDef* fn = output->mutable_library()->mutable_function(i);
+    fn->mutable_signature()->set_name(fn->signature().name() + "world");
+  }
+
+  return Status::OK();
+}
+
+void FunctionRename::Feedback(Cluster* cluster, const GrapplerItem& item,
+                              const GraphDef& optimize_output, double result) {
+  // no-op
+}
+
+REGISTER_GRAPH_OPTIMIZER_AS(FunctionRename, "_test_only_function_rename");
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/take_all_elimination.h b/tensorflow/core/grappler/optimizers/data/function_rename.h
index 8351603bda..23ad9470ff 100644
--- a/tensorflow/core/grappler/optimizers/data/take_all_elimination.h
+++ b/tensorflow/core/grappler/optimizers/data/function_rename.h
@@ -13,22 +13,20 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_TAKE_ALL_ELIMINATION_H_
-#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_TAKE_ALL_ELIMINATION_H_
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_FUNCTION_RENAME_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_FUNCTION_RENAME_H_
 
 #include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
 
 namespace tensorflow {
 namespace grappler {
 
-// This class optimizes 'TakeDataset' operations that take all the elements
-// e.g. - take(-1).
-class TakeAllElimination : public CustomGraphOptimizer {
+class FunctionRename : public CustomGraphOptimizer {
  public:
-  TakeAllElimination() = default;
-  ~TakeAllElimination() override = default;
+  FunctionRename() = default;
+  ~FunctionRename() override = default;
 
-  string name() const override { return "take_all_elimination"; };
+  string name() const override { return "_test_only_function_rename"; };
 
   Status Init(
       const tensorflow::RewriterConfig_CustomGraphOptimizer* config) override {
@@ -45,4 +43,4 @@ class TakeAllElimination : public CustomGraphOptimizer {
 }  // end namespace grappler
 }  // end namespace tensorflow
 
-#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_TAKE_ALL_ELIMINATION_H_
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_FUNCTION_RENAME_H_
diff --git a/tensorflow/core/grappler/optimizers/data/function_rename_test.cc b/tensorflow/core/grappler/optimizers/data/function_rename_test.cc
new file mode 100644
index 0000000000..56b8a960a7
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/function_rename_test.cc
@@ -0,0 +1,42 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/function_rename.h"
+
+#include "tensorflow/core/framework/function.pb.h"
+#include "tensorflow/core/framework/op_def.pb.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+TEST(FunctionRenameTest, RenameFunction) {
+  GrapplerItem item;
+  GraphDef *graph = &item.graph;
+  FunctionDef *fn = graph->mutable_library()->add_function();
+  fn->mutable_signature()->set_name("hello");
+
+  FunctionRename optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  EXPECT_EQ(output.library().function(0).signature().name(), "helloworld");
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/graph_utils.cc b/tensorflow/core/grappler/optimizers/data/graph_utils.cc
index b5b46ccafe..838787d2a5 100644
--- a/tensorflow/core/grappler/optimizers/data/graph_utils.cc
+++ b/tensorflow/core/grappler/optimizers/data/graph_utils.cc
@@ -16,11 +16,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
 
 #include "tensorflow/core/framework/device_base.h"
-#include "tensorflow/core/grappler/clusters/virtual_cluster.h"
-#include "tensorflow/core/grappler/graph_view.h"
-#include "tensorflow/core/grappler/grappler_item.h"
-#include "tensorflow/core/grappler/grappler_item_builder.h"
-#include "tensorflow/core/grappler/optimizers/meta_optimizer.h"
+#include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/util/ptr_util.h"
 
 namespace tensorflow {
@@ -30,14 +26,18 @@ namespace {
 
 constexpr char kConstOpName[] = "Const";
 
-int FindNodeWithPredicate(const std::function<bool(const NodeDef&)>& predicate,
-                          const GraphDef& graph) {
-  for (int i = 0; i < graph.node_size(); ++i) {
-    if (predicate(graph.node(i))) {
-      return i;
+template <typename Predicate, typename Collection>
+std::vector<int> GetElementIndicesWithPredicate(const Predicate& predicate,
+                                                const Collection& collection) {
+  std::vector<int> indices = {};
+  unsigned idx = 0;
+  for (auto&& element : collection) {
+    if (predicate(element)) {
+      indices.push_back(idx);
     }
+    idx++;
   }
-  return -1;
+  return indices;
 }
 
 std::vector<int> CreateNameIndex(const GraphDef& graph) {
@@ -66,13 +66,14 @@ std::vector<int> CreateInputIndex(const NodeDef& node) {
   return index;
 }
 
-Status AddScalarConstNodeHelper(
+NodeDef* AddScalarConstNodeHelper(
     DataType dtype, const std::function<void(TensorProto*)>& add_value,
-    GraphDef* graph, NodeDef** result) {
-  NodeDef* node = graph->add_node();
-  node->set_op(kConstOpName);
-  SetUniqueName(kConstOpName, graph, node);
-  (*node->mutable_attr())["dtype"].set_type(dtype);
+    MutableGraphView* graph) {
+  NodeDef node;
+  node.set_op(kConstOpName);
+  SetUniqueGraphNodeName(kConstOpName, graph->GetGraph(), &node);
+
+  (*node.mutable_attr())["dtype"].set_type(dtype);
   std::unique_ptr<tensorflow::TensorProto> tensor =
       tensorflow::MakeUnique<tensorflow::TensorProto>();
   std::unique_ptr<tensorflow::TensorShapeProto> tensor_shape =
@@ -80,75 +81,69 @@ Status AddScalarConstNodeHelper(
   tensor->set_allocated_tensor_shape(tensor_shape.release());
   tensor->set_dtype(dtype);
   add_value(tensor.get());
-  (*node->mutable_attr())["value"].set_allocated_tensor(tensor.release());
-  *result = node;
-  return Status::OK();
+  (*node.mutable_attr())["value"].set_allocated_tensor(tensor.release());
+
+  return graph->AddNode(std::move(node));
 }
 
 }  // namespace
 
-Status AddNode(const string& name, const string& op,
-               const std::vector<string>& inputs,
-               const std::vector<std::pair<string, AttrValue>>& attributes,
-               GraphDef* graph, NodeDef** result) {
-  NodeDef* node = graph->add_node();
+NodeDef* AddNode(const string& name, const string& op,
+                 const std::vector<string>& inputs,
+                 const std::vector<std::pair<string, AttrValue>>& attributes,
+                 MutableGraphView* graph) {
+  NodeDef node;
   if (!name.empty()) {
-    node->set_name(name);
+    node.set_name(name);
   } else {
-    SetUniqueName(op, graph, node);
+    SetUniqueGraphNodeName(op, graph->GetGraph(), &node);
   }
-  node->set_op(op);
+  node.set_op(op);
   for (const string& input : inputs) {
-    node->add_input(input);
+    node.add_input(input);
   }
   for (auto attr : attributes) {
-    (*node->mutable_attr())[attr.first] = attr.second;
+    (*node.mutable_attr())[attr.first] = attr.second;
   }
-  *result = node;
-  return Status::OK();
+  return graph->AddNode(std::move(node));
 }
 
 template <>
-Status AddScalarConstNode(bool v, GraphDef* graph, NodeDef** result) {
+NodeDef* AddScalarConstNode(bool v, MutableGraphView* graph) {
   return AddScalarConstNodeHelper(
-      DT_BOOL, [v](TensorProto* proto) { proto->add_bool_val(v); }, graph,
-      result);
+      DT_BOOL, [v](TensorProto* proto) { proto->add_bool_val(v); }, graph);
 }
 
 template <>
-Status AddScalarConstNode(double v, GraphDef* graph, NodeDef** result) {
+NodeDef* AddScalarConstNode(double v, MutableGraphView* graph) {
   return AddScalarConstNodeHelper(
-      DT_DOUBLE, [v](TensorProto* proto) { proto->add_double_val(v); }, graph,
-      result);
+      DT_DOUBLE, [v](TensorProto* proto) { proto->add_double_val(v); }, graph);
 }
 
 template <>
-Status AddScalarConstNode(float v, GraphDef* graph, NodeDef** result) {
+NodeDef* AddScalarConstNode(float v, MutableGraphView* graph) {
   return AddScalarConstNodeHelper(
-      DT_FLOAT, [v](TensorProto* proto) { proto->add_float_val(v); }, graph,
-      result);
+      DT_FLOAT, [v](TensorProto* proto) { proto->add_float_val(v); }, graph);
 }
 
 template <>
-Status AddScalarConstNode(int v, GraphDef* graph, NodeDef** result) {
+NodeDef* AddScalarConstNode(int v, MutableGraphView* graph) {
   return AddScalarConstNodeHelper(
-      DT_INT32, [v](TensorProto* proto) { proto->add_int_val(v); }, graph,
-      result);
+      DT_INT32, [v](TensorProto* proto) { proto->add_int_val(v); }, graph);
 }
 
 template <>
-Status AddScalarConstNode(int64 v, GraphDef* graph, NodeDef** result) {
+NodeDef* AddScalarConstNode(int64 v, MutableGraphView* graph) {
   return AddScalarConstNodeHelper(
-      DT_INT64, [v](TensorProto* proto) { proto->add_int64_val(v); }, graph,
-      result);
+      DT_INT64, [v](TensorProto* proto) { proto->add_int64_val(v); }, graph);
 }
 
 template <>
-Status AddScalarConstNode(StringPiece v, GraphDef* graph, NodeDef** result) {
+NodeDef* AddScalarConstNode(StringPiece v, MutableGraphView* graph) {
   return AddScalarConstNodeHelper(
       DT_STRING,
       [v](TensorProto* proto) { proto->add_string_val(v.data(), v.size()); },
-      graph, result);
+      graph);
 }
 
 bool Compare(const GraphDef& g1, const GraphDef& g2) {
@@ -181,52 +176,97 @@ bool Compare(const GraphDef& g1, const GraphDef& g2) {
   return true;
 }
 
-bool ContainsNodeWithName(const string& name, const GraphDef& graph) {
-  return FindNodeWithName(name, graph) != -1;
+bool ContainsGraphNodeWithName(const string& name, const GraphDef& graph) {
+  return FindGraphNodeWithName(name, graph) != -1;
 }
 
 bool ContainsNodeWithOp(const string& op, const GraphDef& graph) {
   return FindNodeWithOp(op, graph) != -1;
 }
 
-Status DeleteNodes(const std::set<string>& nodes_to_delete, GraphDef* graph) {
-  int last = graph->node_size() - 1;
-  for (int i = graph->node_size() - 1; i >= 0; --i) {
-    const NodeDef& node = graph->node(i);
-    if (nodes_to_delete.find(node.name()) != nodes_to_delete.end()) {
-      graph->mutable_node()->SwapElements(i, last);
-      last--;
-    }
-  }
-  graph->mutable_node()->DeleteSubrange(last + 1,
-                                        graph->node_size() - last - 1);
-  return Status::OK();
+bool ContainsGraphFunctionWithName(const string& name,
+                                   const FunctionDefLibrary& library) {
+  return FindGraphFunctionWithName(name, library) != -1;
 }
 
-int FindNodeWithName(const string& name, const GraphDef& graph) {
-  return FindNodeWithPredicate(
-      [name](const NodeDef& node) { return node.name() == name; }, graph);
+bool ContainsFunctionNodeWithName(const string& name,
+                                  const FunctionDef& function) {
+  return FindFunctionNodeWithName(name, function) != -1;
+}
+
+int FindGraphNodeWithName(const string& name, const GraphDef& graph) {
+  std::vector<int> indices = GetElementIndicesWithPredicate(
+      [&name](const NodeDef& node) { return node.name() == name; },
+      graph.node());
+  return indices.empty() ? -1 : indices.front();
 }
 
 int FindNodeWithOp(const string& op, const GraphDef& graph) {
-  return FindNodeWithPredicate(
-      [op](const NodeDef& node) { return node.op() == op; }, graph);
+  std::vector<int> indices = GetElementIndicesWithPredicate(
+      [&op](const NodeDef& node) { return node.op() == op; }, graph.node());
+  return indices.empty() ? -1 : indices.front();
+}
+
+std::vector<int> FindAllGraphNodesWithOp(const string& op,
+                                         const GraphDef& graph) {
+  return GetElementIndicesWithPredicate(
+      [&op](const NodeDef& node) { return node.op() == op; }, graph.node());
+}
+
+int FindGraphFunctionWithName(const string& name,
+                              const FunctionDefLibrary& library) {
+  std::vector<int> indices = GetElementIndicesWithPredicate(
+      [&name](const FunctionDef& function) {
+        return function.signature().name() == name;
+      },
+      library.function());
+  return indices.empty() ? -1 : indices.front();
+}
+
+int FindFunctionNodeWithName(const string& name, const FunctionDef& function) {
+  std::vector<int> indices = GetElementIndicesWithPredicate(
+      [&name](const NodeDef& node) { return node.name() == name; },
+      function.node_def());
+  return indices.empty() ? -1 : indices.front();
 }
 
-void SetUniqueName(const string& op, GraphDef* graph, NodeDef* node) {
+void SetUniqueGraphNodeName(const string& prefix, GraphDef* graph,
+                            NodeDef* node) {
+  string name = prefix;
   int id = graph->node_size();
-  while (ContainsNodeWithName(strings::StrCat(op, "/_", id), *graph)) {
+  while (ContainsGraphNodeWithName(name, *graph)) {
+    if (name.rfind("_generated") != std::string::npos &&
+        (name.rfind("_generated") == (name.size() - strlen("_generated")))) {
+      name.insert(name.rfind("_generated"), strings::StrCat("/_", id));
+    } else {
+      name = strings::StrCat(prefix, "/_", id);
+    }
+    ++id;
+  }
+  node->set_name(std::move(name));
+}
+
+void SetUniqueFunctionNodeName(const string& prefix, FunctionDef* function,
+                               NodeDef* node) {
+  string name = prefix;
+  int id = function->node_def_size();
+  while (ContainsFunctionNodeWithName(name, *function)) {
+    name = strings::StrCat(prefix, "/_", id);
     ++id;
   }
-  node->set_name(strings::StrCat(op, "/_", id));
+  node->set_name(std::move(name));
 }
 
-void ReplaceInput(const NodeDef& old_input, const NodeDef& new_input,
-                  GraphView* graph) {
-  GraphView::OutputPort output_port = graph->GetOutputPort(old_input.name(), 0);
-  auto fanout = graph->GetFanout(output_port);
-  for (auto& input_port : fanout)
-    input_port.node->set_input(0, new_input.name());
+void SetUniqueGraphFunctionName(const string& prefix,
+                                FunctionDefLibrary* library,
+                                FunctionDef* function) {
+  string name = prefix;
+  int id = library->function_size();
+  while (ContainsGraphFunctionWithName(name, *library)) {
+    name = strings::StrCat(prefix, "/_", id);
+    ++id;
+  }
+  function->mutable_signature()->set_name(name);
 }
 
 }  // end namespace graph_utils
diff --git a/tensorflow/core/grappler/optimizers/data/graph_utils.h b/tensorflow/core/grappler/optimizers/data/graph_utils.h
index 1cb0f0c81d..39c687b501 100644
--- a/tensorflow/core/grappler/optimizers/data/graph_utils.h
+++ b/tensorflow/core/grappler/optimizers/data/graph_utils.h
@@ -17,12 +17,13 @@ limitations under the License.
 #define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_GRAPH_UTILS_H_
 
 #include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/function.pb.h"
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/tensor.pb.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/types.h"
-#include "tensorflow/core/grappler/graph_view.h"
+#include "tensorflow/core/grappler/mutable_graph_view.h"
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/lib/core/errors.h"
 
@@ -31,57 +32,88 @@ namespace grappler {
 namespace graph_utils {
 
 // Adds a node to the graph.
-Status AddNode(const string& name, const string& op,
-               const std::vector<string>& inputs,
-               const std::vector<std::pair<string, AttrValue>>& attributes,
-               GraphDef* graph, NodeDef** result);
+NodeDef* AddNode(const string& name, const string& op,
+                 const std::vector<string>& inputs,
+                 const std::vector<std::pair<string, AttrValue>>& attributes,
+                 MutableGraphView* graph);
 
 // Adds a Const node with the given value to the graph.
 template <typename T>
-Status AddScalarConstNode(T v, GraphDef* graph, NodeDef** result) {
-  return errors::Unimplemented("Type %s is not supported.",
-                               DataTypeToEnum<T>::value);
+NodeDef* AddScalarConstNode(T v, MutableGraphView* graph) {
+  // is_same is an idiomatic hack for making it compile if not instantiated.
+  // Replacing with false will result in a compile-time error.
+  static_assert(!std::is_same<T, T>::value,
+                "Invalid specialization of this method for type T.");
+  return {};
 }
+
 template <>
-Status AddScalarConstNode(bool v, GraphDef* graph, NodeDef** result);
+NodeDef* AddScalarConstNode(bool v, MutableGraphView* graph);
 template <>
-Status AddScalarConstNode(double v, GraphDef* graph, NodeDef** result);
+NodeDef* AddScalarConstNode(double v, MutableGraphView* graph);
 template <>
-Status AddScalarConstNode(float v, GraphDef* graph, NodeDef** result);
+NodeDef* AddScalarConstNode(float v, MutableGraphView* graph);
 template <>
-Status AddScalarConstNode(int v, GraphDef* graph, NodeDef** result);
+NodeDef* AddScalarConstNode(int v, MutableGraphView* graph);
 template <>
-Status AddScalarConstNode(int64 v, GraphDef* graph, NodeDef** result);
+NodeDef* AddScalarConstNode(int64 v, MutableGraphView* graph);
 template <>
-Status AddScalarConstNode(StringPiece v, GraphDef* graph, NodeDef** result);
+NodeDef* AddScalarConstNode(StringPiece v, MutableGraphView* graph);
 
 // Checks whether the two graphs are the same.
 bool Compare(const GraphDef& g1, const GraphDef& g2);
 
 // Checks whether the graph contains a node with the given name.
-bool ContainsNodeWithName(const string& name, const GraphDef& graph);
+bool ContainsGraphNodeWithName(const string& name, const GraphDef& graph);
+
+// Checks whether the library contains a function with the given name.
+bool ContainsGraphFunctionWithName(const string& name,
+                                   const FunctionDefLibrary& library);
+
+// Checks whether the function contains a node with the given name.
+bool ContainsFunctionNodeWithName(const string& name,
+                                  const FunctionDef& function);
 
 // Checks whether the graph contains a node with the given op.
 bool ContainsNodeWithOp(const string& op, const GraphDef& graph);
 
-// Deletes nodes from the graph.
-Status DeleteNodes(const std::set<string>& nodes_to_delete, GraphDef* graph);
-
 // Returns the index of the node with the given name or -1 if the node does
 // not exist.
-int FindNodeWithName(const string& name, const GraphDef& graph);
+int FindGraphNodeWithName(const string& name, const GraphDef& graph);
 
-// Returns the index of a node with the given op or -1 if no such  node
+// Returns the index of the function with the given name or -1 if the function
+// does not exist.
+int FindGraphFunctionWithName(const string& name,
+                              const FunctionDefLibrary& library);
+
+// Returns the index of the function node with the given name or -1 if the
+// function node does not exist.
+int FindFunctionNodeWithName(const string& name, const FunctionDef& function);
+
+// Returns the index of the first node with the given op or -1 if no such  node
 // exists.
 int FindNodeWithOp(const string& op, const GraphDef& graph);
 
-// Sets the node name using the op name as a prefix while guaranteeing the name
-// is unique across the graph.
-void SetUniqueName(const string& op, GraphDef* graph, NodeDef* node);
+// Returns the list of indices of all nodes with the given op or empty list if
+// no such node exists.
+std::vector<int> FindAllGraphNodesWithOp(const string& op,
+                                         const GraphDef& graph);
 
-// Replaces the input for the output nodes of 'old_input' with 'new_input'.
-void ReplaceInput(const NodeDef& old_input, const NodeDef& new_input,
-                  GraphView* graph);
+// Sets the node name using `prefix` as a prefix while guaranteeing the name
+// is unique across the graph.
+void SetUniqueGraphNodeName(const string& prefix, GraphDef* graph,
+                            NodeDef* node);
+
+// Sets the function node name using the `prefix` as a prefix while guaranteeing
+// the name is unique across the functions nodes.
+void SetUniqueFunctionNodeName(const string& prefix, FunctionDef* function,
+                               NodeDef* node);
+
+// Sets the node name using the `prefix` name as a prefix while guaranteeing the
+// name is unique across the graph.
+void SetUniqueGraphFunctionName(const string& prefix,
+                                FunctionDefLibrary* library,
+                                FunctionDef* function);
 
 }  // end namespace graph_utils
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/optimizers/data/graph_utils_test.cc b/tensorflow/core/grappler/optimizers/data/graph_utils_test.cc
index d723d73b7a..e6789d47b5 100644
--- a/tensorflow/core/grappler/optimizers/data/graph_utils_test.cc
+++ b/tensorflow/core/grappler/optimizers/data/graph_utils_test.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
 
+#include "tensorflow/core/framework/function_testlib.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
 
@@ -24,149 +25,215 @@ namespace graph_utils {
 namespace {
 
 TEST(GraphUtilsTest, AddScalarConstNodeBool) {
-  GraphDef graph;
-  NodeDef* bool_node;
-  TF_EXPECT_OK(AddScalarConstNode<bool>(true, &graph, &bool_node));
-  EXPECT_TRUE(ContainsNodeWithName(bool_node->name(), graph));
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  NodeDef* bool_node = AddScalarConstNode<bool>(true, &graph);
+  EXPECT_TRUE(ContainsGraphNodeWithName(bool_node->name(), *graph.GetGraph()));
   EXPECT_EQ(bool_node->attr().at("value").tensor().bool_val(0), true);
 }
 
 TEST(GraphUtilsTest, AddScalarConstNodeDouble) {
-  GraphDef graph;
-  NodeDef* double_node;
-  TF_EXPECT_OK(AddScalarConstNode<double>(3.14, &graph, &double_node));
-  EXPECT_TRUE(ContainsNodeWithName(double_node->name(), graph));
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  NodeDef* double_node = AddScalarConstNode<double>(3.14, &graph);
+  EXPECT_TRUE(
+      ContainsGraphNodeWithName(double_node->name(), *graph.GetGraph()));
   EXPECT_FLOAT_EQ(double_node->attr().at("value").tensor().double_val(0), 3.14);
 }
 
 TEST(GraphUtilsTest, AddScalarConstNodeFloat) {
-  GraphDef graph;
-  NodeDef* float_node;
-  TF_EXPECT_OK(AddScalarConstNode<float>(3.14, &graph, &float_node));
-  EXPECT_TRUE(ContainsNodeWithName(float_node->name(), graph));
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  NodeDef* float_node = AddScalarConstNode<float>(3.14, &graph);
+  EXPECT_TRUE(ContainsGraphNodeWithName(float_node->name(), *graph.GetGraph()));
   EXPECT_FLOAT_EQ(float_node->attr().at("value").tensor().float_val(0), 3.14);
 }
 
 TEST(GraphUtilsTest, AddScalarConstNodeInt) {
-  GraphDef graph;
-  NodeDef* int_node;
-  TF_EXPECT_OK(AddScalarConstNode<int>(42, &graph, &int_node));
-  EXPECT_TRUE(ContainsNodeWithName(int_node->name(), graph));
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  NodeDef* int_node = AddScalarConstNode<int>(42, &graph);
+  EXPECT_TRUE(ContainsGraphNodeWithName(int_node->name(), *graph.GetGraph()));
   EXPECT_EQ(int_node->attr().at("value").tensor().int_val(0), 42);
 }
 
 TEST(GraphUtilsTest, AddScalarConstNodeInt64) {
-  GraphDef graph;
-  NodeDef* int64_node;
-  TF_EXPECT_OK(AddScalarConstNode<int64>(42, &graph, &int64_node));
-  EXPECT_TRUE(ContainsNodeWithName(int64_node->name(), graph));
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  NodeDef* int64_node = AddScalarConstNode<int64>(42, &graph);
+  EXPECT_TRUE(ContainsGraphNodeWithName(int64_node->name(), *graph.GetGraph()));
   EXPECT_EQ(int64_node->attr().at("value").tensor().int64_val(0), 42);
 }
 
 TEST(GraphUtilsTest, AddScalarConstNodeString) {
-  GraphDef graph;
-  NodeDef* string_node;
-  TF_EXPECT_OK(AddScalarConstNode<StringPiece>("hello", &graph, &string_node));
-  EXPECT_TRUE(ContainsNodeWithName(string_node->name(), graph));
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  NodeDef* string_node = AddScalarConstNode<StringPiece>("hello", &graph);
+  EXPECT_TRUE(
+      ContainsGraphNodeWithName(string_node->name(), *graph.GetGraph()));
   EXPECT_EQ(string_node->attr().at("value").tensor().string_val(0), "hello");
 }
 
 TEST(GraphUtilsTest, Compare) {
-  GraphDef graphA;
-  GraphDef graphB;
-  EXPECT_TRUE(Compare(graphA, graphB));
+  GraphDef graph_def_a;
+  MutableGraphView graph_a(&graph_def_a);
+  GraphDef graph_def_b;
+  MutableGraphView graph_b(&graph_def_b);
 
-  NodeDef* nodeA;
-  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graphA, &nodeA));
-  NodeDef* nodeB;
-  TF_EXPECT_OK(AddNode("B", "OpB", {"A"}, {}, &graphA, &nodeB));
-  EXPECT_FALSE(Compare(graphA, graphB));
+  EXPECT_TRUE(Compare(graph_def_a, graph_def_b));
 
-  graphB.mutable_node()->CopyFrom(graphA.node());
-  EXPECT_TRUE(Compare(graphA, graphB));
+  AddNode("A", "OpA", {}, {}, &graph_a);
+  AddNode("B", "OpB", {"A"}, {}, &graph_a);
+  EXPECT_FALSE(Compare(graph_def_a, graph_def_b));
+
+  graph_def_b.mutable_node()->CopyFrom(graph_def_a.node());
+  EXPECT_TRUE(Compare(graph_def_a, graph_def_b));
 }
 
-TEST(GraphUtilsTest, ContainsNodeWithName) {
-  GraphDef graph;
-  EXPECT_TRUE(!ContainsNodeWithName("A", graph));
+TEST(GraphUtilsTest, ContainsGraphNodeWithName) {
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  EXPECT_TRUE(!ContainsGraphNodeWithName("A", *graph.GetGraph()));
 
-  NodeDef* node;
-  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
-  EXPECT_TRUE(ContainsNodeWithName("A", graph));
+  AddNode("A", "OpA", {}, {}, &graph);
+  EXPECT_TRUE(ContainsGraphNodeWithName("A", *graph.GetGraph()));
 
-  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
-  EXPECT_TRUE(!ContainsNodeWithName("A", graph));
+  graph.DeleteNodes({"A"});
+  EXPECT_TRUE(!ContainsGraphNodeWithName("A", *graph.GetGraph()));
 }
 
-TEST(GraphUtilsTest, ContainsNodeWithOp) {
-  GraphDef graph;
-  EXPECT_TRUE(!ContainsNodeWithOp("OpA", graph));
+TEST(GraphUtilsTest, ContainsGraphFunctionWithName) {
+  FunctionDefLibrary library;
+  EXPECT_FALSE(ContainsGraphFunctionWithName("new_function", library));
+  FunctionDef* new_function = library.add_function();
+  SetUniqueGraphFunctionName("new_function", &library, new_function);
 
-  NodeDef* node;
-  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
-  EXPECT_TRUE(ContainsNodeWithOp("OpA", graph));
+  EXPECT_TRUE(
+      ContainsGraphFunctionWithName(new_function->signature().name(), library));
+}
 
-  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
-  EXPECT_TRUE(!ContainsNodeWithOp("OpA", graph));
+TEST(GraphUtilsTest, ContainsFunctionNodeWithName) {
+  FunctionDef function = test::function::XTimesTwo();
+  EXPECT_FALSE(ContainsFunctionNodeWithName(
+      "weird_name_that_should_not_be_there", function));
+  EXPECT_TRUE(ContainsFunctionNodeWithName("two", function));
 }
 
-TEST(GraphUtilsTest, FindNodeWithName) {
-  GraphDef graph;
-  EXPECT_EQ(FindNodeWithName("A", graph), -1);
+TEST(GraphUtilsTest, ContainsNodeWithOp) {
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  EXPECT_TRUE(!ContainsNodeWithOp("OpA", *graph.GetGraph()));
 
-  NodeDef* node;
-  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
-  EXPECT_NE(FindNodeWithName("A", graph), -1);
+  AddNode("A", "OpA", {}, {}, &graph);
+  EXPECT_TRUE(ContainsNodeWithOp("OpA", *graph.GetGraph()));
 
-  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
-  EXPECT_EQ(FindNodeWithName("A", graph), -1);
+  graph.DeleteNodes({"A"});
+  EXPECT_TRUE(!ContainsNodeWithOp("OpA", *graph.GetGraph()));
 }
 
-TEST(GraphUtilsTest, FindNodeWithOp) {
-  GraphDef graph;
-  EXPECT_EQ(FindNodeWithOp("OpA", graph), -1);
+TEST(GraphUtilsTest, FindGraphNodeWithName) {
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  EXPECT_EQ(FindGraphNodeWithName("A", *graph.GetGraph()), -1);
 
-  NodeDef* node;
-  TF_EXPECT_OK(AddNode("A", "OpA", {}, {}, &graph, &node));
-  EXPECT_NE(FindNodeWithOp("OpA", graph), -1);
+  AddNode("A", "OpA", {}, {}, &graph);
+  EXPECT_NE(FindGraphNodeWithName("A", *graph.GetGraph()), -1);
 
-  TF_EXPECT_OK(DeleteNodes({"A"}, &graph));
-  EXPECT_EQ(FindNodeWithOp("OpA", graph), -1);
+  graph.DeleteNodes({"A"});
+  EXPECT_EQ(FindGraphNodeWithName("A", *graph.GetGraph()), -1);
 }
 
-TEST(GraphUtilsTest, SetUniqueName) {
-  GraphDef graph;
+TEST(GraphUtilsTest, FindFunctionWithName) {
+  FunctionDef function = test::function::XTimesTwo();
+  EXPECT_EQ(
+      FindFunctionNodeWithName("weird_name_that_should_not_be_there", function),
+      -1);
+  EXPECT_NE(FindFunctionNodeWithName("two", function), -1);
+}
 
-  NodeDef* node1;
-  TF_EXPECT_OK(AddNode("", "A", {}, {}, &graph, &node1));
-  NodeDef* node2;
-  TF_EXPECT_OK(AddNode("", "A", {}, {}, &graph, &node2));
-  EXPECT_NE(node1->name(), node2->name());
+TEST(GraphUtilsTest, FindGraphFunctionWithName) {
+  FunctionDefLibrary library;
+  EXPECT_EQ(FindGraphFunctionWithName("new_function", library), -1);
+  FunctionDef* new_function = library.add_function();
+  SetUniqueGraphFunctionName("new_function", &library, new_function);
 
-  TF_EXPECT_OK(DeleteNodes({node1->name()}, &graph));
-  NodeDef* node3;
-  TF_EXPECT_OK(AddNode("", "A", {}, {}, &graph, &node3));
-  EXPECT_NE(node2->name(), node3->name());
+  EXPECT_NE(
+      FindGraphFunctionWithName(new_function->signature().name(), library), -1);
 }
 
-TEST(GraphUtilsTest, ReplaceInput) {
-  GraphDef graph;
+TEST(GraphUtilsTest, FindNodeWithOp) {
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  EXPECT_EQ(FindNodeWithOp("OpA", *graph.GetGraph()), -1);
+
+  AddNode("A", "OpA", {}, {}, &graph);
+  AddNode("B", "OpB", {"A"}, {}, &graph);
+  AddNode("A2", "OpA", {"B"}, {}, &graph);
+  EXPECT_EQ(FindNodeWithOp("OpA", *graph.GetGraph()), 0);
+
+  graph.DeleteNodes({"B"});
+  EXPECT_EQ(FindNodeWithOp("OpB", *graph.GetGraph()), -1);
+  EXPECT_EQ(FindGraphNodeWithName("A2", *graph.GetGraph()), 1);
+}
+
+TEST(GraphUtilsTest, FindAllGraphNodesWithOp) {
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
+  EXPECT_EQ(FindNodeWithOp("OpA", *graph.GetGraph()), -1);
+
+  AddNode("A", "OpA", {}, {}, &graph);
+  AddNode("B", "OpB", {"A"}, {}, &graph);
+  AddNode("A2", "OpA", {"B"}, {}, &graph);
+  std::vector<int> result_indices =
+      FindAllGraphNodesWithOp("OpA", *graph.GetGraph());
+  EXPECT_EQ(result_indices.size(), 2);
+  EXPECT_EQ(result_indices.at(0), 0);
+  EXPECT_EQ(result_indices.at(1), 2);
+
+  graph.DeleteNodes({"A2"});
+  std::vector<int> result_indices_new =
+      FindAllGraphNodesWithOp("OpA", *graph.GetGraph());
+  EXPECT_EQ(result_indices_new.size(), 1);
+  EXPECT_EQ(result_indices_new.at(0), 0);
+}
 
-  NodeDef* node1;
-  TF_EXPECT_OK(AddNode("", "A", {}, {}, &graph, &node1));
+TEST(GraphUtilsTest, SetUniqueGraphNodeName) {
+  GraphDef graph_def;
+  MutableGraphView graph(&graph_def);
 
-  NodeDef* node2;
-  TF_EXPECT_OK(AddNode("", "A", {node1->name()}, {}, &graph, &node2));
+  NodeDef* node1 = AddNode("", "A", {}, {}, &graph);
+  NodeDef* node2 = AddNode("", "A", {}, {}, &graph);
+  EXPECT_NE(node1->name(), node2->name());
 
-  NodeDef* node3;
-  TF_EXPECT_OK(AddNode("", "A", {node2->name()}, {}, &graph, &node3));
+  graph.DeleteNodes({node1->name()});
+  NodeDef* node3 = AddNode("", "A", {}, {}, &graph);
+  EXPECT_NE(node2->name(), node3->name());
+}
 
-  EXPECT_EQ(node3->input(0), node2->name());
+TEST(GraphUtilsTest, SetUniqueFunctionNodeName) {
+  FunctionDef function = test::function::XTimesTwo();
+  NodeDef node;
+  SetUniqueFunctionNodeName("abc", &function, &node);
+  for (const NodeDef& function_node : function.node_def()) {
+    EXPECT_NE(node.name(), function_node.name());
+  }
+  auto* new_node = function.add_node_def();
+  *new_node = node;
+
+  NodeDef other;
+  SetUniqueFunctionNodeName("abc", &function, &other);
+  EXPECT_NE(other.name(), new_node->name());
+}
 
-  GraphView view(&graph);
-  ReplaceInput(*node2, *node1, &view);
+TEST(GraphUtilsTest, SetUniqueGraphFunctionName) {
+  FunctionDefLibrary library;
+  FunctionDef* new_function = library.add_function();
+  SetUniqueGraphFunctionName("new_function", &library, new_function);
 
-  EXPECT_EQ(node3->input(0), node1->name());
+  FunctionDef* other_function = library.add_function();
+  SetUniqueGraphFunctionName("new_function", &library, other_function);
+  EXPECT_NE(new_function->signature().name(),
+            other_function->signature().name());
 }
 
 }  // namespace
diff --git a/tensorflow/core/grappler/optimizers/data/latency_all_edges.cc b/tensorflow/core/grappler/optimizers/data/latency_all_edges.cc
new file mode 100644
index 0000000000..0b25b1ea9d
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/latency_all_edges.cc
@@ -0,0 +1,112 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/latency_all_edges.h"
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/grappler/clusters/cluster.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/mutable_graph_view.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+constexpr char kInsertOpName[] = "LatencyStatsDataset";
+
+NodeDef make_latency_node(const NodeDef& node, MutableGraphView* graph) {
+  NodeDef new_node;
+  new_node.set_op(kInsertOpName);
+  graph_utils::SetUniqueGraphNodeName(
+      strings::StrCat(kInsertOpName, "_generated"), graph->GetGraph(),
+      &new_node);
+  // Set the input of LatencyDataset node as `node`
+  new_node.add_input(node.name());
+
+  NodeDef* tag = graph_utils::AddScalarConstNode<StringPiece>(
+      StringPiece("record_latency_" + node.name()), graph);
+  new_node.add_input(tag->name());
+
+  // Set `output_types` and `output_shapes` attributes.
+  for (auto key : {"output_shapes", "output_types"}) {
+    if (node.attr().find(key) != node.attr().end()) {
+      (*new_node.mutable_attr())[key] = node.attr().at(key);
+    } else {
+      const char* kInferredAttrPrefix = "T";
+      if (node.attr().find(strings::StrCat(kInferredAttrPrefix, key)) !=
+          node.attr().end()) {
+        (*new_node.mutable_attr())[key] =
+            node.attr().at(strings::StrCat(kInferredAttrPrefix, key));
+      }
+    }
+  }
+  return new_node;
+}
+
+}  // namespace
+
+Status LatencyAllEdges::Optimize(Cluster* cluster, const GrapplerItem& item,
+                                 GraphDef* output) {
+  *output = item.graph;
+  MutableGraphView graph(output);
+
+  // Add LatencyDatasetOp node after each node.
+  // TODO(shivaniagrawal): Add Op to return Latency for the particular Op than
+  // for the edge (e2 - e1?).
+  for (const NodeDef& node : item.graph.node()) {
+    if (node.op().rfind("Dataset") != node.op().size() - strlen("Dataset") ||
+        node.attr().empty() ||
+        node.name().rfind("_generated") ==
+            node.name().size() - strlen("_generated")) {
+      // TODO(b/111805951): Replace this with non-approximate way to check if
+      // node corresponds to a `Dataset` op.
+      continue;
+    }
+    GraphView::OutputPort output_port = graph.GetOutputPort(node.name(), 0);
+    auto fanout = graph.GetFanout(output_port);
+    if (fanout.size() > 1) {
+      LOG(WARNING) << node.name() << " has fanout size " << fanout.size();
+      continue;
+    } else {  // fanout will have size 0 for last dataset node in the pipeline.
+      if (fanout.size() == 1) {
+        NodeDef* output_node = (*(fanout.begin())).node;
+        if (output_node->name().rfind("_generated") ==
+            output_node->name().size() - strlen("_generated")) {
+          continue;
+        }
+      }
+    }
+
+    graph.InsertNode(node, make_latency_node(node, &graph));
+  }
+  return Status::OK();
+}
+
+void LatencyAllEdges::Feedback(Cluster* cluster, const GrapplerItem& item,
+                               const GraphDef& optimize_output, double result) {
+  // no-op
+}
+
+REGISTER_GRAPH_OPTIMIZER_AS(LatencyAllEdges, "latency_all_edges");
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/latency_all_edges.h b/tensorflow/core/grappler/optimizers/data/latency_all_edges.h
new file mode 100644
index 0000000000..f6c71a9ec7
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/latency_all_edges.h
@@ -0,0 +1,46 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_LATENCY_ALL_EDGES_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_LATENCY_ALL_EDGES_H_
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+
+namespace tensorflow {
+namespace grappler {
+
+class LatencyAllEdges : public CustomGraphOptimizer {
+ public:
+  LatencyAllEdges() = default;
+  ~LatencyAllEdges() override = default;
+
+  string name() const override { return "latency_all_edges"; };
+
+  Status Init(
+      const tensorflow::RewriterConfig_CustomGraphOptimizer* config) override {
+    return Status::OK();
+  }
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* output) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimize_output, double result) override;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_LATENCY_ALL_EDGES_H_
diff --git a/tensorflow/core/grappler/optimizers/data/latency_all_edges_test.cc b/tensorflow/core/grappler/optimizers/data/latency_all_edges_test.cc
new file mode 100644
index 0000000000..6789cf5bd6
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/latency_all_edges_test.cc
@@ -0,0 +1,92 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/latency_all_edges.h"
+
+#include "tensorflow/core/framework/attr_value_util.h"
+#include "tensorflow/core/framework/function_testlib.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+TEST(LatencyAllEdgesTest, AddLatenciesAfterTensorMapPrefetch) {
+  using test::function::NDef;
+  GrapplerItem item;
+  NodeDef component_node =
+      NDef("component_nodes", "Const", {}, {{"value", 1}, {"dtype", DT_INT32}});
+  NodeDef from_tensor_node =
+      NDef("from_tensor_nodes", "TensorDataset", {"component_nodes"},
+           {{"Toutput_types", {}}, {"output_shapes", {}}});
+
+  NodeDef captured_input_node = NDef("captured_input_node", "Const", {},
+                                     {{"value", ""}, {"dtype", DT_STRING}});
+  NodeDef map_node = NDef("map_node", "MapDataset",
+                          {"from_tensor_node", "captured_input_node"},
+                          {{"f", {}},
+                           {"Targumemts", {}},
+                           {"output_shapes", {}},
+                           {"output_types", {}}});
+  NodeDef buffer_size_node = NDef("buffer_size_node", "Const", {},
+                                  {{"value", 1}, {"dtype", DT_INT32}});
+  NodeDef prefetch_node = NDef("prefetch_node", "Prefetch_Dataset",
+                               {"map_node", "buffer_size_node"},
+                               {{"output_shapes", {}}, {"output_types", {}}});
+
+  item.graph = test::function::GDef({component_node, from_tensor_node,
+                                     captured_input_node, map_node,
+                                     buffer_size_node, prefetch_node});
+
+  LatencyAllEdges optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_TRUE(graph_utils::ContainsNodeWithOp("LatencyStatsDataset", output));
+  std::vector<int> latency_node_indices =
+      graph_utils::FindAllGraphNodesWithOp("LatencyStatsDataset", output);
+  EXPECT_EQ(latency_node_indices.size(), 3);
+  std::vector<NodeDef> dataset_nodes = {std::move(from_tensor_node),
+                                        std::move(map_node),
+                                        std::move(prefetch_node)};
+  for (int i = 0; i < latency_node_indices.size(); i++) {
+    NodeDef latency_node = output.node(latency_node_indices[i]);
+    EXPECT_EQ(latency_node.input_size(), 2);
+    EXPECT_EQ(latency_node.input(0), dataset_nodes[i].name());
+    EXPECT_TRUE(
+        AreAttrValuesEqual(latency_node.attr().at("output_shapes"),
+                           dataset_nodes[i].attr().at("output_shapes")));
+    if (dataset_nodes[i].attr().find("output_types") !=
+        dataset_nodes[i].attr().end()) {
+      EXPECT_TRUE(
+          AreAttrValuesEqual(latency_node.attr().at("output_types"),
+                             dataset_nodes[i].attr().at("output_types")));
+    } else {
+      if (dataset_nodes[i].attr().find("Toutput_types") !=
+          dataset_nodes[i].attr().end()) {
+        EXPECT_TRUE(
+            AreAttrValuesEqual(latency_node.attr().at("output_types"),
+                               dataset_nodes[i].attr().at("Toutput_types")));
+      }
+    }
+  }
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.cc b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.cc
index eac665bd92..3ce238a30a 100644
--- a/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.cc
+++ b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion.cc
@@ -18,8 +18,8 @@ limitations under the License.
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/grappler/clusters/cluster.h"
-#include "tensorflow/core/grappler/graph_view.h"
 #include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/mutable_graph_view.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
 #include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
@@ -32,12 +32,70 @@ namespace {
 
 constexpr char kFusedOpName[] = "MapAndBatchDatasetV2";
 
+NodeDef make_map_and_batch_node(const NodeDef& map_node,
+                                const NodeDef& batch_node,
+                                MutableGraphView* graph) {
+  NodeDef new_node;
+  new_node.set_op(kFusedOpName);
+  graph_utils::SetUniqueGraphNodeName(kFusedOpName, graph->GetGraph(),
+                                      &new_node);
+
+  // Set the `input` input argument.
+  new_node.add_input(map_node.input(0));
+
+  // Set the `other_arguments` input arguments.
+  int num_other_args;
+  if (map_node.op() == "ParallelMapDataset") {
+    num_other_args = map_node.input_size() - 2;
+  } else {
+    num_other_args = map_node.input_size() - 1;
+  }
+  for (int i = 0; i < num_other_args; i++) {
+    new_node.add_input(map_node.input(i + 1));
+  }
+
+  // Set the `batch_size` input argument.
+  new_node.add_input(batch_node.input(1));
+
+  // Set the `num_parallel_calls` input argument.
+  if (map_node.op() == "ParallelMapDataset") {
+    // The type of the `num_parallel_calls` argument in ParallelMapDataset
+    // and MapAndBatchDataset is different (int32 and int64 respectively)
+    // so we cannot reuse the same Const node and thus create a new one.
+    NodeDef* v = graph->GetNode(map_node.input(map_node.input_size() - 1));
+    NodeDef* tmp = graph_utils::AddScalarConstNode<int64>(
+        v->attr().at("value").tensor().int_val(0), graph);
+    new_node.add_input(tmp->name());
+  } else {
+    NodeDef* tmp = graph_utils::AddScalarConstNode<int64>(1, graph);
+    new_node.add_input(tmp->name());
+  }
+
+  // Set the `drop_remainder` input argument.
+  if (batch_node.op() == "BatchDatasetV2") {
+    new_node.add_input(batch_node.input(2));
+  } else {
+    NodeDef* tmp = graph_utils::AddScalarConstNode<bool>(false, graph);
+    new_node.add_input(tmp->name());
+  }
+
+  // Set `f` and `Targuments` attributes.
+  for (auto key : {"f", "Targuments"}) {
+    (*new_node.mutable_attr())[key] = map_node.attr().at(key);
+  }
+  // Set `output_types` and `output_shapes` attributes.
+  for (auto key : {"output_shapes", "output_types"}) {
+    (*new_node.mutable_attr())[key] = batch_node.attr().at(key);
+  }
+  return new_node;
+}
+
 }  // namespace
 
 Status MapAndBatchFusion::Optimize(Cluster* cluster, const GrapplerItem& item,
                                    GraphDef* output) {
   *output = item.graph;
-  GraphView graph(output);
+  MutableGraphView graph(output);
   std::set<string> nodes_to_delete;
   for (const NodeDef& node : item.graph.node()) {
     if (node.op() != "BatchDataset" && node.op() != "BatchDatasetV2") {
@@ -45,79 +103,25 @@ Status MapAndBatchFusion::Optimize(Cluster* cluster, const GrapplerItem& item,
     }
 
     // Use a more descriptive variable name now that we know the node type.
-    const NodeDef batch_node(node);
+    const NodeDef& batch_node = node;
     GraphView::InputPort input_port = graph.GetInputPort(batch_node.name(), 0);
     NodeDef* node2 = graph.GetRegularFanin(input_port).node;
     if (node2->op() != "MapDataset" && node2->op() != "ParallelMapDataset") {
       continue;
     }
-
-    NodeDef* new_node = output->add_node();
-    new_node->set_op(kFusedOpName);
-    graph_utils::SetUniqueName(kFusedOpName, output, new_node);
-
     // Use a more descriptive variable name now that we know the node type.
     NodeDef* map_node = node2;
-    // Set the `input` input argument.
-    new_node->add_input(map_node->input(0));
-
-    // Set the `other_arguments` input arguments.
-    int num_other_args;
-    if (map_node->op() == "ParallelMapDataset") {
-      num_other_args = map_node->input_size() - 2;
-    } else {
-      num_other_args = map_node->input_size() - 1;
-    }
-    for (int i = 0; i < num_other_args; i++) {
-      new_node->add_input(map_node->input(i + 1));
-    }
 
-    // Set the `batch_size` input argument.
-    new_node->add_input(batch_node.input(1));
-
-    // Set the `num_parallel_calls` input argument.
-    if (map_node->op() == "ParallelMapDataset") {
-      // The type of the `num_parallel_calls` argument in ParallelMapDataset
-      // and MapAndBatchDataset is different (int32 and int64 respectively)
-      // so we cannot reuse the same Const node and thus create a new one.
-      NodeDef* v = graph.GetNode(map_node->input(map_node->input_size() - 1));
-      NodeDef* tmp;
-      TF_RETURN_IF_ERROR(graph_utils::AddScalarConstNode<int64>(
-          v->attr().at("value").tensor().int_val(0), output, &tmp));
-      new_node->add_input(tmp->name());
-    } else {
-      NodeDef* tmp;
-      TF_RETURN_IF_ERROR(
-          graph_utils::AddScalarConstNode<int64>(1, output, &tmp));
-      new_node->add_input(tmp->name());
-    }
-
-    // Set the `drop_remainder` input argument.
-    if (batch_node.op() == "BatchDatasetV2") {
-      new_node->add_input(batch_node.input(2));
-    } else {
-      NodeDef* tmp;
-      TF_RETURN_IF_ERROR(
-          graph_utils::AddScalarConstNode<bool>(false, output, &tmp));
-      new_node->add_input(tmp->name());
-    }
-
-    // Set `f` and `Targuments` attributes.
-    for (auto key : {"f", "Targuments"}) {
-      (*new_node->mutable_attr())[key] = map_node->attr().at(key);
-    }
-    // Set `output_types` and `output_shapes` attributes.
-    for (auto key : {"output_shapes", "output_types"}) {
-      (*new_node->mutable_attr())[key] = batch_node.attr().at(key);
-    }
+    auto* new_node =
+        graph.AddNode(make_map_and_batch_node(*map_node, batch_node, &graph));
+    graph.ReplaceInput(batch_node, *new_node);
 
     // Mark the `Map` and `Batch` nodes for removal.
     nodes_to_delete.insert(map_node->name());
     nodes_to_delete.insert(batch_node.name());
-
-    graph_utils::ReplaceInput(batch_node, *new_node, &graph);
   }
-  TF_RETURN_IF_ERROR(graph_utils::DeleteNodes(nodes_to_delete, output));
+
+  graph.DeleteNodes(nodes_to_delete);
   return Status::OK();
 }
 
diff --git a/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion_test.cc b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion_test.cc
index 3c1d8d5359..a46c504ac4 100644
--- a/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion_test.cc
+++ b/tensorflow/core/grappler/optimizers/data/map_and_batch_fusion_test.cc
@@ -27,25 +27,21 @@ namespace {
 
 TEST(MapAndBatchFusionTest, FuseMapAndBatchNodesIntoOne) {
   GrapplerItem item;
-  GraphDef *graph = &item.graph;
-  NodeDef *start_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
-  NodeDef *stop_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
-  NodeDef *step_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+  MutableGraphView graph(&item.graph);
+
+  NodeDef *start_node = graph_utils::AddScalarConstNode<int64>(0, &graph);
+  NodeDef *stop_node = graph_utils::AddScalarConstNode<int64>(10, &graph);
+  NodeDef *step_node = graph_utils::AddScalarConstNode<int64>(1, &graph);
 
   std::vector<string> range_inputs(3);
   range_inputs[0] = start_node->name();
   range_inputs[1] = stop_node->name();
   range_inputs[2] = step_node->name();
   std::vector<std::pair<string, AttrValue>> range_attrs;
-  NodeDef *range_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
-                                    range_attrs, graph, &range_node));
-  NodeDef *captured_input_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<StringPiece>(
-      "hello", graph, &captured_input_node));
+  NodeDef *range_node = graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                             range_attrs, &graph);
+  NodeDef *captured_input_node =
+      graph_utils::AddScalarConstNode<StringPiece>("hello", &graph);
 
   NodeDef *map_node;
   {
@@ -59,13 +55,11 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchNodesIntoOne) {
     AttrValue args_attr;
     SetAttrValue("Targuments", &args_attr);
     map_attrs[1] = std::make_pair("Targuments", args_attr);
-    TF_ASSERT_OK(graph_utils::AddNode("", "MapDataset", map_inputs, map_attrs,
-                                      graph, &map_node));
+    map_node =
+        graph_utils::AddNode("", "MapDataset", map_inputs, map_attrs, &graph);
   }
 
-  NodeDef *batch_size_node;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
+  NodeDef *batch_size_node = graph_utils::AddScalarConstNode<int64>(5, &graph);
   NodeDef *batch_node;
   {
     std::vector<string> batch_inputs(2);
@@ -78,16 +72,18 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchNodesIntoOne) {
     AttrValue types_attr;
     SetAttrValue("output_types", &types_attr);
     batch_attrs[1] = std::make_pair("output_types", types_attr);
-    TF_ASSERT_OK(graph_utils::AddNode("", "BatchDataset", batch_inputs,
-                                      batch_attrs, graph, &batch_node));
+    batch_node = graph_utils::AddNode("", "BatchDataset", batch_inputs,
+                                      batch_attrs, &graph);
   }
 
   MapAndBatchFusion optimizer;
   GraphDef output;
   TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
 
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(map_node->name(), output));
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(batch_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(map_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(batch_node->name(), output));
   EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapAndBatchDatasetV2", output));
   NodeDef map_and_batch_node =
       output.node(graph_utils::FindNodeWithOp("MapAndBatchDatasetV2", output));
@@ -96,11 +92,11 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchNodesIntoOne) {
   EXPECT_EQ(map_and_batch_node.input(1), map_node->input(1));
   EXPECT_EQ(map_and_batch_node.input(2), batch_node->input(1));
   NodeDef num_parallel_calls_node = output.node(
-      graph_utils::FindNodeWithName(map_and_batch_node.input(3), output));
+      graph_utils::FindGraphNodeWithName(map_and_batch_node.input(3), output));
   EXPECT_EQ(num_parallel_calls_node.attr().at("value").tensor().int64_val(0),
             1);
   NodeDef drop_remainder_node = output.node(
-      graph_utils::FindNodeWithName(map_and_batch_node.input(4), output));
+      graph_utils::FindGraphNodeWithName(map_and_batch_node.input(4), output));
   EXPECT_EQ(drop_remainder_node.attr().at("value").tensor().bool_val(0), false);
   EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("f"),
                                  map_node->attr().at("f")));
@@ -114,25 +110,20 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchNodesIntoOne) {
 
 TEST(MapAndBatchFusionTest, FuseMapAndBatchV2NodesIntoOne) {
   GrapplerItem item;
-  GraphDef *graph = &item.graph;
-  NodeDef *start_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
-  NodeDef *stop_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
-  NodeDef *step_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+  MutableGraphView graph(&item.graph);
+  NodeDef *start_node = graph_utils::AddScalarConstNode<int64>(0, &graph);
+  NodeDef *stop_node = graph_utils::AddScalarConstNode<int64>(10, &graph);
+  NodeDef *step_node = graph_utils::AddScalarConstNode<int64>(1, &graph);
 
   std::vector<string> range_inputs(3);
   range_inputs[0] = start_node->name();
   range_inputs[1] = stop_node->name();
   range_inputs[2] = step_node->name();
   std::vector<std::pair<string, AttrValue>> range_attrs;
-  NodeDef *range_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
-                                    range_attrs, graph, &range_node));
-  NodeDef *captured_input_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<StringPiece>(
-      "hello", graph, &captured_input_node));
+  NodeDef *range_node = graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                             range_attrs, &graph);
+  NodeDef *captured_input_node =
+      graph_utils::AddScalarConstNode<StringPiece>("hello", &graph);
 
   NodeDef *map_node;
   {
@@ -146,16 +137,13 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchV2NodesIntoOne) {
     AttrValue args_attr;
     SetAttrValue("Targuments", &args_attr);
     map_attrs[1] = std::make_pair("Targuments", args_attr);
-    TF_ASSERT_OK(graph_utils::AddNode("", "MapDataset", map_inputs, map_attrs,
-                                      graph, &map_node));
+    map_node =
+        graph_utils::AddNode("", "MapDataset", map_inputs, map_attrs, &graph);
   }
 
-  NodeDef *batch_size_node;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
-  NodeDef *drop_remainder_node;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<bool>(true, graph, &drop_remainder_node));
+  NodeDef *batch_size_node = graph_utils::AddScalarConstNode<int64>(5, &graph);
+  NodeDef *drop_remainder_node =
+      graph_utils::AddScalarConstNode<bool>(true, &graph);
   NodeDef *batch_node;
   {
     std::vector<string> batch_inputs(3);
@@ -169,16 +157,18 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchV2NodesIntoOne) {
     AttrValue types_attr;
     SetAttrValue("output_types", &types_attr);
     batch_attrs[1] = std::make_pair("output_types", types_attr);
-    TF_ASSERT_OK(graph_utils::AddNode("", "BatchDatasetV2", batch_inputs,
-                                      batch_attrs, graph, &batch_node));
+    batch_node = graph_utils::AddNode("", "BatchDatasetV2", batch_inputs,
+                                      batch_attrs, &graph);
   }
 
   MapAndBatchFusion optimizer;
   GraphDef output;
   TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
 
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(map_node->name(), output));
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(batch_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(map_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(batch_node->name(), output));
   EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapAndBatchDatasetV2", output));
   NodeDef map_and_batch_node =
       output.node(graph_utils::FindNodeWithOp("MapAndBatchDatasetV2", output));
@@ -187,7 +177,7 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchV2NodesIntoOne) {
   EXPECT_EQ(map_and_batch_node.input(1), map_node->input(1));
   EXPECT_EQ(map_and_batch_node.input(2), batch_node->input(1));
   NodeDef num_parallel_calls_node = output.node(
-      graph_utils::FindNodeWithName(map_and_batch_node.input(3), output));
+      graph_utils::FindGraphNodeWithName(map_and_batch_node.input(3), output));
   EXPECT_EQ(num_parallel_calls_node.attr().at("value").tensor().int64_val(0),
             1);
   EXPECT_EQ(map_and_batch_node.input(4), batch_node->input(2));
@@ -203,28 +193,22 @@ TEST(MapAndBatchFusionTest, FuseMapAndBatchV2NodesIntoOne) {
 
 TEST(MapAndBatchFusionTest, FuseParallelMapAndBatchNodesIntoOne) {
   GrapplerItem item;
-  GraphDef *graph = &item.graph;
-  NodeDef *start_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
-  NodeDef *stop_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
-  NodeDef *step_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+  MutableGraphView graph(&item.graph);
+  NodeDef *start_node = graph_utils::AddScalarConstNode<int64>(0, &graph);
+  NodeDef *stop_node = graph_utils::AddScalarConstNode<int64>(10, &graph);
+  NodeDef *step_node = graph_utils::AddScalarConstNode<int64>(1, &graph);
 
   std::vector<string> range_inputs(3);
   range_inputs[0] = start_node->name();
   range_inputs[1] = stop_node->name();
   range_inputs[2] = step_node->name();
   std::vector<std::pair<string, AttrValue>> range_attrs;
-  NodeDef *range_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
-                                    range_attrs, graph, &range_node));
-  NodeDef *captured_input_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<StringPiece>(
-      "hello", graph, &captured_input_node));
-  NodeDef *num_parallel_calls_node;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<int>(2, graph, &num_parallel_calls_node));
+  NodeDef *range_node = graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                             range_attrs, &graph);
+  NodeDef *captured_input_node =
+      graph_utils::AddScalarConstNode<StringPiece>("hello", &graph);
+  NodeDef *num_parallel_calls_node =
+      graph_utils::AddScalarConstNode<int>(2, &graph);
 
   NodeDef *map_node;
   {
@@ -239,13 +223,11 @@ TEST(MapAndBatchFusionTest, FuseParallelMapAndBatchNodesIntoOne) {
     AttrValue args_attr;
     SetAttrValue("Targuments", &args_attr);
     map_attrs[1] = std::make_pair("Targuments", args_attr);
-    TF_ASSERT_OK(graph_utils::AddNode("", "ParallelMapDataset", map_inputs,
-                                      map_attrs, graph, &map_node));
+    map_node = graph_utils::AddNode("", "ParallelMapDataset", map_inputs,
+                                    map_attrs, &graph);
   }
 
-  NodeDef *batch_size_node;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
+  NodeDef *batch_size_node = graph_utils::AddScalarConstNode<int64>(5, &graph);
   NodeDef *batch_node;
   {
     std::vector<string> batch_inputs(2);
@@ -258,16 +240,18 @@ TEST(MapAndBatchFusionTest, FuseParallelMapAndBatchNodesIntoOne) {
     AttrValue types_attr;
     SetAttrValue("output_types", &types_attr);
     batch_attrs[1] = std::make_pair("output_types", types_attr);
-    TF_ASSERT_OK(graph_utils::AddNode("", "BatchDataset", batch_inputs,
-                                      batch_attrs, graph, &batch_node));
+    batch_node = graph_utils::AddNode("", "BatchDataset", batch_inputs,
+                                      batch_attrs, &graph);
   }
 
   MapAndBatchFusion optimizer;
   GraphDef output;
   TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
 
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(map_node->name(), output));
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(batch_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(map_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(batch_node->name(), output));
   EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapAndBatchDatasetV2", output));
   NodeDef map_and_batch_node =
       output.node(graph_utils::FindNodeWithOp("MapAndBatchDatasetV2", output));
@@ -276,11 +260,11 @@ TEST(MapAndBatchFusionTest, FuseParallelMapAndBatchNodesIntoOne) {
   EXPECT_EQ(map_and_batch_node.input(1), map_node->input(1));
   EXPECT_EQ(map_and_batch_node.input(2), batch_node->input(1));
   NodeDef num_parallel_calls_node2 = output.node(
-      graph_utils::FindNodeWithName(map_and_batch_node.input(3), output));
+      graph_utils::FindGraphNodeWithName(map_and_batch_node.input(3), output));
   EXPECT_EQ(num_parallel_calls_node2.attr().at("value").tensor().int64_val(0),
             2);
   NodeDef drop_remainder_node = output.node(
-      graph_utils::FindNodeWithName(map_and_batch_node.input(4), output));
+      graph_utils::FindGraphNodeWithName(map_and_batch_node.input(4), output));
   EXPECT_EQ(drop_remainder_node.attr().at("value").tensor().bool_val(0), false);
   EXPECT_TRUE(AreAttrValuesEqual(map_and_batch_node.attr().at("f"),
                                  map_node->attr().at("f")));
@@ -294,27 +278,21 @@ TEST(MapAndBatchFusionTest, FuseParallelMapAndBatchNodesIntoOne) {
 
 TEST(MapAndBatchFusionTest, NoChange) {
   GrapplerItem item;
-  GraphDef *graph = &item.graph;
+  MutableGraphView graph(&item.graph);
 
-  NodeDef *start_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
-  NodeDef *stop_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
-  NodeDef *step_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+  NodeDef *start_node = graph_utils::AddScalarConstNode<int64>(0, &graph);
+  NodeDef *stop_node = graph_utils::AddScalarConstNode<int64>(10, &graph);
+  NodeDef *step_node = graph_utils::AddScalarConstNode<int64>(1, &graph);
 
   std::vector<string> range_inputs(3);
   range_inputs[0] = start_node->name();
   range_inputs[1] = stop_node->name();
   range_inputs[2] = step_node->name();
   std::vector<std::pair<string, AttrValue>> range_attrs;
-  NodeDef *range_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
-                                    range_attrs, graph, &range_node));
+  NodeDef *range_node = graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                             range_attrs, &graph);
 
-  NodeDef *batch_size_node;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<int64>(5, graph, &batch_size_node));
+  NodeDef *batch_size_node = graph_utils::AddScalarConstNode<int64>(5, &graph);
   std::vector<string> batch_inputs(2);
   batch_inputs[0] = range_node->name();
   batch_inputs[1] = batch_size_node->name();
@@ -325,15 +303,13 @@ TEST(MapAndBatchFusionTest, NoChange) {
   AttrValue types_attr;
   SetAttrValue("output_types", &types_attr);
   batch_attrs[1] = std::make_pair("output_types", types_attr);
-  NodeDef *batch_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "BatchDataset", batch_inputs,
-                                    batch_attrs, graph, &batch_node));
+  graph_utils::AddNode("", "BatchDataset", batch_inputs, batch_attrs, &graph);
 
   MapAndBatchFusion optimizer;
   GraphDef output;
   TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
 
-  EXPECT_TRUE(graph_utils::Compare(*graph, output));
+  EXPECT_TRUE(graph_utils::Compare(*graph.GetGraph(), output));
 }
 
 }  // namespace
diff --git a/tensorflow/core/grappler/optimizers/data/map_fusion.cc b/tensorflow/core/grappler/optimizers/data/map_fusion.cc
new file mode 100644
index 0000000000..707f4a3407
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/map_fusion.cc
@@ -0,0 +1,261 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/map_fusion.h"
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/grappler/clusters/cluster.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/mutable_graph_view.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/topological_sort.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+// Sets basic function parameters and copies attributes from parent and map
+// node.
+NodeDef MakeFusedNode(const NodeDef& parent_map_node, const NodeDef& map_node,
+                      const FunctionDef& fused_function,
+                      MutableGraphView* graph) {
+  NodeDef fused_node;
+  graph_utils::SetUniqueGraphNodeName("fused_map", graph->GetGraph(),
+                                      &fused_node);
+
+  fused_node.set_op("MapDataset");
+  fused_node.add_input(parent_map_node.input(0));
+
+  auto copy_attribute = [](const string& attribute_name, const NodeDef& from,
+                           NodeDef* to) {
+    (*to->mutable_attr())[attribute_name] = from.attr().at(attribute_name);
+  };
+
+  auto attr = parent_map_node.attr().at("f");
+  *attr.mutable_func()->mutable_name() = fused_function.signature().name();
+  (*fused_node.mutable_attr())["f"] = std::move(attr);
+
+  copy_attribute("Targuments", parent_map_node, &fused_node);
+
+  for (auto key : {"output_shapes", "output_types"})
+    copy_attribute(key, map_node, &fused_node);
+
+  return fused_node;
+}
+
+string ParseNodeConnection(const string& name) {
+  // If input/output node name has semicolon, take the prefix.  Otherwise take
+  // the whole string.
+  return name.substr(0, name.find(':'));
+}
+
+string ParseOutputNode(const string& name) {
+  return name.substr(name.find(':'), string::npos);
+}
+
+const string& GetOutputNode(const FunctionDef& parent_function,
+                            int output_idx) {
+  const auto& ret_output_name =
+      parent_function.signature().output_arg(output_idx).name();
+  return parent_function.ret().at(ret_output_name);
+}
+
+// Nodes that will be added to the function can have the same name as the nodes
+// from parent function.  We need to rename them and the connections of the
+// inputs that refer to them.
+void RenameFunctionNodes(FunctionDef* fused_function,
+                         protobuf::RepeatedPtrField<NodeDef>* nodes_to_fuse) {
+  std::unordered_map<string, string> changed_node_names;
+  for (NodeDef& function_node : *nodes_to_fuse) {
+    string name_before = function_node.name();
+    graph_utils::SetUniqueFunctionNodeName(name_before, fused_function,
+                                           &function_node);
+    if (name_before != function_node.name())
+      changed_node_names[name_before] = function_node.name();
+  }
+
+  auto update_name = [&changed_node_names](string* input) {
+    string input_node = ParseNodeConnection(*input);
+    if (changed_node_names.count(input_node) == 0) return;
+    const string& new_node_name = changed_node_names.at(input_node);
+    *input = new_node_name + ParseOutputNode(*input);
+  };
+
+  for (NodeDef& function_node : *nodes_to_fuse) {
+    for (string& input : *function_node.mutable_input()) {
+      update_name(&input);
+    }
+  }
+
+  for (auto& ret : *fused_function->mutable_ret()) update_name(&ret.second);
+}
+
+// This function adds new nodes and changes their input to the output nodes
+// of parent function.
+void FuseFunctionNodes(const FunctionDef& parent_function,
+                       const FunctionDef& function,
+                       protobuf::RepeatedPtrField<NodeDef>* nodes_to_fuse) {
+  const auto number_of_outputs = parent_function.signature().output_arg_size();
+  CHECK(number_of_outputs == function.signature().input_arg_size())
+      << "The number of input arguments of function "
+      << function.signature().name()
+      << " should be the same as the number of output arguments of function "
+      << parent_function.signature().name() << ".";
+
+  for (int output_idx = 0; output_idx < number_of_outputs; output_idx++) {
+    const string& output = GetOutputNode(parent_function, output_idx);
+
+    const auto& input_node_name =
+        function.signature().input_arg(output_idx).name();
+
+    for (NodeDef& function_node : *nodes_to_fuse) {
+      for (auto& node_input : *function_node.mutable_input()) {
+        auto parsed_name = ParseNodeConnection(node_input);
+        if (parsed_name != input_node_name) continue;
+
+        node_input = output;
+      }
+    }
+  }
+}
+
+// This function looks for direct edges from input to return and rewrites
+// them to the coresponding input of the return of parent_function.
+void FuseReturns(const FunctionDef& parent_function,
+                 const FunctionDef& function, FunctionDef* fused_function) {
+  const auto number_of_inputs = function.signature().input_arg_size();
+
+  for (auto& ret : *fused_function->mutable_ret()) {
+    auto return_input = ParseNodeConnection(ret.second);
+    for (int input_idx = 0; input_idx < number_of_inputs; input_idx++) {
+      const auto& input_arg = function.signature().input_arg(input_idx);
+      if (return_input != input_arg.name()) continue;
+
+      ret.second = GetOutputNode(parent_function, input_idx);
+    }
+  }
+}
+
+// This function produces new function that is a result of fusion of
+// `parent_function` with `function`.
+FunctionDef* FuseFunctions(const FunctionDef& parent_function,
+                           const FunctionDef& function,
+                           FunctionDefLibrary* library) {
+  FunctionDef* fused_function = library->add_function();
+  graph_utils::SetUniqueGraphFunctionName("fused_function", library,
+                                          fused_function);
+
+  // Copy input signature from parent function.
+  *fused_function->mutable_signature()->mutable_input_arg() =
+      parent_function.signature().input_arg();
+
+  fused_function->mutable_node_def()->CopyFrom(parent_function.node_def());
+  // This code assumes functions does not have any attributes. If this is
+  // not the case, we need to merge attributes and fix name conflicts.
+  CHECK(parent_function.attr_size() == 0 && function.attr_size() == 0 &&
+        "Functions with attributes are currently not supported");
+
+  // Copy the returns and output signature from the second node.
+  auto nodes_to_fuse = function.node_def();
+  fused_function->mutable_signature()->mutable_output_arg()->CopyFrom(
+      function.signature().output_arg());
+  *fused_function->mutable_ret() = function.ret();
+
+  RenameFunctionNodes(fused_function, &nodes_to_fuse);
+  FuseFunctionNodes(parent_function, function, &nodes_to_fuse);
+  FuseReturns(parent_function, function, fused_function);
+
+  // Copy transformed nodes from the second function.
+  fused_function->mutable_node_def()->MergeFrom(nodes_to_fuse);
+
+  return fused_function;
+}
+
+}  // namespace
+
+Status MapFusion::Optimize(Cluster* cluster, const GrapplerItem& item,
+                           GraphDef* output) {
+  GraphDef sorted_old_graph = item.graph;
+  TF_RETURN_IF_ERROR(TopologicalSort(&sorted_old_graph));
+  *output = sorted_old_graph;
+
+  MutableGraphView graph(output);
+  std::set<string> nodes_to_delete;
+  FunctionLibraryDefinition function_library(OpRegistry::Global(),
+                                             item.graph.library());
+
+  auto get_map_node = [](const NodeDef& node) -> const NodeDef* {
+    // TODO(prazek): we could also handle ParallelMapDataset and
+    // MapAndBatchDataset.
+    if (node.op() == "MapDataset") return &node;
+    return nullptr;
+  };
+
+  auto get_fused_function = [&function_library, &output](
+                                const NodeDef* parent_map_node,
+                                const NodeDef* map_node) {
+    const auto& parent_fun = parent_map_node->attr().at("f");
+    const FunctionDef* parent_func =
+        function_library.Find(parent_fun.func().name());
+    const auto& fun = map_node->attr().at("f");
+    const FunctionDef* func = function_library.Find(fun.func().name());
+
+    return FuseFunctions(*parent_func, *func, output->mutable_library());
+  };
+
+  for (const NodeDef& node : sorted_old_graph.node()) {
+    const NodeDef* map_node = get_map_node(node);
+    if (!map_node) continue;
+
+    GraphView::InputPort input_port = graph.GetInputPort(map_node->name(), 0);
+    const NodeDef* parent_map_node =
+        get_map_node(*graph.GetRegularFanin(input_port).node);
+    if (!parent_map_node) continue;
+
+    const auto* fused_function = get_fused_function(parent_map_node, map_node);
+    const auto* fused_maps_node = graph.AddNode(
+        MakeFusedNode(*parent_map_node, *map_node, *fused_function, &graph));
+
+    graph.ReplaceInput(*map_node, *fused_maps_node);
+
+    // TODO(prazek): we should run some optimizations on the fused map
+    // functions, or make sure that optimization passes run after map
+    // fusion.
+    TF_RETURN_IF_ERROR(function_library.AddFunctionDef(*fused_function));
+
+    // TODO(prazek): we could also remove map functions from library if they
+    // are not used anymore.
+    nodes_to_delete.insert(parent_map_node->name());
+    nodes_to_delete.insert(map_node->name());
+  }
+
+  graph.DeleteNodes(nodes_to_delete);
+  return Status::OK();
+}
+
+void MapFusion::Feedback(Cluster* cluster, const GrapplerItem& item,
+                         const GraphDef& optimize_output, double result) {
+  // no-op
+}
+
+REGISTER_GRAPH_OPTIMIZER_AS(MapFusion, "map_fusion");
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/map_fusion.h b/tensorflow/core/grappler/optimizers/data/map_fusion.h
new file mode 100644
index 0000000000..a6a06592b8
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/map_fusion.h
@@ -0,0 +1,47 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_MAP_FUSION_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_MAP_FUSION_H_
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+
+namespace tensorflow {
+namespace grappler {
+
+// This optimization fuses map transformations by merging their map functions.
+class MapFusion : public CustomGraphOptimizer {
+ public:
+  MapFusion() = default;
+  ~MapFusion() override = default;
+
+  string name() const override { return "map_fusion"; };
+
+  Status Init(
+      const tensorflow::RewriterConfig_CustomGraphOptimizer* config) override {
+    return Status::OK();
+  }
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* output) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimize_output, double result) override;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_MAP_FUSION_H_
diff --git a/tensorflow/core/grappler/optimizers/data/map_fusion_test.cc b/tensorflow/core/grappler/optimizers/data/map_fusion_test.cc
new file mode 100644
index 0000000000..df6c19dc7c
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/map_fusion_test.cc
@@ -0,0 +1,90 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/map_fusion.h"
+
+#include "tensorflow/core/framework/attr_value_util.h"
+#include "tensorflow/core/framework/function_testlib.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+NodeDef MakeMapNode(StringPiece name, StringPiece input_node_name) {
+  return test::function::NDef(
+      name, "MapDataset", {input_node_name.ToString()},
+      {{"f", FunctionDefHelper::FunctionRef("XTimesTwo")},
+       {"Targuments", {}},
+       {"output_shapes", {}},
+       {"output_types", {}}});
+}
+
+TEST(MapFusionTest, FuseTwoMapNodesIntoOne) {
+  using test::function::NDef;
+  GrapplerItem item;
+  item.graph = test::function::GDef(
+      {NDef("start", "Const", {}, {{"value", 0}, {"dtype", DT_INT32}}),
+       NDef("stop", "Const", {}, {{"value", 10}, {"dtype", DT_INT32}}),
+       NDef("step", "Const", {}, {{"value", 1}, {"dtype", DT_INT32}}),
+       NDef("range", "RangeDataset", {"start", "stop", "step"}, {}),
+       MakeMapNode("map1", "range"), MakeMapNode("map2", "map1")},
+      // FunctionLib
+      {
+          test::function::XTimesTwo(),
+      });
+
+  MapFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapDataset", output));
+  EXPECT_FALSE(graph_utils::ContainsGraphNodeWithName("map1", output));
+  EXPECT_FALSE(graph_utils::ContainsGraphNodeWithName("map2", output));
+}
+
+TEST(MapFusionTest, FuseThreeNodesIntoOne) {
+  using test::function::NDef;
+  GrapplerItem item;
+  item.graph = test::function::GDef(
+      {NDef("start", "Const", {}, {{"value", 0}, {"dtype", DT_INT32}}),
+       NDef("stop", "Const", {}, {{"value", 10}, {"dtype", DT_INT32}}),
+       NDef("step", "Const", {}, {{"value", 1}, {"dtype", DT_INT32}}),
+       NDef("filename", "Const", {}, {{"value", ""}, {"dtype", DT_STRING}}),
+       NDef("range", "RangeDataset", {"start", "stop", "step"}, {}),
+       MakeMapNode("map1", "range"), MakeMapNode("map2", "map1"),
+       MakeMapNode("map3", "map2"),
+       NDef("cache", "CacheDataset", {"map3", "filename"}, {})},
+      // FunctionLib
+      {
+          test::function::XTimesTwo(),
+      });
+
+  MapFusion optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+  EXPECT_TRUE(graph_utils::ContainsNodeWithOp("MapDataset", output));
+  EXPECT_FALSE(graph_utils::ContainsGraphNodeWithName("map1", output));
+  EXPECT_FALSE(graph_utils::ContainsGraphNodeWithName("map2", output));
+  EXPECT_FALSE(graph_utils::ContainsGraphNodeWithName("map3", output));
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/noop_elimination.cc b/tensorflow/core/grappler/optimizers/data/noop_elimination.cc
new file mode 100644
index 0000000000..55d57b3b97
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/noop_elimination.cc
@@ -0,0 +1,91 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/noop_elimination.h"
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/grappler/clusters/cluster.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/mutable_graph_view.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+bool IsTakeAll(const NodeDef& take_node, const GraphView& graph) {
+  if (take_node.op() != "TakeDataset") return false;
+
+  const NodeDef& count_node = *graph.GetNode(take_node.input(1));
+  // We are looking only for 'take' with negative count.
+  return count_node.attr().at("value").tensor().int64_val(0) < 0;
+}
+
+bool IsSkipNone(const NodeDef& skip_node, const GraphView& graph) {
+  if (skip_node.op() != "SkipDataset") return false;
+
+  const NodeDef& count_node = *graph.GetNode(skip_node.input(1));
+  // We are looking only for skip(0) nodes.
+  return count_node.attr().at("value").tensor().int64_val(0) == 0;
+}
+
+bool IsRepeatOne(const NodeDef& repeat_node, const GraphView& graph) {
+  if (repeat_node.op() != "RepeatDataset") return false;
+
+  const NodeDef& count_node = *graph.GetNode(repeat_node.input(1));
+  // We are looking only for repeat(1) nodes.
+  return count_node.attr().at("value").tensor().int64_val(0) == 1;
+}
+
+bool IsNoOp(const NodeDef& node, const GraphView& graph) {
+  return IsTakeAll(node, graph) || IsSkipNone(node, graph) ||
+         IsRepeatOne(node, graph);
+}
+
+}  // namespace
+
+Status NoOpElimination::Optimize(Cluster* cluster, const GrapplerItem& item,
+                                 GraphDef* output) {
+  *output = item.graph;
+  MutableGraphView graph(output);
+  std::set<string> nodes_to_delete;
+  for (const NodeDef& node : item.graph.node()) {
+    if (!IsNoOp(node, graph)) continue;
+
+    GraphView::InputPort input_port = graph.GetInputPort(node.name(), 0);
+    NodeDef* const parent = graph.GetRegularFanin(input_port).node;
+    graph.ReplaceInput(node, *parent);
+
+    nodes_to_delete.insert(node.name());
+  }
+
+  graph.DeleteNodes(nodes_to_delete);
+  return Status::OK();
+}
+
+void NoOpElimination::Feedback(Cluster* cluster, const GrapplerItem& item,
+                               const GraphDef& optimize_output, double result) {
+  // no-op
+}
+
+REGISTER_GRAPH_OPTIMIZER_AS(NoOpElimination, "noop_elimination");
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/noop_elimination.h b/tensorflow/core/grappler/optimizers/data/noop_elimination.h
new file mode 100644
index 0000000000..c67cea49d5
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/noop_elimination.h
@@ -0,0 +1,48 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_NOOP_ELIMINATION_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_NOOP_ELIMINATION_H_
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+
+namespace tensorflow {
+namespace grappler {
+
+// This class eliminates tf.data transformations such as `take(n)` (for n < 0),
+// `skip(0)`, or `repeat(1)`
+class NoOpElimination : public CustomGraphOptimizer {
+ public:
+  NoOpElimination() = default;
+  ~NoOpElimination() override = default;
+
+  string name() const override { return "noop_elimination"; };
+
+  Status Init(
+      const tensorflow::RewriterConfig_CustomGraphOptimizer* config) override {
+    return Status::OK();
+  }
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* output) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimize_output, double result) override;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_DATA_NOOP_ELIMINATION_H_
diff --git a/tensorflow/core/grappler/optimizers/data/noop_elimination_test.cc b/tensorflow/core/grappler/optimizers/data/noop_elimination_test.cc
new file mode 100644
index 0000000000..a6cc63edba
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/data/noop_elimination_test.cc
@@ -0,0 +1,210 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/data/noop_elimination.h"
+#include <tuple>
+#include "tensorflow/core/framework/attr_value_util.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+std::vector<std::pair<string, AttrValue>> GetCommonAttributes() {
+  AttrValue shapes_attr, types_attr;
+  SetAttrValue("output_shapes", &shapes_attr);
+  SetAttrValue("output_types", &types_attr);
+  std::vector<std::pair<string, AttrValue>> commonAttributes = {
+      {"output_shapes", shapes_attr}, {"output_types", types_attr}};
+
+  return commonAttributes;
+}
+
+NodeDef *MakeUnaryNode(const std::string &node_type, int count,
+                       string input_node, MutableGraphView *graph) {
+  NodeDef *node_count = graph_utils::AddScalarConstNode<int64>(count, graph);
+  return graph_utils::AddNode("", node_type,
+                              {std::move(input_node), node_count->name()},
+                              GetCommonAttributes(), graph);
+}
+
+NodeDef *MakeCacheNode(string input_node, MutableGraphView *graph) {
+  NodeDef *node_filename =
+      graph_utils::AddScalarConstNode<StringPiece>("", graph);
+  return graph_utils::AddNode("", "CacheDataset",
+                              {std::move(input_node), node_filename->name()},
+                              GetCommonAttributes(), graph);
+}
+
+NodeDef *MakeRangeNode(MutableGraphView *graph) {
+  auto *start_node = graph_utils::AddScalarConstNode<int64>(0, graph);
+  auto *stop_node = graph_utils::AddScalarConstNode<int64>(10, graph);
+  auto *step_node = graph_utils::AddScalarConstNode<int64>(1, graph);
+
+  std::vector<string> range_inputs = {start_node->name(), stop_node->name(),
+                                      step_node->name()};
+
+  return graph_utils::AddNode("", "RangeDataset", range_inputs,
+                              GetCommonAttributes(), graph);
+}
+
+struct NoOpLastEliminationTest
+    : ::testing::TestWithParam<std::tuple<std::string, int, bool>> {};
+
+// This test checks whether the no-op elimination correctly handles
+// transformations at the end of the pipeline.
+TEST_P(NoOpLastEliminationTest, EliminateLastNoOpNode) {
+  GrapplerItem item;
+  MutableGraphView graph(&item.graph);
+
+  const std::string &node_type = std::get<0>(GetParam());
+  const int node_count = std::get<1>(GetParam());
+  const bool should_keep_node = std::get<2>(GetParam());
+
+  NodeDef *range_node = MakeRangeNode(&graph);
+
+  NodeDef *node =
+      MakeUnaryNode(node_type, node_count, range_node->name(), &graph);
+
+  NoOpElimination optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_EQ(graph_utils::ContainsGraphNodeWithName(node->name(), output),
+            should_keep_node);
+}
+
+INSTANTIATE_TEST_CASE_P(
+    BasicRemovalTest, NoOpLastEliminationTest,
+    ::testing::Values(std::make_tuple("TakeDataset", -3, false),
+                      std::make_tuple("TakeDataset", -1, false),
+                      std::make_tuple("TakeDataset", 0, true),
+                      std::make_tuple("TakeDataset", 3, true),
+                      std::make_tuple("SkipDataset", -1, true),
+                      std::make_tuple("SkipDataset", 0, false),
+                      std::make_tuple("SkipDataset", 3, true),
+                      std::make_tuple("RepeatDataset", 1, false),
+                      std::make_tuple("RepeatDataset", 2, true)));
+
+struct NoOpMiddleEliminationTest
+    : ::testing::TestWithParam<std::tuple<std::string, int, bool>> {};
+
+// This test checks whether the no-op elimination correctly handles
+// transformations int the middle of the pipeline.
+TEST_P(NoOpMiddleEliminationTest, EliminateMiddleNoOpNode) {
+  GrapplerItem item;
+  MutableGraphView graph(&item.graph);
+
+  const std::string &node_type = std::get<0>(GetParam());
+  const int node_count = std::get<1>(GetParam());
+  const bool should_keep_node = std::get<2>(GetParam());
+
+  NodeDef *range_node = MakeRangeNode(&graph);
+
+  NodeDef *node =
+      MakeUnaryNode(node_type, node_count, range_node->name(), &graph);
+
+  NodeDef *cache_node = MakeCacheNode(node->name(), &graph);
+  NoOpElimination optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  EXPECT_EQ(graph_utils::ContainsGraphNodeWithName(node->name(), output),
+            should_keep_node);
+  EXPECT_TRUE(
+      graph_utils::ContainsGraphNodeWithName(cache_node->name(), output));
+
+  NodeDef cache_node_out = output.node(
+      graph_utils::FindGraphNodeWithName(cache_node->name(), output));
+
+  EXPECT_EQ(cache_node_out.input_size(), 2);
+  auto last_node_input = (should_keep_node ? node : range_node)->name();
+  EXPECT_EQ(cache_node_out.input(0), last_node_input);
+}
+
+INSTANTIATE_TEST_CASE_P(
+    BasicRemovalTest, NoOpMiddleEliminationTest,
+    ::testing::Values(std::make_tuple("TakeDataset", -1, false),
+                      std::make_tuple("TakeDataset", -3, false),
+                      std::make_tuple("TakeDataset", 0, true),
+                      std::make_tuple("TakeDataset", 3, true),
+                      std::make_tuple("SkipDataset", -1, true),
+                      std::make_tuple("SkipDataset", 0, false),
+                      std::make_tuple("SkipDataset", 3, true),
+                      std::make_tuple("RepeatDataset", 1, false),
+                      std::make_tuple("RepeatDataset", 2, true)));
+
+using NodesTypes = std::tuple<std::pair<string, int>, std::pair<string, int>>;
+struct NoOpMultipleEliminationTest : ::testing::TestWithParam<NodesTypes> {};
+
+// This test checks whether the no-op elimination correctly removes
+// multiple noop nodes.
+TEST_P(NoOpMultipleEliminationTest, EliminateMultipleNoOpNode) {
+  GrapplerItem item;
+  MutableGraphView graph(&item.graph);
+
+  static_assert(std::tuple_size<NodesTypes>::value == 2,
+                "Make sure to include everything in the test");
+  const std::vector<std::pair<string, int>> noop_nodes = {
+      std::get<0>(GetParam()), std::get<1>(GetParam())};
+
+  NodeDef *range_node = MakeRangeNode(&graph);
+
+  NodeDef *previous = range_node;
+  std::vector<string> nodes_to_remove;
+  nodes_to_remove.reserve(noop_nodes.size());
+
+  for (const auto &noop_node : noop_nodes) {
+    NodeDef *node = MakeUnaryNode(noop_node.first, noop_node.second,
+                                  previous->name(), &graph);
+    nodes_to_remove.push_back(node->name());
+    previous = node;
+  }
+
+  NodeDef *cache_node = MakeCacheNode(previous->name(), &graph);
+  NoOpElimination optimizer;
+  GraphDef output;
+  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
+
+  for (const auto &noop_node_name : nodes_to_remove)
+    EXPECT_FALSE(
+        graph_utils::ContainsGraphNodeWithName(noop_node_name, output));
+
+  EXPECT_TRUE(
+      graph_utils::ContainsGraphNodeWithName(cache_node->name(), output));
+
+  NodeDef cache_node_out = output.node(
+      graph_utils::FindGraphNodeWithName(cache_node->name(), output));
+
+  EXPECT_EQ(cache_node_out.input_size(), 2);
+  EXPECT_EQ(cache_node_out.input(0), range_node->name());
+}
+
+const auto *const kTakeNode = new std::pair<string, int>{"TakeDataset", -1};
+const auto *const kSkipNode = new std::pair<string, int>{"SkipDataset", 0};
+const auto *const kRepeatNode = new std::pair<string, int>{"RepeatDataset", 1};
+
+INSTANTIATE_TEST_CASE_P(
+    BasicRemovalTest, NoOpMultipleEliminationTest,
+    ::testing::Combine(::testing::Values(*kTakeNode, *kSkipNode, *kRepeatNode),
+                       ::testing::Values(*kTakeNode, *kSkipNode,
+                                         *kRepeatNode)));
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.cc b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.cc
index 8332fb0b1e..7c7161c5b2 100644
--- a/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.cc
+++ b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion.cc
@@ -18,8 +18,8 @@ limitations under the License.
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/grappler/clusters/cluster.h"
-#include "tensorflow/core/grappler/graph_view.h"
 #include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/mutable_graph_view.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
 #include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
@@ -38,55 +38,62 @@ Status ShuffleAndRepeatFusion::Optimize(Cluster* cluster,
                                         const GrapplerItem& item,
                                         GraphDef* output) {
   *output = item.graph;
-  GraphView graph(output);
+  MutableGraphView graph(output);
   std::set<string> nodes_to_delete;
-  for (const NodeDef& node : item.graph.node()) {
-    if (node.op() != "RepeatDataset") {
-      continue;
-    }
 
-    // Use a more descriptive variable name now that we know the node type.
-    const NodeDef repeat_node(node);
-    GraphView::InputPort input_port = graph.GetInputPort(repeat_node.name(), 0);
-    NodeDef* node2 = graph.GetRegularFanin(input_port).node;
-    if (node2->op() != "ShuffleDataset") {
-      continue;
-    }
-
-    NodeDef* new_node = output->add_node();
-    new_node->set_op(kFusedOpName);
-    graph_utils::SetUniqueName(kFusedOpName, output, new_node);
-
-    // Use a more descriptive variable name now that we know the node type.
-    NodeDef* shuffle_node = node2;
+  auto make_shuffle_and_repeat_node = [&output](const NodeDef& shuffle_node,
+                                                const NodeDef& repeat_node) {
+    NodeDef new_node;
+    new_node.set_op(kFusedOpName);
+    graph_utils::SetUniqueGraphNodeName(kFusedOpName, output, &new_node);
 
     // Set the `input` input argument.
-    new_node->add_input(shuffle_node->input(0));
+    new_node.add_input(shuffle_node.input(0));
 
     // Set the `buffer_size` input argument.
-    new_node->add_input(shuffle_node->input(1));
+    new_node.add_input(shuffle_node.input(1));
 
     // Set the `seed` input argument.
-    new_node->add_input(shuffle_node->input(2));
+    new_node.add_input(shuffle_node.input(2));
 
     // Set the `seed2` input argument.
-    new_node->add_input(shuffle_node->input(3));
+    new_node.add_input(shuffle_node.input(3));
 
     // Set the `count` input argument.
-    new_node->add_input(repeat_node.input(1));
+    new_node.add_input(repeat_node.input(1));
 
     // Set `output_types` and `output_shapes` attributes.
     for (auto key : {"output_shapes", "output_types"}) {
-      (*new_node->mutable_attr())[key] = repeat_node.attr().at(key);
+      (*new_node.mutable_attr())[key] = repeat_node.attr().at(key);
     }
+    return new_node;
+  };
+
+  for (const NodeDef& node : item.graph.node()) {
+    if (node.op() != "RepeatDataset") {
+      continue;
+    }
+
+    // Use a more descriptive variable name now that we know the node type.
+    const NodeDef& repeat_node = node;
+    GraphView::InputPort input_port = graph.GetInputPort(repeat_node.name(), 0);
+    NodeDef* node2 = graph.GetRegularFanin(input_port).node;
+    if (node2->op() != "ShuffleDataset") {
+      continue;
+    }
+    // Use a more descriptive variable name now that we know the node type.
+    const NodeDef& shuffle_node = *node2;
+
+    NodeDef* shuffle_and_repeat_node =
+        graph.AddNode(make_shuffle_and_repeat_node(shuffle_node, repeat_node));
+    graph.ReplaceInput(repeat_node, *shuffle_and_repeat_node);
 
     // Mark the `Shuffle` and `Repeat` nodes for removal.
-    nodes_to_delete.insert(shuffle_node->name());
+    nodes_to_delete.insert(shuffle_node.name());
     nodes_to_delete.insert(repeat_node.name());
-
-    graph_utils::ReplaceInput(repeat_node, *new_node, &graph);
   }
-  TF_RETURN_IF_ERROR(graph_utils::DeleteNodes(nodes_to_delete, output));
+
+  graph.DeleteNodes(nodes_to_delete);
   return Status::OK();
 }
 
diff --git a/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion_test.cc b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion_test.cc
index e89675efb7..a2e470e511 100644
--- a/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion_test.cc
+++ b/tensorflow/core/grappler/optimizers/data/shuffle_and_repeat_fusion_test.cc
@@ -27,7 +27,7 @@ namespace {
 
 TEST(ShuffleAndRepeatFusionTest, FuseShuffleAndRepeatNodesIntoOne) {
   GrapplerItem item;
-  GraphDef *graph = &item.graph;
+  MutableGraphView graph(&item.graph);
 
   std::vector<std::pair<string, AttrValue>> common_attrs(2);
   AttrValue shapes_attr;
@@ -37,52 +37,44 @@ TEST(ShuffleAndRepeatFusionTest, FuseShuffleAndRepeatNodesIntoOne) {
   SetAttrValue("output_types", &types_attr);
   common_attrs[1] = std::make_pair("output_types", types_attr);
 
-  NodeDef *start_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
-  NodeDef *stop_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
-  NodeDef *step_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+  NodeDef *start_node = graph_utils::AddScalarConstNode<int64>(0, &graph);
+  NodeDef *stop_node = graph_utils::AddScalarConstNode<int64>(10, &graph);
+  NodeDef *step_node = graph_utils::AddScalarConstNode<int64>(1, &graph);
 
   std::vector<string> range_inputs(3);
   range_inputs[0] = start_node->name();
   range_inputs[1] = stop_node->name();
   range_inputs[2] = step_node->name();
-  NodeDef *range_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
-                                    common_attrs, graph, &range_node));
-
-  NodeDef *buffer_size_node;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<int64>(128, graph, &buffer_size_node));
-  NodeDef *seed_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &seed_node));
-  NodeDef *seed2_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &seed2_node));
+  NodeDef *range_node = graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                             common_attrs, &graph);
+
+  NodeDef *buffer_size_node =
+      graph_utils::AddScalarConstNode<int64>(128, &graph);
+  NodeDef *seed_node = graph_utils::AddScalarConstNode<int64>(-1, &graph);
+  NodeDef *seed2_node = graph_utils::AddScalarConstNode<int64>(-1, &graph);
   std::vector<string> shuffle_inputs(4);
   shuffle_inputs[0] = range_node->name();
   shuffle_inputs[1] = buffer_size_node->name();
   shuffle_inputs[2] = seed_node->name();
   shuffle_inputs[3] = seed2_node->name();
-  NodeDef *shuffle_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "ShuffleDataset", shuffle_inputs,
-                                    common_attrs, graph, &shuffle_node));
+  NodeDef *shuffle_node = graph_utils::AddNode(
+      "", "ShuffleDataset", shuffle_inputs, common_attrs, &graph);
 
-  NodeDef *count_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &count_node));
+  NodeDef *count_node = graph_utils::AddScalarConstNode<int64>(-1, &graph);
   std::vector<string> repeat_inputs(2);
   repeat_inputs[0] = shuffle_node->name();
   repeat_inputs[1] = count_node->name();
-  NodeDef *repeat_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RepeatDataset", repeat_inputs,
-                                    common_attrs, graph, &repeat_node));
+  NodeDef *repeat_node = graph_utils::AddNode(
+      "", "RepeatDataset", repeat_inputs, common_attrs, &graph);
 
   ShuffleAndRepeatFusion optimizer;
   GraphDef output;
   TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
 
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(shuffle_node->name(), output));
-  EXPECT_FALSE(graph_utils::ContainsNodeWithName(repeat_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(shuffle_node->name(), output));
+  EXPECT_FALSE(
+      graph_utils::ContainsGraphNodeWithName(repeat_node->name(), output));
   EXPECT_TRUE(
       graph_utils::ContainsNodeWithOp("ShuffleAndRepeatDataset", output));
   NodeDef shuffle_and_repeat_node = output.node(
@@ -103,7 +95,7 @@ TEST(ShuffleAndRepeatFusionTest, FuseShuffleAndRepeatNodesIntoOne) {
 
 TEST(ShuffleAndRepeatFusionTest, NoChange) {
   GrapplerItem item;
-  GraphDef *graph = &item.graph;
+  MutableGraphView graph(&item.graph);
 
   std::vector<std::pair<string, AttrValue>> common_attrs(2);
   AttrValue shapes_attr;
@@ -113,35 +105,29 @@ TEST(ShuffleAndRepeatFusionTest, NoChange) {
   SetAttrValue("output_types", &types_attr);
   common_attrs[1] = std::make_pair("output_types", types_attr);
 
-  NodeDef *start_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
-  NodeDef *stop_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
-  NodeDef *step_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
+  NodeDef *start_node = graph_utils::AddScalarConstNode<int64>(0, &graph);
+  NodeDef *stop_node = graph_utils::AddScalarConstNode<int64>(10, &graph);
+  NodeDef *step_node = graph_utils::AddScalarConstNode<int64>(1, &graph);
 
   std::vector<string> range_inputs(3);
   range_inputs[0] = start_node->name();
   range_inputs[1] = stop_node->name();
   range_inputs[2] = step_node->name();
-  NodeDef *range_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
-                                    common_attrs, graph, &range_node));
+  NodeDef *range_node = graph_utils::AddNode("", "RangeDataset", range_inputs,
+                                             common_attrs, &graph);
 
-  NodeDef *count_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(-1, graph, &count_node));
+  NodeDef *count_node = graph_utils::AddScalarConstNode<int64>(-1, &graph);
   std::vector<string> repeat_inputs(2);
   repeat_inputs[0] = range_node->name();
   repeat_inputs[1] = count_node->name();
-  NodeDef *repeat_node;
-  TF_ASSERT_OK(graph_utils::AddNode("", "RepeatDataset", repeat_inputs,
-                                    common_attrs, graph, &repeat_node));
+  graph_utils::AddNode("", "RepeatDataset", repeat_inputs, common_attrs,
+                       &graph);
 
   ShuffleAndRepeatFusion optimizer;
   GraphDef output;
   TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
 
-  EXPECT_TRUE(graph_utils::Compare(*graph, output));
+  EXPECT_TRUE(graph_utils::Compare(*graph.GetGraph(), output));
 }
 
 }  // namespace
diff --git a/tensorflow/core/grappler/optimizers/data/take_all_elimination.cc b/tensorflow/core/grappler/optimizers/data/take_all_elimination.cc
deleted file mode 100644
index 8e55947de1..0000000000
--- a/tensorflow/core/grappler/optimizers/data/take_all_elimination.cc
+++ /dev/null
@@ -1,67 +0,0 @@
-/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/core/grappler/optimizers/data/take_all_elimination.h"
-
-#include "tensorflow/core/framework/attr_value.pb.h"
-#include "tensorflow/core/framework/node_def.pb.h"
-#include "tensorflow/core/grappler/clusters/cluster.h"
-#include "tensorflow/core/grappler/graph_view.h"
-#include "tensorflow/core/grappler/grappler_item.h"
-#include "tensorflow/core/grappler/op_types.h"
-#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
-#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
-#include "tensorflow/core/grappler/utils.h"
-#include "tensorflow/core/platform/protobuf.h"
-
-namespace tensorflow {
-namespace grappler {
-
-Status TakeAllElimination::Optimize(Cluster* cluster, const GrapplerItem& item,
-                                    GraphDef* output) {
-  *output = item.graph;
-  GraphView graph(output);
-  std::set<string> nodes_to_delete;
-  for (const NodeDef& node : item.graph.node()) {
-    if (node.op() != "TakeDataset") continue;
-
-    // Use a more descriptive variable name now that we know the node type.
-    const auto take_node(node);
-
-    const NodeDef& count_node = *graph.GetNode(take_node.input(1));
-
-    // We are looking only for take(-1) nodes.
-    if (count_node.attr().at("value").tensor().int64_val(0) >= 0) continue;
-
-    GraphView::InputPort input_port = graph.GetInputPort(take_node.name(), 0);
-    NodeDef* const parent = graph.GetRegularFanin(input_port).node;
-    graph_utils::ReplaceInput(take_node, *parent, &graph);
-
-    nodes_to_delete.insert(take_node.name());
-  }
-  TF_RETURN_IF_ERROR(graph_utils::DeleteNodes(nodes_to_delete, output));
-  return Status::OK();
-}
-
-void TakeAllElimination::Feedback(Cluster* cluster, const GrapplerItem& item,
-                                  const GraphDef& optimize_output,
-                                  double result) {
-  // no-op
-}
-
-REGISTER_GRAPH_OPTIMIZER_AS(TakeAllElimination, "take_all_elimination");
-
-}  // end namespace grappler
-}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/data/take_all_elimination_test.cc b/tensorflow/core/grappler/optimizers/data/take_all_elimination_test.cc
deleted file mode 100644
index f19596e5b9..0000000000
--- a/tensorflow/core/grappler/optimizers/data/take_all_elimination_test.cc
+++ /dev/null
@@ -1,144 +0,0 @@
-/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/core/grappler/optimizers/data/take_all_elimination.h"
-
-#include "tensorflow/core/framework/attr_value_util.h"
-#include "tensorflow/core/grappler/grappler_item.h"
-#include "tensorflow/core/grappler/optimizers/data/graph_utils.h"
-#include "tensorflow/core/lib/core/status_test_util.h"
-#include "tensorflow/core/platform/test.h"
-
-namespace tensorflow {
-namespace grappler {
-namespace {
-
-std::vector<std::pair<string, AttrValue>> GetCommonAttributes() {
-  AttrValue shapes_attr, types_attr;
-  SetAttrValue("output_shapes", &shapes_attr);
-  SetAttrValue("output_types", &types_attr);
-  std::vector<std::pair<string, AttrValue>> commonAttributes = {
-      {"output_shapes", shapes_attr}, {"output_types", types_attr}};
-
-  return commonAttributes;
-}
-
-void MakeTakeNode(GraphDef *graph, int count, string input_node,
-                  NodeDef **return_node) {
-  NodeDef *take_count;
-  TF_ASSERT_OK(
-      graph_utils::AddScalarConstNode<int64>(count, graph, &take_count));
-  TF_ASSERT_OK(graph_utils::AddNode("", "TakeDataset",
-                                    {std::move(input_node), take_count->name()},
-                                    GetCommonAttributes(), graph, return_node));
-}
-
-void MakeRangeNode(GraphDef *graph, NodeDef **range_node) {
-  NodeDef *start_node, *stop_node, *step_node;
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(0, graph, &start_node));
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(10, graph, &stop_node));
-  TF_ASSERT_OK(graph_utils::AddScalarConstNode<int64>(1, graph, &step_node));
-
-  std::vector<string> range_inputs = {start_node->name(), stop_node->name(),
-                                      step_node->name()};
-
-  TF_ASSERT_OK(graph_utils::AddNode("", "RangeDataset", range_inputs,
-                                    GetCommonAttributes(), graph, range_node));
-}
-
-struct TakeLastEliminationTest
-    : ::testing::TestWithParam<std::pair<int, bool>> {};
-
-// This test checks if given pipeline:
-// range(10) -> take(X)
-// is transformed into:
-// range(10)  if X < 0.
-TEST_P(TakeLastEliminationTest, EliminateLastTakeNode) {
-  GrapplerItem item;
-  GraphDef *graph = &item.graph;
-
-  const int node_count = GetParam().first;
-  const bool should_keep_node = GetParam().second;
-
-  NodeDef *range_node;
-  MakeRangeNode(graph, &range_node);
-
-  NodeDef *take_all_node;
-  MakeTakeNode(graph, node_count, range_node->name(), &take_all_node);
-
-  TakeAllElimination optimizer;
-  GraphDef output;
-  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
-
-  EXPECT_EQ(graph_utils::ContainsNodeWithName(take_all_node->name(), output),
-            should_keep_node);
-}
-
-INSTANTIATE_TEST_CASE_P(BasicRemovalTest, TakeLastEliminationTest,
-                        ::testing::Values(std::make_pair(-1, false),
-                                          std::make_pair(-3, false),
-                                          std::make_pair(0, true),
-                                          std::make_pair(3, true)));
-
-struct TakeMiddleEliminationTest
-    : ::testing::TestWithParam<std::pair<int, bool>> {};
-
-// This test checks if given pipeline:
-// range(10) -> take(X) -> take(3)
-// is transformed into:
-// range(10) -> take(3) if X < 0.
-TEST_P(TakeMiddleEliminationTest, EliminateMiddleTakeNode) {
-  GrapplerItem item;
-  GraphDef *graph = &item.graph;
-
-  const int node_count = GetParam().first;
-  const bool should_keep_node = GetParam().second;
-
-  NodeDef *range_node;
-  MakeRangeNode(graph, &range_node);
-
-  NodeDef *take_all_node;
-  MakeTakeNode(graph, node_count, range_node->name(), &take_all_node);
-
-  NodeDef *take_three_node;
-  MakeTakeNode(graph, 3, take_all_node->name(), &take_three_node);
-
-  TakeAllElimination optimizer;
-  GraphDef output;
-  TF_ASSERT_OK(optimizer.Optimize(nullptr, item, &output));
-
-  EXPECT_EQ(graph_utils::ContainsNodeWithName(take_all_node->name(), output),
-            should_keep_node);
-  EXPECT_TRUE(
-      graph_utils::ContainsNodeWithName(take_three_node->name(), output));
-
-  NodeDef take_three_out = output.node(
-      graph_utils::FindNodeWithName(take_three_node->name(), output));
-
-  EXPECT_EQ(take_three_out.input_size(), 2);
-  auto last_node_input =
-      (should_keep_node ? take_all_node : range_node)->name();
-  EXPECT_EQ(take_three_out.input(0), last_node_input);
-}
-
-INSTANTIATE_TEST_CASE_P(BasicRemovalTest, TakeMiddleEliminationTest,
-                        ::testing::Values(std::make_pair(-1, false),
-                                          std::make_pair(-3, false),
-                                          std::make_pair(0, true),
-                                          std::make_pair(3, true)));
-
-}  // namespace
-}  // namespace grappler
-}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
index fdd82b9603..bb14ce310d 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
+#include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/grappler/utils/topological_sort.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
@@ -53,16 +54,6 @@ bool RemoveInput(NodeDef* node, const string& input, NodeMap* node_map) {
   return removed_input;
 }
 
-void DeleteNodes(const std::set<int>& nodes_to_delete, GraphDef* graph) {
-  int last = graph->node_size() - 1;
-  for (auto it = nodes_to_delete.rbegin(); it != nodes_to_delete.rend(); ++it) {
-    const int index = *it;
-    graph->mutable_node()->SwapElements(index, last);
-    last--;
-  }
-  graph->mutable_node()->DeleteSubrange(last + 1, nodes_to_delete.size());
-}
-
 }  // namespace
 
 bool DependencyOptimizer::SafeToRemoveIdentity(const NodeDef& node) const {
@@ -441,7 +432,7 @@ Status DependencyOptimizer::OptimizeDependencies() {
   if (fetch_nodes_known_) {
     VLOG(1) << "Deleted " << nodes_to_delete.size() << " out of "
             << optimized_graph_->node_size() << " nodes.";
-    DeleteNodes(nodes_to_delete, optimized_graph_);
+    EraseNodesFromGraph(nodes_to_delete, optimized_graph_);
     node_map_.reset(new NodeMap(optimized_graph_));
     BuildNodeToIdx();
   }
diff --git a/tensorflow/core/grappler/optimizers/evaluation_utils.cc b/tensorflow/core/grappler/optimizers/evaluation_utils.cc
new file mode 100644
index 0000000000..00ad7494f4
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/evaluation_utils.cc
@@ -0,0 +1,120 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/evaluation_utils.h"
+
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/platform/cpu_info.h"
+#include "tensorflow/core/platform/denormal.h"
+#include "tensorflow/core/platform/setround.h"
+#include "tensorflow/core/public/version.h"
+
+namespace tensorflow {
+namespace grappler {
+using TensorVector = gtl::InlinedVector<TensorValue, 4>;
+
+namespace {
+class EigenThreadPoolWrapper : public Eigen::ThreadPoolInterface {
+ public:
+  explicit EigenThreadPoolWrapper(thread::ThreadPool* pool) : pool_(pool) {}
+  ~EigenThreadPoolWrapper() override {}
+  void Schedule(std::function<void()> fn) override {
+    auto wrapped = [=]() {
+      // TensorFlow flushes denormals to zero and rounds to nearest, so we do
+      // the same here.
+      port::ScopedFlushDenormal flush;
+      port::ScopedSetRound round(FE_TONEAREST);
+      fn();
+    };
+    pool_->Schedule(std::move(wrapped));
+  }
+  int NumThreads() const override { return pool_->NumThreads(); }
+  int CurrentThreadId() const override { return pool_->CurrentThreadId(); }
+
+ private:
+  thread::ThreadPool* pool_ = nullptr;
+};
+
+}  // namespace
+
+DeviceSimple::DeviceSimple() : DeviceBase(Env::Default()) {
+  eigen_worker_threads_.num_threads = port::NumSchedulableCPUs();
+  eigen_worker_threads_.workers = new thread::ThreadPool(
+      Env::Default(), "evaluation_utils", eigen_worker_threads_.num_threads);
+  eigen_threadpool_wrapper_.reset(
+      new EigenThreadPoolWrapper(eigen_worker_threads_.workers));
+  eigen_device_.reset(new Eigen::ThreadPoolDevice(
+      eigen_threadpool_wrapper_.get(), eigen_worker_threads_.num_threads));
+  set_tensorflow_cpu_worker_threads(&eigen_worker_threads_);
+  set_eigen_cpu_device(eigen_device_.get());
+}
+
+DeviceSimple::~DeviceSimple() {
+  eigen_threadpool_wrapper_.reset();
+  eigen_device_.reset();
+  delete eigen_worker_threads_.workers;
+}
+
+Status DeviceSimple::MakeTensorFromProto(const TensorProto& tensor_proto,
+                                         const AllocatorAttributes alloc_attrs,
+                                         Tensor* tensor) {
+  Tensor parsed(tensor_proto.dtype());
+  if (!parsed.FromProto(cpu_allocator(), tensor_proto)) {
+    return errors::InvalidArgument("Cannot parse tensor from tensor_proto.");
+  }
+  *tensor = parsed;
+  return Status::OK();
+}
+
+Status EvaluateNode(const NodeDef& node, const TensorVector& inputs,
+                    DeviceBase* cpu_device, ResourceMgr* resource_mgr,
+                    TensorVector* output) {
+  Status status;
+  std::unique_ptr<DeviceBase> device;
+  if (cpu_device == nullptr) {
+    device.reset(new DeviceSimple());
+    cpu_device = device.get();
+  }
+
+  std::unique_ptr<OpKernel> op_kernel(
+      CreateOpKernel("CPU", cpu_device, cpu_device->GetAllocator({}), node,
+                     TF_GRAPH_DEF_VERSION, &status));
+  TF_RETURN_IF_ERROR(status);
+  OpKernelContext::Params params;
+  params.device = cpu_device;
+  params.frame_iter = FrameAndIter(0, 0);
+  params.inputs = &inputs;
+  params.op_kernel = op_kernel.get();
+  params.resource_manager = resource_mgr;
+
+  gtl::InlinedVector<AllocatorAttributes, 4> output_attrs;
+  const int num_outputs = op_kernel->num_outputs();
+  for (int i = 0; i < num_outputs; i++) {
+    AllocatorAttributes attr;
+    attr.set_on_host(true);
+    output_attrs.push_back(attr);
+  }
+  params.output_attr_array = output_attrs.data();
+
+  OpKernelContext op_context(&params);
+  op_kernel->Compute(&op_context);
+  for (int i = 0; i < num_outputs; i++) {
+    output->push_back(op_context.release_output(i));
+  }
+  return op_context.status();
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/evaluation_utils.h b/tensorflow/core/grappler/optimizers/evaluation_utils.h
new file mode 100644
index 0000000000..8414b5b8ca
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/evaluation_utils.h
@@ -0,0 +1,61 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_EVALUATION_UTILS_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_EVALUATION_UTILS_H_
+
+#define EIGEN_USE_THREADS
+
+#include "tensorflow/core/framework/device_base.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
+
+namespace Eigen {
+class ThreadPoolInterface;
+class ThreadPoolWrapper;
+}  // namespace Eigen
+
+namespace tensorflow {
+namespace grappler {
+
+class DeviceSimple : public DeviceBase {
+ public:
+  DeviceSimple();
+  ~DeviceSimple();
+
+  Status MakeTensorFromProto(const TensorProto& tensor_proto,
+                             const AllocatorAttributes alloc_attrs,
+                             Tensor* tensor) override;
+
+  Allocator* GetAllocator(AllocatorAttributes attr) override {
+    return cpu_allocator();
+  }
+
+ private:
+  DeviceBase::CpuWorkerThreads eigen_worker_threads_;
+  std::unique_ptr<Eigen::ThreadPoolInterface> eigen_threadpool_wrapper_;
+  std::unique_ptr<Eigen::ThreadPoolDevice> eigen_device_;
+};
+
+Status EvaluateNode(const NodeDef& node,
+                    const gtl::InlinedVector<TensorValue, 4>& inputs,
+                    DeviceBase* cpu_device, ResourceMgr* resource_mgr,
+                    gtl::InlinedVector<TensorValue, 4>* output);
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_EVALUATION_UTILS_H_
diff --git a/tensorflow/core/grappler/optimizers/evaluation_utils_test.cc b/tensorflow/core/grappler/optimizers/evaluation_utils_test.cc
new file mode 100644
index 0000000000..17b42490d7
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/evaluation_utils_test.cc
@@ -0,0 +1,63 @@
+#include "tensorflow/core/platform/cpu_info.h"
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#define EIGEN_USE_THREADS
+
+#include "third_party/eigen3/unsupported/Eigen/CXX11/ThreadPool"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor.pb.h"
+#include "tensorflow/core/grappler/optimizers/evaluation_utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+
+TEST(EvaluationUtilsTest, DeviceSimple_BasicProperties) {
+  DeviceSimple dsimple;
+  ASSERT_TRUE(dsimple.has_eigen_cpu_device());
+  EXPECT_EQ(dsimple.eigen_cpu_device()->numThreads(),
+            port::NumSchedulableCPUs());
+  const Eigen::ThreadPoolInterface* pool =
+      dsimple.eigen_cpu_device()->getPool();
+  ASSERT_NE(pool, nullptr);
+}
+
+TEST(EvaluationUtilsTest, DeviceSimple_MakeTensorFromProto) {
+  DeviceSimple dsimple;
+
+  TensorProto proto;
+  Tensor tensor;
+  EXPECT_FALSE(dsimple.MakeTensorFromProto(proto, {}, &tensor).ok());
+
+  Tensor original(tensorflow::DT_INT16, TensorShape{4, 2});
+  original.flat<int16>().setRandom();
+
+  original.AsProtoTensorContent(&proto);
+  TF_ASSERT_OK(dsimple.MakeTensorFromProto(proto, {}, &tensor));
+
+  ASSERT_EQ(tensor.dtype(), original.dtype());
+  ASSERT_EQ(tensor.shape(), original.shape());
+
+  auto buf0 = original.flat<int16>();
+  auto buf1 = tensor.flat<int16>();
+  ASSERT_EQ(buf0.size(), buf1.size());
+  for (int i = 0; i < buf0.size(); ++i) {
+    EXPECT_EQ(buf0(i), buf1(i));
+  }
+}
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer.cc b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
index 9627ed7323..f3a07be728 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
@@ -22,20 +22,26 @@ limitations under the License.
 #include <unordered_set>
 #include <vector>
 
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/tensor.pb.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/grappler/graph_view.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/op_types.h"
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
+#include "tensorflow/core/grappler/optimizers/evaluation_utils.h"
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/grappler/utils/frame.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/lib/gtl/inlined_vector.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/tensor_coding.h"
+#include "tensorflow/core/public/version.h"
 #include "tensorflow/core/util/device_name_utils.h"
 #include "tensorflow/core/util/saved_tensor_slice_util.h"
 
@@ -45,6 +51,8 @@ namespace tensorflow {
 namespace grappler {
 namespace {
 
+using TensorVector = gtl::InlinedVector<TensorValue, 4>;
+
 class LoopInvariantNodeMotionOptimizer {
  public:
   explicit LoopInvariantNodeMotionOptimizer(GraphDef* optimized_graph)
@@ -456,7 +464,25 @@ std::vector<int> GetStackPushNodesToConvert(
     const NodeDef& fanout_node = graph_view.graph()->node(fanout_idx);
     VLOG(1) << "Fanout " << fanout_idx << " : " << fanout_node.name();
     if (IsStackPushOp(fanout_node)) {
-      nodes_to_convert.push_back(fanout_idx);
+      // Check that the stack itself is not a node we want to preserve. This can
+      // happen when the graph we have contains only the forward pass for a loop
+      // (as when the forward and backward passes are split across different
+      // functions).
+      if (graph_view.has_node(fanout_node.input(0))) {
+        const NodeDef* stack_node =
+            &graph_view.node(graph_view.index(fanout_node.input(0)));
+        while (stack_node->op() != "Stack" && stack_node->op() != "StackV2" &&
+               stack_node->input_size() > 0 &&
+               graph_view.has_node(stack_node->input(0))) {
+          stack_node = &graph_view.node(graph_view.index(stack_node->input(0)));
+        }
+        if (nodes_to_preserve.find(stack_node->name()) ==
+            nodes_to_preserve.end()) {
+          nodes_to_convert.push_back(fanout_idx);
+        }
+      } else {
+        nodes_to_convert.push_back(fanout_idx);
+      }
     } else if (IsStackOp(fanout_node) || IsStackCloseOp(fanout_node) ||
                op_types_to_traverse.find(fanout_node.op()) !=
                    op_types_to_traverse.end()) {
@@ -504,8 +530,179 @@ Status RemoveStackOps(const std::unordered_set<string>& nodes_to_preserve,
   return Status::OK();
 }
 
-Status RemoveDeadBranches(const std::unordered_set<string>& nodes_to_preserve,
-                          GraphDef* optimized_graph) {
+bool IsSimpleBinaryOperator(const NodeDef& node) {
+  return (IsLess(node) || IsLessEqual(node) || IsGreater(node) ||
+          IsGreaterEqual(node) || IsEqual(node));
+}
+
+Status EvaluateBoolOpForConstantOperands(const NodeDef& op_node,
+                                         const NodeDef& constant_operand_0,
+                                         const NodeDef& constant_operand_1,
+                                         DeviceBase* cpu_device,
+                                         ResourceMgr* resource_mgr,
+                                         bool* value) {
+  TensorVector inputs;
+
+  const TensorProto& raw_val_0 = constant_operand_0.attr().at("value").tensor();
+  Tensor value_0(raw_val_0.dtype(), raw_val_0.tensor_shape());
+  CHECK(value_0.FromProto(raw_val_0));
+  inputs.emplace_back(&value_0);
+  const TensorProto& raw_val_1 = constant_operand_1.attr().at("value").tensor();
+  Tensor value_1(raw_val_1.dtype(), raw_val_1.tensor_shape());
+  CHECK(value_1.FromProto(raw_val_1));
+  inputs.emplace_back(&value_1);
+
+  TensorVector outputs;
+  TF_RETURN_IF_ERROR(
+      EvaluateNode(op_node, inputs, cpu_device, resource_mgr, &outputs));
+
+  if (outputs.size() != 1 || outputs[0].tensor == nullptr) {
+    return Status(error::INVALID_ARGUMENT, "Expected one output.");
+  }
+  *value = outputs[0].tensor->scalar<bool>()();
+  delete outputs[0].tensor;
+
+  return Status::OK();
+}
+
+Status CheckForDeadFanout(const GraphView& view, const NodeDef& switch_node,
+                          const NodeMap& node_map,
+                          DeviceBase* cpu_device, ResourceMgr* resource_mgr,
+                          bool* has_dead_fanout, int* dead_fanout) {
+  *has_dead_fanout = false;
+  GraphView::InputPort switch_loopcond_port(&switch_node, 1);
+  NodeDef* switch_predicate = view.GetRegularFanin(switch_loopcond_port).node;
+
+  // CASE 1: Control is a constant.
+  if (IsConstant(*switch_predicate)) {
+    Tensor selector;
+    CHECK(selector.FromProto(switch_predicate->attr().at("value").tensor()));
+    *has_dead_fanout = true;
+    *dead_fanout = selector.scalar<bool>()() ? 0 : 1;
+  }
+
+  GraphView::InputPort switch_input_port(&switch_node, 0);
+  NodeDef* switch_input = view.GetRegularFanin(switch_input_port).node;
+
+  // CASE 2: Zero-iteration while loop.
+  // We check if its a while loop such that the condition is a simple binary
+  // operator which returns false for the initialization value.
+  // TODO(srjoglekar): Improve to work with arbitrary predicate subgraphs.
+  if (!IsMerge(*switch_input)) {
+    return Status::OK();
+  }
+
+  // Find the boolean Op from predicate node.
+  NodeDef* switch_ctrl_node = nullptr;
+  for (int i = 0; i < switch_predicate->input().size(); ++i) {
+    NodeDef* node = node_map.GetNode(switch_predicate->input(i));
+    if (IsSimpleBinaryOperator(*node)) {
+      switch_ctrl_node = node;
+    }
+  }
+  if (switch_ctrl_node == nullptr) {
+    return Status::OK();
+  }
+  // Find the Merge node & the Constant Operand to the condition node, if
+  // available.
+  NodeDef* merge_node = nullptr;
+  NodeDef* constant_ctrl_input = nullptr;
+  int constant_index = 0;
+  for (int i = 0; i < switch_ctrl_node->input().size(); ++i) {
+    NodeDef* node = node_map.GetNode(switch_ctrl_node->input(i));
+    if (IsMerge(*node)) {
+      merge_node = node;
+    }
+    if (IsConstant(*node)) {
+      constant_ctrl_input = node;
+      constant_index = i;
+    }
+  }
+  if (merge_node == nullptr || constant_ctrl_input == nullptr) {
+    return Status::OK();
+  }
+  // Find the initialization constant (via Enter, if one exists).
+  NodeDef* enter_node = nullptr;
+  NodeDef* constant_init_node = nullptr;
+  for (const auto& input : merge_node->input()) {
+    NodeDef* node = node_map.GetNode(input);
+    if (IsEnter(*node)) {
+      enter_node = node;
+    }
+    if (IsConstant(*node)) {
+      constant_init_node = node;
+    }
+  }
+  if (enter_node != nullptr) {
+    if (constant_init_node != nullptr) return Status::OK();
+    for (const auto& input : enter_node->input()) {
+      NodeDef* node = node_map.GetNode(input);
+      if (IsConstant(*node)) {
+        constant_init_node = node;
+      }
+    }
+  }
+  if (constant_init_node == nullptr) {
+    return Status::OK();
+  }
+
+  // Check if there will be 0 iterations. This will only happen if the condition
+  // evaluates to false with respect to the initialization value.
+  NodeDef* operand_0 =
+      constant_index ? constant_init_node : constant_ctrl_input;
+  NodeDef* operand_1 =
+      constant_index ? constant_ctrl_input : constant_init_node;
+  bool constant_switch_value;
+  TF_RETURN_IF_ERROR(EvaluateBoolOpForConstantOperands(
+      *switch_ctrl_node, *operand_0, *operand_1, cpu_device, resource_mgr,
+      &constant_switch_value));
+  if (constant_switch_value == false) {
+    *has_dead_fanout = true;
+    *dead_fanout = 1;
+  }
+  return Status::OK();
+}
+
+}  // namespace
+
+LoopOptimizer::LoopOptimizer()
+    : opt_level_(RewriterConfig::ON),
+      cpu_device_(nullptr),
+      options_(LoopOptimizerOptions::Default(RewriterConfig::ON)) {}
+
+LoopOptimizer::LoopOptimizer(RewriterConfig::Toggle opt_level,
+                             DeviceBase* cpu_device)
+    : opt_level_(opt_level),
+      cpu_device_(cpu_device),
+      options_(LoopOptimizerOptions::Default(RewriterConfig::ON)) {
+  resource_mgr_.reset(new ResourceMgr());
+}
+
+Status LoopOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
+                               GraphDef* optimized_graph) {
+  *optimized_graph = item.graph;
+  // Set up helper data structures.
+  if (options_.enable_loop_invariant_node_motion) {
+    LoopInvariantNodeMotionOptimizer linm_optimizer(optimized_graph);
+    TF_RETURN_IF_ERROR(linm_optimizer.Optimize());
+  }
+  if (options_.enable_stack_push_removal) {
+    TF_RETURN_IF_ERROR(RemoveStackOps(item.NodesToPreserve(), optimized_graph));
+  }
+  if (options_.enable_dead_branch_removal) {
+    // TODO(srjoglekar): Figure out if we can optimize NodeMap creations across
+    // optimizer passes.
+    NodeMap node_map(optimized_graph);
+    TF_RETURN_IF_ERROR(
+        RemoveDeadBranches(item.NodesToPreserve(), node_map, optimized_graph));
+  }
+
+  return Status::OK();
+}
+
+Status LoopOptimizer::RemoveDeadBranches(
+    const std::unordered_set<string>& nodes_to_preserve,
+    const NodeMap& node_map, GraphDef* optimized_graph) {
   std::unordered_set<const NodeDef*> dead_nodes;
   std::unordered_map<NodeDef*, std::set<int>> dead_merge_inputs;
   // TODO(bsteiner): also rewrite switches as identity. For now we just record
@@ -521,14 +718,15 @@ Status RemoveDeadBranches(const std::unordered_set<string>& nodes_to_preserve,
     if (nodes_to_preserve.find(node.name()) != nodes_to_preserve.end()) {
       continue;
     }
-    GraphView::InputPort ctrl_port(&node, 1);
-    GraphView::OutputPort ctrl_node = view.GetRegularFanin(ctrl_port);
-    if (!IsConstant(*ctrl_node.node)) {
+
+    int dead_fanout;
+    bool has_dead_fanout;
+    TF_RETURN_IF_ERROR(CheckForDeadFanout(view, node, node_map, cpu_device_,
+                                          resource_mgr_.get(), &has_dead_fanout,
+                                          &dead_fanout));
+    if (!has_dead_fanout) {
       continue;
     }
-    Tensor selector;
-    CHECK(selector.FromProto(ctrl_node.node->attr().at("value").tensor()));
-    const int dead_fanout = selector.scalar<bool>()() ? 0 : 1;
     GraphView::OutputPort dead(const_cast<NodeDef*>(&node), dead_fanout);
     identity_switches.insert(dead);
 
@@ -616,15 +814,13 @@ Status RemoveDeadBranches(const std::unordered_set<string>& nodes_to_preserve,
     }
   }
 
-  int last = optimized_graph->node_size() - 1;
-  for (int i = optimized_graph->node_size() - 1; i >= 0; --i) {
-    NodeDef* node = optimized_graph->mutable_node(i);
-    if (dead_nodes.find(node) != dead_nodes.end()) {
-      optimized_graph->mutable_node()->SwapElements(i, last);
-      last--;
-    }
+  std::vector<int> nodes_idx_to_delete;
+  nodes_idx_to_delete.reserve(dead_nodes.size());
+  for (int i = 0; i < optimized_graph->node_size(); ++i) {
+    if (dead_nodes.count(&optimized_graph->node(i)))
+      nodes_idx_to_delete.push_back(i);
   }
-  optimized_graph->mutable_node()->DeleteSubrange(last + 1, dead_nodes.size());
+  EraseNodesFromGraph(std::move(nodes_idx_to_delete), optimized_graph);
 
   for (const auto& itr : dead_merge_inputs) {
     NodeDef* dead_node = itr.first;
@@ -642,27 +838,6 @@ Status RemoveDeadBranches(const std::unordered_set<string>& nodes_to_preserve,
   return Status::OK();
 }
 
-}  // namespace
-
-Status LoopOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
-                               GraphDef* optimized_graph) {
-  *optimized_graph = item.graph;
-  // Set up helper data structures.
-  if (options_.enable_loop_invariant_node_motion) {
-    LoopInvariantNodeMotionOptimizer linm_optimizer(optimized_graph);
-    TF_RETURN_IF_ERROR(linm_optimizer.Optimize());
-  }
-  if (options_.enable_stack_push_removal) {
-    TF_RETURN_IF_ERROR(RemoveStackOps(item.NodesToPreserve(), optimized_graph));
-  }
-  if (options_.enable_dead_branch_removal) {
-    TF_RETURN_IF_ERROR(
-        RemoveDeadBranches(item.NodesToPreserve(), optimized_graph));
-  }
-
-  return Status::OK();
-}
-
 void LoopOptimizer::Feedback(Cluster* /*cluster*/, const GrapplerItem& /*item*/,
                              const GraphDef& /*optimized_graph*/,
                              double /*result*/) {
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer.h b/tensorflow/core/grappler/optimizers/loop_optimizer.h
index 85b8e65543..7c04f55381 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer.h
@@ -30,12 +30,10 @@ constexpr char kLoopOptimizer[] = "LoopOptimizer";
 
 class LoopOptimizer : public GraphOptimizer {
  public:
-  LoopOptimizer()
-      : opt_level_(RewriterConfig::ON),
-        options_(LoopOptimizerOptions::Default(RewriterConfig::ON)) {}
-  explicit LoopOptimizer(RewriterConfig::Toggle opt_level)
-      : opt_level_(opt_level),
-        options_(LoopOptimizerOptions::Default(RewriterConfig::ON)) {}
+  LoopOptimizer();
+
+  explicit LoopOptimizer(RewriterConfig::Toggle opt_level,
+                         DeviceBase* cpu_device);
 
   ~LoopOptimizer() override {}
 
@@ -62,8 +60,13 @@ class LoopOptimizer : public GraphOptimizer {
     }
   };
 
+  Status RemoveDeadBranches(const std::unordered_set<string>& nodes_to_preserve,
+                            const NodeMap& node_map, GraphDef* optimized_graph);
+
   RewriterConfig::Toggle opt_level_;
+  DeviceBase* cpu_device_;
   LoopOptimizerOptions options_;
+  std::unique_ptr<ResourceMgr> resource_mgr_;
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc b/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
index 6fd177b710..81f40db8f0 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/loop_optimizer.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"
 #include "tensorflow/core/grappler/utils.h"
@@ -535,6 +536,29 @@ TEST_F(LoopOptimizerTest, RemovePush_NoOp) {
   VerifyGraphsEqual(item.graph, output, __FUNCTION__);
 }
 
+TEST_F(LoopOptimizerTest, RemovePush_NoPopButStackLives) {
+  GrapplerItem item;
+  GraphDef& graph = item.graph;
+  AddSimpleNode("c", "Const", {}, &graph);
+  // Stack with corresponding push
+  AddSimpleNode("stack1", "StackV2", {}, &graph);
+  AddSimpleNode("push1", "StackPushV2", {"stack1", "c"}, &graph);
+  // Stack with corresponding push behind Enter.
+  AddSimpleNode("stack2", "StackV2", {}, &graph);
+  AddEnterNode("enter2_c", "frame_name", false, 1, {"c"}, &graph);
+  AddEnterNode("enter2_stack2", "frame_name", false, 1, {"stack2"}, &graph);
+  AddSimpleNode("push2", "StackPushV2", {"enter2_stack2", "enter2_c"}, &graph);
+  item.keep_ops.push_back("stack1");
+  item.keep_ops.push_back("stack2");
+
+  LoopOptimizer optimizer;
+  EnableOnlyStackPushRemoval(&optimizer);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  VerifyGraphsEqual(item.graph, output, __FUNCTION__);
+}
+
 TEST_F(LoopOptimizerTest, RemovePushWithoutMatchingPop) {
   GrapplerItem item;
   GraphDef& graph = item.graph;
@@ -589,7 +613,7 @@ TEST_F(LoopOptimizerTest, RemovePushWithoutMatchingPop) {
   }
 }
 
-TEST_F(LoopOptimizerTest, RemoveDeadBranches) {
+TEST_F(LoopOptimizerTest, RemoveDeadBranches_ConstantCondition) {
   Scope scope = Scope::NewRootScope();
   Output v_in = ops::Variable(scope.WithOpName("v_in"), {3}, DT_FLOAT);
 
@@ -639,7 +663,7 @@ TEST_F(LoopOptimizerTest, RemoveDeadBranches) {
 
   TF_CHECK_OK(scope.ToGraphDef(&item.graph));
 
-  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE);
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE, nullptr);
   GraphDef output;
   Status status = optimizer.Optimize(nullptr, item, &output);
   TF_CHECK_OK(status);
@@ -696,5 +720,237 @@ TEST_F(LoopOptimizerTest, RemoveDeadBranches) {
   }
 }
 
+TEST_F(LoopOptimizerTest, RemoveDeadBranches_ZeroIterWhile) {
+  const string gdef_ascii = R"EOF(
+node {
+  name: "Const"
+  op: "Const"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 20
+      }
+    }
+  }
+}
+node {
+  name: "while/Enter"
+  op: "Enter"
+  input: "Const"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "frame_name"
+    value {
+      s: "while/while/"
+    }
+  }
+  attr {
+    key: "is_constant"
+    value {
+      b: false
+    }
+  }
+  attr {
+    key: "parallel_iterations"
+    value {
+      i: 1
+    }
+  }
+}
+node {
+  name: "while/Merge"
+  op: "Merge"
+  input: "while/Enter"
+  input: "while/NextIteration"
+  attr {
+    key: "N"
+    value {
+      i: 2
+    }
+  }
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "while/Less/y"
+  op: "Const"
+  input: "^while/Merge"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 10
+      }
+    }
+  }
+}
+node {
+  name: "while/Less"
+  op: "Less"
+  input: "while/Merge"
+  input: "while/Less/y"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "while/LoopCond"
+  op: "LoopCond"
+  input: "while/Less"
+}
+node {
+  name: "while/Switch"
+  op: "Switch"
+  input: "while/Merge"
+  input: "while/LoopCond"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "_class"
+    value {
+      list {
+        s: "loc:@while/Merge"
+      }
+    }
+  }
+}
+node {
+  name: "while/Identity"
+  op: "Identity"
+  input: "while/Switch:1"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "while/add/y"
+  op: "Const"
+  input: "^while/Identity"
+  attr {
+    key: "dtype"
+    value {
+      type: DT_INT32
+    }
+  }
+  attr {
+    key: "value"
+    value {
+      tensor {
+        dtype: DT_INT32
+        tensor_shape {
+        }
+        int_val: 1
+      }
+    }
+  }
+}
+node {
+  name: "while/add"
+  op: "Add"
+  input: "while/Identity"
+  input: "while/add/y"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "while/NextIteration"
+  op: "NextIteration"
+  input: "while/add"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+node {
+  name: "while/Exit"
+  op: "Exit"
+  input: "while/Switch"
+  attr {
+    key: "T"
+    value {
+      type: DT_INT32
+    }
+  }
+}
+versions {
+  producer: 21
+}
+  )EOF";
+
+  GrapplerItem item;
+  CHECK(protobuf::TextFormat::ParseFromString(gdef_ascii, &item.graph));
+  item.fetch = {"while/Exit"};
+  auto tensors_expected = EvaluateNodes(item.graph, item.fetch);
+  EXPECT_EQ(1, tensors_expected.size());
+
+  LoopOptimizer optimizer(RewriterConfig::AGGRESSIVE, nullptr);
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_CHECK_OK(status);
+  auto tensors_got = EvaluateNodes(output, item.fetch);
+  EXPECT_EQ(1, tensors_got.size());
+  test::ExpectTensorEqual<int32>(tensors_expected[0], tensors_got[0]);
+
+  int nodes_present = 0;
+  for (const NodeDef& node : output.node()) {
+    // All nodes connected to Switch's positive check should be pruned.
+    if (node.name() == "while/add") {
+      LOG(ERROR) << "while/add is present after optimization";
+    } else if (node.name() == "while/add/y") {
+      LOG(ERROR) << "while/add/y is present after optimization";
+    } else if (node.name() == "while/NextIteration") {
+      LOG(ERROR) << "while/NextIteration is present after optimization";
+    } else if (node.name() == "while/Identity") {
+      LOG(ERROR) << "while/Identity is present after optimization";
+    }
+    ++nodes_present;
+  }
+  EXPECT_EQ(8, nodes_present);
+}
+
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/meta_optimizer.cc b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
index c55f479451..96f6fe1e0b 100644
--- a/tensorflow/core/grappler/optimizers/meta_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
@@ -87,7 +87,7 @@ std::unique_ptr<GraphOptimizer> MetaOptimizer::MakeNewOptimizer(
   MK_OPT("memory", new MemoryOptimizer(RewriterConfig::MANUAL));
   MK_OPT("arithmetic", new ArithmeticOptimizer(cfg_.arithmetic_optimization()));
   MK_OPT("autoparallel", new AutoParallel(cfg_.auto_parallel().num_replicas()));
-  MK_OPT("loop", new LoopOptimizer(cfg_.loop_optimization()));
+  MK_OPT("loop", new LoopOptimizer(cfg_.loop_optimization(), cpu_device_));
   MK_OPT("dependency", new DependencyOptimizer(cfg_.dependency_optimization()));
   MK_OPT("debug_stripper", new DebugStripper());
   MK_OPT("scoped_allocator",
@@ -126,7 +126,8 @@ Status MetaOptimizer::InitializeOptimizers(
         new ArithmeticOptimizer(cfg_.arithmetic_optimization()));
   }
   if (cfg_.loop_optimization() != RewriterConfig::OFF) {
-    optimizers->emplace_back(new LoopOptimizer(cfg_.loop_optimization()));
+    optimizers->emplace_back(
+        new LoopOptimizer(cfg_.loop_optimization(), cpu_device_));
   }
   if (cfg_.dependency_optimization() != RewriterConfig::OFF) {
     optimizers->emplace_back(
diff --git a/tensorflow/core/grappler/utils.cc b/tensorflow/core/grappler/utils.cc
index c8e63f95e1..153785d3b4 100644
--- a/tensorflow/core/grappler/utils.cc
+++ b/tensorflow/core/grappler/utils.cc
@@ -13,7 +13,10 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include "tensorflow/core/grappler/utils.h"
+
 #include <memory>
+#include <queue>
 #include <vector>
 
 #include "tensorflow/core/framework/attr_value.pb.h"
@@ -21,7 +24,6 @@ limitations under the License.
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/framework/types.h"
-#include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/scanner.h"
 #include "tensorflow/core/lib/strings/strcat.h"
@@ -354,13 +356,51 @@ void DedupControlInputs(NodeDef* node) {
 }
 
 namespace {
+
+template <typename UniqueContainer>
+void EraseNodesFromGraphImpl(const UniqueContainer& nodes_to_delete,
+                             GraphDef* graph) {
+  static_assert(std::is_same<typename UniqueContainer::value_type, int>::value,
+                "Need to pass container of ints");
+
+  int last = graph->node_size() - 1;
+  for (auto it = nodes_to_delete.rbegin(); it != nodes_to_delete.rend(); ++it) {
+    const int index = *it;
+    graph->mutable_node()->SwapElements(index, last);
+    last--;
+  }
+  graph->mutable_node()->DeleteSubrange(last + 1, nodes_to_delete.size());
+}
+
 template <typename T>
 inline void STLSortAndRemoveDuplicates(T* v) {
   std::sort(v->begin(), v->end());
   v->erase(std::unique(v->begin(), v->end()), v->end());
 }
+
 }  // namespace
 
+void EraseNodesFromGraph(const std::set<int>& nodes_to_delete,
+                         GraphDef* graph) {
+  EraseNodesFromGraphImpl(nodes_to_delete, graph);
+}
+
+void EraseNodesFromGraph(std::vector<int>&& nodes_to_delete, GraphDef* graph) {
+  STLSortAndRemoveDuplicates(&nodes_to_delete);
+  EraseNodesFromGraphImpl(nodes_to_delete, graph);
+}
+
+void EraseNodesFromGraph(const std::set<string>& nodes_to_delete,
+                         GraphDef* graph) {
+  std::vector<int> nodes_idx_to_delete;
+  nodes_idx_to_delete.reserve(nodes_to_delete.size());
+  for (int i = 0; i < graph->node_size(); ++i) {
+    if (nodes_to_delete.count(graph->node(i).name()))
+      nodes_idx_to_delete.push_back(i);
+  }
+  EraseNodesFromGraphImpl(nodes_idx_to_delete, graph);
+}
+
 Status SimpleGraphView::Initialize(
     const GraphDef& graph,
     const std::vector<std::pair<const NodeDef*, const NodeDef*>>*
diff --git a/tensorflow/core/grappler/utils.h b/tensorflow/core/grappler/utils.h
index 1c6fef59ea..a9c34b6d08 100644
--- a/tensorflow/core/grappler/utils.h
+++ b/tensorflow/core/grappler/utils.h
@@ -209,6 +209,13 @@ void PermuteNodesInPlace(GraphDef* graph, std::vector<int>* permutation,
 
 Status SetTensorValue(DataType dtype, int value, Tensor* tensor);
 
+void EraseNodesFromGraph(const std::set<int>& nodes_to_delete, GraphDef* graph);
+
+void EraseNodesFromGraph(std::vector<int>&& nodes_to_delete, GraphDef* graph);
+
+void EraseNodesFromGraph(const std::set<string>& nodes_to_delete,
+                         GraphDef* graph);
+
 class SimpleGraphView {
  public:
   // Build a graph view for the specified graphdef.
@@ -232,11 +239,17 @@ class SimpleGraphView {
 
   const GraphDef* graph() const { return graph_; }
   inline int num_nodes() const { return index_to_name_.size(); }
+  inline bool has_node(const string& node_name) const {
+    return name_to_index_.find(node_name) != name_to_index_.end();
+  }
   inline const int index(const string& node_name) const {
     const auto& it = name_to_index_.find(node_name);
     DCHECK(it != name_to_index_.end());
     return it == name_to_index_.end() ? -1 : it->second;
   }
+  inline const NodeDef& node(int node_idx) const {
+    return graph_->node(node_idx);
+  }
   inline const string& node_name(int node_idx) const {
     return index_to_name_[node_idx];
   }
diff --git a/tensorflow/core/grappler/utils/functions.cc b/tensorflow/core/grappler/utils/functions.cc
index d64cb49715..fd71406d2c 100644
--- a/tensorflow/core/grappler/utils/functions.cc
+++ b/tensorflow/core/grappler/utils/functions.cc
@@ -119,7 +119,7 @@ Status GrapplerFunctionConnectivity::ExpandFunctionDefInput(
   if (Scanner(remaining)
           .OneLiteral(":")
           .RestartCapture()
-          .One(strings::Scanner::LOWERLETTER)
+          .One(strings::Scanner::LETTER)
           .Any(strings::Scanner::LETTER_DIGIT_UNDERSCORE)
           .GetResult(&remaining, &capture)) {
     node_output = string(capture.data(), capture.size());
diff --git a/tensorflow/core/grappler/utils/scc.cc b/tensorflow/core/grappler/utils/scc.cc
index f2a6507d94..d033e9c522 100644
--- a/tensorflow/core/grappler/utils/scc.cc
+++ b/tensorflow/core/grappler/utils/scc.cc
@@ -142,9 +142,13 @@ void StronglyConnectedComponents(
 
   // Create a list of top-level parents (add them to object queue)
   // Also create a mapping from nodes to their children.
+  // Inputs might not be present if called on a subgraph.
   for (const NodeDef& node : graph.node()) {
     for (const string& input : node.input()) {
-      name_to_data[NodeName(input)]->children.push_back(node_to_data[&node]);
+      auto it = name_to_data.find(NodeName(input));
+      if (it != name_to_data.end()) {
+        it->second->children.push_back(node_to_data[&node]);
+      }
     }
   }
 
@@ -202,10 +206,12 @@ int IdentifyLoops(const GraphDef& graph,
     const std::vector<const NodeDef*>& component_nodes = component.second;
     std::vector<std::pair<NodeDef*, string>> next_iter_nodes;
     GraphDef subgraph;
+    std::unordered_map<const NodeDef*, const NodeDef*> subgraph_mapping;
 
     for (const auto& component_node : component_nodes) {
       NodeDef* node = subgraph.add_node();
       *node = *component_node;
+      subgraph_mapping[node] = component_node;
       if (IsNextIteration(*node)) {
         CHECK_EQ(1, node->input_size());
         next_iter_nodes.emplace_back(node, node->input(0));
@@ -227,13 +233,13 @@ int IdentifyLoops(const GraphDef& graph,
         int num_components = 0;
         std::unordered_map<const NodeDef*, int> components;
         StronglyConnectedComponents(subgraph, &components, &num_components);
-        CHECK_EQ(1, num_components);
+        CHECK_GE(num_components, 1);
         for (const auto it : components) {
           int id = it.second;
           if (id < 0) {
             continue;
           }
-          (*loops)[it.first].push_back(loop_id);
+          (*loops)[subgraph_mapping[it.first]].push_back(loop_id);
         }
         ++loop_id;
       }
diff --git a/tensorflow/core/grappler/utils_test.cc b/tensorflow/core/grappler/utils_test.cc
index 49a1996d25..c6e035834c 100644
--- a/tensorflow/core/grappler/utils_test.cc
+++ b/tensorflow/core/grappler/utils_test.cc
@@ -16,7 +16,9 @@ limitations under the License.
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/notification.h"
@@ -333,7 +335,9 @@ TEST_F(UtilsTest, NumNonControlOutputs) {
   EXPECT_EQ(1, NumNonControlDataOutputs(*add_node, node_map));
 }
 
-TEST_F(UtilsTest, DeleteNodes) {}
+TEST_F(UtilsTest, DeleteNodes) {
+  // TODO(rmlarsen): write forgtten test.
+}
 
 }  // namespace
 }  // namespace grappler
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index 07360d594b..6126e8b7ba 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -124,6 +124,7 @@ tf_kernel_library(
         ":bounds_check",
         ":dense_update_functor",
         ":ops_util",
+        ":training_op_helpers",
         ":variable_ops",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
@@ -882,7 +883,6 @@ tf_kernel_library(
         "tile_functor_gpu.cu.cc",
     ],
     prefix = "tile_ops",
-    textual_hdrs = ["tile_ops_gpu_impl.h"],
     deps = ARRAY_DEPS,
 )
 
@@ -1106,6 +1106,29 @@ tf_cc_test(
     ],
 )
 
+tf_cuda_cc_test(
+    name = "depthwise_conv_ops_test",
+    size = "small",
+    srcs = ["depthwise_conv_ops_test.cc"],
+    tags = ["requires-gpu-sm35"],
+    deps = [
+        ":conv_ops",
+        ":image",
+        ":ops_testutil",
+        ":ops_util",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:tensorflow",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
 tf_cc_test(
     name = "decode_wav_op_test",
     size = "small",
@@ -2087,6 +2110,7 @@ IMAGE_DEPS = [
     "//tensorflow/core:jpeg_internal",
     "//tensorflow/core:lib",
     "//tensorflow/core:lib_internal",
+    "//tensorflow/core:png_internal",
     "//tensorflow/core:protos_all_cc",
 ]
 
@@ -2323,6 +2347,22 @@ tf_cuda_cc_test(
 )
 
 tf_cuda_cc_test(
+    name = "crop_and_resize_op_benchmark_test",
+    srcs = ["crop_and_resize_op_benchmark_test.cc"],
+    deps = [
+        ":image",
+        ":ops_testutil",
+        ":ops_util",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
+tf_cuda_cc_test(
     name = "resize_benchmark_test",
     srcs = ["resize_op_benchmark_test.cc"],
     deps = [
@@ -2661,7 +2701,7 @@ tf_kernel_library(
 tf_kernel_library(
     name = "summary_image_op",
     prefix = "summary_image_op",
-    deps = LOGGING_DEPS,
+    deps = LOGGING_DEPS + ["//tensorflow/core:png_internal"],
 )
 
 tf_kernel_library(
@@ -2706,17 +2746,16 @@ cc_library(
     ],
 )
 
-MANIP_DEPS = [
-    "//tensorflow/core:framework",
-    "//tensorflow/core:lib",
-    "//tensorflow/core:manip_ops_op_lib",
-    "//third_party/eigen3",
-]
-
 tf_kernel_library(
     name = "roll_op",
     prefix = "roll_op",
-    deps = MANIP_DEPS,
+    deps = [
+        ":bounds_check",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:manip_ops_op_lib",
+        "//third_party/eigen3",
+    ],
 )
 
 tf_cc_test(
@@ -4827,6 +4866,8 @@ filegroup(
         "cast_op_impl_int64.cc",
         "cast_op_impl_int8.cc",
         "cast_op_impl_uint16.cc",
+        "cast_op_impl_uint32.cc",
+        "cast_op_impl_uint64.cc",
         "cast_op_impl_uint8.cc",
         "concat_lib.h",
         "concat_lib_cpu.cc",
@@ -5201,6 +5242,16 @@ filegroup(
     visibility = ["//visibility:public"],
 )
 
+ANDROID_TEXTUAL_HDRS = [
+    "gather_nd_op_cpu_impl.h",
+    "gemm_functors.h",
+    "mirror_pad_op_cpu_impl.h",
+    "scatter_nd_op_cpu_impl.h",
+    "slice_op_cpu_impl.h",
+    "strided_slice_op_impl.h",
+    "tile_ops_cpu_impl.h",
+]
+
 # A file group which contains nearly all available operators which
 # may work on Android. This is intended to be used with selective
 # registration.
@@ -5262,10 +5313,20 @@ filegroup(
             "batch_kernels.*",
             "regex_full_match_op.cc",
             "regex_replace_op.cc",
-        ],
+            # Ops that are inherently incompatible with Android (e.g. tied to x86 platform).
+            "mkl_*",
+            "xsmm_*",
+            "cwise_ops_sycl_common.h",
+        ] + ANDROID_TEXTUAL_HDRS,
     ),
     visibility = ["//visibility:public"],
 )
+
+filegroup(
+    name = "android_all_ops_textual_hdrs",
+    srcs = ANDROID_TEXTUAL_HDRS,
+    visibility = ["//visibility:public"],
+)
 # LINT.ThenChange(//tensorflow/contrib/makefile/tf_op_files.txt)
 
 cc_library(
@@ -5306,10 +5367,6 @@ cc_library(
     srcs = if_android(["decode_image_op.cc"]),
     copts = tf_copts(),
     linkopts = ["-ldl"],
-    tags = [
-        "manual",
-        "notap",
-    ],
     visibility = ["//visibility:public"],
     deps = [
         "//tensorflow/core:android_gif_internal",
@@ -6289,6 +6346,7 @@ tf_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core/util/proto:decode",
         "//tensorflow/core/util/proto:descriptors",
+        "//tensorflow/core/util/proto:proto_utils",
         "//third_party/eigen3",
     ],
 )
@@ -6301,6 +6359,7 @@ tf_kernel_library(
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core/util/proto:descriptors",
+        "//tensorflow/core/util/proto:proto_utils",
         "//third_party/eigen3",
     ],
 )
diff --git a/tensorflow/core/kernels/argmax_op.cc b/tensorflow/core/kernels/argmax_op.cc
index 49cd997fed..c731b64993 100644
--- a/tensorflow/core/kernels/argmax_op.cc
+++ b/tensorflow/core/kernels/argmax_op.cc
@@ -59,7 +59,7 @@ class ArgOp : public OpKernel {
 
     int axis = dim < 0 ? dim + input_dims : dim;
 
-    OP_REQUIRES(context, axis >= 0 && axis < input_dims,
+    OP_REQUIRES(context, FastBoundsCheck(axis, input_dims),
                 errors::InvalidArgument("Expected dimension in the range [",
                                         -input_dims, ", ", input_dims,
                                         "), but got ", dim));
@@ -76,6 +76,10 @@ class ArgOp : public OpKernel {
     Tensor* output = nullptr;
     OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));
 
+    if (output_shape.num_elements() == 0) {
+      return;
+    }
+
 #define HANDLE_DIM(NDIM)                                        \
   case NDIM:                                                    \
     ArgFunctor::Reduce##NDIM(context->eigen_device<Device>(),   \
diff --git a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
index b77c14d012..656b6ced6d 100644
--- a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
+++ b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
@@ -147,13 +147,21 @@ class AdaptiveSharedBatchScheduler
 
   // Tracks processing latency and adjusts in_flight_batches_limit to minimize.
   void CallbackWrapper(const internal::ASBSBatch<TaskType>* batch,
-                       BatchProcessor callback);
+                       BatchProcessor callback, bool is_express);
 
   // Schedules batch if in_flight_batches_limit_ is not met.
   void MaybeScheduleNextBatch() EXCLUSIVE_LOCKS_REQUIRED(mu_);
 
+  // Schedules the earliest closed batch in batches_
+  // if batch_thread_pool_ has an idle thead.
+  // Batches scheduled this way are called express batches.
+  // Express batches are not limited by in_flight_batches_limit_, and
+  // their latencies will not affect in_flight_batches_limit_.
+  void MaybeScheduleClosedBatch() EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
   // Notifies scheduler of non-empty batch which is eligible for processing.
-  void AddBatch(const internal::ASBSBatch<TaskType>* batch);
+  void AddBatch(const internal::ASBSBatch<TaskType>* batch,
+                bool also_schedule_closed_batch);
 
   // Removes queue from scheduler.
   void RemoveQueue(const internal::ASBSQueue<TaskType>* queue);
@@ -180,8 +188,10 @@ class AdaptiveSharedBatchScheduler
   // results in an actual cap of 3 80% of the time, and 4 20% of the time.
   double in_flight_batches_limit_ GUARDED_BY(mu_);
 
-  // Number of batches currently being processed.
+  // Number of regular batches currently being processed.
   int64 in_flight_batches_ GUARDED_BY(mu_) = 0;
+  // Number of express batches currently being processed.
+  int64 in_flight_express_batches_ GUARDED_BY(mu_) = 0;
 
   // RNG engine and distribution.
   std::default_random_engine rand_engine_;
@@ -363,10 +373,14 @@ Status AdaptiveSharedBatchScheduler<TaskType>::AddQueue(
 
 template <typename TaskType>
 void AdaptiveSharedBatchScheduler<TaskType>::AddBatch(
-    const internal::ASBSBatch<TaskType>* batch) {
+    const internal::ASBSBatch<TaskType>* batch,
+    bool also_schedule_closed_batch) {
   mutex_lock l(mu_);
   batches_.push_back(batch);
   MaybeScheduleNextBatch();
+  if (also_schedule_closed_batch) {
+    MaybeScheduleClosedBatch();
+  }
 }
 
 template <typename TaskType>
@@ -407,19 +421,45 @@ void AdaptiveSharedBatchScheduler<TaskType>::MaybeScheduleNextBatch() {
   batch->queue()->ReleaseBatch(batch);
   batch_thread_pool_->Schedule(
       std::bind(&AdaptiveSharedBatchScheduler<TaskType>::CallbackWrapper, this,
-                batch, queues_and_callbacks_[batch->queue()]));
+                batch, queues_and_callbacks_[batch->queue()], false));
   in_flight_batches_++;
 }
 
 template <typename TaskType>
+void AdaptiveSharedBatchScheduler<TaskType>::MaybeScheduleClosedBatch() {
+  if (in_flight_batches_ + in_flight_express_batches_ >=
+      options_.num_batch_threads) {
+    return;
+  }
+  for (auto it = batches_.begin(); it != batches_.end(); it++) {
+    if ((*it)->IsClosed()) {
+      const internal::ASBSBatch<TaskType>* batch = *it;
+      batches_.erase(it);
+      batch->queue()->ReleaseBatch(batch);
+      batch_thread_pool_->Schedule(
+          std::bind(&AdaptiveSharedBatchScheduler<TaskType>::CallbackWrapper,
+                    this, batch, queues_and_callbacks_[batch->queue()], true));
+      in_flight_express_batches_++;
+      return;
+    }
+  }
+}
+
+template <typename TaskType>
 void AdaptiveSharedBatchScheduler<TaskType>::CallbackWrapper(
     const internal::ASBSBatch<TaskType>* batch,
-    AdaptiveSharedBatchScheduler<TaskType>::BatchProcessor callback) {
+    AdaptiveSharedBatchScheduler<TaskType>::BatchProcessor callback,
+    bool is_express) {
   int64 start_time = batch->creation_time_micros();
   callback(std::unique_ptr<Batch<TaskType>>(
       const_cast<internal::ASBSBatch<TaskType>*>(batch)));
   int64 end_time = GetEnv()->NowMicros();
   mutex_lock l(mu_);
+  if (is_express) {
+    in_flight_express_batches_--;
+    MaybeScheduleClosedBatch();
+    return;
+  }
   in_flight_batches_--;
   batch_count_++;
   batch_latency_sum_ += end_time - start_time;
@@ -496,6 +536,7 @@ Status ASBSQueue<TaskType>::Schedule(std::unique_ptr<TaskType>* task) {
                                    " is larger than maximum batch size ",
                                    options_.max_batch_size);
   }
+  bool is_old_batch_closed = false;
   {
     mutex_lock l(mu_);
     // Current batch is full, create another if allowed.
@@ -505,6 +546,7 @@ Status ASBSQueue<TaskType>::Schedule(std::unique_ptr<TaskType>* task) {
         return errors::Unavailable("The batch scheduling queue is full");
       }
       current_batch_->Close();
+      is_old_batch_closed = true;
       current_batch_ = nullptr;
     }
     if (!current_batch_) {
@@ -516,7 +558,8 @@ Status ASBSQueue<TaskType>::Schedule(std::unique_ptr<TaskType>* task) {
     num_enqueued_tasks_++;
   }
   // AddBatch must be called outside of lock, since it may call ReleaseBatch.
-  if (new_batch != nullptr) scheduler_->AddBatch(new_batch);
+  if (new_batch != nullptr)
+    scheduler_->AddBatch(new_batch, is_old_batch_closed);
   return Status::OK();
 }
 
diff --git a/tensorflow/core/kernels/boosted_trees/BUILD b/tensorflow/core/kernels/boosted_trees/BUILD
index 0244f3cd8d..4910021c63 100644
--- a/tensorflow/core/kernels/boosted_trees/BUILD
+++ b/tensorflow/core/kernels/boosted_trees/BUILD
@@ -45,6 +45,11 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "tree_helper",
+    hdrs = ["tree_helper.h"],
+)
+
 tf_kernel_library(
     name = "resource_ops",
     srcs = ["resource_ops.cc"],
@@ -61,6 +66,7 @@ tf_kernel_library(
     name = "stats_ops",
     srcs = ["stats_ops.cc"],
     deps = [
+        ":tree_helper",
         "//tensorflow/core:boosted_trees_ops_op_lib",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
@@ -72,6 +78,7 @@ tf_kernel_library(
     srcs = ["training_ops.cc"],
     deps = [
         ":resources",
+        ":tree_helper",
         "//tensorflow/core:boosted_trees_ops_op_lib",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
diff --git a/tensorflow/core/kernels/boosted_trees/prediction_ops.cc b/tensorflow/core/kernels/boosted_trees/prediction_ops.cc
index 2920132a27..b2efa06941 100644
--- a/tensorflow/core/kernels/boosted_trees/prediction_ops.cc
+++ b/tensorflow/core/kernels/boosted_trees/prediction_ops.cc
@@ -104,8 +104,8 @@ class BoostedTreesTrainingPredictOp : public OpKernel {
     const int32 latest_tree = resource->num_trees() - 1;
 
     if (latest_tree < 0) {
-      // Ensemble was empty. Nothing changes.
-      output_node_ids = cached_node_ids;
+      // Ensemble was empty. Output the very first node.
+      output_node_ids.setZero();
       output_tree_ids = cached_tree_ids;
       // All the predictions are zeros.
       output_partial_logits.setZero();
@@ -120,16 +120,20 @@ class BoostedTreesTrainingPredictOp : public OpKernel {
           int32 node_id = cached_node_ids(i);
           float partial_tree_logit = 0.0;
 
-          // If the tree was pruned, returns the node id into which the
-          // current_node_id was pruned, as well the correction of the cached
-          // logit prediction.
-          resource->GetPostPruneCorrection(tree_id, node_id, &node_id,
-                                           &partial_tree_logit);
-
-          // Logic in the loop adds the cached node value again if it is a leaf.
-          // If it is not a leaf anymore we need to subtract the old node's
-          // value. The following logic handles both of these cases.
-          partial_tree_logit -= resource->node_value(tree_id, node_id);
+          if (node_id >= 0) {
+            // If the tree was pruned, returns the node id into which the
+            // current_node_id was pruned, as well the correction of the cached
+            // logit prediction.
+            resource->GetPostPruneCorrection(tree_id, node_id, &node_id,
+                                             &partial_tree_logit);
+            // Logic in the loop adds the cached node value again if it is a
+            // leaf. If it is not a leaf anymore we need to subtract the old
+            // node's value. The following logic handles both of these cases.
+            partial_tree_logit -= resource->node_value(tree_id, node_id);
+          } else {
+            // No cache exists, start from the very first node.
+            node_id = 0;
+          }
           float partial_all_logit = 0.0;
           while (true) {
             if (resource->is_leaf(tree_id, node_id)) {
diff --git a/tensorflow/core/kernels/boosted_trees/resources.cc b/tensorflow/core/kernels/boosted_trees/resources.cc
index c410748c27..cc90bb2f45 100644
--- a/tensorflow/core/kernels/boosted_trees/resources.cc
+++ b/tensorflow/core/kernels/boosted_trees/resources.cc
@@ -21,6 +21,10 @@ limitations under the License.
 
 namespace tensorflow {
 
+namespace {
+constexpr float kLayerByLayerTreeWeight = 1.0;
+}  // namespace
+
 // Constructor.
 BoostedTreesEnsembleResource::BoostedTreesEnsembleResource()
     : tree_ensemble_(
@@ -78,6 +82,16 @@ float BoostedTreesEnsembleResource::node_value(const int32 tree_id,
   }
 }
 
+void BoostedTreesEnsembleResource::set_node_value(const int32 tree_id,
+                                                  const int32 node_id,
+                                                  const float logits) {
+  DCHECK_LT(tree_id, tree_ensemble_->trees_size());
+  DCHECK_LT(node_id, tree_ensemble_->trees(tree_id).nodes_size());
+  auto* node = tree_ensemble_->mutable_trees(tree_id)->mutable_nodes(node_id);
+  DCHECK(node->node_case() == boosted_trees::Node::kLeaf);
+  node->mutable_leaf()->set_scalar(logits);
+}
+
 int32 BoostedTreesEnsembleResource::GetNumLayersGrown(
     const int32 tree_id) const {
   DCHECK_LT(tree_id, tree_ensemble_->trees_size());
@@ -204,9 +218,14 @@ void BoostedTreesEnsembleResource::UpdateGrowingMetadata() const {
 
 // Add a tree to the ensemble and returns a new tree_id.
 int32 BoostedTreesEnsembleResource::AddNewTree(const float weight) {
+  return AddNewTreeWithLogits(weight, 0.0);
+}
+
+int32 BoostedTreesEnsembleResource::AddNewTreeWithLogits(const float weight,
+                                                         const float logits) {
   const int32 new_tree_id = tree_ensemble_->trees_size();
   auto* node = tree_ensemble_->add_trees()->add_nodes();
-  node->mutable_leaf()->set_scalar(0.0);
+  node->mutable_leaf()->set_scalar(logits);
   tree_ensemble_->add_tree_weights(weight);
   tree_ensemble_->add_tree_metadata();
 
@@ -225,7 +244,7 @@ void BoostedTreesEnsembleResource::AddBucketizedSplitNode(
   *right_node_id = *left_node_id + 1;
   auto* left_node = tree->add_nodes();
   auto* right_node = tree->add_nodes();
-  if (node_id != 0) {
+  if (node_id != 0 || (node->has_leaf() && node->leaf().scalar() != 0)) {
     // Save previous leaf value if it is not the first leaf in the tree.
     node->mutable_metadata()->mutable_original_leaf()->Swap(
         node->mutable_leaf());
diff --git a/tensorflow/core/kernels/boosted_trees/resources.h b/tensorflow/core/kernels/boosted_trees/resources.h
index df78d3f275..f961ed3814 100644
--- a/tensorflow/core/kernels/boosted_trees/resources.h
+++ b/tensorflow/core/kernels/boosted_trees/resources.h
@@ -70,6 +70,9 @@ class BoostedTreesEnsembleResource : public StampedResource {
 
   float node_value(const int32 tree_id, const int32 node_id) const;
 
+  void set_node_value(const int32 tree_id, const int32 node_id,
+                      const float logits);
+
   int32 GetNumLayersGrown(const int32 tree_id) const;
 
   void SetNumLayersGrown(const int32 tree_id, int32 new_num_layers) const;
@@ -99,6 +102,9 @@ class BoostedTreesEnsembleResource : public StampedResource {
   // Add a tree to the ensemble and returns a new tree_id.
   int32 AddNewTree(const float weight);
 
+  // Adds new tree with one node to the ensemble and sets node's value to logits
+  int32 AddNewTreeWithLogits(const float weight, const float logits);
+
   // Grows the tree by adding a split and leaves.
   void AddBucketizedSplitNode(const int32 tree_id, const int32 node_id,
                               const int32 feature_id, const int32 threshold,
diff --git a/tensorflow/core/kernels/boosted_trees/stats_ops.cc b/tensorflow/core/kernels/boosted_trees/stats_ops.cc
index 48afd3fbf3..64ec1caa9c 100644
--- a/tensorflow/core/kernels/boosted_trees/stats_ops.cc
+++ b/tensorflow/core/kernels/boosted_trees/stats_ops.cc
@@ -17,13 +17,10 @@ limitations under the License.
 
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/kernels/boosted_trees/tree_helper.h"
 
 namespace tensorflow {
 
-namespace {
-const float kEps = 1e-15;
-}  // namespace
-
 class BoostedTreesCalculateBestGainsPerFeatureOp : public OpKernel {
  public:
   explicit BoostedTreesCalculateBestGainsPerFeatureOp(
@@ -139,7 +136,7 @@ class BoostedTreesCalculateBestGainsPerFeatureOp : public OpKernel {
                                    total_hess - cum_hess_bucket, l1, l2,
                                    &contrib_for_right, &gain_for_right);
 
-          if (gain_for_left + gain_for_right > best_gain) {
+          if (GainIsLarger(gain_for_left + gain_for_right, best_gain)) {
             best_gain = gain_for_left + gain_for_right;
             best_bucket = bucket;
             best_contrib_for_left = contrib_for_left;
@@ -200,40 +197,6 @@ class BoostedTreesCalculateBestGainsPerFeatureOp : public OpKernel {
   }
 
  private:
-  void CalculateWeightsAndGains(const float g, const float h, const float l1,
-                                const float l2, float* weight, float* gain) {
-    //
-    // The formula for weight is -(g+l1*sgn(w))/(H+l2), for gain it is
-    // (g+l1*sgn(w))^2/(h+l2).
-    // This is because for each leaf we optimize
-    // 1/2(h+l2)*w^2+g*w+l1*abs(w)
-    float g_with_l1 = g;
-    // Apply L1 regularization.
-    // 1) Assume w>0 => w=-(g+l1)/(h+l2)=> g+l1 < 0 => g < -l1
-    // 2) Assume w<0 => w=-(g-l1)/(h+l2)=> g-l1 > 0 => g > l1
-    // For g from (-l1, l1), thus there is no solution => set to 0.
-    if (l1 > 0) {
-      if (g > l1) {
-        g_with_l1 -= l1;
-      } else if (g < -l1) {
-        g_with_l1 += l1;
-      } else {
-        *weight = 0.0;
-        *gain = 0.0;
-        return;
-      }
-    }
-    // Apply L2 regularization.
-    if (h + l2 <= kEps) {
-      // Avoid division by 0 or infinitesimal.
-      *weight = 0;
-      *gain = 0;
-    } else {
-      *weight = -g_with_l1 / (h + l2);
-      *gain = -g_with_l1 * (*weight);
-    }
-  }
-
   int max_splits_;
   int num_features_;
 };
diff --git a/tensorflow/core/kernels/boosted_trees/training_ops.cc b/tensorflow/core/kernels/boosted_trees/training_ops.cc
index a14fd4a133..973cdec13a 100644
--- a/tensorflow/core/kernels/boosted_trees/training_ops.cc
+++ b/tensorflow/core/kernels/boosted_trees/training_ops.cc
@@ -16,11 +16,13 @@ limitations under the License.
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/kernels/boosted_trees/resources.h"
+#include "tensorflow/core/kernels/boosted_trees/tree_helper.h"
 
 namespace tensorflow {
 
 namespace {
 constexpr float kLayerByLayerTreeWeight = 1.0;
+constexpr float kMinDeltaForCenterBias = 0.01;
 
 // TODO(nponomareva, youngheek): consider using vector.
 struct SplitCandidate {
@@ -89,7 +91,8 @@ class BoostedTreesUpdateEnsembleOp : public OpKernel {
 
     // Find best splits for each active node.
     std::map<int32, SplitCandidate> best_splits;
-    FindBestSplitsPerNode(context, node_ids_list, gains_list, &best_splits);
+    FindBestSplitsPerNode(context, node_ids_list, gains_list, feature_ids,
+                          &best_splits);
 
     int32 current_tree =
         UpdateGlobalAttemptsAndRetrieveGrowableTree(ensemble_resource);
@@ -193,6 +196,7 @@ class BoostedTreesUpdateEnsembleOp : public OpKernel {
   void FindBestSplitsPerNode(
       OpKernelContext* const context, const OpInputList& node_ids_list,
       const OpInputList& gains_list,
+      const TTypes<const int32>::Vec& feature_ids,
       std::map<int32, SplitCandidate>* best_split_per_node) {
     // Find best split per node going through every feature candidate.
     for (int64 feature_idx = 0; feature_idx < num_features_; ++feature_idx) {
@@ -211,8 +215,18 @@ class BoostedTreesUpdateEnsembleOp : public OpKernel {
         candidate.candidate_idx = candidate_idx;
         candidate.gain = gain;
 
-        if (best_split_it == best_split_per_node->end() ||
-            gain > best_split_it->second.gain) {
+        if (TF_PREDICT_FALSE(best_split_it != best_split_per_node->end() &&
+                             GainsAreEqual(gain, best_split_it->second.gain))) {
+          const auto best_candidate = (*best_split_per_node)[node_id];
+          const int32 best_feature_id = feature_ids(best_candidate.feature_idx);
+          const int32 feature_id = feature_ids(candidate.feature_idx);
+          VLOG(2) << "Breaking ties on feature ids and buckets";
+          // Breaking ties deterministically.
+          if (feature_id < best_feature_id) {
+            (*best_split_per_node)[node_id] = candidate;
+          }
+        } else if (best_split_it == best_split_per_node->end() ||
+                   GainIsLarger(gain, best_split_it->second.gain)) {
           (*best_split_per_node)[node_id] = candidate;
         }
       }
@@ -227,4 +241,69 @@ class BoostedTreesUpdateEnsembleOp : public OpKernel {
 REGISTER_KERNEL_BUILDER(Name("BoostedTreesUpdateEnsemble").Device(DEVICE_CPU),
                         BoostedTreesUpdateEnsembleOp);
 
+class BoostedTreesCenterBiasOp : public OpKernel {
+ public:
+  explicit BoostedTreesCenterBiasOp(OpKernelConstruction* const context)
+      : OpKernel(context) {}
+
+  void Compute(OpKernelContext* const context) override {
+    // Get decision tree ensemble.
+    BoostedTreesEnsembleResource* ensemble_resource;
+    OP_REQUIRES_OK(context, LookupResource(context, HandleFromInput(context, 0),
+                                           &ensemble_resource));
+    core::ScopedUnref unref_me(ensemble_resource);
+    mutex_lock l(*ensemble_resource->get_mutex());
+    // Increase the ensemble stamp.
+    ensemble_resource->set_stamp(ensemble_resource->stamp() + 1);
+
+    // Read means of hessians and gradients
+    const Tensor* mean_gradients_t;
+    OP_REQUIRES_OK(context,
+                   context->input("mean_gradients", &mean_gradients_t));
+
+    const Tensor* mean_hessians_t;
+    OP_REQUIRES_OK(context, context->input("mean_hessians", &mean_hessians_t));
+
+    // Get the regularization options.
+    const Tensor* l1_t;
+    OP_REQUIRES_OK(context, context->input("l1", &l1_t));
+    const auto l1 = l1_t->scalar<float>()();
+    const Tensor* l2_t;
+    OP_REQUIRES_OK(context, context->input("l2", &l2_t));
+    const auto l2 = l2_t->scalar<float>()();
+
+    // For now, assume 1-dimensional weight on leaves.
+    float logits;
+    float unused_gain;
+
+    // TODO(nponomareva): change this when supporting multiclass.
+    const float gradients_mean = mean_gradients_t->flat<float>()(0);
+    const float hessians_mean = mean_hessians_t->flat<float>()(0);
+    CalculateWeightsAndGains(gradients_mean, hessians_mean, l1, l2, &logits,
+                             &unused_gain);
+
+    float current_bias = 0.0;
+    bool continue_centering = true;
+    if (ensemble_resource->num_trees() == 0) {
+      ensemble_resource->AddNewTreeWithLogits(kLayerByLayerTreeWeight, logits);
+      current_bias = logits;
+    } else {
+      current_bias = ensemble_resource->node_value(0, 0);
+      continue_centering =
+          std::abs(logits / current_bias) > kMinDeltaForCenterBias;
+      current_bias += logits;
+      ensemble_resource->set_node_value(0, 0, current_bias);
+    }
+
+    Tensor* continue_centering_t = nullptr;
+    OP_REQUIRES_OK(
+        context, context->allocate_output("continue_centering", TensorShape({}),
+                                          &continue_centering_t));
+    // Check if we need to continue centering bias.
+    continue_centering_t->scalar<bool>()() = continue_centering;
+  }
+};
+REGISTER_KERNEL_BUILDER(Name("BoostedTreesCenterBias").Device(DEVICE_CPU),
+                        BoostedTreesCenterBiasOp);
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/boosted_trees/tree_helper.h b/tensorflow/core/kernels/boosted_trees/tree_helper.h
new file mode 100644
index 0000000000..8b18d9e5f8
--- /dev/null
+++ b/tensorflow/core/kernels/boosted_trees/tree_helper.h
@@ -0,0 +1,69 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_KERNELS_BOOSTED_TREES_TREE_HELPER_H_
+#define TENSORFLOW_CORE_KERNELS_BOOSTED_TREES_TREE_HELPER_H_
+#include <cmath>
+
+namespace tensorflow {
+
+static bool GainsAreEqual(const float g1, const float g2) {
+  const float kTolerance = 1e-15;
+  return std::abs(g1 - g2) < kTolerance;
+}
+
+static bool GainIsLarger(const float g1, const float g2) {
+  const float kTolerance = 1e-15;
+  return g1 - g2 >= kTolerance;
+}
+
+static void CalculateWeightsAndGains(const float g, const float h,
+                                     const float l1, const float l2,
+                                     float* weight, float* gain) {
+  const float kEps = 1e-15;
+  // The formula for weight is -(g+l1*sgn(w))/(H+l2), for gain it is
+  // (g+l1*sgn(w))^2/(h+l2).
+  // This is because for each leaf we optimize
+  // 1/2(h+l2)*w^2+g*w+l1*abs(w)
+  float g_with_l1 = g;
+  // Apply L1 regularization.
+  // 1) Assume w>0 => w=-(g+l1)/(h+l2)=> g+l1 < 0 => g < -l1
+  // 2) Assume w<0 => w=-(g-l1)/(h+l2)=> g-l1 > 0 => g > l1
+  // For g from (-l1, l1), thus there is no solution => set to 0.
+  if (l1 > 0) {
+    if (g > l1) {
+      g_with_l1 -= l1;
+    } else if (g < -l1) {
+      g_with_l1 += l1;
+    } else {
+      *weight = 0.0;
+      *gain = 0.0;
+      return;
+    }
+  }
+  // Apply L2 regularization.
+  if (h + l2 <= kEps) {
+    // Avoid division by 0 or infinitesimal.
+    *weight = 0;
+    *gain = 0;
+  } else {
+    *weight = -g_with_l1 / (h + l2);
+    *gain = -g_with_l1 * (*weight);
+  }
+}
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_BOOSTED_TREES_TREE_HELPER_H_
diff --git a/tensorflow/core/kernels/cast_op.cc b/tensorflow/core/kernels/cast_op.cc
index 626db9131a..0478c93280 100644
--- a/tensorflow/core/kernels/cast_op.cc
+++ b/tensorflow/core/kernels/cast_op.cc
@@ -41,8 +41,10 @@ typedef Eigen::SyclDevice SYCLDevice;
 #define CURRY_TYPES2(FN, arg0)   \
   FN(arg0, bool);                \
   FN(arg0, uint8);               \
-  FN(arg0, int8);                \
   FN(arg0, uint16);              \
+  FN(arg0, uint32);              \
+  FN(arg0, uint64);              \
+  FN(arg0, int8);                \
   FN(arg0, int16);               \
   FN(arg0, int32);               \
   FN(arg0, int64);               \
@@ -53,8 +55,41 @@ typedef Eigen::SyclDevice SYCLDevice;
   FN(arg0, std::complex<double>)
 
 CastOpBase::CastOpBase(OpKernelConstruction* ctx) : OpKernel(ctx) {
-  OP_REQUIRES_OK(ctx, ctx->GetAttr("SrcT", &src_dtype_));
-  OP_REQUIRES_OK(ctx, ctx->GetAttr("DstT", &dst_dtype_));
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("SrcT", &external_src_dtype_));
+
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("DstT", &external_dst_dtype_));
+
+  OP_REQUIRES_OK(ctx, ctx->GetAttr("Truncate", &use_truncation_));
+
+  // Quantized data types use the same underlying format as their non quantized
+  // version so we use the non quantized implementation for casting.
+  if (external_dst_dtype_ == DT_QUINT8) {
+    dst_dtype_ = DT_UINT8;
+  } else if (external_dst_dtype_ == DT_QINT8) {
+    dst_dtype_ = DT_INT8;
+  } else if (external_dst_dtype_ == DT_QINT32) {
+    dst_dtype_ = DT_INT32;
+  } else if (external_dst_dtype_ == DT_QINT16) {
+    dst_dtype_ = DT_INT16;
+  } else if (external_dst_dtype_ == DT_QUINT16) {
+    dst_dtype_ = DT_UINT16;
+  } else {
+    dst_dtype_ = external_dst_dtype_;
+  }
+
+  if (external_src_dtype_ == DT_QUINT8) {
+    src_dtype_ = DT_UINT8;
+  } else if (external_src_dtype_ == DT_QINT8) {
+    src_dtype_ = DT_INT8;
+  } else if (external_src_dtype_ == DT_QINT32) {
+    src_dtype_ = DT_INT32;
+  } else if (external_src_dtype_ == DT_QINT16) {
+    src_dtype_ = DT_INT16;
+  } else if (external_src_dtype_ == DT_QUINT16) {
+    src_dtype_ = DT_UINT16;
+  } else {
+    src_dtype_ = external_src_dtype_;
+  }
 }
 
 void CastOpBase::Compute(OpKernelContext* ctx) {
@@ -62,15 +97,20 @@ void CastOpBase::Compute(OpKernelContext* ctx) {
   if (work_ == nullptr) {
     ctx->set_output(0, inp);
   } else {
+    Tensor in;
+    in.UnsafeCopyFromInternal(inp, src_dtype_, inp.shape());
     Tensor* out = nullptr;
-    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, inp.shape(), &out));
-    work_(ctx, inp, out);
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, in.shape(), &out));
+    out->set_dtype(dst_dtype_);
+    work_(ctx, in, out, use_truncation_);
+    out->set_dtype(external_dst_dtype_);
   }
 }
 
 Status CastOpBase::Unimplemented() {
-  return errors::Unimplemented("Cast ", DataTypeString(src_dtype_), " to ",
-                               DataTypeString(dst_dtype_), " is not supported");
+  return errors::Unimplemented("Cast ", DataTypeString(external_src_dtype_),
+                               " to ", DataTypeString(external_dst_dtype_),
+                               " is not supported");
 }
 
 CpuCastOp::CpuCastOp(OpKernelConstruction* ctx) : CastOpBase(ctx) {
@@ -78,7 +118,7 @@ CpuCastOp::CpuCastOp(OpKernelConstruction* ctx) : CastOpBase(ctx) {
 }
 
 Status CpuCastOp::Prepare() {
-  if (src_dtype_ == dst_dtype_) {
+  if (external_src_dtype_ == external_dst_dtype_) {
     work_ = nullptr;  // Identity
     return Status::OK();
   }
@@ -86,10 +126,14 @@ Status CpuCastOp::Prepare() {
     work_ = GetCpuCastFromBool(dst_dtype_);
   } else if (src_dtype_ == DT_UINT8) {
     work_ = GetCpuCastFromUint8(dst_dtype_);
-  } else if (src_dtype_ == DT_INT8) {
-    work_ = GetCpuCastFromInt8(dst_dtype_);
   } else if (src_dtype_ == DT_UINT16) {
     work_ = GetCpuCastFromUint16(dst_dtype_);
+  } else if (src_dtype_ == DT_UINT32) {
+    work_ = GetCpuCastFromUint32(dst_dtype_);
+  } else if (src_dtype_ == DT_UINT64) {
+    work_ = GetCpuCastFromUint64(dst_dtype_);
+  } else if (src_dtype_ == DT_INT8) {
+    work_ = GetCpuCastFromInt8(dst_dtype_);
   } else if (src_dtype_ == DT_INT16) {
     work_ = GetCpuCastFromInt16(dst_dtype_);
   } else if (src_dtype_ == DT_INT32) {
@@ -127,7 +171,7 @@ class GpuCastOp : public CastOpBase {
 
  private:
   Status Prepare() {
-    if (src_dtype_ == dst_dtype_) {
+    if (external_src_dtype_ == external_dst_dtype_) {
       work_ = nullptr;  // Identity
       return Status::OK();
     }
@@ -135,10 +179,14 @@ class GpuCastOp : public CastOpBase {
       work_ = GetGpuCastFromBool(dst_dtype_);
     } else if (src_dtype_ == DT_UINT8) {
       work_ = GetGpuCastFromUint8(dst_dtype_);
-    } else if (src_dtype_ == DT_INT8) {
-      work_ = GetGpuCastFromInt8(dst_dtype_);
     } else if (src_dtype_ == DT_UINT16) {
       work_ = GetGpuCastFromUint16(dst_dtype_);
+    } else if (src_dtype_ == DT_UINT32) {
+      work_ = GetGpuCastFromUint32(dst_dtype_);
+    } else if (src_dtype_ == DT_UINT64) {
+      work_ = GetGpuCastFromUint64(dst_dtype_);
+    } else if (src_dtype_ == DT_INT8) {
+      work_ = GetGpuCastFromInt8(dst_dtype_);
     } else if (src_dtype_ == DT_INT16) {
       work_ = GetGpuCastFromInt16(dst_dtype_);
     } else if (src_dtype_ == DT_INT32) {
@@ -178,8 +226,10 @@ REGISTER_KERNEL_BUILDER(Name("Cast").Device(DEVICE_CPU), CpuCastOp);
 
 CURRY_TYPES2(REGISTER_CAST_GPU, bool);
 CURRY_TYPES2(REGISTER_CAST_GPU, uint8);
-CURRY_TYPES2(REGISTER_CAST_GPU, int8);
 CURRY_TYPES2(REGISTER_CAST_GPU, uint16);
+CURRY_TYPES2(REGISTER_CAST_GPU, uint32);
+CURRY_TYPES2(REGISTER_CAST_GPU, uint64);
+CURRY_TYPES2(REGISTER_CAST_GPU, int8);
 CURRY_TYPES2(REGISTER_CAST_GPU, int16);
 CURRY_TYPES2(REGISTER_CAST_GPU, int32);
 CURRY_TYPES2(REGISTER_CAST_GPU, int64);
@@ -203,7 +253,7 @@ class SyclCastOp : public CastOpBase {
 
  private:
   Status Prepare() {
-    if (src_dtype_ == dst_dtype_) {
+    if (external_src_dtype_ == external_dst_dtype_) {
       work_ = nullptr;  // Identity
       return Status::OK();
     }
diff --git a/tensorflow/core/kernels/cast_op.h b/tensorflow/core/kernels/cast_op.h
index 16d2e0e0a5..527ab528c9 100644
--- a/tensorflow/core/kernels/cast_op.h
+++ b/tensorflow/core/kernels/cast_op.h
@@ -24,8 +24,71 @@ limitations under the License.
 #include "tensorflow/core/platform/byte_order.h"
 #include "tensorflow/core/platform/types.h"
 
+// Note that the GPU cast functor templates need to be instantiated unlike the
+// CPU ones, and hence their specializations are different than that for CPUs.
+#ifdef SPECIALIZE_FOR_GPUS
+#define SPECIALIZE_CAST(DEVICE, OUT_TYPE, IN_OUT)                   \
+  template <typename Device>                                        \
+  struct CastFunctor<Device, OUT_TYPE, IN_OUT> {                    \
+    void operator()(const Device& d,                                \
+                    typename TTypes<OUT_TYPE>::Flat out_tensor,     \
+                    typename TTypes<IN_OUT>::ConstFlat in_tensor,   \
+                    bool truncate = false) {                        \
+      if (truncate) {                                               \
+        out_tensor.device(d) =                                      \
+            in_tensor.unaryExpr(LSBZeroSetter<IN_OUT, OUT_TYPE>())  \
+                .template cast<OUT_TYPE>();                         \
+      } else {                                                      \
+        out_tensor.device(d) = in_tensor.template cast<OUT_TYPE>(); \
+      }                                                             \
+    }                                                               \
+  };                                                                \
+  template struct CastFunctor<DEVICE, OUT_TYPE, IN_OUT>;
+#else
+#define SPECIALIZE_CAST(DEVICE, OUT_TYPE, IN_OUT)                   \
+  template <>                                                       \
+  struct CastFunctor<DEVICE, OUT_TYPE, IN_OUT> {                    \
+    void operator()(const DEVICE& d,                                \
+                    typename TTypes<OUT_TYPE>::Flat out_tensor,     \
+                    typename TTypes<IN_OUT>::ConstFlat in_tensor,   \
+                    bool truncate = false) {                        \
+      if (truncate) {                                               \
+        out_tensor.device(d) =                                      \
+            in_tensor.unaryExpr(LSBZeroSetter<IN_OUT, OUT_TYPE>())  \
+                .template cast<OUT_TYPE>();                         \
+      } else {                                                      \
+        out_tensor.device(d) = in_tensor.template cast<OUT_TYPE>(); \
+      }                                                             \
+    }                                                               \
+  };
+#endif
+
+#define CAST_FUNCTORS(devname)                                        \
+  SPECIALIZE_CAST(devname, float, double)                             \
+  SPECIALIZE_CAST(devname, float, std::complex<double>)               \
+  SPECIALIZE_CAST(devname, std::complex<float>, std::complex<double>) \
+  SPECIALIZE_CAST(devname, std::complex<float>, double)               \
+  SPECIALIZE_CAST(devname, Eigen::half, double)                       \
+  SPECIALIZE_CAST(devname, Eigen::half, float)                        \
+  SPECIALIZE_CAST(devname, Eigen::half, std::complex<double>)         \
+  SPECIALIZE_CAST(devname, Eigen::half, std::complex<float>)          \
+  SPECIALIZE_CAST(devname, bfloat16, float)                           \
+  template <typename OUT_TYPE, typename IN_OUT>                       \
+  struct CastFunctor<devname, OUT_TYPE, IN_OUT> {                     \
+    void operator()(const devname& d,                                 \
+                    typename TTypes<OUT_TYPE>::Flat out_tensor,       \
+                    typename TTypes<IN_OUT>::ConstFlat in_tensor,     \
+                    bool truncate = false) {                          \
+      out_tensor.device(d) = in_tensor.template cast<OUT_TYPE>();     \
+    }                                                                 \
+  };
+
 namespace tensorflow {
 
+typedef std::function<void(OpKernelContext*, const Tensor&, Tensor*,
+                           bool trunc)>
+    CastFunctorType;
+
 // Common base class of Cast kernels
 class CastOpBase : public OpKernel {
  public:
@@ -36,8 +99,10 @@ class CastOpBase : public OpKernel {
  protected:
   DataType src_dtype_;
   DataType dst_dtype_;
-  std::function<void(OpKernelContext*, const Tensor&, Tensor*)> work_ = nullptr;
-
+  DataType external_src_dtype_;
+  DataType external_dst_dtype_;
+  bool use_truncation_;
+  CastFunctorType work_ = nullptr;
   Status Unimplemented();
 
   TF_DISALLOW_COPY_AND_ASSIGN(CastOpBase);
@@ -54,6 +119,23 @@ class CpuCastOp : public CastOpBase {
 
 namespace functor {
 
+template <typename I>
+constexpr int MantissaWidth() {
+  return std::numeric_limits<I>::digits;
+}
+
+template <>
+constexpr int MantissaWidth<Eigen::half>() {
+  // Remember, there's 1 hidden bit
+  return 10 + 1;
+}
+
+template <>
+constexpr int MantissaWidth<bfloat16>() {
+  // Remember, there's 1 hidden bit
+  return 7 + 1;
+}
+
 template <typename Device, typename Tout, typename Tin>
 void Cast(const Device& d, typename TTypes<Tout>::Flat o,
           typename TTypes<Tin>::ConstFlat i) {
@@ -63,7 +145,85 @@ void Cast(const Device& d, typename TTypes<Tout>::Flat o,
 template <typename Device, typename Tout, typename Tin>
 struct CastFunctor {
   void operator()(const Device& d, typename TTypes<Tout>::Flat o,
-                  typename TTypes<Tin>::ConstFlat i);
+                  typename TTypes<Tin>::ConstFlat i, bool truncate = false);
+};
+
+// Only enable LSBZeroSetterHelper for 64 and 32 bit input data types.
+// Specialize for others if needed in future.
+template <typename I>
+typename std::enable_if<sizeof(I) == 8, void>::type EIGEN_DEVICE_FUNC
+    EIGEN_STRONG_INLINE static LSBZeroSetterHelper(I& t, int n) {
+  // Only zero the bits for non-NaNs.
+  // For NaNs, let the non-truncation version handle it.
+  if (!std::isnan(t)) {
+    uint64_t* p = reinterpret_cast<uint64_t*>(&t);
+    *p &= (0xFFFFFFFFFFFFFFFF << n);
+  }
+}
+
+template <typename I>
+typename std::enable_if<sizeof(I) == 4, void>::type EIGEN_DEVICE_FUNC
+    EIGEN_STRONG_INLINE static LSBZeroSetterHelper(I& t, int n) {
+  // Only zero the bits for non-NaNs.
+  // For NaNs, let the non-truncation version handle it.
+  if (!std::isnan(t)) {
+    uint32_t* p = reinterpret_cast<uint32_t*>(&t);
+    *p &= (0xFFFFFFFF << n);
+  }
+}
+
+// Set n least significant bits to 0
+template <typename I, typename O>
+struct LSBZeroSetter {
+  EIGEN_EMPTY_STRUCT_CTOR(LSBZeroSetter)
+
+  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const I operator()(const I& a) const {
+    constexpr int bits = MantissaWidth<I>() - MantissaWidth<O>();
+    static_assert(
+        bits > 0,
+        "The output type must have fewer mantissa bits than the input type\n");
+    I t = a;
+    LSBZeroSetterHelper(t, bits);
+    return t;
+  }
+};
+
+template <typename I, typename O>
+struct LSBZeroSetter<std::complex<I>, std::complex<O>> {
+  EIGEN_EMPTY_STRUCT_CTOR(LSBZeroSetter)
+
+  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const std::complex<I> operator()(
+      const std::complex<I>& a) const {
+    constexpr int bits = MantissaWidth<I>() - MantissaWidth<O>();
+    static_assert(
+        bits > 0,
+        "The output type must have fewer mantissa bits than the input type\n");
+    I re = std::real(a);
+    I img = std::imag(a);
+    LSBZeroSetterHelper(re, bits);
+    LSBZeroSetterHelper(img, bits);
+    std::complex<I> toReturn(re, img);
+    return toReturn;
+  }
+};
+
+template <typename I, typename O>
+struct LSBZeroSetter<std::complex<I>, O> {
+  EIGEN_EMPTY_STRUCT_CTOR(LSBZeroSetter)
+  // Sets the 16 LSBits of the float to 0
+  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const std::complex<I> operator()(
+      const std::complex<I>& a) const {
+    constexpr int bits = MantissaWidth<I>() - MantissaWidth<O>();
+    static_assert(
+        bits > 0,
+        "The output type must have fewer mantissa bits than the input type\n");
+    I re = std::real(a);
+    I img = std::imag(a);
+    LSBZeroSetterHelper(re, bits);
+    LSBZeroSetterHelper(img, bits);
+    std::complex<I> toReturn(re, img);
+    return toReturn;
+  }
 };
 
 }  // end namespace functor
diff --git a/tensorflow/core/kernels/cast_op_gpu.cu.cc b/tensorflow/core/kernels/cast_op_gpu.cu.cc
index 9c9e9e7658..036996fca2 100644
--- a/tensorflow/core/kernels/cast_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/cast_op_gpu.cu.cc
@@ -18,27 +18,26 @@ limitations under the License.
 #define EIGEN_USE_GPU
 
 #include "tensorflow/core/framework/bfloat16.h"
+#define SPECIALIZE_FOR_GPUS
 #include "tensorflow/core/kernels/cast_op.h"
+#undef SPECIALIZE_FOR_GPUS
 
 namespace tensorflow {
 namespace functor {
 
 typedef Eigen::GpuDevice GPUDevice;
 
-template <typename O, typename I>
-struct CastFunctor<GPUDevice, O, I> {
-  void operator()(const GPUDevice& d, typename TTypes<O>::Flat o,
-                  typename TTypes<I>::ConstFlat i) {
-    Cast<GPUDevice, O, I>(d, o, i);
-  }
-};
+CAST_FUNCTORS(GPUDevice);
 
 #define DEFINE(O, I) template struct CastFunctor<GPUDevice, O, I>
+
 #define DEFINE_ALL_FROM(in_type)        \
   DEFINE(in_type, bool);                \
   DEFINE(in_type, uint8);               \
-  DEFINE(in_type, int8);                \
   DEFINE(in_type, uint16);              \
+  DEFINE(in_type, uint32);              \
+  DEFINE(in_type, uint64);              \
+  DEFINE(in_type, int8);                \
   DEFINE(in_type, int16);               \
   DEFINE(in_type, int32);               \
   DEFINE(in_type, int64);               \
@@ -50,19 +49,50 @@ struct CastFunctor<GPUDevice, O, I> {
 
 DEFINE_ALL_FROM(bool);
 DEFINE_ALL_FROM(uint8);
-DEFINE_ALL_FROM(int8);
 DEFINE_ALL_FROM(uint16);
+DEFINE_ALL_FROM(uint32);
+DEFINE_ALL_FROM(uint64);
+DEFINE_ALL_FROM(int8);
 DEFINE_ALL_FROM(int16);
 DEFINE_ALL_FROM(int32);
 DEFINE_ALL_FROM(int64);
-DEFINE_ALL_FROM(Eigen::half);
-DEFINE_ALL_FROM(float);
 DEFINE_ALL_FROM(double);
-DEFINE_ALL_FROM(std::complex<float>);
 DEFINE_ALL_FROM(std::complex<double>);
-DEFINE(bfloat16, float);
 DEFINE(float, bfloat16);
 
+#define DEFINE_ALL_TO_FLOAT(out_type) \
+  DEFINE(out_type, bool);             \
+  DEFINE(out_type, uint8);            \
+  DEFINE(out_type, uint16);           \
+  DEFINE(out_type, uint32);           \
+  DEFINE(out_type, uint64);           \
+  DEFINE(out_type, int8);             \
+  DEFINE(out_type, int16);            \
+  DEFINE(out_type, int32);            \
+  DEFINE(out_type, int64);            \
+  DEFINE(out_type, Eigen::half);      \
+  DEFINE(out_type, float);            \
+  DEFINE(out_type, std::complex<float>)
+
+#define DEFINE_ALL_TO_HALF(out_type) \
+  DEFINE(out_type, bool);            \
+  DEFINE(out_type, uint8);           \
+  DEFINE(out_type, uint16);          \
+  DEFINE(out_type, uint32);          \
+  DEFINE(out_type, uint64);          \
+  DEFINE(out_type, int8);            \
+  DEFINE(out_type, int16);           \
+  DEFINE(out_type, int32);           \
+  DEFINE(out_type, int64);           \
+  DEFINE(out_type, Eigen::half)
+
+DEFINE_ALL_TO_HALF(Eigen::half);
+DEFINE_ALL_TO_HALF(bfloat16);
+DEFINE_ALL_TO_FLOAT(float);
+DEFINE_ALL_TO_FLOAT(std::complex<float>);
+
+#undef DEFINE_ALL_TO_FLOAT
+#undef DEFINE_ALL_TO_HALF
 #undef DEFINE_ALL_FROM
 #undef DEFINE
 
diff --git a/tensorflow/core/kernels/cast_op_impl.h b/tensorflow/core/kernels/cast_op_impl.h
index 382e5440e1..b899bac681 100644
--- a/tensorflow/core/kernels/cast_op_impl.h
+++ b/tensorflow/core/kernels/cast_op_impl.h
@@ -25,22 +25,10 @@ namespace tensorflow {
 
 namespace functor {
 
-template <typename O, typename I>
-struct CastFunctor<Eigen::ThreadPoolDevice, O, I> {
-  void operator()(const Eigen::ThreadPoolDevice& d, typename TTypes<O>::Flat o,
-                  typename TTypes<I>::ConstFlat i) {
-    o.device(d) = i.template cast<O>();
-  }
-};
+CAST_FUNCTORS(Eigen::ThreadPoolDevice);
 
 #ifdef TENSORFLOW_USE_SYCL
-template <typename O, typename I>
-struct CastFunctor<Eigen::SyclDevice, O, I> {
-  void operator()(const Eigen::SyclDevice& d, typename TTypes<O>::Flat o,
-                  typename TTypes<I>::ConstFlat i) {
-    o.device(d) = i.template cast<O>();
-  }
-};
+CAST_FUNCTORS(Eigen::SyclDevice);
 #endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace functor
@@ -48,8 +36,10 @@ struct CastFunctor<Eigen::SyclDevice, O, I> {
 #define CURRY_TYPES3_NO_HALF(FN, arg0, arg1) \
   FN(arg0, arg1, bool);                      \
   FN(arg0, arg1, uint8);                     \
-  FN(arg0, arg1, int8);                      \
   FN(arg0, arg1, uint16);                    \
+  FN(arg0, arg1, uint32);                    \
+  FN(arg0, arg1, uint64);                    \
+  FN(arg0, arg1, int8);                      \
   FN(arg0, arg1, int16);                     \
   FN(arg0, arg1, int32);                     \
   FN(arg0, arg1, int64);                     \
@@ -66,121 +56,103 @@ struct CastFunctor<Eigen::SyclDevice, O, I> {
   CURRY_TYPES3_NO_BF16(FN, arg0, arg1) \
   FN(arg0, arg1, bfloat16);
 
-#define CAST_CASE(DEVICE, IN, OUT)                                         \
-  if (DataTypeToEnum<OUT>::value == dst_dtype) {                           \
-    return [](OpKernelContext* ctx, const Tensor& inp, Tensor* out) {      \
-      functor::CastFunctor<DEVICE, OUT, IN> func;                          \
-      func(ctx->eigen_device<DEVICE>(), out->flat<OUT>(), inp.flat<IN>()); \
-    };                                                                     \
+#define CAST_CASE(DEVICE, IN, OUT)                                        \
+  if (DataTypeToEnum<OUT>::value == dst_dtype) {                          \
+    return [](OpKernelContext* ctx, const Tensor& inp, Tensor* out,       \
+              bool truncate) {                                            \
+      functor::CastFunctor<DEVICE, OUT, IN> func;                         \
+      func(ctx->eigen_device<DEVICE>(), out->flat<OUT>(), inp.flat<IN>(), \
+           truncate);                                                     \
+    };                                                                    \
   }
 
 // The functions below are implemented in the cast_op_impl_*.cc files.
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromBool(DataType dst_dtype);
+CastFunctorType GetCpuCastFromBool(DataType dst_dtype);
+
+CastFunctorType GetCpuCastFromUint8(DataType dst_dtype);
+
+CastFunctorType GetCpuCastFromUint16(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromUint8(DataType dst_dtype);
+CastFunctorType GetCpuCastFromInt8(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt8(DataType dst_dtype);
+CastFunctorType GetCpuCastFromUint32(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromUint16(DataType dst_dtype);
+CastFunctorType GetCpuCastFromUint64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt16(DataType dst_dtype);
+CastFunctorType GetCpuCastFromInt8(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt32(DataType dst_dtype);
+CastFunctorType GetCpuCastFromInt16(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt64(DataType dst_dtype);
+CastFunctorType GetCpuCastFromInt32(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromHalf(DataType dst_dtype);
+CastFunctorType GetCpuCastFromInt64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromFloat(DataType dst_dtype);
+CastFunctorType GetCpuCastFromHalf(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromDouble(DataType dst_dtype);
+CastFunctorType GetCpuCastFromFloat(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromComplex64(DataType dst_dtype);
+CastFunctorType GetCpuCastFromDouble(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromComplex128(DataType dst_dtype);
+CastFunctorType GetCpuCastFromComplex64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromBfloat(DataType dst_dtype);
+CastFunctorType GetCpuCastFromComplex128(DataType dst_dtype);
+
+CastFunctorType GetCpuCastFromBfloat(DataType dst_dtype);
 
 #if GOOGLE_CUDA
 // Same, for GPU.
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromBool(DataType dst_dtype);
+CastFunctorType GetGpuCastFromBool(DataType dst_dtype);
+
+CastFunctorType GetGpuCastFromUint8(DataType dst_dtype);
+
+CastFunctorType GetGpuCastFromUint16(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromUint8(DataType dst_dtype);
+CastFunctorType GetGpuCastFromInt8(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt8(DataType dst_dtype);
+CastFunctorType GetGpuCastFromUint32(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromUint16(DataType dst_dtype);
+CastFunctorType GetGpuCastFromUint64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt16(DataType dst_dtype);
+CastFunctorType GetGpuCastFromInt16(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt32(DataType dst_dtype);
+CastFunctorType GetGpuCastFromInt32(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt64(DataType dst_dtype);
+CastFunctorType GetGpuCastFromInt64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromHalf(DataType dst_dtype);
+CastFunctorType GetGpuCastFromHalf(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromFloat(DataType dst_dtype);
+CastFunctorType GetGpuCastFromFloat(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromDouble(DataType dst_dtype);
+CastFunctorType GetGpuCastFromDouble(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromComplex64(DataType dst_dtype);
+CastFunctorType GetGpuCastFromComplex64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromComplex128(DataType dst_dtype);
+CastFunctorType GetGpuCastFromComplex128(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromBfloat(DataType dst_dtype);
+CastFunctorType GetGpuCastFromBfloat(DataType dst_dtype);
 
 #endif  // GOOGLE_CUDA
 
 #ifdef TENSORFLOW_USE_SYCL
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromBool(DataType dst_dtype);
+CastFunctorType GetSyclCastFromBool(DataType dst_dtype);
+
+CastFunctorType GetSyclCastFromUint8(DataType dst_dtype);
+
+CastFunctorType GetSyclCastFromUint16(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromUint8(DataType dst_dtype);
+CastFunctorType GetSyclCastFromUint32(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromUint16(DataType dst_dtype);
+CastFunctorType GetSyclCastFromUint64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromInt16(DataType dst_dtype);
+CastFunctorType GetSyclCastFromInt16(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromInt32(DataType dst_dtype);
+CastFunctorType GetSyclCastFromInt32(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromInt64(DataType dst_dtype);
+CastFunctorType GetSyclCastFromInt64(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromFloat(DataType dst_dtype);
+CastFunctorType GetSyclCastFromFloat(DataType dst_dtype);
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromDouble(DataType dst_dtype);
+CastFunctorType GetSyclCastFromDouble(DataType dst_dtype);
 #endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cast_op_impl_bfloat.cc b/tensorflow/core/kernels/cast_op_impl_bfloat.cc
index bfa7ba0d47..96aae15608 100644
--- a/tensorflow/core/kernels/cast_op_impl_bfloat.cc
+++ b/tensorflow/core/kernels/cast_op_impl_bfloat.cc
@@ -22,20 +22,19 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromBfloat(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromBfloat(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, bfloat16);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromBfloat(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromBfloat(DataType dst_dtype) {
   if (dst_dtype == DT_FLOAT) {
-    return [](OpKernelContext* ctx, const Tensor& inp, Tensor* out) {
+    return [](OpKernelContext* ctx, const Tensor& inp, Tensor* out,
+              bool truncate) {
       functor::CastFunctor<GPUDevice, float, bfloat16> func;
       func(ctx->eigen_device<GPUDevice>(), out->flat<float>(),
-           inp.flat<bfloat16>());
+           inp.flat<bfloat16>(), truncate);
     };
   }
   return nullptr;
diff --git a/tensorflow/core/kernels/cast_op_impl_bool.cc b/tensorflow/core/kernels/cast_op_impl_bool.cc
index c5c7394b43..792d4781f2 100644
--- a/tensorflow/core/kernels/cast_op_impl_bool.cc
+++ b/tensorflow/core/kernels/cast_op_impl_bool.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromBool(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromBool(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, bool);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromBool(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromBool(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, bool);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromBool(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromBool(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromBool(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, bool);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_complex128.cc b/tensorflow/core/kernels/cast_op_impl_complex128.cc
index 52899d58cd..9a184e5954 100644
--- a/tensorflow/core/kernels/cast_op_impl_complex128.cc
+++ b/tensorflow/core/kernels/cast_op_impl_complex128.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromComplex128(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromComplex128(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, std::complex<double>);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromComplex128(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromComplex128(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, std::complex<double>);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_complex64.cc b/tensorflow/core/kernels/cast_op_impl_complex64.cc
index 617bda53d5..77bc620b46 100644
--- a/tensorflow/core/kernels/cast_op_impl_complex64.cc
+++ b/tensorflow/core/kernels/cast_op_impl_complex64.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromComplex64(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromComplex64(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, std::complex<float>);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromComplex64(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromComplex64(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, std::complex<float>);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_double.cc b/tensorflow/core/kernels/cast_op_impl_double.cc
index 7dc485ddad..ff9056897f 100644
--- a/tensorflow/core/kernels/cast_op_impl_double.cc
+++ b/tensorflow/core/kernels/cast_op_impl_double.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromDouble(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromDouble(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, double);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromDouble(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromDouble(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, double);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromDouble(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromDouble(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromDouble(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, double);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_float.cc b/tensorflow/core/kernels/cast_op_impl_float.cc
index 1c933914fd..f1e8f0e37b 100644
--- a/tensorflow/core/kernels/cast_op_impl_float.cc
+++ b/tensorflow/core/kernels/cast_op_impl_float.cc
@@ -22,15 +22,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromFloat(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromFloat(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, float);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromFloat(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromFloat(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, GPUDevice, float);
   return nullptr;
 }
@@ -38,8 +36,7 @@ GetGpuCastFromFloat(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromFloat(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromFloat(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, float);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_half.cc b/tensorflow/core/kernels/cast_op_impl_half.cc
index ef4b94e326..5da3a01352 100644
--- a/tensorflow/core/kernels/cast_op_impl_half.cc
+++ b/tensorflow/core/kernels/cast_op_impl_half.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromHalf(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromHalf(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, Eigen::half);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromHalf(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromHalf(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, Eigen::half);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_int16.cc b/tensorflow/core/kernels/cast_op_impl_int16.cc
index 59360f7445..440ee88fb5 100644
--- a/tensorflow/core/kernels/cast_op_impl_int16.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int16.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt16(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromInt16(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, int16);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt16(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromInt16(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, int16);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromInt16(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromInt16(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromInt16(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int16);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_int32.cc b/tensorflow/core/kernels/cast_op_impl_int32.cc
index a867392fde..4b3e7efddc 100644
--- a/tensorflow/core/kernels/cast_op_impl_int32.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int32.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt32(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromInt32(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, int32);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt32(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromInt32(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, int32);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromInt32(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromInt32(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromInt32(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int32);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_int64.cc b/tensorflow/core/kernels/cast_op_impl_int64.cc
index 467a8f6c89..0f711aa560 100644
--- a/tensorflow/core/kernels/cast_op_impl_int64.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int64.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt64(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromInt64(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, int64);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt64(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromInt64(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, int64);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromInt64(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromInt64(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromInt64(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int64);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_int8.cc b/tensorflow/core/kernels/cast_op_impl_int8.cc
index 21002a4321..eac185d5a0 100644
--- a/tensorflow/core/kernels/cast_op_impl_int8.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int8.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromInt8(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromInt8(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, int8);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromInt8(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromInt8(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, int8);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromInt8(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromInt8(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromInt8(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int8);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_uint16.cc b/tensorflow/core/kernels/cast_op_impl_uint16.cc
index cd829bae2a..3aebbdc1f3 100644
--- a/tensorflow/core/kernels/cast_op_impl_uint16.cc
+++ b/tensorflow/core/kernels/cast_op_impl_uint16.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromUint16(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromUint16(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, uint16);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromUint16(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromUint16(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, uint16);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromUint16(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromUint16(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromUint16(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, uint16);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_impl_uint32.cc b/tensorflow/core/kernels/cast_op_impl_uint32.cc
new file mode 100644
index 0000000000..86f5961bcc
--- /dev/null
+++ b/tensorflow/core/kernels/cast_op_impl_uint32.cc
@@ -0,0 +1,43 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/kernels/cast_op_impl.h"
+
+namespace tensorflow {
+
+typedef Eigen::ThreadPoolDevice CPUDevice;
+typedef Eigen::GpuDevice GPUDevice;
+
+CastFunctorType GetCpuCastFromUint32(DataType dst_dtype) {
+  CURRY_TYPES3(CAST_CASE, CPUDevice, uint32);
+  return nullptr;
+}
+
+#if GOOGLE_CUDA
+CastFunctorType GetGpuCastFromUint32(DataType dst_dtype) {
+  CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, uint32);
+  return nullptr;
+}
+#endif  // GOOGLE_CUDA
+
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+CastFunctorType GetSyclCastFromUint32(DataType dst_dtype) {
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, uint32);
+  return nullptr;
+}
+#endif  // TENSORFLOW_USE_SYCL
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cast_op_impl_uint64.cc b/tensorflow/core/kernels/cast_op_impl_uint64.cc
new file mode 100644
index 0000000000..6478c266ee
--- /dev/null
+++ b/tensorflow/core/kernels/cast_op_impl_uint64.cc
@@ -0,0 +1,43 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/kernels/cast_op_impl.h"
+
+namespace tensorflow {
+
+typedef Eigen::ThreadPoolDevice CPUDevice;
+typedef Eigen::GpuDevice GPUDevice;
+
+CastFunctorType GetCpuCastFromUint64(DataType dst_dtype) {
+  CURRY_TYPES3(CAST_CASE, CPUDevice, uint64);
+  return nullptr;
+}
+
+#if GOOGLE_CUDA
+CastFunctorType GetGpuCastFromUint64(DataType dst_dtype) {
+  CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, uint64);
+  return nullptr;
+}
+#endif  // GOOGLE_CUDA
+
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+CastFunctorType GetSyclCastFromUint64(DataType dst_dtype) {
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, uint64);
+  return nullptr;
+}
+#endif  // TENSORFLOW_USE_SYCL
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cast_op_impl_uint8.cc b/tensorflow/core/kernels/cast_op_impl_uint8.cc
index 2d1a6f3a4e..b22547a23e 100644
--- a/tensorflow/core/kernels/cast_op_impl_uint8.cc
+++ b/tensorflow/core/kernels/cast_op_impl_uint8.cc
@@ -20,15 +20,13 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetCpuCastFromUint8(DataType dst_dtype) {
+CastFunctorType GetCpuCastFromUint8(DataType dst_dtype) {
   CURRY_TYPES3(CAST_CASE, CPUDevice, uint8);
   return nullptr;
 }
 
 #if GOOGLE_CUDA
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetGpuCastFromUint8(DataType dst_dtype) {
+CastFunctorType GetGpuCastFromUint8(DataType dst_dtype) {
   CURRY_TYPES3_NO_BF16(CAST_CASE, GPUDevice, uint8);
   return nullptr;
 }
@@ -36,8 +34,7 @@ GetGpuCastFromUint8(DataType dst_dtype) {
 
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
-GetSyclCastFromUint8(DataType dst_dtype) {
+CastFunctorType GetSyclCastFromUint8(DataType dst_dtype) {
   CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, uint8);
   return nullptr;
 }
diff --git a/tensorflow/core/kernels/cast_op_test.cc b/tensorflow/core/kernels/cast_op_test.cc
index 7da9d28a3d..cb305de5e3 100644
--- a/tensorflow/core/kernels/cast_op_test.cc
+++ b/tensorflow/core/kernels/cast_op_test.cc
@@ -40,17 +40,27 @@ static Graph* Cast(int num) {
 
 class CastOpTest : public OpsTestBase {
  protected:
-  void MakeOp(DataType src, DataType dst) {
-    TF_EXPECT_OK(NodeDefBuilder("cast_op", "Cast")
-                     .Input(FakeInput(src))
-                     .Attr("SrcT", src)
-                     .Attr("DstT", dst)
-                     .Finalize(node_def()));
+  void MakeOp(DataType src, DataType dst, bool trunc = false) {
+    if (trunc) {
+      TF_EXPECT_OK(NodeDefBuilder("cast_op", "Cast")
+                       .Input(FakeInput(src))
+                       .Attr("SrcT", src)
+                       .Attr("DstT", dst)
+                       .Attr("Truncate", true)
+                       .Finalize(node_def()));
+    } else {
+      TF_EXPECT_OK(NodeDefBuilder("cast_op", "Cast")
+                       .Input(FakeInput(src))
+                       .Attr("SrcT", src)
+                       .Attr("DstT", dst)
+                       .Finalize(node_def()));
+    }
+
     TF_EXPECT_OK(InitOp());
   }
 
   template <typename INPUT, typename OUTPUT>
-  void CheckCast() {
+  void CheckCast(bool trunc = false) {
     DataType in_type = DataTypeToEnum<INPUT>::v();
     DataType out_type = DataTypeToEnum<OUTPUT>::v();
     MakeOp(in_type, out_type);
@@ -64,22 +74,32 @@ class CastOpTest : public OpsTestBase {
   }
 };
 
-#define TEST_CAST(in, out) \
-  TEST_F(CastOpTest, TestCast##_##in##_##out) { CheckCast<in, out>(); }
+#define TEST_CAST(in, out)                                              \
+  TEST_F(CastOpTest, TestCast##_##in##_##out) { CheckCast<in, out>(); } \
+  TEST_F(CastOpTest, TestCast2##_##in##_##out) { CheckCast<in, out>(true); }
 
 #define TEST_ALL_CASTS_FROM(in) \
   TEST_CAST(in, uint8);         \
   TEST_CAST(in, uint16);        \
+  TEST_CAST(in, uint32);        \
+  TEST_CAST(in, uint64);        \
   TEST_CAST(in, int16);         \
   TEST_CAST(in, int32);         \
   TEST_CAST(in, int64);         \
   TEST_CAST(in, half);          \
   TEST_CAST(in, float);         \
   TEST_CAST(in, double);        \
-  TEST_CAST(in, bfloat16);
+  TEST_CAST(in, bfloat16);      \
+  TEST_CAST(in, quint8);        \
+  TEST_CAST(in, qint8);         \
+  TEST_CAST(in, qint32);        \
+  TEST_CAST(in, qint16);        \
+  TEST_CAST(in, quint16);
 
 TEST_ALL_CASTS_FROM(uint8)
 TEST_ALL_CASTS_FROM(uint16)
+TEST_ALL_CASTS_FROM(uint32)
+TEST_ALL_CASTS_FROM(uint64)
 TEST_ALL_CASTS_FROM(int16)
 TEST_ALL_CASTS_FROM(int32)
 TEST_ALL_CASTS_FROM(int64)
@@ -87,6 +107,11 @@ TEST_ALL_CASTS_FROM(half)
 TEST_ALL_CASTS_FROM(float)
 TEST_ALL_CASTS_FROM(double)
 TEST_ALL_CASTS_FROM(bfloat16)
+TEST_ALL_CASTS_FROM(quint8)
+TEST_ALL_CASTS_FROM(qint8)
+TEST_ALL_CASTS_FROM(qint32)
+TEST_ALL_CASTS_FROM(qint16)
+TEST_ALL_CASTS_FROM(quint16)
 
 #undef TEST_ALL_CASTS_FROM
 #undef TEST_CAST
diff --git a/tensorflow/core/kernels/concat_op.cc b/tensorflow/core/kernels/concat_op.cc
index a87b63f913..902327aaea 100644
--- a/tensorflow/core/kernels/concat_op.cc
+++ b/tensorflow/core/kernels/concat_op.cc
@@ -113,7 +113,7 @@ class ConcatBaseOp : public OpKernel {
     int64 output_concat_dim = 0;
     const bool input_is_scalar = IsLegacyScalar(input_shape);
     for (int i = 0; i < N; ++i) {
-      const auto in = values[i];
+      const auto& in = values[i];
       const bool in_is_scalar = IsLegacyScalar(in.shape());
       OP_REQUIRES(
           c, in.dims() == input_dims || (input_is_scalar && in_is_scalar),
diff --git a/tensorflow/core/kernels/conv_grad_filter_ops.cc b/tensorflow/core/kernels/conv_grad_filter_ops.cc
index aca75176a5..63b1bcda43 100644
--- a/tensorflow/core/kernels/conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_filter_ops.cc
@@ -909,6 +909,7 @@ void LaunchConv2DBackpropFilterOp<Eigen::GpuDevice, T>::operator()(
       dims.in_depth,                       // in_depths
       {{input_desc.height(),               // in_rows
         input_desc.width()}},              // in_cols
+      FORMAT_NCHW,                         // compute_data_format
       dims.out_depth,                      // out_depths
       {{dims.spatial_dims[0].filter_size,  // filter_rows
         dims.spatial_dims[1].filter_size,  // filter_cols
diff --git a/tensorflow/core/kernels/conv_grad_input_ops.cc b/tensorflow/core/kernels/conv_grad_input_ops.cc
index 63a775afa8..d664a11e73 100644
--- a/tensorflow/core/kernels/conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_input_ops.cc
@@ -957,6 +957,7 @@ void LaunchConv2DBackpropInputOp<GPUDevice, T>::operator()(
       dims.in_depth,                       // in_depths
       {{input_desc.height(),               // in_rows
         input_desc.width()}},              // in_cols
+      FORMAT_NCHW,                         // compute_data_format
       dims.out_depth,                      // out_depths
       {{dims.spatial_dims[0].filter_size,  // filter_rows
         dims.spatial_dims[1].filter_size,  // filter_cols
diff --git a/tensorflow/core/kernels/conv_grad_ops_3d.cc b/tensorflow/core/kernels/conv_grad_ops_3d.cc
index 980b1063de..15f1bf9aba 100644
--- a/tensorflow/core/kernels/conv_grad_ops_3d.cc
+++ b/tensorflow/core/kernels/conv_grad_ops_3d.cc
@@ -716,6 +716,7 @@ class Conv3DBackpropInputOp<GPUDevice, T> : public OpKernel {
         batch,
         in_depth,
         {{input_size[0], input_size[1], input_size[2]}},
+        FORMAT_NCHW,
         out_depth,
         {{filter_size[0], filter_size[1], filter_size[2]}},
         {{dilations[0], dilations[1], dilations[2]}},
@@ -1112,6 +1113,7 @@ class Conv3DBackpropFilterOp<GPUDevice, T> : public OpKernel {
         batch,
         in_depth,
         {{input_size[0], input_size[1], input_size[2]}},
+        FORMAT_NCHW,
         out_depth,
         {{filter_size[0], filter_size[1], filter_size[2]}},
         {{dilations[0], dilations[1], dilations[2]}},
diff --git a/tensorflow/core/kernels/conv_ops.cc b/tensorflow/core/kernels/conv_ops.cc
index 3b9886eece..ef692418d6 100644
--- a/tensorflow/core/kernels/conv_ops.cc
+++ b/tensorflow/core/kernels/conv_ops.cc
@@ -713,6 +713,7 @@ void LaunchConv2DOp<GPUDevice, T>::operator()(
       in_depths,         // in_depths
       {{in_rows,         // in_rows
         in_cols}},       // in_cols
+      FORMAT_NCHW,       // compute_data_format
       out_depths,        // out_depths
       {{patch_rows,      // filter_rows
         patch_cols,      // filter_cols
diff --git a/tensorflow/core/kernels/conv_ops_3d.cc b/tensorflow/core/kernels/conv_ops_3d.cc
index 9ec16be67d..a1eed4e68c 100644
--- a/tensorflow/core/kernels/conv_ops_3d.cc
+++ b/tensorflow/core/kernels/conv_ops_3d.cc
@@ -415,6 +415,7 @@ struct LaunchConvOp<GPUDevice, T> {
         in_batch,
         in_depth,
         {{in_planes, in_rows, in_cols}},
+        FORMAT_NCHW,
         out_depth,
         {{filter_planes, filter_rows, filter_cols}},
         {{dilations[0], dilations[1], dilations[2]}},
diff --git a/tensorflow/core/kernels/conv_ops_fused.cc b/tensorflow/core/kernels/conv_ops_fused.cc
index 1b40ad81f4..972100ba77 100644
--- a/tensorflow/core/kernels/conv_ops_fused.cc
+++ b/tensorflow/core/kernels/conv_ops_fused.cc
@@ -195,7 +195,7 @@ EIGEN_ALWAYS_INLINE PerCacheLineParameters<T1> CalculatePerCacheLineParameters(
   const int64 bottom_y_index =
       std::min(static_cast<int64>(std::ceil(in_y)), (st.in_height - 1));
   // Lerp is used for bilinear filtering when that's needed.
-  result.y_lerp = in_y - top_y_index;
+  result.y_lerp = static_cast<T1>(in_y - top_y_index);
   // Which rows of the original input image to pull the values from.
   result.input_top_row_start =
       input_batch_start + (top_y_index * input_width * input_depth);
@@ -245,7 +245,7 @@ CalculatePerCachePixelParameters(int64 cache_x, int64 cache_start_x,
   result.right_x_index =
       std::min(static_cast<int64>(std::ceil(in_x)), (st.in_width - 1));
   // This x_lerp is used to blend pixels in bilinear filtering.
-  result.x_lerp = in_x - result.left_x_index;
+  result.x_lerp = static_cast<T1>(in_x - result.left_x_index);
   return result;
 }
 
@@ -465,8 +465,8 @@ class FusedResizeAndPadConvFunctor {
                   // for that operation are always present.
                   // Work out the parameters that remain constant across the
                   // row we're calculating.
-                  PerCacheLineParameters<float> line_params(
-                      CalculatePerCacheLineParameters<float>(
+                  PerCacheLineParameters<T1> line_params(
+                      CalculatePerCacheLineParameters<T1>(
                           task_params.cache_height, cache_y,
                           task_params.resize_cache,
                           task_params.cache_line_width, task_params.input_width,
@@ -881,7 +881,9 @@ class FusedResizeConv2DUsingGemmOp : public OpKernel {
                                        BILINEAR>,                         \
           true>);
 
+TF_CALL_half(REGISTER_FUSED);
 TF_CALL_float(REGISTER_FUSED);
+TF_CALL_double(REGISTER_FUSED);
 
 #define REGISTER_PAD_ONLY_FUSED(T)                                        \
   REGISTER_KERNEL_BUILDER(                                                \
@@ -892,6 +894,8 @@ TF_CALL_float(REGISTER_FUSED);
                                        NEAREST>,                          \
           false>);
 
+TF_CALL_half(REGISTER_PAD_ONLY_FUSED);
 TF_CALL_float(REGISTER_PAD_ONLY_FUSED);
+TF_CALL_double(REGISTER_PAD_ONLY_FUSED);
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/conv_ops_gpu.h b/tensorflow/core/kernels/conv_ops_gpu.h
index d2c8020bb6..afc611f277 100644
--- a/tensorflow/core/kernels/conv_ops_gpu.h
+++ b/tensorflow/core/kernels/conv_ops_gpu.h
@@ -85,13 +85,15 @@ class ConvParameters {
  public:
   using SpatialArray = gtl::InlinedVector<int64, 3>;
   ConvParameters(int64 batch, int64 in_depths, const SpatialArray& in,
-                 int64 out_depths, const SpatialArray& filter,
-                 const SpatialArray& dilation, const SpatialArray& stride,
-                 const SpatialArray& padding, DataType dtype, int device_id)
+                 TensorFormat data_format, int64 out_depths,
+                 const SpatialArray& filter, const SpatialArray& dilation,
+                 const SpatialArray& stride, const SpatialArray& padding,
+                 DataType dtype, int device_id)
       : batch_(batch),
         in_depths_(in_depths),
         out_depths_(out_depths),
         in_(in),
+        data_format_(data_format),
         filter_(filter),
         dilation_(dilation),
         stride_(stride),
@@ -101,6 +103,7 @@ class ConvParameters {
     hash_code_ = batch;
     hash_code_ = Hash64Combine(hash_code_, in_depths);
     for (int64 val : in) hash_code_ = Hash64Combine(hash_code_, val);
+    hash_code_ = Hash64Combine(hash_code_, data_format);
     hash_code_ = Hash64Combine(hash_code_, out_depths);
     for (int64 val : filter) hash_code_ = Hash64Combine(hash_code_, val);
     for (int64 val : dilation) hash_code_ = Hash64Combine(hash_code_, val);
@@ -123,6 +126,7 @@ class ConvParameters {
     return strings::StrCat(
         batch_, ", ", in_depths_, ", ",
         "(", str_util::Join(in_, ", "), "), ",
+        ::tensorflow::ToString(data_format_), ", ",
         out_depths_, ", ",
         "(", str_util::Join(filter_, ", "), "), ",
         "(", str_util::Join(dilation_, ", "), "), ",
@@ -148,12 +152,13 @@ class ConvParameters {
 
  protected:
   using ParameterDataType =
-      std::tuple<int64, int64, SpatialArray, int64, SpatialArray, SpatialArray,
-                 SpatialArray, SpatialArray, DataType, int>;
+      std::tuple<int64, int64, SpatialArray, TensorFormat, int64, SpatialArray,
+                 SpatialArray, SpatialArray, SpatialArray, DataType, int>;
 
   ParameterDataType get_data_as_tuple() const {
-    return std::make_tuple(batch_, in_depths_, in_, out_depths_, filter_,
-                           dilation_, stride_, padding_, dtype_, device_id_);
+    return std::make_tuple(batch_, in_depths_, in_, data_format_, out_depths_,
+                           filter_, dilation_, stride_, padding_, dtype_,
+                           device_id_);
   }
 
   uint64 hash_code_;
@@ -178,6 +183,7 @@ class ConvParameters {
   int64 in_depths_;
   int64 out_depths_;
   SpatialArray in_;
+  TensorFormat data_format_;
   SpatialArray filter_;
   SpatialArray dilation_;
   SpatialArray stride_;
diff --git a/tensorflow/core/kernels/conv_ops_test.cc b/tensorflow/core/kernels/conv_ops_test.cc
index 9acc725ba8..1236f27051 100644
--- a/tensorflow/core/kernels/conv_ops_test.cc
+++ b/tensorflow/core/kernels/conv_ops_test.cc
@@ -44,41 +44,43 @@ struct ConvParametersPeer {
 
 TEST(ConvParameters, WinogradNonfusedAlgoSize) {
   ConvParametersPeer conv_params_small = {{
-      1,         // batch
-      32,        // in_depths
-      {{300,     // in_rows
-        300}},   // in_cols
-      128,       // out_depths
-      {{3,       // filter_rows
-        3}},     // filter_cols
-      {{1,       // dilation_rows
-        1}},     // dilation_cols
-      {{1,       // stride_rows
-        1}},     // stride_cols
-      {{0,       // padding_rows
-        0}},     // padding_cols
-      DT_FLOAT,  // tensor datatype
-      0,         // device_id
+      1,            // batch
+      32,           // in_depths
+      {{300,        // in_rows
+        300}},      // in_cols
+      FORMAT_NCHW,  // compute_data_format
+      128,          // out_depths
+      {{3,          // filter_rows
+        3}},        // filter_cols
+      {{1,          // dilation_rows
+        1}},        // dilation_cols
+      {{1,          // stride_rows
+        1}},        // stride_cols
+      {{0,          // padding_rows
+        0}},        // padding_cols
+      DT_FLOAT,     // tensor datatype
+      0,            // device_id
   }};
   EXPECT_TRUE(
       conv_params_small.ShouldIncludeWinogradNonfusedAlgoPreCudnn7<float>());
 
   ConvParametersPeer conv_params_large = {{
-      1,         // batch
-      128,       // in_depths
-      {{300,     // in_rows
-        300}},   // in_cols
-      768,       // out_depths
-      {{3,       // filter_rows
-        3}},     // filter_cols
-      {{1,       // dilation_rows
-        1}},     // dilation_cols
-      {{1,       // stride_rows
-        1}},     // stride_cols
-      {{0,       // padding_rows
-        0}},     // padding_cols
-      DT_FLOAT,  // tensor datatype
-      0,         // device_id
+      1,            // batch
+      128,          // in_depths
+      {{300,        // in_rows
+        300}},      // in_cols
+      FORMAT_NCHW,  // compute_data_format
+      768,          // out_depths
+      {{3,          // filter_rows
+        3}},        // filter_cols
+      {{1,          // dilation_rows
+        1}},        // dilation_cols
+      {{1,          // stride_rows
+        1}},        // stride_cols
+      {{0,          // padding_rows
+        0}},        // padding_cols
+      DT_FLOAT,     // tensor datatype
+      0,            // device_id
   }};
   EXPECT_FALSE(
       conv_params_large.ShouldIncludeWinogradNonfusedAlgoPreCudnn7<float>());
@@ -88,14 +90,15 @@ TEST(ConvParameters, WinogradNonfusedAlgoSize) {
 
 class FusedResizePadConvOpTest : public OpsTestBase {
  protected:
-  void HandwrittenConv() {
+  template <typename T>
+  void HandwrittenConv(DataType dtype) {
     const int stride = 1;
     TF_EXPECT_OK(NodeDefBuilder("fused_resize_op", "FusedResizeAndPadConv2D")
-                     .Input(FakeInput(DT_FLOAT))
+                     .Input(FakeInput(dtype))
                      .Input(FakeInput(DT_INT32))
                      .Input(FakeInput(DT_INT32))
-                     .Input(FakeInput(DT_FLOAT))
-                     .Attr("T", DT_FLOAT)
+                     .Input(FakeInput(dtype))
+                     .Attr("T", dtype)
                      .Attr("resize_align_corners", false)
                      .Attr("mode", "REFLECT")
                      .Attr("strides", {1, stride, stride, 1})
@@ -110,9 +113,8 @@ class FusedResizePadConvOpTest : public OpsTestBase {
     // |  1 |  2 |  3 |  4 |
     // |  5 |  6 |  7 |  8 |
     // |  9 | 10 | 11 | 12 |
-    Tensor image(DT_FLOAT,
-                 {image_batch_count, image_height, image_width, depth});
-    test::FillValues<float>(&image, {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12});
+    Tensor image(dtype, {image_batch_count, image_height, image_width, depth});
+    test::FillValues<T>(&image, {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12});
 
     // The filter matrix is:
     // | 1 | 4 | 7 |
@@ -120,8 +122,8 @@ class FusedResizePadConvOpTest : public OpsTestBase {
     // | 3 | 6 | 9 |
     const int filter_size = 3;
     const int filter_count = 1;
-    Tensor filter(DT_FLOAT, {filter_size, filter_size, depth, filter_count});
-    test::FillValues<float>(&filter, {1, 4, 7, 2, 5, 8, 3, 6, 9});
+    Tensor filter(dtype, {filter_size, filter_size, depth, filter_count});
+    test::FillValues<T>(&filter, {1, 4, 7, 2, 5, 8, 3, 6, 9});
 
     const int resized_width = image_width;
     const int resized_height = image_height;
@@ -131,12 +133,12 @@ class FusedResizePadConvOpTest : public OpsTestBase {
     const int left_padding = 0;
     const int right_padding = 0;
 
-    AddInputFromArray<float>(image.shape(), image.flat<float>());
+    AddInputFromArray<T>(image.shape(), image.flat<T>());
     AddInputFromArray<int32>(TensorShape({2}), {resized_height, resized_width});
     AddInputFromArray<int32>(
         TensorShape({4, 2}),
         {0, 0, top_padding, bottom_padding, left_padding, right_padding, 0, 0});
-    AddInputFromArray<float>(filter.shape(), filter.flat<float>());
+    AddInputFromArray<T>(filter.shape(), filter.flat<T>());
     TF_ASSERT_OK(RunOpKernel());
 
     // We're sliding the 3x3 filter across the 3x4 image, with accesses outside
@@ -160,21 +162,22 @@ class FusedResizePadConvOpTest : public OpsTestBase {
     // |  187  |  234  |  261  |  121  |
     const int expected_width = image_width;
     const int expected_height = image_height * filter_count;
-    Tensor expected(DT_FLOAT, TensorShape({image_batch_count, expected_height,
-                                           expected_width, filter_count}));
-    test::FillValues<float>(
+    Tensor expected(dtype, TensorShape({image_batch_count, expected_height,
+                                        expected_width, filter_count}));
+    test::FillValues<T>(
         &expected, {105, 150, 183, 95, 235, 312, 357, 178, 187, 234, 261, 121});
     const Tensor& output = *GetOutput(0);
-    test::ExpectTensorNear<float>(expected, output, 1e-5);
+    test::ExpectTensorNear<T>(expected, output, 1e-5);
   }
 
+  template <typename T>
   void CompareFusedAndSeparate(int input_width, int input_height,
                                int input_depth, int resize_width,
                                int resize_height, int y_padding, int x_padding,
                                int filter_size, int filter_count,
                                bool resize_align_corners,
                                const string& pad_mode, int stride,
-                               const string& padding) {
+                               const string& padding, DataType dtype) {
     auto root = tensorflow::Scope::NewRootScope();
     using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
 
@@ -183,29 +186,34 @@ class FusedResizePadConvOpTest : public OpsTestBase {
     test::FillIota<float>(&input_data, 1.0f);
     Output input =
         Const(root.WithOpName("input"), Input::Initializer(input_data));
+    Output casted_input = Cast(root.WithOpName("casted_input"), input, dtype);
 
     Tensor filter_data(DT_FLOAT, TensorShape({filter_size, filter_size,
                                               input_depth, filter_count}));
     test::FillIota<float>(&filter_data, 1.0f);
     Output filter =
         Const(root.WithOpName("filter"), Input::Initializer(filter_data));
+    Output casted_filter =
+        Cast(root.WithOpName("casted_filter"), filter, dtype);
 
     Output resize_size =
         Const(root.WithOpName("resize_size"), {resize_height, resize_width});
     Output resize =
         ResizeBilinear(root.WithOpName("resize"), input, resize_size,
                        ResizeBilinear::AlignCorners(resize_align_corners));
+    // Bilinear resize only output float, cast it to dtype to match the input.
+    Output casted_resize = Cast(root.WithOpName("cast"), resize, dtype);
     Output paddings =
         Const(root.WithOpName("paddings"),
               {{0, 0}, {y_padding, y_padding}, {x_padding, x_padding}, {0, 0}});
-    Output mirror_pad =
-        MirrorPad(root.WithOpName("mirror_pad"), resize, paddings, pad_mode);
-    Output conv = Conv2D(root.WithOpName("conv"), mirror_pad, filter,
+    Output mirror_pad = MirrorPad(root.WithOpName("mirror_pad"), casted_resize,
+                                  paddings, pad_mode);
+    Output conv = Conv2D(root.WithOpName("conv"), mirror_pad, casted_filter,
                          {1, stride, stride, 1}, padding);
 
     Output fused_conv = FusedResizeAndPadConv2D(
-        root.WithOpName("fused_conv"), input, resize_size, paddings, filter,
-        pad_mode, {1, stride, stride, 1}, padding,
+        root.WithOpName("fused_conv"), casted_input, resize_size, paddings,
+        casted_filter, pad_mode, {1, stride, stride, 1}, padding,
         FusedResizeAndPadConv2D::ResizeAlignCorners(resize_align_corners));
 
     tensorflow::GraphDef graph;
@@ -224,11 +232,13 @@ class FusedResizePadConvOpTest : public OpsTestBase {
     test::ExpectClose(unfused_tensors[0], fused_tensors[0]);
   }
 
+  template <typename T>
   void CompareFusedPadOnlyAndSeparate(int input_width, int input_height,
                                       int input_depth, int y_padding,
                                       int x_padding, int filter_size,
                                       int filter_count, const string& pad_mode,
-                                      int stride, const string& padding) {
+                                      int stride, const string& padding,
+                                      DataType dtype) {
     auto root = tensorflow::Scope::NewRootScope();
     using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
 
@@ -237,24 +247,27 @@ class FusedResizePadConvOpTest : public OpsTestBase {
     test::FillIota<float>(&input_data, 1.0f);
     Output input =
         Const(root.WithOpName("input"), Input::Initializer(input_data));
+    Output casted_input = Cast(root.WithOpName("casted_input"), input, dtype);
 
     Tensor filter_data(DT_FLOAT, TensorShape({filter_size, filter_size,
                                               input_depth, filter_count}));
     test::FillIota<float>(&filter_data, 1.0f);
     Output filter =
         Const(root.WithOpName("filter"), Input::Initializer(filter_data));
+    Output casted_filter =
+        Cast(root.WithOpName("casted_filter"), filter, dtype);
 
     Output paddings =
         Const(root.WithOpName("paddings"),
               {{0, 0}, {y_padding, y_padding}, {x_padding, x_padding}, {0, 0}});
-    Output mirror_pad =
-        MirrorPad(root.WithOpName("mirror_pad"), input, paddings, pad_mode);
-    Output conv = Conv2D(root.WithOpName("conv"), mirror_pad, filter,
+    Output mirror_pad = MirrorPad(root.WithOpName("mirror_pad"), casted_input,
+                                  paddings, pad_mode);
+    Output conv = Conv2D(root.WithOpName("conv"), mirror_pad, casted_filter,
                          {1, stride, stride, 1}, padding);
 
-    Output fused_conv =
-        FusedPadConv2D(root.WithOpName("fused_conv"), input, paddings, filter,
-                       pad_mode, {1, stride, stride, 1}, padding);
+    Output fused_conv = FusedPadConv2D(
+        root.WithOpName("fused_conv"), casted_input, paddings, casted_filter,
+        pad_mode, {1, stride, stride, 1}, padding);
 
     tensorflow::GraphDef graph;
     TF_ASSERT_OK(root.ToGraphDef(&graph));
@@ -273,91 +286,126 @@ class FusedResizePadConvOpTest : public OpsTestBase {
   }
 };
 
-TEST_F(FusedResizePadConvOpTest, HandwrittenConv) { HandwrittenConv(); }
+TEST_F(FusedResizePadConvOpTest, HandwrittenConvHalf) {
+  HandwrittenConv<Eigen::half>(DT_HALF);
+}
 
-TEST_F(FusedResizePadConvOpTest, IdentityComparative) {
-  CompareFusedAndSeparate(10, 10, 1, 10, 10, 0, 0, 1, 1, false, "REFLECT", 1,
-                          "SAME");
+TEST_F(FusedResizePadConvOpTest, HandwrittenConvFloat) {
+  HandwrittenConv<float>(DT_FLOAT);
+}
+
+TEST_F(FusedResizePadConvOpTest, HandwrittenConvDouble) {
+  HandwrittenConv<double>(DT_DOUBLE);
+}
+
+TEST_F(FusedResizePadConvOpTest, IdentityComparativeHalf) {
+  CompareFusedAndSeparate<Eigen::half>(10, 10, 1, 10, 10, 0, 0, 1, 1, false,
+                                       "REFLECT", 1, "SAME", DT_HALF);
+}
+
+TEST_F(FusedResizePadConvOpTest, IdentityComparativeFloat) {
+  CompareFusedAndSeparate<float>(10, 10, 1, 10, 10, 0, 0, 1, 1, false,
+                                 "REFLECT", 1, "SAME", DT_FLOAT);
+}
+
+TEST_F(FusedResizePadConvOpTest, IdentityComparativeDouble) {
+  CompareFusedAndSeparate<double>(10, 10, 1, 10, 10, 0, 0, 1, 1, false,
+                                  "REFLECT", 1, "SAME", DT_DOUBLE);
 }
 
 TEST_F(FusedResizePadConvOpTest, ConvOnlyComparative) {
-  CompareFusedAndSeparate(10, 10, 3, 10, 10, 0, 0, 4, 4, false, "REFLECT", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(10, 10, 3, 10, 10, 0, 0, 4, 4, false,
+                                 "REFLECT", 1, "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, ResizeOnlyComparative) {
-  CompareFusedAndSeparate(10, 10, 1, 20, 20, 0, 0, 1, 1, false, "REFLECT", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(10, 10, 1, 20, 20, 0, 0, 1, 1, false,
+                                 "REFLECT", 1, "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, ResizeAndConvComparative) {
-  CompareFusedAndSeparate(2, 2, 4, 4, 2, 0, 0, 2, 2, false, "REFLECT", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(2, 2, 4, 4, 2, 0, 0, 2, 2, false, "REFLECT", 1,
+                                 "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, ResizeAlignAndConvComparative) {
-  CompareFusedAndSeparate(2, 2, 4, 4, 2, 0, 0, 2, 2, true, "REFLECT", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(2, 2, 4, 4, 2, 0, 0, 2, 2, true, "REFLECT", 1,
+                                 "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, ResizeAndConvStridedComparative) {
-  CompareFusedAndSeparate(2, 2, 4, 4, 2, 0, 0, 2, 2, false, "REFLECT", 2,
-                          "SAME");
+  CompareFusedAndSeparate<float>(2, 2, 4, 4, 2, 0, 0, 2, 2, false, "REFLECT", 2,
+                                 "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, ResizeAlignAndConvValidComparative) {
-  CompareFusedAndSeparate(2, 2, 4, 4, 2, 0, 0, 2, 2, true, "REFLECT", 1,
-                          "VALID");
+  CompareFusedAndSeparate<float>(2, 2, 4, 4, 2, 0, 0, 2, 2, true, "REFLECT", 1,
+                                 "VALID", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, PadOnlyComparative) {
-  CompareFusedAndSeparate(4, 4, 1, 4, 4, 2, 2, 1, 1, false, "REFLECT", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(4, 4, 1, 4, 4, 2, 2, 1, 1, false, "REFLECT", 1,
+                                 "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, PadOnlyWithChannelsComparative) {
-  CompareFusedAndSeparate(4, 4, 3, 4, 4, 2, 2, 1, 1, false, "REFLECT", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(4, 4, 3, 4, 4, 2, 2, 1, 1, false, "REFLECT", 1,
+                                 "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, ResizeAndPadComparative) {
-  CompareFusedAndSeparate(4, 4, 1, 6, 6, 2, 2, 1, 1, false, "REFLECT", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(4, 4, 1, 6, 6, 2, 2, 1, 1, false, "REFLECT", 1,
+                                 "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, PadOnlySymmetricComparative) {
-  CompareFusedAndSeparate(4, 4, 1, 4, 4, 2, 2, 1, 1, false, "SYMMETRIC", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(4, 4, 1, 4, 4, 2, 2, 1, 1, false, "SYMMETRIC",
+                                 1, "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, ResizeAndPadSymmetricComparative) {
-  CompareFusedAndSeparate(4, 4, 3, 6, 6, 2, 2, 1, 1, false, "SYMMETRIC", 1,
-                          "SAME");
+  CompareFusedAndSeparate<float>(4, 4, 3, 6, 6, 2, 2, 1, 1, false, "SYMMETRIC",
+                                 1, "SAME", DT_FLOAT);
+}
+
+TEST_F(FusedResizePadConvOpTest, ResizeAndPadSymmetricComparativeLarge) {
+  CompareFusedAndSeparate<float>(1000, 1000, 3, 1006, 1006, 2, 2, 1, 1, false,
+                                 "SYMMETRIC", 1, "SAME", DT_FLOAT);
 }
 
-TEST_F(FusedResizePadConvOpTest, NoResizeIdentityComparative) {
-  CompareFusedPadOnlyAndSeparate(10, 10, 1, 0, 0, 1, 1, "REFLECT", 1, "SAME");
+TEST_F(FusedResizePadConvOpTest, NoResizeIdentityComparativeHalf) {
+  CompareFusedPadOnlyAndSeparate<Eigen::half>(10, 10, 1, 0, 0, 1, 1, "REFLECT",
+                                              1, "SAME", DT_HALF);
+}
+
+TEST_F(FusedResizePadConvOpTest, NoResizeIdentityComparativeFloat) {
+  CompareFusedPadOnlyAndSeparate<float>(10, 10, 1, 0, 0, 1, 1, "REFLECT", 1,
+                                        "SAME", DT_FLOAT);
+}
+
+TEST_F(FusedResizePadConvOpTest, NoResizeIdentityComparativeDouble) {
+  CompareFusedPadOnlyAndSeparate<double>(10, 10, 1, 0, 0, 1, 1, "REFLECT", 1,
+                                         "SAME", DT_DOUBLE);
 }
 
 TEST_F(FusedResizePadConvOpTest, NoResizeConvOnlyComparative) {
-  CompareFusedPadOnlyAndSeparate(10, 10, 3, 0, 0, 4, 4, "REFLECT", 1, "SAME");
+  CompareFusedPadOnlyAndSeparate<float>(10, 10, 3, 0, 0, 4, 4, "REFLECT", 1,
+                                        "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, NoResizePadOnlyComparative) {
-  CompareFusedPadOnlyAndSeparate(4, 4, 1, 2, 2, 1, 1, "REFLECT", 1, "SAME");
+  CompareFusedPadOnlyAndSeparate<float>(4, 4, 1, 2, 2, 1, 1, "REFLECT", 1,
+                                        "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, NoResizePadOnlyWithChannelsComparative) {
-  CompareFusedPadOnlyAndSeparate(4, 4, 3, 2, 2, 1, 1, "REFLECT", 1, "SAME");
+  CompareFusedPadOnlyAndSeparate<float>(4, 4, 3, 2, 2, 1, 1, "REFLECT", 1,
+                                        "SAME", DT_FLOAT);
 }
 
 TEST_F(FusedResizePadConvOpTest, NoResizePadOnlySymmetricComparative) {
-  CompareFusedPadOnlyAndSeparate(4, 4, 1, 2, 2, 1, 1, "SYMMETRIC", 1, "SAME");
-}
-
-TEST_F(FusedResizePadConvOpTest, ResizeAndPadSymmetricComparativeLarge) {
-  CompareFusedAndSeparate(1000, 1000, 3, 1006, 1006, 2, 2, 1, 1, false,
-                          "SYMMETRIC", 1, "SAME");
+  CompareFusedPadOnlyAndSeparate<float>(4, 4, 1, 2, 2, 1, 1, "SYMMETRIC", 1,
+                                        "SAME", DT_FLOAT);
 }
 
 class ConvOpTest : public OpsTestBase {
diff --git a/tensorflow/core/kernels/crop_and_resize_op_benchmark_test.cc b/tensorflow/core/kernels/crop_and_resize_op_benchmark_test.cc
new file mode 100644
index 0000000000..d7ca64bea0
--- /dev/null
+++ b/tensorflow/core/kernels/crop_and_resize_op_benchmark_test.cc
@@ -0,0 +1,72 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/graph/node_builder.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+
+namespace tensorflow {
+
+static Graph* BM_CropAndResize(int batches, int width, int height, int depth,
+                               int crop_height, int crop_width) {
+  Graph* g = new Graph(OpRegistry::Global());
+  Tensor in(DT_FLOAT, TensorShape({batches, height, width, depth}));
+  in.flat<float>().setRandom();
+  Tensor boxes(DT_FLOAT, TensorShape({batches, 4}));
+  auto boxes_tensor = boxes.matrix<float>();
+  Tensor box_ind(DT_INT32, TensorShape({batches}));
+  auto box_ind_flat = box_ind.flat<int32>();
+  for (int i = 0; i < batches; ++i) {
+    boxes_tensor(i, 0) = 0.2;
+    boxes_tensor(i, 1) = 0.2;
+    boxes_tensor(i, 2) = 0.8;
+    boxes_tensor(i, 3) = 0.7;
+    box_ind_flat(i) = i;
+  }
+  Tensor crop_size(DT_INT32, TensorShape({2}));
+  auto crop_size_flat = crop_size.flat<int32>();
+  crop_size_flat(0) = crop_height;
+  crop_size_flat(1) = crop_width;
+  Node* ret;
+  TF_CHECK_OK(NodeBuilder(g->NewName("n"), "CropAndResize")
+                  .Input(test::graph::Constant(g, in))
+                  .Input(test::graph::Constant(g, boxes))
+                  .Input(test::graph::Constant(g, box_ind))
+                  .Input(test::graph::Constant(g, crop_size))
+                  .Finalize(g, &ret));
+  return g;
+}
+
+#define BM_CropAndResizeDev(DEVICE, B, W, H, D, CH, CW)                        \
+  static void BM_CropAndResize_##DEVICE##_##B##_##W##_##H##_##D##_##CH##_##CW( \
+      int iters) {                                                             \
+    testing::ItemsProcessed(iters* B* W* H* D);                                \
+    test::Benchmark(#DEVICE, BM_CropAndResize(B, W, H, D, CH, CW)).Run(iters); \
+  }                                                                            \
+  BENCHMARK(BM_CropAndResize_##DEVICE##_##B##_##W##_##H##_##D##_##CH##_##CW);
+
+// Benchmark results using CPU:Intel Haswell with HyperThreading (6 cores)
+// Benchmark                                Time(ns) CPU(ns)  Iterations
+// BM_CropAndResize_cpu_1_640_640_3_512_512 7078765 7173520 100 163.361M items/s
+// BM_CropAndResize_cpu_1_640_640_1_512_512 3801232 3914692 185  99.784M items/s
+// BM_CropAndResize_cpu_1_80_80_512_7_7      182470  241767 2941  1.372G items/s
+
+BM_CropAndResizeDev(cpu, 1, 640, 640, 3, 512, 512);
+BM_CropAndResizeDev(cpu, 1, 640, 640, 1, 512, 512);
+BM_CropAndResizeDev(cpu, 1, 80, 80, 512, 7, 7);
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/ctc_loss_op.cc b/tensorflow/core/kernels/ctc_loss_op.cc
index b38d838bf1..fb375ee4b3 100644
--- a/tensorflow/core/kernels/ctc_loss_op.cc
+++ b/tensorflow/core/kernels/ctc_loss_op.cc
@@ -100,8 +100,10 @@ class CTCLossOp : public OpKernel {
 
     TensorShape labels_shape({batch_size, max_label_len});
     std::vector<int64> order{0, 1};
-    sparse::SparseTensor labels_sp(*labels_indices, *labels_values,
-                                   labels_shape, order);
+    sparse::SparseTensor labels_sp;
+    OP_REQUIRES_OK(
+        ctx, sparse::SparseTensor::Create(*labels_indices, *labels_values,
+                                          labels_shape, order, &labels_sp));
 
     Status labels_sp_valid = labels_sp.IndicesValid();
     OP_REQUIRES(ctx, labels_sp_valid.ok(),
diff --git a/tensorflow/core/kernels/cuda_solvers.cc b/tensorflow/core/kernels/cuda_solvers.cc
index a857bd3ce4..a59baaa96f 100644
--- a/tensorflow/core/kernels/cuda_solvers.cc
+++ b/tensorflow/core/kernels/cuda_solvers.cc
@@ -151,7 +151,7 @@ CudaSolver::CudaSolver(OpKernelContext* context) : context_(context) {
       reinterpret_cast<const cudaStream_t*>(context->op_device_context()
                                                 ->stream()
                                                 ->implementation()
-                                                ->CudaStreamMemberHack()));
+                                                ->GpuStreamMemberHack()));
   cuda_stream_ = *cu_stream_ptr;
   HandleMap* handle_map = CHECK_NOTNULL(GetHandleMapSingleton());
   auto it = handle_map->find(cuda_stream_);
diff --git a/tensorflow/core/kernels/cwise_op_tan.cc b/tensorflow/core/kernels/cwise_op_tan.cc
index c1a25767d3..90762fb1b0 100644
--- a/tensorflow/core/kernels/cwise_op_tan.cc
+++ b/tensorflow/core/kernels/cwise_op_tan.cc
@@ -16,7 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER2(UnaryOp, CPU, "Tan", functor::tan, float, double);
+REGISTER4(UnaryOp, CPU, "Tan", functor::tan, float, double, complex64,
+          complex128);
 
 #if GOOGLE_CUDA
 REGISTER2(UnaryOp, GPU, "Tan", functor::tan, float, double);
diff --git a/tensorflow/core/kernels/data/BUILD b/tensorflow/core/kernels/data/BUILD
index 6d2a04aa25..e9175e768d 100644
--- a/tensorflow/core/kernels/data/BUILD
+++ b/tensorflow/core/kernels/data/BUILD
@@ -85,6 +85,19 @@ tf_kernel_library(
 )
 
 tf_kernel_library(
+    name = "window_dataset_op",
+    srcs = ["window_dataset_op.cc"],
+    deps = [
+        ":dataset",
+        ":window_dataset",
+        "//tensorflow/core:dataset_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
+tf_kernel_library(
     name = "slide_dataset_op",
     srcs = ["slide_dataset_op.cc"],
     deps = [
@@ -191,12 +204,28 @@ tf_kernel_library(
     ],
 )
 
+cc_library(
+    name = "parallel_map_iterator",
+    srcs = ["parallel_map_iterator.cc"],
+    hdrs = ["parallel_map_iterator.h"],
+    deps = [
+        ":dataset",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:dataset_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
 tf_kernel_library(
     name = "parallel_map_dataset_op",
     srcs = ["parallel_map_dataset_op.cc"],
     deps = [
         ":captured_function",
         ":dataset",
+        ":parallel_map_iterator",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:dataset_ops_op_lib",
         "//tensorflow/core:framework",
@@ -525,6 +554,7 @@ tf_kernel_library(
     deps = [
         ":dataset",
         ":dataset_utils",
+        ":optional_ops",
         "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:dataset_ops_op_lib",
         "//tensorflow/core:framework",
@@ -537,24 +567,29 @@ tf_kernel_library(
 )
 
 tf_kernel_library(
-    name = "cache_dataset_ops",
-    srcs = ["cache_dataset_ops.cc"],
+    name = "optional_ops",
+    srcs = ["optional_ops.cc"],
+    hdrs = ["optional_ops.h"],
     deps = [
-        ":dataset",
+        "//tensorflow/core:core_cpu_internal",
         "//tensorflow/core:dataset_ops_op_lib",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
-        "//tensorflow/core/util/tensor_bundle",
+        "//tensorflow/core:protos_all_cc",
     ],
 )
 
 tf_kernel_library(
-    name = "identity_dataset_op",
-    srcs = ["identity_dataset_op.cc"],
+    name = "cache_dataset_ops",
+    srcs = ["cache_dataset_ops.cc"],
     deps = [
         ":dataset",
+        "//tensorflow/core:dataset_ops_op_lib",
         "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core/util/tensor_bundle",
     ],
 )
 
@@ -606,12 +641,12 @@ tf_kernel_library(
         ":generator_dataset_op",
         ":group_by_reducer_dataset_op",
         ":group_by_window_dataset_op",
-        ":identity_dataset_op",
         ":interleave_dataset_op",
         ":iterator_ops",
         ":map_and_batch_dataset_op",
         ":map_dataset_op",
         ":optimize_dataset_op",
+        ":optional_ops",
         ":padded_batch_dataset_op",
         ":parallel_interleave_dataset_op",
         ":parallel_map_dataset_op",
@@ -634,6 +669,7 @@ tf_kernel_library(
         ":tensor_queue_dataset_op",
         ":tensor_slice_dataset_op",
         ":unbatch_dataset_op",
+        ":window_dataset_op",
         ":writer_ops",
         ":zip_dataset_op",
     ],
diff --git a/tensorflow/core/kernels/data/captured_function.cc b/tensorflow/core/kernels/data/captured_function.cc
index ee58341cfd..82da385405 100644
--- a/tensorflow/core/kernels/data/captured_function.cc
+++ b/tensorflow/core/kernels/data/captured_function.cc
@@ -214,6 +214,9 @@ Status CapturedFunction::Run(IteratorContext* ctx, std::vector<Tensor>&& args,
   });
   f_opts.step_container = &step_container;
   f_opts.runner = ctx->runner();
+  if (ctx->lib()->device()->device_type() != DEVICE_CPU) {
+    f_opts.create_rendezvous = true;
+  }
   // TODO(mrry): Add cancellation manager support to IteratorContext
   // so that we can cancel running map functions. The local
   // cancellation manager here is created so that we can run kernels
@@ -248,6 +251,9 @@ Status CapturedFunction::RunWithBorrowedArgs(IteratorContext* ctx,
   });
   f_opts.step_container = &step_container;
   f_opts.runner = ctx->runner();
+  if (ctx->lib()->device()->device_type() != DEVICE_CPU) {
+    f_opts.create_rendezvous = true;
+  }
   // TODO(mrry): Add cancellation manager support to IteratorContext
   // so that we can cancel running map functions. The local
   // cancellation manager here is created so that we can run kernels
@@ -304,6 +310,9 @@ Status CapturedFunction::RunInstantiated(const std::vector<Tensor>& args,
   });
   f_opts.step_container = &step_container;
   f_opts.runner = runner;
+  if (lib->device()->device_type() != DEVICE_CPU) {
+    f_opts.create_rendezvous = true;
+  }
   // TODO(mrry): Add cancellation manager support to IteratorContext
   // so that we can cancel running map functions. The local
   // cancellation manager here is created so that we can run kernels
@@ -351,6 +360,9 @@ void CapturedFunction::RunAsync(IteratorContext* ctx,
       });
   f_opts.step_container = step_container;
   f_opts.runner = ctx->runner();
+  if (ctx->lib()->device()->device_type() != DEVICE_CPU) {
+    f_opts.create_rendezvous = true;
+  }
   // TODO(mrry): Add cancellation manager support to IteratorContext
   // so that we can cancel running map functions. The local
   // cancellation manager here is created so that we can run kernels
diff --git a/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc b/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc
index 91b9279427..da4b14c8b9 100644
--- a/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/dense_to_sparse_batch_dataset_op.cc
@@ -101,8 +101,8 @@ class DenseToSparseBatchDatasetOp : public UnaryDatasetOpKernel {
     }
 
     const DataTypeVector& output_dtypes() const override {
-      static DataTypeVector* output_dtypes_ = new DataTypeVector({DT_VARIANT});
-      return *output_dtypes_;
+      static DataTypeVector* output_dtypes = new DataTypeVector({DT_VARIANT});
+      return *output_dtypes;
     }
 
     const std::vector<PartialTensorShape>& output_shapes() const override {
diff --git a/tensorflow/core/kernels/data/generator_dataset_op.cc b/tensorflow/core/kernels/data/generator_dataset_op.cc
index aae62ad2fe..0981e42ba1 100644
--- a/tensorflow/core/kernels/data/generator_dataset_op.cc
+++ b/tensorflow/core/kernels/data/generator_dataset_op.cc
@@ -197,6 +197,9 @@ class GeneratorDatasetOp : public DatasetOpKernel {
 
 REGISTER_KERNEL_BUILDER(Name("GeneratorDataset").Device(DEVICE_CPU),
                         GeneratorDatasetOp);
+REGISTER_KERNEL_BUILDER(
+    Name("GeneratorDataset").Device(DEVICE_GPU).HostMemory("handle"),
+    GeneratorDatasetOp);
 
 }  // namespace
 
diff --git a/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc b/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc
index 03abae79d2..7206be8c0d 100644
--- a/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc
+++ b/tensorflow/core/kernels/data/group_by_reducer_dataset_op.cc
@@ -254,6 +254,7 @@ class GroupByReducerDatasetOp : public UnaryDatasetOpKernel {
         TF_RETURN_IF_ERROR(
             dataset()->captured_finalize_func_->RunWithBorrowedArgs(
                 ctx, states_[keys_[keys_index_++]], out_tensors));
+        *end_of_sequence = false;
         return Status::OK();
       }
 
diff --git a/tensorflow/core/kernels/data/identity_dataset_op.cc b/tensorflow/core/kernels/data/identity_dataset_op.cc
deleted file mode 100644
index e28f188336..0000000000
--- a/tensorflow/core/kernels/data/identity_dataset_op.cc
+++ /dev/null
@@ -1,102 +0,0 @@
-/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-#include <map>
-
-#include "tensorflow/core/framework/tensor.h"
-#include "tensorflow/core/kernels/data/dataset.h"
-
-namespace tensorflow {
-namespace {
-
-// The purpose of identity dataset is to serve as a placeholder when performing
-// optimizations. It is not expected to be surfaced in the Python API.
-class IdentityDatasetOp : public UnaryDatasetOpKernel {
- public:
-  explicit IdentityDatasetOp(OpKernelConstruction* ctx)
-      : UnaryDatasetOpKernel(ctx) {
-    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
-    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
-  }
-
- protected:
-  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
-                   DatasetBase** output) override {
-    *output = new Dataset(ctx, input);
-  }
-
- private:
-  class Dataset : public GraphDatasetBase {
-   public:
-    Dataset(OpKernelContext* ctx, const DatasetBase* input)
-        : GraphDatasetBase(ctx), input_(input) {
-      input_->Ref();
-    }
-
-    ~Dataset() override { input_->Unref(); }
-
-    std::unique_ptr<IteratorBase> MakeIteratorInternal(
-        const string& prefix) const override {
-      return std::unique_ptr<IteratorBase>(
-          new Iterator({this, strings::StrCat(prefix, "::Identity")}));
-    }
-
-    const DataTypeVector& output_dtypes() const override {
-      return input_->output_dtypes();
-    }
-
-    const std::vector<PartialTensorShape>& output_shapes() const override {
-      return input_->output_shapes();
-    }
-
-    string DebugString() const override { return "IdentityDatasetOp::Dataset"; }
-
-   protected:
-    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
-                              Node** output) const override {
-      Node* input_graph_node = nullptr;
-      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
-      TF_RETURN_IF_ERROR(b->AddDataset(this, {input_graph_node}, output));
-      return Status::OK();
-    }
-
-   private:
-    class Iterator : public DatasetIterator<Dataset> {
-     public:
-      explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params) {}
-
-      Status Initialize(IteratorContext* ctx) override {
-        return errors::Unimplemented(strings::StrCat(prefix(), "::Initialize"));
-      }
-
-      Status GetNextInternal(IteratorContext* ctx,
-                             std::vector<Tensor>* out_tensors,
-                             bool* end_of_sequence) override {
-        return errors::Unimplemented(
-            strings::StrCat(prefix(), "::GetNextInternal"));
-      }
-    };
-
-    const DatasetBase* const input_;
-  };
-
-  DataTypeVector output_types_;
-  std::vector<PartialTensorShape> output_shapes_;
-};
-
-REGISTER_KERNEL_BUILDER(Name("IdentityDataset").Device(DEVICE_CPU),
-                        IdentityDatasetOp);
-}  // namespace
-}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/iterator_ops.cc b/tensorflow/core/kernels/data/iterator_ops.cc
index 090eb3275b..9997939d64 100644
--- a/tensorflow/core/kernels/data/iterator_ops.cc
+++ b/tensorflow/core/kernels/data/iterator_ops.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/kernels/data/dataset.h"
 #include "tensorflow/core/kernels/data/dataset_utils.h"
+#include "tensorflow/core/kernels/data/optional_ops.h"
 #include "tensorflow/core/kernels/ops_util.h"
 #include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/lib/gtl/cleanup.h"
@@ -662,21 +663,89 @@ class MakeIteratorOp : public OpKernel {
   }
 };
 
+// A simple background worker that executes closures asynchronously and without
+// blocking.
+//
+// A `BackgroundWorker` is used to offload blocking work from an `AsyncOpKernel`
+// to avoid blocking an executor thread that may be required by the blocking
+// work.
+//
+// NOTE(mrry): We do not use a regular `tensorflow::thread::ThreadPool` for this
+// purpose because its current implementation (in Eigen) uses a finite-length
+// queue and will block the caller when full. This can lead to deadlock under
+// heavy load. Since the number of concurrent work items in each user of a
+// `BackgroundWorker` is at most one per op invocation, the dynamic allocation
+// overhead is tolerable.
+class BackgroundWorker {
+ public:
+  BackgroundWorker(Env* env, const string& name) {
+    thread_.reset(env->StartThread({} /* thread_options */, name,
+                                   [this]() { WorkerLoop(); }));
+  }
+
+  ~BackgroundWorker() {
+    {
+      mutex_lock l(mu_);
+      cancelled_ = true;
+    }
+    cond_var_.notify_one();
+    // Block until the background thread has terminated.
+    //
+    // NOTE(mrry): We explicitly free and join the thread here because
+    // `WorkerLoop()` uses other members of this object, and so we must join
+    // the thread before destroying them.
+    thread_.reset();
+  }
+
+  void Schedule(std::function<void()> work_item) {
+    {
+      mutex_lock l(mu_);
+      work_queue_.push_back(std::move(work_item));
+    }
+    cond_var_.notify_one();
+  }
+
+ private:
+  void WorkerLoop() {
+    while (true) {
+      std::function<void()> work_item = nullptr;
+      {
+        mutex_lock l(mu_);
+        while (!cancelled_ && work_queue_.empty()) {
+          cond_var_.wait(l);
+        }
+        if (cancelled_) {
+          return;
+        }
+        DCHECK(!work_queue_.empty());
+        work_item = std::move(work_queue_.front());
+        work_queue_.pop_front();
+      }
+      DCHECK(work_item != nullptr);
+      work_item();
+    }
+  }
+
+  std::unique_ptr<Thread> thread_;
+  mutex mu_;
+  condition_variable cond_var_;
+  bool cancelled_ GUARDED_BY(mu_) = false;
+  std::deque<std::function<void()>> work_queue_ GUARDED_BY(mu_);
+};
+
 class ToSingleElementOp : public AsyncOpKernel {
  public:
   explicit ToSingleElementOp(OpKernelConstruction* ctx)
       : AsyncOpKernel(ctx),
-        thread_pool_(new thread::ThreadPool(
-            ctx->env(), ThreadOptions(),
-            strings::StrCat("to_single_element_op_thread_",
-                            SanitizeThreadSuffix(name())),
-            1 /* num_threads */, false /* low_latency_hint */)) {}
+        background_worker_(ctx->env(),
+                           strings::StrCat("to_single_element_op_thread_",
+                                           SanitizeThreadSuffix(name()))) {}
 
   void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
     // The call to `iterator->GetNext()` may block and depend on an
     // inter-op thread pool thread, so we issue the call from the
     // owned thread pool.
-    thread_pool_->Schedule([ctx, done]() {
+    background_worker_.Schedule([ctx, done]() {
       DatasetBase* dataset;
       OP_REQUIRES_OK_ASYNC(
           ctx, GetDatasetFromVariantTensor(ctx->input(0), &dataset), done);
@@ -696,7 +765,7 @@ class ToSingleElementOp : public AsyncOpKernel {
       });
       std::vector<Tensor> components;
       components.reserve(dataset->output_dtypes().size());
-      bool end_of_sequence;
+      bool end_of_sequence = false;
 
       Status s =
           raw_iterator->GetNext(&iter_ctx, &components, &end_of_sequence);
@@ -729,18 +798,17 @@ class ToSingleElementOp : public AsyncOpKernel {
   }
 
  private:
-  std::unique_ptr<thread::ThreadPool> thread_pool_;
+  BackgroundWorker background_worker_;
 };
 
 class OneShotIteratorOp : public AsyncOpKernel {
  public:
   explicit OneShotIteratorOp(OpKernelConstruction* ctx)
       : AsyncOpKernel(ctx),
-        thread_pool_(new thread::ThreadPool(
-            ctx->env(), ThreadOptions(),
+        background_worker_(
+            ctx->env(),
             strings::StrCat("one_shot_iterator_initialization_thread_",
-                            SanitizeThreadSuffix(name())),
-            1 /* num_threads */, false /* low_latency_hint */)),
+                            SanitizeThreadSuffix(name()))),
         graph_def_version_(ctx->graph_def_version())
 
   {
@@ -782,7 +850,7 @@ class OneShotIteratorOp : public AsyncOpKernel {
         if (!initialization_started_) {
           // TODO(mrry): Convert the initialization code to use
           // callbacks instead of wasting a thread.
-          thread_pool_->Schedule([this, ctx, done]() { Init(ctx, done); });
+          background_worker_.Schedule([this, ctx, done]() { Init(ctx, done); });
           initialization_started_ = true;
         } else {
           done_callbacks_.emplace_back(ctx, std::move(done));
@@ -915,7 +983,7 @@ class OneShotIteratorOp : public AsyncOpKernel {
   DataTypeVector output_dtypes_;
   std::vector<PartialTensorShape> output_shapes_;
 
-  std::unique_ptr<thread::ThreadPool> thread_pool_;
+  BackgroundWorker background_worker_;
 
   mutex mu_;
   ContainerInfo cinfo_ GUARDED_BY(mu_);
@@ -932,11 +1000,9 @@ class IteratorGetNextOp : public AsyncOpKernel {
  public:
   explicit IteratorGetNextOp(OpKernelConstruction* ctx)
       : AsyncOpKernel(ctx),
-        thread_pool_(new thread::ThreadPool(
-            ctx->env(), ThreadOptions(),
-            strings::StrCat("iterator_get_next_thread_",
-                            SanitizeThreadSuffix(name())),
-            1 /* num_threads */, false /* low_latency_hint */)) {}
+        background_worker_(ctx->env(),
+                           strings::StrCat("iterator_get_next_thread_",
+                                           SanitizeThreadSuffix(name()))) {}
 
   void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
     IteratorResource* iterator;
@@ -945,7 +1011,7 @@ class IteratorGetNextOp : public AsyncOpKernel {
     // The call to `iterator->GetNext()` may block and depend on an
     // inter-op thread pool thread, so we issue the call from the
     // owned thread pool.
-    thread_pool_->Schedule(std::bind(
+    background_worker_.Schedule(std::bind(
         [ctx, iterator](DoneCallback done) {
           std::vector<Tensor> components;
           bool end_of_sequence = false;
@@ -982,7 +1048,7 @@ class IteratorGetNextOp : public AsyncOpKernel {
   }
 
  private:
-  std::unique_ptr<thread::ThreadPool> thread_pool_;
+  BackgroundWorker background_worker_;
 };
 
 class IteratorGetNextSyncOp : public OpKernel {
@@ -1019,6 +1085,86 @@ class IteratorGetNextSyncOp : public OpKernel {
   }
 };
 
+class IteratorGetNextAsOptionalOp : public AsyncOpKernel {
+ public:
+  explicit IteratorGetNextAsOptionalOp(OpKernelConstruction* ctx)
+      : AsyncOpKernel(ctx),
+        background_worker_(
+            ctx->env(), strings::StrCat("iterator_get_next_as_optional_thread_",
+                                        SanitizeThreadSuffix(name()))) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+  }
+
+  void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
+    IteratorResource* iterator;
+    OP_REQUIRES_OK_ASYNC(
+        ctx, LookupResource(ctx, HandleFromInput(ctx, 0), &iterator), done);
+    // The call to `iterator->GetNext()` may block and depend on an
+    // inter-op thread pool thread, so we issue the call from the
+    // owned thread pool.
+    background_worker_.Schedule(std::bind(
+        [this, ctx, iterator](DoneCallback done) {
+          std::vector<Tensor> components;
+          bool end_of_sequence = false;
+
+          IteratorContext::Params params;
+          params.env = ctx->env();
+          params.runner = *(ctx->runner());
+          params.function_library = iterator->function_library();
+          DeviceBase* device = ctx->function_library()->device();
+          params.allocator_getter = [device](AllocatorAttributes attrs) {
+            return device->GetAllocator(attrs);
+          };
+          IteratorContext iter_ctx(std::move(params));
+
+          Status s =
+              iterator->GetNext(&iter_ctx, &components, &end_of_sequence);
+          // NOTE(mrry): We must unref the iterator before calling `done()`, to
+          // avoid destruction races.
+          iterator->Unref();
+
+          if (!s.ok()) {
+            ctx->SetStatus(s);
+          } else if (end_of_sequence) {
+            OP_REQUIRES_OK_ASYNC(ctx, WriteOptionalNoneToOutput(ctx, 0), done);
+          } else {
+            for (int i = 0; i < components.size(); ++i) {
+              OP_REQUIRES_ASYNC(
+                  ctx, components[i].dtype() == output_types_[i],
+                  errors::InvalidArgument(
+                      "The given optional does not match the expected type for "
+                      "component ",
+                      i, ". Expected: ", DataTypeString(output_types_[i]),
+                      ". Actual: ", DataTypeString(components[i].dtype()), "."),
+                  done);
+              OP_REQUIRES_ASYNC(
+                  ctx,
+                  output_shapes_[i].IsCompatibleWith(components[i].shape()),
+                  errors::InvalidArgument(
+                      "The given optional does not match the expected shape "
+                      "for component ",
+                      i, ". Expected: ", output_shapes_[i].DebugString(),
+                      ". Actual: ", components[i].shape().DebugString(), "."),
+                  done);
+            }
+
+            OP_REQUIRES_OK_ASYNC(
+                ctx,
+                WriteOptionalWithValueToOutput(ctx, 0, std::move(components)),
+                done);
+          }
+          done();
+        },
+        std::move(done)));
+  }
+
+ private:
+  BackgroundWorker background_worker_;
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+};
+
 class IteratorToStringHandleOp : public OpKernel {
  public:
   explicit IteratorToStringHandleOp(OpKernelConstruction* ctx)
@@ -1150,22 +1296,48 @@ class DeserializeIteratorOp : public OpKernel {
 
 
 REGISTER_KERNEL_BUILDER(Name("Iterator").Device(DEVICE_CPU), IteratorHandleOp);
+REGISTER_KERNEL_BUILDER(Name("IteratorV2").Device(DEVICE_CPU),
+                        IteratorHandleOp);
+REGISTER_KERNEL_BUILDER(Name("IteratorV2").Device(DEVICE_GPU),
+                        IteratorHandleOp);
 REGISTER_KERNEL_BUILDER(Name("MakeIterator").Device(DEVICE_CPU),
                         MakeIteratorOp);
+REGISTER_KERNEL_BUILDER(
+    Name("MakeIterator").Device(DEVICE_GPU).HostMemory("dataset"),
+    MakeIteratorOp);
 REGISTER_KERNEL_BUILDER(Name("AnonymousIterator").Device(DEVICE_CPU),
                         AnonymousIteratorHandleOp);
+REGISTER_KERNEL_BUILDER(Name("AnonymousIterator").Device(DEVICE_GPU),
+                        AnonymousIteratorHandleOp);
 REGISTER_KERNEL_BUILDER(Name("DatasetToSingleElement").Device(DEVICE_CPU),
                         ToSingleElementOp);
 REGISTER_KERNEL_BUILDER(Name("OneShotIterator").Device(DEVICE_CPU),
                         OneShotIteratorOp);
 REGISTER_KERNEL_BUILDER(Name("IteratorGetNext").Device(DEVICE_CPU),
                         IteratorGetNextOp);
+REGISTER_KERNEL_BUILDER(Name("IteratorGetNext").Device(DEVICE_GPU),
+                        IteratorGetNextOp);
 REGISTER_KERNEL_BUILDER(Name("IteratorGetNextSync").Device(DEVICE_CPU),
                         IteratorGetNextSyncOp);
+REGISTER_KERNEL_BUILDER(Name("IteratorGetNextSync").Device(DEVICE_GPU),
+                        IteratorGetNextSyncOp);
+// TODO(b/111349762): Add registration for other devices.
+REGISTER_KERNEL_BUILDER(Name("IteratorGetNextAsOptional").Device(DEVICE_CPU),
+                        IteratorGetNextAsOptionalOp);
 REGISTER_KERNEL_BUILDER(Name("IteratorToStringHandle").Device(DEVICE_CPU),
                         IteratorToStringHandleOp);
+REGISTER_KERNEL_BUILDER(Name("IteratorToStringHandle")
+                            .Device(DEVICE_GPU)
+                            .HostMemory("string_handle"),
+                        IteratorToStringHandleOp);
 REGISTER_KERNEL_BUILDER(Name("IteratorFromStringHandle").Device(DEVICE_CPU),
                         IteratorFromStringHandleOp);
+REGISTER_KERNEL_BUILDER(Name("IteratorFromStringHandleV2").Device(DEVICE_CPU),
+                        IteratorFromStringHandleOp);
+REGISTER_KERNEL_BUILDER(Name("IteratorFromStringHandleV2")
+                            .Device(DEVICE_GPU)
+                            .HostMemory("string_handle"),
+                        IteratorFromStringHandleOp);
 REGISTER_KERNEL_BUILDER(Name("SerializeIterator").Device(DEVICE_CPU),
                         SerializeIteratorOp);
 REGISTER_KERNEL_BUILDER(Name("DeserializeIterator").Device(DEVICE_CPU),
diff --git a/tensorflow/core/kernels/data/optimize_dataset_op.cc b/tensorflow/core/kernels/data/optimize_dataset_op.cc
index 8965858e8d..276f5f89c8 100644
--- a/tensorflow/core/kernels/data/optimize_dataset_op.cc
+++ b/tensorflow/core/kernels/data/optimize_dataset_op.cc
@@ -54,8 +54,8 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
         ctx, ParseVectorArgument<string>(ctx, "optimizations", &optimizations));
     Dataset* dataset =
         new Dataset(ctx, input, optimizations, output_types_, output_shapes_);
-    core::ScopedUnref unref(dataset);
-    OP_REQUIRES_OK(ctx, dataset->Optimize(ctx, output));
+    OP_REQUIRES_OK(ctx, dataset->Optimize(ctx));
+    *output = dataset;
   }
 
  private:
@@ -73,7 +73,10 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
       input_->Ref();
     }
 
-    ~Dataset() override { input_->Unref(); }
+    ~Dataset() override {
+      input_->Unref();
+      optimized_input_->Unref();
+    }
 
     std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
@@ -81,7 +84,7 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
           new Iterator({this, strings::StrCat(prefix, "::Optimize")}));
     }
 
-    Status Optimize(OpKernelContext* ctx, DatasetBase** output) {
+    Status Optimize(OpKernelContext* ctx) {
       GraphDefBuilder b;
       DatasetGraphDefBuilder db(&b);
       Node* input_node = nullptr;
@@ -89,18 +92,20 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
       string output_node = input_node->name();
       GraphDef graph_def;
       TF_RETURN_IF_ERROR(b.ToGraphDef(&graph_def));
+      VLOG(3) << "Before optimization: " << graph_def.DebugString();
       TF_RETURN_IF_ERROR(ApplyOptimizations(ctx, &graph_def, &output_node));
-
+      VLOG(3) << "After optimization: " << graph_def.DebugString();
+      flib_def_.reset(new FunctionLibraryDefinition(OpRegistry::Global(),
+                                                    graph_def.library()));
       Graph graph(OpRegistry::Global());
       TF_RETURN_IF_ERROR(ImportGraphDef({}, graph_def, &graph, nullptr));
       std::vector<Tensor> outputs;
-      GraphRunner graph_runner(ctx->env());
-      // Once rewrites that add/modify functions are introduced, we will need
-      // persist the results in a function library runtime.
+      GraphRunner graph_runner(ctx->function_library()->device());
       TF_RETURN_IF_ERROR(graph_runner.Run(&graph, ctx->function_library(), {},
                                           {output_node}, &outputs));
-      TF_RETURN_IF_ERROR(GetDatasetFromVariantTensor(outputs[0], output));
-      (*output)->Ref();
+      TF_RETURN_IF_ERROR(
+          GetDatasetFromVariantTensor(outputs[0], &optimized_input_));
+      optimized_input_->Ref();
       return Status::OK();
     }
 
@@ -113,6 +118,18 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
 
     string DebugString() const override { return "OptimizeDatasetOp::Dataset"; }
 
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_graph_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
+      Node* optimizations_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddVector(optimizations_, &optimizations_node));
+      TF_RETURN_IF_ERROR(
+          b->AddDataset(this, {input_graph_node, optimizations_node}, output));
+      return Status::OK();
+    }
+
    private:
     class Iterator : public DatasetIterator<Dataset> {
      public:
@@ -120,15 +137,38 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
           : DatasetIterator<Dataset>(params) {}
 
       Status Initialize(IteratorContext* ctx) override {
-        return errors::Unimplemented(strings::StrCat(prefix(), "::Initialize"));
+        return dataset()->optimized_input_->MakeIterator(ctx, prefix(),
+                                                         &input_impl_);
       }
 
       Status GetNextInternal(IteratorContext* ctx,
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
-        return errors::Unimplemented(
-            strings::StrCat(prefix(), "::GetNextInternal"));
+        IteratorContext::Params params;
+        params.env = ctx->env();
+        params.runner = *(ctx->runner());
+        params.stats_aggregator_getter = ctx->stats_aggregator_getter();
+        params.lib = ctx->lib();
+        params.function_library = dataset()->flib_def_;
+        params.allocator_getter = ctx->allocator_getter();
+        IteratorContext iter_ctx(params);
+        return input_impl_->GetNext(&iter_ctx, out_tensors, end_of_sequence);
+      }
+
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+        return Status::OK();
       }
+
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+        return Status::OK();
+      }
+
+     private:
+      std::unique_ptr<IteratorBase> input_impl_;
     };
 
     Status ApplyOptimizations(OpKernelContext* ctx, GraphDef* graph_def,
@@ -136,16 +176,8 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
       // Add a fake sink node to allow rewriting the actual sink node.
       NodeDef* node = graph_def->mutable_node()->Add();
       node->set_name("FakeSink");
-      node->set_op("IdentityDataset");
+      node->set_op("SinkDataset");
       node->add_input(*output_node);
-      {
-        grappler::GraphView graph(graph_def);
-        NodeDef* sink = graph.GetNode(*output_node);
-        (*node->mutable_attr())["output_shapes"] =
-            sink->attr().at("output_shapes");
-        (*node->mutable_attr())["output_types"] =
-            sink->attr().at("output_types");
-      }
 
       // Create metagraph.
       MetaGraphDef meta_graph_def;
@@ -162,10 +194,10 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
       for (const string& optimization : optimizations_) {
         rewriter_config.add_optimizers(optimization);
       }
-      // If no optimizations were specified, supply a non-existent optimization
-      // to prevent Grappler from applying the default set of optimizations as
-      // some of them do not work out of the box at the moment (e.g. because we
-      // have no cost model for dataset ops).
+      // If no optimizations were specified, supply a non-existent
+      // optimization to prevent Grappler from applying the default set of
+      // optimizations as some of them do not work out of the box at the
+      // moment (e.g. because we have no cost model for dataset ops).
       if (optimizations_.empty()) {
         rewriter_config.add_optimizers("non-existent");
       }
@@ -178,6 +210,12 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
       tensorflow::grappler::VirtualCluster cluster(device_map);
 
       // Run optimizer.
+      if (VLOG_IS_ON(2)) {
+        LOG(INFO) << "Performing the following optimizations:";
+        for (const string& optimization : optimizations_) {
+          LOG(INFO) << "  " << optimization;
+        }
+      }
       TF_RETURN_IF_ERROR(tensorflow::grappler::RunMetaOptimizer(
           *grappler_item, rewriter_config, ctx->device(), &cluster, graph_def));
 
@@ -192,6 +230,8 @@ class OptimizeDatasetOp : public UnaryDatasetOpKernel {
       return Status::OK();
     }
 
+    DatasetBase* optimized_input_;
+    std::shared_ptr<FunctionLibraryDefinition> flib_def_;
     const DatasetBase* input_;
     const std::vector<string> optimizations_;
     const DataTypeVector output_types_;
diff --git a/tensorflow/core/kernels/data/optional_ops.cc b/tensorflow/core/kernels/data/optional_ops.cc
new file mode 100644
index 0000000000..11a791969c
--- /dev/null
+++ b/tensorflow/core/kernels/data/optional_ops.cc
@@ -0,0 +1,224 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/kernels/data/optional_ops.h"
+
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/variant_encode_decode.h"
+
+namespace tensorflow {
+namespace {
+const char kOptionalVariantTypeName[] = "tensorflow::data::Optional";
+
+// An `OptionalVariant` can represent either an "actual value" (a tuple of
+// tensors) or "none", and may be stored in a DT_VARIANT tensor.
+//
+// TODO(b/111349762): Add registrations for copying `OptionalVariant` between
+// devices.
+class OptionalVariant {
+ public:
+  // Create an `OptionalVariant` with no actual value.
+  OptionalVariant() : values_(nullptr) {}
+
+  // Create an `OptionalVariant` with the actual value given by the tuple of
+  // tensors in `values`.
+  explicit OptionalVariant(std::vector<Tensor> values)
+      : values_(new std::vector<Tensor>(std::move(values))) {}
+
+  OptionalVariant(const OptionalVariant& other) : values_(other.values_) {}
+
+  // Returns true if `this` represents an actual value.
+  bool has_value() const { return values_ != nullptr; }
+
+  // REQUIRES: `this->has_value()` must be true.
+  const std::vector<Tensor>& get_values() const {
+    CHECK(values_) << "Tried to get values from an empty OptionalVariant";
+    return *values_;
+  }
+
+  // Implementations of the necessary methods for using `OptionalVariant`
+  // objects in DT_VARIANT tensors.
+  string TypeName() const { return kOptionalVariantTypeName; }
+  void Encode(VariantTensorData* data) const {
+    data->set_metadata(values_ != nullptr);
+    if (values_ != nullptr) {
+      for (const auto& t : *values_) {
+        *(data->add_tensors()) = t;
+      }
+    }
+  }
+
+  bool Decode(const VariantTensorData& data) {
+    if (data.type_name() != TypeName()) {
+      return false;
+    }
+    bool has_value = false;
+    if (!data.get_metadata(&has_value)) {
+      return false;
+    }
+    if (has_value) {
+      values_.reset(new std::vector<Tensor>(data.tensors()));
+    } else {
+      values_.reset();
+    }
+    return true;
+  }
+
+  string DebugString() const {
+    if (values_) {
+      return strings::StrCat("OptionalVariant<", "values: (",
+                             str_util::Join(*values_, ", ",
+                                            [](string* s, const Tensor& elem) {
+                                              *s = elem.DebugString();
+                                            }),
+                             ")>");
+    } else {
+      return strings::StrCat("OptionalVariant<None>");
+    }
+  }
+
+ private:
+  std::shared_ptr<const std::vector<Tensor>> values_;
+};
+
+class OptionalNoneOp : public OpKernel {
+ public:
+  explicit OptionalNoneOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    OP_REQUIRES_OK(ctx, WriteOptionalNoneToOutput(ctx, 0));
+  }
+};
+
+class OptionalFromValueOp : public OpKernel {
+ public:
+  explicit OptionalFromValueOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    OpInputList components_input;
+    OP_REQUIRES_OK(ctx, ctx->input_list("components", &components_input));
+    std::vector<Tensor> components;
+    components.reserve(components_input.size());
+    for (const Tensor& component_t : components_input) {
+      components.push_back(component_t);
+    }
+    OP_REQUIRES_OK(
+        ctx, WriteOptionalWithValueToOutput(ctx, 0, std::move(components)));
+  }
+};
+
+class OptionalHasValueOp : public OpKernel {
+ public:
+  explicit OptionalHasValueOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor* optional_input;
+    OP_REQUIRES_OK(ctx, ctx->input("optional", &optional_input));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(optional_input->shape()),
+                errors::InvalidArgument(
+                    "Input to OptionalHasValue must be a scalar tensor "
+                    "containing an OptionalVariant object."));
+    const OptionalVariant* optional =
+        optional_input->scalar<Variant>()().get<OptionalVariant>();
+    OP_REQUIRES(
+        ctx, optional != nullptr,
+        errors::InvalidArgument(
+            "Input to OptionalHasValue must be an OptionalVariant object."));
+    Tensor* result;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, {}, &result));
+    result->scalar<bool>()() = optional->has_value();
+  }
+};
+
+class OptionalGetValueOp : public OpKernel {
+ public:
+  explicit OptionalGetValueOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor* optional_input;
+    OP_REQUIRES_OK(ctx, ctx->input("optional", &optional_input));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(optional_input->shape()),
+                errors::InvalidArgument(
+                    "Input to OptionalHasValue must be a scalar tensor "
+                    "containing an OptionalVariant object."));
+    const OptionalVariant* optional =
+        optional_input->scalar<Variant>()().get<OptionalVariant>();
+    OP_REQUIRES(
+        ctx, optional != nullptr,
+        errors::InvalidArgument(
+            "Input to OptionalHasValue must be an OptionalVariant object."));
+    OP_REQUIRES(
+        ctx, optional->has_value(),
+        errors::InvalidArgument("The given optional does not have a value."));
+    const auto& components = optional->get_values();
+    for (int i = 0; i < components.size(); ++i) {
+      OP_REQUIRES(
+          ctx, components[i].dtype() == output_types_[i],
+          errors::InvalidArgument(
+              "The given optional does not match the expected type for "
+              "component ",
+              i, ". Expected: ", DataTypeString(output_types_[i]),
+              ". Actual: ", DataTypeString(components[i].dtype()), "."));
+      OP_REQUIRES(ctx,
+                  output_shapes_[i].IsCompatibleWith(components[i].shape()),
+                  errors::InvalidArgument(
+                      "The given optional does not match the expected shape "
+                      "for component ",
+                      i, ". Expected: ", output_shapes_[i].DebugString(),
+                      ". Actual: ", components[i].shape().DebugString(), "."));
+      ctx->set_output(i, components[i]);
+    }
+  }
+
+ private:
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+};
+
+// TODO(b/111349762): Add registrations for other devices.
+REGISTER_KERNEL_BUILDER(Name("OptionalNone").Device(DEVICE_CPU),
+                        OptionalNoneOp);
+REGISTER_KERNEL_BUILDER(Name("OptionalFromValue").Device(DEVICE_CPU),
+                        OptionalFromValueOp);
+
+REGISTER_KERNEL_BUILDER(Name("OptionalHasValue").Device(DEVICE_CPU),
+                        OptionalHasValueOp);
+REGISTER_KERNEL_BUILDER(Name("OptionalGetValue").Device(DEVICE_CPU),
+                        OptionalGetValueOp);
+
+}  // namespace
+
+Status WriteOptionalWithValueToOutput(OpKernelContext* ctx, int output_index,
+                                      std::vector<Tensor> value) {
+  OptionalVariant v(std::move(value));
+  Tensor* variant_t;
+  TF_RETURN_IF_ERROR(
+      ctx->allocate_output(output_index, TensorShape({}), &variant_t));
+  variant_t->scalar<Variant>()() = v;
+  return Status::OK();
+}
+
+Status WriteOptionalNoneToOutput(OpKernelContext* ctx, int output_index) {
+  OptionalVariant v;
+  Tensor* variant_t;
+  TF_RETURN_IF_ERROR(
+      ctx->allocate_output(output_index, TensorShape({}), &variant_t));
+  variant_t->scalar<Variant>()() = v;
+  return Status::OK();
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/optional_ops.h b/tensorflow/core/kernels/data/optional_ops.h
new file mode 100644
index 0000000000..6f25567678
--- /dev/null
+++ b/tensorflow/core/kernels/data/optional_ops.h
@@ -0,0 +1,36 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_KERNELS_DATA_OPTIONAL_OPS_H_
+#define TENSORFLOW_CORE_KERNELS_DATA_OPTIONAL_OPS_H_
+
+#include <vector>
+
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/variant_tensor_data.h"
+
+namespace tensorflow {
+
+// Stores a DT_VARIANT value representing an Optional with the given value
+// in the `output_index`^th output of the given kernel execution context.
+Status WriteOptionalWithValueToOutput(OpKernelContext* ctx, int output_index,
+                                      std::vector<Tensor> value);
+
+// Stores a DT_VARIANT value representing an Optional with no value
+// in the `output_index`^th output of the given kernel execution context.
+Status WriteOptionalNoneToOutput(OpKernelContext* ctx, int output_index);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_DATA_OPTIONAL_OPS_H_
diff --git a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
index 15f3dc3b1d..b736b33c2e 100644
--- a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
+++ b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/kernels/data/captured_function.h"
 #include "tensorflow/core/kernels/data/dataset.h"
+#include "tensorflow/core/kernels/data/parallel_map_iterator.h"
 #include "tensorflow/core/lib/core/error_codes.pb.h"
 #include "tensorflow/core/lib/random/random.h"
 
@@ -87,8 +88,16 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
 
     std::unique_ptr<IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
-      return std::unique_ptr<IteratorBase>(
-          new Iterator({this, strings::StrCat(prefix, "::ParallelMap")}));
+      auto map_func = [this](IteratorContext* ctx,
+                             std::vector<Tensor> input_element,
+                             std::vector<Tensor>* result, StatusCallback done) {
+        captured_func_->RunAsync(ctx, std::move(input_element), result,
+                                 std::move(done));
+      };
+
+      return NewParallelMapIterator(
+          {this, strings::StrCat(prefix, "::ParallelMap")}, input_,
+          std::move(map_func), num_parallel_calls_);
     }
 
     const DataTypeVector& output_dtypes() const override {
@@ -148,279 +157,6 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
     }
 
    private:
-    class Iterator : public DatasetIterator<Dataset> {
-     public:
-      explicit Iterator(const Params& params)
-          : DatasetIterator<Dataset>(params) {}
-
-      ~Iterator() override {
-        // TODO(mrry): Replace this cancellation logic with a
-        // CancellationManager. The syntax would be more heavyweight,
-        // but it would be possible to thread a cancellation manager
-        // through the IteratorContext to upstream,
-        // potentially-blocking iterators, when we add these.
-        mutex_lock l(mu_);
-        // Cancel the runner thread.
-        cancelled_ = true;
-        cond_var_.notify_all();
-        // Wait for all in-flight calls to complete.
-        while (num_calls_ > 0) {
-          cond_var_.wait(l);
-        }
-      }
-
-      Status Initialize(IteratorContext* ctx) override {
-        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
-      }
-
-      Status GetNextInternal(IteratorContext* ctx,
-                             std::vector<Tensor>* out_tensors,
-                             bool* end_of_sequence) override {
-        std::shared_ptr<InvocationResult> result;
-        {
-          mutex_lock l(mu_);
-          EnsureRunnerThreadStarted(ctx);
-          while (invocation_results_.empty()) {
-            cond_var_.wait(l);
-          }
-          std::swap(result, invocation_results_.front());
-          invocation_results_.pop_front();
-        }
-        cond_var_.notify_all();
-        result->notification.WaitForNotification();
-        return ProcessResult(result, out_tensors, end_of_sequence);
-      }
-
-     protected:
-      Status SaveInternal(IteratorStateWriter* writer) override {
-        mutex_lock l(mu_);
-        // Wait for all in-flight calls to complete.
-        while (num_calls_ > 0) {
-          cond_var_.wait(l);
-        }
-        CHECK_EQ(num_calls_, 0);
-        TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
-        TF_RETURN_IF_ERROR(writer->WriteScalar(
-            full_name("invocation_results.size"), invocation_results_.size()));
-        for (size_t i = 0; i < invocation_results_.size(); i++) {
-          std::shared_ptr<InvocationResult> result = invocation_results_[i];
-          TF_RETURN_IF_ERROR(WriteStatusLocked(writer, i, result->status));
-          TF_RETURN_IF_ERROR(writer->WriteScalar(
-              full_name(strings::StrCat("invocation_results[", i, "].size")),
-              result->return_values.size()));
-          for (size_t j = 0; j < result->return_values.size(); j++) {
-            TF_RETURN_IF_ERROR(writer->WriteTensor(
-                full_name(
-                    strings::StrCat("invocation_results[", i, "][", j, "]")),
-                result->return_values[j]));
-          }
-          if (result->end_of_input) {
-            TF_RETURN_IF_ERROR(writer->WriteScalar(
-                full_name(strings::StrCat("invocation_results[", i,
-                                          "].end_of_input")),
-                ""));
-          }
-        }
-        return Status::OK();
-      }
-
-      Status RestoreInternal(IteratorContext* ctx,
-                             IteratorStateReader* reader) override {
-        mutex_lock l(mu_);
-        TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
-        int64 invocation_results_size;
-        TF_RETURN_IF_ERROR(reader->ReadScalar(
-            full_name("invocation_results.size"), &invocation_results_size));
-        for (size_t i = 0; i < invocation_results_size; i++) {
-          std::shared_ptr<InvocationResult> result(new InvocationResult());
-          invocation_results_.push_back(result);
-          TF_RETURN_IF_ERROR(ReadStatusLocked(reader, i, &result->status));
-          size_t num_return_values;
-          {
-            int64 size;
-            TF_RETURN_IF_ERROR(reader->ReadScalar(
-                full_name(strings::StrCat("invocation_results[", i, "].size")),
-                &size));
-            num_return_values = static_cast<size_t>(size);
-            if (num_return_values != size) {
-              return errors::InvalidArgument(strings::StrCat(
-                  full_name(
-                      strings::StrCat("invocation_results[", i, "].size")),
-                  ": ", size, " is not a valid value of type size_t."));
-            }
-          }
-          result->return_values.reserve(num_return_values);
-          for (size_t j = 0; j < num_return_values; j++) {
-            result->return_values.emplace_back();
-            TF_RETURN_IF_ERROR(
-                reader->ReadTensor(full_name(strings::StrCat(
-                                       "invocation_results[", i, "][", j, "]")),
-                                   &result->return_values.back()));
-          }
-          result->end_of_input = reader->Contains(full_name(
-              strings::StrCat("invocation_results[", i, "].end_of_input")));
-          result->notification.Notify();
-        }
-        return Status::OK();
-      }
-
-     private:
-      struct InvocationResult {
-        Notification notification;
-        Status status;
-        std::vector<Tensor> return_values;
-        bool end_of_input;
-      };
-
-      void EnsureRunnerThreadStarted(IteratorContext* ctx)
-          EXCLUSIVE_LOCKS_REQUIRED(mu_) {
-        if (!runner_thread_) {
-          std::shared_ptr<IteratorContext> ctx_copy(new IteratorContext(*ctx));
-          runner_thread_.reset(ctx->env()->StartThread(
-              {}, "runner_thread",
-              std::bind(&Iterator::RunnerThread, this, ctx_copy)));
-        }
-      }
-
-      void CallCompleted(const std::shared_ptr<InvocationResult>& result)
-          LOCKS_EXCLUDED(mu_) {
-        {
-          mutex_lock l(mu_);
-          num_calls_--;
-        }
-        result->notification.Notify();
-        cond_var_.notify_all();
-      }
-
-      void CallFunction(const std::shared_ptr<IteratorContext>& ctx,
-                        const std::shared_ptr<InvocationResult>& result)
-          LOCKS_EXCLUDED(mu_) {
-        // Get the next input element.
-        std::vector<Tensor> input_element;
-        result->status = input_impl_->GetNext(ctx.get(), &input_element,
-                                              &result->end_of_input);
-        if (result->end_of_input || !result->status.ok()) {
-          CallCompleted(result);
-          return;
-        }
-
-        // Call `func_(input_element)`, store the result in
-        // `result->return_values`, and notify `result->notification` to unblock
-        // a consumer.
-        auto done = [this, result](Status status) {
-          result->status.Update(status);
-          CallCompleted(result);
-        };
-        dataset()->captured_func_->RunAsync(ctx.get(), std::move(input_element),
-                                            &result->return_values, done);
-      }
-
-      int64 MaxInvocationResults() { return dataset()->num_parallel_calls_; }
-
-      Status ProcessResult(const std::shared_ptr<InvocationResult>& result,
-                           std::vector<Tensor>* out_tensors,
-                           bool* end_of_sequence) {
-        if (!result->end_of_input && result->status.ok()) {
-          *out_tensors = std::move(result->return_values);
-          *end_of_sequence = false;
-          return Status::OK();
-        }
-        if (errors::IsOutOfRange(result->status)) {
-          // `f` may deliberately raise `errors::OutOfRange` to indicate that we
-          // should terminate the iteration early.
-          *end_of_sequence = true;
-          return Status::OK();
-        }
-        *end_of_sequence = result->end_of_input;
-        return result->status;
-      }
-
-      void RunnerThread(const std::shared_ptr<IteratorContext>& ctx) {
-        std::vector<std::shared_ptr<InvocationResult>> new_calls;
-        new_calls.reserve(dataset()->num_parallel_calls_);
-        while (true) {
-          {
-            mutex_lock l(mu_);
-            while (!cancelled_ &&
-                   (num_calls_ >= dataset()->num_parallel_calls_ ||
-                    invocation_results_.size() >= MaxInvocationResults())) {
-              cond_var_.wait(l);
-            }
-            if (cancelled_) {
-              return;
-            }
-            while (num_calls_ < dataset()->num_parallel_calls_ &&
-                   invocation_results_.size() < MaxInvocationResults()) {
-              invocation_results_.emplace_back(new InvocationResult());
-              new_calls.push_back(invocation_results_.back());
-              num_calls_++;
-            }
-          }
-          cond_var_.notify_all();
-          for (const auto& call : new_calls) {
-            CallFunction(ctx, call);
-          }
-          new_calls.clear();
-        }
-      }
-
-      Status WriteStatusLocked(IteratorStateWriter* writer, size_t index,
-                               const Status& status)
-          EXCLUSIVE_LOCKS_REQUIRED(mu_) {
-        TF_RETURN_IF_ERROR(writer->WriteScalar(
-            CodeKey(index), static_cast<int64>(status.code())));
-        if (!status.ok()) {
-          TF_RETURN_IF_ERROR(writer->WriteScalar(ErrorMessageKey(index),
-                                                 status.error_message()));
-        }
-        return Status::OK();
-      }
-
-      Status ReadStatusLocked(IteratorStateReader* reader, size_t index,
-                              Status* status) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
-        int64 code_int;
-        TF_RETURN_IF_ERROR(reader->ReadScalar(CodeKey(index), &code_int));
-        error::Code code = static_cast<error::Code>(code_int);
-
-        if (code != error::Code::OK) {
-          string error_message;
-          TF_RETURN_IF_ERROR(
-              reader->ReadScalar(ErrorMessageKey(index), &error_message));
-          *status = Status(code, error_message);
-        } else {
-          *status = Status::OK();
-        }
-        return Status::OK();
-      }
-
-      string CodeKey(size_t index) {
-        return full_name(
-            strings::StrCat("invocation_results[", index, "].code"));
-      }
-
-      string ErrorMessageKey(size_t index) {
-        return full_name(
-            strings::StrCat("invocation_results[", index, "].error_message"));
-      }
-
-      // Used for coordination between the main thread and the runner thread.
-      mutex mu_;
-      // Used for coordination between the main thread and the runner thread. In
-      // particular, the runner thread should only schedule new calls when the
-      // number of in-flight calls is less than the user specified level of
-      // parallelism and there are slots available in the `invocation_results_`
-      // buffer.
-      condition_variable cond_var_;
-      // Counts the number of outstanding calls.
-      int64 num_calls_ GUARDED_BY(mu_) = 0;
-      std::unique_ptr<IteratorBase> input_impl_;
-      // Buffer for storing the invocation results.
-      std::deque<std::shared_ptr<InvocationResult>> invocation_results_
-          GUARDED_BY(mu_);
-      std::unique_ptr<Thread> runner_thread_ GUARDED_BY(mu_);
-      bool cancelled_ GUARDED_BY(mu_) = false;
-    };
-
     const DatasetBase* const input_;
     const NameAttrList func_;
     const int32 num_parallel_calls_;
diff --git a/tensorflow/core/kernels/data/parallel_map_iterator.cc b/tensorflow/core/kernels/data/parallel_map_iterator.cc
new file mode 100644
index 0000000000..10549df25e
--- /dev/null
+++ b/tensorflow/core/kernels/data/parallel_map_iterator.cc
@@ -0,0 +1,318 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/kernels/data/parallel_map_iterator.h"
+
+#include <deque>
+#include <functional>
+#include <utility>
+#include <vector>
+
+namespace tensorflow {
+namespace {
+
+class ParallelMapIterator : public DatasetBaseIterator {
+ public:
+  explicit ParallelMapIterator(
+      const typename DatasetBaseIterator::BaseParams& params,
+      const DatasetBase* input_dataset, ParallelMapIteratorFunction map_func,
+      int32 num_parallel_calls)
+      : DatasetBaseIterator(params),
+        input_dataset_(input_dataset),
+        map_func_(std::move(map_func)),
+        num_parallel_calls_(num_parallel_calls) {}
+
+  ~ParallelMapIterator() override {
+    // TODO(mrry): Replace this cancellation logic with a
+    // CancellationManager. The syntax would be more heavyweight,
+    // but it would be possible to thread a cancellation manager
+    // through the IteratorContext to upstream,
+    // potentially-blocking iterators, when we add these.
+    mutex_lock l(mu_);
+    // Cancel the runner thread.
+    cancelled_ = true;
+    cond_var_.notify_all();
+    // Wait for all in-flight calls to complete.
+    while (num_calls_ > 0) {
+      cond_var_.wait(l);
+    }
+  }
+
+  Status Initialize(IteratorContext* ctx) override {
+    return input_dataset_->MakeIterator(ctx, prefix(), &input_impl_);
+  }
+
+  Status GetNextInternal(IteratorContext* ctx, std::vector<Tensor>* out_tensors,
+                         bool* end_of_sequence) override {
+    std::shared_ptr<InvocationResult> result;
+    {
+      mutex_lock l(mu_);
+      EnsureRunnerThreadStarted(ctx);
+      while (invocation_results_.empty()) {
+        cond_var_.wait(l);
+      }
+      std::swap(result, invocation_results_.front());
+      invocation_results_.pop_front();
+    }
+    cond_var_.notify_all();
+    result->notification.WaitForNotification();
+    return ProcessResult(result, out_tensors, end_of_sequence);
+  }
+
+ protected:
+  Status SaveInternal(IteratorStateWriter* writer) override {
+    mutex_lock l(mu_);
+    // Wait for all in-flight calls to complete.
+    while (num_calls_ > 0) {
+      cond_var_.wait(l);
+    }
+    CHECK_EQ(num_calls_, 0);
+    TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+    TF_RETURN_IF_ERROR(
+        writer->WriteScalar(full_name("invocation_results.size"),
+                            invocation_results_.size()));
+    for (size_t i = 0; i < invocation_results_.size(); i++) {
+      std::shared_ptr<InvocationResult> result = invocation_results_[i];
+      TF_RETURN_IF_ERROR(WriteStatusLocked(writer, i, result->status));
+      TF_RETURN_IF_ERROR(writer->WriteScalar(
+          full_name(strings::StrCat("invocation_results[", i, "].size")),
+          result->return_values.size()));
+      for (size_t j = 0; j < result->return_values.size(); j++) {
+        TF_RETURN_IF_ERROR(
+            writer->WriteTensor(full_name(strings::StrCat(
+                                    "invocation_results[", i, "][", j, "]")),
+                                result->return_values[j]));
+      }
+      if (result->end_of_input) {
+        TF_RETURN_IF_ERROR(writer->WriteScalar(
+            full_name(
+                strings::StrCat("invocation_results[", i, "].end_of_input")),
+            ""));
+      }
+    }
+    return Status::OK();
+  }
+
+  Status RestoreInternal(IteratorContext* ctx,
+                         IteratorStateReader* reader) override {
+    mutex_lock l(mu_);
+    TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+    int64 invocation_results_size;
+    TF_RETURN_IF_ERROR(reader->ReadScalar(
+        full_name("invocation_results.size"), &invocation_results_size));
+    for (size_t i = 0; i < invocation_results_size; i++) {
+      std::shared_ptr<InvocationResult> result(new InvocationResult());
+      invocation_results_.push_back(result);
+      TF_RETURN_IF_ERROR(ReadStatusLocked(reader, i, &result->status));
+      size_t num_return_values;
+      {
+        int64 size;
+        TF_RETURN_IF_ERROR(
+            reader->ReadScalar(full_name(strings::StrCat(
+                                   "invocation_results[", i, "].size")),
+                               &size));
+        num_return_values = static_cast<size_t>(size);
+        if (num_return_values != size) {
+          return errors::InvalidArgument(strings::StrCat(
+              full_name(
+                  strings::StrCat("invocation_results[", i, "].size")),
+              ": ", size, " is not a valid value of type size_t."));
+        }
+      }
+      result->return_values.reserve(num_return_values);
+      for (size_t j = 0; j < num_return_values; j++) {
+        result->return_values.emplace_back();
+        TF_RETURN_IF_ERROR(
+            reader->ReadTensor(full_name(strings::StrCat(
+                                   "invocation_results[", i, "][", j, "]")),
+                               &result->return_values.back()));
+      }
+      result->end_of_input = reader->Contains(full_name(
+          strings::StrCat("invocation_results[", i, "].end_of_input")));
+      result->notification.Notify();
+    }
+    return Status::OK();
+  }
+
+ private:
+  struct InvocationResult {
+    Notification notification;
+    Status status;
+    std::vector<Tensor> return_values;
+    bool end_of_input;
+  };
+
+  void EnsureRunnerThreadStarted(IteratorContext* ctx)
+      EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+    if (!runner_thread_) {
+      std::shared_ptr<IteratorContext> ctx_copy(new IteratorContext(*ctx));
+      runner_thread_.reset(ctx->env()->StartThread(
+          {}, "runner_thread",
+          std::bind(&ParallelMapIterator::RunnerThread, this, ctx_copy)));
+    }
+  }
+
+  void CallCompleted(const std::shared_ptr<InvocationResult>& result)
+      LOCKS_EXCLUDED(mu_) {
+    {
+      mutex_lock l(mu_);
+      num_calls_--;
+    }
+    result->notification.Notify();
+    cond_var_.notify_all();
+  }
+
+  void CallFunction(const std::shared_ptr<IteratorContext>& ctx,
+                    const std::shared_ptr<InvocationResult>& result)
+      LOCKS_EXCLUDED(mu_) {
+    // Get the next input element.
+    std::vector<Tensor> input_element;
+    result->status =
+        input_impl_->GetNext(ctx.get(), &input_element, &result->end_of_input);
+    if (result->end_of_input || !result->status.ok()) {
+      CallCompleted(result);
+      return;
+    }
+
+    // Call `func_(input_element)`, store the result in
+    // `result->return_values`, and notify `result->notification` to unblock
+    // a consumer.
+    auto done = [this, result](Status status) {
+      result->status.Update(status);
+      CallCompleted(result);
+    };
+
+    map_func_(ctx.get(), std::move(input_element), &result->return_values,
+              std::move(done));
+  }
+
+  int64 MaxInvocationResults() { return num_parallel_calls_; }
+
+  Status ProcessResult(const std::shared_ptr<InvocationResult>& result,
+                       std::vector<Tensor>* out_tensors,
+                       bool* end_of_sequence) {
+    if (!result->end_of_input && result->status.ok()) {
+      *out_tensors = std::move(result->return_values);
+      *end_of_sequence = false;
+      return Status::OK();
+    }
+    if (errors::IsOutOfRange(result->status)) {
+      // `f` may deliberately raise `errors::OutOfRange` to indicate that we
+      // should terminate the iteration early.
+      *end_of_sequence = true;
+      return Status::OK();
+    }
+    *end_of_sequence = result->end_of_input;
+    return result->status;
+  }
+
+  void RunnerThread(const std::shared_ptr<IteratorContext>& ctx) {
+    std::vector<std::shared_ptr<InvocationResult>> new_calls;
+    new_calls.reserve(num_parallel_calls_);
+    while (true) {
+      {
+        mutex_lock l(mu_);
+        while (!cancelled_ &&
+               (num_calls_ >= num_parallel_calls_ ||
+                invocation_results_.size() >= MaxInvocationResults())) {
+          cond_var_.wait(l);
+        }
+        if (cancelled_) {
+          return;
+        }
+        while (num_calls_ < num_parallel_calls_ &&
+               invocation_results_.size() < MaxInvocationResults()) {
+          invocation_results_.emplace_back(new InvocationResult());
+          new_calls.push_back(invocation_results_.back());
+          num_calls_++;
+        }
+      }
+      cond_var_.notify_all();
+      for (const auto& call : new_calls) {
+        CallFunction(ctx, call);
+      }
+      new_calls.clear();
+    }
+  }
+
+  Status WriteStatusLocked(IteratorStateWriter* writer, size_t index,
+                           const Status& status) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+    TF_RETURN_IF_ERROR(
+        writer->WriteScalar(CodeKey(index), static_cast<int64>(status.code())));
+    if (!status.ok()) {
+      TF_RETURN_IF_ERROR(
+          writer->WriteScalar(ErrorMessageKey(index), status.error_message()));
+    }
+    return Status::OK();
+  }
+
+  Status ReadStatusLocked(IteratorStateReader* reader, size_t index,
+                          Status* status) EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+    int64 code_int;
+    TF_RETURN_IF_ERROR(reader->ReadScalar(CodeKey(index), &code_int));
+    error::Code code = static_cast<error::Code>(code_int);
+
+    if (code != error::Code::OK) {
+      string error_message;
+      TF_RETURN_IF_ERROR(
+          reader->ReadScalar(ErrorMessageKey(index), &error_message));
+      *status = Status(code, error_message);
+    } else {
+      *status = Status::OK();
+    }
+    return Status::OK();
+  }
+
+  string CodeKey(size_t index) {
+    return full_name(
+        strings::StrCat("invocation_results[", index, "].code"));
+  }
+
+  string ErrorMessageKey(size_t index) {
+    return full_name(
+        strings::StrCat("invocation_results[", index, "].error_message"));
+  }
+
+  const DatasetBase* const input_dataset_;  // Not owned.
+  const ParallelMapIteratorFunction map_func_;
+  const int32 num_parallel_calls_;
+  // Used for coordination between the main thread and the runner thread.
+  mutex mu_;
+  // Used for coordination between the main thread and the runner thread. In
+  // particular, the runner thread should only schedule new calls when the
+  // number of in-flight calls is less than the user specified level of
+  // parallelism and there are slots available in the `invocation_results_`
+  // buffer.
+  condition_variable cond_var_;
+  // Counts the number of outstanding calls.
+  int64 num_calls_ GUARDED_BY(mu_) = 0;
+  std::unique_ptr<IteratorBase> input_impl_;
+  // Buffer for storing the invocation results.
+  std::deque<std::shared_ptr<InvocationResult>> invocation_results_
+      GUARDED_BY(mu_);
+  std::unique_ptr<Thread> runner_thread_ GUARDED_BY(mu_);
+  bool cancelled_ GUARDED_BY(mu_) = false;
+};
+
+}  // namespace
+
+std::unique_ptr<IteratorBase> NewParallelMapIterator(
+    const DatasetBaseIterator::BaseParams& params,
+    const DatasetBase* input_dataset, ParallelMapIteratorFunction map_func,
+    int32 num_parallel_calls) {
+  return std::unique_ptr<IteratorBase>(new ParallelMapIterator(
+      params, input_dataset, std::move(map_func), num_parallel_calls));
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/parallel_map_iterator.h b/tensorflow/core/kernels/data/parallel_map_iterator.h
new file mode 100644
index 0000000000..2ce36c3869
--- /dev/null
+++ b/tensorflow/core/kernels/data/parallel_map_iterator.h
@@ -0,0 +1,44 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_KERNELS_DATA_PARALLEL_MAP_ITERATOR_H_
+#define TENSORFLOW_CORE_KERNELS_DATA_PARALLEL_MAP_ITERATOR_H_
+
+#include <memory>
+
+#include "tensorflow/core/framework/dataset.h"
+
+namespace tensorflow {
+
+// A function that transforms elements of one dataset into another
+// asynchronously. The arguments are:
+// 1. An `IteratorContext*` for the context in which the function should
+// execute.
+// 2. A `std::vector<Tensor>` containing the input element.
+// 3. A `std::vector<Tensor>*` to which the function will write the result.
+// 4. A `StatusCallback` that should be invoked when the function is complete.
+using ParallelMapIteratorFunction =
+    std::function<void(IteratorContext*, std::vector<Tensor>,
+                       std::vector<Tensor>*, StatusCallback)>;
+
+// Returns a new iterator that applies `map_func` to the elements of
+// `input_dataset` using the given degree of parallelism.
+std::unique_ptr<IteratorBase> NewParallelMapIterator(
+    const DatasetBaseIterator::BaseParams& params,
+    const DatasetBase* input_dataset, ParallelMapIteratorFunction map_func,
+    int32 num_parallel_calls);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_DATA_PARALLEL_MAP_ITERATOR_H_
diff --git a/tensorflow/core/kernels/data/prefetch_dataset_op.cc b/tensorflow/core/kernels/data/prefetch_dataset_op.cc
index 2bafb985ef..cc16108dce 100644
--- a/tensorflow/core/kernels/data/prefetch_dataset_op.cc
+++ b/tensorflow/core/kernels/data/prefetch_dataset_op.cc
@@ -357,7 +357,12 @@ class PrefetchDatasetOp : public UnaryDatasetOpKernel {
 
 REGISTER_KERNEL_BUILDER(Name("PrefetchDataset").Device(DEVICE_CPU),
                         PrefetchDatasetOp);
-
+REGISTER_KERNEL_BUILDER(Name("PrefetchDataset")
+                            .Device(DEVICE_GPU)
+                            .HostMemory("buffer_size")
+                            .HostMemory("input_dataset")
+                            .HostMemory("handle"),
+                        PrefetchDatasetOp);
 }  // namespace
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/slide_dataset_op.cc b/tensorflow/core/kernels/data/slide_dataset_op.cc
index 07cc91f9d5..5765c61f30 100644
--- a/tensorflow/core/kernels/data/slide_dataset_op.cc
+++ b/tensorflow/core/kernels/data/slide_dataset_op.cc
@@ -12,6 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
+
+#include <deque>
+#include <vector>
+
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/kernels/data/dataset.h"
@@ -33,36 +37,40 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
   void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
                    DatasetBase** output) override {
     int64 window_size = 0;
-    int64 stride = 0;
     OP_REQUIRES_OK(
         ctx, ParseScalarArgument<int64>(ctx, "window_size", &window_size));
-    OP_REQUIRES_OK(ctx, ParseScalarArgument<int64>(ctx, "stride", &stride));
     OP_REQUIRES(
         ctx, window_size > 0,
         errors::InvalidArgument("Window size must be greater than zero."));
-    OP_REQUIRES(ctx, stride > 0,
-                errors::InvalidArgument("Stride must be greater than zero."));
-    if (stride == window_size) {
-      LOG(WARNING) << "stride: " << stride
+    int64 window_shift = 0;
+    OP_REQUIRES_OK(
+        ctx, ParseScalarArgument<int64>(ctx, "window_shift", &window_shift));
+    OP_REQUIRES(
+        ctx, window_shift > 0,
+        errors::InvalidArgument("Window shift must be greater than zero."));
+    int64 window_stride = 0;
+    OP_REQUIRES_OK(
+        ctx, ParseScalarArgument<int64>(ctx, "window_stride", &window_stride));
+    OP_REQUIRES(
+        ctx, window_stride > 0,
+        errors::InvalidArgument("window_stride must be greater than zero."));
+    if (window_size == window_shift && window_stride == 1) {
+      LOG(WARNING) << "window_shift: " << window_shift
                    << " is equal to window_size: " << window_size
-                   << ", to use `batch` instead.";
-    } else if (stride > window_size) {
-      LOG(WARNING) << "stride: " << stride
-                   << " is greater than window_size: " << window_size
-                   << ", you will lose some data.";
+                   << " and window_stride is 1, use `batch` instead.";
     }
-
-    *output = new Dataset(ctx, window_size, stride, input);
+    *output = new Dataset(ctx, window_size, window_shift, window_stride, input);
   }
 
  private:
   class Dataset : public GraphDatasetBase {
    public:
-    Dataset(OpKernelContext* ctx, int64 window_size, int64 stride,
-            const DatasetBase* input)
+    Dataset(OpKernelContext* ctx, int64 window_size, int64 window_shift,
+            int64 window_stride, const DatasetBase* input)
         : GraphDatasetBase(ctx),
           window_size_(window_size),
-          stride_(stride),
+          window_shift_(window_shift),
+          window_stride_(window_stride),
           input_(input) {
       input_->Ref();
 
@@ -91,8 +99,8 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
     }
 
     string DebugString() const override {
-      return strings::StrCat("SlideDatasetOp(", window_size_, ", ", stride_,
-                             ")::Dataset");
+      return strings::StrCat("SlideDatasetOp(", window_size_, ", ",
+                             window_shift_, ", ", window_stride_, ")::Dataset");
     }
 
    protected:
@@ -101,16 +109,18 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
       Node* input_graph_node = nullptr;
       TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
       Node* window_size = nullptr;
-      Node* stride = nullptr;
+      Node* window_shift = nullptr;
+      Node* window_stride = nullptr;
       TF_RETURN_IF_ERROR(b->AddScalar(window_size_, &window_size));
-      TF_RETURN_IF_ERROR(b->AddScalar(stride_, &stride));
-      TF_RETURN_IF_ERROR(
-          b->AddDataset(this, {input_graph_node, window_size, stride}, output));
+      TF_RETURN_IF_ERROR(b->AddScalar(window_shift_, &window_shift));
+      TF_RETURN_IF_ERROR(b->AddScalar(window_stride_, &window_stride));
+      TF_RETURN_IF_ERROR(b->AddDataset(
+          this, {input_graph_node, window_size, window_shift, window_stride},
+          output));
       return Status::OK();
     }
 
    private:
-
     class Iterator : public DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
@@ -124,7 +134,8 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
                              std::vector<Tensor>* out_tensors,
                              bool* end_of_sequence) override {
         const int64 window_size = dataset()->window_size_;
-        const int64 stride = dataset()->stride_;
+        const int64 window_shift = dataset()->window_shift_;
+        const int64 window_stride = dataset()->window_stride_;
         std::vector<std::vector<Tensor>> batch_elements;
         {
           mutex_lock l(mu_);
@@ -133,55 +144,51 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
             return Status::OK();
           }
           batch_elements.reserve(window_size);
-          // Use cache if stride < window_size.
-          if (stride < window_size) {
-            const bool first_call = cache_.empty();
-            if (first_call) {
-              cache_.reserve(window_size);
-            } else {
-              // Reuse cache in the previous iteration.
-              cache_.swap(batch_elements);
-            }
-          }
-          // Fill up with new elements.
+
+          // Fill up buffer.
+          size_t target_size = TargetBufferSize(window_size, window_stride);
           *end_of_sequence = false;
-          for (size_t i = batch_elements.size(); i < window_size && !*end_of_sequence;
-              ++i) {
-            std::vector<Tensor> batch_element_tuple;
-            TF_RETURN_IF_ERROR(input_impl_->GetNext(ctx, &batch_element_tuple,
-                                                    end_of_sequence));
+          for (size_t i = buffer_.size(); i < target_size && !*end_of_sequence;
+               ++i) {
+            std::vector<Tensor> element;
+            TF_RETURN_IF_ERROR(
+                input_impl_->GetNext(ctx, &element, end_of_sequence));
             if (!*end_of_sequence) {
-              batch_elements.push_back(std::move(batch_element_tuple));
+              buffer_.push_back(std::move(element));
             } else {
               input_impl_.reset();
             }
           }
-          // Drop the final smaller blocks.
-          if (batch_elements.size() < window_size) {
+
+          // Drop the final smaller batch.
+          if (buffer_.size() < target_size) {
             DCHECK(*end_of_sequence);
             return Status::OK();
           }
 
-          if (stride < window_size) {
-            // Cache the data used for the next iteration.
-            for (size_t i = stride; i < window_size; ++i) {
-              cache_.emplace_back(batch_elements[i]);
-            }
-          } else if (stride > window_size) {
-            // Drop the data before the next iteration.
-            std::vector<Tensor> batch_element_tuple;
-            for (size_t i = window_size; i < stride && !*end_of_sequence; ++i) {
-              TF_RETURN_IF_ERROR(input_impl_->GetNext(ctx, &batch_element_tuple,
-                                                      end_of_sequence));
-              if (*end_of_sequence) {
+          for (size_t i = 0; i < window_size; ++i) {
+            batch_elements.emplace_back(buffer_[window_stride * i]);
+          }
+
+          // Drop the data before the next iteration.
+          if (window_shift >= buffer_.size()) {
+            for (size_t i = buffer_.size(); i < window_shift; ++i) {
+              bool end_of_input;
+              std::vector<Tensor> element;
+              TF_RETURN_IF_ERROR(
+                  input_impl_->GetNext(ctx, &element, &end_of_input));
+              if (end_of_input) {
                 input_impl_.reset();
+                break;
               }
             }
+            buffer_.clear();
+          } else {
+            buffer_.erase(buffer_.begin(), buffer_.begin() + window_shift);
           }
         }
 
         // Construct output tensors.
-        // Those codes below are copied from batch_dataset_op.cc.
         const size_t num_tuple_components = batch_elements[0].size();
         const int64 num_batch_elements = batch_elements.size();
         for (size_t component_index = 0; component_index < num_tuple_components;
@@ -223,15 +230,15 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
         } else {
           TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
         }
-        // Save cache.
-        TF_RETURN_IF_ERROR(
-            writer->WriteScalar(strings::StrCat("cache_size"), cache_.size()));
-        for (int64 i = 0; i < cache_.size(); i++) {
+        // Save buffer.
+        TF_RETURN_IF_ERROR(writer->WriteScalar(strings::StrCat("buffer_size"),
+                                               buffer_.size()));
+        for (int64 i = 0; i < buffer_.size(); i++) {
           TF_RETURN_IF_ERROR(writer->WriteScalar(
-              strings::StrCat("cache[", i, "]_size"), cache_[i].size()));
-          for (int64 j = 0; j < cache_[i].size(); j++) {
+              strings::StrCat("buffer[", i, "]_size"), buffer_[i].size()));
+          for (int64 j = 0; j < buffer_[i].size(); j++) {
             TF_RETURN_IF_ERROR(writer->WriteTensor(
-                strings::StrCat("cache[", i, "][", j, "]"), cache_[i][j]));
+                strings::StrCat("buffer[", i, "][", j, "]"), buffer_[i][j]));
           }
         }
         return Status::OK();
@@ -245,32 +252,37 @@ class SlideDatasetOp : public UnaryDatasetOpKernel {
         } else {
           input_impl_.reset();
         }
-        // Restore cache.
-        int64 cache_size;
+        // Restore buffer.
+        int64 buffer_size;
         TF_RETURN_IF_ERROR(
-            reader->ReadScalar(strings::StrCat("cache_size"), &cache_size));
-        cache_.resize(cache_size);
-        for (int64 i = 0; i < cache_size; i++) {
+            reader->ReadScalar(strings::StrCat("buffer_size"), &buffer_size));
+        buffer_.resize(buffer_size);
+        for (int64 i = 0; i < buffer_size; i++) {
           int64 vector_size;
           TF_RETURN_IF_ERROR(reader->ReadScalar(
-              strings::StrCat("cache[", i, "]_size"), &vector_size));
-          cache_[i].resize(vector_size);
+              strings::StrCat("buffer[", i, "]_size"), &vector_size));
+          buffer_[i].resize(vector_size);
           for (int64 j = 0; j < vector_size; j++) {
             TF_RETURN_IF_ERROR(reader->ReadTensor(
-                strings::StrCat("cache[", i, "][", j, "]"), &cache_[i][j]));
+                strings::StrCat("buffer[", i, "][", j, "]"), &buffer_[i][j]));
           }
         }
         return Status::OK();
       }
 
      private:
+      size_t TargetBufferSize(int64 window_size, int64 window_stride) {
+        return (window_size - 1) * window_stride + 1;
+      }
+
       mutex mu_;
-      std::vector<std::vector<Tensor>> cache_ GUARDED_BY(mu_);
+      std::deque<std::vector<Tensor>> buffer_ GUARDED_BY(mu_);
       std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
     };
 
     const int64 window_size_;
-    const int64 stride_;
+    const int64 window_shift_;
+    const int64 window_stride_;
     const DatasetBase* const input_;
     std::vector<PartialTensorShape> output_shapes_;
   };
diff --git a/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc b/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc
index 2604822cc9..b5dff48d2d 100644
--- a/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc
+++ b/tensorflow/core/kernels/data/sparse_tensor_slice_dataset_op.cc
@@ -252,10 +252,12 @@ class SparseTensorSliceDatasetOp : public DatasetOpKernel {
       previous_batch_index = next_batch_index;
     }
     gtl::InlinedVector<int64, 8> std_order(dense_shape->NumElements(), 0);
-    sparse::SparseTensor sparse_tensor(
-        *indices, *values, TensorShape(dense_shape->vec<int64>()), std_order);
-
-    *output = new Dataset<T>(ctx, sparse_tensor);
+    sparse::SparseTensor tensor;
+    OP_REQUIRES_OK(
+        ctx, sparse::SparseTensor::Create(
+                 *indices, *values, TensorShape(dense_shape->vec<int64>()),
+                 std_order, &tensor));
+    *output = new Dataset<T>(ctx, std::move(tensor));
   }
 
  private:
diff --git a/tensorflow/core/kernels/data/stats_dataset_ops.cc b/tensorflow/core/kernels/data/stats_dataset_ops.cc
index a537e7e68f..58ec3d4495 100644
--- a/tensorflow/core/kernels/data/stats_dataset_ops.cc
+++ b/tensorflow/core/kernels/data/stats_dataset_ops.cc
@@ -310,7 +310,7 @@ class FeatureStatsDatasetOp : public UnaryDatasetOpKernel {
           for (const Tensor& t : *out_tensors) {
             auto record_t = t.flat<string>();
             Example example;
-            // TODO(shivaniagrawal): redundant parsing here, potential solutions
+            // TODO(b/111553342): redundant parsing here, potential solutions
             // to improve performance is to a) have a potential
             // ParseExampleDataset and collect stats from there and b) make
             // changes to parse_example() where it returns stats as well.
@@ -333,7 +333,6 @@ class FeatureStatsDatasetOp : public UnaryDatasetOpKernel {
         return s;
       }
 
-      // TODO(shivaniagrawal): Add features/feature-values to streamz metrics.
       int AddStatsFeatureValues(const Feature& feature) {
         int feature_values_list_size = 0;
         switch (feature.kind_case()) {
@@ -391,7 +390,7 @@ class FeatureStatsDatasetOp : public UnaryDatasetOpKernel {
 
         for (const auto& feature_list :
              example.feature_lists().feature_list()) {
-          stats_aggregator->IncrementCounter("feature_lists_count", "reainer",
+          stats_aggregator->IncrementCounter("feature_lists_count", "trainer",
                                              1);
           for (const auto& feature : feature_list.second.feature()) {
             feature_values_list_size_sum += AddStatsFeatureValues(feature);
diff --git a/tensorflow/core/kernels/data/window_dataset.cc b/tensorflow/core/kernels/data/window_dataset.cc
index 668b461374..17551bccd9 100644
--- a/tensorflow/core/kernels/data/window_dataset.cc
+++ b/tensorflow/core/kernels/data/window_dataset.cc
@@ -17,6 +17,7 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
+// TODO(b/110981596): Support checkpointing.
 class WindowDataset : public DatasetBase {
  public:
   WindowDataset(std::vector<std::vector<Tensor>> elements,
diff --git a/tensorflow/core/kernels/data/window_dataset.h b/tensorflow/core/kernels/data/window_dataset.h
index 97c31668ac..7bd31a0bc7 100644
--- a/tensorflow/core/kernels/data/window_dataset.h
+++ b/tensorflow/core/kernels/data/window_dataset.h
@@ -31,7 +31,7 @@ namespace tensorflow {
 //
 // This dataset is constructed internally for use in datasets that
 // build nested dataset expressions (e.g. the reducer function for
-// GroupByBatchDataset). It efficiently supports multiple iterators on
+// GroupByWindowDataset). It efficiently supports multiple iterators on
 // the same window without recomputation.
 //
 // REQUIRES: `output_types` must match the types of the respective
diff --git a/tensorflow/core/kernels/data/window_dataset_op.cc b/tensorflow/core/kernels/data/window_dataset_op.cc
new file mode 100644
index 0000000000..0283e5697b
--- /dev/null
+++ b/tensorflow/core/kernels/data/window_dataset_op.cc
@@ -0,0 +1,196 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/partial_tensor_shape.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/kernels/data/dataset.h"
+#include "tensorflow/core/kernels/data/window_dataset.h"
+
+namespace tensorflow {
+
+namespace {
+
+// See documentation in ../ops/dataset_ops.cc for a high-level
+// description of the following op.
+
+class WindowDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit WindowDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx) {}
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    int64 window_size = 0;
+    OP_REQUIRES_OK(
+        ctx, ParseScalarArgument<int64>(ctx, "window_size", &window_size));
+    OP_REQUIRES(
+        ctx, window_size > 0,
+        errors::InvalidArgument("Window size must be greater than zero."));
+
+    *output = new Dataset(ctx, window_size, input);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, int64 window_size, const DatasetBase* input)
+        : GraphDatasetBase(ctx), window_size_(window_size), input_(input) {
+      input_->Ref();
+    }
+
+    ~Dataset() override { input_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(new Iterator(
+          Iterator::Params{this, strings::StrCat(prefix, "::Window")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      static DataTypeVector* output_dtypes = new DataTypeVector({DT_VARIANT});
+      return *output_dtypes;
+    }
+
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      static std::vector<PartialTensorShape>* output_shapes =
+          new std::vector<PartialTensorShape>({TensorShape({})});
+      return *output_shapes;
+    }
+
+    string DebugString() const override {
+      return strings::StrCat("WindowDatasetOp(", window_size_, ")::Dataset");
+    }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_graph_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
+      Node* window_size = nullptr;
+      TF_RETURN_IF_ERROR(b->AddScalar(window_size_, &window_size));
+      TF_RETURN_IF_ERROR(
+          b->AddDataset(this, {input_graph_node, window_size}, output));
+      return Status::OK();
+    }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      Status Initialize(IteratorContext* ctx) override {
+        return dataset()->input_->MakeIterator(ctx, prefix(), &input_impl_);
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        // Each row of `window_elements` is a tuple of tensors from the
+        // input iterator.
+        std::vector<std::vector<Tensor>> window_elements;
+        {
+          mutex_lock l(mu_);
+          if (!input_impl_) {
+            *end_of_sequence = true;
+            return Status::OK();
+          }
+          window_elements.reserve(dataset()->window_size_);
+          *end_of_sequence = false;
+          for (int i = 0; i < dataset()->window_size_ && !*end_of_sequence;
+               ++i) {
+            std::vector<Tensor> window_element_tuple;
+            TF_RETURN_IF_ERROR(input_impl_->GetNext(ctx, &window_element_tuple,
+                                                    end_of_sequence));
+            if (!*end_of_sequence) {
+              window_elements.emplace_back(std::move(window_element_tuple));
+            } else {
+              input_impl_.reset();
+            }
+          }
+        }
+
+        if (window_elements.empty()) {
+          DCHECK(*end_of_sequence);
+          return Status::OK();
+        }
+
+        const size_t num_tuple_components = window_elements[0].size();
+        const int64 num_window_elements = window_elements.size();
+        for (size_t idx = 0; idx < num_tuple_components; ++idx) {
+          DatasetBase* window_dataset;
+          std::vector<std::vector<Tensor>> window_component_elements;
+          window_component_elements.reserve(num_window_elements);
+          // Build the output tuple component by copying one slice
+          // from each input element in the window.
+          for (size_t i = 0; i < num_window_elements; ++i) {
+            std::vector<Tensor> component_element;
+            component_element.push_back(std::move(window_elements[i][idx]));
+            window_component_elements.push_back(component_element);
+          }
+          DataTypeVector output_types(
+              {dataset()->input_->output_dtypes()[idx]});
+          std::vector<PartialTensorShape> output_shapes(
+              {dataset()->input_->output_shapes()[idx]});
+          TF_RETURN_IF_ERROR(NewWindowDataset(window_component_elements,
+                                              output_types, output_shapes,
+                                              &window_dataset));
+          out_tensors->emplace_back(DT_VARIANT, TensorShape({}));
+          TF_RETURN_IF_ERROR(StoreDatasetInVariantTensor(window_dataset,
+                                                         &out_tensors->back()));
+        }
+        *end_of_sequence = false;
+        return Status::OK();
+      }
+
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        mutex_lock l(mu_);
+        if (!input_impl_) {
+          TF_RETURN_IF_ERROR(
+              writer->WriteScalar(full_name("input_impl_empty"), ""));
+        } else {
+          TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+        }
+        return Status::OK();
+      }
+
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        mutex_lock l(mu_);
+        if (!reader->Contains(full_name("input_impl_empty"))) {
+          TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+        } else {
+          input_impl_.reset();
+        }
+        return Status::OK();
+      }
+
+     private:
+      mutex mu_;
+      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
+    };
+
+    const int64 window_size_;
+    const DatasetBase* const input_;
+  };
+};
+
+REGISTER_KERNEL_BUILDER(Name("WindowDataset").Device(DEVICE_CPU),
+                        WindowDatasetOp);
+
+}  // namespace
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/decode_proto_op.cc b/tensorflow/core/kernels/decode_proto_op.cc
index 6d3dcc1c59..b54e1ea8ac 100644
--- a/tensorflow/core/kernels/decode_proto_op.cc
+++ b/tensorflow/core/kernels/decode_proto_op.cc
@@ -13,21 +13,19 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// DecodeProto is a TensorFlow Op which extracts arbitrary fields
-// from protos serialized as strings.
+// DecodeProto is a TensorFlow op which extracts arbitrary fields from protos
+// serialized as strings.
 //
 // See docs in ../ops/decode_proto_op.cc.
 //
-// This implementation reads the serialized format using a handful of
-// calls from the WireFormatLite API used by generated proto code.
-// WireFormatLite is marked as an "internal" proto API but is widely
-// used in practice and highly unlikely to change.
-// This will be much faster than the previous implementation based on
-// constructing a temporary dynamic message in memory and using the
-// proto reflection api to read it.
-// It can be used with any proto whose descriptors are available at
-// runtime but should be competitive in speed with approaches that
-// compile in the proto definitions.
+// This implementation reads the serialized format using a handful of calls from
+// the WireFormatLite API used by generated proto code. WireFormatLite is marked
+// as an "internal" proto API but is widely used in practice and highly unlikely
+// to change. This will be much faster than the previous implementation based on
+// constructing a temporary dynamic message in memory and using the proto
+// reflection api to read it. It can be used with any proto whose descriptors
+// are available at runtime but should be competitive in speed with approaches
+// that compile in the proto definitions.
 
 #include <memory>
 #include <string>
@@ -36,11 +34,13 @@ limitations under the License.
 #include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/util/proto/decode.h"
 #include "tensorflow/core/util/proto/descriptors.h"
+#include "tensorflow/core/util/proto/proto_utils.h"
 #include "tensorflow/core/util/ptr_util.h"
 
 namespace tensorflow {
@@ -58,53 +58,6 @@ using ::tensorflow::protobuf::io::CodedInputStream;
 
 const bool kFailOnDecodeError = true;
 
-// Returns true if the proto field type can be converted to the
-// tensorflow::DataType.
-bool CheckOutputType(FieldDescriptor::Type field_type, DataType output_type) {
-  switch (field_type) {
-    case WireFormatLite::TYPE_DOUBLE:
-      return output_type == tensorflow::DT_DOUBLE;
-    case WireFormatLite::TYPE_FLOAT:
-      return output_type == tensorflow::DT_FLOAT ||
-             output_type == tensorflow::DT_DOUBLE;
-    case WireFormatLite::TYPE_INT64:
-      return output_type == tensorflow::DT_INT64;
-    case WireFormatLite::TYPE_UINT64:
-      return output_type == tensorflow::DT_INT64;
-    case WireFormatLite::TYPE_INT32:
-      return output_type == tensorflow::DT_INT32;
-    case WireFormatLite::TYPE_FIXED64:
-      return output_type == tensorflow::DT_INT64;
-    case WireFormatLite::TYPE_FIXED32:
-      return output_type == tensorflow::DT_INT32 ||
-             output_type == tensorflow::DT_INT64;
-    case WireFormatLite::TYPE_BOOL:
-      return output_type == tensorflow::DT_BOOL;
-    case WireFormatLite::TYPE_STRING:
-      return output_type == tensorflow::DT_STRING;
-    case WireFormatLite::TYPE_GROUP:
-      return output_type == tensorflow::DT_STRING;
-    case WireFormatLite::TYPE_MESSAGE:
-      return output_type == tensorflow::DT_STRING;
-    case WireFormatLite::TYPE_BYTES:
-      return output_type == tensorflow::DT_STRING;
-    case WireFormatLite::TYPE_UINT32:
-      return output_type == tensorflow::DT_INT32 ||
-             output_type == tensorflow::DT_INT64;
-    case WireFormatLite::TYPE_ENUM:
-      return output_type == tensorflow::DT_INT32;
-    case WireFormatLite::TYPE_SFIXED32:
-      return output_type == tensorflow::DT_INT32;
-    case WireFormatLite::TYPE_SFIXED64:
-      return output_type == tensorflow::DT_INT64;
-    case WireFormatLite::TYPE_SINT32:
-      return output_type == tensorflow::DT_INT32;
-    case WireFormatLite::TYPE_SINT64:
-      return output_type == tensorflow::DT_INT64;
-      // default: intentionally omitted in order to enable static checking.
-  }
-}
-
 // Used to store the default value of a protocol message field, casted to the
 // type of the output tensor.
 //
@@ -113,13 +66,15 @@ struct DefaultValue {
   DataType dtype = DataType::DT_INVALID;
   union Value {
     bool v_bool;           // DT_BOOL
-    uint8 v_uint8;         // DT_UINT8
+    double v_double;       // DT_DOUBLE
+    float v_float;         // DT_FLOAT
     int8 v_int8;           // DT_INT8
     int32 v_int32;         // DT_INT32
     int64 v_int64;         // DT_INT64
-    float v_float;         // DT_FLOAT
-    double v_double;       // DT_DOUBLE
     const char* v_string;  // DT_STRING
+    uint8 v_uint8;         // DT_UINT8
+    uint8 v_uint32;        // DT_UINT32
+    uint8 v_uint64;        // DT_UINT64
   };
   Value value;
 };
@@ -138,23 +93,29 @@ Status InitDefaultValue(DataType dtype, const T value, DefaultValue* result) {
     case DT_BOOL:
       result->value.v_bool = static_cast<bool>(value);
       break;
-    case DT_INT32:
-      result->value.v_int32 = static_cast<int32>(value);
+    case DT_DOUBLE:
+      result->value.v_double = static_cast<double>(value);
+      break;
+    case DT_FLOAT:
+      result->value.v_float = static_cast<float>(value);
       break;
     case DT_INT8:
       result->value.v_int8 = static_cast<int8>(value);
       break;
-    case DT_UINT8:
-      result->value.v_uint8 = static_cast<uint8>(value);
+    case DT_INT32:
+      result->value.v_int32 = static_cast<int32>(value);
       break;
     case DT_INT64:
       result->value.v_int64 = static_cast<int64>(value);
       break;
-    case DT_FLOAT:
-      result->value.v_float = static_cast<float>(value);
+    case DT_UINT8:
+      result->value.v_uint8 = static_cast<uint8>(value);
       break;
-    case DT_DOUBLE:
-      result->value.v_double = static_cast<double>(value);
+    case DT_UINT32:
+      result->value.v_uint32 = static_cast<uint32>(value);
+      break;
+    case DT_UINT64:
+      result->value.v_uint64 = static_cast<uint64>(value);
       break;
     default:
       // We should never get here, given the type checking that occurs earlier.
@@ -241,13 +202,11 @@ struct FieldInfo {
     number = field_desc->number();
 
     // The wire format library defines the same constants used in
-    // descriptor.proto. This static_cast is safe because they
-    // are guaranteed to stay in sync.
-    // We need the field type from the FieldDescriptor here
-    // because the wire format doesn't tell us anything about
-    // what happens inside a packed repeated field: there is
-    // enough information in the wire format to skip the
-    // whole field but not enough to know how to parse what's
+    // descriptor.proto. This static_cast is safe because they are guaranteed to
+    // stay in sync. We need the field type from the FieldDescriptor here
+    // because the wire format doesn't tell us anything about what happens
+    // inside a packed repeated field: there is enough information in the wire
+    // format to skip the whole field but not enough to know how to parse what's
     // inside. For that we go to the schema.
     type = static_cast<WireFormatLite::FieldType>(field_desc->type());
     is_repeated = field_desc->is_repeated();
@@ -257,16 +216,15 @@ struct FieldInfo {
   FieldInfo(const FieldInfo&) = delete;
   FieldInfo& operator=(const FieldInfo&) = delete;
 
-  // Internally we sort field descriptors by wire number for
-  // fast lookup. In general this is different from the order
-  // given by the user. Output_index gives the index into
-  // the field_names and output_types attributes and into
+  // Internally we sort field descriptors by wire number for fast lookup. In
+  // general this is different from the order given by the user. Output_index
+  // gives the index into the field_names and output_types attributes and into
   // the output tensor list.
   int output_index = -1;
 
-  // This is a cache of the relevant fields from `FieldDescriptorProto`.
-  // This was added after noticing that FieldDescriptor->type() was
-  // using 6% of the cpu profile.
+  // This is a cache of the relevant fields from `FieldDescriptorProto`. This
+  // was added after noticing that FieldDescriptor->type() was using 6% of the
+  // cpu profile.
   WireFormatLite::FieldType type;
   int number;
   bool is_repeated;
@@ -275,16 +233,16 @@ struct FieldInfo {
 
 // A CountCollector counts sizes of repeated and optional fields in a proto.
 //
-// Each field is tracked by a single CountCollector instance. The
-// instance manages a single count, which is stored as a pointer (it
-// is intended to be a reference to the `sizes` output which is being
-// filled in). The pointer is passed in at initialization.
+// Each field is tracked by a single CountCollector instance. The instance
+// manages a single count, which is stored as a pointer (it is intended to be a
+// reference to the `sizes` output which is being filled in). The pointer is
+// passed in at initialization.
 //
-// Counting is done as a separate pass in order to allocate output tensors
-// all at once. This allows the TensorFlow runtime to optimize allocation
-// for the consumer, while removing the need for copying inside this op.
-// After this pass, the DenseCollector class (below) gathers the data:
-// It is more complex and provides better motivation for the API here.
+// Counting is done as a separate pass in order to allocate output tensors all
+// at once. This allows the TensorFlow runtime to optimize allocation for the
+// consumer, while removing the need for copying inside this op. After this
+// pass, the DenseCollector class (below) gathers the data: it is more complex
+// and provides better motivation for the API here.
 class CountCollector {
  public:
   CountCollector() = delete;
@@ -298,8 +256,8 @@ class CountCollector {
     if (*count_ptr_ == 0 || field.is_repeated) {
       (*count_ptr_)++;
     }
-    // We expect a wire type based on the schema field_type, to allow
-    // a little more checking.
+    // We expect a wire type based on the schema field_type, to allow a little
+    // more checking.
     if (!SkipValue(input, field)) {
       return errors::DataLoss("ReadValue: Failed skipping field when counting");
     }
@@ -329,8 +287,8 @@ class CountCollector {
       return errors::DataLoss("ReadPackedValues: Skipping packed field failed");
     }
 
-    // Dispatch to the appropriately typed field reader based on the
-    // schema type.
+    // Dispatch to the appropriately typed field reader based on the schema
+    // type.
     Status st;
     switch (field.type) {
       case WireFormatLite::TYPE_DOUBLE:
@@ -409,18 +367,17 @@ class CountCollector {
     return input->Skip(length);
   }
 
-  // Counts the number of packed varints in an array.
-  // The end of a varint is signaled by a value < 0x80,
-  // so counting them requires parsing the bytestream.
-  // It is the caller's responsibility to ensure that len > 0.
+  // Counts the number of packed varints in an array. The end of a varint is
+  // signaled by a value < 0x80, so counting them requires parsing the
+  // bytestream. It is the caller's responsibility to ensure that len > 0.
   Status CountPackedVarint(const uint8* buf, size_t len) {
     const uint8* bound = buf + len;
     int count;
 
-    // The last byte in a valid encoded varint is guaranteed to have
-    // the high bit unset. We rely on this property to prevent
-    // ReadVarint64FromArray from going out of bounds, so validate
-    // the end of the buf before scanning anything.
+    // The last byte in a valid encoded varint is guaranteed to have the high
+    // bit unset. We rely on this property to prevent ReadVarint64FromArray from
+    // going out of bounds, so validate the end of the buf before scanning
+    // anything.
     if (bound[-1] & 0x80) {
       return errors::DataLoss("Corrupt packed varint");
     }
@@ -439,8 +396,8 @@ class CountCollector {
     return Status::OK();
   }
 
-  // Counts the number of fixed-size values in a packed field.
-  // This can be done without actually parsing anything.
+  // Counts the number of fixed-size values in a packed field. This can be done
+  // without actually parsing anything.
   template <typename T>
   Status CountPackedFixed(const uint8* unused_buf, size_t len) {
     int count = len / sizeof(T);
@@ -452,10 +409,9 @@ class CountCollector {
     return Status::OK();
   }
 
-  // Skips a single value in the input stream.
-  // Dispatches to the appropriately typed field skipper based on the
-  // schema type tag.
-  // This is not as permissive as just handling the wire type.
+  // Skips a single value in the input stream. Dispatches to the appropriately
+  // typed field skipper based on the schema type tag. This is not as permissive
+  // as just handling the wire type.
   static bool SkipValue(CodedInputStream* input, const FieldInfo& field) {
     uint32 tmp32;
     protobuf_uint64 tmp64;
@@ -507,13 +463,13 @@ class CountCollector {
 
 // A DenseCollector accumulates values from a proto into a tensor.
 //
-// There is an instance of DenseCollector for each field of each
-// proto. The DenseCollector deserializes the value from the wire
-// directly into the preallocated output Tensor.
+// There is an instance of DenseCollector for each field of each proto. The
+// DenseCollector deserializes the value from the wire directly into the
+// preallocated output Tensor.
 //
-// This class is named DenseCollector because in the future there should
-// be a SparseCollector that accumulates field data into sparse tensors if
-// the user requests it.
+// This class is named DenseCollector because in the future there should be a
+// SparseCollector that accumulates field data into sparse tensors if the user
+// requests it.
 class DenseCollector {
  public:
   DenseCollector() = delete;
@@ -578,40 +534,43 @@ class DenseCollector {
     }
   }
 
-  // Fills in any missing values in the output array with defaults.
-  // Dispatches to the appropriately typed field default based on the
-  // runtime type tag.
+  // Fills in any missing values in the output array with defaults. Dispatches
+  // to the appropriately typed field default based on the runtime type tag.
   Status FillWithDefaults() {
     switch (default_value_.dtype) {
+      case DataType::DT_BOOL:
+        return FillDefault<bool>(default_value_.value.v_bool);
       case DataType::DT_FLOAT:
         return FillDefault<float>(default_value_.value.v_float);
       case DataType::DT_DOUBLE:
         return FillDefault<double>(default_value_.value.v_double);
-      case DataType::DT_INT32:
-        return FillDefault<int32>(default_value_.value.v_int32);
-      case DataType::DT_UINT8:
-        return FillDefault<uint8>(default_value_.value.v_uint8);
       case DataType::DT_INT8:
         return FillDefault<int8>(default_value_.value.v_int8);
-      case DataType::DT_STRING:
-        return FillDefault<string>(default_value_.value.v_string);
+      case DataType::DT_INT32:
+        return FillDefault<int32>(default_value_.value.v_int32);
       case DataType::DT_INT64:
         return FillDefault<int64>(default_value_.value.v_int64);
-      case DataType::DT_BOOL:
-        return FillDefault<bool>(default_value_.value.v_bool);
+      case DataType::DT_STRING:
+        return FillDefault<string>(default_value_.value.v_string);
+      case DataType::DT_UINT8:
+        return FillDefault<uint8>(default_value_.value.v_uint8);
+      case DataType::DT_UINT32:
+        return FillDefault<uint32>(default_value_.value.v_uint32);
+      case DataType::DT_UINT64:
+        return FillDefault<uint64>(default_value_.value.v_uint64);
       default:
         // There are many tensorflow dtypes not handled here, but they
         // should not come up unless type casting is added to the Op.
         // Chaining with tf.cast() should do the right thing until then.
-        return errors::DataLoss(
-            "Failed filling defaults in unknown tf::DataType");
+        return errors::DataLoss("Failed filling defaults for ",
+                                DataTypeString(default_value_.dtype));
     }
   }
 
  private:
-  // Fills empty values in the dense representation with a
-  // default value. This uses next_repeat_index_ which counts the number
-  // of parsed values for the field.
+  // Fills empty values in the dense representation with a default value. This
+  // uses next_repeat_index_ which counts the number of parsed values for the
+  // field.
   template <class T>
   Status FillDefault(const T& default_value) {
     for (int i = next_repeat_index_; i < max_repeat_count_; i++) {
@@ -622,11 +581,10 @@ class DenseCollector {
 
   int32 next_repeat_index_ = 0;
 
-  // This is a pointer to data_[message_index_].
-  // There is no bounds checking at this level: we computed the max
-  // repeat size for each field in CountCollector and use the same
-  // code to traverse it here, so we are guaranteed not to be called
-  // for more items than we have allocated space.
+  // This is a pointer to data_[message_index_]. There is no bounds checking at
+  // this level: we computed the max repeat size for each field in
+  // CountCollector and use the same code to traverse it here, so we are
+  // guaranteed not to be called for more items than we have allocated space.
   void* const datap_ = nullptr;
 
   const DefaultValue default_value_;
@@ -665,7 +623,6 @@ class DecodeProtoOp : public OpKernel {
                                 "have the same length"));
 
     // Gather the field descriptors and check that requested output types match.
-
     int field_index = 0;
     std::vector<const FieldDescriptor*> field_descs;
     for (const string& name : field_names) {
@@ -673,18 +630,16 @@ class DecodeProtoOp : public OpKernel {
       OP_REQUIRES(context, fd != nullptr,
                   errors::InvalidArgument("Unknown field: ", name,
                                           " in message type ", message_type));
-      OP_REQUIRES(context,
-                  CheckOutputType(fd->type(), output_types[field_index]),
-                  // Many TensorFlow types don't have corresponding proto types
-                  // and the user will get an error if they are requested. It
-                  // would be nice to allow conversions here, but tf.cast
-                  // already exists so we don't duplicate the functionality.
-                  // Known unhandled types:
-                  //   DT_INT16 DT_COMPLEX64 DT_QINT8 DT_QUINT8 DT_QINT32
-                  //   DT_BFLOAT16 DT_QINT16 DT_QUINT16 DT_UINT16
-                  errors::InvalidArgument("Unexpected output type for ",
-                                          fd->full_name(), ": ", fd->cpp_type(),
-                                          " to ", output_types[field_index]));
+      OP_REQUIRES(
+          context,
+          proto_utils::IsCompatibleType(fd->type(), output_types[field_index]),
+          // Many TensorFlow types don't have corresponding proto types and the
+          // user will get an error if they are requested. It would be nice to
+          // allow conversions here, but tf.cast already exists so we don't
+          // duplicate the functionality.
+          errors::InvalidArgument("Unexpected output type for ",
+                                  fd->full_name(), ": ", fd->cpp_type(), " to ",
+                                  output_types[field_index]));
 
       field_index++;
       field_descs.push_back(fd);
@@ -726,10 +681,9 @@ class DecodeProtoOp : public OpKernel {
         errors::InvalidArgument("format must be one of binary or text"));
     is_binary_ = format == "binary";
 
-    // Enable the initial protobuf sanitizer, which is much
-    // more expensive than the decoder.
-    // TODO(nix): Remove this once the fast decoder
-    // has passed security review.
+    // Enable the initial protobuf sanitizer, which is much more expensive than
+    // the decoder.
+    // TODO(nix): Remove this once the fast decoder has passed security review.
     OP_REQUIRES_OK(context, context->GetAttr("sanitize", &sanitize_));
   }
 
@@ -742,9 +696,9 @@ class DecodeProtoOp : public OpKernel {
 
     int field_count = fields_.size();
 
-    // Save the argument shape for later, then flatten the input
-    // Tensor since we are working componentwise. We will restore
-    // the same shape in the returned Tensor.
+    // Save the argument shape for later, then flatten the input Tensor since we
+    // are working componentwise. We will restore the same shape in the returned
+    // Tensor.
     const TensorShape& shape_prefix = buf_tensor.shape();
 
     TensorShape sizes_shape = shape_prefix;
@@ -752,8 +706,8 @@ class DecodeProtoOp : public OpKernel {
     Tensor* sizes_tensor = nullptr;
     OP_REQUIRES_OK(ctx, ctx->allocate_output(0, sizes_shape, &sizes_tensor));
 
-    // This is used to allocate binary bufs if used. It serves only
-    // to define memory ownership.
+    // This is used to allocate binary bufs if used. It serves only to define
+    // memory ownership.
     std::vector<string> tmp_binary_bufs(message_count);
 
     // These are the actual buffers to use, which may be in tmp_binary_bufs
@@ -768,8 +722,8 @@ class DecodeProtoOp : public OpKernel {
         bufs.push_back(buf);
       }
     } else {
-      // We will have to allocate a copy, either to convert from text to
-      // binary or to sanitize a binary proto.
+      // We will have to allocate a copy, either to convert from text to binary
+      // or to sanitize a binary proto.
       for (int mi = 0; mi < message_count; ++mi) {
         ReserializeMessage(ctx, buf_tensor.flat<string>()(mi),
                            &tmp_binary_bufs[mi]);
@@ -780,16 +734,14 @@ class DecodeProtoOp : public OpKernel {
       }
     }
 
-    // Walk through all the strings in the input tensor, counting
-    // the number of fields in each.
-    // We can't allocate our actual output Tensor until we know the
-    // maximum repeat count, so we do a first pass through the serialized
-    // proto just counting fields.
-    // We always allocate at least one value so that optional fields
-    // are populated with default values - this avoids a TF
-    // conditional when handling the output data.
-    // The caller can distinguish between real data and defaults
-    // using the repeat count matrix that is returned by decode_proto.
+    // Walk through all the strings in the input tensor, counting the number of
+    // fields in each. We can't allocate our actual output Tensor until we know
+    // the maximum repeat count, so we do a first pass through the serialized
+    // proto just counting fields. We always allocate at least one value so that
+    // optional fields are populated with default values - this avoids a TF
+    // conditional when handling the output data. The caller can distinguish
+    // between real data and defaults using the repeat count matrix that is
+    // returned by decode_proto.
     std::vector<int32> max_sizes(field_count, 1);
     for (int mi = 0; mi < message_count; ++mi) {
       CountFields(ctx, mi, *bufs[mi], sizes_tensor, &max_sizes);
@@ -814,14 +766,12 @@ class DecodeProtoOp : public OpKernel {
       //  REGISTER_OP(...)
       //    .Attr("output_types: list(type) >= 0")
       //    .Output("values: output_types")
-      OP_REQUIRES_OK(ctx,
-                     // ctx->allocate_output(output_indices_[fi] + 1,
-                     ctx->allocate_output(fields_[fi]->output_index + 1,
-                                          out_shape, &outputs[fi]));
+      OP_REQUIRES_OK(ctx, ctx->allocate_output(fields_[fi]->output_index + 1,
+                                               out_shape, &outputs[fi]));
     }
 
-    // Make the second pass through the serialized proto, decoding
-    // into preallocated tensors.
+    // Make the second pass through the serialized proto, decoding into
+    // preallocated tensors.
     AccumulateFields(ctx, bufs, outputs);
   }
 
@@ -976,6 +926,7 @@ class DecodeProtoOp : public OpKernel {
   // Look up the FieldDescriptor for a particular field number.
   bool LookupField(int field_number, int* field_index) {
     // Look up the FieldDescriptor using linear search.
+    //
     // TODO(nix): this could be sped up with binary search, but we are
     // already way off the fastpath at this point. If you see a hotspot
     // here, somebody is sending you very inefficient protos.
@@ -1010,6 +961,7 @@ class DecodeProtoOp : public OpKernel {
       // This takes advantage of the sorted field numbers in most serialized
       // protos: it tries the next expected field first rather than doing
       // a lookup by field number.
+      //
       // TODO(nix): haberman@ suggests a hybrid approach with a lookup table
       // for small field numbers and a hash table for larger ones. This would
       // be a simpler approach that should offer comparable speed in most
@@ -1029,9 +981,9 @@ class DecodeProtoOp : public OpKernel {
             last_good_field_index = field_index;
           }
         } else {
-          // If we see a field that is past the next field we want,
-          // it was empty. Look for the one after that.
-          // Repeat until we run out of fields that we care about.
+          // If we see a field that is past the next field we want, it was
+          // empty. Look for the one after that. Repeat until we run out of
+          // fields that we care about.
           while (field_number >= next_good_field_number) {
             if (field_number == next_good_field_number) {
               last_good_field_number = field_number;
@@ -1044,10 +996,9 @@ class DecodeProtoOp : public OpKernel {
               next_good_field_number =
                   fields_[last_good_field_index + 1]->number;
             } else {
-              // Saw something past the last field we care about.
-              // Continue parsing the message just in case there
-              // are disordered fields later, but any remaining
-              // ordered fields will have no effect.
+              // Saw something past the last field we care about. Continue
+              // parsing the message just in case there are disordered fields
+              // later, but any remaining ordered fields will have no effect.
               next_good_field_number = INT_MAX;
             }
           }
@@ -1077,20 +1028,20 @@ class DecodeProtoOp : public OpKernel {
                       WireFormatLite::WireType wire_type,
                       CodedInputStream* input, CollectorClass* collector) {
     // The wire format library defines the same constants used in
-    // descriptor.proto. This static_cast is safe because they
-    // are guaranteed to stay in sync.
-    // We need the field type from the FieldDescriptor here
-    // because the wire format doesn't tell us anything about
-    // what happens inside a packed repeated field: there is
-    // enough information in the wire format to skip the
-    // whole field but not enough to know how to parse what's
-    // inside. For that we go to the schema.
+    // descriptor.proto. This static_cast is safe because they are guaranteed to
+    // stay in sync.
+    //
+    // We need the field type from the FieldDescriptor here because the wire
+    // format doesn't tell us anything about what happens inside a packed
+    // repeated field: there is enough information in the wire format to skip
+    // the whole field but not enough to know how to parse what's inside. For
+    // that we go to the schema.
     WireFormatLite::WireType schema_wire_type =
         WireFormatLite::WireTypeForFieldType(field.type);
 
-    // Handle packed repeated fields. SkipField would skip the
-    // whole length-delimited blob without letting us count the
-    // values, so we have to scan them ourselves.
+    // Handle packed repeated fields. SkipField would skip the whole
+    // length-delimited blob without letting us count the values, so we have to
+    // scan them ourselves.
     if (wire_type == WireFormatLite::WIRETYPE_LENGTH_DELIMITED &&
         schema_wire_type != WireFormatLite::WIRETYPE_LENGTH_DELIMITED) {
       // Handle packed repeated primitives.
@@ -1098,11 +1049,7 @@ class DecodeProtoOp : public OpKernel {
       if (!input->ReadVarintSizeAsInt(&length)) {
         return errors::DataLoss("CollectField: Failed reading packed size");
       }
-      Status st = collector->ReadPackedValues(input, field, length);
-      if (!st.ok()) {
-        return st;
-      }
-      return Status::OK();
+      return collector->ReadPackedValues(input, field, length);
     }
 
     // Read ordinary values, including strings, bytes, and messages.
@@ -1118,9 +1065,9 @@ class DecodeProtoOp : public OpKernel {
   }
 
   string message_type_;
-  // Note that fields are sorted by increasing field number,
-  // which is not in general the order given by the user-specified
-  // field_names and output_types Op attributes.
+  // Note that fields are sorted by increasing field number, which is not in
+  // general the order given by the user-specified field_names and output_types
+  // Op attributes.
   std::vector<std::unique_ptr<const FieldInfo>> fields_;
 
   // Owned_desc_pool_ is null when using descriptor_source=local.
@@ -1131,12 +1078,12 @@ class DecodeProtoOp : public OpKernel {
   // True if decoding binary format, false if decoding text format.
   bool is_binary_;
 
-  // True if the protos should be sanitized before parsing.
-  // Enables the initial protobuf sanitizer, which is much
-  // more expensive than the decoder. The flag defaults to true
-  // but can be set to false for trusted sources.
-  // TODO(nix): flip the default to false when the fast decoder
-  // has passed security review.
+  // True if the protos should be sanitized before parsing. Enables the initial
+  // protobuf sanitizer, which is much more expensive than the decoder. The flag
+  // defaults to true but can be set to false for trusted sources.
+  //
+  // TODO(nix): Flip the default to false when the fast decoder has passed
+  // security review.
   bool sanitize_;
 
   TF_DISALLOW_COPY_AND_ASSIGN(DecodeProtoOp);
diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
index 5390222b3a..2a25459194 100644
--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
@@ -165,15 +165,18 @@ __global__ void __launch_bounds__(1024, 2)
 // one each in the lower and upper half of a tile.
 // Backprop input direction is the same as forward direction with the filter
 // rotated by 180°.
+// T is the tensors' data type. S is the math type the kernel uses. This is the
+// same as T for all cases but pseudo half (which has T=Eigen::half, S=float).
 template <typename T, DepthwiseConv2dDirection kDirection,
           int kKnownFilterWidth, int kKnownFilterHeight, int kBlockDepth,
-          bool kKnownEvenHeight>
+          bool kKnownEvenHeight, typename S>
 __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
-  T* const shared_data = reinterpret_cast<T*>(shared_memory);
+  extern __shared__ __align__(8) unsigned char shared_memory[];
+  static_assert(sizeof(S) <= 8, "Insufficient alignement detected");
+  S* const shared_data = reinterpret_cast<S*>(shared_memory);
 
   const int num_batches = args.batch;
   const int in_height = args.in_rows;
@@ -219,7 +222,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
 
   // Initialize tile, in particular the padding.
   for (int i = thread_idx; i < tile_size; i += block_size) {
-    shared_data[i] = T(0);
+    shared_data[i] = S();
   }
   __syncthreads();
 
@@ -254,14 +257,15 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
 
     if (channel_in_range) {
       const T* const in_ptr = inout_offset + input;
-      T* const tile_ptr = tile_idx + shared_data;
-      tile_ptr[0] = ldg(in_ptr);
+      S* const tile_ptr = tile_idx + shared_data;
+      tile_ptr[0] = static_cast<S>(ldg(in_ptr));
       if (!skip_second) {
-        tile_ptr[tile_offset] = ldg(tensor_offset + in_ptr);
+        tile_ptr[tile_offset] = static_cast<S>(ldg(tensor_offset + in_ptr));
       }
 
       if (filter_write_offset != 0) {
-        shared_data[filter_write_offset] = ldg(filter_offset + filter);
+        shared_data[filter_write_offset] =
+            static_cast<S>(ldg(filter_offset + filter));
       }
     }
 
@@ -269,17 +273,17 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
     __syncthreads();
 
     if (channel_in_range) {
-      T sum1 = static_cast<T>(0);
-      T sum2 = static_cast<T>(0);
+      S sum1 = S();
+      S sum2 = S();
       int shared_offset = data_idx;
-      const T* filter_ptr = filter_read_offset + shared_data;
+      const S* filter_ptr = filter_read_offset + shared_data;
       UNROLL for (int r = 0; r < filter_height; ++r) {
         UNROLL for (int c = 0; c < filter_width; ++c) {
           if (kDirection == DIRECTION_BACKWARD) {
             filter_ptr -= kBlockDepth;
           }
-          const T filter_value = *filter_ptr;
-          const T* const tile_ptr = shared_offset + shared_data;
+          const S filter_value = *filter_ptr;
+          const S* const tile_ptr = shared_offset + shared_data;
           sum1 += filter_value * tile_ptr[0];
           sum2 += filter_value * tile_ptr[tile_offset];
           shared_offset += kBlockDepth;
@@ -290,9 +294,9 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
         shared_offset += in_increment;
       }
       T* const out_ptr = inout_offset + output;
-      out_ptr[0] = sum1;
+      out_ptr[0] = static_cast<T>(sum1);
       if (!skip_second) {
-        out_ptr[tensor_offset] = sum2;
+        out_ptr[tensor_offset] = static_cast<T>(sum2);
       }
     }
 
@@ -445,15 +449,18 @@ __global__ void __launch_bounds__(1024, 2)
 // one each in the lower and upper half of a tile.
 // Backprop input direction is the same as forward direction with the filter
 // rotated by 180°.
+// T is the tensors' data type. S is the math type the kernel uses. This is the
+// same as T for all cases but pseudo half (which has T=Eigen::half, S=float).
 template <typename T, DepthwiseConv2dDirection kDirection,
           int kKnownFilterWidth, int kKnownFilterHeight, int kBlockDepth,
-          bool kKnownEvenHeight>
+          bool kKnownEvenHeight, typename S>
 __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* input, const T* filter, T* output) {
   assert(CanLaunchDepthwiseConv2dGPUSmall(args));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
-  T* const shared_data = reinterpret_cast<T*>(shared_memory);
+  extern __shared__ __align__(8) unsigned char shared_memory[];
+  static_assert(sizeof(S) <= 8, "Insufficient alignement detected");
+  S* const shared_data = reinterpret_cast<S*>(shared_memory);
 
   const int num_batches = args.batch;
   const int in_height = args.in_rows;
@@ -498,7 +505,7 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
 
   // Initialize tile, in particular the padding.
   for (int i = thread_idx; i < tile_size; i += block_size) {
-    shared_data[i] = T(0);
+    shared_data[i] = S();
   }
   __syncthreads();
 
@@ -534,34 +541,35 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
 
     if (channel_in_range) {
       const T* const in_ptr = inout_offset + input;
-      T* const tile_ptr = tile_idx + shared_data;
-      tile_ptr[0] = ldg(in_ptr);
+      S* const tile_ptr = tile_idx + shared_data;
+      tile_ptr[0] = static_cast<S>(ldg(in_ptr));
       if (!skip_second) {
-        tile_ptr[tile_offset] = ldg(block_pixels + in_ptr);
+        tile_ptr[tile_offset] = static_cast<S>(ldg(block_pixels + in_ptr));
       }
     }
 
     if (filter_write_offset != 0) {
       const int filter_offset =
           filter_idx + (channel + filter_channel) % in_depth;
-      shared_data[filter_write_offset] = ldg(filter_offset + filter);
+      shared_data[filter_write_offset] =
+          static_cast<S>(ldg(filter_offset + filter));
     }
 
     // Note: the condition to reach this is uniform across the entire block.
     __syncthreads();
 
     if (channel_in_range) {
-      T sum1 = static_cast<T>(0);
-      T sum2 = static_cast<T>(0);
+      S sum1 = S();
+      S sum2 = S();
       int shared_offset = data_idx;
-      const T* filter_ptr = filter_read_offset + shared_data;
+      const S* filter_ptr = filter_read_offset + shared_data;
       UNROLL for (int r = 0; r < filter_height; ++r) {
         UNROLL for (int c = 0; c < filter_width; ++c) {
           if (kDirection == DIRECTION_BACKWARD) {
             filter_ptr -= kBlockDepth;
           }
-          const T filter_value = *filter_ptr;
-          const T* const tile_ptr = shared_offset + shared_data;
+          const S filter_value = *filter_ptr;
+          const S* const tile_ptr = shared_offset + shared_data;
           sum1 += filter_value * tile_ptr[0];
           sum2 += filter_value * tile_ptr[tile_offset];
           ++shared_offset;
@@ -572,9 +580,9 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
         shared_offset += in_increment;
       }
       T* const out_ptr = inout_offset + output;
-      out_ptr[0] = sum1;
+      out_ptr[0] = static_cast<T>(sum1);
       if (!skip_second) {
-        out_ptr[block_pixels] = sum2;
+        out_ptr[block_pixels] = static_cast<T>(sum2);
       }
     }
 
@@ -585,11 +593,11 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
 
 template <typename T, DepthwiseConv2dDirection kDirection,
           int kKnownFilterWidth, int kKnownFilterHeight, int kBlockDepth,
-          bool kKnownEvenHeight>
-void LaunchDepthwiseConv2dGPUSmall(const GpuDevice& device,
-                                   const DepthwiseArgs& args, const T* input,
-                                   const T* filter, T* output,
-                                   TensorFormat data_format) {
+          bool kKnownEvenHeight, typename S>
+Status LaunchDepthwiseConv2dGPUSmall(OpKernelContext* ctx,
+                                     const DepthwiseArgs& args, const T* input,
+                                     const T* filter, T* output,
+                                     TensorFormat data_format) {
   const int block_height = (args.in_rows + 1) / 2;
   dim3 block_dim;
   int block_count;
@@ -602,7 +610,7 @@ void LaunchDepthwiseConv2dGPUSmall(const GpuDevice& device,
       kernel =
           DepthwiseConv2dGPUKernelNHWCSmall<T, kDirection, kKnownFilterWidth,
                                             kKnownFilterHeight, kBlockDepth,
-                                            kKnownEvenHeight>;
+                                            kKnownEvenHeight, S>;
       break;
     case FORMAT_NCHW:
       block_dim = dim3(args.in_cols, block_height, kBlockDepth);
@@ -611,73 +619,126 @@ void LaunchDepthwiseConv2dGPUSmall(const GpuDevice& device,
       kernel =
           DepthwiseConv2dGPUKernelNCHWSmall<T, kDirection, kKnownFilterWidth,
                                             kKnownFilterHeight, kBlockDepth,
-                                            kKnownEvenHeight>;
+                                            kKnownEvenHeight, S>;
       break;
     default:
-      LOG(ERROR) << "FORMAT_" << ToString(data_format) << " is not supported";
-      return;
+      return errors::InvalidArgument("FORMAT_", ToString(data_format),
+                                     " is not supported");
   }
   const int tile_width = args.in_cols + args.filter_cols - 1;
   const int tile_height = block_height * 2 + args.filter_rows - 1;
   const int tile_pixels = tile_height * tile_width;
   const int filter_pixels = args.filter_rows * args.filter_cols;
   const int shared_memory_size =
-      kBlockDepth * (tile_pixels + filter_pixels) * sizeof(T);
+      kBlockDepth * (tile_pixels + filter_pixels) * sizeof(S);
   const int num_outputs = args.out_rows * args.out_cols * block_count;
+  auto device = ctx->eigen_gpu_device();
   CudaLaunchConfig config = GetCudaLaunchConfigFixedBlockSize(
       num_outputs, device, kernel, shared_memory_size,
       block_dim.x * block_dim.y * block_dim.z);
   kernel<<<config.block_count, block_dim, shared_memory_size,
            device.stream()>>>(args, input, filter, output);
+  return Status::OK();
+}
+
+namespace detail {
+template <typename T>
+struct PseudoHalfType {
+  using Type = T;
+};
+template <>
+struct PseudoHalfType<Eigen::half> {
+  using Type = float;
+};
+}  // namespace detail
+
+namespace {
+// Maps to float if T is __half, and to T otherwise.
+template <typename T>
+using PseudoHalfType = typename detail::PseudoHalfType<T>::Type;
+
+// Returns whether the context's GPU supports efficient fp16 math.
+bool HasFastHalfMath(OpKernelContext* ctx) {
+  int major, minor;
+  ctx->op_device_context()
+      ->stream()
+      ->parent()
+      ->GetDeviceDescription()
+      .cuda_compute_capability(&major, &minor);
+  auto cuda_arch = major * 100 + minor * 10;
+  // GPUs before sm_53 don't support fp16 math, and sm_61's fp16 math is slow.
+  return cuda_arch >= 530 && cuda_arch != 610;
+}
+}  // namespace
+
+template <typename T, DepthwiseConv2dDirection kDirection,
+          int kKnownFilterWidth, int kKnownFilterHeight, int kBlockDepth,
+          bool kKnownEvenHeight>
+Status LaunchDepthwiseConv2dGPUSmall(OpKernelContext* ctx,
+                                     const DepthwiseArgs& args, const T* input,
+                                     const T* filter, T* output,
+                                     TensorFormat data_format) {
+#if !defined __CUDA_ARCH__ || __CUDA_ARCH__ >= 530
+  if (HasFastHalfMath(ctx)) {
+    return LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
+                                         kKnownFilterHeight, kBlockDepth,
+                                         kKnownEvenHeight, T>(
+        ctx, args, input, filter, output, data_format);
+  }
+#endif
+  return LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
+                                       kKnownFilterHeight, kBlockDepth,
+                                       kKnownEvenHeight, PseudoHalfType<T>>(
+      ctx, args, input, filter, output, data_format);
 }
 
 template <typename T, DepthwiseConv2dDirection kDirection,
           int kKnownFilterWidth, int kKnownFilterHeight, int kBlockDepth>
-void LaunchDepthwiseConv2dGPUSmall(const GpuDevice& device,
-                                   const DepthwiseArgs& args, const T* input,
-                                   const T* filter, T* output,
-                                   TensorFormat data_format) {
+Status LaunchDepthwiseConv2dGPUSmall(OpKernelContext* ctx,
+                                     const DepthwiseArgs& args, const T* input,
+                                     const T* filter, T* output,
+                                     TensorFormat data_format) {
   if (args.in_rows & 1) {
-    LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
-                                  kKnownFilterHeight, kBlockDepth, false>(
-        device, args, input, filter, output, data_format);
+    return LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
+                                         kKnownFilterHeight, kBlockDepth,
+                                         false>(ctx, args, input, filter,
+                                                output, data_format);
   } else {
-    LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
-                                  kKnownFilterHeight, kBlockDepth, true>(
-        device, args, input, filter, output, data_format);
+    return LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
+                                         kKnownFilterHeight, kBlockDepth, true>(
+        ctx, args, input, filter, output, data_format);
   }
 }
 
 template <typename T, DepthwiseConv2dDirection kDirection,
           int kKnownFilterWidth, int kKnownFilterHeight>
-void LaunchDepthwiseConv2dGPUSmall(const GpuDevice& device,
-                                   const DepthwiseArgs& args, const T* input,
-                                   const T* filter, T* output,
-                                   TensorFormat data_format) {
+Status LaunchDepthwiseConv2dGPUSmall(OpKernelContext* ctx,
+                                     const DepthwiseArgs& args, const T* input,
+                                     const T* filter, T* output,
+                                     TensorFormat data_format) {
   // Maximize (power of two) kBlockDepth while keeping a block within 1024
   // threads (2 pixels per thread).
   const int block_pixels = (args.in_rows + 1) / 2 * args.in_cols;
   if (block_pixels > 256) {
-    LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
-                                  kKnownFilterHeight, 2>(
-        device, args, input, filter, output, data_format);
+    return LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
+                                         kKnownFilterHeight, 2>(
+        ctx, args, input, filter, output, data_format);
   } else if (block_pixels > 128) {
-    LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
-                                  kKnownFilterHeight, 4>(
-        device, args, input, filter, output, data_format);
+    return LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
+                                         kKnownFilterHeight, 4>(
+        ctx, args, input, filter, output, data_format);
   } else {
-    LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
-                                  kKnownFilterHeight, 8>(
-        device, args, input, filter, output, data_format);
+    return LaunchDepthwiseConv2dGPUSmall<T, kDirection, kKnownFilterWidth,
+                                         kKnownFilterHeight, 8>(
+        ctx, args, input, filter, output, data_format);
   }
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
           int kKnownDepthMultiplier>
-void LaunchDepthwiseConv2dGPU(const GpuDevice& device,
-                              const DepthwiseArgs& args, const T* input,
-                              const T* filter, T* output,
-                              TensorFormat data_format) {
+Status LaunchDepthwiseConv2dGPU(OpKernelContext* ctx, const DepthwiseArgs& args,
+                                const T* input, const T* filter, T* output,
+                                TensorFormat data_format) {
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*, int);
   switch (data_format) {
     case FORMAT_NHWC:
@@ -691,11 +752,12 @@ void LaunchDepthwiseConv2dGPU(const GpuDevice& device,
                                        kKnownDepthMultiplier>;
       break;
     default:
-      LOG(ERROR) << "FORMAT_" << ToString(data_format) << " is not supported";
-      return;
+      return errors::InvalidArgument("FORMAT_", ToString(data_format),
+                                     " is not supported");
   }
   const int num_outputs =
       args.batch * args.out_rows * args.out_cols * args.out_depth;
+  auto device = ctx->eigen_gpu_device();
   CudaLaunchConfig config =
       GetCudaLaunchConfig(num_outputs, device, kernel, 0, 0);
   // The compile-time constant version runs faster with a single block.
@@ -706,26 +768,27 @@ void LaunchDepthwiseConv2dGPU(const GpuDevice& device,
   kernel<<<std::min(max_block_count, config.block_count),
            config.thread_per_block, 0, device.stream()>>>(args, input, filter,
                                                           output, num_outputs);
+  return Status::OK();
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight>
-void LaunchDepthwiseConv2dGPU(const GpuDevice& device,
-                              const DepthwiseArgs& args, const T* input,
-                              const T* filter, T* output,
-                              TensorFormat data_format) {
+Status LaunchDepthwiseConv2dGPU(OpKernelContext* ctx, const DepthwiseArgs& args,
+                                const T* input, const T* filter, T* output,
+                                TensorFormat data_format) {
   if (args.depth_multiplier == 1) {
     if (CanLaunchDepthwiseConv2dGPUSmall(args)) {
-      LaunchDepthwiseConv2dGPUSmall<T, DIRECTION_FORWARD, kKnownFilterWidth,
-                                    kKnownFilterHeight>(
-          device, args, input, filter, output, data_format);
-      return;
+      return LaunchDepthwiseConv2dGPUSmall<
+          T, DIRECTION_FORWARD, kKnownFilterWidth, kKnownFilterHeight>(
+          ctx, args, input, filter, output, data_format);
     }
 
-    LaunchDepthwiseConv2dGPU<T, kKnownFilterWidth, kKnownFilterHeight, 1>(
-        device, args, input, filter, output, data_format);
+    return LaunchDepthwiseConv2dGPU<T, kKnownFilterWidth, kKnownFilterHeight,
+                                    1>(ctx, args, input, filter, output,
+                                       data_format);
   } else {
-    LaunchDepthwiseConv2dGPU<T, kKnownFilterWidth, kKnownFilterHeight, -1>(
-        device, args, input, filter, output, data_format);
+    return LaunchDepthwiseConv2dGPU<T, kKnownFilterWidth, kKnownFilterHeight,
+                                    -1>(ctx, args, input, filter, output,
+                                        data_format);
   }
 }
 
@@ -736,18 +799,13 @@ void LaunchDepthwiseConvOp<GpuDevice, T>::operator()(OpKernelContext* ctx,
                                                      const T* input,
                                                      const T* filter, T* output,
                                                      TensorFormat data_format) {
-  const GpuDevice& device = ctx->eigen_device<GpuDevice>();
   if (args.filter_rows == 3 && args.filter_cols == 3) {
-    LaunchDepthwiseConv2dGPU<T, 3, 3>(device, args, input, filter, output,
-                                      data_format);
+    OP_REQUIRES_OK(ctx, LaunchDepthwiseConv2dGPU<T, 3, 3>(
+                            ctx, args, input, filter, output, data_format));
   } else {
-    LaunchDepthwiseConv2dGPU<T, -1, -1>(device, args, input, filter, output,
-                                        data_format);
+    OP_REQUIRES_OK(ctx, LaunchDepthwiseConv2dGPU<T, -1, -1>(
+                            ctx, args, input, filter, output, data_format));
   }
-  auto stream = ctx->op_device_context()->stream();
-  OP_REQUIRES(ctx, stream->ok(),
-              errors::Internal(
-                  "Launch of gpu kernel for DepthwiseConv2dGPULaunch failed"));
 }
 
 template struct LaunchDepthwiseConvOp<GpuDevice, Eigen::half>;
@@ -904,11 +962,11 @@ __global__ void __launch_bounds__(640, 2)
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
           int kKnownDepthMultiplier>
-void LaunchDepthwiseConv2dBackpropInputGPU(const GpuDevice& device,
-                                           const DepthwiseArgs& args,
-                                           const T* out_backprop,
-                                           const T* filter, T* in_backprop,
-                                           TensorFormat data_format) {
+Status LaunchDepthwiseConv2dBackpropInputGPU(OpKernelContext* ctx,
+                                             const DepthwiseArgs& args,
+                                             const T* out_backprop,
+                                             const T* filter, T* in_backprop,
+                                             TensorFormat data_format) {
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*, int);
   switch (data_format) {
     case FORMAT_NHWC:
@@ -920,38 +978,39 @@ void LaunchDepthwiseConv2dBackpropInputGPU(const GpuDevice& device,
           T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
       break;
     default:
-      LOG(ERROR) << "FORMAT_" << ToString(data_format) << " is not supported";
-      return;
+      return errors::InvalidArgument("FORMAT_", ToString(data_format),
+                                     " is not supported");
   }
   const int num_in_backprop =
       args.batch * args.in_rows * args.in_cols * args.in_depth;
+  auto device = ctx->eigen_gpu_device();
   CudaLaunchConfig config =
       GetCudaLaunchConfig(num_in_backprop, device, kernel, 0, 0);
   kernel<<<config.block_count, config.thread_per_block, 0, device.stream()>>>(
       args, out_backprop, filter, in_backprop, num_in_backprop);
+  return Status::OK();
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight>
-void LaunchDepthwiseConv2dBackpropInputGPU(const GpuDevice& device,
-                                           const DepthwiseArgs& args,
-                                           const T* out_backprop,
-                                           const T* filter, T* in_backprop,
-                                           TensorFormat data_format) {
+Status LaunchDepthwiseConv2dBackpropInputGPU(OpKernelContext* ctx,
+                                             const DepthwiseArgs& args,
+                                             const T* out_backprop,
+                                             const T* filter, T* in_backprop,
+                                             TensorFormat data_format) {
   if (args.depth_multiplier == 1) {
     if (CanLaunchDepthwiseConv2dGPUSmall(args)) {
-      LaunchDepthwiseConv2dGPUSmall<T, DIRECTION_BACKWARD, kKnownFilterWidth,
-                                    kKnownFilterHeight>(
-          device, args, out_backprop, filter, in_backprop, data_format);
-      return;
+      return LaunchDepthwiseConv2dGPUSmall<
+          T, DIRECTION_BACKWARD, kKnownFilterWidth, kKnownFilterHeight>(
+          ctx, args, out_backprop, filter, in_backprop, data_format);
     }
 
-    LaunchDepthwiseConv2dBackpropInputGPU<T, kKnownFilterWidth,
-                                          kKnownFilterHeight, 1>(
-        device, args, out_backprop, filter, in_backprop, data_format);
+    return LaunchDepthwiseConv2dBackpropInputGPU<T, kKnownFilterWidth,
+                                                 kKnownFilterHeight, 1>(
+        ctx, args, out_backprop, filter, in_backprop, data_format);
   } else {
-    LaunchDepthwiseConv2dBackpropInputGPU<T, kKnownFilterWidth,
-                                          kKnownFilterHeight, -1>(
-        device, args, out_backprop, filter, in_backprop, data_format);
+    return LaunchDepthwiseConv2dBackpropInputGPU<T, kKnownFilterWidth,
+                                                 kKnownFilterHeight, -1>(
+        ctx, args, out_backprop, filter, in_backprop, data_format);
   }
 }
 
@@ -960,19 +1019,15 @@ template <typename T>
 void LaunchDepthwiseConvBackpropInputOp<GpuDevice, T>::operator()(
     OpKernelContext* ctx, const DepthwiseArgs& args, const T* out_backprop,
     const T* filter, T* in_backprop, TensorFormat data_format) {
-  const GpuDevice& device = ctx->eigen_device<GpuDevice>();
   if (args.filter_rows == 3 && args.filter_cols == 3) {
-    LaunchDepthwiseConv2dBackpropInputGPU<T, 3, 3>(
-        device, args, out_backprop, filter, in_backprop, data_format);
+    OP_REQUIRES_OK(
+        ctx, LaunchDepthwiseConv2dBackpropInputGPU<T, 3, 3>(
+                 ctx, args, out_backprop, filter, in_backprop, data_format));
   } else {
-    LaunchDepthwiseConv2dBackpropInputGPU<T, -1, -1>(
-        device, args, out_backprop, filter, in_backprop, data_format);
+    OP_REQUIRES_OK(
+        ctx, LaunchDepthwiseConv2dBackpropInputGPU<T, -1, -1>(
+                 ctx, args, out_backprop, filter, in_backprop, data_format));
   }
-  auto stream = ctx->op_device_context()->stream();
-  OP_REQUIRES(ctx, stream->ok(),
-              errors::Internal("Launch of gpu kernel for "
-                               "DepthwiseConv2dBackpropInp"
-                               "utGPULaunch failed"));
 }
 
 template struct LaunchDepthwiseConvBackpropInputOp<GpuDevice, Eigen::half>;
@@ -1111,15 +1166,18 @@ __device__ __forceinline__ T WarpSumReduce(T val) {
 // up in global memory using atomics.
 // Requirements: threads per block must be multiple of 32 and <= launch_bounds,
 // kAccumPixels * 64 >= args.in_rows * args.in_cols * kBlockDepth.
+// T is the tensors' data type. S is the math type the kernel uses. This is the
+// same as T for all cases but pseudo half (which has T=Eigen::half, S=float).
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
-          int kBlockDepth, int kAccumPixels>
+          int kBlockDepth, int kAccumPixels, typename S>
 __global__
 __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.z));
   // Holds block plus halo and filter data for blockDim.x depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
-  T* const shared_data = reinterpret_cast<T*>(shared_memory);
+  extern __shared__ __align__(8) unsigned char shared_memory[];
+  static_assert(sizeof(S) <= 8, "Insufficient alignement detected");
+  S* const shared_data = reinterpret_cast<S*>(shared_memory);
 
   const int num_batches = args.batch;
   const int in_height = args.in_rows;
@@ -1169,7 +1227,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
 
   // Initialize tile, in particular the padding and accumulator.
   for (int i = thread_idx; i < tile_size + accum_size; i += block_size) {
-    shared_data[i] = T(0);
+    shared_data[i] = S();
   }
   __syncthreads();
 
@@ -1203,10 +1261,10 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
 
     if (channel_in_range) {
       const T* const in_ptr = inout_offset + input;
-      T* const tile_ptr = tile_idx + shared_data;
-      tile_ptr[0] = ldg(in_ptr);
+      S* const tile_ptr = tile_idx + shared_data;
+      tile_ptr[0] = static_cast<S>(ldg(in_ptr));
       if (!skip_second) {
-        tile_ptr[tile_offset] = ldg(tensor_offset + in_ptr);
+        tile_ptr[tile_offset] = static_cast<S>(ldg(tensor_offset + in_ptr));
       }
     }
 
@@ -1216,14 +1274,15 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
 
     if (channel_in_range) {
       const T* const out_ptr = inout_offset + output;
-      const T out1 = ldg(out_ptr);
-      const T out2 = skip_second ? T(0) : ldg(tensor_offset + out_ptr);
+      const S out1 = static_cast<S>(ldg(out_ptr));
+      const S out2 =
+          skip_second ? S() : static_cast<S>(ldg(tensor_offset + out_ptr));
       int shared_offset = data_idx;
-      T* accum_ptr = accum_offset + shared_data;
+      S* accum_ptr = accum_offset + shared_data;
       UNROLL for (int r = 0; r < filter_height; ++r) {
         UNROLL for (int c = 0; c < filter_width; ++c) {
-          const T* const tile_ptr = shared_offset + shared_data;
-          T val = out1 * tile_ptr[0] + out2 * tile_ptr[tile_offset];
+          const S* const tile_ptr = shared_offset + shared_data;
+          S val = out1 * tile_ptr[0] + out2 * tile_ptr[tile_offset];
           // Warp-accumulate pixels of the same depth and write to accumulator.
           for (int delta = 16; delta >= kBlockDepth; delta /= 2) {
             val += CudaShuffleXorSync(active_threads, val, delta);
@@ -1241,18 +1300,18 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
     // Note: the condition to reach this is uniform across the entire block.
     __syncthreads();
 
-    const T* const accum_data = tile_size + shared_data;
+    const S* const accum_data = tile_size + shared_data;
     for (int i = thread_idx; i < accum_size; i += block_size) {
       const int filter_idx = i / kAccumPixels;
       const int filter_pix = filter_idx / kBlockDepth;
       const int filter_channel = filter_idx % kBlockDepth + start_channel;
       const int filter_offset = filter_pix * in_depth + filter_channel;
       if (filter_channel < in_depth) {
-        T val = accum_data[i];
+        S val = accum_data[i];
         // Warp-accumulate the pixels of the same depth from the accumulator.
         val = WarpSumReduce<kAccumPixels>(val);
         if (!(thread_idx & kAccumPixels - 1)) {
-          CudaAtomicAdd(filter_offset + filter, val);
+          CudaAtomicAdd(filter_offset + filter, static_cast<T>(val));
         }
       }
     }
@@ -1382,14 +1441,15 @@ __global__ void __launch_bounds__(640, 2)
 // Requirements: threads per block must be multiple of 32 and <= launch_bounds,
 // kAccumPixels * 64 >= args.in_rows * args.in_cols * kBlockDepth.
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
-          int kBlockDepth, int kAccumPixels>
+          int kBlockDepth, int kAccumPixels, typename S>
 __global__
 __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
     const DepthwiseArgs args, const T* output, const T* input, T* filter) {
   assert(CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, blockDim.x));
   // Holds block plus halo and filter data for blockDim.z depths.
-  extern __shared__ __align__(sizeof(T)) unsigned char shared_memory[];
-  T* const shared_data = reinterpret_cast<T*>(shared_memory);
+  extern __shared__ __align__(8) unsigned char shared_memory[];
+  static_assert(sizeof(S) <= 8, "Insufficient alignement detected");
+  S* const shared_data = reinterpret_cast<S*>(shared_memory);
 
   const int num_batches = args.batch;
   const int in_height = args.in_rows;
@@ -1438,7 +1498,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
 
   // Initialize tile, in particular the padding and accumulator.
   for (int i = thread_idx; i < tile_size + accum_size; i += block_size) {
-    shared_data[i] = T(0);
+    shared_data[i] = S();
   }
   __syncthreads();
 
@@ -1468,10 +1528,10 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
 
     if (channel_in_range) {
       const T* const in_ptr = inout_offset + input;
-      T* const tile_ptr = tile_idx + shared_data;
-      tile_ptr[0] = ldg(in_ptr);
+      S* const tile_ptr = tile_idx + shared_data;
+      tile_ptr[0] = static_cast<S>(ldg(in_ptr));
       if (!skip_second) {
-        tile_ptr[tile_offset] = ldg(block_pixels + in_ptr);
+        tile_ptr[tile_offset] = static_cast<S>(ldg(block_pixels + in_ptr));
       }
     }
 
@@ -1481,14 +1541,15 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
 
     if (channel_in_range) {
       const T* const out_ptr = inout_offset + output;
-      const T out1 = ldg(out_ptr);
-      const T out2 = skip_second ? T(0) : ldg(block_pixels + out_ptr);
+      const S out1 = static_cast<S>(ldg(out_ptr));
+      const S out2 =
+          skip_second ? S() : static_cast<S>(ldg(block_pixels + out_ptr));
       int shared_offset = data_idx;
-      T* accum_ptr = accum_offset + shared_data;
+      S* accum_ptr = accum_offset + shared_data;
       UNROLL for (int r = 0; r < filter_height; ++r) {
         UNROLL for (int c = 0; c < filter_width; ++c) {
-          const T* const tile_ptr = shared_offset + shared_data;
-          T val = out1 * tile_ptr[0] + out2 * tile_ptr[tile_offset];
+          const S* const tile_ptr = shared_offset + shared_data;
+          S val = out1 * tile_ptr[0] + out2 * tile_ptr[tile_offset];
           // Warp-accumulate pixels of the same depth and write to accumulator.
           for (int delta = 16 / kBlockDepth; delta > 0; delta /= 2) {
             val += CudaShuffleXorSync(active_threads, val, delta);
@@ -1506,7 +1567,7 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
     // Note: the condition to reach this is uniform across the entire block.
     __syncthreads();
 
-    const T* const accum_data = tile_size + shared_data;
+    const S* const accum_data = tile_size + shared_data;
     for (int i = thread_idx; i < accum_size; i += block_size) {
       const int filter_idx = i / kAccumPixels;
       const int filter_pix = filter_idx / kBlockDepth;
@@ -1514,11 +1575,11 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
           (channel + filter_idx % kBlockDepth) % in_depth;
       const int filter_offset = filter_pix * in_depth + filter_channel;
       if (filter_channel < in_depth) {
-        T val = accum_data[i];
+        S val = accum_data[i];
         // Warp-accumulate pixels of the same depth from the accumulator.
         val = WarpSumReduce<kAccumPixels>(val);
         if (!(thread_idx & kAccumPixels - 1)) {
-          CudaAtomicAdd(filter_offset + filter, val);
+          CudaAtomicAdd(filter_offset + filter, static_cast<T>(val));
         }
       }
     }
@@ -1526,19 +1587,20 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
-          int kBlockDepth, int kAccumPixels>
-bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
-    const GpuDevice& device, const DepthwiseArgs& args, const int block_height,
+          int kBlockDepth, int kAccumPixels, typename S>
+Status TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
+    OpKernelContext* ctx, const DepthwiseArgs& args, const int block_height,
     const T* out_backprop, const T* input, T* filter_backprop,
     TensorFormat data_format) {
+  auto device = ctx->eigen_gpu_device();
   const int tile_width = args.in_cols + args.filter_cols - 1;
   const int tile_height = block_height * 2 + args.filter_rows - 1;
   const int tile_pixels = tile_height * tile_width;
   const int filter_pixels = args.filter_rows * args.filter_cols;
   const int shared_memory_size =
-      kBlockDepth * (tile_pixels + filter_pixels * kAccumPixels) * sizeof(T);
+      kBlockDepth * (tile_pixels + filter_pixels * kAccumPixels) * sizeof(S);
   if (shared_memory_size > device.sharedMemPerBlock()) {
-    return false;
+    return errors::FailedPrecondition("Not enough shared memory");
   }
 
   dim3 block_dim;
@@ -1550,18 +1612,20 @@ bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
       block_count =
           args.batch * DivUp(args.out_depth, kBlockDepth) * kBlockDepth;
       kernel = DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall<
-          T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels>;
+          T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels,
+          S>;
       break;
     case FORMAT_NCHW:
       block_dim = dim3(args.in_cols, block_height, kBlockDepth);
       block_count =
           DivUp(args.batch * args.out_depth, kBlockDepth) * kBlockDepth;
       kernel = DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall<
-          T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels>;
+          T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels,
+          S>;
       break;
     default:
-      LOG(ERROR) << "FORMAT_" << ToString(data_format) << " is not supported";
-      return false;
+      return errors::InvalidArgument("FORMAT_", ToString(data_format),
+                                     " is not supported");
   }
   const int num_out_backprop = args.out_rows * args.out_cols * block_count;
   CudaLaunchConfig config = GetCudaLaunchConfigFixedBlockSize(
@@ -1569,13 +1633,33 @@ bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
       block_dim.x * block_dim.y * block_dim.z);
   kernel<<<config.block_count, block_dim, shared_memory_size,
            device.stream()>>>(args, out_backprop, input, filter_backprop);
-  return true;
+  return Status::OK();
+}
+
+template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
+          int kBlockDepth, int kAccumPixels>
+Status TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
+    OpKernelContext* ctx, const DepthwiseArgs& args, const int block_height,
+    const T* out_backprop, const T* input, T* filter_backprop,
+    TensorFormat data_format) {
+#if !defined __CUDA_ARCH__ || __CUDA_ARCH__ >= 530
+  if (HasFastHalfMath(ctx)) {
+    return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
+        T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels, T>(
+        ctx, args, block_height, out_backprop, input, filter_backprop,
+        data_format);
+  }
+#endif
+  return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
+      T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels,
+      PseudoHalfType<T>>(ctx, args, block_height, out_backprop, input,
+                         filter_backprop, data_format);
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
           int kBlockDepth>
-bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
-    const GpuDevice& device, const DepthwiseArgs& args, const int block_height,
+Status TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
+    OpKernelContext* ctx, const DepthwiseArgs& args, const int block_height,
     const T* out_backprop, const T* input, T* filter_backprop,
     TensorFormat data_format) {
   // Minimize (power of two) kAccumPixels, while satisfying
@@ -1584,24 +1668,24 @@ bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
   if (block_pixels > 512) {
     return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
         T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, 32>(
-        device, args, block_height, out_backprop, input, filter_backprop,
+        ctx, args, block_height, out_backprop, input, filter_backprop,
         data_format);
   } else if (block_pixels > 256) {
     return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
         T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, 16>(
-        device, args, block_height, out_backprop, input, filter_backprop,
+        ctx, args, block_height, out_backprop, input, filter_backprop,
         data_format);
   } else {
     return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
         T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, 8>(
-        device, args, block_height, out_backprop, input, filter_backprop,
+        ctx, args, block_height, out_backprop, input, filter_backprop,
         data_format);
   }
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight>
-bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
-    const GpuDevice& device, const DepthwiseArgs& args, const T* out_backprop,
+Status TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
+    OpKernelContext* ctx, const DepthwiseArgs& args, const T* out_backprop,
     const T* input, T* filter_backprop, TensorFormat data_format) {
   // Maximize (power of two) kBlockDepth while keeping a block within 1024
   // threads (2 pixels per thread).
@@ -1621,37 +1705,35 @@ bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
   }
 
   if (!CanLaunchDepthwiseConv2dBackpropFilterGPUSmall(args, block_height)) {
-    return false;
+    return errors::FailedPrecondition("Cannot launch this configuration");
   }
 
   switch (block_depth) {
     case 8:
       return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
           T, kKnownFilterWidth, kKnownFilterHeight, 8>(
-          device, args, block_height, out_backprop, input, filter_backprop,
+          ctx, args, block_height, out_backprop, input, filter_backprop,
           data_format);
     case 4:
       return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
           T, kKnownFilterWidth, kKnownFilterHeight, 4>(
-          device, args, block_height, out_backprop, input, filter_backprop,
+          ctx, args, block_height, out_backprop, input, filter_backprop,
           data_format);
     case 2:
       return TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<
           T, kKnownFilterWidth, kKnownFilterHeight, 2>(
-          device, args, block_height, out_backprop, input, filter_backprop,
+          ctx, args, block_height, out_backprop, input, filter_backprop,
           data_format);
     default:
-      return false;
+      return errors::InvalidArgument("Unexpected block depth");
   }
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight,
           int kKnownDepthMultiplier>
-void LaunchDepthwiseConv2dBackpropFilterGPU(const GpuDevice& device,
-                                            const DepthwiseArgs& args,
-                                            const T* out_backprop,
-                                            const T* input, T* filter_backprop,
-                                            TensorFormat data_format) {
+Status LaunchDepthwiseConv2dBackpropFilterGPU(
+    OpKernelContext* ctx, const DepthwiseArgs& args, const T* out_backprop,
+    const T* input, T* filter_backprop, TensorFormat data_format) {
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*, int);
   switch (data_format) {
     case FORMAT_NHWC:
@@ -1663,37 +1745,38 @@ void LaunchDepthwiseConv2dBackpropFilterGPU(const GpuDevice& device,
           T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
       break;
     default:
-      LOG(ERROR) << "FORMAT_" << ToString(data_format) << " is not supported";
-      return;
+      return errors::InvalidArgument("FORMAT_", ToString(data_format),
+                                     " is not supported");
   }
   const int num_out_backprop =
       args.batch * args.out_rows * args.out_cols * args.out_depth;
+  auto device = ctx->eigen_gpu_device();
   CudaLaunchConfig config =
       GetCudaLaunchConfig(num_out_backprop, device, kernel, 0, 0);
   kernel<<<config.block_count, config.thread_per_block, 0, device.stream()>>>(
       args, out_backprop, input, filter_backprop, num_out_backprop);
+  return Status::OK();
 }
 
 template <typename T, int kKnownFilterWidth, int kKnownFilterHeight>
-void LaunchDepthwiseConv2dBackpropFilterGPU(const GpuDevice& device,
-                                            const DepthwiseArgs& args,
-                                            const T* out_backprop,
-                                            const T* input, T* filter_backprop,
-                                            TensorFormat data_format) {
+Status LaunchDepthwiseConv2dBackpropFilterGPU(
+    OpKernelContext* ctx, const DepthwiseArgs& args, const T* out_backprop,
+    const T* input, T* filter_backprop, TensorFormat data_format) {
   if (args.depth_multiplier == 1) {
     if (TryLaunchDepthwiseConv2dBackpropFilterGPUSmall<T, kKnownFilterWidth,
                                                        kKnownFilterHeight>(
-            device, args, out_backprop, input, filter_backprop, data_format)) {
-      return;
+            ctx, args, out_backprop, input, filter_backprop, data_format)
+            .ok()) {
+      return Status::OK();
     }
 
-    LaunchDepthwiseConv2dBackpropFilterGPU<T, kKnownFilterWidth,
-                                           kKnownFilterHeight, 1>(
-        device, args, out_backprop, input, filter_backprop, data_format);
+    return LaunchDepthwiseConv2dBackpropFilterGPU<T, kKnownFilterWidth,
+                                                  kKnownFilterHeight, 1>(
+        ctx, args, out_backprop, input, filter_backprop, data_format);
   } else {
-    LaunchDepthwiseConv2dBackpropFilterGPU<T, kKnownFilterWidth,
-                                           kKnownFilterHeight, -1>(
-        device, args, out_backprop, input, filter_backprop, data_format);
+    return LaunchDepthwiseConv2dBackpropFilterGPU<T, kKnownFilterWidth,
+                                                  kKnownFilterHeight, -1>(
+        ctx, args, out_backprop, input, filter_backprop, data_format);
   }
 }
 
@@ -1702,7 +1785,6 @@ template <typename T>
 void LaunchDepthwiseConvBackpropFilterOp<GpuDevice, T>::operator()(
     OpKernelContext* ctx, const DepthwiseArgs& args, const T* out_backprop,
     const T* input, T* filter_backprop, TensorFormat data_format) {
-  const GpuDevice& device = ctx->eigen_device<GpuDevice>();
   auto stream = ctx->op_device_context()->stream();
 
   // Initialize the results to 0.
@@ -1712,16 +1794,14 @@ void LaunchDepthwiseConvBackpropFilterOp<GpuDevice, T>::operator()(
   stream->ThenMemset32(&filter_bp_ptr, 0, num_filter_backprop * sizeof(T));
 
   if (args.filter_rows == 3 && args.filter_cols == 3) {
-    LaunchDepthwiseConv2dBackpropFilterGPU<T, 3, 3>(
-        device, args, out_backprop, input, filter_backprop, data_format);
+    OP_REQUIRES_OK(
+        ctx, LaunchDepthwiseConv2dBackpropFilterGPU<T, 3, 3>(
+                 ctx, args, out_backprop, input, filter_backprop, data_format));
   } else {
-    LaunchDepthwiseConv2dBackpropFilterGPU<T, -1, -1>(
-        device, args, out_backprop, input, filter_backprop, data_format);
+    OP_REQUIRES_OK(
+        ctx, LaunchDepthwiseConv2dBackpropFilterGPU<T, -1, -1>(
+                 ctx, args, out_backprop, input, filter_backprop, data_format));
   }
-  OP_REQUIRES(ctx, stream->ok(),
-              errors::Internal("Launch of gpu kernel for "
-                               "DepthwiseConv2dBackpropFil"
-                               "terGPULaunch failed"));
 }
 
 template struct LaunchDepthwiseConvBackpropFilterOp<GpuDevice, Eigen::half>;
diff --git a/tensorflow/core/kernels/depthwise_conv_ops_test.cc b/tensorflow/core/kernels/depthwise_conv_ops_test.cc
new file mode 100644
index 0000000000..87bb68a43b
--- /dev/null
+++ b/tensorflow/core/kernels/depthwise_conv_ops_test.cc
@@ -0,0 +1,114 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/cc/ops/const_op.h"
+#include "tensorflow/cc/ops/image_ops.h"
+#include "tensorflow/cc/ops/nn_ops.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
+#include "tensorflow/core/framework/fake_input.h"
+#include "tensorflow/core/framework/node_def_builder.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/kernels/conv_ops_gpu.h"
+#include "tensorflow/core/kernels/ops_testutil.h"
+#include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+#include "tensorflow/core/public/session.h"
+
+namespace tensorflow {
+namespace {
+class DepthwiseConvOpTest : public OpsTestBase {
+ protected:
+  enum class Device { CPU, GPU };
+
+  template <typename T>
+  void Run(Device device) {
+    if (device == Device::GPU) {
+      SetDevice(DEVICE_GPU,
+                std::unique_ptr<tensorflow::Device>(DeviceFactory::NewDevice(
+                    "GPU", {}, "/job:a/replica:0/task:0")));
+    }
+    DataType dtype = DataTypeToEnum<T>::value;
+    TF_EXPECT_OK(NodeDefBuilder("depthwise_conv2d", "DepthwiseConv2dNative")
+                     .Input(FakeInput(dtype))
+                     .Input(FakeInput(dtype))
+                     .Attr("T", dtype)
+                     .Attr("strides", {1, 1, 1, 1})
+                     .Attr("padding", "SAME")
+                     .Finalize(node_def()));
+    TF_EXPECT_OK(InitOp());
+    const int depth = 2;
+    const int image_width = 2;
+    const int image_height = 3;
+    const int batch_count = 1;
+    // The image matrix is ('first/second' channel):
+    // | 1/2  |  3/4  |
+    // | 5/6  |  7/8  |
+    // | 9/10 | 11/12 |
+    Tensor image(dtype, {batch_count, image_height, image_width, depth});
+    test::FillValues<T>(&image, {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12});
+
+    // The filter matrix is:
+    // | 1/2 |  7/8  | 13/14 |
+    // | 3/4 |  9/10 | 15/16 |
+    // | 5/6 | 11/12 | 17/18 |
+    const int filter_size = 3;
+    const int filter_count = 1;
+    Tensor filter(dtype, {filter_size, filter_size, depth, filter_count});
+    test::FillValues<T>(&filter, {1, 2, 7, 8, 13, 14, 3, 4, 9, 10, 15, 16, 5, 6,
+                                  11, 12, 17, 18});
+
+    AddInputFromArray<T>(image.shape(), image.flat<T>());
+    AddInputFromArray<T>(filter.shape(), filter.flat<T>());
+    TF_ASSERT_OK(RunOpKernel());
+
+    // We're sliding two 3x3 filters across the 3x2 image, with accesses outside
+    // the input set to zero because we're using the 'SAME' padding mode.
+    // This means we should end up with this matrix:
+    // | 105/150 | 183/95  |
+    // | 235/312 | 357/178 |
+    // | 187/234 | 261/121 |
+    Tensor expected(dtype, image.shape());
+    test::FillValues<T>(&expected, {228, 300, 132, 180, 482, 596, 266, 344, 372,
+                                    452, 180, 236});
+    const Tensor& output = *GetOutput(0);
+    // TODO(csigg): This should happen as part of GetOutput.
+    TF_EXPECT_OK(device_->Sync());
+    test::ExpectTensorNear<T>(expected, output, 1e-5);
+  }
+};
+
+TEST_F(DepthwiseConvOpTest, DepthwiseConvFloatCpu) { Run<float>(Device::CPU); }
+TEST_F(DepthwiseConvOpTest, DepthwiseConvDoubleCpu) {
+  Run<double>(Device::CPU);
+}
+TEST_F(DepthwiseConvOpTest, DepthwiseConvHalfCpu) {
+  Run<Eigen::half>(Device::CPU);
+}
+
+#ifdef GOOGLE_CUDA
+TEST_F(DepthwiseConvOpTest, DepthwiseConvFloatGpu) { Run<float>(Device::GPU); }
+TEST_F(DepthwiseConvOpTest, DepthwiseConvDoubleGpu) {
+  Run<double>(Device::GPU);
+}
+TEST_F(DepthwiseConvOpTest, DepthwiseConvHalfGpu) {
+  Run<Eigen::half>(Device::GPU);
+}
+#endif
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/deserialize_sparse_string_op.cc b/tensorflow/core/kernels/deserialize_sparse_string_op.cc
index 6fb07c11e9..2c13f24ad6 100644
--- a/tensorflow/core/kernels/deserialize_sparse_string_op.cc
+++ b/tensorflow/core/kernels/deserialize_sparse_string_op.cc
@@ -165,7 +165,10 @@ class DeserializeSparseOp : public OpKernel {
     std::vector<SparseTensor> tensors;
     tensors.reserve(num_sparse_tensors);
     for (int i = 0; i < num_sparse_tensors; ++i) {
-      tensors.emplace_back(indices[i], values[i], shape, std_order);
+      SparseTensor tensor;
+      OP_REQUIRES_OK(context, SparseTensor::Create(indices[i], values[i], shape,
+                                                   std_order, &tensor));
+      tensors.push_back(std::move(tensor));
     }
 
     gtl::optional<SparseTensor> maybe_output;
diff --git a/tensorflow/core/kernels/edit_distance_op.cc b/tensorflow/core/kernels/edit_distance_op.cc
index 20d857c721..4aecdc9e41 100644
--- a/tensorflow/core/kernels/edit_distance_op.cc
+++ b/tensorflow/core/kernels/edit_distance_op.cc
@@ -133,10 +133,15 @@ class EditDistanceOp : public OpKernel {
     std::vector<int64> sorted_order(truth_st_shape.dims());
     std::iota(sorted_order.begin(), sorted_order.end(), 0);
 
-    sparse::SparseTensor hypothesis(*hypothesis_indices, *hypothesis_values,
-                                    hypothesis_st_shape, sorted_order);
-    sparse::SparseTensor truth(*truth_indices, *truth_values, truth_st_shape,
-                               sorted_order);
+    sparse::SparseTensor hypothesis;
+    OP_REQUIRES_OK(ctx, sparse::SparseTensor::Create(
+                            *hypothesis_indices, *hypothesis_values,
+                            hypothesis_st_shape, sorted_order, &hypothesis));
+
+    sparse::SparseTensor truth;
+    OP_REQUIRES_OK(ctx, sparse::SparseTensor::Create(
+                            *truth_indices, *truth_values, truth_st_shape,
+                            sorted_order, &truth));
 
     // Group dims 0, 1, ..., RANK - 1.  The very last dim is assumed
     // to store the variable length sequences.
diff --git a/tensorflow/core/kernels/encode_proto_op.cc b/tensorflow/core/kernels/encode_proto_op.cc
index 3b02ae52a2..4a0c1943e5 100644
--- a/tensorflow/core/kernels/encode_proto_op.cc
+++ b/tensorflow/core/kernels/encode_proto_op.cc
@@ -31,6 +31,7 @@ limitations under the License.
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/util/proto/descriptors.h"
+#include "tensorflow/core/util/proto/proto_utils.h"
 
 namespace tensorflow {
 namespace {
@@ -42,9 +43,9 @@ using ::tensorflow::protobuf::internal::WireFormatLite;
 using ::tensorflow::protobuf::io::CodedOutputStream;
 using ::tensorflow::protobuf::io::StringOutputStream;
 
-// Computes the total serialized size for a packed repeated field.
-// For fixed-size types this can just multiply, but for variable-sized
-// types it has to iterate through the values in the tensor.
+// Computes the total serialized size for a packed repeated field. For
+// fixed-size types this can just multiply, but for variable-sized types it has
+// to iterate through the values in the tensor.
 template <WireFormatLite::FieldType FieldType, typename TensorT>
 size_t TotalPackedSize(const Tensor& input, int message_index, int size);
 
@@ -83,11 +84,11 @@ size_t TotalPackedSize<WireFormatLite::TYPE_INT64, int64>(const Tensor& input,
 }
 
 template <>
-size_t TotalPackedSize<WireFormatLite::TYPE_UINT64, int64>(const Tensor& input,
-                                                           int message_index,
-                                                           int size) {
+size_t TotalPackedSize<WireFormatLite::TYPE_UINT64, uint64>(const Tensor& input,
+                                                            int message_index,
+                                                            int size) {
   size_t data_size = 0;
-  auto input_t = input.flat_inner_dims<int64>();
+  auto input_t = input.flat_inner_dims<uint64>();
   for (int64 i = 0; i < size; i++) {
     data_size += WireFormatLite::UInt64Size(
         input_t(static_cast<int64>(message_index), i));
@@ -96,6 +97,19 @@ size_t TotalPackedSize<WireFormatLite::TYPE_UINT64, int64>(const Tensor& input,
 }
 
 template <>
+size_t TotalPackedSize<WireFormatLite::TYPE_INT32, int64>(const Tensor& input,
+                                                          int message_index,
+                                                          int size) {
+  size_t data_size = 0;
+  auto input_t = input.flat_inner_dims<int64>();
+  for (int64 i = 0; i < size; i++) {
+    data_size += WireFormatLite::Int32Size(
+        input_t(static_cast<int64>(message_index), i));
+  }
+  return data_size;
+}
+
+template <>
 size_t TotalPackedSize<WireFormatLite::TYPE_INT32, int32>(const Tensor& input,
                                                           int message_index,
                                                           int size) {
@@ -109,23 +123,20 @@ size_t TotalPackedSize<WireFormatLite::TYPE_INT32, int32>(const Tensor& input,
 }
 
 template <>
-size_t TotalPackedSize<WireFormatLite::TYPE_FIXED64, int64>(const Tensor& input,
-                                                            int message_index,
-                                                            int size) {
+size_t TotalPackedSize<WireFormatLite::TYPE_FIXED64, uint64>(
+    const Tensor& input, int message_index, int size) {
   return size * WireFormatLite::kFixed64Size;
 }
 
 template <>
-size_t TotalPackedSize<WireFormatLite::TYPE_FIXED32, int64>(const Tensor& input,
-                                                            int message_index,
-                                                            int size) {
+size_t TotalPackedSize<WireFormatLite::TYPE_FIXED32, uint64>(
+    const Tensor& input, int message_index, int size) {
   return size * WireFormatLite::kFixed32Size;
 }
 
 template <>
-size_t TotalPackedSize<WireFormatLite::TYPE_FIXED32, int32>(const Tensor& input,
-                                                            int message_index,
-                                                            int size) {
+size_t TotalPackedSize<WireFormatLite::TYPE_FIXED32, uint32>(
+    const Tensor& input, int message_index, int size) {
   return size * WireFormatLite::kFixed32Size;
 }
 
@@ -137,11 +148,11 @@ size_t TotalPackedSize<WireFormatLite::TYPE_BOOL, bool>(const Tensor& input,
 }
 
 template <>
-size_t TotalPackedSize<WireFormatLite::TYPE_UINT32, int64>(const Tensor& input,
-                                                           int message_index,
-                                                           int size) {
+size_t TotalPackedSize<WireFormatLite::TYPE_UINT32, uint64>(const Tensor& input,
+                                                            int message_index,
+                                                            int size) {
   size_t data_size = 0;
-  auto input_t = input.flat_inner_dims<int64>();
+  auto input_t = input.flat_inner_dims<uint64>();
   for (int64 i = 0; i < size; i++) {
     data_size += WireFormatLite::UInt32Size(
         input_t(static_cast<int64>(message_index), i));
@@ -150,11 +161,11 @@ size_t TotalPackedSize<WireFormatLite::TYPE_UINT32, int64>(const Tensor& input,
 }
 
 template <>
-size_t TotalPackedSize<WireFormatLite::TYPE_UINT32, int32>(const Tensor& input,
-                                                           int message_index,
-                                                           int size) {
+size_t TotalPackedSize<WireFormatLite::TYPE_UINT32, uint32>(const Tensor& input,
+                                                            int message_index,
+                                                            int size) {
   size_t data_size = 0;
-  auto input_t = input.flat_inner_dims<int32>();
+  auto input_t = input.flat_inner_dims<uint32>();
   for (int64 i = 0; i < size; i++) {
     data_size += WireFormatLite::UInt32Size(
         input_t(static_cast<int64>(message_index), i));
@@ -182,6 +193,12 @@ size_t TotalPackedSize<WireFormatLite::TYPE_SFIXED32, int32>(
 }
 
 template <>
+size_t TotalPackedSize<WireFormatLite::TYPE_SFIXED32, int64>(
+    const Tensor& input, int message_index, int size) {
+  return size * WireFormatLite::kSFixed32Size;
+}
+
+template <>
 size_t TotalPackedSize<WireFormatLite::TYPE_SFIXED64, int64>(
     const Tensor& input, int message_index, int size) {
   return size * WireFormatLite::kSFixed64Size;
@@ -201,6 +218,19 @@ size_t TotalPackedSize<WireFormatLite::TYPE_SINT32, int32>(const Tensor& input,
 }
 
 template <>
+size_t TotalPackedSize<WireFormatLite::TYPE_SINT32, int64>(const Tensor& input,
+                                                           int message_index,
+                                                           int size) {
+  size_t data_size = 0;
+  auto input_t = input.flat_inner_dims<int64>();
+  for (int64 i = 0; i < size; i++) {
+    data_size += WireFormatLite::SInt32Size(
+        input_t(static_cast<int64>(message_index), i));
+  }
+  return data_size;
+}
+
+template <>
 size_t TotalPackedSize<WireFormatLite::TYPE_SINT64, int64>(const Tensor& input,
                                                            int message_index,
                                                            int size) {
@@ -213,14 +243,13 @@ size_t TotalPackedSize<WireFormatLite::TYPE_SINT64, int64>(const Tensor& input,
   return data_size;
 }
 
-// Writes a possibly repeated primitive field.
-// TensorFlow does not have unsigned types, so we decode them to signed and
-// encode them back to unsigned.
+// Writes a possibly repeated primitive field. TensorFlow does not have unsigned
+// types, so we decode them to signed and encode them back to unsigned.
 template <typename TensorT, typename ProtoT,
           WireFormatLite::FieldType FieldType,
           void Writer(ProtoT, CodedOutputStream*)>
-void WriteField(const FieldDescriptor& field_desc, const Tensor& input,
-                int message_index, int size, CodedOutputStream* output) {
+Status WriteField(const FieldDescriptor& field_desc, const Tensor& input,
+                  int message_index, int size, CodedOutputStream* output) {
   auto wire_type = WireFormatLite::WireTypeForFieldType(
       WireFormatLite::FieldType(field_desc.type()));
 
@@ -250,12 +279,14 @@ void WriteField(const FieldDescriptor& field_desc, const Tensor& input,
       Writer(value, output);
     }
   }
+  return Status::OK();
 }
 
 // Writes a possibly repeated string, bytes, or message field.
 template <typename T, void Writer(int, const T&, CodedOutputStream*)>
-void WriteVarLenField(const FieldDescriptor& field_desc, const Tensor& input,
-                      int message_index, int size, CodedOutputStream* output) {
+Status WriteVarLenField(const FieldDescriptor& field_desc, const Tensor& input,
+                        int message_index, int size,
+                        CodedOutputStream* output) {
   auto input_t = input.flat_inner_dims<T>();
   for (int64 i = 0; i < size; i++) {
     const T& value = input_t(static_cast<int64>(message_index), i);
@@ -264,14 +295,14 @@ void WriteVarLenField(const FieldDescriptor& field_desc, const Tensor& input,
     // small speedup.
     Writer(field_desc.number(), value, output);
   }
+  return Status::OK();
 }
 
-// Writes a group field.
-// Groups are treated like submessages, but tag-delimited
-// instead of length-delimited. WireFormatLite handles this
-// differently so we code it ourselves.
-void WriteGroup(const FieldDescriptor& field_desc, const Tensor& input,
-                int message_index, int size, CodedOutputStream* output) {
+// Writes a group field. Groups are treated like submessages, but tag-delimited
+// instead of length-delimited. WireFormatLite handles this differently so we
+// code it ourselves.
+Status WriteGroup(const FieldDescriptor& field_desc, const Tensor& input,
+                  int message_index, int size, CodedOutputStream* output) {
   auto input_t = input.flat_inner_dims<string>();
   for (int64 i = 0; i < size; i++) {
     const string& value = input_t(static_cast<int64>(message_index), i);
@@ -282,16 +313,16 @@ void WriteGroup(const FieldDescriptor& field_desc, const Tensor& input,
     WireFormatLite::WriteTag(field_desc.number(),
                              WireFormatLite::WIRETYPE_END_GROUP, output);
   }
+  return Status::OK();
 }
 
-// Writes a (possibly repeated) field into an output stream.
-// It is the caller's responsibility to ensure that the type of
-// the input tensor is compatible with the type of the proto
-// field descriptor, and that (message_index, size-1) is within
-// bounds.
-void WriteField(const FieldDescriptor& field_desc, const Tensor& input,
-                int message_index, int size, CodedOutputStream* output) {
-  DataType tf_type = input.dtype();
+// Writes a (possibly repeated) field into an output stream. It is the caller's
+// responsibility to ensure that the type of the input tensor is compatible with
+// the type of the proto field descriptor, and that (message_index, size-1) is
+// within bounds.
+Status WriteField(const FieldDescriptor& field_desc, const Tensor& input,
+                  int message_index, int size, CodedOutputStream* output) {
+  DataType dtype = input.dtype();
 
   switch (field_desc.type()) {
     case WireFormatLite::TYPE_DOUBLE:
@@ -299,7 +330,7 @@ void WriteField(const FieldDescriptor& field_desc, const Tensor& input,
                         WireFormatLite::WriteDoubleNoTag>(
           field_desc, input, message_index, size, output);
     case WireFormatLite::TYPE_FLOAT:
-      switch (tf_type) {
+      switch (dtype) {
         case DataType::DT_FLOAT:
           return WriteField<float, float, WireFormatLite::TYPE_FLOAT,
                             WireFormatLite::WriteFloatNoTag>(
@@ -309,36 +340,48 @@ void WriteField(const FieldDescriptor& field_desc, const Tensor& input,
                             WireFormatLite::WriteFloatNoTag>(
               field_desc, input, message_index, size, output);
         default:
-          return;
+          return errors::DataLoss("Failed writing TYPE_FLOAT for ",
+                                  DataTypeString(dtype));
       }
     case WireFormatLite::TYPE_INT64:
       return WriteField<int64, protobuf_int64, WireFormatLite::TYPE_INT64,
                         WireFormatLite::WriteInt64NoTag>(
           field_desc, input, message_index, size, output);
     case WireFormatLite::TYPE_UINT64:
-      return WriteField<int64, protobuf_uint64, WireFormatLite::TYPE_UINT64,
+      return WriteField<uint64, protobuf_uint64, WireFormatLite::TYPE_UINT64,
                         WireFormatLite::WriteUInt64NoTag>(
           field_desc, input, message_index, size, output);
     case WireFormatLite::TYPE_INT32:
-      return WriteField<int32, int32, WireFormatLite::TYPE_INT32,
-                        WireFormatLite::WriteInt32NoTag>(
-          field_desc, input, message_index, size, output);
+      switch (dtype) {
+        case DataType::DT_INT64:
+          return WriteField<int64, int32, WireFormatLite::TYPE_INT32,
+                            WireFormatLite::WriteInt32NoTag>(
+              field_desc, input, message_index, size, output);
+        case DataType::DT_INT32:
+          return WriteField<int32, int32, WireFormatLite::TYPE_INT32,
+                            WireFormatLite::WriteInt32NoTag>(
+              field_desc, input, message_index, size, output);
+        default:
+          return errors::DataLoss("Failed writing TYPE_INT32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_FIXED64:
-      return WriteField<int64, protobuf_uint64, WireFormatLite::TYPE_FIXED64,
+      return WriteField<uint64, protobuf_uint64, WireFormatLite::TYPE_FIXED64,
                         WireFormatLite::WriteFixed64NoTag>(
           field_desc, input, message_index, size, output);
     case WireFormatLite::TYPE_FIXED32:
-      switch (tf_type) {
-        case DataType::DT_INT64:
-          return WriteField<int64, uint32, WireFormatLite::TYPE_FIXED32,
+      switch (dtype) {
+        case DataType::DT_UINT64:
+          return WriteField<uint64, uint32, WireFormatLite::TYPE_FIXED32,
                             WireFormatLite::WriteFixed32NoTag>(
               field_desc, input, message_index, size, output);
-        case DataType::DT_INT32:
-          return WriteField<int32, uint32, WireFormatLite::TYPE_FIXED32,
+        case DataType::DT_UINT32:
+          return WriteField<uint32, uint32, WireFormatLite::TYPE_FIXED32,
                             WireFormatLite::WriteFixed32NoTag>(
               field_desc, input, message_index, size, output);
         default:
-          return;
+          return errors::DataLoss("Failed writing TYPE_FIXED32 for ",
+                                  DataTypeString(dtype));
       }
     case WireFormatLite::TYPE_BOOL:
       return WriteField<bool, bool, WireFormatLite::TYPE_BOOL,
@@ -356,34 +399,55 @@ void WriteField(const FieldDescriptor& field_desc, const Tensor& input,
       return WriteVarLenField<string, WireFormatLite::WriteBytes>(
           field_desc, input, message_index, size, output);
     case WireFormatLite::TYPE_UINT32:
-      switch (tf_type) {
-        case DataType::DT_INT64:
-          return WriteField<int64, uint32, WireFormatLite::TYPE_UINT32,
+      switch (dtype) {
+        case DataType::DT_UINT64:
+          return WriteField<uint64, uint32, WireFormatLite::TYPE_UINT32,
                             WireFormatLite::WriteUInt32NoTag>(
               field_desc, input, message_index, size, output);
-        case DataType::DT_INT32:
-          return WriteField<int32, uint32, WireFormatLite::TYPE_UINT32,
+        case DataType::DT_UINT32:
+          return WriteField<uint32, uint32, WireFormatLite::TYPE_UINT32,
                             WireFormatLite::WriteUInt32NoTag>(
               field_desc, input, message_index, size, output);
         default:
-          return;
+          return errors::DataLoss("Failed writing TYPE_UINT32 for ",
+                                  DataTypeString(dtype));
       }
     case WireFormatLite::TYPE_ENUM:
       return WriteField<int32, int32, WireFormatLite::TYPE_ENUM,
                         WireFormatLite::WriteEnumNoTag>(
           field_desc, input, message_index, size, output);
     case WireFormatLite::TYPE_SFIXED32:
-      return WriteField<int32, int32, WireFormatLite::TYPE_SFIXED32,
-                        WireFormatLite::WriteSFixed32NoTag>(
-          field_desc, input, message_index, size, output);
+      switch (dtype) {
+        case DataType::DT_INT64:
+          return WriteField<int64, int32, WireFormatLite::TYPE_SFIXED32,
+                            WireFormatLite::WriteSFixed32NoTag>(
+              field_desc, input, message_index, size, output);
+        case DataType::DT_INT32:
+          return WriteField<int32, int32, WireFormatLite::TYPE_SFIXED32,
+                            WireFormatLite::WriteSFixed32NoTag>(
+              field_desc, input, message_index, size, output);
+        default:
+          return errors::DataLoss("Failed writing TYPE_SFIXED32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_SFIXED64:
       return WriteField<int64, protobuf_int64, WireFormatLite::TYPE_SFIXED64,
                         WireFormatLite::WriteSFixed64NoTag>(
           field_desc, input, message_index, size, output);
     case WireFormatLite::TYPE_SINT32:
-      return WriteField<int32, int32, WireFormatLite::TYPE_SINT32,
-                        WireFormatLite::WriteSInt32NoTag>(
-          field_desc, input, message_index, size, output);
+      switch (dtype) {
+        case DataType::DT_INT64:
+          return WriteField<int64, int32, WireFormatLite::TYPE_SINT32,
+                            WireFormatLite::WriteSInt32NoTag>(
+              field_desc, input, message_index, size, output);
+        case DataType::DT_INT32:
+          return WriteField<int32, int32, WireFormatLite::TYPE_SINT32,
+                            WireFormatLite::WriteSInt32NoTag>(
+              field_desc, input, message_index, size, output);
+        default:
+          return errors::DataLoss("Failed writing TYPE_SINT32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_SINT64:
       return WriteField<int64, protobuf_int64, WireFormatLite::TYPE_SINT64,
                         WireFormatLite::WriteSInt64NoTag>(
@@ -392,42 +456,6 @@ void WriteField(const FieldDescriptor& field_desc, const Tensor& input,
   }
 }
 
-// Checks that a Protobuf field is compatible with a TensorFlow datatype.
-// This is separated from WriteField to lift it out of the inner loop.
-bool IsCompatibleType(const FieldDescriptor& field_desc, DataType tf_type) {
-  switch (field_desc.type()) {
-    case WireFormatLite::TYPE_DOUBLE:
-      return tf_type == DataType::DT_DOUBLE;
-    case WireFormatLite::TYPE_FLOAT:
-      return tf_type == DataType::DT_FLOAT || tf_type == DataType::DT_DOUBLE;
-    case WireFormatLite::TYPE_INT64:
-    case WireFormatLite::TYPE_SFIXED64:
-    case WireFormatLite::TYPE_SINT64:
-      return tf_type == DataType::DT_INT64;
-    case WireFormatLite::TYPE_UINT64:
-      return tf_type == DataType::DT_INT64;
-    case WireFormatLite::TYPE_INT32:
-    case WireFormatLite::TYPE_ENUM:
-    case WireFormatLite::TYPE_SFIXED32:
-    case WireFormatLite::TYPE_SINT32:
-      return tf_type == DataType::DT_INT32;
-    case WireFormatLite::TYPE_FIXED64:
-      return tf_type == DataType::DT_INT64;
-    case WireFormatLite::TYPE_FIXED32:
-    case WireFormatLite::TYPE_UINT32:
-      return tf_type == DataType::DT_INT64 || tf_type == DataType::DT_INT32;
-    case WireFormatLite::TYPE_BOOL:
-      return tf_type == DataType::DT_BOOL;
-    case WireFormatLite::TYPE_STRING:
-    case WireFormatLite::TYPE_GROUP:
-    case WireFormatLite::TYPE_MESSAGE:
-    case WireFormatLite::TYPE_BYTES:
-      return tf_type == DataType::DT_STRING;
-      // default: intentionally omitted in order to enable static checking.
-  }
-  return false;
-}
-
 class EncodeProtoOp : public OpKernel {
  public:
   explicit EncodeProtoOp(OpKernelConstruction* context) : OpKernel(context) {
@@ -475,14 +503,14 @@ class EncodeProtoOp : public OpKernel {
               });
   }
 
-  void Compute(OpKernelContext* cx) override {
+  void Compute(OpKernelContext* ctx) override {
     const Tensor* sizes_tensor;
-    OP_REQUIRES_OK(cx, cx->input("sizes", &sizes_tensor));
+    OP_REQUIRES_OK(ctx, ctx->input("sizes", &sizes_tensor));
 
     OpInputList values;
-    OP_REQUIRES_OK(cx, cx->input_list("values", &values));
+    OP_REQUIRES_OK(ctx, ctx->input_list("values", &values));
 
-    OP_REQUIRES(cx, field_descs_.size() == values.size(),
+    OP_REQUIRES(ctx, field_descs_.size() == values.size(),
                 errors::InvalidArgument(
                     "Length of inputs list must match field_names"));
 
@@ -493,12 +521,14 @@ class EncodeProtoOp : public OpKernel {
       const Tensor& v = values[i];
 
       // The type of each value tensor must match the corresponding field.
-      OP_REQUIRES(cx, IsCompatibleType(*field_descs_[i], v.dtype()),
-                  errors::InvalidArgument(
-                      "Incompatible type for field " + field_names_[i] +
-                          ".  Saw dtype: ",
-                      DataTypeString(v.dtype()),
-                      " but field type is: ", field_descs_[i]->type_name()));
+      OP_REQUIRES(
+          ctx,
+          proto_utils::IsCompatibleType(field_descs_[i]->type(), v.dtype()),
+          errors::InvalidArgument(
+              "Incompatible type for field " + field_names_[i] +
+                  ".  Saw dtype: ",
+              DataTypeString(v.dtype()),
+              " but field type is: ", field_descs_[i]->type_name()));
 
       // All value tensors must have the same shape prefix (i.e. batch size).
       TensorShape shape_prefix = v.shape();
@@ -507,14 +537,14 @@ class EncodeProtoOp : public OpKernel {
       // Do some initialization on the first input value. The rest will
       // have to match this one.
       if (i == 0) {
-        OP_REQUIRES(cx, v.dims() >= 1,
+        OP_REQUIRES(ctx, v.dims() >= 1,
                     errors::InvalidArgument(
                         "Expected value to be at least a vector, saw shape: ",
                         v.shape().DebugString()));
         common_prefix = shape_prefix;
         message_count = common_prefix.num_elements();
       } else {
-        OP_REQUIRES(cx, shape_prefix == common_prefix,
+        OP_REQUIRES(ctx, shape_prefix == common_prefix,
                     errors::InvalidArgument(
                         "Values must match up to the last dimension"));
       }
@@ -523,7 +553,7 @@ class EncodeProtoOp : public OpKernel {
     TensorShape expected_sizes_shape = common_prefix;
     expected_sizes_shape.AddDim(field_descs_.size());
 
-    OP_REQUIRES(cx, sizes_tensor->shape() == expected_sizes_shape,
+    OP_REQUIRES(ctx, sizes_tensor->shape() == expected_sizes_shape,
                 errors::InvalidArgument(
                     "sizes should be batch_size + [len(field_names)].  Saw: ",
                     sizes_tensor->shape().DebugString(),
@@ -536,12 +566,11 @@ class EncodeProtoOp : public OpKernel {
       int max_size = v.dim_size(v.dims() - 1);
 
       // The last dimension of a value tensor must be greater than the
-      // corresponding
-      // size in the sizes tensor.
+      // corresponding size in the sizes tensor.
       for (int message_index = 0; message_index < message_count;
            message_index++) {
         OP_REQUIRES(
-            cx, sizes(message_index, i) <= max_size,
+            ctx, sizes(message_index, i) <= max_size,
             errors::InvalidArgument(
                 "Size to write must not be larger than value tensor; but saw: ",
                 sizes(message_index, i), " > ", max_size, " at message ",
@@ -551,13 +580,13 @@ class EncodeProtoOp : public OpKernel {
 
     // This pointer is owned by the context.
     Tensor* output_tensor;
-    OP_REQUIRES_OK(cx, cx->allocate_output(0, common_prefix, &output_tensor));
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, common_prefix, &output_tensor));
 
     auto bufs = output_tensor->flat<string>();
     for (int message_index = 0; message_index < message_count;
          message_index++) {
       // TODO(nix): possibly optimize allocation here by calling
-      //   bufs(message_index).reserve(DEFAULT_BUF_SIZE);
+      // `bufs(message_index).reserve(DEFAULT_BUF_SIZE)`.
       StringOutputStream output_string(&bufs(message_index));
       CodedOutputStream out(&output_string);
       // Write fields in ascending field_number order.
@@ -566,7 +595,8 @@ class EncodeProtoOp : public OpKernel {
         const Tensor& v = values[i];
         int size = sizes(message_index, i);
         if (!size) continue;
-        WriteField(field_desc, v, message_index, size, &out);
+        OP_REQUIRES_OK(ctx,
+                       WriteField(field_desc, v, message_index, size, &out));
       }
     }
   }
@@ -578,8 +608,8 @@ class EncodeProtoOp : public OpKernel {
   // Owned_desc_pool_ is null when using descriptor_source=local.
   std::unique_ptr<DescriptorPool> owned_desc_pool_;
 
-  // Contains indices into field_names_, sorted by field number since
-  // that's the order of writing.
+  // Contains indices into field_names_, sorted by field number since that's the
+  // order of writing.
   std::vector<int> sorted_field_index_;
 
   TF_DISALLOW_COPY_AND_ASSIGN(EncodeProtoOp);
diff --git a/tensorflow/core/kernels/function_ops.cc b/tensorflow/core/kernels/function_ops.cc
index e269aa3a78..d5c33c0188 100644
--- a/tensorflow/core/kernels/function_ops.cc
+++ b/tensorflow/core/kernels/function_ops.cc
@@ -135,6 +135,12 @@ REGISTER_KERNEL_BUILDER(Name(kArgOp)
                             .TypeConstraint<ResourceHandle>("T"),
                         ArgOp);
 
+REGISTER_KERNEL_BUILDER(Name(kArgOp)
+                            .Device(DEVICE_GPU)
+                            .HostMemory("output")
+                            .TypeConstraint<string>("T"),
+                        ArgOp);
+
 #define REGISTER(type)     \
   REGISTER_KERNEL_BUILDER( \
       Name(kRetOp).Device(DEVICE_GPU).TypeConstraint<type>("T"), RetvalOp);
@@ -149,6 +155,12 @@ REGISTER_KERNEL_BUILDER(Name(kRetOp)
                             .TypeConstraint<ResourceHandle>("T")
                             .HostMemory("input"),
                         RetvalOp);
+
+REGISTER_KERNEL_BUILDER(Name(kRetOp)
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<string>("T")
+                            .HostMemory("input"),
+                        RetvalOp);
 #undef REGISTER
 
 class PassOn : public OpKernel {
diff --git a/tensorflow/core/kernels/functional_ops.cc b/tensorflow/core/kernels/functional_ops.cc
index 519c475332..1c0abf26cd 100644
--- a/tensorflow/core/kernels/functional_ops.cc
+++ b/tensorflow/core/kernels/functional_ops.cc
@@ -127,31 +127,47 @@ class IfOp : public AsyncOpKernel {
   explicit IfOp(OpKernelConstruction* ctx) : AsyncOpKernel(ctx) {
     auto lib = ctx->function_library();
     OP_REQUIRES(ctx, lib != nullptr, errors::Internal("No function library"));
-    const NameAttrList* func;
-    OP_REQUIRES_OK(ctx, ctx->GetAttr("then_branch", &func));
-    OP_REQUIRES_OK(ctx, Instantiate(lib, *func, &then_handle_));
-    OP_REQUIRES_OK(ctx, ctx->GetAttr("else_branch", &func));
-    OP_REQUIRES_OK(ctx, Instantiate(lib, *func, &else_handle_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("then_branch", &then_func_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("else_branch", &else_func_));
   }
 
   ~IfOp() override {}
 
   void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
+    auto lib = ctx->function_library();
+    OP_REQUIRES_ASYNC(ctx, lib != nullptr,
+                      errors::Internal("No function library"), done);
+
+    // TODO(b/37549631): Because this op has `SetIsStateful()` in its op
+    // registration, this kernel may be shared by multiple subgraphs, which have
+    // different associated `FunctionLibraryRuntime` objects and hence different
+    // `FHandle` namespaces. So we must call Instantiate() to make sure we get
+    // the correct function handles with respect to `lib`. Note the underlying
+    // `lib->Instantiate()` caches the created function handles, so calling
+    // `Instantiate()` repeatedly on the same `lib` and function is cheap.
+    FHandle then_handle;
+    FHandle else_handle;
+    OP_REQUIRES_OK_ASYNC(ctx, Instantiate(lib, then_func_, &then_handle), done);
+    OP_REQUIRES_OK_ASYNC(ctx, Instantiate(lib, else_func_, &else_handle), done);
+
     bool cond;
     OP_REQUIRES_OK(ctx, ToBool({ctx->input(0)}, &cond));
-    (new State(this, ctx, cond, done))->Start();
+    (new State(this, ctx, cond, then_handle, else_handle, done))->Start();
   }
 
  private:
-  FHandle then_handle_;
-  FHandle else_handle_;
+  NameAttrList then_func_;
+  NameAttrList else_func_;
 
   class State {
    public:
-    State(IfOp* kernel, OpKernelContext* ctx, bool cond, DoneCallback done)
+    State(IfOp* kernel, OpKernelContext* ctx, bool cond, FHandle then_handle,
+          FHandle else_handle, DoneCallback done)
         : kernel_(kernel),
           ctx_(ctx),
           cond_(cond),
+          then_handle_(then_handle),
+          else_handle_(else_handle),
           done_(std::move(done)),
           lib_(CHECK_NOTNULL(ctx_->function_library())) {
       SetRunOptions(ctx_, &opts_, true /* always_collect_stats */);
@@ -163,7 +179,7 @@ class IfOp : public AsyncOpKernel {
     ~State() {}
 
     void Start() {
-      FHandle handle = cond_ ? kernel_->then_handle_ : kernel_->else_handle_;
+      FHandle handle = cond_ ? then_handle_ : else_handle_;
       rets_.clear();
       lib_->Run(
           // Evaluate one of the branch.
@@ -184,6 +200,8 @@ class IfOp : public AsyncOpKernel {
     IfOp* const kernel_;
     OpKernelContext* const ctx_;
     const bool cond_;
+    FHandle then_handle_;
+    FHandle else_handle_;
     DoneCallback done_;
     FunctionLibraryRuntime* const lib_;
     FunctionLibraryRuntime::Options opts_;
@@ -214,30 +232,17 @@ class WhileOp : public AsyncOpKernel {
     OP_REQUIRES_ASYNC(ctx, lib != nullptr,
                       errors::Internal("No function library"), done);
 
-    // TODO(b/37549631): Because this op has `SetIsStateful()` in its
-    // op registration, this kernel may be shared by multiple
-    // subgraphs, which have different associated
-    // `FunctionLibraryRuntime` objects and hence different `FHandle`
-    // namespaces. We currently work around this by caching the map
-    // from `FunctionLibraryRuntime*` to `FHandle` pairs for the two
-    // functions this op uses.
+    // TODO(b/37549631): Because this op has `SetIsStateful()` in its op
+    // registration, this kernel may be shared by multiple subgraphs, which have
+    // different associated `FunctionLibraryRuntime` objects and hence different
+    // `FHandle` namespaces. So we must call Instantiate() to make sure we get
+    // the correct function handles with respect to `lib`. Note the underlying
+    // `lib->Instantiate()` caches the created function handles, so calling
+    // `Instantiate()` repeatedly on the same `lib` and function is cheap.
     FHandle cond_handle;
     FHandle body_handle;
-    {
-      mutex_lock l(mu_);
-      const auto iter = handles_.find(lib);
-      if (iter == handles_.end()) {
-        OP_REQUIRES_OK_ASYNC(ctx, Instantiate(lib, cond_func_, &cond_handle),
-                             done);
-        OP_REQUIRES_OK_ASYNC(ctx, Instantiate(lib, body_func_, &body_handle),
-                             done);
-        handles_[lib] = {cond_handle, body_handle};
-      } else {
-        cond_handle = iter->second.first;
-        body_handle = iter->second.second;
-      }
-    }
-
+    OP_REQUIRES_OK_ASYNC(ctx, Instantiate(lib, cond_func_, &cond_handle), done);
+    OP_REQUIRES_OK_ASYNC(ctx, Instantiate(lib, body_func_, &body_handle), done);
     (new State(this, ctx, cond_handle, body_handle, done))->Start();
   }
 
@@ -245,10 +250,6 @@ class WhileOp : public AsyncOpKernel {
   NameAttrList cond_func_;
   NameAttrList body_func_;
 
-  mutex mu_;
-  std::unordered_map<FunctionLibraryRuntime*, std::pair<FHandle, FHandle>>
-      handles_ GUARDED_BY(mu_);
-
   class State {
    public:
     State(WhileOp* kernel, OpKernelContext* ctx, FHandle cond_handle,
@@ -536,6 +537,7 @@ class FakeParamOp : public OpKernel {
 };
 
 REGISTER_KERNEL_BUILDER(Name("FakeParam").Device(DEVICE_CPU), FakeParamOp);
+REGISTER_KERNEL_BUILDER(Name("FakeParam").Device(DEVICE_GPU), FakeParamOp);
 
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/fused_batch_norm_op.cc b/tensorflow/core/kernels/fused_batch_norm_op.cc
index f99dd643f7..d89f1592bd 100644
--- a/tensorflow/core/kernels/fused_batch_norm_op.cc
+++ b/tensorflow/core/kernels/fused_batch_norm_op.cc
@@ -45,6 +45,24 @@ struct FusedBatchNorm;
 template <typename Device, typename T, typename U>
 struct FusedBatchNormGrad;
 
+template <bool IsSame, typename Y, typename X, typename T>
+struct CastIfNecessary {
+  static inline void process(
+      Y& y, X& x_shifted, const Eigen::DSizes<Eigen::Index, 2>& rest_by_depth,
+      const CPUDevice& d) {
+    y.reshape(rest_by_depth).device(d) = x_shifted.template cast<T>();
+  }
+};
+
+template <typename Y, typename X, typename T>
+struct CastIfNecessary<true, Y, X, T> {
+  static inline void process(
+      Y& y, X& x_shifted, const Eigen::DSizes<Eigen::Index, 2>& rest_by_depth,
+      const CPUDevice& d) {
+    y.reshape(rest_by_depth).device(d) = x_shifted;
+  }
+};
+
 template <typename T, typename U>
 struct FusedBatchNorm<CPUDevice, T, U> {
   void operator()(OpKernelContext* context, const Tensor& x_input,
@@ -125,7 +143,11 @@ struct FusedBatchNorm<CPUDevice, T, U> {
     auto x_shifted =
         x_scaled + offset.reshape(one_by_depth).broadcast(bcast_spec);
 
-    y.reshape(rest_by_depth).device(d) = x_shifted.template cast<T>();
+    // Explicitly checks the types of T and U and only casts x_shifted when
+    // T != U. (Not doing so caused a 35-50% performance slowdown for
+    // some compiler flags.)
+    CastIfNecessary<std::is_same<T, U>::value, decltype(y), decltype(x_shifted),
+                    T>::process(y, x_shifted, rest_by_depth, d);
   }
 };
 
diff --git a/tensorflow/core/kernels/gather_nd_op.cc b/tensorflow/core/kernels/gather_nd_op.cc
index 4e53291b7f..e50b7fe3bf 100644
--- a/tensorflow/core/kernels/gather_nd_op.cc
+++ b/tensorflow/core/kernels/gather_nd_op.cc
@@ -188,12 +188,13 @@ Status DoGatherNd(OpKernelContext* c, const Tensor& params,
 
     // bad_i will only return >= 0 on CPUs right now.
     if (bad_i >= 0) {
+      auto shape = indices.shape();
+      shape.RemoveLastDims(1);
       return errors::InvalidArgument(
-          "flat indices[", bad_i, ", :] = [",
+          "indices", SliceDebugString(shape, bad_i), " = [",
           str_util::Join(
               gtl::ArraySlice<Index>(&indices_mat(bad_i, 0), indices_nd), ", "),
-          "] does not index into param (shape: ", params.shape().DebugString(),
-          ").");
+          "] does not index into param shape ", params.shape().DebugString());
     }
   }
   return Status::OK();
diff --git a/tensorflow/core/kernels/identity_op.cc b/tensorflow/core/kernels/identity_op.cc
index dffb4d7171..6f79729883 100644
--- a/tensorflow/core/kernels/identity_op.cc
+++ b/tensorflow/core/kernels/identity_op.cc
@@ -145,6 +145,7 @@ REGISTER_GPU_KERNEL(Variant);
 REGISTER_GPU_HOST_KERNEL(int32);
 REGISTER_GPU_HOST_KERNEL(bool);
 REGISTER_GPU_HOST_KERNEL(string);
+REGISTER_GPU_HOST_KERNEL(ResourceHandle);
 
 #undef REGISTER_GPU_HOST_KERNEL
 
diff --git a/tensorflow/core/kernels/initializable_lookup_table.h b/tensorflow/core/kernels/initializable_lookup_table.h
index 990cbceac2..b4f81d9a70 100644
--- a/tensorflow/core/kernels/initializable_lookup_table.h
+++ b/tensorflow/core/kernels/initializable_lookup_table.h
@@ -51,7 +51,7 @@ class InitializableLookupTable : public LookupInterface {
         "Insert not supported by InitializableLookupTable implementations");
   }
 
-  Status ExportValues(OpKernelContext* context) {
+  Status ExportValues(OpKernelContext* context) override {
     return errors::Unimplemented(
         "ExportValues not supported by InitializableLookupTable "
         "implementations");
diff --git a/tensorflow/core/kernels/lookup_table_op.cc b/tensorflow/core/kernels/lookup_table_op.cc
index 57b7798ba0..07e754a6ef 100644
--- a/tensorflow/core/kernels/lookup_table_op.cc
+++ b/tensorflow/core/kernels/lookup_table_op.cc
@@ -822,6 +822,7 @@ REGISTER_KERNEL(int64, float);
 REGISTER_KERNEL(string, string);
 REGISTER_KERNEL(string, bool);
 REGISTER_KERNEL(int32, int32);
+REGISTER_KERNEL(int32, string);
 
 #undef REGISTER_KERNEL
 
diff --git a/tensorflow/core/kernels/matmul_op.cc b/tensorflow/core/kernels/matmul_op.cc
index b596dbc782..80376c61aa 100644
--- a/tensorflow/core/kernels/matmul_op.cc
+++ b/tensorflow/core/kernels/matmul_op.cc
@@ -453,10 +453,14 @@ class MatMulOp : public OpKernel {
     const Tensor& b = ctx->input(1);
 
     // Check that the dimensions of the two matrices are valid.
-    OP_REQUIRES(ctx, TensorShapeUtils::IsMatrix(a.shape()),
-                errors::InvalidArgument("In[0] is not a matrix"));
-    OP_REQUIRES(ctx, TensorShapeUtils::IsMatrix(b.shape()),
-                errors::InvalidArgument("In[1] is not a matrix"));
+    OP_REQUIRES(
+        ctx, TensorShapeUtils::IsMatrix(a.shape()),
+        errors::InvalidArgument("In[0] is not a matrix. Instead it has shape ",
+                                a.shape().DebugString()));
+    OP_REQUIRES(
+        ctx, TensorShapeUtils::IsMatrix(b.shape()),
+        errors::InvalidArgument("In[1] is not a matrix. Instead it has shape ",
+                                b.shape().DebugString()));
     Eigen::array<Eigen::IndexPair<Eigen::DenseIndex>, 1> dim_pair;
     dim_pair[0].first = transpose_a_ ? 0 : 1;
     dim_pair[0].second = transpose_b_ ? 1 : 0;
diff --git a/tensorflow/core/kernels/mkl_aggregate_ops.cc b/tensorflow/core/kernels/mkl_aggregate_ops.cc
index 4ad858e4a9..3d04aeeb3e 100644
--- a/tensorflow/core/kernels/mkl_aggregate_ops.cc
+++ b/tensorflow/core/kernels/mkl_aggregate_ops.cc
@@ -445,11 +445,10 @@ class MklAddNOp : public OpKernel {
       // atleast one input is in MKL format, we choose output descriptor for
       // reorder.
       std::vector<primitive::at> inputs;
-      std::vector<primitive> net;
       // Check if actual input format of the tensor is different than common_pd
       // we told MKLDNN. In that case, we will need reorder.
-      src1.CheckReorderToOpMem(srcs_pd[0], &net);
-      src2.CheckReorderToOpMem(srcs_pd[1], &net);
+      src1.CheckReorderToOpMem(srcs_pd[0]);
+      src2.CheckReorderToOpMem(srcs_pd[1]);
       inputs.push_back(src1.GetOpMem());
       inputs.push_back(src2.GetOpMem());
 
@@ -482,6 +481,7 @@ class MklAddNOp : public OpKernel {
       dst.SetUsrMemDataHandle(dst_tensor);
 
       // Create Sum op, and submit net for execution.
+      std::vector<primitive> net;
       net.push_back(sum(sum_pd, inputs, dst.GetOpMem()));
       stream(stream::kind::eager).submit(net).wait();
     } catch (mkldnn::error& e) {
diff --git a/tensorflow/core/kernels/mkl_concat_op.cc b/tensorflow/core/kernels/mkl_concat_op.cc
index d054f0d404..d8efb1be3e 100644
--- a/tensorflow/core/kernels/mkl_concat_op.cc
+++ b/tensorflow/core/kernels/mkl_concat_op.cc
@@ -308,11 +308,9 @@ class MklConcatOp : public OpKernel {
     }
 
     if (invoke_eigen) {
-      string msg = std::string("Invoking Eigen version of Concat. Reason:") +
-                   (!is_concat_dim_channel
-                        ? std::string("Concat dimension is not channel")
-                        : std::string("Not all tensors are in Mkl layout"));
-      VLOG(1) << "_MklConcatOp: " << msg;
+      VLOG(1) << "_MklConcatOp: Invoking Eigen version of Concat. Reason:"
+              << (!is_concat_dim_channel ? "Concat dimension is not channel"
+                                         : "Not all tensors are in Mkl layout");
       CallEigenVersion(context, input_tensors, input_shapes);
       return;
     }
@@ -756,11 +754,10 @@ class MklConcatOp : public OpKernel {
       }
 
       std::vector<primitive::at> inputs;
-      std::vector<primitive> net;
       if (isMklReorderNeeded) {
         for (int k = 0; k < input_tensors.size(); k++) {
           if (input_tensors[k].NumElements() > 0) {
-            srcs[k].CheckReorderToOpMem(srcs_pd[k], &net);
+            srcs[k].CheckReorderToOpMem(srcs_pd[k]);
           }
         }
       }
@@ -806,6 +803,7 @@ class MklConcatOp : public OpKernel {
       dst.SetUsrMem(dst_md, dst_tensor);
 
       auto concat_op = concat(concat_pd, inputs, dst.GetOpMem());
+      std::vector<primitive> net;
       net.push_back(concat_op);
       stream(stream::kind::eager).submit(net).wait();
     } catch (mkldnn::error& e) {
diff --git a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
index 356eed8b67..b73a119a88 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
@@ -54,9 +54,310 @@ using mkldnn::stream;
 #include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
-
 typedef Eigen::ThreadPoolDevice CPUDevice;
 
+#ifndef INTEL_MKL_ML
+
+struct MklConvBwdFilterParams {
+  memory::dims src_dims;
+  memory::dims diff_filter_dims;
+  memory::dims diff_bias_dims;
+  memory::dims diff_dst_dims;
+  memory::dims strides;
+  memory::dims dilations;
+  memory::dims padding_left;
+  memory::dims padding_right;
+  padding_kind padding;
+
+  MklConvBwdFilterParams(memory::dims src_dims,
+    memory::dims diff_filter_dims, memory::dims diff_bias_dims,
+    memory::dims diff_dst_dims, memory::dims strides,
+    memory::dims dilations, memory::dims padding_left,
+    memory::dims padding_right, padding_kind padding) :
+      src_dims(src_dims), diff_filter_dims(diff_filter_dims),
+      diff_bias_dims(diff_bias_dims), diff_dst_dims(diff_dst_dims),
+      strides(strides), dilations(dilations),
+      padding_left(padding_left), padding_right(padding_right),
+      padding(padding) {
+  }
+};
+
+template <typename T>
+class MklConv2DBwdFilterPrimitive : public MklPrimitive {
+ public:
+  explicit MklConv2DBwdFilterPrimitive(
+      const MklConvBwdFilterParams& convBwdFilterDims) :
+            cpu_engine_(engine::cpu, 0) {
+    context_.bwd_filter_stream.reset(new stream(stream::kind::eager));
+    // create conv primitive
+    if (context_.conv_bwd_filter == nullptr) {
+      Setup(convBwdFilterDims);
+    }
+  }
+
+  ~MklConv2DBwdFilterPrimitive() {}
+
+  // Convolution backward weights with bias
+  //   src_data:         input data buffer of src
+  //   diff_filter_data: output data buffer of diff_filter
+  //   diff_bias_data:   output data buffer of diff_bias
+  //   diff_dst_data:    input data buffer of diff_dst
+  void Execute(const T* src_data, const T* diff_filter_data,
+      const T* diff_bias_data, const T* diff_dst_data) {
+    context_.src_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(src_data)));
+    context_.diff_filter_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(diff_filter_data)));
+    context_.diff_bias_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(diff_bias_data)));
+    context_.diff_dst_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(diff_dst_data)));
+
+    context_.bwd_filter_stream->submit(context_.bwd_filter_primitives);
+
+    context_.src_mem->set_data_handle(DummyData);
+    context_.diff_filter_mem->set_data_handle(DummyData);
+    context_.diff_bias_mem->set_data_handle(DummyData);
+    context_.diff_dst_mem->set_data_handle(DummyData);
+    return;
+  }
+
+  // Convolution backward weights without bias
+  //   src_data:         input data buffer of src
+  //   diff_filter_data: output data buffer of diff_filter
+  //   diff_dst_data:    input data buffer of diff_dst
+  void Execute(const T* src_data,
+      const T* diff_filter_data, const T* diff_dst_data) {
+    context_.src_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(src_data)));
+    context_.diff_filter_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(diff_filter_data)));
+    context_.diff_dst_mem->set_data_handle(
+        static_cast<void*>(const_cast<T*>(diff_dst_data)));
+
+    context_.bwd_filter_stream->submit(context_.bwd_filter_primitives);
+
+    context_.src_mem->set_data_handle(DummyData);
+    context_.diff_filter_mem->set_data_handle(DummyData);
+    context_.diff_dst_mem->set_data_handle(DummyData);
+    return;
+  }
+
+  memory::format GetSrcMemoryFormat() const {
+    return context_.src_fmt;
+  }
+
+  memory::format GetDiffDstMemoryFormat() const {
+    return context_.diff_dst_fmt;
+  }
+
+  memory::format GetDiffFilterMemoryFormat() const {
+    return context_.diff_filter_fmt;
+  }
+
+  // convolution primitive
+  std::shared_ptr<mkldnn::convolution_backward_weights::primitive_desc>
+  GetPrimitiveDesc() const {
+    return context_.bwd_filter_pd;
+  }
+
+ private:
+  // Primitive reuse context for Conv2D bwd filter op
+  struct ConvBwdFilterContext {
+    // expected memory format for this primitive instance
+    memory::format src_fmt;
+    memory::format diff_dst_fmt;
+    memory::format diff_filter_fmt;
+
+    // convolution bwd input primitive
+    std::shared_ptr<mkldnn::convolution_backward_weights::primitive_desc>
+        bwd_filter_pd;
+    std::shared_ptr<mkldnn::primitive> conv_bwd_filter;
+
+    // MKLDNN memory
+    std::shared_ptr<mkldnn::memory> src_mem;
+    std::shared_ptr<mkldnn::memory> diff_filter_mem;
+    std::shared_ptr<mkldnn::memory> diff_bias_mem;
+    std::shared_ptr<mkldnn::memory> diff_dst_mem;
+
+    // desc & prmitive desc
+    std::shared_ptr<mkldnn::convolution_backward_weights::desc> bwd_filter_desc;
+    std::shared_ptr<mkldnn::convolution_forward::desc> fwd_desc;
+    std::shared_ptr<mkldnn::convolution_forward::primitive_desc> fwd_pd;
+
+    // memory desc: forward & backward can share same memory desc
+    std::shared_ptr<mkldnn::memory::desc> src_md;
+    std::shared_ptr<mkldnn::memory::desc> diff_filter_md;
+    std::shared_ptr<mkldnn::memory::desc> diff_bias_md;
+    std::shared_ptr<mkldnn::memory::desc> diff_dst_md;
+
+    // MKL pipeline
+    std::shared_ptr<mkldnn::stream> bwd_filter_stream;
+    std::vector<mkldnn::primitive> bwd_filter_primitives;
+
+    ConvBwdFilterContext() :
+        src_fmt(memory::format::any),
+        diff_dst_fmt(memory::format::any),
+        diff_filter_fmt(memory::format::any),
+        src_mem(nullptr), diff_filter_mem(nullptr),
+        diff_bias_mem(nullptr), diff_dst_mem(nullptr),
+        bwd_filter_desc(nullptr), fwd_desc(nullptr), fwd_pd(nullptr),
+        src_md(nullptr), diff_filter_md(nullptr),
+        diff_bias_md(nullptr), diff_dst_md(nullptr),
+        bwd_filter_stream(nullptr) {
+    }
+  };
+
+  // Setup Conv2d backward filter (weights) primitives.
+  void Setup(const MklConvBwdFilterParams& convBwdFilterDims) {
+    // create memory descriptors for convolution data w/ no specified format
+    context_.src_md.reset(new memory::desc({convBwdFilterDims.src_dims},
+        MklDnnType<T>(), memory::format::any));
+
+    context_.diff_dst_md.reset(new memory::desc(
+        {convBwdFilterDims.diff_dst_dims},
+        MklDnnType<T>(), memory::format::any));
+
+    context_.diff_filter_md.reset(new memory::desc(
+        {convBwdFilterDims.diff_filter_dims},
+        MklDnnType<T>(), memory::format::any));
+
+    if (!convBwdFilterDims.diff_bias_dims.empty())
+      context_.diff_bias_md.reset(new memory::desc(
+          {convBwdFilterDims.diff_bias_dims},
+          MklDnnType<T>(), memory::format::x));
+
+    // create a convolution
+    if (!convBwdFilterDims.diff_bias_dims.empty()) {
+      context_.bwd_filter_desc.reset(new convolution_backward_weights::desc(
+          convolution_direct, *context_.src_md, *context_.diff_filter_md,
+          *context_.diff_bias_md, *context_.diff_dst_md,
+          convBwdFilterDims.strides, convBwdFilterDims.dilations,
+          convBwdFilterDims.padding_left, convBwdFilterDims.padding_right,
+          convBwdFilterDims.padding));
+    } else {
+      context_.bwd_filter_desc.reset(
+          new convolution_backward_weights::desc(
+          convolution_direct, *context_.src_md, *context_.diff_filter_md,
+          *context_.diff_dst_md, convBwdFilterDims.strides,
+          convBwdFilterDims.dilations, convBwdFilterDims.padding_left,
+          convBwdFilterDims.padding_right, convBwdFilterDims.padding));
+    }
+
+    // create fwd primitive_desc
+    context_.fwd_desc.reset(new convolution_forward::desc(
+        prop_kind::forward, convolution_direct,
+        *context_.src_md, *context_.diff_filter_md, *context_.diff_dst_md,
+        convBwdFilterDims.strides,
+        convBwdFilterDims.dilations, convBwdFilterDims.padding_left,
+        convBwdFilterDims.padding_right, convBwdFilterDims.padding));
+    context_.fwd_pd.reset(new convolution_forward::primitive_desc(
+        *context_.fwd_desc, cpu_engine_));
+
+    // create backward conv primitive_desc
+    context_.bwd_filter_pd.reset(
+        new convolution_backward_weights::primitive_desc(
+        *context_.bwd_filter_desc, cpu_engine_, *context_.fwd_pd));
+
+    // store the expected memory format
+    auto bwd_filter_pd = context_.bwd_filter_pd.get();
+    context_.src_fmt = static_cast<mkldnn::memory::format>(
+        bwd_filter_pd->src_primitive_desc().desc().data.format);
+    context_.diff_filter_fmt = static_cast<mkldnn::memory::format>(
+        bwd_filter_pd->diff_weights_primitive_desc().desc().data.format);
+    context_.diff_dst_fmt = static_cast<mkldnn::memory::format>(
+        bwd_filter_pd->diff_dst_primitive_desc().desc().data.format);
+
+    // create memory primitive based on dummy data
+    context_.src_mem.reset(new memory(
+        bwd_filter_pd->src_primitive_desc(), DummyData));
+    context_.diff_filter_mem.reset(new memory(
+        bwd_filter_pd->diff_weights_primitive_desc(), DummyData));
+    context_.diff_dst_mem.reset(new memory(
+        bwd_filter_pd->diff_dst_primitive_desc(), DummyData));
+
+    // create convolution primitive and add it to net
+    if (!convBwdFilterDims.diff_bias_dims.empty()) {
+      context_.diff_bias_mem.reset(new memory(
+          {{{convBwdFilterDims.diff_bias_dims}, MklDnnType<T>(),
+          memory::format::x}, cpu_engine_}, DummyData));
+      context_.conv_bwd_filter.reset(new convolution_backward_weights(
+          *context_.bwd_filter_pd, *context_.src_mem, *context_.diff_dst_mem,
+          *context_.diff_filter_mem, *context_.diff_bias_mem));
+    } else {
+      context_.conv_bwd_filter.reset(new convolution_backward_weights(
+          *context_.bwd_filter_pd, *context_.src_mem,
+          *context_.diff_dst_mem, *context_.diff_filter_mem));
+    }
+
+    context_.bwd_filter_primitives.push_back(*context_.conv_bwd_filter);
+  }
+
+  struct ConvBwdFilterContext context_;
+  engine cpu_engine_;
+};
+
+template <typename T>
+class MklConv2DBwdFilterPrimitiveFactory : public MklPrimitiveFactory<T> {
+ public:
+  static MklConv2DBwdFilterPrimitive<T>* Get(
+      const MklConvBwdFilterParams& convBwdFilterDims) {
+    MklConv2DBwdFilterPrimitive<T>* conv2d_bwd_filter = nullptr;
+
+    // look into the pool for reusable primitive
+    conv2d_bwd_filter = dynamic_cast<MklConv2DBwdFilterPrimitive<T>*> (
+      MklConv2DBwdFilterPrimitiveFactory<T>::GetInstance().GetConv2dBwdFilter(
+      convBwdFilterDims));
+
+    if (conv2d_bwd_filter == nullptr) {
+        conv2d_bwd_filter = new MklConv2DBwdFilterPrimitive<T>(
+            convBwdFilterDims);
+        MklConv2DBwdFilterPrimitiveFactory<T>::GetInstance().SetConv2dBwdFilter(
+            convBwdFilterDims, conv2d_bwd_filter);
+    }
+    return conv2d_bwd_filter;
+  }
+
+
+ private:
+  MklConv2DBwdFilterPrimitiveFactory() {}
+  ~MklConv2DBwdFilterPrimitiveFactory() {}
+
+  static MklConv2DBwdFilterPrimitiveFactory& GetInstance() {
+    static MklConv2DBwdFilterPrimitiveFactory instance_;
+    return instance_;
+  }
+
+  static string CreateKey(const MklConvBwdFilterParams& convBwdFilterDims) {
+    string prefix = "conv2d_bwd_filter";
+    FactoryKeyCreator key_creator;
+    key_creator.AddAsKey(prefix);
+    key_creator.AddAsKey(convBwdFilterDims.src_dims);
+    key_creator.AddAsKey(convBwdFilterDims.diff_filter_dims);
+    key_creator.AddAsKey(convBwdFilterDims.diff_bias_dims);
+    key_creator.AddAsKey(convBwdFilterDims.diff_dst_dims);
+    key_creator.AddAsKey(convBwdFilterDims.strides);
+    key_creator.AddAsKey(convBwdFilterDims.dilations);
+    key_creator.AddAsKey(convBwdFilterDims.padding_left);
+    key_creator.AddAsKey(convBwdFilterDims.padding_right);
+    return key_creator.GetKey();
+  }
+
+  MklPrimitive* GetConv2dBwdFilter(
+      const MklConvBwdFilterParams& convBwdFilterDims) {
+    string key = CreateKey(convBwdFilterDims);
+    return this->GetOp(key);
+  }
+
+  void SetConv2dBwdFilter(
+      const MklConvBwdFilterParams& convBwdFilterDims, MklPrimitive* op) {
+    string key = CreateKey(convBwdFilterDims);
+    this->SetOp(key, op);
+  }
+};
+
+#endif
+
 #ifdef INTEL_MKL_ML
 
 template <typename Device, class T>
@@ -442,11 +743,207 @@ class MklConv2DCustomBackpropFilterOp
     : public MklConv2DBackpropCommonOp<Device, T> {
  public:
   explicit MklConv2DCustomBackpropFilterOp(OpKernelConstruction* context)
-      : MklConv2DBackpropCommonOp<Device, T>(context) {}
+      : MklConv2DBackpropCommonOp<Device, T>(context) {
+  }
+
   ~MklConv2DCustomBackpropFilterOp() {}
 
+  void Compute(OpKernelContext* context) {
+    try {
+      MklDnnData<T> src(&cpu_engine_);
+      MklDnnData<T> diff_dst(&cpu_engine_);
+      MklDnnData<T> diff_filter(&cpu_engine_);  // output
+
+      // Input tensors
+      const int kInputIdx = 0, kFilterIdx = 1, kOutbpropIdx = 2;
+      const Tensor& src_tensor = MklGetInput(context, kInputIdx);
+      const Tensor& filter_tensor = MklGetInput(context, kFilterIdx);
+      const Tensor& diff_dst_tensor = MklGetInput(context, kOutbpropIdx);
+
+      MklDnnShape src_mkl_shape, filter_mkl_shape, diff_dst_mkl_shape;
+      GetMklShape(context, kInputIdx, &src_mkl_shape);
+      GetMklShape(context, kFilterIdx, &filter_mkl_shape);
+      GetMklShape(context, kOutbpropIdx, &diff_dst_mkl_shape);
+      // Allow operator-specific sanity checking of shapes.
+      ValidateMklShapes(src_mkl_shape, filter_mkl_shape, diff_dst_mkl_shape);
+
+      // Allow operator-specific generation of shapes.
+      // E.g., Conv2DBackpropFilter gets filter as filter_sizes. It is a
+      // tensor containing shape of filter. So filter.shape() is not
+      // a correct way to get filter shape. These operator-specific calls
+      // allow this class to handle this case.
+      TensorShape src_tf_shape = MakeInputTfShape(context, src_tensor);
+      TensorShape filter_tf_shape = MakeFilterTfShape(context, filter_tensor);
+      TensorShape diff_dst_tf_shape = GetTfShape(context, kOutbpropIdx);
+
+      // Corner cases: output with 0 elements and 0 batch size.
+      Tensor* diff_filter_tensor = nullptr;
+      if (src_tf_shape.num_elements() == 0 ||
+          filter_tf_shape.num_elements() == 0 ||
+          diff_dst_tf_shape.num_elements() == 0) {
+        MklDnnShape diff_filter_mkl_shape;
+        diff_filter_mkl_shape.SetMklTensor(false);
+        TensorShape diff_filter_tf_shape = GetOutputTfShape(
+            src_tf_shape, filter_tf_shape, diff_dst_tf_shape);
+        const int kOutputIdx = 0;
+        AllocateOutputSetMklShape(context, kOutputIdx, &diff_filter_tensor,
+                                  diff_filter_tf_shape, diff_filter_mkl_shape);
+        CHECK_NOTNULL(diff_filter_tensor);
+
+        // if output tensor has more than 0 elements, we need to 0 them out.
+        auto diff_filter_data = diff_filter_tensor->flat<T>().data();
+        for (size_t i = 0; i < diff_filter_tf_shape.num_elements(); ++i) {
+          diff_filter_data[i] = 0;
+        }
+        return;
+      }
+
+      // By default, all dims are in MKL order. Only dims in TF order
+      // are those with prefix tf_order.
+      memory::dims diff_dst_dims, fwd_src_dims, fwd_filter_dims;
+      memory::dims padding_left, padding_right, dilations,
+          strides, fwd_dst_dims;
+      memory::dims fwd_dst_dims_tf_order;
+
+      // Get forward convolution parameters.
+      MklDnnConvUtil conv_utl(context, this->strides_, this->padding_,
+          this->data_format_, this->dilations_);
+      conv_utl.GetConvFwdSizesInMklOrder(
+          src_tf_shape, filter_tf_shape, &fwd_src_dims, &fwd_filter_dims,
+          &strides, &dilations, &fwd_dst_dims_tf_order,
+          &fwd_dst_dims, &padding_left, &padding_right);
+      if (!context->status().ok()) return;
+
+      auto tf_fmt = TFDataFormatToMklDnnDataFormat(this->data_format_);
+      auto fwd_src_md =
+          src_mkl_shape.IsMklTensor()
+              ? src_mkl_shape.GetMklLayout()
+              : memory::desc(fwd_src_dims, MklDnnType<T>(), tf_fmt);
+
+      conv_utl.GetInputSizeInMklOrder(diff_dst_tf_shape, &diff_dst_dims);
+      if (!context->status().ok()) return;
+
+      auto diff_dst_md = diff_dst_mkl_shape.IsMklTensor()
+                       ? diff_dst_mkl_shape.GetMklLayout()
+                       : memory::desc(diff_dst_dims,
+                           MklDnnType<T>(), tf_fmt);
+
+      memory::dims diff_bias_dims = {};
+      int64 depth = 0;
+      if (biasEnabled) {
+        TensorShape obp_tf_shape = GetTfShape(context, 2);
+        depth = (this->data_format_ == FORMAT_NCHW)
+              ? obp_tf_shape.dim_size(1)
+              : obp_tf_shape.dim_size(3);
+        diff_bias_dims = {static_cast<int>(depth)};
+      }
+
+      dilations[kDilationH] -= 1;
+      dilations[kDilationW] -= 1;
+
+      MklConv2DBwdFilterPrimitive<T> *conv2d_bwd_filter = nullptr;
+      MklConvBwdFilterParams convBwdFilterDims(fwd_src_dims, fwd_filter_dims,
+          diff_bias_dims, diff_dst_dims, strides, dilations, padding_left,
+          padding_right, TFPaddingToMklDnnPadding(this->padding_));
+      conv2d_bwd_filter = MklConv2DBwdFilterPrimitiveFactory<T>::Get(
+          convBwdFilterDims);
+      auto bwd_filter_pd = conv2d_bwd_filter->GetPrimitiveDesc();
+
+      // allocate output tensors: diff_fitler and diff_bias (w bias)
+      auto bwd_output_dims = GetOutputDims(fwd_src_dims, fwd_filter_dims);
+
+      // diff_filter
+      MklDnnShape diff_filter_mkl_shape;
+      diff_filter_mkl_shape.SetMklTensor(false);
+      // output_dims_mkl_order is in OIHW format.
+      TensorShape diff_filter_tf_shape(
+                    {bwd_output_dims[MklDnnDims::Dim_H],
+                     bwd_output_dims[MklDnnDims::Dim_W],
+                     bwd_output_dims[MklDnnDims::Dim_I],
+                     bwd_output_dims[MklDnnDims::Dim_O]});
+      AllocateOutputSetMklShape(context, 0, &diff_filter_tensor,
+                diff_filter_tf_shape, diff_filter_mkl_shape);
+
+      Tensor* diff_bias_tensor = nullptr;
+      if (biasEnabled) {
+        TensorShape diff_bias_shape({depth});
+        AllocateBiasGradTensor(context, diff_bias_shape, &diff_bias_tensor);
+      }
+
+      // check if src and diff_dst need reorder
+      T *src_data = nullptr;
+      if (fwd_src_md.data.format != conv2d_bwd_filter->GetSrcMemoryFormat()) {
+        src.SetUsrMem(fwd_src_md, &src_tensor);
+        src.CheckReorderToOpMem(bwd_filter_pd->src_primitive_desc());
+        src_data = static_cast<T*>(src.GetOpMem().get_data_handle());
+      } else {
+        src_data = static_cast<T*>(const_cast<T*>(
+            src_tensor.flat<T>().data()));
+      }
+
+      T *diff_dst_data = nullptr;
+      if (diff_dst_md.data.format !=
+          conv2d_bwd_filter->GetDiffDstMemoryFormat()) {
+        diff_dst.SetUsrMem(diff_dst_md, &diff_dst_tensor);
+        diff_dst.CheckReorderToOpMem(bwd_filter_pd->diff_dst_primitive_desc());
+        diff_dst_data = static_cast<T*>(
+            diff_dst.GetOpMem().get_data_handle());
+      } else {
+        diff_dst_data = static_cast<T*>(const_cast<T*>(
+            diff_dst_tensor.flat<T>().data()));
+      }
+
+      // For backward filter, convert diff_filter back to Tensorflow layout
+      // Here we prepare to reorder op memory back to user memory
+      bool diff_filter_reorder_required = false;
+      T *diff_filter_data = nullptr;
+      if (GetOutputFormat(tf_fmt) !=
+          conv2d_bwd_filter->GetDiffFilterMemoryFormat()) {
+        // Allocate diff filter tensor as Tensorflow layout
+        diff_filter.SetUsrMem(bwd_output_dims, GetOutputFormat(tf_fmt),
+                              diff_filter_tensor);
+        diff_filter_reorder_required = true;
+        diff_filter.PrepareReorderToUserMemIfReq(
+                bwd_filter_pd->diff_weights_primitive_desc());
+        diff_filter_data = static_cast<T*>(
+                            diff_filter.GetOpMem().get_data_handle());
+      } else {
+        diff_filter_data = static_cast<T*>(const_cast<T*>(
+                            diff_filter_tensor->flat<T>().data()));
+      }
+
+      // Execute convolution filter bwd
+      if (biasEnabled) {
+        T* diff_bias_data = static_cast<T*>(const_cast<T*>(
+                         diff_bias_tensor->flat<T>().data()));
+        conv2d_bwd_filter->Execute(src_data, diff_filter_data,
+                               diff_bias_data, diff_dst_data);
+      } else {
+        conv2d_bwd_filter->Execute(src_data, diff_filter_data, diff_dst_data);
+      }
+
+      // Reorder diff_filter back to Tensorflow layout if necessary
+      if (diff_filter_reorder_required) {
+        diff_filter.InsertReorderToUserMem();
+      }
+    } catch (mkldnn::error& e) {
+      string error_msg = "Status: " + std::to_string(e.status) +
+                         ", message: " + string(e.message) + ", in file " +
+                         string(__FILE__) + ":" + std::to_string(__LINE__);
+      OP_REQUIRES_OK(
+          context,
+          errors::Aborted("Operation received an exception:", error_msg));
+    }
+  }
+
  private:
+  const int kInputIndex_Filter = 1;
+  const int kInputIndex_InputSizes = 0;
   const int kDilationH = 0, kDilationW = 1;
+  engine cpu_engine_ = engine(engine::cpu, 0);
+
+  // Validate input shapes.
+  // Function asserts that input shapes are valid.
   void ValidateMklShapes(const MklDnnShape& input_mkl_shape,
                          const MklDnnShape& filter_mkl_shape,
                          const MklDnnShape& obp_mkl_shape) {
@@ -454,141 +951,44 @@ class MklConv2DCustomBackpropFilterOp
         << "Conv2DBackpropFilter: filter should not be in MKL Layout";
   }
 
-  size_t GetInputTensorIndexWithSizes() { return 1; /* filter index */ }
-
+  // Get TensorFlow shape of input tensor.
   TensorShape MakeInputTfShape(OpKernelContext* context,
                                const Tensor& input_tensor) {
     size_t input_idx = 0;
     return GetTfShape(context, input_idx);
   }
 
+  // Get TensorFlow shape of filter tensor.
   TensorShape MakeFilterTfShape(OpKernelContext* context,
                                 const Tensor& filter_tensor) {
     TensorShape filter_tf_shape;
     CHECK_EQ(TensorShapeUtils::IsVector(filter_tensor.shape()), true);
     CHECK_EQ(TensorShapeUtils::MakeShape(filter_tensor.vec<int32>(),
-                                         &filter_tf_shape)
-                 .ok(),
-             true);
+             &filter_tf_shape).ok(), true);
     return filter_tf_shape;
   }
 
+  // Get Tensorflow shape of output tensor (diff_filter),
+  // which is same as shape of filter.
   TensorShape GetOutputTfShape(const TensorShape& input_shape,
                                const TensorShape& filter_shape,
                                const TensorShape& outbprop_shape) {
-    // Shape of output of Conv2DBackpropFilter is same as shape of filter.
     return filter_shape;
   }
 
+  // Get the shape of output (diff_filter) in MKL-DNN order.
+  // Computes shape of output from input shape (fwd_input_dims)
+  // and filter shape (fwd_filter_dims).
   const memory::dims& GetOutputDims(const memory::dims& fwd_input_dims,
                                     const memory::dims& fwd_filter_dims) {
-    // Shape of output of Conv2DBackpropFilter is same as shape of filter.
     return fwd_filter_dims;
   }
 
+  // Output layout is Tensorflow's filter layout (HWIO).
   memory::format GetOutputFormat(const memory::format data_format) {
-    // Output layout is Tensorflow's filter layout (HWIO).
     return memory::format::hwio;
   }
 
-  void CreatePrimitive(OpKernelContext* context, const engine& cpu_engine,
-                       const convolution_forward::primitive_desc& conv_fwd_pd,
-                       MklDnnData<T>* input, MklDnnData<T>* filter,
-                       MklDnnData<T>* outbackprop, MklDnnData<T>* output,
-                       Tensor** output_tensor,
-                       const memory::dims& strides,
-                       const memory::dims& dilations,
-                       const memory::dims& padding_l,
-                       const memory::dims& padding_r, padding_kind padding,
-                       const memory::dims& bwd_output_dims,
-                       memory::format bwd_output_format) {
-    CHECK_NOTNULL(context);
-    CHECK_NOTNULL(input);
-    CHECK_NOTNULL(filter);
-    CHECK_NOTNULL(outbackprop);
-    CHECK_NOTNULL(output);
-    CHECK_NOTNULL(output_tensor);
-
-    MklDnnData<T>* bias_grad = nullptr;
-    int depth = 0;
-    if (biasEnabled) {
-      // Data structure for bias_grad
-      bias_grad = new MklDnnData<T>(&cpu_engine);
-      TensorShape obp_tf_shape = GetTfShape(context, 2);
-      depth = (MklConv2DBackpropCommonOp<Device, T>::GetTFDataFormat() ==
-               FORMAT_NCHW)
-                  ? obp_tf_shape.dim_size(1)
-                  : obp_tf_shape.dim_size(3);
-      memory::dims bias_grad_dims = {depth};
-      bias_grad->SetOpMemDesc(bias_grad_dims, memory::format::x);
-    }
-
-    if (biasEnabled && (bias_grad != nullptr)) {
-      // Create convolution backward weights with bias primitive.
-      // Use dilated convolution in case dilate rates are greater than zero.
-      auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
-        convolution_backward_weights::desc(convolution_direct,
-                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
-                                  bias_grad->GetOpMemDesc(),
-                                  outbackprop->GetOpMemDesc(), strides,
-                                  dilations, padding_l, padding_r, padding) :
-        convolution_backward_weights::desc(convolution_direct,
-                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
-                                  bias_grad->GetOpMemDesc(),
-                                  outbackprop->GetOpMemDesc(),
-                                  strides, padding_l, padding_r, padding);
-      auto bwd_pd = convolution_backward_weights::primitive_desc(bwd_desc,
-                                                            cpu_engine,
-                                                            conv_fwd_pd);
-
-      // Allocate output tensor.
-      AllocateOutputTensor(context, bwd_pd, bwd_output_dims,
-                           bwd_output_format, output_tensor);
-
-      CHECK_NOTNULL(*output_tensor);
-      // Set buffer handle using allocated output tensor.
-      output->SetUsrMemDataHandle(*output_tensor);
-
-      // Allocate bias_grad tensor
-      TensorShape bias_grad_shape({depth});
-      Tensor* bias_grad_tensor = nullptr;
-      AllocateBiasGradTensor(context, bias_grad_shape, &bias_grad_tensor);
-      memory::dims bias_grad_dims = {depth};
-      // Since Bias is 1D, we use format::x from MKLDNN to represent it.
-      auto bias_grad_md =
-          memory::desc({bias_grad_dims}, MklDnnType<T>(), memory::format::x);
-      bias_grad->SetUsrMem(bias_grad_md, bias_grad_tensor);
-      bias_grad->SetUsrMemDataHandle(bias_grad_tensor);
-
-      PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output,
-                                  bias_grad);
-    } else {
-      // Create convolution backward weights primitive.
-      // Use dilated convolution in case dilate rates are greater than zero.
-      auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
-        convolution_backward_weights::desc(convolution_direct,
-                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
-                                  outbackprop->GetOpMemDesc(), strides,
-                                  dilations, padding_l, padding_r, padding) :
-        convolution_backward_weights::desc(convolution_direct,
-                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
-                                  outbackprop->GetOpMemDesc(),
-                                  strides, padding_l, padding_r, padding);
-      auto bwd_pd = convolution_backward_weights::primitive_desc(bwd_desc,
-                                                            cpu_engine,
-                                                            conv_fwd_pd);
-
-      // Allocate output tensor.
-      AllocateOutputTensor(context, bwd_pd, bwd_output_dims,
-                           bwd_output_format, output_tensor);
-
-      CHECK_NOTNULL(*output_tensor);
-      // Set buffer handle using allocated output tensor.
-      output->SetUsrMemDataHandle(*output_tensor);
-      PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output);
-    }
-  }
-
   // Allocate output tensor.
   void AllocateOutputTensor(
       OpKernelContext* context,
@@ -623,40 +1023,8 @@ class MklConv2DCustomBackpropFilterOp
 
     MklDnnShape bias_grad_mkl_shape;
     bias_grad_mkl_shape.SetMklTensor(false);
-    AllocateOutputSetMklShape(context, 1, bias_grad_tensor, bias_grad_shape,
-                              bias_grad_mkl_shape);
-  }
-
-  // Prepare and execute net - checks for input and output reorders.
-  void PrepareAndExecutePrimitive(
-      const convolution_backward_weights::primitive_desc& conv_pd,
-      MklDnnData<T>* input, MklDnnData<T>* obp, MklDnnData<T>* output,
-      MklDnnData<T>* bias_grad = nullptr) {
-    // Create reorders between user layout and MKL layout if it is needed and
-    // add it to the net before convolution.
-    std::vector<primitive> net;
-    input->CheckReorderToOpMem(conv_pd.src_primitive_desc(), &net);
-    obp->CheckReorderToOpMem(conv_pd.diff_dst_primitive_desc(), &net);
-
-    // For BackpropFilter, we convert the output tensor back in Tensorflow
-    // layout.
-    bool output_reorder_required = output->PrepareReorderToUserMemIfReq(
-        conv_pd.diff_weights_primitive_desc());
-
-    if (biasEnabled && (bias_grad != nullptr)) {
-      net.push_back(convolution_backward_weights(
-          conv_pd, input->GetOpMem(), obp->GetOpMem(), output->GetOpMem(),
-          bias_grad->GetOpMem()));
-    } else {
-      net.push_back(convolution_backward_weights(
-          conv_pd, input->GetOpMem(), obp->GetOpMem(), output->GetOpMem()));
-    }
-
-    if (output_reorder_required) {
-      output->InsertReorderToUserMem(&net);
-    }
-
-    stream(stream::kind::eager).submit(net).wait();
+    AllocateOutputSetMklShape(context, 1, bias_grad_tensor,
+        bias_grad_shape, bias_grad_mkl_shape);
   }
 };
 
diff --git a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
index 21b18f9119..39498f1a80 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
@@ -55,9 +55,245 @@ using mkldnn::stream;
 #endif
 
 namespace tensorflow {
-
 typedef Eigen::ThreadPoolDevice CPUDevice;
 
+#ifndef INTEL_MKL_ML
+
+/// utility classes enabling primitive reuse for backward conv2d ops.
+struct MklConvBwdInputParams {
+  memory::dims diff_src_dims;
+  memory::dims filter_dims;
+  memory::dims diff_dst_dims;
+  memory::dims strides;
+  memory::dims dilations;
+  memory::dims padding_left;
+  memory::dims padding_right;
+  padding_kind padding;
+
+  MklConvBwdInputParams(memory::dims diff_src_dims,
+    memory::dims filter_dims, memory::dims diff_dst_dims,
+    memory::dims strides, memory::dims dilations,
+    memory::dims padding_left, memory::dims padding_right,
+    padding_kind padding) :
+      diff_src_dims(diff_src_dims), filter_dims(filter_dims),
+      diff_dst_dims(diff_dst_dims), strides(strides),
+      dilations(dilations), padding_left(padding_left),
+      padding_right(padding_right), padding(padding) {
+  }
+};
+
+template <typename T>
+class MklConv2DBwdInputPrimitive : public MklPrimitive {
+ public:
+  explicit MklConv2DBwdInputPrimitive(
+      const MklConvBwdInputParams& convBwdInputDims) :
+           cpu_engine_(engine::cpu, 0) {
+    context_.bwd_input_stream.reset(new stream(stream::kind::eager));
+
+    // create conv primitive
+    if (context_.conv_bwd_input == nullptr) {
+      Setup(convBwdInputDims);
+    }
+  }
+  ~MklConv2DBwdInputPrimitive() {}
+
+  // Convolution backward filter (weights)
+  //   diff_src_data: output data buffer of diff_src
+  //   filter_data:   input data buffer of filter (weights)
+  //   diff_dst_data: input data buffer of dst
+  // Bias does not matter here
+  void Execute(const T* diff_src_data,
+      const T* filter_data, const T* diff_dst_data) {
+    context_.diff_src_mem->set_data_handle(
+        static_cast<T*>(const_cast<T*>(diff_src_data)));
+    context_.filter_mem->set_data_handle(
+        static_cast<T*>(const_cast<T*>(filter_data)));
+    context_.diff_dst_mem->set_data_handle(
+        static_cast<T*>(const_cast<T*>(diff_dst_data)));
+
+    context_.bwd_input_stream->submit(context_.bwd_input_primitives);
+
+    // set back data handle
+    context_.diff_src_mem->set_data_handle(DummyData);
+    context_.filter_mem->set_data_handle(DummyData);
+    context_.diff_dst_mem->set_data_handle(DummyData);
+    return;
+  }
+
+  memory::format GetFilterMemoryFormat() const {
+    return context_.filter_fmt;
+  }
+
+  memory::format GetDiffDstMemoryFormat() const {
+    return context_.diff_dst_fmt;
+  }
+
+  std::shared_ptr<mkldnn::convolution_backward_data::primitive_desc>
+  GetPrimitiveDesc() const {
+    return context_.bwd_input_pd;
+  }
+
+ private:
+  // Primitive reuse context for Conv2D Bwd Input op
+  struct ConvBwdInputContext {
+    // expected memory format for this primitive instance
+    memory::format filter_fmt;
+    memory::format diff_dst_fmt;
+
+    // MKLDNN memory
+    std::shared_ptr<mkldnn::memory> diff_src_mem;
+    std::shared_ptr<mkldnn::memory> filter_mem;
+    std::shared_ptr<mkldnn::memory> diff_dst_mem;
+
+    // convolution primitive
+    std::shared_ptr<mkldnn::convolution_backward_data::primitive_desc>
+        bwd_input_pd;
+    std::shared_ptr<mkldnn::primitive> conv_bwd_input;
+
+    // desc & prmitive desc
+    std::shared_ptr<mkldnn::convolution_backward_data::desc> bwd_input_desc;
+    std::shared_ptr<mkldnn::convolution_forward::desc> fwd_desc;
+    std::shared_ptr<mkldnn::convolution_forward::primitive_desc> fwd_pd;
+
+    // memory desc: forward & backward can share same memory::desc
+    std::shared_ptr<memory::desc> diff_src_md;
+    std::shared_ptr<memory::desc> filter_md;
+    std::shared_ptr<memory::desc> diff_dst_md;
+
+    // MKL pipeline
+    std::shared_ptr<mkldnn::stream> bwd_input_stream;
+    std::vector<mkldnn::primitive> bwd_input_primitives;
+
+    ConvBwdInputContext() :
+        filter_fmt(memory::format::any), diff_dst_fmt(memory::format::any),
+        diff_src_mem(nullptr), filter_mem(nullptr), diff_dst_mem(nullptr),
+        bwd_input_pd(nullptr), conv_bwd_input(nullptr),
+        bwd_input_desc(nullptr), fwd_desc(nullptr), fwd_pd(nullptr),
+        diff_src_md(nullptr), filter_md(nullptr), diff_dst_md(nullptr),
+        bwd_input_stream(nullptr) {
+    }
+  };
+
+
+  void Setup(const MklConvBwdInputParams& convBwdInputDims) {
+    // create memory descriptors for convolution data w/ no specified format
+    context_.diff_src_md.reset(new memory::desc(
+        {convBwdInputDims.diff_src_dims},
+        MklDnnType<T>(), memory::format::any));
+    context_.filter_md.reset(new memory::desc(
+        {convBwdInputDims.filter_dims},
+        MklDnnType<T>(), memory::format::any));
+    context_.diff_dst_md.reset(new memory::desc(
+        {convBwdInputDims.diff_dst_dims},
+        MklDnnType<T>(), memory::format::any));
+
+    // create convolution primitives
+    context_.bwd_input_desc.reset(new convolution_backward_data::desc(
+        convolution_direct, *context_.diff_src_md, *context_.filter_md,
+        *context_.diff_dst_md, convBwdInputDims.strides,
+        convBwdInputDims.dilations, convBwdInputDims.padding_left,
+        convBwdInputDims.padding_right, convBwdInputDims.padding));
+
+    context_.fwd_desc.reset(new convolution_forward::desc(prop_kind::forward,
+        convolution_direct, *context_.diff_src_md, *context_.filter_md,
+        *context_.diff_dst_md, convBwdInputDims.strides,
+        convBwdInputDims.dilations, convBwdInputDims.padding_left,
+        convBwdInputDims.padding_right, convBwdInputDims.padding));
+
+    context_.fwd_pd.reset(new convolution_forward::primitive_desc(
+        *context_.fwd_desc, cpu_engine_));
+
+    // create backward conv prim desc
+    context_.bwd_input_pd.reset(
+        new convolution_backward_data::primitive_desc(
+        *context_.bwd_input_desc, cpu_engine_, *context_.fwd_pd));
+
+    // create memory primitive based on dummy data
+    context_.diff_src_mem.reset(new memory(
+        context_.bwd_input_pd.get()->diff_src_primitive_desc(), DummyData));
+    context_.filter_mem.reset(new memory(
+        context_.bwd_input_pd.get()->weights_primitive_desc(), DummyData));
+    context_.diff_dst_mem.reset(new memory(
+        context_.bwd_input_pd.get()->diff_dst_primitive_desc(), DummyData));
+
+    // store the expected memory format
+    context_.filter_fmt = static_cast<memory::format>(
+     context_.bwd_input_pd.get()->weights_primitive_desc().desc().data.format);
+    context_.diff_dst_fmt = static_cast<memory::format>(
+     context_.bwd_input_pd.get()->diff_dst_primitive_desc().desc().data.format);
+
+    // create convolution primitive and add it to net
+    context_.conv_bwd_input.reset(new convolution_backward_data(
+        *context_.bwd_input_pd, *context_.diff_dst_mem,
+        *context_.filter_mem, *context_.diff_src_mem));
+
+    context_.bwd_input_primitives.push_back(*context_.conv_bwd_input);
+  }
+
+  struct ConvBwdInputContext context_;
+  engine cpu_engine_;
+};
+
+template <typename T>
+class MklConv2DBwdInputPrimitiveFactory : public MklPrimitiveFactory<T> {
+ private:
+  MklConv2DBwdInputPrimitiveFactory() {}
+  ~MklConv2DBwdInputPrimitiveFactory() {}
+
+ public:
+  static MklConv2DBwdInputPrimitive<T>* Get(
+      const MklConvBwdInputParams& convBwdInputDims) {
+    MklConv2DBwdInputPrimitive<T>* conv2d_bwd_input = nullptr;
+
+    // look into the pool for reusable primitive
+    conv2d_bwd_input = dynamic_cast<MklConv2DBwdInputPrimitive<T>*> (
+        MklConv2DBwdInputPrimitiveFactory<T>::GetInstance().GetConv2dBwdInput(
+            convBwdInputDims));
+
+    if (conv2d_bwd_input == nullptr) {
+      conv2d_bwd_input = new MklConv2DBwdInputPrimitive<T>(
+          convBwdInputDims);
+      MklConv2DBwdInputPrimitiveFactory<T>::GetInstance().SetConv2dBwdInput(
+          convBwdInputDims, conv2d_bwd_input);
+    }
+    return conv2d_bwd_input;
+  }
+
+ private:
+  static MklConv2DBwdInputPrimitiveFactory& GetInstance() {
+    static MklConv2DBwdInputPrimitiveFactory instance_;
+    return instance_;
+  }
+
+  static string CreateKey(const MklConvBwdInputParams& convBwdInputDims) {
+    string prefix = "conv2d_bwd_input";
+    FactoryKeyCreator key_creator;
+    key_creator.AddAsKey(prefix);
+    key_creator.AddAsKey(convBwdInputDims.diff_src_dims);
+    key_creator.AddAsKey(convBwdInputDims.filter_dims);
+    key_creator.AddAsKey(convBwdInputDims.diff_dst_dims);
+    key_creator.AddAsKey(convBwdInputDims.strides);
+    key_creator.AddAsKey(convBwdInputDims.dilations);
+    key_creator.AddAsKey(convBwdInputDims.padding_left);
+    key_creator.AddAsKey(convBwdInputDims.padding_right);
+    return key_creator.GetKey();
+  }
+
+  MklPrimitive* GetConv2dBwdInput(
+      const MklConvBwdInputParams& convBwdInputDims) {
+    string key = CreateKey(convBwdInputDims);
+    return this->GetOp(key);
+  }
+
+  void SetConv2dBwdInput(
+      const MklConvBwdInputParams& convBwdInputDims, MklPrimitive *op) {
+    string key = CreateKey(convBwdInputDims);
+    this->SetOp(key, op);
+  }
+};
+
+#endif
+
 #ifdef INTEL_MKL_ML
 
 template <typename Device, class T>
@@ -365,13 +601,168 @@ class MklConv2DCustomBackpropInputOp
     : public MklConv2DBackpropCommonOp<Device, T> {
  public:
   explicit MklConv2DCustomBackpropInputOp(OpKernelConstruction* context)
-      : MklConv2DBackpropCommonOp<Device, T>(context) {}
+      : MklConv2DBackpropCommonOp<Device, T>(context) {
+  }
+
   ~MklConv2DCustomBackpropInputOp() {}
 
+  void Compute(OpKernelContext* context) {
+    try {
+      MklDnnData<T> filter(&cpu_engine);
+      MklDnnData<T> diff_dst(&cpu_engine);
+
+      // Input tensors
+      const int kInputIdx = 0, kFilterIdx = 1, kOutbpropIdx = 2;
+      const Tensor& src_tensor = MklGetInput(context, kInputIdx);
+      const Tensor& filter_tensor = MklGetInput(context, kFilterIdx);
+      const Tensor& diff_dst_tensor = MklGetInput(context, kOutbpropIdx);
+
+      MklDnnShape src_mkl_shape, filter_mkl_shape, diff_dst_mkl_shape;
+      GetMklShape(context, kInputIdx, &src_mkl_shape);
+      GetMklShape(context, kFilterIdx, &filter_mkl_shape);
+      GetMklShape(context, kOutbpropIdx, &diff_dst_mkl_shape);
+      // Allow operator-specific sanity checking of shapes.
+      ValidateMklShapes(src_mkl_shape, filter_mkl_shape,
+                        diff_dst_mkl_shape);
+
+      // Allow operator-specific generation of shapes.
+      // E.g., Conv2DBackpropFilter gets filter as filter_sizes. It is a
+      // tensor containing shape of filter. So filter.shape() is not
+      // a correct way to get filter shape. These operator-specific calls
+      // allow this class to handle this case.
+      TensorShape src_tf_shape = MakeInputTfShape(context, src_tensor);
+      TensorShape filter_tf_shape = MakeFilterTfShape(context, filter_tensor);
+      TensorShape diff_dst_tf_shape = GetTfShape(context, kOutbpropIdx);
+
+      // Corner cases: output with 0 elements and 0 batch size.
+      Tensor* diff_src_tensor = nullptr;
+      if (src_tf_shape.num_elements() == 0 ||
+          filter_tf_shape.num_elements() == 0 ||
+          diff_dst_tf_shape.num_elements() == 0) {
+        MklDnnShape diff_src_mkl_shape;
+        diff_src_mkl_shape.SetMklTensor(false);
+        TensorShape diff_src_tf_shape = GetOutputTfShape(
+            src_tf_shape, filter_tf_shape, diff_dst_tf_shape);
+        const int kOutputIdx = 0;
+        AllocateOutputSetMklShape(context, kOutputIdx, &diff_src_tensor,
+                       diff_src_tf_shape, diff_src_mkl_shape);
+        CHECK_NOTNULL(diff_src_tensor);
+
+        // if output tensor has more than 0 elements, we need to 0 them out.
+        auto diff_src_data = diff_src_tensor->flat<T>().data();
+        for (size_t i = 0; i < diff_src_tf_shape.num_elements(); ++i) {
+          diff_src_data[i] = 0;
+        }
+        return;
+      }
+      // By default, all dims are in MKL order. Only dims in TF order
+      // are those with postfix tf_order.
+      memory::dims diff_dst_dims, fwd_src_dims, fwd_filter_dims;
+      memory::dims padding_left, padding_right, dilations, strides;
+      memory::dims fwd_output_dims, fwd_output_dims_tf_order;
+
+      // Get forward convolution parameters.
+      MklDnnConvUtil conv_utl(context, this->strides_, this->padding_,
+          this->data_format_, this->dilations_);
+      conv_utl.GetConvFwdSizesInMklOrder(
+          src_tf_shape, filter_tf_shape, &fwd_src_dims, &fwd_filter_dims,
+          &strides, &dilations, &fwd_output_dims_tf_order, &fwd_output_dims,
+          &padding_left, &padding_right);
+      if (!context->status().ok()) return;
+
+      // Create Convolution forward descriptor since Convolution backward
+      // API needs it. For that, we first need to create input, filter
+      // and output memory descriptors.
+      auto tf_fmt = TFDataFormatToMklDnnDataFormat(this->data_format_);
+
+      // If filter is in MKL layout, then simply grab filter layout;
+      // otherwise, construct filter in TF layout.
+      // For TF layout, filter is in HWIO format.
+      auto fwd_filter_md = filter_mkl_shape.IsMklTensor()
+                         ? filter_mkl_shape.GetMklLayout()
+                         : memory::desc(fwd_filter_dims, MklDnnType<T>(),
+                                        memory::format::hwio);
+
+      conv_utl.GetInputSizeInMklOrder(diff_dst_tf_shape, &diff_dst_dims);
+      if (!context->status().ok()) return;
+      auto diff_dst_md = diff_dst_mkl_shape.IsMklTensor()
+                       ? diff_dst_mkl_shape.GetMklLayout()
+                       : memory::desc(diff_dst_dims,
+                           MklDnnType<T>(), tf_fmt);
+
+      dilations[kDilationH] -= 1;
+      dilations[kDilationW] -= 1;
+
+      MklConv2DBwdInputPrimitive<T> *conv2d_bwd_input = nullptr;
+      conv_utl.GetInputSizeInMklOrder(diff_dst_tf_shape, &diff_dst_dims);
+      MklConvBwdInputParams convBwdInputDims(fwd_src_dims, fwd_filter_dims,
+          diff_dst_dims, strides, dilations, padding_left, padding_right,
+          TFPaddingToMklDnnPadding(this->padding_));
+      conv2d_bwd_input = MklConv2DBwdInputPrimitiveFactory<T>::Get(
+          convBwdInputDims);
+      auto bwd_input_pd = conv2d_bwd_input->GetPrimitiveDesc();
+
+      // allocate output tensor
+      auto diff_src_pd = bwd_input_pd->diff_src_primitive_desc();
+      auto bwd_diff_src_dims = GetOutputDims(fwd_src_dims, fwd_filter_dims);
+      auto bwd_diff_src_format = GetOutputFormat(tf_fmt);
+      MklDnnShape diff_src_mkl_shape;
+      diff_src_mkl_shape.SetMklTensor(true);
+      diff_src_mkl_shape.SetMklLayout(&diff_src_pd);
+      diff_src_mkl_shape.SetElemType(MklDnnType<T>());
+      diff_src_mkl_shape.SetTfLayout(bwd_diff_src_dims.size(),
+          bwd_diff_src_dims, bwd_diff_src_format);
+      TensorShape diff_src_tf_shape;
+      diff_src_tf_shape.AddDim(diff_src_pd.get_size() / sizeof(T));
+      AllocateOutputSetMklShape(context, 0, &diff_src_tensor,
+          diff_src_tf_shape, diff_src_mkl_shape);
+
+      T *diff_src_data = static_cast<T*>(const_cast<T*>(
+          diff_src_tensor->flat<T>().data()));
+
+      // check if filter and diff_dst need reorder
+      T* filter_data = nullptr;
+      if (fwd_filter_md.data.format !=
+          conv2d_bwd_input->GetFilterMemoryFormat()) {
+        filter.SetUsrMem(fwd_filter_md, &filter_tensor);
+        filter.CheckReorderToOpMem(bwd_input_pd->weights_primitive_desc());
+        filter_data = static_cast<T*>(filter.GetOpMem().get_data_handle());
+      } else {
+        filter_data = static_cast<T*>(const_cast<T*>(
+                       filter_tensor.flat<T>().data()));
+      }
+
+      T* diff_dst_data = nullptr;
+      if (diff_dst_md.data.format !=
+          conv2d_bwd_input->GetDiffDstMemoryFormat()) {
+        diff_dst.SetUsrMem(diff_dst_md, &diff_dst_tensor);
+        diff_dst.CheckReorderToOpMem(bwd_input_pd->diff_dst_primitive_desc());
+        diff_dst_data = static_cast<T*>(
+                         diff_dst.GetOpMem().get_data_handle());
+      } else {
+        diff_dst_data = static_cast<T*>(const_cast<T*>(
+                         diff_dst_tensor.flat<T>().data()));
+      }
+
+      // execute convolution input bwd
+      conv2d_bwd_input->Execute(diff_src_data, filter_data, diff_dst_data);
+    } catch (mkldnn::error& e) {
+      string error_msg = "Status: " + std::to_string(e.status) +
+                         ", message: " + string(e.message) + ", in file " +
+                         string(__FILE__) + ":" + std::to_string(__LINE__);
+      OP_REQUIRES_OK(
+          context,
+          errors::Aborted("Operation received an exception:", error_msg));
+    }
+  }
+
  private:
-  const int kInputIndex_Filter = 1, kInputIndex_InputSizes = 0,
-            kInputIndex_OutBackProp = 2;
+  const int kInputIndex_Filter = 1, kInputIndex_InputSizes = 0;
   const int kDilationH = 0, kDilationW = 1;
+  engine cpu_engine = engine(engine::cpu, 0);
+
+  // Validate input shapes.
+  // Function asserts that input shapes are valid.
   void ValidateMklShapes(const MklDnnShape& input_mkl_shape,
                          const MklDnnShape& filter_mkl_shape,
                          const MklDnnShape& obp_mkl_shape) {
@@ -382,8 +773,7 @@ class MklConv2DCustomBackpropInputOp
         << "Conv2DBackpropInput: input should not be in MKL Layout";
   }
 
-  size_t GetInputTensorIndexWithSizes() { return kInputIndex_InputSizes; }
-
+  // Get TensorFlow shape of input tensor.
   TensorShape MakeInputTfShape(OpKernelContext* context,
                                const Tensor& input_tensor) {
     TensorShape input_tf_shape;
@@ -395,72 +785,32 @@ class MklConv2DCustomBackpropInputOp
     return input_tf_shape;
   }
 
+  // Get TensorFlow shape of filter tensor.
   TensorShape MakeFilterTfShape(OpKernelContext* context,
                                 const Tensor& filter_tensor) {
     return GetTfShape(context, kInputIndex_Filter);
   }
 
+  // Get the Tensorflow shape of Output (diff_src),
+  // which is same as shape of Conv2D 'input'.
   TensorShape GetOutputTfShape(const TensorShape& input_shape,
                                const TensorShape& filter_shape,
                                const TensorShape& outbprop_shape) {
-    // Output Shape of Conv2DBackpropInput is same as shape of Conv2D 'input'.
     return input_shape;
   }
 
+  // Get the Tensorflow shape of Output (diff_src),
+  // which is same as shape of Conv2D 'input'.
   const memory::dims& GetOutputDims(const memory::dims& fwd_input_dims,
                                     const memory::dims& fwd_filter_dims) {
-    // Output Shape of Conv2DBackpropInput is same as shape of Conv2D 'input'.
     return fwd_input_dims;
   }
 
+  // Output layout is Tensorflow's layout in data format order.
   memory::format GetOutputFormat(const memory::format data_format) {
-    // Output layout is Tensorflow's layout in data format order.
     return data_format;
   }
 
-  void CreatePrimitive(OpKernelContext* context, const engine& cpu_engine,
-                       const convolution_forward::primitive_desc& conv_fwd_pd,
-                       MklDnnData<T>* input, MklDnnData<T>* filter,
-                       MklDnnData<T>* outbackprop, MklDnnData<T>* output,
-                       Tensor** output_tensor,
-                       const memory::dims& strides,
-                       const memory::dims& dilations,
-                       const memory::dims& padding_l,
-                       const memory::dims& padding_r, padding_kind padding,
-                       const memory::dims& bwd_output_dims,
-                       memory::format bwd_output_format) {
-    CHECK_NOTNULL(context);
-    CHECK_NOTNULL(input);
-    CHECK_NOTNULL(filter);
-    CHECK_NOTNULL(outbackprop);
-    CHECK_NOTNULL(output);
-    CHECK_NOTNULL(output_tensor);
-
-    // Create convolution backward data primitive.
-    // Use dilated convolution in case dilate rates are greater than zero.
-    auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
-        convolution_backward_data::desc(convolution_direct,
-                      output->GetOpMemDesc(), filter->GetOpMemDesc(),
-                      outbackprop->GetOpMemDesc(), strides,
-                      dilations, padding_l, padding_r, padding):
-        convolution_backward_data::desc(convolution_direct,
-                      output->GetOpMemDesc(), filter->GetOpMemDesc(),
-                      outbackprop->GetOpMemDesc(),
-                      strides, padding_l, padding_r, padding);
-
-    auto bwd_pd = convolution_backward_data::primitive_desc(
-        bwd_desc, cpu_engine, conv_fwd_pd);
-
-    // Allocate output tensor in TensorFlow and MKL layout.
-    AllocateOutputTensor(context, bwd_pd, bwd_output_dims, bwd_output_format,
-                         output_tensor);
-    CHECK_NOTNULL(*output_tensor);
-    // Set buffer handle using allocated output tensor.
-    output->SetUsrMemDataHandle(*output_tensor);
-
-    PrepareAndExecutePrimitive(bwd_pd, filter, outbackprop, output);
-  }
-
   // Allocate output tensor.
   void AllocateOutputTensor(
       OpKernelContext* context,
@@ -487,22 +837,6 @@ class MklConv2DCustomBackpropInputOp
     AllocateOutputSetMklShape(context, 0, output_tensor, output_tf_shape,
                               output_mkl_shape);
   }
-
-  // Prepare and execute net - checks for input and output reorders.
-  void PrepareAndExecutePrimitive(
-      const convolution_backward_data::primitive_desc& conv_pd,
-      MklDnnData<T>* filter, MklDnnData<T>* obp, MklDnnData<T>* output) {
-    // Create reorders between user layout and MKL layout if it is needed and
-    // add it to the net before convolution.
-    std::vector<primitive> net;
-    filter->CheckReorderToOpMem(conv_pd.weights_primitive_desc(), &net);
-    obp->CheckReorderToOpMem(conv_pd.diff_dst_primitive_desc(), &net);
-
-    net.push_back(convolution_backward_data(
-        conv_pd, obp->GetOpMem(), filter->GetOpMem(), output->GetOpMem()));
-
-    stream(stream::kind::eager).submit(net).wait();
-  }
 };
 
 #endif  // INTEL_MKL_ML
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index 1d0edb10b3..62396eeb8b 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -18,7 +18,6 @@ limitations under the License.
 
 #include <string.h>
 #include <map>
-#include <string>
 #include <vector>
 #include <memory>
 
@@ -35,6 +34,7 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/util/padding.h"
@@ -298,8 +298,8 @@ class MklConv2DFwdPrimitiveFactory : public MklPrimitiveFactory<T> {
     return instance_;
   }
 
-  static std::string CreateKey(const MklConvFwdParams& convFwdDims) {
-    std::string prefix = "conv2d_fwd_";
+  static string CreateKey(const MklConvFwdParams& convFwdDims) {
+    string prefix = "conv2d_fwd_";
     FactoryKeyCreator key_creator;
     key_creator.AddAsKey(prefix);
     key_creator.AddAsKey(convFwdDims.src_dims);
@@ -314,12 +314,12 @@ class MklConv2DFwdPrimitiveFactory : public MklPrimitiveFactory<T> {
   }
 
   MklPrimitive* GetConv2DFwd(const MklConvFwdParams& convFwdDims) {
-    std::string key = CreateKey(convFwdDims);
+    string key = CreateKey(convFwdDims);
     return this->GetOp(key);
   }
 
   void SetConv2DFwd(const MklConvFwdParams& convFwdDims, MklPrimitive* op) {
-    std::string key = CreateKey(convFwdDims);
+    string key = CreateKey(convFwdDims);
     this->SetOp(key, op);
   }
 };
@@ -899,11 +899,10 @@ class MklConv2DOp : public OpKernel {
       T* dst_data = static_cast<T*>(dst_tensor->flat<T>().data());
 
       // check whether src/filter need reorder
-      std::vector<primitive> net;
-      T* src_data = nullptr;
+      T *src_data = nullptr;
       if (src_md.data.format != conv2d_fwd->GetSrcMemoryFormat()) {
         src.SetUsrMem(src_md, &src_tensor);
-        src.CheckReorderToOpMem(conv_fwd_pd.get()->src_primitive_desc(), &net);
+        src.CheckReorderToOpMem(conv_fwd_pd.get()->src_primitive_desc());
         src_data = static_cast<T*>(src.GetOpMem().get_data_handle());
       } else {
         src_data = static_cast<T*>(const_cast<T*>(src_tensor.flat<T>().data()));
@@ -912,16 +911,13 @@ class MklConv2DOp : public OpKernel {
       if (filter_md.data.format != conv2d_fwd->GetFilterMemoryFormat()) {
         filter.SetUsrMem(filter_md, &filter_tensor);
         filter.CheckReorderToOpMem(conv_fwd_pd.get()->weights_primitive_desc(),
-                                   filter.GetTensorBuffer(filter_out_tensor),
-                                   &net);
+                                   filter.GetTensorBuffer(filter_out_tensor));
         filter_data = static_cast<T*>(filter.GetOpMem().get_data_handle());
       } else {
         filter_data =
             static_cast<T*>(const_cast<T*>(filter_tensor.flat<T>().data()));
       }
 
-      stream(stream::kind::eager).submit(net).wait();
-
 
       // execute convolution
       if (biasEnabled) {
@@ -934,10 +930,9 @@ class MklConv2DOp : public OpKernel {
         conv2d_fwd->Execute(src_data, filter_data, dst_data);
       }
     } catch (mkldnn::error &e) {
-      string error_msg = "Status: " + std::to_string(e.status) +
-                       ", message: " + std::string(e.message) +
-                       ", in file " + std::string(__FILE__) + ":" +
-                       std::to_string(__LINE__);
+      string error_msg = tensorflow::strings::StrCat(
+          "Status: ", e.status, ", message: ", string(e.message), ", in file ",
+          __FILE__, ":", __LINE__);
       OP_REQUIRES_OK(context,
         errors::Aborted("Operation received an exception:", error_msg));
     }
@@ -1016,16 +1011,15 @@ class MklConv2DOp : public OpKernel {
     // Create reorders between user layout and MKL layout if it is needed and
     // add it to the net before convolution. No need to check for output
     // reorder as we propagate output layout to the next layer.
-    std::vector<primitive> net;
-    src->CheckReorderToOpMem(conv_prim_desc.src_primitive_desc(), &net);
+    src->CheckReorderToOpMem(conv_prim_desc.src_primitive_desc());
 
     // rather than re-order to a temp buffer, reorder directly to the
     // filter output tensor
     filter->CheckReorderToOpMem(conv_prim_desc.weights_primitive_desc(),
-                                filter->GetTensorBuffer(filter_out_tensor),
-                                &net);
+                                filter->GetTensorBuffer(filter_out_tensor));
 
     // Create convolution primitive and add it to net.
+    std::vector<primitive> net;
     if (bias) {
       CHECK_EQ(biasEnabled, true);
       net.push_back(convolution_forward(conv_prim_desc, src->GetOpMem(),
diff --git a/tensorflow/core/kernels/mkl_conv_ops.h b/tensorflow/core/kernels/mkl_conv_ops.h
index 8333a09316..3f154ff33b 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.h
+++ b/tensorflow/core/kernels/mkl_conv_ops.h
@@ -17,8 +17,8 @@ limitations under the License.
 #define TENSORFLOW_CORE_KERNELS_MKL_CONV_OPS_H_
 
 #include <limits>
-#include <string>
 #include <vector>
+#include <memory>
 
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -349,6 +349,7 @@ class MklDnnConvUtil {
   }
 };
 
+
 /////////////////////////////////////////////////////////////////////
 ///  Common class that implements Conv2DBackpropFilter and Input
 /////////////////////////////////////////////////////////////////////
@@ -388,227 +389,17 @@ class MklConv2DBackpropCommonOp : public OpKernel {
     OP_REQUIRES_OK(context, context->GetAttr("padding", &padding_));
   }
 
-  void Compute(OpKernelContext* context) override {
-    try {
-      auto cpu_engine = engine(engine::cpu, 0);
-
-      // Prepare common tensors for Conv2DBackpropInput and
-      // Conv2DBackpropFilter.
-      MklDnnData<T> input(&cpu_engine);
-      MklDnnData<T> filter(&cpu_engine);
-      MklDnnData<T> outbackprop(&cpu_engine);
-      MklDnnData<T> output(&cpu_engine);
-
-      // Input tensors
-      const int kInputIdx = 0, kFilterIdx = 1, kOutbpropIdx = 2;
-      const Tensor& input_tensor = MklGetInput(context, kInputIdx);
-      const Tensor& filter_tensor = MklGetInput(context, kFilterIdx);
-      const Tensor& outbprop_tensor = MklGetInput(context, kOutbpropIdx);
-
-      MklDnnShape input_mkl_shape, filter_mkl_shape, outbprop_mkl_shape;
-      GetMklShape(context, kInputIdx, &input_mkl_shape);
-      GetMklShape(context, kFilterIdx, &filter_mkl_shape);
-      GetMklShape(context, kOutbpropIdx, &outbprop_mkl_shape);
-      // Allow operator-specific sanity checking of shapes.
-      ValidateMklShapes(input_mkl_shape, filter_mkl_shape, outbprop_mkl_shape);
-
-      // Allow operator-specific generation of shapes.
-      // E.g., Conv2DBackpropFilter gets filter as filter_sizes. It is a
-      // tensor containing shape of filter. So filter.shape() is not
-      // a correct way to get filter shape. These operator-specific calls
-      // allow this class to handle this case.
-      TensorShape input_tf_shape = MakeInputTfShape(context, input_tensor);
-      TensorShape filter_tf_shape = MakeFilterTfShape(context, filter_tensor);
-      TensorShape outbprop_tf_shape = GetTfShape(context, kOutbpropIdx);
-
-      // Corner cases: output with 0 elements and 0 batch size.
-      Tensor* output_tensor = nullptr;
-      if (input_tf_shape.num_elements() == 0 ||
-          filter_tf_shape.num_elements() == 0 ||
-          outbprop_tf_shape.num_elements() == 0) {
-        MklDnnShape output_mkl_shape;
-        output_mkl_shape.SetMklTensor(false);
-        TensorShape output_tf_shape = GetOutputTfShape(
-            input_tf_shape, filter_tf_shape, outbprop_tf_shape);
-        const int kOutputIdx = 0;
-        AllocateOutputSetMklShape(context, kOutputIdx, &output_tensor,
-                                  output_tf_shape, output_mkl_shape);
-        CHECK_NOTNULL(output_tensor);
-
-        // if output tensor has more than 0 elements, we need to 0 them out.
-        for (size_t i = 0; i < output_tf_shape.num_elements(); ++i) {
-          output_tensor->flat<T>().data()[i] = 0;
-        }
-
-        return;
-      }
-
-      // By default, all dims are in MKL order. Only dims in TF order
-      // are those with prefix tf_order.
-      memory::dims outbprop_dims, fwd_input_dims, fwd_filter_dims;
-      memory::dims padding_l, padding_r, dilations, strides, fwd_output_dims;
-      memory::dims fwd_output_dims_tf_order;
-
-      // Get forward convolution parameters.
-      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_,
-                             dilations_);
-      conv_utl.GetConvFwdSizesInMklOrder(
-          input_tf_shape, filter_tf_shape, &fwd_input_dims, &fwd_filter_dims,
-          &strides, &dilations, &fwd_output_dims_tf_order, &fwd_output_dims,
-          &padding_l, &padding_r);
-      if (!context->status().ok()) return;
-
-      // Create Convolution forward descriptor since Convolution backward
-      // API needs it. For that, we first need to create input, filter
-      // and output memory descriptors.
-      auto tf_fmt = TFDataFormatToMklDnnDataFormat(data_format_);
-      // If input is in MKL layout, then simply grab input layout; otherwise,
-      // construct input TF layout. For TF layout, although input shape
-      // required is in MKL-DNN order, the layout is Tensorflow's layout
-      // (NHWC or NCHW depending on data format).
-      auto fwd_input_md =
-          input_mkl_shape.IsMklTensor()
-              ? input_mkl_shape.GetMklLayout()
-              : memory::desc(fwd_input_dims, MklDnnType<T>(), tf_fmt);
-      // If filter is in MKL layout, then simply grab filter layout; otherwise
-      // construct filter in TF layout. For TF layout, filter is in HWIO format.
-      auto fwd_filter_md = filter_mkl_shape.IsMklTensor()
-                               ? filter_mkl_shape.GetMklLayout()
-                               : memory::desc(fwd_filter_dims, MklDnnType<T>(),
-                                              memory::format::hwio);
-      // Tensorflow Output of Conv2D is in data_format order.
-      auto fwd_out_md = memory::desc(fwd_output_dims, MklDnnType<T>(), tf_fmt);
-
-      const int kDilationH = 0, kDilationW = 1;
-      dilations[kDilationH] -= 1;
-      dilations[kDilationW] -= 1;
-      auto fwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0)?
-              convolution_forward::desc(prop_kind::forward,
-                     convolution_direct, fwd_input_md,
-                     fwd_filter_md, fwd_out_md,
-                     strides, dilations, padding_l, padding_r,
-                     TFPaddingToMklDnnPadding(padding_)) :
-              convolution_forward::desc(prop_kind::forward,
-                     convolution_direct, fwd_input_md,
-                     fwd_filter_md, fwd_out_md,
-                     strides, padding_l, padding_r,
-                     TFPaddingToMklDnnPadding(padding_));
-      auto fwd_pd = convolution_forward::primitive_desc(fwd_desc, cpu_engine);
-
-      // Create memory for user data. Describe how the inputs and outputs of
-      // Convolution look like. Also specify buffers containing actual input
-      // and output data.
-
-      // Since this is a common class for both Conv2DBackpropFilter and
-      // Conv2DBackpropInput, we skip SetUsrMem call for input tensor (for
-      // Conv2DBackpropInput) and for filter tensor (for
-      // conv2DBackpropFilter) depending on which tensor is int32 type.
-      size_t input_with_sizes = GetInputTensorIndexWithSizes();
-      if (input_with_sizes != kInputIdx) {
-        // Shape of Conv2DBackpropFilter's input is same as Conv2D input.
-        input.SetUsrMem(fwd_input_md, &input_tensor);
-      } else if (input_with_sizes != kFilterIdx) {
-        // Shape of Conv2DBackpropInput's filter is same as Conv2D filter.
-        filter.SetUsrMem(fwd_filter_md, &filter_tensor);
-      }
-
-      conv_utl.GetInputSizeInMklOrder(outbprop_tf_shape, &outbprop_dims);
-      if (!context->status().ok()) return;
-      if (outbprop_mkl_shape.IsMklTensor()) {
-        // If outbackprop is in Mkl layout, then simply grab it.
-        auto outbprop_md = outbprop_mkl_shape.GetMklLayout();
-        outbackprop.SetUsrMem(outbprop_md, &outbprop_tensor);
-      } else {
-        // If outbackprop is in TensorFlow layout, then we need to create memory
-        // descriptor for it. Outbackprop shape is data format order.
-        outbackprop.SetUsrMem(outbprop_dims, tf_fmt, &outbprop_tensor);
-      }
-
-      // Operator specific call to get output shape and data_format.
-      auto bwd_output_dims = GetOutputDims(fwd_input_dims, fwd_filter_dims);
-      auto bwd_output_format = GetOutputFormat(tf_fmt);
-      output.SetUsrMem(bwd_output_dims, bwd_output_format);
-
-      // Create memory descriptors for convolution data w/ no specified format.
-      input.SetOpMemDesc(fwd_input_dims, memory::format::any);
-      filter.SetOpMemDesc(fwd_filter_dims, memory::format::any);
-      outbackprop.SetOpMemDesc(outbprop_dims, memory::format::any);
-      output.SetOpMemDesc(bwd_output_dims, memory::format::any);
-
-      // Operator-specific call to create and execute primitive.
-      CreatePrimitive(context, cpu_engine, fwd_pd, &input, &filter,
-                      &outbackprop, &output, &output_tensor,
-                      strides, dilations, padding_l, padding_r,
-                      TFPaddingToMklDnnPadding(padding_),
-                      bwd_output_dims, bwd_output_format);
-    } catch (mkldnn::error& e) {
-      string error_msg = "Status: " + std::to_string(e.status) +
-                         ", message: " + string(e.message) + ", in file " +
-                         string(__FILE__) + ":" + std::to_string(__LINE__);
-      OP_REQUIRES_OK(
-          context,
-          errors::Aborted("Operation received an exception:", error_msg));
-    }
-  }
-
-  /// Pure virtual function to allow operator to check for validity of input
-  /// shapes. Function asserts that input shapes are valid.
-  virtual void ValidateMklShapes(const MklDnnShape& input_mkl_shape,
-                                 const MklDnnShape& filter_mkl_shape,
-                                 const MklDnnShape& outbprop_mkl_shape) = 0;
-
-  /// Operator-specific function that returns index of input that is
-  /// representing input sizes. For Conv2DBackpropFilter it returns 1 since
-  /// filter for this operator is filter shape. For Conv2DBackpropInput it
-  /// returns 0 (for input).
-  virtual size_t GetInputTensorIndexWithSizes() = 0;
-
-  /// Get TensorFlow shape of input tensor.
-  virtual TensorShape MakeInputTfShape(OpKernelContext* context,
-                                       const Tensor& input_tensor) = 0;
-
-  /// Get TensorFlow shape of filter tensor.
-  virtual TensorShape MakeFilterTfShape(OpKernelContext* context,
-                                        const Tensor& filter_tensor) = 0;
-
-  /// Get the TensorFlow shape of output tensor.
-  virtual TensorShape GetOutputTfShape(const TensorShape& input_shape,
-                                       const TensorShape& filter_shape,
-                                       const TensorShape& outbprop_shape) = 0;
-
-  /// Get shape of output in MKL-DNN order. Computes shape of output from
-  /// input shape (fwd_input_dims) and filter shape (fwd_filter_dims).
-  virtual const memory::dims& GetOutputDims(
-      const memory::dims& fwd_input_dims,
-      const memory::dims& fwd_filter_dims) = 0;
-
-  /// Get data_format of output in MKL-DNN order. If output data format is
-  /// same as input data format, then it simply returns value of data_format
-  /// parameter as it is.
-  virtual memory::format GetOutputFormat(const memory::format data_format) = 0;
-
-  /// Create and execute the primitive storing output in the output_tensor.
-  virtual void CreatePrimitive(OpKernelContext* context,
-    const engine& cpu_engine,
-    const convolution_forward::primitive_desc& conv_fwd_pd,
-    MklDnnData<T>* input, MklDnnData<T>* filter, MklDnnData<T>* outbackprop,
-    MklDnnData<T>* output, Tensor** output_tensor, const memory::dims& strides,
-    const memory::dims& dilations, const memory::dims& padding_l,
-    const memory::dims& padding_r, padding_kind padding,
-    const memory::dims& bwd_output_dims,
-    memory::format bwd_output_format) = 0;
-
-  // Get the data_format {NCHW, NHWC}
-  TensorFormat GetTFDataFormat() { return data_format_; }
-
- private:
+ protected:
+  // data members accessible to derived classes.
   std::vector<int32> dilations_;
   std::vector<int32> strides_;
   Padding padding_;
-  TensorFormat data_format_;
+  TensorFormat data_format_;  // NCHW or NHWC
 };
+
 #endif  // INTEL_MKL_ML
 
+
 /////////////////////////////////////////////////////////////////////
 ///  Dummy Mkl op that is just used for operators that are intermediate
 ///  output of node fusion in the graph
diff --git a/tensorflow/core/kernels/mkl_lrn_op.cc b/tensorflow/core/kernels/mkl_lrn_op.cc
index dfe50e6a7f..7966c271d5 100644
--- a/tensorflow/core/kernels/mkl_lrn_op.cc
+++ b/tensorflow/core/kernels/mkl_lrn_op.cc
@@ -847,12 +847,12 @@ class MklLRNOp : public OpKernel {
                             MklDnnData<T>* src_dnn_data,
                             MklDnnData<T>* dst_dnn_data,
                             MklDnnData<uint8>* wksp_dnn_data = nullptr) {
-    std::vector<primitive> net;
 
     // Check for input reorder
-    src_dnn_data->CheckReorderToOpMem(lrn_fwd_desc.src_primitive_desc(), &net);
+    src_dnn_data->CheckReorderToOpMem(lrn_fwd_desc.src_primitive_desc());
 
     // Create pooling primitive and add it to net
+    std::vector<primitive> net;
     if (wksp_dnn_data != nullptr) {
       net.push_back(lrn_forward(lrn_fwd_desc, src_dnn_data->GetOpMem(),
                                 wksp_dnn_data->GetOpMem(),
@@ -1160,15 +1160,15 @@ class MklLRNGradOp : public OpKernel {
       MklDnnData<T>* output_diff_src,
       const memory::primitive_desc& target_diff_dst_pd,
       const MklDnnData<uint8>* workspace_dnn_data = nullptr) {
-    std::vector<primitive> net;
 
     // Check for input reordering on the diff dst input
     input_gradient_diff_dst->CheckReorderToOpMem(
-        lrn_bkwd_desc.diff_dst_primitive_desc(), &net);
+        lrn_bkwd_desc.diff_dst_primitive_desc());
 
     // Check for input reordering on the original input
-    src_dnn_data->CheckReorderToOpMem(lrn_fwd_desc.src_primitive_desc(), &net);
+    src_dnn_data->CheckReorderToOpMem(lrn_fwd_desc.src_primitive_desc());
     // Create pooling primitive and add it to net
+    std::vector<primitive> net;
     if (nullptr == workspace_dnn_data) {
       net.push_back(lrn_backward(lrn_bkwd_desc, src_dnn_data->GetOpMem(),
                                  input_gradient_diff_dst->GetOpMem(),
diff --git a/tensorflow/core/kernels/mkl_pooling_ops_common.h b/tensorflow/core/kernels/mkl_pooling_ops_common.h
index c0dfed7d7d..cb1eecb36a 100644
--- a/tensorflow/core/kernels/mkl_pooling_ops_common.h
+++ b/tensorflow/core/kernels/mkl_pooling_ops_common.h
@@ -17,7 +17,6 @@ limitations under the License.
 #define TENSORFLOW_CORE_KERNELS_MKL_POOLING_OPS_COMMON_H_
 
 #ifdef INTEL_MKL
-#include <string>
 #include <vector>
 #include "tensorflow/core/util/mkl_util.h"
 #include "tensorflow/core/util/padding.h"
diff --git a/tensorflow/core/kernels/mkl_reshape_op.cc b/tensorflow/core/kernels/mkl_reshape_op.cc
index c44a6f3477..9c536df215 100644
--- a/tensorflow/core/kernels/mkl_reshape_op.cc
+++ b/tensorflow/core/kernels/mkl_reshape_op.cc
@@ -152,8 +152,12 @@ class MklReshapeOp : public OpKernel {
     // If Tensorflow's data format and the underlying format maintained by
     // MKLDNN are equivalent (both are NHWC or both are NCHW), then we can
     // safely return true.
+    // @todo: Future do not force skip reorder for all blocked format. Use
+    // blocking_desc_is_equal() for checking all the stride arrays in
+    // mkl-dnn/blob/master/src/common/type_helpers.hpp
     auto input_mkl_md = mkl_shape_input.GetMklLayout();
-    if (mkl_shape_input.GetTfDataFormat() == input_mkl_md.data.format) {
+    if (mkl_shape_input.GetTfDataFormat() == input_mkl_md.data.format &&
+        mkl_shape_input.GetTfDataFormat() != memory::format::blocked) {
       ret = true;
     }
 
@@ -263,10 +267,7 @@ class MklReshapeOp : public OpKernel {
             // shape_from != shape_to), then we just copy input tensor to
             // output tensor with target shape (we cannot forward Mkl layout
             // in such case because shape has changed.)
-            std::vector<primitive> net;
-            if (dnn_data_input.CheckReorderToOpMem(output_tf_pd, output_tensor,
-                                                   &net)) {
-              stream(stream::kind::eager).submit(net).wait();
+            if (dnn_data_input.CheckReorderToOpMem(output_tf_pd, output_tensor)) {
             } else {
               OP_REQUIRES(
                   context, output_tensor->CopyFrom(input_tensor, shape_to),
diff --git a/tensorflow/core/kernels/mkl_tfconv_op.h b/tensorflow/core/kernels/mkl_tfconv_op.h
index 7e8ed1b1d6..f4f0035f26 100644
--- a/tensorflow/core/kernels/mkl_tfconv_op.h
+++ b/tensorflow/core/kernels/mkl_tfconv_op.h
@@ -111,10 +111,8 @@ class MklToTfOp : public OpKernel {
       // Do we need to reorder Mkl layout into TensorFlow layout?
       if (input.IsReorderNeeded(output_tf_pd)) {
         // Insert reorder between Mkl layout and TensorFlow layout.
-        std::vector<primitive> net;
-        CHECK_EQ(input.CheckReorderToOpMem(output_tf_pd, output_tensor, &net),
+        CHECK_EQ(input.CheckReorderToOpMem(output_tf_pd, output_tensor),
                  true);
-        stream(stream::kind::eager).submit(net).wait();
       } else {
         // If not, just forward input tensor to output tensor.
         CHECK(output_tensor->CopyFrom(input_tensor, output_shape));
diff --git a/tensorflow/core/kernels/non_max_suppression_op.cc b/tensorflow/core/kernels/non_max_suppression_op.cc
index f08dd4f750..c7d0d4de0d 100644
--- a/tensorflow/core/kernels/non_max_suppression_op.cc
+++ b/tensorflow/core/kernels/non_max_suppression_op.cc
@@ -19,6 +19,7 @@ limitations under the License.
 
 #include "tensorflow/core/kernels/non_max_suppression_op.h"
 
+#include <functional>
 #include <queue>
 #include <vector>
 
@@ -38,9 +39,32 @@ namespace {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
 
+static inline void CheckScoreSizes(OpKernelContext* context, int num_boxes,
+                                   const Tensor& scores) {
+  // The shape of 'scores' is [num_boxes]
+  OP_REQUIRES(context, scores.dims() == 1,
+              errors::InvalidArgument("scores must be 1-D",
+                                      scores.shape().DebugString()));
+  OP_REQUIRES(context, scores.dim_size(0) == num_boxes,
+              errors::InvalidArgument("scores has incompatible shape"));
+}
+
+static inline void ParseAndCheckOverlapSizes(OpKernelContext* context,
+                                             const Tensor& overlaps,
+                                             int* num_boxes) {
+  // the shape of 'overlaps' is [num_boxes, num_boxes]
+  OP_REQUIRES(context, overlaps.dims() == 2,
+              errors::InvalidArgument("overlaps must be 2-D",
+                                      overlaps.shape().DebugString()));
+
+  *num_boxes = overlaps.dim_size(0);
+  OP_REQUIRES(context, overlaps.dim_size(1) == *num_boxes,
+              errors::InvalidArgument("overlaps must be square",
+                                      overlaps.shape().DebugString()));
+}
+
 static inline void ParseAndCheckBoxSizes(OpKernelContext* context,
-                                         const Tensor& boxes,
-                                         const Tensor& scores, int* num_boxes) {
+                                         const Tensor& boxes, int* num_boxes) {
   // The shape of 'boxes' is [num_boxes, 4]
   OP_REQUIRES(context, boxes.dims() == 2,
               errors::InvalidArgument("boxes must be 2-D",
@@ -48,18 +72,12 @@ static inline void ParseAndCheckBoxSizes(OpKernelContext* context,
   *num_boxes = boxes.dim_size(0);
   OP_REQUIRES(context, boxes.dim_size(1) == 4,
               errors::InvalidArgument("boxes must have 4 columns"));
-
-  // The shape of 'scores' is [num_boxes]
-  OP_REQUIRES(context, scores.dims() == 1,
-              errors::InvalidArgument("scores must be 1-D",
-                                      scores.shape().DebugString()));
-  OP_REQUIRES(context, scores.dim_size(0) == *num_boxes,
-              errors::InvalidArgument("scores has incompatible shape"));
 }
 
 // Return intersection-over-union overlap between boxes i and j
-static inline float IOU(typename TTypes<float, 2>::ConstTensor boxes, int i,
-                        int j) {
+static inline float IOUGreaterThanThreshold(
+    typename TTypes<float, 2>::ConstTensor boxes, int i, int j,
+    float iou_threshold) {
   const float ymin_i = std::min<float>(boxes(i, 0), boxes(i, 2));
   const float xmin_i = std::min<float>(boxes(i, 1), boxes(i, 3));
   const float ymax_i = std::max<float>(boxes(i, 0), boxes(i, 2));
@@ -78,24 +96,37 @@ static inline float IOU(typename TTypes<float, 2>::ConstTensor boxes, int i,
   const float intersection_area =
       std::max<float>(intersection_ymax - intersection_ymin, 0.0) *
       std::max<float>(intersection_xmax - intersection_xmin, 0.0);
-  return intersection_area / (area_i + area_j - intersection_area);
+  const float iou = intersection_area / (area_i + area_j - intersection_area);
+  return iou > iou_threshold;
 }
 
-void DoNonMaxSuppressionOp(OpKernelContext* context, const Tensor& boxes,
-                           const Tensor& scores, const Tensor& max_output_size,
-                           const float iou_threshold,
-                           const float score_threshold) {
-  OP_REQUIRES(context, iou_threshold >= 0 && iou_threshold <= 1,
-              errors::InvalidArgument("iou_threshold must be in [0, 1]"));
-
-  int num_boxes = 0;
-  ParseAndCheckBoxSizes(context, boxes, scores, &num_boxes);
-  if (!context->status().ok()) {
-    return;
-  }
+static inline bool OverlapsGreaterThanThreshold(
+    typename TTypes<float, 2>::ConstTensor overlaps, int i, int j,
+    float overlap_threshold) {
+  return overlaps(i, j) > overlap_threshold;
+}
 
+static inline std::function<bool(int, int)> CreateIOUSuppressCheckFn(
+    const Tensor& boxes, float threshold) {
+  typename TTypes<float, 2>::ConstTensor boxes_data = boxes.tensor<float, 2>();
+  return std::bind(&IOUGreaterThanThreshold, boxes_data, std::placeholders::_1,
+                   std::placeholders::_2, threshold);
+}
+
+static inline std::function<bool(int, int)> CreateOverlapsSuppressCheckFn(
+    const Tensor& overlaps, float threshold) {
+  typename TTypes<float, 2>::ConstTensor overlaps_data =
+      overlaps.tensor<float, 2>();
+  return std::bind(&OverlapsGreaterThanThreshold, overlaps_data,
+                   std::placeholders::_1, std::placeholders::_2, threshold);
+}
+
+void DoNonMaxSuppressionOp(
+    OpKernelContext* context, const Tensor& scores, int num_boxes,
+    const Tensor& max_output_size, const float score_threshold,
+    const std::function<bool(int, int)>& suppress_check_fn,
+    bool pad_to_max_output_size = false, int* ptr_num_valid_outputs = nullptr) {
   const int output_size = std::min(max_output_size.scalar<int>()(), num_boxes);
-  TTypes<float, 2>::ConstTensor boxes_data = boxes.tensor<float, 2>();
 
   std::vector<float> scores_data(num_boxes);
   std::copy_n(scores.flat<float>().data(), num_boxes, scores_data.begin());
@@ -120,11 +151,9 @@ void DoNonMaxSuppressionOp(OpKernelContext* context, const Tensor& boxes,
   std::vector<int> selected;
   std::vector<float> selected_scores;
   Candidate next_candidate;
-  float iou, original_score;
 
   while (selected.size() < output_size && !candidate_priority_queue.empty()) {
     next_candidate = candidate_priority_queue.top();
-    original_score = next_candidate.score;
     candidate_priority_queue.pop();
 
     // Overlapping boxes are likely to have similar scores,
@@ -132,8 +161,10 @@ void DoNonMaxSuppressionOp(OpKernelContext* context, const Tensor& boxes,
     // in order to see if `next_candidate` should be suppressed.
     bool should_select = true;
     for (int j = selected.size() - 1; j >= 0; --j) {
-      iou = IOU(boxes_data, next_candidate.box_index, selected[j]);
-      if (iou > iou_threshold) should_select = false;
+      if (suppress_check_fn(next_candidate.box_index, selected[j])) {
+        should_select = false;
+        break;
+      }
     }
 
     if (should_select) {
@@ -142,6 +173,15 @@ void DoNonMaxSuppressionOp(OpKernelContext* context, const Tensor& boxes,
     }
   }
 
+  int num_valid_outputs = selected.size();
+  if (pad_to_max_output_size) {
+    selected.resize(output_size, 0);
+    selected_scores.resize(output_size, 0);
+  }
+  if (ptr_num_valid_outputs) {
+    *ptr_num_valid_outputs = num_valid_outputs;
+  }
+
   // Allocate output tensors
   Tensor* output_indices = nullptr;
   TensorShape output_shape({static_cast<int>(selected.size())});
@@ -173,9 +213,19 @@ class NonMaxSuppressionOp : public OpKernel {
         errors::InvalidArgument("max_output_size must be 0-D, got shape ",
                                 max_output_size.shape().DebugString()));
 
+    OP_REQUIRES(context, iou_threshold_ >= 0 && iou_threshold_ <= 1,
+                errors::InvalidArgument("iou_threshold must be in [0, 1]"));
+    int num_boxes = 0;
+    ParseAndCheckBoxSizes(context, boxes, &num_boxes);
+    CheckScoreSizes(context, num_boxes, scores);
+    if (!context->status().ok()) {
+      return;
+    }
+    auto suppress_check_fn = CreateIOUSuppressCheckFn(boxes, iou_threshold_);
+
     const float score_threshold_val = std::numeric_limits<float>::lowest();
-    DoNonMaxSuppressionOp(context, boxes, scores, max_output_size,
-                          iou_threshold_, score_threshold_val);
+    DoNonMaxSuppressionOp(context, scores, num_boxes, max_output_size,
+                          score_threshold_val, suppress_check_fn);
   }
 
  private:
@@ -206,35 +256,145 @@ class NonMaxSuppressionV2Op : public OpKernel {
                                         iou_threshold.shape().DebugString()));
     const float iou_threshold_val = iou_threshold.scalar<float>()();
 
+    OP_REQUIRES(context, iou_threshold_val >= 0 && iou_threshold_val <= 1,
+                errors::InvalidArgument("iou_threshold must be in [0, 1]"));
+    int num_boxes = 0;
+    ParseAndCheckBoxSizes(context, boxes, &num_boxes);
+    CheckScoreSizes(context, num_boxes, scores);
+    if (!context->status().ok()) {
+      return;
+    }
+    auto suppress_check_fn = CreateIOUSuppressCheckFn(boxes, iou_threshold_val);
+
     const float score_threshold_val = std::numeric_limits<float>::lowest();
-    DoNonMaxSuppressionOp(context, boxes, scores, max_output_size,
-                          iou_threshold_val, score_threshold_val);
+    DoNonMaxSuppressionOp(context, scores, num_boxes, max_output_size,
+                          score_threshold_val, suppress_check_fn);
   }
 };
 
-template <typename Device>
-class NonMaxSuppressionV3Op : public OpKernel {
+class NonMaxSuppressionV3V4Base : public OpKernel {
  public:
-  explicit NonMaxSuppressionV3Op(OpKernelConstruction* context)
+  explicit NonMaxSuppressionV3V4Base(OpKernelConstruction* context)
       : OpKernel(context) {}
 
   void Compute(OpKernelContext* context) override {
     // boxes: [num_boxes, 4]
-    const Tensor& boxes = context->input(0);
+    boxes_ = context->input(0);
     // scores: [num_boxes]
-    const Tensor& scores = context->input(1);
+    scores_ = context->input(1);
     // max_output_size: scalar
-    const Tensor& max_output_size = context->input(2);
+    max_output_size_ = context->input(2);
     OP_REQUIRES(
-        context, TensorShapeUtils::IsScalar(max_output_size.shape()),
+        context, TensorShapeUtils::IsScalar(max_output_size_.shape()),
         errors::InvalidArgument("max_output_size must be 0-D, got shape ",
-                                max_output_size.shape().DebugString()));
+                                max_output_size_.shape().DebugString()));
     // iou_threshold: scalar
     const Tensor& iou_threshold = context->input(3);
     OP_REQUIRES(context, TensorShapeUtils::IsScalar(iou_threshold.shape()),
                 errors::InvalidArgument("iou_threshold must be 0-D, got shape ",
                                         iou_threshold.shape().DebugString()));
-    const float iou_threshold_val = iou_threshold.scalar<float>()();
+    iou_threshold_val_ = iou_threshold.scalar<float>()();
+    OP_REQUIRES(context, iou_threshold_val_ >= 0 && iou_threshold_val_ <= 1,
+                errors::InvalidArgument("iou_threshold must be in [0, 1]"));
+    // score_threshold: scalar
+    const Tensor& score_threshold = context->input(4);
+    OP_REQUIRES(
+        context, TensorShapeUtils::IsScalar(score_threshold.shape()),
+        errors::InvalidArgument("score_threshold must be 0-D, got shape ",
+                                score_threshold.shape().DebugString()));
+    score_threshold_val_ = score_threshold.scalar<float>()();
+
+    num_boxes_ = 0;
+    ParseAndCheckBoxSizes(context, boxes_, &num_boxes_);
+    CheckScoreSizes(context, num_boxes_, scores_);
+    if (!context->status().ok()) {
+      return;
+    }
+
+    DoComputeAndPostProcess(context);
+  }
+
+ protected:
+  virtual void DoComputeAndPostProcess(OpKernelContext* context) = 0;
+
+  Tensor boxes_;
+  Tensor scores_;
+  Tensor max_output_size_;
+  int num_boxes_;
+  float iou_threshold_val_;
+  float score_threshold_val_;
+};
+
+template <typename Device>
+class NonMaxSuppressionV3Op : public NonMaxSuppressionV3V4Base {
+ public:
+  explicit NonMaxSuppressionV3Op(OpKernelConstruction* context)
+      : NonMaxSuppressionV3V4Base(context) {}
+
+ protected:
+  void DoComputeAndPostProcess(OpKernelContext* context) override {
+    auto suppress_check_fn =
+        CreateIOUSuppressCheckFn(boxes_, iou_threshold_val_);
+
+    DoNonMaxSuppressionOp(context, scores_, num_boxes_, max_output_size_,
+                          score_threshold_val_, suppress_check_fn);
+  }
+};
+
+template <typename Device>
+class NonMaxSuppressionV4Op : public NonMaxSuppressionV3V4Base {
+ public:
+  explicit NonMaxSuppressionV4Op(OpKernelConstruction* context)
+      : NonMaxSuppressionV3V4Base(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("pad_to_max_output_size",
+                                             &pad_to_max_output_size_));
+  }
+
+ protected:
+  void DoComputeAndPostProcess(OpKernelContext* context) override {
+    auto suppress_check_fn =
+        CreateIOUSuppressCheckFn(boxes_, iou_threshold_val_);
+    int num_valid_outputs;
+
+    DoNonMaxSuppressionOp(context, scores_, num_boxes_, max_output_size_,
+                          score_threshold_val_, suppress_check_fn,
+                          pad_to_max_output_size_, &num_valid_outputs);
+
+    // Allocate scalar output tensor for number of indices computed.
+    Tensor* num_outputs_t = nullptr;
+    OP_REQUIRES_OK(context, context->allocate_output(
+                                1, tensorflow::TensorShape{}, &num_outputs_t));
+    num_outputs_t->scalar<int32>().setConstant(num_valid_outputs);
+  }
+
+ private:
+  bool pad_to_max_output_size_;
+};
+
+template <typename Device>
+class NonMaxSuppressionWithOverlapsOp : public OpKernel {
+ public:
+  explicit NonMaxSuppressionWithOverlapsOp(OpKernelConstruction* context)
+      : OpKernel(context) {}
+
+  void Compute(OpKernelContext* context) override {
+    // overlaps: [num_boxes, num_boxes]
+    const Tensor& overlaps = context->input(0);
+    // scores: [num_boxes]
+    const Tensor& scores = context->input(1);
+    // max_output_size: scalar
+    const Tensor& max_output_size = context->input(2);
+    OP_REQUIRES(
+        context, TensorShapeUtils::IsScalar(max_output_size.shape()),
+        errors::InvalidArgument("max_output_size must be 0-D, got shape ",
+                                max_output_size.shape().DebugString()));
+    // overlap_threshold: scalar
+    const Tensor& overlap_threshold = context->input(3);
+    OP_REQUIRES(
+        context, TensorShapeUtils::IsScalar(overlap_threshold.shape()),
+        errors::InvalidArgument("overlap_threshold must be 0-D, got shape ",
+                                overlap_threshold.shape().DebugString()));
+    const float overlap_threshold_val = overlap_threshold.scalar<float>()();
 
     // score_threshold: scalar
     const Tensor& score_threshold = context->input(4);
@@ -244,8 +404,17 @@ class NonMaxSuppressionV3Op : public OpKernel {
                                 score_threshold.shape().DebugString()));
     const float score_threshold_val = score_threshold.scalar<float>()();
 
-    DoNonMaxSuppressionOp(context, boxes, scores, max_output_size,
-                          iou_threshold_val, score_threshold_val);
+    int num_boxes = 0;
+    ParseAndCheckOverlapSizes(context, overlaps, &num_boxes);
+    CheckScoreSizes(context, num_boxes, scores);
+    if (!context->status().ok()) {
+      return;
+    }
+    auto suppress_check_fn =
+        CreateOverlapsSuppressCheckFn(overlaps, overlap_threshold_val);
+
+    DoNonMaxSuppressionOp(context, scores, num_boxes, max_output_size,
+                          score_threshold_val, suppress_check_fn);
   }
 };
 
@@ -258,4 +427,11 @@ REGISTER_KERNEL_BUILDER(Name("NonMaxSuppressionV2").Device(DEVICE_CPU),
 REGISTER_KERNEL_BUILDER(Name("NonMaxSuppressionV3").Device(DEVICE_CPU),
                         NonMaxSuppressionV3Op<CPUDevice>);
 
+REGISTER_KERNEL_BUILDER(Name("NonMaxSuppressionV4").Device(DEVICE_CPU),
+                        NonMaxSuppressionV4Op<CPUDevice>);
+
+REGISTER_KERNEL_BUILDER(
+    Name("NonMaxSuppressionWithOverlaps").Device(DEVICE_CPU),
+    NonMaxSuppressionWithOverlapsOp<CPUDevice>);
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/non_max_suppression_op_test.cc b/tensorflow/core/kernels/non_max_suppression_op_test.cc
index ed7db313bd..c321849f40 100644
--- a/tensorflow/core/kernels/non_max_suppression_op_test.cc
+++ b/tensorflow/core/kernels/non_max_suppression_op_test.cc
@@ -569,4 +569,296 @@ TEST_F(NonMaxSuppressionV3OpTest, TestEmptyInput) {
   test::ExpectTensorEqual<int>(expected, *GetOutput(0));
 }
 
+//
+// NonMaxSuppressionV4Op Tests
+//
+
+class NonMaxSuppressionV4OpTest : public OpsTestBase {
+ protected:
+  void MakeOp() {
+    TF_EXPECT_OK(NodeDefBuilder("non_max_suppression_op", "NonMaxSuppressionV4")
+                     .Input(FakeInput(DT_FLOAT))
+                     .Input(FakeInput(DT_FLOAT))
+                     .Input(FakeInput(DT_INT32))
+                     .Input(FakeInput(DT_FLOAT))
+                     .Input(FakeInput(DT_FLOAT))
+                     .Attr("pad_to_max_output_size", true)
+                     .Finalize(node_def()));
+    TF_EXPECT_OK(InitOp());
+  }
+};
+
+TEST_F(NonMaxSuppressionV4OpTest, TestSelectFromThreeClustersPadFive) {
+  MakeOp();
+  AddInputFromArray<float>(
+      TensorShape({6, 4}),
+      {0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+       0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(TensorShape({6}), {.9f, .75f, .6f, .95f, .5f, .3f});
+  AddInputFromArray<int>(TensorShape({}), {5});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  const auto expected_indices = test::AsTensor<int>({3, 0, 5, 0, 0});
+  test::ExpectTensorEqual<int>(expected_indices, *GetOutput(0));
+  Tensor expected_num_valid = test::AsScalar<int>(3);
+  test::ExpectTensorEqual<int>(expected_num_valid, *GetOutput(1));
+}
+
+TEST_F(NonMaxSuppressionV4OpTest, TestSelectFromThreeClustersPadFiveScoreThr) {
+  MakeOp();
+  AddInputFromArray<float>(
+      TensorShape({6, 4}),
+      {0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+       0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(TensorShape({6}), {.9f, .75f, .6f, .95f, .5f, .3f});
+  AddInputFromArray<int>(TensorShape({}), {6});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.4f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  const auto expected_indices = test::AsTensor<int>({3, 0, 0, 0, 0, 0});
+  test::ExpectTensorEqual<int>(expected_indices, *GetOutput(0));
+  Tensor expected_num_valid = test::AsScalar<int>(2);
+  test::ExpectTensorEqual<int>(expected_num_valid, *GetOutput(1));
+}
+
+//
+// NonMaxSuppressionWithOverlapsOp Tests
+//
+
+class NonMaxSuppressionWithOverlapsOpTest : public OpsTestBase {
+ protected:
+  void MakeOp() {
+    TF_EXPECT_OK(NodeDefBuilder("non_max_suppression_op",
+                                "NonMaxSuppressionWithOverlaps")
+                     .Input(FakeInput(DT_FLOAT))
+                     .Input(FakeInput(DT_FLOAT))
+                     .Input(FakeInput(DT_INT32))
+                     .Input(FakeInput(DT_FLOAT))
+                     .Input(FakeInput(DT_FLOAT))
+                     .Finalize(node_def()));
+    TF_EXPECT_OK(InitOp());
+  }
+
+  void AddIoUInput(const std::vector<float>& boxes) {
+    ASSERT_EQ((boxes.size() % 4), 0);
+    size_t num_boxes = boxes.size() / 4;
+    std::vector<float> iou_overlaps(num_boxes * num_boxes);
+
+    // compute the pairwise IoU overlaps
+    auto corner_access = [&boxes](size_t box_idx, size_t corner_idx) {
+      return boxes[box_idx * 4 + corner_idx];
+    };
+    for (size_t i = 0; i < num_boxes; ++i) {
+      for (size_t j = 0; j < num_boxes; ++j) {
+        const float ymin_i =
+            std::min<float>(corner_access(i, 0), corner_access(i, 2));
+        const float xmin_i =
+            std::min<float>(corner_access(i, 1), corner_access(i, 3));
+        const float ymax_i =
+            std::max<float>(corner_access(i, 0), corner_access(i, 2));
+        const float xmax_i =
+            std::max<float>(corner_access(i, 1), corner_access(i, 3));
+        const float ymin_j =
+            std::min<float>(corner_access(j, 0), corner_access(j, 2));
+        const float xmin_j =
+            std::min<float>(corner_access(j, 1), corner_access(j, 3));
+        const float ymax_j =
+            std::max<float>(corner_access(j, 0), corner_access(j, 2));
+        const float xmax_j =
+            std::max<float>(corner_access(j, 1), corner_access(j, 3));
+        const float area_i = (ymax_i - ymin_i) * (xmax_i - xmin_i);
+        const float area_j = (ymax_j - ymin_j) * (xmax_j - xmin_j);
+
+        float iou;
+        if (area_i <= 0 || area_j <= 0) {
+          iou = 0.0;
+        } else {
+          const float intersection_ymin = std::max<float>(ymin_i, ymin_j);
+          const float intersection_xmin = std::max<float>(xmin_i, xmin_j);
+          const float intersection_ymax = std::min<float>(ymax_i, ymax_j);
+          const float intersection_xmax = std::min<float>(xmax_i, xmax_j);
+          const float intersection_area =
+              std::max<float>(intersection_ymax - intersection_ymin, 0.0) *
+              std::max<float>(intersection_xmax - intersection_xmin, 0.0);
+          iou = intersection_area / (area_i + area_j - intersection_area);
+        }
+        iou_overlaps[i * num_boxes + j] = iou;
+      }
+    }
+
+    AddInputFromArray<float>(TensorShape({static_cast<signed>(num_boxes),
+                                          static_cast<signed>(num_boxes)}),
+                             iou_overlaps);
+  }
+};
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestSelectFromThreeClusters) {
+  MakeOp();
+  AddIoUInput({0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+               0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(TensorShape({6}), {.9f, .75f, .6f, .95f, .5f, .3f});
+  AddInputFromArray<int>(TensorShape({}), {3});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({3}));
+  test::FillValues<int>(&expected, {3, 0, 5});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest,
+       TestSelectFromThreeClustersFlippedCoordinates) {
+  MakeOp();
+  AddIoUInput({1, 1,  0, 0,  0, 0.1f,  1, 1.1f,  0, .9f, 1, -0.1f,
+               0, 10, 1, 11, 1, 10.1f, 0, 11.1f, 1, 101, 0, 100});
+  AddInputFromArray<float>(TensorShape({6}), {.9f, .75f, .6f, .95f, .5f, .3f});
+  AddInputFromArray<int>(TensorShape({}), {3});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({3}));
+  test::FillValues<int>(&expected, {3, 0, 5});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest,
+       TestSelectAtMostTwoBoxesFromThreeClusters) {
+  MakeOp();
+  AddIoUInput({0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+               0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(TensorShape({6}), {.9f, .75f, .6f, .95f, .5f, .3f});
+  AddInputFromArray<int>(TensorShape({}), {2});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({2}));
+  test::FillValues<int>(&expected, {3, 0});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest,
+       TestSelectAtMostThirtyBoxesFromThreeClusters) {
+  MakeOp();
+  AddIoUInput({0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+               0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(TensorShape({6}), {.9f, .75f, .6f, .95f, .5f, .3f});
+  AddInputFromArray<int>(TensorShape({}), {30});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({3}));
+  test::FillValues<int>(&expected, {3, 0, 5});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestSelectSingleBox) {
+  MakeOp();
+  AddIoUInput({0, 0, 1, 1});
+  AddInputFromArray<float>(TensorShape({1}), {.9f});
+  AddInputFromArray<int>(TensorShape({}), {3});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({1}));
+  test::FillValues<int>(&expected, {0});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestSelectFromTenIdenticalBoxes) {
+  MakeOp();
+
+  int num_boxes = 10;
+  std::vector<float> corners(num_boxes * 4);
+  std::vector<float> scores(num_boxes);
+  for (int i = 0; i < num_boxes; ++i) {
+    corners[i * 4 + 0] = 0;
+    corners[i * 4 + 1] = 0;
+    corners[i * 4 + 2] = 1;
+    corners[i * 4 + 3] = 1;
+    scores[i] = .9;
+  }
+  AddIoUInput(corners);
+  AddInputFromArray<float>(TensorShape({num_boxes}), scores);
+  AddInputFromArray<int>(TensorShape({}), {3});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({1}));
+  test::FillValues<int>(&expected, {0});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestInconsistentBoxAndScoreShapes) {
+  MakeOp();
+  AddIoUInput({0, 0,  1, 1,  0, 0.1f,  1, 1.1f,  0, -0.1f, 1, 0.9f,
+               0, 10, 1, 11, 0, 10.1f, 1, 11.1f, 0, 100,   1, 101});
+  AddInputFromArray<float>(TensorShape({5}), {.9f, .75f, .6f, .95f, .5f});
+  AddInputFromArray<int>(TensorShape({}), {30});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  Status s = RunOpKernel();
+
+  ASSERT_FALSE(s.ok());
+  EXPECT_TRUE(
+      str_util::StrContains(s.ToString(), "scores has incompatible shape"))
+      << s;
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestInvalidOverlapsShape) {
+  MakeOp();
+  AddInputFromArray<float>(TensorShape({2, 3}), {0, 0, 0, 0, 0, 0});
+  AddInputFromArray<float>(TensorShape({2}), {0.5f, 0.5f});
+  AddInputFromArray<int>(TensorShape({}), {30});
+  AddInputFromArray<float>(TensorShape({}), {0.f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  Status s = RunOpKernel();
+
+  ASSERT_FALSE(s.ok());
+  EXPECT_TRUE(str_util::StrContains(s.ToString(), "overlaps must be square"))
+      << s;
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestThresholdGreaterOne) {
+  MakeOp();
+  AddIoUInput({0, 0, 1, 1});
+  AddInputFromArray<float>(TensorShape({1}), {.9f});
+  AddInputFromArray<int>(TensorShape({}), {3});
+  AddInputFromArray<float>(TensorShape({}), {1.2f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestThresholdSmallerZero) {
+  MakeOp();
+  AddIoUInput({0, 0, 1, 1});
+  AddInputFromArray<float>(TensorShape({1}), {.9f});
+  AddInputFromArray<int>(TensorShape({}), {3});
+  AddInputFromArray<float>(TensorShape({}), {-0.2f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+}
+
+TEST_F(NonMaxSuppressionWithOverlapsOpTest, TestEmptyInput) {
+  MakeOp();
+  AddIoUInput({});
+  AddInputFromArray<float>(TensorShape({0}), {});
+  AddInputFromArray<int>(TensorShape({}), {30});
+  AddInputFromArray<float>(TensorShape({}), {.5f});
+  AddInputFromArray<float>(TensorShape({}), {0.0f});
+  TF_ASSERT_OK(RunOpKernel());
+
+  Tensor expected(allocator(), DT_INT32, TensorShape({0}));
+  test::FillValues<int>(&expected, {});
+  test::ExpectTensorEqual<int>(expected, *GetOutput(0));
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/partitioned_function_ops.cc b/tensorflow/core/kernels/partitioned_function_ops.cc
index b6ee808091..a7a9609c21 100644
--- a/tensorflow/core/kernels/partitioned_function_ops.cc
+++ b/tensorflow/core/kernels/partitioned_function_ops.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/core/framework/graph_to_functiondef.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/graph/graph.h"
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/graph/graph_partition.h"
@@ -42,8 +43,7 @@ namespace {
 // TODO(akshayka): Support distributed execution.
 class PartitionedCallOp : public AsyncOpKernel {
  public:
-  explicit PartitionedCallOp(OpKernelConstruction* ctx)
-      : AsyncOpKernel(ctx), local_device_name_(ctx->device()->name()) {
+  explicit PartitionedCallOp(OpKernelConstruction* ctx) : AsyncOpKernel(ctx) {
     OP_REQUIRES_OK(ctx, ctx->GetAttr("f", &func_));
   }
 
@@ -55,6 +55,9 @@ class PartitionedCallOp : public AsyncOpKernel {
                       errors::Internal("No function library is provided."),
                       done);
 
+    OpInputList args;
+    OP_REQUIRES_OK_ASYNC(ctx, ctx->input_list("args", &args), done);
+
     // The function body's graph is placed and partitioned the first time
     // `ComputeAsync` is invoked; every subsequent invocation calls each
     // of the function shards yielded by partitioning.
@@ -67,16 +70,35 @@ class PartitionedCallOp : public AsyncOpKernel {
     // via, e.g., virtual device annotations and a list of device names supplied
     // through an attribute.
     //
-    // TODO(akshayka): Lift the constraint pinning inputs and outputs to the
-    // local device.
-    //
     // TODO(akshayka): Add a fastpath for functions that execute on a single
     // device.
     {
       mutex_lock l(mu_);
-      if (!partitioned_) {
-        auto graph = tensorflow::MakeUnique<Graph>(OpRegistry::Global());
-        OP_REQUIRES_OK_ASYNC(ctx, GetGraphFromFunction(lib, graph.get()), done);
+      if (function_handles_.find(lib) == function_handles_.end()) {
+        if (local_device_name_.empty()) {
+          // The full local device name isn't known at kernel construction
+          // time, hence the need to set it here.
+          local_device_name_ = lib->device()->name();
+        }
+
+        // TODO(b/37549631): Because this kernel may correspond to a stateful
+        // op, it may be shared by multiple subgraphs, which in turn may have
+        // different `FunctionLibraryRuntime` objects and therefore different
+        // `FHandle` namespaces. As such, we partition on a per-FLR basis.
+        FunctionLibraryRuntime::InstantiateOptions opts;
+        FHandle handle;
+        OP_REQUIRES_OK_ASYNC(
+            ctx,
+            lib->Instantiate(func_.name(), AttrSlice(&func_.attr()), opts,
+                             &handle),
+            done);
+        const FunctionBody* fbody = lib->GetFunctionBody(handle);
+        OP_REQUIRES_ASYNC(ctx, fbody != nullptr,
+                          errors::Internal("Could not find handle ", handle),
+                          done);
+        auto graph = tensorflow::MakeUnique<Graph>(fbody->graph->flib_def());
+        CopyGraph(*fbody->graph, graph.get());
+        OP_REQUIRES_OK_ASYNC(ctx, PinResourceArgs(graph.get(), args), done);
 
         DeviceSet device_set;
         for (auto d : lib->device_mgr()->ListDevices()) {
@@ -94,9 +116,14 @@ class PartitionedCallOp : public AsyncOpKernel {
         // an OpKernel, so functions are instantiated in an overlay library.
         overlay_lib_.reset(new FunctionLibraryDefinition(
             *lib->GetFunctionLibraryDefinition()));
+        auto handles = tensorflow::MakeUnique<gtl::FlatMap<string, FHandle>>();
         for (const auto& pair : subgraphs) {
+          // TODO(akshayka): Fail gracefully if the set of devices corresponds
+          // to more than one address space.
           const string& target = pair.first;
           const auto& subgraph = pair.second;
+          OP_REQUIRES_OK_ASYNC(
+              ctx, UpdateArgAndRetMetadata(target, subgraph.get()), done);
           FunctionDef shard;
           string unique_name = UniquifyFunctionName(func_.name());
           OP_REQUIRES_OK_ASYNC(
@@ -111,40 +138,38 @@ class PartitionedCallOp : public AsyncOpKernel {
               lib->Instantiate(unique_name, AttrSlice(&shard.attr()), opts,
                                &handle),
               done);
-          function_handles_.emplace(target, handle);
+          handles->emplace(target, handle);
         }
-        partitioned_ = true;
+
+        function_handles_.emplace(lib, std::move(handles));
       }
     }
-    ExecuteFunctions(lib, ctx, std::move(done));
+    ExecuteFunctions(lib, ctx, args, std::move(done));
   }
 
  private:
   typedef std::pair<string, FHandle> DeviceAndFHandle;
+  typedef std::pair<std::vector<int>, std::vector<int>> ArgAndRetIndices;
+  typedef std::pair<std::vector<AllocatorAttributes>,
+                    std::vector<AllocatorAttributes>>
+      ArgAndRetAllocAttrs;
 
-  // `func_` encapsulates the original, unsharded function.
-  // Copies the graph backing `func_` into `*graph`, pinning the input and
-  // output nodes to the local device.
-  //
-  // `*graph` must be a freshly allocated graph.
-  Status GetGraphFromFunction(FunctionLibraryRuntime* lib, Graph* graph) {
-    FunctionLibraryRuntime::InstantiateOptions opts;
-    FHandle handle;
-    TF_RETURN_IF_ERROR(lib->Instantiate(func_.name(), AttrSlice(&func_.attr()),
-                                        opts, &handle));
-    const FunctionBody* fbody = lib->GetFunctionBody(handle);
-    if (fbody == nullptr) {
-      return errors::Internal("Could not find handle ", handle);
-    }
-    CopyGraph(*fbody->graph, graph);
-
-    // Pin the inputs and outputs to the local device to simplify the
-    // function-dispatching logic.
+  // Pins each arg that emits a `DT_RESOURCE` tensor to the device on which the
+  // corresponding resource lives. This ensures that the Placer assigns ops that
+  // access these resources to the appropriate devices.
+  Status PinResourceArgs(Graph* graph, const OpInputList& args) {
     for (Node* node : graph->op_nodes()) {
       string node_type = node->type_string();
-      if (node_type == FunctionLibraryDefinition::kArgOp ||
-          node_type == FunctionLibraryDefinition::kRetOp) {
-        node->set_assigned_device_name(local_device_name_);
+      if (node_type == FunctionLibraryDefinition::kArgOp) {
+        const AttrValue* attr_value;
+        TF_RETURN_IF_ERROR(node->attrs().Find("index", &attr_value));
+        int index = attr_value->i();
+        TF_RETURN_IF_ERROR(node->attrs().Find("T", &attr_value));
+        DataType dtype = attr_value->type();
+        if (dtype == DT_RESOURCE) {
+          ResourceHandle handle = args[index].flat<ResourceHandle>()(0);
+          node->set_assigned_device_name(handle.device());
+        }
       }
     }
     return Status::OK();
@@ -198,9 +223,104 @@ class PartitionedCallOp : public AsyncOpKernel {
     return Status::OK();
   }
 
-  // Executes the partitioned functions.
+  // Each subgraph produced by partitioning the function body contains a subset
+  // of the original `Arg` and `Retval` nodes. This function performs
+  // bookkeeping to track which `Arg` and `Retval` nodes were placed on a
+  // particular device / subgraph.
+  //
+  // More specifically, this function
+  //  (1) rewrites the indices of the `Arg` and `Retval` nodes placed on a
+  //      particular device,
+  //  (2) records the subsets of `Arg` and `Retval` nodes assigned to the
+  //      device, and
+  //  (3) records which `Arg` and `Retval` nodes live in host memory.
+  Status UpdateArgAndRetMetadata(const string& device, Graph* subgraph) {
+    if (arg_and_ret_indices_.find(device) != arg_and_ret_indices_.end()) {
+      // This function has already been partitioned, albeit for a different
+      // function library.
+      return Status::OK();
+    }
+
+    ArgAndRetIndices indices;
+    std::vector<int>* arg_indices = &indices.first;
+    std::vector<int>* ret_indices = &indices.second;
+    std::vector<std::pair<Node*, int>> arg_nodes;
+    std::vector<std::pair<Node*, int>> ret_nodes;
+    const AttrValue* attr_value;
+
+    for (Node* node : subgraph->op_nodes()) {
+      string node_type = node->type_string();
+      if (node_type == FunctionLibraryDefinition::kArgOp) {
+        TF_RETURN_IF_ERROR(node->attrs().Find("index", &attr_value));
+        int index = attr_value->i();
+        arg_indices->push_back(index);
+        arg_nodes.push_back(std::make_pair(node, index));
+      } else if (node_type == FunctionLibraryDefinition::kRetOp) {
+        TF_RETURN_IF_ERROR(node->attrs().Find("index", &attr_value));
+        int index = attr_value->i();
+        ret_indices->push_back(index);
+        ret_nodes.push_back(std::make_pair(node, index));
+      }
+    }
+
+    auto sort_by_index = [](std::pair<Node*, int> one,
+                            std::pair<Node*, int> two) -> bool {
+      return one.second < two.second;
+    };
+    std::sort(arg_nodes.begin(), arg_nodes.end(), sort_by_index);
+    std::sort(ret_nodes.begin(), ret_nodes.end(), sort_by_index);
+    for (int i = 0; i < arg_nodes.size(); ++i) {
+      Node* arg = arg_nodes[i].first;
+      arg->AddAttr("index", i);
+      TF_RETURN_IF_ERROR(arg->attrs().Find("T", &attr_value));
+      AllocatorAttributes alloc_attr;
+      DataType type = attr_value->type();
+      if (MTypeFromDType(type) == HOST_MEMORY) {
+        alloc_attr.set_on_host(true);
+      }
+      arg_and_ret_alloc_attrs_[device].first.push_back(alloc_attr);
+    }
+    for (int i = 0; i < ret_nodes.size(); ++i) {
+      Node* ret = ret_nodes[i].first;
+      ret->AddAttr("index", i);
+      TF_RETURN_IF_ERROR(ret->attrs().Find("T", &attr_value));
+      AllocatorAttributes alloc_attr;
+      DataType type = attr_value->type();
+      if (MTypeFromDType(type) == HOST_MEMORY) {
+        alloc_attr.set_on_host(true);
+      }
+      arg_and_ret_alloc_attrs_[device].second.push_back(alloc_attr);
+    }
+
+    arg_and_ret_indices_.emplace(device, indices);
+    return Status::OK();
+  }
+
+  std::vector<Tensor> GetArgsForIndices(const std::vector<int>& indices,
+                                        const OpInputList& arguments) {
+    std::vector<Tensor> args;
+    args.reserve(indices.size());
+    for (int i : indices) {
+      args.push_back(arguments[i]);
+    }
+    return args;
+  }
+
   void ExecuteFunctions(FunctionLibraryRuntime* lib, OpKernelContext* ctx,
-                        DoneCallback done) LOCKS_EXCLUDED(mu_) {
+                        const OpInputList& op_args, DoneCallback done)
+      LOCKS_EXCLUDED(mu_) {
+    const gtl::FlatMap<string, FHandle>* handles;
+    {
+      mutex_lock l(mu_);
+      handles = function_handles_[lib].get();
+    }
+    if (handles->empty()) {
+      // Trivial case where the function body is empty.
+      ctx->SetStatus(Status::OK());
+      done();
+      return;
+    }
+
     FunctionLibraryRuntime::Options opts;
     opts.step_id = ctx->step_id();
     opts.step_container = ctx->step_container();
@@ -210,16 +330,12 @@ class PartitionedCallOp : public AsyncOpKernel {
     // using device-specific threadpools when available.
     opts.runner = ctx->runner();
     opts.source_device = local_device_name_;
+    opts.allow_dead_tensors = true;
     // TODO(akshayka): Accommodate the multiple-worker scenario by adding the
     // constructed rendezvous to a rendezvous manager.
     Rendezvous* rendez = new IntraProcessRendezvous(lib->device_mgr());
     opts.rendezvous = rendez;
 
-    OpInputList arguments;
-    OP_REQUIRES_OK_ASYNC(ctx, ctx->input_list("args", &arguments), done);
-    // Dummy args vector for the remote shards, which do not have inputs.
-    std::vector<Tensor> dummy_args;
-
     StatusCallback callback = std::bind(
         [](Rendezvous* rendez, DoneCallback& done, const Status& status) {
           rendez->Unref();
@@ -227,48 +343,62 @@ class PartitionedCallOp : public AsyncOpKernel {
         },
         rendez, std::move(done), std::placeholders::_1);
     auto* refcounted_done = new ReffedStatusCallback(std::move(callback));
-    for (int i = 1; i < function_handles_.size(); ++i) {
+    for (int i = 1; i < handles->size(); ++i) {
       refcounted_done->Ref();
     }
 
-    for (const auto& pair : function_handles_) {
-      const string& target_device = pair.first;
+    for (const auto& pair : *handles) {
+      const string& target = pair.first;
       FHandle handle = pair.second;
-      VLOG(3) << "Running function shard on device " << target_device;
-      if (target_device == local_device_name_) {
+      VLOG(3) << "Running function shard on device " << target;
+      ArgAndRetIndices indices = arg_and_ret_indices_[target];
+      ArgAndRetAllocAttrs alloc_attrs = arg_and_ret_alloc_attrs_[target];
+      const std::vector<int>& arg_indices = indices.first;
+      const std::vector<int>& ret_indices = indices.second;
+      opts.args_alloc_attrs = alloc_attrs.first;
+      opts.rets_alloc_attrs = alloc_attrs.second;
+      if (target == local_device_name_) {
         opts.remote_execution = false;
-        std::vector<Tensor> args;
-        args.reserve(arguments.size());
-        for (const Tensor& argument : arguments) {
-          args.push_back(argument);
-        }
-        auto* rets = new std::vector<Tensor>;
-        lib->Run(opts, handle, args, rets,
-                 [rets, refcounted_done, ctx](const Status& status) {
-                   if (!status.ok()) {
-                     ctx->SetStatus(status);
-                   } else {
-                     for (int i = 0; i < rets->size(); ++i) {
-                       ctx->set_output(i, (*rets)[i]);
-                     }
-                   }
-                   delete rets;
-                   refcounted_done->Unref();
-                 });
+        std::vector<Tensor> args = GetArgsForIndices(arg_indices, op_args);
+        std::vector<Tensor>* rets = new std::vector<Tensor>;
+        lib->Run(
+            opts, handle, args, rets,
+            [rets, ret_indices, refcounted_done, ctx](const Status& status) {
+              if (!status.ok()) {
+                VLOG(3) << "Local execution failed: " << status;
+                ctx->SetStatus(status);
+              } else {
+                for (int i = 0; i < rets->size(); ++i) {
+                  ctx->set_output(ret_indices[i], (*rets)[i]);
+                }
+              }
+              delete rets;
+              VLOG(3) << "Finished local execution.";
+              refcounted_done->Unref();
+            });
       } else {
         opts.remote_execution = true;
-        std::vector<Tensor>* dummy_rets = new std::vector<Tensor>;
-        lib->Run(opts, handle, dummy_args, dummy_rets,
-                 [dummy_rets, refcounted_done, ctx](const Status& status) {
-                   if (!status.ok()) {
-                     ctx->SetStatus(status);
-                   }
-                   delete dummy_rets;
-                   refcounted_done->Unref();
-                 });
+        std::vector<Tensor> args = GetArgsForIndices(arg_indices, op_args);
+        std::vector<Tensor>* rets = new std::vector<Tensor>;
+        lib->Run(
+            opts, handle, args, rets,
+            [rets, ret_indices, refcounted_done, ctx](const Status& status) {
+              if (!status.ok()) {
+                VLOG(3) << "Remote execution failed: " << status;
+                ctx->SetStatus(status);
+              } else {
+                for (int i = 0; i < rets->size(); ++i) {
+                  ctx->set_output(ret_indices[i], (*rets)[i]);
+                }
+              }
+              delete rets;
+              VLOG(3) << "Finished remote execution.";
+              refcounted_done->Unref();
+            });
       }
     }
   }
+
   string UniquifyFunctionName(const string& name) {
     for (;; ++suffix_) {
       const string candidate = strings::StrCat(name, "_", suffix_);
@@ -279,26 +409,40 @@ class PartitionedCallOp : public AsyncOpKernel {
   }
 
   NameAttrList func_;
-  const string local_device_name_;
+  string local_device_name_;
   // Function shards are added to `overlay_lib_`.
   std::unique_ptr<FunctionLibraryDefinition> overlay_lib_;
-  // A map from device names to handles of function shards; this map is
-  // read-only after the first execution of the OpKernel.
-  gtl::FlatMap<string, FHandle> function_handles_;
+  // Contains maps from device names to handles of function shards, keyed by
+  // FunctionLibraryRuntime pointers. (Because this kernel may be instantiated
+  // for a stateful op, different invocations of it may use different FLRs.)
+  gtl::FlatMap<FunctionLibraryRuntime*,
+               std::unique_ptr<gtl::FlatMap<string, FHandle>>>
+      function_handles_ GUARDED_BY(mu_);
+  // Map from device name to the indices of the arguments and return values
+  // placed on that device. Read-only after the first invocation.
+  gtl::FlatMap<string, ArgAndRetIndices> arg_and_ret_indices_;
+  // Map from device name to alloc attrs for arguments and return values of the
+  // function placed on that device. Read-only after the first invocation.
+  gtl::FlatMap<string, ArgAndRetAllocAttrs> arg_and_ret_alloc_attrs_;
 
   mutex mu_;
-  bool partitioned_ GUARDED_BY(mu_) = false;
 
   // Used to uniquify function names in `overlay_lib_`.
   uint32 suffix_ = 0;
 };
 REGISTER_KERNEL_BUILDER(Name("PartitionedCall").Device(DEVICE_CPU),
                         PartitionedCallOp);
+REGISTER_KERNEL_BUILDER(Name("StatefulPartitionedCall").Device(DEVICE_CPU),
+                        PartitionedCallOp);
 REGISTER_KERNEL_BUILDER(Name("PartitionedCall").Device(DEVICE_GPU),
                         PartitionedCallOp);
+REGISTER_KERNEL_BUILDER(Name("StatefulPartitionedCall").Device(DEVICE_GPU),
+                        PartitionedCallOp);
 #if TENSORFLOW_USE_SYCL
 REGISTER_KERNEL_BUILDER(Name("PartitionedCall").Device(DEVICE_SYCL),
                         PartitionedCallOp);
+REGISTER_KERNEL_BUILDER(Name("StatefulPartitionedCall").Device(DEVICE_SYCL),
+                        PartitionedCallOp);
 #endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace
diff --git a/tensorflow/core/kernels/reshape_util.cc b/tensorflow/core/kernels/reshape_util.cc
index ac301f3342..50fdc17916 100644
--- a/tensorflow/core/kernels/reshape_util.cc
+++ b/tensorflow/core/kernels/reshape_util.cc
@@ -28,7 +28,6 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor_util.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
-#include "tensorflow/core/util/sparse/sparse_tensor.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/kernels/resource_variable_ops.cc b/tensorflow/core/kernels/resource_variable_ops.cc
index af921e4815..cab9eb729d 100644
--- a/tensorflow/core/kernels/resource_variable_ops.cc
+++ b/tensorflow/core/kernels/resource_variable_ops.cc
@@ -174,25 +174,20 @@ REGISTER_KERNEL_BUILDER(Name("VariableShape")
 
 #endif  // GOOGLE_CUDA
 
-class DestroyResourceOp : public OpKernel {
- public:
-  explicit DestroyResourceOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
-    OP_REQUIRES_OK(ctx,
-                   ctx->GetAttr("ignore_lookup_error", &ignore_lookup_error_));
-  }
+DestroyResourceOp::DestroyResourceOp(OpKernelConstruction* ctx)
+    : OpKernel(ctx) {
+  OP_REQUIRES_OK(ctx,
+                 ctx->GetAttr("ignore_lookup_error", &ignore_lookup_error_));
+}
 
-  void Compute(OpKernelContext* ctx) override {
-    const ResourceHandle& p = HandleFromInput(ctx, 0);
-    Status status = DeleteResource(ctx, p);
-    if (ignore_lookup_error_ && errors::IsNotFound(status)) {
-      return;
-    }
-    OP_REQUIRES_OK(ctx, status);
+void DestroyResourceOp::Compute(OpKernelContext* ctx) {
+  const ResourceHandle& p = HandleFromInput(ctx, 0);
+  Status status = DeleteResource(ctx, p);
+  if (ignore_lookup_error_ && errors::IsNotFound(status)) {
+    return;
   }
-
- private:
-  bool ignore_lookup_error_;
-};
+  OP_REQUIRES_OK(ctx, status);
+}
 
 REGISTER_KERNEL_BUILDER(Name("DestroyResourceOp").Device(DEVICE_CPU),
                         DestroyResourceOp);
@@ -218,64 +213,32 @@ class AssignVariableOp : public OpKernel {
                     "Variable and value dtypes don't match; respectively, ",
                     dtype_, " and ", context->input(1).dtype()));
     Var* variable = nullptr;
-    OP_REQUIRES_OK(
-        context,
-        LookupOrCreateResource<Var>(
-            context, HandleFromInput(context, 0), &variable,
-            [this, context](Var** ptr) {
-              *ptr = new Var(dtype_);
-              PersistentTensor unused;
-              Tensor* tmp;
-              AllocatorAttributes attr;
-              if (!relax_constraints_) {
-                attr.set_gpu_compatible(true);
-                attr.set_nic_compatible(true);
-              }
-              TF_RETURN_IF_ERROR(context->allocate_persistent(
-                  dtype_, context->input(1).shape(), &unused, &tmp, attr));
-              *(*ptr)->tensor() = *tmp;
-              return Status::OK();
-            }));
+    const Tensor& value = context->input(1);
+    // Note: every resource-variable-manipulating op assumes copy-on-write
+    // semantics, and creates a copy of the variable's Tensor if its refcount is
+    // bigger than 1 when we try to modify it. This means we never need to copy
+    // the original tensor for AssignVariableOp; even if there are other live
+    // users of it we know none can modify it so this is always safe (even in
+    // esoteric cases where the same tensor is used to initialize multiple
+    // variables or the tensor is a constant this is safe, as future writes will
+    // trigger copies).
+    OP_REQUIRES_OK(context, LookupOrCreateResource<Var>(
+                                context, HandleFromInput(context, 0), &variable,
+                                [this, &value](Var** ptr) {
+                                  *ptr = new Var(dtype_);
+                                  *(*ptr)->tensor() = value;
+                                  (*ptr)->is_initialized = true;
+                                  return Status::OK();
+                                }));
     core::ScopedUnref s(variable);
-
     OP_REQUIRES(context, variable->tensor()->dtype() == dtype_,
                 errors::InvalidArgument(
                     "Trying to assign variable with wrong dtype. Expected ",
                     DataTypeString(variable->tensor()->dtype()), " got ",
                     DataTypeString(dtype_)));
-
-    const Tensor& value = context->input(1);
-    AllocatorAttributes attr;
-    if (!relax_constraints_) {
-      attr.set_gpu_compatible(true);
-      attr.set_nic_compatible(true);
-    }
-
-    // Copying is unnecessary if we are the last user of the value
-    // tensor, we can just adopt the input tensor's buffer instead.
-    std::unique_ptr<Tensor> input_alias = context->forward_input(
-        1, OpKernelContext::Params::kNoReservation /*output_index*/, dtype_,
-        value.shape(), DEVICE_MEMORY, attr);
     mutex_lock ml(*variable->mu());
     variable->is_initialized = true;
-    if (input_alias) {
-      *variable->tensor() = *input_alias;
-      return;
-    }
-
-    // Need to copy, but maybe we can re-use variable's buffer?
-    if (!variable->tensor()->RefCountIsOne() ||
-        !variable->tensor()->shape().IsSameSize(value.shape())) {
-      // Copy to new buffer
-      PersistentTensor unused;
-      Tensor* tmp;
-      OP_REQUIRES_OK(context, context->allocate_persistent(
-                                  dtype_, value.shape(), &unused, &tmp, attr));
-      *variable->tensor() = *tmp;
-    }
-    functor::DenseUpdate<Device, T, ASSIGN> copy_functor;
-    copy_functor(context->eigen_device<Device>(), variable->tensor()->flat<T>(),
-                 value.flat<T>());
+    *variable->tensor() = value;
   }
 
  private:
diff --git a/tensorflow/core/kernels/resource_variable_ops.h b/tensorflow/core/kernels/resource_variable_ops.h
index 8cae5d21f0..9b60106f13 100644
--- a/tensorflow/core/kernels/resource_variable_ops.h
+++ b/tensorflow/core/kernels/resource_variable_ops.h
@@ -28,6 +28,15 @@ class ReadVariableOp : public OpKernel {
   DataType dtype_;
 };
 
+class DestroyResourceOp : public OpKernel {
+ public:
+  explicit DestroyResourceOp(OpKernelConstruction* ctx);
+  void Compute(OpKernelContext* ctx) override;
+
+ private:
+  bool ignore_lookup_error_;
+};
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_KERNELS_RESOURCE_VARIABLE_OPS_H_
diff --git a/tensorflow/core/kernels/roll_op.cc b/tensorflow/core/kernels/roll_op.cc
index 722116f86f..efa30438d9 100644
--- a/tensorflow/core/kernels/roll_op.cc
+++ b/tensorflow/core/kernels/roll_op.cc
@@ -19,6 +19,7 @@ limitations under the License.
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/register_types_traits.h"
 #include "tensorflow/core/framework/shape_inference.h"
+#include "tensorflow/core/kernels/bounds_check.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/work_sharder.h"
@@ -258,7 +259,7 @@ class RollOp : public OpKernel {
       if (axis < 0) {
         axis += num_dims;
       }
-      OP_REQUIRES(context, 0 <= axis && axis < num_dims,
+      OP_REQUIRES(context, FastBoundsCheck(axis, num_dims),
                   errors::InvalidArgument("axis ", axis, " is out of range"));
       const int ds = std::max<int>(static_cast<int>(input.dim_size(axis)), 1);
       const int sum = shift_mod_sum[axis] + static_cast<int>(shift_flat(i));
diff --git a/tensorflow/core/kernels/save_restore_tensor.cc b/tensorflow/core/kernels/save_restore_tensor.cc
index 990bd2bff9..7930ce4615 100644
--- a/tensorflow/core/kernels/save_restore_tensor.cc
+++ b/tensorflow/core/kernels/save_restore_tensor.cc
@@ -23,6 +23,7 @@ limitations under the License.
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/kernels/bounds_check.h"
+#include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
@@ -226,43 +227,53 @@ void RestoreTensor(OpKernelContext* context,
 #undef READER_COPY
 }
 
-Status RestoreTensorsV2(OpKernelContext* context, const Tensor& prefix,
-                        const Tensor& tensor_names,
-                        const Tensor& shape_and_slices,
-                        gtl::ArraySlice<DataType> dtypes) {
-  const string& prefix_string = prefix.scalar<string>()();
+namespace {
 
-  const auto& tensor_names_flat = tensor_names.flat<string>();
-  const auto& shape_and_slices_flat = shape_and_slices.flat<string>();
+// Tensors larger than this threshold will be restored from a thread-pool.
+const int64 kLargeShapeThreshold = 16 << 20;  // 16M
 
-  // Sort lookup keys to improve locality when reading multiple tensors.
-  std::vector<size_t> sorted_name_idx(tensor_names_flat.size());
-  std::iota(sorted_name_idx.begin(), sorted_name_idx.end(), 0);
-  std::sort(sorted_name_idx.begin(), sorted_name_idx.end(),
-            [&tensor_names_flat](size_t a, size_t b) {
-              return tensor_names_flat(a) < tensor_names_flat(b);
-            });
+// A restore operation for a single tensor.  Small tensors may be restored
+// directly from the op thread to improve read locality.  Large tensors can be
+// restored from a thread pool: this requires creating a separate BundleReader
+// for each restore.
+struct RestoreOp {
+  RestoreOp& operator=(const RestoreOp&) = delete;
 
-  BundleReader reader(Env::Default(), prefix_string);
-  TF_RETURN_IF_ERROR(reader.status());
+  bool should_run_in_pool(BundleReader* reader) const {
+    TensorShape restored_full_shape;
 
-  // TODO(zongheng): potential optimization: one Seek() in first lookup.
-  // TODO(zongheng): consider measuring speed and issuing concurrent lookups
-  // within a fixed memory budget.
-  TensorShape restored_full_shape;
-  Tensor* restored_tensor = nullptr;
-  for (auto i : sorted_name_idx) {
-    const string& tensor_name = tensor_names_flat(i);
-    const string& shape_and_slice = shape_and_slices_flat(i);
+    // Ignore status here; we'll catch the error later.
+    if (!reader->LookupTensorShape(tensor_name, &restored_full_shape).ok()) {
+      return false;
+    }
 
+    return restored_full_shape.num_elements() > kLargeShapeThreshold;
+  }
+
+  // Run this restore operation using a new BundleReader.
+  void run_with_new_reader() {
+    BundleReader reader(Env::Default(), reader_prefix);
+    if (!reader.status().ok()) {
+      status = reader.status();
+      return;
+    }
+
+    status = run(&reader);
+  }
+
+  Status run(BundleReader* reader) {
+    TensorShape restored_full_shape;
     TF_RETURN_IF_ERROR(
-        reader.LookupTensorShape(tensor_name, &restored_full_shape));
+        reader->LookupTensorShape(tensor_name, &restored_full_shape));
 
+    VLOG(1) << "Restoring tensor " << idx << " : " << tensor_name << " : "
+            << restored_full_shape.num_elements();
+    Tensor* restored_tensor;
     if (shape_and_slice.empty()) {
       // Lookup the full tensor.
       TF_RETURN_IF_ERROR(
-          context->allocate_output(i, restored_full_shape, &restored_tensor));
-      TF_RETURN_IF_ERROR(reader.Lookup(tensor_name, restored_tensor));
+          context->allocate_output(idx, restored_full_shape, &restored_tensor));
+      TF_RETURN_IF_ERROR(reader->Lookup(tensor_name, restored_tensor));
     } else {
       // Lookup the slice.
       TensorShape parsed_full_shape;
@@ -272,6 +283,7 @@ Status RestoreTensorsV2(OpKernelContext* context, const Tensor& prefix,
       TF_RETURN_IF_ERROR(
           checkpoint::ParseShapeAndSlice(shape_and_slice, &parsed_full_shape,
                                          &parsed_slice, &parsed_slice_shape));
+
       if (!restored_full_shape.IsSameSize(parsed_full_shape)) {
         return errors::InvalidArgument(
             "tensor_name = ", tensor_name, "; shape in shape_and_slice spec ",
@@ -279,19 +291,93 @@ Status RestoreTensorsV2(OpKernelContext* context, const Tensor& prefix,
             " does not match the shape stored in checkpoint: ",
             restored_full_shape.DebugString());
       }
-
       TF_RETURN_IF_ERROR(
-          context->allocate_output(i, parsed_slice_shape, &restored_tensor));
+          context->allocate_output(idx, parsed_slice_shape, &restored_tensor));
       TF_RETURN_IF_ERROR(
-          reader.LookupSlice(tensor_name, parsed_slice, restored_tensor));
+          reader->LookupSlice(tensor_name, parsed_slice, restored_tensor));
+    }
+    return Status::OK();
+  }
+
+  OpKernelContext* context;
+  size_t idx;
+  string tensor_name;
+  string shape_and_slice;
+  string reader_prefix;
+
+  ::tensorflow::Status status;
+};
+
+}  // namespace
+
+Status RestoreTensorsV2(OpKernelContext* context, const Tensor& prefix,
+                        const Tensor& tensor_names,
+                        const Tensor& shape_and_slices,
+                        gtl::ArraySlice<DataType> dtypes) {
+  const string& prefix_string = prefix.scalar<string>()();
+
+  const auto& tensor_names_flat = tensor_names.flat<string>();
+  const auto& shape_and_slices_flat = shape_and_slices.flat<string>();
+
+  // Sort lookup keys to improve locality when reading multiple tensors.
+  std::vector<size_t> sorted_name_idx(tensor_names_flat.size());
+  std::iota(sorted_name_idx.begin(), sorted_name_idx.end(), 0);
+  std::sort(sorted_name_idx.begin(), sorted_name_idx.end(),
+            [&tensor_names_flat](size_t a, size_t b) {
+              return tensor_names_flat(a) < tensor_names_flat(b);
+            });
+
+  std::vector<std::unique_ptr<RestoreOp> > pool_restore_ops;
+  std::vector<std::unique_ptr<RestoreOp> > direct_restore_ops;
+
+  BundleReader default_reader(Env::Default(), prefix_string);
+  TF_RETURN_IF_ERROR(default_reader.status());
+
+  for (auto i : sorted_name_idx) {
+    const string& tensor_name = tensor_names_flat(i);
+    const string& shape_and_slice = shape_and_slices_flat(i);
+    auto op =
+        new RestoreOp{context, i, tensor_name, shape_and_slice, prefix_string};
+    if (op->should_run_in_pool(&default_reader)) {
+      pool_restore_ops.emplace_back(op);
+    } else {
+      direct_restore_ops.emplace_back(op);
     }
-    if (dtypes[i] != restored_tensor->dtype()) {
+  }
+
+  {
+    // Schedule any threaded operations first, skipping thread pool creation if
+    // we don't have any expensive operations.
+    std::unique_ptr<thread::ThreadPool> reader_pool;
+    if (!pool_restore_ops.empty()) {
+      reader_pool.reset(
+          new thread::ThreadPool(Env::Default(), "restore_tensors", 8));
+      for (auto& op : pool_restore_ops) {
+        reader_pool->Schedule([&op]() { op->run_with_new_reader(); });
+      }
+    }
+
+    // Read small tensors from the op thread
+    for (auto& op : direct_restore_ops) {
+      TF_RETURN_IF_ERROR(op->run(&default_reader));
+    }
+  }
+
+  // Check status of pool ops; this must come after the pool shuts down.
+  for (auto& op : pool_restore_ops) {
+    TF_RETURN_IF_ERROR(op->status);
+  }
+
+  for (auto i : sorted_name_idx) {
+    const string& tensor_name = tensor_names_flat(i);
+    if (dtypes[i] != context->mutable_output(i)->dtype()) {
       return errors::InvalidArgument(
           "tensor_name = ", tensor_name, "; expected dtype ",
           DataTypeString(dtypes[i]), " does not equal restored dtype ",
-          DataTypeString(restored_tensor->dtype()));
+          DataTypeString(context->mutable_output(i)->dtype()));
     }
   }
+
   return Status::OK();
 }
 
diff --git a/tensorflow/core/kernels/scatter_nd_op.cc b/tensorflow/core/kernels/scatter_nd_op.cc
index e1fc2ea128..e0194605ce 100644
--- a/tensorflow/core/kernels/scatter_nd_op.cc
+++ b/tensorflow/core/kernels/scatter_nd_op.cc
@@ -277,6 +277,9 @@ TF_CALL_NUMBER_TYPES(REGISTER_SCATTER_ND_ADD_SUB_CPU);
 TF_CALL_NUMBER_TYPES(REGISTER_SCATTER_ND_UPDATE_CPU);
 TF_CALL_NUMBER_TYPES(REGISTER_SCATTER_ND_CPU);
 TF_CALL_string(REGISTER_SCATTER_ND_CPU);
+TF_CALL_bool(REGISTER_SCATTER_ND_ADD_SUB_CPU);
+TF_CALL_bool(REGISTER_SCATTER_ND_UPDATE_CPU);
+TF_CALL_bool(REGISTER_SCATTER_ND_CPU);
 
 // Registers GPU kernels.
 #if GOOGLE_CUDA
@@ -309,6 +312,7 @@ TF_CALL_complex128(REGISTER_SCATTER_ND_ALL_GPU);
 
 TF_CALL_int32(REGISTER_SCATTER_ND_ADD_SUB_SYCL);
 TF_CALL_int32(REGISTER_SCATTER_ND_UPDATE_SYCL);
+TF_CALL_bool(REGISTER_SCATTER_ND_UPDATE_SYCL);
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SCATTER_ND_ADD_SUB_SYCL);
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SCATTER_ND_UPDATE_SYCL);
 #undef REGISTER_SCATTER_ND_ADD_SUB_SYCL
@@ -537,11 +541,13 @@ Status DoScatterNd(OpKernelContext* c, const Tensor& indices,
     }
   }
   if (bad_i >= 0) {
+    auto slice_shape = indices.shape();
+    slice_shape.RemoveLastDims(1);
     return errors::InvalidArgument(
-        "Invalid indices: ", SliceDebugString(indices.shape(), bad_i), " = [",
+        "indices", SliceDebugString(slice_shape, bad_i), " = [",
         str_util::Join(
             gtl::ArraySlice<Index>(&indices_flat(bad_i, 0), slice_dim), ", "),
-        "] does not index into ", shape.DebugString());
+        "] does not index into shape ", shape.DebugString());
   }
   return Status::OK();
 }
diff --git a/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h b/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h
index 7cfffa20c5..472f5a3547 100644
--- a/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h
+++ b/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h
@@ -161,15 +161,16 @@ struct ScatterNdFunctor<CPUDevice, T, Index, OP, IXDIM> {
 
 TF_CALL_ALL_TYPES(REGISTER_SCATTER_ND_UPDATE);
 REGISTER_SCATTER_ND_INDEX(string, scatter_nd_op::UpdateOp::ADD);
-TF_CALL_NUMBER_TYPES(REGISTER_SCATTER_ND_MATH)
-
+TF_CALL_NUMBER_TYPES(REGISTER_SCATTER_ND_MATH);
+TF_CALL_bool(REGISTER_SCATTER_ND_MATH);
 #undef REGISTER_SCATTER_ND_MATH
 #undef REGISTER_SCATTER_ND_UPDATE
 #undef REGISTER_SCATTER_ND_INDEX
 #undef REGISTER_SCATTER_ND_FULL
 
-#ifdef TENSORFLOW_USE_SYCL
 // Implementation of update functor for SYCL.
+#ifdef TENSORFLOW_USE_SYCL
+
 template <typename T, typename Index, scatter_nd_op::UpdateOp OP, int IXDIM>
 struct ScatterNdFunctor<SYCLDevice, T, Index, OP, IXDIM> {
   Index operator()(
diff --git a/tensorflow/core/kernels/scatter_nd_op_test.cc b/tensorflow/core/kernels/scatter_nd_op_test.cc
index c134a8dd5b..95ecc69c95 100644
--- a/tensorflow/core/kernels/scatter_nd_op_test.cc
+++ b/tensorflow/core/kernels/scatter_nd_op_test.cc
@@ -185,7 +185,7 @@ TEST_F(ScatterNdUpdateOpTest, Error_IndexOutOfRange) {
                            {100, 101, 102, 777, 778, 779, 10000, 10001, 10002});
   Status s = RunOpKernel();
   EXPECT_TRUE(str_util::StrContains(
-      s.ToString(), "Invalid indices: [2,0] = [99] does not index into [5,3]"))
+      s.ToString(), "indices[2] = [99] does not index into shape [5,3]"))
       << s;
 }
 
diff --git a/tensorflow/core/kernels/sdca_internal.cc b/tensorflow/core/kernels/sdca_internal.cc
index 3e16ba8d04..1c071d3d41 100644
--- a/tensorflow/core/kernels/sdca_internal.cc
+++ b/tensorflow/core/kernels/sdca_internal.cc
@@ -18,6 +18,7 @@ limitations under the License.
 #include "tensorflow/core/kernels/sdca_internal.h"
 
 #include <limits>
+#include <numeric>
 #include <random>
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
diff --git a/tensorflow/core/kernels/sdca_internal.h b/tensorflow/core/kernels/sdca_internal.h
index 897c488702..1eff4b15fa 100644
--- a/tensorflow/core/kernels/sdca_internal.h
+++ b/tensorflow/core/kernels/sdca_internal.h
@@ -43,8 +43,6 @@ limitations under the License.
 #include "tensorflow/core/lib/random/distribution_sampler.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/util/guarded_philox_random.h"
-#include "tensorflow/core/util/sparse/group_iterator.h"
-#include "tensorflow/core/util/sparse/sparse_tensor.h"
 #include "tensorflow/core/util/work_sharder.h"
 
 namespace tensorflow {
diff --git a/tensorflow/core/kernels/segment_reduction_ops.h b/tensorflow/core/kernels/segment_reduction_ops.h
index 2da83a0288..d28e35157b 100644
--- a/tensorflow/core/kernels/segment_reduction_ops.h
+++ b/tensorflow/core/kernels/segment_reduction_ops.h
@@ -13,16 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef THIRD_PARTY_TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
-#define THIRD_PARTY_TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
-
-
-// This file requires the following include because it uses CudaAtomicMax:
-// #include "tensorflow/core/util/cuda_kernel_helper.h"
-
-// Unfortunately we can't add the #include, since it breaks compilation for
-// non-GPU targets. This only breaks in clang, because it's more strict for
-// template code and CudaAtomicMax is used in template context.
+#ifndef TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
+#define TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
 
 // This file requires the following include because it uses CudaAtomicMax:
 // #include "tensorflow/core/util/cuda_kernel_helper.h"
diff --git a/tensorflow/core/kernels/sendrecv_ops.cc b/tensorflow/core/kernels/sendrecv_ops.cc
index 2f87057f4e..6521dcf932 100644
--- a/tensorflow/core/kernels/sendrecv_ops.cc
+++ b/tensorflow/core/kernels/sendrecv_ops.cc
@@ -160,7 +160,6 @@ Rendezvous::DoneCallback make_recv_callback(OpKernelContext* ctx,
           if (!is_dead) {
             ctx->set_output(0, val);
           }
-          *ctx->is_output_dead() = is_dead;
         }
         done();
       },
diff --git a/tensorflow/core/kernels/serialize_sparse_op.cc b/tensorflow/core/kernels/serialize_sparse_op.cc
index 852cef29c7..577e327809 100644
--- a/tensorflow/core/kernels/serialize_sparse_op.cc
+++ b/tensorflow/core/kernels/serialize_sparse_op.cc
@@ -190,8 +190,10 @@ class SerializeManySparseOp : public SerializeManySparseOpBase<U> {
     TensorShape tensor_input_shape(input_shape->vec<int64>());
     gtl::InlinedVector<int64, 8> std_order(rank);
     std::iota(std_order.begin(), std_order.end(), 0);
-    SparseTensor input_st(*input_indices, *input_values, tensor_input_shape,
-                          std_order);
+    SparseTensor input_st;
+    OP_REQUIRES_OK(context, SparseTensor::Create(*input_indices, *input_values,
+                                                 tensor_input_shape, std_order,
+                                                 &input_st));
 
     auto input_shape_t = input_shape->vec<int64>();
     const int64 N = input_shape_t(0);
diff --git a/tensorflow/core/kernels/set_kernels.cc b/tensorflow/core/kernels/set_kernels.cc
index e836c764ac..f893d4e945 100644
--- a/tensorflow/core/kernels/set_kernels.cc
+++ b/tensorflow/core/kernels/set_kernels.cc
@@ -63,9 +63,9 @@ Status GroupShape(const VarDimArray& input_shape, ShapeArray* grouped_shape) {
 
 // Build `SparseTensor` from indices, values, and shape in inputs
 // [base_index, base_index + 3), and validate its rank and indices.
-sparse::SparseTensor SparseTensorFromContext(OpKernelContext* ctx,
-                                             const int32 base_index,
-                                             bool validate_indices) {
+Status SparseTensorFromContext(OpKernelContext* ctx, const int32 base_index,
+                               bool validate_indices,
+                               sparse::SparseTensor* tensor) {
   // Assume row-major order.
   const TensorShape shape =
       TensorShape(ctx->input(base_index + 2).vec<int64>());
@@ -73,13 +73,8 @@ sparse::SparseTensor SparseTensorFromContext(OpKernelContext* ctx,
   std::vector<int64> order(shape.dims());
   std::iota(order.begin(), order.end(), 0);
 
-  const sparse::SparseTensor st(ctx->input(base_index),
-                                ctx->input(base_index + 1), shape, order);
-  if (validate_indices) {
-    Status s = st.IndicesValid();
-    if (!s.ok()) ctx->SetStatus(s);
-  }
-  return st;
+  return sparse::SparseTensor::Create(
+      ctx->input(base_index), ctx->input(base_index + 1), shape, order, tensor);
 }
 
 // TODO(ptucker): CheckGroup is just a sanity check on the result of
@@ -253,11 +248,13 @@ class SetSizeOp : public OpKernel {
 
 template <typename T>
 void SetSizeOp<T>::Compute(OpKernelContext* ctx) {
-  const sparse::SparseTensor set_st =
-      SparseTensorFromContext(ctx, 0, validate_indices_);
+  sparse::SparseTensor set_st;
+  OP_REQUIRES_OK(ctx,
+                 SparseTensorFromContext(ctx, 0, validate_indices_, &set_st));
+  OP_REQUIRES_OK(ctx, set_st.IndicesValid());
 
-  // Output shape is same as input except for last dimension, which reduces to
-  // the set size of values along that dimension.
+  // Output shape is same as input except for last dimension, which reduces
+  // to the set size of values along that dimension.
   ShapeArray output_shape;
   OP_REQUIRES_OK(ctx, GroupShape(set_st.shape(), &output_shape));
   const auto output_strides = Strides(output_shape);
@@ -484,8 +481,10 @@ void SetOperationOp<T>::ComputeDenseToDense(OpKernelContext* ctx) const {
 template <typename T>
 void SetOperationOp<T>::ComputeDenseToSparse(OpKernelContext* ctx) const {
   const Tensor& set1_t = ctx->input(0);
-  const sparse::SparseTensor set2_st =
-      SparseTensorFromContext(ctx, 1, validate_indices_);
+  sparse::SparseTensor set2_st;
+  OP_REQUIRES_OK(ctx,
+                 SparseTensorFromContext(ctx, 1, validate_indices_, &set2_st));
+  OP_REQUIRES_OK(ctx, set2_st.IndicesValid());
   // The following should stay in sync with `_dense_to_sparse_shape` shape
   // assertions in python/ops/set_ops.py, and `SetShapeFn` for
   // `DenseToSparseSetOperation` in ops/set_ops.cc.
@@ -597,10 +596,15 @@ const std::vector<int64> GROUP_ITER_END;
 // with the same first n-1 dimensions in set1 and set2.
 template <typename T>
 void SetOperationOp<T>::ComputeSparseToSparse(OpKernelContext* ctx) const {
-  const sparse::SparseTensor set1_st =
-      SparseTensorFromContext(ctx, 0, validate_indices_);
-  const sparse::SparseTensor set2_st =
-      SparseTensorFromContext(ctx, 3, validate_indices_);
+  sparse::SparseTensor set1_st;
+  OP_REQUIRES_OK(ctx,
+                 SparseTensorFromContext(ctx, 0, validate_indices_, &set1_st));
+  OP_REQUIRES_OK(ctx, set1_st.IndicesValid());
+
+  sparse::SparseTensor set2_st;
+  OP_REQUIRES_OK(ctx,
+                 SparseTensorFromContext(ctx, 3, validate_indices_, &set2_st));
+
   // The following should stay in sync with `_sparse_to_sparse_shape` shape
   // assertions in python/ops/set_ops.py, and `SetShapeFn` for
   // `SparseToSparseSetOperation` in ops/set_ops.cc.
diff --git a/tensorflow/core/kernels/spacetobatch_op.cc b/tensorflow/core/kernels/spacetobatch_op.cc
index fdc08ec8e3..64f1b0d661 100644
--- a/tensorflow/core/kernels/spacetobatch_op.cc
+++ b/tensorflow/core/kernels/spacetobatch_op.cc
@@ -42,29 +42,29 @@ typedef Eigen::GpuDevice GPUDevice;
 namespace {
 
 template <typename Device, typename T>
-void SpaceToBatchOpCompute(OpKernelContext* context,
-                           const Tensor& orig_input_tensor,
-                           const Tensor& orig_block_shape,
-                           const Tensor& orig_paddings) {
+Status SpaceToBatchOpCompute(OpKernelContext* context,
+                             const Tensor& orig_input_tensor,
+                             const Tensor& orig_block_shape,
+                             const Tensor& orig_paddings) {
   const int input_dims = orig_input_tensor.dims();
-  OP_REQUIRES(
-      context, TensorShapeUtils::IsVector(orig_block_shape.shape()),
-      errors::InvalidArgument("block_shape rank should be 1 instead of ",
-                              orig_block_shape.dims()));
+  if (!TensorShapeUtils::IsVector(orig_block_shape.shape())) {
+    return errors::InvalidArgument("block_shape rank should be 1 instead of ",
+                                   orig_block_shape.dims());
+  }
 
   const int block_dims = orig_block_shape.dim_size(0);
-  OP_REQUIRES(
-      context, orig_input_tensor.dims() >= 1 + block_dims,
-      errors::InvalidArgument("input rank should be >= ", 1 + block_dims,
-                              " instead of ", orig_input_tensor.dims()));
-
-  OP_REQUIRES(context,
-              TensorShapeUtils::IsMatrix(orig_paddings.shape()) &&
-                  block_dims == orig_paddings.dim_size(0) &&
-                  2 == orig_paddings.dim_size(1),
-              errors::InvalidArgument("paddings should have shape [",
-                                      block_dims, ", 2] instead of ",
-                                      orig_paddings.shape().DebugString()));
+  if (orig_input_tensor.dims() < 1 + block_dims) {
+    return errors::InvalidArgument("input rank should be >= ", 1 + block_dims,
+                                   " instead of ", orig_input_tensor.dims());
+  }
+
+  if (!(TensorShapeUtils::IsMatrix(orig_paddings.shape()) &&
+        block_dims == orig_paddings.dim_size(0) &&
+        2 == orig_paddings.dim_size(1))) {
+    return errors::InvalidArgument("paddings should have shape [", block_dims,
+                                   ", 2] instead of ",
+                                   orig_paddings.shape().DebugString());
+  }
 
   // To avoid out-of-bounds access in the case that the block_shape and/or
   // paddings tensors are concurrently modified, we must copy the values.
@@ -101,22 +101,23 @@ void SpaceToBatchOpCompute(OpKernelContext* context,
   for (int block_dim = 0; block_dim < block_dims; ++block_dim) {
     block_shape_product *= block_shape[block_dim];
   }
-  OP_REQUIRES(
-      context, block_shape_product > 0,
-      errors::InvalidArgument("Product of block sizes must be positive, got ",
-                              block_shape_product));
+  if (block_shape_product <= 0) {
+    return errors::InvalidArgument(
+        "Product of block sizes must be positive, got ", block_shape_product);
+  }
 
   const int internal_block_dims =
       block_dims - removed_prefix_block_dims - removed_suffix_block_dims;
-  OP_REQUIRES(context, internal_block_dims <= kMaxSpaceToBatchBlockDims,
-              errors::InvalidArgument(
-                  "Maximum number of non-combined block dimensions is ",
-                  internal_block_dims, " but must not exceed ",
-                  kMaxSpaceToBatchBlockDims));
+  if (internal_block_dims > kMaxSpaceToBatchBlockDims) {
+    return errors::InvalidArgument(
+        "Maximum number of non-combined block dimensions is ",
+        internal_block_dims, " but must not exceed ",
+        kMaxSpaceToBatchBlockDims);
+  }
 
   if (internal_block_dims == 0) {
     context->set_output(0, orig_input_tensor);
-    return;
+    return Status::OK();
   }
 
   // For the purpose of computing the result, the input will be treated as
@@ -146,16 +147,18 @@ void SpaceToBatchOpCompute(OpKernelContext* context,
        block_dim < block_dims - removed_suffix_block_dims; ++block_dim) {
     const int64 pad_start = paddings[2 * block_dim],
                 pad_end = paddings[2 * block_dim + 1];
-    OP_REQUIRES(context, pad_start >= 0 && pad_end >= 0,
-                errors::InvalidArgument("Paddings must be non-negative"));
+    if (pad_start < 0 || pad_end < 0) {
+      return errors::InvalidArgument("Paddings must be non-negative");
+    }
     const int64 input_size = orig_input_tensor.dim_size(block_dim + 1);
     const int64 block_shape_value = block_shape[block_dim];
     const int64 padded_size = input_size + pad_start + pad_end;
-    OP_REQUIRES(
-        context, padded_size % block_shape_value == 0,
-        errors::InvalidArgument("padded_shape[", block_dim, "]=", padded_size,
-                                " is not divisible by block_shape[", block_dim,
-                                "]=", block_shape_value));
+    if (padded_size % block_shape_value != 0) {
+      return errors::InvalidArgument("padded_shape[", block_dim,
+                                     "]=", padded_size,
+                                     " is not divisible by block_shape[",
+                                     block_dim, "]=", block_shape_value);
+    }
     internal_input_shape.AddDim(input_size);
     const int64 output_size = padded_size / block_shape_value;
     internal_output_shape.AddDim(output_size);
@@ -174,29 +177,29 @@ void SpaceToBatchOpCompute(OpKernelContext* context,
 
   // Allocate output tensor.
   Tensor* output_tensor = nullptr;
-  OP_REQUIRES_OK(context, context->allocate_output(0, external_output_shape,
-                                                   &output_tensor));
+  TF_RETURN_IF_ERROR(
+      context->allocate_output(0, external_output_shape, &output_tensor));
 
   const int64* internal_paddings = &paddings[2 * removed_prefix_block_dims];
   const int64* internal_block_shape = &block_shape[removed_prefix_block_dims];
 
   switch (internal_block_dims) {
-#define TF_SPACETOBATCH_BLOCK_DIMS_CASE(NUM_BLOCK_DIMS)                    \
-  case NUM_BLOCK_DIMS: {                                                   \
-    OP_REQUIRES_OK(                                                        \
-        context,                                                           \
-        (functor::SpaceToBatchFunctor<Device, T, NUM_BLOCK_DIMS, false>()( \
-            context->eigen_device<Device>(),                               \
-            orig_input_tensor.shaped<T, NUM_BLOCK_DIMS + 2>(               \
-                internal_input_shape.dim_sizes()),                         \
-            internal_block_shape, internal_paddings,                       \
-            output_tensor->shaped<T, NUM_BLOCK_DIMS + 2>(                  \
-                internal_output_shape.dim_sizes()))));                     \
-  } break;                                                                 \
+#define TF_SPACETOBATCH_BLOCK_DIMS_CASE(NUM_BLOCK_DIMS)                   \
+  case NUM_BLOCK_DIMS: {                                                  \
+    TF_RETURN_IF_ERROR(                                                   \
+        functor::SpaceToBatchFunctor<Device, T, NUM_BLOCK_DIMS, false>()( \
+            context->eigen_device<Device>(),                              \
+            orig_input_tensor.shaped<T, NUM_BLOCK_DIMS + 2>(              \
+                internal_input_shape.dim_sizes()),                        \
+            internal_block_shape, internal_paddings,                      \
+            output_tensor->shaped<T, NUM_BLOCK_DIMS + 2>(                 \
+                internal_output_shape.dim_sizes())));                     \
+  } break;                                                                \
     /**/
     TF_SPACETOBATCH_FOR_EACH_NUM_BLOCK_DIMS(TF_SPACETOBATCH_BLOCK_DIMS_CASE)
 #undef TF_SPACETOBATCH_BLOCK_DIMS_CASE
   }
+  return Status::OK();
 }
 
 }  // namespace
@@ -211,8 +214,9 @@ class SpaceToBatchNDOp : public OpKernel {
     const Tensor& orig_input_tensor = context->input(0);
     const Tensor& orig_block_shape = context->input(1);
     const Tensor& orig_paddings = context->input(2);
-    SpaceToBatchOpCompute<Device, T>(context, orig_input_tensor,
-                                     orig_block_shape, orig_paddings);
+    OP_REQUIRES_OK(context, SpaceToBatchOpCompute<Device, T>(
+                                context, orig_input_tensor, orig_block_shape,
+                                orig_paddings));
   }
 };
 
@@ -241,7 +245,8 @@ class SpaceToBatchOp : public OpKernel {
     OP_REQUIRES(context, kRequiredDims == dims,
                 errors::InvalidArgument("Input rank should be: ", kRequiredDims,
                                         "instead of: ", dims));
-    SpaceToBatchOpCompute<Device, T>(context, in0, block_shape_, in1);
+    OP_REQUIRES_OK(context, SpaceToBatchOpCompute<Device, T>(
+                                context, in0, block_shape_, in1));
   }
 
  private:
diff --git a/tensorflow/core/kernels/sparse_concat_op.cc b/tensorflow/core/kernels/sparse_concat_op.cc
index f813794374..3b2a0cb0f3 100644
--- a/tensorflow/core/kernels/sparse_concat_op.cc
+++ b/tensorflow/core/kernels/sparse_concat_op.cc
@@ -124,9 +124,12 @@ class SparseConcatOp : public OpKernel {
     std::vector<sparse::SparseTensor> sp_inputs;
     for (int i = 0; i < N; ++i) {
       const TensorShape current_shape(shapes[i].vec<int64>());
-      sp_inputs.emplace_back(tensor::DeepCopy(inds[i]),
-                             tensor::DeepCopy(vals[i]), current_shape,
-                             std_order);
+      sparse::SparseTensor tensor;
+      OP_REQUIRES_OK(context,
+                     sparse::SparseTensor::Create(
+                         tensor::DeepCopy(inds[i]), tensor::DeepCopy(vals[i]),
+                         current_shape, std_order, &tensor));
+      sp_inputs.push_back(std::move(tensor));
       sp_inputs[i].Reorder<T>(concat_order);
     }
 
diff --git a/tensorflow/core/kernels/sparse_reduce_op.cc b/tensorflow/core/kernels/sparse_reduce_op.cc
index 9e60791f97..a465564739 100644
--- a/tensorflow/core/kernels/sparse_reduce_op.cc
+++ b/tensorflow/core/kernels/sparse_reduce_op.cc
@@ -172,8 +172,10 @@ class SparseReduceOp : public OpKernel {
     // making deep copies here.  Remove this if/when we change Reorder()'s
     // semantics.
     const auto shape_vec = shape_t->vec<int64>();
-    SparseTensor sp(tensor::DeepCopy(*indices_t), tensor::DeepCopy(*values_t),
-                    TensorShape(shape_vec));
+    SparseTensor sp;
+    OP_REQUIRES_OK(ctx, SparseTensor::Create(
+        tensor::DeepCopy(*indices_t), tensor::DeepCopy(*values_t),
+                    TensorShape(shape_vec), &sp));
     ReduceDetails reduction = SparseTensorReduceHelper(
         sp, reduction_axes_t->flat<int32>(), keep_dims_);
 
@@ -260,8 +262,10 @@ class SparseReduceSparseOp : public OpKernel {
 
     OP_REQUIRES_OK(ctx, ValidateInputs(shape_t, reduction_axes_t));
 
-    SparseTensor sp(tensor::DeepCopy(*indices_t), tensor::DeepCopy(*values_t),
-                    TensorShape(shape_t->vec<int64>()));
+    SparseTensor sp;
+    OP_REQUIRES_OK(ctx, SparseTensor::Create(tensor::DeepCopy(*indices_t),
+                                         tensor::DeepCopy(*values_t),
+                    TensorShape(shape_t->vec<int64>()), &sp));
     ReduceDetails reduction = SparseTensorReduceHelper(
         sp, reduction_axes_t->flat<int32>(), keep_dims_);
 
diff --git a/tensorflow/core/kernels/sparse_reorder_op.cc b/tensorflow/core/kernels/sparse_reorder_op.cc
index d1373fe0ef..6f9065827f 100644
--- a/tensorflow/core/kernels/sparse_reorder_op.cc
+++ b/tensorflow/core/kernels/sparse_reorder_op.cc
@@ -60,16 +60,21 @@ class SparseReorderOp : public OpKernel {
     std::iota(std_order.begin(), std_order.end(), 0);
 
     // Check if the sparse tensor is already ordered correctly
-    sparse::SparseTensor input_sp(input_ind, input_val, input_shape, std_order);
+    sparse::SparseTensor input_sp;
+    OP_REQUIRES_OK(
+        context, sparse::SparseTensor::Create(input_ind, input_val, input_shape,
+                                              std_order, &input_sp));
 
     if (input_sp.IndicesValid().ok()) {
       context->set_output(0, input_sp.indices());
       context->set_output(1, input_sp.values());
     } else {
       // Deep-copy the input Tensors, then reorder in-place
-      sparse::SparseTensor reordered_sp(tensor::DeepCopy(input_ind),
-                                        tensor::DeepCopy(input_val),
-                                        input_shape);
+      sparse::SparseTensor reordered_sp;
+      OP_REQUIRES_OK(context,
+                     sparse::SparseTensor::Create(tensor::DeepCopy(input_ind),
+                                                  tensor::DeepCopy(input_val),
+                                                  input_shape, &reordered_sp));
       reordered_sp.Reorder<T>(std_order);
       context->set_output(0, reordered_sp.indices());
       context->set_output(1, reordered_sp.values());
diff --git a/tensorflow/core/kernels/sparse_slice_grad_op.cc b/tensorflow/core/kernels/sparse_slice_grad_op.cc
index 90a39ed818..f92b6414ff 100644
--- a/tensorflow/core/kernels/sparse_slice_grad_op.cc
+++ b/tensorflow/core/kernels/sparse_slice_grad_op.cc
@@ -18,7 +18,6 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_util.h"
 #include "tensorflow/core/framework/types.h"
-#include "tensorflow/core/util/sparse/sparse_tensor.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/kernels/sparse_slice_op.cc b/tensorflow/core/kernels/sparse_slice_op.cc
index 10dc208ab6..6aaf4fd88f 100644
--- a/tensorflow/core/kernels/sparse_slice_op.cc
+++ b/tensorflow/core/kernels/sparse_slice_op.cc
@@ -66,8 +66,11 @@ class SparseSliceOp : public OpKernel {
                     "Expected size to be a vector of length ", input_dims,
                     " but got length ", input_size.NumElements()));
 
-    sparse::SparseTensor sparse_tensor(input_indices, input_values,
-                                       TensorShape(input_shape.vec<int64>()));
+    sparse::SparseTensor sparse_tensor;
+    OP_REQUIRES_OK(context,
+                   sparse::SparseTensor::Create(
+                       input_indices, input_values,
+                       TensorShape(input_shape.vec<int64>()), &sparse_tensor));
 
     const gtl::ArraySlice<int64> start(input_start.flat<int64>().data(),
                                        input_dims);
diff --git a/tensorflow/core/kernels/sparse_softmax_op.cc b/tensorflow/core/kernels/sparse_softmax_op.cc
index 444a5f657a..dc3119bba4 100644
--- a/tensorflow/core/kernels/sparse_softmax_op.cc
+++ b/tensorflow/core/kernels/sparse_softmax_op.cc
@@ -69,8 +69,11 @@ class SparseSoftmaxOp : public OpKernel {
 
     const int nnz = static_cast<int>(indices_t->dim_size(0));
     const int rank = static_cast<int>(indices_t->dim_size(1));
-    SparseTensor st(tensor::DeepCopy(*indices_t), tensor::DeepCopy(*values_t),
-                    TensorShape(shape_t->flat<int64>()));
+    SparseTensor st;
+    OP_REQUIRES_OK(
+        context, SparseTensor::Create(
+                     tensor::DeepCopy(*indices_t), tensor::DeepCopy(*values_t),
+                     TensorShape(shape_t->flat<int64>()), &st));
 
     Tensor *output_values = nullptr;
     OP_REQUIRES_OK(context, context->allocate_output(0, TensorShape({nnz}),
diff --git a/tensorflow/core/kernels/sparse_split_op.cc b/tensorflow/core/kernels/sparse_split_op.cc
index 67dcf05a6c..3d02be47cb 100644
--- a/tensorflow/core/kernels/sparse_split_op.cc
+++ b/tensorflow/core/kernels/sparse_split_op.cc
@@ -63,10 +63,16 @@ class SparseSplitOp : public OpKernel {
                                 input_shape.vec<int64>()(split_dim), "), got ",
                                 num_split_));
 
-    sparse::SparseTensor sparse_tensor(input_indices, input_values,
-                                       TensorShape(input_shape.vec<int64>()));
-    const std::vector<sparse::SparseTensor> outputs =
-        sparse::SparseTensor::Split<T>(sparse_tensor, split_dim, num_split_);
+    sparse::SparseTensor sparse_tensor;
+    OP_REQUIRES_OK(context,
+                   sparse::SparseTensor::Create(
+                       input_indices, input_values,
+                       TensorShape(input_shape.vec<int64>()), &sparse_tensor));
+
+    std::vector<sparse::SparseTensor> outputs;
+    OP_REQUIRES_OK(context,
+                   sparse::SparseTensor::Split<T>(sparse_tensor, split_dim,
+                                                  num_split_, &outputs));
 
     for (int slice_index = 0; slice_index < num_split_; ++slice_index) {
       context->set_output(slice_index, outputs[slice_index].indices());
diff --git a/tensorflow/core/kernels/sparse_tensors_map_ops.cc b/tensorflow/core/kernels/sparse_tensors_map_ops.cc
index 2aadd92475..74fa3a15f0 100644
--- a/tensorflow/core/kernels/sparse_tensors_map_ops.cc
+++ b/tensorflow/core/kernels/sparse_tensors_map_ops.cc
@@ -93,8 +93,9 @@ class SparseTensorsMap : public ResourceBase {
         const Tensor* ix = sp_iter->second.indices.AccessTensor(ctx);
         const Tensor* values = sp_iter->second.values.AccessTensor(ctx);
         const auto& shape = sp_iter->second.shape;
-        sparse_tensors->emplace_back(*ix, *values, shape);
-
+        SparseTensor tensor;
+        TF_RETURN_IF_ERROR(SparseTensor::Create(*ix, *values, shape, &tensor));
+        sparse_tensors->push_back(std::move(tensor));
         sp_tensors_.erase(sp_iter);
       }
     }
@@ -195,7 +196,9 @@ class AddSparseToTensorsMapOp : public SparseTensorAccessingOp {
                    TensorShapeUtils::MakeShape(input_shape->vec<int64>().data(),
                                                input_shape->NumElements(),
                                                &input_shape_object));
-    SparseTensor st(*input_indices, *input_values, input_shape_object);
+    SparseTensor st;
+    OP_REQUIRES_OK(context, SparseTensor::Create(*input_indices, *input_values,
+                                                 input_shape_object, &st));
     int64 handle;
     OP_REQUIRES_OK(context, map->AddSparseTensor(context, st, &handle));
 
@@ -253,8 +256,10 @@ class AddManySparseToTensorsMapOp : public SparseTensorAccessingOp {
     TensorShape tensor_input_shape(input_shape->vec<int64>());
     gtl::InlinedVector<int64, 8> std_order(rank);
     std::iota(std_order.begin(), std_order.end(), 0);
-    SparseTensor input_st(*input_indices, *input_values, tensor_input_shape,
-                          std_order);
+    SparseTensor input_st;
+    OP_REQUIRES_OK(context, SparseTensor::Create(*input_indices, *input_values,
+                                                 tensor_input_shape, std_order,
+                                                 &input_st));
 
     auto input_shape_t = input_shape->vec<int64>();
     const int64 N = input_shape_t(0);
@@ -300,7 +305,10 @@ class AddManySparseToTensorsMapOp : public SparseTensorAccessingOp {
         output_values_t(i) = values(i);
       }
 
-      SparseTensor st_i(output_indices, output_values, output_shape);
+      SparseTensor st_i;
+      OP_REQUIRES_OK(context,
+                     SparseTensor::Create(output_indices, output_values,
+                                          output_shape, &st_i));
       int64 handle;
       OP_REQUIRES_OK(context, map->AddSparseTensor(context, st_i, &handle));
       sparse_handles_t(b) = handle;
@@ -311,7 +319,9 @@ class AddManySparseToTensorsMapOp : public SparseTensorAccessingOp {
     if (visited.size() < N) {
       Tensor empty_indices(DT_INT64, {0, rank - 1});
       Tensor empty_values(DataTypeToEnum<T>::value, {0});
-      SparseTensor empty_st(empty_indices, empty_values, output_shape);
+      SparseTensor empty_st;
+      OP_REQUIRES_OK(context, SparseTensor::Create(empty_indices, empty_values,
+                                                   output_shape, &empty_st));
 
       for (int64 b = 0; b < N; ++b) {
         // We skipped this batch entry.
@@ -466,13 +476,15 @@ class TakeManySparseFromTensorsMapOp : public SparseTensorAccessingOp {
     std::vector<SparseTensor> tensors_to_concat;
     tensors_to_concat.reserve(N);
     for (int i = 0; i < N; ++i) {
-      tensors_to_concat.emplace_back(std::move(indices_to_concat[i]),
-                                     std::move(values_to_concat[i]),
-                                     preconcat_shape, std_order);
+      SparseTensor tensor;
+      OP_REQUIRES_OK(context,
+                     SparseTensor::Create(std::move(indices_to_concat[i]),
+                                          std::move(values_to_concat[i]),
+                                          preconcat_shape, std_order, &tensor));
+      tensors_to_concat.push_back(std::move(tensor));
     }
 
-    SparseTensor output(SparseTensor::Concat<T>(tensors_to_concat));
-
+    auto output = SparseTensor::Concat<T>(tensors_to_concat);
     Tensor final_output_shape(DT_INT64, TensorShape({output.dims()}));
 
     std::copy_n(output.shape().data(), output.dims(),
diff --git a/tensorflow/core/kernels/sparse_to_dense_op.cc b/tensorflow/core/kernels/sparse_to_dense_op.cc
index ba3da21a43..f79a4d0494 100644
--- a/tensorflow/core/kernels/sparse_to_dense_op.cc
+++ b/tensorflow/core/kernels/sparse_to_dense_op.cc
@@ -119,8 +119,10 @@ class SparseToDense : public OpKernel {
     // Assume SparseTensor is lexicographically sorted.
     gtl::InlinedVector<int64, 8> order(output->shape().dims());
     std::iota(order.begin(), order.end(), 0);
-    sparse::SparseTensor st(indices_shaped, sparse_values_b, output->shape(),
-                            order);
+    sparse::SparseTensor st;
+    OP_REQUIRES_OK(c,
+                   sparse::SparseTensor::Create(indices_shaped, sparse_values_b,
+                                                output->shape(), order, &st));
 
     if (validate_indices_) {
       OP_REQUIRES_OK(c, st.IndicesValid());
diff --git a/tensorflow/core/kernels/strided_slice_op.cc b/tensorflow/core/kernels/strided_slice_op.cc
index 1e3e92a68a..59fdc2262a 100644
--- a/tensorflow/core/kernels/strided_slice_op.cc
+++ b/tensorflow/core/kernels/strided_slice_op.cc
@@ -32,6 +32,7 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/kernels/bounds_check.h"
 #include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/kernels/training_op_helpers.h"
 #include "tensorflow/core/kernels/variable_ops.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
@@ -304,6 +305,9 @@ class StridedSliceAssignOp : public OpKernel {
       Var* v;
       OP_REQUIRES_OK(context,
                      LookupResource(context, HandleFromInput(context, 0), &v));
+      mutex_lock ml(*v->mu());
+      OP_REQUIRES_OK(context,
+                     PrepareToUpdateVariable<Device, T>(context, v->tensor()));
       old_lhs = *v->tensor();
       OP_REQUIRES(context, old_lhs.dtype() == DataTypeToEnum<T>::value,
                   errors::InvalidArgument(
diff --git a/tensorflow/core/kernels/training_op_helpers.cc b/tensorflow/core/kernels/training_op_helpers.cc
index f288e124ee..d3c4f62071 100644
--- a/tensorflow/core/kernels/training_op_helpers.cc
+++ b/tensorflow/core/kernels/training_op_helpers.cc
@@ -39,8 +39,15 @@ mutex* GetTrainingVariableMutex(OpKernelContext* ctx, int input) {
 // GetInputTensor which will signal a failure.
 std::vector<mutex_lock> MaybeLockVariableInputMutexesInOrder(
     OpKernelContext* ctx, bool do_lock, const std::vector<int>& input_ids) {
+  bool any_resource = false;
+  for (auto i : input_ids) {
+    if (ctx->input_dtype(i) == DT_RESOURCE) {
+      any_resource = true;
+      break;
+    }
+  }
   std::vector<mutex_lock> locks;
-  if (!do_lock) {
+  if (!do_lock && !any_resource) {
     return locks;
   }
   std::vector<mutex*> mutexes;
diff --git a/tensorflow/core/kernels/training_op_helpers.h b/tensorflow/core/kernels/training_op_helpers.h
index 7e56e15450..765335d3a0 100644
--- a/tensorflow/core/kernels/training_op_helpers.h
+++ b/tensorflow/core/kernels/training_op_helpers.h
@@ -80,18 +80,8 @@ Status GetInputTensorFromVariable(OpKernelContext* ctx, int input,
     Var* var;
     TF_RETURN_IF_ERROR(LookupResource(ctx, HandleFromInput(ctx, input), &var));
     core::ScopedUnref unref_var(var);
-    if (lock_held) {
-      TF_RETURN_IF_ERROR(
-          PrepareToUpdateVariable<Device, T>(ctx, var->tensor()));
-      *out = *var->tensor();
-    } else {
-      mutex_lock ml(*var->mu());
-      if (!sparse) {
-        TF_RETURN_IF_ERROR(
-            PrepareToUpdateVariable<Device, T>(ctx, var->tensor()));
-      }
-      *out = *var->tensor();
-    }
+    TF_RETURN_IF_ERROR(PrepareToUpdateVariable<Device, T>(ctx, var->tensor()));
+    *out = *var->tensor();
     return Status::OK();
   }
   *out = ctx->mutable_input(input, lock_held);
diff --git a/tensorflow/core/lib/bfloat16/bfloat16.h b/tensorflow/core/lib/bfloat16/bfloat16.h
index 1c130ba300..d6f3f26cd5 100644
--- a/tensorflow/core/lib/bfloat16/bfloat16.h
+++ b/tensorflow/core/lib/bfloat16/bfloat16.h
@@ -45,17 +45,25 @@ typedef std::complex<double> complex128;
 struct bfloat16 {
   B16_DEVICE_FUNC bfloat16() {}
 
-  B16_DEVICE_FUNC explicit bfloat16(const float v) {
+  B16_DEVICE_FUNC static bfloat16 truncate_to_bfloat16(const float v) {
+    bfloat16 output;
     if (float_isnan(v)) {
-      value = NAN_VALUE;
-      return;
+      output.value = NAN_VALUE;
+      return output;
     }
     const uint16_t* p = reinterpret_cast<const uint16_t*>(&v);
 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-    value = p[0];
+    output.value = p[0];
 #else
-    value = p[1];
+    output.value = p[1];
 #endif
+    return output;
+  }
+
+  B16_DEVICE_FUNC explicit bfloat16(const float v) {
+    // TODO(asabne) : change the below line to
+    // value = round_to_bfloat16(v).value;
+    value = truncate_to_bfloat16(v).value;
   }
 
   B16_DEVICE_FUNC explicit bfloat16(const double val)
@@ -169,8 +177,6 @@ struct bfloat16 {
 
   // Converts a float point to bfloat16, with round-nearest-to-even as rounding
   // method.
-  // TODO(b/69266521): Add a truncate_to_bfloat16 function and make this
-  // function as default behavior.
   // TODO: There is a slightly faster implementation (8% faster on CPU)
   // than this (documented in cl/175987786), that is exponentially harder to
   // understand and document. Switch to the faster version when converting to
diff --git a/tensorflow/core/lib/core/errors.h b/tensorflow/core/lib/core/errors.h
index 51c09032df..a631d9815a 100644
--- a/tensorflow/core/lib/core/errors.h
+++ b/tensorflow/core/lib/core/errors.h
@@ -19,6 +19,7 @@ limitations under the License.
 #include <sstream>
 
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/macros.h"
@@ -118,6 +119,25 @@ DECLARE_ERROR(Unauthenticated, UNAUTHENTICATED)
 
 #undef DECLARE_ERROR
 
+// Produces a formatted string pattern from the name which can uniquely identify
+// this node upstream to produce an informative error message. The pattern
+// followed is: {{node <name>}}
+// Note: The pattern below determines the regex _NODEDEF_NAME_RE in the file
+// tensorflow/python/client/session.py
+// LINT.IfChange
+inline string FormatNodeNameForError(const string& name) {
+  return strings::StrCat("{{node ", name, "}}");
+}
+// LINT.ThenChange(//tensorflow/python/client/session.py)
+template <typename T>
+string FormatNodeNamesForError(const T& names) {
+  ::tensorflow::str_util::Formatter<string> f(
+      [](string* output, const string& s) {
+        ::tensorflow::strings::StrAppend(output, FormatNodeNameForError(s));
+      });
+  return ::tensorflow::str_util::Join(names, ", ", f);
+}
+
 // The CanonicalCode() for non-errors.
 using ::tensorflow::error::OK;
 
diff --git a/tensorflow/core/lib/core/refcount.h b/tensorflow/core/lib/core/refcount.h
index eb41f9ff36..87bcfec411 100644
--- a/tensorflow/core/lib/core/refcount.h
+++ b/tensorflow/core/lib/core/refcount.h
@@ -17,6 +17,8 @@ limitations under the License.
 #define TENSORFLOW_LIB_CORE_REFCOUNT_H_
 
 #include <atomic>
+#include <memory>
+
 #include "tensorflow/core/platform/logging.h"
 
 namespace tensorflow {
@@ -58,6 +60,15 @@ class RefCounted {
   void operator=(const RefCounted&) = delete;
 };
 
+// A deleter class to form a std::unique_ptr that unrefs objects.
+struct RefCountDeleter {
+  void operator()(tensorflow::core::RefCounted* o) const { o->Unref(); }
+};
+
+// A unique_ptr that unrefs the owned object on destruction.
+template <typename T>
+using RefCountPtr = std::unique_ptr<T, RefCountDeleter>;
+
 // Helper class to unref an object when out-of-scope.
 class ScopedUnref {
  public:
diff --git a/tensorflow/core/lib/db/sqlite.cc b/tensorflow/core/lib/db/sqlite.cc
index cb6943379d..cf11f3a331 100644
--- a/tensorflow/core/lib/db/sqlite.cc
+++ b/tensorflow/core/lib/db/sqlite.cc
@@ -112,6 +112,7 @@ Status EnvPragma(Sqlite* db, const char* pragma, const char* var) {
 /* static */
 Status Sqlite::Open(const string& path, int flags, Sqlite** db) {
   flags |= SQLITE_OPEN_PRIVATECACHE;
+  flags |= SQLITE_OPEN_URI;
   sqlite3* sqlite = nullptr;
   int rc = sqlite3_open_v2(path.c_str(), &sqlite, flags, nullptr);
   if (rc != SQLITE_OK) {
diff --git a/tensorflow/core/lib/io/record_reader_writer_test.cc b/tensorflow/core/lib/io/record_reader_writer_test.cc
index 95ac040602..c36c909399 100644
--- a/tensorflow/core/lib/io/record_reader_writer_test.cc
+++ b/tensorflow/core/lib/io/record_reader_writer_test.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/core/lib/io/record_reader.h"
 #include "tensorflow/core/lib/io/record_writer.h"
 
+#include <zlib.h>
 #include <vector>
 #include "tensorflow/core/platform/env.h"
 
@@ -33,6 +34,89 @@ static std::vector<int> BufferSizes() {
           12, 13, 14, 15, 16, 17, 18, 19, 20, 65536};
 }
 
+namespace {
+
+io::RecordReaderOptions GetMatchingReaderOptions(
+    const io::RecordWriterOptions& options) {
+  if (options.compression_type == io::RecordWriterOptions::ZLIB_COMPRESSION) {
+    return io::RecordReaderOptions::CreateRecordReaderOptions("ZLIB");
+  }
+  return io::RecordReaderOptions::CreateRecordReaderOptions("");
+}
+
+uint64 GetFileSize(const string& fname) {
+  Env* env = Env::Default();
+  uint64 fsize;
+  TF_CHECK_OK(env->GetFileSize(fname, &fsize));
+  return fsize;
+}
+
+void VerifyFlush(const io::RecordWriterOptions& options) {
+  std::vector<string> records = {
+      "abcdefghijklmnopqrstuvwxyz",
+      "ZYXWVUTSRQPONMLKJIHGFEDCBA0123456789!@#$%^&*()",
+      "G5SyohOL9UmXofSOOwWDrv9hoLLMYPJbG9r38t3uBRcHxHj2PdKcPDuZmKW62RIY",
+      "aaaaaaaaaaaaaaaaaaaaaaaaaa",
+  };
+
+  Env* env = Env::Default();
+  string fname = testing::TmpDir() + "/record_reader_writer_flush_test";
+
+  std::unique_ptr<WritableFile> file;
+  TF_CHECK_OK(env->NewWritableFile(fname, &file));
+  io::RecordWriter writer(file.get(), options);
+
+  std::unique_ptr<RandomAccessFile> read_file;
+  TF_CHECK_OK(env->NewRandomAccessFile(fname, &read_file));
+  io::RecordReaderOptions read_options = GetMatchingReaderOptions(options);
+  io::RecordReader reader(read_file.get(), read_options);
+
+  EXPECT_EQ(GetFileSize(fname), 0);
+  for (size_t i = 0; i < records.size(); i++) {
+    uint64 start_size = GetFileSize(fname);
+
+    // Write a new record.
+    TF_EXPECT_OK(writer.WriteRecord(records[i]));
+    TF_CHECK_OK(writer.Flush());
+    TF_CHECK_OK(file->Flush());
+
+    // Verify that file size has changed after file flush.
+    uint64 new_size = GetFileSize(fname);
+    EXPECT_GT(new_size, start_size);
+
+    // Verify that file has all records written so far and no more.
+    uint64 offset = 0;
+    string record;
+    for (size_t j = 0; j <= i; j++) {
+      // Check that j'th record is written correctly.
+      TF_CHECK_OK(reader.ReadRecord(&offset, &record));
+      EXPECT_EQ(record, records[j]);
+    }
+
+    // Verify that file has no more records.
+    CHECK_EQ(reader.ReadRecord(&offset, &record).code(), error::OUT_OF_RANGE);
+  }
+}
+
+}  // namespace
+
+TEST(RecordReaderWriterTest, TestFlush) {
+  io::RecordWriterOptions options;
+  VerifyFlush(options);
+}
+
+TEST(RecordReaderWriterTest, TestZlibSyncFlush) {
+  io::RecordWriterOptions options;
+  options.compression_type = io::RecordWriterOptions::ZLIB_COMPRESSION;
+  // The default flush_mode is Z_NO_FLUSH and only writes to the file when the
+  // buffer is full or the file is closed, which makes testing harder.
+  // By using Z_SYNC_FLUSH the test can verify Flush does write out records of
+  // approximately the right size at the right times.
+  options.zlib_options.flush_mode = Z_SYNC_FLUSH;
+
+  VerifyFlush(options);
+}
+
 TEST(RecordReaderWriterTest, TestBasics) {
   Env* env = Env::Default();
   string fname = testing::TmpDir() + "/record_reader_writer_test";
diff --git a/tensorflow/core/lib/io/zlib_compression_options.cc b/tensorflow/core/lib/io/zlib_compression_options.cc
new file mode 100644
index 0000000000..fc54083be1
--- /dev/null
+++ b/tensorflow/core/lib/io/zlib_compression_options.cc
@@ -0,0 +1,32 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/lib/io/zlib_compression_options.h"
+
+#include <zlib.h>
+
+namespace tensorflow {
+namespace io {
+
+ZlibCompressionOptions::ZlibCompressionOptions() {
+  flush_mode = Z_NO_FLUSH;
+  window_bits = MAX_WBITS;
+  compression_level = Z_DEFAULT_COMPRESSION;
+  compression_method = Z_DEFLATED;
+  compression_strategy = Z_DEFAULT_STRATEGY;
+}
+
+}  // namespace io
+}  // namespace tensorflow
diff --git a/tensorflow/core/lib/io/zlib_compression_options.h b/tensorflow/core/lib/io/zlib_compression_options.h
index dc7218e866..238c1464fb 100644
--- a/tensorflow/core/lib/io/zlib_compression_options.h
+++ b/tensorflow/core/lib/io/zlib_compression_options.h
@@ -16,8 +16,6 @@ limitations under the License.
 #ifndef TENSORFLOW_LIB_IO_ZLIB_COMPRESSION_OPTIONS_H_
 #define TENSORFLOW_LIB_IO_ZLIB_COMPRESSION_OPTIONS_H_
 
-#include <zlib.h>
-
 #include "tensorflow/core/platform/types.h"
 
 namespace tensorflow {
@@ -25,11 +23,14 @@ namespace io {
 
 class ZlibCompressionOptions {
  public:
+  ZlibCompressionOptions();
+
   static ZlibCompressionOptions DEFAULT();
   static ZlibCompressionOptions RAW();
   static ZlibCompressionOptions GZIP();
 
-  int8 flush_mode = Z_NO_FLUSH;
+  // Defaults to Z_NO_FLUSH
+  int8 flush_mode;
 
   // Size of the buffer used for caching the data read from source file.
   int64 input_buffer_size = 256 << 10;
@@ -71,7 +72,9 @@ class ZlibCompressionOptions {
   // window_bits value provided used while compressing. If a compressed stream
   // with a larger window size is given as input, inflate() will return with the
   // error code Z_DATA_ERROR instead of trying to allocate a larger window.
-  int8 window_bits = MAX_WBITS;
+  //
+  // Defaults to MAX_WBITS
+  int8 window_bits;
 
   // From the zlib manual (http://www.zlib.net/manual.html):
   // The compression level must be Z_DEFAULT_COMPRESSION, or between 0 and 9:
@@ -79,10 +82,10 @@ class ZlibCompressionOptions {
   // (the input data is simply copied a block at a time). Z_DEFAULT_COMPRESSION
   // requests a default compromise between speed and compression (currently
   // equivalent to level 6).
-  int8 compression_level = Z_DEFAULT_COMPRESSION;
+  int8 compression_level;
 
-  // The only one supported at this time.
-  int8 compression_method = Z_DEFLATED;
+  // Only Z_DEFLATED is supported at this time.
+  int8 compression_method;
 
   // From the zlib manual (http://www.zlib.net/manual.html):
   // The mem_level parameter specifies how much memory should be allocated for
@@ -106,7 +109,7 @@ class ZlibCompressionOptions {
   // but not the correctness of the compressed output even if it is not set
   // appropriately. Z_FIXED prevents the use of dynamic Huffman codes, allowing
   // for a simpler decoder for special applications.
-  int8 compression_strategy = Z_DEFAULT_STRATEGY;
+  int8 compression_strategy;
 };
 
 inline ZlibCompressionOptions ZlibCompressionOptions::DEFAULT() {
diff --git a/tensorflow/core/lib/io/zlib_inputstream.cc b/tensorflow/core/lib/io/zlib_inputstream.cc
index 47de36bf6c..d069db6d20 100644
--- a/tensorflow/core/lib/io/zlib_inputstream.cc
+++ b/tensorflow/core/lib/io/zlib_inputstream.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include <zlib.h>
+
 #include "tensorflow/core/lib/io/zlib_inputstream.h"
 
 #include "tensorflow/core/lib/strings/strcat.h"
@@ -21,6 +23,35 @@ limitations under the License.
 namespace tensorflow {
 namespace io {
 
+struct ZStreamDef {
+  ZStreamDef(size_t input_buffer_capacity, size_t output_buffer_capacity)
+      : input(new Bytef[input_buffer_capacity]),
+        output(new Bytef[output_buffer_capacity]),
+        stream(new z_stream) {}
+
+  // Buffer for storing contents read from compressed stream.
+  // TODO(srbs): Consider using circular buffers. That would greatly simplify
+  // the implementation.
+  std::unique_ptr<Bytef[]> input;
+
+  // Buffer for storing inflated contents of `input_stream_`.
+  std::unique_ptr<Bytef[]> output;
+
+  // Configuration passed to `inflate`.
+  //
+  // z_stream_def_->stream->next_in:
+  //   Next byte to de-compress. Points to some byte in
+  //   z_stream_def_->streamdef_.input buffer.
+  // z_stream_def_->stream->avail_in:
+  //   Number of bytes available to be decompressed at this time.
+  // z_stream_def_->stream->next_out:
+  //   Next byte to write de-compressed data to. Points to some byte in
+  //   z_stream_def_->streamdef_.output buffer.
+  // z_stream_def_->stream->avail_out:
+  //   Number of free bytes available at write location.
+  std::unique_ptr<z_stream> stream;
+};
+
 ZlibInputStream::ZlibInputStream(
     InputStreamInterface* input_stream,
     size_t input_buffer_bytes,   // size of z_stream.next_in buffer
@@ -30,10 +61,9 @@ ZlibInputStream::ZlibInputStream(
       input_stream_(input_stream),
       input_buffer_capacity_(input_buffer_bytes),
       output_buffer_capacity_(output_buffer_bytes),
-      z_stream_input_(new Bytef[input_buffer_capacity_]),
-      z_stream_output_(new Bytef[output_buffer_capacity_]),
       zlib_options_(zlib_options),
-      z_stream_(new z_stream),
+      z_stream_def_(
+          new ZStreamDef(input_buffer_capacity_, output_buffer_capacity_)),
       bytes_read_(0) {
   InitZlibBuffer();
 }
@@ -46,8 +76,8 @@ ZlibInputStream::ZlibInputStream(InputStreamInterface* input_stream,
                       zlib_options, false) {}
 
 ZlibInputStream::~ZlibInputStream() {
-  if (z_stream_) {
-    inflateEnd(z_stream_.get());
+  if (z_stream_def_->stream) {
+    inflateEnd(z_stream_def_->stream.get());
   }
   if (owns_input_stream_) {
     delete input_stream_;
@@ -56,51 +86,54 @@ ZlibInputStream::~ZlibInputStream() {
 
 Status ZlibInputStream::Reset() {
   TF_RETURN_IF_ERROR(input_stream_->Reset());
-  inflateEnd(z_stream_.get());
+  inflateEnd(z_stream_def_->stream.get());
   InitZlibBuffer();
   bytes_read_ = 0;
   return Status::OK();
 }
 
 void ZlibInputStream::InitZlibBuffer() {
-  memset(z_stream_.get(), 0, sizeof(z_stream));
+  memset(z_stream_def_->stream.get(), 0, sizeof(z_stream));
 
-  z_stream_->zalloc = Z_NULL;
-  z_stream_->zfree = Z_NULL;
-  z_stream_->opaque = Z_NULL;
-  z_stream_->next_in = Z_NULL;
-  z_stream_->avail_in = 0;
+  z_stream_def_->stream->zalloc = Z_NULL;
+  z_stream_def_->stream->zfree = Z_NULL;
+  z_stream_def_->stream->opaque = Z_NULL;
+  z_stream_def_->stream->next_in = Z_NULL;
+  z_stream_def_->stream->avail_in = 0;
 
-  int status = inflateInit2(z_stream_.get(), zlib_options_.window_bits);
+  int status =
+      inflateInit2(z_stream_def_->stream.get(), zlib_options_.window_bits);
 
   CHECK_EQ(status, Z_OK) << "inflateInit failed with status " << status;
 
-  z_stream_->next_in = z_stream_input_.get();
-  z_stream_->next_out = z_stream_output_.get();
-  next_unread_byte_ = reinterpret_cast<char*>(z_stream_output_.get());
-  z_stream_->avail_in = 0;
-  z_stream_->avail_out = output_buffer_capacity_;
+  z_stream_def_->stream->next_in = z_stream_def_->input.get();
+  z_stream_def_->stream->next_out = z_stream_def_->output.get();
+  next_unread_byte_ = reinterpret_cast<char*>(z_stream_def_->output.get());
+  z_stream_def_->stream->avail_in = 0;
+  z_stream_def_->stream->avail_out = output_buffer_capacity_;
 }
 
 Status ZlibInputStream::ReadFromStream() {
   int bytes_to_read = input_buffer_capacity_;
-  char* read_location = reinterpret_cast<char*>(z_stream_input_.get());
+  char* read_location = reinterpret_cast<char*>(z_stream_def_->input.get());
 
   // If there are unread bytes in the input stream we move them to the head
   // of the stream to maximize the space available to read new data into.
-  if (z_stream_->avail_in > 0) {
-    uLong read_bytes = z_stream_->next_in - z_stream_input_.get();
+  if (z_stream_def_->stream->avail_in > 0) {
+    uLong read_bytes =
+        z_stream_def_->stream->next_in - z_stream_def_->input.get();
     // Remove `read_bytes` from the head of the input stream.
     // Move unread bytes to the head of the input stream.
     if (read_bytes > 0) {
-      memmove(z_stream_input_.get(), z_stream_->next_in, z_stream_->avail_in);
+      memmove(z_stream_def_->input.get(), z_stream_def_->stream->next_in,
+              z_stream_def_->stream->avail_in);
     }
 
-    bytes_to_read -= z_stream_->avail_in;
-    read_location += z_stream_->avail_in;
+    bytes_to_read -= z_stream_def_->stream->avail_in;
+    read_location += z_stream_def_->stream->avail_in;
   }
   string data;
-  // Try to read enough data to fill up z_stream_input_.
+  // Try to read enough data to fill up z_stream_def_->input.
   // TODO(rohanj): Add a char* version of ReadNBytes to InputStreamInterface
   // and use that instead to make this more efficient.
   Status s = input_stream_->ReadNBytes(bytes_to_read, &data);
@@ -108,10 +141,10 @@ Status ZlibInputStream::ReadFromStream() {
 
   // Since we moved unread data to the head of the input stream we can point
   // next_in to the head of the input stream.
-  z_stream_->next_in = z_stream_input_.get();
+  z_stream_def_->stream->next_in = z_stream_def_->input.get();
 
   // Note: data.size() could be different from bytes_to_read.
-  z_stream_->avail_in += data.size();
+  z_stream_def_->stream->avail_in += data.size();
 
   if (!s.ok() && !errors::IsOutOfRange(s)) {
     return s;
@@ -135,7 +168,8 @@ Status ZlibInputStream::ReadFromStream() {
 size_t ZlibInputStream::ReadBytesFromCache(size_t bytes_to_read,
                                            string* result) {
   size_t unread_bytes =
-      reinterpret_cast<char*>(z_stream_->next_out) - next_unread_byte_;
+      reinterpret_cast<char*>(z_stream_def_->stream->next_out) -
+      next_unread_byte_;
   size_t can_read_bytes = std::min(bytes_to_read, unread_bytes);
   if (can_read_bytes > 0) {
     result->append(next_unread_byte_, can_read_bytes);
@@ -147,8 +181,9 @@ size_t ZlibInputStream::ReadBytesFromCache(size_t bytes_to_read,
 
 size_t ZlibInputStream::NumUnreadBytes() const {
   size_t read_bytes =
-      next_unread_byte_ - reinterpret_cast<char*>(z_stream_output_.get());
-  return output_buffer_capacity_ - z_stream_->avail_out - read_bytes;
+      next_unread_byte_ - reinterpret_cast<char*>(z_stream_def_->output.get());
+  return output_buffer_capacity_ - z_stream_def_->stream->avail_out -
+         read_bytes;
 }
 
 Status ZlibInputStream::ReadNBytes(int64 bytes_to_read, string* result) {
@@ -167,14 +202,14 @@ Status ZlibInputStream::ReadNBytes(int64 bytes_to_read, string* result) {
     // completely consumed. This is an optimization and can be removed if
     // it causes problems. `ReadFromStream` is capable of handling partially
     // filled up buffers.
-    if (z_stream_->avail_in == 0) {
+    if (z_stream_def_->stream->avail_in == 0) {
       TF_RETURN_IF_ERROR(ReadFromStream());
     }
 
     // Step 2. Setup output stream.
-    z_stream_->next_out = z_stream_output_.get();
-    next_unread_byte_ = reinterpret_cast<char*>(z_stream_output_.get());
-    z_stream_->avail_out = output_buffer_capacity_;
+    z_stream_def_->stream->next_out = z_stream_def_->output.get();
+    next_unread_byte_ = reinterpret_cast<char*>(z_stream_def_->output.get());
+    z_stream_def_->stream->avail_out = output_buffer_capacity_;
 
     // Step 3. Inflate Inflate Inflate!
     TF_RETURN_IF_ERROR(Inflate());
@@ -188,12 +223,12 @@ Status ZlibInputStream::ReadNBytes(int64 bytes_to_read, string* result) {
 int64 ZlibInputStream::Tell() const { return bytes_read_; }
 
 Status ZlibInputStream::Inflate() {
-  int error = inflate(z_stream_.get(), zlib_options_.flush_mode);
+  int error = inflate(z_stream_def_->stream.get(), zlib_options_.flush_mode);
   if (error != Z_OK && error != Z_STREAM_END) {
     string error_string =
         strings::StrCat("inflate() failed with error ", error);
-    if (z_stream_->msg != nullptr) {
-      strings::StrAppend(&error_string, ": ", z_stream_->msg);
+    if (z_stream_def_->stream->msg != nullptr) {
+      strings::StrAppend(&error_string, ": ", z_stream_def_->stream->msg);
     }
     return errors::DataLoss(error_string);
   }
diff --git a/tensorflow/core/lib/io/zlib_inputstream.h b/tensorflow/core/lib/io/zlib_inputstream.h
index 37339163ee..ac9e23ca97 100644
--- a/tensorflow/core/lib/io/zlib_inputstream.h
+++ b/tensorflow/core/lib/io/zlib_inputstream.h
@@ -16,8 +16,6 @@ limitations under the License.
 #ifndef TENSORFLOW_LIB_IO_ZLIB_INPUTSTREAM_H_
 #define TENSORFLOW_LIB_IO_ZLIB_INPUTSTREAM_H_
 
-#include <zlib.h>
-
 #include <string>
 
 #include "tensorflow/core/lib/core/status.h"
@@ -30,6 +28,10 @@ limitations under the License.
 namespace tensorflow {
 namespace io {
 
+// Forward declare some members of zlib.h, which is only included in the
+// .cc file.
+struct ZStreamDef;
+
 // An ZlibInputStream provides support for reading from a stream compressed
 // using zlib (http://www.zlib.net/). Buffers the contents of the file.
 //
@@ -79,28 +81,9 @@ class ZlibInputStream : public InputStreamInterface {
   size_t output_buffer_capacity_;  // Size of z_stream_output_
   char* next_unread_byte_;         // Next unread byte in z_stream_output_
 
-  // Buffer for storing contents read from compressed stream.
-  // TODO(srbs): Consider using circular buffers. That would greatly simplify
-  // the implementation.
-  std::unique_ptr<Bytef[]> z_stream_input_;
-
-  // Buffer for storing inflated contents of `input_stream_`.
-  std::unique_ptr<Bytef[]> z_stream_output_;
-
   ZlibCompressionOptions const zlib_options_;
 
-  // Configuration passed to `inflate`.
-  //
-  // z_stream_->next_in:
-  //   Next byte to de-compress. Points to some byte in z_stream_input_ buffer.
-  // z_stream_->avail_in:
-  //   Number of bytes available to be decompressed at this time.
-  // z_stream_->next_out:
-  //   Next byte to write de-compressed data to. Points to some byte in
-  //   z_stream_output_ buffer.
-  // z_stream_->avail_out:
-  //   Number of free bytes available at write location.
-  std::unique_ptr<z_stream> z_stream_;
+  std::unique_ptr<ZStreamDef> z_stream_def_;
 
   // Reads data from `input_stream_` and tries to fill up `z_stream_input_` if
   // enough unread data is left in `input_stream_`.
diff --git a/tensorflow/core/ops/array_ops.cc b/tensorflow/core/ops/array_ops.cc
index fce0b93cd7..d6ae75473f 100644
--- a/tensorflow/core/ops/array_ops.cc
+++ b/tensorflow/core/ops/array_ops.cc
@@ -2549,14 +2549,16 @@ REGISTER_OP("ExtractImagePatches")
 REGISTER_OP("Bitcast")
     .Input("input: T")
     .Output("output: type")
-    // All supported dtypes are listed here to include qint16 and quint16.
+    // All supported dtypes are listed here to include qint16, quint16, uint32,
+    // and uint64.
     .Attr(
-        "T: {bfloat16, half, float, double, int64, int32, uint8, uint16, int8, "
-        "int16, complex64, complex128, qint8, quint8, qint16, quint16, qint32}")
+        "T: {bfloat16, half, float, double, int64, int32, uint8, uint16, "
+        "uint32, uint64, int8, int16, complex64, complex128, qint8, quint8, "
+        "qint16, quint16, qint32}")
     .Attr(
         "type: {bfloat16, half, float, double, int64, int32, uint8, uint16, "
-        "int8, int16, complex64, complex128, qint8, quint8, qint16, quint16, "
-        "qint32}")
+        "uint32, uint64, int8, int16, complex64, complex128, qint8, quint8, "
+        "qint16, quint16, qint32}")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle input = c->input(0);
       if (!c->RankKnown(input)) {
@@ -2879,7 +2881,7 @@ REGISTER_OP("ScatterNdNonAliasingAdd")
     .Input("indices: Tindices")
     .Input("updates: T")
     .Output("output: T")
-    .Attr("T: numbertype")
+    .Attr("T: {numbertype, bool}")
     .Attr("Tindices: {int32, int64}")
     .SetShapeFn(shape_inference::ScatterNdUpdateShape);
 
diff --git a/tensorflow/core/ops/boosted_trees_ops.cc b/tensorflow/core/ops/boosted_trees_ops.cc
index edcdc4cb6a..01452b3e85 100644
--- a/tensorflow/core/ops/boosted_trees_ops.cc
+++ b/tensorflow/core/ops/boosted_trees_ops.cc
@@ -331,4 +331,27 @@ REGISTER_OP("BoostedTreesUpdateEnsemble")
       return Status::OK();
     });
 
+REGISTER_OP("BoostedTreesCenterBias")
+    .Input("tree_ensemble_handle: resource")
+    .Input("mean_gradients: float")
+    .Input("mean_hessians: float")
+    // Regularization-related.
+    .Input("l1: float")
+    .Input("l2: float")
+    .Output("continue_centering: bool")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle gradients_shape;
+      shape_inference::ShapeHandle hessians_shape;
+      shape_inference::ShapeHandle unused_shape;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 2, &gradients_shape));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 2, &hessians_shape));
+      TF_RETURN_IF_ERROR(
+          c->Merge(gradients_shape, hessians_shape, &unused_shape));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &unused_shape));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 0, &unused_shape));
+
+      c->set_output(0, c->Scalar());
+      return Status::OK();
+    });
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/compat/ops_history.v1.pbtxt b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
index 8fc87408a2..b7ca6dbe3d 100644
--- a/tensorflow/core/ops/compat/ops_history.v1.pbtxt
+++ b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
@@ -7681,66 +7681,6 @@ op {
   }
 }
 op {
-  name: "AvgPool"
-  input_arg {
-    name: "value"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "ksize"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-}
-op {
   name: "AvgPool3D"
   input_arg {
     name: "input"
@@ -8430,70 +8370,6 @@ op {
   }
 }
 op {
-  name: "AvgPoolGrad"
-  input_arg {
-    name: "orig_input_shape"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "grad"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "ksize"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-}
-op {
   name: "Barrier"
   output_arg {
     name: "handle"
@@ -10555,61 +10431,6 @@ op {
   }
 }
 op {
-  name: "BiasAdd"
-  input_arg {
-    name: "value"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "bias"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_COMPLEX64
-        type: DT_INT64
-        type: DT_QINT8
-        type: DT_QUINT8
-        type: DT_QINT32
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_COMPLEX128
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-}
-op {
   name: "BiasAddGrad"
   input_arg {
     name: "out_backprop"
@@ -10802,57 +10623,6 @@ op {
   }
 }
 op {
-  name: "BiasAddGrad"
-  input_arg {
-    name: "out_backprop"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_COMPLEX64
-        type: DT_INT64
-        type: DT_QINT8
-        type: DT_QUINT8
-        type: DT_QINT32
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_COMPLEX128
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-}
-op {
   name: "BiasAddV1"
   input_arg {
     name: "value"
@@ -11276,6 +11046,71 @@ op {
   }
 }
 op {
+  name: "Bitcast"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "type"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_BFLOAT16
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT64
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_UINT16
+        type: DT_UINT32
+        type: DT_UINT64
+        type: DT_INT8
+        type: DT_INT16
+        type: DT_COMPLEX64
+        type: DT_COMPLEX128
+        type: DT_QINT8
+        type: DT_QUINT8
+        type: DT_QINT16
+        type: DT_QUINT16
+        type: DT_QINT32
+      }
+    }
+  }
+  attr {
+    name: "type"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_BFLOAT16
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT64
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_UINT16
+        type: DT_UINT32
+        type: DT_UINT64
+        type: DT_INT8
+        type: DT_INT16
+        type: DT_COMPLEX64
+        type: DT_COMPLEX128
+        type: DT_QINT8
+        type: DT_QUINT8
+        type: DT_QINT16
+        type: DT_QUINT16
+        type: DT_QINT32
+      }
+    }
+  }
+}
+op {
   name: "BitwiseAnd"
   input_arg {
     name: "x"
@@ -11527,6 +11362,34 @@ op {
   }
 }
 op {
+  name: "BoostedTreesCenterBias"
+  input_arg {
+    name: "tree_ensemble_handle"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "mean_gradients"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "mean_hessians"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "l1"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "l2"
+    type: DT_FLOAT
+  }
+  output_arg {
+    name: "continue_centering"
+    type: DT_BOOL
+  }
+  is_stateful: true
+}
+op {
   name: "BoostedTreesCreateEnsemble"
   input_arg {
     name: "tree_ensemble_handle"
@@ -12211,6 +12074,32 @@ op {
   }
 }
 op {
+  name: "Cast"
+  input_arg {
+    name: "x"
+    type_attr: "SrcT"
+  }
+  output_arg {
+    name: "y"
+    type_attr: "DstT"
+  }
+  attr {
+    name: "SrcT"
+    type: "type"
+  }
+  attr {
+    name: "DstT"
+    type: "type"
+  }
+  attr {
+    name: "Truncate"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+}
+op {
   name: "Ceil"
   input_arg {
     name: "x"
@@ -13429,81 +13318,6 @@ op {
   }
 }
 op {
-  name: "Conv2D"
-  input_arg {
-    name: "input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "filter"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-  }
-  attr {
-    name: "use_cudnn_on_gpu"
-    type: "bool"
-    default_value {
-      b: true
-    }
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "dilations"
-    type: "list(int)"
-    default_value {
-      list {
-        i: 1
-        i: 1
-        i: 1
-        i: 1
-      }
-    }
-  }
-}
-op {
   name: "Conv2DBackpropFilter"
   input_arg {
     name: "input"
@@ -13720,148 +13534,6 @@ op {
   }
 }
 op {
-  name: "Conv2DBackpropFilter"
-  input_arg {
-    name: "input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "filter_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "out_backprop"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-  }
-  attr {
-    name: "use_cudnn_on_gpu"
-    type: "bool"
-    default_value {
-      b: true
-    }
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "dilations"
-    type: "list(int)"
-    default_value {
-      list {
-        i: 1
-        i: 1
-        i: 1
-        i: 1
-      }
-    }
-  }
-}
-op {
-  name: "Conv2DBackpropInput"
-  input_arg {
-    name: "input_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "filter"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "out_backprop"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_FLOAT
-      }
-    }
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-  }
-  attr {
-    name: "use_cudnn_on_gpu"
-    type: "bool"
-    default_value {
-      b: true
-    }
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-      }
-    }
-  }
-}
-op {
   name: "Conv2DBackpropInput"
   input_arg {
     name: "input_sizes"
@@ -13885,7 +13557,6 @@ op {
     allowed_values {
       list {
         type: DT_HALF
-        type: DT_BFLOAT16
         type: DT_FLOAT
       }
     }
@@ -13924,18 +13595,6 @@ op {
       }
     }
   }
-  attr {
-    name: "dilations"
-    type: "list(int)"
-    default_value {
-      list {
-        i: 1
-        i: 1
-        i: 1
-        i: 1
-      }
-    }
-  }
 }
 op {
   name: "Conv2DBackpropInput"
@@ -13963,7 +13622,6 @@ op {
         type: DT_HALF
         type: DT_BFLOAT16
         type: DT_FLOAT
-        type: DT_DOUBLE
       }
     }
   }
@@ -14075,8 +13733,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -18824,74 +18480,6 @@ op {
   }
 }
 op {
-  name: "DepthwiseConv2dNative"
-  input_arg {
-    name: "input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "filter"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "dilations"
-    type: "list(int)"
-    default_value {
-      list {
-        i: 1
-        i: 1
-        i: 1
-        i: 1
-      }
-    }
-  }
-}
-op {
   name: "DepthwiseConv2dNativeBackpropFilter"
   input_arg {
     name: "input"
@@ -19130,78 +18718,6 @@ op {
   }
 }
 op {
-  name: "DepthwiseConv2dNativeBackpropFilter"
-  input_arg {
-    name: "input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "filter_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "out_backprop"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "dilations"
-    type: "list(int)"
-    default_value {
-      list {
-        i: 1
-        i: 1
-        i: 1
-        i: 1
-      }
-    }
-  }
-}
-op {
   name: "DepthwiseConv2dNativeBackpropInput"
   input_arg {
     name: "input_sizes"
@@ -19440,78 +18956,6 @@ op {
   }
 }
 op {
-  name: "DepthwiseConv2dNativeBackpropInput"
-  input_arg {
-    name: "input_sizes"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "filter"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "out_backprop"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_HALF
-        type: DT_BFLOAT16
-        type: DT_FLOAT
-        type: DT_DOUBLE
-      }
-    }
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "dilations"
-    type: "list(int)"
-    default_value {
-      list {
-        i: 1
-        i: 1
-        i: 1
-        i: 1
-      }
-    }
-  }
-}
-op {
   name: "Dequantize"
   input_arg {
     name: "input"
@@ -24750,6 +24194,60 @@ op {
   }
 }
 op {
+  name: "FusedPadConv2D"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "paddings"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "filter"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "mode"
+    type: "string"
+    allowed_values {
+      list {
+        s: "REFLECT"
+        s: "SYMMETRIC"
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+}
+op {
   name: "FusedResizeAndPadConv2D"
   input_arg {
     name: "input"
@@ -24813,6 +24311,71 @@ op {
   }
 }
 op {
+  name: "FusedResizeAndPadConv2D"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "paddings"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "filter"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_HALF
+        type: DT_FLOAT
+        type: DT_DOUBLE
+      }
+    }
+  }
+  attr {
+    name: "resize_align_corners"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+  attr {
+    name: "mode"
+    type: "string"
+    allowed_values {
+      list {
+        s: "REFLECT"
+        s: "SYMMETRIC"
+      }
+    }
+  }
+  attr {
+    name: "strides"
+    type: "list(int)"
+  }
+  attr {
+    name: "padding"
+    type: "string"
+    allowed_values {
+      list {
+        s: "SAME"
+        s: "VALID"
+      }
+    }
+  }
+}
+op {
   name: "Gather"
   input_arg {
     name: "params"
@@ -26129,29 +25692,6 @@ op {
   }
 }
 op {
-  name: "IdentityDataset"
-  input_arg {
-    name: "input_dataset"
-    type: DT_VARIANT
-  }
-  output_arg {
-    name: "handle"
-    type: DT_VARIANT
-  }
-  attr {
-    name: "output_types"
-    type: "list(type)"
-    has_minimum: true
-    minimum: 1
-  }
-  attr {
-    name: "output_shapes"
-    type: "list(shape)"
-    has_minimum: true
-    minimum: 1
-  }
-}
-op {
   name: "IdentityN"
   input_arg {
     name: "input"
@@ -26317,6 +25857,81 @@ op {
   }
 }
 op {
+  name: "If"
+  input_arg {
+    name: "cond"
+    type_attr: "Tcond"
+  }
+  input_arg {
+    name: "input"
+    type_list_attr: "Tin"
+  }
+  output_arg {
+    name: "output"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "Tcond"
+    type: "type"
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "then_branch"
+    type: "func"
+  }
+  attr {
+    name: "else_branch"
+    type: "func"
+  }
+}
+op {
+  name: "If"
+  input_arg {
+    name: "cond"
+    type_attr: "Tcond"
+  }
+  input_arg {
+    name: "input"
+    type_list_attr: "Tin"
+  }
+  output_arg {
+    name: "output"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "Tcond"
+    type: "type"
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "then_branch"
+    type: "func"
+  }
+  attr {
+    name: "else_branch"
+    type: "func"
+  }
+  is_stateful: true
+}
+op {
   name: "Igamma"
   input_arg {
     name: "a"
@@ -27685,6 +27300,36 @@ op {
   is_stateful: true
 }
 op {
+  name: "IteratorFromStringHandleV2"
+  input_arg {
+    name: "string_handle"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "resource_handle"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    default_value {
+      list {
+      }
+    }
+    has_minimum: true
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    default_value {
+      list {
+      }
+    }
+    has_minimum: true
+  }
+  is_stateful: true
+}
+op {
   name: "IteratorGetNext"
   input_arg {
     name: "iterator"
@@ -27709,6 +27354,30 @@ op {
   is_stateful: true
 }
 op {
+  name: "IteratorGetNextAsOptional"
+  input_arg {
+    name: "iterator"
+    type: DT_RESOURCE
+  }
+  output_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
+op {
   name: "IteratorGetNextSync"
   input_arg {
     name: "iterator"
@@ -27745,6 +27414,34 @@ op {
   is_stateful: true
 }
 op {
+  name: "IteratorV2"
+  output_arg {
+    name: "handle"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+  }
+  attr {
+    name: "container"
+    type: "string"
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
+op {
   name: "L2Loss"
   input_arg {
     name: "t"
@@ -32015,85 +31712,6 @@ op {
   }
 }
 op {
-  name: "MaxPoolGrad"
-  input_arg {
-    name: "orig_input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "orig_output"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "grad"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "ksize"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "T"
-    type: "type"
-    default_value {
-      type: DT_FLOAT
-    }
-    allowed_values {
-      list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_INT64
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
-      }
-    }
-  }
-}
-op {
   name: "MaxPoolGradGrad"
   input_arg {
     name: "orig_input"
@@ -32386,82 +32004,6 @@ op {
   }
 }
 op {
-  name: "MaxPoolGradGrad"
-  input_arg {
-    name: "orig_input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "orig_output"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "grad"
-    type_attr: "T"
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "ksize"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "strides"
-    type: "list(int)"
-    has_minimum: true
-    minimum: 4
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_INT64
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
-      }
-    }
-  }
-}
-op {
   name: "MaxPoolGradGradV2"
   input_arg {
     name: "orig_input"
@@ -32738,78 +32280,6 @@ op {
   }
 }
 op {
-  name: "MaxPoolGradGradV2"
-  input_arg {
-    name: "orig_input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "orig_output"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "grad"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "ksize"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "strides"
-    type: DT_INT32
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "T"
-    type: "type"
-    allowed_values {
-      list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_INT64
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
-      }
-    }
-  }
-}
-op {
   name: "MaxPoolGradGradWithArgmax"
   input_arg {
     name: "input"
@@ -33378,81 +32848,6 @@ op {
   }
 }
 op {
-  name: "MaxPoolGradV2"
-  input_arg {
-    name: "orig_input"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "orig_output"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "grad"
-    type_attr: "T"
-  }
-  input_arg {
-    name: "ksize"
-    type: DT_INT32
-  }
-  input_arg {
-    name: "strides"
-    type: DT_INT32
-  }
-  output_arg {
-    name: "output"
-    type_attr: "T"
-  }
-  attr {
-    name: "padding"
-    type: "string"
-    allowed_values {
-      list {
-        s: "SAME"
-        s: "VALID"
-      }
-    }
-  }
-  attr {
-    name: "data_format"
-    type: "string"
-    default_value {
-      s: "NHWC"
-    }
-    allowed_values {
-      list {
-        s: "NHWC"
-        s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
-      }
-    }
-  }
-  attr {
-    name: "T"
-    type: "type"
-    default_value {
-      type: DT_FLOAT
-    }
-    allowed_values {
-      list {
-        type: DT_FLOAT
-        type: DT_DOUBLE
-        type: DT_INT32
-        type: DT_UINT8
-        type: DT_INT16
-        type: DT_INT8
-        type: DT_INT64
-        type: DT_BFLOAT16
-        type: DT_UINT16
-        type: DT_HALF
-        type: DT_UINT32
-        type: DT_UINT64
-      }
-    }
-  }
-}
-op {
   name: "MaxPoolGradWithArgmax"
   input_arg {
     name: "input"
@@ -36137,6 +35532,71 @@ op {
   }
 }
 op {
+  name: "NonMaxSuppressionV4"
+  input_arg {
+    name: "boxes"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "scores"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "max_output_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "iou_threshold"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "score_threshold"
+    type: DT_FLOAT
+  }
+  output_arg {
+    name: "selected_indices"
+    type: DT_INT32
+  }
+  output_arg {
+    name: "valid_outputs"
+    type: DT_INT32
+  }
+  attr {
+    name: "pad_to_max_output_size"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+}
+op {
+  name: "NonMaxSuppressionWithOverlaps"
+  input_arg {
+    name: "overlaps"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "scores"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "max_output_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "overlap_threshold"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "score_threshold"
+    type: DT_FLOAT
+  }
+  output_arg {
+    name: "selected_indices"
+    type: DT_INT32
+  }
+}
+op {
   name: "NotEqual"
   input_arg {
     name: "x"
@@ -36582,6 +36042,64 @@ op {
   }
 }
 op {
+  name: "OptionalFromValue"
+  input_arg {
+    name: "components"
+    type_list_attr: "Toutput_types"
+  }
+  output_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "Toutput_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+}
+op {
+  name: "OptionalGetValue"
+  input_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "components"
+    type_list_attr: "output_types"
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
+op {
+  name: "OptionalHasValue"
+  input_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "has_value"
+    type: DT_BOOL
+  }
+}
+op {
+  name: "OptionalNone"
+  output_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+}
+op {
   name: "OrderedMapClear"
   attr {
     name: "capacity"
@@ -55909,6 +55427,61 @@ op {
   }
 }
 op {
+  name: "ScatterNdNonAliasingAdd"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  input_arg {
+    name: "indices"
+    type_attr: "Tindices"
+  }
+  input_arg {
+    name: "updates"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "output"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_UINT8
+        type: DT_INT16
+        type: DT_INT8
+        type: DT_COMPLEX64
+        type: DT_INT64
+        type: DT_QINT8
+        type: DT_QUINT8
+        type: DT_QINT32
+        type: DT_BFLOAT16
+        type: DT_UINT16
+        type: DT_COMPLEX128
+        type: DT_HALF
+        type: DT_UINT32
+        type: DT_UINT64
+        type: DT_BOOL
+      }
+    }
+  }
+  attr {
+    name: "Tindices"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
   name: "ScatterNdSub"
   input_arg {
     name: "ref"
@@ -58701,6 +58274,17 @@ op {
   }
 }
 op {
+  name: "SinkDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+}
+op {
   name: "Size"
   input_arg {
     name: "input"
@@ -58891,7 +58475,11 @@ op {
     type: DT_INT64
   }
   input_arg {
-    name: "stride"
+    name: "window_shift"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "window_stride"
     type: DT_INT64
   }
   output_arg {
@@ -68669,6 +68257,32 @@ op {
   is_stateful: true
 }
 op {
+  name: "StatefulPartitionedCall"
+  input_arg {
+    name: "args"
+    type_list_attr: "Tin"
+  }
+  output_arg {
+    name: "output"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "f"
+    type: "func"
+  }
+  is_stateful: true
+}
+op {
   name: "StatelessMultinomial"
   input_arg {
     name: "logits"
@@ -74645,6 +74259,33 @@ op {
   is_stateful: true
 }
 op {
+  name: "WindowDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "window_size"
+    type: DT_INT64
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
+op {
   name: "WriteAudioSummary"
   input_arg {
     name: "writer"
diff --git a/tensorflow/core/ops/dataset_ops.cc b/tensorflow/core/ops/dataset_ops.cc
index 9dca5f53ce..28bfa73b6d 100644
--- a/tensorflow/core/ops/dataset_ops.cc
+++ b/tensorflow/core/ops/dataset_ops.cc
@@ -362,6 +362,19 @@ REGISTER_OP("FilterDataset")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("WindowDataset")
+    .Input("input_dataset: variant")
+    .Input("window_size: int64")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle unused;
+      // batch_size should be a scalar.
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
+      return shape_inference::ScalarShape(c);
+    });
+
 REGISTER_OP("BatchDataset")
     .Input("input_dataset: variant")
     .Input("batch_size: int64")
@@ -391,19 +404,20 @@ REGISTER_OP("BatchDatasetV2")
       return shape_inference::ScalarShape(c);
     });
 
-// TODO(mrry): move SlideDataset to contrib in the future.
 REGISTER_OP("SlideDataset")
     .Input("input_dataset: variant")
     .Input("window_size: int64")
-    .Input("stride: int64")
+    .Input("window_shift: int64")
+    .Input("window_stride: int64")
     .Output("handle: variant")
     .Attr("output_types: list(type) >= 1")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn([](shape_inference::InferenceContext* c) {
       shape_inference::ShapeHandle unused;
-      // window_size and stride should be scalars.
+      // window_size, window_shift, and window_stride should be scalars.
       TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
       TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &unused));
       return shape_inference::ScalarShape(c);
     });
 
@@ -631,6 +645,14 @@ REGISTER_OP("Iterator")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("IteratorV2")
+    .Output("handle: resource")
+    .Attr("shared_name: string")
+    .Attr("container: string")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
 REGISTER_OP("AnonymousIterator")
     .Output("handle: resource")
     .Attr("output_types: list(type) >= 1")
@@ -708,6 +730,13 @@ REGISTER_OP("IteratorFromStringHandle")
     .Attr("output_shapes: list(shape) >= 0 = []")
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("IteratorFromStringHandleV2")
+    .Input("string_handle: string")
+    .Output("resource_handle: resource")
+    .Attr("output_types: list(type) >= 0 = []")
+    .Attr("output_shapes: list(shape) >= 0 = []")
+    .SetShapeFn(shape_inference::ScalarShape);
+
 REGISTER_OP("SerializeIterator")
     .Input("resource_handle: resource")
     .Output("serialized: variant")
@@ -770,11 +799,9 @@ REGISTER_OP("DatasetToGraph")
     .Output("graph: string")
     .SetShapeFn(shape_inference::ScalarShape);
 
-REGISTER_OP("IdentityDataset")
+REGISTER_OP("SinkDataset")
     .Input("input_dataset: variant")
     .Output("handle: variant")
-    .Attr("output_types: list(type) >= 1")
-    .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
 REGISTER_OP("OptimizeDataset")
@@ -785,4 +812,33 @@ REGISTER_OP("OptimizeDataset")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("OptionalFromValue")
+    .Input("components: Toutput_types")
+    .Output("optional: variant")
+    .Attr("Toutput_types: list(type) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
+REGISTER_OP("OptionalNone")
+    .Output("optional: variant")
+    .SetShapeFn(shape_inference::ScalarShape);
+
+REGISTER_OP("OptionalHasValue")
+    .Input("optional: variant")
+    .Output("has_value: bool")
+    .SetShapeFn(shape_inference::ScalarShape);
+
+REGISTER_OP("OptionalGetValue")
+    .Input("optional: variant")
+    .Output("components: output_types")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(IteratorGetNextShapeFn);
+
+REGISTER_OP("IteratorGetNextAsOptional")
+    .Input("iterator: resource")
+    .Output("optional: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/debug_ops.cc b/tensorflow/core/ops/debug_ops.cc
index 5aebdca1ea..2d9b4360de 100644
--- a/tensorflow/core/ops/debug_ops.cc
+++ b/tensorflow/core/ops/debug_ops.cc
@@ -20,7 +20,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-// EXPERIMENTAL: tfdbg debugger-inserted ops.
+// TensorFlow Debugger-inserted ops.
 // These ops are used only internally by tfdbg. There is no API for users to
 // direct create them. Users can create them indirectly by using
 // RunOptions.debug_options during Session::Run() call. See tfdbg documentation
diff --git a/tensorflow/core/ops/functional_ops.cc b/tensorflow/core/ops/functional_ops.cc
index ec22ce4177..a16ecccf00 100644
--- a/tensorflow/core/ops/functional_ops.cc
+++ b/tensorflow/core/ops/functional_ops.cc
@@ -40,7 +40,11 @@ REGISTER_OP("SymbolicGradient")
         if (types[i] == DT_RESOURCE) {
           const std::vector<shape_inference::ShapeAndType>* handle_type =
               c->input_handle_shapes_and_types(i);
-          c->set_output(i, handle_type->at(0).shape);
+          if (handle_type != nullptr) {
+            c->set_output(i, handle_type->at(0).shape);
+          } else {
+            c->set_output(i, c->UnknownShape());
+          }
         } else {
           c->set_output(i, c->input(i));
         }
@@ -68,6 +72,7 @@ REGISTER_OP("_If")
     .Attr("Tout: list(type)")
     .Attr("then_branch: func")
     .Attr("else_branch: func")
+    .SetIsStateful()
     .SetShapeFn(shape_inference::UnknownShape)
     .Doc(R"doc(
 output = cond ? then_branch(input) : else_branch(input)
@@ -91,9 +96,10 @@ REGISTER_OP("If")
     .Output("output: Tout")
     .Attr("Tcond: type")
     .Attr("Tin: list(type) >= 0")
-    .Attr("Tout: list(type)")
+    .Attr("Tout: list(type) >= 0")
     .Attr("then_branch: func")
     .Attr("else_branch: func")
+    .SetIsStateful()
     .SetShapeFn(shape_inference::UnknownShape);
 
 // TODO(drpng): remove this.
@@ -153,7 +159,6 @@ REGISTER_OP("For")
     .Attr("body: func")
     .SetShapeFn(shape_inference::UnknownShape);
 
-// TODO(b/73826847, b/37549631) Mark as stateful.
 REGISTER_OP("PartitionedCall")
     .Input("args: Tin")
     .Output("output: Tout")
@@ -162,6 +167,15 @@ REGISTER_OP("PartitionedCall")
     .Attr("f: func")
     .SetShapeFn(shape_inference::UnknownShape);
 
+REGISTER_OP("StatefulPartitionedCall")
+    .Input("args: Tin")
+    .Output("output: Tout")
+    .Attr("Tin: list(type) >= 0")
+    .Attr("Tout: list(type) >= 0")
+    .Attr("f: func")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::UnknownShape);
+
 // This op is used as a placeholder in If branch functions. It doesn't provide a
 // valid output when run, so must either be removed (e.g. replaced with a
 // function input) or guaranteed not to be used (e.g. if mirroring an
diff --git a/tensorflow/core/ops/image_ops.cc b/tensorflow/core/ops/image_ops.cc
index 87f4991134..31267f72b8 100644
--- a/tensorflow/core/ops/image_ops.cc
+++ b/tensorflow/core/ops/image_ops.cc
@@ -442,8 +442,9 @@ REGISTER_OP("DrawBoundingBoxes")
       if (c->ValueKnown(c->Dim(images, 3))) {
         int64 depth = c->Value(c->Dim(images, 3));
         if (!(depth == 1 || depth == 3 || depth == 4)) {
-          return errors::InvalidArgument("Channel depth should be either 1 (GRY), "
-                                         "3 (RGB), or 4 (RGBA)");
+          return errors::InvalidArgument(
+              "Channel depth should be either 1 (GRY), "
+              "3 (RGB), or 4 (RGBA)");
         }
       }
 
@@ -709,4 +710,70 @@ REGISTER_OP("NonMaxSuppressionV3")
       return Status::OK();
     });
 
+REGISTER_OP("NonMaxSuppressionV4")
+    .Input("boxes: float")
+    .Input("scores: float")
+    .Input("max_output_size: int32")
+    .Input("iou_threshold: float")
+    .Input("score_threshold: float")
+    .Output("selected_indices: int32")
+    .Output("valid_outputs: int32")
+    .Attr("pad_to_max_output_size: bool = false")
+    .SetShapeFn([](InferenceContext* c) {
+      // Get inputs and validate ranks.
+      ShapeHandle boxes;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 2, &boxes));
+      ShapeHandle scores;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &scores));
+      ShapeHandle max_output_size;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &max_output_size));
+      ShapeHandle iou_threshold;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &iou_threshold));
+      ShapeHandle score_threshold;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 0, &score_threshold));
+      // The boxes is a 2-D float Tensor of shape [num_boxes, 4].
+      DimensionHandle unused;
+      // The boxes[0] and scores[0] are both num_boxes.
+      TF_RETURN_IF_ERROR(
+          c->Merge(c->Dim(boxes, 0), c->Dim(scores, 0), &unused));
+      // The boxes[1] is 4.
+      TF_RETURN_IF_ERROR(c->WithValue(c->Dim(boxes, 1), 4, &unused));
+
+      c->set_output(0, c->Vector(c->UnknownDim()));
+      c->set_output(1, c->MakeShape({}));
+      return Status::OK();
+    });
+
+REGISTER_OP("NonMaxSuppressionWithOverlaps")
+    .Input("overlaps: float")
+    .Input("scores: float")
+    .Input("max_output_size: int32")
+    .Input("overlap_threshold: float")
+    .Input("score_threshold: float")
+    .Output("selected_indices: int32")
+    .SetShapeFn([](InferenceContext* c) {
+      // Get inputs and validate ranks.
+      ShapeHandle overlaps;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 2, &overlaps));
+      ShapeHandle scores;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &scores));
+      ShapeHandle max_output_size;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &max_output_size));
+      ShapeHandle overlap_threshold;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &overlap_threshold));
+      ShapeHandle score_threshold;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 0, &score_threshold));
+      // The boxes is a 2-D float Tensor of shape [num_boxes, 4].
+      DimensionHandle unused;
+      // The boxes[0] and scores[0] are both num_boxes.
+      TF_RETURN_IF_ERROR(
+          c->Merge(c->Dim(overlaps, 0), c->Dim(scores, 0), &unused));
+      // The boxes[1] is 4.
+      TF_RETURN_IF_ERROR(
+          c->Merge(c->Dim(overlaps, 0), c->Dim(overlaps, 1), &unused));
+
+      c->set_output(0, c->Vector(c->UnknownDim()));
+      return Status::OK();
+    });
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/lookup_ops.cc b/tensorflow/core/ops/lookup_ops.cc
index 444aa8b954..2059741da9 100644
--- a/tensorflow/core/ops/lookup_ops.cc
+++ b/tensorflow/core/ops/lookup_ops.cc
@@ -140,11 +140,13 @@ REGISTER_OP("LookupTableSize")
     .Input("table_handle: Ref(string)")
     .Output("size: int64")
     .SetShapeFn(TwoElementVectorInputsAndScalarOutputs);
+WHITELIST_STATEFUL_OP_FOR_DATASET_FUNCTIONS("LookupTableSize");
 
 REGISTER_OP("LookupTableSizeV2")
     .Input("table_handle: resource")
     .Output("size: int64")
     .SetShapeFn(ScalarAndTwoElementVectorInputsAndScalarOutputs);
+WHITELIST_STATEFUL_OP_FOR_DATASET_FUNCTIONS("LookupTableSizeV2");
 
 REGISTER_OP("LookupTableExport")
     .Input("table_handle: Ref(string)")
diff --git a/tensorflow/core/ops/math_grad_test.cc b/tensorflow/core/ops/math_grad_test.cc
index da38a6bc24..cfa3a64328 100644
--- a/tensorflow/core/ops/math_grad_test.cc
+++ b/tensorflow/core/ops/math_grad_test.cc
@@ -774,12 +774,40 @@ TEST_F(MathGradTest, ComplexPow) {
   };
   SymGrad("Pow", x, y, &dx, &dy);
 
+  // This case failed on Kokoro MacOS:
+  // dx[2] = (-4,6.0398321011234657e-07),
+  // test::AsTensor[2] = (-4,-3.4969110629390343e-07).
+  // dx[2] on linux is close to test::AsTensor[2].
+  // This error hasn't shown up before because
+  // ExpectClose used to check just the magnitude of a complex number, i.e.,
+  // std::abs(complex) = sqrt(real^2 + imag^2).
+  // Now ExpectClose checks the value of each component separately.
+  // Workaround: I set a big tolerance to make the case pass for now.
+  // TODO(penporn): Fix this or file a bug. This is not a precision issue.
+  // Even the most significant digit (or the sign) doesn't match.
   test::ExpectClose(
-      dx, test::AsTensor<complex64>({g(0.f, 2.f), g(2.f, 2.f), g(-2.f, 2.f)},
-                                    TensorShape({3})));
+      dx,
+      test::AsTensor<complex64>({g(0.f, 2.f), g(2.f, 2.f), g(-2.f, 2.f)},
+                                TensorShape({3})),
+      1e-6f);
+
+  // This case failed on Kokoro MacOS:
+  // dx[2] = (2.7725925445556641,12.56636905670166),
+  // test::AsTensor[2] = (2.7725865840911865,12.566371917724609)
+  // dx[2] on linux is close to test::AsTensor[2].
+  // Default atol = rtol = 5.96046e-07.
+  // Real: diff = 5.96046e-06 > threshold = 2.248633e-06 <- failed
+  // Complex: diff = 2.86102e-06 <= threshold = 8.08618e-06 <- passed
+  // Again, this error hasn't shown up before because ExpectClose used to
+  // check just the magnitude of the complex number. Now it checks each
+  // component separately.
+  // Workaround: Set a larger tolerance for now.
+  // TODO(penporn): See if this is a precision issue or a bug.
   test::ExpectClose(
-      dy, test::AsTensor<complex64>({h(0.f, 2.f), h(2.f, 2.f), h(-2.f, 2.f)},
-                                    TensorShape({3})));
+      dy,
+      test::AsTensor<complex64>({h(0.f, 2.f), h(2.f, 2.f), h(-2.f, 2.f)},
+                                TensorShape({3})),
+      4.5e-6f);
 }
 #endif  // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/ops/math_ops.cc b/tensorflow/core/ops/math_ops.cc
index c229bd5a41..1667c398f4 100644
--- a/tensorflow/core/ops/math_ops.cc
+++ b/tensorflow/core/ops/math_ops.cc
@@ -114,6 +114,7 @@ REGISTER_OP("Cast")
     .Output("y: DstT")
     .Attr("SrcT: type")
     .Attr("DstT: type")
+    .Attr("Truncate: bool = false")
     .SetShapeFn(shape_inference::UnchangedShape);
 
 REGISTER_OP("_HostCast")
@@ -121,6 +122,7 @@ REGISTER_OP("_HostCast")
     .Output("y: DstT")
     .Attr("SrcT: type")
     .Attr("DstT: type")
+    .Attr("Truncate: bool = false")
     .SetShapeFn(shape_inference::UnchangedShape)
     .Doc(R"doc(
 Cast x of type SrcT to y of DstT.
@@ -1380,10 +1382,26 @@ REGISTER_OP("HistogramFixedWidth")
     .Attr("T: {int32, int64, float32, float64}")
     .Attr("dtype: {int32, int64} = DT_INT32")
     .SetShapeFn([](InferenceContext* c) {
+      // value_range should be a vector.
+      ShapeHandle value_range_shape;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &value_range_shape));
+      // value_range should have two elements.
+      DimensionHandle unused;
+      TF_RETURN_IF_ERROR(
+          c->WithValue(c->Dim(value_range_shape, 0), 2, &unused));
+      // nbins should be a scalar.
+      ShapeHandle nbins_shape;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &nbins_shape));
+
+      // If nbins is available, set the shape from nbins.
       const Tensor* nbins_input = c->input_tensor(2);
       if (nbins_input != nullptr) {
         int64 nbins;
         TF_RETURN_IF_ERROR(c->GetScalarFromTensor(nbins_input, &nbins));
+        // nbins has to be positive.
+        if (nbins <= 0) {
+          return errors::InvalidArgument("Requires nbins > 0: ", nbins);
+        }
         c->set_output(0, c->Vector(nbins));
       } else {
         c->set_output(0, c->UnknownShapeOfRank(1));
@@ -1488,6 +1506,13 @@ REGISTER_OP("QuantizedAdd")
     .SetIsCommutative()
     .SetShapeFn([](InferenceContext* c) {
       TF_RETURN_IF_ERROR(shape_inference::BroadcastBinaryOpShapeFn(c));
+      // min_x, max_x, min_y, max_y should be scalar.
+      ShapeHandle unused;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(3), 0, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(4), 0, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(5), 0, &unused));
+
       c->set_output(1, c->Scalar());
       c->set_output(2, c->Scalar());
       return Status::OK();
diff --git a/tensorflow/core/ops/math_ops_test.cc b/tensorflow/core/ops/math_ops_test.cc
index 8f974d5367..23f1538912 100644
--- a/tensorflow/core/ops/math_ops_test.cc
+++ b/tensorflow/core/ops/math_ops_test.cc
@@ -528,4 +528,34 @@ TEST(MathOpsTest, Cross_ShapeFn) {
   INFER_OK(op, "[?];[?]", "in0");
   INFER_OK(op, "[1,?,3];[?,?,?]", "in0");
 }
+
+TEST(MathOpsTest, HistogramFixedWidth_ShapeFn) {
+  ShapeInferenceTestOp op("HistogramFixedWidth");
+
+  // value_range should be vector.
+  INFER_ERROR("Shape must be rank 1 but is rank 0", op, "[];[];[]");
+  // value_range should have 2 elements.
+  INFER_ERROR("Dimension must be 2 but is 3", op, "[];[3];[]");
+  // nbins should be scalar.
+  INFER_ERROR("Shape must be rank 0 but is rank 1", op, "[];[2];[2]");
+
+  INFER_OK(op, "?;?;?", "[?]");
+  INFER_OK(op, "[?];[2];[]", "[?]");
+  INFER_OK(op, "[?];[2];?", "[?]");
+}
+
+TEST(MathOpsTest, QuantizedAdd_ShapeFn) {
+  ShapeInferenceTestOp op("QuantizedAdd");
+
+  INFER_OK(op, "?;?;?;?;?;?", "?;[];[]");
+  INFER_OK(op, "?;?;[];[];[];[]", "?;[];[]");
+  INFER_OK(op, "[1,2];?;[];[];[];[]", "?;[];[]");
+  INFER_OK(op, "[];[2];[];[];[];[]", "[d1_0];[];[]");
+
+  // Rank checks on input scalars.
+  INFER_ERROR("must be rank 0", op, "?;?;[1];?;?;?");
+  INFER_ERROR("must be rank 0", op, "?;?;?;[2];?;?");
+  INFER_ERROR("must be rank 0", op, "?;?;?;?;[3];?");
+  INFER_ERROR("must be rank 0", op, "?;?;?;?;?;[4]");
+}
 }  // end namespace tensorflow
diff --git a/tensorflow/core/ops/nn_ops.cc b/tensorflow/core/ops/nn_ops.cc
index f1bbfac5e6..f947d4c30d 100644
--- a/tensorflow/core/ops/nn_ops.cc
+++ b/tensorflow/core/ops/nn_ops.cc
@@ -432,7 +432,7 @@ REGISTER_OP("FusedResizeAndPadConv2D")
     .Input("paddings: int32")
     .Input("filter: T")
     .Output("output: T")
-    .Attr("T: {float}")
+    .Attr("T: {half, float, double}")
     .Attr("resize_align_corners: bool = false")
     .Attr(GetMirrorPadModeAttrString())
     .Attr("strides: list(int)")
@@ -446,7 +446,7 @@ REGISTER_OP("FusedPadConv2D")
     .Input("paddings: int32")
     .Input("filter: T")
     .Output("output: T")
-    .Attr("T: {float}")
+    .Attr("T: {half, float, double}")
     .Attr(GetMirrorPadModeAttrString())
     .Attr("strides: list(int)")
     .Attr(GetPaddingAttrString())
diff --git a/tensorflow/core/ops/ops.pbtxt b/tensorflow/core/ops/ops.pbtxt
index 32bbd4521a..ef167a2e73 100644
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@@ -2490,8 +2490,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -2674,8 +2672,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -3989,8 +3985,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -4040,8 +4034,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -4140,6 +4132,8 @@ op {
         type: DT_INT32
         type: DT_UINT8
         type: DT_UINT16
+        type: DT_UINT32
+        type: DT_UINT64
         type: DT_INT8
         type: DT_INT16
         type: DT_COMPLEX64
@@ -4165,6 +4159,8 @@ op {
         type: DT_INT32
         type: DT_UINT8
         type: DT_UINT16
+        type: DT_UINT32
+        type: DT_UINT64
         type: DT_INT8
         type: DT_INT16
         type: DT_COMPLEX64
@@ -4340,6 +4336,34 @@ op {
   }
 }
 op {
+  name: "BoostedTreesCenterBias"
+  input_arg {
+    name: "tree_ensemble_handle"
+    type: DT_RESOURCE
+  }
+  input_arg {
+    name: "mean_gradients"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "mean_hessians"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "l1"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "l2"
+    type: DT_FLOAT
+  }
+  output_arg {
+    name: "continue_centering"
+    type: DT_BOOL
+  }
+  is_stateful: true
+}
+op {
   name: "BoostedTreesCreateEnsemble"
   input_arg {
     name: "tree_ensemble_handle"
@@ -4953,6 +4977,13 @@ op {
     name: "DstT"
     type: "type"
   }
+  attr {
+    name: "Truncate"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
 }
 op {
   name: "Ceil"
@@ -5702,8 +5733,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -5781,8 +5810,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -5860,8 +5887,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -8564,8 +8589,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -8636,8 +8659,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -8708,8 +8729,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -11457,7 +11476,9 @@ op {
     type: "type"
     allowed_values {
       list {
+        type: DT_HALF
         type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
@@ -11513,7 +11534,9 @@ op {
     type: "type"
     allowed_values {
       list {
+        type: DT_HALF
         type: DT_FLOAT
+        type: DT_DOUBLE
       }
     }
   }
@@ -12342,29 +12365,6 @@ op {
   }
 }
 op {
-  name: "IdentityDataset"
-  input_arg {
-    name: "input_dataset"
-    type: DT_VARIANT
-  }
-  output_arg {
-    name: "handle"
-    type: DT_VARIANT
-  }
-  attr {
-    name: "output_types"
-    type: "list(type)"
-    has_minimum: true
-    minimum: 1
-  }
-  attr {
-    name: "output_shapes"
-    type: "list(shape)"
-    has_minimum: true
-    minimum: 1
-  }
-}
-op {
   name: "IdentityN"
   input_arg {
     name: "input"
@@ -12457,7 +12457,6 @@ op {
     name: "Tout"
     type: "list(type)"
     has_minimum: true
-    minimum: 1
   }
   attr {
     name: "then_branch"
@@ -12467,6 +12466,7 @@ op {
     name: "else_branch"
     type: "func"
   }
+  is_stateful: true
 }
 op {
   name: "Igamma"
@@ -13237,6 +13237,36 @@ op {
   is_stateful: true
 }
 op {
+  name: "IteratorFromStringHandleV2"
+  input_arg {
+    name: "string_handle"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "resource_handle"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    default_value {
+      list {
+      }
+    }
+    has_minimum: true
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    default_value {
+      list {
+      }
+    }
+    has_minimum: true
+  }
+  is_stateful: true
+}
+op {
   name: "IteratorGetNext"
   input_arg {
     name: "iterator"
@@ -13261,6 +13291,30 @@ op {
   is_stateful: true
 }
 op {
+  name: "IteratorGetNextAsOptional"
+  input_arg {
+    name: "iterator"
+    type: DT_RESOURCE
+  }
+  output_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
+op {
   name: "IteratorGetNextSync"
   input_arg {
     name: "iterator"
@@ -13297,6 +13351,34 @@ op {
   is_stateful: true
 }
 op {
+  name: "IteratorV2"
+  output_arg {
+    name: "handle"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+  }
+  attr {
+    name: "container"
+    type: "string"
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
+op {
   name: "L2Loss"
   input_arg {
     name: "t"
@@ -15443,8 +15525,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -15522,8 +15602,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -15594,8 +15672,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -15737,8 +15813,6 @@ op {
       list {
         s: "NHWC"
         s: "NCHW"
-        s: "HWNC"
-        s: "HWCN"
       }
     }
   }
@@ -16958,6 +17032,71 @@ op {
   }
 }
 op {
+  name: "NonMaxSuppressionV4"
+  input_arg {
+    name: "boxes"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "scores"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "max_output_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "iou_threshold"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "score_threshold"
+    type: DT_FLOAT
+  }
+  output_arg {
+    name: "selected_indices"
+    type: DT_INT32
+  }
+  output_arg {
+    name: "valid_outputs"
+    type: DT_INT32
+  }
+  attr {
+    name: "pad_to_max_output_size"
+    type: "bool"
+    default_value {
+      b: false
+    }
+  }
+}
+op {
+  name: "NonMaxSuppressionWithOverlaps"
+  input_arg {
+    name: "overlaps"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "scores"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "max_output_size"
+    type: DT_INT32
+  }
+  input_arg {
+    name: "overlap_threshold"
+    type: DT_FLOAT
+  }
+  input_arg {
+    name: "score_threshold"
+    type: DT_FLOAT
+  }
+  output_arg {
+    name: "selected_indices"
+    type: DT_INT32
+  }
+}
+op {
   name: "NotEqual"
   input_arg {
     name: "x"
@@ -17185,6 +17324,64 @@ op {
   }
 }
 op {
+  name: "OptionalFromValue"
+  input_arg {
+    name: "components"
+    type_list_attr: "Toutput_types"
+  }
+  output_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "Toutput_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+}
+op {
+  name: "OptionalGetValue"
+  input_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "components"
+    type_list_attr: "output_types"
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
+op {
+  name: "OptionalHasValue"
+  input_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "has_value"
+    type: DT_BOOL
+  }
+}
+op {
+  name: "OptionalNone"
+  output_arg {
+    name: "optional"
+    type: DT_VARIANT
+  }
+}
+op {
   name: "OrderedMapClear"
   attr {
     name: "capacity"
@@ -26113,6 +26310,7 @@ op {
         type: DT_HALF
         type: DT_UINT32
         type: DT_UINT64
+        type: DT_BOOL
       }
     }
   }
@@ -27339,6 +27537,17 @@ op {
   }
 }
 op {
+  name: "SinkDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+}
+op {
   name: "Size"
   input_arg {
     name: "input"
@@ -27502,7 +27711,11 @@ op {
     type: DT_INT64
   }
   input_arg {
-    name: "stride"
+    name: "window_shift"
+    type: DT_INT64
+  }
+  input_arg {
+    name: "window_stride"
     type: DT_INT64
   }
   output_arg {
@@ -31218,6 +31431,32 @@ op {
   is_stateful: true
 }
 op {
+  name: "StatefulPartitionedCall"
+  input_arg {
+    name: "args"
+    type_list_attr: "Tin"
+  }
+  output_arg {
+    name: "output"
+    type_list_attr: "Tout"
+  }
+  attr {
+    name: "Tin"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tout"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "f"
+    type: "func"
+  }
+  is_stateful: true
+}
+op {
   name: "StatelessMultinomial"
   input_arg {
     name: "logits"
@@ -35118,6 +35357,33 @@ op {
   is_stateful: true
 }
 op {
+  name: "WindowDataset"
+  input_arg {
+    name: "input_dataset"
+    type: DT_VARIANT
+  }
+  input_arg {
+    name: "window_size"
+    type: DT_INT64
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+}
+op {
   name: "WriteAudioSummary"
   input_arg {
     name: "writer"
diff --git a/tensorflow/core/platform/cloud/BUILD b/tensorflow/core/platform/cloud/BUILD
index 67651349ea..647a797b82 100644
--- a/tensorflow/core/platform/cloud/BUILD
+++ b/tensorflow/core/platform/cloud/BUILD
@@ -73,6 +73,8 @@ cc_library(
     linkstatic = 1,  # Needed since alwayslink is broken in bazel b/27630669
     visibility = ["//visibility:public"],
     deps = [
+        ":compute_engine_metadata_client",
+        ":compute_engine_zone_provider",
         ":curl_http_request",
         ":expiring_lru_cache",
         ":file_block_cache",
@@ -144,7 +146,7 @@ cc_library(
     copts = tf_copts(),
     visibility = ["//tensorflow:__subpackages__"],
     deps = [
-        ":curl_http_request",
+        ":compute_engine_metadata_client",
         ":oauth_client",
         ":retrying_utils",
         "//tensorflow/core:lib",
@@ -154,6 +156,43 @@ cc_library(
 )
 
 cc_library(
+    name = "compute_engine_metadata_client",
+    srcs = [
+        "compute_engine_metadata_client.cc",
+    ],
+    hdrs = [
+        "compute_engine_metadata_client.h",
+    ],
+    copts = tf_copts(),
+    visibility = ["//tensorflow:__subpackages__"],
+    deps = [
+        ":curl_http_request",
+        ":http_request",
+        ":retrying_utils",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
+cc_library(
+    name = "compute_engine_zone_provider",
+    srcs = [
+        "compute_engine_zone_provider.cc",
+    ],
+    hdrs = [
+        "compute_engine_zone_provider.h",
+        "zone_provider.h",
+    ],
+    copts = tf_copts(),
+    visibility = ["//tensorflow:__subpackages__"],
+    deps = [
+        ":compute_engine_metadata_client",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
+cc_library(
     name = "now_seconds_env",
     testonly = 1,
     hdrs = ["now_seconds_env.h"],
@@ -345,6 +384,34 @@ tf_cc_test(
 )
 
 tf_cc_test(
+    name = "compute_engine_metadata_client_test",
+    size = "small",
+    srcs = ["compute_engine_metadata_client_test.cc"],
+    deps = [
+        ":compute_engine_metadata_client",
+        ":http_request_fake",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
+    name = "compute_engine_zone_provider_test",
+    size = "small",
+    srcs = ["compute_engine_zone_provider_test.cc"],
+    deps = [
+        ":compute_engine_zone_provider",
+        ":http_request_fake",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
+tf_cc_test(
     name = "retrying_file_system_test",
     size = "small",
     srcs = ["retrying_file_system_test.cc"],
diff --git a/tensorflow/core/platform/cloud/compute_engine_metadata_client.cc b/tensorflow/core/platform/cloud/compute_engine_metadata_client.cc
new file mode 100644
index 0000000000..f41b83ac34
--- /dev/null
+++ b/tensorflow/core/platform/cloud/compute_engine_metadata_client.cc
@@ -0,0 +1,59 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/cloud/compute_engine_metadata_client.h"
+
+#include <utility>
+#include "tensorflow/core/platform/cloud/curl_http_request.h"
+#include "tensorflow/core/platform/cloud/retrying_utils.h"
+
+namespace tensorflow {
+
+namespace {
+
+// The URL to retrieve metadata when running in Google Compute Engine.
+constexpr char kGceMetadataBaseUrl[] = "http://metadata/computeMetadata/v1/";
+// The default initial delay between retries with exponential backoff.
+constexpr int kInitialRetryDelayUsec = 500000;  // 0.5 sec
+
+}  // namespace
+
+ComputeEngineMetadataClient::ComputeEngineMetadataClient(
+    std::shared_ptr<HttpRequest::Factory> http_request_factory)
+    : ComputeEngineMetadataClient(std::move(http_request_factory),
+                                  kInitialRetryDelayUsec) {}
+
+ComputeEngineMetadataClient::ComputeEngineMetadataClient(
+    std::shared_ptr<HttpRequest::Factory> http_request_factory,
+    int64 initial_retry_delay_usec)
+    : http_request_factory_(std::move(http_request_factory)),
+      initial_retry_delay_usec_(initial_retry_delay_usec) {}
+
+Status ComputeEngineMetadataClient::GetMetadata(
+    const string& path, std::vector<char>* response_buffer) {
+  const auto get_metadata_from_gce = [path, response_buffer, this]() {
+    std::unique_ptr<HttpRequest> request(http_request_factory_->Create());
+    request->SetUri(kGceMetadataBaseUrl + path);
+    request->AddHeader("Metadata-Flavor", "Google");
+    request->SetResultBuffer(response_buffer);
+    TF_RETURN_IF_ERROR(request->Send());
+    return Status::OK();
+  };
+
+  return RetryingUtils::CallWithRetries(get_metadata_from_gce,
+                                        initial_retry_delay_usec_);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/compute_engine_metadata_client.h b/tensorflow/core/platform/cloud/compute_engine_metadata_client.h
new file mode 100644
index 0000000000..534ccf30b2
--- /dev/null
+++ b/tensorflow/core/platform/cloud/compute_engine_metadata_client.h
@@ -0,0 +1,64 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_CLOUD_COMPUTE_ENGINE_METADATA_CLIENT_H_
+#define TENSORFLOW_CORE_PLATFORM_CLOUD_COMPUTE_ENGINE_METADATA_CLIENT_H_
+
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/cloud/http_request.h"
+
+namespace tensorflow {
+
+/// \brief A client that accesses to the metadata server running on GCE hosts.
+///
+/// Uses the provided HttpRequest::Factory to make requests to the local
+/// metadata service
+/// (https://cloud.google.com/compute/docs/storing-retrieving-metadata).
+/// Retries on recoverable failures using exponential backoff with the initial
+/// retry wait configurable via initial_retry_delay_usec.
+class ComputeEngineMetadataClient {
+ public:
+  explicit ComputeEngineMetadataClient(
+      std::shared_ptr<HttpRequest::Factory> http_request_factory);
+  ComputeEngineMetadataClient(
+      std::shared_ptr<HttpRequest::Factory> http_request_factory,
+      int64 initial_retry_delay_usec);
+  virtual ~ComputeEngineMetadataClient() {}
+
+  /// \brief Get the metadata value for a given attribute of the metadata
+  /// service.
+  ///
+  /// Given a metadata path relative
+  /// to http://metadata.google.internal/computeMetadata/v1/,
+  /// fills response_buffer with the metadata. Returns OK if the server returns
+  /// the response for the given metadata path successfully.
+  ///
+  /// Example usage:
+  /// To get the zone of an instance:
+  ///   compute_engine_metadata_client.GetMetadata(
+  ///       "instance/zone", response_buffer);
+  virtual Status GetMetadata(const string& path,
+                             std::vector<char>* response_buffer);
+
+ private:
+  std::shared_ptr<HttpRequest::Factory> http_request_factory_;
+  const int64 initial_retry_delay_usec_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(ComputeEngineMetadataClient);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_COMPUTE_ENGINE_METADATA_CLIENT_H_
diff --git a/tensorflow/core/platform/cloud/compute_engine_metadata_client_test.cc b/tensorflow/core/platform/cloud/compute_engine_metadata_client_test.cc
new file mode 100644
index 0000000000..4c41ccaa0e
--- /dev/null
+++ b/tensorflow/core/platform/cloud/compute_engine_metadata_client_test.cc
@@ -0,0 +1,68 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/cloud/compute_engine_metadata_client.h"
+#include "tensorflow/core/platform/cloud/http_request_fake.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+
+TEST(ComputeEngineMetadataClientTest, GetMetadata) {
+  const string example_response = "example response";
+
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: http://metadata/computeMetadata/v1/instance/service-accounts"
+      "/default/token\n"
+      "Header Metadata-Flavor: Google\n",
+      example_response)});
+
+  std::shared_ptr<HttpRequest::Factory> http_factory =
+      std::make_shared<FakeHttpRequestFactory>(&requests);
+  ComputeEngineMetadataClient client(http_factory, 0);
+
+  std::vector<char> result;
+  TF_EXPECT_OK(
+      client.GetMetadata("instance/service-accounts/default/token", &result));
+  std::vector<char> expected(example_response.begin(), example_response.end());
+  EXPECT_EQ(expected, result);
+}
+
+TEST(ComputeEngineMetadataClientTest, RetryOnFailure) {
+  const string example_response = "example response";
+
+  std::vector<HttpRequest*> requests(
+      {new FakeHttpRequest(
+           "Uri: http://metadata/computeMetadata/v1/instance/service-accounts"
+           "/default/token\n"
+           "Header Metadata-Flavor: Google\n",
+           "", errors::Unavailable("503"), 503),
+       new FakeHttpRequest(
+           "Uri: http://metadata/computeMetadata/v1/instance/service-accounts"
+           "/default/token\n"
+           "Header Metadata-Flavor: Google\n",
+           example_response)});
+
+  std::shared_ptr<HttpRequest::Factory> http_factory =
+      std::make_shared<FakeHttpRequestFactory>(&requests);
+  ComputeEngineMetadataClient client(http_factory, 0);
+
+  std::vector<char> result;
+  TF_EXPECT_OK(
+      client.GetMetadata("instance/service-accounts/default/token", &result));
+  std::vector<char> expected(example_response.begin(), example_response.end());
+  EXPECT_EQ(expected, result);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/compute_engine_zone_provider.cc b/tensorflow/core/platform/cloud/compute_engine_zone_provider.cc
new file mode 100644
index 0000000000..dacf56187c
--- /dev/null
+++ b/tensorflow/core/platform/cloud/compute_engine_zone_provider.cc
@@ -0,0 +1,53 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/cloud/compute_engine_zone_provider.h"
+
+#include <utility>
+#include "tensorflow/core/lib/strings/str_util.h"
+namespace tensorflow {
+
+namespace {
+constexpr char kGceMetadataZonePath[] = "instance/zone";
+}  // namespace
+
+ComputeEngineZoneProvider::ComputeEngineZoneProvider(
+    std::shared_ptr<ComputeEngineMetadataClient> google_metadata_client)
+    : google_metadata_client_(std::move(google_metadata_client)) {}
+
+Status ComputeEngineZoneProvider::GetZone(string* zone) {
+  if (!cached_zone.empty()) {
+    *zone = cached_zone;
+    return Status::OK();
+  }
+  std::vector<char> response_buffer;
+  TF_RETURN_IF_ERROR(google_metadata_client_->GetMetadata(kGceMetadataZonePath,
+                                                          &response_buffer));
+  StringPiece location(&response_buffer[0], response_buffer.size());
+
+  std::vector<string> elems = str_util::Split(location, "/");
+  if (elems.size() == 4) {
+    cached_zone = elems.back();
+    *zone = cached_zone;
+  } else {
+    LOG(ERROR) << "Failed to parse the zone name from location: "
+               << location.ToString();
+  }
+
+  return Status::OK();
+}
+ComputeEngineZoneProvider::~ComputeEngineZoneProvider() {}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/compute_engine_zone_provider.h b/tensorflow/core/platform/cloud/compute_engine_zone_provider.h
new file mode 100644
index 0000000000..614b688e6f
--- /dev/null
+++ b/tensorflow/core/platform/cloud/compute_engine_zone_provider.h
@@ -0,0 +1,40 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_CLOUD_COMPUTE_ENGINE_ZONE_PROVIDER_H_
+#define TENSORFLOW_CORE_PLATFORM_CLOUD_COMPUTE_ENGINE_ZONE_PROVIDER_H_
+
+#include "tensorflow/core/platform/cloud/compute_engine_metadata_client.h"
+#include "tensorflow/core/platform/cloud/zone_provider.h"
+
+namespace tensorflow {
+
+class ComputeEngineZoneProvider : public ZoneProvider {
+ public:
+  explicit ComputeEngineZoneProvider(
+      std::shared_ptr<ComputeEngineMetadataClient> google_metadata_client);
+  virtual ~ComputeEngineZoneProvider();
+
+  Status GetZone(string* zone) override;
+
+ private:
+  std::shared_ptr<ComputeEngineMetadataClient> google_metadata_client_;
+  string cached_zone;
+  TF_DISALLOW_COPY_AND_ASSIGN(ComputeEngineZoneProvider);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_COMPUTE_ENGINE_ZONE_PROVIDER_H_
diff --git a/tensorflow/core/platform/cloud/compute_engine_zone_provider_test.cc b/tensorflow/core/platform/cloud/compute_engine_zone_provider_test.cc
new file mode 100644
index 0000000000..f7477eca23
--- /dev/null
+++ b/tensorflow/core/platform/cloud/compute_engine_zone_provider_test.cc
@@ -0,0 +1,69 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/cloud/compute_engine_zone_provider.h"
+#include "tensorflow/core/platform/cloud/http_request_fake.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+
+class ComputeEngineZoneProviderTest : public ::testing::Test {
+ protected:
+  void SetUp() override {}
+
+  void TearDown() override {}
+};
+
+TEST_F(ComputeEngineZoneProviderTest, GetZone) {
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: http://metadata/computeMetadata/v1/instance/zone\n"
+      "Header Metadata-Flavor: Google\n",
+      "projects/123456789/zones/us-west1-b")});
+
+  auto httpRequestFactory = std::make_shared<FakeHttpRequestFactory>(&requests);
+
+  auto metadata_client =
+      std::make_shared<ComputeEngineMetadataClient>(httpRequestFactory, 0);
+
+  ComputeEngineZoneProvider provider(metadata_client);
+
+  string zone;
+
+  TF_EXPECT_OK(provider.GetZone(&zone));
+  EXPECT_EQ("us-west1-b", zone);
+  // Test caching, should be no further requests
+  TF_EXPECT_OK(provider.GetZone(&zone));
+}
+
+TEST_F(ComputeEngineZoneProviderTest, InvalidZoneString) {
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: http://metadata/computeMetadata/v1/instance/zone\n"
+      "Header Metadata-Flavor: Google\n",
+      "invalidresponse")});
+
+  auto httpRequestFactory = std::make_shared<FakeHttpRequestFactory>(&requests);
+
+  auto metadata_client =
+      std::make_shared<ComputeEngineMetadataClient>(httpRequestFactory, 0);
+
+  ComputeEngineZoneProvider provider(metadata_client);
+
+  string zone;
+
+  TF_EXPECT_OK(provider.GetZone(&zone));
+  EXPECT_EQ("", zone);
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/gcs_file_system.cc b/tensorflow/core/platform/cloud/gcs_file_system.cc
index ec77861480..67c872ac67 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system.cc
@@ -57,6 +57,7 @@ constexpr char kGcsUriBase[] = "https://www.googleapis.com/storage/v1/";
 constexpr char kGcsUploadUriBase[] =
     "https://www.googleapis.com/upload/storage/v1/";
 constexpr char kStorageHost[] = "storage.googleapis.com";
+constexpr char kBucketMetadataLocationKey[] = "location";
 constexpr size_t kReadAppendableFileBufferSize = 1024 * 1024;  // In bytes.
 constexpr int kGetChildrenDefaultPageSize = 1000;
 // The HTTP response code "308 Resume Incomplete".
@@ -98,6 +99,11 @@ constexpr uint64 kMatchingPathsCacheDefaultMaxAge = 0;
 constexpr char kMatchingPathsCacheMaxEntries[] =
     "GCS_MATCHING_PATHS_CACHE_MAX_ENTRIES";
 constexpr size_t kMatchingPathsCacheDefaultMaxEntries = 1024;
+// Number of bucket locations cached, most workloads wont touch more than one
+// bucket so this limit is set fairly low
+constexpr size_t kBucketLocationCacheMaxEntries = 10;
+// ExpiringLRUCache doesnt support any "cache forever" option
+constexpr size_t kCacheNeverExpire = std::numeric_limits<uint64>::max();
 // The file statistics returned by Stat() for directories.
 const FileStatistics DIRECTORY_STAT(0, 0, true);
 // Some environments exhibit unreliable DNS resolution. Set this environment
@@ -131,6 +137,14 @@ constexpr char kTokensPerRequest[] = "GCS_TOKENS_PER_REQUEST";
 // The environment variable to configure the initial tokens (format: <int64>)
 constexpr char kInitialTokens[] = "GCS_INITIAL_TOKENS";
 
+// The environment variable to customize which GCS bucket locations are allowed,
+// if the list is empty defaults to using the region of the zone (format, comma
+// delimited list). Requires 'storage.buckets.get' permission.
+constexpr char kAllowedBucketLocations[] = "GCS_ALLOWED_BUCKET_LOCATIONS";
+// When this value is passed as an allowed location detects the zone tensorflow
+// is running in and restricts to buckets in that region.
+constexpr char kDetectZoneSentinalValue[] = "auto";
+
 // TODO: DO NOT use a hardcoded path
 Status GetTmpFilename(string* filename) {
 #ifndef _WIN32
@@ -603,15 +617,35 @@ bool StringPieceIdentity(StringPiece str, StringPiece* value) {
   return true;
 }
 
+/// \brief Utility function to split a comma delimited list of strings to an
+/// unordered set
+bool SplitByCommaToSet(StringPiece list, std::unordered_set<string>* set) {
+  std::vector<string> vector = str_util::Split(list, ",");
+  *set = std::unordered_set<string>(vector.begin(), vector.end());
+  return true;
+}
+
+// \brief Convert Compute Engine zone to region
+string ZoneToRegion(string* zone) {
+  return zone->substr(0, zone->find_last_of('-'));
+}
+
 }  // namespace
 
-GcsFileSystem::GcsFileSystem()
-    : auth_provider_(new GoogleAuthProvider()),
-      http_request_factory_(new CurlHttpRequest::Factory()) {
+GcsFileSystem::GcsFileSystem() {
   uint64 value;
   size_t block_size = kDefaultBlockSize;
   size_t max_bytes = kDefaultMaxCacheSize;
   uint64 max_staleness = kDefaultMaxStaleness;
+
+  http_request_factory_ = std::make_shared<CurlHttpRequest::Factory>();
+  compute_engine_metadata_client_ =
+      std::make_shared<ComputeEngineMetadataClient>(http_request_factory_);
+  auth_provider_ = std::unique_ptr<AuthProvider>(
+      new GoogleAuthProvider(compute_engine_metadata_client_));
+  zone_provider_ = std::unique_ptr<ZoneProvider>(
+      new ComputeEngineZoneProvider(compute_engine_metadata_client_));
+
   // Apply the sys env override for the readahead buffer size if it's provided.
   if (GetEnvVar(kReadaheadBufferSize, strings::safe_strtou64, &value)) {
     block_size = value;
@@ -631,6 +665,9 @@ GcsFileSystem::GcsFileSystem()
     // Setting either to 0 disables the cache; set both for good measure.
     block_size = max_bytes = 0;
   }
+  VLOG(1) << "GCS cache max size = " << max_bytes << " ; "
+          << "block size = " << block_size << " ; "
+          << "max staleness = " << max_staleness;
   file_block_cache_ = MakeFileBlockCache(block_size, max_bytes, max_staleness);
   // Apply overrides for the stat cache max age and max entries, if provided.
   uint64 stat_cache_max_age = kStatCacheDefaultMaxAge;
@@ -658,6 +695,9 @@ GcsFileSystem::GcsFileSystem()
   matching_paths_cache_.reset(new ExpiringLRUCache<std::vector<string>>(
       matching_paths_cache_max_age, matching_paths_cache_max_entries));
 
+  bucket_location_cache_.reset(new ExpiringLRUCache<string>(
+      kCacheNeverExpire, kBucketLocationCacheMaxEntries));
+
   int64 resolve_frequency_secs;
   if (GetEnvVar(kResolveCacheSecs, strings::safe_strto64,
                 &resolve_frequency_secs)) {
@@ -737,24 +777,30 @@ GcsFileSystem::GcsFileSystem()
     }
     throttle_.SetConfig(config);
   }
+
+  GetEnvVar(kAllowedBucketLocations, SplitByCommaToSet, &allowed_locations_);
 }
 
 GcsFileSystem::GcsFileSystem(
     std::unique_ptr<AuthProvider> auth_provider,
     std::unique_ptr<HttpRequest::Factory> http_request_factory,
-    size_t block_size, size_t max_bytes, uint64 max_staleness,
-    uint64 stat_cache_max_age, size_t stat_cache_max_entries,
-    uint64 matching_paths_cache_max_age,
+    std::unique_ptr<ZoneProvider> zone_provider, size_t block_size,
+    size_t max_bytes, uint64 max_staleness, uint64 stat_cache_max_age,
+    size_t stat_cache_max_entries, uint64 matching_paths_cache_max_age,
     size_t matching_paths_cache_max_entries, int64 initial_retry_delay_usec,
-    TimeoutConfig timeouts,
+    TimeoutConfig timeouts, const std::unordered_set<string>& allowed_locations,
     std::pair<const string, const string>* additional_header)
     : auth_provider_(std::move(auth_provider)),
       http_request_factory_(std::move(http_request_factory)),
+      zone_provider_(std::move(zone_provider)),
       file_block_cache_(
           MakeFileBlockCache(block_size, max_bytes, max_staleness)),
       stat_cache_(new StatCache(stat_cache_max_age, stat_cache_max_entries)),
       matching_paths_cache_(new MatchingPathsCache(
           matching_paths_cache_max_age, matching_paths_cache_max_entries)),
+      bucket_location_cache_(new BucketLocationCache(
+          kCacheNeverExpire, kBucketLocationCacheMaxEntries)),
+      allowed_locations_(allowed_locations),
       timeouts_(timeouts),
       initial_retry_delay_usec_(initial_retry_delay_usec),
       additional_header_(additional_header) {}
@@ -763,6 +809,7 @@ Status GcsFileSystem::NewRandomAccessFile(
     const string& fname, std::unique_ptr<RandomAccessFile>* result) {
   string bucket, object;
   TF_RETURN_IF_ERROR(ParseGcsPath(fname, false, &bucket, &object));
+  TF_RETURN_IF_ERROR(CheckBucketLocationConstraint(bucket));
   result->reset(new GcsRandomAccessFile(fname, [this, bucket, object](
                                                    const string& fname,
                                                    uint64 offset, size_t n,
@@ -1064,11 +1111,7 @@ Status GcsFileSystem::StatForObject(const string& fname, const string& bucket,
 }
 
 Status GcsFileSystem::BucketExists(const string& bucket, bool* result) {
-  std::unique_ptr<HttpRequest> request;
-  TF_RETURN_IF_ERROR(CreateHttpRequest(&request));
-  request->SetUri(strings::StrCat(kGcsUriBase, "b/", bucket));
-  request->SetTimeouts(timeouts_.connect, timeouts_.idle, timeouts_.metadata);
-  const Status status = request->Send();
+  const Status status = GetBucketMetadata(bucket, nullptr);
   switch (status.code()) {
     case errors::Code::OK:
       *result = true;
@@ -1081,6 +1124,62 @@ Status GcsFileSystem::BucketExists(const string& bucket, bool* result) {
   }
 }
 
+Status GcsFileSystem::CheckBucketLocationConstraint(const string& bucket) {
+  if (allowed_locations_.empty()) {
+    return Status::OK();
+  }
+
+  // Avoid calling external API's in the constructor
+  if (allowed_locations_.erase(kDetectZoneSentinalValue) == 1) {
+    string zone;
+    TF_RETURN_IF_ERROR(zone_provider_->GetZone(&zone));
+    allowed_locations_.insert(ZoneToRegion(&zone));
+  }
+
+  string location;
+  TF_RETURN_IF_ERROR(GetBucketLocation(bucket, &location));
+  if (allowed_locations_.find(location) != allowed_locations_.end()) {
+    return Status::OK();
+  }
+
+  return errors::FailedPrecondition(strings::Printf(
+      "Bucket '%s' is in '%s' location, allowed locations are: (%s).",
+      bucket.c_str(), location.c_str(),
+      str_util::Join(allowed_locations_, ", ").c_str()));
+}
+
+Status GcsFileSystem::GetBucketLocation(const string& bucket,
+                                        string* location) {
+  auto compute_func = [this](const string& bucket, string* location) {
+    std::vector<char> result_buffer;
+    Status status = GetBucketMetadata(bucket, &result_buffer);
+    Json::Value result;
+    TF_RETURN_IF_ERROR(ParseJson(result_buffer, &result));
+    TF_RETURN_IF_ERROR(
+        GetStringValue(result, kBucketMetadataLocationKey, location));
+    return Status::OK();
+  };
+
+  TF_RETURN_IF_ERROR(
+      bucket_location_cache_->LookupOrCompute(bucket, location, compute_func));
+
+  return Status::OK();
+}
+
+Status GcsFileSystem::GetBucketMetadata(const string& bucket,
+                                        std::vector<char>* result_buffer) {
+  std::unique_ptr<HttpRequest> request;
+  TF_RETURN_IF_ERROR(CreateHttpRequest(&request));
+  request->SetUri(strings::StrCat(kGcsUriBase, "b/", bucket));
+
+  if (result_buffer != nullptr) {
+    request->SetResultBuffer(result_buffer);
+  }
+
+  request->SetTimeouts(timeouts_.connect, timeouts_.idle, timeouts_.metadata);
+  return request->Send();
+}
+
 Status GcsFileSystem::FolderExists(const string& dirname, bool* result) {
   StatCache::ComputeFunc compute_func = [this](const string& dirname,
                                                GcsFileStat* stat) {
@@ -1506,6 +1605,7 @@ void GcsFileSystem::FlushCaches() {
   file_block_cache_->Flush();
   stat_cache_->Clear();
   matching_paths_cache_->Clear();
+  bucket_location_cache_->Clear();
 }
 
 void GcsFileSystem::SetStats(GcsStatsInterface* stats) {
@@ -1557,6 +1657,7 @@ Status GcsFileSystem::CreateHttpRequest(std::unique_ptr<HttpRequest>* request) {
   return Status::OK();
 }
 
-REGISTER_FILE_SYSTEM("gs", RetryingGcsFileSystem);
-
 }  // namespace tensorflow
+
+// Initialize gcs_file_system
+REGISTER_FILE_SYSTEM("gs", ::tensorflow::RetryingGcsFileSystem);
diff --git a/tensorflow/core/platform/cloud/gcs_file_system.h b/tensorflow/core/platform/cloud/gcs_file_system.h
index 74768c98b5..71db707687 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system.h
+++ b/tensorflow/core/platform/cloud/gcs_file_system.h
@@ -22,6 +22,8 @@ limitations under the License.
 
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/cloud/auth_provider.h"
+#include "tensorflow/core/platform/cloud/compute_engine_metadata_client.h"
+#include "tensorflow/core/platform/cloud/compute_engine_zone_provider.h"
 #include "tensorflow/core/platform/cloud/expiring_lru_cache.h"
 #include "tensorflow/core/platform/cloud/file_block_cache.h"
 #include "tensorflow/core/platform/cloud/gcs_dns_cache.h"
@@ -80,14 +82,19 @@ class GcsFileSystem : public FileSystem {
  public:
   struct TimeoutConfig;
 
+  // Main constructor used (via RetryingFileSystem) throughout Tensorflow
   GcsFileSystem();
+  // Used mostly for unit testing or use cases which need to customize the
+  // filesystem from defaults
   GcsFileSystem(std::unique_ptr<AuthProvider> auth_provider,
                 std::unique_ptr<HttpRequest::Factory> http_request_factory,
-                size_t block_size, size_t max_bytes, uint64 max_staleness,
+                std::unique_ptr<ZoneProvider> zone_provider, size_t block_size,
+                size_t max_bytes, uint64 max_staleness,
                 uint64 stat_cache_max_age, size_t stat_cache_max_entries,
                 uint64 matching_paths_cache_max_age,
                 size_t matching_paths_cache_max_entries,
                 int64 initial_retry_delay_usec, TimeoutConfig timeouts,
+                const std::unordered_set<string>& allowed_locations,
                 std::pair<const string, const string>* additional_header);
 
   Status NewRandomAccessFile(
@@ -148,6 +155,9 @@ class GcsFileSystem : public FileSystem {
     return file_block_cache_->max_staleness();
   }
   TimeoutConfig timeouts() const { return timeouts_; }
+  std::unordered_set<string> allowed_locations() const {
+    return allowed_locations_;
+  }
   string additional_header_name() const {
     return additional_header_ ? additional_header_->first : "";
   }
@@ -229,6 +239,27 @@ class GcsFileSystem : public FileSystem {
   /// 'result' is set if the function returns OK. 'result' cannot be nullptr.
   Status BucketExists(const string& bucket, bool* result);
 
+  /// \brief Retrieves the GCS bucket location. Returns OK if the location was
+  /// retrieved.
+  ///
+  /// Given a string bucket the GCS bucket metadata API will be called and the
+  /// location string filled with the location of the bucket.
+  ///
+  /// This requires the bucket metadata permission.
+  /// Repeated calls for the same bucket are cached so this function can be
+  /// called frequently without causing an extra API call
+  Status GetBucketLocation(const string& bucket, string* location);
+
+  /// \brief Check if the GCS buckets location is allowed with the current
+  /// constraint configuration
+  Status CheckBucketLocationConstraint(const string& bucket);
+
+  /// \brief Given the input bucket `bucket`, fills `result_buffer` with the
+  /// results of the metadata. Returns OK if the API call succeeds without
+  /// error.
+  Status GetBucketMetadata(const string& bucket,
+                           std::vector<char>* result_buffer);
+
   /// \brief Checks if the object exists. Returns OK if the check succeeded.
   ///
   /// 'result' is set if the function returns OK. 'result' cannot be nullptr.
@@ -275,12 +306,14 @@ class GcsFileSystem : public FileSystem {
 
   mutex mu_;
   std::unique_ptr<AuthProvider> auth_provider_ GUARDED_BY(mu_);
-  std::unique_ptr<HttpRequest::Factory> http_request_factory_;
+  std::shared_ptr<HttpRequest::Factory> http_request_factory_;
+  std::unique_ptr<ZoneProvider> zone_provider_;
   // block_cache_lock_ protects the file_block_cache_ pointer (Note that
   // FileBlockCache instances are themselves threadsafe).
   mutex block_cache_lock_;
   std::unique_ptr<FileBlockCache> file_block_cache_
       GUARDED_BY(block_cache_lock_);
+  std::shared_ptr<ComputeEngineMetadataClient> compute_engine_metadata_client_;
   std::unique_ptr<GcsDnsCache> dns_cache_;
   GcsThrottle throttle_;
 
@@ -290,6 +323,10 @@ class GcsFileSystem : public FileSystem {
   using MatchingPathsCache = ExpiringLRUCache<std::vector<string>>;
   std::unique_ptr<MatchingPathsCache> matching_paths_cache_;
 
+  using BucketLocationCache = ExpiringLRUCache<string>;
+  std::unique_ptr<BucketLocationCache> bucket_location_cache_;
+  std::unordered_set<string> allowed_locations_;
+
   TimeoutConfig timeouts_;
 
   GcsStatsInterface* stats_ = nullptr;  // Not owned.
diff --git a/tensorflow/core/platform/cloud/gcs_file_system_test.cc b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
index e791ae5a19..ee2b034d74 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
@@ -24,6 +24,13 @@ namespace tensorflow {
 namespace {
 
 static GcsFileSystem::TimeoutConfig kTestTimeoutConfig(5, 1, 10, 20, 30);
+// Default (empty) constraint config
+static std::unordered_set<string>* kAllowedLocationsDefault =
+    new std::unordered_set<string>();
+// Constraint config if bucket location constraint is turned on, with no
+// custom list
+static std::unordered_set<string>* kAllowedLocationsAuto =
+    new std::unordered_set<string>({"auto"});
 
 class FakeAuthProvider : public AuthProvider {
  public:
@@ -33,6 +40,14 @@ class FakeAuthProvider : public AuthProvider {
   }
 };
 
+class FakeZoneProvider : public ZoneProvider {
+ public:
+  Status GetZone(string* zone) override {
+    *zone = "us-east1-b";
+    return Status::OK();
+  }
+};
+
 TEST(GcsFileSystemTest, NewRandomAccessFile_NoBlockCache) {
   std::vector<HttpRequest*> requests(
       {new FakeHttpRequest(
@@ -47,15 +62,16 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_NoBlockCache) {
            "Range: 6-11\n"
            "Timeouts: 5 1 20\n",
            "6789")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<RandomAccessFile> file;
   TF_EXPECT_OK(fs.NewRandomAccessFile("gs://bucket/random_access.txt", &file));
@@ -74,6 +90,118 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_NoBlockCache) {
   EXPECT_EQ("6789", result);
 }
 
+TEST(GcsFileSystemTest,
+     NewRandomAccessFile_WithLocationConstraintInSameLocation) {
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: https://www.googleapis.com/storage/v1/b/bucket\n"
+      "Auth Token: fake_token\n"
+      "Timeouts: 5 1 10\n",
+      R"(
+          {
+            "location":"us-east1"
+          })")});
+
+  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+                   std::unique_ptr<HttpRequest::Factory>(
+                       new FakeHttpRequestFactory(&requests)),
+                   std::unique_ptr<ZoneProvider>(new FakeZoneProvider),
+                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
+                   0 /* stat cache max age */, 0 /* stat cache max entries */,
+                   0 /* matching paths cache max age */,
+                   0 /* matching paths cache max entries */,
+                   0 /* initial retry delay */, kTestTimeoutConfig,
+                   *kAllowedLocationsAuto, nullptr /* gcs additional header */);
+
+  std::unique_ptr<RandomAccessFile> file;
+  TF_EXPECT_OK(fs.NewRandomAccessFile("gs://bucket/random_access.txt", &file));
+}
+
+TEST(GcsFileSystemTest, NewRandomAccessFile_WithLocationConstraintCaching) {
+  std::vector<HttpRequest*> requests(
+      {new FakeHttpRequest(
+           "Uri: https://www.googleapis.com/storage/v1/b/bucket\n"
+           "Auth Token: fake_token\n"
+           "Timeouts: 5 1 10\n",
+           R"(
+          {
+            "location":"us-east1"
+          })"),
+       new FakeHttpRequest(
+           "Uri: https://www.googleapis.com/storage/v1/b/anotherbucket\n"
+           "Auth Token: fake_token\n"
+           "Timeouts: 5 1 10\n",
+           R"(
+          {
+            "location":"us-east1"
+          })"),
+       new FakeHttpRequest(
+           "Uri: https://www.googleapis.com/storage/v1/b/bucket\n"
+           "Auth Token: fake_token\n"
+           "Timeouts: 5 1 10\n",
+           R"(
+          {
+            "location":"us-east1"
+          })")});
+
+  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+                   std::unique_ptr<HttpRequest::Factory>(
+                       new FakeHttpRequestFactory(&requests)),
+                   std::unique_ptr<ZoneProvider>(new FakeZoneProvider),
+                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
+                   0 /* stat cache max age */, 0 /* stat cache max entries */,
+                   0 /* matching paths cache max age */,
+                   0 /* matching paths cache max entries */,
+                   0 /* initial retry delay */, kTestTimeoutConfig,
+                   *kAllowedLocationsAuto, nullptr /* gcs additional header */);
+
+  std::unique_ptr<RandomAccessFile> file;
+
+  string bucket = "gs://bucket/random_access.txt";
+  string another_bucket = "gs://anotherbucket/random_access.txt";
+  // Multiple calls should only cause one request to the location api.
+  TF_EXPECT_OK(fs.NewRandomAccessFile(bucket, &file));
+  TF_EXPECT_OK(fs.NewRandomAccessFile(bucket, &file));
+
+  // A new bucket should have one cache miss
+  TF_EXPECT_OK(fs.NewRandomAccessFile(another_bucket, &file));
+  // And then future calls to both should be cached
+  TF_EXPECT_OK(fs.NewRandomAccessFile(bucket, &file));
+  TF_EXPECT_OK(fs.NewRandomAccessFile(another_bucket, &file));
+
+  // Trigger a flush, should then require one more call
+  fs.FlushCaches();
+  TF_EXPECT_OK(fs.NewRandomAccessFile(bucket, &file));
+}
+
+TEST(GcsFileSystemTest,
+     NewRandomAccessFile_WithLocationConstraintInDifferentLocation) {
+  std::vector<HttpRequest*> requests({new FakeHttpRequest(
+      "Uri: https://www.googleapis.com/storage/v1/b/bucket\n"
+      "Auth Token: fake_token\n"
+      "Timeouts: 5 1 10\n",
+      R"(
+          {
+            "location":"barfoo"
+          })")});
+
+  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+                   std::unique_ptr<HttpRequest::Factory>(
+                       new FakeHttpRequestFactory(&requests)),
+                   std::unique_ptr<ZoneProvider>(new FakeZoneProvider),
+                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
+                   0 /* stat cache max age */, 0 /* stat cache max entries */,
+                   0 /* matching paths cache max age */,
+                   0 /* matching paths cache max entries */,
+                   0 /* initial retry delay */, kTestTimeoutConfig,
+                   *kAllowedLocationsAuto, nullptr /* gcs additional header */);
+
+  std::unique_ptr<RandomAccessFile> file;
+  EXPECT_EQ(tensorflow::errors::FailedPrecondition(
+                "Bucket 'bucket' is in 'barfoo' location, allowed locations "
+                "are: (us-east1)."),
+            fs.NewRandomAccessFile("gs://bucket/random_access.txt", &file));
+}
+
 TEST(GcsFileSystemTest, NewRandomAccessFile_NoBlockCache_DifferentN) {
   std::vector<HttpRequest*> requests(
       {new FakeHttpRequest(
@@ -88,15 +216,16 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_NoBlockCache_DifferentN) {
            "Range: 3-12\n"
            "Timeouts: 5 1 20\n",
            "3456789")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<RandomAccessFile> file;
   TF_EXPECT_OK(fs.NewRandomAccessFile("gs://bucket/random_access.txt", &file));
@@ -151,11 +280,12 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_WithBlockCache) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      9 /* block size */, 18 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 9 /* block size */,
+      18 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   char scratch[100];
   StringPiece result;
@@ -239,11 +369,12 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_WithBlockCache_Flush) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      9 /* block size */, 18 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 9 /* block size */,
+      18 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   char scratch[100];
   StringPiece result;
@@ -287,11 +418,13 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_WithBlockCache_MaxStaleness) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      8 /* block size */, 16 /* max bytes */, 3600 /* max staleness */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 8 /* block size */,
+      16 /* max bytes */, 3600 /* max staleness */,
       3600 /* stat cache max age */, 0 /* stat cache max entries */,
       0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
   char scratch[100];
   StringPiece result;
   // There should only be two HTTP requests issued to GCS even though we iterate
@@ -356,11 +489,12 @@ TEST(GcsFileSystemTest,
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      9 /* block size */, 18 /* max bytes */, 0 /* max staleness */,
-      0 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 9 /* block size */,
+      18 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<RandomAccessFile> file;
   TF_EXPECT_OK(fs.NewRandomAccessFile("gs://bucket/random_access.txt", &file));
@@ -383,11 +517,13 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_NoObjectName) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider),
       0 /* read ahead bytes */, 0 /* max bytes */, 0 /* max staleness */,
       0 /* stat cache max age */, 0 /* stat cache max entries */,
       0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<RandomAccessFile> file;
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
@@ -411,15 +547,16 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_InconsistentRead) {
            "012")});
 
   // Set stat_cache_max_age to 1000s so that StatCache could work.
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   1e3 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 1e3 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // Stat the file first so that the file stats are cached.
   FileStatistics stat;
@@ -481,11 +618,12 @@ TEST(GcsFileSystemTest, NewWritableFile) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      8 /* block size */, 8 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 8 /* block size */,
+      8 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // Read from the file first, to fill the block cache.
   std::unique_ptr<RandomAccessFile> rfile;
@@ -565,15 +703,16 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadSucceeds) {
                            "Timeouts: 5 1 30\n"
                            "Put body: t2\n",
                            "")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<WritableFile> file;
   TF_EXPECT_OK(fs.NewWritableFile("gs://bucket/path/writeable.txt", &file));
@@ -638,11 +777,13 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadSucceedsOnGetStatus) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      8 /* block size */, 8 /* max bytes */, 3600 /* max staleness */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 8 /* block size */,
+      8 /* max bytes */, 3600 /* max staleness */,
       3600 /* stat cache max age */, 0 /* stat cache max entries */,
       0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
   // Pull the file's first block into the cache. This will trigger the first
   // HTTP request to GCS.
   std::unique_ptr<RandomAccessFile> rfile;
@@ -719,15 +860,16 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadAllAttemptsFail) {
                           "Timeouts: 5 1 30\n"
                           "Put body: content1,content2\n",
                           ""));
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   2 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 2 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<WritableFile> file;
   TF_EXPECT_OK(fs.NewWritableFile("gs://bucket/path/writeable.txt", &file));
@@ -776,15 +918,16 @@ TEST(GcsFileSystemTest, NewWritableFile_UploadReturns410) {
                            "Timeouts: 5 1 30\n"
                            "Put body: content1,content2\n",
                            "")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<WritableFile> file;
   TF_EXPECT_OK(fs.NewWritableFile("gs://bucket/path/writeable.txt", &file));
@@ -805,15 +948,16 @@ TEST(GcsFileSystemTest, NewWritableFile_UploadReturns410) {
 
 TEST(GcsFileSystemTest, NewWritableFile_NoObjectName) {
   std::vector<HttpRequest*> requests;
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<WritableFile> file;
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
@@ -866,11 +1010,12 @@ TEST(GcsFileSystemTest, NewAppendableFile) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      32 /* block size */, 32 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 32 /* block size */,
+      32 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // Create an appendable file. This should read the file from GCS, and pull its
   // contents into the block cache.
@@ -896,15 +1041,16 @@ TEST(GcsFileSystemTest, NewAppendableFile) {
 
 TEST(GcsFileSystemTest, NewAppendableFile_NoObjectName) {
   std::vector<HttpRequest*> requests;
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<WritableFile> file;
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
@@ -929,15 +1075,16 @@ TEST(GcsFileSystemTest, NewReadOnlyMemoryRegionFromFile) {
                            "Range: 0-",
                            content.size() - 1, "\n", "Timeouts: 5 1 20\n"),
            content)});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<ReadOnlyMemoryRegion> region;
   TF_EXPECT_OK(fs.NewReadOnlyMemoryRegionFromFile(
@@ -949,15 +1096,16 @@ TEST(GcsFileSystemTest, NewReadOnlyMemoryRegionFromFile) {
 
 TEST(GcsFileSystemTest, NewReadOnlyMemoryRegionFromFile_NoObjectName) {
   std::vector<HttpRequest*> requests;
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<ReadOnlyMemoryRegion> region;
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
@@ -972,15 +1120,16 @@ TEST(GcsFileSystemTest, FileExists_YesAsObject) {
       "Timeouts: 5 1 10\n",
       strings::StrCat("{\"size\": \"1010\",\"generation\": \"1\","
                       "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.FileExists("gs://bucket/path/file1.txt"));
 }
@@ -1001,15 +1150,16 @@ TEST(GcsFileSystemTest, FileExists_YesAsFolder) {
            "Timeouts: 5 1 10\n",
            "{\"items\": [ "
            "  { \"name\": \"path/subfolder/\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.FileExists("gs://bucket/path/subfolder"));
 }
@@ -1026,15 +1176,16 @@ TEST(GcsFileSystemTest, FileExists_YesAsBucket) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "{\"size\": \"100\"}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.FileExists("gs://bucket1"));
   TF_EXPECT_OK(fs.FileExists("gs://bucket1/"));
@@ -1055,15 +1206,16 @@ TEST(GcsFileSystemTest, FileExists_NotAsObjectOrFolder) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "{\"items\": []}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   EXPECT_EQ(errors::Code::NOT_FOUND,
             fs.FileExists("gs://bucket/path/file1.txt").code());
@@ -1081,15 +1233,16 @@ TEST(GcsFileSystemTest, FileExists_NotAsBucket) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "", errors::NotFound("404"), 404)});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
             fs.FileExists("gs://bucket2/").code());
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
@@ -1123,11 +1276,12 @@ TEST(GcsFileSystemTest, FileExists_StatCache) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // The stat cache will ensure that repeated lookups don't trigger additional
   // HTTP requests.
@@ -1149,11 +1303,12 @@ TEST(GcsFileSystemTest, FileExists_DirectoryMark) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.FileExists("gs://bucket/dir/"));
   TF_EXPECT_OK(fs.IsDirectory("gs://bucket/dir/"));
@@ -1167,15 +1322,16 @@ TEST(GcsFileSystemTest, GetChildren_NoItems) {
       "Auth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "{\"prefixes\": [\"path/subpath/\"]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> children;
   TF_EXPECT_OK(fs.GetChildren("gs://bucket/path/", &children));
@@ -1194,15 +1350,16 @@ TEST(GcsFileSystemTest, GetChildren_ThreeFiles) {
       "  { \"name\": \"path/file1.txt\" },"
       "  { \"name\": \"path/file3.txt\" }],"
       "\"prefixes\": [\"path/subpath/\"]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> children;
   TF_EXPECT_OK(fs.GetChildren("gs://bucket/path/", &children));
@@ -1222,15 +1379,16 @@ TEST(GcsFileSystemTest, GetChildren_SelfDirectoryMarker) {
       "  { \"name\": \"path/\" },"
       "  { \"name\": \"path/file3.txt\" }],"
       "\"prefixes\": [\"path/subpath/\"]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> children;
   TF_EXPECT_OK(fs.GetChildren("gs://bucket/path/", &children));
@@ -1249,15 +1407,16 @@ TEST(GcsFileSystemTest, GetChildren_ThreeFiles_NoSlash) {
       "  { \"name\": \"path/file1.txt\" },"
       "  { \"name\": \"path/file3.txt\" }],"
       "\"prefixes\": [\"path/subpath/\"]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> children;
   TF_EXPECT_OK(fs.GetChildren("gs://bucket/path", &children));
@@ -1273,15 +1432,16 @@ TEST(GcsFileSystemTest, GetChildren_Root) {
       "Auth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> children;
   TF_EXPECT_OK(fs.GetChildren("gs://bucket-a-b-c", &children));
@@ -1297,15 +1457,16 @@ TEST(GcsFileSystemTest, GetChildren_Empty) {
       "Auth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> children;
   TF_EXPECT_OK(fs.GetChildren("gs://bucket/path/", &children));
@@ -1337,15 +1498,16 @@ TEST(GcsFileSystemTest, GetChildren_Pagination) {
            "  { \"name\": \"path/file4.txt\" },"
            "  { \"name\": \"path/file5.txt\" }]}")});
 
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> children;
   TF_EXPECT_OK(fs.GetChildren("gs://bucket/path", &children));
@@ -1363,15 +1525,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_NoWildcard) {
       "Timeouts: 5 1 10\n",
       "{\"items\": [ "
       "  { \"name\": \"path/subpath/file2.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> result;
   TF_EXPECT_OK(
@@ -1390,15 +1553,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_BucketAndWildcard) {
       "  { \"name\": \"path/file1.txt\" },"
       "  { \"name\": \"path/subpath/file2.txt\" },"
       "  { \"name\": \"path/file3.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> result;
   TF_EXPECT_OK(fs.GetMatchingPaths("gs://bucket/*/*", &result));
@@ -1418,15 +1582,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_FolderAndWildcard_Matches) {
       "  { \"name\": \"path/file1.txt\" },"
       "  { \"name\": \"path/subpath/file2.txt\" },"
       "  { \"name\": \"path/file3.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> result;
   TF_EXPECT_OK(fs.GetMatchingPaths("gs://bucket/path/*/file2.txt", &result));
@@ -1443,15 +1608,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_SelfDirectoryMarker) {
       "{\"items\": [ "
       "  { \"name\": \"path/\" },"
       "  { \"name\": \"path/file3.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> result;
   TF_EXPECT_OK(fs.GetMatchingPaths("gs://bucket/path/*", &result));
@@ -1468,15 +1634,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_FolderAndWildcard_NoMatches) {
       "  { \"name\": \"path/file1.txt\" },"
       "  { \"name\": \"path/subpath/file2.txt\" },"
       "  { \"name\": \"path/file3.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> result;
   TF_EXPECT_OK(fs.GetMatchingPaths("gs://bucket/path/*/file3.txt", &result));
@@ -1485,15 +1652,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_FolderAndWildcard_NoMatches) {
 
 TEST(GcsFileSystemTest, GetMatchingPaths_OnlyWildcard) {
   std::vector<HttpRequest*> requests;
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::vector<string> result;
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
@@ -1518,15 +1686,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_Cache) {
            "  { \"name\": \"path/file1.txt\" },"
            "  { \"name\": \"path/subpath/file2.txt\" },"
            "  { \"name\": \"path/file3.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   3600 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 3600 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // Repeated calls to fs.GetMatchingPaths on these patterns should not lead to
   // any additional HTTP requests to GCS.
@@ -1560,15 +1729,16 @@ TEST(GcsFileSystemTest, GetMatchingPaths_Cache_Flush) {
            "Timeouts: 5 1 10\n",
            "{\"items\": [ "
            "  { \"name\": \"path/subpath/file2.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   3600 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 3600 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // This loop should trigger the first HTTP request to GCS.
   for (int i = 0; i < 10; i++) {
@@ -1627,11 +1797,12 @@ TEST(GcsFileSystemTest, DeleteFile) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      16 /* block size */, 16 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 16 /* block size */,
+      16 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // Do an initial read of the file to load its contents into the block cache.
   char scratch[100];
@@ -1650,15 +1821,16 @@ TEST(GcsFileSystemTest, DeleteFile) {
 
 TEST(GcsFileSystemTest, DeleteFile_NoObjectName) {
   std::vector<HttpRequest*> requests;
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
             fs.DeleteFile("gs://bucket/").code());
@@ -1696,11 +1868,12 @@ TEST(GcsFileSystemTest, DeleteFile_StatCacheRemoved) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      16 /* block size */, 16 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 16 /* block size */,
+      16 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // Stats the file first so the stat is cached.
   FileStatistics stat_before_deletion;
@@ -1721,15 +1894,16 @@ TEST(GcsFileSystemTest, DeleteDir_Empty) {
       "Auth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.DeleteDir("gs://bucket/path/"));
 }
@@ -1749,15 +1923,16 @@ TEST(GcsFileSystemTest, DeleteDir_OnlyDirMarkerLeft) {
                            "Timeouts: 5 1 10\n"
                            "Delete: yes\n",
                            "")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.DeleteDir("gs://bucket/path/"));
 }
@@ -1768,15 +1943,16 @@ TEST(GcsFileSystemTest, DeleteDir_BucketOnly) {
       "name%2CnextPageToken&maxResults=2\nAuth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.DeleteDir("gs://bucket"));
 }
@@ -1789,15 +1965,16 @@ TEST(GcsFileSystemTest, DeleteDir_NonEmpty) {
       "Timeouts: 5 1 10\n",
       "{\"items\": [ "
       "  { \"name\": \"path/file1.txt\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   EXPECT_EQ(error::Code::FAILED_PRECONDITION,
             fs.DeleteDir("gs://bucket/path/").code());
@@ -1811,15 +1988,16 @@ TEST(GcsFileSystemTest, GetFileSize) {
       "Timeouts: 5 1 10\n",
       strings::StrCat("{\"size\": \"1010\",\"generation\": \"1\","
                       "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   uint64 size;
   TF_EXPECT_OK(fs.GetFileSize("gs://bucket/file.txt", &size));
@@ -1828,15 +2006,16 @@ TEST(GcsFileSystemTest, GetFileSize) {
 
 TEST(GcsFileSystemTest, GetFileSize_NoObjectName) {
   std::vector<HttpRequest*> requests;
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   uint64 size;
   EXPECT_EQ(errors::Code::INVALID_ARGUMENT,
@@ -1913,15 +2092,16 @@ TEST(GcsFileSystemTest, RenameFile_Folder) {
            "Timeouts: 5 1 10\n"
            "Delete: yes\n",
            "")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.RenameFile("gs://bucket/path1", "gs://bucket/path2/"));
 }
@@ -2008,11 +2188,12 @@ TEST(GcsFileSystemTest, RenameFile_Object) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      16 /* block size */, 64 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 16 /* block size */,
+      64 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
   // Do an initial read of the source and destination files to load their
   // contents into the block cache.
   char scratch[100];
@@ -2088,11 +2269,12 @@ TEST(GcsFileSystemTest, RenameFile_Object_FlushTargetStatCache) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
   // Do an initial stat of the destination file to load their contents into the
   // stat cache.
   FileStatistics stat_before_renaming;
@@ -2150,15 +2332,16 @@ TEST(GcsFileSystemTest, RenameFile_Object_DeletionRetried) {
            "Timeouts: 5 1 10\n"
            "Delete: yes\n",
            "", errors::NotFound("404"), 404)});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(
       fs.RenameFile("gs://bucket/path/src.txt", "gs://bucket/path/dst.txt"));
@@ -2191,15 +2374,16 @@ TEST(GcsFileSystemTest, RenameFile_Object_Incomplete) {
            "Post: yes\n"
            "Timeouts: 5 1 10\n",
            "{\"done\": false}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   EXPECT_EQ(
       errors::Code::UNIMPLEMENTED,
@@ -2215,15 +2399,16 @@ TEST(GcsFileSystemTest, Stat_Object) {
       "Timeouts: 5 1 10\n",
       strings::StrCat("{\"size\": \"1010\",\"generation\": \"1\","
                       "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   FileStatistics stat;
   TF_EXPECT_OK(fs.Stat("gs://bucket/file.txt", &stat));
@@ -2248,15 +2433,16 @@ TEST(GcsFileSystemTest, Stat_Folder) {
            "Timeouts: 5 1 10\n",
            "{\"items\": [ "
            "  { \"name\": \"subfolder/\" }]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   FileStatistics stat;
   TF_EXPECT_OK(fs.Stat("gs://bucket/subfolder", &stat));
@@ -2280,15 +2466,16 @@ TEST(GcsFileSystemTest, Stat_ObjectOrFolderNotFound) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   FileStatistics stat;
   EXPECT_EQ(error::Code::NOT_FOUND, fs.Stat("gs://bucket/path", &stat).code());
@@ -2300,15 +2487,16 @@ TEST(GcsFileSystemTest, Stat_Bucket) {
       "Auth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   FileStatistics stat;
   TF_EXPECT_OK(fs.Stat("gs://bucket/", &stat));
@@ -2323,15 +2511,16 @@ TEST(GcsFileSystemTest, Stat_BucketNotFound) {
       "Auth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "", errors::NotFound("404"), 404)});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   FileStatistics stat;
   EXPECT_EQ(error::Code::NOT_FOUND, fs.Stat("gs://bucket/", &stat).code());
@@ -2364,11 +2553,12 @@ TEST(GcsFileSystemTest, Stat_Cache) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   // Repeated calls to fs.Stat on these paths should not lead to any additional
   // HTTP requests to GCS.
@@ -2405,11 +2595,12 @@ TEST(GcsFileSystemTest, Stat_Cache_Flush) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-      3600 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 3600 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
-      kTestTimeoutConfig, nullptr /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
   // There should be a single HTTP request to GCS for fs.Stat in this loop.
   for (int i = 0; i < 10; i++) {
     FileStatistics stat;
@@ -2437,15 +2628,16 @@ TEST(GcsFileSystemTest, Stat_FilenameEndingWithSlash) {
       "Timeouts: 5 1 10\n",
       strings::StrCat("{\"size\": \"5\",\"generation\": \"1\","
                       "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   FileStatistics stat;
   TF_EXPECT_OK(fs.Stat("gs://bucket/dir/", &stat));
@@ -2468,15 +2660,16 @@ TEST(GcsFileSystemTest, IsDirectory_NotFound) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "", errors::NotFound("404"), 404)});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   EXPECT_EQ(error::Code::NOT_FOUND,
             fs.IsDirectory("gs://bucket/file.txt").code());
@@ -2498,15 +2691,16 @@ TEST(GcsFileSystemTest, IsDirectory_NotDirectoryButObject) {
            "Timeouts: 5 1 10\n",
            strings::StrCat("{\"size\": \"1010\",\"generation\": \"1\","
                            "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   EXPECT_EQ(error::Code::FAILED_PRECONDITION,
             fs.IsDirectory("gs://bucket/file.txt").code());
@@ -2528,15 +2722,16 @@ TEST(GcsFileSystemTest, IsDirectory_Yes) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "{\"items\": [{\"name\": \"subfolder/\"}]}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.IsDirectory("gs://bucket/subfolder"));
   TF_EXPECT_OK(fs.IsDirectory("gs://bucket/subfolder/"));
@@ -2554,15 +2749,16 @@ TEST(GcsFileSystemTest, IsDirectory_Bucket) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.IsDirectory("gs://bucket"));
   TF_EXPECT_OK(fs.IsDirectory("gs://bucket/"));
@@ -2574,15 +2770,16 @@ TEST(GcsFileSystemTest, IsDirectory_BucketNotFound) {
       "Auth Token: fake_token\n"
       "Timeouts: 5 1 10\n",
       "", errors::NotFound("404"), 404)});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   EXPECT_EQ(error::Code::NOT_FOUND, fs.IsDirectory("gs://bucket/").code());
 }
@@ -2615,15 +2812,16 @@ TEST(GcsFileSystemTest, CreateDir_Folder) {
                            "Timeouts: 5 1 30\n"
                            "Put body: \n",
                            "")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.CreateDir("gs://bucket/subpath"));
   TF_EXPECT_OK(fs.CreateDir("gs://bucket/subpath/"));
@@ -2641,15 +2839,16 @@ TEST(GcsFileSystemTest, CreateDir_Bucket) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TF_EXPECT_OK(fs.CreateDir("gs://bucket/"));
   TF_EXPECT_OK(fs.CreateDir("gs://bucket"));
@@ -2712,15 +2911,16 @@ TEST(GcsFileSystemTest, DeleteRecursively_Ok) {
                            "Timeouts: 5 1 10\n"
                            "Delete: yes\n",
                            "")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   int64 undeleted_files, undeleted_dirs;
   TF_EXPECT_OK(fs.DeleteRecursively("gs://bucket/path", &undeleted_files,
@@ -2804,15 +3004,16 @@ TEST(GcsFileSystemTest, DeleteRecursively_DeletionErrors) {
            "Timeouts: 5 1 10\n",
            "", errors::NotFound("404"), 404)});
 
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   int64 undeleted_files, undeleted_dirs;
   TF_EXPECT_OK(fs.DeleteRecursively("gs://bucket/path", &undeleted_files,
@@ -2838,15 +3039,16 @@ TEST(GcsFileSystemTest, DeleteRecursively_NotAFolder) {
            "Auth Token: fake_token\n"
            "Timeouts: 5 1 10\n",
            "", errors::NotFound("404"), 404)});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay*/, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay*/,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   int64 undeleted_files, undeleted_dirs;
   EXPECT_EQ(error::Code::NOT_FOUND,
@@ -2857,6 +3059,29 @@ TEST(GcsFileSystemTest, DeleteRecursively_NotAFolder) {
   EXPECT_EQ(1, undeleted_dirs);
 }
 
+TEST(GcsFileSystemTest, NoConstraintsEnvironmentVariableTest) {
+  unsetenv("GCS_ALLOWED_BUCKET_LOCATIONS");
+  // No constraints
+  GcsFileSystem fs1;
+  EXPECT_EQ(*kAllowedLocationsDefault, fs1.allowed_locations());
+
+  // Cover cache initialization code, any uninitialized cache will cause this to
+  // fail
+  fs1.FlushCaches();
+}
+
+TEST(GcsFileSystemTest, BucketLocationConstraintEnvironmentVariableTest) {
+  unsetenv("GCS_ALLOWED_BUCKET_LOCATIONS");
+  setenv("GCS_ALLOWED_BUCKET_LOCATIONS", "auto", 1);
+  GcsFileSystem fs1;
+  EXPECT_EQ(*kAllowedLocationsAuto, fs1.allowed_locations());
+
+  setenv("GCS_ALLOWED_BUCKET_LOCATIONS", "custom,list", 1);
+  GcsFileSystem fs2;
+  EXPECT_EQ(std::unordered_set<string>({"custom", "list"}),
+            fs2.allowed_locations());
+}
+
 TEST(GcsFileSystemTest, AdditionalRequestHeaderTest) {
   GcsFileSystem fs1;
   EXPECT_EQ("", fs1.additional_header_name());
@@ -2902,11 +3127,12 @@ TEST(GcsFileSystemTest, AdditionalRequestHeaderTest) {
       std::unique_ptr<AuthProvider>(new FakeAuthProvider),
       std::unique_ptr<HttpRequest::Factory>(
           new FakeHttpRequestFactory(&requests)),
-      0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-      0 /* stat cache max age */, 0 /* stat cache max entries */,
-      0 /* matching paths cache max age */,
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
       0 /* matching paths cache max entries */, 0 /* initial retry delay */,
-      kTestTimeoutConfig, add_header /* gcs additional header */);
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      add_header /* gcs additional header */);
 
   std::unique_ptr<HttpRequest> request;
   TF_EXPECT_OK(fs7.CreateHttpRequest(&request));
@@ -2973,15 +3199,16 @@ TEST(GcsFileSystemTest, CreateHttpRequest) {
                            "Auth Token: fake_token\n"
                            "Header Hello: world\n",
                            "{}")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   std::unique_ptr<HttpRequest> request;
   TF_EXPECT_OK(fs.CreateHttpRequest(&request));
@@ -3035,15 +3262,16 @@ TEST(GcsFileSystemTest, Stat_StatsRecording) {
       "Timeouts: 5 1 10\n",
       strings::StrCat("{\"size\": \"1010\",\"generation\": \"1\","
                       "\"updated\": \"2016-04-29T23:15:24.896Z\"}"))});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TestGcsStats stats;
   fs.SetStats(&stats);
@@ -3061,15 +3289,16 @@ TEST(GcsFileSystemTest, NewRandomAccessFile_StatsRecording) {
       "Range: 0-5\n"
       "Timeouts: 5 1 20\n",
       "012345")});
-  GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
-                   std::unique_ptr<HttpRequest::Factory>(
-                       new FakeHttpRequestFactory(&requests)),
-                   0 /* block size */, 0 /* max bytes */, 0 /* max staleness */,
-                   0 /* stat cache max age */, 0 /* stat cache max entries */,
-                   0 /* matching paths cache max age */,
-                   0 /* matching paths cache max entries */,
-                   0 /* initial retry delay */, kTestTimeoutConfig,
-                   nullptr /* gcs additional header */);
+  GcsFileSystem fs(
+      std::unique_ptr<AuthProvider>(new FakeAuthProvider),
+      std::unique_ptr<HttpRequest::Factory>(
+          new FakeHttpRequestFactory(&requests)),
+      std::unique_ptr<ZoneProvider>(new FakeZoneProvider), 0 /* block size */,
+      0 /* max bytes */, 0 /* max staleness */, 0 /* stat cache max age */,
+      0 /* stat cache max entries */, 0 /* matching paths cache max age */,
+      0 /* matching paths cache max entries */, 0 /* initial retry delay */,
+      kTestTimeoutConfig, *kAllowedLocationsDefault,
+      nullptr /* gcs additional header */);
 
   TestGcsStats stats;
   fs.SetStats(&stats);
diff --git a/tensorflow/core/platform/cloud/gcs_throttle_test.cc b/tensorflow/core/platform/cloud/gcs_throttle_test.cc
index 57193ac405..8f962b92b8 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_throttle_test.cc
@@ -24,14 +24,14 @@ namespace {
 
 class TestTime : public EnvTime {
  public:
-  uint64 NowMicros() override { return now_; }
+  uint64 NowNanos() override { return now_micros_ * kMicrosToNanos; }
 
-  void SetTime(uint64 now_micros) { now_ = now_micros; }
+  void SetTime(uint64 now_micros) { now_micros_ = now_micros; }
 
-  void AdvanceSeconds(int64 secs) { now_ += secs * 1000000L; }
+  void AdvanceSeconds(int64 secs) { now_micros_ += secs * kSecondsToMicros; }
 
  private:
-  uint64 now_ = 1234567890000000ULL;
+  uint64 now_micros_ = 1234567890000000ULL;
 };
 
 class GcsThrottleTest : public ::testing::Test {
diff --git a/tensorflow/core/platform/cloud/google_auth_provider.cc b/tensorflow/core/platform/cloud/google_auth_provider.cc
index 7e39b63e3e..6ffe51e897 100644
--- a/tensorflow/core/platform/cloud/google_auth_provider.cc
+++ b/tensorflow/core/platform/cloud/google_auth_provider.cc
@@ -21,11 +21,11 @@ limitations under the License.
 #include <sys/types.h>
 #endif
 #include <fstream>
+#include <utility>
 #include "include/json/json.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/io/path.h"
 #include "tensorflow/core/lib/strings/base64.h"
-#include "tensorflow/core/platform/cloud/curl_http_request.h"
 #include "tensorflow/core/platform/cloud/retrying_utils.h"
 #include "tensorflow/core/platform/env.h"
 
@@ -63,16 +63,11 @@ constexpr char kOAuthV4Url[] = "https://www.googleapis.com/oauth2/v4/token";
 
 // The URL to retrieve the auth bearer token when running in Google Compute
 // Engine.
-constexpr char kGceTokenUrl[] =
-    "http://metadata/computeMetadata/v1/instance/service-accounts/default/"
-    "token";
+constexpr char kGceTokenPath[] = "instance/service-accounts/default/token";
 
 // The authentication token scope to request.
 constexpr char kOAuthScope[] = "https://www.googleapis.com/auth/cloud-platform";
 
-// The default initial delay between retries with exponential backoff.
-constexpr int kInitialRetryDelayUsec = 500000;  // 0.5 sec
-
 /// Returns whether the given path points to a readable file.
 bool IsFile(const string& filename) {
   std::ifstream fstream(filename.c_str());
@@ -121,20 +116,20 @@ Status GetWellKnownFileName(string* filename) {
 
 }  // namespace
 
-GoogleAuthProvider::GoogleAuthProvider()
-    : GoogleAuthProvider(
-          std::unique_ptr<OAuthClient>(new OAuthClient()),
-          std::unique_ptr<HttpRequest::Factory>(new CurlHttpRequest::Factory()),
-          Env::Default(), kInitialRetryDelayUsec) {}
+GoogleAuthProvider::GoogleAuthProvider(
+    std::shared_ptr<ComputeEngineMetadataClient> compute_engine_metadata_client)
+    : GoogleAuthProvider(std::unique_ptr<OAuthClient>(new OAuthClient()),
+                         std::move(compute_engine_metadata_client),
+                         Env::Default()) {}
 
 GoogleAuthProvider::GoogleAuthProvider(
     std::unique_ptr<OAuthClient> oauth_client,
-    std::unique_ptr<HttpRequest::Factory> http_request_factory, Env* env,
-    int64 initial_retry_delay_usec)
+    std::shared_ptr<ComputeEngineMetadataClient> compute_engine_metadata_client,
+    Env* env)
     : oauth_client_(std::move(oauth_client)),
-      http_request_factory_(std::move(http_request_factory)),
-      env_(env),
-      initial_retry_delay_usec_(initial_retry_delay_usec) {}
+      compute_engine_metadata_client_(
+          std::move(compute_engine_metadata_client)),
+      env_(env) {}
 
 Status GoogleAuthProvider::GetToken(string* t) {
   mutex_lock lock(mu_);
@@ -207,24 +202,19 @@ Status GoogleAuthProvider::GetTokenFromFiles() {
 }
 
 Status GoogleAuthProvider::GetTokenFromGce() {
-  const auto get_token_from_gce = [this]() {
-    std::unique_ptr<HttpRequest> request(http_request_factory_->Create());
-    std::vector<char> response_buffer;
-    const uint64 request_timestamp_sec = env_->NowSeconds();
-    request->SetUri(kGceTokenUrl);
-    request->AddHeader("Metadata-Flavor", "Google");
-    request->SetResultBuffer(&response_buffer);
-    TF_RETURN_IF_ERROR(request->Send());
-    StringPiece response =
-        StringPiece(&response_buffer[0], response_buffer.size());
-
-    TF_RETURN_IF_ERROR(oauth_client_->ParseOAuthResponse(
-        response, request_timestamp_sec, &current_token_,
-        &expiration_timestamp_sec_));
-    return Status::OK();
-  };
-  return RetryingUtils::CallWithRetries(get_token_from_gce,
-                                        initial_retry_delay_usec_);
+  std::vector<char> response_buffer;
+  const uint64 request_timestamp_sec = env_->NowSeconds();
+
+  TF_RETURN_IF_ERROR(compute_engine_metadata_client_->GetMetadata(
+      kGceTokenPath, &response_buffer));
+  StringPiece response =
+      StringPiece(&response_buffer[0], response_buffer.size());
+
+  TF_RETURN_IF_ERROR(oauth_client_->ParseOAuthResponse(
+      response, request_timestamp_sec, &current_token_,
+      &expiration_timestamp_sec_));
+
+  return Status::OK();
 }
 
 Status GoogleAuthProvider::GetTokenForTesting() {
diff --git a/tensorflow/core/platform/cloud/google_auth_provider.h b/tensorflow/core/platform/cloud/google_auth_provider.h
index 00da25a959..58a785fd60 100644
--- a/tensorflow/core/platform/cloud/google_auth_provider.h
+++ b/tensorflow/core/platform/cloud/google_auth_provider.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include <memory>
 #include "tensorflow/core/platform/cloud/auth_provider.h"
+#include "tensorflow/core/platform/cloud/compute_engine_metadata_client.h"
 #include "tensorflow/core/platform/cloud/oauth_client.h"
 #include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/thread_annotations.h"
@@ -27,11 +28,12 @@ namespace tensorflow {
 /// Implementation based on Google Application Default Credentials.
 class GoogleAuthProvider : public AuthProvider {
  public:
-  GoogleAuthProvider();
-  explicit GoogleAuthProvider(
-      std::unique_ptr<OAuthClient> oauth_client,
-      std::unique_ptr<HttpRequest::Factory> http_request_factory, Env* env,
-      int64 initial_retry_delay_usec);
+  GoogleAuthProvider(std::shared_ptr<ComputeEngineMetadataClient>
+                         compute_engine_metadata_client);
+  explicit GoogleAuthProvider(std::unique_ptr<OAuthClient> oauth_client,
+                              std::shared_ptr<ComputeEngineMetadataClient>
+                                  compute_engine_metadata_client,
+                              Env* env);
   virtual ~GoogleAuthProvider() {}
 
   /// \brief Returns the short-term authentication bearer token.
@@ -53,13 +55,11 @@ class GoogleAuthProvider : public AuthProvider {
   Status GetTokenForTesting() EXCLUSIVE_LOCKS_REQUIRED(mu_);
 
   std::unique_ptr<OAuthClient> oauth_client_;
-  std::unique_ptr<HttpRequest::Factory> http_request_factory_;
+  std::shared_ptr<ComputeEngineMetadataClient> compute_engine_metadata_client_;
   Env* env_;
   mutex mu_;
   string current_token_ GUARDED_BY(mu_);
   uint64 expiration_timestamp_sec_ GUARDED_BY(mu_) = 0;
-  // The initial delay for exponential backoffs when retrying failed calls.
-  const int64 initial_retry_delay_usec_;
   TF_DISALLOW_COPY_AND_ASSIGN(GoogleAuthProvider);
 };
 
diff --git a/tensorflow/core/platform/cloud/google_auth_provider_test.cc b/tensorflow/core/platform/cloud/google_auth_provider_test.cc
index 4281c6c737..07b88a880f 100644
--- a/tensorflow/core/platform/cloud/google_auth_provider_test.cc
+++ b/tensorflow/core/platform/cloud/google_auth_provider_test.cc
@@ -90,10 +90,13 @@ TEST_F(GoogleAuthProviderTest, EnvironmentVariable_Caching) {
   std::vector<HttpRequest*> requests;
 
   FakeEnv env;
+
+  std::shared_ptr<HttpRequest::Factory> fakeHttpRequestFactory =
+      std::make_shared<FakeHttpRequestFactory>(&requests);
+  auto metadataClient =
+      std::make_shared<ComputeEngineMetadataClient>(fakeHttpRequestFactory, 0);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
-                              std::unique_ptr<HttpRequest::Factory>(
-                                  new FakeHttpRequestFactory(&requests)),
-                              &env, 0);
+                              metadataClient, &env);
   oauth_client->return_token = "fake-token";
   oauth_client->return_expiration_timestamp = env.NowSeconds() + 3600;
 
@@ -124,10 +127,13 @@ TEST_F(GoogleAuthProviderTest, GCloudRefreshToken) {
   std::vector<HttpRequest*> requests;
 
   FakeEnv env;
+  std::shared_ptr<HttpRequest::Factory> fakeHttpRequestFactory =
+      std::make_shared<FakeHttpRequestFactory>(&requests);
+  auto metadataClient =
+      std::make_shared<ComputeEngineMetadataClient>(fakeHttpRequestFactory, 0);
+
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
-                              std::unique_ptr<HttpRequest::Factory>(
-                                  new FakeHttpRequestFactory(&requests)),
-                              &env, 0);
+                              metadataClient, &env);
   oauth_client->return_token = "fake-token";
   oauth_client->return_expiration_timestamp = env.NowSeconds() + 3600;
 
@@ -170,10 +176,12 @@ TEST_F(GoogleAuthProviderTest, RunningOnGCE) {
               })")});
 
   FakeEnv env;
+  std::shared_ptr<HttpRequest::Factory> fakeHttpRequestFactory =
+      std::make_shared<FakeHttpRequestFactory>(&requests);
+  auto metadataClient =
+      std::make_shared<ComputeEngineMetadataClient>(fakeHttpRequestFactory, 0);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
-                              std::unique_ptr<HttpRequest::Factory>(
-                                  new FakeHttpRequestFactory(&requests)),
-                              &env, 0);
+                              metadataClient, &env);
 
   string token;
   TF_EXPECT_OK(provider.GetToken(&token));
@@ -196,10 +204,12 @@ TEST_F(GoogleAuthProviderTest, OverrideForTesting) {
   auto oauth_client = new FakeOAuthClient;
   std::vector<HttpRequest*> empty_requests;
   FakeEnv env;
+  std::shared_ptr<HttpRequest::Factory> fakeHttpRequestFactory =
+      std::make_shared<FakeHttpRequestFactory>(&empty_requests);
+  auto metadataClient =
+      std::make_shared<ComputeEngineMetadataClient>(fakeHttpRequestFactory, 0);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
-                              std::unique_ptr<HttpRequest::Factory>(
-                                  new FakeHttpRequestFactory(&empty_requests)),
-                              &env, 0);
+                              metadataClient, &env);
 
   string token;
   TF_EXPECT_OK(provider.GetToken(&token));
@@ -216,10 +226,12 @@ TEST_F(GoogleAuthProviderTest, NothingAvailable) {
       "", errors::NotFound("404"), 404)});
 
   FakeEnv env;
+  std::shared_ptr<HttpRequest::Factory> fakeHttpRequestFactory =
+      std::make_shared<FakeHttpRequestFactory>(&requests);
+  auto metadataClient =
+      std::make_shared<ComputeEngineMetadataClient>(fakeHttpRequestFactory, 0);
   GoogleAuthProvider provider(std::unique_ptr<OAuthClient>(oauth_client),
-                              std::unique_ptr<HttpRequest::Factory>(
-                                  new FakeHttpRequestFactory(&requests)),
-                              &env, 0);
+                              metadataClient, &env);
 
   string token;
   TF_EXPECT_OK(provider.GetToken(&token));
diff --git a/tensorflow/core/platform/cloud/zone_provider.h b/tensorflow/core/platform/cloud/zone_provider.h
new file mode 100644
index 0000000000..421b6a7e1a
--- /dev/null
+++ b/tensorflow/core/platform/cloud/zone_provider.h
@@ -0,0 +1,48 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_CLOUD_ZONE_PROVIDER_H_
+#define TENSORFLOW_CORE_PLATFORM_CLOUD_ZONE_PROVIDER_H_
+
+#include <string>
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+
+/// Interface for a provider of cloud instance zone
+class ZoneProvider {
+ public:
+  virtual ~ZoneProvider() {}
+
+  /// \brief  Gets the zone of the Cloud instance and set the result in `zone`.
+  /// Returns OK if success.
+  ///
+  /// Returns an empty string in the case where the zone does not match the
+  /// expected format
+  /// Safe for concurrent use by multiple threads.
+  virtual Status GetZone(string* zone) = 0;
+
+  static Status GetZone(ZoneProvider* provider, string* zone) {
+    if (!provider) {
+      return errors::Internal("Zone provider is required.");
+    }
+    return provider->GetZone(zone);
+  }
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_ZONE_PROVIDER_H_
diff --git a/tensorflow/core/platform/default/build_config/BUILD b/tensorflow/core/platform/default/build_config/BUILD
index c17e4810d5..da1f66dc67 100644
--- a/tensorflow/core/platform/default/build_config/BUILD
+++ b/tensorflow/core/platform/default/build_config/BUILD
@@ -146,7 +146,6 @@ cc_library(
         "@farmhash_archive//:farmhash",
         "@fft2d",
         "@highwayhash//:sip_hash",
-        "@png_archive//:png",
     ],
 )
 
@@ -161,7 +160,7 @@ cc_library(
         "@farmhash_archive//:farmhash",
         "@fft2d",
         "@highwayhash//:sip_hash",
-        "@png_archive//:png",
+        "@zlib_archive//:zlib",
     ],
 )
 
@@ -187,6 +186,15 @@ cc_library(
 )
 
 cc_library(
+    name = "png",
+    copts = tf_copts(),
+    deps = [
+        "@png_archive//:png",
+        "@zlib_archive//:zlib",
+    ],
+)
+
+cc_library(
     name = "protos_cc_impl",
     copts = tf_copts(),
     deps = [
diff --git a/tensorflow/core/platform/default/build_config_root.bzl b/tensorflow/core/platform/default/build_config_root.bzl
index 09029a4b25..3a012c23fd 100644
--- a/tensorflow/core/platform/default/build_config_root.bzl
+++ b/tensorflow/core/platform/default/build_config_root.bzl
@@ -58,3 +58,9 @@ def if_static(extra_deps, otherwise=[]):
       str(Label("//tensorflow:framework_shared_object")): otherwise,
       "//conditions:default": extra_deps,
   })
+
+def if_dynamic_kernels(extra_deps, otherwise=[]):
+  return select({
+      str(Label("//tensorflow:dynamic_loaded_kernels")): extra_deps,
+      "//conditions:default": otherwise,
+  })
diff --git a/tensorflow/core/platform/env.h b/tensorflow/core/platform/env.h
index 9192f7ba10..5b237c4736 100644
--- a/tensorflow/core/platform/env.h
+++ b/tensorflow/core/platform/env.h
@@ -232,8 +232,11 @@ class Env {
   // TODO(jeff,sanjay): if needed, tighten spec so relative to epoch, or
   // provide a routine to get the absolute time.
 
+  /// \brief Returns the number of nano-seconds since the Unix epoch.
+  virtual uint64 NowNanos() { return envTime->NowNanos(); }
+
   /// \brief Returns the number of micro-seconds since the Unix epoch.
-  virtual uint64 NowMicros() { return envTime->NowMicros(); };
+  virtual uint64 NowMicros() { return envTime->NowMicros(); }
 
   /// \brief Returns the number of seconds since the Unix epoch.
   virtual uint64 NowSeconds() { return envTime->NowSeconds(); }
@@ -450,6 +453,6 @@ struct Register {
           ::tensorflow::register_file_system::Register<factory>(env, scheme)
 
 #define REGISTER_FILE_SYSTEM(scheme, factory) \
-  REGISTER_FILE_SYSTEM_ENV(Env::Default(), scheme, factory);
+  REGISTER_FILE_SYSTEM_ENV(::tensorflow::Env::Default(), scheme, factory);
 
 #endif  // TENSORFLOW_CORE_PLATFORM_ENV_H_
diff --git a/tensorflow/core/platform/env_test.cc b/tensorflow/core/platform/env_test.cc
index c461a40086..305a9a682f 100644
--- a/tensorflow/core/platform/env_test.cc
+++ b/tensorflow/core/platform/env_test.cc
@@ -86,7 +86,7 @@ TEST_F(DefaultEnvTest, IncompleteReadOutOfRange) {
 
 TEST_F(DefaultEnvTest, ReadFileToString) {
   for (const int length : {0, 1, 1212, 2553, 4928, 8196, 9000, (1 << 20) - 1,
-                           1 << 20, (1 << 20) + 1}) {
+                           1 << 20, (1 << 20) + 1, (256 << 20) + 100}) {
     const string filename = strings::StrCat(BaseDir(), "/bar/..//file", length);
 
     // Write a file with the given length
diff --git a/tensorflow/core/platform/env_time.h b/tensorflow/core/platform/env_time.h
index 23dbedd60d..b4756ed209 100644
--- a/tensorflow/core/platform/env_time.h
+++ b/tensorflow/core/platform/env_time.h
@@ -25,6 +25,13 @@ namespace tensorflow {
 /// access timer related operations.
 class EnvTime {
  public:
+  static constexpr uint64 kMicrosToNanos = 1000ULL;
+  static constexpr uint64 kMillisToMicros = 1000ULL;
+  static constexpr uint64 kMillisToNanos = 1000ULL * 1000ULL;
+  static constexpr uint64 kSecondsToMillis = 1000ULL;
+  static constexpr uint64 kSecondsToMicros = 1000ULL * 1000ULL;
+  static constexpr uint64 kSecondsToNanos = 1000ULL * 1000ULL * 1000ULL;
+
   EnvTime();
   virtual ~EnvTime() = default;
 
@@ -34,11 +41,14 @@ class EnvTime {
   /// The result of Default() belongs to this library and must never be deleted.
   static EnvTime* Default();
 
+  /// \brief Returns the number of nano-seconds since the Unix epoch.
+  virtual uint64 NowNanos() = 0;
+
   /// \brief Returns the number of micro-seconds since the Unix epoch.
-  virtual uint64 NowMicros() = 0;
+  virtual uint64 NowMicros() { return NowNanos() / kMicrosToNanos; }
 
   /// \brief Returns the number of seconds since the Unix epoch.
-  virtual uint64 NowSeconds() { return NowMicros() / 1000000L; }
+  virtual uint64 NowSeconds() { return NowNanos() / kSecondsToNanos; }
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/gif.h b/tensorflow/core/platform/gif.h
index ab095a35c9..61b9fbbcb2 100644
--- a/tensorflow/core/platform/gif.h
+++ b/tensorflow/core/platform/gif.h
@@ -18,10 +18,10 @@ limitations under the License.
 
 #include "tensorflow/core/platform/platform.h"
 
-#if defined(PLATFORM_GOOGLE)
+#if defined(PLATFORM_GOOGLE) && !defined(IS_MOBILE_PLATFORM)
 #include "tensorflow/core/platform/google/build_config/gif.h"
 #elif defined(PLATFORM_POSIX) || defined(PLATFORM_WINDOWS) || \
-    defined(PLATFORM_POSIX_ANDROID)
+    defined(PLATFORM_POSIX_ANDROID) || defined(IS_MOBILE_PLATFORM)
 #include <gif_lib.h>
 #else
 #error Define the appropriate PLATFORM_<foo> macro for this platform
diff --git a/tensorflow/core/platform/jpeg.h b/tensorflow/core/platform/jpeg.h
index 1b5e633f0a..f98ddb8c98 100644
--- a/tensorflow/core/platform/jpeg.h
+++ b/tensorflow/core/platform/jpeg.h
@@ -18,10 +18,10 @@ limitations under the License.
 
 #include "tensorflow/core/platform/platform.h"
 
-#if defined(PLATFORM_GOOGLE)
+#if defined(PLATFORM_GOOGLE) && !defined(IS_MOBILE_PLATFORM)
 #include "tensorflow/core/platform/google/build_config/jpeg.h"
 #elif defined(PLATFORM_POSIX) || defined(PLATFORM_WINDOWS) || \
-    defined(PLATFORM_POSIX_ANDROID)
+    defined(PLATFORM_POSIX_ANDROID) || defined(IS_MOBILE_PLATFORM)
 #include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
diff --git a/tensorflow/core/platform/numa.h b/tensorflow/core/platform/numa.h
new file mode 100644
index 0000000000..b1f08e4c4c
--- /dev/null
+++ b/tensorflow/core/platform/numa.h
@@ -0,0 +1,62 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_NUMA_H_
+#define TENSORFLOW_CORE_PLATFORM_NUMA_H_
+
+#include "tensorflow/core/platform/platform.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+namespace port {
+
+// Returns true iff NUMA functions are supported.
+bool NUMAEnabled();
+
+// Returns the number of NUMA nodes present with respect to CPU operations.
+// Typically this will be the number of sockets where some RAM has greater
+// affinity with one socket than another.
+int NUMANumNodes();
+
+static const int kNUMANoAffinity = -1;
+
+// If possible sets affinity of the current thread to the specified NUMA node.
+// If node == kNUMANoAffinity removes affinity to any particular node.
+void NUMASetThreadNodeAffinity(int node);
+
+// Returns NUMA node affinity of the current thread, kNUMANoAffinity if none.
+int NUMAGetThreadNodeAffinity();
+
+// Like AlignedMalloc, but allocates memory with affinity to the specified NUMA
+// node.
+//
+// Notes:
+//  1. node must be >= 0 and < NUMANumNodes.
+//  1. minimum_alignment must a factor of system page size, the memory
+//     returned will be page-aligned.
+//  2. This function is likely significantly slower than AlignedMalloc
+//     and should not be used for lots of small allocations.  It makes more
+//     sense as a backing allocator for BFCAllocator, PoolAllocator, or similar.
+void* NUMAMalloc(int node, size_t size, int minimum_alignment);
+
+// Memory allocated by NUMAMalloc must be freed via NUMAFree.
+void NUMAFree(void* ptr, size_t size);
+
+// Returns NUMA node affinity of memory address, kNUMANoAffinity if none.
+int NUMAGetMemAffinity(const void* ptr);
+
+}  // namespace port
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CORE_PLATFORM_NUMA_H_
diff --git a/tensorflow/core/platform/numa_test.cc b/tensorflow/core/platform/numa_test.cc
new file mode 100644
index 0000000000..8b39ecd59c
--- /dev/null
+++ b/tensorflow/core/platform/numa_test.cc
@@ -0,0 +1,61 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/platform/numa.h"
+
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace internal {
+
+TEST(Numa, NumNodes) {
+  if (port::NUMAEnabled()) {
+    EXPECT_GE(port::NUMANumNodes(), 1);
+  }
+}
+
+TEST(Numa, Malloc) {
+  if (port::NUMAEnabled()) {
+    int num_nodes = port::NUMANumNodes();
+    for (int request_node = 0; request_node < num_nodes; ++request_node) {
+      void* ptr = port::NUMAMalloc(request_node, 8, 0);
+      EXPECT_NE(ptr, nullptr);
+      // Affinity cannot be tested until page is touched, so save a value.
+      *(reinterpret_cast<int*>(ptr)) = 0;
+      int affinity_node = port::NUMAGetMemAffinity(ptr);
+      EXPECT_EQ(affinity_node, request_node);
+      port::NUMAFree(ptr, 8);
+    }
+  }
+}
+
+TEST(Numa, SetNodeAffinity) {
+  // NOTE(tucker): This test is not reliable when executed under tap because
+  // the virtual machine may not have access to all of the availble NUMA
+  // nodes.  Not sure what to do about that.
+  EXPECT_EQ(-1, port::NUMAGetThreadNodeAffinity());
+  if (port::NUMAEnabled()) {
+    int num_nodes = port::NUMANumNodes();
+    for (int request_node = 0; request_node < num_nodes; ++request_node) {
+      port::NUMASetThreadNodeAffinity(request_node);
+      int affinity_node = port::NUMAGetThreadNodeAffinity();
+      EXPECT_EQ(affinity_node, request_node);
+    }
+  }
+}
+
+}  // namespace internal
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/png.h b/tensorflow/core/platform/png.h
index dad18d7219..b110d63aba 100644
--- a/tensorflow/core/platform/png.h
+++ b/tensorflow/core/platform/png.h
@@ -18,10 +18,10 @@ limitations under the License.
 
 #include "tensorflow/core/platform/platform.h"
 
-#if defined(PLATFORM_GOOGLE)
+#if defined(PLATFORM_GOOGLE) && !defined(IS_MOBILE_PLATFORM)
 #include "tensorflow/core/platform/google/build_config/png.h"
 #elif defined(PLATFORM_POSIX) || defined(PLATFORM_WINDOWS) || \
-    defined(PLATFORM_POSIX_ANDROID)
+    defined(PLATFORM_POSIX_ANDROID) || defined(IS_MOBILE_PLATFORM)
 #include <png.h>
 #else
 #error Define the appropriate PLATFORM_<foo> macro for this platform
diff --git a/tensorflow/core/platform/posix/env_time.cc b/tensorflow/core/platform/posix/env_time.cc
index 341c585a9e..59a67b17aa 100644
--- a/tensorflow/core/platform/posix/env_time.cc
+++ b/tensorflow/core/platform/posix/env_time.cc
@@ -26,10 +26,11 @@ class PosixEnvTime : public EnvTime {
  public:
   PosixEnvTime() {}
 
-  uint64 NowMicros() override {
-    struct timeval tv;
-    gettimeofday(&tv, nullptr);
-    return static_cast<uint64>(tv.tv_sec) * 1000000 + tv.tv_usec;
+  uint64 NowNanos() override {
+    struct timespec ts;
+    clock_gettime(CLOCK_REALTIME, &ts);
+    return (static_cast<uint64>(ts.tv_sec) * kSecondsToNanos +
+            static_cast<uint64>(ts.tv_nsec));
   }
 };
 
diff --git a/tensorflow/core/platform/posix/port.cc b/tensorflow/core/platform/posix/port.cc
index 708f32ba80..1939cf72fb 100644
--- a/tensorflow/core/platform/posix/port.cc
+++ b/tensorflow/core/platform/posix/port.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/core/platform/cpu_info.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/mem.h"
+#include "tensorflow/core/platform/numa.h"
 #include "tensorflow/core/platform/snappy.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -79,6 +80,19 @@ int NumHyperthreadsPerCore() {
   return (ht_per_core > 0) ? ht_per_core : 1;
 }
 
+bool NUMAEnabled() {
+  // Not yet implemented: coming soon.
+  return false;
+}
+
+int NUMANumNodes() { return 1; }
+
+void NUMASetThreadNodeAffinity(int node) {}
+
+int NUMAGetThreadNodeAffinity() {
+  return kNUMANoAffinity;
+}
+
 void* AlignedMalloc(size_t size, int minimum_alignment) {
 #if defined(__ANDROID__)
   return memalign(minimum_alignment, size);
@@ -128,6 +142,16 @@ void Free(void* ptr) {
 #endif
 }
 
+void* NUMAMalloc(int node, size_t size, int minimum_alignment) {
+  return AlignedMalloc(size, minimum_alignment);
+}
+
+void NUMAFree(void* ptr, size_t size) { Free(ptr); }
+
+int NUMAGetMemAffinity(const void* addr) {
+  return kNUMANoAffinity;
+}
+
 void MallocExtension_ReleaseToSystem(std::size_t num_bytes) {
   // No-op.
 }
diff --git a/tensorflow/core/platform/profile_utils/cpu_utils.cc b/tensorflow/core/platform/profile_utils/cpu_utils.cc
index 02de7d1362..664412565f 100644
--- a/tensorflow/core/platform/profile_utils/cpu_utils.cc
+++ b/tensorflow/core/platform/profile_utils/cpu_utils.cc
@@ -15,9 +15,14 @@ limitations under the License.
 
 #include "tensorflow/core/platform/profile_utils/cpu_utils.h"
 
+#include <fstream>
 #include <limits>
 #include <mutex>
 
+#if defined(_WIN32)
+#include <windows.h>
+#endif
+
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/profile_utils/android_armv7a_cpu_utils_helper.h"
 
@@ -67,22 +72,32 @@ static ICpuUtilsHelper* cpu_utils_helper_instance_ = nullptr;
 #if defined(__ANDROID__)
   return GetCpuUtilsHelperSingletonInstance().CalculateCpuFrequency();
 #elif defined(__linux__)
-  double bogomips;
-  FILE* fp = popen("grep '^bogomips' /proc/cpuinfo | head -1", "r");
-  if (fp == nullptr) {
-    return INVALID_FREQUENCY;
-  }
-  const int retval_of_bogomips = fscanf(fp, "bogomips : %lf", &bogomips);
-  if (retval_of_bogomips <= 0) {
+  // Read the contents of /proc/cpuinfo.
+  std::ifstream cpuinfo("/proc/cpuinfo");
+  if (!cpuinfo) {
+    LOG(WARNING) << "Failed to open /proc/cpuinfo";
     return INVALID_FREQUENCY;
   }
-  pclose(fp);
-  const double freq_ghz = bogomips / 1000.0 / 2.0;
-  if (retval_of_bogomips != 1 || freq_ghz < 0.01) {
-    LOG(WARNING) << "Failed to get CPU frequency: " << freq_ghz << " Hz";
-    return INVALID_FREQUENCY;
+  string line;
+  while (std::getline(cpuinfo, line)) {
+    double bogomips;
+    const int retval_of_bogomips =
+        sscanf(line.c_str(), "bogomips : %lf", &bogomips);
+    if (retval_of_bogomips > 0) {
+      const double freq_ghz = bogomips / 1000.0 / 2.0;
+      if (retval_of_bogomips != 1 || freq_ghz < 0.01) {
+        LOG(WARNING) << "Failed to get CPU frequency: " << freq_ghz << " Hz";
+        return INVALID_FREQUENCY;
+      }
+      const int64 freq_n =
+          static_cast<int64>(freq_ghz * 1000.0 * 1000.0 * 1000.0);
+      LOG(INFO) << "CPU Frequency: " << freq_n << " Hz";
+      return freq_n;
+    }
   }
-  return static_cast<int64>(freq_ghz * 1000.0 * 1000.0 * 1000.0);
+  LOG(WARNING) << "Failed to find bogomips in /proc/cpuinfo; cannot determine "
+                  "CPU frequency";
+  return INVALID_FREQUENCY;
 #elif defined(__APPLE__)
   int64 freq_hz;
   FILE* fp =
@@ -99,6 +114,10 @@ static ICpuUtilsHelper* cpu_utils_helper_instance_ = nullptr;
     return INVALID_FREQUENCY;
   }
   return freq_hz;
+#elif defined(_WIN32)
+  LARGE_INTEGER freq;
+  QueryPerformanceFrequency(&freq);
+  return freq.QuadPart;
 #else
   // TODO(satok): Support other OS if needed
   // Return INVALID_FREQUENCY on unsupported OS
diff --git a/tensorflow/core/platform/profile_utils/cpu_utils.h b/tensorflow/core/platform/profile_utils/cpu_utils.h
index 7b580c8bf6..8f06290303 100644
--- a/tensorflow/core/platform/profile_utils/cpu_utils.h
+++ b/tensorflow/core/platform/profile_utils/cpu_utils.h
@@ -28,6 +28,10 @@ limitations under the License.
 #include <sys/time.h>
 #endif
 
+#if defined(_WIN32)
+#include <intrin.h>
+#endif
+
 namespace tensorflow {
 
 namespace profile_utils {
@@ -55,6 +59,9 @@ class CpuUtils {
 #if defined(__ANDROID__)
     return GetCpuUtilsHelperSingletonInstance().GetCurrentClockCycle();
 // ----------------------------------------------------------------
+#elif defined(_WIN32)
+    return __rdtsc();
+// ----------------------------------------------------------------
 #elif defined(__x86_64__) || defined(__amd64__)
     uint64_t high, low;
     __asm__ volatile("rdtsc" : "=a"(low), "=d"(high));
diff --git a/tensorflow/core/platform/s3/s3_file_system.cc b/tensorflow/core/platform/s3/s3_file_system.cc
index bdc8f808df..d5f5dec390 100644
--- a/tensorflow/core/platform/s3/s3_file_system.cc
+++ b/tensorflow/core/platform/s3/s3_file_system.cc
@@ -187,9 +187,7 @@ class S3RandomAccessFile : public RandomAccessFile {
       return Status(error::OUT_OF_RANGE, "Read less bytes than requested");
     }
     n = getObjectOutcome.GetResult().GetContentLength();
-    std::stringstream ss;
-    ss << getObjectOutcome.GetResult().GetBody().rdbuf();
-    ss.read(scratch, n);
+    getObjectOutcome.GetResult().GetBody().read(scratch, n);
 
     *result = StringPiece(scratch, n);
     return Status::OK();
diff --git a/tensorflow/compiler/xla/service/pool_test.cc b/tensorflow/core/platform/vmodule_benchmark_test.cc
index 8c4fe258e3..0f9e75bf9c 100644
--- a/tensorflow/compiler/xla/service/pool_test.cc
+++ b/tensorflow/core/platform/vmodule_benchmark_test.cc
@@ -13,28 +13,16 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/service/pool.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/test_benchmark.h"
 
-#include "tensorflow/compiler/xla/test_helpers.h"
+namespace tensorflow {
 
-namespace xla {
-namespace {
-
-using PoolTest = ::testing::Test;
-
-TEST_F(PoolTest, Test) {
-  Pool<int> pool;
-
-  {
-    auto ptr = pool.Allocate();
-    EXPECT_NE(nullptr, ptr.get());
-    *ptr = 5;
+static void BM_DisabledVlog(int iters) {
+  for (int i = 0; i < iters; ++i) {
+    VLOG(1) << "Testing VLOG(1)!";
   }
-
-  auto ptr = pool.Allocate();
-  EXPECT_NE(nullptr, ptr.get());
-  EXPECT_EQ(5, *ptr);
 }
+BENCHMARK(BM_DisabledVlog);
 
-}  // namespace
-}  // namespace xla
+}  // namespace tensorflow
diff --git a/tensorflow/core/platform/vmodule_test.cc b/tensorflow/core/platform/vmodule_test.cc
new file mode 100644
index 0000000000..47b4b2e0e7
--- /dev/null
+++ b/tensorflow/core/platform/vmodule_test.cc
@@ -0,0 +1,117 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Test that popens a child process with the VLOG-ing environment variable set
+// for the logging framework, and observes VLOG_IS_ON and VLOG macro output.
+
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/platform.h"
+#include "tensorflow/core/platform/test.h"
+
+#include <string.h>
+
+namespace tensorflow {
+namespace {
+
+int RealMain(const char* argv0, bool do_vlog) {
+  if (do_vlog) {
+#if !defined(PLATFORM_GOOGLE)
+    // Note, we only test this when !defined(PLATFORM_GOOGLE) because
+    // VmoduleActivated doesn't exist in that implementation.
+    //
+    // Also, we call this internal API to simulate what would happen if
+    // differently-named translation units attempted to VLOG, so we don't need
+    // to create dummy translation unit files.
+    bool ok = internal::LogMessage::VmoduleActivated("vmodule_test.cc", 7) &&
+              internal::LogMessage::VmoduleActivated("shoobadooba.h", 3);
+    if (!ok) {
+      fprintf(stderr, "vmodule activated levels not as expected.\n");
+      return EXIT_FAILURE;
+    }
+#endif
+
+    // Print info on which VLOG levels are activated.
+    fprintf(stderr, "VLOG_IS_ON(8)? %d\n", VLOG_IS_ON(8));
+    fprintf(stderr, "VLOG_IS_ON(7)? %d\n", VLOG_IS_ON(7));
+    fprintf(stderr, "VLOG_IS_ON(6)? %d\n", VLOG_IS_ON(6));
+    // Do some VLOG-ing.
+    VLOG(8) << "VLOG(8)";
+    VLOG(7) << "VLOG(7)";
+    VLOG(6) << "VLOG(6)";
+    LOG(INFO) << "INFO";
+    return EXIT_SUCCESS;
+  }
+
+  // Popen the child process.
+  std::string command = std::string(argv0);
+#if defined(PLATFORM_GOOGLE)
+  command = command + " do_vlog --vmodule=vmodule_test=7 --alsologtostderr";
+#else
+  command =
+      "TF_CPP_VMODULE=vmodule_test=7,shoobadooba=3 " + command + " do_vlog";
+#endif
+  command += " 2>&1";
+  fprintf(stderr, "Running: \"%s\"\n", command.c_str());
+  FILE* f = popen(command.c_str(), "r");
+  if (f == nullptr) {
+    fprintf(stderr, "Failed to popen child: %s\n", strerror(errno));
+    return EXIT_FAILURE;
+  }
+
+  // Read data from the child's stdout.
+  constexpr int kBufferSizeBytes = 4096;
+  char buffer[kBufferSizeBytes];
+  size_t result = fread(buffer, sizeof(buffer[0]), kBufferSizeBytes - 1, f);
+  if (result == 0) {
+    fprintf(stderr, "Failed to read from child stdout: %zu %s\n", result,
+            strerror(errno));
+    return EXIT_FAILURE;
+  }
+  buffer[result] = '\0';
+  int status = pclose(f);
+  if (status == -1) {
+    fprintf(stderr, "Failed to close popen child: %s\n", strerror(errno));
+    return EXIT_FAILURE;
+  }
+
+  // Check output is as expected.
+  const char kExpected[] =
+      "VLOG_IS_ON(8)? 0\nVLOG_IS_ON(7)? 1\nVLOG_IS_ON(6)? 1\n";
+  if (strstr(buffer, kExpected) == nullptr) {
+    fprintf(stderr, "error: unexpected output from child: \"%.*s\"\n",
+            kBufferSizeBytes, buffer);
+    return EXIT_FAILURE;
+  }
+  bool ok = strstr(buffer, "VLOG(7)\n") != nullptr &&
+            strstr(buffer, "VLOG(6)\n") != nullptr &&
+            strstr(buffer, "VLOG(8)\n") == nullptr;
+  if (!ok) {
+    fprintf(stderr, "error: VLOG output not as expected: \"%.*s\"\n",
+            kBufferSizeBytes, buffer);
+    return EXIT_FAILURE;
+  }
+
+  // Success!
+  return EXIT_SUCCESS;
+}
+
+}  // namespace
+}  // namespace tensorflow
+
+int main(int argc, char** argv) {
+  testing::InitGoogleTest(&argc, argv);
+  bool do_vlog = argc >= 2 && strcmp(argv[1], "do_vlog") == 0;
+  return tensorflow::RealMain(argv[0], do_vlog);
+}
diff --git a/tensorflow/core/platform/windows/env_time.cc b/tensorflow/core/platform/windows/env_time.cc
index 16cc9dc675..b1713f695c 100644
--- a/tensorflow/core/platform/windows/env_time.cc
+++ b/tensorflow/core/platform/windows/env_time.cc
@@ -19,6 +19,10 @@ limitations under the License.
 #include <windows.h>
 #include <chrono>
 
+using std::chrono::duration_cast;
+using std::chrono::nanoseconds;
+using std::chrono::system_clock;
+
 namespace tensorflow {
 
 namespace {
@@ -38,18 +42,17 @@ class WindowsEnvTime : public EnvTime {
     }
   }
 
-  uint64 NowMicros() override {
+  uint64 NowNanos() {
     if (GetSystemTimePreciseAsFileTime_ != NULL) {
       // GetSystemTimePreciseAsFileTime function is only available in latest
       // versions of Windows, so we need to check for its existence here.
-      // All std::chrono clocks on Windows proved to return
-      // values that may repeat, which is not good enough for some uses.
+      // All std::chrono clocks on Windows proved to return values that may
+      // repeat, which is not good enough for some uses.
       constexpr int64_t kUnixEpochStartTicks = 116444736000000000i64;
-      constexpr int64_t kFtToMicroSec = 10;
 
-      // This interface needs to return system time and not
-      // just any microseconds because it is often used as an argument
-      // to TimedWait() on condition variable
+      // This interface needs to return system time and not just any time
+      // because it is often used as an argument to TimedWait() on condition
+      // variable.
       FILETIME system_time;
       GetSystemTimePreciseAsFileTime_(&system_time);
 
@@ -58,12 +61,12 @@ class WindowsEnvTime : public EnvTime {
       li.HighPart = system_time.dwHighDateTime;
       // Subtract unix epoch start
       li.QuadPart -= kUnixEpochStartTicks;
-      // Convert to microsecs
-      li.QuadPart /= kFtToMicroSec;
+
+      constexpr int64_t kFtToNanoSec = 100;
+      li.QuadPart *= kFtToNanoSec;
       return li.QuadPart;
     }
-    using namespace std::chrono;
-    return duration_cast<microseconds>(system_clock::now().time_since_epoch())
+    return duration_cast<nanoseconds>(system_clock::now().time_since_epoch())
         .count();
   }
 
diff --git a/tensorflow/core/platform/windows/port.cc b/tensorflow/core/platform/windows/port.cc
index f2aaf13bec..5375f56372 100644
--- a/tensorflow/core/platform/windows/port.cc
+++ b/tensorflow/core/platform/windows/port.cc
@@ -33,6 +33,7 @@ limitations under the License.
 #include "tensorflow/core/platform/init_main.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/mem.h"
+#include "tensorflow/core/platform/numa.h"
 #include "tensorflow/core/platform/snappy.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -57,6 +58,17 @@ int NumSchedulableCPUs() {
   return system_info.dwNumberOfProcessors;
 }
 
+bool NUMAEnabled() {
+  // Not yet implemented: coming soon.
+  return false;
+}
+
+int NUMANumNodes() { return 1; }
+
+void NUMASetThreadNodeAffinity(int node) {}
+
+int NUMAGetThreadNodeAffinity() { return kNUMANoAffinity; }
+
 void* AlignedMalloc(size_t size, int minimum_alignment) {
 #ifdef TENSORFLOW_USE_JEMALLOC
   void* ptr = NULL;
@@ -108,6 +120,14 @@ void Free(void* ptr) {
 #endif
 }
 
+void* NUMAMalloc(int node, size_t size, int minimum_alignment) {
+  return AlignedMalloc(size, minimum_alignment);
+}
+
+void NUMAFree(void* ptr, size_t size) { Free(ptr); }
+
+int NUMAGetMemAffinity(const void* addr) { return kNUMANoAffinity; }
+
 void MallocExtension_ReleaseToSystem(std::size_t num_bytes) {
   // No-op.
 }
diff --git a/tensorflow/core/protobuf/config.proto b/tensorflow/core/protobuf/config.proto
index 5b6aa47b93..da3a99565e 100644
--- a/tensorflow/core/protobuf/config.proto
+++ b/tensorflow/core/protobuf/config.proto
@@ -145,7 +145,8 @@ message GPUOptions {
     bool use_unified_memory = 2;
 
     // If > 1, the number of device-to-device copy streams to create
-    // for each GPUDevice.
+    // for each GPUDevice.  Default value is 0, which is automatically
+    // converted to 1.
     int32 num_dev_to_dev_copy_streams = 3;
   }
 
@@ -389,6 +390,13 @@ message ConfigProto {
   message Experimental {
     // Task name for group resolution.
     string collective_group_leader = 1;
+    // Whether the client will format templated errors. For example, the string:
+    // "The node was defined on ^^node:Foo:${file}:${line}^^".
+    bool client_handles_error_formatting = 2;
+
+    // Which executor to use, the default executor will be used
+    // if it is an empty string or "DEFAULT"
+    string executor_type = 3;
   };
 
   Experimental experimental = 16;
@@ -412,6 +420,11 @@ message RunOptions {
   int64 timeout_in_ms = 2;
 
   // The thread pool to use, if session_inter_op_thread_pool is configured.
+  // To use the caller thread set this to -1 - this uses the caller thread
+  // to execute Session::Run() and thus avoids a context switch. Using the
+  // caller thread to execute Session::Run() should be done ONLY for simple
+  // graphs, where the overhead of an additional context switch is
+  // comparable with the overhead of Session::Run().
   int32 inter_op_thread_pool = 3;
 
   // Whether the partition graph(s) executed by the executor(s) should be
diff --git a/tensorflow/core/protobuf/debug.proto b/tensorflow/core/protobuf/debug.proto
index 499900f965..811cf406b9 100644
--- a/tensorflow/core/protobuf/debug.proto
+++ b/tensorflow/core/protobuf/debug.proto
@@ -7,7 +7,7 @@ option java_multiple_files = true;
 option java_package = "org.tensorflow.framework";
 option go_package = "github.com/tensorflow/tensorflow/tensorflow/go/core/protobuf";
 
-// EXPERIMENTAL. Option for watching a node.
+// Option for watching a node in TensorFlow Debugger (tfdbg).
 message DebugTensorWatch {
   // Name of the node to watch.
   string node_name = 1;
@@ -51,7 +51,7 @@ message DebugTensorWatch {
   bool tolerate_debug_op_creation_failures = 5;
 }
 
-// EXPERIMENTAL. Options for initializing DebuggerState.
+// Options for initializing DebuggerState in TensorFlow Debugger (tfdbg).
 message DebugOptions {
   // Debugging options
   repeated DebugTensorWatch debug_tensor_watch_opts = 4;
diff --git a/tensorflow/core/protobuf/eager_service.proto b/tensorflow/core/protobuf/eager_service.proto
index 5b05a1b3ee..63ba4eb173 100644
--- a/tensorflow/core/protobuf/eager_service.proto
+++ b/tensorflow/core/protobuf/eager_service.proto
@@ -8,6 +8,7 @@ import "tensorflow/core/framework/function.proto";
 import "tensorflow/core/framework/versions.proto";
 import "tensorflow/core/protobuf/tensorflow_server.proto";
 import "tensorflow/core/framework/tensor_shape.proto";
+import "tensorflow/core/framework/tensor.proto";
 
 message RemoteTensorHandle {
   // The ID of the operation that produced this tensor.
@@ -128,6 +129,24 @@ message RegisterFunctionRequest {
 message RegisterFunctionResponse {
 }
 
+message SendTensorRequest {
+  fixed64 context_id = 1;
+
+  // All remote tensors are identified by <Op ID, Output num>. To mimic this
+  // situation when directly sending tensors, we include an "artificial" op ID
+  // (which would have corresponded to the _Recv op when not using SendTensor).
+  int64 op_id = 2;
+  // The index within the repeated field is the output number that will help
+  // uniquely identify (along with the above op_id) the particular tensor.
+  repeated TensorProto tensors = 3;
+
+  // The device on which the tensors should be resident.
+  string device_name = 4;
+}
+
+message SendTensorResponse {
+}
+
 ////////////////////////////////////////////////////////////////////////////////
 //
 // Eager Service defines a TensorFlow service that executes operations eagerly
@@ -174,4 +193,8 @@ service EagerService {
   // Takes a FunctionDef and makes it enqueable on the remote worker.
   rpc RegisterFunction(RegisterFunctionRequest)
       returns (RegisterFunctionResponse);
+
+  // An RPC to push tensors to the server. At times, certain environments don't
+  // allow the server to connect back to the client.
+  rpc SendTensor(SendTensorRequest) returns (SendTensorResponse);
 }
diff --git a/tensorflow/core/protobuf/tensorflow_server.proto b/tensorflow/core/protobuf/tensorflow_server.proto
index be25804a1b..2bf48d50e1 100644
--- a/tensorflow/core/protobuf/tensorflow_server.proto
+++ b/tensorflow/core/protobuf/tensorflow_server.proto
@@ -46,6 +46,6 @@ message ServerDef {
 
   // The protocol to be used by this server.
   //
-  // Acceptable values include: "grpc".
+  // Acceptable values include: "grpc", "grpc+verbs".
   string protocol = 5;
 }
diff --git a/tensorflow/core/public/session.h b/tensorflow/core/public/session.h
index d58c877cfd..cc8596ef3d 100644
--- a/tensorflow/core/public/session.h
+++ b/tensorflow/core/public/session.h
@@ -237,7 +237,7 @@ class Session {
 /// If session creation succeeds, the new `Session` will be stored in
 /// `*out_session`, the caller will take ownership of the returned
 /// `*out_session`, and this function will return `OK()`. Otherwise, this
-/// function will return an error status.
+/// function will return an error status and set *out_session to nullptr.
 Status NewSession(const SessionOptions& options, Session** out_session);
 
 /// \brief Resets resource containers associated with a target.
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index cb1fd09dbb..cea5e8ffb0 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -24,7 +24,7 @@ limitations under the License.
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
 // "-beta", "-rc", "-rc.1")
-#define TF_VERSION_SUFFIX "-rc0"
+#define TF_VERSION_SUFFIX ""
 
 #define TF_STR_HELPER(x) #x
 #define TF_STR(x) TF_STR_HELPER(x)
diff --git a/tensorflow/core/util/batch_util.cc b/tensorflow/core/util/batch_util.cc
index 7ea8851e65..45556d53a4 100644
--- a/tensorflow/core/util/batch_util.cc
+++ b/tensorflow/core/util/batch_util.cc
@@ -264,6 +264,7 @@ Status CopyElementToLargerSlice(const Tensor& element, Tensor* parent,
     HANDLE_DIMS(2);
     HANDLE_DIMS(3);
     HANDLE_DIMS(4);
+    HANDLE_DIMS(5);
 #undef HANDLE_DIMS
     default:
       return errors::Unimplemented("CopyElementToLargerSlice Unhandled rank: ",
diff --git a/tensorflow/core/util/ctc/ctc_beam_entry.h b/tensorflow/core/util/ctc/ctc_beam_entry.h
index 53087821d7..973e315f09 100644
--- a/tensorflow/core/util/ctc/ctc_beam_entry.h
+++ b/tensorflow/core/util/ctc/ctc_beam_entry.h
@@ -1,3 +1,4 @@
+// LINT.IfChange
 /* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
@@ -145,3 +146,4 @@ class BeamComparer {
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_UTIL_CTC_CTC_BEAM_ENTRY_H_
+// LINT.ThenChange(//tensorflow/contrib/lite/experimental/kernels/ctc_beam_entry.h)
diff --git a/tensorflow/core/util/ctc/ctc_beam_scorer.h b/tensorflow/core/util/ctc/ctc_beam_scorer.h
index 2579198ece..1a622babe1 100644
--- a/tensorflow/core/util/ctc/ctc_beam_scorer.h
+++ b/tensorflow/core/util/ctc/ctc_beam_scorer.h
@@ -1,3 +1,4 @@
+// LINT.IfChange
 /* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
@@ -73,3 +74,4 @@ class BaseBeamScorer {
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_UTIL_CTC_CTC_BEAM_SCORER_H_
+// LINT.ThenChange(//tensorflow/contrib/lite/experimental/kernels/ctc_beam_scorer.h)
diff --git a/tensorflow/core/util/ctc/ctc_beam_search.h b/tensorflow/core/util/ctc/ctc_beam_search.h
index 709c65fc96..aee647a1b3 100644
--- a/tensorflow/core/util/ctc/ctc_beam_search.h
+++ b/tensorflow/core/util/ctc/ctc_beam_search.h
@@ -418,3 +418,4 @@ Status CTCBeamSearchDecoder<CTCBeamState, CTCBeamComparer>::TopPaths(
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_UTIL_CTC_CTC_BEAM_SEARCH_H_
+// LINT.ThenChange(//tensorflow/contrib/lite/experimental/kernels/ctc_beam_search.h)
diff --git a/tensorflow/core/util/ctc/ctc_decoder.h b/tensorflow/core/util/ctc/ctc_decoder.h
index b8bab69053..3be36822e5 100644
--- a/tensorflow/core/util/ctc/ctc_decoder.h
+++ b/tensorflow/core/util/ctc/ctc_decoder.h
@@ -1,3 +1,4 @@
+// LINT.IfChange
 /* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
@@ -112,3 +113,4 @@ class CTCGreedyDecoder : public CTCDecoder {
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_UTIL_CTC_CTC_DECODER_H_
+// LINT.ThenChange(//tensorflow/contrib/lite/experimental/kernels/ctc_decoder.h)
diff --git a/tensorflow/core/util/ctc/ctc_loss_util.h b/tensorflow/core/util/ctc/ctc_loss_util.h
index 9c71f58e23..36be9e92ef 100644
--- a/tensorflow/core/util/ctc/ctc_loss_util.h
+++ b/tensorflow/core/util/ctc/ctc_loss_util.h
@@ -1,3 +1,4 @@
+// LINT.IfChange
 /* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
@@ -31,8 +32,10 @@ const float kLogZero = -std::numeric_limits<float>::infinity();
 inline float LogSumExp(float log_prob_1, float log_prob_2) {
   // Always have 'b' be the smaller number to avoid the exponential from
   // blowing up.
-  if (log_prob_1 == kLogZero && log_prob_2 == kLogZero) {
-    return kLogZero;
+  if (log_prob_1 == kLogZero) {
+    return log_prob_2;
+  } else if (log_prob_2 == kLogZero) {
+    return log_prob_1;
   } else {
     return (log_prob_1 > log_prob_2)
                ? log_prob_1 + log1pf(expf(log_prob_2 - log_prob_1))
@@ -44,3 +47,4 @@ inline float LogSumExp(float log_prob_1, float log_prob_2) {
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_UTIL_CTC_CTC_LOSS_UTIL_H_
+// LINT.ThenChange(//tensorflow/contrib/lite/experimental/kernels/ctc_loss_util.h)
diff --git a/tensorflow/core/util/cuda_launch_config.h b/tensorflow/core/util/cuda_launch_config.h
index 81df7a51d7..d0d95736d3 100644
--- a/tensorflow/core/util/cuda_launch_config.h
+++ b/tensorflow/core/util/cuda_launch_config.h
@@ -295,7 +295,7 @@ inline const cudaStream_t& GetCudaStream(OpKernelContext* context) {
       reinterpret_cast<const cudaStream_t*>(context->op_device_context()
                                                 ->stream()
                                                 ->implementation()
-                                                ->CudaStreamMemberHack()));
+                                                ->GpuStreamMemberHack()));
   return *ptr;
 }
 
diff --git a/tensorflow/core/util/equal_graph_def_test.cc b/tensorflow/core/util/equal_graph_def_test.cc
index c54540332e..77ca8eaec3 100644
--- a/tensorflow/core/util/equal_graph_def_test.cc
+++ b/tensorflow/core/util/equal_graph_def_test.cc
@@ -85,7 +85,7 @@ TEST_F(EqualGraphDefTest, NoMatch) {
   Input(e_.opts().WithName("A"));
   Input(a_.opts().WithName("B"));
   EXPECT_FALSE(Match());
-  EXPECT_EQ("Did not find expected node 'A = Input[]()'", diff_);
+  EXPECT_EQ("Did not find expected node '{{node A}} = Input[]()'", diff_);
 }
 
 TEST_F(EqualGraphDefTest, MissingNode) {
@@ -93,7 +93,7 @@ TEST_F(EqualGraphDefTest, MissingNode) {
   Input(e_.opts().WithName("B"));
   Input(a_.opts().WithName("A"));
   EXPECT_FALSE(Match());
-  EXPECT_EQ("Did not find expected node 'B = Input[]()'", diff_);
+  EXPECT_EQ("Did not find expected node '{{node B}} = Input[]()'", diff_);
 }
 
 TEST_F(EqualGraphDefTest, ExtraNode) {
@@ -101,7 +101,7 @@ TEST_F(EqualGraphDefTest, ExtraNode) {
   Input(a_.opts().WithName("A"));
   Input(a_.opts().WithName("B"));
   EXPECT_FALSE(Match());
-  EXPECT_EQ("Found unexpected node 'B = Input[]()'", diff_);
+  EXPECT_EQ("Found unexpected node '{{node B}} = Input[]()'", diff_);
 }
 
 TEST_F(EqualGraphDefTest, NodeOrder) {
diff --git a/tensorflow/core/util/mkl_util.h b/tensorflow/core/util/mkl_util.h
index b5e42f5384..d90f85e422 100644
--- a/tensorflow/core/util/mkl_util.h
+++ b/tensorflow/core/util/mkl_util.h
@@ -17,7 +17,6 @@ limitations under the License.
 #define TENSORFLOW_CORE_UTIL_MKL_UTIL_H_
 #ifdef INTEL_MKL
 
-#include <string>
 #include <vector>
 #include <unordered_map>
 #include <utility>
@@ -1487,6 +1486,8 @@ inline memory::desc CreateBlockedMemDescHelper(const memory::dims& dim,
   return memory::desc(md);
 }
 
+template <typename T>
+inline primitive FindOrCreateReorder(const memory* from, const memory* to);
 /*
  * Class to represent all the resources corresponding to a tensor in TensorFlow
  * that are required to execute an operation (such as Convolution).
@@ -1733,6 +1734,24 @@ class MklDnnData {
     return false;
   }
 
+  /// TODO: this is a faster path with reorder primitive cache compared with
+  /// CheckReorderToOpMem(..., std::vector<primitive>* net), will remove
+  /// slow path in the future
+  inline bool CheckReorderToOpMem(const memory::primitive_desc& op_pd) {
+    CHECK_NOTNULL(user_memory_);
+    if (IsReorderNeeded(op_pd)) {
+      // TODO(nhasabni): can we remove dynamic memory allocation?
+      // primitive reuse don't allow two same reorder prim in
+      // one stream, so submit it immediately
+      reorder_memory_ = new memory(op_pd);
+      std::vector<primitive> net;
+      net.push_back(FindOrCreateReorder<T>(user_memory_, reorder_memory_));
+      stream(stream::kind::eager).submit(net).wait();
+      return true;
+    }
+    return false;
+  }
+
   /// Overloaded version of above function that accepts memory buffer
   /// where output of reorder needs to be stored.
   ///
@@ -1758,6 +1777,26 @@ class MklDnnData {
     return false;
   }
 
+  /// TODO: this is a faster path with reorder primitive cache compared with
+  /// CheckReorderToOpMem(..., std::vector<primitive>* net), will remove
+  /// slow path in the future
+  inline bool CheckReorderToOpMem(const memory::primitive_desc& op_pd,
+                                  void* reorder_data_handle) {
+    CHECK_NOTNULL(reorder_data_handle);
+    CHECK_NOTNULL(user_memory_);
+    if (IsReorderNeeded(op_pd)) {
+      // TODO(nhasabni): can we remove dynamic memory allocation?
+      // primitive reuse don't allow two same reorder prim in
+      // one stream, so submit it immediately
+      std::vector<primitive> net;
+      reorder_memory_ = new memory(op_pd, reorder_data_handle);
+      net.push_back(FindOrCreateReorder<T>(user_memory_, reorder_memory_));
+      stream(stream::kind::eager).submit(net).wait();
+      return true;
+    }
+    return false;
+  }
+
   /// Another overloaded version of CheckReorderToOpMem that accepts Tensor
   /// where output of reorder needs to be stored.
   ///
@@ -1776,6 +1815,15 @@ class MklDnnData {
     return CheckReorderToOpMem(op_pd, GetTensorBuffer(reorder_tensor), net);
   }
 
+  /// TODO: this is a faster path with reorder primitive cache compared with
+  /// CheckReorderToOpMem(..., std::vector<primitive>* net), will remove
+  /// slow path in the future
+  inline bool CheckReorderToOpMem(const memory::primitive_desc& op_pd,
+                                  Tensor* reorder_tensor) {
+    CHECK_NOTNULL(reorder_tensor);
+    return CheckReorderToOpMem(op_pd, GetTensorBuffer(reorder_tensor));
+  }
+
   /// Function to handle output reorder
   ///
   /// This function performs very similar functionality as input reordering
@@ -1812,6 +1860,20 @@ class MklDnnData {
     CHECK_NOTNULL(reorder_memory_);
     net->push_back(CreateReorder(reorder_memory_, user_memory_));
   }
+
+  /// TODO: this is a faster path with reorder primitive cache compared with
+  ///       InsertReorderToUserMem(std::vector<primitive>* net), will remove
+  ///       slow path in the future
+  inline void InsertReorderToUserMem() {
+    CHECK_NOTNULL(user_memory_);
+    CHECK_NOTNULL(reorder_memory_);
+    // primitive reuse don't allow two same reorder prim in
+    // one stream, so submit it immediately
+    std::vector<primitive> net;
+    net.push_back(FindOrCreateReorder<T>(reorder_memory_, user_memory_));
+    stream(stream::kind::eager).submit(net).wait();
+  }
+
 };
 
 /// Base class for operations with reuse of primitives
@@ -1820,9 +1882,8 @@ class MklPrimitive {
  public:
   virtual ~MklPrimitive() {}
 
-  // Dummy data. Its size, hard-coded as 256 here, does
-  // not matter since MKL should never operate on this buffer.
-  unsigned char DummyData[256];
+  // Dummy data which MKL DNN never operates on
+  unsigned char* DummyData = nullptr;
 };
 
 const mkldnn::memory::dims NONE_DIMS = {};
@@ -1833,26 +1894,29 @@ class MklPrimitiveFactory {
   MklPrimitiveFactory() {}
   ~MklPrimitiveFactory() {}
 
-  MklPrimitive* GetOp(const std::string& key) {
-    auto stream_iter = MklPrimitiveFactory<T>::GetHashMap().find(key);
-    if (stream_iter == MklPrimitiveFactory<T>::GetHashMap().end()) {
+  MklPrimitive* GetOp(const string& key) {
+    auto& map = MklPrimitiveFactory<T>::GetHashMap();
+    auto stream_iter = map.find(key);
+    if (stream_iter == map.end()) {
       return nullptr;
     } else {
+      CHECK(stream_iter->second != nullptr) << "nullptr present in map";
       return stream_iter->second;
     }
   }
 
-  void SetOp(const std::string& key, MklPrimitive* op) {
-    auto stream_iter = MklPrimitiveFactory<T>::GetHashMap().find(key);
+  void SetOp(const string& key, MklPrimitive* op) {
+    auto& map = MklPrimitiveFactory<T>::GetHashMap();
+    auto stream_iter = map.find(key);
 
-    CHECK(stream_iter == MklPrimitiveFactory<T>::GetHashMap().end());
+    CHECK(stream_iter == map.end());
 
-    MklPrimitiveFactory<T>::GetHashMap()[key] = op;
+    map[key] = op;
   }
 
  private:
-  static inline std::unordered_map<std::string, MklPrimitive*>& GetHashMap() {
-    static thread_local std::unordered_map<std::string, MklPrimitive*> map_;
+  static inline std::unordered_map<string, MklPrimitive*>& GetHashMap() {
+    static thread_local std::unordered_map<string, MklPrimitive*> map_;
     return map_;
   }
 };
@@ -1880,9 +1944,7 @@ class FactoryKeyCreator {
     Append(StringPiece(buffer, sizeof(T)));
   }
 
-  std::string GetKey() {
-    return key_;
-  }
+  string GetKey() { return key_; }
 
  private:
   string key_;
@@ -1894,6 +1956,109 @@ class FactoryKeyCreator {
   }
 };
 
+class MklReorderPrimitive : public MklPrimitive {
+  public:
+    explicit MklReorderPrimitive(const memory* from, const memory* to) {
+      Setup(from, to);
+    }
+    ~MklReorderPrimitive() {}
+
+    std::shared_ptr<primitive> GetPrimitive() {
+      return context_.reorder_prim;
+    }
+
+    void SetMemory(const memory* from, const memory* to) {
+      context_.src_mem->set_data_handle(from->get_data_handle());
+      context_.dst_mem->set_data_handle(to->get_data_handle());
+    }
+
+  private:
+    struct ReorderContext {
+      std::shared_ptr<mkldnn::memory> src_mem;
+      std::shared_ptr<mkldnn::memory> dst_mem;
+      std::shared_ptr<primitive> reorder_prim;
+      ReorderContext():
+        src_mem(nullptr), dst_mem(nullptr), reorder_prim(nullptr) {
+      }
+    } context_;
+
+    engine cpu_engine_ = engine(engine::cpu, 0);
+
+    void Setup(const memory* from, const memory* to) {
+      context_.src_mem.reset(new memory(
+            {from->get_primitive_desc().desc(), cpu_engine_}, DummyData));
+      context_.dst_mem.reset(new memory(
+            {to->get_primitive_desc().desc(), cpu_engine_}, DummyData));
+      context_.reorder_prim = std::make_shared<mkldnn::reorder>(
+          reorder(*context_.src_mem, *context_.dst_mem));
+    }
+};
+
+template <typename T>
+class MklReorderPrimitiveFactory : public MklPrimitiveFactory<T> {
+  public:
+    static MklReorderPrimitive* Get(const memory* from,
+        const memory* to) {
+      auto reorderPrim = static_cast<MklReorderPrimitive*>(
+        MklReorderPrimitiveFactory<T>::GetInstance().GetReorder(from, to));
+      if (reorderPrim == nullptr) {
+        reorderPrim = new MklReorderPrimitive(from, to);
+        MklReorderPrimitiveFactory<T>::GetInstance().SetReorder(
+            from, to, reorderPrim);
+      }
+      reorderPrim->SetMemory(from, to);
+      return reorderPrim;
+    }
+
+    static MklReorderPrimitiveFactory & GetInstance() {
+      static MklReorderPrimitiveFactory instance_;
+      return instance_;
+    }
+
+  private:
+    MklReorderPrimitiveFactory() {};
+    ~MklReorderPrimitiveFactory() {};
+
+    static string CreateKey(const memory* from, const memory* to) {
+      string prefix = "reorder";
+      FactoryKeyCreator key_creator;
+      auto const &from_desc =  from->get_primitive_desc().desc().data;
+      auto const &to_desc =  to->get_primitive_desc().desc().data;
+      memory::dims from_dims(from_desc.dims, &from_desc.dims[from_desc.ndims]);
+      memory::dims to_dims(to_desc.dims, &to_desc.dims[to_desc.ndims]);
+      key_creator.AddAsKey(prefix);
+      key_creator.AddAsKey(static_cast<int>(from_desc.format));
+      key_creator.AddAsKey(static_cast<int>(from_desc.data_type));
+      key_creator.AddAsKey(from_dims);
+      key_creator.AddAsKey(static_cast<int>(to_desc.format));
+      key_creator.AddAsKey(static_cast<int>(to_desc.data_type));
+      key_creator.AddAsKey(to_dims);
+      return key_creator.GetKey();
+    }
+
+    MklPrimitive* GetReorder(const memory* from, const memory* to) {
+      string key = CreateKey(from, to);
+      return this->GetOp(key);
+    }
+
+    void SetReorder(const memory* from, const memory* to, MklPrimitive* op) {
+      string key = CreateKey(from, to);
+      this->SetOp(key, op);
+    }
+};
+
+  /// Fuction to find(or create) a reorder from memory pointed by from to memory pointed
+  /// by to, it will created primitive or get primitive from pool if it is cached.
+  /// Returns the primitive.
+  template <typename T>
+  inline primitive FindOrCreateReorder(const memory* from, const memory* to) {
+    CHECK_NOTNULL(from);
+    CHECK_NOTNULL(to);
+    MklReorderPrimitive *reorder_prim = 
+      MklReorderPrimitiveFactory<T>::Get(from, to); 
+    return *reorder_prim->GetPrimitive();
+  }
+ 
 #endif  // INTEL_MKL_DNN
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/util/proto/BUILD b/tensorflow/core/util/proto/BUILD
index ade14ed162..7e549c7764 100644
--- a/tensorflow/core/util/proto/BUILD
+++ b/tensorflow/core/util/proto/BUILD
@@ -60,3 +60,13 @@ cc_library(
     ],
     alwayslink = 1,
 )
+
+cc_library(
+    name = "proto_utils",
+    srcs = ["proto_utils.cc"],
+    hdrs = ["proto_utils.h"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+)
diff --git a/tensorflow/core/util/proto/decode.h b/tensorflow/core/util/proto/decode.h
index 74634a356a..cbcb203ee7 100644
--- a/tensorflow/core/util/proto/decode.h
+++ b/tensorflow/core/util/proto/decode.h
@@ -27,6 +27,7 @@ limitations under the License.
 #define TENSORFLOW_CORE_UTIL_PROTO_DECODE_H_
 
 #include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/types.h"
 
@@ -103,6 +104,16 @@ template <class TensorType, enum WireFormatLite::FieldType DeclaredType>
 const uint8* ReadFromArray(const uint8* buf, TensorType* value);
 
 template <>
+inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_INT32>(
+    const uint8* buf, int64* value) {
+  uint32 temp;
+  bool unused_ok;  // The Counting pass would have failed if this were corrupt.
+  buf = ReadVarint32FromArray(buf, &unused_ok, &temp);
+  *value = static_cast<int64>(temp);
+  return buf;
+}
+
+template <>
 inline const uint8* ReadFromArray<int32, WireFormatLite::TYPE_INT32>(
     const uint8* buf, int32* value) {
   uint32 temp;
@@ -123,8 +134,8 @@ inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_INT64>(
 }
 
 template <>
-inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_UINT32>(
-    const uint8* buf, int64* value) {
+inline const uint8* ReadFromArray<uint64, WireFormatLite::TYPE_UINT32>(
+    const uint8* buf, uint64* value) {
   uint32 temp;
   bool unused_ok;  // The Counting pass would have failed if this were corrupt.
   buf = ReadVarint32FromArray(buf, &unused_ok, &temp);
@@ -133,22 +144,26 @@ inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_UINT32>(
 }
 
 template <>
-inline const uint8* ReadFromArray<int32, WireFormatLite::TYPE_UINT32>(
-    const uint8* buf, int32* value) {
-  uint32 temp;
+inline const uint8* ReadFromArray<uint32, WireFormatLite::TYPE_UINT32>(
+    const uint8* buf, uint32* value) {
   bool unused_ok;  // The Counting pass would have failed if this were corrupt.
-  buf = ReadVarint32FromArray(buf, &unused_ok, &temp);
-  *value = WrapUnsignedAsSigned32(temp);
-  return buf;
+  return ReadVarint32FromArray(buf, &unused_ok, value);
+}
+
+template <>
+inline const uint8* ReadFromArray<uint64, WireFormatLite::TYPE_UINT64>(
+    const uint8* buf, uint64* value) {
+  bool unused_ok;  // The Counting pass would have failed if this were corrupt.
+  return ReadVarint64FromArray(buf, &unused_ok, value);
 }
 
 template <>
-inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_UINT64>(
+inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_SINT32>(
     const uint8* buf, int64* value) {
   uint64 temp;
   bool unused_ok;  // The Counting pass would have failed if this were corrupt.
   buf = ReadVarint64FromArray(buf, &unused_ok, &temp);
-  *value = static_cast<int64>(temp);
+  *value = WireFormatLite::ZigZagDecode32(temp);
   return buf;
 }
 
@@ -173,8 +188,8 @@ inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_SINT64>(
 }
 
 template <>
-inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_FIXED32>(
-    const uint8* buf, int64* value) {
+inline const uint8* ReadFromArray<uint64, WireFormatLite::TYPE_FIXED32>(
+    const uint8* buf, uint64* value) {
   uint32 temp;
   buf = WireFormatLite::ReadPrimitiveFromArray<uint32,
                                                WireFormatLite::TYPE_FIXED32>(
@@ -184,8 +199,8 @@ inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_FIXED32>(
 }
 
 template <>
-inline const uint8* ReadFromArray<int32, WireFormatLite::TYPE_FIXED32>(
-    const uint8* buf, int32* value) {
+inline const uint8* ReadFromArray<uint32, WireFormatLite::TYPE_FIXED32>(
+    const uint8* buf, uint32* value) {
   uint32 temp;
   buf = WireFormatLite::ReadPrimitiveFromArray<uint32,
                                                WireFormatLite::TYPE_FIXED32>(
@@ -195,8 +210,8 @@ inline const uint8* ReadFromArray<int32, WireFormatLite::TYPE_FIXED32>(
 }
 
 template <>
-inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_FIXED64>(
-    const uint8* buf, int64* value) {
+inline const uint8* ReadFromArray<uint64, WireFormatLite::TYPE_FIXED64>(
+    const uint8* buf, uint64* value) {
   protobuf_uint64 temp;
   buf = WireFormatLite::ReadPrimitiveFromArray<protobuf_uint64,
                                                WireFormatLite::TYPE_FIXED64>(
@@ -206,6 +221,17 @@ inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_FIXED64>(
 }
 
 template <>
+inline const uint8* ReadFromArray<int64, WireFormatLite::TYPE_SFIXED32>(
+    const uint8* buf, int64* value) {
+  int32 temp;
+  buf = WireFormatLite::ReadPrimitiveFromArray<int32,
+                                               WireFormatLite::TYPE_SFIXED32>(
+      buf, &temp);
+  *value = temp;
+  return buf;
+}
+
+template <>
 inline const uint8* ReadFromArray<int32, WireFormatLite::TYPE_SFIXED32>(
     const uint8* buf, int32* value) {
   return WireFormatLite::ReadPrimitiveFromArray<int32,
@@ -233,6 +259,17 @@ inline const uint8* ReadFromArray<float, WireFormatLite::TYPE_FLOAT>(
 }
 
 template <>
+inline const uint8* ReadFromArray<double, WireFormatLite::TYPE_FLOAT>(
+    const uint8* buf, double* value) {
+  float temp;
+  buf =
+      WireFormatLite::ReadPrimitiveFromArray<float, WireFormatLite::TYPE_FLOAT>(
+          buf, &temp);
+  *value = temp;
+  return buf;
+}
+
+template <>
 inline const uint8* ReadFromArray<double, WireFormatLite::TYPE_DOUBLE>(
     const uint8* buf, double* value) {
   return WireFormatLite::ReadPrimitiveFromArray<double,
@@ -334,48 +371,56 @@ inline Status ReadGroupBytes(CodedInputStream* input, int field_number,
 inline Status ReadValue(CodedInputStream* input,
                         WireFormatLite::FieldType field_type, int field_number,
                         DataType dtype, int index, void* datap) {
-  // Dispatch to the appropriately typed field reader based on the
-  // schema type.
+  // Dispatch to the appropriately typed field reader based on the schema type.
   switch (field_type) {
     case WireFormatLite::TYPE_DOUBLE:
       return ReadPrimitive<double, double, WireFormatLite::TYPE_DOUBLE>(
           input, index, datap);
     case WireFormatLite::TYPE_FLOAT:
-      if (dtype == DataType::DT_FLOAT) {
-        return ReadPrimitive<float, float, WireFormatLite::TYPE_FLOAT>(
-            input, index, datap);
-      }
-      if (dtype == DataType::DT_DOUBLE) {
-        return ReadPrimitive<float, double, WireFormatLite::TYPE_FLOAT>(
-            input, index, datap);
+      switch (dtype) {
+        case DataType::DT_DOUBLE:
+          return ReadPrimitive<float, double, WireFormatLite::TYPE_FLOAT>(
+              input, index, datap);
+        case DataType::DT_FLOAT:
+          return ReadPrimitive<float, float, WireFormatLite::TYPE_FLOAT>(
+              input, index, datap);
+        default:
+          return errors::DataLoss("Failed reading TYPE_FLOAT for ",
+                                  DataTypeString(dtype));
       }
-      // Any case that reaches this point should have triggered an error
-      // already.
-      return errors::DataLoss("Failed reading TYPE_FLOAT");
     case WireFormatLite::TYPE_INT64:
       return ReadPrimitive<protobuf_int64, int64, WireFormatLite::TYPE_INT64>(
           input, index, datap);
     case WireFormatLite::TYPE_UINT64:
-      return ReadPrimitive<protobuf_uint64, int64, WireFormatLite::TYPE_UINT64>(
-          input, index, datap);
+      return ReadPrimitive<protobuf_uint64, uint64,
+                           WireFormatLite::TYPE_UINT64>(input, index, datap);
     case WireFormatLite::TYPE_INT32:
-      return ReadPrimitive<int32, int32, WireFormatLite::TYPE_INT32>(
-          input, index, datap);
+      switch (dtype) {
+        case DataType::DT_INT64:
+          return ReadPrimitive<int32, int64, WireFormatLite::TYPE_INT32>(
+              input, index, datap);
+        case DataType::DT_INT32:
+          return ReadPrimitive<int32, int32, WireFormatLite::TYPE_INT32>(
+              input, index, datap);
+        default:
+          return errors::DataLoss("Failed reading TYPE_INT32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_FIXED64:
-      return ReadPrimitive<protobuf_uint64, int64,
+      return ReadPrimitive<protobuf_uint64, uint64,
                            WireFormatLite::TYPE_FIXED64>(input, index, datap);
     case WireFormatLite::TYPE_FIXED32:
-      if (dtype == DataType::DT_INT64) {
-        return ReadPrimitive<uint32, int64, WireFormatLite::TYPE_FIXED32>(
-            input, index, datap);
-      }
-      if (dtype == DataType::DT_INT32) {
-        return ReadPrimitive<uint32, int32, WireFormatLite::TYPE_FIXED32>(
-            input, index, datap);
+      switch (dtype) {
+        case DataType::DT_UINT64:
+          return ReadPrimitive<uint32, uint64, WireFormatLite::TYPE_FIXED32>(
+              input, index, datap);
+        case DataType::DT_UINT32:
+          return ReadPrimitive<uint32, uint32, WireFormatLite::TYPE_FIXED32>(
+              input, index, datap);
+        default:
+          return errors::DataLoss("Failed reading TYPE_FIXED32 for ",
+                                  DataTypeString(dtype));
       }
-      // Any case that reaches this point should have triggered an error
-      // already.
-      return errors::DataLoss("Failed reading TYPE_FIXED32");
     case WireFormatLite::TYPE_BOOL:
       return ReadPrimitive<bool, bool, WireFormatLite::TYPE_BOOL>(input, index,
                                                                   datap);
@@ -388,29 +433,47 @@ inline Status ReadValue(CodedInputStream* input,
     case WireFormatLite::TYPE_BYTES:
       return ReadBytes(input, index, datap);
     case WireFormatLite::TYPE_UINT32:
-      if (dtype == DataType::DT_INT64) {
-        return ReadPrimitive<uint32, int64, WireFormatLite::TYPE_UINT32>(
-            input, index, datap);
+      switch (dtype) {
+        case DataType::DT_UINT64:
+          return ReadPrimitive<uint32, uint64, WireFormatLite::TYPE_UINT32>(
+              input, index, datap);
+        case DataType::DT_UINT32:
+          return ReadPrimitive<uint32, uint32, WireFormatLite::TYPE_UINT32>(
+              input, index, datap);
+        default:
+          return errors::DataLoss("Failed reading TYPE_UINT32 for ",
+                                  DataTypeString(dtype));
       }
-      if (dtype == DataType::DT_INT32) {
-        return ReadPrimitive<uint32, int32, WireFormatLite::TYPE_UINT32>(
-            input, index, datap);
-      }
-      // Any case that reaches this point should have triggered an error
-      // already.
-      return errors::DataLoss("Failed reading TYPE_UINT32");
     case WireFormatLite::TYPE_ENUM:
       return ReadPrimitive<int32, int32, WireFormatLite::TYPE_ENUM>(
           input, index, datap);
     case WireFormatLite::TYPE_SFIXED32:
-      return ReadPrimitive<int32, int32, WireFormatLite::TYPE_SFIXED32>(
-          input, index, datap);
+      switch (dtype) {
+        case DataType::DT_INT64:
+          return ReadPrimitive<int32, int64, WireFormatLite::TYPE_SFIXED32>(
+              input, index, datap);
+        case DataType::DT_INT32:
+          return ReadPrimitive<int32, int32, WireFormatLite::TYPE_SFIXED32>(
+              input, index, datap);
+        default:
+          return errors::DataLoss("Failed reading TYPE_SFIXED32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_SFIXED64:
       return ReadPrimitive<protobuf_int64, int64,
                            WireFormatLite::TYPE_SFIXED64>(input, index, datap);
     case WireFormatLite::TYPE_SINT32:
-      return ReadPrimitive<int32, int32, WireFormatLite::TYPE_SINT32>(
-          input, index, datap);
+      switch (dtype) {
+        case DataType::DT_INT64:
+          return ReadPrimitive<int32, int64, WireFormatLite::TYPE_SINT32>(
+              input, index, datap);
+        case DataType::DT_INT32:
+          return ReadPrimitive<int32, int32, WireFormatLite::TYPE_SINT32>(
+              input, index, datap);
+        default:
+          return errors::DataLoss("Failed reading TYPE_SINT32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_SINT64:
       return ReadPrimitive<protobuf_int64, int64, WireFormatLite::TYPE_SINT64>(
           input, index, datap);
@@ -425,47 +488,66 @@ inline Status ReadPackedFromArray(const void* buf, size_t buf_size,
                                   const WireFormatLite::FieldType field_type,
                                   const int field_number, const DataType dtype,
                                   const int stride, int* index, void* data) {
-  // Dispatch to the appropriately typed field reader based on the
-  // schema type.
+  // Dispatch to the appropriately typed field reader based on the schema type.
   switch (field_type) {
     case WireFormatLite::TYPE_DOUBLE:
       *index += ReadPackedPrimitives<double, WireFormatLite::TYPE_DOUBLE>(
           buf, buf_size, *index, stride, data);
       return Status::OK();
     case WireFormatLite::TYPE_FLOAT:
-      *index += ReadPackedPrimitives<float, WireFormatLite::TYPE_FLOAT>(
-          buf, buf_size, *index, stride, data);
-      return Status::OK();
+      switch (dtype) {
+        case DataType::DT_DOUBLE:
+          *index += ReadPackedPrimitives<double, WireFormatLite::TYPE_FLOAT>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        case DataType::DT_FLOAT:
+          *index += ReadPackedPrimitives<float, WireFormatLite::TYPE_FLOAT>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        default:
+          return errors::DataLoss("Failed reading TYPE_FLOAT for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_INT64:
       *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_INT64>(
           buf, buf_size, *index, stride, data);
       return Status::OK();
     case WireFormatLite::TYPE_UINT64:
-      *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_UINT64>(
+      *index += ReadPackedPrimitives<uint64, WireFormatLite::TYPE_UINT64>(
           buf, buf_size, *index, stride, data);
       return Status::OK();
     case WireFormatLite::TYPE_INT32:
-      *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_INT32>(
-          buf, buf_size, *index, stride, data);
-      return Status::OK();
+      switch (dtype) {
+        case DataType::DT_INT64:
+          *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_INT32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        case DataType::DT_INT32:
+          *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_INT32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        default:
+          return errors::DataLoss("Failed reading TYPE_INT32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_FIXED64:
-      *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_FIXED64>(
+      *index += ReadPackedPrimitives<uint64, WireFormatLite::TYPE_FIXED64>(
           buf, buf_size, *index, stride, data);
       return Status::OK();
     case WireFormatLite::TYPE_FIXED32:
-      if (dtype == DataType::DT_INT64) {
-        *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_FIXED32>(
-            buf, buf_size, *index, stride, data);
-        return Status::OK();
-      }
-      if (dtype == DataType::DT_INT32) {
-        *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_FIXED32>(
-            buf, buf_size, *index, stride, data);
-        return Status::OK();
+      switch (dtype) {
+        case DataType::DT_UINT64:
+          *index += ReadPackedPrimitives<uint64, WireFormatLite::TYPE_FIXED32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        case DataType::DT_UINT32:
+          *index += ReadPackedPrimitives<uint32, WireFormatLite::TYPE_FIXED32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        default:
+          return errors::DataLoss("Failed reading TYPE_FIXED32 for ",
+                                  DataTypeString(dtype));
       }
-      // Any case that reaches this point should have triggered an error
-      // already.
-      return errors::DataLoss("Failed reading TYPE_FIXED32");
     case WireFormatLite::TYPE_BOOL:
       *index += ReadPackedPrimitives<bool, WireFormatLite::TYPE_BOOL>(
           buf, buf_size, *index, stride, data);
@@ -476,38 +558,56 @@ inline Status ReadPackedFromArray(const void* buf, size_t buf_size,
     case WireFormatLite::TYPE_BYTES:
       return errors::DataLoss("Non-primitive type encountered as packed");
     case WireFormatLite::TYPE_UINT32:
-      if (dtype == DataType::DT_INT64) {
-        *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_UINT32>(
-            buf, buf_size, *index, stride, data);
-        return Status::OK();
+      switch (dtype) {
+        case DataType::DT_UINT64:
+          *index += ReadPackedPrimitives<uint64, WireFormatLite::TYPE_UINT32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        case DataType::DT_UINT32:
+          *index += ReadPackedPrimitives<uint32, WireFormatLite::TYPE_UINT32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        default:
+          return errors::DataLoss("Failed reading TYPE_UINT32 for ",
+                                  DataTypeString(dtype));
       }
-      if (dtype == DataType::DT_INT32) {
-        *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_UINT32>(
-            buf, buf_size, *index, stride, data);
-        return Status::OK();
-      }
-      // Any case that reaches this point should have triggered an error
-      // already.
-      return errors::DataLoss("Failed reading TYPE_UINT32");
     case WireFormatLite::TYPE_ENUM:
       *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_ENUM>(
           buf, buf_size, *index, stride, data);
       return Status::OK();
     case WireFormatLite::TYPE_SFIXED32:
-      *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_SFIXED32>(
-          buf, buf_size, *index, stride, data);
-      return Status::OK();
-
+      switch (dtype) {
+        case DataType::DT_INT64:
+          *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_SFIXED32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        case DataType::DT_INT32:
+          *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_SFIXED32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        default:
+          return errors::DataLoss("Failed reading TYPE_INT32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_SFIXED64:
       *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_SFIXED64>(
           buf, buf_size, *index, stride, data);
       return Status::OK();
 
     case WireFormatLite::TYPE_SINT32:
-      *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_SINT32>(
-          buf, buf_size, *index, stride, data);
-      return Status::OK();
-
+      switch (dtype) {
+        case DataType::DT_INT64:
+          *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_SINT32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        case DataType::DT_INT32:
+          *index += ReadPackedPrimitives<int32, WireFormatLite::TYPE_SINT32>(
+              buf, buf_size, *index, stride, data);
+          return Status::OK();
+        default:
+          return errors::DataLoss("Failed reading TYPE_SINT32 for ",
+                                  DataTypeString(dtype));
+      }
     case WireFormatLite::TYPE_SINT64:
       *index += ReadPackedPrimitives<int64, WireFormatLite::TYPE_SINT64>(
           buf, buf_size, *index, stride, data);
diff --git a/tensorflow/core/util/proto/proto_utils.cc b/tensorflow/core/util/proto/proto_utils.cc
new file mode 100644
index 0000000000..201f05a129
--- /dev/null
+++ b/tensorflow/core/util/proto/proto_utils.cc
@@ -0,0 +1,70 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+#include "tensorflow/core/util/proto/proto_utils.h"
+
+namespace tensorflow {
+namespace proto_utils {
+
+using tensorflow::protobuf::FieldDescriptor;
+using tensorflow::protobuf::internal::WireFormatLite;
+
+bool IsCompatibleType(FieldDescriptor::Type field_type, DataType dtype) {
+  switch (field_type) {
+    case WireFormatLite::TYPE_DOUBLE:
+      return dtype == tensorflow::DT_DOUBLE;
+    case WireFormatLite::TYPE_FLOAT:
+      return dtype == tensorflow::DT_FLOAT || dtype == tensorflow::DT_DOUBLE;
+    case WireFormatLite::TYPE_INT64:
+      return dtype == tensorflow::DT_INT64;
+    case WireFormatLite::TYPE_UINT64:
+      return dtype == tensorflow::DT_UINT64;
+    case WireFormatLite::TYPE_INT32:
+      return dtype == tensorflow::DT_INT32 || dtype == tensorflow::DT_INT64;
+    case WireFormatLite::TYPE_FIXED64:
+      return dtype == tensorflow::DT_UINT64;
+    case WireFormatLite::TYPE_FIXED32:
+      return dtype == tensorflow::DT_UINT32 || dtype == tensorflow::DT_UINT64;
+    case WireFormatLite::TYPE_BOOL:
+      return dtype == tensorflow::DT_BOOL;
+    case WireFormatLite::TYPE_STRING:
+      return dtype == tensorflow::DT_STRING;
+    case WireFormatLite::TYPE_GROUP:
+      return dtype == tensorflow::DT_STRING;
+    case WireFormatLite::TYPE_MESSAGE:
+      return dtype == tensorflow::DT_STRING;
+    case WireFormatLite::TYPE_BYTES:
+      return dtype == tensorflow::DT_STRING;
+    case WireFormatLite::TYPE_UINT32:
+      return dtype == tensorflow::DT_UINT32 || dtype == tensorflow::DT_UINT64;
+    case WireFormatLite::TYPE_ENUM:
+      return dtype == tensorflow::DT_INT32;
+    case WireFormatLite::TYPE_SFIXED32:
+      return dtype == tensorflow::DT_INT32 || dtype == tensorflow::DT_INT64;
+    case WireFormatLite::TYPE_SFIXED64:
+      return dtype == tensorflow::DT_INT64;
+    case WireFormatLite::TYPE_SINT32:
+      return dtype == tensorflow::DT_INT32 || dtype == tensorflow::DT_INT64;
+    case WireFormatLite::TYPE_SINT64:
+      return dtype == tensorflow::DT_INT64;
+      // default: intentionally omitted in order to enable static checking.
+  }
+}
+
+}  // namespace proto_utils
+}  // namespace tensorflow
diff --git a/tensorflow/core/util/proto/proto_utils.h b/tensorflow/core/util/proto/proto_utils.h
new file mode 100644
index 0000000000..d5e0b9006c
--- /dev/null
+++ b/tensorflow/core/util/proto/proto_utils.h
@@ -0,0 +1,33 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_UTIL_PROTO_PROTO_UTILS_H_
+#define TENSORFLOW_CORE_UTIL_PROTO_PROTO_UTILS_H_
+
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/platform/protobuf.h"
+
+namespace tensorflow {
+namespace proto_utils {
+
+using tensorflow::protobuf::FieldDescriptor;
+
+// Returns true if the proto field type can be converted to the tensor dtype.
+bool IsCompatibleType(FieldDescriptor::Type field_type, DataType dtype);
+
+}  // namespace proto_utils
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_UTIL_PROTO_PROTO_UTILS_H_
diff --git a/tensorflow/core/util/sparse/dim_comparator.h b/tensorflow/core/util/sparse/dim_comparator.h
index b773b33008..0782e7e1a8 100644
--- a/tensorflow/core/util/sparse/dim_comparator.h
+++ b/tensorflow/core/util/sparse/dim_comparator.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_UTIL_SPARSE_DIM_COMPARATOR_H_
-#define TENSORFLOW_UTIL_SPARSE_DIM_COMPARATOR_H_
+#ifndef TENSORFLOW_CORE_UTIL_SPARSE_DIM_COMPARATOR_H_
+#define TENSORFLOW_CORE_UTIL_SPARSE_DIM_COMPARATOR_H_
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/kernels/bounds_check.h"
@@ -49,11 +49,11 @@ class DimComparator {
   DimComparator(const TTypes<int64>::Matrix& ix, const VarDimArray& order,
                 const VarDimArray& shape)
       : ix_(ix), order_(order), dims_(shape.size()) {
-    CHECK_GT(order.size(), size_t{0}) << "Must order using at least one index";
-    CHECK_LE(order.size(), shape.size()) << "Can only sort up to dims";
+    DCHECK_GT(order.size(), size_t{0}) << "Must order using at least one index";
+    DCHECK_LE(order.size(), shape.size()) << "Can only sort up to dims";
     for (size_t d = 0; d < order.size(); ++d) {
-      CHECK_GE(order[d], 0);
-      CHECK_LT(order[d], shape.size());
+      DCHECK_GE(order[d], 0);
+      DCHECK_LT(order[d], shape.size());
     }
   }
 
@@ -97,7 +97,7 @@ class FixedDimComparator : DimComparator {
   FixedDimComparator(const TTypes<int64>::Matrix& ix, const VarDimArray& order,
                      const VarDimArray& shape)
       : DimComparator(ix, order, shape) {
-    CHECK_EQ(order.size(), ORDER_DIM);
+    DCHECK_EQ(order.size(), ORDER_DIM);
   }
   inline bool operator()(const int64 i, const int64 j) const {
     bool value = false;
@@ -116,4 +116,4 @@ class FixedDimComparator : DimComparator {
 }  // namespace sparse
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_UTIL_SPARSE_DIM_COMPARATOR_H_
+#endif  // TENSORFLOW_CORE_UTIL_SPARSE_DIM_COMPARATOR_H_
diff --git a/tensorflow/core/util/sparse/group_iterator.h b/tensorflow/core/util/sparse/group_iterator.h
index fb70318078..3fa8cb6116 100644
--- a/tensorflow/core/util/sparse/group_iterator.h
+++ b/tensorflow/core/util/sparse/group_iterator.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_UTIL_SPARSE_GROUP_ITERATOR_H_
-#define TENSORFLOW_UTIL_SPARSE_GROUP_ITERATOR_H_
+#ifndef TENSORFLOW_CORE_UTIL_SPARSE_GROUP_ITERATOR_H_
+#define TENSORFLOW_CORE_UTIL_SPARSE_GROUP_ITERATOR_H_
 
 #include <vector>
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
@@ -143,4 +143,4 @@ typename TTypes<T>::UnalignedVec Group::values() const {
 }  // namespace sparse
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_UTIL_SPARSE_GROUP_ITERATOR_H_
+#endif  // TENSORFLOW_CORE_UTIL_SPARSE_GROUP_ITERATOR_H_
diff --git a/tensorflow/core/util/sparse/sparse_tensor.h b/tensorflow/core/util/sparse/sparse_tensor.h
index 258ee418c1..0f04b65f60 100644
--- a/tensorflow/core/util/sparse/sparse_tensor.h
+++ b/tensorflow/core/util/sparse/sparse_tensor.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_UTIL_SPARSE_SPARSE_TENSOR_H_
-#define TENSORFLOW_UTIL_SPARSE_SPARSE_TENSOR_H_
+#ifndef TENSORFLOW_CORE_UTIL_SPARSE_SPARSE_TENSOR_H_
+#define TENSORFLOW_CORE_UTIL_SPARSE_SPARSE_TENSOR_H_
 
 #include <limits>
 #include <numeric>
@@ -26,8 +26,10 @@ limitations under the License.
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/framework/types.pb.h"
 #include "tensorflow/core/kernels/bounds_check.h"
+#include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/sparse/dim_comparator.h"
@@ -41,32 +43,88 @@ class SparseTensor {
   typedef typename gtl::ArraySlice<int64> VarDimArray;
   typedef typename gtl::InlinedVector<int64, 8> ShapeArray;
 
+  static Status Create(Tensor ix, Tensor vals, const VarDimArray shape,
+                       const VarDimArray order, SparseTensor* result) {
+    if (ix.dtype() != DT_INT64) {
+      return Status(
+          error::INVALID_ARGUMENT,
+          strings::StrCat("indices must be type int64 but got: ", ix.dtype()));
+    }
+    if (!TensorShapeUtils::IsVector(vals.shape())) {
+      return Status(error::INVALID_ARGUMENT,
+                    strings::StrCat("vals must be a vec, but got: ",
+                                    vals.shape().DebugString()));
+    }
+    if (ix.shape().dim_size(0) != vals.shape().dim_size(0)) {
+      return Status(error::INVALID_ARGUMENT,
+                    strings::StrCat("indices and values rows (indexing "
+                                    "dimension) must match. (indices = ",
+                                    ix.shape().dim_size(0), ", values = ",
+                                    vals.shape().dim_size(0), ")"));
+    }
+    int dims;
+    TF_RETURN_IF_ERROR(GetDimsFromIx(ix, &dims));
+    if (order.size() != dims) {
+      return Status(error::INVALID_ARGUMENT,
+                    "Order length must be SparseTensor rank.");
+    }
+    if (shape.size() != dims) {
+      return Status(error::INVALID_ARGUMENT,
+                    "Shape rank must be SparseTensor rank.");
+    }
+
+    *result = SparseTensor(ix, vals, shape, order);
+    return Status();
+  }
+
+  static Status Create(Tensor ix, Tensor vals, const TensorShape& shape,
+                       SparseTensor* result) {
+    return Create(ix, vals, TensorShapeToVector(shape),
+                  UndefinedOrder(TensorShapeToVector(shape)), result);
+  }
+
+  static Status Create(Tensor ix, Tensor vals, const VarDimArray shape,
+                       SparseTensor* result) {
+    return Create(ix, vals, shape, UndefinedOrder(shape), result);
+  }
+
+  static Status Create(Tensor ix, Tensor vals, const TensorShape& shape,
+                       const VarDimArray order, SparseTensor* result) {
+    return Create(ix, vals, TensorShapeToVector(shape), order, result);
+  }
+
+  SparseTensor() : dims_(0) {}
+
+  // DEPRECATED: use Create() functions instead of constructors directly.
   SparseTensor(Tensor ix, Tensor vals, const TensorShape& shape)
       : SparseTensor(ix, vals, TensorShapeToVector(shape),
                      UndefinedOrder(TensorShapeToVector(shape))) {}
 
+  // DEPRECATED: use Create() functions instead of constructors directly.
   SparseTensor(Tensor ix, Tensor vals, const VarDimArray shape)
       : SparseTensor(ix, vals, shape, UndefinedOrder(shape)) {}
 
+  // DEPRECATED: use Create() functions instead of constructors directly.
   SparseTensor(Tensor ix, Tensor vals, const TensorShape& shape,
                const VarDimArray order)
       : SparseTensor(ix, vals, TensorShapeToVector(shape), order) {}
 
+  // DEPRECATED: use Create() functions instead of constructors directly.
   SparseTensor(Tensor ix, Tensor vals, const VarDimArray shape,
                const VarDimArray order)
       : ix_(ix),
         vals_(vals),
         shape_(shape.begin(), shape.end()),
         order_(order.begin(), order.end()),
-        dims_(GetDimsFromIx(ix)) {
-    CHECK_EQ(ix.dtype(), DT_INT64)
+        dims_(UnsafeGetDimsFromIx(ix)) {
+    DCHECK_EQ(ix.dtype(), DT_INT64)
         << "indices must be type int64 but got: " << ix.dtype();
-    CHECK(TensorShapeUtils::IsVector(vals.shape()))
+    DCHECK(TensorShapeUtils::IsVector(vals.shape()))
         << "vals must be a vec, but got: " << vals.shape().DebugString();
-    CHECK_EQ(ix.shape().dim_size(0), vals.shape().dim_size(0))
+    DCHECK_EQ(ix.shape().dim_size(0), vals.shape().dim_size(0))
         << "indices and values rows (indexing dimension) must match.";
-    CHECK_EQ(order.size(), dims_) << "Order length must be SparseTensor rank.";
-    CHECK_EQ(shape.size(), dims_) << "Shape rank must be SparseTensor rank.";
+    DCHECK_EQ(order.size(), dims_) << "Order length must be SparseTensor rank.";
+    DCHECK_EQ(shape.size(), dims_) << "Shape rank must be SparseTensor rank.";
   }
 
   SparseTensor(const SparseTensor& other)
@@ -81,6 +139,16 @@ class SparseTensor {
     vals_ = other.vals_;
     shape_ = other.shape_;
     order_ = other.order_;
+    dims_ = other.dims_;
+    return *this;
+  }
+
+  SparseTensor& operator=(SparseTensor&& other) {
+    ix_ = std::move(other.ix_);
+    vals_ = std::move(other.vals_);
+    shape_ = std::move(other.shape_);
+    order_ = std::move(other.order_);
+    dims_ = std::move(other.dims_);
     return *this;
   }
 
@@ -126,11 +194,11 @@ class SparseTensor {
   //
   // See the README.md in this directory for more usage information.
   GroupIterable group(const VarDimArray& group_ix) const {
-    CHECK_LE(group_ix.size(), dims_);
+    DCHECK_LE(group_ix.size(), dims_);
     for (std::size_t di = 0; di < group_ix.size(); ++di) {
-      CHECK_GE(group_ix[di], 0) << "Group dimension out of range";
-      CHECK_LT(group_ix[di], dims_) << "Group dimension out of range";
-      CHECK_EQ(group_ix[di], order_[di])
+      DCHECK_GE(group_ix[di], 0) << "Group dimension out of range";
+      DCHECK_LT(group_ix[di], dims_) << "Group dimension out of range";
+      DCHECK_EQ(group_ix[di], order_[di])
           << "Group dimension does not match sorted order";
     }
     return GroupIterable(ix_, vals_, dims_, group_ix);
@@ -166,9 +234,16 @@ class SparseTensor {
   // isn't an integer multiple of split_dim, we add one extra dimension for
   // each slice.
   template <typename T>
+  static Status Split(const SparseTensor& tensor, const int split_dim,
+                      const int num_split, std::vector<SparseTensor>* result);
+
+  // DEPRECATED: use the form of Split() that takes an output pointer and
+  // returns a status instead.
+  template <typename T>
   static std::vector<SparseTensor> Split(const SparseTensor& tensor,
                                          const int split_dim,
-                                         const int num_split);
+                                         const int num_split,
+                                         Status* status = nullptr);
 
   // Slice() will slice the input SparseTensor into a SparseTensor based on
   // specified start and size. Both start and size are 1-D array with each
@@ -189,9 +264,18 @@ class SparseTensor {
   }
 
  private:
-  static int GetDimsFromIx(const Tensor& ix) {
-    CHECK(TensorShapeUtils::IsMatrix(ix.shape()))
-        << "indices must be a matrix, but got: " << ix.shape().DebugString();
+  static Status GetDimsFromIx(const Tensor& ix, int* result) {
+    if (!TensorShapeUtils::IsMatrix(ix.shape())) {
+      return Status(error::INVALID_ARGUMENT,
+                    strings::StrCat("indices must be a matrix, but got: ",
+                                    ix.shape().DebugString()));
+    }
+    *result = UnsafeGetDimsFromIx(ix);
+    return Status();
+  }
+
+  static int UnsafeGetDimsFromIx(const Tensor& ix) {
+    DCHECK(TensorShapeUtils::IsMatrix(ix.shape()));
     return ix.dim_size(1);
   }
 
@@ -251,8 +335,8 @@ class SparseTensor {
   // Helper for Split() that returns the slice index.
   static inline int GetSliceIndex(const int dim, const int split_size,
                                   const int residual) {
-    CHECK_GT(split_size, 0);
-    CHECK_GE(dim, 0);
+    DCHECK_GT(split_size, 0);
+    DCHECK_GE(dim, 0);
     if (residual == 0) return dim / split_size;
     const int offset = residual * (split_size + 1);
     if (dim < offset) {
@@ -265,8 +349,8 @@ class SparseTensor {
   // Helper for Split() that returns the dimension in the slice.
   static inline int GetDimensionInSlice(const int dim, const int split_size,
                                         const int residual) {
-    CHECK_GT(split_size, 0);
-    CHECK_GE(dim, 0);
+    DCHECK_GT(split_size, 0);
+    DCHECK_GE(dim, 0);
     if (residual == 0) return dim % split_size;
     const int offset = residual * (split_size + 1);
     if (dim < offset) {
@@ -279,8 +363,8 @@ class SparseTensor {
   // Helper for Split() that returns the shape given a slice index.
   static inline int GetSliceShape(const int slice_index, const int split_size,
                                   const int residual) {
-    CHECK_GT(split_size, 0);
-    CHECK_GE(slice_index, 0);
+    DCHECK_GT(split_size, 0);
+    DCHECK_GE(slice_index, 0);
     if (residual == 0) return split_size;
     if (slice_index < residual) {
       return split_size + 1;
@@ -293,7 +377,7 @@ class SparseTensor {
   Tensor vals_;
   ShapeArray shape_;
   ShapeArray order_;
-  const int dims_;
+  int dims_;
 };
 
 // This operation updates the indices and values Tensor rows, so it is
@@ -301,9 +385,9 @@ class SparseTensor {
 // temporary space.
 template <typename T>
 void SparseTensor::Reorder(const VarDimArray& order) {
-  CHECK_EQ(DataTypeToEnum<T>::v(), dtype())
+  DCHECK_EQ(DataTypeToEnum<T>::v(), dtype())
       << "Reorder requested with the wrong datatype";
-  CHECK_EQ(order.size(), dims_) << "Order length must be SparseTensor rank";
+  DCHECK_EQ(order.size(), dims_) << "Order length must be SparseTensor rank";
   auto ix_t = ix_.matrix<int64>();
   auto vals_t = vals_.vec<T>();
 
@@ -360,13 +444,13 @@ void SparseTensor::Reorder(const VarDimArray& order) {
 
 template <typename T>
 bool SparseTensor::ValidateAndInitializeToDense(Tensor* out, bool initialize) {
-  CHECK_EQ(DataTypeToEnum<T>::v(), dtype())
+  DCHECK_EQ(DataTypeToEnum<T>::v(), dtype())
       << "ToDense requested with the wrong datatype";
 
-  CHECK_EQ(out->shape().dims(), dims_)
+  DCHECK_EQ(out->shape().dims(), dims_)
       << "Incompatible dimensions between SparseTensor and output";
 
-  CHECK_EQ(out->dtype(), DataTypeToEnum<T>::v())
+  DCHECK_EQ(out->dtype(), DataTypeToEnum<T>::v())
       << "Output must be type: " << DataTypeToEnum<T>::v()
       << " but got: " << out->dtype();
 
@@ -422,9 +506,9 @@ bool SparseTensor::ToDense(Tensor* out, bool initialize) {
 template <typename T>
 SparseTensor SparseTensor::Concat(
     const gtl::ArraySlice<SparseTensor>& tensors) {
-  CHECK_GE(tensors.size(), size_t{1}) << "Cannot concat 0 SparseTensors";
+  DCHECK_GE(tensors.size(), size_t{1}) << "Cannot concat 0 SparseTensors";
   const int dims = tensors[0].dims_;
-  CHECK_GE(dims, 1) << "Cannot concat 0-dimensional SparseTensors";
+  DCHECK_GE(dims, 1) << "Cannot concat 0-dimensional SparseTensors";
   auto order_0 = tensors[0].order();
   const int primary_dim = order_0[0];
   ShapeArray final_order(order_0.begin(), order_0.end());
@@ -434,17 +518,17 @@ SparseTensor SparseTensor::Concat(
 
   bool fully_ordered = true;
   for (const SparseTensor& st : tensors) {
-    CHECK_EQ(st.dims_, dims) << "All SparseTensors must have the same rank.";
-    CHECK_EQ(DataTypeToEnum<T>::v(), st.dtype())
+    DCHECK_EQ(st.dims_, dims) << "All SparseTensors must have the same rank.";
+    DCHECK_EQ(DataTypeToEnum<T>::v(), st.dtype())
         << "Concat requested with the wrong data type";
-    CHECK_GE(st.order()[0], 0) << "SparseTensor must be ordered";
-    CHECK_EQ(st.order()[0], primary_dim)
+    DCHECK_GE(st.order()[0], 0) << "SparseTensor must be ordered";
+    DCHECK_EQ(st.order()[0], primary_dim)
         << "All SparseTensors' order[0] must match.  This is the concat dim.";
     if (st.order() != final_order) fully_ordered = false;
     const VarDimArray& st_shape = st.shape();
     for (int d = 0; d < dims - 1; ++d) {
       const int cdim = (d < primary_dim) ? d : d + 1;
-      CHECK_EQ(final_shape[cdim], st_shape[cdim])
+      DCHECK_EQ(final_shape[cdim], st_shape[cdim])
           << "All SparseTensors' shapes must match except on the concat dim.  "
           << "Concat dim: " << primary_dim
           << ", mismatched shape at dim: " << cdim
@@ -494,7 +578,8 @@ SparseTensor SparseTensor::Concat(
 template <typename T>
 std::vector<SparseTensor> SparseTensor::Split(const SparseTensor& input_tensor,
                                               const int split_dim,
-                                              const int num_split) {
+                                              const int num_split,
+                                              Status* status /* = nullptr */) {
   std::vector<Tensor> output_indices;
   std::vector<Tensor> output_values;
   std::vector<TensorShape> output_shapes;
@@ -514,12 +599,18 @@ std::vector<SparseTensor> SparseTensor::Split(const SparseTensor& input_tensor,
   const int split_dim_size = input_tensor.shape()[split_dim];
   const int split_size = split_dim_size / num_split;
 
-  CHECK(num_split > 0 && num_split <= split_dim_size) << "num_split must be in "
-                                                         "the interval (0, "
-                                                      << split_dim_size << "]";
-  CHECK(split_dim >= 0 && split_dim < num_dim) << "num_dim must be in "
-                                                  "the interval [0, "
-                                               << num_dim << ")";
+  if (!(num_split > 0 && num_split <= split_dim_size) && status != nullptr) {
+    *status = Status(error::INVALID_ARGUMENT,
+                     strings::StrCat("num_split must be in the interval (0, ",
+                                     split_dim_size, "]"));
+    return {};
+  }
+  if (!(split_dim >= 0 && split_dim < num_dim)) {
+    *status = Status(
+        error::INVALID_ARGUMENT,
+        strings::StrCat("num_dim must be in the interval [0, ", num_dim, ")"));
+    return {};
+  }
 
   const int residual = split_dim_size % num_split;
   for (int i = 0; i < input_tensor.indices().dim_size(0); ++i) {
@@ -559,13 +650,28 @@ std::vector<SparseTensor> SparseTensor::Split(const SparseTensor& input_tensor,
   std::vector<SparseTensor> output_tensors;
   output_tensors.reserve(num_split);
   for (int i = 0; i < num_split; ++i) {
-    output_tensors.emplace_back(output_indices[i], output_values[i],
-                                output_shapes[i]);
+    SparseTensor tensor;
+    Status create_status =
+        Create(output_indices[i], output_values[i], output_shapes[i], &tensor);
+    if (!create_status.ok() && status != nullptr) {
+      *status = create_status;
+      return {};
+    }
+    output_tensors.push_back(std::move(tensor));
   }
   return output_tensors;
 }
 
 template <typename T>
+Status SparseTensor::Split(const SparseTensor& input_tensor,
+                           const int split_dim, const int num_split,
+                           std::vector<SparseTensor>* result) {
+  Status status;
+  *result = Split<T>(input_tensor, split_dim, num_split, &status);
+  return status;
+}
+
+template <typename T>
 SparseTensor SparseTensor::Slice(const SparseTensor& input_tensor,
                                  const gtl::ArraySlice<int64>& start,
                                  const gtl::ArraySlice<int64>& size) {
@@ -643,4 +749,4 @@ SparseTensor SparseTensor::Slice(const SparseTensor& input_tensor,
 }  // namespace sparse
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_UTIL_SPARSE_SPARSE_TENSOR_H_
+#endif  // TENSORFLOW_CORE_UTIL_SPARSE_SPARSE_TENSOR_H_
diff --git a/tensorflow/core/util/sparse/sparse_tensor_test.cc b/tensorflow/core/util/sparse/sparse_tensor_test.cc
index 85de032085..5578e42625 100644
--- a/tensorflow/core/util/sparse/sparse_tensor_test.cc
+++ b/tensorflow/core/util/sparse/sparse_tensor_test.cc
@@ -94,9 +94,12 @@ TEST(SparseTensorTest, SparseTensorInvalidIndicesType) {
   const int NDIM = 3;
   Tensor ix(DT_INT32, TensorShape({N, NDIM}));
   Tensor vals(DT_STRING, TensorShape({N}));
+  SparseTensor result;
 
-  EXPECT_DEATH(SparseTensor(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2}),
-               "indices must be type int64");
+  EXPECT_EQ(SparseTensor::Create(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2},
+                                 &result)
+                .code(),
+            error::INVALID_ARGUMENT);
 }
 
 TEST(SparseTensorTest, SparseTensorInvalidIndicesShape) {
@@ -104,9 +107,12 @@ TEST(SparseTensorTest, SparseTensorInvalidIndicesShape) {
   const int NDIM = 3;
   Tensor ix(DT_INT64, TensorShape({N, NDIM, 1}));
   Tensor vals(DT_STRING, TensorShape({N}));
+  SparseTensor result;
 
-  EXPECT_DEATH(SparseTensor(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2}),
-               "indices must be a matrix");
+  EXPECT_EQ(SparseTensor::Create(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2},
+                                 &result)
+                .code(),
+            error::INVALID_ARGUMENT);
 }
 
 TEST(SparseTensorTest, SparseTensorInvalidValues) {
@@ -114,9 +120,12 @@ TEST(SparseTensorTest, SparseTensorInvalidValues) {
   const int NDIM = 3;
   Tensor ix(DT_INT64, TensorShape({N, NDIM}));
   Tensor vals(DT_STRING, TensorShape({N, 1}));
+  SparseTensor result;
 
-  EXPECT_DEATH(SparseTensor(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2}),
-               "vals must be a vec");
+  EXPECT_EQ(SparseTensor::Create(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2},
+                                 &result)
+                .code(),
+            error::INVALID_ARGUMENT);
 }
 
 TEST(SparseTensorTest, SparseTensorInvalidN) {
@@ -124,9 +133,12 @@ TEST(SparseTensorTest, SparseTensorInvalidN) {
   const int NDIM = 3;
   Tensor ix(DT_INT64, TensorShape({N, NDIM}));
   Tensor vals(DT_STRING, TensorShape({N - 1}));
+  SparseTensor result;
 
-  EXPECT_DEATH(SparseTensor(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2}),
-               "indices and values rows .* must match");
+  EXPECT_EQ(SparseTensor::Create(ix, vals, TensorShape({10, 10, 10}), {0, 1, 2},
+                                 &result)
+                .code(),
+            error::INVALID_ARGUMENT);
 }
 
 TEST(SparseTensorTest, SparseTensorInvalidOrder) {
@@ -134,18 +146,24 @@ TEST(SparseTensorTest, SparseTensorInvalidOrder) {
   const int NDIM = 3;
   Tensor ix(DT_INT64, TensorShape({N, NDIM}));
   Tensor vals(DT_STRING, TensorShape({N}));
+  SparseTensor result;
 
-  EXPECT_DEATH(SparseTensor(ix, vals, TensorShape({10, 10, 10}), {0, 1}),
-               "Order length must be SparseTensor rank");
+  EXPECT_EQ(
+      SparseTensor::Create(ix, vals, TensorShape({10, 10, 10}), {0, 1}, &result)
+          .code(),
+      error::INVALID_ARGUMENT);
 }
 TEST(SparseTensorTest, SparseTensorInvalidShape) {
   int N = 5;
   const int NDIM = 3;
   Tensor ix(DT_INT64, TensorShape({N, NDIM}));
   Tensor vals(DT_STRING, TensorShape({N}));
+  SparseTensor result;
 
-  EXPECT_DEATH(SparseTensor(ix, vals, TensorShape({10, 10}), {0, 1, 2}),
-               "Shape rank must be SparseTensor rank");
+  EXPECT_EQ(
+      SparseTensor::Create(ix, vals, TensorShape({10, 10}), {0, 1, 2}, &result)
+          .code(),
+      error::INVALID_ARGUMENT);
 }
 
 TEST(SparseTensorTest, SparseTensorConstruction) {
@@ -169,7 +187,8 @@ TEST(SparseTensorTest, SparseTensorConstruction) {
 
   TensorShape shape({10, 10, 10});
   std::vector<int64> order{0, 1, 2};
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
   Status st_indices_valid = st.IndicesValid();
   EXPECT_FALSE(st_indices_valid.ok());
   EXPECT_EQ("indices[2] = [2,0,0] is out of order",
@@ -210,7 +229,8 @@ TEST(SparseTensorTest, EmptySparseTensorAllowed) {
 
   std::vector<int64> shape{10, 10, 10};
   std::vector<int64> order{0, 1, 2};
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
   TF_EXPECT_OK(st.IndicesValid());
   EXPECT_EQ(st.order(), order);
 
@@ -227,7 +247,8 @@ TEST(SparseTensorTest, SortingWorksCorrectly) {
   Tensor ix(DT_INT64, TensorShape({N, NDIM}));
   Tensor vals(DT_STRING, TensorShape({N}));
   TensorShape shape({1000, 1000, 1000, 1000});
-  SparseTensor st(ix, vals, shape);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, &st));
 
   auto ix_t = ix.matrix<int64>();
 
@@ -266,7 +287,8 @@ TEST(SparseTensorTest, ValidateIndicesFindsInvalid) {
 
   TensorShape shape({10, 10, 10});
   std::vector<int64> order{0, 1, 2};
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
 
   st.Reorder<string>(order);
   Status st_indices_valid = st.IndicesValid();
@@ -302,7 +324,8 @@ TEST(SparseTensorTest, SparseTensorCheckBoundaries) {
   TensorShape shape({10, 10, 10});
   std::vector<int64> order{0, 1, 2};
 
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
   EXPECT_FALSE(st.IndicesValid().ok());
 
   st.Reorder<string>(order);
@@ -351,7 +374,8 @@ TEST(SparseTensorTest, SparseTensorToDenseTensor) {
 
   TensorShape shape({4, 4, 5});
   std::vector<int64> order{0, 1, 2};
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
 
   Tensor dense(DT_STRING, TensorShape({4, 4, 5}));
   st.ToDense<string>(&dense);
@@ -390,7 +414,8 @@ TEST(SparseTensorTest, SparseTensorToLargerDenseTensor) {
 
   TensorShape shape({4, 4, 5});
   std::vector<int64> order{0, 1, 2};
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
 
   Tensor dense(DT_STRING, TensorShape({10, 10, 10}));
   st.ToDense<string>(&dense);
@@ -433,7 +458,8 @@ TEST(SparseTensorTest, SparseTensorGroup) {
   TensorShape shape({10, 10, 10});
   std::vector<int64> order{0, 1, 2};
 
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
   st.Reorder<int32>(order);
 
   std::vector<std::vector<int64> > groups;
@@ -521,7 +547,8 @@ TEST(SparseTensorTest, Concat) {
   TensorShape shape({10, 10, 10});
   std::vector<int64> order{0, 1, 2};
 
-  SparseTensor st(ix, vals, shape, order);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
   EXPECT_FALSE(st.IndicesValid().ok());
   st.Reorder<string>(order);
   TF_EXPECT_OK(st.IndicesValid());
@@ -551,7 +578,9 @@ TEST(SparseTensorTest, Concat) {
 
   // Concat works if non-primary ix is out of order, but output order
   // is not defined
-  SparseTensor st_ooo(ix, vals, shape, {0, 2, 1});  // non-primary ix OOO
+  SparseTensor st_ooo;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, {0, 2, 1},
+                                    &st_ooo));  // non-primary ix OOO
   SparseTensor conc_ooo = SparseTensor::Concat<string>({st, st, st, st_ooo});
   std::vector<int64> expected_ooo{-1, -1, -1};
   EXPECT_EQ(conc_ooo.order(), expected_ooo);
@@ -584,9 +613,11 @@ TEST(SparseTensorTest, Split) {
   vals.vec<int64>()(2) = 3;
   vals.vec<int64>()(3) = 4;
 
-  SparseTensor st(ids, vals, TensorShape({4, 3}));
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ids, vals, TensorShape({4, 3}), &st));
 
-  std::vector<SparseTensor> st_list = SparseTensor::Split<int64>(st, 0, 2);
+  std::vector<SparseTensor> st_list;
+  TF_ASSERT_OK(SparseTensor::Split<int64>(st, 0, 2, &st_list));
 
   EXPECT_EQ(st_list.size(), 2);
   auto expected_shape = gtl::InlinedVector<int64, 8>{2, 3};
@@ -633,7 +664,8 @@ TEST(SparseTensorTest, Slice) {
   vals.vec<int64>()(2) = 3;
   vals.vec<int64>()(3) = 4;
 
-  SparseTensor st(ids, vals, TensorShape({4, 3}));
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ids, vals, TensorShape({4, 3}), &st));
 
   std::vector<int64> start(2, 0);
   std::vector<int64> size(2);
@@ -662,7 +694,8 @@ TEST(SparseTensorTest, Dim0SparseTensorToDenseTensor) {
   vals.scalar<int32>()() = 5;
 
   TensorShape shape({});
-  SparseTensor st(ix, vals, shape);
+  SparseTensor st;
+  TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, &st));
 
   Tensor dense(DT_INT32, TensorShape({}));
   st.ToDense<int32>(&dense);
@@ -699,7 +732,8 @@ static void BM_SparseReorderFloat(int iters, int N32, int NDIM32) {
         ix_t(i, d) = rnd.Rand64() % 1000;
       }
     }
-    SparseTensor st(ix, vals, shape, order);
+    SparseTensor st;
+    TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
 
     testing::StartTiming();
     st.Reorder<float>(reorder);
@@ -740,7 +774,8 @@ static void BM_SparseReorderString(int iters, int N32, int NDIM32) {
         ix_t(i, d) = rnd.Rand64() % 1000;
       }
     }
-    SparseTensor st(ix, vals, shape, order);
+    SparseTensor st;
+    TF_ASSERT_OK(SparseTensor::Create(ix, vals, shape, order, &st));
 
     testing::StartTiming();
     st.Reorder<string>(reorder);
diff --git a/tensorflow/core/util/stat_summarizer.cc b/tensorflow/core/util/stat_summarizer.cc
index a5c1fda102..2117042034 100644
--- a/tensorflow/core/util/stat_summarizer.cc
+++ b/tensorflow/core/util/stat_summarizer.cc
@@ -133,7 +133,6 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
 
   int64 first_node_start_us =
       step_stats.dev_stats(0).node_stats(0).all_start_micros();
-  std::map<std::string, Detail> details;
 
   int node_num = 0;
   for (const auto& ds : step_stats.dev_stats()) {
@@ -177,22 +176,15 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
       ++node_num;
       const int64 curr_time = ns.all_end_rel_micros();
       curr_total_us += curr_time;
-      auto result = details.emplace(name, Detail());
       auto output_result =
           outputs_.emplace(name, std::vector<TensorDescription>());
       std::vector<TensorDescription>* outputs = &(output_result.first->second);
-      Detail* detail = &(result.first->second);
 
-      detail->start_us.UpdateStat(ns.all_start_micros() - first_node_start_us);
-      detail->rel_end_us.UpdateStat(curr_time);
+      int64_t start_us = (ns.all_start_micros() - first_node_start_us);
+      int64_t rel_end_us = curr_time;
 
       // If this is the first pass, initialize some values.
-      if (result.second) {
-        detail->name = name;
-        detail->type = op_type;
-
-        detail->run_order = node_num;
-
+      if (output_result.second) {
         outputs->resize(ns.output_size());
         for (const auto& output : ns.output()) {
           const int32 slot = output.slot();
@@ -202,7 +194,6 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
           }
           (*outputs)[slot] = output.tensor_description();
         }
-        detail->times_called = 0;
       }
 
       int64 curr_node_mem = 0;
@@ -210,11 +201,10 @@ void StatSummarizer::ProcessStepStats(const StepStats& step_stats) {
         const int64 mem_usage = mem.total_bytes();
         curr_node_mem += mem_usage;
       }
-      detail->mem_used.UpdateStat(curr_node_mem);
-      mem_total += curr_node_mem;
+      stats_calculator_->AddNodeStats(name, op_type, node_num, start_us,
+                                      rel_end_us, curr_node_mem);
 
-      ++detail->times_called;
-      stats_calculator_->UpdateDetails(details);
+      mem_total += curr_node_mem;
 
       Validate(outputs, ns);
     }
diff --git a/tensorflow/core/util/stats_calculator.cc b/tensorflow/core/util/stats_calculator.cc
index c4befbdb84..eb07754650 100644
--- a/tensorflow/core/util/stats_calculator.cc
+++ b/tensorflow/core/util/stats_calculator.cc
@@ -272,9 +272,24 @@ std::string StatsCalculator::GetOutputString() const {
   return stream.str();
 }
 
-void StatsCalculator::UpdateDetails(
-    const std::map<std::string, Detail>& details) {
-  details_.insert(details.begin(), details.end());
+void StatsCalculator::AddNodeStats(const std::string& name,
+                                   const std::string& type, int64_t run_order,
+                                   int64_t start_us, int64_t rel_end_us,
+                                   int64_t mem_used) {
+  Detail* detail = nullptr;
+  if (details_.find(name) == details_.end()) {
+    details_.insert({name, {}});
+    detail = &details_.at(name);
+    detail->type = type;
+    detail->name = name;
+    detail->run_order = run_order;
+  } else {
+    detail = &details_.at(name);
+  }
+  detail->start_us.UpdateStat(start_us);
+  detail->rel_end_us.UpdateStat(rel_end_us);
+  detail->mem_used.UpdateStat(mem_used);
+  detail->times_called++;
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/util/stats_calculator.h b/tensorflow/core/util/stats_calculator.h
index 39cef816f1..e191737bb2 100644
--- a/tensorflow/core/util/stats_calculator.h
+++ b/tensorflow/core/util/stats_calculator.h
@@ -163,7 +163,10 @@ class StatsCalculator {
   };
 
   const std::map<std::string, Detail>& GetDetails() const { return details_; }
-  void UpdateDetails(const std::map<std::string, Detail>& details);
+
+  void AddNodeStats(const std::string& name, const std::string& type,
+                    int64_t run_order, int64_t start_us, int64_t rel_end_us,
+                    int64_t mem_used);
 
  private:
   void OrderNodesByMetric(SortingMetric sorting_metric,
diff --git a/tensorflow/core/util/stats_calculator_test.cc b/tensorflow/core/util/stats_calculator_test.cc
new file mode 100644
index 0000000000..00d7bfc2f9
--- /dev/null
+++ b/tensorflow/core/util/stats_calculator_test.cc
@@ -0,0 +1,76 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/util/stats_calculator.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace {
+
+using Detail = StatsCalculator::Detail;
+
+TEST(StatsCalculatorTest, TotalTimeMs) {
+  auto options = StatSummarizerOptions();
+  StatsCalculator calc(options);
+
+  EXPECT_EQ(0, calc.num_runs());
+  calc.UpdateRunTotalUs(1);
+
+  EXPECT_EQ(1, calc.num_runs());
+  calc.UpdateRunTotalUs(2);
+
+  EXPECT_EQ(2, calc.num_runs());
+  auto run_time_us = calc.run_total_us();
+  EXPECT_EQ(1, run_time_us.min());
+  EXPECT_FLOAT_EQ(1.5, run_time_us.avg());
+}
+
+TEST(StatsCalculatorTest, AddNodeStatsUpdate) {
+  auto options = StatSummarizerOptions();
+  StatsCalculator calc(options);
+  EXPECT_TRUE(calc.GetDetails().empty());
+
+  const int64_t node1_run_order = 1;
+  const int64_t run1_start_us = 1;
+  const int64_t run1_end_us = 2;
+  const int64_t run1_mem_used = 45;
+  calc.AddNodeStats("node1", "type_1", node1_run_order, run1_start_us,
+                    run1_end_us, run1_mem_used);
+  ASSERT_EQ(1, calc.GetDetails().size());
+  const Detail& detail = calc.GetDetails().at("node1");
+  EXPECT_EQ(1, detail.times_called);
+  EXPECT_EQ("node1", detail.name);
+  EXPECT_EQ("type_1", detail.type);
+  EXPECT_EQ(node1_run_order, detail.run_order);
+
+  const int64_t run2_start_us = 3;
+  const int64_t run2_end_us = 5;
+  const int64_t run2_mem_used = 145;
+  calc.AddNodeStats("node1", "type_1", node1_run_order, run2_start_us,
+                    run2_end_us, run2_mem_used);
+  EXPECT_EQ(1, calc.GetDetails().size());
+
+  EXPECT_EQ(2, detail.times_called);
+  EXPECT_EQ("node1", detail.name);
+  EXPECT_EQ("type_1", detail.type);
+  EXPECT_EQ(node1_run_order, detail.run_order);
+
+  EXPECT_EQ(run1_start_us + run2_start_us, detail.start_us.sum());
+  EXPECT_EQ(run1_end_us + run2_end_us, detail.rel_end_us.sum());
+  EXPECT_EQ(run1_mem_used + run2_mem_used, detail.mem_used.sum());
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/util/tensor_format.cc b/tensorflow/core/util/tensor_format.cc
index 33ab87aa78..a5f7ecf0d1 100644
--- a/tensorflow/core/util/tensor_format.cc
+++ b/tensorflow/core/util/tensor_format.cc
@@ -18,7 +18,7 @@ limitations under the License.
 namespace tensorflow {
 
 string GetConvnetDataFormatAttrString() {
-  return "data_format: { 'NHWC', 'NCHW', 'HWNC', 'HWCN' } = 'NHWC' ";
+  return "data_format: { 'NHWC', 'NCHW' } = 'NHWC' ";
 }
 
 string GetConvnet3dDataFormatAttrString() {
diff --git a/tensorflow/docs_src/deploy/distributed.md b/tensorflow/docs_src/deploy/distributed.md
index 8e2c818e39..fc3a60603f 100644
--- a/tensorflow/docs_src/deploy/distributed.md
+++ b/tensorflow/docs_src/deploy/distributed.md
@@ -314,7 +314,7 @@ serve multiple clients.
 
 **Cluster**
 
-A TensorFlow cluster comprises a one or more "jobs", each divided into lists of
+A TensorFlow cluster comprises one or more "jobs", each divided into lists of
 one or more "tasks". A cluster is typically dedicated to a particular high-level
 objective, such as training a neural network, using many machines in parallel. A
 cluster is defined by
diff --git a/tensorflow/docs_src/deploy/s3.md b/tensorflow/docs_src/deploy/s3.md
index 9ef9674338..7028249e94 100644
--- a/tensorflow/docs_src/deploy/s3.md
+++ b/tensorflow/docs_src/deploy/s3.md
@@ -90,4 +90,4 @@ S3 was invented by Amazon, but the S3 API has spread in popularity and has sever
 
 * [Amazon S3](https://aws.amazon.com/s3/)
 * [Google Storage](https://cloud.google.com/storage/docs/interoperability)
-* [Minio](https://www.minio.io/kubernetes.html)(Standalone mode only)
+* [Minio](https://www.minio.io/kubernetes.html)
diff --git a/tensorflow/docs_src/extend/index.md b/tensorflow/docs_src/extend/index.md
index 1ab0340ad9..d48340a777 100644
--- a/tensorflow/docs_src/extend/index.md
+++ b/tensorflow/docs_src/extend/index.md
@@ -17,7 +17,8 @@ TensorFlow:
 
 Python is currently the only language supported by TensorFlow's API stability
 promises. However, TensorFlow also provides functionality in C++, Go, Java and
-[JavaScript](https://js.tensorflow.org),
+[JavaScript](https://js.tensorflow.org) (incuding
+[Node.js](https://github.com/tensorflow/tfjs-node)),
 plus community support for [Haskell](https://github.com/tensorflow/haskell) and
 [Rust](https://github.com/tensorflow/rust). If you'd like to create or
 develop TensorFlow features in a language other than these languages, read the
diff --git a/tensorflow/docs_src/extend/new_data_formats.md b/tensorflow/docs_src/extend/new_data_formats.md
index d1d1f69766..abbf47910e 100644
--- a/tensorflow/docs_src/extend/new_data_formats.md
+++ b/tensorflow/docs_src/extend/new_data_formats.md
@@ -77,18 +77,24 @@ can be used as a starting point for your implementation:
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/shape_inference.h"
 
-namespace tensorflow {
+namespace myproject {
 namespace {
 
-class MyReaderDatasetOp : public DatasetOpKernel {
+using ::tensorflow::DT_STRING;
+using ::tensorflow::PartialTensorShape;
+using ::tensorflow::Status;
+
+class MyReaderDatasetOp : public tensorflow::DatasetOpKernel {
  public:
 
-  MyReaderDatasetOp(OpKernelConstruction* ctx) : DatasetOpKernel(ctx) {
+  MyReaderDatasetOp(tensorflow::OpKernelConstruction* ctx)
+      : DatasetOpKernel(ctx) {
     // Parse and validate any attrs that define the dataset using
     // `ctx->GetAttr()`, and store them in member variables.
   }
 
-  void MakeDataset(OpKernelContext* ctx, DatasetBase** output) override {
+  void MakeDataset(tensorflow::OpKernelContext* ctx,
+                   tensorflow::DatasetBase** output) override {
     // Parse and validate any input tensors 0that define the dataset using
     // `ctx->input()` or the utility function
     // `ParseScalarArgument<T>(ctx, &arg)`.
@@ -99,14 +105,14 @@ class MyReaderDatasetOp : public DatasetOpKernel {
   }
 
  private:
-  class Dataset : public GraphDatasetBase {
+  class Dataset : public tensorflow::GraphDatasetBase {
    public:
-    Dataset(OpKernelContext* ctx) : GraphDatasetBase(ctx) {}
+    Dataset(tensorflow::OpKernelContext* ctx) : GraphDatasetBase(ctx) {}
 
-    std::unique_ptr<IteratorBase> MakeIteratorInternal(
+    std::unique_ptr<tensorflow::IteratorBase> MakeIteratorInternal(
         const string& prefix) const override {
-      return std::unique_ptr<IteratorBase>(
-          new Iterator({this, strings::StrCat(prefix, "::MyReader")}));
+      return std::unique_ptr<tensorflow::IteratorBase>(new Iterator(
+          {this, tensorflow::strings::StrCat(prefix, "::MyReader")}));
     }
 
     // Record structure: Each record is represented by a scalar string tensor.
@@ -114,8 +120,8 @@ class MyReaderDatasetOp : public DatasetOpKernel {
     // Dataset elements can have a fixed number of components of different
     // types and shapes; replace the following two methods to customize this
     // aspect of the dataset.
-    const DataTypeVector& output_dtypes() const override {
-      static DataTypeVector* dtypes = new DataTypeVector({DT_STRING});
+    const tensorflow::DataTypeVector& output_dtypes() const override {
+      static auto* const dtypes = new tensorflow::DataTypeVector({DT_STRING});
       return *dtypes;
     }
     const std::vector<PartialTensorShape>& output_shapes() const override {
@@ -132,16 +138,16 @@ class MyReaderDatasetOp : public DatasetOpKernel {
     // Implement this method if you want to be able to save and restore
     // instances of this dataset (and any iterators over it).
     Status AsGraphDefInternal(DatasetGraphDefBuilder* b,
-                              Node** output) const override {
+                              tensorflow::Node** output) const override {
       // Construct nodes to represent any of the input tensors from this
       // object's member variables using `b->AddScalar()` and `b->AddVector()`.
-      std::vector<Node*> input_tensors;
+      std::vector<tensorflow::Node*> input_tensors;
       TF_RETURN_IF_ERROR(b->AddDataset(this, input_tensors, output));
       return Status::OK();
     }
 
    private:
-    class Iterator : public DatasetIterator<Dataset> {
+    class Iterator : public tensorflow::DatasetIterator<Dataset> {
      public:
       explicit Iterator(const Params& params)
           : DatasetIterator<Dataset>(params), i_(0) {}
@@ -158,15 +164,15 @@ class MyReaderDatasetOp : public DatasetOpKernel {
       //    return `Status::OK()`.
       // 3. If an error occurs, return an error status using one of the helper
       //    functions from "tensorflow/core/lib/core/errors.h".
-      Status GetNextInternal(IteratorContext* ctx,
-                             std::vector<Tensor>* out_tensors,
+      Status GetNextInternal(tensorflow::IteratorContext* ctx,
+                             std::vector<tensorflow::Tensor>* out_tensors,
                              bool* end_of_sequence) override {
         // NOTE: `GetNextInternal()` may be called concurrently, so it is
         // recommended that you protect the iterator state with a mutex.
-        mutex_lock l(mu_);
+        tensorflow::mutex_lock l(mu_);
         if (i_ < 10) {
           // Create a scalar string tensor and add it to the output.
-          Tensor record_tensor(ctx->allocator({}), DT_STRING, {});
+          tensorflow::Tensor record_tensor(ctx->allocator({}), DT_STRING, {});
           record_tensor.scalar<string>()() = "MyReader!";
           out_tensors->emplace_back(std::move(record_tensor));
           ++i_;
@@ -183,20 +189,20 @@ class MyReaderDatasetOp : public DatasetOpKernel {
       //
       // Implement these two methods if you want to be able to save and restore
       // instances of this iterator.
-      Status SaveInternal(IteratorStateWriter* writer) override {
-        mutex_lock l(mu_);
+      Status SaveInternal(tensorflow::IteratorStateWriter* writer) override {
+        tensorflow::mutex_lock l(mu_);
         TF_RETURN_IF_ERROR(writer->WriteScalar(full_name("i"), i_));
         return Status::OK();
       }
-      Status RestoreInternal(IteratorContext* ctx,
-                             IteratorStateReader* reader) override {
-        mutex_lock l(mu_);
+      Status RestoreInternal(tensorflow::IteratorContext* ctx,
+                             tensorflow::IteratorStateReader* reader) override {
+        tensorflow::mutex_lock l(mu_);
         TF_RETURN_IF_ERROR(reader->ReadScalar(full_name("i"), &i_));
         return Status::OK();
       }
 
      private:
-      mutex mu_;
+      tensorflow::mutex mu_;
       int64 i_ GUARDED_BY(mu_);
     };
   };
@@ -211,14 +217,14 @@ class MyReaderDatasetOp : public DatasetOpKernel {
 REGISTER_OP("MyReaderDataset")
     .Output("handle: variant")
     .SetIsStateful()
-    .SetShapeFn(shape_inference::ScalarShape);
+    .SetShapeFn(tensorflow::shape_inference::ScalarShape);
 
 // Register the kernel implementation for MyReaderDataset.
-REGISTER_KERNEL_BUILDER(Name("MyReaderDataset").Device(DEVICE_CPU),
+REGISTER_KERNEL_BUILDER(Name("MyReaderDataset").Device(tensorflow::DEVICE_CPU),
                         MyReaderDatasetOp);
 
 }  // namespace
-}  // namespace tensorflow
+}  // namespace myproject
 ```
 
 The last step is to build the C++ code and add a Python wrapper. The easiest way
diff --git a/tensorflow/docs_src/get_started/eager.md b/tensorflow/docs_src/get_started/eager.md
deleted file mode 100644
index ddf239485a..0000000000
--- a/tensorflow/docs_src/get_started/eager.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Custom Training Walkthrough
-
-[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/r1.9.0/samples/core/get_started/eager.ipynb)
diff --git a/tensorflow/docs_src/get_started/index.md b/tensorflow/docs_src/get_started/index.md
deleted file mode 100644
index bd2a80d9ef..0000000000
--- a/tensorflow/docs_src/get_started/index.md
+++ /dev/null
@@ -1,29 +0,0 @@
-# Get Started
-
-If you are new to machine learning, we recommend taking the following online
-course prior to diving into TensorFlow documentation:
-
-  * [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/),
-    which introduces machine learning concepts and encourages experimentation
-    with existing TensorFlow code.
-
-TensorFlow is a tool for machine learning. While it contains a wide range of
-functionality, TensorFlow is mainly designed for deep neural network models.
-
-The easiest way to get started with TensorFlow is by using Eager Execution.
-
-  * @{$get_started/eager}, is for anyone new to machine learning or TensorFlow.
-
-TensorFlow provides many APIs. The remainder of this section focuses on the
-Estimator API which provide scalable, high-performance models. See the
-@{$estimators} guide.
-
-For more advanced users:
-
-  * The @{$low_level_intro$Low Level Introduction} demonstrates how to use
-    TensorFlow outside of the Estimator framework, for debugging and
-    experimentation.
-  * The @{$guide$Programmer's Guide} details major
-    TensorFlow components.
-  * The @{$tutorials$Tutorials} provide walkthroughs of a variety of
-    TensorFlow models.
diff --git a/tensorflow/docs_src/get_started/leftnav_files b/tensorflow/docs_src/get_started/leftnav_files
deleted file mode 100644
index 99d2b2c3e1..0000000000
--- a/tensorflow/docs_src/get_started/leftnav_files
+++ /dev/null
@@ -1,10 +0,0 @@
-### Learn and use ML
-basic_classification.md: Basic classification
-basic_text_classification.md: Text classification
-basic_regression.md: Regression
-overfit_and_underfit.md
-save_and_restore_models.md
-next_steps.md
-
-### Research and experimentation
-eager.md
diff --git a/tensorflow/docs_src/guide/autograph.md b/tensorflow/docs_src/guide/autograph.md
new file mode 100644
index 0000000000..823e1c6d6b
--- /dev/null
+++ b/tensorflow/docs_src/guide/autograph.md
@@ -0,0 +1,3 @@
+# AutoGraph: Easy control flow for graphs
+
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/guide/autograph.ipynb)
diff --git a/tensorflow/docs_src/guide/custom_estimators.md b/tensorflow/docs_src/guide/custom_estimators.md
index a63e2bafb3..6e4ef2e0f2 100644
--- a/tensorflow/docs_src/guide/custom_estimators.md
+++ b/tensorflow/docs_src/guide/custom_estimators.md
@@ -149,7 +149,7 @@ model. This configuration step is similar to how we configured the @{tf.estimato
 
 ```python
 classifier = tf.estimator.Estimator(
-    model_fn=my_model,
+    model_fn=my_model_fn,
     params={
         'feature_columns': my_feature_columns,
         # Two hidden layers of 10 nodes each.
@@ -474,7 +474,7 @@ Instantiate the custom Estimator through the Estimator base class as follows:
 ```python
     # Build 2 hidden layer DNN with 10, 10 units respectively.
     classifier = tf.estimator.Estimator(
-        model_fn=my_model,
+        model_fn=my_model_fn,
         params={
             'feature_columns': my_feature_columns,
             # Two hidden layers of 10 nodes each.
diff --git a/tensorflow/docs_src/guide/datasets_for_estimators.md b/tensorflow/docs_src/guide/datasets_for_estimators.md
index b04af78cd8..b55a5731a4 100644
--- a/tensorflow/docs_src/guide/datasets_for_estimators.md
+++ b/tensorflow/docs_src/guide/datasets_for_estimators.md
@@ -76,9 +76,9 @@ Let's walk through the `train_input_fn()`.
 The function starts by using the @{tf.data.Dataset.from_tensor_slices} function
 to create a @{tf.data.Dataset} representing slices of the array. The array is
 sliced across the first dimension. For example, an array containing the
-@{$tutorials/layers$mnist training data} has a shape of `(60000, 28, 28)`.
-Passing this to `from_tensor_slices` returns a `Dataset` object containing
-60000 slices, each one a 28x28 image.
+MNIST training data has a shape of `(60000, 28, 28)`. Passing this to
+`from_tensor_slices` returns a `Dataset` object containing 60000 slices, each one
+a 28x28 image.
 
 The code that returns this `Dataset` is as follows:
 
diff --git a/tensorflow/docs_src/guide/debugger.md b/tensorflow/docs_src/guide/debugger.md
index dc4db58857..f0e465214e 100644
--- a/tensorflow/docs_src/guide/debugger.md
+++ b/tensorflow/docs_src/guide/debugger.md
@@ -463,7 +463,6 @@ predict_results = classifier.predict(predict_input_fn, hooks=hooks)
 ```
 
 [debug_tflearn_iris.py](https://www.tensorflow.org/code/tensorflow/python/debug/examples/debug_tflearn_iris.py),
-based on [tf-learn's iris tutorial](https://www.tensorflow.org/versions/r1.8/get_started/tflearn),
 contains a full example of how to use the tfdbg with `Estimator`s.
 To run this example, do:
 
@@ -781,7 +780,7 @@ sess.run(b)
 ``` python
 import numpy as np
 
-a = tf.Variable(np.ones[10], name="a")
+a = tf.Variable(np.ones(10), name="a")
 b = tf.add(a, a, name="b")
 sess = tf.Session()
 sess.run(tf.global_variables_initializer())
diff --git a/tensorflow/docs_src/guide/eager.md b/tensorflow/docs_src/guide/eager.md
index babdb1db09..4954e9a615 100644
--- a/tensorflow/docs_src/guide/eager.md
+++ b/tensorflow/docs_src/guide/eager.md
@@ -224,7 +224,7 @@ the tape backwards and then discard. A particular `tf.GradientTape` can only
 compute one gradient; subsequent calls throw a runtime error.
 
 ```py
-w = tfe.Variable([[1.0]])
+w = tf.Variable([[1.0]])
 with tf.GradientTape() as tape:
   loss = w * w
 
@@ -259,8 +259,8 @@ def grad(weights, biases):
 train_steps = 200
 learning_rate = 0.01
 # Start with arbitrary values for W and B on the same batch of data
-W = tfe.Variable(5.)
-B = tfe.Variable(10.)
+W = tf.Variable(5.)
+B = tf.Variable(10.)
 
 print("Initial loss: {:.3f}".format(loss(W, B)))
 
@@ -315,9 +315,8 @@ for (batch, (images, labels)) in enumerate(dataset):
 
 
 The following example creates a multi-layer model that classifies the standard
-[MNIST handwritten digits](https://www.tensorflow.org/tutorials/layers). It
-demonstrates the optimizer and layer APIs to build trainable graphs in an eager
-execution environment.
+MNIST handwritten digits. It demonstrates the optimizer and layer APIs to build
+trainable graphs in an eager execution environment.
 
 ### Train a model
 
@@ -407,11 +406,11 @@ with tf.device("/gpu:0"):
 
 ### Variables and optimizers
 
-`tfe.Variable` objects store mutable `tf.Tensor` values accessed during
+`tf.Variable` objects store mutable `tf.Tensor` values accessed during
 training to make automatic differentiation easier. The parameters of a model can
 be encapsulated in classes as variables.
 
-Better encapsulate model parameters by using `tfe.Variable` with
+Better encapsulate model parameters by using `tf.Variable` with
 `tf.GradientTape`. For example, the automatic differentiation example above
 can be rewritten:
 
@@ -419,9 +418,9 @@ can be rewritten:
 class Model(tf.keras.Model):
   def __init__(self):
     super(Model, self).__init__()
-    self.W = tfe.Variable(5., name='weight')
-    self.B = tfe.Variable(10., name='bias')
-  def predict(self, inputs):
+    self.W = tf.Variable(5., name='weight')
+    self.B = tf.Variable(10., name='bias')
+  def call(self, inputs):
     return inputs * self.W + self.B
 
 # A toy dataset of points around 3 * x + 2
@@ -432,7 +431,7 @@ training_outputs = training_inputs * 3 + 2 + noise
 
 # The loss function to be optimized
 def loss(model, inputs, targets):
-  error = model.predict(inputs) - targets
+  error = model(inputs) - targets
   return tf.reduce_mean(tf.square(error))
 
 def grad(model, inputs, targets):
@@ -498,19 +497,19 @@ is removed, and is then deleted.
 
 ```py
 with tf.device("gpu:0"):
-  v = tfe.Variable(tf.random_normal([1000, 1000]))
+  v = tf.Variable(tf.random_normal([1000, 1000]))
   v = None  # v no longer takes up GPU memory
 ```
 
 ### Object-based saving
 
-`tfe.Checkpoint` can save and restore `tfe.Variable`s to and from
+`tf.train.Checkpoint` can save and restore `tf.Variable`s to and from
 checkpoints:
 
 ```py
-x = tfe.Variable(10.)
+x = tf.Variable(10.)
 
-checkpoint = tfe.Checkpoint(x=x)  # save as "x"
+checkpoint = tf.train.Checkpoint(x=x)  # save as "x"
 
 x.assign(2.)   # Assign a new value to the variables and save.
 save_path = checkpoint.save('./ckpt/')
@@ -523,18 +522,18 @@ checkpoint.restore(save_path)
 print(x)  # => 2.0
 ```
 
-To save and load models, `tfe.Checkpoint` stores the internal state of objects,
+To save and load models, `tf.train.Checkpoint` stores the internal state of objects,
 without requiring hidden variables. To record the state of a `model`,
-an `optimizer`, and a global step, pass them to a `tfe.Checkpoint`:
+an `optimizer`, and a global step, pass them to a `tf.train.Checkpoint`:
 
 ```py
 model = MyModel()
 optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
 checkpoint_dir = ‘/path/to/model_dir’
 checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
-root = tfe.Checkpoint(optimizer=optimizer,
-                      model=model,
-                      optimizer_step=tf.train.get_or_create_global_step())
+root = tf.train.Checkpoint(optimizer=optimizer,
+                           model=model,
+                           optimizer_step=tf.train.get_or_create_global_step())
 
 root.save(file_prefix=checkpoint_prefix)
 # or
@@ -612,7 +611,7 @@ def line_search_step(fn, init_x, rate=1.0):
 `tf.GradientTape` is a powerful interface for computing gradients, but there
 is another [Autograd](https://github.com/HIPS/autograd)-style API available for
 automatic differentiation. These functions are useful if writing math code with
-only tensors and gradient functions, and without `tfe.Variables`:
+only tensors and gradient functions, and without `tf.Variables`:
 
 * `tfe.gradients_function` —Returns a function that computes the derivatives
   of its input function parameter with respect to its arguments. The input
@@ -824,7 +823,7 @@ gives you eager's interactive experimentation and debuggability with the
 distributed performance benefits of graph execution.
 
 Write, debug, and iterate in eager execution, then import the model graph for
-production deployment. Use `tfe.Checkpoint` to save and restore model
+production deployment. Use `tf.train.Checkpoint` to save and restore model
 variables, this allows movement between eager and graph execution environments.
 See the examples in:
 [tensorflow/contrib/eager/python/examples](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples).
diff --git a/tensorflow/docs_src/guide/feature_columns.md b/tensorflow/docs_src/guide/feature_columns.md
index 1013ec910c..41080e050b 100644
--- a/tensorflow/docs_src/guide/feature_columns.md
+++ b/tensorflow/docs_src/guide/feature_columns.md
@@ -561,9 +561,9 @@ For more examples on feature columns, view the following:
 
 * The @{$low_level_intro#feature_columns$Low Level Introduction} demonstrates how
   experiment directly with `feature_columns` using TensorFlow's low level APIs.
-* The @{$wide$wide} and @{$wide_and_deep$Wide & Deep} Tutorials solve a
-  binary classification problem using `feature_columns` on a variety of input
-  data types.
+* The [Estimator wide and deep learning tutorial](https://github.com/tensorflow/models/tree/master/official/wide_deep)
+  solves a binary classification problem using `feature_columns` on a variety of
+  input data types.
 
 To learn more about embeddings, see the following:
 
diff --git a/tensorflow/docs_src/guide/graph_viz.md b/tensorflow/docs_src/guide/graph_viz.md
index f581ae56da..a8876da5a5 100644
--- a/tensorflow/docs_src/guide/graph_viz.md
+++ b/tensorflow/docs_src/guide/graph_viz.md
@@ -248,7 +248,8 @@ The images below show the CIFAR-10 model with tensor shape information:
 Often it is useful to collect runtime metadata for a run, such as total memory
 usage, total compute time, and tensor shapes for nodes. The code example below
 is a snippet from the train and test section of a modification of the
-@{$layers$simple MNIST tutorial}, in which we have recorded summaries and
+[Estimators MNIST tutorial](../tutorials/estimators/cnn.md), in which we have
+recorded summaries and
 runtime statistics. See the
 @{$summaries_and_tensorboard#serializing-the-data$Summaries Tutorial}
 for details on how to record summaries.
diff --git a/tensorflow/docs_src/guide/graphs.md b/tensorflow/docs_src/guide/graphs.md
index e6246ef148..492f97c191 100644
--- a/tensorflow/docs_src/guide/graphs.md
+++ b/tensorflow/docs_src/guide/graphs.md
@@ -486,7 +486,7 @@ subgraph inside.
 ![](../images/mnist_deep.png)
 
 For more information about visualizing your TensorFlow application with
-TensorBoard, see the [TensorBoard tutorial](../get_started/summaries_and_tensorboard.md).
+TensorBoard, see the [TensorBoard guide](./summaries_and_tensorboard.md).
 
 ## Programming with multiple graphs
 
diff --git a/tensorflow/docs_src/guide/index.md b/tensorflow/docs_src/guide/index.md
index eefdb9ceae..f78dfc9a89 100644
--- a/tensorflow/docs_src/guide/index.md
+++ b/tensorflow/docs_src/guide/index.md
@@ -16,15 +16,12 @@ works. The units are as follows:
 
 ## Estimators
 
-* @{$estimators} provides an introduction.
-* @{$premade_estimators}, introduces Estimators for machine learning.
-* @{$custom_estimators}, which demonstrates how to build and train models you
-  design yourself.
-* @{$feature_columns}, which shows how an Estimator can handle a variety of input
-  data types without changes to the model.
-* @{$datasets_for_estimators} describes using tf.data with estimators.
-* @{$checkpoints}, which explains how to save training progress and resume where
-  you left off.
+* @{$estimators}, learn how to use Estimators for machine learning.
+* @{$premade_estimators}, the basics of premade Estimators.
+* @{$checkpoints}, save training progress and resume where you left off.
+* @{$feature_columns}, handle a variety of input data types without changes to the model.
+* @{$datasets_for_estimators}, use `tf.data` to input data.
+* @{$custom_estimators}, write your own Estimator.
 
 ## Accelerators
 
diff --git a/tensorflow/docs_src/guide/keras.md b/tensorflow/docs_src/guide/keras.md
index 1d846df104..2330fa03c7 100644
--- a/tensorflow/docs_src/guide/keras.md
+++ b/tensorflow/docs_src/guide/keras.md
@@ -467,13 +467,13 @@ JSON and YAML serialization formats:
 json_string = model.to_json()
 
 # Recreate the model (freshly initialized)
-fresh_model = keras.models.from_json(json_string)
+fresh_model = keras.models.model_from_json(json_string)
 
 # Serializes a model to YAML format
 yaml_string = model.to_yaml()
 
 # Recreate the model
-fresh_model = keras.models.from_yaml(yaml_string)
+fresh_model = keras.models.model_from_yaml(yaml_string)
 ```
 
 Caution: Subclassed models are not serializable because their architecture is
diff --git a/tensorflow/docs_src/guide/leftnav_files b/tensorflow/docs_src/guide/leftnav_files
index 357a2a1cb9..c4e235b41a 100644
--- a/tensorflow/docs_src/guide/leftnav_files
+++ b/tensorflow/docs_src/guide/leftnav_files
@@ -8,10 +8,10 @@ datasets.md
 ### Estimators
 estimators.md: Introduction to Estimators
 premade_estimators.md
-custom_estimators.md
+checkpoints.md
 feature_columns.md
 datasets_for_estimators.md
-checkpoints.md
+custom_estimators.md
 
 ### Accelerators
 using_gpu.md
@@ -23,6 +23,7 @@ tensors.md
 variables.md
 graphs.md
 saved_model.md
+autograph.md : Control flow
 
 ### ML Concepts
 embedding.md
diff --git a/tensorflow/docs_src/guide/saved_model.md b/tensorflow/docs_src/guide/saved_model.md
index acc3d3ca0b..717488e7cc 100644
--- a/tensorflow/docs_src/guide/saved_model.md
+++ b/tensorflow/docs_src/guide/saved_model.md
@@ -2,9 +2,8 @@
 
 The @{tf.train.Saver} class provides methods to save and restore models. The
 @{tf.saved_model.simple_save} function is an easy way to build a
-@{tf.saved_model$saved model} suitable for serving.
-[Estimators](@{$guide/estimators}) automatically save and restore
-variables in the `model_dir`.
+@{tf.saved_model$saved model} suitable for serving. [Estimators](./estimators)
+automatically save and restore variables in the `model_dir`.
 
 ## Save and restore variables
 
diff --git a/tensorflow/docs_src/guide/tensorboard_histograms.md b/tensorflow/docs_src/guide/tensorboard_histograms.md
index 918deda190..af8f2cadd1 100644
--- a/tensorflow/docs_src/guide/tensorboard_histograms.md
+++ b/tensorflow/docs_src/guide/tensorboard_histograms.md
@@ -13,8 +13,8 @@ TensorFlow has an op
 which is perfect for this purpose. As is usually the case with TensorBoard, we
 will ingest data using a summary op; in this case,
 ['tf.summary.histogram'](https://www.tensorflow.org/api_docs/python/tf/summary/histogram).
-For a primer on how summaries work, please see the general
-[TensorBoard tutorial](https://www.tensorflow.org/get_started/summaries_and_tensorboard).
+For a primer on how summaries work, please see the
+[TensorBoard guide](./summaries_and_tensorboard.md).
 
 Here is a code snippet that will generate some histogram summaries containing
 normally distributed data, where the mean of the distribution increases over
diff --git a/tensorflow/docs_src/guide/using_gpu.md b/tensorflow/docs_src/guide/using_gpu.md
index c429ca4750..c0218fd12e 100644
--- a/tensorflow/docs_src/guide/using_gpu.md
+++ b/tensorflow/docs_src/guide/using_gpu.md
@@ -143,7 +143,7 @@ If the device you have specified does not exist, you will get
 ```
 InvalidArgumentError: Invalid argument: Cannot assign a device to node 'b':
 Could not satisfy explicit device specification '/device:GPU:2'
-   [[Node: b = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [3,2]
+   [[{{node b}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [3,2]
    values: 1 2 3...>, _device="/device:GPU:2"]()]]
 ```
 
diff --git a/tensorflow/docs_src/guide/version_compat.md b/tensorflow/docs_src/guide/version_compat.md
index 72e427c5f8..d2e5e41190 100644
--- a/tensorflow/docs_src/guide/version_compat.md
+++ b/tensorflow/docs_src/guide/version_compat.md
@@ -301,8 +301,10 @@ existing producer scripts will not suddenly use the new functionality.
 #### Change an op's functionality
 
 1.  Add a new similar op named `SomethingV2` or similar and go through the
-    process of adding it and switching existing Python wrappers to use it, which
-    may take three weeks if forward compatibility is desired.
+    process of adding it and switching existing Python wrappers to use it.
+    To ensure forward compatibility use the checks suggested in
+    [compat.py](https://www.tensorflow.org/code/tensorflow/python/compat/compat.py)
+    when changing the Python wrappers.
 2.  Remove the old op (Can only take place with a major version change due to
     backward compatibility).
 3.  Increase `min_consumer` to rule out consumers with the old op, add back the
diff --git a/tensorflow/docs_src/install/index.md b/tensorflow/docs_src/install/index.md
index c2e5a991d4..55481cc400 100644
--- a/tensorflow/docs_src/install/index.md
+++ b/tensorflow/docs_src/install/index.md
@@ -1,36 +1,39 @@
-# Installing TensorFlow
+# Install TensorFlow
 
-We've built and tested TensorFlow on the following 64-bit laptop/desktop
-operating systems:
+Note: Run the [TensorFlow tutorials](../tutorials) in a pre-configured
+[Colab notebook environment](https://colab.research.google.com/notebooks/welcome.ipynb){: .external},
+without installation.
+
+TensorFlow is built and tested on the following 64-bit operating systems:
 
   * macOS 10.12.6 (Sierra) or later.
   * Ubuntu 16.04 or later
   * Windows 7 or later.
   * Raspbian 9.0 or later.
 
-Although you might be able to install TensorFlow on other laptop or desktop
-systems, we only support (and only fix issues in) the preceding configurations.
+While TensorFlow may work on other systems, we only support—and fix issues in—the
+systems listed above.
 
 The following guides explain how to install a version of TensorFlow
 that enables you to write applications in Python:
 
-  * @{$install_linux$Installing TensorFlow on Ubuntu}
-  * @{$install_mac$Installing TensorFlow on macOS}
-  * @{$install_windows$Installing TensorFlow on Windows}
-  * @{$install_raspbian$Installing TensorFlow on a Raspberry Pi}
-  * @{$install_sources$Installing TensorFlow from Sources}
+  * @{$install_linux$Install TensorFlow on Ubuntu}
+  * @{$install_mac$Install TensorFlow on macOS}
+  * @{$install_windows$Install TensorFlow on Windows}
+  * @{$install_raspbian$Install TensorFlow on a Raspberry Pi}
+  * @{$install_sources$Install TensorFlow from source code}
 
 Many aspects of the Python TensorFlow API changed from version 0.n to 1.0.
 The following guide explains how to migrate older TensorFlow applications
 to Version 1.0:
 
-  * @{$migration$Transitioning to TensorFlow 1.0}
+  * @{$migration$Transition to TensorFlow 1.0}
 
 The following guides explain how to install TensorFlow libraries for use in
 other programming languages. These APIs are aimed at deploying TensorFlow
 models in applications and are not as extensive as the Python APIs.
 
-  * @{$install_java$Installing TensorFlow for Java}
-  * @{$install_c$Installing TensorFlow for C}
-  * @{$install_go$Installing TensorFlow for Go}
+  * @{$install_java$Install TensorFlow for Java}
+  * @{$install_c$Install TensorFlow for C}
+  * @{$install_go$Install TensorFlow for Go}
 
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index 2901848745..cf869e8655 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -1,4 +1,4 @@
-# Installing TensorFlow for C
+# Install TensorFlow for C
 
 TensorFlow provides a C API defined in
 [`c_api.h`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/c_api.h),
@@ -38,7 +38,7 @@ enable TensorFlow for C:
          OS="linux" # Change to "darwin" for macOS
          TARGET_DIRECTORY="/usr/local"
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.9.0-rc0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.9.0.tar.gz" |
            sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index 2c126df5aa..4ec7e42773 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -1,4 +1,4 @@
-# Installing TensorFlow for Go
+# Install TensorFlow for Go
 
 TensorFlow provides APIs for use in Go programs. These APIs are particularly
 well-suited to loading models created in Python and executing them within
@@ -38,7 +38,7 @@ steps to install this library and enable TensorFlow for Go:
          TF_TYPE="cpu" # Change to "gpu" for GPU support
          TARGET_DIRECTORY='/usr/local'
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.9.0-rc0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.9.0.tar.gz" |
          sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 692dfc9cef..c5f760d254 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -1,4 +1,4 @@
-# Installing TensorFlow for Java
+# Install TensorFlow for Java
 
 TensorFlow provides APIs for use in Java programs. These APIs are particularly
 well-suited to loading models created in Python and executing them within a
@@ -36,7 +36,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>tensorflow</artifactId>
-  <version>1.9.0-rc0</version>
+  <version>1.9.0</version>
 </dependency>
 ```
 
@@ -65,7 +65,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
                <dependency>
                  <groupId>org.tensorflow</groupId>
                  <artifactId>tensorflow</artifactId>
-                 <version>1.9.0-rc0</version>
+                 <version>1.9.0</version>
                </dependency>
              </dependencies>
          </project>
@@ -124,12 +124,12 @@ instead:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow</artifactId>
-  <version>1.9.0-rc0</version>
+  <version>1.9.0</version>
 </dependency>
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow_jni_gpu</artifactId>
-  <version>1.9.0-rc0</version>
+  <version>1.9.0</version>
 </dependency>
 ```
 
@@ -148,7 +148,7 @@ refer to the simpler instructions above instead.
 Take the following steps to install TensorFlow for Java on Linux or macOS:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.9.0-rc0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.9.0.jar),
      which is the TensorFlow Java Archive (JAR).
 
   2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@@ -167,7 +167,7 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
          OS=$(uname -s | tr '[:upper:]' '[:lower:]')
          mkdir -p ./jni
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.9.0-rc0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.9.0.tar.gz" |
            tar -xz -C ./jni
 
 ### Install on Windows
@@ -175,10 +175,10 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
 Take the following steps to install TensorFlow for Java on Windows:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.9.0-rc0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.9.0.jar),
      which is the TensorFlow Java Archive (JAR).
   2. Download the following Java Native Interface (JNI) file appropriate for
-     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.9.0-rc0.zip).
+     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.9.0.zip).
   3. Extract this .zip file.
 
 __Note__: The native library (`tensorflow_jni.dll`) requires `msvcp140.dll` at runtime, which is included in the [Visual C++ 2015 Redistributable](https://www.microsoft.com/en-us/download/details.aspx?id=48145) package. 
@@ -227,7 +227,7 @@ must be part of your `classpath`. For example, you can include the
 downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
 as follows:
 
-<pre><b>javac -cp libtensorflow-1.9.0-rc0.jar HelloTF.java</b></pre>
+<pre><b>javac -cp libtensorflow-1.9.0.jar HelloTF.java</b></pre>
 
 
 ### Running
@@ -241,11 +241,11 @@ two files are available to the JVM:
 For example, the following command line executes the `HelloTF` program on Linux
 and macOS X:
 
-<pre><b>java -cp libtensorflow-1.9.0-rc0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.9.0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
 
 And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.9.0-rc0.jar;. -Djava.library.path=jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.9.0.jar;. -Djava.library.path=jni HelloTF</b></pre>
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index c573acaf45..3a9a01c57e 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -1,38 +1,38 @@
-# Installing TensorFlow on Ubuntu
+# Install TensorFlow on Ubuntu
 
 This guide explains how to install TensorFlow on Ubuntu Linux. While these
-instructions may work on other Linux variants, they are tested and supported with
-the following system requirements:
-
-* 64-bit desktops or laptops
-* Ubuntu 16.04 or higher
+instructions may work on other Linux variants, they are tested and supported
+with the following system requirements:
 
+*   64-bit desktops or laptops
+*   Ubuntu 16.04 or higher
 
 ## Choose which TensorFlow to install
 
 The following TensorFlow variants are available for installation:
 
-* __TensorFlow with CPU support only__. If your system does not have a
-  NVIDIA®&nbsp;GPU, you must install this version. This version of TensorFlow is
-  usually easier to install, so even if you have an NVIDIA GPU, we recommend
-  installing this version first.
-* __TensorFlow with GPU support__. TensorFlow programs usually run much faster on
-  a GPU instead of a CPU. If you run performance-critical applications and your
-  system has an NVIDIA®&nbsp;GPU that meets the prerequisites, you should install
-  this version. See [TensorFlow GPU support](#NVIDIARequirements) for details.
-
+*   __TensorFlow with CPU support only__. If your system does not have a
+    NVIDIA®&nbsp;GPU, you must install this version. This version of TensorFlow
+    is usually easier to install, so even if you have an NVIDIA GPU, we
+    recommend installing this version first.
+*   __TensorFlow with GPU support__. TensorFlow programs usually run much faster
+    on a GPU instead of a CPU. If you run performance-critical applications and
+    your system has an NVIDIA®&nbsp;GPU that meets the prerequisites, you should
+    install this version. See [TensorFlow GPU support](#NVIDIARequirements) for
+    details.
 
 ## How to install TensorFlow
 
 There are a few options to install TensorFlow on your machine:
 
-* [Use pip in a virtual environment](#InstallingVirtualenv) *(recommended)*
-* [Use pip in your system environment](#InstallingNativePip)
-* [Configure a Docker container](#InstallingDocker)
-* [Use pip in Anaconda](#InstallingAnaconda)
-* [Install TensorFlow from source](/install/install_sources)
+*   [Use pip in a virtual environment](#InstallingVirtualenv) *(recommended)*
+*   [Use pip in your system environment](#InstallingNativePip)
+*   [Configure a Docker container](#InstallingDocker)
+*   [Use pip in Anaconda](#InstallingAnaconda)
+*   [Install TensorFlow from source](/install/install_sources)
 
 <a name="InstallingVirtualenv"></a>
+
 ### Use `pip` in a virtual environment
 
 Key Point: Using a virtual environment is the recommended install method.
@@ -41,8 +41,8 @@ The [Virtualenv](https://virtualenv.pypa.io/en/stable/) tool creates virtual
 Python environments that are isolated from other Python development on the same
 machine. In this scenario, you install TensorFlow and its dependencies within a
 virtual environment that is available when *activated*. Virtualenv provides a
-reliable way to install and run TensorFlow while avoiding conflicts with the rest
-of the system.
+reliable way to install and run TensorFlow while avoiding conflicts with the
+rest of the system.
 
 ##### 1. Install Python, `pip`, and `virtualenv`.
 
@@ -62,10 +62,10 @@ To install these packages on Ubuntu:
 </pre>
 
 We *recommend* using `pip` version 8.1 or higher. If using a release before
-version 8.1,  upgrade `pip`:
+version 8.1, upgrade `pip`:
 
 <pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo pip install -U pip</code>
+  <code class="devsite-terminal">pip install --upgrade pip</code>
 </pre>
 
 If not using Ubuntu and [setuptools](https://pypi.org/project/setuptools/) is
@@ -102,7 +102,7 @@ When the Virtualenv is activated, the shell prompt displays as `(venv) $`.
 Within the active virtual environment, upgrade `pip`:
 
 <pre class="prettyprint lang-bsh">
-(venv)$ pip install -U pip
+(venv)$ pip install --upgrade pip
 </pre>
 
 You can install other Python packages within the virtual environment without
@@ -112,15 +112,15 @@ affecting packages outside the `virtualenv`.
 
 Choose one of the available TensorFlow packages for installation:
 
-* `tensorflow` —Current release for CPU
-* `tensorflow-gpu` —Current release with GPU support
-* `tf-nightly` —Nightly build for CPU
-* `tf-nightly-gpu` —Nightly build with GPU support
+*   `tensorflow` —Current release for CPU
+*   `tensorflow-gpu` —Current release with GPU support
+*   `tf-nightly` —Nightly build for CPU
+*   `tf-nightly-gpu` —Nightly build with GPU support
 
 Within an active Virtualenv environment, use `pip` to install the package:
 
 <pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">pip install -U tensorflow</code>
+  <code class="devsite-terminal">pip install --upgrade tensorflow</code>
 </pre>
 
 Use `pip list` to show the packages installed in the virtual environment.
@@ -160,14 +160,14 @@ To uninstall TensorFlow, remove the Virtualenv directory you created in step 2:
   <code class="devsite-terminal">rm -r ~/tensorflow/<var>venv</var></code>
 </pre>
 
-
 <a name="InstallingNativePip"></a>
+
 ### Use `pip` in your system environment
 
 Use `pip` to install the TensorFlow package directly on your system without
 using a container or virtual environment for isolation. This method is
-recommended for system administrators that want a TensorFlow installation that is
-available to everyone on a multi-user system.
+recommended for system administrators that want a TensorFlow installation that
+is available to everyone on a multi-user system.
 
 Since a system install is not isolated, it could interfere with other
 Python-based installations. But if you understand `pip` and your Python
@@ -195,10 +195,10 @@ To install these packages on Ubuntu:
 </pre>
 
 We *recommend* using `pip` version 8.1 or higher. If using a release before
-version 8.1,  upgrade `pip`:
+version 8.1, upgrade `pip`:
 
 <pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo pip install -U pip</code>
+  <code class="devsite-terminal">pip install --upgrade pip</code>
 </pre>
 
 If not using Ubuntu and [setuptools](https://pypi.org/project/setuptools/) is
@@ -212,16 +212,16 @@ installed, use `easy_install` to install `pip`:
 
 Choose one of the available TensorFlow packages for installation:
 
-* `tensorflow` —Current release for CPU
-* `tensorflow-gpu` —Current release with GPU support
-* `tf-nightly` —Nightly build for CPU
-* `tf-nightly-gpu` —Nightly build with GPU support
+*   `tensorflow` —Current release for CPU
+*   `tensorflow-gpu` —Current release with GPU support
+*   `tf-nightly` —Nightly build for CPU
+*   `tf-nightly-gpu` —Nightly build with GPU support
 
 And use `pip` to install the package for Python 2 or 3:
 
 <pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo pip install -U tensorflow   # Python 2.7</code>
-  <code class="devsite-terminal">sudo pip3 install -U tensorflow  # Python 3.n</code>
+  <code class="devsite-terminal">pip install --upgrade --user tensorflow   # Python 2.7</code>
+  <code class="devsite-terminal">pip3 install --upgrade --user tensorflow  # Python 3.n</code>
 </pre>
 
 Use `pip list` to show the packages installed on the system.
@@ -239,8 +239,8 @@ If the above steps failed, try installing the TensorFlow binary using the remote
 URL of the `pip` package:
 
 <pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo pip install --upgrade <var>remote-pkg-URL</var>   # Python 2.7</code>
-  <code class="devsite-terminal">sudo pip3 install --upgrade <var>remote-pkg-URL</var>  # Python 3.n</code>
+  <code class="devsite-terminal">pip install --user --upgrade <var>remote-pkg-URL</var>   # Python 2.7</code>
+  <code class="devsite-terminal">pip3 install --user --upgrade <var>remote-pkg-URL</var>  # Python 3.n</code>
 </pre>
 
 The <var>remote-pkg-URL</var> depends on the operating system, Python version,
@@ -255,42 +255,41 @@ encounter problems.
 To uninstall TensorFlow on your system, use one of following commands:
 
 <pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo pip uninstall tensorflow   # for Python 2.7</code>
-  <code class="devsite-terminal">sudo pip3 uninstall tensorflow  # for Python 3.n</code>
+  <code class="devsite-terminal">pip uninstall tensorflow   # for Python 2.7</code>
+  <code class="devsite-terminal">pip3 uninstall tensorflow  # for Python 3.n</code>
 </pre>
 
 <a name="InstallingDocker"></a>
+
 ### Configure a Docker container
 
-Docker completely isolates the TensorFlow installation
-from pre-existing packages on your machine. The Docker container contains
-TensorFlow and all its dependencies. Note that the Docker image can be quite
-large (hundreds of MBs). You might choose the Docker installation if you are
-incorporating TensorFlow into a larger application architecture that already
-uses Docker.
+Docker completely isolates the TensorFlow installation from pre-existing
+packages on your machine. The Docker container contains TensorFlow and all its
+dependencies. Note that the Docker image can be quite large (hundreds of MBs).
+You might choose the Docker installation if you are incorporating TensorFlow
+into a larger application architecture that already uses Docker.
 
 Take the following steps to install TensorFlow through Docker:
 
-  1. Install Docker on your machine as described in the
-     [Docker documentation](http://docs.docker.com/engine/installation/).
-  2. Optionally, create a Linux group called <code>docker</code> to allow
-     launching containers without sudo as described in the
-     [Docker documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/).
-     (If you don't do this step, you'll have to use sudo each time
-     you invoke Docker.)
-  3. To install a version of TensorFlow that supports GPUs, you must first
-     install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker), which
-     is stored in github.
-  4. Launch a Docker container that contains one of the
-     [TensorFlow binary images](https://hub.docker.com/r/tensorflow/tensorflow/tags/).
+1.  Install Docker on your machine as described in the
+    [Docker documentation](http://docs.docker.com/engine/installation/).
+2.  Optionally, create a Linux group called <code>docker</code> to allow
+    launching containers without sudo as described in the
+    [Docker documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/).
+    (If you don't do this step, you'll have to use sudo each time you invoke
+    Docker.)
+3.  To install a version of TensorFlow that supports GPUs, you must first
+    install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker), which is
+    stored in github.
+4.  Launch a Docker container that contains one of the
+    [TensorFlow binary images](https://hub.docker.com/r/tensorflow/tensorflow/tags/).
 
 The remainder of this section explains how to launch a Docker container.
 
-
 #### CPU-only
 
-To launch a Docker container with CPU-only support (that is, without
-GPU support), enter a command of the following format:
+To launch a Docker container with CPU-only support (that is, without GPU
+support), enter a command of the following format:
 
 <pre>
 $ docker run -it <i>-p hostPort:containerPort TensorFlowCPUImage</i>
@@ -298,29 +297,31 @@ $ docker run -it <i>-p hostPort:containerPort TensorFlowCPUImage</i>
 
 where:
 
-  * <tt><i>-p hostPort:containerPort</i></tt> is optional.
-    If you plan to run TensorFlow programs from the shell, omit this option.
-    If you plan to run TensorFlow programs as Jupyter notebooks, set both
-    <tt><i>hostPort</i></tt> and <tt><i>containerPort</i></tt>
-    to <tt>8888</tt>.  If you'd like to run TensorBoard inside the container,
-    add a second `-p` flag, setting both <i>hostPort</i> and <i>containerPort</i>
-    to 6006.
-  * <tt><i>TensorFlowCPUImage</i></tt> is required. It identifies the Docker
+*   <tt><i>-p hostPort:containerPort</i></tt> is optional. If you plan to run
+    TensorFlow programs from the shell, omit this option. If you plan to run
+    TensorFlow programs as Jupyter notebooks, set both <tt><i>hostPort</i></tt>
+    and <tt><i>containerPort</i></tt> to <tt>8888</tt>. If you'd like to run
+    TensorBoard inside the container, add a second `-p` flag, setting both
+    <i>hostPort</i> and <i>containerPort</i> to 6006.
+*   <tt><i>TensorFlowCPUImage</i></tt> is required. It identifies the Docker
     container. Specify one of the following values:
-    * <tt>tensorflow/tensorflow</tt>, which is the TensorFlow CPU binary image.
-    * <tt>tensorflow/tensorflow:latest-devel</tt>, which is the latest
-      TensorFlow CPU Binary image plus source code.
-    * <tt>tensorflow/tensorflow:<i>version</i></tt>, which is the
-      specified version (for example, 1.1.0rc1) of TensorFlow CPU binary image.
-    * <tt>tensorflow/tensorflow:<i>version</i>-devel</tt>, which is
-      the specified version (for example, 1.1.0rc1) of the TensorFlow GPU
-      binary image plus source code.
+
+    *   <tt>tensorflow/tensorflow</tt>, which is the TensorFlow CPU binary
+        image.
+    *   <tt>tensorflow/tensorflow:latest-devel</tt>, which is the latest
+        TensorFlow CPU Binary image plus source code.
+    *   <tt>tensorflow/tensorflow:<i>version</i></tt>, which is the specified
+        version (for example, 1.1.0rc1) of TensorFlow CPU binary image.
+    *   <tt>tensorflow/tensorflow:<i>version</i>-devel</tt>, which is the
+        specified version (for example, 1.1.0rc1) of the TensorFlow GPU binary
+        image plus source code.
 
     TensorFlow images are available at
     [dockerhub](https://hub.docker.com/r/tensorflow/tensorflow/).
 
-For example, the following command launches the latest TensorFlow CPU binary image
-in a Docker container from which you can run TensorFlow programs in a shell:
+For example, the following command launches the latest TensorFlow CPU binary
+image in a Docker container from which you can run TensorFlow programs in a
+shell:
 
 <pre>
 $ <b>docker run -it tensorflow/tensorflow bash</b>
@@ -336,10 +337,11 @@ $ <b>docker run -it -p 8888:8888 tensorflow/tensorflow</b>
 
 Docker will download the TensorFlow binary image the first time you launch it.
 
-
 #### GPU support
 
-To launch a Docker container with NVidia GPU support, enter a command of the following format (this [does not require any local CUDA installation](https://github.com/nvidia/nvidia-docker/wiki/CUDA#requirements)):
+To launch a Docker container with NVidia GPU support, enter a command of the
+following format (this
+[does not require any local CUDA installation](https://github.com/nvidia/nvidia-docker/wiki/CUDA#requirements)):
 
 <pre>
 $ <b>nvidia-docker run -it</b> <i>-p hostPort:containerPort TensorFlowGPUImage</i>
@@ -347,34 +349,34 @@ $ <b>nvidia-docker run -it</b> <i>-p hostPort:containerPort TensorFlowGPUImage</
 
 where:
 
-  * <tt><i>-p hostPort:containerPort</i></tt> is optional. If you plan
-    to run TensorFlow programs from the shell, omit this option. If you plan
-    to run TensorFlow programs as Jupyter notebooks, set both
-    <tt><i>hostPort</i></tt> and <code><em>containerPort</em></code> to `8888`.
-  * <i>TensorFlowGPUImage</i> specifies the Docker container. You must
-    specify one of the following values:
-    * <tt>tensorflow/tensorflow:latest-gpu</tt>, which is the latest
-      TensorFlow GPU binary image.
-    * <tt>tensorflow/tensorflow:latest-devel-gpu</tt>, which is
-      the latest TensorFlow GPU Binary image plus source code.
-    * <tt>tensorflow/tensorflow:<i>version</i>-gpu</tt>, which is the
-      specified version (for example, 0.12.1) of the TensorFlow GPU
-      binary image.
-    * <tt>tensorflow/tensorflow:<i>version</i>-devel-gpu</tt>, which is
-      the specified version (for example, 0.12.1) of the TensorFlow GPU
-      binary image plus source code.
-
-We recommend installing one of the `latest` versions. For example, the
-following command launches the latest TensorFlow GPU binary image in a
-Docker container from which you can run TensorFlow programs in a shell:
+*   <tt><i>-p hostPort:containerPort</i></tt> is optional. If you plan to run
+    TensorFlow programs from the shell, omit this option. If you plan to run
+    TensorFlow programs as Jupyter notebooks, set both <tt><i>hostPort</i></tt>
+    and <code><em>containerPort</em></code> to `8888`.
+*   <i>TensorFlowGPUImage</i> specifies the Docker container. You must specify
+    one of the following values:
+    *   <tt>tensorflow/tensorflow:latest-gpu</tt>, which is the latest
+        TensorFlow GPU binary image.
+    *   <tt>tensorflow/tensorflow:latest-devel-gpu</tt>, which is the latest
+        TensorFlow GPU Binary image plus source code.
+    *   <tt>tensorflow/tensorflow:<i>version</i>-gpu</tt>, which is the
+        specified version (for example, 0.12.1) of the TensorFlow GPU binary
+        image.
+    *   <tt>tensorflow/tensorflow:<i>version</i>-devel-gpu</tt>, which is the
+        specified version (for example, 0.12.1) of the TensorFlow GPU binary
+        image plus source code.
+
+We recommend installing one of the `latest` versions. For example, the following
+command launches the latest TensorFlow GPU binary image in a Docker container
+from which you can run TensorFlow programs in a shell:
 
 <pre>
 $ <b>nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash</b>
 </pre>
 
-The following command also launches the latest TensorFlow GPU binary image
-in a Docker container. In this Docker container, you can run TensorFlow
-programs in a Jupyter notebook:
+The following command also launches the latest TensorFlow GPU binary image in a
+Docker container. In this Docker container, you can run TensorFlow programs in a
+Jupyter notebook:
 
 <pre>
 $ <b>nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu</b>
@@ -390,14 +392,12 @@ Docker will download the TensorFlow binary image the first time you launch it.
 For more details see the
 [TensorFlow docker readme](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker).
 
-
 #### Next Steps
 
-You should now
-[validate your installation](#ValidateYourInstallation).
-
+You should now [validate your installation](#ValidateYourInstallation).
 
 <a name="InstallingAnaconda"></a>
+
 ### Use `pip` in Anaconda
 
 Anaconda provides the `conda` utility to create a virtual environment. However,
@@ -410,61 +410,59 @@ not tested on new TensorFlow releases.
 
 Take the following steps to install TensorFlow in an Anaconda environment:
 
-  1. Follow the instructions on the
-     [Anaconda download site](https://www.continuum.io/downloads)
-     to download and install Anaconda.
+1.  Follow the instructions on the
+    [Anaconda download site](https://www.continuum.io/downloads) to download and
+    install Anaconda.
 
-  2. Create a conda environment named <tt>tensorflow</tt> to run a version
-     of Python by invoking the following command:
+2.  Create a conda environment named <tt>tensorflow</tt> to run a version of
+    Python by invoking the following command:
 
      <pre>$ <b>conda create -n tensorflow pip python=2.7 # or python=3.3, etc.</b></pre>
 
-  3. Activate the conda environment by issuing the following command:
+3.  Activate the conda environment by issuing the following command:
 
      <pre>$ <b>source activate tensorflow</b>
      (tensorflow)$  # Your prompt should change </pre>
 
-  4. Issue a command of the following format to install
-     TensorFlow inside your conda environment:
+4.  Issue a command of the following format to install TensorFlow inside your
+    conda environment:
 
      <pre>(tensorflow)$ <b>pip install --ignore-installed --upgrade</b> <i>tfBinaryURL</i></pre>
 
-     where <code><em>tfBinaryURL</em></code> is the
-     [URL of the TensorFlow Python package](#the_url_of_the_tensorflow_python_package).
-     For example, the following command installs the CPU-only version of
-     TensorFlow for Python 3.4:
+    where <code><em>tfBinaryURL</em></code> is the
+    [URL of the TensorFlow Python package](#the_url_of_the_tensorflow_python_package).
+    For example, the following command installs the CPU-only version of
+    TensorFlow for Python 3.4:
 
      <pre>
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 <a name="ValidateYourInstallation"></a>
+
 ## Validate your installation
 
 To validate your TensorFlow installation, do the following:
 
-  1. Ensure that your environment is prepared to run TensorFlow programs.
-  2. Run a short TensorFlow program.
-
+1.  Ensure that your environment is prepared to run TensorFlow programs.
+2.  Run a short TensorFlow program.
 
 ### Prepare your environment
 
-If you installed on native pip, Virtualenv, or Anaconda, then
-do the following:
+If you installed on native pip, Virtualenv, or Anaconda, then do the following:
 
-  1. Start a terminal.
-  2. If you installed with Virtualenv or Anaconda, activate your container.
-  3. If you installed TensorFlow source code, navigate to any
-     directory *except* one containing TensorFlow source code.
+1.  Start a terminal.
+2.  If you installed with Virtualenv or Anaconda, activate your container.
+3.  If you installed TensorFlow source code, navigate to any directory *except*
+    one containing TensorFlow source code.
 
-If you installed through Docker, start a Docker container
-from which you can run bash. For example:
+If you installed through Docker, start a Docker container from which you can run
+bash. For example:
 
 <pre>
 $ <b>docker run -it tensorflow/tensorflow bash</b>
 </pre>
 
-
 ### Run a short TensorFlow program
 
 Invoke python from your shell as follows:
@@ -486,94 +484,71 @@ TensorFlow programs:
 
 <pre>Hello, TensorFlow!</pre>
 
-If the system outputs an error message instead of a greeting, see [Common
-installation problems](#common_installation_problems).
+If the system outputs an error message instead of a greeting, see
+[Common installation problems](#common_installation_problems).
 
-To learn more, see [Get Started with TensorFlow](https://www.tensorflow.org/get_started).
+To learn more, see the [TensorFlow tutorials](../tutorials/).
 
 <a name="NVIDIARequirements"></a>
-## TensorFlow GPU support
-
-To install TensorFlow with GPU support, configure the following NVIDIA® software
-on your system:
-
-* [CUDA Toolkit 9.0](http://nvidia.com/cuda). For details, see
-  [NVIDIA's documentation](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
-  Append the relevant CUDA pathnames to the `LD_LIBRARY_PATH` environmental
-  variable as described in the NVIDIA documentation.
-* [cuDNN SDK v7](http://developer.nvidia.com/cudnn). For details, see
-  [NVIDIA's documentation](http://docs.nvidia.com/deeplearning/sdk/cudnn-install/).
-  Create the `CUDA_HOME` environment variable as described in the NVIDIA
-  documentation.
-* A GPU card with CUDA Compute Capability 3.0 or higher for building TensorFlow
-  from source. To use the TensorFlow binaries, version 3.5 or higher is required.
-  See the [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus) for a
-  list of supported GPU cards.
-* [GPU drivers](http://nvidia.com/drivers) that support your version of the CUDA
-  Toolkit.
-* The `libcupti-dev` library is the NVIDIA CUDA Profile Tools Interface. This
-  library provides advanced profiling support. To install this library,
-  use the following command for CUDA Toolkit >= 8.0:
-
-<pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo apt-get install cuda-command-line-tools</code>
-</pre>
-
-Add this path to the `LD_LIBRARY_PATH` environmental variable:
-
-<pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">export LD_LIBRARY_PATH=${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/local/cuda/extras/CUPTI/lib64</code>
-</pre>
-
-* *OPTIONAL*:  For optimized performance during inference, install
-  *NVIDIA&nbsp;TensorRT&nbsp;3.0*. To install the minimal amount of TensorRT
-  runtime components required to use with the pre-built `tensorflow-gpu` package:
 
-<pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1404/x86_64/nvinfer-runtime-trt-repo-ubuntu1404-3.0.4-ga-cuda9.0_1.0-1_amd64.deb</code>
-  <code class="devsite-terminal">sudo dpkg -i nvinfer-runtime-trt-repo-ubuntu1404-3.0.4-ga-cuda9.0_1.0-1_amd64.deb</code>
-  <code class="devsite-terminal">sudo apt-get update</code>
-  <code class="devsite-terminal">sudo apt-get install -y --allow-downgrades libnvinfer-dev libcudnn7-dev=7.0.5.15-1+cuda9.0 libcudnn7=7.0.5.15-1+cuda9.0</code>
-</pre>
-
-Note: For compatibility with the pre-built `tensorflow-gpu` package, use the
-Ubuntu *14.04* package of TensorRT (shown above). Use this even when installing
-on an Ubuntu 16.04 system.
-
-To build the TensorFlow-TensorRT integration module from source instead of using
-the pre-built binaries, see the
-[module documentation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensorrt#using-tensorrt-in-tensorflow).
-For detailed TensorRT installation instructions, see
-[NVIDIA's TensorRT documentation](http://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html).
-
-To avoid cuDNN version conflicts during later system upgrades, hold the cuDNN
-version at 7.0.5:
-
-<pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo apt-mark hold libcudnn7 libcudnn7-dev</code>
-</pre>
-
-To allow upgrades, remove the this hold:
-
-<pre class="prettyprint lang-bsh">
-  <code class="devsite-terminal">sudo apt-mark unhold libcudnn7 libcudnn7-dev</code>
-</pre>
-
-If you have an earlier version of the preceding packages, upgrade to the
-specified versions. If upgrading is not possible, you can still run TensorFlow
-with GPU support by @{$install_sources}.
+## TensorFlow GPU support
 
+Note: Due to the number of libraries required, using [Docker](#InstallingDocker)
+is recommended over installing directly on the host system.
+
+The following NVIDIA® <i>hardware</i> must be installed on your system:
+
+*   GPU card with CUDA Compute Capability 3.5 or higher. See
+    [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus) for a list of
+    supported GPU cards.
+
+The following NVIDIA® <i>software</i> must be installed on your system:
+
+*   [GPU drivers](http://nvidia.com/driver). CUDA 9.0 requires 384.x or higher.
+*   [CUDA Toolkit 9.0](http://nvidia.com/cuda).
+*   [cuDNN SDK](http://developer.nvidia.com/cudnn) (>= 7.0). Version 7.1 is
+    recommended.
+*   [CUPTI](http://docs.nvidia.com/cuda/cupti/) ships with the CUDA Toolkit, but
+    you also need to append its path to the `LD_LIBRARY_PATH` environment
+    variable: `export
+    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64`
+*   *OPTIONAL*: [NCCL 2.2](https://developer.nvidia.com/nccl) to use TensorFlow
+    with multiple GPUs.
+*   *OPTIONAL*:
+    [TensorRT](http://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html)
+    which can improve latency and throughput for inference for some models.
+
+To use a GPU with CUDA Compute Capability 3.0, or different versions of the
+preceding NVIDIA libraries see
+@{$install_sources$installing TensorFlow from Sources}. If using Ubuntu 16.04
+and possibly other Debian based linux distros, `apt-get` can be used with the
+NVIDIA repository to simplify installation.
+
+```bash
+# Adds NVIDIA package repository.
+sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
+wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
+wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
+sudo dpkg -i cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
+sudo dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
+sudo apt-get update
+# Includes optional NCCL 2.x.
+sudo apt-get install cuda9.0 cuda-cublas-9-0 cuda-cufft-9-0 cuda-curand-9-0 \
+  cuda-cusolver-9-0 cuda-cusparse-9-0 libcudnn7=7.1.4.18-1+cuda9.0 \
+   libnccl2=2.2.13-1+cuda9.0 cuda-command-line-tools-9-0
+# Optionally install TensorRT runtime, must be done after above cuda install.
+sudo apt-get update
+sudo apt-get install libnvinfer4=4.1.2-1+cuda9.0
+```
 
 ## Common installation problems
 
 We are relying on Stack Overflow to document TensorFlow installation problems
-and their remedies.  The following table contains links to Stack Overflow
-answers for some common installation problems.
-If you encounter an error message or other
-installation problem not listed in the following table, search for it
-on Stack Overflow.  If Stack Overflow doesn't show the error message,
-ask a new question about it on Stack Overflow and specify
-the `tensorflow` tag.
+and their remedies. The following table contains links to Stack Overflow answers
+for some common installation problems. If you encounter an error message or
+other installation problem not listed in the following table, search for it on
+Stack Overflow. If Stack Overflow doesn't show the error message, ask a new
+question about it on Stack Overflow and specify the `tensorflow` tag.
 
 <table>
 <tr> <th>Link to GitHub or Stack&nbsp;Overflow</th> <th>Error Message</th> </tr>
@@ -657,74 +632,67 @@ the `tensorflow` tag.
 
 </table>
 
-
 <a name="TF_PYTHON_URL"></a>
+
 ## The URL of the TensorFlow Python package
 
 A few installation mechanisms require the URL of the TensorFlow Python package.
 The value you specify depends on three factors:
 
-  * operating system
-  * Python version
-  * CPU only vs. GPU support
+*   operating system
+*   Python version
+*   CPU only vs. GPU support
 
 This section documents the relevant values for Linux installations.
 
-
 ### Python 2.7
 
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0-cp27-none-linux_x86_64.whl
 </pre>
 
-
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0-cp27-none-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
 [NVIDIA requirements to run TensorFlow with GPU support](#NVIDIARequirements).
 
-
 ### Python 3.4
 
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
-
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
 [NVIDIA requirements to run TensorFlow with GPU support](#NVIDIARequirements).
 
-
 ### Python 3.5
 
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
-
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
-
 Note that GPU support requires the NVIDIA hardware and software described in
 [NVIDIA requirements to run TensorFlow with GPU support](#NVIDIARequirements).
 
@@ -733,16 +701,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0rc0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
-
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0rc0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.9.0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
-
 Note that GPU support requires the NVIDIA hardware and software described in
 [NVIDIA requirements to run TensorFlow with GPU support](#NVIDIARequirements).
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index 584f1e2e35..1a7b2b815d 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -1,4 +1,4 @@
-# Installing TensorFlow on macOS
+# Install TensorFlow on macOS
 
 This guide explains how to install TensorFlow on macOS. Although these
 instructions might also work on other macOS variants, we have only
@@ -119,7 +119,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      TensorFlow in the active Virtualenv is as follows:
 
      <pre> $ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py3-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0-py3-none-any.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
@@ -242,7 +242,7 @@ take the following steps:
      issue the following command:
 
      <pre> $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py3-none-any.whl</b> </pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0-py3-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
@@ -350,7 +350,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      TensorFlow for Python 2.7:
 
      <pre> (<i>targetDirectory</i>)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0-py2-none-any.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -403,8 +403,7 @@ writing TensorFlow programs:
 If the system outputs an error message instead of a greeting, see
 [Common installation problems](#common_installation_problems).
 
-To learn more, see [Get Started with TensorFlow](https://www.tensorflow.org/get_started).
-
+To learn more, see the [TensorFlow tutorials](../tutorials/).
 
 ## Common installation problems
 
@@ -518,7 +517,7 @@ The value you specify depends on your Python version.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py2-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0-py2-none-any.whl
 </pre>
 
 
@@ -526,5 +525,5 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py2-none-a
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0rc0-py3-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.9.0-py3-none-any.whl
 </pre>
diff --git a/tensorflow/docs_src/install/install_raspbian.md b/tensorflow/docs_src/install/install_raspbian.md
index 0caab6d335..58a5285c78 100644
--- a/tensorflow/docs_src/install/install_raspbian.md
+++ b/tensorflow/docs_src/install/install_raspbian.md
@@ -1,4 +1,4 @@
-# Installing TensorFlow on Raspbian
+# Install TensorFlow on Raspbian
 
 This guide explains how to install TensorFlow on a Raspberry Pi running
 Raspbian. Although these instructions might also work on other Pi variants, we
@@ -230,7 +230,7 @@ problems, despite the log message.
 If the system outputs an error message instead of a greeting, see [Common
 installation problems](#common_installation_problems).
 
-To learn more, see [Get Started with TensorFlow](https://www.tensorflow.org/get_started).
+To learn more, see the [TensorFlow tutorials](../tutorials/).
 
 ## Common installation problems
 
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index a641dc3a6f..31dcad64d4 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -1,28 +1,27 @@
-# Installing TensorFlow from Sources
+# Install TensorFlow from Sources
 
-This guide explains how to build TensorFlow sources into a TensorFlow
-binary and how to install that TensorFlow binary.  Note that we provide
-well-tested, pre-built TensorFlow binaries for Ubuntu, macOS, and Windows
-systems. In addition, there are pre-built TensorFlow
-[docker images](https://hub.docker.com/r/tensorflow/tensorflow/).
-So, don't build a TensorFlow binary yourself unless you are very
-comfortable building complex packages from source and dealing with
-the inevitable aftermath should things not go exactly as documented.
+This guide explains how to build TensorFlow sources into a TensorFlow binary and
+how to install that TensorFlow binary. Note that we provide well-tested,
+pre-built TensorFlow binaries for Ubuntu, macOS, and Windows systems. In
+addition, there are pre-built TensorFlow
+[docker images](https://hub.docker.com/r/tensorflow/tensorflow/). So, don't
+build a TensorFlow binary yourself unless you are very comfortable building
+complex packages from source and dealing with the inevitable aftermath should
+things not go exactly as documented.
 
-If the last paragraph didn't scare you off, welcome.  This guide explains
-how to build TensorFlow on 64-bit desktops and laptops running either of
-the following operating systems:
+If the last paragraph didn't scare you off, welcome. This guide explains how to
+build TensorFlow on 64-bit desktops and laptops running either of the following
+operating systems:
 
 *   Ubuntu
 *   macOS X
 
-Note: Some users have successfully built and installed TensorFlow from
-sources on non-supported systems.  Please remember that we do not fix
-issues stemming from these attempts.
+Note: Some users have successfully built and installed TensorFlow from sources
+on non-supported systems. Please remember that we do not fix issues stemming
+from these attempts.
 
-We **do not support** building TensorFlow on Windows. That said, if you'd
-like to try to build TensorFlow on Windows anyway, use either of the
-following:
+We **do not support** building TensorFlow on Windows. That said, if you'd like
+to try to build TensorFlow on Windows anyway, use either of the following:
 
 *   [Bazel on Windows](https://bazel.build/versions/master/docs/windows.html)
 *   [TensorFlow CMake build](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake)
@@ -32,38 +31,33 @@ instructions. Older CPUs may not be able to execute these binaries.
 
 ## Determine which TensorFlow to install
 
-You must choose one of the following types of TensorFlow to build and
-install:
-
-* **TensorFlow with CPU support only**. If your system does not have a
-  NVIDIA® GPU, build and install this version. Note that this version of
-  TensorFlow is typically easier to build and install, so even if you
-  have an NVIDIA GPU, we recommend building and installing this version
-  first.
-* **TensorFlow with GPU support**. TensorFlow programs typically run
-  significantly faster on a GPU than on a CPU. Therefore, if your system
-  has a NVIDIA GPU and you need to run performance-critical applications,
-  you should ultimately build and install this version.
-  Beyond the NVIDIA GPU itself, your system must also fulfill the NVIDIA
-  software requirements described in one of the following documents:
+You must choose one of the following types of TensorFlow to build and install:
 
-  * @{$install_linux#NVIDIARequirements$Installing TensorFlow on Ubuntu}
-  * @{$install_mac#NVIDIARequirements$Installing TensorFlow on macOS}
+*   **TensorFlow with CPU support only**. If your system does not have a NVIDIA®
+    GPU, build and install this version. Note that this version of TensorFlow is
+    typically easier to build and install, so even if you have an NVIDIA GPU, we
+    recommend building and installing this version first.
+*   **TensorFlow with GPU support**. TensorFlow programs typically run
+    significantly faster on a GPU than on a CPU. Therefore, if your system has a
+    NVIDIA GPU and you need to run performance-critical applications, you should
+    ultimately build and install this version. Beyond the NVIDIA GPU itself,
+    your system must also fulfill the NVIDIA software requirements described in
+    one of the following documents:
 
+    *   @ {$install_linux#NVIDIARequirements$Installing TensorFlow on Ubuntu}
+    *   @ {$install_mac#NVIDIARequirements$Installing TensorFlow on macOS}
 
 ## Clone the TensorFlow repository
 
-Start the process of building TensorFlow by cloning a TensorFlow
-repository.
+Start the process of building TensorFlow by cloning a TensorFlow repository.
 
 To clone **the latest** TensorFlow repository, issue the following command:
 
 <pre>$ <b>git clone https://github.com/tensorflow/tensorflow</b> </pre>
 
-The preceding <code>git clone</code> command creates a subdirectory
-named `tensorflow`.  After cloning, you may optionally build a
-**specific branch** (such as a release branch) by invoking the
-following commands:
+The preceding <code>git clone</code> command creates a subdirectory named
+`tensorflow`. After cloning, you may optionally build a **specific branch**
+(such as a release branch) by invoking the following commands:
 
 <pre>
 $ <b>cd tensorflow</b>
@@ -75,38 +69,34 @@ issue the following command:
 
 <pre>$ <b>git checkout r1.0</b></pre>
 
-Next, you must prepare your environment for
-[Linux](#PrepareLinux)
-or
+Next, you must prepare your environment for [Linux](#PrepareLinux) or
 [macOS](#PrepareMac)
 
-
 <a name="PrepareLinux"></a>
-## Prepare environment for Linux
 
-Before building TensorFlow on Linux, install the following build
-tools on your system:
+## Prepare environment for Linux
 
-  * bazel
-  * TensorFlow Python dependencies
-  * optionally, NVIDIA packages to support TensorFlow for GPU.
+Before building TensorFlow on Linux, install the following build tools on your
+system:
 
+*   bazel
+*   TensorFlow Python dependencies
+*   optionally, NVIDIA packages to support TensorFlow for GPU.
 
 ### Install Bazel
 
 If bazel is not installed on your system, install it now by following
 [these directions](https://bazel.build/versions/master/docs/install.html).
 
-
 ### Install TensorFlow Python dependencies
 
 To install TensorFlow, you must install the following packages:
 
-  * `numpy`, which is a numerical processing package that TensorFlow requires.
-  * `dev`, which enables adding extensions to Python.
-  * `pip`, which enables you to install and manage certain Python packages.
-  * `wheel`, which enables you to manage Python compressed packages in
-    the wheel (.whl) format.
+*   `numpy`, which is a numerical processing package that TensorFlow requires.
+*   `dev`, which enables adding extensions to Python.
+*   `pip`, which enables you to install and manage certain Python packages.
+*   `wheel`, which enables you to manage Python compressed packages in the wheel
+    (.whl) format.
 
 To install these packages for Python 2.7, issue the following command:
 
@@ -120,68 +110,70 @@ To install these packages for Python 3.n, issue the following command:
 $ <b>sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel</b>
 </pre>
 
-
 ### Optional: install TensorFlow for GPU prerequisites
 
 If you are building TensorFlow without GPU support, skip this section.
 
-The following NVIDIA <i>hardware</i> must be installed on your system:
-
-  * GPU card with CUDA Compute Capability 3.0 or higher.  See
-    [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus)
-    for a list of supported GPU cards.
-
-The following NVIDIA <i>software</i> must be installed on your system:
-
-  * [CUDA Toolkit](http://nvidia.com/cuda) (>= 8.0). We recommend version 9.0.
-    For details, see
-    [NVIDIA's documentation](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
-    Ensure that you append the relevant CUDA pathnames to the
-    `LD_LIBRARY_PATH` environment variable as described in the
-    NVIDIA documentation.
-  * [GPU drivers](http://nvidia.com/driver) supporting your version of the CUDA
-    Toolkit.
-  * [cuDNN SDK](http://developer.nvidia.com/cudnn) (>= 6.0). We recommend version 7.0. For details, see
-    [NVIDIA's documentation](http://docs.nvidia.com/deeplearning/sdk/cudnn-install/).
-  * [CUPTI](http://docs.nvidia.com/cuda/cupti/) ships with the CUDA Toolkit, but
-    you also need to append its path to the `LD_LIBRARY_PATH` environment
-    variable:
+The following NVIDIA® <i>hardware</i> must be installed on your system:
 
-    <pre> $ <b>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64</b> </pre>
+*   GPU card with CUDA Compute Capability 3.5 or higher. See
+    [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus) for a list of
+    supported GPU cards.
+
+The following NVIDIA® <i>software</i> must be installed on your system:
+
+*   [GPU drivers](http://nvidia.com/driver). CUDA 9.0 requires 384.x or higher.
+*   [CUDA Toolkit](http://nvidia.com/cuda) (>= 8.0). We recommend version 9.0.
+*   [cuDNN SDK](http://developer.nvidia.com/cudnn) (>= 6.0). We recommend
+    version 7.1.x.
+*   [CUPTI](http://docs.nvidia.com/cuda/cupti/) ships with the CUDA Toolkit, but
+    you also need to append its path to the `LD_LIBRARY_PATH` environment
+    variable: `export
+    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64`
+*   *OPTIONAL*: [NCCL 2.2](https://developer.nvidia.com/nccl) to use TensorFlow
+    with multiple GPUs.
+*   *OPTIONAL*:
+    [TensorRT](http://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html)
+    which can improve latency and throughput for inference for some models.
+
+While it is possible to install the NVIDIA libraries via `apt-get` from the
+NVIDIA repository, the libraries and headers are installed in locations that
+make it difficult to configure and debug build issues. Downloading and
+installing the libraries manually or using docker
+([latest-devel-gpu](https://hub.docker.com/r/tensorflow/tensorflow/tags/)) is
+recommended.
 
 ### Next
 
 After preparing the environment, you must now
 [configure the installation](#ConfigureInstallation).
 
-
 <a name="PrepareMac"></a>
+
 ## Prepare environment for macOS
 
 Before building TensorFlow, you must install the following on your system:
 
-  * bazel
-  * TensorFlow Python dependencies.
-  * optionally, NVIDIA packages to support TensorFlow for GPU.
-
+*   bazel
+*   TensorFlow Python dependencies.
+*   optionally, NVIDIA packages to support TensorFlow for GPU.
 
 ### Install bazel
 
 If bazel is not installed on your system, install it now by following
 [these directions](https://bazel.build/versions/master/docs/install.html#mac-os-x).
 
-
 ### Install python dependencies
 
 To build TensorFlow, you must install the following packages:
 
-  * six
-  * numpy, which is a numerical processing package that TensorFlow requires.
-  * wheel, which enables you to manage Python compressed packages
-    in the wheel (.whl) format.
+*   six
+*   numpy, which is a numerical processing package that TensorFlow requires.
+*   wheel, which enables you to manage Python compressed packages in the wheel
+    (.whl) format.
 
-You may install the python dependencies using pip. If you don't have pip
-on your machine, we recommend using homebrew to install Python and pip as
+You may install the python dependencies using pip. If you don't have pip on your
+machine, we recommend using homebrew to install Python and pip as
 [documented here](http://docs.python-guide.org/en/latest/starting/install/osx/).
 If you follow these instructions, you will not need to disable SIP.
 
@@ -192,22 +184,23 @@ After installing pip, invoke the following commands:
 Note: These are just the minimum requirements to _build_ tensorflow. Installing
 the pip package will download additional packages required to _run_ it. If you
 plan on executing tasks directly with `bazel` , without the pip installation,
-you may need to install additional python packages. For example, you should
-`pip install mock enum34` before running TensorFlow's tests with bazel.
+you may need to install additional python packages. For example, you should `pip
+install mock enum34` before running TensorFlow's tests with bazel.
 
 <a name="ConfigureInstallation"></a>
+
 ## Configure the installation
 
-The root of the source tree contains a bash script named
-<code>configure</code>. This script asks you to identify the pathname of all
-relevant TensorFlow dependencies and specify other build configuration options
-such as compiler flags. You must run this script *prior* to
-creating the pip package and installing TensorFlow.
+The root of the source tree contains a bash script named <code>configure</code>.
+This script asks you to identify the pathname of all relevant TensorFlow
+dependencies and specify other build configuration options such as compiler
+flags. You must run this script *prior* to creating the pip package and
+installing TensorFlow.
 
-If you wish to build TensorFlow with GPU, `configure` will ask
-you to specify the version numbers of CUDA and cuDNN. If several
-versions of CUDA or cuDNN are installed on your system, explicitly select
-the desired version instead of relying on the default.
+If you wish to build TensorFlow with GPU, `configure` will ask you to specify
+the version numbers of CUDA and cuDNN. If several versions of CUDA or cuDNN are
+installed on your system, explicitly select the desired version instead of
+relying on the default.
 
 One of the questions that `configure` will ask is as follows:
 
@@ -215,73 +208,117 @@ One of the questions that `configure` will ask is as follows:
 Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]
 </pre>
 
-This question refers to a later phase in which you'll use bazel to [build the
-pip package](#build-the-pip-package) or the [C/Java libraries](#BuildCorJava).
-We recommend accepting the default (`-march=native`), which will optimize the
-generated code for your local machine's CPU type.  However, if you are building
-TensorFlow on one CPU type but will run TensorFlow on a different CPU type, then
-consider specifying a more specific optimization
-flag as described in [the gcc
-documentation](https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html).
+This question refers to a later phase in which you'll use bazel to
+[build the pip package](#build-the-pip-package) or the
+[C/Java libraries](#BuildCorJava). We recommend accepting the default
+(`-march=native`), which will optimize the generated code for your local
+machine's CPU type. However, if you are building TensorFlow on one CPU type but
+will run TensorFlow on a different CPU type, then consider specifying a more
+specific optimization flag as described in
+[the gcc documentation](https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html).
 
-Here is an example execution of the `configure` script.  Note that your
-own input will likely differ from our sample input:
+Here is an example execution of the `configure` script. Note that your own input
+will likely differ from our sample input:
 
 <pre>
 $ <b>cd tensorflow</b>  # cd to the top-level directory created
 $ <b>./configure</b>
+You have bazel 0.15.0 installed.
 Please specify the location of python. [Default is /usr/bin/python]: <b>/usr/bin/python2.7</b>
+
+
 Found possible Python library paths:
   /usr/local/lib/python2.7/dist-packages
   /usr/lib/python2.7/dist-packages
 Please input the desired Python library path to use.  Default is [/usr/lib/python2.7/dist-packages]
 
-Using python library path: /usr/local/lib/python2.7/dist-packages
-Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
-Do you wish to use jemalloc as the malloc implementation? [Y/n]
-jemalloc enabled
-Do you wish to build TensorFlow with Google Cloud Platform support? [y/N]
-No Google Cloud Platform support will be enabled for TensorFlow
-Do you wish to build TensorFlow with Hadoop File System support? [y/N]
-No Hadoop File System support will be enabled for TensorFlow
-Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N]
-No XLA support will be enabled for TensorFlow
-Do you wish to build TensorFlow with VERBS support? [y/N]
-No VERBS support will be enabled for TensorFlow
-Do you wish to build TensorFlow with OpenCL support? [y/N]
-No OpenCL support will be enabled for TensorFlow
-Do you wish to build TensorFlow with CUDA support? [y/N] <b>Y</b>
-CUDA support will be enabled for TensorFlow
-Do you want to use clang as CUDA compiler? [y/N]
-nvcc will be used as CUDA compiler
+Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]:
+jemalloc as malloc support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]:
+Google Cloud Platform support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with Hadoop File System support? [Y/n]:
+Hadoop File System support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with Amazon AWS Platform support? [Y/n]:
+Amazon AWS Platform support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]:
+Apache Kafka Platform support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with XLA JIT support? [y/N]:
+No XLA JIT support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with GDR support? [y/N]:
+No GDR support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with VERBS support? [y/N]:
+No VERBS support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
+No OpenCL SYCL support will be enabled for TensorFlow.
+
+Do you wish to build TensorFlow with CUDA support? [y/N]: <b>Y</b>
+CUDA support will be enabled for TensorFlow.
+
 Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: <b>9.0</b>
+
+
 Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
-Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
-Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: <b>7</b>
+
+
+Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: <b>7.0</b>
+
+
 Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
-Please specify a list of comma-separated CUDA compute capabilities you want to build with.
+
+
+Do you wish to build TensorFlow with TensorRT support? [y/N]:
+No TensorRT support will be enabled for TensorFlow.
+
+Please specify the NCCL version you want to use. If NCLL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]: 1.3
+
+
+Please specify a list of comma-separated Cuda compute capabilities you want to build with.
 You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
-Please note that each additional compute capability significantly increases your build time and binary size.
-[Default is: "3.5,5.2"]: <b>3.0</b>
-Do you wish to build TensorFlow with MPI support? [y/N]
-MPI support will not be enabled for TensorFlow
+Please note that each additional compute capability significantly increases your
+build time and binary size. [Default is: 3.5,7.0] <b>6.1</b>
+
+
+Do you want to use clang as CUDA compiler? [y/N]:
+nvcc will be used as CUDA compiler.
+
+Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
+
+
+Do you wish to build TensorFlow with MPI support? [y/N]:
+No MPI support will be enabled for TensorFlow.
+
+Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
+
+
+Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
+Not configuring the WORKSPACE for Android builds.
+
+Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
+    --config=mkl            # Build with MKL support.
+    --config=monolithic     # Config for mostly static monolithic build.
 Configuration finished
 </pre>
 
-If you told `configure` to build for GPU support, then `configure`
-will create a canonical set of symbolic links to the CUDA libraries
-on your system.  Therefore, every time you change the CUDA library paths,
-you must rerun the `configure` script before re-invoking
-the <code>bazel build</code> command.
+If you told `configure` to build for GPU support, then `configure` will create a
+canonical set of symbolic links to the CUDA libraries on your system. Therefore,
+every time you change the CUDA library paths, you must rerun the `configure`
+script before re-invoking the <code>bazel build</code> command.
 
 Note the following:
 
-  * Although it is possible to build both CUDA and non-CUDA configs
-    under the same source tree, we recommend running `bazel clean` when
-    switching between these two configurations in the same source tree.
-  * If you don't run the `configure` script *before* running the
-    `bazel build` command, the `bazel build` command will fail.
-
+*   Although it is possible to build both CUDA and non-CUDA configs under the
+    same source tree, we recommend running `bazel clean` when switching between
+    these two configurations in the same source tree.
+*   If you don't run the `configure` script *before* running the `bazel build`
+    command, the `bazel build` command will fail.
 
 ## Build the pip package
 
@@ -297,7 +334,8 @@ To build a pip package for TensorFlow with CPU-only support:
 $ bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
 </pre>
 
-To build a pip package for TensorFlow with CPU-only support for the Intel® MKL-DNN:
+To build a pip package for TensorFlow with CPU-only support for the Intel®
+MKL-DNN:
 
 <pre>
 $ bazel build --config=mkl --config=opt //tensorflow/tools/pip_package:build_pip_package
@@ -311,37 +349,35 @@ To build a pip package for TensorFlow with GPU support:
 $ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
 </pre>
 
-**NOTE on gcc 5 or later:** the binary pip packages available on the
-TensorFlow website are built with gcc 4, which uses the older ABI. To
-make your build compatible with the older ABI, you need to add
-`--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"` to your `bazel build` command.
-ABI compatibility allows custom ops built against the TensorFlow pip package
-to continue to work against your built package.
+**NOTE on gcc 5 or later:** the binary pip packages available on the TensorFlow
+website are built with gcc 4, which uses the older ABI. To make your build
+compatible with the older ABI, you need to add
+`--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"` to your `bazel build` command. ABI
+compatibility allows custom ops built against the TensorFlow pip package to
+continue to work against your built package.
 
-<b>Tip:</b> By default, building TensorFlow from sources consumes
-a lot of RAM.  If RAM is an issue on your system, you may limit RAM usage
-by specifying <code>--local_resources 2048,.5,1.0</code> while
-invoking `bazel`.
+<b>Tip:</b> By default, building TensorFlow from sources consumes a lot of RAM.
+If RAM is an issue on your system, you may limit RAM usage by specifying
+<code>--local_resources 2048,.5,1.0</code> while invoking `bazel`.
 
-The <code>bazel build</code> command builds a script named
-`build_pip_package`.  Running this script as follows will build
-a `.whl` file within the `/tmp/tensorflow_pkg` directory:
+The <code>bazel build</code> command builds a script named `build_pip_package`.
+Running this script as follows will build a `.whl` file within the
+`/tmp/tensorflow_pkg` directory:
 
 <pre>
 $ <b>bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg</b>
 </pre>
 
-
 ## Install the pip package
 
-Invoke `pip install` to install that pip package.
-The filename of the `.whl` file depends on your platform.
-For example, the following command will install the pip package
+Invoke `pip install` to install that pip package. The filename of the `.whl`
+file depends on your platform. For example, the following command will install
+the pip package
 
-for TensorFlow 1.9.0rc0 on Linux:
+for TensorFlow 1.9.0 on Linux:
 
 <pre>
-$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.9.0rc0-py2-none-any.whl</b>
+$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.9.0-py2-none-any.whl</b>
 </pre>
 
 ## Validate your installation
@@ -372,28 +408,31 @@ TensorFlow programs:
 
 <pre>Hello, TensorFlow!</pre>
 
-To learn more, see [Get Started with TensorFlow](https://www.tensorflow.org/get_started).
+To learn more, see the [TensorFlow tutorials](../tutorials/).
 
-If the system outputs an error message instead of a greeting, see [Common
-installation problems](#common_installation_problems).
+If the system outputs an error message instead of a greeting, see
+[Common installation problems](#common_installation_problems).
 
 ## Common build and installation problems
 
 The build and installation problems you encounter typically depend on the
-operating system.  See the "Common installation problems" section
-of one of the following guides:
-
-  * @{$install_linux#common_installation_problems$Installing TensorFlow on Linux}
-  * @{$install_mac#common_installation_problems$Installing TensorFlow on Mac OS}
-  * @{$install_windows#common_installation_problems$Installing TensorFlow on Windows}
-
-Beyond the errors documented in those two guides, the following table
-notes additional errors specific to building TensorFlow.  Note that we
-are relying on Stack Overflow as the repository for build and installation
-problems.  If you encounter an error message not listed in the preceding
-two guides or in the following table, search for it on Stack Overflow.  If
-Stack Overflow doesn't show the error message, ask a new question on
-Stack Overflow and specify the `tensorflow` tag.
+operating system. See the "Common installation problems" section of one of the
+following guides:
+
+*   @
+    {$install_linux#common_installation_problems$Installing TensorFlow on Linux}
+*   @
+    {$install_mac#common_installation_problems$Installing TensorFlow on Mac OS}
+*   @
+    {$install_windows#common_installation_problems$Installing TensorFlow on Windows}
+
+Beyond the errors documented in those two guides, the following table notes
+additional errors specific to building TensorFlow. Note that we are relying on
+Stack Overflow as the repository for build and installation problems. If you
+encounter an error message not listed in the preceding two guides or in the
+following table, search for it on Stack Overflow. If Stack Overflow doesn't show
+the error message, ask a new question on Stack Overflow and specify the
+`tensorflow` tag.
 
 <table>
 <tr> <th>Stack Overflow Link</th> <th>Error Message</th> </tr>
@@ -440,6 +479,7 @@ Stack Overflow and specify the `tensorflow` tag.
 </table>
 
 ## Tested source configurations
+
 **Linux**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
@@ -508,6 +548,7 @@ Stack Overflow and specify the `tensorflow` tag.
 </table>
 
 <a name="BuildCorJava"></a>
+
 ## Build the C or Java libraries
 
 The instructions above are tailored to building the TensorFlow Python packages.
@@ -516,10 +557,12 @@ If you're interested in building the libraries for the TensorFlow C API, do the
 following:
 
 1.  Follow the steps up to [Configure the installation](#ConfigureInstallation)
-2.  Build the C libraries following instructions in the [README](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/lib_package/README.md).
+2.  Build the C libraries following instructions in the
+    [README](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/lib_package/README.md).
 
-If you're interested inv building the libraries for the TensorFlow Java API,
-do the following:
+If you're interested inv building the libraries for the TensorFlow Java API, do
+the following:
 
 1.  Follow the steps up to [Configure the installation](#ConfigureInstallation)
-2.  Build the Java library following instructions in the [README](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/lib_package/README.md).
+2.  Build the Java library following instructions in the
+    [README](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/lib_package/README.md).
diff --git a/tensorflow/docs_src/install/install_windows.md b/tensorflow/docs_src/install/install_windows.md
index 7fe94f0bc3..e9061bf3c1 100644
--- a/tensorflow/docs_src/install/install_windows.md
+++ b/tensorflow/docs_src/install/install_windows.md
@@ -1,4 +1,4 @@
-# Installing TensorFlow on Windows
+# Install TensorFlow on Windows
 
 This guide explains how to install TensorFlow on Windows. Although these
 instructions might also work on other Windows variants, we have only
@@ -157,7 +157,7 @@ TensorFlow programs:
 If the system outputs an error message instead of a greeting, see [Common
 installation problems](#common_installation_problems).
 
-To learn more, see [Get Started with TensorFlow](https://www.tensorflow.org/get_started).
+To learn more, see the [TensorFlow tutorials](../tutorials/).
 
 ## Common installation problems
 
diff --git a/tensorflow/docs_src/install/migration.md b/tensorflow/docs_src/install/migration.md
index d6c31f96bd..19315ace2d 100644
--- a/tensorflow/docs_src/install/migration.md
+++ b/tensorflow/docs_src/install/migration.md
@@ -1,5 +1,4 @@
-
-# Transitioning to TensorFlow 1.0
+# Transition to TensorFlow 1.0
 
 
 The APIs in TensorFlow 1.0 have changed in ways that are not all backwards
diff --git a/tensorflow/docs_src/javascript/index.md b/tensorflow/docs_src/javascript/index.md
deleted file mode 100644
index ad63eeb255..0000000000
--- a/tensorflow/docs_src/javascript/index.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# JavaScript 
-
-You may develop TensorFlow programs in JavaScript, training and deploying
-models right in your browser.  For details, see
-[js.tensorflow.org](https://js.tensorflow.org).
diff --git a/tensorflow/docs_src/javascript/leftnav_files b/tensorflow/docs_src/javascript/leftnav_files
deleted file mode 100644
index fc0ab8a543..0000000000
--- a/tensorflow/docs_src/javascript/leftnav_files
+++ /dev/null
@@ -1 +0,0 @@
-index.md
diff --git a/tensorflow/docs_src/mobile/README.md b/tensorflow/docs_src/mobile/README.md
new file mode 100644
index 0000000000..ecf4267265
--- /dev/null
+++ b/tensorflow/docs_src/mobile/README.md
@@ -0,0 +1,3 @@
+# TF Lite subsite
+
+This subsite directory lives in [tensorflow/contrib/lite/g3doc](../../contrib/lite/g3doc/).
diff --git a/tensorflow/docs_src/mobile/index.md b/tensorflow/docs_src/mobile/index.md
deleted file mode 100644
index 419ae7094a..0000000000
--- a/tensorflow/docs_src/mobile/index.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# Overview
-
-TensorFlow was designed to be a good deep learning solution for mobile
-platforms. Currently we have two solutions for deploying machine learning
-applications on mobile and embedded devices:
-@{$mobile/mobile_intro$TensorFlow for Mobile} and @{$mobile/tflite$TensorFlow Lite}.
-
-## TensorFlow Lite versus TensorFlow Mobile
-
-Here are a few of the differences between the two:
-
-- TensorFlow Lite is an evolution of TensorFlow Mobile.  In most cases, apps
-  developed with TensorFlow Lite will have a smaller binary size, fewer
-  dependencies, and better performance.
-
-- TensorFlow Lite is in developer preview, so not all use cases are covered yet.
-  We expect you to use TensorFlow Mobile to cover production cases.
-
-- TensorFlow Lite supports only a limited set of operators, so not all models
-  will work on it by default. TensorFlow for Mobile has a fuller set of
-  supported functionality.
-
-TensorFlow Lite provides better performance and a small binary size on mobile
-platforms as well as the ability to leverage hardware acceleration if available
-on their platforms. In addition, it has many fewer dependencies so it can be
-built and hosted on simpler, more constrained device scenarios. TensorFlow Lite
-also allows targeting accelerators through the [Neural Networks
-API](https://developer.android.com/ndk/guides/neuralnetworks/index.html).
-
-TensorFlow Lite currently has coverage for a limited set of operators. While
-TensorFlow for Mobile supports only a constrained set of ops by default, in
-principle if you use an arbitrary operator in TensorFlow, it can be customized
-to build that kernel. Thus use cases which are not currently supported by
-TensorFlow Lite should continue to use TensorFlow for Mobile. As TensorFlow Lite
-evolves, it will gain additional operators, and the decision will be easier to
-make.
diff --git a/tensorflow/docs_src/mobile/leftnav_files b/tensorflow/docs_src/mobile/leftnav_files
deleted file mode 100644
index 585470d5f0..0000000000
--- a/tensorflow/docs_src/mobile/leftnav_files
+++ /dev/null
@@ -1,14 +0,0 @@
-index.md
-### TensorFlow Lite
-tflite/index.md
-tflite/devguide.md
-tflite/demo_android.md
-tflite/demo_ios.md
->>>
-### TensorFlow Mobile
-mobile_intro.md
-android_build.md
-ios_build.md
-linking_libs.md
-prepare_models.md
-optimizing.md
diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md
index cb0f5ca924..dafacbe379 100644
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ b/tensorflow/docs_src/performance/performance_guide.md
@@ -464,7 +464,7 @@ equal to the number of physical cores rather than logical cores.
   config = tf.ConfigProto()
   config.intra_op_parallelism_threads = 44
   config.inter_op_parallelism_threads = 44
-  tf.session(config=config)
+  tf.Session(config=config)
 
 ```
 
diff --git a/tensorflow/docs_src/performance/xla/broadcasting.md b/tensorflow/docs_src/performance/xla/broadcasting.md
index eaa709c2f8..7018ded53f 100644
--- a/tensorflow/docs_src/performance/xla/broadcasting.md
+++ b/tensorflow/docs_src/performance/xla/broadcasting.md
@@ -99,7 +99,7 @@ dimensions 1 and 2 of the cuboid.
 
 This type of broadcast is used in the binary ops in `XlaBuilder`, if the
 `broadcast_dimensions` argument is given. For example, see
-[XlaBuilder::Add](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.cc).
+[XlaBuilder::Add](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.cc).
 In the XLA source code, this type of broadcasting is sometimes called "InDim"
 broadcasting.
 
diff --git a/tensorflow/docs_src/performance/xla/developing_new_backend.md b/tensorflow/docs_src/performance/xla/developing_new_backend.md
index 74ea15bb2b..840f6983c2 100644
--- a/tensorflow/docs_src/performance/xla/developing_new_backend.md
+++ b/tensorflow/docs_src/performance/xla/developing_new_backend.md
@@ -44,7 +44,7 @@ It is possible to model a new
 implementation on the existing [`xla::CPUCompiler`]
 (https://www.tensorflow.org/code/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc)
 and [`xla::GPUCompiler`]
-(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc)
+(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc)
 classes, since these already emit LLVM IR. Depending on the nature of the
 hardware, it is possible that many of the LLVM IR generation aspects will have
 to be changed, but a lot of code can be shared with the existing backends.
diff --git a/tensorflow/docs_src/performance/xla/jit.md b/tensorflow/docs_src/performance/xla/jit.md
index 6724d1eaf8..7202ef47f7 100644
--- a/tensorflow/docs_src/performance/xla/jit.md
+++ b/tensorflow/docs_src/performance/xla/jit.md
@@ -19,10 +19,11 @@ on the `XLA_CPU` or `XLA_GPU` TensorFlow devices. Placing operators directly on
 a TensorFlow XLA device forces the operator to run on that device and is mainly
 used for testing.
 
-> Note: The XLA CPU backend produces fast single-threaded code (in most cases),
-> but does not yet parallelize as well as the TensorFlow CPU backend. The XLA
-> GPU backend is competitive with the standard TensorFlow implementation,
-> sometimes faster, sometimes slower.
+> Note: The XLA CPU backend supports intra-op parallelism (i.e. it can shard a
+> single operation across multiple cores) but it does not support inter-op
+> parallelism (i.e. it cannot execute independent operations concurrently across
+> multiple cores).  The XLA GPU backend is competitive with the standard
+> TensorFlow implementation, sometimes faster, sometimes slower.
 
 ### Turning on JIT compilation
 
@@ -55,8 +56,7 @@ sess = tf.Session(config=config)
 
 > Note: Turning on JIT at the session level will not result in operations being
 > compiled for the CPU. JIT compilation for CPU operations must be done via
-> the manual method documented below. This decision was made due to the CPU
-> backend being single-threaded.
+> the manual method documented below.
 
 #### Manual
 
diff --git a/tensorflow/docs_src/performance/xla/operation_semantics.md b/tensorflow/docs_src/performance/xla/operation_semantics.md
index 4c4f3f3934..edc777a3c7 100644
--- a/tensorflow/docs_src/performance/xla/operation_semantics.md
+++ b/tensorflow/docs_src/performance/xla/operation_semantics.md
@@ -1,7 +1,7 @@
 # Operation Semantics
 
 The following describes the semantics of operations defined in the
-[`XlaBuilder`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h)
+[`XlaBuilder`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h)
 interface. Typically, these operations map one-to-one to operations defined in
 the RPC interface in
 [`xla_data.proto`](https://www.tensorflow.org/code/tensorflow/compiler/xla/xla_data.proto).
@@ -16,7 +16,7 @@ and familiar names; for example a *vector* is a 1-dimensional array and a
 ## BatchNormGrad
 
 See also
-[`XlaBuilder::BatchNormGrad`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h)
+[`XlaBuilder::BatchNormGrad`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h)
 and [the original batch normalization paper](https://arxiv.org/abs/1502.03167)
 for a detailed description of the algorithm.
 
@@ -80,7 +80,7 @@ The output type is a tuple of three handles:
 ## BatchNormInference
 
 See also
-[`XlaBuilder::BatchNormInference`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h)
+[`XlaBuilder::BatchNormInference`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h)
 and [the original batch normalization paper](https://arxiv.org/abs/1502.03167)
 for a detailed description of the algorithm.
 
@@ -115,7 +115,7 @@ The output is an n-dimensional, normalized array with the same shape as input
 ## BatchNormTraining
 
 See also
-[`XlaBuilder::BatchNormTraining`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h)
+[`XlaBuilder::BatchNormTraining`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h)
 and [`the original batch normalization paper`](https://arxiv.org/abs/1502.03167)
 for a detailed description of the algorithm.
 
@@ -167,7 +167,7 @@ spatial dimensions using the formulas above.
 ## BitcastConvertType
 
 See also
-[`XlaBuilder::BitcastConvertType`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::BitcastConvertType`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Similar to a `tf.bitcast` in TensorFlow, performs an element-wise bitcast
 operation from a data shape to a target shape. The dimensions must match, and
@@ -189,7 +189,7 @@ and destination element types must not be tuples.
 ## Broadcast
 
 See also
-[`XlaBuilder::Broadcast`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Broadcast`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Adds dimensions to an array by duplicating the data in the array.
 
@@ -217,7 +217,7 @@ For example, if `operand` is a scalar `f32` with value `2.0f`, and
 ## Call
 
 See also
-[`XlaBuilder::Call`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Call`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Invokes a computation with the given arguments.
 
@@ -236,7 +236,7 @@ The arity and types of the `args` must match the parameters of the
 ## Clamp
 
 See also
-[`XlaBuilder::Clamp`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Clamp`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Clamps an operand to within the range between a minimum and maximum value.
 
@@ -269,7 +269,7 @@ Clamp(min, operand, max) = s32[3]{0, 5, 6};
 ## Collapse
 
 See also
-[`XlaBuilder::Collapse`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h)
+[`XlaBuilder::Collapse`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h)
 and the @{tf.reshape} operation.
 
 Collapses dimensions of an array into one dimension.
@@ -332,7 +332,7 @@ then v12 == f32[8x3] {{10, 11, 12},
 ## Concatenate
 
 See also
-[`XlaBuilder::ConcatInDim`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::ConcatInDim`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Concatenate composes an array from multiple array operands. The array is of the
 same rank as each of the input array operands (which must be of the same rank as
@@ -388,7 +388,7 @@ Diagram:
 ## Conditional
 
 See also
-[`XlaBuilder::Conditional`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Conditional`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `Conditional(pred, true_operand, true_computation, false_operand,
 false_computation)` </b>
@@ -416,7 +416,7 @@ executed depending on the value of `pred`.
 ## Conv (convolution)
 
 See also
-[`XlaBuilder::Conv`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Conv`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 As ConvWithGeneralPadding, but the padding is specified in a short-hand way as
 either SAME or VALID. SAME padding pads the input (`lhs`) with zeroes so that
@@ -426,7 +426,7 @@ account. VALID padding simply means no padding.
 ## ConvWithGeneralPadding (convolution)
 
 See also
-[`XlaBuilder::ConvWithGeneralPadding`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::ConvWithGeneralPadding`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Computes a convolution of the kind used in neural networks. Here, a convolution
 can be thought of as a n-dimensional window moving across a n-dimensional base
@@ -538,7 +538,7 @@ for (b, oz, oy, ox) {  // output coordinates
 ## ConvertElementType
 
 See also
-[`XlaBuilder::ConvertElementType`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::ConvertElementType`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Similar to an element-wise `static_cast` in C++, performs an element-wise
 conversion operation from a data shape to a target shape. The dimensions must
@@ -572,7 +572,7 @@ then b == f32[3]{0.0, 1.0, 2.0}
 ## CrossReplicaSum
 
 See also
-[`XlaBuilder::CrossReplicaSum`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::CrossReplicaSum`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Computes a sum across replicas.
 
@@ -607,7 +607,7 @@ than another.
 ## CustomCall
 
 See also
-[`XlaBuilder::CustomCall`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::CustomCall`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Call a user-provided function within a computation.
 
@@ -668,7 +668,7 @@ idempotent.
 ## Dot
 
 See also
-[`XlaBuilder::Dot`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Dot`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `Dot(lhs, rhs)` </b>
 
@@ -697,7 +697,7 @@ multiplications or matrix/matrix multiplications.
 ## DotGeneral
 
 See also
-[`XlaBuilder::DotGeneral`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::DotGeneral`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `DotGeneral(lhs, rhs, dimension_numbers)` </b>
 
@@ -784,15 +784,13 @@ non-contracting/non-batch dimension.
 ## DynamicSlice
 
 See also
-[`XlaBuilder::DynamicSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::DynamicSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 DynamicSlice extracts a sub-array from the input array at dynamic
 `start_indices`. The size of the slice in each dimension is passed in
 `size_indices`, which specify the end point of exclusive slice intervals in each
 dimension: [start, start + size). The shape of `start_indices` must be rank ==
 1, with dimension size equal to the rank of `operand`.
-Note: handling of out-of-bounds slice indices (generated by incorrect runtime
-calculation of 'start_indices') is currently implementation-defined.
 
 <b> `DynamicSlice(operand, start_indices, size_indices)` </b>
 
@@ -812,6 +810,17 @@ calculation of 'start_indices') is currently implementation-defined.
 :                 :                     : dimension to avoid wrapping modulo  :
 :                 :                     : dimension size.                     :
 
+The effective slice indices are computed by applying the following
+transformation for each index `i` in `[1, N)` before performing the slice:
+
+```
+start_indices[i] = clamp(start_indices[i], 0, operand.dimension_size[i] - size_indices[i])
+```
+
+This ensures that the extracted slice is always in-bounds with respect to the
+operand array. If the slice is in-bounds before the transformation is applied,
+the transformation has no effect.
+
 1-dimensional example:
 
 ```
@@ -839,7 +848,7 @@ DynamicSlice(b, s, {2, 2}) produces:
 ## DynamicUpdateSlice
 
 See also
-[`XlaBuilder::DynamicUpdateSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::DynamicUpdateSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 DynamicUpdateSlice generates a result which is the value of the input array
 `operand`, with a slice `update` overwritten at `start_indices`.
@@ -847,8 +856,6 @@ The shape of `update` determines the shape of the sub-array of the result which
 is updated.
 The shape of `start_indices` must be rank == 1, with dimension size equal to
 the rank of `operand`.
-Note: handling of out-of-bounds slice indices (generated by incorrect runtime
-calculation of 'start_indices') is currently implementation-defined.
 
 <b> `DynamicUpdateSlice(operand, update, start_indices)` </b>
 
@@ -866,6 +873,17 @@ calculation of 'start_indices') is currently implementation-defined.
 :                 :         : dimension. Value must be greater than or equal   :
 :                 :         : to zero.                                         :
 
+The effective slice indices are computed by applying the following
+transformation for each index `i` in `[1, N)` before performing the slice:
+
+```
+start_indices[i] = clamp(start_indices[i], 0, operand.dimension_size[i] - update.dimension_size[i])
+```
+
+This ensures that the updated slice is always in-bounds with respect to the
+operand array. If the slice is in-bounds before the transformation is applied,
+the transformation has no effect.
+
 1-dimensional example:
 
 ```
@@ -902,7 +920,7 @@ DynamicUpdateSlice(b, u, s) produces:
 ## Element-wise binary arithmetic operations
 
 See also
-[`XlaBuilder::Add`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Add`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 A set of element-wise binary arithmetic operations is supported.
 
@@ -947,7 +965,7 @@ shapes of both operands. The semantics are described in detail on the
 ## Element-wise comparison operations
 
 See also
-[`XlaBuilder::Eq`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Eq`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 A set of standard element-wise binary comparison operations is supported. Note
 that standard IEEE 754 floating-point comparison semantics apply when comparing
@@ -1033,7 +1051,7 @@ potentially different runtime offset) of an input tensor into an output tensor.
 ### General Semantics
 
 See also
-[`XlaBuilder::Gather`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Gather`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 For a more intuitive description, see the "Informal Description" section below.
 
 <b> `gather(operand, gather_indices, output_window_dims, elided_window_dims, window_bounds, gather_dims_to_operand_dims)` </b>
@@ -1236,7 +1254,7 @@ concatenation of all these rows.
 ## GetTupleElement
 
 See also
-[`XlaBuilder::GetTupleElement`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::GetTupleElement`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Indexes into a tuple with a compile-time-constant value.
 
@@ -1257,7 +1275,7 @@ See also @{tf.tuple}.
 ## Infeed
 
 See also
-[`XlaBuilder::Infeed`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Infeed`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `Infeed(shape)` </b>
 
@@ -1293,17 +1311,30 @@ Infeed of the device.
 > which case the compiler will provide information about how the Infeed
 > operations are serialized in the compiled program.
 
+## Iota
+
+<b> `Iota()` </b>
+
+Builds a constant literal on device rather than a potentially large host
+transfer.  Creates a rank 1 tensor of values starting at zero and incrementing
+by one.
+
+Arguments          | Type            | Semantics
+------------------ | --------------- | ---------------------------
+`type`             | `PrimitiveType` | type U
+`size`             | `int64`         | The number of elements in the tensor.
+
 ## Map
 
 See also
-[`XlaBuilder::Map`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Map`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `Map(operands..., computation)` </b>
 
 | Arguments         | Type                   | Semantics                      |
 | ----------------- | ---------------------- | ------------------------------ |
 | `operands`        | sequence of N `XlaOp`s | N arrays of types T_0..T_{N-1} |
-| `computation`     | `XlaComputation`        | computation of type `T_0, T_1, |
+| `computation`     | `XlaComputation`       | computation of type `T_0, T_1, |
 :                   :                        : ..., T_{N + M -1} -> S` with N :
 :                   :                        : parameters of type T and M of  :
 :                   :                        : arbitrary type                 :
@@ -1325,7 +1356,7 @@ input arrays to produce the output array.
 ## Pad
 
 See also
-[`XlaBuilder::Pad`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Pad`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `Pad(operand, padding_value, padding_config)` </b>
 
@@ -1364,7 +1395,7 @@ are all 0. The figure below shows examples of different `edge_padding` and
 ## Recv
 
 See also
-[`XlaBuilder::Recv`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Recv`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `Recv(shape, channel_handle)` </b>
 
@@ -1398,21 +1429,31 @@ complete and returns the received data.
 ## Reduce
 
 See also
-[`XlaBuilder::Reduce`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Reduce`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
+
+Applies a reduction function to one or more arrays in parallel.
+
+<b> `Reduce(operands..., init_values..., computation, dimensions)` </b>
 
-Applies a reduction function to an array.
+Arguments     | Type                  | Semantics
+------------- | --------------------- | ---------------------------------------
+`operands`    | Sequence of N `XlaOp` | N arrays of types `T_0, ..., T_N`.
+`init_values` | Sequence of N `XlaOp` | N scalars of types `T_0, ..., T_N`.
+`computation` | `XlaComputation`      | computation of type
+              :                       : `T_0, ..., T_N, T_0, ..., T_N -> Collate(T_0, ..., T_N)`
+`dimensions`  | `int64` array         | unordered array of dimensions to reduce
 
-<b> `Reduce(operand, init_value, computation, dimensions)` </b>
+Where:
+* N is required to be greater or equal to 1.
+* All input arrays must have the same dimensions.
+* If `N = 1`, `Collate(T)` is `T`.
+* If `N > 1`, `Collate(T_0, ..., T_N)` is a tuple of `N` elements of type `T`.
 
-Arguments     | Type             | Semantics
-------------- | ---------------- | ---------------------------------------
-`operand`     | `XlaOp`          | array of type `T`
-`init_value`  | `XlaOp`          | scalar of type `T`
-`computation` | `XlaComputation` | computation of type `T, T -> T`
-`dimensions`  | `int64` array    | unordered array of dimensions to reduce
+The output of the op is `Collate(Q_0, ..., Q_N)` where `Q_i` is an array of type
+`T_i`, the dimensions of which are described below.
 
-This operation reduces one or more dimensions of the input array into scalars.
-The rank of the returned array is `rank(operand) - len(dimensions)`.
+This operation reduces one or more dimensions of each input array into scalars.
+The rank of each returned array is `rank(operand) - len(dimensions)`.
 `init_value` is the initial value used for every reduction and may be inserted
 anywhere during computation by the back-end. In most cases, `init_value` is an
 identity of the reduction function (for example, 0 for addition). The applied
@@ -1428,9 +1469,9 @@ enough to being associative for most practical uses. It is possible to conceive
 of some completely non-associative reductions, however, and these will produce
 incorrect or unpredictable results in XLA reductions.
 
-As an example, when reducing across the one dimension in a 1D array with values
-[10, 11, 12, 13], with reduction function `f` (this is `computation`) then that
-could be computed as
+As an example, when reducing across one dimension in a single 1D array with
+values [10, 11, 12, 13], with reduction function `f` (this is `computation`)
+then that could be computed as
 
 `f(10, f(11, f(12, f(init_value, 13)))`
 
@@ -1512,10 +1553,38 @@ the 1D array `| 20 28 36 |`.
 
 Reducing the 3D array over all its dimensions produces the scalar `84`.
 
+When `N > 1`, reduce function application is slightly more complex, as it is
+applied simultaneously to all inputs. For example, consider the following
+reduction function, which can be used to compute the max and the argmax of a
+a 1-D tensor in parallel:
+
+```
+f: (Float, Int, Float, Int) -> Float, Int
+f(max, argmax, value, index):
+  if value >= argmax:
+    return (value, index)
+  else:
+    return (max, argmax)
+```
+
+For 1-D Input arrays `V = Float[N], K = Int[N]`, and init values
+`I_V = Float, I_K =  Int`, the result `f_(N-1)` of reducing across the only
+input dimension is equivalent to the following recursive application:
+```
+f_0 = f(I_V, I_K, V_0, K_0)
+f_1 = f(f_0.first, f_0.second, V_1, K_1)
+...
+f_(N-1) = f(f_(N-2).first, f_(N-2).second, V_(N-1), K_(N-1))
+```
+
+Applying this reduction to an array of values, and an array of sequential
+indices (i.e. iota), will co-iterate over the arrays, and return a tuple
+containing the maximal value and the matching index.
+
 ## ReducePrecision
 
 See also
-[`XlaBuilder::ReducePrecision`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::ReducePrecision`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Models the effect of converting floating-point values to a lower-precision
 format (such as IEEE-FP16) and back to the original format.  The number of
@@ -1546,7 +1615,7 @@ portion of the conversion is then simply a no-op.
 ## ReduceWindow
 
 See also
-[`XlaBuilder::ReduceWindow`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::ReduceWindow`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Applies a reduction function to all elements in each window of the input
 multi-dimensional array, producing an output multi-dimensional array with the
@@ -1629,7 +1698,7 @@ context of [`Reduce`](#reduce) for more details.
 ## Reshape
 
 See also
-[`XlaBuilder::Reshape`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h)
+[`XlaBuilder::Reshape`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h)
 and the [`Collapse`](#collapse) operation.
 
 Reshapes the dimensions of an array into a new configuration.
@@ -1710,7 +1779,7 @@ Reshape(5, {}, {1,1}) == f32[1x1] {{5}};
 ## Rev (reverse)
 
 See also
-[`XlaBuilder::Rev`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Rev`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b>`Rev(operand, dimensions)`</b>
 
@@ -1732,7 +1801,7 @@ the two window dimensions during the gradient computation in neural networks.
 ## RngNormal
 
 See also
-[`XlaBuilder::RngNormal`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::RngNormal`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Constructs an output of a given shape with random numbers generated following
 the $$N(\mu, \sigma)$$ normal distribution. The parameters `mu` and `sigma`, and
@@ -1752,7 +1821,7 @@ be scalar valued.
 ## RngUniform
 
 See also
-[`XlaBuilder::RngUniform`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::RngUniform`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Constructs an output of a given shape with random numbers generated following
 the uniform distribution over the interval $$[a,b)$$. The parameters and output
@@ -1770,10 +1839,142 @@ is implementation-defined.
 :           :                         : limit of interval                 :
 | `shape`   | `Shape`                 | Output shape of type T            |
 
+## Scatter
+
+The XLA scatter operation generates a result which is the value of the input
+tensor `operand`, with several slices (at indices specified by
+`scatter_indices`) updated with the values in `updates` using
+`update_computation`.
+
+See also
+[`XlaBuilder::Scatter`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+
+<b> `scatter(operand, scatter_indices, updates, update_computation, index_vector_dim, update_window_dims, inserted_window_dims, scatter_dims_to_operand_dims)` </b>
+
+|Arguments         | Type                   | Semantics                        |
+|------------------|------------------------|----------------------------------|
+|`operand`         | `XlaOp`                | Tensor to be scattered into.     |
+|`scatter_indices` | `XlaOp`                | Tensor containing the starting   |
+:                  :                        : indices of the slices that must  :
+:                  :                        : be scattered to.                 :
+|`updates`         | `XlaOp`                | Tensor containing the values that|
+:                  :                        : must be used for scattering.     :
+|`update_computation`| `XlaComputation`     | Computation to be used for       |
+:                  :                        : combining the existing values in :
+:                  :                        : the input tensor and the updates :
+:                  :                        : during scatter. This computation :
+:                  :                        : should be of type `T, T -> T`.   :
+|`index_vector_dim`| `int64`                | The dimension in                 |
+:                  :                        : `scatter_indices` that contains  :
+:                  :                        : the starting indices.            :
+|`update_window_dims`| `ArraySlice<int64>`  | The set of dimensions in         |
+:                  :                        : `updates` shape that are _window :
+:                  :                        : dimensions_.                     :
+|`inserted_window_dims`| `ArraySlice<int64>`| The set of _window dimensions_   |
+:                  :                        : that must be inserted into       :
+:                  :                        : `updates` shape.                 :
+|`scatter_dims_to_operand_dims`| `ArraySlice<int64>`  | A dimensions map from  |
+:                  :                        : the scatter indices to the       :
+:                  :                        : operand index space. This array  :
+:                  :                        : is interpreted as mapping `i` to :
+:                  :                        : `scatter_dims_to_operand_dims[i]`:
+:                  :                        : . It has to be one-to-one and    :
+:                  :                        : total.                           :
+
+If `index_vector_dim` is equal to `scatter_indices.rank` we implicitly consider
+`scatter_indices` to have a trailing `1` dimension.
+
+We define `update_scatter_dims` of type `ArraySlice<int64>` as the set of
+dimensions in `updates` shape that are not in `update_window_dims`, in ascending
+order.
+
+The arguments of scatter should follow these constraints:
+
+  - `updates` tensor must be of rank `update_window_dims.size +
+  scatter_indices.rank - 1`.
+
+  - Bounds of dimension `i` in `updates` must conform to the following:
+      - If `i` is present in `update_window_dims` (i.e. equal to
+        `update_window_dims`[`k`] for some `k`), then the bound of dimension
+        `i` in `updates` must not exceed the corresponding bound of `operand`
+        after accounting for the `inserted_window_dims` (i.e.
+        `adjusted_window_bounds`[`k`], where `adjusted_window_bounds` contains
+        the bounds of `operand` with the bounds at indices
+        `inserted_window_dims` removed).
+      - If `i` is present in `update_scatter_dims` (i.e. equal to
+        `update_scatter_dims`[`k`] for some `k`), then the bound of dimension
+        `i` in `updates` must be equal to the corresponding bound of
+        `scatter_indices`, skipping `index_vector_dim` (i.e.
+        `scatter_indices.shape.dims`[`k`], if `k` < `index_vector_dim` and
+        `scatter_indices.shape.dims`[`k+1`] otherwise).
+
+  - `update_window_dims` must be in ascending order, not have any repeating
+    dimension numbers, and be in the range `[0, updates.rank)`.
+
+  - `inserted_window_dims` must be in ascending order, not have any
+    repeating dimension numbers, and be in the range `[0, operand.rank)`.
+
+  - `scatter_dims_to_operand_dims.size` must be equal to
+    `scatter_indices`[`index_vector_dim`], and its values must be in the range
+    `[0, operand.rank)`.
+
+For a given index `U` in the `updates` tensor, the corresponding index `I` in
+the `operand` tensor into which this update has to be applied is computed as
+follows:
+
+  1. Let `G` = { `U`[`k`] for `k` in `update_scatter_dims` }. Use `G` to look up
+     an index vector `S` in the `scatter_indices` tensor such that `S`[`i`] =
+     `scatter_indices`[Combine(`G`, `i`)] where Combine(A, b) inserts b at
+     positions `index_vector_dim` into A.
+  2. Create an index `S`<sub>`in`</sub> into `operand` using `S` by scattering
+     `S` using the `scatter_dims_to_operand_dims` map. More formally:
+       1. `S`<sub>`in`</sub>[`scatter_dims_to_operand_dims`[`k`]] = `S`[`k`] if
+          `k` < `scatter_dims_to_operand_dims.size`.
+       2. `S`<sub>`in`</sub>[`_`] = `0` otherwise.
+  3. Create an index `W`<sub>`in`</sub> into `operand` by scattering the indices
+     at `update_window_dims` in `U` according to `inserted_window_dims`.
+     More formally:
+       1. `W`<sub>`in`</sub>[`window_dims_to_operand_dims`(`k`)] = `U`[`k`] if
+          `k` < `update_window_dims.size`, where `window_dims_to_operand_dims`
+          is the monotonic function with domain [`0`, `update_window_dims.size`)
+          and range [`0`, `operand.rank`) \\ `inserted_window_dims`. (For
+          example, if `update_window_dims.size` is `4`, `operand.rank` is `6`,
+          and `inserted_window_dims` is {`0`, `2`} then
+          `window_dims_to_operand_dims` is {`0`→`1`, `1`→`3`, `2`→`4`,
+          `3`→`5`}).
+       2. `W`<sub>`in`</sub>[`_`] = `0` otherwise.
+  4. `I` is `W`<sub>`in`</sub> + `S`<sub>`in`</sub> where + is element-wise
+     addition.
+
+In summary, the scatter operation can be defined as follows.
+
+   - Initialize `output` with `operand`, i.e. for all indices `O` in the
+     `operand` tensor:\
+       `output`[`O`] = `operand`[`O`]
+   - For every index `U` in the `updates` tensor and the corresponding index `O`
+     in the `operand` tensor:\
+       `output`[`O`] = `update_computation`(`output`[`O`], `updates`[`U`])
+
+The order in which updates are applied is non-deterministic. So, when multiple
+indices in `updates` refer to the same index in `operand`, the corresponding
+value in `output` will be non-deterministic.
+
+Note that the first parameter that is passed into the `update_computation` will
+always be the current value from the `output` tensor and the second parameter
+will always be the value from the `updates` tensor. This is important
+specifically for cases when the `update_computation` is _not commutative_.
+
+Informally, the scatter op can be viewed as an _inverse_ of the gather op, i.e.
+the scatter op updates the elements in the input that are extracted by the
+corresponding gather op.
+
+For a detailed informal description and examples, refer to the
+"Informal Description" section under `Gather`.
+
 ## Select
 
 See also
-[`XlaBuilder::Select`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Select`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Constructs an output array from elements of two input arrays, based on the
 values of a predicate array.
@@ -1824,7 +2025,7 @@ the same shape!) then `pred` has to be a scalar of type `PRED`.
 ## SelectAndScatter
 
 See also
-[`XlaBuilder::SelectAndScatter`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::SelectAndScatter`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 This operation can be considered as a composite operation that first computes
 `ReduceWindow` on the `operand` array to select an element from each window, and
@@ -1904,7 +2105,7 @@ context of [`Reduce`](#reduce) for more details.
 ## Send
 
 See also
-[`XlaBuilder::Send`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Send`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `Send(operand, channel_handle)` </b>
 
@@ -1959,7 +2160,7 @@ computations. For example, below schedules lead to deadlocks.
 ## Slice
 
 See also
-[`XlaBuilder::Slice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Slice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 Slicing extracts a sub-array from the input array. The sub-array is of the same
 rank as the input and contains the values inside a bounding box within the input
@@ -2008,37 +2209,44 @@ Slice(b, {2, 1}, {4, 3}) produces:
 ## Sort
 
 See also
-[`XlaBuilder::Sort`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Sort`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 There are two versions of the Sort instruction: a single-operand and a
 two-operand version.
 
 <b>`Sort(operand)`</b>
 
-Arguments | Type    | Semantics
---------- | ------- | --------------------
-`operand` | `XlaOp` | The operand to sort.
+Arguments   | Type    | Semantics
+----------- | ------- | --------------------
+`operand`   | `XlaOp` | The operand to sort.
+`dimension` | `int64` | The dimension along which to sort.
 
-Sorts the elements in the operand in ascending order. The operand must be rank-1.
-If the operand's elements have floating point type, and the operand contains
-NaN elements, the order of elements in the output is implementation-defined.
+Sorts the elements in the operand in ascending order along the provided
+dimension. For example, for a rank-2 (matrix) operand, a `dimension` value of 0
+will sort each column independently, and a `dimension` value of 1 will sort each
+row independently. If the operand's elements have floating point type, and the
+operand contains NaN elements, the order of elements in the output is
+implementation-defined.
 
 <b>`Sort(key, value)`</b>
 
 Sorts both the key and the value operands. The keys are sorted as in the
 single-operand version. The values are sorted according to the order of their
 corresponding keys. For example, if the inputs are `keys = [3, 1]` and
-`values = [42, 50]`, then the output of the sort is the tuple `{[1, 3], [50, 42]}`.
+`values = [42, 50]`, then the output of the sort is the tuple 
+`{[1, 3], [50, 42]}`.
+
 The sort is not guaranteed to be stable, that is, if the keys array contains
 duplicates, the order of their corresponding values may not be preserved.
 
-Arguments | Type    | Semantics
---------- | ------- | -------------------
-`keys`    | `XlaOp` | The sort keys.
-`values`  | `XlaOp` | The values to sort.
+Arguments   | Type    | Semantics
+----------- | ------- | -------------------
+`keys`      | `XlaOp` | The sort keys.
+`values`    | `XlaOp` | The values to sort.
+`dimension` | `int64` | The dimension along which to sort.
 
-The `keys` and `values` operand must both be rank-1, and must have the same
-dimensions, but may have different element types.
+The `keys` and `values` must have the same dimensions, but may have different
+element types.
 
 ## Transpose
 
@@ -2061,7 +2269,7 @@ This is the same as Reshape(operand, permutation,
 ## Tuple
 
 See also
-[`XlaBuilder::Tuple`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::Tuple`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 A tuple containing a variable number of data handles, each of which has its own
 shape.
@@ -2080,7 +2288,7 @@ Tuples can be deconstructed (accessed) via the [`GetTupleElement`]
 ## While
 
 See also
-[`XlaBuilder::While`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_client/xla_builder.h).
+[`XlaBuilder::While`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h).
 
 <b> `While(condition, body, init)` </b>
 
diff --git a/tensorflow/docs_src/get_started/_index.yaml b/tensorflow/docs_src/tutorials/_index.yaml
index 4060804892..9534114689 100644
--- a/tensorflow/docs_src/get_started/_index.yaml
+++ b/tensorflow/docs_src/tutorials/_index.yaml
@@ -2,6 +2,7 @@ project_path: /_project.yaml
 book_path: /_book.yaml
 description: <!--no description-->
 landing_page:
+  custom_css_path: /site-assets/css/style.css
   show_side_navs: True
   rows:
   - description: >
@@ -14,57 +15,6 @@ landing_page:
       </p>
     items:
     - custom_html: >
-        <style>
-        .tfo-button-primary {
-          background-color: #fca851;
-        }
-        .tfo-button-primary:hover {
-          background-color: #ef6c02;
-        }
-
-        a.colab-button {
-          display: inline-block;
-          background: rgba(255, 255, 255, 0.75);
-          padding: 4px 8px;
-          border-radius: 4px;
-          font-size: 11px!important;
-          text-decoration: none;
-          color:#aaa;border: none;
-          font-weight: 300;
-          border: solid 1px rgba(0, 0, 0, 0.08);
-          border-bottom-color: rgba(0, 0, 0, 0.15);
-          text-transform: uppercase;
-          line-height: 16px
-        }
-        a.colab-button:hover {
-          color: #666;
-          background: white;
-          border-color: rgba(0, 0, 0, 0.2);
-        }
-        a.colab-button span {
-          background-image: url("/images/colab_logo_button.svg");
-          background-repeat:no-repeat;background-size:20px;
-          background-position-y:2px;display:inline-block;
-          padding-left:24px;border-radius:4px;
-          text-decoration:none;
-        }
-
-        /* adjust code block for smaller screens */
-        @media screen and (max-width: 1000px) {
-          .tfo-landing-row-item-code-block {
-            flex-direction: column !important;
-          }
-          .tfo-landing-row-item-code-block > .devsite-landing-row-item-code {
-            /*display: none;*/
-            width: 100%;
-          }
-        }
-        @media screen and (max-width: 720px) {
-          .tfo-landing-row-item-code-block {
-            display: none;
-          }
-        }
-        </style>
         <div class="devsite-landing-row-item-description">
           <h3 class="hide-from-toc">Learn and use ML</h3>
           <div class="devsite-landing-row-item-description-content">
@@ -75,11 +25,11 @@ landing_page:
               <a href="/guide/keras">TensorFlow Keras guide</a>.
             </p>
             <ol style="padding-left:20px;">
-              <li><a href="/get_started/basic_classification">Basic classification</a></li>
-              <li><a href="/get_started/basic_text_classification">Text classification</a></li>
-              <li><a href="/get_started/basic_regression">Regression</a></li>
-              <li><a href="/get_started/overfit_and_underfit">Overfitting and underfitting</a></li>
-              <li><a href="/get_started/save_and_restore_models">Save and load</a></li>
+              <li><a href="./keras/basic_classification">Basic classification</a></li>
+              <li><a href="./keras/basic_text_classification">Text classification</a></li>
+              <li><a href="./keras/basic_regression">Regression</a></li>
+              <li><a href="./keras/overfit_and_underfit">Overfitting and underfitting</a></li>
+              <li><a href="./keras/save_and_restore_models">Save and load</a></li>
             </ol>
           </div>
           <div class="devsite-landing-row-item-buttons" style="margin-top:0;">
@@ -109,7 +59,7 @@ landing_page:
         model.evaluate(x_test, y_test)
         </pre>
         {% dynamic if request.tld != 'cn' %}
-        <a class="colab-button" target="_blank" href="https://colab.sandbox.google.com/github/tensorflow/models/blob/master/samples/core/get_started/_index.ipynb">Run in a <span>Notebook</span></a>
+        <a class="colab-button" target="_blank" href="https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/_index.ipynb">Run in a <span>Notebook</span></a>
         {% dynamic endif %}
 
   - items:
@@ -124,38 +74,38 @@ landing_page:
             <ol style="padding-left:20px;">
               <li>
                 {% dynamic if request.tld == 'cn' %}
-                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb" class="external">Eager execution basics</a>
+                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb" class="external">Eager execution basics</a>
                 {% dynamic else %}
-                <a href="https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb" class="external">Eager execution basics</a>
+                <a href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb" class="external">Eager execution basics</a>
                 {% dynamic endif %}
               </li>
               <li>
                 {% dynamic if request.tld == 'cn' %}
-                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb" class="external">Automatic differentiation and gradient tapes</a>
+                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb" class="external">Automatic differentiation and gradient tape</a>
                 {% dynamic else %}
-                <a href="https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb" class="external">Automatic differentiation and gradient tapes</a>
+                <a href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb" class="external">Automatic differentiation and gradient tape</a>
                 {% dynamic endif %}
               </li>
               <li>
                 {% dynamic if request.tld == 'cn' %}
-                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb" class="external">Variables, models, and training</a>
+                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb" class="external">Custom training: basics</a>
                 {% dynamic else %}
-                <a href="https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/3_training_models.ipynb" class="external">Variables, models, and training</a>
+                <a href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb" class="external">Custom training: basics</a>
                 {% dynamic endif %}
               </li>
               <li>
                 {% dynamic if request.tld == 'cn' %}
-                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb" class="external">Custom layers</a>
+                <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb" class="external">Custom layers</a>
                 {% dynamic else %}
-                <a href="https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/4_high_level.ipynb" class="external">Custom layers</a>
+                <a href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb" class="external">Custom layers</a>
                 {% dynamic endif %}
               </li>
-              <li><a href="/get_started/eager">Custom training walkthrough</a></li>
+              <li><a href="./eager/custom_training_walkthrough">Custom training: walkthrough</a></li>
               <li>
                 {% dynamic if request.tld == 'cn' %}
                 <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb" class="external">Example: Neural machine translation w/ attention</a>
                 {% dynamic else %}
-                <a href="https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb" class="external">Example: Neural machine translation w/ attention</a>
+                <a href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb" class="external">Example: Neural machine translation w/ attention</a>
                 {% dynamic endif %}
               </li>
             </ol>
@@ -170,13 +120,16 @@ landing_page:
           <div class="devsite-landing-row-item-description-content">
             <p>
               Estimators can train large models on multiple machines in a
-              production environment. Try the examples below and read the
+              production environment. TensorFlow provides a collection of
+              pre-made Estimators to implement common ML algorithms. See the
               <a href="/guide/estimators">Estimators guide</a>.
             </p>
             <ol style="padding-left: 20px;">
-              <li><a href="/tutorials/text_classification_with_tf_hub">How to build a simple text classifier with TF-Hub</a></li>
-              <li><a href="https://github.com/tensorflow/models/tree/master/official/boosted_trees">Classifying Higgs boson processes</a></li>
-              <li><a href="/tutorials/wide_and_deep">Wide and deep learning using estimators</a></li>
+              <li><a href="/tutorials/estimators/linear">Build a linear model with Estimators</a></li>
+              <li><a href="https://github.com/tensorflow/models/tree/master/official/wide_deep" class="external">Wide and deep learning with Estimators</a></li>
+              <li><a href="https://github.com/tensorflow/models/tree/master/official/boosted_trees" class="external">Boosted trees</a></li>
+              <li><a href="/hub/tutorials/text_classification_with_tf_hub">How to build a simple text classifier with TF-Hub</a></li>
+              <li><a href="/tutorials/estimators/cnn">Build a Convolutional Neural Network using Estimators</a></li>
             </ol>
           </div>
           <div class="devsite-landing-row-item-buttons">
@@ -187,7 +140,7 @@ landing_page:
   - description: >
       <h2 class="hide-from-toc">Google Colab&#58; An easy way to learn and use TensorFlow</h2>
       <p>
-        <a href="https://colab.sandbox.google.com/notebooks/welcome.ipynb" class="external">Colaboratory</a>
+        <a href="https://colab.research.google.com/notebooks/welcome.ipynb" class="external">Colaboratory</a>
         is a Google research project created to help disseminate machine learning
         education and research. It's a Jupyter notebook environment that requires
         no setup to use and runs entirely in the cloud.
diff --git a/tensorflow/docs_src/tutorials/_toc.yaml b/tensorflow/docs_src/tutorials/_toc.yaml
new file mode 100644
index 0000000000..d33869af6e
--- /dev/null
+++ b/tensorflow/docs_src/tutorials/_toc.yaml
@@ -0,0 +1,103 @@
+toc:
+- title: Get started with TensorFlow
+  path: /tutorials/
+
+- title: Learn and use ML
+  style: accordion
+  section:
+  - title: Overview
+    path: /tutorials/keras/
+  - title: Basic classification
+    path: /tutorials/keras/basic_classification
+  - title: Text classification
+    path: /tutorials/keras/basic_text_classification
+  - title: Regression
+    path: /tutorials/keras/basic_regression
+  - title: Overfitting and underfitting
+    path: /tutorials/keras/overfit_and_underfit
+  - title: Save and restore models
+    path: /tutorials/keras/save_and_restore_models
+
+- title: Research and experimentation
+  style: accordion
+  section:
+  - title: Overview
+    path: /tutorials/eager/
+  - title: Eager execution
+    path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb
+    status: external
+  - title: Automatic differentiation
+    path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb
+    status: external
+  - title: "Custom training: basics"
+    path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb
+    status: external
+  - title: Custom layers
+    path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb
+    status: external
+  - title: "Custom training: walkthrough"
+    path: /tutorials/eager/custom_training_walkthrough
+  - title: Translation with attention
+    path: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb
+    status: external
+
+- title: ML at production scale
+  style: accordion
+  section:
+  - title: Linear model with Estimators
+    path: /tutorials/estimators/linear
+  - title: Wide and deep learning
+    path: https://github.com/tensorflow/models/tree/master/official/wide_deep
+    status: external
+  - title: Boosted trees
+    path: https://github.com/tensorflow/models/tree/master/official/boosted_trees
+    status: external
+  - title: Text classifier with TF-Hub
+    path: /hub/tutorials/text_classification_with_tf_hub
+  - title: Build a CNN using Estimators
+    path: /tutorials/estimators/cnn
+
+- title: Images
+  style: accordion
+  section:
+  - title: Image recognition
+    path: /tutorials/images/image_recognition
+  - title: Image retraining
+    path: /hub/tutorials/image_retraining
+  - title: Advanced CNN
+    path: /tutorials/images/deep_cnn
+
+- title: Sequences
+  style: accordion
+  section:
+  - title: Recurrent neural network
+    path: /tutorials/sequences/recurrent
+  - title: Drawing classification
+    path: /tutorials/sequences/recurrent_quickdraw
+  - title: Simple audio recognition
+    path: /tutorials/sequences/audio_recognition
+  - title: Neural machine translation
+    path: https://github.com/tensorflow/nmt
+    status: external
+
+- title: Data representation
+  style: accordion
+  section:
+  - title: Vector representations of words
+    path: /tutorials/representation/word2vec
+  - title: Kernel methods
+    path: /tutorials/representation/kernel_methods
+  - title: Large-scale linear models
+    path: /tutorials/representation/linear
+
+- title: Non-ML
+  style: accordion
+  section:
+  - title: Mandelbrot set
+    path: /tutorials/non-ml/mandelbrot
+  - title: Partial differential equations
+    path: /tutorials/non-ml/pdes
+
+- break: True
+- title: Next steps
+  path: /tutorials/next_steps
diff --git a/tensorflow/docs_src/tutorials/eager/custom_training_walkthrough.md b/tensorflow/docs_src/tutorials/eager/custom_training_walkthrough.md
new file mode 100644
index 0000000000..b564a27ecf
--- /dev/null
+++ b/tensorflow/docs_src/tutorials/eager/custom_training_walkthrough.md
@@ -0,0 +1,3 @@
+# Custom training: walkthrough
+
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/eager/custom_training_walkthrough.ipynb)
diff --git a/tensorflow/docs_src/tutorials/eager/index.md b/tensorflow/docs_src/tutorials/eager/index.md
new file mode 100644
index 0000000000..a13b396094
--- /dev/null
+++ b/tensorflow/docs_src/tutorials/eager/index.md
@@ -0,0 +1,13 @@
+# Research and experimentation
+
+Eager execution provides an imperative, define-by-run interface for advanced
+operations. Write custom layers, forward passes, and training loops with
+auto&nbsp;differentiation. Start with these notebooks, then read the
+[eager execution guide](../../guide/eager).
+
+1. <span>[Eager execution](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb){:.external}</span>
+2. <span>[Automatic differentiation and gradient tape](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/automatic_differentiation.ipynb){:.external}</span>
+3. <span>[Custom training: basics](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_training.ipynb){:.external}</span>
+4. <span>[Custom layers](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/notebooks/custom_layers.ipynb){:.external}</span>
+5. [Custom training: walkthrough](/tutorials/eager/custom_training_walkthrough)
+6. <span>[Advanced example: Neural machine translation with attention](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb){:.external}</span>
diff --git a/tensorflow/docs_src/tutorials/layers.md b/tensorflow/docs_src/tutorials/estimators/cnn.md
index 791909f5fd..12a215b50c 100644
--- a/tensorflow/docs_src/tutorials/layers.md
+++ b/tensorflow/docs_src/tutorials/estimators/cnn.md
@@ -1,4 +1,4 @@
-# A Guide to TF Layers: Building a Convolutional Neural Network
+# Build a Convolutional Neural Network using Estimators
 
 The TensorFlow @{tf.layers$`layers` module} provides a high-level API that makes
 it easy to construct a neural network. It provides methods that facilitate the
diff --git a/tensorflow/docs_src/tutorials/estimators/linear.md b/tensorflow/docs_src/tutorials/estimators/linear.md
new file mode 100644
index 0000000000..067a33ac03
--- /dev/null
+++ b/tensorflow/docs_src/tutorials/estimators/linear.md
@@ -0,0 +1,3 @@
+# Build a linear model with Estimators
+
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/estimators/linear.ipynb)
diff --git a/tensorflow/docs_src/tutorials/image_retraining.md b/tensorflow/docs_src/tutorials/image_retraining.md
deleted file mode 100644
index 27784eef9c..0000000000
--- a/tensorflow/docs_src/tutorials/image_retraining.md
+++ /dev/null
@@ -1,4 +0,0 @@
-# How to Retrain Inception's Final Layer for New Categories
-
-**NOTE: This tutorial has moved to**
-https://github.com/tensorflow/hub/tree/master/docs/tutorials/image_retraining.md
diff --git a/tensorflow/docs_src/tutorials/deep_cnn.md b/tensorflow/docs_src/tutorials/images/deep_cnn.md
index 44a32d9d1d..27963575f5 100644
--- a/tensorflow/docs_src/tutorials/deep_cnn.md
+++ b/tensorflow/docs_src/tutorials/images/deep_cnn.md
@@ -1,7 +1,4 @@
-# Convolutional Neural Networks
-
-> **NOTE:** This tutorial is intended for *advanced* users of TensorFlow
-and assumes expertise and experience in machine learning.
+# Advanced Convolutional Neural Networks
 
 ## Overview
 
@@ -83,21 +80,21 @@ for details.  It consists of 1,068,298 learnable parameters and requires about
 ## Code Organization
 
 The code for this tutorial resides in
-[`models/tutorials/image/cifar10/`](https://www.tensorflow.org/code/tensorflow_models/tutorials/image/cifar10/).
+[`models/tutorials/image/cifar10/`](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/).
 
 File | Purpose
 --- | ---
-[`cifar10_input.py`](https://www.tensorflow.org/code/tensorflow_models/tutorials/image/cifar10/cifar10_input.py) | Reads the native CIFAR-10 binary file format.
-[`cifar10.py`](https://www.tensorflow.org/code/tensorflow_models/tutorials/image/cifar10/cifar10.py) | Builds the CIFAR-10 model.
-[`cifar10_train.py`](https://www.tensorflow.org/code/tensorflow_models/tutorials/image/cifar10/cifar10_train.py) | Trains a CIFAR-10 model on a CPU or GPU.
-[`cifar10_multi_gpu_train.py`](https://www.tensorflow.org/code/tensorflow_models/tutorials/image/cifar10/cifar10_multi_gpu_train.py) | Trains a CIFAR-10 model on multiple GPUs.
-[`cifar10_eval.py`](https://www.tensorflow.org/code/tensorflow_models/tutorials/image/cifar10/cifar10_eval.py) | Evaluates the predictive performance of a CIFAR-10 model.
+[`cifar10_input.py`](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10_input.py) | Reads the native CIFAR-10 binary file format.
+[`cifar10.py`](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10.py) | Builds the CIFAR-10 model.
+[`cifar10_train.py`](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10_train.py) | Trains a CIFAR-10 model on a CPU or GPU.
+[`cifar10_multi_gpu_train.py`](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) | Trains a CIFAR-10 model on multiple GPUs.
+[`cifar10_eval.py`](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10_eval.py) | Evaluates the predictive performance of a CIFAR-10 model.
 
 
 ## CIFAR-10 Model
 
 The CIFAR-10 network is largely contained in
-[`cifar10.py`](https://www.tensorflow.org/code/tensorflow_models/tutorials/image/cifar10/cifar10.py).
+[`cifar10.py`](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10.py).
 The complete training
 graph contains roughly 765 operations. We find that we can make the code most
 reusable by constructing the graph with the following modules:
@@ -438,9 +435,6 @@ with a batch size of 64 and compare the training speed.
 
 ## Next Steps
 
-[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You have
-completed the CIFAR-10 tutorial.
-
 If you are now interested in developing and training your own image
 classification system, we recommend forking this tutorial and replacing
 components to address your image classification problem.
diff --git a/tensorflow/docs_src/tutorials/image_recognition.md b/tensorflow/docs_src/tutorials/images/image_recognition.md
index 332bcf54f0..d545de73df 100644
--- a/tensorflow/docs_src/tutorials/image_recognition.md
+++ b/tensorflow/docs_src/tutorials/images/image_recognition.md
@@ -434,7 +434,6 @@ should be able to transfer some of that understanding to solving related
 problems.  One way to perform transfer learning is to remove the final
 classification layer of the network and extract
 the [next-to-last layer of the CNN](https://arxiv.org/abs/1310.1531), in this case a 2048 dimensional vector.
-There's a guide to doing this @{$image_retraining$in the how-to section}.
 
 
 ## Resources for Learning More
@@ -450,7 +449,7 @@ covering them.
 
 To find out more about implementing convolutional neural networks, you can jump
 to the TensorFlow @{$deep_cnn$deep convolutional networks tutorial},
-or start a bit more gently with our @{$layers$MNIST starter tutorial}.
+or start a bit more gently with our [Estimator MNIST tutorial](../estimators/cnn.md).
 Finally, if you want to get up to speed on research in this area, you can
 read the recent work of all the papers referenced in this tutorial.
 
diff --git a/tensorflow/docs_src/tutorials/index.md b/tensorflow/docs_src/tutorials/index.md
deleted file mode 100644
index 6bd3a3a897..0000000000
--- a/tensorflow/docs_src/tutorials/index.md
+++ /dev/null
@@ -1,59 +0,0 @@
-# Tutorials
-
-
-This section contains tutorials demonstrating how to do specific tasks
-in TensorFlow.  If you are new to TensorFlow, we recommend reading
-[Get Started with TensorFlow](/get_started/).
-
-## Images
-
-These tutorials cover different aspects of image recognition:
-
-  * @{$layers$MNIST}, which introduces convolutional neural networks (CNNs) and
-    demonstrates how to build a CNN in TensorFlow.
-  * @{$image_recognition}, which introduces the field of image recognition and
-    uses a pre-trained model (Inception) for recognizing images.
-  * @{$image_retraining}, which has a wonderfully self-explanatory title.
-  * @{$deep_cnn}, which demonstrates how to build a small CNN for recognizing
-    images.  This tutorial is aimed at advanced TensorFlow users.
-
-
-## Sequences
-
-These tutorials focus on machine learning problems dealing with sequence data.
-
-  * @{$recurrent}, which demonstrates how to use a
-    recurrent neural network to predict the next word in a sentence.
-  * @{$seq2seq}, which demonstrates how to use a
-    sequence-to-sequence model to translate text from English to French.
-  * @{$recurrent_quickdraw}
-    builds a classification model for drawings, directly from the sequence of
-    pen strokes.
-  * @{$audio_recognition}, which shows how to
-    build a basic speech recognition network.
-
-## Data representation
-
-These tutorials demonstrate various data representations that can be used in
-TensorFlow.
-
-  * @{$wide}, uses
-    @{tf.feature_column$feature columns} to feed a variety of data types
-    to linear model, to solve a classification problem.
-  * @{$wide_and_deep}, builds on the
-    above linear model tutorial, adding a deep feed-forward neural network
-    component and a DNN-compatible data representation.
-  * @{$word2vec}, which demonstrates how to
-    create an embedding for words.
-  * @{$kernel_methods},
-    which shows how to improve the quality of a linear model by using explicit
-    kernel mappings.
-
-## Non Machine Learning
-
-Although TensorFlow specializes in machine learning, the core of TensorFlow is
-a powerful numeric computation system which you can also use to solve other
-kinds of math problems.  For example:
-
-  * @{$mandelbrot}
-  * @{$pdes}
diff --git a/tensorflow/docs_src/get_started/basic_classification.md b/tensorflow/docs_src/tutorials/keras/basic_classification.md
index 91bbd85b24..e028af99b9 100644
--- a/tensorflow/docs_src/get_started/basic_classification.md
+++ b/tensorflow/docs_src/tutorials/keras/basic_classification.md
@@ -1,3 +1,3 @@
 # Basic Classification
 
-[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/basic_classification.ipynb)
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/keras/basic_classification.ipynb)
diff --git a/tensorflow/docs_src/get_started/basic_regression.md b/tensorflow/docs_src/tutorials/keras/basic_regression.md
index a535f22f5a..8721b7aca1 100644
--- a/tensorflow/docs_src/get_started/basic_regression.md
+++ b/tensorflow/docs_src/tutorials/keras/basic_regression.md
@@ -1,3 +1,3 @@
 # Basic Regression
 
-[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/basic_regression.ipynb)
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/keras/basic_regression.ipynb)
diff --git a/tensorflow/docs_src/get_started/basic_text_classification.md b/tensorflow/docs_src/tutorials/keras/basic_text_classification.md
index 7c5d4f7896..c2a16bdd20 100644
--- a/tensorflow/docs_src/get_started/basic_text_classification.md
+++ b/tensorflow/docs_src/tutorials/keras/basic_text_classification.md
@@ -1,3 +1,3 @@
 # Basic Text Classification
 
-[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/basic_text_classification.ipynb)
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/keras/basic_text_classification.ipynb)
diff --git a/tensorflow/docs_src/tutorials/keras/index.md b/tensorflow/docs_src/tutorials/keras/index.md
new file mode 100644
index 0000000000..9d42281c8f
--- /dev/null
+++ b/tensorflow/docs_src/tutorials/keras/index.md
@@ -0,0 +1,22 @@
+# Learn and use machine learning
+
+This notebook collection is inspired by the book
+*[Deep Learning with Python](https://books.google.com/books?id=Yo3CAQAACAAJ)*.
+These tutorials use `tf.keras`, TensorFlow's high-level Python API for building
+and training deep learning models. To learn more about using Keras with
+TensorFlow, see the [TensorFlow Keras Guide](../../guide/keras).
+
+Publisher's note: *Deep Learning with Python* introduces the field of deep
+learning using the Python language and the powerful Keras library. Written by
+Keras creator and Google AI researcher François Chollet, this book builds your
+understanding through intuitive explanations and practical examples.
+
+To learn about machine learning fundamentals and concepts, consider taking the
+[Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/).
+Additional TensorFlow and machine learning resources are listed in [next steps](../next_steps).
+
+1. [Basic classification](./basic_classification)
+2. [Text classification](./basic_text_classification)
+3. [Regression](./basic_regression)
+4. [Overfitting and underfitting](./overfit_and_underfit)
+5. [Save and restore models](./save_and_restore_models)
diff --git a/tensorflow/docs_src/get_started/overfit_and_underfit.md b/tensorflow/docs_src/tutorials/keras/overfit_and_underfit.md
index e5b5ae7b5a..f07f3addd8 100644
--- a/tensorflow/docs_src/get_started/overfit_and_underfit.md
+++ b/tensorflow/docs_src/tutorials/keras/overfit_and_underfit.md
@@ -1,3 +1,3 @@
 # Overfitting and Underfitting
 
-[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/overfit_and_underfit.ipynb)
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/keras/overfit_and_underfit.ipynb)
diff --git a/tensorflow/docs_src/get_started/save_and_restore_models.md b/tensorflow/docs_src/tutorials/keras/save_and_restore_models.md
index 44b3772945..a799b379a0 100644
--- a/tensorflow/docs_src/get_started/save_and_restore_models.md
+++ b/tensorflow/docs_src/tutorials/keras/save_and_restore_models.md
@@ -1,3 +1,3 @@
 # Save and restore Models
 
-[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/save_and_restore_models.ipynb)
+[Colab notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/tutorials/keras/save_and_restore_models.ipynb)
diff --git a/tensorflow/docs_src/tutorials/leftnav_files b/tensorflow/docs_src/tutorials/leftnav_files
deleted file mode 100644
index eadd410d08..0000000000
--- a/tensorflow/docs_src/tutorials/leftnav_files
+++ /dev/null
@@ -1,24 +0,0 @@
-index.md
-
-### Images
-layers.md: MNIST
-image_recognition.md: Image Recognition
-/hub/tutorials/image_retraining.md: Image Retraining
-deep_cnn.md
-
-### Sequences
-/hub/tutorials/text_classification_with_tf_hub: Text Classification
-recurrent.md
-seq2seq.md: Neural Machine Translation
-recurrent_quickdraw.md: Drawing Classification
-audio_recognition.md
-
-### Data Representation
-wide.md: Linear Models
-wide_and_deep.md: Wide & Deep Learning
-word2vec.md
-kernel_methods.md: Kernel Methods
-
-### Non-ML
-mandelbrot.md
-pdes.md
diff --git a/tensorflow/docs_src/get_started/next_steps.md b/tensorflow/docs_src/tutorials/next_steps.md
index 01c9f7204a..01c9f7204a 100644
--- a/tensorflow/docs_src/get_started/next_steps.md
+++ b/tensorflow/docs_src/tutorials/next_steps.md
diff --git a/tensorflow/docs_src/tutorials/mandelbrot.md b/tensorflow/docs_src/tutorials/non-ml/mandelbrot.md
index 1c0a548129..1c0a548129 100755..100644
--- a/tensorflow/docs_src/tutorials/mandelbrot.md
+++ b/tensorflow/docs_src/tutorials/non-ml/mandelbrot.md
diff --git a/tensorflow/docs_src/tutorials/pdes.md b/tensorflow/docs_src/tutorials/non-ml/pdes.md
index 425e8d7084..b5a0fa834a 100755..100644
--- a/tensorflow/docs_src/tutorials/pdes.md
+++ b/tensorflow/docs_src/tutorials/non-ml/pdes.md
@@ -135,7 +135,6 @@ for i in range(1000):
   DisplayArray(U.eval(), rng=[-0.1, 0.1])
 ```
 
-![jpeg](../images/pde_output_2.jpg)
+![jpeg](../../images/pde_output_2.jpg)
 
 Look! Ripples!
-
diff --git a/tensorflow/docs_src/tutorials/kernel_methods.md b/tensorflow/docs_src/tutorials/representation/kernel_methods.md
index 205e2a2d2c..f3c232c511 100644
--- a/tensorflow/docs_src/tutorials/kernel_methods.md
+++ b/tensorflow/docs_src/tutorials/representation/kernel_methods.md
@@ -27,7 +27,7 @@ TensorFlow will provide support for sparse features at a later release.
 
 This tutorial uses [tf.contrib.learn](https://www.tensorflow.org/code/tensorflow/contrib/learn/python/learn)
 (TensorFlow's high-level Machine Learning API) Estimators for our ML models.
-If you are not familiar with this API, [tf.estimator Quickstart](https://www.tensorflow.org/get_started/estimator)
+If you are not familiar with this API, The [Estimator guide](../../guide/estimators.md)
 is a good place to start. We will use the MNIST dataset. The tutorial consists
 of the following steps:
 
diff --git a/tensorflow/docs_src/tutorials/linear.md b/tensorflow/docs_src/tutorials/representation/linear.md
index 3f247ade26..1b418cf065 100644
--- a/tensorflow/docs_src/tutorials/linear.md
+++ b/tensorflow/docs_src/tutorials/representation/linear.md
@@ -11,8 +11,9 @@ those tools. It explains:
      deep learning to get the advantages of both.
 
 Read this overview to decide whether the Estimator's linear model tools  might
-be useful to you. Then do the @{$wide$Linear Models tutorial} to
-give it a try. This overview uses code samples from the tutorial, but the
+be useful to you. Then work through the
+[Estimator wide and deep learning tutorial](https://github.com/tensorflow/models/tree/master/official/wide_deep)
+to give it a try. This overview uses code samples from the tutorial, but the
 tutorial walks through the code in greater detail.
 
 To understand this overview it will help to have some familiarity
@@ -176,7 +177,7 @@ the name of a `FeatureColumn`. Each key's value is a tensor containing the
 values of that feature for all data instances. See
 @{$premade_estimators#input_fn} for a
 more comprehensive look at input functions, and `input_fn` in the
-[linear models tutorial code](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py)
+[wide and deep learning tutorial](https://github.com/tensorflow/models/tree/master/official/wide_deep)
 for an example implementation of an input function.
 
 The input function is passed to the `train()` and `evaluate()` calls that
@@ -234,4 +235,5 @@ e = tf.estimator.DNNLinearCombinedClassifier(
     dnn_feature_columns=deep_columns,
     dnn_hidden_units=[100, 50])
 ```
-For more information, see the @{$wide_and_deep$Wide and Deep Learning tutorial}.
+For more information, see the
+[wide and deep learning tutorial](https://github.com/tensorflow/models/tree/master/official/wide_deep).
diff --git a/tensorflow/docs_src/tutorials/word2vec.md b/tensorflow/docs_src/tutorials/representation/word2vec.md
index 3fe7352bd2..0a1c41c84a 100644
--- a/tensorflow/docs_src/tutorials/word2vec.md
+++ b/tensorflow/docs_src/tutorials/representation/word2vec.md
@@ -23,7 +23,7 @@ straight in, feel free to look at the minimalistic implementation in
 This basic example contains the code needed to download some data, train on it a
 bit and visualize the result. Once you get comfortable with reading and running
 the basic version, you can graduate to
-[models/tutorials/embedding/word2vec.py](https://www.tensorflow.org/code/tensorflow_models/tutorials/embedding/word2vec.py)
+[models/tutorials/embedding/word2vec.py](https://github.com/tensorflow/models/tree/master/tutorials/embedding/word2vec.py)
 which is a more serious implementation that showcases some more advanced
 TensorFlow principles about how to efficiently use threads to move data into a
 text model, how to checkpoint during training, etc.
@@ -341,7 +341,7 @@ t-SNE.
 Et voila! As expected, words that are similar end up clustering nearby each
 other. For a more heavyweight implementation of word2vec that showcases more of
 the advanced features of TensorFlow, see the implementation in
-[models/tutorials/embedding/word2vec.py](https://www.tensorflow.org/code/tensorflow_models/tutorials/embedding/word2vec.py).
+[models/tutorials/embedding/word2vec.py](https://github.com/tensorflow/models/tree/master/tutorials/embedding/word2vec.py).
 
 ## Evaluating Embeddings: Analogical Reasoning
 
@@ -357,7 +357,7 @@ Download the dataset for this task from
 
 To see how we do this evaluation, have a look at the `build_eval_graph()` and
 `eval()` functions in
-[models/tutorials/embedding/word2vec.py](https://www.tensorflow.org/code/tensorflow_models/tutorials/embedding/word2vec.py).
+[models/tutorials/embedding/word2vec.py](https://github.com/tensorflow/models/tree/master/tutorials/embedding/word2vec.py).
 
 The choice of hyperparameters can strongly influence the accuracy on this task.
 To achieve state-of-the-art performance on this task requires training over a
@@ -385,13 +385,13 @@ your model is seriously bottlenecked on input data, you may want to implement a
 custom data reader for your problem, as described in
 @{$new_data_formats$New Data Formats}.  For the case of Skip-Gram
 modeling, we've actually already done this for you as an example in
-[models/tutorials/embedding/word2vec.py](https://www.tensorflow.org/code/tensorflow_models/tutorials/embedding/word2vec.py).
+[models/tutorials/embedding/word2vec.py](https://github.com/tensorflow/models/tree/master/tutorials/embedding/word2vec.py).
 
 If your model is no longer I/O bound but you want still more performance, you
 can take things further by writing your own TensorFlow Ops, as described in
 @{$adding_an_op$Adding a New Op}.  Again we've provided an
 example of this for the Skip-Gram case
-[models/tutorials/embedding/word2vec_optimized.py](https://www.tensorflow.org/code/tensorflow_models/tutorials/embedding/word2vec_optimized.py).
+[models/tutorials/embedding/word2vec_optimized.py](https://github.com/tensorflow/models/tree/master/tutorials/embedding/word2vec_optimized.py).
 Feel free to benchmark these against each other to measure performance
 improvements at each stage.
 
diff --git a/tensorflow/docs_src/tutorials/seq2seq.md b/tensorflow/docs_src/tutorials/seq2seq.md
deleted file mode 100644
index 8928ba4f7d..0000000000
--- a/tensorflow/docs_src/tutorials/seq2seq.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Sequence-to-Sequence Models
-
-Please check out the
-[tensorflow neural machine translation tutorial](https://github.com/tensorflow/nmt)
-for building sequence-to-sequence models with the latest Tensorflow API.
diff --git a/tensorflow/docs_src/tutorials/audio_recognition.md b/tensorflow/docs_src/tutorials/sequences/audio_recognition.md
index d7a8da6f96..d7a8da6f96 100644
--- a/tensorflow/docs_src/tutorials/audio_recognition.md
+++ b/tensorflow/docs_src/tutorials/sequences/audio_recognition.md
diff --git a/tensorflow/docs_src/tutorials/recurrent.md b/tensorflow/docs_src/tutorials/sequences/recurrent.md
index 14da2c8785..715cc7856a 100644
--- a/tensorflow/docs_src/tutorials/recurrent.md
+++ b/tensorflow/docs_src/tutorials/sequences/recurrent.md
@@ -2,8 +2,8 @@
 
 ## Introduction
 
-Take a look at [this great article](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
-for an introduction to recurrent neural networks and LSTMs in particular.
+See [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/){:.external}
+for an introduction to recurrent neural networks and LSTMs.
 
 ## Language Modeling
 
diff --git a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md b/tensorflow/docs_src/tutorials/sequences/recurrent_quickdraw.md
index 1afd861738..37bce5b76d 100644
--- a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
+++ b/tensorflow/docs_src/tutorials/sequences/recurrent_quickdraw.md
@@ -13,7 +13,7 @@ In this tutorial we'll show how to build an RNN-based recognizer for this
 problem. The model will use a combination of convolutional layers, LSTM layers,
 and a softmax output layer to classify the drawings:
 
-<center> ![RNN model structure](../images/quickdraw_model.png) </center>
+<center> ![RNN model structure](../../images/quickdraw_model.png) </center>
 
 The figure above shows the structure of the model that we will build in this
 tutorial. The input is a drawing that is encoded as a sequence of strokes of
@@ -208,7 +208,7 @@ This data is then reformatted into a tensor of shape `[num_training_samples,
 max_length, 3]`. Then we determine the bounding box of the original drawing in
 screen coordinates and normalize the size such that the drawing has unit height.
 
-<center> ![Size normalization](../images/quickdraw_sizenormalization.png) </center>
+<center> ![Size normalization](../../images/quickdraw_sizenormalization.png) </center>
 
 Finally, we compute the differences between consecutive points and store these
 as a `VarLenFeature` in a
diff --git a/tensorflow/docs_src/tutorials/wide.md b/tensorflow/docs_src/tutorials/wide.md
deleted file mode 100644
index 27ce75a30d..0000000000
--- a/tensorflow/docs_src/tutorials/wide.md
+++ /dev/null
@@ -1,461 +0,0 @@
-# TensorFlow Linear Model Tutorial
-
-In this tutorial, we will use the tf.estimator API in TensorFlow to solve a
-binary classification problem: Given census data about a person such as age,
-education, marital status, and occupation (the features), we will try to predict
-whether or not the person earns more than 50,000 dollars a year (the target
-label). We will train a **logistic regression** model, and given an individual's
-information our model will output a number between 0 and 1, which can be
-interpreted as the probability that the individual has an annual income of over
-50,000 dollars.
-
-## Setup
-
-To try the code for this tutorial:
-
-1.  @{$install$Install TensorFlow} if you haven't already.
-
-2.  Download [the tutorial code](https://github.com/tensorflow/models/tree/master/official/wide_deep/).
-
-3. Execute the data download script we provide to you:
-
-        $ python data_download.py
-
-4. Execute the tutorial code with the following command to train the linear
-model described in this tutorial:
-
-        $ python wide_deep.py --model_type=wide
-
-Read on to find out how this code builds its linear model.
-
-## Reading The Census Data
-
-The dataset we'll be using is the
-[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).
-We have provided
-[data_download.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/data_download.py)
-which downloads the code and performs some additional cleanup.
-
-Since the task is a binary classification problem, we'll construct a label
-column named "label" whose value is 1 if the income is over 50K, and 0
-otherwise. For reference, see `input_fn` in
-[wide_deep.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py).
-
-Next, let's take a look at the dataframe and see which columns we can use to
-predict the target label. The columns can be grouped into two types—categorical
-and continuous columns:
-
-*   A column is called **categorical** if its value can only be one of the
-    categories in a finite set. For example, the relationship status of a person
-    (wife, husband, unmarried, etc.) or the education level (high school,
-    college, etc.) are categorical columns.
-*   A column is called **continuous** if its value can be any numerical value in
-    a continuous range. For example, the capital gain of a person (e.g. $14,084)
-    is a continuous column.
-
-Here's a list of columns available in the Census Income dataset:
-
-| Column Name    | Type        | Description                       |
-| -------------- | ----------- | --------------------------------- |
-| age            | Continuous  | The age of the individual         |
-| workclass      | Categorical | The type of employer the          |
-:                :             : individual has (government,       :
-:                :             : military, private, etc.).         :
-| fnlwgt         | Continuous  | The number of people the census   |
-:                :             : takers believe that observation   :
-:                :             : represents (sample weight). Final :
-:                :             : weight will not be used.          :
-| education      | Categorical | The highest level of education    |
-:                :             : achieved for that individual.     :
-| education_num  | Continuous  | The highest level of education in |
-:                :             : numerical form.                   :
-| marital_status | Categorical | Marital status of the individual. |
-| occupation     | Categorical | The occupation of the individual. |
-| relationship   | Categorical | Wife, Own-child, Husband,         |
-:                :             : Not-in-family, Other-relative,    :
-:                :             : Unmarried.                        :
-| race           | Categorical | Amer-Indian-Eskimo, Asian-Pac-    |
-:                :             : Islander, Black, White, Other.    :
-| gender         | Categorical | Female, Male.                     |
-| capital_gain   | Continuous  | Capital gains recorded.           |
-| capital_loss   | Continuous  | Capital Losses recorded.          |
-| hours_per_week | Continuous  | Hours worked per week.            |
-| native_country | Categorical | Country of origin of the          |
-:                :             : individual.                       :
-| income_bracket | Categorical | ">50K" or "<=50K", meaning        |
-:                :             : whether the person makes more     :
-:                :             : than $50,000 annually.            :
-
-## Converting Data into Tensors
-
-When building a tf.estimator model, the input data is specified by means of an
-Input Builder function. This builder function will not be called until it is
-later passed to tf.estimator.Estimator methods such as `train` and `evaluate`.
-The purpose of this function is to construct the input data, which is
-represented in the form of @{tf.Tensor}s or @{tf.SparseTensor}s.
-In more detail, the input builder function returns the following as a pair:
-
-1.  `features`: A dict from feature column names to `Tensors` or
-    `SparseTensors`.
-2.  `labels`: A `Tensor` containing the label column.
-
-The keys of the `features` will be used to construct columns in the next
-section. Because we want to call the `train` and `evaluate` methods with
-different data, we define a method that returns an input function based on the
-given data. Note that the returned input function will be called while
-constructing the TensorFlow graph, not while running the graph. What it is
-returning is a representation of the input data as the fundamental unit of
-TensorFlow computations, a `Tensor` (or `SparseTensor`).
-
-Each continuous column in the train or test data will be converted into a
-`Tensor`, which in general is a good format to represent dense data. For
-categorical data, we must represent the data as a `SparseTensor`. This data
-format is good for representing sparse data. Our `input_fn` uses the `tf.data`
-API, which makes it easy to apply transformations to our dataset:
-
-```python
-def input_fn(data_file, num_epochs, shuffle, batch_size):
-  """Generate an input function for the Estimator."""
-  assert tf.gfile.Exists(data_file), (
-      '%s not found. Please make sure you have either run data_download.py or '
-      'set both arguments --train_data and --test_data.' % data_file)
-
-  def parse_csv(value):
-    print('Parsing', data_file)
-    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
-    features = dict(zip(_CSV_COLUMNS, columns))
-    labels = features.pop('income_bracket')
-    return features, tf.equal(labels, '>50K')
-
-  # Extract lines from input files using the Dataset API.
-  dataset = tf.data.TextLineDataset(data_file)
-
-  if shuffle:
-    dataset = dataset.shuffle(buffer_size=_SHUFFLE_BUFFER)
-
-  dataset = dataset.map(parse_csv, num_parallel_calls=5)
-
-  # We call repeat after shuffling, rather than before, to prevent separate
-  # epochs from blending together.
-  dataset = dataset.repeat(num_epochs)
-  dataset = dataset.batch(batch_size)
-
-  iterator = dataset.make_one_shot_iterator()
-  features, labels = iterator.get_next()
-  return features, labels
-```
-
-## Selecting and Engineering Features for the Model
-
-Selecting and crafting the right set of feature columns is key to learning an
-effective model. A **feature column** can be either one of the raw columns in
-the original dataframe (let's call them **base feature columns**), or any new
-columns created based on some transformations defined over one or multiple base
-columns (let's call them **derived feature columns**). Basically, "feature
-column" is an abstract concept of any raw or derived variable that can be used
-to predict the target label.
-
-### Base Categorical Feature Columns
-
-To define a feature column for a categorical feature, we can create a
-`CategoricalColumn` using the tf.feature_column API. If you know the set of all
-possible feature values of a column and there are only a few of them, you can
-use `categorical_column_with_vocabulary_list`. Each key in the list will get
-assigned an auto-incremental ID starting from 0. For example, for the
-`relationship` column we can assign the feature string "Husband" to an integer
-ID of 0 and "Not-in-family" to 1, etc., by doing:
-
-```python
-relationship = tf.feature_column.categorical_column_with_vocabulary_list(
-    'relationship', [
-        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
-        'Other-relative'])
-```
-
-What if we don't know the set of possible values in advance? Not a problem. We
-can use `categorical_column_with_hash_bucket` instead:
-
-```python
-occupation = tf.feature_column.categorical_column_with_hash_bucket(
-    'occupation', hash_bucket_size=1000)
-```
-
-What will happen is that each possible value in the feature column `occupation`
-will be hashed to an integer ID as we encounter them in training. See an example
-illustration below:
-
-ID  | Feature
---- | -------------
-... |
-9   | `"Machine-op-inspct"`
-... |
-103 | `"Farming-fishing"`
-... |
-375 | `"Protective-serv"`
-... |
-
-No matter which way we choose to define a `SparseColumn`, each feature string
-will be mapped into an integer ID by looking up a fixed mapping or by hashing.
-Note that hashing collisions are possible, but may not significantly impact the
-model quality. Under the hood, the `LinearModel` class is responsible for
-managing the mapping and creating `tf.Variable` to store the model parameters
-(also known as model weights) for each feature ID. The model parameters will be
-learned through the model training process we'll go through later.
-
-We'll do the similar trick to define the other categorical features:
-
-```python
-education = tf.feature_column.categorical_column_with_vocabulary_list(
-    'education', [
-        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
-        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
-        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])
-
-marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
-    'marital_status', [
-        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
-        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])
-
-relationship = tf.feature_column.categorical_column_with_vocabulary_list(
-    'relationship', [
-        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
-        'Other-relative'])
-
-workclass = tf.feature_column.categorical_column_with_vocabulary_list(
-    'workclass', [
-        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
-        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])
-
-# To show an example of hashing:
-occupation = tf.feature_column.categorical_column_with_hash_bucket(
-    'occupation', hash_bucket_size=1000)
-```
-
-### Base Continuous Feature Columns
-
-Similarly, we can define a `NumericColumn` for each continuous feature column
-that we want to use in the model:
-
-```python
-age = tf.feature_column.numeric_column('age')
-education_num = tf.feature_column.numeric_column('education_num')
-capital_gain = tf.feature_column.numeric_column('capital_gain')
-capital_loss = tf.feature_column.numeric_column('capital_loss')
-hours_per_week = tf.feature_column.numeric_column('hours_per_week')
-```
-
-### Making Continuous Features Categorical through Bucketization
-
-Sometimes the relationship between a continuous feature and the label is not
-linear. As a hypothetical example, a person's income may grow with age in the
-early stage of one's career, then the growth may slow at some point, and finally
-the income decreases after retirement. In this scenario, using the raw `age` as
-a real-valued feature column might not be a good choice because the model can
-only learn one of the three cases:
-
-1.  Income always increases at some rate as age grows (positive correlation),
-1.  Income always decreases at some rate as age grows (negative correlation), or
-1.  Income stays the same no matter at what age (no correlation)
-
-If we want to learn the fine-grained correlation between income and each age
-group separately, we can leverage **bucketization**. Bucketization is a process
-of dividing the entire range of a continuous feature into a set of consecutive
-bins/buckets, and then converting the original numerical feature into a bucket
-ID (as a categorical feature) depending on which bucket that value falls into.
-So, we can define a `bucketized_column` over `age` as:
-
-```python
-age_buckets = tf.feature_column.bucketized_column(
-    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
-```
-
-where the `boundaries` is a list of bucket boundaries. In this case, there are
-10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,
-25-29, ..., to 65 and over).
-
-### Intersecting Multiple Columns with CrossedColumn
-
-Using each base feature column separately may not be enough to explain the data.
-For example, the correlation between education and the label (earning > 50,000
-dollars) may be different for different occupations. Therefore, if we only learn
-a single model weight for `education="Bachelors"` and `education="Masters"`, we
-won't be able to capture every single education-occupation combination (e.g.
-distinguishing between `education="Bachelors" AND occupation="Exec-managerial"`
-and `education="Bachelors" AND occupation="Craft-repair"`). To learn the
-differences between different feature combinations, we can add **crossed feature
-columns** to the model.
-
-```python
-education_x_occupation = tf.feature_column.crossed_column(
-    ['education', 'occupation'], hash_bucket_size=1000)
-```
-
-We can also create a `CrossedColumn` over more than two columns. Each
-constituent column can be either a base feature column that is categorical
-(`SparseColumn`), a bucketized real-valued feature column (`BucketizedColumn`),
-or even another `CrossColumn`. Here's an example:
-
-```python
-age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
-    [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)
-```
-
-## Defining The Logistic Regression Model
-
-After processing the input data and defining all the feature columns, we're now
-ready to put them all together and build a Logistic Regression model. In the
-previous section we've seen several types of base and derived feature columns,
-including:
-
-*   `CategoricalColumn`
-*   `NumericColumn`
-*   `BucketizedColumn`
-*   `CrossedColumn`
-
-All of these are subclasses of the abstract `FeatureColumn` class, and can be
-added to the `feature_columns` field of a model:
-
-```python
-base_columns = [
-    education, marital_status, relationship, workclass, occupation,
-    age_buckets,
-]
-crossed_columns = [
-    tf.feature_column.crossed_column(
-        ['education', 'occupation'], hash_bucket_size=1000),
-    tf.feature_column.crossed_column(
-        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
-]
-
-model_dir = tempfile.mkdtemp()
-model = tf.estimator.LinearClassifier(
-    model_dir=model_dir, feature_columns=base_columns + crossed_columns)
-```
-
-The model also automatically learns a bias term, which controls the prediction
-one would make without observing any features (see the section "How Logistic
-Regression Works" for more explanations). The learned model files will be stored
-in `model_dir`.
-
-## Training and Evaluating Our Model
-
-After adding all the features to the model, now let's look at how to actually
-train the model. Training a model is just a single command using the
-tf.estimator API:
-
-```python
-model.train(input_fn=lambda: input_fn(train_data, num_epochs, True, batch_size))
-```
-
-After the model is trained, we can evaluate how good our model is at predicting
-the labels of the holdout data:
-
-```python
-results = model.evaluate(input_fn=lambda: input_fn(
-    test_data, 1, False, batch_size))
-for key in sorted(results):
-  print('%s: %s' % (key, results[key]))
-```
-
-The first line of the final output should be something like
-`accuracy: 0.83557522`, which means the accuracy is 83.6%. Feel free to try more
-features and transformations and see if you can do even better!
-
-After the model is evaluated, we can use the model to predict whether an individual has an annual income of over
-50,000 dollars given an individual's information input.
-```python
-  pred_iter = model.predict(input_fn=lambda: input_fn(FLAGS.test_data, 1, False, 1))
-  for pred in pred_iter:
-    print(pred['classes'])
-```
-
-The model prediction output would be like `[b'1']` or `[b'0']` which means whether corresponding individual has an annual income of over 50,000 dollars or not.
-
-If you'd like to see a working end-to-end example, you can download our
-[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py)
-and set the `model_type` flag to `wide`.
-
-## Adding Regularization to Prevent Overfitting
-
-Regularization is a technique used to avoid **overfitting**. Overfitting happens
-when your model does well on the data it is trained on, but worse on test data
-that the model has not seen before, such as live traffic. Overfitting generally
-occurs when a model is excessively complex, such as having too many parameters
-relative to the number of observed training data. Regularization allows for you
-to control your model's complexity and makes the model more generalizable to
-unseen data.
-
-In the Linear Model library, you can add L1 and L2 regularizations to the model
-as:
-
-```
-model = tf.estimator.LinearClassifier(
-    model_dir=model_dir, feature_columns=base_columns + crossed_columns,
-    optimizer=tf.train.FtrlOptimizer(
-        learning_rate=0.1,
-        l1_regularization_strength=1.0,
-        l2_regularization_strength=1.0))
-```
-
-One important difference between L1 and L2 regularization is that L1
-regularization tends to make model weights stay at zero, creating sparser
-models, whereas L2 regularization also tries to make the model weights closer to
-zero but not necessarily zero. Therefore, if you increase the strength of L1
-regularization, you will have a smaller model size because many of the model
-weights will be zero. This is often desirable when the feature space is very
-large but sparse, and when there are resource constraints that prevent you from
-serving a model that is too large.
-
-In practice, you should try various combinations of L1, L2 regularization
-strengths and find the best parameters that best control overfitting and give
-you a desirable model size.
-
-## How Logistic Regression Works
-
-Finally, let's take a minute to talk about what the Logistic Regression model
-actually looks like in case you're not already familiar with it. We'll denote
-the label as \\(Y\\), and the set of observed features as a feature vector
-\\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). We define \\(Y=1\\) if an individual
-earned > 50,000 dollars and \\(Y=0\\) otherwise. In Logistic Regression, the
-probability of the label being positive (\\(Y=1\\)) given the features
-\\(\mathbf{x}\\) is given as:
-
-$$ P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$$
-
-where \\(\mathbf{w}=[w_1, w_2, ..., w_d]\\) are the model weights for the
-features \\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). \\(b\\) is a constant that is
-often called the **bias** of the model. The equation consists of two parts—A
-linear model and a logistic function:
-
-*   **Linear Model**: First, we can see that \\(\mathbf{w}^T\mathbf{x}+b = b +
-    w_1x_1 + ... +w_dx_d\\) is a linear model where the output is a linear
-    function of the input features \\(\mathbf{x}\\). The bias \\(b\\) is the
-    prediction one would make without observing any features. The model weight
-    \\(w_i\\) reflects how the feature \\(x_i\\) is correlated with the positive
-    label. If \\(x_i\\) is positively correlated with the positive label, the
-    weight \\(w_i\\) increases, and the probability \\(P(Y=1|\mathbf{x})\\) will
-    be closer to 1. On the other hand, if \\(x_i\\) is negatively correlated
-    with the positive label, then the weight \\(w_i\\) decreases and the
-    probability \\(P(Y=1|\mathbf{x})\\) will be closer to 0.
-
-*   **Logistic Function**: Second, we can see that there's a logistic function
-    (also known as the sigmoid function) \\(S(t) = 1/(1+\exp(-t))\\) being
-    applied to the linear model. The logistic function is used to convert the
-    output of the linear model \\(\mathbf{w}^T\mathbf{x}+b\\) from any real
-    number into the range of \\([0, 1]\\), which can be interpreted as a
-    probability.
-
-Model training is an optimization problem: The goal is to find a set of model
-weights (i.e. model parameters) to minimize a **loss function** defined over the
-training data, such as logistic loss for Logistic Regression models. The loss
-function measures the discrepancy between the ground-truth label and the model's
-prediction. If the prediction is very close to the ground-truth label, the loss
-value will be low; if the prediction is very far from the label, then the loss
-value would be high.
-
-## Learn Deeper
-
-If you're interested in learning more, check out our
-@{$wide_and_deep$Wide & Deep Learning Tutorial} where we'll show you how to
-combine the strengths of linear models and deep neural networks by jointly
-training them using the tf.estimator API.
diff --git a/tensorflow/docs_src/tutorials/wide_and_deep.md b/tensorflow/docs_src/tutorials/wide_and_deep.md
deleted file mode 100644
index 44677a810b..0000000000
--- a/tensorflow/docs_src/tutorials/wide_and_deep.md
+++ /dev/null
@@ -1,243 +0,0 @@
-# TensorFlow Wide & Deep Learning Tutorial
-
-In the previous @{$wide$TensorFlow Linear Model Tutorial}, we trained a logistic
-regression model to predict the probability that the individual has an annual
-income of over 50,000 dollars using the
-[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).
-TensorFlow is great for training deep neural networks too, and you might be
-thinking which one you should choose—well, why not both? Would it be possible to
-combine the strengths of both in one model?
-
-In this tutorial, we'll introduce how to use the tf.estimator API to jointly
-train a wide linear model and a deep feed-forward neural network. This approach
-combines the strengths of memorization and generalization. It's useful for
-generic large-scale regression and classification problems with sparse input
-features (e.g., categorical features with a large number of possible feature
-values). If you're interested in learning more about how Wide & Deep Learning
-works, please check out our [research paper](https://arxiv.org/abs/1606.07792).
-
-![Wide & Deep Spectrum of Models](https://www.tensorflow.org/images/wide_n_deep.svg "Wide & Deep")
-
-The figure above shows a comparison of a wide model (logistic regression with
-sparse features and transformations), a deep model (feed-forward neural network
-with an embedding layer and several hidden layers), and a Wide & Deep model
-(joint training of both). At a high level, there are only 3 steps to configure a
-wide, deep, or Wide & Deep model using the tf.estimator API:
-
-1.  Select features for the wide part: Choose the sparse base columns and
-    crossed columns you want to use.
-1.  Select features for the deep part: Choose the continuous columns, the
-    embedding dimension for each categorical column, and the hidden layer sizes.
-1.  Put them all together in a Wide & Deep model
-    (`DNNLinearCombinedClassifier`).
-
-And that's it! Let's go through a simple example.
-
-## Setup
-
-To try the code for this tutorial:
-
-1.  @{$install$Install TensorFlow} if you haven't already.
-
-2.  Download [the tutorial code](https://github.com/tensorflow/models/tree/master/official/wide_deep/).
-
-3. Execute the data download script we provide to you:
-
-        $ python data_download.py
-
-4. Execute the tutorial code with the following command to train the wide and
-deep model described in this tutorial:
-
-        $ python wide_deep.py
-
-Read on to find out how this code builds its model.
-
-
-## Define Base Feature Columns
-
-First, let's define the base categorical and continuous feature columns that
-we'll use. These base columns will be the building blocks used by both the wide
-part and the deep part of the model.
-
-```python
-import tensorflow as tf
-
-# Continuous columns
-age = tf.feature_column.numeric_column('age')
-education_num = tf.feature_column.numeric_column('education_num')
-capital_gain = tf.feature_column.numeric_column('capital_gain')
-capital_loss = tf.feature_column.numeric_column('capital_loss')
-hours_per_week = tf.feature_column.numeric_column('hours_per_week')
-
-education = tf.feature_column.categorical_column_with_vocabulary_list(
-    'education', [
-        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
-        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
-        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])
-
-marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
-    'marital_status', [
-        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
-        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])
-
-relationship = tf.feature_column.categorical_column_with_vocabulary_list(
-    'relationship', [
-        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
-        'Other-relative'])
-
-workclass = tf.feature_column.categorical_column_with_vocabulary_list(
-    'workclass', [
-        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
-        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])
-
-# To show an example of hashing:
-occupation = tf.feature_column.categorical_column_with_hash_bucket(
-    'occupation', hash_bucket_size=1000)
-
-# Transformations.
-age_buckets = tf.feature_column.bucketized_column(
-    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
-```
-
-## The Wide Model: Linear Model with Crossed Feature Columns
-
-The wide model is a linear model with a wide set of sparse and crossed feature
-columns:
-
-```python
-base_columns = [
-    education, marital_status, relationship, workclass, occupation,
-    age_buckets,
-]
-
-crossed_columns = [
-    tf.feature_column.crossed_column(
-        ['education', 'occupation'], hash_bucket_size=1000),
-    tf.feature_column.crossed_column(
-        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
-]
-```
-
-You can also see the @{$wide$TensorFlow Linear Model Tutorial} for more details.
-
-Wide models with crossed feature columns can memorize sparse interactions
-between features effectively. That being said, one limitation of crossed feature
-columns is that they do not generalize to feature combinations that have not
-appeared in the training data. Let's add a deep model with embeddings to fix
-that.
-
-## The Deep Model: Neural Network with Embeddings
-
-The deep model is a feed-forward neural network, as shown in the previous
-figure. Each of the sparse, high-dimensional categorical features are first
-converted into a low-dimensional and dense real-valued vector, often referred to
-as an embedding vector. These low-dimensional dense embedding vectors are
-concatenated with the continuous features, and then fed into the hidden layers
-of a neural network in the forward pass. The embedding values are initialized
-randomly, and are trained along with all other model parameters to minimize the
-training loss. If you're interested in learning more about embeddings, check out
-the TensorFlow tutorial on @{$word2vec$Vector Representations of Words} or
-[Word embedding](https://en.wikipedia.org/wiki/Word_embedding) on Wikipedia.
-
-Another way to represent categorical columns to feed into a neural network is
-via a one-hot or multi-hot representation. This is often appropriate for
-categorical columns with only a few possible values. As an example of a one-hot
-representation, for the relationship column, `"Husband"` can be represented as
-[1, 0, 0, 0, 0, 0], and `"Not-in-family"` as [0, 1, 0, 0, 0, 0], etc. This is a
-fixed representation, whereas embeddings are more flexible and calculated at
-training time.
-
-We'll configure the embeddings for the categorical columns using
-`embedding_column`, and concatenate them with the continuous columns.
-We also use `indicator_column` to create multi-hot representations of some
-categorical columns.
-
-```python
-deep_columns = [
-    age,
-    education_num,
-    capital_gain,
-    capital_loss,
-    hours_per_week,
-    tf.feature_column.indicator_column(workclass),
-    tf.feature_column.indicator_column(education),
-    tf.feature_column.indicator_column(marital_status),
-    tf.feature_column.indicator_column(relationship),
-    # To show an example of embedding
-    tf.feature_column.embedding_column(occupation, dimension=8),
-]
-```
-
-The higher the `dimension` of the embedding is, the more degrees of freedom the
-model will have to learn the representations of the features. For simplicity, we
-set the dimension to 8 for all feature columns here. Empirically, a more
-informed decision for the number of dimensions is to start with a value on the
-order of \\(\log_2(n)\\) or \\(k\sqrt[4]n\\), where \\(n\\) is the number of
-unique features in a feature column and \\(k\\) is a small constant (usually
-smaller than 10).
-
-Through dense embeddings, deep models can generalize better and make predictions
-on feature pairs that were previously unseen in the training data. However, it
-is difficult to learn effective low-dimensional representations for feature
-columns when the underlying interaction matrix between two feature columns is
-sparse and high-rank. In such cases, the interaction between most feature pairs
-should be zero except a few, but dense embeddings will lead to nonzero
-predictions for all feature pairs, and thus can over-generalize. On the other
-hand, linear models with crossed features can memorize these “exception rules”
-effectively with fewer model parameters.
-
-Now, let's see how to jointly train wide and deep models and allow them to
-complement each other’s strengths and weaknesses.
-
-## Combining Wide and Deep Models into One
-
-The wide models and deep models are combined by summing up their final output
-log odds as the prediction, then feeding the prediction to a logistic loss
-function. All the graph definition and variable allocations have already been
-handled for you under the hood, so you simply need to create a
-`DNNLinearCombinedClassifier`:
-
-```python
-model = tf.estimator.DNNLinearCombinedClassifier(
-    model_dir='/tmp/census_model',
-    linear_feature_columns=base_columns + crossed_columns,
-    dnn_feature_columns=deep_columns,
-    dnn_hidden_units=[100, 50])
-```
-
-## Training and Evaluating The Model
-
-Before we train the model, let's read in the Census dataset as we did in the
-@{$wide$TensorFlow Linear Model tutorial}. See `data_download.py` as well as
-`input_fn` within
-[`wide_deep.py`](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py).
-
-After reading in the data, you can train and evaluate the model:
-
-```python
-# Train and evaluate the model every `FLAGS.epochs_per_eval` epochs.
-for n in range(FLAGS.train_epochs // FLAGS.epochs_per_eval):
-  model.train(input_fn=lambda: input_fn(
-      FLAGS.train_data, FLAGS.epochs_per_eval, True, FLAGS.batch_size))
-
-  results = model.evaluate(input_fn=lambda: input_fn(
-      FLAGS.test_data, 1, False, FLAGS.batch_size))
-
-  # Display evaluation metrics
-  print('Results at epoch', (n + 1) * FLAGS.epochs_per_eval)
-  print('-' * 30)
-
-  for key in sorted(results):
-    print('%s: %s' % (key, results[key]))
-```
-
-The final output accuracy should be somewhere around 85.5%. If you'd like to
-see a working end-to-end example, you can download our
-[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py).
-
-Note that this tutorial is just a quick example on a small dataset to get you
-familiar with the API. Wide & Deep Learning will be even more powerful if you
-try it on a large dataset with many sparse feature columns that have a large
-number of possible feature values. Again, feel free to take a look at our
-[research paper](https://arxiv.org/abs/1606.07792) for more ideas about how to
-apply Wide & Deep Learning in real-world large-scale machine learning problems.
diff --git a/tensorflow/examples/android/src/org/tensorflow/demo/TensorFlowObjectDetectionAPIModel.java b/tensorflow/examples/android/src/org/tensorflow/demo/TensorFlowObjectDetectionAPIModel.java
index 614d3c7dd7..9739e58018 100644
--- a/tensorflow/examples/android/src/org/tensorflow/demo/TensorFlowObjectDetectionAPIModel.java
+++ b/tensorflow/examples/android/src/org/tensorflow/demo/TensorFlowObjectDetectionAPIModel.java
@@ -137,7 +137,7 @@ public class TensorFlowObjectDetectionAPIModel implements Classifier {
     Trace.beginSection("recognizeImage");
 
     Trace.beginSection("preprocessBitmap");
-    // Preprocess the image data from 0-255 int to normalized float based
+    // Preprocess the image data to extract R, G and B bytes from int of form 0x00RRGGBB
     // on the provided parameters.
     bitmap.getPixels(intValues, 0, bitmap.getWidth(), 0, 0, bitmap.getWidth(), bitmap.getHeight());
 
diff --git a/tensorflow/examples/saved_model/saved_model_half_plus_two.py b/tensorflow/examples/saved_model/saved_model_half_plus_two.py
index 0d6f1ef655..2d1e0c6f6d 100644
--- a/tensorflow/examples/saved_model/saved_model_half_plus_two.py
+++ b/tensorflow/examples/saved_model/saved_model_half_plus_two.py
@@ -33,6 +33,13 @@ where `a`, `b` and `c` are variables with `a=0.5` and `b=2` and `c=3`.
 
 Output from this program is typically used to exercise SavedModel load and
 execution code.
+
+To create a CPU model:
+  bazel run -c opt saved_half_plus_two -- --device=cpu
+
+To create GPU model:
+  bazel run --config=cuda -c opt saved_half_plus_two -- \
+  --device=gpu
 """
 
 from __future__ import absolute_import
@@ -105,42 +112,52 @@ def _build_classification_signature(input_tensor, scores_tensor):
 
 def _generate_saved_model_for_half_plus_two(export_dir,
                                             as_text=False,
-                                            use_main_op=False):
+                                            use_main_op=False,
+                                            device_type="cpu"):
   """Generates SavedModel for half plus two.
 
   Args:
     export_dir: The directory to which the SavedModel should be written.
     as_text: Writes the SavedModel protocol buffer in text format to disk.
     use_main_op: Whether to supply a main op during SavedModel build time.
+    device_name: Device to force ops to run on.
   """
   builder = tf.saved_model.builder.SavedModelBuilder(export_dir)
 
-  with tf.Session(graph=tf.Graph()) as sess:
-    # Set up the model parameters as variables to exercise variable loading
-    # functionality upon restore.
-    a = tf.Variable(0.5, name="a")
-    b = tf.Variable(2.0, name="b")
-    c = tf.Variable(3.0, name="c")
-
-    # Create a placeholder for serialized tensorflow.Example messages to be fed.
-    serialized_tf_example = tf.placeholder(tf.string, name="tf_example")
-
-    # Parse the tensorflow.Example looking for a feature named "x" with a single
-    # floating point value.
-    feature_configs = {
-        "x": tf.FixedLenFeature(
-            [1], dtype=tf.float32),
-        "x2": tf.FixedLenFeature(
-            [1], dtype=tf.float32, default_value=[0.0])
-    }
-    tf_example = tf.parse_example(serialized_tf_example, feature_configs)
-    # Use tf.identity() to assign name
-    x = tf.identity(tf_example["x"], name="x")
-    y = tf.add(tf.multiply(a, x), b, name="y")
-    y2 = tf.add(tf.multiply(a, x), c, name="y2")
-
-    x2 = tf.identity(tf_example["x2"], name="x2")
-    y3 = tf.add(tf.multiply(a, x2), c, name="y3")
+  device_name = "/cpu:0"
+  if device_type == "gpu":
+    device_name = "/gpu:0"
+
+  with tf.Session(
+      graph=tf.Graph(),
+      config=tf.ConfigProto(log_device_placement=True)) as sess:
+    with tf.device(device_name):
+      # Set up the model parameters as variables to exercise variable loading
+      # functionality upon restore.
+      a = tf.Variable(0.5, name="a")
+      b = tf.Variable(2.0, name="b")
+      c = tf.Variable(3.0, name="c")
+
+      # Create a placeholder for serialized tensorflow.Example messages to be
+      # fed.
+      serialized_tf_example = tf.placeholder(tf.string, name="tf_example")
+
+      # Parse the tensorflow.Example looking for a feature named "x" with a
+      # single floating point value.
+      feature_configs = {
+          "x": tf.FixedLenFeature([1], dtype=tf.float32),
+          "x2": tf.FixedLenFeature([1], dtype=tf.float32, default_value=[0.0])
+      }
+      # parse_example only works on CPU
+      with tf.device("/cpu:0"):
+        tf_example = tf.parse_example(serialized_tf_example, feature_configs)
+      # Use tf.identity() to assign name
+      x = tf.identity(tf_example["x"], name="x")
+      y = tf.add(tf.multiply(a, x), b, name="y")
+      y2 = tf.add(tf.multiply(a, x), c, name="y2")
+
+      x2 = tf.identity(tf_example["x2"], name="x2")
+      y3 = tf.add(tf.multiply(a, x2), c, name="y3")
 
     # Create an assets file that can be saved and restored as part of the
     # SavedModel.
@@ -185,20 +202,7 @@ def _generate_saved_model_for_half_plus_two(export_dir,
     }
     # Initialize all variables and then save the SavedModel.
     sess.run(tf.global_variables_initializer())
-    signature_def_map = {
-        "regress_x_to_y":
-            _build_regression_signature(serialized_tf_example, y),
-        "regress_x_to_y2":
-            _build_regression_signature(serialized_tf_example, y2),
-        "regress_x2_to_y3":
-            _build_regression_signature(x2, y3),
-        "classify_x_to_y":
-            _build_classification_signature(serialized_tf_example, y),
-        "classify_x2_to_y3":
-            _build_classification_signature(x2, y3),
-        tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
-            predict_signature_def
-    }
+
     if use_main_op:
       builder.add_meta_graph_and_variables(
           sess, [tf.saved_model.tag_constants.SERVING],
@@ -212,19 +216,30 @@ def _generate_saved_model_for_half_plus_two(export_dir,
           signature_def_map=signature_def_map,
           assets_collection=tf.get_collection(tf.GraphKeys.ASSET_FILEPATHS),
           legacy_init_op=tf.group(assign_filename_op))
-    builder.save(as_text)
+  builder.save(as_text)
 
 
 def main(_):
-  _generate_saved_model_for_half_plus_two(FLAGS.output_dir)
-  print("SavedModel generated at: %s" % FLAGS.output_dir)
+  _generate_saved_model_for_half_plus_two(
+      FLAGS.output_dir, device_type=FLAGS.device)
+  print("SavedModel generated for %(device)s at: %(dir)s" % {
+      "device": FLAGS.device,
+      "dir": FLAGS.output_dir
+  })
 
-  _generate_saved_model_for_half_plus_two(FLAGS.output_dir_pbtxt, as_text=True)
-  print("SavedModel generated at: %s" % FLAGS.output_dir_pbtxt)
+  _generate_saved_model_for_half_plus_two(
+      FLAGS.output_dir_pbtxt, as_text=True, device_type=FLAGS.device)
+  print("SavedModel generated for %(device)s at: %(dir)s" % {
+      "device": FLAGS.device,
+      "dir": FLAGS.output_dir_pbtxt
+  })
 
   _generate_saved_model_for_half_plus_two(
-      FLAGS.output_dir_main_op, use_main_op=True)
-  print("SavedModel generated at: %s" % FLAGS.output_dir_main_op)
+      FLAGS.output_dir_main_op, use_main_op=True, device_type=FLAGS.device)
+  print("SavedModel generated for %(device)s at: %(dir)s " % {
+      "device": FLAGS.device,
+      "dir": FLAGS.output_dir_main_op
+  })
 
 
 if __name__ == "__main__":
@@ -244,5 +259,10 @@ if __name__ == "__main__":
       type=str,
       default="/tmp/saved_model_half_plus_two_main_op",
       help="Directory where to output the SavedModel with a main op.")
+  parser.add_argument(
+      "--device",
+      type=str,
+      default="cpu",
+      help="Force model to run on 'cpu' or 'gpu'")
   FLAGS, unparsed = parser.parse_known_args()
   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
diff --git a/tensorflow/examples/speech_commands/BUILD b/tensorflow/examples/speech_commands/BUILD
index 13bca34a86..7a44e2ee4f 100644
--- a/tensorflow/examples/speech_commands/BUILD
+++ b/tensorflow/examples/speech_commands/BUILD
@@ -56,6 +56,7 @@ tf_py_test(
     srcs = ["input_data_test.py"],
     additional_deps = [
         ":input_data",
+        ":models",
         "//tensorflow/python:client_testlib",
     ],
 )
diff --git a/tensorflow/examples/speech_commands/freeze.py b/tensorflow/examples/speech_commands/freeze.py
index c8671d9c41..89e790d4e4 100644
--- a/tensorflow/examples/speech_commands/freeze.py
+++ b/tensorflow/examples/speech_commands/freeze.py
@@ -54,7 +54,7 @@ FLAGS = None
 
 def create_inference_graph(wanted_words, sample_rate, clip_duration_ms,
                            clip_stride_ms, window_size_ms, window_stride_ms,
-                           dct_coefficient_count, model_architecture):
+                           feature_bin_count, model_architecture, preprocess):
   """Creates an audio model with the nodes needed for inference.
 
   Uses the supplied arguments to create a model, and inserts the input and
@@ -67,14 +67,19 @@ def create_inference_graph(wanted_words, sample_rate, clip_duration_ms,
     clip_stride_ms: How often to run recognition. Useful for models with cache.
     window_size_ms: Time slice duration to estimate frequencies from.
     window_stride_ms: How far apart time slices should be.
-    dct_coefficient_count: Number of frequency bands to analyze.
+    feature_bin_count: Number of frequency bands to analyze.
     model_architecture: Name of the kind of model to generate.
+    preprocess: How the spectrogram is processed to produce features, for
+      example 'mfcc' or 'average'.
+
+  Raises:
+    Exception: If the preprocessing mode isn't recognized.
   """
 
   words_list = input_data.prepare_words_list(wanted_words.split(','))
   model_settings = models.prepare_model_settings(
       len(words_list), sample_rate, clip_duration_ms, window_size_ms,
-      window_stride_ms, dct_coefficient_count)
+      window_stride_ms, feature_bin_count, preprocess)
   runtime_settings = {'clip_stride_ms': clip_stride_ms}
 
   wav_data_placeholder = tf.placeholder(tf.string, [], name='wav_data')
@@ -88,15 +93,25 @@ def create_inference_graph(wanted_words, sample_rate, clip_duration_ms,
       window_size=model_settings['window_size_samples'],
       stride=model_settings['window_stride_samples'],
       magnitude_squared=True)
-  fingerprint_input = contrib_audio.mfcc(
-      spectrogram,
-      decoded_sample_data.sample_rate,
-      dct_coefficient_count=dct_coefficient_count)
-  fingerprint_frequency_size = model_settings['dct_coefficient_count']
-  fingerprint_time_size = model_settings['spectrogram_length']
-  reshaped_input = tf.reshape(fingerprint_input, [
-      -1, fingerprint_time_size * fingerprint_frequency_size
-  ])
+
+  if preprocess == 'average':
+    fingerprint_input = tf.nn.pool(
+        tf.expand_dims(spectrogram, -1),
+        window_shape=[1, model_settings['average_window_width']],
+        strides=[1, model_settings['average_window_width']],
+        pooling_type='AVG',
+        padding='SAME')
+  elif preprocess == 'mfcc':
+    fingerprint_input = contrib_audio.mfcc(
+        spectrogram,
+        sample_rate,
+        dct_coefficient_count=model_settings['fingerprint_width'])
+  else:
+    raise Exception('Unknown preprocess mode "%s" (should be "mfcc" or'
+                    ' "average")' % (preprocess))
+
+  fingerprint_size = model_settings['fingerprint_size']
+  reshaped_input = tf.reshape(fingerprint_input, [-1, fingerprint_size])
 
   logits = models.create_model(
       reshaped_input, model_settings, model_architecture, is_training=False,
@@ -110,10 +125,12 @@ def main(_):
 
   # Create the model and load its weights.
   sess = tf.InteractiveSession()
-  create_inference_graph(FLAGS.wanted_words, FLAGS.sample_rate,
-                         FLAGS.clip_duration_ms, FLAGS.clip_stride_ms,
-                         FLAGS.window_size_ms, FLAGS.window_stride_ms,
-                         FLAGS.dct_coefficient_count, FLAGS.model_architecture)
+  create_inference_graph(
+      FLAGS.wanted_words, FLAGS.sample_rate, FLAGS.clip_duration_ms,
+      FLAGS.clip_stride_ms, FLAGS.window_size_ms, FLAGS.window_stride_ms,
+      FLAGS.feature_bin_count, FLAGS.model_architecture, FLAGS.preprocess)
+  if FLAGS.quantize:
+    tf.contrib.quantize.create_eval_graph()
   models.load_variables_from_checkpoint(sess, FLAGS.start_checkpoint)
 
   # Turn all the variables into inline constants inside the graph and save it.
@@ -155,10 +172,11 @@ if __name__ == '__main__':
       default=10.0,
       help='How long the stride is between spectrogram timeslices',)
   parser.add_argument(
-      '--dct_coefficient_count',
+      '--feature_bin_count',
       type=int,
       default=40,
-      help='How many bins to use for the MFCC fingerprint',)
+      help='How many bins to use for the MFCC fingerprint',
+  )
   parser.add_argument(
       '--start_checkpoint',
       type=str,
@@ -176,5 +194,15 @@ if __name__ == '__main__':
       help='Words to use (others will be added to an unknown label)',)
   parser.add_argument(
       '--output_file', type=str, help='Where to save the frozen graph.')
+  parser.add_argument(
+      '--quantize',
+      type=bool,
+      default=False,
+      help='Whether to train the model for eight-bit deployment')
+  parser.add_argument(
+      '--preprocess',
+      type=str,
+      default='mfcc',
+      help='Spectrogram processing mode. Can be "mfcc" or "average"')
   FLAGS, unparsed = parser.parse_known_args()
   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
diff --git a/tensorflow/examples/speech_commands/freeze_test.py b/tensorflow/examples/speech_commands/freeze_test.py
index 97c6eac675..c8de6c2152 100644
--- a/tensorflow/examples/speech_commands/freeze_test.py
+++ b/tensorflow/examples/speech_commands/freeze_test.py
@@ -24,14 +24,62 @@ from tensorflow.python.platform import test
 
 class FreezeTest(test.TestCase):
 
-  def testCreateInferenceGraph(self):
+  def testCreateInferenceGraphWithMfcc(self):
     with self.test_session() as sess:
-      freeze.create_inference_graph('a,b,c,d', 16000, 1000.0, 30.0, 30.0, 10.0,
-                                    40, 'conv')
+      freeze.create_inference_graph(
+          wanted_words='a,b,c,d',
+          sample_rate=16000,
+          clip_duration_ms=1000.0,
+          clip_stride_ms=30.0,
+          window_size_ms=30.0,
+          window_stride_ms=10.0,
+          feature_bin_count=40,
+          model_architecture='conv',
+          preprocess='mfcc')
       self.assertIsNotNone(sess.graph.get_tensor_by_name('wav_data:0'))
       self.assertIsNotNone(
           sess.graph.get_tensor_by_name('decoded_sample_data:0'))
       self.assertIsNotNone(sess.graph.get_tensor_by_name('labels_softmax:0'))
+      ops = [node.op for node in sess.graph_def.node]
+      self.assertEqual(1, ops.count('Mfcc'))
+
+  def testCreateInferenceGraphWithoutMfcc(self):
+    with self.test_session() as sess:
+      freeze.create_inference_graph(
+          wanted_words='a,b,c,d',
+          sample_rate=16000,
+          clip_duration_ms=1000.0,
+          clip_stride_ms=30.0,
+          window_size_ms=30.0,
+          window_stride_ms=10.0,
+          feature_bin_count=40,
+          model_architecture='conv',
+          preprocess='average')
+      self.assertIsNotNone(sess.graph.get_tensor_by_name('wav_data:0'))
+      self.assertIsNotNone(
+          sess.graph.get_tensor_by_name('decoded_sample_data:0'))
+      self.assertIsNotNone(sess.graph.get_tensor_by_name('labels_softmax:0'))
+      ops = [node.op for node in sess.graph_def.node]
+      self.assertEqual(0, ops.count('Mfcc'))
+
+  def testFeatureBinCount(self):
+    with self.test_session() as sess:
+      freeze.create_inference_graph(
+          wanted_words='a,b,c,d',
+          sample_rate=16000,
+          clip_duration_ms=1000.0,
+          clip_stride_ms=30.0,
+          window_size_ms=30.0,
+          window_stride_ms=10.0,
+          feature_bin_count=80,
+          model_architecture='conv',
+          preprocess='average')
+      self.assertIsNotNone(sess.graph.get_tensor_by_name('wav_data:0'))
+      self.assertIsNotNone(
+          sess.graph.get_tensor_by_name('decoded_sample_data:0'))
+      self.assertIsNotNone(sess.graph.get_tensor_by_name('labels_softmax:0'))
+      ops = [node.op for node in sess.graph_def.node]
+      self.assertEqual(0, ops.count('Mfcc'))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/examples/speech_commands/generate_streaming_test_wav.py b/tensorflow/examples/speech_commands/generate_streaming_test_wav.py
index 053206ae2f..9858906927 100644
--- a/tensorflow/examples/speech_commands/generate_streaming_test_wav.py
+++ b/tensorflow/examples/speech_commands/generate_streaming_test_wav.py
@@ -87,11 +87,12 @@ def main(_):
   words_list = input_data.prepare_words_list(FLAGS.wanted_words.split(','))
   model_settings = models.prepare_model_settings(
       len(words_list), FLAGS.sample_rate, FLAGS.clip_duration_ms,
-      FLAGS.window_size_ms, FLAGS.window_stride_ms, FLAGS.dct_coefficient_count)
+      FLAGS.window_size_ms, FLAGS.window_stride_ms, FLAGS.feature_bin_count,
+      'mfcc')
   audio_processor = input_data.AudioProcessor(
       '', FLAGS.data_dir, FLAGS.silence_percentage, 10,
       FLAGS.wanted_words.split(','), FLAGS.validation_percentage,
-      FLAGS.testing_percentage, model_settings)
+      FLAGS.testing_percentage, model_settings, FLAGS.data_dir)
 
   output_audio_sample_count = FLAGS.sample_rate * FLAGS.test_duration_seconds
   output_audio = np.zeros((output_audio_sample_count,), dtype=np.float32)
@@ -242,10 +243,11 @@ if __name__ == '__main__':
       default=10.0,
       help='How long the stride is between spectrogram timeslices',)
   parser.add_argument(
-      '--dct_coefficient_count',
+      '--feature_bin_count',
       type=int,
       default=40,
-      help='How many bins to use for the MFCC fingerprint',)
+      help='How many bins to use for the MFCC fingerprint',
+  )
   parser.add_argument(
       '--wanted_words',
       type=str,
diff --git a/tensorflow/examples/speech_commands/input_data.py b/tensorflow/examples/speech_commands/input_data.py
index 63dd18457f..30f2cfa9fe 100644
--- a/tensorflow/examples/speech_commands/input_data.py
+++ b/tensorflow/examples/speech_commands/input_data.py
@@ -153,14 +153,14 @@ class AudioProcessor(object):
 
   def __init__(self, data_url, data_dir, silence_percentage, unknown_percentage,
                wanted_words, validation_percentage, testing_percentage,
-               model_settings):
+               model_settings, summaries_dir):
     self.data_dir = data_dir
     self.maybe_download_and_extract_dataset(data_url, data_dir)
     self.prepare_data_index(silence_percentage, unknown_percentage,
                             wanted_words, validation_percentage,
                             testing_percentage)
     self.prepare_background_data()
-    self.prepare_processing_graph(model_settings)
+    self.prepare_processing_graph(model_settings, summaries_dir)
 
   def maybe_download_and_extract_dataset(self, data_url, dest_directory):
     """Download and extract data set tar file.
@@ -325,7 +325,7 @@ class AudioProcessor(object):
       if not self.background_data:
         raise Exception('No background wav files were found in ' + search_path)
 
-  def prepare_processing_graph(self, model_settings):
+  def prepare_processing_graph(self, model_settings, summaries_dir):
     """Builds a TensorFlow graph to apply the input distortions.
 
     Creates a graph that loads a WAVE file, decodes it, scales the volume,
@@ -341,48 +341,88 @@ class AudioProcessor(object):
       - time_shift_offset_placeholder_: How much to move the clip in time.
       - background_data_placeholder_: PCM sample data for background noise.
       - background_volume_placeholder_: Loudness of mixed-in background.
-      - mfcc_: Output 2D fingerprint of processed audio.
+      - output_: Output 2D fingerprint of processed audio.
 
     Args:
       model_settings: Information about the current model being trained.
+      summaries_dir: Path to save training summary information to.
+
+    Raises:
+      ValueError: If the preprocessing mode isn't recognized.
     """
-    desired_samples = model_settings['desired_samples']
-    self.wav_filename_placeholder_ = tf.placeholder(tf.string, [])
-    wav_loader = io_ops.read_file(self.wav_filename_placeholder_)
-    wav_decoder = contrib_audio.decode_wav(
-        wav_loader, desired_channels=1, desired_samples=desired_samples)
-    # Allow the audio sample's volume to be adjusted.
-    self.foreground_volume_placeholder_ = tf.placeholder(tf.float32, [])
-    scaled_foreground = tf.multiply(wav_decoder.audio,
-                                    self.foreground_volume_placeholder_)
-    # Shift the sample's start position, and pad any gaps with zeros.
-    self.time_shift_padding_placeholder_ = tf.placeholder(tf.int32, [2, 2])
-    self.time_shift_offset_placeholder_ = tf.placeholder(tf.int32, [2])
-    padded_foreground = tf.pad(
-        scaled_foreground,
-        self.time_shift_padding_placeholder_,
-        mode='CONSTANT')
-    sliced_foreground = tf.slice(padded_foreground,
-                                 self.time_shift_offset_placeholder_,
-                                 [desired_samples, -1])
-    # Mix in background noise.
-    self.background_data_placeholder_ = tf.placeholder(tf.float32,
-                                                       [desired_samples, 1])
-    self.background_volume_placeholder_ = tf.placeholder(tf.float32, [])
-    background_mul = tf.multiply(self.background_data_placeholder_,
-                                 self.background_volume_placeholder_)
-    background_add = tf.add(background_mul, sliced_foreground)
-    background_clamp = tf.clip_by_value(background_add, -1.0, 1.0)
-    # Run the spectrogram and MFCC ops to get a 2D 'fingerprint' of the audio.
-    spectrogram = contrib_audio.audio_spectrogram(
-        background_clamp,
-        window_size=model_settings['window_size_samples'],
-        stride=model_settings['window_stride_samples'],
-        magnitude_squared=True)
-    self.mfcc_ = contrib_audio.mfcc(
-        spectrogram,
-        wav_decoder.sample_rate,
-        dct_coefficient_count=model_settings['dct_coefficient_count'])
+    with tf.get_default_graph().name_scope('data'):
+      desired_samples = model_settings['desired_samples']
+      self.wav_filename_placeholder_ = tf.placeholder(
+          tf.string, [], name='wav_filename')
+      wav_loader = io_ops.read_file(self.wav_filename_placeholder_)
+      wav_decoder = contrib_audio.decode_wav(
+          wav_loader, desired_channels=1, desired_samples=desired_samples)
+      # Allow the audio sample's volume to be adjusted.
+      self.foreground_volume_placeholder_ = tf.placeholder(
+          tf.float32, [], name='foreground_volume')
+      scaled_foreground = tf.multiply(wav_decoder.audio,
+                                      self.foreground_volume_placeholder_)
+      # Shift the sample's start position, and pad any gaps with zeros.
+      self.time_shift_padding_placeholder_ = tf.placeholder(
+          tf.int32, [2, 2], name='time_shift_padding')
+      self.time_shift_offset_placeholder_ = tf.placeholder(
+          tf.int32, [2], name='time_shift_offset')
+      padded_foreground = tf.pad(
+          scaled_foreground,
+          self.time_shift_padding_placeholder_,
+          mode='CONSTANT')
+      sliced_foreground = tf.slice(padded_foreground,
+                                   self.time_shift_offset_placeholder_,
+                                   [desired_samples, -1])
+      # Mix in background noise.
+      self.background_data_placeholder_ = tf.placeholder(
+          tf.float32, [desired_samples, 1], name='background_data')
+      self.background_volume_placeholder_ = tf.placeholder(
+          tf.float32, [], name='background_volume')
+      background_mul = tf.multiply(self.background_data_placeholder_,
+                                   self.background_volume_placeholder_)
+      background_add = tf.add(background_mul, sliced_foreground)
+      background_clamp = tf.clip_by_value(background_add, -1.0, 1.0)
+      # Run the spectrogram and MFCC ops to get a 2D 'fingerprint' of the audio.
+      spectrogram = contrib_audio.audio_spectrogram(
+          background_clamp,
+          window_size=model_settings['window_size_samples'],
+          stride=model_settings['window_stride_samples'],
+          magnitude_squared=True)
+      tf.summary.image(
+          'spectrogram', tf.expand_dims(spectrogram, -1), max_outputs=1)
+      # The number of buckets in each FFT row in the spectrogram will depend on
+      # how many input samples there are in each window. This can be quite
+      # large, with a 160 sample window producing 127 buckets for example. We
+      # don't need this level of detail for classification, so we often want to
+      # shrink them down to produce a smaller result. That's what this section
+      # implements. One method is to use average pooling to merge adjacent
+      # buckets, but a more sophisticated approach is to apply the MFCC
+      # algorithm to shrink the representation.
+      if model_settings['preprocess'] == 'average':
+        self.output_ = tf.nn.pool(
+            tf.expand_dims(spectrogram, -1),
+            window_shape=[1, model_settings['average_window_width']],
+            strides=[1, model_settings['average_window_width']],
+            pooling_type='AVG',
+            padding='SAME')
+        tf.summary.image('shrunk_spectrogram', self.output_, max_outputs=1)
+      elif model_settings['preprocess'] == 'mfcc':
+        self.output_ = contrib_audio.mfcc(
+            spectrogram,
+            wav_decoder.sample_rate,
+            dct_coefficient_count=model_settings['fingerprint_width'])
+        tf.summary.image(
+            'mfcc', tf.expand_dims(self.output_, -1), max_outputs=1)
+      else:
+        raise ValueError('Unknown preprocess mode "%s" (should be "mfcc" or'
+                         ' "average")' % (model_settings['preprocess']))
+
+      # Merge all the summaries and write them out to /tmp/retrain_logs (by
+      # default)
+      self.merged_summaries_ = tf.summary.merge_all(scope='data')
+      self.summary_writer_ = tf.summary.FileWriter(summaries_dir + '/data',
+                                                   tf.get_default_graph())
 
   def set_size(self, mode):
     """Calculates the number of samples in the dataset partition.
@@ -418,6 +458,9 @@ class AudioProcessor(object):
 
     Returns:
       List of sample data for the transformed samples, and list of label indexes
+
+    Raises:
+      ValueError: If background samples are too short.
     """
     # Pick one of the partitions to choose samples from.
     candidates = self.data_index[mode]
@@ -460,6 +503,11 @@ class AudioProcessor(object):
       if use_background or sample['label'] == SILENCE_LABEL:
         background_index = np.random.randint(len(self.background_data))
         background_samples = self.background_data[background_index]
+        if len(background_samples) <= model_settings['desired_samples']:
+          raise ValueError(
+              'Background sample is too short! Need more than %d'
+              ' samples but only %d were found' %
+              (model_settings['desired_samples'], len(background_samples)))
         background_offset = np.random.randint(
             0, len(background_samples) - model_settings['desired_samples'])
         background_clipped = background_samples[background_offset:(
@@ -482,7 +530,10 @@ class AudioProcessor(object):
       else:
         input_dict[self.foreground_volume_placeholder_] = 1
       # Run the graph to produce the output audio.
-      data[i - offset, :] = sess.run(self.mfcc_, feed_dict=input_dict).flatten()
+      summary, data_tensor = sess.run(
+          [self.merged_summaries_, self.output_], feed_dict=input_dict)
+      self.summary_writer_.add_summary(summary)
+      data[i - offset, :] = data_tensor.flatten()
       label_index = self.word_to_index[sample['label']]
       labels[i - offset] = label_index
     return data, labels
diff --git a/tensorflow/examples/speech_commands/input_data_test.py b/tensorflow/examples/speech_commands/input_data_test.py
index 13f294d39d..2e551be9a2 100644
--- a/tensorflow/examples/speech_commands/input_data_test.py
+++ b/tensorflow/examples/speech_commands/input_data_test.py
@@ -25,6 +25,7 @@ import tensorflow as tf
 
 from tensorflow.contrib.framework.python.ops import audio_ops as contrib_audio
 from tensorflow.examples.speech_commands import input_data
+from tensorflow.examples.speech_commands import models
 from tensorflow.python.platform import test
 
 
@@ -32,7 +33,7 @@ class InputDataTest(test.TestCase):
 
   def _getWavData(self):
     with self.test_session() as sess:
-      sample_data = tf.zeros([1000, 2])
+      sample_data = tf.zeros([32000, 2])
       wav_encoder = contrib_audio.encode_wav(sample_data, 16000)
       wav_data = sess.run(wav_encoder)
     return wav_data
@@ -57,9 +58,31 @@ class InputDataTest(test.TestCase):
         "label_count": 4,
         "window_size_samples": 100,
         "window_stride_samples": 100,
-        "dct_coefficient_count": 40,
+        "fingerprint_width": 40,
+        "preprocess": "mfcc",
     }
 
+  def _runGetDataTest(self, preprocess, window_length_ms):
+    tmp_dir = self.get_temp_dir()
+    wav_dir = os.path.join(tmp_dir, "wavs")
+    os.mkdir(wav_dir)
+    self._saveWavFolders(wav_dir, ["a", "b", "c"], 100)
+    background_dir = os.path.join(wav_dir, "_background_noise_")
+    os.mkdir(background_dir)
+    wav_data = self._getWavData()
+    for i in range(10):
+      file_path = os.path.join(background_dir, "background_audio_%d.wav" % i)
+      self._saveTestWavFile(file_path, wav_data)
+    model_settings = models.prepare_model_settings(
+        4, 16000, 1000, window_length_ms, 20, 40, preprocess)
+    with self.test_session() as sess:
+      audio_processor = input_data.AudioProcessor(
+          "", wav_dir, 10, 10, ["a", "b"], 10, 10, model_settings, tmp_dir)
+      result_data, result_labels = audio_processor.get_data(
+          10, 0, model_settings, 0.3, 0.1, 100, "training", sess)
+      self.assertEqual(10, len(result_data))
+      self.assertEqual(10, len(result_labels))
+
   def testPrepareWordsList(self):
     words_list = ["a", "b"]
     self.assertGreater(
@@ -76,8 +99,9 @@ class InputDataTest(test.TestCase):
   def testPrepareDataIndex(self):
     tmp_dir = self.get_temp_dir()
     self._saveWavFolders(tmp_dir, ["a", "b", "c"], 100)
-    audio_processor = input_data.AudioProcessor("", tmp_dir, 10, 10, ["a", "b"],
-                                                10, 10, self._model_settings())
+    audio_processor = input_data.AudioProcessor("", tmp_dir, 10, 10,
+                                                ["a", "b"], 10, 10,
+                                                self._model_settings(), tmp_dir)
     self.assertLess(0, audio_processor.set_size("training"))
     self.assertTrue("training" in audio_processor.data_index)
     self.assertTrue("validation" in audio_processor.data_index)
@@ -90,7 +114,7 @@ class InputDataTest(test.TestCase):
     self._saveWavFolders(tmp_dir, ["a", "b", "c"], 0)
     with self.assertRaises(Exception) as e:
       _ = input_data.AudioProcessor("", tmp_dir, 10, 10, ["a", "b"], 10, 10,
-                                    self._model_settings())
+                                    self._model_settings(), tmp_dir)
     self.assertTrue("No .wavs found" in str(e.exception))
 
   def testPrepareDataIndexMissing(self):
@@ -98,7 +122,7 @@ class InputDataTest(test.TestCase):
     self._saveWavFolders(tmp_dir, ["a", "b", "c"], 100)
     with self.assertRaises(Exception) as e:
       _ = input_data.AudioProcessor("", tmp_dir, 10, 10, ["a", "b", "d"], 10,
-                                    10, self._model_settings())
+                                    10, self._model_settings(), tmp_dir)
     self.assertTrue("Expected to find" in str(e.exception))
 
   def testPrepareBackgroundData(self):
@@ -110,8 +134,9 @@ class InputDataTest(test.TestCase):
       file_path = os.path.join(background_dir, "background_audio_%d.wav" % i)
       self._saveTestWavFile(file_path, wav_data)
     self._saveWavFolders(tmp_dir, ["a", "b", "c"], 100)
-    audio_processor = input_data.AudioProcessor("", tmp_dir, 10, 10, ["a", "b"],
-                                                10, 10, self._model_settings())
+    audio_processor = input_data.AudioProcessor("", tmp_dir, 10, 10,
+                                                ["a", "b"], 10, 10,
+                                                self._model_settings(), tmp_dir)
     self.assertEqual(10, len(audio_processor.background_data))
 
   def testLoadWavFile(self):
@@ -148,44 +173,27 @@ class InputDataTest(test.TestCase):
         "label_count": 4,
         "window_size_samples": 100,
         "window_stride_samples": 100,
-        "dct_coefficient_count": 40,
+        "fingerprint_width": 40,
+        "preprocess": "mfcc",
     }
     audio_processor = input_data.AudioProcessor("", wav_dir, 10, 10, ["a", "b"],
-                                                10, 10, model_settings)
+                                                10, 10, model_settings, tmp_dir)
     self.assertIsNotNone(audio_processor.wav_filename_placeholder_)
     self.assertIsNotNone(audio_processor.foreground_volume_placeholder_)
     self.assertIsNotNone(audio_processor.time_shift_padding_placeholder_)
     self.assertIsNotNone(audio_processor.time_shift_offset_placeholder_)
     self.assertIsNotNone(audio_processor.background_data_placeholder_)
     self.assertIsNotNone(audio_processor.background_volume_placeholder_)
-    self.assertIsNotNone(audio_processor.mfcc_)
+    self.assertIsNotNone(audio_processor.output_)
 
-  def testGetData(self):
-    tmp_dir = self.get_temp_dir()
-    wav_dir = os.path.join(tmp_dir, "wavs")
-    os.mkdir(wav_dir)
-    self._saveWavFolders(wav_dir, ["a", "b", "c"], 100)
-    background_dir = os.path.join(wav_dir, "_background_noise_")
-    os.mkdir(background_dir)
-    wav_data = self._getWavData()
-    for i in range(10):
-      file_path = os.path.join(background_dir, "background_audio_%d.wav" % i)
-      self._saveTestWavFile(file_path, wav_data)
-    model_settings = {
-        "desired_samples": 160,
-        "fingerprint_size": 40,
-        "label_count": 4,
-        "window_size_samples": 100,
-        "window_stride_samples": 100,
-        "dct_coefficient_count": 40,
-    }
-    audio_processor = input_data.AudioProcessor("", wav_dir, 10, 10, ["a", "b"],
-                                                10, 10, model_settings)
-    with self.test_session() as sess:
-      result_data, result_labels = audio_processor.get_data(
-          10, 0, model_settings, 0.3, 0.1, 100, "training", sess)
-    self.assertEqual(10, len(result_data))
-    self.assertEqual(10, len(result_labels))
+  def testGetDataAverage(self):
+    self._runGetDataTest("average", 10)
+
+  def testGetDataAverageLongWindow(self):
+    self._runGetDataTest("average", 30)
+
+  def testGetDataMfcc(self):
+    self._runGetDataTest("mfcc", 30)
 
   def testGetUnprocessedData(self):
     tmp_dir = self.get_temp_dir()
@@ -198,10 +206,11 @@ class InputDataTest(test.TestCase):
         "label_count": 4,
         "window_size_samples": 100,
         "window_stride_samples": 100,
-        "dct_coefficient_count": 40,
+        "fingerprint_width": 40,
+        "preprocess": "mfcc",
     }
     audio_processor = input_data.AudioProcessor("", wav_dir, 10, 10, ["a", "b"],
-                                                10, 10, model_settings)
+                                                10, 10, model_settings, tmp_dir)
     result_data, result_labels = audio_processor.get_unprocessed_data(
         10, model_settings, "training")
     self.assertEqual(10, len(result_data))
diff --git a/tensorflow/examples/speech_commands/models.py b/tensorflow/examples/speech_commands/models.py
index ab611f414a..4d1454be0d 100644
--- a/tensorflow/examples/speech_commands/models.py
+++ b/tensorflow/examples/speech_commands/models.py
@@ -24,9 +24,21 @@ import math
 import tensorflow as tf
 
 
+def _next_power_of_two(x):
+  """Calculates the smallest enclosing power of two for an input.
+
+  Args:
+    x: Positive float or integer number.
+
+  Returns:
+    Next largest power of two integer.
+  """
+  return 1 if x == 0 else 2**(int(x) - 1).bit_length()
+
+
 def prepare_model_settings(label_count, sample_rate, clip_duration_ms,
-                           window_size_ms, window_stride_ms,
-                           dct_coefficient_count):
+                           window_size_ms, window_stride_ms, feature_bin_count,
+                           preprocess):
   """Calculates common settings needed for all models.
 
   Args:
@@ -35,10 +47,14 @@ def prepare_model_settings(label_count, sample_rate, clip_duration_ms,
     clip_duration_ms: Length of each audio clip to be analyzed.
     window_size_ms: Duration of frequency analysis window.
     window_stride_ms: How far to move in time between frequency windows.
-    dct_coefficient_count: Number of frequency bins to use for analysis.
+    feature_bin_count: Number of frequency bins to use for analysis.
+    preprocess: How the spectrogram is processed to produce features.
 
   Returns:
     Dictionary containing common settings.
+
+  Raises:
+    ValueError: If the preprocessing mode isn't recognized.
   """
   desired_samples = int(sample_rate * clip_duration_ms / 1000)
   window_size_samples = int(sample_rate * window_size_ms / 1000)
@@ -48,16 +64,28 @@ def prepare_model_settings(label_count, sample_rate, clip_duration_ms,
     spectrogram_length = 0
   else:
     spectrogram_length = 1 + int(length_minus_window / window_stride_samples)
-  fingerprint_size = dct_coefficient_count * spectrogram_length
+  if preprocess == 'average':
+    fft_bin_count = 1 + (_next_power_of_two(window_size_samples) / 2)
+    average_window_width = int(math.floor(fft_bin_count / feature_bin_count))
+    fingerprint_width = int(math.ceil(fft_bin_count / average_window_width))
+  elif preprocess == 'mfcc':
+    average_window_width = -1
+    fingerprint_width = feature_bin_count
+  else:
+    raise ValueError('Unknown preprocess mode "%s" (should be "mfcc" or'
+                     ' "average")' % (preprocess))
+  fingerprint_size = fingerprint_width * spectrogram_length
   return {
       'desired_samples': desired_samples,
       'window_size_samples': window_size_samples,
       'window_stride_samples': window_stride_samples,
       'spectrogram_length': spectrogram_length,
-      'dct_coefficient_count': dct_coefficient_count,
+      'fingerprint_width': fingerprint_width,
       'fingerprint_size': fingerprint_size,
       'label_count': label_count,
       'sample_rate': sample_rate,
+      'preprocess': preprocess,
+      'average_window_width': average_window_width,
   }
 
 
@@ -106,10 +134,14 @@ def create_model(fingerprint_input, model_settings, model_architecture,
   elif model_architecture == 'low_latency_svdf':
     return create_low_latency_svdf_model(fingerprint_input, model_settings,
                                          is_training, runtime_settings)
+  elif model_architecture == 'tiny_conv':
+    return create_tiny_conv_model(fingerprint_input, model_settings,
+                                  is_training)
   else:
     raise Exception('model_architecture argument "' + model_architecture +
                     '" not recognized, should be one of "single_fc", "conv",' +
-                    ' "low_latency_conv, or "low_latency_svdf"')
+                    ' "low_latency_conv, "low_latency_svdf",' +
+                    ' or "tiny_conv"')
 
 
 def load_variables_from_checkpoint(sess, start_checkpoint):
@@ -152,9 +184,12 @@ def create_single_fc_model(fingerprint_input, model_settings, is_training):
     dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
   fingerprint_size = model_settings['fingerprint_size']
   label_count = model_settings['label_count']
-  weights = tf.Variable(
-      tf.truncated_normal([fingerprint_size, label_count], stddev=0.001))
-  bias = tf.Variable(tf.zeros([label_count]))
+  weights = tf.get_variable(
+      name='weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.001),
+      shape=[fingerprint_size, label_count])
+  bias = tf.get_variable(
+      name='bias', initializer=tf.zeros_initializer, shape=[label_count])
   logits = tf.matmul(fingerprint_input, weights) + bias
   if is_training:
     return logits, dropout_prob
@@ -212,18 +247,21 @@ def create_conv_model(fingerprint_input, model_settings, is_training):
   """
   if is_training:
     dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
-  input_frequency_size = model_settings['dct_coefficient_count']
+  input_frequency_size = model_settings['fingerprint_width']
   input_time_size = model_settings['spectrogram_length']
   fingerprint_4d = tf.reshape(fingerprint_input,
                               [-1, input_time_size, input_frequency_size, 1])
   first_filter_width = 8
   first_filter_height = 20
   first_filter_count = 64
-  first_weights = tf.Variable(
-      tf.truncated_normal(
-          [first_filter_height, first_filter_width, 1, first_filter_count],
-          stddev=0.01))
-  first_bias = tf.Variable(tf.zeros([first_filter_count]))
+  first_weights = tf.get_variable(
+      name='first_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[first_filter_height, first_filter_width, 1, first_filter_count])
+  first_bias = tf.get_variable(
+      name='first_bias',
+      initializer=tf.zeros_initializer,
+      shape=[first_filter_count])
   first_conv = tf.nn.conv2d(fingerprint_4d, first_weights, [1, 1, 1, 1],
                             'SAME') + first_bias
   first_relu = tf.nn.relu(first_conv)
@@ -235,14 +273,17 @@ def create_conv_model(fingerprint_input, model_settings, is_training):
   second_filter_width = 4
   second_filter_height = 10
   second_filter_count = 64
-  second_weights = tf.Variable(
-      tf.truncated_normal(
-          [
-              second_filter_height, second_filter_width, first_filter_count,
-              second_filter_count
-          ],
-          stddev=0.01))
-  second_bias = tf.Variable(tf.zeros([second_filter_count]))
+  second_weights = tf.get_variable(
+      name='second_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[
+          second_filter_height, second_filter_width, first_filter_count,
+          second_filter_count
+      ])
+  second_bias = tf.get_variable(
+      name='second_bias',
+      initializer=tf.zeros_initializer,
+      shape=[second_filter_count])
   second_conv = tf.nn.conv2d(max_pool, second_weights, [1, 1, 1, 1],
                              'SAME') + second_bias
   second_relu = tf.nn.relu(second_conv)
@@ -259,10 +300,14 @@ def create_conv_model(fingerprint_input, model_settings, is_training):
   flattened_second_conv = tf.reshape(second_dropout,
                                      [-1, second_conv_element_count])
   label_count = model_settings['label_count']
-  final_fc_weights = tf.Variable(
-      tf.truncated_normal(
-          [second_conv_element_count, label_count], stddev=0.01))
-  final_fc_bias = tf.Variable(tf.zeros([label_count]))
+  final_fc_weights = tf.get_variable(
+      name='final_fc_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[second_conv_element_count, label_count])
+  final_fc_bias = tf.get_variable(
+      name='final_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[label_count])
   final_fc = tf.matmul(flattened_second_conv, final_fc_weights) + final_fc_bias
   if is_training:
     return final_fc, dropout_prob
@@ -318,7 +363,7 @@ def create_low_latency_conv_model(fingerprint_input, model_settings,
   """
   if is_training:
     dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
-  input_frequency_size = model_settings['dct_coefficient_count']
+  input_frequency_size = model_settings['fingerprint_width']
   input_time_size = model_settings['spectrogram_length']
   fingerprint_4d = tf.reshape(fingerprint_input,
                               [-1, input_time_size, input_frequency_size, 1])
@@ -327,11 +372,14 @@ def create_low_latency_conv_model(fingerprint_input, model_settings,
   first_filter_count = 186
   first_filter_stride_x = 1
   first_filter_stride_y = 1
-  first_weights = tf.Variable(
-      tf.truncated_normal(
-          [first_filter_height, first_filter_width, 1, first_filter_count],
-          stddev=0.01))
-  first_bias = tf.Variable(tf.zeros([first_filter_count]))
+  first_weights = tf.get_variable(
+      name='first_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[first_filter_height, first_filter_width, 1, first_filter_count])
+  first_bias = tf.get_variable(
+      name='first_bias',
+      initializer=tf.zeros_initializer,
+      shape=[first_filter_count])
   first_conv = tf.nn.conv2d(fingerprint_4d, first_weights, [
       1, first_filter_stride_y, first_filter_stride_x, 1
   ], 'VALID') + first_bias
@@ -351,30 +399,42 @@ def create_low_latency_conv_model(fingerprint_input, model_settings,
   flattened_first_conv = tf.reshape(first_dropout,
                                     [-1, first_conv_element_count])
   first_fc_output_channels = 128
-  first_fc_weights = tf.Variable(
-      tf.truncated_normal(
-          [first_conv_element_count, first_fc_output_channels], stddev=0.01))
-  first_fc_bias = tf.Variable(tf.zeros([first_fc_output_channels]))
+  first_fc_weights = tf.get_variable(
+      name='first_fc_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[first_conv_element_count, first_fc_output_channels])
+  first_fc_bias = tf.get_variable(
+      name='first_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[first_fc_output_channels])
   first_fc = tf.matmul(flattened_first_conv, first_fc_weights) + first_fc_bias
   if is_training:
     second_fc_input = tf.nn.dropout(first_fc, dropout_prob)
   else:
     second_fc_input = first_fc
   second_fc_output_channels = 128
-  second_fc_weights = tf.Variable(
-      tf.truncated_normal(
-          [first_fc_output_channels, second_fc_output_channels], stddev=0.01))
-  second_fc_bias = tf.Variable(tf.zeros([second_fc_output_channels]))
+  second_fc_weights = tf.get_variable(
+      name='second_fc_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[first_fc_output_channels, second_fc_output_channels])
+  second_fc_bias = tf.get_variable(
+      name='second_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[second_fc_output_channels])
   second_fc = tf.matmul(second_fc_input, second_fc_weights) + second_fc_bias
   if is_training:
     final_fc_input = tf.nn.dropout(second_fc, dropout_prob)
   else:
     final_fc_input = second_fc
   label_count = model_settings['label_count']
-  final_fc_weights = tf.Variable(
-      tf.truncated_normal(
-          [second_fc_output_channels, label_count], stddev=0.01))
-  final_fc_bias = tf.Variable(tf.zeros([label_count]))
+  final_fc_weights = tf.get_variable(
+      name='final_fc_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[second_fc_output_channels, label_count])
+  final_fc_bias = tf.get_variable(
+      name='final_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[label_count])
   final_fc = tf.matmul(final_fc_input, final_fc_weights) + final_fc_bias
   if is_training:
     return final_fc, dropout_prob
@@ -422,7 +482,7 @@ def create_low_latency_svdf_model(fingerprint_input, model_settings,
   Args:
     fingerprint_input: TensorFlow node that will output audio feature vectors.
     The node is expected to produce a 2D Tensor of shape:
-      [batch, model_settings['dct_coefficient_count'] *
+      [batch, model_settings['fingerprint_width'] *
               model_settings['spectrogram_length']]
     with the features corresponding to the same time slot arranged contiguously,
     and the oldest slot at index [:, 0], and newest at [:, -1].
@@ -440,7 +500,7 @@ def create_low_latency_svdf_model(fingerprint_input, model_settings,
   if is_training:
     dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
 
-  input_frequency_size = model_settings['dct_coefficient_count']
+  input_frequency_size = model_settings['fingerprint_width']
   input_time_size = model_settings['spectrogram_length']
 
   # Validation.
@@ -462,8 +522,11 @@ def create_low_latency_svdf_model(fingerprint_input, model_settings,
   num_filters = rank * num_units
   # Create the runtime memory: [num_filters, batch, input_time_size]
   batch = 1
-  memory = tf.Variable(tf.zeros([num_filters, batch, input_time_size]),
-                       trainable=False, name='runtime-memory')
+  memory = tf.get_variable(
+      initializer=tf.zeros_initializer,
+      shape=[num_filters, batch, input_time_size],
+      trainable=False,
+      name='runtime-memory')
   # Determine the number of new frames in the input, such that we only operate
   # on those. For training we do not use the memory, and thus use all frames
   # provided in the input.
@@ -483,8 +546,10 @@ def create_low_latency_svdf_model(fingerprint_input, model_settings,
   new_fingerprint_input = tf.expand_dims(new_fingerprint_input, 2)
 
   # Create the frequency filters.
-  weights_frequency = tf.Variable(
-      tf.truncated_normal([input_frequency_size, num_filters], stddev=0.01))
+  weights_frequency = tf.get_variable(
+      name='weights_frequency',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[input_frequency_size, num_filters])
   # Expand to add input channels dimensions.
   # weights_frequency: [input_frequency_size, 1, num_filters]
   weights_frequency = tf.expand_dims(weights_frequency, 1)
@@ -506,8 +571,10 @@ def create_low_latency_svdf_model(fingerprint_input, model_settings,
     activations_time = new_memory
 
   # Create the time filters.
-  weights_time = tf.Variable(
-      tf.truncated_normal([num_filters, input_time_size], stddev=0.01))
+  weights_time = tf.get_variable(
+      name='weights_time',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[num_filters, input_time_size])
   # Apply the time filter on the outputs of the feature filters.
   # weights_time: [num_filters, input_time_size, 1]
   # outputs: [num_filters, batch, 1]
@@ -524,7 +591,8 @@ def create_low_latency_svdf_model(fingerprint_input, model_settings,
   units_output = tf.transpose(units_output)
 
   # Appy bias.
-  bias = tf.Variable(tf.zeros([num_units]))
+  bias = tf.get_variable(
+      name='bias', initializer=tf.zeros_initializer, shape=[num_units])
   first_bias = tf.nn.bias_add(units_output, bias)
 
   # Relu.
@@ -536,31 +604,135 @@ def create_low_latency_svdf_model(fingerprint_input, model_settings,
     first_dropout = first_relu
 
   first_fc_output_channels = 256
-  first_fc_weights = tf.Variable(
-      tf.truncated_normal([num_units, first_fc_output_channels], stddev=0.01))
-  first_fc_bias = tf.Variable(tf.zeros([first_fc_output_channels]))
+  first_fc_weights = tf.get_variable(
+      name='first_fc_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[num_units, first_fc_output_channels])
+  first_fc_bias = tf.get_variable(
+      name='first_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[first_fc_output_channels])
   first_fc = tf.matmul(first_dropout, first_fc_weights) + first_fc_bias
   if is_training:
     second_fc_input = tf.nn.dropout(first_fc, dropout_prob)
   else:
     second_fc_input = first_fc
   second_fc_output_channels = 256
-  second_fc_weights = tf.Variable(
-      tf.truncated_normal(
-          [first_fc_output_channels, second_fc_output_channels], stddev=0.01))
-  second_fc_bias = tf.Variable(tf.zeros([second_fc_output_channels]))
+  second_fc_weights = tf.get_variable(
+      name='second_fc_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[first_fc_output_channels, second_fc_output_channels])
+  second_fc_bias = tf.get_variable(
+      name='second_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[second_fc_output_channels])
   second_fc = tf.matmul(second_fc_input, second_fc_weights) + second_fc_bias
   if is_training:
     final_fc_input = tf.nn.dropout(second_fc, dropout_prob)
   else:
     final_fc_input = second_fc
   label_count = model_settings['label_count']
-  final_fc_weights = tf.Variable(
-      tf.truncated_normal(
-          [second_fc_output_channels, label_count], stddev=0.01))
-  final_fc_bias = tf.Variable(tf.zeros([label_count]))
+  final_fc_weights = tf.get_variable(
+      name='final_fc_weights',
+      initializer=tf.truncated_normal(stddev=0.01),
+      shape=[second_fc_output_channels, label_count])
+  final_fc_bias = tf.get_variable(
+      name='final_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[label_count])
   final_fc = tf.matmul(final_fc_input, final_fc_weights) + final_fc_bias
   if is_training:
     return final_fc, dropout_prob
   else:
     return final_fc
+
+
+def create_tiny_conv_model(fingerprint_input, model_settings, is_training):
+  """Builds a convolutional model aimed at microcontrollers.
+
+  Devices like DSPs and microcontrollers can have very small amounts of
+  memory and limited processing power. This model is designed to use less
+  than 20KB of working RAM, and fit within 32KB of read-only (flash) memory.
+
+  Here's the layout of the graph:
+
+  (fingerprint_input)
+          v
+      [Conv2D]<-(weights)
+          v
+      [BiasAdd]<-(bias)
+          v
+        [Relu]
+          v
+      [MatMul]<-(weights)
+          v
+      [BiasAdd]<-(bias)
+          v
+
+  This doesn't produce particularly accurate results, but it's designed to be
+  used as the first stage of a pipeline, running on a low-energy piece of
+  hardware that can always be on, and then wake higher-power chips when a
+  possible utterance has been found, so that more accurate analysis can be done.
+
+  During training, a dropout node is introduced after the relu, controlled by a
+  placeholder.
+
+  Args:
+    fingerprint_input: TensorFlow node that will output audio feature vectors.
+    model_settings: Dictionary of information about the model.
+    is_training: Whether the model is going to be used for training.
+
+  Returns:
+    TensorFlow node outputting logits results, and optionally a dropout
+    placeholder.
+  """
+  if is_training:
+    dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')
+  input_frequency_size = model_settings['fingerprint_width']
+  input_time_size = model_settings['spectrogram_length']
+  fingerprint_4d = tf.reshape(fingerprint_input,
+                              [-1, input_time_size, input_frequency_size, 1])
+  first_filter_width = 8
+  first_filter_height = 10
+  first_filter_count = 8
+  first_weights = tf.get_variable(
+      name='first_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[first_filter_height, first_filter_width, 1, first_filter_count])
+  first_bias = tf.get_variable(
+      name='first_bias',
+      initializer=tf.zeros_initializer,
+      shape=[first_filter_count])
+  first_conv_stride_x = 2
+  first_conv_stride_y = 2
+  first_conv = tf.nn.conv2d(fingerprint_4d, first_weights,
+                            [1, first_conv_stride_y, first_conv_stride_x, 1],
+                            'SAME') + first_bias
+  first_relu = tf.nn.relu(first_conv)
+  if is_training:
+    first_dropout = tf.nn.dropout(first_relu, dropout_prob)
+  else:
+    first_dropout = first_relu
+  first_dropout_shape = first_dropout.get_shape()
+  first_dropout_output_width = first_dropout_shape[2]
+  first_dropout_output_height = first_dropout_shape[1]
+  first_dropout_element_count = int(
+      first_dropout_output_width * first_dropout_output_height *
+      first_filter_count)
+  flattened_first_dropout = tf.reshape(first_dropout,
+                                       [-1, first_dropout_element_count])
+  label_count = model_settings['label_count']
+  final_fc_weights = tf.get_variable(
+      name='final_fc_weights',
+      initializer=tf.truncated_normal_initializer(stddev=0.01),
+      shape=[first_dropout_element_count, label_count])
+  final_fc_bias = tf.get_variable(
+      name='final_fc_bias',
+      initializer=tf.zeros_initializer,
+      shape=[label_count])
+  final_fc = (
+      tf.matmul(flattened_first_dropout, final_fc_weights) + final_fc_bias)
+  if is_training:
+    return final_fc, dropout_prob
+  else:
+    return final_fc
diff --git a/tensorflow/examples/speech_commands/models_test.py b/tensorflow/examples/speech_commands/models_test.py
index 80c795367f..0c373967ed 100644
--- a/tensorflow/examples/speech_commands/models_test.py
+++ b/tensorflow/examples/speech_commands/models_test.py
@@ -26,12 +26,29 @@ from tensorflow.python.platform import test
 
 class ModelsTest(test.TestCase):
 
+  def _modelSettings(self):
+    return models.prepare_model_settings(
+        label_count=10,
+        sample_rate=16000,
+        clip_duration_ms=1000,
+        window_size_ms=20,
+        window_stride_ms=10,
+        feature_bin_count=40,
+        preprocess="mfcc")
+
   def testPrepareModelSettings(self):
     self.assertIsNotNone(
-        models.prepare_model_settings(10, 16000, 1000, 20, 10, 40))
+        models.prepare_model_settings(
+            label_count=10,
+            sample_rate=16000,
+            clip_duration_ms=1000,
+            window_size_ms=20,
+            window_stride_ms=10,
+            feature_bin_count=40,
+            preprocess="mfcc"))
 
   def testCreateModelConvTraining(self):
-    model_settings = models.prepare_model_settings(10, 16000, 1000, 20, 10, 40)
+    model_settings = self._modelSettings()
     with self.test_session() as sess:
       fingerprint_input = tf.zeros([1, model_settings["fingerprint_size"]])
       logits, dropout_prob = models.create_model(fingerprint_input,
@@ -42,7 +59,7 @@ class ModelsTest(test.TestCase):
       self.assertIsNotNone(sess.graph.get_tensor_by_name(dropout_prob.name))
 
   def testCreateModelConvInference(self):
-    model_settings = models.prepare_model_settings(10, 16000, 1000, 20, 10, 40)
+    model_settings = self._modelSettings()
     with self.test_session() as sess:
       fingerprint_input = tf.zeros([1, model_settings["fingerprint_size"]])
       logits = models.create_model(fingerprint_input, model_settings, "conv",
@@ -51,7 +68,7 @@ class ModelsTest(test.TestCase):
       self.assertIsNotNone(sess.graph.get_tensor_by_name(logits.name))
 
   def testCreateModelLowLatencyConvTraining(self):
-    model_settings = models.prepare_model_settings(10, 16000, 1000, 20, 10, 40)
+    model_settings = self._modelSettings()
     with self.test_session() as sess:
       fingerprint_input = tf.zeros([1, model_settings["fingerprint_size"]])
       logits, dropout_prob = models.create_model(
@@ -62,7 +79,7 @@ class ModelsTest(test.TestCase):
       self.assertIsNotNone(sess.graph.get_tensor_by_name(dropout_prob.name))
 
   def testCreateModelFullyConnectedTraining(self):
-    model_settings = models.prepare_model_settings(10, 16000, 1000, 20, 10, 40)
+    model_settings = self._modelSettings()
     with self.test_session() as sess:
       fingerprint_input = tf.zeros([1, model_settings["fingerprint_size"]])
       logits, dropout_prob = models.create_model(
@@ -73,7 +90,7 @@ class ModelsTest(test.TestCase):
       self.assertIsNotNone(sess.graph.get_tensor_by_name(dropout_prob.name))
 
   def testCreateModelBadArchitecture(self):
-    model_settings = models.prepare_model_settings(10, 16000, 1000, 20, 10, 40)
+    model_settings = self._modelSettings()
     with self.test_session():
       fingerprint_input = tf.zeros([1, model_settings["fingerprint_size"]])
       with self.assertRaises(Exception) as e:
@@ -81,6 +98,17 @@ class ModelsTest(test.TestCase):
                             "bad_architecture", True)
       self.assertTrue("not recognized" in str(e.exception))
 
+  def testCreateModelTinyConvTraining(self):
+    model_settings = self._modelSettings()
+    with self.test_session() as sess:
+      fingerprint_input = tf.zeros([1, model_settings["fingerprint_size"]])
+      logits, dropout_prob = models.create_model(
+          fingerprint_input, model_settings, "tiny_conv", True)
+      self.assertIsNotNone(logits)
+      self.assertIsNotNone(dropout_prob)
+      self.assertIsNotNone(sess.graph.get_tensor_by_name(logits.name))
+      self.assertIsNotNone(sess.graph.get_tensor_by_name(dropout_prob.name))
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/examples/speech_commands/train.py b/tensorflow/examples/speech_commands/train.py
index fc28eb0631..eca34f8812 100644
--- a/tensorflow/examples/speech_commands/train.py
+++ b/tensorflow/examples/speech_commands/train.py
@@ -98,12 +98,12 @@ def main(_):
   model_settings = models.prepare_model_settings(
       len(input_data.prepare_words_list(FLAGS.wanted_words.split(','))),
       FLAGS.sample_rate, FLAGS.clip_duration_ms, FLAGS.window_size_ms,
-      FLAGS.window_stride_ms, FLAGS.dct_coefficient_count)
+      FLAGS.window_stride_ms, FLAGS.feature_bin_count, FLAGS.preprocess)
   audio_processor = input_data.AudioProcessor(
-      FLAGS.data_url, FLAGS.data_dir, FLAGS.silence_percentage,
-      FLAGS.unknown_percentage,
+      FLAGS.data_url, FLAGS.data_dir,
+      FLAGS.silence_percentage, FLAGS.unknown_percentage,
       FLAGS.wanted_words.split(','), FLAGS.validation_percentage,
-      FLAGS.testing_percentage, model_settings)
+      FLAGS.testing_percentage, model_settings, FLAGS.summaries_dir)
   fingerprint_size = model_settings['fingerprint_size']
   label_count = model_settings['label_count']
   time_shift_samples = int((FLAGS.time_shift_ms * FLAGS.sample_rate) / 1000)
@@ -122,8 +122,25 @@ def main(_):
         'lists, but are %d and %d long instead' % (len(training_steps_list),
                                                    len(learning_rates_list)))
 
-  fingerprint_input = tf.placeholder(
+  input_placeholder = tf.placeholder(
       tf.float32, [None, fingerprint_size], name='fingerprint_input')
+  if FLAGS.quantize:
+    # TODO(petewarden): These values have been derived from the observed ranges
+    # of spectrogram and MFCC inputs. If the preprocessing pipeline changes,
+    # they may need to be updated.
+    if FLAGS.preprocess == 'average':
+      fingerprint_min = 0.0
+      fingerprint_max = 2048.0
+    elif FLAGS.preprocess == 'mfcc':
+      fingerprint_min = -247.0
+      fingerprint_max = 30.0
+    else:
+      raise Exception('Unknown preprocess mode "%s" (should be "mfcc" or'
+                      ' "average")' % (FLAGS.preprocess))
+    fingerprint_input = tf.fake_quant_with_min_max_args(
+        input_placeholder, fingerprint_min, fingerprint_max)
+  else:
+    fingerprint_input = input_placeholder
 
   logits, dropout_prob = models.create_model(
       fingerprint_input,
@@ -146,7 +163,8 @@ def main(_):
   with tf.name_scope('cross_entropy'):
     cross_entropy_mean = tf.losses.sparse_softmax_cross_entropy(
         labels=ground_truth_input, logits=logits)
-  tf.summary.scalar('cross_entropy', cross_entropy_mean)
+  if FLAGS.quantize:
+    tf.contrib.quantize.create_training_graph(quant_delay=0)
   with tf.name_scope('train'), tf.control_dependencies(control_dependencies):
     learning_rate_input = tf.placeholder(
         tf.float32, [], name='learning_rate_input')
@@ -157,7 +175,9 @@ def main(_):
   confusion_matrix = tf.confusion_matrix(
       ground_truth_input, predicted_indices, num_classes=label_count)
   evaluation_step = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
-  tf.summary.scalar('accuracy', evaluation_step)
+  with tf.get_default_graph().name_scope('eval'):
+    tf.summary.scalar('cross_entropy', cross_entropy_mean)
+    tf.summary.scalar('accuracy', evaluation_step)
 
   global_step = tf.train.get_or_create_global_step()
   increment_global_step = tf.assign(global_step, global_step + 1)
@@ -165,7 +185,7 @@ def main(_):
   saver = tf.train.Saver(tf.global_variables())
 
   # Merge all the summaries and write them out to /tmp/retrain_logs (by default)
-  merged_summaries = tf.summary.merge_all()
+  merged_summaries = tf.summary.merge_all(scope='eval')
   train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train',
                                        sess.graph)
   validation_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/validation')
@@ -207,8 +227,11 @@ def main(_):
     # Run the graph with this batch of training data.
     train_summary, train_accuracy, cross_entropy_value, _, _ = sess.run(
         [
-            merged_summaries, evaluation_step, cross_entropy_mean, train_step,
-            increment_global_step
+            merged_summaries,
+            evaluation_step,
+            cross_entropy_mean,
+            train_step,
+            increment_global_step,
         ],
         feed_dict={
             fingerprint_input: train_fingerprints,
@@ -364,10 +387,11 @@ if __name__ == '__main__':
       default=10.0,
       help='How far to move in time between spectogram timeslices.',)
   parser.add_argument(
-      '--dct_coefficient_count',
+      '--feature_bin_count',
       type=int,
       default=40,
-      help='How many bins to use for the MFCC fingerprint',)
+      help='How many bins to use for the MFCC fingerprint',
+  )
   parser.add_argument(
       '--how_many_training_steps',
       type=str,
@@ -423,6 +447,16 @@ if __name__ == '__main__':
       type=bool,
       default=False,
       help='Whether to check for invalid numbers during processing')
+  parser.add_argument(
+      '--quantize',
+      type=bool,
+      default=False,
+      help='Whether to train the model for eight-bit deployment')
+  parser.add_argument(
+      '--preprocess',
+      type=str,
+      default='mfcc',
+      help='Spectrogram processing mode. Can be "mfcc" or "average"')
 
   FLAGS, unparsed = parser.parse_known_args()
   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
diff --git a/tensorflow/examples/tutorials/mnist/mnist_deep.py b/tensorflow/examples/tutorials/mnist/mnist_deep.py
index 1e0294db27..5d8d8d84fe 100644
--- a/tensorflow/examples/tutorials/mnist/mnist_deep.py
+++ b/tensorflow/examples/tutorials/mnist/mnist_deep.py
@@ -34,6 +34,8 @@ from tensorflow.examples.tutorials.mnist import input_data
 
 import tensorflow as tf
 
+import numpy
+
 FLAGS = None
 
 
@@ -164,8 +166,15 @@ def main(_):
         print('step %d, training accuracy %g' % (i, train_accuracy))
       train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
 
-    print('test accuracy %g' % accuracy.eval(feed_dict={
-        x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))
+    # compute in batches to avoid OOM on GPUs 
+    accuracy_l = []
+    for _ in range(20):
+      batch = mnist.test.next_batch(500, shuffle=False)
+      accuracy_l.append(accuracy.eval(feed_dict={x: batch[0], 
+                                                 y_: batch[1], 
+                                                 keep_prob: 1.0}))
+    print('test accuracy %g' % numpy.mean(accuracy_l))
+
 
 if __name__ == '__main__':
   parser = argparse.ArgumentParser()
diff --git a/tensorflow/go/README.md b/tensorflow/go/README.md
index e251356ec8..288a32530a 100644
--- a/tensorflow/go/README.md
+++ b/tensorflow/go/README.md
@@ -46,7 +46,7 @@ from source.
     ```sh
     cd ${GOPATH}/src/github.com/tensorflow/tensorflow
     ./configure
-    bazel build --config opt //tensorflow:libtensorflow.so
+    bazel build -c opt //tensorflow:libtensorflow.so
     ```
 
     This can take a while (tens of minutes, more if also building for GPU).
diff --git a/tensorflow/go/graph.go b/tensorflow/go/graph.go
index 08943a527c..32a77550ee 100644
--- a/tensorflow/go/graph.go
+++ b/tensorflow/go/graph.go
@@ -177,7 +177,14 @@ type OpSpec struct {
 	// being added.
 	ControlDependencies []*Operation
 
-	// Other possible fields: Device, ColocateWith.
+	// The device on which the operation should be executed.
+	// If omitted, an appropriate device will automatically be selected.
+	//
+	// For example, if set of "/device:GPU:0", then the operation will
+	// execute on GPU #0.
+	Device string
+
+	// Other possible fields: ColocateWith.
 }
 
 // AddOperation adds an operation to g.
@@ -225,6 +232,11 @@ func (g *Graph) AddOperation(args OpSpec) (*Operation, error) {
 			return nil, fmt.Errorf("%v (memory will be leaked)", err)
 		}
 	}
+	if len(args.Device) > 0 {
+		cdevice := C.CString(args.Device)
+		C.TF_SetDevice(cdesc, cdevice)
+		C.free(unsafe.Pointer(cdevice))
+	}
 	c := C.TF_FinishOperation(cdesc, status.c)
 	if err := status.Err(); err != nil {
 		return nil, err
diff --git a/tensorflow/go/op/scope.go b/tensorflow/go/op/scope.go
index 13de4294dc..ac39808d83 100644
--- a/tensorflow/go/op/scope.go
+++ b/tensorflow/go/op/scope.go
@@ -37,6 +37,7 @@ type Scope struct {
 	namemap             map[string]int
 	namespace           string
 	controlDependencies []*tf.Operation
+	device              string
 	err                 *scopeErr
 }
 
@@ -82,6 +83,7 @@ func (s *Scope) AddOperation(args tf.OpSpec) *tf.Operation {
 		args.Name = s.namespace + "/" + args.Name
 	}
 	args.ControlDependencies = append(args.ControlDependencies, s.controlDependencies...)
+	args.Device = s.device
 	op, err := s.graph.AddOperation(args)
 	if err != nil {
 		s.UpdateErr(args.Type, err)
@@ -98,10 +100,12 @@ func (s *Scope) SubScope(namespace string) *Scope {
 		namespace = s.namespace + "/" + namespace
 	}
 	return &Scope{
-		graph:     s.graph,
-		namemap:   make(map[string]int),
-		namespace: namespace,
-		err:       s.err,
+		graph:               s.graph,
+		namemap:             make(map[string]int),
+		namespace:           namespace,
+		controlDependencies: s.controlDependencies,
+		device:              s.device,
+		err:                 s.err,
 	}
 }
 
@@ -123,6 +127,25 @@ func (s *Scope) WithControlDependencies(ops ...*tf.Operation) *Scope {
 		namemap:             s.namemap,
 		namespace:           s.namespace,
 		controlDependencies: deps,
+		device:              s.device,
+		err:                 s.err,
+	}
+}
+
+// WithDevice returns a new Scope which will cause all operations added to the
+// graph to execute on devices that match the provided device specification.
+//
+// For example, WithDevice("/device:GPU:0") will cause operations added to
+// the graph to execute on GPU #0.
+//
+// An empty string removes any device restrictions.
+func (s *Scope) WithDevice(device string) *Scope {
+	return &Scope{
+		graph:               s.graph,
+		namemap:             s.namemap,
+		namespace:           s.namespace,
+		controlDependencies: s.controlDependencies,
+		device:              device,
 		err:                 s.err,
 	}
 }
diff --git a/tensorflow/go/op/scope_test.go b/tensorflow/go/op/scope_test.go
index b58a61de98..be7b0ad892 100644
--- a/tensorflow/go/op/scope_test.go
+++ b/tensorflow/go/op/scope_test.go
@@ -112,6 +112,21 @@ func TestControlDependencies(t *testing.T) {
 	}
 }
 
+func TestDevice(t *testing.T) {
+	s := NewScope()
+	matrix := Const(s, [][]float32{{3.0}})
+	s = s.WithDevice("/device:GPU:0")
+	square := MatMul(s.SubScope("square"), matrix, matrix)
+	s = s.WithDevice("")
+	cube := MatMul(s.SubScope("cube"), square, matrix)
+	if got, want := square.Op.Device(), "/device:GPU:0"; got != want {
+		t.Errorf("Got %q, want %q", got, want)
+	}
+	if got, want := cube.Op.Device(), ""; got != want {
+		t.Errorf("Got %q, want %q", got, want)
+	}
+}
+
 func TestScopeFinalize(t *testing.T) {
 	var (
 		root = NewScope()
diff --git a/tensorflow/go/op/wrappers.go b/tensorflow/go/op/wrappers.go
index 7f1f0970a6..ca1521e641 100644
--- a/tensorflow/go/op/wrappers.go
+++ b/tensorflow/go/op/wrappers.go
@@ -327,15 +327,19 @@ func FakeQuantWithMinMaxArgs(scope *Scope, inputs tf.Output, optional ...FakeQua
 	return op.Output(0)
 }
 
-// Scatter `updates` into a new (initially zero) tensor according to `indices`.
+// Scatter `updates` into a new tensor according to `indices`.
 //
-// Creates a new tensor by applying sparse `updates` to individual
-// values or slices within a zero tensor of the given `shape` according to
-// indices.  This operator is the inverse of the @{tf.gather_nd} operator which
-// extracts values or slices from a given tensor.
+// Creates a new tensor by applying sparse `updates` to individual values or
+// slices within a tensor (initially zero for numeric, empty for string) of
+// the given `shape` according to indices.  This operator is the inverse of the
+// @{tf.gather_nd} operator which extracts values or slices from a given tensor.
+//
+// If `indices` contains duplicates, then their updates are accumulated (summed).
 //
 // **WARNING**: The order in which updates are applied is nondeterministic, so the
-// output will be nondeterministic if `indices` contains duplicates.
+// output will be nondeterministic if `indices` contains duplicates -- because
+// of some numerical approximation issues, numbers summed in different order
+// may yield different results.
 //
 // `indices` is an integer tensor containing indices into a new tensor of shape
 // `shape`.  The last dimension of `indices` can be at most the rank of `shape`:
@@ -430,7 +434,8 @@ type QuantizeAndDequantizeV2Attr func(optionalAttr)
 
 // QuantizeAndDequantizeV2SignedInput sets the optional signed_input attribute to value.
 //
-// value: If the quantization is signed or unsigned.
+// value: Whether the quantization is signed or unsigned. (actually this parameter should
+// have been called <b>`signed_output`</b>)
 // If not specified, defaults to true
 func QuantizeAndDequantizeV2SignedInput(value bool) QuantizeAndDequantizeV2Attr {
 	return func(m optionalAttr) {
@@ -450,7 +455,7 @@ func QuantizeAndDequantizeV2NumBits(value int64) QuantizeAndDequantizeV2Attr {
 
 // QuantizeAndDequantizeV2RangeGiven sets the optional range_given attribute to value.
 //
-// value: If the range is given or should be computed from the tensor.
+// value: Whether the range is given or should be determined from the `input` tensor.
 // If not specified, defaults to false
 func QuantizeAndDequantizeV2RangeGiven(value bool) QuantizeAndDequantizeV2Attr {
 	return func(m optionalAttr) {
@@ -461,61 +466,64 @@ func QuantizeAndDequantizeV2RangeGiven(value bool) QuantizeAndDequantizeV2Attr {
 // Quantizes then dequantizes a tensor.
 //
 // This op simulates the precision loss from the quantized forward pass by:
+//
 // 1. Quantizing the tensor to fixed point numbers, which should match the target
 //    quantization method when it is used in inference.
 // 2. Dequantizing it back to floating point numbers for the following ops, most
 //    likely matmul.
 //
-// There are different ways to quantize. This version does not use the full range
-// of the output type, choosing to elide the lowest possible value for symmetry
-// (e.g., output range is -127 to 127, not -128 to 127 for signed 8 bit
-// quantization), so that 0.0 maps to 0.
-//
-// To perform this op, we first find the range of values in our tensor. The range
-// we use is always centered on 0, so we find m such that
-//
-// 1. m = max(abs(input_min), abs(input_max)) if range_given is true,
-// 2. m = max(abs(min_elem(input)), abs(max_elem(input))) otherwise.
+// There are different ways to quantize. This version uses only scaling, so 0.0
+// maps to 0.
 //
-// Our input tensor range is then [-m, m].
+// From the specified 'num_bits' in the quantized output type, it determines
+// minimum and maximum representable quantized values.
 //
-// Next, we choose our fixed-point quantization buckets, [min_fixed, max_fixed].
-// If signed_input is true, this is
+// e.g.
 //
-//   [min_fixed, max_fixed ] =
-//       [-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1].
+// *   [-128, 127] for signed, num_bits = 8, or
+// *   [0, 255] for unsigned, num_bits = 8.
 //
-// Otherwise, if signed_input is false, the fixed-point range is
+// If range_given == False, the initial input_min, input_max will be determined
+// automatically as the minimum and maximum values in the input tensor, otherwise
+// the specified values of input_min, input_max are used.
 //
-//   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1].
+// Note: If the input_min, input_max are specified, they do not need to equal the
+// actual minimum and maximum values in the tensor. e.g. in some cases it may be
+// beneficial to specify these values such that the low probability extremes of the
+// input distribution are clipped.
 //
-// From this we compute our scaling factor, s:
-//
-//   s = (max_fixed - min_fixed) / (2 * m).
+// This op determines the maximum scale_factor that would map the initial
+// [input_min, input_max] range to a range that lies within the representable
+// quantized range.
 //
-// Now we can quantize and dequantize the elements of our tensor.  An element e
-// is transformed into e':
+// It determines the scale from one of input_min and input_max, then updates the
+// other one to maximize the respresentable range.
 //
-//   e' = (e * s).round_to_nearest() / s.
+// e.g.
 //
-// Note that we have a different number of buckets in the signed vs. unsigned
-// cases.  For example, if num_bits == 8, we get 254 buckets in the signed case
-// vs. 255 in the unsigned case.
+// *   if the output is signed, num_bits = 8, [input_min, input_max] = [-10.0,
+//     5.0]: it would use a scale_factor of -128 / -10.0 = 12.8 In this case, it
+//     would update input_max to be 127 / 12.8 = 9.921875
+// *   if the output is signed, num_bits = 8, [input_min, input_max] = [-10.0,
+//     10.0]: it would use a scale_factor of 127 / 10.0 = 12.7 In this case, it
+//     would update input_min to be 128.0 / 12.7 = -10.07874
+// *   if the output is unsigned, input_min is forced to be 0, and only the
+//     specified input_max is used.
 //
-// For example, suppose num_bits = 8 and m = 1.  Then
+// After determining the scale_factor and updating the input range, it applies the
+// following to each value in the 'input' tensor.
 //
-//   [min_fixed, max_fixed] = [-127, 127], and
-//   s = (127 + 127) / 2 = 127.
+// output = round(clamp(value, input_min, input_max) * scale_factor) / scale_factor.
 //
-// Given the vector {-1, -0.5, 0, 0.3}, this is quantized to
-// {-127, -63, 0, 38}, and dequantized to {-1, -63.0/127, 0, 38.0/127}.
 //
 // Arguments:
 //	input: Tensor to quantize and then dequantize.
-//	input_min: If range_given, this is the min of the range, otherwise this input
-// will be ignored.
-//	input_max: If range_given, this is the max of the range, otherwise this input
-// will be ignored.
+//	input_min: If `range_given == True`, this specifies the minimum input value that needs to
+// be represented, otherwise it is determined from the min value of the `input`
+// tensor.
+//	input_max: If `range_given == True`, this specifies the maximum input value that needs to
+// be represented, otherwise it is determined from the max value of the `input`
+// tensor.
 func QuantizeAndDequantizeV2(scope *Scope, input tf.Output, input_min tf.Output, input_max tf.Output, optional ...QuantizeAndDequantizeV2Attr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
@@ -2249,7 +2257,7 @@ func CheckNumerics(scope *Scope, tensor tf.Output, message string) (output tf.Ou
 // (K-1)-dimensional tensor of indices into `params`, where each element defines a
 // slice of `params`:
 //
-//     output[i_0, ..., i_{K-2}] = params[indices[i0, ..., i_{K-2}]]
+//     output[\\(i_0, ..., i_{K-2}\\)] = params[indices[\\(i_0, ..., i_{K-2}\\)]]
 //
 // Whereas in @{tf.gather} `indices` defines slices into the first
 // dimension of `params`, in `tf.gather_nd`, `indices` defines slices into the
@@ -3015,6 +3023,45 @@ func Concat(scope *Scope, concat_dim tf.Output, values []tf.Output) (output tf.O
 	return op.Output(0)
 }
 
+// Broadcast an array for a compatible shape.
+//
+// Broadcasting is the process of making arrays to have compatible shapes
+// for arithmetic operations. Two shapes are compatible if for each
+// dimension pair they are either equal or one of them is one. When trying
+// to broadcast a Tensor to a shape, it starts with the trailing dimensions,
+// and works its way forward.
+//
+// For example,
+// ```
+// >>> x = tf.constant([1, 2, 3])
+// >>> y = tf.broadcast_to(x, [3, 3])
+// >>> sess.run(y)
+// array([[1, 2, 3],
+//        [1, 2, 3],
+//        [1, 2, 3]], dtype=int32)
+// ```
+// In the above example, the input Tensor with the shape of `[1, 3]`
+// is broadcasted to output Tensor with shape of `[3, 3]`.
+//
+// Arguments:
+//	input: A Tensor to broadcast.
+//	shape: An 1-D `int` Tensor. The shape of the desired output.
+//
+// Returns A Tensor.
+func BroadcastTo(scope *Scope, input tf.Output, shape tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "BroadcastTo",
+		Input: []tf.Input{
+			input, shape,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Converts a flat index or array of flat indices into a tuple of
 //
 // coordinate arrays.
@@ -3045,25 +3092,292 @@ func UnravelIndex(scope *Scope, indices tf.Output, dims tf.Output) (output tf.Ou
 	return op.Output(0)
 }
 
-// Computes gradients for SparseSegmentSqrtN.
+//     Subtracts `v` into specified rows of `x`.
 //
-// Returns tensor "output" with same shape as grad, except for dimension 0 whose
-// value is output_dim0.
+//     Computes y = x; y[i, :] -= v; return y.
 //
 // Arguments:
-//	grad: gradient propagated to the SparseSegmentSqrtN op.
-//	indices: indices passed to the corresponding SparseSegmentSqrtN op.
-//	segment_ids: segment_ids passed to the corresponding SparseSegmentSqrtN op.
-//	output_dim0: dimension 0 of "data" passed to SparseSegmentSqrtN op.
-func SparseSegmentSqrtNGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
+//	x: A `Tensor` of type T.
+//	i: A vector. Indices into the left-most dimension of `x`.
+//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
+//
+// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
+func InplaceSub(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SparseSegmentSqrtNGrad",
+		Type: "InplaceSub",
 		Input: []tf.Input{
-			grad, indices, segment_ids, output_dim0,
+			x, i, v,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+//     Updates specified rows with values in `v`.
+//
+//     Computes `x[i, :] = v; return x`.
+//
+// Arguments:
+//	x: A tensor of type `T`.
+//	i: A vector. Indices into the left-most dimension of `x`.
+//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
+//
+// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
+func InplaceUpdate(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "InplaceUpdate",
+		Input: []tf.Input{
+			x, i, v,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Makes a copy of `x`.
+//
+// Arguments:
+//	x: The source tensor of type `T`.
+//
+// Returns     y: A `Tensor` of type `T`. A copy of `x`. Guaranteed that `y`
+//       is not an alias of `x`.
+func DeepCopy(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "DeepCopy",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// PackAttr is an optional argument to Pack.
+type PackAttr func(optionalAttr)
+
+// PackAxis sets the optional axis attribute to value.
+//
+// value: Dimension along which to pack.  Negative values wrap around, so the
+// valid range is `[-(R+1), R+1)`.
+// If not specified, defaults to 0
+func PackAxis(value int64) PackAttr {
+	return func(m optionalAttr) {
+		m["axis"] = value
+	}
+}
+
+// Packs a list of `N` rank-`R` tensors into one rank-`(R+1)` tensor.
+//
+// Packs the `N` tensors in `values` into a tensor with rank one higher than each
+// tensor in `values`, by packing them along the `axis` dimension.
+// Given a list of tensors of shape `(A, B, C)`;
+//
+// if `axis == 0` then the `output` tensor will have the shape `(N, A, B, C)`.
+// if `axis == 1` then the `output` tensor will have the shape `(A, N, B, C)`.
+// Etc.
+//
+// For example:
+//
+// ```
+// # 'x' is [1, 4]
+// # 'y' is [2, 5]
+// # 'z' is [3, 6]
+// pack([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
+// pack([x, y, z], axis=1) => [[1, 2, 3], [4, 5, 6]]
+// ```
+//
+// This is the opposite of `unpack`.
+//
+// Arguments:
+//	values: Must be of same shape and type.
+//
+// Returns The packed tensor.
+func Pack(scope *Scope, values []tf.Output, optional ...PackAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Pack",
+		Input: []tf.Input{
+			tf.OutputList(values),
 		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Concatenates a list of `N` tensors along the first dimension.
+//
+// The input tensors are all required to have size 1 in the first dimension.
+//
+// For example:
+//
+// ```
+// # 'x' is [[1, 4]]
+// # 'y' is [[2, 5]]
+// # 'z' is [[3, 6]]
+// parallel_concat([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
+// ```
+//
+// The difference between concat and parallel_concat is that concat requires all
+// of the inputs be computed before the operation will begin but doesn't require
+// that the input shapes be known during graph construction.  Parallel concat
+// will copy pieces of the input into the output as they become available, in
+// some situations this can provide a performance benefit.
+//
+// Arguments:
+//	values: Tensors to be concatenated. All must have size 1 in the first dimension
+// and same shape.
+//	shape: the final shape of the result; should be equal to the shapes of any input
+// but with the number of input values in the first dimension.
+//
+// Returns The concatenated tensor.
+func ParallelConcat(scope *Scope, values []tf.Output, shape tf.Shape) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"shape": shape}
+	opspec := tf.OpSpec{
+		Type: "ParallelConcat",
+		Input: []tf.Input{
+			tf.OutputList(values),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// DecodeWavAttr is an optional argument to DecodeWav.
+type DecodeWavAttr func(optionalAttr)
+
+// DecodeWavDesiredChannels sets the optional desired_channels attribute to value.
+//
+// value: Number of sample channels wanted.
+// If not specified, defaults to -1
+func DecodeWavDesiredChannels(value int64) DecodeWavAttr {
+	return func(m optionalAttr) {
+		m["desired_channels"] = value
+	}
+}
+
+// DecodeWavDesiredSamples sets the optional desired_samples attribute to value.
+//
+// value: Length of audio requested.
+// If not specified, defaults to -1
+func DecodeWavDesiredSamples(value int64) DecodeWavAttr {
+	return func(m optionalAttr) {
+		m["desired_samples"] = value
+	}
+}
+
+// Decode a 16-bit PCM WAV file to a float tensor.
+//
+// The -32768 to 32767 signed 16-bit values will be scaled to -1.0 to 1.0 in float.
+//
+// When desired_channels is set, if the input contains fewer channels than this
+// then the last channel will be duplicated to give the requested number, else if
+// the input has more channels than requested then the additional channels will be
+// ignored.
+//
+// If desired_samples is set, then the audio will be cropped or padded with zeroes
+// to the requested length.
+//
+// The first output contains a Tensor with the content of the audio samples. The
+// lowest dimension will be the number of channels, and the second will be the
+// number of samples. For example, a ten-sample-long stereo WAV file should give an
+// output shape of [10, 2].
+//
+// Arguments:
+//	contents: The WAV-encoded audio, usually from a file.
+//
+// Returns 2-D with shape `[length, channels]`.Scalar holding the sample rate found in the WAV header.
+func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (audio tf.Output, sample_rate tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "DecodeWav",
+		Input: []tf.Input{
+			contents,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
+// UnbatchAttr is an optional argument to Unbatch.
+type UnbatchAttr func(optionalAttr)
+
+// UnbatchContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func UnbatchContainer(value string) UnbatchAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// UnbatchSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func UnbatchSharedName(value string) UnbatchAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Reverses the operation of Batch for a single output Tensor.
+//
+// An instance of Unbatch either receives an empty batched_tensor, in which case it
+// asynchronously waits until the values become available from a concurrently
+// running instance of Unbatch with the same container and shared_name, or receives
+// a non-empty batched_tensor in which case it finalizes all other concurrently
+// running instances and outputs its own element from the batch.
+//
+// batched_tensor: The possibly transformed output of Batch. The size of the first
+//  dimension should remain unchanged by the transformations for the operation to
+//  work.
+// batch_index: The matching batch_index obtained from Batch.
+// id: The id scalar emitted by Batch.
+// unbatched_tensor: The Tensor corresponding to this execution.
+// timeout_micros: Maximum amount of time (in microseconds) to wait to receive the
+//  batched input tensor associated with a given invocation of the op.
+// container: Container to control resource sharing.
+// shared_name: Instances of Unbatch with the same container and shared_name are
+//  assumed to possibly belong to the same batch. If left empty, the op name will
+//  be used as the shared name.
+func Unbatch(scope *Scope, batched_tensor tf.Output, batch_index tf.Output, id tf.Output, timeout_micros int64, optional ...UnbatchAttr) (unbatched_tensor tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"timeout_micros": timeout_micros}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Unbatch",
+		Input: []tf.Input{
+			batched_tensor, batch_index, id,
+		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
@@ -3121,6 +3435,57 @@ func StackPopV2(scope *Scope, handle tf.Output, elem_type tf.DataType) (elem tf.
 	return op.Output(0)
 }
 
+// Computes the sum along sparse segments of a tensor.
+//
+// Like `SparseSegmentSum`, but allows missing ids in `segment_ids`. If an id is
+// misisng, the `output` tensor at that position will be zeroed.
+//
+// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// For example:
+//
+// ```python
+// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
+//
+// tf.sparse_segment_sum_with_num_segments(
+//     c, tf.constant([0, 1]), tf.constant([0, 0]), num_segments=3)
+// # => [[0 0 0 0]
+// #     [0 0 0 0]
+// #     [0 0 0 0]]
+//
+// tf.sparse_segment_sum_with_num_segments(c,
+//                                         tf.constant([0, 1]),
+//                                         tf.constant([0, 2],
+//                                         num_segments=4))
+// # => [[ 1  2  3  4]
+// #     [ 0  0  0  0]
+// #     [-1 -2 -3 -4]
+// #     [ 0  0  0  0]]
+// ```
+//
+// Arguments:
+//
+//	indices: A 1-D tensor. Has same rank as `segment_ids`.
+//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
+//	num_segments: Should equal the number of distinct segment IDs.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `num_segments`.
+func SparseSegmentSumWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSegmentSumWithNumSegments",
+		Input: []tf.Input{
+			data, indices, segment_ids, num_segments,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // PreventGradientAttr is an optional argument to PreventGradient.
 type PreventGradientAttr func(optionalAttr)
 
@@ -3309,7 +3674,7 @@ func Relu6(scope *Scope, features tf.Output) (activations tf.Output) {
 // segments.
 //
 // Computes a tensor such that
-// `(output[i] = sum_{j...} data[j...]` where the sum is over tuples `j...` such
+// \\(output[i] = sum_{j...} data[j...]\\) where the sum is over tuples `j...` such
 // that `segment_ids[j...] == i`.  Unlike `SegmentSum`, `segment_ids`
 // need not be sorted and need not cover all values in the full
 // range of valid values.
@@ -3678,11 +4043,13 @@ func Atan2(scope *Scope, y tf.Output, x tf.Output) (z tf.Output) {
 //
 //	window_size: A scalar representing the number of elements in the
 // sliding window.
-//	stride: A scalar representing the steps moving the sliding window
-// forward in one iteration. It must be in `[1, window_size)`.
+//	window_shift: A scalar representing the steps moving the sliding window
+// forward in one iteration. It must be positive.
+//	window_stride: A scalar representing the stride of the input elements of the sliding window.
+// It must be positive.
 //
 //
-func SlideDataset(scope *Scope, input_dataset tf.Output, window_size tf.Output, stride tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+func SlideDataset(scope *Scope, input_dataset tf.Output, window_size tf.Output, window_shift tf.Output, window_stride tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -3690,7 +4057,7 @@ func SlideDataset(scope *Scope, input_dataset tf.Output, window_size tf.Output,
 	opspec := tf.OpSpec{
 		Type: "SlideDataset",
 		Input: []tf.Input{
-			input_dataset, window_size, stride,
+			input_dataset, window_size, window_shift, window_stride,
 		},
 		Attrs: attrs,
 	}
@@ -4635,6 +5002,146 @@ func Rsqrt(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
+// AudioSpectrogramAttr is an optional argument to AudioSpectrogram.
+type AudioSpectrogramAttr func(optionalAttr)
+
+// AudioSpectrogramMagnitudeSquared sets the optional magnitude_squared attribute to value.
+//
+// value: Whether to return the squared magnitude or just the
+// magnitude. Using squared magnitude can avoid extra calculations.
+// If not specified, defaults to false
+func AudioSpectrogramMagnitudeSquared(value bool) AudioSpectrogramAttr {
+	return func(m optionalAttr) {
+		m["magnitude_squared"] = value
+	}
+}
+
+// Produces a visualization of audio data over time.
+//
+// Spectrograms are a standard way of representing audio information as a series of
+// slices of frequency information, one slice for each window of time. By joining
+// these together into a sequence, they form a distinctive fingerprint of the sound
+// over time.
+//
+// This op expects to receive audio data as an input, stored as floats in the range
+// -1 to 1, together with a window width in samples, and a stride specifying how
+// far to move the window between slices. From this it generates a three
+// dimensional output. The lowest dimension has an amplitude value for each
+// frequency during that time slice. The next dimension is time, with successive
+// frequency slices. The final dimension is for the channels in the input, so a
+// stereo audio input would have two here for example.
+//
+// This means the layout when converted and saved as an image is rotated 90 degrees
+// clockwise from a typical spectrogram. Time is descending down the Y axis, and
+// the frequency decreases from left to right.
+//
+// Each value in the result represents the square root of the sum of the real and
+// imaginary parts of an FFT on the current window of samples. In this way, the
+// lowest dimension represents the power of each frequency in the current window,
+// and adjacent windows are concatenated in the next dimension.
+//
+// To get a more intuitive and visual look at what this operation does, you can run
+// tensorflow/examples/wav_to_spectrogram to read in an audio file and save out the
+// resulting spectrogram as a PNG image.
+//
+// Arguments:
+//	input: Float representation of audio data.
+//	window_size: How wide the input window is in samples. For the highest efficiency
+// this should be a power of two, but other values are accepted.
+//	stride: How widely apart the center of adjacent sample windows should be.
+//
+// Returns 3D representation of the audio frequencies as an image.
+func AudioSpectrogram(scope *Scope, input tf.Output, window_size int64, stride int64, optional ...AudioSpectrogramAttr) (spectrogram tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"window_size": window_size, "stride": stride}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "AudioSpectrogram",
+		Input: []tf.Input{
+			input,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// CTCBeamSearchDecoderAttr is an optional argument to CTCBeamSearchDecoder.
+type CTCBeamSearchDecoderAttr func(optionalAttr)
+
+// CTCBeamSearchDecoderMergeRepeated sets the optional merge_repeated attribute to value.
+//
+// value: If true, merge repeated classes in output.
+// If not specified, defaults to true
+func CTCBeamSearchDecoderMergeRepeated(value bool) CTCBeamSearchDecoderAttr {
+	return func(m optionalAttr) {
+		m["merge_repeated"] = value
+	}
+}
+
+// Performs beam search decoding on the logits given in input.
+//
+// A note about the attribute merge_repeated: For the beam search decoder,
+// this means that if consecutive entries in a beam are the same, only
+// the first of these is emitted.  That is, when the top path is "A B B B B",
+// "A B" is returned if merge_repeated = True but "A B B B B" is
+// returned if merge_repeated = False.
+//
+// Arguments:
+//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
+//	sequence_length: A vector containing sequence lengths, size `(batch)`.
+//	beam_width: A scalar >= 0 (beam search beam width).
+//	top_paths: A scalar >= 0, <= beam_width (controls output size).
+//
+// Returns A list (length: top_paths) of indices matrices.  Matrix j,
+// size `(total_decoded_outputs[j] x 2)`, has indices of a
+// `SparseTensor<int64, 2>`.  The rows store: [batch, time].A list (length: top_paths) of values vectors.  Vector j,
+// size `(length total_decoded_outputs[j])`, has the values of a
+// `SparseTensor<int64, 2>`.  The vector stores the decoded classes for beam j.A list (length: top_paths) of shape vector.  Vector j,
+// size `(2)`, stores the shape of the decoded `SparseTensor[j]`.
+// Its values are: `[batch_size, max_decoded_length[j]]`.A matrix, shaped: `(batch_size x top_paths)`.  The
+// sequence log-probabilities.
+func CTCBeamSearchDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, beam_width int64, top_paths int64, optional ...CTCBeamSearchDecoderAttr) (decoded_indices []tf.Output, decoded_values []tf.Output, decoded_shape []tf.Output, log_probability tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"beam_width": beam_width, "top_paths": top_paths}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "CTCBeamSearchDecoder",
+		Input: []tf.Input{
+			inputs, sequence_length,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if decoded_indices, idx, err = makeOutputList(op, idx, "decoded_indices"); err != nil {
+		scope.UpdateErr("CTCBeamSearchDecoder", err)
+		return
+	}
+	if decoded_values, idx, err = makeOutputList(op, idx, "decoded_values"); err != nil {
+		scope.UpdateErr("CTCBeamSearchDecoder", err)
+		return
+	}
+	if decoded_shape, idx, err = makeOutputList(op, idx, "decoded_shape"); err != nil {
+		scope.UpdateErr("CTCBeamSearchDecoder", err)
+		return
+	}
+	log_probability = op.Output(idx)
+	return decoded_indices, decoded_values, decoded_shape, log_probability
+}
+
 // MatrixInverseAttr is an optional argument to MatrixInverse.
 type MatrixInverseAttr func(optionalAttr)
 
@@ -4705,6 +5212,21 @@ func Add(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
+// Computes the derivative of a Gamma random sample w.r.t. `alpha`.
+func RandomGammaGrad(scope *Scope, alpha tf.Output, sample tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "RandomGammaGrad",
+		Input: []tf.Input{
+			alpha, sample,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Computes square of x element-wise.
 //
 // I.e., \\(y = x * x = x^2\\).
@@ -4968,12 +5490,26 @@ func IsBoostedTreesEnsembleInitialized(scope *Scope, tree_ensemble_handle tf.Out
 	return op.Output(0)
 }
 
+// CastAttr is an optional argument to Cast.
+type CastAttr func(optionalAttr)
+
+// CastTruncate sets the optional Truncate attribute to value.
+// If not specified, defaults to false
+func CastTruncate(value bool) CastAttr {
+	return func(m optionalAttr) {
+		m["Truncate"] = value
+	}
+}
+
 // Cast x of type SrcT to y of DstT.
-func Cast(scope *Scope, x tf.Output, DstT tf.DataType) (y tf.Output) {
+func Cast(scope *Scope, x tf.Output, DstT tf.DataType, optional ...CastAttr) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	attrs := map[string]interface{}{"DstT": DstT}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
 		Type: "Cast",
 		Input: []tf.Input{
@@ -5453,7 +5989,7 @@ func LessEqual(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 //
 // For each batch `i` and class `j` we have
 //
-//     softmax[i, j] = exp(logits[i, j]) / sum_j(exp(logits[i, j]))
+//     $$softmax[i, j] = exp(logits[i, j]) / sum_j(exp(logits[i, j]))$$
 //
 // Arguments:
 //	logits: 2-D with shape `[batch_size, num_classes]`.
@@ -5758,146 +6294,6 @@ func Dilation2DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, ou
 	return op.Output(0)
 }
 
-// CTCBeamSearchDecoderAttr is an optional argument to CTCBeamSearchDecoder.
-type CTCBeamSearchDecoderAttr func(optionalAttr)
-
-// CTCBeamSearchDecoderMergeRepeated sets the optional merge_repeated attribute to value.
-//
-// value: If true, merge repeated classes in output.
-// If not specified, defaults to true
-func CTCBeamSearchDecoderMergeRepeated(value bool) CTCBeamSearchDecoderAttr {
-	return func(m optionalAttr) {
-		m["merge_repeated"] = value
-	}
-}
-
-// Performs beam search decoding on the logits given in input.
-//
-// A note about the attribute merge_repeated: For the beam search decoder,
-// this means that if consecutive entries in a beam are the same, only
-// the first of these is emitted.  That is, when the top path is "A B B B B",
-// "A B" is returned if merge_repeated = True but "A B B B B" is
-// returned if merge_repeated = False.
-//
-// Arguments:
-//	inputs: 3-D, shape: `(max_time x batch_size x num_classes)`, the logits.
-//	sequence_length: A vector containing sequence lengths, size `(batch)`.
-//	beam_width: A scalar >= 0 (beam search beam width).
-//	top_paths: A scalar >= 0, <= beam_width (controls output size).
-//
-// Returns A list (length: top_paths) of indices matrices.  Matrix j,
-// size `(total_decoded_outputs[j] x 2)`, has indices of a
-// `SparseTensor<int64, 2>`.  The rows store: [batch, time].A list (length: top_paths) of values vectors.  Vector j,
-// size `(length total_decoded_outputs[j])`, has the values of a
-// `SparseTensor<int64, 2>`.  The vector stores the decoded classes for beam j.A list (length: top_paths) of shape vector.  Vector j,
-// size `(2)`, stores the shape of the decoded `SparseTensor[j]`.
-// Its values are: `[batch_size, max_decoded_length[j]]`.A matrix, shaped: `(batch_size x top_paths)`.  The
-// sequence log-probabilities.
-func CTCBeamSearchDecoder(scope *Scope, inputs tf.Output, sequence_length tf.Output, beam_width int64, top_paths int64, optional ...CTCBeamSearchDecoderAttr) (decoded_indices []tf.Output, decoded_values []tf.Output, decoded_shape []tf.Output, log_probability tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"beam_width": beam_width, "top_paths": top_paths}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "CTCBeamSearchDecoder",
-		Input: []tf.Input{
-			inputs, sequence_length,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if decoded_indices, idx, err = makeOutputList(op, idx, "decoded_indices"); err != nil {
-		scope.UpdateErr("CTCBeamSearchDecoder", err)
-		return
-	}
-	if decoded_values, idx, err = makeOutputList(op, idx, "decoded_values"); err != nil {
-		scope.UpdateErr("CTCBeamSearchDecoder", err)
-		return
-	}
-	if decoded_shape, idx, err = makeOutputList(op, idx, "decoded_shape"); err != nil {
-		scope.UpdateErr("CTCBeamSearchDecoder", err)
-		return
-	}
-	log_probability = op.Output(idx)
-	return decoded_indices, decoded_values, decoded_shape, log_probability
-}
-
-// AudioSpectrogramAttr is an optional argument to AudioSpectrogram.
-type AudioSpectrogramAttr func(optionalAttr)
-
-// AudioSpectrogramMagnitudeSquared sets the optional magnitude_squared attribute to value.
-//
-// value: Whether to return the squared magnitude or just the
-// magnitude. Using squared magnitude can avoid extra calculations.
-// If not specified, defaults to false
-func AudioSpectrogramMagnitudeSquared(value bool) AudioSpectrogramAttr {
-	return func(m optionalAttr) {
-		m["magnitude_squared"] = value
-	}
-}
-
-// Produces a visualization of audio data over time.
-//
-// Spectrograms are a standard way of representing audio information as a series of
-// slices of frequency information, one slice for each window of time. By joining
-// these together into a sequence, they form a distinctive fingerprint of the sound
-// over time.
-//
-// This op expects to receive audio data as an input, stored as floats in the range
-// -1 to 1, together with a window width in samples, and a stride specifying how
-// far to move the window between slices. From this it generates a three
-// dimensional output. The lowest dimension has an amplitude value for each
-// frequency during that time slice. The next dimension is time, with successive
-// frequency slices. The final dimension is for the channels in the input, so a
-// stereo audio input would have two here for example.
-//
-// This means the layout when converted and saved as an image is rotated 90 degrees
-// clockwise from a typical spectrogram. Time is descending down the Y axis, and
-// the frequency decreases from left to right.
-//
-// Each value in the result represents the square root of the sum of the real and
-// imaginary parts of an FFT on the current window of samples. In this way, the
-// lowest dimension represents the power of each frequency in the current window,
-// and adjacent windows are concatenated in the next dimension.
-//
-// To get a more intuitive and visual look at what this operation does, you can run
-// tensorflow/examples/wav_to_spectrogram to read in an audio file and save out the
-// resulting spectrogram as a PNG image.
-//
-// Arguments:
-//	input: Float representation of audio data.
-//	window_size: How wide the input window is in samples. For the highest efficiency
-// this should be a power of two, but other values are accepted.
-//	stride: How widely apart the center of adjacent sample windows should be.
-//
-// Returns 3D representation of the audio frequencies as an image.
-func AudioSpectrogram(scope *Scope, input tf.Output, window_size int64, stride int64, optional ...AudioSpectrogramAttr) (spectrogram tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"window_size": window_size, "stride": stride}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "AudioSpectrogram",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Compute the polygamma function \\(\psi^{(n)}(x)\\).
 //
 // The polygamma function is defined as:
@@ -6071,53 +6467,6 @@ func MutexV2(scope *Scope, optional ...MutexV2Attr) (resource tf.Output) {
 	return op.Output(0)
 }
 
-// AvgPool3DAttr is an optional argument to AvgPool3D.
-type AvgPool3DAttr func(optionalAttr)
-
-// AvgPool3DDataFormat sets the optional data_format attribute to value.
-//
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func AvgPool3DDataFormat(value string) AvgPool3DAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Performs 3D average pooling on the input.
-//
-// Arguments:
-//	input: Shape `[batch, depth, rows, cols, channels]` tensor to pool over.
-//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
-// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
-//
-// Returns The average pooled output tensor.
-func AvgPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "AvgPool3D",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Returns element-wise remainder of division. This emulates C semantics in that
 //
 // the result here is consistent with a truncating divide. E.g.
@@ -6678,8 +7027,9 @@ type CropAndResizeAttr func(optionalAttr)
 
 // CropAndResizeMethod sets the optional method attribute to value.
 //
-// value: A string specifying the interpolation method. Only 'bilinear' is
-// supported for now.
+// value: A string specifying the sampling method for resizing. It can be either
+// `"bilinear"` or `"nearest"` and default to `"bilinear"`. Currently two sampling
+// methods are supported: Bilinear and Nearest Neighbor.
 // If not specified, defaults to "bilinear"
 func CropAndResizeMethod(value string) CropAndResizeAttr {
 	return func(m optionalAttr) {
@@ -6697,19 +7047,23 @@ func CropAndResizeExtrapolationValue(value float32) CropAndResizeAttr {
 	}
 }
 
-// Extracts crops from the input image tensor and bilinearly resizes them (possibly
+// Extracts crops from the input image tensor and resizes them.
 //
-// with aspect ratio change) to a common output size specified by `crop_size`. This
-// is more general than the `crop_to_bounding_box` op which extracts a fixed size
-// slice from the input image and does not allow resizing or aspect ratio change.
+// Extracts crops from the input image tensor and resizes them using bilinear
+// sampling or nearest neighbor sampling (possibly with aspect ratio change) to a
+// common output size specified by `crop_size`. This is more general than the
+// `crop_to_bounding_box` op which extracts a fixed size slice from the input image
+// and does not allow resizing or aspect ratio change.
 //
 // Returns a tensor with `crops` from the input `image` at positions defined at the
 // bounding box locations in `boxes`. The cropped boxes are all resized (with
-// bilinear interpolation) to a fixed `size = [crop_height, crop_width]`. The
-// result is a 4-D tensor `[num_boxes, crop_height, crop_width, depth]`. The
-// resizing is corner aligned. In particular, if `boxes = [[0, 0, 1, 1]]`, the
-// method will give identical results to using `tf.image.resize_bilinear()`
-// with `align_corners=True`.
+// bilinear or nearest neighbor interpolation) to a fixed
+// `size = [crop_height, crop_width]`. The result is a 4-D tensor
+// `[num_boxes, crop_height, crop_width, depth]`. The resizing is corner aligned.
+// In particular, if `boxes = [[0, 0, 1, 1]]`, the method will give identical
+// results to using `tf.image.resize_bilinear()` or
+// `tf.image.resize_nearest_neighbor()`(depends on the `method` argument) with
+// `align_corners=True`.
 //
 // Arguments:
 //	image: A 4-D tensor of shape `[batch, image_height, image_width, depth]`.
@@ -7092,6 +7446,26 @@ func Min(scope *Scope, input tf.Output, axis tf.Output, optional ...MinAttr) (ou
 	return op.Output(0)
 }
 
+// Computes the Bessel i1e function of `x` element-wise.
+//
+// Exponentially scaled modified Bessel function of order 0 defined as
+// `bessel_i1e(x) = exp(-abs(x)) bessel_i1(x)`.
+//
+// This function is faster and numerically stabler than `bessel_i1(x)`.
+func BesselI1e(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "BesselI1e",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Transforms a Tensor into a serialized TensorProto proto.
 //
 // Arguments:
@@ -7127,6 +7501,272 @@ func Acos(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
+// UnbatchGradAttr is an optional argument to UnbatchGrad.
+type UnbatchGradAttr func(optionalAttr)
+
+// UnbatchGradContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func UnbatchGradContainer(value string) UnbatchGradAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// UnbatchGradSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func UnbatchGradSharedName(value string) UnbatchGradAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Gradient of Unbatch.
+//
+// Acts like Batch but using the given batch_index index of batching things as they
+// become available. This ensures that the gradients are propagated back in the
+// same session which did the forward pass.
+//
+// original_input: The input to the Unbatch operation this is the gradient of.
+// batch_index: The batch_index given to the Unbatch operation this is the gradient
+// of.
+// grad: The downstream gradient.
+// id: The id scalar emitted by Batch.
+// batched_grad: The return value, either an empty tensor or the batched gradient.
+// container: Container to control resource sharing.
+// shared_name: Instances of UnbatchGrad with the same container and shared_name
+//  are assumed to possibly belong to the same batch. If left empty, the op name
+//  will be used as the shared name.
+func UnbatchGrad(scope *Scope, original_input tf.Output, batch_index tf.Output, grad tf.Output, id tf.Output, optional ...UnbatchGradAttr) (batched_grad tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "UnbatchGrad",
+		Input: []tf.Input{
+			original_input, batch_index, grad, id,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// AvgPool3DGradAttr is an optional argument to AvgPool3DGrad.
+type AvgPool3DGradAttr func(optionalAttr)
+
+// AvgPool3DGradDataFormat sets the optional data_format attribute to value.
+//
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func AvgPool3DGradDataFormat(value string) AvgPool3DGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Computes gradients of average pooling function.
+//
+// Arguments:
+//	orig_input_shape: The original input dimensions.
+//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
+//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
+// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
+//
+// Returns The backprop for input.
+func AvgPool3DGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DGradAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "AvgPool3DGrad",
+		Input: []tf.Input{
+			orig_input_shape, grad,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// ParseSingleSequenceExampleAttr is an optional argument to ParseSingleSequenceExample.
+type ParseSingleSequenceExampleAttr func(optionalAttr)
+
+// ParseSingleSequenceExampleContextSparseTypes sets the optional context_sparse_types attribute to value.
+//
+// value: A list of Ncontext_sparse types; the data types of data in
+// each context Feature given in context_sparse_keys.
+// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleContextSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["context_sparse_types"] = value
+	}
+}
+
+// ParseSingleSequenceExampleFeatureListDenseTypes sets the optional feature_list_dense_types attribute to value.
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListDenseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_dense_types"] = value
+	}
+}
+
+// ParseSingleSequenceExampleContextDenseShapes sets the optional context_dense_shapes attribute to value.
+//
+// value: A list of Ncontext_dense shapes; the shapes of data in
+// each context Feature given in context_dense_keys.
+// The number of elements in the Feature corresponding to context_dense_key[j]
+// must always equal context_dense_shapes[j].NumEntries().
+// The shape of context_dense_values[j] will match context_dense_shapes[j].
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleContextDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["context_dense_shapes"] = value
+	}
+}
+
+// ParseSingleSequenceExampleFeatureListSparseTypes sets the optional feature_list_sparse_types attribute to value.
+//
+// value: A list of Nfeature_list_sparse types; the data types
+// of data in each FeatureList given in feature_list_sparse_keys.
+// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
+// DT_INT64 (Int64List), and DT_STRING (BytesList).
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_sparse_types"] = value
+	}
+}
+
+// ParseSingleSequenceExampleFeatureListDenseShapes sets the optional feature_list_dense_shapes attribute to value.
+//
+// value: A list of Nfeature_list_dense shapes; the shapes of
+// data in each FeatureList given in feature_list_dense_keys.
+// The shape of each Feature in the FeatureList corresponding to
+// feature_list_dense_key[j] must always equal
+// feature_list_dense_shapes[j].NumEntries().
+// If not specified, defaults to <>
+//
+// REQUIRES: len(value) >= 0
+func ParseSingleSequenceExampleFeatureListDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
+	return func(m optionalAttr) {
+		m["feature_list_dense_shapes"] = value
+	}
+}
+
+// Transforms a scalar brain.SequenceExample proto (as strings) into typed tensors.
+//
+// Arguments:
+//	serialized: A scalar containing a binary serialized SequenceExample proto.
+//	feature_list_dense_missing_assumed_empty: A vector listing the
+// FeatureList keys which may be missing from the SequenceExample.  If the
+// associated FeatureList is missing, it is treated as empty.  By default,
+// any FeatureList not listed in this vector must exist in the SequenceExample.
+//	context_sparse_keys: A list of Ncontext_sparse string Tensors (scalars).
+// The keys expected in the Examples' features associated with context_sparse
+// values.
+//	context_dense_keys: A list of Ncontext_dense string Tensors (scalars).
+// The keys expected in the SequenceExamples' context features associated with
+// dense values.
+//	feature_list_sparse_keys: A list of Nfeature_list_sparse string Tensors
+// (scalars).  The keys expected in the FeatureLists associated with sparse
+// values.
+//	feature_list_dense_keys: A list of Nfeature_list_dense string Tensors (scalars).
+// The keys expected in the SequenceExamples' feature_lists associated
+// with lists of dense values.
+//	context_dense_defaults: A list of Ncontext_dense Tensors (some may be empty).
+// context_dense_defaults[j] provides default values
+// when the SequenceExample's context map lacks context_dense_key[j].
+// If an empty Tensor is provided for context_dense_defaults[j],
+// then the Feature context_dense_keys[j] is required.
+// The input type is inferred from context_dense_defaults[j], even when it's
+// empty.  If context_dense_defaults[j] is not empty, its shape must match
+// context_dense_shapes[j].
+//	debug_name: A scalar containing the name of the serialized proto.
+// May contain, for example, table key (descriptive) name for the
+// corresponding serialized proto.  This is purely useful for debugging
+// purposes, and the presence of values here has no effect on the output.
+// May also be an empty scalar if no name is available.
+func ParseSingleSequenceExample(scope *Scope, serialized tf.Output, feature_list_dense_missing_assumed_empty tf.Output, context_sparse_keys []tf.Output, context_dense_keys []tf.Output, feature_list_sparse_keys []tf.Output, feature_list_dense_keys []tf.Output, context_dense_defaults []tf.Output, debug_name tf.Output, optional ...ParseSingleSequenceExampleAttr) (context_sparse_indices []tf.Output, context_sparse_values []tf.Output, context_sparse_shapes []tf.Output, context_dense_values []tf.Output, feature_list_sparse_indices []tf.Output, feature_list_sparse_values []tf.Output, feature_list_sparse_shapes []tf.Output, feature_list_dense_values []tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "ParseSingleSequenceExample",
+		Input: []tf.Input{
+			serialized, feature_list_dense_missing_assumed_empty, tf.OutputList(context_sparse_keys), tf.OutputList(context_dense_keys), tf.OutputList(feature_list_sparse_keys), tf.OutputList(feature_list_dense_keys), tf.OutputList(context_dense_defaults), debug_name,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if context_sparse_indices, idx, err = makeOutputList(op, idx, "context_sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if context_sparse_values, idx, err = makeOutputList(op, idx, "context_sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if context_sparse_shapes, idx, err = makeOutputList(op, idx, "context_sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if context_dense_values, idx, err = makeOutputList(op, idx, "context_dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_sparse_indices, idx, err = makeOutputList(op, idx, "feature_list_sparse_indices"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_sparse_values, idx, err = makeOutputList(op, idx, "feature_list_sparse_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_sparse_shapes, idx, err = makeOutputList(op, idx, "feature_list_sparse_shapes"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	if feature_list_dense_values, idx, err = makeOutputList(op, idx, "feature_list_dense_values"); err != nil {
+		scope.UpdateErr("ParseSingleSequenceExample", err)
+		return
+	}
+	return context_sparse_indices, context_sparse_values, context_sparse_shapes, context_dense_values, feature_list_sparse_indices, feature_list_sparse_values, feature_list_sparse_shapes, feature_list_dense_values
+}
+
 // QuantizeAndDequantizeAttr is an optional argument to QuantizeAndDequantize.
 type QuantizeAndDequantizeAttr func(optionalAttr)
 
@@ -7677,6 +8317,124 @@ func AccumulateNV2(scope *Scope, inputs []tf.Output, shape tf.Shape) (sum tf.Out
 	return op.Output(0)
 }
 
+// RandomShuffleAttr is an optional argument to RandomShuffle.
+type RandomShuffleAttr func(optionalAttr)
+
+// RandomShuffleSeed sets the optional seed attribute to value.
+//
+// value: If either `seed` or `seed2` are set to be non-zero, the random number
+// generator is seeded by the given seed.  Otherwise, it is seeded by a
+// random seed.
+// If not specified, defaults to 0
+func RandomShuffleSeed(value int64) RandomShuffleAttr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// RandomShuffleSeed2 sets the optional seed2 attribute to value.
+//
+// value: A second seed to avoid seed collision.
+// If not specified, defaults to 0
+func RandomShuffleSeed2(value int64) RandomShuffleAttr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Randomly shuffles a tensor along its first dimension.
+//
+//   The tensor is shuffled along dimension 0, such that each `value[j]` is mapped
+//   to one and only one `output[i]`. For example, a mapping that might occur for a
+//   3x2 tensor is:
+//
+// ```
+// [[1, 2],       [[5, 6],
+//  [3, 4],  ==>   [1, 2],
+//  [5, 6]]        [3, 4]]
+// ```
+//
+// Arguments:
+//	value: The tensor to be shuffled.
+//
+// Returns A tensor of same shape and type as `value`, shuffled along its first
+// dimension.
+func RandomShuffle(scope *Scope, value tf.Output, optional ...RandomShuffleAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "RandomShuffle",
+		Input: []tf.Input{
+			value,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// OrderedMapIncompleteSizeAttr is an optional argument to OrderedMapIncompleteSize.
+type OrderedMapIncompleteSizeAttr func(optionalAttr)
+
+// OrderedMapIncompleteSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func OrderedMapIncompleteSizeCapacity(value int64) OrderedMapIncompleteSizeAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// OrderedMapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func OrderedMapIncompleteSizeMemoryLimit(value int64) OrderedMapIncompleteSizeAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// OrderedMapIncompleteSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func OrderedMapIncompleteSizeContainer(value string) OrderedMapIncompleteSizeAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// OrderedMapIncompleteSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func OrderedMapIncompleteSizeSharedName(value string) OrderedMapIncompleteSizeAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op returns the number of incomplete elements in the underlying container.
+func OrderedMapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapIncompleteSizeAttr) (size tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "OrderedMapIncompleteSize",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // DepthwiseConv2dNativeBackpropFilterAttr is an optional argument to DepthwiseConv2dNativeBackpropFilter.
 type DepthwiseConv2dNativeBackpropFilterAttr func(optionalAttr)
 
@@ -7916,6 +8674,101 @@ func RandomUniform(scope *Scope, shape tf.Output, dtype tf.DataType, optional ..
 	return op.Output(0)
 }
 
+// Encode audio data using the WAV file format.
+//
+// This operation will generate a string suitable to be saved out to create a .wav
+// audio file. It will be encoded in the 16-bit PCM format. It takes in float
+// values in the range -1.0f to 1.0f, and any outside that value will be clamped to
+// that range.
+//
+// `audio` is a 2-D float Tensor of shape `[length, channels]`.
+// `sample_rate` is a scalar Tensor holding the rate to use (e.g. 44100).
+//
+// Arguments:
+//	audio: 2-D with shape `[length, channels]`.
+//	sample_rate: Scalar containing the sample frequency.
+//
+// Returns 0-D. WAV-encoded file contents.
+func EncodeWav(scope *Scope, audio tf.Output, sample_rate tf.Output) (contents tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "EncodeWav",
+		Input: []tf.Input{
+			audio, sample_rate,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes atan of x element-wise.
+func Atan(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "Atan",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// ResourceApplyAdaMaxAttr is an optional argument to ResourceApplyAdaMax.
+type ResourceApplyAdaMaxAttr func(optionalAttr)
+
+// ResourceApplyAdaMaxUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var, m, and v tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyAdaMaxUseLocking(value bool) ResourceApplyAdaMaxAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update '*var' according to the AdaMax algorithm.
+//
+// m_t <- beta1 * m_{t-1} + (1 - beta1) * g
+// v_t <- max(beta2 * v_{t-1}, abs(g))
+// variable <- variable - learning_rate / (1 - beta1^t) * m_t / (v_t + epsilon)
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	m: Should be from a Variable().
+//	v: Should be from a Variable().
+//	beta1_power: Must be a scalar.
+//	lr: Scaling factor. Must be a scalar.
+//	beta1: Momentum factor. Must be a scalar.
+//	beta2: Momentum factor. Must be a scalar.
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//
+// Returns the created operation.
+func ResourceApplyAdaMax(scope *Scope, var_ tf.Output, m tf.Output, v tf.Output, beta1_power tf.Output, lr tf.Output, beta1 tf.Output, beta2 tf.Output, epsilon tf.Output, grad tf.Output, optional ...ResourceApplyAdaMaxAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "ResourceApplyAdaMax",
+		Input: []tf.Input{
+			var_, m, v, beta1_power, lr, beta1, beta2, epsilon, grad,
+		},
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
 // AssertAttr is an optional argument to Assert.
 type AssertAttr func(optionalAttr)
 
@@ -7996,27 +8849,6 @@ func CollectiveBcastSend(scope *Scope, input tf.Output, group_size int64, group_
 	return op.Output(0)
 }
 
-// Makes a copy of `x`.
-//
-// Arguments:
-//	x: The source tensor of type `T`.
-//
-// Returns     y: A `Tensor` of type `T`. A copy of `x`. Guaranteed that `y`
-//       is not an alias of `x`.
-func DeepCopy(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "DeepCopy",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Split a `SparseTensor` into `num_split` tensors along one dimension.
 //
 // If the `shape[split_dim]` is not an integer multiple of `num_split`. Slices
@@ -8190,6 +9022,21 @@ func DataFormatVecPermute(scope *Scope, x tf.Output, optional ...DataFormatVecPe
 	return op.Output(0)
 }
 
+// Computes the gradient of `igamma(a, x)` wrt `a`.
+func IgammaGradA(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "IgammaGradA",
+		Input: []tf.Input{
+			a, x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Converts each string in the input Tensor to its hash mod by a number of buckets.
 //
 // The hash function is deterministic on the content of the string within the
@@ -8854,6 +9701,85 @@ func ResourceScatterDiv(scope *Scope, resource tf.Output, indices tf.Output, upd
 	return scope.AddOperation(opspec)
 }
 
+// ResourceScatterNdAddAttr is an optional argument to ResourceScatterNdAdd.
+type ResourceScatterNdAddAttr func(optionalAttr)
+
+// ResourceScatterNdAddUseLocking sets the optional use_locking attribute to value.
+//
+// value: An optional bool. Defaults to True. If True, the assignment will
+// be protected by a lock; otherwise the behavior is undefined,
+// but may exhibit less contention.
+// If not specified, defaults to true
+func ResourceScatterNdAddUseLocking(value bool) ResourceScatterNdAddAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Adds sparse `updates` to individual values or slices within a given
+//
+// variable according to `indices`.
+//
+// `ref` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
+//
+// `indices` must be integer tensor, containing indices into `ref`.
+// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+//
+// The innermost dimension of `indices` (with length `K`) corresponds to
+// indices into elements (if `K = P`) or slices (if `K < P`) along the `K`th
+// dimension of `ref`.
+//
+// `updates` is `Tensor` of rank `Q-1+P-K` with shape:
+//
+// ```
+// [d_0, ..., d_{Q-2}, ref.shape[K], ..., ref.shape[P-1]].
+// ```
+//
+// For example, say we want to update 4 scattered elements to a rank-1 tensor to
+// 8 elements. In Python, that update would look like this:
+//
+// ```python
+//     ref = tf.Variable([1, 2, 3, 4, 5, 6, 7, 8], use_resource=True)
+//     indices = tf.constant([[4], [3], [1] ,[7]])
+//     updates = tf.constant([9, 10, 11, 12])
+//     update = tf.scatter_nd_add(ref, indices, updates)
+//     with tf.Session() as sess:
+//       print sess.run(update)
+// ```
+//
+// The resulting update to ref would look like this:
+//
+//     [1, 12, 3, 14, 14, 6, 7, 20]
+//
+// See @{tf.scatter_nd} for more details about how to make updates to
+// slices.
+//
+// Arguments:
+//	ref: A resource handle. Must be from a VarHandleOp.
+//	indices: A Tensor. Must be one of the following types: int32, int64.
+// A tensor of indices into ref.
+//	updates: A Tensor. Must have the same type as ref. A tensor of
+// values to add to ref.
+//
+// Returns the created operation.
+func ResourceScatterNdAdd(scope *Scope, ref tf.Output, indices tf.Output, updates tf.Output, optional ...ResourceScatterNdAddAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "ResourceScatterNdAdd",
+		Input: []tf.Input{
+			ref, indices, updates,
+		},
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
 // Mutually reduces multiple tensors of identical type and shape.
 func CollectiveReduce(scope *Scope, input tf.Output, group_size int64, group_key int64, instance_key int64, merge_op string, final_op string, subdiv_offsets []int64) (data tf.Output) {
 	if scope.Err() != nil {
@@ -8914,6 +9840,68 @@ func StatelessRandomNormal(scope *Scope, shape tf.Output, seed tf.Output, option
 	return op.Output(0)
 }
 
+// StringSplitV2Attr is an optional argument to StringSplitV2.
+type StringSplitV2Attr func(optionalAttr)
+
+// StringSplitV2Maxsplit sets the optional maxsplit attribute to value.
+//
+// value: An `int`. If `maxsplit > 0`, limit of the split of the result.
+// If not specified, defaults to -1
+func StringSplitV2Maxsplit(value int64) StringSplitV2Attr {
+	return func(m optionalAttr) {
+		m["maxsplit"] = value
+	}
+}
+
+// Split elements of `source` based on `sep` into a `SparseTensor`.
+//
+// Let N be the size of source (typically N will be the batch size). Split each
+// element of `source` based on `sep` and return a `SparseTensor`
+// containing the split tokens. Empty tokens are ignored.
+//
+// For example, N = 2, source[0] is 'hello world' and source[1] is 'a b c',
+// then the output will be
+// ```
+// st.indices = [0, 0;
+//               0, 1;
+//               1, 0;
+//               1, 1;
+//               1, 2]
+// st.shape = [2, 3]
+// st.values = ['hello', 'world', 'a', 'b', 'c']
+// ```
+//
+// If `sep` is given, consecutive delimiters are not grouped together and are
+// deemed to delimit empty strings. For example, source of `"1<>2<><>3"` and
+// sep of `"<>"` returns `["1", "2", "", "3"]`. If `sep` is None or an empty
+// string, consecutive whitespace are regarded as a single separator, and the
+// result will contain no empty strings at the startor end if the string has
+// leading or trailing whitespace.
+//
+// Note that the above mentioned behavior matches python's str.split.
+//
+// Arguments:
+//	input: `1-D` string `Tensor`, the strings to split.
+//	sep: `0-D` string `Tensor`, the delimiter character.
+func StringSplitV2(scope *Scope, input tf.Output, sep tf.Output, optional ...StringSplitV2Attr) (indices tf.Output, values tf.Output, shape tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "StringSplitV2",
+		Input: []tf.Input{
+			input, sep,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
+}
+
 // MaxPoolAttr is an optional argument to MaxPool.
 type MaxPoolAttr func(optionalAttr)
 
@@ -8998,9 +9986,11 @@ func SparseMatMulBIsSparse(value bool) SparseMatMulAttr {
 // Multiply matrix "a" by matrix "b".
 //
 // The inputs must be two-dimensional matrices and the inner dimension of "a" must
-// match the outer dimension of "b". This op is optimized for the case where at
-// least one of "a" or "b" is sparse. The breakeven for using this versus a dense
-// matrix multiply on one platform was 30% zero values in the sparse matrix.
+// match the outer dimension of "b". Both "a" and "b" must be `Tensor`s not
+// `SparseTensor`s.  This op is optimized for the case where at least one of "a" or
+// "b" is sparse, in the sense that they have a large proportion of zero values.
+// The breakeven for using this versus a dense matrix multiply on one platform was
+// 30% zero values in the sparse matrix.
 //
 // The gradient computation of this operation will only take advantage of sparsity
 // in the input gradient when that gradient comes from a Relu.
@@ -9631,6 +10621,51 @@ func AvgPoolGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize
 	return op.Output(0)
 }
 
+// Greedily selects a subset of bounding boxes in descending order of score,
+//
+// pruning away boxes that have high overlaps
+// with previously selected boxes.  Bounding boxes with score less than
+// `score_threshold` are removed. N-by-n overlap values are supplied as square matrix,
+// which allows for defining a custom overlap criterium (eg. intersection over union,
+// intersection over area, etc.).
+//
+// The output of this operation is a set of integers indexing into the input
+// collection of bounding boxes representing the selected boxes.  The bounding
+// box coordinates corresponding to the selected indices can then be obtained
+// using the `tf.gather operation`.  For example:
+//
+//   selected_indices = tf.image.non_max_suppression_with_overlaps(
+//       overlaps, scores, max_output_size, overlap_threshold, score_threshold)
+//   selected_boxes = tf.gather(boxes, selected_indices)
+//
+// Arguments:
+//	overlaps: A 2-D float tensor of shape `[num_boxes, num_boxes]` representing
+// the n-by-n box overlap values.
+//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
+// score corresponding to each box (each row of boxes).
+//	max_output_size: A scalar integer tensor representing the maximum number of
+// boxes to be selected by non max suppression.
+//	overlap_threshold: A 0-D float tensor representing the threshold for deciding whether
+// boxes overlap too.
+//	score_threshold: A 0-D float tensor representing the threshold for deciding when to remove
+// boxes based on score.
+//
+// Returns A 1-D integer tensor of shape `[M]` representing the selected
+// indices from the boxes tensor, where `M <= max_output_size`.
+func NonMaxSuppressionWithOverlaps(scope *Scope, overlaps tf.Output, scores tf.Output, max_output_size tf.Output, overlap_threshold tf.Output, score_threshold tf.Output) (selected_indices tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "NonMaxSuppressionWithOverlaps",
+		Input: []tf.Input{
+			overlaps, scores, max_output_size, overlap_threshold, score_threshold,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // StageClearAttr is an optional argument to StageClear.
 type StageClearAttr func(optionalAttr)
 
@@ -9908,50 +10943,6 @@ func SparseAddGrad(scope *Scope, backprop_val_grad tf.Output, a_indices tf.Outpu
 	return op.Output(0), op.Output(1)
 }
 
-// Computes atan of x element-wise.
-func Atan(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "Atan",
-		Input: []tf.Input{
-			x,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Encode audio data using the WAV file format.
-//
-// This operation will generate a string suitable to be saved out to create a .wav
-// audio file. It will be encoded in the 16-bit PCM format. It takes in float
-// values in the range -1.0f to 1.0f, and any outside that value will be clamped to
-// that range.
-//
-// `audio` is a 2-D float Tensor of shape `[length, channels]`.
-// `sample_rate` is a scalar Tensor holding the rate to use (e.g. 44100).
-//
-// Arguments:
-//	audio: 2-D with shape `[length, channels]`.
-//	sample_rate: Scalar containing the sample frequency.
-//
-// Returns 0-D. WAV-encoded file contents.
-func EncodeWav(scope *Scope, audio tf.Output, sample_rate tf.Output) (contents tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "EncodeWav",
-		Input: []tf.Input{
-			audio, sample_rate,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Converts each string in the input Tensor to its hash mod by a number of buckets.
 //
 // The hash function is deterministic on the content of the string within the
@@ -10531,6 +11522,120 @@ func ResourceApplyPowerSign(scope *Scope, var_ tf.Output, m tf.Output, lr tf.Out
 	return scope.AddOperation(opspec)
 }
 
+// CudnnRNNBackpropV2Attr is an optional argument to CudnnRNNBackpropV2.
+type CudnnRNNBackpropV2Attr func(optionalAttr)
+
+// CudnnRNNBackpropV2RnnMode sets the optional rnn_mode attribute to value.
+// If not specified, defaults to "lstm"
+func CudnnRNNBackpropV2RnnMode(value string) CudnnRNNBackpropV2Attr {
+	return func(m optionalAttr) {
+		m["rnn_mode"] = value
+	}
+}
+
+// CudnnRNNBackpropV2InputMode sets the optional input_mode attribute to value.
+// If not specified, defaults to "linear_input"
+func CudnnRNNBackpropV2InputMode(value string) CudnnRNNBackpropV2Attr {
+	return func(m optionalAttr) {
+		m["input_mode"] = value
+	}
+}
+
+// CudnnRNNBackpropV2Direction sets the optional direction attribute to value.
+// If not specified, defaults to "unidirectional"
+func CudnnRNNBackpropV2Direction(value string) CudnnRNNBackpropV2Attr {
+	return func(m optionalAttr) {
+		m["direction"] = value
+	}
+}
+
+// CudnnRNNBackpropV2Dropout sets the optional dropout attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNBackpropV2Dropout(value float32) CudnnRNNBackpropV2Attr {
+	return func(m optionalAttr) {
+		m["dropout"] = value
+	}
+}
+
+// CudnnRNNBackpropV2Seed sets the optional seed attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNBackpropV2Seed(value int64) CudnnRNNBackpropV2Attr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// CudnnRNNBackpropV2Seed2 sets the optional seed2 attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNBackpropV2Seed2(value int64) CudnnRNNBackpropV2Attr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// Backprop step of CudnnRNN.
+//
+// Compute the backprop of both data and weights in a RNN. Takes an extra
+//     "host_reserved" inupt than CudnnRNNBackprop, which is used to determine RNN
+//     cudnnRNNAlgo_t and cudnnMathType_t.
+//
+// rnn_mode: Indicates the type of the RNN model.
+// input_mode: Indicates whether there is a linear projection between the input and
+//     the actual computation before the first layer. 'skip_input' is only allowed
+//     when input_size == num_units; 'auto_select' implies 'skip_input' when
+//     input_size == num_units; otherwise, it implies 'linear_input'.
+// direction: Indicates whether a bidirectional model will be used. Should be
+//   "unidirectional" or "bidirectional".
+// dropout: Dropout probability. When set to 0., dropout is disabled.
+// seed: The 1st part of a seed to initialize dropout.
+// seed2: The 2nd part of a seed to initialize dropout.
+// input: A 3-D tensor with the shape of [seq_length, batch_size, input_size].
+// input_h: A 3-D tensor with the shape of [num_layer * dir, batch_size,
+//     num_units].
+// input_c: For LSTM, a 3-D tensor with the shape of
+//     [num_layer * dir, batch, num_units]. For other models, it is ignored.
+// params: A 1-D tensor that contains the weights and biases in an opaque layout.
+//     The size must be created through CudnnRNNParamsSize, and initialized
+//     separately. Note that they might not be compatible across different
+//     generations. So it is a good idea to save and restore
+// output: A 3-D tensor with the shape of [seq_length, batch_size,
+//     dir * num_units].
+// output_h: The same shape has input_h.
+// output_c: The same shape as input_c for LSTM. An empty tensor for other models.
+// output_backprop: A 3-D tensor with the same shape as output in the forward pass.
+// output_h_backprop: A 3-D tensor with the same shape as output_h in the forward
+//     pass.
+// output_c_backprop: A 3-D tensor with the same shape as output_c in the forward
+//     pass.
+// reserve_space: The same reserve_space produced in the forward operation.
+// host_reserved: The same host_reserved produced in the forward operation.
+// input_backprop: The backprop to input in the forward pass. Has the same shape
+//     as input.
+// input_h_backprop: The backprop to input_h in the forward pass. Has the same
+//     shape as input_h.
+// input_c_backprop: The backprop to input_c in the forward pass. Has the same
+//     shape as input_c.
+// params_backprop: The backprop to the params buffer in the forward pass. Has the
+//     same shape as params.
+func CudnnRNNBackpropV2(scope *Scope, input tf.Output, input_h tf.Output, input_c tf.Output, params tf.Output, output tf.Output, output_h tf.Output, output_c tf.Output, output_backprop tf.Output, output_h_backprop tf.Output, output_c_backprop tf.Output, reserve_space tf.Output, host_reserved tf.Output, optional ...CudnnRNNBackpropV2Attr) (input_backprop tf.Output, input_h_backprop tf.Output, input_c_backprop tf.Output, params_backprop tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "CudnnRNNBackpropV2",
+		Input: []tf.Input{
+			input, input_h, input_c, params, output, output_h, output_c, output_backprop, output_h_backprop, output_c_backprop, reserve_space, host_reserved,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
+}
+
 // Locks a mutex resource.  The output is the lock.  So long as the lock tensor
 //
 // is alive, any other request to use `MutexLock` with this mutex will wait.
@@ -10718,6 +11823,34 @@ func BatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, o
 	return op.Output(0)
 }
 
+// Check if the input matches the regex pattern.
+//
+// The input is a string tensor of any shape. The pattern is a scalar
+// string tensor which is applied to every element of the input tensor.
+// The boolean values (True or False) of the output tensor indicate
+// if the input matches the regex pattern provided.
+//
+// The pattern follows the re2 syntax (https://github.com/google/re2/wiki/Syntax)
+//
+// Arguments:
+//	input: A string tensor of the text to be processed.
+//	pattern: A 1-D string tensor of the regular expression to match the input.
+//
+// Returns A bool tensor with the same shape as `input`.
+func RegexFullMatch(scope *Scope, input tf.Output, pattern tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "RegexFullMatch",
+		Input: []tf.Input{
+			input, pattern,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Says whether the targets are in the top `K` predictions.
 //
 // This outputs a `batch_size` bool array, an entry `out[i]` is `true` if the
@@ -11657,7 +12790,7 @@ func ResourceScatterNdUpdateUseLocking(value bool) ResourceScatterNdUpdateAttr {
 // 8 elements. In Python, that update would look like this:
 //
 // ```python
-//     ref = tfe.Variable([1, 2, 3, 4, 5, 6, 7, 8])
+//     ref = tf.Variable([1, 2, 3, 4, 5, 6, 7, 8])
 //     indices = tf.constant([[4], [3], [1] ,[7]])
 //     updates = tf.constant([9, 10, 11, 12])
 //     update = tf.scatter_nd_update(ref, indices, updates)
@@ -11982,6 +13115,7 @@ func RFFT2D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Outp
 //                       [0, 0, 2, 2, 0, 0]
 //                       [0, 0, 0, 0, 0, 0]]
 // ```
+//
 func Pad(scope *Scope, input tf.Output, paddings tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
@@ -13013,122 +14147,6 @@ func Conv3DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_ba
 	return op.Output(0)
 }
 
-// ResourceApplyProximalAdagradAttr is an optional argument to ResourceApplyProximalAdagrad.
-type ResourceApplyProximalAdagradAttr func(optionalAttr)
-
-// ResourceApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
-//
-// value: If True, updating of the var and accum tensors will be protected by
-// a lock; otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceApplyProximalAdagradUseLocking(value bool) ResourceApplyProximalAdagradAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Update '*var' and '*accum' according to FOBOS with Adagrad learning rate.
-//
-// accum += grad * grad
-// prox_v = var - lr * grad * (1 / sqrt(accum))
-// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
-//
-// Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 regularization. Must be a scalar.
-//	grad: The gradient.
-//
-// Returns the created operation.
-func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, optional ...ResourceApplyProximalAdagradAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ResourceApplyProximalAdagrad",
-		Input: []tf.Input{
-			var_, accum, lr, l1, l2, grad,
-		},
-		Attrs: attrs,
-	}
-	return scope.AddOperation(opspec)
-}
-
-// MutableHashTableOfTensorsV2Attr is an optional argument to MutableHashTableOfTensorsV2.
-type MutableHashTableOfTensorsV2Attr func(optionalAttr)
-
-// MutableHashTableOfTensorsV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func MutableHashTableOfTensorsV2Container(value string) MutableHashTableOfTensorsV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MutableHashTableOfTensorsV2SharedName sets the optional shared_name attribute to value.
-//
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func MutableHashTableOfTensorsV2SharedName(value string) MutableHashTableOfTensorsV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// MutableHashTableOfTensorsV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
-// If not specified, defaults to false
-func MutableHashTableOfTensorsV2UseNodeNameSharing(value bool) MutableHashTableOfTensorsV2Attr {
-	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
-	}
-}
-
-// MutableHashTableOfTensorsV2ValueShape sets the optional value_shape attribute to value.
-// If not specified, defaults to <>
-func MutableHashTableOfTensorsV2ValueShape(value tf.Shape) MutableHashTableOfTensorsV2Attr {
-	return func(m optionalAttr) {
-		m["value_shape"] = value
-	}
-}
-
-// Creates an empty hash table.
-//
-// This op creates a mutable hash table, specifying the type of its keys and
-// values. Each value must be a vector. Data can be inserted into the table using
-// the insert operations. It does not support the initialization operation.
-//
-// Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
-//
-// Returns Handle to a table.
-func MutableHashTableOfTensorsV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableOfTensorsV2Attr) (table_handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "MutableHashTableOfTensorsV2",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Subtracts sparse updates from the variable referenced by `resource`.
 //
 // This operation computes
@@ -13416,9 +14434,11 @@ func ReduceJoinSeparator(value string) ReduceJoinAttr {
 // Joins a string Tensor across the given dimensions.
 //
 // Computes the string join across dimensions in the given string Tensor of shape
-// `[d_0, d_1, ..., d_n-1]`.  Returns a new Tensor created by joining the input
+// `[\\(d_0, d_1, ..., d_{n-1}\\)]`.  Returns a new Tensor created by joining the input
 // strings with the given separator (default: empty string).  Negative indices are
-// counted backwards from the end, with `-1` being equivalent to `n - 1`.
+// counted backwards from the end, with `-1` being equivalent to `n - 1`.  If
+// indices are not specified, joins across all dimensions beginning from `n - 1`
+// through `0`.
 //
 // For example:
 //
@@ -13431,9 +14451,10 @@ func ReduceJoinSeparator(value string) ReduceJoinAttr {
 // tf.reduce_join(a, 0, keep_dims=True) ==> [["ac", "bd"]]
 // tf.reduce_join(a, 1, keep_dims=True) ==> [["ab"], ["cd"]]
 // tf.reduce_join(a, 0, separator=".") ==> ["a.c", "b.d"]
-// tf.reduce_join(a, [0, 1]) ==> ["acbd"]
-// tf.reduce_join(a, [1, 0]) ==> ["abcd"]
-// tf.reduce_join(a, []) ==> ["abcd"]
+// tf.reduce_join(a, [0, 1]) ==> "acbd"
+// tf.reduce_join(a, [1, 0]) ==> "abcd"
+// tf.reduce_join(a, []) ==> [["a", "b"], ["c", "d"]]
+// tf.reduce_join(a) = tf.reduce_join(a, [1, 0]) ==> "abcd"
 // ```
 //
 // Arguments:
@@ -13874,6 +14895,83 @@ func Minimum(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
+// MfccAttr is an optional argument to Mfcc.
+type MfccAttr func(optionalAttr)
+
+// MfccUpperFrequencyLimit sets the optional upper_frequency_limit attribute to value.
+//
+// value: The highest frequency to use when calculating the
+// ceptstrum.
+// If not specified, defaults to 4000
+func MfccUpperFrequencyLimit(value float32) MfccAttr {
+	return func(m optionalAttr) {
+		m["upper_frequency_limit"] = value
+	}
+}
+
+// MfccLowerFrequencyLimit sets the optional lower_frequency_limit attribute to value.
+//
+// value: The lowest frequency to use when calculating the
+// ceptstrum.
+// If not specified, defaults to 20
+func MfccLowerFrequencyLimit(value float32) MfccAttr {
+	return func(m optionalAttr) {
+		m["lower_frequency_limit"] = value
+	}
+}
+
+// MfccFilterbankChannelCount sets the optional filterbank_channel_count attribute to value.
+//
+// value: Resolution of the Mel bank used internally.
+// If not specified, defaults to 40
+func MfccFilterbankChannelCount(value int64) MfccAttr {
+	return func(m optionalAttr) {
+		m["filterbank_channel_count"] = value
+	}
+}
+
+// MfccDctCoefficientCount sets the optional dct_coefficient_count attribute to value.
+//
+// value: How many output channels to produce per time slice.
+// If not specified, defaults to 13
+func MfccDctCoefficientCount(value int64) MfccAttr {
+	return func(m optionalAttr) {
+		m["dct_coefficient_count"] = value
+	}
+}
+
+// Transforms a spectrogram into a form that's useful for speech recognition.
+//
+// Mel Frequency Cepstral Coefficients are a way of representing audio data that's
+// been effective as an input feature for machine learning. They are created by
+// taking the spectrum of a spectrogram (a 'cepstrum'), and discarding some of the
+// higher frequencies that are less significant to the human ear. They have a long
+// history in the speech recognition world, and https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
+// is a good resource to learn more.
+//
+// Arguments:
+//	spectrogram: Typically produced by the Spectrogram op, with magnitude_squared
+// set to true.
+//	sample_rate: How many samples per second the source audio used.
+func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional ...MfccAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Mfcc",
+		Input: []tf.Input{
+			spectrogram, sample_rate,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // AudioSummaryAttr is an optional argument to AudioSummary.
 type AudioSummaryAttr func(optionalAttr)
 
@@ -14292,65 +15390,6 @@ func TensorArraySplitV2(scope *Scope, handle tf.Output, value tf.Output, lengths
 	return op.Output(0)
 }
 
-// PackAttr is an optional argument to Pack.
-type PackAttr func(optionalAttr)
-
-// PackAxis sets the optional axis attribute to value.
-//
-// value: Dimension along which to pack.  Negative values wrap around, so the
-// valid range is `[-(R+1), R+1)`.
-// If not specified, defaults to 0
-func PackAxis(value int64) PackAttr {
-	return func(m optionalAttr) {
-		m["axis"] = value
-	}
-}
-
-// Packs a list of `N` rank-`R` tensors into one rank-`(R+1)` tensor.
-//
-// Packs the `N` tensors in `values` into a tensor with rank one higher than each
-// tensor in `values`, by packing them along the `axis` dimension.
-// Given a list of tensors of shape `(A, B, C)`;
-//
-// if `axis == 0` then the `output` tensor will have the shape `(N, A, B, C)`.
-// if `axis == 1` then the `output` tensor will have the shape `(A, N, B, C)`.
-// Etc.
-//
-// For example:
-//
-// ```
-// # 'x' is [1, 4]
-// # 'y' is [2, 5]
-// # 'z' is [3, 6]
-// pack([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
-// pack([x, y, z], axis=1) => [[1, 2, 3], [4, 5, 6]]
-// ```
-//
-// This is the opposite of `unpack`.
-//
-// Arguments:
-//	values: Must be of same shape and type.
-//
-// Returns The packed tensor.
-func Pack(scope *Scope, values []tf.Output, optional ...PackAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Pack",
-		Input: []tf.Input{
-			tf.OutputList(values),
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Reorders a SparseTensor into the canonical, row-major ordering.
 //
 // Note that by convention, all sparse ops preserve the canonical ordering along
@@ -14505,27 +15544,27 @@ func CudnnRNNBackpropSeed2(value int64) CudnnRNNBackpropAttr {
 //
 // rnn_mode: Indicates the type of the RNN model.
 // input_mode: Indicate whether there is a linear projection between the input and
-//     The actual computation before the first layer. 'skip_input' is only allowed
+//     the actual computation before the first layer. 'skip_input' is only allowed
 //     when input_size == num_units; 'auto_select' implies 'skip_input' when
 //     input_size == num_units; otherwise, it implies 'linear_input'.
-// direction: Indicates whether a bidirectional model will be used.
-//     dir = (direction == bidirectional) ? 2 : 1
-// dropout: dropout probability. When set to 0., dropout is disabled.
-// seed: the 1st part of a seed to initialize dropout.
-// seed2: the 2nd part of a seed to initialize dropout.
-// input: a 3-D tensor with the shape of [seq_length, batch_size, input_size].
-// input_h: a 3-D tensor with the shape of [num_layer * dir, batch_size,
+// direction: Indicates whether a bidirectional model will be used. Should be
+//   "unidirectional" or "bidirectional".
+// dropout: Dropout probability. When set to 0., dropout is disabled.
+// seed: The 1st part of a seed to initialize dropout.
+// seed2: The 2nd part of a seed to initialize dropout.
+// input: A 3-D tensor with the shape of [seq_length, batch_size, input_size].
+// input_h: A 3-D tensor with the shape of [num_layer * dir, batch_size,
 //     num_units].
 // input_c: For LSTM, a 3-D tensor with the shape of
 //     [num_layer * dir, batch, num_units]. For other models, it is ignored.
-// params: a 1-D tensor that contains the weights and biases in an opaque layout.
+// params: A 1-D tensor that contains the weights and biases in an opaque layout.
 //     The size must be created through CudnnRNNParamsSize, and initialized
 //     separately. Note that they might not be compatible across different
 //     generations. So it is a good idea to save and restore
-// output: a 3-D tensor with the shape of [seq_length, batch_size,
+// output: A 3-D tensor with the shape of [seq_length, batch_size,
 //     dir * num_units].
-// output_h: the same shape has input_h.
-// output_c: the same shape as input_c for LSTM. An empty tensor for other models.
+// output_h: The same shape has input_h.
+// output_c: The same shape as input_c for LSTM. An empty tensor for other models.
 // output_backprop: A 3-D tensor with the same shape as output in the forward pass.
 // output_h_backprop: A 3-D tensor with the same shape as output_h in the forward
 //     pass.
@@ -15010,30 +16049,6 @@ func Sigmoid(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
-//     Updates specified rows with values in `v`.
-//
-//     Computes `x[i, :] = v; return x`.
-//
-// Arguments:
-//	x: A tensor of type `T`.
-//	i: A vector. Indices into the left-most dimension of `x`.
-//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
-//
-// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
-func InplaceUpdate(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "InplaceUpdate",
-		Input: []tf.Input{
-			x, i, v,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // FusedBatchNormAttr is an optional argument to FusedBatchNorm.
 type FusedBatchNormAttr func(optionalAttr)
 
@@ -15510,6 +16525,30 @@ func OrderedMapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataTyp
 	return key, values
 }
 
+// Calculates the prior from the training data (the bias) and fills in the first node with the logits' prior. Returns a boolean indicating whether to continue centering.
+//
+// Arguments:
+//	tree_ensemble_handle: Handle to the tree ensemble.
+//	mean_gradients: A tensor with shape=[logits_dimension] with mean of gradients for a first node.
+//	mean_hessians: A tensor with shape=[logits_dimension] mean of hessians for a first node.
+//	l1: l1 regularization factor on leaf weights, per instance based.
+//	l2: l2 regularization factor on leaf weights, per instance based.
+//
+// Returns Bool, whether to continue bias centering.
+func BoostedTreesCenterBias(scope *Scope, tree_ensemble_handle tf.Output, mean_gradients tf.Output, mean_hessians tf.Output, l1 tf.Output, l2 tf.Output) (continue_centering tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "BoostedTreesCenterBias",
+		Input: []tf.Input{
+			tree_ensemble_handle, mean_gradients, mean_hessians, l1, l2,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // SerializeManySparseAttr is an optional argument to SerializeManySparse.
 type SerializeManySparseAttr func(optionalAttr)
 
@@ -17078,6 +18117,7 @@ func QuantizeV2RoundMode(value string) QuantizeV2Attr {
 // out[i] = (in[i] - min_range) * range(T) / (max_range - min_range)
 // if T == qint8, out[i] -= (range(T) + 1) / 2.0
 // ```
+//
 // here `range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()`
 //
 // *MIN_COMBINED Mode Example*
@@ -17121,6 +18161,7 @@ func QuantizeV2RoundMode(value string) QuantizeV2Attr {
 //
 // We first find the range of values in our tensor. The
 // range we use is always centered on 0, so we find m such that
+//
 // ```c++
 //   m = max(abs(input_min), abs(input_max))
 // ```
@@ -17129,6 +18170,7 @@ func QuantizeV2RoundMode(value string) QuantizeV2Attr {
 //
 // Next, we choose our fixed-point quantization buckets, `[min_fixed, max_fixed]`.
 // If T is signed, this is
+//
 // ```
 //   num_bits = sizeof(T) * 8
 //   [min_fixed, max_fixed] =
@@ -17136,16 +18178,19 @@ func QuantizeV2RoundMode(value string) QuantizeV2Attr {
 // ```
 //
 // Otherwise, if T is unsigned, the fixed-point range is
+//
 // ```
 //   [min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
 // ```
 //
 // From this we compute our scaling factor, s:
+//
 // ```c++
 //   s = (max_fixed - min_fixed) / (2 * m)
 // ```
 //
 // Now we can quantize the elements of our tensor:
+//
 // ```c++
 // result = round(input * s)
 // ```
@@ -17242,6 +18287,31 @@ func QuantizedReluX(scope *Scope, features tf.Output, max_value tf.Output, min_f
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
+// Creates a dataset that batches `batch_size` elements from `input_dataset`.
+//
+// Arguments:
+//
+//	batch_size: A scalar representing the number of elements to accumulate in a batch.
+//	drop_remainder: A scalar representing whether the last batch should be dropped in case its size
+// is smaller than desired.
+//
+//
+func BatchDatasetV2(scope *Scope, input_dataset tf.Output, batch_size tf.Output, drop_remainder tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "BatchDatasetV2",
+		Input: []tf.Input{
+			input_dataset, batch_size, drop_remainder,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // QuantizedConv2DAttr is an optional argument to QuantizedConv2D.
 type QuantizedConv2DAttr func(optionalAttr)
 
@@ -17765,6 +18835,240 @@ func SparseCross(scope *Scope, indices []tf.Output, values []tf.Output, shapes [
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
+// ResourceApplyProximalAdagradAttr is an optional argument to ResourceApplyProximalAdagrad.
+type ResourceApplyProximalAdagradAttr func(optionalAttr)
+
+// ResourceApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
+//
+// value: If True, updating of the var and accum tensors will be protected by
+// a lock; otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceApplyProximalAdagradUseLocking(value bool) ResourceApplyProximalAdagradAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update '*var' and '*accum' according to FOBOS with Adagrad learning rate.
+//
+// accum += grad * grad
+// prox_v = var - lr * grad * (1 / sqrt(accum))
+// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 regularization. Must be a scalar.
+//	grad: The gradient.
+//
+// Returns the created operation.
+func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, optional ...ResourceApplyProximalAdagradAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "ResourceApplyProximalAdagrad",
+		Input: []tf.Input{
+			var_, accum, lr, l1, l2, grad,
+		},
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
+// MutableHashTableOfTensorsV2Attr is an optional argument to MutableHashTableOfTensorsV2.
+type MutableHashTableOfTensorsV2Attr func(optionalAttr)
+
+// MutableHashTableOfTensorsV2Container sets the optional container attribute to value.
+//
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func MutableHashTableOfTensorsV2Container(value string) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MutableHashTableOfTensorsV2SharedName sets the optional shared_name attribute to value.
+//
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func MutableHashTableOfTensorsV2SharedName(value string) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// MutableHashTableOfTensorsV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+// If not specified, defaults to false
+func MutableHashTableOfTensorsV2UseNodeNameSharing(value bool) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["use_node_name_sharing"] = value
+	}
+}
+
+// MutableHashTableOfTensorsV2ValueShape sets the optional value_shape attribute to value.
+// If not specified, defaults to <>
+func MutableHashTableOfTensorsV2ValueShape(value tf.Shape) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["value_shape"] = value
+	}
+}
+
+// Creates an empty hash table.
+//
+// This op creates a mutable hash table, specifying the type of its keys and
+// values. Each value must be a vector. Data can be inserted into the table using
+// the insert operations. It does not support the initialization operation.
+//
+// Arguments:
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
+//
+// Returns Handle to a table.
+func MutableHashTableOfTensorsV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableOfTensorsV2Attr) (table_handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "MutableHashTableOfTensorsV2",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// The gradient operator for the SparseSlice op.
+//
+// This op takes in the upstream gradient w.r.t. non-empty values of
+// the sliced `SparseTensor`, and outputs the gradients w.r.t.
+// the non-empty values of input `SparseTensor`.
+//
+// Arguments:
+//	backprop_val_grad: 1-D. The gradient with respect to
+// the non-empty values of the sliced `SparseTensor`.
+//	input_indices: 2-D.  The `indices` of the input `SparseTensor`.
+//	input_start: 1-D. tensor represents the start of the slice.
+//	output_indices: 2-D.  The `indices` of the sliced `SparseTensor`.
+//
+// Returns 1-D. The gradient with respect to the non-empty values of input `SparseTensor`.
+func SparseSliceGrad(scope *Scope, backprop_val_grad tf.Output, input_indices tf.Output, input_start tf.Output, output_indices tf.Output) (val_grad tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSliceGrad",
+		Input: []tf.Input{
+			backprop_val_grad, input_indices, input_start, output_indices,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes the gradient of the sigmoid of `x` wrt its input.
+//
+// Specifically, `grad = dy * y * (1 - y)`, where `y = sigmoid(x)`, and
+// `dy` is the corresponding input gradient.
+func SigmoidGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SigmoidGrad",
+		Input: []tf.Input{
+			y, dy,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Convert one or more images from HSV to RGB.
+//
+// Outputs a tensor of the same shape as the `images` tensor, containing the RGB
+// value of the pixels. The output is only well defined if the value in `images`
+// are in `[0,1]`.
+//
+// See `rgb_to_hsv` for a description of the HSV encoding.
+//
+// Arguments:
+//	images: 1-D or higher rank. HSV data to convert. Last dimension must be size 3.
+//
+// Returns `images` converted to RGB.
+func HSVToRGB(scope *Scope, images tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "HSVToRGB",
+		Input: []tf.Input{
+			images,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Creates a dataset by applying optimizations to `input_dataset`.
+//
+// Creates a dataset by applying optimizations to `input_dataset`.
+//
+// Arguments:
+//	input_dataset: A variant tensor representing the input dataset.
+//	optimizations: A `tf.string` vector `tf.Tensor` identifying optimizations to use.
+//
+//
+func OptimizeDataset(scope *Scope, input_dataset tf.Output, optimizations tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "OptimizeDataset",
+		Input: []tf.Input{
+			input_dataset, optimizations,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Retrieves the tree ensemble resource stamp token, number of trees and growing statistics.
+//
+// Arguments:
+//	tree_ensemble_handle: Handle to the tree ensemble.
+//
+// Returns Stamp token of the tree ensemble resource.The number of trees in the tree ensemble resource.The number of trees that were finished successfully.The number of layers we attempted to build (but not necessarily succeeded).Rank size 2 tensor that contains start and end ids of the nodes in the latest
+// layer.
+func BoostedTreesGetEnsembleStates(scope *Scope, tree_ensemble_handle tf.Output) (stamp_token tf.Output, num_trees tf.Output, num_finalized_trees tf.Output, num_attempted_layers tf.Output, last_layer_nodes_range tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "BoostedTreesGetEnsembleStates",
+		Input: []tf.Input{
+			tree_ensemble_handle,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+}
+
 // Returns the element-wise min of two SparseTensors.
 //
 // Assumes the two SparseTensors have the same shape, i.e., no broadcasting.
@@ -17918,6 +19222,26 @@ func AssignVariableOp(scope *Scope, resource tf.Output, value tf.Output) (o *tf.
 	return scope.AddOperation(opspec)
 }
 
+// Strip leading and trailing whitespaces from the Tensor.
+//
+// Arguments:
+//	input: A string `Tensor` of any shape.
+//
+// Returns A string `Tensor` of the same shape as the input.
+func StringStrip(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "StringStrip",
+		Input: []tf.Input{
+			input,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Returns a tensor of ones with the same shape and type as x.
 //
 // Arguments:
@@ -17969,9 +19293,12 @@ func SparseFillEmptyRowsGrad(scope *Scope, reverse_index_map tf.Output, grad_val
 }
 
 // Computes scaled exponential linear: `scale * alpha * (exp(features) - 1)`
+//
 // if < 0, `scale * features` otherwise.
 //
-// Assumes weights to have zero mean and variance 1.0 / fan_in.
+// To be used together with
+// `initializer = tf.variance_scaling_initializer(factor=1.0, mode='FAN_IN')`.
+// For correct dropout, use `tf.contrib.nn.alpha_dropout`.
 //
 // See [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
 func Selu(scope *Scope, features tf.Output) (activations tf.Output) {
@@ -18655,7 +19982,7 @@ func MatrixTriangularSolveLower(value bool) MatrixTriangularSolveAttr {
 //          adjoint.
 //
 // @compatibility(numpy)
-// Equivalent to np.linalg.triangular_solve
+// Equivalent to scipy.linalg.solve_triangular
 // @end_compatibility
 // If not specified, defaults to false
 func MatrixTriangularSolveAdjoint(value bool) MatrixTriangularSolveAttr {
@@ -19204,119 +20531,25 @@ func RandomUniformInt(scope *Scope, shape tf.Output, minval tf.Output, maxval tf
 	return op.Output(0)
 }
 
-// RandomShuffleAttr is an optional argument to RandomShuffle.
-type RandomShuffleAttr func(optionalAttr)
-
-// RandomShuffleSeed sets the optional seed attribute to value.
-//
-// value: If either `seed` or `seed2` are set to be non-zero, the random number
-// generator is seeded by the given seed.  Otherwise, it is seeded by a
-// random seed.
-// If not specified, defaults to 0
-func RandomShuffleSeed(value int64) RandomShuffleAttr {
-	return func(m optionalAttr) {
-		m["seed"] = value
-	}
-}
-
-// RandomShuffleSeed2 sets the optional seed2 attribute to value.
-//
-// value: A second seed to avoid seed collision.
-// If not specified, defaults to 0
-func RandomShuffleSeed2(value int64) RandomShuffleAttr {
-	return func(m optionalAttr) {
-		m["seed2"] = value
-	}
-}
-
-// Randomly shuffles a tensor along its first dimension.
-//
-//   The tensor is shuffled along dimension 0, such that each `value[j]` is mapped
-//   to one and only one `output[i]`. For example, a mapping that might occur for a
-//   3x2 tensor is:
+// Computes gradients for SparseSegmentSqrtN.
 //
-// ```
-// [[1, 2],       [[5, 6],
-//  [3, 4],  ==>   [1, 2],
-//  [5, 6]]        [3, 4]]
-// ```
+// Returns tensor "output" with same shape as grad, except for dimension 0 whose
+// value is output_dim0.
 //
 // Arguments:
-//	value: The tensor to be shuffled.
-//
-// Returns A tensor of same shape and type as `value`, shuffled along its first
-// dimension.
-func RandomShuffle(scope *Scope, value tf.Output, optional ...RandomShuffleAttr) (output tf.Output) {
+//	grad: gradient propagated to the SparseSegmentSqrtN op.
+//	indices: indices passed to the corresponding SparseSegmentSqrtN op.
+//	segment_ids: segment_ids passed to the corresponding SparseSegmentSqrtN op.
+//	output_dim0: dimension 0 of "data" passed to SparseSegmentSqrtN op.
+func SparseSegmentSqrtNGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "RandomShuffle",
+		Type: "SparseSegmentSqrtNGrad",
 		Input: []tf.Input{
-			value,
+			grad, indices, segment_ids, output_dim0,
 		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// OrderedMapIncompleteSizeAttr is an optional argument to OrderedMapIncompleteSize.
-type OrderedMapIncompleteSizeAttr func(optionalAttr)
-
-// OrderedMapIncompleteSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func OrderedMapIncompleteSizeCapacity(value int64) OrderedMapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// OrderedMapIncompleteSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func OrderedMapIncompleteSizeMemoryLimit(value int64) OrderedMapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// OrderedMapIncompleteSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func OrderedMapIncompleteSizeContainer(value string) OrderedMapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// OrderedMapIncompleteSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func OrderedMapIncompleteSizeSharedName(value string) OrderedMapIncompleteSizeAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op returns the number of incomplete elements in the underlying container.
-func OrderedMapIncompleteSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapIncompleteSizeAttr) (size tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "OrderedMapIncompleteSize",
-
-		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
@@ -19525,9 +20758,9 @@ func DestroyResourceOp(scope *Scope, resource tf.Output, optional ...DestroyReso
 // ```
 //
 // Arguments:
-//	start: First entry in the range.
-//	stop: Last entry in the range.
-//	num: Number of values to generate.
+//	start: 0-D tensor. First entry in the range.
+//	stop: 0-D tensor. Last entry in the range.
+//	num: 0-D tensor. Number of values to generate.
 //
 // Returns 1-D. The generated values.
 func LinSpace(scope *Scope, start tf.Output, stop tf.Output, num tf.Output) (output tf.Output) {
@@ -20357,83 +21590,6 @@ func QuantizedAdd(scope *Scope, x tf.Output, y tf.Output, min_x tf.Output, max_x
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// MfccAttr is an optional argument to Mfcc.
-type MfccAttr func(optionalAttr)
-
-// MfccUpperFrequencyLimit sets the optional upper_frequency_limit attribute to value.
-//
-// value: The highest frequency to use when calculating the
-// ceptstrum.
-// If not specified, defaults to 4000
-func MfccUpperFrequencyLimit(value float32) MfccAttr {
-	return func(m optionalAttr) {
-		m["upper_frequency_limit"] = value
-	}
-}
-
-// MfccLowerFrequencyLimit sets the optional lower_frequency_limit attribute to value.
-//
-// value: The lowest frequency to use when calculating the
-// ceptstrum.
-// If not specified, defaults to 20
-func MfccLowerFrequencyLimit(value float32) MfccAttr {
-	return func(m optionalAttr) {
-		m["lower_frequency_limit"] = value
-	}
-}
-
-// MfccFilterbankChannelCount sets the optional filterbank_channel_count attribute to value.
-//
-// value: Resolution of the Mel bank used internally.
-// If not specified, defaults to 40
-func MfccFilterbankChannelCount(value int64) MfccAttr {
-	return func(m optionalAttr) {
-		m["filterbank_channel_count"] = value
-	}
-}
-
-// MfccDctCoefficientCount sets the optional dct_coefficient_count attribute to value.
-//
-// value: How many output channels to produce per time slice.
-// If not specified, defaults to 13
-func MfccDctCoefficientCount(value int64) MfccAttr {
-	return func(m optionalAttr) {
-		m["dct_coefficient_count"] = value
-	}
-}
-
-// Transforms a spectrogram into a form that's useful for speech recognition.
-//
-// Mel Frequency Cepstral Coefficients are a way of representing audio data that's
-// been effective as an input feature for machine learning. They are created by
-// taking the spectrum of a spectrogram (a 'cepstrum'), and discarding some of the
-// higher frequencies that are less significant to the human ear. They have a long
-// history in the speech recognition world, and https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
-// is a good resource to learn more.
-//
-// Arguments:
-//	spectrogram: Typically produced by the Spectrogram op, with magnitude_squared
-// set to true.
-//	sample_rate: How many samples per second the source audio used.
-func Mfcc(scope *Scope, spectrogram tf.Output, sample_rate tf.Output, optional ...MfccAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Mfcc",
-		Input: []tf.Input{
-			spectrogram, sample_rate,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Given a quantized tensor described by (input, input_min, input_max), outputs a
 //
 // range that covers the actual values present in that tensor.  This op is
@@ -20785,7 +21941,38 @@ func LookupTableInsertV2(scope *Scope, table_handle tf.Output, keys tf.Output, v
 	return scope.AddOperation(opspec)
 }
 
-// Returns element-wise smallest integer in not less than x.
+// Creates a dataset that batches and pads `batch_size` elements from the input.
+//
+// Arguments:
+//
+//	batch_size: A scalar representing the number of elements to accumulate in a
+// batch.
+//	padded_shapes: A list of int64 tensors representing the desired padded shapes
+// of the corresponding output components. These shapes may be partially
+// specified, using `-1` to indicate that a particular dimension should be
+// padded to the maximum size of all batch elements.
+//	padding_values: A list of scalars containing the padding value to use for
+// each of the outputs.
+//	drop_remainder: A scalar representing whether the last batch should be dropped in case its size
+// is smaller than desired.
+//
+func PaddedBatchDatasetV2(scope *Scope, input_dataset tf.Output, batch_size tf.Output, padded_shapes []tf.Output, padding_values []tf.Output, drop_remainder tf.Output, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "PaddedBatchDatasetV2",
+		Input: []tf.Input{
+			input_dataset, batch_size, tf.OutputList(padded_shapes), tf.OutputList(padding_values), drop_remainder,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns element-wise smallest integer not less than x.
 func Ceil(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
@@ -22114,7 +23301,7 @@ func TensorListSetItem(scope *Scope, input_handle tf.Output, index tf.Output, it
 
 // Computes the matrix exponential of one or more square matrices:
 //
-// exp(A) = \sum_{n=0}^\infty A^n/n!
+// \\(exp(A) = \sum_{n=0}^\infty A^n/n!\\)
 //
 // The exponential is computed using a combination of the scaling and squaring
 // method and the Pade approximation. Details can be founds in:
@@ -22494,6 +23681,28 @@ func MatrixSolve(scope *Scope, matrix tf.Output, rhs tf.Output, optional ...Matr
 	return op.Output(0)
 }
 
+// Returns a serialized GraphDef representing `input_dataset`.
+//
+// Returns a graph representation for `input_dataset`.
+//
+// Arguments:
+//	input_dataset: A variant tensor representing the dataset to return the graph representation for.
+//
+// Returns The graph representation of the dataset (as serialized GraphDef).
+func DatasetToGraph(scope *Scope, input_dataset tf.Output) (graph tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "DatasetToGraph",
+		Input: []tf.Input{
+			input_dataset,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // SvdAttr is an optional argument to Svd.
 type SvdAttr func(optionalAttr)
 
@@ -23490,6 +24699,145 @@ func ReaderReadUpToV2(scope *Scope, reader_handle tf.Output, queue_handle tf.Out
 	return op.Output(0), op.Output(1)
 }
 
+// BatchAttr is an optional argument to Batch.
+type BatchAttr func(optionalAttr)
+
+// BatchMaxEnqueuedBatches sets the optional max_enqueued_batches attribute to value.
+// If not specified, defaults to 10
+func BatchMaxEnqueuedBatches(value int64) BatchAttr {
+	return func(m optionalAttr) {
+		m["max_enqueued_batches"] = value
+	}
+}
+
+// BatchAllowedBatchSizes sets the optional allowed_batch_sizes attribute to value.
+// If not specified, defaults to <>
+func BatchAllowedBatchSizes(value []int64) BatchAttr {
+	return func(m optionalAttr) {
+		m["allowed_batch_sizes"] = value
+	}
+}
+
+// BatchContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func BatchContainer(value string) BatchAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// BatchSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func BatchSharedName(value string) BatchAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// BatchBatchingQueue sets the optional batching_queue attribute to value.
+// If not specified, defaults to ""
+func BatchBatchingQueue(value string) BatchAttr {
+	return func(m optionalAttr) {
+		m["batching_queue"] = value
+	}
+}
+
+// Batches all input tensors nondeterministically.
+//
+// When many instances of this Op are being run concurrently with the same
+// container/shared_name in the same device, some will output zero-shaped Tensors
+// and others will output Tensors of size up to max_batch_size.
+//
+// All Tensors in in_tensors are batched together (so, for example, labels and
+// features should be batched with a single instance of this operation.
+//
+// Each invocation of batch emits an `id` scalar which will be used to identify
+// this particular invocation when doing unbatch or its gradient.
+//
+// Each op which emits a non-empty batch will also emit a non-empty batch_index
+// Tensor, which, is a [K, 3] matrix where each row contains the invocation's id,
+// start, and length of elements of each set of Tensors present in batched_tensors.
+//
+// Batched tensors are concatenated along the first dimension, and all tensors in
+// in_tensors must have the first dimension of the same size.
+//
+// in_tensors: The tensors to be batched.
+// num_batch_threads: Number of scheduling threads for processing batches of work.
+//  Determines the number of batches processed in parallel.
+// max_batch_size: Batch sizes will never be bigger than this.
+// batch_timeout_micros: Maximum number of microseconds to wait before outputting
+//  an incomplete batch.
+// allowed_batch_sizes: Optional list of allowed batch sizes. If left empty, does
+//  nothing. Otherwise, supplies a list of batch sizes, causing the op to pad
+//  batches up to one of those sizes. The entries must increase monotonically, and
+//  the final entry must equal max_batch_size.
+// grad_timeout_micros: The timeout to use for the gradient. See Unbatch.
+// batched_tensors: Either empty tensors or a batch of concatenated Tensors.
+// batch_index: If out_tensors is non-empty, has information to invert it.
+// container: Controls the scope of sharing of this batch.
+// id: always contains a scalar with a unique ID for this invocation of Batch.
+// shared_name: Concurrently running instances of batch in the same device with the
+//  same container and shared_name will batch their elements together. If left
+//  empty, the op name will be used as the shared name.
+// T: the types of tensors to be batched.
+func Batch(scope *Scope, in_tensors []tf.Output, num_batch_threads int64, max_batch_size int64, batch_timeout_micros int64, grad_timeout_micros int64, optional ...BatchAttr) (batched_tensors []tf.Output, batch_index tf.Output, id tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"num_batch_threads": num_batch_threads, "max_batch_size": max_batch_size, "batch_timeout_micros": batch_timeout_micros, "grad_timeout_micros": grad_timeout_micros}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Batch",
+		Input: []tf.Input{
+			tf.OutputList(in_tensors),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if batched_tensors, idx, err = makeOutputList(op, idx, "batched_tensors"); err != nil {
+		scope.UpdateErr("Batch", err)
+		return
+	}
+	batch_index = op.Output(idx)
+	id = op.Output(idx)
+	return batched_tensors, batch_index, id
+}
+
+// Adjust the hue of one or more images.
+//
+// `images` is a tensor of at least 3 dimensions.  The last dimension is
+// interpretted as channels, and must be three.
+//
+// The input image is considered in the RGB colorspace. Conceptually, the RGB
+// colors are first mapped into HSV. A delta is then applied all the hue values,
+// and then remapped back to RGB colorspace.
+//
+// Arguments:
+//	images: Images to adjust.  At least 3-D.
+//	delta: A float delta to add to the hue.
+//
+// Returns The hue-adjusted image or images.
+func AdjustHue(scope *Scope, images tf.Output, delta tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "AdjustHue",
+		Input: []tf.Input{
+			images, delta,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // ResourceApplyAdamAttr is an optional argument to ResourceApplyAdam.
 type ResourceApplyAdamAttr func(optionalAttr)
 
@@ -23517,10 +24865,10 @@ func ResourceApplyAdamUseNesterov(value bool) ResourceApplyAdamAttr {
 
 // Update '*var' according to the Adam algorithm.
 //
-// lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
-// m_t <- beta1 * m_{t-1} + (1 - beta1) * g_t
-// v_t <- beta2 * v_{t-1} + (1 - beta2) * g_t * g_t
-// variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
+// $$lr_t := \text{learning_rate} * \sqrt{(1 - beta_2^t) / (1 - beta_1^t)}$$
+// $$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$
+// $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$
+// $$variable := variable - lr_t * m_t / (\sqrt{v_t} + \epsilon)$$
 //
 // Arguments:
 //	var_: Should be from a Variable().
@@ -23914,71 +25262,6 @@ func DecodeGif(scope *Scope, contents tf.Output) (image tf.Output) {
 	return op.Output(0)
 }
 
-// Computes the gradient of the sigmoid of `x` wrt its input.
-//
-// Specifically, `grad = dy * y * (1 - y)`, where `y = sigmoid(x)`, and
-// `dy` is the corresponding input gradient.
-func SigmoidGrad(scope *Scope, y tf.Output, dy tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SigmoidGrad",
-		Input: []tf.Input{
-			y, dy,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Convert one or more images from HSV to RGB.
-//
-// Outputs a tensor of the same shape as the `images` tensor, containing the RGB
-// value of the pixels. The output is only well defined if the value in `images`
-// are in `[0,1]`.
-//
-// See `rgb_to_hsv` for a description of the HSV encoding.
-//
-// Arguments:
-//	images: 1-D or higher rank. HSV data to convert. Last dimension must be size 3.
-//
-// Returns `images` converted to RGB.
-func HSVToRGB(scope *Scope, images tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "HSVToRGB",
-		Input: []tf.Input{
-			images,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Retrieves the tree ensemble resource stamp token, number of trees and growing statistics.
-//
-// Arguments:
-//	tree_ensemble_handle: Handle to the tree ensemble.
-//
-// Returns Stamp token of the tree ensemble resource.The number of trees in the tree ensemble resource.The number of trees that were finished successfully.The number of layers we attempted to build (but not necessarily succeeded).Rank size 2 tensor that contains start and end ids of the nodes in the latest
-// layer.
-func BoostedTreesGetEnsembleStates(scope *Scope, tree_ensemble_handle tf.Output) (stamp_token tf.Output, num_trees tf.Output, num_finalized_trees tf.Output, num_attempted_layers tf.Output, last_layer_nodes_range tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "BoostedTreesGetEnsembleStates",
-		Input: []tf.Input{
-			tree_ensemble_handle,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
-}
-
 // Gets the next output from the given iterator.
 //
 // This operation is a synchronous version IteratorGetNext. It should only be used
@@ -24558,10 +25841,124 @@ func NonMaxSuppressionV2(scope *Scope, boxes tf.Output, scores tf.Output, max_ou
 	return op.Output(0)
 }
 
+// Greedily selects a subset of bounding boxes in descending order of score,
+//
+// pruning away boxes that have high intersection-over-union (IOU) overlap
+// with previously selected boxes.  Bounding boxes with score less than
+// `score_threshold` are removed.  Bounding boxes are supplied as
+// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
+// diagonal pair of box corners and the coordinates can be provided as normalized
+// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
+// is agnostic to where the origin is in the coordinate system and more
+// generally is invariant to orthogonal transformations and translations
+// of the coordinate system; thus translating or reflections of the coordinate
+// system result in the same boxes being selected by the algorithm.
+// The output of this operation is a set of integers indexing into the input
+// collection of bounding boxes representing the selected boxes.  The bounding
+// box coordinates corresponding to the selected indices can then be obtained
+// using the `tf.gather operation`.  For example:
+//   selected_indices = tf.image.non_max_suppression_v2(
+//       boxes, scores, max_output_size, iou_threshold, score_threshold)
+//   selected_boxes = tf.gather(boxes, selected_indices)
+//
+// Arguments:
+//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
+//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
+// score corresponding to each box (each row of boxes).
+//	max_output_size: A scalar integer tensor representing the maximum number of
+// boxes to be selected by non max suppression.
+//	iou_threshold: A 0-D float tensor representing the threshold for deciding whether
+// boxes overlap too much with respect to IOU.
+//	score_threshold: A 0-D float tensor representing the threshold for deciding when to remove
+// boxes based on score.
+//
+// Returns A 1-D integer tensor of shape `[M]` representing the selected
+// indices from the boxes tensor, where `M <= max_output_size`.
+func NonMaxSuppressionV3(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, iou_threshold tf.Output, score_threshold tf.Output) (selected_indices tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "NonMaxSuppressionV3",
+		Input: []tf.Input{
+			boxes, scores, max_output_size, iou_threshold, score_threshold,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// NonMaxSuppressionV4Attr is an optional argument to NonMaxSuppressionV4.
+type NonMaxSuppressionV4Attr func(optionalAttr)
+
+// NonMaxSuppressionV4PadToMaxOutputSize sets the optional pad_to_max_output_size attribute to value.
+//
+// value: If true, the output `selected_indices` is padded to be of length
+// `max_output_size`. Defaults to false.
+// If not specified, defaults to false
+func NonMaxSuppressionV4PadToMaxOutputSize(value bool) NonMaxSuppressionV4Attr {
+	return func(m optionalAttr) {
+		m["pad_to_max_output_size"] = value
+	}
+}
+
+// Greedily selects a subset of bounding boxes in descending order of score,
+//
+// pruning away boxes that have high intersection-over-union (IOU) overlap
+// with previously selected boxes.  Bounding boxes with score less than
+// `score_threshold` are removed.  Bounding boxes are supplied as
+// [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any
+// diagonal pair of box corners and the coordinates can be provided as normalized
+// (i.e., lying in the interval [0, 1]) or absolute.  Note that this algorithm
+// is agnostic to where the origin is in the coordinate system and more
+// generally is invariant to orthogonal transformations and translations
+// of the coordinate system; thus translating or reflections of the coordinate
+// system result in the same boxes being selected by the algorithm.
+// The output of this operation is a set of integers indexing into the input
+// collection of bounding boxes representing the selected boxes.  The bounding
+// box coordinates corresponding to the selected indices can then be obtained
+// using the `tf.gather operation`.  For example:
+//   selected_indices = tf.image.non_max_suppression_v2(
+//       boxes, scores, max_output_size, iou_threshold, score_threshold)
+//   selected_boxes = tf.gather(boxes, selected_indices)
+//
+// Arguments:
+//	boxes: A 2-D float tensor of shape `[num_boxes, 4]`.
+//	scores: A 1-D float tensor of shape `[num_boxes]` representing a single
+// score corresponding to each box (each row of boxes).
+//	max_output_size: A scalar integer tensor representing the maximum number of
+// boxes to be selected by non max suppression.
+//	iou_threshold: A 0-D float tensor representing the threshold for deciding whether
+// boxes overlap too much with respect to IOU.
+//	score_threshold: A 0-D float tensor representing the threshold for deciding when to remove
+// boxes based on score.
+//
+// Returns A 1-D integer tensor of shape `[M]` representing the selected
+// indices from the boxes tensor, where `M <= max_output_size`.A 0-D integer tensor representing the number of valid elements in
+// `selected_indices`, with the valid elements appearing first.
+func NonMaxSuppressionV4(scope *Scope, boxes tf.Output, scores tf.Output, max_output_size tf.Output, iou_threshold tf.Output, score_threshold tf.Output, optional ...NonMaxSuppressionV4Attr) (selected_indices tf.Output, valid_outputs tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "NonMaxSuppressionV4",
+		Input: []tf.Input{
+			boxes, scores, max_output_size, iou_threshold, score_threshold,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
 // Computes the matrix logarithm of one or more square matrices:
 //
 //
-// log(exp(A)) = A
+// \\(log(exp(A)) = A\\)
 //
 // This op is only defined for complex matrices. If A is positive-definite and
 // real, then casting to a complex matrix, taking the logarithm and casting back
@@ -24598,6 +25995,31 @@ func MatrixLogarithm(scope *Scope, input tf.Output) (output tf.Output) {
 	return op.Output(0)
 }
 
+//   This op is used as a placeholder in If branch functions. It doesn't provide a
+//   valid output when run, so must either be removed (e.g. replaced with a
+//   function input) or guaranteed not to be used (e.g. if mirroring an
+//   intermediate output needed for the gradient computation of the other branch).
+//
+// Arguments:
+//	dtype: The type of the output.
+//	shape:     The purported shape of the output. This is only used for shape inference;
+//     the output will not necessarily have this shape. Can be a partial shape.
+//
+// Returns     \"Fake\" output value. This should not be consumed by another op.
+func FakeParam(scope *Scope, dtype tf.DataType, shape tf.Shape) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtype": dtype, "shape": shape}
+	opspec := tf.OpSpec{
+		Type: "FakeParam",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // EncodeProtoAttr is an optional argument to EncodeProto.
 type EncodeProtoAttr func(optionalAttr)
 
@@ -24735,131 +26157,6 @@ func TensorArrayGradV3(scope *Scope, handle tf.Output, flow_in tf.Output, source
 	return op.Output(0), op.Output(1)
 }
 
-// DecodeProtoV2Attr is an optional argument to DecodeProtoV2.
-type DecodeProtoV2Attr func(optionalAttr)
-
-// DecodeProtoV2DescriptorSource sets the optional descriptor_source attribute to value.
-//
-// value: Either the special value `local://` or a path to a file containing
-// a serialized `FileDescriptorSet`.
-// If not specified, defaults to "local://"
-func DecodeProtoV2DescriptorSource(value string) DecodeProtoV2Attr {
-	return func(m optionalAttr) {
-		m["descriptor_source"] = value	}
-}
-
-// DecodeProtoV2MessageFormat sets the optional message_format attribute to value.
-//
-// value: Either `binary` or `text`.
-// If not specified, defaults to "binary"
-func DecodeProtoV2MessageFormat(value string) DecodeProtoV2Attr {
-	return func(m optionalAttr) {
-		m["message_format"] = value
-	}
-}
-
-// DecodeProtoV2Sanitize sets the optional sanitize attribute to value.
-//
-// value: Whether to sanitize the result or not.
-// If not specified, defaults to false
-func DecodeProtoV2Sanitize(value bool) DecodeProtoV2Attr {
-	return func(m optionalAttr) {
-		m["sanitize"] = value
-	}
-}
-
-// The op extracts fields from a serialized protocol buffers message into tensors.
-//
-// The `decode_proto` op extracts fields from a serialized protocol buffers
-// message into tensors.  The fields in `field_names` are decoded and converted
-// to the corresponding `output_types` if possible.
-//
-// A `message_type` name must be provided to give context for the field
-// names. The actual message descriptor can be looked up either in the
-// linked-in descriptor pool or a filename provided by the caller using
-// the `descriptor_source` attribute.
-//
-// Each output tensor is a dense tensor. This means that it is padded to
-// hold the largest number of repeated elements seen in the input
-// minibatch. (The shape is also padded by one to prevent zero-sized
-// dimensions). The actual repeat counts for each example in the
-// minibatch can be found in the `sizes` output. In many cases the output
-// of `decode_proto` is fed immediately into tf.squeeze if missing values
-// are not a concern. When using tf.squeeze, always pass the squeeze
-// dimension explicitly to avoid surprises.
-//
-// For the most part, the mapping between Proto field types and
-// TensorFlow dtypes is straightforward. However, there are a few
-// special cases:
-//
-// - A proto field that contains a submessage or group can only be converted
-// to `DT_STRING` (the serialized submessage). This is to reduce the
-// complexity of the API. The resulting string can be used as input
-// to another instance of the decode_proto op.
-//
-// - TensorFlow lacks support for unsigned integers. The ops represent uint64
-// types as a `DT_INT64` with the same twos-complement bit pattern
-// (the obvious way). Unsigned int32 values can be represented exactly by
-// specifying type `DT_INT64`, or using twos-complement if the caller
-// specifies `DT_INT32` in the `output_types` attribute.
-//
-// The `descriptor_source` attribute selects a source of protocol
-// descriptors to consult when looking up `message_type`. This may be a
-// filename containing a serialized `FileDescriptorSet` message,
-// or the special value `local://`, in which case only descriptors linked
-// into the code will be searched; the filename can be on any filesystem
-// accessible to TensorFlow.
-//
-// You can build a `descriptor_source` file using the `--descriptor_set_out`
-// and `--include_imports` options to the protocol compiler `protoc`.
-//
-// The `local://` database only covers descriptors linked into the
-// code via C++ libraries, not Python imports. You can link in a proto descriptor
-// by creating a cc_library target with alwayslink=1.
-//
-// Both binary and text proto serializations are supported, and can be
-// chosen using the `format` attribute.
-//
-// Arguments:
-//	bytes: Tensor of serialized protos with shape `batch_shape`.
-//	message_type: Name of the proto message type to decode.
-//	field_names: List of strings containing proto field names.
-//	output_types: List of TF types to use for the respective field in field_names.
-//
-// Returns Tensor of int32 with shape `[batch_shape, len(field_names)]`.
-// Each entry is the number of values found for the corresponding field.
-// Optional fields may have 0 or 1 values.List of tensors containing values for the corresponding field.
-// `values[i]` has datatype `output_types[i]`
-// and shape `[batch_shape, max(sizes[...,i])]`.
-func DecodeProtoV2(scope *Scope, bytes tf.Output, message_type string, field_names []string, output_types []tf.DataType, optional ...DecodeProtoV2Attr) (sizes tf.Output, values []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"message_type": message_type, "field_names": field_names, "output_types": output_types}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "DecodeProtoV2",
-		Input: []tf.Input{
-			bytes,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	sizes = op.Output(idx)
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("DecodeProtoV2", err)
-		return
-	}
-	return sizes, values
-}
-
 // Creates a dataset that splits a SparseTensor into elements row-wise.
 func SparseTensorSliceDataset(scope *Scope, indices tf.Output, values tf.Output, dense_shape tf.Output) (handle tf.Output) {
 	if scope.Err() != nil {
@@ -24938,6 +26235,23 @@ func ReaderResetV2(scope *Scope, reader_handle tf.Output) (o *tf.Operation) {
 	return scope.AddOperation(opspec)
 }
 
+// A dataset that splits the elements of its input into multiple elements.
+func UnbatchDataset(scope *Scope, input_dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "UnbatchDataset",
+		Input: []tf.Input{
+			input_dataset,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // RpcAttr is an optional argument to Rpc.
 type RpcAttr func(optionalAttr)
 
@@ -25190,6 +26504,36 @@ func ConcatenateDataset(scope *Scope, input_dataset tf.Output, another_dataset t
 	return op.Output(0)
 }
 
+// Debugging/model interpretability outputs for each example.
+//
+// It traverses all the trees and computes debug metrics for individual examples,
+// such as getting split feature ids and logits after each split along the decision
+// path used to compute directional feature contributions.
+//
+// Arguments:
+//
+//	bucketized_features: A list of rank 1 Tensors containing bucket id for each
+// feature.
+//	logits_dimension: scalar, dimension of the logits, to be used for constructing the protos in
+// examples_debug_outputs_serialized.
+//
+// Returns Output rank 1 Tensor containing a proto serialized as a string for each example.
+func BoostedTreesExampleDebugOutputs(scope *Scope, tree_ensemble_handle tf.Output, bucketized_features []tf.Output, logits_dimension int64) (examples_debug_outputs_serialized tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"logits_dimension": logits_dimension}
+	opspec := tf.OpSpec{
+		Type: "BoostedTreesExampleDebugOutputs",
+		Input: []tf.Input{
+			tree_ensemble_handle, tf.OutputList(bucketized_features),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Adds a value to the current value of a variable.
 //
 // Any ReadVariableOp with a control dependency on this op is guaranteed to
@@ -25778,81 +27122,6 @@ func CacheDataset(scope *Scope, input_dataset tf.Output, filename tf.Output, out
 	return op.Output(0)
 }
 
-// Computes the sum along sparse segments of a tensor.
-//
-// Like `SparseSegmentSum`, but allows missing ids in `segment_ids`. If an id is
-// misisng, the `output` tensor at that position will be zeroed.
-//
-// Read @{$math_ops#Segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// For example:
-//
-// ```python
-// c = tf.constant([[1,2,3,4], [-1,-2,-3,-4], [5,6,7,8]])
-//
-// tf.sparse_segment_sum_with_num_segments(
-//     c, tf.constant([0, 1]), tf.constant([0, 0]), num_segments=3)
-// # => [[0 0 0 0]
-// #     [0 0 0 0]
-// #     [0 0 0 0]]
-//
-// tf.sparse_segment_sum_with_num_segments(c,
-//                                         tf.constant([0, 1]),
-//                                         tf.constant([0, 2],
-//                                         num_segments=4))
-// # => [[ 1  2  3  4]
-// #     [ 0  0  0  0]
-// #     [-1 -2 -3 -4]
-// #     [ 0  0  0  0]]
-// ```
-//
-// Arguments:
-//
-//	indices: A 1-D tensor. Has same rank as `segment_ids`.
-//	segment_ids: A 1-D tensor. Values should be sorted and can be repeated.
-//	num_segments: Should equal the number of distinct segment IDs.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `num_segments`.
-func SparseSegmentSumWithNumSegments(scope *Scope, data tf.Output, indices tf.Output, segment_ids tf.Output, num_segments tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseSegmentSumWithNumSegments",
-		Input: []tf.Input{
-			data, indices, segment_ids, num_segments,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Creates a dataset that executes a SQL query and emits rows of the result set.
-//
-// Arguments:
-//	driver_name: The database type. Currently, the only supported type is 'sqlite'.
-//	data_source_name: A connection string to connect to the database.
-//	query: A SQL query to execute.
-//
-//
-func SqlDataset(scope *Scope, driver_name tf.Output, data_source_name tf.Output, query tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "SqlDataset",
-		Input: []tf.Input{
-			driver_name, data_source_name, query,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Creates a dataset that emits the records from one or more binary files.
 //
 // Arguments:
@@ -25940,6 +27209,26 @@ func TFRecordDataset(scope *Scope, filenames tf.Output, compression_type tf.Outp
 	return op.Output(0)
 }
 
+// A container for an iterator resource.
+//
+// Returns A handle to the iterator that can be passed to a "MakeIterator" or
+// "IteratorGetNext" op. In contrast to Iterator, AnonymousIterator prevents
+// resource sharing by name, and does not keep a reference to the resource
+// container.
+func AnonymousIterator(scope *Scope, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "AnonymousIterator",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // BatchToSpace for 4-D tensors of type T.
 //
 // This is a legacy version of the more general BatchToSpaceND.
@@ -26124,7 +27413,7 @@ func AdjustContrastv2(scope *Scope, images tf.Output, contrast_factor tf.Output)
 	return op.Output(0)
 }
 
-// Gets the next output from the given iterator.
+// Gets the next output from the given iterator .
 func IteratorGetNext(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
 	if scope.Err() != nil {
 		return
@@ -26443,6 +27732,330 @@ func Cross(scope *Scope, a tf.Output, b tf.Output) (product tf.Output) {
 	return op.Output(0)
 }
 
+// Writes the given dataset to the given file using the TFRecord format.
+//
+// Arguments:
+//	input_dataset: A variant tensor representing the dataset to write.
+//	filename: A scalar string tensor representing the filename to use.
+//	compression_type: A scalar string tensor containing either (i) the empty string (no
+// compression), (ii) "ZLIB", or (iii) "GZIP".
+//
+// Returns the created operation.
+func DatasetToTFRecord(scope *Scope, input_dataset tf.Output, filename tf.Output, compression_type tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "DatasetToTFRecord",
+		Input: []tf.Input{
+			input_dataset, filename, compression_type,
+		},
+	}
+	return scope.AddOperation(opspec)
+}
+
+// AvgPool3DAttr is an optional argument to AvgPool3D.
+type AvgPool3DAttr func(optionalAttr)
+
+// AvgPool3DDataFormat sets the optional data_format attribute to value.
+//
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func AvgPool3DDataFormat(value string) AvgPool3DAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Performs 3D average pooling on the input.
+//
+// Arguments:
+//	input: Shape `[batch, depth, rows, cols, channels]` tensor to pool over.
+//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
+// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
+//
+// Returns The average pooled output tensor.
+func AvgPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "AvgPool3D",
+		Input: []tf.Input{
+			input,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// A placeholder for input pipeline graph optimizations.
+//
+// A placeholder for input pipeline graph optimizations.
+//
+// Arguments:
+//	input_dataset: A variant tensor representing the input dataset.
+func SinkDataset(scope *Scope, input_dataset tf.Output) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SinkDataset",
+		Input: []tf.Input{
+			input_dataset,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Constructs an Optional variant from a tuple of tensors.
+func OptionalFromValue(scope *Scope, components []tf.Output) (optional tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "OptionalFromValue",
+		Input: []tf.Input{
+			tf.OutputList(components),
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// DecodeProtoV2Attr is an optional argument to DecodeProtoV2.
+type DecodeProtoV2Attr func(optionalAttr)
+
+// DecodeProtoV2DescriptorSource sets the optional descriptor_source attribute to value.
+//
+// value: Either the special value `local://` or a path to a file containing
+// a serialized `FileDescriptorSet`.
+// If not specified, defaults to "local://"
+func DecodeProtoV2DescriptorSource(value string) DecodeProtoV2Attr {
+	return func(m optionalAttr) {
+		m["descriptor_source"] = value
+	}
+}
+
+// DecodeProtoV2MessageFormat sets the optional message_format attribute to value.
+//
+// value: Either `binary` or `text`.
+// If not specified, defaults to "binary"
+func DecodeProtoV2MessageFormat(value string) DecodeProtoV2Attr {
+	return func(m optionalAttr) {
+		m["message_format"] = value
+	}
+}
+
+// DecodeProtoV2Sanitize sets the optional sanitize attribute to value.
+//
+// value: Whether to sanitize the result or not.
+// If not specified, defaults to false
+func DecodeProtoV2Sanitize(value bool) DecodeProtoV2Attr {
+	return func(m optionalAttr) {
+		m["sanitize"] = value
+	}
+}
+
+// The op extracts fields from a serialized protocol buffers message into tensors.
+//
+// The `decode_proto` op extracts fields from a serialized protocol buffers
+// message into tensors.  The fields in `field_names` are decoded and converted
+// to the corresponding `output_types` if possible.
+//
+// A `message_type` name must be provided to give context for the field
+// names. The actual message descriptor can be looked up either in the
+// linked-in descriptor pool or a filename provided by the caller using
+// the `descriptor_source` attribute.
+//
+// Each output tensor is a dense tensor. This means that it is padded to
+// hold the largest number of repeated elements seen in the input
+// minibatch. (The shape is also padded by one to prevent zero-sized
+// dimensions). The actual repeat counts for each example in the
+// minibatch can be found in the `sizes` output. In many cases the output
+// of `decode_proto` is fed immediately into tf.squeeze if missing values
+// are not a concern. When using tf.squeeze, always pass the squeeze
+// dimension explicitly to avoid surprises.
+//
+// For the most part, the mapping between Proto field types and
+// TensorFlow dtypes is straightforward. However, there are a few
+// special cases:
+//
+// - A proto field that contains a submessage or group can only be converted
+// to `DT_STRING` (the serialized submessage). This is to reduce the
+// complexity of the API. The resulting string can be used as input
+// to another instance of the decode_proto op.
+//
+// - TensorFlow lacks support for unsigned integers. The ops represent uint64
+// types as a `DT_INT64` with the same twos-complement bit pattern
+// (the obvious way). Unsigned int32 values can be represented exactly by
+// specifying type `DT_INT64`, or using twos-complement if the caller
+// specifies `DT_INT32` in the `output_types` attribute.
+//
+// The `descriptor_source` attribute selects a source of protocol
+// descriptors to consult when looking up `message_type`. This may be a
+// filename containing a serialized `FileDescriptorSet` message,
+// or the special value `local://`, in which case only descriptors linked
+// into the code will be searched; the filename can be on any filesystem
+// accessible to TensorFlow.
+//
+// You can build a `descriptor_source` file using the `--descriptor_set_out`
+// and `--include_imports` options to the protocol compiler `protoc`.
+//
+// The `local://` database only covers descriptors linked into the
+// code via C++ libraries, not Python imports. You can link in a proto descriptor
+// by creating a cc_library target with alwayslink=1.
+//
+// Both binary and text proto serializations are supported, and can be
+// chosen using the `format` attribute.
+//
+// Arguments:
+//	bytes: Tensor of serialized protos with shape `batch_shape`.
+//	message_type: Name of the proto message type to decode.
+//	field_names: List of strings containing proto field names.
+//	output_types: List of TF types to use for the respective field in field_names.
+//
+// Returns Tensor of int32 with shape `[batch_shape, len(field_names)]`.
+// Each entry is the number of values found for the corresponding field.
+// Optional fields may have 0 or 1 values.List of tensors containing values for the corresponding field.
+// `values[i]` has datatype `output_types[i]`
+// and shape `[batch_shape, max(sizes[...,i])]`.
+func DecodeProtoV2(scope *Scope, bytes tf.Output, message_type string, field_names []string, output_types []tf.DataType, optional ...DecodeProtoV2Attr) (sizes tf.Output, values []tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"message_type": message_type, "field_names": field_names, "output_types": output_types}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "DecodeProtoV2",
+		Input: []tf.Input{
+			bytes,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	sizes = op.Output(idx)
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("DecodeProtoV2", err)
+		return
+	}
+	return sizes, values
+}
+
+// Creates an Optional variant with no value.
+func OptionalNone(scope *Scope) (optional tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "OptionalNone",
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns true if and only if the given Optional variant has a value.
+func OptionalHasValue(scope *Scope, optional tf.Output) (has_value tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "OptionalHasValue",
+		Input: []tf.Input{
+			optional,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Creates a dataset that executes a SQL query and emits rows of the result set.
+//
+// Arguments:
+//	driver_name: The database type. Currently, the only supported type is 'sqlite'.
+//	data_source_name: A connection string to connect to the database.
+//	query: A SQL query to execute.
+//
+//
+func SqlDataset(scope *Scope, driver_name tf.Output, data_source_name tf.Output, query tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "SqlDataset",
+		Input: []tf.Input{
+			driver_name, data_source_name, query,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Returns the value stored in an Optional variant or raises an error if none exists.
+func OptionalGetValue(scope *Scope, optional tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (components []tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "OptionalGetValue",
+		Input: []tf.Input{
+			optional,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if components, idx, err = makeOutputList(op, idx, "components"); err != nil {
+		scope.UpdateErr("OptionalGetValue", err)
+		return
+	}
+	return components
+}
+
+// Gets the next output from the given iterator as an Optional variant.
+func IteratorGetNextAsOptional(scope *Scope, iterator tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (optional tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "IteratorGetNextAsOptional",
+		Input: []tf.Input{
+			iterator,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Performs a padding as a preprocess during a convolution.
 //
 // Similar to FusedResizeAndPadConv2d, this op allows for an optimized
@@ -26998,6 +28611,26 @@ func QueueEnqueueV2(scope *Scope, handle tf.Output, components []tf.Output, opti
 	return scope.AddOperation(opspec)
 }
 
+// Computes the Bessel i0e function of `x` element-wise.
+//
+// Exponentially scaled modified Bessel function of order 0 defined as
+// `bessel_i0e(x) = exp(-abs(x)) bessel_i0(x)`.
+//
+// This function is faster and numerically stabler than `bessel_i0(x)`.
+func BesselI0e(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "BesselI0e",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // QueueDequeueManyV2Attr is an optional argument to QueueDequeueManyV2.
 type QueueDequeueManyV2Attr func(optionalAttr)
 
@@ -27108,6 +28741,29 @@ func EncodeBase64(scope *Scope, input tf.Output, optional ...EncodeBase64Attr) (
 	return op.Output(0)
 }
 
+// A dataset that creates window datasets from the input dataset.
+//
+// Arguments:
+//
+//	window_size: A scalar representing the number of elements to accumulate in a window.
+//
+//
+func WindowDataset(scope *Scope, input_dataset tf.Output, window_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "WindowDataset",
+		Input: []tf.Input{
+			input_dataset, window_size,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Deprecated. Use TensorArrayCloseV3
 //
 // DEPRECATED at GraphDef version 26: Use TensorArrayCloseV3
@@ -27480,30 +29136,30 @@ func CudnnRNNIsTraining(value bool) CudnnRNNAttr {
 //
 // rnn_mode: Indicates the type of the RNN model.
 // input_mode: Indicate whether there is a linear projection between the input and
-//   The actual computation before the first layer. 'skip_input' is only allowed
+//   the actual computation before the first layer. 'skip_input' is only allowed
 //   when input_size == num_units; 'auto_select' implies 'skip_input' when
 //   input_size == num_units; otherwise, it implies 'linear_input'.
-// direction: Indicates whether a bidirectional model will be used.
-//   dir = (direction == bidirectional) ? 2 : 1
-// dropout: dropout probability. When set to 0., dropout is disabled.
-// seed: the 1st part of a seed to initialize dropout.
-// seed2: the 2nd part of a seed to initialize dropout.
-// input: a 3-D tensor with the shape of [seq_length, batch_size, input_size].
-// input_h: a 3-D tensor with the shape of [num_layer * dir, batch_size,
+// direction: Indicates whether a bidirectional model will be used. Should be
+//   "unidirectional" or "bidirectional".
+// dropout: Dropout probability. When set to 0., dropout is disabled.
+// seed: The 1st part of a seed to initialize dropout.
+// seed2: The 2nd part of a seed to initialize dropout.
+// input: A 3-D tensor with the shape of [seq_length, batch_size, input_size].
+// input_h: A 3-D tensor with the shape of [num_layer * dir, batch_size,
 //     num_units].
 // input_c: For LSTM, a 3-D tensor with the shape of
 //     [num_layer * dir, batch, num_units]. For other models, it is ignored.
-// params: a 1-D tensor that contains the weights and biases in an opaque layout.
+// params: A 1-D tensor that contains the weights and biases in an opaque layout.
 //     The size must be created through CudnnRNNParamsSize, and initialized
 //     separately. Note that they might not be compatible across different
 //     generations. So it is a good idea to save and restore
-// output: a 3-D tensor with the shape of [seq_length, batch_size,
+// output: A 3-D tensor with the shape of [seq_length, batch_size,
 //     dir * num_units].
-// output_h: the same shape has input_h.
-// output_c: the same shape as input_c for LSTM. An empty tensor for other models.
+// output_h: The same shape has input_h.
+// output_c: The same shape as input_c for LSTM. An empty tensor for other models.
 // is_training: Indicates whether this operation is used for inferenece or
 //   training.
-// reserve_space: an opaque tensor that can be used in backprop calculation. It
+// reserve_space: An opaque tensor that can be used in backprop calculation. It
 //   is only produced if is_training is false.
 func CudnnRNN(scope *Scope, input tf.Output, input_h tf.Output, input_c tf.Output, params tf.Output, optional ...CudnnRNNAttr) (output tf.Output, output_h tf.Output, output_c tf.Output, reserve_space tf.Output) {
 	if scope.Err() != nil {
@@ -27524,6 +29180,37 @@ func CudnnRNN(scope *Scope, input tf.Output, input_h tf.Output, input_c tf.Outpu
 	return op.Output(0), op.Output(1), op.Output(2), op.Output(3)
 }
 
+// Creates a TensorArray for storing multiple gradients of values in the given handle.
+//
+// Similar to TensorArrayGradV3. However it creates an accumulator with an
+// expanded shape compared to the input TensorArray whose gradient is being
+// computed. This enables multiple gradients for the same TensorArray to be
+// calculated using the same accumulator.
+//
+// Arguments:
+//	handle: The handle to the forward TensorArray.
+//	flow_in: A float scalar that enforces proper chaining of operations.
+//	shape_to_prepend: An int32 vector representing a shape. Elements in the gradient accumulator will
+// have shape which is this shape_to_prepend value concatenated with shape of the
+// elements in the TensorArray corresponding to the input handle.
+//	source: The gradient source string, used to decide which gradient TensorArray
+// to return.
+func TensorArrayGradWithShape(scope *Scope, handle tf.Output, flow_in tf.Output, shape_to_prepend tf.Output, source string) (grad_handle tf.Output, flow_out tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"source": source}
+	opspec := tf.OpSpec{
+		Type: "TensorArrayGradWithShape",
+		Input: []tf.Input{
+			handle, flow_in, shape_to_prepend,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
 // Compare values of `input` to `threshold` and pack resulting bits into a `uint8`.
 //
 // Each comparison returns a boolean `true` (if `input_value > threshold`)
@@ -27914,7 +29601,7 @@ func RandomShuffleQueueV2(scope *Scope, component_types []tf.DataType, optional
 //
 // For example, if an image is 100 x 200 pixels (height x width) and the bounding
 // box is `[0.1, 0.2, 0.5, 0.9]`, the upper-left and bottom-right coordinates of
-// the bounding box will be `(40, 10)` to `(100, 50)` (in (x,y) coordinates).
+// the bounding box will be `(40, 10)` to `(180, 50)` (in (x,y) coordinates).
 //
 // Parts of the bounding box may fall outside the image.
 //
@@ -28255,7 +29942,7 @@ func BoostedTreesCreateEnsemble(scope *Scope, tree_ensemble_handle tf.Output, st
 // `input` is a `Tensor` with rank `P` and `indices` is a `Tensor` of rank `Q`.
 //
 // `indices` must be integer tensor, containing indices into `input`.
-// It must be shape `[d_0, ..., d_{Q-2}, K]` where `0 < K <= P`.
+// It must be shape \\([d_0, ..., d_{Q-2}, K]\\) where `0 < K <= P`.
 //
 // The innermost dimension of `indices` (with length `K`) corresponds to
 // indices into elements (if `K = P`) or `(P-K)`-dimensional slices
@@ -28263,9 +29950,7 @@ func BoostedTreesCreateEnsemble(scope *Scope, tree_ensemble_handle tf.Output, st
 //
 // `updates` is `Tensor` of rank `Q-1+P-K` with shape:
 //
-// ```
-// [d_0, ..., d_{Q-2}, input.shape[K], ..., input.shape[P-1]].
-// ```
+// $$[d_0, ..., d_{Q-2}, input.shape[K], ..., input.shape[P-1]].$$
 //
 // For example, say we want to add 4 scattered elements to a rank-1 tensor to 8
 // elements. In Python, that addition would look like this:
@@ -29026,6 +30711,119 @@ func OrderedMapSize(scope *Scope, dtypes []tf.DataType, optional ...OrderedMapSi
 	return op.Output(0)
 }
 
+// CudnnRNNV2Attr is an optional argument to CudnnRNNV2.
+type CudnnRNNV2Attr func(optionalAttr)
+
+// CudnnRNNV2RnnMode sets the optional rnn_mode attribute to value.
+// If not specified, defaults to "lstm"
+func CudnnRNNV2RnnMode(value string) CudnnRNNV2Attr {
+	return func(m optionalAttr) {
+		m["rnn_mode"] = value
+	}
+}
+
+// CudnnRNNV2InputMode sets the optional input_mode attribute to value.
+// If not specified, defaults to "linear_input"
+func CudnnRNNV2InputMode(value string) CudnnRNNV2Attr {
+	return func(m optionalAttr) {
+		m["input_mode"] = value
+	}
+}
+
+// CudnnRNNV2Direction sets the optional direction attribute to value.
+// If not specified, defaults to "unidirectional"
+func CudnnRNNV2Direction(value string) CudnnRNNV2Attr {
+	return func(m optionalAttr) {
+		m["direction"] = value
+	}
+}
+
+// CudnnRNNV2Dropout sets the optional dropout attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNV2Dropout(value float32) CudnnRNNV2Attr {
+	return func(m optionalAttr) {
+		m["dropout"] = value
+	}
+}
+
+// CudnnRNNV2Seed sets the optional seed attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNV2Seed(value int64) CudnnRNNV2Attr {
+	return func(m optionalAttr) {
+		m["seed"] = value
+	}
+}
+
+// CudnnRNNV2Seed2 sets the optional seed2 attribute to value.
+// If not specified, defaults to 0
+func CudnnRNNV2Seed2(value int64) CudnnRNNV2Attr {
+	return func(m optionalAttr) {
+		m["seed2"] = value
+	}
+}
+
+// CudnnRNNV2IsTraining sets the optional is_training attribute to value.
+// If not specified, defaults to true
+func CudnnRNNV2IsTraining(value bool) CudnnRNNV2Attr {
+	return func(m optionalAttr) {
+		m["is_training"] = value
+	}
+}
+
+// A RNN backed by cuDNN.
+//
+// Computes the RNN from the input and initial states, with respect to the params
+// buffer. Produces one extra output "host_reserved" than CudnnRNN.
+//
+// rnn_mode: Indicates the type of the RNN model.
+// input_mode: Indicates whether there is a linear projection between the input and
+//   the actual computation before the first layer. 'skip_input' is only allowed
+//   when input_size == num_units; 'auto_select' implies 'skip_input' when
+//   input_size == num_units; otherwise, it implies 'linear_input'.
+// direction: Indicates whether a bidirectional model will be used. Should be
+//   "unidirectional" or "bidirectional".
+// dropout: Dropout probability. When set to 0., dropout is disabled.
+// seed: The 1st part of a seed to initialize dropout.
+// seed2: The 2nd part of a seed to initialize dropout.
+// input: A 3-D tensor with the shape of [seq_length, batch_size, input_size].
+// input_h: A 3-D tensor with the shape of [num_layer * dir, batch_size,
+//     num_units].
+// input_c: For LSTM, a 3-D tensor with the shape of
+//     [num_layer * dir, batch, num_units]. For other models, it is ignored.
+// params: A 1-D tensor that contains the weights and biases in an opaque layout.
+//     The size must be created through CudnnRNNParamsSize, and initialized
+//     separately. Note that they might not be compatible across different
+//     generations. So it is a good idea to save and restore
+// output: A 3-D tensor with the shape of [seq_length, batch_size,
+//     dir * num_units].
+// output_h: The same shape has input_h.
+// output_c: The same shape as input_c for LSTM. An empty tensor for other models.
+// is_training: Indicates whether this operation is used for inferenece or
+//   training.
+// reserve_space: An opaque tensor that can be used in backprop calculation. It
+//   is only produced if is_training is true.
+// host_reserved: An opaque tensor that can be used in backprop calculation. It is
+//   only produced if is_training is true. It is output on host memory rather than
+//   device memory.
+func CudnnRNNV2(scope *Scope, input tf.Output, input_h tf.Output, input_c tf.Output, params tf.Output, optional ...CudnnRNNV2Attr) (output tf.Output, output_h tf.Output, output_c tf.Output, reserve_space tf.Output, host_reserved tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "CudnnRNNV2",
+		Input: []tf.Input{
+			input, input_h, input_c, params,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2), op.Output(3), op.Output(4)
+}
+
 // ShapeNAttr is an optional argument to ShapeN.
 type ShapeNAttr func(optionalAttr)
 
@@ -30118,595 +31916,3 @@ func RightShift(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
-
-// Adjust the hue of one or more images.
-//
-// `images` is a tensor of at least 3 dimensions.  The last dimension is
-// interpretted as channels, and must be three.
-//
-// The input image is considered in the RGB colorspace. Conceptually, the RGB
-// colors are first mapped into HSV. A delta is then applied all the hue values,
-// and then remapped back to RGB colorspace.
-//
-// Arguments:
-//	images: Images to adjust.  At least 3-D.
-//	delta: A float delta to add to the hue.
-//
-// Returns The hue-adjusted image or images.
-func AdjustHue(scope *Scope, images tf.Output, delta tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "AdjustHue",
-		Input: []tf.Input{
-			images, delta,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// BatchAttr is an optional argument to Batch.
-type BatchAttr func(optionalAttr)
-
-// BatchMaxEnqueuedBatches sets the optional max_enqueued_batches attribute to value.
-// If not specified, defaults to 10
-func BatchMaxEnqueuedBatches(value int64) BatchAttr {
-	return func(m optionalAttr) {
-		m["max_enqueued_batches"] = value
-	}
-}
-
-// BatchAllowedBatchSizes sets the optional allowed_batch_sizes attribute to value.
-// If not specified, defaults to <>
-func BatchAllowedBatchSizes(value []int64) BatchAttr {
-	return func(m optionalAttr) {
-		m["allowed_batch_sizes"] = value
-	}
-}
-
-// BatchContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func BatchContainer(value string) BatchAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// BatchSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func BatchSharedName(value string) BatchAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// BatchBatchingQueue sets the optional batching_queue attribute to value.
-// If not specified, defaults to ""
-func BatchBatchingQueue(value string) BatchAttr {
-	return func(m optionalAttr) {
-		m["batching_queue"] = value
-	}
-}
-
-// Batches all input tensors nondeterministically.
-//
-// When many instances of this Op are being run concurrently with the same
-// container/shared_name in the same device, some will output zero-shaped Tensors
-// and others will output Tensors of size up to max_batch_size.
-//
-// All Tensors in in_tensors are batched together (so, for example, labels and
-// features should be batched with a single instance of this operation.
-//
-// Each invocation of batch emits an `id` scalar which will be used to identify
-// this particular invocation when doing unbatch or its gradient.
-//
-// Each op which emits a non-empty batch will also emit a non-empty batch_index
-// Tensor, which, is a [K, 3] matrix where each row contains the invocation's id,
-// start, and length of elements of each set of Tensors present in batched_tensors.
-//
-// Batched tensors are concatenated along the first dimension, and all tensors in
-// in_tensors must have the first dimension of the same size.
-//
-// in_tensors: The tensors to be batched.
-// num_batch_threads: Number of scheduling threads for processing batches of work.
-//  Determines the number of batches processed in parallel.
-// max_batch_size: Batch sizes will never be bigger than this.
-// batch_timeout_micros: Maximum number of microseconds to wait before outputting
-//  an incomplete batch.
-// allowed_batch_sizes: Optional list of allowed batch sizes. If left empty, does
-//  nothing. Otherwise, supplies a list of batch sizes, causing the op to pad
-//  batches up to one of those sizes. The entries must increase monotonically, and
-//  the final entry must equal max_batch_size.
-// grad_timeout_micros: The timeout to use for the gradient. See Unbatch.
-// batched_tensors: Either empty tensors or a batch of concatenated Tensors.
-// batch_index: If out_tensors is non-empty, has information to invert it.
-// container: Controls the scope of sharing of this batch.
-// id: always contains a scalar with a unique ID for this invocation of Batch.
-// shared_name: Concurrently running instances of batch in the same device with the
-//  same container and shared_name will batch their elements together. If left
-//  empty, the op name will be used as the shared name.
-// T: the types of tensors to be batched.
-func Batch(scope *Scope, in_tensors []tf.Output, num_batch_threads int64, max_batch_size int64, batch_timeout_micros int64, grad_timeout_micros int64, optional ...BatchAttr) (batched_tensors []tf.Output, batch_index tf.Output, id tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"num_batch_threads": num_batch_threads, "max_batch_size": max_batch_size, "batch_timeout_micros": batch_timeout_micros, "grad_timeout_micros": grad_timeout_micros}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Batch",
-		Input: []tf.Input{
-			tf.OutputList(in_tensors),
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if batched_tensors, idx, err = makeOutputList(op, idx, "batched_tensors"); err != nil {
-		scope.UpdateErr("Batch", err)
-		return
-	}
-	batch_index = op.Output(idx)
-	id = op.Output(idx)
-	return batched_tensors, batch_index, id
-}
-
-// UnbatchAttr is an optional argument to Unbatch.
-type UnbatchAttr func(optionalAttr)
-
-// UnbatchContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func UnbatchContainer(value string) UnbatchAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// UnbatchSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func UnbatchSharedName(value string) UnbatchAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Reverses the operation of Batch for a single output Tensor.
-//
-// An instance of Unbatch either receives an empty batched_tensor, in which case it
-// asynchronously waits until the values become available from a concurrently
-// running instance of Unbatch with the same container and shared_name, or receives
-// a non-empty batched_tensor in which case it finalizes all other concurrently
-// running instances and outputs its own element from the batch.
-//
-// batched_tensor: The possibly transformed output of Batch. The size of the first
-//  dimension should remain unchanged by the transformations for the operation to
-//  work.
-// batch_index: The matching batch_index obtained from Batch.
-// id: The id scalar emitted by Batch.
-// unbatched_tensor: The Tensor corresponding to this execution.
-// timeout_micros: Maximum amount of time (in microseconds) to wait to receive the
-//  batched input tensor associated with a given invocation of the op.
-// container: Container to control resource sharing.
-// shared_name: Instances of Unbatch with the same container and shared_name are
-//  assumed to possibly belong to the same batch. If left empty, the op name will
-//  be used as the shared name.
-func Unbatch(scope *Scope, batched_tensor tf.Output, batch_index tf.Output, id tf.Output, timeout_micros int64, optional ...UnbatchAttr) (unbatched_tensor tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"timeout_micros": timeout_micros}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Unbatch",
-		Input: []tf.Input{
-			batched_tensor, batch_index, id,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// AvgPool3DGradAttr is an optional argument to AvgPool3DGrad.
-type AvgPool3DGradAttr func(optionalAttr)
-
-// AvgPool3DGradDataFormat sets the optional data_format attribute to value.
-//
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func AvgPool3DGradDataFormat(value string) AvgPool3DGradAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Computes gradients of average pooling function.
-//
-// Arguments:
-//	orig_input_shape: The original input dimensions.
-//	grad: Output backprop of shape `[batch, depth, rows, cols, channels]`.
-//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
-// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
-//
-// Returns The backprop for input.
-func AvgPool3DGrad(scope *Scope, orig_input_shape tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...AvgPool3DGradAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "AvgPool3DGrad",
-		Input: []tf.Input{
-			orig_input_shape, grad,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// ParseSingleSequenceExampleAttr is an optional argument to ParseSingleSequenceExample.
-type ParseSingleSequenceExampleAttr func(optionalAttr)
-
-// ParseSingleSequenceExampleContextSparseTypes sets the optional context_sparse_types attribute to value.
-//
-// value: A list of Ncontext_sparse types; the data types of data in
-// each context Feature given in context_sparse_keys.
-// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleContextSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["context_sparse_types"] = value
-	}
-}
-
-// ParseSingleSequenceExampleFeatureListDenseTypes sets the optional feature_list_dense_types attribute to value.
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListDenseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["feature_list_dense_types"] = value
-	}
-}
-
-// ParseSingleSequenceExampleContextDenseShapes sets the optional context_dense_shapes attribute to value.
-//
-// value: A list of Ncontext_dense shapes; the shapes of data in
-// each context Feature given in context_dense_keys.
-// The number of elements in the Feature corresponding to context_dense_key[j]
-// must always equal context_dense_shapes[j].NumEntries().
-// The shape of context_dense_values[j] will match context_dense_shapes[j].
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleContextDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["context_dense_shapes"] = value
-	}
-}
-
-// ParseSingleSequenceExampleFeatureListSparseTypes sets the optional feature_list_sparse_types attribute to value.
-//
-// value: A list of Nfeature_list_sparse types; the data types
-// of data in each FeatureList given in feature_list_sparse_keys.
-// Currently the ParseSingleSequenceExample supports DT_FLOAT (FloatList),
-// DT_INT64 (Int64List), and DT_STRING (BytesList).
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListSparseTypes(value []tf.DataType) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["feature_list_sparse_types"] = value
-	}
-}
-
-// ParseSingleSequenceExampleFeatureListDenseShapes sets the optional feature_list_dense_shapes attribute to value.
-//
-// value: A list of Nfeature_list_dense shapes; the shapes of
-// data in each FeatureList given in feature_list_dense_keys.
-// The shape of each Feature in the FeatureList corresponding to
-// feature_list_dense_key[j] must always equal
-// feature_list_dense_shapes[j].NumEntries().
-// If not specified, defaults to <>
-//
-// REQUIRES: len(value) >= 0
-func ParseSingleSequenceExampleFeatureListDenseShapes(value []tf.Shape) ParseSingleSequenceExampleAttr {
-	return func(m optionalAttr) {
-		m["feature_list_dense_shapes"] = value
-	}
-}
-
-// Transforms a scalar brain.SequenceExample proto (as strings) into typed tensors.
-//
-// Arguments:
-//	serialized: A scalar containing a binary serialized SequenceExample proto.
-//	feature_list_dense_missing_assumed_empty: A vector listing the
-// FeatureList keys which may be missing from the SequenceExample.  If the
-// associated FeatureList is missing, it is treated as empty.  By default,
-// any FeatureList not listed in this vector must exist in the SequenceExample.
-//	context_sparse_keys: A list of Ncontext_sparse string Tensors (scalars).
-// The keys expected in the Examples' features associated with context_sparse
-// values.
-//	context_dense_keys: A list of Ncontext_dense string Tensors (scalars).
-// The keys expected in the SequenceExamples' context features associated with
-// dense values.
-//	feature_list_sparse_keys: A list of Nfeature_list_sparse string Tensors
-// (scalars).  The keys expected in the FeatureLists associated with sparse
-// values.
-//	feature_list_dense_keys: A list of Nfeature_list_dense string Tensors (scalars).
-// The keys expected in the SequenceExamples' feature_lists associated
-// with lists of dense values.
-//	context_dense_defaults: A list of Ncontext_dense Tensors (some may be empty).
-// context_dense_defaults[j] provides default values
-// when the SequenceExample's context map lacks context_dense_key[j].
-// If an empty Tensor is provided for context_dense_defaults[j],
-// then the Feature context_dense_keys[j] is required.
-// The input type is inferred from context_dense_defaults[j], even when it's
-// empty.  If context_dense_defaults[j] is not empty, its shape must match
-// context_dense_shapes[j].
-//	debug_name: A scalar containing the name of the serialized proto.
-// May contain, for example, table key (descriptive) name for the
-// corresponding serialized proto.  This is purely useful for debugging
-// purposes, and the presence of values here has no effect on the output.
-// May also be an empty scalar if no name is available.
-func ParseSingleSequenceExample(scope *Scope, serialized tf.Output, feature_list_dense_missing_assumed_empty tf.Output, context_sparse_keys []tf.Output, context_dense_keys []tf.Output, feature_list_sparse_keys []tf.Output, feature_list_dense_keys []tf.Output, context_dense_defaults []tf.Output, debug_name tf.Output, optional ...ParseSingleSequenceExampleAttr) (context_sparse_indices []tf.Output, context_sparse_values []tf.Output, context_sparse_shapes []tf.Output, context_dense_values []tf.Output, feature_list_sparse_indices []tf.Output, feature_list_sparse_values []tf.Output, feature_list_sparse_shapes []tf.Output, feature_list_dense_values []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ParseSingleSequenceExample",
-		Input: []tf.Input{
-			serialized, feature_list_dense_missing_assumed_empty, tf.OutputList(context_sparse_keys), tf.OutputList(context_dense_keys), tf.OutputList(feature_list_sparse_keys), tf.OutputList(feature_list_dense_keys), tf.OutputList(context_dense_defaults), debug_name,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if context_sparse_indices, idx, err = makeOutputList(op, idx, "context_sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_sparse_values, idx, err = makeOutputList(op, idx, "context_sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_sparse_shapes, idx, err = makeOutputList(op, idx, "context_sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if context_dense_values, idx, err = makeOutputList(op, idx, "context_dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_sparse_indices, idx, err = makeOutputList(op, idx, "feature_list_sparse_indices"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_sparse_values, idx, err = makeOutputList(op, idx, "feature_list_sparse_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_sparse_shapes, idx, err = makeOutputList(op, idx, "feature_list_sparse_shapes"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	if feature_list_dense_values, idx, err = makeOutputList(op, idx, "feature_list_dense_values"); err != nil {
-		scope.UpdateErr("ParseSingleSequenceExample", err)
-		return
-	}
-	return context_sparse_indices, context_sparse_values, context_sparse_shapes, context_dense_values, feature_list_sparse_indices, feature_list_sparse_values, feature_list_sparse_shapes, feature_list_dense_values
-}
-
-// UnbatchGradAttr is an optional argument to UnbatchGrad.
-type UnbatchGradAttr func(optionalAttr)
-
-// UnbatchGradContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func UnbatchGradContainer(value string) UnbatchGradAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// UnbatchGradSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func UnbatchGradSharedName(value string) UnbatchGradAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Gradient of Unbatch.
-//
-// Acts like Batch but using the given batch_index index of batching things as they
-// become available. This ensures that the gradients are propagated back in the
-// same session which did the forward pass.
-//
-// original_input: The input to the Unbatch operation this is the gradient of.
-// batch_index: The batch_index given to the Unbatch operation this is the gradient
-// of.
-// grad: The downstream gradient.
-// id: The id scalar emitted by Batch.
-// batched_grad: The return value, either an empty tensor or the batched gradient.
-// container: Container to control resource sharing.
-// shared_name: Instances of UnbatchGrad with the same container and shared_name
-//  are assumed to possibly belong to the same batch. If left empty, the op name
-//  will be used as the shared name.
-func UnbatchGrad(scope *Scope, original_input tf.Output, batch_index tf.Output, grad tf.Output, id tf.Output, optional ...UnbatchGradAttr) (batched_grad tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "UnbatchGrad",
-		Input: []tf.Input{
-			original_input, batch_index, grad, id,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// DecodeWavAttr is an optional argument to DecodeWav.
-type DecodeWavAttr func(optionalAttr)
-
-// DecodeWavDesiredChannels sets the optional desired_channels attribute to value.
-//
-// value: Number of sample channels wanted.
-// If not specified, defaults to -1
-func DecodeWavDesiredChannels(value int64) DecodeWavAttr {
-	return func(m optionalAttr) {
-		m["desired_channels"] = value
-	}
-}
-
-// DecodeWavDesiredSamples sets the optional desired_samples attribute to value.
-//
-// value: Length of audio requested.
-// If not specified, defaults to -1
-func DecodeWavDesiredSamples(value int64) DecodeWavAttr {
-	return func(m optionalAttr) {
-		m["desired_samples"] = value
-	}
-}
-
-// Decode a 16-bit PCM WAV file to a float tensor.
-//
-// The -32768 to 32767 signed 16-bit values will be scaled to -1.0 to 1.0 in float.
-//
-// When desired_channels is set, if the input contains fewer channels than this
-// then the last channel will be duplicated to give the requested number, else if
-// the input has more channels than requested then the additional channels will be
-// ignored.
-//
-// If desired_samples is set, then the audio will be cropped or padded with zeroes
-// to the requested length.
-//
-// The first output contains a Tensor with the content of the audio samples. The
-// lowest dimension will be the number of channels, and the second will be the
-// number of samples. For example, a ten-sample-long stereo WAV file should give an
-// output shape of [10, 2].
-//
-// Arguments:
-//	contents: The WAV-encoded audio, usually from a file.
-//
-// Returns 2-D with shape `[length, channels]`.Scalar holding the sample rate found in the WAV header.
-func DecodeWav(scope *Scope, contents tf.Output, optional ...DecodeWavAttr) (audio tf.Output, sample_rate tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "DecodeWav",
-		Input: []tf.Input{
-			contents,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
-}
-
-// Concatenates a list of `N` tensors along the first dimension.
-//
-// The input tensors are all required to have size 1 in the first dimension.
-//
-// For example:
-//
-// ```
-// # 'x' is [[1, 4]]
-// # 'y' is [[2, 5]]
-// # 'z' is [[3, 6]]
-// parallel_concat([x, y, z]) => [[1, 4], [2, 5], [3, 6]]  # Pack along first dim.
-// ```
-//
-// The difference between concat and parallel_concat is that concat requires all
-// of the inputs be computed before the operation will begin but doesn't require
-// that the input shapes be known during graph construction.  Parallel concat
-// will copy pieces of the input into the output as they become available, in
-// some situations this can provide a performance benefit.
-//
-// Arguments:
-//	values: Tensors to be concatenated. All must have size 1 in the first dimension
-// and same shape.
-//	shape: the final shape of the result; should be equal to the shapes of any input
-// but with the number of input values in the first dimension.
-//
-// Returns The concatenated tensor.
-func ParallelConcat(scope *Scope, values []tf.Output, shape tf.Shape) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"shape": shape}
-	opspec := tf.OpSpec{
-		Type: "ParallelConcat",
-		Input: []tf.Input{
-			tf.OutputList(values),
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-//     Subtracts `v` into specified rows of `x`.
-//
-//     Computes y = x; y[i, :] -= v; return y.
-//
-// Arguments:
-//	x: A `Tensor` of type T.
-//	i: A vector. Indices into the left-most dimension of `x`.
-//	v: A `Tensor` of type T. Same dimension sizes as x except the first dimension, which must be the same as i's size.
-//
-// Returns A `Tensor` of type T. An alias of `x`. The content of `y` is undefined if there are duplicates in `i`.
-func InplaceSub(scope *Scope, x tf.Output, i tf.Output, v tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "InplaceSub",
-		Input: []tf.Input{
-			x, i, v,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
diff --git a/tensorflow/go/operation.go b/tensorflow/go/operation.go
index 25ec718703..d6a37e0a86 100644
--- a/tensorflow/go/operation.go
+++ b/tensorflow/go/operation.go
@@ -45,6 +45,12 @@ func (op *Operation) NumOutputs() int {
 	return int(C.TF_OperationNumOutputs(op.c))
 }
 
+// Device returns a specification of the device on which this operation
+// will be executed, or the empty string if there is no such specification.
+func (op *Operation) Device() string {
+	return C.GoString(C.TF_OperationDevice(op.c))
+}
+
 // OutputListSize returns the size of the list of Outputs that is produced by a
 // named output of op.
 //
diff --git a/tensorflow/go/operation_test.go b/tensorflow/go/operation_test.go
index 06b65bdfb7..4af9e33ad0 100644
--- a/tensorflow/go/operation_test.go
+++ b/tensorflow/go/operation_test.go
@@ -228,6 +228,29 @@ func TestOperationConsumers(t *testing.T) {
 	}
 }
 
+func TestOperationDevice(t *testing.T) {
+	graph := NewGraph()
+	v, err := NewTensor(float32(1.0))
+	if err != nil {
+		t.Fatal(err)
+	}
+	op, err := graph.AddOperation(OpSpec{
+		Type: "Const",
+		Name: "Const",
+		Attrs: map[string]interface{}{
+			"dtype": v.DataType(),
+			"value": v,
+		},
+		Device: "/device:GPU:0",
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if got, want := op.Device(), "/device:GPU:0"; got != want {
+		t.Errorf("Got %q, want %q", got, want)
+	}
+}
+
 func forceGC() {
 	var mem runtime.MemStats
 	runtime.ReadMemStats(&mem)
diff --git a/tensorflow/java/BUILD b/tensorflow/java/BUILD
index 73e210fae0..7ceba3903d 100644
--- a/tensorflow/java/BUILD
+++ b/tensorflow/java/BUILD
@@ -292,6 +292,19 @@ tf_java_test(
     ],
 )
 
+tf_java_test(
+    name = "GradientsTest",
+    size = "small",
+    srcs = ["src/test/java/org/tensorflow/op/core/GradientsTest.java"],
+    javacopts = JAVACOPTS,
+    test_class = "org.tensorflow.op.core.GradientsTest",
+    deps = [
+        ":tensorflow",
+        ":testutil",
+        "@junit",
+    ],
+)
+
 filegroup(
     name = "processor_test_resources",
     srcs = glob([
diff --git a/tensorflow/java/maven/README.md b/tensorflow/java/maven/README.md
index 3e030dcd09..cbc64a284f 100644
--- a/tensorflow/java/maven/README.md
+++ b/tensorflow/java/maven/README.md
@@ -151,16 +151,6 @@ conducted in a [Docker](https://www.docker.com) container.
 7.  Upon successful release, commit changes to all the `pom.xml` files
     (which should have the updated version number).
 
-### Snapshots
-
-If the `TF_VERSION` provided to the `release.sh` script ends in `-SNAPSHOT`,
-then instead of using official release files, the nightly build artifacts from
-https://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/,
-https://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow-windows/ and
-https://ci.tensorflow.org/view/Nightly/job/nightly-android
-will be used to upload to the Maven Central snapshots repository. (Note that
-snapshots are only uploaded to Maven Central, not Bintray.)
-
 ### Skip deploying to a repository
 
 Should you need, setting environment variables `DEPLOY_OSSRH=0` or
@@ -173,12 +163,12 @@ cannot skip deploying to OSSRH for a `-SNAPSHOT` version.
 This section provides some pointers around how artifacts are currently
 assembled.
 
-All native and java code is first built and tested on
-a [Tensorflow Jenkins server](https://ci.tensorflow.org/) which run various
-scripts under the [`tools/ci_build`](../../tools/ci_build/) directory. Of
-particular interest may be `tools/ci_build/builds/libtensorflow.sh` which
-bundles Java-related build sources and outputs into archives, and
-`tools/ci_build/builds/android_full.sh` which produces an Android AAR package.
+All native and java code is first built and tested by the release process
+which run various scripts under the [`tools/ci_build`](../../tools/ci_build/)
+directory. Of particular interest may be
+`tools/ci_build/builds/libtensorflow.sh` which bundles Java-related build
+sources and outputs into archives, and `tools/ci_build/builds/android_full.sh`
+which produces an Android AAR package.
 
 Maven artifacts however are not created in Jenkins. Instead, artifacts are
 created and deployed externally on-demand, when a maintainer runs the
diff --git a/tensorflow/java/maven/hadoop/pom.xml b/tensorflow/java/maven/hadoop/pom.xml
index 0642be06fa..7fa751a46a 100644
--- a/tensorflow/java/maven/hadoop/pom.xml
+++ b/tensorflow/java/maven/hadoop/pom.xml
@@ -1,12 +1,30 @@
-<project
-        xmlns="http://maven.apache.org/POM/4.0.0"
-        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
-    <!-- Placeholder pom which is replaced by TensorFlow ecosystem Hadoop pom during build -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
     <modelVersion>4.0.0</modelVersion>
-    <description>TensorFlow TFRecord InputFormat/OutputFormat for Apache Hadoop</description>
+    <groupId>org.tensorflow</groupId>
     <artifactId>hadoop</artifactId>
     <packaging>jar</packaging>
+    <version>1.10.0-rc1</version>
+    <name>tensorflow-hadoop</name>
+    <url>https://www.tensorflow.org</url>
+    <description>TensorFlow TFRecord InputFormat/OutputFormat for Apache Hadoop</description>
+
+    <properties>
+        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+        <maven.compiler.source>1.6</maven.compiler.source>
+        <maven.compiler.target>1.6</maven.compiler.target>
+        <hadoop.version>2.6.0</hadoop.version>
+        <protobuf.version>3.3.1</protobuf.version>
+        <junit.version>4.11</junit.version>
+    </properties>
+
+    <licenses>
+        <license>
+            <name>Apache License Version 2.0</name>
+            <url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
+        </license>
+    </licenses>
 
     <scm>
         <url>https://github.com/tensorflow/ecosystem.git</url>
@@ -14,11 +32,161 @@
         <developerConnection>scm:git:https://github.com/tensorflow/ecosystem.git</developerConnection>
     </scm>
 
-    <url>https://github.com/tensorflow/ecosystem/</url>
-    <parent>
-        <groupId>org.tensorflow</groupId>
-        <artifactId>parentpom</artifactId>
-        <version>1.9.0-rc0</version>
-        <relativePath>../</relativePath>
-    </parent>
-</project>
-\ No newline at end of file
+    <build>
+        <pluginManagement>
+            <plugins>
+                <plugin>
+                    <groupId>org.apache.maven.plugins</groupId>
+                    <artifactId>maven-gpg-plugin</artifactId>
+                    <version>1.5</version>
+                    <executions>
+                        <execution>
+                            <id>sign-artifacts</id>
+                            <phase>verify</phase>
+                            <goals>
+                                <goal>sign</goal>
+                            </goals>
+                        </execution>
+                    </executions>
+                </plugin>
+            </plugins>
+        </pluginManagement>
+        <plugins>
+          <plugin>
+            <groupId>org.apache.maven.plugins</groupId>
+            <artifactId>maven-source-plugin</artifactId>
+            <version>2.2.1</version>
+            <executions>
+              <execution>
+                <id>attach-sources</id>
+                <goals>
+                  <goal>jar-no-fork</goal>
+                </goals>
+              </execution>
+            </executions>
+          </plugin>
+          <plugin>
+            <groupId>org.apache.maven.plugins</groupId>
+            <artifactId>maven-javadoc-plugin</artifactId>
+            <version>2.9.1</version>
+            <executions>
+              <execution>
+                <id>attach-javadocs</id>
+                <goals>
+                  <goal>jar</goal>
+                </goals>
+              </execution>
+            </executions>
+          </plugin>
+        </plugins>
+    </build>
+
+    <dependencies>
+        <dependency>
+            <groupId>org.tensorflow</groupId>
+            <artifactId>proto</artifactId>
+            <version>${project.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-common</artifactId>
+            <version>${hadoop.version}</version>
+            <exclusions>
+                <exclusion>
+                    <groupId>com.google.protobuf</groupId>
+                    <artifactId>protobuf-java</artifactId>
+                </exclusion>
+            </exclusions>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-mapreduce-client-core</artifactId>
+            <version>${hadoop.version}</version>
+            <exclusions>
+                <exclusion>
+                    <groupId>com.google.protobuf</groupId>
+                    <artifactId>protobuf-java</artifactId>
+                </exclusion>
+            </exclusions>
+        </dependency>
+        <dependency>
+            <groupId>com.google.protobuf</groupId>
+            <artifactId>protobuf-java</artifactId>
+            <version>${protobuf.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>junit</groupId>
+            <artifactId>junit</artifactId>
+            <version>${junit.version}</version>
+            <scope>test</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
+            <version>${hadoop.version}</version>
+            <type>test-jar</type>
+            <optional>true</optional>
+            <scope>test</scope>
+            <exclusions>
+                <exclusion>
+                    <groupId>com.google.protobuf</groupId>
+                    <artifactId>protobuf-java</artifactId>
+                </exclusion>
+            </exclusions>
+        </dependency>
+    </dependencies>
+
+    <!-- Two profiles are used:
+         ossrh - deploys to ossrh/maven central
+         bintray - deploys to bintray/jcenter. -->
+    <profiles>
+        <profile>
+            <id>ossrh</id>
+            <distributionManagement>
+                <!-- Sonatype requirements from http://central.sonatype.org/pages/apache-maven.html -->
+                <snapshotRepository>
+                    <id>ossrh</id>
+                    <url>https://oss.sonatype.org/content/repositories/snapshots</url>
+                </snapshotRepository>
+                <repository>
+                    <id>ossrh</id>
+                    <url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
+                </repository>
+            </distributionManagement>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-gpg-plugin</artifactId>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+        <profile>
+            <id>bintray</id>
+            <distributionManagement>
+                <!-- https://blog.bintray.com/2015/09/17/publishing-your-maven-project-to-bintray/ -->
+                <repository>
+                    <id>bintray</id>
+                    <url>https://api.bintray.com/maven/google/tensorflow/tensorflow/;publish=0</url>
+                </repository>
+            </distributionManagement>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-gpg-plugin</artifactId>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+    </profiles>
+
+    <developers>
+        <developer>
+            <name>TensorFlowers</name>
+            <organization>TensorFlow</organization>
+            <organizationUrl>http://www.tensorflow.org</organizationUrl>
+        </developer>
+    </developers>
+</project>
diff --git a/tensorflow/java/maven/libtensorflow/pom.xml b/tensorflow/java/maven/libtensorflow/pom.xml
index a7fa9ea5cc..8ecabfd399 100644
--- a/tensorflow/java/maven/libtensorflow/pom.xml
+++ b/tensorflow/java/maven/libtensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.9.0-rc1</version>
+    <version>1.10.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni/pom.xml b/tensorflow/java/maven/libtensorflow_jni/pom.xml
index 83aae29f1e..e03ce32216 100644
--- a/tensorflow/java/maven/libtensorflow_jni/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.9.0-rc1</version>
+    <version>1.10.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
index 50bd8ee5f9..fee840f547 100644
--- a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.9.0-rc1</version>
+    <version>1.10.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni_gpu</artifactId>
diff --git a/tensorflow/java/maven/pom.xml b/tensorflow/java/maven/pom.xml
index b4746794ea..0c33819b2b 100644
--- a/tensorflow/java/maven/pom.xml
+++ b/tensorflow/java/maven/pom.xml
@@ -6,7 +6,7 @@
   <modelVersion>4.0.0</modelVersion>
   <groupId>org.tensorflow</groupId>
   <artifactId>parentpom</artifactId>
-  <version>1.9.0-rc1</version>
+  <version>1.10.0-rc1</version>
   <packaging>pom</packaging>
 
   <url>https://www.tensorflow.org</url>
diff --git a/tensorflow/java/maven/proto/pom.xml b/tensorflow/java/maven/proto/pom.xml
index 618a2a124c..2af7a5cd2e 100644
--- a/tensorflow/java/maven/proto/pom.xml
+++ b/tensorflow/java/maven/proto/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.9.0-rc1</version>
+    <version>1.10.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>proto</artifactId>
diff --git a/tensorflow/java/maven/run_inside_container.sh b/tensorflow/java/maven/run_inside_container.sh
index 2e771064e4..f4794d68a9 100644
--- a/tensorflow/java/maven/run_inside_container.sh
+++ b/tensorflow/java/maven/run_inside_container.sh
@@ -26,12 +26,6 @@ TF_ECOSYSTEM_URL="https://github.com/tensorflow/ecosystem.git"
 DEPLOY_BINTRAY="${DEPLOY_BINTRAY:-true}"
 DEPLOY_OSSRH="${DEPLOY_OSSRH:-true}"
 
-IS_SNAPSHOT="false"
-if [[ "${TF_VERSION}" == *"-SNAPSHOT" ]]; then
-  IS_SNAPSHOT="true"
-  # Bintray does not allow snapshots.
-  DEPLOY_BINTRAY="false"
-fi
 PROTOC_RELEASE_URL="https://github.com/google/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip"
 if [[ "${DEPLOY_BINTRAY}" != "true" && "${DEPLOY_OSSRH}" != "true" ]]; then
   echo "Must deploy to at least one of Bintray or OSSRH" >&2
@@ -69,11 +63,7 @@ mvn_property() {
 }
 
 download_libtensorflow() {
-  if [[ "${IS_SNAPSHOT}" == "true" ]]; then
-    URL="http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/TYPE=cpu-slave/lastSuccessfulBuild/artifact/lib_package/libtensorflow-src.jar"
-  else
-    URL="${RELEASE_URL_PREFIX}/libtensorflow-src-${TF_VERSION}.jar"
-  fi
+  URL="${RELEASE_URL_PREFIX}/libtensorflow-src-${TF_VERSION}.jar"
   curl -L "${URL}" -o /tmp/src.jar
   cd "${DIR}/libtensorflow"
   jar -xvf /tmp/src.jar
@@ -101,17 +91,9 @@ download_libtensorflow_jni() {
   mkdir windows-x86_64
   mkdir darwin-x86_64
 
-  if [[ "${IS_SNAPSHOT}" == "true" ]]; then
-    # Nightly builds from http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/
-    # and http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow-windows/
-    curl -L "http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/TYPE=cpu-slave/lastSuccessfulBuild/artifact/lib_package/libtensorflow_jni-cpu-linux-x86_64.tar.gz" | tar -xvz -C linux-x86_64
-    curl -L "http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/TYPE=mac-slave/lastSuccessfulBuild/artifact/lib_package/libtensorflow_jni-cpu-darwin-x86_64.tar.gz" | tar -xvz -C darwin-x86_64
-    curl -L "http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow-windows/lastSuccessfulBuild/artifact/lib_package/libtensorflow_jni-cpu-windows-x86_64.zip" -o /tmp/windows.zip
-  else
-    curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-cpu-linux-x86_64-${TF_VERSION}.tar.gz" | tar -xvz -C linux-x86_64
-    curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-cpu-darwin-x86_64-${TF_VERSION}.tar.gz" | tar -xvz -C darwin-x86_64
-    curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-cpu-windows-x86_64-${TF_VERSION}.zip" -o /tmp/windows.zip
-  fi
+  curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-cpu-linux-x86_64-${TF_VERSION}.tar.gz" | tar -xvz -C linux-x86_64
+  curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-cpu-darwin-x86_64-${TF_VERSION}.tar.gz" | tar -xvz -C darwin-x86_64
+  curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-cpu-windows-x86_64-${TF_VERSION}.zip" -o /tmp/windows.zip
 
   unzip /tmp/windows.zip -d windows-x86_64
   rm -f /tmp/windows.zip
@@ -129,13 +111,7 @@ download_libtensorflow_jni_gpu() {
 
   mkdir linux-x86_64
 
-  if [[ "${IS_SNAPSHOT}" == "true" ]]; then
-    # Nightly builds from http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/
-    # and http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow-windows/
-    curl -L "http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/TYPE=gpu-linux/lastSuccessfulBuild/artifact/lib_package/libtensorflow_jni-gpu-linux-x86_64.tar.gz" | tar -xvz -C linux-x86_64
-  else
-    curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-gpu-linux-x86_64-${TF_VERSION}.tar.gz" | tar -xvz -C linux-x86_64
-  fi
+  curl -L "${RELEASE_URL_PREFIX}/libtensorflow_jni-gpu-linux-x86_64-${TF_VERSION}.tar.gz" | tar -xvz -C linux-x86_64
 
   # Updated timestamps seem to be required to get Maven to pick up the file.
   touch linux-x86_64/*
@@ -165,11 +141,7 @@ generate_java_protos() {
   rm -f "/tmp/protoc.zip"
 
   # Download the release archive of TensorFlow protos.
-  if [[ "${IS_SNAPSHOT}" == "true" ]]; then
-    URL="http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow/TYPE=cpu-slave/lastSuccessfulBuild/artifact/lib_package/libtensorflow_proto.zip"
-  else
-    URL="${RELEASE_URL_PREFIX}/libtensorflow_proto-${TF_VERSION}.zip"
-  fi
+  URL="${RELEASE_URL_PREFIX}/libtensorflow_proto-${TF_VERSION}.zip"
   curl -L "${URL}" -o /tmp/libtensorflow_proto.zip
   mkdir -p "${DIR}/proto/tmp/src"
   unzip -d "${DIR}/proto/tmp/src" "/tmp/libtensorflow_proto.zip"
@@ -203,7 +175,10 @@ download_tf_ecosystem() {
   cd "${ECOSYSTEM_DIR}"
   git clone "${TF_ECOSYSTEM_URL}"
   cd ecosystem
-  git checkout r${TF_VERSION}
+  # TF_VERSION is a semver string (<major>.<minor>.<patch>[-suffix])
+  # but the branch is just (r<major>.<minor>).
+  RELEASE_BRANCH=$(echo "${TF_VERSION}" | sed -e 's/\([0-9]\+\.[0-9]\+\)\.[0-9]\+.*/\1/')
+  git checkout r${RELEASE_BRANCH}
 
   # Copy the TensorFlow Hadoop source
   cp -r "${ECOSYSTEM_DIR}/ecosystem/hadoop/src" "${HADOOP_DIR}"
@@ -235,11 +210,7 @@ deploy_profile() {
   # Determine the correct pom file property to use
   # for the repository url.
   local rtype
-  if [[ "${IS_SNAPSHOT}" == "true" ]]; then
-    rtype='snapshotRepository'
-  else
-    rtype='repository'
-  fi
+  rtype='repository'
   local url=$(mvn_property "${profile}" "project.distributionManagement.${rtype}.url")
   local repositoryId=$(mvn_property "${profile}" "project.distributionManagement.${rtype}.id")
   mvn gpg:sign-and-deploy-file \
@@ -297,17 +268,13 @@ mvn verify
 deploy_artifacts
 
 set +ex
-if [[ "${IS_SNAPSHOT}" == "false" ]]; then
-  echo "Uploaded to the staging repository"
-  echo "After validating the release: "
-  if [[ "${DEPLOY_OSSRH}" == "true" ]]; then
-    echo "* Login to https://oss.sonatype.org/#stagingRepositories"
-    echo "* Find the 'org.tensorflow' staging release and click either 'Release' to release or 'Drop' to abort"
-  fi
-  if [[ "${DEPLOY_BINTRAY}" == "true" ]]; then
-    echo "* Login to https://bintray.com/google/tensorflow/tensorflow"
-    echo "* Either 'Publish' unpublished items to release, or 'Discard' to abort"
-  fi
-else
-  echo "Uploaded to the snapshot repository"
+echo "Uploaded to the staging repository"
+echo "After validating the release: "
+if [[ "${DEPLOY_OSSRH}" == "true" ]]; then
+  echo "* Login to https://oss.sonatype.org/#stagingRepositories"
+  echo "* Find the 'org.tensorflow' staging release and click either 'Release' to release or 'Drop' to abort"
+fi
+if [[ "${DEPLOY_BINTRAY}" == "true" ]]; then
+  echo "* Login to https://bintray.com/google/tensorflow/tensorflow"
+  echo "* Either 'Publish' unpublished items to release, or 'Discard' to abort"
 fi
diff --git a/tensorflow/java/maven/spark-connector/pom.xml b/tensorflow/java/maven/spark-connector/pom.xml
index 19c752d08b..27d9b54c6c 100644
--- a/tensorflow/java/maven/spark-connector/pom.xml
+++ b/tensorflow/java/maven/spark-connector/pom.xml
@@ -1,12 +1,23 @@
-<project
-        xmlns="http://maven.apache.org/POM/4.0.0"
-        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
-    <!-- Placeholder pom which is replaced by TensorFlow ecosystem Spark pom during build -->
+<?xml version="1.0" encoding="UTF-8"?>
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
     <modelVersion>4.0.0</modelVersion>
-    <description>TensorFlow TFRecord connector for Apache Spark DataFrames</description>
-    <artifactId>spark-connector</artifactId>
+    <groupId>org.tensorflow</groupId>
+    <artifactId>spark-connector_2.11</artifactId>
     <packaging>jar</packaging>
+    <version>1.10.0-rc1</version>
+    <name>spark-tensorflow-connector</name>
+    <url>https://www.tensorflow.org</url>
+    <description>TensorFlow TFRecord connector for Apache Spark DataFrames</description>
+
+    <licenses>
+        <license>
+            <name>The Apache Software License, Version 2.0</name>
+            <url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
+            <distribution>repo</distribution>
+        </license>
+    </licenses>
 
     <scm>
         <url>https://github.com/tensorflow/ecosystem.git</url>
@@ -14,11 +25,325 @@
         <developerConnection>scm:git:https://github.com/tensorflow/ecosystem.git</developerConnection>
     </scm>
 
-    <url>https://github.com/tensorflow/ecosystem/</url>
-    <parent>
-        <groupId>org.tensorflow</groupId>
-        <artifactId>parentpom</artifactId>
-        <version>1.9.0-rc0</version>
-        <relativePath>../</relativePath>
-    </parent>
-</project>
-\ No newline at end of file
+    <properties>
+        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+        <scala.maven.version>3.2.2</scala.maven.version>
+        <scala.binary.version>2.11</scala.binary.version>
+        <scalatest.maven.version>1.0</scalatest.maven.version>
+        <scala.test.version>2.2.6</scala.test.version>
+        <maven.compiler.version>3.0</maven.compiler.version>
+        <java.version>1.8</java.version>
+        <spark.version>2.3.0</spark.version>
+        <yarn.api.version>2.7.3</yarn.api.version>
+        <junit.version>4.11</junit.version>
+    </properties>
+
+    <build>
+        <pluginManagement>
+            <plugins>
+                <plugin>
+                    <inherited>true</inherited>
+                    <groupId>net.alchim31.maven</groupId>
+                    <artifactId>scala-maven-plugin</artifactId>
+                    <version>${scala.maven.version}</version>
+                    <executions>
+                        <execution>
+                            <id>compile</id>
+                            <goals>
+                                <goal>add-source</goal>
+                                <goal>compile</goal>
+                            </goals>
+                            <configuration>
+                                <jvmArgs>
+                                    <jvmArg>-Xms256m</jvmArg>
+                                    <jvmArg>-Xmx512m</jvmArg>
+                                </jvmArgs>
+                                <args>
+                                    <arg>-g:vars</arg>
+                                    <arg>-deprecation</arg>
+                                    <arg>-feature</arg>
+                                    <arg>-unchecked</arg>
+                                    <arg>-Xfatal-warnings</arg>
+                                    <arg>-language:implicitConversions</arg>
+                                    <arg>-language:existentials</arg>
+                                </args>
+                            </configuration>
+                        </execution>
+                        <execution>
+                            <id>test</id>
+                            <goals>
+                                <goal>add-source</goal>
+                                <goal>testCompile</goal>
+                            </goals>
+                        </execution>
+                        <execution>
+                          <id>attach-javadocs</id>
+                          <goals>
+                            <goal>doc-jar</goal>
+                          </goals>
+                        </execution>
+                    </executions>
+                    <configuration>
+                        <recompileMode>incremental</recompileMode>
+                        <useZincServer>true</useZincServer>
+                        <scalaVersion>${scala.binary.version}</scalaVersion>
+                        <checkMultipleScalaVersions>false</checkMultipleScalaVersions>
+                    </configuration>
+                </plugin>
+                <plugin>
+                    <inherited>true</inherited>
+                    <groupId>org.scalatest</groupId>
+                    <artifactId>scalatest-maven-plugin</artifactId>
+                    <version>${scalatest.maven.version}</version>
+                    <executions>
+                        <execution>
+                            <id>scalaTest</id>
+                            <phase>test</phase>
+                            <goals>
+                                <goal>test</goal>
+                            </goals>
+                        </execution>
+                    </executions>
+                </plugin>
+                <!-- Shade protobuf dependency. -->
+                <plugin>
+                    <artifactId>maven-shade-plugin</artifactId>
+                    <version>3.1.0</version>
+                    <executions>
+                        <execution>
+                            <phase>package</phase>
+                            <goals>
+                                <goal>shade</goal>
+                            </goals>
+                            <configuration>
+                                <minimizeJar>true</minimizeJar>
+                                <artifactSet>
+                                    <includes>
+                                        <include>com.google.protobuf:protobuf-java</include>
+                                        <include>org.tensorflow:hadoop</include>
+                                        <include>org.tensorflow:proto</include>
+                                    </includes>
+                                </artifactSet>
+                                <filters>
+                                    <filter>
+                                        <!-- Remove the source to keep the result smaller. -->
+                                        <artifact>com.google.protobuf:protobuf-java</artifact>
+                                        <excludes>
+                                            <exclude>**/*.java</exclude>
+                                        </excludes>
+                                    </filter>
+                                </filters>
+                                <relocations>
+                                    <relocation>
+                                        <pattern>com.google.protobuf</pattern>
+                                        <shadedPattern>
+                                            org.tensorflow.spark.shaded.com.google.protobuf
+                                        </shadedPattern>
+                                    </relocation>
+                                </relocations>
+                            </configuration>
+                        </execution>
+                    </executions>
+                </plugin>
+                <!-- GPG signed components: http://central.sonatype.org/pages/apache-maven.html#gpg-signed-components -->
+                <plugin>
+                    <groupId>org.apache.maven.plugins</groupId>
+                    <artifactId>maven-gpg-plugin</artifactId>
+                    <version>1.5</version>
+                    <executions>
+                        <execution>
+                            <id>sign-artifacts</id>
+                            <phase>verify</phase>
+                            <goals>
+                                <goal>sign</goal>
+                            </goals>
+                        </execution>
+                    </executions>
+                </plugin>
+            </plugins>
+        </pluginManagement>
+        <plugins>
+            <plugin>
+                <groupId>net.alchim31.maven</groupId>
+                <artifactId>scala-maven-plugin</artifactId>
+            </plugin>
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-shade-plugin</artifactId>
+            </plugin>
+            <plugin>
+                <groupId>org.scalatest</groupId>
+                <artifactId>scalatest-maven-plugin</artifactId>
+            </plugin>
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-compiler-plugin</artifactId>
+                <version>${maven.compiler.version}</version>
+                <configuration>
+                    <source>${java.version}</source>
+                    <target>${java.version}</target>
+                </configuration>
+            </plugin>
+            <plugin>
+              <groupId>org.apache.maven.plugins</groupId>
+              <artifactId>maven-source-plugin</artifactId>
+              <version>2.2.1</version>
+              <executions>
+                <execution>
+                  <id>attach-sources</id>
+                  <goals>
+                    <goal>jar-no-fork</goal>
+                  </goals>
+                </execution>
+              </executions>
+            </plugin>
+            <plugin>
+              <groupId>org.apache.maven.plugins</groupId>
+              <artifactId>maven-javadoc-plugin</artifactId>
+              <version>2.9.1</version>
+              <executions>
+                <execution>
+                  <id>attach-javadocs</id>
+                  <goals>
+                    <goal>jar</goal>
+                  </goals>
+                </execution>
+              </executions>
+            </plugin>
+        </plugins>
+    </build>
+
+    <profiles>
+        <profile>
+            <id>test</id>
+            <activation>
+                <activeByDefault>true</activeByDefault>
+                <property>
+                    <name>!NEVERSETME</name>
+                </property>
+            </activation>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>net.alchim31.maven</groupId>
+                        <artifactId>scala-maven-plugin</artifactId>
+                    </plugin>
+                </plugins>
+            </build>
+            <dependencyManagement>
+                <dependencies>
+                    <dependency>
+                        <groupId>org.scalatest</groupId>
+                        <artifactId>scalatest_${scala.binary.version}</artifactId>
+                        <version>${scala.test.version}</version>
+                        <scope>test</scope>
+                    </dependency>
+                </dependencies>
+            </dependencyManagement>
+            <dependencies>
+                <dependency>
+                    <groupId>org.scalatest</groupId>
+                    <artifactId>scalatest_${scala.binary.version}</artifactId>
+                    <scope>test</scope>
+                </dependency>
+            </dependencies>
+        </profile>
+
+        <!-- Two profiles are used:
+             ossrh - deploys to ossrh/maven central
+             bintray - deploys to bintray/jcenter. -->
+        <profile>
+            <id>ossrh</id>
+            <distributionManagement>
+                <!-- Sonatype requirements from http://central.sonatype.org/pages/apache-maven.html -->
+                <snapshotRepository>
+                    <id>ossrh</id>
+                    <url>https://oss.sonatype.org/content/repositories/snapshots</url>
+                </snapshotRepository>
+                <repository>
+                    <id>ossrh</id>
+                    <url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
+                </repository>
+            </distributionManagement>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-gpg-plugin</artifactId>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+        <profile>
+            <id>bintray</id>
+            <distributionManagement>
+                <!-- https://blog.bintray.com/2015/09/17/publishing-your-maven-project-to-bintray/ -->
+                <repository>
+                    <id>bintray</id>
+                    <url>https://api.bintray.com/maven/google/tensorflow/tensorflow/;publish=0</url>
+                </repository>
+            </distributionManagement>
+            <build>
+                <plugins>
+                    <plugin>
+                        <groupId>org.apache.maven.plugins</groupId>
+                        <artifactId>maven-gpg-plugin</artifactId>
+                    </plugin>
+                </plugins>
+            </build>
+        </profile>
+    </profiles>
+
+    <developers>
+        <developer>
+            <name>TensorFlowers</name>
+            <organization>TensorFlow</organization>
+            <organizationUrl>http://www.tensorflow.org</organizationUrl>
+        </developer>
+    </developers>
+
+    <dependencies>
+        <dependency>
+            <groupId>org.tensorflow</groupId>
+            <artifactId>hadoop</artifactId>
+            <version>${project.version}</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-core_${scala.binary.version}</artifactId>
+            <version>${spark.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-sql_${scala.binary.version}</artifactId>
+            <version>${spark.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-mllib_${scala.binary.version}</artifactId>
+            <version>${spark.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-yarn-api</artifactId>
+            <version>${yarn.api.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-mllib_${scala.binary.version}</artifactId>
+            <version>${spark.version}</version>
+            <type>test-jar</type>
+            <scope>test</scope>
+        </dependency>
+        <dependency>
+            <groupId>junit</groupId>
+            <artifactId>junit</artifactId>
+            <version>${junit.version}</version>
+            <scope>test</scope>
+        </dependency>
+    </dependencies>
+</project>
diff --git a/tensorflow/java/maven/tensorflow-android/update.py b/tensorflow/java/maven/tensorflow-android/update.py
index 2206d800ca..c620564072 100644
--- a/tensorflow/java/maven/tensorflow-android/update.py
+++ b/tensorflow/java/maven/tensorflow-android/update.py
@@ -86,19 +86,10 @@ def read_template(path):
 def main():
   args = get_args()
 
-  # Artifacts are downloaded from the ci build. A SNAPSHOT release is
-  # associated with artifacts from the last successful nightly build. Otherwise,
-  # it comes from the officially blessed release artifacts.
-  if args.version.endswith('SNAPSHOT'):
-    info_url = ('https://ci.tensorflow.org/view/Nightly/job/nightly-android'
-                '/lastSuccessfulBuild/api/json')
-    aar_url = None
-    build_type = 'nightly-android'
-  else:
-    release_prefix = 'https://storage.googleapis.com/tensorflow/libtensorflow'
-    info_url = '%s/android_buildinfo-%s.json' % (release_prefix, args.version)
-    aar_url = '%s/tensorflow-%s.aar' % (release_prefix, args.version)
-    build_type = 'release-android'
+  release_prefix = 'https://storage.googleapis.com/tensorflow/libtensorflow'
+  info_url = '%s/android_buildinfo-%s.json' % (release_prefix, args.version)
+  aar_url = '%s/tensorflow-%s.aar' % (release_prefix, args.version)
+  build_type = 'release-android'
 
   # Retrieve build information
   build_info = get_json(info_url)
diff --git a/tensorflow/java/maven/tensorflow/pom.xml b/tensorflow/java/maven/tensorflow/pom.xml
index 157c4b8e82..c952545bc6 100644
--- a/tensorflow/java/maven/tensorflow/pom.xml
+++ b/tensorflow/java/maven/tensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.9.0-rc1</version>
+    <version>1.10.0-rc1</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>tensorflow</artifactId>
diff --git a/tensorflow/java/src/gen/cc/java_defs.h b/tensorflow/java/src/gen/cc/java_defs.h
index f5f54bf4d3..d9d6f8adc8 100644
--- a/tensorflow/java/src/gen/cc/java_defs.h
+++ b/tensorflow/java/src/gen/cc/java_defs.h
@@ -16,9 +16,9 @@ limitations under the License.
 #ifndef TENSORFLOW_JAVA_SRC_GEN_CC_JAVA_DEFS_H_
 #define TENSORFLOW_JAVA_SRC_GEN_CC_JAVA_DEFS_H_
 
-#include <string>
 #include <list>
 #include <map>
+#include <string>
 #include <utility>
 
 namespace tensorflow {
diff --git a/tensorflow/java/src/gen/cc/op_generator.h b/tensorflow/java/src/gen/cc/op_generator.h
index 759d800ecf..05decd6b54 100644
--- a/tensorflow/java/src/gen/cc/op_generator.h
+++ b/tensorflow/java/src/gen/cc/op_generator.h
@@ -19,10 +19,10 @@ limitations under the License.
 #include <string>
 #include <vector>
 
-#include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/framework/api_def.pb.h"
 #include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/env.h"
 #include "tensorflow/java/src/gen/cc/op_specs.h"
 
 namespace tensorflow {
diff --git a/tensorflow/java/src/gen/cc/op_specs.cc b/tensorflow/java/src/gen/cc/op_specs.cc
index 63e99fbb04..941ab2699c 100644
--- a/tensorflow/java/src/gen/cc/op_specs.cc
+++ b/tensorflow/java/src/gen/cc/op_specs.cc
@@ -14,9 +14,9 @@ limitations under the License.
 ==============================================================================*/
 
 #include <map>
-#include <vector>
 #include <string>
 #include <utility>
+#include <vector>
 
 #include "re2/re2.h"
 #include "tensorflow/core/framework/op.h"
@@ -50,7 +50,7 @@ class TypeResolver {
   // For example, if the argument's datatype is DT_STRING, this method will
   // return "java.lang.String", so the argument can become "Operand<String>"
   // in the Ops API
-  Type TypeOf(const OpDef_ArgDef& arg_def, bool *iterable_out);
+  Type TypeOf(const OpDef_ArgDef& arg_def, bool* iterable_out);
 
   // Returns types of an input attribute
   //
@@ -62,7 +62,7 @@ class TypeResolver {
   // <java.lang.Float, float>, so the attribute can be used as a "Float" object
   // in the Ops API and casted to a "float" when passing through the JNI layer.
   std::pair<Type, Type> TypesOf(const OpDef_AttrDef& attr_def,
-      bool *iterable_out);
+                                bool* iterable_out);
 
   // Returns true if the type of this attribute has already been resolved
   bool IsAttributeVisited(const string& attr_name) {
@@ -89,8 +89,7 @@ class TypeResolver {
   }
 };
 
-Type TypeResolver::TypeOf(const OpDef_ArgDef& arg_def,
-    bool* iterable_out) {
+Type TypeResolver::TypeOf(const OpDef_ArgDef& arg_def, bool* iterable_out) {
   *iterable_out = false;
   if (!arg_def.number_attr().empty()) {
     // when number_attr is set, argument has to be a list of tensors
@@ -154,13 +153,13 @@ Type TypeResolver::TypeOf(const OpDef_ArgDef& arg_def,
 
   } else {
     LOG(FATAL) << "Cannot resolve data type of argument \"" << arg_def.name()
-        << "\" in operation \"" << op_def_.name() << "\"";
+               << "\" in operation \"" << op_def_.name() << "\"";
   }
   return type;
 }
 
 std::pair<Type, Type> TypeResolver::TypesOf(const OpDef_AttrDef& attr_def,
-    bool* iterable_out) {
+                                            bool* iterable_out) {
   std::pair<Type, Type> types = MakeTypePair(Type::Wildcard());
   *iterable_out = false;
   StringPiece attr_type = attr_def.type();
@@ -185,7 +184,7 @@ std::pair<Type, Type> TypeResolver::TypesOf(const OpDef_AttrDef& attr_def,
 
   } else if (attr_type == "tensor") {
     types = MakeTypePair(Type::Class("Tensor", "org.tensorflow")
-        .add_parameter(Type::Wildcard()));
+                             .add_parameter(Type::Wildcard()));
 
   } else if (attr_type == "type") {
     Type type = *iterable_out ? Type::Wildcard() : NextGeneric();
@@ -196,7 +195,7 @@ std::pair<Type, Type> TypeResolver::TypesOf(const OpDef_AttrDef& attr_def,
 
   } else {
     LOG(FATAL) << "Cannot resolve data type for attribute \"" << attr_type
-        << "\" in operation \"" << op_def_.name() << "\"";
+               << "\" in operation \"" << op_def_.name() << "\"";
   }
   visited_attrs_.insert(std::make_pair(attr_def.name(), types.first));
   return types;
@@ -219,47 +218,43 @@ string SnakeToCamelCase(const string& str, bool upper = false) {
   return result;
 }
 
-bool FindAndCut(re2::StringPiece* input, const RE2& expr,
-    re2::StringPiece* before_match, re2::StringPiece* ret_match = nullptr) {
-  re2::StringPiece match;
-  if (!expr.Match(*input, 0, input->size(), RE2::UNANCHORED, &match, 1)) {
-    return false;
-  }
-  before_match->set(input->data(), match.begin() - input->begin());
-  input->remove_prefix(match.end() - before_match->begin());
-  if (ret_match != nullptr) {
-    *ret_match = match;
-  }
+bool FindAndCut(string* input, const RE2& expr, string* before_match,
+                string* ret_match = nullptr) {
+  string match;
+  if (!RE2::PartialMatch(*input, expr, &match)) return false;
+  *before_match = input->substr(0, input->find(match));
+  *input = input->substr(before_match->size() + match.size());
+  if (ret_match != nullptr) *ret_match = match;
   return true;
 }
 
-string ParseDocumentation(re2::StringPiece input) {
+string ParseDocumentation(const string& inp) {
   std::stringstream javadoc_text;
 
   // TODO(karllessard) This is a very minimalist utility method for converting
   // markdown syntax, as found in ops descriptions, to Javadoc/html tags. Check
   // for alternatives to increase the level of support for markups.
   std::vector<string> markups_subexpr;
-  markups_subexpr.push_back("\n+\\*\\s+");  // lists
-  markups_subexpr.push_back("\n{2,}");  // paragraphs
+  markups_subexpr.push_back("\n+\\*\\s+");                // lists
+  markups_subexpr.push_back("\n{2,}");                    // paragraphs
   markups_subexpr.push_back("`{3,}\\s*[^\\s\n]*\\s*\n");  // code blocks
-  markups_subexpr.push_back("`+");  // inlined code and code blocks
+  markups_subexpr.push_back("`+");           // inlined code and code blocks
   markups_subexpr.push_back("\\*{1,2}\\b");  // text emphasis
-  markups_subexpr.push_back("\\[");  // hyperlinks
-  const RE2 markup_expr(str_util::Join(markups_subexpr, "|"));
+  markups_subexpr.push_back("\\[");          // hyperlinks
+  const RE2 markup_expr("(" + str_util::Join(markups_subexpr, "|") + ")");
 
   bool in_list = false;
+  string input = inp;
   while (true) {
-    re2::StringPiece text;
-    re2::StringPiece markup;
+    string text, markup;
     if (!FindAndCut(&input, markup_expr, &text, &markup)) {
       javadoc_text << input;
       break;  // end of loop
     }
     javadoc_text << text;
-    if (markup.starts_with("\n")) {
+    if (str_util::StartsWith(markup, "\n")) {
       javadoc_text << "\n";
-      if (markup.contains("*")) {
+      if (str_util::StrContains(markup, "*")) {
         // new list item
         javadoc_text << (in_list ? "</li>\n" : "<ul>\n") << "<li>\n";
         in_list = true;
@@ -267,18 +262,18 @@ string ParseDocumentation(re2::StringPiece input) {
         // end of list
         javadoc_text << "</li>\n</ul>\n";
         in_list = false;
-      } else if (!input.starts_with("```")) {
+      } else if (!str_util::StartsWith(input, "```")) {
         // new paragraph (not required if a <pre> block follows)
         javadoc_text << "<p>\n";
       }
-    } else if (markup.starts_with("```")) {
+    } else if (str_util::StartsWith(markup, "```")) {
       // code blocks
-      if (FindAndCut(&input, "```\\s*\n*", &text)) {
+      if (FindAndCut(&input, "(```\\s*\n*)", &text)) {
         javadoc_text << "<pre>{@code\n" << text << "}</pre>\n";
       } else {
         javadoc_text << markup;
       }
-    } else if (markup.starts_with("`")) {
+    } else if (str_util::StartsWith("(" + markup + ")", "`")) {
       // inlined code
       if (FindAndCut(&input, markup, &text)) {
         javadoc_text << "{@code " << text << "}";
@@ -287,26 +282,28 @@ string ParseDocumentation(re2::StringPiece input) {
       }
     } else if (markup == "**") {
       // text emphasis (strong)
-      if (FindAndCut(&input, "\\b\\*{2}", &text)) {
+      if (FindAndCut(&input, "(\\b\\*{2})", &text)) {
         javadoc_text << "<b>" << ParseDocumentation(text) << "</b>";
       } else {
         javadoc_text << markup;
       }
     } else if (markup == "*") {
       // text emphasis (normal)
-      if (FindAndCut(&input, "\\b\\*{1}", &text)) {
+      if (FindAndCut(&input, "(\\b\\*{1})", &text)) {
         javadoc_text << "<i>" << ParseDocumentation(text) << "</i>";
       } else {
         javadoc_text << markup;
       }
-    } else if (markup.starts_with("[")) {
+    } else if (str_util::StartsWith(markup, "[")) {
       // hyperlinks
       string label;
       string link;
-      if (RE2::Consume(&input, "([^\\[]+)\\]\\((http.+)\\)", &label, &link)) {
+      if (RE2::PartialMatch(input, "([^\\[]+)\\]\\((http.+)\\)", &label,
+                            &link) &&
+          str_util::StartsWith(input, label + link)) {
+        input = input.substr(label.size() + link.size());
         javadoc_text << "<a href=\"" << link << "\">"
-            << ParseDocumentation(label)
-            << "</a>";
+                     << ParseDocumentation(label) << "</a>";
       } else {
         javadoc_text << markup;
       }
@@ -319,57 +316,56 @@ string ParseDocumentation(re2::StringPiece input) {
 }
 
 ArgumentSpec CreateInput(const OpDef_ArgDef& input_def,
-    const ApiDef::Arg& input_api_def, TypeResolver* type_resolver) {
+                         const ApiDef::Arg& input_api_def,
+                         TypeResolver* type_resolver) {
   bool iterable = false;
   Type type = type_resolver->TypeOf(input_def, &iterable);
-  Type var_type = Type::Interface("Operand", "org.tensorflow")
-    .add_parameter(type);
+  Type var_type =
+      Type::Interface("Operand", "org.tensorflow").add_parameter(type);
   if (iterable) {
     var_type = Type::IterableOf(var_type);
   }
-  return ArgumentSpec(input_api_def.name(),
+  return ArgumentSpec(
+      input_api_def.name(),
       Variable::Create(SnakeToCamelCase(input_api_def.rename_to()), var_type),
-      type,
-      ParseDocumentation(input_api_def.description()),
-      iterable);
+      type, ParseDocumentation(input_api_def.description()), iterable);
 }
 
 AttributeSpec CreateAttribute(const OpDef_AttrDef& attr_def,
-    const ApiDef::Attr& attr_api_def, TypeResolver* type_resolver) {
+                              const ApiDef::Attr& attr_api_def,
+                              TypeResolver* type_resolver) {
   bool iterable = false;
   std::pair<Type, Type> types = type_resolver->TypesOf(attr_def, &iterable);
-  Type var_type = types.first.kind() == Type::GENERIC ?
-      Type::Class("Class").add_parameter(types.first) : types.first;
+  Type var_type = types.first.kind() == Type::GENERIC
+                      ? Type::Class("Class").add_parameter(types.first)
+                      : types.first;
   if (iterable) {
     var_type = Type::ListOf(var_type);
   }
-  return AttributeSpec(attr_api_def.name(),
+  return AttributeSpec(
+      attr_api_def.name(),
       Variable::Create(SnakeToCamelCase(attr_api_def.rename_to()), var_type),
-      types.first,
-      types.second,
-      ParseDocumentation(attr_api_def.description()),
-      iterable,
-      attr_api_def.has_default_value());
+      types.first, types.second, ParseDocumentation(attr_api_def.description()),
+      iterable, attr_api_def.has_default_value());
 }
 
 ArgumentSpec CreateOutput(const OpDef_ArgDef& output_def,
-    const ApiDef::Arg& output_api, TypeResolver* type_resolver) {
+                          const ApiDef::Arg& output_api,
+                          TypeResolver* type_resolver) {
   bool iterable = false;
   Type type = type_resolver->TypeOf(output_def, &iterable);
-  Type var_type = Type::Class("Output", "org.tensorflow")
-    .add_parameter(type);
+  Type var_type = Type::Class("Output", "org.tensorflow").add_parameter(type);
   if (iterable) {
     var_type = Type::ListOf(var_type);
   }
-  return ArgumentSpec(output_api.name(),
+  return ArgumentSpec(
+      output_api.name(),
       Variable::Create(SnakeToCamelCase(output_api.rename_to()), var_type),
-      type,
-      ParseDocumentation(output_api.description()),
-      iterable);
+      type, ParseDocumentation(output_api.description()), iterable);
 }
 
 EndpointSpec CreateEndpoint(const OpDef& op_def, const ApiDef& api_def,
-    const ApiDef_Endpoint& endpoint_def) {
+                            const ApiDef_Endpoint& endpoint_def) {
   std::vector<string> name_tokens = str_util::Split(endpoint_def.name(), ".");
   string package;
   string name;
@@ -377,27 +373,25 @@ EndpointSpec CreateEndpoint(const OpDef& op_def, const ApiDef& api_def,
     package = name_tokens.at(0);
     name = name_tokens.at(1);
   } else {
-    package = kDefaultEndpointPackage;
+    package = "core";  // generate unclassified ops in the 'core' package
     name = name_tokens.at(0);
   }
-  return EndpointSpec(package,
-      name,
-      Javadoc::Create(ParseDocumentation(api_def.summary()))
-          .details(ParseDocumentation(api_def.description())));
+  return EndpointSpec(package, name,
+                      Javadoc::Create(ParseDocumentation(api_def.summary()))
+                          .details(ParseDocumentation(api_def.description())));
 }
 
 }  // namespace
 
 OpSpec OpSpec::Create(const OpDef& op_def, const ApiDef& api_def) {
-  OpSpec op(api_def.graph_op_name(),
-      api_def.visibility() == ApiDef::HIDDEN,
-      op_def.deprecation().explanation());
+  OpSpec op(api_def.graph_op_name(), api_def.visibility() == ApiDef::HIDDEN,
+            op_def.deprecation().explanation());
   TypeResolver type_resolver(op_def);
   for (const string& next_input_name : api_def.arg_order()) {
     for (int i = 0; i < op_def.input_arg().size(); ++i) {
       if (op_def.input_arg(i).name() == next_input_name) {
         op.inputs_.push_back(CreateInput(op_def.input_arg(i), api_def.in_arg(i),
-            &type_resolver));
+                                         &type_resolver));
         break;
       }
     }
@@ -406,8 +400,8 @@ OpSpec OpSpec::Create(const OpDef& op_def, const ApiDef& api_def) {
     // do not parse attributes already visited, they have probably been inferred
     // before as an input argument type
     if (!type_resolver.IsAttributeVisited(op_def.attr(i).name())) {
-      AttributeSpec attr = CreateAttribute(op_def.attr(i), api_def.attr(i),
-          &type_resolver);
+      AttributeSpec attr =
+          CreateAttribute(op_def.attr(i), api_def.attr(i), &type_resolver);
       // attributes with a default value are optional
       if (attr.has_default_value() && attr.type().kind() != Type::GENERIC) {
         op.optional_attributes_.push_back(attr);
@@ -417,8 +411,8 @@ OpSpec OpSpec::Create(const OpDef& op_def, const ApiDef& api_def) {
     }
   }
   for (int i = 0; i < op_def.output_arg().size(); ++i) {
-    op.outputs_.push_back(CreateOutput(op_def.output_arg(i), api_def.out_arg(i),
-        &type_resolver));
+    op.outputs_.push_back(
+        CreateOutput(op_def.output_arg(i), api_def.out_arg(i), &type_resolver));
   }
   for (const auto& endpoint_def : api_def.endpoint()) {
     op.endpoints_.push_back(CreateEndpoint(op_def, api_def, endpoint_def));
diff --git a/tensorflow/java/src/gen/cc/op_specs.h b/tensorflow/java/src/gen/cc/op_specs.h
index 3b53c730df..30ecb8ce53 100644
--- a/tensorflow/java/src/gen/cc/op_specs.h
+++ b/tensorflow/java/src/gen/cc/op_specs.h
@@ -19,9 +19,9 @@ limitations under the License.
 #include <string>
 #include <vector>
 
-#include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/framework/api_def.pb.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/java/src/gen/cc/java_defs.h"
 
 namespace tensorflow {
@@ -38,9 +38,8 @@ class EndpointSpec {
   // javadoc: the endpoint class documentation
   // TODO(annarev): hardcode depcreated to false until deprecated is possible
   EndpointSpec(const string& package, const string& name,
-      const Javadoc& javadoc)
-    : package_(package), name_(name), javadoc_(javadoc),
-      deprecated_(false) {}
+               const Javadoc& javadoc)
+      : package_(package), name_(name), javadoc_(javadoc), deprecated_(false) {}
 
   const string& package() const { return package_; }
   const string& name() const { return name_; }
@@ -63,10 +62,13 @@ class ArgumentSpec {
   // type: the tensor type of this argument
   // description: a description of this argument, in javadoc
   // iterable: true if this argument is a list
-  ArgumentSpec(const string& op_def_name, const Variable& var,
-      const Type& type, const string& description, bool iterable)
-    : op_def_name_(op_def_name), var_(var), type_(type),
-      description_(description), iterable_(iterable) {}
+  ArgumentSpec(const string& op_def_name, const Variable& var, const Type& type,
+               const string& description, bool iterable)
+      : op_def_name_(op_def_name),
+        var_(var),
+        type_(type),
+        description_(description),
+        iterable_(iterable) {}
 
   const string& op_def_name() const { return op_def_name_; }
   const Variable& var() const { return var_; }
@@ -94,11 +96,16 @@ class AttributeSpec {
   // iterable: true if this attribute is a list
   // has_default_value: true if this attribute has a default value if not set
   AttributeSpec(const string& op_def_name, const Variable& var,
-      const Type& type, const Type& jni_type, const string& description,
-      bool iterable, bool has_default_value)
-    : op_def_name_(op_def_name), var_(var), type_(type),
-      description_(description), iterable_(iterable),
-      jni_type_(jni_type), has_default_value_(has_default_value) {}
+                const Type& type, const Type& jni_type,
+                const string& description, bool iterable,
+                bool has_default_value)
+      : op_def_name_(op_def_name),
+        var_(var),
+        type_(type),
+        description_(description),
+        iterable_(iterable),
+        jni_type_(jni_type),
+        has_default_value_(has_default_value) {}
 
   const string& op_def_name() const { return op_def_name_; }
   const Variable& var() const { return var_; }
@@ -147,9 +154,10 @@ class OpSpec {
   // hidden: true if this op should not be visible through the Graph Ops API
   // deprecation_explanation: message to show if all endpoints are deprecated
   explicit OpSpec(const string& graph_op_name, bool hidden,
-      const string& deprecation_explanation)
-    : graph_op_name_(graph_op_name), hidden_(hidden),
-      deprecation_explanation_(deprecation_explanation) {}
+                  const string& deprecation_explanation)
+      : graph_op_name_(graph_op_name),
+        hidden_(hidden),
+        deprecation_explanation_(deprecation_explanation) {}
 
   const string graph_op_name_;
   const bool hidden_;
diff --git a/tensorflow/java/src/gen/java/org/tensorflow/processor/OperatorProcessor.java b/tensorflow/java/src/gen/java/org/tensorflow/processor/OperatorProcessor.java
index 796d6a62dc..1b7bcdab35 100644
--- a/tensorflow/java/src/gen/java/org/tensorflow/processor/OperatorProcessor.java
+++ b/tensorflow/java/src/gen/java/org/tensorflow/processor/OperatorProcessor.java
@@ -290,7 +290,7 @@ public final class OperatorProcessor extends AbstractProcessor {
         javadoc.append(tag).append('\n');
       }
     }
-    javadoc.append("@see {@link ").append(opClassName).append("}\n");
+    javadoc.append("@see ").append(opClassName).append("\n");
 
     return javadoc.toString();
   }
diff --git a/tensorflow/java/src/main/java/org/tensorflow/Graph.java b/tensorflow/java/src/main/java/org/tensorflow/Graph.java
index 7d19696749..752b49af04 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/Graph.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/Graph.java
@@ -144,21 +144,29 @@ public final class Graph implements AutoCloseable {
   }
 
   /**
-   * Adds operations to compute the partial derivatives of sum of {@code y}s w.r.t {@code x}s,
-   * i.e., {@code d(y_1 + y_2 + ...)/dx_1, d(y_1 + y_2 + ...)/dx_2...}
-   * <p> 
-   * {@code dx} are used as initial gradients (which represent the symbolic partial derivatives of some loss function 
-   * {@code L} w.r.t. {@code y}). {@code dx} must be null or have size of {@code y}.
-   * <p>
-   * If {@code dx} is null, the implementation will use dx of {@link org.tensorflow.op.core.OnesLike OnesLike} for all
-   * shapes in {@code y}.
-   * 
+   * Adds operations to compute the partial derivatives of sum of {@code y}s w.r.t {@code x}s, i.e.,
+   * {@code d(y_1 + y_2 + ...)/dx_1, d(y_1 + y_2 + ...)/dx_2...}
+   *
+   * <p>{@code dx} are used as initial gradients (which represent the symbolic partial derivatives
+   * of some loss function {@code L} w.r.t. {@code y}). {@code dx} must be null or have size of
+   * {@code y}.
+   *
+   * <p>If {@code dx} is null, the implementation will use dx of {@link
+   * org.tensorflow.op.core.OnesLike OnesLike} for all shapes in {@code y}.
+   *
+   * <p>{@code prefix} is used as the name prefix applied to all nodes added to the graph to compute
+   * gradients. It must be unique within the provided graph or the operation will fail.
+   *
+   * <p>If {@code prefix} is null, then one will be chosen automatically.
+   *
+   * @param prefix unique string prefix applied before the names of nodes added to the graph to
+   *     compute gradients. If null, a default one will be chosen.
    * @param y output of the function to derive
    * @param x inputs of the function for which partial derivatives are computed
    * @param dx if not null, the partial derivatives of some loss function {@code L} w.r.t. {@code y}
    * @return the partial derivatives {@code dy} with the size of {@code x}
    */
-  public Output<?>[] addGradients(Output<?>[] y, Output<?>[] x, Output<?>[] dx) {
+  public Output<?>[] addGradients(String prefix, Output<?>[] y, Output<?>[] x, Output<?>[] dx) {
     Output<?>[] dy = new Output<?>[x.length];
     final long[] yHandles = new long[y.length];
     final int[] yIndices = new int[y.length];
@@ -185,12 +193,21 @@ public final class Graph implements AutoCloseable {
           dxIndices[i] = dx[i].index();
         }
       }
-      // Gradient outputs are returned in two continuous arrays concatenated into one. The first holds the native handles 
-      // of the gradient operations while the second holds the index of their output
-      // e.g. given xHandles = [x0Handle, x1Handle, ...] and xIndices = [x0Index, x1Index, ..], we obtain 
+      // Gradient outputs are returned in two continuous arrays concatenated into one. The first
+      // holds the native handles of the gradient operations while the second holds the index of
+      // their output e.g. given
+      // xHandles = [x0Handle, x1Handle, ...] and xIndices = [x0Index, x1Index, ..], we obtain
       // dy = [dy0Handle, dy1Handle, ..., dy0Index, dy1Index, ...]
       long[] dyHandlesAndIndices =
-            addGradients(ref.nativeHandle(), yHandles, yIndices, xHandles, xIndices, dxHandles, dxIndices);
+          addGradients(
+              ref.nativeHandle(),
+              prefix,
+              yHandles,
+              yIndices,
+              xHandles,
+              xIndices,
+              dxHandles,
+              dxIndices);
       int ndy = dyHandlesAndIndices.length >> 1;
       if (ndy != dy.length) {
         throw new IllegalStateException(String.valueOf(ndy) + " gradients were added to the graph when " + dy.length
@@ -207,16 +224,16 @@ public final class Graph implements AutoCloseable {
   /**
    * Adds operations to compute the partial derivatives of sum of {@code y}s w.r.t {@code x}s,
    * i.e., {@code dy/dx_1, dy/dx_2...}
-   * <p> 
+   * <p>
    * This is a simplified version of {@link #addGradients(Output[], Output[], Output[]) where {@code y} is
-   * a single output and {@code dx} is null.
-   * 
+   * a single output, {@code dx} is null and {@code prefix} is null.
+   *
    * @param y output of the function to derive
    * @param x inputs of the function for which partial derivatives are computed
    * @return the partial derivatives {@code dy} with the size of {@code x}
    */
   public Output<?>[] addGradients(Output<?> y, Output<?>[] x) {
-    return addGradients(new Output<?>[]{y}, x, null);
+    return addGradients(null, new Output<?>[] {y}, x, null);
   }
   
   private final Object nativeHandleLock = new Object();
@@ -330,8 +347,15 @@ public final class Graph implements AutoCloseable {
 
   private static native byte[] toGraphDef(long handle);
 
-  private static native long[] addGradients(long handle, long[] inputHandles, int[] inputIndices,
-      long[] outputHandles, int[] outputIndices, long[] gradInputHandles, int[] gradInputIndices);
+  private static native long[] addGradients(
+      long handle,
+      String prefix,
+      long[] inputHandles,
+      int[] inputIndices,
+      long[] outputHandles,
+      int[] outputIndices,
+      long[] gradInputHandles,
+      int[] gradInputIndices);
 
   static {
     TensorFlow.init();
diff --git a/tensorflow/java/src/main/java/org/tensorflow/Input.java b/tensorflow/java/src/main/java/org/tensorflow/Input.java
new file mode 100644
index 0000000000..13bc463e7d
--- /dev/null
+++ b/tensorflow/java/src/main/java/org/tensorflow/Input.java
@@ -0,0 +1,48 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+package org.tensorflow;
+
+/**
+ * Interface implemented by operands of a TensorFlow operation.
+ *
+ * <p>Example usage:
+ *
+ * <pre>{@code
+ * // The "decodeJpeg" operation can be used as input to the "cast" operation
+ * Input decodeJpeg = ops.image().decodeJpeg(...);
+ * ops.math().cast(decodeJpeg, DataType.FLOAT);
+ *
+ * // The output "y" of the "unique" operation can be used as input to the "cast" operation
+ * Output y = ops.array().unique(...).y();
+ * ops.math().cast(y, DataType.FLOAT);
+ *
+ * // The "split" operation can be used as input list to the "concat" operation
+ * Iterable<? extends Input> split = ops.array().split(...);
+ * ops.array().concat(0, split);
+ * }</pre>
+ */
+public interface Input<T> {
+
+  /**
+   * Returns the symbolic handle of a tensor.
+   *
+   * <p>Inputs to TensorFlow operations are outputs of another TensorFlow operation. This method is
+   * used to obtain a symbolic handle that represents the computation of the input.
+   *
+   * @see OperationBuilder#addInput(Output)
+   */
+  Output<T> asOutput();
+}
diff --git a/tensorflow/java/src/main/java/org/tensorflow/SavedModelBundle.java b/tensorflow/java/src/main/java/org/tensorflow/SavedModelBundle.java
index c8b9126f03..49594e6b47 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/SavedModelBundle.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/SavedModelBundle.java
@@ -25,18 +25,86 @@ package org.tensorflow;
  * protocol buffer</a>).
  */
 public class SavedModelBundle implements AutoCloseable {
+  /** Options for loading a SavedModel. */
+  public static final class Loader {
+    /** Load a <code>SavedModelBundle</code> with the configured options. */
+    public SavedModelBundle load() {
+      return SavedModelBundle.load(exportDir, tags, configProto, runOptions);
+    }
+
+    /**
+     * Sets options to use when executing model initialization operations.
+     *
+     * @param options Serialized <a
+     *     href="https://www.tensorflow.org/code/tensorflow/core/protobuf/config.proto">RunOptions
+     *     protocol buffer</a>.
+     */
+    public Loader withRunOptions(byte[] options) {
+      this.runOptions = options;
+      return this;
+    }
+
+    /**
+     * Set configuration of the <code>Session</code> object created when loading the model.
+     *
+     * @param configProto Serialized <a
+     *     href="https://www.tensorflow.org/code/tensorflow/core/protobuf/config.proto">ConfigProto
+     *     protocol buffer</a>.
+     */
+    public Loader withConfigProto(byte[] configProto) {
+      this.configProto = configProto;
+      return this;
+    }
+
+    /**
+     * Sets the set of tags that identify the specific graph in the saved model to load.
+     *
+     * @param tags the tags identifying the specific MetaGraphDef to load.
+     */
+    public Loader withTags(String... tags) {
+      this.tags = tags;
+      return this;
+    }
+
+    private Loader(String exportDir) {
+      this.exportDir = exportDir;
+    }
+
+    private String exportDir = null;
+    private String[] tags = null;
+    private byte[] configProto = null;
+    private byte[] runOptions = null;
+  }
 
   /**
    * Load a saved model from an export directory. The model that is being loaded should be created
    * using the <a href="https://www.tensorflow.org/api_docs/python/tf/saved_model">Saved Model
    * API</a>.
    *
+   * <p>This method is a shorthand for:
+   *
+   * <pre>{@code
+   * SavedModelBundler.loader().withTags(tags).load();
+   * }</pre>
+   *
    * @param exportDir the directory path containing a saved model.
    * @param tags the tags identifying the specific metagraphdef to load.
    * @return a bundle containing the graph and associated session.
    */
   public static SavedModelBundle load(String exportDir, String... tags) {
-    return load(exportDir, tags, null);
+    return loader(exportDir).withTags(tags).load();
+  }
+
+  /**
+   * Load a saved model.
+   *
+   * <p/>Returns a <code>Loader</code> object that can set configuration options before actually
+   * loading the model,
+   *
+   * @param exportDir the directory path containing a saved model.
+   */
+  public static Loader loader(String exportDir) {
+    return new Loader(exportDir);
   }
 
   /**
@@ -95,7 +163,8 @@ public class SavedModelBundle implements AutoCloseable {
     return new SavedModelBundle(graph, session, metaGraphDef);
   }
 
-  private static native SavedModelBundle load(String exportDir, String[] tags, byte[] runOptions);
+  private static native SavedModelBundle load(
+      String exportDir, String[] tags, byte[] config, byte[] runOptions);
 
   static {
     TensorFlow.init();
diff --git a/tensorflow/java/src/main/java/org/tensorflow/op/Scope.java b/tensorflow/java/src/main/java/org/tensorflow/op/Scope.java
index 8de2eaeb79..5a233bcc98 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/op/Scope.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/op/Scope.java
@@ -135,17 +135,8 @@ public final class Scope {
    * }</pre>
    *
    * <p><b>Note:</b> if you provide a composite operator building class (i.e, a class that adds a
-   * set of related operations to the graph by calling other operator building code) you should also
-   * create a {@link #withSubScope(String)} scope for the underlying operators to group them under a
-   * meaningful name.
-   *
-   * <pre>{@code
-   * public static Stddev create(Scope scope, ...) {
-   *   // group sub-operations under a common name
-   *   Scope group = scope.withSubScope("stddev");
-   *   ... Sqrt.create(group, Mean.create(group, ...))
-   * }
-   * }</pre>
+   * set of related operations to the graph by calling other operator building code), the provided
+   * name will act as a subscope to all underlying operators.
    *
    * @param defaultName name for the underlying operator.
    * @return unique name for the operator.
diff --git a/tensorflow/java/src/main/java/org/tensorflow/op/core/Gradients.java b/tensorflow/java/src/main/java/org/tensorflow/op/core/Gradients.java
index f4671c8af9..eea9dc1c47 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/op/core/Gradients.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/op/core/Gradients.java
@@ -18,7 +18,6 @@ package org.tensorflow.op.core;
 import java.util.Arrays;
 import java.util.Iterator;
 import java.util.List;
-
 import org.tensorflow.Operand;
 import org.tensorflow.Output;
 import org.tensorflow.op.Op;
@@ -54,32 +53,36 @@ public class Gradients implements Op, Iterable<Operand<?>> {
    * Optional attributes for {@link Gradients}
    */
   public static class Options {
-    
+
     /**
      * @param dx partial derivatives of some loss function {@code L} w.r.t. {@code y}
      * @return this option builder
      */
-    public Options dx(Iterable<Operand<?>> dx) {
+    public Options dx(Iterable<? extends Operand<?>> dx) {
       this.dx = dx;
       return this;
     }
-    
-    private Iterable<Operand<?>> dx;
-    
+
+    private Iterable<? extends Operand<?>> dx;
+
     private Options() {
     }
   }
 
   /**
    * Adds gradients computation ops to the graph according to scope.
-   * 
+   *
    * @param scope current graph scope
    * @param y outputs of the function to derive
    * @param x inputs of the function for which partial derivatives are computed
    * @param options carries optional attributes values
    * @return a new instance of {@code Gradients}
    */
-  public static Gradients create(Scope scope, Iterable<Operand<?>> y, Iterable<Operand<?>> x, Options... options) {
+  public static Gradients create(
+      Scope scope,
+      Iterable<? extends Operand<?>> y,
+      Iterable<? extends Operand<?>> x,
+      Options... options) {
     Output<?>[] dx = null;
     if (options != null) {
       for (Options opts : options) {
@@ -88,16 +91,20 @@ public class Gradients implements Op, Iterable<Operand<?>> {
         }
       }
     }
-    Output<?>[] gradOutputs = scope.graph().addGradients(Operands.asOutputs(y), Operands.asOutputs(x), dx);
-    return new Gradients(Arrays.asList(gradOutputs));
+    Output<?>[] dy =
+        scope
+            .graph()
+            .addGradients(
+                scope.makeOpName("Gradients"), Operands.asOutputs(y), Operands.asOutputs(x), dx);
+    return new Gradients(Arrays.asList(dy));
   }
 
   /**
    * Adds gradients computation ops to the graph according to scope.
-   * 
-   * This is a simplified version of {@link #create(Scope, Iterable, Iterable, Options...)} where {@code y} is
-   * a single output.
-   * 
+   *
+   * <p>This is a simplified version of {@link #create(Scope, Iterable, Iterable, Options...)} where
+   * {@code y} is a single output.
+   *
    * @param scope current graph scope
    * @param y output of the function to derive
    * @param x inputs of the function for which partial derivatives are computed
@@ -105,7 +112,8 @@ public class Gradients implements Op, Iterable<Operand<?>> {
    * @return a new instance of {@code Gradients}
    */
   @SuppressWarnings({"unchecked", "rawtypes"})
-  public static Gradients create(Scope scope, Operand<?> y, Iterable<Operand<?>> x, Options... options) {
+  public static Gradients create(
+      Scope scope, Operand<?> y, Iterable<? extends Operand<?>> x, Options... options) {
     return create(scope, (Iterable) Arrays.asList(y), x, options);
   }
 
@@ -113,7 +121,7 @@ public class Gradients implements Op, Iterable<Operand<?>> {
    * @param dx partial derivatives of some loss function {@code L} w.r.t. {@code y}
    * @return builder to add more options to this operation
    */
-  public Options dx(Iterable<Operand<?>> dx) {
+  public static Options dx(Iterable<? extends Operand<?>> dx) {
     return new Options().dx(dx);
   }
 
@@ -129,13 +137,13 @@ public class Gradients implements Op, Iterable<Operand<?>> {
   public List<Output<?>> dy() {
     return dy;
   }
-  
+
   /**
    * Returns a symbolic handle to one of the gradient operation output
-   * <p>
-   * Warning: Does not check that the type of the tensor matches T. It is recommended to call
+   *
+   * <p>Warning: Does not check that the type of the tensor matches T. It is recommended to call
    * this method with an explicit type parameter rather than letting it be inferred, e.g. {@code
-   * gradients.<Integer>dy(0)}
+   * gradients.<Float>dy(0)}
    *
    * @param <T> The expected element type of the tensors produced by this output.
    * @param index The index of the output among the gradients added by this operation
diff --git a/tensorflow/java/src/main/native/graph_jni.cc b/tensorflow/java/src/main/native/graph_jni.cc
index dac6a345e9..f1744d8769 100644
--- a/tensorflow/java/src/main/native/graph_jni.cc
+++ b/tensorflow/java/src/main/native/graph_jni.cc
@@ -133,12 +133,10 @@ Java_org_tensorflow_Graph_toGraphDef(JNIEnv* env, jclass clazz, jlong handle) {
   return ret;
 }
 
-JNIEXPORT jlongArray JNICALL
-Java_org_tensorflow_Graph_addGradients(JNIEnv* env, jclass clazz, jlong handle,
-    jlongArray y_handles, jintArray y_indices,
-    jlongArray x_handles, jintArray x_indices,
-    jlongArray dx_handles, jintArray dx_indices) {
-
+JNIEXPORT jlongArray JNICALL Java_org_tensorflow_Graph_addGradients(
+    JNIEnv* env, jclass clazz, jlong handle, jstring prefix,
+    jlongArray y_handles, jintArray y_indices, jlongArray x_handles,
+    jintArray x_indices, jlongArray dx_handles, jintArray dx_indices) {
   TF_Graph* g = requireHandle(env, handle);
   if (g == nullptr) return nullptr;
 
@@ -163,9 +161,16 @@ Java_org_tensorflow_Graph_addGradients(JNIEnv* env, jclass clazz, jlong handle,
   }
   if (env->ExceptionCheck()) return nullptr;
 
+  const char* cprefix = nullptr;
+  if (prefix != nullptr) {
+    cprefix = env->GetStringUTFChars(prefix, nullptr);
+  }
   TF_Status* status = TF_NewStatus();
-  TF_AddGradients(g, y.get(), ny, x.get(), nx, dx.get(), status, dy.get());
-
+  TF_AddGradientsWithPrefix(g, cprefix, y.get(), ny, x.get(), nx, dx.get(),
+                            status, dy.get());
+  if (prefix != nullptr) {
+    env->ReleaseStringUTFChars(prefix, cprefix);
+  }
   if (!throwExceptionIfNotOK(env, status)) {
     TF_DeleteStatus(status);
     return nullptr;
diff --git a/tensorflow/java/src/main/native/graph_jni.h b/tensorflow/java/src/main/native/graph_jni.h
index 4f87e8d5a7..215695cdfd 100644
--- a/tensorflow/java/src/main/native/graph_jni.h
+++ b/tensorflow/java/src/main/native/graph_jni.h
@@ -76,11 +76,11 @@ JNIEXPORT jbyteArray JNICALL Java_org_tensorflow_Graph_toGraphDef(JNIEnv *,
 /*
  * Class:     org_tensorflow_Graph
  * Method:    name
- * Signature: (J[J[I[J[I[J[I)[J
+ * Signature: (JLjava/lang/String;[J[I[J[I[J[I)[J
  */
-JNIEXPORT jlongArray JNICALL Java_org_tensorflow_Graph_addGradients(JNIEnv *,
-    jclass, jlong, jlongArray, jintArray, jlongArray, jintArray, jlongArray,
-    jintArray);
+JNIEXPORT jlongArray JNICALL Java_org_tensorflow_Graph_addGradients(
+    JNIEnv *, jclass, jlong, jstring, jlongArray, jintArray, jlongArray,
+    jintArray, jlongArray, jintArray);
 
 #ifdef __cplusplus
 }  // extern "C"
diff --git a/tensorflow/java/src/main/native/saved_model_bundle_jni.cc b/tensorflow/java/src/main/native/saved_model_bundle_jni.cc
index de6382a79c..68999fb2da 100644
--- a/tensorflow/java/src/main/native/saved_model_bundle_jni.cc
+++ b/tensorflow/java/src/main/native/saved_model_bundle_jni.cc
@@ -22,12 +22,25 @@ limitations under the License.
 
 JNIEXPORT jobject JNICALL Java_org_tensorflow_SavedModelBundle_load(
     JNIEnv* env, jclass clazz, jstring export_dir, jobjectArray tags,
-    jbyteArray run_options) {
+    jbyteArray config, jbyteArray run_options) {
   TF_Status* status = TF_NewStatus();
   jobject bundle = nullptr;
 
   // allocate parameters for TF_LoadSessionFromSavedModel
   TF_SessionOptions* opts = TF_NewSessionOptions();
+  if (config != nullptr) {
+    size_t sz = env->GetArrayLength(config);
+    if (sz > 0) {
+      jbyte* config_data = env->GetByteArrayElements(config, nullptr);
+      TF_SetConfig(opts, static_cast<void*>(config_data), sz, status);
+      env->ReleaseByteArrayElements(config, config_data, JNI_ABORT);
+      if (!throwExceptionIfNotOK(env, status)) {
+        TF_DeleteSessionOptions(opts);
+        TF_DeleteStatus(status);
+        return nullptr;
+      }
+    }
+  }
   TF_Buffer* crun_options = nullptr;
   if (run_options != nullptr) {
     size_t sz = env->GetArrayLength(run_options);
diff --git a/tensorflow/java/src/main/native/saved_model_bundle_jni.h b/tensorflow/java/src/main/native/saved_model_bundle_jni.h
index 6cce6a81bd..a4b05d0409 100644
--- a/tensorflow/java/src/main/native/saved_model_bundle_jni.h
+++ b/tensorflow/java/src/main/native/saved_model_bundle_jni.h
@@ -26,10 +26,10 @@ extern "C" {
  * Class:     org_tensorflow_SavedModelBundle
  * Method:    load
  * Signature:
- * (Ljava/lang/String;[Ljava/lang/String;[B)Lorg/tensorflow/SavedModelBundle;
+ * (Ljava/lang/String;[Ljava/lang/String;[B;[B)Lorg/tensorflow/SavedModelBundle;
  */
 JNIEXPORT jobject JNICALL Java_org_tensorflow_SavedModelBundle_load(
-    JNIEnv *, jclass, jstring, jobjectArray, jbyteArray);
+    JNIEnv *, jclass, jstring, jobjectArray, jbyteArray, jbyteArray);
 
 #ifdef __cplusplus
 }  // extern "C"
diff --git a/tensorflow/java/src/main/native/session_jni.cc b/tensorflow/java/src/main/native/session_jni.cc
index cb54daf137..8b11525785 100644
--- a/tensorflow/java/src/main/native/session_jni.cc
+++ b/tensorflow/java/src/main/native/session_jni.cc
@@ -86,20 +86,22 @@ JNIEXPORT jlong JNICALL Java_org_tensorflow_Session_allocate2(
   TF_Graph* graph = reinterpret_cast<TF_Graph*>(graph_handle);
   TF_Status* status = TF_NewStatus();
   TF_SessionOptions* opts = TF_NewSessionOptions();
-  const char* ctarget = nullptr;
   jbyte* cconfig = nullptr;
-  if (target != nullptr) {
-    ctarget = env->GetStringUTFChars(target, nullptr);
-  }
   if (config != nullptr) {
     cconfig = env->GetByteArrayElements(config, nullptr);
     TF_SetConfig(opts, cconfig,
                  static_cast<size_t>(env->GetArrayLength(config)), status);
     if (!throwExceptionIfNotOK(env, status)) {
       env->ReleaseByteArrayElements(config, cconfig, JNI_ABORT);
+      TF_DeleteSessionOptions(opts);
+      TF_DeleteStatus(status);
       return 0;
     }
   }
+  const char* ctarget = nullptr;
+  if (target != nullptr) {
+    ctarget = env->GetStringUTFChars(target, nullptr);
+  }
   TF_Session* session = TF_NewSession(graph, opts, status);
   if (config != nullptr) {
     env->ReleaseByteArrayElements(config, cconfig, JNI_ABORT);
diff --git a/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java b/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java
index c2e52c22c6..7c05c1deaf 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java
@@ -22,7 +22,6 @@ import static org.junit.Assert.assertTrue;
 
 import java.util.HashSet;
 import java.util.Iterator;
-
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
@@ -180,8 +179,8 @@ public class GraphTest {
       Output<Float> x = TestUtil.placeholder(g, "x", Float.class);
       Output<Float> y0 = TestUtil.square(g, "y0", x);
       Output<Float> y1 = TestUtil.square(g, "y1", y0);
-      
-      Output<?>[] grad = g.addGradients(toArray(y0, y1), toArray(x), null);
+
+      Output<?>[] grad = g.addGradients(null, toArray(y0, y1), toArray(x), null);
       assertNotNull(grad);
       assertEquals(1, grad.length);
       assertEquals(DataType.FLOAT, grad[0].dataType());
@@ -212,7 +211,7 @@ public class GraphTest {
       assertEquals(1, grad0.length);
       assertEquals(DataType.FLOAT, grad0[0].dataType());
 
-      Output<?>[] grad1 = g.addGradients(toArray(y0), toArray(x), toArray(grad0[0]));
+      Output<?>[] grad1 = g.addGradients(null, toArray(y0), toArray(x), toArray(grad0[0]));
       assertNotNull(grad1);
       assertEquals(1, grad1.length);
       assertEquals(DataType.FLOAT, grad1[0].dataType());
@@ -228,6 +227,33 @@ public class GraphTest {
       }
     }
   }
+
+  @Test
+  public void validateGradientsNames() {
+    try (Graph g = new Graph()) {
+
+      Output<Float> x = TestUtil.placeholder(g, "x", Float.class);
+      Output<Float> y0 = TestUtil.square(g, "y0", x);
+
+      Output<?>[] grad0 = g.addGradients(null, toArray(y0), toArray(x), null);
+      assertTrue(grad0[0].op().name().startsWith("gradients/"));
+
+      Output<?>[] grad1 = g.addGradients(null, toArray(y0), toArray(x), null);
+      assertTrue(grad1[0].op().name().startsWith("gradients_1/"));
+
+      Output<?>[] grad2 = g.addGradients("more_gradients", toArray(y0), toArray(x), null);
+      assertTrue(grad2[0].op().name().startsWith("more_gradients/"));
+
+      Output<?>[] grad3 = g.addGradients("even_more_gradients", toArray(y0), toArray(x), null);
+      assertTrue(grad3[0].op().name().startsWith("even_more_gradients/"));
+
+      try {
+        g.addGradients("even_more_gradients", toArray(y0), toArray(x), null);
+      } catch (IllegalArgumentException e) {
+        // expected exception
+      }
+    }
+  }
   
   private static Output<?>[] toArray(Output<?>... outputs) {
     return outputs;
diff --git a/tensorflow/java/src/test/java/org/tensorflow/SavedModelBundleTest.java b/tensorflow/java/src/test/java/org/tensorflow/SavedModelBundleTest.java
index 7922f3329c..7d936867a7 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/SavedModelBundleTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/SavedModelBundleTest.java
@@ -47,7 +47,61 @@ public class SavedModelBundleTest {
       fail("not expected");
     } catch (org.tensorflow.TensorFlowException e) {
       // expected exception
-      assertTrue(e.getMessage().contains("SavedModel not found"));
+      assertTrue(e.getMessage().contains("Could not find SavedModel"));
     }
   }
+
+  @Test
+  public void loader() {
+    try (SavedModelBundle bundle = SavedModelBundle.loader(SAVED_MODEL_PATH)
+        .withTags("serve")
+        .withConfigProto(sillyConfigProto())
+        .withRunOptions(sillyRunOptions())
+        .load()) {
+      assertNotNull(bundle.session());
+      assertNotNull(bundle.graph());
+      assertNotNull(bundle.metaGraphDef());
+    }
+  }
+
+  private static byte[] sillyRunOptions() {
+    // Ideally this would use the generated Java sources for protocol buffers
+    // and end up with something like the snippet below. However, generating
+    // the Java files for the .proto files in tensorflow/core:protos_all is
+    // a bit cumbersome in bazel until the proto_library rule is setup.
+    //
+    // See https://github.com/bazelbuild/bazel/issues/52#issuecomment-194341866
+    // https://github.com/bazelbuild/rules_go/pull/121#issuecomment-251515362
+    // https://github.com/bazelbuild/rules_go/pull/121#issuecomment-251692558
+    //
+    // For this test, for now, the use of specific bytes suffices.
+    return new byte[] {0x08, 0x03};
+    /*
+    return org.tensorflow.framework.RunOptions.newBuilder()
+        .setTraceLevel(RunOptions.TraceLevel.FULL_TRACE)
+        .build()
+        .toByteArray();
+    */
+  }
+
+  public static byte[] sillyConfigProto() {
+    // Ideally this would use the generated Java sources for protocol buffers
+    // and end up with something like the snippet below. However, generating
+    // the Java files for the .proto files in tensorflow/core:protos_all is
+    // a bit cumbersome in bazel until the proto_library rule is setup.
+    //
+    // See https://github.com/bazelbuild/bazel/issues/52#issuecomment-194341866
+    // https://github.com/bazelbuild/rules_go/pull/121#issuecomment-251515362
+    // https://github.com/bazelbuild/rules_go/pull/121#issuecomment-251692558
+    //
+    // For this test, for now, the use of specific bytes suffices.
+    return new byte[] {0x10, 0x01, 0x28, 0x01};
+    /*
+    return org.tensorflow.framework.ConfigProto.newBuilder()
+        .setInterOpParallelismThreads(1)
+        .setIntraOpParallelismThreads(1)
+        .build()
+        .toByteArray();
+     */
+  }
 }
diff --git a/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java b/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java
index 4e84886416..f984c508ee 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java
@@ -24,7 +24,7 @@ public class TestUtil {
 
   public static final class AutoCloseableList<E extends AutoCloseable> extends ArrayList<E>
       implements AutoCloseable {
-    AutoCloseableList(Collection<? extends E> c) {
+    public AutoCloseableList(Collection<? extends E> c) {
       super(c);
     }
 
diff --git a/tensorflow/java/src/test/java/org/tensorflow/op/core/GradientsTest.java b/tensorflow/java/src/test/java/org/tensorflow/op/core/GradientsTest.java
new file mode 100644
index 0000000000..3f49790b29
--- /dev/null
+++ b/tensorflow/java/src/test/java/org/tensorflow/op/core/GradientsTest.java
@@ -0,0 +1,131 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+package org.tensorflow.op.core;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+
+import java.util.Arrays;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+import org.tensorflow.Graph;
+import org.tensorflow.Output;
+import org.tensorflow.Session;
+import org.tensorflow.Tensor;
+import org.tensorflow.Tensors;
+import org.tensorflow.TestUtil;
+import org.tensorflow.op.Scope;
+
+@RunWith(JUnit4.class)
+public class GradientsTest {
+
+  @Test
+  public void createGradients() {
+    try (Graph g = new Graph();
+        Session sess = new Session(g)) {
+      Scope scope = new Scope(g);
+
+      Output<Float> x = TestUtil.placeholder(g, "x1", Float.class);
+      Output<Float> y0 = TestUtil.square(g, "y0", x);
+      Output<Float> y1 = TestUtil.square(g, "y1", y0);
+
+      Gradients grads = Gradients.create(scope, y1, Arrays.asList(x, y0));
+
+      assertNotNull(grads);
+      assertNotNull(grads.dy());
+      assertEquals(2, grads.dy().size());
+
+      try (Tensor<Float> c = Tensors.create(3.0f);
+          TestUtil.AutoCloseableList<Tensor<?>> outputs =
+              new TestUtil.AutoCloseableList<>(
+                  sess.runner().feed(x, c).fetch(grads.dy(0)).fetch(grads.dy(1)).run())) {
+
+        assertEquals(108.0f, outputs.get(0).floatValue(), 0.0f);
+        assertEquals(18.0f, outputs.get(1).floatValue(), 0.0f);
+      }
+    }
+  }
+
+  @Test
+  public void createGradientsWithSum() {
+    try (Graph g = new Graph();
+        Session sess = new Session(g)) {
+      Scope scope = new Scope(g);
+
+      Output<Float> x = TestUtil.placeholder(g, "x1", Float.class);
+      Output<Float> y0 = TestUtil.square(g, "y0", x);
+      Output<Float> y1 = TestUtil.square(g, "y1", y0);
+
+      Gradients grads = Gradients.create(scope, Arrays.asList(y0, y1), Arrays.asList(x));
+
+      assertNotNull(grads);
+      assertNotNull(grads.dy());
+      assertEquals(1, grads.dy().size());
+
+      try (Tensor<Float> c = Tensors.create(3.0f);
+          TestUtil.AutoCloseableList<Tensor<?>> outputs =
+              new TestUtil.AutoCloseableList<>(sess.runner().feed(x, c).fetch(grads.dy(0)).run())) {
+
+        assertEquals(114.0f, outputs.get(0).floatValue(), 0.0f);
+      }
+    }
+  }
+
+  @Test
+  public void createGradientsWithInitialValues() {
+    try (Graph g = new Graph();
+        Session sess = new Session(g)) {
+      Scope scope = new Scope(g);
+
+      Output<Float> x = TestUtil.placeholder(g, "x1", Float.class);
+      Output<Float> y0 = TestUtil.square(g, "y0", x);
+      Output<Float> y1 = TestUtil.square(g, "y1", y0);
+
+      Gradients grads0 = Gradients.create(scope, y1, Arrays.asList(y0));
+      Gradients grads1 = Gradients.create(scope, y0, Arrays.asList(x), Gradients.dx(grads0.dy()));
+
+      assertNotNull(grads1);
+      assertNotNull(grads1.dy());
+      assertEquals(1, grads1.dy().size());
+
+      try (Tensor<Float> c = Tensors.create(3.0f);
+          TestUtil.AutoCloseableList<Tensor<?>> outputs =
+              new TestUtil.AutoCloseableList<>(
+                  sess.runner().feed(x, c).fetch(grads1.dy(0)).run())) {
+
+        assertEquals(108.0f, outputs.get(0).floatValue(), 0.0f);
+      }
+    }
+  }
+
+  @Test
+  public void validateGradientsNames() {
+    try (Graph g = new Graph()) {
+      Scope scope = new Scope(g).withSubScope("sub");
+
+      Output<Float> x = TestUtil.placeholder(g, "x1", Float.class);
+      Output<Float> y = TestUtil.square(g, "y", x);
+
+      Gradients grad0 = Gradients.create(scope, y, Arrays.asList(x));
+      assertTrue(grad0.dy(0).op().name().startsWith("sub/Gradients/"));
+
+      Gradients grad1 = Gradients.create(scope.withName("MyGradients"), y, Arrays.asList(x));
+      assertTrue(grad1.dy(0).op().name().startsWith("sub/MyGradients/"));
+    }
+  }
+}
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index ebfcfff4a5..614499793b 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -73,7 +73,7 @@ py_library(
     visibility = [
         "//tensorflow:__pkg__",
         "//tensorflow/python/tools:__pkg__",
-        "//tensorflow/tools/api/generator:__pkg__",
+        "//tensorflow/python/tools/api/generator:__pkg__",
     ],
     deps = [
         ":array_ops",
@@ -96,6 +96,7 @@ py_library(
         ":image_ops",
         ":initializers_ns",
         ":io_ops",
+        ":kernels",
         ":layers",
         ":lib",
         ":list_ops",
@@ -705,7 +706,9 @@ py_library(
         "framework/error_interpolation.py",
     ],
     srcs_version = "PY2AND3",
-    deps = [],
+    deps = [
+        ":util",
+    ],
 )
 
 py_library(
@@ -743,8 +746,8 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":framework",
+        ":framework_ops",
         ":function",
-        ":op_def_registry",
         ":tensor_shape",
         ":versions",
         "//tensorflow/core:protos_all_py",
@@ -760,8 +763,10 @@ py_test(
     deps = [
         ":array_ops",
         ":client_testlib",
+        ":constant_op",
         ":dtypes",
         ":framework_ops",
+        ":function",
         ":function_def_to_graph",
         ":graph_to_function_def",
         ":math_ops",
@@ -786,6 +791,19 @@ py_library(
 )
 
 py_library(
+    name = "kernels",
+    srcs = [
+        "framework/kernels.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":pywrap_tensorflow",
+        ":util",
+        "//tensorflow/core:protos_all_py",
+    ],
+)
+
+py_library(
     name = "op_def_library",
     srcs = ["framework/op_def_library.py"],
     srcs_version = "PY2AND3",
@@ -816,12 +834,15 @@ py_library(
     deps = [
         ":c_api_util",
         ":control_flow_util",
+        ":cpp_shape_inference_proto_py",
         ":device",
         ":dtypes",
+        ":error_interpolation",
         ":op_def_registry",
         ":platform",
         ":registry",
         ":tensor_shape",
+        ":traceable_stack",
         ":util",
         ":versions",
         "//tensorflow/core:protos_all_py",
@@ -887,6 +908,17 @@ py_library(
     ],
 )
 
+# This target is maintained separately from :util to provide separate visibility
+# for legacy users who were granted visibility when the functions were private
+# members of ops.Graph.
+py_library(
+    name = "tf_stack",
+    srcs = ["util/tf_stack.py"],
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [],
+)
+
 py_library(
     name = "tensor_shape",
     srcs = ["framework/tensor_shape.py"],
@@ -922,6 +954,16 @@ py_library(
 )
 
 py_library(
+    name = "traceable_stack",
+    srcs = ["framework/traceable_stack.py"],
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [
+        ":util",
+    ],
+)
+
+py_library(
     name = "versions",
     srcs = ["framework/versions.py"],
     srcs_version = "PY2AND3",
@@ -1018,7 +1060,9 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":client_testlib",
+        ":constant_op",
         ":error_interpolation",
+        ":traceable_stack",
     ],
 )
 
@@ -1207,6 +1251,21 @@ py_test(
     ],
 )
 
+py_test(
+    name = "framework_traceable_stack_test",
+    size = "small",
+    srcs = ["framework/traceable_stack_test.py"],
+    main = "framework/traceable_stack_test.py",
+    srcs_version = "PY2AND3",
+    deps = [
+        ":framework_test_lib",
+        ":platform_test",
+        ":test_ops",
+        ":traceable_stack",
+        ":util",
+    ],
+)
+
 tf_gen_op_wrapper_py(
     name = "test_ops",
     out = "framework/test_ops.py",
@@ -1439,6 +1498,20 @@ py_test(
     ],
 )
 
+py_test(
+    name = "framework_kernels_test",
+    size = "small",
+    srcs = ["framework/kernels_test.py"],
+    main = "framework/kernels_test.py",
+    srcs_version = "PY2AND3",
+    deps = [
+        ":framework_test_lib",
+        ":kernels",
+        ":platform_test",
+        ":test_ops",
+    ],
+)
+
 tf_gen_op_wrapper_private_py(
     name = "array_ops_gen",
     visibility = [
@@ -2096,8 +2169,8 @@ py_library(
         ":linalg_ops_gen",
         ":linalg_ops_impl",
         ":math_ops",
-        ":nn_ops",
         ":random_ops",
+        ":util",
         "//third_party/py/numpy",
     ],
 )
@@ -3006,6 +3079,20 @@ cuda_py_test(
 )
 
 cuda_py_test(
+    name = "init_ops_test",
+    size = "small",
+    srcs = ["ops/init_ops_test.py"],
+    additional_deps = [
+        ":client_testlib",
+        ":init_ops",
+        ":framework_ops",
+        ":resource_variable_ops",
+        "//third_party/py/numpy",
+        "//tensorflow/python/eager:context",
+    ],
+)
+
+cuda_py_test(
     name = "math_grad_test",
     size = "small",
     srcs = ["ops/math_grad_test.py"],
@@ -3130,14 +3217,18 @@ py_library(
             "training/checkpointable/**/*.py",
             # The following targets have their own build rules (same name as the
             # file):
+            "training/checkpoint_management.py",
             "training/saveable_object.py",
+            "training/saver.py",
             "training/training_util.py",
         ],
     ),
     srcs_version = "PY2AND3",
     deps = [
+        "saver",
         ":array_ops",
         ":array_ops_gen",
+        ":checkpoint_management",
         ":checkpoint_ops_gen",
         ":client",
         ":control_flow_ops",
@@ -3149,24 +3240,20 @@ py_library(
         ":framework_ops",
         ":gradients",
         ":init_ops",
-        ":distribute",
         ":io_ops",
-        ":io_ops_gen",
         ":layers_base",
-        ":lib",
         ":lookup_ops",
         ":math_ops",
         ":platform",
-        ":protos_all_py",
         ":pywrap_tensorflow",
         ":random_ops",
         ":resource_variable_ops",
         ":resources",
-        ":saveable_object",
         ":sdca_ops",
+        ":session",
         ":sparse_ops",
+        ":sparse_tensor",
         ":state_ops",
-        ":string_ops",
         ":summary",
         ":training_ops_gen",
         ":training_util",
@@ -3176,6 +3263,7 @@ py_library(
         "//third_party/py/numpy",
         "@six_archive//:six",
         "//tensorflow/core:protos_all_py",
+        "//tensorflow/python/data/ops:dataset_ops",
         "//tensorflow/python/eager:backprop",
         "//tensorflow/python/eager:context",
         # `layers` dependency only exists due to the use of a small utility.
@@ -3193,6 +3281,52 @@ py_library(
 )
 
 py_library(
+    name = "checkpoint_management",
+    srcs = ["training/checkpoint_management.py"],
+    deps = [
+        ":errors",
+        ":lib",
+        ":platform",
+        ":protos_all_py",
+        ":util",
+        "//tensorflow/core:protos_all_py",
+    ],
+)
+
+py_library(
+    name = "saver",
+    srcs = ["training/saver.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":array_ops",
+        ":checkpoint_management",
+        ":constant_op",
+        ":control_flow_ops",
+        ":device",
+        ":errors",
+        ":framework",
+        ":framework_ops",
+        ":io_ops",
+        ":io_ops_gen",
+        ":platform",
+        ":pywrap_tensorflow",
+        ":resource_variable_ops",
+        ":saveable_object",
+        ":session",
+        ":state_ops",
+        ":string_ops",
+        ":training_util",
+        ":util",
+        ":variables",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/training/checkpointable:base",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+    ],
+)
+
+py_library(
     name = "device_util",
     srcs = ["training/device_util.py"],
     srcs_version = "PY2AND3",
@@ -3297,6 +3431,9 @@ py_library(
         ],
     ),
     srcs_version = "PY2AND3",
+    visibility = visibility + [
+        "//tensorflow:__pkg__",
+    ],
     deps = [
         "//third_party/py/numpy",
         "@org_python_pypi_backports_weakref",
@@ -3319,6 +3456,7 @@ py_test(
         ":math_ops",
         ":util",
         "//third_party/py/numpy",
+        "@absl_py//absl/testing:parameterized",
     ],
 )
 
@@ -3569,6 +3707,7 @@ tf_cuda_library(
         "//tensorflow/core:graph",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:session_ref",
         "//third_party/py/numpy:headers",
         "//third_party/python_runtime:headers",
     ],
@@ -4079,6 +4218,7 @@ cuda_py_test(
         ":math_ops",
         "//tensorflow/core:protos_all_py",
     ],
+    tags = ["no_windows_gpu"],
 )
 
 py_test(
@@ -4295,6 +4435,42 @@ cuda_py_test(
     tags = ["multi_gpu"],
 )
 
+cuda_py_test(
+    name = "checkpoint_management_test",
+    size = "small",
+    srcs = [
+        "training/checkpoint_management_test.py",
+    ],
+    additional_deps = [
+        ":array_ops",
+        ":client_testlib",
+        ":control_flow_ops",
+        ":data_flow_ops",
+        ":errors",
+        ":gradients",
+        ":math_ops",
+        ":nn_grad",
+        ":nn_ops",
+        ":saver_test_utils",
+        ":partitioned_variables",
+        ":platform",
+        ":platform_test",
+        ":pywrap_tensorflow",
+        ":random_ops",
+        ":resource_variable_ops",
+        ":sparse_ops",
+        ":summary",
+        ":training",
+        ":util",
+        ":variable_scope",
+        ":variables",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python/data/ops:dataset_ops",
+    ],
+)
+
 py_test(
     name = "saver_large_variable_test",
     size = "medium",
@@ -4361,6 +4537,7 @@ tf_py_test(
     srcs = ["training/supervisor_test.py"],
     additional_deps = [
         ":array_ops",
+        ":checkpoint_management",
         ":client_testlib",
         ":errors",
         ":framework",
@@ -4368,6 +4545,7 @@ tf_py_test(
         ":io_ops",
         ":parsing_ops",
         ":platform",
+        ":saver",
         ":summary",
         ":training",
         ":variables",
@@ -4481,10 +4659,13 @@ py_test(
     tags = ["notsan"],  # b/67945581
     deps = [
         ":array_ops",
+        ":checkpoint_management",
         ":client_testlib",
         ":control_flow_ops",
         ":errors",
         ":framework_for_generated_wrappers",
+        ":resource_variable_ops",
+        ":saver",
         ":session",
         ":state_ops",
         ":summary",
diff --git a/tensorflow/python/client/session.py b/tensorflow/python/client/session.py
index e037925961..58a002c776 100644
--- a/tensorflow/python/client/session.py
+++ b/tensorflow/python/client/session.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
 import functools
 import re
 import threading
@@ -28,6 +29,7 @@ import numpy as np
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python import pywrap_tensorflow as tf_session
 from tensorflow.python.framework import device
+from tensorflow.python.framework import error_interpolation
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
@@ -243,7 +245,7 @@ class _FetchMapper(object):
     elif isinstance(fetch, (list, tuple)):
       # NOTE(touts): This is also the code path for namedtuples.
       return _ListFetchMapper(fetch)
-    elif isinstance(fetch, dict):
+    elif isinstance(fetch, collections.Mapping):
       return _DictFetchMapper(fetch)
     else:
       # Look for a handler in the registered expansions.
@@ -540,10 +542,11 @@ class _DeviceAttributes(object):
         (in bytes).
   """
 
-  def __init__(self, name, device_type, memory_limit_bytes):
+  def __init__(self, name, device_type, memory_limit_bytes, incarnation):
     self._name = device.canonical_name(name)
     self._device_type = device_type
     self._memory_limit_bytes = memory_limit_bytes
+    self._incarnation = incarnation
 
   @property
   def name(self):
@@ -557,11 +560,16 @@ class _DeviceAttributes(object):
   def memory_limit_bytes(self):
     return self._memory_limit_bytes
 
+  @property
+  def incarnation(self):
+    return self._incarnation
+
   def __repr__(self):
-    return '_DeviceAttributes(%s, %s, %d)' % (
+    return '_DeviceAttributes(%s, %s, %d, %d)' % (
         self.name,
         self.device_type,
         self.memory_limit_bytes,
+        self.incarnation,
     )
 
 
@@ -623,7 +631,7 @@ class BaseSession(SessionInterface):
     opts = tf_session.TF_NewSessionOptions(target=self._target, config=config)
     try:
       # pylint: disable=protected-access
-      self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
+      self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
       # pylint: enable=protected-access
     finally:
       tf_session.TF_DeleteSessionOptions(opts)
@@ -658,7 +666,9 @@ class BaseSession(SessionInterface):
       name = tf_session.TF_DeviceListName(raw_device_list, i)
       device_type = tf_session.TF_DeviceListType(raw_device_list, i)
       memory = tf_session.TF_DeviceListMemoryBytes(raw_device_list, i)
-      device_list.append(_DeviceAttributes(name, device_type, memory))
+      incarnation = tf_session.TF_DeviceListIncarnation(raw_device_list, i)
+      device_list.append(
+          _DeviceAttributes(name, device_type, memory, incarnation))
     tf_session.TF_DeleteDeviceList(raw_device_list)
     return device_list
 
@@ -1226,8 +1236,12 @@ class BaseSession(SessionInterface):
 
       return _fetch_handler_run
 
-  # Captures the name of a node in an error status.
-  _NODEDEF_NAME_RE = re.compile(r'\[\[Node: ([^ ]*?) =')
+  # Captures the name of a node in an error status. The regex below matches
+  # both the old and the new formats:
+  # Old format: [[Node: <node_name> = ...]]
+  # New format: [[{{node <node_name>}} = ...]]
+  _NODEDEF_NAME_RE = re.compile(
+      r'\[\[(Node: )?(\{\{node )?([^\} ]*)(\}\})?\s*=')
 
   def _do_run(self, handle, target_list, fetch_list, feed_dict, options,
               run_metadata):
@@ -1282,12 +1296,15 @@ class BaseSession(SessionInterface):
       node_def = None
       op = None
       if m is not None:
-        node_name = m.group(1)
+        node_name = m.group(3)
         try:
           op = self._graph.get_operation_by_name(node_name)
           node_def = op.node_def
         except KeyError:
           pass
+      if (self._config is not None and
+          self._config.experimental.client_handles_error_formatting):
+        message = error_interpolation.interpolate(message, self._graph)
       raise type(e)(node_def, op, message)
 
   def _extend_graph(self):
diff --git a/tensorflow/python/client/session_list_devices_test.py b/tensorflow/python/client/session_list_devices_test.py
index c5d82c213a..dd381c689f 100644
--- a/tensorflow/python/client/session_list_devices_test.py
+++ b/tensorflow/python/client/session_list_devices_test.py
@@ -37,6 +37,8 @@ class SessionListDevicesTest(test_util.TensorFlowTestCase):
       devices = sess.list_devices()
       self.assertTrue('/job:localhost/replica:0/task:0/device:CPU:0' in set(
           [d.name for d in devices]), devices)
+      # All valid device incarnations must be non-zero.
+      self.assertTrue(all(d.incarnation != 0 for d in devices))
 
   def testInvalidDeviceNumber(self):
     opts = tf_session.TF_NewSessionOptions()
@@ -54,6 +56,8 @@ class SessionListDevicesTest(test_util.TensorFlowTestCase):
       devices = sess.list_devices()
       self.assertTrue('/job:local/replica:0/task:0/device:CPU:0' in set(
           [d.name for d in devices]), devices)
+      # All valid device incarnations must be non-zero.
+      self.assertTrue(all(d.incarnation != 0 for d in devices))
 
   def testListDevicesClusterSpecPropagation(self):
     server1 = server_lib.Server.create_local_server()
@@ -67,11 +71,13 @@ class SessionListDevicesTest(test_util.TensorFlowTestCase):
     config = config_pb2.ConfigProto(cluster_def=cluster_def)
     with session.Session(server1.target, config=config) as sess:
       devices = sess.list_devices()
-      device_names = set([d.name for d in devices])
+      device_names = set(d.name for d in devices)
       self.assertTrue(
           '/job:worker/replica:0/task:0/device:CPU:0' in device_names)
       self.assertTrue(
           '/job:worker/replica:0/task:1/device:CPU:0' in device_names)
+      # All valid device incarnations must be non-zero.
+      self.assertTrue(all(d.incarnation != 0 for d in devices))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/client/session_test.py b/tensorflow/python/client/session_test.py
index b72e029d1c..052be68385 100644
--- a/tensorflow/python/client/session_test.py
+++ b/tensorflow/python/client/session_test.py
@@ -35,6 +35,7 @@ from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.client import session
 from tensorflow.python.framework import common_shapes
 from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import device as framework_device_lib
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import function
@@ -104,18 +105,20 @@ class SessionTest(test_util.TensorFlowTestCase):
           copy_val)
 
   def testManyCPUs(self):
-    # TODO(keveman): Implement ListDevices and test for the number of
-    # devices returned by ListDevices.
     with session.Session(
         config=config_pb2.ConfigProto(device_count={
-            'CPU': 2
-        })):
+            'CPU': 2, 'GPU': 0
+        })) as sess:
       inp = constant_op.constant(10.0, name='W1')
       self.assertAllEqual(inp.eval(), 10.0)
 
+      devices = sess.list_devices()
+      self.assertEqual(2, len(devices))
+      for device in devices:
+        self.assertEqual('CPU', framework_device_lib.DeviceSpec.from_string(
+            device.name).device_type)
+
   def testPerSessionThreads(self):
-    # TODO(keveman): Implement ListDevices and test for the number of
-    # devices returned by ListDevices.
     with session.Session(
         config=config_pb2.ConfigProto(use_per_session_threads=True)):
       inp = constant_op.constant(10.0, name='W1')
@@ -1868,19 +1871,21 @@ class SessionTest(test_util.TensorFlowTestCase):
 
   def testDeviceAttributes(self):
     attrs = session._DeviceAttributes(
-        '/job:worker/replica:0/task:3/device:CPU:2', 'TYPE', 1337)
+        '/job:worker/replica:0/task:3/device:CPU:2', 'TYPE', 1337, 1000000)
     self.assertEqual(1337, attrs.memory_limit_bytes)
     self.assertEqual('/job:worker/replica:0/task:3/device:CPU:2', attrs.name)
     self.assertEqual('TYPE', attrs.device_type)
+    self.assertEqual(1000000, attrs.incarnation)
     str_repr = '%s' % attrs
     self.assertTrue(str_repr.startswith('_DeviceAttributes'), str_repr)
 
   def testDeviceAttributesCanonicalization(self):
     attrs = session._DeviceAttributes('/job:worker/replica:0/task:3/cpu:1',
-                                      'TYPE', 1337)
+                                      'TYPE', 1337, 1000000)
     self.assertEqual(1337, attrs.memory_limit_bytes)
     self.assertEqual('/job:worker/replica:0/task:3/device:CPU:1', attrs.name)
     self.assertEqual('TYPE', attrs.device_type)
+    self.assertEqual(1000000, attrs.incarnation)
     str_repr = '%s' % attrs
     self.assertTrue(str_repr.startswith('_DeviceAttributes'), str_repr)
 
diff --git a/tensorflow/python/client/tf_session.i b/tensorflow/python/client/tf_session.i
index 985cb90436..39a2922ac0 100644
--- a/tensorflow/python/client/tf_session.i
+++ b/tensorflow/python/client/tf_session.i
@@ -138,6 +138,11 @@ tensorflow::ImportNumpy();
   $result = PyLong_FromLongLong($1);
 }
 
+// Convert TF_DeviceListIncarnation uint64_t output to Python integer
+%typemap(out) uint64_t {
+  $result = PyLong_FromUnsignedLongLong($1);
+}
+
 // We use TF_OperationGetControlInputs_wrapper instead of
 // TF_OperationGetControlInputs
 %ignore TF_OperationGetControlInputs;
@@ -772,6 +777,7 @@ def TF_Reset(target, containers=None, config=None):
   $1 = &types_local;
 }
 
+%unignore TF_NewSessionRef;
 %unignore SetRequireShapeInferenceFns;
 %unignore TF_TryEvaluateConstant_wrapper;
 %noexception TF_TryEvaluateConstant_wrapper;
diff --git a/tensorflow/python/client/tf_session_helper.cc b/tensorflow/python/client/tf_session_helper.cc
index b6481e7e29..bcd4af2912 100644
--- a/tensorflow/python/client/tf_session_helper.cc
+++ b/tensorflow/python/client/tf_session_helper.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include "tensorflow/c/c_api.h"
 #include "tensorflow/c/c_api_internal.h"
 #include "tensorflow/c/tf_status_helper.h"
+#include "tensorflow/core/common_runtime/session_ref.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/attr_value_util.h"
@@ -42,6 +43,19 @@ static const char* kFeedDictErrorMsg =
     "feed_dict must be a dictionary mapping strings to NumPy arrays.";
 }  // end namespace
 
+TF_Session* TF_NewSessionRef(TF_Graph* graph, const TF_SessionOptions* opts,
+                             TF_Status* status) {
+  TF_Session* tf_session = TF_NewSession(graph, opts, status);
+  if (tf_session == nullptr) {
+    return nullptr;
+  }
+
+  Session* session = reinterpret_cast<Session*>(tf_session->session);
+  SessionRef* session_ref = new SessionRef(session);
+  tf_session->session = session_ref;
+  return tf_session;
+}
+
 void TF_Run_wrapper_helper(TF_DeprecatedSession* session, const char* handle,
                            const TF_Buffer* run_options, PyObject* feed_dict,
                            const NameVector& output_names,
diff --git a/tensorflow/python/client/tf_session_helper.h b/tensorflow/python/client/tf_session_helper.h
index cfd27c2bee..dab7e71aac 100644
--- a/tensorflow/python/client/tf_session_helper.h
+++ b/tensorflow/python/client/tf_session_helper.h
@@ -40,6 +40,9 @@ typedef tensorflow::gtl::InlinedVector<PyObject*, 8> PyObjectVector;
 // A TF_TensorVector is a vector of borrowed pointers to TF_Tensors.
 typedef gtl::InlinedVector<TF_Tensor*, 8> TF_TensorVector;
 
+TF_Session* TF_NewSessionRef(TF_Graph* graph, const TF_SessionOptions* opts,
+                             TF_Status* status);
+
 // Run the graph associated with the session starting with the
 // supplied inputs[].  Regardless of success or failure, inputs[] are
 // stolen by the implementation (i.e. the implementation will
diff --git a/tensorflow/python/compat/BUILD b/tensorflow/python/compat/BUILD
index 58ceafca06..e0a1c8e057 100644
--- a/tensorflow/python/compat/BUILD
+++ b/tensorflow/python/compat/BUILD
@@ -9,6 +9,7 @@ py_library(
     srcs = ["compat.py"],
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:internal"],
+    deps = ["//tensorflow/python:util"],
 )
 
 tf_py_test(
diff --git a/tensorflow/python/compat/compat.py b/tensorflow/python/compat/compat.py
index 68a6421c2c..c7b0cde7ed 100644
--- a/tensorflow/python/compat/compat.py
+++ b/tensorflow/python/compat/compat.py
@@ -24,13 +24,17 @@ from __future__ import print_function
 
 import datetime
 from tensorflow.python.util import tf_contextlib
+from tensorflow.python.util.tf_export import tf_export
 
-_FORWARD_COMPATIBILITY_HORIZON = datetime.date(2018, 8, 1)
+_FORWARD_COMPATIBILITY_HORIZON = datetime.date(2018, 8, 3)
 
 
+@tf_export("compat.forward_compatible")
 def forward_compatible(year, month, day):
   """Return true if the forward compatibility window has expired.
 
+  See @{$guide/version_compat#backward_and_partial_forward_compatibility}.
+
   Forward-compatibility refers to scenarios where the producer of a TensorFlow
   model (a GraphDef or SavedModel) is compiled against a version of the
   TensorFlow library newer than what the consumer was compiled against. The
@@ -82,10 +86,13 @@ def forward_compatible(year, month, day):
   return _FORWARD_COMPATIBILITY_HORIZON > datetime.date(year, month, day)
 
 
+@tf_export("compat.forward_compatibility_horizon")
 @tf_contextlib.contextmanager
 def forward_compatibility_horizon(year, month, day):
   """Context manager for testing forward compatibility of generated graphs.
 
+  See @{$guide/version_compat#backward_and_partial_forward_compatibility}.
+
   To ensure forward compatibility of generated graphs (see `forward_compatible`)
   with older binaries, new features can be gated with:
 
diff --git a/tensorflow/python/data/kernel_tests/BUILD b/tensorflow/python/data/kernel_tests/BUILD
index 3bde62fa1d..eeaae88bca 100644
--- a/tensorflow/python/data/kernel_tests/BUILD
+++ b/tensorflow/python/data/kernel_tests/BUILD
@@ -318,7 +318,7 @@ tf_py_test(
     ],
 )
 
-tf_py_test(
+cuda_py_test(
     name = "iterator_ops_test",
     size = "small",
     srcs = ["iterator_ops_test.py"],
@@ -329,6 +329,8 @@ tf_py_test(
         "//tensorflow/python/data/ops:dataset_ops",
         "//tensorflow/python/data/ops:iterator_ops",
         "//tensorflow/python/data/util:sparse",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/training/checkpointable:util",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
@@ -349,6 +351,9 @@ tf_py_test(
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:tensor_shape",
         "//tensorflow/python:training",
+        "//tensorflow/python/compat:compat",
+        "//tensorflow/python:util",
+        "//tensorflow/python:variables",
     ],
     grpc_enabled = True,
 )
@@ -380,3 +385,21 @@ tf_py_test(
         "no_windows",
     ],
 )
+
+tf_py_test(
+    name = "optional_ops_test",
+    size = "small",
+    srcs = ["optional_ops_test.py"],
+    additional_deps = [
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/ops:iterator_ops",
+        "//tensorflow/python/data/ops:optional_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:tensor_shape",
+    ],
+)
diff --git a/tensorflow/python/data/kernel_tests/iterator_ops_test.py b/tensorflow/python/data/kernel_tests/iterator_ops_test.py
index 820c167b6b..352424514e 100644
--- a/tensorflow/python/data/kernel_tests/iterator_ops_test.py
+++ b/tensorflow/python/data/kernel_tests/iterator_ops_test.py
@@ -17,6 +17,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import functools
 import os
 import warnings
 
@@ -25,6 +26,7 @@ import numpy as np
 from tensorflow.core.protobuf import cluster_pb2
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.client import session
+from tensorflow.python.compat import compat as forward_compat
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.ops import readers
@@ -45,7 +47,9 @@ from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import script_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import server_lib
+from tensorflow.python.training.checkpointable import util as checkpointable_utils
 from tensorflow.python.util import compat
 
 
@@ -415,6 +419,69 @@ class IteratorTest(test.TestCase):
         sess.run(
             next_element, feed_dict={handle_placeholder: iterator_4_handle})
 
+  def testIteratorStringHandleFuture(self):
+    with forward_compat.forward_compatibility_horizon(2018, 8, 4):
+      dataset_3 = dataset_ops.Dataset.from_tensor_slices([1, 2, 3])
+      dataset_4 = dataset_ops.Dataset.from_tensor_slices([10, 20, 30, 40])
+
+      iterator_3 = dataset_3.make_one_shot_iterator()
+      iterator_4 = dataset_4.make_one_shot_iterator()
+
+      handle_placeholder = array_ops.placeholder(dtypes.string, shape=[])
+      feedable_iterator = iterator_ops.Iterator.from_string_handle(
+          handle_placeholder, dataset_3.output_types, dataset_3.output_shapes)
+      next_element = feedable_iterator.get_next()
+
+      self.assertEqual(dataset_3.output_types, feedable_iterator.output_types)
+      self.assertEqual(dataset_4.output_types, feedable_iterator.output_types)
+      self.assertEqual([], feedable_iterator.output_shapes)
+
+      with self.test_session() as sess:
+        iterator_3_handle = sess.run(iterator_3.string_handle())
+        iterator_4_handle = sess.run(iterator_4.string_handle())
+
+        self.assertEqual(
+            10,
+            sess.run(
+                next_element,
+                feed_dict={handle_placeholder: iterator_4_handle}))
+        self.assertEqual(
+            1,
+            sess.run(
+                next_element,
+                feed_dict={handle_placeholder: iterator_3_handle}))
+        self.assertEqual(
+            20,
+            sess.run(
+                next_element,
+                feed_dict={handle_placeholder: iterator_4_handle}))
+        self.assertEqual(
+            2,
+            sess.run(
+                next_element,
+                feed_dict={handle_placeholder: iterator_3_handle}))
+        self.assertEqual(
+            30,
+            sess.run(
+                next_element,
+                feed_dict={handle_placeholder: iterator_4_handle}))
+        self.assertEqual(
+            3,
+            sess.run(
+                next_element,
+                feed_dict={handle_placeholder: iterator_3_handle}))
+        self.assertEqual(
+            40,
+            sess.run(
+                next_element,
+                feed_dict={handle_placeholder: iterator_4_handle}))
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(
+              next_element, feed_dict={handle_placeholder: iterator_3_handle})
+        with self.assertRaises(errors.OutOfRangeError):
+          sess.run(
+              next_element, feed_dict={handle_placeholder: iterator_4_handle})
+
   def testIteratorStringHandleReuseTensorObject(self):
     dataset = dataset_ops.Dataset.from_tensor_slices([1, 2, 3])
     one_shot_iterator = dataset.make_one_shot_iterator()
@@ -724,5 +791,98 @@ class IteratorTest(test.TestCase):
         val += 1
 
 
+class IteratorCheckpointingTest(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes
+  def testSaveRestoreOneShotIterator(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    dataset = dataset_ops.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6]).map(
+        math_ops.square).batch(2)
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next if context.executing_eagerly(
+    ) else functools.partial(self.evaluate, iterator.get_next())
+    checkpoint = checkpointable_utils.Checkpoint(iterator=iterator)
+    with self.test_session() as sess:
+      self.assertAllEqual([1, 4], get_next())
+      save_path = checkpoint.save(checkpoint_prefix)
+      self.assertAllEqual([9, 16], get_next())
+      self.assertAllEqual([25, 36], get_next())
+      checkpoint.restore(save_path).run_restore_ops(sess)
+      self.assertAllEqual([9, 16], get_next())
+      self.assertAllEqual([25, 36], get_next())
+      with self.assertRaises(errors.OutOfRangeError):
+        get_next()
+
+  @test_util.run_in_graph_and_eager_modes
+  def testSaveRestoreMultipleIterator(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    dataset = dataset_ops.Dataset.from_tensor_slices(
+        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
+    dataset = dataset.map(math_ops.square).batch(2)
+    iterator_1 = dataset.make_one_shot_iterator()
+    get_next_1 = iterator_1.get_next if context.executing_eagerly(
+    ) else functools.partial(self.evaluate, iterator_1.get_next())
+    iterator_2 = dataset.make_one_shot_iterator()
+    get_next_2 = iterator_2.get_next if context.executing_eagerly(
+    ) else functools.partial(self.evaluate, iterator_2.get_next())
+    dataset_2 = dataset_ops.Dataset.range(10)
+    iterator_3 = dataset_2.make_one_shot_iterator()
+    get_next_3 = iterator_3.get_next if context.executing_eagerly(
+    ) else functools.partial(self.evaluate, iterator_3.get_next())
+    checkpoint = checkpointable_utils.Checkpoint(
+        iterator_1=iterator_1, iterator_2=iterator_2, iterator_3=iterator_3)
+    with self.test_session() as sess:
+      self.assertAllEqual([1, 4], get_next_1())
+      self.assertAllEqual(0, get_next_3())
+      self.assertAllEqual(1, get_next_3())
+      self.assertAllEqual(2, get_next_3())
+      save_path = checkpoint.save(checkpoint_prefix)
+      self.assertAllEqual([1, 4], get_next_2())
+      self.assertAllEqual([9, 16], get_next_2())
+      self.assertAllEqual(3, get_next_3())
+      checkpoint.restore(save_path).run_restore_ops(sess)
+      self.assertAllEqual([9, 16], get_next_1())
+      self.assertAllEqual([1, 4], get_next_2())
+      self.assertAllEqual(3, get_next_3())
+
+  @test_util.run_in_graph_and_eager_modes
+  def testRestoreExhaustedIterator(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    dataset = dataset_ops.Dataset.range(3)
+    iterator = dataset.make_one_shot_iterator()
+    get_next = iterator.get_next if context.executing_eagerly(
+    ) else functools.partial(self.evaluate, iterator.get_next())
+    checkpoint = checkpointable_utils.Checkpoint(iterator=iterator)
+    with self.test_session() as sess:
+      self.assertAllEqual(0, get_next())
+      self.assertAllEqual(1, get_next())
+      save_path = checkpoint.save(checkpoint_prefix)
+      self.assertAllEqual(2, get_next())
+      checkpoint.restore(save_path).run_restore_ops(sess)
+      self.assertAllEqual(2, get_next())
+      save_path = checkpoint.save(checkpoint_prefix)
+      checkpoint.restore(save_path).run_restore_ops(sess)
+      with self.assertRaises(errors.OutOfRangeError):
+        get_next()
+
+  def testRestoreInReconstructedIteratorInitializable(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    dataset = dataset_ops.Dataset.range(10)
+    iterator = dataset.make_initializable_iterator()
+    get_next = iterator.get_next()
+    checkpoint = checkpointable_utils.Checkpoint(iterator=iterator)
+    for i in range(5):
+      with self.test_session() as sess:
+        checkpoint.restore(checkpoint_management.latest_checkpoint(
+            checkpoint_directory)).initialize_or_restore(sess)
+        for j in range(2):
+          self.assertEqual(i * 2 + j, sess.run(get_next))
+        checkpoint.save(file_prefix=checkpoint_prefix)
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/data/kernel_tests/map_dataset_op_test.py b/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
index 0ecd821e9e..637bde9ae4 100644
--- a/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
@@ -666,6 +666,13 @@ class MapDatasetTest(test.TestCase):
         "currently support nested datasets as outputs."):
       _ = dataset.map(dataset_ops.Dataset.from_tensor_slices)
 
+  def testReturnValueError(self):
+    dataset = dataset_ops.Dataset.from_tensors([1.0, 2.0, 3.0])
+    with self.assertRaisesRegexp(
+        TypeError, r"Unsupported return value from function passed to "
+        r"Dataset.map\(\): None."):
+      _ = dataset.map(lambda x: None)
+
 
 class MapDatasetBenchmark(test.Benchmark):
 
diff --git a/tensorflow/python/data/kernel_tests/optional_ops_test.py b/tensorflow/python/data/kernel_tests/optional_ops_test.py
new file mode 100644
index 0000000000..d6991fdd35
--- /dev/null
+++ b/tensorflow/python/data/kernel_tests/optional_ops_test.py
@@ -0,0 +1,153 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the Optional data type wrapper."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.ops import iterator_ops
+from tensorflow.python.data.ops import optional_ops
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test
+
+
+class OptionalTest(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes
+  def testFromValue(self):
+    opt = optional_ops.Optional.from_value(constant_op.constant(37.0))
+    self.assertEqual(dtypes.float32, opt.output_types)
+    self.assertEqual([], opt.output_shapes)
+    self.assertEqual(ops.Tensor, opt.output_classes)
+    self.assertTrue(self.evaluate(opt.has_value()))
+    self.assertEqual(37.0, self.evaluate(opt.get_value()))
+
+  @test_util.run_in_graph_and_eager_modes
+  def testFromStructuredValue(self):
+    opt = optional_ops.Optional.from_value({
+        "a": constant_op.constant(37.0),
+        "b": (constant_op.constant(["Foo"]), constant_op.constant("Bar"))
+    })
+    self.assertEqual({
+        "a": dtypes.float32,
+        "b": (dtypes.string, dtypes.string)
+    }, opt.output_types)
+    self.assertEqual({"a": [], "b": ([1], [])}, opt.output_shapes)
+    self.assertEqual({
+        "a": ops.Tensor,
+        "b": (ops.Tensor, ops.Tensor)
+    }, opt.output_classes)
+    self.assertTrue(self.evaluate(opt.has_value()))
+    self.assertEqual({
+        "a": 37.0,
+        "b": ([b"Foo"], b"Bar")
+    }, self.evaluate(opt.get_value()))
+
+  @test_util.run_in_graph_and_eager_modes
+  def testFromSparseTensor(self):
+    st_0 = sparse_tensor.SparseTensorValue(
+        indices=np.array([[0]]),
+        values=np.array([0], dtype=np.int64),
+        dense_shape=np.array([1]))
+    st_1 = sparse_tensor.SparseTensorValue(
+        indices=np.array([[0, 0], [1, 1]]),
+        values=np.array([-1., 1.], dtype=np.float32),
+        dense_shape=np.array([2, 2]))
+    opt = optional_ops.Optional.from_value((st_0, st_1))
+    self.assertEqual((dtypes.int64, dtypes.float32), opt.output_types)
+    self.assertEqual(([1], [2, 2]), opt.output_shapes)
+    self.assertEqual((sparse_tensor.SparseTensor, sparse_tensor.SparseTensor),
+                     opt.output_classes)
+
+  @test_util.run_in_graph_and_eager_modes
+  def testFromNone(self):
+    opt = optional_ops.Optional.none_from_structure(tensor_shape.scalar(),
+                                                    dtypes.float32, ops.Tensor)
+    self.assertEqual(dtypes.float32, opt.output_types)
+    self.assertEqual([], opt.output_shapes)
+    self.assertEqual(ops.Tensor, opt.output_classes)
+    self.assertFalse(self.evaluate(opt.has_value()))
+    with self.assertRaises(errors.InvalidArgumentError):
+      self.evaluate(opt.get_value())
+
+  def testStructureMismatchError(self):
+    tuple_output_shapes = (tensor_shape.scalar(), tensor_shape.scalar())
+    tuple_output_types = (dtypes.float32, dtypes.float32)
+    tuple_output_classes = (ops.Tensor, ops.Tensor)
+
+    dict_output_shapes = {
+        "a": tensor_shape.scalar(),
+        "b": tensor_shape.scalar()
+    }
+    dict_output_types = {"a": dtypes.float32, "b": dtypes.float32}
+    dict_output_classes = {"a": ops.Tensor, "b": ops.Tensor}
+
+    with self.assertRaises(TypeError):
+      optional_ops.Optional.none_from_structure(
+          tuple_output_shapes, tuple_output_types, dict_output_classes)
+
+    with self.assertRaises(TypeError):
+      optional_ops.Optional.none_from_structure(
+          tuple_output_shapes, dict_output_types, tuple_output_classes)
+
+    with self.assertRaises(TypeError):
+      optional_ops.Optional.none_from_structure(
+          dict_output_shapes, tuple_output_types, tuple_output_classes)
+
+  def testIteratorGetNextAsOptional(self):
+    ds = dataset_ops.Dataset.range(3)
+    iterator = ds.make_initializable_iterator()
+    next_elem = iterator_ops.get_next_as_optional(iterator)
+    self.assertTrue(isinstance(next_elem, optional_ops.Optional))
+    self.assertEqual(ds.output_types, next_elem.output_types)
+    self.assertEqual(ds.output_shapes, next_elem.output_shapes)
+    self.assertEqual(ds.output_classes, next_elem.output_classes)
+    elem_has_value_t = next_elem.has_value()
+    elem_value_t = next_elem.get_value()
+    with self.test_session() as sess:
+      # Before initializing the iterator, evaluating the optional fails with
+      # a FailedPreconditionError.
+      with self.assertRaises(errors.FailedPreconditionError):
+        sess.run(elem_has_value_t)
+      with self.assertRaises(errors.FailedPreconditionError):
+        sess.run(elem_value_t)
+
+      # For each element of the dataset, assert that the optional evaluates to
+      # the expected value.
+      sess.run(iterator.initializer)
+      for i in range(3):
+        elem_has_value, elem_value = sess.run([elem_has_value_t, elem_value_t])
+        self.assertTrue(elem_has_value)
+        self.assertEqual(i, elem_value)
+
+      # After exhausting the iterator, `next_elem.has_value()` will evaluate to
+      # false, and attempting to get the value will fail.
+      for _ in range(2):
+        self.assertFalse(sess.run(elem_has_value_t))
+        with self.assertRaises(errors.InvalidArgumentError):
+          sess.run(elem_value_t)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/data/ops/BUILD b/tensorflow/python/data/ops/BUILD
index fa2e86eab1..61bf9783ab 100644
--- a/tensorflow/python/data/ops/BUILD
+++ b/tensorflow/python/data/ops/BUILD
@@ -40,6 +40,7 @@ py_library(
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:tensor_shape",
+        "//tensorflow/python/compat",
         "//tensorflow/python/data/util:convert",
     ],
 )
@@ -49,13 +50,33 @@ py_library(
     srcs = ["iterator_ops.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":optional_ops",
         "//tensorflow/python:dataset_ops_gen",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:saver",
         "//tensorflow/python:tensor_shape",
+        "//tensorflow/python/compat",
         "//tensorflow/python/data/util:nest",
         "//tensorflow/python/data/util:sparse",
         "//tensorflow/python/eager:context",
+        "//tensorflow/python/training/checkpointable:base",
+    ],
+)
+
+py_library(
+    name = "optional_ops",
+    srcs = ["optional_ops.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/python:dataset_ops_gen",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:sparse_tensor",
+        "//tensorflow/python:tensor_shape",
+        "//tensorflow/python/data/util:nest",
+        "//tensorflow/python/data/util:sparse",
     ],
 )
diff --git a/tensorflow/python/data/ops/dataset_ops.py b/tensorflow/python/data/ops/dataset_ops.py
index 89265d9575..88de4b588c 100644
--- a/tensorflow/python/data/ops/dataset_ops.py
+++ b/tensorflow/python/data/ops/dataset_ops.py
@@ -24,6 +24,7 @@ import warnings
 import numpy as np
 import six
 
+from tensorflow.python.compat import compat
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import random_seed
@@ -107,8 +108,12 @@ class Dataset(object):
           "execution is enabled.")
     if shared_name is None:
       shared_name = ""
-    iterator_resource = gen_dataset_ops.iterator(
-        container="", shared_name=shared_name, **flat_structure(self))
+    if compat.forward_compatible(2018, 8, 3):
+      iterator_resource = gen_dataset_ops.iterator_v2(
+          container="", shared_name=shared_name, **flat_structure(self))
+    else:
+      iterator_resource = gen_dataset_ops.iterator(
+          container="", shared_name=shared_name, **flat_structure(self))
     with ops.colocate_with(iterator_resource):
       initializer = gen_dataset_ops.make_iterator(self._as_variant_tensor(),
                                                   iterator_resource)
@@ -1244,10 +1249,29 @@ class _NestedDatasetComponent(object):
   custom component types.
   """
 
-  def __init__(self, dataset):
-    self._output_classes = dataset.output_classes
-    self._output_shapes = dataset.output_shapes
-    self._output_types = dataset.output_types
+  def __init__(self,
+               dataset=None,
+               output_shapes=None,
+               output_types=None,
+               output_classes=None):
+    if dataset is None:
+      if (output_classes is None or output_shapes is None or
+          output_types is None):
+        raise ValueError(
+            "Either `dataset`, or all of `output_classes`, "
+            "`output_shapes`, and `output_types` must be specified.")
+      self._output_classes = output_classes
+      self._output_shapes = output_shapes
+      self._output_types = output_types
+    else:
+      if not (output_classes is None and output_shapes is None and
+              output_types is None):
+        raise ValueError(
+            "Either `dataset`, or all of `output_classes`, "
+            "`output_shapes`, and `output_types` must be specified.")
+      self._output_classes = dataset.output_classes
+      self._output_shapes = dataset.output_shapes
+      self._output_types = dataset.output_types
 
   @property
   def output_classes(self):
@@ -1406,7 +1430,11 @@ class StructuredFunctionWrapper(object):
           flat_shapes.append(component)
           flat_types.append(component)
         else:
-          t = ops.convert_to_tensor(t)
+          try:
+            t = ops.convert_to_tensor(t)
+          except (ValueError, TypeError):
+            raise TypeError("Unsupported return value from function passed to "
+                            "%s: %s." % (transformation_name, t))
           flat_ret.append(t)
           flat_classes.append(ops.Tensor)
           flat_shapes.append(t.get_shape())
@@ -1482,11 +1510,30 @@ def flat_structure(dataset):
     A dictionary of keyword arguments that can be passed to many Dataset op
     constructors.
   """
+  output_classes = []
+  output_shapes = []
+  output_types = []
+  for output_class, output_shape, output_type in zip(
+      nest.flatten(dataset.output_classes), nest.flatten(dataset.output_shapes),
+      nest.flatten(dataset.output_types)):
+    if isinstance(output_class, _NestedDatasetComponent):
+      output_classes.append(output_class.output_classes)
+      output_shapes.append(output_shape.output_shapes)
+      output_types.append(output_type.output_types)
+    else:
+      output_classes.append(output_class)
+      output_shapes.append(output_shape)
+      output_types.append(output_type)
+
+  output_classes = nest.pack_sequence_as(dataset.output_classes, output_classes)
+  output_shapes = nest.pack_sequence_as(dataset.output_shapes, output_shapes)
+  output_types = nest.pack_sequence_as(dataset.output_types, output_types)
+
   return {
-      "output_shapes": nest.flatten(sparse.as_dense_shapes(
-          dataset.output_shapes, dataset.output_classes)),
-      "output_types": nest.flatten(sparse.as_dense_types(
-          dataset.output_types, dataset.output_classes)),
+      "output_shapes":
+          nest.flatten(sparse.as_dense_shapes(output_shapes, output_classes)),
+      "output_types":
+          nest.flatten(sparse.as_dense_types(output_types, output_classes)),
   }
 
 
diff --git a/tensorflow/python/data/ops/iterator_ops.py b/tensorflow/python/data/ops/iterator_ops.py
index b6dba4e3ca..83c541c2f7 100644
--- a/tensorflow/python/data/ops/iterator_ops.py
+++ b/tensorflow/python/data/ops/iterator_ops.py
@@ -20,6 +20,8 @@ from __future__ import print_function
 import threading
 import warnings
 
+from tensorflow.python.compat import compat
+from tensorflow.python.data.ops import optional_ops
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
 from tensorflow.python.eager import context
@@ -29,6 +31,8 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.training.checkpointable import base as checkpointable
+from tensorflow.python.training.saver import BaseSaverBuilder
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -56,8 +60,15 @@ GET_NEXT_CALL_WARNING_MESSAGE = (
 GLOBAL_ITERATORS = "iterators"
 
 
+def _device_stack_is_empty():
+  # pylint: disable=protected-access
+  device_stack = ops.get_default_graph()._device_functions_outer_to_inner
+  # pylint: enable=protected-access
+  return not bool(device_stack)
+
+
 @tf_export("data.Iterator")
-class Iterator(object):
+class Iterator(checkpointable.CheckpointableBase):
   """Represents the state of iterating through a `Dataset`."""
 
   def __init__(self, iterator_resource, initializer, output_types,
@@ -172,13 +183,32 @@ class Iterator(object):
     nest.assert_same_structure(output_types, output_shapes)
     if shared_name is None:
       shared_name = ""
-    iterator_resource = gen_dataset_ops.iterator(
-        container="",
-        shared_name=shared_name,
-        output_types=nest.flatten(
-            sparse.as_dense_types(output_types, output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(output_shapes, output_classes)))
+    if compat.forward_compatible(2018, 8, 3):
+      if _device_stack_is_empty():
+        with ops.device("/cpu:0"):
+          iterator_resource = gen_dataset_ops.iterator_v2(
+              container="",
+              shared_name=shared_name,
+              output_types=nest.flatten(
+                  sparse.as_dense_types(output_types, output_classes)),
+              output_shapes=nest.flatten(
+                  sparse.as_dense_shapes(output_shapes, output_classes)))
+      else:
+        iterator_resource = gen_dataset_ops.iterator_v2(
+            container="",
+            shared_name=shared_name,
+            output_types=nest.flatten(
+                sparse.as_dense_types(output_types, output_classes)),
+            output_shapes=nest.flatten(
+                sparse.as_dense_shapes(output_shapes, output_classes)))
+    else:
+      iterator_resource = gen_dataset_ops.iterator(
+          container="",
+          shared_name=shared_name,
+          output_types=nest.flatten(
+              sparse.as_dense_types(output_types, output_classes)),
+          output_shapes=nest.flatten(
+              sparse.as_dense_shapes(output_shapes, output_classes)))
     return Iterator(iterator_resource, None, output_types, output_shapes,
                     output_classes)
 
@@ -242,12 +272,29 @@ class Iterator(object):
       output_classes = nest.map_structure(lambda _: ops.Tensor, output_types)
     nest.assert_same_structure(output_types, output_shapes)
     string_handle = ops.convert_to_tensor(string_handle, dtype=dtypes.string)
-    iterator_resource = gen_dataset_ops.iterator_from_string_handle(
-        string_handle,
-        output_types=nest.flatten(
-            sparse.as_dense_types(output_types, output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(output_shapes, output_classes)))
+    if compat.forward_compatible(2018, 8, 3):
+      if _device_stack_is_empty():
+        with ops.device("/cpu:0"):
+          iterator_resource = gen_dataset_ops.iterator_from_string_handle_v2(
+              string_handle,
+              output_types=nest.flatten(
+                  sparse.as_dense_types(output_types, output_classes)),
+              output_shapes=nest.flatten(
+                  sparse.as_dense_shapes(output_shapes, output_classes)))
+      else:
+        iterator_resource = gen_dataset_ops.iterator_from_string_handle_v2(
+            string_handle,
+            output_types=nest.flatten(
+                sparse.as_dense_types(output_types, output_classes)),
+            output_shapes=nest.flatten(
+                sparse.as_dense_shapes(output_shapes, output_classes)))
+    else:
+      iterator_resource = gen_dataset_ops.iterator_from_string_handle(
+          string_handle,
+          output_types=nest.flatten(
+              sparse.as_dense_types(output_types, output_classes)),
+          output_shapes=nest.flatten(
+              sparse.as_dense_shapes(output_shapes, output_classes)))
     return Iterator(iterator_resource, None, output_types, output_shapes,
                     output_classes)
 
@@ -420,6 +467,13 @@ class Iterator(object):
     """
     return self._output_types
 
+  def _gather_saveables_for_checkpoint(self):
+
+    def _saveable_factory(name):
+      return _IteratorSaveable(self._iterator_resource, name)
+
+    return {"ITERATOR": _saveable_factory}
+
 
 _uid_counter = 0
 _uid_lock = threading.Lock()
@@ -433,7 +487,7 @@ def _generate_shared_name(prefix):
   return "{}{}".format(prefix, uid)
 
 
-class EagerIterator(object):
+class EagerIterator(checkpointable.CheckpointableBase):
   """An iterator producing tf.Tensor objects from a tf.data.Dataset."""
 
   def __init__(self, dataset):
@@ -462,7 +516,8 @@ class EagerIterator(object):
           "tf.data.Dataset.make_initializable_iterator or "
           "tf.data.Dataset.make_one_shot_iterator for graph construction".
           format(type(self)))
-    with ops.device("/device:CPU:0"):
+    self._device = context.context().device_name
+    with ops.device("/cpu:0"):
       ds_variant = dataset._as_variant_tensor()  # pylint: disable=protected-access
       self._output_classes = dataset.output_classes
       self._output_types = dataset.output_types
@@ -471,14 +526,14 @@ class EagerIterator(object):
           sparse.as_dense_types(self._output_types, self._output_classes))
       self._flat_output_shapes = nest.flatten(
           sparse.as_dense_shapes(self._output_shapes, self._output_classes))
-      self._resource = gen_dataset_ops.anonymous_iterator(
-          output_types=self._flat_output_types,
-          output_shapes=self._flat_output_shapes)
-      gen_dataset_ops.make_iterator(ds_variant, self._resource)
-      # Delete the resource when this object is deleted
-      self._resource_deleter = resource_variable_ops.EagerResourceDeleter(
-          handle=self._resource, handle_device="/device:CPU:0")
-    self._device = context.context().device_name
+      with ops.colocate_with(ds_variant):
+        self._resource = gen_dataset_ops.anonymous_iterator(
+            output_types=self._flat_output_types,
+            output_shapes=self._flat_output_shapes)
+        gen_dataset_ops.make_iterator(ds_variant, self._resource)
+        # Delete the resource when this object is deleted
+        self._resource_deleter = resource_variable_ops.EagerResourceDeleter(
+            handle=self._resource, handle_device=self._device)
 
   def __iter__(self):
     return self
@@ -565,3 +620,56 @@ class EagerIterator(object):
     """
     del name
     return self._next_internal()
+
+  def _gather_saveables_for_checkpoint(self):
+
+    def _saveable_factory(name):
+      return _IteratorSaveable(self._resource, name)
+
+    return {"ITERATOR": _saveable_factory}
+
+
+# TODO(b/71645805): Expose checkpointable stateful objects from dataset
+# attributes(potential).
+class _IteratorSaveable(BaseSaverBuilder.SaveableObject):
+  """SaveableObject for saving/restoring iterator state."""
+
+  def __init__(self, iterator_resource, name):
+    serialized_iterator = gen_dataset_ops.serialize_iterator(iterator_resource)
+    specs = [
+        BaseSaverBuilder.SaveSpec(serialized_iterator, "", name + "_STATE")
+    ]
+    # pylint: disable=protected-access
+    super(_IteratorSaveable, self).__init__(iterator_resource, specs, name)
+
+  def restore(self, restored_tensors, restored_shapes):
+    with ops.colocate_with(self.op):
+      return gen_dataset_ops.deserialize_iterator(self.op, restored_tensors[0])
+
+
+def get_next_as_optional(iterator):
+  """Returns an `Optional` that contains the next value from the iterator.
+
+  If `iterator` has reached the end of the sequence, the returned `Optional`
+  will have no value.
+
+  Args:
+    iterator: A `tf.data.Iterator` object.
+
+  Returns:
+    An `Optional` object representing the next value from the iterator (if it
+    has one) or no value.
+  """
+  # pylint: disable=protected-access
+  return optional_ops._OptionalImpl(
+      gen_dataset_ops.iterator_get_next_as_optional(
+          iterator._iterator_resource,
+          output_types=nest.flatten(
+              sparse.as_dense_types(iterator.output_types,
+                                    iterator.output_classes)),
+          output_shapes=nest.flatten(
+              sparse.as_dense_shapes(iterator.output_shapes,
+                                     iterator.output_classes))),
+      output_shapes=iterator.output_shapes,
+      output_types=iterator.output_types,
+      output_classes=iterator.output_classes)
diff --git a/tensorflow/python/data/ops/optional_ops.py b/tensorflow/python/data/ops/optional_ops.py
new file mode 100644
index 0000000000..1d3007ef76
--- /dev/null
+++ b/tensorflow/python/data/ops/optional_ops.py
@@ -0,0 +1,209 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""An Optional type for representing potentially missing values."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import abc
+
+from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import sparse
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import gen_dataset_ops
+
+
+class Optional(object):
+  """Wraps a nested structure of tensors that may/may not be present at runtime.
+
+  An `Optional` can represent the result of an operation that may fail as a
+  value, rather than raising an exception and halting execution. For example,
+  @{tf.contrib.data.get_next_as_optional} returns an `Optional` that either
+  contains the next value from a @{tf.data.Iterator} if one exists, or a "none"
+  value that indicates the end of the sequence has been reached.
+  """
+
+  @abc.abstractmethod
+  def has_value(self, name=None):
+    """Returns a tensor that evaluates to `True` if this optional has a value.
+
+    Args:
+      name: (Optional.) A name for the created operation.
+
+    Returns:
+      A scalar `tf.Tensor` of type `tf.bool`.
+    """
+    raise NotImplementedError("Optional.has_value()")
+
+  @abc.abstractmethod
+  def get_value(self, name=None):
+    """Returns a nested structure of values wrapped by this optional.
+
+    If this optional does not have a value (i.e. `self.has_value()` evaluates
+    to `False`), this operation will raise @{tf.errors.InvalidArgumentError}
+    at runtime.
+
+    Args:
+      name: (Optional.) A name for the created operation.
+
+    Returns:
+      A nested structure of `tf.Tensor` and/or `tf.SparseTensor` objects.
+    """
+    raise NotImplementedError("Optional.get_value()")
+
+  @abc.abstractproperty
+  def output_classes(self):
+    """Returns the class of each component of this optional.
+
+    The expected values are `tf.Tensor` and `tf.SparseTensor`.
+
+    Returns:
+      A nested structure of Python `type` objects corresponding to each
+      component of this optional.
+    """
+    raise NotImplementedError("Optional.output_classes")
+
+  @abc.abstractproperty
+  def output_shapes(self):
+    """Returns the shape of each component of this optional.
+
+    Returns:
+      A nested structure of `tf.TensorShape` objects corresponding to each
+      component of this optional.
+    """
+    raise NotImplementedError("Optional.output_shapes")
+
+  @abc.abstractproperty
+  def output_types(self):
+    """Returns the type of each component of this optional.
+
+    Returns:
+      A nested structure of `tf.DType` objects corresponding to each component
+      of this optional.
+    """
+    raise NotImplementedError("Optional.output_types")
+
+  @staticmethod
+  def from_value(value):
+    """Returns an `Optional` that wraps the given value.
+
+    Args:
+      value: A nested structure of `tf.Tensor` and/or `tf.SparseTensor` objects.
+
+    Returns:
+      An `Optional` that wraps `value`.
+    """
+    # TODO(b/110122868): Consolidate this destructuring logic with the
+    # similar code in `Dataset.from_tensors()`.
+    with ops.name_scope("optional") as scope:
+      with ops.name_scope("value"):
+        value = nest.pack_sequence_as(value, [
+            sparse_tensor_lib.SparseTensor.from_value(t)
+            if sparse_tensor_lib.is_sparse(t) else ops.convert_to_tensor(
+                t, name="component_%d" % i)
+            for i, t in enumerate(nest.flatten(value))
+        ])
+
+      encoded_value = nest.flatten(sparse.serialize_sparse_tensors(value))
+      output_classes = sparse.get_classes(value)
+      output_shapes = nest.pack_sequence_as(
+          value, [t.get_shape() for t in nest.flatten(value)])
+      output_types = nest.pack_sequence_as(
+          value, [t.dtype for t in nest.flatten(value)])
+
+    return _OptionalImpl(
+        gen_dataset_ops.optional_from_value(encoded_value, name=scope),
+        output_shapes, output_types, output_classes)
+
+  @staticmethod
+  def none_from_structure(output_shapes, output_types, output_classes):
+    """Returns an `Optional` that has no value.
+
+    NOTE: This method takes arguments that define the structure of the value
+    that would be contained in the returned `Optional` if it had a value.
+
+    Args:
+      output_shapes: A nested structure of `tf.TensorShape` objects
+        corresponding to each component of this optional.
+      output_types: A nested structure of `tf.DType` objects corresponding to
+        each component of this optional.
+      output_classes: A nested structure of Python `type` objects corresponding
+        to each component of this optional.
+
+    Returns:
+      An `Optional` that has no value.
+    """
+    return _OptionalImpl(gen_dataset_ops.optional_none(), output_shapes,
+                         output_types, output_classes)
+
+
+class _OptionalImpl(Optional):
+  """Concrete implementation of `tf.contrib.data.Optional`.
+
+  NOTE(mrry): This implementation is kept private, to avoid defining
+  `Optional.__init__()` in the public API.
+  """
+
+  def __init__(self, variant_tensor, output_shapes, output_types,
+               output_classes):
+    # TODO(b/110122868): Consolidate the structure validation logic with the
+    # similar logic in `Iterator.from_structure()` and
+    # `Dataset.from_generator()`.
+    output_types = nest.map_structure(dtypes.as_dtype, output_types)
+    output_shapes = nest.map_structure_up_to(
+        output_types, tensor_shape.as_shape, output_shapes)
+    nest.assert_same_structure(output_types, output_shapes)
+    nest.assert_same_structure(output_types, output_classes)
+    self._variant_tensor = variant_tensor
+    self._output_shapes = output_shapes
+    self._output_types = output_types
+    self._output_classes = output_classes
+
+  def has_value(self, name=None):
+    return gen_dataset_ops.optional_has_value(self._variant_tensor, name=name)
+
+  def get_value(self, name=None):
+    # TODO(b/110122868): Consolidate the restructuring logic with similar logic
+    # in `Iterator.get_next()` and `StructuredFunctionWrapper`.
+    with ops.name_scope(name, "OptionalGetValue",
+                        [self._variant_tensor]) as scope:
+      return sparse.deserialize_sparse_tensors(
+          nest.pack_sequence_as(
+              self._output_types,
+              gen_dataset_ops.optional_get_value(
+                  self._variant_tensor,
+                  name=scope,
+                  output_types=nest.flatten(
+                      sparse.as_dense_types(self._output_types,
+                                            self._output_classes)),
+                  output_shapes=nest.flatten(
+                      sparse.as_dense_shapes(self._output_shapes,
+                                             self._output_classes)))),
+          self._output_types, self._output_shapes, self._output_classes)
+
+  @property
+  def output_classes(self):
+    return self._output_classes
+
+  @property
+  def output_shapes(self):
+    return self._output_shapes
+
+  @property
+  def output_types(self):
+    return self._output_types
diff --git a/tensorflow/python/data/util/nest.py b/tensorflow/python/data/util/nest.py
index 32e08021dc..1b596bdfc0 100644
--- a/tensorflow/python/data/util/nest.py
+++ b/tensorflow/python/data/util/nest.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 # ==============================================================================
 
-# TODO(shivaniagrawal): Merge with core nest
 """## Functions for working with arbitrarily nested sequences of elements.
 
 NOTE(mrry): This fork of the `tensorflow.python.util.nest` module
diff --git a/tensorflow/python/debug/BUILD b/tensorflow/python/debug/BUILD
index c025dc8aa5..27b8ebd362 100644
--- a/tensorflow/python/debug/BUILD
+++ b/tensorflow/python/debug/BUILD
@@ -404,6 +404,7 @@ py_library(
     deps = [
         ":debug_errors",
         ":debug_fibonacci",
+        ":debug_keras",
         ":debug_mnist",
         ":debug_tflearn_iris",
     ],
@@ -802,6 +803,7 @@ cuda_py_test(
         "//tensorflow/python:platform_test",
         "//tensorflow/python:variables",
     ],
+    tags = ["no_windows_gpu"],
 )
 
 py_test(
diff --git a/tensorflow/python/debug/examples/examples_test.sh b/tensorflow/python/debug/examples/examples_test.sh
index 2d35b2d8bb..f7d597c8c0 100755
--- a/tensorflow/python/debug/examples/examples_test.sh
+++ b/tensorflow/python/debug/examples/examples_test.sh
@@ -99,7 +99,7 @@ if [[ -d "${CUSTOM_DUMP_ROOT}" ]]; then
 fi
 
 # Test debugging of tf.keras.
-cat << EOF | "${DEBUG_KERAS_BIN}" --debug --ui_type=readline
+cat << EOF | ${DEBUG_KERAS_BIN} --debug --ui_type=readline
 run -f has_inf_or_nan
 EOF
 
diff --git a/tensorflow/python/distribute/BUILD b/tensorflow/python/distribute/BUILD
new file mode 100644
index 0000000000..2bd0b4320a
--- /dev/null
+++ b/tensorflow/python/distribute/BUILD
@@ -0,0 +1,43 @@
+package(
+    default_visibility = ["//tensorflow:internal"],
+)
+
+licenses(["notice"])  # Apache 2.0
+
+exports_files(["LICENSE"])
+
+load("//tensorflow:tensorflow.bzl", "py_test")
+
+py_library(
+    name = "distribute_coordinator",
+    srcs = [
+        "distribute_coordinator.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:training",
+    ],
+)
+
+py_test(
+    name = "distribute_coordinator_test",
+    size = "small",
+    srcs = ["distribute_coordinator_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_pip"],
+    deps = [
+        ":distribute_coordinator",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:distributed_framework_test_lib",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:session",
+        "//tensorflow/python:training",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python:variables",
+    ],
+)
diff --git a/tensorflow/python/distribute/distribute_coordinator.py b/tensorflow/python/distribute/distribute_coordinator.py
new file mode 100644
index 0000000000..dab1ed43ca
--- /dev/null
+++ b/tensorflow/python/distribute/distribute_coordinator.py
@@ -0,0 +1,361 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""A unified and split coordinator for distributed TensorFlow."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+import json
+import os
+import threading
+
+from tensorflow.core.protobuf import cluster_pb2
+from tensorflow.python.training import server_lib
+
+
+class _TaskType(object):
+  PS = "ps"
+  WORKER = "worker"
+  CHIEF = "chief"
+  EVALUATOR = "evaluator"
+
+
+_worker_context = threading.local()
+
+
+def get_current_worker_context():
+  """Returns the current task context."""
+  try:
+    return _worker_context.current
+  except AttributeError:
+    return None
+
+
+class _Barrier(object):
+  """A reusable barrier class for worker synchronization."""
+
+  def __init__(self, num_participants):
+    """Initializes the barrier object.
+
+    Args:
+      num_participants: an integer which is the expected number of calls of
+        `wait` pass to through this barrier.
+    """
+    self._num_participants = num_participants
+    self._counter = 0
+    self._flag = False
+    self._local_sense = threading.local()
+    self._lock = threading.Lock()
+    self._condition = threading.Condition()
+
+  def wait(self):
+    """Waits until all other callers reach the same wait call."""
+    if not hasattr(self._local_sense, "value"):
+      self._local_sense.value = False
+    self._local_sense.value = not self._flag
+    with self._lock:
+      self._counter += 1
+      if self._counter == self._num_participants:
+        self._counter = 0
+        self._flag = self._local_sense.value
+    with self._condition:
+      while self._flag != self._local_sense.value:
+        self._condition.wait()
+      self._condition.notify_all()
+
+
+def _get_num_workers(cluster_spec):
+  """Gets number of workers including chief."""
+  if not cluster_spec:
+    return 0
+  return len(cluster_spec.as_dict().get(_TaskType.WORKER, [])) + len(
+      cluster_spec.as_dict().get(_TaskType.CHIEF, []))
+
+
+class _WorkerContext(object):
+  """The worker context class.
+
+  This context object provides configuration information for each task. One
+  context manager with a worker context object will be created per
+  invocation to the `worker_fn` where `get_current_worker_context` can be called
+  to access the worker context object.
+  """
+
+  def __init__(self,
+               cluster_spec,
+               task_type,
+               task_id,
+               between_graph=False,
+               rpc_layer="grpc",
+               worker_barrier=None):
+    """Initialize the worker context object.
+
+    Args:
+      cluster_spec: a ClusterSpec object. It can be empty or None in the local
+        training case.
+      task_type: a string indicating the role of the corresponding task, such as
+        "worker" or "ps". It can be None if it is local training or
+        `between_graph` is False.
+      task_id: an integer indicating id of the corresponding task. It can be
+        None if it is local training or `between_graph` is False.
+      between_graph: whether it is between-graph replication or not.
+      rpc_layer: optional string specifying the RPC protocol for communication
+        with worker masters. If None or empty, hosts in the `cluster_spec` will
+        be used directly.
+      worker_barrier: optional, the barrier object for worker synchronization.
+
+    Raises:
+      ValueError: if task_type or task_id is Node or empty and it is distributed
+        between-graph replicated training.
+    """
+    if cluster_spec and between_graph:
+      if not task_type or task_id is None:
+        raise ValueError("`task_type` and `task_id` must be set in the "
+                         "distributed between-graph replicated training.")
+      if task_type not in cluster_spec.jobs:
+        raise ValueError("`task_type` %r not found in the `cluster_spec` %r" %
+                         (task_type, cluster_spec))
+    self._cluster_spec = cluster_spec
+    self._task_type = task_type
+    self._task_id = task_id
+    self._worker_barrier = worker_barrier
+    self._rpc_layer = rpc_layer
+    self._master_target = self._get_master_target()
+    self._num_workers = _get_num_workers(cluster_spec)
+    self._is_chief_node = self._is_chief()
+
+  def __enter__(self):
+    old_context = get_current_worker_context()
+    if old_context:
+      raise ValueError(
+          "You cannot run distribute coordinator in a `worker_fn`.")
+    _worker_context.current = self
+
+  def __exit__(self, unused_exception_type, unused_exception_value,
+               unused_traceback):
+    _worker_context.current = None
+
+  def _get_master_target(self):
+    """Return the master target for a task."""
+    # If cluster_spec is None or empty, we use local master.
+    if not self._cluster_spec:
+      return "local"
+
+    # If task_type is None, then it is in-graph replicated training. In this
+    # case we use the chief or first worker's master target.
+    if not self._task_type:
+      if _TaskType.CHIEF in self._cluster_spec.jobs:
+        assert not self.between_graph
+        task_type = _TaskType.CHIEF
+        task_id = 0
+      else:
+        assert _TaskType.WORKER in self._cluster_spec.jobs
+        task_type = _TaskType.WORKER
+        task_id = 0
+    else:
+      task_type = self._task_type
+      task_id = self._task_id
+
+    prefix = ""
+    if self._rpc_layer:
+      prefix = self._rpc_layer + "://"
+    return prefix + self._cluster_spec.job_tasks(task_type)[task_id or 0]
+
+  def _is_chief(self):
+    """Return whether the task is the chief worker."""
+    if (not self._cluster_spec or self._task_type in [_TaskType.CHIEF, None]):
+      return True
+
+    # If not local and chief not in the cluster_spec, use the first worker as
+    # chief.
+    if (_TaskType.CHIEF not in self._cluster_spec.jobs and
+        self._task_type == _TaskType.WORKER and self._task_id == 0):
+      return True
+    return False
+
+  def wait_for_other_workers(self):
+    """Waits for other workers to reach the same call to this method.
+
+    Raises:
+      ValueError: if `worker_barrier` is not passed to the __init__ method.
+    """
+    if not self._worker_barrier:
+      raise ValueError(
+          "`worker_barrier is not set in the worker context.`")
+    self._worker_barrier.wait()
+
+  @property
+  def distributed_mode(self):
+    """Whether it is distributed training or not."""
+    return bool(self._cluster_spec)
+
+  @property
+  def cluster_spec(self):
+    """Returns a copy of the cluster_spec object."""
+    return copy.deepcopy(self._cluster_spec)
+
+  @property
+  def task_type(self):
+    """Returns the role of the corresponing task."""
+    return self._task_type
+
+  @property
+  def task_id(self):
+    """Returns the id or index of the corresponing task."""
+    return self._task_id
+
+  @property
+  def master_target(self):
+    """Returns the session master for the corresponding task to connect to."""
+    return self._master_target
+
+  @property
+  def is_chief(self):
+    """Returns whether the task is a chief node."""
+    return self._is_chief_node
+
+  @property
+  def num_workers(self):
+    """Returns number of workers in the cluster, including chief."""
+    return self._num_workers
+
+
+def _run(worker_fn, cluster_spec, task_type, task_id, between_graph, rpc_layer,
+         worker_barrier):
+  with _WorkerContext(cluster_spec, task_type, task_id, between_graph,
+                      rpc_layer, worker_barrier):
+    worker_fn()
+
+
+def run_distribute_coordinator(worker_fn,
+                               cluster_spec=None,
+                               between_graph=False,
+                               rpc_layer=None):
+  """Run the coordinator for distributed TensorFlow.
+
+  This function runs a unified and split coordinator for distributed TensorFlow.
+  Given a `cluster_spec` specifying server addresses and their roles in a
+  cluster, this coordinator will figure out how to set them up, give the
+  underlying function the right targets for master sessions and coordinate their
+  training.
+
+  In addition to be the distribute coordinator, this is also the source of
+  configurations for each job in the distributed training. As there are multiple
+  ways to configure a distributed TensorFlow cluster, its context object
+  provides these configurations so that users or higher-level APIs don't have to
+  figure out the configuration for each job by themselves.
+
+  In the between-graph replicated training, this coordinator will create
+  multiple threads and each calls the `worker_fn` which is supposed to create
+  its own graph and connect to one worker master given by its coordinator
+  context. In the in-graph replicated training, it has only one thread calling
+  this `worker_fn`.
+
+  The `worker_fn` defines the training logic and is called under a its own
+  worker context which can be accessed to via `get_current_worker_context`. A
+  worker context provides access to configurations for each task, e.g. the
+  task_type, task_id, master target and so on. Since `worker_fn` will be called
+  in a thread and possibly multiple times, caller should be careful when it
+  accesses global data. For example, it is unsafe to define flags in a
+  `worker_fn` or to define different environment variables for different
+  `worker_fn`s.
+
+  The `worker_fn` for the between-graph replication is defined as if there are
+  only one worker corresponding to the `worker_fn` and possibly ps jobs. It
+  assigns variables to parameter servers and all other operations to that
+  worker. In the in-graph replication case, the `worker_fn` has to define
+  operations for all worker jobs. Using a distribution strategy can simplify the
+  `worker_fn` by not having to worry about the replication and device assignment
+  of variables and operations.
+
+  This method is intended to be invoked by high-level APIs so that users don't
+  have to explictly call it to run this coordinator. For those who don't use
+  high-level APIs, to change a program to use this coordinator, wrap everything
+  in a the program after global data definitions such as commandline flag
+  definition into the `worker_fn` and get task-specific configurations from
+  the worker context.
+
+  The `cluster_spec` can be either passed by the argument or parsed from the
+  "TF_CONFIG" envrionment variable. Example of a TF_CONFIG:
+  ```
+    cluster = {'chief': ['host0:2222'],
+               'ps': ['host1:2222', 'host2:2222'],
+               'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
+    os.environ['TF_CONFIG'] = json.dumps({'cluster': cluster})
+  ```
+
+  If `cluster_spec` is not given in any format, it becomes local training and
+  this coordinator will connect to a local session.
+
+  For evaluation, if "evaluator" exist in the cluster_spec, a separate thread
+  will be created with its `task_type` set to "evaluator". If "evaluator" is not
+  set in the cluster_spec, it entirely depends on the `worker_fn` for how to do
+  evaluation.
+
+  Args:
+    worker_fn: the function to be called and given the access to a coordinator
+      context object.
+    cluster_spec: a dict, ClusterDef or ClusterSpec specifying servers and roles
+      in a cluster. If not set or empty, fall back to local training.
+    between_graph: a boolean. It is only useful when `cluster_spec` is set and
+      not empty. If true, it will use between-graph replicated training;
+      otherwise it will use in-graph replicated training.
+    rpc_layer: optional string, the protocol for RPC, e.g. "grpc".
+
+  Raises:
+    ValueError: if `cluster_spec` is supplied but not a dict or a ClusterDef or
+      a ClusterSpec.
+  """
+  if not cluster_spec:
+    tf_config = json.loads(os.environ.get("TF_CONFIG", "{}"))
+    cluster_spec = tf_config.get("cluster", {})
+
+  if cluster_spec:
+    if isinstance(cluster_spec, (dict, cluster_pb2.ClusterDef)):
+      cluster_spec = server_lib.ClusterSpec(cluster_spec)
+    elif not isinstance(cluster_spec, server_lib.ClusterSpec):
+      raise ValueError(
+          "`cluster_spec' should be dict or a `tf.train.ClusterSpec` or a "
+          "`tf.train.ClusterDef` object")
+    # TODO(yuefengz): validate cluster_spec.
+
+  threads = []
+  if cluster_spec and _TaskType.EVALUATOR in cluster_spec.jobs:
+    t = threading.Thread(
+        target=_run,
+        args=(worker_fn, cluster_spec, _TaskType.EVALUATOR, 0, between_graph,
+              rpc_layer, None))
+    t.start()
+    threads.append(t)
+
+  if cluster_spec and between_graph:
+    worker_barrier = _Barrier(_get_num_workers(cluster_spec))
+    for task_type in [_TaskType.CHIEF, _TaskType.WORKER]:
+      for task_id in range(len(cluster_spec.as_dict().get(task_type, []))):
+        t = threading.Thread(
+            target=_run,
+            args=(worker_fn, cluster_spec, task_type, task_id, between_graph,
+                  rpc_layer, worker_barrier))
+        t.start()
+        threads.append(t)
+  else:
+    # Local or in-graph replicated training.
+    _run(worker_fn, cluster_spec, None, None, between_graph, rpc_layer, None)
+
+  # TODO(yuefengz): wrapper threads into thread coordinator?
+  for t in threads:
+    t.join()
diff --git a/tensorflow/python/distribute/distribute_coordinator_test.py b/tensorflow/python/distribute/distribute_coordinator_test.py
new file mode 100644
index 0000000000..d7ffeb56a5
--- /dev/null
+++ b/tensorflow/python/distribute/distribute_coordinator_test.py
@@ -0,0 +1,293 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for distribute coordinator."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import contextlib
+import copy
+import threading
+import six
+
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.python.client import session
+from tensorflow.python.distribute import distribute_coordinator
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+
+CHIEF = distribute_coordinator._TaskType.CHIEF
+WORKER = distribute_coordinator._TaskType.WORKER
+PS = distribute_coordinator._TaskType.PS
+EVALUATOR = distribute_coordinator._TaskType.EVALUATOR
+
+NUM_WORKERS = 3
+NUM_PS = 2
+
+
+def _bytes_to_str(maybe_bytes):
+  if isinstance(maybe_bytes, six.string_types):
+    return maybe_bytes
+  else:
+    return str(maybe_bytes, "utf-8")
+
+
+class DistributeCoordinatorTest(test.TestCase):
+
+  @classmethod
+  def setUpClass(cls):
+    # We have to create a global in-process cluster because once an in-process
+    # tensorflow server is created, there is no way to terminate it. Please see
+    # multi_worker_test_base.py for more details.
+    cls._workers, cls._ps = test_util.create_local_cluster(
+        NUM_WORKERS, num_ps=NUM_PS)
+    cls._cluster_spec = {
+        WORKER: [_bytes_to_str(w.target) for w in cls._workers],
+        PS: [_bytes_to_str(ps.target) for ps in cls._ps]
+    }
+
+  def setUp(self):
+    self._result_correct = 0
+    self._lock = threading.Lock()
+    self._worker_context = {}
+
+  @contextlib.contextmanager
+  def _test_session(self, target):
+    config = config_pb2.ConfigProto(allow_soft_placement=True)
+    config.graph_options.optimizer_options.opt_level = -1
+    with session.Session(graph=None, config=config, target=target) as sess:
+      yield sess
+
+  def _in_graph_worker_fn(self):
+    context = distribute_coordinator.get_current_worker_context()
+    self.assertTrue(context is not None)
+    with self._test_session(target=context.master_target) as sess:
+      xs = []
+      expected = 0.0
+      for i in range(context.num_workers):
+        with ops.device("/job:worker/task:%d" % i):
+          x = variable_scope.get_variable("x_%d" % i, initializer=10.0)
+          x_add = x.assign_add(float(i))
+          xs.append(x_add)
+          expected += i + 10.0
+
+      with ops.device("/job:worker/task:0"):
+        result = math_ops.add_n(xs)
+
+      variables.global_variables_initializer().run()
+      result_value = sess.run(result)
+    self.assertEqual(result_value, expected)
+    if result_value == expected:
+      self._result_correct += 1
+
+  def testInGraph(self):
+    """Test it runs in-graph replicated training correctly."""
+    distribute_coordinator.run_distribute_coordinator(
+        self._in_graph_worker_fn,
+        cluster_spec=self._cluster_spec,
+        between_graph=False)
+    self.assertEqual(self._result_correct, 1)
+
+  def _between_graph_worker_fn(self):
+    context = distribute_coordinator.get_current_worker_context()
+    self.assertTrue(context is not None)
+    with self._test_session(target=context.master_target) as sess:
+      with ops.device("/job:ps/task:0"):
+        # TODO(yuefengz): investigate why not using resource variable will make
+        # the test flaky.
+        x = variable_scope.get_variable(
+            "x", initializer=10.0, use_resource=True)
+      with ops.device("/job:ps/task:1"):
+        y = variable_scope.get_variable(
+            "y", initializer=20.0, use_resource=True)
+
+      x_add = x.assign_add(2.0)
+      y_sub = y.assign_sub(2.0)
+      train_op = control_flow_ops.group([x_add, y_sub])
+
+      if context.is_chief:
+        variables.global_variables_initializer().run()
+
+      # Synchronize workers after initializaton.
+      context.wait_for_other_workers()
+
+      sess.run(train_op)
+
+      # Synchronize workers after one step to make sure they all have finished
+      # training.
+      context.wait_for_other_workers()
+
+      x_val, y_val = sess.run([x, y])
+
+      self.assertEqual(x_val, 16.0)
+      self.assertEqual(y_val, 14.0)
+      if x_val == 16.0 and y_val == 14.0:
+        with self._lock:
+          self._result_correct += 1
+
+  def testBetweenGraph(self):
+    """Test it runs between-graph replicated training correctly."""
+    distribute_coordinator.run_distribute_coordinator(
+        self._between_graph_worker_fn,
+        cluster_spec=self._cluster_spec,
+        between_graph=True)
+
+    # Each finished worker will increment self._result_correct.
+    self.assertEqual(self._result_correct, NUM_WORKERS)
+
+  def _dump_worker_context(self):
+    """Dumps the propoerties of each worker context.
+
+    It dumps the context properties to a dict mapping from task_type to a list
+    of tuples of master_target, num_workers, is_chief and distribute_mode, where
+    the list is indexed by the task_id.
+    """
+    context = distribute_coordinator.get_current_worker_context()
+    self.assertTrue(context is not None)
+    task_type = str(context.task_type)
+    task_id = context.task_id or 0
+    with self._lock:
+      if task_type not in self._worker_context:
+        self._worker_context[task_type] = []
+      while len(self._worker_context[task_type]) <= task_id:
+        self._worker_context[task_type].append(None)
+      self._worker_context[task_type][task_id] = (context.master_target,
+                                                  context.num_workers,
+                                                  context.is_chief,
+                                                  context.distributed_mode)
+
+  def testBetweenGraphContext(self):
+    # Dumps the task contexts to the self._worker_context dict.
+    distribute_coordinator.run_distribute_coordinator(
+        self._dump_worker_context,
+        cluster_spec=self._cluster_spec,
+        between_graph=True)
+
+    # There is only one type of task and there three such tasks.
+    self.assertEqual(len(self._worker_context), 1)
+    self.assertTrue(WORKER in self._worker_context)
+    self.assertEqual(len(self._worker_context[WORKER]), NUM_WORKERS)
+
+    # Check whether each task has the right master_target, num_workers, is_chief
+    # and distributed_mode.
+    self.assertEqual(
+        self._worker_context[WORKER][0],
+        (_bytes_to_str(self._workers[0].target), NUM_WORKERS, True, True))
+    self.assertEqual(
+        self._worker_context[WORKER][1],
+        (_bytes_to_str(self._workers[1].target), NUM_WORKERS, False, True))
+    self.assertEqual(
+        self._worker_context[WORKER][2],
+        (_bytes_to_str(self._workers[2].target), NUM_WORKERS, False, True))
+
+  def testInGraphContext(self):
+    # Dumps the task contexts to the self._worker_context dict.
+    distribute_coordinator.run_distribute_coordinator(
+        self._dump_worker_context,
+        cluster_spec=self._cluster_spec,
+        between_graph=False)
+
+    # There is only a "None" task in the dumped task context.
+    self.assertEqual(len(self._worker_context), 1)
+    self.assertTrue("None" in self._worker_context)
+    self.assertEqual(len(self._worker_context["None"]), 1)
+
+    # Check whether each task has the right master_target, num_workers, is_chief
+    # and distributed_mode.
+    self.assertEqual(
+        self._worker_context["None"][0],
+        (_bytes_to_str(self._workers[0].target), NUM_WORKERS, True, True))
+
+  def testLocalContext(self):
+    # Dumps the task contexts to the self._worker_context dict.
+    distribute_coordinator.run_distribute_coordinator(
+        self._dump_worker_context, cluster_spec=None, between_graph=True)
+
+    # There is only a "None" task.
+    self.assertEqual(len(self._worker_context), 1)
+    self.assertTrue("None" in self._worker_context)
+    self.assertEqual(len(self._worker_context["None"]), 1)
+
+    # Check whether each task has the right master_target, num_workers, is_chief
+    # and distributed_mode.
+    self.assertEqual(self._worker_context["None"][0], ("local", 0, True, False))
+
+  def testBetweenGraphContextWithChief(self):
+    # Adds a chief node, so there are NUM_WORKERS + 1 workers in total.
+    cluster_spec = copy.deepcopy(self._cluster_spec)
+    cluster_spec[CHIEF] = ["fake_chief"]
+
+    # Dumps the task contexts to the self._worker_context dict.
+    distribute_coordinator.run_distribute_coordinator(
+        self._dump_worker_context,
+        cluster_spec=cluster_spec,
+        between_graph=True,
+        rpc_layer="grpc")
+
+    # There are one CHIEF and three workers.
+    self.assertEqual(len(self._worker_context), 2)
+    self.assertTrue(CHIEF in self._worker_context)
+    self.assertTrue(WORKER in self._worker_context)
+    self.assertEqual(len(self._worker_context[CHIEF]), 1)
+    self.assertEqual(len(self._worker_context[WORKER]), NUM_WORKERS)
+
+    # Check whether each task has the right master_target, num_workers, is_chief
+    # and distributed_mode.
+    self.assertEqual(self._worker_context[CHIEF][0],
+                     ("grpc://fake_chief", 4, True, True))
+    self.assertEqual(self._worker_context[WORKER][0],
+                     ("grpc://" + _bytes_to_str(self._workers[0].target),
+                      NUM_WORKERS + 1, False, True))
+    self.assertEqual(self._worker_context[WORKER][1],
+                     ("grpc://" + _bytes_to_str(self._workers[1].target),
+                      NUM_WORKERS + 1, False, True))
+    self.assertEqual(self._worker_context[WORKER][2],
+                     ("grpc://" + _bytes_to_str(self._workers[2].target),
+                      NUM_WORKERS + 1, False, True))
+
+  def testInGraphContextWithEval(self):
+    # Adds a EVALUATOR job.
+    cluster_spec = copy.deepcopy(self._cluster_spec)
+    cluster_spec[EVALUATOR] = ["fake_evaluator"]
+
+    # Dumps the task contexts to the self._worker_context dict.
+    distribute_coordinator.run_distribute_coordinator(
+        self._dump_worker_context,
+        cluster_spec=cluster_spec,
+        between_graph=False)
+
+    # There are one "None" task and one EVALUATOR task.
+    self.assertEqual(len(self._worker_context), 2)
+    self.assertTrue("None" in self._worker_context)
+    self.assertTrue(EVALUATOR in self._worker_context)
+    self.assertEqual(len(self._worker_context["None"]), 1)
+    self.assertEqual(len(self._worker_context[EVALUATOR]), 1)
+
+    # Check whether each task has the right master_target, num_workers, is_chief
+    # and distributed_mode.
+    self.assertEqual(self._worker_context["None"][0],
+                     (_bytes_to_str(self._workers[0].target), 3, True, True))
+    self.assertEqual(self._worker_context[EVALUATOR][0],
+                     ("fake_evaluator", 3, False, True))
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/eager/BUILD b/tensorflow/python/eager/BUILD
index 6ede8e4f4d..32a8452f62 100644
--- a/tensorflow/python/eager/BUILD
+++ b/tensorflow/python/eager/BUILD
@@ -322,6 +322,7 @@ cuda_py_test(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:pywrap_tensorflow",
         "//tensorflow/python:random_ops",
+        "//tensorflow/python/keras",
     ],
 )
 
@@ -404,6 +405,7 @@ cuda_py_test(
         "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_test_lib",
+        "@six_archive//:six",
     ],
     tags = [
         "optonly",  # The test is too slow in non-opt mode
diff --git a/tensorflow/python/eager/backprop.py b/tensorflow/python/eager/backprop.py
index 3e3c82e56a..9cd39d02da 100644
--- a/tensorflow/python/eager/backprop.py
+++ b/tensorflow/python/eager/backprop.py
@@ -276,7 +276,7 @@ def implicit_grad(f):
 def _get_arg_spec(f, params, param_args):
   """The positions of the parameters of f to be differentiated in param_args."""
   try:
-    args = tf_inspect.getargspec(f).args
+    args = tf_inspect.getfullargspec(f).args
   except TypeError as e:
     # TypeError can happen when f is a callable object.
     if params is None:
@@ -599,15 +599,18 @@ def _fast_fill(value, shape, dtype):
 
 
 def _zeros(shape, dtype):
-  """Wraps array_ops.zeros to cache last zero for a given shape and dtype."""
-  device = context.context().device_name
+  """Helper to return (possibly cached) zero tensors in eager mode."""
   if dtype == dtypes.variant:
     # TODO(apassos): need to save enough information about variant tensors to do
     # a zeros
     return None
-  # pylint: disable=protected-access
-  cache_key = shape, dtype, device, context.context()._eager_context.mode
-  # pylint: enable=protected-access
+
+  ctx = context.context()
+  if not ctx.executing_eagerly():
+    return array_ops.zeros(shape, dtype)
+
+  device = ctx.device_name
+  cache_key = shape, dtype, device
   cached = _zeros_cache.get(cache_key)
   if cached is None:
     cached = _fast_fill(0, shape, dtype)
@@ -616,6 +619,9 @@ def _zeros(shape, dtype):
 
 
 def _ones(shape, dtype):
+  if not context.context().executing_eagerly():
+    return array_ops.ones(shape, dtype)
+
   if shape == ():  # pylint: disable=g-explicit-bool-comparison
     return constant_op.constant(1, dtype=dtype)
   return _fast_fill(1, shape, dtype)
@@ -643,10 +649,10 @@ class GradientTape(object):
   Operations are recorded if they are executed within this context manager and
   at least one of their inputs is being "watched".
 
-  Trainable variables (created by `tf.contrib.eager.Variable` or
-  @{tf.get_variable}, trainable=True is default in both cases) are automatically
-  watched. Tensors can be manually watched by invoking the `watch` method on
-  this context manager.
+  Trainable variables (created by `tf.Variable` or @{tf.get_variable},
+  trainable=True is default in both cases) are automatically watched. Tensors
+  can be manually watched by invoking the `watch` method on this context
+  manager.
 
   For example, consider the function `y = x * x`. The gradient at `x = 3.0` can
   be computed as:
@@ -713,10 +719,15 @@ class GradientTape(object):
     if self._recording:
       self._pop_tape()
 
-  def _push_tape(self):
+  def _push_tape(self, existing_tape=False):
     if self._recording:
       raise ValueError("Tape is already recording.")
-    self._tape = tape.push_new_tape(persistent=self._persistent)
+    if existing_tape:
+      if self._tape is None:
+        raise ValueError("There is no existing tape.")
+      tape.push_tape(self._tape)
+    else:
+      self._tape = tape.push_new_tape(persistent=self._persistent)
     self._recording = True
 
   def _pop_tape(self):
@@ -764,7 +775,7 @@ class GradientTape(object):
     try:
       yield
     finally:
-      self._push_tape()
+      self._push_tape(existing_tape=True)
 
   def reset(self):
     """Clears all information stored in this tape.
diff --git a/tensorflow/python/eager/backprop_test.py b/tensorflow/python/eager/backprop_test.py
index ebbd3cd98e..3d3f54b9c4 100644
--- a/tensorflow/python/eager/backprop_test.py
+++ b/tensorflow/python/eager/backprop_test.py
@@ -96,6 +96,19 @@ class BackpropTest(test.TestCase):
     self.assertAllEqual(grads_and_vars[0][0], 1.0)
     self.assertAllEqual(id(grads_and_vars[0][1]), id(x))
 
+  def testGradientInsideLoop(self):
+    with ops.Graph().as_default():
+      v = resource_variable_ops.ResourceVariable(1.0)
+
+      def body(_):
+        _ = v + 1.0  # This reads the variable inside the loop context
+        with backprop.GradientTape() as t:
+          result = v * 2
+        self.assertTrue(t.gradient(result, v) is not None)
+        return 1.0
+
+      control_flow_ops.while_loop(lambda i: False, body, [1.0])
+
   def testWhereGradient(self):
     # Note: where is special because only some of its arguments are of
     # differentiable dtypes.
@@ -223,11 +236,23 @@ class BackpropTest(test.TestCase):
 
   def testTapeStopRecording(self):
     with backprop.GradientTape() as t:
-      x = constant_op.constant(1.0)
+      x = resource_variable_ops.ResourceVariable(1.0)
       with t.stop_recording():
         y = x * x
     self.assertEqual(t.gradient(y, x), None)
 
+  def testTapeStopStartRecording(self):
+    with backprop.GradientTape(persistent=True) as t:
+      x = resource_variable_ops.ResourceVariable(1.0)
+      x2 = x * 2  # This should be differentiated through.
+      with t.stop_recording():
+        y = x2 * x2
+      z = x2 * x2
+    self.assertEqual(t.gradient(y, x2), None)
+
+    # If the x*2 was not differentiated through, this would be 2.0, not 4.0
+    self.assertEqual(t.gradient(z, x2).numpy(), 4.0)
+
   def testTapeReset(self):
     with backprop.GradientTape() as t:
       v = resource_variable_ops.ResourceVariable(1.0)
@@ -900,32 +925,23 @@ class BackpropTest(test.TestCase):
         'did you forget to return a value from fn?'):
       val_and_grads_fn(x, y)
 
-  def testZerosCacheDoesntLeakAcrossModes(self):
-    with ops.Graph().as_default():
-      t = random_ops.random_normal(shape=[100, 2])
-      x = random_ops.random_normal(shape=[100, 4])
-      dy = random_ops.random_normal(shape=[100, 4])
-      with backprop.GradientTape() as gradient_tape:
-        gradient_tape.watch(x)
-        x1, _ = array_ops.split(x, num_or_size_splits=2, axis=1)
-        y1 = x1 ** 2.
-        y = array_ops.concat([y1, t], axis=1)
-
-      dx = gradient_tape.gradient(y, x, output_gradients=dy)
-      with self.test_session() as sess:
-        sess.run(variables.global_variables_initializer())
-        sess.run(dx)
-
-    t = random_ops.random_normal(shape=[100, 2])
-    x = random_ops.random_normal(shape=[100, 4])
-    dy = random_ops.random_normal(shape=[100, 4])
-    with backprop.GradientTape() as gradient_tape:
-      gradient_tape.watch(x)
-      x1, _ = array_ops.split(x, num_or_size_splits=2, axis=1)
-      y1 = x1 ** 2.
-      y = array_ops.concat([y1, t], axis=1)
-
-    dx = gradient_tape.gradient(y, x, output_gradients=dy)
+  def testZerosCacheDoesntLeakAcrossGraphs(self):
+    with context.graph_mode():
+      def get_grad():
+        with ops.Graph().as_default(), self.test_session():
+          t = constant_op.constant(1, dtype=dtypes.float32, shape=(10, 4))
+          x = constant_op.constant(2, dtype=dtypes.float32, shape=(10, 4))
+          with backprop.GradientTape() as gt:
+            tape.watch(x)
+            x1, _ = array_ops.split(x, num_or_size_splits=2, axis=1)
+            y1 = x1**2
+            y = array_ops.concat([y1, t], axis=1)
+          return self.evaluate(gt.gradient(y, x))
+
+      grad1 = get_grad()
+      grad2 = get_grad()
+
+      self.assertAllEqual(grad1, grad2)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/eager/benchmarks_test.py b/tensorflow/python/eager/benchmarks_test.py
index 3aad4a114a..afc4bf0066 100644
--- a/tensorflow/python/eager/benchmarks_test.py
+++ b/tensorflow/python/eager/benchmarks_test.py
@@ -31,6 +31,7 @@ import numpy as np
 import six
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
+from tensorflow.python import keras
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import backprop  # pylint: disable=unused-import
 from tensorflow.python.eager import context
@@ -70,6 +71,25 @@ def c_tfe_py_fastpath_execute(a,
     six.raise_from(core._status_to_exception(e.code, message), None)
 
 
+class SubclassedKerasModel(keras.Model):
+
+  def __init__(self):
+    super(SubclassedKerasModel, self).__init__()
+    self.layer = keras.layers.Dense(
+        10, kernel_initializer="ones", bias_initializer="zeros")
+
+  def call(self, x):
+    return self.layer(x)
+
+
+def make_keras_model():
+  x = keras.Input(shape=(10,))
+  y = keras.layers.Dense(
+      10, kernel_initializer="ones", bias_initializer="zeros")(
+          x)
+  return keras.Model(inputs=x, outputs=y)
+
+
 class MicroBenchmarks(test.Benchmark):
 
   def __init__(self):
@@ -115,6 +135,7 @@ class MicroBenchmarks(test.Benchmark):
 
     def func():
       ops.EagerTensor(value, context=handle, device=device, dtype=dtype)
+
     self._run(func, 30000)
 
   def benchmark_create_float_tensor_from_list_CPU(self):
@@ -211,8 +232,8 @@ class MicroBenchmarks(test.Benchmark):
     inputs = [m]
 
     def f():
-      pywrap_tensorflow.TFE_Py_Execute(
-          ctx_handle, None, "Identity", inputs, attrs, 1)
+      pywrap_tensorflow.TFE_Py_Execute(ctx_handle, None, "Identity", inputs,
+                                       attrs, 1)
 
     self._run(f, 30000)
 
@@ -234,14 +255,13 @@ class MicroBenchmarks(test.Benchmark):
     def f():
       with backprop.GradientTape():
         pass
+
     self._run(f, 30000)
 
   def benchmark_tf_gradient_function_no_op(self):
     with context.device(CPU):
       m = gen_array_ops.identity(self._m_2)
-      self._run(
-          lambda: backprop.gradients_function(lambda x: x, [0])(m),
-          30000)
+      self._run(lambda: backprop.gradients_function(lambda x: x, [0])(m), 30000)
 
   def _benchmark_np_matmul(self, m, transpose_b, num_iters):
     a = m.cpu().numpy()
@@ -255,6 +275,7 @@ class MicroBenchmarks(test.Benchmark):
     self._run(func, num_iters, execution_mode=execution_mode)
 
   def _benchmark_gen_math_ops_matmul(self, m, transpose_b, num_iters):
+
     def func():
       gen_math_ops.mat_mul(m, m, transpose_b=transpose_b)
 
@@ -276,9 +297,10 @@ class MicroBenchmarks(test.Benchmark):
     device = context.context().device_name
     attrs = ("transpose_a", False, "transpose_b", transpose_b, "T",
              m.dtype.as_datatype_enum)
+
     def func():
-      pywrap_tensorflow.TFE_Py_Execute(ctx_handle, device, "MatMul",
-                                       inputs, attrs, 1)
+      pywrap_tensorflow.TFE_Py_Execute(ctx_handle, device, "MatMul", inputs,
+                                       attrs, 1)
 
     self._run(func, num_iters)
 
@@ -542,6 +564,30 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_read_variable_with_tape(
           m, num_iters=self._num_iters_2_by_2)
 
+  def benchmark_keras_model_subclassed(self):
+    model = SubclassedKerasModel()
+    data = random_ops.random_uniform((10, 10))
+
+    func = lambda: model(data)
+    # First call is more expensive (creates variables etc.), discount that.
+    func()
+
+    # The whole point of this test is to contrast subclassing with
+    # the functional style of keras model building, so validate that
+    # the models are equivalent.
+    assert np.equal(func(), make_keras_model()(data)).all()
+
+    self._run(func, 30000)
+
+  def benchmark_keras_model_functional(self):
+    model = make_keras_model()
+    data = random_ops.random_uniform((10, 10))
+    func = lambda: model(data)
+    # Symmetry with benchmark_keras_model_subclassed
+    func()
+    assert np.equal(func(), SubclassedKerasModel()(data)).all()
+    self._run(func, 30000)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/eager/context.py b/tensorflow/python/eager/context.py
index 85b9491903..495a674526 100644
--- a/tensorflow/python/eager/context.py
+++ b/tensorflow/python/eager/context.py
@@ -177,6 +177,11 @@ class Context(object):
         - tf.contrib.eager.SYNC: executes each operation synchronously.
         - tf.contrib.eager.ASYNC: executes each operation asynchronously. These
           operations may return "non-ready" handles.
+      server_def: (Optional.) A tensorflow::ServerDef proto.
+        Enables execution on remote devices. GrpcServers need to be started by
+        creating an identical server_def to this, and setting the appropriate
+        task_indexes, so that the servers can communicate. It will then be
+        possible to execute operations on remote devices.
 
     Raises:
      ValueError: If execution_mode is not valid.
diff --git a/tensorflow/python/eager/core_test.py b/tensorflow/python/eager/core_test.py
index 3fabe7060e..cc765725a4 100644
--- a/tensorflow/python/eager/core_test.py
+++ b/tensorflow/python/eager/core_test.py
@@ -610,6 +610,14 @@ class TFETest(test_util.TensorFlowTestCase):
       self.assertEquals(typ, dtypes.float32)
       self.assertIsInstance(t, ops.EagerTensor)
 
+  def testConvertMixedEagerTensorsWithVariables(self):
+    var = resource_variable_ops.ResourceVariable(1.0)
+    types, tensors = execute_lib.convert_to_mixed_eager_tensors(
+        ['foo', var], context.context())
+    self.assertAllEqual([dtypes.string, dtypes.float32], types)
+    for t in tensors:
+      self.assertIsInstance(t, ops.EagerTensor)
+
 
 class SendRecvTest(test_util.TensorFlowTestCase):
 
diff --git a/tensorflow/python/eager/execute.py b/tensorflow/python/eager/execute.py
index 2ff5b8d8f4..f9b8d2cb5d 100644
--- a/tensorflow/python/eager/execute.py
+++ b/tensorflow/python/eager/execute.py
@@ -198,11 +198,7 @@ def args_to_matching_eager(l, ctx, default_dtype=None):
 
 
 def convert_to_mixed_eager_tensors(values, ctx):
-  v = [
-      t if isinstance(t, ops.EagerTensor) else ops.EagerTensor(
-          t, context=ctx._handle, device=ctx.device_name)  # pylint: disable=protected-access
-      for t in values
-  ]
+  v = [ops.internal_convert_to_tensor(t, ctx=ctx) for t in values]
   types = [t._datatype_enum() for t in v]  # pylint: disable=protected-access
   return types, v
 
diff --git a/tensorflow/python/eager/function.py b/tensorflow/python/eager/function.py
index 08470f65b0..51ebcd65b3 100644
--- a/tensorflow/python/eager/function.py
+++ b/tensorflow/python/eager/function.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 
 import collections
 import functools
+import threading
 
 import numpy as np
 
@@ -36,6 +37,7 @@ from tensorflow.python.framework import dtypes as dtypes_module
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import functional_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.util import compat
@@ -91,10 +93,11 @@ def capture_value(tensor_map, value, dtype, name):
 class CapturingGraph(ops.Graph):
   """Graph used when constructing eager functions."""
 
-  def __init__(self, captures):
+  def __init__(self):
     super(CapturingGraph, self).__init__()
     self._building_function = True
-    self.captures = captures
+    # Maps external tensor id -> internal tensor (e.g. input placeholder).
+    self.captures = {}
     # Map from resource tensor name to last op (in program order) which uses
     # this tensor. Used to enforce that execution order matches program order
     # for resource tensors.
@@ -129,14 +132,26 @@ class CapturingGraph(ops.Graph):
       op_def=None,
       compute_shapes=True,
       compute_device=True):
-    # TODO(apassos) this should do some form of alias analysis as ops which
-    # forward the resources such as Identity and Switch can cause serialization
-    # to fail.
+    # This capturing logic interacts poorly with control flow contexts which
+    # want to replace inputs of ops far too late in the process. This can lead
+    # the context to get confused and try to create an Enter for an Enter. We
+    # can detect this here and skip the additional Enter which can confuse loop
+    # validation logic.
+    if op_type == "Enter" and inputs[0].op.type == "Enter":
+      if inputs[0].op.get_attr("frame_name") == attrs["frame_name"].s:
+        return inputs[0].op
+    # Calling AddValue on the control flow contexts to force creation of the
+    # backward accumulators in the original graph before we create placeholders
+    # to capture the inputs.
+    ctxt = ops.get_default_graph()._control_flow_context  # pylint: disable=protected-access
     for i, inp in enumerate(inputs):
-      inputs[i] = self.capture(inp)
+      if ctxt is not None and hasattr(ctxt, "AddValue"):
+        inp = ctxt.AddValue(inp)
+      inp = self.capture(inp)
+      inputs[i] = inp
     return super(CapturingGraph, self).create_op(
         op_type, inputs, dtypes, input_types, name, attrs, op_def,
-        compute_shapes, compute_device)
+        compute_device=compute_device)
 
 
 # pylint: disable=invalid-name
@@ -231,11 +246,20 @@ def _register(fn):
   context.context().add_function(fn)
 
 
+_xla_compile_attr = "_XlaCompile"
+
+
 # TODO(apassos) get rid of this by splitting framework.function._DefinedFunction
 # so it doesn't have the definition-generating logic and is just a container for
 # an already-defined function.
 class _EagerDefinedFunction(object):
-  """Function object with the interface of tf _DefinedFunction."""
+  """Callable with the interface of `framework.function._DefinedFunction.`
+
+  `_EagerDefinedFunction` encapsulates a function definition and its properties,
+  and it provides a method for calling the encapsulated function. Some Ops
+  take functions as attributes, which have type `func`; an instance of this
+  class may be provided as the value of these `func` attributes.
+  """
 
   def __init__(self, name, graph, operations, inputs, outputs, attrs):
     """Initializes an eager defined function.
@@ -266,6 +290,7 @@ class _EagerDefinedFunction(object):
       # It might be worth creating a convenient way to re-use status.
       pywrap_tensorflow.TF_FunctionSetAttrValueProto(
           fn, compat.as_str(name), serialized)
+    self._xla_compile = _xla_compile_attr in attrs
 
     # TODO(apassos) avoid creating a FunctionDef (specially to grab the
     # signature, but also in general it's nice not to depend on it.
@@ -277,12 +302,92 @@ class _EagerDefinedFunction(object):
     if context.executing_eagerly():
       _register(fn)
     self.definition = function_def
-    self.name = function_def.signature.name
+    self.name = compat.as_bytes(function_def.signature.name)
     self.signature = function_def.signature
+    self._num_outputs = len(self.signature.output_arg)
+    self._output_types = [o.type for o in self.signature.output_arg]
     self.grad_func_name = None
     self.python_grad_func = None
     self._c_func = c_api_util.ScopedTFFunction(fn)
     self._grad_func = None
+    self._graph = graph
+    self._stateful_ops = tuple(op for op in operations if op.op_def.is_stateful)
+
+  def add_to_graph(self, g):
+    # pylint: disable=protected-access
+    if self.name not in g._functions:
+      g._add_function(self)
+    for f in self._graph._functions.values():
+      if f.name not in g._functions:
+        g._add_function(f)
+    # pylint: enable=protected-access
+
+  @property
+  def stateful_ops(self):
+    return self._stateful_ops
+
+  def call(self, ctx, args, output_shapes):
+    """Calls this function with `args` as inputs.
+
+    Function execution respects device annotations only if the function won't
+    be compiled with xla.
+
+    Args:
+      ctx: a Context object
+      args: a list of arguments to supply this function with.
+      output_shapes: shapes to which outputs should be set; ignored when
+        executing eagerly.
+
+    Returns:
+      The outputs of the function call.
+    """
+
+    executing_eagerly = ctx.executing_eagerly()
+
+    xla_compile = self._xla_compile or (executing_eagerly and
+                                        ctx.device_spec.device_type == "TPU")
+
+    if xla_compile:
+      # XLA compilation relies upon a custom kernel creator to run functions.
+      signature = self.signature
+      if executing_eagerly:
+        outputs = execute.execute(
+            str(signature.name),
+            num_outputs=self._num_outputs,
+            inputs=args,
+            attrs=None,
+            ctx=ctx)
+      else:
+        g = ops.get_default_graph()
+        self.add_to_graph(g)
+        op = g.create_op(
+            signature.name,
+            [ops.internal_convert_to_tensor(x, ctx=ctx) for x in args],
+            tuple(dtypes_module.DType(x.type) for x in signature.output_arg),
+            op_def=signature,
+            name="FunctionCall",
+            compute_shapes=False)
+        outputs = op.outputs
+        if not outputs:
+          return op
+        outputs = [outputs] if isinstance(
+            outputs, (ops.Tensor, type(None))) else list(outputs)
+    else:
+      # TODO(akshayka): Either remove this if the FunctionLibraryRuntime
+      # creates `PartitionedCallOp` kernels by default, or remove the previous
+      # branch if a TPU kernel is registered for `PartitionedCall`.
+      outputs = functional_ops.partitioned_call(
+          args=args,
+          f=self,
+          tout=self._output_types,
+          executing_eagerly=executing_eagerly)
+
+    if executing_eagerly:
+      return outputs
+    else:
+      for i, shape in enumerate(output_shapes):
+        outputs[i].set_shape(shape)
+      return outputs
 
 
 def _map_sequence_obj_to_idx(sequence):
@@ -306,8 +411,12 @@ def _flatten(sequence):
   return outputs
 
 
+# TODO(akshayka): Perhaps rename to something more appropriate.
 class GraphModeFunction(object):
-  """Callable object representing a graph-mode function.
+  """Callable object encapsulating a function definition and its gradient.
+
+  `GraphModeFunction` is a callable that encapsulates a function definition and
+  is differentiable under `tf.GradientTape` objects.
   """
 
   def __init__(self,
@@ -360,7 +469,6 @@ class GraphModeFunction(object):
     self._func_name = name
     self._function_def = defined_function
     self._num_outputs = len(defined_function.signature.output_arg)
-    self._ops = operations
     self._python_func_outputs = python_func_outputs
     self._python_returns = [python_func_outputs] if isinstance(
         python_func_outputs,
@@ -374,37 +482,45 @@ class GraphModeFunction(object):
 
   def _construct_backprop_function(self):
     """Constructs the backprop function object for this function."""
-    with self._graph.as_default(), context.graph_mode():
-      c_known_ops = set()
-      c_captured_tensors = set()
-
-      existing_op_len = len(self._graph.get_operations())
-      filtered_outputs = [x for x in self._python_returns if x is not None]
+    filtered_outputs = [x for x in self._python_returns if x is not None]
+    backwards_graph = CapturingGraph()
+    backwards_graph._graph_key = self._graph._graph_key  # pylint: disable=protected-access
+    for collection in self._graph.collections:
+      backwards_graph.get_collection_ref(
+          collection)[:] = self._graph.get_collection(collection)
+    backwards_graph.seed = self._graph.seed
+    with backwards_graph.as_default():
       self._out_grad_placeholders = [
           graph_placeholder(x.dtype, x.shape) for x in filtered_outputs]
-      in_gradients = gradients_impl.gradients(
+      in_gradients = gradients_impl._GradientsHelper(  # pylint: disable=protected-access
           filtered_outputs,
           self._input_placeholders,
-          grad_ys=self._out_grad_placeholders)
-      for op in self._graph.get_operations()[existing_op_len:]:
-        if op.type in ["Variable", "VariableV2", "VarHandleOp"]:
-          raise ValueError("tfe.defun cannot capture variables created without "
-                           "using tf.get_variable. Op: %s" % op)
-        c_known_ops.add(op)
-        for i in op.inputs:
-          if i.op not in c_known_ops:
-            c_captured_tensors.add(i)
+          grad_ys=self._out_grad_placeholders,
+          src_graph=self._graph)
 
     backward_outputs = tuple(
         grad for grad in _flatten(in_gradients) if grad is not None)
     output_shapes = tuple(grad.shape for grad in backward_outputs)
 
-    captures = list(sorted(c_captured_tensors, key=lambda x: x.name))
+    captures = backwards_graph.captures
+    ids = list(sorted(captures.keys()))
+    if ids:
+      extra_inputs, extra_placeholders = zip(*[captures[x] for x in ids])
+    else:
+      extra_inputs = []
+      extra_placeholders = []
+
     forward_name = _forward_name(self._func_name)
+    # Note: we cannot have placeholder ops in the graph or the TPU compilation
+    # pass fails.
+    placeholder_ops = set([y.op for y in self._input_placeholders])
+    function_ops = [x for x in self._graph.get_operations()
+                    if x not in placeholder_ops]
     self._forward_fdef = _EagerDefinedFunction(
-        forward_name, self._graph, self._ops, self._input_placeholders,
-        filtered_outputs + captures, self._attrs)
-    all_inputs = self._out_grad_placeholders + captures
+        forward_name, self._graph, function_ops,
+        self._input_placeholders, filtered_outputs + list(extra_inputs),
+        self._attrs)
+    all_inputs = self._out_grad_placeholders + list(extra_placeholders)
     # Excluding input ops from the body as we do not intend to execute these
     # operations when the function is executed.
     all_ignored_ops = frozenset(x.op for x in all_inputs)
@@ -412,11 +528,12 @@ class GraphModeFunction(object):
     # means rerunning the function-defining code will always define the same
     # function, which is useful if we serialize this etc.
     function_def_ops = tuple(x
-                             for x in sorted(c_known_ops, key=lambda x: x.name)
+                             for x in sorted(backwards_graph.get_operations(),
+                                             key=lambda x: x.name)
                              if x not in all_ignored_ops)
     bname = _backward_name(self._func_name)
     self._backward_function = GraphModeFunction(
-        bname, all_inputs, [], self._graph, function_def_ops,
+        bname, all_inputs, [], backwards_graph, function_def_ops,
         backward_outputs, in_gradients, output_shapes, attrs=self._attrs)
 
   def _backprop_call(self, args):
@@ -430,35 +547,10 @@ class GraphModeFunction(object):
       The call output.
     """
     all_args = args + self._extra_inputs
-    signature = self._forward_fdef.signature
     ctx = context.context()
-    if ctx.executing_eagerly():
-      outputs = execute.execute(
-          str(signature.name),
-          num_outputs=len(signature.output_arg),
-          inputs=all_args,
-          attrs=None,
-          ctx=ctx)
-      if not outputs:
-        return None
-    else:
-      g = ops.get_default_graph()
-      g._add_function(self._forward_fdef)  # pylint: disable=protected-access
-      op = g.create_op(
-          signature.name,
-          [ops.internal_convert_to_tensor(x, ctx=ctx) for x in all_args],
-          tuple(dtypes_module.DType(x.type) for x in signature.output_arg),
-          op_def=signature,
-          name="FunctionCall",
-          compute_shapes=False)
-      outputs = op.outputs
-      if not outputs:
-        return op
-      outputs = [outputs] if isinstance(outputs, ops.Tensor) else list(outputs)
-
-      shapes = [shape for shape in self._output_shapes if shape is not None]
-      for i, shape in enumerate(shapes):
-        outputs[i].set_shape(shape)
+    outputs = self._forward_fdef.call(ctx, all_args, self._output_shapes)
+    if isinstance(outputs, ops.Operation) or outputs is None:
+      return outputs
 
     # `real_outputs` are the actual outputs of the inference graph function;
     # `side_outputs` are the intermediate Tensors that were added as outputs to
@@ -470,7 +562,7 @@ class GraphModeFunction(object):
       return self._backward_function(*(list(args) + side_outputs))  # pylint: disable=not-callable
 
     tape.record_operation(
-        signature.name,
+        self._forward_fdef.signature.name,
         real_outputs,
         (args + self._extra_inputs),
         backward_function)
@@ -512,13 +604,6 @@ class GraphModeFunction(object):
     """Returns the name of the function in Eager-compatible format."""
     return self._function_def.name.encode("utf-8")
 
-  def add_to_graph(self, g):
-    if self._function_def.name not in g._functions:  # pylint: disable=protected-access
-      g._add_function(self._function_def)  # pylint: disable=protected-access
-    for f in self._graph._functions.values():  # pylint: disable=protected-access
-      if f.name not in g._functions:  # pylint: disable=protected-access
-        g._add_function(f)  # pylint: disable=protected-access
-
   def __call__(self, *args):
     """Executes the passed function in eager mode."""
     for v in self._variables:
@@ -533,34 +618,9 @@ class GraphModeFunction(object):
       return self._backprop_call(tensor_inputs)
 
     ctx = context.context()
-    if ctx.executing_eagerly():
-      result = execute.execute(
-          str(self._func_name),
-          num_outputs=self._num_outputs,
-          inputs=tensor_inputs + self._extra_inputs,
-          attrs=None,
-          ctx=ctx)
-    else:
-      g = ops.get_default_graph()
-      self.add_to_graph(g)
-      signature = self._function_def.definition.signature
-      args = list(tensor_inputs) + self._extra_inputs
-      op = g.create_op(
-          signature.name,
-          [ops.internal_convert_to_tensor(x, ctx=ctx) for x in args],
-          tuple(dtypes_module.DType(x.type) for x in signature.output_arg),
-          op_def=signature,
-          name="FunctionCall",
-          compute_shapes=False)
-      result = op.outputs
-      if not result:
-        return op
-
-      shapes = [shape for shape in self._output_shapes if shape is not None]
-      for i, shape in enumerate(shapes):
-        result[i].set_shape(shape)
-
-    return self._build_call_outputs(result)
+    args = tensor_inputs + self._extra_inputs
+    outputs = self._function_def.call(ctx, args, self._output_shapes)
+    return self._build_call_outputs(outputs)
 
   def _build_call_outputs(self, result):
     """Maps the fdef output list to actual output structure.
@@ -571,7 +631,8 @@ class GraphModeFunction(object):
       The actual call output.
     """
     if self._python_func_outputs is None:
-      return None
+      return result
+
     # Use `nest.flatten` instead of `_flatten` in order to preserve any
     # IndexedSlices in `self._python_func_outputs`.
     outputs_list = nest.flatten(self._python_func_outputs)
@@ -617,55 +678,82 @@ def _deterministic_dict_values(kwds):
 def _trace_and_define_function(name, func, compiled, args, kwds):
   """Defines and returns graph-mode version of func."""
   graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
-  with context.graph_mode():
-    captures = {}
-    tmp_graph = CapturingGraph(captures)
-    # Inherit the graph key, since this is used for matching variables in
-    # optimizers.
-    tmp_graph._graph_key = graph_key  # pylint: disable=protected-access
-    # Copy the graph collections to ensure summaries and other things work. This
-    # lets the function access (but not mutate) collections of the containing
-    # graph, such as the global step and the summary writer collections.
-    curr_graph = ops.get_default_graph()
-    for collection in curr_graph.collections:
-      tmp_graph.get_collection_ref(collection)[:] = curr_graph.get_collection(
-          collection)
-    with tmp_graph.as_default(), AutomaticControlDependencies() as a:
-      func_args = _get_defun_inputs(args)
-      func_kwds = _get_defun_inputs(kwds)
-
-      def convert(x):
-        if x is None:
-          return None
-        x = ops.convert_to_tensor_or_indexed_slices(x)
-        x = a.mark_as_return(x)
-        return x
+  tmp_graph = CapturingGraph()
+  # Inherit the graph key, since this is used for matching variables in
+  # optimizers.
+  tmp_graph._graph_key = graph_key  # pylint: disable=protected-access
+  # Copy the graph collections to ensure summaries and other things work. This
+  # lets the function access (but not mutate) collections of the containing
+  # graph, such as the global step and the summary writer collections.
+  curr_graph = ops.get_default_graph()
+  for collection in curr_graph.collections:
+    tmp_graph.get_collection_ref(collection)[:] = curr_graph.get_collection(
+        collection)
+  if context.executing_eagerly():
+    tmp_graph.seed = context.global_seed()
+  else:
+    tmp_graph.seed = curr_graph.seed
+  with tmp_graph.as_default(), AutomaticControlDependencies() as a:
+    func_args = _get_defun_inputs(args)
+    func_kwds = _get_defun_inputs(kwds)
+
+    # Variables to help check whether mutation happens in calling the function
+    # Copy the recursive list, tuple and map structure, but not base objects
+    func_args_before = nest.pack_sequence_as(func_args, nest.flatten(func_args))
+    func_kwds_before = nest.pack_sequence_as(func_kwds, nest.flatten(func_kwds))
+
+    def convert(x):
+      if x is None:
+        return None
+      x = ops.convert_to_tensor_or_indexed_slices(x)
+      x = a.mark_as_return(x)
+      return x
 
-      this_tape = tape.push_new_tape()
-      try:
-        func_outputs = func(*func_args, **func_kwds)
-        func_outputs = nest.map_structure(convert, func_outputs)
-      finally:
-        tape.pop_tape(this_tape)
-      variables = this_tape.watched_variables()
-
-      # Returning a closed-over tensor as an output does not trigger a
-      # call to convert_to_tensor, so we manually capture all such tensors.
-      outputs_list = _flatten(func_outputs)
-      func_def_outputs = [
-          tmp_graph.capture(x) for x in outputs_list
-          if x is not None
-      ]
-
-      ids = list(sorted(captures.keys()))
-      if ids:
-        extra_inputs, extra_placeholders = zip(* [captures[x] for x in ids])
-      else:
-        extra_inputs = []
-        extra_placeholders = []
-      output_shapes = tuple(
-          x.shape if isinstance(x, ops.Tensor) else None
-          for x in outputs_list)
+    this_tape = tape.push_new_tape()
+    try:
+      func_outputs = func(*func_args, **func_kwds)
+      func_outputs = nest.map_structure(convert, func_outputs)
+
+      def check_mutation(n1, n2):
+        """Check if two list of arguments are exactly the same."""
+        errmsg = ("Function to be traced should not modify structure of input "
+                  "arguments. Check if your function has list and dictionary "
+                  "operations that alter input arguments, "
+                  "such as `list.pop`, `list.append`")
+        try:
+          nest.assert_same_structure(n1, n2)
+        except ValueError:
+          raise ValueError(errmsg)
+
+        for arg1, arg2 in zip(nest.flatten(n1), nest.flatten(n2)):
+          if arg1 is not arg2:
+            raise ValueError(errmsg)
+
+      check_mutation(func_args_before, func_args)
+      check_mutation(func_kwds_before, func_kwds)
+
+    finally:
+      tape.pop_tape(this_tape)
+    variables = this_tape.watched_variables()
+
+    # Returning a closed-over tensor as an output does not trigger a
+    # call to convert_to_tensor, so we manually capture all such tensors.
+    outputs_list = _flatten(func_outputs)
+    func_def_outputs = [
+        tmp_graph.capture(x) for x in outputs_list
+        if x is not None
+    ]
+
+    captures = tmp_graph.captures
+    ids = list(sorted(captures.keys()))
+    if ids:
+      extra_inputs, extra_placeholders = zip(* [captures[x] for x in ids])
+    else:
+      extra_inputs = []
+      extra_placeholders = []
+    output_shapes = tuple(
+        x.shape if isinstance(x, ops.Tensor) else None
+        for x in func_def_outputs)
 
   func_kwds_values = _deterministic_dict_values(func_kwds)
   flat_inputs = [
@@ -686,7 +774,7 @@ def _trace_and_define_function(name, func, compiled, args, kwds):
 
   attrs = {}
   if compiled:
-    attrs["_XlaCompile"] = attr_value_pb2.AttrValue(b=True)
+    attrs[_xla_compile_attr] = attr_value_pb2.AttrValue(b=True)
 
   return GraphModeFunction(
       fname, all_inputs, extra_inputs, tmp_graph, operations, func_def_outputs,
@@ -731,6 +819,11 @@ class _PolymorphicFunction(object):
 
   See the documentation for `defun` for more information on the semantics of
   defined functions.
+
+  _PolymorphicFunction class is thread-compatible meaning that minimal
+  usage of defuns (defining and calling) is thread-safe, but if users call other
+  methods or invoke the base `python_function` themselves, external
+  synchronization is necessary.
   """
 
   def __init__(self, python_function, name, compiled=False):
@@ -748,6 +841,8 @@ class _PolymorphicFunction(object):
     self._arguments_to_functions = {}
     self._variables = []
 
+    self._lock = threading.Lock()
+
   def __get__(self, instance, owner):
     """Makes it possible to defun instance methods."""
     del owner
@@ -786,15 +881,16 @@ class _PolymorphicFunction(object):
     # signature so we don't improperly capture tensors such as variables.
     signature += tuple([context.executing_eagerly() or ops.get_default_graph()])
 
-    if signature not in self._arguments_to_functions:
-      graph_function = _trace_and_define_function(
-          self._name, self._python_function, self._compiled, args, kwds)
-      self._arguments_to_functions[signature] = graph_function
-      self._variables.extend(
-          [v for v in graph_function.variables if v not in self._variables])
-      return graph_function, inputs
-    else:
-      return self._arguments_to_functions[signature], inputs
+    with self._lock:
+      if signature not in self._arguments_to_functions:
+        graph_function = _trace_and_define_function(
+            self._name, self._python_function, self._compiled, args, kwds)
+        self._arguments_to_functions[signature] = graph_function
+        self._variables.extend(
+            [v for v in graph_function.variables if v not in self._variables])
+        return graph_function, inputs
+      else:
+        return self._arguments_to_functions[signature], inputs
 
   def __call__(self, *args, **kwds):
     """Calls a graph function specialized for this input signature."""
@@ -839,8 +935,14 @@ def defun(func=None, compiled=False):
   `defun`-generated graphs.
 
   For a Python function to be compatible with `defun`, all of its arguments must
-  be hashable Python objects or lists thereof. Additionally, it must return zero
-  or more @{tf.Tensor} objects.
+  be hashable Python objects or lists thereof. The function itself may not
+  modify the list/map structure of its arguments. Additionally, it must return
+  zero or more @{tf.Tensor} objects.
+
+  Executing a graph generated by `defun` respects device annotations (i.e.,
+  all `with tf.device` directives present in a Python function will also be
+  present in its corresponding graph), but it is not yet possible to execute the
+  generated graphs across multiple machines.
 
   _Example Usage_
 
@@ -1021,7 +1123,7 @@ def defun(func=None, compiled=False):
   tf.enable_eager_execution()
 
   def fn():
-    x = tf.contrib.eager.Variable(0.0)
+    x = tf.Variable(0.0)
     x.assign_add(1.0)
     return x.read_value()
 
@@ -1038,19 +1140,18 @@ def defun(func=None, compiled=False):
   ```
 
   Finally, because each input signature is bound to a unique graph, if your
-  Python function constructs `tf.contrib.eager.Variable` objects, then each
-  graph constructed for that Python function will reference a unique set of
-  variables. To circumvent this problem, we recommend against compiling Python
-  functions that create `tf.contrib.eager.Variable` objects. Instead, Python
-  functions should either lexically close over `tf.contrib.eager.Variable`
-  objects or accept them as arguments, preferably encapsulated in an
-  object-oriented container. If you must create variables inside your Python
-  function and you want each graph generated for it to reference the same set of
-  variables, add logic to your Python function that ensures that variables are
-  only created the first time it is called and are reused for every subsequent
-  invocation; note that this is precisely what @{tf.keras.layers.Layer} objects
-  do, so we recommend using them to represent variable-bearing computations
-  whenever possible.
+  Python function constructs `tf.Variable` objects, then each graph constructed
+  for that Python function will reference a unique set of variables. To
+  circumvent this problem, we recommend against compiling Python functions that
+  create `tf.Variable` objects. Instead, Python functions should either
+  lexically close over `tf.Variable` objects or accept them as arguments,
+  preferably encapsulated in an object-oriented container. If you must create
+  variables inside your Python function and you want each graph generated for it
+  to reference the same set of variables, add logic to your Python function that
+  ensures that variables are only created the first time it is called and are
+  reused for every subsequent invocation; note that this is precisely what
+  @{tf.keras.layers.Layer} objects do, so we recommend using them to represent
+  variable-bearing computations whenever possible.
 
   Args:
     func: function to be compiled. If `func` is None, returns a
@@ -1252,7 +1353,7 @@ class AutomaticControlDependencies(object):
     # Ensures the merge always runs
     ops_which_must_run.add(new_merge[0].op)
     if inp in last_op_using_resource_tensor:
-      # Ensures the switch exectutes after the previous op using the resource.
+      # Ensures the switch executes after the previous op using the resource.
       switch_op._add_control_input(last_op_using_resource_tensor[inp])  # pylint: disable=protected-access
     # Ensure the next op outside the cond happens after the merge.
     last_op_using_resource_tensor[inp] = new_merge[0].op
diff --git a/tensorflow/python/eager/function_test.py b/tensorflow/python/eager/function_test.py
index e1801b7ec6..5efdecdbc6 100644
--- a/tensorflow/python/eager/function_test.py
+++ b/tensorflow/python/eager/function_test.py
@@ -18,16 +18,20 @@ from __future__ import division
 from __future__ import print_function
 
 import collections
+import sys
 
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
 from tensorflow.python.eager import function
 from tensorflow.python.eager import tape
-from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import function as tf_function
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util
 from tensorflow.python.layers import convolutional
@@ -37,10 +41,15 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
-from tensorflow.python.training import gradient_descent
+from tensorflow.python.platform import test
+from tensorflow.python.training import adam
+from tensorflow.python.training import momentum
+from tensorflow.python.training import training_ops
+from tensorflow.python.util import compat
 
 
 @test_util.with_c_shapes
@@ -131,6 +140,18 @@ class FunctionTest(test.TestCase):
     out = sq_op(t)
     self.assertAllEqual(out, math_ops.matmul(t, t).numpy())
 
+  def disabled_testRandomSeed(self):
+
+    @function.defun
+    def f():
+      return random_ops.random_normal(())
+
+    random_seed.set_random_seed(1)
+    x = f()
+    self.assertNotEqual(x, f())
+    random_seed.set_random_seed(1)
+    self.assertAllEqual(f(), x)
+
   def testNestedInputsDefunOpGraphMode(self):
     matmul = function.defun(math_ops.matmul)
 
@@ -193,6 +214,45 @@ class FunctionTest(test.TestCase):
     self.assertEqual(fn_op.output_shapes, None)
     self.assertAllEqual(fn_op(x, x), None)
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testDefunCondGradient(self):
+
+    @function.defun
+    def f(x):
+      return control_flow_ops.cond(x > 0.5, lambda: 2 * x, lambda: 3 * x)
+
+    with backprop.GradientTape() as t:
+      x = constant_op.constant(1.0)
+      t.watch(x)
+      y = f(x)
+    self.assertAllEqual(self.evaluate(t.gradient(y, x)), 2.0)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testGraphLoopGradient(self):
+    if context.executing_eagerly():
+      self.skipTest('TODO(apassos): support loops in defuns in eager')
+
+    @function.defun
+    def f(x):
+      return control_flow_ops.while_loop(lambda _, i: i < 2,
+                                         lambda x, i: (2*x, i + 1),
+                                         [x, 0])[0]
+
+    with backprop.GradientTape() as t:
+      x = constant_op.constant(1.0)
+      t.watch(x)
+      y = f(x)
+    self.assertAllEqual(self.evaluate(t.gradient(y, x)), 4.0)
+
+  def testDefunCapturedInt32(self):
+    x = constant_op.constant(1, dtype=dtypes.int32)
+
+    @function.defun
+    def add_int32s():
+      return x + x
+
+    self.assertEqual(2, int(add_int32s()))
+
   def testDefunReadVariable(self):
     v = resource_variable_ops.ResourceVariable(1.0)
 
@@ -204,13 +264,14 @@ class FunctionTest(test.TestCase):
 
   def testDefunAssignAddVariable(self):
     v = resource_variable_ops.ResourceVariable(1.0)
+    x = constant_op.constant(2.0)
 
     @function.defun
-    def f():
-      v.assign_add(2.0)
+    def test_assign_add():
+      v.assign_add(x)
       return v.read_value()
 
-    self.assertEqual(3.0, float(f()))
+    self.assertEqual(3.0, float(test_assign_add()))
 
   def testDefunShapeInferenceWithCapturedResourceVariable(self):
     v = resource_variable_ops.ResourceVariable([[1, 2], [3, 4]])
@@ -440,24 +501,33 @@ class FunctionTest(test.TestCase):
 
     self.assertAllEqual(f(constant_op.constant(1.0)), 2.0)
 
-  def testGradientOfGatherWithDefun(self):
+  def testGatherResourceWithDefun(self):
     with ops.device('cpu:0'):
       v = resource_variable_ops.ResourceVariable([0.0, 1.0, 2.0])
 
-      def sum_gather():
-        return math_ops.reduce_sum(array_ops.gather(v, [1, 2]))
+    def sum_gather():
+      return math_ops.reduce_sum(array_ops.gather(v, [1, 2]))
+
+    defined = function.defun(sum_gather)
+    self.assertAllEqual(sum_gather(), defined())
+
+  def testGradientOfGatherWithDefun(self):
+    v = resource_variable_ops.ResourceVariable([0.0, 1.0, 2.0])
+
+    def sum_gather():
+      return math_ops.reduce_sum(array_ops.gather(v, [1, 2]))
 
-      grad_fn = backprop.implicit_grad(sum_gather)
-      gradient = grad_fn()
-      defun_grad_fn = backprop.implicit_grad(function.defun(sum_gather))
-      defun_gradient = defun_grad_fn()
-      self.assertEqual(len(gradient), len(defun_gradient))
+    grad_fn = backprop.implicit_grad(sum_gather)
+    gradient = grad_fn()
+    defun_grad_fn = backprop.implicit_grad(function.defun(sum_gather))
+    defun_gradient = defun_grad_fn()
+    self.assertEqual(len(gradient), len(defun_gradient))
 
-      gradient = gradient[0][0]
-      defun_gradient = defun_gradient[0][0]
-      self.assertAllEqual(gradient.values, defun_gradient.values)
-      self.assertAllEqual(gradient.indices, defun_gradient.indices)
-      self.assertAllEqual(gradient.dense_shape, defun_gradient.dense_shape)
+    gradient = gradient[0][0]
+    defun_gradient = defun_gradient[0][0]
+    self.assertAllEqual(gradient.values, defun_gradient.values)
+    self.assertAllEqual(gradient.indices, defun_gradient.indices)
+    self.assertAllEqual(gradient.dense_shape, defun_gradient.dense_shape)
 
   def testReturningIndexedSlicesWithDefun(self):
 
@@ -521,6 +591,66 @@ class FunctionTest(test.TestCase):
     y = f(x, x).cpu()
     self.assertAllEqual(y, [2.])
 
+  @test_util.run_in_graph_and_eager_modes
+  def testFunctionWithResourcesOnDifferentDevices(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found.')
+
+    with ops.device('/cpu:0'):
+      v_cpu = resource_variable_ops.ResourceVariable([0.0, 1.0, 2.0])
+
+    with ops.device('/gpu:0'):
+      v_gpu = resource_variable_ops.ResourceVariable([0.0, 1.0, 2.0])
+
+    def sum_gather():
+      cpu_result = math_ops.reduce_sum(array_ops.gather(v_cpu, [1, 2]))
+      gpu_result = math_ops.reduce_sum(array_ops.gather(v_gpu, [1, 2]))
+      return cpu_result, gpu_result
+
+    defined = function.defun(sum_gather)
+    if not context.executing_eagerly():
+      self.evaluate(variables.global_variables_initializer())
+    expected = self.evaluate(sum_gather())
+    self.assertAllEqual(expected, self.evaluate(defined()))
+
+  @test_util.run_in_graph_and_eager_modes
+  def testOpInFunctionWithConflictingResourceInputs(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found.')
+
+    with ops.device('/cpu:0'):
+      v_cpu = resource_variable_ops.ResourceVariable(
+          [0.0, 1.0, 2.0], name='cpu')
+      v_also_cpu = resource_variable_ops.ResourceVariable(
+          [0.0, 1.0, 2.0], name='also_cpu')
+
+    with ops.device('/gpu:0'):
+      v_gpu = resource_variable_ops.ResourceVariable(
+          [0.0, 1.0, 2.0], name='gpu')
+
+    @function.defun
+    def resource_apply_adam():
+      training_ops.resource_apply_adam(
+          v_cpu.handle,
+          v_gpu.handle,
+          v_also_cpu.handle,
+          1.0,  # beta1_power
+          1.0,  # beta2_power
+          1.0,  # learning_rate
+          1.0,  # beta1
+          1.0,  # beta2
+          1.0,  # epsilon,
+          [1.0, 1.0, 1.0],  # grad
+          False)  # use_locking
+      return None
+
+    with self.assertRaisesRegexp(
+        errors.InvalidArgumentError, 'Could not colocate node with its '
+        'resource and reference inputs.*'):
+      if not context.executing_eagerly():
+        self.evaluate(variables.global_variables_initializer())
+      self.evaluate(resource_apply_adam())
+
   def testFunctionHandlesInputsOnDifferentDevices(self):
     if not context.context().num_gpus():
       self.skipTest('No GPUs found')
@@ -570,17 +700,17 @@ class FunctionTest(test.TestCase):
 
   def testNestedDifferentiableFunction(self):
     @function.defun
-    def foo(a, b):
+    def inner_fn(a, b):
       return a * math_ops.add(a, b)
 
     @function.defun
-    def bar(x):
-      return foo(x, 1.0)
+    def outer_fn(x):
+      return inner_fn(x, 1.0)
 
     x = constant_op.constant(5.0)
     with backprop.GradientTape() as tp:
       tp.watch(x)
-      result = bar(x)
+      result = outer_fn(x)
     grad = tp.gradient(result, x)
 
     self.assertAllEqual(grad, 2 * 5.0 + 1.0)
@@ -630,15 +760,15 @@ class FunctionTest(test.TestCase):
     self.assertAllEqual(3, add_one(constant_op.constant(2)))
 
   def testVariableCaptureInNestedFunctions(self):
-    v = resource_variable_ops.ResourceVariable(1)
+    v = resource_variable_ops.ResourceVariable(1, dtype=dtypes.int32)
 
     @function.defun
-    def read():
+    def inner_read():
       return v.read_value()
 
     @function.defun
     def outer():
-      return read()
+      return inner_read()
 
     self.assertEqual(1, int(outer()))
 
@@ -729,6 +859,27 @@ class FunctionTest(test.TestCase):
     y = model(x)
     self.assertAllEqual([[[[4.0]]]], y.numpy())
 
+  @test_util.run_in_graph_and_eager_modes(
+      config=config_pb2.ConfigProto(device_count={'CPU': 3}))
+  def testDeviceAnnotationsRespected(self):
+    @function.defun
+    def multi_device_fn():
+      with ops.device('/cpu:0'):
+        s1 = iterator_ops.Iterator.from_structure(
+            (dtypes.float32,)).string_handle()
+      with ops.device('/cpu:1'):
+        s2 = iterator_ops.Iterator.from_structure(
+            (dtypes.float32,)).string_handle()
+      with ops.device('/cpu:2'):
+        s3 = iterator_ops.Iterator.from_structure(
+            (dtypes.float32,)).string_handle()
+      return s1, s2, s3
+
+    outputs = multi_device_fn()
+    self.assertTrue(compat.as_bytes('CPU:0') in self.evaluate(outputs[0]))
+    self.assertTrue(compat.as_bytes('CPU:1') in self.evaluate(outputs[1]))
+    self.assertTrue(compat.as_bytes('CPU:2') in self.evaluate(outputs[2]))
+
   def testVariablesAreTracked(self):
     v = resource_variable_ops.ResourceVariable(1.0)
 
@@ -1035,7 +1186,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
     def loss(v):
       return v**2
 
-    optimizer = gradient_descent.GradientDescentOptimizer(learning_rate=1.0)
+    optimizer = momentum.MomentumOptimizer(learning_rate=1.0, momentum=1.0)
 
     @function.defun
     def train():
@@ -1047,12 +1198,29 @@ class AutomaticControlDependenciesTest(test.TestCase):
     value = train()
     self.assertEqual(value.numpy(), -1.0)
 
+  # TODO(b/111663004): This should work when the outer context is graph
+  # building.
+  def testOptimizerNonSlotVarsInDefunNoError(self):
+    def loss(v):
+      return v**2
+
+    optimizer = adam.AdamOptimizer(learning_rate=1.0)
+
+    @function.defun
+    def train():
+      v = resource_variable_ops.ResourceVariable(1.0)
+      grad = backprop.implicit_grad(loss)(v)
+      optimizer.apply_gradients(grad)
+      return v.read_value()
+
+    train()
+
   def testOptimizerInDefunWithCapturedVariable(self):
     v = resource_variable_ops.ResourceVariable(1.0)
     def loss():
       return v**2
 
-    optimizer = gradient_descent.GradientDescentOptimizer(learning_rate=1.0)
+    optimizer = momentum.MomentumOptimizer(learning_rate=1.0, momentum=1.0)
 
     @function.defun
     def train():
@@ -1062,6 +1230,176 @@ class AutomaticControlDependenciesTest(test.TestCase):
     train()
     self.assertEqual(v.numpy(), -1.0)
 
+  def testFunctionModifiesInputList(self):
+    # Tests on `list` methods that do in place modification, except `list.sort`
+    # since it cannot even be "defunned" in the first place
+
+    def get_list():
+      return [constant_op.constant(0.), constant_op.constant(1.)]
+
+    expected_msg = (
+        'Function to be traced should not modify structure of input '
+        'arguments. Check if your function has list and dictionary '
+        'operations that alter input arguments, '
+        'such as `list.pop`, `list.append`')
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def append(l):
+        l.append(constant_op.constant(0.))
+
+      append(get_list())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def extend(l):
+        l.extend([constant_op.constant(0.)])
+
+      extend(get_list())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def insert(l):
+        l.insert(0, constant_op.constant(0.))
+
+      insert(get_list())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def pop(l):
+        l.pop()
+
+      pop(get_list())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def reverse(l):
+        l.reverse()
+
+      reverse(get_list())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def remove(l):
+        l.remove(l[0])
+
+      remove(get_list())
+
+    # `list.clear` is a method that is in Py3 but not Py2
+    if sys.version.startswith('3'):
+
+      with self.assertRaisesRegexp(ValueError, expected_msg):
+
+        @function.defun
+        def clear(l):
+          l.clear()
+
+        clear(get_list())
+
+    # One last test for keyword arguments
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def kwdappend(**kwargs):
+        l = kwargs['l']
+        l.append(constant_op.constant(0.))
+
+      kwdappend(l=get_list())
+
+  def testFunctionModifiesInputDict(self):
+
+    def get_dict():
+      return {'t1': constant_op.constant(0.), 't2': constant_op.constant(1.)}
+
+    expected_msg = (
+        'Function to be traced should not modify structure of input '
+        'arguments. Check if your function has list and dictionary '
+        'operations that alter input arguments, '
+        'such as `list.pop`, `list.append`')
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def clear(m):
+        m.clear()
+
+      clear(get_dict())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def pop(m):
+        m.pop('t1')
+
+      pop(get_dict())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def popitem(m):
+        m.popitem()
+
+      popitem(get_dict())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def update(m):
+        m.update({'t1': constant_op.constant(3.)})
+
+      update(get_dict())
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def setdefault(m):
+        m.setdefault('t3', constant_op.constant(3.))
+
+      setdefault(get_dict())
+
+  def testFunctionModifiesInputNest(self):
+    # Test on functions that modify structure of nested input arguments
+    expected_msg = (
+        'Function to be traced should not modify structure of input '
+        'arguments. Check if your function has list and dictionary '
+        'operations that alter input arguments, '
+        'such as `list.pop`, `list.append`')
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      @function.defun
+      def modify(n):
+        n[0]['t1'].append(constant_op.constant(1.))
+
+      nested_input = [{
+          't1': [constant_op.constant(0.),
+                 constant_op.constant(1.)],
+      },
+                      constant_op.constant(2.)]
+
+      modify(nested_input)
+
+    with self.assertRaisesRegexp(ValueError, expected_msg):
+
+      # The flat list doesn't change whereas the true structure changes
+      @function.defun
+      def modify_same_flat(n):
+        n[0].append(n[1].pop(0))
+
+      nested_input = [[constant_op.constant(0.)],
+                      [constant_op.constant(1.),
+                       constant_op.constant(2.)]]
+
+      modify_same_flat(nested_input)
+
 
 if __name__ == '__main__':
+  ops.enable_eager_execution(
+      config=config_pb2.ConfigProto(device_count={'CPU': 3}))
   test.main()
diff --git a/tensorflow/python/eager/graph_callable.py b/tensorflow/python/eager/graph_callable.py
index 848adf4fd3..9200396c8a 100644
--- a/tensorflow/python/eager/graph_callable.py
+++ b/tensorflow/python/eager/graph_callable.py
@@ -118,7 +118,7 @@ class _VariableCapturingScope(object):
         initializer=None,
         regularizer=None,
         reuse=None,
-        trainable=True,
+        trainable=None,
         collections=None,
         caching_device=None,  # pylint: disable=redefined-outer-name
         partitioner=None,
@@ -156,7 +156,7 @@ class _VariableCapturingScope(object):
         initializer=None,
         regularizer=None,
         reuse=None,
-        trainable=True,
+        trainable=None,
         collections=None,
         caching_device=None,  # pylint: disable=redefined-outer-name
         partitioner=None,
@@ -280,8 +280,7 @@ def _graph_callable_internal(func, shape_and_dtypes):
     # This graph will store both the initialization and the call version of the
     # wrapped function. It will later be used by the backprop code to build the
     # backprop graph, if necessary.
-    captures = {}
-    tmp_graph = function.CapturingGraph(captures)
+    tmp_graph = function.CapturingGraph()
     # Inherit the graph key from the original graph to ensure optimizers don't
     # misbehave.
     tmp_graph._container = container  # pylint: disable=protected-access
@@ -289,7 +288,7 @@ def _graph_callable_internal(func, shape_and_dtypes):
     with tmp_graph.as_default():
       # Placeholders for the non-variable inputs.
       func_inputs = _get_graph_callable_inputs(shape_and_dtypes)
-      func_num_args = len(tf_inspect.getargspec(func).args)
+      func_num_args = len(tf_inspect.getfullargspec(func).args)
       if len(func_inputs) != func_num_args:
         raise TypeError("The number of arguments accepted by the decorated "
                         "function `%s` (%d) must match the number of "
@@ -331,6 +330,7 @@ def _graph_callable_internal(func, shape_and_dtypes):
 
   sorted_variables = sorted(variable_captures.variables.values(),
                             key=lambda x: x.name)
+  captures = tmp_graph.captures
   ids = list(sorted(captures.keys()))
   if ids:
     extra_inputs, extra_placeholders = zip(*[captures[x] for x in ids])
diff --git a/tensorflow/python/eager/memory_test.py b/tensorflow/python/eager/memory_test.py
index 74c6cbdd31..a1a59d511f 100644
--- a/tensorflow/python/eager/memory_test.py
+++ b/tensorflow/python/eager/memory_test.py
@@ -24,6 +24,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import six
+
 from tensorflow.python import keras
 from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
@@ -63,7 +65,7 @@ class MemoryTest(test.TestCase):
 
       initial = memory_profiler.memory_usage(-1)[0]
 
-      for _ in xrange(num_iters):
+      for _ in six.moves.range(num_iters):
         f()
 
       increase = memory_profiler.memory_usage(-1)[0] - initial
diff --git a/tensorflow/python/eager/ops_test.py b/tensorflow/python/eager/ops_test.py
index fc76ede4c5..17a090d526 100644
--- a/tensorflow/python/eager/ops_test.py
+++ b/tensorflow/python/eager/ops_test.py
@@ -370,6 +370,10 @@ class OpsTest(test_util.TensorFlowTestCase):
     with self.assertRaises(TypeError):
       float(x)
 
+  def testRange(self):
+    x = constant_op.constant(2)
+    self.assertEqual([0, 1], list(range(x)))
+
   def testFormatString(self):
     x = constant_op.constant(3.1415)
     self.assertEqual('3.14', '{:.2f}'.format(x))
diff --git a/tensorflow/python/eager/pywrap_tensor.cc b/tensorflow/python/eager/pywrap_tensor.cc
index ea604647fa..15d2ccf9d2 100644
--- a/tensorflow/python/eager/pywrap_tensor.cc
+++ b/tensorflow/python/eager/pywrap_tensor.cc
@@ -154,6 +154,7 @@ TFE_TensorHandle* EagerCast(TFE_Context* ctx, TFE_TensorHandle* handle,
   if (TF_GetCode(out_status) != TF_OK) RETURN_ERROR
   TFE_OpSetAttrType(op, "SrcT", src_type_enum);
   TFE_OpSetAttrType(op, "DstT", dst_type_enum);
+  TFE_OpSetAttrBool(op, "Truncate", false);
   TFE_TensorHandle* output = nullptr;
   int num_outputs = 1;
   TFE_Execute(op, &output, &num_outputs, out_status);
@@ -620,10 +621,6 @@ static PyType_Slot EagerTensor_Type_slots[] = {
     {Py_tp_init, reinterpret_cast<void*>(EagerTensor_init)},
     {0, nullptr},
 };
-
-PyType_Spec EagerTensor_Type_spec = {"EagerTensor", sizeof(EagerTensor), 0,
-                                     Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HEAPTYPE,
-                                     EagerTensor_Type_slots};
 #else
 // TODO(agarwal): support active_trace.
 static PyTypeObject _EagerTensorType = {
@@ -754,6 +751,34 @@ PyObject* TFE_Py_InitEagerTensor(PyObject* base_class) {
 #if PY_MAJOR_VERSION >= 3
   PyObject* bases = PyTuple_New(1);
   PyTuple_SET_ITEM(bases, 0, base_class);
+
+  tensorflow::Safe_PyObjectPtr base_class_module(
+      PyObject_GetAttrString(base_class, "__module__"));
+  const char* module = nullptr;
+  if (PyErr_Occurred()) {
+    PyErr_Clear();
+    module = "__builtin__";
+  } else {
+    module = PyBytes_AsString(base_class_module.get());
+    if (module == nullptr) {
+      PyErr_Clear();
+      module = PyUnicode_AsUTF8(base_class_module.get());
+      if (module == nullptr) {
+        PyErr_Clear();
+        module = "__builtin__";
+      }
+    }
+  }
+
+  // NOTE: The c_str from this string needs to outlast the function, hence is
+  // static.
+  static tensorflow::string fully_qualified_name =
+      tensorflow::strings::StrCat(module, ".EagerTensor");
+
+  static PyType_Spec EagerTensor_Type_spec = {
+      fully_qualified_name.c_str(), sizeof(EagerTensor), 0,
+      Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HEAPTYPE, EagerTensor_Type_slots};
+
   EagerTensorType = reinterpret_cast<PyTypeObject*>(
       PyType_FromSpecWithBases(&EagerTensor_Type_spec, bases));
   if (PyErr_Occurred()) {
diff --git a/tensorflow/python/eager/pywrap_tfe_src.cc b/tensorflow/python/eager/pywrap_tfe_src.cc
index 57b4dab51c..0eabea321c 100644
--- a/tensorflow/python/eager/pywrap_tfe_src.cc
+++ b/tensorflow/python/eager/pywrap_tfe_src.cc
@@ -845,11 +845,9 @@ int64_t get_uid() {
 PyObject* TFE_Py_UID() { return PyLong_FromLongLong(get_uid()); }
 
 void TFE_DeleteContextCapsule(PyObject* context) {
-  TF_Status* status = TF_NewStatus();
   TFE_Context* ctx =
       reinterpret_cast<TFE_Context*>(PyCapsule_GetPointer(context, nullptr));
-  TFE_DeleteContext(ctx, status);
-  TF_DeleteStatus(status);
+  TFE_DeleteContext(ctx);
 }
 
 static tensorflow::int64 MakeInt(PyObject* integer) {
@@ -1173,14 +1171,14 @@ static tensorflow::eager::TapeTensor TapeTensorFromTensor(PyObject* tensor) {
   if (EagerTensor_CheckExact(tensor)) {
     TFE_TensorHandle* t = EagerTensor_Handle(tensor);
     tensorflow::int64 id = EagerTensor_id(tensor);
-    const tensorflow::Tensor* tensor = nullptr;
-    const tensorflow::Status status = t->handle->Tensor(&tensor);
+    tensorflow::TensorShape tensor_shape;
+    const tensorflow::Status status = t->handle->Shape(&tensor_shape);
+
     if (MaybeRaiseExceptionFromStatus(status, nullptr)) {
       return tensorflow::eager::TapeTensor{id, t->handle->dtype,
                                            tensorflow::TensorShape({})};
     } else {
-      return tensorflow::eager::TapeTensor{id, t->handle->dtype,
-                                           tensor->shape()};
+      return tensorflow::eager::TapeTensor{id, t->handle->dtype, tensor_shape};
     }
   }
   tensorflow::int64 id = FastTensorId(tensor);
@@ -1898,14 +1896,39 @@ PyObject* RecordGradient(PyObject* op_name, PyObject* inputs, PyObject* attrs,
 
 void MaybeWatchVariable(PyObject* input) {
   DCHECK(CheckResourceVariable(input));
-  DCHECK(PyObject_HasAttrString(input, "trainable"));
+  DCHECK(PyObject_HasAttrString(input, "_trainable"));
 
   tensorflow::Safe_PyObjectPtr trainable(
-      PyObject_GetAttrString(input, "trainable"));
+      PyObject_GetAttrString(input, "_trainable"));
   if (trainable.get() == Py_False) return;
   TFE_Py_TapeSetWatchVariable(input);
 }
 
+bool CastTensor(const FastPathOpExecInfo& op_exec_info,
+                const TF_DataType& desired_dtype,
+                tensorflow::Safe_TFE_TensorHandlePtr* handle,
+                TF_Status* status) {
+  TF_DataType input_dtype = TFE_TensorHandleDataType(handle->get());
+  TF_DataType output_dtype = input_dtype;
+
+  if (desired_dtype >= 0 && desired_dtype != input_dtype) {
+    *handle = tensorflow::make_safe(
+        tensorflow::EagerCast(op_exec_info.ctx, handle->get(), input_dtype,
+                              static_cast<TF_DataType>(desired_dtype), status));
+    if (!status->status.ok()) return false;
+    output_dtype = desired_dtype;
+  }
+
+  if (output_dtype != TF_INT32) {
+    // Note that this is a shallow copy and will share the underlying buffer
+    // if copying to the same device.
+    *handle = tensorflow::make_safe(TFE_TensorHandleCopyToDevice(
+        handle->get(), op_exec_info.ctx, op_exec_info.device_name, status));
+    if (!status->status.ok()) return false;
+  }
+  return true;
+}
+
 bool ReadVariableOp(const FastPathOpExecInfo& parent_op_exec_info,
                     PyObject* input, tensorflow::Safe_PyObjectPtr* output,
                     TF_Status* status) {
@@ -1938,9 +1961,31 @@ bool ReadVariableOp(const FastPathOpExecInfo& parent_op_exec_info,
   TFE_Execute(op, &output_handle, &num_retvals, status);
   if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
 
-  // Always create the py object (and correctly DECREF it) from the returned
-  // value, else the data will leak.
-  output->reset(EagerTensorFromHandle(output_handle));
+  if (!PyObject_HasAttrString(input, "_read_dtype")) {
+    // Always create the py object (and correctly DECREF it) from the returned
+    // value, else the data will leak.
+    output->reset(EagerTensorFromHandle(output_handle));
+  } else {
+    // This is a _MixedPrecisionVariable which potentially does casting when
+    // being read.
+    tensorflow::Safe_PyObjectPtr read_dtype(
+        PyObject_GetAttrString(input, "_read_dtype"));
+    int desired_dtype = -1;
+    if (!ParseTypeValue("_read_dtype", read_dtype.get(), status,
+                        &desired_dtype)) {
+      return false;
+    }
+
+    auto safe_output_handle = tensorflow::make_safe(output_handle);
+    // Retires output_handle in the future.
+    output_handle = nullptr;
+    if (!CastTensor(parent_op_exec_info,
+                    static_cast<TF_DataType>(desired_dtype),
+                    &safe_output_handle, status)) {
+      return false;
+    }
+    output->reset(EagerTensorFromHandle(safe_output_handle.release()));
+  }
 
   // TODO(nareshmodi): Should we run post exec callbacks here?
   if (parent_op_exec_info.run_gradient_callback) {
@@ -2010,27 +2055,13 @@ bool ConvertToTensor(
     }
   }
 
-  TF_DataType handle_dtype = TFE_TensorHandleDataType(handle.get());
-  if (desired_dtype >= 0 && desired_dtype != handle_dtype) {
-    handle = tensorflow::make_safe(
-        tensorflow::EagerCast(op_exec_info.ctx, handle.get(), handle_dtype,
-                              static_cast<TF_DataType>(desired_dtype), status));
-    if (!status->status.ok()) return false;
-
-    handle_dtype = TFE_TensorHandleDataType(handle.get());
-  }
-
-  if (handle_dtype != TF_INT32) {
-    // Note that this is a shallow copy and will share the underlying buffer
-    // if copying to the same device.
-    handle = tensorflow::make_safe(TFE_TensorHandleCopyToDevice(
-        handle.get(), op_exec_info.ctx, op_exec_info.device_name, status));
-    if (!status->status.ok()) return false;
+  if (!CastTensor(op_exec_info, static_cast<TF_DataType>(desired_dtype),
+                  &handle, status)) {
+    return false;
   }
-
+  TF_DataType output_dtype = TFE_TensorHandleDataType(handle.get());
   output_handle->reset(EagerTensorFromHandle(handle.release()));
-
-  dtype_setter(handle_dtype);
+  dtype_setter(output_dtype);
 
   return true;
 }
diff --git a/tensorflow/python/eager/pywrap_tfe_test.py b/tensorflow/python/eager/pywrap_tfe_test.py
index faaae40b3f..fd8ab695b8 100644
--- a/tensorflow/python/eager/pywrap_tfe_test.py
+++ b/tensorflow/python/eager/pywrap_tfe_test.py
@@ -23,6 +23,7 @@ from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
 from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
@@ -71,6 +72,25 @@ class Tests(test.TestCase):
 
   @test_util.assert_no_new_tensors
   @test_util.assert_no_garbage_created
+  def testFastpathExecute_MixedPrecisionVariableMatMulCorrectResponse(self):
+    ctx = context.context()
+    a_2_by_2 = constant_op.constant(1.0, shape=[2, 2])
+    a_2_by_2_fp16 = math_ops.cast(a_2_by_2, dtype=dtypes.float16)
+    m = resource_variable_ops.ResourceVariable(a_2_by_2)
+    m = resource_variable_ops._MixedPrecisionVariable(
+        m, read_dtype=dtypes.float16)
+    x = pywrap_tensorflow.TFE_Py_FastPathExecute(
+        ctx._handle, ctx.device_name, "MatMul", None, None, m, m, "transpose_a",
+        False, "transpose_b", False)
+    y = pywrap_tensorflow.TFE_Py_FastPathExecute(
+        ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2_fp16,
+        a_2_by_2_fp16, "transpose_a", False, "transpose_b", False)
+
+    self.assertEqual(x.dtype, dtypes.float16)
+    self.assertAllEqual(x, y)
+
+  @test_util.assert_no_new_tensors
+  @test_util.assert_no_garbage_created
   def testFastpathExecute_TapeWrite(self):
     ctx = context.context()
     with backprop.GradientTape(persistent=True) as tape:
@@ -98,6 +118,29 @@ class Tests(test.TestCase):
     self.assertAllEqual(dz_dy.numpy(),
                         constant_op.constant(4.0, shape=[2, 2]).numpy())
 
+  @test_util.assert_no_new_tensors
+  @test_util.assert_no_garbage_created
+  def testFastpathExecute_MixedPrecisionVariableTapeWrite(self):
+    ctx = context.context()
+    with backprop.GradientTape(persistent=True) as tape:
+      a_2_by_2 = constant_op.constant(
+          [[1.0, 2.0], [3.0, 4.0]], dtype=dtypes.float32)
+      a_2_by_2_fp16 = math_ops.cast(a_2_by_2, dtype=dtypes.float16)
+      m1 = resource_variable_ops.ResourceVariable(a_2_by_2)
+      m2 = resource_variable_ops._MixedPrecisionVariable(
+          m1, read_dtype=dtypes.float16)
+      tape.watch(m2)
+      z = pywrap_tensorflow.TFE_Py_FastPathExecute(
+          ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2_fp16, m2,
+          "transpose_a", False, "transpose_b", False)
+    dz_dy = tape.gradient(z, [m2])[0]
+    self.assertEqual(dz_dy.dtype, dtypes.float16)
+
+    expected_grads = math_ops.matmul(
+        array_ops.transpose(a_2_by_2_fp16),
+        constant_op.constant(1., shape=[2, 2], dtype=dtypes.float16)).numpy()
+    self.assertAllEqual(dz_dy.numpy(), expected_grads)
+
   # Tests homogeneous list op
   @test_util.assert_no_new_tensors
   @test_util.assert_no_garbage_created
diff --git a/tensorflow/python/eager/tensor_test.py b/tensorflow/python/eager/tensor_test.py
index 626a4eb1ee..871136e2c8 100644
--- a/tensorflow/python/eager/tensor_test.py
+++ b/tensorflow/python/eager/tensor_test.py
@@ -278,7 +278,7 @@ class TFETensorUtilTest(test_util.TensorFlowTestCase):
 
     with self.assertRaisesRegexp(
         TypeError,
-        r"tensors argument must be a list or a tuple. Got \"EagerTensor\""):
+        r"tensors argument must be a list or a tuple. Got.*EagerTensor"):
       pywrap_tensorflow.TFE_Py_TensorShapeSlice(t1, -2)
 
   def testNegativeSliceDim(self):
diff --git a/tensorflow/python/eager/test.py b/tensorflow/python/eager/test.py
index f6a46e7eb3..33ee797678 100644
--- a/tensorflow/python/eager/test.py
+++ b/tensorflow/python/eager/test.py
@@ -23,6 +23,7 @@ from tensorflow.python.platform import test as _test
 from tensorflow.python.platform.test import *  # pylint: disable=wildcard-import
 
 
+# TODO(akshayka): Do away with this file.
 def main(argv=None):
   _ops.enable_eager_execution()
   _test.main(argv)
diff --git a/tensorflow/python/estimator/BUILD b/tensorflow/python/estimator/BUILD
index 8ee38d35cc..817c8e6848 100644
--- a/tensorflow/python/estimator/BUILD
+++ b/tensorflow/python/estimator/BUILD
@@ -40,9 +40,9 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":gc",
+        ":metric_keys",
+        ":util",
         "//tensorflow:tensorflow_py_no_contrib",
-        "//tensorflow/python/estimator:metric_keys",
-        "//tensorflow/python/estimator:util",
     ],
 )
 
@@ -171,6 +171,7 @@ py_test(
     name = "baseline_test",
     size = "medium",
     srcs = ["canned/baseline_test.py"],
+    shard_count = 4,
     srcs_version = "PY2AND3",
     tags = [
         "no_pip",
@@ -207,6 +208,7 @@ py_test(
     name = "boosted_trees_test",
     size = "medium",
     srcs = ["canned/boosted_trees_test.py"],
+    shard_count = 2,
     srcs_version = "PY2AND3",
     tags = [
         "optonly",
@@ -676,6 +678,7 @@ py_test(
     name = "keras_test",
     size = "large",
     srcs = ["keras_test.py"],
+    shard_count = 4,
     srcs_version = "PY2AND3",
     tags = [
         "no_windows",
@@ -683,9 +686,9 @@ py_test(
     ],
     deps = [
         ":keras",
+        ":numpy_io",
+        ":run_config",
         "//tensorflow:tensorflow_py_no_contrib",
-        "//tensorflow/python/estimator:numpy_io",
-        "//tensorflow/python/estimator:run_config",
         "//third_party/py/numpy",
     ],
 )
@@ -707,6 +710,14 @@ py_library(
 )
 
 py_library(
+    name = "expect_h5py_installed",
+    # This is a dummy rule used as a numpy dependency in open-source.
+    # We expect h5py to already be installed on the system, e.g. via
+    # `pip install h5py'
+    visibility = ["//visibility:public"],
+)
+
+py_library(
     name = "expect_six_installed",
     # This is a dummy rule used as a numpy dependency in open-source.
     # We expect six to already be installed on the system, e.g. via
diff --git a/tensorflow/python/estimator/api/BUILD b/tensorflow/python/estimator/api/BUILD
index aa5a29e6dd..a75fa7d0ae 100644
--- a/tensorflow/python/estimator/api/BUILD
+++ b/tensorflow/python/estimator/api/BUILD
@@ -6,13 +6,14 @@ package(
 
 licenses(["notice"])  # Apache 2.0
 
-load("//tensorflow/tools/api/generator:api_gen.bzl", "gen_api_init_files")
-load("//tensorflow/tools/api/generator:api_gen.bzl", "ESTIMATOR_API_INIT_FILES")
+load("//tensorflow/python/tools/api/generator:api_gen.bzl", "gen_api_init_files")
+load("//tensorflow/python/tools/api/generator:api_gen.bzl", "ESTIMATOR_API_INIT_FILES")
 
 gen_api_init_files(
     name = "estimator_python_api_gen",
     api_name = "estimator",
     output_files = ESTIMATOR_API_INIT_FILES,
+    output_package = "tensorflow.python.estimator.api",
     package = "tensorflow.python.estimator",
     package_dep = "//tensorflow/python/estimator:estimator_py",
 )
diff --git a/tensorflow/python/estimator/canned/baseline_test.py b/tensorflow/python/estimator/canned/baseline_test.py
index 7bf2e62da9..e46a3a156d 100644
--- a/tensorflow/python/estimator/canned/baseline_test.py
+++ b/tensorflow/python/estimator/canned/baseline_test.py
@@ -154,6 +154,8 @@ class BaselineRegressorEvaluationTest(test.TestCase):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 9.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -176,6 +178,8 @@ class BaselineRegressorEvaluationTest(test.TestCase):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 18.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -204,6 +208,8 @@ class BaselineRegressorEvaluationTest(test.TestCase):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 27.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -229,7 +235,9 @@ class BaselineRegressorEvaluationTest(test.TestCase):
 
     self.assertItemsEqual(
         (metric_keys.MetricKeys.LOSS, metric_keys.MetricKeys.LOSS_MEAN,
-         ops.GraphKeys.GLOBAL_STEP), eval_metrics.keys())
+         metric_keys.MetricKeys.PREDICTION_MEAN,
+         metric_keys.MetricKeys.LABEL_MEAN, ops.GraphKeys.GLOBAL_STEP),
+        eval_metrics.keys())
 
     # Logit is bias which is [46, 58]
     self.assertAlmostEqual(0, eval_metrics[metric_keys.MetricKeys.LOSS])
diff --git a/tensorflow/python/estimator/canned/boosted_trees.py b/tensorflow/python/estimator/canned/boosted_trees.py
index 8afef1b65a..8b423f76de 100644
--- a/tensorflow/python/estimator/canned/boosted_trees.py
+++ b/tensorflow/python/estimator/canned/boosted_trees.py
@@ -17,7 +17,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import abc
 import collections
+import functools
 
 from tensorflow.python.estimator import estimator
 from tensorflow.python.estimator import model_fn
@@ -44,12 +46,13 @@ from tensorflow.python.util.tf_export import estimator_export
 # TODO(nponomareva): Reveal pruning params here.
 _TreeHParams = collections.namedtuple('TreeHParams', [
     'n_trees', 'max_depth', 'learning_rate', 'l1', 'l2', 'tree_complexity',
-    'min_node_weight'
+    'min_node_weight', 'center_bias', 'pruning_mode'
 ])
 
 _HOLD_FOR_MULTI_CLASS_SUPPORT = object()
 _HOLD_FOR_MULTI_DIM_SUPPORT = object()
 _DUMMY_NUM_BUCKETS = -1
+_DUMMY_NODE_ID = -1
 
 
 def _get_transformed_features(features, sorted_feature_columns):
@@ -279,7 +282,9 @@ class _CacheTrainingStatesUsingHashTable(object):
     """Returns cached_tree_ids, cached_node_ids, cached_logits."""
     cached_tree_ids, cached_node_ids, cached_logits = array_ops.split(
         lookup_ops.lookup_table_find_v2(
-            self._table_ref, self._example_ids, default_value=[0.0, 0.0, 0.0]),
+            self._table_ref,
+            self._example_ids,
+            default_value=[0.0, _DUMMY_NODE_ID, 0.0]),
         [1, 1, self._logits_dimension],
         axis=1)
     cached_tree_ids = array_ops.squeeze(
@@ -330,7 +335,7 @@ class _CacheTrainingStatesUsingVariables(object):
         array_ops.zeros([batch_size], dtype=dtypes.int32),
         name='tree_ids_cache')
     self._node_ids = _local_variable(
-        array_ops.zeros([batch_size], dtype=dtypes.int32),
+        _DUMMY_NODE_ID*array_ops.ones([batch_size], dtype=dtypes.int32),
         name='node_ids_cache')
     self._logits = _local_variable(
         array_ops.zeros([batch_size, logits_dimension], dtype=dtypes.float32),
@@ -380,6 +385,260 @@ class _StopAtAttemptsHook(session_run_hook.SessionRunHook):
       run_context.request_stop()
 
 
+def _get_max_splits(tree_hparams):
+  """Calculates the max possible number of splits based on tree params."""
+  # maximum number of splits possible in the whole tree =2^(D-1)-1
+  max_splits = (1 << tree_hparams.max_depth) - 1
+  return max_splits
+
+
+class _EnsembleGrower(object):
+  """Abstract base class for different types of ensemble growers.
+
+  Use it to receive training ops for growing and centering bias, depending
+  on the implementation (for example, in memory or accumulator-based
+  distributed):
+    grower = ...create subclass grower(tree_ensemble, tree_hparams)
+    grow_op = grower.grow_tree(stats_summaries_list, feature_ids_list,
+                               last_layer_nodes_range)
+    training_ops.append(grow_op)
+  """
+
+  def __init__(self, tree_ensemble, tree_hparams):
+    """Initializes a grower object.
+
+    Args:
+      tree_ensemble: A TreeEnsemble variable.
+      tree_hparams: TODO. collections.namedtuple for hyper parameters.
+    Raises:
+      ValueError: when pruning mode is invalid or pruning is used and no tree
+      complexity is set.
+    """
+    self._tree_ensemble = tree_ensemble
+    self._tree_hparams = tree_hparams
+    # pylint: disable=protected-access
+    self._pruning_mode_parsed = boosted_trees_ops.PruningMode.from_str(
+        tree_hparams.pruning_mode)
+
+    if (self._pruning_mode_parsed != boosted_trees_ops.PruningMode.NO_PRUNING
+        and tree_hparams.tree_complexity <= 0):
+      raise ValueError('For pruning, tree_complexity must be positive.')
+    # pylint: enable=protected-access
+
+  @abc.abstractmethod
+  def center_bias(self, center_bias_var, gradients, hessians):
+    """Centers bias, if ready, based on statistics.
+
+    Args:
+      center_bias_var: A variable that will be updated when bias centering
+        finished.
+      gradients: A rank 2 tensor of gradients.
+      hessians: A rank 2 tensor of hessians.
+
+    Returns:
+      An operation for centering bias.
+    """
+
+  @abc.abstractmethod
+  def grow_tree(self, stats_summaries_list, feature_ids_list,
+                last_layer_nodes_range):
+    """Grows a tree, if ready, based on provided statistics.
+
+    Args:
+      stats_summaries_list: List of stats summary tensors, representing sums of
+        gradients and hessians for each feature bucket.
+      feature_ids_list: a list of lists of feature ids for each bucket size.
+      last_layer_nodes_range: A tensor representing ids of the nodes in the
+        current layer, to be split.
+
+    Returns:
+      An op for growing a tree.
+    """
+
+  #  ============= Helper methods ===========
+
+  def _center_bias_fn(self, center_bias_var, mean_gradients, mean_hessians):
+    """Updates the ensembles and cache (if needed) with logits prior."""
+    continue_centering = boosted_trees_ops.center_bias(
+        self._tree_ensemble.resource_handle,
+        mean_gradients=mean_gradients,
+        mean_hessians=mean_hessians,
+        l1=self._tree_hparams.l1,
+        l2=self._tree_hparams.l2)
+    return center_bias_var.assign(continue_centering)
+
+  def _grow_tree_from_stats_summaries(self, stats_summaries_list,
+                                      feature_ids_list, last_layer_nodes_range):
+    """Updates ensemble based on the best gains from stats summaries."""
+    node_ids_per_feature = []
+    gains_list = []
+    thresholds_list = []
+    left_node_contribs_list = []
+    right_node_contribs_list = []
+    all_feature_ids = []
+    assert len(stats_summaries_list) == len(feature_ids_list)
+
+    max_splits = _get_max_splits(self._tree_hparams)
+
+    for i, feature_ids in enumerate(feature_ids_list):
+      (numeric_node_ids_per_feature, numeric_gains_list,
+       numeric_thresholds_list, numeric_left_node_contribs_list,
+       numeric_right_node_contribs_list) = (
+           boosted_trees_ops.calculate_best_gains_per_feature(
+               node_id_range=last_layer_nodes_range,
+               stats_summary_list=stats_summaries_list[i],
+               l1=self._tree_hparams.l1,
+               l2=self._tree_hparams.l2,
+               tree_complexity=self._tree_hparams.tree_complexity,
+               min_node_weight=self._tree_hparams.min_node_weight,
+               max_splits=max_splits))
+
+      all_feature_ids += feature_ids
+      node_ids_per_feature += numeric_node_ids_per_feature
+      gains_list += numeric_gains_list
+      thresholds_list += numeric_thresholds_list
+      left_node_contribs_list += numeric_left_node_contribs_list
+      right_node_contribs_list += numeric_right_node_contribs_list
+
+    grow_op = boosted_trees_ops.update_ensemble(
+        # Confirm if local_tree_ensemble or tree_ensemble should be used.
+        self._tree_ensemble.resource_handle,
+        feature_ids=all_feature_ids,
+        node_ids=node_ids_per_feature,
+        gains=gains_list,
+        thresholds=thresholds_list,
+        left_node_contribs=left_node_contribs_list,
+        right_node_contribs=right_node_contribs_list,
+        learning_rate=self._tree_hparams.learning_rate,
+        max_depth=self._tree_hparams.max_depth,
+        pruning_mode=self._pruning_mode_parsed)
+    return grow_op
+
+
+class _InMemoryEnsembleGrower(_EnsembleGrower):
+  """A base class for ensemble growers."""
+
+  def __init__(self, tree_ensemble, tree_hparams):
+
+    super(_InMemoryEnsembleGrower, self).__init__(
+        tree_ensemble=tree_ensemble, tree_hparams=tree_hparams)
+
+  def center_bias(self, center_bias_var, gradients, hessians):
+    # For in memory, we already have a full batch of gradients and hessians,
+    # so just take a mean and proceed with centering.
+    mean_gradients = array_ops.expand_dims(
+        math_ops.reduce_mean(gradients, 0), 0)
+    mean_heassians = array_ops.expand_dims(math_ops.reduce_mean(hessians, 0), 0)
+    return self._center_bias_fn(center_bias_var, mean_gradients, mean_heassians)
+
+  def grow_tree(self, stats_summaries_list, feature_ids_list,
+                last_layer_nodes_range):
+    # For in memory, we already have full data in one batch, so we can grow the
+    # tree immediately.
+    return self._grow_tree_from_stats_summaries(
+        stats_summaries_list, feature_ids_list, last_layer_nodes_range)
+
+
+class _AccumulatorEnsembleGrower(_EnsembleGrower):
+  """A base class for ensemble growers."""
+
+  def __init__(self, tree_ensemble, tree_hparams, stamp_token,
+               n_batches_per_layer, bucket_size_list, is_chief):
+    super(_AccumulatorEnsembleGrower, self).__init__(
+        tree_ensemble=tree_ensemble, tree_hparams=tree_hparams)
+    self._stamp_token = stamp_token
+    self._n_batches_per_layer = n_batches_per_layer
+    self._bucket_size_list = bucket_size_list
+    self._is_chief = is_chief
+
+  def center_bias(self, center_bias_var, gradients, hessians):
+    # For not in memory situation, we need to accumulate enough of batches first
+    # before proceeding with centering bias.
+
+    # Create an accumulator.
+    bias_dependencies = []
+    bias_accumulator = data_flow_ops.ConditionalAccumulator(
+        dtype=dtypes.float32,
+        # The stats consist of grads and hessians means only.
+        # TODO(nponomareva): this will change for a multiclass
+        shape=[2, 1],
+        shared_name='bias_accumulator')
+
+    grads_and_hess = array_ops.stack([gradients, hessians], axis=0)
+    grads_and_hess = math_ops.reduce_mean(grads_and_hess, axis=1)
+
+    apply_grad = bias_accumulator.apply_grad(grads_and_hess, self._stamp_token)
+    bias_dependencies.append(apply_grad)
+
+    # Center bias if enough batches were processed.
+    with ops.control_dependencies(bias_dependencies):
+      if not self._is_chief:
+        return control_flow_ops.no_op()
+
+      def center_bias_from_accumulator():
+        accumulated = array_ops.unstack(bias_accumulator.take_grad(1), axis=0)
+        return self._center_bias_fn(center_bias_var,
+                                    array_ops.expand_dims(accumulated[0], 0),
+                                    array_ops.expand_dims(accumulated[1], 0))
+
+      center_bias_op = control_flow_ops.cond(
+          math_ops.greater_equal(bias_accumulator.num_accumulated(),
+                                 self._n_batches_per_layer),
+          center_bias_from_accumulator,
+          control_flow_ops.no_op,
+          name='wait_until_n_batches_for_bias_accumulated')
+      return center_bias_op
+
+  def grow_tree(self, stats_summaries_list, feature_ids_list,
+                last_layer_nodes_range):
+    # For not in memory situation, we need to accumulate enough of batches first
+    # before proceeding with building a tree layer.
+    max_splits = _get_max_splits(self._tree_hparams)
+
+    # Prepare accumulators.
+    accumulators = []
+    dependencies = []
+    for i, feature_ids in enumerate(feature_ids_list):
+      stats_summaries = stats_summaries_list[i]
+      accumulator = data_flow_ops.ConditionalAccumulator(
+          dtype=dtypes.float32,
+          # The stats consist of grads and hessians (the last dimension).
+          shape=[len(feature_ids), max_splits, self._bucket_size_list[i], 2],
+          shared_name='numeric_stats_summary_accumulator_' + str(i))
+      accumulators.append(accumulator)
+
+      apply_grad = accumulator.apply_grad(
+          array_ops.stack(stats_summaries, axis=0), self._stamp_token)
+      dependencies.append(apply_grad)
+
+    # Grow the tree if enough batches is accumulated.
+    with ops.control_dependencies(dependencies):
+      if not self._is_chief:
+        return control_flow_ops.no_op()
+
+      min_accumulated = math_ops.reduce_min(
+          array_ops.stack([acc.num_accumulated() for acc in accumulators]))
+
+      def grow_tree_from_accumulated_summaries_fn():
+        """Updates tree with the best layer from accumulated summaries."""
+        # Take out the accumulated summaries from the accumulator and grow.
+        stats_summaries_list = []
+        stats_summaries_list = [
+            array_ops.unstack(accumulator.take_grad(1), axis=0)
+            for accumulator in accumulators
+        ]
+        grow_op = self._grow_tree_from_stats_summaries(
+            stats_summaries_list, feature_ids_list, last_layer_nodes_range)
+        return grow_op
+
+      grow_model = control_flow_ops.cond(
+          math_ops.greater_equal(min_accumulated, self._n_batches_per_layer),
+          grow_tree_from_accumulated_summaries_fn,
+          control_flow_ops.no_op,
+          name='wait_until_n_batches_accumulated')
+      return grow_model
+
+
 def _bt_model_fn(
     features,
     labels,
@@ -425,8 +684,9 @@ def _bt_model_fn(
     ValueError: mode or params are invalid, or features has the wrong type.
   """
   is_single_machine = (config.num_worker_replicas <= 1)
-
   sorted_feature_columns = sorted(feature_columns, key=lambda tc: tc.name)
+  center_bias = tree_hparams.center_bias
+
   if train_in_memory:
     assert n_batches_per_layer == 1, (
         'When train_in_memory is enabled, input_fn should return the entire '
@@ -437,11 +697,6 @@ def _bt_model_fn(
       raise ValueError('train_in_memory is supported only for '
                        'non-distributed training.')
   worker_device = control_flow_ops.no_op().device
-  # maximum number of splits possible in the whole tree =2^(D-1)-1
-  # TODO(youngheek): perhaps storage could be optimized by storing stats with
-  # the dimension max_splits_per_layer, instead of max_splits (for the entire
-  # tree).
-  max_splits = (1 << tree_hparams.max_depth) - 1
   train_op = []
   with ops.name_scope(name) as name:
     # Prepare.
@@ -469,6 +724,9 @@ def _bt_model_fn(
 
     # Create Ensemble resources.
     tree_ensemble = boosted_trees_ops.TreeEnsemble(name=name)
+    # Variable that determines whether bias centering is needed.
+    center_bias_var = variable_scope.variable(
+        initial_value=center_bias, name='center_bias_needed', trainable=False)
     # Create logits.
     if mode != model_fn.ModeKeys.TRAIN:
       logits = boosted_trees_ops.predict(
@@ -489,6 +747,7 @@ def _bt_model_fn(
         # TODO(soroush): Do partial updates if this becomes a bottleneck.
         ensemble_reload = local_tree_ensemble.deserialize(
             *tree_ensemble.serialize())
+
       if training_state_cache:
         cached_tree_ids, cached_node_ids, cached_logits = (
             training_state_cache.lookup())
@@ -497,9 +756,10 @@ def _bt_model_fn(
         batch_size = array_ops.shape(labels)[0]
         cached_tree_ids, cached_node_ids, cached_logits = (
             array_ops.zeros([batch_size], dtype=dtypes.int32),
-            array_ops.zeros([batch_size], dtype=dtypes.int32),
+            _DUMMY_NODE_ID * array_ops.ones([batch_size], dtype=dtypes.int32),
             array_ops.zeros(
                 [batch_size, head.logits_dimension], dtype=dtypes.float32))
+
       with ops.control_dependencies([ensemble_reload]):
         (stamp_token, num_trees, num_finalized_trees, num_attempted_layers,
          last_layer_nodes_range) = local_tree_ensemble.get_states()
@@ -513,13 +773,20 @@ def _bt_model_fn(
             cached_node_ids=cached_node_ids,
             bucketized_features=input_feature_list,
             logits_dimension=head.logits_dimension)
+
       logits = cached_logits + partial_logits
 
     # Create training graph.
     def _train_op_fn(loss):
       """Run one training iteration."""
       if training_state_cache:
-        train_op.append(training_state_cache.insert(tree_ids, node_ids, logits))
+        # Cache logits only after center_bias is complete, if it's in progress.
+        train_op.append(
+            control_flow_ops.cond(
+                center_bias_var, control_flow_ops.no_op,
+                lambda: training_state_cache.insert(tree_ids, node_ids, logits))
+        )
+
       if closed_form_grad_and_hess_fn:
         gradients, hessians = closed_form_grad_and_hess_fn(logits, labels)
       else:
@@ -527,6 +794,11 @@ def _bt_model_fn(
         hessians = gradients_impl.gradients(
             gradients, logits, name='Hessians')[0]
 
+      # TODO(youngheek): perhaps storage could be optimized by storing stats
+      # with the dimension max_splits_per_layer, instead of max_splits (for the
+      # entire tree).
+      max_splits = _get_max_splits(tree_hparams)
+
       stats_summaries_list = []
       for i, feature_ids in enumerate(feature_ids_list):
         num_buckets = bucket_size_list[i]
@@ -543,103 +815,28 @@ def _bt_model_fn(
         ]
         stats_summaries_list.append(summaries)
 
-      accumulators = []
-
-      def grow_tree_from_stats_summaries(stats_summaries_list,
-                                         feature_ids_list):
-        """Updates ensemble based on the best gains from stats summaries."""
-        node_ids_per_feature = []
-        gains_list = []
-        thresholds_list = []
-        left_node_contribs_list = []
-        right_node_contribs_list = []
-        all_feature_ids = []
-
-        assert len(stats_summaries_list) == len(feature_ids_list)
-
-        for i, feature_ids in enumerate(feature_ids_list):
-          (numeric_node_ids_per_feature, numeric_gains_list,
-           numeric_thresholds_list, numeric_left_node_contribs_list,
-           numeric_right_node_contribs_list) = (
-               boosted_trees_ops.calculate_best_gains_per_feature(
-                   node_id_range=last_layer_nodes_range,
-                   stats_summary_list=stats_summaries_list[i],
-                   l1=tree_hparams.l1,
-                   l2=tree_hparams.l2,
-                   tree_complexity=tree_hparams.tree_complexity,
-                   min_node_weight=tree_hparams.min_node_weight,
-                   max_splits=max_splits))
-
-          all_feature_ids += feature_ids
-          node_ids_per_feature += numeric_node_ids_per_feature
-          gains_list += numeric_gains_list
-          thresholds_list += numeric_thresholds_list
-          left_node_contribs_list += numeric_left_node_contribs_list
-          right_node_contribs_list += numeric_right_node_contribs_list
-
-        grow_op = boosted_trees_ops.update_ensemble(
-            # Confirm if local_tree_ensemble or tree_ensemble should be used.
-            tree_ensemble.resource_handle,
-            feature_ids=all_feature_ids,
-            node_ids=node_ids_per_feature,
-            gains=gains_list,
-            thresholds=thresholds_list,
-            left_node_contribs=left_node_contribs_list,
-            right_node_contribs=right_node_contribs_list,
-            learning_rate=tree_hparams.learning_rate,
-            max_depth=tree_hparams.max_depth,
-            pruning_mode=boosted_trees_ops.PruningMode.NO_PRUNING)
-        return grow_op
-
       if train_in_memory and is_single_machine:
-        train_op.append(distribute_lib.increment_var(global_step))
-        train_op.append(
-            grow_tree_from_stats_summaries(stats_summaries_list,
-                                           feature_ids_list))
+        grower = _InMemoryEnsembleGrower(tree_ensemble, tree_hparams)
       else:
-        dependencies = []
-
-        for i, feature_ids in enumerate(feature_ids_list):
-          stats_summaries = stats_summaries_list[i]
-          accumulator = data_flow_ops.ConditionalAccumulator(
-              dtype=dtypes.float32,
-              # The stats consist of grads and hessians (the last dimension).
-              shape=[len(feature_ids), max_splits, bucket_size_list[i], 2],
-              shared_name='numeric_stats_summary_accumulator_' + str(i))
-          accumulators.append(accumulator)
-
-          apply_grad = accumulator.apply_grad(
-              array_ops.stack(stats_summaries, axis=0), stamp_token)
-          dependencies.append(apply_grad)
-
-        def grow_tree_from_accumulated_summaries_fn():
-          """Updates the tree with the best layer from accumulated summaries."""
-          # Take out the accumulated summaries from the accumulator and grow.
-          stats_summaries_list = []
-
-          stats_summaries_list = [
-              array_ops.unstack(accumulator.take_grad(1), axis=0)
-              for accumulator in accumulators
-          ]
-
-          grow_op = grow_tree_from_stats_summaries(stats_summaries_list,
-                                                   feature_ids_list)
-          return grow_op
-
-        with ops.control_dependencies(dependencies):
-          train_op.append(distribute_lib.increment_var(global_step))
-          if config.is_chief:
-            min_accumulated = math_ops.reduce_min(
-                array_ops.stack(
-                    [acc.num_accumulated() for acc in accumulators]))
-
-            train_op.append(
-                control_flow_ops.cond(
-                    math_ops.greater_equal(min_accumulated,
-                                           n_batches_per_layer),
-                    grow_tree_from_accumulated_summaries_fn,
-                    control_flow_ops.no_op,
-                    name='wait_until_n_batches_accumulated'))
+        grower = _AccumulatorEnsembleGrower(tree_ensemble, tree_hparams,
+                                            stamp_token, n_batches_per_layer,
+                                            bucket_size_list, config.is_chief)
+
+      update_model = control_flow_ops.cond(
+          center_bias_var,
+          functools.partial(
+              grower.center_bias,
+              center_bias_var,
+              gradients,
+              hessians,
+          ),
+          functools.partial(grower.grow_tree, stats_summaries_list,
+                            feature_ids_list, last_layer_nodes_range))
+      train_op.append(update_model)
+
+      with ops.control_dependencies([update_model]):
+        increment_global = distribute_lib.increment_var(global_step)
+        train_op.append(increment_global)
 
       return control_flow_ops.group(train_op, name='train_op')
 
@@ -739,7 +936,9 @@ class BoostedTreesClassifier(estimator.Estimator):
                l2_regularization=0.,
                tree_complexity=0.,
                min_node_weight=0.,
-               config=None):
+               config=None,
+               center_bias=False,
+               pruning_mode='none'):
     """Initializes a `BoostedTreesClassifier` instance.
 
     Example:
@@ -807,6 +1006,17 @@ class BoostedTreesClassifier(estimator.Estimator):
         split to be considered. The value will be compared with
         sum(leaf_hessian)/(batch_size * n_batches_per_layer).
       config: `RunConfig` object to configure the runtime settings.
+      center_bias: Whether bias centering needs to occur. Bias centering refers
+        to the first node in the very first tree returning the prediction that
+        is aligned with the original labels distribution. For example, for
+        regression problems, the first node will return the mean of the labels.
+        For binary classification problems, it will return a logit for a prior
+        probability of label 1.
+      pruning_mode: one of 'none', 'pre', 'post' to indicate no pruning, pre-
+        pruning (do not split a node if not enough gain is observed) and post
+        pruning (build the tree up to a max depth and then prune branches with
+        negative gain). For pre and post pruning, you MUST provide
+        tree_complexity >0.
 
     Raises:
       ValueError: when wrong arguments are given or unsupported functionalities
@@ -819,9 +1029,9 @@ class BoostedTreesClassifier(estimator.Estimator):
         n_classes, weight_column, label_vocabulary=label_vocabulary)
 
     # HParams for the model.
-    tree_hparams = _TreeHParams(n_trees, max_depth, learning_rate,
-                                l1_regularization, l2_regularization,
-                                tree_complexity, min_node_weight)
+    tree_hparams = _TreeHParams(
+        n_trees, max_depth, learning_rate, l1_regularization, l2_regularization,
+        tree_complexity, min_node_weight, center_bias, pruning_mode)
 
     def _model_fn(features, labels, mode, config):
       return _bt_model_fn(  # pylint: disable=protected-access
@@ -864,7 +1074,9 @@ class BoostedTreesRegressor(estimator.Estimator):
                l2_regularization=0.,
                tree_complexity=0.,
                min_node_weight=0.,
-               config=None):
+               config=None,
+               center_bias=False,
+               pruning_mode='none'):
     """Initializes a `BoostedTreesRegressor` instance.
 
     Example:
@@ -925,6 +1137,17 @@ class BoostedTreesRegressor(estimator.Estimator):
         split to be considered. The value will be compared with
         sum(leaf_hessian)/(batch_size * n_batches_per_layer).
       config: `RunConfig` object to configure the runtime settings.
+      center_bias: Whether bias centering needs to occur. Bias centering refers
+        to the first node in the very first tree returning the prediction that
+        is aligned with the original labels distribution. For example, for
+        regression problems, the first node will return the mean of the labels.
+        For binary classification problems, it will return a logit for a prior
+        probability of label 1.
+      pruning_mode: one of 'none', 'pre', 'post' to indicate no pruning, pre-
+        pruning (do not split a node if not enough gain is observed) and post
+        pruning (build the tree up to a max depth and then prune branches with
+        negative gain). For pre and post pruning, you MUST provide
+        tree_complexity >0.
 
     Raises:
       ValueError: when wrong arguments are given or unsupported functionalities
@@ -936,9 +1159,9 @@ class BoostedTreesRegressor(estimator.Estimator):
     head = _create_regression_head(label_dimension, weight_column)
 
     # HParams for the model.
-    tree_hparams = _TreeHParams(n_trees, max_depth, learning_rate,
-                                l1_regularization, l2_regularization,
-                                tree_complexity, min_node_weight)
+    tree_hparams = _TreeHParams(
+        n_trees, max_depth, learning_rate, l1_regularization, l2_regularization,
+        tree_complexity, min_node_weight, center_bias, pruning_mode)
 
     def _model_fn(features, labels, mode, config):
       return _bt_model_fn(  # pylint: disable=protected-access
diff --git a/tensorflow/python/estimator/canned/boosted_trees_test.py b/tensorflow/python/estimator/canned/boosted_trees_test.py
index 33e9e69b04..ec597e4686 100644
--- a/tensorflow/python/estimator/canned/boosted_trees_test.py
+++ b/tensorflow/python/estimator/canned/boosted_trees_test.py
@@ -554,14 +554,6 @@ class ModelFnTests(test_util.TensorFlowTestCase):
             feature_column.numeric_column('f_%d' % i, dtype=dtypes.float32),
             BUCKET_BOUNDARIES) for i in range(NUM_FEATURES)
     }
-    self._tree_hparams = boosted_trees._TreeHParams(  # pylint:disable=protected-access
-        n_trees=2,
-        max_depth=2,
-        learning_rate=0.1,
-        l1=0.,
-        l2=0.01,
-        tree_complexity=0.,
-        min_node_weight=0.)
 
   def _get_expected_ensembles_for_classification(self):
     first_round = """
@@ -790,6 +782,245 @@ class ModelFnTests(test_util.TensorFlowTestCase):
         """
     return (first_round, second_round, third_round)
 
+  def _get_expected_ensembles_for_classification_with_bias(self):
+    first_round = """
+        trees {
+          nodes {
+            leaf {
+              scalar: -0.405086
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_metadata {
+        }
+        """
+    second_round = """
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 2
+              threshold: 2
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 0.407711
+              original_leaf {
+                scalar: -0.405086
+              }
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.556054
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.301233
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_metadata {
+          num_layers_grown: 1
+          is_finalized: false
+        }
+        growing_metadata {
+          num_trees_attempted: 1
+          num_layers_attempted: 1
+          last_layer_node_start: 1
+          last_layer_node_end: 3
+        }
+        """
+    third_round = """
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 2
+              threshold: 2
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 0.407711
+              original_leaf {
+                scalar: -0.405086
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              feature_id: 0
+              threshold: 3
+              left_id: 3
+              right_id: 4
+            }
+            metadata {
+              original_leaf {
+                scalar: -0.556054
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              feature_id: 0
+              threshold: 0
+              left_id: 5
+              right_id: 6
+            }
+            metadata {
+              gain: 0.09876
+              original_leaf {
+                scalar: -0.301233
+              }
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.698072
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.556054
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.106016
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.27349
+            }
+          }
+        }
+        trees {
+          nodes {
+            leaf {
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_weights: 1.0
+        tree_metadata {
+          num_layers_grown: 2
+          is_finalized: true
+        }
+        tree_metadata {
+        }
+        growing_metadata {
+          num_trees_attempted: 1
+          num_layers_attempted: 2
+          last_layer_node_end: 1
+        }
+        """
+    forth_round = """
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 2
+              threshold: 2
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 0.4077113
+              original_leaf {
+                scalar: -0.405086
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              threshold: 3
+              left_id: 3
+              right_id: 4
+            }
+            metadata {
+              original_leaf {
+                scalar: -0.556054
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              threshold: 0
+              left_id: 5
+              right_id: 6
+            }
+            metadata {
+              gain: 0.09876
+              original_leaf {
+                scalar: -0.301233
+              }
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.698072
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.556054
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.106016
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.27349
+            }
+          }
+        }
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 2
+              threshold: 2
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 0.289927
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.134588
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 0.083838            
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_weights: 1.0
+        tree_metadata {
+          num_layers_grown: 2
+          is_finalized: true
+        }
+        tree_metadata {
+          num_layers_grown: 1
+        }
+        growing_metadata {
+          num_trees_attempted: 2
+          num_layers_attempted: 3
+          last_layer_node_start: 1
+          last_layer_node_end: 3
+        }
+        """
+    return (first_round, second_round, third_round, forth_round)
+
   def _get_expected_ensembles_for_regression(self):
     first_round = """
         trees {
@@ -1017,17 +1248,276 @@ class ModelFnTests(test_util.TensorFlowTestCase):
         """
     return (first_round, second_round, third_round)
 
-  def _get_train_op_and_ensemble(self, head, config, is_classification,
-                                 train_in_memory):
+  def _get_expected_ensembles_for_regression_with_bias(self):
+    first_round = """
+        trees {
+          nodes {
+            leaf {
+              scalar: 1.799974
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_metadata {
+        }
+        """
+    second_round = """
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 1
+              threshold: 1
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 1.190442
+              original_leaf {
+                scalar: 1.799974
+              }
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.862786
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.706149
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_metadata {
+          num_layers_grown: 1
+          is_finalized: false
+        }
+        growing_metadata {
+          num_trees_attempted: 1
+          num_layers_attempted: 1
+          last_layer_node_start: 1
+          last_layer_node_end: 3
+        }
+        """
+    third_round = """
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 1
+              threshold: 1
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 1.190442
+              original_leaf {
+                scalar: 1.799974
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              feature_id: 0
+              threshold: 1
+              left_id: 3
+              right_id: 4
+            }
+            metadata {
+              gain: 2.683594
+              original_leaf {
+                scalar: 1.862786
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              feature_id: 0
+              threshold: 0
+              left_id: 5
+              right_id: 6
+            }
+            metadata {
+              gain: 0.322693
+              original_leaf {
+                scalar: 1.706149
+              }
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 2.024487
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.710319
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.559208
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.686037
+            }
+          }
+        }
+        trees {
+          nodes {
+            leaf {
+              scalar: 0.0
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_weights: 1.0
+        tree_metadata {
+          num_layers_grown: 2
+          is_finalized: true
+        }
+        tree_metadata {
+          num_layers_grown: 0
+          is_finalized: false
+        }
+        growing_metadata {
+          num_trees_attempted: 1
+          num_layers_attempted: 2
+          last_layer_node_start: 0
+          last_layer_node_end: 1
+        }
+        """
+    forth_round = """
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 1
+              threshold: 1
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 1.190442
+              original_leaf {
+                scalar:  1.799974
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              threshold: 1
+              left_id: 3
+              right_id: 4
+            }
+            metadata {
+              gain: 2.683594
+              original_leaf {
+                scalar: 1.8627863
+              }
+            }
+          }
+          nodes {
+            bucketized_split {
+              left_id: 5
+              right_id: 6
+            }
+            metadata {
+              gain: 0.322693
+              original_leaf {
+                scalar: 1.706149
+              }
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 2.024487
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.710319
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.5592078
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 1.686037
+            }
+          }
+        }
+        trees {
+          nodes {
+            bucketized_split {
+              feature_id: 1
+              left_id: 1
+              right_id: 2
+            }
+            metadata {
+              gain: 0.972589
+            }
+          }
+          nodes {
+            leaf {
+              scalar: -0.137592
+            }
+          }
+          nodes {
+            leaf {
+              scalar: 0.034926
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_weights: 1.0
+        tree_metadata {
+          num_layers_grown: 2
+          is_finalized: true
+        }
+        tree_metadata {
+          num_layers_grown: 1
+        }
+        growing_metadata {
+          num_trees_attempted: 2
+          num_layers_attempted: 3
+          last_layer_node_start: 1
+          last_layer_node_end: 3
+        }
+        """
+    return (first_round, second_round, third_round, forth_round)
+
+  def _get_train_op_and_ensemble(self,
+                                 head,
+                                 config,
+                                 is_classification,
+                                 train_in_memory,
+                                 center_bias=False):
     """Calls bt_model_fn() and returns the train_op and ensemble_serialzed."""
     features, labels = _make_train_input_fn(is_classification)()
+
+    tree_hparams = boosted_trees._TreeHParams(  # pylint:disable=protected-access
+        n_trees=2,
+        max_depth=2,
+        learning_rate=0.1,
+        l1=0.,
+        l2=0.01,
+        tree_complexity=0.,
+        min_node_weight=0.,
+        center_bias=center_bias,
+        pruning_mode='none')
+
     estimator_spec = boosted_trees._bt_model_fn(  # pylint:disable=protected-access
         features=features,
         labels=labels,
         mode=model_fn.ModeKeys.TRAIN,
         head=head,
         feature_columns=self._feature_columns,
-        tree_hparams=self._tree_hparams,
+        tree_hparams=tree_hparams,
         example_id_column_name=EXAMPLE_ID_COLUMN,
         n_batches_per_layer=1,
         config=config,
@@ -1076,6 +1566,49 @@ class ModelFnTests(test_util.TensorFlowTestCase):
       ensemble_proto.ParseFromString(serialized)
       self.assertProtoEquals(expected_third, ensemble_proto)
 
+  def testTrainClassifierWithCenterBiasInMemory(self):
+    ops.reset_default_graph()
+
+    # When bias centering is on, we expect the very first node to have the
+    expected_first, expected_second, expected_third, expected_forth = (
+        self._get_expected_ensembles_for_classification_with_bias())
+
+    with self.test_session() as sess:
+      with sess.graph.as_default():
+        train_op, ensemble_serialized = self._get_train_op_and_ensemble(
+            boosted_trees._create_classification_head(n_classes=2),
+            run_config.RunConfig(),
+            is_classification=True,
+            train_in_memory=True,
+            center_bias=True)
+
+      # 4 iterations to center bias.
+      for _ in range(4):
+        _, serialized = sess.run([train_op, ensemble_serialized])
+
+      # Validate the trained ensemble.
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_first, ensemble_proto)
+
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_second, ensemble_proto)
+
+      # Third round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_third, ensemble_proto)
+
+      # Forth round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+
+      self.assertProtoEquals(expected_forth, ensemble_proto)
+
   def testTrainClassifierNonInMemory(self):
     ops.reset_default_graph()
     expected_first, expected_second, expected_third = (
@@ -1106,6 +1639,47 @@ class ModelFnTests(test_util.TensorFlowTestCase):
       ensemble_proto.ParseFromString(serialized)
       self.assertProtoEquals(expected_third, ensemble_proto)
 
+  def testTrainClassifierWithCenterBiasNonInMemory(self):
+    ops.reset_default_graph()
+
+    # When bias centering is on, we expect the very first node to have the
+    expected_first, expected_second, expected_third, expected_forth = (
+        self._get_expected_ensembles_for_classification_with_bias())
+
+    with self.test_session() as sess:
+      with sess.graph.as_default():
+        train_op, ensemble_serialized = self._get_train_op_and_ensemble(
+            boosted_trees._create_classification_head(n_classes=2),
+            run_config.RunConfig(),
+            is_classification=True,
+            train_in_memory=False,
+            center_bias=True)
+      # 4 iterations to center bias.
+      for _ in range(4):
+        _, serialized = sess.run([train_op, ensemble_serialized])
+      # Validate the trained ensemble.
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_first, ensemble_proto)
+
+      # Run one more time and validate the trained ensemble.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_second, ensemble_proto)
+
+      # Third round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_third, ensemble_proto)
+
+      # Forth round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_forth, ensemble_proto)
+
   def testTrainRegressorInMemory(self):
     ops.reset_default_graph()
     expected_first, expected_second, expected_third = (
@@ -1136,6 +1710,46 @@ class ModelFnTests(test_util.TensorFlowTestCase):
       ensemble_proto.ParseFromString(serialized)
       self.assertProtoEquals(expected_third, ensemble_proto)
 
+  def testTrainRegressorInMemoryWithCenterBias(self):
+    ops.reset_default_graph()
+    expected_first, expected_second, expected_third, expected_forth = (
+        self._get_expected_ensembles_for_regression_with_bias())
+    with self.test_session() as sess:
+      # Train with train_in_memory mode.
+      with sess.graph.as_default():
+        train_op, ensemble_serialized = self._get_train_op_and_ensemble(
+            boosted_trees._create_regression_head(label_dimension=1),
+            run_config.RunConfig(),
+            is_classification=False,
+            train_in_memory=True,
+            center_bias=True)
+      # 3 iterations to center bias.
+      for _ in range(3):
+        _, serialized = sess.run([train_op, ensemble_serialized])
+      # Validate the trained ensemble.
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+
+      self.assertProtoEquals(expected_first, ensemble_proto)
+
+      # Run one more time and validate the trained ensemble.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_second, ensemble_proto)
+
+      # Third round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_third, ensemble_proto)
+
+      # Forth round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_forth, ensemble_proto)
+
   def testTrainRegressorNonInMemory(self):
     ops.reset_default_graph()
     expected_first, expected_second, expected_third = (
@@ -1166,6 +1780,46 @@ class ModelFnTests(test_util.TensorFlowTestCase):
       ensemble_proto.ParseFromString(serialized)
       self.assertProtoEquals(expected_third, ensemble_proto)
 
+  def testTrainRegressorNotInMemoryWithCenterBias(self):
+    ops.reset_default_graph()
+    expected_first, expected_second, expected_third, expected_forth = (
+        self._get_expected_ensembles_for_regression_with_bias())
+    with self.test_session() as sess:
+      # Train with train_in_memory mode.
+      with sess.graph.as_default():
+        train_op, ensemble_serialized = self._get_train_op_and_ensemble(
+            boosted_trees._create_regression_head(label_dimension=1),
+            run_config.RunConfig(),
+            is_classification=False,
+            train_in_memory=False,
+            center_bias=True)
+      # 3 iterations to center the bias (because we are using regularization).
+      for _ in range(3):
+        _, serialized = sess.run([train_op, ensemble_serialized])
+
+      # Validate the trained ensemble.
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_first, ensemble_proto)
+
+      # Run one more time and validate the trained ensemble.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_second, ensemble_proto)
+
+      # Third round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_third, ensemble_proto)
+
+      # Forth round training and validation.
+      _, serialized = sess.run([train_op, ensemble_serialized])
+      ensemble_proto = boosted_trees_pb2.TreeEnsemble()
+      ensemble_proto.ParseFromString(serialized)
+      self.assertProtoEquals(expected_forth, ensemble_proto)
+
 
 if __name__ == '__main__':
   googletest.main()
diff --git a/tensorflow/python/estimator/canned/dnn_linear_combined.py b/tensorflow/python/estimator/canned/dnn_linear_combined.py
index 5f453d6fe8..efa7812452 100644
--- a/tensorflow/python/estimator/canned/dnn_linear_combined.py
+++ b/tensorflow/python/estimator/canned/dnn_linear_combined.py
@@ -89,7 +89,8 @@ def _dnn_linear_combined_model_fn(features,
                                   dnn_dropout=None,
                                   input_layer_partitioner=None,
                                   config=None,
-                                  batch_norm=False):
+                                  batch_norm=False,
+                                  linear_sparse_combiner='sum'):
   """Deep Neural Net and Linear combined model_fn.
 
   Args:
@@ -117,7 +118,9 @@ def _dnn_linear_combined_model_fn(features,
     input_layer_partitioner: Partitioner for input layer.
     config: `RunConfig` object to configure the runtime settings.
     batch_norm: Whether to use batch normalization after each hidden layer.
-
+    linear_sparse_combiner: A string specifying how to reduce the linear model
+      if a categorical column is multivalent.  One of "mean", "sqrtn", and
+      "sum".
   Returns:
     An `EstimatorSpec` instance.
 
@@ -185,7 +188,8 @@ def _dnn_linear_combined_model_fn(features,
         partitioner=input_layer_partitioner) as scope:
       logit_fn = linear._linear_logit_fn_builder(  # pylint: disable=protected-access
           units=head.logits_dimension,
-          feature_columns=linear_feature_columns)
+          feature_columns=linear_feature_columns,
+          sparse_combiner=linear_sparse_combiner)
       linear_logits = logit_fn(features=features)
       _add_layer_summary(linear_logits, scope.name)
 
@@ -325,7 +329,8 @@ class DNNLinearCombinedClassifier(estimator.Estimator):
                config=None,
                warm_start_from=None,
                loss_reduction=losses.Reduction.SUM,
-               batch_norm=False):
+               batch_norm=False,
+               linear_sparse_combiner='sum'):
     """Initializes a DNNLinearCombinedClassifier instance.
 
     Args:
@@ -379,6 +384,11 @@ class DNNLinearCombinedClassifier(estimator.Estimator):
       loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how
         to reduce training loss over batch. Defaults to `SUM`.
       batch_norm: Whether to use batch normalization after each hidden layer.
+      linear_sparse_combiner: A string specifying how to reduce the linear model
+        if a categorical column is multivalent.  One of "mean", "sqrtn", and
+        "sum" -- these are effectively different ways to do example-level
+        normalization, which can be useful for bag-of-words features.  For more
+        details, see @{tf.feature_column.linear_model$linear_model}.
 
     Raises:
       ValueError: If both linear_feature_columns and dnn_features_columns are
@@ -419,7 +429,8 @@ class DNNLinearCombinedClassifier(estimator.Estimator):
           dnn_dropout=dnn_dropout,
           input_layer_partitioner=input_layer_partitioner,
           config=config,
-          batch_norm=batch_norm)
+          batch_norm=batch_norm,
+          linear_sparse_combiner=linear_sparse_combiner)
 
     super(DNNLinearCombinedClassifier, self).__init__(
         model_fn=_model_fn, model_dir=model_dir, config=config,
@@ -522,7 +533,8 @@ class DNNLinearCombinedRegressor(estimator.Estimator):
                config=None,
                warm_start_from=None,
                loss_reduction=losses.Reduction.SUM,
-               batch_norm=False):
+               batch_norm=False,
+               linear_sparse_combiner='sum'):
     """Initializes a DNNLinearCombinedRegressor instance.
 
     Args:
@@ -570,6 +582,11 @@ class DNNLinearCombinedRegressor(estimator.Estimator):
       loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how
         to reduce training loss over batch. Defaults to `SUM`.
       batch_norm: Whether to use batch normalization after each hidden layer.
+      linear_sparse_combiner: A string specifying how to reduce the linear model
+        if a categorical column is multivalent.  One of "mean", "sqrtn", and
+        "sum" -- these are effectively different ways to do example-level
+        normalization, which can be useful for bag-of-words features.  For more
+        details, see @{tf.feature_column.linear_model$linear_model}.
 
     Raises:
       ValueError: If both linear_feature_columns and dnn_features_columns are
@@ -601,7 +618,8 @@ class DNNLinearCombinedRegressor(estimator.Estimator):
           dnn_dropout=dnn_dropout,
           input_layer_partitioner=input_layer_partitioner,
           config=config,
-          batch_norm=batch_norm)
+          batch_norm=batch_norm,
+          linear_sparse_combiner=linear_sparse_combiner)
 
     super(DNNLinearCombinedRegressor, self).__init__(
         model_fn=_model_fn, model_dir=model_dir, config=config,
diff --git a/tensorflow/python/estimator/canned/dnn_linear_combined_test.py b/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
index d275695eb3..d16318659b 100644
--- a/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
+++ b/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
@@ -100,7 +100,8 @@ def _linear_regressor_fn(feature_columns,
                          weight_column=None,
                          optimizer='Ftrl',
                          config=None,
-                         partitioner=None):
+                         partitioner=None,
+                         sparse_combiner='sum'):
   return dnn_linear_combined.DNNLinearCombinedRegressor(
       model_dir=model_dir,
       linear_feature_columns=feature_columns,
@@ -108,7 +109,8 @@ def _linear_regressor_fn(feature_columns,
       label_dimension=label_dimension,
       weight_column=weight_column,
       input_layer_partitioner=partitioner,
-      config=config)
+      config=config,
+      linear_sparse_combiner=sparse_combiner)
 
 
 class LinearOnlyRegressorPartitionerTest(
@@ -163,7 +165,8 @@ def _linear_classifier_fn(feature_columns,
                           label_vocabulary=None,
                           optimizer='Ftrl',
                           config=None,
-                          partitioner=None):
+                          partitioner=None,
+                          sparse_combiner='sum'):
   return dnn_linear_combined.DNNLinearCombinedClassifier(
       model_dir=model_dir,
       linear_feature_columns=feature_columns,
@@ -172,7 +175,8 @@ def _linear_classifier_fn(feature_columns,
       weight_column=weight_column,
       label_vocabulary=label_vocabulary,
       input_layer_partitioner=partitioner,
-      config=config)
+      config=config,
+      linear_sparse_combiner=sparse_combiner)
 
 
 class LinearOnlyClassifierTrainingTest(
diff --git a/tensorflow/python/estimator/canned/dnn_testing_utils.py b/tensorflow/python/estimator/canned/dnn_testing_utils.py
index ba17821259..de226ed0ef 100644
--- a/tensorflow/python/estimator/canned/dnn_testing_utils.py
+++ b/tensorflow/python/estimator/canned/dnn_testing_utils.py
@@ -1271,6 +1271,8 @@ class BaseDNNRegressorEvaluateTest(object):
     self.assertAllClose({
         metric_keys.MetricKeys.LOSS: expected_loss,
         metric_keys.MetricKeys.LOSS_MEAN: expected_loss,
+        metric_keys.MetricKeys.PREDICTION_MEAN: -2.08,
+        metric_keys.MetricKeys.LABEL_MEAN: 1.0,
         ops.GraphKeys.GLOBAL_STEP: global_step
     }, dnn_regressor.evaluate(input_fn=_input_fn, steps=1))
 
@@ -1301,6 +1303,8 @@ class BaseDNNRegressorEvaluateTest(object):
     self.assertAllClose({
         metric_keys.MetricKeys.LOSS: expected_loss,
         metric_keys.MetricKeys.LOSS_MEAN: expected_loss / label_dimension,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 0.39 / 3.0,
+        metric_keys.MetricKeys.LABEL_MEAN: 0.5 / 3.0,
         ops.GraphKeys.GLOBAL_STEP: global_step
     }, dnn_regressor.evaluate(input_fn=_input_fn, steps=1))
 
diff --git a/tensorflow/python/estimator/canned/head.py b/tensorflow/python/estimator/canned/head.py
index b74ef1015c..da9a64c2bc 100644
--- a/tensorflow/python/estimator/canned/head.py
+++ b/tensorflow/python/estimator/canned/head.py
@@ -1398,15 +1398,21 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
         weights=weights,
         processed_labels=labels)
 
-  def _eval_metric_ops(self, weights, unreduced_loss, regularization_loss):
+  def _eval_metric_ops(self, predicted_value, labels, weights, unreduced_loss,
+                       regularization_loss):
     """Returns the Eval metric ops."""
     keys = metric_keys.MetricKeys
     # Estimator already adds a metric for loss.
     eval_metric_ops = {
         _summary_key(self._name, keys.LOSS_MEAN):
-            metrics_lib.mean(
-                values=unreduced_loss,
-                weights=weights)
+            metrics_lib.mean(values=unreduced_loss, weights=weights),
+        _summary_key(self._name, keys.PREDICTION_MEAN):
+            _predictions_mean(
+                predictions=predicted_value,
+                weights=weights,
+                name=keys.PREDICTION_MEAN),
+        _summary_key(self._name, keys.LABEL_MEAN):
+            metrics_lib.mean(values=labels, weights=weights)
     }
     if regularization_loss is not None:
       regularization_loss_key = _summary_key(
@@ -1489,13 +1495,13 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
             predictions=predictions,
             loss=regularized_training_loss,
             eval_metrics=_create_eval_metrics_tuple(
-                self._eval_metric_ops,
-                {
+                self._eval_metric_ops, {
+                    'predicted_value': predicted_value,
+                    'labels': labels,
                     'weights': weights,
                     'unreduced_loss': unreduced_loss,
                     'regularization_loss': regularization_loss,
-                }
-            ))
+                }))
 
       # Train.
       if optimizer is not None:
diff --git a/tensorflow/python/estimator/canned/head_test.py b/tensorflow/python/estimator/canned/head_test.py
index 08ce5ca8e8..bd2e0ae943 100644
--- a/tensorflow/python/estimator/canned/head_test.py
+++ b/tensorflow/python/estimator/canned/head_test.py
@@ -3103,8 +3103,10 @@ class RegressionHead(test.TestCase):
     self.assertItemsEqual((prediction_key,), spec.predictions.keys())
     self.assertEqual(dtypes.float32, spec.predictions[prediction_key].dtype)
     self.assertEqual(dtypes.float32, spec.loss.dtype)
-    self.assertItemsEqual(
-        (metric_keys.MetricKeys.LOSS_MEAN,), spec.eval_metric_ops.keys())
+    self.assertItemsEqual((metric_keys.MetricKeys.LOSS_MEAN,
+                           metric_keys.MetricKeys.PREDICTION_MEAN,
+                           metric_keys.MetricKeys.LABEL_MEAN),
+                          spec.eval_metric_ops.keys())
     self.assertIsNone(spec.train_op)
     self.assertIsNone(spec.export_outputs)
     _assert_no_hooks(self, spec)
@@ -3140,6 +3142,9 @@ class RegressionHead(test.TestCase):
 
     expected_metric_keys = [
         '{}/some_regression_head'.format(metric_keys.MetricKeys.LOSS_MEAN),
+        '{}/some_regression_head'.format(
+            metric_keys.MetricKeys.PREDICTION_MEAN),
+        '{}/some_regression_head'.format(metric_keys.MetricKeys.LABEL_MEAN),
     ]
     self.assertItemsEqual(expected_metric_keys, spec.eval_metric_ops.keys())
 
@@ -3170,6 +3175,8 @@ class RegressionHead(test.TestCase):
     expected_metrics = {
         keys.LOSS_MEAN: expected_unregularized_loss,
         keys.LOSS_REGULARIZATION: expected_regularization_loss,
+        keys.PREDICTION_MEAN: (45 + 41) / 2.0,
+        keys.LABEL_MEAN: (43 + 44) / 2.0,
     }
 
     # Assert predictions, loss, and metrics.
@@ -3471,8 +3478,10 @@ class RegressionHead(test.TestCase):
     self.assertItemsEqual((prediction_key,), spec.predictions.keys())
     self.assertEqual(dtypes.float32, spec.predictions[prediction_key].dtype)
     self.assertEqual(dtypes.float32, spec.loss.dtype)
-    self.assertItemsEqual(
-        (metric_keys.MetricKeys.LOSS_MEAN,), spec.eval_metric_ops.keys())
+    self.assertItemsEqual((metric_keys.MetricKeys.LOSS_MEAN,
+                           metric_keys.MetricKeys.PREDICTION_MEAN,
+                           metric_keys.MetricKeys.LABEL_MEAN),
+                          spec.eval_metric_ops.keys())
     self.assertIsNone(spec.train_op)
     self.assertIsNone(spec.export_outputs)
     _assert_no_hooks(self, spec)
@@ -3700,8 +3709,10 @@ class RegressionHead(test.TestCase):
     self.assertItemsEqual((prediction_key,), spec.predictions.keys())
     self.assertEqual(dtypes.float32, spec.predictions[prediction_key].dtype)
     self.assertEqual(dtypes.float32, spec.loss.dtype)
-    self.assertItemsEqual(
-        (metric_keys.MetricKeys.LOSS_MEAN,), spec.eval_metric_ops.keys())
+    self.assertItemsEqual((metric_keys.MetricKeys.LOSS_MEAN,
+                           metric_keys.MetricKeys.PREDICTION_MEAN,
+                           metric_keys.MetricKeys.LABEL_MEAN),
+                          spec.eval_metric_ops.keys())
     self.assertIsNone(spec.train_op)
     self.assertIsNone(spec.export_outputs)
     _assert_no_hooks(self, spec)
@@ -3832,7 +3843,13 @@ class RegressionHead(test.TestCase):
     # losses = [1*(35-45)^2, .1*(42-41)^2, 1.5*(45-44)^2] = [100, .1, 1.5]
     # loss = sum(losses) = 100+.1+1.5 = 101.6
     # loss_mean = loss/(1+.1+1.5) = 101.6/2.6 = 39.076923
-    expected_metrics = {metric_keys.MetricKeys.LOSS_MEAN: 39.076923}
+    expected_metrics = {
+        metric_keys.MetricKeys.LOSS_MEAN:
+            39.076923,
+        metric_keys.MetricKeys.PREDICTION_MEAN:
+            (45 + 41 * 0.1 + 44 * 1.5) / 2.6,
+        metric_keys.MetricKeys.LABEL_MEAN: (35 + 42 * 0.1 + 45 * 1.5) / 2.6,
+    }
 
     # Assert spec contains expected tensors.
     self.assertEqual(dtypes.float32, spec.loss.dtype)
diff --git a/tensorflow/python/estimator/canned/linear.py b/tensorflow/python/estimator/canned/linear.py
index e22df849e5..58a7160348 100644
--- a/tensorflow/python/estimator/canned/linear.py
+++ b/tensorflow/python/estimator/canned/linear.py
@@ -66,13 +66,15 @@ def _compute_fraction_of_zero(cols_to_vars):
   return nn.zero_fraction(array_ops.concat(all_weight_vars, axis=0))
 
 
-def _linear_logit_fn_builder(units, feature_columns):
+def _linear_logit_fn_builder(units, feature_columns, sparse_combiner='sum'):
   """Function builder for a linear logit_fn.
 
   Args:
     units: An int indicating the dimension of the logit layer.
     feature_columns: An iterable containing all the feature columns used by
       the model.
+    sparse_combiner: A string specifying how to reduce if a categorical column
+      is multivalent.  One of "mean", "sqrtn", and "sum".
 
   Returns:
     A logit_fn (see below).
@@ -95,6 +97,7 @@ def _linear_logit_fn_builder(units, feature_columns):
         features=features,
         feature_columns=feature_columns,
         units=units,
+        sparse_combiner=sparse_combiner,
         cols_to_vars=cols_to_vars)
     bias = cols_to_vars.pop('bias')
     if units > 1:
@@ -111,7 +114,7 @@ def _linear_logit_fn_builder(units, feature_columns):
 
 
 def _linear_model_fn(features, labels, mode, head, feature_columns, optimizer,
-                     partitioner, config):
+                     partitioner, config, sparse_combiner='sum'):
   """A model_fn for linear models that use a gradient-based optimizer.
 
   Args:
@@ -126,6 +129,8 @@ def _linear_model_fn(features, labels, mode, head, feature_columns, optimizer,
       optimizer to use for training. If `None`, will use a FTRL optimizer.
     partitioner: Partitioner for variables.
     config: `RunConfig` object to configure the runtime settings.
+    sparse_combiner: A string specifying how to reduce if a categorical column
+      is multivalent.  One of "mean", "sqrtn", and "sum".
 
   Returns:
     An `EstimatorSpec` instance.
@@ -153,7 +158,8 @@ def _linear_model_fn(features, labels, mode, head, feature_columns, optimizer,
       partitioner=partitioner):
 
     logit_fn = _linear_logit_fn_builder(
-        units=head.logits_dimension, feature_columns=feature_columns)
+        units=head.logits_dimension, feature_columns=feature_columns,
+        sparse_combiner=sparse_combiner)
     logits = logit_fn(features=features)
 
     return head.create_estimator_spec(
@@ -255,7 +261,8 @@ class LinearClassifier(estimator.Estimator):
                config=None,
                partitioner=None,
                warm_start_from=None,
-               loss_reduction=losses.Reduction.SUM):
+               loss_reduction=losses.Reduction.SUM,
+               sparse_combiner='sum'):
     """Construct a `LinearClassifier` estimator object.
 
     Args:
@@ -295,6 +302,11 @@ class LinearClassifier(estimator.Estimator):
         and Tensor names are unchanged.
       loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how
         to reduce training loss over batch. Defaults to `SUM`.
+      sparse_combiner: A string specifying how to reduce if a categorical column
+        is multivalent.  One of "mean", "sqrtn", and "sum" -- these are
+        effectively different ways to do example-level normalization, which can
+        be useful for bag-of-words features. for more details, see
+        @{tf.feature_column.linear_model$linear_model}.
 
     Returns:
       A `LinearClassifier` estimator.
@@ -323,7 +335,8 @@ class LinearClassifier(estimator.Estimator):
           feature_columns=tuple(feature_columns or []),
           optimizer=optimizer,
           partitioner=partitioner,
-          config=config)
+          config=config,
+          sparse_combiner=sparse_combiner)
 
     super(LinearClassifier, self).__init__(
         model_fn=_model_fn,
@@ -422,7 +435,8 @@ class LinearRegressor(estimator.Estimator):
                config=None,
                partitioner=None,
                warm_start_from=None,
-               loss_reduction=losses.Reduction.SUM):
+               loss_reduction=losses.Reduction.SUM,
+               sparse_combiner='sum'):
     """Initializes a `LinearRegressor` instance.
 
     Args:
@@ -454,6 +468,11 @@ class LinearRegressor(estimator.Estimator):
         and Tensor names are unchanged.
       loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how
         to reduce training loss over batch. Defaults to `SUM`.
+      sparse_combiner: A string specifying how to reduce if a categorical column
+        is multivalent.  One of "mean", "sqrtn", and "sum" -- these are
+        effectively different ways to do example-level normalization, which can
+        be useful for bag-of-words features. for more details, see
+        @{tf.feature_column.linear_model$linear_model}.
     """
     head = head_lib._regression_head(  # pylint: disable=protected-access
         label_dimension=label_dimension, weight_column=weight_column,
@@ -469,7 +488,8 @@ class LinearRegressor(estimator.Estimator):
           feature_columns=tuple(feature_columns or []),
           optimizer=optimizer,
           partitioner=partitioner,
-          config=config)
+          config=config,
+          sparse_combiner=sparse_combiner)
 
     super(LinearRegressor, self).__init__(
         model_fn=_model_fn,
diff --git a/tensorflow/python/estimator/canned/linear_testing_utils.py b/tensorflow/python/estimator/canned/linear_testing_utils.py
index 0e6436b421..c3934c7a80 100644
--- a/tensorflow/python/estimator/canned/linear_testing_utils.py
+++ b/tensorflow/python/estimator/canned/linear_testing_utils.py
@@ -29,6 +29,7 @@ import six
 from tensorflow.core.example import example_pb2
 from tensorflow.core.example import feature_pb2
 from tensorflow.python.client import session as tf_session
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.estimator import estimator
 from tensorflow.python.estimator import run_config
 from tensorflow.python.estimator.canned import linear
@@ -260,6 +261,8 @@ class BaseLinearRegressorEvaluationTest(object):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 9.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -285,6 +288,8 @@ class BaseLinearRegressorEvaluationTest(object):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 18.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -315,6 +320,8 @@ class BaseLinearRegressorEvaluationTest(object):
     self.assertDictEqual({
         metric_keys.MetricKeys.LOSS: 27.,
         metric_keys.MetricKeys.LOSS_MEAN: 9.,
+        metric_keys.MetricKeys.PREDICTION_MEAN: 13.,
+        metric_keys.MetricKeys.LABEL_MEAN: 10.,
         ops.GraphKeys.GLOBAL_STEP: 100
     }, eval_metrics)
 
@@ -345,7 +352,9 @@ class BaseLinearRegressorEvaluationTest(object):
 
     self.assertItemsEqual(
         (metric_keys.MetricKeys.LOSS, metric_keys.MetricKeys.LOSS_MEAN,
-         ops.GraphKeys.GLOBAL_STEP), eval_metrics.keys())
+         metric_keys.MetricKeys.PREDICTION_MEAN,
+         metric_keys.MetricKeys.LABEL_MEAN, ops.GraphKeys.GLOBAL_STEP),
+        eval_metrics.keys())
 
     # Logit is
     #   [2., 4., 5.] * [1.0, 2.0] + [7.0, 8.0] = [39, 50] + [7.0, 8.0]
@@ -382,7 +391,9 @@ class BaseLinearRegressorEvaluationTest(object):
     eval_metrics = est.evaluate(input_fn=input_fn, steps=1)
     self.assertItemsEqual(
         (metric_keys.MetricKeys.LOSS, metric_keys.MetricKeys.LOSS_MEAN,
-         ops.GraphKeys.GLOBAL_STEP), eval_metrics.keys())
+         metric_keys.MetricKeys.PREDICTION_MEAN,
+         metric_keys.MetricKeys.LABEL_MEAN, ops.GraphKeys.GLOBAL_STEP),
+        eval_metrics.keys())
 
     # Logit is [(20. * 10.0 + 4 * 2.0 + 5.0), (40. * 10.0 + 8 * 2.0 + 5.0)] =
     # [213.0, 421.0], while label is [213., 421.]. Loss = 0.
@@ -484,6 +495,69 @@ class BaseLinearRegressorPredictTest(object):
     # x0 * weight0 + x1 * weight1 + bias = 2. * 10. + 3. * 20 + .2 = 80.2
     self.assertAllClose([[80.2]], predicted_scores)
 
+  def testSparseCombiner(self):
+    w_a = 2.0
+    w_b = 3.0
+    w_c = 5.0
+    bias = 5.0
+    with ops.Graph().as_default():
+      variables_lib.Variable([[w_a], [w_b], [w_c]], name=LANGUAGE_WEIGHT_NAME)
+      variables_lib.Variable([bias], name=BIAS_NAME)
+      variables_lib.Variable(1, name=ops.GraphKeys.GLOBAL_STEP,
+                             dtype=dtypes.int64)
+      save_variables_to_ckpt(self._model_dir)
+
+    def _input_fn():
+      return dataset_ops.Dataset.from_tensors({
+          'language': sparse_tensor.SparseTensor(
+              values=['a', 'c', 'b', 'c'],
+              indices=[[0, 0], [0, 1], [1, 0], [1, 1]],
+              dense_shape=[2, 2]),
+      })
+
+    feature_columns = (
+        feature_column_lib.categorical_column_with_vocabulary_list(
+            'language', vocabulary_list=['a', 'b', 'c']),)
+
+    # Check prediction for each sparse_combiner.
+    # With sparse_combiner = 'sum', we have
+    # logits_1 = w_a + w_c + bias
+    #          = 2.0 + 5.0 + 5.0 = 12.0
+    # logits_2 = w_b + w_c + bias
+    #          = 3.0 + 5.0 + 5.0 = 13.0
+    linear_regressor = self._linear_regressor_fn(
+        feature_columns=feature_columns,
+        model_dir=self._model_dir)
+    predictions = linear_regressor.predict(input_fn=_input_fn)
+    predicted_scores = list([x['predictions'] for x in predictions])
+    self.assertAllClose([[12.0], [13.0]], predicted_scores)
+
+    # With sparse_combiner = 'mean', we have
+    # logits_1 = 1/2 * (w_a + w_c) + bias
+    #          = 1/2 * (2.0 + 5.0) + 5.0 = 8.5
+    # logits_2 = 1/2 * (w_b + w_c) + bias
+    #          = 1/2 * (3.0 + 5.0) + 5.0 = 9.0
+    linear_regressor = self._linear_regressor_fn(
+        feature_columns=feature_columns,
+        model_dir=self._model_dir,
+        sparse_combiner='mean')
+    predictions = linear_regressor.predict(input_fn=_input_fn)
+    predicted_scores = list([x['predictions'] for x in predictions])
+    self.assertAllClose([[8.5], [9.0]], predicted_scores)
+
+    # With sparse_combiner = 'sqrtn', we have
+    # logits_1 = sqrt(2)/2 * (w_a + w_c) + bias
+    #          = sqrt(2)/2 * (2.0 + 5.0) + 5.0 = 9.94974
+    # logits_2 = sqrt(2)/2 * (w_b + w_c) + bias
+    #          = sqrt(2)/2 * (3.0 + 5.0) + 5.0 = 10.65685
+    linear_regressor = self._linear_regressor_fn(
+        feature_columns=feature_columns,
+        model_dir=self._model_dir,
+        sparse_combiner='sqrtn')
+    predictions = linear_regressor.predict(input_fn=_input_fn)
+    predicted_scores = list([x['predictions'] for x in predictions])
+    self.assertAllClose([[9.94974], [10.65685]], predicted_scores)
+
 
 class BaseLinearRegressorIntegrationTest(object):
 
@@ -1636,6 +1710,69 @@ class BaseLinearClassifierPredictTest(object):
                           for i in range(n_classes)],
         label_output_fn=lambda x: ('class_vocab_%s' % x).encode())
 
+  def testSparseCombiner(self):
+    w_a = 2.0
+    w_b = 3.0
+    w_c = 5.0
+    bias = 5.0
+    with ops.Graph().as_default():
+      variables_lib.Variable([[w_a], [w_b], [w_c]], name=LANGUAGE_WEIGHT_NAME)
+      variables_lib.Variable([bias], name=BIAS_NAME)
+      variables_lib.Variable(1, name=ops.GraphKeys.GLOBAL_STEP,
+                             dtype=dtypes.int64)
+      save_variables_to_ckpt(self._model_dir)
+
+    def _input_fn():
+      return dataset_ops.Dataset.from_tensors({
+          'language': sparse_tensor.SparseTensor(
+              values=['a', 'c', 'b', 'c'],
+              indices=[[0, 0], [0, 1], [1, 0], [1, 1]],
+              dense_shape=[2, 2]),
+      })
+
+    feature_columns = (
+        feature_column_lib.categorical_column_with_vocabulary_list(
+            'language', vocabulary_list=['a', 'b', 'c']),)
+
+    # Check prediction for each sparse_combiner.
+    # With sparse_combiner = 'sum', we have
+    # logits_1 = w_a + w_c + bias
+    #          = 2.0 + 5.0 + 5.0 = 12.0
+    # logits_2 = w_b + w_c + bias
+    #          = 3.0 + 5.0 + 5.0 = 13.0
+    linear_classifier = self._linear_classifier_fn(
+        feature_columns=feature_columns,
+        model_dir=self._model_dir)
+    predictions = linear_classifier.predict(input_fn=_input_fn)
+    predicted_scores = list([x['logits'] for x in predictions])
+    self.assertAllClose([[12.0], [13.0]], predicted_scores)
+
+    # With sparse_combiner = 'mean', we have
+    # logits_1 = 1/2 * (w_a + w_c) + bias
+    #          = 1/2 * (2.0 + 5.0) + 5.0 = 8.5
+    # logits_2 = 1/2 * (w_b + w_c) + bias
+    #          = 1/2 * (3.0 + 5.0) + 5.0 = 9.0
+    linear_classifier = self._linear_classifier_fn(
+        feature_columns=feature_columns,
+        model_dir=self._model_dir,
+        sparse_combiner='mean')
+    predictions = linear_classifier.predict(input_fn=_input_fn)
+    predicted_scores = list([x['logits'] for x in predictions])
+    self.assertAllClose([[8.5], [9.0]], predicted_scores)
+
+    # With sparse_combiner = 'sqrtn', we have
+    # logits_1 = sqrt(2)/2 * (w_a + w_c) + bias
+    #          = sqrt(2)/2 * (2.0 + 5.0) + 5.0 = 9.94974
+    # logits_2 = sqrt(2)/2 * (w_b + w_c) + bias
+    #          = sqrt(2)/2 * (3.0 + 5.0) + 5.0 = 10.65685
+    linear_classifier = self._linear_classifier_fn(
+        feature_columns=feature_columns,
+        model_dir=self._model_dir,
+        sparse_combiner='sqrtn')
+    predictions = linear_classifier.predict(input_fn=_input_fn)
+    predicted_scores = list([x['logits'] for x in predictions])
+    self.assertAllClose([[9.94974], [10.65685]], predicted_scores)
+
 
 class BaseLinearClassifierIntegrationTest(object):
 
diff --git a/tensorflow/python/estimator/canned/metric_keys.py b/tensorflow/python/estimator/canned/metric_keys.py
index 4f7c849ba4..9d49240fea 100644
--- a/tensorflow/python/estimator/canned/metric_keys.py
+++ b/tensorflow/python/estimator/canned/metric_keys.py
@@ -47,3 +47,8 @@ class MetricKeys(object):
   PROBABILITY_MEAN_AT_CLASS = 'probability_mean/class%d'
   AUC_AT_CLASS = 'auc/class%d'
   AUC_PR_AT_CLASS = 'auc_precision_recall/class%d'
+
+  # The following require a class name applied.
+  PROBABILITY_MEAN_AT_NAME = 'probability_mean/%s'
+  AUC_AT_NAME = 'auc/%s'
+  AUC_PR_AT_NAME = 'auc_precision_recall/%s'
diff --git a/tensorflow/python/estimator/estimator.py b/tensorflow/python/estimator/estimator.py
index 350a95eea1..43deb8bc6c 100644
--- a/tensorflow/python/estimator/estimator.py
+++ b/tensorflow/python/estimator/estimator.py
@@ -29,8 +29,6 @@ import six
 
 from google.protobuf import message
 from tensorflow.core.framework import summary_pb2
-from tensorflow.core.protobuf import config_pb2
-from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.client import session as tf_session
 from tensorflow.python.eager import context
 from tensorflow.python.estimator import model_fn as model_fn_lib
@@ -55,6 +53,7 @@ from tensorflow.python.saved_model import builder as saved_model_builder
 from tensorflow.python.saved_model import constants
 from tensorflow.python.summary import summary
 from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import device_setter
 from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.training import evaluation
@@ -181,49 +180,17 @@ class Estimator(object):
     """
     Estimator._assert_members_are_not_overridden(self)
 
-    if config is None:
-      self._config = run_config.RunConfig()
-      logging.info('Using default config.')
-    else:
-      if not isinstance(config, run_config.RunConfig):
-        raise ValueError(
-            'config must be an instance of RunConfig, but provided %s.' %
-            config)
-      self._config = config
+    config = maybe_overwrite_model_dir_and_session_config(config, model_dir)
+    self._config = config
 
     # The distribute field contains an instance of DistributionStrategy.
-    self._distribution = self._config.train_distribute
-
+    self._train_distribution = self._config.train_distribute
+    self._eval_distribution = self._config.eval_distribute
     # Model directory.
-    model_dir = compat_internal.path_to_str(model_dir)
-    if (model_dir is not None) and (self._config.model_dir is not None):
-      if model_dir != self._config.model_dir:
-        # TODO(alanyee): remove this suppression after it is no longer needed
-        # pylint: disable=g-doc-exception
-        raise ValueError(
-            "model_dir are set both in constructor and RunConfig, but with "
-            "different values. In constructor: '{}', in RunConfig: "
-            "'{}' ".format(model_dir, self._config.model_dir))
-        # pylint: enable=g-doc-exception
-
-    self._model_dir = model_dir or self._config.model_dir
-    if self._model_dir is None:
-      self._model_dir = tempfile.mkdtemp()
-      logging.warning('Using temporary folder as model directory: %s',
-                      self._model_dir)
-    if self._config.model_dir is None:
-      self._config = self._config.replace(model_dir=self._model_dir)
+    self._model_dir = self._config.model_dir
+    self._session_config = self._config.session_config
     logging.info('Using config: %s', str(vars(self._config)))
 
-    if self._config.session_config is None:
-      rewrite_opts = rewriter_config_pb2.RewriterConfig(
-          meta_optimizer_iterations=rewriter_config_pb2.RewriterConfig.ONE)
-      graph_opts = config_pb2.GraphOptions(rewrite_options=rewrite_opts)
-      self._session_config = config_pb2.ConfigProto(
-          allow_soft_placement=True, graph_options=graph_opts)
-    else:
-      self._session_config = self._config.session_config
-
     self._device_fn = (
         self._config.device_fn or _get_replica_device_setter(self._config))
 
@@ -302,7 +269,7 @@ class Estimator(object):
       found.
     """
     with context.graph_mode():
-      return saver.latest_checkpoint(self.model_dir)
+      return checkpoint_management.latest_checkpoint(self.model_dir)
 
   def train(self,
             input_fn,
@@ -451,16 +418,15 @@ class Estimator(object):
 
       # Check that model has been trained (if nothing has been set explicitly).
       if not checkpoint_path:
-        latest_path = saver.latest_checkpoint(self._model_dir)
+        latest_path = checkpoint_management.latest_checkpoint(self._model_dir)
         if not latest_path:
           logging.info('Could not find trained model in model_dir: {}, running '
                        'initialization to evaluate.'.format(self._model_dir))
         checkpoint_path = latest_path
 
-      with ops.Graph().as_default():
-        (scaffold, update_op,
-         eval_dict, all_hooks) = self._evaluate_build_graph(
-             input_fn, hooks, checkpoint_path)
+      def _evaluate():
+        (scaffold, update_op, eval_dict, all_hooks) = (
+            self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
         return self._evaluate_run(
             checkpoint_path=checkpoint_path,
             scaffold=scaffold,
@@ -469,6 +435,15 @@ class Estimator(object):
             all_hooks=all_hooks,
             output_dir=self.eval_dir(name))
 
+      with ops.Graph().as_default():
+        # TODO(priyag): Support distributed eval on TPUs.
+        if (self._eval_distribution
+            and self._eval_distribution.__class__.__name__ != 'TPUStrategy'):
+          with self._eval_distribution.scope():
+            return _evaluate()
+        else:
+          return _evaluate()
+
   def _convert_eval_steps_to_hooks(self, steps):
     if steps is None:
       return []
@@ -530,7 +505,8 @@ class Estimator(object):
       hooks = _check_hooks_type(hooks)
       # Check that model has been trained.
       if not checkpoint_path:
-        checkpoint_path = saver.latest_checkpoint(self._model_dir)
+        checkpoint_path = checkpoint_management.latest_checkpoint(
+            self._model_dir)
       if not checkpoint_path:
         logging.info('Could not find trained model in model_dir: {}, running '
                      'initialization to predict.'.format(self._model_dir))
@@ -573,12 +549,19 @@ class Estimator(object):
 
   def _assert_members_are_not_overridden(self):
     """Asserts members of `Estimator` are not overridden."""
+    # TPUEstimator is special cased (owned by TF).
+    if self.__class__.__name__ == 'TPUEstimator':
+      return
+
     allowed_overrides = set([
-        '_call_input_fn', '_create_global_step',
+        '_call_input_fn', '_call_model_fn',
         '_convert_train_steps_to_hooks', '_convert_eval_steps_to_hooks',
-        '_tf_api_names', '_estimator_api_names', '_estimator_api_constants',
+        '_create_global_step', '_create_and_assert_global_step',
+        '_tf_api_names', '_tf_api_names_v1', '_estimator_api_names',
+        '_estimator_api_names_v1', '_estimator_api_constants',
+        '_estimator_api_constants_v1',
         '_validate_features_in_predict_input',
-        '_call_model_fn', '_add_meta_graph_for_mode'
+        '_add_meta_graph_for_mode'
     ])
     estimator_members = set([m for m in Estimator.__dict__.keys()
                              if not m.startswith('__')])
@@ -788,7 +771,8 @@ class Estimator(object):
     with context.graph_mode():
       if not checkpoint_path:
         # Locate the latest checkpoint
-        checkpoint_path = saver.latest_checkpoint(self._model_dir)
+        checkpoint_path = checkpoint_management.latest_checkpoint(
+            self._model_dir)
       if not checkpoint_path:
         raise ValueError("Couldn't find trained model at %s." % self._model_dir)
 
@@ -905,9 +889,10 @@ class Estimator(object):
 
       with tf_session.Session(config=self._session_config) as session:
 
-        local_init_op = (
-            estimator_spec.scaffold.local_init_op or
-            monitored_session.Scaffold.default_local_init_op())
+        if estimator_spec.scaffold.local_init_op is not None:
+          local_init_op = estimator_spec.scaffold.local_init_op
+        else:
+          local_init_op = monitored_session.Scaffold.default_local_init_op()
 
         # This saver will be used both for restoring variables now,
         # and in saving out the metagraph below. This ensures that any
@@ -1000,10 +985,11 @@ class Estimator(object):
                       'QueueRunner. That means predict yields forever. '
                       'This is probably a mistake.')
 
-  def _get_features_and_labels_from_input_fn(self, input_fn, mode):
+  def _get_features_and_labels_from_input_fn(self, input_fn, mode,
+                                             distribution=None):
     """Extracts the `features` and labels from return values of `input_fn`."""
-    if self._distribution is not None and mode == model_fn_lib.ModeKeys.TRAIN:
-      result = self._distribution.distribute_dataset(
+    if distribution is not None and mode == model_fn_lib.ModeKeys.TRAIN:
+      result = distribution.distribute_dataset(
           lambda: self._call_input_fn(input_fn, mode))
     else:
       result = self._call_input_fn(input_fn, mode)
@@ -1137,7 +1123,7 @@ class Estimator(object):
     return model_fn_results
 
   def _train_model(self, input_fn, hooks, saving_listeners):
-    if self._distribution:
+    if self._train_distribution:
       return self._train_model_distributed(input_fn, hooks, saving_listeners)
     else:
       return self._train_model_default(input_fn, hooks, saving_listeners)
@@ -1159,13 +1145,19 @@ class Estimator(object):
     with ops.Graph().as_default() as g, g.device(self._device_fn):
       random_seed.set_random_seed(self._config.tf_random_seed)
       global_step_tensor = self._create_and_assert_global_step(g)
-      training_util._get_or_create_global_step_read()  # pylint: disable=protected-access
+
+      # Skip creating a read variable if _create_and_assert_global_step
+      # returns None (e.g. tf.contrib.estimator.SavedModelEstimator).
+      if global_step_tensor is not None:
+        training_util._get_or_create_global_step_read(g)  # pylint: disable=protected-access
+
       features, labels, input_hooks = (
           self._get_features_and_labels_from_input_fn(
               input_fn, model_fn_lib.ModeKeys.TRAIN))
       worker_hooks.extend(input_hooks)
       estimator_spec = self._call_model_fn(
           features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
+      global_step_tensor = training_util.get_global_step(g)
       return self._train_with_estimator_spec(estimator_spec, worker_hooks,
                                              hooks, global_step_tensor,
                                              saving_listeners)
@@ -1183,22 +1175,23 @@ class Estimator(object):
     Returns:
       Loss from training
     """
-    self._distribution.configure(self._session_config)
+    self._train_distribution.configure(self._session_config)
 
     # TODO(sourabhbajaj): Remove this hack once we migrate the other strategies
     # to use the new API
-    is_tpu_strategy = self._distribution.__class__.__name__ == 'TPUStrategy'
+    is_tpu_strategy = (
+        self._train_distribution.__class__.__name__ == 'TPUStrategy')
 
     worker_hooks = []
     with ops.Graph().as_default() as g:
-      with self._distribution.scope():
+      with self._train_distribution.scope():
         random_seed.set_random_seed(self._config.tf_random_seed)
 
         if is_tpu_strategy:
           # Create the iterator for run_on_dataset function
           # TODO(sourabhbajaj): refactor this out to call a function on the
           # strategy
-          dataset = self._distribution.distribute_dataset(
+          dataset = self._train_distribution.distribute_dataset(
               lambda: self._call_input_fn(input_fn,  # pylint: disable=g-long-lambda
                                           model_fn_lib.ModeKeys.TRAIN))
           iterator = dataset.make_initializable_iterator()
@@ -1208,14 +1201,15 @@ class Estimator(object):
           global_step_tensor = self._create_and_assert_global_step(g)
           # we want to add to the global collection in the main thread not the
           # tower threads.
-          ops.add_to_collection(training_util.GLOBAL_STEP_READ_KEY,
-                                self._distribution.read_var(global_step_tensor))
+          ops.add_to_collection(
+              training_util.GLOBAL_STEP_READ_KEY,
+              self._train_distribution.read_var(global_step_tensor))
 
           # Create a step_fn from the train_op of grouped_estimator_spec
           def step_fn(ctx, inputs):
             """A single step that is passed to run_on_dataset."""
             features, labels = inputs
-            estimator_spec = self._distribution.call_for_each_tower(
+            estimator_spec = self._train_distribution.call_for_each_tower(
                 self._call_model_fn,
                 features,
                 labels,
@@ -1231,103 +1225,34 @@ class Estimator(object):
           # work correctly. Currently hardcoded at 2
           initial_training_loss = constant_op.constant(1e7)
           distributed_train_op, tpu_result, ctx = \
-              self._distribution._run_steps_on_dataset(  # pylint: disable=protected-access
+              self._train_distribution._run_steps_on_dataset(  # pylint: disable=protected-access
                   step_fn, iterator, iterations=2,
                   initial_loop_values=initial_training_loss)
           grouped_estimator_spec = ctx.non_tensor_outputs['estimator_spec']
         else:
           features, labels, input_hooks = (
               self._get_features_and_labels_from_input_fn(
-                  input_fn, model_fn_lib.ModeKeys.TRAIN))
+                  input_fn, model_fn_lib.ModeKeys.TRAIN,
+                  self._train_distribution))
           worker_hooks.extend(input_hooks)
           global_step_tensor = self._create_and_assert_global_step(g)
           # we want to add to the global collection in the main thread not the
           # tower threads.
-          ops.add_to_collection(training_util.GLOBAL_STEP_READ_KEY,
-                                self._distribution.read_var(global_step_tensor))
-          grouped_estimator_spec = self._distribution.call_for_each_tower(
+          ops.add_to_collection(
+              training_util.GLOBAL_STEP_READ_KEY,
+              self._train_distribution.read_var(global_step_tensor))
+          grouped_estimator_spec = self._train_distribution.call_for_each_tower(
               self._call_model_fn,
               features,
               labels,  # although this will be None it seems
               model_fn_lib.ModeKeys.TRAIN,
               self.config)
 
-        # TODO(anjalisridhar): Figure out how to resolve the following scaffold
-        # parameters: init_feed_dict, init_fn.
-        scaffold_list = self._distribution.unwrap(
-            grouped_estimator_spec.scaffold)
-        init_feed_dict = [
-            s.init_feed_dict
-            for s in scaffold_list
-            if s.init_feed_dict is not None
-        ]
-        if init_feed_dict:
-          init_feed_dict = self._distribution.group(init_feed_dict)
-        else:
-          init_feed_dict = None
-
-        init_fn = [s.init_fn for s in scaffold_list if s.init_fn is not None]
-        if init_fn:
-          init_fn = self._distribution.group(init_fn)
-        else:
-          init_fn = None
-
-        init_op = [s.init_op for s in scaffold_list if s.init_op is not None]
-        if init_op:
-          init_op = self._distribution.group(init_op)
-        else:
-          init_op = None
-
-        def _unwrap_and_concat(value):
-          value = nest.flatten(self._distribution.unwrap(value))
-          if len(value) != 1:
-            return array_ops.concat(value)
-          return value[0]
-
-        ready_op = self._distribution.call_for_each_tower(
-            create_per_tower_ready_op, grouped_estimator_spec.scaffold)
-        if ready_op is not None:
-          ready_op = _unwrap_and_concat(ready_op)
-        else:
-          ready_op = None
-
-        ready_for_local_init_op = self._distribution.call_for_each_tower(
-            create_per_tower_ready_for_local_init_op,
-            grouped_estimator_spec.scaffold)
-        if ready_for_local_init_op is not None:
-          ready_for_local_init_op = _unwrap_and_concat(ready_for_local_init_op)
-        else:
-          ready_for_local_init_op = None
-
-        local_init_op = [
-            s.local_init_op
-            for s in scaffold_list
-            if s.local_init_op is not None
-        ]
-        if local_init_op:
-          local_init_op = self._distribution.group(local_init_op)
-        else:
-          local_init_op = None
-
-        summary_op = [
-            s.summary_op for s in scaffold_list if s.summary_op is not None
-        ]
-        if summary_op:
-          summary_op = self._distribution.group(summary_op)
-        else:
-          summary_op = None
-
-        scaffold = monitored_session.Scaffold(
-            init_op=init_op,
-            ready_op=ready_op,
-            ready_for_local_init_op=ready_for_local_init_op,
-            local_init_op=local_init_op,
-            summary_op=summary_op,
-            init_feed_dict=init_feed_dict,
-            init_fn=init_fn)
+        scaffold = _combine_distributed_scaffold(
+            grouped_estimator_spec.scaffold, self._train_distribution)
 
         def get_hooks_from_the_first_device(per_device_hooks):
-          hooks_list = self._distribution.unwrap(per_device_hooks)
+          hooks_list = self._train_distribution.unwrap(per_device_hooks)
           assert hooks_list
           return hooks_list[0]
 
@@ -1336,28 +1261,25 @@ class Estimator(object):
         training_chief_hooks = get_hooks_from_the_first_device(
             grouped_estimator_spec.training_chief_hooks)
 
-        # TODO(sourabhbajaj): Merge the two code paths once we can
-        # handle per device variables correctly in reduce and can output
-        # the loss scaler.
+        # TODO(sourabhbajaj): Merge the two code paths and clean up the code
         if is_tpu_strategy:
-          loss = self._distribution.unwrap(
-              self._distribution.reduce(distribute_lib.get_loss_reduction(),
-                                        tpu_result)[0])[0]
+          distributed_loss = tpu_result
           worker_hooks.append(
               estimator_util.StrategyInitFinalizeHook(
-                  self._distribution.get_initialization_ops,
-                  self._distribution.get_finalize_ops))
+                  self._train_distribution.get_initialization_ops,
+                  self._train_distribution.get_finalize_ops))
         else:
-          loss = self._distribution.unwrap(
-              self._distribution.reduce(distribute_lib.get_loss_reduction(),
-                                        grouped_estimator_spec.loss,
-                                        destinations='/device:CPU:0'))[0]
+          distributed_loss = grouped_estimator_spec.loss
           distributed_train_op = grouped_estimator_spec.train_op
 
         estimator_spec = model_fn_lib.EstimatorSpec(
             mode=grouped_estimator_spec.mode,
-            loss=loss,
-            train_op=self._distribution.group(distributed_train_op),
+            loss=self._train_distribution.unwrap(
+                self._train_distribution.reduce(
+                    distribute_lib.get_loss_reduction(),
+                    distributed_loss,
+                    destinations='/device:CPU:0'))[0],
+            train_op=self._train_distribution.group(distributed_train_op),
             training_hooks=training_hooks,
             training_chief_hooks=training_chief_hooks,
             scaffold=scaffold)
@@ -1452,27 +1374,31 @@ class Estimator(object):
   def _evaluate_build_graph(self, input_fn, hooks=None, checkpoint_path=None):
     """Builds the graph and related hooks to run evaluation."""
     random_seed.set_random_seed(self._config.tf_random_seed)
-    global_step_tensor = self._create_and_assert_global_step(
-        ops.get_default_graph())
+    self._create_and_assert_global_step(ops.get_default_graph())
     features, labels, input_hooks = (
-        self._get_features_and_labels_from_input_fn(input_fn,
-                                                    model_fn_lib.ModeKeys.EVAL))
-    estimator_spec = self._call_model_fn(
-        features, labels, model_fn_lib.ModeKeys.EVAL, self.config)
+        self._get_features_and_labels_from_input_fn(
+            input_fn, model_fn_lib.ModeKeys.EVAL, self._eval_distribution))
 
+    if self._eval_distribution:
+      (loss_metric, scaffold, evaluation_hooks, eval_metric_ops) = (
+          self._call_model_fn_eval_distributed(features, labels, self.config))
+    else:
+      (loss_metric, scaffold, evaluation_hooks, eval_metric_ops) = (
+          self._call_model_fn_eval(features, labels, self.config))
+
+    global_step_tensor = training_util.get_global_step(ops.get_default_graph())
     # Call to warm_start has to be after model_fn is called.
     self._maybe_warm_start(checkpoint_path)
 
-    if model_fn_lib.LOSS_METRIC_KEY in estimator_spec.eval_metric_ops:
+    if model_fn_lib.LOSS_METRIC_KEY in eval_metric_ops:
       raise ValueError(
           'Metric with name "%s" is not allowed, because Estimator ' %
           (model_fn_lib.LOSS_METRIC_KEY) +
           'already defines a default metric with the same name.')
-    estimator_spec.eval_metric_ops[
-        model_fn_lib.LOSS_METRIC_KEY] = metrics_lib.mean(estimator_spec.loss)
+    eval_metric_ops[model_fn_lib.LOSS_METRIC_KEY] = loss_metric
 
-    update_op, eval_dict = _extract_metric_update_ops(
-        estimator_spec.eval_metric_ops)
+    update_op, eval_dict = _extract_metric_update_ops(eval_metric_ops,
+                                                      self._eval_distribution)
 
     if ops.GraphKeys.GLOBAL_STEP in eval_dict:
       raise ValueError(
@@ -1482,9 +1408,42 @@ class Estimator(object):
 
     all_hooks = list(input_hooks)
     all_hooks.extend(hooks)
-    all_hooks.extend(list(estimator_spec.evaluation_hooks or []))
+    all_hooks.extend(list(evaluation_hooks or []))
+    # New local variables have been added, so update the estimator spec's
+    # local init op if it was defined.
+    if scaffold and scaffold.local_init_op:
+      # Ensure that eval step has been created before updating local init op.
+      evaluation._get_or_create_eval_step()  # pylint: disable=protected-access
+
+      scaffold = monitored_session.Scaffold(
+          local_init_op=control_flow_ops.group(
+              scaffold.local_init_op,
+              monitored_session.Scaffold.default_local_init_op()),
+          copy_from_scaffold=scaffold
+      )
 
-    return estimator_spec.scaffold, update_op, eval_dict, all_hooks
+    return scaffold, update_op, eval_dict, all_hooks
+
+  def _call_model_fn_eval(self, features, labels, config):
+    estimator_spec = self._call_model_fn(
+        features, labels, model_fn_lib.ModeKeys.EVAL, config)
+    loss_metric = metrics_lib.mean(estimator_spec.loss)
+    return (loss_metric, estimator_spec.scaffold,
+            estimator_spec.evaluation_hooks, estimator_spec.eval_metric_ops)
+
+  def _call_model_fn_eval_distributed(self, features, labels, config):
+    """Call model_fn in distribution mode and handle return values."""
+    grouped_estimator_spec = self._eval_distribution.call_for_each_tower(
+        self._call_model_fn, features, labels,
+        model_fn_lib.ModeKeys.EVAL, config)
+    scaffold = _combine_distributed_scaffold(
+        grouped_estimator_spec.scaffold, self._eval_distribution)
+    evaluation_hooks = self._eval_distribution.unwrap(
+        grouped_estimator_spec.evaluation_hooks)[0]
+    loss_metric = self._eval_distribution.call_for_each_tower(
+        metrics_lib.mean, grouped_estimator_spec.loss)
+    return (loss_metric, scaffold,
+            evaluation_hooks, grouped_estimator_spec.eval_metric_ops)
 
   def _evaluate_run(self, checkpoint_path, scaffold, update_op, eval_dict,
                     all_hooks, output_dir):
@@ -1520,6 +1479,49 @@ class Estimator(object):
       warm_starting_util.warm_start(*self._warm_start_settings)
 
 
+def maybe_overwrite_model_dir_and_session_config(config, model_dir):
+  """Overwrite estimator config by `model_dir` and `session_config` if needed.
+
+  Args:
+    config: Original estimator config.
+    model_dir: Estimator model checkpoint directory.
+
+  Returns:
+    Overwritten estimator config.
+
+  Raises:
+    ValueError: Model directory inconsistent between `model_dir` and `config`.
+  """
+
+  if config is None:
+    config = run_config.RunConfig()
+    logging.info('Using default config.')
+  if not isinstance(config, run_config.RunConfig):
+    raise ValueError(
+        'config must be an instance of `RunConfig`, but provided %s.' % config)
+
+  if config.session_config is None:
+    session_config = run_config.get_default_session_config()
+    config = run_config.RunConfig.replace(config, session_config=session_config)
+
+  model_dir = compat_internal.path_to_str(model_dir)
+  if model_dir is not None:
+    if (getattr(config, 'model_dir', None) is not None and
+        config.model_dir != model_dir):
+      raise ValueError(
+          "`model_dir` are set both in constructor and `RunConfig`, but with "
+          "different values. In constructor: '{}', in `RunConfig`: "
+          "'{}' ".format(model_dir, config.model_dir))
+    if model_dir:
+      config = run_config.RunConfig.replace(config, model_dir=model_dir)
+  if getattr(config, 'model_dir', None) is None:
+    model_dir = tempfile.mkdtemp()
+    logging.warning('Using temporary folder as model directory: %s', model_dir)
+    config = run_config.RunConfig.replace(config, model_dir=model_dir)
+
+  return config
+
+
 def create_per_tower_ready_op(scaffold):
   """Create a Scaffold.ready_op inside a tower."""
   if scaffold.ready_op:
@@ -1549,8 +1551,85 @@ def create_per_tower_ready_for_local_init_op(scaffold):
       default_ready_for_local_init_op)
 
 
+def _combine_distributed_scaffold(grouped_scaffold, distribution):
+  """Combines scaffold(s) returned from `distribution.call_for_each_tower`."""
+
+  # TODO(anjalisridhar): Figure out how to resolve the following scaffold
+  # parameters: init_feed_dict, init_fn.
+  scaffold_list = distribution.unwrap(grouped_scaffold)
+  init_feed_dict = [
+      s.init_feed_dict
+      for s in scaffold_list
+      if s.init_feed_dict is not None
+  ]
+  if init_feed_dict:
+    init_feed_dict = distribution.group(init_feed_dict)
+  else:
+    init_feed_dict = None
+
+  init_fn = [s.init_fn for s in scaffold_list if s.init_fn is not None]
+  if init_fn:
+    init_fn = distribution.group(init_fn)
+  else:
+    init_fn = None
+
+  init_op = [s.init_op for s in scaffold_list if s.init_op is not None]
+  if init_op:
+    init_op = distribution.group(init_op)
+  else:
+    init_op = None
+
+  def _unwrap_and_concat(value):
+    value = nest.flatten(distribution.unwrap(value))
+    if len(value) != 1:
+      return array_ops.concat(value)
+    return value[0]
+
+  ready_op = distribution.call_for_each_tower(
+      create_per_tower_ready_op, grouped_scaffold)
+  if ready_op is not None:
+    ready_op = _unwrap_and_concat(ready_op)
+  else:
+    ready_op = None
+
+  ready_for_local_init_op = distribution.call_for_each_tower(
+      create_per_tower_ready_for_local_init_op, grouped_scaffold)
+  if ready_for_local_init_op is not None:
+    ready_for_local_init_op = _unwrap_and_concat(ready_for_local_init_op)
+  else:
+    ready_for_local_init_op = None
+
+  local_init_op = [
+      s.local_init_op
+      for s in scaffold_list
+      if s.local_init_op is not None
+  ]
+  if local_init_op:
+    local_init_op = distribution.group(local_init_op)
+  else:
+    local_init_op = None
+
+  summary_op = [
+      s.summary_op for s in scaffold_list if s.summary_op is not None
+  ]
+  if summary_op:
+    summary_op = distribution.group(summary_op)
+  else:
+    summary_op = None
+
+  scaffold = monitored_session.Scaffold(
+      init_op=init_op,
+      ready_op=ready_op,
+      ready_for_local_init_op=ready_for_local_init_op,
+      local_init_op=local_init_op,
+      summary_op=summary_op,
+      init_feed_dict=init_feed_dict,
+      init_fn=init_fn)
+  return scaffold
+
+
 def _check_checkpoint_available(model_dir):
-  latest_path = saver.latest_checkpoint(model_dir)
+  latest_path = checkpoint_management.latest_checkpoint(model_dir)
   if not latest_path:
     raise ValueError(
         'Could not find trained model in model_dir: {}.'.format(model_dir))
@@ -1633,14 +1712,18 @@ def _load_global_step_from_checkpoint_dir(checkpoint_dir):
     return 0
 
 
-def _extract_metric_update_ops(eval_dict):
+def _extract_metric_update_ops(eval_dict, distribution=None):
   """Separate update operations from metric value operations."""
   update_ops = []
   value_ops = {}
   # Sort metrics lexicographically so graph is identical every time.
   for name, metric_ops in sorted(six.iteritems(eval_dict)):
     value_ops[name] = metric_ops[0]
-    update_ops.append(metric_ops[1])
+    if distribution:
+      update_op = distribution.group(metric_ops[1])
+    else:
+      update_op = metric_ops[1]
+    update_ops.append(update_op)
 
   if update_ops:
     update_op = control_flow_ops.group(*update_ops)
@@ -1915,6 +1998,19 @@ class WarmStartSettings(
     )
 
 
+def _get_saved_model_ckpt(saved_model_dir):
+  """Return path to variables checkpoint in a SavedModel directory."""
+  if not gfile.Exists(
+      os.path.join(compat.as_bytes(saved_model_dir),
+                   compat.as_bytes('variables/variables.index'))):
+    raise ValueError('Directory provided has an invalid SavedModel format: %s'
+                     % saved_model_dir)
+  return os.path.join(
+      compat.as_bytes(saved_model_dir),
+      compat.as_bytes('{}/{}'.format(constants.VARIABLES_DIRECTORY,
+                                     constants.VARIABLES_FILENAME)))
+
+
 def _get_default_warm_start_settings(warm_start_from):
   """Returns default WarmStartSettings.
 
@@ -1938,10 +2034,8 @@ def _get_default_warm_start_settings(warm_start_from):
     if gfile.Exists(os.path.join(compat.as_bytes(warm_start_from),
                                  compat.as_bytes('variables/variables.index'))):
       logging.info('Warm-starting from a SavedModel')
-      return WarmStartSettings(ckpt_to_initialize_from=os.path.join(
-          compat.as_bytes(warm_start_from),
-          compat.as_bytes('{}/{}'.format(constants.VARIABLES_DIRECTORY,
-                                         constants.VARIABLES_FILENAME))))
+      return WarmStartSettings(
+          ckpt_to_initialize_from=_get_saved_model_ckpt(warm_start_from))
     return WarmStartSettings(ckpt_to_initialize_from=warm_start_from)
   elif isinstance(warm_start_from, WarmStartSettings):
     return warm_start_from
diff --git a/tensorflow/python/estimator/estimator_test.py b/tensorflow/python/estimator/estimator_test.py
index 2a0e4e7617..e8552092e0 100644
--- a/tensorflow/python/estimator/estimator_test.py
+++ b/tensorflow/python/estimator/estimator_test.py
@@ -28,6 +28,7 @@ import six
 
 from google.protobuf import text_format
 
+from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.client import session
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.estimator import estimator
@@ -68,6 +69,7 @@ from tensorflow.python.summary import summary
 from tensorflow.python.summary import summary_iterator
 from tensorflow.python.summary.writer import writer_cache
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import checkpoint_state_pb2
 from tensorflow.python.training import saver
 from tensorflow.python.training import saver_test_utils
@@ -174,7 +176,7 @@ class EstimatorInheritanceConstraintTest(test.TestCase):
 class EstimatorConstructorTest(test.TestCase):
 
   def test_config_must_be_a_run_config(self):
-    with self.assertRaisesRegexp(ValueError, 'an instance of RunConfig'):
+    with self.assertRaisesRegexp(ValueError, 'an instance of `RunConfig`'):
       estimator.Estimator(model_fn=None, config='NotARunConfig')
 
   def test_model_fn_must_be_provided(self):
@@ -203,6 +205,10 @@ class EstimatorConstructorTest(test.TestCase):
 
     est = estimator.Estimator(model_fn=model_fn)
     self.assertTrue(isinstance(est.config, run_config.RunConfig))
+    self.assertTrue(est._session_config.allow_soft_placement)
+    rewrite_options = est._session_config.graph_options.rewrite_options
+    self.assertEqual(rewrite_options.meta_optimizer_iterations,
+                     rewriter_config_pb2.RewriterConfig.ONE)
 
   def test_default_model_dir(self):
 
@@ -223,6 +229,15 @@ class EstimatorConstructorTest(test.TestCase):
     self.assertEqual(_TMP_DIR, est.config.model_dir)
     self.assertEqual(_TMP_DIR, est.model_dir)
 
+  def test_empty_model_dir(self):
+    def model_fn(features, labels):
+      _, _ = features, labels
+
+    with test.mock.patch.object(tempfile, 'mkdtemp', return_value=_TMP_DIR):
+      est = estimator.Estimator(model_fn=model_fn, model_dir='')
+      self.assertEqual(_TMP_DIR, est.config.model_dir)
+      self.assertEqual(_TMP_DIR, est.model_dir)
+
   def test_model_dir_in_run_config(self):
 
     class FakeConfig(run_config.RunConfig):
@@ -267,7 +282,7 @@ class EstimatorConstructorTest(test.TestCase):
 
     with self.assertRaisesRegexp(
         ValueError,
-        'model_dir are set both in constructor and RunConfig, but '
+        '`model_dir` are set both in constructor and `RunConfig`, but '
         'with different values'):
       estimator.Estimator(
           model_fn=model_fn, config=FakeConfig(), model_dir=_ANOTHER_TMP_DIR)
@@ -1534,7 +1549,8 @@ class EstimatorPredictTest(test.TestCase):
       next(
           est.predict(
               dummy_input_fn,
-              checkpoint_path=saver.latest_checkpoint('fakedir')))
+              checkpoint_path=
+              checkpoint_management.latest_checkpoint('fakedir')))
 
   def test_tensor_predictions(self):
 
@@ -2304,6 +2320,43 @@ class EstimatorExportTest(test.TestCase):
     with self.assertRaisesRegexp(ValueError, err_regex):
       est._export_all_saved_models(export_dir_base, input_receiver_fn_map)
 
+  def test_export_all_saved_models_metric_operation(self):
+    """Ensures metrics ops.Operations can be expoerted (b/109740581)."""
+
+    def _model_fn(features, labels, mode):
+      del features, labels  # Unused
+      metrics = {'metrics': (constant_op.constant([0]),
+                             control_flow_ops.no_op())}
+      return model_fn_lib.EstimatorSpec(
+          mode,
+          predictions=constant_op.constant(10.),
+          loss=constant_op.constant(1.),
+          train_op=state_ops.assign_add(training.get_global_step(), 1),
+          eval_metric_ops=metrics)
+
+    tmpdir = tempfile.mkdtemp()
+    est = estimator.Estimator(model_fn=_model_fn)
+    est.train(input_fn=dummy_input_fn, steps=1)
+
+    # Perform the export.
+    export_dir_base = os.path.join(
+        compat.as_bytes(tmpdir), compat.as_bytes('metric_operation_export'))
+
+    input_receiver_fn_map = {
+        model_fn_lib.ModeKeys.EVAL: _get_supervised_input_receiver_fn()}
+
+    export_dir = est._export_all_saved_models(
+        export_dir_base, input_receiver_fn_map)
+
+    # Restore, to validate that the export was well-formed.
+    with ops.Graph().as_default() as graph:
+      with session.Session(graph=graph) as sess:
+        meta_graph = loader.load(sess, [tag_constants.EVAL], export_dir)
+        sig_outputs = meta_graph.signature_def[
+            model_fn_lib.ModeKeys.EVAL].outputs
+        self.assertEqual(
+            sig_outputs['metrics/update_op'].name, 'metric_op_wrapper:0')
+
   def test_export_savedmodel_with_saveables_proto_roundtrip(self):
     tmpdir = tempfile.mkdtemp()
     est = estimator.Estimator(
diff --git a/tensorflow/python/estimator/export/export_output.py b/tensorflow/python/estimator/export/export_output.py
index 6c26d29985..20382a58d8 100644
--- a/tensorflow/python/estimator/export/export_output.py
+++ b/tensorflow/python/estimator/export/export_output.py
@@ -23,6 +23,7 @@ import abc
 import six
 
 
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.saved_model import signature_def_utils
@@ -338,8 +339,16 @@ class _SupervisedOutput(ExportOutput):
         raise ValueError(
             '{} update_op must be a Tensor or Operation; got {}.'.format(
                 key, metric_op))
+
+      # We must wrap any ops in a Tensor before export, as the SignatureDef
+      # proto expects tensors only. See b/109740581
+      metric_op_tensor = metric_op
+      if isinstance(metric_op, ops.Operation):
+        with ops.control_dependencies([metric_op]):
+          metric_op_tensor = constant_op.constant([], name='metric_op_wrapper')
+
       outputs[val_name] = metric_val
-      outputs[op_name] = metric_op
+      outputs[op_name] = metric_op_tensor
 
     return outputs
 
diff --git a/tensorflow/python/estimator/export/export_output_test.py b/tensorflow/python/estimator/export/export_output_test.py
index b21ba91b0f..d94c764fd7 100644
--- a/tensorflow/python/estimator/export/export_output_test.py
+++ b/tensorflow/python/estimator/export/export_output_test.py
@@ -24,8 +24,10 @@ from tensorflow.core.protobuf import meta_graph_pb2
 from tensorflow.python.estimator.export import export_output as export_output_lib
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.platform import test
 from tensorflow.python.saved_model import signature_constants
 
@@ -335,5 +337,18 @@ class SupervisedOutputTest(test.TestCase):
     self.assertTrue("predictions/output1" in sig_def.outputs)
     self.assertTrue("features" in sig_def.inputs)
 
+  def test_metric_op_is_operation(self):
+    """Tests that ops.Operation is wrapped by a tensor for metric_ops."""
+    loss = {"my_loss": constant_op.constant([0])}
+    predictions = {u"output1": constant_op.constant(["foo"])}
+    metrics = {"metrics": (constant_op.constant([0]), control_flow_ops.no_op())}
+
+    outputter = MockSupervisedOutput(loss, predictions, metrics)
+    self.assertEqual(outputter.metrics["metrics/value"], metrics["metrics"][0])
+    self.assertEqual(
+        outputter.metrics["metrics/update_op"].name, "metric_op_wrapper:0")
+    self.assertTrue(
+        isinstance(outputter.metrics["metrics/update_op"], ops.Tensor))
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/estimator/keras.py b/tensorflow/python/estimator/keras.py
index cb37f99704..c63deb8f4d 100644
--- a/tensorflow/python/estimator/keras.py
+++ b/tensorflow/python/estimator/keras.py
@@ -21,11 +21,11 @@ from __future__ import print_function
 
 import os
 import re
+
 from tensorflow.python.client import session
 from tensorflow.python.estimator import estimator as estimator_lib
 from tensorflow.python.estimator import export as export_lib
 from tensorflow.python.estimator import model_fn as model_fn_lib
-from tensorflow.python.estimator import run_config as run_config_lib
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
@@ -39,9 +39,10 @@ from tensorflow.python.keras.utils.generic_utils import CustomObjectScope
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_module
-from tensorflow.python.ops import variables as variables_module
+from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.saved_model import signature_constants
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.training import training_util
@@ -71,16 +72,22 @@ def _convert_tensor(x):
   return x
 
 
-def _any_variable_initialized():
-  """Check if any variable has been initialized in the Keras model.
+def _any_weight_initialized(keras_model):
+  """Check if any weights has been initialized in the Keras model.
+
+  Args:
+    keras_model: An instance of compiled keras model.
 
   Returns:
-    boolean, True if at least one variable has been initialized, else False.
+    boolean, True if at least one weight has been initialized, else False.
+    Currently keras initialize all weights at get_session().
   """
-  variables = variables_module.global_variables()
-  for v in variables:
-    if getattr(v, '_keras_initialized', False):
-      return True
+  if keras_model is None:
+    return False
+  for layer in keras_model.layers:
+    for weight in layer.weights:
+      if hasattr(weight, '_keras_initialized'):
+        return True
   return False
 
 
@@ -175,7 +182,7 @@ def _in_place_subclassed_model_reset(model):
   # Replace layers on the model with fresh layers
   layers_to_names = {value: key for key, value in attributes_cache.items()}
   original_layers = model._layers[:]
-  model._layers = []
+  model._layers = data_structures.NoDependency([])
   for layer in original_layers:  # We preserve layer order.
     config = layer.get_config()
     # This will not work for nested subclassed models used as layers.
@@ -223,7 +230,8 @@ def _in_place_subclassed_model_reset(model):
       ]
       for name in attributes_to_cache:
         attributes_cache[name] = getattr(model, name)
-  model._original_attributes_cache = attributes_cache
+  model._original_attributes_cache = data_structures.NoDependency(
+      attributes_cache)
   # Reset built state
   model.built = False
   model.inputs = None
@@ -421,29 +429,34 @@ def _create_keras_model_fn(keras_model, custom_objects=None):
   return model_fn
 
 
-def _save_first_checkpoint(keras_model, estimator, custom_objects,
-                           keras_weights):
+def _save_first_checkpoint(keras_model, custom_objects, config):
   """Save first checkpoint for the keras Estimator.
 
   Args:
     keras_model: an instance of compiled keras model.
-    estimator: keras estimator.
     custom_objects: Dictionary for custom objects.
-    keras_weights: A flat list of Numpy arrays for weights of given keras_model.
+    config: Estimator config.
 
   Returns:
-    The model_fn for a keras Estimator.
+    The path where keras model checkpoint is saved.
   """
+  # save checkpoint into subdirectory to allow warm start
+  keras_model_dir = os.path.join(config.model_dir, 'keras')
   # Load weights and save to checkpoint if there is no checkpoint
-  latest_path = saver_lib.latest_checkpoint(estimator.model_dir)
+  latest_path = checkpoint_management.latest_checkpoint(keras_model_dir)
   if not latest_path:
+    keras_weights = None
+    if _any_weight_initialized(keras_model):
+      keras_weights = keras_model.get_weights()
+    if not gfile.IsDirectory(keras_model_dir):
+      gfile.MakeDirs(keras_model_dir)
     with ops.Graph().as_default():
-      random_seed.set_random_seed(estimator.config.tf_random_seed)
+      random_seed.set_random_seed(config.tf_random_seed)
       training_util.create_global_step()
       model = _clone_and_build_model(model_fn_lib.ModeKeys.TRAIN, keras_model,
                                      custom_objects)
       # save to checkpoint
-      with session.Session(config=estimator._session_config) as sess:
+      with session.Session(config=config.session_config) as sess:
         if keras_weights:
           model.set_weights(keras_weights)
         # Make update ops and initialize all variables.
@@ -453,7 +466,9 @@ def _save_first_checkpoint(keras_model, estimator, custom_objects,
           K._initialize_variables(sess)
           # pylint: enable=protected-access
         saver = saver_lib.Saver()
-        saver.save(sess, os.path.join(estimator.model_dir, 'keras_model.ckpt'))
+        latest_path = os.path.join(keras_model_dir, 'keras_model.ckpt')
+        saver.save(sess, latest_path)
+  return latest_path
 
 
 def model_to_estimator(keras_model=None,
@@ -473,9 +488,9 @@ def model_to_estimator(keras_model=None,
       format, which can be generated with the `save()` method of a Keras model.
       This argument is mutually exclusive with `keras_model`.
     custom_objects: Dictionary for custom objects.
-    model_dir: Directory to save Estimator model parameters, graph, summary
+    model_dir: Directory to save `Estimator` model parameters, graph, summary
       files for TensorBoard, etc.
-    config: Configuration object.
+    config: `RunConfig` to config `Estimator`.
 
   Returns:
     An Estimator from given keras model.
@@ -512,45 +527,40 @@ def model_to_estimator(keras_model=None,
         'Please compile the model with `model.compile()` '
         'before calling `model_to_estimator()`.')
 
-  if isinstance(config, dict):
-    config = run_config_lib.RunConfig(**config)
+  config = estimator_lib.maybe_overwrite_model_dir_and_session_config(config,
+                                                                      model_dir)
 
   keras_model_fn = _create_keras_model_fn(keras_model, custom_objects)
-  estimator = estimator_lib.Estimator(
-      keras_model_fn, model_dir=model_dir, config=config)
-
-  # Check if we need to call get_weights:
-  if _any_variable_initialized():
-    keras_weights = keras_model.get_weights()
+  if _any_weight_initialized(keras_model):
     # Warn if config passed to estimator tries to update GPUOptions. If a
     # session has already been created, the GPUOptions passed to the first
     # session sticks.
-    if estimator._session_config.HasField('gpu_options'):
+    if config.session_config.HasField('gpu_options'):
       logging.warning(
           'The Keras backend session has already been set. '
           'The _session_config passed to model_to_estimator will not be used.')
   else:
     # Pass the config into keras backend's default session.
-    sess = session.Session(config=estimator._session_config)
+    sess = session.Session(config=config.session_config)
     K.set_session(sess)
-    keras_weights = None
 
+  warm_start_path = None
   if keras_model._is_graph_network:
-    # TODO(yifeif): move checkpoint initialization to scaffold.init_fn
-    _save_first_checkpoint(keras_model,
-                           estimator,
-                           custom_objects,
-                           keras_weights)
+    warm_start_path = _save_first_checkpoint(keras_model, custom_objects,
+                                             config)
   elif keras_model.built:
-    logging.warning('You are creating an Estimator from a Keras model '
-                    'manually subclassed from `Model`, that was '
-                    'already called on some inputs (and thus already had '
-                    'weights). We are currently unable to preserve '
-                    'the model\'s state (its weights) '
-                    'as part of the estimator '
-                    'in this case. Be warned that the estimator '
-                    'has been created using '
-                    'a freshly initialized version of your model.\n'
-                    'Note that this doesn\'t affect the state of the '
-                    'model instance you passed as `keras_model` argument.')
+    logging.warning('You are creating an Estimator from a Keras model manually '
+                    'subclassed from `Model`, that was already called on some '
+                    'inputs (and thus already had weights). We are currently '
+                    'unable to preserve the model\'s state (its weights) as '
+                    'part of the estimator in this case. Be warned that the '
+                    'estimator has been created using a freshly initialized '
+                    'version of your model.\n'
+                    'Note that this doesn\'t affect the state of the model '
+                    'instance you passed as `keras_model` argument.')
+
+  estimator = estimator_lib.Estimator(keras_model_fn,
+                                      config=config,
+                                      warm_start_from=warm_start_path)
+
   return estimator
diff --git a/tensorflow/python/estimator/keras_test.py b/tensorflow/python/estimator/keras_test.py
index 5e094ae92b..332e385726 100644
--- a/tensorflow/python/estimator/keras_test.py
+++ b/tensorflow/python/estimator/keras_test.py
@@ -32,13 +32,14 @@ from tensorflow.python.estimator.inputs import numpy_io
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.keras import testing_utils
-from tensorflow.python.keras.applications import mobilenet
 from tensorflow.python.keras.optimizers import SGD
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops.parsing_ops import gen_parsing_ops
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.summary.writer import writer_cache
 from tensorflow.python.training import rmsprop
+from tensorflow.python.training import session_run_hook
 
 
 try:
@@ -51,6 +52,8 @@ _TRAIN_SIZE = 200
 _INPUT_SIZE = (10,)
 _NUM_CLASS = 2
 
+_TMP_DIR = '/tmp'
+
 
 def simple_sequential_model():
   model = keras.models.Sequential()
@@ -60,9 +63,9 @@ def simple_sequential_model():
   return model
 
 
-def simple_functional_model():
+def simple_functional_model(activation='relu'):
   a = keras.layers.Input(shape=_INPUT_SIZE)
-  b = keras.layers.Dense(16, activation='relu')(a)
+  b = keras.layers.Dense(16, activation=activation)(a)
   b = keras.layers.Dropout(0.1)(b)
   b = keras.layers.Dense(_NUM_CLASS, activation='softmax')(b)
   model = keras.models.Model(inputs=[a], outputs=[b])
@@ -168,6 +171,12 @@ def multi_inputs_multi_outputs_model():
   return model
 
 
+class MyHook(session_run_hook.SessionRunHook):
+
+  def begin(self):
+    _ = variable_scope.get_variable('temp', [1])
+
+
 class TestKerasEstimator(test_util.TensorFlowTestCase):
 
   def setUp(self):
@@ -204,6 +213,55 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
       writer_cache.FileWriterCache.clear()
       gfile.DeleteRecursively(self._config.model_dir)
 
+  # see b/109935364
+  @test_util.run_in_graph_and_eager_modes
+  def test_train_with_hooks(self):
+    for model_type in ['sequential', 'functional']:
+      keras_model, (_, _), (
+          _, _), train_input_fn, eval_input_fn = get_resource_for_simple_model(
+              model_type=model_type, is_evaluate=True)
+      keras_model.compile(
+          loss='categorical_crossentropy',
+          optimizer=rmsprop.RMSPropOptimizer(1e-3),
+          metrics=['mse', keras.metrics.categorical_accuracy])
+
+      my_hook = MyHook()
+      with self.test_session():
+        est_keras = keras_lib.model_to_estimator(
+            keras_model=keras_model, config=self._config)
+        before_eval_results = est_keras.evaluate(
+            input_fn=eval_input_fn, steps=1)
+        est_keras.train(input_fn=train_input_fn, hooks=[my_hook],
+                        steps=_TRAIN_SIZE / 16)
+        after_eval_results = est_keras.evaluate(input_fn=eval_input_fn, steps=1)
+        self.assertLess(after_eval_results['loss'], before_eval_results['loss'])
+
+      writer_cache.FileWriterCache.clear()
+      gfile.DeleteRecursively(self._config.model_dir)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_train_with_model_fit_and_hooks(self):
+    keras_model, (x_train, y_train), _, \
+      train_input_fn, eval_input_fn = get_resource_for_simple_model(
+          model_type='sequential', is_evaluate=True)
+
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer=rmsprop.RMSPropOptimizer(1e-3),
+        metrics=['mse', keras.metrics.categorical_accuracy])
+    my_hook = MyHook()
+    with self.test_session():
+      keras_model.fit(x_train, y_train, epochs=1)
+
+      keras_est = keras_lib.model_to_estimator(
+          keras_model=keras_model, config=self._config)
+      before_eval_results = keras_est.evaluate(input_fn=eval_input_fn)
+      keras_est.train(input_fn=train_input_fn, hooks=[my_hook],
+                      steps=_TRAIN_SIZE / 16)
+      after_eval_results = keras_est.evaluate(input_fn=eval_input_fn, steps=1)
+      self.assertLess(after_eval_results['loss'], before_eval_results['loss'])
+
+  @test_util.run_in_graph_and_eager_modes
   def test_train_with_tf_optimizer(self):
     for model_type in ['sequential', 'functional']:
       keras_model, (_, _), (
@@ -217,11 +275,7 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
       with self.test_session():
         est_keras = keras_lib.model_to_estimator(
             keras_model=keras_model,
-            # Also use dict config argument to get test coverage for that line.
-            config={
-                'tf_random_seed': _RANDOM_SEED,
-                'model_dir': self._base_dir,
-            })
+            config=self._config)
         before_eval_results = est_keras.evaluate(
             input_fn=eval_input_fn, steps=1)
         est_keras.train(input_fn=train_input_fn, steps=_TRAIN_SIZE / 16)
@@ -231,6 +285,7 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
       writer_cache.FileWriterCache.clear()
       gfile.DeleteRecursively(self._config.model_dir)
 
+  @test_util.run_in_graph_and_eager_modes
   def test_train_with_subclassed_model(self):
     keras_model, (_, _), (
         _, _), train_input_fn, eval_input_fn = get_resource_for_simple_model(
@@ -472,23 +527,43 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
         est_keras.train(input_fn=invald_output_name_input_fn, steps=100)
 
   def test_custom_objects(self):
-    keras_mobile = mobilenet.MobileNet(weights=None)
-    keras_mobile.compile(loss='categorical_crossentropy', optimizer='adam')
+
+    def relu6(x):
+      return keras.backend.relu(x, max_value=6)
+
+    keras_model = simple_functional_model(activation=relu6)
+    keras_model.compile(loss='categorical_crossentropy', optimizer='adam')
     custom_objects = {
-        'relu6': mobilenet.relu6,
-        'DepthwiseConv2D': mobilenet.DepthwiseConv2D
+        'relu6': relu6
     }
+
+    (x_train, y_train), _ = testing_utils.get_test_data(
+        train_samples=_TRAIN_SIZE,
+        test_samples=50,
+        input_shape=(10,),
+        num_classes=2)
+    y_train = keras.utils.to_categorical(y_train, 2)
+    input_name = keras_model.input_names[0]
+    output_name = keras_model.output_names[0]
+    train_input_fn = numpy_io.numpy_input_fn(
+        x=randomize_io_type(x_train, input_name),
+        y=randomize_io_type(y_train, output_name),
+        shuffle=False,
+        num_epochs=None,
+        batch_size=16)
     with self.assertRaisesRegexp(ValueError, 'relu6'):
       with self.test_session():
-        keras_lib.model_to_estimator(
-            keras_model=keras_mobile,
+        est = keras_lib.model_to_estimator(
+            keras_model=keras_model,
             model_dir=tempfile.mkdtemp(dir=self._base_dir))
+        est.train(input_fn=train_input_fn, steps=1)
 
     with self.test_session():
-      keras_lib.model_to_estimator(
-          keras_model=keras_mobile,
+      est = keras_lib.model_to_estimator(
+          keras_model=keras_model,
           model_dir=tempfile.mkdtemp(dir=self._base_dir),
           custom_objects=custom_objects)
+      est.train(input_fn=train_input_fn, steps=1)
 
   def test_tf_config(self):
     keras_model, (_, _), (_, _), _, _ = get_resource_for_simple_model()
@@ -525,12 +600,73 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
       gpu_options = config_pb2.GPUOptions(per_process_gpu_memory_fraction=0.3)
       sess_config = config_pb2.ConfigProto(gpu_options=gpu_options)
       self._config._session_config = sess_config
-      keras_lib.model_to_estimator(
-          keras_model=keras_model, config=self._config)
-      self.assertEqual(
-          keras.backend.get_session()
-          ._config.gpu_options.per_process_gpu_memory_fraction,
-          gpu_options.per_process_gpu_memory_fraction)
+      with self.test_session():
+        keras_lib.model_to_estimator(
+            keras_model=keras_model, config=self._config)
+        self.assertEqual(
+            keras.backend.get_session()
+            ._config.gpu_options.per_process_gpu_memory_fraction,
+            gpu_options.per_process_gpu_memory_fraction)
+
+  def test_with_empty_config(self):
+    keras_model, _, _, _, _ = get_resource_for_simple_model(
+        model_type='sequential', is_evaluate=True)
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer='rmsprop',
+        metrics=['mse', keras.metrics.categorical_accuracy])
+
+    with self.test_session():
+      est_keras = keras_lib.model_to_estimator(
+          keras_model=keras_model, model_dir=self._base_dir,
+          config=run_config_lib.RunConfig())
+      self.assertEqual(run_config_lib.get_default_session_config(),
+                       est_keras._session_config)
+      self.assertEqual(est_keras._session_config,
+                       est_keras._config.session_config)
+      self.assertEqual(self._base_dir, est_keras._config.model_dir)
+      self.assertEqual(self._base_dir, est_keras._model_dir)
+
+    with self.test_session():
+      est_keras = keras_lib.model_to_estimator(
+          keras_model=keras_model, model_dir=self._base_dir,
+          config=None)
+      self.assertEqual(run_config_lib.get_default_session_config(),
+                       est_keras._session_config)
+      self.assertEqual(est_keras._session_config,
+                       est_keras._config.session_config)
+      self.assertEqual(self._base_dir, est_keras._config.model_dir)
+      self.assertEqual(self._base_dir, est_keras._model_dir)
+
+  def test_with_empty_config_and_empty_model_dir(self):
+    keras_model, _, _, _, _ = get_resource_for_simple_model(
+        model_type='sequential', is_evaluate=True)
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer='rmsprop',
+        metrics=['mse', keras.metrics.categorical_accuracy])
+
+    with self.test_session():
+      with test.mock.patch.object(tempfile, 'mkdtemp', return_value=_TMP_DIR):
+        est_keras = keras_lib.model_to_estimator(
+            keras_model=keras_model,
+            config=run_config_lib.RunConfig())
+        self.assertEqual(est_keras._model_dir, _TMP_DIR)
+
+  def test_with_conflicting_model_dir_and_config(self):
+    keras_model, _, _, _, _ = get_resource_for_simple_model(
+        model_type='sequential', is_evaluate=True)
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer='rmsprop',
+        metrics=['mse', keras.metrics.categorical_accuracy])
+
+    with self.test_session():
+      with self.assertRaisesRegexp(ValueError, '`model_dir` are set both in '
+                                   'constructor and `RunConfig`'):
+        keras_lib.model_to_estimator(
+            keras_model=keras_model, model_dir=self._base_dir,
+            config=run_config_lib.RunConfig(model_dir=_TMP_DIR))
 
   def test_pretrained_weights(self):
     keras_model, (_, _), (_, _), _, _ = get_resource_for_simple_model()
diff --git a/tensorflow/python/estimator/model_fn.py b/tensorflow/python/estimator/model_fn.py
index a9fd8f8e1a..9db9ccd01d 100644
--- a/tensorflow/python/estimator/model_fn.py
+++ b/tensorflow/python/estimator/model_fn.py
@@ -380,15 +380,12 @@ def _maybe_add_default_serving_output(export_outputs):
   return export_outputs
 
 
-class _TPUEstimatorSpec(collections.namedtuple('TPUEstimatorSpec', [
-    'mode',
-    'predictions',
-    'loss',
-    'train_op',
-    'eval_metrics',
-    'export_outputs',
-    'scaffold_fn',
-    'host_call'])):
+class _TPUEstimatorSpec(
+    collections.namedtuple('TPUEstimatorSpec', [
+        'mode', 'predictions', 'loss', 'train_op', 'eval_metrics',
+        'export_outputs', 'scaffold_fn', 'host_call', 'training_hooks',
+        'evaluation_hooks', 'prediction_hooks'
+    ])):
   """Ops and objects returned from a `model_fn` and passed to `TPUEstimator`.
 
   This is a simplified implementation of `tf.contrib.tpu.EstimatorSpec`. See
@@ -404,17 +401,24 @@ class _TPUEstimatorSpec(collections.namedtuple('TPUEstimatorSpec', [
               eval_metrics=None,
               export_outputs=None,
               scaffold_fn=None,
-              host_call=None):
+              host_call=None,
+              training_hooks=None,
+              evaluation_hooks=None,
+              prediction_hooks=None):
     """Creates a `_TPUEstimatorSpec` instance."""
-    return super(_TPUEstimatorSpec, cls).__new__(cls,
-                                                 mode=mode,
-                                                 predictions=predictions,
-                                                 loss=loss,
-                                                 train_op=train_op,
-                                                 eval_metrics=eval_metrics,
-                                                 export_outputs=export_outputs,
-                                                 scaffold_fn=scaffold_fn,
-                                                 host_call=host_call)
+    return super(_TPUEstimatorSpec, cls).__new__(
+        cls,
+        mode=mode,
+        predictions=predictions,
+        loss=loss,
+        train_op=train_op,
+        eval_metrics=eval_metrics,
+        export_outputs=export_outputs,
+        scaffold_fn=scaffold_fn,
+        host_call=host_call,
+        training_hooks=training_hooks,
+        evaluation_hooks=evaluation_hooks,
+        prediction_hooks=prediction_hooks)
 
   def as_estimator_spec(self):
     """Creates an equivalent `EstimatorSpec` used by CPU train/eval."""
@@ -423,12 +427,16 @@ class _TPUEstimatorSpec(collections.namedtuple('TPUEstimatorSpec', [
     else:
       metric_fn, tensors = self.eval_metrics
       eval_metric_ops = metric_fn(**tensors)
-    return EstimatorSpec(mode=self.mode,
-                         predictions=self.predictions,
-                         loss=self.loss,
-                         train_op=self.train_op,
-                         eval_metric_ops=eval_metric_ops,
-                         export_outputs=self.export_outputs)
+    return EstimatorSpec(
+        mode=self.mode,
+        predictions=self.predictions,
+        loss=self.loss,
+        train_op=self.train_op,
+        eval_metric_ops=eval_metric_ops,
+        export_outputs=self.export_outputs,
+        training_hooks=self.training_hooks,
+        evaluation_hooks=self.evaluation_hooks,
+        prediction_hooks=self.prediction_hooks)
 
 
 def _check_is_tensor_or_operation(x, name):
diff --git a/tensorflow/python/estimator/run_config.py b/tensorflow/python/estimator/run_config.py
index 3d60c63b68..220c3e58ca 100644
--- a/tensorflow/python/estimator/run_config.py
+++ b/tensorflow/python/estimator/run_config.py
@@ -48,7 +48,9 @@ _DEFAULT_REPLACEABLE_LIST = [
     'keep_checkpoint_every_n_hours',
     'log_step_count_steps',
     'train_distribute',
-    'device_fn'
+    'device_fn',
+    'protocol',
+    'eval_distribute',
 ]
 
 _SAVE_CKPT_ERR = (
@@ -288,6 +290,21 @@ def _validate_properties(run_config):
             message='device_fn must be callable with exactly'
                     ' one argument "op".')
 
+  _validate('protocol',
+            lambda protocol: protocol in (None, "grpc", "grpc+verbs"),
+            message='protocol should be grpc or grpc+verbs')
+
+
+def get_default_session_config():
+  """Returns tf.ConfigProto instance."""
+
+  rewrite_opts = rewriter_config_pb2.RewriterConfig(
+      meta_optimizer_iterations=rewriter_config_pb2.RewriterConfig.ONE)
+  graph_opts = config_pb2.GraphOptions(rewrite_options=rewrite_opts)
+
+  return config_pb2.ConfigProto(allow_soft_placement=True,
+                                graph_options=graph_opts)
+
 
 class TaskType(object):
   MASTER = 'master'
@@ -312,7 +329,9 @@ class RunConfig(object):
                keep_checkpoint_every_n_hours=10000,
                log_step_count_steps=100,
                train_distribute=None,
-               device_fn=None):
+               device_fn=None,
+               protocol=None,
+               eval_distribute=None):
     """Constructs a RunConfig.
 
     All distributed training related properties `cluster_spec`, `is_chief`,
@@ -436,7 +455,7 @@ class RunConfig(object):
         the feature.
       log_step_count_steps: The frequency, in number of global steps, that the
         global step/sec and the loss will be logged during training.
-      train_distribute: an optional instance of
+      train_distribute: An optional instance of
         `tf.contrib.distribute.DistributionStrategy`. If specified,
         then Estimator will distribute the user's model during training,
         according to the policy specified by that strategy.
@@ -444,6 +463,12 @@ class RunConfig(object):
         `Operation` and returns the device string. If `None`, defaults to
         the device function returned by `tf.train.replica_device_setter`
         with round-robin strategy.
+      protocol: An optional argument which specifies the protocol used when
+        starting server. None means default to grpc.
+      eval_distribute: An optional instance of
+        `tf.contrib.distribute.DistributionStrategy`. If specified,
+        then Estimator will distribute the user's model during evaluation,
+        according to the policy specified by that strategy.
 
     Raises:
       ValueError: If both `save_checkpoints_steps` and `save_checkpoints_secs`
@@ -481,18 +506,29 @@ class RunConfig(object):
         keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
         log_step_count_steps=log_step_count_steps,
         train_distribute=train_distribute,
-        device_fn=device_fn)
+        device_fn=device_fn,
+        protocol=protocol,
+        eval_distribute=eval_distribute)
 
     self._init_distributed_setting_from_environment_var(tf_config)
 
-    # Get session_config only for distributed mode (cluster_spec is present).
+    self._maybe_overwrite_session_config_for_distributed_training()
+
+  def _maybe_overwrite_session_config_for_distributed_training(self):
+    """Overwrites the session_config for distributed training.
+
+    The default overwrite is optimized for between-graph training. Subclass
+    should override this method if necessary.
+    """
+    # Get session_config only for between-graph distributed mode (cluster_spec
+    # is present).
     if not self._session_config and self._cluster_spec:
       RunConfig._replace(
           self,
           allowed_properties_list=_DEFAULT_REPLACEABLE_LIST,
-          session_config=self._get_default_session_config())
+          session_config=self._get_default_session_config_distributed())
 
-  def _get_default_session_config(self):
+  def _get_default_session_config_distributed(self):
     """Returns None or tf.ConfigProto instance with default device_filters set.
 
     Device filters are set such that chief/master and worker communicates with
@@ -741,10 +777,21 @@ class RunConfig(object):
 
   @property
   def train_distribute(self):
-    """Returns the optional `tf.contrib.distribute.DistributionStrategy` object.
+    """Optional `tf.contrib.distribute.DistributionStrategy` for training.
     """
     return self._train_distribute
 
+  @property
+  def eval_distribute(self):
+    """Optional `tf.contrib.distribute.DistributionStrategy` for evaluation.
+    """
+    return self._eval_distribute
+
+  @property
+  def protocol(self):
+    """Returns the optional protocol value."""
+    return self._protocol
+
   def replace(self, **kwargs):
     """Returns a new instance of `RunConfig` replacing specified properties.
 
@@ -760,7 +807,9 @@ class RunConfig(object):
       - `keep_checkpoint_every_n_hours`,
       - `log_step_count_steps`,
       - `train_distribute`,
-      - `device_fn`.
+      - `device_fn`,
+      - `protocol`.
+      - `eval_distribute`,
 
     In addition, either `save_checkpoints_steps` or `save_checkpoints_secs`
     can be set (should not be both).
diff --git a/tensorflow/python/estimator/training.py b/tensorflow/python/estimator/training.py
index 5730101092..a01b2300dd 100644
--- a/tensorflow/python/estimator/training.py
+++ b/tensorflow/python/estimator/training.py
@@ -312,10 +312,10 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
   #       hidden_units=[1024, 512, 256])
 
   # Input pipeline for train and evaluate.
-  def train_input_fn: # returns x, y
+  def train_input_fn(): # returns x, y
     # please shuffle the data.
     pass
-  def eval_input_fn_eval: # returns x, y
+  def eval_input_fn(): # returns x, y
     pass
 
   train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=1000)
@@ -732,7 +732,8 @@ class _TrainingExecutor(object):
         job_name=config.task_type,
         task_index=config.task_id,
         config=session_config,
-        start=False)
+        start=False,
+        protocol=config.protocol)
     server.start()
     return server
 
diff --git a/tensorflow/python/estimator/training_test.py b/tensorflow/python/estimator/training_test.py
index 6bee7cbe83..dc106c7d3b 100644
--- a/tensorflow/python/estimator/training_test.py
+++ b/tensorflow/python/estimator/training_test.py
@@ -472,6 +472,7 @@ class _TrainingExecutorTrainingTest(object):
         job_name=mock_est.config.task_type,
         task_index=mock_est.config.task_id,
         config=test.mock.ANY,
+        protocol=None,
         start=False)
 
     self.assertTrue(mock_server_instance.start.called)
@@ -502,6 +503,7 @@ class _TrainingExecutorTrainingTest(object):
         job_name=mock_est.config.task_type,
         task_index=mock_est.config.task_id,
         config=test.mock.ANY,
+        protocol=None,
         start=False)
 
     self.assertTrue(mock_server_instance.start.called)
@@ -729,6 +731,7 @@ class TrainingExecutorRunMasterTest(test.TestCase):
         job_name=mock_est.config.task_type,
         task_index=mock_est.config.task_id,
         config=test.mock.ANY,
+        protocol=None,
         start=False)
 
     self.assertTrue(mock_server_instance.start.called)
@@ -1481,6 +1484,7 @@ class TrainingExecutorRunPsTest(test.TestCase):
         job_name=mock_est.config.task_type,
         task_index=mock_est.config.task_id,
         config=test.mock.ANY,
+        protocol=None,
         start=False)
 
     self.assertTrue(mock_server_instance.start.called)
diff --git a/tensorflow/python/feature_column/BUILD b/tensorflow/python/feature_column/BUILD
index 295d4ca094..80707030e6 100644
--- a/tensorflow/python/feature_column/BUILD
+++ b/tensorflow/python/feature_column/BUILD
@@ -48,6 +48,39 @@ py_library(
     ],
 )
 
+py_library(
+    name = "feature_column_v2",
+    srcs = ["feature_column_v2.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:check_ops",
+        "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:embedding_ops",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:init_ops",
+        "//tensorflow/python:lookup_ops",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:nn_ops",
+        "//tensorflow/python:parsing_ops",
+        "//tensorflow/python:platform",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:sparse_ops",
+        "//tensorflow/python:sparse_tensor",
+        "//tensorflow/python:string_ops",
+        "//tensorflow/python:template",
+        "//tensorflow/python:tensor_shape",
+        "//tensorflow/python:training",
+        "//tensorflow/python:util",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python:variables",
+        "//tensorflow/python/keras",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+    ],
+)
+
 filegroup(
     name = "vocabulary_testdata",
     srcs = [
@@ -92,3 +125,38 @@ py_test(
         "//tensorflow/python/estimator:numpy_io",
     ],
 )
+
+py_test(
+    name = "feature_column_v2_test",
+    srcs = ["feature_column_v2_test.py"],
+    data = [":vocabulary_testdata"],
+    srcs_version = "PY2AND3",
+    tags = [
+        "no_cuda_on_cpu_tap",
+        "no_pip",
+    ],
+    deps = [
+        ":feature_column_py",
+        ":feature_column_v2",
+        "//tensorflow/core:protos_all_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:lookup_ops",
+        "//tensorflow/python:parsing_ops",
+        "//tensorflow/python:partitioned_variables",
+        "//tensorflow/python:session",
+        "//tensorflow/python:sparse_tensor",
+        "//tensorflow/python:training",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python:variables",
+        "//tensorflow/python/eager:backprop",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/estimator:numpy_io",
+        "//third_party/py/numpy",
+    ],
+)
diff --git a/tensorflow/python/feature_column/feature_column_v2.py b/tensorflow/python/feature_column/feature_column_v2.py
new file mode 100644
index 0000000000..b4dd23f58d
--- /dev/null
+++ b/tensorflow/python/feature_column/feature_column_v2.py
@@ -0,0 +1,3600 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""This API defines FeatureColumn abstraction.
+
+FeatureColumns provide a high level abstraction for ingesting and representing
+features. FeatureColumns are also the primary way of encoding features for
+canned @{tf.estimator.Estimator}s.
+
+When using FeatureColumns with `Estimators`, the type of feature column you
+should choose depends on (1) the feature type and (2) the model type.
+
+1. Feature type:
+
+  * Continuous features can be represented by `numeric_column`.
+  * Categorical features can be represented by any `categorical_column_with_*`
+  column:
+    - `categorical_column_with_vocabulary_list`
+    - `categorical_column_with_vocabulary_file`
+    - `categorical_column_with_hash_bucket`
+    - `categorical_column_with_identity`
+    - `weighted_categorical_column`
+
+2. Model type:
+
+  * Deep neural network models (`DNNClassifier`, `DNNRegressor`).
+
+    Continuous features can be directly fed into deep neural network models.
+
+      age_column = numeric_column("age")
+
+    To feed sparse features into DNN models, wrap the column with
+    `embedding_column` or `indicator_column`. `indicator_column` is recommended
+    for features with only a few possible values. For features with many
+    possible values, to reduce the size of your model, `embedding_column` is
+    recommended.
+
+      embedded_dept_column = embedding_column(
+          categorical_column_with_vocabulary_list(
+              "department", ["math", "philosophy", ...]), dimension=10)
+
+  * Wide (aka linear) models (`LinearClassifier`, `LinearRegressor`).
+
+    Sparse features can be fed directly into linear models. They behave like an
+    indicator column but with an efficient implementation.
+
+      dept_column = categorical_column_with_vocabulary_list("department",
+          ["math", "philosophy", "english"])
+
+    It is recommended that continuous features be bucketized before being
+    fed into linear models.
+
+      bucketized_age_column = bucketized_column(
+          source_column=age_column,
+          boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
+
+    Sparse features can be crossed (also known as conjuncted or combined) in
+    order to form non-linearities, and then fed into linear models.
+
+      cross_dept_age_column = crossed_column(
+          columns=["department", bucketized_age_column],
+          hash_bucket_size=1000)
+
+Example of building canned `Estimator`s using FeatureColumns:
+
+  ```python
+  # Define features and transformations
+  deep_feature_columns = [age_column, embedded_dept_column]
+  wide_feature_columns = [dept_column, bucketized_age_column,
+      cross_dept_age_column]
+
+  # Build deep model
+  estimator = DNNClassifier(
+      feature_columns=deep_feature_columns,
+      hidden_units=[500, 250, 50])
+  estimator.train(...)
+
+  # Or build a wide model
+  estimator = LinearClassifier(
+      feature_columns=wide_feature_columns)
+  estimator.train(...)
+
+  # Or build a wide and deep model!
+  estimator = DNNLinearCombinedClassifier(
+      linear_feature_columns=wide_feature_columns,
+      dnn_feature_columns=deep_feature_columns,
+      dnn_hidden_units=[500, 250, 50])
+  estimator.train(...)
+  ```
+
+
+FeatureColumns can also be transformed into a generic input layer for
+custom models using `input_layer`.
+
+Example of building model using FeatureColumns, this can be used in a
+`model_fn` which is given to the {tf.estimator.Estimator}:
+
+  ```python
+  # Building model via layers
+
+  deep_feature_columns = [age_column, embedded_dept_column]
+  columns_to_tensor = parse_feature_columns_from_examples(
+      serialized=my_data,
+      feature_columns=deep_feature_columns)
+  first_layer = input_layer(
+      features=columns_to_tensor,
+      feature_columns=deep_feature_columns)
+  second_layer = fully_connected(first_layer, ...)
+  ```
+
+NOTE: Functions prefixed with "_" indicate experimental or private parts of
+the API subject to change, and should not be relied upon!
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import abc
+import collections
+import math
+
+import numpy as np
+import six
+
+
+from tensorflow.python.eager import context
+from tensorflow.python.feature_column import feature_column as fc_old
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.keras.engine import training
+from tensorflow.python.layers import base
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import embedding_ops
+from tensorflow.python.ops import init_ops
+from tensorflow.python.ops import lookup_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import sparse_ops
+from tensorflow.python.ops import string_ops
+from tensorflow.python.ops import template
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import gfile
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpoint_utils
+from tensorflow.python.util import nest
+
+
+def _internal_input_layer(features,
+                          feature_columns,
+                          weight_collections=None,
+                          trainable=True,
+                          cols_to_vars=None,
+                          scope=None):
+  """See input_layer. `scope` is a name or variable scope to use."""
+
+  feature_columns = fc_old._normalize_feature_columns(feature_columns)  # pylint: disable=protected-access
+  for column in feature_columns:
+    if not isinstance(column, fc_old._DenseColumn):  # pylint: disable=protected-access
+      raise ValueError(
+          'Items of feature_columns must be a _DenseColumn. '
+          'You can wrap a categorical column with an '
+          'embedding_column or indicator_column. Given: {}'.format(column))
+  weight_collections = list(weight_collections or [])
+  if ops.GraphKeys.GLOBAL_VARIABLES not in weight_collections:
+    weight_collections.append(ops.GraphKeys.GLOBAL_VARIABLES)
+  if ops.GraphKeys.MODEL_VARIABLES not in weight_collections:
+    weight_collections.append(ops.GraphKeys.MODEL_VARIABLES)
+
+  # a non-None `scope` can allow for variable reuse, when, e.g., this function
+  # is wrapped by a `make_template`.
+  with variable_scope.variable_scope(
+      scope, default_name='input_layer', values=features.values()):
+    builder = fc_old._LazyBuilder(features)  # pylint: disable=protected-access
+    output_tensors = []
+    ordered_columns = []
+    for column in sorted(feature_columns, key=lambda x: x.name):
+      ordered_columns.append(column)
+      with variable_scope.variable_scope(
+          None, default_name=column._var_scope_name):  # pylint: disable=protected-access
+        tensor = column._get_dense_tensor(  # pylint: disable=protected-access
+            builder,
+            weight_collections=weight_collections,
+            trainable=trainable)
+        num_elements = column._variable_shape.num_elements()  # pylint: disable=protected-access
+        batch_size = array_ops.shape(tensor)[0]
+        output_tensors.append(
+            array_ops.reshape(tensor, shape=(batch_size, num_elements)))
+        if cols_to_vars is not None:
+          # Retrieve any variables created (some _DenseColumn's don't create
+          # variables, in which case an empty list is returned).
+          cols_to_vars[column] = ops.get_collection(
+              ops.GraphKeys.GLOBAL_VARIABLES,
+              scope=variable_scope.get_variable_scope().name)
+    _verify_static_batch_size_equality(output_tensors, ordered_columns)
+    return array_ops.concat(output_tensors, 1)
+
+
+def input_layer(features,
+                feature_columns,
+                weight_collections=None,
+                trainable=True,
+                cols_to_vars=None):
+  """Returns a dense `Tensor` as input layer based on given `feature_columns`.
+
+  Generally a single example in training data is described with FeatureColumns.
+  At the first layer of the model, this column oriented data should be converted
+  to a single `Tensor`.
+
+  Example:
+
+  ```python
+  price = numeric_column('price')
+  keywords_embedded = embedding_column(
+      categorical_column_with_hash_bucket("keywords", 10K), dimensions=16)
+  columns = [price, keywords_embedded, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+  for units in [128, 64, 32]:
+    dense_tensor = tf.layers.dense(dense_tensor, units, tf.nn.relu)
+  prediction = tf.layers.dense(dense_tensor, 1)
+  ```
+
+  Args:
+    features: A mapping from key to tensors. `_FeatureColumn`s look up via these
+      keys. For example `numeric_column('price')` will look at 'price' key in
+      this dict. Values can be a `SparseTensor` or a `Tensor` depends on
+      corresponding `_FeatureColumn`.
+    feature_columns: An iterable containing the FeatureColumns to use as inputs
+      to your model. All items should be instances of classes derived from
+      `_DenseColumn` such as `numeric_column`, `embedding_column`,
+      `bucketized_column`, `indicator_column`. If you have categorical features,
+      you can wrap them with an `embedding_column` or `indicator_column`.
+    weight_collections: A list of collection names to which the Variable will be
+      added. Note that variables will also be added to collections
+      `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
+    trainable: If `True` also add the variable to the graph collection
+      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
+    cols_to_vars: If not `None`, must be a dictionary that will be filled with a
+      mapping from `_FeatureColumn` to list of `Variable`s.  For example, after
+      the call, we might have cols_to_vars =
+      {_EmbeddingColumn(
+        categorical_column=_HashedCategoricalColumn(
+          key='sparse_feature', hash_bucket_size=5, dtype=tf.string),
+        dimension=10): [<tf.Variable 'some_variable:0' shape=(5, 10),
+                        <tf.Variable 'some_variable:1' shape=(5, 10)]}
+      If a column creates no variables, its value will be an empty list.
+
+  Returns:
+    A `Tensor` which represents input layer of a model. Its shape
+    is (batch_size, first_layer_dimension) and its dtype is `float32`.
+    first_layer_dimension is determined based on given `feature_columns`.
+
+  Raises:
+    ValueError: if an item in `feature_columns` is not a `_DenseColumn`.
+  """
+  return _internal_input_layer(features, feature_columns, weight_collections,
+                               trainable, cols_to_vars)
+
+
+# TODO(akshayka): InputLayer should be a subclass of Layer, and it
+# should implement the logic in input_layer using Layer's build-and-call
+# paradigm; input_layer should create an instance of InputLayer and
+# return the result of invoking its apply method, just as functional layers do.
+class InputLayer(object):
+  """An object-oriented version of `input_layer` that reuses variables."""
+
+  def __init__(self,
+               feature_columns,
+               weight_collections=None,
+               trainable=True,
+               cols_to_vars=None):
+    """See `input_layer`."""
+
+    self._feature_columns = feature_columns
+    self._weight_collections = weight_collections
+    self._trainable = trainable
+    self._cols_to_vars = cols_to_vars
+    self._input_layer_template = template.make_template(
+        'feature_column_input_layer',
+        _internal_input_layer,
+        create_scope_now_=True)
+    self._scope = self._input_layer_template.variable_scope
+
+  def __call__(self, features):
+    return self._input_layer_template(
+        features=features,
+        feature_columns=self._feature_columns,
+        weight_collections=self._weight_collections,
+        trainable=self._trainable,
+        cols_to_vars=None,
+        scope=self._scope)
+
+  @property
+  def non_trainable_variables(self):
+    return self._input_layer_template.non_trainable_variables
+
+  @property
+  def non_trainable_weights(self):
+    return self._input_layer_template.non_trainable_weights
+
+  @property
+  def trainable_variables(self):
+    return self._input_layer_template.trainable_variables
+
+  @property
+  def trainable_weights(self):
+    return self._input_layer_template.trainable_weights
+
+  @property
+  def variables(self):
+    return self._input_layer_template.variables
+
+  @property
+  def weights(self):
+    return self._input_layer_template.weights
+
+
+def linear_model(features,
+                 feature_columns,
+                 units=1,
+                 sparse_combiner='sum',
+                 weight_collections=None,
+                 trainable=True,
+                 cols_to_vars=None):
+  """Returns a linear prediction `Tensor` based on given `feature_columns`.
+
+  This function generates a weighted sum based on output dimension `units`.
+  Weighted sum refers to logits in classification problems. It refers to the
+  prediction itself for linear regression problems.
+
+  Note on supported columns: `linear_model` treats categorical columns as
+  `indicator_column`s. To be specific, assume the input as `SparseTensor` looks
+  like:
+
+  ```python
+    shape = [2, 2]
+    {
+        [0, 0]: "a"
+        [1, 0]: "b"
+        [1, 1]: "c"
+    }
+  ```
+  `linear_model` assigns weights for the presence of "a", "b", "c' implicitly,
+  just like `indicator_column`, while `input_layer` explicitly requires wrapping
+  each of categorical columns with an `embedding_column` or an
+  `indicator_column`.
+
+  Example of usage:
+
+  ```python
+  price = numeric_column('price')
+  price_buckets = bucketized_column(price, boundaries=[0., 10., 100., 1000.])
+  keywords = categorical_column_with_hash_bucket("keywords", 10K)
+  keywords_price = crossed_column('keywords', price_buckets, ...)
+  columns = [price_buckets, keywords, keywords_price ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  prediction = linear_model(features, columns)
+  ```
+
+  Args:
+    features: A mapping from key to tensors. `_FeatureColumn`s look up via these
+      keys. For example `numeric_column('price')` will look at 'price' key in
+      this dict. Values are `Tensor` or `SparseTensor` depending on
+      corresponding `_FeatureColumn`.
+    feature_columns: An iterable containing the FeatureColumns to use as inputs
+      to your model. All items should be instances of classes derived from
+      `_FeatureColumn`s.
+    units: An integer, dimensionality of the output space. Default value is 1.
+    sparse_combiner: A string specifying how to reduce if a categorical column
+      is multivalent. Except `numeric_column`, almost all columns passed to
+      `linear_model` are considered as categorical columns.  It combines each
+      categorical column independently. Currently "mean", "sqrtn" and "sum" are
+      supported, with "sum" the default for linear model. "sqrtn" often achieves
+      good accuracy, in particular with bag-of-words columns.
+        * "sum": do not normalize features in the column
+        * "mean": do l1 normalization on features in the column
+        * "sqrtn": do l2 normalization on features in the column
+      For example, for two features represented as the categorical columns:
+
+      ```python
+        # Feature 1
+
+        shape = [2, 2]
+        {
+            [0, 0]: "a"
+            [0, 1]: "b"
+            [1, 0]: "c"
+        }
+
+        # Feature 2
+
+        shape = [2, 3]
+        {
+            [0, 0]: "d"
+            [1, 0]: "e"
+            [1, 1]: "f"
+            [1, 2]: "g"
+        }
+      ```
+      with `sparse_combiner` as "mean", the linear model outputs conceptly are:
+      ```
+        y_0 = 1.0 / 2.0 * ( w_a + w_ b) + w_c + b_0
+        y_1 = w_d + 1.0 / 3.0 * ( w_e + w_ f + w_g) + b_1
+      ```
+      where `y_i` is the output, `b_i` is the bias, and `w_x` is the weight
+      assigned to the presence of `x` in the input features.
+    weight_collections: A list of collection names to which the Variable will be
+      added. Note that, variables will also be added to collections
+      `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
+    trainable: If `True` also add the variable to the graph collection
+      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
+    cols_to_vars: If not `None`, must be a dictionary that will be filled with a
+      mapping from `_FeatureColumn` to associated list of `Variable`s.  For
+      example, after the call, we might have cols_to_vars = {
+        _NumericColumn(
+          key='numeric_feature1', shape=(1,):
+        [<tf.Variable 'linear_model/price2/weights:0' shape=(1, 1)>],
+        'bias': [<tf.Variable 'linear_model/bias_weights:0' shape=(1,)>],
+        _NumericColumn(
+          key='numeric_feature2', shape=(2,)):
+        [<tf.Variable 'linear_model/price1/weights:0' shape=(2, 1)>]}
+      If a column creates no variables, its value will be an empty list. Note
+      that cols_to_vars will also contain a string key 'bias' that maps to a
+      list of Variables.
+
+  Returns:
+    A `Tensor` which represents predictions/logits of a linear model. Its shape
+    is (batch_size, units) and its dtype is `float32`.
+
+  Raises:
+    ValueError: if an item in `feature_columns` is neither a `_DenseColumn`
+      nor `_CategoricalColumn`.
+  """
+  with variable_scope.variable_scope(None, 'linear_model') as vs:
+    model_name = _strip_leading_slashes(vs.name)
+  linear_model_layer = _LinearModel(
+      feature_columns=feature_columns,
+      units=units,
+      sparse_combiner=sparse_combiner,
+      weight_collections=weight_collections,
+      trainable=trainable,
+      name=model_name)
+  retval = linear_model_layer(features)  # pylint: disable=not-callable
+  if cols_to_vars is not None:
+    cols_to_vars.update(linear_model_layer.cols_to_vars())
+  return retval
+
+
+def _add_to_collections(var, weight_collections):
+  """Adds a var to the list of weight_collections provided.
+
+  Handles the case for partitioned and non-partitioned variables.
+
+  Args:
+    var: A variable or Partitioned Variable.
+    weight_collections: List of collections to add variable to.
+  """
+  for weight_collection in weight_collections:
+    # The layer self.add_variable call already adds it to GLOBAL_VARIABLES.
+    if weight_collection == ops.GraphKeys.GLOBAL_VARIABLES:
+      continue
+    # TODO(rohanj): Explore adding a _get_variable_list method on `Variable`
+    # so that we don't have to do this check.
+    if isinstance(var, variables.PartitionedVariable):
+      for constituent_var in list(var):
+        ops.add_to_collection(weight_collection, constituent_var)
+    else:
+      ops.add_to_collection(weight_collection, var)
+
+
+class _FCLinearWrapper(base.Layer):
+  """Wraps a _FeatureColumn in a layer for use in a linear model.
+
+  See `linear_model` above.
+  """
+
+  def __init__(self,
+               feature_column,
+               units=1,
+               sparse_combiner='sum',
+               weight_collections=None,
+               trainable=True,
+               name=None,
+               **kwargs):
+    super(_FCLinearWrapper, self).__init__(
+        trainable=trainable, name=name, **kwargs)
+    self._feature_column = feature_column
+    self._units = units
+    self._sparse_combiner = sparse_combiner
+    self._weight_collections = weight_collections
+
+  def build(self, _):
+    if isinstance(self._feature_column, fc_old._CategoricalColumn):  # pylint: disable=protected-access
+      weight = self.add_variable(
+          name='weights',
+          shape=(self._feature_column._num_buckets, self._units),  # pylint: disable=protected-access
+          initializer=init_ops.zeros_initializer(),
+          trainable=self.trainable)
+    else:
+      num_elements = self._feature_column._variable_shape.num_elements()  # pylint: disable=protected-access
+      weight = self.add_variable(
+          name='weights',
+          shape=[num_elements, self._units],
+          initializer=init_ops.zeros_initializer(),
+          trainable=self.trainable)
+    _add_to_collections(weight, self._weight_collections)
+    self._weight_var = weight
+    self.built = True
+
+  def call(self, builder):
+    weighted_sum = fc_old._create_weighted_sum(  # pylint: disable=protected-access
+        column=self._feature_column,
+        builder=builder,
+        units=self._units,
+        sparse_combiner=self._sparse_combiner,
+        weight_collections=self._weight_collections,
+        trainable=self.trainable,
+        weight_var=self._weight_var)
+    return weighted_sum
+
+
+class _BiasLayer(base.Layer):
+  """A layer for the bias term.
+  """
+
+  def __init__(self,
+               units=1,
+               trainable=True,
+               weight_collections=None,
+               name=None,
+               **kwargs):
+    super(_BiasLayer, self).__init__(trainable=trainable, name=name, **kwargs)
+    self._units = units
+    self._weight_collections = weight_collections
+
+  def build(self, _):
+    self._bias_variable = self.add_variable(
+        'bias_weights',
+        shape=[self._units],
+        initializer=init_ops.zeros_initializer(),
+        trainable=self.trainable)
+    _add_to_collections(self._bias_variable, self._weight_collections)
+    self.built = True
+
+  def call(self, _):
+    return self._bias_variable
+
+
+def _get_expanded_variable_list(variable):
+  if (isinstance(variable, variables.Variable) or
+      resource_variable_ops.is_resource_variable(variable)):
+    return [variable]  # Single variable case.
+  else:  # Must be a PartitionedVariable, so convert into a list.
+    return list(variable)
+
+
+def _strip_leading_slashes(name):
+  return name.rsplit('/', 1)[-1]
+
+
+class _LinearModel(training.Model):
+  """Creates a linear model using feature columns.
+
+  See `linear_model` for details.
+  """
+
+  def __init__(self,
+               feature_columns,
+               units=1,
+               sparse_combiner='sum',
+               weight_collections=None,
+               trainable=True,
+               name=None,
+               **kwargs):
+    super(_LinearModel, self).__init__(name=name, **kwargs)
+    self._feature_columns = fc_old._normalize_feature_columns(  # pylint: disable=protected-access
+        feature_columns)
+    self._weight_collections = list(weight_collections or [])
+    if ops.GraphKeys.GLOBAL_VARIABLES not in self._weight_collections:
+      self._weight_collections.append(ops.GraphKeys.GLOBAL_VARIABLES)
+    if ops.GraphKeys.MODEL_VARIABLES not in self._weight_collections:
+      self._weight_collections.append(ops.GraphKeys.MODEL_VARIABLES)
+
+    column_layers = {}
+    for column in sorted(self._feature_columns, key=lambda x: x.name):
+      with variable_scope.variable_scope(
+          None, default_name=column._var_scope_name) as vs:  # pylint: disable=protected-access
+        # Having the fully expressed variable scope name ends up doubly
+        # expressing the outer scope (scope with which this method was called)
+        # in the name of the variable that would get created.
+        column_name = _strip_leading_slashes(vs.name)
+      column_layer = _FCLinearWrapper(column, units, sparse_combiner,
+                                      self._weight_collections, trainable,
+                                      column_name, **kwargs)
+      column_layers[column_name] = column_layer
+    self._column_layers = self._add_layers(column_layers)
+    self._bias_layer = _BiasLayer(
+        units=units,
+        trainable=trainable,
+        weight_collections=self._weight_collections,
+        name='bias_layer',
+        **kwargs)
+    self._cols_to_vars = {}
+
+  def cols_to_vars(self):
+    """Returns a dict mapping _FeatureColumns to variables.
+
+    See `linear_model` for more information.
+    This is not populated till `call` is called i.e. layer is built.
+    """
+    return self._cols_to_vars
+
+  def call(self, features):
+    with variable_scope.variable_scope(self.name):
+      for column in self._feature_columns:
+        if not isinstance(
+            column,
+            (
+                fc_old._DenseColumn,  # pylint: disable=protected-access
+                fc_old._CategoricalColumn)):  # pylint: disable=protected-access
+          raise ValueError(
+              'Items of feature_columns must be either a '
+              '_DenseColumn or _CategoricalColumn. Given: {}'.format(column))
+      weighted_sums = []
+      ordered_columns = []
+      builder = fc_old._LazyBuilder(features)  # pylint: disable=protected-access
+      for layer in sorted(self._column_layers.values(), key=lambda x: x.name):
+        column = layer._feature_column  # pylint: disable=protected-access
+        ordered_columns.append(column)
+        weighted_sum = layer(builder)
+        weighted_sums.append(weighted_sum)
+        self._cols_to_vars[column] = ops.get_collection(
+            ops.GraphKeys.GLOBAL_VARIABLES, scope=layer.scope_name)
+
+      _verify_static_batch_size_equality(weighted_sums, ordered_columns)
+      predictions_no_bias = math_ops.add_n(
+          weighted_sums, name='weighted_sum_no_bias')
+      predictions = nn_ops.bias_add(
+          predictions_no_bias,
+          self._bias_layer(  # pylint: disable=not-callable
+              builder,
+              scope=variable_scope.get_variable_scope()),  # pylint: disable=not-callable
+          name='weighted_sum')
+      bias = self._bias_layer.variables[0]
+      self._cols_to_vars['bias'] = _get_expanded_variable_list(bias)
+    return predictions
+
+  def _add_layers(self, layers):
+    # "Magic" required for keras.Model classes to track all the variables in
+    # a list of layers.Layer objects.
+    # TODO(ashankar): Figure out API so user code doesn't have to do this.
+    for name, layer in layers.items():
+      setattr(self, 'layer-%s' % name, layer)
+    return layers
+
+
+def _transform_features(features, feature_columns, state_manager):
+  """Returns transformed features based on features columns passed in.
+
+  Please note that most probably you would not need to use this function. Please
+  check `input_layer` and `linear_model` to see whether they will
+  satisfy your use case or not.
+
+  Example:
+
+  ```python
+  # Define features and transformations
+  crosses_a_x_b = crossed_column(
+      columns=["sparse_feature_a", "sparse_feature_b"], hash_bucket_size=10000)
+  price_buckets = bucketized_column(
+      source_column=numeric_column("price"), boundaries=[...])
+
+  columns = [crosses_a_x_b, price_buckets]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  transformed = transform_features(features=features, feature_columns=columns)
+
+  assertCountEqual(columns, transformed.keys())
+  ```
+
+  Args:
+    features: A mapping from key to tensors. `FeatureColumn`s look up via these
+      keys. For example `numeric_column('price')` will look at 'price' key in
+      this dict. Values can be a `SparseTensor` or a `Tensor` depends on
+      corresponding `FeatureColumn`.
+    feature_columns: An iterable containing all the `FeatureColumn`s.
+    state_manager: A StateManager object that holds the FeatureColumn state.
+
+  Returns:
+    A `dict` mapping `FeatureColumn` to `Tensor` and `SparseTensor` values.
+  """
+  feature_columns = _normalize_feature_columns(feature_columns)
+  outputs = {}
+  with ops.name_scope(
+      None, default_name='transform_features', values=features.values()):
+    transformation_cache = FeatureTransformationCache(features)
+    for column in sorted(feature_columns, key=lambda x: x.name):
+      with ops.name_scope(None, default_name=column.name):
+        outputs[column] = transformation_cache.get(column, state_manager)
+  return outputs
+
+
+def make_parse_example_spec(feature_columns):
+  """Creates parsing spec dictionary from input feature_columns.
+
+  The returned dictionary can be used as arg 'features' in `tf.parse_example`.
+
+  Typical usage example:
+
+  ```python
+  # Define features and transformations
+  feature_a = categorical_column_with_vocabulary_file(...)
+  feature_b = numeric_column(...)
+  feature_c_bucketized = bucketized_column(numeric_column("feature_c"), ...)
+  feature_a_x_feature_c = crossed_column(
+      columns=["feature_a", feature_c_bucketized], ...)
+
+  feature_columns = set(
+      [feature_b, feature_c_bucketized, feature_a_x_feature_c])
+  features = tf.parse_example(
+      serialized=serialized_examples,
+      features=make_parse_example_spec(feature_columns))
+  ```
+
+  For the above example, make_parse_example_spec would return the dict:
+
+  ```python
+  {
+      "feature_a": parsing_ops.VarLenFeature(tf.string),
+      "feature_b": parsing_ops.FixedLenFeature([1], dtype=tf.float32),
+      "feature_c": parsing_ops.FixedLenFeature([1], dtype=tf.float32)
+  }
+  ```
+
+  Args:
+    feature_columns: An iterable containing all feature columns. All items
+      should be instances of classes derived from `FeatureColumn`.
+
+  Returns:
+    A dict mapping each feature key to a `FixedLenFeature` or `VarLenFeature`
+    value.
+
+  Raises:
+    ValueError: If any of the given `feature_columns` is not a `FeatureColumn`
+      instance.
+  """
+  result = {}
+  for column in feature_columns:
+    if not isinstance(column, FeatureColumn):
+      raise ValueError('All feature_columns must be FeatureColumn instances. '
+                       'Given: {}'.format(column))
+    config = column.parse_example_spec
+    for key, value in six.iteritems(config):
+      if key in result and value != result[key]:
+        raise ValueError(
+            'feature_columns contain different parse_spec for key '
+            '{}. Given {} and {}'.format(key, value, result[key]))
+    result.update(config)
+  return result
+
+
+def embedding_column(
+    categorical_column, dimension, combiner='mean', initializer=None,
+    ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None,
+    trainable=True):
+  """`_DenseColumn` that converts from sparse, categorical input.
+
+  Use this when your inputs are sparse, but you want to convert them to a dense
+  representation (e.g., to feed to a DNN).
+
+  Inputs must be a `_CategoricalColumn` created by any of the
+  `categorical_column_*` function. Here is an example of using
+  `embedding_column` with `DNNClassifier`:
+
+  ```python
+  video_id = categorical_column_with_identity(
+      key='video_id', num_buckets=1000000, default_value=0)
+  columns = [embedding_column(video_id, 9),...]
+
+  estimator = tf.estimator.DNNClassifier(feature_columns=columns, ...)
+
+  label_column = ...
+  def input_fn():
+    features = tf.parse_example(
+        ..., features=make_parse_example_spec(columns + [label_column]))
+    labels = features.pop(label_column.name)
+    return features, labels
+
+  estimator.train(input_fn=input_fn, steps=100)
+  ```
+
+  Here is an example using `embedding_column` with model_fn:
+
+  ```python
+  def model_fn(features, ...):
+    video_id = categorical_column_with_identity(
+        key='video_id', num_buckets=1000000, default_value=0)
+    columns = [embedding_column(video_id, 9),...]
+    dense_tensor = input_layer(features, columns)
+    # Form DNN layers, calculate loss, and return EstimatorSpec.
+    ...
+  ```
+
+  Args:
+    categorical_column: A `_CategoricalColumn` created by a
+      `categorical_column_with_*` function. This column produces the sparse IDs
+      that are inputs to the embedding lookup.
+    dimension: An integer specifying dimension of the embedding, must be > 0.
+    combiner: A string specifying how to reduce if there are multiple entries
+      in a single row. Currently 'mean', 'sqrtn' and 'sum' are supported, with
+      'mean' the default. 'sqrtn' often achieves good accuracy, in particular
+      with bag-of-words columns. Each of this can be thought as example level
+      normalizations on the column. For more information, see
+      `tf.embedding_lookup_sparse`.
+    initializer: A variable initializer function to be used in embedding
+      variable initialization. If not specified, defaults to
+      `tf.truncated_normal_initializer` with mean `0.0` and standard deviation
+      `1/sqrt(dimension)`.
+    ckpt_to_load_from: String representing checkpoint name/pattern from which to
+      restore column weights. Required if `tensor_name_in_ckpt` is not `None`.
+    tensor_name_in_ckpt: Name of the `Tensor` in `ckpt_to_load_from` from
+      which to restore the column weights. Required if `ckpt_to_load_from` is
+      not `None`.
+    max_norm: If not `None`, embedding values are l2-normalized to this value.
+    trainable: Whether or not the embedding is trainable. Default is True.
+
+  Returns:
+    `_DenseColumn` that converts from sparse input.
+
+  Raises:
+    ValueError: if `dimension` not > 0.
+    ValueError: if exactly one of `ckpt_to_load_from` and `tensor_name_in_ckpt`
+      is specified.
+    ValueError: if `initializer` is specified and is not callable.
+    RuntimeError: If eager execution is enabled.
+  """
+  if (dimension is None) or (dimension < 1):
+    raise ValueError('Invalid dimension {}.'.format(dimension))
+  if (ckpt_to_load_from is None) != (tensor_name_in_ckpt is None):
+    raise ValueError('Must specify both `ckpt_to_load_from` and '
+                     '`tensor_name_in_ckpt` or none of them.')
+
+  if (initializer is not None) and (not callable(initializer)):
+    raise ValueError('initializer must be callable if specified. '
+                     'Embedding of column_name: {}'.format(
+                         categorical_column.name))
+  if initializer is None:
+    initializer = init_ops.truncated_normal_initializer(
+        mean=0.0, stddev=1 / math.sqrt(dimension))
+
+  return EmbeddingColumn(
+      categorical_column=categorical_column,
+      dimension=dimension,
+      combiner=combiner,
+      initializer=initializer,
+      ckpt_to_load_from=ckpt_to_load_from,
+      tensor_name_in_ckpt=tensor_name_in_ckpt,
+      max_norm=max_norm,
+      trainable=trainable)
+
+
+def shared_embedding_columns(
+    categorical_columns, dimension, combiner='mean', initializer=None,
+    shared_embedding_collection_name=None, ckpt_to_load_from=None,
+    tensor_name_in_ckpt=None, max_norm=None, trainable=True):
+  """List of dense columns that convert from sparse, categorical input.
+
+  This is similar to `embedding_column`, except that it produces a list of
+  embedding columns that share the same embedding weights.
+
+  Use this when your inputs are sparse and of the same type (e.g. watched and
+  impression video IDs that share the same vocabulary), and you want to convert
+  them to a dense representation (e.g., to feed to a DNN).
+
+  Inputs must be a list of categorical columns created by any of the
+  `categorical_column_*` function. They must all be of the same type and have
+  the same arguments except `key`. E.g. they can be
+  categorical_column_with_vocabulary_file with the same vocabulary_file. Some or
+  all columns could also be weighted_categorical_column.
+
+  Here is an example embedding of two features for a DNNClassifier model:
+
+  ```python
+  watched_video_id = categorical_column_with_vocabulary_file(
+      'watched_video_id', video_vocabulary_file, video_vocabulary_size)
+  impression_video_id = categorical_column_with_vocabulary_file(
+      'impression_video_id', video_vocabulary_file, video_vocabulary_size)
+  columns = shared_embedding_columns(
+      [watched_video_id, impression_video_id], dimension=10)
+
+  estimator = tf.estimator.DNNClassifier(feature_columns=columns, ...)
+
+  label_column = ...
+  def input_fn():
+    features = tf.parse_example(
+        ..., features=make_parse_example_spec(columns + [label_column]))
+    labels = features.pop(label_column.name)
+    return features, labels
+
+  estimator.train(input_fn=input_fn, steps=100)
+  ```
+
+  Here is an example using `shared_embedding_columns` with model_fn:
+
+  ```python
+  def model_fn(features, ...):
+    watched_video_id = categorical_column_with_vocabulary_file(
+        'watched_video_id', video_vocabulary_file, video_vocabulary_size)
+    impression_video_id = categorical_column_with_vocabulary_file(
+        'impression_video_id', video_vocabulary_file, video_vocabulary_size)
+    columns = shared_embedding_columns(
+        [watched_video_id, impression_video_id], dimension=10)
+    dense_tensor = input_layer(features, columns)
+    # Form DNN layers, calculate loss, and return EstimatorSpec.
+    ...
+  ```
+
+  Args:
+    categorical_columns: List of categorical columns created by a
+      `categorical_column_with_*` function. These columns produce the sparse IDs
+      that are inputs to the embedding lookup. All columns must be of the same
+      type and have the same arguments except `key`. E.g. they can be
+      categorical_column_with_vocabulary_file with the same vocabulary_file.
+      Some or all columns could also be weighted_categorical_column.
+    dimension: An integer specifying dimension of the embedding, must be > 0.
+    combiner: A string specifying how to reduce if there are multiple entries
+      in a single row. Currently 'mean', 'sqrtn' and 'sum' are supported, with
+      'mean' the default. 'sqrtn' often achieves good accuracy, in particular
+      with bag-of-words columns. Each of this can be thought as example level
+      normalizations on the column. For more information, see
+      `tf.embedding_lookup_sparse`.
+    initializer: A variable initializer function to be used in embedding
+      variable initialization. If not specified, defaults to
+      `tf.truncated_normal_initializer` with mean `0.0` and standard deviation
+      `1/sqrt(dimension)`.
+    shared_embedding_collection_name: Optional collective name of these columns.
+      If not given, a reasonable name will be chosen based on the names of
+      `categorical_columns`.
+    ckpt_to_load_from: String representing checkpoint name/pattern from which to
+      restore column weights. Required if `tensor_name_in_ckpt` is not `None`.
+    tensor_name_in_ckpt: Name of the `Tensor` in `ckpt_to_load_from` from
+      which to restore the column weights. Required if `ckpt_to_load_from` is
+      not `None`.
+    max_norm: If not `None`, each embedding is clipped if its l2-norm is
+      larger than this value, before combining.
+    trainable: Whether or not the embedding is trainable. Default is True.
+
+  Returns:
+    A list of dense columns that converts from sparse input. The order of
+    results follows the ordering of `categorical_columns`.
+
+  Raises:
+    ValueError: if `dimension` not > 0.
+    ValueError: if any of the given `categorical_columns` is of different type
+      or has different arguments than the others.
+    ValueError: if exactly one of `ckpt_to_load_from` and `tensor_name_in_ckpt`
+      is specified.
+    ValueError: if `initializer` is specified and is not callable.
+    RuntimeError: if eager execution is enabled.
+  """
+  if context.executing_eagerly():
+    raise RuntimeError('shared_embedding_columns are not supported when eager '
+                       'execution is enabled.')
+
+  if (dimension is None) or (dimension < 1):
+    raise ValueError('Invalid dimension {}.'.format(dimension))
+  if (ckpt_to_load_from is None) != (tensor_name_in_ckpt is None):
+    raise ValueError('Must specify both `ckpt_to_load_from` and '
+                     '`tensor_name_in_ckpt` or none of them.')
+
+  if (initializer is not None) and (not callable(initializer)):
+    raise ValueError('initializer must be callable if specified.')
+  if initializer is None:
+    initializer = init_ops.truncated_normal_initializer(
+        mean=0.0, stddev=1. / math.sqrt(dimension))
+
+  # Sort the columns so the default collection name is deterministic even if the
+  # user passes columns from an unsorted collection, such as dict.values().
+  sorted_columns = sorted(categorical_columns, key=lambda x: x.name)
+
+  c0 = sorted_columns[0]
+  num_buckets = c0.num_buckets
+  if not isinstance(c0, CategoricalColumn):
+    raise ValueError(
+        'All categorical_columns must be subclasses of CategoricalColumn. '
+        'Given: {}, of type: {}'.format(c0, type(c0)))
+  if isinstance(c0, WeightedCategoricalColumn):
+    c0 = c0.categorical_column
+  for c in sorted_columns[1:]:
+    if isinstance(c, WeightedCategoricalColumn):
+      c = c.categorical_column
+    if not isinstance(c, type(c0)):
+      raise ValueError(
+          'To use shared_embedding_column, all categorical_columns must have '
+          'the same type, or be weighted_categorical_column of the same type. '
+          'Given column: {} of type: {} does not match given column: {} of '
+          'type: {}'.format(c0, type(c0), c, type(c)))
+    if num_buckets != c.num_buckets:
+      raise ValueError(
+          'To use shared_embedding_column, all categorical_columns must have '
+          'the same number of buckets. Given column: {} with buckets: {} does  '
+          'not match column: {} with buckets: {}'.format(
+              c0, num_buckets, c, c.num_buckets))
+
+  if not shared_embedding_collection_name:
+    shared_embedding_collection_name = '_'.join(c.name for c in sorted_columns)
+    shared_embedding_collection_name += '_shared_embedding'
+
+  result = []
+  for column in categorical_columns:
+    result.append(
+        SharedEmbeddingColumn(
+            categorical_column=column,
+            initializer=initializer,
+            dimension=dimension,
+            combiner=combiner,
+            shared_embedding_collection_name=shared_embedding_collection_name,
+            ckpt_to_load_from=ckpt_to_load_from,
+            tensor_name_in_ckpt=tensor_name_in_ckpt,
+            max_norm=max_norm,
+            trainable=trainable))
+
+  return result
+
+
+def numeric_column(key,
+                   shape=(1,),
+                   default_value=None,
+                   dtype=dtypes.float32,
+                   normalizer_fn=None):
+  """Represents real valued or numerical features.
+
+  Example:
+
+  ```python
+  price = numeric_column('price')
+  columns = [price, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+
+  # or
+  bucketized_price = bucketized_column(price, boundaries=[...])
+  columns = [bucketized_price, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature. It is used as the
+      column name and the dictionary key for feature parsing configs, feature
+      `Tensor` objects, and feature columns.
+    shape: An iterable of integers specifies the shape of the `Tensor`. An
+      integer can be given which means a single dimension `Tensor` with given
+      width. The `Tensor` representing the column will have the shape of
+      [batch_size] + `shape`.
+    default_value: A single value compatible with `dtype` or an iterable of
+      values compatible with `dtype` which the column takes on during
+      `tf.Example` parsing if data is missing. A default value of `None` will
+      cause `tf.parse_example` to fail if an example does not contain this
+      column. If a single value is provided, the same value will be applied as
+      the default value for every item. If an iterable of values is provided,
+      the shape of the `default_value` should be equal to the given `shape`.
+    dtype: defines the type of values. Default value is `tf.float32`. Must be a
+      non-quantized, real integer or floating point type.
+    normalizer_fn: If not `None`, a function that can be used to normalize the
+      value of the tensor after `default_value` is applied for parsing.
+      Normalizer function takes the input `Tensor` as its argument, and returns
+      the output `Tensor`. (e.g. lambda x: (x - 3.0) / 4.2). Please note that
+      even though the most common use case of this function is normalization, it
+      can be used for any kind of Tensorflow transformations.
+
+  Returns:
+    A `NumericColumn`.
+
+  Raises:
+    TypeError: if any dimension in shape is not an int
+    ValueError: if any dimension in shape is not a positive integer
+    TypeError: if `default_value` is an iterable but not compatible with `shape`
+    TypeError: if `default_value` is not compatible with `dtype`.
+    ValueError: if `dtype` is not convertible to `tf.float32`.
+  """
+  shape = _check_shape(shape, key)
+  if not (dtype.is_integer or dtype.is_floating):
+    raise ValueError('dtype must be convertible to float. '
+                     'dtype: {}, key: {}'.format(dtype, key))
+  default_value = _check_default_value(shape, default_value, dtype, key)
+
+  if normalizer_fn is not None and not callable(normalizer_fn):
+    raise TypeError(
+        'normalizer_fn must be a callable. Given: {}'.format(normalizer_fn))
+
+  _assert_key_is_string(key)
+  return NumericColumn(
+      key,
+      shape=shape,
+      default_value=default_value,
+      dtype=dtype,
+      normalizer_fn=normalizer_fn)
+
+
+def bucketized_column(source_column, boundaries):
+  """Represents discretized dense input.
+
+  Buckets include the left boundary, and exclude the right boundary. Namely,
+  `boundaries=[0., 1., 2.]` generates buckets `(-inf, 0.)`, `[0., 1.)`,
+  `[1., 2.)`, and `[2., +inf)`.
+
+  For example, if the inputs are
+
+  ```python
+  boundaries = [0, 10, 100]
+  input tensor = [[-5, 10000]
+                  [150,   10]
+                  [5,    100]]
+  ```
+
+  then the output will be
+
+  ```python
+  output = [[0, 3]
+            [3, 2]
+            [1, 3]]
+  ```
+
+  Example:
+
+  ```python
+  price = numeric_column('price')
+  bucketized_price = bucketized_column(price, boundaries=[...])
+  columns = [bucketized_price, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+
+  # or
+  columns = [bucketized_price, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+  ```
+
+  `bucketized_column` can also be crossed with another categorical column using
+  `crossed_column`:
+
+  ```python
+  price = numeric_column('price')
+  # bucketized_column converts numerical feature to a categorical one.
+  bucketized_price = bucketized_column(price, boundaries=[...])
+  # 'keywords' is a string feature.
+  price_x_keywords = crossed_column([bucketized_price, 'keywords'], 50K)
+  columns = [price_x_keywords, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+  ```
+
+  Args:
+    source_column: A one-dimensional dense column which is generated with
+      `numeric_column`.
+    boundaries: A sorted list or tuple of floats specifying the boundaries.
+
+  Returns:
+    A `BucketizedColumn`.
+
+  Raises:
+    ValueError: If `source_column` is not a numeric column, or if it is not
+      one-dimensional.
+    ValueError: If `boundaries` is not a sorted list or tuple.
+  """
+  if not isinstance(source_column, NumericColumn):
+    raise ValueError(
+        'source_column must be a column generated with numeric_column(). '
+        'Given: {}'.format(source_column))
+  if len(source_column.shape) > 1:
+    raise ValueError(
+        'source_column must be one-dimensional column. '
+        'Given: {}'.format(source_column))
+  if (not boundaries or
+      not (isinstance(boundaries, list) or isinstance(boundaries, tuple))):
+    raise ValueError('boundaries must be a sorted list.')
+  for i in range(len(boundaries) - 1):
+    if boundaries[i] >= boundaries[i + 1]:
+      raise ValueError('boundaries must be a sorted list.')
+  return BucketizedColumn(source_column, tuple(boundaries))
+
+
+def _assert_string_or_int(dtype, prefix):
+  if (dtype != dtypes.string) and (not dtype.is_integer):
+    raise ValueError(
+        '{} dtype must be string or integer. dtype: {}.'.format(prefix, dtype))
+
+
+def _assert_key_is_string(key):
+  if not isinstance(key, six.string_types):
+    raise ValueError(
+        'key must be a string. Got: type {}. Given key: {}.'.format(
+            type(key), key))
+
+
+def categorical_column_with_hash_bucket(key,
+                                        hash_bucket_size,
+                                        dtype=dtypes.string):
+  """Represents sparse feature where ids are set by hashing.
+
+  Use this when your sparse features are in string or integer format, and you
+  want to distribute your inputs into a finite number of buckets by hashing.
+  output_id = Hash(input_feature_string) % bucket_size for string type input.
+  For int type input, the value is converted to its string representation first
+  and then hashed by the same formula.
+
+  For input dictionary `features`, `features[key]` is either `Tensor` or
+  `SparseTensor`. If `Tensor`, missing values can be represented by `-1` for int
+  and `''` for string, which will be dropped by this feature column.
+
+  Example:
+
+  ```python
+  keywords = categorical_column_with_hash_bucket("keywords", 10K)
+  columns = [keywords, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+
+  # or
+  keywords_embedded = embedding_column(keywords, 16)
+  columns = [keywords_embedded, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature. It is used as the
+      column name and the dictionary key for feature parsing configs, feature
+      `Tensor` objects, and feature columns.
+    hash_bucket_size: An int > 1. The number of buckets.
+    dtype: The type of features. Only string and integer types are supported.
+
+  Returns:
+    A `HashedCategoricalColumn`.
+
+  Raises:
+    ValueError: `hash_bucket_size` is not greater than 1.
+    ValueError: `dtype` is neither string nor integer.
+  """
+  if hash_bucket_size is None:
+    raise ValueError('hash_bucket_size must be set. ' 'key: {}'.format(key))
+
+  if hash_bucket_size < 1:
+    raise ValueError('hash_bucket_size must be at least 1. '
+                     'hash_bucket_size: {}, key: {}'.format(
+                         hash_bucket_size, key))
+
+  _assert_key_is_string(key)
+  _assert_string_or_int(dtype, prefix='column_name: {}'.format(key))
+
+  return HashedCategoricalColumn(key, hash_bucket_size, dtype)
+
+
+def categorical_column_with_vocabulary_file(key,
+                                            vocabulary_file,
+                                            vocabulary_size=None,
+                                            num_oov_buckets=0,
+                                            default_value=None,
+                                            dtype=dtypes.string):
+  """A `CategoricalColumn` with a vocabulary file.
+
+  Use this when your inputs are in string or integer format, and you have a
+  vocabulary file that maps each value to an integer ID. By default,
+  out-of-vocabulary values are ignored. Use either (but not both) of
+  `num_oov_buckets` and `default_value` to specify how to include
+  out-of-vocabulary values.
+
+  For input dictionary `features`, `features[key]` is either `Tensor` or
+  `SparseTensor`. If `Tensor`, missing values can be represented by `-1` for int
+  and `''` for string, which will be dropped by this feature column.
+
+  Example with `num_oov_buckets`:
+  File '/us/states.txt' contains 50 lines, each with a 2-character U.S. state
+  abbreviation. All inputs with values in that file are assigned an ID 0-49,
+  corresponding to its line number. All other values are hashed and assigned an
+  ID 50-54.
+
+  ```python
+  states = categorical_column_with_vocabulary_file(
+      key='states', vocabulary_file='/us/states.txt', vocabulary_size=50,
+      num_oov_buckets=5)
+  columns = [states, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+  ```
+
+  Example with `default_value`:
+  File '/us/states.txt' contains 51 lines - the first line is 'XX', and the
+  other 50 each have a 2-character U.S. state abbreviation. Both a literal 'XX'
+  in input, and other values missing from the file, will be assigned ID 0. All
+  others are assigned the corresponding line number 1-50.
+
+  ```python
+  states = categorical_column_with_vocabulary_file(
+      key='states', vocabulary_file='/us/states.txt', vocabulary_size=51,
+      default_value=0)
+  columns = [states, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction, _, _ = linear_model(features, columns)
+  ```
+
+  And to make an embedding with either:
+
+  ```python
+  columns = [embedding_column(states, 3),...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature. It is used as the
+      column name and the dictionary key for feature parsing configs, feature
+      `Tensor` objects, and feature columns.
+    vocabulary_file: The vocabulary file name.
+    vocabulary_size: Number of the elements in the vocabulary. This must be no
+      greater than length of `vocabulary_file`, if less than length, later
+      values are ignored. If None, it is set to the length of `vocabulary_file`.
+    num_oov_buckets: Non-negative integer, the number of out-of-vocabulary
+      buckets. All out-of-vocabulary inputs will be assigned IDs in the range
+      `[vocabulary_size, vocabulary_size+num_oov_buckets)` based on a hash of
+      the input value. A positive `num_oov_buckets` can not be specified with
+      `default_value`.
+    default_value: The integer ID value to return for out-of-vocabulary feature
+      values, defaults to `-1`. This can not be specified with a positive
+      `num_oov_buckets`.
+    dtype: The type of features. Only string and integer types are supported.
+
+  Returns:
+    A `CategoricalColumn` with a vocabulary file.
+
+  Raises:
+    ValueError: `vocabulary_file` is missing or cannot be opened.
+    ValueError: `vocabulary_size` is missing or < 1.
+    ValueError: `num_oov_buckets` is a negative integer.
+    ValueError: `num_oov_buckets` and `default_value` are both specified.
+    ValueError: `dtype` is neither string nor integer.
+  """
+  if not vocabulary_file:
+    raise ValueError('Missing vocabulary_file in {}.'.format(key))
+
+  if vocabulary_size is None:
+    if not gfile.Exists(vocabulary_file):
+      raise ValueError('vocabulary_file in {} does not exist.'.format(key))
+
+    with gfile.GFile(vocabulary_file) as f:
+      vocabulary_size = sum(1 for _ in f)
+    logging.info(
+        'vocabulary_size = %d in %s is inferred from the number of elements '
+        'in the vocabulary_file %s.', vocabulary_size, key, vocabulary_file)
+
+  # `vocabulary_size` isn't required for lookup, but it is for `_num_buckets`.
+  if vocabulary_size < 1:
+    raise ValueError('Invalid vocabulary_size in {}.'.format(key))
+  if num_oov_buckets:
+    if default_value is not None:
+      raise ValueError(
+          'Can\'t specify both num_oov_buckets and default_value in {}.'.format(
+              key))
+    if num_oov_buckets < 0:
+      raise ValueError('Invalid num_oov_buckets {} in {}.'.format(
+          num_oov_buckets, key))
+  _assert_string_or_int(dtype, prefix='column_name: {}'.format(key))
+  _assert_key_is_string(key)
+  return VocabularyFileCategoricalColumn(
+      key=key,
+      vocabulary_file=vocabulary_file,
+      vocabulary_size=vocabulary_size,
+      num_oov_buckets=0 if num_oov_buckets is None else num_oov_buckets,
+      default_value=-1 if default_value is None else default_value,
+      dtype=dtype)
+
+
+def categorical_column_with_vocabulary_list(
+    key, vocabulary_list, dtype=None, default_value=-1, num_oov_buckets=0):
+  """A `_CategoricalColumn` with in-memory vocabulary.
+
+  Use this when your inputs are in string or integer format, and you have an
+  in-memory vocabulary mapping each value to an integer ID. By default,
+  out-of-vocabulary values are ignored. Use either (but not both) of
+  `num_oov_buckets` and `default_value` to specify how to include
+  out-of-vocabulary values.
+
+  For input dictionary `features`, `features[key]` is either `Tensor` or
+  `SparseTensor`. If `Tensor`, missing values can be represented by `-1` for int
+  and `''` for string, which will be dropped by this feature column.
+
+  Example with `num_oov_buckets`:
+  In the following example, each input in `vocabulary_list` is assigned an ID
+  0-3 corresponding to its index (e.g., input 'B' produces output 2). All other
+  inputs are hashed and assigned an ID 4-5.
+
+  ```python
+  colors = categorical_column_with_vocabulary_list(
+      key='colors', vocabulary_list=('R', 'G', 'B', 'Y'),
+      num_oov_buckets=2)
+  columns = [colors, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction, _, _ = linear_model(features, columns)
+  ```
+
+  Example with `default_value`:
+  In the following example, each input in `vocabulary_list` is assigned an ID
+  0-4 corresponding to its index (e.g., input 'B' produces output 3). All other
+  inputs are assigned `default_value` 0.
+
+
+  ```python
+  colors = categorical_column_with_vocabulary_list(
+      key='colors', vocabulary_list=('X', 'R', 'G', 'B', 'Y'), default_value=0)
+  columns = [colors, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction, _, _ = linear_model(features, columns)
+  ```
+
+  And to make an embedding with either:
+
+  ```python
+  columns = [embedding_column(colors, 3),...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature. It is used as the
+      column name and the dictionary key for feature parsing configs, feature
+      `Tensor` objects, and feature columns.
+    vocabulary_list: An ordered iterable defining the vocabulary. Each feature
+      is mapped to the index of its value (if present) in `vocabulary_list`.
+      Must be castable to `dtype`.
+    dtype: The type of features. Only string and integer types are supported.
+      If `None`, it will be inferred from `vocabulary_list`.
+    default_value: The integer ID value to return for out-of-vocabulary feature
+      values, defaults to `-1`. This can not be specified with a positive
+      `num_oov_buckets`.
+    num_oov_buckets: Non-negative integer, the number of out-of-vocabulary
+      buckets. All out-of-vocabulary inputs will be assigned IDs in the range
+      `[len(vocabulary_list), len(vocabulary_list)+num_oov_buckets)` based on a
+      hash of the input value. A positive `num_oov_buckets` can not be specified
+      with `default_value`.
+
+  Returns:
+    A `CategoricalColumn` with in-memory vocabulary.
+
+  Raises:
+    ValueError: if `vocabulary_list` is empty, or contains duplicate keys.
+    ValueError: `num_oov_buckets` is a negative integer.
+    ValueError: `num_oov_buckets` and `default_value` are both specified.
+    ValueError: if `dtype` is not integer or string.
+  """
+  if (vocabulary_list is None) or (len(vocabulary_list) < 1):
+    raise ValueError(
+        'vocabulary_list {} must be non-empty, column_name: {}'.format(
+            vocabulary_list, key))
+  if len(set(vocabulary_list)) != len(vocabulary_list):
+    raise ValueError(
+        'Duplicate keys in vocabulary_list {}, column_name: {}'.format(
+            vocabulary_list, key))
+  vocabulary_dtype = dtypes.as_dtype(np.array(vocabulary_list).dtype)
+  if num_oov_buckets:
+    if default_value != -1:
+      raise ValueError(
+          'Can\'t specify both num_oov_buckets and default_value in {}.'.format(
+              key))
+    if num_oov_buckets < 0:
+      raise ValueError('Invalid num_oov_buckets {} in {}.'.format(
+          num_oov_buckets, key))
+  _assert_string_or_int(
+      vocabulary_dtype, prefix='column_name: {} vocabulary'.format(key))
+  if dtype is None:
+    dtype = vocabulary_dtype
+  elif dtype.is_integer != vocabulary_dtype.is_integer:
+    raise ValueError(
+        'dtype {} and vocabulary dtype {} do not match, column_name: {}'.format(
+            dtype, vocabulary_dtype, key))
+  _assert_string_or_int(dtype, prefix='column_name: {}'.format(key))
+  _assert_key_is_string(key)
+
+  return VocabularyListCategoricalColumn(
+      key=key,
+      vocabulary_list=tuple(vocabulary_list),
+      dtype=dtype,
+      default_value=default_value,
+      num_oov_buckets=num_oov_buckets)
+
+
+def categorical_column_with_identity(key, num_buckets, default_value=None):
+  """A `CategoricalColumn` that returns identity values.
+
+  Use this when your inputs are integers in the range `[0, num_buckets)`, and
+  you want to use the input value itself as the categorical ID. Values outside
+  this range will result in `default_value` if specified, otherwise it will
+  fail.
+
+  Typically, this is used for contiguous ranges of integer indexes, but
+  it doesn't have to be. This might be inefficient, however, if many of IDs
+  are unused. Consider `categorical_column_with_hash_bucket` in that case.
+
+  For input dictionary `features`, `features[key]` is either `Tensor` or
+  `SparseTensor`. If `Tensor`, missing values can be represented by `-1` for int
+  and `''` for string, which will be dropped by this feature column.
+
+  In the following examples, each input in the range `[0, 1000000)` is assigned
+  the same value. All other inputs are assigned `default_value` 0. Note that a
+  literal 0 in inputs will result in the same default ID.
+
+  Linear model:
+
+  ```python
+  video_id = categorical_column_with_identity(
+      key='video_id', num_buckets=1000000, default_value=0)
+  columns = [video_id, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction, _, _ = linear_model(features, columns)
+  ```
+
+  Embedding for a DNN model:
+
+  ```python
+  columns = [embedding_column(video_id, 9),...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature. It is used as the
+      column name and the dictionary key for feature parsing configs, feature
+      `Tensor` objects, and feature columns.
+    num_buckets: Range of inputs and outputs is `[0, num_buckets)`.
+    default_value: If `None`, this column's graph operations will fail for
+      out-of-range inputs. Otherwise, this value must be in the range
+      `[0, num_buckets)`, and will replace inputs in that range.
+
+  Returns:
+    A `CategoricalColumn` that returns identity values.
+
+  Raises:
+    ValueError: if `num_buckets` is less than one.
+    ValueError: if `default_value` is not in range `[0, num_buckets)`.
+  """
+  if num_buckets < 1:
+    raise ValueError(
+        'num_buckets {} < 1, column_name {}'.format(num_buckets, key))
+  if (default_value is not None) and (
+      (default_value < 0) or (default_value >= num_buckets)):
+    raise ValueError(
+        'default_value {} not in range [0, {}), column_name {}'.format(
+            default_value, num_buckets, key))
+  _assert_key_is_string(key)
+  return IdentityCategoricalColumn(
+      key=key, number_buckets=num_buckets, default_value=default_value)
+
+
+def indicator_column(categorical_column):
+  """Represents multi-hot representation of given categorical column.
+
+  - For DNN model, `indicator_column` can be used to wrap any
+    `categorical_column_*` (e.g., to feed to DNN). Consider to Use
+    `embedding_column` if the number of buckets/unique(values) are large.
+
+  - For Wide (aka linear) model, `indicator_column` is the internal
+    representation for categorical column when passing categorical column
+    directly (as any element in feature_columns) to `linear_model`. See
+    `linear_model` for details.
+
+  ```python
+  name = indicator_column(categorical_column_with_vocabulary_list(
+      'name', ['bob', 'george', 'wanda'])
+  columns = [name, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  dense_tensor = input_layer(features, columns)
+
+  dense_tensor == [[1, 0, 0]]  # If "name" bytes_list is ["bob"]
+  dense_tensor == [[1, 0, 1]]  # If "name" bytes_list is ["bob", "wanda"]
+  dense_tensor == [[2, 0, 0]]  # If "name" bytes_list is ["bob", "bob"]
+  ```
+
+  Args:
+    categorical_column: A `CategoricalColumn` which is created by
+      `categorical_column_with_*` or `crossed_column` functions.
+
+  Returns:
+    An `IndicatorColumn`.
+  """
+  return IndicatorColumn(categorical_column)
+
+
+def weighted_categorical_column(
+    categorical_column, weight_feature_key, dtype=dtypes.float32):
+  """Applies weight values to a `_CategoricalColumn`.
+
+  Use this when each of your sparse inputs has both an ID and a value. For
+  example, if you're representing text documents as a collection of word
+  frequencies, you can provide 2 parallel sparse input features ('terms' and
+  'frequencies' below).
+
+  Example:
+
+  Input `tf.Example` objects:
+
+  ```proto
+  [
+    features {
+      feature {
+        key: "terms"
+        value {bytes_list {value: "very" value: "model"}}
+      }
+      feature {
+        key: "frequencies"
+        value {float_list {value: 0.3 value: 0.1}}
+      }
+    },
+    features {
+      feature {
+        key: "terms"
+        value {bytes_list {value: "when" value: "course" value: "human"}}
+      }
+      feature {
+        key: "frequencies"
+        value {float_list {value: 0.4 value: 0.1 value: 0.2}}
+      }
+    }
+  ]
+  ```
+
+  ```python
+  categorical_column = categorical_column_with_hash_bucket(
+      column_name='terms', hash_bucket_size=1000)
+  weighted_column = weighted_categorical_column(
+      categorical_column=categorical_column, weight_feature_key='frequencies')
+  columns = [weighted_column, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction, _, _ = linear_model(features, columns)
+  ```
+
+  This assumes the input dictionary contains a `SparseTensor` for key
+  'terms', and a `SparseTensor` for key 'frequencies'. These 2 tensors must have
+  the same indices and dense shape.
+
+  Args:
+    categorical_column: A `_CategoricalColumn` created by
+      `categorical_column_with_*` functions.
+    weight_feature_key: String key for weight values.
+    dtype: Type of weights, such as `tf.float32`. Only float and integer weights
+      are supported.
+
+  Returns:
+    A `CategoricalColumn` composed of two sparse features: one represents id,
+    the other represents weight (value) of the id feature in that example.
+
+  Raises:
+    ValueError: if `dtype` is not convertible to float.
+  """
+  if (dtype is None) or not (dtype.is_integer or dtype.is_floating):
+    raise ValueError('dtype {} is not convertible to float.'.format(dtype))
+  return WeightedCategoricalColumn(
+      categorical_column=categorical_column,
+      weight_feature_key=weight_feature_key,
+      dtype=dtype)
+
+
+def crossed_column(keys, hash_bucket_size, hash_key=None):
+  """Returns a column for performing crosses of categorical features.
+
+  Crossed features will be hashed according to `hash_bucket_size`. Conceptually,
+  the transformation can be thought of as:
+    Hash(cartesian product of features) % `hash_bucket_size`
+
+  For example, if the input features are:
+
+  * SparseTensor referred by first key:
+
+    ```python
+    shape = [2, 2]
+    {
+        [0, 0]: "a"
+        [1, 0]: "b"
+        [1, 1]: "c"
+    }
+    ```
+
+  * SparseTensor referred by second key:
+
+    ```python
+    shape = [2, 1]
+    {
+        [0, 0]: "d"
+        [1, 0]: "e"
+    }
+    ```
+
+  then crossed feature will look like:
+
+  ```python
+   shape = [2, 2]
+  {
+      [0, 0]: Hash64("d", Hash64("a")) % hash_bucket_size
+      [1, 0]: Hash64("e", Hash64("b")) % hash_bucket_size
+      [1, 1]: Hash64("e", Hash64("c")) % hash_bucket_size
+  }
+  ```
+
+  Here is an example to create a linear model with crosses of string features:
+
+  ```python
+  keywords_x_doc_terms = crossed_column(['keywords', 'doc_terms'], 50K)
+  columns = [keywords_x_doc_terms, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+  ```
+
+  You could also use vocabulary lookup before crossing:
+
+  ```python
+  keywords = categorical_column_with_vocabulary_file(
+      'keywords', '/path/to/vocabulary/file', vocabulary_size=1K)
+  keywords_x_doc_terms = crossed_column([keywords, 'doc_terms'], 50K)
+  columns = [keywords_x_doc_terms, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+  ```
+
+  If an input feature is of numeric type, you can use
+  `categorical_column_with_identity`, or `bucketized_column`, as in the example:
+
+  ```python
+  # vertical_id is an integer categorical feature.
+  vertical_id = categorical_column_with_identity('vertical_id', 10K)
+  price = numeric_column('price')
+  # bucketized_column converts numerical feature to a categorical one.
+  bucketized_price = bucketized_column(price, boundaries=[...])
+  vertical_id_x_price = crossed_column([vertical_id, bucketized_price], 50K)
+  columns = [vertical_id_x_price, ...]
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  linear_prediction = linear_model(features, columns)
+  ```
+
+  To use crossed column in DNN model, you need to add it in an embedding column
+  as in this example:
+
+  ```python
+  vertical_id_x_price = crossed_column([vertical_id, bucketized_price], 50K)
+  vertical_id_x_price_embedded = embedding_column(vertical_id_x_price, 10)
+  dense_tensor = input_layer(features, [vertical_id_x_price_embedded, ...])
+  ```
+
+  Args:
+    keys: An iterable identifying the features to be crossed. Each element can
+      be either:
+      * string: Will use the corresponding feature which must be of string type.
+      * `CategoricalColumn`: Will use the transformed tensor produced by this
+        column. Does not support hashed categorical column.
+    hash_bucket_size: An int > 1. The number of buckets.
+    hash_key: Specify the hash_key that will be used by the `FingerprintCat64`
+      function to combine the crosses fingerprints on SparseCrossOp (optional).
+
+  Returns:
+    A `CrossedColumn`.
+
+  Raises:
+    ValueError: If `len(keys) < 2`.
+    ValueError: If any of the keys is neither a string nor `CategoricalColumn`.
+    ValueError: If any of the keys is `HashedCategoricalColumn`.
+    ValueError: If `hash_bucket_size < 1`.
+  """
+  if not hash_bucket_size or hash_bucket_size < 1:
+    raise ValueError('hash_bucket_size must be > 1. '
+                     'hash_bucket_size: {}'.format(hash_bucket_size))
+  if not keys or len(keys) < 2:
+    raise ValueError(
+        'keys must be a list with length > 1. Given: {}'.format(keys))
+  for key in keys:
+    if (not isinstance(key, six.string_types) and
+        not isinstance(key, CategoricalColumn)):
+      raise ValueError(
+          'Unsupported key type. All keys must be either string, or '
+          'categorical column except HashedCategoricalColumn. '
+          'Given: {}'.format(key))
+    if isinstance(key, HashedCategoricalColumn):
+      raise ValueError(
+          'categorical_column_with_hash_bucket is not supported for crossing. '
+          'Hashing before crossing will increase probability of collision. '
+          'Instead, use the feature name as a string. Given: {}'.format(key))
+  return CrossedColumn(
+      keys=tuple(keys), hash_bucket_size=hash_bucket_size, hash_key=hash_key)
+
+
+class StateManager(object):
+  """Manages the state associated with FeatureColumns.
+
+  Some `FeatureColumn`s create variables or resources to assist their
+  computation. The `StateManager` is responsible for creating and storing these
+  objects since `FeatureColumn`s are supposed to be stateless configuration
+  only.
+  """
+
+  def get_variable(self,
+                   feature_column,
+                   name,
+                   shape,
+                   dtype=None,
+                   initializer=None):
+    """Creates a new variable or returns an existing one.
+
+    Args:
+      feature_column: A `FeatureColumn` object this variable corresponds to.
+      name: variable name.
+      shape: variable shape.
+      dtype: The type of the variable. Defaults to `self.dtype` or `float32`.
+      initializer: initializer instance (callable).
+
+    Returns:
+      The variable.
+    """
+    raise NotImplementedError('StateManager.get_variable')
+
+  def get_resource(self, feature_column, name, resource_creator):
+    """Creates a new resource or returns an existing one.
+
+    Resources can be things such as tables etc.
+
+    Args:
+      feature_column: A `FeatureColumn` object this variable corresponds to.
+      name: Name of the resource.
+      resource_creator: A callable that can create the resource.
+
+    Returns:
+      The resource.
+    """
+    raise NotImplementedError('StateManager.get_resource')
+
+
+class FeatureColumn(object):
+  """Represents a feature column abstraction.
+
+  WARNING: Do not subclass this layer unless you know what you are doing:
+  the API is subject to future changes.
+
+  To distinguish between the concept of a feature family and a specific binary
+  feature within a family, we refer to a feature family like "country" as a
+  feature column. For example, we can have a feature in a `tf.Example` format:
+    {key: "country",  value: [ "US" ]}
+  In this example the value of feature is "US" and "country" refers to the
+  column of the feature.
+
+  This class is an abstract class. Users should not create instances of this.
+  """
+  __metaclass__ = abc.ABCMeta
+
+  @abc.abstractproperty
+  def name(self):
+    """Returns string. Used for naming."""
+    pass
+
+  @abc.abstractmethod
+  def transform_feature(self, transformation_cache, state_manager):
+    """Returns intermediate representation (usually a `Tensor`).
+
+    Uses `transformation_cache` to create an intermediate representation
+    (usually a `Tensor`) that other feature columns can use.
+
+    Example usage of `transformation_cache`:
+    Let's say a Feature column depends on raw feature ('raw') and another
+    `FeatureColumn` (input_fc). To access corresponding `Tensor`s,
+    transformation_cache will be used as follows:
+
+    ```python
+    raw_tensor = transformation_cache.get('raw', state_manager)
+    fc_tensor = transformation_cache.get(input_fc, state_manager)
+    ```
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+
+    Returns:
+      Transformed feature `Tensor`.
+    """
+    pass
+
+  @abc.abstractproperty
+  def parse_example_spec(self):
+    """Returns a `tf.Example` parsing spec as dict.
+
+    It is used for get_parsing_spec for `tf.parse_example`. Returned spec is a
+    dict from keys ('string') to `VarLenFeature`, `FixedLenFeature`, and other
+    supported objects. Please check documentation of @{tf.parse_example} for all
+    supported spec objects.
+
+    Let's say a Feature column depends on raw feature ('raw') and another
+    `FeatureColumn` (input_fc). One possible implementation of
+    parse_example_spec is as follows:
+
+    ```python
+    spec = {'raw': tf.FixedLenFeature(...)}
+    spec.update(input_fc.parse_example_spec)
+    return spec
+    ```
+    """
+    pass
+
+  def create_state(self, state_manager):
+    """Uses the `state_manager` to create state for the FeatureColumn.
+
+    Args:
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables and variables.
+    """
+    pass
+
+
+class DenseColumn(FeatureColumn):
+  """Represents a column which can be represented as `Tensor`.
+
+  Some examples of this type are: numeric_column, embedding_column,
+  indicator_column.
+  """
+
+  __metaclass__ = abc.ABCMeta
+
+  @abc.abstractproperty
+  def variable_shape(self):
+    """`TensorShape` of `get_dense_tensor`, without batch dimension."""
+    pass
+
+  @abc.abstractmethod
+  def get_dense_tensor(self, transformation_cache, state_manager):
+    """Returns a `Tensor`.
+
+    The output of this function will be used by model-builder-functions. For
+    example the pseudo code of `input_layer` will be like:
+
+    ```python
+    def input_layer(features, feature_columns, ...):
+      outputs = [fc.get_dense_tensor(...) for fc in feature_columns]
+      return tf.concat(outputs)
+    ```
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+
+    Returns:
+      `Tensor` of shape [batch_size] + `variable_shape`.
+    """
+    pass
+
+
+def _create_weighted_sum(column,
+                         transformation_cache,
+                         state_manager,
+                         units,
+                         sparse_combiner,
+                         weight_collections,
+                         trainable,
+                         weight_var=None):
+  """Creates a weighted sum for a dense/categorical column for linear_model."""
+  if isinstance(column, CategoricalColumn):
+    return _create_categorical_column_weighted_sum(
+        column=column,
+        transformation_cache=transformation_cache,
+        state_manager=state_manager,
+        units=units,
+        sparse_combiner=sparse_combiner,
+        weight_collections=weight_collections,
+        trainable=trainable,
+        weight_var=weight_var)
+  else:
+    return _create_dense_column_weighted_sum(
+        column=column,
+        transformation_cache=transformation_cache,
+        state_manager=state_manager,
+        units=units,
+        weight_collections=weight_collections,
+        trainable=trainable,
+        weight_var=weight_var)
+
+
+def _create_dense_column_weighted_sum(column,
+                                      transformation_cache,
+                                      state_manager,
+                                      units,
+                                      weight_collections,
+                                      trainable,
+                                      weight_var=None):
+  """Create a weighted sum of a dense column for linear_model."""
+  tensor = column.get_dense_tensor(transformation_cache, state_manager)
+  num_elements = column.variable_shape.num_elements()
+  batch_size = array_ops.shape(tensor)[0]
+  tensor = array_ops.reshape(tensor, shape=(batch_size, num_elements))
+  if weight_var is not None:
+    weight = weight_var
+  else:
+    weight = variable_scope.get_variable(
+        name='weights',
+        shape=[num_elements, units],
+        initializer=init_ops.zeros_initializer(),
+        trainable=trainable,
+        collections=weight_collections)
+  return math_ops.matmul(tensor, weight, name='weighted_sum')
+
+
+class CategoricalColumn(FeatureColumn):
+  """Represents a categorical feature.
+
+  A categorical feature typically handled with a @{tf.SparseTensor} of IDs.
+  """
+  __metaclass__ = abc.ABCMeta
+
+  IdWeightPair = collections.namedtuple(  # pylint: disable=invalid-name
+      'IdWeightPair', ('id_tensor', 'weight_tensor'))
+
+  @abc.abstractproperty
+  def num_buckets(self):
+    """Returns number of buckets in this sparse feature."""
+    pass
+
+  @abc.abstractmethod
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """Returns an IdWeightPair.
+
+    `IdWeightPair` is a pair of `SparseTensor`s which represents ids and
+    weights.
+
+    `IdWeightPair.id_tensor` is typically a `batch_size` x `num_buckets`
+    `SparseTensor` of `int64`. `IdWeightPair.weight_tensor` is either a
+    `SparseTensor` of `float` or `None` to indicate all weights should be
+    taken to be 1. If specified, `weight_tensor` must have exactly the same
+    shape and indices as `sp_ids`. Expected `SparseTensor` is same as parsing
+    output of a `VarLenFeature` which is a ragged matrix.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+    """
+    pass
+
+
+def _create_categorical_column_weighted_sum(column,
+                                            transformation_cache,
+                                            state_manager,
+                                            units,
+                                            sparse_combiner,
+                                            weight_collections,
+                                            trainable,
+                                            weight_var=None):
+  # pylint: disable=g-doc-return-or-yield,g-doc-args
+  """Create a weighted sum of a categorical column for linear_model.
+
+  Note to maintainer: As implementation details, the weighted sum is
+  implemented via embedding_lookup_sparse toward efficiency. Mathematically,
+  they are the same.
+
+  To be specific, conceptually, categorical column can be treated as multi-hot
+  vector. Say:
+
+  ```python
+    x = [0 0 1]  # categorical column input
+    w = [a b c]  # weights
+  ```
+  The weighted sum is `c` in this case, which is same as `w[2]`.
+
+  Another example is
+
+  ```python
+    x = [0 1 1]  # categorical column input
+    w = [a b c]  # weights
+  ```
+  The weighted sum is `b + c` in this case, which is same as `w[2] + w[3]`.
+
+  For both cases, we can implement weighted sum via embedding_lookup with
+  sparse_combiner = "sum".
+  """
+
+  sparse_tensors = column.get_sparse_tensors(transformation_cache,
+                                             state_manager)
+  id_tensor = sparse_ops.sparse_reshape(sparse_tensors.id_tensor, [
+      array_ops.shape(sparse_tensors.id_tensor)[0], -1
+  ])
+  weight_tensor = sparse_tensors.weight_tensor
+  if weight_tensor is not None:
+    weight_tensor = sparse_ops.sparse_reshape(
+        weight_tensor, [array_ops.shape(weight_tensor)[0], -1])
+
+  if weight_var is not None:
+    weight = weight_var
+  else:
+    weight = variable_scope.get_variable(
+        name='weights',
+        shape=(column.num_buckets, units),
+        initializer=init_ops.zeros_initializer(),
+        trainable=trainable,
+        collections=weight_collections)
+  return _safe_embedding_lookup_sparse(
+      weight,
+      id_tensor,
+      sparse_weights=weight_tensor,
+      combiner=sparse_combiner,
+      name='weighted_sum')
+
+
+class SequenceDenseColumn(FeatureColumn):
+  """Represents dense sequence data."""
+
+  __metaclass__ = abc.ABCMeta
+
+  TensorSequenceLengthPair = collections.namedtuple(  # pylint: disable=invalid-name
+      'TensorSequenceLengthPair', ('dense_tensor', 'sequence_length'))
+
+  @abc.abstractmethod
+  def get_sequence_dense_tensor(self, transformation_cache, state_manager):
+    """Returns a `TensorSequenceLengthPair`.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+    """
+    pass
+
+
+class FeatureTransformationCache(object):
+  """Handles caching of transformations while building the model.
+
+  `FeatureColumn` specifies how to digest an input column to the network. Some
+  feature columns require data transformations. This class caches those
+  transformations.
+
+  Some features may be used in more than one place. For example, one can use a
+  bucketized feature by itself and a cross with it. In that case we
+  should create only one bucketization op instead of creating ops for each
+  feature column separately. To handle re-use of transformed columns,
+  `FeatureTransformationCache` caches all previously transformed columns.
+
+  Example:
+  We're trying to use the following `FeatureColumn`s:
+
+  ```python
+  bucketized_age = fc.bucketized_column(fc.numeric_column("age"), ...)
+  keywords = fc.categorical_column_with_hash_buckets("keywords", ...)
+  age_X_keywords = fc.crossed_column([bucketized_age, "keywords"])
+  ... = linear_model(features,
+                          [bucketized_age, keywords, age_X_keywords]
+  ```
+
+  If we transform each column independently, then we'll get duplication of
+  bucketization (one for cross, one for bucketization itself).
+  The `FeatureTransformationCache` eliminates this duplication.
+  """
+
+  def __init__(self, features):
+    """Creates a `FeatureTransformationCache`.
+
+    Args:
+      features: A mapping from feature column to objects that are `Tensor` or
+        `SparseTensor`, or can be converted to same via
+        `sparse_tensor.convert_to_tensor_or_sparse_tensor`. A `string` key
+        signifies a base feature (not-transformed). A `FeatureColumn` key
+        means that this `Tensor` is the output of an existing `FeatureColumn`
+        which can be reused.
+    """
+    self._features = features.copy()
+    self._feature_tensors = {}
+
+  def get(self, key, state_manager):
+    """Returns a `Tensor` for the given key.
+
+    A `str` key is used to access a base feature (not-transformed). When a
+    `FeatureColumn` is passed, the transformed feature is returned if it
+    already exists, otherwise the given `FeatureColumn` is asked to provide its
+    transformed output, which is then cached.
+
+    Args:
+      key: a `str` or a `FeatureColumn`.
+      state_manager: A StateManager object that holds the FeatureColumn state.
+
+    Returns:
+      The transformed `Tensor` corresponding to the `key`.
+
+    Raises:
+      ValueError: if key is not found or a transformed `Tensor` cannot be
+        computed.
+    """
+    if key in self._feature_tensors:
+      # FeatureColumn is already transformed or converted.
+      return self._feature_tensors[key]
+
+    if key in self._features:
+      feature_tensor = self._get_raw_feature_as_tensor(key)
+      self._feature_tensors[key] = feature_tensor
+      return feature_tensor
+
+    if isinstance(key, six.string_types):
+      raise ValueError('Feature {} is not in features dictionary.'.format(key))
+
+    if not isinstance(key, FeatureColumn):
+      raise TypeError('"key" must be either a "str" or "FeatureColumn". '
+                      'Provided: {}'.format(key))
+
+    column = key
+    logging.debug('Transforming feature_column %s.', column)
+    transformed = column.transform_feature(self, state_manager)
+    if transformed is None:
+      raise ValueError('Column {} is not supported.'.format(column.name))
+    self._feature_tensors[column] = transformed
+    return transformed
+
+  def _get_raw_feature_as_tensor(self, key):
+    """Gets the raw_feature (keyed by `key`) as `tensor`.
+
+    The raw feature is converted to (sparse) tensor and maybe expand dim.
+
+    For both `Tensor` and `SparseTensor`, the rank will be expanded (to 2) if
+    the rank is 1. This supports dynamic rank also. For rank 0 raw feature, will
+    error out as it is not supported.
+
+    Args:
+      key: A `str` key to access the raw feature.
+
+    Returns:
+      A `Tensor` or `SparseTensor`.
+
+    Raises:
+      ValueError: if the raw feature has rank 0.
+    """
+    raw_feature = self._features[key]
+    feature_tensor = sparse_tensor_lib.convert_to_tensor_or_sparse_tensor(
+        raw_feature)
+
+    def expand_dims(input_tensor):
+      # Input_tensor must have rank 1.
+      if isinstance(input_tensor, sparse_tensor_lib.SparseTensor):
+        return sparse_ops.sparse_reshape(
+            input_tensor, [array_ops.shape(input_tensor)[0], -1])
+      else:
+        return array_ops.expand_dims(input_tensor, -1)
+
+    rank = feature_tensor.get_shape().ndims
+    if rank is not None:
+      if rank == 0:
+        raise ValueError(
+            'Feature (key: {}) cannot have rank 0. Give: {}'.format(
+                key, feature_tensor))
+      return feature_tensor if rank != 1 else expand_dims(feature_tensor)
+
+    # Handle dynamic rank.
+    with ops.control_dependencies([
+        check_ops.assert_positive(
+            array_ops.rank(feature_tensor),
+            message='Feature (key: {}) cannot have rank 0. Given: {}'.format(
+                key, feature_tensor))]):
+      return control_flow_ops.cond(
+          math_ops.equal(1, array_ops.rank(feature_tensor)),
+          lambda: expand_dims(feature_tensor),
+          lambda: feature_tensor)
+
+
+# TODO(ptucker): Move to third_party/tensorflow/python/ops/sparse_ops.py
+def _shape_offsets(shape):
+  """Returns moving offset for each dimension given shape."""
+  offsets = []
+  for dim in reversed(shape):
+    if offsets:
+      offsets.append(dim * offsets[-1])
+    else:
+      offsets.append(dim)
+  offsets.reverse()
+  return offsets
+
+
+# TODO(ptucker): Move to third_party/tensorflow/python/ops/sparse_ops.py
+def _to_sparse_input_and_drop_ignore_values(input_tensor, ignore_value=None):
+  """Converts a `Tensor` to a `SparseTensor`, dropping ignore_value cells.
+
+  If `input_tensor` is already a `SparseTensor`, just return it.
+
+  Args:
+    input_tensor: A string or integer `Tensor`.
+    ignore_value: Entries in `dense_tensor` equal to this value will be
+      absent from the resulting `SparseTensor`. If `None`, default value of
+      `dense_tensor`'s dtype will be used ('' for `str`, -1 for `int`).
+
+  Returns:
+    A `SparseTensor` with the same shape as `input_tensor`.
+
+  Raises:
+    ValueError: when `input_tensor`'s rank is `None`.
+  """
+  input_tensor = sparse_tensor_lib.convert_to_tensor_or_sparse_tensor(
+      input_tensor)
+  if isinstance(input_tensor, sparse_tensor_lib.SparseTensor):
+    return input_tensor
+  with ops.name_scope(None, 'to_sparse_input', (input_tensor, ignore_value,)):
+    if ignore_value is None:
+      if input_tensor.dtype == dtypes.string:
+        # Exception due to TF strings are converted to numpy objects by default.
+        ignore_value = ''
+      elif input_tensor.dtype.is_integer:
+        ignore_value = -1  # -1 has a special meaning of missing feature
+      else:
+        # NOTE: `as_numpy_dtype` is a property, so with the parentheses this is
+        # constructing a new numpy object of the given type, which yields the
+        # default value for that type.
+        ignore_value = input_tensor.dtype.as_numpy_dtype()
+    ignore_value = math_ops.cast(
+        ignore_value, input_tensor.dtype, name='ignore_value')
+    indices = array_ops.where(
+        math_ops.not_equal(input_tensor, ignore_value), name='indices')
+    return sparse_tensor_lib.SparseTensor(
+        indices=indices,
+        values=array_ops.gather_nd(input_tensor, indices, name='values'),
+        dense_shape=array_ops.shape(
+            input_tensor, out_type=dtypes.int64, name='dense_shape'))
+
+
+def _normalize_feature_columns(feature_columns):
+  """Normalizes the `feature_columns` input.
+
+  This method converts the `feature_columns` to list type as best as it can. In
+  addition, verifies the type and other parts of feature_columns, required by
+  downstream library.
+
+  Args:
+    feature_columns: The raw feature columns, usually passed by users.
+
+  Returns:
+    The normalized feature column list.
+
+  Raises:
+    ValueError: for any invalid inputs, such as empty, duplicated names, etc.
+  """
+  if isinstance(feature_columns, FeatureColumn):
+    feature_columns = [feature_columns]
+
+  if isinstance(feature_columns, collections.Iterator):
+    feature_columns = list(feature_columns)
+
+  if isinstance(feature_columns, dict):
+    raise ValueError('Expected feature_columns to be iterable, found dict.')
+
+  for column in feature_columns:
+    if not isinstance(column, FeatureColumn):
+      raise ValueError('Items of feature_columns must be a FeatureColumn. '
+                       'Given (type {}): {}.'.format(type(column), column))
+  if not feature_columns:
+    raise ValueError('feature_columns must not be empty.')
+  name_to_column = dict()
+  for column in feature_columns:
+    if column.name in name_to_column:
+      raise ValueError('Duplicate feature column name found for columns: {} '
+                       'and {}. This usually means that these columns refer to '
+                       'same base feature. Either one must be discarded or a '
+                       'duplicated but renamed item must be inserted in '
+                       'features dict.'.format(column,
+                                               name_to_column[column.name]))
+    name_to_column[column.name] = column
+
+  return feature_columns
+
+
+class NumericColumn(
+    DenseColumn,
+    collections.namedtuple(
+        'NumericColumn',
+        ('key', 'shape', 'default_value', 'dtype', 'normalizer_fn'))):
+  """see `numeric_column`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return self.key
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return {
+        self.key:
+            parsing_ops.FixedLenFeature(self.shape, self.dtype,
+                                        self.default_value)
+    }
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """See `FeatureColumn` base class.
+
+    In this case, we apply the `normalizer_fn` to the input tensor.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+
+    Returns:
+      Normalized input tensor.
+    Raises:
+      ValueError: If a SparseTensor is passed in.
+    """
+    input_tensor = transformation_cache.get(self.key, state_manager)
+    if isinstance(input_tensor, sparse_tensor_lib.SparseTensor):
+      raise ValueError(
+          'The corresponding Tensor of numerical column must be a Tensor. '
+          'SparseTensor is not supported. key: {}'.format(self.key))
+    if self.normalizer_fn is not None:
+      input_tensor = self.normalizer_fn(input_tensor)
+    return math_ops.to_float(input_tensor)
+
+  @property
+  def variable_shape(self):
+    """See `DenseColumn` base class."""
+    return tensor_shape.TensorShape(self.shape)
+
+  def get_dense_tensor(self, transformation_cache, state_manager):
+    """Returns dense `Tensor` representing numeric feature.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+
+    Returns:
+      Dense `Tensor` created within `transform_feature`.
+    """
+    # Feature has been already transformed. Return the intermediate
+    # representation created by _transform_feature.
+    return transformation_cache.get(self, state_manager)
+
+
+class BucketizedColumn(DenseColumn, CategoricalColumn,
+                       collections.namedtuple('BucketizedColumn',
+                                              ('source_column', 'boundaries'))):
+  """See `bucketized_column`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return '{}_bucketized'.format(self.source_column.name)
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return self.source_column.parse_example_spec
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Returns bucketized categorical `source_column` tensor."""
+    source_tensor = transformation_cache.get(self.source_column, state_manager)
+    return math_ops._bucketize(  # pylint: disable=protected-access
+        source_tensor,
+        boundaries=self.boundaries)
+
+  @property
+  def variable_shape(self):
+    """See `DenseColumn` base class."""
+    return tensor_shape.TensorShape(
+        tuple(self.source_column.shape) + (len(self.boundaries) + 1,))
+
+  def get_dense_tensor(self, transformation_cache, state_manager):
+    """Returns one hot encoded dense `Tensor`."""
+    input_tensor = transformation_cache.get(self, state_manager)
+    return array_ops.one_hot(
+        indices=math_ops.to_int64(input_tensor),
+        depth=len(self.boundaries) + 1,
+        on_value=1.,
+        off_value=0.)
+
+  @property
+  def num_buckets(self):
+    """See `CategoricalColumn` base class."""
+    # By construction, source_column is always one-dimensional.
+    return (len(self.boundaries) + 1) * self.source_column.shape[0]
+
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """Converts dense inputs to SparseTensor so downstream code can use it."""
+    input_tensor = transformation_cache.get(self, state_manager)
+    batch_size = array_ops.shape(input_tensor)[0]
+    # By construction, source_column is always one-dimensional.
+    source_dimension = self.source_column.shape[0]
+
+    i1 = array_ops.reshape(
+        array_ops.tile(
+            array_ops.expand_dims(math_ops.range(0, batch_size), 1),
+            [1, source_dimension]),
+        (-1,))
+    i2 = array_ops.tile(math_ops.range(0, source_dimension), [batch_size])
+    # Flatten the bucket indices and unique them across dimensions
+    # E.g. 2nd dimension indices will range from k to 2*k-1 with k buckets
+    bucket_indices = (
+        array_ops.reshape(input_tensor, (-1,)) +
+        (len(self.boundaries) + 1) * i2)
+
+    indices = math_ops.to_int64(array_ops.transpose(array_ops.stack((i1, i2))))
+    dense_shape = math_ops.to_int64(array_ops.stack(
+        [batch_size, source_dimension]))
+    sparse_tensor = sparse_tensor_lib.SparseTensor(
+        indices=indices,
+        values=bucket_indices,
+        dense_shape=dense_shape)
+    return CategoricalColumn.IdWeightPair(sparse_tensor, None)
+
+
+class EmbeddingColumn(
+    DenseColumn, SequenceDenseColumn,
+    collections.namedtuple(
+        'EmbeddingColumn',
+        ('categorical_column', 'dimension', 'combiner', 'initializer',
+         'ckpt_to_load_from', 'tensor_name_in_ckpt', 'max_norm', 'trainable'))):
+  """See `embedding_column`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return '{}_embedding'.format(self.categorical_column.name)
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return self.categorical_column.parse_example_spec
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Transforms underlying `categorical_column`."""
+    return transformation_cache.get(self.categorical_column, state_manager)
+
+  @property
+  def variable_shape(self):
+    """See `DenseColumn` base class."""
+    return tensor_shape.vector(self.dimension)
+
+  def _get_dense_tensor_internal(self, transformation_cache, state_manager):
+    """Private method that follows the signature of _get_dense_tensor."""
+    # Get sparse IDs and weights.
+    sparse_tensors = self.categorical_column.get_sparse_tensors(
+        transformation_cache, state_manager)
+    sparse_ids = sparse_tensors.id_tensor
+    sparse_weights = sparse_tensors.weight_tensor
+
+    embedding_shape = (self.categorical_column.num_buckets, self.dimension)
+    embedding_weights = state_manager.get_variable(
+        self,
+        name='embedding_weights',
+        shape=embedding_shape,
+        dtype=dtypes.float32,
+        initializer=self.initializer)
+
+    if self.ckpt_to_load_from is not None:
+      to_restore = embedding_weights
+      if isinstance(to_restore, variables.PartitionedVariable):
+        to_restore = to_restore._get_variable_list()  # pylint: disable=protected-access
+      checkpoint_utils.init_from_checkpoint(self.ckpt_to_load_from, {
+          self.tensor_name_in_ckpt: to_restore
+      })
+
+    # Return embedding lookup result.
+    return _safe_embedding_lookup_sparse(
+        embedding_weights=embedding_weights,
+        sparse_ids=sparse_ids,
+        sparse_weights=sparse_weights,
+        combiner=self.combiner,
+        name='%s_weights' % self.name,
+        max_norm=self.max_norm)
+
+  def get_dense_tensor(self, transformation_cache, state_manager):
+    """Returns tensor after doing the embedding lookup.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+
+    Returns:
+      Embedding lookup tensor.
+
+    Raises:
+      ValueError: `categorical_column` is SequenceCategoricalColumn.
+    """
+    if isinstance(self.categorical_column, SequenceCategoricalColumn):
+      raise ValueError(
+          'In embedding_column: {}. '
+          'categorical_column must not be of type SequenceCategoricalColumn. '
+          'Suggested fix A: If you wish to use input_layer, use a '
+          'non-sequence categorical_column_with_*. '
+          'Suggested fix B: If you wish to create sequence input, use '
+          'sequence_input_layer instead of input_layer. '
+          'Given (type {}): {}'.format(self.name, type(self.categorical_column),
+                                       self.categorical_column))
+    return self._get_dense_tensor_internal(transformation_cache, state_manager)
+
+  def get_sequence_dense_tensor(self, transformation_cache, state_manager):
+    """See `SequenceDenseColumn` base class."""
+    if not isinstance(self.categorical_column, SequenceCategoricalColumn):
+      raise ValueError(
+          'In embedding_column: {}. '
+          'categorical_column must be of type SequenceCategoricalColumn '
+          'to use sequence_input_layer. '
+          'Suggested fix: Use one of sequence_categorical_column_with_*. '
+          'Given (type {}): {}'.format(self.name, type(self.categorical_column),
+                                       self.categorical_column))
+    dense_tensor = self._get_dense_tensor_internal(  # pylint: disable=protected-access
+        transformation_cache, state_manager)
+    sparse_tensors = self.categorical_column.get_sparse_tensors(
+        transformation_cache, state_manager)
+    sequence_length = _sequence_length_from_sparse_tensor(
+        sparse_tensors.id_tensor)
+    return SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+
+def _get_graph_for_variable(var):
+  if isinstance(var, variables.PartitionedVariable):
+    return list(var)[0].graph
+  else:
+    return var.graph
+
+
+class SharedEmbeddingColumn(
+    DenseColumn, SequenceDenseColumn,
+    collections.namedtuple(
+        'SharedEmbeddingColumn',
+        ('categorical_column', 'dimension', 'combiner', 'initializer',
+         'shared_embedding_collection_name', 'ckpt_to_load_from',
+         'tensor_name_in_ckpt', 'max_norm', 'trainable'))):
+  """See `embedding_column`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return '{}_shared_embedding'.format(self.categorical_column.name)
+
+  @property
+  def shared_collection_name(self):
+    """Returns the shared name of this column.
+
+    A group of columns share an embedding. Each one of those columns would have
+    the same `shared_collection_name` by which they could be collectively
+    referred to.
+    """
+    return self.shared_embedding_collection_name
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return self.categorical_column.parse_example_spec
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """See `FeatureColumn` base class."""
+    return transformation_cache.get(self.categorical_column, state_manager)
+
+  @property
+  def variable_shape(self):
+    """See `DenseColumn` base class."""
+    return tensor_shape.vector(self.dimension)
+
+  def _get_dense_tensor_internal(self, transformation_cache, state_manager):
+    """Private method that follows the signature of _get_dense_tensor."""
+    # This method is called from a variable_scope with name _var_scope_name,
+    # which is shared among all shared embeddings. Open a name_scope here, so
+    # that the ops for different columns have distinct names.
+    with ops.name_scope(None, default_name=self.name):
+      # Get sparse IDs and weights.
+      sparse_tensors = self.categorical_column.get_sparse_tensors(
+          transformation_cache, state_manager)
+      sparse_ids = sparse_tensors.id_tensor
+      sparse_weights = sparse_tensors.weight_tensor
+
+      embedding_shape = (self.categorical_column.num_buckets, self.dimension)
+      embedding_weights = state_manager.get_variable(
+          self,
+          name='embedding_weights',
+          shape=embedding_shape,
+          dtype=dtypes.float32,
+          initializer=self.initializer)
+
+      if self.ckpt_to_load_from is not None:
+        to_restore = embedding_weights
+        if isinstance(to_restore, variables.PartitionedVariable):
+          to_restore = to_restore._get_variable_list()  # pylint: disable=protected-access
+        checkpoint_utils.init_from_checkpoint(self.ckpt_to_load_from, {
+            self.tensor_name_in_ckpt: to_restore
+        })
+
+      # Return embedding lookup result.
+      return _safe_embedding_lookup_sparse(
+          embedding_weights=embedding_weights,
+          sparse_ids=sparse_ids,
+          sparse_weights=sparse_weights,
+          combiner=self.combiner,
+          name='%s_weights' % self.name,
+          max_norm=self.max_norm)
+
+  def get_dense_tensor(self, transformation_cache, state_manager):
+    """Returns the embedding lookup result."""
+    if isinstance(self.categorical_column, SequenceCategoricalColumn):
+      raise ValueError(
+          'In embedding_column: {}. '
+          'categorical_column must not be of type SequenceCategoricalColumn. '
+          'Suggested fix A: If you wish to use input_layer, use a '
+          'non-sequence categorical_column_with_*. '
+          'Suggested fix B: If you wish to create sequence input, use '
+          'sequence_input_layer instead of input_layer. '
+          'Given (type {}): {}'.format(self.name, type(self.categorical_column),
+                                       self.categorical_column))
+    return self._get_dense_tensor_internal(transformation_cache, state_manager)
+
+  def get_sequence_dense_tensor(self, transformation_cache, state_manager):
+    """See `SequenceDenseColumn` base class."""
+    if not isinstance(self.categorical_column, SequenceCategoricalColumn):
+      raise ValueError(
+          'In embedding_column: {}. '
+          'categorical_column must be of type SequenceCategoricalColumn '
+          'to use sequence_input_layer. '
+          'Suggested fix: Use one of sequence_categorical_column_with_*. '
+          'Given (type {}): {}'.format(self.name, type(self.categorical_column),
+                                       self.categorical_column))
+    dense_tensor = self.get_dense_tensor_internal(transformation_cache,
+                                                  state_manager)
+    sparse_tensors = self.categorical_column.get_sparse_tensors(
+        transformation_cache, state_manager)
+    sequence_length = _sequence_length_from_sparse_tensor(
+        sparse_tensors.id_tensor)
+    return SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+
+def _create_tuple(shape, value):
+  """Returns a tuple with given shape and filled with value."""
+  if shape:
+    return tuple([_create_tuple(shape[1:], value) for _ in range(shape[0])])
+  return value
+
+
+def _as_tuple(value):
+  if not nest.is_sequence(value):
+    return value
+  return tuple([_as_tuple(v) for v in value])
+
+
+def _check_shape(shape, key):
+  """Returns shape if it's valid, raises error otherwise."""
+  assert shape is not None
+  if not nest.is_sequence(shape):
+    shape = [shape]
+  shape = tuple(shape)
+  for dimension in shape:
+    if not isinstance(dimension, int):
+      raise TypeError('shape dimensions must be integer. '
+                      'shape: {}, key: {}'.format(shape, key))
+    if dimension < 1:
+      raise ValueError('shape dimensions must be greater than 0. '
+                       'shape: {}, key: {}'.format(shape, key))
+  return shape
+
+
+def _is_shape_and_default_value_compatible(default_value, shape):
+  """Verifies compatibility of shape and default_value."""
+  # Invalid condition:
+  #  * if default_value is not a scalar and shape is empty
+  #  * or if default_value is an iterable and shape is not empty
+  if nest.is_sequence(default_value) != bool(shape):
+    return False
+  if not shape:
+    return True
+  if len(default_value) != shape[0]:
+    return False
+  for i in range(shape[0]):
+    if not _is_shape_and_default_value_compatible(default_value[i], shape[1:]):
+      return False
+  return True
+
+
+def _check_default_value(shape, default_value, dtype, key):
+  """Returns default value as tuple if it's valid, otherwise raises errors.
+
+  This function verifies that `default_value` is compatible with both `shape`
+  and `dtype`. If it is not compatible, it raises an error. If it is compatible,
+  it casts default_value to a tuple and returns it. `key` is used only
+  for error message.
+
+  Args:
+    shape: An iterable of integers specifies the shape of the `Tensor`.
+    default_value: If a single value is provided, the same value will be applied
+      as the default value for every item. If an iterable of values is
+      provided, the shape of the `default_value` should be equal to the given
+      `shape`.
+    dtype: defines the type of values. Default value is `tf.float32`. Must be a
+      non-quantized, real integer or floating point type.
+    key: Column name, used only for error messages.
+
+  Returns:
+    A tuple which will be used as default value.
+
+  Raises:
+    TypeError: if `default_value` is an iterable but not compatible with `shape`
+    TypeError: if `default_value` is not compatible with `dtype`.
+    ValueError: if `dtype` is not convertible to `tf.float32`.
+  """
+  if default_value is None:
+    return None
+
+  if isinstance(default_value, int):
+    return _create_tuple(shape, default_value)
+
+  if isinstance(default_value, float) and dtype.is_floating:
+    return _create_tuple(shape, default_value)
+
+  if callable(getattr(default_value, 'tolist', None)):  # Handles numpy arrays
+    default_value = default_value.tolist()
+
+  if nest.is_sequence(default_value):
+    if not _is_shape_and_default_value_compatible(default_value, shape):
+      raise ValueError(
+          'The shape of default_value must be equal to given shape. '
+          'default_value: {}, shape: {}, key: {}'.format(
+              default_value, shape, key))
+    # Check if the values in the list are all integers or are convertible to
+    # floats.
+    is_list_all_int = all(
+        isinstance(v, int) for v in nest.flatten(default_value))
+    is_list_has_float = any(
+        isinstance(v, float) for v in nest.flatten(default_value))
+    if is_list_all_int:
+      return _as_tuple(default_value)
+    if is_list_has_float and dtype.is_floating:
+      return _as_tuple(default_value)
+  raise TypeError('default_value must be compatible with dtype. '
+                  'default_value: {}, dtype: {}, key: {}'.format(
+                      default_value, dtype, key))
+
+
+class HashedCategoricalColumn(
+    CategoricalColumn,
+    collections.namedtuple('HashedCategoricalColumn',
+                           ('key', 'hash_bucket_size', 'dtype'))):
+  """see `categorical_column_with_hash_bucket`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return self.key
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return {self.key: parsing_ops.VarLenFeature(self.dtype)}
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Hashes the values in the feature_column."""
+    input_tensor = _to_sparse_input_and_drop_ignore_values(
+        transformation_cache.get(self.key, state_manager))
+    if not isinstance(input_tensor, sparse_tensor_lib.SparseTensor):
+      raise ValueError('SparseColumn input must be a SparseTensor.')
+
+    _assert_string_or_int(
+        input_tensor.dtype,
+        prefix='column_name: {} input_tensor'.format(self.key))
+
+    if self.dtype.is_integer != input_tensor.dtype.is_integer:
+      raise ValueError(
+          'Column dtype and SparseTensors dtype must be compatible. '
+          'key: {}, column dtype: {}, tensor dtype: {}'.format(
+              self.key, self.dtype, input_tensor.dtype))
+
+    if self.dtype == dtypes.string:
+      sparse_values = input_tensor.values
+    else:
+      sparse_values = string_ops.as_string(input_tensor.values)
+
+    sparse_id_values = string_ops.string_to_hash_bucket_fast(
+        sparse_values, self.hash_bucket_size, name='lookup')
+    return sparse_tensor_lib.SparseTensor(
+        input_tensor.indices, sparse_id_values, input_tensor.dense_shape)
+
+  @property
+  def num_buckets(self):
+    """Returns number of buckets in this sparse feature."""
+    return self.hash_bucket_size
+
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """See `CategoricalColumn` base class."""
+    return CategoricalColumn.IdWeightPair(
+        transformation_cache.get(self, state_manager), None)
+
+
+class VocabularyFileCategoricalColumn(
+    CategoricalColumn,
+    collections.namedtuple('VocabularyFileCategoricalColumn',
+                           ('key', 'vocabulary_file', 'vocabulary_size',
+                            'num_oov_buckets', 'dtype', 'default_value'))):
+  """See `categorical_column_with_vocabulary_file`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return self.key
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return {self.key: parsing_ops.VarLenFeature(self.dtype)}
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Creates a lookup table for the vocabulary."""
+    input_tensor = _to_sparse_input_and_drop_ignore_values(
+        transformation_cache.get(self.key, state_manager))
+
+    if self.dtype.is_integer != input_tensor.dtype.is_integer:
+      raise ValueError(
+          'Column dtype and SparseTensors dtype must be compatible. '
+          'key: {}, column dtype: {}, tensor dtype: {}'.format(
+              self.key, self.dtype, input_tensor.dtype))
+
+    _assert_string_or_int(
+        input_tensor.dtype,
+        prefix='column_name: {} input_tensor'.format(self.key))
+
+    key_dtype = self.dtype
+    if input_tensor.dtype.is_integer:
+      # `index_table_from_file` requires 64-bit integer keys.
+      key_dtype = dtypes.int64
+      input_tensor = math_ops.to_int64(input_tensor)
+
+    # TODO(rohanj): Use state manager to manage the index table creation.
+    return lookup_ops.index_table_from_file(
+        vocabulary_file=self.vocabulary_file,
+        num_oov_buckets=self.num_oov_buckets,
+        vocab_size=self.vocabulary_size,
+        default_value=self.default_value,
+        key_dtype=key_dtype,
+        name='{}_lookup'.format(self.key)).lookup(input_tensor)
+
+  @property
+  def num_buckets(self):
+    """Returns number of buckets in this sparse feature."""
+    return self.vocabulary_size + self.num_oov_buckets
+
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """See `CategoricalColumn` base class."""
+    return CategoricalColumn.IdWeightPair(
+        transformation_cache.get(self, state_manager), None)
+
+
+class VocabularyListCategoricalColumn(
+    CategoricalColumn,
+    collections.namedtuple(
+        'VocabularyListCategoricalColumn',
+        ('key', 'vocabulary_list', 'dtype', 'default_value', 'num_oov_buckets'))
+):
+  """See `categorical_column_with_vocabulary_list`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return self.key
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return {self.key: parsing_ops.VarLenFeature(self.dtype)}
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Creates a lookup table for the vocabulary list."""
+    input_tensor = _to_sparse_input_and_drop_ignore_values(
+        transformation_cache.get(self.key, state_manager))
+
+    if self.dtype.is_integer != input_tensor.dtype.is_integer:
+      raise ValueError(
+          'Column dtype and SparseTensors dtype must be compatible. '
+          'key: {}, column dtype: {}, tensor dtype: {}'.format(
+              self.key, self.dtype, input_tensor.dtype))
+
+    _assert_string_or_int(
+        input_tensor.dtype,
+        prefix='column_name: {} input_tensor'.format(self.key))
+
+    key_dtype = self.dtype
+    if input_tensor.dtype.is_integer:
+      # `index_table_from_tensor` requires 64-bit integer keys.
+      key_dtype = dtypes.int64
+      input_tensor = math_ops.to_int64(input_tensor)
+
+    # TODO(rohanj): Use state manager to manage the index table creation.
+    return lookup_ops.index_table_from_tensor(
+        vocabulary_list=tuple(self.vocabulary_list),
+        default_value=self.default_value,
+        num_oov_buckets=self.num_oov_buckets,
+        dtype=key_dtype,
+        name='{}_lookup'.format(self.key)).lookup(input_tensor)
+
+  @property
+  def num_buckets(self):
+    """Returns number of buckets in this sparse feature."""
+    return len(self.vocabulary_list) + self.num_oov_buckets
+
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """See `CategoricalColumn` base class."""
+    return CategoricalColumn.IdWeightPair(
+        transformation_cache.get(self, state_manager), None)
+
+
+class IdentityCategoricalColumn(
+    CategoricalColumn,
+    collections.namedtuple('IdentityCategoricalColumn',
+                           ('key', 'number_buckets', 'default_value'))):
+
+  """See `categorical_column_with_identity`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return self.key
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return {self.key: parsing_ops.VarLenFeature(dtypes.int64)}
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Returns a SparseTensor with identity values."""
+    input_tensor = _to_sparse_input_and_drop_ignore_values(
+        transformation_cache.get(self.key, state_manager))
+
+    if not input_tensor.dtype.is_integer:
+      raise ValueError(
+          'Invalid input, not integer. key: {} dtype: {}'.format(
+              self.key, input_tensor.dtype))
+
+    values = math_ops.to_int64(input_tensor.values, name='values')
+    num_buckets = math_ops.to_int64(self.num_buckets, name='num_buckets')
+    zero = math_ops.to_int64(0, name='zero')
+    if self.default_value is None:
+      # Fail if values are out-of-range.
+      assert_less = check_ops.assert_less(
+          values, num_buckets, data=(values, num_buckets),
+          name='assert_less_than_num_buckets')
+      assert_greater = check_ops.assert_greater_equal(
+          values, zero, data=(values,),
+          name='assert_greater_or_equal_0')
+      with ops.control_dependencies((assert_less, assert_greater)):
+        values = array_ops.identity(values)
+    else:
+      # Assign default for out-of-range values.
+      values = array_ops.where(
+          math_ops.logical_or(
+              values < zero, values >= num_buckets, name='out_of_range'),
+          array_ops.fill(
+              dims=array_ops.shape(values),
+              value=math_ops.to_int64(self.default_value),
+              name='default_values'),
+          values)
+
+    return sparse_tensor_lib.SparseTensor(
+        indices=input_tensor.indices,
+        values=values,
+        dense_shape=input_tensor.dense_shape)
+
+  @property
+  def num_buckets(self):
+    """Returns number of buckets in this sparse feature."""
+    return self.number_buckets
+
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """See `CategoricalColumn` base class."""
+    return CategoricalColumn.IdWeightPair(
+        transformation_cache.get(self, state_manager), None)
+
+
+class WeightedCategoricalColumn(
+    CategoricalColumn,
+    collections.namedtuple(
+        'WeightedCategoricalColumn',
+        ('categorical_column', 'weight_feature_key', 'dtype'))):
+  """See `weighted_categorical_column`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return '{}_weighted_by_{}'.format(
+        self.categorical_column.name, self.weight_feature_key)
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    config = self.categorical_column.parse_example_spec
+    if self.weight_feature_key in config:
+      raise ValueError('Parse config {} already exists for {}.'.format(
+          config[self.weight_feature_key], self.weight_feature_key))
+    config[self.weight_feature_key] = parsing_ops.VarLenFeature(self.dtype)
+    return config
+
+  @property
+  def num_buckets(self):
+    """See `DenseColumn` base class."""
+    return self.categorical_column.num_buckets
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Applies weights to tensor generated from `categorical_column`'."""
+    weight_tensor = transformation_cache.get(self.weight_feature_key,
+                                             state_manager)
+    if weight_tensor is None:
+      raise ValueError('Missing weights {}.'.format(self.weight_feature_key))
+    weight_tensor = sparse_tensor_lib.convert_to_tensor_or_sparse_tensor(
+        weight_tensor)
+    if self.dtype != weight_tensor.dtype.base_dtype:
+      raise ValueError('Bad dtype, expected {}, but got {}.'.format(
+          self.dtype, weight_tensor.dtype))
+    if not isinstance(weight_tensor, sparse_tensor_lib.SparseTensor):
+      # The weight tensor can be a regular Tensor. In this case, sparsify it.
+      weight_tensor = _to_sparse_input_and_drop_ignore_values(
+          weight_tensor, ignore_value=0.0)
+    if not weight_tensor.dtype.is_floating:
+      weight_tensor = math_ops.to_float(weight_tensor)
+    return (transformation_cache.get(self.categorical_column, state_manager),
+            weight_tensor)
+
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """See `CategoricalColumn` base class."""
+    tensors = transformation_cache.get(self, state_manager)
+    return CategoricalColumn.IdWeightPair(tensors[0], tensors[1])
+
+
+class CrossedColumn(
+    CategoricalColumn,
+    collections.namedtuple('CrossedColumn',
+                           ('keys', 'hash_bucket_size', 'hash_key'))):
+  """See `crossed_column`."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    feature_names = []
+    for key in _collect_leaf_level_keys(self):
+      if isinstance(key, FeatureColumn):
+        feature_names.append(key.name)
+      else:  # key must be a string
+        feature_names.append(key)
+    return '_X_'.join(sorted(feature_names))
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    config = {}
+    for key in self.keys:
+      if isinstance(key, FeatureColumn):
+        config.update(key.parse_example_spec)
+      else:  # key must be a string
+        config.update({key: parsing_ops.VarLenFeature(dtypes.string)})
+    return config
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Generates a hashed sparse cross from the input tensors."""
+    feature_tensors = []
+    for key in _collect_leaf_level_keys(self):
+      if isinstance(key, six.string_types):
+        feature_tensors.append(transformation_cache.get(key, state_manager))
+      elif isinstance(key, CategoricalColumn):
+        ids_and_weights = key.get_sparse_tensors(transformation_cache,
+                                                 state_manager)
+        if ids_and_weights.weight_tensor is not None:
+          raise ValueError(
+              'crossed_column does not support weight_tensor, but the given '
+              'column populates weight_tensor. '
+              'Given column: {}'.format(key.name))
+        feature_tensors.append(ids_and_weights.id_tensor)
+      else:
+        raise ValueError('Unsupported column type. Given: {}'.format(key))
+    return sparse_ops.sparse_cross_hashed(
+        inputs=feature_tensors,
+        num_buckets=self.hash_bucket_size,
+        hash_key=self.hash_key)
+
+  @property
+  def num_buckets(self):
+    """Returns number of buckets in this sparse feature."""
+    return self.hash_bucket_size
+
+  def get_sparse_tensors(self, transformation_cache, state_manager):
+    """See `CategoricalColumn` base class."""
+    return CategoricalColumn.IdWeightPair(
+        transformation_cache.get(self, state_manager), None)
+
+
+def _collect_leaf_level_keys(cross):
+  """Collects base keys by expanding all nested crosses.
+
+  Args:
+    cross: A `CrossedColumn`.
+
+  Returns:
+    A list of strings or `CategoricalColumn` instances.
+  """
+  leaf_level_keys = []
+  for k in cross.keys:
+    if isinstance(k, CrossedColumn):
+      leaf_level_keys.extend(_collect_leaf_level_keys(k))
+    else:
+      leaf_level_keys.append(k)
+  return leaf_level_keys
+
+
+# TODO(zakaria): Move this to embedding_ops and make it public.
+def _safe_embedding_lookup_sparse(embedding_weights,
+                                  sparse_ids,
+                                  sparse_weights=None,
+                                  combiner='mean',
+                                  default_id=None,
+                                  name=None,
+                                  partition_strategy='div',
+                                  max_norm=None):
+  """Lookup embedding results, accounting for invalid IDs and empty features.
+
+  The partitioned embedding in `embedding_weights` must all be the same shape
+  except for the first dimension. The first dimension is allowed to vary as the
+  vocabulary size is not necessarily a multiple of `P`.  `embedding_weights`
+  may be a `PartitionedVariable` as returned by using `tf.get_variable()` with a
+  partitioner.
+
+  Invalid IDs (< 0) are pruned from input IDs and weights, as well as any IDs
+  with non-positive weight. For an entry with no features, the embedding vector
+  for `default_id` is returned, or the 0-vector if `default_id` is not supplied.
+
+  The ids and weights may be multi-dimensional. Embeddings are always aggregated
+  along the last dimension.
+
+  Args:
+    embedding_weights:  A list of `P` float `Tensor`s or values representing
+        partitioned embedding `Tensor`s.  Alternatively, a `PartitionedVariable`
+        created by partitioning along dimension 0.  The total unpartitioned
+        shape should be `[e_0, e_1, ..., e_m]`, where `e_0` represents the
+        vocab size and `e_1, ..., e_m` are the embedding dimensions.
+    sparse_ids: `SparseTensor` of shape `[d_0, d_1, ..., d_n]` containing the
+        ids. `d_0` is typically batch size.
+    sparse_weights: `SparseTensor` of same shape as `sparse_ids`, containing
+        float weights corresponding to `sparse_ids`, or `None` if all weights
+        are be assumed to be 1.0.
+    combiner: A string specifying how to combine embedding results for each
+        entry. Currently "mean", "sqrtn" and "sum" are supported, with "mean"
+        the default.
+    default_id: The id to use for an entry with no features.
+    name: A name for this operation (optional).
+    partition_strategy: A string specifying the partitioning strategy.
+        Currently `"div"` and `"mod"` are supported. Default is `"div"`.
+    max_norm: If not `None`, all embeddings are l2-normalized to max_norm before
+        combining.
+
+
+  Returns:
+    Dense `Tensor` of shape `[d_0, d_1, ..., d_{n-1}, e_1, ..., e_m]`.
+
+  Raises:
+    ValueError: if `embedding_weights` is empty.
+  """
+  if embedding_weights is None:
+    raise ValueError('Missing embedding_weights %s.' % embedding_weights)
+  if isinstance(embedding_weights, variables.PartitionedVariable):
+    embedding_weights = list(embedding_weights)  # get underlying Variables.
+  if not isinstance(embedding_weights, list):
+    embedding_weights = [embedding_weights]
+  if len(embedding_weights) < 1:
+    raise ValueError('Missing embedding_weights %s.' % embedding_weights)
+
+  dtype = sparse_weights.dtype if sparse_weights is not None else None
+  embedding_weights = [
+      ops.convert_to_tensor(w, dtype=dtype) for w in embedding_weights
+  ]
+
+  with ops.name_scope(name, 'embedding_lookup',
+                      embedding_weights + [sparse_ids,
+                                           sparse_weights]) as scope:
+    # Reshape higher-rank sparse ids and weights to linear segment ids.
+    original_shape = sparse_ids.dense_shape
+    original_rank_dim = sparse_ids.dense_shape.get_shape()[0]
+    original_rank = (
+        array_ops.size(original_shape)
+        if original_rank_dim.value is None
+        else original_rank_dim.value)
+    sparse_ids = sparse_ops.sparse_reshape(sparse_ids, [
+        math_ops.reduce_prod(
+            array_ops.slice(original_shape, [0], [original_rank - 1])),
+        array_ops.gather(original_shape, original_rank - 1)])
+    if sparse_weights is not None:
+      sparse_weights = sparse_tensor_lib.SparseTensor(
+          sparse_ids.indices,
+          sparse_weights.values, sparse_ids.dense_shape)
+
+    # Prune invalid ids and weights.
+    sparse_ids, sparse_weights = _prune_invalid_ids(sparse_ids, sparse_weights)
+    if combiner != 'sum':
+      sparse_ids, sparse_weights = _prune_invalid_weights(
+          sparse_ids, sparse_weights)
+
+    # Fill in dummy values for empty features, if necessary.
+    sparse_ids, is_row_empty = sparse_ops.sparse_fill_empty_rows(sparse_ids,
+                                                                 default_id or
+                                                                 0)
+    if sparse_weights is not None:
+      sparse_weights, _ = sparse_ops.sparse_fill_empty_rows(sparse_weights, 1.0)
+
+    result = embedding_ops.embedding_lookup_sparse(
+        embedding_weights,
+        sparse_ids,
+        sparse_weights,
+        combiner=combiner,
+        partition_strategy=partition_strategy,
+        name=None if default_id is None else scope,
+        max_norm=max_norm)
+
+    if default_id is None:
+      # Broadcast is_row_empty to the same shape as embedding_lookup_result,
+      # for use in Select.
+      is_row_empty = array_ops.tile(
+          array_ops.reshape(is_row_empty, [-1, 1]),
+          array_ops.stack([1, array_ops.shape(result)[1]]))
+
+      result = array_ops.where(is_row_empty,
+                               array_ops.zeros_like(result),
+                               result,
+                               name=scope)
+
+    # Reshape back from linear ids back into higher-dimensional dense result.
+    final_result = array_ops.reshape(
+        result,
+        array_ops.concat([
+            array_ops.slice(
+                math_ops.cast(original_shape, dtypes.int32), [0],
+                [original_rank - 1]),
+            array_ops.slice(array_ops.shape(result), [1], [-1])
+        ], 0))
+    final_result.set_shape(tensor_shape.unknown_shape(
+        (original_rank_dim - 1).value).concatenate(result.get_shape()[1:]))
+    return final_result
+
+
+def _prune_invalid_ids(sparse_ids, sparse_weights):
+  """Prune invalid IDs (< 0) from the input ids and weights."""
+  is_id_valid = math_ops.greater_equal(sparse_ids.values, 0)
+  if sparse_weights is not None:
+    is_id_valid = math_ops.logical_and(
+        is_id_valid,
+        array_ops.ones_like(sparse_weights.values, dtype=dtypes.bool))
+  sparse_ids = sparse_ops.sparse_retain(sparse_ids, is_id_valid)
+  if sparse_weights is not None:
+    sparse_weights = sparse_ops.sparse_retain(sparse_weights, is_id_valid)
+  return sparse_ids, sparse_weights
+
+
+def _prune_invalid_weights(sparse_ids, sparse_weights):
+  """Prune invalid weights (< 0) from the input ids and weights."""
+  if sparse_weights is not None:
+    is_weights_valid = math_ops.greater(sparse_weights.values, 0)
+    sparse_ids = sparse_ops.sparse_retain(sparse_ids, is_weights_valid)
+    sparse_weights = sparse_ops.sparse_retain(sparse_weights, is_weights_valid)
+  return sparse_ids, sparse_weights
+
+
+class IndicatorColumn(DenseColumn, SequenceDenseColumn,
+                      collections.namedtuple('IndicatorColumn',
+                                             ('categorical_column'))):
+  """Represents a one-hot column for use in deep networks.
+
+  Args:
+    categorical_column: A `CategoricalColumn` which is created by
+      `categorical_column_with_*` function.
+  """
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return '{}_indicator'.format(self.categorical_column.name)
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """Returns dense `Tensor` representing feature.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+
+    Returns:
+      Transformed feature `Tensor`.
+
+    Raises:
+      ValueError: if input rank is not known at graph building time.
+    """
+    id_weight_pair = self.categorical_column.get_sparse_tensors(
+        transformation_cache, state_manager)
+    id_tensor = id_weight_pair.id_tensor
+    weight_tensor = id_weight_pair.weight_tensor
+
+    # If the underlying column is weighted, return the input as a dense tensor.
+    if weight_tensor is not None:
+      weighted_column = sparse_ops.sparse_merge(
+          sp_ids=id_tensor,
+          sp_values=weight_tensor,
+          vocab_size=int(self.variable_shape[-1]))
+      # Remove (?, -1) index
+      weighted_column = sparse_ops.sparse_slice(weighted_column, [0, 0],
+                                                weighted_column.dense_shape)
+      return sparse_ops.sparse_tensor_to_dense(weighted_column)
+
+    dense_id_tensor = sparse_ops.sparse_tensor_to_dense(
+        id_tensor, default_value=-1)
+
+    # One hot must be float for tf.concat reasons since all other inputs to
+    # input_layer are float32.
+    one_hot_id_tensor = array_ops.one_hot(
+        dense_id_tensor,
+        depth=self.variable_shape[-1],
+        on_value=1.0,
+        off_value=0.0)
+
+    # Reduce to get a multi-hot per example.
+    return math_ops.reduce_sum(one_hot_id_tensor, axis=[-2])
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return self.categorical_column.parse_example_spec
+
+  @property
+  def variable_shape(self):
+    """Returns a `TensorShape` representing the shape of the dense `Tensor`."""
+    return tensor_shape.TensorShape([1, self.categorical_column.num_buckets])
+
+  def get_dense_tensor(self, transformation_cache, state_manager):
+    """Returns dense `Tensor` representing feature.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+
+    Returns:
+      Dense `Tensor` created within `transform_feature`.
+
+    Raises:
+      ValueError: If `categorical_column` is a `SequenceCategoricalColumn`.
+    """
+    if isinstance(self.categorical_column, SequenceCategoricalColumn):
+      raise ValueError(
+          'In indicator_column: {}. '
+          'categorical_column must not be of type SequenceCategoricalColumn. '
+          'Suggested fix A: If you wish to use input_layer, use a '
+          'non-sequence categorical_column_with_*. '
+          'Suggested fix B: If you wish to create sequence input, use '
+          'sequence_input_layer instead of input_layer. '
+          'Given (type {}): {}'.format(self.name, type(self.categorical_column),
+                                       self.categorical_column))
+    # Feature has been already transformed. Return the intermediate
+    # representation created by transform_feature.
+    return transformation_cache.get(self, state_manager)
+
+  def get_sequence_dense_tensor(self, transformation_cache, state_manager):
+    """See `SequenceDenseColumn` base class."""
+    if not isinstance(self.categorical_column, SequenceCategoricalColumn):
+      raise ValueError(
+          'In indicator_column: {}. '
+          'categorical_column must be of type SequenceCategoricalColumn '
+          'to use sequence_input_layer. '
+          'Suggested fix: Use one of sequence_categorical_column_with_*. '
+          'Given (type {}): {}'.format(self.name, type(self.categorical_column),
+                                       self.categorical_column))
+    # Feature has been already transformed. Return the intermediate
+    # representation created by transform_feature.
+    dense_tensor = transformation_cache.get(self, state_manager)
+    sparse_tensors = self.categorical_column.get_sparse_tensors(
+        transformation_cache, state_manager)
+    sequence_length = _sequence_length_from_sparse_tensor(
+        sparse_tensors.id_tensor)
+    return SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+
+def _verify_static_batch_size_equality(tensors, columns):
+  # bath_size is a tf.Dimension object.
+  expected_batch_size = None
+  for i in range(0, len(tensors)):
+    if tensors[i].shape[0].value is not None:
+      if expected_batch_size is None:
+        bath_size_column_index = i
+        expected_batch_size = tensors[i].shape[0]
+      elif not expected_batch_size.is_compatible_with(tensors[i].shape[0]):
+        raise ValueError(
+            'Batch size (first dimension) of each feature must be same. '
+            'Batch size of columns ({}, {}): ({}, {})'.format(
+                columns[bath_size_column_index].name, columns[i].name,
+                expected_batch_size, tensors[i].shape[0]))
+
+
+def _sequence_length_from_sparse_tensor(sp_tensor, num_elements=1):
+  """Returns a [batch_size] Tensor with per-example sequence length."""
+  with ops.name_scope(None, 'sequence_length') as name_scope:
+    row_ids = sp_tensor.indices[:, 0]
+    column_ids = sp_tensor.indices[:, 1]
+    column_ids += array_ops.ones_like(column_ids)
+    seq_length = math_ops.to_int64(
+        math_ops.segment_max(column_ids, segment_ids=row_ids) / num_elements)
+    # If the last n rows do not have ids, seq_length will have shape
+    # [batch_size - n]. Pad the remaining values with zeros.
+    n_pad = array_ops.shape(sp_tensor)[:1] - array_ops.shape(seq_length)[:1]
+    padding = array_ops.zeros(n_pad, dtype=seq_length.dtype)
+    return array_ops.concat([seq_length, padding], axis=0, name=name_scope)
+
+
+class SequenceCategoricalColumn(FeatureColumn,
+                                collections.namedtuple(
+                                    'SequenceCategoricalColumn',
+                                    ('categorical_column'))):
+  """Represents sequences of categorical data."""
+
+  @property
+  def name(self):
+    """See `FeatureColumn` base class."""
+    return self.categorical_column.name
+
+  @property
+  def parse_example_spec(self):
+    """See `FeatureColumn` base class."""
+    return self.categorical_column.parse_example_spec
+
+  def transform_feature(self, transformation_cache, state_manager):
+    """See `FeatureColumn` base class."""
+    return self.categorical_column.transform_feature(transformation_cache,
+                                                     state_manager)
+
+  @property
+  def num_buckets(self):
+    """Returns number of buckets in this sparse feature."""
+    return self.categorical_column.num_buckets
+
+  def get_sequence_sparse_tensors(self, transformation_cache, state_manager):
+    """Returns an IdWeightPair.
+
+    `IdWeightPair` is a pair of `SparseTensor`s which represents ids and
+    weights.
+
+    `IdWeightPair.id_tensor` is typically a `batch_size` x `num_buckets`
+    `SparseTensor` of `int64`. `IdWeightPair.weight_tensor` is either a
+    `SparseTensor` of `float` or `None` to indicate all weights should be
+    taken to be 1. If specified, `weight_tensor` must have exactly the same
+    shape and indices as `sp_ids`. Expected `SparseTensor` is same as parsing
+    output of a `VarLenFeature` which is a ragged matrix.
+
+    Args:
+      transformation_cache: A `FeatureTransformationCache` object to access
+        features.
+      state_manager: A `StateManager` to create / access resources such as
+        lookup tables.
+    """
+    sparse_tensors = self.categorical_column.get_sparse_tensors(
+        transformation_cache, state_manager)
+    id_tensor = sparse_tensors.id_tensor
+    weight_tensor = sparse_tensors.weight_tensor
+    # Expands final dimension, so that embeddings are not combined during
+    # embedding lookup.
+    check_id_rank = check_ops.assert_equal(
+        array_ops.rank(id_tensor), 2,
+        data=[
+            'Column {} expected ID tensor of rank 2. '.format(self.name),
+            'id_tensor shape: ', array_ops.shape(id_tensor)])
+    with ops.control_dependencies([check_id_rank]):
+      id_tensor = sparse_ops.sparse_reshape(
+          id_tensor,
+          shape=array_ops.concat([id_tensor.dense_shape, [1]], axis=0))
+    if weight_tensor is not None:
+      check_weight_rank = check_ops.assert_equal(
+          array_ops.rank(weight_tensor), 2,
+          data=[
+              'Column {} expected weight tensor of rank 2.'.format(self.name),
+              'weight_tensor shape:', array_ops.shape(weight_tensor)])
+      with ops.control_dependencies([check_weight_rank]):
+        weight_tensor = sparse_ops.sparse_reshape(
+            weight_tensor,
+            shape=array_ops.concat([weight_tensor.dense_shape, [1]], axis=0))
+    return CategoricalColumn.IdWeightPair(id_tensor, weight_tensor)
diff --git a/tensorflow/python/feature_column/feature_column_v2_test.py b/tensorflow/python/feature_column/feature_column_v2_test.py
new file mode 100644
index 0000000000..80a9d5d40e
--- /dev/null
+++ b/tensorflow/python/feature_column/feature_column_v2_test.py
@@ -0,0 +1,6583 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for feature_column."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import copy
+
+import numpy as np
+
+from tensorflow.core.example import example_pb2
+from tensorflow.core.example import feature_pb2
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.core.protobuf import rewriter_config_pb2
+from tensorflow.python.client import session
+from tensorflow.python.eager import backprop
+from tensorflow.python.eager import context
+from tensorflow.python.estimator.inputs import numpy_io
+from tensorflow.python.feature_column import feature_column as fc_old
+from tensorflow.python.feature_column import feature_column_v2 as fc
+from tensorflow.python.feature_column.feature_column_v2 import FeatureColumn
+from tensorflow.python.feature_column.feature_column_v2 import FeatureTransformationCache
+from tensorflow.python.feature_column.feature_column_v2 import InputLayer
+from tensorflow.python.feature_column.feature_column_v2 import StateManager
+from tensorflow.python.feature_column.feature_column_v2 import _LinearModel
+from tensorflow.python.feature_column.feature_column_v2 import _transform_features
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import lookup_ops
+from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import partitioned_variables
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.ops import variables as variables_lib
+from tensorflow.python.platform import test
+from tensorflow.python.training import coordinator
+from tensorflow.python.training import queue_runner_impl
+
+
+def _initialized_session(config=None):
+  sess = session.Session(config=config)
+  sess.run(variables_lib.global_variables_initializer())
+  sess.run(lookup_ops.tables_initializer())
+  return sess
+
+
+class LazyColumnTest(test.TestCase):
+
+  def test_transformations_called_once(self):
+
+    class TransformCounter(FeatureColumn):
+
+      def __init__(self):
+        self.num_transform = 0
+
+      @property
+      def name(self):
+        return 'TransformCounter'
+
+      def transform_feature(self, transformation_cache, state_manager):
+        self.num_transform += 1  # Count transform calls.
+        return transformation_cache.get('a', state_manager)
+
+      @property
+      def parse_example_spec(self):
+        pass
+
+    transformation_cache = FeatureTransformationCache(
+        features={'a': [[2], [3.]]})
+    column = TransformCounter()
+    self.assertEqual(0, column.num_transform)
+    transformation_cache.get(column, None)
+    self.assertEqual(1, column.num_transform)
+    transformation_cache.get(column, None)
+    self.assertEqual(1, column.num_transform)
+
+  def test_returns_transform_output(self):
+
+    class Transformer(FeatureColumn):
+
+      @property
+      def name(self):
+        return 'Transformer'
+
+      def transform_feature(self, transformation_cache, state_manager):
+        return 'Output'
+
+      @property
+      def parse_example_spec(self):
+        pass
+
+    transformation_cache = FeatureTransformationCache(
+        features={'a': [[2], [3.]]})
+    column = Transformer()
+    self.assertEqual('Output', transformation_cache.get(column, None))
+    self.assertEqual('Output', transformation_cache.get(column, None))
+
+  def test_does_not_pollute_given_features_dict(self):
+
+    class Transformer(FeatureColumn):
+
+      @property
+      def name(self):
+        return 'Transformer'
+
+      def transform_feature(self, transformation_cache, state_manager):
+        return 'Output'
+
+      @property
+      def parse_example_spec(self):
+        pass
+
+    features = {'a': [[2], [3.]]}
+    transformation_cache = FeatureTransformationCache(features=features)
+    transformation_cache.get(Transformer(), None)
+    self.assertEqual(['a'], list(features.keys()))
+
+  def test_error_if_feature_is_not_found(self):
+    transformation_cache = FeatureTransformationCache(
+        features={'a': [[2], [3.]]})
+    with self.assertRaisesRegexp(ValueError,
+                                 'bbb is not in features dictionary'):
+      transformation_cache.get('bbb', None)
+    with self.assertRaisesRegexp(ValueError,
+                                 'bbb is not in features dictionary'):
+      transformation_cache.get(u'bbb', None)
+
+  def test_not_supported_feature_column(self):
+
+    class NotAProperColumn(FeatureColumn):
+
+      @property
+      def name(self):
+        return 'NotAProperColumn'
+
+      def transform_feature(self, transformation_cache, state_manager):
+        # It should return not None.
+        pass
+
+      @property
+      def parse_example_spec(self):
+        pass
+
+    transformation_cache = FeatureTransformationCache(
+        features={'a': [[2], [3.]]})
+    with self.assertRaisesRegexp(ValueError,
+                                 'NotAProperColumn is not supported'):
+      transformation_cache.get(NotAProperColumn(), None)
+
+  def test_key_should_be_string_or_feature_colum(self):
+
+    class NotAFeatureColumn(object):
+      pass
+
+    transformation_cache = FeatureTransformationCache(
+        features={'a': [[2], [3.]]})
+    with self.assertRaisesRegexp(
+        TypeError, '"key" must be either a "str" or "FeatureColumn".'):
+      transformation_cache.get(NotAFeatureColumn(), None)
+
+
+class NumericColumnTest(test.TestCase):
+
+  def test_defaults(self):
+    a = fc.numeric_column('aaa')
+    self.assertEqual('aaa', a.key)
+    self.assertEqual('aaa', a.name)
+    self.assertEqual((1,), a.shape)
+    self.assertIsNone(a.default_value)
+    self.assertEqual(dtypes.float32, a.dtype)
+    self.assertIsNone(a.normalizer_fn)
+
+  def test_key_should_be_string(self):
+    with self.assertRaisesRegexp(ValueError, 'key must be a string.'):
+      fc.numeric_column(key=('aaa',))
+
+  def test_shape_saved_as_tuple(self):
+    a = fc.numeric_column('aaa', shape=[1, 2], default_value=[[3, 2.]])
+    self.assertEqual((1, 2), a.shape)
+
+  def test_default_value_saved_as_tuple(self):
+    a = fc.numeric_column('aaa', default_value=4.)
+    self.assertEqual((4.,), a.default_value)
+    a = fc.numeric_column('aaa', shape=[1, 2], default_value=[[3, 2.]])
+    self.assertEqual(((3., 2.),), a.default_value)
+
+  def test_shape_and_default_value_compatibility(self):
+    fc.numeric_column('aaa', shape=[2], default_value=[1, 2.])
+    with self.assertRaisesRegexp(ValueError, 'The shape of default_value'):
+      fc.numeric_column('aaa', shape=[2], default_value=[1, 2, 3.])
+    fc.numeric_column(
+        'aaa', shape=[3, 2], default_value=[[2, 3], [1, 2], [2, 3.]])
+    with self.assertRaisesRegexp(ValueError, 'The shape of default_value'):
+      fc.numeric_column(
+          'aaa', shape=[3, 1], default_value=[[2, 3], [1, 2], [2, 3.]])
+    with self.assertRaisesRegexp(ValueError, 'The shape of default_value'):
+      fc.numeric_column(
+          'aaa', shape=[3, 3], default_value=[[2, 3], [1, 2], [2, 3.]])
+
+  def test_default_value_type_check(self):
+    fc.numeric_column(
+        'aaa', shape=[2], default_value=[1, 2.], dtype=dtypes.float32)
+    fc.numeric_column(
+        'aaa', shape=[2], default_value=[1, 2], dtype=dtypes.int32)
+    with self.assertRaisesRegexp(TypeError, 'must be compatible with dtype'):
+      fc.numeric_column(
+          'aaa', shape=[2], default_value=[1, 2.], dtype=dtypes.int32)
+    with self.assertRaisesRegexp(TypeError,
+                                 'default_value must be compatible with dtype'):
+      fc.numeric_column('aaa', default_value=['string'])
+
+  def test_shape_must_be_positive_integer(self):
+    with self.assertRaisesRegexp(TypeError, 'shape dimensions must be integer'):
+      fc.numeric_column(
+          'aaa', shape=[
+              1.0,
+          ])
+
+    with self.assertRaisesRegexp(ValueError,
+                                 'shape dimensions must be greater than 0'):
+      fc.numeric_column(
+          'aaa', shape=[
+              0,
+          ])
+
+  def test_dtype_is_convertible_to_float(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 'dtype must be convertible to float'):
+      fc.numeric_column('aaa', dtype=dtypes.string)
+
+  def test_scalar_default_value_fills_the_shape(self):
+    a = fc.numeric_column('aaa', shape=[2, 3], default_value=2.)
+    self.assertEqual(((2., 2., 2.), (2., 2., 2.)), a.default_value)
+
+  def test_parse_spec(self):
+    a = fc.numeric_column('aaa', shape=[2, 3], dtype=dtypes.int32)
+    self.assertEqual({
+        'aaa': parsing_ops.FixedLenFeature((2, 3), dtype=dtypes.int32)
+    }, a.parse_example_spec)
+
+  def test_parse_example_no_default_value(self):
+    price = fc.numeric_column('price', shape=[2])
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'price':
+                feature_pb2.Feature(float_list=feature_pb2.FloatList(
+                    value=[20., 110.]))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([price]))
+    self.assertIn('price', features)
+    with self.test_session():
+      self.assertAllEqual([[20., 110.]], features['price'].eval())
+
+  def test_parse_example_with_default_value(self):
+    price = fc.numeric_column('price', shape=[2], default_value=11.)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'price':
+                feature_pb2.Feature(float_list=feature_pb2.FloatList(
+                    value=[20., 110.]))
+        }))
+    no_data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'something_else':
+                feature_pb2.Feature(float_list=feature_pb2.FloatList(
+                    value=[20., 110.]))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString(),
+                    no_data.SerializeToString()],
+        features=fc.make_parse_example_spec([price]))
+    self.assertIn('price', features)
+    with self.test_session():
+      self.assertAllEqual([[20., 110.], [11., 11.]], features['price'].eval())
+
+  def test_normalizer_fn_must_be_callable(self):
+    with self.assertRaisesRegexp(TypeError, 'must be a callable'):
+      fc.numeric_column('price', normalizer_fn='NotACallable')
+
+  def test_normalizer_fn_transform_feature(self):
+
+    def _increment_two(input_tensor):
+      return input_tensor + 2.
+
+    price = fc.numeric_column('price', shape=[2], normalizer_fn=_increment_two)
+    output = _transform_features({'price': [[1., 2.], [5., 6.]]}, [price], None)
+    with self.test_session():
+      self.assertAllEqual([[3., 4.], [7., 8.]], output[price].eval())
+
+  def test_get_dense_tensor(self):
+
+    def _increment_two(input_tensor):
+      return input_tensor + 2.
+
+    price = fc.numeric_column('price', shape=[2], normalizer_fn=_increment_two)
+    transformation_cache = FeatureTransformationCache({
+        'price': [[1., 2.], [5., 6.]]
+    })
+    self.assertEqual(
+        transformation_cache.get(price, None),
+        price.get_dense_tensor(transformation_cache, None))
+
+  def test_sparse_tensor_not_supported(self):
+    price = fc.numeric_column('price')
+    transformation_cache = FeatureTransformationCache({
+        'price':
+            sparse_tensor.SparseTensor(
+                indices=[[0, 0]], values=[0.3], dense_shape=[1, 1])
+    })
+    with self.assertRaisesRegexp(ValueError, 'must be a Tensor'):
+      price.transform_feature(transformation_cache, None)
+
+  def test_deep_copy(self):
+    a = fc.numeric_column('aaa', shape=[1, 2], default_value=[[3., 2.]])
+    a_copy = copy.deepcopy(a)
+    self.assertEqual(a_copy.name, 'aaa')
+    self.assertEqual(a_copy.shape, (1, 2))
+    self.assertEqual(a_copy.default_value, ((3., 2.),))
+
+  def test_numpy_default_value(self):
+    a = fc.numeric_column(
+        'aaa', shape=[1, 2], default_value=np.array([[3., 2.]]))
+    self.assertEqual(a.default_value, ((3., 2.),))
+
+  def test_linear_model(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      predictions = fc.linear_model(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.]], price_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(price_var.assign([[10.]]))
+        self.assertAllClose([[10.], [50.]], predictions.eval())
+
+  def test_keras_linear_model(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      predictions = get_keras_linear_model_predictions(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.]], price_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(price_var.assign([[10.]]))
+        self.assertAllClose([[10.], [50.]], predictions.eval())
+
+
+class BucketizedColumnTest(test.TestCase):
+
+  def test_invalid_source_column_type(self):
+    a = fc.categorical_column_with_hash_bucket('aaa', hash_bucket_size=10)
+    with self.assertRaisesRegexp(
+        ValueError,
+        'source_column must be a column generated with numeric_column'):
+      fc.bucketized_column(a, boundaries=[0, 1])
+
+  def test_invalid_source_column_shape(self):
+    a = fc.numeric_column('aaa', shape=[2, 3])
+    with self.assertRaisesRegexp(
+        ValueError, 'source_column must be one-dimensional column'):
+      fc.bucketized_column(a, boundaries=[0, 1])
+
+  def test_invalid_boundaries(self):
+    a = fc.numeric_column('aaa')
+    with self.assertRaisesRegexp(
+        ValueError, 'boundaries must be a sorted list'):
+      fc.bucketized_column(a, boundaries=None)
+    with self.assertRaisesRegexp(
+        ValueError, 'boundaries must be a sorted list'):
+      fc.bucketized_column(a, boundaries=1.)
+    with self.assertRaisesRegexp(
+        ValueError, 'boundaries must be a sorted list'):
+      fc.bucketized_column(a, boundaries=[1, 0])
+    with self.assertRaisesRegexp(
+        ValueError, 'boundaries must be a sorted list'):
+      fc.bucketized_column(a, boundaries=[1, 1])
+
+  def test_name(self):
+    a = fc.numeric_column('aaa', dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    self.assertEqual('aaa_bucketized', b.name)
+
+  def test_parse_spec(self):
+    a = fc.numeric_column('aaa', shape=[2], dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    self.assertEqual({
+        'aaa': parsing_ops.FixedLenFeature((2,), dtype=dtypes.int32)
+    }, b.parse_example_spec)
+
+  def test_variable_shape(self):
+    a = fc.numeric_column('aaa', shape=[2], dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    # Column 'aaa` has shape [2] times three buckets -> variable_shape=[2, 3].
+    self.assertAllEqual((2, 3), b.variable_shape)
+
+  def test_num_buckets(self):
+    a = fc.numeric_column('aaa', shape=[2], dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    # Column 'aaa` has shape [2] times three buckets -> num_buckets=6.
+    self.assertEqual(6, b.num_buckets)
+
+  def test_parse_example(self):
+    price = fc.numeric_column('price', shape=[2])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 50])
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'price':
+                feature_pb2.Feature(float_list=feature_pb2.FloatList(
+                    value=[20., 110.]))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([bucketized_price]))
+    self.assertIn('price', features)
+    with self.test_session():
+      self.assertAllEqual([[20., 110.]], features['price'].eval())
+
+  def test_transform_feature(self):
+    price = fc.numeric_column('price', shape=[2])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      transformed_tensor = _transform_features({
+          'price': [[-1., 1.], [5., 6.]]
+      }, [bucketized_price], None)
+      with _initialized_session():
+        self.assertAllEqual([[0, 1], [3, 4]],
+                            transformed_tensor[bucketized_price].eval())
+
+  def test_get_dense_tensor_one_input_value(self):
+    """Tests _get_dense_tensor() for input with shape=[1]."""
+    price = fc.numeric_column('price', shape=[1])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      transformation_cache = FeatureTransformationCache({
+          'price': [[-1.], [1.], [5.], [6.]]
+      })
+      with _initialized_session():
+        bucketized_price_tensor = bucketized_price.get_dense_tensor(
+            transformation_cache, None)
+        self.assertAllClose(
+            # One-hot tensor.
+            [[[1., 0., 0., 0., 0.]],
+             [[0., 1., 0., 0., 0.]],
+             [[0., 0., 0., 1., 0.]],
+             [[0., 0., 0., 0., 1.]]],
+            bucketized_price_tensor.eval())
+
+  def test_get_dense_tensor_two_input_values(self):
+    """Tests _get_dense_tensor() for input with shape=[2]."""
+    price = fc.numeric_column('price', shape=[2])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      transformation_cache = FeatureTransformationCache({
+          'price': [[-1., 1.], [5., 6.]]
+      })
+      with _initialized_session():
+        bucketized_price_tensor = bucketized_price.get_dense_tensor(
+            transformation_cache, None)
+        self.assertAllClose(
+            # One-hot tensor.
+            [[[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]],
+             [[0., 0., 0., 1., 0.], [0., 0., 0., 0., 1.]]],
+            bucketized_price_tensor.eval())
+
+  def test_get_sparse_tensors_one_input_value(self):
+    """Tests _get_sparse_tensors() for input with shape=[1]."""
+    price = fc.numeric_column('price', shape=[1])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      transformation_cache = FeatureTransformationCache({
+          'price': [[-1.], [1.], [5.], [6.]]
+      })
+      with _initialized_session() as sess:
+        id_weight_pair = bucketized_price.get_sparse_tensors(
+            transformation_cache, None)
+        self.assertIsNone(id_weight_pair.weight_tensor)
+        id_tensor_value = sess.run(id_weight_pair.id_tensor)
+        self.assertAllEqual(
+            [[0, 0], [1, 0], [2, 0], [3, 0]], id_tensor_value.indices)
+        self.assertAllEqual([0, 1, 3, 4], id_tensor_value.values)
+        self.assertAllEqual([4, 1], id_tensor_value.dense_shape)
+
+  def test_get_sparse_tensors_two_input_values(self):
+    """Tests _get_sparse_tensors() for input with shape=[2]."""
+    price = fc.numeric_column('price', shape=[2])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      transformation_cache = FeatureTransformationCache({
+          'price': [[-1., 1.], [5., 6.]]
+      })
+      with _initialized_session() as sess:
+        id_weight_pair = bucketized_price.get_sparse_tensors(
+            transformation_cache, None)
+        self.assertIsNone(id_weight_pair.weight_tensor)
+        id_tensor_value = sess.run(id_weight_pair.id_tensor)
+        self.assertAllEqual(
+            [[0, 0], [0, 1], [1, 0], [1, 1]], id_tensor_value.indices)
+        # Values 0-4 correspond to the first column of the input price.
+        # Values 5-9 correspond to the second column of the input price.
+        self.assertAllEqual([0, 6, 3, 9], id_tensor_value.values)
+        self.assertAllEqual([2, 2], id_tensor_value.dense_shape)
+
+  def test_sparse_tensor_input_not_supported(self):
+    price = fc.numeric_column('price')
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 1])
+    transformation_cache = FeatureTransformationCache({
+        'price':
+            sparse_tensor.SparseTensor(
+                indices=[[0, 0]], values=[0.3], dense_shape=[1, 1])
+    })
+    with self.assertRaisesRegexp(ValueError, 'must be a Tensor'):
+      bucketized_price.transform_feature(transformation_cache, None)
+
+  def test_deep_copy(self):
+    a = fc.numeric_column('aaa', shape=[2])
+    a_bucketized = fc.bucketized_column(a, boundaries=[0, 1])
+    a_bucketized_copy = copy.deepcopy(a_bucketized)
+    self.assertEqual(a_bucketized_copy.name, 'aaa_bucketized')
+    self.assertAllEqual(a_bucketized_copy.variable_shape, (2, 3))
+    self.assertEqual(a_bucketized_copy.boundaries, (0, 1))
+
+  def test_linear_model_one_input_value(self):
+    """Tests linear_model() for input with shape=[1]."""
+    price = fc_old.numeric_column('price', shape=[1])
+    bucketized_price = fc_old.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      features = {'price': [[-1.], [1.], [5.], [6.]]}
+      predictions = fc.linear_model(features, [bucketized_price])
+      bias = get_linear_model_bias()
+      bucketized_price_var = get_linear_model_column_var(bucketized_price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        # One weight variable per bucket, all initialized to zero.
+        self.assertAllClose(
+            [[0.], [0.], [0.], [0.], [0.]], bucketized_price_var.eval())
+        self.assertAllClose([[0.], [0.], [0.], [0.]], predictions.eval())
+        sess.run(bucketized_price_var.assign(
+            [[10.], [20.], [30.], [40.], [50.]]))
+        # price -1. is in the 0th bucket, whose weight is 10.
+        # price 1. is in the 1st bucket, whose weight is 20.
+        # price 5. is in the 3rd bucket, whose weight is 40.
+        # price 6. is in the 4th bucket, whose weight is 50.
+        self.assertAllClose([[10.], [20.], [40.], [50.]], predictions.eval())
+        sess.run(bias.assign([1.]))
+        self.assertAllClose([[11.], [21.], [41.], [51.]], predictions.eval())
+
+  def test_linear_model_two_input_values(self):
+    """Tests linear_model() for input with shape=[2]."""
+    price = fc_old.numeric_column('price', shape=[2])
+    bucketized_price = fc_old.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      features = {'price': [[-1., 1.], [5., 6.]]}
+      predictions = fc.linear_model(features, [bucketized_price])
+      bias = get_linear_model_bias()
+      bucketized_price_var = get_linear_model_column_var(bucketized_price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        # One weight per bucket per input column, all initialized to zero.
+        self.assertAllClose(
+            [[0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.]],
+            bucketized_price_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(bucketized_price_var.assign(
+            [[10.], [20.], [30.], [40.], [50.],
+             [60.], [70.], [80.], [90.], [100.]]))
+        # 1st example:
+        #   price -1. is in the 0th bucket, whose weight is 10.
+        #   price 1. is in the 6th bucket, whose weight is 70.
+        # 2nd example:
+        #   price 5. is in the 3rd bucket, whose weight is 40.
+        #   price 6. is in the 9th bucket, whose weight is 100.
+        self.assertAllClose([[80.], [140.]], predictions.eval())
+        sess.run(bias.assign([1.]))
+        self.assertAllClose([[81.], [141.]], predictions.eval())
+
+  def test_keras_linear_model_one_input_value(self):
+    """Tests _LinearModel for input with shape=[1]."""
+    price = fc_old.numeric_column('price', shape=[1])
+    bucketized_price = fc_old.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      features = {'price': [[-1.], [1.], [5.], [6.]]}
+      predictions = get_keras_linear_model_predictions(features,
+                                                       [bucketized_price])
+      bias = get_linear_model_bias()
+      bucketized_price_var = get_linear_model_column_var(bucketized_price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        # One weight variable per bucket, all initialized to zero.
+        self.assertAllClose([[0.], [0.], [0.], [0.], [0.]],
+                            bucketized_price_var.eval())
+        self.assertAllClose([[0.], [0.], [0.], [0.]], predictions.eval())
+        sess.run(
+            bucketized_price_var.assign([[10.], [20.], [30.], [40.], [50.]]))
+        # price -1. is in the 0th bucket, whose weight is 10.
+        # price 1. is in the 1st bucket, whose weight is 20.
+        # price 5. is in the 3rd bucket, whose weight is 40.
+        # price 6. is in the 4th bucket, whose weight is 50.
+        self.assertAllClose([[10.], [20.], [40.], [50.]], predictions.eval())
+        sess.run(bias.assign([1.]))
+        self.assertAllClose([[11.], [21.], [41.], [51.]], predictions.eval())
+
+  def test_keras_linear_model_two_input_values(self):
+    """Tests _LinearModel for input with shape=[2]."""
+    price = fc_old.numeric_column('price', shape=[2])
+    bucketized_price = fc_old.bucketized_column(price, boundaries=[0, 2, 4, 6])
+    with ops.Graph().as_default():
+      features = {'price': [[-1., 1.], [5., 6.]]}
+      predictions = get_keras_linear_model_predictions(features,
+                                                       [bucketized_price])
+      bias = get_linear_model_bias()
+      bucketized_price_var = get_linear_model_column_var(bucketized_price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        # One weight per bucket per input column, all initialized to zero.
+        self.assertAllClose(
+            [[0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.]],
+            bucketized_price_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(
+            bucketized_price_var.assign([[10.], [20.], [30.], [40.], [50.],
+                                         [60.], [70.], [80.], [90.], [100.]]))
+        # 1st example:
+        #   price -1. is in the 0th bucket, whose weight is 10.
+        #   price 1. is in the 6th bucket, whose weight is 70.
+        # 2nd example:
+        #   price 5. is in the 3rd bucket, whose weight is 40.
+        #   price 6. is in the 9th bucket, whose weight is 100.
+        self.assertAllClose([[80.], [140.]], predictions.eval())
+        sess.run(bias.assign([1.]))
+        self.assertAllClose([[81.], [141.]], predictions.eval())
+
+
+class HashedCategoricalColumnTest(test.TestCase):
+
+  def test_defaults(self):
+    a = fc.categorical_column_with_hash_bucket('aaa', 10)
+    self.assertEqual('aaa', a.name)
+    self.assertEqual('aaa', a.key)
+    self.assertEqual(10, a.hash_bucket_size)
+    self.assertEqual(dtypes.string, a.dtype)
+
+  def test_key_should_be_string(self):
+    with self.assertRaisesRegexp(ValueError, 'key must be a string.'):
+      fc.categorical_column_with_hash_bucket(('key',), 10)
+
+  def test_bucket_size_should_be_given(self):
+    with self.assertRaisesRegexp(ValueError, 'hash_bucket_size must be set.'):
+      fc.categorical_column_with_hash_bucket('aaa', None)
+
+  def test_bucket_size_should_be_positive(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 'hash_bucket_size must be at least 1'):
+      fc.categorical_column_with_hash_bucket('aaa', 0)
+
+  def test_dtype_should_be_string_or_integer(self):
+    fc.categorical_column_with_hash_bucket('aaa', 10, dtype=dtypes.string)
+    fc.categorical_column_with_hash_bucket('aaa', 10, dtype=dtypes.int32)
+    with self.assertRaisesRegexp(ValueError, 'dtype must be string or integer'):
+      fc.categorical_column_with_hash_bucket('aaa', 10, dtype=dtypes.float32)
+
+  def test_deep_copy(self):
+    original = fc.categorical_column_with_hash_bucket('aaa', 10)
+    for column in (original, copy.deepcopy(original)):
+      self.assertEqual('aaa', column.name)
+      self.assertEqual(10, column.hash_bucket_size)
+      self.assertEqual(10, column.num_buckets)
+      self.assertEqual(dtypes.string, column.dtype)
+
+  def test_parse_spec_string(self):
+    a = fc.categorical_column_with_hash_bucket('aaa', 10)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.string)
+    }, a.parse_example_spec)
+
+  def test_parse_spec_int(self):
+    a = fc.categorical_column_with_hash_bucket('aaa', 10, dtype=dtypes.int32)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int32)
+    }, a.parse_example_spec)
+
+  def test_parse_example(self):
+    a = fc.categorical_column_with_hash_bucket('aaa', 10)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer']))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a]))
+    self.assertIn('aaa', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'omar', b'stringer'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+
+  def test_strings_should_be_hashed(self):
+    hashed_sparse = fc.categorical_column_with_hash_bucket('wire', 10)
+    wire_tensor = sparse_tensor.SparseTensor(
+        values=['omar', 'stringer', 'marlo'],
+        indices=[[0, 0], [1, 0], [1, 1]],
+        dense_shape=[2, 2])
+    outputs = _transform_features({'wire': wire_tensor}, [hashed_sparse], None)
+    output = outputs[hashed_sparse]
+    # Check exact hashed output. If hashing changes this test will break.
+    expected_values = [6, 4, 1]
+    with self.test_session():
+      self.assertEqual(dtypes.int64, output.values.dtype)
+      self.assertAllEqual(expected_values, output.values.eval())
+      self.assertAllEqual(wire_tensor.indices.eval(), output.indices.eval())
+      self.assertAllEqual(wire_tensor.dense_shape.eval(),
+                          output.dense_shape.eval())
+
+  def test_tensor_dtype_should_be_string_or_integer(self):
+    string_fc = fc.categorical_column_with_hash_bucket(
+        'a_string', 10, dtype=dtypes.string)
+    int_fc = fc.categorical_column_with_hash_bucket(
+        'a_int', 10, dtype=dtypes.int32)
+    float_fc = fc.categorical_column_with_hash_bucket(
+        'a_float', 10, dtype=dtypes.string)
+    int_tensor = sparse_tensor.SparseTensor(
+        values=[101],
+        indices=[[0, 0]],
+        dense_shape=[1, 1])
+    string_tensor = sparse_tensor.SparseTensor(
+        values=['101'],
+        indices=[[0, 0]],
+        dense_shape=[1, 1])
+    float_tensor = sparse_tensor.SparseTensor(
+        values=[101.],
+        indices=[[0, 0]],
+        dense_shape=[1, 1])
+    transformation_cache = FeatureTransformationCache({
+        'a_int': int_tensor,
+        'a_string': string_tensor,
+        'a_float': float_tensor
+    })
+    transformation_cache.get(string_fc, None)
+    transformation_cache.get(int_fc, None)
+    with self.assertRaisesRegexp(ValueError, 'dtype must be string or integer'):
+      transformation_cache.get(float_fc, None)
+
+  def test_dtype_should_match_with_tensor(self):
+    hashed_sparse = fc.categorical_column_with_hash_bucket(
+        'wire', 10, dtype=dtypes.int64)
+    wire_tensor = sparse_tensor.SparseTensor(
+        values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+    transformation_cache = FeatureTransformationCache({'wire': wire_tensor})
+    with self.assertRaisesRegexp(ValueError, 'dtype must be compatible'):
+      transformation_cache.get(hashed_sparse, None)
+
+  def test_ints_should_be_hashed(self):
+    hashed_sparse = fc.categorical_column_with_hash_bucket(
+        'wire', 10, dtype=dtypes.int64)
+    wire_tensor = sparse_tensor.SparseTensor(
+        values=[101, 201, 301],
+        indices=[[0, 0], [1, 0], [1, 1]],
+        dense_shape=[2, 2])
+    transformation_cache = FeatureTransformationCache({'wire': wire_tensor})
+    output = transformation_cache.get(hashed_sparse, None)
+    # Check exact hashed output. If hashing changes this test will break.
+    expected_values = [3, 7, 5]
+    with self.test_session():
+      self.assertAllEqual(expected_values, output.values.eval())
+
+  def test_int32_64_is_compatible(self):
+    hashed_sparse = fc.categorical_column_with_hash_bucket(
+        'wire', 10, dtype=dtypes.int64)
+    wire_tensor = sparse_tensor.SparseTensor(
+        values=constant_op.constant([101, 201, 301], dtype=dtypes.int32),
+        indices=[[0, 0], [1, 0], [1, 1]],
+        dense_shape=[2, 2])
+    transformation_cache = FeatureTransformationCache({'wire': wire_tensor})
+    output = transformation_cache.get(hashed_sparse, None)
+    # Check exact hashed output. If hashing changes this test will break.
+    expected_values = [3, 7, 5]
+    with self.test_session():
+      self.assertAllEqual(expected_values, output.values.eval())
+
+  def test_get_sparse_tensors(self):
+    hashed_sparse = fc.categorical_column_with_hash_bucket('wire', 10)
+    transformation_cache = FeatureTransformationCache({
+        'wire':
+            sparse_tensor.SparseTensor(
+                values=['omar', 'stringer', 'marlo'],
+                indices=[[0, 0], [1, 0], [1, 1]],
+                dense_shape=[2, 2])
+    })
+    id_weight_pair = hashed_sparse.get_sparse_tensors(transformation_cache,
+                                                      None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    self.assertEqual(
+        transformation_cache.get(hashed_sparse, None), id_weight_pair.id_tensor)
+
+  def DISABLED_test_get_sparse_tensors_weight_collections(self):
+    column = fc.categorical_column_with_hash_bucket('aaa', 10)
+    inputs = sparse_tensor.SparseTensor(
+        values=['omar', 'stringer', 'marlo'],
+        indices=[[0, 0], [1, 0], [1, 1]],
+        dense_shape=[2, 2])
+    column._get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }),
+        weight_collections=('my_weights',))
+
+    self.assertItemsEqual(
+        [], ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES))
+    self.assertItemsEqual([], ops.get_collection('my_weights'))
+
+  def test_get_sparse_tensors_dense_input(self):
+    hashed_sparse = fc.categorical_column_with_hash_bucket('wire', 10)
+    transformation_cache = FeatureTransformationCache({
+        'wire': (('omar', ''), ('stringer', 'marlo'))
+    })
+    id_weight_pair = hashed_sparse.get_sparse_tensors(transformation_cache,
+                                                      None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    self.assertEqual(
+        transformation_cache.get(hashed_sparse, None), id_weight_pair.id_tensor)
+
+  def test_linear_model(self):
+    wire_column = fc_old.categorical_column_with_hash_bucket('wire', 4)
+    self.assertEqual(4, wire_column._num_buckets)
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          wire_column.name: sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=('marlo', 'skywalker', 'omar'),
+              dense_shape=(2, 2))
+      }, (wire_column,))
+      bias = get_linear_model_bias()
+      wire_var = get_linear_model_column_var(wire_column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,), (0.,)), wire_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        wire_var.assign(((1.,), (2.,), (3.,), (4.,))).eval()
+        # 'marlo' -> 3: wire_var[3] = 4
+        # 'skywalker' -> 2, 'omar' -> 2: wire_var[2] + wire_var[2] = 3+3 = 6
+        self.assertAllClose(((4.,), (6.,)), predictions.eval())
+
+  def test_keras_linear_model(self):
+    wire_column = fc_old.categorical_column_with_hash_bucket('wire', 4)
+    self.assertEqual(4, wire_column._num_buckets)
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          wire_column.name:
+              sparse_tensor.SparseTensorValue(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=('marlo', 'skywalker', 'omar'),
+                  dense_shape=(2, 2))
+      }, (wire_column,))
+      bias = get_linear_model_bias()
+      wire_var = get_linear_model_column_var(wire_column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,), (0.,)), wire_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        wire_var.assign(((1.,), (2.,), (3.,), (4.,))).eval()
+        # 'marlo' -> 3: wire_var[3] = 4
+        # 'skywalker' -> 2, 'omar' -> 2: wire_var[2] + wire_var[2] = 3+3 = 6
+        self.assertAllClose(((4.,), (6.,)), predictions.eval())
+
+
+class CrossedColumnTest(test.TestCase):
+
+  def test_keys_empty(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'keys must be a list with length > 1'):
+      fc.crossed_column([], 10)
+
+  def test_keys_length_one(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'keys must be a list with length > 1'):
+      fc.crossed_column(['a'], 10)
+
+  def test_key_type_unsupported(self):
+    with self.assertRaisesRegexp(ValueError, 'Unsupported key type'):
+      fc.crossed_column(['a', fc.numeric_column('c')], 10)
+
+    with self.assertRaisesRegexp(
+        ValueError, 'categorical_column_with_hash_bucket is not supported'):
+      fc.crossed_column(
+          ['a', fc.categorical_column_with_hash_bucket('c', 10)], 10)
+
+  def test_hash_bucket_size_negative(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'hash_bucket_size must be > 1'):
+      fc.crossed_column(['a', 'c'], -1)
+
+  def test_hash_bucket_size_zero(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'hash_bucket_size must be > 1'):
+      fc.crossed_column(['a', 'c'], 0)
+
+  def test_hash_bucket_size_none(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'hash_bucket_size must be > 1'):
+      fc.crossed_column(['a', 'c'], None)
+
+  def test_name(self):
+    a = fc.numeric_column('a', dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    crossed1 = fc.crossed_column(['d1', 'd2'], 10)
+
+    crossed2 = fc.crossed_column([b, 'c', crossed1], 10)
+    self.assertEqual('a_bucketized_X_c_X_d1_X_d2', crossed2.name)
+
+  def test_name_ordered_alphabetically(self):
+    """Tests that the name does not depend on the order of given columns."""
+    a = fc.numeric_column('a', dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    crossed1 = fc.crossed_column(['d1', 'd2'], 10)
+
+    crossed2 = fc.crossed_column([crossed1, 'c', b], 10)
+    self.assertEqual('a_bucketized_X_c_X_d1_X_d2', crossed2.name)
+
+  def test_name_leaf_keys_ordered_alphabetically(self):
+    """Tests that the name does not depend on the order of given columns."""
+    a = fc.numeric_column('a', dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    crossed1 = fc.crossed_column(['d2', 'c'], 10)
+
+    crossed2 = fc.crossed_column([crossed1, 'd1', b], 10)
+    self.assertEqual('a_bucketized_X_c_X_d1_X_d2', crossed2.name)
+
+  def test_parse_spec(self):
+    a = fc.numeric_column('a', shape=[2], dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    crossed = fc.crossed_column([b, 'c'], 10)
+    self.assertEqual({
+        'a': parsing_ops.FixedLenFeature((2,), dtype=dtypes.int32),
+        'c': parsing_ops.VarLenFeature(dtypes.string),
+    }, crossed.parse_example_spec)
+
+  def test_num_buckets(self):
+    a = fc.numeric_column('a', shape=[2], dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    crossed = fc.crossed_column([b, 'c'], 15)
+    self.assertEqual(15, crossed.num_buckets)
+
+  def test_deep_copy(self):
+    a = fc.numeric_column('a', dtype=dtypes.int32)
+    b = fc.bucketized_column(a, boundaries=[0, 1])
+    crossed1 = fc.crossed_column(['d1', 'd2'], 10)
+    crossed2 = fc.crossed_column([b, 'c', crossed1], 15, hash_key=5)
+    crossed2_copy = copy.deepcopy(crossed2)
+    self.assertEqual('a_bucketized_X_c_X_d1_X_d2', crossed2_copy.name,)
+    self.assertEqual(15, crossed2_copy.hash_bucket_size)
+    self.assertEqual(5, crossed2_copy.hash_key)
+
+  def test_parse_example(self):
+    price = fc.numeric_column('price', shape=[2])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 50])
+    price_cross_wire = fc.crossed_column([bucketized_price, 'wire'], 10)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'price':
+                feature_pb2.Feature(float_list=feature_pb2.FloatList(
+                    value=[20., 110.])),
+            'wire':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer'])),
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([price_cross_wire]))
+    self.assertIn('price', features)
+    self.assertIn('wire', features)
+    with self.test_session():
+      self.assertAllEqual([[20., 110.]], features['price'].eval())
+      wire_sparse = features['wire']
+      self.assertAllEqual([[0, 0], [0, 1]], wire_sparse.indices.eval())
+      # Use byte constants to pass the open-source test.
+      self.assertAllEqual([b'omar', b'stringer'], wire_sparse.values.eval())
+      self.assertAllEqual([1, 2], wire_sparse.dense_shape.eval())
+
+  def test_transform_feature(self):
+    price = fc.numeric_column('price', shape=[2])
+    bucketized_price = fc.bucketized_column(price, boundaries=[0, 50])
+    hash_bucket_size = 10
+    price_cross_wire = fc.crossed_column(
+        [bucketized_price, 'wire'], hash_bucket_size)
+    features = {
+        'price': constant_op.constant([[1., 2.], [5., 6.]]),
+        'wire': sparse_tensor.SparseTensor(
+            values=['omar', 'stringer', 'marlo'],
+            indices=[[0, 0], [1, 0], [1, 1]],
+            dense_shape=[2, 2]),
+    }
+    outputs = _transform_features(features, [price_cross_wire], None)
+    output = outputs[price_cross_wire]
+    with self.test_session() as sess:
+      output_val = sess.run(output)
+      self.assertAllEqual(
+          [[0, 0], [0, 1], [1, 0], [1, 1], [1, 2], [1, 3]], output_val.indices)
+      for val in output_val.values:
+        self.assertIn(val, list(range(hash_bucket_size)))
+      self.assertAllEqual([2, 4], output_val.dense_shape)
+
+  def test_get_sparse_tensors(self):
+    a = fc.numeric_column('a', dtype=dtypes.int32, shape=(2,))
+    b = fc.bucketized_column(a, boundaries=(0, 1))
+    crossed1 = fc.crossed_column(['d1', 'd2'], 10)
+    crossed2 = fc.crossed_column([b, 'c', crossed1], 15, hash_key=5)
+    with ops.Graph().as_default():
+      transformation_cache = FeatureTransformationCache({
+          'a':
+              constant_op.constant(((-1., .5), (.5, 1.))),
+          'c':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=['cA', 'cB', 'cC'],
+                  dense_shape=(2, 2)),
+          'd1':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=['d1A', 'd1B', 'd1C'],
+                  dense_shape=(2, 2)),
+          'd2':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=['d2A', 'd2B', 'd2C'],
+                  dense_shape=(2, 2)),
+      })
+      id_weight_pair = crossed2.get_sparse_tensors(transformation_cache, None)
+      with _initialized_session():
+        id_tensor_eval = id_weight_pair.id_tensor.eval()
+        self.assertAllEqual(
+            ((0, 0), (0, 1), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5),
+             (1, 6), (1, 7), (1, 8), (1, 9), (1, 10), (1, 11), (1, 12), (1, 13),
+             (1, 14), (1, 15)),
+            id_tensor_eval.indices)
+        # Check exact hashed output. If hashing changes this test will break.
+        # All values are within [0, hash_bucket_size).
+        expected_values = (
+            6, 14, 0, 13, 8, 8, 10, 12, 2, 0, 1, 9, 8, 12, 2, 0, 10, 11)
+        self.assertAllEqual(expected_values, id_tensor_eval.values)
+        self.assertAllEqual((2, 16), id_tensor_eval.dense_shape)
+
+  def test_get_sparse_tensors_simple(self):
+    """Same as test_get_sparse_tensors, but with simpler values."""
+    a = fc.numeric_column('a', dtype=dtypes.int32, shape=(2,))
+    b = fc.bucketized_column(a, boundaries=(0, 1))
+    crossed = fc.crossed_column([b, 'c'], hash_bucket_size=5, hash_key=5)
+    with ops.Graph().as_default():
+      transformation_cache = FeatureTransformationCache({
+          'a':
+              constant_op.constant(((-1., .5), (.5, 1.))),
+          'c':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=['cA', 'cB', 'cC'],
+                  dense_shape=(2, 2)),
+      })
+      id_weight_pair = crossed.get_sparse_tensors(transformation_cache, None)
+      with _initialized_session():
+        id_tensor_eval = id_weight_pair.id_tensor.eval()
+        self.assertAllEqual(
+            ((0, 0), (0, 1), (1, 0), (1, 1), (1, 2), (1, 3)),
+            id_tensor_eval.indices)
+        # Check exact hashed output. If hashing changes this test will break.
+        # All values are within [0, hash_bucket_size).
+        expected_values = (1, 0, 1, 3, 4, 2)
+        self.assertAllEqual(expected_values, id_tensor_eval.values)
+        self.assertAllEqual((2, 4), id_tensor_eval.dense_shape)
+
+  def test_linear_model(self):
+    """Tests linear_model.
+
+    Uses data from test_get_sparse_tesnsors_simple.
+    """
+    a = fc_old.numeric_column('a', dtype=dtypes.int32, shape=(2,))
+    b = fc_old.bucketized_column(a, boundaries=(0, 1))
+    crossed = fc_old.crossed_column([b, 'c'], hash_bucket_size=5, hash_key=5)
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          'a': constant_op.constant(((-1., .5), (.5, 1.))),
+          'c': sparse_tensor.SparseTensor(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=['cA', 'cB', 'cC'],
+              dense_shape=(2, 2)),
+      }, (crossed,))
+      bias = get_linear_model_bias()
+      crossed_var = get_linear_model_column_var(crossed)
+      with _initialized_session() as sess:
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(
+            ((0.,), (0.,), (0.,), (0.,), (0.,)), crossed_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        sess.run(crossed_var.assign(((1.,), (2.,), (3.,), (4.,), (5.,))))
+        # Expected ids after cross = (1, 0, 1, 3, 4, 2)
+        self.assertAllClose(((3.,), (14.,)), predictions.eval())
+        sess.run(bias.assign((.1,)))
+        self.assertAllClose(((3.1,), (14.1,)), predictions.eval())
+
+  def test_linear_model_with_weights(self):
+
+    class _TestColumnWithWeights(fc_old._CategoricalColumn):
+      """Produces sparse IDs and sparse weights."""
+
+      @property
+      def name(self):
+        return 'test_column'
+
+      @property
+      def _parse_example_spec(self):
+        return {
+            self.name: parsing_ops.VarLenFeature(dtypes.int32),
+            '{}_weights'.format(self.name): parsing_ops.VarLenFeature(
+                dtypes.float32),
+            }
+
+      @property
+      def _num_buckets(self):
+        return 5
+
+      def _transform_feature(self, inputs):
+        return (inputs.get(self.name),
+                inputs.get('{}_weights'.format(self.name)))
+
+      def _get_sparse_tensors(self, inputs, weight_collections=None,
+                              trainable=None):
+        """Populates both id_tensor and weight_tensor."""
+        ids_and_weights = inputs.get(self)
+        return fc_old._CategoricalColumn.IdWeightPair(
+            id_tensor=ids_and_weights[0], weight_tensor=ids_and_weights[1])
+
+    t = _TestColumnWithWeights()
+    crossed = fc_old.crossed_column([t, 'c'], hash_bucket_size=5, hash_key=5)
+    with ops.Graph().as_default():
+      with self.assertRaisesRegexp(
+          ValueError,
+          'crossed_column does not support weight_tensor.*{}'.format(t.name)):
+        fc.linear_model({
+            t.name: sparse_tensor.SparseTensor(
+                indices=((0, 0), (1, 0), (1, 1)),
+                values=[0, 1, 2],
+                dense_shape=(2, 2)),
+            '{}_weights'.format(t.name): sparse_tensor.SparseTensor(
+                indices=((0, 0), (1, 0), (1, 1)),
+                values=[1., 10., 2.],
+                dense_shape=(2, 2)),
+            'c': sparse_tensor.SparseTensor(
+                indices=((0, 0), (1, 0), (1, 1)),
+                values=['cA', 'cB', 'cC'],
+                dense_shape=(2, 2)),
+        }, (crossed,))
+
+  def test_keras_linear_model(self):
+    """Tests _LinearModel.
+
+    Uses data from test_get_sparse_tesnsors_simple.
+    """
+    a = fc_old.numeric_column('a', dtype=dtypes.int32, shape=(2,))
+    b = fc_old.bucketized_column(a, boundaries=(0, 1))
+    crossed = fc_old.crossed_column([b, 'c'], hash_bucket_size=5, hash_key=5)
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          'a':
+              constant_op.constant(((-1., .5), (.5, 1.))),
+          'c':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=['cA', 'cB', 'cC'],
+                  dense_shape=(2, 2)),
+      }, (crossed,))
+      bias = get_linear_model_bias()
+      crossed_var = get_linear_model_column_var(crossed)
+      with _initialized_session() as sess:
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,), (0.,), (0.,)),
+                            crossed_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        sess.run(crossed_var.assign(((1.,), (2.,), (3.,), (4.,), (5.,))))
+        # Expected ids after cross = (1, 0, 1, 3, 4, 2)
+        self.assertAllClose(((3.,), (14.,)), predictions.eval())
+        sess.run(bias.assign((.1,)))
+        self.assertAllClose(((3.1,), (14.1,)), predictions.eval())
+
+  def test_keras_linear_model_with_weights(self):
+
+    class _TestColumnWithWeights(fc_old._CategoricalColumn):
+      """Produces sparse IDs and sparse weights."""
+
+      @property
+      def name(self):
+        return 'test_column'
+
+      @property
+      def _parse_example_spec(self):
+        return {
+            self.name:
+                parsing_ops.VarLenFeature(dtypes.int32),
+            '{}_weights'.format(self.name):
+                parsing_ops.VarLenFeature(dtypes.float32),
+        }
+
+      @property
+      def _num_buckets(self):
+        return 5
+
+      def _transform_feature(self, inputs):
+        return (inputs.get(self.name),
+                inputs.get('{}_weights'.format(self.name)))
+
+      def _get_sparse_tensors(self,
+                              inputs,
+                              weight_collections=None,
+                              trainable=None):
+        """Populates both id_tensor and weight_tensor."""
+        ids_and_weights = inputs.get(self)
+        return fc_old._CategoricalColumn.IdWeightPair(
+            id_tensor=ids_and_weights[0], weight_tensor=ids_and_weights[1])
+
+    t = _TestColumnWithWeights()
+    crossed = fc_old.crossed_column([t, 'c'], hash_bucket_size=5, hash_key=5)
+    with ops.Graph().as_default():
+      with self.assertRaisesRegexp(
+          ValueError,
+          'crossed_column does not support weight_tensor.*{}'.format(t.name)):
+        get_keras_linear_model_predictions({
+            t.name:
+                sparse_tensor.SparseTensor(
+                    indices=((0, 0), (1, 0), (1, 1)),
+                    values=[0, 1, 2],
+                    dense_shape=(2, 2)),
+            '{}_weights'.format(t.name):
+                sparse_tensor.SparseTensor(
+                    indices=((0, 0), (1, 0), (1, 1)),
+                    values=[1., 10., 2.],
+                    dense_shape=(2, 2)),
+            'c':
+                sparse_tensor.SparseTensor(
+                    indices=((0, 0), (1, 0), (1, 1)),
+                    values=['cA', 'cB', 'cC'],
+                    dense_shape=(2, 2)),
+        }, (crossed,))
+
+
+def get_linear_model_bias(name='linear_model'):
+  with variable_scope.variable_scope(name, reuse=True):
+    return variable_scope.get_variable('bias_weights')
+
+
+def get_linear_model_column_var(column, name='linear_model'):
+  return ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES,
+                            name + '/' + column.name)[0]
+
+
+def get_keras_linear_model_predictions(features,
+                                       feature_columns,
+                                       units=1,
+                                       sparse_combiner='sum',
+                                       weight_collections=None,
+                                       trainable=True,
+                                       cols_to_vars=None):
+  keras_linear_model = _LinearModel(
+      feature_columns,
+      units,
+      sparse_combiner,
+      weight_collections,
+      trainable,
+      name='linear_model')
+  retval = keras_linear_model(features)  # pylint: disable=not-callable
+  if cols_to_vars is not None:
+    cols_to_vars.update(keras_linear_model.cols_to_vars())
+  return retval
+
+
+class LinearModelTest(test.TestCase):
+
+  def test_raises_if_empty_feature_columns(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 'feature_columns must not be empty'):
+      fc.linear_model(features={}, feature_columns=[])
+
+  def test_should_be_feature_column(self):
+    with self.assertRaisesRegexp(ValueError, 'must be a _FeatureColumn'):
+      fc.linear_model(features={'a': [[0]]}, feature_columns='NotSupported')
+
+  def test_should_be_dense_or_categorical_column(self):
+
+    class NotSupportedColumn(fc_old._FeatureColumn):
+
+      @property
+      def name(self):
+        return 'NotSupportedColumn'
+
+      def _transform_feature(self, cache):
+        pass
+
+      @property
+      def _parse_example_spec(self):
+        pass
+
+    with self.assertRaisesRegexp(
+        ValueError, 'must be either a _DenseColumn or _CategoricalColumn'):
+      fc.linear_model(
+          features={'a': [[0]]}, feature_columns=[NotSupportedColumn()])
+
+  def test_does_not_support_dict_columns(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'Expected feature_columns to be iterable, found dict.'):
+      fc.linear_model(
+          features={'a': [[0]]},
+          feature_columns={'a': fc_old.numeric_column('a')})
+
+  def test_raises_if_duplicate_name(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'Duplicate feature column name found for columns'):
+      fc.linear_model(
+          features={'a': [[0]]},
+          feature_columns=[
+              fc_old.numeric_column('a'),
+              fc_old.numeric_column('a')
+          ])
+
+  def test_dense_bias(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      predictions = fc.linear_model(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        sess.run(price_var.assign([[10.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[15.], [55.]], predictions.eval())
+
+  def test_sparse_bias(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = fc.linear_model(features, [wire_cast])
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.], [0.], [0.], [0.]], wire_cast_var.eval())
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[1005.], [10015.]], predictions.eval())
+
+  def test_dense_and_sparse_bias(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor, 'price': [[1.], [5.]]}
+      predictions = fc.linear_model(features, [wire_cast, price])
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        sess.run(price_var.assign([[10.]]))
+        self.assertAllClose([[1015.], [10065.]], predictions.eval())
+
+  def test_dense_and_sparse_column(self):
+    """When the column is both dense and sparse, uses sparse tensors."""
+
+    class _DenseAndSparseColumn(fc_old._DenseColumn, fc_old._CategoricalColumn):
+
+      @property
+      def name(self):
+        return 'dense_and_sparse_column'
+
+      @property
+      def _parse_example_spec(self):
+        return {self.name: parsing_ops.VarLenFeature(self.dtype)}
+
+      def _transform_feature(self, inputs):
+        return inputs.get(self.name)
+
+      @property
+      def _variable_shape(self):
+        raise ValueError('Should not use this method.')
+
+      def _get_dense_tensor(self, inputs, weight_collections=None,
+                            trainable=None):
+        raise ValueError('Should not use this method.')
+
+      @property
+      def _num_buckets(self):
+        return 4
+
+      def _get_sparse_tensors(self, inputs, weight_collections=None,
+                              trainable=None):
+        sp_tensor = sparse_tensor.SparseTensor(
+            indices=[[0, 0], [1, 0], [1, 1]],
+            values=[2, 0, 3],
+            dense_shape=[2, 2])
+        return fc_old._CategoricalColumn.IdWeightPair(sp_tensor, None)
+
+    dense_and_sparse_column = _DenseAndSparseColumn()
+    with ops.Graph().as_default():
+      sp_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {dense_and_sparse_column.name: sp_tensor}
+      predictions = fc.linear_model(features, [dense_and_sparse_column])
+      bias = get_linear_model_bias()
+      dense_and_sparse_column_var = get_linear_model_column_var(
+          dense_and_sparse_column)
+      with _initialized_session() as sess:
+        sess.run(dense_and_sparse_column_var.assign(
+            [[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[1005.], [10015.]], predictions.eval())
+
+  def test_dense_multi_output(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      predictions = fc.linear_model(features, [price], units=3)
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((3,)), bias.eval())
+        self.assertAllClose(np.zeros((1, 3)), price_var.eval())
+        sess.run(price_var.assign([[10., 100., 1000.]]))
+        sess.run(bias.assign([5., 6., 7.]))
+        self.assertAllClose([[15., 106., 1007.], [55., 506., 5007.]],
+                            predictions.eval())
+
+  def test_sparse_multi_output(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = fc.linear_model(features, [wire_cast], units=3)
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((3,)), bias.eval())
+        self.assertAllClose(np.zeros((4, 3)), wire_cast_var.eval())
+        sess.run(
+            wire_cast_var.assign([[10., 11., 12.], [100., 110., 120.], [
+                1000., 1100., 1200.
+            ], [10000., 11000., 12000.]]))
+        sess.run(bias.assign([5., 6., 7.]))
+        self.assertAllClose([[1005., 1106., 1207.], [10015., 11017., 12019.]],
+                            predictions.eval())
+
+  def test_dense_multi_dimension(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1., 2.], [5., 6.]]}
+      predictions = fc.linear_model(features, [price])
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([[0.], [0.]], price_var.eval())
+        sess.run(price_var.assign([[10.], [100.]]))
+        self.assertAllClose([[210.], [650.]], predictions.eval())
+
+  def test_sparse_multi_rank(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = array_ops.sparse_placeholder(dtypes.string)
+      wire_value = sparse_tensor.SparseTensorValue(
+          values=['omar', 'stringer', 'marlo', 'omar'],  # hashed = [2, 0, 3, 2]
+          indices=[[0, 0, 0], [0, 1, 0], [1, 0, 0], [1, 0, 1]],
+          dense_shape=[2, 2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = fc.linear_model(features, [wire_cast])
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((4, 1)), wire_cast_var.eval())
+        self.assertAllClose(
+            np.zeros((2, 1)),
+            predictions.eval(feed_dict={wire_tensor: wire_value}))
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        self.assertAllClose(
+            [[1010.], [11000.]],
+            predictions.eval(feed_dict={wire_tensor: wire_value}))
+
+  def test_sparse_combiner(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = fc.linear_model(
+          features, [wire_cast], sparse_combiner='mean')
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[1005.], [5010.]], predictions.eval())
+
+  def test_sparse_combiner_with_negative_weights(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    wire_cast_weights = fc_old.weighted_categorical_column(wire_cast, 'weights')
+
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {
+          'wire_cast': wire_tensor,
+          'weights': constant_op.constant([[1., 1., -1.0]])
+      }
+      predictions = fc.linear_model(
+          features, [wire_cast_weights], sparse_combiner='sum')
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[1005.], [-9985.]], predictions.eval())
+
+  def test_dense_multi_dimension_multi_output(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1., 2.], [5., 6.]]}
+      predictions = fc.linear_model(features, [price], units=3)
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((3,)), bias.eval())
+        self.assertAllClose(np.zeros((2, 3)), price_var.eval())
+        sess.run(price_var.assign([[1., 2., 3.], [10., 100., 1000.]]))
+        sess.run(bias.assign([2., 3., 4.]))
+        self.assertAllClose([[23., 205., 2007.], [67., 613., 6019.]],
+                            predictions.eval())
+
+  def test_raises_if_shape_mismatch(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      with self.assertRaisesRegexp(
+          Exception,
+          r'Cannot reshape a tensor with 2 elements to shape \[2,2\]'):
+        fc.linear_model(features, [price])
+
+  def test_dense_reshaping(self):
+    price = fc_old.numeric_column('price', shape=[1, 2])
+    with ops.Graph().as_default():
+      features = {'price': [[[1., 2.]], [[5., 6.]]]}
+      predictions = fc.linear_model(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.], [0.]], price_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(price_var.assign([[10.], [100.]]))
+        self.assertAllClose([[210.], [650.]], predictions.eval())
+
+  def test_dense_multi_column(self):
+    price1 = fc_old.numeric_column('price1', shape=2)
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[1., 2.], [5., 6.]],
+          'price2': [[3.], [4.]]
+      }
+      predictions = fc.linear_model(features, [price1, price2])
+      bias = get_linear_model_bias()
+      price1_var = get_linear_model_column_var(price1)
+      price2_var = get_linear_model_column_var(price2)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.], [0.]], price1_var.eval())
+        self.assertAllClose([[0.]], price2_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(price1_var.assign([[10.], [100.]]))
+        sess.run(price2_var.assign([[1000.]]))
+        sess.run(bias.assign([7.]))
+        self.assertAllClose([[3217.], [4657.]], predictions.eval())
+
+  def test_fills_cols_to_vars(self):
+    price1 = fc_old.numeric_column('price1', shape=2)
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {'price1': [[1., 2.], [5., 6.]], 'price2': [[3.], [4.]]}
+      cols_to_vars = {}
+      fc.linear_model(features, [price1, price2], cols_to_vars=cols_to_vars)
+      bias = get_linear_model_bias()
+      price1_var = get_linear_model_column_var(price1)
+      price2_var = get_linear_model_column_var(price2)
+      self.assertAllEqual(cols_to_vars['bias'], [bias])
+      self.assertAllEqual(cols_to_vars[price1], [price1_var])
+      self.assertAllEqual(cols_to_vars[price2], [price2_var])
+
+  def test_fills_cols_to_vars_partitioned_variables(self):
+    price1 = fc_old.numeric_column('price1', shape=2)
+    price2 = fc_old.numeric_column('price2', shape=3)
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[1., 2.], [6., 7.]],
+          'price2': [[3., 4., 5.], [8., 9., 10.]]
+      }
+      cols_to_vars = {}
+      with variable_scope.variable_scope(
+          'linear',
+          partitioner=partitioned_variables.fixed_size_partitioner(2, axis=0)):
+        fc.linear_model(features, [price1, price2], cols_to_vars=cols_to_vars)
+      with _initialized_session():
+        self.assertEqual([0.], cols_to_vars['bias'][0].eval())
+        # Partitioning shards the [2, 1] price1 var into 2 [1, 1] Variables.
+        self.assertAllEqual([[0.]], cols_to_vars[price1][0].eval())
+        self.assertAllEqual([[0.]], cols_to_vars[price1][1].eval())
+        # Partitioning shards the [3, 1] price2 var into a [2, 1] Variable and
+        # a [1, 1] Variable.
+        self.assertAllEqual([[0.], [0.]], cols_to_vars[price2][0].eval())
+        self.assertAllEqual([[0.]], cols_to_vars[price2][1].eval())
+
+  def test_dense_collection(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default() as g:
+      features = {'price': [[1.], [5.]]}
+      fc.linear_model(features, [price], weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      self.assertIn(bias, my_vars)
+      self.assertIn(price_var, my_vars)
+
+  def test_sparse_collection(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      features = {'wire_cast': wire_tensor}
+      fc.linear_model(
+          features, [wire_cast], weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      self.assertIn(bias, my_vars)
+      self.assertIn(wire_cast_var, my_vars)
+
+  def test_dense_trainable_default(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default() as g:
+      features = {'price': [[1.], [5.]]}
+      fc.linear_model(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertIn(bias, trainable_vars)
+      self.assertIn(price_var, trainable_vars)
+
+  def test_sparse_trainable_default(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      features = {'wire_cast': wire_tensor}
+      fc.linear_model(features, [wire_cast])
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      self.assertIn(bias, trainable_vars)
+      self.assertIn(wire_cast_var, trainable_vars)
+
+  def test_dense_trainable_false(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default() as g:
+      features = {'price': [[1.], [5.]]}
+      fc.linear_model(features, [price], trainable=False)
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertEqual([], trainable_vars)
+
+  def test_sparse_trainable_false(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      features = {'wire_cast': wire_tensor}
+      fc.linear_model(features, [wire_cast], trainable=False)
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertEqual([], trainable_vars)
+
+  def test_column_order(self):
+    price_a = fc_old.numeric_column('price_a')
+    price_b = fc_old.numeric_column('price_b')
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      features = {
+          'price_a': [[1.]],
+          'price_b': [[3.]],
+          'wire_cast':
+              sparse_tensor.SparseTensor(
+                  values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      }
+      fc.linear_model(
+          features, [price_a, wire_cast, price_b],
+          weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      self.assertIn('price_a', my_vars[0].name)
+      self.assertIn('price_b', my_vars[1].name)
+      self.assertIn('wire_cast', my_vars[2].name)
+
+    with ops.Graph().as_default() as g:
+      features = {
+          'price_a': [[1.]],
+          'price_b': [[3.]],
+          'wire_cast':
+              sparse_tensor.SparseTensor(
+                  values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      }
+      fc.linear_model(
+          features, [wire_cast, price_b, price_a],
+          weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      self.assertIn('price_a', my_vars[0].name)
+      self.assertIn('price_b', my_vars[1].name)
+      self.assertIn('wire_cast', my_vars[2].name)
+
+  def test_static_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[1.], [5.], [7.]],  # batchsize = 3
+          'price2': [[3.], [4.]]  # batchsize = 2
+      }
+    with self.assertRaisesRegexp(
+        ValueError,
+        'Batch size \(first dimension\) of each feature must be same.'):  # pylint: disable=anomalous-backslash-in-string
+      fc.linear_model(features, [price1, price2])
+
+  def test_subset_of_static_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    price3 = fc_old.numeric_column('price3')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 3
+          'price2': [[3.], [4.]],  # batchsize = 2
+          'price3': [[3.], [4.], [5.]]  # batchsize = 3
+      }
+      with self.assertRaisesRegexp(
+          ValueError,
+          'Batch size \(first dimension\) of each feature must be same.'):  # pylint: disable=anomalous-backslash-in-string
+        fc.linear_model(features, [price1, price2, price3])
+
+  def test_runtime_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 3
+          'price2': [[3.], [4.]]  # batchsize = 2
+      }
+      predictions = fc.linear_model(features, [price1, price2])
+      with _initialized_session() as sess:
+        with self.assertRaisesRegexp(errors.OpError,
+                                     'must have the same size and shape'):
+          sess.run(
+              predictions, feed_dict={features['price1']: [[1.], [5.], [7.]]})
+
+  def test_runtime_batch_size_matches(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 2
+          'price2': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 2
+      }
+      predictions = fc.linear_model(features, [price1, price2])
+      with _initialized_session() as sess:
+        sess.run(
+            predictions,
+            feed_dict={
+                features['price1']: [[1.], [5.]],
+                features['price2']: [[1.], [5.]],
+            })
+
+  def test_with_numpy_input_fn(self):
+    price = fc_old.numeric_column('price')
+    price_buckets = fc_old.bucketized_column(
+        price, boundaries=[
+            0.,
+            10.,
+            100.,
+        ])
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+
+    input_fn = numpy_io.numpy_input_fn(
+        x={
+            'price': np.array([-1., 2., 13., 104.]),
+            'body-style': np.array(['sedan', 'hardtop', 'wagon', 'sedan']),
+        },
+        batch_size=2,
+        shuffle=False)
+    features = input_fn()
+    net = fc.linear_model(features, [price_buckets, body_style])
+    # self.assertEqual(1 + 3 + 5, net.shape[1])
+    with _initialized_session() as sess:
+      coord = coordinator.Coordinator()
+      threads = queue_runner_impl.start_queue_runners(sess, coord=coord)
+
+      bias = get_linear_model_bias()
+      price_buckets_var = get_linear_model_column_var(price_buckets)
+      body_style_var = get_linear_model_column_var(body_style)
+
+      sess.run(price_buckets_var.assign([[10.], [100.], [1000.], [10000.]]))
+      sess.run(body_style_var.assign([[-10.], [-100.], [-1000.]]))
+      sess.run(bias.assign([5.]))
+
+      self.assertAllClose([[10 - 1000 + 5.], [100 - 10 + 5.]], sess.run(net))
+
+      coord.request_stop()
+      coord.join(threads)
+
+  def test_with_1d_sparse_tensor(self):
+    price = fc_old.numeric_column('price')
+    price_buckets = fc_old.bucketized_column(
+        price, boundaries=[
+            0.,
+            10.,
+            100.,
+        ])
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+
+    # Provides 1-dim tensor and dense tensor.
+    features = {
+        'price': constant_op.constant([-1., 12.,]),
+        'body-style': sparse_tensor.SparseTensor(
+            indices=((0,), (1,)),
+            values=('sedan', 'hardtop'),
+            dense_shape=(2,)),
+    }
+    self.assertEqual(1, features['price'].shape.ndims)
+    self.assertEqual(1, features['body-style'].dense_shape.get_shape()[0])
+
+    net = fc.linear_model(features, [price_buckets, body_style])
+    with _initialized_session() as sess:
+      bias = get_linear_model_bias()
+      price_buckets_var = get_linear_model_column_var(price_buckets)
+      body_style_var = get_linear_model_column_var(body_style)
+
+      sess.run(price_buckets_var.assign([[10.], [100.], [1000.], [10000.]]))
+      sess.run(body_style_var.assign([[-10.], [-100.], [-1000.]]))
+      sess.run(bias.assign([5.]))
+
+      self.assertAllClose([[10 - 1000 + 5.], [1000 - 10 + 5.]], sess.run(net))
+
+  def test_with_1d_unknown_shape_sparse_tensor(self):
+    price = fc_old.numeric_column('price')
+    price_buckets = fc_old.bucketized_column(
+        price, boundaries=[
+            0.,
+            10.,
+            100.,
+        ])
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+    country = fc_old.categorical_column_with_vocabulary_list(
+        'country', vocabulary_list=['US', 'JP', 'CA'])
+
+    # Provides 1-dim tensor and dense tensor.
+    features = {
+        'price': array_ops.placeholder(dtypes.float32),
+        'body-style': array_ops.sparse_placeholder(dtypes.string),
+        'country': array_ops.placeholder(dtypes.string),
+    }
+    self.assertIsNone(features['price'].shape.ndims)
+    self.assertIsNone(features['body-style'].get_shape().ndims)
+
+    price_data = np.array([-1., 12.])
+    body_style_data = sparse_tensor.SparseTensorValue(
+        indices=((0,), (1,)),
+        values=('sedan', 'hardtop'),
+        dense_shape=(2,))
+    country_data = np.array(['US', 'CA'])
+
+    net = fc.linear_model(features, [price_buckets, body_style, country])
+    bias = get_linear_model_bias()
+    price_buckets_var = get_linear_model_column_var(price_buckets)
+    body_style_var = get_linear_model_column_var(body_style)
+    with _initialized_session() as sess:
+      sess.run(price_buckets_var.assign([[10.], [100.], [1000.], [10000.]]))
+      sess.run(body_style_var.assign([[-10.], [-100.], [-1000.]]))
+      sess.run(bias.assign([5.]))
+
+      self.assertAllClose([[10 - 1000 + 5.], [1000 - 10 + 5.]],
+                          sess.run(
+                              net,
+                              feed_dict={
+                                  features['price']: price_data,
+                                  features['body-style']: body_style_data,
+                                  features['country']: country_data
+                              }))
+
+  def test_with_rank_0_feature(self):
+    price = fc_old.numeric_column('price')
+    features = {
+        'price': constant_op.constant(0),
+    }
+    self.assertEqual(0, features['price'].shape.ndims)
+
+    # Static rank 0 should fail
+    with self.assertRaisesRegexp(ValueError, 'Feature .* cannot have rank 0'):
+      fc.linear_model(features, [price])
+
+    # Dynamic rank 0 should fail
+    features = {
+        'price': array_ops.placeholder(dtypes.float32),
+    }
+    net = fc.linear_model(features, [price])
+    self.assertEqual(1, net.shape[1])
+    with _initialized_session() as sess:
+      with self.assertRaisesOpError('Feature .* cannot have rank 0'):
+        sess.run(net, feed_dict={features['price']: np.array(1)})
+
+  def test_multiple_linear_models(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features1 = {'price': [[1.], [5.]]}
+      features2 = {'price': [[2.], [10.]]}
+      predictions1 = fc.linear_model(features1, [price])
+      predictions2 = fc.linear_model(features2, [price])
+      bias1 = get_linear_model_bias(name='linear_model')
+      bias2 = get_linear_model_bias(name='linear_model_1')
+      price_var1 = get_linear_model_column_var(price, name='linear_model')
+      price_var2 = get_linear_model_column_var(price, name='linear_model_1')
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias1.eval())
+        sess.run(price_var1.assign([[10.]]))
+        sess.run(bias1.assign([5.]))
+        self.assertAllClose([[15.], [55.]], predictions1.eval())
+        self.assertAllClose([0.], bias2.eval())
+        sess.run(price_var2.assign([[10.]]))
+        sess.run(bias2.assign([5.]))
+        self.assertAllClose([[25.], [105.]], predictions2.eval())
+
+
+class _LinearModelTest(test.TestCase):
+
+  def test_raises_if_empty_feature_columns(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 'feature_columns must not be empty'):
+      get_keras_linear_model_predictions(features={}, feature_columns=[])
+
+  def test_should_be_feature_column(self):
+    with self.assertRaisesRegexp(ValueError, 'must be a _FeatureColumn'):
+      get_keras_linear_model_predictions(
+          features={'a': [[0]]}, feature_columns='NotSupported')
+
+  def test_should_be_dense_or_categorical_column(self):
+
+    class NotSupportedColumn(fc_old._FeatureColumn):
+
+      @property
+      def name(self):
+        return 'NotSupportedColumn'
+
+      def _transform_feature(self, cache):
+        pass
+
+      @property
+      def _parse_example_spec(self):
+        pass
+
+    with self.assertRaisesRegexp(
+        ValueError, 'must be either a _DenseColumn or _CategoricalColumn'):
+      get_keras_linear_model_predictions(
+          features={'a': [[0]]}, feature_columns=[NotSupportedColumn()])
+
+  def test_does_not_support_dict_columns(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'Expected feature_columns to be iterable, found dict.'):
+      fc.linear_model(
+          features={'a': [[0]]},
+          feature_columns={'a': fc_old.numeric_column('a')})
+
+  def test_raises_if_duplicate_name(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'Duplicate feature column name found for columns'):
+      get_keras_linear_model_predictions(
+          features={'a': [[0]]},
+          feature_columns=[
+              fc_old.numeric_column('a'),
+              fc_old.numeric_column('a')
+          ])
+
+  def test_dense_bias(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      predictions = get_keras_linear_model_predictions(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        sess.run(price_var.assign([[10.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[15.], [55.]], predictions.eval())
+
+  def test_sparse_bias(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = get_keras_linear_model_predictions(features, [wire_cast])
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.], [0.], [0.], [0.]], wire_cast_var.eval())
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[1005.], [10015.]], predictions.eval())
+
+  def test_dense_and_sparse_bias(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor, 'price': [[1.], [5.]]}
+      predictions = get_keras_linear_model_predictions(features,
+                                                       [wire_cast, price])
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        sess.run(price_var.assign([[10.]]))
+        self.assertAllClose([[1015.], [10065.]], predictions.eval())
+
+  def test_dense_and_sparse_column(self):
+    """When the column is both dense and sparse, uses sparse tensors."""
+
+    class _DenseAndSparseColumn(fc_old._DenseColumn, fc_old._CategoricalColumn):
+
+      @property
+      def name(self):
+        return 'dense_and_sparse_column'
+
+      @property
+      def _parse_example_spec(self):
+        return {self.name: parsing_ops.VarLenFeature(self.dtype)}
+
+      def _transform_feature(self, inputs):
+        return inputs.get(self.name)
+
+      @property
+      def _variable_shape(self):
+        raise ValueError('Should not use this method.')
+
+      def _get_dense_tensor(self,
+                            inputs,
+                            weight_collections=None,
+                            trainable=None):
+        raise ValueError('Should not use this method.')
+
+      @property
+      def _num_buckets(self):
+        return 4
+
+      def _get_sparse_tensors(self,
+                              inputs,
+                              weight_collections=None,
+                              trainable=None):
+        sp_tensor = sparse_tensor.SparseTensor(
+            indices=[[0, 0], [1, 0], [1, 1]],
+            values=[2, 0, 3],
+            dense_shape=[2, 2])
+        return fc_old._CategoricalColumn.IdWeightPair(sp_tensor, None)
+
+    dense_and_sparse_column = _DenseAndSparseColumn()
+    with ops.Graph().as_default():
+      sp_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {dense_and_sparse_column.name: sp_tensor}
+      predictions = get_keras_linear_model_predictions(
+          features, [dense_and_sparse_column])
+      bias = get_linear_model_bias()
+      dense_and_sparse_column_var = get_linear_model_column_var(
+          dense_and_sparse_column)
+      with _initialized_session() as sess:
+        sess.run(
+            dense_and_sparse_column_var.assign([[10.], [100.], [1000.],
+                                                [10000.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[1005.], [10015.]], predictions.eval())
+
+  def test_dense_multi_output(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      predictions = get_keras_linear_model_predictions(
+          features, [price], units=3)
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((3,)), bias.eval())
+        self.assertAllClose(np.zeros((1, 3)), price_var.eval())
+        sess.run(price_var.assign([[10., 100., 1000.]]))
+        sess.run(bias.assign([5., 6., 7.]))
+        self.assertAllClose([[15., 106., 1007.], [55., 506., 5007.]],
+                            predictions.eval())
+
+  def test_sparse_multi_output(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = get_keras_linear_model_predictions(
+          features, [wire_cast], units=3)
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((3,)), bias.eval())
+        self.assertAllClose(np.zeros((4, 3)), wire_cast_var.eval())
+        sess.run(
+            wire_cast_var.assign([[10., 11., 12.], [100., 110., 120.],
+                                  [1000., 1100.,
+                                   1200.], [10000., 11000., 12000.]]))
+        sess.run(bias.assign([5., 6., 7.]))
+        self.assertAllClose([[1005., 1106., 1207.], [10015., 11017., 12019.]],
+                            predictions.eval())
+
+  def test_dense_multi_dimension(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1., 2.], [5., 6.]]}
+      predictions = get_keras_linear_model_predictions(features, [price])
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([[0.], [0.]], price_var.eval())
+        sess.run(price_var.assign([[10.], [100.]]))
+        self.assertAllClose([[210.], [650.]], predictions.eval())
+
+  def test_sparse_multi_rank(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = array_ops.sparse_placeholder(dtypes.string)
+      wire_value = sparse_tensor.SparseTensorValue(
+          values=['omar', 'stringer', 'marlo', 'omar'],  # hashed = [2, 0, 3, 2]
+          indices=[[0, 0, 0], [0, 1, 0], [1, 0, 0], [1, 0, 1]],
+          dense_shape=[2, 2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = get_keras_linear_model_predictions(features, [wire_cast])
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((4, 1)), wire_cast_var.eval())
+        self.assertAllClose(
+            np.zeros((2, 1)),
+            predictions.eval(feed_dict={wire_tensor: wire_value}))
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        self.assertAllClose(
+            [[1010.], [11000.]],
+            predictions.eval(feed_dict={wire_tensor: wire_value}))
+
+  def test_sparse_combiner(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default():
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar', 'stringer', 'marlo'],  # hashed to = [2, 0, 3]
+          indices=[[0, 0], [1, 0], [1, 1]],
+          dense_shape=[2, 2])
+      features = {'wire_cast': wire_tensor}
+      predictions = get_keras_linear_model_predictions(
+          features, [wire_cast], sparse_combiner='mean')
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      with _initialized_session() as sess:
+        sess.run(wire_cast_var.assign([[10.], [100.], [1000.], [10000.]]))
+        sess.run(bias.assign([5.]))
+        self.assertAllClose([[1005.], [5010.]], predictions.eval())
+
+  def test_dense_multi_dimension_multi_output(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1., 2.], [5., 6.]]}
+      predictions = get_keras_linear_model_predictions(
+          features, [price], units=3)
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose(np.zeros((3,)), bias.eval())
+        self.assertAllClose(np.zeros((2, 3)), price_var.eval())
+        sess.run(price_var.assign([[1., 2., 3.], [10., 100., 1000.]]))
+        sess.run(bias.assign([2., 3., 4.]))
+        self.assertAllClose([[23., 205., 2007.], [67., 613., 6019.]],
+                            predictions.eval())
+
+  def test_raises_if_shape_mismatch(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      with self.assertRaisesRegexp(
+          Exception,
+          r'Cannot reshape a tensor with 2 elements to shape \[2,2\]'):
+        get_keras_linear_model_predictions(features, [price])
+
+  def test_dense_reshaping(self):
+    price = fc_old.numeric_column('price', shape=[1, 2])
+    with ops.Graph().as_default():
+      features = {'price': [[[1., 2.]], [[5., 6.]]]}
+      predictions = get_keras_linear_model_predictions(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.], [0.]], price_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(price_var.assign([[10.], [100.]]))
+        self.assertAllClose([[210.], [650.]], predictions.eval())
+
+  def test_dense_multi_column(self):
+    price1 = fc_old.numeric_column('price1', shape=2)
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {'price1': [[1., 2.], [5., 6.]], 'price2': [[3.], [4.]]}
+      predictions = get_keras_linear_model_predictions(features,
+                                                       [price1, price2])
+      bias = get_linear_model_bias()
+      price1_var = get_linear_model_column_var(price1)
+      price2_var = get_linear_model_column_var(price2)
+      with _initialized_session() as sess:
+        self.assertAllClose([0.], bias.eval())
+        self.assertAllClose([[0.], [0.]], price1_var.eval())
+        self.assertAllClose([[0.]], price2_var.eval())
+        self.assertAllClose([[0.], [0.]], predictions.eval())
+        sess.run(price1_var.assign([[10.], [100.]]))
+        sess.run(price2_var.assign([[1000.]]))
+        sess.run(bias.assign([7.]))
+        self.assertAllClose([[3217.], [4657.]], predictions.eval())
+
+  def test_fills_cols_to_vars(self):
+    price1 = fc_old.numeric_column('price1', shape=2)
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {'price1': [[1., 2.], [5., 6.]], 'price2': [[3.], [4.]]}
+      cols_to_vars = {}
+      get_keras_linear_model_predictions(
+          features, [price1, price2], cols_to_vars=cols_to_vars)
+      bias = get_linear_model_bias()
+      price1_var = get_linear_model_column_var(price1)
+      price2_var = get_linear_model_column_var(price2)
+      self.assertAllEqual(cols_to_vars['bias'], [bias])
+      self.assertAllEqual(cols_to_vars[price1], [price1_var])
+      self.assertAllEqual(cols_to_vars[price2], [price2_var])
+
+  def test_fills_cols_to_vars_partitioned_variables(self):
+    price1 = fc_old.numeric_column('price1', shape=2)
+    price2 = fc_old.numeric_column('price2', shape=3)
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[1., 2.], [6., 7.]],
+          'price2': [[3., 4., 5.], [8., 9., 10.]]
+      }
+      cols_to_vars = {}
+      with variable_scope.variable_scope(
+          'linear',
+          partitioner=partitioned_variables.fixed_size_partitioner(2, axis=0)):
+        get_keras_linear_model_predictions(
+            features, [price1, price2], cols_to_vars=cols_to_vars)
+      with _initialized_session():
+        self.assertEqual([0.], cols_to_vars['bias'][0].eval())
+        # Partitioning shards the [2, 1] price1 var into 2 [1, 1] Variables.
+        self.assertAllEqual([[0.]], cols_to_vars[price1][0].eval())
+        self.assertAllEqual([[0.]], cols_to_vars[price1][1].eval())
+        # Partitioning shards the [3, 1] price2 var into a [2, 1] Variable and
+        # a [1, 1] Variable.
+        self.assertAllEqual([[0.], [0.]], cols_to_vars[price2][0].eval())
+        self.assertAllEqual([[0.]], cols_to_vars[price2][1].eval())
+
+  def test_dense_collection(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default() as g:
+      features = {'price': [[1.], [5.]]}
+      get_keras_linear_model_predictions(
+          features, [price], weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      self.assertIn(bias, my_vars)
+      self.assertIn(price_var, my_vars)
+
+  def test_sparse_collection(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      features = {'wire_cast': wire_tensor}
+      get_keras_linear_model_predictions(
+          features, [wire_cast], weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      self.assertIn(bias, my_vars)
+      self.assertIn(wire_cast_var, my_vars)
+
+  def test_dense_trainable_default(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default() as g:
+      features = {'price': [[1.], [5.]]}
+      get_keras_linear_model_predictions(features, [price])
+      bias = get_linear_model_bias()
+      price_var = get_linear_model_column_var(price)
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertIn(bias, trainable_vars)
+      self.assertIn(price_var, trainable_vars)
+
+  def test_sparse_trainable_default(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      features = {'wire_cast': wire_tensor}
+      get_keras_linear_model_predictions(features, [wire_cast])
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      bias = get_linear_model_bias()
+      wire_cast_var = get_linear_model_column_var(wire_cast)
+      self.assertIn(bias, trainable_vars)
+      self.assertIn(wire_cast_var, trainable_vars)
+
+  def test_dense_trainable_false(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default() as g:
+      features = {'price': [[1.], [5.]]}
+      get_keras_linear_model_predictions(features, [price], trainable=False)
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertEqual([], trainable_vars)
+
+  def test_sparse_trainable_false(self):
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      wire_tensor = sparse_tensor.SparseTensor(
+          values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      features = {'wire_cast': wire_tensor}
+      get_keras_linear_model_predictions(features, [wire_cast], trainable=False)
+      trainable_vars = g.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertEqual([], trainable_vars)
+
+  def test_column_order(self):
+    price_a = fc_old.numeric_column('price_a')
+    price_b = fc_old.numeric_column('price_b')
+    wire_cast = fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+    with ops.Graph().as_default() as g:
+      features = {
+          'price_a': [[1.]],
+          'price_b': [[3.]],
+          'wire_cast':
+              sparse_tensor.SparseTensor(
+                  values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      }
+      get_keras_linear_model_predictions(
+          features, [price_a, wire_cast, price_b],
+          weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      self.assertIn('price_a', my_vars[0].name)
+      self.assertIn('price_b', my_vars[1].name)
+      self.assertIn('wire_cast', my_vars[2].name)
+
+    with ops.Graph().as_default() as g:
+      features = {
+          'price_a': [[1.]],
+          'price_b': [[3.]],
+          'wire_cast':
+              sparse_tensor.SparseTensor(
+                  values=['omar'], indices=[[0, 0]], dense_shape=[1, 1])
+      }
+      get_keras_linear_model_predictions(
+          features, [wire_cast, price_b, price_a],
+          weight_collections=['my-vars'])
+      my_vars = g.get_collection('my-vars')
+      self.assertIn('price_a', my_vars[0].name)
+      self.assertIn('price_b', my_vars[1].name)
+      self.assertIn('wire_cast', my_vars[2].name)
+
+  def test_static_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[1.], [5.], [7.]],  # batchsize = 3
+          'price2': [[3.], [4.]]  # batchsize = 2
+      }
+    with self.assertRaisesRegexp(
+        ValueError,
+        'Batch size \(first dimension\) of each feature must be same.'):  # pylint: disable=anomalous-backslash-in-string
+      get_keras_linear_model_predictions(features, [price1, price2])
+
+  def test_subset_of_static_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    price3 = fc_old.numeric_column('price3')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 3
+          'price2': [[3.], [4.]],  # batchsize = 2
+          'price3': [[3.], [4.], [5.]]  # batchsize = 3
+      }
+      with self.assertRaisesRegexp(
+          ValueError,
+          'Batch size \(first dimension\) of each feature must be same.'):  # pylint: disable=anomalous-backslash-in-string
+        get_keras_linear_model_predictions(features, [price1, price2, price3])
+
+  def test_runtime_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 3
+          'price2': [[3.], [4.]]  # batchsize = 2
+      }
+      predictions = get_keras_linear_model_predictions(features,
+                                                       [price1, price2])
+      with _initialized_session() as sess:
+        with self.assertRaisesRegexp(errors.OpError,
+                                     'must have the same size and shape'):
+          sess.run(
+              predictions, feed_dict={features['price1']: [[1.], [5.], [7.]]})
+
+  def test_runtime_batch_size_matches(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 2
+          'price2': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 2
+      }
+      predictions = get_keras_linear_model_predictions(features,
+                                                       [price1, price2])
+      with _initialized_session() as sess:
+        sess.run(
+            predictions,
+            feed_dict={
+                features['price1']: [[1.], [5.]],
+                features['price2']: [[1.], [5.]],
+            })
+
+  def test_with_numpy_input_fn(self):
+    price = fc_old.numeric_column('price')
+    price_buckets = fc_old.bucketized_column(
+        price, boundaries=[
+            0.,
+            10.,
+            100.,
+        ])
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+
+    input_fn = numpy_io.numpy_input_fn(
+        x={
+            'price': np.array([-1., 2., 13., 104.]),
+            'body-style': np.array(['sedan', 'hardtop', 'wagon', 'sedan']),
+        },
+        batch_size=2,
+        shuffle=False)
+    features = input_fn()
+    net = get_keras_linear_model_predictions(features,
+                                             [price_buckets, body_style])
+    # self.assertEqual(1 + 3 + 5, net.shape[1])
+    with _initialized_session() as sess:
+      coord = coordinator.Coordinator()
+      threads = queue_runner_impl.start_queue_runners(sess, coord=coord)
+
+      bias = get_linear_model_bias()
+      price_buckets_var = get_linear_model_column_var(price_buckets)
+      body_style_var = get_linear_model_column_var(body_style)
+
+      sess.run(price_buckets_var.assign([[10.], [100.], [1000.], [10000.]]))
+      sess.run(body_style_var.assign([[-10.], [-100.], [-1000.]]))
+      sess.run(bias.assign([5.]))
+
+      self.assertAllClose([[10 - 1000 + 5.], [100 - 10 + 5.]], sess.run(net))
+
+      coord.request_stop()
+      coord.join(threads)
+
+  def test_with_1d_sparse_tensor(self):
+    price = fc_old.numeric_column('price')
+    price_buckets = fc_old.bucketized_column(
+        price, boundaries=[
+            0.,
+            10.,
+            100.,
+        ])
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+
+    # Provides 1-dim tensor and dense tensor.
+    features = {
+        'price':
+            constant_op.constant([
+                -1.,
+                12.,
+            ]),
+        'body-style':
+            sparse_tensor.SparseTensor(
+                indices=((0,), (1,)),
+                values=('sedan', 'hardtop'),
+                dense_shape=(2,)),
+    }
+    self.assertEqual(1, features['price'].shape.ndims)
+    self.assertEqual(1, features['body-style'].dense_shape.get_shape()[0])
+
+    net = get_keras_linear_model_predictions(features,
+                                             [price_buckets, body_style])
+    with _initialized_session() as sess:
+      bias = get_linear_model_bias()
+      price_buckets_var = get_linear_model_column_var(price_buckets)
+      body_style_var = get_linear_model_column_var(body_style)
+
+      sess.run(price_buckets_var.assign([[10.], [100.], [1000.], [10000.]]))
+      sess.run(body_style_var.assign([[-10.], [-100.], [-1000.]]))
+      sess.run(bias.assign([5.]))
+
+      self.assertAllClose([[10 - 1000 + 5.], [1000 - 10 + 5.]], sess.run(net))
+
+  def test_with_1d_unknown_shape_sparse_tensor(self):
+    price = fc_old.numeric_column('price')
+    price_buckets = fc_old.bucketized_column(
+        price, boundaries=[
+            0.,
+            10.,
+            100.,
+        ])
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+    country = fc_old.categorical_column_with_vocabulary_list(
+        'country', vocabulary_list=['US', 'JP', 'CA'])
+
+    # Provides 1-dim tensor and dense tensor.
+    features = {
+        'price': array_ops.placeholder(dtypes.float32),
+        'body-style': array_ops.sparse_placeholder(dtypes.string),
+        'country': array_ops.placeholder(dtypes.string),
+    }
+    self.assertIsNone(features['price'].shape.ndims)
+    self.assertIsNone(features['body-style'].get_shape().ndims)
+
+    price_data = np.array([-1., 12.])
+    body_style_data = sparse_tensor.SparseTensorValue(
+        indices=((0,), (1,)), values=('sedan', 'hardtop'), dense_shape=(2,))
+    country_data = np.array(['US', 'CA'])
+
+    net = get_keras_linear_model_predictions(
+        features, [price_buckets, body_style, country])
+    bias = get_linear_model_bias()
+    price_buckets_var = get_linear_model_column_var(price_buckets)
+    body_style_var = get_linear_model_column_var(body_style)
+    with _initialized_session() as sess:
+      sess.run(price_buckets_var.assign([[10.], [100.], [1000.], [10000.]]))
+      sess.run(body_style_var.assign([[-10.], [-100.], [-1000.]]))
+      sess.run(bias.assign([5.]))
+
+      self.assertAllClose([[10 - 1000 + 5.], [1000 - 10 + 5.]],
+                          sess.run(
+                              net,
+                              feed_dict={
+                                  features['price']: price_data,
+                                  features['body-style']: body_style_data,
+                                  features['country']: country_data
+                              }))
+
+  def test_with_rank_0_feature(self):
+    price = fc_old.numeric_column('price')
+    features = {
+        'price': constant_op.constant(0),
+    }
+    self.assertEqual(0, features['price'].shape.ndims)
+
+    # Static rank 0 should fail
+    with self.assertRaisesRegexp(ValueError, 'Feature .* cannot have rank 0'):
+      get_keras_linear_model_predictions(features, [price])
+
+    # Dynamic rank 0 should fail
+    features = {
+        'price': array_ops.placeholder(dtypes.float32),
+    }
+    net = get_keras_linear_model_predictions(features, [price])
+    self.assertEqual(1, net.shape[1])
+    with _initialized_session() as sess:
+      with self.assertRaisesOpError('Feature .* cannot have rank 0'):
+        sess.run(net, feed_dict={features['price']: np.array(1)})
+
+
+class InputLayerTest(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def test_retrieving_input(self):
+    features = {'a': [0.]}
+    input_layer = InputLayer(fc_old.numeric_column('a'))
+    inputs = self.evaluate(input_layer(features))
+    self.assertAllClose([[0.]], inputs)
+
+  def test_reuses_variables(self):
+    with context.eager_mode():
+      sparse_input = sparse_tensor.SparseTensor(
+          indices=((0, 0), (1, 0), (2, 0)),
+          values=(0, 1, 2),
+          dense_shape=(3, 3))
+
+      # Create feature columns (categorical and embedding).
+      categorical_column = fc_old.categorical_column_with_identity(
+          key='a', num_buckets=3)
+      embedding_dimension = 2
+      def _embedding_column_initializer(shape, dtype, partition_info):
+        del shape  # unused
+        del dtype  # unused
+        del partition_info  # unused
+        embedding_values = (
+            (1, 0),  # id 0
+            (0, 1),  # id 1
+            (1, 1))  # id 2
+        return embedding_values
+
+      embedding_column = fc_old.embedding_column(
+          categorical_column,
+          dimension=embedding_dimension,
+          initializer=_embedding_column_initializer)
+
+      input_layer = InputLayer([embedding_column])
+      features = {'a': sparse_input}
+
+      inputs = input_layer(features)
+      variables = input_layer.variables
+
+      # Sanity check: test that the inputs are correct.
+      self.assertAllEqual([[1, 0], [0, 1], [1, 1]], inputs)
+
+      # Check that only one variable was created.
+      self.assertEqual(1, len(variables))
+
+      # Check that invoking input_layer on the same features does not create
+      # additional variables
+      _ = input_layer(features)
+      self.assertEqual(1, len(variables))
+      self.assertEqual(variables[0], input_layer.variables[0])
+
+  def test_feature_column_input_layer_gradient(self):
+    with context.eager_mode():
+      sparse_input = sparse_tensor.SparseTensor(
+          indices=((0, 0), (1, 0), (2, 0)),
+          values=(0, 1, 2),
+          dense_shape=(3, 3))
+
+      # Create feature columns (categorical and embedding).
+      categorical_column = fc_old.categorical_column_with_identity(
+          key='a', num_buckets=3)
+      embedding_dimension = 2
+
+      def _embedding_column_initializer(shape, dtype, partition_info):
+        del shape  # unused
+        del dtype  # unused
+        del partition_info  # unused
+        embedding_values = (
+            (1, 0),  # id 0
+            (0, 1),  # id 1
+            (1, 1))  # id 2
+        return embedding_values
+
+      embedding_column = fc_old.embedding_column(
+          categorical_column,
+          dimension=embedding_dimension,
+          initializer=_embedding_column_initializer)
+
+      input_layer = InputLayer([embedding_column])
+      features = {'a': sparse_input}
+
+      def scale_matrix():
+        matrix = input_layer(features)
+        return 2 * matrix
+
+      # Sanity check: Verify that scale_matrix returns the correct output.
+      self.assertAllEqual([[2, 0], [0, 2], [2, 2]], scale_matrix())
+
+      # Check that the returned gradient is correct.
+      grad_function = backprop.implicit_grad(scale_matrix)
+      grads_and_vars = grad_function()
+      indexed_slice = grads_and_vars[0][0]
+      gradient = grads_and_vars[0][0].values
+
+      self.assertAllEqual([0, 1, 2], indexed_slice.indices)
+      self.assertAllEqual([[2, 2], [2, 2], [2, 2]], gradient)
+
+
+class FunctionalInputLayerTest(test.TestCase):
+
+  def test_raises_if_empty_feature_columns(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 'feature_columns must not be empty'):
+      fc.input_layer(features={}, feature_columns=[])
+
+  def test_should_be_dense_column(self):
+    with self.assertRaisesRegexp(ValueError, 'must be a _DenseColumn'):
+      fc.input_layer(
+          features={'a': [[0]]},
+          feature_columns=[
+              fc_old.categorical_column_with_hash_bucket('wire_cast', 4)
+          ])
+
+  def test_does_not_support_dict_columns(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'Expected feature_columns to be iterable, found dict.'):
+      fc.input_layer(
+          features={'a': [[0]]},
+          feature_columns={'a': fc_old.numeric_column('a')})
+
+  def test_bare_column(self):
+    with ops.Graph().as_default():
+      features = features = {'a': [0.]}
+      net = fc.input_layer(features, fc_old.numeric_column('a'))
+      with _initialized_session():
+        self.assertAllClose([[0.]], net.eval())
+
+  def test_column_generator(self):
+    with ops.Graph().as_default():
+      features = features = {'a': [0.], 'b': [1.]}
+      columns = (fc_old.numeric_column(key) for key in features)
+      net = fc.input_layer(features, columns)
+      with _initialized_session():
+        self.assertAllClose([[0., 1.]], net.eval())
+
+  def test_raises_if_duplicate_name(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'Duplicate feature column name found for columns'):
+      fc.input_layer(
+          features={'a': [[0]]},
+          feature_columns=[
+              fc_old.numeric_column('a'),
+              fc_old.numeric_column('a')
+          ])
+
+  def test_one_column(self):
+    price = fc_old.numeric_column('price')
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      net = fc.input_layer(features, [price])
+      with _initialized_session():
+        self.assertAllClose([[1.], [5.]], net.eval())
+
+  def test_multi_dimension(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1., 2.], [5., 6.]]}
+      net = fc.input_layer(features, [price])
+      with _initialized_session():
+        self.assertAllClose([[1., 2.], [5., 6.]], net.eval())
+
+  def test_raises_if_shape_mismatch(self):
+    price = fc_old.numeric_column('price', shape=2)
+    with ops.Graph().as_default():
+      features = {'price': [[1.], [5.]]}
+      with self.assertRaisesRegexp(
+          Exception,
+          r'Cannot reshape a tensor with 2 elements to shape \[2,2\]'):
+        fc.input_layer(features, [price])
+
+  def test_reshaping(self):
+    price = fc_old.numeric_column('price', shape=[1, 2])
+    with ops.Graph().as_default():
+      features = {'price': [[[1., 2.]], [[5., 6.]]]}
+      net = fc.input_layer(features, [price])
+      with _initialized_session():
+        self.assertAllClose([[1., 2.], [5., 6.]], net.eval())
+
+  def test_multi_column(self):
+    price1 = fc_old.numeric_column('price1', shape=2)
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[1., 2.], [5., 6.]],
+          'price2': [[3.], [4.]]
+      }
+      net = fc.input_layer(features, [price1, price2])
+      with _initialized_session():
+        self.assertAllClose([[1., 2., 3.], [5., 6., 4.]], net.eval())
+
+  def test_fills_cols_to_vars(self):
+    # Provide three _DenseColumn's to input_layer: a _NumericColumn, a
+    # _BucketizedColumn, and an _EmbeddingColumn.  Only the _EmbeddingColumn
+    # creates a Variable.
+    price1 = fc_old.numeric_column('price1')
+    dense_feature = fc_old.numeric_column('dense_feature')
+    dense_feature_bucketized = fc_old.bucketized_column(
+        dense_feature, boundaries=[0.])
+    some_sparse_column = fc_old.categorical_column_with_hash_bucket(
+        'sparse_feature', hash_bucket_size=5)
+    some_embedding_column = fc_old.embedding_column(
+        some_sparse_column, dimension=10)
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[3.], [4.]],
+          'dense_feature': [[-1.], [4.]],
+          'sparse_feature': [['a'], ['x']],
+      }
+      cols_to_vars = {}
+      all_cols = [price1, dense_feature_bucketized, some_embedding_column]
+      fc.input_layer(features, all_cols, cols_to_vars=cols_to_vars)
+      self.assertItemsEqual(list(cols_to_vars.keys()), all_cols)
+      self.assertEqual(0, len(cols_to_vars[price1]))
+      self.assertEqual(0, len(cols_to_vars[dense_feature_bucketized]))
+      self.assertEqual(1, len(cols_to_vars[some_embedding_column]))
+      self.assertIsInstance(cols_to_vars[some_embedding_column][0],
+                            variables_lib.Variable)
+      self.assertAllEqual(cols_to_vars[some_embedding_column][0].shape, [5, 10])
+
+  def test_fills_cols_to_vars_partitioned_variables(self):
+    price1 = fc_old.numeric_column('price1')
+    dense_feature = fc_old.numeric_column('dense_feature')
+    dense_feature_bucketized = fc_old.bucketized_column(
+        dense_feature, boundaries=[0.])
+    some_sparse_column = fc_old.categorical_column_with_hash_bucket(
+        'sparse_feature', hash_bucket_size=5)
+    some_embedding_column = fc_old.embedding_column(
+        some_sparse_column, dimension=10)
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[3.], [4.]],
+          'dense_feature': [[-1.], [4.]],
+          'sparse_feature': [['a'], ['x']],
+      }
+      cols_to_vars = {}
+      all_cols = [price1, dense_feature_bucketized, some_embedding_column]
+      with variable_scope.variable_scope(
+          'input_from_feature_columns',
+          partitioner=partitioned_variables.fixed_size_partitioner(3, axis=0)):
+        fc.input_layer(features, all_cols, cols_to_vars=cols_to_vars)
+      self.assertItemsEqual(list(cols_to_vars.keys()), all_cols)
+      self.assertEqual(0, len(cols_to_vars[price1]))
+      self.assertEqual(0, len(cols_to_vars[dense_feature_bucketized]))
+      self.assertEqual(3, len(cols_to_vars[some_embedding_column]))
+      self.assertAllEqual(cols_to_vars[some_embedding_column][0].shape, [2, 10])
+      self.assertAllEqual(cols_to_vars[some_embedding_column][1].shape, [2, 10])
+      self.assertAllEqual(cols_to_vars[some_embedding_column][2].shape, [1, 10])
+
+  def test_column_order(self):
+    price_a = fc_old.numeric_column('price_a')
+    price_b = fc_old.numeric_column('price_b')
+    with ops.Graph().as_default():
+      features = {
+          'price_a': [[1.]],
+          'price_b': [[3.]],
+      }
+      net1 = fc.input_layer(features, [price_a, price_b])
+      net2 = fc.input_layer(features, [price_b, price_a])
+      with _initialized_session():
+        self.assertAllClose([[1., 3.]], net1.eval())
+        self.assertAllClose([[1., 3.]], net2.eval())
+
+  def test_fails_for_categorical_column(self):
+    animal = fc_old.categorical_column_with_identity('animal', num_buckets=4)
+    with ops.Graph().as_default():
+      features = {
+          'animal':
+              sparse_tensor.SparseTensor(
+                  indices=[[0, 0], [0, 1]], values=[1, 2], dense_shape=[1, 2])
+      }
+      with self.assertRaisesRegexp(Exception, 'must be a _DenseColumn'):
+        fc.input_layer(features, [animal])
+
+  def test_static_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': [[1.], [5.], [7.]],  # batchsize = 3
+          'price2': [[3.], [4.]]  # batchsize = 2
+      }
+      with self.assertRaisesRegexp(
+          ValueError,
+          'Batch size \(first dimension\) of each feature must be same.'):  # pylint: disable=anomalous-backslash-in-string
+        fc.input_layer(features, [price1, price2])
+
+  def test_subset_of_static_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    price3 = fc_old.numeric_column('price3')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 3
+          'price2': [[3.], [4.]],  # batchsize = 2
+          'price3': [[3.], [4.], [5.]]  # batchsize = 3
+      }
+      with self.assertRaisesRegexp(
+          ValueError,
+          'Batch size \(first dimension\) of each feature must be same.'):  # pylint: disable=anomalous-backslash-in-string
+        fc.input_layer(features, [price1, price2, price3])
+
+  def test_runtime_batch_size_mismatch(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 3
+          'price2': [[3.], [4.]]  # batchsize = 2
+      }
+      net = fc.input_layer(features, [price1, price2])
+      with _initialized_session() as sess:
+        with self.assertRaisesRegexp(errors.OpError,
+                                     'Dimensions of inputs should match'):
+          sess.run(net, feed_dict={features['price1']: [[1.], [5.], [7.]]})
+
+  def test_runtime_batch_size_matches(self):
+    price1 = fc_old.numeric_column('price1')
+    price2 = fc_old.numeric_column('price2')
+    with ops.Graph().as_default():
+      features = {
+          'price1': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 2
+          'price2': array_ops.placeholder(dtype=dtypes.int64),  # batchsize = 2
+      }
+      net = fc.input_layer(features, [price1, price2])
+      with _initialized_session() as sess:
+        sess.run(
+            net,
+            feed_dict={
+                features['price1']: [[1.], [5.]],
+                features['price2']: [[1.], [5.]],
+            })
+
+  def test_multiple_layers_with_same_embedding_column(self):
+    some_sparse_column = fc_old.categorical_column_with_hash_bucket(
+        'sparse_feature', hash_bucket_size=5)
+    some_embedding_column = fc_old.embedding_column(
+        some_sparse_column, dimension=10)
+
+    with ops.Graph().as_default():
+      features = {
+          'sparse_feature': [['a'], ['x']],
+      }
+      all_cols = [some_embedding_column]
+      fc.input_layer(features, all_cols)
+      fc.input_layer(features, all_cols)
+      # Make sure that 2 variables get created in this case.
+      self.assertEqual(2, len(
+          ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)))
+      expected_var_names = [
+          'input_layer/sparse_feature_embedding/embedding_weights:0',
+          'input_layer_1/sparse_feature_embedding/embedding_weights:0'
+      ]
+      self.assertItemsEqual(
+          expected_var_names,
+          [v.name for v in ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)])
+
+  def test_multiple_layers_with_same_shared_embedding_column(self):
+    categorical_column_a = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    categorical_column_b = fc_old.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    embedding_dimension = 2
+    embedding_column_b, embedding_column_a = fc_old.shared_embedding_columns(
+        [categorical_column_b, categorical_column_a],
+        dimension=embedding_dimension)
+
+    with ops.Graph().as_default():
+      features = {
+          'aaa':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(0, 1, 0),
+                  dense_shape=(2, 2)),
+          'bbb':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(1, 2, 1),
+                  dense_shape=(2, 2)),
+      }
+      all_cols = [embedding_column_a, embedding_column_b]
+      fc.input_layer(features, all_cols)
+      fc.input_layer(features, all_cols)
+      # Make sure that only 1 variable gets created in this case.
+      self.assertEqual(1, len(
+          ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)))
+      self.assertItemsEqual(
+          ['input_layer/aaa_bbb_shared_embedding/embedding_weights:0'],
+          [v.name for v in ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)])
+
+  def test_multiple_layers_with_same_shared_embedding_column_diff_graphs(self):
+    categorical_column_a = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    categorical_column_b = fc_old.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    embedding_dimension = 2
+    embedding_column_b, embedding_column_a = fc_old.shared_embedding_columns(
+        [categorical_column_b, categorical_column_a],
+        dimension=embedding_dimension)
+    all_cols = [embedding_column_a, embedding_column_b]
+
+    with ops.Graph().as_default():
+      features = {
+          'aaa':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(0, 1, 0),
+                  dense_shape=(2, 2)),
+          'bbb':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(1, 2, 1),
+                  dense_shape=(2, 2)),
+      }
+      fc.input_layer(features, all_cols)
+      # Make sure that only 1 variable gets created in this case.
+      self.assertEqual(1, len(
+          ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)))
+
+    with ops.Graph().as_default():
+      features1 = {
+          'aaa':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(0, 1, 0),
+                  dense_shape=(2, 2)),
+          'bbb':
+              sparse_tensor.SparseTensor(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(1, 2, 1),
+                  dense_shape=(2, 2)),
+      }
+
+      fc.input_layer(features1, all_cols)
+      # Make sure that only 1 variable gets created in this case.
+      self.assertEqual(1, len(
+          ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)))
+      self.assertItemsEqual(
+          ['input_layer/aaa_bbb_shared_embedding/embedding_weights:0'],
+          [v.name for v in ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)])
+
+  def test_with_numpy_input_fn(self):
+    embedding_values = (
+        (1., 2., 3., 4., 5.),  # id 0
+        (6., 7., 8., 9., 10.),  # id 1
+        (11., 12., 13., 14., 15.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      del shape, dtype, partition_info
+      return embedding_values
+
+    # price has 1 dimension in input_layer
+    price = fc_old.numeric_column('price')
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+    # one_hot_body_style has 3 dims in input_layer.
+    one_hot_body_style = fc_old.indicator_column(body_style)
+    # embedded_body_style has 5 dims in input_layer.
+    embedded_body_style = fc_old.embedding_column(
+        body_style, dimension=5, initializer=_initializer)
+
+    input_fn = numpy_io.numpy_input_fn(
+        x={
+            'price': np.array([11., 12., 13., 14.]),
+            'body-style': np.array(['sedan', 'hardtop', 'wagon', 'sedan']),
+        },
+        batch_size=2,
+        shuffle=False)
+    features = input_fn()
+    net = fc.input_layer(features,
+                         [price, one_hot_body_style, embedded_body_style])
+    self.assertEqual(1 + 3 + 5, net.shape[1])
+    with _initialized_session() as sess:
+      coord = coordinator.Coordinator()
+      threads = queue_runner_impl.start_queue_runners(sess, coord=coord)
+
+      # Each row is formed by concatenating `embedded_body_style`,
+      # `one_hot_body_style`, and `price` in order.
+      self.assertAllEqual(
+          [[11., 12., 13., 14., 15., 0., 0., 1., 11.],
+           [1., 2., 3., 4., 5., 1., 0., 0., 12]],
+          sess.run(net))
+
+      coord.request_stop()
+      coord.join(threads)
+
+  def test_with_1d_sparse_tensor(self):
+    embedding_values = (
+        (1., 2., 3., 4., 5.),  # id 0
+        (6., 7., 8., 9., 10.),  # id 1
+        (11., 12., 13., 14., 15.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      del shape, dtype, partition_info
+      return embedding_values
+
+    # price has 1 dimension in input_layer
+    price = fc_old.numeric_column('price')
+
+    # one_hot_body_style has 3 dims in input_layer.
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+    one_hot_body_style = fc_old.indicator_column(body_style)
+
+    # embedded_body_style has 5 dims in input_layer.
+    country = fc_old.categorical_column_with_vocabulary_list(
+        'country', vocabulary_list=['US', 'JP', 'CA'])
+    embedded_country = fc_old.embedding_column(
+        country, dimension=5, initializer=_initializer)
+
+    # Provides 1-dim tensor and dense tensor.
+    features = {
+        'price': constant_op.constant([11., 12.,]),
+        'body-style': sparse_tensor.SparseTensor(
+            indices=((0,), (1,)),
+            values=('sedan', 'hardtop'),
+            dense_shape=(2,)),
+        # This is dense tensor for the categorical_column.
+        'country': constant_op.constant(['CA', 'US']),
+    }
+    self.assertEqual(1, features['price'].shape.ndims)
+    self.assertEqual(1, features['body-style'].dense_shape.get_shape()[0])
+    self.assertEqual(1, features['country'].shape.ndims)
+
+    net = fc.input_layer(features,
+                         [price, one_hot_body_style, embedded_country])
+    self.assertEqual(1 + 3 + 5, net.shape[1])
+    with _initialized_session() as sess:
+
+      # Each row is formed by concatenating `embedded_body_style`,
+      # `one_hot_body_style`, and `price` in order.
+      self.assertAllEqual(
+          [[0., 0., 1., 11., 12., 13., 14., 15., 11.],
+           [1., 0., 0., 1., 2., 3., 4., 5., 12.]],
+          sess.run(net))
+
+  def test_with_1d_unknown_shape_sparse_tensor(self):
+    embedding_values = (
+        (1., 2.),  # id 0
+        (6., 7.),  # id 1
+        (11., 12.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      del shape, dtype, partition_info
+      return embedding_values
+
+    # price has 1 dimension in input_layer
+    price = fc_old.numeric_column('price')
+
+    # one_hot_body_style has 3 dims in input_layer.
+    body_style = fc_old.categorical_column_with_vocabulary_list(
+        'body-style', vocabulary_list=['hardtop', 'wagon', 'sedan'])
+    one_hot_body_style = fc_old.indicator_column(body_style)
+
+    # embedded_body_style has 5 dims in input_layer.
+    country = fc_old.categorical_column_with_vocabulary_list(
+        'country', vocabulary_list=['US', 'JP', 'CA'])
+    embedded_country = fc_old.embedding_column(
+        country, dimension=2, initializer=_initializer)
+
+    # Provides 1-dim tensor and dense tensor.
+    features = {
+        'price': array_ops.placeholder(dtypes.float32),
+        'body-style': array_ops.sparse_placeholder(dtypes.string),
+        # This is dense tensor for the categorical_column.
+        'country': array_ops.placeholder(dtypes.string),
+    }
+    self.assertIsNone(features['price'].shape.ndims)
+    self.assertIsNone(features['body-style'].get_shape().ndims)
+    self.assertIsNone(features['country'].shape.ndims)
+
+    price_data = np.array([11., 12.])
+    body_style_data = sparse_tensor.SparseTensorValue(
+        indices=((0,), (1,)),
+        values=('sedan', 'hardtop'),
+        dense_shape=(2,))
+    country_data = np.array([['US'], ['CA']])
+
+    net = fc.input_layer(features,
+                         [price, one_hot_body_style, embedded_country])
+    self.assertEqual(1 + 3 + 2, net.shape[1])
+    with _initialized_session() as sess:
+
+      # Each row is formed by concatenating `embedded_body_style`,
+      # `one_hot_body_style`, and `price` in order.
+      self.assertAllEqual(
+          [[0., 0., 1., 1., 2., 11.], [1., 0., 0., 11., 12., 12.]],
+          sess.run(
+              net,
+              feed_dict={
+                  features['price']: price_data,
+                  features['body-style']: body_style_data,
+                  features['country']: country_data
+              }))
+
+  def test_with_rank_0_feature(self):
+    # price has 1 dimension in input_layer
+    price = fc_old.numeric_column('price')
+    features = {
+        'price': constant_op.constant(0),
+    }
+    self.assertEqual(0, features['price'].shape.ndims)
+
+    # Static rank 0 should fail
+    with self.assertRaisesRegexp(ValueError, 'Feature .* cannot have rank 0'):
+      fc.input_layer(features, [price])
+
+    # Dynamic rank 0 should fail
+    features = {
+        'price': array_ops.placeholder(dtypes.float32),
+    }
+    net = fc.input_layer(features, [price])
+    self.assertEqual(1, net.shape[1])
+    with _initialized_session() as sess:
+      with self.assertRaisesOpError('Feature .* cannot have rank 0'):
+        sess.run(net, feed_dict={features['price']: np.array(1)})
+
+
+class MakeParseExampleSpecTest(test.TestCase):
+
+  class _TestFeatureColumn(FeatureColumn,
+                           collections.namedtuple('_TestFeatureColumn',
+                                                  ('parse_spec'))):
+
+    @property
+    def name(self):
+      return "_TestFeatureColumn"
+
+    def transform_feature(self, transformation_cache, state_manager):
+      pass
+
+    @property
+    def parse_example_spec(self):
+      return self.parse_spec
+
+  def test_no_feature_columns(self):
+    actual = fc.make_parse_example_spec([])
+    self.assertDictEqual({}, actual)
+
+  def test_invalid_type(self):
+    key1 = 'key1'
+    parse_spec1 = parsing_ops.FixedLenFeature(
+        shape=(2,), dtype=dtypes.float32, default_value=0.)
+    with self.assertRaisesRegexp(
+        ValueError,
+        'All feature_columns must be FeatureColumn instances.*invalid_column'):
+      fc.make_parse_example_spec(
+          (self._TestFeatureColumn({key1: parse_spec1}), 'invalid_column'))
+
+  def test_one_feature_column(self):
+    key1 = 'key1'
+    parse_spec1 = parsing_ops.FixedLenFeature(
+        shape=(2,), dtype=dtypes.float32, default_value=0.)
+    actual = fc.make_parse_example_spec(
+        (self._TestFeatureColumn({key1: parse_spec1}),))
+    self.assertDictEqual({key1: parse_spec1}, actual)
+
+  def test_two_feature_columns(self):
+    key1 = 'key1'
+    parse_spec1 = parsing_ops.FixedLenFeature(
+        shape=(2,), dtype=dtypes.float32, default_value=0.)
+    key2 = 'key2'
+    parse_spec2 = parsing_ops.VarLenFeature(dtype=dtypes.string)
+    actual = fc.make_parse_example_spec(
+        (self._TestFeatureColumn({key1: parse_spec1}),
+         self._TestFeatureColumn({key2: parse_spec2})))
+    self.assertDictEqual({key1: parse_spec1, key2: parse_spec2}, actual)
+
+  def test_equal_keys_different_parse_spec(self):
+    key1 = 'key1'
+    parse_spec1 = parsing_ops.FixedLenFeature(
+        shape=(2,), dtype=dtypes.float32, default_value=0.)
+    parse_spec2 = parsing_ops.VarLenFeature(dtype=dtypes.string)
+    with self.assertRaisesRegexp(
+        ValueError,
+        'feature_columns contain different parse_spec for key key1'):
+      fc.make_parse_example_spec(
+          (self._TestFeatureColumn({key1: parse_spec1}),
+           self._TestFeatureColumn({key1: parse_spec2})))
+
+  def test_equal_keys_equal_parse_spec(self):
+    key1 = 'key1'
+    parse_spec1 = parsing_ops.FixedLenFeature(
+        shape=(2,), dtype=dtypes.float32, default_value=0.)
+    actual = fc.make_parse_example_spec(
+        (self._TestFeatureColumn({key1: parse_spec1}),
+         self._TestFeatureColumn({key1: parse_spec1})))
+    self.assertDictEqual({key1: parse_spec1}, actual)
+
+  def test_multiple_features_dict(self):
+    """parse_spc for one column is a dict with length > 1."""
+    key1 = 'key1'
+    parse_spec1 = parsing_ops.FixedLenFeature(
+        shape=(2,), dtype=dtypes.float32, default_value=0.)
+    key2 = 'key2'
+    parse_spec2 = parsing_ops.VarLenFeature(dtype=dtypes.string)
+    key3 = 'key3'
+    parse_spec3 = parsing_ops.VarLenFeature(dtype=dtypes.int32)
+    actual = fc.make_parse_example_spec(
+        (self._TestFeatureColumn({key1: parse_spec1}),
+         self._TestFeatureColumn({key2: parse_spec2, key3: parse_spec3})))
+    self.assertDictEqual(
+        {key1: parse_spec1, key2: parse_spec2, key3: parse_spec3}, actual)
+
+
+def _assert_sparse_tensor_value(test_case, expected, actual):
+  test_case.assertEqual(np.int64, np.array(actual.indices).dtype)
+  test_case.assertAllEqual(expected.indices, actual.indices)
+
+  test_case.assertEqual(
+      np.array(expected.values).dtype, np.array(actual.values).dtype)
+  test_case.assertAllEqual(expected.values, actual.values)
+
+  test_case.assertEqual(np.int64, np.array(actual.dense_shape).dtype)
+  test_case.assertAllEqual(expected.dense_shape, actual.dense_shape)
+
+
+class VocabularyFileCategoricalColumnTest(test.TestCase):
+
+  def setUp(self):
+    super(VocabularyFileCategoricalColumnTest, self).setUp()
+
+    # Contains ints, Golden State Warriors jersey numbers: 30, 35, 11, 23, 22
+    self._warriors_vocabulary_file_name = test.test_src_dir_path(
+        'python/feature_column/testdata/warriors_vocabulary.txt')
+    self._warriors_vocabulary_size = 5
+
+    # Contains strings, character names from 'The Wire': omar, stringer, marlo
+    self._wire_vocabulary_file_name = test.test_src_dir_path(
+        'python/feature_column/testdata/wire_vocabulary.txt')
+    self._wire_vocabulary_size = 3
+
+  def test_defaults(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa', vocabulary_file='path_to_file', vocabulary_size=3)
+    self.assertEqual('aaa', column.name)
+    self.assertEqual('aaa', column.key)
+    self.assertEqual(3, column.num_buckets)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.string)
+    }, column.parse_example_spec)
+
+  def test_key_should_be_string(self):
+    with self.assertRaisesRegexp(ValueError, 'key must be a string.'):
+      fc.categorical_column_with_vocabulary_file(
+          key=('aaa',), vocabulary_file='path_to_file', vocabulary_size=3)
+
+  def test_all_constructor_args(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa', vocabulary_file='path_to_file', vocabulary_size=3,
+        num_oov_buckets=4, dtype=dtypes.int32)
+    self.assertEqual(7, column.num_buckets)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int32)
+    }, column.parse_example_spec)
+
+  def test_deep_copy(self):
+    original = fc.categorical_column_with_vocabulary_file(
+        key='aaa', vocabulary_file='path_to_file', vocabulary_size=3,
+        num_oov_buckets=4, dtype=dtypes.int32)
+    for column in (original, copy.deepcopy(original)):
+      self.assertEqual('aaa', column.name)
+      self.assertEqual(7, column.num_buckets)
+      self.assertEqual({
+          'aaa': parsing_ops.VarLenFeature(dtypes.int32)
+      }, column.parse_example_spec)
+
+  def test_vocabulary_file_none(self):
+    with self.assertRaisesRegexp(ValueError, 'Missing vocabulary_file'):
+      fc.categorical_column_with_vocabulary_file(
+          key='aaa', vocabulary_file=None, vocabulary_size=3)
+
+  def test_vocabulary_file_empty_string(self):
+    with self.assertRaisesRegexp(ValueError, 'Missing vocabulary_file'):
+      fc.categorical_column_with_vocabulary_file(
+          key='aaa', vocabulary_file='', vocabulary_size=3)
+
+  def test_invalid_vocabulary_file(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa', vocabulary_file='file_does_not_exist', vocabulary_size=10)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    column.get_sparse_tensors(FeatureTransformationCache({'aaa': inputs}), None)
+    with self.assertRaisesRegexp(errors.OpError, 'file_does_not_exist'):
+      with self.test_session():
+        lookup_ops.tables_initializer().run()
+
+  def test_invalid_vocabulary_size(self):
+    with self.assertRaisesRegexp(ValueError, 'Invalid vocabulary_size'):
+      fc.categorical_column_with_vocabulary_file(
+          key='aaa', vocabulary_file=self._wire_vocabulary_file_name,
+          vocabulary_size=-1)
+    with self.assertRaisesRegexp(ValueError, 'Invalid vocabulary_size'):
+      fc.categorical_column_with_vocabulary_file(
+          key='aaa', vocabulary_file=self._wire_vocabulary_file_name,
+          vocabulary_size=0)
+
+  def test_too_large_vocabulary_size(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size + 1)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    column.get_sparse_tensors(FeatureTransformationCache({'aaa': inputs}), None)
+    with self.assertRaisesRegexp(errors.OpError, 'Invalid vocab_size'):
+      with self.test_session():
+        lookup_ops.tables_initializer().run()
+
+  def test_invalid_num_oov_buckets(self):
+    with self.assertRaisesRegexp(ValueError, 'Invalid num_oov_buckets'):
+      fc.categorical_column_with_vocabulary_file(
+          key='aaa', vocabulary_file='path', vocabulary_size=3,
+          num_oov_buckets=-1)
+
+  def test_invalid_dtype(self):
+    with self.assertRaisesRegexp(ValueError, 'dtype must be string or integer'):
+      fc.categorical_column_with_vocabulary_file(
+          key='aaa', vocabulary_file='path', vocabulary_size=3,
+          dtype=dtypes.float64)
+
+  def test_invalid_buckets_and_default_value(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'both num_oov_buckets and default_value'):
+      fc.categorical_column_with_vocabulary_file(
+          key='aaa',
+          vocabulary_file=self._wire_vocabulary_file_name,
+          vocabulary_size=self._wire_vocabulary_size,
+          num_oov_buckets=100,
+          default_value=2)
+
+  def test_invalid_input_dtype_int32(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size,
+        dtype=dtypes.string)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(12, 24, 36),
+        dense_shape=(2, 2))
+    with self.assertRaisesRegexp(ValueError, 'dtype must be compatible'):
+      column.get_sparse_tensors(
+          FeatureTransformationCache({
+              'aaa': inputs
+          }), None)
+
+  def test_invalid_input_dtype_string(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._warriors_vocabulary_file_name,
+        vocabulary_size=self._warriors_vocabulary_size,
+        dtype=dtypes.int32)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('omar', 'stringer', 'marlo'),
+        dense_shape=(2, 2))
+    with self.assertRaisesRegexp(ValueError, 'dtype must be compatible'):
+      column.get_sparse_tensors(
+          FeatureTransformationCache({
+              'aaa': inputs
+          }), None)
+
+  def test_parse_example(self):
+    a = fc.categorical_column_with_vocabulary_file(
+        key='aaa', vocabulary_file='path_to_file', vocabulary_size=3)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer']))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a]))
+    self.assertIn('aaa', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'omar', b'stringer'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+
+  def test_get_sparse_tensors(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, -1, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_none_vocabulary_size(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa', vocabulary_file=self._wire_vocabulary_file_name)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(self,
+                                  sparse_tensor.SparseTensorValue(
+                                      indices=inputs.indices,
+                                      values=np.array(
+                                          (2, -1, 0), dtype=np.int64),
+                                      dense_shape=inputs.dense_shape),
+                                  id_weight_pair.id_tensor.eval())
+
+  def test_transform_feature(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_tensor = _transform_features({'aaa': inputs}, [column], None)[column]
+    with _initialized_session():
+      _assert_sparse_tensor_value(self,
+                                  sparse_tensor.SparseTensorValue(
+                                      indices=inputs.indices,
+                                      values=np.array(
+                                          (2, -1, 0), dtype=np.int64),
+                                      dense_shape=inputs.dense_shape),
+                                  id_tensor.eval())
+
+  def DISABLED_test_get_sparse_tensors_weight_collections(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size)
+    inputs = sparse_tensor.SparseTensor(
+        values=['omar', 'stringer', 'marlo'],
+        indices=[[0, 0], [1, 0], [1, 1]],
+        dense_shape=[2, 2])
+    column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }),
+        weight_collections=('my_weights',))
+
+    self.assertItemsEqual(
+        [], ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES))
+    self.assertItemsEqual([], ops.get_collection('my_weights'))
+
+  def test_get_sparse_tensors_dense_input(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size)
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': (('marlo', ''), ('skywalker', 'omar'))
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=np.array((2, -1, 0), dtype=np.int64),
+              dense_shape=(2, 2)),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_default_value_in_vocabulary(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size,
+        default_value=2)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, 2, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_with_oov_buckets(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size,
+        num_oov_buckets=100)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1), (1, 2)),
+        values=('marlo', 'skywalker', 'omar', 'heisenberg'),
+        dense_shape=(2, 3))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, 33, 0, 62), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_small_vocabulary_size(self):
+    # 'marlo' is the last entry in our vocabulary file, so be setting
+    # `vocabulary_size` to 1 less than number of entries in file, we take
+    # 'marlo' out of the vocabulary.
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size - 1)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((-1, -1, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_int32(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._warriors_vocabulary_file_name,
+        vocabulary_size=self._warriors_vocabulary_size,
+        dtype=dtypes.int32)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1), (2, 2)),
+        values=(11, 100, 30, 22),
+        dense_shape=(3, 3))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, -1, 0, 4), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_int32_dense_input(self):
+    default_value = -100
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._warriors_vocabulary_file_name,
+        vocabulary_size=self._warriors_vocabulary_size,
+        dtype=dtypes.int32,
+        default_value=default_value)
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': ((11, -1, -1), (100, 30, -1), (-1, -1, 22))
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1), (2, 2)),
+              values=np.array((2, default_value, 0, 4), dtype=np.int64),
+              dense_shape=(3, 3)),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_int32_with_oov_buckets(self):
+    column = fc.categorical_column_with_vocabulary_file(
+        key='aaa',
+        vocabulary_file=self._warriors_vocabulary_file_name,
+        vocabulary_size=self._warriors_vocabulary_size,
+        dtype=dtypes.int32,
+        num_oov_buckets=100)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1), (2, 2)),
+        values=(11, 100, 30, 22),
+        dense_shape=(3, 3))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, 60, 0, 4), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_linear_model(self):
+    wire_column = fc_old.categorical_column_with_vocabulary_file(
+        key='wire',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size,
+        num_oov_buckets=1)
+    self.assertEqual(4, wire_column._num_buckets)
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          wire_column.name: sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=('marlo', 'skywalker', 'omar'),
+              dense_shape=(2, 2))
+      }, (wire_column,))
+      bias = get_linear_model_bias()
+      wire_var = get_linear_model_column_var(wire_column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,), (0.,)), wire_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        wire_var.assign(((1.,), (2.,), (3.,), (4.,))).eval()
+        # 'marlo' -> 2: wire_var[2] = 3
+        # 'skywalker' -> 3, 'omar' -> 0: wire_var[3] + wire_var[0] = 4+1 = 5
+        self.assertAllClose(((3.,), (5.,)), predictions.eval())
+
+  def test_keras_linear_model(self):
+    wire_column = fc_old.categorical_column_with_vocabulary_file(
+        key='wire',
+        vocabulary_file=self._wire_vocabulary_file_name,
+        vocabulary_size=self._wire_vocabulary_size,
+        num_oov_buckets=1)
+    self.assertEqual(4, wire_column._num_buckets)
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          wire_column.name:
+              sparse_tensor.SparseTensorValue(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=('marlo', 'skywalker', 'omar'),
+                  dense_shape=(2, 2))
+      }, (wire_column,))
+      bias = get_linear_model_bias()
+      wire_var = get_linear_model_column_var(wire_column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,), (0.,)), wire_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        wire_var.assign(((1.,), (2.,), (3.,), (4.,))).eval()
+        # 'marlo' -> 2: wire_var[2] = 3
+        # 'skywalker' -> 3, 'omar' -> 0: wire_var[3] + wire_var[0] = 4+1 = 5
+        self.assertAllClose(((3.,), (5.,)), predictions.eval())
+
+
+class VocabularyListCategoricalColumnTest(test.TestCase):
+
+  def test_defaults_string(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'))
+    self.assertEqual('aaa', column.name)
+    self.assertEqual('aaa', column.key)
+    self.assertEqual(3, column.num_buckets)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.string)
+    }, column.parse_example_spec)
+
+  def test_key_should_be_string(self):
+    with self.assertRaisesRegexp(ValueError, 'key must be a string.'):
+      fc.categorical_column_with_vocabulary_list(
+          key=('aaa',), vocabulary_list=('omar', 'stringer', 'marlo'))
+
+  def test_defaults_int(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=(12, 24, 36))
+    self.assertEqual('aaa', column.name)
+    self.assertEqual('aaa', column.key)
+    self.assertEqual(3, column.num_buckets)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+    }, column.parse_example_spec)
+
+  def test_all_constructor_args(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=(12, 24, 36), dtype=dtypes.int32,
+        default_value=-99)
+    self.assertEqual(3, column.num_buckets)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int32)
+    }, column.parse_example_spec)
+
+  def test_deep_copy(self):
+    original = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=(12, 24, 36), dtype=dtypes.int32)
+    for column in (original, copy.deepcopy(original)):
+      self.assertEqual('aaa', column.name)
+      self.assertEqual(3, column.num_buckets)
+      self.assertEqual({
+          'aaa': parsing_ops.VarLenFeature(dtypes.int32)
+      }, column.parse_example_spec)
+
+  def test_invalid_dtype(self):
+    with self.assertRaisesRegexp(ValueError, 'dtype must be string or integer'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'),
+          dtype=dtypes.float32)
+
+  def test_invalid_mapping_dtype(self):
+    with self.assertRaisesRegexp(
+        ValueError, r'vocabulary dtype must be string or integer'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=(12., 24., 36.))
+
+  def test_mismatched_int_dtype(self):
+    with self.assertRaisesRegexp(
+        ValueError, r'dtype.*and vocabulary dtype.*do not match'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'),
+          dtype=dtypes.int32)
+
+  def test_mismatched_string_dtype(self):
+    with self.assertRaisesRegexp(
+        ValueError, r'dtype.*and vocabulary dtype.*do not match'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=(12, 24, 36), dtype=dtypes.string)
+
+  def test_none_mapping(self):
+    with self.assertRaisesRegexp(
+        ValueError, r'vocabulary_list.*must be non-empty'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=None)
+
+  def test_empty_mapping(self):
+    with self.assertRaisesRegexp(
+        ValueError, r'vocabulary_list.*must be non-empty'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=tuple([]))
+
+  def test_duplicate_mapping(self):
+    with self.assertRaisesRegexp(ValueError, 'Duplicate keys'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=(12, 24, 12))
+
+  def test_invalid_num_oov_buckets(self):
+    with self.assertRaisesRegexp(ValueError, 'Invalid num_oov_buckets'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa', vocabulary_list=(12, 24, 36),
+          num_oov_buckets=-1)
+
+  def test_invalid_buckets_and_default_value(self):
+    with self.assertRaisesRegexp(
+        ValueError, 'both num_oov_buckets and default_value'):
+      fc.categorical_column_with_vocabulary_list(
+          key='aaa',
+          vocabulary_list=(12, 24, 36),
+          num_oov_buckets=100,
+          default_value=2)
+
+  def test_invalid_input_dtype_int32(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'))
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(12, 24, 36),
+        dense_shape=(2, 2))
+    with self.assertRaisesRegexp(ValueError, 'dtype must be compatible'):
+      column.get_sparse_tensors(
+          FeatureTransformationCache({
+              'aaa': inputs
+          }), None)
+
+  def test_invalid_input_dtype_string(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=(12, 24, 36))
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('omar', 'stringer', 'marlo'),
+        dense_shape=(2, 2))
+    with self.assertRaisesRegexp(ValueError, 'dtype must be compatible'):
+      column.get_sparse_tensors(
+          FeatureTransformationCache({
+              'aaa': inputs
+          }), None)
+
+  def test_parse_example_string(self):
+    a = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'))
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer']))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a]))
+    self.assertIn('aaa', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'omar', b'stringer'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+
+  def test_parse_example_int(self):
+    a = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=(11, 21, 31))
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(int64_list=feature_pb2.Int64List(
+                    value=[11, 21]))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a]))
+    self.assertIn('aaa', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=[11, 21],
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+
+  def test_get_sparse_tensors(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'))
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, -1, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_transform_feature(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'))
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_tensor = _transform_features({'aaa': inputs}, [column], None)[column]
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, -1, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_tensor.eval())
+
+  def DISABLED_test_get_sparse_tensors_weight_collections(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'))
+    inputs = sparse_tensor.SparseTensor(
+        values=['omar', 'stringer', 'marlo'],
+        indices=[[0, 0], [1, 0], [1, 1]],
+        dense_shape=[2, 2])
+    column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }),
+        weight_collections=('my_weights',))
+
+    self.assertItemsEqual(
+        [], ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES))
+    self.assertItemsEqual([], ops.get_collection('my_weights'))
+
+  def test_get_sparse_tensors_dense_input(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': (('marlo', ''), ('skywalker', 'omar'))
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=np.array((2, -1, 0), dtype=np.int64),
+              dense_shape=(2, 2)),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_default_value_in_vocabulary(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'),
+        default_value=2)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('marlo', 'skywalker', 'omar'),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, 2, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_with_oov_buckets(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'),
+        num_oov_buckets=100)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1), (1, 2)),
+        values=('marlo', 'skywalker', 'omar', 'heisenberg'),
+        dense_shape=(2, 3))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, 33, 0, 62), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_int32(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=np.array((30, 35, 11, 23, 22), dtype=np.int32),
+        dtype=dtypes.int32)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1), (2, 2)),
+        values=np.array((11, 100, 30, 22), dtype=np.int32),
+        dense_shape=(3, 3))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, -1, 0, 4), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_int32_dense_input(self):
+    default_value = -100
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=np.array((30, 35, 11, 23, 22), dtype=np.int32),
+        dtype=dtypes.int32,
+        default_value=default_value)
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa':
+                np.array(
+                    ((11, -1, -1), (100, 30, -1), (-1, -1, 22)), dtype=np.int32)
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1), (2, 2)),
+              values=np.array((2, default_value, 0, 4), dtype=np.int64),
+              dense_shape=(3, 3)),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_int32_with_oov_buckets(self):
+    column = fc.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=np.array((30, 35, 11, 23, 22), dtype=np.int32),
+        dtype=dtypes.int32,
+        num_oov_buckets=100)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1), (2, 2)),
+        values=(11, 100, 30, 22),
+        dense_shape=(3, 3))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((2, 60, 0, 4), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_linear_model(self):
+    wire_column = fc_old.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'),
+        num_oov_buckets=1)
+    self.assertEqual(4, wire_column._num_buckets)
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          wire_column.name: sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=('marlo', 'skywalker', 'omar'),
+              dense_shape=(2, 2))
+      }, (wire_column,))
+      bias = get_linear_model_bias()
+      wire_var = get_linear_model_column_var(wire_column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,), (0.,)), wire_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        wire_var.assign(((1.,), (2.,), (3.,), (4.,))).eval()
+        # 'marlo' -> 2: wire_var[2] = 3
+        # 'skywalker' -> 3, 'omar' -> 0: wire_var[3] + wire_var[0] = 4+1 = 5
+        self.assertAllClose(((3.,), (5.,)), predictions.eval())
+
+  def test_keras_linear_model(self):
+    wire_column = fc_old.categorical_column_with_vocabulary_list(
+        key='aaa',
+        vocabulary_list=('omar', 'stringer', 'marlo'),
+        num_oov_buckets=1)
+    self.assertEqual(4, wire_column._num_buckets)
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          wire_column.name:
+              sparse_tensor.SparseTensorValue(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=('marlo', 'skywalker', 'omar'),
+                  dense_shape=(2, 2))
+      }, (wire_column,))
+      bias = get_linear_model_bias()
+      wire_var = get_linear_model_column_var(wire_column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,), (0.,)), wire_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        wire_var.assign(((1.,), (2.,), (3.,), (4.,))).eval()
+        # 'marlo' -> 2: wire_var[2] = 3
+        # 'skywalker' -> 3, 'omar' -> 0: wire_var[3] + wire_var[0] = 4+1 = 5
+        self.assertAllClose(((3.,), (5.,)), predictions.eval())
+
+
+class IdentityCategoricalColumnTest(test.TestCase):
+
+  def test_constructor(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    self.assertEqual('aaa', column.name)
+    self.assertEqual('aaa', column.key)
+    self.assertEqual(3, column.num_buckets)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+    }, column.parse_example_spec)
+
+  def test_key_should_be_string(self):
+    with self.assertRaisesRegexp(ValueError, 'key must be a string.'):
+      fc.categorical_column_with_identity(key=('aaa',), num_buckets=3)
+
+  def test_deep_copy(self):
+    original = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    for column in (original, copy.deepcopy(original)):
+      self.assertEqual('aaa', column.name)
+      self.assertEqual(3, column.num_buckets)
+      self.assertEqual({
+          'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+      }, column.parse_example_spec)
+
+  def test_invalid_num_buckets_zero(self):
+    with self.assertRaisesRegexp(ValueError, 'num_buckets 0 < 1'):
+      fc.categorical_column_with_identity(key='aaa', num_buckets=0)
+
+  def test_invalid_num_buckets_negative(self):
+    with self.assertRaisesRegexp(ValueError, 'num_buckets -1 < 1'):
+      fc.categorical_column_with_identity(key='aaa', num_buckets=-1)
+
+  def test_invalid_default_value_too_small(self):
+    with self.assertRaisesRegexp(ValueError, 'default_value -1 not in range'):
+      fc.categorical_column_with_identity(
+          key='aaa', num_buckets=3, default_value=-1)
+
+  def test_invalid_default_value_too_big(self):
+    with self.assertRaisesRegexp(ValueError, 'default_value 3 not in range'):
+      fc.categorical_column_with_identity(
+          key='aaa', num_buckets=3, default_value=3)
+
+  def test_invalid_input_dtype(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('omar', 'stringer', 'marlo'),
+        dense_shape=(2, 2))
+    with self.assertRaisesRegexp(ValueError, 'Invalid input, not integer'):
+      column.get_sparse_tensors(
+          FeatureTransformationCache({
+              'aaa': inputs
+          }), None)
+
+  def test_parse_example(self):
+    a = fc.categorical_column_with_identity(key='aaa', num_buckets=30)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(int64_list=feature_pb2.Int64List(
+                    value=[11, 21]))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a]))
+    self.assertIn('aaa', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([11, 21], dtype=np.int64),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+
+  def test_get_sparse_tensors(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(0, 1, 0),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((0, 1, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_transform_feature(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(0, 1, 0),
+        dense_shape=(2, 2))
+    id_tensor = _transform_features({'aaa': inputs}, [column], None)[column]
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((0, 1, 0), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_tensor.eval())
+
+  def DISABLED_test_get_sparse_tensors_weight_collections(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(0, 1, 0),
+        dense_shape=(2, 2))
+    column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }),
+        weight_collections=('my_weights',))
+
+    self.assertItemsEqual(
+        [], ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES))
+    self.assertItemsEqual([], ops.get_collection('my_weights'))
+
+  def test_get_sparse_tensors_dense_input(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': ((0, -1), (1, 0))
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=np.array((0, 1, 0), dtype=np.int64),
+              dense_shape=(2, 2)),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_with_inputs_too_small(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, -1, 0),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      with self.assertRaisesRegexp(
+          errors.OpError, 'assert_greater_or_equal_0'):
+        id_weight_pair.id_tensor.eval()
+
+  def test_get_sparse_tensors_with_inputs_too_big(self):
+    column = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 99, 0),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      with self.assertRaisesRegexp(
+          errors.OpError, 'assert_less_than_num_buckets'):
+        id_weight_pair.id_tensor.eval()
+
+  def test_get_sparse_tensors_with_default_value(self):
+    column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=4, default_value=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, -1, 99),
+        dense_shape=(2, 2))
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array((1, 3, 3), dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_weight_pair.id_tensor.eval())
+
+  def test_get_sparse_tensors_with_default_value_and_placeholder_inputs(self):
+    column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=4, default_value=3)
+    input_indices = array_ops.placeholder(dtype=dtypes.int64)
+    input_values = array_ops.placeholder(dtype=dtypes.int32)
+    input_shape = array_ops.placeholder(dtype=dtypes.int64)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=input_indices,
+        values=input_values,
+        dense_shape=input_shape)
+    id_weight_pair = column.get_sparse_tensors(
+        FeatureTransformationCache({
+            'aaa': inputs
+        }), None)
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=np.array(((0, 0), (1, 0), (1, 1)), dtype=np.int64),
+              values=np.array((1, 3, 3), dtype=np.int64),
+              dense_shape=np.array((2, 2), dtype=np.int64)),
+          id_weight_pair.id_tensor.eval(feed_dict={
+              input_indices: ((0, 0), (1, 0), (1, 1)),
+              input_values: (1, -1, 99),
+              input_shape: (2, 2),
+          }))
+
+  def test_linear_model(self):
+    column = fc_old.categorical_column_with_identity(key='aaa', num_buckets=3)
+    self.assertEqual(3, column.num_buckets)
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          column.name: sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=(0, 2, 1),
+              dense_shape=(2, 2))
+      }, (column,))
+      bias = get_linear_model_bias()
+      weight_var = get_linear_model_column_var(column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,)), weight_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        weight_var.assign(((1.,), (2.,), (3.,))).eval()
+        # weight_var[0] = 1
+        # weight_var[2] + weight_var[1] = 3+2 = 5
+        self.assertAllClose(((1.,), (5.,)), predictions.eval())
+
+  def test_keras_linear_model(self):
+    column = fc_old.categorical_column_with_identity(key='aaa', num_buckets=3)
+    self.assertEqual(3, column.num_buckets)
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          column.name:
+              sparse_tensor.SparseTensorValue(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(0, 2, 1),
+                  dense_shape=(2, 2))
+      }, (column,))
+      bias = get_linear_model_bias()
+      weight_var = get_linear_model_column_var(column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,)), weight_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        weight_var.assign(((1.,), (2.,), (3.,))).eval()
+        # weight_var[0] = 1
+        # weight_var[2] + weight_var[1] = 3+2 = 5
+        self.assertAllClose(((1.,), (5.,)), predictions.eval())
+
+
+class TransformFeaturesTest(test.TestCase):
+
+  # All transform tests are distributed in column test.
+  # Here we only test multi column case and naming
+  def transform_multi_column(self):
+    bucketized_price = fc.bucketized_column(
+        fc.numeric_column('price'), boundaries=[0, 2, 4, 6])
+    hashed_sparse = fc.categorical_column_with_hash_bucket('wire', 10)
+    with ops.Graph().as_default():
+      features = {
+          'price': [[-1.], [5.]],
+          'wire':
+              sparse_tensor.SparseTensor(
+                  values=['omar', 'stringer', 'marlo'],
+                  indices=[[0, 0], [1, 0], [1, 1]],
+                  dense_shape=[2, 2])
+      }
+      transformed = _transform_features(features,
+                                        [bucketized_price, hashed_sparse], None)
+      with _initialized_session():
+        self.assertIn(bucketized_price.name, transformed[bucketized_price].name)
+        self.assertAllEqual([[0], [3]], transformed[bucketized_price].eval())
+        self.assertIn(hashed_sparse.name, transformed[hashed_sparse].name)
+        self.assertAllEqual([6, 4, 1], transformed[hashed_sparse].values.eval())
+
+  def test_column_order(self):
+    """When the column is both dense and sparse, uses sparse tensors."""
+
+    class _LoggerColumn(FeatureColumn):
+
+      def __init__(self, name):
+        self._name = name
+
+      @property
+      def name(self):
+        return self._name
+
+      def transform_feature(self, transformation_cache, state_manager):
+        self.call_order = call_logger['count']
+        call_logger['count'] += 1
+        return 'Anything'
+
+      @property
+      def parse_example_spec(self):
+        pass
+
+    with ops.Graph().as_default():
+      column1 = _LoggerColumn('1')
+      column2 = _LoggerColumn('2')
+      call_logger = {'count': 0}
+      _transform_features({}, [column1, column2], None)
+      self.assertEqual(0, column1.call_order)
+      self.assertEqual(1, column2.call_order)
+
+      call_logger = {'count': 0}
+      _transform_features({}, [column2, column1], None)
+      self.assertEqual(0, column1.call_order)
+      self.assertEqual(1, column2.call_order)
+
+
+class IndicatorColumnTest(test.TestCase):
+
+  def test_indicator_column(self):
+    a = fc.categorical_column_with_hash_bucket('a', 4)
+    indicator_a = fc.indicator_column(a)
+    self.assertEqual(indicator_a.categorical_column.name, 'a')
+    self.assertEqual(indicator_a.name, 'a_indicator')
+    self.assertEqual(indicator_a.variable_shape, [1, 4])
+
+    b = fc.categorical_column_with_hash_bucket('b', hash_bucket_size=100)
+    indicator_b = fc.indicator_column(b)
+    self.assertEqual(indicator_b.categorical_column.name, 'b')
+    self.assertEqual(indicator_b.name, 'b_indicator')
+    self.assertEqual(indicator_b.variable_shape, [1, 100])
+
+  def test_1D_shape_succeeds(self):
+    animal = fc.indicator_column(
+        fc.categorical_column_with_hash_bucket('animal', 4))
+    transformation_cache = FeatureTransformationCache({
+        'animal': ['fox', 'fox']
+    })
+    output = transformation_cache.get(animal, None)
+    with self.test_session():
+      self.assertAllEqual([[0., 0., 1., 0.], [0., 0., 1., 0.]], output.eval())
+
+  def test_2D_shape_succeeds(self):
+    # TODO(ispir/cassandrax): Swith to categorical_column_with_keys when ready.
+    animal = fc.indicator_column(
+        fc.categorical_column_with_hash_bucket('animal', 4))
+    transformation_cache = FeatureTransformationCache({
+        'animal':
+            sparse_tensor.SparseTensor(
+                indices=[[0, 0], [1, 0]],
+                values=['fox', 'fox'],
+                dense_shape=[2, 1])
+    })
+    output = transformation_cache.get(animal, None)
+    with self.test_session():
+      self.assertAllEqual([[0., 0., 1., 0.], [0., 0., 1., 0.]], output.eval())
+
+  def test_multi_hot(self):
+    animal = fc.indicator_column(
+        fc.categorical_column_with_identity('animal', num_buckets=4))
+
+    transformation_cache = FeatureTransformationCache({
+        'animal':
+            sparse_tensor.SparseTensor(
+                indices=[[0, 0], [0, 1]], values=[1, 1], dense_shape=[1, 2])
+    })
+    output = transformation_cache.get(animal, None)
+    with self.test_session():
+      self.assertAllEqual([[0., 2., 0., 0.]], output.eval())
+
+  def test_multi_hot2(self):
+    animal = fc.indicator_column(
+        fc.categorical_column_with_identity('animal', num_buckets=4))
+    transformation_cache = FeatureTransformationCache({
+        'animal':
+            sparse_tensor.SparseTensor(
+                indices=[[0, 0], [0, 1]], values=[1, 2], dense_shape=[1, 2])
+    })
+    output = transformation_cache.get(animal, None)
+    with self.test_session():
+      self.assertAllEqual([[0., 1., 1., 0.]], output.eval())
+
+  def test_deep_copy(self):
+    a = fc.categorical_column_with_hash_bucket('a', 4)
+    column = fc.indicator_column(a)
+    column_copy = copy.deepcopy(column)
+    self.assertEqual(column_copy.categorical_column.name, 'a')
+    self.assertEqual(column.name, 'a_indicator')
+    self.assertEqual(column.variable_shape, [1, 4])
+
+  def test_parse_example(self):
+    a = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'))
+    a_indicator = fc.indicator_column(a)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer']))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a_indicator]))
+    self.assertIn('aaa', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'omar', b'stringer'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+
+  def test_transform(self):
+    a = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'))
+    a_indicator = fc.indicator_column(a)
+    features = {
+        'aaa': sparse_tensor.SparseTensorValue(
+            indices=((0, 0), (1, 0), (1, 1)),
+            values=('marlo', 'skywalker', 'omar'),
+            dense_shape=(2, 2))
+    }
+    indicator_tensor = _transform_features(features, [a_indicator],
+                                           None)[a_indicator]
+    with _initialized_session():
+      self.assertAllEqual([[0, 0, 1], [1, 0, 0]], indicator_tensor.eval())
+
+  def test_transform_with_weighted_column(self):
+    # Github issue 12557
+    ids = fc.categorical_column_with_vocabulary_list(
+        key='ids', vocabulary_list=('a', 'b', 'c'))
+    weights = fc.weighted_categorical_column(ids, 'weights')
+    indicator = fc.indicator_column(weights)
+    features = {
+        'ids': constant_op.constant([['c', 'b', 'a']]),
+        'weights': constant_op.constant([[2., 4., 6.]])
+    }
+    indicator_tensor = _transform_features(features, [indicator],
+                                           None)[indicator]
+    with _initialized_session():
+      self.assertAllEqual([[6., 4., 2.]], indicator_tensor.eval())
+
+  def test_transform_with_missing_value_in_weighted_column(self):
+    # Github issue 12583
+    ids = fc.categorical_column_with_vocabulary_list(
+        key='ids', vocabulary_list=('a', 'b', 'c'))
+    weights = fc.weighted_categorical_column(ids, 'weights')
+    indicator = fc.indicator_column(weights)
+    features = {
+        'ids': constant_op.constant([['c', 'b', 'unknown']]),
+        'weights': constant_op.constant([[2., 4., 6.]])
+    }
+    indicator_tensor = _transform_features(features, [indicator],
+                                           None)[indicator]
+    with _initialized_session():
+      self.assertAllEqual([[0., 4., 2.]], indicator_tensor.eval())
+
+  def test_transform_with_missing_value_in_categorical_column(self):
+    # Github issue 12583
+    ids = fc.categorical_column_with_vocabulary_list(
+        key='ids', vocabulary_list=('a', 'b', 'c'))
+    indicator = fc.indicator_column(ids)
+    features = {
+        'ids': constant_op.constant([['c', 'b', 'unknown']]),
+    }
+    indicator_tensor = _transform_features(features, [indicator],
+                                           None)[indicator]
+    with _initialized_session():
+      self.assertAllEqual([[0., 1., 1.]], indicator_tensor.eval())
+
+  def test_linear_model(self):
+    animal = fc_old.indicator_column(
+        fc_old.categorical_column_with_identity('animal', num_buckets=4))
+    with ops.Graph().as_default():
+      features = {
+          'animal':
+              sparse_tensor.SparseTensor(
+                  indices=[[0, 0], [0, 1]], values=[1, 2], dense_shape=[1, 2])
+      }
+
+      predictions = fc.linear_model(features, [animal])
+      weight_var = get_linear_model_column_var(animal)
+      with _initialized_session():
+        # All should be zero-initialized.
+        self.assertAllClose([[0.], [0.], [0.], [0.]], weight_var.eval())
+        self.assertAllClose([[0.]], predictions.eval())
+        weight_var.assign([[1.], [2.], [3.], [4.]]).eval()
+        self.assertAllClose([[2. + 3.]], predictions.eval())
+
+  def test_keras_linear_model(self):
+    animal = fc_old.indicator_column(
+        fc_old.categorical_column_with_identity('animal', num_buckets=4))
+    with ops.Graph().as_default():
+      features = {
+          'animal':
+              sparse_tensor.SparseTensor(
+                  indices=[[0, 0], [0, 1]], values=[1, 2], dense_shape=[1, 2])
+      }
+
+      predictions = get_keras_linear_model_predictions(features, [animal])
+      weight_var = get_linear_model_column_var(animal)
+      with _initialized_session():
+        # All should be zero-initialized.
+        self.assertAllClose([[0.], [0.], [0.], [0.]], weight_var.eval())
+        self.assertAllClose([[0.]], predictions.eval())
+        weight_var.assign([[1.], [2.], [3.], [4.]]).eval()
+        self.assertAllClose([[2. + 3.]], predictions.eval())
+
+  def test_input_layer(self):
+    animal = fc_old.indicator_column(
+        fc_old.categorical_column_with_identity('animal', num_buckets=4))
+    with ops.Graph().as_default():
+      features = {
+          'animal':
+              sparse_tensor.SparseTensor(
+                  indices=[[0, 0], [0, 1]], values=[1, 2], dense_shape=[1, 2])
+      }
+      net = fc.input_layer(features, [animal])
+      with _initialized_session():
+        self.assertAllClose([[0., 1., 1., 0.]], net.eval())
+
+
+class _TestStateManager(StateManager):
+
+  def __init__(self, trainable=True):
+    # Dict of feature_column to a dict of variables.
+    self._all_variables = {}
+    self._trainable = trainable
+
+  def get_variable(self,
+                   feature_column,
+                   name,
+                   shape,
+                   dtype=None,
+                   initializer=None):
+    if feature_column not in self._all_variables:
+      self._all_variables[feature_column] = {}
+    var_dict = self._all_variables[feature_column]
+    if name in var_dict:
+      return var_dict[name]
+    else:
+      var = variable_scope.get_variable(
+          name=name,
+          shape=shape,
+          initializer=initializer,
+          trainable=self._trainable)
+      var_dict[name] = var
+      return var
+
+
+class EmbeddingColumnTest(test.TestCase):
+
+  def test_defaults(self):
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    embedding_dimension = 2
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension)
+    self.assertIs(categorical_column, embedding_column.categorical_column)
+    self.assertEqual(embedding_dimension, embedding_column.dimension)
+    self.assertEqual('mean', embedding_column.combiner)
+    self.assertIsNone(embedding_column.ckpt_to_load_from)
+    self.assertIsNone(embedding_column.tensor_name_in_ckpt)
+    self.assertIsNone(embedding_column.max_norm)
+    self.assertTrue(embedding_column.trainable)
+    self.assertEqual('aaa_embedding', embedding_column.name)
+    self.assertEqual((embedding_dimension,), embedding_column.variable_shape)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+    }, embedding_column.parse_example_spec)
+
+  def test_all_constructor_args(self):
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    embedding_dimension = 2
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        combiner='my_combiner', initializer=lambda: 'my_initializer',
+        ckpt_to_load_from='my_ckpt', tensor_name_in_ckpt='my_ckpt_tensor',
+        max_norm=42., trainable=False)
+    self.assertIs(categorical_column, embedding_column.categorical_column)
+    self.assertEqual(embedding_dimension, embedding_column.dimension)
+    self.assertEqual('my_combiner', embedding_column.combiner)
+    self.assertEqual('my_ckpt', embedding_column.ckpt_to_load_from)
+    self.assertEqual('my_ckpt_tensor', embedding_column.tensor_name_in_ckpt)
+    self.assertEqual(42., embedding_column.max_norm)
+    self.assertFalse(embedding_column.trainable)
+    self.assertEqual('aaa_embedding', embedding_column.name)
+    self.assertEqual((embedding_dimension,), embedding_column.variable_shape)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+    }, embedding_column.parse_example_spec)
+
+  def test_deep_copy(self):
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    embedding_dimension = 2
+    original = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        combiner='my_combiner', initializer=lambda: 'my_initializer',
+        ckpt_to_load_from='my_ckpt', tensor_name_in_ckpt='my_ckpt_tensor',
+        max_norm=42., trainable=False)
+    for embedding_column in (original, copy.deepcopy(original)):
+      self.assertEqual('aaa', embedding_column.categorical_column.name)
+      self.assertEqual(3, embedding_column.categorical_column.num_buckets)
+      self.assertEqual({
+          'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+      }, embedding_column.categorical_column.parse_example_spec)
+
+      self.assertEqual(embedding_dimension, embedding_column.dimension)
+      self.assertEqual('my_combiner', embedding_column.combiner)
+      self.assertEqual('my_ckpt', embedding_column.ckpt_to_load_from)
+      self.assertEqual('my_ckpt_tensor', embedding_column.tensor_name_in_ckpt)
+      self.assertEqual(42., embedding_column.max_norm)
+      self.assertFalse(embedding_column.trainable)
+      self.assertEqual('aaa_embedding', embedding_column.name)
+      self.assertEqual((embedding_dimension,), embedding_column.variable_shape)
+      self.assertEqual({
+          'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+      }, embedding_column.parse_example_spec)
+
+  def test_invalid_initializer(self):
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    with self.assertRaisesRegexp(ValueError, 'initializer must be callable'):
+      fc.embedding_column(categorical_column, dimension=2, initializer='not_fn')
+
+  def test_parse_example(self):
+    a = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'))
+    a_embedded = fc.embedding_column(a, dimension=2)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer']))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a_embedded]))
+    self.assertIn('aaa', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'omar', b'stringer'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+
+  def test_transform_feature(self):
+    a = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    a_embedded = fc.embedding_column(a, dimension=2)
+    features = {
+        'aaa': sparse_tensor.SparseTensor(
+            indices=((0, 0), (1, 0), (1, 1)),
+            values=(0, 1, 0),
+            dense_shape=(2, 2))
+    }
+    outputs = _transform_features(features, [a, a_embedded], None)
+    output_a = outputs[a]
+    output_embedded = outputs[a_embedded]
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self, output_a.eval(), output_embedded.eval())
+
+  def test_get_dense_tensor(self):
+    # Inputs.
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 5))
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups = (
+        # example 0, ids [2], embedding = [7, 11]
+        (7., 11.),
+        # example 1, ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+        (2., 3.5),
+        # example 2, ids [], embedding = [0, 0]
+        (0., 0.),
+        # example 3, ids [1], embedding = [3, 5]
+        (3., 5.),
+    )
+
+    # Build columns.
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        initializer=_initializer)
+    state_manager = _TestStateManager()
+
+    # Provide sparse input and get dense result.
+    embedding_lookup = embedding_column.get_dense_tensor(
+        FeatureTransformationCache({
+            'aaa': sparse_input
+        }), state_manager)
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(('embedding_weights:0',),
+                          tuple([v.name for v in global_vars]))
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, global_vars[0].eval())
+      self.assertAllEqual(expected_lookups, embedding_lookup.eval())
+
+  def test_get_dense_tensor_3d(self):
+    # Inputs.
+    vocabulary_size = 4
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0, 0), (1, 1, 0), (1, 1, 4), (3, 0, 0), (3, 1, 2)),
+        values=(2, 0, 1, 1, 2),
+        dense_shape=(4, 2, 5))
+
+    # Embedding variable.
+    embedding_dimension = 3
+    embedding_values = (
+        (1., 2., 4.),   # id 0
+        (3., 5., 1.),   # id 1
+        (7., 11., 2.),  # id 2
+        (2., 7., 12.)   # id 3
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups = (
+        # example 0, ids [[2], []], embedding = [[7, 11, 2], [0, 0, 0]]
+        ((7., 11., 2.), (0., 0., 0.)),
+        # example 1, ids [[], [0, 1]], embedding
+        # = mean([[], [1, 2, 4] + [3, 5, 1]]) = [[0, 0, 0], [2, 3.5, 2.5]]
+        ((0., 0., 0.), (2., 3.5, 2.5)),
+        # example 2, ids [[], []], embedding = [[0, 0, 0], [0, 0, 0]]
+        ((0., 0., 0.), (0., 0., 0.)),
+        # example 3, ids [[1], [2]], embedding = [[3, 5, 1], [7, 11, 2]]
+        ((3., 5., 1.), (7., 11., 2.)),
+    )
+
+    # Build columns.
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        initializer=_initializer)
+    state_manager = _TestStateManager()
+
+    # Provide sparse input and get dense result.
+    embedding_lookup = embedding_column.get_dense_tensor(
+        FeatureTransformationCache({
+            'aaa': sparse_input
+        }), state_manager)
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(('embedding_weights:0',),
+                          tuple([v.name for v in global_vars]))
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, global_vars[0].eval())
+      self.assertAllEqual(expected_lookups, embedding_lookup.eval())
+
+  def DISABLED_test_get_dense_tensor_weight_collections(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 5))
+
+    # Build columns.
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    embedding_column = fc.embedding_column(categorical_column, dimension=2)
+
+    # Provide sparse input and get dense result.
+    embedding_column.get_dense_tensor(
+        FeatureTransformationCache({
+            'aaa': sparse_input
+        }),
+        weight_collections=('my_vars',))
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(('embedding_weights:0',),
+                          tuple([v.name for v in global_vars]))
+    my_vars = ops.get_collection('my_vars')
+    self.assertItemsEqual(
+        ('embedding_weights:0',), tuple([v.name for v in my_vars]))
+
+  def test_get_dense_tensor_placeholder_inputs(self):
+    # Inputs.
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 5))
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups = (
+        # example 0, ids [2], embedding = [7, 11]
+        (7., 11.),
+        # example 1, ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+        (2., 3.5),
+        # example 2, ids [], embedding = [0, 0]
+        (0., 0.),
+        # example 3, ids [1], embedding = [3, 5]
+        (3., 5.),
+    )
+
+    # Build columns.
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        initializer=_initializer)
+    state_manager = _TestStateManager()
+
+    # Provide sparse input and get dense result.
+    input_indices = array_ops.placeholder(dtype=dtypes.int64)
+    input_values = array_ops.placeholder(dtype=dtypes.int64)
+    input_shape = array_ops.placeholder(dtype=dtypes.int64)
+    embedding_lookup = embedding_column.get_dense_tensor(
+        FeatureTransformationCache({
+            'aaa':
+                sparse_tensor.SparseTensorValue(
+                    indices=input_indices,
+                    values=input_values,
+                    dense_shape=input_shape)
+        }), state_manager)
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('embedding_weights:0',), tuple([v.name for v in global_vars]))
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, global_vars[0].eval())
+      self.assertAllEqual(expected_lookups, embedding_lookup.eval(
+          feed_dict={
+              input_indices: sparse_input.indices,
+              input_values: sparse_input.values,
+              input_shape: sparse_input.dense_shape,
+          }))
+
+  def test_get_dense_tensor_restore_from_ckpt(self):
+    # Inputs.
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 5))
+
+    # Embedding variable. The checkpoint file contains _embedding_values.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    ckpt_path = test.test_src_dir_path(
+        'python/feature_column/testdata/embedding.ckpt')
+    ckpt_tensor = 'my_embedding'
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups = (
+        # example 0, ids [2], embedding = [7, 11]
+        (7., 11.),
+        # example 1, ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+        (2., 3.5),
+        # example 2, ids [], embedding = [0, 0]
+        (0., 0.),
+        # example 3, ids [1], embedding = [3, 5]
+        (3., 5.),
+    )
+
+    # Build columns.
+    categorical_column = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc.embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        ckpt_to_load_from=ckpt_path,
+        tensor_name_in_ckpt=ckpt_tensor)
+    state_manager = _TestStateManager()
+
+    # Provide sparse input and get dense result.
+    embedding_lookup = embedding_column.get_dense_tensor(
+        FeatureTransformationCache({
+            'aaa': sparse_input
+        }), state_manager)
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('embedding_weights:0',), tuple([v.name for v in global_vars]))
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, global_vars[0].eval())
+      self.assertAllEqual(expected_lookups, embedding_lookup.eval())
+
+  def test_linear_model(self):
+    # Inputs.
+    batch_size = 4
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(batch_size, 5))
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_shape = (vocabulary_size, embedding_dimension)
+    zeros_embedding_values = np.zeros(embedding_shape)
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual(embedding_shape, shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return zeros_embedding_values
+
+    # Build columns.
+    categorical_column = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc_old.embedding_column(
+        categorical_column,
+        dimension=embedding_dimension,
+        initializer=_initializer)
+
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          categorical_column.name: sparse_input
+      }, (embedding_column,))
+      expected_var_names = (
+          'linear_model/bias_weights:0',
+          'linear_model/aaa_embedding/weights:0',
+          'linear_model/aaa_embedding/embedding_weights:0',
+      )
+      self.assertItemsEqual(
+          expected_var_names,
+          [v.name for v in ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)])
+      trainable_vars = {
+          v.name: v for v in ops.get_collection(
+              ops.GraphKeys.TRAINABLE_VARIABLES)
+      }
+      self.assertItemsEqual(expected_var_names, trainable_vars.keys())
+      bias = trainable_vars['linear_model/bias_weights:0']
+      embedding_weights = trainable_vars[
+          'linear_model/aaa_embedding/embedding_weights:0']
+      linear_weights = trainable_vars[
+          'linear_model/aaa_embedding/weights:0']
+      with _initialized_session():
+        # Predictions with all zero weights.
+        self.assertAllClose(np.zeros((1,)), bias.eval())
+        self.assertAllClose(zeros_embedding_values, embedding_weights.eval())
+        self.assertAllClose(
+            np.zeros((embedding_dimension, 1)), linear_weights.eval())
+        self.assertAllClose(np.zeros((batch_size, 1)), predictions.eval())
+
+        # Predictions with all non-zero weights.
+        embedding_weights.assign((
+            (1., 2.),  # id 0
+            (3., 5.),  # id 1
+            (7., 11.)  # id 2
+        )).eval()
+        linear_weights.assign(((4.,), (6.,))).eval()
+        # example 0, ids [2], embedding[0] = [7, 11]
+        # example 1, ids [0, 1], embedding[1] = mean([1, 2] + [3, 5]) = [2, 3.5]
+        # example 2, ids [], embedding[2] = [0, 0]
+        # example 3, ids [1], embedding[3] = [3, 5]
+        # sum(embeddings * linear_weights)
+        # = [4*7 + 6*11, 4*2 + 6*3.5, 4*0 + 6*0, 4*3 + 6*5] = [94, 29, 0, 42]
+        self.assertAllClose(((94.,), (29.,), (0.,), (42.,)), predictions.eval())
+
+  def test_keras_linear_model(self):
+    # Inputs.
+    batch_size = 4
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(batch_size, 5))
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_shape = (vocabulary_size, embedding_dimension)
+    zeros_embedding_values = np.zeros(embedding_shape)
+
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual(embedding_shape, shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return zeros_embedding_values
+
+    # Build columns.
+    categorical_column = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc_old.embedding_column(
+        categorical_column,
+        dimension=embedding_dimension,
+        initializer=_initializer)
+
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          categorical_column.name: sparse_input
+      }, (embedding_column,))
+      expected_var_names = (
+          'linear_model/bias_weights:0',
+          'linear_model/aaa_embedding/weights:0',
+          'linear_model/aaa_embedding/embedding_weights:0',
+      )
+      self.assertItemsEqual(
+          expected_var_names,
+          [v.name for v in ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)])
+      trainable_vars = {
+          v.name: v
+          for v in ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      }
+      self.assertItemsEqual(expected_var_names, trainable_vars.keys())
+      bias = trainable_vars['linear_model/bias_weights:0']
+      embedding_weights = trainable_vars[
+          'linear_model/aaa_embedding/embedding_weights:0']
+      linear_weights = trainable_vars['linear_model/aaa_embedding/weights:0']
+      with _initialized_session():
+        # Predictions with all zero weights.
+        self.assertAllClose(np.zeros((1,)), bias.eval())
+        self.assertAllClose(zeros_embedding_values, embedding_weights.eval())
+        self.assertAllClose(
+            np.zeros((embedding_dimension, 1)), linear_weights.eval())
+        self.assertAllClose(np.zeros((batch_size, 1)), predictions.eval())
+
+        # Predictions with all non-zero weights.
+        embedding_weights.assign((
+            (1., 2.),  # id 0
+            (3., 5.),  # id 1
+            (7., 11.)  # id 2
+        )).eval()
+        linear_weights.assign(((4.,), (6.,))).eval()
+        # example 0, ids [2], embedding[0] = [7, 11]
+        # example 1, ids [0, 1], embedding[1] = mean([1, 2] + [3, 5]) = [2, 3.5]
+        # example 2, ids [], embedding[2] = [0, 0]
+        # example 3, ids [1], embedding[3] = [3, 5]
+        # sum(embeddings * linear_weights)
+        # = [4*7 + 6*11, 4*2 + 6*3.5, 4*0 + 6*0, 4*3 + 6*5] = [94, 29, 0, 42]
+        self.assertAllClose(((94.,), (29.,), (0.,), (42.,)), predictions.eval())
+
+  def test_input_layer(self):
+    # Inputs.
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 5))
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups = (
+        # example 0, ids [2], embedding = [7, 11]
+        (7., 11.),
+        # example 1, ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+        (2., 3.5),
+        # example 2, ids [], embedding = [0, 0]
+        (0., 0.),
+        # example 3, ids [1], embedding = [3, 5]
+        (3., 5.),
+    )
+
+    # Build columns.
+    categorical_column = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc_old.embedding_column(
+        categorical_column,
+        dimension=embedding_dimension,
+        initializer=_initializer)
+
+    # Provide sparse input and get dense result.
+    input_layer = fc.input_layer({'aaa': sparse_input}, (embedding_column,))
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('input_layer/aaa_embedding/embedding_weights:0',),
+        tuple([v.name for v in global_vars]))
+    trainable_vars = ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+    self.assertItemsEqual(
+        ('input_layer/aaa_embedding/embedding_weights:0',),
+        tuple([v.name for v in trainable_vars]))
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, trainable_vars[0].eval())
+      self.assertAllEqual(expected_lookups, input_layer.eval())
+
+  def test_input_layer_not_trainable(self):
+    # Inputs.
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 4), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 5))
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups = (
+        # example 0, ids [2], embedding = [7, 11]
+        (7., 11.),
+        # example 1, ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+        (2., 3.5),
+        # example 2, ids [], embedding = [0, 0]
+        (0., 0.),
+        # example 3, ids [1], embedding = [3, 5]
+        (3., 5.),
+    )
+
+    # Build columns.
+    categorical_column = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = fc_old.embedding_column(
+        categorical_column,
+        dimension=embedding_dimension,
+        initializer=_initializer,
+        trainable=False)
+
+    # Provide sparse input and get dense result.
+    input_layer = fc.input_layer({'aaa': sparse_input}, (embedding_column,))
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('input_layer/aaa_embedding/embedding_weights:0',),
+        tuple([v.name for v in global_vars]))
+    self.assertItemsEqual(
+        [], ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES))
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, global_vars[0].eval())
+      self.assertAllEqual(expected_lookups, input_layer.eval())
+
+
+class _TestSharedEmbeddingStateManager(StateManager):
+  """Manages the state for shared embedding columns.
+
+  This can handle multiple groups of shared embedding columns.
+  """
+
+  def __init__(self, trainable=True):
+    # Dict of shared_embedding_collection_name to a dict of variables.
+    self._all_variables = {}
+    self._trainable = trainable
+
+  def get_variable(self,
+                   feature_column,
+                   name,
+                   shape,
+                   dtype=None,
+                   initializer=None):
+    if not isinstance(feature_column, fc.SharedEmbeddingColumn):
+      raise ValueError(
+          'SharedEmbeddingStateManager can only handle SharedEmbeddingColumns. '
+          'Given type: {} '.format(type(feature_column)))
+
+    collection_name = feature_column.shared_collection_name
+    if collection_name not in self._all_variables:
+      self._all_variables[collection_name] = {}
+    var_dict = self._all_variables[collection_name]
+    if name in var_dict:
+      return var_dict[name]
+    else:
+      var = variable_scope.get_variable(
+          name=name,
+          shape=shape,
+          initializer=initializer,
+          trainable=self._trainable)
+      var_dict[name] = var
+      return var
+
+
+class SharedEmbeddingColumnTest(test.TestCase):
+
+  def test_defaults(self):
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    embedding_dimension = 2
+    embedding_column_b, embedding_column_a = fc.shared_embedding_columns(
+        [categorical_column_b, categorical_column_a],
+        dimension=embedding_dimension)
+    self.assertIs(categorical_column_a, embedding_column_a.categorical_column)
+    self.assertIs(categorical_column_b, embedding_column_b.categorical_column)
+    self.assertEqual(embedding_dimension, embedding_column_a.dimension)
+    self.assertEqual(embedding_dimension, embedding_column_b.dimension)
+    self.assertEqual('mean', embedding_column_a.combiner)
+    self.assertEqual('mean', embedding_column_b.combiner)
+    self.assertIsNone(embedding_column_a.ckpt_to_load_from)
+    self.assertIsNone(embedding_column_b.ckpt_to_load_from)
+    self.assertEqual('aaa_bbb_shared_embedding',
+                     embedding_column_a.shared_collection_name)
+    self.assertEqual('aaa_bbb_shared_embedding',
+                     embedding_column_b.shared_collection_name)
+    self.assertIsNone(embedding_column_a.tensor_name_in_ckpt)
+    self.assertIsNone(embedding_column_b.tensor_name_in_ckpt)
+    self.assertIsNone(embedding_column_a.max_norm)
+    self.assertIsNone(embedding_column_b.max_norm)
+    self.assertTrue(embedding_column_a.trainable)
+    self.assertTrue(embedding_column_b.trainable)
+    self.assertEqual('aaa_shared_embedding', embedding_column_a.name)
+    self.assertEqual('bbb_shared_embedding', embedding_column_b.name)
+    self.assertEqual((embedding_dimension,), embedding_column_a.variable_shape)
+    self.assertEqual((embedding_dimension,), embedding_column_b.variable_shape)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+    }, embedding_column_a.parse_example_spec)
+    self.assertEqual({
+        'bbb': parsing_ops.VarLenFeature(dtypes.int64)
+    }, embedding_column_b.parse_example_spec)
+
+  def test_all_constructor_args(self):
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    embedding_dimension = 2
+    embedding_column_a, embedding_column_b = fc.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension,
+        combiner='my_combiner',
+        initializer=lambda: 'my_initializer',
+        shared_embedding_collection_name='shared_embedding_collection_name',
+        ckpt_to_load_from='my_ckpt',
+        tensor_name_in_ckpt='my_ckpt_tensor',
+        max_norm=42.,
+        trainable=False)
+    self.assertIs(categorical_column_a, embedding_column_a.categorical_column)
+    self.assertIs(categorical_column_b, embedding_column_b.categorical_column)
+    self.assertEqual(embedding_dimension, embedding_column_a.dimension)
+    self.assertEqual(embedding_dimension, embedding_column_b.dimension)
+    self.assertEqual('my_combiner', embedding_column_a.combiner)
+    self.assertEqual('my_combiner', embedding_column_b.combiner)
+    self.assertEqual('shared_embedding_collection_name',
+                     embedding_column_a.shared_collection_name)
+    self.assertEqual('shared_embedding_collection_name',
+                     embedding_column_b.shared_collection_name)
+    self.assertEqual('my_ckpt', embedding_column_a.ckpt_to_load_from)
+    self.assertEqual('my_ckpt', embedding_column_b.ckpt_to_load_from)
+    self.assertEqual('my_ckpt_tensor', embedding_column_a.tensor_name_in_ckpt)
+    self.assertEqual('my_ckpt_tensor', embedding_column_b.tensor_name_in_ckpt)
+    self.assertEqual(42., embedding_column_a.max_norm)
+    self.assertEqual(42., embedding_column_b.max_norm)
+    self.assertFalse(embedding_column_a.trainable)
+    self.assertFalse(embedding_column_b.trainable)
+    self.assertEqual('aaa_shared_embedding', embedding_column_a.name)
+    self.assertEqual('bbb_shared_embedding', embedding_column_b.name)
+    self.assertEqual((embedding_dimension,), embedding_column_a.variable_shape)
+    self.assertEqual((embedding_dimension,), embedding_column_b.variable_shape)
+    self.assertEqual({
+        'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+    }, embedding_column_a.parse_example_spec)
+    self.assertEqual({
+        'bbb': parsing_ops.VarLenFeature(dtypes.int64)
+    }, embedding_column_b.parse_example_spec)
+
+  def test_deep_copy(self):
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    embedding_dimension = 2
+    original_a, _ = fc.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension,
+        combiner='my_combiner',
+        initializer=lambda: 'my_initializer',
+        shared_embedding_collection_name='shared_embedding_collection_name',
+        ckpt_to_load_from='my_ckpt',
+        tensor_name_in_ckpt='my_ckpt_tensor',
+        max_norm=42., trainable=False)
+    for embedding_column_a in (original_a, copy.deepcopy(original_a)):
+      self.assertEqual('aaa', embedding_column_a.categorical_column.name)
+      self.assertEqual(3, embedding_column_a.categorical_column.num_buckets)
+      self.assertEqual({
+          'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+      }, embedding_column_a.categorical_column.parse_example_spec)
+
+      self.assertEqual(embedding_dimension, embedding_column_a.dimension)
+      self.assertEqual('my_combiner', embedding_column_a.combiner)
+      self.assertEqual('shared_embedding_collection_name',
+                       embedding_column_a.shared_collection_name)
+      self.assertEqual('my_ckpt', embedding_column_a.ckpt_to_load_from)
+      self.assertEqual('my_ckpt_tensor', embedding_column_a.tensor_name_in_ckpt)
+      self.assertEqual(42., embedding_column_a.max_norm)
+      self.assertFalse(embedding_column_a.trainable)
+      self.assertEqual('aaa_shared_embedding', embedding_column_a.name)
+      self.assertEqual((embedding_dimension,),
+                       embedding_column_a.variable_shape)
+      self.assertEqual({
+          'aaa': parsing_ops.VarLenFeature(dtypes.int64)
+      }, embedding_column_a.parse_example_spec)
+
+  def test_invalid_initializer(self):
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    with self.assertRaisesRegexp(ValueError, 'initializer must be callable'):
+      fc.shared_embedding_columns(
+          [categorical_column_a, categorical_column_b], dimension=2,
+          initializer='not_fn')
+
+  def test_incompatible_column_type(self):
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    categorical_column_c = fc.categorical_column_with_hash_bucket(
+        key='ccc', hash_bucket_size=3)
+    with self.assertRaisesRegexp(
+        ValueError, 'all categorical_columns must have the same type.*'
+        'IdentityCategoricalColumn.*HashedCategoricalColumn'):
+      fc.shared_embedding_columns(
+          [categorical_column_a, categorical_column_b, categorical_column_c],
+          dimension=2)
+
+  def test_weighted_categorical_column_ok(self):
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=3)
+    weighted_categorical_column_a = fc.weighted_categorical_column(
+        categorical_column_a, weight_feature_key='aaa_weights')
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=3)
+    weighted_categorical_column_b = fc.weighted_categorical_column(
+        categorical_column_b, weight_feature_key='bbb_weights')
+    fc.shared_embedding_columns(
+        [weighted_categorical_column_a, categorical_column_b], dimension=2)
+    fc.shared_embedding_columns(
+        [categorical_column_a, weighted_categorical_column_b], dimension=2)
+    fc.shared_embedding_columns(
+        [weighted_categorical_column_a, weighted_categorical_column_b],
+        dimension=2)
+
+  def test_parse_example(self):
+    a = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'))
+    b = fc.categorical_column_with_vocabulary_list(
+        key='bbb', vocabulary_list=('omar', 'stringer', 'marlo'))
+    a_embedded, b_embedded = fc.shared_embedding_columns(
+        [a, b], dimension=2)
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer'])),
+            'bbb':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'stringer', b'marlo'])),
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a_embedded, b_embedded]))
+    self.assertIn('aaa', features)
+    self.assertIn('bbb', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'omar', b'stringer'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'stringer', b'marlo'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['bbb'].eval())
+
+  def test_transform_feature(self):
+    a = fc.categorical_column_with_identity(key='aaa', num_buckets=3)
+    b = fc.categorical_column_with_identity(key='bbb', num_buckets=3)
+    a_embedded, b_embedded = fc.shared_embedding_columns(
+        [a, b], dimension=2)
+    features = {
+        'aaa': sparse_tensor.SparseTensor(
+            indices=((0, 0), (1, 0), (1, 1)),
+            values=(0, 1, 0),
+            dense_shape=(2, 2)),
+        'bbb': sparse_tensor.SparseTensor(
+            indices=((0, 0), (1, 0), (1, 1)),
+            values=(1, 2, 1),
+            dense_shape=(2, 2)),
+    }
+    outputs = _transform_features(features, [a, a_embedded, b, b_embedded],
+                                  None)
+    output_a = outputs[a]
+    output_a_embedded = outputs[a_embedded]
+    output_b = outputs[b]
+    output_b_embedded = outputs[b_embedded]
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self, output_a.eval(), output_a_embedded.eval())
+      _assert_sparse_tensor_value(
+          self, output_b.eval(), output_b_embedded.eval())
+
+  def test_get_dense_tensor(self):
+    # Inputs.
+    vocabulary_size = 3
+    # -1 values are ignored.
+    input_a = np.array(
+        [[2, -1, -1],  # example 0, ids [2]
+         [0, 1, -1]])  # example 1, ids [0, 1]
+    input_b = np.array(
+        [[0, -1, -1],  # example 0, ids [0]
+         [-1, -1, -1]])  # example 1, ids []
+    input_features = {
+        'aaa': input_a,
+        'bbb': input_b
+    }
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups_a = (
+        # example 0:
+        (7., 11.),  # ids [2], embedding = [7, 11]
+        # example 1:
+        (2., 3.5),  # ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+    )
+    expected_lookups_b = (
+        # example 0:
+        (1., 2.),  # ids [0], embedding = [1, 2]
+        # example 1:
+        (0., 0.),  # ids [], embedding = [0, 0]
+    )
+
+    # Build columns.
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_a, embedding_column_b = fc.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension, initializer=_initializer)
+    state_manager = _TestSharedEmbeddingStateManager()
+
+    # Provide sparse input and get dense result.
+    embedding_lookup_a = embedding_column_a.get_dense_tensor(
+        FeatureTransformationCache(input_features), state_manager)
+    embedding_lookup_b = embedding_column_b.get_dense_tensor(
+        FeatureTransformationCache(input_features), state_manager)
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(('embedding_weights:0',),
+                          tuple([v.name for v in global_vars]))
+    embedding_var = global_vars[0]
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, embedding_var.eval())
+      self.assertAllEqual(expected_lookups_a, embedding_lookup_a.eval())
+      self.assertAllEqual(expected_lookups_b, embedding_lookup_b.eval())
+
+  def DISABLED_test_get_dense_tensor_weight_collections(self):
+    # Inputs.
+    vocabulary_size = 3
+    # -1 values are ignored.
+    input_a = np.array([
+        [2, -1, -1],  # example 0, ids [2]
+        [0, 1, -1]
+    ])  # example 1, ids [0, 1]
+    input_b = np.array([
+        [0, -1, -1],  # example 0, ids [0]
+        [-1, -1, -1]
+    ])  # example 1, ids []
+    input_features = {'aaa': input_a, 'bbb': input_b}
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Build columns.
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_a, embedding_column_b = fc.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension,
+        initializer=_initializer)
+
+    fc.input_layer(
+        input_features, [embedding_column_a, embedding_column_b],
+        weight_collections=('my_vars',))
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('input_layer/aaa_bbb_shared_embedding/embedding_weights:0',),
+        tuple(v.name for v in global_vars))
+    my_vars = ops.get_collection('my_vars')
+    self.assertItemsEqual(
+        ('input_layer/aaa_bbb_shared_embedding/embedding_weights:0',),
+        tuple(v.name for v in my_vars))
+
+  def test_get_dense_tensor_placeholder_inputs(self):
+    # Inputs.
+    vocabulary_size = 3
+    # -1 values are ignored.
+    input_a = np.array(
+        [[2, -1, -1],  # example 0, ids [2]
+         [0, 1, -1]])  # example 1, ids [0, 1]
+    input_b = np.array(
+        [[0, -1, -1],  # example 0, ids [0]
+         [-1, -1, -1]])  # example 1, ids []
+    # Specify shape, because dense input must have rank specified.
+    input_a_placeholder = array_ops.placeholder(
+        dtype=dtypes.int64, shape=[None, 3])
+    input_b_placeholder = array_ops.placeholder(
+        dtype=dtypes.int64, shape=[None, 3])
+    input_features = {
+        'aaa': input_a_placeholder,
+        'bbb': input_b_placeholder,
+    }
+    feed_dict = {
+        input_a_placeholder: input_a,
+        input_b_placeholder: input_b,
+    }
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Build columns.
+    categorical_column_a = fc.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    categorical_column_b = fc.categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_a, embedding_column_b = fc.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension, initializer=_initializer)
+    state_manager = _TestSharedEmbeddingStateManager()
+
+    # Provide sparse input and get dense result.
+    embedding_lookup_a = embedding_column_a.get_dense_tensor(
+        FeatureTransformationCache(input_features), state_manager)
+    embedding_lookup_b = embedding_column_b.get_dense_tensor(
+        FeatureTransformationCache(input_features), state_manager)
+
+    with _initialized_session() as sess:
+      sess.run([embedding_lookup_a, embedding_lookup_b], feed_dict=feed_dict)
+
+  def test_linear_model(self):
+    # Inputs.
+    batch_size = 2
+    vocabulary_size = 3
+    # -1 values are ignored.
+    input_a = np.array(
+        [[2, -1, -1],  # example 0, ids [2]
+         [0, 1, -1]])  # example 1, ids [0, 1]
+    input_b = np.array(
+        [[0, -1, -1],  # example 0, ids [0]
+         [-1, -1, -1]])  # example 1, ids []
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_shape = (vocabulary_size, embedding_dimension)
+    zeros_embedding_values = np.zeros(embedding_shape)
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual(embedding_shape, shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return zeros_embedding_values
+
+    # Build columns.
+    categorical_column_a = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    categorical_column_b = fc_old.categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_a, embedding_column_b = fc_old.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension,
+        initializer=_initializer)
+
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          categorical_column_a.name: input_a,
+          categorical_column_b.name: input_b,
+      }, (embedding_column_a, embedding_column_b))
+      # Linear weights do not follow the column name. But this is a rare use
+      # case, and fixing it would add too much complexity to the code.
+      expected_var_names = (
+          'linear_model/bias_weights:0',
+          'linear_model/aaa_bbb_shared_embedding/weights:0',
+          'linear_model/aaa_bbb_shared_embedding/embedding_weights:0',
+          'linear_model/aaa_bbb_shared_embedding_1/weights:0',
+      )
+      self.assertItemsEqual(
+          expected_var_names,
+          [v.name for v in ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)])
+      trainable_vars = {
+          v.name: v for v in ops.get_collection(
+              ops.GraphKeys.TRAINABLE_VARIABLES)
+      }
+      self.assertItemsEqual(expected_var_names, trainable_vars.keys())
+      bias = trainable_vars['linear_model/bias_weights:0']
+      embedding_weights = trainable_vars[
+          'linear_model/aaa_bbb_shared_embedding/embedding_weights:0']
+      linear_weights_a = trainable_vars[
+          'linear_model/aaa_bbb_shared_embedding/weights:0']
+      linear_weights_b = trainable_vars[
+          'linear_model/aaa_bbb_shared_embedding_1/weights:0']
+      with _initialized_session():
+        # Predictions with all zero weights.
+        self.assertAllClose(np.zeros((1,)), bias.eval())
+        self.assertAllClose(zeros_embedding_values, embedding_weights.eval())
+        self.assertAllClose(
+            np.zeros((embedding_dimension, 1)), linear_weights_a.eval())
+        self.assertAllClose(
+            np.zeros((embedding_dimension, 1)), linear_weights_b.eval())
+        self.assertAllClose(np.zeros((batch_size, 1)), predictions.eval())
+
+        # Predictions with all non-zero weights.
+        embedding_weights.assign((
+            (1., 2.),  # id 0
+            (3., 5.),  # id 1
+            (7., 11.)  # id 2
+        )).eval()
+        linear_weights_a.assign(((4.,), (6.,))).eval()
+        # example 0, ids [2], embedding[0] = [7, 11]
+        # example 1, ids [0, 1], embedding[1] = mean([1, 2] + [3, 5]) = [2, 3.5]
+        # sum(embeddings * linear_weights)
+        # = [4*7 + 6*11, 4*2 + 6*3.5] = [94, 29]
+        linear_weights_b.assign(((3.,), (5.,))).eval()
+        # example 0, ids [0], embedding[0] = [1, 2]
+        # example 1, ids [], embedding[1] = 0, 0]
+        # sum(embeddings * linear_weights)
+        # = [3*1 + 5*2, 3*0 +5*0] = [13, 0]
+        self.assertAllClose([[94. + 13.], [29.]], predictions.eval())
+
+  def test_keras_linear_model(self):
+    # Inputs.
+    batch_size = 2
+    vocabulary_size = 3
+    # -1 values are ignored.
+    input_a = np.array([
+        [2, -1, -1],  # example 0, ids [2]
+        [0, 1, -1]
+    ])  # example 1, ids [0, 1]
+    input_b = np.array([
+        [0, -1, -1],  # example 0, ids [0]
+        [-1, -1, -1]
+    ])  # example 1, ids []
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_shape = (vocabulary_size, embedding_dimension)
+    zeros_embedding_values = np.zeros(embedding_shape)
+
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual(embedding_shape, shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return zeros_embedding_values
+
+    # Build columns.
+    categorical_column_a = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    categorical_column_b = fc_old.categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_a, embedding_column_b = fc_old.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension,
+        initializer=_initializer)
+
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          categorical_column_a.name: input_a,
+          categorical_column_b.name: input_b,
+      }, (embedding_column_a, embedding_column_b))
+      # Linear weights do not follow the column name. But this is a rare use
+      # case, and fixing it would add too much complexity to the code.
+      expected_var_names = (
+          'linear_model/bias_weights:0',
+          'linear_model/aaa_bbb_shared_embedding/weights:0',
+          'linear_model/aaa_bbb_shared_embedding/embedding_weights:0',
+          'linear_model/aaa_bbb_shared_embedding_1/weights:0',
+      )
+      self.assertItemsEqual(
+          expected_var_names,
+          [v.name for v in ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)])
+      trainable_vars = {
+          v.name: v
+          for v in ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      }
+      self.assertItemsEqual(expected_var_names, trainable_vars.keys())
+      bias = trainable_vars['linear_model/bias_weights:0']
+      embedding_weights = trainable_vars[
+          'linear_model/aaa_bbb_shared_embedding/embedding_weights:0']
+      linear_weights_a = trainable_vars[
+          'linear_model/aaa_bbb_shared_embedding/weights:0']
+      linear_weights_b = trainable_vars[
+          'linear_model/aaa_bbb_shared_embedding_1/weights:0']
+      with _initialized_session():
+        # Predictions with all zero weights.
+        self.assertAllClose(np.zeros((1,)), bias.eval())
+        self.assertAllClose(zeros_embedding_values, embedding_weights.eval())
+        self.assertAllClose(
+            np.zeros((embedding_dimension, 1)), linear_weights_a.eval())
+        self.assertAllClose(
+            np.zeros((embedding_dimension, 1)), linear_weights_b.eval())
+        self.assertAllClose(np.zeros((batch_size, 1)), predictions.eval())
+
+        # Predictions with all non-zero weights.
+        embedding_weights.assign((
+            (1., 2.),  # id 0
+            (3., 5.),  # id 1
+            (7., 11.)  # id 2
+        )).eval()
+        linear_weights_a.assign(((4.,), (6.,))).eval()
+        # example 0, ids [2], embedding[0] = [7, 11]
+        # example 1, ids [0, 1], embedding[1] = mean([1, 2] + [3, 5]) = [2, 3.5]
+        # sum(embeddings * linear_weights)
+        # = [4*7 + 6*11, 4*2 + 6*3.5] = [94, 29]
+        linear_weights_b.assign(((3.,), (5.,))).eval()
+        # example 0, ids [0], embedding[0] = [1, 2]
+        # example 1, ids [], embedding[1] = 0, 0]
+        # sum(embeddings * linear_weights)
+        # = [3*1 + 5*2, 3*0 +5*0] = [13, 0]
+        self.assertAllClose([[94. + 13.], [29.]], predictions.eval())
+
+  def _test_input_layer(self, trainable=True):
+    # Inputs.
+    vocabulary_size = 3
+    sparse_input_a = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 4)),
+        values=(2, 0, 1),
+        dense_shape=(2, 5))
+    sparse_input_b = sparse_tensor.SparseTensorValue(
+        # example 0, ids [0]
+        # example 1, ids []
+        indices=((0, 0),),
+        values=(0,),
+        dense_shape=(2, 5))
+
+    # Embedding variable.
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    # Expected lookup result, using combiner='mean'.
+    expected_lookups = (
+        # example 0:
+        # A ids [2], embedding = [7, 11]
+        # B ids [0], embedding = [1, 2]
+        (7., 11., 1., 2.),
+        # example 1:
+        # A ids [0, 1], embedding = mean([1, 2] + [3, 5]) = [2, 3.5]
+        # B ids [], embedding = [0, 0]
+        (2., 3.5, 0., 0.),
+    )
+
+    # Build columns.
+    categorical_column_a = fc_old.categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    categorical_column_b = fc_old.categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_a, embedding_column_b = fc_old.shared_embedding_columns(
+        [categorical_column_a, categorical_column_b],
+        dimension=embedding_dimension,
+        initializer=_initializer,
+        trainable=trainable)
+
+    # Provide sparse input and get dense result.
+    input_layer = fc.input_layer(
+        features={'aaa': sparse_input_a, 'bbb': sparse_input_b},
+        feature_columns=(embedding_column_b, embedding_column_a))
+
+    # Assert expected embedding variable and lookups.
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ['input_layer/aaa_bbb_shared_embedding/embedding_weights:0'],
+        tuple([v.name for v in global_vars]))
+    trainable_vars = ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+    if trainable:
+      self.assertItemsEqual(
+          ['input_layer/aaa_bbb_shared_embedding/embedding_weights:0'],
+          tuple([v.name for v in trainable_vars]))
+    else:
+      self.assertItemsEqual([], tuple([v.name for v in trainable_vars]))
+    shared_embedding_vars = global_vars
+    with _initialized_session():
+      self.assertAllEqual(embedding_values, shared_embedding_vars[0].eval())
+      self.assertAllEqual(expected_lookups, input_layer.eval())
+
+  def test_input_layer(self):
+    self._test_input_layer()
+
+  def test_input_layer_no_trainable(self):
+    self._test_input_layer(trainable=False)
+
+
+class WeightedCategoricalColumnTest(test.TestCase):
+
+  def test_defaults(self):
+    column = fc.weighted_categorical_column(
+        categorical_column=fc.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    self.assertEqual('ids_weighted_by_values', column.name)
+    self.assertEqual(3, column.num_buckets)
+    self.assertEqual({
+        'ids': parsing_ops.VarLenFeature(dtypes.int64),
+        'values': parsing_ops.VarLenFeature(dtypes.float32)
+    }, column.parse_example_spec)
+
+  def test_deep_copy(self):
+    """Tests deepcopy of categorical_column_with_hash_bucket."""
+    original = fc.weighted_categorical_column(
+        categorical_column=fc.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    for column in (original, copy.deepcopy(original)):
+      self.assertEqual('ids_weighted_by_values', column.name)
+      self.assertEqual(3, column.num_buckets)
+      self.assertEqual({
+          'ids': parsing_ops.VarLenFeature(dtypes.int64),
+          'values': parsing_ops.VarLenFeature(dtypes.float32)
+      }, column.parse_example_spec)
+
+  def test_invalid_dtype_none(self):
+    with self.assertRaisesRegexp(ValueError, 'is not convertible to float'):
+      fc.weighted_categorical_column(
+          categorical_column=fc.categorical_column_with_identity(
+              key='ids', num_buckets=3),
+          weight_feature_key='values',
+          dtype=None)
+
+  def test_invalid_dtype_string(self):
+    with self.assertRaisesRegexp(ValueError, 'is not convertible to float'):
+      fc.weighted_categorical_column(
+          categorical_column=fc.categorical_column_with_identity(
+              key='ids', num_buckets=3),
+          weight_feature_key='values',
+          dtype=dtypes.string)
+
+  def test_invalid_input_dtype(self):
+    column = fc.weighted_categorical_column(
+        categorical_column=fc.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    strings = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('omar', 'stringer', 'marlo'),
+        dense_shape=(2, 2))
+    with self.assertRaisesRegexp(ValueError, 'Bad dtype'):
+      _transform_features({'ids': strings, 'values': strings}, (column,), None)
+
+  def test_column_name_collision(self):
+    with self.assertRaisesRegexp(ValueError, r'Parse config.*already exists'):
+      fc.weighted_categorical_column(
+          categorical_column=fc.categorical_column_with_identity(
+              key='aaa', num_buckets=3),
+          weight_feature_key='aaa').parse_example_spec()
+
+  def test_missing_weights(self):
+    column = fc.weighted_categorical_column(
+        categorical_column=fc.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=('omar', 'stringer', 'marlo'),
+        dense_shape=(2, 2))
+    with self.assertRaisesRegexp(
+        ValueError, 'values is not in features dictionary'):
+      _transform_features({'ids': inputs}, (column,), None)
+
+  def test_parse_example(self):
+    a = fc.categorical_column_with_vocabulary_list(
+        key='aaa', vocabulary_list=('omar', 'stringer', 'marlo'))
+    a_weighted = fc.weighted_categorical_column(a, weight_feature_key='weights')
+    data = example_pb2.Example(features=feature_pb2.Features(
+        feature={
+            'aaa':
+                feature_pb2.Feature(bytes_list=feature_pb2.BytesList(
+                    value=[b'omar', b'stringer'])),
+            'weights':
+                feature_pb2.Feature(float_list=feature_pb2.FloatList(
+                    value=[1., 10.]))
+        }))
+    features = parsing_ops.parse_example(
+        serialized=[data.SerializeToString()],
+        features=fc.make_parse_example_spec([a_weighted]))
+    self.assertIn('aaa', features)
+    self.assertIn('weights', features)
+    with self.test_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([b'omar', b'stringer'], dtype=np.object_),
+              dense_shape=[1, 2]),
+          features['aaa'].eval())
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=[[0, 0], [0, 1]],
+              values=np.array([1., 10.], dtype=np.float32),
+              dense_shape=[1, 2]),
+          features['weights'].eval())
+
+  def test_transform_features(self):
+    column = fc.weighted_categorical_column(
+        categorical_column=fc.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(0, 1, 0),
+        dense_shape=(2, 2))
+    weights = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(0.5, 1.0, 0.1),
+        dense_shape=(2, 2))
+    id_tensor, weight_tensor = _transform_features({
+        'ids': inputs,
+        'values': weights,
+    }, (column,), None)[column]
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array(inputs.values, dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_tensor.eval())
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=weights.indices,
+              values=np.array(weights.values, dtype=np.float32),
+              dense_shape=weights.dense_shape),
+          weight_tensor.eval())
+
+  def test_transform_features_dense_input(self):
+    column = fc.weighted_categorical_column(
+        categorical_column=fc.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    weights = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(0.5, 1.0, 0.1),
+        dense_shape=(2, 2))
+    id_tensor, weight_tensor = _transform_features({
+        'ids': ((0, -1), (1, 0)),
+        'values': weights,
+    }, (column,), None)[column]
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=np.array((0, 1, 0), dtype=np.int64),
+              dense_shape=(2, 2)),
+          id_tensor.eval())
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=weights.indices,
+              values=np.array(weights.values, dtype=np.float32),
+              dense_shape=weights.dense_shape),
+          weight_tensor.eval())
+
+  def test_transform_features_dense_weights(self):
+    column = fc.weighted_categorical_column(
+        categorical_column=fc.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 1, 0),
+        dense_shape=(2, 2))
+    id_tensor, weight_tensor = _transform_features({
+        'ids': inputs,
+        'values': ((.5, 0.), (1., .1)),
+    }, (column,), None)[column]
+    with _initialized_session():
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=inputs.indices,
+              values=np.array(inputs.values, dtype=np.int64),
+              dense_shape=inputs.dense_shape),
+          id_tensor.eval())
+      _assert_sparse_tensor_value(
+          self,
+          sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=np.array((.5, 1., .1), dtype=np.float32),
+              dense_shape=(2, 2)),
+          weight_tensor.eval())
+
+  def test_keras_linear_model(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          'ids':
+              sparse_tensor.SparseTensorValue(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(0, 2, 1),
+                  dense_shape=(2, 2)),
+          'values':
+              sparse_tensor.SparseTensorValue(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(.5, 1., .1),
+                  dense_shape=(2, 2))
+      }, (column,))
+      bias = get_linear_model_bias()
+      weight_var = get_linear_model_column_var(column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,)), weight_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        weight_var.assign(((1.,), (2.,), (3.,))).eval()
+        # weight_var[0] * weights[0, 0] = 1 * .5 = .5
+        # weight_var[2] * weights[1, 0] + weight_var[1] * weights[1, 1]
+        # = 3*1 + 2*.1 = 3+.2 = 3.2
+        self.assertAllClose(((.5,), (3.2,)), predictions.eval())
+
+  def test_keras_linear_model_mismatched_shape(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      with self.assertRaisesRegexp(ValueError,
+                                   r'Dimensions.*are not compatible'):
+        get_keras_linear_model_predictions({
+            'ids':
+                sparse_tensor.SparseTensorValue(
+                    indices=((0, 0), (1, 0), (1, 1)),
+                    values=(0, 2, 1),
+                    dense_shape=(2, 2)),
+            'values':
+                sparse_tensor.SparseTensorValue(
+                    indices=((0, 0), (0, 1), (1, 0), (1, 1)),
+                    values=(.5, 11., 1., .1),
+                    dense_shape=(2, 2))
+        }, (column,))
+
+  def test_keras_linear_model_mismatched_dense_values(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions(
+          {
+              'ids':
+                  sparse_tensor.SparseTensorValue(
+                      indices=((0, 0), (1, 0), (1, 1)),
+                      values=(0, 2, 1),
+                      dense_shape=(2, 2)),
+              'values': ((.5,), (1.,))
+          }, (column,),
+          sparse_combiner='mean')
+      # Disabling the constant folding optimizer here since it changes the
+      # error message differently on CPU and GPU.
+      config = config_pb2.ConfigProto()
+      config.graph_options.rewrite_options.constant_folding = (
+          rewriter_config_pb2.RewriterConfig.OFF)
+      with _initialized_session(config):
+        with self.assertRaisesRegexp(errors.OpError, 'Incompatible shapes'):
+          predictions.eval()
+
+  def test_keras_linear_model_mismatched_dense_shape(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      predictions = get_keras_linear_model_predictions({
+          'ids':
+              sparse_tensor.SparseTensorValue(
+                  indices=((0, 0), (1, 0), (1, 1)),
+                  values=(0, 2, 1),
+                  dense_shape=(2, 2)),
+          'values': ((.5,), (1.,), (.1,))
+      }, (column,))
+      bias = get_linear_model_bias()
+      weight_var = get_linear_model_column_var(column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,)), weight_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        weight_var.assign(((1.,), (2.,), (3.,))).eval()
+        # weight_var[0] * weights[0, 0] = 1 * .5 = .5
+        # weight_var[2] * weights[1, 0] + weight_var[1] * weights[1, 1]
+        # = 3*1 + 2*.1 = 3+.2 = 3.2
+        self.assertAllClose(((.5,), (3.2,)), predictions.eval())
+
+  def test_linear_model(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          'ids': sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=(0, 2, 1),
+              dense_shape=(2, 2)),
+          'values': sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=(.5, 1., .1),
+              dense_shape=(2, 2))
+      }, (column,))
+      bias = get_linear_model_bias()
+      weight_var = get_linear_model_column_var(column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,)), weight_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        weight_var.assign(((1.,), (2.,), (3.,))).eval()
+        # weight_var[0] * weights[0, 0] = 1 * .5 = .5
+        # weight_var[2] * weights[1, 0] + weight_var[1] * weights[1, 1]
+        # = 3*1 + 2*.1 = 3+.2 = 3.2
+        self.assertAllClose(((.5,), (3.2,)), predictions.eval())
+
+  def test_linear_model_mismatched_shape(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      with self.assertRaisesRegexp(
+          ValueError, r'Dimensions.*are not compatible'):
+        fc.linear_model({
+            'ids': sparse_tensor.SparseTensorValue(
+                indices=((0, 0), (1, 0), (1, 1)),
+                values=(0, 2, 1),
+                dense_shape=(2, 2)),
+            'values': sparse_tensor.SparseTensorValue(
+                indices=((0, 0), (0, 1), (1, 0), (1, 1)),
+                values=(.5, 11., 1., .1),
+                dense_shape=(2, 2))
+        }, (column,))
+
+  def test_linear_model_mismatched_dense_values(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      predictions = fc.linear_model(
+          {
+              'ids':
+                  sparse_tensor.SparseTensorValue(
+                      indices=((0, 0), (1, 0), (1, 1)),
+                      values=(0, 2, 1),
+                      dense_shape=(2, 2)),
+              'values': ((.5,), (1.,))
+          }, (column,),
+          sparse_combiner='mean')
+      # Disabling the constant folding optimizer here since it changes the
+      # error message differently on CPU and GPU.
+      config = config_pb2.ConfigProto()
+      config.graph_options.rewrite_options.constant_folding = (
+          rewriter_config_pb2.RewriterConfig.OFF)
+      with _initialized_session(config):
+        with self.assertRaisesRegexp(errors.OpError, 'Incompatible shapes'):
+          predictions.eval()
+
+  def test_linear_model_mismatched_dense_shape(self):
+    column = fc_old.weighted_categorical_column(
+        categorical_column=fc_old.categorical_column_with_identity(
+            key='ids', num_buckets=3),
+        weight_feature_key='values')
+    with ops.Graph().as_default():
+      predictions = fc.linear_model({
+          'ids': sparse_tensor.SparseTensorValue(
+              indices=((0, 0), (1, 0), (1, 1)),
+              values=(0, 2, 1),
+              dense_shape=(2, 2)),
+          'values': ((.5,), (1.,), (.1,))
+      }, (column,))
+      bias = get_linear_model_bias()
+      weight_var = get_linear_model_column_var(column)
+      with _initialized_session():
+        self.assertAllClose((0.,), bias.eval())
+        self.assertAllClose(((0.,), (0.,), (0.,)), weight_var.eval())
+        self.assertAllClose(((0.,), (0.,)), predictions.eval())
+        weight_var.assign(((1.,), (2.,), (3.,))).eval()
+        # weight_var[0] * weights[0, 0] = 1 * .5 = .5
+        # weight_var[2] * weights[1, 0] + weight_var[1] * weights[1, 1]
+        # = 3*1 + 2*.1 = 3+.2 = 3.2
+        self.assertAllClose(((.5,), (3.2,)), predictions.eval())
+
+  # TODO(ptucker): Add test with embedding of weighted categorical.
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/framework/common_shapes.py b/tensorflow/python/framework/common_shapes.py
index 3c5aebbce8..40788e24c4 100644
--- a/tensorflow/python/framework/common_shapes.py
+++ b/tensorflow/python/framework/common_shapes.py
@@ -28,6 +28,18 @@ from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
 
 
+def has_fully_defined_shape(tensor):
+  """Returns true if tensor has a fully defined shape."""
+  return isinstance(tensor, ops.EagerTensor) or tensor.shape.is_fully_defined()
+
+
+def rank(tensor):
+  """Return a rank if it is a tensor, else return None."""
+  if isinstance(tensor, ops.Tensor):
+    return tensor._rank()  # pylint: disable=protected-access
+  return None
+
+
 def scalar_shape(unused_op):
   """Shape function for ops that output a scalar value."""
   return [tensor_shape.scalar()]
diff --git a/tensorflow/python/framework/error_interpolation.py b/tensorflow/python/framework/error_interpolation.py
index 9ccae76147..6e844e14b9 100644
--- a/tensorflow/python/framework/error_interpolation.py
+++ b/tensorflow/python/framework/error_interpolation.py
@@ -24,11 +24,15 @@ from __future__ import print_function
 
 import collections
 import itertools
+import os
 import re
 import string
 
 import six
 
+from tensorflow.python.util import tf_stack
+
+
 _NAME_REGEX = r"[A-Za-z0-9.][A-Za-z0-9_.\-/]*?"
 _FORMAT_REGEX = r"[A-Za-z0-9_.\-/${}:]+"
 _TAG_REGEX = r"\^\^({name}):({name}):({fmt})\^\^".format(
@@ -38,6 +42,11 @@ _INTERPOLATION_PATTERN = re.compile(_INTERPOLATION_REGEX)
 
 _ParseTag = collections.namedtuple("_ParseTag", ["type", "name", "format"])
 
+_BAD_FILE_SUBSTRINGS = [
+    os.path.join("tensorflow", "python"),
+    "<embedded",
+]
+
 
 def _parse_message(message):
   """Parses the message.
@@ -48,6 +57,12 @@ def _parse_message(message):
   "123^^node:Foo:${file}^^456^^node:Bar:${line}^^789", there are two tags and
   three separators. The separators are the numeric characters.
 
+  Supported tags after node:<node_name>
+    file: Replaced with the filename in which the node was defined.
+    line: Replaced by the line number at which the node was defined.
+    colocations: Replaced by a multi-line message describing the file and
+        line numbers at which this node was colocated with other nodes.
+
   Args:
     message: String to parse
 
@@ -72,9 +87,193 @@ def _parse_message(message):
   return seps, tags
 
 
-# TODO(jtkeeling): Modify to actually interpolate format strings rather than
-# echoing them.
-def interpolate(error_message):
+def _compute_device_summary_from_list(name, device_assignment_list, prefix=""):
+  """Return a summary of an op's device function stack.
+
+  Args:
+    name: The name of the op.
+    device_assignment_list: The op._device_assignments list.
+    prefix:  An optional string prefix used before each line of the multi-
+        line string returned by this function.
+
+  Returns:
+    A multi-line string similar to:
+        Device assignments active during op 'foo' creation:
+          with tf.device(/cpu:0): <test_1.py:27>
+          with tf.device(some_func<foo.py, 123>): <test_2.py:38>
+    The first line will have no padding to its left by default.  Subsequent
+    lines will have two spaces of left-padding.  Use the prefix argument
+    to increase indentation.
+  """
+  if not device_assignment_list:
+    message = "No device assignments were active during op '%s' creation."
+    message %= name
+    return prefix + message
+
+  str_list = []
+  str_list.append("%sDevice assignments active during op '%s' creation:"
+                  % (prefix, name))
+
+  for traceable_obj in device_assignment_list:
+    location_summary = "<{file}:{line}>".format(file=traceable_obj.filename,
+                                                line=traceable_obj.lineno)
+    subs = {
+        "prefix": prefix,
+        "indent": "  ",
+        "dev_name": traceable_obj.obj,
+        "loc": location_summary,
+    }
+    str_list.append(
+        "{prefix}{indent}with tf.device({dev_name}): {loc}".format(**subs))
+
+  return "\n".join(str_list)
+
+
+def _compute_device_assignment_summary_from_op(op, prefix=""):
+  # pylint: disable=protected-access
+  return _compute_device_summary_from_list(op.name, op._device_assignments,
+                                           prefix)
+  # pylint: enable=protected-access
+
+
+def _compute_colocation_summary_from_dict(name, colocation_dict, prefix=""):
+  """Return a summary of an op's colocation stack.
+
+  Args:
+    name: The op name.
+    colocation_dict: The op._colocation_dict.
+    prefix:  An optional string prefix used before each line of the multi-
+        line string returned by this function.
+
+  Returns:
+    A multi-line string similar to:
+        Node-device colocations active during op creation:
+          with tf.colocate_with(test_node_1): <test_1.py:27>
+          with tf.colocate_with(test_node_2): <test_2.py:38>
+    The first line will have no padding to its left by default.  Subsequent
+    lines will have two spaces of left-padding.  Use the prefix argument
+    to increase indentation.
+  """
+  if not colocation_dict:
+    message = "No node-device colocations were active during op '%s' creation."
+    message %= name
+    return prefix + message
+
+  str_list = []
+  str_list.append("%sNode-device colocations active during op '%s' creation:"
+                  % (prefix, name))
+
+  for coloc_name, location in colocation_dict.items():
+    location_summary = "<{file}:{line}>".format(file=location.filename,
+                                                line=location.lineno)
+    subs = {
+        "prefix": prefix,
+        "indent": "  ",
+        "name": coloc_name,
+        "loc": location_summary,
+    }
+    str_list.append(
+        "{prefix}{indent}with tf.colocate_with({name}): {loc}".format(**subs))
+
+  return "\n".join(str_list)
+
+
+def _compute_colocation_summary_from_op(op, prefix=""):
+  """Fetch colocation file, line, and nesting and return a summary string."""
+  return _compute_colocation_summary_from_dict(
+      op.name, op._colocation_dict, prefix)  # pylint: disable=protected-access
+
+
+def _find_index_of_defining_frame_for_op(op):
+  """Return index in op._traceback with first 'useful' frame.
+
+  This method reads through the stack stored in op._traceback looking for the
+  innermost frame which (hopefully) belongs to the caller.  It accomplishes this
+  by rejecting frames whose filename appears to come from TensorFlow (see
+  error_interpolation._BAD_FILE_SUBSTRINGS for the list of rejected substrings).
+
+  Args:
+    op: the Operation object for which we would like to find the defining
+        location.
+
+  Returns:
+    Integer index into op._traceback where the first non-TF file was found
+    (innermost to outermost), or 0 (for the outermost stack frame) if all files
+    came from TensorFlow.
+  """
+  # pylint: disable=protected-access
+  # Index 0 of tf_traceback is the outermost frame.
+  tf_traceback = tf_stack.convert_stack(op._traceback)
+  size = len(tf_traceback)
+  # pylint: enable=protected-access
+  filenames = [frame[tf_stack.TB_FILENAME] for frame in tf_traceback]
+  # We process the filenames from the innermost frame to outermost.
+  for idx, filename in enumerate(reversed(filenames)):
+    contains_bad_substrings = [ss in filename for ss in _BAD_FILE_SUBSTRINGS]
+    if not any(contains_bad_substrings):
+      return size - idx - 1
+  return 0
+
+
+def _get_defining_frame_from_op(op):
+  """Find and return stack frame where op was defined."""
+  frame_index = _find_index_of_defining_frame_for_op(op)
+  # pylint: disable=protected-access
+  frame = op._traceback[frame_index]
+  # pylint: enable=protected-access
+  return frame
+
+
+def compute_field_dict(op):
+  """Return a dictionary mapping interpolation tokens to values.
+
+  Args:
+    op: op.Operation object having a _traceback member.
+
+  Returns:
+    A dictionary mapping string tokens to string values.  The keys are shown
+    below along with example values.
+    {
+      "file": "tool_utils.py",
+      "line": "124",
+      "defined_at": " (defined at tool_utils.py:124)",
+      "colocations":
+          '''Node-device colocations active during op creation:
+               with tf.colocate_with(test_node_1): <test_1.py:27>
+               with tf.colocate_with(test_node_2): <test_2.py:38>'''
+      "devices":
+          '''Device assignments active during op 'foo' creation:
+               with tf.device(/cpu:0): <test_1.py:27>
+               with tf.device(some_func<foo.py, 123>): <test_2.py:38>'''
+      "devs_and_colocs": A concatenation of colocations and devices, e.g.
+          '''Node-device colocations active during op creation:
+               with tf.colocate_with(test_node_1): <test_1.py:27>
+               with tf.colocate_with(test_node_2): <test_2.py:38>'''
+             Device assignments active during op 'foo' creation:
+               with tf.device(/cpu:0): <test_1.py:27>
+               with tf.device(some_func<foo.py, 123>): <test_2.py:38>'''
+    }
+  """
+  frame = _get_defining_frame_from_op(op)
+  filename = frame[tf_stack.TB_FILENAME]
+  lineno = frame[tf_stack.TB_LINENO]
+  defined_at = " (defined at %s:%d)" % (filename, lineno)
+  colocation_summary = _compute_colocation_summary_from_op(op)
+  device_summary = _compute_device_assignment_summary_from_op(op)
+  combined_summary = "\n".join([colocation_summary, device_summary])
+
+  field_dict = {
+      "file": filename,
+      "line": lineno,
+      "defined_at": defined_at,
+      "colocations": colocation_summary,
+      "devices": device_summary,
+      "devs_and_colocs": combined_summary,
+  }
+  return field_dict
+
+
+def interpolate(error_message, graph):
   """Interpolates an error message.
 
   The error message can contain tags of the form ^^type:name:format^^ which will
@@ -82,11 +281,33 @@ def interpolate(error_message):
 
   Args:
     error_message: A string to interpolate.
+    graph: ops.Graph object containing all nodes referenced in the error
+        message.
 
   Returns:
     The string with tags of the form ^^type:name:format^^ interpolated.
   """
   seps, tags = _parse_message(error_message)
-  subs = [string.Template(tag.format).safe_substitute({}) for tag in tags]
+
+  node_name_to_substitution_dict = {}
+  for name in [t.name for t in tags]:
+    if name in node_name_to_substitution_dict:
+      continue
+    try:
+      op = graph.get_operation_by_name(name)
+    except KeyError:
+      op = None
+
+    if op is not None:
+      field_dict = compute_field_dict(op)
+    else:
+      msg = "<NA>"
+      field_dict = collections.defaultdict(lambda s=msg: s)
+    node_name_to_substitution_dict[name] = field_dict
+
+  subs = [
+      string.Template(tag.format).safe_substitute(
+          node_name_to_substitution_dict[tag.name]) for tag in tags
+  ]
   return "".join(
       itertools.chain(*six.moves.zip_longest(seps, subs, fillvalue="")))
diff --git a/tensorflow/python/framework/error_interpolation_test.py b/tensorflow/python/framework/error_interpolation_test.py
index ad448deb62..0427156b2b 100644
--- a/tensorflow/python/framework/error_interpolation_test.py
+++ b/tensorflow/python/framework/error_interpolation_test.py
@@ -18,31 +18,280 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
+
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import error_interpolation
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import traceable_stack
 from tensorflow.python.platform import test
+from tensorflow.python.util import tf_stack
+
+
+def _make_frame_with_filename(op, idx, filename):
+  """Return a copy of an existing stack frame with a new filename."""
+  stack_frame = list(op._traceback[idx])
+  stack_frame[tf_stack.TB_FILENAME] = filename
+  return tuple(stack_frame)
+
+
+def _modify_op_stack_with_filenames(op, num_user_frames, user_filename,
+                                    num_inner_tf_frames):
+  """Replace op._traceback with a new traceback using special filenames."""
+  tf_filename = "%d" + error_interpolation._BAD_FILE_SUBSTRINGS[0]
+  user_filename = os.path.join("%d", "my_favorite_file.py")
+
+  num_requested_frames = num_user_frames + num_inner_tf_frames
+  num_actual_frames = len(op._traceback)
+  num_outer_frames = num_actual_frames - num_requested_frames
+  assert num_requested_frames <= num_actual_frames, "Too few real frames."
+
+  # The op's traceback has outermost frame at index 0.
+  stack = []
+  for idx in range(0, num_outer_frames):
+    stack.append(op._traceback[idx])
+  for idx in range(len(stack), len(stack)+num_user_frames):
+    stack.append(_make_frame_with_filename(op, idx, user_filename % idx))
+  for idx in range(len(stack), len(stack)+num_inner_tf_frames):
+    stack.append(_make_frame_with_filename(op, idx, tf_filename % idx))
+  op._traceback = stack
+
+
+class ComputeDeviceSummaryFromOpTest(test.TestCase):
+
+  def testCorrectFormatWithActiveDeviceAssignments(self):
+    assignments = []
+    assignments.append(
+        traceable_stack.TraceableObject("/cpu:0",
+                                        filename="hope.py",
+                                        lineno=24))
+    assignments.append(
+        traceable_stack.TraceableObject("/gpu:2",
+                                        filename="please.py",
+                                        lineno=42))
+
+    summary = error_interpolation._compute_device_summary_from_list(
+        "nodename", assignments, prefix="  ")
+
+    self.assertIn("nodename", summary)
+    self.assertIn("tf.device(/cpu:0)", summary)
+    self.assertIn("<hope.py:24>", summary)
+    self.assertIn("tf.device(/gpu:2)", summary)
+    self.assertIn("<please.py:42>", summary)
+
+  def testCorrectFormatWhenNoColocationsWereActive(self):
+    device_assignment_list = []
+    summary = error_interpolation._compute_device_summary_from_list(
+        "nodename", device_assignment_list, prefix="  ")
+    self.assertIn("nodename", summary)
+    self.assertIn("No device assignments", summary)
+
+
+class ComputeColocationSummaryFromOpTest(test.TestCase):
+
+  def testCorrectFormatWithActiveColocations(self):
+    t_obj_1 = traceable_stack.TraceableObject(None,
+                                              filename="test_1.py",
+                                              lineno=27)
+    t_obj_2 = traceable_stack.TraceableObject(None,
+                                              filename="test_2.py",
+                                              lineno=38)
+    colocation_dict = {
+        "test_node_1": t_obj_1,
+        "test_node_2": t_obj_2,
+    }
+    summary = error_interpolation._compute_colocation_summary_from_dict(
+        "node_name", colocation_dict, prefix="  ")
+    self.assertIn("node_name", summary)
+    self.assertIn("colocate_with(test_node_1)", summary)
+    self.assertIn("<test_1.py:27>", summary)
+    self.assertIn("colocate_with(test_node_2)", summary)
+    self.assertIn("<test_2.py:38>", summary)
+
+  def testCorrectFormatWhenNoColocationsWereActive(self):
+    colocation_dict = {}
+    summary = error_interpolation._compute_colocation_summary_from_dict(
+        "node_name", colocation_dict, prefix="  ")
+    self.assertIn("node_name", summary)
+    self.assertIn("No node-device colocations", summary)
+
+
+class InterpolateFilenamesAndLineNumbersTest(test.TestCase):
+
+  def setUp(self):
+    ops.reset_default_graph()
+    # Add nodes to the graph for retrieval by name later.
+    constant_op.constant(1, name="One")
+    constant_op.constant(2, name="Two")
+    three = constant_op.constant(3, name="Three")
+    self.graph = three.graph
+
+    # Change the list of bad file substrings so that constant_op.py is chosen
+    # as the defining stack frame for constant_op.constant ops.
+    self.old_bad_strings = error_interpolation._BAD_FILE_SUBSTRINGS
+    error_interpolation._BAD_FILE_SUBSTRINGS = [
+        "%sops.py" % os.sep,
+        "%sutil" % os.sep,
+    ]
+
+  def tearDown(self):
+    error_interpolation._BAD_FILE_SUBSTRINGS = self.old_bad_strings
 
+  def testFindIndexOfDefiningFrameForOp(self):
+    local_op = constant_op.constant(42).op
+    user_filename = "hope.py"
+    _modify_op_stack_with_filenames(local_op,
+                                    num_user_frames=3,
+                                    user_filename=user_filename,
+                                    num_inner_tf_frames=5)
+    idx = error_interpolation._find_index_of_defining_frame_for_op(local_op)
+    # Expected frame is 6th from the end because there are 5 inner frames witih
+    # TF filenames.
+    expected_frame = len(local_op._traceback) - 6
+    self.assertEqual(expected_frame, idx)
 
-class InterpolateTest(test.TestCase):
+  def testFindIndexOfDefiningFrameForOpReturnsZeroOnError(self):
+    local_op = constant_op.constant(43).op
+    # Truncate stack to known length.
+    local_op._traceback = local_op._traceback[:7]
+    # Ensure all frames look like TF frames.
+    _modify_op_stack_with_filenames(local_op,
+                                    num_user_frames=0,
+                                    user_filename="user_file.py",
+                                    num_inner_tf_frames=7)
+    idx = error_interpolation._find_index_of_defining_frame_for_op(local_op)
+    self.assertEqual(0, idx)
 
   def testNothingToDo(self):
     normal_string = "This is just a normal string"
-    interpolated_string = error_interpolation.interpolate(normal_string)
+    interpolated_string = error_interpolation.interpolate(normal_string,
+                                                          self.graph)
     self.assertEqual(interpolated_string, normal_string)
 
   def testOneTag(self):
-    one_tag_string = "^^node:Foo:${file}^^"
-    interpolated_string = error_interpolation.interpolate(one_tag_string)
-    self.assertEqual(interpolated_string, "${file}")
+    one_tag_string = "^^node:Two:${file}^^"
+    interpolated_string = error_interpolation.interpolate(one_tag_string,
+                                                          self.graph)
+    self.assertTrue(interpolated_string.endswith("constant_op.py"),
+                    "interpolated_string '%s' did not end with constant_op.py"
+                    % interpolated_string)
+
+  def testOneTagWithAFakeNameResultsInPlaceholders(self):
+    one_tag_string = "^^node:MinusOne:${file}^^"
+    interpolated_string = error_interpolation.interpolate(one_tag_string,
+                                                          self.graph)
+    self.assertEqual("<NA>", interpolated_string)
 
   def testTwoTagsNoSeps(self):
-    two_tags_no_seps = "^^node:Foo:${file}^^^^node:Bar:${line}^^"
-    interpolated_string = error_interpolation.interpolate(two_tags_no_seps)
-    self.assertEqual(interpolated_string, "${file}${line}")
+    two_tags_no_seps = "^^node:One:${file}^^^^node:Three:${line}^^"
+    interpolated_string = error_interpolation.interpolate(two_tags_no_seps,
+                                                          self.graph)
+    self.assertRegexpMatches(interpolated_string, "constant_op.py[0-9]+")
 
   def testTwoTagsWithSeps(self):
-    two_tags_with_seps = "123^^node:Foo:${file}^^456^^node:Bar:${line}^^789"
-    interpolated_string = error_interpolation.interpolate(two_tags_with_seps)
-    self.assertEqual(interpolated_string, "123${file}456${line}789")
+    two_tags_with_seps = ";;;^^node:Two:${file}^^,,,^^node:Three:${line}^^;;;"
+    interpolated_string = error_interpolation.interpolate(two_tags_with_seps,
+                                                          self.graph)
+    expected_regex = "^;;;.*constant_op.py,,,[0-9]*;;;$"
+    self.assertRegexpMatches(interpolated_string, expected_regex)
+
+
+class InterpolateDeviceSummaryTest(test.TestCase):
+
+  def _fancy_device_function(self, unused_op):
+    return "/cpu:*"
+
+  def setUp(self):
+    ops.reset_default_graph()
+    self.zero = constant_op.constant([0.0], name="zero")
+    with ops.device("/cpu"):
+      self.one = constant_op.constant([1.0], name="one")
+      with ops.device("/cpu:0"):
+        self.two = constant_op.constant([2.0], name="two")
+    with ops.device(self._fancy_device_function):
+      self.three = constant_op.constant(3.0, name="three")
+
+    self.graph = self.three.graph
+
+  def testNodeZeroHasNoDeviceSummaryInfo(self):
+    message = "^^node:zero:${devices}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    self.assertIn("No device assignments were active", result)
+
+  def testNodeOneHasExactlyOneInterpolatedDevice(self):
+    message = "^^node:one:${devices}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    num_devices = result.count("tf.device")
+    self.assertEqual(1, num_devices)
+    self.assertIn("tf.device(/cpu)", result)
+
+  def testNodeTwoHasTwoInterpolatedDevice(self):
+    message = "^^node:two:${devices}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    num_devices = result.count("tf.device")
+    self.assertEqual(2, num_devices)
+    self.assertIn("tf.device(/cpu)", result)
+    self.assertIn("tf.device(/cpu:0)", result)
+
+  def testNodeThreeHasFancyFunctionDisplayNameForInterpolatedDevice(self):
+    message = "^^node:three:${devices}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    num_devices = result.count("tf.device")
+    self.assertEqual(1, num_devices)
+    name_re = r"_fancy_device_function<.*error_interpolation_test.py, [0-9]+>"
+    expected_re = r"with tf.device\(.*%s\)" % name_re
+    self.assertRegexpMatches(result, expected_re)
+
+
+class InterpolateColocationSummaryTest(test.TestCase):
+
+  def setUp(self):
+    ops.reset_default_graph()
+    # Add nodes to the graph for retrieval by name later.
+    node_one = constant_op.constant(1, name="One")
+    node_two = constant_op.constant(2, name="Two")
+
+    # node_three has one colocation group, obviously.
+    with ops.colocate_with(node_one):
+      node_three = constant_op.constant(3, name="Three_with_one")
+
+    # node_four has one colocation group even though three is (transitively)
+    # colocated with one.
+    with ops.colocate_with(node_three):
+      constant_op.constant(4, name="Four_with_three")
+
+    # node_five has two colocation groups because one and two are not colocated.
+    with ops.colocate_with(node_two):
+      with ops.colocate_with(node_one):
+        constant_op.constant(5, name="Five_with_one_with_two")
+
+    self.graph = node_three.graph
+
+  def testNodeThreeHasColocationInterpolation(self):
+    message = "^^node:Three_with_one:${colocations}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    self.assertIn("colocate_with(One)", result)
+
+  def testNodeFourHasColocationInterpolationForNodeThreeOnly(self):
+    message = "^^node:Four_with_three:${colocations}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    self.assertIn("colocate_with(Three_with_one)", result)
+    self.assertNotIn(
+        "One", result,
+        "Node One should not appear in Four_with_three's summary:\n%s"
+        % result)
+
+  def testNodeFiveHasColocationInterpolationForNodeOneAndTwo(self):
+    message = "^^node:Five_with_one_with_two:${colocations}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    self.assertIn("colocate_with(One)", result)
+    self.assertIn("colocate_with(Two)", result)
+
+  def testColocationInterpolationForNodeLackingColocation(self):
+    message = "^^node:One:${colocations}^^"
+    result = error_interpolation.interpolate(message, self.graph)
+    self.assertIn("No node-device colocations", result)
+    self.assertNotIn("Two", result)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/framework/fast_tensor_util.pyx b/tensorflow/python/framework/fast_tensor_util.pyx
index 17d112a1ec..2e3e15f53a 100644
--- a/tensorflow/python/framework/fast_tensor_util.pyx
+++ b/tensorflow/python/framework/fast_tensor_util.pyx
@@ -6,6 +6,13 @@ cimport numpy as np
 
 from tensorflow.python.util import compat
 
+def AppendBFloat16ArrayToTensorProto(
+    tensor_proto, np.ndarray[np.uint16_t, ndim=1] nparray):
+  cdef long i, n
+  n = nparray.size
+  for i in range(n):
+    tensor_proto.half_val.append(nparray[i])
+
 
 def AppendFloat16ArrayToTensorProto(
     # For numpy, npy_half is a typedef for npy_uint16,
diff --git a/tensorflow/python/framework/function.py b/tensorflow/python/framework/function.py
index 6525607fae..12bf03c5fa 100644
--- a/tensorflow/python/framework/function.py
+++ b/tensorflow/python/framework/function.py
@@ -38,8 +38,8 @@ from tensorflow.python.ops import cond_v2_impl
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.util import compat
+from tensorflow.python.util import function_utils
 from tensorflow.python.util import tf_contextlib
-from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_inspect
 
 # This is to avoid a circular dependency with cond_v2_impl.
@@ -255,9 +255,12 @@ class _DefinedFunction(object):
     # Constructed only when C API is enabled, lazily
     self._c_func = None
     self._sub_functions = dict()  # Constructed with _definition or _c_func
-    device_stack = ops.get_default_graph()._device_function_stack  # pylint: disable=protected-access
+    # pylint: disable=protected-access
+    device_funcs = ops.get_default_graph()._device_functions_outer_to_inner
+    # pylint: enable=protected-access
+
     # Get the innermost device if possbile.
-    self._caller_device = device_stack[-1] if device_stack else None
+    self._caller_device = device_funcs[-1] if device_funcs else None
 
     # Cached OpDef for this function. When C API is enabled, this is
     # the only part of FunctionDef that we cache in Python. When C API
@@ -354,7 +357,7 @@ class _DefinedFunction(object):
     if self._func_name:
       base_func_name = self._func_name
     else:
-      base_func_name = _get_func_name(self._func)
+      base_func_name = function_utils.get_func_name(self._func)
       if self._grad_func:
         base_func_name += ("_%s" % self._grad_func.name)
     kwargs_attr = _parse_kwargs_as_attrs(base_func_name, **self._extra_kwargs)
@@ -816,7 +819,7 @@ class _FuncGraph(ops.Graph):
 def func_graph_from_py_func(func, arg_names, arg_types, name=None,
                             capture_by_value=False, device=None,
                             colocation_stack=None, container=None,
-                            collections_ref=None):
+                            collections_ref=None, arg_shapes=None):
   """Returns a _FuncGraph generated from `func`.
 
   Args:
@@ -833,6 +836,7 @@ def func_graph_from_py_func(func, arg_names, arg_types, name=None,
     container: A container name the _FuncGraph should start with.
     collections_ref: A reference to a collections dict the _FuncGraph should
       use internally.
+    arg_shapes: A sequence of the function's argument shapes.
 
   Returns:
     A _FuncGraph.
@@ -841,7 +845,7 @@ def func_graph_from_py_func(func, arg_names, arg_types, name=None,
     ValueError: if func returns None.
   """
   if not name:
-    name = _get_func_name(func)
+    name = function_utils.get_func_name(func)
   func_graph = _FuncGraph(name, capture_by_value)
 
   with func_graph.as_default(), ops.device(device):
@@ -854,9 +858,12 @@ def func_graph_from_py_func(func, arg_names, arg_types, name=None,
       func_graph._colocation_stack = colocation_stack
     # pylint: enable=protected-access
 
+    if arg_shapes is None:
+      arg_shapes = [None] * len(arg_types)
+
     # Create placeholders for the function arguments.
-    for (argname, argtype) in zip(arg_names, arg_types):
-      argholder = array_ops.placeholder(argtype, name=argname)
+    for (argname, argtype, argshape) in zip(arg_names, arg_types, arg_shapes):
+      argholder = array_ops.placeholder(argtype, shape=argshape, name=argname)
       func_graph.inputs.append(argholder)
     # Call func and gather the output tensors.
     with vs.variable_scope("", custom_getter=func_graph.getvar):
@@ -1139,19 +1146,6 @@ def _parse_kwargs_as_attrs(func_name, **kwargs):
   return attrs
 
 
-def _get_func_name(func):
-  _, func = tf_decorator.unwrap(func)
-  if callable(func):
-    if tf_inspect.isfunction(func):
-      return func.__name__
-    elif tf_inspect.ismethod(func):
-      return "%s.%s" % (func.__self__.__name__, func.__name__)
-    else:  # Probably a class instance with __call__
-      return type(func)
-  else:
-    raise ValueError("Argument must be callable")
-
-
 def get_extra_vars():
   """Returns the captured variables by the function.
 
diff --git a/tensorflow/python/framework/function_def_to_graph.py b/tensorflow/python/framework/function_def_to_graph.py
index 46c9c4c14a..1b09506662 100644
--- a/tensorflow/python/framework/function_def_to_graph.py
+++ b/tensorflow/python/framework/function_def_to_graph.py
@@ -25,7 +25,7 @@ from tensorflow.core.framework import types_pb2
 from tensorflow.core.framework import versions_pb2
 from tensorflow.python.framework import function
 from tensorflow.python.framework import importer
-from tensorflow.python.framework import op_def_registry
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import versions
 from tensorflow.python.ops import cond_v2_impl
 
@@ -114,6 +114,10 @@ def function_def_to_graph_def(fdef, input_shapes=None):
           producer=versions.GRAPH_DEF_VERSION,
           min_consumer=versions.GRAPH_DEF_VERSION_MIN_CONSUMER))
 
+  # Copy *all* functions from outer graph to `graph_def` so that both direct
+  # and indirect references are safely handled.
+  ops.get_default_graph()._copy_functions_to_graph_def(graph_def, 0)  # pylint: disable=protected-access
+
   if input_shapes and len(input_shapes) != len(fdef.signature.input_arg):
     raise ValueError("Length of input_shapes must match the number of " +
                      "input_args. len(input_shapes): {} len(input_arg): {}".
@@ -142,24 +146,18 @@ def function_def_to_graph_def(fdef, input_shapes=None):
     nested_to_flat_tensor_name[arg_def.name] = "{}:0".format(arg_def.name)
 
   for node_def in fdef.node_def:
-    op_def = op_def_registry.get_registered_ops().get(node_def.op)
-    if not op_def:
-      # TODO(b/80470245): Support functions which refer other functions.
-      raise NotImplementedError(
-          "No op registered for {},".format(node_def.op) +
-          " it may be a function. function_def_to_graph_def " +
-          "currently does not support converting functions with " +
-          "references to other graph functions.")
+    op_def = ops.get_default_graph()._get_op_def(node_def.op)  # pylint: disable=protected-access
 
     for attr in op_def.attr:
-      if attr.type in ("func", "list(func)"):
-        # TODO(b/80470245): Support functions which refer other functions.
-        raise NotImplementedError("Unsupported attr {} ".format(attr.name) +
-                                  " with type {}".format(attr.type) +
-                                  " in op {}. ".format(op_def.name) +
-                                  "function_def_to_graph_def currently does " +
-                                  "not support converting functions with " +
-                                  "references to other graph functions.")
+      if attr.type == "func":
+        fname = node_def.attr[attr.name].func.name
+        if not ops.get_default_graph()._is_function(fname):  # pylint: disable=protected-access
+          raise ValueError("%s function not found." % fname)
+      elif attr.type == "list(func)":
+        for fn in node_def.attr[attr.name].list.func:
+          fname = fn.name
+          if not ops.get_default_graph()._is_function(fname):  # pylint: disable=protected-access
+            raise ValueError("%s function not found." % fname)
 
     # Iterate over output_args in op_def to build the map.
     # Index of the output tensor in the flattened list of *all* output
diff --git a/tensorflow/python/framework/function_def_to_graph_test.py b/tensorflow/python/framework/function_def_to_graph_test.py
index 0f4e6ef54f..cd2a16ed5a 100644
--- a/tensorflow/python/framework/function_def_to_graph_test.py
+++ b/tensorflow/python/framework/function_def_to_graph_test.py
@@ -18,7 +18,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import function
 from tensorflow.python.framework import function_def_to_graph
 from tensorflow.python.framework import graph_to_function_def
 from tensorflow.python.framework import ops
@@ -79,7 +81,6 @@ class FunctionDefToGraphTest(test.TestCase):
 
     g = function_def_to_graph.function_def_to_graph(
         fdef, input_shapes=[None, tensor_shape.matrix(5, 7)])
-    print(g.as_graph_def())
     self.assertIsNone(g.inputs[0].shape.dims)
     self.assertSequenceEqual(g.inputs[1].shape.dims, [5, 7])
     self.assertSequenceEqual(g.outputs[0].shape.dims, [5, 7])
@@ -179,6 +180,37 @@ class FunctionDefToGraphDefTest(test.TestCase):
     self.assertEqual(g.node[0].attr["shape"].shape.unknown_rank, False)
     self.assertFalse("shape" in g.node[2].attr)
 
+  def testFunctionCallsFromFunction(self):
+    x = constant_op.constant(5.0)
+    y = constant_op.constant(10.0)
+
+    @function.Defun()
+    def fn():
+
+      @function.Defun()
+      def inner_fn():
+        return x + y
+
+      return inner_fn()
+
+    # Instantiate the function in this graph so that
+    # `function_def_to_graph` can find it.
+    fn()
+
+    def fn2():
+      return 2 * fn()
+
+    fdef = function._DefinedFunction(fn2, [], []).definition
+    func_graph = function_def_to_graph.function_def_to_graph(fdef)
+    with func_graph.as_default():
+      x_ph, y_ph = func_graph.inputs
+      with self.test_session(graph=func_graph) as sess:
+        self.assertEqual(
+            sess.run(func_graph.outputs[0], feed_dict={
+                x_ph: 5.0,
+                y_ph: 10.0
+            }), 30.0)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/framework/function_test.py b/tensorflow/python/framework/function_test.py
index 15e41ba91f..1707f929b8 100644
--- a/tensorflow/python/framework/function_test.py
+++ b/tensorflow/python/framework/function_test.py
@@ -537,19 +537,25 @@ class FunctionTest(test.TestCase):
   def testResourceVarAsImplicitInput(self):
     g = ops.Graph()
     with g.as_default(), ops.device("cpu:0"):
+      expected_type = dtypes.float32
+      expected_shape = tensor_shape.TensorShape((4, 4))
       v = variable_scope.get_variable(
-          "var", (4, 4), dtypes.float32, use_resource=True)
+          "var", expected_shape, expected_type, use_resource=True)
 
       @function.Defun()
       def Foo():
-        return array_ops.identity(v)
+        captured = array_ops.identity(v)
+        self.assertEqual(expected_type, captured.dtype)
+        self.assertEqual(expected_shape, captured.shape)
+        return captured, array_ops.shape(captured)
 
-      y = v.value()
-      z = Foo()
+      expected_val = v.value()
+      actual_val, actual_shape = Foo()
 
     with self.test_session(graph=g):
       v.initializer.run()
-      self.assertAllEqual(y.eval(), z.eval())
+      self.assertAllEqual(expected_val.eval(), actual_val.eval())
+      self.assertAllEqual(expected_shape, actual_shape.eval())
 
   def testDefineErrors(self):
     with ops.Graph().as_default():
diff --git a/tensorflow/python/framework/importer.py b/tensorflow/python/framework/importer.py
index 699d2b70d1..687bfebd43 100644
--- a/tensorflow/python/framework/importer.py
+++ b/tensorflow/python/framework/importer.py
@@ -205,7 +205,7 @@ def _PopulateTFImportGraphDefOptions(options, prefix, input_map,
   for input_src, input_dst in input_map.items():
     input_src = compat.as_str(input_src)
     if input_src.startswith('^'):
-      src_name = compat.as_bytes(input_src[1:])
+      src_name = compat.as_str(input_src[1:])
       dst_op = input_dst._as_tf_output().oper  # pylint: disable=protected-access
       c_api.TF_ImportGraphDefOptionsRemapControlDependency(
           options, src_name, dst_op)
diff --git a/tensorflow/python/framework/importer_test.py b/tensorflow/python/framework/importer_test.py
index c5a54470d2..7182c28666 100644
--- a/tensorflow/python/framework/importer_test.py
+++ b/tensorflow/python/framework/importer_test.py
@@ -30,6 +30,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import importer
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_ops  # pylint: disable=unused-import
 from tensorflow.python.framework import versions
 from tensorflow.python.ops import array_ops
@@ -419,6 +420,46 @@ class ImportGraphDefTest(test.TestCase):
       with self.test_session() as sess:
         self.assertEqual(sess.run(imported_r), 10)
 
+  def testImportWhileLoopInCond(self):
+    # Produce GraphDef containing while loop.
+    graph = ops.Graph()
+    with graph.as_default():
+      r = control_flow_ops.while_loop(lambda i: i < 10, lambda i: i + 1, [0])
+    graph_def = graph.as_graph_def()
+
+    # Import the GraphDef inside a cond and make sure it runs.
+    with ops.Graph().as_default():
+
+      def ImportFn():
+        return importer.import_graph_def(graph_def, return_elements=[r.name])[0]
+
+      pred = array_ops.placeholder(dtypes.bool)
+      out = control_flow_ops.cond(pred, ImportFn,
+                                  lambda: constant_op.constant(1))
+      with self.test_session() as sess:
+        self.assertEqual(sess.run(out, {pred: True}), 10)
+        self.assertEqual(sess.run(out, {pred: False}), 1)
+
+  def testImportWhileLoopInWhileLoop(self):
+    self.skipTest("b/111757448")
+    # Produce GraphDef containing while loop.
+    graph = ops.Graph()
+    with graph.as_default():
+      r = control_flow_ops.while_loop(lambda i: i < 10, lambda i: i + 1, [0])
+    graph_def = graph.as_graph_def()
+
+    # Import the GraphDef inside another loop and make sure it runs.
+    with ops.Graph().as_default():
+
+      def ImportFn(_):
+        return importer.import_graph_def(graph_def, return_elements=[r.name])[0]
+
+      out = control_flow_ops.while_loop(
+          lambda i: i < 2, ImportFn, [0],
+          shape_invariants=[tensor_shape.TensorShape(None)])
+      with self.test_session() as sess:
+        self.assertEqual(sess.run(out), 10)
+
   def testTypeMismatchInGraphDef(self):
     # TODO(skyewm): improve error message
     error_msg = ("Input 0 of node import/B was passed int32 from import/A:0 "
diff --git a/tensorflow/python/framework/kernels.py b/tensorflow/python/framework/kernels.py
new file mode 100644
index 0000000000..f7641f3442
--- /dev/null
+++ b/tensorflow/python/framework/kernels.py
@@ -0,0 +1,46 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Functions for querying registered kernels."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.core.framework import kernel_def_pb2
+from tensorflow.python import pywrap_tensorflow as c_api
+from tensorflow.python.util import compat
+
+
+def get_all_registered_kernels():
+  """Returns a KernelList proto of all registered kernels.
+  """
+  buf = c_api.TF_GetAllRegisteredKernels()
+  data = c_api.TF_GetBuffer(buf)
+  kernel_list = kernel_def_pb2.KernelList()
+  kernel_list.ParseFromString(compat.as_bytes(data))
+  return kernel_list
+
+
+def get_registered_kernels_for_op(name):
+  """Returns a KernelList proto of registered kernels for a given op.
+
+  Args:
+    name: A string representing the name of the op whose kernels to retrieve.
+  """
+  buf = c_api.TF_GetRegisteredKernelsForOp(name)
+  data = c_api.TF_GetBuffer(buf)
+  kernel_list = kernel_def_pb2.KernelList()
+  kernel_list.ParseFromString(compat.as_bytes(data))
+  return kernel_list
diff --git a/tensorflow/python/framework/kernels_test.py b/tensorflow/python/framework/kernels_test.py
new file mode 100644
index 0000000000..c53500be73
--- /dev/null
+++ b/tensorflow/python/framework/kernels_test.py
@@ -0,0 +1,41 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for querying registered kernels."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import kernels
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import googletest
+
+
+class GetAllRegisteredKernelsTest(test_util.TensorFlowTestCase):
+
+  def testFindsAtLeastOneKernel(self):
+    kernel_list = kernels.get_all_registered_kernels()
+    self.assertGreater(len(kernel_list.kernel), 0)
+
+
+class GetRegisteredKernelsForOp(test_util.TensorFlowTestCase):
+
+  def testFindsAtLeastOneKernel(self):
+    kernel_list = kernels.get_registered_kernels_for_op("KernelLabel")
+    self.assertGreater(len(kernel_list.kernel), 0)
+    self.assertEqual(kernel_list.kernel[0].op, "KernelLabel")
+
+
+if __name__ == "__main__":
+  googletest.main()
diff --git a/tensorflow/python/framework/meta_graph.py b/tensorflow/python/framework/meta_graph.py
index 923e76fc9c..33631282bd 100644
--- a/tensorflow/python/framework/meta_graph.py
+++ b/tensorflow/python/framework/meta_graph.py
@@ -696,6 +696,67 @@ def import_scoped_meta_graph(meta_graph_or_file,
   Raises:
     ValueError: If the graph_def contains unbound inputs.
   """
+  return import_scoped_meta_graph_with_return_elements(
+      meta_graph_or_file, clear_devices, graph, import_scope, input_map,
+      unbound_inputs_col_name, restore_collections_predicate)[0]
+
+
+def import_scoped_meta_graph_with_return_elements(
+    meta_graph_or_file,
+    clear_devices=False,
+    graph=None,
+    import_scope=None,
+    input_map=None,
+    unbound_inputs_col_name="unbound_inputs",
+    restore_collections_predicate=(lambda key: True),
+    return_elements=None):
+  """Imports graph from `MetaGraphDef` and returns vars and return elements.
+
+  This function takes a `MetaGraphDef` protocol buffer as input. If
+  the argument is a file containing a `MetaGraphDef` protocol buffer ,
+  it constructs a protocol buffer from the file content. The function
+  then adds all the nodes from the `graph_def` field to the
+  current graph, recreates the desired collections, and returns a dictionary of
+  all the Variables imported into the name scope.
+
+  In combination with `export_scoped_meta_graph()`, this function can be used to
+
+  * Serialize a graph along with other Python objects such as `QueueRunner`,
+    `Variable` into a `MetaGraphDef`.
+
+  * Restart training from a saved graph and checkpoints.
+
+  * Run inference from a saved graph and checkpoints.
+
+  Args:
+    meta_graph_or_file: `MetaGraphDef` protocol buffer or filename (including
+      the path) containing a `MetaGraphDef`.
+    clear_devices: Boolean which controls whether to clear device information
+      from graph_def. Default false.
+    graph: The `Graph` to import into. If `None`, use the default graph.
+    import_scope: Optional `string`. Name scope into which to import the
+      subgraph. If `None`, the graph is imported to the root name scope.
+    input_map: A dictionary mapping input names (as strings) in `graph_def` to
+      `Tensor` objects. The values of the named input tensors in the imported
+      graph will be re-mapped to the respective `Tensor` values.
+    unbound_inputs_col_name: Collection name for looking up unbound inputs.
+    restore_collections_predicate: a predicate on collection names. A collection
+      named c (i.e whose key is c) will be restored iff
+      1) `restore_collections_predicate(c)` is True, and
+      2) `c != unbound_inputs_col_name`.
+    return_elements:  A list of strings containing operation names in the
+      `MetaGraphDef` that will be returned as `Operation` objects; and/or
+      tensor names in `MetaGraphDef` that will be returned as `Tensor` objects.
+
+  Returns:
+    A tuple of (
+      dictionary of all the `Variables` imported into the name scope,
+      list of `Operation` or `Tensor` objects from the `return_elements` list).
+
+  Raises:
+    ValueError: If the graph_def contains unbound inputs.
+
+  """
   if context.executing_eagerly():
     raise ValueError("Exporting/importing meta graphs is not supported when "
                      "eager execution is enabled.")
@@ -737,11 +798,12 @@ def import_scoped_meta_graph(meta_graph_or_file,
     scope_to_prepend_to_names = graph.unique_name(
         import_scope or "", mark_as_used=False)
 
-    importer.import_graph_def(
+    imported_return_elements = importer.import_graph_def(
         input_graph_def,
         name=(import_scope or scope_to_prepend_to_names),
         input_map=input_map,
-        producer_op_list=producer_op_list)
+        producer_op_list=producer_op_list,
+        return_elements=return_elements)
 
     # Restores all the other collections.
     variable_objects = {}
@@ -806,7 +868,7 @@ def import_scoped_meta_graph(meta_graph_or_file,
     for v in variables:
       var_list[ops.strip_name_scope(v.name, scope_to_prepend_to_names)] = v
 
-  return var_list
+  return var_list, imported_return_elements
 
 
 def export_scoped_meta_graph(filename=None,
diff --git a/tensorflow/python/framework/ops.py b/tensorflow/python/framework/ops.py
index cf0b1e36fb..cf8689895f 100644
--- a/tensorflow/python/framework/ops.py
+++ b/tensorflow/python/framework/ops.py
@@ -20,7 +20,6 @@ from __future__ import print_function
 
 import collections
 import copy
-import linecache
 import os
 import re
 import sys
@@ -45,18 +44,22 @@ from tensorflow.python.framework import c_api_util
 from tensorflow.python.framework import cpp_shape_inference_pb2
 from tensorflow.python.framework import device as pydev
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import error_interpolation
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import op_def_registry
 from tensorflow.python.framework import registry
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import traceable_stack
 from tensorflow.python.framework import versions
 from tensorflow.python.ops import control_flow_util
 from tensorflow.python.platform import app
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import compat
 from tensorflow.python.util import decorator_utils
+from tensorflow.python.util import function_utils
 from tensorflow.python.util import lock_util
 from tensorflow.python.util import tf_contextlib
+from tensorflow.python.util import tf_stack
 from tensorflow.python.util.deprecation import deprecated_args
 from tensorflow.python.util.tf_export import tf_export
 
@@ -72,6 +75,31 @@ def tensor_id(tensor):
   return tensor._id  # pylint: disable=protected-access
 
 
+class _UserDeviceSpec(object):
+  """Store user-specified device and provide computation of merged device."""
+
+  def __init__(self, device_name_or_function):
+    self._device_name_or_function = device_name_or_function
+
+    self.display_name = str(self._device_name_or_function)
+    if callable(self._device_name_or_function):
+      dev_func = self._device_name_or_function
+      func_name = function_utils.get_func_name(dev_func)
+      func_code = function_utils.get_func_code(dev_func)
+      if func_code:
+        fname = func_code.co_filename
+        lineno = func_code.co_firstlineno
+      else:
+        fname = "unknown"
+        lineno = -1
+      self.display_name = "%s<%s, %d>" % (func_name, fname, lineno)
+
+    self.function = self._device_name_or_function
+    if not (self._device_name_or_function is None or
+            callable(self._device_name_or_function)):
+      self.function = pydev.merge_device(self._device_name_or_function)
+
+
 class _NullContextmanager(object):
 
   def __enter__(self):
@@ -706,9 +734,9 @@ class _EagerTensorBase(Tensor):
     """
     if self.dtype == dtypes.resource:
       raise ValueError("Resource handles are not convertible to numpy.")
-    return self.cpu()._numpy()  # pylint: disable=protected-access
+    return self._cpu_nograd()._numpy()  # pylint: disable=protected-access
 
-  # __int__ and  __float__ may copy the tensor to CPU and
+  # __int__, __float__ and __index__ may copy the tensor to CPU and
   # only work for scalars; values are cast as per numpy.
   def __int__(self):
     return int(self.numpy())
@@ -716,6 +744,9 @@ class _EagerTensorBase(Tensor):
   def __float__(self):
     return float(self.numpy())
 
+  def __index__(self):
+    return int(self.numpy())
+
   def __array__(self, dtype=None):
     return np.array(self.numpy(), dtype=dtype)
 
@@ -780,8 +811,8 @@ class _EagerTensorBase(Tensor):
   def _override_operator(name, func):
     setattr(_EagerTensorBase, name, func)
 
-  def _copy(self, ctx=None, device_name=None):
-    """Copies tensor to dest device."""
+  def _copy_nograd(self, ctx=None, device_name=None):
+    """Copies tensor to dest device, but doesn't record the operation."""
     # pylint: disable=protected-access
     # Creates a new tensor on the dest device.
     if ctx is None:
@@ -793,7 +824,11 @@ class _EagerTensorBase(Tensor):
       new_tensor = self._copy_to_device(context=ctx._handle, device=device_name)
     except core._NotOkStatusException as e:
       six.raise_from(core._status_to_exception(e.code, e.message), None)
+    return new_tensor
 
+  def _copy(self, ctx=None, device_name=None):
+    """Copies tensor to dest device."""
+    new_tensor = self._copy_nograd(ctx, device_name)
     # Record the copy on tape and define backprop copy as well.
     if context.executing_eagerly():
       self_device = self.device
@@ -824,6 +859,16 @@ class _EagerTensorBase(Tensor):
     """Returns the number of Tensor dimensions."""
     return self.shape.ndims
 
+  def _cpu_nograd(self):
+    """A copy of this Tensor with contents backed by host memory.
+
+    The copy cannot be differentiated through.
+
+    Returns:
+      A CPU-memory backed Tensor object with the same contents as this Tensor.
+    """
+    return self._copy_nograd(context.context(), "CPU:0")
+
   def cpu(self):
     """A copy of this Tensor with contents backed by host memory."""
     return self._copy(context.context(), "CPU:0")
@@ -1697,10 +1742,19 @@ class Operation(object):
     # This will be set by self.inputs.
     self._inputs_val = None
 
-    self._id_value = self._graph._next_id()  # pylint: disable=protected-access
+    # pylint: disable=protected-access
+    self._id_value = self._graph._next_id()
     self._original_op = original_op
-    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
-    self._control_flow_context = self.graph._get_control_flow_context()  # pylint: disable=protected-access
+    self._traceback = tf_stack.extract_stack()
+
+    # List of _UserDevSpecs holding code location of device context manager
+    # invocations and the users original argument to them.
+    self._device_code_locations = None
+    # Dict mapping op name to file and line information for op colocation
+    # context managers.
+    self._colocation_code_locations = None
+    self._control_flow_context = self.graph._get_control_flow_context()
+    # pylint: enable=protected-access
 
     # Initialize self._c_op.
     if c_op:
@@ -1839,6 +1893,72 @@ class Operation(object):
     return c_api.TF_OperationDevice(self._c_op)
 
   @property
+  def _device_assignments(self):
+    """Code locations for device context managers active at op creation.
+
+    This property will return a list of traceable_stack.TraceableObject
+    instances where .obj is a string representing the assigned device
+    (or information about the function that would be applied to this op
+    to compute the desired device) and the filename and lineno members
+    record the location of the relevant device context manager.
+
+    For example, suppose file_a contained these lines:
+
+      file_a.py:
+        15: with tf.device('/gpu:0'):
+        16:   node_b = tf.constant(4, name='NODE_B')
+
+    Then a TraceableObject t_obj representing the device context manager
+    would have these member values:
+
+      t_obj.obj -> '/gpu:0'
+      t_obj.filename = 'file_a.py'
+      t_obj.lineno = 15
+
+    and node_b.op._device_assignments would return the list [t_obj].
+
+    Returns:
+      [str: traceable_stack.TraceableObject, ...] as per this method's
+      description, above.
+    """
+    return self._device_code_locations or []
+
+  @property
+  def _colocation_dict(self):
+    """Code locations for colocation context managers active at op creation.
+
+    This property will return a dictionary for which the keys are nodes with
+    which this Operation is colocated, and for which the values are
+    traceable_stack.TraceableObject instances.  The TraceableObject instances
+    record the location of the relevant colocation context manager but have the
+    "obj" field set to None to prevent leaking private data.
+
+    For example, suppose file_a contained these lines:
+
+      file_a.py:
+        14: node_a = tf.constant(3, name='NODE_A')
+        15: with tf.colocate_with(node_a):
+        16:   node_b = tf.constant(4, name='NODE_B')
+
+    Then a TraceableObject t_obj representing the colocation context manager
+    would have these member values:
+
+      t_obj.obj -> None
+      t_obj.filename = 'file_a.py'
+      t_obj.lineno = 15
+
+    and node_b.op._colocation_dict would return the dictionary
+
+      { 'NODE_A': t_obj }
+
+    Returns:
+      {str: traceable_stack.TraceableObject} as per this method's description,
+      above.
+    """
+    locations_dict = self._colocation_code_locations or {}
+    return locations_dict.copy()
+
+  @property
   def _output_types(self):
     """List this operation's output types.
 
@@ -2140,7 +2260,7 @@ class Operation(object):
   @property
   def traceback(self):
     """Returns the call stack from when this operation was constructed."""
-    return self._graph._convert_stack(self._traceback)  # pylint: disable=protected-access
+    return tf_stack.convert_stack(self._traceback)
 
   @property
   def traceback_with_start_lines(self):
@@ -2149,9 +2269,8 @@ class Operation(object):
     Returns:
       A list of 5-tuples (filename, lineno, name, code, func_start_lineno).
     """
-    return self._graph._convert_stack(  # pylint: disable=protected-access
-        self._traceback,
-        include_func_start_lineno=True)
+    return tf_stack.convert_stack(self._traceback,
+                                  include_func_start_lineno=True)
 
   def _set_attr(self, attr_name, attr_value):
     """Private method used to set an attribute in the node_def."""
@@ -2603,7 +2722,6 @@ def _name_from_scope_name(name):
 _MUTATION_LOCK_GROUP = 0
 _SESSION_RUN_LOCK_GROUP = 1
 
-
 @tf_export("Graph")
 class Graph(object):
   """A TensorFlow computation, represented as a dataflow graph.
@@ -2679,7 +2797,7 @@ class Graph(object):
     # Functions that will be applied to choose a device if none is specified.
     # After switch_to_thread_local(), self._thread_local._device_function_stack
     # is used instead.
-    self._graph_device_function_stack = []
+    self._graph_device_function_stack = traceable_stack.TraceableStack()
     # Default original_op applied to new ops.
     self._default_original_op = None
     # Current control flow context. It could be either CondContext or
@@ -2712,7 +2830,7 @@ class Graph(object):
     self._building_function = False
     # Stack of colocate_with ops. After switch_to_thread_local(),
     # self._thread_local._colocation_stack is used instead.
-    self._graph_colocation_stack = []
+    self._graph_colocation_stack = traceable_stack.TraceableStack()
     # Set of tensors that are dangerous to feed!
     self._unfeedable_tensors = set()
     # Set of operations that are dangerous to fetch!
@@ -2752,36 +2870,6 @@ class Graph(object):
     """Temporary hack; can be overridden to force C API usage."""
     return _USE_C_API
 
-  def _convert_stack(self, stack, include_func_start_lineno=False):
-    """Converts a stack extracted using _extract_stack() to a traceback stack.
-
-    Args:
-      stack: A list of n 5-tuples,
-        (filename, lineno, name, frame_globals, func_start_lineno).
-      include_func_start_lineno: True if function start line number should be
-        included as the 5th entry in return tuples.
-
-    Returns:
-      A list of n 4-tuples or 5-tuples
-      (filename, lineno, name, code, [optional: func_start_lineno]), where the
-      code tuple element is calculated from the corresponding elements of the
-      input tuple.
-    """
-    ret = []
-    for (filename, lineno, name, frame_globals, func_start_lineno,
-         unused_frame_info) in stack:
-      linecache.checkcache(filename)
-      line = linecache.getline(filename, lineno, frame_globals)
-      if line:
-        line = line.strip()
-      else:
-        line = None
-      if include_func_start_lineno:
-        ret.append((filename, lineno, name, line, func_start_lineno))
-      else:
-        ret.append((filename, lineno, name, line))
-    return ret
-
   # Note: this method is private because the API of tf.Graph() is public and
   # frozen, and this functionality is still not ready for public visibility.
   @tf_contextlib.contextmanager
@@ -2789,63 +2877,23 @@ class Graph(object):
     # This step makes a copy of the existing stack, and it also initializes
     # self._thread_local._variable_creator_stack if it doesn't exist yet.
     old = list(self._variable_creator_stack)
-    self._thread_local._variable_creator_stack.append(creator)
+    self._thread_local._variable_creator_stack.append(creator)  # pylint: disable=protected-access
     try:
       yield
     finally:
-      self._thread_local._variable_creator_stack = old
+      self._thread_local._variable_creator_stack = old  # pylint: disable=protected-access
 
   # Note: this method is private because the API of tf.Graph() is public and
   # frozen, and this functionality is still not ready for public visibility.
   @property
   def _variable_creator_stack(self):
     if not hasattr(self._thread_local, "_variable_creator_stack"):
-      self._thread_local._variable_creator_stack = []
-    return list(self._thread_local._variable_creator_stack)
+      self._thread_local._variable_creator_stack = []  # pylint: disable=protected-access
+    return list(self._thread_local._variable_creator_stack)  # pylint: disable=protected-access
 
   @_variable_creator_stack.setter
   def _variable_creator_stack(self, variable_creator_stack):
-    self._thread_local._variable_creator_stack = variable_creator_stack
-
-  def _extract_stack(self):
-    """A lightweight, extensible re-implementation of traceback.extract_stack.
-
-    NOTE(mrry): traceback.extract_stack eagerly retrieves the line of code for
-      each stack frame using linecache, which results in an abundance of stat()
-      calls. This implementation does not retrieve the code, and any consumer
-      should apply _convert_stack to the result to obtain a traceback that can
-      be formatted etc. using traceback methods.
-
-    Derived classes can implement _extract_frame_info() to add extra information
-    to the traceback.
-
-    Returns:
-      A list of 6-tuples
-      (filename, lineno, name, frame_globals, func_start_lineno, custom_info)
-      corresponding to the call stack of the current thread.
-    """
-    try:
-      raise ZeroDivisionError
-    except ZeroDivisionError:
-      f = sys.exc_info()[2].tb_frame.f_back
-    ret = []
-    while f is not None:
-      lineno = f.f_lineno
-      co = f.f_code
-      filename = co.co_filename
-      name = co.co_name
-      frame_globals = f.f_globals
-      func_start_lineno = co.co_firstlineno
-      frame_info = self._extract_frame_info(f)
-      ret.append((filename, lineno, name, frame_globals, func_start_lineno,
-                  frame_info))
-      f = f.f_back
-    ret.reverse()
-    return ret
-
-  def _extract_frame_info(self, frame):  # pylint: disable=unused-argument
-    """Extracts custom information from a frame in an op traceback."""
-    return None
+    self._thread_local._variable_creator_stack = variable_creator_stack  # pylint: disable=protected-access
 
   def _check_not_finalized(self):
     """Check if the graph is finalized.
@@ -3245,6 +3293,36 @@ class Graph(object):
     self._create_op_helper(ret, compute_device=compute_device)
     return ret
 
+  def _make_colocation_conflict_message(self, op, colocation_op):
+    """Return detailed error message about device conflict due to colocation."""
+    # Example error message:
+    #   Tried to colocate op 'a' (defined at file1.py:149) having device
+    #   '/device:GPU:0' with op 'b' (defined at file2:96) which had an
+    #   incompatible device '/device:CPU:0'.
+    #
+    #   No node-device colocations were active during op 'a' creation.
+    #   Device assignments active during op 'a' creation:
+    #     with tf.device(/device:GPU:0): file1.py:148>
+    #
+    #   Node-device colocations active during op 'b' creation:
+    #     with tf.colocate_with(a): file2.py:93>
+    #   Device assignments active during op 'b' creation:
+    #     with tf.device(/cpu:0): file2.py:94
+    op_info = error_interpolation.compute_field_dict(op)
+    coloc_op_info = error_interpolation.compute_field_dict(colocation_op)
+    msg = ("Tried to colocate op '{op_name}'{op_loc} having device '{op_dev}' "
+           "with op '{coloc_op_name}'{coloc_op_loc} which had an incompatible "
+           "device '{coloc_op_dev}'.\n\n{op_summary}\n\n{coloc_op_summary}"
+           .format(op_name=op.name,
+                   op_loc=op_info["defined_at"],
+                   op_dev=op.device,
+                   op_summary=op_info["devs_and_colocs"],
+                   coloc_op_name=colocation_op.name,
+                   coloc_op_loc=coloc_op_info["defined_at"],
+                   coloc_op_dev=colocation_op.device,
+                   coloc_op_summary=coloc_op_info["devs_and_colocs"]))
+    return msg
+
   def _create_op_helper(self, op, compute_device=True):
     """Common logic for creating an op in this graph."""
     # Apply any additional attributes requested. Do not overwrite any existing
@@ -3285,20 +3363,22 @@ class Graph(object):
     if compute_device:
       self._apply_device_functions(op)
 
+    # Snapshot the colocation stack metadata before we might generate error
+    # messages using it.  Note that this snapshot depends on the actual stack
+    # and is independent of the op's _class attribute.
+    # pylint: disable=protected-access
+    op._colocation_code_locations = self._snapshot_colocation_stack_metadata()
+    # pylint: enable=protected-access
+
     if self._colocation_stack:
       all_colocation_groups = []
-      for colocation_op in self._colocation_stack:
+      for colocation_op in self._colocation_stack.peek_objs():
         all_colocation_groups.extend(colocation_op.colocation_groups())
         if colocation_op.device:
-          # Make this device match the device of the colocated op, to provide
-          # consistency between the device and the colocation property.
           if (op.device and pydev.canonical_name(op.device) !=
               pydev.canonical_name(colocation_op.device)):
-            logging.warning("Tried to colocate %s with an op %s that had "
-                            "a different device: %s vs %s. Postponing "
-                            "error-checking until all devices are assigned.",
-                            op.name, colocation_op.name, op.device,
-                            colocation_op.device)
+            msg = self._make_colocation_conflict_message(op, colocation_op)
+            logging.warning(msg)
           else:
             op._set_device(colocation_op.device)  # pylint: disable=protected-access
 
@@ -3615,9 +3695,13 @@ class Graph(object):
     This method should be used if you want to create multiple graphs
     in the same process. For convenience, a global default graph is
     provided, and all ops will be added to this graph if you do not
-    create a new graph explicitly. Use this method with the `with` keyword
-    to specify that ops created within the scope of a block should be
-    added to this graph.
+    create a new graph explicitly.
+
+    Use this method with the `with` keyword to specify that ops created within
+    the scope of a block should be added to this graph. In this case, once
+    the scope of the `with` is exited, the previous default graph is set again
+    as default. There is a stack, so it's ok to have multiple nested levels
+    of `as_default` calls.
 
     The default graph is a property of the current thread. If you
     create a new thread, and wish to use the default graph in that
@@ -3788,8 +3872,8 @@ class Graph(object):
       Nothing.
     """
     old_original_op = self._default_original_op
+    self._default_original_op = op
     try:
-      self._default_original_op = op
       yield
     finally:
       self._default_original_op = old_original_op
@@ -3906,15 +3990,15 @@ class Graph(object):
         # op name regex, which constrains the initial character.
         if not _VALID_OP_NAME_REGEX.match(name):
           raise ValueError("'%s' is not a valid scope name" % name)
+    old_stack = self._name_stack
+    if not name:  # Both for name=None and name="" we re-set to empty scope.
+      new_stack = None
+    elif name[-1] == "/":
+      new_stack = _name_from_scope_name(name)
+    else:
+      new_stack = self.unique_name(name)
+    self._name_stack = new_stack
     try:
-      old_stack = self._name_stack
-      if not name:  # Both for name=None and name="" we re-set to empty scope.
-        new_stack = None
-      elif name[-1] == "/":
-        new_stack = _name_from_scope_name(name)
-      else:
-        new_stack = self.unique_name(name)
-      self._name_stack = new_stack
       yield "" if new_stack is None else new_stack + "/"
     finally:
       self._name_stack = old_stack
@@ -3995,8 +4079,8 @@ class Graph(object):
                                   ignore_existing=False):
     with self.colocate_with(op, ignore_existing):
       if gradient_uid is not None and self._control_flow_context is not None:
+        self._control_flow_context.EnterGradientColocation(op, gradient_uid)
         try:
-          self._control_flow_context.EnterGradientColocation(op, gradient_uid)
           yield
         finally:
           self._control_flow_context.ExitGradientColocation(op, gradient_uid)
@@ -4038,7 +4122,6 @@ class Graph(object):
     Yields:
       A context manager that specifies the op with which to colocate
       newly created ops.
-
     """
     if op is None and not ignore_existing:
       raise ValueError("Trying to reset colocation (op is None) but "
@@ -4056,14 +4139,17 @@ class Graph(object):
     # In the future, a caller may specify that device_functions win
     # over colocation, in which case we can add support.
     device_fn_tmp = self._device_function_stack
-    self._device_function_stack = []
+    self._device_function_stack = traceable_stack.TraceableStack()
 
     if ignore_existing:
       current_stack = self._colocation_stack
-      self._colocation_stack = []
+      self._colocation_stack = traceable_stack.TraceableStack()
 
     if op is not None:
-      self._colocation_stack.append(op)
+      # offset refers to the stack frame used for storing code location.
+      # We use 4, the sum of 1 to use our caller's stack frame and 3
+      # to jump over layers of context managers above us.
+      self._colocation_stack.push_obj(op, offset=4)
 
     try:
       yield
@@ -4071,12 +4157,19 @@ class Graph(object):
       # Restore device function stack
       self._device_function_stack = device_fn_tmp
       if op is not None:
-        self._colocation_stack.pop()
+        self._colocation_stack.pop_obj()
 
       # Reset the colocation stack if requested.
       if ignore_existing:
         self._colocation_stack = current_stack
 
+  def _add_device_to_stack(self, device_name_or_function, offset=0):
+    """Add device to stack manually, separate from a context manager."""
+    total_offset = 1 + offset
+    spec = _UserDeviceSpec(device_name_or_function)
+    self._device_function_stack.push_obj(spec, offset=total_offset)
+    return spec
+
   @tf_contextlib.contextmanager
   def device(self, device_name_or_function):
     # pylint: disable=line-too-long
@@ -4134,31 +4227,26 @@ class Graph(object):
     Yields:
       A context manager that specifies the default device to use for newly
       created ops.
-
     """
-    # pylint: enable=line-too-long
-    if (device_name_or_function is not None and
-        not callable(device_name_or_function)):
-      device_function = pydev.merge_device(device_name_or_function)
-    else:
-      device_function = device_name_or_function
-
+    self._add_device_to_stack(device_name_or_function, offset=2)
     try:
-      self._device_function_stack.append(device_function)
       yield
     finally:
-      self._device_function_stack.pop()
+      self._device_function_stack.pop_obj()
 
   def _apply_device_functions(self, op):
     """Applies the current device function stack to the given operation."""
-    # Apply any device functions in reverse order, so that the most recently
+    # Apply any device functions in LIFO order, so that the most recently
     # pushed function has the first chance to apply a device to the op.
     # We apply here because the result can depend on the Operation's
     # signature, which is computed in the Operation constructor.
-    for device_function in reversed(self._device_function_stack):
-      if device_function is None:
+    # pylint: disable=protected-access
+    for device_spec in self._device_function_stack.peek_objs():
+      if device_spec.function is None:
         break
-      op._set_device(device_function(op))  # pylint: disable=protected-access
+      op._set_device(device_spec.function(op))
+    op._device_code_locations = self._snapshot_device_function_stack_metadata()
+    # pylint: enable=protected-access
 
   # pylint: disable=g-doc-return-or-yield
   @tf_contextlib.contextmanager
@@ -4207,8 +4295,8 @@ class Graph(object):
         yields the container name.
     """
     original_container = self._container
+    self._container = container_name
     try:
-      self._container = container_name
       yield self._container
     finally:
       self._container = original_container
@@ -4682,35 +4770,74 @@ class Graph(object):
     if self._stack_state_is_thread_local:
       # This may be called from a thread where device_function_stack doesn't yet
       # exist.
+      # pylint: disable=protected-access
       if not hasattr(self._thread_local, "_device_function_stack"):
-        self._thread_local._device_function_stack = (
-            self._graph_device_function_stack[:])
+        stack_copy_for_this_thread = self._graph_device_function_stack.copy()
+        self._thread_local._device_function_stack = stack_copy_for_this_thread
       return self._thread_local._device_function_stack
+      # pylint: enable=protected-access
     else:
       return self._graph_device_function_stack
 
+  @property
+  def _device_functions_outer_to_inner(self):
+    user_device_specs = self._device_function_stack.peek_objs()
+    device_functions = [spec.function for spec in user_device_specs]
+    device_functions_outer_to_inner = list(reversed(device_functions))
+    return device_functions_outer_to_inner
+
+  def _snapshot_device_function_stack_metadata(self):
+    """Return device function stack as a list of TraceableObjects.
+
+    Returns:
+      [traceable_stack.TraceableObject, ...] where each TraceableObject's .obj
+      member is a displayable name for the user's argument to Graph.device, and
+      the filename and lineno members point to the code location where
+      Graph.device was called directly or indirectly by the user.
+    """
+    traceable_objects = self._device_function_stack.peek_traceable_objs()
+    snapshot = []
+    for obj in traceable_objects:
+      obj_copy = obj.copy_metadata()
+      obj_copy.obj = obj.obj.display_name
+      snapshot.append(obj_copy)
+    return snapshot
+
   @_device_function_stack.setter
   def _device_function_stack(self, device_function_stack):
     if self._stack_state_is_thread_local:
+      # pylint: disable=protected-access
       self._thread_local._device_function_stack = device_function_stack
+      # pylint: enable=protected-access
     else:
       self._graph_device_function_stack = device_function_stack
 
   @property
   def _colocation_stack(self):
+    """Return thread-local copy of colocation stack."""
     if self._stack_state_is_thread_local:
       # This may be called from a thread where colocation_stack doesn't yet
       # exist.
+      # pylint: disable=protected-access
       if not hasattr(self._thread_local, "_colocation_stack"):
-        self._thread_local._colocation_stack = self._graph_colocation_stack[:]
+        stack_copy_for_this_thread = self._graph_colocation_stack.copy()
+        self._thread_local._colocation_stack = stack_copy_for_this_thread
       return self._thread_local._colocation_stack
+      # pylint: enable=protected-access
     else:
       return self._graph_colocation_stack
 
+  def _snapshot_colocation_stack_metadata(self):
+    """Return colocation stack metadata as a dictionary."""
+    traceable_objects = self._colocation_stack.peek_traceable_objs()
+    return {obj.obj.name: obj.copy_metadata() for obj in traceable_objects}
+
   @_colocation_stack.setter
   def _colocation_stack(self, colocation_stack):
     if self._stack_state_is_thread_local:
+      # pylint: disable=protected-access
       self._thread_local._colocation_stack = colocation_stack
+      # pylint: enable=protected-access
     else:
       self._graph_colocation_stack = colocation_stack
 
@@ -4879,8 +5006,8 @@ class _DefaultStack(threading.local):
   @tf_contextlib.contextmanager
   def get_controller(self, default):
     """A context manager for manipulating a default stack."""
+    self.stack.append(default)
     try:
-      self.stack.append(default)
       yield default
     finally:
       # stack may be empty if reset() was called
@@ -5068,13 +5195,15 @@ class _DefaultGraphStack(_DefaultStack):  # pylint: disable=protected-access
 
   @tf_contextlib.contextmanager
   def get_controller(self, default):
+    context.context().context_switches.push(
+        default.building_function, default.as_default)
     try:
-      context.context().context_switches.push(
-          default.building_function, default.as_default)
       with super(_DefaultGraphStack, self).get_controller(
           default) as g, context.graph_mode():
         yield g
     finally:
+      # If an exception is raised here it may be hiding a related exception in
+      # the try-block (just above).
       context.context().context_switches.pop()
 
 
@@ -5110,6 +5239,9 @@ def init_scope():
         `init_scope` will simply install a fresh graph as the default one.
 
     (3) The gradient tape is paused while the scope is active.
+
+  Raises:
+    RuntimeError: if graph state is incompatible with this initialization.
   """
   # pylint: enable=g-doc-return-or-yield,line-too-long
 
@@ -5122,10 +5254,10 @@ def init_scope():
     # the name scope of the current context.
     default_graph = get_default_graph()
     scope = default_graph.get_name_scope()
-    if scope and scope[-1] != '/':
+    if scope and scope[-1] != "/":
       # Names that end with trailing slashes are treated by `name_scope` as
       # absolute.
-      scope = scope + '/'
+      scope = scope + "/"
     inner_device_stack = default_graph._device_function_stack  # pylint: disable=protected-access
 
     outer_context = None
@@ -5170,6 +5302,8 @@ def init_scope():
           outer_graph._device_function_stack = inner_device_stack  # pylint: disable=protected-access
         yield
     finally:
+      # If an exception is raised here it may be hiding a related exception in
+      # try-block (just above).
       if outer_graph is not None:
         outer_graph._device_function_stack = outer_device_stack  # pylint: disable=protected-access
 
@@ -5237,7 +5371,10 @@ def enable_eager_execution(config=None,
      to this function.
   """
   return enable_eager_execution_internal(
-      config, device_policy, execution_mode, None)
+      config=config,
+      device_policy=device_policy,
+      execution_mode=execution_mode,
+      server_def=None)
 
 
 def enable_eager_execution_internal(config=None,
diff --git a/tensorflow/python/framework/ops_test.py b/tensorflow/python/framework/ops_test.py
index 150100d771..318387c61b 100644
--- a/tensorflow/python/framework/ops_test.py
+++ b/tensorflow/python/framework/ops_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import gc
+import os
 import threading
 import weakref
 
@@ -2542,6 +2543,56 @@ class StatisticsTest(test_util.TensorFlowTestCase):
     self.assertEqual(3, flops_total.value)
 
 
+class DeviceStackTest(test_util.TensorFlowTestCase):
+
+  def testBasicDeviceAssignmentMetadata(self):
+
+    def device_func(unused_op):
+      return "/cpu:*"
+
+    const_zero = constant_op.constant([0.0], name="zero")
+    with ops.device("/cpu"):
+      const_one = constant_op.constant([1.0], name="one")
+      with ops.device("/cpu:0"):
+        const_two = constant_op.constant([2.0], name="two")
+    with ops.device(device_func):
+      const_three = constant_op.constant(3.0, name="three")
+
+    self.assertEqual(0, len(const_zero.op._device_assignments))
+
+    one_list = const_one.op._device_assignments
+    self.assertEqual(1, len(one_list))
+    self.assertEqual("/cpu", one_list[0].obj)
+    self.assertEqual("ops_test.py", os.path.basename(one_list[0].filename))
+
+    two_list = const_two.op._device_assignments
+    self.assertEqual(2, len(two_list))
+    devices = [t.obj for t in two_list]
+    self.assertEqual(set(["/cpu", "/cpu:0"]), set(devices))
+
+    three_list = const_three.op._device_assignments
+    self.assertEqual(1, len(three_list))
+    func_description = three_list[0].obj
+    expected_regex = r"device_func<.*ops_test.py, [0-9]+"
+    self.assertRegexpMatches(func_description, expected_regex)
+
+  def testDeviceAssignmentMetadataForGraphDeviceAndTfDeviceFunctions(self):
+
+    with ops.device("/cpu"):
+      const_one = constant_op.constant([1.0], name="one")
+    with ops.get_default_graph().device("/cpu"):
+      const_two = constant_op.constant([2.0], name="two")
+
+    one_metadata = const_one.op._device_assignments[0]
+    two_metadata = const_two.op._device_assignments[0]
+
+    # Verify both types of device assignment return the right stack info.
+    self.assertRegexpMatches("ops_test.py",
+                             os.path.basename(one_metadata.filename))
+    self.assertEqual(one_metadata.filename, two_metadata.filename)
+    self.assertEqual(one_metadata.lineno + 2, two_metadata.lineno)
+
+
 class ColocationGroupTest(test_util.TensorFlowTestCase):
 
   def testBasic(self):
@@ -2554,6 +2605,18 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
     with self.assertRaises(ValueError):
       c.op.get_attr("_class")
 
+  def testBasicColocationMetadata(self):
+    const_two = constant_op.constant([2.0], name="two")
+    with ops.colocate_with(const_two.op):
+      const_three = constant_op.constant(3.0, name="three")
+    locations_dict = const_three.op._colocation_dict
+    self.assertIn("two", locations_dict)
+    metadata = locations_dict["two"]
+    self.assertIsNone(metadata.obj)
+    # Check that this test's filename is recorded as the file containing the
+    # colocation statement.
+    self.assertEqual("ops_test.py", os.path.basename(metadata.filename))
+
   def testColocationDeviceInteraction(self):
     with ops.device("/cpu:0"):
       with ops.device("/device:GPU:0"):
@@ -2665,6 +2728,28 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
 
     self.assertEqual("/device:CPU:0", b.device)
 
+  def testMakeColocationConflictMessage(self):
+    """Test that provides an example of a complicated error message."""
+    # We could test the message with any ops, but this test will be more
+    # instructive with a real colocation conflict.
+    with ops.device("/device:GPU:0"):
+      a = constant_op.constant([2.0], name="a")
+      with ops.colocate_with(a.op):
+        with ops.device("/cpu:0"):
+          b = constant_op.constant([3.0], name="b")
+    # The definition-location of the nodes will be wrong because of running
+    # from within a TF unittest.  The rest of the info should be correct.
+    message = ops.get_default_graph()._make_colocation_conflict_message(a.op,
+                                                                        b.op)
+    self.assertRegexpMatches(message,
+                             r"Tried to colocate op 'a' \(defined at.*\)")
+    self.assertRegexpMatches(message, "No node-device.*'a'")
+    self.assertRegexpMatches(message, "Device assignments active.*'a'")
+    self.assertRegexpMatches(message, "GPU:0")
+    self.assertRegexpMatches(message, "Node-device colocations active.*'b'")
+    self.assertRegexpMatches(message, "Device assignments active.*'b'")
+    self.assertRegexpMatches(message, "cpu:0")
+
 
 class DeprecatedTest(test_util.TensorFlowTestCase):
 
diff --git a/tensorflow/python/framework/python_op_gen.cc b/tensorflow/python/framework/python_op_gen.cc
index ec3748b40e..76d4c2017c 100644
--- a/tensorflow/python/framework/python_op_gen.cc
+++ b/tensorflow/python/framework/python_op_gen.cc
@@ -943,6 +943,7 @@ from tensorflow.python.framework import common_shapes as _common_shapes
 from tensorflow.python.framework import op_def_registry as _op_def_registry
 from tensorflow.python.framework import ops as _ops
 from tensorflow.python.framework import op_def_library as _op_def_library
+from tensorflow.python.util.deprecation import deprecated_endpoints
 from tensorflow.python.util.tf_export import tf_export
 
 )");
diff --git a/tensorflow/python/framework/python_op_gen_internal.cc b/tensorflow/python/framework/python_op_gen_internal.cc
index 940bffb906..031b4a384e 100644
--- a/tensorflow/python/framework/python_op_gen_internal.cc
+++ b/tensorflow/python/framework/python_op_gen_internal.cc
@@ -588,10 +588,12 @@ void GenPythonOp::AddExport() {
     return;
   }
 
+  // Add @tf_export decorator.
   strings::StrAppend(&result_, "@tf_export(");
 
   // Add all endpoint names to tf_export.
   bool first_endpoint = true;
+  std::vector<string> deprecated_endpoints;
   for (const auto& endpoint : api_def_.endpoint()) {
     if (!first_endpoint) {
       strings::StrAppend(&result_, ", ");
@@ -601,9 +603,32 @@ void GenPythonOp::AddExport() {
     string endpoint_name;
     python_op_gen_internal::GenerateLowerCaseOpName(endpoint.name(),
                                                     &endpoint_name);
+    if (endpoint.deprecated()) {
+      deprecated_endpoints.push_back(endpoint_name);
+    }
     strings::StrAppend(&result_, "'", endpoint_name, "'");
   }
   strings::StrAppend(&result_, ")\n");
+
+  // If all endpoints are deprecated, add @deprecated decorator.
+  if (!api_def_.deprecation_message().empty()) {
+    const string instructions = api_def_.deprecation_message();
+    strings::StrAppend(&result_, "@deprecated(None, '", instructions, "')\n");
+  }
+  // Add @deprecated_endpoints decorator.
+  if (!deprecated_endpoints.empty()) {
+    strings::StrAppend(&result_, "@deprecated_endpoints(");
+    bool first_endpoint = true;
+    for (auto& endpoint_name : deprecated_endpoints) {
+      if (first_endpoint) {
+        first_endpoint = false;
+      } else {
+        strings::StrAppend(&result_, ", ");
+      }
+      strings::StrAppend(&result_, "'", endpoint_name, "'");
+    }
+    strings::StrAppend(&result_, ")\n");
+  }
 }
 
 void GenPythonOp::AddDefLine(const string& function_name,
diff --git a/tensorflow/python/framework/subscribe.py b/tensorflow/python/framework/subscribe.py
index 7797d991da..cee7398974 100644
--- a/tensorflow/python/framework/subscribe.py
+++ b/tensorflow/python/framework/subscribe.py
@@ -47,7 +47,7 @@ def _recursive_apply(tensors, apply_fn):
   tensors_type = type(tensors)
   if tensors_type is ops.Tensor:
     return apply_fn(tensors)
-  elif tensors_type is variables.Variable:
+  elif isinstance(tensors, variables.Variable):
     return apply_fn(tensors.value())
   elif isinstance(tensors, (list, tuple)):
     tensors = [_recursive_apply(t, apply_fn) for t in tensors]
diff --git a/tensorflow/python/framework/tensor_util.py b/tensorflow/python/framework/tensor_util.py
index ca63efbc84..9a0f34fad2 100644
--- a/tensorflow/python/framework/tensor_util.py
+++ b/tensorflow/python/framework/tensor_util.py
@@ -67,10 +67,16 @@ def SlowAppendBFloat16ArrayToTensorProto(tensor_proto, proto_values):
       [ExtractBitsFromBFloat16(x) for x in proto_values])
 
 
+def FastAppendBFloat16ArrayToTensorProto(tensor_proto, proto_values):
+  fast_tensor_util.AppendBFloat16ArrayToTensorProto(
+      tensor_proto, np.asarray(
+          proto_values, dtype=dtypes.bfloat16.as_numpy_dtype).view(np.uint16))
+
+
 if _FAST_TENSOR_UTIL_AVAILABLE:
   _NP_TO_APPEND_FN = {
       dtypes.bfloat16.as_numpy_dtype:
-          SlowAppendBFloat16ArrayToTensorProto,
+          FastAppendBFloat16ArrayToTensorProto,
       np.float16:
           _MediumAppendFloat16ArrayToTensorProto,
       np.float32:
@@ -935,8 +941,10 @@ def constant_value_as_shape(tensor):  # pylint: disable=invalid-name
 def is_tensor(x):  # pylint: disable=invalid-name
   """Check whether `x` is of tensor type.
 
-  Check whether an object is a tensor. Equivalent to
-  `isinstance(x, [tf.Tensor, tf.SparseTensor, tf.Variable])`.
+  Check whether an object is a tensor. This check is equivalent to calling
+  `isinstance(x, [tf.Tensor, tf.SparseTensor, tf.Variable])` and also checks
+  if all the component variables of a MirroredVariable or a TowerLocalVariable
+  are tensors.
 
   Args:
     x: A python object to check.
@@ -944,4 +952,5 @@ def is_tensor(x):  # pylint: disable=invalid-name
   Returns:
     `True` if `x` is a tensor, `False` if not.
   """
-  return isinstance(x, ops._TensorLike) or ops.is_dense_tensor_like(x)  # pylint: disable=protected-access
+  return (isinstance(x, ops._TensorLike) or ops.is_dense_tensor_like(x) or  # pylint: disable=protected-access
+          (hasattr(x, "is_tensor_like") and x.is_tensor_like))
diff --git a/tensorflow/python/framework/tensor_util_test.py b/tensorflow/python/framework/tensor_util_test.py
index d6edc13643..395cf43b3f 100644
--- a/tensorflow/python/framework/tensor_util_test.py
+++ b/tensorflow/python/framework/tensor_util_test.py
@@ -50,13 +50,13 @@ class TensorUtilTest(test.TestCase):
   def testFloatN(self):
     t = tensor_util.make_tensor_proto([10.0, 20.0, 30.0])
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "A \000\000A\240\000\000A\360\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\000 A\000\000\240A\000\000\360A"
@@ -68,13 +68,13 @@ class TensorUtilTest(test.TestCase):
   def testFloatTyped(self):
     t = tensor_util.make_tensor_proto([10.0, 20.0, 30.0], dtype=dtypes.float32)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "A \000\000A\240\000\000A\360\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\000 A\000\000\240A\000\000\360A"
@@ -86,13 +86,13 @@ class TensorUtilTest(test.TestCase):
   def testFloatTypeCoerce(self):
     t = tensor_util.make_tensor_proto([10, 20, 30], dtype=dtypes.float32)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "A \000\000A\240\000\000A\360\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\000 A\000\000\240A\000\000\360A"
@@ -105,13 +105,13 @@ class TensorUtilTest(test.TestCase):
     arr = np.asarray([10, 20, 30], dtype="int")
     t = tensor_util.make_tensor_proto(arr, dtype=dtypes.float32)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "A \000\000A\240\000\000A\360\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\000 A\000\000\240A\000\000\360A"
@@ -123,13 +123,13 @@ class TensorUtilTest(test.TestCase):
   def testFloatSizes(self):
     t = tensor_util.make_tensor_proto([10.0, 20.0, 30.0], shape=[1, 3])
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 1 } dim { size: 3 } }
         tensor_content: "A \000\000A\240\000\000A\360\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 1 } dim { size: 3 } }
         tensor_content: "\000\000 A\000\000\240A\000\000\360A"
@@ -141,13 +141,13 @@ class TensorUtilTest(test.TestCase):
   def testFloatSizes2(self):
     t = tensor_util.make_tensor_proto([10.0, 20.0, 30.0], shape=[3, 1])
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } dim { size: 1 } }
         tensor_content: "A \000\000A\240\000\000A\360\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } dim { size: 1 } }
         tensor_content: "\000\000 A\000\000\240A\000\000\360A"
@@ -169,13 +169,13 @@ class TensorUtilTest(test.TestCase):
     t = tensor_util.make_tensor_proto(
         np.array([[10.0, 20.0, 30.0]], dtype=np.float64))
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_DOUBLE
         tensor_shape { dim { size: 1 } dim { size: 3 } }
         tensor_content: "@$\000\000\000\000\000\000@4\000\000\000\000\000\000@>\000\000\000\000\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_DOUBLE
         tensor_shape { dim { size: 1 } dim { size: 3 } }
         tensor_content: "\000\000\000\000\000\000$@\000\000\000\000\000\0004@\000\000\000\000\000\000>@"
@@ -206,13 +206,13 @@ class TensorUtilTest(test.TestCase):
     self.assertEquals(np.float32, a.dtype)
     self.assertAllClose(np.array([5.0, 20.0, 30.0], dtype=np.float32), a)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "A \000\000A\240\000\000A\360\000\000"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_FLOAT
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\000 A\000\000\240A\000\000\360A"
@@ -299,16 +299,16 @@ class TensorUtilTest(test.TestCase):
   def testIntNDefaultType(self):
     t = tensor_util.make_tensor_proto([10, 20, 30, 40], shape=[2, 2])
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_INT32
         tensor_shape { dim { size: 2 } dim { size: 2 } }
-        tensor_content: "\000\000\000\\n\000\000\000\024\000\000\000\036\000\000\000("
+        tensor_content: "\000\000\000\n\000\000\000\024\000\000\000\036\000\000\000("
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_INT32
         tensor_shape { dim { size: 2 } dim { size: 2 } }
-        tensor_content: "\\n\000\000\000\024\000\000\000\036\000\000\000(\000\000\000"
+        tensor_content: "\n\000\000\000\024\000\000\000\036\000\000\000(\000\000\000"
         """, t)
     a = tensor_util.MakeNdarray(t)
     self.assertEquals(np.int32, a.dtype)
@@ -380,16 +380,16 @@ class TensorUtilTest(test.TestCase):
     t = tensor_util.make_tensor_proto(
         [10, 20, 30], shape=[1, 3], dtype=dtypes.int64)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_INT64
         tensor_shape { dim { size: 1 } dim { size: 3 } }
-        tensor_content: "\000\000\000\000\000\000\000\\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036"
+        tensor_content: "\000\000\000\000\000\000\000\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_INT64
         tensor_shape { dim { size: 1 } dim { size: 3 } }
-        tensor_content: "\\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036\000\000\000\000\000\000\000"
+        tensor_content: "\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036\000\000\000\000\000\000\000"
         """, t)
     a = tensor_util.MakeNdarray(t)
     self.assertEquals(np.int64, a.dtype)
@@ -398,16 +398,16 @@ class TensorUtilTest(test.TestCase):
   def testLongNpArray(self):
     t = tensor_util.make_tensor_proto(np.array([10, 20, 30]))
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_INT64
         tensor_shape { dim { size: 3 } }
-        tensor_content: "\000\000\000\000\000\000\000\\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036"
+        tensor_content: "\000\000\000\000\000\000\000\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_INT64
         tensor_shape { dim { size: 3 } }
-        tensor_content: "\\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036\000\000\000\000\000\000\000"
+        tensor_content: "\n\000\000\000\000\000\000\000\024\000\000\000\000\000\000\000\036\000\000\000\000\000\000\000"
         """, t)
     a = tensor_util.MakeNdarray(t)
     self.assertEquals(np.int64, a.dtype)
@@ -419,13 +419,13 @@ class TensorUtilTest(test.TestCase):
 
     t = tensor_util.make_tensor_proto(data, dtype=dtypes.qint32)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_QINT32
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\000\000\025\000\000\000\026\000\000\000\027"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_QINT32
         tensor_shape { dim { size: 3 } }
         tensor_content: "\025\000\000\000\026\000\000\000\027\000\000\000"
@@ -435,7 +435,7 @@ class TensorUtilTest(test.TestCase):
     self.assertAllEqual(np.array(data, dtype=a.dtype), a)
 
     t = tensor_util.make_tensor_proto(data, dtype=dtypes.quint8)
-    self.assertProtoEquals("""
+    self.assertProtoEquals(r"""
       dtype: DT_QUINT8
       tensor_shape { dim { size: 3 } }
       tensor_content: "\025\026\027"
@@ -445,7 +445,7 @@ class TensorUtilTest(test.TestCase):
     self.assertAllEqual(np.array(data, dtype=a.dtype), a)
 
     t = tensor_util.make_tensor_proto(data, dtype=dtypes.qint8)
-    self.assertProtoEquals("""
+    self.assertProtoEquals(r"""
       dtype: DT_QINT8
       tensor_shape { dim { size: 3 } }
       tensor_content: "\025\026\027"
@@ -456,13 +456,13 @@ class TensorUtilTest(test.TestCase):
 
     t = tensor_util.make_tensor_proto(data, dtype=dtypes.quint16)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_QUINT16
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\025\000\026\000\027"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_QUINT16
         tensor_shape { dim { size: 3 } }
         tensor_content: "\025\000\026\000\027\000"
@@ -473,13 +473,13 @@ class TensorUtilTest(test.TestCase):
 
     t = tensor_util.make_tensor_proto(data, dtype=dtypes.qint16)
     if sys.byteorder == "big":
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_QINT16
         tensor_shape { dim { size: 3 } }
         tensor_content: "\000\025\000\026\000\027"
         """, t)
     else:
-      self.assertProtoEquals("""
+      self.assertProtoEquals(r"""
         dtype: DT_QINT16
         tensor_shape { dim { size: 3 } }
         tensor_content: "\025\000\026\000\027\000"
diff --git a/tensorflow/python/framework/test_util.py b/tensorflow/python/framework/test_util.py
index 2bc2a189fa..fc47b1cca5 100644
--- a/tensorflow/python/framework/test_util.py
+++ b/tensorflow/python/framework/test_util.py
@@ -19,6 +19,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
+from collections import OrderedDict
 import contextlib
 import gc
 import itertools
@@ -571,6 +573,78 @@ def assert_no_garbage_created(f):
   return decorator
 
 
+def _combine_named_parameters(**kwargs):
+  """Generate combinations based on its keyword arguments.
+
+  Two sets of returned combinations can be concatenated using +.  Their product
+  can be computed using `times()`.
+
+  Args:
+    **kwargs: keyword arguments of form `option=[possibilities, ...]`
+         or `option=the_only_possibility`.
+
+  Returns:
+    a list of dictionaries for each combination. Keys in the dictionaries are
+    the keyword argument names.  Each key has one value - one of the
+    corresponding keyword argument values.
+  """
+  if not kwargs:
+    return [OrderedDict()]
+
+  sort_by_key = lambda k: k[0][0]
+  kwargs = OrderedDict(sorted(kwargs.items(), key=sort_by_key))
+  first = list(kwargs.items())[0]
+
+  rest = dict(list(kwargs.items())[1:])
+  rest_combined = _combine_named_parameters(**rest)
+
+  key = first[0]
+  values = first[1]
+  if not isinstance(values, list):
+    values = [values]
+
+  combinations = [
+      OrderedDict(sorted(list(combined.items()) + [(key, v)], key=sort_by_key))
+      for v in values
+      for combined in rest_combined
+  ]
+  return combinations
+
+
+def generate_combinations_with_testcase_name(**kwargs):
+  """Generate combinations based on its keyword arguments using combine().
+
+  This function calls combine() and appends a testcase name to the list of
+  dictionaries returned. The 'testcase_name' key is a required for named
+  parameterized tests.
+
+  Args:
+    **kwargs: keyword arguments of form `option=[possibilities, ...]`
+         or `option=the_only_possibility`.
+
+  Returns:
+    a list of dictionaries for each combination. Keys in the dictionaries are
+    the keyword argument names.  Each key has one value - one of the
+    corresponding keyword argument values.
+  """
+  combinations = _combine_named_parameters(**kwargs)
+  named_combinations = []
+  for combination in combinations:
+    assert isinstance(combination, OrderedDict)
+    name = "".join([
+        "_{}_{}".format(
+            "".join(filter(str.isalnum, key)),
+            "".join(filter(str.isalnum, str(value))))
+        for key, value in combination.items()
+    ])
+    named_combinations.append(
+        OrderedDict(
+            list(combination.items()) + [("testcase_name",
+                                          "_test{}".format(name))]))
+
+  return named_combinations
+
+
 def run_all_in_graph_and_eager_modes(cls):
   """Execute all test methods in the given class with and without eager."""
   base_decorator = run_in_graph_and_eager_modes
@@ -1227,8 +1301,8 @@ class TensorFlowTestCase(googletest.TestCase):
       a = a._asdict()
     if hasattr(b, "_asdict"):
       b = b._asdict()
-    a_is_dict = isinstance(a, dict)
-    if a_is_dict != isinstance(b, dict):
+    a_is_dict = isinstance(a, collections.Mapping)
+    if a_is_dict != isinstance(b, collections.Mapping):
       raise ValueError("Can't compare dict to non-dict, a%s vs b%s. %s" %
                        (path_str, path_str, msg))
     if a_is_dict:
diff --git a/tensorflow/python/framework/test_util_test.py b/tensorflow/python/framework/test_util_test.py
index 122c14c847..f983cbef04 100644
--- a/tensorflow/python/framework/test_util_test.py
+++ b/tensorflow/python/framework/test_util_test.py
@@ -73,7 +73,7 @@ class TestUtilTest(test_util.TensorFlowTestCase):
     test_util.assert_equal_graph_def(def_57, def_75)
     # Compare two unequal graphs
     with self.assertRaisesRegexp(AssertionError,
-                                 r"^Found unexpected node 'seven"):
+                                 r"^Found unexpected node '{{node seven}}"):
       test_util.assert_equal_graph_def(def_57, def_empty)
 
   def testIsGoogleCudaEnabled(self):
diff --git a/tensorflow/python/framework/traceable_stack.py b/tensorflow/python/framework/traceable_stack.py
new file mode 100644
index 0000000000..7f4d28237f
--- /dev/null
+++ b/tensorflow/python/framework/traceable_stack.py
@@ -0,0 +1,132 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""A simple stack that associates filename and line numbers with each object."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.util import tf_stack
+
+
+class TraceableObject(object):
+  """Wrap an object together with its the code definition location."""
+
+  # Return codes for the set_filename_and_line_from_caller() method.
+  SUCCESS, HEURISTIC_USED, FAILURE = (0, 1, 2)
+
+  def __init__(self, obj, filename=None, lineno=None):
+    self.obj = obj
+    self.filename = filename
+    self.lineno = lineno
+
+  def set_filename_and_line_from_caller(self, offset=0):
+    """Set filename and line using the caller's stack frame.
+
+    If the requested stack information is not available, a heuristic may
+    be applied and self.HEURISTIC USED will be returned.  If the heuristic
+    fails then no change will be made to the filename and lineno members
+    (None by default) and self.FAILURE will be returned.
+
+    Args:
+      offset: Integer.  If 0, the caller's stack frame is used.  If 1,
+          the caller's caller's stack frame is used.  Larger values are
+          permissible but if out-of-range (larger than the number of stack
+          frames available) the outermost stack frame will be used.
+
+    Returns:
+      TraceableObject.SUCCESS if appropriate stack information was found,
+      TraceableObject.HEURISTIC_USED if the offset was larger than the stack,
+      and TraceableObject.FAILURE if the stack was empty.
+    """
+    # Offset is defined in "Args" as relative to the caller.  We are one frame
+    # beyond the caller.
+    local_offset = offset + 1
+
+    frame_records = tf_stack.extract_stack()
+    if not frame_records:
+      return self.FAILURE
+    if len(frame_records) >= local_offset:
+      # Negative indexing is one-indexed instead of zero-indexed.
+      negative_offset = -(local_offset + 1)
+      self.filename, self.lineno = frame_records[negative_offset][:2]
+      return self.SUCCESS
+    else:
+      # If the offset is too large then we use the largest offset possible,
+      # meaning we use the outermost stack frame at index 0.
+      self.filename, self.lineno = frame_records[0][:2]
+      return self.HEURISTIC_USED
+
+  def copy_metadata(self):
+    """Return a TraceableObject like this one, but without the object."""
+    return self.__class__(None, filename=self.filename, lineno=self.lineno)
+
+
+class TraceableStack(object):
+  """A stack of TraceableObjects."""
+
+  def __init__(self, existing_stack=None):
+    """Constructor.
+
+    Args:
+      existing_stack: [TraceableObject, ...] If provided, this object will
+        set its new stack to a SHALLOW COPY of existing_stack.
+    """
+    self._stack = existing_stack[:] if existing_stack else []
+
+  def push_obj(self, obj, offset=0):
+    """Add object to the stack and record its filename and line information.
+
+    Args:
+      obj: An object to store on the stack.
+      offset: Integer.  If 0, the caller's stack frame is used.  If 1,
+          the caller's caller's stack frame is used.
+
+    Returns:
+      TraceableObject.SUCCESS if appropriate stack information was found,
+      TraceableObject.HEURISTIC_USED if the stack was smaller than expected,
+      and TraceableObject.FAILURE if the stack was empty.
+    """
+    traceable_obj = TraceableObject(obj)
+    self._stack.append(traceable_obj)
+    # Offset is defined in "Args" as relative to the caller.  We are 1 frame
+    # beyond the caller and need to compensate.
+    return traceable_obj.set_filename_and_line_from_caller(offset + 1)
+
+  def pop_obj(self):
+    """Remove last-inserted object and return it, without filename/line info."""
+    return self._stack.pop().obj
+
+  def peek_objs(self):
+    """Return list of stored objects ordered newest to oldest."""
+    return [t_obj.obj for t_obj in reversed(self._stack)]
+
+  def peek_traceable_objs(self):
+    """Return list of stored TraceableObjects ordered newest to oldest."""
+    return list(reversed(self._stack))
+
+  def __len__(self):
+    """Return number of items on the stack, and used for truth-value testing."""
+    return len(self._stack)
+
+  def copy(self):
+    """Return a copy of self referencing the same objects but in a new list.
+
+    This method is implemented to support thread-local stacks.
+
+    Returns:
+      TraceableStack with a new list that holds existing objects.
+    """
+    return TraceableStack(self._stack)
diff --git a/tensorflow/python/framework/traceable_stack_test.py b/tensorflow/python/framework/traceable_stack_test.py
new file mode 100644
index 0000000000..3e7876f631
--- /dev/null
+++ b/tensorflow/python/framework/traceable_stack_test.py
@@ -0,0 +1,133 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for tensorflow.python.framework.traceable_stack."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import test_util
+from tensorflow.python.framework import traceable_stack
+from tensorflow.python.platform import googletest
+from tensorflow.python.util import tf_inspect as inspect
+
+_LOCAL_OBJECT = lambda x: x
+_THIS_FILENAME = inspect.getsourcefile(_LOCAL_OBJECT)
+
+
+class TraceableObjectTest(test_util.TensorFlowTestCase):
+
+  def testSetFilenameAndLineFromCallerUsesCallersStack(self):
+    t_obj = traceable_stack.TraceableObject(17)
+
+    # Do not separate placeholder from the set_filename_and_line_from_caller()
+    # call one line below it as it is used to calculate the latter's line
+    # number.
+    placeholder = lambda x: x
+    result = t_obj.set_filename_and_line_from_caller()
+
+    expected_lineno = inspect.getsourcelines(placeholder)[1] + 1
+    self.assertEqual(expected_lineno, t_obj.lineno)
+    self.assertEqual(_THIS_FILENAME, t_obj.filename)
+    self.assertEqual(t_obj.SUCCESS, result)
+
+  def testSetFilenameAndLineFromCallerRespectsOffset(self):
+
+    def call_set_filename_and_line_from_caller(t_obj):
+      # We expect to retrieve the line number from _our_ caller.
+      return t_obj.set_filename_and_line_from_caller(offset=1)
+
+    t_obj = traceable_stack.TraceableObject(None)
+    # Do not separate placeholder from the
+    # call_set_filename_and_line_from_caller() call one line below it as it is
+    # used to calculate the latter's line number.
+    placeholder = lambda x: x
+    result = call_set_filename_and_line_from_caller(t_obj)
+
+    expected_lineno = inspect.getsourcelines(placeholder)[1] + 1
+    self.assertEqual(expected_lineno, t_obj.lineno)
+    self.assertEqual(t_obj.SUCCESS, result)
+
+  def testSetFilenameAndLineFromCallerHandlesRidiculousOffset(self):
+    t_obj = traceable_stack.TraceableObject('The quick brown fox.')
+    # This line shouldn't die.
+    result = t_obj.set_filename_and_line_from_caller(offset=300)
+
+    # We expect a heuristic to be used because we are not currently 300 frames
+    # down on the stack.  The filename and lineno of the outermost frame are not
+    # predictable -- in some environments the filename is this test file, but in
+    # other environments it is not (e.g. due to a test runner calling this
+    # file).  Therefore we only test that the called function knows it applied a
+    # heuristic for the ridiculous stack offset.
+    self.assertEqual(t_obj.HEURISTIC_USED, result)
+
+
+class TraceableStackTest(test_util.TensorFlowTestCase):
+
+  def testPushPeekPopObj(self):
+    t_stack = traceable_stack.TraceableStack()
+    t_stack.push_obj(42.0)
+    t_stack.push_obj('hope')
+
+    expected_lifo_peek = ['hope', 42.0]
+    self.assertEqual(expected_lifo_peek, t_stack.peek_objs())
+
+    self.assertEqual('hope', t_stack.pop_obj())
+    self.assertEqual(42.0, t_stack.pop_obj())
+
+  def testPushPopPreserveLifoOrdering(self):
+    t_stack = traceable_stack.TraceableStack()
+    t_stack.push_obj(0)
+    t_stack.push_obj(1)
+    t_stack.push_obj(2)
+    t_stack.push_obj(3)
+
+    obj_3 = t_stack.pop_obj()
+    obj_2 = t_stack.pop_obj()
+    obj_1 = t_stack.pop_obj()
+    obj_0 = t_stack.pop_obj()
+
+    self.assertEqual(3, obj_3)
+    self.assertEqual(2, obj_2)
+    self.assertEqual(1, obj_1)
+    self.assertEqual(0, obj_0)
+
+  def testPushObjSetsFilenameAndLineInfoForCaller(self):
+    t_stack = traceable_stack.TraceableStack()
+
+    # We expect that the line number recorded for the 1-object will come from
+    # the call to t_stack.push_obj(1).  Do not separate the next two lines!
+    placeholder_1 = lambda x: x
+    t_stack.push_obj(1)
+
+    # We expect that the line number recorded for the 2-object will come from
+    # the call to call_push_obj() and _not_ the call to t_stack.push_obj().
+    def call_push_obj(obj):
+      t_stack.push_obj(obj, offset=1)
+
+    # Do not separate the next two lines!
+    placeholder_2 = lambda x: x
+    call_push_obj(2)
+
+    expected_lineno_1 = inspect.getsourcelines(placeholder_1)[1] + 1
+    expected_lineno_2 = inspect.getsourcelines(placeholder_2)[1] + 1
+
+    t_obj_2, t_obj_1 = t_stack.peek_traceable_objs()
+    self.assertEqual(expected_lineno_2, t_obj_2.lineno)
+    self.assertEqual(expected_lineno_1, t_obj_1.lineno)
+
+
+if __name__ == '__main__':
+  googletest.main()
diff --git a/tensorflow/python/grappler/layout_optimizer_test.py b/tensorflow/python/grappler/layout_optimizer_test.py
index 7d07c77c79..8cc971c61d 100644
--- a/tensorflow/python/grappler/layout_optimizer_test.py
+++ b/tensorflow/python/grappler/layout_optimizer_test.py
@@ -1340,7 +1340,7 @@ class LayoutOptimizerTest(test.TestCase):
       expected_num_transposes = 2
       self.assertEqual(expected_num_transposes, num_transposes)
       self._assert_trans_nhwc_to_nchw('Conv2D-0', nodes)
-      self.assertAllEqual(output_val_ref, output_val)
+      self.assertAllClose(output_val_ref, output_val, atol=1e-3)
 
   def testLoop(self):
     if test.is_gpu_available(cuda_only=True):
diff --git a/tensorflow/python/keras/BUILD b/tensorflow/python/keras/BUILD
index 8b6b28bc77..87ceef052c 100755
--- a/tensorflow/python/keras/BUILD
+++ b/tensorflow/python/keras/BUILD
@@ -114,12 +114,14 @@ py_library(
         "constraints.py",
         "engine/__init__.py",
         "engine/base_layer.py",
+        "engine/distributed_training_utils.py",
         "engine/input_layer.py",
         "engine/network.py",
         "engine/saving.py",
         "engine/sequential.py",
         "engine/training.py",
         "engine/training_arrays.py",
+        "engine/training_distributed.py",
         "engine/training_eager.py",
         "engine/training_generator.py",
         "engine/training_utils.py",
@@ -451,6 +453,7 @@ cuda_py_test(
         "//tensorflow/python:client_testlib",
     ],
     shard_count = 2,
+    tags = ["no_windows_gpu"],
 )
 
 py_test(
@@ -703,6 +706,17 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "training_gpu_test",
+    size = "small",
+    srcs = ["engine/training_gpu_test.py"],
+    additional_deps = [
+        ":keras",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
 py_test(
     name = "imagenet_utils_test",
     size = "small",
@@ -766,7 +780,7 @@ py_test(
 
 py_test(
     name = "training_test",
-    size = "medium",
+    size = "large",
     srcs = ["engine/training_test.py"],
     srcs_version = "PY2AND3",
     tags = ["notsan"],
@@ -791,6 +805,19 @@ py_test(
 )
 
 py_test(
+    name = "training_utils_test",
+    size = "medium",
+    srcs = ["engine/training_utils_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["notsan"],
+    deps = [
+        ":keras",
+        "//tensorflow/python:client_testlib",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_test(
     name = "model_subclassing_test",
     size = "medium",
     srcs = ["model_subclassing_test.py"],
diff --git a/tensorflow/python/keras/activations.py b/tensorflow/python/keras/activations.py
index f608dea430..99645de736 100644
--- a/tensorflow/python/keras/activations.py
+++ b/tensorflow/python/keras/activations.py
@@ -128,20 +128,26 @@ def softsign(x):
 
 
 @tf_export('keras.activations.relu')
-def relu(x, alpha=0., max_value=None):
+def relu(x, alpha=0., max_value=None, threshold=0):
   """Rectified Linear Unit.
 
+  With default values, it returns element-wise `max(x, 0)`.
+
+  Otherwise, it follows:
+  `f(x) = max_value` for `x >= max_value`,
+  `f(x) = x` for `threshold <= x < max_value`,
+  `f(x) = alpha * (x - threshold)` otherwise.
+
   Arguments:
-      x: Input tensor.
-      alpha: Slope of the negative part. Defaults to zero.
-      max_value: Maximum value for the output.
+      x: A tensor or variable.
+      alpha: A scalar, slope of negative section (default=`0.`).
+      max_value: float. Saturation threshold.
+      threshold: float. Threshold value for thresholded activation.
 
   Returns:
-      The (leaky) rectified linear unit activation: `x` if `x > 0`,
-        `alpha * x` if `x < 0`. If `max_value` is defined, the result
-        is truncated to this value.
+      A tensor.
   """
-  return K.relu(x, alpha=alpha, max_value=max_value)
+  return K.relu(x, alpha=alpha, max_value=max_value, threshold=threshold)
 
 
 @tf_export('keras.activations.tanh')
diff --git a/tensorflow/python/keras/applications/mobilenet.py b/tensorflow/python/keras/applications/mobilenet.py
index e56c695a28..7285e03963 100644
--- a/tensorflow/python/keras/applications/mobilenet.py
+++ b/tensorflow/python/keras/applications/mobilenet.py
@@ -72,13 +72,9 @@ from __future__ import print_function
 import os
 
 from tensorflow.python.keras import backend as K
-from tensorflow.python.keras import constraints
-from tensorflow.python.keras import initializers
-from tensorflow.python.keras import regularizers
 from tensorflow.python.keras.applications import imagenet_utils
 from tensorflow.python.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras.applications.imagenet_utils import decode_predictions
-from tensorflow.python.keras.engine.base_layer import InputSpec
 from tensorflow.python.keras.layers import Activation
 from tensorflow.python.keras.layers import BatchNormalization
 from tensorflow.python.keras.layers import Conv2D
@@ -87,10 +83,10 @@ from tensorflow.python.keras.layers import Dropout
 from tensorflow.python.keras.layers import GlobalAveragePooling2D
 from tensorflow.python.keras.layers import GlobalMaxPooling2D
 from tensorflow.python.keras.layers import Input
+from tensorflow.python.keras.layers import ReLU
 from tensorflow.python.keras.layers import Reshape
 from tensorflow.python.keras.layers import ZeroPadding2D
 from tensorflow.python.keras.models import Model
-from tensorflow.python.keras.utils import conv_utils
 from tensorflow.python.keras.utils import layer_utils
 from tensorflow.python.keras.utils.data_utils import get_file
 from tensorflow.python.platform import tf_logging as logging
@@ -100,10 +96,6 @@ from tensorflow.python.util.tf_export import tf_export
 BASE_WEIGHT_PATH = 'https://github.com/fchollet/deep-learning-models/releases/download/v0.6/'
 
 
-def relu6(x):
-  return K.relu(x, max_value=6)
-
-
 @tf_export('keras.applications.mobilenet.preprocess_input')
 def preprocess_input(x):
   """Preprocesses a numpy array encoding a batch of images.
@@ -130,12 +122,6 @@ def MobileNet(input_shape=None,
               classes=1000):
   """Instantiates the MobileNet architecture.
 
-  To load a MobileNet model via `load_model`, import the custom
-  objects `relu6` and pass them to the `custom_objects` parameter.
-  E.g.
-  model = load_model('mobilenet.h5', custom_objects={
-                     'relu6': mobilenet.relu6})
-
   Arguments:
       input_shape: optional shape tuple, only to be specified
           if `include_top` is False (otherwise the input shape
@@ -412,7 +398,7 @@ def _conv_block(inputs, filters, alpha, kernel=(3, 3), strides=(1, 1)):
       strides=strides,
       name='conv1')(x)
   x = BatchNormalization(axis=channel_axis, name='conv1_bn')(x)
-  return Activation(relu6, name='conv1_relu')(x)
+  return ReLU(6, name='conv1_relu')(x)
 
 
 def _depthwise_conv_block(inputs,
@@ -479,7 +465,7 @@ def _depthwise_conv_block(inputs,
       use_bias=False,
       name='conv_dw_%d' % block_id)(x)
   x = BatchNormalization(axis=channel_axis, name='conv_dw_%d_bn' % block_id)(x)
-  x = Activation(relu6, name='conv_dw_%d_relu' % block_id)(x)
+  x = ReLU(6, name='conv_dw_%d_relu' % block_id)(x)
 
   x = Conv2D(
       pointwise_conv_filters, (1, 1),
@@ -489,4 +475,4 @@ def _depthwise_conv_block(inputs,
       name='conv_pw_%d' % block_id)(
           x)
   x = BatchNormalization(axis=channel_axis, name='conv_pw_%d_bn' % block_id)(x)
-  return Activation(relu6, name='conv_pw_%d_relu' % block_id)(x)
+  return ReLU(6, name='conv_pw_%d_relu' % block_id)(x)
diff --git a/tensorflow/python/keras/backend.py b/tensorflow/python/keras/backend.py
index 824513dce0..bf36ab42e3 100644
--- a/tensorflow/python/keras/backend.py
+++ b/tensorflow/python/keras/backend.py
@@ -736,9 +736,10 @@ def is_keras_tensor(x):
       True
   ```
   """
-  if not isinstance(x, (ops.Tensor,
-                        variables_module.Variable,
-                        sparse_tensor.SparseTensor)):
+  if (not isinstance(x, (ops.Tensor,
+                         variables_module.Variable,
+                         sparse_tensor.SparseTensor)) and
+      x.__class__.__name__ != 'DeferredTensor'):
     raise ValueError('Unexpectedly found an instance of type `' + str(type(x)) +
                      '`. Expected a symbolic tensor instance.')
   return hasattr(x, '_keras_history')
@@ -963,13 +964,14 @@ def zeros(shape, dtype=None, name=None):
              [ 0.,  0.,  0.,  0.]], dtype=float32)
   ```
   """
-  if dtype is None:
-    dtype = floatx()
-  tf_dtype = dtypes_module.as_dtype(dtype)
-  v = array_ops.zeros(shape=shape, dtype=tf_dtype, name=name)
-  if py_all(v.get_shape().as_list()):
-    return variable(v, dtype=dtype, name=name)
-  return v
+  with ops.init_scope():
+    if dtype is None:
+      dtype = floatx()
+    tf_dtype = dtypes_module.as_dtype(dtype)
+    v = array_ops.zeros(shape=shape, dtype=tf_dtype, name=name)
+    if py_all(v.get_shape().as_list()):
+      return variable(v, dtype=dtype, name=name)
+    return v
 
 
 @tf_export('keras.backend.ones')
@@ -996,13 +998,14 @@ def ones(shape, dtype=None, name=None):
              [ 1.,  1.,  1.,  1.]], dtype=float32)
   ```
   """
-  if dtype is None:
-    dtype = floatx()
-  tf_dtype = dtypes_module.as_dtype(dtype)
-  v = array_ops.ones(shape=shape, dtype=tf_dtype, name=name)
-  if py_all(v.get_shape().as_list()):
-    return variable(v, dtype=dtype, name=name)
-  return v
+  with ops.init_scope():
+    if dtype is None:
+      dtype = floatx()
+    tf_dtype = dtypes_module.as_dtype(dtype)
+    v = array_ops.ones(shape=shape, dtype=tf_dtype, name=name)
+    if py_all(v.get_shape().as_list()):
+      return variable(v, dtype=dtype, name=name)
+    return v
 
 
 @tf_export('keras.backend.eye')
@@ -2932,8 +2935,8 @@ def function(inputs, outputs, updates=None, **kwargs):
   """
   if kwargs:
     for key in kwargs:
-      if (key not in tf_inspect.getargspec(session_module.Session.run)[0] and
-          key not in tf_inspect.getargspec(Function.__init__)[0]):
+      if (key not in tf_inspect.getfullargspec(session_module.Session.run)[0]
+          and key not in tf_inspect.getfullargspec(Function.__init__)[0]):
         msg = ('Invalid argument "%s" passed to K.function with TensorFlow '
                'backend') % key
         raise ValueError(msg)
@@ -3370,26 +3373,48 @@ def in_test_phase(x, alt, training=None):
 
 
 @tf_export('keras.backend.relu')
-def relu(x, alpha=0., max_value=None):
+def relu(x, alpha=0., max_value=None, threshold=0):
   """Rectified linear unit.
 
   With default values, it returns element-wise `max(x, 0)`.
 
+  Otherwise, it follows:
+  `f(x) = max_value` for `x >= max_value`,
+  `f(x) = x` for `threshold <= x < max_value`,
+  `f(x) = alpha * (x - threshold)` otherwise.
+
   Arguments:
       x: A tensor or variable.
       alpha: A scalar, slope of negative section (default=`0.`).
-      max_value: Saturation threshold.
+      max_value: float. Saturation threshold.
+      threshold: float. Threshold value for thresholded activation.
 
   Returns:
       A tensor.
   """
+  clip_max = max_value is not None
+
   if alpha != 0.:
-    negative_part = nn.relu(-x)
-  x = nn.relu(x)
-  if max_value is not None:
+    if threshold != 0:
+      negative_part = nn.relu(-x + threshold)
+    else:
+      negative_part = nn.relu(-x)
+
+  if threshold != 0:
+    # computes x for x > threshold else 0
+    x = x * math_ops.cast(math_ops.greater(x, threshold), floatx())
+  elif max_value == 6:
+    # if no threshold, then can use nn.relu6 native TF op for performance
+    x = nn.relu6(x)
+    clip_max = False
+  else:
+    x = nn.relu(x)
+
+  if clip_max:
     max_value = _to_tensor(max_value, x.dtype.base_dtype)
     zero = _to_tensor(0., x.dtype.base_dtype)
     x = clip_ops.clip_by_value(x, zero, max_value)
+
   if alpha != 0.:
     alpha = _to_tensor(alpha, x.dtype.base_dtype)
     x -= alpha * negative_part
@@ -3456,7 +3481,7 @@ def softsign(x):
 
 
 @tf_export('keras.backend.categorical_crossentropy')
-def categorical_crossentropy(target, output, from_logits=False):
+def categorical_crossentropy(target, output, from_logits=False, axis=-1):
   """Categorical crossentropy between an output tensor and a target tensor.
 
   Arguments:
@@ -3466,28 +3491,33 @@ def categorical_crossentropy(target, output, from_logits=False):
           case `output` is expected to be the logits).
       from_logits: Boolean, whether `output` is the
           result of a softmax, or is a tensor of logits.
+      axis: Int specifying the channels axis. `axis=-1` corresponds to data
+          format `channels_last', and `axis=1` corresponds to data format
+          `channels_first`.
 
   Returns:
       Output tensor.
+
+  Raises:
+      ValueError: if `axis` is neither -1 nor one of the axes of `output`.
   """
+  rank = len(output.get_shape())
+  axis = axis % rank
   # Note: nn.softmax_cross_entropy_with_logits_v2
   # expects logits, Keras expects probabilities.
   if not from_logits:
     # scale preds so that the class probas of each sample sum to 1
-    output = output / math_ops.reduce_sum(  # pylint: disable=g-no-augmented-assignment
-        output, len(output.get_shape()) - 1, True)
+    output = output / math_ops.reduce_sum(output, axis, True)
     # manual computation of crossentropy
     epsilon_ = _to_tensor(epsilon(), output.dtype.base_dtype)
     output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
-    return -math_ops.reduce_sum(
-        target * math_ops.log(output),
-        axis=len(output.get_shape()) - 1)
+    return -math_ops.reduce_sum(target * math_ops.log(output), axis)
   else:
     return nn.softmax_cross_entropy_with_logits_v2(labels=target, logits=output)
 
 
 @tf_export('keras.backend.sparse_categorical_crossentropy')
-def sparse_categorical_crossentropy(target, output, from_logits=False):
+def sparse_categorical_crossentropy(target, output, from_logits=False, axis=-1):
   """Categorical crossentropy with integer targets.
 
   Arguments:
@@ -3497,10 +3527,22 @@ def sparse_categorical_crossentropy(target, output, from_logits=False):
           case `output` is expected to be the logits).
       from_logits: Boolean, whether `output` is the
           result of a softmax, or is a tensor of logits.
+      axis: Int specifying the channels axis. `axis=-1` corresponds to data
+          format `channels_last', and `axis=1` corresponds to data format
+          `channels_first`.
 
   Returns:
       Output tensor.
+
+  Raises:
+      ValueError: if `axis` is neither -1 nor one of the axes of `output`.
   """
+  rank = len(output.get_shape())
+  axis = axis % rank
+  if axis != rank - 1:
+    permutation = list(range(axis)) + list(range(axis + 1, rank)) + [axis]
+    output = array_ops.transpose(output, perm=permutation)
+
   # Note: nn.sparse_softmax_cross_entropy_with_logits
   # expects logits, Keras expects probabilities.
   if not from_logits:
diff --git a/tensorflow/python/keras/backend_test.py b/tensorflow/python/keras/backend_test.py
index 36478ea089..40e7910061 100644
--- a/tensorflow/python/keras/backend_test.py
+++ b/tensorflow/python/keras/backend_test.py
@@ -23,6 +23,7 @@ import scipy.sparse
 
 from tensorflow.python import keras
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
@@ -490,6 +491,66 @@ class BackendLinearAlgebraTest(test.TestCase):
                                        input_shape_a=(4, 7),
                                        input_shape_b=(4, 7))
 
+  def test_relu(self):
+    x = ops.convert_to_tensor([[-4, 0], [2, 7]], 'float32')
+    with self.test_session():
+      # standard relu
+      relu_op = keras.backend.relu(x)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [2, 7]])
+
+      # alpha
+      relu_op = keras.backend.relu(x, alpha=0.5)
+      self.assertAllClose(keras.backend.eval(relu_op), [[-2, 0], [2, 7]])
+
+      # max_value < some elements
+      relu_op = keras.backend.relu(x, max_value=5)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [2, 5]])
+
+      # nn.relu6 used
+      relu_op = keras.backend.relu(x, max_value=6)
+      self.assertTrue('Relu6' in relu_op.name)  # uses tf.nn.relu6
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [2, 6]])
+
+      # max value > 6
+      relu_op = keras.backend.relu(x, max_value=10)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [2, 7]])
+
+      # max value is float
+      relu_op = keras.backend.relu(x, max_value=4.3)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [2, 4.3]])
+
+      # max value == 0
+      relu_op = keras.backend.relu(x, max_value=0)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [0, 0]])
+
+      # alpha and max_value
+      relu_op = keras.backend.relu(x, alpha=0.25, max_value=3)
+      self.assertAllClose(keras.backend.eval(relu_op), [[-1, 0], [2, 3]])
+
+      # threshold
+      relu_op = keras.backend.relu(x, threshold=3)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [0, 7]])
+
+      # threshold is float
+      relu_op = keras.backend.relu(x, threshold=1.5)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [2, 7]])
+
+      # threshold is negative
+      relu_op = keras.backend.relu(x, threshold=-5)
+      self.assertAllClose(keras.backend.eval(relu_op), [[-4, 0], [2, 7]])
+
+      # threshold and max_value
+      relu_op = keras.backend.relu(x, threshold=3, max_value=5)
+      self.assertAllClose(keras.backend.eval(relu_op), [[0, 0], [0, 5]])
+
+      # threshold and alpha
+      relu_op = keras.backend.relu(x, alpha=0.25, threshold=4)
+      self.assertAllClose(keras.backend.eval(relu_op), [[-2, -1], [-0.5, 7]])
+
+      # threshold, alpha, and max_value
+      relu_op = keras.backend.relu(x, alpha=0.25, threshold=4, max_value=5)
+      self.assertAllClose(keras.backend.eval(relu_op), [[-2, -1], [-0.5, 5]])
+
 
 class BackendShapeOpsTest(test.TestCase):
 
diff --git a/tensorflow/python/keras/callbacks.py b/tensorflow/python/keras/callbacks.py
index 3ae06d7ab8..070d41147d 100644
--- a/tensorflow/python/keras/callbacks.py
+++ b/tensorflow/python/keras/callbacks.py
@@ -31,11 +31,18 @@ import time
 import numpy as np
 import six
 
+from tensorflow.python.eager import context
+from tensorflow.python.framework import dtypes
 from tensorflow.python.keras import backend as K
+from tensorflow.python.keras.engine.training_utils import standardize_input_data
 from tensorflow.python.keras.utils.generic_utils import Progbar
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import summary_ops_v2
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.summary import summary as tf_summary
+from tensorflow.python.training import saver
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -697,7 +704,9 @@ class TensorBoard(Callback):
       write_images: whether to write model weights to visualize as
           image in TensorBoard.
       embeddings_freq: frequency (in epochs) at which selected embedding
-          layers will be saved.
+          layers will be saved. If set to 0, embeddings won't be computed.
+          Data to be visualized in TensorBoard's Embedding tab must be passed
+          as `embeddings_data`.
       embeddings_layer_names: a list of names of layers to keep eye on. If
           None or empty list all the embedding layer will be watched.
       embeddings_metadata: a dictionary which maps layer name to a file name
@@ -705,6 +714,19 @@ class TensorBoard(Callback):
           [details](https://www.tensorflow.org/how_tos/embedding_viz/#metadata_optional)
           about metadata files format. In case if the same metadata file is
           used for all embedding layers, string can be passed.
+      embeddings_data: data to be embedded at layers specified in
+          `embeddings_layer_names`. Numpy array (if the model has a single
+          input) or list of Numpy arrays (if the model has multiple inputs).
+          Learn [more about embeddings](https://www.tensorflow.org/programmers_guide/embedding)
+
+  Raises:
+      ValueError: If histogram_freq is set and no validation data is provided.
+
+  @compatbility(eager)
+  Using `Tensorboard` callback will work while eager execution is enabled,
+  however outputting histogram summaries of weights and gradients is not
+  supported, and thus `histogram_freq` will be ignored.
+  @end_compatibility
   """
 
   # pylint: enable=line-too-long
@@ -715,24 +737,43 @@ class TensorBoard(Callback):
                batch_size=32,
                write_graph=True,
                write_grads=False,
-               write_images=False):
+               write_images=False,
+               embeddings_freq=0,
+               embeddings_layer_names=None,
+               embeddings_metadata=None,
+               embeddings_data=None):
     super(TensorBoard, self).__init__()
     self.log_dir = log_dir
     self.histogram_freq = histogram_freq
+    if self.histogram_freq and context.executing_eagerly():
+      logging.warning(
+          UserWarning('Weight and gradient histograms not supported for eager'
+                      'execution, setting `histogram_freq` to `0`.'))
+      self.histogram_freq = 0
     self.merged = None
     self.write_graph = write_graph
     self.write_grads = write_grads
     self.write_images = write_images
     self.batch_size = batch_size
     self._current_batch = 0
-    # abstracted writer class to be able to stub for testing
-    self._writer_class = tf_summary.FileWriter
-
-  def set_model(self, model):
-    """Sets Keras model and creates summary ops."""
+    self._total_batches_seen = 0
+    self.embeddings_freq = embeddings_freq
+    self.embeddings_layer_names = embeddings_layer_names
+    self.embeddings_metadata = embeddings_metadata
+    self.embeddings_data = embeddings_data
+
+  def _init_writer(self):
+    """Sets file writer."""
+    if context.executing_eagerly():
+      self.writer = summary_ops_v2.create_file_writer(self.log_dir)
+    elif self.write_graph:
+      self.writer = tf_summary.FileWriter(self.log_dir, K.get_session().graph)
+    else:
+      self.writer = tf_summary.FileWriter(self.log_dir)
 
-    self.model = model
-    self.sess = K.get_session()
+  def _make_histogram_ops(self, model):
+    """Defines histogram ops when histogram_freq > 0."""
+    # only make histogram summary op if it hasn't already been made
     if self.histogram_freq and self.merged is None:
       for layer in self.model.layers:
         for weight in layer.weights:
@@ -772,35 +813,162 @@ class TensorBoard(Callback):
             def is_indexed_slices(grad):
               return type(grad).__name__ == 'IndexedSlices'
 
-            grads = [grad.values if is_indexed_slices(grad) else grad
-                     for grad in grads]
+            grads = [
+                grad.values if is_indexed_slices(grad) else grad
+                for grad in grads
+            ]
             tf_summary.histogram('{}_grad'.format(mapped_weight_name), grads)
 
         if hasattr(layer, 'output'):
-          tf_summary.histogram('{}_out'.format(layer.name), layer.output)
-    self.merged = tf_summary.merge_all()
+          if isinstance(layer.output, list):
+            for i, output in enumerate(layer.output):
+              tf_summary.histogram('{}_out_{}'.format(layer.name, i), output)
+          else:
+            tf_summary.histogram('{}_out'.format(layer.name), layer.output)
 
-    if self.write_graph:
-      self.writer = self._writer_class(self.log_dir, self.sess.graph)
-    else:
-      self.writer = self._writer_class(self.log_dir)
+  def set_model(self, model):
+    """Sets Keras model and creates summary ops."""
+
+    self.model = model
+    self._init_writer()
+    # histogram summaries only enabled in graph mode
+    if not context.executing_eagerly():
+      self._make_histogram_ops(model)
+      self.merged = tf_summary.merge_all()
+
+    # If both embedding_freq and embeddings_data are available, we will
+    # visualize embeddings.
+    if self.embeddings_freq and self.embeddings_data is not None:
+      self.embeddings_data = standardize_input_data(self.embeddings_data,
+                                                    model.input_names)
+
+      # If embedding_layer_names are not provided, get all of the embedding
+      # layers from the model.
+      embeddings_layer_names = self.embeddings_layer_names
+      if not embeddings_layer_names:
+        embeddings_layer_names = [
+            layer.name
+            for layer in self.model.layers
+            if type(layer).__name__ == 'Embedding'
+        ]
+
+      self.assign_embeddings = []
+      embeddings_vars = {}
+
+      self.batch_id = batch_id = array_ops.placeholder(dtypes.int32)
+      self.step = step = array_ops.placeholder(dtypes.int32)
+
+      for layer in self.model.layers:
+        if layer.name in embeddings_layer_names:
+          embedding_input = self.model.get_layer(layer.name).output
+          embedding_size = np.prod(embedding_input.shape[1:])
+          embedding_input = array_ops.reshape(embedding_input,
+                                              (step, int(embedding_size)))
+          shape = (self.embeddings_data[0].shape[0], int(embedding_size))
+          embedding = variables.Variable(
+              array_ops.zeros(shape), name=layer.name + '_embedding')
+          embeddings_vars[layer.name] = embedding
+          batch = state_ops.assign(embedding[batch_id:batch_id + step],
+                                   embedding_input)
+          self.assign_embeddings.append(batch)
+
+      self.saver = saver.Saver(list(embeddings_vars.values()))
+
+      # Create embeddings_metadata dictionary
+      if isinstance(self.embeddings_metadata, str):
+        embeddings_metadata = {
+            layer_name: self.embeddings_metadata
+            for layer_name in embeddings_vars.keys()
+        }
+      else:
+        # If embedding_metadata is already a dictionary
+        embeddings_metadata = self.embeddings_metadata
+
+      try:
+        from tensorboard.plugins import projector
+      except ImportError:
+        raise ImportError('Failed to import TensorBoard. Please make sure that '
+                          'TensorBoard integration is complete."')
+
+      # TODO(psv): Add integration tests to test embedding visualization
+      # with TensorBoard callback. We are unable to write a unit test for this
+      # because TensorBoard dependency assumes TensorFlow package is installed.
+      config = projector.ProjectorConfig()
+      for layer_name, tensor in embeddings_vars.items():
+        embedding = config.embeddings.add()
+        embedding.tensor_name = tensor.name
+
+        if (embeddings_metadata is not None and
+            layer_name in embeddings_metadata):
+          embedding.metadata_path = embeddings_metadata[layer_name]
+
+      projector.visualize_embeddings(self.writer, config)
 
   def _fetch_callback(self, summary):
     self.writer.add_summary(
-        summary, self._epoch + self._current_batch / self._batches_per_epoch)
-    self._current_batch += 1
+        summary,
+        self._epoch + self._current_val_batch / self._validation_batches)
+    self._current_val_batch += 1
+
+  def _write_custom_summaries(self, step, logs=None):
+    """Writes metrics out as custom scalar summaries.
+
+    Arguments:
+        step: the global step to use for Tensorboard.
+        logs: dict. Keys are scalar summary names, values are
+            NumPy scalars.
+
+    """
+    logs = logs or {}
+    if context.executing_eagerly():
+      # use v2 summary ops
+      with self.writer.as_default(), summary_ops_v2.always_record_summaries():
+        for name, value in logs.items():
+          summary_ops_v2.scalar(name, value.item(), step=step)
+    else:
+      # use FileWriter from v1 summary
+      for name, value in logs.items():
+        summary = tf_summary.Summary()
+        summary_value = summary.value.add()
+        summary_value.simple_value = value.item()
+        summary_value.tag = name
+        self.writer.add_summary(summary, step)
+    self.writer.flush()
+
+  def on_train_begin(self, logs=None):
+    """Checks if histogram summaries can be run."""
+    # will never be set when in eager
+    if self.histogram_freq:
+      if 'validation_steps' in self.params:
+        self._validation_batches = self.params['validation_steps']
+      elif self.validation_data:
+        self._validation_batches = math.ceil(
+            self.validation_data[0].shape[0] / self.batch_size)
+      else:
+        raise ValueError('If printing histograms, validation data must be '
+                         'provided.')
+      if self._validation_batches == 0:
+        raise ValueError(
+            'If printing histograms, validation data must have length > 0.')
+
+  def on_batch_end(self, batch, logs=None):
+    """Writes scalar summaries for metrics on every training batch."""
+    # Don't output batch_size and batch number as Tensorboard summaries
+    logs = logs or {}
+    batch_logs = {('batch_' + k): v
+                  for k, v in logs.items()
+                  if k not in ['batch', 'size']}
+    self._write_custom_summaries(self._total_batches_seen, batch_logs)
+    self._total_batches_seen += 1
 
   def on_epoch_begin(self, epoch, logs=None):
     """Add histogram op to Model test_function callbacks, reset batch count."""
 
-    if not self.validation_data and self.histogram_freq:
-      raise ValueError('If printing histograms, validation_data must be '
-                       'provided, and cannot be a generator.')
+    # check if histogram summary should be run for this epoch
     if self.histogram_freq and epoch % self.histogram_freq == 0:
       self._epoch = epoch
-      self._current_batch = 0
-      self._batches_per_epoch = math.ceil(
-          self.validation_data[0].shape[0] / self.batch_size)
+      self._current_val_batch = 0
+      # add the histogram summary op if it should run this epoch
       if self.merged not in self.model.test_function.fetches:
         self.model.test_function.fetches.append(self.merged)
         self.model.test_function.fetch_callbacks[
@@ -809,23 +977,59 @@ class TensorBoard(Callback):
   def on_epoch_end(self, epoch, logs=None):
     """Checks if summary ops should run next epoch, logs scalar summaries."""
 
-    logs = logs or {}
+    # don't output batch_size and
+    # batch number as Tensorboard summaries
+    logs = {('epoch_' + k): v
+            for k, v in logs.items()
+            if k not in ['batch', 'size']}
+    self._write_custom_summaries(epoch, logs)
 
-    if self.histogram_freq and self.histogram_freq > 1:
+    # pop the histogram summary op after each epoch
+    if self.histogram_freq:
       if self.merged in self.model.test_function.fetches:
         self.model.test_function.fetches.remove(self.merged)
       if self.merged in self.model.test_function.fetch_callbacks:
         self.model.test_function.fetch_callbacks.pop(self.merged)
 
-    for name, value in logs.items():
-      if name in ['batch', 'size']:
-        continue
-      summary = tf_summary.Summary()
-      summary_value = summary.value.add()
-      summary_value.simple_value = value.item()
-      summary_value.tag = name
-      self.writer.add_summary(summary, epoch)
-    self.writer.flush()
+    if self.embeddings_data is None and self.embeddings_freq:
+      raise ValueError('To visualize embeddings, embeddings_data must '
+                       'be provided.')
+
+    if self.embeddings_freq and self.embeddings_data is not None:
+      if epoch % self.embeddings_freq == 0:
+        # We need a second forward-pass here because we're passing
+        # the `embeddings_data` explicitly. This design allows to pass
+        # arbitrary data as `embeddings_data` and results from the fact
+        # that we need to know the size of the `tf.Variable`s which
+        # hold the embeddings in `set_model`. At this point, however,
+        # the `validation_data` is not yet set.
+
+        embeddings_data = self.embeddings_data
+        n_samples = embeddings_data[0].shape[0]
+        i = 0
+        while i < n_samples:
+          step = min(self.batch_size, n_samples - i)
+          batch = slice(i, i + step)
+
+          if isinstance(self.model.input, list):
+            feed_dict = {
+                model_input: embeddings_data[idx][batch]
+                for idx, model_input in enumerate(self.model.input)
+            }
+          else:
+            feed_dict = {self.model.input: embeddings_data[0][batch]}
+
+          feed_dict.update({self.batch_id: i, self.step: step})
+
+          if self.model.uses_learning_phase:
+            feed_dict[K.learning_phase()] = False
+
+          self.sess.run(self.assign_embeddings, feed_dict=feed_dict)
+          self.saver.save(self.sess,
+                          os.path.join(self.log_dir, 'keras_embedding.ckpt'),
+                          epoch)
+
+          i += self.batch_size
 
   def on_train_end(self, logs=None):
     self.writer.close()
diff --git a/tensorflow/python/keras/callbacks_test.py b/tensorflow/python/keras/callbacks_test.py
index d56f2f5bfc..d38a753263 100644
--- a/tensorflow/python/keras/callbacks_test.py
+++ b/tensorflow/python/keras/callbacks_test.py
@@ -22,6 +22,7 @@ import csv
 import os
 import re
 import shutil
+import tempfile
 import threading
 import unittest
 
@@ -29,10 +30,12 @@ import numpy as np
 
 from tensorflow.core.framework import summary_pb2
 from tensorflow.python import keras
+from tensorflow.python.framework import test_util
 from tensorflow.python.keras import testing_utils
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import adam
 
 try:
   import h5py  # pylint:disable=g-import-not-at-top
@@ -63,7 +66,7 @@ class KerasCallbacksTest(test.TestCase):
       np.random.seed(1337)
 
       temp_dir = self.get_temp_dir()
-      self.addCleanup(shutil.rmtree, temp_dir)
+      self.addCleanup(shutil.rmtree, temp_dir, ignore_errors=True)
 
       filepath = os.path.join(temp_dir, 'checkpoint.h5')
       (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
@@ -479,7 +482,7 @@ class KerasCallbacksTest(test.TestCase):
     with self.test_session():
       np.random.seed(1337)
       temp_dir = self.get_temp_dir()
-      self.addCleanup(shutil.rmtree, temp_dir)
+      self.addCleanup(shutil.rmtree, temp_dir, ignore_errors=True)
       filepath = os.path.join(temp_dir, 'log.tsv')
 
       sep = '\t'
@@ -557,7 +560,7 @@ class KerasCallbacksTest(test.TestCase):
     # does not result in invalid CSVs.
     np.random.seed(1337)
     tmpdir = self.get_temp_dir()
-    self.addCleanup(shutil.rmtree, tmpdir)
+    self.addCleanup(shutil.rmtree, tmpdir, ignore_errors=True)
 
     with self.test_session():
       fp = os.path.join(tmpdir, 'test.csv')
@@ -649,7 +652,7 @@ class KerasCallbacksTest(test.TestCase):
     np.random.seed(1337)
 
     temp_dir = self.get_temp_dir()
-    self.addCleanup(shutil.rmtree, temp_dir)
+    self.addCleanup(shutil.rmtree, temp_dir, ignore_errors=True)
 
     (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
         train_samples=TRAIN_SAMPLES,
@@ -747,7 +750,7 @@ class KerasCallbacksTest(test.TestCase):
   def test_TensorBoard_histogram_freq_must_have_validation_data(self):
     np.random.seed(1337)
     tmpdir = self.get_temp_dir()
-    self.addCleanup(shutil.rmtree, tmpdir)
+    self.addCleanup(shutil.rmtree, tmpdir, ignore_errors=True)
 
     with self.test_session():
       filepath = os.path.join(tmpdir, 'logs')
@@ -813,28 +816,13 @@ class KerasCallbacksTest(test.TestCase):
       for cb in cbs:
         cb.on_train_end()
 
-      # fit generator with validation data generator should raise ValueError if
-      # histogram_freq > 0
-      cbs = callbacks_factory(histogram_freq=1)
-      with self.assertRaises(ValueError):
-        model.fit_generator(
-            data_generator(True),
-            len(x_train),
-            epochs=2,
-            validation_data=data_generator(False),
-            validation_steps=1,
-            callbacks=cbs)
-
-      for cb in cbs:
-        cb.on_train_end()
-
       # Make sure file writer cache is clear to avoid failures during cleanup.
       writer_cache.FileWriterCache.clear()
 
   def test_TensorBoard_multi_input_output(self):
     np.random.seed(1337)
     tmpdir = self.get_temp_dir()
-    self.addCleanup(shutil.rmtree, tmpdir)
+    self.addCleanup(shutil.rmtree, tmpdir, ignore_errors=True)
 
     with self.test_session():
       filepath = os.path.join(tmpdir, 'logs')
@@ -932,9 +920,12 @@ class KerasCallbacksTest(test.TestCase):
       def close(self):
         pass
 
+    def _init_writer(obj):
+      obj.writer = FileWriterStub(obj.log_dir)
+
     np.random.seed(1337)
     tmpdir = self.get_temp_dir()
-    self.addCleanup(shutil.rmtree, tmpdir)
+    self.addCleanup(shutil.rmtree, tmpdir, ignore_errors=True)
     (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
         train_samples=TRAIN_SAMPLES,
         test_samples=TEST_SAMPLES,
@@ -955,13 +946,13 @@ class KerasCallbacksTest(test.TestCase):
           loss='categorical_crossentropy',
           optimizer='sgd',
           metrics=['accuracy'])
+      keras.callbacks.TensorBoard._init_writer = _init_writer
       tsb = keras.callbacks.TensorBoard(
           log_dir=tmpdir,
           histogram_freq=1,
           write_images=True,
           write_grads=True,
           batch_size=5)
-      tsb._writer_class = FileWriterStub
       cbks = [tsb]
 
       # fit with validation data
@@ -976,6 +967,56 @@ class KerasCallbacksTest(test.TestCase):
 
       self.assertAllEqual(tsb.writer.steps_seen, [0, 0.5, 1, 1.5, 2, 2.5])
 
+  def test_Tensorboard_histogram_summaries_with_generator(self):
+    np.random.seed(1337)
+    tmpdir = self.get_temp_dir()
+    self.addCleanup(shutil.rmtree, tmpdir, ignore_errors=True)
+
+    def generator():
+      x = np.random.randn(10, 100).astype(np.float32)
+      y = np.random.randn(10, 10).astype(np.float32)
+      while True:
+        yield x, y
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(10, input_dim=100, activation='relu'))
+      model.add(keras.layers.Dense(10, activation='softmax'))
+      model.compile(
+          loss='categorical_crossentropy',
+          optimizer='sgd',
+          metrics=['accuracy'])
+      tsb = keras.callbacks.TensorBoard(
+          log_dir=tmpdir,
+          histogram_freq=1,
+          write_images=True,
+          write_grads=True,
+          batch_size=5)
+      cbks = [tsb]
+
+      # fit with validation generator
+      model.fit_generator(
+          generator(),
+          steps_per_epoch=2,
+          epochs=2,
+          validation_data=generator(),
+          validation_steps=2,
+          callbacks=cbks,
+          verbose=0)
+
+      with self.assertRaises(ValueError):
+        # fit with validation generator but no
+        # validation_steps
+        model.fit_generator(
+            generator(),
+            steps_per_epoch=2,
+            epochs=2,
+            validation_data=generator(),
+            callbacks=cbks,
+            verbose=0)
+
+      self.assertTrue(os.path.exists(tmpdir))
+
   @unittest.skipIf(
       os.name == 'nt',
       'use_multiprocessing=True does not work on windows properly.')
@@ -1026,7 +1067,7 @@ class KerasCallbacksTest(test.TestCase):
   def test_TensorBoard_with_ReduceLROnPlateau(self):
     with self.test_session():
       temp_dir = self.get_temp_dir()
-      self.addCleanup(shutil.rmtree, temp_dir)
+      self.addCleanup(shutil.rmtree, temp_dir, ignore_errors=True)
 
       (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
           train_samples=TRAIN_SAMPLES,
@@ -1061,6 +1102,112 @@ class KerasCallbacksTest(test.TestCase):
 
       assert os.path.exists(temp_dir)
 
+  def test_Tensorboard_batch_logging(self):
+
+    class FileWriterStub(object):
+
+      def __init__(self, logdir, graph=None):
+        self.logdir = logdir
+        self.graph = graph
+        self.batches_logged = []
+        self.summary_values = []
+        self.summary_tags = []
+
+      def add_summary(self, summary, step):
+        self.summary_values.append(summary.value[0].simple_value)
+        self.summary_tags.append(summary.value[0].tag)
+        self.batches_logged.append(step)
+
+      def flush(self):
+        pass
+
+      def close(self):
+        pass
+
+    temp_dir = self.get_temp_dir()
+    self.addCleanup(shutil.rmtree, temp_dir, ignore_errors=True)
+
+    tb_cbk = keras.callbacks.TensorBoard(temp_dir)
+    tb_cbk.writer = FileWriterStub(temp_dir)
+
+    for batch in range(5):
+      tb_cbk.on_batch_end(batch, {'acc': np.float32(batch)})
+    self.assertEqual(tb_cbk.writer.batches_logged, [0, 1, 2, 3, 4])
+    self.assertEqual(tb_cbk.writer.summary_values, [0., 1., 2., 3., 4.])
+    self.assertEqual(tb_cbk.writer.summary_tags, ['batch_acc'] * 5)
+
+  def test_Tensorboard_epoch_and_batch_logging(self):
+
+    class FileWriterStub(object):
+
+      def __init__(self, logdir, graph=None):
+        self.logdir = logdir
+        self.graph = graph
+
+      def add_summary(self, summary, step):
+        if 'batch_' in summary.value[0].tag:
+          self.batch_summary = (step, summary)
+        elif 'epoch_' in summary.value[0].tag:
+          self.epoch_summary = (step, summary)
+
+      def flush(self):
+        pass
+
+      def close(self):
+        pass
+
+    temp_dir = self.get_temp_dir()
+    self.addCleanup(shutil.rmtree, temp_dir, ignore_errors=True)
+
+    tb_cbk = keras.callbacks.TensorBoard(temp_dir)
+    tb_cbk.writer = FileWriterStub(temp_dir)
+
+    tb_cbk.on_batch_end(0, {'acc': np.float32(5.0)})
+    tb_cbk.on_epoch_end(0, {'acc': np.float32(10.0)})
+    batch_step, batch_summary = tb_cbk.writer.batch_summary
+    self.assertEqual(batch_step, 0)
+    self.assertEqual(batch_summary.value[0].simple_value, 5.0)
+    epoch_step, epoch_summary = tb_cbk.writer.epoch_summary
+    self.assertEqual(epoch_step, 0)
+    self.assertEqual(epoch_summary.value[0].simple_value, 10.0)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_Tensorboard_eager(self):
+    with self.test_session():
+      temp_dir = tempfile.mkdtemp(dir=self.get_temp_dir())
+      self.addCleanup(shutil.rmtree, temp_dir, ignore_errors=True)
+
+      (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
+          train_samples=TRAIN_SAMPLES,
+          test_samples=TEST_SAMPLES,
+          input_shape=(INPUT_DIM,),
+          num_classes=NUM_CLASSES)
+      y_test = keras.utils.to_categorical(y_test)
+      y_train = keras.utils.to_categorical(y_train)
+
+      model = keras.models.Sequential()
+      model.add(
+          keras.layers.Dense(
+              NUM_HIDDEN, input_dim=INPUT_DIM, activation='relu'))
+      model.add(keras.layers.Dense(NUM_CLASSES, activation='softmax'))
+      model.compile(
+          loss='binary_crossentropy',
+          optimizer=adam.AdamOptimizer(0.01),
+          metrics=['accuracy'])
+
+      cbks = [keras.callbacks.TensorBoard(log_dir=temp_dir)]
+
+      model.fit(
+          x_train,
+          y_train,
+          batch_size=BATCH_SIZE,
+          validation_data=(x_test, y_test),
+          callbacks=cbks,
+          epochs=2,
+          verbose=0)
+
+      self.assertTrue(os.path.exists(temp_dir))
+
   def test_RemoteMonitorWithJsonPayload(self):
     if requests is None:
       self.skipTest('`requests` required to run this test')
diff --git a/tensorflow/python/keras/engine/base_layer.py b/tensorflow/python/keras/engine/base_layer.py
index 361778570b..c35b878cc9 100644
--- a/tensorflow/python/keras/engine/base_layer.py
+++ b/tensorflow/python/keras/engine/base_layer.py
@@ -26,6 +26,7 @@ import numpy as np
 from six.moves import zip  # pylint: disable=redefined-builtin
 
 from tensorflow.python.eager import context
+from tensorflow.python.eager import function as eager_function
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
@@ -174,6 +175,12 @@ class Layer(checkpointable.CheckpointableBase):
 
     self.supports_masking = False
 
+    call_argspec = tf_inspect.getfullargspec(self.call)
+    if 'training' in call_argspec.args:
+      self._expects_training_arg = True
+    else:
+      self._expects_training_arg = False
+
     # Manage input shape information if passed.
     if 'input_shape' in kwargs or 'batch_input_shape' in kwargs:
       # In this case we will later create an input layer
@@ -460,14 +467,18 @@ class Layer(checkpointable.CheckpointableBase):
     """Alias for `add_weight`."""
     return self.add_weight(*args, **kwargs)
 
-  def add_weight(self, name, shape,
+  def add_weight(self,
+                 name,
+                 shape,
                  dtype=None,
                  initializer=None,
                  regularizer=None,
-                 trainable=True,
+                 trainable=None,
                  constraint=None,
                  partitioner=None,
                  use_resource=None,
+                 synchronization=vs.VariableSynchronization.AUTO,
+                 aggregation=vs.VariableAggregation.NONE,
                  getter=None):
     """Adds a new variable to the layer, or gets an existing one; returns it.
 
@@ -482,10 +493,20 @@ class Layer(checkpointable.CheckpointableBase):
         or "non_trainable_variables" (e.g. BatchNorm mean, stddev).
         Note, if the current variable scope is marked as non-trainable
         then this parameter is ignored and any added variables are also
-        marked as non-trainable.
+        marked as non-trainable. `trainable` defaults to `True` unless
+        `synchronization` is set to `ON_READ`.
       constraint: constraint instance (callable).
       partitioner: Partitioner to be passed to the `Checkpointable` API.
       use_resource: Whether to use `ResourceVariable`.
+      synchronization: Indicates when a distributed a variable will be
+        aggregated. Accepted values are constants defined in the class
+        @{tf.VariableSynchronization}. By default the synchronization is set to
+        `AUTO` and the current `DistributionStrategy` chooses
+        when to synchronize. If `synchronization` is set to `ON_READ`,
+        `trainable` must not be set to `True`.
+      aggregation: Indicates how a distributed variable will be aggregated.
+        Accepted values are constants defined in the class
+        @{tf.VariableAggregation}.
       getter: Variable getter argument to be passed to the `Checkpointable` API.
 
     Returns:
@@ -496,7 +517,8 @@ class Layer(checkpointable.CheckpointableBase):
     Raises:
       RuntimeError: If called with partioned variable regularization and
         eager execution is enabled.
-      ValueError: When giving unsupported dtype and no initializer.
+      ValueError: When giving unsupported dtype and no initializer or when
+        trainable has been set to True with synchronization set as `ON_READ`.
     """
     if dtype is None:
       dtype = self.dtype or backend.floatx()
@@ -505,6 +527,19 @@ class Layer(checkpointable.CheckpointableBase):
     regularizer = regularizers.get(regularizer)
     constraint = constraints.get(constraint)
 
+    if synchronization == vs.VariableSynchronization.ON_READ:
+      if trainable:
+        raise ValueError(
+            'Synchronization value can be set to '
+            'VariableSynchronization.ON_READ only for non-trainable variables. '
+            'You have specified trainable=True and '
+            'synchronization=VariableSynchronization.ON_READ.')
+      else:
+        # Set trainable to be false when variable is to be synced on read.
+        trainable = False
+    elif trainable is None:
+      trainable = True
+
     # Initialize variable when no initializer provided
     if initializer is None:
       # If dtype is DT_FLOAT, provide a uniform unit scaling initializer
@@ -532,7 +567,9 @@ class Layer(checkpointable.CheckpointableBase):
         constraint=constraint,
         trainable=trainable and self.trainable,
         partitioner=partitioner,
-        use_resource=use_resource)
+        use_resource=use_resource,
+        synchronization=synchronization,
+        aggregation=aggregation)
 
     if regularizer is not None:
       # TODO(fchollet): in the future, this should be handled at the
@@ -655,8 +692,8 @@ class Layer(checkpointable.CheckpointableBase):
 
     # Handle Keras mask propagation from previous layer to current layer.
     previous_mask = None
-    if (not hasattr(self, '_compute_previous_mask') or
-        self._compute_previous_mask):
+    if build_graph and (not hasattr(self, '_compute_previous_mask') or
+                        self._compute_previous_mask):
       previous_mask = collect_previous_mask(inputs)
       if not hasattr(self, '_call_fn_args'):
         self._call_fn_args = self._no_dependency(
@@ -693,9 +730,20 @@ class Layer(checkpointable.CheckpointableBase):
             self._dtype = input_list[0].dtype.base_dtype.name
           except AttributeError:
             pass
-        if all(hasattr(x, 'get_shape') for x in input_list):
-          input_shapes = nest.map_structure(lambda x: x.get_shape(), inputs)
-        self.build(input_shapes)
+
+        if all(hasattr(x, 'shape') for x in input_list):
+          input_shapes = nest.map_structure(lambda x: x.shape, inputs)
+
+        if (not hasattr(self, '_is_graph_network') or
+            self.__class__.__name__ == 'Sequential' or
+            not hasattr(self.build, '_is_default')):
+          # Only if self is a layer, an instance of a sequential model, or
+          # the user has manually overwritten the build method do we need to
+          # build it.
+          self.build(input_shapes)
+        # We must set self.built since user defined build functions are not
+        # constrained to set self.built.
+        self.built = True
 
       # Check input assumptions set after layer building, e.g. input shape.
       if build_graph or in_deferred_mode:
@@ -711,7 +759,7 @@ class Layer(checkpointable.CheckpointableBase):
         # Deferred mode behavior: use `compute_output_shape` to
         # infer the number of outputs of the layer and their shapes.
         if input_shapes is None:
-          input_shapes = nest.map_structure(lambda x: x.get_shape(), inputs)
+          input_shapes = nest.map_structure(lambda x: x.shape, inputs)
 
         output_shapes = self.compute_output_shape(input_shapes)
         output_shapes = nest.flatten(output_shapes)
@@ -731,8 +779,6 @@ class Layer(checkpointable.CheckpointableBase):
       if in_deferred_mode or build_graph and have_all_keras_metadata(inputs):
         inputs, outputs = self._set_connectivity_metadata_(
             inputs, outputs, args, kwargs)
-
-      self.built = True
       if context.executing_eagerly():
         return outputs
 
@@ -749,17 +795,8 @@ class Layer(checkpointable.CheckpointableBase):
     if hasattr(self, '_initial_weights') and self._initial_weights is not None:
       self.set_weights(self._initial_weights)
       del self._initial_weights
-    self._post_build_cleanup()
     return outputs
 
-  def _post_build_cleanup(self):
-    """Hooks to run after all sub-Layers are built."""
-    # Note that in addition to Layer.__call__, this method is called by Model
-    # after building a graph network (which skips __call__). It should be called
-    # when possible if self.built may have switched from False to True, and is
-    # idempotent.
-    pass  # No-op for Layers which don't override this method.
-
   def apply(self, inputs, *args, **kwargs):
     """Apply the layer on a input.
 
@@ -869,7 +906,7 @@ class Layer(checkpointable.CheckpointableBase):
       assert len(call_args) == 1  # TypeError raised earlier in __call__.
       return call_args[0], call_kwargs
     else:
-      call_arg_spec = tf_inspect.getargspec(self.call)
+      call_arg_spec = tf_inspect.getfullargspec(self.call)
       # There is no explicit "inputs" argument expected or provided to
       # call(). Arguments which have default values are considered non-inputs,
       # and arguments without are considered inputs.
@@ -889,8 +926,8 @@ class Layer(checkpointable.CheckpointableBase):
       _, unwrapped_call = tf_decorator.unwrap(self.call)
       bound_args = inspect.getcallargs(
           unwrapped_call, *call_args, **call_kwargs)
-      if call_arg_spec.keywords is not None:
-        var_kwargs = bound_args.pop(call_arg_spec.keywords)
+      if call_arg_spec.varkw is not None:
+        var_kwargs = bound_args.pop(call_arg_spec.varkw)
         bound_args.update(var_kwargs)
         keyword_arg_names = keyword_arg_names.union(var_kwargs.keys())
       all_args = call_arg_spec.args
@@ -933,6 +970,39 @@ class Layer(checkpointable.CheckpointableBase):
     Returns:
         An input shape tuple.
     """
+    if context.executing_eagerly():
+      # In this case we build the model first in order to do shape inference.
+      # This is acceptable because the framework only calls
+      # `compute_output_shape` on shape values that the layer would later be
+      # built for. It would however cause issues in case a user attempts to
+      # use `compute_output_shape` manually (these users will have to
+      # implement `compute_output_shape` themselves).
+      self.build(input_shape)
+
+      with context.graph_mode():
+        graph = eager_function.CapturingGraph()
+        with graph.as_default():
+          if isinstance(input_shape, list):
+            inputs = [generate_placeholders_from_shape(shape)
+                      for shape in input_shape]
+          else:
+            inputs = generate_placeholders_from_shape(input_shape)
+
+          try:
+            if self._expects_training_arg:
+              outputs = self(inputs, training=False)
+            else:
+              outputs = self(inputs)
+          except TypeError:
+            raise NotImplementedError('We could not automatically infer '
+                                      'the static shape of the layer\'s output.'
+                                      ' Please implement the '
+                                      '`compute_output_shape` method on your '
+                                      'layer (%s).' % self.__class__.__name__)
+      if isinstance(outputs, list):
+        return [output.shape for output in outputs]
+      else:
+        return outputs.shape
     raise NotImplementedError
 
   def compute_mask(self, inputs, mask=None):  # pylint: disable=unused-argument
@@ -1295,7 +1365,7 @@ class Layer(checkpointable.CheckpointableBase):
                          ', but the layer isn\'t built. '
                          'You can build it manually via: `' + self.name +
                          '.build(batch_input_shape)`.')
-    weight_shapes = [w.get_shape().as_list() for w in self.weights]
+    weight_shapes = [w.shape.as_list() for w in self.weights]
     return int(sum([np.prod(w) for w in weight_shapes]))
 
   @property
@@ -1378,7 +1448,7 @@ class Layer(checkpointable.CheckpointableBase):
       if (spec.ndim is not None or
           spec.min_ndim is not None or
           spec.max_ndim is not None):
-        if x.get_shape().ndims is None:
+        if x.shape.ndims is None:
           raise ValueError('Input ' + str(input_index) + ' of layer ' +
                            self.name + ' is incompatible with the layer: '
                            'its rank is undefined, but the layer requires a '
@@ -1386,29 +1456,29 @@ class Layer(checkpointable.CheckpointableBase):
 
       # Check ndim.
       if spec.ndim is not None:
-        ndim = x.get_shape().ndims
+        ndim = x.shape.ndims
         if ndim != spec.ndim:
           raise ValueError('Input ' + str(input_index) + ' of layer ' +
                            self.name + ' is incompatible with the layer: '
                            'expected ndim=' + str(spec.ndim) + ', found ndim=' +
                            str(ndim) + '. Full shape received: ' +
-                           str(x.get_shape().as_list()))
+                           str(x.shape.as_list()))
       if spec.max_ndim is not None:
-        ndim = x.get_shape().ndims
+        ndim = x.shape.ndims
         if ndim is not None and ndim > spec.max_ndim:
           raise ValueError('Input ' + str(input_index) + ' of layer ' +
                            self.name + ' is incompatible with the layer: '
                            'expected max_ndim=' + str(spec.max_ndim) +
                            ', found ndim=' + str(ndim))
       if spec.min_ndim is not None:
-        ndim = x.get_shape().ndims
+        ndim = x.shape.ndims
         if ndim is not None and ndim < spec.min_ndim:
           raise ValueError('Input ' + str(input_index) + ' of layer ' +
                            self.name + ' is incompatible with the layer: '
                            ': expected min_ndim=' + str(spec.min_ndim) +
                            ', found ndim=' + str(ndim) +
                            '. Full shape received: ' +
-                           str(x.get_shape().as_list()))
+                           str(x.shape.as_list()))
       # Check dtype.
       if spec.dtype is not None:
         if x.dtype != spec.dtype:
@@ -1418,7 +1488,7 @@ class Layer(checkpointable.CheckpointableBase):
                            ', found dtype=' + str(x.dtype))
       # Check specific shape axes.
       if spec.axes:
-        shape = x.get_shape().as_list()
+        shape = x.shape.as_list()
         if shape is not None:
           for axis, value in spec.axes.items():
             if hasattr(value, 'value'):
@@ -1431,7 +1501,7 @@ class Layer(checkpointable.CheckpointableBase):
                   ' but received input with shape ' + str(shape))
       # Check shape.
       if spec.shape is not None:
-        shape = x.get_shape().as_list()
+        shape = x.shape.as_list()
         if shape is not None:
           for spec_dim, dim in zip(spec.shape, shape):
             if spec_dim is not None and dim is not None:
@@ -1706,12 +1776,12 @@ class DeferredTensor(object):
 
   def __str__(self):
     return "DeferredTensor('%s', shape=%s, dtype=%s)" % (self.name,
-                                                         self.get_shape(),
+                                                         self.shape,
                                                          self.dtype.name)
 
   def __repr__(self):
     return "<DeferredTensor '%s' shape=%s dtype=%s>" % (self.name,
-                                                        self.get_shape(),
+                                                        self.shape,
                                                         self.dtype.name)
 
 
@@ -1806,11 +1876,13 @@ def make_variable(name,
                   dtype=dtypes.float32,
                   initializer=None,
                   partition_info=None,
-                  trainable=True,
+                  trainable=None,
                   caching_device=None,
                   validate_shape=True,
                   constraint=None,
                   use_resource=None,
+                  synchronization=vs.VariableSynchronization.AUTO,
+                  aggregation=vs.VariableAggregation.NONE,
                   partitioner=None):  # pylint: disable=unused-argument
   """Temporary util to create a variable (relies on `variable_scope.variable`).
 
@@ -1836,11 +1908,21 @@ def make_variable(name,
       or "non_trainable_variables" (e.g. BatchNorm mean, stddev).
       Note, if the current variable scope is marked as non-trainable
       then this parameter is ignored and any added variables are also
-      marked as non-trainable.
+      marked as non-trainable. `trainable` defaults to `True` unless
+      `synchronization` is set to `ON_READ`.
     caching_device: Passed to `vs.variable`.
     validate_shape: Passed to `vs.variable`.
     constraint: Constraint instance (callable).
     use_resource: Whether to use a `ResourceVariable`.
+    synchronization: Indicates when a distributed a variable will be
+      aggregated. Accepted values are constants defined in the class
+      @{tf.VariableSynchronization}. By default the synchronization is set to
+      `AUTO` and the current `DistributionStrategy` chooses
+      when to synchronize. If `synchronization` is set to `ON_READ`,
+      `trainable` must not be set to `True`.
+    aggregation: Indicates how a distributed variable will be aggregated.
+      Accepted values are constants defined in the class
+      @{tf.VariableAggregation}.
     partitioner: Not handled at this time.
 
   Returns:
@@ -1872,5 +1954,17 @@ def make_variable(name,
       dtype=variable_dtype,
       validate_shape=validate_shape,
       constraint=constraint,
-      use_resource=use_resource)
+      use_resource=use_resource,
+      synchronization=synchronization,
+      aggregation=aggregation)
   return v
+
+
+def default(method):
+  """Decorates a method to detect overrides in subclasses."""
+  method._is_default = True
+  return method
+
+
+def generate_placeholders_from_shape(shape):
+  return array_ops.placeholder(shape=shape, dtype=backend.floatx())
diff --git a/tensorflow/python/keras/engine/distributed_training_utils.py b/tensorflow/python/keras/engine/distributed_training_utils.py
new file mode 100644
index 0000000000..c78e6fe9ec
--- /dev/null
+++ b/tensorflow/python/keras/engine/distributed_training_utils.py
@@ -0,0 +1,249 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utilities related to distributed training."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.keras import backend
+from tensorflow.python.keras import callbacks
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import distribute as distribute_lib
+from tensorflow.python.util import nest
+
+
+def set_weights(distribution_strategy, dist_model, weights):
+  """Sets the weights of the replicated models.
+
+  The weights of the replicated models are set to the weights of the original
+  model. The weights of the replicated model are Mirrored variables and hence
+  we need to use the `update` call within a DistributionStrategy scope.
+
+  Args:
+    distribution_strategy: DistributionStrategy used to distribute training
+        and validation.
+    dist_model: The replicated models on the different devices.
+    weights: The weights of the original model.
+  """
+  assign_ops = []
+  for layer in dist_model.layers:
+    num_param = len(layer.weights)
+    layer_weights = weights[:num_param]
+    for sw, w in zip(layer.weights, layer_weights):
+      assign_ops.append(distribution_strategy.unwrap(sw.assign(w)))
+
+    weights = weights[num_param:]
+  backend.get_session().run(assign_ops)
+
+
+def unwrap_values(distribution_strategy, grouped_inputs, grouped_outputs,
+                  grouped_updates, grouped_session_args,
+                  with_loss_tensor=False):
+  """Unwrap and return the list of values contained in the PerDevice parameters.
+
+  This function calls `flatten_perdevice_values` to parse each of the input
+  parameters into a list of values on the different devices. If we set
+  `with_loss_tensor` to be True, we also call `reduce` on the list of losses on
+  the different devices to give us one loss tensor.
+
+  Args:
+    distribution_strategy: DistributionStrategy used to distribute training and
+        validation.
+    grouped_inputs: PerDevice inputs returned from the train or test function
+        that we ran on each device.
+    grouped_outputs: PerDevice outputs returned from the train or test function
+        that we ran on each device.
+    grouped_updates: PerDevice updates returned from the train or test function
+        that we ran on each device.
+    grouped_session_args: PerDevice session args returned from the train or
+        test function that we ran on each device.
+    with_loss_tensor: Boolean that indicates if we need to add the reduced loss
+        tensor as one of the outputs.
+
+  Returns:
+    Values of each of the PerDevice parameters.
+
+  """
+  # Unwrap per device values returned from each model's train function.
+  # This will be used to construct the main train function.
+  all_inputs = flatten_perdevice_values(distribution_strategy,
+                                        grouped_inputs)
+  if with_loss_tensor:
+    # reduce loss tensor before adding it to the list of fetches
+    loss = distribution_strategy.unwrap(
+        distribution_strategy.reduce(distribute_lib.get_loss_reduction(),
+                                     grouped_outputs[0],
+                                     destinations='/device:CPU:0'))[0]
+
+    all_outputs = flatten_perdevice_values(distribution_strategy,
+                                           grouped_outputs[1:])
+    all_outputs = [loss] + all_outputs
+  else:
+    all_outputs = flatten_perdevice_values(distribution_strategy,
+                                           grouped_outputs)
+
+  all_updates = flatten_perdevice_values(distribution_strategy,
+                                         grouped_updates)
+
+  all_session_args = {}
+  grouped_feed_dict = grouped_session_args.get('feed_dict')
+  if grouped_feed_dict:
+    all_session_args['feed_dict'] = flatten_perdevice_values(
+        distribution_strategy, grouped_feed_dict)
+
+  grouped_fetches = grouped_session_args.get('fetches')
+  if grouped_fetches:
+    all_session_args['fetches'] = flatten_perdevice_values(
+        distribution_strategy, grouped_fetches)
+
+  return all_inputs, all_outputs, all_updates, all_session_args
+
+
+def flatten_perdevice_values(distribution_strategy, perdevice_values):
+  """Unwraps and flattens a nest of PerDevice parameters.
+
+  PerDevice values have one value associated with each device. Each entry in
+  the PerDevice dict has a device `key` and the corresponding value on the
+  device as the `value`. In this function we take a PerDevice value or a list of
+  PerDevice values and return all the values in the PerDevice dict.
+
+  Args:
+    distribution_strategy: DistributionStrategy used to distribute training and
+        validation.
+    perdevice_values: List of PerDevice object or a single PerDevice object.
+
+  Returns:
+    List of values of all the PerDevice objects.
+
+  """
+  # This function takes a PerDevice object or a list of PerDevice objects and
+  # returns all the values associated with it.
+  return [e for flattened in nest.flatten(perdevice_values)
+          for e in distribution_strategy.unwrap(flattened)]
+
+
+def validate_callbacks(input_callbacks):
+  """Validate whether given callbacks are supported by DistributionStrategy.
+
+  Args:
+    input_callbacks: List of callbacks passed by the user to fit.
+
+  Raises:
+    ValueError: If `LearningRateScheduler` or `ReduceLROnPlateau` is one of the
+        callbacks passed.
+    ValueError: If `histogram_freq` or `write_grads` is one of the parameters
+        passed as part of the TensorBoard callback.
+  """
+  if input_callbacks:
+    for callback in input_callbacks:
+      if callback not in [callbacks.TensorBoard, callbacks.ReduceLROnPlateau,
+                          callbacks.LearningRateScheduler, callbacks.CSVLogger,
+                          callbacks.EarlyStopping, callbacks.ModelCheckpoint,
+                          callbacks.TerminateOnNaN, callbacks.ProgbarLogger,
+                          callbacks.History, callbacks.RemoteMonitor]:
+        logging.warning('Your input callback is not one of the predefined '
+                        'Callbacks that supports DistributionStrategy. You '
+                        'might encounter an error if you access one of the '
+                        'model\'s attributes as part of the callback since '
+                        'these attributes are not set. You can access each of '
+                        'the individual distributed models using the '
+                        '`_grouped_model` attribute of your original model.')
+      if isinstance(callback, callbacks.LearningRateScheduler):
+        raise ValueError('LearningRateScheduler callback is not supported with '
+                         'DistributionStrategy.')
+      if isinstance(callback, callbacks.ReduceLROnPlateau):
+        raise ValueError('ReduceLROnPlateau callback is not supported with '
+                         'DistributionStrategy.')
+
+      # If users want to use the TensorBoard callback they cannot use certain
+      # features of the callback that involve accessing model attributes and
+      # running ops.
+      if isinstance(callback, callbacks.TensorBoard):
+        if callback.__getattribute__('histogram_freq'):
+          raise ValueError('histogram_freq in the TensorBoard callback is not '
+                           'supported when using DistributionStrategy.')
+        if callback.__getattribute__('write_grads'):
+          raise ValueError('write_grads in the TensorBoard callback is not '
+                           'supported when using DistributionStrategy.')
+
+
+def validate_distributed_dataset_inputs(distribution_strategy, x, y):
+  """Validate all the components of a DistributedValue Dataset input.
+
+  Args:
+    distribution_strategy: The current DistributionStrategy using to call
+        `fit`/`evaluate`.
+    x: Input Dataset DistributedValue object. For example, when we use
+        `MirroredStrategy` this is a PerDevice object with a tensor for each
+        device set in the dict.
+    y: Target Dataset DistributedValue object. For example, when we use
+        `MirroredStrategy` this is a PerDevice object with a tensor for each
+        device set in the dict.
+
+  Returns:
+    The unwrapped values list of the x and y DistributedValues inputs.
+
+  Raises:
+    ValueError: If x and y do not have support for being evaluated as tensors.
+        or if x and y contain elements that are not tensors or if x and y
+        contain elements that have a shape or dtype mismatch.
+  """
+  # If the input and target used to call the model are not dataset tensors,
+  # we need to raise an error. When using a DistributionStrategy, the input
+  # and targets to a model should be from a `tf.data.Dataset`.
+
+  # If each element of x and y are not tensors, we cannot standardize and
+  # validate the input and targets.`
+  if not tensor_util.is_tensor(x):
+    raise ValueError('Dataset input to the model should be tensors instead they'
+                     ' are of type {}'.format(type(x)))
+
+  if not tensor_util.is_tensor(y):
+    raise ValueError('Dataset input to the model should be tensors instead they'
+                     ' are of type {}'.format(type(y)))
+
+  # At this point both x and y contain tensors in the `DistributedValues`
+  # structure.
+  x_values = distribution_strategy.unwrap(x)
+  y_values = distribution_strategy.unwrap(y)
+
+  # Validate that the shape and dtype of all the elements in x are the same.
+  validate_all_tensor_shapes(x, x_values)
+  validate_all_tensor_types(x, x_values)
+
+  # Similarly for y, we perform the same validation
+  validate_all_tensor_shapes(y, y_values)
+  validate_all_tensor_types(y, y_values)
+
+  # Return the unwrapped values to avoid calling `unwrap` a second time.
+  return x_values, y_values
+
+
+def validate_all_tensor_types(x, x_values):
+  x_dtype = x_values[0].dtype
+  for i in range(1, len(x_values)):
+    if x_dtype != x_values[i].dtype:
+      raise ValueError('Input tensor dtypes do not match for distributed tensor'
+                       ' inputs {}'.format(x))
+
+
+def validate_all_tensor_shapes(x, x_values):
+  # Validate that the shape of all the elements in x have the same shape
+  x_shape = x_values[0].get_shape().as_list()
+  for i in range(1, len(x_values)):
+    if x_shape != x_values[i].get_shape().as_list():
+      raise ValueError('Input tensor shapes do not match for distributed tensor'
+                       ' inputs {}'.format(x))
diff --git a/tensorflow/python/keras/engine/network.py b/tensorflow/python/keras/engine/network.py
index a4d96de74f..210c61a726 100644
--- a/tensorflow/python/keras/engine/network.py
+++ b/tensorflow/python/keras/engine/network.py
@@ -20,7 +20,6 @@ from __future__ import division
 from __future__ import print_function
 
 import copy
-import functools
 import json
 import os
 import weakref
@@ -30,6 +29,8 @@ from six.moves import zip  # pylint: disable=redefined-builtin
 
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import context
+from tensorflow.python.eager import function as eager_function
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
@@ -144,10 +145,6 @@ class Network(base_layer.Layer):
 
     self._checkpointable_saver = checkpointable_utils.CheckpointableSaver(
         weakref.ref(self))
-    # A zero-argument function which should be called and set back to None as
-    # soon as the network is built (only applicable to subclassed Models). Runs
-    # restore operations when graph building.
-    self._in_progress_restore_finalizer = None
 
   @checkpointable.no_automatic_dependency_tracking
   def _init_graph_network(self, inputs, outputs, name=None):
@@ -218,7 +215,7 @@ class Network(base_layer.Layer):
 
     self._base_init(name=name)
     self._compute_previous_mask = (
-        'mask' in tf_inspect.getargspec(self.call).args or
+        'mask' in tf_inspect.getfullargspec(self.call).args or
         hasattr(self, 'compute_mask'))
     # A Network does not create weights of its own, thus it is already
     # built.
@@ -312,14 +309,14 @@ class Network(base_layer.Layer):
   def _init_subclassed_network(self, name=None):
     self._base_init(name=name)
     self._is_graph_network = False
-    call_argspec = tf_inspect.getargspec(self.call)
+    call_argspec = tf_inspect.getfullargspec(self.call)
     if 'training' in call_argspec.args:
       self._expects_training_arg = True
     else:
       self._expects_training_arg = False
     self._call_convention = self._determine_call_convention(call_argspec)
-    self.outputs = None
-    self.inputs = None
+    self.outputs = []
+    self.inputs = []
     self.built = False
 
   def _determine_call_convention(self, call_argspec):
@@ -739,6 +736,93 @@ class Network(base_layer.Layer):
       return specs[0]
     return specs
 
+  @base_layer.default
+  def build(self, input_shape):
+    """Builds the model based on input shapes received.
+
+    This is to be used for subclassed models, which do not know at instantiation
+    time what their inputs look like.
+
+    Args:
+     input_shape: Single tuple, TensorShape, or list of shapes, where shapes
+         are tuples, integers, or TensorShapes.
+
+    Raises:
+      ValueError:
+        1. In case of invalid user-provided data (not of type tuple,
+           list, or TensorShape).
+        2. If the model requires call arguments that are agnostic
+           to the input shapes (positional or kwarg in call signature).
+        3. If not all layers were properly built.
+        4. If float type inputs are not supported within the layers.
+
+      In each of these cases, the user should build their model by calling it
+      on real tensor data.
+    """
+    if self._is_graph_network:
+      self.built = True
+      return
+
+    # If subclass network
+    if input_shape is None:
+      raise ValueError('Input shape must be defined when calling build on a '
+                       'model subclass network.')
+    valid_types = (tuple, list, tensor_shape.TensorShape)
+    if not isinstance(input_shape, valid_types):
+      raise ValueError('Specified input shape is not one of the valid types. '
+                       'Please specify a batch input shape of type tuple or '
+                       'list of input shapes. User provided '
+                       'input type: {}'.format(type(input_shape)))
+
+    if input_shape and not self.inputs:
+      # We create placeholders for the `None`s in the shape and build the model
+      # in a Graph. Since tf.Variable is compatible with both eager execution
+      # and graph building, the variables created after building the model in
+      # a Graph are still valid when executing eagerly.
+      with context.graph_mode():
+        graph = eager_function.CapturingGraph()
+        with graph.as_default():
+          if isinstance(input_shape, list):
+            x = [base_layer.generate_placeholders_from_shape(shape)
+                 for shape in input_shape]
+          else:
+            x = base_layer.generate_placeholders_from_shape(input_shape)
+
+          kwargs = {}
+          num_call_args = len(tf_inspect.getfullargspec(self.call).args)
+          if self._expects_training_arg and num_call_args == 3:
+            # Has call signature of call(self, input, training)
+            kwargs['training'] = False
+          elif num_call_args > 2:
+            # Has invalid call signature of call(self, input, *args, **kwargs)
+            raise ValueError('Currently, you cannot build your model if it has '
+                             'positional or keyword arguments that are not '
+                             'inputs to the model, but are required for its '
+                             '`call` method. Instead, in order to instantiate '
+                             'and build your model, `call` your model on real '
+                             'tensor data with all expected call arguments.')
+
+          try:
+            self.call(x, **kwargs)
+          except (errors.InvalidArgumentError, TypeError):
+            raise ValueError('You cannot build your model by calling `build` '
+                             'if your layers do not support float type inputs. '
+                             'Instead, in order to instantiate and build your '
+                             'model, `call` your model on real tensor data (of '
+                             'the correct dtype).')
+
+    if self._layers:
+      self._track_layers(self._layers)
+    if self.layers:
+      for layer in self.layers:
+        if not layer.built:
+          raise ValueError('Layer: {} was not built in your model. Calling '
+                           '`build` manually on a subclassed model is only '
+                           'allowed for models with a static topology. '
+                           'In this case, you can build your model by '
+                           'calling it on real tensor data.'.format(layer))
+    self.built = True
+
   def call(self, inputs, training=None, mask=None):
     """Calls the model on new inputs.
 
@@ -757,11 +841,11 @@ class Network(base_layer.Layer):
         A tensor if there is a single output, or
         a list of tensors if there are more than one outputs.
     """
-    inputs = nest.flatten(inputs)
+    inputs = generic_utils.to_list(inputs)
     if mask is None:
       masks = [None for _ in range(len(inputs))]
     else:
-      masks = nest.flatten(mask)
+      masks = generic_utils.to_list(mask)
 
     if not context.executing_eagerly():
       # Try to retrieve cached outputs if the layer has already been called
@@ -777,8 +861,21 @@ class Network(base_layer.Layer):
                                           mask=masks)
     return outputs
 
+  def _call_and_compute_mask(self, inputs, training=None, mask=None):
+    assert context.executing_eagerly()
+    inputs = generic_utils.to_list(inputs)
+    if mask is None:
+      masks = [None for _ in range(len(inputs))]
+    else:
+      masks = generic_utils.to_list(mask)
+    return self._run_internal_graph(inputs,
+                                    training=training,
+                                    mask=masks)
+
   def compute_output_shape(self, input_shape):
     if not self._is_graph_network:
+      if context.executing_eagerly():
+        return super(Network, self).compute_output_shape(input_shape)
       raise NotImplementedError
 
     if isinstance(input_shape, list):
@@ -889,7 +986,7 @@ class Network(base_layer.Layer):
         mask: List of masks (tensors or None).
 
     Returns:
-        Three lists: output_tensors, output_masks, output_shapes
+        Two lists: output_tensors, output_masks
     """
     # Note: masking support is relevant mainly for Keras.
     # It cannot be factored out without having the fully reimplement the network
@@ -939,9 +1036,9 @@ class Network(base_layer.Layer):
             if len(computed_data) == 1:
               computed_tensor, computed_mask = computed_data[0]
               # Ensure mask propagation if applicable.
-              if 'mask' in tf_inspect.getargspec(layer.call).args:
+              if 'mask' in tf_inspect.getfullargspec(layer.call).args:
                 kwargs.setdefault('mask', computed_mask)
-              if 'training' in tf_inspect.getargspec(layer.call).args:
+              if 'training' in tf_inspect.getfullargspec(layer.call).args:
                 kwargs.setdefault('training', training)
 
               output_tensors = nest.flatten(
@@ -956,13 +1053,12 @@ class Network(base_layer.Layer):
               else:
                 output_masks = [None for _ in output_tensors]
               computed_tensors = [computed_tensor]
-              computed_masks = [computed_mask]
             else:
               computed_tensors = [x[0] for x in computed_data]
               computed_masks = [x[1] for x in computed_data]
-              if 'mask' in tf_inspect.getargspec(layer.call).args:
+              if 'mask' in tf_inspect.getfullargspec(layer.call).args:
                 kwargs.setdefault('mask', computed_masks)
-              if 'training' in tf_inspect.getargspec(layer.call).args:
+              if 'training' in tf_inspect.getfullargspec(layer.call).args:
                 kwargs.setdefault('training', training)
 
               output_tensors = nest.flatten(
@@ -1362,6 +1458,16 @@ class Network(base_layer.Layer):
         session = None
       else:
         session = backend.get_session()
+      optimizer = getattr(self, 'optimizer', None)
+      if (optimizer
+          and not isinstance(optimizer, checkpointable.CheckpointableBase)):
+        logging.warning(
+            ('This model was compiled with a Keras optimizer (%s) but is being '
+             'saved in TensorFlow format with `save_weights`. The model\'s '
+             'weights will be saved, but unlike with TensorFlow optimizers in '
+             'the TensorFlow format the optimizer\'s state will not be '
+             'saved.\n\nConsider using a TensorFlow optimizer from `tf.train`.')
+            % (optimizer,))
       self._checkpointable_saver.save(filepath, session=session)
 
   def load_weights(self, filepath, by_name=False):
@@ -1423,13 +1529,9 @@ class Network(base_layer.Layer):
             'load_weights).')
       if not context.executing_eagerly():
         session = backend.get_session()
-        finalizer = functools.partial(status.run_restore_ops, session=session)
-        if self.built:
-          finalizer()
-        else:
-          # Hold on to this status object until the network is built (for
-          # subclassed Models). Then we'll run restore ops if necessary.
-          self._in_progress_restore_finalizer = finalizer
+        # Restore existing variables (if any) immediately, and set up a
+        # streaming restore for any variables created in the future.
+        checkpointable_utils.streaming_restore(status=status, session=session)
       return status
     if h5py is None:
       raise ImportError(
@@ -1447,14 +1549,6 @@ class Network(base_layer.Layer):
       else:
         saving.load_weights_from_hdf5_group(f, self.layers)
 
-  def _post_build_cleanup(self):
-    super(Network, self)._post_build_cleanup()
-    if self._in_progress_restore_finalizer is not None:
-      # Runs queued restore operations left over from load_weights when graph
-      # building.
-      self._in_progress_restore_finalizer()
-      self._in_progress_restore_finalizer = None
-
   def _updated_config(self):
     """Util shared between different serialization methods.
 
diff --git a/tensorflow/python/keras/engine/saving_test.py b/tensorflow/python/keras/engine/saving_test.py
index 030328f2a6..f2f8a27b76 100644
--- a/tensorflow/python/keras/engine/saving_test.py
+++ b/tensorflow/python/keras/engine/saving_test.py
@@ -35,6 +35,7 @@ from tensorflow.python.keras.engine import training
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.platform import test
+from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import training as training_module
 
 try:
@@ -663,6 +664,22 @@ class SubclassedModel(training.Model):
 
 class TestWeightSavingAndLoadingTFFormat(test.TestCase):
 
+  def test_keras_optimizer_warning(self):
+    graph = ops.Graph()
+    with graph.as_default(), self.test_session(graph):
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.Dense(3))
+      model.compile(loss='mse', optimizer='adam', metrics=['acc'])
+      model._make_train_function()
+      temp_dir = self.get_temp_dir()
+      prefix = os.path.join(temp_dir, 'ckpt')
+      with test.mock.patch.object(logging, 'warning') as mock_log:
+        model.save_weights(prefix)
+        self.assertRegexpMatches(
+            str(mock_log.call_args),
+            'Keras optimizer')
+
   @test_util.run_in_graph_and_eager_modes
   def test_tensorflow_format_overwrite(self):
     with self.test_session() as session:
@@ -722,18 +739,23 @@ class TestWeightSavingAndLoadingTFFormat(test.TestCase):
         self.assertEqual(len(graph.get_operations()), op_count)
 
   def _weight_loading_test_template(self, make_model_fn):
-    with self.test_session() as session:
+    with self.test_session():
       model = make_model_fn()
+      model.compile(
+          loss='mse',
+          optimizer=training_module.RMSPropOptimizer(0.1),
+          metrics=['acc'])
       temp_dir = self.get_temp_dir()
       prefix = os.path.join(temp_dir, 'ckpt')
+      train_x = np.random.random((3, 2))
+      train_y = np.random.random((3,))
+      x = constant_op.constant(train_x, dtype=dtypes.float32)
 
-      x = constant_op.constant(np.random.random((3, 2)), dtype=dtypes.float32)
-      executing_eagerly = context.executing_eagerly()
-      ref_y_tensor = model(x)
-      if not executing_eagerly:
-        session.run([v.initializer for v in model.variables])
-      ref_y = self.evaluate(ref_y_tensor)
+      model.train_on_batch(train_x, train_y)
       model.save_weights(prefix, save_format='tf')
+      ref_y_before_train = model.predict(train_x)
+      model.train_on_batch(train_x, train_y)
+      ref_y_after_train = model.predict(train_x)
       for v in model.variables:
         self.evaluate(
             v.assign(random_ops.random_normal(shape=array_ops.shape(v))))
@@ -741,16 +763,27 @@ class TestWeightSavingAndLoadingTFFormat(test.TestCase):
       self.addCleanup(shutil.rmtree, temp_dir)
 
       model.load_weights(prefix)
-      y = self.evaluate(model(x))
-      self.assertAllClose(ref_y, y)
+      self.assertAllClose(ref_y_before_train, self.evaluate(model(x)))
 
       # Test restore-on-create if this is a subclassed Model (graph Networks
       # will have already created their variables).
       load_model = make_model_fn()
       load_model.load_weights(prefix)
-      restore_on_create_y_tensor = load_model(x)
-      restore_on_create_y = self.evaluate(restore_on_create_y_tensor)
-      self.assertAllClose(ref_y, restore_on_create_y)
+      self.assertAllClose(
+          ref_y_before_train,
+          self.evaluate(load_model(x)))
+      load_model = make_model_fn()
+      load_model.load_weights(prefix)
+      # We need to run some of the restore ops for predict(), but not all
+      # variables have been created yet (optimizer slot variables). Tests
+      # incremental restore.
+      load_model.predict(train_x)
+      load_model.compile(
+          loss='mse',
+          optimizer=training_module.RMSPropOptimizer(0.1),
+          metrics=['acc'])
+      load_model.train_on_batch(train_x, train_y)
+      self.assertAllClose(ref_y_after_train, self.evaluate(load_model(x)))
 
   @test_util.run_in_graph_and_eager_modes
   def test_weight_loading_graph_model(self):
@@ -858,5 +891,6 @@ class TestWeightSavingAndLoadingTFFormat(test.TestCase):
         SubclassedModel, SubclassedModelRestore,
         _restore_init_fn)
 
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/engine/sequential.py b/tensorflow/python/keras/engine/sequential.py
index 371504a503..41cdfda660 100644
--- a/tensorflow/python/keras/engine/sequential.py
+++ b/tensorflow/python/keras/engine/sequential.py
@@ -213,13 +213,31 @@ class Sequential(Model):
       self.outputs = [self.layers[-1].output]
       self.build()
 
-  @checkpointable.no_automatic_dependency_tracking
   def build(self, input_shape=None):
-    if input_shape and not self.inputs:
-      batch_shape = tuple(input_shape)
+    self._set_inputs_and_outputs(input_shape=input_shape)
+
+  def symbolic_set_inputs(self, inputs):
+    self._set_inputs_and_outputs(tensor=inputs)
+
+  @checkpointable.no_automatic_dependency_tracking
+  def _set_inputs_and_outputs(self, input_shape=None, tensor=None):
+    """Set model's input and output specs based on the input received.
+
+    If `tensor` is provided, `input_shape` is not required.
+
+    Args:
+      input_shape: Optional shape of input.
+      tensor: Optional existing tensor to wrap into the `Input` layer.
+    """
+    if not self.inputs:
       dtype = K.floatx()
-      x = Input(
-          batch_shape=batch_shape, dtype=dtype, name=self.name + '_input')
+      if tensor is not None:
+        batch_shape = (None,) + tuple(tensor.get_shape().as_list()[1:])
+        x = Input(dtype=dtype, name=self.name + '_input', tensor=tensor)
+      elif input_shape is not None:
+        batch_shape = tuple(input_shape)
+        x = Input(
+            batch_shape=batch_shape, dtype=dtype, name=self.name + '_input')
       self.inputs = [x]
       for layer in self._layers:
         x = layer(x)
diff --git a/tensorflow/python/keras/engine/sequential_test.py b/tensorflow/python/keras/engine/sequential_test.py
index 0f54e29cee..4f4adca333 100644
--- a/tensorflow/python/keras/engine/sequential_test.py
+++ b/tensorflow/python/keras/engine/sequential_test.py
@@ -22,7 +22,6 @@ import numpy as np
 
 from tensorflow.python import keras
 from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.eager import context
 from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import test
@@ -104,9 +103,6 @@ class TestSequential(test.TestCase):
 
   @tf_test_util.run_in_graph_and_eager_modes
   def test_sequential_deferred_build_with_dataset_iterators(self):
-    if not context.executing_eagerly():
-      # TODO(psv/fchollet): Add support for this use case in graph mode.
-      return
     num_hidden = 5
     input_dim = 3
     num_classes = 2
@@ -136,6 +132,48 @@ class TestSequential(test.TestCase):
                      [None, num_classes])
     self.assertEqual(len(model.weights), 2 * 2)
 
+  def test_training_and_eval_methods_on_symbolic_tensors(self):
+    with self.test_session():
+
+      def create_model():
+        model = keras.Sequential()
+        model.add(keras.layers.Dense(10, activation='relu'))
+        model.add(keras.layers.Dense(4, activation='softmax'))
+
+        model.compile(
+            optimizer=rmsprop.RMSPropOptimizer(1e-3),
+            loss='categorical_crossentropy',
+            metrics=['accuracy'])
+        return model
+
+      inputs = keras.backend.zeros(shape=(10, 3))
+      targets = keras.backend.zeros(shape=(10, 4))
+
+      model = create_model()
+      model.fit(inputs, targets, epochs=10, steps_per_epoch=30)
+
+      model = create_model()
+      model.evaluate(inputs, targets, steps=2, verbose=0)
+
+      model = create_model()
+      model.predict(inputs, steps=2)
+
+      model = create_model()
+      model.train_on_batch(inputs, targets)
+
+      model = create_model()
+      model.test_on_batch(inputs, targets)
+
+      model = create_model()
+      model.fit(
+          inputs,
+          targets,
+          epochs=1,
+          steps_per_epoch=2,
+          verbose=0,
+          validation_data=(inputs, targets),
+          validation_steps=2)
+
   @tf_test_util.run_in_graph_and_eager_modes
   def test_invalid_use_cases(self):
     # Added objects must be layer instances
diff --git a/tensorflow/python/keras/engine/topology_test.py b/tensorflow/python/keras/engine/topology_test.py
index 3eb69bd7f3..079c8dae71 100644
--- a/tensorflow/python/keras/engine/topology_test.py
+++ b/tensorflow/python/keras/engine/topology_test.py
@@ -24,6 +24,7 @@ from tensorflow.python import keras
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util
 from tensorflow.python.keras.engine import base_layer
 from tensorflow.python.keras.engine import input_layer as input_layer_lib
@@ -110,7 +111,6 @@ class TopologyConstructionTest(test.TestCase):
     layer = keras.layers.BatchNormalization()
     _ = layer.apply(x1)
 
-    print('BN updates', layer._updates)
     self.assertEqual(len(layer.updates), 2)
     self.assertEqual(len(layer.get_updates_for(x1)), 2)
     self.assertEqual(len(layer.get_updates_for(None)), 0)
@@ -960,9 +960,6 @@ class DeferredModeTest(test.TestCase):
       def call(self, inputs):
         return inputs[0] + inputs[1]
 
-      def compute_output_shape(self, input_shape):
-        return input_shape[0]
-
     c = AddLayer()([a, input_b])  # pylint: disable=not-callable
     c = keras.layers.Dense(2)(c)
 
@@ -978,6 +975,196 @@ class DeferredModeTest(test.TestCase):
       self.assertEqual(outputs[1].shape.as_list(), [10, 2])
 
 
+class DefaultShapeInferenceBehaviorTest(test.TestCase):
+
+  def _testShapeInference(self, model, input_shape, expected_output_shape):
+    input_value = np.random.random(input_shape)
+    output_value = model.predict(input_value)
+    self.assertEqual(output_value.shape, expected_output_shape)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testSingleInputCase(self):
+
+    class LayerWithOneInput(keras.layers.Layer):
+
+      def build(self, input_shape):
+        self.w = array_ops.ones(shape=(3, 4))
+
+      def call(self, inputs):
+        return keras.backend.dot(inputs, self.w)
+
+    inputs = input_layer_lib.Input(shape=(3,))
+    layer = LayerWithOneInput()
+
+    if context.executing_eagerly():
+      self.assertEqual(
+          layer.compute_output_shape((None, 3)).as_list(), [None, 4])
+      # As a side-effect, compute_output_shape builds the layer.
+      self.assertTrue(layer.built)
+      # We can still query the layer's compute_output_shape with compatible
+      # input shapes.
+      self.assertEqual(
+          layer.compute_output_shape((6, 3)).as_list(), [6, 4])
+
+    outputs = layer(inputs)
+    model = keras.Model(inputs, outputs)
+    self._testShapeInference(model, (2, 3), (2, 4))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testMultiInputOutputCase(self):
+
+    class MultiInputOutputLayer(keras.layers.Layer):
+
+      def build(self, input_shape):
+        self.w = array_ops.ones(shape=(3, 4))
+
+      def call(self, inputs):
+        a = keras.backend.dot(inputs[0], self.w)
+        b = a + inputs[1]
+        return [a, b]
+
+    input_a = input_layer_lib.Input(shape=(3,))
+    input_b = input_layer_lib.Input(shape=(4,))
+    output_a, output_b = MultiInputOutputLayer()([input_a, input_b])
+    model = keras.Model([input_a, input_b], [output_a, output_b])
+    output_a_val, output_b_val = model.predict(
+        [np.random.random((2, 3)), np.random.random((2, 4))])
+    self.assertEqual(output_a_val.shape, (2, 4))
+    self.assertEqual(output_b_val.shape, (2, 4))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testTrainingArgument(self):
+
+    class LayerWithTrainingArg(keras.layers.Layer):
+
+      def build(self, input_shape):
+        self.w = array_ops.ones(shape=(3, 4))
+
+      def call(self, inputs, training):
+        return keras.backend.dot(inputs, self.w)
+
+    inputs = input_layer_lib.Input(shape=(3,))
+    outputs = LayerWithTrainingArg()(inputs, training=False)
+    model = keras.Model(inputs, outputs)
+    self._testShapeInference(model, (2, 3), (2, 4))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testUnsupportedSignature(self):
+
+    class LayerWithAdditionalArg(keras.layers.Layer):
+
+      def build(self, input_shape):
+        self.w = array_ops.ones(shape=(3, 4))
+
+      def call(self, inputs, some_arg):
+        return keras.backend.dot(inputs, self.w) + some_arg
+
+    inputs = input_layer_lib.Input(shape=(3,))
+    if context.executing_eagerly():
+      with self.assertRaises(NotImplementedError):
+        outputs = LayerWithAdditionalArg()(inputs, some_arg=0)
+    else:
+      # Works with graph mode because the graph of ops is built together with
+      # the graph of layers.
+      outputs = LayerWithAdditionalArg()(inputs, some_arg=0)
+      _ = keras.Model(inputs, outputs)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testNoneInShape(self):
+
+    class Model(keras.Model):
+
+      def __init__(self):
+        super(Model, self).__init__()
+        self.conv1 = keras.layers.Conv2D(8, 3)
+        self.pool = keras.layers.GlobalAveragePooling2D()
+        self.fc = keras.layers.Dense(3)
+
+      def call(self, x):
+        x = self.conv1(x)
+        x = self.pool(x)
+        x = self.fc(x)
+        return x
+
+    model = Model()
+    model.build(tensor_shape.TensorShape((None, None, None, 1)))
+    self.assertTrue(model.built, 'Model should be built')
+    self.assertTrue(model.weights,
+                    'Model should have its weights created as it '
+                    'has been built')
+    sample_input = array_ops.ones((1, 10, 10, 1))
+    output = model(sample_input)
+    self.assertEqual(output.shape, (1, 3))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testNoneInShapeWithCompoundModel(self):
+
+    class BasicBlock(keras.Model):
+
+      def __init__(self):
+        super(BasicBlock, self).__init__()
+        self.conv1 = keras.layers.Conv2D(8, 3)
+        self.pool = keras.layers.GlobalAveragePooling2D()
+        self.dense = keras.layers.Dense(3)
+
+      def call(self, x):
+        x = self.conv1(x)
+        x = self.pool(x)
+        x = self.dense(x)
+        return x
+
+    class CompoundModel(keras.Model):
+
+      def __init__(self):
+        super(CompoundModel, self).__init__()
+        self.block = BasicBlock()
+
+      def call(self, x):
+        x = self.block(x)  # pylint: disable=not-callable
+        return x
+
+    model = CompoundModel()
+    model.build(tensor_shape.TensorShape((None, None, None, 1)))
+    self.assertTrue(model.built, 'Model should be built')
+    self.assertTrue(model.weights,
+                    'Model should have its weights created as it '
+                    'has been built')
+    sample_input = array_ops.ones((1, 10, 10, 1))
+    output = model(sample_input)  # pylint: disable=not-callable
+    self.assertEqual(output.shape, (1, 3))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testNoneInShapeWithFunctinalAPI(self):
+
+    class BasicBlock(keras.Model):
+      # Inherting from keras.layers.Layer since we are calling this layer
+      # inside a model created using functional API.
+
+      def __init__(self):
+        super(BasicBlock, self).__init__()
+        self.conv1 = keras.layers.Conv2D(8, 3)
+
+      def call(self, x):
+        x = self.conv1(x)
+        return x
+
+    input_layer = keras.layers.Input(shape=(None, None, 1))
+    x = BasicBlock()(input_layer)
+    x = keras.layers.GlobalAveragePooling2D()(x)
+    output_layer = keras.layers.Dense(3)(x)
+
+    model = keras.Model(inputs=input_layer, outputs=output_layer)
+
+    model.build(tensor_shape.TensorShape((None, None, None, 1)))
+    self.assertTrue(model.built, 'Model should be built')
+    self.assertTrue(model.weights,
+                    'Model should have its weights created as it '
+                    'has been built')
+    sample_input = array_ops.ones((1, 10, 10, 1))
+    output = model(sample_input)
+    self.assertEqual(output.shape, (1, 3))
+
+
 class GraphUtilsTest(test.TestCase):
 
   def testGetReachableFromInputs(self):
diff --git a/tensorflow/python/keras/engine/training.py b/tensorflow/python/keras/engine/training.py
index 8e632651fa..d470651a64 100644
--- a/tensorflow/python/keras/engine/training.py
+++ b/tensorflow/python/keras/engine/training.py
@@ -24,22 +24,21 @@ import numpy as np
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.eager import context
-from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.keras import backend as K
 from tensorflow.python.keras import losses
-from tensorflow.python.keras import metrics as metrics_module
 from tensorflow.python.keras import optimizers
 from tensorflow.python.keras.engine import base_layer
+from tensorflow.python.keras.engine import distributed_training_utils
 from tensorflow.python.keras.engine import training_arrays
+from tensorflow.python.keras.engine import training_distributed
 from tensorflow.python.keras.engine import training_eager
 from tensorflow.python.keras.engine import training_generator
 from tensorflow.python.keras.engine import training_utils
 from tensorflow.python.keras.engine.network import Network
 from tensorflow.python.keras.utils.generic_utils import slice_arrays
-from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import optimizer as tf_optimizer_module
 from tensorflow.python.training.checkpointable import base as checkpointable
@@ -115,6 +114,27 @@ class Model(Network):
     self._iterator_get_next = weakref.WeakKeyDictionary()
     # Create a cache for dataset - uninitialized iterators
     self._dataset_iterator_cache = weakref.WeakKeyDictionary()
+    # initializing _distribution_strategy here since it is possible to call
+    # predict on a model without compiling it.
+    self._distribution_strategy = None
+
+  def _set_sample_weight_attributes(self, sample_weight_mode,
+                                    skip_target_weighing_indices):
+    """Sets sample weight related attributes on the model."""
+    sample_weights, sample_weight_modes = training_utils.prepare_sample_weights(
+        self.output_names, sample_weight_mode, skip_target_weighing_indices)
+    self.sample_weights = sample_weights
+    self.sample_weight_modes = sample_weight_modes
+    self._feed_sample_weight_modes = [
+        sample_weight_modes[i]
+        for i in range(len(self.outputs))
+        if i not in skip_target_weighing_indices
+    ]
+    self._feed_sample_weights = [
+        sample_weights[i]
+        for i in range(len(sample_weights))
+        if i not in skip_target_weighing_indices
+    ]
 
   @checkpointable.no_automatic_dependency_tracking
   def compile(self,
@@ -125,6 +145,7 @@ class Model(Network):
               sample_weight_mode=None,
               weighted_metrics=None,
               target_tensors=None,
+              distribute=None,
               **kwargs):
     """Configures the model for training.
 
@@ -168,12 +189,32 @@ class Model(Network):
             can specify them via the `target_tensors` argument. It can be
             a single tensor (for a single-output model), a list of tensors,
             or a dict mapping output names to target tensors.
+        distribute: The DistributionStrategy instance that we want to use to
+            distribute the training of the model.
         **kwargs: These arguments are passed to `tf.Session.run`.
 
     Raises:
         ValueError: In case of invalid arguments for
             `optimizer`, `loss`, `metrics` or `sample_weight_mode`.
     """
+    # Validate that arguments passed by the user to `compile` are supported by
+    # DistributionStrategy.
+    if distribute and not isinstance(
+        optimizer, (tf_optimizer_module.Optimizer, optimizers.TFOptimizer)):
+      raise ValueError('Only TF native optimizers are supported with '
+                       'DistributionStrategy.')
+    if distribute and context.executing_eagerly():
+      raise ValueError('DistributionStrategy is not supported in Eager mode.')
+    if distribute and sample_weight_mode:
+      raise ValueError('sample_weight_mode is not supported with '
+                       'DistributionStrategy.')
+    if distribute and weighted_metrics:
+      raise ValueError('weighted_metrics is not supported with '
+                       'DistributionStrategy.')
+    if distribute and target_tensors:
+      raise ValueError('target_tensors is not supported with '
+                       'DistributionStrategy.')
+
     loss = loss or {}
     if context.executing_eagerly() and not isinstance(
         optimizer, (tf_optimizer_module.Optimizer, optimizers.TFOptimizer)):
@@ -188,8 +229,6 @@ class Model(Network):
     self.loss = loss
     self.metrics = metrics or []
     self.loss_weights = loss_weights
-    if context.executing_eagerly() and sample_weight_mode is not None:
-      raise ValueError('sample_weight_mode is not supported in Eager mode.')
     self.sample_weight_mode = sample_weight_mode
     if context.executing_eagerly() and weighted_metrics is not None:
       raise ValueError('weighted_metrics is not supported in Eager mode.')
@@ -198,6 +237,17 @@ class Model(Network):
       raise ValueError('target_tensors is not supported in Eager mode.')
     self.target_tensors = target_tensors
 
+    # Set DistributionStrategy specific parameters.
+    self._distribution_strategy = distribute
+    if self._distribution_strategy is not None:
+      self._grouped_model = self._compile_distributed_model(
+          self._distribution_strategy)
+      with self._distribution_strategy.scope():
+        first_replicated_model = self._distribution_strategy.unwrap(
+            self._grouped_model)[0]
+      # We initialize the callback model with the first replicated model.
+      self._replicated_model = DistributedCallbackModel(first_replicated_model)
+      self._replicated_model.set_original_model(self)
     if not self.built:
       # Model is not compilable because it does not know its number of inputs
       # and outputs, nor their shapes and names. We will compile after the first
@@ -217,10 +267,9 @@ class Model(Network):
       for name in self.output_names:
         if name not in loss:
           logging.warning(
-              'Output "' + name + '" missing from loss dictionary. '
-              'We assume this was done on purpose, '
-              'and we will not be expecting '
-              'any data to be passed to "' + name + '" during training.')
+              'Output "' + name + '" missing from loss dictionary. We assume '
+              'this was done on purpose. The fit and evaluate APIs will not be '
+              'expecting any data to be passed to "' + name + '".')
         loss_functions.append(losses.get(loss.get(name)))
     elif isinstance(loss, list):
       if len(loss) != len(self.outputs):
@@ -281,8 +330,12 @@ class Model(Network):
                       str(loss_weights) + ' - expected a list of dicts.')
     self.loss_weights_list = loss_weights_list
 
-    # initialization for Eager mode execution
+    # Initialization for Eager mode execution.
     if context.executing_eagerly():
+      # Prepare sample weights.
+      self._set_sample_weight_attributes(sample_weight_mode,
+                                         skip_target_weighing_indices)
+
       if target_tensors is not None:
         raise ValueError('target_tensors are not currently supported in Eager '
                          'mode.')
@@ -300,10 +353,6 @@ class Model(Network):
 
       with K.name_scope('metrics'):
         training_utils.populate_metric_names(self)
-      self._feed_sample_weight_modes = []
-      for i in range(len(self.outputs)):
-        self._feed_sample_weight_modes.append(None)
-      self.sample_weights = []
       self.targets = []
       for i in range(len(self.outputs)):
         self._feed_output_names.append(self.output_names[i])
@@ -363,73 +412,8 @@ class Model(Network):
         self.targets.append(target)
 
     # Prepare sample weights.
-    sample_weights = []
-    sample_weight_modes = []
-    if isinstance(sample_weight_mode, dict):
-      for name in sample_weight_mode:
-        if name not in self.output_names:
-          raise ValueError(
-              'Unknown entry in '
-              'sample_weight_mode dictionary: "' + name + '". '
-              'Only expected the following keys: ' + str(self.output_names))
-      for i, name in enumerate(self.output_names):
-        if i in skip_target_weighing_indices:
-          weight = None
-          sample_weight_modes.append(None)
-        else:
-          if name not in sample_weight_mode:
-            raise ValueError(
-                'Output "' + name + '" missing from sample_weight_modes '
-                'dictionary')
-          if sample_weight_mode.get(name) == 'temporal':
-            weight = K.placeholder(ndim=2, name=name + '_sample_weights')
-            sample_weight_modes.append('temporal')
-          else:
-            weight = K.placeholder(ndim=1, name=name + 'sample_weights')
-            sample_weight_modes.append(None)
-        sample_weights.append(weight)
-    elif isinstance(sample_weight_mode, list):
-      if len(sample_weight_mode) != len(self.outputs):
-        raise ValueError('When passing a list as sample_weight_mode, '
-                         'it should have one entry per model output. '
-                         'The model has ' + str(len(self.outputs)) +
-                         ' outputs, but you passed '
-                         'sample_weight_mode=' + str(sample_weight_mode))
-      for i in range(len(self.output_names)):
-        if i in skip_target_weighing_indices:
-          weight = None
-          sample_weight_modes.append(None)
-        else:
-          mode = sample_weight_mode[i]
-          name = self.output_names[i]
-          if mode == 'temporal':
-            weight = K.placeholder(ndim=2, name=name + '_sample_weights')
-            sample_weight_modes.append('temporal')
-          else:
-            weight = K.placeholder(ndim=1, name=name + '_sample_weights')
-            sample_weight_modes.append(None)
-        sample_weights.append(weight)
-    else:
-      for i, name in enumerate(self.output_names):
-        if i in skip_target_weighing_indices:
-          sample_weight_modes.append(None)
-          sample_weights.append(None)
-        else:
-          if sample_weight_mode == 'temporal':
-            sample_weights.append(array_ops.placeholder_with_default(
-                constant_op.constant([[1.]], dtype=K.floatx()),
-                shape=[None, None], name=name + '_sample_weights'))
-            sample_weight_modes.append('temporal')
-          else:
-            sample_weights.append(array_ops.placeholder_with_default(
-                constant_op.constant([1.], dtype=K.floatx()),
-                shape=[None], name=name + '_sample_weights'))
-            sample_weight_modes.append(None)
-    self.sample_weight_modes = sample_weight_modes
-    self._feed_sample_weight_modes = []
-    for i in range(len(self.outputs)):
-      if i not in skip_target_weighing_indices:
-        self._feed_sample_weight_modes.append(self.sample_weight_modes[i])
+    self._set_sample_weight_attributes(sample_weight_mode,
+                                       skip_target_weighing_indices)
 
     # Prepare metrics.
     self.weighted_metrics = weighted_metrics
@@ -445,7 +429,7 @@ class Model(Network):
         y_true = self.targets[i]
         y_pred = self.outputs[i]
         weighted_loss = weighted_losses[i]
-        sample_weight = sample_weights[i]
+        sample_weight = self.sample_weights[i]
         mask = masks[i]
         loss_weight = loss_weights_list[i]
         with K.name_scope(self.output_names[i] + '_loss'):
@@ -484,50 +468,28 @@ class Model(Network):
 
         y_true = self.targets[i]
         y_pred = self.outputs[i]
-        weights = sample_weights[i]
+        weights = self.sample_weights[i]
         output_metrics = nested_metrics[i]
         output_weighted_metrics = nested_weighted_metrics[i]
+        output_shape = self.outputs[i].get_shape().as_list()
+        loss_fn = self.loss_functions[i]
 
-        def handle_metrics(metrics, weights=None):
+        def handle_metrics(metrics, output_shape, loss_fn, weights=None):
+          """Invokes metric functions for the output."""
 
           for metric in metrics:
-            if metric in ('accuracy', 'acc', 'crossentropy', 'ce'):
-              # custom handling of accuracy/crossentropy
-              # (because of class mode duality)
-              output_shape = self.outputs[i].get_shape().as_list()
-              if (output_shape[-1] == 1 or
-                  self.loss_functions[i] == losses.binary_crossentropy):
-                # case: binary accuracy/crossentropy
-                if metric in ('accuracy', 'acc'):
-                  metric_fn = metrics_module.binary_accuracy
-                elif metric in ('crossentropy', 'ce'):
-                  metric_fn = metrics_module.binary_crossentropy
-              elif self.loss_functions[
-                  i] == losses.sparse_categorical_crossentropy:
-                # case: categorical accuracy/crossentropy with sparse targets
-                if metric in ('accuracy', 'acc'):
-                  metric_fn = metrics_module.sparse_categorical_accuracy
-                elif metric in ('crossentropy', 'ce'):
-                  metric_fn = metrics_module.sparse_categorical_crossentropy
-              else:
-                # case: categorical accuracy/crossentropy
-                if metric in ('accuracy', 'acc'):
-                  metric_fn = metrics_module.categorical_accuracy
-                elif metric in ('crossentropy', 'ce'):
-                  metric_fn = metrics_module.categorical_crossentropy
-              weighted_metric_fn = training_utils.weighted_masked_objective(
-                  metric_fn)
-            else:
-              metric_fn = metrics_module.get(metric)
-              weighted_metric_fn = training_utils.weighted_masked_objective(
-                  metric_fn)
-            metric_name = training_utils.get_base_metric_name(
+            metric_fn = training_utils.get_metric_function(
+                metric, output_shape=output_shape, loss_fn=loss_fn)
+            metric_name = training_utils.get_metric_name(
                 metric, weighted=weights is not None)
+
             with K.name_scope(metric_name):
+              weighted_metric_fn = training_utils.weighted_masked_objective(
+                  metric_fn)
               metric_result = weighted_metric_fn(
-                  y_true, y_pred, weights=weights, mask=masks[i])
+                  y_true, y_pred, weights=weights, mask=masks[i])  # pylint: disable=undefined-loop-variable
 
-            training_utils.add_metric_name(self, metric_name, i)
+            metric_name = training_utils.add_metric_name(self, metric_name, i)  # pylint: disable=undefined-loop-variable
             self.metrics_tensors.append(metric_result)
 
             # Keep track of state updates created by
@@ -537,16 +499,12 @@ class Model(Network):
               self.stateful_metric_functions.append(metric_fn)
               self.metrics_updates += metric_fn.updates
 
-        handle_metrics(output_metrics)
-        handle_metrics(output_weighted_metrics, weights=weights)
+        handle_metrics(output_metrics, output_shape, loss_fn)
+        handle_metrics(
+            output_weighted_metrics, output_shape, loss_fn, weights=weights)
 
     # Prepare gradient updates and state updates.
     self.total_loss = total_loss
-    self.sample_weights = sample_weights
-    self._feed_sample_weights = []
-    for i in range(len(self.sample_weights)):
-      if i not in skip_target_weighing_indices:
-        self._feed_sample_weights.append(self.sample_weights[i])
 
     # Functions for train, test and predict will
     # be compiled lazily when required.
@@ -561,6 +519,19 @@ class Model(Network):
     trainable_weights = self.trainable_weights
     self._collected_trainable_weights = trainable_weights
 
+  def _compile_distributed_model(self, distribution_strategy):
+    # TODO(anjalisridhar): Can we move the clone_and_build_model to outside the
+    # model?
+    def _clone_model_per_tower(model):
+      new_model = training_distributed.clone_and_build_model(model)
+      return new_model
+
+    with distribution_strategy.scope():
+      # Create a copy of this model on each of the devices.
+      grouped_models = distribution_strategy.call_for_each_tower(
+          _clone_model_per_tower, self)
+    return grouped_models
+
   def _check_trainable_weights_consistency(self):
     """Check trainable weights count consistency.
 
@@ -599,7 +570,7 @@ class Model(Network):
         # Unconditional updates
         updates += self.get_updates_for(None)
         # Conditional updates relevant to this model
-        updates += self.get_updates_for(self._feed_inputs)
+        updates += self.get_updates_for(self.inputs)
         # Stateful metrics updates
         updates += self.metrics_updates
         # Gets loss and metrics. Updates weights at each call.
@@ -608,7 +579,6 @@ class Model(Network):
             updates=updates,
             name='train_function',
             **self._function_kwargs)
-    self._post_build_cleanup()
 
   def _make_test_function(self):
     if not hasattr(self, 'test_function'):
@@ -626,7 +596,6 @@ class Model(Network):
           updates=self.state_updates + self.metrics_updates,
           name='test_function',
           **self._function_kwargs)
-    self._post_build_cleanup()
 
   def _make_predict_function(self):
     if not hasattr(self, 'predict_function'):
@@ -645,7 +614,6 @@ class Model(Network):
           updates=self.state_updates,
           name='predict_function',
           **kwargs)
-    self._post_build_cleanup()
 
   def _get_iterator_get_next_tensors(self, iterator):
     get_next_op = self._iterator_get_next.get(iterator, None)
@@ -654,6 +622,103 @@ class Model(Network):
       self._iterator_get_next[iterator] = get_next_op
     return get_next_op
 
+  def _distribution_standardize_user_data(self,
+                                          x,
+                                          y=None,
+                                          sample_weight=None,
+                                          class_weight=None,
+                                          batch_size=None,
+                                          check_steps=False,
+                                          steps_name='steps',
+                                          steps=None,
+                                          validation_split=0):
+    """Runs validation checks on input and target data passed by the user.
+
+    This is called when using DistributionStrategy to train, evaluate or serve
+    the model.
+
+    Args:
+      x: Input data. A `tf.data` dataset.
+      y: Since `x` is a dataset, `y` should not be specified
+        (since targets will be obtained from the iterator).
+      sample_weight: An optional sample-weight array passed by the user to
+        weight the importance of each sample in `x`.
+      class_weight: An optional class-weight array by the user to
+        weight the importance of samples in `x` based on the class they belong
+        to, as conveyed by `y`.
+      batch_size: Integer batch size. If provided, it is used to run additional
+        validation checks on stateful models.
+      check_steps: boolean, True if we want to check for validity of `steps` and
+        False, otherwise.
+      steps_name: The public API's parameter name for `steps`.
+      steps: Integer or `None`. Total number of steps (batches of samples) to
+        execute.
+      validation_split: Float between 0 and 1.
+        Fraction of the training data to be used as validation data.
+
+    Returns:
+      A tuple of 3 lists: input arrays, target arrays, sample-weight arrays.
+      If the model's input and targets are symbolic, these lists are empty
+      (since the model takes no user-provided data, instead the data comes
+      from the symbolic inputs/targets).
+
+    Raises:
+      ValueError: In case of invalid user-provided data.
+      RuntimeError: If the model was never compiled.
+    """
+    if sample_weight is not None and sample_weight.all():
+      raise ValueError('sample_weight is currently not supported when using '
+                       'DistributionStrategy.')
+    if class_weight:
+      raise ValueError('class_weight is currently not supported when using '
+                       'DistributionStrategy.')
+
+    # TODO(anjalisridhar): Can we use the iterator and getnext op cache?
+    # We require users to pass Datasets since we distribute the dataset across
+    # multiple devices.
+    if not isinstance(x, dataset_ops.Dataset):
+      raise ValueError('When using DistributionStrategy you must specify a '
+                       'Dataset object instead of a %s.' % type(x))
+    # TODO(anjalisridhar): We want distribute_dataset() to accept a Dataset or a
+    # function which returns a Dataset. Currently distribute_dataset() only
+    # accepts a function that returns a Dataset. Once we add support for being
+    # able to clone a Dataset on multiple workers we can remove this lambda.
+    result = self._distribution_strategy.distribute_dataset(lambda: x)
+    iterator = result.make_initializable_iterator()
+    K.get_session().run(iterator.initializer)
+    # Validates `steps` argument based on x's type.
+    if check_steps:
+      if steps is None:
+        raise ValueError('When using a Dataset instance as input to a model, '
+                         'you should specify the `{steps_name}` argument.'
+                         .format(steps_name=steps_name))
+
+    training_utils.validate_iterator_input(x, y, sample_weight,
+                                           validation_split)
+    # x an y may be PerDevice objects with an input and output tensor
+    # corresponding to each device. For example, x could be
+    # PerDevice:{device: get_next tensor,...}.
+    next_element = iterator.get_next()
+
+    if not isinstance(next_element, (list, tuple)) or len(next_element) != 2:
+      raise ValueError('Please provide data as a list or tuple of 2 elements '
+                       ' - input and target pair. Received %s' % next_element)
+    x, y = next_element
+    # Validate that all the elements in x and y are of the same type and shape.
+    # We can then pass the first element of x and y to `_standardize_weights`
+    # below and be confident of the output. We need to reopen the scope since
+    # we unwrap values when we validate x and y.
+    with self._distribution_strategy.scope():
+      x_values, y_values = distributed_training_utils.\
+        validate_distributed_dataset_inputs(self._distribution_strategy, x, y)
+
+    _, _, sample_weights = self._standardize_weights(x_values[0],
+                                                     y_values[0],
+                                                     sample_weight,
+                                                     class_weight,
+                                                     batch_size)
+    return x, y, sample_weights
+
   def _standardize_user_data(self,
                              x,
                              y=None,
@@ -716,6 +781,18 @@ class Model(Network):
       ValueError: In case of invalid user-provided data.
       RuntimeError: If the model was never compiled.
     """
+    if self._distribution_strategy:
+      return self._distribution_standardize_user_data(
+          x,
+          y,
+          sample_weight=sample_weight,
+          class_weight=class_weight,
+          batch_size=batch_size,
+          check_steps=check_steps,
+          steps_name=steps_name,
+          steps=steps,
+          validation_split=validation_split)
+
     if isinstance(x, dataset_ops.Dataset):
       if context.executing_eagerly():
         x = x.make_one_shot_iterator()
@@ -764,7 +841,12 @@ class Model(Network):
         raise ValueError('Please provide data as a list or tuple of 2 elements '
                          ' - input and target pair. Received %s' % next_element)
       x, y = next_element
+    x, y, sample_weights = self._standardize_weights(x, y, sample_weight,
+                                                     class_weight, batch_size)
+    return x, y, sample_weights
 
+  def _standardize_weights(self, x, y, sample_weight=None, class_weight=None,
+                           batch_size=None,):
     # First, we build/compile the model on the fly if necessary.
     all_inputs = []
     is_build_called = False
@@ -878,13 +960,7 @@ class Model(Network):
         exception_prefix='input')
 
     if y is not None:
-      if context.executing_eagerly():
-        feed_output_names = self.output_names
-        feed_output_shapes = None
-        # Sample weighting not supported in this case.
-        # TODO(fchollet): consider supporting it.
-        feed_sample_weight_modes = [None for _ in self.outputs]
-      elif not self._is_graph_network:
+      if not self._is_graph_network:
         feed_output_names = self._feed_output_names
         feed_output_shapes = None
         # Sample weighting not supported in this case.
@@ -897,7 +973,11 @@ class Model(Network):
         for output_shape, loss_fn in zip(self._feed_output_shapes,
                                          self._feed_loss_fns):
           if loss_fn is losses.sparse_categorical_crossentropy:
-            feed_output_shapes.append(output_shape[:-1] + (1,))
+            if K.image_data_format() == 'channels_first':
+              feed_output_shapes.append(
+                  (output_shape[0], 1) + output_shape[2:])
+            else:
+              feed_output_shapes.append(output_shape[:-1] + (1,))
           elif (not hasattr(loss_fn, '__name__') or
                 getattr(losses, loss_fn.__name__, None) is None):
             # If `loss_fn` is not a function (e.g. callable class)
@@ -928,11 +1008,12 @@ class Model(Network):
                                          feed_sample_weight_modes)
       ]
       # Check that all arrays have the same length.
-      training_utils.check_array_lengths(x, y, sample_weights)
-      if self._is_graph_network and not context.executing_eagerly():
-        # Additional checks to avoid users mistakenly using improper loss fns.
-        training_utils.check_loss_and_target_compatibility(
-            y, self._feed_loss_fns, feed_output_shapes)
+      if not self._distribution_strategy:
+        training_utils.check_array_lengths(x, y, sample_weights)
+        if self._is_graph_network and not context.executing_eagerly():
+          # Additional checks to avoid users mistakenly using improper loss fns.
+          training_utils.check_loss_and_target_compatibility(
+              y, self._feed_loss_fns, feed_output_shapes)
     else:
       y = []
       sample_weights = []
@@ -988,10 +1069,14 @@ class Model(Network):
         inputs = inputs[0]
 
       if tensor_util.is_tensor(inputs):
-        input_shape = (None,) + tuple(inputs.get_shape().as_list()[1:])
+        if context.executing_eagerly():
+          input_shape = (None,) + tuple(inputs.get_shape().as_list()[1:])
+          self.build(input_shape=input_shape)
+        else:
+          self.symbolic_set_inputs(inputs)
       else:
         input_shape = (None,) + inputs.shape[1:]
-      self.build(input_shape=input_shape)
+        self.build(input_shape=input_shape)
     elif context.executing_eagerly():
       self._eager_set_inputs(inputs)
     else:
@@ -1266,6 +1351,9 @@ class Model(Network):
       raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))
 
     # Validate and standardize user data.
+    if self._distribution_strategy:
+      distributed_training_utils.validate_callbacks(callbacks)
+
     x, y, sample_weights = self._standardize_user_data(
         x,
         y,
@@ -1346,6 +1434,17 @@ class Model(Network):
           initial_epoch=initial_epoch,
           steps_per_epoch=steps_per_epoch,
           validation_steps=validation_steps)
+    elif self._distribution_strategy:
+      return training_distributed.fit_loop(
+          self, x, y,
+          epochs=epochs,
+          verbose=verbose,
+          callbacks=callbacks,
+          val_inputs=val_x,
+          val_targets=val_y,
+          initial_epoch=initial_epoch,
+          steps_per_epoch=steps_per_epoch,
+          validation_steps=validation_steps)
     else:
       return training_arrays.fit_loop(
           self, x, y,
@@ -1438,12 +1537,29 @@ class Model(Network):
 
     if context.executing_eagerly():
       return training_eager.test_loop(
-          self, inputs=x, targets=y, sample_weights=sample_weights,
-          batch_size=batch_size, verbose=verbose, steps=steps)
+          self,
+          inputs=x,
+          targets=y,
+          sample_weights=sample_weights,
+          batch_size=batch_size,
+          verbose=verbose,
+          steps=steps)
+    elif self._distribution_strategy:
+      return training_distributed.test_loop(
+          self,
+          inputs=x,
+          targets=y,
+          verbose=verbose,
+          steps=steps)
     else:
       return training_arrays.test_loop(
-          self, inputs=x, targets=y, sample_weights=sample_weights,
-          batch_size=batch_size, verbose=verbose, steps=steps)
+          self,
+          inputs=x,
+          targets=y,
+          sample_weights=sample_weights,
+          batch_size=batch_size,
+          verbose=verbose,
+          steps=steps)
 
   def predict(self, x, batch_size=None, verbose=0, steps=None):
     """Generates output predictions for the input samples.
@@ -1488,6 +1604,9 @@ class Model(Network):
     if context.executing_eagerly():
       return training_eager.predict_loop(
           self, x, batch_size=batch_size, verbose=verbose, steps=steps)
+    elif self._distribution_strategy:
+      return training_distributed.predict_loop(
+          self, x, verbose=verbose, steps=steps)
     else:
       return training_arrays.predict_loop(
           self, x, batch_size=batch_size, verbose=verbose, steps=steps)
@@ -1535,6 +1654,9 @@ class Model(Network):
     Raises:
       ValueError: In case of invalid user-provided arguments.
     """
+    if self._distribution_strategy:
+      raise ValueError('`train_on_batch` is not supported for models '
+                       'compiled with DistributionStrategy.')
     # Validate and standardize user data.
     x, y, sample_weights = self._standardize_user_data(
         x, y, sample_weight=sample_weight, class_weight=class_weight)
@@ -1591,6 +1713,9 @@ class Model(Network):
     Raises:
         ValueError: In case of invalid user-provided arguments.
     """
+    if self._distribution_strategy:
+      raise ValueError('`test_on_batch` is not supported for models '
+                       'compiled with DistributionStrategy.')
     # Validate and standardize user data.
     x, y, sample_weights = self._standardize_user_data(
         x, y, sample_weight=sample_weight)
@@ -1628,6 +1753,9 @@ class Model(Network):
         ValueError: In case of mismatch between given number of inputs and
           expectations of the model.
     """
+    if self._distribution_strategy:
+      raise ValueError('`predict_on_batch` is not supported for models '
+                       'compiled with DistributionStrategy.')
     # Validate and standardize user data.
     inputs, _, _ = self._standardize_user_data(x)
     if context.executing_eagerly():
@@ -1891,3 +2019,45 @@ class Model(Network):
         workers=workers,
         use_multiprocessing=use_multiprocessing,
         verbose=verbose)
+
+
+class DistributedCallbackModel(Model):
+  """Model that is used for callbacks with DistributionStrategy."""
+
+  def __init__(self, model):
+    super(DistributedCallbackModel, self).__init__()
+    # TODO(anjalisridhar): Right now the only attributes set are the layer and
+    # weights. We may need to set additional attributes as needed since we have
+    # not called compile on this model.
+
+  def set_original_model(self, orig_model):
+    self._original_model = orig_model
+
+  def save_weights(self, filepath, overwrite=True, save_format=None):
+    self._replicated_model.save_weights(filepath, overwrite=overwrite,
+                                        save_format=save_format)
+
+  def save(self, filepath, overwrite=True, include_optimizer=True):
+    # save weights from the distributed model to the original model
+    distributed_model_weights = self.get_weights()
+    self._original_model.set_weights(distributed_model_weights)
+    # TODO(anjalisridhar): Do we need to save the original model here?
+    # Saving the first replicated model works as well.
+    self._original_model.save(filepath, overwrite=True, include_optimizer=False)
+
+  def load_weights(self, filepath, by_name=False):
+    self._original_model.load_weights(filepath, by_name=False)
+    # Copy the weights from the original model to each of the replicated models.
+    orig_model_weights = self._original_model.get_weights()
+    distributed_training_utils.set_weights(
+        self._original_model._distribution_strategy, self,  # pylint: disable=protected-access
+        orig_model_weights)
+
+  def __getattr__(self, item):
+    # Whitelisted atttributes of the model that can be accessed by the user
+    # during a callback.
+    if item not in ['_setattr_tracking']:
+      logging.warning('You are accessing attribute ' + item + 'of the'
+                      'DistributedCallbackModel that may not have been set'
+                      'correctly.')
+
diff --git a/tensorflow/python/keras/engine/training_arrays.py b/tensorflow/python/keras/engine/training_arrays.py
index e82f5c0332..d24f4b64b9 100644
--- a/tensorflow/python/keras/engine/training_arrays.py
+++ b/tensorflow/python/keras/engine/training_arrays.py
@@ -50,7 +50,6 @@ def fit_loop(model,
              val_targets=None,
              val_sample_weights=None,
              shuffle=True,
-             callback_metrics=None,
              initial_epoch=0,
              steps_per_epoch=None,
              validation_steps=None):
@@ -69,8 +68,6 @@ def fit_loop(model,
       val_targets: List of target arrays.
       val_sample_weights: Optional list of sample weight arrays.
       shuffle: Whether to shuffle the data at the beginning of each epoch
-      callback_metrics: List of strings, the display names of the metrics
-          passed to the callbacks. They should be the
           concatenation of list the display names of the outputs of
            `f` and the list of display names of the outputs of `f_val`.
       initial_epoch: Epoch at which to start training
@@ -121,15 +118,11 @@ def fit_loop(model,
 
   out_labels = model.metrics_names
   if do_validation:
-    callback_metrics = copy.copy(out_labels) + [
-        'val_' + n for n in out_labels
-    ]
-    if callbacks is not None and any(
-        [isinstance(callback, cbks.TensorBoard) for callback in callbacks]):
-      # need to create the test_function before start of the first epoch
-      # because TensorBoard callback on_epoch_begin adds summary to the
-      # list of fetches of the test_function
-      model._make_test_function()
+    callback_metrics = copy.copy(out_labels) + ['val_' + n for n in out_labels]
+    # need to create the test_function before start of the first epoch
+    # because TensorBoard callback on_epoch_begin adds summary to the
+    # list of fetches of the test_function
+    model._make_test_function()
   else:
     callback_metrics = copy.copy(out_labels)
 
@@ -162,7 +155,7 @@ def fit_loop(model,
 
   callbacks.set_model(callback_model)
 
-  callbacks.set_params({
+  callback_params = {
       'batch_size': batch_size,
       'epochs': epochs,
       'steps': steps_per_epoch,
@@ -170,11 +163,17 @@ def fit_loop(model,
       'verbose': verbose,
       'do_validation': do_validation,
       'metrics': callback_metrics or [],
-  })
-  callbacks.on_train_begin()
-  callback_model.stop_training = False
+  }
+  if validation_steps:
+    callback_params.update({'validation_steps': validation_steps})
+  callbacks.set_params(callback_params)
+
   for cbk in callbacks:
     cbk.validation_data = val_ins
+  # validation_data must be set before on_train_begin() is called
+  # so that TensorboardCallback can validate its input
+  callbacks.on_train_begin()
+  callback_model.stop_training = False
 
   # To prevent a slowdown, we find beforehand the arrays that need conversion.
   feed = model._feed_inputs + model._feed_targets + model._feed_sample_weights
@@ -193,9 +192,7 @@ def fit_loop(model,
     if steps_per_epoch is not None:
       # Step-wise fit loop.
       for step_index in range(steps_per_epoch):
-        batch_logs = {}
-        batch_logs['batch'] = step_index
-        batch_logs['size'] = 1
+        batch_logs = {'batch': step_index, 'size': 1}
         callbacks.on_batch_begin(step_index, batch_logs)
         try:
           outs = f(ins)
@@ -203,7 +200,9 @@ def fit_loop(model,
           logging.warning('Your dataset iterator ran out of data; '
                           'interrupting training. Make sure that your dataset '
                           'can generate at least `steps_per_epoch * epochs` '
-                          'batches (in this case, %d batches).' %
+                          'batches (in this case, %d batches). You may need to'
+                          'use the repeat() function when building your '
+                          'dataset.' %
                           steps_per_epoch * epochs)
           break
 
@@ -384,7 +383,9 @@ def predict_loop(model, inputs, batch_size=32, verbose=0, steps=None):
     return outs
 
 
-def test_loop(model, inputs, targets,
+def test_loop(model,
+              inputs,
+              targets,
               sample_weights=None,
               batch_size=None,
               verbose=0,
@@ -481,8 +482,7 @@ def test_loop(model, inputs, targets,
 
       if isinstance(batch_outs, list):
         if batch_index == 0:
-          for batch_out in enumerate(batch_outs):
-            outs.append(0.)
+          outs.extend([0.] * len(batch_outs))
         for i, batch_out in enumerate(batch_outs):
           if i in stateful_metric_indices:
             outs[i] = batch_out
diff --git a/tensorflow/python/keras/engine/training_distributed.py b/tensorflow/python/keras/engine/training_distributed.py
new file mode 100644
index 0000000000..75e466d593
--- /dev/null
+++ b/tensorflow/python/keras/engine/training_distributed.py
@@ -0,0 +1,455 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Part of the Keras training engine related to distributed training.
+"""
+# pylint: disable=protected-access
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import copy
+import numpy as np
+from tensorflow.python.framework import errors
+from tensorflow.python.keras import backend as K
+from tensorflow.python.keras import callbacks as cbks
+from tensorflow.python.keras import optimizers
+from tensorflow.python.keras.engine import distributed_training_utils
+from tensorflow.python.keras.utils.generic_utils import Progbar
+from tensorflow.python.platform import tf_logging as logging
+
+
+def fit_loop(
+    model,
+    inputs,
+    targets,
+    epochs=100,
+    verbose=1,
+    callbacks=None,
+    val_inputs=None,
+    val_targets=None,
+    callback_metrics=None,
+    initial_epoch=0,
+    steps_per_epoch=None,
+    validation_steps=None):
+  """fit function when using DistributionStrategy for training.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      epochs: Number of times to iterate over the data
+      verbose: Verbosity mode, 0, 1 or 2
+      callbacks: List of callbacks to be called during training
+      val_inputs: List of input arrays.
+      val_targets: List of target arrays.
+      callback_metrics: List of strings, the display names of the metrics
+          passed to the callbacks. They should be the
+          concatenation of list the display names of the outputs of
+           `f` and the list of display names of the outputs of `f_val`.
+      initial_epoch: Epoch at which to start training
+          (useful for resuming a previous training run)
+      steps_per_epoch: Total number of steps (batches of samples)
+          before declaring one epoch finished and starting the
+          next epoch. Ignored with the default value of `None`.
+      validation_steps: Number of steps to run validation for
+          (only if doing validation from data tensors).
+          Ignored with the default value of `None`.
+
+  Returns:
+      `History` object.
+
+  Raises:
+      ValueError: in case of invalid arguments.
+  """
+  current_strategy = model._distribution_strategy
+  def _per_device_train_function(model):
+    model._make_train_function()
+    return (model.train_function.inputs,
+            model.train_function.outputs,
+            model.train_function.updates_op,
+            model.train_function.session_kwargs)
+
+  with current_strategy.scope():
+    # Create train ops on each of the devices when we call
+    # `_per_device_train_function`.
+    (grouped_inputs, grouped_outputs, grouped_updates,
+     grouped_session_args) = current_strategy.call_for_each_tower(
+         _per_device_train_function, model._grouped_model)
+    # Unwrap all the per device values returned from `call_for_each_tower`.
+    # Unwrapping per device values gives you a list of values that can be
+    # used to construct a new train function that is composed of update ops on
+    # all the devices over which the model is distributed.
+    (all_inputs, all_outputs, all_updates,
+     all_session_args) = distributed_training_utils.unwrap_values(
+         current_strategy, grouped_inputs, grouped_outputs,
+         grouped_updates, grouped_session_args, with_loss_tensor=True)
+
+    # Dataset inputs and targets are also per devices values that need to be
+    # unwrapped.
+    dataset_inputs = distributed_training_utils.flatten_perdevice_values(
+        current_strategy, inputs)
+    dataset_targets = distributed_training_utils.flatten_perdevice_values(
+        current_strategy, targets)
+
+  # Create a train function that is composed of all the parameters above.
+  distributed_train_function = K.Function(
+      all_inputs, all_outputs,
+      updates=all_updates,
+      name='distributed_train_function',
+      **all_session_args)
+
+  # We need to set sample_weights to None since there are sample weight
+  # placeholders that are created with default values.
+  sample_weights = [None for _ in range(len(model.outputs) *
+                                        current_strategy.num_towers)]
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = dataset_inputs + dataset_targets + sample_weights + [1]
+  else:
+    ins = dataset_inputs + dataset_targets
+
+  do_validation = False
+  if validation_steps:
+    do_validation = True
+    if steps_per_epoch is None:
+      raise ValueError('Can only use `validation_steps` '
+                       'when doing step-wise '
+                       'training, i.e. `steps_per_epoch` '
+                       'must be set.')
+  out_labels = model.metrics_names
+  if do_validation:
+    callback_metrics = copy.copy(out_labels) + [
+        'val_' + n for n in out_labels
+    ]
+  else:
+    callback_metrics = copy.copy(out_labels)
+
+  model.history = cbks.History()
+  all_callbacks = [cbks.BaseLogger(
+      stateful_metrics=model.stateful_metric_names)]
+  if verbose:
+    # We assume that `steps_per_epoch` is always set since we have to use
+    # Datasets.
+    count_mode = 'steps'
+
+    all_callbacks.append(
+        cbks.ProgbarLogger(
+            count_mode, stateful_metrics=model.stateful_metric_names))
+  all_callbacks += (callbacks or []) + [model.history]
+  callbacks = cbks.CallbackList(all_callbacks)
+  out_labels = out_labels or []
+
+  # We set the callback model to an instance of the `DistributedModel` that we
+  # create in the  `compile` call. The `DistributedModel` is initialized with
+  # the first replicated model. We need to set the callback model to a
+  # DistributedModel to allow us to override saving and loading weights when
+  # we checkpoint the model during training.
+  callback_model = model._replicated_model
+
+  callbacks.set_model(callback_model)
+
+  callbacks.set_params({
+      'epochs': epochs,
+      'steps': steps_per_epoch,
+      'samples': None,
+      'verbose': verbose,
+      'do_validation': do_validation,
+      'metrics': callback_metrics or [],
+  })
+  callbacks.on_train_begin()
+  callback_model.stop_training = False
+
+  out_labels = out_labels or []
+
+  # Copy the weights from the original model to each of the replicated models.
+  orig_model_weights = model.get_weights()
+  with current_strategy.scope():
+    distributed_model = current_strategy.unwrap(model._grouped_model)[0]
+    distributed_training_utils.set_weights(
+        current_strategy, distributed_model, orig_model_weights)
+
+  for epoch in range(initial_epoch, epochs):
+    callbacks.on_epoch_begin(epoch)
+    if steps_per_epoch is not None:
+      epoch_logs = {}
+      for step_index in range(steps_per_epoch):
+        batch_logs = {}
+        batch_logs['batch'] = step_index
+        batch_logs['size'] = 1
+        callbacks.on_batch_begin(step_index, batch_logs)
+        try:
+          outs = distributed_train_function(ins)
+        except errors.OutOfRangeError:
+          logging.warning('Your dataset iterator ran out of data; '
+                          'interrupting training. Make sure that your dataset '
+                          'can generate at least `steps_per_epoch * epochs` '
+                          'batches (in this case, %d batches).' %
+                          steps_per_epoch * epochs)
+          break
+
+        if not isinstance(outs, list):
+          outs = [outs]
+
+        # TODO(anjalisridhar): Temporary workaround for aggregating metrics
+        # across towers. Replace with the new metrics module eventually.
+        merged_output = []
+        # The first output is the total loss.
+        merged_output.append(outs[0])
+        current_index = 1
+        num_devices = len(current_strategy._devices)
+        # Each label in `out_labels` corresponds to one set of metrics. The
+        # number of metric values corresponds to the number of devices. We
+        # currently take the mean of the values.
+        for _ in out_labels[1:]:
+          m = np.mean(outs[current_index:current_index + num_devices])
+          merged_output.append(m)
+          current_index += num_devices
+
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+        callbacks.on_batch_end(step_index, batch_logs)
+        if callback_model.stop_training:
+          break
+      if do_validation:
+        val_outs = test_loop(
+            model,
+            val_inputs,
+            val_targets,
+            steps=validation_steps,
+            verbose=0)
+        if not isinstance(val_outs, list):
+          val_outs = [val_outs]
+        # Same labels assumed.
+        for l, o in zip(out_labels, val_outs):
+          epoch_logs['val_' + l] = o
+
+    callbacks.on_epoch_end(epoch, epoch_logs)
+    if callback_model.stop_training:
+      break
+  callbacks.on_train_end()
+
+  # Copy the weights back from the replicated model to the original model.
+  with current_strategy.scope():
+    updated_weights = current_strategy.unwrap(
+        model._grouped_model)[0].get_weights()
+    model.set_weights(updated_weights)
+  return model.history
+
+
+def test_loop(model, inputs, targets, verbose=0, steps=None):
+  """evaluate method to validate a model that uses DistributionStrategy.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      verbose: verbosity mode.
+      steps: Total number of steps (batches of samples)
+          before declaring predictions finished.
+          Ignored with the default value of `None`.
+
+  Returns:
+      Scalar loss (if the model has a single output and no metrics)
+      or list of scalars (if the model has multiple outputs
+      and/or metrics). The attribute `model.metrics_names` will give you
+      the display labels for the scalar outputs.
+  """
+  current_strategy = model._distribution_strategy
+  def _per_device_test_function(model):
+    model._make_test_function()
+    return (model.test_function.inputs,
+            model.test_function.outputs,
+            model.test_function.updates_op,
+            model.test_function.session_kwargs)
+
+  with current_strategy.scope():
+    (grouped_inputs, grouped_outputs, grouped_updates,
+     grouped_session_args) = current_strategy.call_for_each_tower(
+         _per_device_test_function, model._grouped_model)
+
+    (all_inputs, all_outputs, all_updates,
+     all_session_args) = distributed_training_utils.unwrap_values(
+         current_strategy, grouped_inputs, grouped_outputs, grouped_updates,
+         grouped_session_args, with_loss_tensor=True)
+
+    dataset_inputs = distributed_training_utils.flatten_perdevice_values(
+        current_strategy, inputs)
+    dataset_targets = distributed_training_utils.flatten_perdevice_values(
+        current_strategy, targets)
+
+  distributed_test_function = K.Function(
+      all_inputs, all_outputs,
+      updates=all_updates,
+      name='distributed_test_function',
+      **all_session_args)
+
+  # We need to set sample_weights to None since there are sample weight
+  # placeholders that are created with default values.
+  sample_weights = [None for _ in range(len(model.outputs) *
+                                        current_strategy.num_towers)]
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = dataset_inputs + dataset_targets + sample_weights + [0]
+  else:
+    ins = dataset_inputs + dataset_targets
+
+  if hasattr(model, 'metrics'):
+    for m in model.stateful_metric_functions:
+      m.reset_states()
+    stateful_metric_indices = [
+        i for i, name in enumerate(model.metrics_names)
+        if str(name) in model.stateful_metric_names
+    ]
+  else:
+    stateful_metric_indices = []
+
+  outs = []
+  if verbose == 1:
+    progbar = Progbar(target=steps)
+
+  # Copy the weights from the original model to each of the replicated models.
+  orig_model_weights = model.get_weights()
+  with current_strategy.scope():
+    distributed_model = current_strategy.unwrap(model._grouped_model)[0]
+    distributed_training_utils.set_weights(
+        current_strategy, distributed_model, orig_model_weights)
+
+  if steps is not None:
+    for step in range(steps):
+      batch_outs = distributed_test_function(ins)
+      if isinstance(batch_outs, list):
+        if step == 0:
+          for _ in enumerate(batch_outs):
+            outs.append(0.)
+        for i, batch_out in enumerate(batch_outs):
+          if i in stateful_metric_indices:
+            outs[i] = batch_out
+          else:
+            outs[i] += batch_out
+      else:
+        if step == 0:
+          outs.append(0.)
+        outs[0] += batch_outs
+      if verbose == 1:
+        progbar.update(step + 1)
+    for i in range(len(outs)):
+      if i not in stateful_metric_indices:
+        outs[i] /= steps
+  if len(outs) == 1:
+    return outs[0]
+  return outs
+
+
+def predict_loop(model, inputs, verbose=0, steps=None):
+  """Abstract method to loop over some data in batches.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: list of tensors to be fed to `f`.
+      verbose: verbosity mode.
+      steps: Total number of steps (batches of samples)
+          before declaring `_predict_loop` finished.
+          Ignored with the default value of `None`.
+
+  Returns:
+      Array of predictions (if the model has a single output)
+      or list of arrays of predictions
+      (if the model has multiple outputs).
+  """
+  current_strategy = model._distribution_strategy
+  def _per_device_predict_function(model):
+    model._make_predict_function()
+    return (model.predict_function.inputs,
+            model.predict_function.outputs,
+            model.predict_function.updates_op,
+            model.predict_function.session_kwargs)
+
+  with current_strategy.scope():
+    (grouped_inputs, grouped_outputs, grouped_updates,
+     grouped_session_args) = current_strategy.call_for_each_tower(
+         _per_device_predict_function, model._grouped_model)
+
+    (all_inputs, all_outputs, all_updates,
+     all_session_args) = distributed_training_utils.unwrap_values(
+         current_strategy, grouped_inputs, grouped_outputs, grouped_updates,
+         grouped_session_args)
+
+    dataset_inputs = distributed_training_utils.flatten_perdevice_values(
+        current_strategy, inputs)
+
+  distributed_predict_function = K.Function(
+      all_inputs, all_outputs,
+      updates=all_updates,
+      name='distributed_predict_function',
+      **all_session_args)
+
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = dataset_inputs + [0]
+  else:
+    ins = dataset_inputs
+
+  if verbose == 1:
+    progbar = Progbar(target=steps)
+
+  # Copy the weights from the original model to each of the replicated models.
+  orig_model_weights = model.get_weights()
+  with current_strategy.scope():
+    distributed_model = current_strategy.unwrap(model._grouped_model)[0]
+    distributed_training_utils.set_weights(
+        current_strategy, distributed_model, orig_model_weights)
+
+  if steps is not None:
+    # Since we do not know how many samples we will see, we cannot pre-allocate
+    # the returned Numpy arrays. Instead, we store one array per batch seen
+    # and concatenate them upon returning.
+    unconcatenated_outs = []
+    for step in range(steps):
+      batch_outs = distributed_predict_function(ins)
+      if not isinstance(batch_outs, list):
+        batch_outs = [batch_outs]
+      if step == 0:
+        for _ in batch_outs:
+          unconcatenated_outs.append([])
+      for i, batch_out in enumerate(batch_outs):
+        unconcatenated_outs[i].append(batch_out)
+      if verbose == 1:
+        progbar.update(step + 1)
+    if len(unconcatenated_outs) == 1:
+      return np.concatenate(unconcatenated_outs[0], axis=0)
+    return [
+        np.concatenate(unconcatenated_outs[i], axis=0)
+        for i in range(len(unconcatenated_outs))
+    ]
+
+
+def clone_and_build_model(model):
+  """Clone and build the given keras_model."""
+  # We need to set the import here since we run into a circular dependency
+  # error.
+  from tensorflow.python.keras import models  # pylint: disable=g-import-not-at-top
+  cloned_model = models.clone_model(model, input_tensors=None)
+
+  # Compile and build model.
+  if isinstance(model.optimizer, optimizers.TFOptimizer):
+    optimizer = model.optimizer
+  else:
+    optimizer_config = model.optimizer.get_config()
+    optimizer = model.optimizer.__class__.from_config(optimizer_config)
+
+  cloned_model.compile(
+      optimizer,
+      model.loss,
+      metrics=model.metrics,
+      loss_weights=model.loss_weights,
+      sample_weight_mode=model.sample_weight_mode,
+      weighted_metrics=model.weighted_metrics)
+  return cloned_model
diff --git a/tensorflow/python/keras/engine/training_eager.py b/tensorflow/python/keras/engine/training_eager.py
index e8838cd3bc..774d2e44f3 100644
--- a/tensorflow/python/keras/engine/training_eager.py
+++ b/tensorflow/python/keras/engine/training_eager.py
@@ -30,36 +30,11 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.keras import backend
 from tensorflow.python.keras import callbacks as cbks
-from tensorflow.python.keras import losses
-from tensorflow.python.keras import metrics as metrics_module
 from tensorflow.python.keras.engine import training_utils
 from tensorflow.python.keras.utils import generic_utils
-from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
 
 
-def _get_metrics_info(metric, internal_output_shapes=None, loss_func=None):
-  if metric == 'accuracy' or metric == 'acc':
-    # custom handling of accuracy
-    # (because of class mode duality)
-    output_shape = internal_output_shapes
-    if output_shape[-1] == 1 or loss_func == losses.binary_crossentropy:
-      # case: binary accuracy
-      acc_fn = metrics_module.binary_accuracy
-    elif loss_func == losses.sparse_categorical_crossentropy:
-      # case: categorical accuracy with sparse targets
-      acc_fn = metrics_module.sparse_categorical_accuracy
-    else:
-      acc_fn = metrics_module.categorical_accuracy
-
-    metric_name = 'acc'
-    return metric_name, acc_fn
-  else:
-    metric_fn = metrics_module.get(metric)
-    metric_name = metric_fn.__name__
-    return metric_name, metric_fn
-
-
 def _eager_loss_fn(outputs, targets, loss_fn, output_name):
   with backend.name_scope(output_name + '_loss'):
     loss = loss_fn(targets, outputs)
@@ -75,9 +50,8 @@ def _eager_metrics_fn(model, outputs, targets):
       targets: The predictions or targets of the given model.
 
   Returns:
-      Returns the metric names and metric results for each output of the model.
+      Returns the metric results for each output of the model.
   """
-  metric_names = []
   metric_results = []
   if not isinstance(outputs, list):
     outputs = [outputs]
@@ -88,18 +62,15 @@ def _eager_metrics_fn(model, outputs, targets):
   for i in range(len(model.outputs)):
     output_metrics = model.nested_metrics[i]
     for nested_output_metric in output_metrics:
-      metric_name, metric_fn = _get_metrics_info(
+      metric_fn = training_utils.get_metric_function(
           nested_output_metric, backend.int_shape(model.outputs[i]),
           model.loss_functions[i])
-
-      if len(model.output_names) > 1:
-        metric_name = model.output_names[i] + '_' + metric_name
-        if metric_name not in model.metrics_names:
-          model.metrics_names.append(metric_name)
+      # weighted metrics are not supported in eager mode
+      metric_name = training_utils.get_metric_name(
+          nested_output_metric, weighted=False)
 
       with backend.name_scope(metric_name):
         metric_result = metric_fn(targets[i], outputs[i])
-        metric_names.append(metric_name)
         metric_results.append(backend.mean(metric_result))
 
   return metric_results
@@ -121,21 +92,23 @@ def _model_loss(model, inputs, targets, sample_weights=None, training=False):
      applies masking and sample weighting to the loss value.
   """
   total_loss = 0
+  kwargs = {}
+  if model._expects_training_arg:
+    kwargs['training'] = training
   if len(inputs) == 1:
-    if model._expects_training_arg:
-      outs = model.call(inputs[0], training=training)
-    else:
-      outs = model.call(inputs[0])
+    inputs = inputs[0]
+
+  if model._is_graph_network:
+    outs, masks = model._call_and_compute_mask(inputs, **kwargs)
+    masks = generic_utils.to_list(masks)
   else:
-    if model._expects_training_arg:
-      outs = model.call(inputs, training=training)
-    else:
-      outs = model.call(inputs)
-  if not isinstance(outs, list):
-    outs = [outs]
+    outs = model.call(inputs, **kwargs)
+    masks = None
 
-  if not isinstance(targets, list):
-    targets = [targets]
+  outs = generic_utils.to_list(outs)
+  if masks is None:
+    masks = [None for _ in outs]
+  targets = generic_utils.to_list(targets)
 
   loss_metrics = []
   with backend.name_scope('loss'):
@@ -144,10 +117,7 @@ def _model_loss(model, inputs, targets, sample_weights=None, training=False):
         weights = sample_weights[i]
       else:
         weights = None
-
-      # TODO(fchollet): support masking; in practice `_keras_mask` is never
-      # set in this context currently.
-      mask = outs[i]._keras_mask
+      mask = masks[i]
 
       weighted_masked_fn = training_utils.weighted_masked_objective(loss_fn)
       with backend.name_scope(model.output_names[i] + '_loss'):
@@ -194,7 +164,8 @@ def iterator_fit_loop(model,
                       callbacks=None,
                       callback_metrics=None,
                       validation_steps=None,
-                      do_validation=False):
+                      do_validation=False,
+                      batch_size=None):
   """Fit function for eager execution when input is given as dataset iterator.
 
   Updates the given epoch logs.
@@ -224,16 +195,23 @@ def iterator_fit_loop(model,
       validation_steps: Number of steps to run validation for (only if doing
         validation from data tensors). Ignored with default value of `None`.
       do_validation: Boolean value indicating whether we should do validation.
+      batch_size: int, val_inputs and val_targets will be evaled batch by
+        batch with size batch_size if they are array.
 
   Raises:
       ValueError: In case of mismatch between given number of inputs and
         expectations of the model.
   """
   assert isinstance(inputs, iterator_ops.EagerIterator)
+
+  # make sure either x,y or x,y,sample_weights is provided
+  if (not isinstance(inputs.output_shapes, (list, tuple)) or
+      len(inputs.output_shapes) not in (2, 3)):
+    raise ValueError('Please provide either inputs and targets'
+                     'or inputs, targets, and sample_weights')
+
   for step_index in range(steps_per_epoch):
-    batch_logs = {}
-    batch_logs['batch'] = step_index
-    batch_logs['size'] = 1
+    batch_logs = {'batch': step_index, 'size': 1}
     callbacks.on_batch_begin(step_index, batch_logs)
 
     # Get data from the iterator.
@@ -241,25 +219,28 @@ def iterator_fit_loop(model,
       next_element = inputs.get_next()
     except errors.OutOfRangeError:
       logging.warning(
-          'Your dataset iterator ran out of data; '
-          'interrupting training. Make sure that your dataset'
-          ' can generate at least `steps_per_epoch * epochs` '
-          'batches (in this case, %d batches).' % steps_per_epoch * epochs)
+          'Your dataset iterator ran out of data; interrupting training. Make '
+          'sure that your dataset can generate at least '
+          '`steps_per_epoch * epochs` batches (in this case, %d batches). You '
+          'may need to use the repeat() function when building your '
+          'dataset.' % steps_per_epoch * epochs)
       break
 
-    if not isinstance(next_element, (list, tuple)) or len(next_element) != 2:
-      raise ValueError('Please provide data as a list or tuple of 2 elements '
-                       ' - input and target pair. Received %s' % next_element)
-    x, y = next_element
+    if len(inputs.output_shapes) == 2:
+      x, y = next_element
+      sample_weights = None
+    else:
+      x, y, sample_weights = next_element
 
     # Validate and standardize data.
     x, y, sample_weights = model._standardize_user_data(
-        x, y, class_weight=class_weight)
+        x, y, sample_weight=sample_weights, class_weight=class_weight)
     x = training_utils.cast_if_floating_dtype(x)
     y = training_utils.cast_if_floating_dtype(y)
     if sample_weights:
       sample_weights = [
-          ops.convert_to_tensor(val, dtype=backend.floatx())
+          training_utils.cast_if_floating_dtype(
+              ops.convert_to_tensor(val, dtype=backend.floatx()))
           if val is not None else None for val in sample_weights
       ]
 
@@ -307,122 +288,8 @@ def iterator_fit_loop(model,
             val_targets,
             sample_weights=val_sample_weights,
             steps=validation_steps,
-            verbose=0)
-        if not isinstance(val_outs, list):
-          val_outs = [val_outs]
-        # Same labels assumed.
-        for l, o in zip(out_labels, val_outs):
-          epoch_logs['val_' + l] = o
-
-
-def batch_fit_loop(model,
-                   inputs,
-                   targets,
-                   epoch_logs,
-                   index_array,
-                   out_labels,
-                   callback_model,
-                   batch_size,
-                   sample_weights=None,
-                   val_inputs=None,
-                   val_targets=None,
-                   val_sample_weights=None,
-                   callbacks=None,
-                   shuffle=True,
-                   num_train_samples=None,
-                   do_validation=False):
-  """Fit function for eager execution when input is given as arrays or tensors.
-
-  Updates the given epoch logs.
-
-  Arguments:
-      model: Instance of the `Model`.
-      inputs: List of input arrays.
-      targets: List of target arrays.
-      epoch_logs: Dictionary of logs from every epoch.
-      index_array: Index array generated from number of training samples.
-      out_labels: Output labels generated from model metric names.
-      callback_model: Instance of `Model` to callback.
-      batch_size: Integer batch size or None if unknown.
-      sample_weights: Optional list of sample weight arrays.
-      val_inputs: Input data for validation.
-      val_targets: Target data for validation.
-      val_sample_weights: Sample weight data for validation.
-      callbacks: List of callbacks to be called during training.
-      shuffle: Whether to shuffle the data at the beginning of each epoch.
-      num_train_samples: Integer number of training samples.
-      do_validation: Boolean value indicating whether we should do validation.
-  """
-  # TODO(psv): Create a dataset iterator instead of manually creating batches
-  # here and in batch_test_loop, batch_predict_loop.
-  if shuffle == 'batch':
-    index_array = model._batch_shuffle(index_array, batch_size)
-  elif shuffle:
-    np.random.shuffle(index_array)
-
-  batches = generic_utils.make_batches(num_train_samples, batch_size)
-
-  for batch_index, (batch_start, batch_end) in enumerate(batches):
-    batch_ids = index_array[batch_start:batch_end]
-    inputs_batch = slice_arrays(inputs, batch_ids, contiguous=not shuffle)
-    targets_batch = slice_arrays(targets, batch_ids, contiguous=not shuffle)
-    if sample_weights:
-      sample_weights_batch = slice_arrays(
-          sample_weights, batch_ids, contiguous=not shuffle)
-    else:
-      sample_weights_batch = None
-    batch_logs = {}
-    batch_logs['batch'] = batch_index
-    batch_logs['size'] = len(batch_ids)
-
-    callbacks.on_batch_begin(batch_index, batch_logs)
-
-    inputs_batch = [
-        ops.convert_to_tensor(val, dtype=backend.floatx())
-        for val in inputs_batch
-    ]
-    targets_batch = [
-        ops.convert_to_tensor(val, dtype=backend.floatx())
-        for val in targets_batch
-    ]
-    if sample_weights:
-      sample_weights_batch = [
-          ops.convert_to_tensor(val, dtype=backend.floatx())
-          if val is not None else None for val in sample_weights_batch
-      ]
-
-    outs, loss, loss_metrics = _process_single_batch(
-        model,
-        inputs_batch,
-        targets_batch,
-        sample_weights=sample_weights_batch,
-        training=True)
-
-    if not isinstance(outs, list):
-      outs = [outs]
-
-    for l, o in zip(out_labels, outs):
-      batch_logs[l] = o
-    # Required for eager execution
-    metrics_results = _eager_metrics_fn(model, outs, targets_batch)
-    batch_logs['loss'] = tensor_util.constant_value(backend.mean(loss))
-
-    for k, v in zip(model.metrics_names,
-                    [backend.mean(loss)] + loss_metrics + metrics_results):
-      batch_logs[k] = tensor_util.constant_value(v)
-    callbacks.on_batch_end(batch_index, batch_logs)
-    if callback_model.stop_training:
-      break
-
-    if batch_index == len(batches) - 1:  # Last batch.
-      if do_validation:
-        val_outs = test_loop(
-            model,
-            val_inputs,
-            val_targets,
-            sample_weights=val_sample_weights,
-            batch_size=batch_size,
-            verbose=0)
+            verbose=0,
+            batch_size=batch_size)
         if not isinstance(val_outs, list):
           val_outs = [val_outs]
         # Same labels assumed.
@@ -451,6 +318,11 @@ def iterator_test_loop(model, inputs, steps, verbose=0):
         expectations of the model.
   """
   assert isinstance(inputs, iterator_ops.EagerIterator)
+  # make sure either x,y or x,y,sample_weights is provided
+  if (not isinstance(inputs.output_shapes, (list, tuple)) or
+      len(inputs.output_shapes) < 2 or len(inputs.output_shapes) > 3):
+    raise ValueError('Please provide either inputs and targets'
+                     'or inputs, targets, and sample_weights')
   outs = []
   num_samples = 0
   if verbose == 1:
@@ -463,18 +335,27 @@ def iterator_test_loop(model, inputs, steps, verbose=0):
       logging.warning(
           'Your dataset iterator ran out of data interrupting testing. '
           'Make sure that your dataset can generate at least `steps` batches '
-          '(in this case, %d batches).', steps)
+          '(in this case, %d batches). You may need to use the repeat() '
+          'function when building your dataset.', steps)
       break
 
-    if not isinstance(next_element, (list, tuple)) or len(next_element) != 2:
-      raise ValueError('Please provide data as a list or tuple of 2 elements '
-                       ' - input and target pair. Received %s' % next_element)
-    x, y = next_element
+    if len(inputs.output_shapes) == 2:
+      x, y = next_element
+      sample_weights = None
+    else:
+      x, y, sample_weights = next_element
 
     # Validate and standardize data.
-    x, y, sample_weights = model._standardize_user_data(x, y)
+    x, y, sample_weights = model._standardize_user_data(
+        x, y, sample_weight=sample_weights)
     x = training_utils.cast_if_floating_dtype(x)
     y = training_utils.cast_if_floating_dtype(y)
+    if sample_weights:
+      sample_weights = [
+          training_utils.cast_if_floating_dtype(
+              ops.convert_to_tensor(val, dtype=backend.floatx()))
+          if val is not None else None for val in sample_weights
+      ]
 
     # Calculate model output, loss values.
     loss_outs, loss, loss_metrics = _model_loss(
@@ -512,94 +393,6 @@ def iterator_test_loop(model, inputs, steps, verbose=0):
   return outs
 
 
-def batch_test_loop(model,
-                    inputs,
-                    targets,
-                    batch_size,
-                    sample_weights=None,
-                    verbose=0):
-  """Test function for eager execution when input is given as arrays or tensors.
-
-  Arguments:
-      model: Model instance that is being evaluated in Eager mode.
-      inputs: List of input arrays.
-      targets: List of target arrays.
-      batch_size: Integer batch size.
-      sample_weights: Optional list of sample weight arrays.
-      verbose: Verbosity mode.
-
-  Returns:
-      Scalar loss (if the model has a single output and no metrics)
-      or list of scalars (if the model has multiple outputs
-      and/or metrics). The attribute `model.metrics_names` will give you
-      the display labels for the scalar outputs.
-  """
-  outs = []
-  feed_data = inputs + targets
-  if sample_weights:
-    feed_data += sample_weights
-  num_samples = training_utils.check_num_samples(
-      feed_data, batch_size=batch_size)
-  if verbose == 1:
-    progbar = generic_utils.Progbar(target=num_samples)
-  batches = generic_utils.make_batches(num_samples, batch_size)
-  index_array = np.arange(num_samples)
-  for batch_index, (batch_start, batch_end) in enumerate(batches):
-    batch_ids = index_array[batch_start:batch_end]
-    inputs_batch = slice_arrays(inputs, batch_ids)
-    targets_batch = slice_arrays(targets, batch_ids)
-    if sample_weights:
-      sample_weights_batch = slice_arrays(sample_weights, batch_ids)
-    else:
-      sample_weights_batch = None
-
-    inputs_batch = [
-        ops.convert_to_tensor(val, dtype=backend.floatx())
-        for val in inputs_batch
-    ]
-    targets_batch = [
-        ops.convert_to_tensor(val, dtype=backend.floatx())
-        for val in targets_batch
-    ]
-    if sample_weights:
-      sample_weights_batch = [
-          ops.convert_to_tensor(val, dtype=backend.floatx())
-          if val is not None else None for val in sample_weights_batch
-      ]
-
-    loss_outs, loss, loss_metrics = _model_loss(
-        model,
-        inputs_batch,
-        targets_batch,
-        sample_weights=sample_weights_batch,
-        training=False)
-    metrics_results = _eager_metrics_fn(model, loss_outs, targets_batch)
-    batch_outs = []
-    for _, v in zip(model.metrics_names,
-                    [backend.mean(loss)] + loss_metrics + metrics_results):
-      batch_outs.append(tensor_util.constant_value(v))
-
-    if isinstance(batch_outs, list):
-      if batch_index == 0:
-        for _ in enumerate(batch_outs):
-          outs.append(0.)
-      for i, batch_out in enumerate(batch_outs):
-        outs[i] += batch_out * len(batch_ids)
-    else:
-      if batch_index == 0:
-        outs.append(0.)
-      outs[0] += batch_outs * len(batch_ids)
-
-    if verbose == 1:
-      progbar.update(batch_end)
-
-  for i in range(len(outs)):
-    outs[i] /= num_samples
-  if len(outs) == 1:
-    return outs[0]
-  return outs
-
-
 def iterator_predict_loop(model, inputs, steps, verbose=0):
   """Predict function for eager execution when input is dataset iterator.
 
@@ -619,6 +412,12 @@ def iterator_predict_loop(model, inputs, steps, verbose=0):
         expectations of the model.
   """
   assert isinstance(inputs, iterator_ops.EagerIterator)
+  if not isinstance(inputs.output_shapes,
+                    (list, tuple)) or len(inputs.output_shapes) > 2:
+    raise ValueError(
+        'Please provide data as a list or tuple of 1 or 2 elements '
+        ' - input or input and target pair. Received %s. We do not use the '
+        '`target` value here.' % inputs.output_shapes)
   outs = []
   if verbose == 1:
     progbar = generic_utils.Progbar(target=steps)
@@ -628,18 +427,14 @@ def iterator_predict_loop(model, inputs, steps, verbose=0):
       next_element = inputs.get_next()
     except errors.OutOfRangeError:
       logging.warning(
-          'Your dataset iterator ran out of data; '
-          'interrupting prediction. Make sure that your '
-          'dataset can generate at least `steps` '
-          'batches (in this case, %d batches).', steps)
+          'Your dataset iterator ran out of data; interrupting prediction. '
+          'Make sure that your dataset can generate at least `steps` batches '
+          '(in this case, %d batches). You may need to use the repeat() '
+          'function when building your dataset.', steps)
       break
 
-    if not isinstance(next_element, (list, tuple)) or len(next_element) != 2:
-      raise ValueError(
-          'Please provide data as a list or tuple of 2 elements '
-          ' - input and target pair. Received %s. We do not use the '
-          '`target` value here.' % next_element)
-    x, _ = next_element
+    # expects a tuple, where first element of tuple represents inputs
+    x = next_element[0]
 
     # Validate and standardize data.
     x, _, _ = model._standardize_user_data(x)
@@ -670,99 +465,6 @@ def iterator_predict_loop(model, inputs, steps, verbose=0):
   return outs
 
 
-def batch_predict_loop(model, inputs, batch_size, verbose=0):
-  """Predict function for eager execution when input is arrays or tensors.
-
-  Arguments:
-      model: Instance of `Model`.
-      inputs: List of input arrays.
-      batch_size: Integer batch size.
-      verbose: Verbosity mode.
-
-  Returns:
-      Array of predictions (if the model has a single output)
-      or list of arrays of predictions (if the model has multiple outputs).
-  """
-  outs = []
-  num_samples = training_utils.check_num_samples(inputs, batch_size)
-  if verbose == 1:
-    progbar = generic_utils.Progbar(target=num_samples)
-  batches = generic_utils.make_batches(num_samples, batch_size)
-  index_array = np.arange(num_samples)
-  for batch_index, (batch_start, batch_end) in enumerate(batches):
-    batch_ids = index_array[batch_start:batch_end]
-    inputs_batch = slice_arrays(inputs, batch_ids)
-
-    inputs_batch = [
-        ops.convert_to_tensor(val, dtype=backend.floatx())
-        for val in inputs_batch
-    ]
-
-    if len(inputs_batch) == 1:
-      if model._expects_training_arg:
-        batch_outs = model.call(inputs_batch[0], training=False)
-      else:
-        batch_outs = model.call(inputs_batch[0])
-    else:
-      if model._expects_training_arg:
-        batch_outs = model.call(inputs_batch, training=False)
-      else:
-        batch_outs = model.call(inputs_batch)
-
-    if not isinstance(batch_outs, list):
-      batch_outs = [batch_outs]
-    if batch_index == 0:
-      # Pre-allocate the results arrays.
-      for batch_out in batch_outs:
-        dims = batch_out.shape[1:].dims
-        dims_list = [d.value for d in dims]
-        shape = (num_samples,) + tuple(dims_list)
-        outs.append(np.zeros(shape, dtype=batch_out.dtype.as_numpy_dtype))
-    for i, batch_out in enumerate(batch_outs):
-      outs[i][batch_start:batch_end] = batch_out
-    if verbose == 1:
-      progbar.update(batch_end)
-
-  if len(outs) == 1:
-    return outs[0]
-  return outs
-
-
-def slice_arrays(arrays, indices, contiguous=True):
-  """Slices batches out of provided arrays (workaround for eager tensors).
-
-  Unfortunately eager tensors don't have the same slicing behavior as
-  Numpy arrays (they follow the same slicing behavior as symbolic TF tensors),
-  hence we cannot use `generic_utils.slice_arrays` directly
-  and we have to implement this workaround based on `concat`. This has a
-  performance cost.
-
-  Arguments:
-    arrays: Single array or list of arrays.
-    indices: List of indices in the array that should be included in the output
-      batch.
-    contiguous: Boolean flag indicating whether the indices are contiguous.
-
-  Returns:
-    Slice of data (either single array or list of arrays).
-  """
-  if any(tensor_util.is_tensor(x) for x in arrays):
-    converted_to_list = False
-    if not isinstance(arrays, list):
-      converted_to_list = True
-      arrays = [arrays]
-    if not contiguous:
-      entries = [[x[i:i + 1] for i in indices] for x in arrays]
-      slices = [array_ops.concat(x, axis=0) for x in entries]
-    else:
-      slices = [x[indices[0]:indices[-1] + 1] for x in arrays]
-    if converted_to_list:
-      slices = slices[0]
-    return slices
-  else:
-    return generic_utils.slice_arrays(arrays, indices)
-
-
 def _process_single_batch(model,
                           inputs,
                           targets,
@@ -895,7 +597,6 @@ def fit_loop(model,
              verbose=1,
              callbacks=None,
              shuffle=True,
-             callback_metrics=None,
              initial_epoch=0,
              steps_per_epoch=None,
              validation_steps=None):
@@ -917,10 +618,6 @@ def fit_loop(model,
       verbose: Verbosity mode, 0, 1 or 2
       callbacks: List of callbacks to be called during training
       shuffle: Whether to shuffle the data at the beginning of each epoch
-      callback_metrics: List of strings, the display names of the metrics
-          passed to the callbacks. They should be the
-          concatenation of list the display names of the outputs of
-           `f` and the list of display names of the outputs of `f_val`.
       initial_epoch: Epoch at which to start training
           (useful for resuming a previous training run)
       steps_per_epoch: Total number of steps (batches of samples)
@@ -935,19 +632,25 @@ def fit_loop(model,
   Raises:
     ValueError: In case of invalid argument values.
   """
+  # Convert training inputs to an EagerIterator
+  inputs, steps_per_epoch = training_utils.convert_to_iterator(
+      x=inputs,
+      y=targets,
+      sample_weights=sample_weights,
+      batch_size=batch_size,
+      steps_per_epoch=steps_per_epoch,
+      epochs=epochs,
+      shuffle=shuffle)
   # Required for eager execution
   with backend.learning_phase_scope(1):
     do_validation = False
     if val_inputs:
       do_validation = True
-      if (steps_per_epoch is None and verbose and inputs and
-          hasattr(inputs[0], 'shape') and hasattr(val_inputs[0], 'shape')):
-        print('Train on %d samples, validate on %d samples' %
-              (inputs[0].shape[0], val_inputs[0].shape[0]))
 
     num_train_samples = None
     out_labels = None
-    if steps_per_epoch is None or model._is_compiled:
+    callback_metrics = None
+    if model._is_compiled:
       out_labels = model.metrics_names
       if do_validation:
         callback_metrics = copy.copy(out_labels) + [
@@ -956,28 +659,10 @@ def fit_loop(model,
       else:
         callback_metrics = copy.copy(out_labels)
 
-    if steps_per_epoch is None:
-      if sample_weights:
-        feed_data = inputs + targets + sample_weights
-      else:
-        feed_data = inputs + targets
-      num_train_samples = training_utils.check_num_samples(
-          feed_data,
-          batch_size=batch_size,
-          steps=steps_per_epoch,
-          steps_name='steps_per_epoch')
-
-      if num_train_samples is not None:
-        index_array = np.arange(num_train_samples)
-
     model.history = cbks.History()
     callbacks = [cbks.BaseLogger()] + (callbacks or []) + [model.history]
     if verbose:
-      if steps_per_epoch is not None:
-        count_mode = 'steps'
-      else:
-        count_mode = 'samples'
-      callbacks += [cbks.ProgbarLogger(count_mode)]
+      callbacks += [cbks.ProgbarLogger('steps')]
     callbacks = cbks.CallbackList(callbacks)
 
     # it's possible to callback a different model than self
@@ -989,7 +674,7 @@ def fit_loop(model,
 
     callbacks.set_model(callback_model)
 
-    callbacks.set_params({
+    callback_params = {
         'batch_size': batch_size,
         'epochs': epochs,
         'steps': steps_per_epoch,
@@ -997,9 +682,11 @@ def fit_loop(model,
         'verbose': verbose,
         'do_validation': do_validation,
         'metrics': callback_metrics or [],
-    })
-    callbacks.on_train_begin()
-    callback_model.stop_training = False
+    }
+    if validation_steps:
+      callback_params.update({'validation_steps': validation_steps})
+    callbacks.set_params(callback_params)
+
     for cbk in callbacks:
       if not val_inputs:
         cbk.validation_data = []
@@ -1009,47 +696,32 @@ def fit_loop(model,
         cbk.validation_data = val_inputs + val_targets + val_sample_weights
       else:
         cbk.validation_data = val_inputs + val_targets
+    # validation_data must be set before on_train_begin() is called
+    # so that TensorboardCallback can validate its input
+    callbacks.on_train_begin()
+    callback_model.stop_training = False
 
     for epoch in range(initial_epoch, epochs):
       callbacks.on_epoch_begin(epoch)
       epoch_logs = {}
-
-      if steps_per_epoch is not None:
-        iterator_fit_loop(
-            model,
-            inputs,
-            class_weight,
-            steps_per_epoch=steps_per_epoch,
-            callback_model=callback_model,
-            out_labels=out_labels,
-            epoch_logs=epoch_logs,
-            val_inputs=val_inputs,
-            val_targets=val_targets,
-            val_sample_weights=val_sample_weights,
-            epochs=epochs,
-            verbose=verbose,
-            callbacks=callbacks,
-            callback_metrics=callback_metrics,
-            validation_steps=validation_steps,
-            do_validation=do_validation)
-      else:
-        batch_fit_loop(
-            model,
-            inputs,
-            targets,
-            epoch_logs=epoch_logs,
-            index_array=index_array,
-            out_labels=out_labels,
-            callback_model=callback_model,
-            batch_size=batch_size,
-            sample_weights=sample_weights,
-            val_inputs=val_inputs,
-            val_targets=val_targets,
-            val_sample_weights=val_sample_weights,
-            callbacks=callbacks,
-            shuffle=shuffle,
-            num_train_samples=num_train_samples,
-            do_validation=do_validation)
+      iterator_fit_loop(
+          model,
+          inputs,
+          class_weight,
+          steps_per_epoch=steps_per_epoch,
+          callback_model=callback_model,
+          out_labels=out_labels,
+          epoch_logs=epoch_logs,
+          val_inputs=val_inputs,
+          val_targets=val_targets,
+          val_sample_weights=val_sample_weights,
+          epochs=epochs,
+          verbose=verbose,
+          callbacks=callbacks,
+          callback_metrics=callback_metrics,
+          validation_steps=validation_steps,
+          do_validation=do_validation,
+          batch_size=batch_size)
       callbacks.on_epoch_end(epoch, epoch_logs)
       if callback_model.stop_training:
         break
@@ -1081,17 +753,14 @@ def test_loop(model, inputs, targets,
       and/or metrics). The attribute `model.metrics_names` will give you
       the display labels for the scalar outputs.
   """
+  inputs, steps = training_utils.convert_to_iterator(
+      x=inputs,
+      y=targets,
+      sample_weights=sample_weights,
+      batch_size=batch_size,
+      steps_per_epoch=steps)
   with backend.learning_phase_scope(0):
-    if steps is not None:
-      return iterator_test_loop(model, inputs, steps, verbose=verbose)
-    else:
-      return batch_test_loop(
-          model,
-          inputs,
-          targets,
-          batch_size=batch_size,
-          sample_weights=sample_weights,
-          verbose=verbose)
+    return iterator_test_loop(model, inputs, steps, verbose=verbose)
 
 
 def predict_loop(model, inputs,
@@ -1115,8 +784,6 @@ def predict_loop(model, inputs,
       (if the model has multiple outputs).
   """
   with backend.learning_phase_scope(0):
-    if steps is not None:
-      return iterator_predict_loop(model, inputs, steps, verbose=verbose)
-    else:
-      return batch_predict_loop(
-          model, inputs, batch_size=batch_size, verbose=verbose)
+    inputs, steps = training_utils.convert_to_iterator(
+        x=inputs, batch_size=batch_size, steps_per_epoch=steps)
+    return iterator_predict_loop(model, inputs, steps, verbose=verbose)
diff --git a/tensorflow/python/keras/engine/training_eager_test.py b/tensorflow/python/keras/engine/training_eager_test.py
index bdb3035129..605c1935a5 100644
--- a/tensorflow/python/keras/engine/training_eager_test.py
+++ b/tensorflow/python/keras/engine/training_eager_test.py
@@ -24,291 +24,12 @@ from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python import keras
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util as tf_test_util
-from tensorflow.python.keras import testing_utils
 from tensorflow.python.platform import test
 from tensorflow.python.training.rmsprop import RMSPropOptimizer
 
 
 class TrainingTest(test.TestCase):
 
-  def test_fit_on_arrays(self):
-    a = keras.layers.Input(shape=(3,), name='input_a')
-    b = keras.layers.Input(shape=(3,), name='input_b')
-
-    dense = keras.layers.Dense(4, name='dense')
-    c = dense(a)
-    d = dense(b)
-    e = keras.layers.Dropout(0.5, name='dropout')(c)
-
-    model = keras.models.Model([a, b], [d, e])
-
-    optimizer = RMSPropOptimizer(learning_rate=0.001)
-    loss = 'mse'
-    loss_weights = [1., 0.5]
-    metrics = ['mae']
-    model.compile(optimizer, loss, metrics=metrics, loss_weights=loss_weights)
-
-    input_a_np = np.random.random((10, 3))
-    input_b_np = np.random.random((10, 3))
-
-    output_d_np = np.random.random((10, 4))
-    output_e_np = np.random.random((10, 4))
-
-    # Test fit at different verbosity
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        epochs=1,
-        batch_size=5,
-        verbose=0)
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        epochs=1,
-        batch_size=5,
-        verbose=1)
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        epochs=2,
-        batch_size=5,
-        verbose=2)
-
-    # Test with validation data
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        validation_data=([input_a_np, input_b_np], [output_d_np,
-                                                    output_e_np]),
-        epochs=1,
-        batch_size=5,
-        verbose=0)
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        validation_data=([input_a_np, input_b_np], [output_d_np,
-                                                    output_e_np]),
-        epochs=2,
-        batch_size=5,
-        verbose=1)
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        validation_data=([input_a_np, input_b_np], [output_d_np,
-                                                    output_e_np]),
-        epochs=2,
-        batch_size=5,
-        verbose=2)
-    model.train_on_batch([input_a_np, input_b_np], [output_d_np, output_e_np])
-
-    # Test with validation split
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        epochs=2,
-        batch_size=5,
-        verbose=0,
-        validation_split=0.2)
-
-    # Test with dictionary inputs
-    model.fit(
-        {
-            'input_a': input_a_np,
-            'input_b': input_b_np
-        }, {'dense': output_d_np,
-            'dropout': output_e_np},
-        epochs=1,
-        batch_size=5,
-        verbose=0)
-    model.fit(
-        {
-            'input_a': input_a_np,
-            'input_b': input_b_np
-        }, {'dense': output_d_np,
-            'dropout': output_e_np},
-        epochs=1,
-        batch_size=5,
-        verbose=1)
-    model.fit(
-        {
-            'input_a': input_a_np,
-            'input_b': input_b_np
-        }, {'dense': output_d_np,
-            'dropout': output_e_np},
-        validation_data=({'input_a': input_a_np,
-                          'input_b': input_b_np
-                         },
-                         {
-                             'dense': output_d_np,
-                             'dropout': output_e_np
-                         }),
-        epochs=1,
-        batch_size=5,
-        verbose=0)
-    model.train_on_batch({
-        'input_a': input_a_np,
-        'input_b': input_b_np
-    }, {'dense': output_d_np,
-        'dropout': output_e_np})
-    # Test with lists for loss, metrics
-    loss = ['mae', 'mse']
-    metrics = ['acc', 'mae']
-    model.compile(optimizer, loss, metrics=metrics)
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        epochs=1,
-        batch_size=5,
-        verbose=0)
-
-    # Test with dictionaries for loss, metrics, loss weights
-    loss = {'dense': 'mse', 'dropout': 'mae'}
-    loss_weights = {'dense': 1., 'dropout': 0.5}
-    metrics = {'dense': 'mse', 'dropout': 'mae'}
-    model.compile(optimizer, loss, metrics=metrics, loss_weights=loss_weights)
-    model.fit(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        epochs=1,
-        batch_size=5,
-        verbose=0)
-
-    # Invalid use cases
-    with self.assertRaises(AttributeError):
-      model.fit(
-          [input_a_np, input_b_np], [output_d_np, output_e_np],
-          epochs=1,
-          validation_data=([input_a_np, input_b_np], 0, 0),
-          verbose=0)
-    with self.assertRaises(ValueError):
-      model.train_on_batch({'input_a': input_a_np},
-                           [output_d_np, output_e_np])
-    with self.assertRaises(ValueError):
-      model.train_on_batch([input_a_np], [output_d_np, output_e_np])
-    with self.assertRaises(AttributeError):
-      model.train_on_batch(1, [output_d_np, output_e_np])
-    with self.assertRaises(ValueError):
-      model.train_on_batch(input_a_np, [output_d_np, output_e_np])
-    with self.assertRaises(ValueError):
-      bad_input = np.random.random((11, 3))
-      model.train_on_batch([bad_input, input_b_np],
-                           [output_d_np, output_e_np])
-    with self.assertRaises(ValueError):
-      bad_target = np.random.random((11, 4))
-      model.train_on_batch([input_a_np, input_b_np],
-                           [bad_target, output_e_np])
-
-    # Build single-input model
-    x = keras.layers.Input(shape=(3,), name='input_a')
-    y = keras.layers.Dense(4)(x)
-    model = keras.models.Model(x, y)
-    model.compile(optimizer=RMSPropOptimizer(learning_rate=0.001), loss='mse')
-    # This will work
-    model.fit([input_a_np], output_d_np, epochs=1)
-    with self.assertRaises(ValueError):
-      model.fit([input_a_np, input_a_np], output_d_np, epochs=1)
-
-  def test_evaluate_predict_on_arrays(self):
-    a = keras.layers.Input(shape=(3,), name='input_a')
-    b = keras.layers.Input(shape=(3,), name='input_b')
-
-    dense = keras.layers.Dense(4, name='dense')
-    c = dense(a)
-    d = dense(b)
-    e = keras.layers.Dropout(0.5, name='dropout')(c)
-
-    model = keras.models.Model([a, b], [d, e])
-
-    optimizer = RMSPropOptimizer(learning_rate=0.001)
-    loss = 'mse'
-    loss_weights = [1., 0.5]
-    metrics = ['acc', 'mae']
-    model.compile(
-        optimizer,
-        loss,
-        metrics=metrics,
-        loss_weights=loss_weights,
-        sample_weight_mode=None)
-
-    input_a_np = np.random.random((10, 3))
-    input_b_np = np.random.random((10, 3))
-
-    output_d_np = np.random.random((10, 4))
-    output_e_np = np.random.random((10, 4))
-
-    # Test evaluate at different verbosity
-    out = model.evaluate(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        batch_size=5,
-        verbose=0)
-    self.assertEqual(len(out), 7)
-    out = model.evaluate(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        batch_size=5,
-        verbose=1)
-    self.assertEqual(len(out), 7)
-    out = model.evaluate(
-        [input_a_np, input_b_np], [output_d_np, output_e_np],
-        batch_size=5,
-        verbose=2)
-    self.assertEqual(len(out), 7)
-    out = model.test_on_batch([input_a_np, input_b_np],
-                              [output_d_np, output_e_np])
-    self.assertEqual(len(out), 7)
-
-    # Test evaluate with dictionary inputs
-    model.evaluate(
-        {
-            'input_a': input_a_np,
-            'input_b': input_b_np
-        }, {'dense': output_d_np,
-            'dropout': output_e_np},
-        batch_size=5,
-        verbose=0)
-    model.evaluate(
-        {
-            'input_a': input_a_np,
-            'input_b': input_b_np
-        }, {'dense': output_d_np,
-            'dropout': output_e_np},
-        batch_size=5,
-        verbose=1)
-
-    # Test predict
-    out = model.predict([input_a_np, input_b_np], batch_size=5)
-    self.assertEqual(len(out), 2)
-    out = model.predict({'input_a': input_a_np, 'input_b': input_b_np})
-    self.assertEqual(len(out), 2)
-    out = model.predict_on_batch({
-        'input_a': input_a_np,
-        'input_b': input_b_np
-    })
-    self.assertEqual(len(out), 2)
-
-  def test_invalid_loss_or_metrics(self):
-    num_classes = 5
-    train_samples = 1000
-    test_samples = 1000
-    input_dim = 5
-
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(10, input_shape=(input_dim,)))
-    model.add(keras.layers.Activation('relu'))
-    model.add(keras.layers.Dense(num_classes))
-    model.add(keras.layers.Activation('softmax'))
-    model.compile(loss='categorical_crossentropy',
-                  optimizer=RMSPropOptimizer(learning_rate=0.001))
-    np.random.seed(1337)
-
-    (x_train, y_train), (_, _) = testing_utils.get_test_data(
-        train_samples=train_samples,
-        test_samples=test_samples,
-        input_shape=(input_dim,),
-        num_classes=num_classes)
-
-    with self.assertRaises(ValueError):
-      model.fit(x_train, np.concatenate([y_train, y_train], axis=-1))
-
-    with self.assertRaises(TypeError):
-      model.compile(loss='categorical_crossentropy',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=set(0))
-
-    with self.assertRaises(ValueError):
-      model.compile(loss=None,
-                    optimizer='rms')
-
   def test_model_methods_with_eager_tensors_multi_io(self):
     a = keras.layers.Input(shape=(3,), name='input_a')
     b = keras.layers.Input(shape=(3,), name='input_b')
@@ -422,229 +143,6 @@ class TrainingTest(test.TestCase):
     self.assertEqual(out.shape, (30, 4))
 
 
-class LossWeightingTest(test.TestCase):
-
-  def test_class_weights(self):
-    num_classes = 5
-    batch_size = 5
-    weighted_class = 3
-    train_samples = 300
-    test_samples = 300
-    input_dim = 5
-
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(10, input_shape=(input_dim,)))
-    model.add(keras.layers.Activation('relu'))
-    model.add(keras.layers.Dense(num_classes))
-    model.add(keras.layers.Activation('softmax'))
-    model.compile(loss='categorical_crossentropy',
-                  optimizer=RMSPropOptimizer(learning_rate=0.001))
-
-    np.random.seed(1337)
-    (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
-        train_samples=train_samples,
-        test_samples=test_samples,
-        input_shape=(input_dim,),
-        num_classes=num_classes)
-    int_y_test = y_test.copy()
-    int_y_train = y_train.copy()
-    # convert class vectors to binary class matrices
-    y_train = keras.utils.to_categorical(y_train, num_classes)
-    y_test = keras.utils.to_categorical(y_test, num_classes)
-    test_ids = np.where(int_y_test == np.array(weighted_class))[0]
-
-    class_weight = dict([(i, 1.) for i in range(num_classes)])
-    class_weight[weighted_class] = 4.
-
-    sample_weight = np.ones((y_train.shape[0]))
-    sample_weight[int_y_train == weighted_class] = 4.
-
-    model.fit(
-        x_train,
-        y_train,
-        batch_size=batch_size,
-        epochs=2,
-        verbose=0,
-        class_weight=class_weight,
-        validation_data=(x_train, y_train, sample_weight))
-    model.fit(
-        x_train,
-        y_train,
-        batch_size=batch_size,
-        epochs=2,
-        verbose=0,
-        class_weight=class_weight)
-    model.fit(
-        x_train,
-        y_train,
-        batch_size=batch_size,
-        epochs=2,
-        verbose=0,
-        class_weight=class_weight,
-        validation_split=0.1)
-
-    model.train_on_batch(
-        x_train[:batch_size], y_train[:batch_size], class_weight=class_weight)
-    ref_score = model.evaluate(x_test, y_test, verbose=0)
-    score = model.evaluate(
-        x_test[test_ids, :], y_test[test_ids, :], verbose=0)
-    self.assertLess(score, ref_score)
-
-  def test_sample_weights(self):
-    num_classes = 5
-    batch_size = 5
-    weighted_class = 3
-    train_samples = 300
-    test_samples = 300
-    input_dim = 5
-
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(10, input_shape=(input_dim,)))
-    model.add(keras.layers.Activation('relu'))
-    model.add(keras.layers.Dense(num_classes))
-    model.add(keras.layers.Activation('softmax'))
-    model.compile(loss='categorical_crossentropy',
-                  optimizer=RMSPropOptimizer(learning_rate=0.001))
-
-    np.random.seed(43)
-    (x_train, y_train), _ = testing_utils.get_test_data(
-        train_samples=train_samples,
-        test_samples=test_samples,
-        input_shape=(input_dim,),
-        num_classes=num_classes)
-    int_y_train = y_train.copy()
-    y_train = keras.utils.to_categorical(y_train, num_classes)
-
-    class_weight = dict([(i, 1.) for i in range(num_classes)])
-    class_weight[weighted_class] = 4.
-
-    sample_weight = np.ones((y_train.shape[0]))
-    sample_weight[int_y_train == weighted_class] = 4.
-
-    model.fit(
-        x_train,
-        y_train,
-        batch_size=batch_size,
-        epochs=2,
-        verbose=0,
-        sample_weight=sample_weight)
-    model.fit(
-        x_train,
-        y_train,
-        batch_size=batch_size,
-        epochs=2,
-        verbose=0,
-        sample_weight=sample_weight,
-        validation_split=0.1)
-    model.train_on_batch(
-        x_train[:batch_size],
-        y_train[:batch_size],
-        sample_weight=sample_weight[:batch_size])
-    model.test_on_batch(
-        x_train[:batch_size],
-        y_train[:batch_size],
-        sample_weight=sample_weight[:batch_size])
-
-  def test_temporal_sample_weights(self):
-    num_classes = 5
-    weighted_class = 3
-    train_samples = 1000
-    test_samples = 1000
-    input_dim = 5
-    timesteps = 3
-
-    model = keras.models.Sequential()
-    model.add(
-        keras.layers.TimeDistributed(
-            keras.layers.Dense(num_classes),
-            input_shape=(timesteps, input_dim)))
-    model.add(keras.layers.Activation('softmax'))
-
-    np.random.seed(1337)
-    (_, y_train), _ = testing_utils.get_test_data(
-        train_samples=train_samples,
-        test_samples=test_samples,
-        input_shape=(input_dim,),
-        num_classes=num_classes)
-    int_y_train = y_train.copy()
-    # convert class vectors to binary class matrices
-    y_train = keras.utils.to_categorical(y_train, num_classes)
-
-    class_weight = dict([(i, 1.) for i in range(num_classes)])
-    class_weight[weighted_class] = 2.
-
-    sample_weight = np.ones((y_train.shape[0]))
-    sample_weight[int_y_train == weighted_class] = 2.
-    with self.assertRaises(ValueError):
-      model.compile(
-          loss='binary_crossentropy',
-          optimizer=RMSPropOptimizer(learning_rate=0.001),
-          sample_weight_mode='temporal')
-
-  def test_class_weight_invalid_use_case(self):
-    num_classes = 5
-    train_samples = 1000
-    test_samples = 1000
-    input_dim = 5
-    timesteps = 3
-
-    model = keras.models.Sequential()
-    model.add(
-        keras.layers.TimeDistributed(
-            keras.layers.Dense(num_classes),
-            input_shape=(timesteps, input_dim)))
-    model.add(keras.layers.Activation('softmax'))
-    model.compile(
-        loss='binary_crossentropy',
-        optimizer=RMSPropOptimizer(learning_rate=0.001))
-
-    (x_train, y_train), _ = testing_utils.get_test_data(
-        train_samples=train_samples,
-        test_samples=test_samples,
-        input_shape=(input_dim,),
-        num_classes=num_classes)
-    # convert class vectors to binary class matrices
-    y_train = keras.utils.to_categorical(y_train, num_classes)
-    class_weight = dict([(i, 1.) for i in range(num_classes)])
-
-    del class_weight[1]
-    with self.assertRaises(ValueError):
-      model.fit(x_train, y_train,
-                epochs=0, verbose=0, class_weight=class_weight)
-
-    with self.assertRaises(ValueError):
-      model.compile(
-          loss='binary_crossentropy',
-          optimizer=RMSPropOptimizer(learning_rate=0.001),
-          sample_weight_mode=[])
-
-    # Build multi-output model
-    x = keras.Input((3,))
-    y1 = keras.layers.Dense(4, name='1')(x)
-    y2 = keras.layers.Dense(4, name='2')(x)
-    model = keras.models.Model(x, [y1, y2])
-    model.compile(optimizer=RMSPropOptimizer(learning_rate=0.001), loss='mse')
-    x_np = np.random.random((10, 3))
-    y_np = np.random.random((10, 4))
-    w_np = np.random.random((10,))
-    # This will work
-    model.fit(x_np, [y_np, y_np], epochs=1, sample_weight={'1': w_np})
-    # These will not
-    with self.assertRaises(ValueError):
-      model.fit(x_np, [y_np, y_np], epochs=1, sample_weight=[w_np])
-    with self.assertRaises(TypeError):
-      model.fit(x_np, [y_np, y_np], epochs=1, sample_weight=w_np)
-    with self.assertRaises(ValueError):
-      bad_w_np = np.random.random((11,))
-      model.fit(x_np, [y_np, y_np], epochs=1, sample_weight={'1': bad_w_np})
-    with self.assertRaises(ValueError):
-      bad_w_np = np.random.random((10, 2))
-      model.fit(x_np, [y_np, y_np], epochs=1, sample_weight={'1': bad_w_np})
-    with self.assertRaises(ValueError):
-      bad_w_np = np.random.random((10, 2, 2))
-      model.fit(x_np, [y_np, y_np], epochs=1, sample_weight={'1': bad_w_np})
-
-
 class CorrectnessTest(test.TestCase):
 
   @tf_test_util.run_in_graph_and_eager_modes
diff --git a/tensorflow/python/keras/engine/training_generator.py b/tensorflow/python/keras/engine/training_generator.py
index d81b384f0e..432cf2bddd 100644
--- a/tensorflow/python/keras/engine/training_generator.py
+++ b/tensorflow/python/keras/engine/training_generator.py
@@ -96,14 +96,25 @@ def fit_generator(model,
   else:
     callback_model = model
   callbacks.set_model(callback_model)
-  callbacks.set_params({
+
+  callback_params = {
       'epochs': epochs,
       'steps': steps_per_epoch,
       'verbose': verbose,
       'do_validation': do_validation,
       'metrics': callback_metrics,
-  })
-  callbacks.on_train_begin()
+  }
+  if do_validation:
+    # need to create the test_function before start of the first epoch
+    # because TensorBoard callback on_epoch_begin adds summary to the
+    # list of fetches of the test_function
+    model._make_test_function()
+    # determine the number of validation batches given a generator
+    if validation_steps:
+      callback_params.update({'validation_steps': validation_steps})
+    elif isinstance(validation_data, Sequence):
+      callback_params.update({'validation_steps': len(validation_data)})
+  callbacks.set_params(callback_params)
 
   enqueuer = None
   val_enqueuer = None
@@ -149,6 +160,9 @@ def fit_generator(model,
         output_generator = generator
 
     callback_model.stop_training = False
+    # validation_data must be set before on_train_begin() is called
+    # so that TensorboardCallback can validate its input
+    callbacks.on_train_begin()
     # Construct epoch logs.
     epoch_logs = {}
     while epoch < epochs:
diff --git a/tensorflow/python/keras/engine/training_gpu_test.py b/tensorflow/python/keras/engine/training_gpu_test.py
new file mode 100644
index 0000000000..5825ce814f
--- /dev/null
+++ b/tensorflow/python/keras/engine/training_gpu_test.py
@@ -0,0 +1,125 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for training routines."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python import keras
+from tensorflow.python.framework import test_util
+from tensorflow.python.keras import backend as K
+from tensorflow.python.keras.layers.convolutional import Conv2D
+from tensorflow.python.platform import test
+from tensorflow.python.training import rmsprop
+
+
+class TrainingGPUTest(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_model_with_crossentropy_losses_channels_first(self):
+    """Tests use of all crossentropy losses with `channels_first`.
+
+    Tests `sparse_categorical_crossentropy`, `categorical_crossentropy`,
+    and `binary_crossentropy`.
+    Verifies that evaluate gives the same result with either `channels_first`
+    or `channels_last` image_data_format.
+    """
+    def prepare_simple_model(input_tensor, loss_name, target):
+      axis = 1 if K.image_data_format() == 'channels_first' else -1
+      loss = None
+      num_channels = None
+      activation = None
+      if loss_name == 'sparse_categorical_crossentropy':
+        loss = lambda y_true, y_pred: K.sparse_categorical_crossentropy(  # pylint: disable=g-long-lambda
+            y_true, y_pred, axis=axis)
+        num_channels = np.amax(target) + 1
+        activation = 'softmax'
+      elif loss_name == 'categorical_crossentropy':
+        loss = lambda y_true, y_pred: K.categorical_crossentropy(  # pylint: disable=g-long-lambda
+            y_true, y_pred, axis=axis)
+        num_channels = target.shape[axis]
+        activation = 'softmax'
+      elif loss_name == 'binary_crossentropy':
+        loss = lambda y_true, y_pred: K.binary_crossentropy(y_true, y_pred)  # pylint: disable=unnecessary-lambda
+        num_channels = target.shape[axis]
+        activation = 'sigmoid'
+      predictions = Conv2D(num_channels,
+                           1,
+                           activation=activation,
+                           kernel_initializer='ones',
+                           bias_initializer='ones')(input_tensor)
+      simple_model = keras.models.Model(inputs=input_tensor,
+                                        outputs=predictions)
+      simple_model.compile(optimizer=rmsprop.RMSPropOptimizer(1e-3), loss=loss)
+      return simple_model
+
+    if test.is_gpu_available(cuda_only=True):
+      with self.test_session(use_gpu=True):
+        losses_to_test = ['sparse_categorical_crossentropy',
+                          'categorical_crossentropy', 'binary_crossentropy']
+
+        data_channels_first = np.array([[[[8., 7.1, 0.], [4.5, 2.6, 0.55],
+                                          [0.9, 4.2, 11.2]]]], dtype=np.float32)
+        # Labels for testing 4-class sparse_categorical_crossentropy, 4-class
+        # categorical_crossentropy, and 2-class binary_crossentropy:
+        labels_channels_first = [np.array([[[[0, 1, 3], [2, 1, 0], [2, 2, 1]]]], dtype=np.float32),  # pylint: disable=line-too-long
+                                 np.array([[[[0, 1, 0], [0, 1, 0], [0, 0, 0]],
+                                            [[1, 0, 0], [0, 0, 1], [0, 1, 0]],
+                                            [[0, 0, 0], [1, 0, 0], [0, 0, 1]],
+                                            [[0, 0, 1], [0, 0, 0], [1, 0, 0]]]], dtype=np.float32),  # pylint: disable=line-too-long
+                                 np.array([[[[0, 1, 0], [0, 1, 0], [0, 0, 1]],
+                                            [[1, 0, 1], [1, 0, 1], [1, 1, 0]]]], dtype=np.float32)]  # pylint: disable=line-too-long
+        # Compute one loss for each loss function in the list `losses_to_test`:
+        loss_channels_last = [0., 0., 0.]
+        loss_channels_first = [0., 0., 0.]
+
+        old_data_format = K.image_data_format()
+
+        # Evaluate a simple network with channels last, with all three loss
+        # functions:
+        K.set_image_data_format('channels_last')
+        data = np.moveaxis(data_channels_first, 1, -1)
+        for index, loss_function in enumerate(losses_to_test):
+          labels = np.moveaxis(labels_channels_first[index], 1, -1)
+          inputs = keras.Input(shape=(3, 3, 1))
+          model = prepare_simple_model(inputs, loss_function, labels)
+          loss_channels_last[index] = model.evaluate(x=data, y=labels,
+                                                     batch_size=1, verbose=0)
+
+        # Evaluate the same network with channels first, with all three loss
+        # functions:
+        K.set_image_data_format('channels_first')
+        data = data_channels_first
+        for index, loss_function in enumerate(losses_to_test):
+          labels = labels_channels_first[index]
+          inputs = keras.Input(shape=(1, 3, 3))
+          model = prepare_simple_model(inputs, loss_function, labels)
+          loss_channels_first[index] = model.evaluate(x=data, y=labels,
+                                                      batch_size=1, verbose=0)
+
+        K.set_image_data_format(old_data_format)
+
+        np.testing.assert_allclose(loss_channels_first,
+                                   loss_channels_last,
+                                   err_msg='{}{}'.format(
+                                       'Computed different losses for ',
+                                       'channels_first and channels_last'))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/keras/engine/training_test.py b/tensorflow/python/keras/engine/training_test.py
index d9e548f01f..650bc0bce7 100644
--- a/tensorflow/python/keras/engine/training_test.py
+++ b/tensorflow/python/keras/engine/training_test.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import logging
 import os
 import unittest
 
@@ -25,6 +26,7 @@ import numpy as np
 
 from tensorflow.python import keras
 from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util as tf_test_util
@@ -44,6 +46,7 @@ except ImportError:
 
 class TrainingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_fit_on_arrays(self):
     with self.test_session():
       a = keras.layers.Input(shape=(3,), name='input_a')
@@ -56,7 +59,7 @@ class TrainingTest(test.TestCase):
 
       model = keras.models.Model([a, b], [d, e])
 
-      optimizer = 'rmsprop'
+      optimizer = RMSPropOptimizer(learning_rate=0.001)
       loss = 'mse'
       loss_weights = [1., 0.5]
       metrics = ['mae']
@@ -223,7 +226,7 @@ class TrainingTest(test.TestCase):
       x = keras.layers.Input(shape=(3,), name='input_a')
       y = keras.layers.Dense(4)(x)
       model = keras.models.Model(x, y)
-      model.compile(optimizer='rmsprop', loss='mse')
+      model.compile(optimizer, loss='mse')
       # This will work
       model.fit([input_a_np], output_d_np, epochs=1)
       with self.assertRaises(ValueError):
@@ -239,6 +242,7 @@ class TrainingTest(test.TestCase):
                 batch_size=5,
                 verbose=2)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_evaluate_predict_on_arrays(self):
     with self.test_session():
       a = keras.layers.Input(shape=(3,), name='input_a')
@@ -251,7 +255,7 @@ class TrainingTest(test.TestCase):
 
       model = keras.models.Model([a, b], [d, e])
 
-      optimizer = 'rmsprop'
+      optimizer = RMSPropOptimizer(learning_rate=0.001)
       loss = 'mse'
       loss_weights = [1., 0.5]
       metrics = ['mae']
@@ -321,6 +325,7 @@ class TrainingTest(test.TestCase):
       })
       self.assertEqual(len(out), 2)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_invalid_loss_or_metrics(self):
     num_classes = 5
     train_samples = 1000
@@ -333,27 +338,29 @@ class TrainingTest(test.TestCase):
       model.add(keras.layers.Activation('relu'))
       model.add(keras.layers.Dense(num_classes))
       model.add(keras.layers.Activation('softmax'))
-      model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
+      optimizer = RMSPropOptimizer(learning_rate=0.001)
+      model.compile(optimizer, loss='categorical_crossentropy')
       np.random.seed(1337)
       (x_train, y_train), (_, _) = testing_utils.get_test_data(
           train_samples=train_samples,
           test_samples=test_samples,
           input_shape=(input_dim,),
           num_classes=num_classes)
-      with self.assertRaises(ValueError):
-        model.fit(x_train, y_train)
 
       with self.assertRaises(ValueError):
         model.fit(x_train, np.concatenate([y_train, y_train], axis=-1))
 
       with self.assertRaises(TypeError):
-        model.compile(loss='categorical_crossentropy',
-                      optimizer='rmsprop',
-                      metrics=set(0))
+        model.compile(
+            optimizer, loss='categorical_crossentropy', metrics=set(0))
 
-      with self.assertRaises(ValueError):
-        model.compile(loss=None,
-                      optimizer='rmsprop')
+      if not context.executing_eagerly():
+        # TODO(psv): Investigate these use cases in eager mode.
+        with self.assertRaises(ValueError):
+          model.fit(x_train, y_train)
+
+        with self.assertRaises(ValueError):
+          model.compile(optimizer, loss=None)
 
   def test_training_on_sparse_data_with_dense_placeholders(self):
     if scipy_sparse is None:
@@ -415,9 +422,32 @@ class TrainingTest(test.TestCase):
       x2 = model.predict(val_a)
       self.assertAllClose(x1, x2, atol=1e-7)
 
+  def test_compile_warning_for_loss_missing_output(self):
+    with self.test_session():
+      inp = keras.layers.Input(shape=(16,), name='input_a')
+      out_1 = keras.layers.Dense(8, name='dense_1')(inp)
+      out_2 = keras.layers.Dense(3, activation='softmax', name='dense_2')(out_1)
+      model = keras.models.Model(inputs=[inp], outputs=[out_1, out_2])
+
+      with test.mock.patch.object(logging, 'warning') as mock_log:
+        model.compile(
+            loss={
+                'dense_2': 'categorical_crossentropy',
+            },
+            optimizer='rmsprop',
+            metrics={
+                'dense_2': 'categorical_accuracy',
+                'dense_1': 'categorical_accuracy',
+            })
+        msg = ('Output "dense_1" missing from loss dictionary. We assume this '
+               'was done on purpose. The fit and evaluate APIs will not be '
+               'expecting any data to be passed to "dense_1".')
+        self.assertRegexpMatches(str(mock_log.call_args), msg)
+
 
 class LossWeightingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_class_weights(self):
     num_classes = 5
     batch_size = 5
@@ -426,6 +456,7 @@ class LossWeightingTest(test.TestCase):
     train_samples = 1000
     test_samples = 1000
     input_dim = 5
+    learning_rate = 0.001
 
     with self.test_session():
       model = keras.models.Sequential()
@@ -433,7 +464,9 @@ class LossWeightingTest(test.TestCase):
       model.add(keras.layers.Activation('relu'))
       model.add(keras.layers.Dense(num_classes))
       model.add(keras.layers.Activation('softmax'))
-      model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
+      model.compile(
+          loss='categorical_crossentropy',
+          optimizer=RMSPropOptimizer(learning_rate=learning_rate))
 
       np.random.seed(1337)
       (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
@@ -485,6 +518,7 @@ class LossWeightingTest(test.TestCase):
           x_test[test_ids, :], y_test[test_ids, :], verbose=0)
       self.assertLess(score, ref_score)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_sample_weights(self):
     num_classes = 5
     batch_size = 5
@@ -493,6 +527,7 @@ class LossWeightingTest(test.TestCase):
     train_samples = 1000
     test_samples = 1000
     input_dim = 5
+    learning_rate = 0.001
 
     with self.test_session():
       model = keras.models.Sequential()
@@ -500,7 +535,9 @@ class LossWeightingTest(test.TestCase):
       model.add(keras.layers.Activation('relu'))
       model.add(keras.layers.Dense(num_classes))
       model.add(keras.layers.Activation('softmax'))
-      model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
+      model.compile(
+          RMSPropOptimizer(learning_rate=learning_rate),
+          loss='categorical_crossentropy')
 
       np.random.seed(43)
       (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
@@ -515,9 +552,6 @@ class LossWeightingTest(test.TestCase):
       y_test = keras.utils.to_categorical(y_test, num_classes)
       test_ids = np.where(int_y_test == np.array(weighted_class))[0]
 
-      class_weight = dict([(i, 1.) for i in range(num_classes)])
-      class_weight[weighted_class] = 2.
-
       sample_weight = np.ones((y_train.shape[0]))
       sample_weight[int_y_train == weighted_class] = 2.
 
@@ -546,10 +580,12 @@ class LossWeightingTest(test.TestCase):
           y_train[:batch_size],
           sample_weight=sample_weight[:batch_size])
       ref_score = model.evaluate(x_test, y_test, verbose=0)
-      score = model.evaluate(
-          x_test[test_ids, :], y_test[test_ids, :], verbose=0)
-      self.assertLess(score, ref_score)
+      if not context.executing_eagerly():
+        score = model.evaluate(
+            x_test[test_ids, :], y_test[test_ids, :], verbose=0)
+        self.assertLess(score, ref_score)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_temporal_sample_weights(self):
     num_classes = 5
     batch_size = 5
@@ -559,6 +595,7 @@ class LossWeightingTest(test.TestCase):
     test_samples = 1000
     input_dim = 5
     timesteps = 3
+    learning_rate = 0.001
 
     with self.test_session():
       model = keras.models.Sequential()
@@ -581,9 +618,6 @@ class LossWeightingTest(test.TestCase):
       y_test = keras.utils.to_categorical(y_test, num_classes)
       test_ids = np.where(int_y_test == np.array(weighted_class))[0]
 
-      class_weight = dict([(i, 1.) for i in range(num_classes)])
-      class_weight[weighted_class] = 2.
-
       sample_weight = np.ones((y_train.shape[0]))
       sample_weight[int_y_train == weighted_class] = 2.
 
@@ -605,8 +639,8 @@ class LossWeightingTest(test.TestCase):
           temporal_sample_weight, timesteps, axis=1)
 
       model.compile(
+          RMSPropOptimizer(learning_rate=learning_rate),
           loss='binary_crossentropy',
-          optimizer='rmsprop',
           sample_weight_mode='temporal')
 
       model.fit(
@@ -634,16 +668,19 @@ class LossWeightingTest(test.TestCase):
           temporal_y_train[:batch_size],
           sample_weight=temporal_sample_weight[:batch_size])
       ref_score = model.evaluate(temporal_x_test, temporal_y_test, verbose=0)
-      score = model.evaluate(
-          temporal_x_test[test_ids], temporal_y_test[test_ids], verbose=0)
-      self.assertLess(score, ref_score)
+      if not context.executing_eagerly():
+        score = model.evaluate(
+            temporal_x_test[test_ids], temporal_y_test[test_ids], verbose=0)
+        self.assertLess(score, ref_score)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_class_weight_invalid_use_case(self):
     num_classes = 5
     train_samples = 1000
     test_samples = 1000
     input_dim = 5
     timesteps = 3
+    learning_rate = 0.001
 
     with self.test_session():
       model = keras.models.Sequential()
@@ -652,9 +689,8 @@ class LossWeightingTest(test.TestCase):
               keras.layers.Dense(num_classes),
               input_shape=(timesteps, input_dim)))
       model.add(keras.layers.Activation('softmax'))
-      model.compile(
-          loss='binary_crossentropy',
-          optimizer='rmsprop')
+      optimizer = RMSPropOptimizer(learning_rate=learning_rate)
+      model.compile(optimizer, loss='binary_crossentropy')
 
       (x_train, y_train), _ = testing_utils.get_test_data(
           train_samples=train_samples,
@@ -672,16 +708,14 @@ class LossWeightingTest(test.TestCase):
 
       with self.assertRaises(ValueError):
         model.compile(
-            loss='binary_crossentropy',
-            optimizer='rmsprop',
-            sample_weight_mode=[])
+            optimizer, loss='binary_crossentropy', sample_weight_mode=[])
 
       # Build multi-output model
       x = keras.Input((3,))
       y1 = keras.layers.Dense(4, name='1')(x)
       y2 = keras.layers.Dense(4, name='2')(x)
       model = keras.models.Model(x, [y1, y2])
-      model.compile(optimizer='rmsprop', loss='mse')
+      model.compile(optimizer, loss='mse')
       x_np = np.random.random((10, 3))
       y_np = np.random.random((10, 4))
       w_np = np.random.random((10,))
@@ -708,22 +742,99 @@ class LossWeightingTest(test.TestCase):
         model.fit(x_np, [y_np, y_np], epochs=1,
                   sample_weight={'1': bad_w_np})
 
+  @tf_test_util.run_in_graph_and_eager_modes
+  def test_default_sample_weight(self):
+    """Verifies that fit works without having to set sample_weight."""
+
+    num_classes = 5
+    input_dim = 5
+    timesteps = 3
+    learning_rate = 0.001
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(
+          keras.layers.TimeDistributed(
+              keras.layers.Dense(num_classes),
+              input_shape=(timesteps, input_dim)))
+
+      x = np.random.random((10, timesteps, input_dim))
+      y = np.random.random((10, timesteps, num_classes))
+      optimizer = RMSPropOptimizer(learning_rate=learning_rate)
+
+      # sample_weight_mode is a list and mode value is None
+      model.compile(optimizer, loss='mse', sample_weight_mode=[None])
+      model.fit(x, y, epochs=1, batch_size=10)
+
+      # sample_weight_mode is a list and mode value is `temporal`
+      model.compile(optimizer, loss='mse', sample_weight_mode=['temporal'])
+      model.fit(x, y, epochs=1, batch_size=10)
+
+      # sample_weight_mode is a dict and mode value is None
+      model.compile(
+          optimizer, loss='mse', sample_weight_mode={'time_distributed': None})
+      model.fit(x, y, epochs=1, batch_size=10)
+
+      # sample_weight_mode is a dict and mode value is `temporal`
+      model.compile(
+          optimizer,
+          loss='mse',
+          sample_weight_mode={'time_distributed': 'temporal'})
+      model.fit(x, y, epochs=1, batch_size=10)
+
+      # sample_weight_mode is a not a list/dict and mode value is None
+      model.compile(optimizer, loss='mse', sample_weight_mode=None)
+      model.fit(x, y, epochs=1, batch_size=10)
+
+      # sample_weight_mode is a not a list/dict and mode value is `temporal`
+      model.compile(optimizer, loss='mse', sample_weight_mode='temporal')
+      model.fit(x, y, epochs=1, batch_size=10)
+
 
 class LossMaskingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_masking(self):
     with self.test_session():
-      np.random.seed(1337)
       x = np.array([[[1], [1]], [[0], [0]]])
       model = keras.models.Sequential()
       model.add(keras.layers.Masking(mask_value=0, input_shape=(2, 1)))
       model.add(
           keras.layers.TimeDistributed(
               keras.layers.Dense(1, kernel_initializer='one')))
-      model.compile(loss='mse', optimizer='sgd')
+      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
       y = np.array([[[1], [1]], [[1], [1]]])
       loss = model.train_on_batch(x, y)
-      self.assertEqual(loss, 0)
+      self.assertEqual(float(loss), 0.)
+
+  @tf_test_util.run_in_graph_and_eager_modes
+  def test_mask_argument_in_layer(self):
+    # Test that the mask argument gets correctly passed to a layer in the
+    # functional API.
+
+    class CustomMaskedLayer(keras.layers.Layer):
+
+      def __init__(self):
+        super(CustomMaskedLayer, self).__init__()
+        self.supports_masking = True
+
+      def call(self, inputs, mask=None):
+        assert mask is not None
+        return inputs
+
+      def compute_output_shape(self, input_shape):
+        return input_shape
+
+    with self.test_session():
+      x = np.random.random((5, 3))
+      inputs = keras.layers.Input((3,))
+      masked = keras.layers.Masking(mask_value=0)(inputs)
+      outputs = CustomMaskedLayer()(masked)
+
+      model = keras.Model(inputs, outputs)
+      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+      y = np.random.random((5, 3))
+      model.train_on_batch(x, y)
 
   def test_loss_masking(self):
     with self.test_session():
@@ -744,6 +855,22 @@ class LossMaskingTest(test.TestCase):
               keras.backend.variable(weights), keras.backend.variable(mask)))
 
 
+class LearningPhaseTest(test.TestCase):
+
+  def test_empty_model_no_learning_phase(self):
+    with self.test_session():
+      model = keras.models.Sequential()
+      self.assertFalse(model.uses_learning_phase)
+
+  def test_dropout_has_learning_phase(self):
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_dim=3))
+      model.add(keras.layers.Dropout(0.5))
+      model.add(keras.layers.Dense(2))
+      self.assertTrue(model.uses_learning_phase)
+
+
 class TestDynamicTrainability(test.TestCase):
 
   def test_trainable_warning(self):
diff --git a/tensorflow/python/keras/engine/training_utils.py b/tensorflow/python/keras/engine/training_utils.py
index 728a2b493b..b304cb9093 100644
--- a/tensorflow/python/keras/engine/training_utils.py
+++ b/tensorflow/python/keras/engine/training_utils.py
@@ -19,18 +19,151 @@ from __future__ import division
 from __future__ import print_function
 
 import copy
+import math
 
 import numpy as np
 
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.eager import context
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.keras import backend as K
 from tensorflow.python.keras import losses
 from tensorflow.python.keras import metrics as metrics_module
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 
 
+def _map_nested(data, func):
+  """Maps each nested element using func."""
+  if isinstance(data, list):
+    return [_map_nested(nested_data, func) for nested_data in data]
+  elif isinstance(data, tuple):
+    return tuple(_map_nested(nested_data, func) for nested_data in data)
+  elif isinstance(data, dict):
+    return {
+        k: _map_nested(nested_data, func) for k, nested_data in data.items()
+    }
+  else:
+    return func(data)
+
+
+def _nested_all(data, cond_func):
+  """Checks if all elements in a nested structure satisfy cond_func."""
+  if isinstance(data, (tuple, list)):
+    return all([_nested_all(nested_data, cond_func) for nested_data in data])
+  elif isinstance(data, dict):
+    return all(
+        [_nested_all(nested_data, cond_func) for nested_data in data.values()])
+  else:
+    return cond_func(data)
+
+
+def _nested_any(data, cond_func):
+  """Checks if any nested_elements in a nested structure satisfy cond_func."""
+  if isinstance(data, (tuple, list)):
+    return any([_nested_any(nested_data, cond_func) for nested_data in data])
+  elif isinstance(data, dict):
+    return any(
+        [_nested_any(nested_data, cond_func) for nested_data in data.values()])
+  else:
+    return cond_func(data)
+
+
+def _convert_lists_to_tuples(data):
+  """Converts all lists to tuples, since Datasets expect tuples."""
+  if isinstance(data, (tuple, list)):
+    return tuple(_convert_lists_to_tuples(nested_data) for nested_data in data)
+  elif isinstance(data, dict):
+    return {
+        k: _convert_lists_to_tuples(nested_data)
+        for k, nested_data in data.items()
+    }
+  else:
+    return data
+
+
+def _get_batch_axis_size(data):
+  """Returns batch axis shape for nested data."""
+  if isinstance(data, (tuple, list)):
+    return _get_batch_axis_size(data[0])
+  elif isinstance(data, dict):
+    return _get_batch_axis_size(list(data.values()))
+  else:
+    return int(data.shape[0])
+
+
+def convert_to_iterator(x=None,
+                        y=None,
+                        sample_weights=None,
+                        batch_size=None,
+                        steps_per_epoch=None,
+                        epochs=1,
+                        shuffle=False):
+  """Converts NumPy arrays or EagerTensors to an EagerIterator.
+
+  Combines all provided data into a single EagerIterator.
+
+  Arguments:
+      x: NumPy array or EagerTensor,  or list of Numpy arrays or EagerTensors
+        representing inputs to a model.
+      y: Optional. NumPy array or EagerTensor, or list of Numpy arrays or
+        EagerTensors representing targets of a model.
+      sample_weights: Optional NumPy array or EagerTensor representing sample
+        weights.
+      batch_size: Used to batch data and calculate how many steps EagerIterator
+        should take per epoch.
+      steps_per_epoch: If provided, how many steps EagerIterator should take per
+        epoch.
+      epochs: Epochs to repeat iterator for.
+      shuffle: Whether to shuffle data after each epoch.
+
+  Raises:
+      ValueError: if steps_per_epoch cannot be calculated from the data
+      provided.
+
+  Returns:
+      (Iterator, steps_per_epoch).
+
+  """
+  if isinstance(x, iterator_ops.EagerIterator):
+    return x, steps_per_epoch
+
+  if not _nested_any(sample_weights, lambda x: x is None):
+    data = (x, y, sample_weights)
+  elif not _nested_any(y, lambda x: x is None):
+    data = (x, y)
+  else:
+    # always wrap in a tuple, so we know y, sample_weights weren't set
+    # even when x has multiple elements
+    data = (x,)
+
+  data = _convert_lists_to_tuples(data)
+  if steps_per_epoch is None and batch_size is not None:
+    num_samples = _get_batch_axis_size(data)
+    steps_per_epoch = int(math.ceil(num_samples / batch_size))
+
+  if steps_per_epoch is None:
+    raise ValueError('Could not determine steps_per_epoch.'
+                     'Please provide either batch_size or'
+                     'steps_per_epoch.')
+
+  # TODO(omalleyt) for NumPy arrays in graph mode
+  # placeholder ops should be used
+  # this is only ideal for eager mode
+  dataset = dataset_ops.Dataset.from_tensor_slices(data)
+
+  if batch_size is not None:
+    dataset = dataset.batch(batch_size)
+  if shuffle:
+    dataset = dataset.shuffle(buffer_size=10000)
+  dataset = dataset.repeat(epochs)
+  iterator = dataset.make_one_shot_iterator()
+
+  return iterator, steps_per_epoch
+
+
 def check_num_samples(ins,
                       batch_size=None,
                       steps=None,
@@ -128,8 +261,8 @@ def standardize_input_data(data,
     except KeyError as e:
       raise ValueError('No data provided for "' + e.args[0] + '". Need data '
                        'for each key in: ' + str(names))
-  elif isinstance(data, list):
-    if isinstance(data[0], list):
+  elif isinstance(data, (list, tuple)):
+    if isinstance(data[0], (list, tuple)):
       data = [np.asarray(d) for d in data]
     elif len(names) == 1 and isinstance(data[0], (float, int)):
       data = [np.asarray(data)]
@@ -482,6 +615,9 @@ def standardize_weights(y,
   Raises:
       ValueError: In case of invalid user-provided arguments.
   """
+  # Iterator may return sample_weight as 1-tuple
+  if isinstance(sample_weight, tuple):
+    sample_weight = sample_weight[0]
   if sample_weight_mode is not None:
     if sample_weight_mode != 'temporal':
       raise ValueError('"sample_weight_mode '
@@ -566,17 +702,16 @@ def populate_metric_names(model):
   for i in range(len(model.outputs)):
     metrics = model.nested_metrics[i]
     for metric in metrics:
-      base_metric_name = get_base_metric_name(metric)
+      base_metric_name = get_metric_name(metric)
       add_metric_name(model, base_metric_name, i)
 
 
-def get_base_metric_name(metric, weighted=False):
-  """Returns the metric name given the metric function.
+def get_metric_name(metric, weighted=False):
+  """Returns the metric name corresponding to the given metric input.
 
   Arguments:
       metric: Metric function name or reference.
-      weighted: Boolean indicating if the metric for which we are adding
-          names is weighted.
+      weighted: Boolean indicating if the given metric is weighted.
 
   Returns:
       a metric name.
@@ -600,6 +735,36 @@ def get_base_metric_name(metric, weighted=False):
   return metric_name
 
 
+def get_metric_function(metric, output_shape=None, loss_fn=None):
+  """Returns the metric function corresponding to the given metric input.
+
+  Arguments:
+      metric: Metric function name or reference.
+      output_shape: The shape of the output that this metric
+          will be calculated for.
+      loss_fn: The loss function used.
+
+  Returns:
+      The metric function.
+  """
+  if metric in ['accuracy', 'acc']:
+    if output_shape[-1] == 1 or loss_fn == losses.binary_crossentropy:
+      return metrics_module.binary_accuracy  # case: binary accuracy
+    elif loss_fn == losses.sparse_categorical_crossentropy:
+      # case: categorical accuracy with sparse targets
+      return metrics_module.sparse_categorical_accuracy
+    return metrics_module.categorical_accuracy  # case: categorical accuracy
+  elif metric in ['crossentropy', 'ce']:
+    if output_shape[-1] == 1 or loss_fn == losses.binary_crossentropy:
+      return metrics_module.binary_crossentropy  # case: binary cross-entropy
+    elif loss_fn == losses.sparse_categorical_crossentropy:
+      # case: categorical cross-entropy with sparse targets
+      return metrics_module.sparse_categorical_crossentropy
+    # case: categorical cross-entropy
+    return metrics_module.categorical_crossentropy
+  return metrics_module.get(metric)
+
+
 def add_metric_name(model, metric_name, index):
   """Makes the metric name unique and adds it to the model's metric name list.
 
@@ -612,6 +777,9 @@ def add_metric_name(model, metric_name, index):
         user. For example: 'acc'
     index: The index of the model output for which the metric name is being
         added.
+
+  Returns:
+    string, name of the model's unique metric name
   """
   if len(model.output_names) > 1:
     metric_name = '%s_%s' % (model.output_names[index], metric_name)
@@ -621,6 +789,7 @@ def add_metric_name(model, metric_name, index):
     metric_name = '%s_%d' % (base_metric_name, j)
     j += 1
   model.metrics_names.append(metric_name)
+  return metric_name
 
 
 def validate_iterator_input(x, y, sample_weight, validation_split=None):
@@ -722,3 +891,83 @@ def cast_if_floating_dtype(x):
         for val in x
     ]
   return math_ops.cast(x, dtype=K.floatx()) if x.dtype.is_floating else x
+
+
+def get_output_sample_weight_and_mode(skip_target_weighing_indices,
+                                      sample_weight_mode, output_name,
+                                      output_index):
+  """Returns the sample weight and weight mode for a single output."""
+  if output_index in skip_target_weighing_indices:
+    return None, None
+
+  if sample_weight_mode == 'temporal':
+    default_value = [[1.]]
+    shape = [None, None]
+    mode = 'temporal'
+  else:
+    default_value = [1.]
+    shape = [None]
+    mode = None
+  if context.executing_eagerly():
+    weight = None
+  else:
+    weight = array_ops.placeholder_with_default(
+        constant_op.constant(default_value, dtype=K.floatx()),
+        shape=shape,
+        name=output_name + '_sample_weights')
+  return weight, mode
+
+
+def prepare_sample_weights(output_names, sample_weight_mode,
+                           skip_target_weighing_indices):
+  """Prepares sample weights for the model.
+
+  Args:
+    output_names: List of model output names.
+    sample_weight_mode: sample weight mode user input passed from compile API.
+    skip_target_weighing_indices: Indices of output for which sample weights
+      should be skipped.
+
+  Returns:
+    A pair of list of sample weights and sample weight modes
+      (one for each output).
+
+  Raises:
+    ValueError: In case of invalid `sample_weight_mode` input.
+  """
+  sample_weights = []
+  sample_weight_modes = []
+  if isinstance(sample_weight_mode, dict):
+    unknown_output = set(sample_weight_mode.keys()) - set(output_names)
+    if unknown_output:
+      raise ValueError('Unknown entry in '
+                       'sample_weight_mode dictionary: "' + unknown_output +
+                       '". Only expected the following keys: ' +
+                       str(output_names))
+    for i, name in enumerate(output_names):
+      if (i not in skip_target_weighing_indices and
+          name not in sample_weight_mode):
+        raise ValueError('Output missing from sample_weight_modes dictionary')
+      weight, mode = get_output_sample_weight_and_mode(
+          skip_target_weighing_indices, sample_weight_mode.get(name), name, i)
+      sample_weights.append(weight)
+      sample_weight_modes.append(mode)
+  elif isinstance(sample_weight_mode, list):
+    if len(sample_weight_mode) != len(output_names):
+      raise ValueError('When passing a list as sample_weight_mode, '
+                       'it should have one entry per model output. '
+                       'The model has ' + str(len(output_names)) +
+                       ' outputs, but you passed ' +
+                       str(len(sample_weight_mode)) + 'sample_weight_modes')
+    for i, name in enumerate(output_names):
+      weight, mode = get_output_sample_weight_and_mode(
+          skip_target_weighing_indices, sample_weight_mode[i], name, i)
+      sample_weights.append(weight)
+      sample_weight_modes.append(mode)
+  else:
+    for i, name in enumerate(output_names):
+      weight, mode = get_output_sample_weight_and_mode(
+          skip_target_weighing_indices, sample_weight_mode, name, i)
+      sample_weights.append(weight)
+      sample_weight_modes.append(mode)
+  return sample_weights, sample_weight_modes
diff --git a/tensorflow/python/keras/engine/training_utils_test.py b/tensorflow/python/keras/engine/training_utils_test.py
new file mode 100644
index 0000000000..297a1ae494
--- /dev/null
+++ b/tensorflow/python/keras/engine/training_utils_test.py
@@ -0,0 +1,150 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for training utility functions."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.keras.engine import training_utils
+from tensorflow.python.platform import test
+
+
+class TrainingUtilTest(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_convert_to_iterator_single_numpy(self):
+    batch_size = 2
+    a = np.ones([10, 10])
+    iterator, steps_per_epoch = training_utils.convert_to_iterator(
+        x=a, batch_size=batch_size)
+    self.assertEquals(steps_per_epoch, 5)
+
+    expected_batch = a[:batch_size, :]
+    actual_batch, = iterator.get_next()
+    self.assertAllEqual(expected_batch, actual_batch)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_convert_to_iterator_single_tensor(self):
+    batch_size = 2
+    a = ops.convert_to_tensor(np.ones([10, 10]))
+    iterator, steps_per_epoch = training_utils.convert_to_iterator(
+        x=a, batch_size=batch_size)
+    self.assertEquals(steps_per_epoch, 5)
+
+    expected_batch = a[:batch_size, :]
+    actual_batch, = iterator.get_next()
+    self.assertAllEqual(expected_batch, actual_batch)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_convert_to_iterator_y(self):
+    batch_size = 2
+    a = np.ones([10, 100])
+    b = np.ones([10, 10])
+    iterator, steps_per_epoch = training_utils.convert_to_iterator(
+        x=a, y=b, batch_size=batch_size)
+    self.assertEquals(steps_per_epoch, 5)
+
+    expected_x = a[:batch_size, :]
+    expected_y = b[:batch_size, :]
+    actual_x, actual_y = iterator.get_next()
+    self.assertAllEqual(expected_x, actual_x)
+    self.assertAllEqual(expected_y, actual_y)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_convert_to_iterator_sample_weights(self):
+    batch_size = 2
+    a = ops.convert_to_tensor(np.ones([10, 100]))
+    b = ops.convert_to_tensor(np.ones([10, 10]))
+    sw = ops.convert_to_tensor(np.ones([10]))
+    iterator, steps_per_epoch = training_utils.convert_to_iterator(
+        x=a, y=b, sample_weights=sw, batch_size=batch_size)
+    self.assertEquals(steps_per_epoch, 5)
+
+    expected_x = a[:batch_size, :]
+    expected_y = b[:batch_size, :]
+    expected_sw = sw[:batch_size]
+    actual_x, actual_y, actual_sw = iterator.get_next()
+    self.assertAllEqual(expected_x, actual_x)
+    self.assertAllEqual(expected_y, actual_y)
+    self.assertAllEqual(expected_sw, actual_sw)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_convert_to_iterator_nested(self):
+    batch_size = 2
+    x = {'1': np.ones([10, 100]), '2': [np.zeros([10, 10]), np.ones([10, 20])]}
+    iterator, steps_per_epoch = training_utils.convert_to_iterator(
+        x=x, batch_size=batch_size)
+    self.assertEquals(steps_per_epoch, 5)
+
+    expected_x1 = x['1'][:batch_size, :]
+    expected_x2_0 = x['2'][0][:batch_size, :]
+    expected_x2_1 = x['2'][1][:batch_size, :]
+
+    actual_x, = iterator.get_next()
+    actual_x1 = actual_x['1'][:batch_size, :]
+    actual_x2_0 = actual_x['2'][0][:batch_size, :]
+    actual_x2_1 = actual_x['2'][1][:batch_size, :]
+
+    self.assertAllEqual(expected_x1, actual_x1)
+    self.assertAllEqual(expected_x2_0, actual_x2_0)
+    self.assertAllEqual(expected_x2_1, actual_x2_1)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_convert_to_iterator_epochs(self):
+    batch_size = 2
+    a = np.ones([10, 10])
+    iterator, steps_per_epoch = training_utils.convert_to_iterator(
+        x=a, batch_size=batch_size, epochs=2)
+    self.assertEquals(steps_per_epoch, 5)
+
+    expected_batch = a[:batch_size, :]
+    # loop through one whole epoch
+    for _ in range(6):
+      actual_batch, = iterator.get_next()
+    self.assertAllEqual(expected_batch, actual_batch)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_convert_to_iterator_insufficient_info(self):
+    # with batch_size and steps_per_epoch not set
+    with self.assertRaises(ValueError):
+      a = np.ones([10, 10])
+      _ = training_utils.convert_to_iterator(x=a)
+
+  def test_nested_all(self):
+    nested_data = {'a': True, 'b': [True, True, (False, True)]}
+    all_true = training_utils._nested_all(nested_data, lambda x: x)
+    self.assertEquals(all_true, False)
+
+    nested_data = {'a': True, 'b': [True, True, (True, True)]}
+    all_true = training_utils._nested_all(nested_data, lambda x: x)
+    self.assertEquals(all_true, True)
+
+  def test_nested_any(self):
+    nested_data = [False, {'a': False, 'b': (False, True)}]
+    any_true = training_utils._nested_any(nested_data, lambda x: x)
+    self.assertEquals(any_true, True)
+
+    nested_data = [False, {'a': False, 'b': (False, False)}]
+    any_true = training_utils._nested_any(nested_data, lambda x: x)
+    self.assertEquals(any_true, False)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/keras/initializers.py b/tensorflow/python/keras/initializers.py
index b9b2e9ad59..b9d856efa8 100644
--- a/tensorflow/python/keras/initializers.py
+++ b/tensorflow/python/keras/initializers.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Keras initializer classes (soon to be replaced with core TF initializers).
+"""Keras initializer serialization / deserialization.
 """
 from __future__ import absolute_import
 from __future__ import division
@@ -22,150 +22,27 @@ import six
 
 from tensorflow.python.keras.utils.generic_utils import deserialize_keras_object
 from tensorflow.python.keras.utils.generic_utils import serialize_keras_object
+
+# These imports are brought in so that keras.initializers.deserialize
+# has them available in module_objects.
 from tensorflow.python.ops.init_ops import Constant
+from tensorflow.python.ops.init_ops import glorot_normal_initializer
+from tensorflow.python.ops.init_ops import glorot_uniform_initializer
+from tensorflow.python.ops.init_ops import he_normal  # pylint: disable=unused-import
+from tensorflow.python.ops.init_ops import he_uniform  # pylint: disable=unused-import
 from tensorflow.python.ops.init_ops import Identity
 from tensorflow.python.ops.init_ops import Initializer  # pylint: disable=unused-import
+from tensorflow.python.ops.init_ops import lecun_normal  # pylint: disable=unused-import
+from tensorflow.python.ops.init_ops import lecun_uniform  # pylint: disable=unused-import
 from tensorflow.python.ops.init_ops import Ones
 from tensorflow.python.ops.init_ops import Orthogonal
 from tensorflow.python.ops.init_ops import RandomNormal
 from tensorflow.python.ops.init_ops import RandomUniform
 from tensorflow.python.ops.init_ops import TruncatedNormal
-from tensorflow.python.ops.init_ops import VarianceScaling
+from tensorflow.python.ops.init_ops import VarianceScaling  # pylint: disable=unused-import
 from tensorflow.python.ops.init_ops import Zeros
-from tensorflow.python.util.tf_export import tf_export
-
-
-@tf_export('keras.initializers.lecun_normal')
-def lecun_normal(seed=None):
-  """LeCun normal initializer.
-
-  It draws samples from a truncated normal distribution centered on 0
-  with `stddev = sqrt(1 / fan_in)`
-  where `fan_in` is the number of input units in the weight tensor.
-
-  Arguments:
-      seed: A Python integer. Used to seed the random generator.
-
-  Returns:
-      An initializer.
-
-  References:
-      - [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
-      - [Efficient
-      Backprop](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)
-  """
-  return VarianceScaling(
-      scale=1., mode='fan_in', distribution='normal', seed=seed)
-
-
-@tf_export('keras.initializers.lecun_uniform')
-def lecun_uniform(seed=None):
-  """LeCun uniform initializer.
-
-  It draws samples from a uniform distribution within [-limit, limit]
-  where `limit` is `sqrt(3 / fan_in)`
-  where `fan_in` is the number of input units in the weight tensor.
-
-  Arguments:
-      seed: A Python integer. Used to seed the random generator.
-
-  Returns:
-      An initializer.
-
-  References:
-      LeCun 98, Efficient Backprop,
-      http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
-  """
-  return VarianceScaling(
-      scale=1., mode='fan_in', distribution='uniform', seed=seed)
-
-
-@tf_export('keras.initializers.glorot_normal')
-def glorot_normal(seed=None):
-  """Glorot normal initializer, also called Xavier normal initializer.
-
-  It draws samples from a truncated normal distribution centered on 0
-  with `stddev = sqrt(2 / (fan_in + fan_out))`
-  where `fan_in` is the number of input units in the weight tensor
-  and `fan_out` is the number of output units in the weight tensor.
-
-  Arguments:
-      seed: A Python integer. Used to seed the random generator.
-
-  Returns:
-      An initializer.
 
-  References:
-      Glorot & Bengio, AISTATS 2010
-      http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
-  """
-  return VarianceScaling(
-      scale=1., mode='fan_avg', distribution='normal', seed=seed)
-
-
-@tf_export('keras.initializers.glorot_uniform')
-def glorot_uniform(seed=None):
-  """Glorot uniform initializer, also called Xavier uniform initializer.
-
-  It draws samples from a uniform distribution within [-limit, limit]
-  where `limit` is `sqrt(6 / (fan_in + fan_out))`
-  where `fan_in` is the number of input units in the weight tensor
-  and `fan_out` is the number of output units in the weight tensor.
-
-  Arguments:
-      seed: A Python integer. Used to seed the random generator.
-
-  Returns:
-      An initializer.
-
-  References:
-      Glorot & Bengio, AISTATS 2010
-      http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
-  """
-  return VarianceScaling(
-      scale=1., mode='fan_avg', distribution='uniform', seed=seed)
-
-
-@tf_export('keras.initializers.he_normal')
-def he_normal(seed=None):
-  """He normal initializer.
-
-  It draws samples from a truncated normal distribution centered on 0
-  with `stddev = sqrt(2 / fan_in)`
-  where `fan_in` is the number of input units in the weight tensor.
-
-  Arguments:
-      seed: A Python integer. Used to seed the random generator.
-
-  Returns:
-      An initializer.
-
-  References:
-      He et al., http://arxiv.org/abs/1502.01852
-  """
-  return VarianceScaling(
-      scale=2., mode='fan_in', distribution='normal', seed=seed)
-
-
-@tf_export('keras.initializers.he_uniform')
-def he_uniform(seed=None):
-  """He uniform variance scaling initializer.
-
-  It draws samples from a uniform distribution within [-limit, limit]
-  where `limit` is `sqrt(6 / fan_in)`
-  where `fan_in` is the number of input units in the weight tensor.
-
-  Arguments:
-      seed: A Python integer. Used to seed the random generator.
-
-  Returns:
-      An initializer.
-
-  References:
-      He et al., http://arxiv.org/abs/1502.01852
-  """
-  return VarianceScaling(
-      scale=2., mode='fan_in', distribution='uniform', seed=seed)
+from tensorflow.python.util.tf_export import tf_export
 
 
 # Compatibility aliases
@@ -179,6 +56,8 @@ normal = random_normal = RandomNormal
 truncated_normal = TruncatedNormal
 identity = Identity
 orthogonal = Orthogonal
+glorot_normal = glorot_normal_initializer
+glorot_uniform = glorot_uniform_initializer
 
 # pylint: enable=invalid-name
 
diff --git a/tensorflow/python/keras/initializers_test.py b/tensorflow/python/keras/initializers_test.py
index c519e194bd..51725e03f2 100644
--- a/tensorflow/python/keras/initializers_test.py
+++ b/tensorflow/python/keras/initializers_test.py
@@ -31,16 +31,6 @@ class KerasInitializersTest(test.TestCase):
               target_max=None, target_min=None):
     variable = keras.backend.variable(init(shape))
     output = keras.backend.get_value(variable)
-    lim = 3e-2
-    if target_std is not None:
-      self.assertGreater(lim, abs(output.std() - target_std))
-    if target_mean is not None:
-      self.assertGreater(lim, abs(output.mean() - target_mean))
-    if target_max is not None:
-      self.assertGreater(lim, abs(output.max() - target_max))
-    if target_min is not None:
-      self.assertGreater(lim, abs(output.min() - target_min))
-
     # Test serialization (assumes deterministic behavior).
     config = init.get_config()
     reconstructed_init = init.__class__.from_config(config)
diff --git a/tensorflow/python/keras/layers/advanced_activations.py b/tensorflow/python/keras/layers/advanced_activations.py
index eba10da6f3..61ab69c16f 100644
--- a/tensorflow/python/keras/layers/advanced_activations.py
+++ b/tensorflow/python/keras/layers/advanced_activations.py
@@ -284,6 +284,13 @@ class Softmax(Layer):
 class ReLU(Layer):
   """Rectified Linear Unit activation function.
 
+  With default values, it returns element-wise `max(x, 0)`.
+
+  Otherwise, it follows:
+  `f(x) = max_value` for `x >= max_value`,
+  `f(x) = x` for `threshold <= x < max_value`,
+  `f(x) = negative_slope * (x - threshold)` otherwise.
+
   Input shape:
       Arbitrary. Use the keyword argument `input_shape`
       (tuple of integers, does not include the samples axis)
@@ -294,21 +301,39 @@ class ReLU(Layer):
 
   Arguments:
       max_value: float >= 0. Maximum activation value.
+      negative_slope: float >= 0. Negative slope coefficient.
+      threshold: float. Threshold value for thresholded activation.
   """
 
-  def __init__(self, max_value=None, **kwargs):
+  def __init__(self, max_value=None, negative_slope=0, threshold=0, **kwargs):
     super(ReLU, self).__init__(**kwargs)
-    self.support_masking = True
-    self.max_value = K.cast_to_floatx(max_value)
-    if self.max_value < 0.:
+    if max_value is not None and max_value < 0.:
       raise ValueError('max_value of Relu layer '
                        'cannot be negative value: ' + str(max_value))
+    if negative_slope < 0.:
+      raise ValueError('negative_slope of Relu layer '
+                       'cannot be negative value: ' + str(negative_slope))
+
+    self.support_masking = True
+    self.max_value = K.cast_to_floatx(max_value)
+    self.negative_slope = K.cast_to_floatx(negative_slope)
+    self.threshold = K.cast_to_floatx(threshold)
 
   def call(self, inputs):
-    return activations.relu(inputs, max_value=self.max_value)
+    # alpha is used for leaky relu slope in activations instead of
+    # negative_slope.
+    return activations.relu(
+        inputs,
+        alpha=self.negative_slope,
+        max_value=self.max_value,
+        threshold=self.threshold)
 
   def get_config(self):
-    config = {'max_value': self.max_value}
+    config = {
+        'max_value': self.max_value,
+        'negative_slope': self.negative_slope,
+        'threshold': self.threshold
+    }
     base_config = super(ReLU, self).get_config()
     return dict(list(base_config.items()) + list(config.items()))
 
diff --git a/tensorflow/python/keras/layers/advanced_activations_test.py b/tensorflow/python/keras/layers/advanced_activations_test.py
index 9e1f15b1bc..53c1baa2bb 100644
--- a/tensorflow/python/keras/layers/advanced_activations_test.py
+++ b/tensorflow/python/keras/layers/advanced_activations_test.py
@@ -75,6 +75,14 @@ class AdvancedActivationsTest(test.TestCase):
         testing_utils.layer_test(keras.layers.ReLU,
                                  kwargs={'max_value': -10},
                                  input_shape=(2, 3, 4))
+    with self.assertRaisesRegexp(
+        ValueError,
+        'negative_slope of Relu layer cannot be negative value: -2'):
+      with self.test_session():
+        testing_utils.layer_test(
+            keras.layers.ReLU,
+            kwargs={'negative_slope': -2},
+            input_shape=(2, 3, 4))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/layers/convolutional_recurrent.py b/tensorflow/python/keras/layers/convolutional_recurrent.py
index 84d794cada..e61dd3043d 100644
--- a/tensorflow/python/keras/layers/convolutional_recurrent.py
+++ b/tensorflow/python/keras/layers/convolutional_recurrent.py
@@ -788,7 +788,7 @@ class ConvLSTM2D(ConvRNN2D):
 
   Arguments:
     filters: Integer, the dimensionality of the output space
-        (i.e. the number output of filters in the convolution).
+        (i.e. the number of output filters in the convolution).
     kernel_size: An integer or tuple/list of n integers, specifying the
         dimensions of the convolution window.
     strides: An integer or tuple/list of n integers,
diff --git a/tensorflow/python/keras/layers/core.py b/tensorflow/python/keras/layers/core.py
index 2bf6229ccb..f28cade474 100644
--- a/tensorflow/python/keras/layers/core.py
+++ b/tensorflow/python/keras/layers/core.py
@@ -26,6 +26,7 @@ import warnings
 import numpy as np
 
 from tensorflow.python.eager import context
+from tensorflow.python.framework import common_shapes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.keras import activations
@@ -929,13 +930,13 @@ class Dense(Layer):
 
   def call(self, inputs):
     inputs = ops.convert_to_tensor(inputs, dtype=self.dtype)
-    shape = inputs.get_shape().as_list()
-    if len(shape) > 2:
+    rank = common_shapes.rank(inputs)
+    if rank > 2:
       # Broadcasting is required for the inputs.
-      outputs = standard_ops.tensordot(inputs, self.kernel, [[len(shape) - 1],
-                                                             [0]])
+      outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
       # Reshape the output back to the original ndim of the input.
       if not context.executing_eagerly():
+        shape = inputs.get_shape().as_list()
         output_shape = shape[:-1] + [self.units]
         outputs.set_shape(output_shape)
     else:
diff --git a/tensorflow/python/keras/layers/cudnn_recurrent_test.py b/tensorflow/python/keras/layers/cudnn_recurrent_test.py
index 8fd970239f..2ed0aa8f26 100644
--- a/tensorflow/python/keras/layers/cudnn_recurrent_test.py
+++ b/tensorflow/python/keras/layers/cudnn_recurrent_test.py
@@ -220,7 +220,7 @@ class CuDNNTest(test.TestCase, parameterized.TestCase):
         self.assertNotEqual(out4.max(), out5.max())
 
   @parameterized.named_parameters(
-      *testing_utils.generate_combinations_with_testcase_name(
+      *test_util.generate_combinations_with_testcase_name(
           rnn_type=['LSTM', 'GRU'], to_cudnn=[True, False],
           bidirectional=[True, False], implementation=[1, 2],
           model_nest_level=[1, 2], model_type=['seq', 'func']))
@@ -301,7 +301,7 @@ class CuDNNTest(test.TestCase, parameterized.TestCase):
     os.remove(fname)
 
   @parameterized.named_parameters(
-      *testing_utils.generate_combinations_with_testcase_name(
+      *test_util.generate_combinations_with_testcase_name(
           rnn_type=['LSTM', 'GRU'], to_cudnn=[True, False]))
   def test_load_weights_between_noncudnn_rnn_time_distributed(self, rnn_type,
                                                               to_cudnn):
diff --git a/tensorflow/python/keras/layers/embeddings.py b/tensorflow/python/keras/layers/embeddings.py
index 910fff720f..629a9ec9a1 100644
--- a/tensorflow/python/keras/layers/embeddings.py
+++ b/tensorflow/python/keras/layers/embeddings.py
@@ -112,6 +112,7 @@ class Embedding(Layer):
     self.activity_regularizer = regularizers.get(activity_regularizer)
     self.embeddings_constraint = constraints.get(embeddings_constraint)
     self.mask_zero = mask_zero
+    self.supports_masking = mask_zero
     self.input_length = input_length
 
   @tf_utils.shape_type_conversion
@@ -127,8 +128,8 @@ class Embedding(Layer):
   def compute_mask(self, inputs, mask=None):
     if not self.mask_zero:
       return None
-    else:
-      return math_ops.not_equal(inputs, 0)
+
+    return math_ops.not_equal(inputs, 0)
 
   @tf_utils.shape_type_conversion
   def compute_output_shape(self, input_shape):
diff --git a/tensorflow/python/keras/layers/gru_test.py b/tensorflow/python/keras/layers/gru_test.py
index 57f660b6d5..afef997b00 100644
--- a/tensorflow/python/keras/layers/gru_test.py
+++ b/tensorflow/python/keras/layers/gru_test.py
@@ -183,6 +183,7 @@ class GRULayerTest(test.TestCase):
       self.assertEqual(layer.cell.recurrent_kernel.constraint, r_constraint)
       self.assertEqual(layer.cell.bias.constraint, b_constraint)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_with_masking_layer_GRU(self):
     layer_class = keras.layers.GRU
     with self.test_session():
@@ -192,7 +193,8 @@ class GRULayerTest(test.TestCase):
       model = keras.models.Sequential()
       model.add(keras.layers.Masking(input_shape=(3, 4)))
       model.add(layer_class(units=5, return_sequences=True, unroll=False))
-      model.compile(loss='categorical_crossentropy', optimizer='adam')
+      model.compile(loss='categorical_crossentropy',
+                    optimizer=RMSPropOptimizer(0.01))
       model.fit(inputs, targets, epochs=1, batch_size=2, verbose=1)
 
   def test_from_config_GRU(self):
diff --git a/tensorflow/python/keras/layers/lstm_test.py b/tensorflow/python/keras/layers/lstm_test.py
index ae381f5955..9802820fd0 100644
--- a/tensorflow/python/keras/layers/lstm_test.py
+++ b/tensorflow/python/keras/layers/lstm_test.py
@@ -197,6 +197,7 @@ class LSTMLayerTest(test.TestCase):
       self.assertEqual(layer.cell.recurrent_kernel.constraint, r_constraint)
       self.assertEqual(layer.cell.bias.constraint, b_constraint)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_with_masking_layer_LSTM(self):
     layer_class = keras.layers.LSTM
     with self.test_session():
@@ -206,7 +207,8 @@ class LSTMLayerTest(test.TestCase):
       model = keras.models.Sequential()
       model.add(keras.layers.Masking(input_shape=(3, 4)))
       model.add(layer_class(units=5, return_sequences=True, unroll=False))
-      model.compile(loss='categorical_crossentropy', optimizer='adam')
+      model.compile(loss='categorical_crossentropy',
+                    optimizer=RMSPropOptimizer(0.01))
       model.fit(inputs, targets, epochs=1, batch_size=2, verbose=1)
 
   def test_from_config_LSTM(self):
@@ -311,7 +313,8 @@ class LSTMLayerTest(test.TestCase):
       output = keras.layers.LSTM(units)(inputs, initial_state=initial_state)
 
       model = keras.models.Model([inputs] + initial_state, output)
-      model.compile(loss='categorical_crossentropy', optimizer='adam')
+      model.compile(loss='categorical_crossentropy',
+                    optimizer=RMSPropOptimizer(0.01))
 
       inputs = np.random.random((num_samples, timesteps, embedding_dim))
       initial_state = [np.random.random((num_samples, units))
diff --git a/tensorflow/python/keras/layers/normalization.py b/tensorflow/python/keras/layers/normalization.py
index 8b894ca6b1..a7835bc0a2 100644
--- a/tensorflow/python/keras/layers/normalization.py
+++ b/tensorflow/python/keras/layers/normalization.py
@@ -181,12 +181,6 @@ class BatchNormalization(Layer):
       self.renorm_clipping = renorm_clipping
       self.renorm_momentum = renorm_momentum
 
-  def _add_tower_local_variable(self, *args, **kwargs):
-    tower_context = distribute_lib.get_tower_context()
-    with tower_context.tower_local_var_scope(
-        variable_scope.VariableAggregation.MEAN):
-      return self.add_weight(*args, **kwargs)
-
   def build(self, input_shape):
     input_shape = tensor_shape.TensorShape(input_shape)
     if not input_shape.ndims:
@@ -314,19 +308,23 @@ class BatchNormalization(Layer):
         self._scope.set_partitioner(None)
       else:
         partitioner = None
-      self.moving_mean = self._add_tower_local_variable(
+      self.moving_mean = self.add_weight(
           name='moving_mean',
           shape=param_shape,
           dtype=param_dtype,
           initializer=self.moving_mean_initializer,
-          trainable=False)
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          trainable=False,
+          aggregation=variable_scope.VariableAggregation.MEAN)
 
-      self.moving_variance = self._add_tower_local_variable(
+      self.moving_variance = self.add_weight(
           name='moving_variance',
           shape=param_shape,
           dtype=param_dtype,
           initializer=self.moving_variance_initializer,
-          trainable=False)
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          trainable=False,
+          aggregation=variable_scope.VariableAggregation.MEAN)
 
       if self.renorm:
         # Create variables to maintain the moving mean and standard deviation.
@@ -337,12 +335,14 @@ class BatchNormalization(Layer):
         # stack to be cleared. The nested ones use a `lambda` to set the desired
         # device and ignore any devices that may be set by the custom getter.
         def _renorm_variable(name, shape):
-          var = self._add_tower_local_variable(
+          var = self.add_weight(
               name=name,
               shape=shape,
               dtype=param_dtype,
               initializer=init_ops.zeros_initializer(),
-              trainable=False)
+              synchronization=variable_scope.VariableSynchronization.ON_READ,
+              trainable=False,
+              aggregation=variable_scope.VariableAggregation.MEAN)
           return var
 
         with distribute_lib.get_distribution_strategy().colocate_vars_with(
@@ -370,7 +370,7 @@ class BatchNormalization(Layer):
         decay = ops.convert_to_tensor(1.0 - momentum, name='decay')
         if decay.dtype != variable.dtype.base_dtype:
           decay = math_ops.cast(decay, variable.dtype.base_dtype)
-        update_delta = (variable - value) * decay
+        update_delta = (variable - math_ops.cast(value, variable.dtype)) * decay
         return state_ops.assign_sub(variable, update_delta, name=scope)
 
   def _fused_batch_norm(self, inputs, training):
@@ -619,6 +619,10 @@ class BatchNormalization(Layer):
     else:
       mean, variance = self.moving_mean, self.moving_variance
 
+    mean = math_ops.cast(mean, inputs.dtype)
+    variance = math_ops.cast(variance, inputs.dtype)
+    if offset is not None:
+      offset = math_ops.cast(offset, inputs.dtype)
     outputs = nn.batch_normalization(inputs,
                                      _broadcast(mean),
                                      _broadcast(variance),
diff --git a/tensorflow/python/keras/layers/normalization_test.py b/tensorflow/python/keras/layers/normalization_test.py
index b22f3bd152..a97b4cac46 100644
--- a/tensorflow/python/keras/layers/normalization_test.py
+++ b/tensorflow/python/keras/layers/normalization_test.py
@@ -95,6 +95,24 @@ class NormalizationLayersTest(test.TestCase):
       np.testing.assert_allclose(out.mean(), 0.0, atol=1e-1)
       np.testing.assert_allclose(out.std(), 1.0, atol=1e-1)
 
+  def test_batchnorm_mixed_precision(self):
+    with self.test_session():
+      model = keras.models.Sequential()
+      norm = keras.layers.BatchNormalization(input_shape=(10,), momentum=0.8)
+      model.add(norm)
+      model.compile(loss='mse', optimizer='sgd')
+
+      # centered on 5.0, variance 10.0
+      x = np.random.normal(
+          loc=5.0, scale=10.0, size=(1000, 10)).astype(np.float16)
+      model.fit(x, x, epochs=4, verbose=0)
+      out = model.predict(x)
+      out -= keras.backend.eval(norm.beta)
+      out /= keras.backend.eval(norm.gamma)
+
+      np.testing.assert_allclose(out.mean(), 0.0, atol=1e-1)
+      np.testing.assert_allclose(out.std(), 1.0, atol=1e-1)
+
   def test_batchnorm_convnet(self):
     if test.is_gpu_available(cuda_only=True):
       with self.test_session(use_gpu=True):
diff --git a/tensorflow/python/keras/layers/recurrent.py b/tensorflow/python/keras/layers/recurrent.py
index 32d25c5a65..a8bfdf25f2 100644
--- a/tensorflow/python/keras/layers/recurrent.py
+++ b/tensorflow/python/keras/layers/recurrent.py
@@ -23,7 +23,6 @@ import numbers
 import numpy as np
 
 from tensorflow.python.eager import context
-from tensorflow.python.framework import tensor_shape
 from tensorflow.python.keras import activations
 from tensorflow.python.keras import backend as K
 from tensorflow.python.keras import constraints
@@ -37,6 +36,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training.checkpointable import base as checkpointable
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -235,7 +235,8 @@ class RNN(Layer):
   """Base class for recurrent layers.
 
   Arguments:
-      cell: A RNN cell instance. A RNN cell is a class that has:
+      cell: A RNN cell instance or a list of RNN cell instances.
+          A RNN cell is a class that has:
           - a `call(input_at_t, states_at_t)` method, returning
               `(output_at_t, states_at_t_plus_1)`. The call method of the
               cell can also take the optional argument `constants`, see
@@ -248,9 +249,9 @@ class RNN(Layer):
               (one size per state). In this case, the first entry
               (`state_size[0]`) should be the same as
               the size of the cell output.
-          It is also possible for `cell` to be a list of RNN cell instances,
-          in which cases the cells get stacked on after the other in the RNN,
-          implementing an efficient stacked RNN.
+          In the case that `cell` is a list of RNN cell instances, the cells
+          will be stacked on after the other in the RNN, implementing an
+          efficient stacked RNN.
       return_sequences: Boolean. Whether to return the last output
           in the output sequence, or the full sequence.
       return_state: Boolean. Whether to return the last state
@@ -402,6 +403,8 @@ class RNN(Layer):
                        'one integer per RNN state).')
     super(RNN, self).__init__(**kwargs)
     self.cell = cell
+    if isinstance(cell, checkpointable.CheckpointableBase):
+      self._track_checkpointable(self.cell, name='cell')
     self.return_sequences = return_sequences
     self.return_state = return_state
     self.go_backwards = go_backwards
@@ -2227,342 +2230,6 @@ def _generate_dropout_mask(ones, rate, training=None, count=1):
   return K.in_train_phase(dropped_inputs, ones, training=training)
 
 
-class Recurrent(Layer):
-  """Deprecated abstract base class for recurrent layers.
-
-  It still exists because it is leveraged by the convolutional-recurrent layers.
-  It will be removed entirely in the future.
-  It was never part of the public API.
-  Do not use.
-
-  Arguments:
-      weights: list of Numpy arrays to set as initial weights.
-          The list should have 3 elements, of shapes:
-          `[(input_dim, output_dim), (output_dim, output_dim), (output_dim,)]`.
-      return_sequences: Boolean. Whether to return the last output
-          in the output sequence, or the full sequence.
-      return_state: Boolean. Whether to return the last state
-          in addition to the output.
-      go_backwards: Boolean (default False).
-          If True, process the input sequence backwards and return the
-          reversed sequence.
-      stateful: Boolean (default False). If True, the last state
-          for each sample at index i in a batch will be used as initial
-          state for the sample of index i in the following batch.
-      unroll: Boolean (default False).
-          If True, the network will be unrolled,
-          else a symbolic loop will be used.
-          Unrolling can speed-up a RNN,
-          although it tends to be more memory-intensive.
-          Unrolling is only suitable for short sequences.
-      implementation: one of {0, 1, or 2}.
-          If set to 0, the RNN will use
-          an implementation that uses fewer, larger matrix products,
-          thus running faster on CPU but consuming more memory.
-          If set to 1, the RNN will use more matrix products,
-          but smaller ones, thus running slower
-          (may actually be faster on GPU) while consuming less memory.
-          If set to 2 (LSTM/GRU only),
-          the RNN will combine the input gate,
-          the forget gate and the output gate into a single matrix,
-          enabling more time-efficient parallelization on the GPU.
-          Note: RNN dropout must be shared for all gates,
-          resulting in a slightly reduced regularization.
-      input_dim: dimensionality of the input (integer).
-          This argument (or alternatively, the keyword argument `input_shape`)
-          is required when using this layer as the first layer in a model.
-      input_length: Length of input sequences, to be specified
-          when it is constant.
-          This argument is required if you are going to connect
-          `Flatten` then `Dense` layers upstream
-          (without it, the shape of the dense outputs cannot be computed).
-          Note that if the recurrent layer is not the first layer
-          in your model, you would need to specify the input length
-          at the level of the first layer
-          (e.g. via the `input_shape` argument)
-
-  Input shape:
-      3D tensor with shape `(batch_size, timesteps, input_dim)`,
-      (Optional) 2D tensors with shape `(batch_size, output_dim)`.
-
-  Output shape:
-      - if `return_state`: a list of tensors. The first tensor is
-          the output. The remaining tensors are the last states,
-          each with shape `(batch_size, units)`.
-      - if `return_sequences`: 3D tensor with shape
-          `(batch_size, timesteps, units)`.
-      - else, 2D tensor with shape `(batch_size, units)`.
-
-  # Masking
-      This layer supports masking for input data with a variable number
-      of timesteps. To introduce masks to your data,
-      use an `Embedding` layer with the `mask_zero` parameter
-      set to `True`.
-
-  # Note on using statefulness in RNNs
-      You can set RNN layers to be 'stateful', which means that the states
-      computed for the samples in one batch will be reused as initial states
-      for the samples in the next batch. This assumes a one-to-one mapping
-      between samples in different successive batches.
-
-      To enable statefulness:
-          - specify `stateful=True` in the layer constructor.
-          - specify a fixed batch size for your model, by passing
-              if sequential model:
-                `batch_input_shape=(...)` to the first layer in your model.
-              else for functional model with 1 or more Input layers:
-                `batch_shape=(...)` to all the first layers in your model.
-              This is the expected shape of your inputs
-              *including the batch size*.
-              It should be a tuple of integers, e.g. `(32, 10, 100)`.
-          - specify `shuffle=False` when calling fit().
-
-      To reset the states of your model, call `.reset_states()` on either
-      a specific layer, or on your entire model.
-
-  # Note on specifying the initial state of RNNs
-      You can specify the initial state of RNN layers symbolically by
-      calling them with the keyword argument `initial_state`. The value of
-      `initial_state` should be a tensor or list of tensors representing
-      the initial state of the RNN layer.
-
-      You can specify the initial state of RNN layers numerically by
-      calling `reset_states` with the keyword argument `states`. The value of
-      `states` should be a numpy array or list of numpy arrays representing
-      the initial state of the RNN layer.
-  """
-
-  def __init__(self,
-               return_sequences=False,
-               return_state=False,
-               go_backwards=False,
-               stateful=False,
-               unroll=False,
-               implementation=0,
-               **kwargs):
-    super(Recurrent, self).__init__(**kwargs)
-    self.return_sequences = return_sequences
-    self.return_state = return_state
-    self.go_backwards = go_backwards
-    self.stateful = stateful
-    self.unroll = unroll
-    self.implementation = implementation
-    self.supports_masking = True
-    self.input_spec = [InputSpec(ndim=3)]
-    self.state_spec = None
-    self.dropout = 0
-    self.recurrent_dropout = 0
-
-  @tf_utils.shape_type_conversion
-  def compute_output_shape(self, input_shape):
-    if isinstance(input_shape, list):
-      input_shape = input_shape[0]
-    input_shape = tensor_shape.TensorShape(input_shape).as_list()
-    if self.return_sequences:
-      output_shape = (input_shape[0], input_shape[1], self.units)
-    else:
-      output_shape = (input_shape[0], self.units)
-
-    if self.return_state:
-      state_shape = [tensor_shape.TensorShape(
-          (input_shape[0], self.units)) for _ in self.states]
-      return [tensor_shape.TensorShape(output_shape)] + state_shape
-    return tensor_shape.TensorShape(output_shape)
-
-  def compute_mask(self, inputs, mask):
-    if isinstance(mask, list):
-      mask = mask[0]
-    output_mask = mask if self.return_sequences else None
-    if self.return_state:
-      state_mask = [None for _ in self.states]
-      return [output_mask] + state_mask
-    return output_mask
-
-  def step(self, inputs, states):
-    raise NotImplementedError
-
-  def get_constants(self, inputs, training=None):
-    return []
-
-  def get_initial_state(self, inputs):
-    # build an all-zero tensor of shape (samples, output_dim)
-    initial_state = array_ops.zeros_like(inputs)
-    # shape of initial_state = (samples, timesteps, input_dim)
-    initial_state = math_ops.reduce_sum(initial_state, axis=(1, 2))
-    # shape of initial_state = (samples,)
-    initial_state = array_ops.expand_dims(initial_state, axis=-1)
-    # shape of initial_state = (samples, 1)
-    initial_state = K.tile(initial_state, [1,
-                                           self.units])  # (samples, output_dim)
-    initial_state = [initial_state for _ in range(len(self.states))]
-    return initial_state
-
-  def preprocess_input(self, inputs, training=None):
-    return inputs
-
-  def __call__(self, inputs, initial_state=None, **kwargs):
-    if (isinstance(inputs, (list, tuple)) and
-        len(inputs) > 1
-        and initial_state is None):
-      initial_state = inputs[1:]
-      inputs = inputs[0]
-
-    # If `initial_state` is specified,
-    # and if it a Keras tensor,
-    # then add it to the inputs and temporarily
-    # modify the input spec to include the state.
-    if initial_state is None:
-      return super(Recurrent, self).__call__(inputs, **kwargs)
-
-    if not isinstance(initial_state, (list, tuple)):
-      initial_state = [initial_state]
-
-    is_keras_tensor = hasattr(initial_state[0], '_keras_history')
-    for tensor in initial_state:
-      if hasattr(tensor, '_keras_history') != is_keras_tensor:
-        raise ValueError('The initial state of an RNN layer cannot be'
-                         ' specified with a mix of Keras tensors and'
-                         ' non-Keras tensors')
-
-    if is_keras_tensor:
-      # Compute the full input spec, including state
-      input_spec = self.input_spec
-      state_spec = self.state_spec
-      if not isinstance(input_spec, list):
-        input_spec = [input_spec]
-      if not isinstance(state_spec, list):
-        state_spec = [state_spec]
-      self.input_spec = input_spec + state_spec
-
-      # Compute the full inputs, including state
-      inputs = [inputs] + list(initial_state)
-
-      # Perform the call
-      output = super(Recurrent, self).__call__(inputs, **kwargs)
-
-      # Restore original input spec
-      self.input_spec = input_spec
-      return output
-    else:
-      kwargs['initial_state'] = initial_state
-      return super(Recurrent, self).__call__(inputs, **kwargs)
-
-  def call(self, inputs, mask=None, training=None, initial_state=None):
-    # input shape: `(samples, time (padded with zeros), input_dim)`
-    # note that the .build() method of subclasses MUST define
-    # self.input_spec and self.state_spec with complete input shapes.
-    if isinstance(inputs, list):
-      initial_state = inputs[1:]
-      inputs = inputs[0]
-    elif initial_state is not None:
-      pass
-    elif self.stateful:
-      initial_state = self.states
-    else:
-      initial_state = self.get_initial_state(inputs)
-
-    if isinstance(mask, list):
-      mask = mask[0]
-
-    if len(initial_state) != len(self.states):
-      raise ValueError('Layer has ' + str(len(self.states)) +
-                       ' states but was passed ' + str(len(initial_state)) +
-                       ' initial states.')
-    input_shape = K.int_shape(inputs)
-    if self.unroll and input_shape[1] is None:
-      raise ValueError('Cannot unroll a RNN if the '
-                       'time dimension is undefined. \n'
-                       '- If using a Sequential model, '
-                       'specify the time dimension by passing '
-                       'an `input_shape` or `batch_input_shape` '
-                       'argument to your first layer. If your '
-                       'first layer is an Embedding, you can '
-                       'also use the `input_length` argument.\n'
-                       '- If using the functional API, specify '
-                       'the time dimension by passing a `shape` '
-                       'or `batch_shape` argument to your Input layer.')
-    constants = self.get_constants(inputs, training=None)
-    preprocessed_input = self.preprocess_input(inputs, training=None)
-    last_output, outputs, states = K.rnn(
-        self.step,
-        preprocessed_input,
-        initial_state,
-        go_backwards=self.go_backwards,
-        mask=mask,
-        constants=constants,
-        unroll=self.unroll)
-    if self.stateful:
-      updates = []
-      for i in range(len(states)):
-        updates.append(state_ops.assign(self.states[i], states[i]))
-      self.add_update(updates, inputs)
-
-    # Properly set learning phase
-    if 0 < self.dropout + self.recurrent_dropout:
-      last_output._uses_learning_phase = True
-      outputs._uses_learning_phase = True
-
-    if not self.return_sequences:
-      outputs = last_output
-
-    if self.return_state:
-      if not isinstance(states, (list, tuple)):
-        states = [states]
-      else:
-        states = list(states)
-      return [outputs] + states
-    return outputs
-
-  def reset_states(self, states=None):
-    if not self.stateful:
-      raise AttributeError('Layer must be stateful.')
-    batch_size = self.input_spec[0].shape[0]
-    if not batch_size:
-      raise ValueError('If a RNN is stateful, it needs to know '
-                       'its batch size. Specify the batch size '
-                       'of your input tensors: \n'
-                       '- If using a Sequential model, '
-                       'specify the batch size by passing '
-                       'a `batch_input_shape` '
-                       'argument to your first layer.\n'
-                       '- If using the functional API, specify '
-                       'the time dimension by passing a '
-                       '`batch_shape` argument to your Input layer.')
-    # initialize state if None
-    if self.states[0] is None:
-      self.states = [K.zeros((batch_size, self.units)) for _ in self.states]
-    elif states is None:
-      for state in self.states:
-        K.set_value(state, np.zeros((batch_size, self.units)))
-    else:
-      if not isinstance(states, (list, tuple)):
-        states = [states]
-      if len(states) != len(self.states):
-        raise ValueError('Layer ' + self.name + ' expects ' +
-                         str(len(self.states)) + ' states, '
-                         'but it received ' + str(len(states)) +
-                         ' state values. Input received: ' + str(states))
-      for index, (value, state) in enumerate(zip(states, self.states)):
-        if value.shape != (batch_size, self.units):
-          raise ValueError('State ' + str(index) +
-                           ' is incompatible with layer ' + self.name +
-                           ': expected shape=' + str((batch_size, self.units)) +
-                           ', found shape=' + str(value.shape))
-        K.set_value(state, value)
-
-  def get_config(self):
-    config = {
-        'return_sequences': self.return_sequences,
-        'return_state': self.return_state,
-        'go_backwards': self.go_backwards,
-        'stateful': self.stateful,
-        'unroll': self.unroll,
-        'implementation': self.implementation
-    }
-    base_config = super(Recurrent, self).get_config()
-    return dict(list(base_config.items()) + list(config.items()))
-
-
 def _standardize_args(inputs, initial_state, constants, num_constants):
   """Standardizes `__call__` to a single list of tensor inputs.
 
diff --git a/tensorflow/python/keras/layers/recurrent_test.py b/tensorflow/python/keras/layers/recurrent_test.py
index 802374d2d2..fefb92826b 100644
--- a/tensorflow/python/keras/layers/recurrent_test.py
+++ b/tensorflow/python/keras/layers/recurrent_test.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.platform import test
+from tensorflow.python.training.checkpointable import util as checkpointable_util
 
 
 class RNNTest(test.TestCase):
@@ -556,5 +557,22 @@ class RNNTest(test.TestCase):
         [tuple(o.as_list()) for o in output_shape],
         expected_output_shape)
 
+  def test_checkpointable_dependencies(self):
+    rnn = keras.layers.SimpleRNN
+    with self.test_session():
+      x = np.random.random((2, 2, 2))
+      y = np.random.random((2, 2))
+      model = keras.models.Sequential()
+      model.add(rnn(2))
+      model.compile(optimizer='rmsprop', loss='mse')
+      model.fit(x, y, epochs=1, batch_size=1)
+
+      # check whether the model variables are present in the
+      # checkpointable list of objects
+      checkpointed_objects = set(checkpointable_util.list_objects(model))
+      for v in model.variables:
+        self.assertIn(v, checkpointed_objects)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/layers/simplernn_test.py b/tensorflow/python/keras/layers/simplernn_test.py
index 18fefbe84f..1429537648 100644
--- a/tensorflow/python/keras/layers/simplernn_test.py
+++ b/tensorflow/python/keras/layers/simplernn_test.py
@@ -183,6 +183,7 @@ class SimpleRNNLayerTest(test.TestCase):
       self.assertEqual(layer.cell.recurrent_kernel.constraint, r_constraint)
       self.assertEqual(layer.cell.bias.constraint, b_constraint)
 
+  @tf_test_util.run_in_graph_and_eager_modes
   def test_with_masking_layer_SimpleRNN(self):
     layer_class = keras.layers.SimpleRNN
     with self.test_session():
@@ -192,7 +193,8 @@ class SimpleRNNLayerTest(test.TestCase):
       model = keras.models.Sequential()
       model.add(keras.layers.Masking(input_shape=(3, 4)))
       model.add(layer_class(units=5, return_sequences=True, unroll=False))
-      model.compile(loss='categorical_crossentropy', optimizer='adam')
+      model.compile(loss='categorical_crossentropy',
+                    optimizer=RMSPropOptimizer(0.01))
       model.fit(inputs, targets, epochs=1, batch_size=2, verbose=1)
 
   def test_from_config_SimpleRNN(self):
diff --git a/tensorflow/python/keras/layers/wrappers.py b/tensorflow/python/keras/layers/wrappers.py
index e61acf8e77..f0c1e76156 100644
--- a/tensorflow/python/keras/layers/wrappers.py
+++ b/tensorflow/python/keras/layers/wrappers.py
@@ -47,7 +47,6 @@ class Wrapper(Layer):
   def __init__(self, layer, **kwargs):
     assert isinstance(layer, Layer)
     self.layer = layer
-    self._track_checkpointable(layer, name='layer')
     # Tracks mapping of Wrapper inputs to inner layer inputs. Useful when
     # the inner layer has update ops that depend on its inputs (as opposed
     # to the inputs to the Wrapper layer).
@@ -168,6 +167,39 @@ class TimeDistributed(Wrapper):
           '`Layer` instance. You passed: {input}'.format(input=layer))
     super(TimeDistributed, self).__init__(layer, **kwargs)
     self.supports_masking = True
+    self._track_checkpointable(layer, name='layer')
+
+  def _get_shape_tuple(self, init_tuple, tensor, start_idx, int_shape=None):
+    """Finds non-specific dimensions in the static shapes.
+
+    The static shapes are replaced with the corresponding dynamic shapes of the
+    tensor.
+
+    Arguments:
+        init_tuple: a tuple, the first part of the output shape
+        tensor: the tensor from which to get the (static and dynamic) shapes
+            as the last part of the output shape
+        start_idx: int, which indicate the first dimension to take from
+            the static shape of the tensor
+        int_shape: an alternative static shape to take as the last part
+            of the output shape
+    Returns:
+        The new int_shape with the first part from init_tuple
+        and the last part from either `int_shape` (if provided)
+        or `tensor.shape`, where every `None` is replaced by
+        the corresponding dimension from `tf.shape(tensor)`.
+    """
+    # replace all None in int_shape by K.shape
+    if int_shape is None:
+      int_shape = K.int_shape(tensor)[start_idx:]
+    if not any(not s for s in int_shape):
+      return init_tuple + tuple(int_shape)
+    shape = K.shape(tensor)
+    int_shape = list(int_shape)
+    for i, s in enumerate(int_shape):
+      if not s:
+        int_shape[i] = shape[start_idx + i]
+    return init_tuple + tuple(int_shape)
 
   def build(self, input_shape):
     input_shape = tensor_shape.TensorShape(input_shape).as_list()
@@ -224,18 +256,24 @@ class TimeDistributed(Wrapper):
       input_length = input_shape[1]
       if not input_length:
         input_length = array_ops.shape(inputs)[1]
+      inner_input_shape = self._get_shape_tuple((-1,), inputs, 2)
       # Shape: (num_samples * timesteps, ...). And track the
       # transformation in self._input_map.
       input_uid = generic_utils.object_list_uid(inputs)
-      inputs = array_ops.reshape(inputs, (-1,) + input_shape[2:])
+      inputs = array_ops.reshape(inputs, inner_input_shape)
       self._input_map[input_uid] = inputs
       # (num_samples * timesteps, ...)
+      if generic_utils.has_arg(self.layer.call, 'mask') and mask is not None:
+        inner_mask_shape = self._get_shape_tuple((-1,), mask, 2)
+        kwargs['mask'] = K.reshape(mask, inner_mask_shape)
       y = self.layer.call(inputs, **kwargs)
       if hasattr(y, '_uses_learning_phase'):
         uses_learning_phase = y._uses_learning_phase
       # Shape: (num_samples, timesteps, ...)
       output_shape = self.compute_output_shape(input_shape).as_list()
-      y = array_ops.reshape(y, (-1, input_length) + tuple(output_shape[2:]))
+      output_shape = self._get_shape_tuple(
+          (-1, input_length), y, 1, output_shape[2:])
+      y = array_ops.reshape(y, output_shape)
 
     # Apply activity regularizer if any:
     if (hasattr(self.layer, 'activity_regularizer') and
@@ -247,6 +285,80 @@ class TimeDistributed(Wrapper):
       y._uses_learning_phase = True
     return y
 
+  def compute_mask(self, inputs, mask=None):
+    """Computes an output mask tensor for Embedding layer.
+
+    This is based on the inputs, mask, and the inner layer.
+    If batch size is specified:
+    Simply return the input `mask`. (An rnn-based implementation with
+    more than one rnn inputs is required but not supported in tf.keras yet.)
+    Otherwise we call `compute_mask` of the inner layer at each time step.
+    If the output mask at each time step is not `None`:
+    (E.g., inner layer is Masking or RNN)
+    Concatenate all of them and return the concatenation.
+    If the output mask at each time step is `None` and the input mask is not
+    `None`:(E.g., inner layer is Dense)
+    Reduce the input_mask to 2 dimensions and return it.
+    Otherwise (both the output mask and the input mask are `None`):
+    (E.g., `mask` is not used at all)
+    Return `None`.
+
+    Arguments:
+      inputs: Tensor with shape [batch size, timesteps, ...] indicating the
+          input to TimeDistributed. If static shape information is available for
+          "batch size", `mask` is returned unmodified.
+      mask: Either None (indicating no masking) or a Tensor indicating the
+          input mask for TimeDistributed. The shape can be static or dynamic.
+
+    Returns:
+      Either None (no masking), or a [batch size, timesteps, ...] Tensor with
+      an output mask for the TimeDistributed layer with the shape beyond the
+      second dimension being the value of the input mask shape(if the computed
+      output mask is none), an output mask with the shape beyond the first
+      dimension being the value of the mask shape(if mask is not None) or
+      output mask with the shape beyond the first dimension being the
+      value of the computed output shape.
+
+    """
+    # cases need to call the layer.compute_mask when input_mask is None:
+    # Masking layer and Embedding layer with mask_zero
+    input_shape = K.int_shape(inputs)
+    if input_shape[0]:
+      # batch size matters, we currently do not handle mask explicitly
+      return mask
+    inner_mask = mask
+    if inner_mask is not None:
+      inner_mask_shape = self._get_shape_tuple((-1,), mask, 2)
+      inner_mask = K.reshape(inner_mask, inner_mask_shape)
+    input_uid = generic_utils.object_list_uid(inputs)
+    inner_inputs = self._input_map[input_uid]
+    output_mask = self.layer.compute_mask(inner_inputs, inner_mask)
+    if output_mask is None:
+      if mask is None:
+        return None
+      # input_mask is not None, and output_mask is None:
+      # we should return a not-None mask
+      output_mask = mask
+      for _ in range(2, len(K.int_shape(mask))):
+        output_mask = K.any(output_mask, axis=-1)
+    else:
+      # output_mask is not None. We need to reshape it
+      input_length = input_shape[1]
+      if not input_length:
+        input_length = K.shape(inputs)[1]
+      output_mask_int_shape = K.int_shape(output_mask)
+      if output_mask_int_shape is None:
+        # if the output_mask does not have a static shape,
+        # its shape must be the same as mask's
+        if mask is not None:
+          output_mask_int_shape = K.int_shape(mask)
+        else:
+          output_mask_int_shape = K.compute_output_shape(input_shape)[:-1]
+      output_mask_shape = self._get_shape_tuple(
+          (-1, input_length), output_mask, 1, output_mask_int_shape[1:])
+      output_mask = K.reshape(output_mask, output_mask_shape)
+    return output_mask
+
 
 @tf_export('keras.layers.Bidirectional')
 class Bidirectional(Wrapper):
@@ -305,6 +417,8 @@ class Bidirectional(Wrapper):
     self._num_constants = None
     super(Bidirectional, self).__init__(layer, **kwargs)
     self.input_spec = layer.input_spec
+    self._track_checkpointable(self.forward_layer, name='forward_layer')
+    self._track_checkpointable(self.backward_layer, name='backward_layer')
 
   @property
   def trainable(self):
@@ -414,7 +528,8 @@ class Bidirectional(Wrapper):
     else:
       return super(Bidirectional, self).__call__(inputs, **kwargs)
 
-  def call(self, inputs,
+  def call(self,
+           inputs,
            training=None,
            mask=None,
            initial_state=None,
diff --git a/tensorflow/python/keras/layers/wrappers_test.py b/tensorflow/python/keras/layers/wrappers_test.py
index c8f0d216e6..0cd774ef0f 100644
--- a/tensorflow/python/keras/layers/wrappers_test.py
+++ b/tensorflow/python/keras/layers/wrappers_test.py
@@ -87,6 +87,8 @@ class TimeDistributedTest(test.TestCase):
     # test config
     model.get_config()
 
+    # check whether the model variables are present in the
+    # checkpointable list of objects
     checkpointed_objects = set(checkpointable_util.list_objects(model))
     for v in model.variables:
       self.assertIn(v, checkpointed_objects)
@@ -190,8 +192,8 @@ class TimeDistributedTest(test.TestCase):
     x = keras.layers.Input(shape=(3, 2))
     layer = keras.layers.TimeDistributed(keras.layers.BatchNormalization())
     _ = layer(x)
-    assert len(layer.updates) == 2
-    assert len(layer.trainable_weights) == 2
+    self.assertEquals(len(layer.updates), 2)
+    self.assertEquals(len(layer.trainable_weights), 2)
     layer.trainable = False
     assert not layer.updates
     assert not layer.trainable_weights
@@ -199,6 +201,62 @@ class TimeDistributedTest(test.TestCase):
     assert len(layer.updates) == 2
     assert len(layer.trainable_weights) == 2
 
+  def test_TimeDistributed_with_masked_embedding_and_unspecified_shape(self):
+    with self.test_session():
+      # test with unspecified shape and Embeddings with mask_zero
+      model = keras.models.Sequential()
+      model.add(keras.layers.TimeDistributed(
+          keras.layers.Embedding(5, 6, mask_zero=True),
+          input_shape=(None, None)))  # N by t_1 by t_2 by 6
+      model.add(keras.layers.TimeDistributed(
+          keras.layers.SimpleRNN(7, return_sequences=True)))
+      model.add(keras.layers.TimeDistributed(
+          keras.layers.SimpleRNN(8, return_sequences=False)))
+      model.add(keras.layers.SimpleRNN(1, return_sequences=False))
+      model.compile(optimizer='rmsprop', loss='mse')
+      model_input = np.random.randint(low=1, high=5, size=(10, 3, 4),
+                                      dtype='int32')
+      for i in range(4):
+        model_input[i, i:, i:] = 0
+      model.fit(model_input,
+                np.random.random((10, 1)), epochs=1, batch_size=10)
+      mask_outputs = [model.layers[0].compute_mask(model.input)]
+      for layer in model.layers[1:]:
+        mask_outputs.append(layer.compute_mask(layer.input, mask_outputs[-1]))
+      func = keras.backend.function([model.input], mask_outputs[:-1])
+      mask_outputs_val = func([model_input])
+      ref_mask_val_0 = model_input > 0         # embedding layer
+      ref_mask_val_1 = ref_mask_val_0          # first RNN layer
+      ref_mask_val_2 = np.any(ref_mask_val_1, axis=-1)     # second RNN layer
+      ref_mask_val = [ref_mask_val_0, ref_mask_val_1, ref_mask_val_2]
+      for i in range(3):
+        self.assertAllEqual(mask_outputs_val[i], ref_mask_val[i])
+      self.assertIs(mask_outputs[-1], None)  # final layer
+
+  def test_TimeDistributed_with_masking_layer(self):
+    with self.test_session():
+      # test with Masking layer
+      model = keras.models.Sequential()
+      model.add(keras.layers.TimeDistributed(keras.layers.Masking(
+          mask_value=0.,), input_shape=(None, 4)))
+      model.add(keras.layers.TimeDistributed(keras.layers.Dense(5)))
+      model.compile(optimizer='rmsprop', loss='mse')
+      model_input = np.random.randint(low=1, high=5, size=(10, 3, 4))
+      for i in range(4):
+        model_input[i, i:, :] = 0.
+      model.compile(optimizer='rmsprop', loss='mse')
+      model.fit(model_input,
+                np.random.random((10, 3, 5)), epochs=1, batch_size=6)
+      mask_outputs = [model.layers[0].compute_mask(model.input)]
+      mask_outputs += [model.layers[1].compute_mask(model.layers[1].input,
+                                                    mask_outputs[-1])]
+      func = keras.backend.function([model.input], mask_outputs)
+      mask_outputs_val = func([model_input])
+      self.assertEqual((mask_outputs_val[0]).all(),
+                       model_input.all())
+      self.assertEqual((mask_outputs_val[1]).all(),
+                       model_input.all())
+
 
 class BidirectionalTest(test.TestCase):
 
@@ -222,6 +280,12 @@ class BidirectionalTest(test.TestCase):
         model.compile(optimizer=RMSPropOptimizer(0.01), loss='mse')
         model.fit(x, y, epochs=1, batch_size=1)
 
+        # check whether the model variables are present in the
+        # checkpointable list of objects
+        checkpointed_objects = set(checkpointable_util.list_objects(model))
+        for v in model.variables:
+          self.assertIn(v, checkpointed_objects)
+
         # test compute output shape
         ref_shape = model.layers[-1].output.get_shape()
         shape = model.layers[-1].compute_output_shape(
diff --git a/tensorflow/python/keras/metrics.py b/tensorflow/python/keras/metrics.py
index e03d7dfe93..7d8b1fec45 100644
--- a/tensorflow/python/keras/metrics.py
+++ b/tensorflow/python/keras/metrics.py
@@ -19,9 +19,18 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from abc import ABCMeta
+from abc import abstractmethod
+
+import types
 import six
 
+from tensorflow.python.eager import context
+from tensorflow.python.eager import function
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.keras import backend as K
+from tensorflow.python.keras.engine.base_layer import Layer
 from tensorflow.python.keras.losses import binary_crossentropy
 from tensorflow.python.keras.losses import categorical_crossentropy
 from tensorflow.python.keras.losses import cosine_proximity
@@ -37,14 +46,471 @@ from tensorflow.python.keras.losses import sparse_categorical_crossentropy
 from tensorflow.python.keras.losses import squared_hinge
 from tensorflow.python.keras.utils.generic_utils import deserialize_keras_object
 from tensorflow.python.keras.utils.generic_utils import serialize_keras_object
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import confusion_matrix
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import variable_scope as vs
+from tensorflow.python.ops import weights_broadcast_ops
+from tensorflow.python.training import distribute as distribute_lib
+from tensorflow.python.util import tf_decorator
 from tensorflow.python.util.tf_export import tf_export
 
 
+def check_is_tensor_or_operation(x, name):
+  """Raises type error if the given input is not a tensor or operation."""
+  if not (isinstance(x, ops.Tensor) or isinstance(x, ops.Operation)):
+    raise TypeError('{0} must be a Tensor or Operation, given: {1}'.format(
+        name, x))
+
+
+def update_state_wrapper(update_state_fn):
+  """Decorator to wrap metric `update_state()` with `defun()`, `add_update()`.
+
+  Args:
+    update_state_fn: function that accumulates metric statistics.
+
+  Returns:
+    If eager execution is enabled, returns None.
+    If graph execution is enabled, returns an update op. This op should be
+      executed to update the metric state with the given inputs.
+  """
+
+  def decorated(metric_obj, *args, **kwargs):
+    """Decorated function with `defun()` and `add_update()`."""
+
+    # Converting update_state_fn() into a graph function, so that
+    # we can return a single op that performs all of the variable updates.
+    # Assigning to a different method name to avoid reference cycle.
+    defuned_update_state_fn = function.defun(update_state_fn)
+    update_op = defuned_update_state_fn(*args, **kwargs)
+    if update_op is not None:  # update_op will be None in eager execution.
+      metric_obj.add_update(update_op, inputs=True)
+      check_is_tensor_or_operation(
+          update_op, 'Metric {0}\'s update'.format(metric_obj.name))
+    return update_op
+
+  return tf_decorator.make_decorator(update_state_fn, decorated)
+
+
+def result_wrapper(result_fn):
+  """Decorator to wrap metric `result()` function in `merge_call()`.
+
+  Result computation is an idempotent operation that simply calculates the
+  metric value using the state variables.
+
+  If metric state variables are distributed across towers/devices and
+  `result()` is requested from the context of one device - This function wraps
+  `result()` in a distribution strategy `merge_call()`. With this,
+  the metric state variables will be aggregated across devices.
+
+  Args:
+    result_fn: function that computes the metric result.
+
+  Returns:
+    The metric result tensor.
+  """
+
+  def decorated(metric_obj, *args):
+    """Decorated function with merge_call."""
+    tower_context = distribute_lib.get_tower_context()
+    if tower_context is None:  # if in cross tower context already
+      result_t = result_fn(*args)
+    else:
+      # TODO(psv): Test distribution of metrics using different distribution
+      # strategies.
+
+      # Creating a wrapper for merge_fn. merge_call invokes the given merge_fn
+      # with distribution object as the first parameter. We create a wrapper
+      # here so that the result function need not have that parameter.
+      def merge_fn_wrapper(distribution, merge_fn, *args):
+        # We will get `PerDevice` merge function. Taking the first one as all
+        # are identical copies of the function that we had passed below.
+        return distribution.unwrap(merge_fn)[0](*args)
+
+      # Wrapping result in merge_call. merge_call is used when we want to leave
+      # tower mode and compute a value in cross tower mode.
+      result_t = tower_context.merge_call(merge_fn_wrapper, result_fn, *args)
+    check_is_tensor_or_operation(result_t,
+                                 'Metric {0}\'s result'.format(metric_obj.name))
+    return result_t
+
+  return tf_decorator.make_decorator(result_fn, decorated)
+
+
+def _safe_div(numerator, denominator):
+  """Divides two tensors element-wise, returning 0 if the denominator is <= 0.
+
+  Args:
+    numerator: A `Tensor`.
+    denominator: A `Tensor`, with dtype matching `numerator`.
+
+  Returns:
+    0 if `denominator` <= 0, else `numerator` / `denominator`
+  """
+  t = math_ops.truediv(numerator, denominator)
+  zero = array_ops.zeros_like(t, dtype=denominator.dtype)
+  condition = math_ops.greater(denominator, zero)
+  zero = math_ops.cast(zero, t.dtype)
+  return array_ops.where(condition, t, zero)
+
+
+def _squeeze_or_expand_dimensions(y_pred, y_true, sample_weight):
+  """Squeeze or expand last dimension if needed.
+
+  1. Squeezes last dim of `y_pred` or `y_true` if their rank differs by 1
+  (using `confusion_matrix.remove_squeezable_dimensions`).
+  2. Squeezes or expands last dim of `sample_weight` if its rank differs by 1
+  from the new rank of `y_pred`.
+  If `sample_weight` is scalar, it is kept scalar.
+
+  This will use static shape if available. Otherwise, it will add graph
+  operations, which could result in a performance hit.
+
+  Args:
+    y_pred: Predicted values, a `Tensor` of arbitrary dimensions.
+    y_true: Optional label `Tensor` whose dimensions match `y_pred`.
+    sample_weight: Optional weight scalar or `Tensor` whose dimensions match
+      `y_pred`.
+
+  Returns:
+    Tuple of `y_pred`, `y_true` and `sample_weight`. Each of them possibly has
+    the last dimension squeezed,
+    `sample_weight` could be extended by one dimension.
+  """
+  if y_true is not None:
+    # squeeze last dim of `y_pred` or `y_true` if their rank differs by 1
+    y_true, y_pred = confusion_matrix.remove_squeezable_dimensions(
+        y_true, y_pred)
+    y_pred.get_shape().assert_is_compatible_with(y_true.get_shape())
+
+  if sample_weight is None:
+    return y_pred, y_true, None
+
+  sample_weight = ops.convert_to_tensor(sample_weight)
+  weights_shape = sample_weight.get_shape()
+  weights_rank = weights_shape.ndims
+  if weights_rank == 0:  # If weights is scalar, do nothing.
+    return y_pred, y_true, sample_weight
+
+  y_pred_shape = y_pred.get_shape()
+  y_pred_rank = y_pred_shape.ndims
+  if (y_pred_rank is not None) and (weights_rank is not None):
+    # Use static rank.
+    if weights_rank - y_pred_rank == 1:
+      sample_weight = array_ops.squeeze(sample_weight, [-1])
+    elif y_pred_rank - weights_rank == 1:
+      sample_weight = array_ops.expand_dims(sample_weight, [-1])
+    return y_pred, y_true, sample_weight
+
+  # Use dynamic rank.
+  weights_rank_tensor = array_ops.rank(sample_weight)
+  rank_diff = weights_rank_tensor - array_ops.rank(y_pred)
+  maybe_squeeze_weights = lambda: array_ops.squeeze(sample_weight, [-1])
+
+  def _maybe_expand_weights():
+    return control_flow_ops.cond(
+        math_ops.equal(rank_diff,
+                       -1), lambda: array_ops.expand_dims(sample_weight, [-1]),
+        lambda: sample_weight)
+
+  def _maybe_adjust_weights():
+    return control_flow_ops.cond(
+        math_ops.equal(rank_diff, 1), maybe_squeeze_weights,
+        _maybe_expand_weights)
+
+  # squeeze or expand last dim of `sample_weight` if its rank differs by 1
+  # from the new rank of `y_pred`.
+  sample_weight = control_flow_ops.cond(
+      math_ops.equal(weights_rank_tensor, 0), lambda: sample_weight,
+      _maybe_adjust_weights)
+  return y_pred, y_true, sample_weight
+
+
+class Metric(Layer):
+  """Encapsulates metric logic and state.
+
+  Usage with eager execution:
+
+  ```python
+  m = SomeMetric(...)
+  for input in ...:
+    m.update_state(input)
+  print('Final result: ', m.result().numpy())
+  ```
+
+  Usage with graph execution:
+
+  ```python
+  m = SomeMetric(...)
+  init_op = tf.global_variables_initializer()  # Initialize variables
+  with tf.Session() as sess:
+    sess.run(init_op)
+    for input in ...:
+      update_op = m.update_state(input)
+      sess.run(update_op)
+    print('Final result: ', sess.run(m.result()))
+  ```
+
+  To be implemented by subclasses:
+  * `__init__()`: All state variables should be created in this method by
+    calling `self.add_weight()` like: `self.var = self.add_weight(...)`
+  * `update_state()`: Has all updates to the state variables like:
+    self.var.assign_add(...).
+  * `result()`: Computes and returns a value for the metric
+    from the state variables.
+
+  Example subclass implementation:
+
+  ```
+  class BinaryTruePositives(Metric):
+    def __init__(self, name='binary-true-positives', dtype=None):
+      super(BinaryTruePositives, self).__init__(name=name, dtype=dtype)
+      self.true_positives = self.add_weight(
+          'true_positives', initializer=init_ops.zeros_initializer)
+
+    def update_state(self, y_true, y_pred, sample_weight=None):
+      y_true = math_ops.cast(y_true, dtypes.bool)
+      y_pred = math_ops.cast(y_pred, dtypes.bool)
+      y_pred, y_true, sample_weight = _squeeze_or_expand_dimensions(
+          y_pred, y_true, sample_weight)
+
+      values = math_ops.logical_and(
+          math_ops.equal(y_true, True), math_ops.equal(y_pred, True))
+      values = math_ops.cast(values, self._dtype)
+      if sample_weight is not None:
+        sample_weight = math_ops.cast(sample_weight, self._dtype)
+        values = math_ops.multiply(values, sample_weight)
+      state_ops.assign_add(self.true_positives, math_ops.reduce_sum(values))
+
+    def result(self):
+      return array_ops.identity(self.true_positives)
+  ```
+  """
+  __metaclass__ = ABCMeta
+
+  def __init__(self, name=None, dtype=None):
+    super(Metric, self).__init__(name=name, dtype=dtype)
+    self.stateful = True  # All metric layers are stateful.
+    self.built = True
+    self._dtype = K.floatx() if dtype is None else dtypes.as_dtype(dtype).name
+
+  def __new__(cls, *args, **kwargs):
+    obj = super(Metric, cls).__new__(cls, *args, **kwargs)
+    obj.update_state = types.MethodType(
+        update_state_wrapper(obj.update_state), obj)
+    obj.result = types.MethodType(result_wrapper(obj.result), obj)
+    return obj
+
+  def __call__(self, *args, **kwargs):
+    """Accumulates statistics and then computes metric result value.
+
+    Args:
+      *args:
+      **kwargs: A mini-batch of inputs to the Metric,
+        passed on to `update_state()`.
+
+    Returns:
+      The metric value tensor.
+    """
+    update_op = self.update_state(*args, **kwargs)  # pylint: disable=not-callable
+    with ops.control_dependencies([update_op]):
+      return self.result()  # pylint: disable=not-callable
+
+  def reset_states(self):
+    """Resets all of the metric state variables.
+
+    This function is called between epochs/steps,
+    when a metric is evaluated during training.
+    """
+    for v in self.variables:
+      K.set_value(v, 0)
+
+  @abstractmethod
+  def update_state(self, *args, **kwargs):
+    """Accumulates statistics for the metric.
+
+    Note: This function is executed as a graph function in graph mode.
+    This means:
+      a) Operations on the same resource are executed in textual order.
+         This should make it easier to do things like add the updated
+         value of a variable to another, for example.
+      b) You don't need to worry about collecting the update ops to execute.
+         All update ops added to the graph by this function will be executed.
+      As a result, code should generally work the same way with graph or
+      eager execution.
+    and adds the update op to the metric layer.
+
+    Args:
+      *args:
+      **kwargs: A mini-batch of inputs to the Metric.
+    """
+    NotImplementedError('Must be implemented in subclasses.')
+
+  @abstractmethod
+  def result(self):
+    """Computes and returns the metric value tensor.
+
+    Result computation is an idempotent operation that simply calculates the
+    metric value using the state variables.
+    """
+    NotImplementedError('Must be implemented in subclasses.')
+
+  ### For use by subclasses ###
+  def add_weight(self,
+                 name,
+                 shape=(),
+                 aggregation=vs.VariableAggregation.SUM,
+                 synchronization=vs.VariableSynchronization.ON_READ,
+                 initializer=None):
+    """Adds state variable. Only for use by subclasses."""
+    return super(Metric, self).add_weight(
+        name=name,
+        shape=shape,
+        dtype=self._dtype,
+        trainable=False,
+        initializer=initializer,
+        synchronization=synchronization,
+        aggregation=aggregation)
+
+  ### End: For use by subclasses ###
+
+
+class Mean(Metric):
+  """Computes the (weighted) mean of the given values.
+
+  This metric creates two variables, `total` and `count` that are used to
+  compute the average of `values`. This average is ultimately returned as `mean`
+  which is an idempotent operation that simply divides `total` by `count`.
+
+  If `sample_weight` is `None`, weights default to 1.
+  Use `sample_weight` of 0 to mask values.
+  """
+
+  def __init__(self, name='mean', dtype=None):
+    """Creates a `Mean` instance.
+
+    Args:
+      name: (Optional) string name of the metric instance.
+      dtype: (Optional) data type of the metric result.
+    """
+    super(Mean, self).__init__(name=name, dtype=dtype)
+    # Create new state variables
+    self.total = self.add_weight(
+        'total', initializer=init_ops.zeros_initializer)
+    self.count = self.add_weight(
+        'count', initializer=init_ops.zeros_initializer)
+
+  def update_state(self, values, sample_weight=None):
+    """Accumulates statistics for computing the mean.
+
+    For example, if `values` is [1, 3, 5, 7] then the mean is 4. If
+    the `sample_weight` is specified as [1, 1, 0, 0] then the mean would be 2.
+
+    Args:
+      values: Per-example value.
+      sample_weight: Optional weighting of each example. Defaults to 1.
+    """
+    values = math_ops.cast(values, self._dtype)
+    if sample_weight is None:
+      num_values = math_ops.cast(array_ops.size(values), self._dtype)
+    else:
+      sample_weight = math_ops.cast(sample_weight, self._dtype)
+
+      # Update dimensions of weights to match with values.
+      values, _, sample_weight = _squeeze_or_expand_dimensions(
+          values, None, sample_weight)
+      sample_weight = weights_broadcast_ops.broadcast_weights(
+          sample_weight, values)
+      num_values = math_ops.reduce_sum(sample_weight)
+      values = math_ops.multiply(values, sample_weight)
+    values = math_ops.reduce_sum(values)
+
+    # Update state variables
+    state_ops.assign_add(self.total, values)
+    state_ops.assign_add(self.count, num_values)
+
+  def result(self):
+    return _safe_div(self.total, self.count)
+
+
+class MeanMetricWrapper(Mean):
+  """Wraps a stateless metric function with the Mean metric."""
+
+  def __init__(self, fn, name=None, dtype=None, **kwargs):
+    """Creates a `MeanMetricWrapper` instance.
+
+    Args:
+      fn: The metric function to wrap, with signature
+        `fn(y_true, y_pred, **kwargs)`.
+      name: (Optional) string name of the metric instance.
+      dtype: (Optional) data type of the metric result.
+      **kwargs: The keyword arguments that are passed on to `fn`.
+    """
+    super(MeanMetricWrapper, self).__init__(name=name, dtype=dtype)
+    self._fn = fn
+    self._fn_kwargs = kwargs
+
+  def update_state(self, y_true, y_pred, sample_weight=None):
+    """Accumulates metric statistics.
+
+    `y_true` and `y_pred` should have the same shape.
+
+    Args:
+      y_true: The ground truth values.
+      y_pred: The predicted values.
+      sample_weight: Optional weighting of each example. Defaults to 1. Can be
+        a `Tensor` whose rank is either 0, or the same rank as `y_true`,
+        and must be broadcastable to `y_true`.
+    """
+    y_true = math_ops.cast(y_true, self._dtype)
+    y_pred = math_ops.cast(y_pred, self._dtype)
+    y_pred, y_true, sample_weight = _squeeze_or_expand_dimensions(
+        y_pred, y_true, sample_weight)
+
+    matches = self._fn(y_true, y_pred, **self._fn_kwargs)
+    super(MeanMetricWrapper, self).update_state(
+        matches, sample_weight=sample_weight)
+
+  def get_config(self):
+    config = self._fn_kwargs
+    base_config = super(MeanMetricWrapper, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
+
+
+class BinaryAccuracy(MeanMetricWrapper):
+  """Calculates how often predictions matches labels.
+
+  This metric creates two local variables, `total` and `count` that are used to
+  compute the frequency with which `y_pred` matches `y_true`. This frequency is
+  ultimately returned as `binary accuracy`: an idempotent operation that simply
+  divides `total` by `count`.
+
+  If `sample_weight` is `None`, weights default to 1.
+  Use `sample_weight` of 0 to mask values.
+  """
+
+  def __init__(self, name='binary-accuracy', dtype=None, threshold=0.5):
+    """Creates a `BinaryAccuracy` instance.
+
+    Args:
+      name: (Optional) string name of the metric instance.
+      dtype: (Optional) data type of the metric result.
+      threshold: (Optional) Float representing the threshold for deciding
+      whether prediction values are 1 or 0.
+    """
+    super(BinaryAccuracy, self).__init__(
+        binary_accuracy, name, dtype=dtype, threshold=threshold)
+
+
 @tf_export('keras.metrics.binary_accuracy')
-def binary_accuracy(y_true, y_pred):
-  return K.mean(math_ops.equal(y_true, math_ops.round(y_pred)), axis=-1)
+def binary_accuracy(y_true, y_pred, threshold=0.5):
+  threshold = math_ops.cast(threshold, y_pred.dtype)
+  y_pred = math_ops.cast(y_pred > threshold, y_pred.dtype)
+  return K.mean(math_ops.equal(y_true, y_pred), axis=-1)
 
 
 @tf_export('keras.metrics.categorical_accuracy')
diff --git a/tensorflow/python/keras/metrics_test.py b/tensorflow/python/keras/metrics_test.py
index 15e793f5fc..d583379708 100644
--- a/tensorflow/python/keras/metrics_test.py
+++ b/tensorflow/python/keras/metrics_test.py
@@ -18,67 +18,72 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
 import numpy as np
 
-from tensorflow.python import keras
+from tensorflow.python.eager import context
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.keras import backend as K
+from tensorflow.python.keras import layers
+from tensorflow.python.keras import metrics
+from tensorflow.python.keras.engine.training import Model
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
+from tensorflow.python.training.checkpointable import util as checkpointable_utils
 
 
 class KerasMetricsTest(test.TestCase):
 
   def test_metrics(self):
     with self.test_session():
-      y_a = keras.backend.variable(np.random.random((6, 7)))
-      y_b = keras.backend.variable(np.random.random((6, 7)))
-      for metric in [keras.metrics.binary_accuracy,
-                     keras.metrics.categorical_accuracy]:
+      y_a = K.variable(np.random.random((6, 7)))
+      y_b = K.variable(np.random.random((6, 7)))
+      for metric in [metrics.binary_accuracy, metrics.categorical_accuracy]:
         output = metric(y_a, y_b)
-        self.assertEqual(keras.backend.eval(output).shape, (6,))
+        self.assertEqual(K.eval(output).shape, (6,))
 
   def test_sparse_categorical_accuracy(self):
     with self.test_session():
-      metric = keras.metrics.sparse_categorical_accuracy
-      y_a = keras.backend.variable(np.random.randint(0, 7, (6,)))
-      y_b = keras.backend.variable(np.random.random((6, 7)))
-      self.assertEqual(keras.backend.eval(metric(y_a, y_b)).shape, (6,))
+      metric = metrics.sparse_categorical_accuracy
+      y_a = K.variable(np.random.randint(0, 7, (6,)))
+      y_b = K.variable(np.random.random((6, 7)))
+      self.assertEqual(K.eval(metric(y_a, y_b)).shape, (6,))
 
   def test_sparse_top_k_categorical_accuracy(self):
     with self.test_session():
-      y_pred = keras.backend.variable(np.array([[0.3, 0.2, 0.1],
-                                                [0.1, 0.2, 0.7]]))
-      y_true = keras.backend.variable(np.array([[1], [0]]))
-      result = keras.backend.eval(
-          keras.metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=3))
+      y_pred = K.variable(np.array([[0.3, 0.2, 0.1], [0.1, 0.2, 0.7]]))
+      y_true = K.variable(np.array([[1], [0]]))
+      result = K.eval(
+          metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=3))
       self.assertEqual(result, 1)
-      result = keras.backend.eval(
-          keras.metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=2))
+      result = K.eval(
+          metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=2))
       self.assertEqual(result, 0.5)
-      result = keras.backend.eval(
-          keras.metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=1))
+      result = K.eval(
+          metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=1))
       self.assertEqual(result, 0.)
 
   def test_top_k_categorical_accuracy(self):
     with self.test_session():
-      y_pred = keras.backend.variable(np.array([[0.3, 0.2, 0.1],
-                                                [0.1, 0.2, 0.7]]))
-      y_true = keras.backend.variable(np.array([[0, 1, 0], [1, 0, 0]]))
-      result = keras.backend.eval(
-          keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=3))
+      y_pred = K.variable(np.array([[0.3, 0.2, 0.1], [0.1, 0.2, 0.7]]))
+      y_true = K.variable(np.array([[0, 1, 0], [1, 0, 0]]))
+      result = K.eval(metrics.top_k_categorical_accuracy(y_true, y_pred, k=3))
       self.assertEqual(result, 1)
-      result = keras.backend.eval(
-          keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=2))
+      result = K.eval(metrics.top_k_categorical_accuracy(y_true, y_pred, k=2))
       self.assertEqual(result, 0.5)
-      result = keras.backend.eval(
-          keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=1))
+      result = K.eval(metrics.top_k_categorical_accuracy(y_true, y_pred, k=1))
       self.assertEqual(result, 0.)
 
   def test_stateful_metrics(self):
     with self.test_session():
       np.random.seed(1334)
 
-      class BinaryTruePositives(keras.layers.Layer):
+      class BinaryTruePositives(layers.Layer):
         """Stateful Metric to count the total true positives over all batches.
 
         Assumes predictions and targets of shape `(samples, 1)`.
@@ -91,11 +96,11 @@ class KerasMetricsTest(test.TestCase):
 
         def __init__(self, name='true_positives', **kwargs):
           super(BinaryTruePositives, self).__init__(name=name, **kwargs)
-          self.true_positives = keras.backend.variable(value=0, dtype='int32')
+          self.true_positives = K.variable(value=0, dtype='int32')
           self.stateful = True
 
         def reset_states(self):
-          keras.backend.set_value(self.true_positives, 0)
+          K.set_value(self.true_positives, 0)
 
         def __call__(self, y_true, y_pred):
           """Computes the number of true positives in a batch.
@@ -120,14 +125,14 @@ class KerasMetricsTest(test.TestCase):
           return current_true_pos + true_pos
 
       metric_fn = BinaryTruePositives()
-      config = keras.metrics.serialize(metric_fn)
-      metric_fn = keras.metrics.deserialize(
+      config = metrics.serialize(metric_fn)
+      metric_fn = metrics.deserialize(
           config, custom_objects={'BinaryTruePositives': BinaryTruePositives})
 
       # Test on simple model
-      inputs = keras.Input(shape=(2,))
-      outputs = keras.layers.Dense(1, activation='sigmoid')(inputs)
-      model = keras.Model(inputs, outputs)
+      inputs = layers.Input(shape=(2,))
+      outputs = layers.Dense(1, activation='sigmoid')(inputs)
+      model = Model(inputs, outputs)
       model.compile(optimizer='sgd',
                     loss='binary_crossentropy',
                     metrics=['acc', metric_fn])
@@ -184,6 +189,214 @@ class KerasMetricsTest(test.TestCase):
       self.assertAllClose(
           val_outs[2], history.history['val_true_positives'][-1], atol=1e-5)
 
+  @test_util.run_in_graph_and_eager_modes
+  def test_mean(self):
+    m = metrics.Mean(name='my_mean')
+
+    # check config
+    self.assertEqual(m.name, 'my_mean')
+    self.assertTrue(m.stateful)
+    self.assertEqual(m.dtype, dtypes.float32)
+    self.assertEqual(len(m.variables), 2)
+    self.evaluate(variables.global_variables_initializer())
+
+    # check initial state
+    self.assertEqual(self.evaluate(m.total), 0)
+    self.assertEqual(self.evaluate(m.count), 0)
+
+    # check __call__()
+    self.assertEqual(self.evaluate(m(100)), 100)
+    self.assertEqual(self.evaluate(m.total), 100)
+    self.assertEqual(self.evaluate(m.count), 1)
+
+    # check update_state() and result() + state accumulation + tensor input
+    update_op = m.update_state(ops.convert_n_to_tensor([1, 5]))
+    self.evaluate(update_op)
+    self.assertAlmostEqual(self.evaluate(m.result()), 106 / 3, 2)
+    self.assertEqual(self.evaluate(m.total), 106)  # 100 + 1 + 5
+    self.assertEqual(self.evaluate(m.count), 3)
+
+    # check reset_states()
+    m.reset_states()
+    self.assertEqual(self.evaluate(m.total), 0)
+    self.assertEqual(self.evaluate(m.count), 0)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_mean_with_sample_weight(self):
+    m = metrics.Mean(dtype=dtypes.float64)
+    self.assertEqual(m.dtype, dtypes.float64)
+    self.evaluate(variables.global_variables_initializer())
+
+    # check scalar weight
+    result_t = m(100, sample_weight=0.5)
+    self.assertEqual(self.evaluate(result_t), 50 / 0.5)
+    self.assertEqual(self.evaluate(m.total), 50)
+    self.assertEqual(self.evaluate(m.count), 0.5)
+
+    # check weights not scalar and weights rank matches values rank
+    result_t = m([1, 5], sample_weight=[1, 0.2])
+    result = self.evaluate(result_t)
+    self.assertAlmostEqual(result, 52 / 1.7, 2)
+    self.assertAlmostEqual(self.evaluate(m.total), 52, 2)  # 50 + 1 + 5 * 0.2
+    self.assertAlmostEqual(self.evaluate(m.count), 1.7, 2)  # 0.5 + 1.2
+
+    # check weights broadcast
+    result_t = m([1, 2], sample_weight=0.5)
+    self.assertAlmostEqual(self.evaluate(result_t), 53.5 / 2.7, 2)
+    self.assertAlmostEqual(self.evaluate(m.total), 53.5, 2)  # 52 + 0.5 + 1
+    self.assertAlmostEqual(self.evaluate(m.count), 2.7, 2)  # 1.7 + 0.5 + 0.5
+
+    # check weights squeeze
+    result_t = m([1, 5], sample_weight=[[1], [0.2]])
+    self.assertAlmostEqual(self.evaluate(result_t), 55.5 / 3.9, 2)
+    self.assertAlmostEqual(self.evaluate(m.total), 55.5, 2)  # 53.5 + 1 + 1
+    self.assertAlmostEqual(self.evaluate(m.count), 3.9, 2)  # 2.7 + 1.2
+
+    # check weights expand
+    result_t = m([[1], [5]], sample_weight=[1, 0.2])
+    self.assertAlmostEqual(self.evaluate(result_t), 57.5 / 5.1, 2)
+    self.assertAlmostEqual(self.evaluate(m.total), 57.5, 2)  # 55.5 + 1 + 1
+    self.assertAlmostEqual(self.evaluate(m.count), 5.1, 2)  # 3.9 + 1.2
+
+  def test_mean_graph_with_placeholder(self):
+    with context.graph_mode(), self.test_session() as sess:
+      m = metrics.Mean()
+      v = array_ops.placeholder(dtypes.float32)
+      w = array_ops.placeholder(dtypes.float32)
+      sess.run(variables.global_variables_initializer())
+
+      # check __call__()
+      result_t = m(v, sample_weight=w)
+      result = sess.run(result_t, feed_dict=({v: 100, w: 0.5}))
+      self.assertEqual(sess.run(m.total), 50)
+      self.assertEqual(sess.run(m.count), 0.5)
+      self.assertEqual(result, 50 / 0.5)
+
+      # check update_state() and result()
+      result = sess.run(result_t, feed_dict=({v: [1, 5], w: [1, 0.2]}))
+      self.assertAlmostEqual(sess.run(m.total), 52, 2)  # 50 + 1 + 5 * 0.2
+      self.assertAlmostEqual(sess.run(m.count), 1.7, 2)  # 0.5 + 1.2
+      self.assertAlmostEqual(result, 52 / 1.7, 2)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_save_restore(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, 'ckpt')
+    m = metrics.Mean()
+    checkpoint = checkpointable_utils.Checkpoint(mean=m)
+    self.evaluate(variables.global_variables_initializer())
+
+    # update state
+    self.evaluate(m(100.))
+    self.evaluate(m(200.))
+
+    # save checkpoint and then add an update
+    save_path = checkpoint.save(checkpoint_prefix)
+    self.evaluate(m(1000.))
+
+    # restore to the same checkpoint mean object
+    checkpoint.restore(save_path).assert_consumed().run_restore_ops()
+    self.evaluate(m(300.))
+    self.assertEqual(200., self.evaluate(m.result()))
+
+    # restore to a different checkpoint mean object
+    restore_mean = metrics.Mean()
+    restore_checkpoint = checkpointable_utils.Checkpoint(mean=restore_mean)
+    status = restore_checkpoint.restore(save_path)
+    restore_update = restore_mean(300.)
+    status.assert_consumed().run_restore_ops()
+    self.evaluate(restore_update)
+    self.assertEqual(200., self.evaluate(restore_mean.result()))
+    self.assertEqual(3, self.evaluate(restore_mean.count))
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_binary_accuracy(self):
+    acc_obj = metrics.BinaryAccuracy(name='my acc')
+
+    # check config
+    self.assertEqual(acc_obj.name, 'my acc')
+    self.assertTrue(acc_obj.stateful)
+    self.assertEqual(len(acc_obj.variables), 2)
+    self.assertEqual(acc_obj.dtype, dtypes.float32)
+    self.evaluate(variables.global_variables_initializer())
+
+    # verify that correct value is returned
+    update_op = acc_obj.update_state([[1], [0]], [[1], [0]])
+    self.evaluate(update_op)
+    result = self.evaluate(acc_obj.result())
+    self.assertEqual(result, 1)  # 2/2
+
+    # check y_pred squeeze
+    update_op = acc_obj.update_state([[1], [1]], [[[1]], [[0]]])
+    self.evaluate(update_op)
+    result = self.evaluate(acc_obj.result())
+    self.assertAlmostEqual(result, 0.75, 2)  # 3/4
+
+    # check y_true squeeze
+    result_t = acc_obj([[[1]], [[1]]], [[1], [0]])
+    result = self.evaluate(result_t)
+    self.assertAlmostEqual(result, 0.67, 2)  # 4/6
+
+    # check with sample_weight
+    result_t = acc_obj([[1], [1]], [[1], [0]], [[0.5], [0.2]])
+    result = self.evaluate(result_t)
+    self.assertAlmostEqual(result, 0.67, 2)  # 4.5/6.7
+
+    # check incompatible shapes
+    with self.assertRaisesRegexp(ValueError,
+                                 r'Shapes \(1,\) and \(2,\) are incompatible'):
+      acc_obj.update_state([1, 1], [1])
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_binary_accuracy_threshold(self):
+    acc_obj = metrics.BinaryAccuracy(threshold=0.7)
+    self.evaluate(variables.global_variables_initializer())
+    result_t = acc_obj([[1], [1], [0], [0]], [[0.9], [0.6], [0.4], [0.8]])
+    result = self.evaluate(result_t)
+    self.assertAlmostEqual(result, 0.5, 2)
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_invalid_result(self):
+
+    class InvalidResult(metrics.Metric):
+
+      def __init__(self, name='invalid-result', dtype=dtypes.float64):
+        super(InvalidResult, self).__init__(name=name, dtype=dtype)
+
+      def update_state(self, *args, **kwargs):
+        pass
+
+      def result(self):
+        return 1
+
+    invalid_result_obj = InvalidResult()
+    with self.assertRaisesRegexp(
+        TypeError,
+        'Metric invalid-result\'s result must be a Tensor or Operation, given:'
+    ):
+      invalid_result_obj.result()
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_invalid_update(self):
+
+    class InvalidUpdate(metrics.Metric):
+
+      def __init__(self, name='invalid-update', dtype=dtypes.float64):
+        super(InvalidUpdate, self).__init__(name=name, dtype=dtype)
+
+      def update_state(self, *args, **kwargs):
+        return [1]
+
+      def result(self):
+        pass
+
+    invalid_update_obj = InvalidUpdate()
+    with self.assertRaisesRegexp(
+        TypeError,
+        'Metric invalid-update\'s update must be a Tensor or Operation, given:'
+    ):
+      invalid_update_obj.update_state()
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/model_subclassing_test.py b/tensorflow/python/keras/model_subclassing_test.py
index 3ac4852eff..6cbea45bd5 100644
--- a/tensorflow/python/keras/model_subclassing_test.py
+++ b/tensorflow/python/keras/model_subclassing_test.py
@@ -29,6 +29,8 @@ from tensorflow.python.eager import context
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import embedding_ops
+from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import test
 from tensorflow.python.training.checkpointable import data_structures
@@ -65,6 +67,22 @@ class SimpleTestModel(keras.Model):
     return self.dense2(x)
 
 
+class SimpleConvTestModel(keras.Model):
+
+  def __init__(self, num_classes=10):
+    super(SimpleConvTestModel, self).__init__(name='test_model')
+    self.num_classes = num_classes
+
+    self.conv1 = keras.layers.Conv2D(32, (3, 3), activation='relu')
+    self.flatten = keras.layers.Flatten()
+    self.dense1 = keras.layers.Dense(num_classes, activation='softmax')
+
+  def call(self, x):
+    x = self.conv1(x)
+    x = self.flatten(x)
+    return self.dense1(x)
+
+
 class MultiIOTestModel(keras.Model):
 
   def __init__(self, use_bn=False, use_dp=False, num_classes=(2, 3)):
@@ -162,9 +180,6 @@ def get_nested_model_3(input_dim, num_classes):
       x = self.dense2(x)
       return self.bn(x)
 
-    def compute_output_shape(self, input_shape):
-      return tensor_shape.TensorShape((input_shape[0], 5))
-
   test_model = Inner()
   x = test_model(x)
   outputs = keras.layers.Dense(num_classes)(x)
@@ -174,6 +189,234 @@ def get_nested_model_3(input_dim, num_classes):
 class ModelSubclassingTest(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes
+  def test_custom_build(self):
+    class DummyModel(keras.Model):
+
+      def __init__(self):
+        super(DummyModel, self).__init__()
+        self.dense1 = keras.layers.Dense(32, activation='relu')
+        self.uses_custom_build = False
+
+      def call(self, inputs):
+        return self.dense1(inputs)
+
+      def build(self, input_shape):
+        self.uses_custom_build = True
+
+    test_model = DummyModel()
+    dummy_data = array_ops.ones((32, 50))
+    test_model(dummy_data)
+    self.assertTrue(test_model.uses_custom_build, 'Model should use user '
+                                                  'defined build when called.')
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_invalid_input_shape_build(self):
+    num_classes = 2
+    input_dim = 50
+
+    model = SimpleTestModel(num_classes=num_classes,
+                            use_dp=True,
+                            use_bn=True)
+
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    with self.assertRaisesRegexp(
+        ValueError, 'input shape is not one of the valid types'):
+      model.build(input_shape=tensor_shape.Dimension(input_dim))
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_embed_dtype_with_subclass_build(self):
+    class Embedding(keras.layers.Layer):
+      """An Embedding layer."""
+
+      def __init__(self, vocab_size, embedding_dim, **kwargs):
+        super(Embedding, self).__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.embedding_dim = embedding_dim
+
+      def build(self, _):
+        self.embedding = self.add_variable(
+            'embedding_kernel',
+            shape=[self.vocab_size, self.embedding_dim],
+            dtype=np.float32,
+            initializer=init_ops.random_uniform_initializer(-0.1, 0.1),
+            trainable=True)
+
+      def call(self, x):
+        return embedding_ops.embedding_lookup(self.embedding, x)
+
+    class EmbedModel(keras.Model):
+
+      def __init__(self, vocab_size, embed_size):
+        super(EmbedModel, self).__init__()
+        self.embed1 = Embedding(vocab_size, embed_size)
+
+      def call(self, inputs):
+        return self.embed1(inputs)
+
+    model = EmbedModel(100, 20)
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    with self.assertRaisesRegexp(
+        ValueError, 'if your layers do not support float type inputs'):
+      model.build(input_shape=(35, 20))
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_single_time_step_rnn_build(self):
+    dim = 4
+    timesteps = 1
+    batch_input_shape = (None, timesteps, dim)
+    units = 3
+
+    class SimpleRNNModel(keras.Model):
+
+      def __init__(self):
+        super(SimpleRNNModel, self).__init__()
+        self.lstm = keras.layers.LSTM(units)
+
+      def call(self, inputs):
+        return self.lstm(inputs)
+
+    model = SimpleRNNModel()
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    model.build(batch_input_shape)
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+    model(array_ops.ones((32, timesteps, dim)))
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_single_io_subclass_build(self):
+    num_classes = 2
+    input_dim = 50
+    batch_size = None
+
+    model = SimpleTestModel(num_classes=num_classes,
+                            use_dp=True,
+                            use_bn=True)
+
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    model.build(input_shape=(batch_size, input_dim))
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+    model(array_ops.ones((32, input_dim)))
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_single_io_dimension_subclass_build(self):
+    num_classes = 2
+    input_dim = tensor_shape.Dimension(50)
+    batch_size = tensor_shape.Dimension(None)
+
+    model = SimpleTestModel(num_classes=num_classes,
+                            use_dp=True,
+                            use_bn=True)
+
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    model.build(input_shape=(batch_size, input_dim))
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+    model(array_ops.ones((32, input_dim)))
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_multidim_io_subclass_build(self):
+    num_classes = 10
+    # Input size, e.g. image
+    batch_size = 32
+    input_shape = (32, 32, 3)
+
+    model = SimpleConvTestModel(num_classes)
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    batch_input_shape = (batch_size,) + input_shape
+    model.build(input_shape=batch_input_shape)
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+
+    model(array_ops.ones(batch_input_shape))
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_tensorshape_io_subclass_build(self):
+    num_classes = 10
+    # Input size, e.g. image
+    batch_size = None
+    input_shape = (32, 32, 3)
+
+    model = SimpleConvTestModel(num_classes)
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    model.build(
+        input_shape=tensor_shape.TensorShape((batch_size,) + input_shape))
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+
+    model(array_ops.ones((32,) + input_shape))
+
+  def test_subclass_save_model(self):
+    num_classes = 10
+    # Input size, e.g. image
+    batch_size = None
+    input_shape = (32, 32, 3)
+
+    model = SimpleConvTestModel(num_classes)
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    model.build(
+        input_shape=tensor_shape.TensorShape((batch_size,) + input_shape))
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+    weights = model.get_weights()
+
+    tf_format_name = os.path.join(self.get_temp_dir(), 'ckpt')
+    model.save_weights(tf_format_name)
+    if h5py is not None:
+      hdf5_format_name = os.path.join(self.get_temp_dir(), 'weights.h5')
+      model.save_weights(hdf5_format_name)
+
+    model = SimpleConvTestModel(num_classes)
+    model.build(
+        input_shape=tensor_shape.TensorShape((batch_size,) + input_shape))
+    if h5py is not None:
+      model.load_weights(hdf5_format_name)
+      self.assertAllClose(weights, model.get_weights())
+    model.load_weights(tf_format_name)
+    self.assertAllClose(weights, model.get_weights())
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_multi_io_subclass_build(self):
+    batch_size = None
+    num_samples = 1000
+    input_dim = 50
+    model = MultiIOTestModel()
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    batch_input_shape = tensor_shape.TensorShape((batch_size, input_dim))
+    model.build(
+        input_shape=[batch_input_shape, batch_input_shape])
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+    x1 = array_ops.ones((num_samples, input_dim))
+    x2 = array_ops.ones((num_samples, input_dim))
+    model([x1, x2])
+
+  @test_util.run_in_graph_and_eager_modes
   def test_single_io_workflow_with_np_arrays(self):
     num_classes = 2
     num_samples = 100
@@ -750,6 +993,16 @@ class CustomCallModel(keras.Model):
       return combined
 
 
+class TrainingNoDefaultModel(keras.Model):
+
+  def __init__(self):
+    super(TrainingNoDefaultModel, self).__init__()
+    self.dense1 = keras.layers.Dense(1)
+
+  def call(self, x, training):
+    return self.dense1(x)
+
+
 class CustomCallSignatureTests(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes
@@ -767,6 +1020,32 @@ class CustomCallSignatureTests(test.TestCase):
     self.assertAllClose(expected_output, self.evaluate(output))
 
   @test_util.run_in_graph_and_eager_modes
+  def test_training_args_call_build(self):
+    input_dim = 2
+
+    model = TrainingNoDefaultModel()
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    model.build((None, input_dim))
+    self.assertTrue(model.weights, ('Model should have weights now that it '
+                                    'has been properly built.'))
+    self.assertTrue(model.built, 'Model should be built after calling `build`.')
+
+  @test_util.run_in_graph_and_eager_modes
+  def test_custom_call_kwargs_and_build(self):
+    first_input_shape = (2, 3)
+    second_input_shape = (2, 5)
+
+    model = CustomCallModel()
+    self.assertFalse(model.built, 'Model should not have been built')
+    self.assertFalse(model.weights, ('Model should have no weights since it '
+                                     'has not been built.'))
+    with self.assertRaisesRegexp(
+        ValueError, 'cannot build your model if it has positional'):
+      model.build(input_shape=[first_input_shape, second_input_shape])
+
+  @test_util.run_in_graph_and_eager_modes
   def test_inputs_in_signature(self):
 
     class HasInputsAndOtherPositional(keras.Model):
@@ -829,14 +1108,9 @@ class CustomCallSignatureTests(test.TestCase):
 
   def test_training_no_default(self):
 
-    class TrainingNoDefault(keras.Model):
-
-      def call(self, x, training):
-        return x
-
     with context.graph_mode():
-      model = TrainingNoDefault()
-      arg = array_ops.ones([])
+      model = TrainingNoDefaultModel()
+      arg = array_ops.ones([1, 1])
       model(arg, True)
       six.assertCountEqual(self, [arg], model.inputs)
 
diff --git a/tensorflow/python/keras/models_test.py b/tensorflow/python/keras/models_test.py
index ad3819e6e7..1525104ac9 100644
--- a/tensorflow/python/keras/models_test.py
+++ b/tensorflow/python/keras/models_test.py
@@ -37,6 +37,7 @@ class TestModelCloning(test.TestCase):
 
       model = keras.models.Sequential()
       model.add(keras.layers.Dense(4, input_shape=(4,)))
+      model.add(keras.layers.BatchNormalization())
       model.add(keras.layers.Dropout(0.5))
       model.add(keras.layers.Dense(4))
 
@@ -46,6 +47,8 @@ class TestModelCloning(test.TestCase):
     with self.test_session():
       # With placeholder creation
       new_model = keras.models.clone_model(model)
+      # update ops from batch norm needs to be included
+      self.assertEquals(len(new_model.get_updates_for(new_model.inputs)), 2)
       new_model.compile('rmsprop', 'mse')
       new_model.train_on_batch(val_a, val_out)
 
@@ -53,6 +56,7 @@ class TestModelCloning(test.TestCase):
       input_a = keras.Input(shape=(4,))
       new_model = keras.models.clone_model(
           model, input_tensors=input_a)
+      self.assertEquals(len(new_model.get_updates_for(new_model.inputs)), 2)
       new_model.compile('rmsprop', 'mse')
       new_model.train_on_batch(val_a, val_out)
 
@@ -60,6 +64,7 @@ class TestModelCloning(test.TestCase):
       input_a = keras.backend.variable(val_a)
       new_model = keras.models.clone_model(
           model, input_tensors=input_a)
+      self.assertEquals(len(new_model.get_updates_for(new_model.inputs)), 2)
       new_model.compile('rmsprop', 'mse')
       new_model.train_on_batch(None, val_out)
 
@@ -76,6 +81,7 @@ class TestModelCloning(test.TestCase):
 
       x_a = dense_1(input_a)
       x_a = keras.layers.Dropout(0.5)(x_a)
+      x_a = keras.layers.BatchNormalization()(x_a)
       x_b = dense_1(input_b)
       x_a = dense_2(x_a)
       outputs = keras.layers.add([x_a, x_b])
@@ -87,6 +93,7 @@ class TestModelCloning(test.TestCase):
     with self.test_session():
       # With placeholder creation
       new_model = keras.models.clone_model(model)
+      self.assertEquals(len(new_model.get_updates_for(new_model.inputs)), 2)
       new_model.compile('rmsprop', 'mse')
       new_model.train_on_batch([val_a, val_b], val_out)
 
@@ -95,6 +102,7 @@ class TestModelCloning(test.TestCase):
       input_b = keras.Input(shape=(4,), name='b')
       new_model = keras.models.clone_model(
           model, input_tensors=[input_a, input_b])
+      self.assertEquals(len(new_model.get_updates_for(new_model.inputs)), 2)
       new_model.compile('rmsprop', 'mse')
       new_model.train_on_batch([val_a, val_b], val_out)
 
@@ -103,6 +111,7 @@ class TestModelCloning(test.TestCase):
       input_b = keras.backend.variable(val_b)
       new_model = keras.models.clone_model(
           model, input_tensors=[input_a, input_b])
+      self.assertEquals(len(new_model.get_updates_for(new_model.inputs)), 2)
       new_model.compile('rmsprop', 'mse')
       new_model.train_on_batch(None, val_out)
 
diff --git a/tensorflow/python/keras/testing_utils.py b/tensorflow/python/keras/testing_utils.py
index 17aba7d86c..6e8ee06ff5 100644
--- a/tensorflow/python/keras/testing_utils.py
+++ b/tensorflow/python/keras/testing_utils.py
@@ -18,7 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from collections import OrderedDict
 import numpy as np
 
 from tensorflow.python import keras
@@ -185,75 +184,3 @@ def layer_test(layer_cls, kwargs=None, input_shape=None, input_dtype=None,
   # for further checks in the caller function
   return actual_output
 
-
-def _combine_named_parameters(**kwargs):
-  """Generate combinations based on its keyword arguments.
-
-  Two sets of returned combinations can be concatenated using +.  Their product
-  can be computed using `times()`.
-
-  Args:
-    **kwargs: keyword arguments of form `option=[possibilities, ...]`
-         or `option=the_only_possibility`.
-
-  Returns:
-    a list of dictionaries for each combination. Keys in the dictionaries are
-    the keyword argument names.  Each key has one value - one of the
-    corresponding keyword argument values.
-  """
-  if not kwargs:
-    return [OrderedDict()]
-
-  sort_by_key = lambda k: k[0][0]
-  kwargs = OrderedDict(sorted(kwargs.items(), key=sort_by_key))
-  first = list(kwargs.items())[0]
-
-  rest = dict(list(kwargs.items())[1:])
-  rest_combined = _combine_named_parameters(**rest)
-
-  key = first[0]
-  values = first[1]
-  if not isinstance(values, list):
-    values = [values]
-
-  combinations = [
-      OrderedDict(sorted(list(combined.items()) + [(key, v)], key=sort_by_key))
-      for v in values
-      for combined in rest_combined
-  ]
-  return combinations
-
-
-def generate_combinations_with_testcase_name(**kwargs):
-  """Generate combinations based on its keyword arguments using combine().
-
-  This function calls combine() and appends a testcase name to the list of
-  dictionaries returned. The 'testcase_name' key is a required for named
-  parameterized tests.
-
-  Args:
-    **kwargs: keyword arguments of form `option=[possibilities, ...]`
-         or `option=the_only_possibility`.
-
-  Returns:
-    a list of dictionaries for each combination. Keys in the dictionaries are
-    the keyword argument names.  Each key has one value - one of the
-    corresponding keyword argument values.
-  """
-  combinations = _combine_named_parameters(**kwargs)
-  named_combinations = []
-  for combination in combinations:
-    assert isinstance(combination, OrderedDict)
-    name = ''.join([
-        '_{}_{}'.format(
-            ''.join(filter(str.isalnum, key)),
-            ''.join(filter(str.isalnum, str(value))))
-        for key, value in combination.items()
-    ])
-    named_combinations.append(
-        OrderedDict(
-            list(combination.items()) + [('testcase_name',
-                                          '_test{}'.format(name))]))
-
-  return named_combinations
-
diff --git a/tensorflow/python/keras/utils/generic_utils.py b/tensorflow/python/keras/utils/generic_utils.py
index a69893955f..2e56fa2dc5 100644
--- a/tensorflow/python/keras/utils/generic_utils.py
+++ b/tensorflow/python/keras/utils/generic_utils.py
@@ -162,7 +162,7 @@ def deserialize_keras_object(identifier,
       if cls is None:
         raise ValueError('Unknown ' + printable_module_name + ': ' + class_name)
     if hasattr(cls, 'from_config'):
-      arg_spec = tf_inspect.getargspec(cls.from_config)
+      arg_spec = tf_inspect.getfullargspec(cls.from_config)
       custom_objects = custom_objects or {}
 
       if 'custom_objects' in arg_spec.args:
@@ -281,8 +281,8 @@ def has_arg(fn, name, accept_all=False):
   Returns:
       bool, whether `fn` accepts a `name` keyword argument.
   """
-  arg_spec = tf_inspect.getargspec(fn)
-  if accept_all and arg_spec.keywords is not None:
+  arg_spec = tf_inspect.getfullargspec(fn)
+  if accept_all and arg_spec.varkw is not None:
     return True
   return name in arg_spec.args
 
diff --git a/tensorflow/python/keras/utils/np_utils.py b/tensorflow/python/keras/utils/np_utils.py
index 9d9c72b162..c24e87308b 100644
--- a/tensorflow/python/keras/utils/np_utils.py
+++ b/tensorflow/python/keras/utils/np_utils.py
@@ -33,7 +33,8 @@ def to_categorical(y, num_classes=None):
       num_classes: total number of classes.
 
   Returns:
-      A binary matrix representation of the input.
+      A binary matrix representation of the input. The classes axis is placed
+      last.
   """
   y = np.array(y, dtype='int')
   input_shape = y.shape
diff --git a/tensorflow/python/kernel_tests/BUILD b/tensorflow/python/kernel_tests/BUILD
index 6bfd1936e3..2451dc7257 100644
--- a/tensorflow/python/kernel_tests/BUILD
+++ b/tensorflow/python/kernel_tests/BUILD
@@ -566,6 +566,7 @@ tf_py_test(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:linalg_ops",
     ],
+    shard_count = 16,
 )
 
 tf_py_test(
@@ -701,7 +702,7 @@ tf_py_test(
 
 tf_py_test(
     name = "priority_queue_test",
-    size = "small",
+    size = "medium",
     srcs = ["priority_queue_test.py"],
     additional_deps = [
         "//third_party/py/numpy",
@@ -1525,6 +1526,7 @@ cuda_py_test(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:math_ops",
     ],
+    tags = ["no_windows_gpu"],
 )
 
 cuda_py_test(
@@ -1717,7 +1719,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "matmul_op_test",
-    size = "small",
+    size = "medium",
     srcs = ["matmul_op_test.py"],
     additional_deps = [
         "//third_party/py/numpy",
@@ -2057,6 +2059,7 @@ cuda_py_test(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:math_ops",
     ],
+    tags = ["no_windows_gpu"],
 )
 
 tf_py_test(
@@ -3092,7 +3095,7 @@ tf_py_test(
 
 tf_py_test(
     name = "cond_v2_test",
-    size = "small",
+    size = "medium",
     srcs = ["cond_v2_test.py"],
     additional_deps = [
         "//tensorflow/python:array_ops",
@@ -3107,4 +3110,5 @@ tf_py_test(
         "//tensorflow/python:training",
     ],
     grpc_enabled = True,
+    tags = ["no_gpu"],  # TODO(b/111656070)
 )
diff --git a/tensorflow/python/kernel_tests/argmax_op_test.py b/tensorflow/python/kernel_tests/argmax_op_test.py
index ce06769902..1202c463e8 100644
--- a/tensorflow/python/kernel_tests/argmax_op_test.py
+++ b/tensorflow/python/kernel_tests/argmax_op_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 
@@ -115,6 +116,12 @@ class ArgMaxTest(test.TestCase):
         ans = op([1]).eval()
         self.assertAllEqual(ans, 0)
 
+  def testOutputEmpty(self):
+    with self.test_session():
+      for op in math_ops.argmin, math_ops.argmax:
+        ret = op(array_ops.zeros(shape=[1, 0, 2]), axis=-1).eval()
+        self.assertEqual(ret.shape, (1, 0))
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/bitcast_op_test.py b/tensorflow/python/kernel_tests/bitcast_op_test.py
index a535468b05..a2c6b54273 100644
--- a/tensorflow/python/kernel_tests/bitcast_op_test.py
+++ b/tensorflow/python/kernel_tests/bitcast_op_test.py
@@ -76,12 +76,18 @@ class BitcastTest(test.TestCase):
     datatype = dtypes.int8
     array_ops.bitcast(x, datatype, None)
 
-  def testQuantizeType(self):
+  def testQuantizedType(self):
     shape = [3, 4]
     x = np.zeros(shape, np.uint16)
     datatype = dtypes.quint16
     self._testBitcast(x, datatype, shape)
 
+  def testUnsignedType(self):
+    shape = [3, 4]
+    x = np.zeros(shape, np.int64)
+    datatype = dtypes.uint64
+    self._testBitcast(x, datatype, shape)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/boosted_trees/training_ops_test.py b/tensorflow/python/kernel_tests/boosted_trees/training_ops_test.py
index 13b804875e..d55240297a 100644
--- a/tensorflow/python/kernel_tests/boosted_trees/training_ops_test.py
+++ b/tensorflow/python/kernel_tests/boosted_trees/training_ops_test.py
@@ -139,6 +139,49 @@ class UpdateTreeEnsembleOpTest(test_util.TensorFlowTestCase):
       self.assertEqual(new_stamp, 1)
       self.assertProtoEquals(expected_result, tree_ensemble)
 
+  def testBiasCenteringOnEmptyEnsemble(self):
+    """Test growing with bias centering on an empty ensemble."""
+    with self.test_session() as session:
+      # Create empty ensemble.
+      tree_ensemble = boosted_trees_ops.TreeEnsemble('ensemble')
+      tree_ensemble_handle = tree_ensemble.resource_handle
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      gradients = np.array([[5.]], dtype=np.float32)
+      hessians = np.array([[24.]], dtype=np.float32)
+
+      # Grow tree ensemble.
+      grow_op = boosted_trees_ops.center_bias(
+          tree_ensemble_handle,
+          mean_gradients=gradients,
+          mean_hessians=hessians,
+          l1=0.0,
+          l2=1.0
+      )
+      session.run(grow_op)
+
+      new_stamp, serialized = session.run(tree_ensemble.serialize())
+
+      tree_ensemble = boosted_trees_pb2.TreeEnsemble()
+      tree_ensemble.ParseFromString(serialized)
+
+      expected_result = """
+        trees {
+         nodes {
+            leaf {
+              scalar: -0.2
+            }
+          }
+        }
+        tree_weights: 1.0
+        tree_metadata {
+          num_layers_grown: 0
+          is_finalized: false
+        }
+      """
+      self.assertEqual(new_stamp, 1)
+      self.assertProtoEquals(expected_result, tree_ensemble)
+
   def testGrowExistingEnsembleTreeNotFinalized(self):
     """Test growing an existing ensemble with the last tree not finalized."""
     with self.test_session() as session:
@@ -666,7 +709,6 @@ class UpdateTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           num_layers_attempted: 1
           last_layer_node_start: 1
           last_layer_node_end: 3
-
         }
       """, tree_ensemble_config)
 
diff --git a/tensorflow/python/kernel_tests/cond_v2_test.py b/tensorflow/python/kernel_tests/cond_v2_test.py
index 759db5d5f4..97ce245fc8 100644
--- a/tensorflow/python/kernel_tests/cond_v2_test.py
+++ b/tensorflow/python/kernel_tests/cond_v2_test.py
@@ -22,6 +22,7 @@ from __future__ import print_function
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import cond_v2
@@ -35,10 +36,12 @@ from tensorflow.python.training import saver
 from tensorflow.python.util import compat
 
 
-class NewCondTest(test.TestCase):
+class CondV2Test(test.TestCase):
 
-  def _testCond(self, true_fn, false_fn, train_vals):
-    with self.test_session() as sess:
+  def _testCond(self, true_fn, false_fn, train_vals, feed_dict=None):
+    if not feed_dict:
+      feed_dict = {}
+    with self.test_session(graph=ops.get_default_graph()) as sess:
       pred = array_ops.placeholder(dtypes.bool, name="pred")
 
       expected = control_flow_ops.cond(pred, true_fn, false_fn, name="expected")
@@ -47,13 +50,17 @@ class NewCondTest(test.TestCase):
       expected_grad = gradients_impl.gradients(expected, train_vals)
       actual_grad = gradients_impl.gradients(actual, train_vals)
 
+      sess_run_args = {pred: True}
+      sess_run_args.update(feed_dict)
       expected_val, actual_val, expected_grad_val, actual_grad_val = sess.run(
-          (expected, actual, expected_grad, actual_grad), {pred: True})
+          (expected, actual, expected_grad, actual_grad), sess_run_args)
       self.assertEqual(expected_val, actual_val)
       self.assertEqual(expected_grad_val, actual_grad_val)
 
+      sess_run_args = {pred: False}
+      sess_run_args.update(feed_dict)
       expected_val, actual_val, expected_grad_val, actual_grad_val = sess.run(
-          (expected, actual, expected_grad, actual_grad), {pred: False})
+          (expected, actual, expected_grad, actual_grad), sess_run_args)
       self.assertEqual(expected_val, actual_val)
       self.assertEqual(expected_grad_val, actual_grad_val)
 
@@ -131,6 +138,349 @@ class NewCondTest(test.TestCase):
         self.assertIn("foo_cond_1_true", ops.get_default_graph()._functions)
         self.assertIn("foo_cond_1_false", ops.get_default_graph()._functions)
 
+  def testDefunInCond(self):
+    x = constant_op.constant(1.0, name="x")
+    y = constant_op.constant(2.0, name="y")
+
+    def true_fn():
+
+      @function.Defun()
+      def fn():
+        return x * y * 2.0
+
+      return fn()
+
+    def false_fn():
+      return 2.0
+
+    self._testCond(true_fn, false_fn, [x])
+    self._testCond(true_fn, false_fn, [x, y])
+    self._testCond(true_fn, false_fn, [y])
+
+  def testNestedDefunInCond(self):
+    x = constant_op.constant(1.0, name="x")
+    y = constant_op.constant(2.0, name="y")
+
+    def true_fn():
+      return 2.0
+
+    def false_fn():
+
+      @function.Defun()
+      def fn():
+
+        @function.Defun()
+        def nested_fn():
+          return x * y * 2.0
+
+        return nested_fn()
+
+      return fn()
+
+    self._testCond(true_fn, false_fn, [x])
+    self._testCond(true_fn, false_fn, [x, y])
+    self._testCond(true_fn, false_fn, [y])
+
+  def testDoubleNestedDefunInCond(self):
+    x = constant_op.constant(1.0, name="x")
+    y = constant_op.constant(2.0, name="y")
+
+    def true_fn():
+
+      @function.Defun()
+      def fn():
+
+        @function.Defun()
+        def nested_fn():
+
+          @function.Defun()
+          def nested_nested_fn():
+            return x * y * 2.0
+
+          return nested_nested_fn()
+
+        return nested_fn()
+
+      return fn()
+
+    def false_fn():
+      return 2.0
+
+    self._testCond(true_fn, false_fn, [x])
+    self._testCond(true_fn, false_fn, [x, y])
+    self._testCond(true_fn, false_fn, [y])
+
+  def testNestedCond(self):
+
+    def run_test(pred_value):
+
+      def build_graph():
+        pred = array_ops.placeholder(dtypes.bool, name="pred")
+        x = constant_op.constant(1.0, name="x")
+        y = constant_op.constant(2.0, name="y")
+
+        def true_fn():
+          return 2.0
+
+        def false_fn():
+
+          def false_true_fn():
+            return x * y * 2.0
+
+          def false_false_fn():
+            return x * 5.0
+
+          return _cond(pred, false_true_fn, false_false_fn, "inside_false_fn")
+
+        return x, y, pred, true_fn, false_fn
+
+      with ops.Graph().as_default():
+        x, y, pred, true_fn, false_fn = build_graph()
+        self._testCond(true_fn, false_fn, [x, y], {pred: pred_value})
+        self._testCond(true_fn, false_fn, [x], {pred: pred_value})
+        self._testCond(true_fn, false_fn, [y], {pred: pred_value})
+
+    run_test(True)
+    run_test(False)
+
+  def testDoubleNestedCond(self):
+
+    def run_test(pred1_value, pred2_value):
+
+      def build_graph():
+        pred1 = array_ops.placeholder(dtypes.bool, name="pred1")
+        pred2 = array_ops.placeholder(dtypes.bool, name="pred2")
+        x = constant_op.constant(1.0, name="x")
+        y = constant_op.constant(2.0, name="y")
+
+        def true_fn():
+          return 2.0
+
+        def false_fn():
+
+          def false_true_fn():
+
+            def false_true_true_fn():
+              return x * y * 2.0
+
+            def false_true_false_fn():
+              return x * 10.0
+
+            return _cond(
+                pred1,
+                false_true_true_fn,
+                false_true_false_fn,
+                name="inside_false_true_fn")
+
+          def false_false_fn():
+            return x * 5.0
+
+          return _cond(
+              pred2, false_true_fn, false_false_fn, name="inside_false_fn")
+
+        return x, y, pred1, pred2, true_fn, false_fn
+
+      with ops.Graph().as_default():
+        x, y, pred1, pred2, true_fn, false_fn = build_graph()
+        self._testCond(true_fn, false_fn, [x, y], {
+            pred1: pred1_value,
+            pred2: pred2_value
+        })
+        x, y, pred1, pred2, true_fn, false_fn = build_graph()
+        self._testCond(true_fn, false_fn, [x], {
+            pred1: pred1_value,
+            pred2: pred2_value
+        })
+        x, y, pred1, pred2, true_fn, false_fn = build_graph()
+        self._testCond(true_fn, false_fn, [y], {
+            pred1: pred1_value,
+            pred2: pred2_value
+        })
+
+    run_test(True, True)
+    run_test(True, False)
+    run_test(False, False)
+    run_test(False, True)
+
+  def testGradientFromInsideDefun(self):
+
+    def build_graph():
+      pred_outer = array_ops.placeholder(dtypes.bool, name="pred_outer")
+      pred_inner = array_ops.placeholder(dtypes.bool, name="pred_inner")
+      x = constant_op.constant(1.0, name="x")
+      y = constant_op.constant(2.0, name="y")
+
+      def true_fn():
+        return 2.0
+
+      def false_fn():
+
+        def inner_true_fn():
+          return x * y * 2.0
+
+        def inner_false_fn():
+          return x * 5.0
+
+        return cond_v2.cond_v2(
+            pred_inner, inner_true_fn, inner_false_fn, name="inner_cond")
+
+      cond_outer = cond_v2.cond_v2(
+          pred_outer, true_fn, false_fn, name="outer_cond")
+
+      # Compute grads inside a Defun.
+      @function.Defun()
+      def nesting_fn():
+        return gradients_impl.gradients(cond_outer, [x, y])
+
+      grads = nesting_fn()
+
+      return grads, pred_outer, pred_inner
+
+    with ops.Graph().as_default():
+      grads, pred_outer, pred_inner = build_graph()
+      with self.test_session(graph=ops.get_default_graph()) as sess:
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: True,
+                pred_inner: True
+            }), [0., 0.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: True,
+                pred_inner: False
+            }), [0., 0.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: False,
+                pred_inner: True
+            }), [4., 2.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: False,
+                pred_inner: False
+            }), [5., 0.])
+
+  def testGradientFromInsideNestedDefun(self):
+
+    def build_graph():
+      pred_outer = array_ops.placeholder(dtypes.bool, name="pred_outer")
+      pred_inner = array_ops.placeholder(dtypes.bool, name="pred_inner")
+      x = constant_op.constant(1.0, name="x")
+      y = constant_op.constant(2.0, name="y")
+
+      def true_fn():
+        return 2.0
+
+      def false_fn():
+
+        def inner_true_fn():
+          return x * y * 2.0
+
+        def inner_false_fn():
+          return x * 5.0
+
+        return cond_v2.cond_v2(
+            pred_inner, inner_true_fn, inner_false_fn, name="inner_cond")
+
+      cond_outer = cond_v2.cond_v2(
+          pred_outer, true_fn, false_fn, name="outer_cond")
+
+      # Compute grads inside a Defun.
+      @function.Defun()
+      def nesting_fn():
+
+        @function.Defun()
+        def inner_nesting_fn():
+          return gradients_impl.gradients(cond_outer, [x, y])
+
+        return inner_nesting_fn()
+
+      grads = nesting_fn()
+
+      return grads, pred_outer, pred_inner
+
+    with ops.Graph().as_default():
+      grads, pred_outer, pred_inner = build_graph()
+      with self.test_session(graph=ops.get_default_graph()) as sess:
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: True,
+                pred_inner: True
+            }), [0., 0.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: True,
+                pred_inner: False
+            }), [0., 0.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: False,
+                pred_inner: True
+            }), [4., 2.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: False,
+                pred_inner: False
+            }), [5., 0.])
+
+  def testBuildCondAndGradientInsideDefun(self):
+
+    def build_graph():
+      pred_outer = array_ops.placeholder(dtypes.bool, name="pred_outer")
+      pred_inner = array_ops.placeholder(dtypes.bool, name="pred_inner")
+      x = constant_op.constant(1.0, name="x")
+      y = constant_op.constant(2.0, name="y")
+
+      # Build cond and its gradient inside a Defun.
+      @function.Defun()
+      def fn():
+
+        def true_fn():
+          return 2.0
+
+        def false_fn():
+
+          def inner_true_fn():
+            return x * y * 2.0
+
+          def inner_false_fn():
+            return x * 5.0
+
+          return cond_v2.cond_v2(
+              pred_inner, inner_true_fn, inner_false_fn, name="inner_cond")
+
+        cond_outer = cond_v2.cond_v2(
+            pred_outer, true_fn, false_fn, name="outer_cond")
+        return gradients_impl.gradients(cond_outer, [x, y])
+
+      grads = fn()
+
+      return grads, pred_outer, pred_inner
+
+    with ops.Graph().as_default():
+      grads, pred_outer, pred_inner = build_graph()
+      with self.test_session(graph=ops.get_default_graph()) as sess:
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: True,
+                pred_inner: True
+            }), [0., 0.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: True,
+                pred_inner: False
+            }), [0., 0.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: False,
+                pred_inner: True
+            }), [4., 2.])
+        self.assertSequenceEqual(
+            sess.run(grads, {
+                pred_outer: False,
+                pred_inner: False
+            }), [5., 0.])
+
   def testSecondDerivative(self):
     with self.test_session() as sess:
       pred = array_ops.placeholder(dtypes.bool, name="pred")
@@ -532,5 +882,17 @@ class CondV2ColocationGroupAndDeviceTest(test.TestCase):
         self.assertTrue(len(run_metadata.partition_graphs) >= 2)
 
 
+def _cond(pred, true_fn, false_fn, name):
+  if _is_old_cond():
+    return control_flow_ops.cond(pred, true_fn, false_fn, name=name)
+  else:
+    return cond_v2.cond_v2(pred, true_fn, false_fn, name=name)
+
+
+def _is_old_cond():
+  return isinstance(ops.get_default_graph()._get_control_flow_context(),
+                    control_flow_ops.CondContext)
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
index 68873df97e..b567b71424 100644
--- a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
+++ b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
@@ -734,11 +734,11 @@ class ControlFlowTest(test.TestCase):
 
       def body_fn(i):
         with ops.control_dependencies([increment]):
-          return i + i
+          return i + 1
 
-      result = control_flow_ops.while_loop(cond=lambda i: i < 1,
+      result = control_flow_ops.while_loop(cond=lambda i: i < 2,
                                            body=body_fn, loop_vars=[1])
-      result.eval()
+      self.assertAllEqual(result.eval(), 2)
       self.assertAllEqual(v.eval(), 1.0)
 
   def testWhileExternalControlDependenciesNoInput(self):
diff --git a/tensorflow/python/kernel_tests/conv_ops_test.py b/tensorflow/python/kernel_tests/conv_ops_test.py
index 474d06b8f3..00de94f004 100644
--- a/tensorflow/python/kernel_tests/conv_ops_test.py
+++ b/tensorflow/python/kernel_tests/conv_ops_test.py
@@ -1706,7 +1706,7 @@ class SeparableConv2DTest(test.TestCase):
   def testSeparableConv2D(self):
     self._testSeparableConv2D("NHWC")
 
-  def testSeparableConv2DNCHW(self):
+  def disabledtestSeparableConv2DNCHW(self):
     if not test.is_gpu_available():
       return
     self._testSeparableConv2D("NCHW")
diff --git a/tensorflow/python/kernel_tests/decode_jpeg_op_test.py b/tensorflow/python/kernel_tests/decode_jpeg_op_test.py
index 510daf79dc..66b3e0f22f 100644
--- a/tensorflow/python/kernel_tests/decode_jpeg_op_test.py
+++ b/tensorflow/python/kernel_tests/decode_jpeg_op_test.py
@@ -110,7 +110,8 @@ class DecodeJpegBenchmark(test.Benchmark):
       start_time = time.time()
       for _ in xrange(num_iters):
         sess.run(r)
-    return time.time() - start_time
+      end_time = time.time()
+    return end_time - start_time
 
   def benchmarkDecodeJpegSmall(self):
     """Evaluate single DecodeImageOp for small size image."""
diff --git a/tensorflow/python/kernel_tests/depthwise_conv_op_test.py b/tensorflow/python/kernel_tests/depthwise_conv_op_test.py
index 7134e02c34..58845552db 100644
--- a/tensorflow/python/kernel_tests/depthwise_conv_op_test.py
+++ b/tensorflow/python/kernel_tests/depthwise_conv_op_test.py
@@ -90,7 +90,7 @@ def CheckGradConfigsToTest():
 class DepthwiseConv2DTest(test.TestCase):
 
   # This is testing that depthwise_conv2d and depthwise_conv2d_native
-  # produce the same results.  It also tests that NCHW and NWHC
+  # produce the same results.  It also tests that NCHW and NHWC
   # formats agree, by comparing the depthwise_conv2d_native with
   # 'NCHW' format (with transposition) matches the 'NHWC' format using
   # the higher level interface.
@@ -142,7 +142,7 @@ class DepthwiseConv2DTest(test.TestCase):
       native_t1 = t1
       strides = [1, stride, stride, 1]
       if data_format == "NCHW":
-        # Transpose from NWHC input to NCHW
+        # Transpose from NHWC input to NCHW
         # Ex. [4, 5, 5, 48] to [4, 48, 5, 5]
         native_t1 = array_ops.transpose(t1, [0, 3, 1, 2])
         strides = [1, 1, stride, stride]
@@ -368,7 +368,7 @@ class DepthwiseConv2DTest(test.TestCase):
       native_input = input_tensor
       strides = [1, stride, stride, 1]
       if data_format == "NCHW":
-        # Transpose from NWHC input to NCHW
+        # Transpose from NHWC input to NCHW
         # Ex. [4, 5, 5, 48] to [4, 48, 5, 5]
         native_input = array_ops.transpose(input_tensor, [0, 3, 1, 2])
         input_shape = [
diff --git a/tensorflow/python/kernel_tests/distributions/util_test.py b/tensorflow/python/kernel_tests/distributions/util_test.py
index 9d38ffcb4a..61faa8466e 100644
--- a/tensorflow/python/kernel_tests/distributions/util_test.py
+++ b/tensorflow/python/kernel_tests/distributions/util_test.py
@@ -311,8 +311,10 @@ class EmbedCheckCategoricalEventShapeTest(test.TestCase):
   @test_util.run_in_graph_and_eager_modes
   def testUnsupportedDtype(self):
     with self.test_session():
+      param = ops.convert_to_tensor(
+          np.ones([2**11 + 1]).astype(dtypes.qint16.as_numpy_dtype),
+          dtype=dtypes.qint16)
       with self.assertRaises(TypeError):
-        param = array_ops.ones([int(2**11+1)], dtype=dtypes.qint16)
         du.embed_check_categorical_event_shape(param)
 
 
diff --git a/tensorflow/python/kernel_tests/functional_ops_test.py b/tensorflow/python/kernel_tests/functional_ops_test.py
index bfd4a8fd49..24800d2b7a 100644
--- a/tensorflow/python/kernel_tests/functional_ops_test.py
+++ b/tensorflow/python/kernel_tests/functional_ops_test.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import functional_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 import tensorflow.python.ops.tensor_array_grad  # pylint: disable=unused-import
@@ -1080,6 +1081,56 @@ class PartitionedCallTest(test.TestCase):
       self.assertTrue(compat.as_bytes("CPU:1") in outputs[1].eval())
       self.assertTrue(compat.as_bytes("CPU:2") in outputs[2].eval())
 
+  def testAssignAddResourceVariable(self):
+
+    v = resource_variable_ops.ResourceVariable(1.0)
+
+    @function.Defun()
+    def AssignAdd():
+      v.assign_add(1.0)
+
+    op = functional_ops.partitioned_call(
+        args=AssignAdd.captured_inputs, f=AssignAdd)
+    _ = self.evaluate(variables.global_variables_initializer())
+    _ = self.evaluate(op)
+    value = self.evaluate(v.read_value())
+    self.assertEqual(value, 2.0)
+
+  def testFunctionWithResourcesOnDifferentDevices(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPUs available.")
+
+    with ops.device("/cpu:0"):
+      v_cpu_zero = resource_variable_ops.ResourceVariable(
+          [0.0, 1.0, 2.0], name="v_cpu_zero")
+
+    with ops.device("/cpu:1"):
+      v_cpu_one = resource_variable_ops.ResourceVariable(
+          [0.0, 1.0, 2.0], name="v_cpu_one")
+
+    with ops.device("/gpu:0"):
+      v_gpu = resource_variable_ops.ResourceVariable(
+          [0.0, 1.0, 2.0], name="v_gpu")
+
+    def sum_gather():
+      cpu_result = math_ops.reduce_sum(array_ops.gather(v_cpu_zero, [1, 2]))
+      also_cpu_result = math_ops.reduce_sum(array_ops.gather(v_cpu_one, [1, 2]))
+      gpu_result = math_ops.reduce_sum(array_ops.gather(v_gpu, [1, 2]))
+      return cpu_result, also_cpu_result, gpu_result
+
+    defined = function.Defun()(sum_gather)
+    with self.test_session(
+        config=config_pb2.ConfigProto(
+            allow_soft_placement=False,
+            log_device_placement=True,
+            device_count={"CPU": 2})) as sess:
+      sess.run(variables.global_variables_initializer())
+      expected = sess.run(sum_gather())
+      result = sess.run(
+          functional_ops.partitioned_call(
+              args=defined.captured_inputs, f=defined))
+      self.assertAllEqual(expected, result)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/gather_nd_op_test.py b/tensorflow/python/kernel_tests/gather_nd_op_test.py
index 58e2a8ac2a..c0b419e1d1 100644
--- a/tensorflow/python/kernel_tests/gather_nd_op_test.py
+++ b/tensorflow/python/kernel_tests/gather_nd_op_test.py
@@ -203,8 +203,7 @@ class GatherNdTest(test.TestCase):
       indices = [[[0], [7]]]  # Make this one higher rank
       gather_nd = array_ops.gather_nd(params, indices)
       with self.assertRaisesOpError(
-          r"flat indices\[1, :\] = \[7\] does not index into param "
-          r"\(shape: \[3\]\)"):
+          r"indices\[0,1\] = \[7\] does not index into param shape \[3\]"):
         gather_nd.eval()
 
   def _disabledTestBadIndicesGPU(self):
@@ -217,8 +216,7 @@ class GatherNdTest(test.TestCase):
       indices = [[[0], [7]]]  # Make this one higher rank
       gather_nd = array_ops.gather_nd(params, indices)
       with self.assertRaisesOpError(
-          r"flat indices\[1, :\] = \[7\] does not index into param "
-          r"\(shape: \[3\]\)"):
+          r"indices\[0,1\] = \[7\] does not index into param shape \[3\]"):
         gather_nd.eval()
 
   def testBadIndicesWithSlicesCPU(self):
@@ -227,8 +225,7 @@ class GatherNdTest(test.TestCase):
       indices = [[[0], [0], [1]]]  # Make this one higher rank
       gather_nd = array_ops.gather_nd(params, indices)
       with self.assertRaisesOpError(
-          r"flat indices\[2, :\] = \[1\] does not index into param "
-          r"\(shape: \[1,3\]\)"):
+          r"indices\[0,2\] = \[1\] does not index into param shape \[1,3\]"):
         gather_nd.eval()
 
   def _disabledTestBadIndicesWithSlicesGPU(self):
@@ -241,8 +238,7 @@ class GatherNdTest(test.TestCase):
       indices = [[[0], [0], [1]]]  # Make this one higher rank
       gather_nd = array_ops.gather_nd(params, indices)
       with self.assertRaisesOpError(
-          r"flat indices\[2, :\] = \[1\] does not index into param "
-          r"\(shape: \[1,3\]\)"):
+          r"indices\[0,2\] = \[1\] does not index into param shape \[1,3\]"):
         gather_nd.eval()
 
   def testGradientsRank2Elements(self):
diff --git a/tensorflow/python/kernel_tests/init_ops_test.py b/tensorflow/python/kernel_tests/init_ops_test.py
index 927ca012ae..f6097ad489 100644
--- a/tensorflow/python/kernel_tests/init_ops_test.py
+++ b/tensorflow/python/kernel_tests/init_ops_test.py
@@ -830,7 +830,7 @@ class ConvolutionOrthogonal1dInitializerTest(test.TestCase):
     tol = 1e-3
     gain = 3.14
     # Check orthogonality/isometry by computing the ratio between
-    # the 2-norms of the inputs and ouputs.
+    # the 2-norms of the inputs and outputs.
     for kernel_size in [[1], [2], [3], [4], [5], [6]]:
       convolution = convolutional.conv1d
       inputs = random_ops.random_normal(shape, dtype=dtype)
@@ -925,7 +925,7 @@ class ConvolutionOrthogonal2dInitializerTest(test.TestCase):
     tol = 1e-3
     gain = 3.14
     # Check orthogonality/isometry by computing the ratio between
-    # the 2-norms of the inputs and ouputs.
+    # the 2-norms of the inputs and outputs.
     for kernel_size in [[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]:
       convolution = convolutional.conv2d
       inputs = random_ops.random_normal(shape, dtype=dtype)
@@ -1050,7 +1050,7 @@ class ConvolutionOrthogonal3dInitializerTest(test.TestCase):
     tol = 1e-3
     gain = 3.14
     # Check orthogonality/isometry by computing the ratio between
-    # the 2-norms of the inputs and ouputs.
+    # the 2-norms of the inputs and outputs.
     for kernel_size in [[1, 1, 1], [2, 2, 2], [3, 3, 3]]:
       convolution = convolutional.conv3d
       inputs = random_ops.random_normal(shape, dtype=dtype)
diff --git a/tensorflow/python/kernel_tests/linalg/BUILD b/tensorflow/python/kernel_tests/linalg/BUILD
index 69d3aa4017..f4ec3e3996 100644
--- a/tensorflow/python/kernel_tests/linalg/BUILD
+++ b/tensorflow/python/kernel_tests/linalg/BUILD
@@ -197,7 +197,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "linear_operator_low_rank_update_test",
-    size = "medium",
+    size = "large",
     srcs = ["linear_operator_low_rank_update_test.py"],
     additional_deps = [
         "//tensorflow/python/ops/linalg",
@@ -234,3 +234,21 @@ cuda_py_test(
         "optonly",
     ],
 )
+
+cuda_py_test(
+    name = "linear_operator_zeros_test",
+    size = "medium",
+    srcs = ["linear_operator_zeros_test.py"],
+    additional_deps = [
+        "//tensorflow/python/ops/linalg",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:linalg_ops",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:random_ops",
+    ],
+    shard_count = 5,
+    tags = ["optonly"],  # Test is flaky without optimization.
+)
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py
index 34b35a4ffb..0e38dbd48d 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_low_rank_update_test.py
@@ -49,12 +49,6 @@ class BaseLinearOperatorLowRankUpdatetest(object):
   _use_v = None
 
   @property
-  def _dtypes_to_test(self):
-    # TODO(langmore) Test complex types once cholesky works with them.
-    # See comment in LinearOperatorLowRankUpdate.__init__.
-    return [dtypes.float32, dtypes.float64]
-
-  @property
   def _operator_build_infos(self):
     build_info = linear_operator_test_util.OperatorBuildInfo
     # Previously we had a (2, 10, 10) shape at the end.  We did this to test the
@@ -68,6 +62,15 @@ class BaseLinearOperatorLowRankUpdatetest(object):
         build_info((3, 4, 4)),
         build_info((2, 1, 4, 4))]
 
+  def _gen_positive_diag(self, dtype, diag_shape):
+    if dtype.is_complex:
+      diag = linear_operator_test_util.random_uniform(
+          diag_shape, minval=1e-4, maxval=1., dtype=dtypes.float32)
+      return math_ops.cast(diag, dtype=dtype)
+
+    return linear_operator_test_util.random_uniform(
+        diag_shape, minval=1e-4, maxval=1., dtype=dtype)
+
   def _operator_and_matrix(self, build_info, dtype, use_placeholder):
     # Recall A = L + UDV^H
     shape = list(build_info.shape)
@@ -78,8 +81,7 @@ class BaseLinearOperatorLowRankUpdatetest(object):
 
     # base_operator L will be a symmetric positive definite diagonal linear
     # operator, with condition number as high as 1e4.
-    base_diag = linear_operator_test_util.random_uniform(
-        diag_shape, minval=1e-4, maxval=1., dtype=dtype)
+    base_diag = self._gen_positive_diag(dtype, diag_shape)
     lin_op_base_diag = base_diag
 
     # U
@@ -94,8 +96,7 @@ class BaseLinearOperatorLowRankUpdatetest(object):
 
     # D
     if self._is_diag_update_positive:
-      diag_update = linear_operator_test_util.random_uniform(
-          diag_update_shape, minval=1e-4, maxval=1., dtype=dtype)
+      diag_update = self._gen_positive_diag(dtype, diag_update_shape)
     else:
       diag_update = linear_operator_test_util.random_normal(
           diag_update_shape, stddev=1e-4, dtype=dtype)
@@ -110,7 +111,9 @@ class BaseLinearOperatorLowRankUpdatetest(object):
           diag_update, shape=None)
 
     base_operator = linalg.LinearOperatorDiag(
-        lin_op_base_diag, is_positive_definite=True)
+        lin_op_base_diag,
+        is_positive_definite=True,
+        is_self_adjoint=True)
 
     operator = linalg.LinearOperatorLowRankUpdate(
         base_operator,
@@ -169,6 +172,7 @@ class LinearOperatorLowRankUpdatetestWithDiagUseCholesky(
     self._rtol[dtypes.float32] = 1e-5
     self._atol[dtypes.float64] = 1e-10
     self._rtol[dtypes.float64] = 1e-10
+    self._rtol[dtypes.complex64] = 1e-4
 
 
 class LinearOperatorLowRankUpdatetestWithDiagCannotUseCholesky(
@@ -188,6 +192,7 @@ class LinearOperatorLowRankUpdatetestWithDiagCannotUseCholesky(
     self._rtol[dtypes.float32] = 1e-4
     self._atol[dtypes.float64] = 1e-9
     self._rtol[dtypes.float64] = 1e-9
+    self._rtol[dtypes.complex64] = 1e-4
 
 
 class LinearOperatorLowRankUpdatetestNoDiagUseCholesky(
@@ -206,6 +211,7 @@ class LinearOperatorLowRankUpdatetestNoDiagUseCholesky(
     self._rtol[dtypes.float32] = 1e-5
     self._atol[dtypes.float64] = 1e-10
     self._rtol[dtypes.float64] = 1e-10
+    self._rtol[dtypes.complex64] = 1e-4
 
 
 class LinearOperatorLowRankUpdatetestNoDiagCannotUseCholesky(
@@ -225,6 +231,7 @@ class LinearOperatorLowRankUpdatetestNoDiagCannotUseCholesky(
     self._rtol[dtypes.float32] = 1e-4
     self._atol[dtypes.float64] = 1e-9
     self._rtol[dtypes.float64] = 1e-9
+    self._rtol[dtypes.complex64] = 1e-4
 
 
 class LinearOperatorLowRankUpdatetestWithDiagNotSquare(
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py
index 167c6cacd1..b389e0cbdf 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_lower_triangular_test.py
@@ -17,7 +17,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops.linalg import linalg as linalg_lib
@@ -32,12 +31,6 @@ class LinearOperatorLowerTriangularTest(
     linear_operator_test_util.SquareLinearOperatorDerivedClassTest):
   """Most tests done in the base class LinearOperatorDerivedClassTest."""
 
-  @property
-  def _dtypes_to_test(self):
-    # TODO(langmore) Test complex types once supported by
-    # matrix_triangular_solve.
-    return [dtypes.float32, dtypes.float64]
-
   def _operator_and_matrix(self, build_info, dtype, use_placeholder):
     shape = list(build_info.shape)
     # Upper triangle will be nonzero, but ignored.
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_zeros_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_zeros_test.py
new file mode 100644
index 0000000000..8f60b55e0a
--- /dev/null
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_zeros_test.py
@@ -0,0 +1,192 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import random_seed
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops.linalg import linalg as linalg_lib
+from tensorflow.python.ops.linalg import linear_operator_test_util
+from tensorflow.python.platform import test
+
+
+random_seed.set_random_seed(23)
+rng = np.random.RandomState(2016)
+
+
+class LinearOperatorZerosTest(
+    linear_operator_test_util.SquareLinearOperatorDerivedClassTest):
+  """Most tests done in the base class LinearOperatorDerivedClassTest."""
+
+  @property
+  def _tests_to_skip(self):
+    return ["log_abs_det", "solve", "solve_with_broadcast"]
+
+  @property
+  def _operator_build_infos(self):
+    build_info = linear_operator_test_util.OperatorBuildInfo
+    return [
+        build_info((1, 1)),
+        build_info((1, 3, 3)),
+        build_info((3, 4, 4)),
+        build_info((2, 1, 4, 4))]
+
+  def _operator_and_matrix(self, build_info, dtype, use_placeholder):
+    del use_placeholder
+    shape = list(build_info.shape)
+    assert shape[-1] == shape[-2]
+
+    batch_shape = shape[:-2]
+    num_rows = shape[-1]
+
+    operator = linalg_lib.LinearOperatorZeros(
+        num_rows, batch_shape=batch_shape, dtype=dtype)
+    matrix = array_ops.zeros(shape=shape, dtype=dtype)
+
+    return operator, matrix
+
+  def test_assert_positive_definite(self):
+    operator = linalg_lib.LinearOperatorZeros(num_rows=2)
+    with self.assertRaisesOpError("non-positive definite"):
+      operator.assert_positive_definite()
+
+  def test_assert_non_singular(self):
+    with self.assertRaisesOpError("non-invertible"):
+      operator = linalg_lib.LinearOperatorZeros(num_rows=2)
+      operator.assert_non_singular()
+
+  def test_assert_self_adjoint(self):
+    with self.test_session():
+      operator = linalg_lib.LinearOperatorZeros(num_rows=2)
+      operator.assert_self_adjoint().run()  # Should not fail
+
+  def test_non_scalar_num_rows_raises_static(self):
+    with self.assertRaisesRegexp(ValueError, "must be a 0-D Tensor"):
+      linalg_lib.LinearOperatorZeros(num_rows=[2])
+    with self.assertRaisesRegexp(ValueError, "must be a 0-D Tensor"):
+      linalg_lib.LinearOperatorZeros(num_rows=2, num_columns=[2])
+
+  def test_non_integer_num_rows_raises_static(self):
+    with self.assertRaisesRegexp(TypeError, "must be integer"):
+      linalg_lib.LinearOperatorZeros(num_rows=2.)
+    with self.assertRaisesRegexp(TypeError, "must be integer"):
+      linalg_lib.LinearOperatorZeros(num_rows=2, num_columns=2.)
+
+  def test_negative_num_rows_raises_static(self):
+    with self.assertRaisesRegexp(ValueError, "must be non-negative"):
+      linalg_lib.LinearOperatorZeros(num_rows=-2)
+    with self.assertRaisesRegexp(ValueError, "must be non-negative"):
+      linalg_lib.LinearOperatorZeros(num_rows=2, num_columns=-2)
+
+  def test_non_1d_batch_shape_raises_static(self):
+    with self.assertRaisesRegexp(ValueError, "must be a 1-D"):
+      linalg_lib.LinearOperatorZeros(num_rows=2, batch_shape=2)
+
+  def test_non_integer_batch_shape_raises_static(self):
+    with self.assertRaisesRegexp(TypeError, "must be integer"):
+      linalg_lib.LinearOperatorZeros(num_rows=2, batch_shape=[2.])
+
+  def test_negative_batch_shape_raises_static(self):
+    with self.assertRaisesRegexp(ValueError, "must be non-negative"):
+      linalg_lib.LinearOperatorZeros(num_rows=2, batch_shape=[-2])
+
+  def test_non_scalar_num_rows_raises_dynamic(self):
+    with self.test_session():
+      num_rows = array_ops.placeholder(dtypes.int32)
+      operator = linalg_lib.LinearOperatorZeros(
+          num_rows, assert_proper_shapes=True)
+      with self.assertRaisesOpError("must be a 0-D Tensor"):
+        operator.to_dense().eval(feed_dict={num_rows: [2]})
+
+  def test_negative_num_rows_raises_dynamic(self):
+    with self.test_session():
+      n = array_ops.placeholder(dtypes.int32)
+      operator = linalg_lib.LinearOperatorZeros(
+          num_rows=n, assert_proper_shapes=True)
+      with self.assertRaisesOpError("must be non-negative"):
+        operator.to_dense().eval(feed_dict={n: -2})
+
+      operator = linalg_lib.LinearOperatorZeros(
+          num_rows=2, num_columns=n, assert_proper_shapes=True)
+      with self.assertRaisesOpError("must be non-negative"):
+        operator.to_dense().eval(feed_dict={n: -2})
+
+  def test_non_1d_batch_shape_raises_dynamic(self):
+    with self.test_session():
+      batch_shape = array_ops.placeholder(dtypes.int32)
+      operator = linalg_lib.LinearOperatorZeros(
+          num_rows=2, batch_shape=batch_shape, assert_proper_shapes=True)
+      with self.assertRaisesOpError("must be a 1-D"):
+        operator.to_dense().eval(feed_dict={batch_shape: 2})
+
+  def test_negative_batch_shape_raises_dynamic(self):
+    with self.test_session():
+      batch_shape = array_ops.placeholder(dtypes.int32)
+      operator = linalg_lib.LinearOperatorZeros(
+          num_rows=2, batch_shape=batch_shape, assert_proper_shapes=True)
+      with self.assertRaisesOpError("must be non-negative"):
+        operator.to_dense().eval(feed_dict={batch_shape: [-2]})
+
+  def test_wrong_matrix_dimensions_raises_static(self):
+    operator = linalg_lib.LinearOperatorZeros(num_rows=2)
+    x = rng.randn(3, 3).astype(np.float32)
+    with self.assertRaisesRegexp(ValueError, "Dimensions.*not compatible"):
+      operator.matmul(x)
+
+  def test_wrong_matrix_dimensions_raises_dynamic(self):
+    num_rows = array_ops.placeholder(dtypes.int32)
+    x = array_ops.placeholder(dtypes.float32)
+
+    with self.test_session():
+      operator = linalg_lib.LinearOperatorZeros(
+          num_rows, assert_proper_shapes=True)
+      y = operator.matmul(x)
+      with self.assertRaisesOpError("Incompatible.*dimensions"):
+        y.eval(feed_dict={num_rows: 2, x: rng.rand(3, 3)})
+
+  def test_is_x_flags(self):
+    # The is_x flags are by default all True.
+    operator = linalg_lib.LinearOperatorZeros(num_rows=2)
+    self.assertFalse(operator.is_positive_definite)
+    self.assertFalse(operator.is_non_singular)
+    self.assertTrue(operator.is_self_adjoint)
+
+
+class LinearOperatorZerosNotSquareTest(
+    linear_operator_test_util.NonSquareLinearOperatorDerivedClassTest):
+
+  def _operator_and_matrix(self, build_info, dtype, use_placeholder):
+    del use_placeholder
+    shape = list(build_info.shape)
+
+    batch_shape = shape[:-2]
+    num_rows = shape[-2]
+    num_columns = shape[-1]
+
+    operator = linalg_lib.LinearOperatorZeros(
+        num_rows, num_columns, is_square=False, is_self_adjoint=False,
+        batch_shape=batch_shape, dtype=dtype)
+    matrix = array_ops.zeros(shape=shape, dtype=dtype)
+
+    return operator, matrix
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/kernel_tests/linalg_grad_test.py b/tensorflow/python/kernel_tests/linalg_grad_test.py
index 6f401358a2..0e4e58409e 100644
--- a/tensorflow/python/kernel_tests/linalg_grad_test.py
+++ b/tensorflow/python/kernel_tests/linalg_grad_test.py
@@ -26,6 +26,7 @@ from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.linalg import linalg_impl
 from tensorflow.python.platform import test as test_lib
 
 
@@ -173,6 +174,10 @@ if __name__ == '__main__':
         _AddTest(MatrixUnaryFunctorGradientTest, 'MatrixInverseGradient', name,
                  _GetMatrixUnaryFunctorGradientTest(linalg_ops.matrix_inverse,
                                                     dtype, shape))
+        _AddTest(MatrixUnaryFunctorGradientTest, 'MatrixExponentialGradient',
+                 name,
+                 _GetMatrixUnaryFunctorGradientTest(
+                     linalg_impl.matrix_exponential, dtype, shape))
         _AddTest(
             MatrixUnaryFunctorGradientTest, 'MatrixDeterminantGradient', name,
             _GetMatrixUnaryFunctorGradientTest(linalg_ops.matrix_determinant,
diff --git a/tensorflow/python/kernel_tests/matrix_exponential_op_test.py b/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
index a0c66c77d8..0386e91276 100644
--- a/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
+++ b/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
@@ -12,33 +12,35 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for tensorflow.ops.gen_linalg_ops.matrix_exponential."""
+"""Tests for tensorflow.ops.linalg.linalg_impl.matrix_exponential."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 import itertools
-import math
 
 import numpy as np
 
 from tensorflow.python.client import session
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
-from tensorflow.python.ops import gen_linalg_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variables
+from tensorflow.python.ops.linalg import linalg_impl
 from tensorflow.python.platform import test
 
 
-def np_expm(x):
+def np_expm(x):  # pylint: disable=invalid-name
   """Slow but accurate Taylor series matrix exponential."""
   y = np.zeros(x.shape, dtype=x.dtype)
   xn = np.eye(x.shape[0], dtype=x.dtype)
   for n in range(40):
-    y += xn / float(math.factorial(n))
+    if n > 0:
+      xn /= float(n)
+    y += xn
     xn = np.dot(xn, x)
   return y
 
@@ -48,7 +50,7 @@ class ExponentialOpTest(test.TestCase):
   def _verifyExponential(self, x, np_type):
     inp = x.astype(np_type)
     with self.test_session(use_gpu=True):
-      tf_ans = gen_linalg_ops.matrix_exponential(inp)
+      tf_ans = linalg_impl.matrix_exponential(inp)
       if x.size == 0:
         np_ans = np.empty(x.shape, dtype=np_type)
       else:
@@ -76,7 +78,7 @@ class ExponentialOpTest(test.TestCase):
     matrix_batch = np.tile(matrix_batch, [2, 3, 1, 1])
     return matrix_batch
 
-  def testNonsymmetric(self):
+  def testNonsymmetricReal(self):
     # 2x2 matrices
     matrix1 = np.array([[1., 2.], [3., 4.]])
     matrix2 = np.array([[1., 3.], [3., 5.]])
@@ -84,7 +86,10 @@ class ExponentialOpTest(test.TestCase):
     self._verifyExponentialReal(matrix2)
     # A multidimensional batch of 2x2 matrices
     self._verifyExponentialReal(self._makeBatch(matrix1, matrix2))
-    # Complex
+
+  def testNonsymmetricComplex(self):
+    matrix1 = np.array([[1., 2.], [3., 4.]])
+    matrix2 = np.array([[1., 3.], [3., 5.]])
     matrix1 = matrix1.astype(np.complex64)
     matrix1 += 1j * matrix1
     matrix2 = matrix2.astype(np.complex64)
@@ -94,7 +99,7 @@ class ExponentialOpTest(test.TestCase):
     # Complex batch
     self._verifyExponentialComplex(self._makeBatch(matrix1, matrix2))
 
-  def testSymmetricPositiveDefinite(self):
+  def testSymmetricPositiveDefiniteReal(self):
     # 2x2 matrices
     matrix1 = np.array([[2., 1.], [1., 2.]])
     matrix2 = np.array([[3., -1.], [-1., 3.]])
@@ -102,7 +107,10 @@ class ExponentialOpTest(test.TestCase):
     self._verifyExponentialReal(matrix2)
     # A multidimensional batch of 2x2 matrices
     self._verifyExponentialReal(self._makeBatch(matrix1, matrix2))
-    # Complex
+
+  def testSymmetricPositiveDefiniteComplex(self):
+    matrix1 = np.array([[2., 1.], [1., 2.]])
+    matrix2 = np.array([[3., -1.], [-1., 3.]])
     matrix1 = matrix1.astype(np.complex64)
     matrix1 += 1j * matrix1
     matrix2 = matrix2.astype(np.complex64)
@@ -116,35 +124,31 @@ class ExponentialOpTest(test.TestCase):
     # When the exponential of a non-square matrix is attempted we should return
     # an error
     with self.assertRaises(ValueError):
-      gen_linalg_ops.matrix_exponential(np.array([[1., 2., 3.], [3., 4., 5.]]))
+      linalg_impl.matrix_exponential(np.array([[1., 2., 3.], [3., 4., 5.]]))
 
   def testWrongDimensions(self):
     # The input to the exponential should be at least a 2-dimensional tensor.
     tensor3 = constant_op.constant([1., 2.])
     with self.assertRaises(ValueError):
-      gen_linalg_ops.matrix_exponential(tensor3)
+      linalg_impl.matrix_exponential(tensor3)
 
   def testEmpty(self):
     self._verifyExponentialReal(np.empty([0, 2, 2]))
     self._verifyExponentialReal(np.empty([2, 0, 0]))
 
-  def testRandomSmallAndLarge(self):
-    np.random.seed(42)
-    for dtype in np.float32, np.float64, np.complex64, np.complex128:
-      for batch_dims in [(), (1,), (3,), (2, 2)]:
-        for size in 8, 31, 32:
-          shape = batch_dims + (size, size)
-          matrix = np.random.uniform(
-              low=-1.0, high=1.0,
-              size=np.prod(shape)).reshape(shape).astype(dtype)
-          self._verifyExponentialReal(matrix)
+  def testDynamic(self):
+    with self.test_session(use_gpu=True) as sess:
+      inp = array_ops.placeholder(ops.dtypes.float32)
+      expm = linalg_impl.matrix_exponential(inp)
+      matrix = np.array([[1., 2.], [3., 4.]])
+      sess.run(expm, feed_dict={inp: matrix})
 
   def testConcurrentExecutesWithoutError(self):
     with self.test_session(use_gpu=True) as sess:
       matrix1 = random_ops.random_normal([5, 5], seed=42)
       matrix2 = random_ops.random_normal([5, 5], seed=42)
-      expm1 = gen_linalg_ops.matrix_exponential(matrix1)
-      expm2 = gen_linalg_ops.matrix_exponential(matrix2)
+      expm1 = linalg_impl.matrix_exponential(matrix1)
+      expm2 = linalg_impl.matrix_exponential(matrix2)
       expm = sess.run([expm1, expm2])
       self.assertAllEqual(expm[0], expm[1])
 
@@ -180,7 +184,7 @@ class MatrixExponentialBenchmark(test.Benchmark):
           session.Session() as sess, \
           ops.device("/cpu:0"):
         matrix = self._GenerateMatrix(shape)
-        expm = gen_linalg_ops.matrix_exponential(matrix)
+        expm = linalg_impl.matrix_exponential(matrix)
         variables.global_variables_initializer().run()
         self.run_op_benchmark(
             sess,
@@ -189,6 +193,66 @@ class MatrixExponentialBenchmark(test.Benchmark):
             name="matrix_exponential_cpu_{shape}".format(
                 shape=shape))
 
+      if test.is_gpu_available(True):
+        with ops.Graph().as_default(), \
+            session.Session() as sess, \
+            ops.device("/gpu:0"):
+          matrix = self._GenerateMatrix(shape)
+          expm = linalg_impl.matrix_exponential(matrix)
+          variables.global_variables_initializer().run()
+          self.run_op_benchmark(
+              sess,
+              control_flow_ops.group(expm),
+              min_iters=25,
+              name="matrix_exponential_gpu_{shape}".format(
+                  shape=shape))
+
+
+def _TestRandomSmall(dtype, batch_dims, size):
+
+  def Test(self):
+    np.random.seed(42)
+    shape = batch_dims + (size, size)
+    matrix = np.random.uniform(
+        low=-1.0, high=1.0,
+        size=shape).astype(dtype)
+    self._verifyExponentialReal(matrix)
+
+  return Test
+
+
+def _TestL1Norms(dtype, shape, scale):
+
+  def Test(self):
+    np.random.seed(42)
+    matrix = np.random.uniform(
+        low=-1.0, high=1.0,
+        size=np.prod(shape)).reshape(shape).astype(dtype)
+    print(dtype, shape, scale, matrix)
+    l1_norm = np.max(np.sum(np.abs(matrix), axis=matrix.ndim-2))
+    matrix /= l1_norm
+    self._verifyExponentialReal(scale * matrix)
+
+  return Test
+
 
 if __name__ == "__main__":
+  for dtype_ in [np.float32, np.float64, np.complex64, np.complex128]:
+    for batch_ in [(), (2,), (2, 2)]:
+      for size_ in [4, 7]:
+        name = "%s_%d_%d" % (dtype_.__name__, len(batch_), size_)
+        setattr(ExponentialOpTest, "testL1Norms_" + name,
+                _TestRandomSmall(dtype_, batch_, size_))
+
+  for shape_ in [(3, 3), (2, 3, 3)]:
+    for dtype_ in [np.float32, np.complex64]:
+      for scale_ in [0.1, 1.5, 5.0, 20.0]:
+        name = "%s_%d_%d" % (dtype_.__name__, len(shape_), int(scale_*10))
+        setattr(ExponentialOpTest, "testL1Norms_" + name,
+                _TestL1Norms(dtype_, shape_, scale_))
+    for dtype_ in [np.float64, np.complex128]:
+      for scale_ in [0.01, 0.2, 0.5, 1.5, 6.0, 25.0]:
+        name = "%s_%d_%d" % (dtype_.__name__, len(shape_), int(scale_*100))
+        setattr(ExponentialOpTest, "testL1Norms_" + name,
+                _TestL1Norms(dtype_, shape_, scale_))
   test.main()
diff --git a/tensorflow/python/kernel_tests/neon_depthwise_conv_op_test.py b/tensorflow/python/kernel_tests/neon_depthwise_conv_op_test.py
index d8ce9fffbd..3cbbd48c8c 100644
--- a/tensorflow/python/kernel_tests/neon_depthwise_conv_op_test.py
+++ b/tensorflow/python/kernel_tests/neon_depthwise_conv_op_test.py
@@ -82,7 +82,7 @@ def CheckGradConfigsToTest():
 class DepthwiseConv2DTest(test.TestCase):
 
   # This is testing that depthwise_conv2d and depthwise_conv2d_native
-  # produce the same results.  It also tests that NCHW and NWHC
+  # produce the same results.  It also tests that NCHW and NHWC
   # formats agree, by comparing the depthwise_conv2d_native with
   # 'NCHW' format (with transposition) matches the 'NHWC' format using
   # the higher level interface.
@@ -123,7 +123,7 @@ class DepthwiseConv2DTest(test.TestCase):
       native_t1 = t1
       strides = [1, stride, stride, 1]
       if data_format == "NCHW":
-        # Transpose from NWHC input to NCHW
+        # Transpose from NHWC input to NCHW
         # Ex. [4, 5, 5, 48] to [4, 48, 5, 5]
         native_t1 = array_ops.transpose(t1, [0, 3, 1, 2])
         strides = [1, 1, stride, stride]
diff --git a/tensorflow/python/kernel_tests/random/random_ops_test.py b/tensorflow/python/kernel_tests/random/random_ops_test.py
index e4b5c3832a..0ef6a95cfc 100644
--- a/tensorflow/python/kernel_tests/random/random_ops_test.py
+++ b/tensorflow/python/kernel_tests/random/random_ops_test.py
@@ -24,13 +24,42 @@ from six.moves import xrange  # pylint: disable=redefined-builtin
 from tensorflow.python.eager import context
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
 
-class RandomNormalTest(test.TestCase):
+class RandomOpTestCommon(test.TestCase):
+
+  # Checks that executing the same rng_func multiple times rarely produces the
+  # same result.
+  def _testSingleSessionNotConstant(self,
+                                    rng_func,
+                                    num,
+                                    dtype,
+                                    min_or_mean,
+                                    max_or_stddev,
+                                    use_gpu,
+                                    op_seed=None,
+                                    graph_seed=None):
+    with self.test_session(use_gpu=use_gpu, graph=ops.Graph()) as sess:
+      if graph_seed is not None:
+        random_seed.set_random_seed(graph_seed)
+      x = rng_func([num], min_or_mean, max_or_stddev, dtype=dtype, seed=op_seed)
+
+      y = sess.run(x)
+      z = sess.run(x)
+      w = sess.run(x)
+
+      # We use exact equality here. If the random-number generator is producing
+      # the same output, all three outputs will be bitwise identical.
+      self.assertTrue((not np.array_equal(y, z)) or
+                      (not np.array_equal(z, w)) or (not np.array_equal(y, w)))
+
+
+class RandomNormalTest(RandomOpTestCommon):
 
   def _Sampler(self, num, mu, sigma, dtype, use_gpu, seed=None):
 
@@ -90,6 +119,36 @@ class RandomNormalTest(test.TestCase):
         diff = rnd2 - rnd1
         self.assertTrue(np.linalg.norm(diff.eval()) > 0.1)
 
+  def testSingleSessionNotConstant(self):
+    for use_gpu in [False, True]:
+      for dt in dtypes.float16, dtypes.float32, dtypes.float64:
+        self._testSingleSessionNotConstant(
+            random_ops.random_normal, 100, dt, 0.0, 1.0, use_gpu=use_gpu)
+
+  def testSingleSessionOpSeedNotConstant(self):
+    for use_gpu in [False, True]:
+      for dt in dtypes.float16, dtypes.float32, dtypes.float64:
+        self._testSingleSessionNotConstant(
+            random_ops.random_normal,
+            100,
+            dt,
+            0.0,
+            1.0,
+            use_gpu=use_gpu,
+            op_seed=1345)
+
+  def testSingleSessionGraphSeedNotConstant(self):
+    for use_gpu in [False, True]:
+      for dt in dtypes.float16, dtypes.float32, dtypes.float64:
+        self._testSingleSessionNotConstant(
+            random_ops.random_normal,
+            100,
+            dt,
+            0.0,
+            1.0,
+            use_gpu=use_gpu,
+            graph_seed=965)
+
 
 class TruncatedNormalTest(test.TestCase):
 
@@ -187,7 +246,7 @@ class TruncatedNormalTest(test.TestCase):
       self.assertAllEqual(rnd1, rnd2)
 
 
-class RandomUniformTest(test.TestCase):
+class RandomUniformTest(RandomOpTestCommon):
 
   def _Sampler(self, num, minv, maxv, dtype, use_gpu, seed=None):
 
@@ -291,6 +350,39 @@ class RandomUniformTest(test.TestCase):
         diff = (rnd2 - rnd1).eval()
         self.assertTrue(np.linalg.norm(diff) > 0.1)
 
+  def testSingleSessionNotConstant(self):
+    for use_gpu in [False, True]:
+      for dt in (dtypes.float16, dtypes.float32, dtypes.float64, dtypes.int32,
+                 dtypes.int64):
+        self._testSingleSessionNotConstant(
+            random_ops.random_uniform, 100, dt, 0, 17, use_gpu=use_gpu)
+
+  def testSingleSessionOpSeedNotConstant(self):
+    for use_gpu in [False, True]:
+      for dt in (dtypes.float16, dtypes.float32, dtypes.float64, dtypes.int32,
+                 dtypes.int64):
+        self._testSingleSessionNotConstant(
+            random_ops.random_uniform,
+            100,
+            dt,
+            10,
+            20,
+            use_gpu=use_gpu,
+            op_seed=1345)
+
+  def testSingleSessionGraphSeedNotConstant(self):
+    for use_gpu in [False, True]:
+      for dt in (dtypes.float16, dtypes.float32, dtypes.float64, dtypes.int32,
+                 dtypes.int64):
+        self._testSingleSessionNotConstant(
+            random_ops.random_uniform,
+            100,
+            dt,
+            20,
+            200,
+            use_gpu=use_gpu,
+            graph_seed=965)
+
 
 class RandomShapeTest(test.TestCase):
 
diff --git a/tensorflow/python/kernel_tests/resource_variable_ops_test.py b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
index 0fb0b8895c..c739cd2c0d 100644
--- a/tensorflow/python/kernel_tests/resource_variable_ops_test.py
+++ b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
@@ -246,6 +246,15 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     read = resource_variable_ops.read_variable_op(handle, dtype=dtypes.int32)
     self.assertEqual(self.evaluate(read), [[2]])
 
+  def testUseResource(self):
+    v = variables.Variable(1.0, use_resource=True)
+    self.assertTrue(isinstance(v, resource_variable_ops.ResourceVariable))
+
+  def testEagerNoUseResource(self):
+    with context.eager_mode():
+      v = variables.Variable(1.0)
+      self.assertTrue(isinstance(v, resource_variable_ops.ResourceVariable))
+
   @test_util.run_in_graph_and_eager_modes
   def testScatterMin(self):
     with ops.device("cpu:0"):
@@ -852,5 +861,62 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       state_ops.scatter_update(v, [0, 1], [0, 1, 2])
 
 
+class _MixedPrecisionVariableTest(test_util.TensorFlowTestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def test_dense_var_to_tensor_read_dtype_same_as_var_dtype(self):
+    # read_dtype is same as dtype
+    v = resource_variable_ops.ResourceVariable(1.0, dtype=dtypes.float32)
+    v = resource_variable_ops._MixedPrecisionVariable(v, dtypes.float32)
+    if not context.executing_eagerly():
+      v.initializer.run()
+
+    # dtype is not read_dtype, return NotImplemented
+    self.assertEqual(
+        NotImplemented, v._dense_var_to_tensor(dtype=dtypes.float16))
+    self.assertEqual(NotImplemented,
+                     v._dense_var_to_tensor(dtype=dtypes.float16, as_ref=True))
+
+    # as_ref is False
+    t = v._dense_var_to_tensor(as_ref=False)
+    self.assertTrue(isinstance(t, ops.Tensor))
+    self.assertEqual(t.dtype, dtypes.float32)
+    self.assertEqual(self.evaluate(t), 1.0)
+
+    t = v._dense_var_to_tensor(dtype=dtypes.float32, as_ref=False)
+    self.assertTrue(isinstance(t, ops.Tensor))
+    self.assertEqual(t.dtype, dtypes.float32)
+    self.assertEqual(self.evaluate(t), 1.0)
+
+    # as_ref is True
+    self.assertEqual(NotImplemented, v._dense_var_to_tensor(as_ref=True))
+    self.assertEqual(NotImplemented,
+                     v._dense_var_to_tensor(dtype=dtypes.float32, as_ref=True))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def test_dense_var_to_tensor_read_dtype_different_from_var_dtype(self):
+    # read_dtype is different from dtype
+    v = resource_variable_ops.ResourceVariable(1.0, dtype=dtypes.float32)
+    v = resource_variable_ops._MixedPrecisionVariable(v, dtypes.float16)
+    if not context.executing_eagerly():
+      v.initializer.run()
+
+    # as_ref is False
+    t = v._dense_var_to_tensor(as_ref=False)
+    self.assertTrue(isinstance(t, ops.Tensor))
+    self.assertEqual(t.dtype, dtypes.float16)
+    self.assertEqual(self.evaluate(t), 1.0)
+
+    t = v._dense_var_to_tensor(dtype=dtypes.float16, as_ref=False)
+    self.assertTrue(isinstance(t, ops.Tensor))
+    self.assertEqual(t.dtype, dtypes.float16)
+    self.assertEqual(self.evaluate(t), 1.0)
+
+    # as_ref is True
+    self.assertEqual(NotImplemented, v._dense_var_to_tensor(as_ref=True))
+    self.assertEqual(NotImplemented,
+                     v._dense_var_to_tensor(dtype=dtypes.float16, as_ref=True))
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/rnn_test.py b/tensorflow/python/kernel_tests/rnn_test.py
index 957baf8c60..acee180a6c 100644
--- a/tensorflow/python/kernel_tests/rnn_test.py
+++ b/tensorflow/python/kernel_tests/rnn_test.py
@@ -268,6 +268,12 @@ class RNNTest(test.TestCase):
     self._assert_cell_builds(rnn_cell_impl.GRUCell, f64, 5, 7, 3)
     self._assert_cell_builds(rnn_cell_impl.LSTMCell, f32, 5, 7, 3)
     self._assert_cell_builds(rnn_cell_impl.LSTMCell, f64, 5, 7, 3)
+    self._assert_cell_builds(contrib_rnn.IndRNNCell, f32, 5, 7, 3)
+    self._assert_cell_builds(contrib_rnn.IndRNNCell, f64, 5, 7, 3)
+    self._assert_cell_builds(contrib_rnn.IndyGRUCell, f32, 5, 7, 3)
+    self._assert_cell_builds(contrib_rnn.IndyGRUCell, f64, 5, 7, 3)
+    self._assert_cell_builds(contrib_rnn.IndyLSTMCell, f32, 5, 7, 3)
+    self._assert_cell_builds(contrib_rnn.IndyLSTMCell, f64, 5, 7, 3)
 
 
 ######### Benchmarking RNN code
diff --git a/tensorflow/python/kernel_tests/scatter_nd_ops_test.py b/tensorflow/python/kernel_tests/scatter_nd_ops_test.py
index f9b9c77bbf..f2f3023469 100644
--- a/tensorflow/python/kernel_tests/scatter_nd_ops_test.py
+++ b/tensorflow/python/kernel_tests/scatter_nd_ops_test.py
@@ -268,12 +268,12 @@ class StatefulScatterNdTest(test.TestCase):
         # Test some out of range errors.
         indices = np.array([[-1], [0], [5]])
         with self.assertRaisesOpError(
-            r"Invalid indices: \[0,0\] = \[-1\] does not index into \[6\]"):
+            r"indices\[0\] = \[-1\] does not index into shape \[6\]"):
           op(ref, indices, updates).eval()
 
         indices = np.array([[2], [0], [6]])
         with self.assertRaisesOpError(
-            r"Invalid indices: \[2,0\] = \[6\] does not index into \[6\]"):
+            r"indices\[2\] = \[6\] does not index into shape \[6\]"):
           op(ref, indices, updates).eval()
 
   def testRank3ValidShape(self):
@@ -370,6 +370,29 @@ class ScatterNdTest(test.TestCase):
     return array_ops.scatter_nd(indices, updates, shape)
 
   @test_util.run_in_graph_and_eager_modes
+  def testBool(self):
+    indices = constant_op.constant(
+        [[4], [3], [1], [7]], dtype=dtypes.int32)
+    updates = constant_op.constant(
+        [False, True, False, True], dtype=dtypes.bool)
+    expected = np.array(
+        [False, False, False, True, False, False, False, True])
+    scatter = self.scatter_nd(indices, updates, shape=(8,))
+    result = self.evaluate(scatter)
+    self.assertAllEqual(expected, result)
+
+    # Same indice is updated twice by same value.
+    indices = constant_op.constant(
+        [[4], [3], [3], [7]], dtype=dtypes.int32)
+    updates = constant_op.constant(
+        [False, True, True, True], dtype=dtypes.bool)
+    expected = np.array([
+        False, False, False, True, False, False, False, True])
+    scatter = self.scatter_nd(indices, updates, shape=(8,))
+    result = self.evaluate(scatter)
+    self.assertAllEqual(expected, result)
+
+  @test_util.run_in_graph_and_eager_modes
   def testInvalidShape(self):
     # TODO(apassos) figure out how to unify these errors
     with self.assertRaises(errors.InvalidArgumentError
diff --git a/tensorflow/python/kernel_tests/variable_scope_test.py b/tensorflow/python/kernel_tests/variable_scope_test.py
index 054c6f9dd7..ae2a0ab29a 100644
--- a/tensorflow/python/kernel_tests/variable_scope_test.py
+++ b/tensorflow/python/kernel_tests/variable_scope_test.py
@@ -1054,7 +1054,7 @@ class VariableScopeTest(test.TestCase):
           "testGetCollection_foo/testGetCollection_a:0"
       ])
 
-  def testGetTrainableVariables(self):
+  def testGetTrainableVariablesWithGetVariable(self):
     with self.test_session():
       _ = variable_scope.get_variable("testGetTrainableVariables_a", [])
       with variable_scope.variable_scope(
@@ -1062,10 +1062,72 @@ class VariableScopeTest(test.TestCase):
         _ = variable_scope.get_variable("testGetTrainableVariables_b", [])
         _ = variable_scope.get_variable(
             "testGetTrainableVariables_c", [], trainable=False)
+
+        # sync `ON_READ` sets trainable=False
+        _ = variable_scope.get_variable(
+            "testGetTrainableVariables_d", [],
+            synchronization=variable_scope.VariableSynchronization.ON_READ)
         self.assertEqual(
             [v.name for v in scope.trainable_variables()],
-            ["testGetTrainableVariables_foo/"
-             "testGetTrainableVariables_b:0"])
+            ["testGetTrainableVariables_foo/testGetTrainableVariables_b:0"])
+
+        # All other sync values sets trainable=True
+        _ = variable_scope.get_variable(
+            "testGetTrainableVariables_e", [],
+            synchronization=variable_scope.VariableSynchronization.ON_WRITE)
+        self.assertEqual([v.name for v in scope.trainable_variables()], [
+            "testGetTrainableVariables_foo/testGetTrainableVariables_b:0",
+            "testGetTrainableVariables_foo/testGetTrainableVariables_e:0"
+        ])
+
+      with self.assertRaisesRegexp(
+          ValueError, "Synchronization value can be set to "
+          "VariableSynchronization.ON_READ only for non-trainable variables. "
+          "You have specified trainable=True and "
+          "synchronization=VariableSynchronization.ON_READ."):
+        _ = variable_scope.get_variable(
+            "testGetTrainableVariables_e", [],
+            synchronization=variable_scope.VariableSynchronization.ON_READ,
+            trainable=True)
+
+  def testGetTrainableVariablesWithVariable(self):
+    with self.test_session():
+      _ = variable_scope.variable(1.0, name="testGetTrainableVariables_a")
+      with variable_scope.variable_scope(
+          "testGetTrainableVariables_foo") as scope:
+        _ = variable_scope.variable(1.0, name="testGetTrainableVariables_b")
+        _ = variable_scope.variable(
+            1.0, name="testGetTrainableVariables_c", trainable=False)
+
+        # sync `ON_READ` sets trainable=False
+        _ = variable_scope.variable(
+            1.0,
+            name="testGetTrainableVariables_d",
+            synchronization=variable_scope.VariableSynchronization.ON_READ)
+        self.assertEqual(
+            [v.name for v in scope.trainable_variables()],
+            ["testGetTrainableVariables_foo/testGetTrainableVariables_b:0"])
+
+        # All other sync values sets trainable=True
+        _ = variable_scope.variable(
+            1.0,
+            name="testGetTrainableVariables_e",
+            synchronization=variable_scope.VariableSynchronization.ON_WRITE)
+        self.assertEqual([v.name for v in scope.trainable_variables()], [
+            "testGetTrainableVariables_foo/testGetTrainableVariables_b:0",
+            "testGetTrainableVariables_foo/testGetTrainableVariables_e:0"
+        ])
+
+      with self.assertRaisesRegexp(
+          ValueError, "Synchronization value can be set to "
+          "VariableSynchronization.ON_READ only for non-trainable variables. "
+          "You have specified trainable=True and "
+          "synchronization=VariableSynchronization.ON_READ."):
+        _ = variable_scope.variable(
+            1.0,
+            name="testGetTrainableVariables_e",
+            synchronization=variable_scope.VariableSynchronization.ON_READ,
+            trainable=True)
 
   def testGetGlobalVariables(self):
     with self.test_session():
diff --git a/tensorflow/python/kernel_tests/variables_test.py b/tensorflow/python/kernel_tests/variables_test.py
index 62d596da91..2b9c62ad6f 100644
--- a/tensorflow/python/kernel_tests/variables_test.py
+++ b/tensorflow/python/kernel_tests/variables_test.py
@@ -642,6 +642,8 @@ class PartitionedVariableTest(test.TestCase):
       iterated_partitions = list(partitioned_variable)
       self.assertEqual(2, num_partitions)
       self.assertEqual([v0, v1], iterated_partitions)
+      self.assertEqual([2], partitioned_variable.get_shape())
+      self.assertEqual([2], partitioned_variable.shape)
       self.assertEqual([2], concatenated.get_shape())
       self.assertEqual([2], concatenated.shape)
 
diff --git a/tensorflow/python/layers/base.py b/tensorflow/python/layers/base.py
index b8969a41ab..cf13b52617 100644
--- a/tensorflow/python/layers/base.py
+++ b/tensorflow/python/layers/base.py
@@ -152,10 +152,17 @@ class Layer(base_layer.Layer):
             scope, default_name=self._base_name) as captured_scope:
           self._scope = captured_scope
 
-  def add_weight(self, name, shape, dtype=None,
-                 initializer=None, regularizer=None,
-                 trainable=True, constraint=None,
+  def add_weight(self,
+                 name,
+                 shape,
+                 dtype=None,
+                 initializer=None,
+                 regularizer=None,
+                 trainable=None,
+                 constraint=None,
                  use_resource=None,
+                 synchronization=vs.VariableSynchronization.AUTO,
+                 aggregation=vs.VariableAggregation.NONE,
                  partitioner=None):
     """Adds a new variable to the layer, or gets an existing one; returns it.
 
@@ -170,9 +177,19 @@ class Layer(base_layer.Layer):
         or "non_trainable_variables" (e.g. BatchNorm mean, stddev).
         Note, if the current variable scope is marked as non-trainable
         then this parameter is ignored and any added variables are also
-        marked as non-trainable.
+        marked as non-trainable. `trainable` defaults to `True` unless
+        `synchronization` is set to `ON_READ`.
       constraint: constraint instance (callable).
       use_resource: Whether to use `ResourceVariable`.
+      synchronization: Indicates when a distributed a variable will be
+        aggregated. Accepted values are constants defined in the class
+        @{tf.VariableSynchronization}. By default the synchronization is set to
+        `AUTO` and the current `DistributionStrategy` chooses
+        when to synchronize. If `synchronization` is set to `ON_READ`,
+        `trainable` must not be set to `True`.
+      aggregation: Indicates how a distributed variable will be aggregated.
+        Accepted values are constants defined in the class
+        @{tf.VariableAggregation}.
       partitioner: (optional) partitioner instance (callable).  If
         provided, when the requested variable is created it will be split
         into multiple partitions according to `partitioner`.  In this case,
@@ -190,7 +207,21 @@ class Layer(base_layer.Layer):
     Raises:
       RuntimeError: If called with partioned variable regularization and
         eager execution is enabled.
+      ValueError: When trainable has been set to True with synchronization
+        set as `ON_READ`.
     """
+    if synchronization == vs.VariableSynchronization.ON_READ:
+      if trainable:
+        raise ValueError(
+            'Synchronization value can be set to '
+            'VariableSynchronization.ON_READ only for non-trainable variables. '
+            'You have specified trainable=True and '
+            'synchronization=VariableSynchronization.ON_READ.')
+      else:
+        # Set trainable to be false when variable is to be synced on read.
+        trainable = False
+    elif trainable is None:
+      trainable = True
 
     def _should_add_regularizer(variable, existing_variable_set):
       if isinstance(variable, tf_variables.PartitionedVariable):
@@ -240,6 +271,8 @@ class Layer(base_layer.Layer):
             constraint=constraint,
             partitioner=partitioner,
             use_resource=use_resource,
+            synchronization=synchronization,
+            aggregation=aggregation,
             getter=vs.get_variable)
 
         if regularizer:
diff --git a/tensorflow/python/layers/base_test.py b/tensorflow/python/layers/base_test.py
index 298e96e711..d2443db665 100644
--- a/tensorflow/python/layers/base_test.py
+++ b/tensorflow/python/layers/base_test.py
@@ -90,12 +90,34 @@ class BaseLayerTest(test.TestCase):
 
       # regularizers only supported in GRAPH mode.
       regularizer = lambda x: math_ops.reduce_sum(x) * 1e-3
-      variable = layer.add_variable(
+      _ = layer.add_variable(
           'reg_var', [2, 2],
           initializer=init_ops.zeros_initializer(),
           regularizer=regularizer)
       self.assertEqual(len(layer.losses), 1)
 
+    # Test that sync `ON_READ` variables are defaulted to be non-trainable.
+    variable_3 = layer.add_variable(
+        'sync_on_read_var', [2, 2],
+        initializer=init_ops.zeros_initializer(),
+        synchronization=variable_scope.VariableSynchronization.ON_READ,
+        aggregation=variable_scope.VariableAggregation.SUM)
+    self.assertEqual(layer.non_trainable_variables, [variable_2, variable_3])
+
+  def testInvalidTrainableSynchronizationCombination(self):
+    layer = base_layers.Layer(name='my_layer')
+
+    with self.assertRaisesRegexp(
+        ValueError, 'Synchronization value can be set to '
+        'VariableSynchronization.ON_READ only for non-trainable variables. '
+        'You have specified trainable=True and '
+        'synchronization=VariableSynchronization.ON_READ.'):
+      _ = layer.add_variable(
+          'v', [2, 2],
+          initializer=init_ops.zeros_initializer(),
+          synchronization=variable_scope.VariableSynchronization.ON_READ,
+          trainable=True)
+
   def testReusePartitionedVaraiblesAndRegularizers(self):
     regularizer = lambda x: math_ops.reduce_sum(x) * 1e-3
     partitioner = partitioned_variables.fixed_size_partitioner(3)
@@ -104,7 +126,7 @@ class BaseLayerTest(test.TestCase):
                                          partitioner=partitioner,
                                          reuse=reuse):
         layer = base_layers.Layer(name='my_layer')
-        variable = layer.add_variable(
+        _ = layer.add_variable(
             'reg_part_var', [4, 4],
             initializer=init_ops.zeros_initializer(),
             regularizer=regularizer)
diff --git a/tensorflow/python/lib/core/ndarray_tensor.cc b/tensorflow/python/lib/core/ndarray_tensor.cc
index ec1ba7b8f7..5765b17594 100644
--- a/tensorflow/python/lib/core/ndarray_tensor.cc
+++ b/tensorflow/python/lib/core/ndarray_tensor.cc
@@ -136,6 +136,33 @@ Status PyArray_TYPE_to_TF_DataType(PyArrayObject* array,
   return Status::OK();
 }
 
+Status PyObjectToString(PyObject* obj, const char** ptr, Py_ssize_t* len,
+                        PyObject** ptr_owner) {
+  *ptr_owner = nullptr;
+  if (!PyUnicode_Check(obj)) {
+    char* buf;
+    if (PyBytes_AsStringAndSize(obj, &buf, len) != 0) {
+      return errors::Internal("Unable to get element as bytes.");
+    }
+    *ptr = buf;
+    return Status::OK();
+  }
+#if (PY_MAJOR_VERSION > 3 || (PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION >= 3))
+  *ptr = PyUnicode_AsUTF8AndSize(obj, len);
+  if (*ptr != nullptr) return Status::OK();
+#else
+  PyObject* utemp = PyUnicode_AsUTF8String(obj);
+  char* buf;
+  if (utemp != nullptr && PyBytes_AsStringAndSize(utemp, &buf, len) != -1) {
+    *ptr = buf;
+    *ptr_owner = utemp;
+    return Status::OK();
+  }
+  Py_XDECREF(utemp);
+#endif
+  return errors::Internal("Unable to convert element to UTF-8.");
+}
+
 // Iterate over the string array 'array', extract the ptr and len of each string
 // element and call f(ptr, len).
 template <typename F>
@@ -148,33 +175,12 @@ Status PyBytesArrayMap(PyArrayObject* array, F f) {
     if (!item) {
       return errors::Internal("Unable to get element from the feed - no item.");
     }
-    char* ptr;
     Py_ssize_t len;
-
-    if (PyUnicode_Check(item.get())) {
-#if PY_VERSION_HEX >= 0x03030000
-      // Accept unicode by converting to UTF-8 bytes.
-      ptr = PyUnicode_AsUTF8AndSize(item.get(), &len);
-      if (!ptr) {
-        return errors::Internal("Unable to get element as UTF-8.");
-      }
-      f(ptr, len);
-#else
-      PyObject* utemp = PyUnicode_AsUTF8String(item.get());
-      if (!utemp || PyBytes_AsStringAndSize(utemp, &ptr, &len) == -1) {
-        Py_XDECREF(utemp);
-        return errors::Internal("Unable to convert element to UTF-8.");
-      }
-      f(ptr, len);
-      Py_DECREF(utemp);
-#endif
-    } else {
-      int success = PyBytes_AsStringAndSize(item.get(), &ptr, &len);
-      if (success != 0) {
-        return errors::Internal("Unable to get element as bytes.");
-      }
-      f(ptr, len);
-    }
+    const char* ptr;
+    PyObject* ptr_owner;
+    TF_RETURN_IF_ERROR(PyObjectToString(item.get(), &ptr, &len, &ptr_owner));
+    f(ptr, len);
+    Py_XDECREF(ptr_owner);
     PyArray_ITER_NEXT(iter.get());
   }
   return Status::OK();
@@ -186,10 +192,11 @@ Status EncodePyBytesArray(PyArrayObject* array, tensorflow::int64 nelems,
                           size_t* size, void** buffer) {
   // Compute bytes needed for encoding.
   *size = 0;
-  TF_RETURN_IF_ERROR(PyBytesArrayMap(array, [&size](char* ptr, Py_ssize_t len) {
-    *size +=
-        sizeof(tensorflow::uint64) + tensorflow::core::VarintLength(len) + len;
-  }));
+  TF_RETURN_IF_ERROR(
+      PyBytesArrayMap(array, [&size](const char* ptr, Py_ssize_t len) {
+        *size += sizeof(tensorflow::uint64) +
+                 tensorflow::core::VarintLength(len) + len;
+      }));
   // Encode all strings.
   std::unique_ptr<char[]> base_ptr(new char[*size]);
   char* base = base_ptr.get();
@@ -198,7 +205,7 @@ Status EncodePyBytesArray(PyArrayObject* array, tensorflow::int64 nelems,
   tensorflow::uint64* offsets = reinterpret_cast<tensorflow::uint64*>(base);
 
   TF_RETURN_IF_ERROR(PyBytesArrayMap(
-      array, [&base, &data_start, &dst, &offsets](char* ptr, Py_ssize_t len) {
+      array, [&data_start, &dst, &offsets](const char* ptr, Py_ssize_t len) {
         *offsets = (dst - data_start);
         offsets++;
         dst = tensorflow::core::EncodeVarint64(dst, len);
diff --git a/tensorflow/python/lib/core/numpy.h b/tensorflow/python/lib/core/numpy.h
index d4621d61ee..0098d938a0 100644
--- a/tensorflow/python/lib/core/numpy.h
+++ b/tensorflow/python/lib/core/numpy.h
@@ -30,9 +30,10 @@ limitations under the License.
 #endif
 
 // Place `<locale>` before <Python.h> to avoid build failure in macOS.
-#include <Python.h>
 #include <locale>
 
+#include <Python.h>
+
 #include "numpy/arrayobject.h"
 #include "numpy/ufuncobject.h"
 
diff --git a/tensorflow/python/lib/core/py_func.cc b/tensorflow/python/lib/core/py_func.cc
index 57139986af..7c107138be 100644
--- a/tensorflow/python/lib/core/py_func.cc
+++ b/tensorflow/python/lib/core/py_func.cc
@@ -333,6 +333,35 @@ class NumpyTensorBuffer : public TensorBuffer {
   void* data_;
 };
 
+Status PyObjectToString(PyObject* obj, string* str) {
+  char* py_bytes;
+  Py_ssize_t size;
+  if (PyBytes_AsStringAndSize(obj, &py_bytes, &size) != -1) {
+    str->assign(py_bytes, size);
+    return Status::OK();
+  }
+#if PY_MAJOR_VERSION >= 3
+  const char* ptr = PyUnicode_AsUTF8AndSize(obj, &size);
+  if (ptr != nullptr) {
+    str->assign(ptr, size);
+    return Status::OK();
+  }
+#else
+  if (PyUnicode_Check(obj)) {
+    PyObject* unicode = PyUnicode_AsUTF8String(obj);
+    char* ptr;
+    if (unicode && PyString_AsStringAndSize(unicode, &ptr, &size) != -1) {
+      str->assign(ptr, size);
+      Py_DECREF(unicode);
+      return Status::OK();
+    }
+    Py_XDECREF(unicode);
+  }
+#endif
+  return errors::Unimplemented("Unsupported object type ",
+                               obj->ob_type->tp_name);
+}
+
 Status ConvertNdarrayToTensor(PyObject* obj, Tensor* ret) {
   PyArrayObject* input = reinterpret_cast<PyArrayObject*>(obj);
   DataType dtype = DT_INVALID;
@@ -348,29 +377,7 @@ Status ConvertNdarrayToTensor(PyObject* obj, Tensor* ret) {
       auto tflat = t.flat<string>();
       PyObject** input_data = reinterpret_cast<PyObject**>(PyArray_DATA(input));
       for (int i = 0; i < tflat.dimension(0); ++i) {
-        char* el;
-        Py_ssize_t el_size;
-        if (PyBytes_AsStringAndSize(input_data[i], &el, &el_size) == -1) {
-#if PY_MAJOR_VERSION >= 3
-          el = PyUnicode_AsUTF8AndSize(input_data[i], &el_size);
-#else
-          el = nullptr;
-          if (PyUnicode_Check(input_data[i])) {
-            PyObject* unicode = PyUnicode_AsUTF8String(input_data[i]);
-            if (unicode) {
-              if (PyString_AsStringAndSize(unicode, &el, &el_size) == -1) {
-                Py_DECREF(unicode);
-                el = nullptr;
-              }
-            }
-          }
-#endif
-          if (!el) {
-            return errors::Unimplemented("Unsupported object type ",
-                                         input_data[i]->ob_type->tp_name);
-          }
-        }
-        tflat(i) = string(el, el_size);
+        TF_RETURN_IF_ERROR(PyObjectToString(input_data[i], &tflat(i)));
       }
       *ret = t;
       break;
diff --git a/tensorflow/python/lib/core/py_util.cc b/tensorflow/python/lib/core/py_util.cc
index 6b6c82015f..2ee898ea1d 100644
--- a/tensorflow/python/lib/core/py_util.cc
+++ b/tensorflow/python/lib/core/py_util.cc
@@ -16,9 +16,10 @@ limitations under the License.
 #include "tensorflow/python/lib/core/py_util.h"
 
 // Place `<locale>` before <Python.h> to avoid build failure in macOS.
-#include <Python.h>
 #include <locale>
 
+#include <Python.h>
+
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 
diff --git a/tensorflow/python/ops/array_grad.py b/tensorflow/python/ops/array_grad.py
index fe459a96b9..a2b5f77f91 100644
--- a/tensorflow/python/ops/array_grad.py
+++ b/tensorflow/python/ops/array_grad.py
@@ -790,7 +790,7 @@ def _ExtractImagePatchesGrad(op, grad):
 
   sp_mat = sparse_tensor.SparseTensor(
       array_ops.constant(idx, dtype=ops.dtypes.int64),
-      array_ops.ones((len(idx),), dtype=ops.dtypes.float32), sp_shape)
+      array_ops.ones((len(idx),), dtype=grad.dtype), sp_shape)
 
   jac = sparse_ops.sparse_tensor_dense_matmul(sp_mat, grad_flat)
 
diff --git a/tensorflow/python/ops/array_ops.py b/tensorflow/python/ops/array_ops.py
index 361667ec49..ec6488ea63 100644
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@@ -636,10 +636,10 @@ def strided_slice(input_,
   `foo[:4, tf.newaxis, :2]` would produce a shape `(4, 1, 2)` tensor.
 
   If the ith bit of `shrink_axis_mask` is set, it implies that the ith
-  specification shrinks the dimensionality by 1. `begin[i]`, `end[i]` and
-  `strides[i]` must imply a slice of size 1 in the dimension. For example in
-  Python one might do `foo[:, 3, :]` which would result in
-  `shrink_axis_mask` equal to 2.
+  specification shrinks the dimensionality by 1, taking on the value at index
+  `begin[i]`. `end[i]` and `strides[i]` are ignored in this case. For example in
+  Python one might do `foo[:, 3, :]` which would result in `shrink_axis_mask`
+  equal to 2.
 
 
   NOTE: `begin` and `end` are zero-indexed.
diff --git a/tensorflow/python/ops/boosted_trees_ops.py b/tensorflow/python/ops/boosted_trees_ops.py
index 9ebb607c47..f7cbfe0312 100644
--- a/tensorflow/python/ops/boosted_trees_ops.py
+++ b/tensorflow/python/ops/boosted_trees_ops.py
@@ -25,6 +25,7 @@ from tensorflow.python.ops import resources
 # Re-exporting ops used by other modules.
 # pylint: disable=unused-import
 from tensorflow.python.ops.gen_boosted_trees_ops import boosted_trees_calculate_best_gains_per_feature as calculate_best_gains_per_feature
+from tensorflow.python.ops.gen_boosted_trees_ops import boosted_trees_center_bias as center_bias
 from tensorflow.python.ops.gen_boosted_trees_ops import boosted_trees_example_debug_outputs as example_debug_outputs
 from tensorflow.python.ops.gen_boosted_trees_ops import boosted_trees_make_stats_summary as make_stats_summary
 from tensorflow.python.ops.gen_boosted_trees_ops import boosted_trees_predict as predict
@@ -36,8 +37,19 @@ from tensorflow.python.training import saver
 
 
 class PruningMode(object):
+  """Class for working with Pruning modes."""
   NO_PRUNING, PRE_PRUNING, POST_PRUNING = range(0, 3)
 
+  _map = {'none': NO_PRUNING, 'pre': PRE_PRUNING, 'post': POST_PRUNING}
+
+  @classmethod
+  def from_str(cls, mode):
+    if mode in cls._map:
+      return cls._map[mode]
+    else:
+      raise ValueError('pruning_mode mode must be one of: {}'.format(', '.join(
+          sorted(cls._map))))
+
 
 class _TreeEnsembleSavable(saver.BaseSaverBuilder.SaveableObject):
   """SaveableObject implementation for TreeEnsemble."""
diff --git a/tensorflow/python/ops/cond_v2_impl.py b/tensorflow/python/ops/cond_v2_impl.py
index d310f83dca..44c5c050c0 100644
--- a/tensorflow/python/ops/cond_v2_impl.py
+++ b/tensorflow/python/ops/cond_v2_impl.py
@@ -58,12 +58,14 @@ def cond_v2(pred, true_fn, false_fn, name="cond"):
 
   with ops.name_scope(name) as scope:
     # Identify if there is a caller device, & get the innermost if possible.
-    device_stack = ops.get_default_graph()._device_function_stack
-    caller_device = device_stack[-1] if device_stack else None
+    # pylint: disable=protected-access
+    device_funcs = ops.get_default_graph()._device_functions_outer_to_inner
+    caller_device = device_funcs[-1] if device_funcs else None
 
     caller_colocation_stack = ops.get_default_graph()._colocation_stack
     caller_container = ops.get_default_graph()._container
     caller_collection_ref = ops.get_default_graph()._collections
+    # pylint: enable=protected-access
 
     func_name_prefix = scope.replace("/", "_")
 
@@ -106,7 +108,7 @@ def cond_v2(pred, true_fn, false_fn, name="cond"):
     false_graph.outputs.extend(extra_false_outputs)
 
     # Create the If op.
-    tensors = gen_functional_ops._if(
+    tensors = gen_functional_ops._if(  # pylint: disable=protected-access
         pred, cond_inputs, [t.dtype for t in true_graph.outputs],
         _create_new_tf_function(true_graph),
         _create_new_tf_function(false_graph),
@@ -125,8 +127,10 @@ def cond_v2(pred, true_fn, false_fn, name="cond"):
     # TODO(b/110167197) this approach requires cond_v2 to have at least 1 output
     if_op = tensors[0].op
     if not control_flow_util.IsInXLAContext(if_op):
+      # pylint: disable=protected-access
       if_op._set_attr("_lower_using_switch_merge",
                       attr_value_pb2.AttrValue(b=True))
+      # pylint: enable=protected-access
 
     return tensors[:num_cond_outputs]
 
@@ -135,6 +139,10 @@ def cond_v2(pred, true_fn, false_fn, name="cond"):
 def _IfGrad(op, *grads):  # pylint: disable=invalid-name
   """The gradient of an If op produced by cond_v2."""
   true_graph, false_graph = _get_func_graphs(op)
+  # Note: op.graph != ops.get_default_graph() when we are computing the gradient
+  # of a nested cond.
+  assert true_graph._outer_graph == op.graph
+  assert false_graph._outer_graph == op.graph
 
   # Create grad functions that compute the gradient of the true/false forward
   # graphs. These functions will capture tensors from the forward pass
@@ -147,15 +155,16 @@ def _IfGrad(op, *grads):  # pylint: disable=invalid-name
   assert ([t.dtype for t in true_grad_graph.outputs] ==
           [t.dtype for t in false_grad_graph.outputs])
 
-  # Match up the captured grad function inputs with outputs of 'op' and other
-  # external tensors.
-  true_grad_inputs = _get_grad_inputs(op, true_graph, true_grad_graph)
-  false_grad_inputs = _get_grad_inputs(op, false_graph, false_grad_graph)
+  # Resolve references to forward graph tensors in grad graphs and ensure
+  # they are in-scope, i.e., belong to one of outer graphs of the grad graph.
+  true_grad_extra_inputs = _resolve_grad_inputs(true_graph, true_grad_graph)
+  false_grad_extra_inputs = _resolve_grad_inputs(false_graph, false_grad_graph)
 
   # Make the inputs to true_grad_graph and false_grad_graph match. Note that
   # this modifies true_grad_graph and false_grad_graph.
   grad_inputs = _make_inputs_match(true_grad_graph, false_grad_graph,
-                                   true_grad_inputs, false_grad_inputs)
+                                   true_grad_extra_inputs,
+                                   false_grad_extra_inputs)
 
   # Add all intermediate tensors as function outputs so they're available for
   # higher-order gradient computations.
@@ -199,11 +208,20 @@ def _get_func_graphs(if_op):
     input_shapes = [t.shape for t in extra_inputs]
     func_name = if_op.get_attr(branch_name).name
     fdef = if_op.graph._get_function(func_name).definition
-    func_graph = _function_def_to_graph.function_def_to_graph(
-        fdef, input_shapes)
+    # `if_op.graph` may not be the same as `ops.get_default_graph()` e.g.
+    # in the case of nested if ops or when the gradient is being computed
+    # from inside a Defun. We build the `func_graph` with `if_op.graph` as its
+    # `outer_graph`. This resembles how the `_FuncGraph` was built in the
+    # forward pass. We need this so that we can resolve references to tensors
+    # in `func_graph` from its gradient graph in `_resolve_grad_inputs`.
+    with if_op.graph.as_default():
+      func_graph = _function_def_to_graph.function_def_to_graph(
+          fdef, input_shapes)
     func_graph.extra_inputs = extra_inputs
     func_graph.extra_args = func_graph.inputs
     func_graph._captured = dict(zip(extra_inputs, func_graph.inputs))
+    # Set the if op so that the gradient code can use it.
+    func_graph._if = if_op
     return func_graph
 
   return (_get_func_graph_for_branch("then_branch"),
@@ -240,7 +258,7 @@ def _grad_fn(func_graph, grads):
   # Build the gradient graph. Note that this builds the gradient computation of
   # func_graph in the current graph, which requires capturing tensors from
   # func_graph. The captured func_graph tensors are resolved to external tensors
-  # in _get_grad_inputs.
+  # in _resolve_grad_inputs.
   result = _gradients_impl._GradientsHelper(
       ys, func_graph.inputs, grad_ys=grad_ys,
       src_graph=func_graph)
@@ -261,43 +279,49 @@ def _create_grad_func(func_graph, grads, name):
                                            [], [], name)
 
 
-def _get_grad_inputs(if_op, cond_graph, grad_graph):
-  """Returns the tensors we should pass to grad_graph.
+def _resolve_grad_inputs(cond_graph, grad_graph):
+  """Returns the tensors to pass as `extra_inputs` to `grad_graph`.
 
-  This method handles tensors captured from cond_graph in grad_graph. It
-  converts these to suitable input tensors from the outer graph.
+  The `grad_graph` may have external references to
+  1. Its outer graph containing the input gradients. These references are kept
+     as is.
+  2. Tensors in the forward pass graph. These tensors may not be "live"
+     when the gradient is being computed. We replace such references by their
+     corresponding tensor in the least common ancestor graph of `grad_graph` and
+     `cond_graph`. Since we export intermediate tensors for all branch
+     functions, this is always possible.
 
   Args:
-    if_op: Operation. The forward-pass If op that uses cond_graph.
     cond_graph: function._FuncGraph. The forward-pass function.
     grad_graph: function._FuncGraph. The gradients function.
 
   Returns:
     A list of inputs tensors to be passed to grad_graph.
   """
-  inputs = []
-
-  # Maps placeholders in cond_graph -> input tensor in outer graph.
-  forward_input_map = {v: k for k, v in cond_graph._captured.items()}
+  new_extra_inputs = []
 
   for t in grad_graph.extra_inputs:
-    if t.graph == ops.get_default_graph():
-      # t is in the outer graph (e.g. one of the input gradients).
-      inputs.append(t)
-    elif t in forward_input_map:
-      # t is an input placeholder in cond_graph. Get the corresponding input
-      # tensor in the outer graph.
-      assert t.graph == cond_graph
-      assert forward_input_map[t].graph == ops.get_default_graph()
-      inputs.append(forward_input_map[t])
-    else:
-      # t is an intermediate value in cond_graph. Get the corresponding output
-      # of 'if_op' (note that all intermediate values are outputs).
-      assert t.graph == cond_graph
-      output_idx = cond_graph.outputs.index(t)
-      inputs.append(if_op.outputs[output_idx])
-
-  return inputs
+    if t.graph != grad_graph._outer_graph:
+      # `t` is a tensor in `cond_graph` or one of its ancestors. We bubble this
+      # tensor to the least common ancestor of the `cond_graph` and
+      # `grad_graph` so that it is "in-scope" for `grad_graph`.
+      # TODO(srbs): `_is_ancestor` calls may be expensive. Compute the least
+      # common ancestor once and re-use.
+      assert _is_ancestor(cond_graph, t.graph)
+      while not _is_ancestor(grad_graph, t.graph):
+        assert isinstance(t.graph, _function._FuncGraph)
+        if t in t.graph.extra_args:
+          # TODO(srbs): Consider building a map of extra_args -> extra_inputs.
+          # instead of searching for `t` twice.
+          t = t.graph.extra_inputs[t.graph.extra_args.index(t)]
+        else:
+          # Note: All intermediate tensors are output by the If op.
+          # TODO(srbs): .index() calls may be expensive. Optimize.
+          t = t.graph._if.outputs[t.graph.outputs.index(t)]
+      assert _is_ancestor(grad_graph, t.graph)
+    new_extra_inputs.append(t)
+
+  return new_extra_inputs
 
 
 def _create_new_tf_function(func_graph):
@@ -326,7 +350,8 @@ def _create_new_tf_function(func_graph):
   # a new TF_Function that we add to the graph.
   fdef = _function.function_def_from_tf_function(c_func)
   defined_func = _function._from_definition(fdef)
-  defined_func.add_to_graph(ops.get_default_graph())
+  defined_func._sub_functions = func_graph._functions
+  defined_func.add_to_graph(func_graph._outer_graph)
 
   return func_graph.name
 
@@ -389,7 +414,8 @@ def _pad_params(true_graph, false_graph, true_params, false_params):
   return new_true_params, new_false_inputs
 
 
-def _make_inputs_match(true_graph, false_graph, true_inputs, false_inputs):
+def _make_inputs_match(true_graph, false_graph, true_extra_inputs,
+                       false_extra_inputs):
   """Modifies true_graph and false_graph so they have the same input signature.
 
   This method reorders and/or adds parameters to true_graph and false_graph so
@@ -400,9 +426,9 @@ def _make_inputs_match(true_graph, false_graph, true_inputs, false_inputs):
   Args:
     true_graph: function._FuncGraph
     false_graph: function._FuncGraph
-    true_inputs: a list of Tensors in the outer graph. The inputs for
+    true_extra_inputs: a list of Tensors in the outer graph. The inputs for
       true_graph.
-    false_inputs: a list of Tensors in the outer graph. The inputs for
+    false_extra_inputs: a list of Tensors in the outer graph. The inputs for
       false_graph.
 
   Returns:
@@ -411,12 +437,12 @@ def _make_inputs_match(true_graph, false_graph, true_inputs, false_inputs):
     false_inputs.
   """
   shared_inputs, true_only_inputs, false_only_inputs = _separate_unique_inputs(
-      true_inputs, false_inputs)
+      true_extra_inputs, false_extra_inputs)
 
   new_inputs = shared_inputs + true_only_inputs + false_only_inputs
 
-  true_input_to_param = dict(zip(true_inputs, true_graph.inputs))
-  false_input_to_param = dict(zip(false_inputs, false_graph.inputs))
+  true_input_to_param = dict(zip(true_extra_inputs, true_graph.inputs))
+  false_input_to_param = dict(zip(false_extra_inputs, false_graph.inputs))
 
   true_graph.inputs = (
       [true_input_to_param[t] for t in shared_inputs] +
@@ -432,6 +458,9 @@ def _make_inputs_match(true_graph, false_graph, true_inputs, false_inputs):
   true_graph.extra_inputs = new_inputs
   false_graph.extra_inputs = new_inputs
 
+  true_graph.extra_args = true_graph.inputs
+  false_graph.extra_args = false_graph.inputs
+
   true_graph._captured = dict(zip(new_inputs, true_graph.inputs))
   false_graph._captured = dict(zip(new_inputs, false_graph.inputs))
 
@@ -454,14 +483,30 @@ def _create_dummy_params(func_graph, template_tensors):
 
 
 def _get_grad_fn_name(func_graph):
-  """Returns a unique name to use for the grad function of `func_graph`."""
+  """Returns a unique name to use for the grad function of `func_graph`.
+
+  Ensures this name is unique in the entire hierarchy.
+
+  Args:
+    func_graph: The _FuncGraph.
+
+  Returns:
+    A string, the name to use for the gradient function.
+  """
   name = "%s_grad" % func_graph.name
 
   base_name = name
   counter = 1
-  if ops.get_default_graph()._is_function(name):
-    name = "%s_%s" % (base_name, counter)
-    counter += 1
+  has_conflict = True
+  while has_conflict:
+    curr_graph = func_graph._outer_graph
+    has_conflict = curr_graph._is_function(name)
+    while not has_conflict and isinstance(curr_graph, _function._FuncGraph):
+      curr_graph = curr_graph._outer_graph
+      has_conflict = curr_graph._is_function(name)
+    if has_conflict:
+      name = "%s_%s" % (base_name, counter)
+      counter += 1
 
   return name
 
@@ -477,3 +522,11 @@ def _check_same_outputs(true_graph, false_graph):
         "arguments, got:\n"
         "  true_fn: %s\n"
         "  false_fn: %s" % (true_output_types, false_output_types))
+
+
+def _is_ancestor(graph, maybe_ancestor):
+  if maybe_ancestor == graph:
+    return True
+  if isinstance(graph, _function._FuncGraph):
+    return _is_ancestor(graph._outer_graph, maybe_ancestor)
+  return False
diff --git a/tensorflow/python/ops/control_flow_ops.py b/tensorflow/python/ops/control_flow_ops.py
index fc37805c79..c7061b36dd 100644
--- a/tensorflow/python/ops/control_flow_ops.py
+++ b/tensorflow/python/ops/control_flow_ops.py
@@ -817,11 +817,12 @@ class GradLoopState(object):
       outer_forward_ctxt = forward_ctxt.outer_context
 
     # Add the forward loop counter.
-    if outer_forward_ctxt:
-      outer_forward_ctxt.Enter()
-    cnt, forward_index = forward_ctxt.AddForwardLoopCounter(outer_grad_state)
-    if outer_forward_ctxt:
-      outer_forward_ctxt.Exit()
+    with forward_ctxt._graph.as_default():  # pylint: disable=protected-access
+      if outer_forward_ctxt:
+        outer_forward_ctxt.Enter()
+      cnt, forward_index = forward_ctxt.AddForwardLoopCounter(outer_grad_state)
+      if outer_forward_ctxt:
+        outer_forward_ctxt.Exit()
     self._forward_context = forward_ctxt
     self._forward_index = forward_index
 
@@ -984,60 +985,61 @@ class GradLoopState(object):
         for the stack can't be found.
     """
     # curr_ctxt is the context that tf.gradients was called in.
-    curr_ctxt = ops.get_default_graph()._get_control_flow_context()  # pylint: disable=protected-access
-    with ops.control_dependencies(None):
-      if curr_ctxt:
-        curr_ctxt.Enter()
-      with ops.colocate_with(value):
-        # We only need to pass maximum_iterations to the stack if
-        # we're inside an XLA context.
-        if not util.IsInXLAContext(value.op):
-          max_size = constant_op.constant(-1, dtypes.int32)
-        else:
-          max_size = GetMaxSizeFromNestedMaximumIterations(
-              value, self.forward_context)
-        acc = gen_data_flow_ops.stack_v2(
-            max_size=max_size, elem_type=value.dtype.base_dtype, name="f_acc")
-      if curr_ctxt:
-        curr_ctxt.Exit()
-
-      # Make acc available in the forward context.
-      enter_acc = self.forward_context.AddValue(acc)
-
-      # Add the stack_push op in the context of value.op.
-      swap_enabled = self.forward_context.swap_memory
-      value_ctxt = util.GetOutputContext(value.op)
-      if value_ctxt == self.forward_context:
-        # value is not nested in the forward context.
-        self.forward_context.Enter()
-        push = gen_data_flow_ops.stack_push_v2(
-            enter_acc, value, swap_memory=swap_enabled)
-        self.forward_context.Exit()
-        # Protect stack push and order it before forward_index.
-        self.forward_index.op._add_control_input(push.op)
-      else:
-        # value is in a cond context within the forward context.
-        if not isinstance(value_ctxt, CondContext):
-          raise TypeError("value_ctxt is not a CondContext: %s" % value_ctxt)
-        if dead_branch:
-          # The special case for creating a zero tensor for a dead
-          # branch of a switch. See ControlFlowState.ZerosLike().
-          value_ctxt.outer_context.Enter()
+    with self._forward_index.graph.as_default():
+      curr_ctxt = ops.get_default_graph()._get_control_flow_context()  # pylint: disable=protected-access
+      with ops.control_dependencies(None):
+        if curr_ctxt:
+          curr_ctxt.Enter()
+        with ops.colocate_with(value):
+          # We only need to pass maximum_iterations to the stack if
+          # we're inside an XLA context.
+          if not util.IsInXLAContext(value.op):
+            max_size = constant_op.constant(-1, dtypes.int32)
+          else:
+            max_size = GetMaxSizeFromNestedMaximumIterations(
+                value, self.forward_context)
+          acc = gen_data_flow_ops.stack_v2(
+              max_size=max_size, elem_type=value.dtype.base_dtype, name="f_acc")
+        if curr_ctxt:
+          curr_ctxt.Exit()
+
+        # Make acc available in the forward context.
+        enter_acc = self.forward_context.AddValue(acc)
+
+        # Add the stack_push op in the context of value.op.
+        swap_enabled = self.forward_context.swap_memory
+        value_ctxt = util.GetOutputContext(value.op)
+        if value_ctxt == self.forward_context:
+          # value is not nested in the forward context.
+          self.forward_context.Enter()
           push = gen_data_flow_ops.stack_push_v2(
               enter_acc, value, swap_memory=swap_enabled)
-          value_ctxt.outer_context.Exit()
-          push.op._set_control_flow_context(value_ctxt)
+          self.forward_context.Exit()
+          # Protect stack push and order it before forward_index.
+          self.forward_index.op._add_control_input(push.op)
         else:
-          value_ctxt.Enter()
-          push = gen_data_flow_ops.stack_push_v2(
-              enter_acc, value, swap_memory=swap_enabled)
-          value_ctxt.Exit()
-        # Protect stack push and order it before forward_sync.
-        self.forward_sync._add_control_input(push.op)
-      # Order stack push after the successor of forward_index
-      add_op = self.forward_index.op.inputs[0].op
-      push.op._add_control_input(add_op)
-      return acc
+          # value is in a cond context within the forward context.
+          if not isinstance(value_ctxt, CondContext):
+            raise TypeError("value_ctxt is not a CondContext: %s" % value_ctxt)
+          if dead_branch:
+            # The special case for creating a zero tensor for a dead
+            # branch of a switch. See ControlFlowState.ZerosLike().
+            value_ctxt.outer_context.Enter()
+            push = gen_data_flow_ops.stack_push_v2(
+                enter_acc, value, swap_memory=swap_enabled)
+            value_ctxt.outer_context.Exit()
+            push.op._set_control_flow_context(value_ctxt)
+          else:
+            value_ctxt.Enter()
+            push = gen_data_flow_ops.stack_push_v2(
+                enter_acc, value, swap_memory=swap_enabled)
+            value_ctxt.Exit()
+          # Protect stack push and order it before forward_sync.
+          self.forward_sync._add_control_input(push.op)
+        # Order stack push after the successor of forward_index
+        add_op = self.forward_index.op.inputs[0].op
+        push.op._add_control_input(add_op)
+        return acc
 
   def AddBackpropAccumulatedValue(self, history_value, value,
                                   dead_branch=False):
@@ -1817,15 +1819,34 @@ class CondContext(ControlFlowContext):
   def _AddOpInternal(self, op):
     """Add `op` to the current context."""
     if not op.inputs:
-      # Remove any external control dependency on this op
+      # If we're in a while loop, remove any control inputs from outside the
+      # loop.
       self._RemoveExternalControlEdges(op)
-      # pylint: disable=protected-access
-      op._add_control_input(self._pivot.op)
-      # pylint: enable=protected-access
+
+      if not any(util.OpInContext(input_op, self)
+                 for input_op in op.control_inputs):
+        # pylint: disable=protected-access
+        op._add_control_input(self._pivot.op)
+        # pylint: enable=protected-access
     else:
+      # Make each input to 'op' available in this CondContext. If an input is
+      # already part of this context there's nothing to do, but if it's
+      # external, AddValue() will handle adding the appropriate Switch node and
+      # other bookkeeping.
       for index in range(len(op.inputs)):
         x = op.inputs[index]
-        real_x = self.AddValue(x)
+        if op.type == "Merge" and x.op.type == "NextIteration":
+          # Edge case: if we're importing a while loop inside this CondContext,
+          # AddValue() will not correctly handle the NextIteration inputs to
+          # Merge node. The problem is that the NextIteration should also be
+          # part of this context, but if we're importing it won't have been
+          # processed and added to the context yet, so AddValue() will try to
+          # add a Switch which results in an invalid graph. Instead, we use the
+          # NextIteration input as-is here, and it will eventually be added to
+          # the context via AddOp().
+          real_x = x
+        else:
+          real_x = self.AddValue(x)
         if real_x != x:
           # pylint: disable=protected-access
           op._update_input(index, real_x)
@@ -2196,6 +2217,7 @@ class WhileContext(ControlFlowContext):
     self._loop_exits = []
     # The list of enter tensors for loop variables.
     self._loop_enters = []
+    self._graph = ops.get_default_graph()
 
   def _init_from_proto(self, context_def, import_scope=None):
     """Creates a new `WhileContext` from protocol buffer.
@@ -2249,6 +2271,7 @@ class WhileContext(ControlFlowContext):
           op._set_attr("frame_name",
                        attr_value_pb2.AttrValue(s=compat.as_bytes(self.name)))
           # pylint: enable=protected-access
+    self._graph = ops.get_default_graph()
 
   @property
   def maximum_iterations(self):
@@ -2573,7 +2596,14 @@ class WhileContext(ControlFlowContext):
     Returns:
       The loop index.
     """
-    one = constant_op.constant(1, name="b_count")
+    in_separate_functions = count.graph is not ops.get_default_graph()
+    if in_separate_functions:
+      # Brings the count into this graph
+      count = array_ops.identity(count)
+    else:
+      # TODO(apassos) XLA expects this constant to be created outside the loop,
+      # so doing that for now.
+      one = constant_op.constant(1, name="b_count")
 
     self.Enter()
     self.AddName(count.name)
@@ -2588,6 +2618,8 @@ class WhileContext(ControlFlowContext):
     merge_count = merge([enter_count, enter_count])[0]
     self._pivot_for_pred = merge_count
 
+    if in_separate_functions:
+      one = constant_op.constant(1, name="b_count")
     pred = math_ops.greater_equal(merge_count, one)
     self._pivot = loop_cond(pred, name="b_count")
     switch_count = switch(merge_count, self._pivot)
@@ -2932,7 +2964,8 @@ class WhileContext(ControlFlowContext):
 
     return original_body_result, exit_vars
 
-  def BuildLoop(self, pred, body, loop_vars, shape_invariants):
+  def BuildLoop(self, pred, body, loop_vars, shape_invariants,
+                return_same_structure):
     """Add the loop termination condition and body to the graph."""
 
     # Keep original_loop_vars to identify which are TensorArrays
@@ -2960,7 +2993,11 @@ class WhileContext(ControlFlowContext):
     packed_exit_vars = nest.pack_sequence_as(
         structure=original_body_result,
         flat_sequence=exit_vars_with_tensor_arrays)
-    return packed_exit_vars[0] if len(exit_vars) == 1 else packed_exit_vars
+
+    if return_same_structure:
+      return packed_exit_vars
+    else:
+      return packed_exit_vars[0] if len(exit_vars) == 1 else packed_exit_vars
 
   def _FixControlInputsAndContext(self, enters):
     graph = ops.get_default_graph()
@@ -3000,7 +3037,8 @@ def while_loop(cond,
                back_prop=True,
                swap_memory=False,
                name=None,
-               maximum_iterations=None):
+               maximum_iterations=None,
+               return_same_structure=False):
   """Repeat `body` while the condition `cond` is true.
 
   `cond` is a callable returning a boolean scalar tensor. `body` is a callable
@@ -3076,11 +3114,16 @@ def while_loop(cond,
       to run.  If provided, the `cond` output is AND-ed with an additional
       condition ensuring the number of iterations executed is no greater than
       `maximum_iterations`.
+    return_same_structure: If True, output has same structure as `loop_vars`. If
+      eager execution is enabled, this is ignored (and always treated as True).
 
   Returns:
-    The output tensors for the loop variables after the loop. When the length
-    of `loop_vars` is 1 this is a Tensor, TensorArray or IndexedSlice and when
-    the length of `loop_vars` is greater than 1 it returns a list.
+    The output tensors for the loop variables after the loop.
+     If `return_same_structure` is True, the return value has the same
+     structure as `loop_vars`.
+     If `return_same_structure` is False, the return value is a Tensor,
+     TensorArray or IndexedSlice if the length of `loop_vars` is 1, or a list
+     otherwise.
 
   Raises:
     TypeError: if `cond` or `body` is not callable.
@@ -3135,7 +3178,7 @@ def while_loop(cond,
   happen is that the thread updating `x` can never get ahead of the
   counter thread because the thread incrementing `x` depends on the value
   of the counter.
-  
+
   ```python
   import tensorflow as tf
 
@@ -3217,7 +3260,8 @@ def while_loop(cond,
     # be encapsulated in the root context.
     if loop_context.outer_context is None:
       ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context)
-    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
+    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants,
+                                    return_same_structure)
     if maximum_iterations is not None:
       return result[1]
     else:
diff --git a/tensorflow/python/ops/control_flow_ops_test.py b/tensorflow/python/ops/control_flow_ops_test.py
index 43fe045bcb..153548ae92 100644
--- a/tensorflow/python/ops/control_flow_ops_test.py
+++ b/tensorflow/python/ops/control_flow_ops_test.py
@@ -958,6 +958,28 @@ class WhileLoopTestCase(test_util.TensorFlowTestCase):
     # Expect a tuple since that is what the body returns.
     self.assertEqual(self.evaluate(r), (10,))
 
+  def testWhileLoopSameReturnShape_False(self):
+    i = constant_op.constant(0)
+    c = lambda i, _: math_ops.less(i, 10)
+
+    # Body returns a [tensor, []]
+    b = lambda i, _: [math_ops.add(i, 1), []]
+
+    # Should only return the tensor.
+    r = control_flow_ops.while_loop(c, b, [i, []])
+    self.assertEqual(self.evaluate(r), 10)
+
+  def testWhileLoopSameReturnShape_True(self):
+    i = constant_op.constant(0)
+    c = lambda i, _: math_ops.less(i, 10)
+
+    # Body returns a [tensor, []]
+    b = lambda i, _: [math_ops.add(i, 1), []]
+
+    # Should only return the original structure.
+    r = control_flow_ops.while_loop(c, b, [i, []], return_same_structure=True)
+    self.assertEqual(self.evaluate(r), [10, []])
+
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/python/ops/control_flow_util.py b/tensorflow/python/ops/control_flow_util.py
index 7a18986c5b..72c074ed1a 100644
--- a/tensorflow/python/ops/control_flow_util.py
+++ b/tensorflow/python/ops/control_flow_util.py
@@ -214,6 +214,14 @@ def IsContainingContext(ctxt, maybe_containing_ctxt):
   return True
 
 
+def OpInContext(op, ctxt):
+  return IsContainingContext(op._get_control_flow_context(), ctxt)  # pylint: disable=protected-access
+
+
+def TensorInContext(tensor, ctxt):
+  return OpInContext(tensor.op, ctxt)
+
+
 def CheckInputFromValidContext(op, input_op):
   """Returns whether `input_op` can be used from `op`s context.
 
diff --git a/tensorflow/python/ops/conv2d_benchmark.py b/tensorflow/python/ops/conv2d_benchmark.py
index aacdaa7ad0..28111c2730 100644
--- a/tensorflow/python/ops/conv2d_benchmark.py
+++ b/tensorflow/python/ops/conv2d_benchmark.py
@@ -175,7 +175,8 @@ class Conv2DBenchmark(test.Benchmark):
 
     data_types = [dtypes.float32, dtypes.float16]
     data_formats = ["NHWC", "NCHW"]
-    in_channels = list(range(3, 16))
+    in_channels = list(range(1, 10)) + list(range(10, 20, 2)) + list(
+        range(20, 33, 4))
     out_channels = [4, 16, 32]
     hw_strides = [[2, 2]]
     paddings = ["VALID", "SAME"]
diff --git a/tensorflow/python/ops/custom_gradient.py b/tensorflow/python/ops/custom_gradient.py
index ca24f11054..9f77a6cca1 100644
--- a/tensorflow/python/ops/custom_gradient.py
+++ b/tensorflow/python/ops/custom_gradient.py
@@ -142,9 +142,9 @@ def _graph_mode_decorator(f, *args, **kwargs):
   # The variables that grad_fn needs to return gradients for are the set of
   # variables used that are *not* part of the inputs.
   variables = list(set(tape.watched_variables()) - set(args))
-  grad_argspec = tf_inspect.getargspec(grad_fn)
+  grad_argspec = tf_inspect.getfullargspec(grad_fn)
   variables_in_signature = ("variables" in grad_argspec.args or
-                            grad_argspec.keywords)
+                            grad_argspec.varkw)
   if variables and not variables_in_signature:
     raise TypeError("If using @custom_gradient with a function that "
                     "uses variables, then grad_fn must accept a keyword "
@@ -194,9 +194,9 @@ def _eager_mode_decorator(f, *args, **kwargs):
   # The variables that grad_fn needs to return gradients for are the set of
   # variables used that are *not* part of the inputs.
   variables = [v for v in set(tape.watched_variables()) if v not in all_inputs]
-  grad_argspec = tf_inspect.getargspec(grad_fn)
-  if (variables and
-      not ("variables" in grad_argspec.args or grad_argspec.keywords)):
+  grad_argspec = tf_inspect.getfullargspec(grad_fn)
+  if (variables and ("variables" not in grad_argspec.args) and
+      not grad_argspec.varkw):
     raise TypeError("If using @custom_gradient with a function that "
                     "uses variables, then grad_fn must accept a keyword "
                     "argument 'variables'.")
diff --git a/tensorflow/python/ops/functional_ops.py b/tensorflow/python/ops/functional_ops.py
index 30413f289a..4ecc74675a 100644
--- a/tensorflow/python/ops/functional_ops.py
+++ b/tensorflow/python/ops/functional_ops.py
@@ -775,7 +775,7 @@ def While(input_, cond, body, name=None, hostmem=None):
       a string, non-empty means True and empty means False. If the
       tensor is not a scalar, non-emptiness means True and False
       otherwise.
-    body: . A funcion takes a list of tensors and returns another
+    body: . A function takes a list of tensors and returns another
       list tensors. Both lists have the same types as specified
       by T.
     name: A name for the operation (optional).
@@ -945,6 +945,61 @@ def For(start,
 # pylint: enable=invalid-name,protected-access
 
 
-def partitioned_call(args, f):
-  return gen_functional_ops.partitioned_call(
-      args=args, Tout=[o.type for o in f.definition.signature.output_arg], f=f)
+def partitioned_call(args, f, tout=None, executing_eagerly=None):
+  """Executes a function while respecting device annotations.
+
+  Currently, only those functions that execute within the same address space
+  can be executed.
+
+  Args:
+    args: The arguments of the function, including captured inputs.
+    f: The function to execute; an instance of `_DefinedFunction` or
+      `_EagerDefinedFunction`.
+    tout: a list containing the output dtypes enums; if `None`, inferred from
+      the signature of `f`.
+    executing_eagerly: (Optional) A boolean indicating whether the context is
+      executing eagerly. If `None`, fetched from the global context.
+
+  Returns:
+    The list of `Tensor`s returned by invoking `f(args)`. If the function does
+    not return anything, then returns `None` if eager execution is enabled, or
+    the `Operation` if not.
+  """
+
+  if tout is None:
+    tout = tuple(x.type for x in f.definition.signature.output_arg)
+
+  if executing_eagerly is None:
+    executing_eagerly = context.executing_eagerly()
+
+  if executing_eagerly or len(tout):
+    if f.stateful_ops:
+      outputs = gen_functional_ops.stateful_partitioned_call(
+          args=args, Tout=tout, f=f)
+    else:
+      outputs = gen_functional_ops.partitioned_call(args=args, Tout=tout, f=f)
+    return outputs if outputs else None
+
+  # The generated binding returns an empty list for functions that don't
+  # return any Tensors, hence the need to use `create_op` directly.
+  args = [ops.internal_convert_to_tensor(x) for x in args]
+  tin_attr = attr_value_pb2.AttrValue(
+      list=attr_value_pb2.AttrValue.ListValue(
+          type=[x.dtype.as_datatype_enum for x in args]))
+  tout_attr = attr_value_pb2.AttrValue(
+      list=attr_value_pb2.AttrValue.ListValue(type=tout))
+  func_attr = attr_value_pb2.AttrValue(
+      func=attr_value_pb2.NameAttrList(name=f.name))
+
+  graph = ops.get_default_graph()
+  f.add_to_graph(graph)
+  op_name = "StatefulPartitionedCall" if f.stateful_ops else "PartitionedCall"
+  op = graph.create_op(
+      op_name,
+      args,
+      tout,
+      compute_shapes=False,
+      name="PartitionedFunctionCall",
+      attrs={"Tin": tin_attr, "Tout": tout_attr, "f": func_attr})
+  outputs = op.outputs
+  return outputs if outputs else op
diff --git a/tensorflow/python/ops/gradients_impl.py b/tensorflow/python/ops/gradients_impl.py
index 713a8ab2cc..b64a66be03 100644
--- a/tensorflow/python/ops/gradients_impl.py
+++ b/tensorflow/python/ops/gradients_impl.py
@@ -55,6 +55,7 @@ from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import spectral_grad  # pylint: disable=unused-import
 from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util import compat
 from tensorflow.python.util.tf_export import tf_export
 
 # This is to avoid a circular dependency with cond_v2_impl.
@@ -356,12 +357,19 @@ def _maybe_colocate_with(op, gradient_uid, colocate_gradients_with_ops):  # pyli
     yield
 
 
+def _IsPartitionedCall(op):
+  return op.type == "PartitionedCall" or op.type == "StatefulPartitionedCall"
+
+
 def _SymGrad(op, out_grads, xs):
   """Backprop through a function call node op given its outputs' gradients."""
   f_in = [x for x in _Inputs(op, xs)] + out_grads
   f_types = [x.dtype for x in _Inputs(op, xs)]
   f = attr_value_pb2.NameAttrList()
-  f.name = op.type
+  if _IsPartitionedCall(op):
+    f.name = op.get_attr("f").name
+  else:
+    f.name = op.type
   for k in op.node_def.attr:
     f.attr[k].CopyFrom(op.node_def.attr[k])
   # TODO(apassos) use a better dtype here
@@ -700,13 +708,19 @@ def _GradientsHelper(ys,
 
         grad_fn = None
         func_call = None
+        is_partitioned_call = _IsPartitionedCall(op)
         # pylint: disable=protected-access
-        is_func_call = src_graph._is_function(op.type)
+        is_func_call = (
+            src_graph._is_function(op.type) or is_partitioned_call)
         # pylint: enable=protected-access
         has_out_grads = any(isinstance(g, ops.Tensor) or g for g in out_grads)
         if has_out_grads and (op not in stop_ops):
           if is_func_call:
-            func_call = src_graph._get_function(op.type)  # pylint: disable=protected-access
+            if is_partitioned_call:
+              func_call = src_graph._get_function(  # pylint: disable=protected-access
+                  compat.as_bytes(op.get_attr("f").name))
+            else:
+              func_call = src_graph._get_function(op.type)  # pylint: disable=protected-access
             # Note that __defun is not set if the graph is
             # imported. If it's set, we prefer to access the original
             # defun.
diff --git a/tensorflow/python/ops/histogram_ops_test.py b/tensorflow/python/ops/histogram_ops_test.py
index a226ac81bb..2e57ae8a2d 100644
--- a/tensorflow/python/ops/histogram_ops_test.py
+++ b/tensorflow/python/ops/histogram_ops_test.py
@@ -84,6 +84,23 @@ class HistogramFixedWidthTest(test.TestCase):
   def setUp(self):
     self.rng = np.random.RandomState(0)
 
+  def test_with_invalid_value_range(self):
+    values = [-1.0, 0.0, 1.5, 2.0, 5.0, 15]
+    with self.assertRaisesRegexp(
+        ValueError, "Shape must be rank 1 but is rank 0"):
+      histogram_ops.histogram_fixed_width(values, 1.0)
+    with self.assertRaisesRegexp(ValueError, "Dimension must be 2 but is 3"):
+      histogram_ops.histogram_fixed_width(values, [1.0, 2.0, 3.0])
+
+  def test_with_invalid_nbins(self):
+    values = [-1.0, 0.0, 1.5, 2.0, 5.0, 15]
+    with self.assertRaisesRegexp(
+        ValueError, "Shape must be rank 0 but is rank 1"):
+      histogram_ops.histogram_fixed_width(values, [1.0, 5.0], nbins=[1, 2])
+    with self.assertRaisesRegexp(
+        ValueError, "Requires nbins > 0"):
+      histogram_ops.histogram_fixed_width(values, [1.0, 5.0], nbins=-5)
+
   def test_empty_input_gives_all_zero_counts(self):
     # Bins will be:
     #   (-inf, 1), [1, 2), [2, 3), [3, 4), [4, inf)
diff --git a/tensorflow/python/ops/image_ops_impl.py b/tensorflow/python/ops/image_ops_impl.py
index a2eae452ae..855a4d0c33 100644
--- a/tensorflow/python/ops/image_ops_impl.py
+++ b/tensorflow/python/ops/image_ops_impl.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.compat import compat
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -55,6 +56,7 @@ ops.NotDifferentiable('SampleDistortedBoundingBoxV2')
 ops.NotDifferentiable('ExtractGlimpse')
 ops.NotDifferentiable('NonMaxSuppression')
 ops.NotDifferentiable('NonMaxSuppressionV2')
+ops.NotDifferentiable('NonMaxSuppressionWithOverlaps')
 
 
 # pylint: disable=invalid-name
@@ -1752,6 +1754,22 @@ def is_jpeg(contents, name=None):
     return math_ops.equal(substr, b'\xff\xd8\xff', name=name)
 
 
+def _is_png(contents, name=None):
+  r"""Convenience function to check if the 'contents' encodes a PNG image.
+
+  Args:
+    contents: 0-D `string`. The encoded image bytes.
+    name: A name for the operation (optional)
+
+  Returns:
+     A scalar boolean tensor indicating if 'contents' may be a PNG image.
+     is_png is susceptible to false positives.
+  """
+  with ops.name_scope(name, 'is_png'):
+    substr = string_ops.substr(contents, 0, 3)
+    return math_ops.equal(substr, b'\211PN', name=name)
+
+
 @tf_export('image.decode_image')
 def decode_image(contents, channels=None, dtype=dtypes.uint8, name=None):
   """Convenience function for `decode_bmp`, `decode_gif`, `decode_jpeg`,
@@ -1829,8 +1847,8 @@ def decode_image(contents, channels=None, dtype=dtypes.uint8, name=None):
 
     def check_png():
       """Checks if an image is PNG."""
-      is_png = math_ops.equal(substr, b'\211PN', name='is_png')
-      return control_flow_ops.cond(is_png, _png, check_gif, name='cond_png')
+      return control_flow_ops.cond(
+          _is_png(contents), _png, check_gif, name='cond_png')
 
     def _jpeg():
       """Decodes a jpeg image."""
@@ -2093,6 +2111,108 @@ def non_max_suppression(boxes,
                                                 iou_threshold, score_threshold)
 
 
+@tf_export('image.non_max_suppression_padded')
+def non_max_suppression_padded(boxes,
+                               scores,
+                               max_output_size,
+                               iou_threshold=0.5,
+                               score_threshold=float('-inf'),
+                               pad_to_max_output_size=False,
+                               name=None):
+  """Greedily selects a subset of bounding boxes in descending order of score.
+
+  Performs algorithmically equivalent operation to tf.image.non_max_suppression,
+  with the addition of an optional parameter which zero-pads the output to
+  be of size `max_output_size`.
+  The output of this operation is a tuple containing the set of integers
+  indexing into the input collection of bounding boxes representing the selected
+  boxes and the number of valid indices in the index set.  The bounding box
+  coordinates corresponding to the selected indices can then be obtained using
+  the `tf.slice` and `tf.gather` operations.  For example:
+    selected_indices_padded, num_valid = tf.image.non_max_suppression_padded(
+        boxes, scores, max_output_size, iou_threshold,
+        score_threshold, pad_to_max_output_size=True)
+    selected_indices = tf.slice(
+        selected_indices_padded, tf.constant([0]), num_valid)
+    selected_boxes = tf.gather(boxes, selected_indices)
+
+  Args:
+    boxes: A 2-D float `Tensor` of shape `[num_boxes, 4]`.
+    scores: A 1-D float `Tensor` of shape `[num_boxes]` representing a single
+      score corresponding to each box (each row of boxes).
+    max_output_size: A scalar integer `Tensor` representing the maximum number
+      of boxes to be selected by non max suppression.
+    iou_threshold: A float representing the threshold for deciding whether boxes
+      overlap too much with respect to IOU.
+    score_threshold: A float representing the threshold for deciding when to
+      remove boxes based on score.
+    pad_to_max_output_size: bool.  If True, size of `selected_indices` output
+      is padded to `max_output_size`.
+    name: A name for the operation (optional).
+
+  Returns:
+    selected_indices: A 1-D integer `Tensor` of shape `[M]` representing the
+      selected indices from the boxes tensor, where `M <= max_output_size`.
+    valid_outputs: A scalar integer `Tensor` denoting how many elements in
+    `selected_indices` are valid.  Valid elements occur first, then padding.
+  """
+  with ops.name_scope(name, 'non_max_suppression_padded'):
+    iou_threshold = ops.convert_to_tensor(iou_threshold, name='iou_threshold')
+    score_threshold = ops.convert_to_tensor(
+        score_threshold, name='score_threshold')
+    if compat.forward_compatible(2018, 8, 7) or pad_to_max_output_size:
+      return gen_image_ops.non_max_suppression_v4(
+          boxes, scores, max_output_size, iou_threshold, score_threshold,
+          pad_to_max_output_size)
+    else:
+      return gen_image_ops.non_max_suppression_v3(
+          boxes, scores, max_output_size, iou_threshold, score_threshold)
+
+
+@tf_export('image.non_max_suppression_overlaps')
+def non_max_suppression_with_overlaps(overlaps,
+                                      scores,
+                                      max_output_size,
+                                      overlap_threshold=0.5,
+                                      score_threshold=float('-inf'),
+                                      name=None):
+  """Greedily selects a subset of bounding boxes in descending order of score.
+
+  Prunes away boxes that have high overlap with previously selected boxes.
+  N-by-n overlap values are supplied as square matrix.
+  The output of this operation is a set of integers indexing into the input
+  collection of bounding boxes representing the selected boxes.  The bounding
+  box coordinates corresponding to the selected indices can then be obtained
+  using the `tf.gather operation`.  For example:
+    selected_indices = tf.image.non_max_suppression_overlaps(
+        overlaps, scores, max_output_size, iou_threshold)
+    selected_boxes = tf.gather(boxes, selected_indices)
+
+  Args:
+    overlaps: A 2-D float `Tensor` of shape `[num_boxes, num_boxes]`.
+    scores: A 1-D float `Tensor` of shape `[num_boxes]` representing a single
+      score corresponding to each box (each row of boxes).
+    max_output_size: A scalar integer `Tensor` representing the maximum number
+      of boxes to be selected by non max suppression.
+    overlap_threshold: A float representing the threshold for deciding whether
+      boxes overlap too much with respect to the provided overlap values.
+    score_threshold: A float representing the threshold for deciding when to
+      remove boxes based on score.
+    name: A name for the operation (optional).
+
+  Returns:
+    selected_indices: A 1-D integer `Tensor` of shape `[M]` representing the
+      selected indices from the overlaps tensor, where `M <= max_output_size`.
+  """
+  with ops.name_scope(name, 'non_max_suppression_overlaps'):
+    overlap_threshold = ops.convert_to_tensor(
+        overlap_threshold, name='overlap_threshold')
+    # pylint: disable=protected-access
+    return gen_image_ops._non_max_suppression_v3(
+        overlaps, scores, max_output_size, overlap_threshold, score_threshold)
+    # pylint: enable=protected-access
+
+
 _rgb_to_yiq_kernel = [[0.299, 0.59590059,
                        0.2115], [0.587, -0.27455667, -0.52273617],
                       [0.114, -0.32134392, 0.31119955]]
diff --git a/tensorflow/python/ops/init_ops.py b/tensorflow/python/ops/init_ops.py
index 5bfc5ce2a7..c315722b6b 100644
--- a/tensorflow/python/ops/init_ops.py
+++ b/tensorflow/python/ops/init_ops.py
@@ -1136,7 +1136,8 @@ convolutional_orthogonal_3d = ConvolutionOrthogonal3D
 # pylint: enable=invalid-name
 
 
-@tf_export("glorot_uniform_initializer")
+@tf_export("glorot_uniform_initializer", "keras.initializers.glorot_uniform",
+           "initializers.glorot_uniform")
 def glorot_uniform_initializer(seed=None, dtype=dtypes.float32):
   """The Glorot uniform initializer, also called Xavier uniform initializer.
 
@@ -1160,7 +1161,8 @@ def glorot_uniform_initializer(seed=None, dtype=dtypes.float32):
       scale=1.0, mode="fan_avg", distribution="uniform", seed=seed, dtype=dtype)
 
 
-@tf_export("glorot_normal_initializer")
+@tf_export("glorot_normal_initializer", "keras.initializers.glorot_normal",
+           "initializers.glorot_normal")
 def glorot_normal_initializer(seed=None, dtype=dtypes.float32):
   """The Glorot normal initializer, also called Xavier normal initializer.
 
@@ -1181,7 +1183,98 @@ def glorot_normal_initializer(seed=None, dtype=dtypes.float32):
     An initializer.
   """
   return variance_scaling_initializer(
-      scale=1.0, mode="fan_avg", distribution="normal", seed=seed, dtype=dtype)
+      scale=1.0,
+      mode="fan_avg",
+      distribution="truncated_normal",
+      seed=seed,
+      dtype=dtype)
+
+
+@tf_export("keras.initializers.lecun_normal", "initializers.lecun_normal")
+def lecun_normal(seed=None):
+  """LeCun normal initializer.
+
+  It draws samples from a truncated normal distribution centered on 0
+  with `stddev = sqrt(1 / fan_in)`
+  where `fan_in` is the number of input units in the weight tensor.
+
+  Arguments:
+      seed: A Python integer. Used to seed the random generator.
+
+  Returns:
+      An initializer.
+
+  References:
+      - [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
+      - [Efficient
+      Backprop](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)
+  """
+  return VarianceScaling(
+      scale=1., mode="fan_in", distribution="truncated_normal", seed=seed)
+
+
+@tf_export("keras.initializers.lecun_uniform", "initializers.lecun_uniform")
+def lecun_uniform(seed=None):
+  """LeCun uniform initializer.
+
+  It draws samples from a uniform distribution within [-limit, limit]
+  where `limit` is `sqrt(3 / fan_in)`
+  where `fan_in` is the number of input units in the weight tensor.
+
+  Arguments:
+      seed: A Python integer. Used to seed the random generator.
+
+  Returns:
+      An initializer.
+
+  References:
+      LeCun 98, Efficient Backprop,
+      http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
+  """
+  return VarianceScaling(
+      scale=1., mode="fan_in", distribution="uniform", seed=seed)
+
+
+@tf_export("keras.initializers.he_normal", "initializers.he_normal")
+def he_normal(seed=None):
+  """He normal initializer.
+
+  It draws samples from a truncated normal distribution centered on 0
+  with `stddev = sqrt(2 / fan_in)`
+  where `fan_in` is the number of input units in the weight tensor.
+
+  Arguments:
+      seed: A Python integer. Used to seed the random generator.
+
+  Returns:
+      An initializer.
+
+  References:
+      He et al., http://arxiv.org/abs/1502.01852
+  """
+  return VarianceScaling(
+      scale=2., mode="fan_in", distribution="truncated_normal", seed=seed)
+
+
+@tf_export("keras.initializers.he_uniform", "initializers.he_uniform")
+def he_uniform(seed=None):
+  """He uniform variance scaling initializer.
+
+  It draws samples from a uniform distribution within [-limit, limit]
+  where `limit` is `sqrt(6 / fan_in)`
+  where `fan_in` is the number of input units in the weight tensor.
+
+  Arguments:
+      seed: A Python integer. Used to seed the random generator.
+
+  Returns:
+      An initializer.
+
+  References:
+      He et al., http://arxiv.org/abs/1502.01852
+  """
+  return VarianceScaling(
+      scale=2., mode="fan_in", distribution="uniform", seed=seed)
 
 
 # Utility functions.
diff --git a/tensorflow/python/ops/init_ops_test.py b/tensorflow/python/ops/init_ops_test.py
new file mode 100644
index 0000000000..f6fffa9079
--- /dev/null
+++ b/tensorflow/python/ops/init_ops_test.py
@@ -0,0 +1,196 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for initializers in init_ops."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.eager import context
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import init_ops
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.platform import test
+
+
+class InitializersTest(test.TestCase):
+
+  def _runner(self,
+              init,
+              shape,
+              target_mean=None,
+              target_std=None,
+              target_max=None,
+              target_min=None):
+    variable = resource_variable_ops.ResourceVariable(init(shape))
+    if context.executing_eagerly():
+      output = variable.numpy()
+    else:
+      sess = ops.get_default_session()
+      sess.run(variable.initializer)
+      output = sess.run(variable)
+    lim = 3e-2
+    if target_std is not None:
+      self.assertGreater(lim, abs(output.std() - target_std))
+    if target_mean is not None:
+      self.assertGreater(lim, abs(output.mean() - target_mean))
+    if target_max is not None:
+      self.assertGreater(lim, abs(output.max() - target_max))
+    if target_min is not None:
+      self.assertGreater(lim, abs(output.min() - target_min))
+
+  def test_uniform(self):
+    tensor_shape = (9, 6, 7)
+    with self.test_session():
+      self._runner(
+          init_ops.RandomUniform(minval=-1, maxval=1, seed=124),
+          tensor_shape,
+          target_mean=0.,
+          target_max=1,
+          target_min=-1)
+
+  def test_normal(self):
+    tensor_shape = (8, 12, 99)
+    with self.test_session():
+      self._runner(
+          init_ops.RandomNormal(mean=0, stddev=1, seed=153),
+          tensor_shape,
+          target_mean=0.,
+          target_std=1)
+
+  def test_truncated_normal(self):
+    tensor_shape = (12, 99, 7)
+    with self.test_session():
+      self._runner(
+          init_ops.TruncatedNormal(mean=0, stddev=1, seed=126),
+          tensor_shape,
+          target_mean=0.,
+          target_max=2,
+          target_min=-2)
+
+  def test_constant(self):
+    tensor_shape = (5, 6, 4)
+    with self.test_session():
+      self._runner(
+          init_ops.Constant(2),
+          tensor_shape,
+          target_mean=2,
+          target_max=2,
+          target_min=2)
+
+  def test_lecun_uniform(self):
+    tensor_shape = (5, 6, 4, 2)
+    with self.test_session():
+      fan_in, _ = init_ops._compute_fans(tensor_shape)
+      std = np.sqrt(1. / fan_in)
+      self._runner(
+          init_ops.lecun_uniform(seed=123),
+          tensor_shape,
+          target_mean=0.,
+          target_std=std)
+
+  def test_glorot_uniform_initializer(self):
+    tensor_shape = (5, 6, 4, 2)
+    with self.test_session():
+      fan_in, fan_out = init_ops._compute_fans(tensor_shape)
+      std = np.sqrt(2. / (fan_in + fan_out))
+      self._runner(
+          init_ops.glorot_uniform_initializer(seed=123),
+          tensor_shape,
+          target_mean=0.,
+          target_std=std)
+
+  def test_he_uniform(self):
+    tensor_shape = (5, 6, 4, 2)
+    with self.test_session():
+      fan_in, _ = init_ops._compute_fans(tensor_shape)
+      std = np.sqrt(2. / fan_in)
+      self._runner(
+          init_ops.he_uniform(seed=123),
+          tensor_shape,
+          target_mean=0.,
+          target_std=std)
+
+  def test_lecun_normal(self):
+    tensor_shape = (5, 6, 4, 2)
+    with self.test_session():
+      fan_in, _ = init_ops._compute_fans(tensor_shape)
+      std = np.sqrt(1. / fan_in)
+      self._runner(
+          init_ops.lecun_normal(seed=123),
+          tensor_shape,
+          target_mean=0.,
+          target_std=std)
+
+  def test_glorot_normal_initializer(self):
+    tensor_shape = (5, 6, 4, 2)
+    with self.test_session():
+      fan_in, fan_out = init_ops._compute_fans(tensor_shape)
+      std = np.sqrt(2. / (fan_in + fan_out))
+      self._runner(
+          init_ops.glorot_normal_initializer(seed=123),
+          tensor_shape,
+          target_mean=0.,
+          target_std=std)
+
+  def test_he_normal(self):
+    tensor_shape = (5, 6, 4, 2)
+    with self.test_session():
+      fan_in, _ = init_ops._compute_fans(tensor_shape)
+      std = np.sqrt(2. / fan_in)
+      self._runner(
+          init_ops.he_normal(seed=123),
+          tensor_shape,
+          target_mean=0.,
+          target_std=std)
+
+  def test_Orthogonal(self):
+    tensor_shape = (20, 20)
+    with self.test_session():
+      self._runner(init_ops.Orthogonal(seed=123), tensor_shape, target_mean=0.)
+
+  def test_Identity(self):
+    with self.test_session():
+      tensor_shape = (3, 4, 5)
+      with self.assertRaises(ValueError):
+        self._runner(
+            init_ops.Identity(),
+            tensor_shape,
+            target_mean=1. / tensor_shape[0],
+            target_max=1.)
+
+      tensor_shape = (3, 3)
+      self._runner(
+          init_ops.Identity(),
+          tensor_shape,
+          target_mean=1. / tensor_shape[0],
+          target_max=1.)
+
+  def test_Zeros(self):
+    tensor_shape = (4, 5)
+    with self.test_session():
+      self._runner(
+          init_ops.Zeros(), tensor_shape, target_mean=0., target_max=0.)
+
+  def test_Ones(self):
+    tensor_shape = (4, 5)
+    with self.test_session():
+      self._runner(init_ops.Ones(), tensor_shape, target_mean=1., target_max=1.)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/ops/linalg/BUILD b/tensorflow/python/ops/linalg/BUILD
index 07659ef44c..c7314d7774 100644
--- a/tensorflow/python/ops/linalg/BUILD
+++ b/tensorflow/python/ops/linalg/BUILD
@@ -29,6 +29,7 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         "//tensorflow/python:array_ops",
+        "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:linalg_ops",
         "//tensorflow/python:math_ops",
         "//tensorflow/python:special_math_ops",
diff --git a/tensorflow/python/ops/linalg/linalg.py b/tensorflow/python/ops/linalg/linalg.py
index a7ba0bbe9c..c29b5033bb 100644
--- a/tensorflow/python/ops/linalg/linalg.py
+++ b/tensorflow/python/ops/linalg/linalg.py
@@ -31,6 +31,7 @@ from tensorflow.python.ops.linalg.linear_operator_identity import *
 from tensorflow.python.ops.linalg.linear_operator_kronecker import *
 from tensorflow.python.ops.linalg.linear_operator_low_rank_update import *
 from tensorflow.python.ops.linalg.linear_operator_lower_triangular import *
+from tensorflow.python.ops.linalg.linear_operator_zeros import *
 # pylint: enable=wildcard-import
 
 # Seal API.
diff --git a/tensorflow/python/ops/linalg/linalg_impl.py b/tensorflow/python/ops/linalg/linalg_impl.py
index 8343c62816..1e3d817980 100644
--- a/tensorflow/python/ops/linalg/linalg_impl.py
+++ b/tensorflow/python/ops/linalg/linalg_impl.py
@@ -18,8 +18,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_linalg_ops
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
@@ -38,8 +41,6 @@ diag_part = array_ops.matrix_diag_part
 eigh = linalg_ops.self_adjoint_eig
 eigvalsh = linalg_ops.self_adjoint_eigvals
 einsum = special_math_ops.einsum
-expm = gen_linalg_ops.matrix_exponential
-tf_export('linalg.expm')(expm)
 eye = linalg_ops.eye
 inv = linalg_ops.matrix_inverse
 logm = gen_linalg_ops.matrix_logarithm
@@ -114,3 +115,214 @@ def adjoint(matrix, name=None):
   with ops.name_scope(name, 'adjoint', [matrix]):
     matrix = ops.convert_to_tensor(matrix, name='matrix')
     return array_ops.matrix_transpose(matrix, conjugate=True)
+
+
+# This section is ported nearly verbatim from Eigen's implementation:
+# https://eigen.tuxfamily.org/dox/unsupported/MatrixExponential_8h_source.html
+def _matrix_exp_pade3(matrix):
+  """3rd-order Pade approximant for matrix exponential."""
+  b = [120.0, 60.0, 12.0]
+  b = [constant_op.constant(x, matrix.dtype) for x in b]
+  ident = linalg_ops.eye(array_ops.shape(matrix)[-2],
+                         batch_shape=array_ops.shape(matrix)[:-2],
+                         dtype=matrix.dtype)
+  matrix_2 = math_ops.matmul(matrix, matrix)
+  tmp = matrix_2 + b[1] * ident
+  matrix_u = math_ops.matmul(matrix, tmp)
+  matrix_v = b[2] * matrix_2 + b[0] * ident
+  return matrix_u, matrix_v
+
+
+def _matrix_exp_pade5(matrix):
+  """5th-order Pade approximant for matrix exponential."""
+  b = [30240.0, 15120.0, 3360.0, 420.0, 30.0]
+  b = [constant_op.constant(x, matrix.dtype) for x in b]
+  ident = linalg_ops.eye(array_ops.shape(matrix)[-2],
+                         batch_shape=array_ops.shape(matrix)[:-2],
+                         dtype=matrix.dtype)
+  matrix_2 = math_ops.matmul(matrix, matrix)
+  matrix_4 = math_ops.matmul(matrix_2, matrix_2)
+  tmp = matrix_4 + b[3] * matrix_2 + b[1] * ident
+  matrix_u = math_ops.matmul(matrix, tmp)
+  matrix_v = b[4] * matrix_4 + b[2] * matrix_2 + b[0] * ident
+  return matrix_u, matrix_v
+
+
+def _matrix_exp_pade7(matrix):
+  """7th-order Pade approximant for matrix exponential."""
+  b = [17297280.0, 8648640.0, 1995840.0, 277200.0, 25200.0, 1512.0, 56.0]
+  b = [constant_op.constant(x, matrix.dtype) for x in b]
+  ident = linalg_ops.eye(array_ops.shape(matrix)[-2],
+                         batch_shape=array_ops.shape(matrix)[:-2],
+                         dtype=matrix.dtype)
+  matrix_2 = math_ops.matmul(matrix, matrix)
+  matrix_4 = math_ops.matmul(matrix_2, matrix_2)
+  matrix_6 = math_ops.matmul(matrix_4, matrix_2)
+  tmp = matrix_6 + b[5] * matrix_4 + b[3] * matrix_2 + b[1] * ident
+  matrix_u = math_ops.matmul(matrix, tmp)
+  matrix_v = b[6] * matrix_6 + b[4] * matrix_4 + b[2] * matrix_2 + b[0] * ident
+  return matrix_u, matrix_v
+
+
+def _matrix_exp_pade9(matrix):
+  """9th-order Pade approximant for matrix exponential."""
+  b = [
+      17643225600.0, 8821612800.0, 2075673600.0, 302702400.0, 30270240.0,
+      2162160.0, 110880.0, 3960.0, 90.0
+  ]
+  b = [constant_op.constant(x, matrix.dtype) for x in b]
+  ident = linalg_ops.eye(array_ops.shape(matrix)[-2],
+                         batch_shape=array_ops.shape(matrix)[:-2],
+                         dtype=matrix.dtype)
+  matrix_2 = math_ops.matmul(matrix, matrix)
+  matrix_4 = math_ops.matmul(matrix_2, matrix_2)
+  matrix_6 = math_ops.matmul(matrix_4, matrix_2)
+  matrix_8 = math_ops.matmul(matrix_6, matrix_2)
+  tmp = (
+      matrix_8 + b[7] * matrix_6 + b[5] * matrix_4 + b[3] * matrix_2 +
+      b[1] * ident)
+  matrix_u = math_ops.matmul(matrix, tmp)
+  matrix_v = (
+      b[8] * matrix_8 + b[6] * matrix_6 + b[4] * matrix_4 + b[2] * matrix_2 +
+      b[0] * ident)
+  return matrix_u, matrix_v
+
+
+def _matrix_exp_pade13(matrix):
+  """13th-order Pade approximant for matrix exponential."""
+  b = [
+      64764752532480000.0, 32382376266240000.0, 7771770303897600.0,
+      1187353796428800.0, 129060195264000.0, 10559470521600.0, 670442572800.0,
+      33522128640.0, 1323241920.0, 40840800.0, 960960.0, 16380.0, 182.0
+  ]
+  b = [constant_op.constant(x, matrix.dtype) for x in b]
+  ident = linalg_ops.eye(array_ops.shape(matrix)[-2],
+                         batch_shape=array_ops.shape(matrix)[:-2],
+                         dtype=matrix.dtype)
+  matrix_2 = math_ops.matmul(matrix, matrix)
+  matrix_4 = math_ops.matmul(matrix_2, matrix_2)
+  matrix_6 = math_ops.matmul(matrix_4, matrix_2)
+  tmp_u = (
+      math_ops.matmul(matrix_6,
+                      matrix_6 + b[11] * matrix_4 + b[9] * matrix_2) +
+      b[7] * matrix_6 + b[5] * matrix_4 + b[3] * matrix_2 + b[1] * ident)
+  matrix_u = math_ops.matmul(matrix, tmp_u)
+  tmp_v = b[12] * matrix_6 + b[10] * matrix_4 + b[8] * matrix_2
+  matrix_v = (
+      math_ops.matmul(matrix_6, tmp_v) + b[6] * matrix_6 + b[4] * matrix_4 +
+      b[2] * matrix_2 + b[0] * ident)
+  return matrix_u, matrix_v
+
+
+@tf_export('linalg.expm')
+def matrix_exponential(input, name=None):  # pylint: disable=redefined-builtin
+  r"""Computes the matrix exponential of one or more square matrices.
+
+  exp(A) = \sum_{n=0}^\infty A^n/n!
+
+  The exponential is computed using a combination of the scaling and squaring
+  method and the Pade approximation. Details can be found in:
+  Nicholas J. Higham, "The scaling and squaring method for the matrix
+  exponential revisited," SIAM J. Matrix Anal. Applic., 26:1179-1193, 2005.
+
+  The input is a tensor of shape `[..., M, M]` whose inner-most 2 dimensions
+  form square matrices. The output is a tensor of the same shape as the input
+  containing the exponential for all input submatrices `[..., :, :]`.
+
+  Args:
+    input: A `Tensor`. Must be `float16`, `float32`, `float64`, `complex64`,
+      or `complex128` with shape `[..., M, M]`.
+    name:  A name to give this `Op` (optional).
+
+  Returns:
+    the matrix exponential of the input.
+
+  Raises:
+    ValueError: An unsupported type is provided as input.
+
+  @compatibility(scipy)
+  Equivalent to scipy.linalg.expm
+  @end_compatibility
+  """
+  with ops.name_scope(name, 'matrix_exponential', [input]):
+    matrix = ops.convert_to_tensor(input, name='input')
+    if matrix.shape[-2:] == [0, 0]:
+      return matrix
+    batch_shape = matrix.shape[:-2]
+    if not batch_shape.is_fully_defined():
+      batch_shape = array_ops.shape(matrix)[:-2]
+
+    # reshaping the batch makes the where statements work better
+    matrix = array_ops.reshape(
+        matrix, array_ops.concat(([-1], array_ops.shape(matrix)[-2:]), axis=0))
+    l1_norm = math_ops.reduce_max(
+        math_ops.reduce_sum(math_ops.abs(matrix),
+                            axis=array_ops.size(array_ops.shape(matrix)) - 2),
+        axis=-1)
+    const = lambda x: constant_op.constant(x, l1_norm.dtype)
+    def _nest_where(vals, cases):
+      assert len(vals) == len(cases) - 1
+      if len(vals) == 1:
+        return array_ops.where(
+            math_ops.less(l1_norm, const(vals[0])), cases[0], cases[1])
+      else:
+        return array_ops.where(
+            math_ops.less(l1_norm, const(vals[0])), cases[0],
+            _nest_where(vals[1:], cases[1:]))
+
+    if matrix.dtype in [dtypes.float16, dtypes.float32, dtypes.complex64]:
+      maxnorm = const(3.925724783138660)
+      squarings = math_ops.maximum(
+          math_ops.floor(
+              math_ops.log(l1_norm / maxnorm) / math_ops.log(const(2.0))), 0)
+      u3, v3 = _matrix_exp_pade3(matrix)
+      u5, v5 = _matrix_exp_pade5(matrix)
+      u7, v7 = _matrix_exp_pade7(
+          matrix / math_ops.pow(
+              constant_op.constant(2.0, dtype=matrix.dtype),
+              math_ops.cast(squarings, matrix.dtype))[...,
+                                                      array_ops.newaxis,
+                                                      array_ops.newaxis])
+      conds = (4.258730016922831e-001, 1.880152677804762e+000)
+      u = _nest_where(conds, (u3, u5, u7))
+      v = _nest_where(conds, (v3, v5, v7))
+    elif matrix.dtype in [dtypes.float64, dtypes.complex128]:
+      maxnorm = const(5.371920351148152)
+      squarings = math_ops.maximum(
+          math_ops.floor(
+              math_ops.log(l1_norm / maxnorm) / math_ops.log(const(2.0))), 0)
+      u3, v3 = _matrix_exp_pade3(matrix)
+      u5, v5 = _matrix_exp_pade5(matrix)
+      u7, v7 = _matrix_exp_pade7(matrix)
+      u9, v9 = _matrix_exp_pade9(matrix)
+      u13, v13 = _matrix_exp_pade13(
+          matrix / math_ops.pow(
+              constant_op.constant(2.0, dtype=matrix.dtype),
+              math_ops.cast(squarings, matrix.dtype))[...,
+                                                      array_ops.newaxis,
+                                                      array_ops.newaxis])
+      conds = (1.495585217958292e-002,
+               2.539398330063230e-001,
+               9.504178996162932e-001,
+               2.097847961257068e+000)
+      u = _nest_where(conds, (u3, u5, u7, u9, u13))
+      v = _nest_where(conds, (v3, v5, v7, v9, v13))
+    else:
+      raise ValueError(
+          'tf.linalg.expm does not support matrices of type %s' % matrix.dtype)
+    numer = u + v
+    denom = -u + v
+    result = linalg_ops.matrix_solve(denom, numer)
+    max_squarings = math_ops.reduce_max(squarings)
+
+    i = const(0.0)
+    c = lambda i, r: math_ops.less(i, max_squarings)
+    def b(i, r):
+      return i+1, array_ops.where(math_ops.less(i, squarings),
+                                  math_ops.matmul(r, r), r)
+    _, result = control_flow_ops.while_loop(c, b, [i, result])
+    if not matrix.shape.is_fully_defined():
+      return array_ops.reshape(
+          result,
+          array_ops.concat((batch_shape, array_ops.shape(result)[-2:]), axis=0))
+    return array_ops.reshape(result, batch_shape.concatenate(result.shape[-2:]))
diff --git a/tensorflow/python/ops/linalg/linear_operator.py b/tensorflow/python/ops/linalg/linear_operator.py
index 8cfe964b1c..20c46fbb82 100644
--- a/tensorflow/python/ops/linalg/linear_operator.py
+++ b/tensorflow/python/ops/linalg/linear_operator.py
@@ -42,7 +42,7 @@ __all__ = ["LinearOperator"]
 class LinearOperator(object):
   """Base class defining a [batch of] linear operator[s].
 
-  Subclasses of `LinearOperator` provide a access to common methods on a
+  Subclasses of `LinearOperator` provide access to common methods on a
   (batch) matrix, without the need to materialize the matrix.  This allows:
 
   * Matrix free computations
@@ -69,11 +69,11 @@ class LinearOperator(object):
 
   #### Shape compatibility
 
-  `LinearOperator` sub classes should operate on a [batch] matrix with
+  `LinearOperator` subclasses should operate on a [batch] matrix with
   compatible shape.  Class docstrings should define what is meant by compatible
-  shape.  Some sub-classes may not support batching.
+  shape.  Some subclasses may not support batching.
 
-  An example is:
+  Examples:
 
   `x` is a batch matrix with compatible shape for `matmul` if
 
diff --git a/tensorflow/python/ops/linalg/linear_operator_diag.py b/tensorflow/python/ops/linalg/linear_operator_diag.py
index 5beaea65a5..ed53decc00 100644
--- a/tensorflow/python/ops/linalg/linear_operator_diag.py
+++ b/tensorflow/python/ops/linalg/linear_operator_diag.py
@@ -231,8 +231,11 @@ class LinearOperatorDiag(linear_operator.LinearOperator):
     return math_ops.reduce_prod(self._diag, reduction_indices=[-1])
 
   def _log_abs_determinant(self):
-    return math_ops.reduce_sum(
+    log_det = math_ops.reduce_sum(
         math_ops.log(math_ops.abs(self._diag)), reduction_indices=[-1])
+    if self.dtype.is_complex:
+      log_det = math_ops.cast(log_det, dtype=self.dtype)
+    return log_det
 
   def _solve(self, rhs, adjoint=False, adjoint_arg=False):
     diag_term = math_ops.conj(self._diag) if adjoint else self._diag
diff --git a/tensorflow/python/ops/linalg/linear_operator_low_rank_update.py b/tensorflow/python/ops/linalg/linear_operator_low_rank_update.py
index 08e5896e10..2b2bf80f27 100644
--- a/tensorflow/python/ops/linalg/linear_operator_low_rank_update.py
+++ b/tensorflow/python/ops/linalg/linear_operator_low_rank_update.py
@@ -18,16 +18,15 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.linalg import linear_operator
 from tensorflow.python.ops.linalg import linear_operator_diag
 from tensorflow.python.ops.linalg import linear_operator_identity
 from tensorflow.python.ops.linalg import linear_operator_util
+from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util.tf_export import tf_export
 
 __all__ = [
@@ -153,8 +152,7 @@ class LinearOperatorLowRankUpdate(linear_operator.LinearOperator):
     `is_X` matrix property hints, which will trigger the appropriate code path.
 
     Args:
-      base_operator:  Shape `[B1,...,Bb, M, N]` real `float16`, `float32` or
-        `float64` `LinearOperator`.  This is `L` above.
+      base_operator:  Shape `[B1,...,Bb, M, N]`.
       u:  Shape `[B1,...,Bb, M, K]` `Tensor` of same `dtype` as `base_operator`.
         This is `U` above.
       diag_update:  Optional shape `[B1,...,Bb, K]` `Tensor` with same `dtype`
@@ -183,23 +181,12 @@ class LinearOperatorLowRankUpdate(linear_operator.LinearOperator):
     Raises:
       ValueError:  If `is_X` flags are set in an inconsistent way.
     """
-    # TODO(langmore) support complex types.
-    # Complex types are not allowed due to tf.cholesky() requiring float.
-    # If complex dtypes are allowed, we update the following
-    # 1. is_diag_update_positive should still imply that `diag > 0`, but we need
-    #    to remind the user that this implies diag is real.  This is needed
-    #    because if diag has non-zero imaginary part, it will not be
-    #    self-adjoint positive definite.
     dtype = base_operator.dtype
-    allowed_dtypes = [
-        dtypes.float16,
-        dtypes.float32,
-        dtypes.float64,
-    ]
-    if dtype not in allowed_dtypes:
-      raise TypeError(
-          "Argument matrix must have dtype in %s.  Found: %s"
-          % (allowed_dtypes, dtype))
+
+    if diag_update is not None:
+      if is_diag_update_positive and dtype.is_complex:
+        logging.warn("Note: setting is_diag_update_positive with a complex "
+                     "dtype means that diagonal is real and positive.")
 
     if diag_update is None:
       if is_diag_update_positive is False:
@@ -271,8 +258,6 @@ class LinearOperatorLowRankUpdate(linear_operator.LinearOperator):
       self._set_diag_operators(diag_update, is_diag_update_positive)
       self._is_diag_update_positive = is_diag_update_positive
 
-      check_ops.assert_same_float_dtype((base_operator, self.u, self.v,
-                                         self._diag_update))
       self._check_shapes()
 
       # Pre-compute the so-called "capacitance" matrix
@@ -407,6 +392,8 @@ class LinearOperatorLowRankUpdate(linear_operator.LinearOperator):
     else:
       det_c = linalg_ops.matrix_determinant(self._capacitance)
       log_abs_det_c = math_ops.log(math_ops.abs(det_c))
+      if self.dtype.is_complex:
+        log_abs_det_c = math_ops.cast(log_abs_det_c, dtype=self.dtype)
 
     return log_abs_det_c + log_abs_det_d + log_abs_det_l
 
diff --git a/tensorflow/python/ops/linalg/linear_operator_lower_triangular.py b/tensorflow/python/ops/linalg/linear_operator_lower_triangular.py
index fb1eb2fedb..ca6d3f5405 100644
--- a/tensorflow/python/ops/linalg/linear_operator_lower_triangular.py
+++ b/tensorflow/python/ops/linalg/linear_operator_lower_triangular.py
@@ -119,8 +119,7 @@ class LinearOperatorLowerTriangular(linear_operator.LinearOperator):
     Args:
       tril:  Shape `[B1,...,Bb, N, N]` with `b >= 0`, `N >= 0`.
         The lower triangular part of `tril` defines this operator.  The strictly
-        upper triangle is ignored.  Allowed dtypes: `float16`, `float32`,
-        `float64`.
+        upper triangle is ignored.
       is_non_singular:  Expect that this operator is non-singular.
         This operator is non-singular if and only if its diagonal elements are
         all non-zero.
@@ -137,7 +136,6 @@ class LinearOperatorLowerTriangular(linear_operator.LinearOperator):
       name: A name for this `LinearOperator`.
 
     Raises:
-      TypeError:  If `diag.dtype` is not an allowed type.
       ValueError:  If `is_square` is `False`.
     """
 
@@ -163,12 +161,12 @@ class LinearOperatorLowerTriangular(linear_operator.LinearOperator):
 
   def _check_tril(self, tril):
     """Static check of the `tril` argument."""
-    # TODO(langmore) Add complex types once matrix_triangular_solve works for
-    # them.
     allowed_dtypes = [
         dtypes.float16,
         dtypes.float32,
         dtypes.float64,
+        dtypes.complex64,
+        dtypes.complex128,
     ]
     dtype = tril.dtype
     if dtype not in allowed_dtypes:
diff --git a/tensorflow/python/ops/linalg/linear_operator_zeros.py b/tensorflow/python/ops/linalg/linear_operator_zeros.py
new file mode 100644
index 0000000000..b8a79c065b
--- /dev/null
+++ b/tensorflow/python/ops/linalg/linear_operator_zeros.py
@@ -0,0 +1,452 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""`LinearOperator` acting like a zero matrix."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.linalg import linalg_impl as linalg
+from tensorflow.python.ops.linalg import linear_operator
+from tensorflow.python.ops.linalg import linear_operator_util
+from tensorflow.python.util.tf_export import tf_export
+
+__all__ = [
+    "LinearOperatorZeros",
+]
+
+
+@tf_export("linalg.LinearOperatorZeros")
+class LinearOperatorZeros(linear_operator.LinearOperator):
+  """`LinearOperator` acting like a [batch] zero matrix.
+
+  This operator acts like a [batch] zero matrix `A` with shape
+  `[B1,...,Bb, N, M]` for some `b >= 0`.  The first `b` indices index a
+  batch member.  For every batch index `(i1,...,ib)`, `A[i1,...,ib, : :]` is
+  an `N x M` matrix.  This matrix `A` is not materialized, but for
+  purposes of broadcasting this shape will be relevant.
+
+  `LinearOperatorZeros` is initialized with `num_rows`, and optionally
+  `num_columns, `batch_shape`, and `dtype` arguments.  If `num_columns` is
+  `None`, then this operator will be initialized as a square matrix. If
+  `batch_shape` is `None`, this operator efficiently passes through all
+  arguments.  If `batch_shape` is provided, broadcasting may occur, which will
+  require making copies.
+
+  ```python
+  # Create a 2 x 2 zero matrix.
+  operator = LinearOperatorZero(num_rows=2, dtype=tf.float32)
+
+  operator.to_dense()
+  ==> [[0., 0.]
+       [0., 0.]]
+
+  operator.shape
+  ==> [2, 2]
+
+  operator.determinant()
+  ==> 0.
+
+  x = ... Shape [2, 4] Tensor
+  operator.matmul(x)
+  ==> Shape [2, 4] Tensor, same as x.
+
+  # Create a 2-batch of 2x2 zero matrices
+  operator = LinearOperatorZeros(num_rows=2, batch_shape=[2])
+  operator.to_dense()
+  ==> [[[0., 0.]
+        [0., 0.]],
+       [[0., 0.]
+        [0., 0.]]]
+
+  # Here, even though the operator has a batch shape, the input is the same as
+  # the output, so x can be passed through without a copy.  The operator is able
+  # to detect that no broadcast is necessary because both x and the operator
+  # have statically defined shape.
+  x = ... Shape [2, 2, 3]
+  operator.matmul(x)
+  ==> Shape [2, 2, 3] Tensor, same as tf.zeros_like(x)
+
+  # Here the operator and x have different batch_shape, and are broadcast.
+  # This requires a copy, since the output is different size than the input.
+  x = ... Shape [1, 2, 3]
+  operator.matmul(x)
+  ==> Shape [2, 2, 3] Tensor, equal to tf.zeros_like([x, x])
+  ```
+
+  ### Shape compatibility
+
+  This operator acts on [batch] matrix with compatible shape.
+  `x` is a batch matrix with compatible shape for `matmul` and `solve` if
+
+  ```
+  operator.shape = [B1,...,Bb] + [N, M],  with b >= 0
+  x.shape =   [C1,...,Cc] + [M, R],
+  and [C1,...,Cc] broadcasts with [B1,...,Bb] to [D1,...,Dd]
+  ```
+
+  #### Matrix property hints
+
+  This `LinearOperator` is initialized with boolean flags of the form `is_X`,
+  for `X = non_singular, self_adjoint, positive_definite, square`.
+  These have the following meaning:
+
+  * If `is_X == True`, callers should expect the operator to have the
+    property `X`.  This is a promise that should be fulfilled, but is *not* a
+    runtime assert.  For example, finite floating point precision may result
+    in these promises being violated.
+  * If `is_X == False`, callers should expect the operator to not have `X`.
+  * If `is_X == None` (the default), callers should have no expectation either
+    way.
+  """
+
+  def __init__(self,
+               num_rows,
+               num_columns=None,
+               batch_shape=None,
+               dtype=None,
+               is_non_singular=False,
+               is_self_adjoint=True,
+               is_positive_definite=False,
+               is_square=True,
+               assert_proper_shapes=False,
+               name="LinearOperatorZeros"):
+    r"""Initialize a `LinearOperatorZeros`.
+
+    The `LinearOperatorZeros` is initialized with arguments defining `dtype`
+    and shape.
+
+    This operator is able to broadcast the leading (batch) dimensions, which
+    sometimes requires copying data.  If `batch_shape` is `None`, the operator
+    can take arguments of any batch shape without copying.  See examples.
+
+    Args:
+      num_rows:  Scalar non-negative integer `Tensor`.  Number of rows in the
+        corresponding zero matrix.
+      num_columns:  Scalar non-negative integer `Tensor`.  Number of columns in
+        the corresponding zero matrix. If `None`, defaults to the value of
+        `num_rows`.
+      batch_shape:  Optional `1-D` integer `Tensor`.  The shape of the leading
+        dimensions.  If `None`, this operator has no leading dimensions.
+      dtype:  Data type of the matrix that this operator represents.
+      is_non_singular:  Expect that this operator is non-singular.
+      is_self_adjoint:  Expect that this operator is equal to its hermitian
+        transpose.
+      is_positive_definite:  Expect that this operator is positive definite,
+        meaning the quadratic form `x^H A x` has positive real part for all
+        nonzero `x`.  Note that we do not require the operator to be
+        self-adjoint to be positive-definite.  See:
+        https://en.wikipedia.org/wiki/Positive-definite_matrix#Extension_for_non-symmetric_matrices
+      is_square:  Expect that this operator acts like square [batch] matrices.
+      assert_proper_shapes:  Python `bool`.  If `False`, only perform static
+        checks that initialization and method arguments have proper shape.
+        If `True`, and static checks are inconclusive, add asserts to the graph.
+      name: A name for this `LinearOperator`
+
+    Raises:
+      ValueError:  If `num_rows` is determined statically to be non-scalar, or
+        negative.
+      ValueError:  If `num_columns` is determined statically to be non-scalar,
+        or negative.
+      ValueError:  If `batch_shape` is determined statically to not be 1-D, or
+        negative.
+      ValueError:  If any of the following is not `True`:
+        `{is_self_adjoint, is_non_singular, is_positive_definite}`.
+    """
+    dtype = dtype or dtypes.float32
+    self._assert_proper_shapes = assert_proper_shapes
+
+    with ops.name_scope(name):
+      dtype = dtypes.as_dtype(dtype)
+      if not is_self_adjoint and is_square:
+        raise ValueError("A zero operator is always self adjoint.")
+      if is_non_singular:
+        raise ValueError("A zero operator is always singular.")
+      if is_positive_definite:
+        raise ValueError("A zero operator is always not positive-definite.")
+
+      super(LinearOperatorZeros, self).__init__(
+          dtype=dtype,
+          is_non_singular=is_non_singular,
+          is_self_adjoint=is_self_adjoint,
+          is_positive_definite=is_positive_definite,
+          is_square=is_square,
+          name=name)
+
+      self._num_rows = linear_operator_util.shape_tensor(
+          num_rows, name="num_rows")
+      self._num_rows_static = tensor_util.constant_value(self._num_rows)
+
+      if num_columns is None:
+        num_columns = num_rows
+
+      self._num_columns = linear_operator_util.shape_tensor(
+          num_columns, name="num_columns")
+      self._num_columns_static = tensor_util.constant_value(self._num_columns)
+
+      self._check_domain_range_possibly_add_asserts()
+
+      if (self._num_rows_static is not None and
+          self._num_columns_static is not None):
+        if is_square and self._num_rows_static != self._num_columns_static:
+          raise ValueError(
+              "LinearOperatorZeros initialized as is_square=True, but got "
+              "num_rows({}) != num_columns({})".format(
+                  self._num_rows_static,
+                  self._num_columns_static))
+
+      if batch_shape is None:
+        self._batch_shape_arg = None
+      else:
+        self._batch_shape_arg = linear_operator_util.shape_tensor(
+            batch_shape, name="batch_shape_arg")
+        self._batch_shape_static = tensor_util.constant_value(
+            self._batch_shape_arg)
+        self._check_batch_shape_possibly_add_asserts()
+
+  def _shape(self):
+    matrix_shape = tensor_shape.TensorShape((self._num_rows_static,
+                                             self._num_columns_static))
+    if self._batch_shape_arg is None:
+      return matrix_shape
+
+    batch_shape = tensor_shape.TensorShape(self._batch_shape_static)
+    return batch_shape.concatenate(matrix_shape)
+
+  def _shape_tensor(self):
+    matrix_shape = array_ops.stack((self._num_rows, self._num_columns), axis=0)
+    if self._batch_shape_arg is None:
+      return matrix_shape
+
+    return array_ops.concat((self._batch_shape_arg, matrix_shape), 0)
+
+  def _assert_non_singular(self):
+    raise errors.InvalidArgumentError(
+        node_def=None, op=None, message="Zero operators are always "
+        "non-invertible.")
+
+  def _assert_positive_definite(self):
+    raise errors.InvalidArgumentError(
+        node_def=None, op=None, message="Zero operators are always "
+        "non-positive definite.")
+
+  def _assert_self_adjoint(self):
+    return control_flow_ops.no_op("assert_self_adjoint")
+
+  def _possibly_broadcast_batch_shape(self, x):
+    """Return 'x', possibly after broadcasting the leading dimensions."""
+    # If we have no batch shape, our batch shape broadcasts with everything!
+    if self._batch_shape_arg is None:
+      return x
+
+    # Static attempt:
+    #   If we determine that no broadcast is necessary, pass x through
+    #   If we need a broadcast, add to an array of zeros.
+    #
+    # special_shape is the shape that, when broadcast with x's shape, will give
+    # the correct broadcast_shape.  Note that
+    #   We have already verified the second to last dimension of self.shape
+    #   matches x's shape in assert_compatible_matrix_dimensions.
+    #   Also, the final dimension of 'x' can have any shape.
+    #   Therefore, the final two dimensions of special_shape are 1's.
+    special_shape = self.batch_shape.concatenate([1, 1])
+    bshape = array_ops.broadcast_static_shape(x.get_shape(), special_shape)
+    if special_shape.is_fully_defined():
+      # bshape.is_fully_defined iff special_shape.is_fully_defined.
+      if bshape == x.get_shape():
+        return x
+      # Use the built in broadcasting of addition.
+      zeros = array_ops.zeros(shape=special_shape, dtype=self.dtype)
+      return x + zeros
+
+    # Dynamic broadcast:
+    #   Always add to an array of zeros, rather than using a "cond", since a
+    #   cond would require copying data from GPU --> CPU.
+    special_shape = array_ops.concat((self.batch_shape_tensor(), [1, 1]), 0)
+    zeros = array_ops.zeros(shape=special_shape, dtype=self.dtype)
+    return x + zeros
+
+  def _matmul(self, x, adjoint=False, adjoint_arg=False):
+    if self._assert_proper_shapes:
+      x = linalg.adjoint(x) if adjoint_arg else x
+      aps = linear_operator_util.assert_compatible_matrix_dimensions(self, x)
+      x = control_flow_ops.with_dependencies([aps], x)
+    if self.is_square:
+      # Note that adjoint has no effect since this matrix is self-adjoint.
+      if adjoint_arg:
+        output_shape = array_ops.concat([
+            array_ops.shape(x)[:-2],
+            [array_ops.shape(x)[-1], array_ops.shape(x)[-2]]], axis=0)
+      else:
+        output_shape = array_ops.shape(x)
+
+      return self._possibly_broadcast_batch_shape(
+          array_ops.zeros(shape=output_shape, dtype=x.dtype))
+
+    x_shape = array_ops.shape(x)
+    n = self._num_columns if adjoint else self._num_rows
+    m = x_shape[-2] if adjoint_arg else x_shape[-1]
+
+    output_shape = array_ops.concat([x_shape[:-2], [n, m]], axis=0)
+
+    zeros = array_ops.zeros(shape=output_shape, dtype=x.dtype)
+    return self._possibly_broadcast_batch_shape(zeros)
+
+  def _determinant(self):
+    if self.batch_shape.is_fully_defined():
+      return array_ops.zeros(shape=self.batch_shape, dtype=self.dtype)
+    else:
+      return array_ops.zeros(shape=self.batch_shape_tensor(), dtype=self.dtype)
+
+  def _trace(self):
+    # Get Tensor of all zeros of same shape as self.batch_shape.
+    if self.batch_shape.is_fully_defined():
+      return array_ops.zeros(shape=self.batch_shape, dtype=self.dtype)
+    else:
+      return array_ops.zeros(shape=self.batch_shape_tensor(), dtype=self.dtype)
+
+  def _diag_part(self):
+    return self._zeros_diag()
+
+  def add_to_tensor(self, mat, name="add_to_tensor"):
+    """Add matrix represented by this operator to `mat`.  Equiv to `I + mat`.
+
+    Args:
+      mat:  `Tensor` with same `dtype` and shape broadcastable to `self`.
+      name:  A name to give this `Op`.
+
+    Returns:
+      A `Tensor` with broadcast shape and same `dtype` as `self`.
+    """
+    return self._possibly_broadcast_batch_shape(mat)
+
+  def _check_domain_range_possibly_add_asserts(self):
+    """Static check of init arg `num_rows`, possibly add asserts."""
+    # Possibly add asserts.
+    if self._assert_proper_shapes:
+      self._num_rows = control_flow_ops.with_dependencies([
+          check_ops.assert_rank(
+              self._num_rows,
+              0,
+              message="Argument num_rows must be a 0-D Tensor."),
+          check_ops.assert_non_negative(
+              self._num_rows,
+              message="Argument num_rows must be non-negative."),
+      ], self._num_rows)
+      self._num_columns = control_flow_ops.with_dependencies([
+          check_ops.assert_rank(
+              self._num_columns,
+              0,
+              message="Argument num_columns must be a 0-D Tensor."),
+          check_ops.assert_non_negative(
+              self._num_columns,
+              message="Argument num_columns must be non-negative."),
+      ], self._num_columns)
+
+    # Static checks.
+    if not self._num_rows.dtype.is_integer:
+      raise TypeError("Argument num_rows must be integer type.  Found:"
+                      " %s" % self._num_rows)
+
+    if not self._num_columns.dtype.is_integer:
+      raise TypeError("Argument num_columns must be integer type.  Found:"
+                      " %s" % self._num_columns)
+
+    num_rows_static = self._num_rows_static
+    num_columns_static = self._num_columns_static
+
+    if num_rows_static is not None:
+      if num_rows_static.ndim != 0:
+        raise ValueError("Argument num_rows must be a 0-D Tensor.  Found:"
+                         " %s" % num_rows_static)
+
+      if num_rows_static < 0:
+        raise ValueError("Argument num_rows must be non-negative.  Found:"
+                         " %s" % num_rows_static)
+    if num_columns_static is not None:
+      if num_columns_static.ndim != 0:
+        raise ValueError("Argument num_columns must be a 0-D Tensor.  Found:"
+                         " %s" % num_columns_static)
+
+      if num_columns_static < 0:
+        raise ValueError("Argument num_columns must be non-negative.  Found:"
+                         " %s" % num_columns_static)
+
+  def _check_batch_shape_possibly_add_asserts(self):
+    """Static check of init arg `batch_shape`, possibly add asserts."""
+    if self._batch_shape_arg is None:
+      return
+
+    # Possibly add asserts
+    if self._assert_proper_shapes:
+      self._batch_shape_arg = control_flow_ops.with_dependencies([
+          check_ops.assert_rank(
+              self._batch_shape_arg,
+              1,
+              message="Argument batch_shape must be a 1-D Tensor."),
+          check_ops.assert_non_negative(
+              self._batch_shape_arg,
+              message="Argument batch_shape must be non-negative."),
+      ], self._batch_shape_arg)
+
+    # Static checks
+    if not self._batch_shape_arg.dtype.is_integer:
+      raise TypeError("Argument batch_shape must be integer type.  Found:"
+                      " %s" % self._batch_shape_arg)
+
+    if self._batch_shape_static is None:
+      return  # Cannot do any other static checks.
+
+    if self._batch_shape_static.ndim != 1:
+      raise ValueError("Argument batch_shape must be a 1-D Tensor.  Found:"
+                       " %s" % self._batch_shape_static)
+
+    if np.any(self._batch_shape_static < 0):
+      raise ValueError("Argument batch_shape must be non-negative.  Found:"
+                       "%s" % self._batch_shape_static)
+
+  def _min_matrix_dim(self):
+    """Minimum of domain/range dimension, if statically available, else None."""
+    domain_dim = self.domain_dimension.value
+    range_dim = self.range_dimension.value
+    if domain_dim is None or range_dim is None:
+      return None
+    return min(domain_dim, range_dim)
+
+  def _min_matrix_dim_tensor(self):
+    """Minimum of domain/range dimension, as a tensor."""
+    return math_ops.reduce_min(self.shape_tensor()[-2:])
+
+  def _zeros_diag(self):
+    """Returns the diagonal of this operator as all zeros."""
+    if self.shape.is_fully_defined():
+      d_shape = self.batch_shape.concatenate([self._min_matrix_dim()])
+    else:
+      d_shape = array_ops.concat(
+          [self.batch_shape_tensor(),
+           [self._min_matrix_dim_tensor()]], axis=0)
+
+    return array_ops.zeros(shape=d_shape, dtype=self.dtype)
diff --git a/tensorflow/python/ops/linalg_ops.py b/tensorflow/python/ops/linalg_ops.py
index a0dfa543f9..f4a93560be 100644
--- a/tensorflow/python/ops/linalg_ops.py
+++ b/tensorflow/python/ops/linalg_ops.py
@@ -401,7 +401,7 @@ def svd(tensor, full_matrices=False, compute_uv=True, name=None):
   import tensorflow as tf
   import numpy as np
   s, u, v = tf.linalg.svd(a)
-  tf_a_approx = tf.matmul(u, tf.matmul(tf.linalg.diag(s), v, adjoint_v=True))
+  tf_a_approx = tf.matmul(u, tf.matmul(tf.linalg.diag(s), v, adjoint_b=True))
   u, s, v_adj = np.linalg.svd(a, full_matrices=False)
   np_a_approx = np.dot(u, np.dot(np.diag(s), v_adj))
   # tf_a_approx and np_a_approx should be numerically close.
diff --git a/tensorflow/python/ops/logging_ops.py b/tensorflow/python/ops/logging_ops.py
index 8276047cb6..df41933f8a 100644
--- a/tensorflow/python/ops/logging_ops.py
+++ b/tensorflow/python/ops/logging_ops.py
@@ -35,9 +35,12 @@ from tensorflow.python.util.tf_export import tf_export
 
 
 # Assert and Print are special symbols in python, so we must
-# have an upper-case version of them.  For users with Python 3 or Python 2.7
-# with `from __future__ import print_function`, we also allow lowercase.
-@tf_export("Print", "print")
+# have an upper-case version of them.
+#
+# For users with Python 3 or Python 2.7
+# with `from __future__ import print_function`, we could also allow lowercase.
+# See https://github.com/tensorflow/tensorflow/issues/18053
+@tf_export("Print")
 def Print(input_, data, message=None, first_n=None, summarize=None,
           name=None):
   """Prints a list of tensors.
diff --git a/tensorflow/python/ops/math_ops.py b/tensorflow/python/ops/math_ops.py
index cdb6dc8f22..fbe6b62302 100644
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@@ -37,11 +37,11 @@ from tensorflow.python.ops import gen_math_ops
 from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import gen_sparse_ops
 from tensorflow.python.ops import gen_spectral_ops
-from tensorflow.python.platform import tf_logging as logging
 # go/tf-wildcard-import
 # pylint: disable=wildcard-import
 from tensorflow.python.ops.gen_math_ops import *
 # pylint: enable=wildcard-import
+from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import compat
 from tensorflow.python.util import deprecation
 from tensorflow.python.util import nest
@@ -628,16 +628,17 @@ def cast(x, dtype, name=None):
   ```
 
   The operation supports data types (for `x` and `dtype`) of
-  `uint8`, `int8`, `uint16`, `int16`, `int32`, `int64`, `float16`, `float32`,
-  `float64`, `complex64`, `complex128`, `bfloat16`. In case of casting from
-  complex types (`complex64`, `complex128`) to real types, only the real part
-  of `x` is returned. In case of casting from real types to complex types
-  (`complex64`, `complex128`), the imaginary part of the returned value is set
-  to `0`. The handling of complex types here matches the behavior of numpy.
+  `uint8`, `uint16`, `uint32`, `uint64`, `int8`, `int16`, `int32`, `int64`,
+  `float16`, `float32`, `float64`, `complex64`, `complex128`, `bfloat16`.
+  In case of casting from complex types (`complex64`, `complex128`) to real
+  types, only the real part of `x` is returned. In case of casting from real
+  types to complex types (`complex64`, `complex128`), the imaginary part of the
+  returned value is set to `0`. The handling of complex types here matches the
+  behavior of numpy.
 
   Args:
     x: A `Tensor` or `SparseTensor` of numeric type. It could be
-      `uint8`, `int8`, `uint16`, `int16`, `int32`, `int64`,
+      `uint8`, `uint16`, `uint32`, `uint64`, `int8`, `int16`, `int32`, `int64`,
       `float16`, `float32`, `float64`, `complex64`, `complex128`, `bfloat16`.
     dtype: The destination type. The list of supported dtypes is the same
       as `x`.
@@ -651,6 +652,9 @@ def cast(x, dtype, name=None):
     TypeError: If `x` cannot be cast to the `dtype`.
   """
   base_type = dtypes.as_dtype(dtype).base_dtype
+  if isinstance(x,
+                (ops.Tensor, _resource_variable_type)) and base_type == x.dtype:
+    return x
   with ops.name_scope(name, "Cast", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
       values_cast = cast(x.values, base_type, name=name)
@@ -1222,8 +1226,9 @@ def _ReductionDims(x, axis, reduction_indices):
     return axis
   else:
     # Fast path: avoid creating Rank and Range ops if ndims is known.
-    if isinstance(x, ops.Tensor) and x._rank() is not None:  # pylint: disable=protected-access
-      return constant_op.constant(np.arange(x._rank()), dtype=dtypes.int32)  # pylint: disable=protected-access
+    rank = common_shapes.rank(x)
+    if rank is not None:
+      return constant_op.constant(np.arange(rank), dtype=dtypes.int32)
     if (isinstance(x, sparse_tensor.SparseTensor) and
         x.dense_shape.get_shape().is_fully_defined()):
       rank = x.dense_shape.get_shape()[0].value  # sparse.dense_shape is 1-D.
@@ -1234,8 +1239,8 @@ def _ReductionDims(x, axis, reduction_indices):
 
 
 def _may_reduce_to_scalar(keepdims, axis, reduction_indices, output):
-  """Set a reduction's output's shape to be a scalar if we are certain."""
-  if (not output.shape.is_fully_defined()) and (not keepdims) and (
+  """Set a reduction's output shape to be a scalar if we are certain."""
+  if not common_shapes.has_fully_defined_shape(output) and (not keepdims) and (
       axis is None) and (reduction_indices is None):
     output.set_shape(())
   return output
diff --git a/tensorflow/python/ops/metrics_impl.py b/tensorflow/python/ops/metrics_impl.py
index bfd225b0d8..3aedeb6acd 100644
--- a/tensorflow/python/ops/metrics_impl.py
+++ b/tensorflow/python/ops/metrics_impl.py
@@ -73,16 +73,16 @@ def metric_variable(shape, dtype, validate_shape=True, name=None):
     A (non-trainable) variable initialized to zero, or if inside a
     `DistributionStrategy` scope a tower-local variable container.
   """
-  with distribute_lib.get_tower_context().tower_local_var_scope(
-      variable_scope.VariableAggregation.SUM):
-    # Note that "tower local" implies trainable=False.
-    return variable_scope.variable(
-        lambda: array_ops.zeros(shape, dtype),
-        collections=[
-            ops.GraphKeys.LOCAL_VARIABLES, ops.GraphKeys.METRIC_VARIABLES
-        ],
-        validate_shape=validate_shape,
-        name=name)
+  # Note that synchronization "ON_READ" implies trainable=False.
+  return variable_scope.variable(
+      lambda: array_ops.zeros(shape, dtype),
+      collections=[
+          ops.GraphKeys.LOCAL_VARIABLES, ops.GraphKeys.METRIC_VARIABLES
+      ],
+      validate_shape=validate_shape,
+      synchronization=variable_scope.VariableSynchronization.ON_READ,
+      aggregation=variable_scope.VariableAggregation.SUM,
+      name=name)
 
 
 def _remove_squeezable_dimensions(predictions, labels, weights):
diff --git a/tensorflow/python/ops/parallel_for/BUILD b/tensorflow/python/ops/parallel_for/BUILD
index 065c2caedc..6c804a50e7 100644
--- a/tensorflow/python/ops/parallel_for/BUILD
+++ b/tensorflow/python/ops/parallel_for/BUILD
@@ -125,5 +125,4 @@ cuda_py_test(
         "//tensorflow/python:random_ops",
         "//tensorflow/python/ops/losses",
     ],
-    tags = ["no_gpu"],  # TODO(b/80127739): test is flaky
 )
diff --git a/tensorflow/python/ops/parallel_for/__init__.py b/tensorflow/python/ops/parallel_for/__init__.py
index b49d865968..dd8bc6d487 100644
--- a/tensorflow/python/ops/parallel_for/__init__.py
+++ b/tensorflow/python/ops/parallel_for/__init__.py
@@ -23,13 +23,3 @@ from tensorflow.python.ops.parallel_for.control_flow_ops import for_loop
 from tensorflow.python.ops.parallel_for.control_flow_ops import pfor
 from tensorflow.python.ops.parallel_for.gradients import batch_jacobian
 from tensorflow.python.ops.parallel_for.gradients import jacobian
-from tensorflow.python.util.all_util import remove_undocumented
-
-_allowed_symbols = [
-    'pfor',
-    'for_loop',
-    'jacobian',
-    'batch_jacobian',
-]
-
-remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/python/ops/parallel_for/control_flow_ops_test.py b/tensorflow/python/ops/parallel_for/control_flow_ops_test.py
index 1134a6c4a6..c0e66cb0b8 100644
--- a/tensorflow/python/ops/parallel_for/control_flow_ops_test.py
+++ b/tensorflow/python/ops/parallel_for/control_flow_ops_test.py
@@ -23,6 +23,8 @@ import time
 from absl import flags
 import numpy as np
 
+from tensorflow.core.example import example_pb2
+from tensorflow.core.example import feature_pb2
 from tensorflow.python.client import session
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
@@ -35,6 +37,7 @@ from tensorflow.python.ops import gradients as gradient_ops
 from tensorflow.python.ops import logging_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
+from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import rnn
 from tensorflow.python.ops import rnn_cell
@@ -1347,5 +1350,55 @@ class SparseTest(PForTest):
     self.run_and_assert_equal(pfor, manual)
 
 
+class ParsingTest(PForTest):
+
+  def test_decode_csv(self):
+    csv_tensor = constant_op.constant([["1:2:3"], ["::"], ["7:8:9"]])
+    kwargs = {"record_defaults": [[10], [20], [30]], "field_delim": ":"}
+
+    def loop_fn(i):
+      line = array_ops.gather(csv_tensor, i)
+      return parsing_ops.decode_csv(line, **kwargs)
+
+    self._test_loop_fn(loop_fn, iters=3, loop_fn_dtypes=[dtypes.int32] * 3)
+
+  def test_parse_single_example(self):
+
+    def _int64_feature(*values):
+      return feature_pb2.Feature(int64_list=feature_pb2.Int64List(value=values))
+
+    def _bytes_feature(*values):
+      return feature_pb2.Feature(
+          bytes_list=feature_pb2.BytesList(
+              value=[v.encode("utf-8") for v in values]))
+
+    examples = constant_op.constant([
+        example_pb2.Example(
+            features=feature_pb2.Features(
+                feature={
+                    "dense_int": _int64_feature(i),
+                    "dense_str": _bytes_feature(str(i)),
+                    "sparse_int": _int64_feature(i, i * 2, i * 4, i * 8),
+                    "sparse_str": _bytes_feature(*["abc"] * i)
+                })).SerializeToString() for i in range(10)
+    ])
+
+    features = {
+        "dense_int": parsing_ops.FixedLenFeature((), dtypes.int64, 0),
+        "dense_str": parsing_ops.FixedLenFeature((), dtypes.string, ""),
+        "sparse_int": parsing_ops.VarLenFeature(dtypes.int64),
+        "sparse_str": parsing_ops.VarLenFeature(dtypes.string),
+    }
+
+    def loop_fn(i):
+      example_proto = array_ops.gather(examples, i)
+      f = parsing_ops.parse_single_example(example_proto, features)
+      return f
+
+    pfor = pfor_control_flow_ops.pfor(loop_fn, iters=10)
+    manual = parsing_ops.parse_example(examples, features)
+    self.run_and_assert_equal(pfor, manual)
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/ops/parallel_for/gradients_test.py b/tensorflow/python/ops/parallel_for/gradients_test.py
index 310a2154f7..3a6d9149ad 100644
--- a/tensorflow/python/ops/parallel_for/gradients_test.py
+++ b/tensorflow/python/ops/parallel_for/gradients_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import functools
+import os
 import time
 
 import numpy as np
@@ -444,6 +445,10 @@ class GradientsTest(test.TestCase):
     self.run_and_assert_equal(pfor_outputs, while_outputs)
 
   def test_mnist_per_eg_grad(self):
+    # It looks like CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED
+    # configuration of Winograd can cause low precision output resulting in
+    # tests failing. So we disable that here.
+    os.environ["TF_ENABLE_WINOGRAD_NONFUSED"] = "0"
     data_format = ("channels_first"
                    if test.is_gpu_available() else "channels_last")
     # Note that we we are setting training=False here so that dropout produces
@@ -451,8 +456,13 @@ class GradientsTest(test.TestCase):
     pfor_outputs, while_outputs = create_mnist_per_eg_grad(
         4, data_format, training=False)
     self.run_and_assert_equal(pfor_outputs, while_outputs, rtol=1e-3)
+    os.environ.pop("TF_ENABLE_WINOGRAD_NONFUSED", None)
 
   def test_mnist_per_eg_jacobian(self):
+    # It looks like CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED
+    # configuration of Winograd can cause low precision output resulting in
+    # tests failing. So we disable that here.
+    os.environ["TF_ENABLE_WINOGRAD_NONFUSED"] = "0"
     data_format = ("channels_first"
                    if test.is_gpu_available() else "channels_last")
     # Note that we we are setting training=False here so that dropout produces
@@ -460,6 +470,7 @@ class GradientsTest(test.TestCase):
     pfor_outputs, while_outputs = create_mnist_per_eg_jacobian(
         2, data_format, training=False)
     self.run_and_assert_equal(pfor_outputs, while_outputs, rtol=1e-3)
+    os.environ.pop("TF_ENABLE_WINOGRAD_NONFUSED", None)
 
   def test_fc_jacobian(self):
     jacobians, per_eg_jacobians_pfor, per_eg_jacobians_while = (
diff --git a/tensorflow/python/ops/parallel_for/pfor.py b/tensorflow/python/ops/parallel_for/pfor.py
index 1c0ef24040..77ec3bc0d4 100644
--- a/tensorflow/python/ops/parallel_for/pfor.py
+++ b/tensorflow/python/ops/parallel_for/pfor.py
@@ -34,12 +34,15 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import data_flow_ops
 from tensorflow.python.ops import functional_ops
+from tensorflow.python.ops import gen_parsing_ops
 from tensorflow.python.ops import gen_sparse_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import sparse_ops
 from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util import nest
 
 flags.DEFINE_bool(
     "op_conversion_fallback_to_while_loop", False,
@@ -152,8 +155,14 @@ class WhileOp(object):
     # `loop_vars` and are not returned from the `while_loop`. In Python code,
     # they are usually captured by the body lambda. We collect them below by
     # iterating over all the ops in the graph. They are appended to the end of
-    # self._enters and don't correspond to any outputs in self._outputs.
+    # self._enters or self._direct_enters, and don't correspond to any outputs
+    # in self._outputs. Note that we keep the resource/variant Enter nodes in
+    # self._direct_enters and the constructed while_loop's body uses them
+    # directly as opposed to passing them as loop variables. This is done
+    # because the while_body cannot partition the resource/variant Tensors, so
+    # it has to leave them unchanged.
     self._enters = []
+    self._direct_enters = []
 
     for e in self._while_context.loop_exits:
       self._outputs.append(e.op.outputs[0])
@@ -185,7 +194,11 @@ class WhileOp(object):
         if op.type == "Enter":
           output = op.outputs[0]
           if output not in self._enters:
-            self._enters.append(output)
+            if output.dtype in (dtypes.resource, dtypes.variant):
+              if output not in self._direct_enters:
+                self._direct_enters.append(output)
+            else:
+              self._enters.append(output)
 
   def __str__(self):
     """String representation."""
@@ -194,13 +207,13 @@ class WhileOp(object):
   @property
   def inputs(self):
     """Input to all the Enter nodes."""
-    return [x.op.inputs[0] for x in self._enters]
+    return [x.op.inputs[0] for x in self._enters + self._direct_enters]
 
   @property
   def control_inputs(self):
     """Control input to all the Enter nodes."""
     control_inputs = []
-    for x in self._enters:
+    for x in self._enters + self._direct_enters:
       control_inputs.extend(x.op.control_inputs)
     return control_inputs
 
@@ -268,6 +281,16 @@ class WhileOp(object):
         pfor_ops=self._pfor_ops,
         all_indices=indices,
         all_indices_partitioned=cond_stacked)
+    # Map all inputs of Enter nodes in self._direct_enters to their converted
+    # values.
+    for enter in self._direct_enters:
+      enter_input = enter.op.inputs[0]
+      converted_enter, stacked, is_sparse_stacked = parent_pfor._convert_helper(
+          enter_input)
+      # Since these are resources / variants, they should be unstacked.
+      assert not stacked and not is_sparse_stacked, (enter, converted_enter)
+      pfor._add_conversion(enter, wrap(converted_enter, False))
+
     # Map all Enter nodes to the inputs.
     for enter, inp, stacked in zip(self._enters, inputs, inputs_stacked):
       pfor._add_conversion(enter, wrap(inp, stacked))
@@ -569,7 +592,7 @@ class WhileOp(object):
       inputs = args[:num_enters]
       output_tas = args[num_enters:]
       # TODO(agarwal): see which outputs have consumers and only populate the
-      # TensorArrays corresonding to those. Or do those paths get trimmed out
+      # TensorArrays corresponding to those. Or do those paths get trimmed out
       # from inside the while_loop body?
       assert len(inputs) >= len(output_tas)
       assert len(inputs) == len(inputs_stacked)
@@ -2482,3 +2505,48 @@ def _convert_stack_pop_v2(pfor_input):
   elem_type = pfor_input.get_attr("elem_type")
   out = data_flow_ops.stack_pop_v2(handle, elem_type)
   return wrap(out, stacked)
+
+
+# parsing_ops
+
+
+@RegisterPFor("DecodeCSV")
+def _convert_decode_csv(pfor_input):
+  lines = pfor_input.stacked_input(0)
+  record_defaults = [
+      pfor_input.unstacked_input(i) for i in range(1, pfor_input.num_inputs)
+  ]
+  field_delim = pfor_input.get_attr("field_delim")
+  use_quote_delim = pfor_input.get_attr("use_quote_delim")
+  select_cols = pfor_input.get_attr("select_cols")
+  if not select_cols:
+    select_cols = None
+  return [
+      wrap(t, True) for t in parsing_ops.decode_csv(
+          lines,
+          record_defaults,
+          field_delim=field_delim,
+          use_quote_delim=use_quote_delim,
+          select_cols=select_cols)
+  ]
+
+
+@RegisterPFor("ParseSingleExample")
+def _convert_parse_single_example(pfor_input):
+  serialized = pfor_input.stacked_input(0)
+  dense_defaults = [
+      pfor_input.unstacked_input(i) for i in range(1, pfor_input.num_inputs)
+  ]
+  sparse_keys = pfor_input.get_attr("sparse_keys")
+  dense_keys = pfor_input.get_attr("dense_keys")
+  sparse_types = pfor_input.get_attr("sparse_types")
+  dense_shapes = pfor_input.get_attr("dense_shapes")
+  output = gen_parsing_ops.parse_example(
+      serialized=serialized,
+      names=[],
+      dense_defaults=dense_defaults,
+      sparse_keys=sparse_keys,
+      dense_keys=dense_keys,
+      sparse_types=sparse_types,
+      dense_shapes=dense_shapes)
+  return [wrap(t, True, True) for t in nest.flatten(output)]
diff --git a/tensorflow/python/ops/resource_variable_ops.py b/tensorflow/python/ops/resource_variable_ops.py
index 15cafbbde5..8b259b6b6b 100644
--- a/tensorflow/python/ops/resource_variable_ops.py
+++ b/tensorflow/python/ops/resource_variable_ops.py
@@ -181,7 +181,8 @@ def shape_safe_assign_variable_handle(handle, shape, value, name=None):
                                                       name=name)
 
 
-class ResourceVariable(variables.Variable):
+# TODO(apassos) make this be variables.Variable
+class ResourceVariable(variables.RefVariable):
   """Variable based on resource handles.
 
   See the @{$variables$Variables How To} for a high level overview.
@@ -195,15 +196,16 @@ class ResourceVariable(variables.Variable):
   the variable are fixed. The value can be changed using one of the assign
   methods.
 
-  Just like any `Tensor`, variables created with `ResourceVariable()` can be
-  used as inputs for other Ops in the graph. Additionally, all the operators
-  overloaded for the `Tensor` class are carried over to variables, so you can
-  also add nodes to the graph by just doing arithmetic on variables.
+  Just like any `Tensor`, variables created with
+  `tf.Variable(use_resource=True)` can be used as inputs for other Ops in the
+  graph. Additionally, all the operators overloaded for the `Tensor` class are
+  carried over to variables, so you can also add nodes to the graph by just
+  doing arithmetic on variables.
 
-  Unlike tf.Variable, a tf.ResourceVariable has well-defined semantics. Each
+  Unlike ref-based variable, a ResourceVariable has well-defined semantics. Each
   usage of a ResourceVariable in a TensorFlow graph adds a read_value operation
-  to the graph. The Tensors returned by a read_value operation are guaranteed
-  to see all modifications to the value of the variable which happen in any
+  to the graph. The Tensors returned by a read_value operation are guaranteed to
+  see all modifications to the value of the variable which happen in any
   operation on which the read_value depends on (either directly, indirectly, or
   via a control dependency) and guaranteed to not see any modification to the
   value of the variable from operations that depend on the read_value operation.
@@ -217,7 +219,7 @@ class ResourceVariable(variables.Variable):
   can cause tf.Variable and tf.ResourceVariable to behave differently:
 
   ```python
-  a = tf.ResourceVariable(1.0)
+  a = tf.Variable(1.0, use_resource=True)
   a.initializer.run()
 
   assign = a.assign(2.0)
@@ -741,8 +743,14 @@ class ResourceVariable(variables.Variable):
   def _read_variable_op(self):
     if self.trainable:
       tape.watch_variable(self)
-    return gen_resource_variable_ops.read_variable_op(self._handle,
-                                                      self._dtype)
+    result = gen_resource_variable_ops.read_variable_op(self._handle,
+                                                        self._dtype)
+    if not context.executing_eagerly():
+      # Note that if a control flow context is active the input of the read op
+      # might not actually be the handle. This line bypasses it.
+      tape.record_operation(
+          "ReadVariableOp", [result], [self._handle], lambda x: [x])
+    return result
 
   def read_value(self):
     """Constructs an op which reads the value of this variable.
@@ -867,6 +875,19 @@ class ResourceVariable(variables.Variable):
 
   __array_priority__ = 100
 
+  def is_initialized(self, name=None):
+    """Checks whether a resource variable has been initialized.
+
+    Outputs boolean scalar indicating whether the tensor has been initialized.
+
+    Args:
+      name: A name for the operation (optional).
+
+    Returns:
+      A `Tensor` of type `bool`.
+    """
+    return gen_resource_variable_ops.var_is_initialized_op(self.handle, name)
+
   def assign_sub(self, delta, use_locking=None, name=None, read_value=True):
     """Subtracts a value from this variable.
 
@@ -1091,6 +1112,113 @@ class _UnreadVariable(ResourceVariable):
 ops.register_tensor_conversion_function(_UnreadVariable, _dense_var_to_tensor)
 ops.register_dense_tensor_like_type(_UnreadVariable)
 
+
+class _MixedPrecisionVariable(ResourceVariable):
+  """Represents a variable that can return in desired dtype when read.
+
+  In mixed precision training, it is usually desirable to use different dtypes
+  for variables and computation. This class will be used to wrap created
+  ResourceVariable when mixed precision training is enabled. It allows layers to
+  perform computation in a different dtype than their variable dtypes, in order
+  to achieve higher performance without causing quality loss.
+  """
+
+  def __init__(self, var, read_dtype):
+    """Creates a MixedPrecisionVariable.
+
+    Args:
+      var: A ResourceVariable instance.
+      read_dtype: A tf.DType, the returned dtype when read, default to None.
+        Casting is performed if read_dtype is not None and differs from
+        var.dtype.
+    Returns:
+      An MixedPrecisionVariable instance.
+    Raises:
+      ValueError: if var is not a ResourceVariable instance, or read_dtype is
+        not a tf.DType instance.
+    """
+    # pylint: disable=super-init-not-called
+    # We do not call super init on purpose.
+    if not isinstance(var, ResourceVariable):
+      raise ValueError("InvalidArgument: var must be a ResourceVariable type.")
+    if not isinstance(read_dtype, dtypes.DType):
+      raise ValueError("InvalidArgument: read_dtype must be a tf.DType type.")
+
+    self._var = var
+    self._trainable = var.trainable
+    self._save_slice_info = None
+    self._graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
+    self._in_graph_mode = var._in_graph_mode  # pylint: disable=protected-access
+    self._handle = var.handle
+    self._shape = var.shape
+    self._initial_value = None
+    if isinstance(self.handle, ops.EagerTensor):
+      self._handle_name = ""
+    else:
+      self._handle_name = self.handle.name
+    self._unique_id = var._unique_id  # pylint: disable=protected-access
+    self._dtype = var.dtype
+    self._constraint = None
+    self._cached_value = None
+    self._is_initialized_op = var._is_initialized_op  # pylint: disable=protected-access
+    self._initializer_op = var._initializer_op  # pylint: disable=protected-access
+    # This needs to be set before read_value() is called.
+    self._read_dtype = read_dtype
+    if context.executing_eagerly():
+      self._graph_element = None
+    else:
+      self._graph_element = self.read_value()
+    self._handle_deleter = (
+        var._handle_deleter if not self._in_graph_mode  # pylint: disable=protected-access
+        else None)
+    # pylint: enable=super-init-not-called
+
+  @property
+  def name(self):
+    return self._var.name
+
+  def value(self):
+    return self._read_variable_op()
+
+  def read_value(self):
+    return self._read_variable_op()
+
+  def _read_variable_op(self):
+    with ops.colocate_with(self._handle):
+      res = gen_resource_variable_ops.read_variable_op(self._handle,
+                                                       self._dtype)
+      if self._read_dtype != self._dtype:
+        return math_ops.cast(res, self._read_dtype)
+      else:
+        return res
+
+  def set_shape(self, shape):
+    self._shape = shape
+    self._cached_shape_as_list = None
+
+  @property
+  def op(self):
+    """The op for this variable."""
+    return self._var.op
+
+  @property
+  def read_dtype(self):
+    """The dtype of the returned tensor when reading the var."""
+    return self._read_dtype
+
+  def _dense_var_to_tensor(self, dtype=None, name=None, as_ref=False):
+    del name
+    dtype = dtype or self.read_dtype
+    if dtype != self.read_dtype or as_ref:
+      return NotImplemented
+    else:
+      res = self.value()
+    return res
+
+  def _should_act_as_resource_variable(self):
+    """To pass resource_variable_ops.is_resource_variable check."""
+    pass
+
 # Register a conversion function which reads the value of the variable,
 # allowing instances of the class to be used as tensors.
 
diff --git a/tensorflow/python/ops/rnn.py b/tensorflow/python/ops/rnn.py
index deba133fb9..7b6ab20975 100644
--- a/tensorflow/python/ops/rnn.py
+++ b/tensorflow/python/ops/rnn.py
@@ -417,24 +417,30 @@ def bidirectional_dynamic_rnn(cell_fw, cell_bw, inputs, sequence_length=None,
 
     # Backward direction
     if not time_major:
-      time_dim = 1
-      batch_dim = 0
+      time_axis = 1
+      batch_axis = 0
     else:
-      time_dim = 0
-      batch_dim = 1
+      time_axis = 0
+      batch_axis = 1
 
-    def _reverse(input_, seq_lengths, seq_dim, batch_dim):
+    def _reverse(input_, seq_lengths, seq_axis, batch_axis):
       if seq_lengths is not None:
         return array_ops.reverse_sequence(
             input=input_, seq_lengths=seq_lengths,
-            seq_dim=seq_dim, batch_dim=batch_dim)
+            seq_axis=seq_axis, batch_axis=batch_axis)
       else:
-        return array_ops.reverse(input_, axis=[seq_dim])
+        return array_ops.reverse(input_, axis=[seq_axis])
 
     with vs.variable_scope("bw") as bw_scope:
-      inputs_reverse = _reverse(
-          inputs, seq_lengths=sequence_length,
-          seq_dim=time_dim, batch_dim=batch_dim)
+
+      def _map_reverse(inp):
+        return _reverse(
+            inp,
+            seq_lengths=sequence_length,
+            seq_axis=time_axis,
+            batch_axis=batch_axis)
+
+      inputs_reverse = nest.map_structure(_map_reverse, inputs)
       tmp, output_state_bw = dynamic_rnn(
           cell=cell_bw, inputs=inputs_reverse, sequence_length=sequence_length,
           initial_state=initial_state_bw, dtype=dtype,
@@ -443,7 +449,7 @@ def bidirectional_dynamic_rnn(cell_fw, cell_bw, inputs, sequence_length=None,
 
   output_bw = _reverse(
       tmp, seq_lengths=sequence_length,
-      seq_dim=time_dim, batch_dim=batch_dim)
+      seq_axis=time_axis, batch_axis=batch_axis)
 
   outputs = (output_fw, output_bw)
   output_states = (output_state_fw, output_state_bw)
diff --git a/tensorflow/python/ops/rnn_cell_impl.py b/tensorflow/python/ops/rnn_cell_impl.py
index 82a044a0d4..42806ba6ec 100644
--- a/tensorflow/python/ops/rnn_cell_impl.py
+++ b/tensorflow/python/ops/rnn_cell_impl.py
@@ -47,7 +47,6 @@ from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training.checkpointable import base as checkpointable
-from tensorflow.python.training.checkpointable import tracking as checkpointable_tracking
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
@@ -55,16 +54,6 @@ from tensorflow.python.util.tf_export import tf_export
 _BIAS_VARIABLE_NAME = "bias"
 _WEIGHTS_VARIABLE_NAME = "kernel"
 
-
-# TODO(jblespiau): Remove this function when we are sure there are no longer
-# any usage (even if protected, it is being used). Prefer assert_like_rnncell.
-def _like_rnncell(cell):
-  """Checks that a given object is an RNNCell by using duck typing."""
-  conditions = [hasattr(cell, "output_size"), hasattr(cell, "state_size"),
-                hasattr(cell, "zero_state"), callable(cell)]
-  return all(conditions)
-
-
 # This can be used with self.assertRaisesRegexp for assert_like_rnncell.
 ASSERT_LIKE_RNNCELL_ERROR_REGEXP = "is not an RNNCell"
 
@@ -1272,6 +1261,11 @@ class MultiRNNCell(RNNCell):
       raise TypeError(
           "cells must be a list or tuple, but saw: %s." % cells)
 
+    if len(set([id(cell) for cell in cells])) < len(cells):
+      logging.log_first_n(logging.WARN,
+                          "At least two cells provided to MultiRNNCell "
+                          "are the same object and will share weights.", 1)
+
     self._cells = cells
     for cell_number, cell in enumerate(self._cells):
       # Add Checkpointable dependencies on these cells so their variables get
@@ -1330,48 +1324,3 @@ class MultiRNNCell(RNNCell):
                   array_ops.concat(new_states, 1))
 
     return cur_inp, new_states
-
-
-class _SlimRNNCell(RNNCell, checkpointable_tracking.NotCheckpointable):
-  """A simple wrapper for slim.rnn_cells."""
-
-  def __init__(self, cell_fn):
-    """Create a SlimRNNCell from a cell_fn.
-
-    Args:
-      cell_fn: a function which takes (inputs, state, scope) and produces the
-        outputs and the new_state. Additionally when called with inputs=None and
-        state=None it should return (initial_outputs, initial_state).
-
-    Raises:
-      TypeError: if cell_fn is not callable
-      ValueError: if cell_fn cannot produce a valid initial state.
-    """
-    if not callable(cell_fn):
-      raise TypeError("cell_fn %s needs to be callable", cell_fn)
-    self._cell_fn = cell_fn
-    self._cell_name = cell_fn.func.__name__
-    init_output, init_state = self._cell_fn(None, None)
-    output_shape = init_output.get_shape()
-    state_shape = init_state.get_shape()
-    self._output_size = output_shape.with_rank(2)[1].value
-    self._state_size = state_shape.with_rank(2)[1].value
-    if self._output_size is None:
-      raise ValueError("Initial output created by %s has invalid shape %s" %
-                       (self._cell_name, output_shape))
-    if self._state_size is None:
-      raise ValueError("Initial state created by %s has invalid shape %s" %
-                       (self._cell_name, state_shape))
-
-  @property
-  def state_size(self):
-    return self._state_size
-
-  @property
-  def output_size(self):
-    return self._output_size
-
-  def __call__(self, inputs, state, scope=None):
-    scope = scope or self._cell_name
-    output, state = self._cell_fn(inputs, state, scope=scope)
-    return output, state
diff --git a/tensorflow/python/ops/script_ops.py b/tensorflow/python/ops/script_ops.py
index 1e3f662ff3..af103d3cc7 100644
--- a/tensorflow/python/ops/script_ops.py
+++ b/tensorflow/python/ops/script_ops.py
@@ -130,7 +130,7 @@ class FuncRegistry(object):
   def __init__(self):
     self._lock = threading.Lock()
     self._unique_id = 0  # GUARDED_BY(self._lock)
-    # Only store weakrefs to the funtions. The strong reference is stored in
+    # Only store weakrefs to the functions. The strong reference is stored in
     # the graph.
     self._funcs = weakref.WeakValueDictionary()
 
diff --git a/tensorflow/python/ops/state_ops.py b/tensorflow/python/ops/state_ops.py
index 8cb6a0537e..2c93cf72c7 100644
--- a/tensorflow/python/ops/state_ops.py
+++ b/tensorflow/python/ops/state_ops.py
@@ -19,7 +19,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import gen_resource_variable_ops
@@ -124,9 +123,7 @@ def is_variable_initialized(ref, name=None):
   if ref.dtype._is_ref_dtype:
     return gen_state_ops.is_variable_initialized(ref=ref, name=name)
   # Handle resource variables.
-  if context.executing_eagerly() or ref.op.type == "VarHandleOp":
-    return gen_resource_variable_ops.var_is_initialized_op(ref.handle,
-                                                           name=name)
+  return ref.is_initialized(name=name)
 
 
 @tf_export("assign_sub")
diff --git a/tensorflow/python/ops/tensor_array_ops.py b/tensorflow/python/ops/tensor_array_ops.py
index cc92da4fd7..f86dfb3527 100644
--- a/tensorflow/python/ops/tensor_array_ops.py
+++ b/tensorflow/python/ops/tensor_array_ops.py
@@ -554,7 +554,7 @@ class _EagerTensorArray(object):
       self._tensor_array.extend([None for _ in range(index - size + 1)])
 
     if not isinstance(value, ops.EagerTensor):
-      value = constant_op.constant(value)
+      value = ops.convert_to_tensor(value)
 
     if self._infer_shape:
       if self._element_shape is None:
@@ -633,8 +633,8 @@ class _EagerTensorArray(object):
   def split(self, value, lengths, name=None):
     """See TensorArray."""
     # error checking to match graph-mode errors
-    value = constant_op.constant(value)
-    lengths = constant_op.constant(lengths)
+    value = ops.convert_to_tensor(value)
+    lengths = ops.convert_to_tensor(lengths)
     sum_lengths = math_ops.reduce_sum(lengths)
     if lengths.shape.ndims != 1:
       raise errors_impl.InvalidArgumentError(
diff --git a/tensorflow/python/ops/variable_scope.py b/tensorflow/python/ops/variable_scope.py
index 1e06bf07d5..aca44bcd44 100644
--- a/tensorflow/python/ops/variable_scope.py
+++ b/tensorflow/python/ops/variable_scope.py
@@ -191,36 +191,9 @@ class _ReuseMode(enum.Enum):
   # REUSE_TRUE = 3
 
 
-@tf_export("VariableSynchronization")
-class VariableSynchronization(enum.Enum):
-  """Indicates when a distributed variable will be synced."""
-
-  # Indicates that the synchronization will be determined by the current
-  # `DistributionStrategy` (eg. With `MirroredStrategy` this would be
-  # `ON_WRITE`).
-  AUTO = 0
-
-  # Indicates that there will only be one copy of the variable, so there is no
-  # need to sync.
-  NONE = 1
-
-  # Indicates that the variable will be aggregated across devices
-  # every time it is updated.
-  ON_WRITE = 2
-
-  # Indicates that the variable will be aggregated across devices
-  # when it is read (eg. when checkpointing or when evaluating an op that uses
-  # the variable).
-  ON_READ = 3
-
-
-@tf_export("VariableAggregation")
-class VariableAggregation(enum.Enum):
-  """Indicates how a distributed variable will be aggregated."""
-  NONE = 0
-  SUM = 1
-  MEAN = 2
-
+# TODO(apassos) remove these forwarding symbols.
+VariableSynchronization = variables.VariableSynchronization  # pylint: disable=invalid-name
+VariableAggregation = variables.VariableAggregation  # pylint: disable=invalid-name
 
 AUTO_REUSE = _ReuseMode.AUTO_REUSE
 tf_export("AUTO_REUSE").export_constant(__name__, "AUTO_REUSE")
@@ -255,7 +228,7 @@ class _VariableStore(object):
                    initializer=None,
                    regularizer=None,
                    reuse=None,
-                   trainable=True,
+                   trainable=None,
                    collections=None,
                    caching_device=None,
                    partitioner=None,
@@ -300,6 +273,8 @@ class _VariableStore(object):
         forced to be False.
       trainable: If `True` also add the variable to the graph collection
         `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
+        `trainable` defaults to `True` unless `synchronization` is
+        set to `ON_READ`.
       collections: List of graph collections keys to add the `Variable` to.
         Defaults to `[GraphKeys.GLOBAL_VARIABLES]` (see `tf.Variable`).
       caching_device: Optional device string or function describing where the
@@ -341,7 +316,8 @@ class _VariableStore(object):
         aggregated. Accepted values are constants defined in the class
         @{tf.VariableSynchronization}. By default the synchronization is set to
         `AUTO` and the current `DistributionStrategy` chooses
-        when to synchronize.
+        when to synchronize. If `synchronization` is set to `ON_READ`,
+        `trainable` must not be set to `True`.
       aggregation: Indicates how a distributed variable will be aggregated.
         Accepted values are constants defined in the class
         @{tf.VariableAggregation}.
@@ -404,7 +380,7 @@ class _VariableStore(object):
         initializer=None,
         regularizer=None,
         reuse=None,
-        trainable=True,
+        trainable=None,
         collections=None,
         caching_device=None,
         partitioner=None,
@@ -477,6 +453,10 @@ class _VariableStore(object):
           synchronization=synchronization,
           aggregation=aggregation)
 
+    # Set trainable value based on synchronization value.
+    trainable = _get_trainable_value(
+        synchronization=synchronization, trainable=trainable)
+
     if custom_getter is not None:
       # Handle backwards compatibility with getter arguments that were added
       # to the API after users started writing custom getters.
@@ -519,11 +499,20 @@ class _VariableStore(object):
           synchronization=synchronization,
           aggregation=aggregation)
 
-  def _get_partitioned_variable(
-      self, name, partitioner, shape=None, dtype=dtypes.float32,
-      initializer=None, regularizer=None, reuse=None,
-      trainable=True, collections=None, caching_device=None,
-      validate_shape=True, use_resource=None, constraint=None):
+  def _get_partitioned_variable(self,
+                                name,
+                                partitioner,
+                                shape=None,
+                                dtype=dtypes.float32,
+                                initializer=None,
+                                regularizer=None,
+                                reuse=None,
+                                trainable=None,
+                                collections=None,
+                                caching_device=None,
+                                validate_shape=True,
+                                use_resource=None,
+                                constraint=None):
     """Gets or creates a sharded variable list with these parameters.
 
     The `partitioner` must be a callable that accepts a fully defined
@@ -773,7 +762,7 @@ class _VariableStore(object):
                            regularizer=None,
                            partition_info=None,
                            reuse=None,
-                           trainable=True,
+                           trainable=None,
                            collections=None,
                            caching_device=None,
                            validate_shape=True,
@@ -1136,7 +1125,7 @@ class VariableScope(object):
                    initializer=None,
                    regularizer=None,
                    reuse=None,
-                   trainable=True,
+                   trainable=None,
                    collections=None,
                    caching_device=None,
                    partitioner=None,
@@ -1207,7 +1196,7 @@ class VariableScope(object):
                                 dtype=None,
                                 initializer=None,
                                 regularizer=None,
-                                trainable=True,
+                                trainable=None,
                                 collections=None,
                                 caching_device=None,
                                 partitioner=None,
@@ -1422,7 +1411,7 @@ def get_variable(name,
                  dtype=None,
                  initializer=None,
                  regularizer=None,
-                 trainable=True,
+                 trainable=None,
                  collections=None,
                  caching_device=None,
                  partitioner=None,
@@ -2334,43 +2323,64 @@ def _compute_slice_dim_and_shape(full_shape, slicing):
   return slice_dim, slice_shape
 
 
+def _get_trainable_value(synchronization, trainable):
+  """Computes the trainable value based on the given arguments."""
+  if synchronization == VariableSynchronization.ON_READ:
+    if trainable:
+      raise ValueError(
+          "Synchronization value can be set to "
+          "VariableSynchronization.ON_READ only for non-trainable variables. "
+          "You have specified trainable=True and "
+          "synchronization=VariableSynchronization.ON_READ.")
+    else:
+      # Set trainable to be false when variable is to be synced on read.
+      trainable = False
+  elif trainable is None:
+    trainable = True
+  return trainable
+
+
 def default_variable_creator(next_creator=None, **kwargs):
   """Default variable creator."""
   assert next_creator is None
   initial_value = kwargs.get("initial_value", None)
-  trainable = kwargs.get("trainable", True)
+  trainable = kwargs.get("trainable", None)
   collections = kwargs.get("collections", None)
   validate_shape = kwargs.get("validate_shape", True)
   caching_device = kwargs.get("caching_device", None)
   name = kwargs.get("name", None)
+  variable_def = kwargs.get("variable_def", None)
   dtype = kwargs.get("dtype", None)
+  expected_shape = kwargs.get("expected_shape", None)
+  import_scope = kwargs.get("import_scope", None)
   constraint = kwargs.get("constraint", None)
   use_resource = kwargs.get("use_resource", None)
 
-  # Enforce `ON_READ` variables to be not trainable.
+  # Set trainable value based on synchronization value.
   synchronization = kwargs.get("synchronization", VariableSynchronization.AUTO)
-  if synchronization == VariableSynchronization.ON_READ:
-    trainable = False
+  trainable = _get_trainable_value(
+      synchronization=synchronization, trainable=trainable)
 
   if use_resource is None:
     use_resource = get_variable_scope().use_resource
-  if use_resource or (use_resource is None and context.executing_eagerly()):
+  use_resource = use_resource or context.executing_eagerly()
+  if use_resource:
     return resource_variable_ops.ResourceVariable(
         initial_value=initial_value, trainable=trainable,
         collections=collections, validate_shape=validate_shape,
         caching_device=caching_device, name=name, dtype=dtype,
-        constraint=constraint)
-  elif not use_resource and context.executing_eagerly():
-    raise RuntimeError(
-        "VariableScope should use resource variable when eager execution is"
-        " enabled, but use_resource is False."
-    )
+        constraint=constraint, variable_def=variable_def,
+        import_scope=import_scope)
   else:
-    return variables.Variable(
+    return variables.RefVariable(
         initial_value=initial_value, trainable=trainable,
         collections=collections, validate_shape=validate_shape,
         caching_device=caching_device, name=name, dtype=dtype,
-        constraint=constraint)
+        constraint=constraint, variable_def=variable_def,
+        expected_shape=expected_shape, import_scope=import_scope)
+
+
+variables.default_variable_creator = default_variable_creator
 
 
 def _make_getter(captured_getter, captured_previous):
@@ -2378,36 +2388,8 @@ def _make_getter(captured_getter, captured_previous):
   return lambda **kwargs: captured_getter(captured_previous, **kwargs)
 
 
-def variable(initial_value=None,
-             trainable=True,
-             collections=None,
-             validate_shape=True,
-             caching_device=None,
-             name=None,
-             dtype=None,
-             constraint=None,
-             use_resource=None,
-             synchronization=VariableSynchronization.AUTO,
-             aggregation=VariableAggregation.NONE):
-  previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
-  for getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
-    previous_getter = _make_getter(getter, previous_getter)
-
-  # Reset `aggregation` that is explicitly set as `None` to the enum None value.
-  if aggregation is None:
-    aggregation = VariableAggregation.NONE
-  return previous_getter(
-      initial_value=initial_value,
-      trainable=trainable,
-      collections=collections,
-      validate_shape=validate_shape,
-      caching_device=caching_device,
-      name=name,
-      dtype=dtype,
-      constraint=constraint,
-      use_resource=use_resource,
-      synchronization=synchronization,
-      aggregation=aggregation)
+# TODO(apassos) remove forwarding symbol
+variable = variables.Variable
 
 
 @tf_contextlib.contextmanager
@@ -2441,6 +2423,8 @@ def variable_creator_scope(variable_creator):
       trainable: If `True`, the default, also adds the variable to the graph
         collection `GraphKeys.TRAINABLE_VARIABLES`. This collection is used as
         the default list of variables to use by the `Optimizer` classes.
+        `trainable` defaults to `True` unless `synchronization` is
+        set to `ON_READ`.
       collections: List of graph collections keys. The new variable is added to
         these collections. Defaults to `[GraphKeys.GLOBAL_VARIABLES]`.
       validate_shape: If `False`, allows the variable to be initialized with a
@@ -2463,7 +2447,8 @@ def variable_creator_scope(variable_creator):
         aggregated. Accepted values are constants defined in the class
         @{tf.VariableSynchronization}. By default the synchronization is set to
         `AUTO` and the current `DistributionStrategy` chooses
-        when to synchronize.
+        when to synchronize. If `synchronization` is set to `ON_READ`,
+        `trainable` must not be set to `True`.
       aggregation: Indicates how a distributed variable will be aggregated.
         Accepted values are constants defined in the class
         @{tf.VariableAggregation}.
diff --git a/tensorflow/python/ops/variables.py b/tensorflow/python/ops/variables.py
index 9a09cdaa52..fc00ce68ae 100644
--- a/tensorflow/python/ops/variables.py
+++ b/tensorflow/python/ops/variables.py
@@ -17,6 +17,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import enum  # pylint: disable=g-bad-import-order
+
+import six
+
 from tensorflow.core.framework import attr_value_pb2
 from tensorflow.core.framework import variable_pb2
 from tensorflow.python.eager import context
@@ -36,8 +40,101 @@ from tensorflow.python.util.deprecation import deprecated
 from tensorflow.python.util.tf_export import tf_export
 
 
+def default_variable_creator(_, **kwds):
+  del kwds
+  raise NotImplementedError("variable_scope needs to be imported")
+
+
+def _make_getter(captured_getter, captured_previous):
+  """To avoid capturing loop variables."""
+  def getter(**kwargs):
+    return captured_getter(captured_previous, **kwargs)
+  return getter
+
+
+@tf_export("VariableSynchronization")
+class VariableSynchronization(enum.Enum):
+  """Indicates when a distributed variable will be synced."""
+
+  # Indicates that the synchronization will be determined by the current
+  # `DistributionStrategy` (eg. With `MirroredStrategy` this would be
+  # `ON_WRITE`).
+  AUTO = 0
+
+  # Indicates that there will only be one copy of the variable, so there is no
+  # need to sync.
+  NONE = 1
+
+  # Indicates that the variable will be aggregated across devices
+  # every time it is updated.
+  ON_WRITE = 2
+
+  # Indicates that the variable will be aggregated across devices
+  # when it is read (eg. when checkpointing or when evaluating an op that uses
+  # the variable).
+  ON_READ = 3
+
+
+@tf_export("VariableAggregation")
+class VariableAggregation(enum.Enum):
+  """Indicates how a distributed variable will be aggregated."""
+  NONE = 0
+  SUM = 1
+  MEAN = 2
+
+
+class VariableMetaclass(type):
+  """Metaclass to allow construction of tf.Variable to be overridden."""
+
+  def _variable_call(cls,
+                     initial_value=None,
+                     trainable=None,
+                     collections=None,
+                     validate_shape=True,
+                     caching_device=None,
+                     name=None,
+                     variable_def=None,
+                     dtype=None,
+                     expected_shape=None,
+                     import_scope=None,
+                     constraint=None,
+                     use_resource=None,
+                     synchronization=VariableSynchronization.AUTO,
+                     aggregation=VariableAggregation.NONE):
+    """Call on Variable class. Useful to force the signature."""
+    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
+    for getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
+      previous_getter = _make_getter(getter, previous_getter)
+
+    # Reset `aggregation` that is explicitly set as `None` to the enum NONE.
+    if aggregation is None:
+      aggregation = VariableAggregation.NONE
+    return previous_getter(
+        initial_value=initial_value,
+        trainable=trainable,
+        collections=collections,
+        validate_shape=validate_shape,
+        caching_device=caching_device,
+        name=name,
+        variable_def=variable_def,
+        dtype=dtype,
+        expected_shape=expected_shape,
+        import_scope=import_scope,
+        constraint=constraint,
+        use_resource=use_resource,
+        synchronization=synchronization,
+        aggregation=aggregation)
+
+  def __call__(cls, *args, **kwargs):
+    if cls is Variable:
+      return cls._variable_call(*args, **kwargs)
+    else:
+      return super(VariableMetaclass, cls).__call__(*args, **kwargs)
+
+
 @tf_export("Variable")
-class Variable(checkpointable.CheckpointableBase):
+class Variable(six.with_metaclass(VariableMetaclass,
+                                  checkpointable.CheckpointableBase)):
   """See the @{$variables$Variables How To} for a high level overview.
 
   A variable maintains state in the graph across calls to `run()`. You add a
@@ -123,37 +220,33 @@ class Variable(checkpointable.CheckpointableBase):
   various `Optimizer` classes use this collection as the default list of
   variables to optimize.
 
-  WARNING: tf.Variable objects have a non-intuitive memory model. A Variable is
-  represented internally as a mutable Tensor which can non-deterministically
-  alias other Tensors in a graph. The set of operations which consume a Variable
-  and can lead to aliasing is undetermined and can change across TensorFlow
-  versions. Avoid writing code which relies on the value of a Variable either
-  changing or not changing as other operations happen. For example, using
-  Variable objects or simple functions thereof as predicates in a `tf.cond` is
-  dangerous and error-prone:
+  WARNING: tf.Variable objects by default have a non-intuitive memory model. A
+  Variable is represented internally as a mutable Tensor which can
+  non-deterministically alias other Tensors in a graph. The set of operations
+  which consume a Variable and can lead to aliasing is undetermined and can
+  change across TensorFlow versions. Avoid writing code which relies on the
+  value of a Variable either changing or not changing as other operations
+  happen. For example, using Variable objects or simple functions thereof as
+  predicates in a `tf.cond` is dangerous and error-prone:
 
   ```
   v = tf.Variable(True)
   tf.cond(v, lambda: v.assign(False), my_false_fn)  # Note: this is broken.
   ```
 
-  Here replacing tf.Variable with tf.contrib.eager.Variable will fix any
-  nondeterminism issues.
+  Here replacing adding `use_resource=True` when constructing the variable will
+  fix any nondeterminism issues:
+  ```
+  v = tf.Variable(True, use_resource=True)
+  tf.cond(v, lambda: v.assign(False), my_false_fn)
+  ```
 
   To use the replacement for variables which does
   not have these issues:
 
-  * Replace `tf.Variable` with `tf.contrib.eager.Variable`;
+  * Add `use_resource=True` when constructing `tf.Variable`;
   * Call `tf.get_variable_scope().set_use_resource(True)` inside a
     `tf.variable_scope` before the `tf.get_variable()` call.
-
-  @compatibility(eager)
-  `tf.Variable` is not compatible with eager execution.  Use
-  `tf.contrib.eager.Variable` instead which is compatible with both eager
-  execution and graph construction.  See [the TensorFlow Eager Execution
-  guide](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/g3doc/guide.md#variables-and-optimizers)
-  for details on how variables work in eager execution.
-  @end_compatibility
   """
 
   def __init__(self,
@@ -167,7 +260,10 @@ class Variable(checkpointable.CheckpointableBase):
                dtype=None,
                expected_shape=None,
                import_scope=None,
-               constraint=None):
+               constraint=None,
+               use_resource=None,
+               synchronization=VariableSynchronization.AUTO,
+               aggregation=VariableAggregation.NONE):
     """Creates a new variable with value `initial_value`.
 
     The new variable is added to the graph collections listed in `collections`,
@@ -219,25 +315,565 @@ class Variable(checkpointable.CheckpointableBase):
         variable and return the Tensor for the projected value
         (which must have the same shape). Constraints are not safe to
         use when doing asynchronous distributed training.
+      use_resource: if True, a ResourceVariable is created; otherwise an
+       old-style ref-based variable is created. When eager execution is enabled
+       a resource variable is always created.
+      synchronization: Indicates when a distributed a variable will be
+        aggregated. Accepted values are constants defined in the class
+        @{tf.VariableSynchronization}. By default the synchronization is set to
+        `AUTO` and the current `DistributionStrategy` chooses
+        when to synchronize. If `synchronization` is set to `ON_READ`,
+        `trainable` must not be set to `True`.
+      aggregation: Indicates how a distributed variable will be aggregated.
+        Accepted values are constants defined in the class
+        @{tf.VariableAggregation}.
 
     Raises:
       ValueError: If both `variable_def` and initial_value are specified.
       ValueError: If the initial value is not specified, or does not have a
         shape and `validate_shape` is `True`.
       RuntimeError: If eager execution is enabled.
+    """
+    raise NotImplementedError
+
+  def __repr__(self):
+    raise NotImplementedError
+
+  def value(self):
+    """Returns the last snapshot of this variable.
+
+    You usually do not need to call this method as all ops that need the value
+    of the variable call it automatically through a `convert_to_tensor()` call.
+
+    Returns a `Tensor` which holds the value of the variable.  You can not
+    assign a new value to this tensor as it is not a reference to the variable.
+
+    To avoid copies, if the consumer of the returned value is on the same device
+    as the variable, this actually returns the live value of the variable, not
+    a copy.  Updates to the variable are seen by the consumer.  If the consumer
+    is on a different device it will get a copy of the variable.
 
-    @compatibility(eager)
-    `tf.Variable` is not compatible with eager execution.  Use
-    `tfe.Variable` instead which is compatible with both eager execution
-    and graph construction.  See [the TensorFlow Eager Execution
-    guide](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/g3doc/guide.md#variables-and-optimizers)
-    for details on how variables work in eager execution.
-    @end_compatibility
+    Returns:
+      A `Tensor` containing the value of the variable.
+    """
+    raise NotImplementedError
+
+  def read_value(self):
+    """Returns the value of this variable, read in the current context.
+
+    Can be different from value() if it's on another device, with control
+    dependencies, etc.
+
+    Returns:
+      A `Tensor` containing the value of the variable.
+    """
+    raise NotImplementedError
+
+  def set_shape(self, shape):
+    """Overrides the shape for this variable.
+
+    Args:
+      shape: the `TensorShape` representing the overridden shape.
+    """
+    raise NotImplementedError
+
+  @property
+  def trainable(self):
+    raise NotImplementedError
+
+  def eval(self, session=None):
+    """In a session, computes and returns the value of this variable.
+
+    This is not a graph construction method, it does not add ops to the graph.
+
+    This convenience method requires a session where the graph
+    containing this variable has been launched. If no session is
+    passed, the default session is used.  See @{tf.Session} for more
+    information on launching a graph and on sessions.
+
+    ```python
+    v = tf.Variable([1, 2])
+    init = tf.global_variables_initializer()
+
+    with tf.Session() as sess:
+        sess.run(init)
+        # Usage passing the session explicitly.
+        print(v.eval(sess))
+        # Usage with the default session.  The 'with' block
+        # above makes 'sess' the default session.
+        print(v.eval())
+    ```
+
+    Args:
+      session: The session to use to evaluate this variable. If
+        none, the default session is used.
+
+    Returns:
+      A numpy `ndarray` with a copy of the value of this variable.
+    """
+    raise NotImplementedError
+
+  def initialized_value(self):
+    """Returns the value of the initialized variable.
+
+    You should use this instead of the variable itself to initialize another
+    variable with a value that depends on the value of this variable.
+
+    ```python
+    # Initialize 'v' with a random tensor.
+    v = tf.Variable(tf.truncated_normal([10, 40]))
+    # Use `initialized_value` to guarantee that `v` has been
+    # initialized before its value is used to initialize `w`.
+    # The random values are picked only once.
+    w = tf.Variable(v.initialized_value() * 2.0)
+    ```
+
+    Returns:
+      A `Tensor` holding the value of this variable after its initializer
+      has run.
+    """
+    raise NotImplementedError
+
+  @property
+  def initial_value(self):
+    """Returns the Tensor used as the initial value for the variable.
+
+    Note that this is different from `initialized_value()` which runs
+    the op that initializes the variable before returning its value.
+    This method returns the tensor that is used by the op that initializes
+    the variable.
+
+    Returns:
+      A `Tensor`.
+    """
+    raise NotImplementedError
+
+  @property
+  def constraint(self):
+    """Returns the constraint function associated with this variable.
+
+    Returns:
+      The constraint function that was passed to the variable constructor.
+      Can be `None` if no constraint was passed.
+    """
+    raise NotImplementedError
+
+  def assign(self, value, use_locking=False):
+    """Assigns a new value to the variable.
+
+    This is essentially a shortcut for `assign(self, value)`.
+
+    Args:
+      value: A `Tensor`. The new value for this variable.
+      use_locking: If `True`, use locking during the assignment.
+
+    Returns:
+      A `Tensor` that will hold the new value of this variable after
+      the assignment has completed.
+    """
+    raise NotImplementedError
+
+  def assign_add(self, delta, use_locking=False):
+    """Adds a value to this variable.
+
+     This is essentially a shortcut for `assign_add(self, delta)`.
+
+    Args:
+      delta: A `Tensor`. The value to add to this variable.
+      use_locking: If `True`, use locking during the operation.
+
+    Returns:
+      A `Tensor` that will hold the new value of this variable after
+      the addition has completed.
+    """
+    raise NotImplementedError
+
+  def assign_sub(self, delta, use_locking=False):
+    """Subtracts a value from this variable.
+
+    This is essentially a shortcut for `assign_sub(self, delta)`.
+
+    Args:
+      delta: A `Tensor`. The value to subtract from this variable.
+      use_locking: If `True`, use locking during the operation.
+
+    Returns:
+      A `Tensor` that will hold the new value of this variable after
+      the subtraction has completed.
+    """
+    raise NotImplementedError
+
+  def scatter_sub(self, sparse_delta, use_locking=False):
+    """Subtracts `IndexedSlices` from this variable.
+
+    This is essentially a shortcut for `scatter_sub(self, sparse_delta.indices,
+    sparse_delta.values)`.
+
+    Args:
+      sparse_delta: `IndexedSlices` to be subtracted from this variable.
+      use_locking: If `True`, use locking during the operation.
+
+    Returns:
+      A `Tensor` that will hold the new value of this variable after
+      the scattered subtraction has completed.
+
+    Raises:
+      ValueError: if `sparse_delta` is not an `IndexedSlices`.
+    """
+    raise NotImplementedError
+
+  def count_up_to(self, limit):
+    """Increments this variable until it reaches `limit`.
+
+    When that Op is run it tries to increment the variable by `1`. If
+    incrementing the variable would bring it above `limit` then the Op raises
+    the exception `OutOfRangeError`.
+
+    If no error is raised, the Op outputs the value of the variable before
+    the increment.
+
+    This is essentially a shortcut for `count_up_to(self, limit)`.
+
+    Args:
+      limit: value at which incrementing the variable raises an error.
+
+    Returns:
+      A `Tensor` that will hold the variable value before the increment. If no
+      other Op modifies this variable, the values produced will all be
+      distinct.
+    """
+    raise NotImplementedError
+
+  def load(self, value, session=None):
+    """Load new value into this variable.
+
+    Writes new value to variable's memory. Doesn't add ops to the graph.
+
+    This convenience method requires a session where the graph
+    containing this variable has been launched. If no session is
+    passed, the default session is used.  See @{tf.Session} for more
+    information on launching a graph and on sessions.
+
+    ```python
+    v = tf.Variable([1, 2])
+    init = tf.global_variables_initializer()
+
+    with tf.Session() as sess:
+        sess.run(init)
+        # Usage passing the session explicitly.
+        v.load([2, 3], sess)
+        print(v.eval(sess)) # prints [2 3]
+        # Usage with the default session.  The 'with' block
+        # above makes 'sess' the default session.
+        v.load([3, 4], sess)
+        print(v.eval()) # prints [3 4]
+    ```
+
+    Args:
+        value: New variable value
+        session: The session to use to evaluate this variable. If
+          none, the default session is used.
+
+    Raises:
+        ValueError: Session is not passed and no default session
+    """
+    raise NotImplementedError
+
+  # Conversion to tensor.
+  @staticmethod
+  def _TensorConversionFunction(v, dtype=None, name=None, as_ref=False):  # pylint: disable=invalid-name
+    """Utility function for converting a Variable to a Tensor."""
+    _ = name
+    if dtype and not dtype.is_compatible_with(v.dtype):
+      raise ValueError(
+          "Incompatible type conversion requested to type '%s' for variable "
+          "of type '%s'" % (dtype.name, v.dtype.name))
+    if as_ref:
+      return v._ref()  # pylint: disable=protected-access
+    else:
+      return v.value()
+
+  @staticmethod
+  def _OverloadAllOperators():  # pylint: disable=invalid-name
+    """Register overloads for all operators."""
+    for operator in ops.Tensor.OVERLOADABLE_OPERATORS:
+      Variable._OverloadOperator(operator)
+    # For slicing, bind getitem differently than a tensor (use SliceHelperVar
+    # instead)
+    # pylint: disable=protected-access
+    setattr(Variable, "__getitem__", array_ops._SliceHelperVar)
+
+  @staticmethod
+  def _OverloadOperator(operator):  # pylint: disable=invalid-name
+    """Defer an operator overload to `ops.Tensor`.
+
+    We pull the operator out of ops.Tensor dynamically to avoid ordering issues.
+
+    Args:
+      operator: string. The operator name.
+    """
+
+    def _run_op(a, *args):
+      # pylint: disable=protected-access
+      return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
+    # Propagate __doc__ to wrapper
+    try:
+      _run_op.__doc__ = getattr(ops.Tensor, operator).__doc__
+    except AttributeError:
+      pass
+
+    setattr(Variable, operator, _run_op)
+
+  # NOTE(mrry): This enables the Variable's overloaded "right" binary
+  # operators to run when the left operand is an ndarray, because it
+  # accords the Variable class higher priority than an ndarray, or a
+  # numpy matrix.
+  # TODO(mrry): Convert this to using numpy's __numpy_ufunc__
+  # mechanism, which allows more control over how Variables interact
+  # with ndarrays.
+  __array_priority__ = 100
+
+  @property
+  def name(self):
+    """The name of this variable."""
+    raise NotImplementedError
+
+  @property
+  def initializer(self):
+    """The initializer operation for this variable."""
+    raise NotImplementedError
+
+  @property
+  def device(self):
+    """The device of this variable."""
+    raise NotImplementedError
+
+  @property
+  def dtype(self):
+    """The `DType` of this variable."""
+    raise NotImplementedError
+
+  @property
+  def op(self):
+    """The `Operation` of this variable."""
+    raise NotImplementedError
+
+  @property
+  def graph(self):
+    """The `Graph` of this variable."""
+    raise NotImplementedError
+
+  @property
+  def shape(self):
+    """The `TensorShape` of this variable.
+
+    Returns:
+      A `TensorShape`.
+    """
+    raise NotImplementedError
+
+  def get_shape(self):
+    """Alias of Variable.shape."""
+    raise NotImplementedError
+
+  def to_proto(self, export_scope=None):
+    """Converts a `Variable` to a `VariableDef` protocol buffer.
+
+    Args:
+      export_scope: Optional `string`. Name scope to remove.
+
+    Returns:
+      A `VariableDef` protocol buffer, or `None` if the `Variable` is not
+      in the specified name scope.
+    """
+    raise NotImplementedError
+
+  @staticmethod
+  def from_proto(variable_def, import_scope=None):
+    """Returns a `Variable` object created from `variable_def`."""
+    return RefVariable(variable_def=variable_def,
+                       import_scope=import_scope)
+
+  class SaveSliceInfo(object):
+    """Information on how to save this Variable as a slice.
+
+    Provides internal support for saving variables as slices of a larger
+    variable.  This API is not public and is subject to change.
+
+    Available properties:
+
+    * full_name
+    * full_shape
+    * var_offset
+    * var_shape
+    """
+
+    def __init__(self,
+                 full_name=None,
+                 full_shape=None,
+                 var_offset=None,
+                 var_shape=None,
+                 save_slice_info_def=None,
+                 import_scope=None):
+      """Create a `SaveSliceInfo`.
+
+      Args:
+        full_name: Name of the full variable of which this `Variable` is a
+            slice.
+        full_shape: Shape of the full variable, as a list of int.
+        var_offset: Offset of this `Variable` into the full variable, as a
+            list of int.
+        var_shape: Shape of this `Variable`, as a list of int.
+        save_slice_info_def: `SaveSliceInfoDef` protocol buffer. If not `None`,
+          recreates the SaveSliceInfo object its contents.
+          `save_slice_info_def` and other arguments are mutually
+          exclusive.
+        import_scope: Optional `string`. Name scope to add. Only used
+          when initializing from protocol buffer.
+      """
+      if save_slice_info_def:
+        assert isinstance(save_slice_info_def, variable_pb2.SaveSliceInfoDef)
+        self.full_name = ops.prepend_name_scope(
+            save_slice_info_def.full_name, import_scope=import_scope)
+        self.full_shape = [i for i in save_slice_info_def.full_shape]
+        self.var_offset = [i for i in save_slice_info_def.var_offset]
+        self.var_shape = [i for i in save_slice_info_def.var_shape]
+      else:
+        self.full_name = full_name
+        self.full_shape = full_shape
+        self.var_offset = var_offset
+        self.var_shape = var_shape
+
+    @property
+    def spec(self):
+      """Computes the spec string used for saving."""
+      full_shape_str = " ".join(["%d" % d for d in self.full_shape]) + " "
+      sl_spec = ":".join([
+          "%d,%d" % (o, s) for o, s in zip(self.var_offset, self.var_shape)
+      ])
+      return full_shape_str + sl_spec
+
+    def to_proto(self, export_scope=None):
+      """Returns a SaveSliceInfoDef() proto.
+
+      Args:
+        export_scope: Optional `string`. Name scope to remove.
+
+      Returns:
+        A `SaveSliceInfoDef` protocol buffer, or None if the `Variable` is not
+        in the specified name scope.
+      """
+      if (export_scope is None or
+          self.full_name.startswith(export_scope)):
+        save_slice_info_def = variable_pb2.SaveSliceInfoDef()
+        save_slice_info_def.full_name = ops.strip_name_scope(
+            self.full_name, export_scope)
+        for i in self.full_shape:
+          save_slice_info_def.full_shape.append(i)
+        for i in self.var_offset:
+          save_slice_info_def.var_offset.append(i)
+        for i in self.var_shape:
+          save_slice_info_def.var_shape.append(i)
+        return save_slice_info_def
+      else:
+        return None
+
+  def __iadd__(self, other):
+    raise NotImplementedError
+
+  def __isub__(self, other):
+    raise NotImplementedError
+
+  def __imul__(self, other):
+    raise NotImplementedError
+
+  def __idiv__(self, other):
+    raise NotImplementedError
+
+  def __itruediv__(self, other):
+    raise NotImplementedError
+
+  def __irealdiv__(self, other):
+    raise NotImplementedError
+
+  def __ipow__(self, other):
+    raise NotImplementedError
+
+
+# TODO(apassos): do not repeat all comments here
+class RefVariable(Variable):
+  """Ref-based implementation of variables."""
+
+  def __init__(self,
+               initial_value=None,
+               trainable=True,
+               collections=None,
+               validate_shape=True,
+               caching_device=None,
+               name=None,
+               variable_def=None,
+               dtype=None,
+               expected_shape=None,
+               import_scope=None,
+               constraint=None):
+    """Creates a new variable with value `initial_value`.
+
+    The new variable is added to the graph collections listed in `collections`,
+    which defaults to `[GraphKeys.GLOBAL_VARIABLES]`.
+
+    If `trainable` is `True` the variable is also added to the graph collection
+    `GraphKeys.TRAINABLE_VARIABLES`.
+
+    This constructor creates both a `variable` Op and an `assign` Op to set the
+    variable to its initial value.
+
+    Args:
+      initial_value: A `Tensor`, or Python object convertible to a `Tensor`,
+        which is the initial value for the Variable. The initial value must have
+        a shape specified unless `validate_shape` is set to False. Can also be a
+        callable with no argument that returns the initial value when called. In
+        that case, `dtype` must be specified. (Note that initializer functions
+        from init_ops.py must first be bound to a shape before being used here.)
+      trainable: If `True`, the default, also adds the variable to the graph
+        collection `GraphKeys.TRAINABLE_VARIABLES`. This collection is used as
+        the default list of variables to use by the `Optimizer` classes.
+      collections: List of graph collections keys. The new variable is added to
+        these collections. Defaults to `[GraphKeys.GLOBAL_VARIABLES]`.
+      validate_shape: If `False`, allows the variable to be initialized with a
+        value of unknown shape. If `True`, the default, the shape of
+        `initial_value` must be known.
+      caching_device: Optional device string describing where the Variable
+        should be cached for reading.  Defaults to the Variable's device.
+        If not `None`, caches on another device.  Typical use is to cache
+        on the device where the Ops using the Variable reside, to deduplicate
+        copying through `Switch` and other conditional statements.
+      name: Optional name for the variable. Defaults to `'Variable'` and gets
+        uniquified automatically.
+      variable_def: `VariableDef` protocol buffer. If not `None`, recreates
+        the Variable object with its contents, referencing the variable's nodes
+        in the graph, which must already exist. The graph is not changed.
+        `variable_def` and the other arguments are mutually exclusive.
+      dtype: If set, initial_value will be converted to the given type.
+        If `None`, either the datatype will be kept (if `initial_value` is
+        a Tensor), or `convert_to_tensor` will decide.
+      expected_shape: A TensorShape. If set, initial_value is expected
+        to have this shape.
+      import_scope: Optional `string`. Name scope to add to the
+        `Variable.` Only used when initializing from protocol buffer.
+      constraint: An optional projection function to be applied to the variable
+        after being updated by an `Optimizer` (e.g. used to implement norm
+        constraints or value constraints for layer weights). The function must
+        take as input the unprojected Tensor representing the value of the
+        variable and return the Tensor for the projected value
+        (which must have the same shape). Constraints are not safe to
+        use when doing asynchronous distributed training.
+
+    Raises:
+      ValueError: If both `variable_def` and initial_value are specified.
+      ValueError: If the initial value is not specified, or does not have a
+        shape and `validate_shape` is `True`.
+      RuntimeError: If eager execution is enabled.
     """
-    if context.executing_eagerly():
-      raise RuntimeError(
-          "tf.Variable not supported when eager execution is enabled. "
-          "Please use tf.contrib.eager.Variable instead")
     self._in_graph_mode = True
     if variable_def:
       # If variable_def is provided, recreates the variable from its fields.
@@ -348,8 +984,7 @@ class Variable(checkpointable.CheckpointableBase):
       # Ensure that we weren't lifted into the eager context.
       if context.executing_eagerly():
         raise RuntimeError(
-            "tf.Variable not supported when eager execution is enabled. "
-            "Please use tf.contrib.eager.Variable instead")
+            "RefVariable not supported when eager execution is enabled. ")
       with ops.name_scope(name, "Variable", [] if init_from_fn else
                           [initial_value]) as name:
 
@@ -1068,12 +1703,6 @@ class Variable(checkpointable.CheckpointableBase):
     else:
       return None
 
-  @staticmethod
-  def from_proto(variable_def, import_scope=None):
-    """Returns a `Variable` object created from `variable_def`."""
-    return Variable(variable_def=variable_def,
-                    import_scope=import_scope)
-
   def __iadd__(self, other):
     logging.log_first_n(
         logging.WARN,
@@ -1130,90 +1759,6 @@ class Variable(checkpointable.CheckpointableBase):
         " if you want a new python Tensor object.", 1)
     return self ** other
 
-  class SaveSliceInfo(object):
-    """Information on how to save this Variable as a slice.
-
-    Provides internal support for saving variables as slices of a larger
-    variable.  This API is not public and is subject to change.
-
-    Available properties:
-
-    * full_name
-    * full_shape
-    * var_offset
-    * var_shape
-    """
-
-    def __init__(self,
-                 full_name=None,
-                 full_shape=None,
-                 var_offset=None,
-                 var_shape=None,
-                 save_slice_info_def=None,
-                 import_scope=None):
-      """Create a `SaveSliceInfo`.
-
-      Args:
-        full_name: Name of the full variable of which this `Variable` is a
-            slice.
-        full_shape: Shape of the full variable, as a list of int.
-        var_offset: Offset of this `Variable` into the full variable, as a
-            list of int.
-        var_shape: Shape of this `Variable`, as a list of int.
-        save_slice_info_def: `SaveSliceInfoDef` protocol buffer. If not `None`,
-          recreates the SaveSliceInfo object its contents.
-          `save_slice_info_def` and other arguments are mutually
-          exclusive.
-        import_scope: Optional `string`. Name scope to add. Only used
-          when initializing from protocol buffer.
-      """
-      if save_slice_info_def:
-        assert isinstance(save_slice_info_def, variable_pb2.SaveSliceInfoDef)
-        self.full_name = ops.prepend_name_scope(
-            save_slice_info_def.full_name, import_scope=import_scope)
-        self.full_shape = [i for i in save_slice_info_def.full_shape]
-        self.var_offset = [i for i in save_slice_info_def.var_offset]
-        self.var_shape = [i for i in save_slice_info_def.var_shape]
-      else:
-        self.full_name = full_name
-        self.full_shape = full_shape
-        self.var_offset = var_offset
-        self.var_shape = var_shape
-
-    @property
-    def spec(self):
-      """Computes the spec string used for saving."""
-      full_shape_str = " ".join(["%d" % d for d in self.full_shape]) + " "
-      sl_spec = ":".join([
-          "%d,%d" % (o, s) for o, s in zip(self.var_offset, self.var_shape)
-      ])
-      return full_shape_str + sl_spec
-
-    def to_proto(self, export_scope=None):
-      """Returns a SaveSliceInfoDef() proto.
-
-      Args:
-        export_scope: Optional `string`. Name scope to remove.
-
-      Returns:
-        A `SaveSliceInfoDef` protocol buffer, or None if the `Variable` is not
-        in the specified name scope.
-      """
-      if (export_scope is None or
-          self.full_name.startswith(export_scope)):
-        save_slice_info_def = variable_pb2.SaveSliceInfoDef()
-        save_slice_info_def.full_name = ops.strip_name_scope(
-            self.full_name, export_scope)
-        for i in self.full_shape:
-          save_slice_info_def.full_shape.append(i)
-        for i in self.var_offset:
-          save_slice_info_def.var_offset.append(i)
-        for i in self.var_shape:
-          save_slice_info_def.var_shape.append(i)
-        return save_slice_info_def
-      else:
-        return None
-
   def _set_save_slice_info(self, save_slice_info):
     """Sets the slice info for this `Variable`.
 
@@ -1230,7 +1775,7 @@ class PartitionedVariable(object):
   """A container for partitioned `Variable` objects.
 
   @compatibility(eager) `tf.PartitionedVariable` is not compatible with
-  eager execution.  Use `tfe.Variable` instead which is compatible
+  eager execution.  Use `tf.Variable` instead which is compatible
   with both eager execution and graph construction.  See [the
   TensorFlow Eager Execution
   guide](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/g3doc/guide.md#variables-and-optimizers)
@@ -1404,6 +1949,10 @@ class PartitionedVariable(object):
   def dtype(self):
     return self._dtype
 
+  @property
+  def shape(self):
+    return self.get_shape()
+
   def get_shape(self):
     return self._shape
 
diff --git a/tensorflow/python/platform/benchmark.py b/tensorflow/python/platform/benchmark.py
index eba2baaf6f..fa17b17d10 100644
--- a/tensorflow/python/platform/benchmark.py
+++ b/tensorflow/python/platform/benchmark.py
@@ -66,11 +66,11 @@ def _global_report_benchmark(
     if not isinstance(extras, dict):
       raise TypeError("extras must be a dict")
 
-    logging.info("Benchmark [%s] iters: %d, wall_time: %g, cpu_time: %g,"
-                 "throughput: %g %s", name, iters if iters is not None else -1,
-                 wall_time if wall_time is not None else -1, cpu_time if
-                 cpu_time is not None else -1, throughput if
-                 throughput is not None else -1, str(extras) if extras else "")
+  logging.info("Benchmark [%s] iters: %d, wall_time: %g, cpu_time: %g,"
+               "throughput: %g %s", name, iters if iters is not None else -1,
+               wall_time if wall_time is not None else -1, cpu_time if
+               cpu_time is not None else -1, throughput if
+               throughput is not None else -1, str(extras) if extras else "")
 
   entries = test_log_pb2.BenchmarkEntries()
   entry = entries.entry.add()
diff --git a/tensorflow/python/platform/gfile.py b/tensorflow/python/platform/gfile.py
index fd697d70bf..45de047894 100644
--- a/tensorflow/python/platform/gfile.py
+++ b/tensorflow/python/platform/gfile.py
@@ -38,7 +38,14 @@ from tensorflow.python.util.tf_export import tf_export
 
 @tf_export('gfile.GFile', 'gfile.Open')
 class GFile(_FileIO):
-  """File I/O wrappers without thread locking."""
+  """File I/O wrappers without thread locking.
+
+  Note, that this  is somewhat like builtin Python  file I/O, but
+  there are  semantic differences to  make it more  efficient for
+  some backing filesystems.  For example, a write  mode file will
+  not  be opened  until the  first  write call  (to minimize  RPC
+  invocations in network filesystems).
+  """
 
   def __init__(self, name, mode='r'):
     super(GFile, self).__init__(name=name, mode=mode)
@@ -46,7 +53,14 @@ class GFile(_FileIO):
 
 @tf_export('gfile.FastGFile')
 class FastGFile(_FileIO):
-  """File I/O wrappers without thread locking."""
+  """File I/O wrappers without thread locking.
+
+  Note, that this  is somewhat like builtin Python  file I/O, but
+  there are  semantic differences to  make it more  efficient for
+  some backing filesystems.  For example, a write  mode file will
+  not  be opened  until the  first  write call  (to minimize  RPC
+  invocations in network filesystems).
+  """
 
   def __init__(self, name, mode='r'):
     super(FastGFile, self).__init__(name=name, mode=mode)
diff --git a/tensorflow/python/platform/self_check.py b/tensorflow/python/platform/self_check.py
index 966a094e55..844ae99918 100644
--- a/tensorflow/python/platform/self_check.py
+++ b/tensorflow/python/platform/self_check.py
@@ -78,7 +78,7 @@ def preload_check():
               "Could not find %r. TensorFlow requires that this DLL be "
               "installed in a directory that is named in your %%PATH%% "
               "environment variable. Download and install CUDA %s from "
-              "this URL: https://developer.nvidia.com/cuda-toolkit"
+              "this URL: https://developer.nvidia.com/cuda-90-download-archive"
               % (build_info.cudart_dll_name, build_info.cuda_version_number))
 
       if hasattr(build_info, "cudnn_dll_name") and hasattr(
diff --git a/tensorflow/python/profiler/model_analyzer_test.py b/tensorflow/python/profiler/model_analyzer_test.py
index f9891f3b1e..c0e16ca536 100644
--- a/tensorflow/python/profiler/model_analyzer_test.py
+++ b/tensorflow/python/profiler/model_analyzer_test.py
@@ -106,7 +106,7 @@ class PrintModelAnalysisTest(test.TestCase):
               # Make sure time is profiled.
               gap = 1 if test.is_gpu_available() else 2
               for i in range(3, 6, gap):
-                mat = re.search('(.*)[um]s/(.*)[um]s', metrics[i])
+                mat = re.search('(.*)(?:us|ms|sec)/(.*)(?:us|ms|sec)', metrics[i])
                 self.assertGreater(float(mat.group(1)), 0.0)
                 self.assertGreater(float(mat.group(2)), 0.0)
               # Make sure device is profiled.
diff --git a/tensorflow/python/profiler/profile_context.py b/tensorflow/python/profiler/profile_context.py
index 18eb66ef98..fa4260a712 100644
--- a/tensorflow/python/profiler/profile_context.py
+++ b/tensorflow/python/profiler/profile_context.py
@@ -88,16 +88,19 @@ def _profiled_run(self,
       to_profiles = self.profile_context._profile_candidates()
       for to_prof in to_profiles:
         cmd, opts, _ = to_prof
+        saved_views = self.profile_context._views.setdefault(cmd, {})
         if self.profile_context._debug:
           sys.stderr.write('debug: profiling %s step: %d\n' % (cmd, step))
         if cmd == 'graph':
-          self.profile_context.profiler.profile_graph(opts)
+          saved_views[step] = self.profile_context.profiler.profile_graph(opts)
         elif cmd == 'scope':
-          self.profile_context.profiler.profile_name_scope(opts)
+          saved_views[step] = self.profile_context.profiler.profile_name_scope(
+              opts)
         elif cmd == 'op':
-          self.profile_context.profiler.profile_operations(opts)
+          saved_views[step] = self.profile_context.profiler.profile_operations(
+              opts)
         elif cmd == 'code':
-          self.profile_context.profiler.profile_python(opts)
+          saved_views[step] = self.profile_context.profiler.profile_python(opts)
         else:
           raise ValueError('Unknown cmd: %s\n' % cmd)
       return ret
@@ -185,8 +188,30 @@ class ProfileContext(object):
     self._traced_steps = 0
     self._auto_profiles = []
     self._profiler = None
+    self._views = {}
     self._lock = threading.Lock()
 
+  def get_profiles(self, cmd):
+    """Returns profiling results for each step at which `cmd` was run.
+
+    Args:
+      cmd: string, profiling command used in an `add_auto_profiling` call.
+
+    Returns:
+      dict[int: (MultiGraphNodeProto | GraphNodeProto)]. Keys are steps at which
+      the profiling command was run. Values are the outputs of profiling.
+      For "code" and "op" commands this will be a `MultiGraphNodeProto`, for
+      "scope" and "graph" commands this will be a `GraphNodeProto.
+
+    Raises:
+      ValueError: if `cmd` was never run (either because no session.run call was
+      made or because there was no `add_auto_profiling` call with the specified
+      `cmd`.
+    """
+    if cmd not in self._views:
+      raise ValueError('No autoprofiler for command: {}, was run'.format(cmd))
+    return self._views[cmd]
+
   def add_auto_profiling(self, cmd, options, profile_steps):
     """Traces and profiles at some session run steps.
 
diff --git a/tensorflow/python/profiler/profile_context_test.py b/tensorflow/python/profiler/profile_context_test.py
index a623beee23..107ad443c3 100644
--- a/tensorflow/python/profiler/profile_context_test.py
+++ b/tensorflow/python/profiler/profile_context_test.py
@@ -61,6 +61,8 @@ class ProfilerContextTest(test.TestCase):
               profile_str = f.read()
             gfile.Remove(outfile)
 
+      self.assertEqual(set([15, 50, 100]), set(pctx.get_profiles("op").keys()))
+
     with lib.ProfilerFromFile(
         os.path.join(test.get_temp_dir(), "profile_100")) as profiler:
       profiler.profile_operations(options=opts)
diff --git a/tensorflow/python/saved_model/builder_impl.py b/tensorflow/python/saved_model/builder_impl.py
index e58be804c2..8c985a7c2f 100644
--- a/tensorflow/python/saved_model/builder_impl.py
+++ b/tensorflow/python/saved_model/builder_impl.py
@@ -34,6 +34,7 @@ from tensorflow.python.platform import tf_logging
 from tensorflow.python.saved_model import constants
 from tensorflow.python.training import saver as tf_saver
 from tensorflow.python.util import compat
+from tensorflow.python.util.deprecation import deprecated_args
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -133,39 +134,32 @@ class SavedModelBuilder(object):
     tf_logging.info("Assets written to: %s",
                     compat.as_text(assets_destination_dir))
 
-  def _maybe_add_legacy_init_op(self, legacy_init_op=None):
-    """Add legacy init op to the SavedModel.
+  def _maybe_add_main_op(self, main_op):
+    """Adds main op to the SavedModel.
 
     Args:
-      legacy_init_op: Optional legacy init op to support backward compatibility.
+      main_op: Main op to run as part of graph initialization. If None, no
+        main op will be added to the graph.
 
     Raises:
-      TypeError if legacy init op is not of type `Operation`.
-      AssertionError if the graph already contains one or more legacy init ops.
+      TypeError: if main op is provided but is not of type `Operation`.
+      ValueError: if the Graph already contains an init op.
     """
-    if legacy_init_op is not None:
-      if not isinstance(legacy_init_op, ops.Operation):
-        raise TypeError("legacy_init_op needs to be an Operation: %r" %
-                        legacy_init_op)
-      if ops.get_collection(constants.LEGACY_INIT_OP_KEY):
-        raise AssertionError(
-            "graph already contains one or more legacy init ops under the "
-            "collection {}.".format(constants.LEGACY_INIT_OP_KEY))
-      ops.add_to_collection(constants.LEGACY_INIT_OP_KEY, legacy_init_op)
-
-  def _add_main_op(self, main_op):
-    """Add main op to the SavedModel.
+    if main_op is None:
+      return
 
-    Args:
-      main_op: Main op to run as part of graph initialization.
+    if not isinstance(main_op, ops.Operation):
+      raise TypeError("main_op needs to be an Operation: %r" % main_op)
 
-    Raises:
-      TypeError if main op is not of type `Operation`.
-    """
-    if main_op is not None:
-      if not isinstance(main_op, ops.Operation):
-        raise TypeError("main_op needs to be an Operation: %r" % main_op)
-      ops.add_to_collection(constants.MAIN_OP_KEY, main_op)
+    # Validate that no other init ops have been added to this graph already.
+    # We check main_op and legacy_init_op for thoroughness and explicitness.
+    for init_op_key in (constants.MAIN_OP_KEY, constants.LEGACY_INIT_OP_KEY):
+      if ops.get_collection(init_op_key):
+        raise ValueError(
+            "Graph already contains one or more main ops under the "
+            "collection {}.".format(init_op_key))
+
+    ops.add_to_collection(constants.MAIN_OP_KEY, main_op)
 
   def _add_train_op(self, train_op):
     """Add train op to the SavedModel.
@@ -257,16 +251,12 @@ class SavedModelBuilder(object):
           self._validate_tensor_info(outputs[outputs_key])
 
   def _add_collections(
-      self, assets_collection, legacy_init_op, main_op, train_op):
+      self, assets_collection, main_op, train_op):
     """Add asset and op collections to be saved."""
     # Save asset files and write them to disk, if any.
     self._save_and_write_assets(assets_collection)
 
-    if main_op is None:
-      # Add legacy init op to the SavedModel.
-      self._maybe_add_legacy_init_op(legacy_init_op)
-    else:
-      self._add_main_op(main_op)
+    self._maybe_add_main_op(main_op)
 
     self._add_train_op(train_op)
 
@@ -282,6 +272,9 @@ class SavedModelBuilder(object):
           allow_empty=True)
     return saver
 
+  @deprecated_args(None,
+                   "Pass your op to the equivalent parameter main_op instead.",
+                   "legacy_init_op")
   def add_meta_graph(self,
                      tags,
                      signature_def_map=None,
@@ -306,7 +299,7 @@ class SavedModelBuilder(object):
           that this collection should be a subset of the assets saved as part of
           the first meta graph in the SavedModel.
       legacy_init_op: Legacy support for op or group of ops to execute after the
-          restore op upon a load.
+          restore op upon a load. Deprecated; please use main_op instead.
       clear_devices: Set to true if the device info on the default graph should
           be cleared.
       main_op: Op or group of ops to execute when the graph is loaded. Note
@@ -333,8 +326,12 @@ class SavedModelBuilder(object):
     # properly populated.
     self._validate_signature_def_map(signature_def_map)
 
+    # legacy_init_op is deprecated, and going away in TF 2.0.
+    # Re-mapping to main_op, as treatment is identical regardless.
+    main_op = main_op or legacy_init_op
+
     # Add assets and ops
-    self._add_collections(assets_collection, legacy_init_op, main_op, None)
+    self._add_collections(assets_collection, main_op, None)
 
     saver = self._maybe_create_saver(saver)
 
@@ -351,6 +348,9 @@ class SavedModelBuilder(object):
     # Tag the meta graph def and add it to the SavedModel.
     self._tag_and_add_meta_graph(meta_graph_def, tags, signature_def_map)
 
+  @deprecated_args(None,
+                   "Pass your op to the equivalent parameter main_op instead.",
+                   "legacy_init_op")
   def add_meta_graph_and_variables(self,
                                    sess,
                                    tags,
@@ -378,7 +378,7 @@ class SavedModelBuilder(object):
         def.
       assets_collection: Assets collection to be saved with SavedModel.
       legacy_init_op: Legacy support for op or group of ops to execute after the
-          restore op upon a load.
+          restore op upon a load. Deprecated; please use main_op instead.
       clear_devices: Set to true if the device info on the default graph should
           be cleared.
       main_op: Op or group of ops to execute when the graph is loaded. Note
@@ -402,8 +402,12 @@ class SavedModelBuilder(object):
     # properly populated.
     self._validate_signature_def_map(signature_def_map)
 
+    # legacy_init_op is deprecated, and going away in TF 2.0.
+    # Re-mapping to main_op, as treatment is identical regardless.
+    main_op = main_op or legacy_init_op
+
     # Add assets and ops
-    self._add_collections(assets_collection, legacy_init_op, main_op, None)
+    self._add_collections(assets_collection, main_op, None)
 
     # Create the variables sub-directory, if it does not exist.
     variables_dir = os.path.join(
diff --git a/tensorflow/python/saved_model/constants.py b/tensorflow/python/saved_model/constants.py
index 61c6ffbd0d..cb251f08bb 100644
--- a/tensorflow/python/saved_model/constants.py
+++ b/tensorflow/python/saved_model/constants.py
@@ -60,6 +60,10 @@ SAVED_MODEL_FILENAME_PBTXT = "saved_model.pbtxt"
 tf_export("saved_model.constants.SAVED_MODEL_FILENAME_PBTXT").export_constant(
     __name__, "SAVED_MODEL_FILENAME_PBTXT")
 
+# File name for json format of SavedModel.
+# Not exported while keras_saved_model is in contrib.
+SAVED_MODEL_FILENAME_JSON = "saved_model.json"
+
 # Subdirectory name containing the variables/checkpoint files.
 VARIABLES_DIRECTORY = "variables"
 tf_export("saved_model.constants.VARIABLES_DIRECTORY").export_constant(
@@ -69,5 +73,3 @@ tf_export("saved_model.constants.VARIABLES_DIRECTORY").export_constant(
 VARIABLES_FILENAME = "variables"
 tf_export("saved_model.constants.VARIABLES_FILENAME").export_constant(
     __name__, "VARIABLES_FILENAME")
-
-
diff --git a/tensorflow/python/saved_model/loader_impl.py b/tensorflow/python/saved_model/loader_impl.py
index e5f649fdab..16077f52fa 100644
--- a/tensorflow/python/saved_model/loader_impl.py
+++ b/tensorflow/python/saved_model/loader_impl.py
@@ -116,11 +116,14 @@ def _get_asset_tensors(export_dir, meta_graph_def_to_load, import_scope=None):
   return asset_tensor_dict
 
 
-def _get_main_op_tensor(meta_graph_def_to_load):
+def _get_main_op_tensor(
+    meta_graph_def_to_load, init_op_key=constants.MAIN_OP_KEY):
   """Gets the main op tensor, if one exists.
 
   Args:
     meta_graph_def_to_load: The meta graph def from the SavedModel to be loaded.
+    init_op_key: name of collection to check; should be one of MAIN_OP_KEY
+      or the deprecated LEGACY_INIT_OP_KEY
 
   Returns:
     The main op tensor, if it exists and `None` otherwise.
@@ -131,38 +134,15 @@ def _get_main_op_tensor(meta_graph_def_to_load):
   """
   collection_def = meta_graph_def_to_load.collection_def
   main_op_tensor = None
-  if constants.MAIN_OP_KEY in collection_def:
-    main_ops = collection_def[constants.MAIN_OP_KEY].node_list.value
+  if init_op_key in collection_def:
+    main_ops = collection_def[init_op_key].node_list.value
     if len(main_ops) != 1:
-      raise RuntimeError("Expected exactly one SavedModel main op.")
-    main_op_tensor = ops.get_collection(constants.MAIN_OP_KEY)[0]
+      raise RuntimeError("Expected exactly one SavedModel main op. "
+                         "Found: {}".format(main_ops))
+    main_op_tensor = ops.get_collection(init_op_key)[0]
   return main_op_tensor
 
 
-def _get_legacy_init_op_tensor(meta_graph_def_to_load):
-  """Gets the legacy init op tensor, if one exists.
-
-  Args:
-    meta_graph_def_to_load: The meta graph def from the SavedModel to be loaded.
-
-  Returns:
-    The legacy init op tensor, if it exists and `None` otherwise.
-
-  Raises:
-    RuntimeError: If the collection def corresponding to the legacy init op key
-        has other than exactly one tensor.
-  """
-  collection_def = meta_graph_def_to_load.collection_def
-  legacy_init_op_tensor = None
-  if constants.LEGACY_INIT_OP_KEY in collection_def:
-    legacy_init_ops = collection_def[
-        constants.LEGACY_INIT_OP_KEY].node_list.value
-    if len(legacy_init_ops) != 1:
-      raise RuntimeError("Expected exactly one legacy serving init op.")
-    legacy_init_op_tensor = ops.get_collection(constants.LEGACY_INIT_OP_KEY)[0]
-  return legacy_init_op_tensor
-
-
 @tf_export("saved_model.loader.maybe_saved_model_directory")
 def maybe_saved_model_directory(export_dir):
   """Checks whether the provided export directory could contain a SavedModel.
@@ -284,12 +264,15 @@ class SavedModelLoader(object):
       **saver_kwargs: keyword arguments to pass to tf.train.import_meta_graph.
 
     Returns:
-      Saver defined by the MetaGraph, which can be used to restore the variable
-      values.
+      A tuple of
+        * Saver defined by the MetaGraph, which can be used to restore the
+          variable values.
+        * List of `Operation`/`Tensor` objects returned from
+          `tf.import_graph_def` (may be `None`).
     """
     meta_graph_def = self.get_meta_graph_def_from_tags(tags)
     with graph.as_default():
-      return tf_saver.import_meta_graph(
+      return tf_saver._import_meta_graph_with_return_elements(  # pylint: disable=protected-access
           meta_graph_def, import_scope=import_scope, **saver_kwargs)
 
   def restore_variables(self, sess, saver, import_scope=None):
@@ -340,8 +323,8 @@ class SavedModelLoader(object):
           self._export_dir, meta_graph_def, import_scope=import_scope)
 
       main_op_tensor = (
-          _get_main_op_tensor(meta_graph_def) or
-          (_get_legacy_init_op_tensor(meta_graph_def)))
+          _get_main_op_tensor(meta_graph_def, constants.MAIN_OP_KEY) or
+          _get_main_op_tensor(meta_graph_def, constants.LEGACY_INIT_OP_KEY))
       if main_op_tensor is not None:
         sess.run(fetches=[main_op_tensor], feed_dict=asset_tensors_dictionary)
 
@@ -361,8 +344,8 @@ class SavedModelLoader(object):
       `MetagraphDef` proto of the graph that was loaded.
     """
     with sess.graph.as_default():
-      saver = self.load_graph(sess.graph, tags, import_scope,
-                              **saver_kwargs)
+      saver, _ = self.load_graph(sess.graph, tags, import_scope,
+                                 **saver_kwargs)
       self.restore_variables(sess, saver, import_scope)
       self.run_init_ops(sess, tags, import_scope)
     return self.get_meta_graph_def_from_tags(tags)
diff --git a/tensorflow/python/saved_model/loader_test.py b/tensorflow/python/saved_model/loader_test.py
index ce18859f6b..9a0b276a4b 100644
--- a/tensorflow/python/saved_model/loader_test.py
+++ b/tensorflow/python/saved_model/loader_test.py
@@ -111,7 +111,8 @@ class SavedModelLoaderTest(test.TestCase):
   def test_load_with_import_scope(self):
     loader = loader_impl.SavedModelLoader(SAVED_MODEL_WITH_MAIN_OP)
     with self.test_session(graph=ops.Graph()) as sess:
-      saver = loader.load_graph(sess.graph, ["foo_graph"], import_scope="baz")
+      saver, _ = loader.load_graph(
+          sess.graph, ["foo_graph"], import_scope="baz")
 
       # The default saver should not work when the import scope is set.
       with self.assertRaises(errors.NotFoundError):
@@ -149,7 +150,7 @@ class SavedModelLoaderTest(test.TestCase):
   def test_run_init_op(self):
     loader = loader_impl.SavedModelLoader(SAVED_MODEL_WITH_MAIN_OP)
     graph = ops.Graph()
-    saver = loader.load_graph(graph, ["foo_graph"])
+    saver, _ = loader.load_graph(graph, ["foo_graph"])
     with self.test_session(graph=graph) as sess:
       loader.restore_variables(sess, saver)
       self.assertEqual(5, sess.graph.get_tensor_by_name("x:0").eval())
@@ -203,7 +204,7 @@ class SavedModelLoaderTest(test.TestCase):
 
     loader = loader_impl.SavedModelLoader(path)
     with self.test_session(graph=ops.Graph()) as sess:
-      saver = loader.load_graph(sess.graph, ["foo_graph"])
+      saver, _ = loader.load_graph(sess.graph, ["foo_graph"])
       self.assertFalse(variables._all_saveable_objects())
       self.assertIsNotNone(saver)
 
@@ -212,6 +213,18 @@ class SavedModelLoaderTest(test.TestCase):
       self.assertEqual(5, sess.graph.get_tensor_by_name("x:0").eval())
       self.assertEqual(11, sess.graph.get_tensor_by_name("y:0").eval())
 
+  def test_load_saved_model_graph_with_return_elements(self):
+    """Ensure that the correct elements are returned."""
+    loader = loader_impl.SavedModelLoader(SIMPLE_ADD_SAVED_MODEL)
+    graph = ops.Graph()
+    _, ret = loader.load_graph(graph, ["foo_graph"],
+                               return_elements=["y:0", "x:0"])
+
+    self.assertEqual(graph.get_tensor_by_name("y:0"), ret[0])
+    self.assertEqual(graph.get_tensor_by_name("x:0"), ret[1])
+
+    with self.assertRaisesRegexp(ValueError, "not found in graph"):
+      loader.load_graph(graph, ["foo_graph"], return_elements=["z:0"])
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/saved_model/saved_model_test.py b/tensorflow/python/saved_model/saved_model_test.py
index fb4732aca2..00b669fc97 100644
--- a/tensorflow/python/saved_model/saved_model_test.py
+++ b/tensorflow/python/saved_model/saved_model_test.py
@@ -846,9 +846,19 @@ class SavedModelTest(test.TestCase):
   def testLegacyInitOpWithNonEmptyCollection(self):
     export_dir = self._get_export_dir(
         "test_legacy_init_op_with_non_empty_collection")
+    self._testInitOpsWithNonEmptyCollection(
+        export_dir, constants.LEGACY_INIT_OP_KEY)
+
+  def testMainOpWithNonEmptyCollection(self):
+    export_dir = self._get_export_dir(
+        "test_main_op_with_non_empty_collection")
+    self._testInitOpsWithNonEmptyCollection(export_dir, constants.MAIN_OP_KEY)
+
+  def _testInitOpsWithNonEmptyCollection(self, export_dir, key):
     builder = saved_model_builder.SavedModelBuilder(export_dir)
 
-    with self.test_session(graph=ops.Graph()) as sess:
+    g = ops.Graph()
+    with self.test_session(graph=g) as sess:
       # Initialize variable `v1` to 1.
       v1 = variables.Variable(1, name="v1")
       ops.add_to_collection("v", v1)
@@ -857,19 +867,21 @@ class SavedModelTest(test.TestCase):
       v2 = variables.Variable(42, name="v2", trainable=False, collections=[])
       ops.add_to_collection("v", v2)
 
-      # Set up an assignment op to be run as part of the legacy_init_op.
+      # Set up an assignment op to be run as part of the init op.
       assign_v2 = state_ops.assign(v2, v1)
-      legacy_init_op = control_flow_ops.group(assign_v2, name="legacy_init_op")
+      init_op = control_flow_ops.group(assign_v2, name="init_op")
 
       sess.run(variables.global_variables_initializer())
 
-      ops.add_to_collection(constants.LEGACY_INIT_OP_KEY,
-                            control_flow_ops.no_op())
-      # AssertionError should be raised since the LEGACY_INIT_OP_KEY collection
+      ops.add_to_collection(key, control_flow_ops.no_op())
+      # ValueError should be raised since the LEGACY_INIT_OP_KEY collection
       # is not empty and we don't support multiple init ops.
-      with self.assertRaises(AssertionError):
+      with self.assertRaisesRegexp(ValueError, "Graph already contains"):
         builder.add_meta_graph_and_variables(
-            sess, ["foo"], legacy_init_op=legacy_init_op)
+            sess, ["foo"], legacy_init_op=init_op)
+      # We shouldn't be able to add as MAIN_OP, either.
+      with self.assertRaisesRegexp(ValueError, "Graph already contains"):
+        builder.add_meta_graph_and_variables(sess, ["foo"], main_op=init_op)
 
   def testTrainOp(self):
     export_dir = self._get_export_dir("test_train_op")
diff --git a/tensorflow/python/summary/writer/writer.py b/tensorflow/python/summary/writer/writer.py
index aca084fc91..60e96ee947 100644
--- a/tensorflow/python/summary/writer/writer.py
+++ b/tensorflow/python/summary/writer/writer.py
@@ -325,7 +325,7 @@ class FileWriter(SummaryToEventTransformer):
     ```
 
     The `session` argument to the constructor makes the returned `FileWriter` a
-    a compatibility layer over new graph-based summaries (`tf.contrib.summary`).
+    compatibility layer over new graph-based summaries (`tf.contrib.summary`).
     Crucially, this means the underlying writer resource and events file will
     be shared with any other `FileWriter` using the same `session` and `logdir`,
     and with any `tf.contrib.summary.SummaryWriter` in this session using the
diff --git a/tensorflow/tools/api/generator/BUILD b/tensorflow/python/tools/api/generator/BUILD
index 8c760e6f52..223d1281ba 100644
--- a/tensorflow/tools/api/generator/BUILD
+++ b/tensorflow/python/tools/api/generator/BUILD
@@ -3,8 +3,9 @@
 
 licenses(["notice"])  # Apache 2.0
 
-load("//tensorflow/tools/api/generator:api_gen.bzl", "ESTIMATOR_API_INIT_FILES")
-load("//tensorflow/tools/api/generator:api_gen.bzl", "TENSORFLOW_API_INIT_FILES")
+load("//tensorflow:tensorflow.bzl", "py_test")
+load("//tensorflow/python/tools/api/generator:api_gen.bzl", "ESTIMATOR_API_INIT_FILES")
+load("//tensorflow/python/tools/api/generator:api_gen.bzl", "TENSORFLOW_API_INIT_FILES")
 
 exports_files(
     [
@@ -13,6 +14,18 @@ exports_files(
     ],
 )
 
+py_binary(
+    name = "create_python_api",
+    srcs = ["//tensorflow/python/tools/api/generator:create_python_api.py"],
+    main = "//tensorflow/python/tools/api/generator:create_python_api.py",
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/python:no_contrib",
+        "//tensorflow/python/tools/api/generator:doc_srcs",
+    ],
+)
+
 py_library(
     name = "doc_srcs",
     srcs = ["doc_srcs.py"],
diff --git a/tensorflow/tools/api/generator/api_gen.bzl b/tensorflow/python/tools/api/generator/api_gen.bzl
index d746b5d3e4..00e1c4e199 100644
--- a/tensorflow/tools/api/generator/api_gen.bzl
+++ b/tensorflow/python/tools/api/generator/api_gen.bzl
@@ -102,36 +102,41 @@ ESTIMATOR_API_INIT_FILES = [
     # END GENERATED ESTIMATOR FILES
 ]
 
-# Creates a genrule that generates a directory structure with __init__.py
-# files that import all exported modules (i.e. modules with tf_export
-# decorators).
-#
-# Args:
-#   name: name of genrule to create.
-#   output_files: List of __init__.py files that should be generated.
-#     This list should include file name for every module exported using
-#     tf_export. For e.g. if an op is decorated with
-#     @tf_export('module1.module2', 'module3'). Then, output_files should
-#     include module1/module2/__init__.py and module3/__init__.py.
-#   root_init_template: Python init file that should be used as template for
-#     root __init__.py file. "# API IMPORTS PLACEHOLDER" comment inside this
-#     template will be replaced with root imports collected by this genrule.
-#   srcs: genrule sources. If passing root_init_template, the template file
-#     must be included in sources.
-#   api_name: Name of the project that you want to generate API files for
-#     (e.g. "tensorflow" or "estimator").
-#   package: Python package containing the @tf_export decorators you want to
-#     process
-#   package_dep: Python library target containing your package.
-
 def gen_api_init_files(
         name,
         output_files = TENSORFLOW_API_INIT_FILES,
         root_init_template = None,
         srcs = [],
         api_name = "tensorflow",
+        api_version = 2,
         package = "tensorflow.python",
-        package_dep = "//tensorflow/python:no_contrib"):
+        package_dep = "//tensorflow/python:no_contrib",
+        output_package = "tensorflow"):
+    """Creates API directory structure and __init__.py files.
+
+    Creates a genrule that generates a directory structure with __init__.py
+    files that import all exported modules (i.e. modules with tf_export
+    decorators).
+
+    Args:
+      name: name of genrule to create.
+      output_files: List of __init__.py files that should be generated.
+        This list should include file name for every module exported using
+        tf_export. For e.g. if an op is decorated with
+        @tf_export('module1.module2', 'module3'). Then, output_files should
+        include module1/module2/__init__.py and module3/__init__.py.
+      root_init_template: Python init file that should be used as template for
+        root __init__.py file. "# API IMPORTS PLACEHOLDER" comment inside this
+        template will be replaced with root imports collected by this genrule.
+      srcs: genrule sources. If passing root_init_template, the template file
+        must be included in sources.
+      api_name: Name of the project that you want to generate API files for
+        (e.g. "tensorflow" or "estimator").
+      api_version: TensorFlow API version to generate. Must be either 1 or 2.
+      package: Python package containing the @tf_export decorators you want to
+        process
+      package_dep: Python library target containing your package.
+    """
     root_init_template_flag = ""
     if root_init_template:
       root_init_template_flag = "--root_init_template=$(location " + root_init_template + ")"
@@ -139,13 +144,14 @@ def gen_api_init_files(
     api_gen_binary_target = "create_" + package + "_api"
     native.py_binary(
         name = "create_" + package + "_api",
-        srcs = ["//tensorflow/tools/api/generator:create_python_api.py"],
-        main = "//tensorflow/tools/api/generator:create_python_api.py",
+        srcs = ["//tensorflow/python/tools/api/generator:create_python_api.py"],
+        main = "//tensorflow/python/tools/api/generator:create_python_api.py",
         srcs_version = "PY2AND3",
         visibility = ["//visibility:public"],
         deps = [
             package_dep,
-            "//tensorflow/tools/api/generator:doc_srcs",
+            "//tensorflow/python:util",
+            "//tensorflow/python/tools/api/generator:doc_srcs",
         ],
     )
 
@@ -154,7 +160,9 @@ def gen_api_init_files(
         outs = output_files,
         cmd = (
             "$(location :" + api_gen_binary_target + ") " +
-            root_init_template_flag + " --apidir=$(@D) --apiname=" + api_name + " --package=" + package + " $(OUTS)"),
+            root_init_template_flag + " --apidir=$(@D) --apiname=" +
+            api_name + " --apiversion=" + str(api_version) + " --package=" + package +
+            " --output_package=" + output_package + " $(OUTS)"),
         srcs = srcs,
         tools = [":" + api_gen_binary_target ],
         visibility = ["//tensorflow:__pkg__"],
diff --git a/tensorflow/tools/api/generator/create_python_api.py b/tensorflow/python/tools/api/generator/create_python_api.py
index 48d7dcd09e..863c922216 100644
--- a/tensorflow/tools/api/generator/create_python_api.py
+++ b/tensorflow/python/tools/api/generator/create_python_api.py
@@ -24,11 +24,12 @@ import importlib
 import os
 import sys
 
+from tensorflow.python.tools.api.generator import doc_srcs
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_export
-from tensorflow.tools.api.generator import doc_srcs
 
 API_ATTRS = tf_export.API_ATTRS
+API_ATTRS_V1 = tf_export.API_ATTRS_V1
 
 _DEFAULT_PACKAGE = 'tensorflow.python'
 _GENFILES_DIR_SUFFIX = 'genfiles/'
@@ -38,14 +39,14 @@ _SYMBOLS_TO_SKIP_EXPLICITLY = {
     'tensorflow.python.platform.flags.FLAGS'
 }
 _GENERATED_FILE_HEADER = """# This file is MACHINE GENERATED! Do not edit.
-# Generated by: tensorflow/tools/api/generator/create_python_api.py script.
+# Generated by: tensorflow/python/tools/api/generator/create_python_api.py script.
 \"\"\"%s
 \"\"\"
 
 from __future__ import print_function
 
 """
-_GENERATED_FILE_FOOTER = "\n\ndel print_function\n"
+_GENERATED_FILE_FOOTER = '\n\ndel print_function\n'
 
 
 class SymbolExposedTwiceError(Exception):
@@ -159,13 +160,16 @@ __all__.remove('print_function')
     return module_text_map
 
 
-def get_api_init_text(package, api_name):
+def get_api_init_text(package, output_package, api_name, api_version):
   """Get a map from destination module to __init__.py code for that module.
 
   Args:
     package: Base python package containing python with target tf_export
       decorators.
+    output_package: Base output python package where generated API will
+      be added.
     api_name: API you want to generate (e.g. `tensorflow` or `estimator`).
+    api_version: API version you want to generate (`v1` or `v2`).
 
   Returns:
     A dictionary where
@@ -173,6 +177,12 @@ def get_api_init_text(package, api_name):
       value: (string) text that should be in __init__.py files for
         corresponding modules.
   """
+  if api_version == 1:
+    names_attr = API_ATTRS_V1[api_name].names
+    constants_attr = API_ATTRS_V1[api_name].constants
+  else:
+    names_attr = API_ATTRS[api_name].names
+    constants_attr = API_ATTRS[api_name].constants
   module_code_builder = _ModuleInitCodeBuilder()
 
   # Traverse over everything imported above. Specifically,
@@ -193,7 +203,7 @@ def get_api_init_text(package, api_name):
       attr = getattr(module, module_contents_name)
 
       # If attr is _tf_api_constants attribute, then add the constants.
-      if module_contents_name == API_ATTRS[api_name].constants:
+      if module_contents_name == constants_attr:
         for exports, value in attr:
           for export in exports:
             names = export.split('.')
@@ -205,9 +215,8 @@ def get_api_init_text(package, api_name):
       _, attr = tf_decorator.unwrap(attr)
       # If attr is a symbol with _tf_api_names attribute, then
       # add import for it.
-      if (hasattr(attr, '__dict__') and
-          API_ATTRS[api_name].names in attr.__dict__):
-        for export in getattr(attr, API_ATTRS[api_name].names):  # pylint: disable=protected-access
+      if (hasattr(attr, '__dict__') and names_attr in attr.__dict__):
+        for export in getattr(attr, names_attr):  # pylint: disable=protected-access
           names = export.split('.')
           dest_module = '.'.join(names[:-1])
           module_code_builder.add_import(
@@ -218,7 +227,6 @@ def get_api_init_text(package, api_name):
   # For e.g. if we import 'foo.bar.Value'. Then, we also
   # import 'bar' in 'foo'.
   imported_modules = set(module_code_builder.module_imports.keys())
-  import_from = '.'
   for module in imported_modules:
     if not module:
       continue
@@ -229,6 +237,9 @@ def get_api_init_text(package, api_name):
       if submodule_index > 0:
         parent_module += ('.' + module_split[submodule_index-1] if parent_module
                           else module_split[submodule_index-1])
+      import_from = output_package
+      if submodule_index > 0:
+        import_from += '.' + '.'.join(module_split[:submodule_index])
       module_code_builder.add_import(
           -1, parent_module, import_from,
           module_split[submodule_index], module_split[submodule_index])
@@ -294,7 +305,8 @@ def get_module_docstring(module_name, package, api_name):
 
 
 def create_api_files(
-    output_files, package, root_init_template, output_dir, api_name):
+    output_files, package, root_init_template, output_dir, output_package,
+    api_name, api_version):
   """Creates __init__.py files for the Python API.
 
   Args:
@@ -306,7 +318,9 @@ def create_api_files(
       "#API IMPORTS PLACEHOLDER" comment in the template file will be replaced
       with imports.
     output_dir: output API root directory.
+    output_package: Base output package where generated API will be added.
     api_name: API you want to generate (e.g. `tensorflow` or `estimator`).
+    api_version: API version to generate (`v1` or `v2`).
 
   Raises:
     ValueError: if an output file is not under api/ directory,
@@ -323,7 +337,8 @@ def create_api_files(
       os.makedirs(os.path.dirname(file_path))
     open(file_path, 'a').close()
 
-  module_text_map = get_api_init_text(package, api_name)
+  module_text_map = get_api_init_text(
+      package, output_package, api_name, api_version)
 
   # Add imports to output files.
   missing_output_files = []
@@ -381,6 +396,13 @@ def main():
       '--apiname', required=True, type=str,
       choices=API_ATTRS.keys(),
       help='The API you want to generate.')
+  parser.add_argument(
+      '--apiversion', default=2, type=int,
+      choices=[1, 2],
+      help='The API version you want to generate.')
+  parser.add_argument(
+      '--output_package', default='tensorflow', type=str,
+      help='Root output package.')
 
   args = parser.parse_args()
 
@@ -395,7 +417,8 @@ def main():
   # Populate `sys.modules` with modules containing tf_export().
   importlib.import_module(args.package)
   create_api_files(outputs, args.package, args.root_init_template,
-                   args.apidir, args.apiname)
+                   args.apidir, args.output_package, args.apiname,
+                   args.apiversion)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/tools/api/generator/create_python_api_test.py b/tensorflow/python/tools/api/generator/create_python_api_test.py
index 651ec9d040..a565a49d96 100644
--- a/tensorflow/tools/api/generator/create_python_api_test.py
+++ b/tensorflow/python/tools/api/generator/create_python_api_test.py
@@ -22,8 +22,8 @@ import imp
 import sys
 
 from tensorflow.python.platform import test
+from tensorflow.python.tools.api.generator import create_python_api
 from tensorflow.python.util.tf_export import tf_export
-from tensorflow.tools.api.generator import create_python_api
 
 
 @tf_export('test_op', 'test_op1')
@@ -58,7 +58,8 @@ class CreatePythonApiTest(test.TestCase):
   def testFunctionImportIsAdded(self):
     imports = create_python_api.get_api_init_text(
         package=create_python_api._DEFAULT_PACKAGE,
-        api_name='tensorflow')
+        output_package='tensorflow',
+        api_name='tensorflow', api_version=1)
     expected_import = (
         'from tensorflow.python.test_module '
         'import test_op as test_op1')
@@ -75,7 +76,8 @@ class CreatePythonApiTest(test.TestCase):
   def testClassImportIsAdded(self):
     imports = create_python_api.get_api_init_text(
         package=create_python_api._DEFAULT_PACKAGE,
-        api_name='tensorflow')
+        output_package='tensorflow',
+        api_name='tensorflow', api_version=2)
     expected_import = ('from tensorflow.python.test_module '
                        'import TestClass')
     self.assertTrue(
@@ -85,7 +87,8 @@ class CreatePythonApiTest(test.TestCase):
   def testConstantIsAdded(self):
     imports = create_python_api.get_api_init_text(
         package=create_python_api._DEFAULT_PACKAGE,
-        api_name='tensorflow')
+        output_package='tensorflow',
+        api_name='tensorflow', api_version=1)
     expected = ('from tensorflow.python.test_module '
                 'import _TEST_CONSTANT')
     self.assertTrue(expected in str(imports),
diff --git a/tensorflow/tools/api/generator/doc_srcs.py b/tensorflow/python/tools/api/generator/doc_srcs.py
index ad1988494d..ad1988494d 100644
--- a/tensorflow/tools/api/generator/doc_srcs.py
+++ b/tensorflow/python/tools/api/generator/doc_srcs.py
diff --git a/tensorflow/tools/api/generator/doc_srcs_test.py b/tensorflow/python/tools/api/generator/doc_srcs_test.py
index dbff904abe..481d9874a4 100644
--- a/tensorflow/tools/api/generator/doc_srcs_test.py
+++ b/tensorflow/python/tools/api/generator/doc_srcs_test.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # =============================================================================
-"""Tests for tensorflow.tools.api.generator.doc_srcs."""
+"""Tests for tensorflow.python.tools.api.generator.doc_srcs."""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -23,7 +23,7 @@ import importlib
 import sys
 
 from tensorflow.python.platform import test
-from tensorflow.tools.api.generator import doc_srcs
+from tensorflow.python.tools.api.generator import doc_srcs
 
 
 FLAGS = None
diff --git a/tensorflow/python/tools/freeze_graph.py b/tensorflow/python/tools/freeze_graph.py
index e9f1def48c..130fe70beb 100644
--- a/tensorflow/python/tools/freeze_graph.py
+++ b/tensorflow/python/tools/freeze_graph.py
@@ -38,6 +38,7 @@ from __future__ import division
 from __future__ import print_function
 
 import argparse
+import re
 import sys
 
 from google.protobuf import text_format
@@ -54,6 +55,7 @@ from tensorflow.python.platform import gfile
 from tensorflow.python.saved_model import loader
 from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.tools import saved_model_utils
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as saver_lib
 
 
@@ -77,7 +79,7 @@ def freeze_graph_with_def_protos(input_graph_def,
 
   # 'input_checkpoint' may be a prefix if we're using Saver V2 format
   if (not input_saved_model_dir and
-      not saver_lib.checkpoint_exists(input_checkpoint)):
+      not checkpoint_management.checkpoint_exists(input_checkpoint)):
     print("Input checkpoint '" + input_checkpoint + "' doesn't exist!")
     return -1
 
@@ -116,16 +118,43 @@ def freeze_graph_with_def_protos(input_graph_def,
       var_list = {}
       reader = pywrap_tensorflow.NewCheckpointReader(input_checkpoint)
       var_to_shape_map = reader.get_variable_to_shape_map()
+
+      # List of all partition variables. Because the condition is heuristic
+      # based, the list could include false positives.
+      all_parition_variable_names = [
+          tensor.name.split(":")[0]
+          for op in sess.graph.get_operations()
+          for tensor in op.values()
+          if re.search(r"/part_\d+/", tensor.name)
+      ]
+      has_partition_var = False
+
       for key in var_to_shape_map:
         try:
           tensor = sess.graph.get_tensor_by_name(key + ":0")
+          if any(key in name for name in all_parition_variable_names):
+            has_partition_var = True
         except KeyError:
           # This tensor doesn't exist in the graph (for example it's
           # 'global_step' or a similar housekeeping element) so skip it.
           continue
         var_list[key] = tensor
-      saver = saver_lib.Saver(
-          var_list=var_list, write_version=checkpoint_version)
+
+      try:
+        saver = saver_lib.Saver(
+            var_list=var_list, write_version=checkpoint_version)
+      except TypeError as e:
+        # `var_list` is required to be a map of variable names to Variable
+        # tensors. Partition variables are Identity tensors that cannot be
+        # handled by Saver.
+        if has_partition_var:
+          print("Models containing partition variables cannot be converted "
+                "from checkpoint files. Please pass in a SavedModel using "
+                "the flag --input_saved_model_dir.")
+          return -1
+        else:
+          raise e
+
       saver.restore(sess, input_checkpoint)
       if initializer_nodes:
         sess.run(initializer_nodes.replace(" ", "").split(","))
diff --git a/tensorflow/python/tools/freeze_graph_test.py b/tensorflow/python/tools/freeze_graph_test.py
index 91f0061ebc..e38945fabc 100644
--- a/tensorflow/python/tools/freeze_graph_test.py
+++ b/tensorflow/python/tools/freeze_graph_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import os
+import re
 
 from tensorflow.core.example import example_pb2
 from tensorflow.core.framework import graph_pb2
@@ -31,7 +32,10 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn
 from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import partitioned_variables
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.saved_model import builder as saved_model_builder
@@ -262,6 +266,69 @@ class FreezeGraphTest(test_util.TensorFlowTestCase):
         output = sess.run(output_node, feed_dict={input_node: [example]})
         self.assertNear(feature_value, output, 0.00001)
 
+  def testSinglePartitionedVariable(self):
+    """Ensures partitioned variables fail cleanly with freeze graph."""
+    checkpoint_prefix = os.path.join(self.get_temp_dir(), "saved_checkpoint")
+    checkpoint_state_name = "checkpoint_state"
+    input_graph_name = "input_graph.pb"
+    output_graph_name = "output_graph.pb"
+
+    # Create a graph with partition variables. When weights are partitioned into
+    # a single partition, the weights variable is followed by a identity ->
+    # identity (an additional identity node).
+    partitioner = partitioned_variables.fixed_size_partitioner(1)
+    with ops.Graph().as_default():
+      with variable_scope.variable_scope("part", partitioner=partitioner):
+        batch_size, height, width, depth = 5, 128, 128, 3
+        input1 = array_ops.zeros(
+            (batch_size, height, width, depth), name="input1")
+        input2 = array_ops.zeros(
+            (batch_size, height, width, depth), name="input2")
+
+        num_nodes = depth
+        filter1 = variable_scope.get_variable("filter", [num_nodes, num_nodes])
+        filter2 = array_ops.reshape(filter1, [1, 1, num_nodes, num_nodes])
+        conv = nn.conv2d(
+            input=input1, filter=filter2, strides=[1, 1, 1, 1], padding="SAME")
+        node = math_ops.add(conv, input2, name="test/add")
+        node = nn.relu6(node, name="test/relu6")
+
+      # Save graph and checkpoints.
+      sess = session.Session()
+      sess.run(variables.global_variables_initializer())
+
+      saver = saver_lib.Saver()
+      checkpoint_path = saver.save(
+          sess,
+          checkpoint_prefix,
+          global_step=0,
+          latest_filename=checkpoint_state_name)
+      graph_io.write_graph(sess.graph, self.get_temp_dir(), input_graph_name)
+
+      # Ensure this graph has partition variables.
+      self.assertTrue([
+          tensor.name.split(":")[0]
+          for op in sess.graph.get_operations()
+          for tensor in op.values()
+          if re.search(r"/part_\d+/", tensor.name)
+      ])
+
+    # Test freezing graph doesn't make it crash.
+    output_node_names = "save/restore_all"
+    output_graph_path = os.path.join(self.get_temp_dir(), output_graph_name)
+
+    return_value = freeze_graph.freeze_graph_with_def_protos(
+        input_graph_def=sess.graph_def,
+        input_saver_def=None,
+        input_checkpoint=checkpoint_path,
+        output_node_names=output_node_names,
+        restore_op_name="save/restore_all",  # default value
+        filename_tensor_name="save/Const:0",  # default value
+        output_graph=output_graph_path,
+        clear_devices=False,
+        initializer_nodes="")
+    self.assertTrue(return_value, -1)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/adam.py b/tensorflow/python/training/adam.py
index b65c88e972..bcbe5907d6 100644
--- a/tensorflow/python/training/adam.py
+++ b/tensorflow/python/training/adam.py
@@ -109,12 +109,13 @@ class AdamOptimizer(optimizer.Optimizer):
     self._updated_lr = None
 
   def _get_beta_accumulators(self):
-    if context.executing_eagerly():
-      graph = None
-    else:
-      graph = ops.get_default_graph()
-    return (self._get_non_slot_variable("beta1_power", graph=graph),
-            self._get_non_slot_variable("beta2_power", graph=graph))
+    with ops.init_scope():
+      if context.executing_eagerly():
+        graph = None
+      else:
+        graph = ops.get_default_graph()
+      return (self._get_non_slot_variable("beta1_power", graph=graph),
+              self._get_non_slot_variable("beta2_power", graph=graph))
 
   def _create_slots(self, var_list):
     # Create the beta1 and beta2 accumulators on the same device as the first
diff --git a/tensorflow/python/training/adam_test.py b/tensorflow/python/training/adam_test.py
index ccdc7e384d..8f84427654 100644
--- a/tensorflow/python/training/adam_test.py
+++ b/tensorflow/python/training/adam_test.py
@@ -315,6 +315,12 @@ class AdamOptimizerTest(test.TestCase):
 
   def testTwoSessions(self):
     optimizer = adam.AdamOptimizer()
+
+    with context.eager_mode():
+      var0 = variables.Variable(np.array([1.0, 2.0]), name="v0")
+      grads0 = constant_op.constant(np.array([0.1, 0.1]))
+      optimizer.apply_gradients([(grads0, var0)])
+
     g = ops.Graph()
     with g.as_default():
       with session.Session():
diff --git a/tensorflow/python/training/checkpoint_management.py b/tensorflow/python/training/checkpoint_management.py
new file mode 100644
index 0000000000..aaddc015ed
--- /dev/null
+++ b/tensorflow/python/training/checkpoint_management.py
@@ -0,0 +1,406 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# pylint: disable=invalid-name
+"""Save and restore variables."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os.path
+import re
+
+from google.protobuf import text_format
+
+from tensorflow.core.protobuf import saver_pb2
+from tensorflow.python.framework import errors
+from tensorflow.python.lib.io import file_io
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training.checkpoint_state_pb2 import CheckpointState
+from tensorflow.python.util.tf_export import tf_export
+
+
+def _GetCheckpointFilename(save_dir, latest_filename):
+  """Returns a filename for storing the CheckpointState.
+
+  Args:
+    save_dir: The directory for saving and restoring checkpoints.
+    latest_filename: Name of the file in 'save_dir' that is used
+      to store the CheckpointState.
+
+  Returns:
+    The path of the file that contains the CheckpointState proto.
+  """
+  if latest_filename is None:
+    latest_filename = "checkpoint"
+  return os.path.join(save_dir, latest_filename)
+
+
+@tf_export("train.generate_checkpoint_state_proto")
+def generate_checkpoint_state_proto(save_dir,
+                                    model_checkpoint_path,
+                                    all_model_checkpoint_paths=None):
+  """Generates a checkpoint state proto.
+
+  Args:
+    save_dir: Directory where the model was saved.
+    model_checkpoint_path: The checkpoint file.
+    all_model_checkpoint_paths: List of strings.  Paths to all not-yet-deleted
+      checkpoints, sorted from oldest to newest.  If this is a non-empty list,
+      the last element must be equal to model_checkpoint_path.  These paths
+      are also saved in the CheckpointState proto.
+
+  Returns:
+    CheckpointState proto with model_checkpoint_path and
+    all_model_checkpoint_paths updated to either absolute paths or
+    relative paths to the current save_dir.
+  """
+  if all_model_checkpoint_paths is None:
+    all_model_checkpoint_paths = []
+
+  if (not all_model_checkpoint_paths or
+      all_model_checkpoint_paths[-1] != model_checkpoint_path):
+    logging.info("%s is not in all_model_checkpoint_paths. Manually adding it.",
+                 model_checkpoint_path)
+    all_model_checkpoint_paths.append(model_checkpoint_path)
+
+  # Relative paths need to be rewritten to be relative to the "save_dir"
+  # if model_checkpoint_path already contains "save_dir".
+  if not os.path.isabs(save_dir):
+    if not os.path.isabs(model_checkpoint_path):
+      model_checkpoint_path = os.path.relpath(model_checkpoint_path, save_dir)
+    for i in range(len(all_model_checkpoint_paths)):
+      p = all_model_checkpoint_paths[i]
+      if not os.path.isabs(p):
+        all_model_checkpoint_paths[i] = os.path.relpath(p, save_dir)
+
+  coord_checkpoint_proto = CheckpointState(
+      model_checkpoint_path=model_checkpoint_path,
+      all_model_checkpoint_paths=all_model_checkpoint_paths)
+
+  return coord_checkpoint_proto
+
+
+@tf_export("train.update_checkpoint_state")
+def update_checkpoint_state(save_dir,
+                            model_checkpoint_path,
+                            all_model_checkpoint_paths=None,
+                            latest_filename=None):
+  """Updates the content of the 'checkpoint' file.
+
+  This updates the checkpoint file containing a CheckpointState
+  proto.
+
+  Args:
+    save_dir: Directory where the model was saved.
+    model_checkpoint_path: The checkpoint file.
+    all_model_checkpoint_paths: List of strings.  Paths to all not-yet-deleted
+      checkpoints, sorted from oldest to newest.  If this is a non-empty list,
+      the last element must be equal to model_checkpoint_path.  These paths
+      are also saved in the CheckpointState proto.
+    latest_filename: Optional name of the checkpoint file.  Default to
+      'checkpoint'.
+
+  Raises:
+    RuntimeError: If any of the model checkpoint paths conflict with the file
+      containing CheckpointSate.
+  """
+  update_checkpoint_state_internal(
+      save_dir=save_dir,
+      model_checkpoint_path=model_checkpoint_path,
+      all_model_checkpoint_paths=all_model_checkpoint_paths,
+      latest_filename=latest_filename,
+      save_relative_paths=False)
+
+
+def update_checkpoint_state_internal(save_dir,
+                                     model_checkpoint_path,
+                                     all_model_checkpoint_paths=None,
+                                     latest_filename=None,
+                                     save_relative_paths=False):
+  """Updates the content of the 'checkpoint' file.
+
+  This updates the checkpoint file containing a CheckpointState
+  proto.
+
+  Args:
+    save_dir: Directory where the model was saved.
+    model_checkpoint_path: The checkpoint file.
+    all_model_checkpoint_paths: List of strings.  Paths to all not-yet-deleted
+      checkpoints, sorted from oldest to newest.  If this is a non-empty list,
+      the last element must be equal to model_checkpoint_path.  These paths
+      are also saved in the CheckpointState proto.
+    latest_filename: Optional name of the checkpoint file.  Default to
+      'checkpoint'.
+    save_relative_paths: If `True`, will write relative paths to the checkpoint
+      state file.
+
+  Raises:
+    RuntimeError: If any of the model checkpoint paths conflict with the file
+      containing CheckpointSate.
+  """
+  # Writes the "checkpoint" file for the coordinator for later restoration.
+  coord_checkpoint_filename = _GetCheckpointFilename(save_dir, latest_filename)
+  if save_relative_paths:
+    if os.path.isabs(model_checkpoint_path):
+      rel_model_checkpoint_path = os.path.relpath(
+          model_checkpoint_path, save_dir)
+    else:
+      rel_model_checkpoint_path = model_checkpoint_path
+    rel_all_model_checkpoint_paths = []
+    for p in all_model_checkpoint_paths:
+      if os.path.isabs(p):
+        rel_all_model_checkpoint_paths.append(os.path.relpath(p, save_dir))
+      else:
+        rel_all_model_checkpoint_paths.append(p)
+    ckpt = generate_checkpoint_state_proto(
+        save_dir,
+        rel_model_checkpoint_path,
+        all_model_checkpoint_paths=rel_all_model_checkpoint_paths)
+  else:
+    ckpt = generate_checkpoint_state_proto(
+        save_dir,
+        model_checkpoint_path,
+        all_model_checkpoint_paths=all_model_checkpoint_paths)
+
+  if coord_checkpoint_filename == ckpt.model_checkpoint_path:
+    raise RuntimeError("Save path '%s' conflicts with path used for "
+                       "checkpoint state.  Please use a different save path." %
+                       model_checkpoint_path)
+
+  # Preventing potential read/write race condition by *atomically* writing to a
+  # file.
+  file_io.atomic_write_string_to_file(coord_checkpoint_filename,
+                                      text_format.MessageToString(ckpt))
+
+
+@tf_export("train.get_checkpoint_state")
+def get_checkpoint_state(checkpoint_dir, latest_filename=None):
+  """Returns CheckpointState proto from the "checkpoint" file.
+
+  If the "checkpoint" file contains a valid CheckpointState
+  proto, returns it.
+
+  Args:
+    checkpoint_dir: The directory of checkpoints.
+    latest_filename: Optional name of the checkpoint file.  Default to
+      'checkpoint'.
+
+  Returns:
+    A CheckpointState if the state was available, None
+    otherwise.
+
+  Raises:
+    ValueError: if the checkpoint read doesn't have model_checkpoint_path set.
+  """
+  ckpt = None
+  coord_checkpoint_filename = _GetCheckpointFilename(checkpoint_dir,
+                                                     latest_filename)
+  f = None
+  try:
+    # Check that the file exists before opening it to avoid
+    # many lines of errors from colossus in the logs.
+    if file_io.file_exists(coord_checkpoint_filename):
+      file_content = file_io.read_file_to_string(
+          coord_checkpoint_filename)
+      ckpt = CheckpointState()
+      text_format.Merge(file_content, ckpt)
+      if not ckpt.model_checkpoint_path:
+        raise ValueError("Invalid checkpoint state loaded from "
+                         + checkpoint_dir)
+      # For relative model_checkpoint_path and all_model_checkpoint_paths,
+      # prepend checkpoint_dir.
+      if not os.path.isabs(ckpt.model_checkpoint_path):
+        ckpt.model_checkpoint_path = os.path.join(checkpoint_dir,
+                                                  ckpt.model_checkpoint_path)
+      for i in range(len(ckpt.all_model_checkpoint_paths)):
+        p = ckpt.all_model_checkpoint_paths[i]
+        if not os.path.isabs(p):
+          ckpt.all_model_checkpoint_paths[i] = os.path.join(checkpoint_dir, p)
+  except errors.OpError as e:
+    # It's ok if the file cannot be read
+    logging.warning("%s: %s", type(e).__name__, e)
+    logging.warning("%s: Checkpoint ignored", coord_checkpoint_filename)
+    return None
+  except text_format.ParseError as e:
+    logging.warning("%s: %s", type(e).__name__, e)
+    logging.warning("%s: Checkpoint ignored", coord_checkpoint_filename)
+    return None
+  finally:
+    if f:
+      f.close()
+  return ckpt
+
+
+def _prefix_to_checkpoint_path(prefix, format_version):
+  """Returns the pathname of a checkpoint file, given the checkpoint prefix.
+
+  For V1 checkpoint, simply returns the prefix itself (the data file).  For V2,
+  returns the pathname to the index file.
+
+  Args:
+    prefix: a string, the prefix of a checkpoint.
+    format_version: the checkpoint format version that corresponds to the
+      prefix.
+  Returns:
+    The pathname of a checkpoint file, taking into account the checkpoint
+      format version.
+  """
+  if format_version == saver_pb2.SaverDef.V2:
+    return prefix + ".index"  # The index file identifies a checkpoint.
+  return prefix  # Just the data file.
+
+
+@tf_export("train.latest_checkpoint")
+def latest_checkpoint(checkpoint_dir, latest_filename=None):
+  """Finds the filename of latest saved checkpoint file.
+
+  Args:
+    checkpoint_dir: Directory where the variables were saved.
+    latest_filename: Optional name for the protocol buffer file that
+      contains the list of most recent checkpoint filenames.
+      See the corresponding argument to `Saver.save()`.
+
+  Returns:
+    The full path to the latest checkpoint or `None` if no checkpoint was found.
+  """
+  # Pick the latest checkpoint based on checkpoint state.
+  ckpt = get_checkpoint_state(checkpoint_dir, latest_filename)
+  if ckpt and ckpt.model_checkpoint_path:
+    # Look for either a V2 path or a V1 path, with priority for V2.
+    v2_path = _prefix_to_checkpoint_path(ckpt.model_checkpoint_path,
+                                         saver_pb2.SaverDef.V2)
+    v1_path = _prefix_to_checkpoint_path(ckpt.model_checkpoint_path,
+                                         saver_pb2.SaverDef.V1)
+    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
+        v1_path):
+      return ckpt.model_checkpoint_path
+    else:
+      logging.error("Couldn't match files for checkpoint %s",
+                    ckpt.model_checkpoint_path)
+  return None
+
+
+@tf_export("train.checkpoint_exists")
+def checkpoint_exists(checkpoint_prefix):
+  """Checks whether a V1 or V2 checkpoint exists with the specified prefix.
+
+  This is the recommended way to check if a checkpoint exists, since it takes
+  into account the naming difference between V1 and V2 formats.
+
+  Args:
+    checkpoint_prefix: the prefix of a V1 or V2 checkpoint, with V2 taking
+      priority.  Typically the result of `Saver.save()` or that of
+      `tf.train.latest_checkpoint()`, regardless of sharded/non-sharded or
+      V1/V2.
+  Returns:
+    A bool, true iff a checkpoint referred to by `checkpoint_prefix` exists.
+  """
+  pathname = _prefix_to_checkpoint_path(checkpoint_prefix,
+                                        saver_pb2.SaverDef.V2)
+  if file_io.get_matching_files(pathname):
+    return True
+  elif file_io.get_matching_files(checkpoint_prefix):
+    return True
+  else:
+    return False
+
+
+@tf_export("train.get_checkpoint_mtimes")
+def get_checkpoint_mtimes(checkpoint_prefixes):
+  """Returns the mtimes (modification timestamps) of the checkpoints.
+
+  Globs for the checkpoints pointed to by `checkpoint_prefixes`.  If the files
+  exist, collect their mtime.  Both V2 and V1 checkpoints are considered, in
+  that priority.
+
+  This is the recommended way to get the mtimes, since it takes into account
+  the naming difference between V1 and V2 formats.
+
+  Args:
+    checkpoint_prefixes: a list of checkpoint paths, typically the results of
+      `Saver.save()` or those of `tf.train.latest_checkpoint()`, regardless of
+      sharded/non-sharded or V1/V2.
+  Returns:
+    A list of mtimes (in microseconds) of the found checkpoints.
+  """
+  mtimes = []
+
+  def match_maybe_append(pathname):
+    fnames = file_io.get_matching_files(pathname)
+    if fnames:
+      mtimes.append(file_io.stat(fnames[0]).mtime_nsec / 1e9)
+      return True
+    return False
+
+  for checkpoint_prefix in checkpoint_prefixes:
+    # Tries V2's metadata file first.
+    pathname = _prefix_to_checkpoint_path(checkpoint_prefix,
+                                          saver_pb2.SaverDef.V2)
+    if match_maybe_append(pathname):
+      continue
+    # Otherwise, tries V1, where the prefix is the complete pathname.
+    match_maybe_append(checkpoint_prefix)
+
+  return mtimes
+
+
+@tf_export("train.remove_checkpoint")
+def remove_checkpoint(checkpoint_prefix,
+                      checkpoint_format_version=saver_pb2.SaverDef.V2,
+                      meta_graph_suffix="meta"):
+  """Removes a checkpoint given by `checkpoint_prefix`.
+
+  Args:
+    checkpoint_prefix: The prefix of a V1 or V2 checkpoint. Typically the result
+      of `Saver.save()` or that of `tf.train.latest_checkpoint()`, regardless of
+      sharded/non-sharded or V1/V2.
+    checkpoint_format_version: `SaverDef.CheckpointFormatVersion`, defaults to
+      `SaverDef.V2`.
+    meta_graph_suffix: Suffix for `MetaGraphDef` file. Defaults to 'meta'.
+  """
+  _delete_file_if_exists(
+      meta_graph_filename(checkpoint_prefix, meta_graph_suffix))
+  if checkpoint_format_version == saver_pb2.SaverDef.V2:
+    # V2 has a metadata file and some data files.
+    _delete_file_if_exists(checkpoint_prefix + ".index")
+    _delete_file_if_exists(checkpoint_prefix + ".data-?????-of-?????")
+  else:
+    # V1, Legacy.  Exact match on the data file.
+    _delete_file_if_exists(checkpoint_prefix)
+
+
+def _delete_file_if_exists(filespec):
+  """Deletes files matching `filespec`."""
+  for pathname in file_io.get_matching_files(filespec):
+    file_io.delete_file(pathname)
+
+
+def meta_graph_filename(checkpoint_filename, meta_graph_suffix="meta"):
+  """Returns the meta graph filename.
+
+  Args:
+    checkpoint_filename: Name of the checkpoint file.
+    meta_graph_suffix: Suffix for `MetaGraphDef` file. Defaults to 'meta'.
+
+  Returns:
+    MetaGraph file name.
+  """
+  # If the checkpoint_filename is sharded, the checkpoint_filename could
+  # be of format model.ckpt-step#-?????-of-shard#. For example,
+  # model.ckpt-123456-?????-of-00005, or model.ckpt-123456-00001-of-00002.
+  basename = re.sub(r"-[\d\?]+-of-\d+$", "", checkpoint_filename)
+  suffixed_filename = ".".join([basename, meta_graph_suffix])
+  return suffixed_filename
diff --git a/tensorflow/python/training/checkpoint_management_test.py b/tensorflow/python/training/checkpoint_management_test.py
new file mode 100644
index 0000000000..4b31d0c613
--- /dev/null
+++ b/tensorflow/python/training/checkpoint_management_test.py
@@ -0,0 +1,316 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+"""Tests for tensorflow.python.training.saver.py."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import contextlib
+import os
+import shutil
+import tempfile
+
+from google.protobuf import text_format
+
+from tensorflow.core.protobuf import saver_pb2
+from tensorflow.python.framework import ops as ops_lib
+from tensorflow.python.lib.io import file_io
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import gfile
+from tensorflow.python.platform import test
+from tensorflow.python.training import checkpoint_management
+from tensorflow.python.training import saver as saver_module
+from tensorflow.python.training.checkpoint_state_pb2 import CheckpointState
+
+
+class LatestCheckpointWithRelativePaths(test.TestCase):
+
+  @staticmethod
+  @contextlib.contextmanager
+  def tempWorkingDir(temppath):
+    cwd = os.getcwd()
+    os.chdir(temppath)
+    try:
+      yield
+    finally:
+      os.chdir(cwd)
+
+  @staticmethod
+  @contextlib.contextmanager
+  def tempDir():
+    tempdir = tempfile.mkdtemp()
+    try:
+      yield tempdir
+    finally:
+      shutil.rmtree(tempdir)
+
+  def testNameCollision(self):
+    # Make sure we have a clean directory to work in.
+    with self.tempDir() as tempdir:
+      # Jump to that directory until this test is done.
+      with self.tempWorkingDir(tempdir):
+        # Save training snapshots to a relative path.
+        traindir = "train/"
+        os.mkdir(traindir)
+        # Collides with the default name of the checkpoint state file.
+        filepath = os.path.join(traindir, "checkpoint")
+
+        with self.test_session() as sess:
+          unused_a = variables.Variable(0.0)  # So that Saver saves something.
+          variables.global_variables_initializer().run()
+
+          # Should fail.
+          saver = saver_module.Saver(sharded=False)
+          with self.assertRaisesRegexp(ValueError, "collides with"):
+            saver.save(sess, filepath)
+
+          # Succeeds: the file will be named "checkpoint-<step>".
+          saver.save(sess, filepath, global_step=1)
+          self.assertIsNotNone(
+              checkpoint_management.latest_checkpoint(traindir))
+
+          # Succeeds: the file will be named "checkpoint-<i>-of-<n>".
+          saver = saver_module.Saver(sharded=True)
+          saver.save(sess, filepath)
+          self.assertIsNotNone(
+              checkpoint_management.latest_checkpoint(traindir))
+
+          # Succeeds: the file will be named "checkpoint-<step>-<i>-of-<n>".
+          saver = saver_module.Saver(sharded=True)
+          saver.save(sess, filepath, global_step=1)
+          self.assertIsNotNone(
+              checkpoint_management.latest_checkpoint(traindir))
+
+  def testRelativePath(self):
+    # Make sure we have a clean directory to work in.
+    with self.tempDir() as tempdir:
+
+      # Jump to that directory until this test is done.
+      with self.tempWorkingDir(tempdir):
+
+        # Save training snapshots to a relative path.
+        traindir = "train/"
+        os.mkdir(traindir)
+
+        filename = "snapshot"
+        filepath = os.path.join(traindir, filename)
+
+        with self.test_session() as sess:
+          # Build a simple graph.
+          v0 = variables.Variable(0.0)
+          inc = v0.assign_add(1.0)
+
+          save = saver_module.Saver({"v0": v0})
+
+          # Record a short training history.
+          variables.global_variables_initializer().run()
+          save.save(sess, filepath, global_step=0)
+          inc.eval()
+          save.save(sess, filepath, global_step=1)
+          inc.eval()
+          save.save(sess, filepath, global_step=2)
+
+        with self.test_session() as sess:
+          # Build a new graph with different initialization.
+          v0 = variables.Variable(-1.0)
+
+          # Create a new saver.
+          save = saver_module.Saver({"v0": v0})
+          variables.global_variables_initializer().run()
+
+          # Get the most recent checkpoint name from the training history file.
+          name = checkpoint_management.latest_checkpoint(traindir)
+          self.assertIsNotNone(name)
+
+          # Restore "v0" from that checkpoint.
+          save.restore(sess, name)
+          self.assertEqual(v0.eval(), 2.0)
+
+
+class CheckpointStateTest(test.TestCase):
+
+  def _get_test_dir(self, dirname):
+    test_dir = os.path.join(self.get_temp_dir(), dirname)
+    gfile.MakeDirs(test_dir)
+    return test_dir
+
+  def testAbsPath(self):
+    save_dir = self._get_test_dir("abs_paths")
+    abs_path = os.path.join(save_dir, "model-0")
+    ckpt = checkpoint_management.generate_checkpoint_state_proto(
+        save_dir, abs_path)
+    self.assertEqual(ckpt.model_checkpoint_path, abs_path)
+    self.assertTrue(os.path.isabs(ckpt.model_checkpoint_path))
+    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 1)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], abs_path)
+
+  def testRelPath(self):
+    train_dir = "train"
+    model = os.path.join(train_dir, "model-0")
+    # model_checkpoint_path should have no "train" directory part.
+    new_rel_path = "model-0"
+    ckpt = checkpoint_management.generate_checkpoint_state_proto(
+        train_dir, model)
+    self.assertEqual(ckpt.model_checkpoint_path, new_rel_path)
+    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 1)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], new_rel_path)
+
+  def testAllModelCheckpointPaths(self):
+    save_dir = self._get_test_dir("all_models_test")
+    abs_path = os.path.join(save_dir, "model-0")
+    for paths in [None, [], ["model-2"]]:
+      ckpt = checkpoint_management.generate_checkpoint_state_proto(
+          save_dir, abs_path, all_model_checkpoint_paths=paths)
+      self.assertEqual(ckpt.model_checkpoint_path, abs_path)
+      self.assertTrue(os.path.isabs(ckpt.model_checkpoint_path))
+      self.assertEqual(
+          len(ckpt.all_model_checkpoint_paths), len(paths) if paths else 1)
+      self.assertEqual(ckpt.all_model_checkpoint_paths[-1], abs_path)
+
+  def testUpdateCheckpointState(self):
+    save_dir = self._get_test_dir("update_checkpoint_state")
+    os.chdir(save_dir)
+    # Make a temporary train directory.
+    train_dir = "train"
+    os.mkdir(train_dir)
+    abs_path = os.path.join(save_dir, "model-0")
+    rel_path = os.path.join("train", "model-2")
+    checkpoint_management.update_checkpoint_state(
+        train_dir, rel_path, all_model_checkpoint_paths=[abs_path, rel_path])
+    ckpt = checkpoint_management.get_checkpoint_state(train_dir)
+    self.assertEqual(ckpt.model_checkpoint_path, rel_path)
+    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 2)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], rel_path)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[0], abs_path)
+
+  def testUpdateCheckpointStateSaveRelativePaths(self):
+    save_dir = self._get_test_dir("update_checkpoint_state")
+    os.chdir(save_dir)
+    abs_path2 = os.path.join(save_dir, "model-2")
+    rel_path2 = "model-2"
+    abs_path0 = os.path.join(save_dir, "model-0")
+    rel_path0 = "model-0"
+    checkpoint_management.update_checkpoint_state_internal(
+        save_dir=save_dir,
+        model_checkpoint_path=abs_path2,
+        all_model_checkpoint_paths=[rel_path0, abs_path2],
+        save_relative_paths=True)
+
+    # File should contain relative paths.
+    file_content = file_io.read_file_to_string(
+        os.path.join(save_dir, "checkpoint"))
+    ckpt = CheckpointState()
+    text_format.Merge(file_content, ckpt)
+    self.assertEqual(ckpt.model_checkpoint_path, rel_path2)
+    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 2)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], rel_path2)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[0], rel_path0)
+
+    # get_checkpoint_state should return absolute paths.
+    ckpt = checkpoint_management.get_checkpoint_state(save_dir)
+    self.assertEqual(ckpt.model_checkpoint_path, abs_path2)
+    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 2)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], abs_path2)
+    self.assertEqual(ckpt.all_model_checkpoint_paths[0], abs_path0)
+
+  def testCheckPointStateFailsWhenIncomplete(self):
+    save_dir = self._get_test_dir("checkpoint_state_fails_when_incomplete")
+    os.chdir(save_dir)
+    ckpt_path = os.path.join(save_dir, "checkpoint")
+    ckpt_file = open(ckpt_path, "w")
+    ckpt_file.write("")
+    ckpt_file.close()
+    with self.assertRaises(ValueError):
+      checkpoint_management.get_checkpoint_state(save_dir)
+
+  def testCheckPointCompletesRelativePaths(self):
+    save_dir = self._get_test_dir("checkpoint_completes_relative_paths")
+    os.chdir(save_dir)
+    ckpt_path = os.path.join(save_dir, "checkpoint")
+    ckpt_file = open(ckpt_path, "w")
+    ckpt_file.write("""
+        model_checkpoint_path: "./model.ckpt-687529"
+        all_model_checkpoint_paths: "./model.ckpt-687500"
+        all_model_checkpoint_paths: "./model.ckpt-687529"
+        """)
+    ckpt_file.close()
+    ckpt = checkpoint_management.get_checkpoint_state(save_dir)
+    self.assertEqual(ckpt.model_checkpoint_path,
+                     os.path.join(save_dir, "./model.ckpt-687529"))
+    self.assertEqual(ckpt.all_model_checkpoint_paths[0],
+                     os.path.join(save_dir, "./model.ckpt-687500"))
+    self.assertEqual(ckpt.all_model_checkpoint_paths[1],
+                     os.path.join(save_dir, "./model.ckpt-687529"))
+
+
+class SaverUtilsTest(test.TestCase):
+
+  def setUp(self):
+    self._base_dir = os.path.join(self.get_temp_dir(), "saver_utils_test")
+    gfile.MakeDirs(self._base_dir)
+
+  def tearDown(self):
+    gfile.DeleteRecursively(self._base_dir)
+
+  def testCheckpointExists(self):
+    for sharded in (False, True):
+      for version in (saver_pb2.SaverDef.V2, saver_pb2.SaverDef.V1):
+        with self.test_session(graph=ops_lib.Graph()) as sess:
+          unused_v = variables.Variable(1.0, name="v")
+          variables.global_variables_initializer().run()
+          saver = saver_module.Saver(sharded=sharded, write_version=version)
+
+          path = os.path.join(self._base_dir, "%s-%s" % (sharded, version))
+          self.assertFalse(
+              checkpoint_management.checkpoint_exists(path))  # Not saved yet.
+
+          ckpt_prefix = saver.save(sess, path)
+          self.assertTrue(checkpoint_management.checkpoint_exists(ckpt_prefix))
+
+          ckpt_prefix = checkpoint_management.latest_checkpoint(self._base_dir)
+          self.assertTrue(checkpoint_management.checkpoint_exists(ckpt_prefix))
+
+  def testGetCheckpointMtimes(self):
+    prefixes = []
+    for version in (saver_pb2.SaverDef.V2, saver_pb2.SaverDef.V1):
+      with self.test_session(graph=ops_lib.Graph()) as sess:
+        unused_v = variables.Variable(1.0, name="v")
+        variables.global_variables_initializer().run()
+        saver = saver_module.Saver(write_version=version)
+        prefixes.append(
+            saver.save(sess, os.path.join(self._base_dir, str(version))))
+
+    mtimes = checkpoint_management.get_checkpoint_mtimes(prefixes)
+    self.assertEqual(2, len(mtimes))
+    self.assertTrue(mtimes[1] >= mtimes[0])
+
+  def testRemoveCheckpoint(self):
+    for sharded in (False, True):
+      for version in (saver_pb2.SaverDef.V2, saver_pb2.SaverDef.V1):
+        with self.test_session(graph=ops_lib.Graph()) as sess:
+          unused_v = variables.Variable(1.0, name="v")
+          variables.global_variables_initializer().run()
+          saver = saver_module.Saver(sharded=sharded, write_version=version)
+
+          path = os.path.join(self._base_dir, "%s-%s" % (sharded, version))
+          ckpt_prefix = saver.save(sess, path)
+          self.assertTrue(checkpoint_management.checkpoint_exists(ckpt_prefix))
+          checkpoint_management.remove_checkpoint(ckpt_prefix, version)
+          self.assertFalse(checkpoint_management.checkpoint_exists(ckpt_prefix))
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/training/checkpoint_utils.py b/tensorflow/python/training/checkpoint_utils.py
index 5b372e82b3..9b72b09f08 100644
--- a/tensorflow/python/training/checkpoint_utils.py
+++ b/tensorflow/python/training/checkpoint_utils.py
@@ -24,11 +24,12 @@ from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import io_ops
 from tensorflow.python.ops import resource_variable_ops
-from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpoint_management
+from tensorflow.python.training import distribute as distribute_lib
 from tensorflow.python.training import saver
 from tensorflow.python.util.tf_export import tf_export
 
@@ -179,6 +180,16 @@ def init_from_checkpoint(ckpt_dir_or_file, assignment_map):
     tf.errors.OpError: If missing checkpoints or tensors in checkpoints.
     ValueError: If missing variables in current graph.
   """
+  if distribute_lib.get_cross_tower_context():
+    _init_from_checkpoint(None, ckpt_dir_or_file, assignment_map)
+  else:
+    distribute_lib.get_tower_context().merge_call(
+        _init_from_checkpoint, ckpt_dir_or_file, assignment_map)
+
+
+def _init_from_checkpoint(_, ckpt_dir_or_file, assignment_map):
+  """See `init_from_checkpoint` for documentation."""
+
   ckpt_file = _get_checkpoint_filename(ckpt_dir_or_file)
   reader = load_checkpoint(ckpt_dir_or_file)
   variable_map = reader.get_variable_to_shape_map()
@@ -187,10 +198,9 @@ def init_from_checkpoint(ckpt_dir_or_file, assignment_map):
     var = None
     # Check if this is Variable object or list of Variable objects (in case of
     # partitioned variables).
-    is_var = lambda x: isinstance(x, variables.Variable)
-    if is_var(current_var_or_name) or (
+    if _is_variable(current_var_or_name) or (
         isinstance(current_var_or_name, list)
-        and all(is_var(v) for v in current_var_or_name)):
+        and all(_is_variable(v) for v in current_var_or_name)):
       var = current_var_or_name
     else:
       store_vars = vs._get_default_variable_store()._vars  # pylint:disable=protected-access
@@ -205,7 +215,7 @@ def init_from_checkpoint(ckpt_dir_or_file, assignment_map):
         raise ValueError("Tensor %s is not found in %s checkpoint %s" % (
             tensor_name_in_ckpt, ckpt_dir_or_file, variable_map
         ))
-      if is_var(var):
+      if _is_variable(var):
         # Additional at-call-time checks.
         if not var.get_shape().is_compatible_with(
             variable_map[tensor_name_in_ckpt]):
@@ -268,7 +278,7 @@ def init_from_checkpoint(ckpt_dir_or_file, assignment_map):
 def _get_checkpoint_filename(ckpt_dir_or_file):
   """Returns checkpoint filename given directory or specific checkpoint file."""
   if gfile.IsDirectory(ckpt_dir_or_file):
-    return saver.latest_checkpoint(ckpt_dir_or_file)
+    return checkpoint_management.latest_checkpoint(ckpt_dir_or_file)
   return ckpt_dir_or_file
 
 
@@ -297,13 +307,21 @@ def _set_checkpoint_initializer(variable,
   with ops.device(variable.device), ops.device("/cpu:0"):
     restore_op = io_ops.restore_v2(
         ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0]
-    if isinstance(variable, resource_variable_ops.ResourceVariable):
-      init_op = variable.assign(restore_op, read_value=False)
-    else:
-      init_op = state_ops.assign(variable, restore_op)
-    variable._initializer_op = init_op  # pylint:disable=protected-access
+
+    names_to_saveables = saver.BaseSaverBuilder.OpListToDict([variable])
+    saveable_objects = []
+    for name, op in names_to_saveables.items():
+      for s in saver.BaseSaverBuilder.SaveableObjectsForOp(op, name):
+        saveable_objects.append(s)
+
+    assert len(saveable_objects) == 1  # Should be only one variable.
+    init_op = saveable_objects[0].restore([restore_op], restored_shapes=None)
+
+    # pylint:disable=protected-access
+    variable._initializer_op = init_op
     restore_op.set_shape(variable.shape)
-    variable._initial_value = restore_op  # pylint:disable=protected-access
+    variable._initial_value = restore_op
+    # pylint:enable=protected-access
 
 
 def _set_variable_or_list_initializer(variable_or_list, ckpt_file,
@@ -337,6 +355,11 @@ def _set_variable_or_list_initializer(variable_or_list, ckpt_file,
     _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "")
 
 
+def _is_variable(x):
+  return (isinstance(x, variables.Variable) or
+          resource_variable_ops.is_resource_variable(x))
+
+
 def _collect_partitioned_variable(name, all_vars):
   """Returns list of `tf.Variable` that comprise the partitioned variable."""
   if name + "/part_0" in all_vars:
diff --git a/tensorflow/python/training/checkpoint_utils_test.py b/tensorflow/python/training/checkpoint_utils_test.py
index 4e08a1c859..1c1f126ce9 100644
--- a/tensorflow/python/training/checkpoint_utils_test.py
+++ b/tensorflow/python/training/checkpoint_utils_test.py
@@ -386,7 +386,9 @@ class CheckpointsTest(test.TestCase):
         op for op in g.get_operations()
         if (op.name.startswith("init_from_checkpoint/") and
             not op.name.startswith("init_from_checkpoint/checkpoint_initializer"
-                                  ) and op.type != "AssignVariableOp")
+                                  ) and
+            op.type != "AssignVariableOp" and
+            op.type != "Identity")
     ]
     self.assertEqual(ops_in_init_from_checkpoint_scope, [])
 
diff --git a/tensorflow/python/training/checkpointable/BUILD b/tensorflow/python/training/checkpointable/BUILD
index 35007653a0..8a289b31b5 100644
--- a/tensorflow/python/training/checkpointable/BUILD
+++ b/tensorflow/python/training/checkpointable/BUILD
@@ -124,14 +124,18 @@ py_test(
     ],
     deps = [
         ":base",
+        ":tracking",
         ":util",
+        "//tensorflow/python:checkpoint_management",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:framework_test_lib",
         "//tensorflow/python:init_ops",
+        "//tensorflow/python:pywrap_tensorflow",
         "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:saver",
         "//tensorflow/python:session",
         "//tensorflow/python:state_ops",
         "//tensorflow/python:template",
diff --git a/tensorflow/python/training/checkpointable/base.py b/tensorflow/python/training/checkpointable/base.py
index e9c8c21905..66837ee52f 100644
--- a/tensorflow/python/training/checkpointable/base.py
+++ b/tensorflow/python/training/checkpointable/base.py
@@ -144,7 +144,7 @@ class _CheckpointPosition(object):
         # process deferred restorations for it and its dependencies.
         restore_ops = checkpointable._restore_from_checkpoint_position(self)  # pylint: disable=protected-access
         if restore_ops:
-          self._checkpoint.restore_ops.extend(restore_ops)
+          self._checkpoint.new_restore_ops(restore_ops)
 
   def bind_object(self, checkpointable):
     """Set a checkpoint<->object correspondence and process slot variables.
@@ -501,12 +501,6 @@ class CheckpointableBase(object):
       ValueError: If the variable name is not unique.
     """
     self._maybe_initialize_checkpointable()
-    if not overwrite and self._lookup_dependency(name) is not None:
-      raise ValueError(
-          ("A variable named '%s' already exists in this Checkpointable, but "
-           "Checkpointable._add_variable called to create another with "
-           "that name. Variable names must be unique within a Checkpointable "
-           "object.") % (name,))
     with ops.init_scope():
       if context.executing_eagerly():
         # If this is a variable with a single Tensor stored in the checkpoint,
@@ -631,9 +625,9 @@ class CheckpointableBase(object):
           self._unconditional_checkpoint_dependencies[index] = new_reference
     elif current_object is None:
       self._unconditional_checkpoint_dependencies.append(new_reference)
-      self._unconditional_dependency_names[name] = checkpointable
       self._handle_deferred_dependencies(
           name=name, checkpointable=checkpointable)
+    self._unconditional_dependency_names[name] = checkpointable
     return checkpointable
 
   def _handle_deferred_dependencies(self, name, checkpointable):
diff --git a/tensorflow/python/training/checkpointable/base_test.py b/tensorflow/python/training/checkpointable/base_test.py
index 950e9c5b53..fd935ac559 100644
--- a/tensorflow/python/training/checkpointable/base_test.py
+++ b/tensorflow/python/training/checkpointable/base_test.py
@@ -16,8 +16,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import test
 from tensorflow.python.training.checkpointable import base
+from tensorflow.python.training.checkpointable import util
 
 
 class InterfaceTests(test.TestCase):
@@ -37,5 +40,22 @@ class InterfaceTests(test.TestCase):
     self.assertIs(duplicate_name_dep, current_dependency)
     self.assertEqual("leaf", current_name)
 
+  def testAddVariableOverwrite(self):
+    root = base.CheckpointableBase()
+    a = root._add_variable_with_custom_getter(
+        name="v", shape=[], getter=variable_scope.get_variable)
+    self.assertEqual([root, a], util.list_objects(root))
+    with ops.Graph().as_default():
+      b = root._add_variable_with_custom_getter(
+          name="v", shape=[], overwrite=True,
+          getter=variable_scope.get_variable)
+      self.assertEqual([root, b], util.list_objects(root))
+    with ops.Graph().as_default():
+      with self.assertRaisesRegexp(
+          ValueError, "already declared as a dependency"):
+        root._add_variable_with_custom_getter(
+            name="v", shape=[], overwrite=False,
+            getter=variable_scope.get_variable)
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/checkpointable/data_structures.py b/tensorflow/python/training/checkpointable/data_structures.py
index 019d43f09c..507cda8734 100644
--- a/tensorflow/python/training/checkpointable/data_structures.py
+++ b/tensorflow/python/training/checkpointable/data_structures.py
@@ -57,6 +57,8 @@ def _wrap_or_unwrap(value):
     return value.value
   if isinstance(value, base.CheckpointableBase):
     return value  # Skip conversion for already checkpointable objects.
+  elif isinstance(value, dict):
+    return _DictWrapper(value)
   elif isinstance(value, list):
     return _ListWrapper(value)
   else:
@@ -438,12 +440,15 @@ class Mapping(CheckpointableDataStructure, collections.Mapping):
   def __init__(self, *args, **kwargs):
     """Construct a new sequence. Arguments are passed to `dict()`."""
     super(Mapping, self).__init__()
-    self._storage = dict(*args, **kwargs)
+    self._storage = self._make_storage(*args, **kwargs)
     self._storage.update(
         {key: self._track_value(
             value, name=self._name_element(key))
          for key, value in self._storage.items()})
 
+  def _make_storage(self, *args, **kwargs):
+    return dict(*args, **kwargs)
+
   def _name_element(self, key):
     if not isinstance(key, six.string_types):
       raise TypeError(
@@ -476,3 +481,185 @@ class Mapping(CheckpointableDataStructure, collections.Mapping):
 
   def __iter__(self):
     return iter(self._storage)
+
+
+# Unlike _ListWrapper, having _DictWrapper inherit from dict and pass isinstance
+# checks seems infeasible. CPython will not call Python methods/properties on
+# dictionary subclasses when running e.g. {}.update(dict_subclass), and instead
+# collects elements directly from dict_subclass's C structs. So subclassing dict
+# implies that the storage has to be "self" (i.e. the C structs for the object
+# must be updated correctly), but we also need that storage to be the wrapped
+# dictionary to avoid synchronization bugs (un-tracked external modifications
+# should still show up when the dict is accessed through the wrapper). Monkey
+# patching all of the "wrapped" dict's methods instead of creating a wrapper
+# object is an option, but not a very attractive one (replacing methods without
+# creating reference cycles is difficult, and then dicts would need to be
+# special cased everywhere as being checkpointable).
+class _DictWrapper(Mapping, collections.MutableMapping):
+  """Wraps built-in dicts to support restore-on-create for variables.
+
+  _DictWrapper is to Mapping as _ListWrapper is to List. Unlike Mapping,
+  _DictWrapper allows non-string keys and values and arbitrary mutations (delete
+  keys, reassign values). Like _ListWrapper, these mutations mean that
+  _DictWrapper will raise an exception on save.
+  """
+
+  def __new__(cls, *args):
+    if len(args) == 1 and isinstance(args[0], dict):
+      return super(_DictWrapper, cls).__new__(cls)
+    else:
+      # Allow construction from a sequence, e.g. for nest.pack_sequence_as. In
+      # this case there's nothing to wrap, so we make a normal dictionary. Also
+      # allows constructing empty instances of the _DictWrapper type, as Session
+      # is wont to do (and again there's nothing to wrap, so a normal dictionary
+      # makes more sense).
+      return dict(*args)
+
+  def __init__(self, wrapped_dict):
+    self._non_string_key = False
+    self._non_append_mutation = False
+    self._external_modification = False
+    super(_DictWrapper, self).__init__(wrapped_dict)
+    self._update_snapshot()
+
+  def _make_storage(self, wrapped_dict):
+    """Re-use the wrapped dict for storage (to force them to be in sync)."""
+    return wrapped_dict
+
+  @property
+  def _checkpoint_dependencies(self):
+    """Check that the object is saveable before listing its dependencies."""
+    self._check_external_modification()
+    if self._non_string_key:
+      raise ValueError(
+          "Unable to save the object %s (a dictionary wrapper constructed "
+          "automatically on attribute assignment). The wrapped dictionary "
+          "contains a non-string key which maps to a checkpointable object or "
+          "mutable data structure.\n\nIf you don't need this dictionary "
+          "checkpointed, wrap it in a tf.contrib.checkpoint.NoDependency "
+          "object; it will be automatically un-wrapped and subsequently "
+          "ignored." % (self,))
+    if self._non_append_mutation:
+      raise ValueError(
+          "Unable to save the object %s (a dictionary wrapper constructed "
+          "automatically on attribute assignment). A key mapping to a "
+          "checkpointable object was overwritten or deleted, which would "
+          "cause problems for restoration.\n\nIf you don't need this "
+          "dictionary checkpointed, wrap it in a "
+          "tf.contrib.checkpoint.NoDependency object; it will be automatically "
+          "un-wrapped and subsequently ignored." % (self,))
+    if self._external_modification:
+      raise ValueError(
+          "Unable to save the object %s (a dictionary wrapper constructed "
+          "automatically on attribute assignment). The wrapped dictionary was "
+          "modified outside the wrapper (its final value was %s, its value "
+          "when a checkpoint dependency was added was %s), which breaks "
+          "restoration on object creation.\n\nIf you don't need this "
+          "dictionary checkpointed, wrap it in a "
+          "tf.contrib.checkpoint.NoDependency object; it will be automatically "
+          "un-wrapped and subsequently ignored." % (
+              self, self, self._last_wrapped_dict_snapshot))
+    assert not self._dirty  # Any reason for dirtiness should have an exception.
+    return super(_DictWrapper, self)._checkpoint_dependencies
+
+  @property
+  def _dirty(self):
+    """Check if there has already been a mutation which prevents saving."""
+    return (self._external_modification
+            or self._non_append_mutation
+            or self._non_string_key)
+
+  def _check_external_modification(self):
+    """Checks for any changes to the wrapped dict not through the wrapper."""
+    if self._dirty:
+      return
+    if self != self._last_wrapped_dict_snapshot:
+      self._external_modification = True
+      self._last_wrapped_dict_snapshot = None
+
+  def _update_snapshot(self):
+    """Acknowledges tracked changes to the wrapped dict."""
+    if self._dirty:
+      return
+    self._last_wrapped_dict_snapshot = dict(self)
+
+  def _track_value(self, value, name):
+    """Allows storage of non-checkpointable objects."""
+    if isinstance(name, six.string_types):
+      string_key = True
+    else:
+      name = "-non_string_key"
+      string_key = False
+    try:
+      no_dependency = isinstance(value, NoDependency)
+      value = super(_DictWrapper, self)._track_value(value=value, name=name)
+      if not (string_key or no_dependency):
+        # A non-string key maps to a checkpointable value. This data structure
+        # is not saveable.
+        self._non_string_key = True
+      return value
+    except ValueError:
+      # Even if this value isn't checkpointable, we need to make sure
+      # NoDependency objects get unwrapped.
+      return sticky_attribute_assignment(
+          checkpointable=self, value=value, name=name)
+
+  def _name_element(self, key):
+    """Don't throw errors for non-string keys."""
+    if isinstance(key, six.string_types):
+      return super(_DictWrapper, self)._name_element(key)
+    else:
+      return key
+
+  def __setitem__(self, key, value):
+    """Allow any modifications, but possibly mark the wrapper as unsaveable."""
+    self._check_external_modification()
+    no_dep = isinstance(value, NoDependency)
+    if isinstance(key, six.string_types):
+      existing_dependency = self._lookup_dependency(key)
+      value = self._track_value(value, name=key)
+    else:
+      value = _wrap_or_unwrap(value)
+      existing_dependency = None
+      if not no_dep and isinstance(value, base.CheckpointableBase):
+        # Non-string keys are OK as long as we have no reason to add a
+        # dependency on the value (either because the value is not
+        # checkpointable, or because it was wrapped in a NoDependency object).
+        self._non_string_key = True
+    current_value = self._storage.setdefault(key, value)
+    if current_value is not value:
+      if ((not no_dep and isinstance(value, base.CheckpointableBase))
+          # We don't want to just check that the existing object is
+          # checkpointable, since it may have been wrapped in a NoDependency
+          # object.
+          or existing_dependency is not None):
+        # A checkpointable object was replaced under the same key; this means
+        # that restoring would be error-prone, so we'll throw an exception on
+        # save.
+        self._non_append_mutation = True
+      self._storage[key] = value
+
+    self._update_snapshot()
+
+  def __delitem__(self, key):
+    self._check_external_modification()
+    existing_value = self[key]
+    if isinstance(existing_value, base.CheckpointableBase):
+      # Deleting tracked checkpointable values means restoring is problematic,
+      # so we'll throw an exception on save.
+      self._non_append_mutation = True
+    del self._storage[key]
+    self._update_snapshot()
+
+  def __repr__(self):
+    return "DictWrapper(%s)" % (repr(self._storage),)
+
+  def __hash__(self):
+    raise TypeError("unhashable type: 'DictWrapper'")
+
+  def __eq__(self, other):
+    return self._storage == getattr(other, "_storage", other)
+
+  def update(self, *args, **kwargs):
+    for key, value in dict(*args, **kwargs).items():
+      self[key] = value
diff --git a/tensorflow/python/training/checkpointable/data_structures_test.py b/tensorflow/python/training/checkpointable/data_structures_test.py
index ec8c9da809..472b7c32b4 100644
--- a/tensorflow/python/training/checkpointable/data_structures_test.py
+++ b/tensorflow/python/training/checkpointable/data_structures_test.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 import os
 
 import numpy
+import six
 
 from tensorflow.python.eager import context
 from tensorflow.python.eager import test
@@ -32,6 +33,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.training.checkpointable import data_structures
 from tensorflow.python.training.checkpointable import tracking
+from tensorflow.python.training.checkpointable import util
 
 
 class HasList(training.Model):
@@ -72,11 +74,14 @@ class ListTests(test.TestCase):
     model = HasList()
     output = model(array_ops.ones([32, 2]))
     self.assertAllEqual([32, 12], output.shape)
-    self.assertEqual(2, len(model.layers))
-    self.assertIs(model.layer_list, model.layers[0])
-    self.assertEqual(10, len(model.layers[0].layers))
+    self.assertEqual(11, len(model.layers))
+    self.assertEqual(10, len(model.layer_list.layers))
+    six.assertCountEqual(
+        self,
+        model.layers,
+        model.layer_list.layers + model.layers_with_updates)
     for index in range(10):
-      self.assertEqual(3 + index, model.layers[0].layers[index].units)
+      self.assertEqual(3 + index, model.layer_list.layers[index].units)
     self.assertEqual(2, len(model._checkpoint_dependencies))
     self.assertIs(model.layer_list, model._checkpoint_dependencies[0].ref)
     self.assertIs(model.layers_with_updates,
@@ -123,9 +128,11 @@ class ListTests(test.TestCase):
         self.l2 = []
 
     model = HasEqualContainers()
-    model.l1.append(HasEqualContainers())
-    model.l2.append(HasEqualContainers())
-    self.assertEqual([model.l1, model.l2], model.layers)
+    first_layer = HasEqualContainers()
+    model.l1.append(first_layer)
+    second_layer = HasEqualContainers()
+    model.l2.append(second_layer)
+    self.assertEqual([first_layer, second_layer], model.layers)
 
   def testNotCheckpointable(self):
     class NotCheckpointable(object):
@@ -260,9 +267,8 @@ class MappingTests(test.TestCase):
     model = HasMapping()
     output = model(array_ops.ones([32, 2]))
     self.assertAllEqual([32, 7], output.shape)
-    self.assertEqual(1, len(model.layers))
-    self.assertIs(model.layer_dict, model.layers[0])
-    self.assertEqual(3, len(model.layers[0].layers))
+    self.assertEqual(5, len(model.layers))
+    six.assertCountEqual(self, model.layers, model.layer_dict.layers)
     self.assertEqual(1, len(model._checkpoint_dependencies))
     self.assertIs(model.layer_dict, model._checkpoint_dependencies[0].ref)
     self.evaluate([v.initializer for v in model.variables])
@@ -298,6 +304,124 @@ class MappingTests(test.TestCase):
                         data_structures.Mapping()])
     self.assertEqual(2, len(has_mappings))
     self.assertNotIn(data_structures.Mapping(), has_mappings)
+    # In contrast to Mapping, dict wrappers are not hashable
+    a = tracking.Checkpointable()
+    a.d = {}
+    self.assertEqual({}, a.d)
+    self.assertFalse({} != a.d)  # pylint: disable=g-explicit-bool-comparison
+    self.assertNotEqual({1: 2}, a.d)
+    with self.assertRaisesRegexp(TypeError, "unhashable"):
+      set([a.d])
+
+  def testDictWrapperBadKeys(self):
+    a = tracking.Checkpointable()
+    a.d = {}
+    a.d[1] = data_structures.List()
+    model = training.Model()
+    model.sub = a
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    with self.assertRaisesRegexp(ValueError, "non-string key"):
+      model.save_weights(save_path)
+
+  def testDictWrapperNoDependency(self):
+    a = tracking.Checkpointable()
+    a.d = data_structures.NoDependency({})
+    a.d[1] = [3]
+    self.assertEqual([a], util.list_objects(a))
+    model = training.Model()
+    model.sub = a
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    model.save_weights(save_path)
+    model.load_weights(save_path)
+
+  def testNonStringKeyNotCheckpointableValue(self):
+    a = tracking.Checkpointable()
+    a.d = {}
+    a.d["a"] = [3]
+    a.d[1] = data_structures.NoDependency([3])
+    self.assertEqual([a, a.d, a.d["a"]], util.list_objects(a))
+    model = training.Model()
+    model.sub = a
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    model.save_weights(save_path)
+    model.load_weights(save_path)
+
+  def testNonAppendNotCheckpointable(self):
+    # Non-append mutations (deleting or overwriting values) are OK when the
+    # values aren't tracked.
+    a = tracking.Checkpointable()
+    a.d = {}
+    a.d["a"] = [3]
+    a.d[1] = 3
+    a.d[1] = 2
+    self.assertEqual(2, a.d[1])
+    del a.d[1]
+    a.d[2] = data_structures.NoDependency(tracking.Checkpointable())
+    second = tracking.Checkpointable()
+    a.d[2] = data_structures.NoDependency(second)
+    self.assertIs(second, a.d[2])
+    self.assertEqual([a, a.d, a.d["a"]], util.list_objects(a))
+    model = training.Model()
+    model.sub = a
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    model.save_weights(save_path)
+    model.load_weights(save_path)
+
+  def testDelNoSave(self):
+    model = training.Model()
+    model.d = {}
+    model.d["a"] = []
+    del model.d["a"]
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    with self.assertRaisesRegexp(ValueError, "overwritten or deleted"):
+      model.save_weights(save_path)
+
+  def testPopNoSave(self):
+    model = training.Model()
+    model.d = {}
+    model.d["a"] = []
+    model.d.pop("a")
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    with self.assertRaisesRegexp(ValueError, "overwritten or deleted"):
+      model.save_weights(save_path)
+
+  def testExternalModificationNoSave(self):
+    model = training.Model()
+    external_reference = {}
+    model.d = external_reference
+    external_reference["a"] = []
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    with self.assertRaisesRegexp(ValueError, "modified outside the wrapper"):
+      model.save_weights(save_path)
+
+  def testOverwriteNoSave(self):
+    model = training.Model()
+    model.d = {}
+    model.d["a"] = {}
+    model.d["a"] = {}
+    save_path = os.path.join(self.get_temp_dir(), "ckpt")
+    with self.assertRaisesRegexp(ValueError, "overwritten or deleted"):
+      model.save_weights(save_path)
+
+  def testIter(self):
+    model = training.Model()
+    model.d = {1: 3}
+    model.d[1] = 3
+    self.assertEqual([1], list(model.d))
+    new_dict = {}
+    # This update() is super tricky. If the dict wrapper subclasses dict,
+    # CPython will access its storage directly instead of calling any
+    # methods/properties on the object. So the options are either not to
+    # subclass dict (in which case update will call normal iter methods, but the
+    # object won't pass isinstance checks) or to subclass dict and keep that
+    # storage updated (no shadowing all its methods like _ListWrapper).
+    new_dict.update(model.d)
+    self.assertEqual({1: 3}, new_dict)
+
+  def testConstructableFromSequence(self):
+    result = data_structures._DictWrapper([(1, 2), (3, 4)])
+    self.assertIsInstance(result, dict)
+    self.assertEqual({1: 2, 3: 4}, result)
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/checkpointable/layer_utils.py b/tensorflow/python/training/checkpointable/layer_utils.py
index 978fcb2252..d65b631fe9 100644
--- a/tensorflow/python/training/checkpointable/layer_utils.py
+++ b/tensorflow/python/training/checkpointable/layer_utils.py
@@ -32,10 +32,15 @@ def is_layer(obj):
 
 def filter_empty_layer_containers(layer_list):
   """Filter out empty Layer-like containers."""
-  return [layer for layer in layer_list
-          # Filter out only empty Checkpointable data structures. Empty Networks
-          # will still show up in Model.layers.
-          if is_layer(layer) or getattr(layer, "layers", True)]
+  filtered = []
+  for obj in layer_list:
+    if is_layer(obj):
+      filtered.append(obj)
+    else:
+      # Checkpointable data structures will not show up in ".layers" lists, but
+      # the layers they contain will.
+      filtered.extend(obj.layers)
+  return filtered
 
 
 def gather_trainable_weights(trainable, sub_layers, extra_variables):
diff --git a/tensorflow/python/training/checkpointable/tracking_test.py b/tensorflow/python/training/checkpointable/tracking_test.py
index f0178b074d..e85f812ce2 100644
--- a/tensorflow/python/training/checkpointable/tracking_test.py
+++ b/tensorflow/python/training/checkpointable/tracking_test.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 import os
 
 import numpy
+import six
 
 from tensorflow.python.framework import test_util
 from tensorflow.python.keras.engine import training
@@ -44,6 +45,9 @@ class InterfaceTests(test.TestCase):
     # from doing this while maintaining backward compatibility.
     root.leaf = duplicate_name_dep
     root._track_checkpointable(duplicate_name_dep, name="leaf", overwrite=True)
+    self.assertIs(duplicate_name_dep, root._lookup_dependency("leaf"))
+    (_, dep_object), = root._checkpoint_dependencies
+    self.assertIs(duplicate_name_dep, dep_object)
 
   def testNoDependency(self):
     root = tracking.Checkpointable()
@@ -141,6 +145,30 @@ class InterfaceTests(test.TestCase):
       checkpoint.save(os.path.join(self.get_temp_dir(), "ckpt"))
 
   @test_util.run_in_graph_and_eager_modes
+  def testDictionariesBasic(self):
+    a = training.Model()
+    b = training.Model()
+    a.attribute = {"b": b}
+    c = training.Model()
+    a.attribute["c"] = []
+    a.attribute["c"].append(c)
+    a_deps = util.list_objects(a)
+    self.assertIn(b, a_deps)
+    self.assertIn(c, a_deps)
+    self.assertIs(b, a.attribute["b"])
+    six.assertCountEqual(
+        self,
+        ["b", "c"],
+        [dep.name for dep in a.attribute._checkpoint_dependencies])
+    self.assertEqual([b, c], a.layers)
+    self.assertEqual([b, c], a.attribute.layers)
+    self.assertEqual([c], a.attribute["c"].layers)
+    checkpoint = util.Checkpoint(a=a)
+    save_path = checkpoint.save(os.path.join(self.get_temp_dir(), "ckpt"))
+    with self.test_session():
+      checkpoint.restore(save_path).assert_consumed().initialize_or_restore()
+
+  @test_util.run_in_graph_and_eager_modes
   def testNoDepList(self):
     a = training.Model()
     a.l1 = data_structures.NoDependency([])
@@ -156,12 +184,13 @@ class InterfaceTests(test.TestCase):
   @test_util.run_in_graph_and_eager_modes
   def testAssertions(self):
     a = tracking.Checkpointable()
-    a.l = [numpy.zeros([2, 2])]
-    self.assertAllEqual([numpy.zeros([2, 2])], a.l)
-    self.assertAllClose([numpy.zeros([2, 2])], a.l)
-    nest.map_structure(self.assertAllClose, a.l, [numpy.zeros([2, 2])])
-    a.tensors = [array_ops.ones([2, 2]), array_ops.zeros([3, 3])]
-    self.assertAllClose([numpy.ones([2, 2]), numpy.zeros([3, 3])],
+    a.l = {"k": [numpy.zeros([2, 2])]}
+    self.assertAllEqual(nest.flatten({"k": [numpy.zeros([2, 2])]}),
+                        nest.flatten(a.l))
+    self.assertAllClose({"k": [numpy.zeros([2, 2])]}, a.l)
+    nest.map_structure(self.assertAllClose, a.l, {"k": [numpy.zeros([2, 2])]})
+    a.tensors = {"k": [array_ops.ones([2, 2]), array_ops.zeros([3, 3])]}
+    self.assertAllClose({"k": [numpy.ones([2, 2]), numpy.zeros([3, 3])]},
                         self.evaluate(a.tensors))
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/training/checkpointable/util.py b/tensorflow/python/training/checkpointable/util.py
index 6ae5765b13..3cdaedce98 100644
--- a/tensorflow/python/training/checkpointable/util.py
+++ b/tensorflow/python/training/checkpointable/util.py
@@ -101,6 +101,7 @@ class _CheckpointRestoreCoordinator(object):
     # this checkpoint.
     self.restore_ops = []
     self.restore_ops_by_name = {}
+    self.new_restore_ops_callback = None
     # A mapping from optimizer proto ids to lists of slot variables to be
     # restored when the optimizer is tracked. Only includes slot variables whose
     # regular variables have already been created, and only for optimizer
@@ -121,6 +122,11 @@ class _CheckpointRestoreCoordinator(object):
                     slot_variable_id=slot_reference.slot_variable_node_id,
                     slot_name=slot_reference.slot_name))
 
+  def new_restore_ops(self, new_ops):
+    self.restore_ops.extend(new_ops)
+    if self.new_restore_ops_callback:
+      self.new_restore_ops_callback(new_ops)  # pylint: disable=not-callable
+
 
 class _NameBasedRestoreCoordinator(object):
   """Keeps the status of a name-based checkpoint restore."""
@@ -361,24 +367,42 @@ class _ObjectIdentityWeakKeyDictionary(_ObjectIdentityDictionary):
         yield unwrapped
 
 
-class _ObjectIdentityWeakSet(collections.MutableSet):
-  """Like weakref.WeakSet, but compares objects with "is"."""
+class _ObjectIdentitySet(collections.MutableSet):
+  """Like the built-in set, but compares objects with "is"."""
 
-  def __init__(self):
-    self._storage = set()
+  def __init__(self, *args):
+    self._storage = set([self._wrap_key(obj) for obj in list(*args)])
+
+  def _wrap_key(self, key):
+    return _ObjectIdentityWrapper(key)
 
   def __contains__(self, key):
-    return _WeakObjectIdentityWrapper(key) in self._storage
+    return self._wrap_key(key) in self._storage
 
   def discard(self, key):
-    self._storage.discard(_WeakObjectIdentityWrapper(key))
+    self._storage.discard(self._wrap_key(key))
 
   def add(self, key):
-    self._storage.add(_WeakObjectIdentityWrapper(key))
+    self._storage.add(self._wrap_key(key))
+
+  def __len__(self):
+    return len(self._storage)
+
+  def __iter__(self):
+    keys = list(self._storage)
+    for key in keys:
+      yield key.unwrapped
+
+
+class _ObjectIdentityWeakSet(_ObjectIdentitySet):
+  """Like weakref.WeakSet, but compares objects with "is"."""
+
+  def _wrap_key(self, key):
+    return _WeakObjectIdentityWrapper(key)
 
   def __len__(self):
     # Iterate, discarding old weak refs
-    return len(list(self))
+    return len([_ for _ in self])
 
   def __iter__(self):
     keys = list(self._storage)
@@ -747,7 +771,7 @@ def capture_dependencies(template):
           initial_value=initializer,
           name=name,
           **inner_kwargs)
-    if name.startswith(name_prefix):
+    if name is not None and name.startswith(name_prefix):
       scope_stripped_name = name[len(name_prefix) + 1:]
       if not checkpointable_parent:
         return template._add_variable_with_custom_getter(  # pylint: disable=protected-access
@@ -803,6 +827,31 @@ class _LoadStatus(object):
     pass
 
 
+def streaming_restore(status, session=None):
+  """When graph building, runs restore ops as soon as they come in.
+
+  Args:
+    status: A _LoadStatus objects from an object-based saver's
+      restore(). Streaming restore from name-based checkpoints is not currently
+      supported.
+    session: A session to run new restore ops in.
+  """
+  if context.executing_eagerly():
+    # Streaming restore is the default/only behavior when executing eagerly.
+    return
+  if session is None:
+    session = ops.get_default_session()
+  if isinstance(status, NameBasedSaverStatus):
+    raise NotImplementedError(
+        "Streaming restore not supported from name-based checkpoints. File a "
+        "feature request if this limitation bothers you.")
+  status.run_restore_ops(session=session)
+  # pylint: disable=protected-access
+  status._checkpoint.new_restore_ops_callback = (
+      lambda ops: session.run(ops, feed_dict=status._feed_dict))
+  # pylint: enable=protected-access
+
+
 class CheckpointLoadStatus(_LoadStatus):
   """Checks the status of checkpoint loading and manages restore ops.
 
@@ -857,8 +906,8 @@ class CheckpointLoadStatus(_LoadStatus):
     for checkpointable_object in list_objects(self._root_checkpointable):
       self._checkpoint.all_python_objects.add(checkpointable_object)
     unused_python_objects = (
-        set(self._checkpoint.all_python_objects)
-        - set(self._checkpoint.object_by_proto_id.values()))
+        _ObjectIdentitySet(self._checkpoint.all_python_objects)
+        - _ObjectIdentitySet(self._checkpoint.object_by_proto_id.values()))
     if unused_python_objects:
       raise AssertionError(
           ("Some Python objects were not bound to checkpointed values, likely "
@@ -894,7 +943,7 @@ class CheckpointLoadStatus(_LoadStatus):
     if session is None:
       session = ops.get_default_session()
     all_objects = list_objects(self._root_checkpointable)
-    already_initialized_objects = set(
+    already_initialized_objects = _ObjectIdentitySet(
         self._checkpoint.object_by_proto_id.values())
     initializers_for_non_restored_variables = [
         c.initializer for c in all_objects
@@ -974,11 +1023,13 @@ _DEPRECATED_RESTORE_INSTRUCTIONS = (
     "one this message is coming from) and use that checkpoint in the future.")
 
 
-@deprecation.deprecated(
-    date=None, instructions=_DEPRECATED_RESTORE_INSTRUCTIONS)
 class NameBasedSaverStatus(_LoadStatus):
   """Status for loading a name-based training checkpoint."""
 
+  # Ideally this deprecation decorator would be on the class, but that
+  # interferes with isinstance checks.
+  @deprecation.deprecated(
+      date=None, instructions=_DEPRECATED_RESTORE_INSTRUCTIONS)
   def __init__(self, checkpoint, root_checkpointable):
     self._checkpoint = checkpoint
     self._root_checkpointable = root_checkpointable
diff --git a/tensorflow/python/training/checkpointable/util_test.py b/tensorflow/python/training/checkpointable/util_test.py
index 896ea47b97..5506e6bc4e 100644
--- a/tensorflow/python/training/checkpointable/util_test.py
+++ b/tensorflow/python/training/checkpointable/util_test.py
@@ -42,6 +42,7 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import template
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.training import adam
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.training import training_util
 from tensorflow.python.training.checkpointable import base
@@ -102,7 +103,7 @@ class InterfaceTests(test.TestCase):
         name="duplicate", initial_value=1.)
     duplicate = checkpointable_utils.add_variable(
         obj, name="duplicate", shape=[])
-    with self.assertRaisesRegexp(ValueError, "'duplicate' already exists"):
+    with self.assertRaisesRegexp(ValueError, "'duplicate'.*already declared"):
       checkpointable_utils.add_variable(obj, name="duplicate", shape=[])
 
     self.evaluate(checkpointable_utils.gather_initializers(obj))
@@ -467,7 +468,8 @@ class CheckpointingTests(test.TestCase):
       root = checkpointable_utils.Checkpoint(
           optimizer=optimizer, model=model,
           optimizer_step=training_util.get_or_create_global_step())
-      root.restore(saver_lib.latest_checkpoint(checkpoint_directory))
+      root.restore(checkpoint_management.latest_checkpoint(
+          checkpoint_directory))
       for _ in range(num_training_steps):
         # TODO(allenl): Use a Dataset and serialize/checkpoint it.
         input_value = constant_op.constant([[3.]])
@@ -495,7 +497,8 @@ class CheckpointingTests(test.TestCase):
           train_op = optimizer.minimize(
               model(input_value),
               global_step=root.global_step)
-          checkpoint_path = saver_lib.latest_checkpoint(checkpoint_directory)
+          checkpoint_path = checkpoint_management.latest_checkpoint(
+              checkpoint_directory)
           with self.test_session(graph=ops.get_default_graph()) as session:
             status = root.restore(save_path=checkpoint_path)
             status.initialize_or_restore(session=session)
@@ -528,7 +531,8 @@ class CheckpointingTests(test.TestCase):
         root = checkpointable_utils.Checkpoint(
             optimizer=optimizer, model=model,
             global_step=training_util.get_or_create_global_step())
-        checkpoint_path = saver_lib.latest_checkpoint(checkpoint_directory)
+        checkpoint_path = checkpoint_management.latest_checkpoint(
+            checkpoint_directory)
         status = root.restore(save_path=checkpoint_path)
         input_value = constant_op.constant([[3.]])
         train_fn = functools.partial(
@@ -561,7 +565,8 @@ class CheckpointingTests(test.TestCase):
         root = checkpointable_utils.Checkpoint(
             optimizer=optimizer, model=model,
             global_step=training_util.get_or_create_global_step())
-        checkpoint_path = saver_lib.latest_checkpoint(checkpoint_directory)
+        checkpoint_path = checkpoint_management.latest_checkpoint(
+            checkpoint_directory)
         status = root.restore(save_path=checkpoint_path)
         def train_fn():
           @function.defun
@@ -1180,7 +1185,8 @@ class CheckpointingTests(test.TestCase):
       optimizer_checkpoint = checkpointable_utils.Checkpoint(
           optimizer=optimizer)
 
-      checkpoint_path = saver_lib.latest_checkpoint(checkpoint_directory)
+      checkpoint_path = checkpoint_management.latest_checkpoint(
+          checkpoint_directory)
       status = root.restore(save_path=checkpoint_path)
       input_value = constant_op.constant([[3.]])
       train_fn = functools.partial(
diff --git a/tensorflow/python/training/distribute.py b/tensorflow/python/training/distribute.py
index d33fd7376a..c719045c7f 100644
--- a/tensorflow/python/training/distribute.py
+++ b/tensorflow/python/training/distribute.py
@@ -614,48 +614,6 @@ class DistributionStrategy(object):
     # Note: should support "colocate_with" argument.
     raise NotImplementedError("must be implemented in descendants")
 
-  def tower_local_var_scope(self, aggregation):
-    """Inside this scope, new variables will not be mirrored.
-
-    There will still be one component variable per tower, but there is
-    no requirement that they stay in sync. Instead, when saving them
-    or calling `read_var()`, we use the value that results when
-    calling `reduce()` on all the towers' variables.
-
-    Note: tower-local implies not trainable. Instead, it is expected
-    that each tower will directly update (using `assign_add()` or
-    whatever) its local variable instance but only the aggregated
-    value (accessible using `read_var()`) will be exported from the
-    model. When it is acceptable to only aggregate on export, we
-    greatly reduce communication overhead by using tower-local
-    variables.
-
-    Note: All component variables will be initialized to the same
-    value, using the initialization expression from the first tower.
-    The values will match even if the initialization expression uses
-    random numbers.
-
-    Args:
-      aggregation: Indicates how a variable will be aggregated. Accepted values
-        are @{tf.VariableAggregation.SUM}, @{tf.VariableAggregation.MEAN}.
-
-    Returns:
-      A context manager.
-    """
-    # TODO(psv): Remove this after adding support for synchronization and
-    # aggregation parameters in get_variable() and mirrored strategy.
-    def create_tower_local_variable(next_creator, *args, **kwargs):
-      _require_distribution_strategy_scope(self)
-      kwargs["use_resource"] = True
-
-      # Set synchronization to be ON_READ for tower local variables.
-      kwargs["synchronization"] = variable_scope.VariableSynchronization.ON_READ
-      kwargs["aggregation"] = aggregation
-      return next_creator(*args, **kwargs)
-
-    _require_distribution_strategy_scope(self)
-    return variable_scope.variable_creator_scope(create_tower_local_variable)
-
   def read_var(self, v):
     """Reads the value of a variable.
 
@@ -1103,10 +1061,6 @@ class TowerContext(object):
     finally:
       _pop_per_thread_mode()
 
-  def tower_local_var_scope(self, aggregation):
-    """Alias for distribution_strategy.tower_local_var_scope()."""
-    return self._distribution_strategy.tower_local_var_scope(aggregation)
-
   @property
   def is_single_tower(self):
     """Returns whether there is a single tower or multiple."""
@@ -1158,16 +1112,6 @@ class _DefaultDistributionStrategy(DistributionStrategy):
     return _CurrentDistributionContext(
         self, variable_scope.variable_creator_scope(creator))
 
-  def tower_local_var_scope(self, aggregation):
-    """Does not set to resource variables."""
-    def create_tower_local_variable(next_creator, *args, **kwargs):
-      _require_distribution_strategy_scope(self)
-      kwargs["trainable"] = False
-      return next_creator(*args, **kwargs)
-
-    _require_distribution_strategy_scope(self)
-    return variable_scope.variable_creator_scope(create_tower_local_variable)
-
   def colocate_vars_with(self, colocate_with_variable):
     """Does not require `self.scope`."""
     _require_distribution_strategy_scope(self)
diff --git a/tensorflow/python/training/learning_rate_decay.py b/tensorflow/python/training/learning_rate_decay.py
index 51190264e8..fd195a7965 100644
--- a/tensorflow/python/training/learning_rate_decay.py
+++ b/tensorflow/python/training/learning_rate_decay.py
@@ -356,7 +356,15 @@ def natural_exp_decay(learning_rate,
   The function returns the decayed learning rate.  It is computed as:
 
   ```python
-  decayed_learning_rate = learning_rate * exp(-decay_rate * global_step)
+  decayed_learning_rate = learning_rate * exp(-decay_rate * global_step /
+  decay_step)
+  ```
+
+  or, if `staircase` is `True`, as:
+
+  ```python
+  decayed_learning_rate = learning_rate * exp(-decay_rate * floor(global_step /
+  decay_step))
   ```
 
   Example: decay exponentially with a base of 0.96:
@@ -365,8 +373,10 @@ def natural_exp_decay(learning_rate,
   ...
   global_step = tf.Variable(0, trainable=False)
   learning_rate = 0.1
+  decay_steps = 5
   k = 0.5
-  learning_rate = tf.train.exponential_time_decay(learning_rate, global_step, k)
+  learning_rate = tf.train.natural_exp_decay(learning_rate, global_step,
+                                             decay_steps, k)
 
   # Passing global_step to minimize() will increment it at each step.
   learning_step = (
diff --git a/tensorflow/python/training/monitored_session_test.py b/tensorflow/python/training/monitored_session_test.py
index 3806056f01..92533ca4f3 100644
--- a/tensorflow/python/training/monitored_session_test.py
+++ b/tensorflow/python/training/monitored_session_test.py
@@ -44,6 +44,7 @@ from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.summary import summary
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import coordinator
 from tensorflow.python.training import monitored_session
 from tensorflow.python.training import saver as saver_lib
@@ -1364,8 +1365,8 @@ class MonitoredSessionTest(test.TestCase):
       with monitored_session.MonitoredSession(
           session_creator=monitored_session.ChiefSessionCreator(
               scaffold,
-              checkpoint_filename_with_path=saver_lib.latest_checkpoint(
-                  logdir))) as session:
+              checkpoint_filename_with_path=
+              checkpoint_management.latest_checkpoint(logdir))) as session:
         self.assertEqual(2, session.run(gstep))
 
   def test_retry_initialization_on_aborted_error(self):
diff --git a/tensorflow/python/training/optimizer.py b/tensorflow/python/training/optimizer.py
index 971ed5c8b5..f75db08059 100644
--- a/tensorflow/python/training/optimizer.py
+++ b/tensorflow/python/training/optimizer.py
@@ -77,9 +77,10 @@ def _deduplicate_indexed_slices(values, indices):
 
 
 def _var_key(var):
-  if context.executing_eagerly():
-    return var._unique_id  # pylint: disable=protected-access
-  return (var.op.graph, var.op.name)
+  # TODO(ashankar): Consolidate handling for eager and graph
+  if hasattr(var, "op"):
+    return (var.op.graph, var.op.name)
+  return var._unique_id  # pylint: disable=protected-access
 
 
 class _OptimizableVariable(object):
diff --git a/tensorflow/python/training/quantize_training.i b/tensorflow/python/training/quantize_training.i
index fb5e47efa0..54d6789616 100644
--- a/tensorflow/python/training/quantize_training.i
+++ b/tensorflow/python/training/quantize_training.i
@@ -73,6 +73,8 @@ def do_quantize_training_on_graphdef(input_graph, num_bits):
 
 do_quantize_training_on_graphdef._tf_api_names = [
     'train.do_quantize_training_on_graphdef']
+do_quantize_training_on_graphdef._tf_api_names_v1 = [
+    'train.do_quantize_training_on_graphdef']
 %}
 
 %unignoreall
diff --git a/tensorflow/python/training/saver.py b/tensorflow/python/training/saver.py
index 1ee975fbe4..213c11c50d 100644
--- a/tensorflow/python/training/saver.py
+++ b/tensorflow/python/training/saver.py
@@ -21,15 +21,12 @@ from __future__ import print_function
 
 import collections
 import os.path
-import re
 import time
 import uuid
 
 import numpy as np
 import six
 
-from google.protobuf import text_format
-
 from tensorflow.core.protobuf import checkpointable_object_graph_pb2
 from tensorflow.core.protobuf import meta_graph_pb2
 from tensorflow.core.protobuf import saver_pb2
@@ -41,7 +38,6 @@ from tensorflow.python.framework import device as pydev
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import meta_graph
 from tensorflow.python.framework import ops
-from tensorflow.python.lib.io import file_io
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_io_ops
@@ -52,14 +48,25 @@ from tensorflow.python.ops import string_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saveable_object
 from tensorflow.python.training import training_util
-from tensorflow.python.training.checkpoint_state_pb2 import CheckpointState
 from tensorflow.python.training.checkpointable import base as checkpointable
 from tensorflow.python.util import compat
 from tensorflow.python.util.tf_export import tf_export
 
 
+# TODO(allenl): Remove these aliases once all users are migrated off.
+get_checkpoint_state = checkpoint_management.get_checkpoint_state
+update_checkpoint_state = checkpoint_management.update_checkpoint_state
+generate_checkpoint_state_proto = (
+    checkpoint_management.generate_checkpoint_state_proto)
+latest_checkpoint = checkpoint_management.latest_checkpoint
+checkpoint_exists = checkpoint_management.checkpoint_exists
+get_checkpoint_mtimes = checkpoint_management.get_checkpoint_mtimes
+remove_checkpoint = checkpoint_management.remove_checkpoint
+
+
 # Op names which identify variable reads which should be saved.
 _VARIABLE_OPS = set(["Variable",
                      "VariableV2",
@@ -126,8 +133,10 @@ class BaseSaverBuilder(object):
           def f():
             with ops.device(v.device):
               x = v.read_value()
-            with ops.device("/device:CPU:0"):
-              return array_ops.identity(x)
+              # To allow variables placed on non-CPU devices to be checkpointed,
+              # we copy them to CPU on the same machine first.
+              with ops.device("/device:CPU:0"):
+                return array_ops.identity(x)
           return f
 
         self.handle_op = var.handle
@@ -856,218 +865,6 @@ def _get_saver_or_default():
   return saver
 
 
-def _GetCheckpointFilename(save_dir, latest_filename):
-  """Returns a filename for storing the CheckpointState.
-
-  Args:
-    save_dir: The directory for saving and restoring checkpoints.
-    latest_filename: Name of the file in 'save_dir' that is used
-      to store the CheckpointState.
-
-  Returns:
-    The path of the file that contains the CheckpointState proto.
-  """
-  if latest_filename is None:
-    latest_filename = "checkpoint"
-  return os.path.join(save_dir, latest_filename)
-
-
-@tf_export("train.generate_checkpoint_state_proto")
-def generate_checkpoint_state_proto(save_dir,
-                                    model_checkpoint_path,
-                                    all_model_checkpoint_paths=None):
-  """Generates a checkpoint state proto.
-
-  Args:
-    save_dir: Directory where the model was saved.
-    model_checkpoint_path: The checkpoint file.
-    all_model_checkpoint_paths: List of strings.  Paths to all not-yet-deleted
-      checkpoints, sorted from oldest to newest.  If this is a non-empty list,
-      the last element must be equal to model_checkpoint_path.  These paths
-      are also saved in the CheckpointState proto.
-
-  Returns:
-    CheckpointState proto with model_checkpoint_path and
-    all_model_checkpoint_paths updated to either absolute paths or
-    relative paths to the current save_dir.
-  """
-  if all_model_checkpoint_paths is None:
-    all_model_checkpoint_paths = []
-
-  if (not all_model_checkpoint_paths or
-      all_model_checkpoint_paths[-1] != model_checkpoint_path):
-    logging.info("%s is not in all_model_checkpoint_paths. Manually adding it.",
-                 model_checkpoint_path)
-    all_model_checkpoint_paths.append(model_checkpoint_path)
-
-  # Relative paths need to be rewritten to be relative to the "save_dir"
-  # if model_checkpoint_path already contains "save_dir".
-  if not os.path.isabs(save_dir):
-    if not os.path.isabs(model_checkpoint_path):
-      model_checkpoint_path = os.path.relpath(model_checkpoint_path, save_dir)
-    for i in range(len(all_model_checkpoint_paths)):
-      p = all_model_checkpoint_paths[i]
-      if not os.path.isabs(p):
-        all_model_checkpoint_paths[i] = os.path.relpath(p, save_dir)
-
-  coord_checkpoint_proto = CheckpointState(
-      model_checkpoint_path=model_checkpoint_path,
-      all_model_checkpoint_paths=all_model_checkpoint_paths)
-
-  return coord_checkpoint_proto
-
-
-@tf_export("train.update_checkpoint_state")
-def update_checkpoint_state(save_dir,
-                            model_checkpoint_path,
-                            all_model_checkpoint_paths=None,
-                            latest_filename=None):
-  """Updates the content of the 'checkpoint' file.
-
-  This updates the checkpoint file containing a CheckpointState
-  proto.
-
-  Args:
-    save_dir: Directory where the model was saved.
-    model_checkpoint_path: The checkpoint file.
-    all_model_checkpoint_paths: List of strings.  Paths to all not-yet-deleted
-      checkpoints, sorted from oldest to newest.  If this is a non-empty list,
-      the last element must be equal to model_checkpoint_path.  These paths
-      are also saved in the CheckpointState proto.
-    latest_filename: Optional name of the checkpoint file.  Default to
-      'checkpoint'.
-
-  Raises:
-    RuntimeError: If any of the model checkpoint paths conflict with the file
-      containing CheckpointSate.
-  """
-  _update_checkpoint_state(
-      save_dir=save_dir,
-      model_checkpoint_path=model_checkpoint_path,
-      all_model_checkpoint_paths=all_model_checkpoint_paths,
-      latest_filename=latest_filename,
-      save_relative_paths=False)
-
-
-def _update_checkpoint_state(save_dir,
-                             model_checkpoint_path,
-                             all_model_checkpoint_paths=None,
-                             latest_filename=None,
-                             save_relative_paths=False):
-  """Updates the content of the 'checkpoint' file.
-
-  This updates the checkpoint file containing a CheckpointState
-  proto.
-
-  Args:
-    save_dir: Directory where the model was saved.
-    model_checkpoint_path: The checkpoint file.
-    all_model_checkpoint_paths: List of strings.  Paths to all not-yet-deleted
-      checkpoints, sorted from oldest to newest.  If this is a non-empty list,
-      the last element must be equal to model_checkpoint_path.  These paths
-      are also saved in the CheckpointState proto.
-    latest_filename: Optional name of the checkpoint file.  Default to
-      'checkpoint'.
-    save_relative_paths: If `True`, will write relative paths to the checkpoint
-      state file.
-
-  Raises:
-    RuntimeError: If any of the model checkpoint paths conflict with the file
-      containing CheckpointSate.
-  """
-  # Writes the "checkpoint" file for the coordinator for later restoration.
-  coord_checkpoint_filename = _GetCheckpointFilename(save_dir, latest_filename)
-  if save_relative_paths:
-    if os.path.isabs(model_checkpoint_path):
-      rel_model_checkpoint_path = os.path.relpath(
-          model_checkpoint_path, save_dir)
-    else:
-      rel_model_checkpoint_path = model_checkpoint_path
-    rel_all_model_checkpoint_paths = []
-    for p in all_model_checkpoint_paths:
-      if os.path.isabs(p):
-        rel_all_model_checkpoint_paths.append(os.path.relpath(p, save_dir))
-      else:
-        rel_all_model_checkpoint_paths.append(p)
-    ckpt = generate_checkpoint_state_proto(
-        save_dir,
-        rel_model_checkpoint_path,
-        all_model_checkpoint_paths=rel_all_model_checkpoint_paths)
-  else:
-    ckpt = generate_checkpoint_state_proto(
-        save_dir,
-        model_checkpoint_path,
-        all_model_checkpoint_paths=all_model_checkpoint_paths)
-
-  if coord_checkpoint_filename == ckpt.model_checkpoint_path:
-    raise RuntimeError("Save path '%s' conflicts with path used for "
-                       "checkpoint state.  Please use a different save path." %
-                       model_checkpoint_path)
-
-  # Preventing potential read/write race condition by *atomically* writing to a
-  # file.
-  file_io.atomic_write_string_to_file(coord_checkpoint_filename,
-                                      text_format.MessageToString(ckpt))
-
-
-@tf_export("train.get_checkpoint_state")
-def get_checkpoint_state(checkpoint_dir, latest_filename=None):
-  """Returns CheckpointState proto from the "checkpoint" file.
-
-  If the "checkpoint" file contains a valid CheckpointState
-  proto, returns it.
-
-  Args:
-    checkpoint_dir: The directory of checkpoints.
-    latest_filename: Optional name of the checkpoint file.  Default to
-      'checkpoint'.
-
-  Returns:
-    A CheckpointState if the state was available, None
-    otherwise.
-
-  Raises:
-    ValueError: if the checkpoint read doesn't have model_checkpoint_path set.
-  """
-  ckpt = None
-  coord_checkpoint_filename = _GetCheckpointFilename(checkpoint_dir,
-                                                     latest_filename)
-  f = None
-  try:
-    # Check that the file exists before opening it to avoid
-    # many lines of errors from colossus in the logs.
-    if file_io.file_exists(coord_checkpoint_filename):
-      file_content = file_io.read_file_to_string(
-          coord_checkpoint_filename)
-      ckpt = CheckpointState()
-      text_format.Merge(file_content, ckpt)
-      if not ckpt.model_checkpoint_path:
-        raise ValueError("Invalid checkpoint state loaded from "
-                         + checkpoint_dir)
-      # For relative model_checkpoint_path and all_model_checkpoint_paths,
-      # prepend checkpoint_dir.
-      if not os.path.isabs(ckpt.model_checkpoint_path):
-        ckpt.model_checkpoint_path = os.path.join(checkpoint_dir,
-                                                  ckpt.model_checkpoint_path)
-      for i in range(len(ckpt.all_model_checkpoint_paths)):
-        p = ckpt.all_model_checkpoint_paths[i]
-        if not os.path.isabs(p):
-          ckpt.all_model_checkpoint_paths[i] = os.path.join(checkpoint_dir, p)
-  except errors.OpError as e:
-    # It's ok if the file cannot be read
-    logging.warning("%s: %s", type(e).__name__, e)
-    logging.warning("%s: Checkpoint ignored", coord_checkpoint_filename)
-    return None
-  except text_format.ParseError as e:
-    logging.warning("%s: %s", type(e).__name__, e)
-    logging.warning("%s: Checkpoint ignored", coord_checkpoint_filename)
-    return None
-  finally:
-    if f:
-      f.close()
-  return ckpt
-
-
 @tf_export("train.Saver")
 class Saver(object):
   """Saves and restores variables.
@@ -1410,7 +1207,7 @@ class Saver(object):
 
       # Otherwise delete the files.
       try:
-        remove_checkpoint(
+        checkpoint_management.remove_checkpoint(
             self._CheckpointFilename(p), self.saver_def.version,
             meta_graph_suffix)
       except Exception as e:  # pylint: disable=broad-except
@@ -1516,7 +1313,7 @@ class Saver(object):
     Args:
       checkpoint_paths: a list of checkpoint paths.
     """
-    mtimes = get_checkpoint_mtimes(checkpoint_paths)
+    mtimes = checkpoint_management.get_checkpoint_mtimes(checkpoint_paths)
     self.set_last_checkpoints_with_time(list(zip(checkpoint_paths, mtimes)))
 
   def save(self,
@@ -1622,7 +1419,7 @@ class Saver(object):
         model_checkpoint_path = compat.as_str(model_checkpoint_path)
         if write_state:
           self._RecordLastCheckpoint(model_checkpoint_path)
-          _update_checkpoint_state(
+          checkpoint_management.update_checkpoint_state_internal(
               save_dir=save_path_parent,
               model_checkpoint_path=model_checkpoint_path,
               all_model_checkpoint_paths=self.last_checkpoints,
@@ -1637,7 +1434,7 @@ class Saver(object):
         raise exc
 
     if write_meta_graph:
-      meta_graph_filename = _meta_graph_filename(
+      meta_graph_filename = checkpoint_management.meta_graph_filename(
           checkpoint_file, meta_graph_suffix=meta_graph_suffix)
       if not context.executing_eagerly():
         with sess.graph.as_default():
@@ -1712,7 +1509,7 @@ class Saver(object):
     if save_path is None:
       raise ValueError("Can't load save_path when it is None.")
 
-    if not checkpoint_exists(compat.as_text(save_path)):
+    if not checkpoint_management.checkpoint_exists(compat.as_text(save_path)):
       raise ValueError("The passed save_path is not a valid checkpoint: "
                        + compat.as_text(save_path))
 
@@ -1798,55 +1595,6 @@ class Saver(object):
                                   export_scope=export_scope)
 
 
-def _prefix_to_checkpoint_path(prefix, format_version):
-  """Returns the pathname of a checkpoint file, given the checkpoint prefix.
-
-  For V1 checkpoint, simply returns the prefix itself (the data file).  For V2,
-  returns the pathname to the index file.
-
-  Args:
-    prefix: a string, the prefix of a checkpoint.
-    format_version: the checkpoint format version that corresponds to the
-      prefix.
-  Returns:
-    The pathname of a checkpoint file, taking into account the checkpoint
-      format version.
-  """
-  if format_version == saver_pb2.SaverDef.V2:
-    return prefix + ".index"  # The index file identifies a checkpoint.
-  return prefix  # Just the data file.
-
-
-@tf_export("train.latest_checkpoint")
-def latest_checkpoint(checkpoint_dir, latest_filename=None):
-  """Finds the filename of latest saved checkpoint file.
-
-  Args:
-    checkpoint_dir: Directory where the variables were saved.
-    latest_filename: Optional name for the protocol buffer file that
-      contains the list of most recent checkpoint filenames.
-      See the corresponding argument to `Saver.save()`.
-
-  Returns:
-    The full path to the latest checkpoint or `None` if no checkpoint was found.
-  """
-  # Pick the latest checkpoint based on checkpoint state.
-  ckpt = get_checkpoint_state(checkpoint_dir, latest_filename)
-  if ckpt and ckpt.model_checkpoint_path:
-    # Look for either a V2 path or a V1 path, with priority for V2.
-    v2_path = _prefix_to_checkpoint_path(ckpt.model_checkpoint_path,
-                                         saver_pb2.SaverDef.V2)
-    v1_path = _prefix_to_checkpoint_path(ckpt.model_checkpoint_path,
-                                         saver_pb2.SaverDef.V1)
-    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
-        v1_path):
-      return ckpt.model_checkpoint_path
-    else:
-      logging.error("Couldn't match files for checkpoint %s",
-                    ckpt.model_checkpoint_path)
-  return None
-
-
 @tf_export("train.import_meta_graph")
 def import_meta_graph(meta_graph_or_file, clear_devices=False,
                       import_scope=None, **kwargs):
@@ -1923,6 +1671,14 @@ def import_meta_graph(meta_graph_or_file, clear_devices=False,
   execution is enabled.
   @end_compatibility
   """  # pylint: disable=g-doc-exception
+  return _import_meta_graph_with_return_elements(
+      meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
+
+
+def _import_meta_graph_with_return_elements(
+    meta_graph_or_file, clear_devices=False, import_scope=None,
+    return_elements=None, **kwargs):
+  """Import MetaGraph, and return both a saver and returned elements."""
   if context.executing_eagerly():
     raise RuntimeError("Exporting/importing meta graphs is not supported when "
                        "eager execution is enabled. No graph exists when eager "
@@ -1932,12 +1688,22 @@ def import_meta_graph(meta_graph_or_file, clear_devices=False,
   else:
     meta_graph_def = meta_graph_or_file
 
-  imported_vars = meta_graph.import_scoped_meta_graph(
-      meta_graph_def,
-      clear_devices=clear_devices,
-      import_scope=import_scope,
-      **kwargs)
+  imported_vars, imported_return_elements = (
+      meta_graph.import_scoped_meta_graph_with_return_elements(
+          meta_graph_def,
+          clear_devices=clear_devices,
+          import_scope=import_scope,
+          return_elements=return_elements,
+          **kwargs))
+
+  saver = _create_saver_from_imported_meta_graph(
+      meta_graph_def, import_scope, imported_vars)
+  return saver, imported_return_elements
+
 
+def _create_saver_from_imported_meta_graph(
+    meta_graph_def, import_scope, imported_vars):
+  """Return a saver for restoring variable values to an imported MetaGraph."""
   if meta_graph_def.HasField("saver_def"):
     # Infer the scope that is prepended by `import_scoped_meta_graph`.
     scope = import_scope
@@ -2036,119 +1802,6 @@ def export_meta_graph(filename=None,
   return meta_graph_def
 
 
-@tf_export("train.checkpoint_exists")
-def checkpoint_exists(checkpoint_prefix):
-  """Checks whether a V1 or V2 checkpoint exists with the specified prefix.
-
-  This is the recommended way to check if a checkpoint exists, since it takes
-  into account the naming difference between V1 and V2 formats.
-
-  Args:
-    checkpoint_prefix: the prefix of a V1 or V2 checkpoint, with V2 taking
-      priority.  Typically the result of `Saver.save()` or that of
-      `tf.train.latest_checkpoint()`, regardless of sharded/non-sharded or
-      V1/V2.
-  Returns:
-    A bool, true iff a checkpoint referred to by `checkpoint_prefix` exists.
-  """
-  pathname = _prefix_to_checkpoint_path(checkpoint_prefix,
-                                        saver_pb2.SaverDef.V2)
-  if file_io.get_matching_files(pathname):
-    return True
-  elif file_io.get_matching_files(checkpoint_prefix):
-    return True
-  else:
-    return False
-
-
-@tf_export("train.get_checkpoint_mtimes")
-def get_checkpoint_mtimes(checkpoint_prefixes):
-  """Returns the mtimes (modification timestamps) of the checkpoints.
-
-  Globs for the checkpoints pointed to by `checkpoint_prefixes`.  If the files
-  exist, collect their mtime.  Both V2 and V1 checkpoints are considered, in
-  that priority.
-
-  This is the recommended way to get the mtimes, since it takes into account
-  the naming difference between V1 and V2 formats.
-
-  Args:
-    checkpoint_prefixes: a list of checkpoint paths, typically the results of
-      `Saver.save()` or those of `tf.train.latest_checkpoint()`, regardless of
-      sharded/non-sharded or V1/V2.
-  Returns:
-    A list of mtimes (in microseconds) of the found checkpoints.
-  """
-  mtimes = []
-
-  def match_maybe_append(pathname):
-    fnames = file_io.get_matching_files(pathname)
-    if fnames:
-      mtimes.append(file_io.stat(fnames[0]).mtime_nsec / 1e9)
-      return True
-    return False
-
-  for checkpoint_prefix in checkpoint_prefixes:
-    # Tries V2's metadata file first.
-    pathname = _prefix_to_checkpoint_path(checkpoint_prefix,
-                                          saver_pb2.SaverDef.V2)
-    if match_maybe_append(pathname):
-      continue
-    # Otherwise, tries V1, where the prefix is the complete pathname.
-    match_maybe_append(checkpoint_prefix)
-
-  return mtimes
-
-
-@tf_export("train.remove_checkpoint")
-def remove_checkpoint(checkpoint_prefix,
-                      checkpoint_format_version=saver_pb2.SaverDef.V2,
-                      meta_graph_suffix="meta"):
-  """Removes a checkpoint given by `checkpoint_prefix`.
-
-  Args:
-    checkpoint_prefix: The prefix of a V1 or V2 checkpoint. Typically the result
-      of `Saver.save()` or that of `tf.train.latest_checkpoint()`, regardless of
-      sharded/non-sharded or V1/V2.
-    checkpoint_format_version: `SaverDef.CheckpointFormatVersion`, defaults to
-      `SaverDef.V2`.
-    meta_graph_suffix: Suffix for `MetaGraphDef` file. Defaults to 'meta'.
-  """
-  _delete_file_if_exists(
-      _meta_graph_filename(checkpoint_prefix, meta_graph_suffix))
-  if checkpoint_format_version == saver_pb2.SaverDef.V2:
-    # V2 has a metadata file and some data files.
-    _delete_file_if_exists(checkpoint_prefix + ".index")
-    _delete_file_if_exists(checkpoint_prefix + ".data-?????-of-?????")
-  else:
-    # V1, Legacy.  Exact match on the data file.
-    _delete_file_if_exists(checkpoint_prefix)
-
-
-def _delete_file_if_exists(filespec):
-  """Deletes files matching `filespec`."""
-  for pathname in file_io.get_matching_files(filespec):
-    file_io.delete_file(pathname)
-
-
-def _meta_graph_filename(checkpoint_filename, meta_graph_suffix="meta"):
-  """Returns the meta graph filename.
-
-  Args:
-    checkpoint_filename: Name of the checkpoint file.
-    meta_graph_suffix: Suffix for `MetaGraphDef` file. Defaults to 'meta'.
-
-  Returns:
-    MetaGraph file name.
-  """
-  # If the checkpoint_filename is sharded, the checkpoint_filename could
-  # be of format model.ckpt-step#-?????-of-shard#. For example,
-  # model.ckpt-123456-?????-of-00005, or model.ckpt-123456-00001-of-00002.
-  basename = re.sub(r"-[\d\?]+-of-\d+$", "", checkpoint_filename)
-  meta_graph_filename = ".".join([basename, meta_graph_suffix])
-  return meta_graph_filename
-
-
 def _wrap_restore_error_with_msg(err, extra_verbiage):
   err_msg = ("Restoring from checkpoint failed. This is most likely "
              "due to {} from the checkpoint. Please ensure that you "
diff --git a/tensorflow/python/training/saver_test.py b/tensorflow/python/training/saver_test.py
index ae9c244aaf..941aafc780 100644
--- a/tensorflow/python/training/saver_test.py
+++ b/tensorflow/python/training/saver_test.py
@@ -18,20 +18,16 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import contextlib
 import functools
 import math
 import os
 import random
-import shutil
-import tempfile
 import time
 
 import numpy as np
 import six
 
 from google.protobuf.any_pb2 import Any
-from google.protobuf import text_format
 
 from tensorflow.core.protobuf import config_pb2
 from tensorflow.core.protobuf import meta_graph_pb2
@@ -71,12 +67,12 @@ from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.summary import summary
 from tensorflow.python.training import adam
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import gradient_descent
 from tensorflow.python.training import queue_runner_impl
 from tensorflow.python.training import saver as saver_module
 from tensorflow.python.training import saver_test_utils
 from tensorflow.python.training import training_util
-from tensorflow.python.training.checkpoint_state_pb2 import CheckpointState
 from tensorflow.python.training.checkpointable import base as checkpointable_base
 from tensorflow.python.training.checkpointable import tracking as checkpointable_tracking
 from tensorflow.python.training.checkpointable import util as checkpointable_utils
@@ -174,6 +170,24 @@ class SaverTest(test.TestCase):
   def testResourceBasic(self):
     self.basicSaveRestore(resource_variable_ops.ResourceVariable)
 
+  def testResourceColocation(self):
+    partitioner = partitioned_variables.fixed_size_partitioner(num_shards=2)
+    with ops_lib.device("/job:ps/device:GPU:0"):
+      v = variable_scope.get_variable("v0",
+                                      shape=[10, 2],
+                                      partitioner=partitioner,
+                                      use_resource=True)
+    saver_module.Saver({"v0": v}).build()
+    save_op = None
+    for op in ops_lib.get_default_graph().get_operations():
+      if op.type == "SaveV2":
+        save_op = op
+        break
+    assert save_op is not None
+    for save_inp in save_op.inputs[3:]:
+      # Input to SaveV2 op is placed on CPU of the same device as the Variable.
+      self.assertEqual("/job:ps/device:CPU:0", save_inp.device)
+
   def testResourceVariableReadOpsAddedDeterministically(self):
     graph_defs = []
     num_graphs = 10
@@ -325,11 +339,13 @@ class SaverTest(test.TestCase):
       self.assertTrue(isinstance(val, six.string_types))
       self.assertEqual(save_path1, val)
 
-    self.assertEqual(saver_module.latest_checkpoint(save_dir1), save_path1)
+    self.assertEqual(
+        checkpoint_management.latest_checkpoint(save_dir1), save_path1)
     save_dir2 = os.path.join(self.get_temp_dir(), "save_dir2")
     os.renames(save_dir1, save_dir2)
     save_path2 = os.path.join(save_dir2, "save_copy_restore")
-    self.assertEqual(saver_module.latest_checkpoint(save_dir2), save_path2)
+    self.assertEqual(
+        checkpoint_management.latest_checkpoint(save_dir2), save_path2)
 
     # Start a second session.  In that session the parameter nodes
     # have not been initialized either.
@@ -768,6 +784,37 @@ class SaverTest(test.TestCase):
       self.assertEqual(20.0, v1.eval())
       save.save(sess, save_path)
 
+  # Test restoring large tensors (triggers a thread pool)
+  def testRestoreLargeTensors(self):
+    save_dir = self.get_temp_dir()
+    def _model():
+      small_v = [variable_scope.get_variable(
+          "small%d" % i, shape=[10, 2], use_resource=True) for i in range(5)]
+      large_v = [variable_scope.get_variable(
+          "large%d" % i, shape=[32000, 1000], use_resource=True)
+                 for i in range(3)]
+      return small_v + large_v
+
+    save_graph = ops_lib.Graph()
+    with save_graph.as_default(), self.test_session(graph=save_graph) as sess:
+      orig_vars = _model()
+      sess.run(variables.global_variables_initializer())
+      save = saver_module.Saver(max_to_keep=1)
+      variables.global_variables_initializer().run()
+      save.save(sess, save_dir)
+      orig_vals = sess.run(orig_vars)
+
+    restore_graph = ops_lib.Graph()
+    with restore_graph.as_default(), self.test_session(
+        graph=restore_graph) as sess:
+      restored_vars = _model()
+      save = saver_module.Saver(max_to_keep=1)
+      save.restore(sess, save_dir)
+      restored_vals = sess.run(restored_vars)
+
+    for orig, restored in zip(orig_vals, restored_vals):
+      self.assertAllEqual(orig, restored)
+
 
 class SaveRestoreShardedTest(test.TestCase):
 
@@ -808,7 +855,7 @@ class SaveRestoreShardedTest(test.TestCase):
         self.assertEqual(save_path + "-?????-of-00002", val)
       else:
         self.assertEqual(save_path, val)
-      meta_graph_filename = saver_module._meta_graph_filename(val)
+      meta_graph_filename = checkpoint_management.meta_graph_filename(val)
       self.assertEqual(save_path + ".meta", meta_graph_filename)
 
     if save._write_version is saver_pb2.SaverDef.V1:
@@ -902,11 +949,11 @@ class SaveRestoreShardedTest(test.TestCase):
 
     if save._write_version is saver_pb2.SaverDef.V1:
       self.assertEqual(
-          saver_module.latest_checkpoint(self.get_temp_dir()),
+          checkpoint_management.latest_checkpoint(self.get_temp_dir()),
           os.path.join(self.get_temp_dir(), "sharded_basics-?????-of-00002"))
     else:
       self.assertEqual(
-          saver_module.latest_checkpoint(self.get_temp_dir()),
+          checkpoint_management.latest_checkpoint(self.get_temp_dir()),
           os.path.join(self.get_temp_dir(), "sharded_basics"))
 
   def testSaverDef(self):
@@ -1056,7 +1103,7 @@ class MaxToKeepTest(test.TestCase):
 
   def assertCheckpointState(self, model_checkpoint_path,
                             all_model_checkpoint_paths, save_dir):
-    checkpoint_state = saver_module.get_checkpoint_state(save_dir)
+    checkpoint_state = checkpoint_management.get_checkpoint_state(save_dir)
     self.assertEqual(checkpoint_state.model_checkpoint_path,
                      model_checkpoint_path)
     self.assertEqual(checkpoint_state.all_model_checkpoint_paths,
@@ -1064,7 +1111,7 @@ class MaxToKeepTest(test.TestCase):
 
   def testMaxToKeepEager(self):
     with context.eager_mode():
-      save_dir = self._get_test_dir("max_to_keep_non_sharded")
+      save_dir = self._get_test_dir("max_to_keep_eager")
 
       v = variable_scope.variable(10.0, name="v")
       save = saver_module.Saver({"v": v}, max_to_keep=2)
@@ -1074,7 +1121,7 @@ class MaxToKeepTest(test.TestCase):
 
       s1 = save.save(None, os.path.join(save_dir, "s1"))
       self.assertEqual([s1], save.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s1],
@@ -1082,8 +1129,8 @@ class MaxToKeepTest(test.TestCase):
 
       s2 = save.save(None, os.path.join(save_dir, "s2"))
       self.assertEqual([s1, s2], save.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s1))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertCheckpointState(
           model_checkpoint_path=s2,
           all_model_checkpoint_paths=[s1, s2],
@@ -1091,9 +1138,9 @@ class MaxToKeepTest(test.TestCase):
 
       s3 = save.save(None, os.path.join(save_dir, "s3"))
       self.assertEqual([s2, s3], save.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s1))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
-      self.assertTrue(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s3))
       self.assertCheckpointState(
           model_checkpoint_path=s3,
           all_model_checkpoint_paths=[s2, s3],
@@ -1108,9 +1155,9 @@ class MaxToKeepTest(test.TestCase):
       # Adding s2 again (old s2 is removed first, then new s2 appended)
       s2 = save.save(None, os.path.join(save_dir, "s2"))
       self.assertEqual([s3, s2], save.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s1))
-      self.assertTrue(saver_module.checkpoint_exists(s3))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s3))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertCheckpointState(
           model_checkpoint_path=s2,
           all_model_checkpoint_paths=[s3, s2],
@@ -1119,8 +1166,8 @@ class MaxToKeepTest(test.TestCase):
       # Adding s1 (s3 should now be deleted as oldest in list)
       s1 = save.save(None, os.path.join(save_dir, "s1"))
       self.assertEqual([s2, s1], save.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s3))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s3))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s2, s1],
@@ -1129,9 +1176,9 @@ class MaxToKeepTest(test.TestCase):
       s2 = save2.save(None, os.path.join(save_dir, "s2"))
       self.assertEqual([s3, s2], save2.last_checkpoints)
       # Created by the first helper.
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       # Deleted by the first helper.
-      self.assertFalse(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s3))
 
   def testNonSharded(self):
     save_dir = self._get_test_dir("max_to_keep_non_sharded")
@@ -1144,7 +1191,7 @@ class MaxToKeepTest(test.TestCase):
 
       s1 = save.save(sess, os.path.join(save_dir, "s1"))
       self.assertEqual([s1], save.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s1],
@@ -1152,8 +1199,8 @@ class MaxToKeepTest(test.TestCase):
 
       s2 = save.save(sess, os.path.join(save_dir, "s2"))
       self.assertEqual([s1, s2], save.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s1))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertCheckpointState(
           model_checkpoint_path=s2,
           all_model_checkpoint_paths=[s1, s2],
@@ -1161,9 +1208,9 @@ class MaxToKeepTest(test.TestCase):
 
       s3 = save.save(sess, os.path.join(save_dir, "s3"))
       self.assertEqual([s2, s3], save.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s1))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
-      self.assertTrue(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s3))
       self.assertCheckpointState(
           model_checkpoint_path=s3,
           all_model_checkpoint_paths=[s2, s3],
@@ -1182,15 +1229,18 @@ class MaxToKeepTest(test.TestCase):
       # Adding s2 again (old s2 is removed first, then new s2 appended)
       s2 = save.save(sess, os.path.join(save_dir, "s2"))
       self.assertEqual([s3, s2], save.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s1))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s1))
       self.assertFalse(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
-      self.assertTrue(saver_module.checkpoint_exists(s3))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s1)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s3))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s3)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s2)))
       self.assertCheckpointState(
           model_checkpoint_path=s2,
           all_model_checkpoint_paths=[s3, s2],
@@ -1199,15 +1249,18 @@ class MaxToKeepTest(test.TestCase):
       # Adding s1 (s3 should now be deleted as oldest in list)
       s1 = save.save(sess, os.path.join(save_dir, "s1"))
       self.assertEqual([s2, s1], save.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s3)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s2)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s1)))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s2, s1],
@@ -1219,16 +1272,19 @@ class MaxToKeepTest(test.TestCase):
       s2 = save2.save(sess, os.path.join(save_dir, "s2"))
       self.assertEqual([s3, s2], save2.last_checkpoints)
       # Created by the first helper.
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s1)))
       # Deleted by the first helper.
-      self.assertFalse(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s3)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s2)))
       self.assertCheckpointState(
           model_checkpoint_path=s2,
           all_model_checkpoint_paths=[s3, s2],
@@ -1237,15 +1293,18 @@ class MaxToKeepTest(test.TestCase):
       # Adding s1 (s3 should now be deleted as oldest in list)
       s1 = save2.save(sess, os.path.join(save_dir, "s1"))
       self.assertEqual([s2, s1], save2.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s3)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s2)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s1)))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s2, s1],
@@ -1257,16 +1316,19 @@ class MaxToKeepTest(test.TestCase):
       s2 = save3.save(sess, os.path.join(save_dir, "s2"))
       self.assertEqual([s2], save3.last_checkpoints)
       # Created by the first helper.
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s1)))
       # Deleted by the first helper.
-      self.assertFalse(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s3)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s2)))
       # Even though the file for s1 exists, this saver isn't aware of it, which
       # is why it doesn't end up in the checkpoint state.
       self.assertCheckpointState(
@@ -1277,15 +1339,18 @@ class MaxToKeepTest(test.TestCase):
       # Adding s1 (s3 should not be deleted because helper is unaware of it)
       s1 = save3.save(sess, os.path.join(save_dir, "s1"))
       self.assertEqual([s2, s1], save3.last_checkpoints)
-      self.assertFalse(saver_module.checkpoint_exists(s3))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s3))
       self.assertFalse(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s3)))
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s3)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s2)))
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s2)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       self.assertTrue(
-          saver_module.checkpoint_exists(saver_module._meta_graph_filename(s1)))
+          checkpoint_management.checkpoint_exists(
+              checkpoint_management.meta_graph_filename(s1)))
       self.assertCheckpointState(
           model_checkpoint_path=s1,
           all_model_checkpoint_paths=[s2, s1],
@@ -1316,7 +1381,8 @@ class MaxToKeepTest(test.TestCase):
       else:
         self.assertEqual(4, len(gfile.Glob(s1 + "*")))
 
-      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s1)))
+      self.assertTrue(
+          gfile.Exists(checkpoint_management.meta_graph_filename(s1)))
 
       s2 = save.save(sess, os.path.join(save_dir, "s2"))
       self.assertEqual([s1, s2], save.last_checkpoints)
@@ -1324,27 +1390,32 @@ class MaxToKeepTest(test.TestCase):
         self.assertEqual(2, len(gfile.Glob(s1)))
       else:
         self.assertEqual(4, len(gfile.Glob(s1 + "*")))
-      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s1)))
+      self.assertTrue(
+          gfile.Exists(checkpoint_management.meta_graph_filename(s1)))
       if save._write_version is saver_pb2.SaverDef.V1:
         self.assertEqual(2, len(gfile.Glob(s2)))
       else:
         self.assertEqual(4, len(gfile.Glob(s2 + "*")))
-      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s2)))
+      self.assertTrue(
+          gfile.Exists(checkpoint_management.meta_graph_filename(s2)))
 
       s3 = save.save(sess, os.path.join(save_dir, "s3"))
       self.assertEqual([s2, s3], save.last_checkpoints)
       self.assertEqual(0, len(gfile.Glob(s1 + "*")))
-      self.assertFalse(gfile.Exists(saver_module._meta_graph_filename(s1)))
+      self.assertFalse(
+          gfile.Exists(checkpoint_management.meta_graph_filename(s1)))
       if save._write_version is saver_pb2.SaverDef.V1:
         self.assertEqual(2, len(gfile.Glob(s2)))
       else:
         self.assertEqual(4, len(gfile.Glob(s2 + "*")))
-      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s2)))
+      self.assertTrue(
+          gfile.Exists(checkpoint_management.meta_graph_filename(s2)))
       if save._write_version is saver_pb2.SaverDef.V1:
         self.assertEqual(2, len(gfile.Glob(s3)))
       else:
         self.assertEqual(4, len(gfile.Glob(s3 + "*")))
-      self.assertTrue(gfile.Exists(saver_module._meta_graph_filename(s3)))
+      self.assertTrue(
+          gfile.Exists(checkpoint_management.meta_graph_filename(s3)))
 
   def testNoMaxToKeep(self):
     save_dir = self._get_test_dir("no_max_to_keep")
@@ -1359,20 +1430,20 @@ class MaxToKeepTest(test.TestCase):
       self.assertEqual([], save.last_checkpoints)
       s1 = save.save(sess, os.path.join(save_dir, "s1"))
       self.assertEqual([], save.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       s2 = save.save(sess, os.path.join(save_dir, "s2"))
       self.assertEqual([], save.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
 
       # Test max_to_keep being 0.
       save2 = saver_module.Saver({"v": v}, max_to_keep=0)
       self.assertEqual([], save2.last_checkpoints)
       s1 = save2.save(sess, os.path.join(save_dir2, "s1"))
       self.assertEqual([], save2.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s1))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
       s2 = save2.save(sess, os.path.join(save_dir2, "s2"))
       self.assertEqual([], save2.last_checkpoints)
-      self.assertTrue(saver_module.checkpoint_exists(s2))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s2))
 
   def testNoMetaGraph(self):
     save_dir = self._get_test_dir("no_meta_graph")
@@ -1383,8 +1454,9 @@ class MaxToKeepTest(test.TestCase):
       variables.global_variables_initializer().run()
 
       s1 = save.save(sess, os.path.join(save_dir, "s1"), write_meta_graph=False)
-      self.assertTrue(saver_module.checkpoint_exists(s1))
-      self.assertFalse(gfile.Exists(saver_module._meta_graph_filename(s1)))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
+      self.assertFalse(
+          gfile.Exists(checkpoint_management.meta_graph_filename(s1)))
 
 
 class KeepCheckpointEveryNHoursTest(test.TestCase):
@@ -1440,10 +1512,10 @@ class KeepCheckpointEveryNHoursTest(test.TestCase):
       self.assertEqual([s3, s4], save.last_checkpoints)
 
       # Check that s1 is still here, but s2 is gone.
-      self.assertTrue(saver_module.checkpoint_exists(s1))
-      self.assertFalse(saver_module.checkpoint_exists(s2))
-      self.assertTrue(saver_module.checkpoint_exists(s3))
-      self.assertTrue(saver_module.checkpoint_exists(s4))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s1))
+      self.assertFalse(checkpoint_management.checkpoint_exists(s2))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s3))
+      self.assertTrue(checkpoint_management.checkpoint_exists(s4))
 
 
 class SaveRestoreWithVariableNameMap(test.TestCase):
@@ -1522,221 +1594,6 @@ class SaveRestoreWithVariableNameMap(test.TestCase):
     self._testNonReshape(variables.Variable)
 
 
-class LatestCheckpointWithRelativePaths(test.TestCase):
-
-  @staticmethod
-  @contextlib.contextmanager
-  def tempWorkingDir(temppath):
-    cwd = os.getcwd()
-    os.chdir(temppath)
-    try:
-      yield
-    finally:
-      os.chdir(cwd)
-
-  @staticmethod
-  @contextlib.contextmanager
-  def tempDir():
-    tempdir = tempfile.mkdtemp()
-    try:
-      yield tempdir
-    finally:
-      shutil.rmtree(tempdir)
-
-  def testNameCollision(self):
-    # Make sure we have a clean directory to work in.
-    with self.tempDir() as tempdir:
-      # Jump to that directory until this test is done.
-      with self.tempWorkingDir(tempdir):
-        # Save training snapshots to a relative path.
-        traindir = "train/"
-        os.mkdir(traindir)
-        # Collides with the default name of the checkpoint state file.
-        filepath = os.path.join(traindir, "checkpoint")
-
-        with self.test_session() as sess:
-          unused_a = variables.Variable(0.0)  # So that Saver saves something.
-          variables.global_variables_initializer().run()
-
-          # Should fail.
-          saver = saver_module.Saver(sharded=False)
-          with self.assertRaisesRegexp(ValueError, "collides with"):
-            saver.save(sess, filepath)
-
-          # Succeeds: the file will be named "checkpoint-<step>".
-          saver.save(sess, filepath, global_step=1)
-          self.assertIsNotNone(saver_module.latest_checkpoint(traindir))
-
-          # Succeeds: the file will be named "checkpoint-<i>-of-<n>".
-          saver = saver_module.Saver(sharded=True)
-          saver.save(sess, filepath)
-          self.assertIsNotNone(saver_module.latest_checkpoint(traindir))
-
-          # Succeeds: the file will be named "checkpoint-<step>-<i>-of-<n>".
-          saver = saver_module.Saver(sharded=True)
-          saver.save(sess, filepath, global_step=1)
-          self.assertIsNotNone(saver_module.latest_checkpoint(traindir))
-
-  def testRelativePath(self):
-    # Make sure we have a clean directory to work in.
-    with self.tempDir() as tempdir:
-
-      # Jump to that directory until this test is done.
-      with self.tempWorkingDir(tempdir):
-
-        # Save training snapshots to a relative path.
-        traindir = "train/"
-        os.mkdir(traindir)
-
-        filename = "snapshot"
-        filepath = os.path.join(traindir, filename)
-
-        with self.test_session() as sess:
-          # Build a simple graph.
-          v0 = variables.Variable(0.0)
-          inc = v0.assign_add(1.0)
-
-          save = saver_module.Saver({"v0": v0})
-
-          # Record a short training history.
-          variables.global_variables_initializer().run()
-          save.save(sess, filepath, global_step=0)
-          inc.eval()
-          save.save(sess, filepath, global_step=1)
-          inc.eval()
-          save.save(sess, filepath, global_step=2)
-
-        with self.test_session() as sess:
-          # Build a new graph with different initialization.
-          v0 = variables.Variable(-1.0)
-
-          # Create a new saver.
-          save = saver_module.Saver({"v0": v0})
-          variables.global_variables_initializer().run()
-
-          # Get the most recent checkpoint name from the training history file.
-          name = saver_module.latest_checkpoint(traindir)
-          self.assertIsNotNone(name)
-
-          # Restore "v0" from that checkpoint.
-          save.restore(sess, name)
-          self.assertEqual(v0.eval(), 2.0)
-
-
-class CheckpointStateTest(test.TestCase):
-
-  def _get_test_dir(self, dirname):
-    test_dir = os.path.join(self.get_temp_dir(), dirname)
-    gfile.MakeDirs(test_dir)
-    return test_dir
-
-  def testAbsPath(self):
-    save_dir = self._get_test_dir("abs_paths")
-    abs_path = os.path.join(save_dir, "model-0")
-    ckpt = saver_module.generate_checkpoint_state_proto(save_dir, abs_path)
-    self.assertEqual(ckpt.model_checkpoint_path, abs_path)
-    self.assertTrue(os.path.isabs(ckpt.model_checkpoint_path))
-    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 1)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], abs_path)
-
-  def testRelPath(self):
-    train_dir = "train"
-    model = os.path.join(train_dir, "model-0")
-    # model_checkpoint_path should have no "train" directory part.
-    new_rel_path = "model-0"
-    ckpt = saver_module.generate_checkpoint_state_proto(train_dir, model)
-    self.assertEqual(ckpt.model_checkpoint_path, new_rel_path)
-    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 1)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], new_rel_path)
-
-  def testAllModelCheckpointPaths(self):
-    save_dir = self._get_test_dir("all_models_test")
-    abs_path = os.path.join(save_dir, "model-0")
-    for paths in [None, [], ["model-2"]]:
-      ckpt = saver_module.generate_checkpoint_state_proto(
-          save_dir, abs_path, all_model_checkpoint_paths=paths)
-      self.assertEqual(ckpt.model_checkpoint_path, abs_path)
-      self.assertTrue(os.path.isabs(ckpt.model_checkpoint_path))
-      self.assertEqual(
-          len(ckpt.all_model_checkpoint_paths), len(paths) if paths else 1)
-      self.assertEqual(ckpt.all_model_checkpoint_paths[-1], abs_path)
-
-  def testUpdateCheckpointState(self):
-    save_dir = self._get_test_dir("update_checkpoint_state")
-    os.chdir(save_dir)
-    # Make a temporary train directory.
-    train_dir = "train"
-    os.mkdir(train_dir)
-    abs_path = os.path.join(save_dir, "model-0")
-    rel_path = os.path.join("train", "model-2")
-    saver_module.update_checkpoint_state(
-        train_dir, rel_path, all_model_checkpoint_paths=[abs_path, rel_path])
-    ckpt = saver_module.get_checkpoint_state(train_dir)
-    self.assertEqual(ckpt.model_checkpoint_path, rel_path)
-    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 2)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], rel_path)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[0], abs_path)
-
-  def testUpdateCheckpointStateSaveRelativePaths(self):
-    save_dir = self._get_test_dir("update_checkpoint_state")
-    os.chdir(save_dir)
-    abs_path2 = os.path.join(save_dir, "model-2")
-    rel_path2 = "model-2"
-    abs_path0 = os.path.join(save_dir, "model-0")
-    rel_path0 = "model-0"
-    saver_module._update_checkpoint_state(  # pylint: disable=protected-access
-        save_dir=save_dir,
-        model_checkpoint_path=abs_path2,
-        all_model_checkpoint_paths=[rel_path0, abs_path2],
-        save_relative_paths=True)
-
-    # File should contain relative paths.
-    file_content = file_io.read_file_to_string(
-        os.path.join(save_dir, "checkpoint"))
-    ckpt = CheckpointState()
-    text_format.Merge(file_content, ckpt)
-    self.assertEqual(ckpt.model_checkpoint_path, rel_path2)
-    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 2)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], rel_path2)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[0], rel_path0)
-
-    # get_checkpoint_state should return absolute paths.
-    ckpt = saver_module.get_checkpoint_state(save_dir)
-    self.assertEqual(ckpt.model_checkpoint_path, abs_path2)
-    self.assertEqual(len(ckpt.all_model_checkpoint_paths), 2)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[-1], abs_path2)
-    self.assertEqual(ckpt.all_model_checkpoint_paths[0], abs_path0)
-
-  def testCheckPointStateFailsWhenIncomplete(self):
-    save_dir = self._get_test_dir("checkpoint_state_fails_when_incomplete")
-    os.chdir(save_dir)
-    ckpt_path = os.path.join(save_dir, "checkpoint")
-    ckpt_file = open(ckpt_path, "w")
-    ckpt_file.write("")
-    ckpt_file.close()
-    with self.assertRaises(ValueError):
-      saver_module.get_checkpoint_state(save_dir)
-
-  def testCheckPointCompletesRelativePaths(self):
-    save_dir = self._get_test_dir("checkpoint_completes_relative_paths")
-    os.chdir(save_dir)
-    ckpt_path = os.path.join(save_dir, "checkpoint")
-    ckpt_file = open(ckpt_path, "w")
-    ckpt_file.write("""
-        model_checkpoint_path: "./model.ckpt-687529"
-        all_model_checkpoint_paths: "./model.ckpt-687500"
-        all_model_checkpoint_paths: "./model.ckpt-687529"
-        """)
-    ckpt_file.close()
-    ckpt = saver_module.get_checkpoint_state(save_dir)
-    self.assertEqual(ckpt.model_checkpoint_path,
-                     os.path.join(save_dir, "./model.ckpt-687529"))
-    self.assertEqual(ckpt.all_model_checkpoint_paths[0],
-                     os.path.join(save_dir, "./model.ckpt-687500"))
-    self.assertEqual(ckpt.all_model_checkpoint_paths[1],
-                     os.path.join(save_dir, "./model.ckpt-687529"))
-
-
 class MetaGraphTest(test.TestCase):
 
   def _get_test_dir(self, dirname):
@@ -2579,62 +2436,6 @@ class WriteGraphTest(test.TestCase):
     self.assertTrue(os.path.exists(path))
 
 
-class SaverUtilsTest(test.TestCase):
-
-  def setUp(self):
-    self._base_dir = os.path.join(self.get_temp_dir(), "saver_utils_test")
-    gfile.MakeDirs(self._base_dir)
-
-  def tearDown(self):
-    gfile.DeleteRecursively(self._base_dir)
-
-  def testCheckpointExists(self):
-    for sharded in (False, True):
-      for version in (saver_pb2.SaverDef.V2, saver_pb2.SaverDef.V1):
-        with self.test_session(graph=ops_lib.Graph()) as sess:
-          unused_v = variables.Variable(1.0, name="v")
-          variables.global_variables_initializer().run()
-          saver = saver_module.Saver(sharded=sharded, write_version=version)
-
-          path = os.path.join(self._base_dir, "%s-%s" % (sharded, version))
-          self.assertFalse(
-              saver_module.checkpoint_exists(path))  # Not saved yet.
-
-          ckpt_prefix = saver.save(sess, path)
-          self.assertTrue(saver_module.checkpoint_exists(ckpt_prefix))
-
-          ckpt_prefix = saver_module.latest_checkpoint(self._base_dir)
-          self.assertTrue(saver_module.checkpoint_exists(ckpt_prefix))
-
-  def testGetCheckpointMtimes(self):
-    prefixes = []
-    for version in (saver_pb2.SaverDef.V2, saver_pb2.SaverDef.V1):
-      with self.test_session(graph=ops_lib.Graph()) as sess:
-        unused_v = variables.Variable(1.0, name="v")
-        variables.global_variables_initializer().run()
-        saver = saver_module.Saver(write_version=version)
-        prefixes.append(
-            saver.save(sess, os.path.join(self._base_dir, str(version))))
-
-    mtimes = saver_module.get_checkpoint_mtimes(prefixes)
-    self.assertEqual(2, len(mtimes))
-    self.assertTrue(mtimes[1] >= mtimes[0])
-
-  def testRemoveCheckpoint(self):
-    for sharded in (False, True):
-      for version in (saver_pb2.SaverDef.V2, saver_pb2.SaverDef.V1):
-        with self.test_session(graph=ops_lib.Graph()) as sess:
-          unused_v = variables.Variable(1.0, name="v")
-          variables.global_variables_initializer().run()
-          saver = saver_module.Saver(sharded=sharded, write_version=version)
-
-          path = os.path.join(self._base_dir, "%s-%s" % (sharded, version))
-          ckpt_prefix = saver.save(sess, path)
-          self.assertTrue(saver_module.checkpoint_exists(ckpt_prefix))
-          saver_module.remove_checkpoint(ckpt_prefix, version)
-          self.assertFalse(saver_module.checkpoint_exists(ckpt_prefix))
-
-
 class ScopedGraphTest(test.TestCase):
 
   def _get_test_dir(self, dirname):
diff --git a/tensorflow/python/training/server_lib.py b/tensorflow/python/training/server_lib.py
index 2f421d1cc0..58cf5277fe 100644
--- a/tensorflow/python/training/server_lib.py
+++ b/tensorflow/python/training/server_lib.py
@@ -42,8 +42,8 @@ def _make_server_def(server_or_cluster_def, job_name, task_index, protocol,
       Defaults to the value in `server_or_cluster_def`, if specified. Otherwise
       defaults to 0 if the server's job has only one task.
     protocol: (Optional.) Specifies the protocol to be used by the server.
-      Acceptable values include `"grpc"`. Defaults to the value in
-      `server_or_cluster_def`, if specified. Otherwise defaults to `"grpc"`.
+      Acceptable values include `"grpc", "grpc+verbs"`. Defaults to the value
+      in `server_or_cluster_def`, if specified. Otherwise defaults to `"grpc"`.
     config: (Options.) A `tf.ConfigProto` that specifies default configuration
       options for all sessions that run on this server.
 
@@ -129,8 +129,9 @@ class Server(object):
         job. Defaults to the value in `server_or_cluster_def`, if specified.
         Otherwise defaults to 0 if the server's job has only one task.
       protocol: (Optional.) Specifies the protocol to be used by the server.
-        Acceptable values include `"grpc"`. Defaults to the value in
-        `server_or_cluster_def`, if specified. Otherwise defaults to `"grpc"`.
+        Acceptable values include `"grpc", "grpc+verbs"`. Defaults to the
+        value in `server_or_cluster_def`, if specified. Otherwise defaults to
+        `"grpc"`.
       config: (Options.) A `tf.ConfigProto` that specifies default
         configuration options for all sessions that run on this server.
       start: (Optional.) Boolean, indicating whether to start the server
diff --git a/tensorflow/python/training/session_manager.py b/tensorflow/python/training/session_manager.py
index 974f75777f..a2e0645ba8 100644
--- a/tensorflow/python/training/session_manager.py
+++ b/tensorflow/python/training/session_manager.py
@@ -24,7 +24,7 @@ from tensorflow.python.client import session
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.platform import tf_logging as logging
-from tensorflow.python.training import saver as saver_mod
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -197,13 +197,13 @@ class SessionManager(object):
 
     # Waits up until max_wait_secs for checkpoint to become available.
     wait_time = 0
-    ckpt = saver_mod.get_checkpoint_state(checkpoint_dir)
+    ckpt = checkpoint_management.get_checkpoint_state(checkpoint_dir)
     while not ckpt or not ckpt.model_checkpoint_path:
       if wait_for_checkpoint and wait_time < max_wait_secs:
         logging.info("Waiting for checkpoint to be available.")
         time.sleep(self._recovery_wait_secs)
         wait_time += self._recovery_wait_secs
-        ckpt = saver_mod.get_checkpoint_state(checkpoint_dir)
+        ckpt = checkpoint_management.get_checkpoint_state(checkpoint_dir)
       else:
         return sess, False
 
diff --git a/tensorflow/python/training/session_manager_test.py b/tensorflow/python/training/session_manager_test.py
index 6670d9365f..d7e6dac95b 100644
--- a/tensorflow/python/training/session_manager_test.py
+++ b/tensorflow/python/training/session_manager_test.py
@@ -30,6 +30,7 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.training import server_lib
 from tensorflow.python.training import session_manager
@@ -174,13 +175,13 @@ class SessionManagerTest(test.TestCase):
                  os.path.join(checkpoint_dir, "recover_session_checkpoint"))
     self._test_recovered_variable(checkpoint_dir=checkpoint_dir)
     self._test_recovered_variable(
-        checkpoint_filename_with_path=saver_lib.latest_checkpoint(
+        checkpoint_filename_with_path=checkpoint_management.latest_checkpoint(
             checkpoint_dir))
     # Cannot set both checkpoint_dir and checkpoint_filename_with_path.
     with self.assertRaises(ValueError):
       self._test_recovered_variable(
           checkpoint_dir=checkpoint_dir,
-          checkpoint_filename_with_path=saver_lib.latest_checkpoint(
+          checkpoint_filename_with_path=checkpoint_management.latest_checkpoint(
               checkpoint_dir))
 
   def testWaitForSessionReturnsNoneAfterTimeout(self):
diff --git a/tensorflow/python/training/supervisor_test.py b/tensorflow/python/training/supervisor_test.py
index 4abce85852..71ed88093a 100644
--- a/tensorflow/python/training/supervisor_test.py
+++ b/tensorflow/python/training/supervisor_test.py
@@ -44,6 +44,7 @@ from tensorflow.python.platform import test
 from tensorflow.python.summary import summary
 from tensorflow.python.summary import summary_iterator
 from tensorflow.python.summary.writer import writer
+from tensorflow.python.training import checkpoint_management
 from tensorflow.python.training import input as input_lib
 from tensorflow.python.training import saver as saver_lib
 from tensorflow.python.training import server_lib
@@ -83,7 +84,7 @@ class SupervisorTest(test.TestCase):
     end_time = time.time() + timeout_secs
     while time.time() < end_time:
       if for_checkpoint:
-        if saver_lib.checkpoint_exists(pattern):
+        if checkpoint_management.checkpoint_exists(pattern):
           return
       else:
         if len(gfile.Glob(pattern)) >= 1:
diff --git a/tensorflow/python/training/training.py b/tensorflow/python/training/training.py
index 3f2dc67976..544010afbe 100644
--- a/tensorflow/python/training/training.py
+++ b/tensorflow/python/training/training.py
@@ -82,12 +82,12 @@ from tensorflow.python.training.monitored_session import WorkerSessionCreator
 from tensorflow.python.training.monitored_session import MonitoredSession
 from tensorflow.python.training.monitored_session import SingularMonitoredSession
 from tensorflow.python.training.saver import Saver
-from tensorflow.python.training.saver import checkpoint_exists
-from tensorflow.python.training.saver import generate_checkpoint_state_proto
-from tensorflow.python.training.saver import get_checkpoint_mtimes
-from tensorflow.python.training.saver import get_checkpoint_state
-from tensorflow.python.training.saver import latest_checkpoint
-from tensorflow.python.training.saver import update_checkpoint_state
+from tensorflow.python.training.checkpoint_management import checkpoint_exists
+from tensorflow.python.training.checkpoint_management import generate_checkpoint_state_proto
+from tensorflow.python.training.checkpoint_management import get_checkpoint_mtimes
+from tensorflow.python.training.checkpoint_management import get_checkpoint_state
+from tensorflow.python.training.checkpoint_management import latest_checkpoint
+from tensorflow.python.training.checkpoint_management import update_checkpoint_state
 from tensorflow.python.training.saver import export_meta_graph
 from tensorflow.python.training.saver import import_meta_graph
 from tensorflow.python.training.session_run_hook import SessionRunHook
diff --git a/tensorflow/python/training/training_util.py b/tensorflow/python/training/training_util.py
index 0877b2a8a2..2ff3eeb153 100644
--- a/tensorflow/python/training/training_util.py
+++ b/tensorflow/python/training/training_util.py
@@ -44,11 +44,13 @@ def global_step(sess, global_step_tensor):
   """Small helper to get the global step.
 
   ```python
-  # Creates a variable to hold the global_step.
+  # Create a variable to hold the global_step.
   global_step_tensor = tf.Variable(10, trainable=False, name='global_step')
-  # Creates a session.
+  # Create a session.
   sess = tf.Session()
-  # Initializes the variable.
+  # Initialize the variable
+  sess.run(global_step_tensor.initializer)
+  # Get the variable value.
   print('global_step: %s' % tf.train.global_step(sess, global_step_tensor))
 
   global_step: 10
diff --git a/tensorflow/python/training/warm_starting_util.py b/tensorflow/python/training/warm_starting_util.py
index ec740abdd1..b1a7cfab83 100644
--- a/tensorflow/python/training/warm_starting_util.py
+++ b/tensorflow/python/training/warm_starting_util.py
@@ -22,7 +22,6 @@ import collections
 import six
 
 from tensorflow.python.framework import ops
-from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables as variables_lib
@@ -83,11 +82,6 @@ class VocabInfo(
     )
 
 
-def _is_variable(x):
-  return (isinstance(x, variables_lib.Variable) or
-          isinstance(x, resource_variable_ops.ResourceVariable))
-
-
 def _infer_var_name(var):
   """Returns name of the `var`.
 
@@ -126,9 +120,10 @@ def _warm_start_var(var, prev_ckpt, prev_tensor_name=None):
     prev_tensor_name: Name of the tensor to lookup in provided `prev_ckpt`. If
       None, we lookup tensor with same name as given `var`.
   """
-  if _is_variable(var):
+  if checkpoint_utils._is_variable(var):  # pylint: disable=protected-access
     current_var_name = _infer_var_name([var])
-  elif isinstance(var, list) and all(_is_variable(v) for v in var):
+  elif (isinstance(var, list) and
+        all(checkpoint_utils._is_variable(v) for v in var)):  # pylint: disable=protected-access
     current_var_name = _infer_var_name(var)
   elif isinstance(var, variables_lib.PartitionedVariable):
     current_var_name = _infer_var_name([var])
@@ -193,9 +188,10 @@ def _warm_start_var_with_vocab(var,
           prev_vocab_path):
     raise ValueError("Invalid args: Must provide all of [current_vocab_path, "
                      "current_vocab_size, prev_ckpt, prev_vocab_path}.")
-  if _is_variable(var):
+  if checkpoint_utils._is_variable(var):
     var = [var]
-  elif isinstance(var, list) and all(_is_variable(v) for v in var):
+  elif (isinstance(var, list) and
+        all(checkpoint_utils._is_variable(v) for v in var)):
     var = var
   elif isinstance(var, variables_lib.PartitionedVariable):
     var = var._get_variable_list()
@@ -271,7 +267,7 @@ def _get_grouped_variables(vars_to_warm_start):
       for v in vars_to_warm_start:
         list_of_vars += ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES,
                                            scope=v)
-    elif all([_is_variable(v) for v in vars_to_warm_start]):
+    elif all([checkpoint_utils._is_variable(v) for v in vars_to_warm_start]):  # pylint: disable=protected-access
       list_of_vars = vars_to_warm_start
     else:
       raise ValueError("If `vars_to_warm_start` is a list, it must be all "
diff --git a/tensorflow/python/util/deprecation.py b/tensorflow/python/util/deprecation.py
index 376be39978..74e1fb227f 100644
--- a/tensorflow/python/util/deprecation.py
+++ b/tensorflow/python/util/deprecation.py
@@ -37,6 +37,11 @@ _PRINT_DEPRECATION_WARNINGS = True
 _PRINTED_WARNING = {}
 
 
+class DeprecatedNamesAlreadySet(Exception):
+  """Raised when setting deprecated names multiple times for the same symbol."""
+  pass
+
+
 def _add_deprecated_function_notice_to_docstring(doc, date, instructions):
   """Adds a deprecation notice to a docstring for deprecated functions."""
   main_text = ['THIS FUNCTION IS DEPRECATED. It will be removed %s.' %
@@ -87,6 +92,27 @@ def _call_location(outer=False):
     return '%s:%d' % (entry[1], entry[2])
 
 
+def _wrap_decorator(wrapped_function):
+  """Indicate that one function wraps another.
+
+  This decorator wraps a function using `tf_decorator.make_decorator`
+  so that doc generation scripts can pick up original function
+  signature.
+  It would be better to use @functools.wrap decorator, but it would
+  not update function signature to match wrapped function in Python 2.
+
+  Args:
+    wrapped_function: The function that decorated function wraps.
+
+  Returns:
+    Function that accepts wrapper function as an argument and returns
+    `TFDecorator` instance.
+  """
+  def wrapper(wrapper_func):
+    return tf_decorator.make_decorator(wrapped_function, wrapper_func)
+  return wrapper
+
+
 def deprecated_alias(deprecated_name, name, func_or_class, warn_once=True):
   """Deprecate a symbol in favor of a new name with identical semantics.
 
@@ -144,7 +170,7 @@ def deprecated_alias(deprecated_name, name, func_or_class, warn_once=True):
   if tf_inspect.isclass(func_or_class):
 
     # Make a new class with __init__ wrapped in a warning.
-    class NewClass(func_or_class):  # pylint: disable=missing-docstring
+    class _NewClass(func_or_class):  # pylint: disable=missing-docstring
       __doc__ = decorator_utils.add_notice_to_docstring(
           func_or_class.__doc__, 'Please use %s instead.' % name,
           'DEPRECATED CLASS',
@@ -153,27 +179,28 @@ def deprecated_alias(deprecated_name, name, func_or_class, warn_once=True):
       __name__ = func_or_class.__name__
       __module__ = _call_location(outer=True)
 
+      @_wrap_decorator(func_or_class.__init__)
       def __init__(self, *args, **kwargs):
-        if hasattr(NewClass.__init__, '__func__'):
+        if hasattr(_NewClass.__init__, '__func__'):
           # Python 2
-          NewClass.__init__.__func__.__doc__ = func_or_class.__init__.__doc__
+          _NewClass.__init__.__func__.__doc__ = func_or_class.__init__.__doc__
         else:
           # Python 3
-          NewClass.__init__.__doc__ = func_or_class.__init__.__doc__
+          _NewClass.__init__.__doc__ = func_or_class.__init__.__doc__
 
         if _PRINT_DEPRECATION_WARNINGS:
           # We're making the alias as we speak. The original may have other
           # aliases, so we cannot use it to check for whether it's already been
           # warned about.
-          if NewClass.__init__ not in _PRINTED_WARNING:
+          if _NewClass.__init__ not in _PRINTED_WARNING:
             if warn_once:
-              _PRINTED_WARNING[NewClass.__init__] = True
+              _PRINTED_WARNING[_NewClass.__init__] = True
             logging.warning(
                 'From %s: The name %s is deprecated. Please use %s instead.\n',
                 _call_location(), deprecated_name, name)
-        super(NewClass, self).__init__(*args, **kwargs)
+        super(_NewClass, self).__init__(*args, **kwargs)
 
-    return NewClass
+    return _NewClass
   else:
     decorator_utils.validate_callable(func_or_class, 'deprecated')
 
@@ -197,6 +224,35 @@ def deprecated_alias(deprecated_name, name, func_or_class, warn_once=True):
             func_or_class.__doc__, None, 'Please use %s instead.' % name))
 
 
+def deprecated_endpoints(*args):
+  """Decorator for marking endpoints deprecated.
+
+  This decorator does not print deprecation messages.
+  TODO(annarev): eventually start printing deprecation warnings when
+  @deprecation_endpoints decorator is added.
+
+  Args:
+    *args: Deprecated endpoint names.
+
+  Returns:
+    A function that takes symbol as an argument and adds
+    _tf_deprecated_api_names to that symbol.
+    _tf_deprecated_api_names would be set to a list of deprecated
+    endpoint names for the symbol.
+  """
+  def deprecated_wrapper(func):
+    # pylint: disable=protected-access
+    if '_tf_deprecated_api_names' in func.__dict__:
+      raise DeprecatedNamesAlreadySet(
+          'Cannot set deprecated names for %s to %s. '
+          'Deprecated names are already set to %s.' % (
+              func.__name__, str(args), str(func._tf_deprecated_api_names)))
+    func._tf_deprecated_api_names = args
+    # pylint: disable=protected-access
+    return func
+  return deprecated_wrapper
+
+
 def deprecated(date, instructions, warn_once=True):
   """Decorator for marking functions or methods deprecated.
 
@@ -332,7 +388,7 @@ def deprecated_args(date, instructions, *deprecated_arg_names_or_tuples,
     Args:
       names_to_ok_vals: dict from string arg_name to a list of values,
         possibly empty, which should not elicit a warning.
-      arg_spec: Output from tf_inspect.getargspec on the called function.
+      arg_spec: Output from tf_inspect.getfullargspec on the called function.
 
     Returns:
       Dictionary from arg_name to DeprecatedArgSpec.
@@ -352,16 +408,16 @@ def deprecated_args(date, instructions, *deprecated_arg_names_or_tuples,
     decorator_utils.validate_callable(func, 'deprecated_args')
     deprecated_arg_names = _get_arg_names_to_ok_vals()
 
-    arg_spec = tf_inspect.getargspec(func)
+    arg_spec = tf_inspect.getfullargspec(func)
     deprecated_positions = _get_deprecated_positional_arguments(
         deprecated_arg_names, arg_spec)
 
     is_varargs_deprecated = arg_spec.varargs in deprecated_arg_names
-    is_kwargs_deprecated = arg_spec.keywords in deprecated_arg_names
+    is_kwargs_deprecated = arg_spec.varkw in deprecated_arg_names
 
     if (len(deprecated_positions) + is_varargs_deprecated + is_kwargs_deprecated
         != len(deprecated_arg_names_or_tuples)):
-      known_args = arg_spec.args + [arg_spec.varargs, arg_spec.keywords]
+      known_args = arg_spec.args + [arg_spec.varargs, arg_spec.varkw]
       missing_args = [arg_name for arg_name in deprecated_arg_names
                       if arg_name not in known_args]
       raise ValueError('The following deprecated arguments are not present '
@@ -411,7 +467,7 @@ def deprecated_args(date, instructions, *deprecated_arg_names_or_tuples,
         if is_varargs_deprecated and len(args) > len(arg_spec.args):
           invalid_args.append(arg_spec.varargs)
         if is_kwargs_deprecated and kwargs:
-          invalid_args.append(arg_spec.keywords)
+          invalid_args.append(arg_spec.varkw)
         for arg_name in deprecated_arg_names:
           if (arg_name in kwargs and
               not (deprecated_positions[arg_name].has_ok_value and
diff --git a/tensorflow/python/util/deprecation_test.py b/tensorflow/python/util/deprecation_test.py
index bdd0bc48d2..90c73a0a58 100644
--- a/tensorflow/python/util/deprecation_test.py
+++ b/tensorflow/python/util/deprecation_test.py
@@ -22,6 +22,7 @@ from __future__ import print_function
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import deprecation
+from tensorflow.python.util import tf_inspect
 
 
 class DeprecatedAliasTest(test.TestCase):
@@ -73,6 +74,11 @@ class DeprecatedAliasTest(test.TestCase):
     self.assertEqual(["test", "deprecated", "deprecated again"],
                      MyClass.init_args)
 
+    # Check __init__ signature matches for doc generation.
+    self.assertEqual(
+        tf_inspect.getfullargspec(MyClass.__init__),
+        tf_inspect.getfullargspec(deprecated_cls.__init__))
+
 
 class DeprecationTest(test.TestCase):
 
@@ -929,5 +935,27 @@ class DeprecationArgumentsTest(test.TestCase):
     self.assertEqual(new_docs, new_docs_ref)
 
 
+class DeprecatedEndpointsTest(test.TestCase):
+
+  def testSingleDeprecatedEndpoint(self):
+    @deprecation.deprecated_endpoints("foo1")
+    def foo():
+      pass
+    self.assertEqual(("foo1",), foo._tf_deprecated_api_names)
+
+  def testMultipleDeprecatedEndpoint(self):
+    @deprecation.deprecated_endpoints("foo1", "foo2")
+    def foo():
+      pass
+    self.assertEqual(("foo1", "foo2"), foo._tf_deprecated_api_names)
+
+  def testCannotSetDeprecatedEndpointsTwice(self):
+    with self.assertRaises(deprecation.DeprecatedNamesAlreadySet):
+      @deprecation.deprecated_endpoints("foo1")
+      @deprecation.deprecated_endpoints("foo2")
+      def foo():  # pylint: disable=unused-variable
+        pass
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/util/function_utils.py b/tensorflow/python/util/function_utils.py
index 7bbbde3cd2..4e9b07e20a 100644
--- a/tensorflow/python/util/function_utils.py
+++ b/tensorflow/python/util/function_utils.py
@@ -20,6 +20,8 @@ from __future__ import print_function
 
 import functools
 
+import six
+
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_inspect
 
@@ -55,3 +57,36 @@ def fn_args(fn):
     if _is_bounded_method(fn):
       args.remove('self')
   return tuple(args)
+
+
+def get_func_name(func):
+  """Returns name of passed callable."""
+  _, func = tf_decorator.unwrap(func)
+  if callable(func):
+    if tf_inspect.isfunction(func):
+      return func.__name__
+    elif tf_inspect.ismethod(func):
+      return '%s.%s' % (six.get_method_self(func).__class__.__name__,
+                        six.get_method_function(func).__name__)
+    else:  # Probably a class instance with __call__
+      return str(type(func))
+  else:
+    raise ValueError('Argument must be callable')
+
+
+def get_func_code(func):
+  """Returns func_code of passed callable, or None if not available."""
+  _, func = tf_decorator.unwrap(func)
+  if callable(func):
+    if tf_inspect.isfunction(func) or tf_inspect.ismethod(func):
+      return six.get_function_code(func)
+    # Since the object is not a function or method, but is a callable, we will
+    # try to access the __call__method as a function.  This works with callable
+    # classes but fails with functool.partial objects despite their __call__
+    # attribute.
+    try:
+      return six.get_function_code(func.__call__)
+    except AttributeError:
+      return None
+  else:
+    raise ValueError('Argument must be callable')
diff --git a/tensorflow/python/util/function_utils_test.py b/tensorflow/python/util/function_utils_test.py
index e78cf6a5b0..1588328c26 100644
--- a/tensorflow/python/util/function_utils_test.py
+++ b/tensorflow/python/util/function_utils_test.py
@@ -24,6 +24,16 @@ from tensorflow.python.platform import test
 from tensorflow.python.util import function_utils
 
 
+def silly_example_function():
+  pass
+
+
+class SillyCallableClass(object):
+
+  def __call__(self):
+    pass
+
+
 class FnArgsTest(test.TestCase):
 
   def test_simple_function(self):
@@ -124,5 +134,73 @@ class FnArgsTest(test.TestCase):
     self.assertEqual(3, double_wrapped_fn(3))
     self.assertEqual(3, double_wrapped_fn(a=3))
 
+
+class GetFuncNameTest(test.TestCase):
+
+  def testWithSimpleFunction(self):
+    self.assertEqual(
+        'silly_example_function',
+        function_utils.get_func_name(silly_example_function))
+
+  def testWithClassMethod(self):
+    self.assertEqual(
+        'GetFuncNameTest.testWithClassMethod',
+        function_utils.get_func_name(self.testWithClassMethod))
+
+  def testWithCallableClass(self):
+    callable_instance = SillyCallableClass()
+    self.assertRegexpMatches(
+        function_utils.get_func_name(callable_instance),
+        '<.*SillyCallableClass.*>')
+
+  def testWithFunctoolsPartial(self):
+    partial = functools.partial(silly_example_function)
+    self.assertRegexpMatches(
+        function_utils.get_func_name(partial),
+        '<.*functools.partial.*>')
+
+  def testWithLambda(self):
+    anon_fn = lambda x: x
+    self.assertEqual('<lambda>', function_utils.get_func_name(anon_fn))
+
+  def testRaisesWithNonCallableObject(self):
+    with self.assertRaises(ValueError):
+      function_utils.get_func_name(None)
+
+
+class GetFuncCodeTest(test.TestCase):
+
+  def testWithSimpleFunction(self):
+    code = function_utils.get_func_code(silly_example_function)
+    self.assertIsNotNone(code)
+    self.assertRegexpMatches(code.co_filename, 'function_utils_test.py')
+
+  def testWithClassMethod(self):
+    code = function_utils.get_func_code(self.testWithClassMethod)
+    self.assertIsNotNone(code)
+    self.assertRegexpMatches(code.co_filename, 'function_utils_test.py')
+
+  def testWithCallableClass(self):
+    callable_instance = SillyCallableClass()
+    code = function_utils.get_func_code(callable_instance)
+    self.assertIsNotNone(code)
+    self.assertRegexpMatches(code.co_filename, 'function_utils_test.py')
+
+  def testWithLambda(self):
+    anon_fn = lambda x: x
+    code = function_utils.get_func_code(anon_fn)
+    self.assertIsNotNone(code)
+    self.assertRegexpMatches(code.co_filename, 'function_utils_test.py')
+
+  def testWithFunctoolsPartial(self):
+    partial = functools.partial(silly_example_function)
+    code = function_utils.get_func_code(partial)
+    self.assertIsNone(code)
+
+  def testRaisesWithNonCallableObject(self):
+    with self.assertRaises(ValueError):
+      function_utils.get_func_code(None)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/util/nest.py b/tensorflow/python/util/nest.py
index d63f59a8c8..faae0d89c3 100644
--- a/tensorflow/python/util/nest.py
+++ b/tensorflow/python/util/nest.py
@@ -73,7 +73,7 @@ def _sequence_like(instance, args):
   Returns:
     `args` with the type of `instance`.
   """
-  if isinstance(instance, dict):
+  if isinstance(instance, (dict, _collections.Mapping)):
     # Pack dictionaries in a deterministic order by sorting the keys.
     # Notice this means that we ignore the original order of `OrderedDict`
     # instances. This is intentional, to avoid potential bugs caused by mixing
@@ -89,7 +89,7 @@ def _sequence_like(instance, args):
 
 
 def _yield_value(iterable):
-  if isinstance(iterable, dict):
+  if isinstance(iterable, (dict, _collections.Mapping)):
     # Iterate through dictionaries in a deterministic order by sorting the
     # keys. Notice this means that we ignore the original order of `OrderedDict`
     # instances. This is intentional, to avoid potential bugs caused by mixing
@@ -215,7 +215,7 @@ def flatten_dict_items(dictionary):
     ValueError: If any key and value have not the same structure, or if keys are
       not unique.
   """
-  if not isinstance(dictionary, dict):
+  if not isinstance(dictionary, (dict, _collections.Mapping)):
     raise TypeError("input must be a dictionary")
   flat_dictionary = {}
   for i, v in _six.iteritems(dictionary):
@@ -377,6 +377,62 @@ def map_structure(func, *structure, **check_types_dict):
       structure[0], [func(*x) for x in entries])
 
 
+def map_structure_with_paths(func, *structure, **kwargs):
+  """Applies `func` to each entry in `structure` and returns a new structure.
+
+  Applies `func(path, x[0], x[1], ..., **kwargs)` where x[i] is an entry in
+  `structure[i]` and `path` is the common path to x[i] in the structures.  All
+  structures in `structure` must have the same arity, and the return value will
+  contain the results in the same structure. Special kwarg `check_types`
+  determines whether the types of iterables within the structure must be the
+  same-- see **kwargs definition below.
+
+  Args:
+    func: A callable with the signature func(path, *values, **kwargs) that is
+      evaluated on the leaves of the structure.
+    *structure: A variable number of compatible structures to process.
+    **kwargs: Optional kwargs to be passed through to func. Special kwarg
+      `check_types` is not passed to func, but instead determines whether the
+      types of iterables within the structures have to be same (e.g.,
+      `map_structure(func, [1], (1,))` raises a `TypeError` exception). By
+      default, the types must match. To allow iteration over structures of
+      different types (but common arity), set this kwarg to `False`.
+
+  Returns:
+    A structure of the same form as the input structures whose leaves are the
+    result of evaluating func on corresponding leaves of the input structures.
+
+  Raises:
+    TypeError: If `func` is not callable or if the structures do not match
+      each other by depth tree.
+    TypeError: If `check_types` is not `False` and the two structures differ in
+      the type of sequence in any of their substructures.
+    ValueError: If no structures are provided.
+  """
+  if not callable(func):
+    raise TypeError("func must be callable, got: %s" % func)
+  if not structure:
+    raise ValueError("Must provide at least one structure")
+
+  check_types = kwargs.pop("check_types", True)
+  for other in structure[1:]:
+    assert_same_structure(structure[0], other, check_types=check_types)
+
+  # First set paths_and_values to:
+  # [[(p11, v11), ... (p1n, v1n)], ... [(pm1, vm1), ... (pmn, vmn)]]
+  paths_and_values = [flatten_with_joined_string_paths(s) for s in structure]
+
+  # Now zip(*paths_and_values) would be:
+  # [((p11, v11), ... (pm1, vm1)), ... ((p1n, v1n), ... (pmn, vmn))]
+  # so grouped_by_path is set to:
+  # [[(p11, ... pm1), (v11, ... vm1)], ... [(p1n, ... pmn), (v1n, ... vmn)]]
+  # Note that p1i, ... pmi must all be equal since the structures are the same.
+  grouped_by_path = [zip(*p_v) for p_v in zip(*paths_and_values)]
+
+  return pack_sequence_as(structure[0], [
+      func(paths[0], *values, **kwargs) for paths, values in grouped_by_path])
+
+
 def _yield_flat_up_to(shallow_tree, input_tree):
   """Yields elements `input_tree` partially flattened up to `shallow_tree`."""
   if is_sequence(shallow_tree):
@@ -455,7 +511,7 @@ def assert_shallow_structure(shallow_tree, input_tree, check_types=True):
           "structure has length %s, while shallow structure has length %s."
           % (len(input_tree), len(shallow_tree)))
 
-    if check_types and isinstance(shallow_tree, dict):
+    if check_types and isinstance(shallow_tree, (dict, _collections.Mapping)):
       if set(input_tree) != set(shallow_tree):
         raise ValueError(
             "The two structures don't have the same keys. Input "
@@ -716,7 +772,7 @@ def yield_flat_paths(nest):
 
   # The _maybe_add_final_path_element function is used below in order to avoid
   # adding trailing slashes when the sub-element recursed into is a leaf.
-  if isinstance(nest, dict):
+  if isinstance(nest, (dict, _collections.Mapping)):
     for key in _sorted(nest):
       value = nest[key]
       for sub_path in yield_flat_paths(value):
@@ -760,3 +816,4 @@ def flatten_with_joined_string_paths(structure, separator="/"):
 
 
 _pywrap_tensorflow.RegisterSequenceClass(_collections.Sequence)
+_pywrap_tensorflow.RegisterMappingClass(_collections.Mapping)
diff --git a/tensorflow/python/util/nest_test.py b/tensorflow/python/util/nest_test.py
index 2f12b25354..2369eb610e 100644
--- a/tensorflow/python/util/nest_test.py
+++ b/tensorflow/python/util/nest_test.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 import collections
 import time
 
+from absl.testing import parameterized
 import numpy as np
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
@@ -33,7 +34,22 @@ from tensorflow.python.platform import test
 from tensorflow.python.util import nest
 
 
-class NestTest(test.TestCase):
+class _CustomMapping(collections.Mapping):
+
+  def __init__(self, *args, **kwargs):
+    self._wrapped = dict(*args, **kwargs)
+
+  def __getitem__(self, key):
+    return self._wrapped[key]
+
+  def __iter__(self):
+    return iter(self._wrapped)
+
+  def __len__(self):
+    return len(self._wrapped)
+
+
+class NestTest(parameterized.TestCase, test.TestCase):
 
   PointXY = collections.namedtuple("Point", ["x", "y"])  # pylint: disable=invalid-name
 
@@ -72,26 +88,32 @@ class NestTest(test.TestCase):
     with self.assertRaises(ValueError):
       nest.pack_sequence_as([5, 6, [7, 8]], ["a", "b", "c"])
 
+  @parameterized.parameters({"mapping_type": collections.OrderedDict},
+                            {"mapping_type": _CustomMapping})
   @test_util.assert_no_new_pyobjects_executing_eagerly
-  def testFlattenDictOrder(self):
+  def testFlattenDictOrder(self, mapping_type):
     """`flatten` orders dicts by key, including OrderedDicts."""
-    ordered = collections.OrderedDict([("d", 3), ("b", 1), ("a", 0), ("c", 2)])
+    ordered = mapping_type([("d", 3), ("b", 1), ("a", 0), ("c", 2)])
     plain = {"d": 3, "b": 1, "a": 0, "c": 2}
     ordered_flat = nest.flatten(ordered)
     plain_flat = nest.flatten(plain)
     self.assertEqual([0, 1, 2, 3], ordered_flat)
     self.assertEqual([0, 1, 2, 3], plain_flat)
 
-  def testPackDictOrder(self):
+  @parameterized.parameters({"mapping_type": collections.OrderedDict},
+                            {"mapping_type": _CustomMapping})
+  def testPackDictOrder(self, mapping_type):
     """Packing orders dicts by key, including OrderedDicts."""
-    ordered = collections.OrderedDict([("d", 0), ("b", 0), ("a", 0), ("c", 0)])
+    custom = mapping_type([("d", 0), ("b", 0), ("a", 0), ("c", 0)])
     plain = {"d": 0, "b": 0, "a": 0, "c": 0}
     seq = [0, 1, 2, 3]
-    ordered_reconstruction = nest.pack_sequence_as(ordered, seq)
+    custom_reconstruction = nest.pack_sequence_as(custom, seq)
     plain_reconstruction = nest.pack_sequence_as(plain, seq)
+    self.assertIsInstance(custom_reconstruction, mapping_type)
+    self.assertIsInstance(plain_reconstruction, dict)
     self.assertEqual(
-        collections.OrderedDict([("d", 3), ("b", 1), ("a", 0), ("c", 2)]),
-        ordered_reconstruction)
+        mapping_type([("d", 3), ("b", 1), ("a", 0), ("c", 2)]),
+        custom_reconstruction)
     self.assertEqual({"d": 3, "b": 1, "a": 0, "c": 2}, plain_reconstruction)
 
   Abc = collections.namedtuple("A", ("b", "c"))  # pylint: disable=invalid-name
@@ -101,8 +123,10 @@ class NestTest(test.TestCase):
     # A nice messy mix of tuples, lists, dicts, and `OrderedDict`s.
     mess = [
         "z",
-        NestTest.Abc(3, 4),
-        {
+        NestTest.Abc(3, 4), {
+            "d": _CustomMapping({
+                41: 4
+            }),
             "c": [
                 1,
                 collections.OrderedDict([
@@ -111,17 +135,19 @@ class NestTest(test.TestCase):
                 ]),
             ],
             "b": 5
-        },
-        17
+        }, 17
     ]
 
     flattened = nest.flatten(mess)
-    self.assertEqual(flattened, ["z", 3, 4, 5, 1, 2, 3, 17])
+    self.assertEqual(flattened, ["z", 3, 4, 5, 1, 2, 3, 4, 17])
 
     structure_of_mess = [
         14,
         NestTest.Abc("a", True),
         {
+            "d": _CustomMapping({
+                41: 42
+            }),
             "c": [
                 0,
                 collections.OrderedDict([
@@ -142,6 +168,10 @@ class NestTest(test.TestCase):
     self.assertIsInstance(unflattened_ordered_dict, collections.OrderedDict)
     self.assertEqual(list(unflattened_ordered_dict.keys()), ["b", "a"])
 
+    unflattened_custom_mapping = unflattened[2]["d"]
+    self.assertIsInstance(unflattened_custom_mapping, _CustomMapping)
+    self.assertEqual(list(unflattened_custom_mapping.keys()), [41])
+
   def testFlatten_numpyIsNotFlattened(self):
     structure = np.array([1, 2, 3])
     flattened = nest.flatten(structure)
@@ -179,19 +209,23 @@ class NestTest(test.TestCase):
     self.assertFalse(nest.is_sequence(math_ops.tanh(ones)))
     self.assertFalse(nest.is_sequence(np.ones((4, 5))))
 
-  def testFlattenDictItems(self):
-    dictionary = {(4, 5, (6, 8)): ("a", "b", ("c", "d"))}
+  @parameterized.parameters({"mapping_type": _CustomMapping},
+                            {"mapping_type": dict})
+  def testFlattenDictItems(self, mapping_type):
+    dictionary = mapping_type({(4, 5, (6, 8)): ("a", "b", ("c", "d"))})
     flat = {4: "a", 5: "b", 6: "c", 8: "d"}
     self.assertEqual(nest.flatten_dict_items(dictionary), flat)
 
     with self.assertRaises(TypeError):
       nest.flatten_dict_items(4)
 
-    bad_dictionary = {(4, 5, (4, 8)): ("a", "b", ("c", "d"))}
+    bad_dictionary = mapping_type({(4, 5, (4, 8)): ("a", "b", ("c", "d"))})
     with self.assertRaisesRegexp(ValueError, "not unique"):
       nest.flatten_dict_items(bad_dictionary)
 
-    another_bad_dictionary = {(4, 5, (6, 8)): ("a", "b", ("c", ("d", "e")))}
+    another_bad_dictionary = mapping_type({
+        (4, 5, (6, 8)): ("a", "b", ("c", ("d", "e")))
+    })
     with self.assertRaisesRegexp(
         ValueError, "Key had [0-9]* elements, but value had [0-9]* elements"):
       nest.flatten_dict_items(another_bad_dictionary)
@@ -320,6 +354,10 @@ class NestTest(test.TestCase):
 
   EmptyNT = collections.namedtuple("empty_nt", "")  # pylint: disable=invalid-name
 
+  def testHeterogeneousComparison(self):
+    nest.assert_same_structure({"a": 4}, _CustomMapping(a=3))
+    nest.assert_same_structure(_CustomMapping(b=3), {"b": 4})
+
   @test_util.assert_no_new_pyobjects_executing_eagerly
   def testMapStructure(self):
     structure1 = (((1, 2), 3), 4, (5, 6))
@@ -712,6 +750,35 @@ class NestTest(test.TestCase):
       self.assertEqual(
           list(nest.flatten_with_joined_string_paths(inputs)), expected)
 
+  @parameterized.named_parameters(
+      ("tuples", (1, 2), (3, 4), True, (("0", 4), ("1", 6))),
+      ("dicts", {"a": 1, "b": 2}, {"b": 4, "a": 3}, True,
+       {"a": ("a", 4), "b": ("b", 6)}),
+      ("mixed", (1, 2), [3, 4], False, (("0", 4), ("1", 6))),
+      ("nested",
+       {"a": [2, 3], "b": [1, 2, 3]}, {"b": [5, 6, 7], "a": [8, 9]}, True,
+       {"a": [("a/0", 10), ("a/1", 12)],
+        "b": [("b/0", 6), ("b/1", 8), ("b/2", 10)]}))
+  def testMapWithPathsCompatibleStructures(self, s1, s2, check_types, expected):
+    def format_sum(path, *values):
+      return (path, sum(values))
+    result = nest.map_structure_with_paths(format_sum, s1, s2,
+                                           check_types=check_types)
+    self.assertEqual(expected, result)
+
+  @parameterized.named_parameters(
+      ("tuples", (1, 2), (3, 4, 5), ValueError),
+      ("dicts", {"a": 1}, {"b": 2}, ValueError),
+      ("mixed", (1, 2), [3, 4], TypeError),
+      ("nested",
+       {"a": [2, 3], "b": [1, 3]},
+       {"b": [5, 6, 7], "a": [8, 9]},
+       ValueError
+      ))
+  def testMapWithPathsIncompatibleStructures(self, s1, s2, error_type):
+    with self.assertRaises(error_type):
+      nest.map_structure_with_paths(lambda path, *s: 0, s1, s2)
+
 
 class NestBenchmark(test.Benchmark):
 
diff --git a/tensorflow/python/util/py_checkpoint_reader.i b/tensorflow/python/util/py_checkpoint_reader.i
index 8004898cbc..1c73f7f06f 100644
--- a/tensorflow/python/util/py_checkpoint_reader.i
+++ b/tensorflow/python/util/py_checkpoint_reader.i
@@ -166,6 +166,7 @@ def NewCheckpointReader(filepattern):
     return CheckpointReader(compat.as_bytes(filepattern), status)
 
 NewCheckpointReader._tf_api_names = ['train.NewCheckpointReader']
+NewCheckpointReader._tf_api_names_v1 = ['train.NewCheckpointReader']
 %}
 
 %include "tensorflow/c/checkpoint_reader.h"
diff --git a/tensorflow/python/util/stat_summarizer.i b/tensorflow/python/util/stat_summarizer.i
index 73fa85494b..a5a7984d91 100644
--- a/tensorflow/python/util/stat_summarizer.i
+++ b/tensorflow/python/util/stat_summarizer.i
@@ -27,8 +27,8 @@ limitations under the License.
 
 %ignoreall
 
-%unignore _NewStatSummarizer;
-%unignore _DeleteStatSummarizer;
+%unignore NewStatSummarizer;
+%unignore DeleteStatSummarizer;
 %unignore tensorflow;
 %unignore tensorflow::StatSummarizer;
 %unignore tensorflow::StatSummarizer::StatSummarizer;
@@ -43,20 +43,20 @@ limitations under the License.
 
 // TODO(ashankar): Remove the unused argument from the API.
 %{
-tensorflow::StatSummarizer* _NewStatSummarizer(
+tensorflow::StatSummarizer* NewStatSummarizer(
       const string& unused) {
   return new tensorflow::StatSummarizer(tensorflow::StatSummarizerOptions());
 }
 %}
 
 %{
-void _DeleteStatSummarizer(tensorflow::StatSummarizer* ss) {
+void DeleteStatSummarizer(tensorflow::StatSummarizer* ss) {
   delete ss;
 }
 %}
 
-tensorflow::StatSummarizer* _NewStatSummarizer(const string& unused);
-void _DeleteStatSummarizer(tensorflow::StatSummarizer* ss);
+tensorflow::StatSummarizer* NewStatSummarizer(const string& unused);
+void DeleteStatSummarizer(tensorflow::StatSummarizer* ss);
 
 %extend tensorflow::StatSummarizer {
   void ProcessStepStatsStr(const string& step_stats_str) {
@@ -76,16 +76,3 @@ void _DeleteStatSummarizer(tensorflow::StatSummarizer* ss);
 %include "tensorflow/core/util/stat_summarizer_options.h"
 %include "tensorflow/core/util/stat_summarizer.h"
 %unignoreall
-
-%insert("python") %{
-
-# Wrapping NewStatSummarizer and DeletStatSummarizer because
-# SWIG-generated functions are built-in functions and do not support
-# setting _tf_api_names attribute.
-
-def NewStatSummarizer(unused):
-  return _NewStatSummarizer(unused)
-
-def DeleteStatSummarizer(stat_summarizer):
-  _DeleteStatSummarizer(stat_summarizer)
-%}
diff --git a/tensorflow/python/util/tf_export.py b/tensorflow/python/util/tf_export.py
index e154ffb68a..274f32c21f 100644
--- a/tensorflow/python/util/tf_export.py
+++ b/tensorflow/python/util/tf_export.py
@@ -63,12 +63,63 @@ API_ATTRS = {
         '_estimator_api_constants')
 }
 
+API_ATTRS_V1 = {
+    TENSORFLOW_API_NAME: _Attributes(
+        '_tf_api_names_v1',
+        '_tf_api_constants_v1'),
+    ESTIMATOR_API_NAME: _Attributes(
+        '_estimator_api_names_v1',
+        '_estimator_api_constants_v1')
+}
+
 
 class SymbolAlreadyExposedError(Exception):
   """Raised when adding API names to symbol that already has API names."""
   pass
 
 
+def get_canonical_name_for_symbol(symbol, api_name=TENSORFLOW_API_NAME):
+  """Get canonical name for the API symbol.
+
+  Canonical name is the first non-deprecated endpoint name.
+
+  Args:
+    symbol: API function or class.
+    api_name: API name (tensorflow or estimator).
+
+  Returns:
+    Canonical name for the API symbol (for e.g. initializers.zeros) if
+    canonical name could be determined. Otherwise, returns None.
+  """
+  if not hasattr(symbol, '__dict__'):
+    return None
+  api_names_attr = API_ATTRS[api_name].names
+  _, undecorated_symbol = tf_decorator.unwrap(symbol)
+  if api_names_attr not in undecorated_symbol.__dict__:
+    return None
+  api_names = getattr(undecorated_symbol, api_names_attr)
+  # TODO(annarev): may be add a separate deprecated attribute
+  # for estimator names.
+  deprecated_api_names = undecorated_symbol.__dict__.get(
+      '_tf_deprecated_api_names', [])
+  return get_canonical_name(api_names, deprecated_api_names)
+
+
+def get_canonical_name(api_names, deprecated_api_names):
+  """Get first non-deprecated endpoint name.
+
+  Args:
+    api_names: API names iterable.
+    deprecated_api_names: Deprecated API names iterable.
+  Returns:
+    Canonical name if there is at least one non-deprecated endpoint.
+    Otherwise returns None.
+  """
+  return next(
+      (name for name in api_names if name not in deprecated_api_names),
+      None)
+
+
 class api_export(object):  # pylint: disable=invalid-name
   """Provides ways to export symbols to the TensorFlow API."""
 
@@ -78,13 +129,16 @@ class api_export(object):  # pylint: disable=invalid-name
     Args:
       *args: API names in dot delimited format.
       **kwargs: Optional keyed arguments.
-          overrides: List of symbols that this is overriding
+        v1: Names for the TensorFlow V1 API. If not set, we will use V2 API
+          names both for TensorFlow V1 and V2 APIs.
+        overrides: List of symbols that this is overriding
           (those overrided api exports will be removed). Note: passing overrides
           has no effect on exporting a constant.
-          api_name: Name of the API you want to generate (e.g. `tensorflow` or
+        api_name: Name of the API you want to generate (e.g. `tensorflow` or
           `estimator`). Default is `tensorflow`.
     """
     self._names = args
+    self._names_v1 = kwargs.get('v1', args)
     self._api_name = kwargs.get('api_name', TENSORFLOW_API_NAME)
     self._overrides = kwargs.get('overrides', [])
 
@@ -102,24 +156,27 @@ class api_export(object):  # pylint: disable=invalid-name
         and kwarg `allow_multiple_exports` not set.
     """
     api_names_attr = API_ATTRS[self._api_name].names
-
+    api_names_attr_v1 = API_ATTRS_V1[self._api_name].names
     # Undecorate overridden names
     for f in self._overrides:
       _, undecorated_f = tf_decorator.unwrap(f)
       delattr(undecorated_f, api_names_attr)
+      delattr(undecorated_f, api_names_attr_v1)
 
     _, undecorated_func = tf_decorator.unwrap(func)
+    self.set_attr(undecorated_func, api_names_attr, self._names)
+    self.set_attr(undecorated_func, api_names_attr_v1, self._names_v1)
+    return func
 
+  def set_attr(self, func, api_names_attr, names):
     # Check for an existing api. We check if attribute name is in
     # __dict__ instead of using hasattr to verify that subclasses have
     # their own _tf_api_names as opposed to just inheriting it.
-    if api_names_attr in undecorated_func.__dict__:
+    if api_names_attr in func.__dict__:
       raise SymbolAlreadyExposedError(
           'Symbol %s is already exposed as %s.' %
-          (undecorated_func.__name__, getattr(
-              undecorated_func, api_names_attr)))  # pylint: disable=protected-access
-    setattr(undecorated_func, api_names_attr, self._names)
-    return func
+          (func.__name__, getattr(func, api_names_attr)))  # pylint: disable=protected-access
+    setattr(func, api_names_attr, names)
 
   def export_constant(self, module_name, name):
     """Store export information for constants/string literals.
@@ -140,12 +197,20 @@ class api_export(object):  # pylint: disable=invalid-name
       name: (string) Current constant name.
     """
     module = sys.modules[module_name]
-    if not hasattr(module, API_ATTRS[self._api_name].constants):
-      setattr(module, API_ATTRS[self._api_name].constants, [])
+    api_constants_attr = API_ATTRS[self._api_name].constants
+    api_constants_attr_v1 = API_ATTRS_V1[self._api_name].constants
+
+    if not hasattr(module, api_constants_attr):
+      setattr(module, api_constants_attr, [])
     # pylint: disable=protected-access
-    getattr(module, API_ATTRS[self._api_name].constants).append(
+    getattr(module, api_constants_attr).append(
         (self._names, name))
 
+    if not hasattr(module, api_constants_attr_v1):
+      setattr(module, api_constants_attr_v1, [])
+    getattr(module, api_constants_attr_v1).append(
+        (self._names_v1, name))
+
 
 tf_export = functools.partial(api_export, api_name=TENSORFLOW_API_NAME)
 estimator_export = functools.partial(tf_export, api_name=ESTIMATOR_API_NAME)
diff --git a/tensorflow/python/util/tf_export_test.py b/tensorflow/python/util/tf_export_test.py
index b9e26ecb33..4ae1dc55e0 100644
--- a/tensorflow/python/util/tf_export_test.py
+++ b/tensorflow/python/util/tf_export_test.py
@@ -60,6 +60,8 @@ class ValidateExportTest(test.TestCase):
     for symbol in [_test_function, _test_function, TestClassA, TestClassB]:
       if hasattr(symbol, '_tf_api_names'):
         del symbol._tf_api_names
+      if hasattr(symbol, '_tf_api_names_v1'):
+        del symbol._tf_api_names_v1
 
   def _CreateMockModule(self, name):
     mock_module = self.MockModule(name)
diff --git a/tensorflow/python/util/tf_inspect.py b/tensorflow/python/util/tf_inspect.py
index fbd6561767..ec20998bdd 100644
--- a/tensorflow/python/util/tf_inspect.py
+++ b/tensorflow/python/util/tf_inspect.py
@@ -300,6 +300,16 @@ def getsource(object):  # pylint: disable=redefined-builtin
   return _inspect.getsource(tf_decorator.unwrap(object)[1])
 
 
+def getsourcefile(object):  # pylint: disable=redefined-builtin
+  """TFDecorator-aware replacement for inspect.getsourcefile."""
+  return _inspect.getsourcefile(tf_decorator.unwrap(object)[1])
+
+
+def getsourcelines(object):  # pylint: disable=redefined-builtin
+  """TFDecorator-aware replacement for inspect.getsourcelines."""
+  return _inspect.getsourcelines(tf_decorator.unwrap(object)[1])
+
+
 def isbuiltin(object):  # pylint: disable=redefined-builtin
   """TFDecorator-aware replacement for inspect.isbuiltin."""
   return _inspect.isbuiltin(tf_decorator.unwrap(object)[1])
diff --git a/tensorflow/python/util/tf_inspect_test.py b/tensorflow/python/util/tf_inspect_test.py
index beaf350de1..2f6021c7d8 100644
--- a/tensorflow/python/util/tf_inspect_test.py
+++ b/tensorflow/python/util/tf_inspect_test.py
@@ -326,6 +326,18 @@ def test_decorated_function_with_defaults(a, b=2, c='Hello'):
     self.assertEqual(
         expected, tf_inspect.getsource(test_decorated_function_with_defaults))
 
+  def testGetSourceFile(self):
+    self.assertEqual(
+        __file__,
+        tf_inspect.getsourcefile(test_decorated_function_with_defaults))
+
+  def testGetSourceLines(self):
+    expected = inspect.getsourcelines(
+        test_decorated_function_with_defaults.decorated_target)
+    self.assertEqual(
+        expected,
+        tf_inspect.getsourcelines(test_decorated_function_with_defaults))
+
   def testIsBuiltin(self):
     self.assertEqual(
         tf_inspect.isbuiltin(TestDecoratedClass),
diff --git a/tensorflow/python/util/tf_stack.py b/tensorflow/python/util/tf_stack.py
new file mode 100644
index 0000000000..fe4f4a63eb
--- /dev/null
+++ b/tensorflow/python/util/tf_stack.py
@@ -0,0 +1,103 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Functions used to extract and analyze stacks.  Faster than Python libs."""
+# pylint: disable=g-bad-name
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import linecache
+import sys
+
+# Names for indices into TF traceback tuples.
+TB_FILENAME = 0
+TB_LINENO = 1
+TB_FUNCNAME = 2
+TB_CODEDICT = 3  # Dictionary of Python interpreter state.
+
+
+def extract_stack(extract_frame_info_fn=None):
+  """A lightweight, extensible re-implementation of traceback.extract_stack.
+
+  NOTE(mrry): traceback.extract_stack eagerly retrieves the line of code for
+      each stack frame using linecache, which results in an abundance of stat()
+      calls. This implementation does not retrieve the code, and any consumer
+      should apply _convert_stack to the result to obtain a traceback that can
+      be formatted etc. using traceback methods.
+
+  Args:
+    extract_frame_info_fn: Optional callable fn(stack_frame) applied to each
+        stack frame.  This callable's return value is stored as the sixth (last)
+        element of the returned tuples.  If not provided, the returned tuples
+        will have None as their sixth value.
+
+  Returns:
+    A list of 6-tuples
+        (filename, lineno, name, frame_globals, func_start_lineno, custom_info)
+    corresponding to the call stack of the current thread.  The returned tuples
+    have the innermost stack frame at the end, unlike the Python inspect
+    module's stack() function.
+  """
+  default_fn = lambda f: None
+  extract_frame_info_fn = extract_frame_info_fn or default_fn
+  try:
+    raise ZeroDivisionError
+  except ZeroDivisionError:
+    f = sys.exc_info()[2].tb_frame.f_back
+  ret = []
+  while f is not None:
+    lineno = f.f_lineno
+    co = f.f_code
+    filename = co.co_filename
+    name = co.co_name
+    frame_globals = f.f_globals
+    func_start_lineno = co.co_firstlineno
+    frame_info = extract_frame_info_fn(f)
+    ret.append((filename, lineno, name, frame_globals, func_start_lineno,
+                frame_info))
+    f = f.f_back
+  ret.reverse()
+  return ret
+
+
+def convert_stack(stack, include_func_start_lineno=False):
+  """Converts a stack extracted using extract_stack() to a traceback stack.
+
+  Args:
+    stack: A list of n 5-tuples,
+      (filename, lineno, name, frame_globals, func_start_lineno).
+    include_func_start_lineno: True if function start line number should be
+      included as the 5th entry in return tuples.
+
+  Returns:
+    A list of n 4-tuples or 5-tuples
+    (filename, lineno, name, code, [optional: func_start_lineno]), where the
+    code tuple element is calculated from the corresponding elements of the
+    input tuple.
+  """
+  ret = []
+  for (filename, lineno, name, frame_globals, func_start_lineno,
+       unused_frame_info) in stack:
+    linecache.checkcache(filename)
+    line = linecache.getline(filename, lineno, frame_globals)
+    if line:
+      line = line.strip()
+    else:
+      line = None
+    if include_func_start_lineno:
+      ret.append((filename, lineno, name, line, func_start_lineno))
+    else:
+      ret.append((filename, lineno, name, line))
+  return ret
diff --git a/tensorflow/python/util/util.cc b/tensorflow/python/util/util.cc
index 366f8a0deb..ebb72079ef 100644
--- a/tensorflow/python/util/util.cc
+++ b/tensorflow/python/util/util.cc
@@ -31,6 +31,8 @@ namespace {
 
 // Type object for collections.Sequence. This is set by RegisterSequenceClass.
 PyObject* CollectionsSequenceType = nullptr;
+// Type object for collections.Mapping, set by RegisterMappingClass.
+PyObject* CollectionsMappingType = nullptr;
 PyTypeObject* SparseTensorValueType = nullptr;
 
 const int kMaxItemsInCache = 1024;
@@ -45,6 +47,28 @@ bool IsString(PyObject* o) {
          PyUnicode_Check(o);
 }
 
+// Work around a writable-strings warning with Python 2's PyMapping_Keys macro,
+// and while we're at it give them consistent behavior by making sure the
+// returned value is a list.
+//
+// As with PyMapping_Keys, returns a new reference.
+//
+// On failure, returns nullptr.
+PyObject* MappingKeys(PyObject* o) {
+#if PY_MAJOR_VERSION >= 3
+  return PyMapping_Keys(o);
+#else
+  static char key_method_name[] = "keys";
+  Safe_PyObjectPtr raw_result(PyObject_CallMethod(o, key_method_name, nullptr));
+  if (PyErr_Occurred() || raw_result.get() == nullptr) {
+    return nullptr;
+  }
+  return PySequence_Fast(
+      raw_result.get(),
+      "The '.keys()' method of a custom mapping returned a non-sequence.");
+#endif
+}
+
 // Equivalent to Python's 'o.__class__.__name__'
 // Note that '__class__' attribute is set only in new-style classes.
 // A lot of tensorflow code uses __class__ without checks, so it seems like
@@ -85,6 +109,119 @@ string PyObjectToString(PyObject* o) {
   }
 }
 
+class CachedTypeCheck {
+ public:
+  explicit CachedTypeCheck(std::function<int(PyObject*)> ternary_predicate)
+      : ternary_predicate_(std::move(ternary_predicate)) {}
+
+  ~CachedTypeCheck() {
+    mutex_lock l(type_to_sequence_map_mu_);
+    for (const auto& pair : type_to_sequence_map_) {
+      Py_DECREF(pair.first);
+    }
+  }
+
+  // Caches successful executions of the one-argument (PyObject*) callable
+  // "ternary_predicate" based on the type of "o". -1 from the callable
+  // indicates an unsuccessful check (not cached), 0 indicates that "o"'s type
+  // does not match the predicate, and 1 indicates that it does. Used to avoid
+  // calling back into Python for expensive isinstance checks.
+  int CachedLookup(PyObject* o) {
+    // Try not to return to Python - see if the type has already been seen
+    // before.
+
+    auto* type = Py_TYPE(o);
+
+    {
+      mutex_lock l(type_to_sequence_map_mu_);
+      auto it = type_to_sequence_map_.find(type);
+      if (it != type_to_sequence_map_.end()) {
+        return it->second;
+      }
+    }
+
+    int check_result = ternary_predicate_(o);
+
+    if (check_result == -1) {
+      return -1;  // Type check error, not cached.
+    }
+
+    // NOTE: This is never decref'd as long as the object lives, which is likely
+    // forever, but we don't want the type to get deleted as long as it is in
+    // the map. This should not be too much of a leak, as there should only be a
+    // relatively small number of types in the map, and an even smaller number
+    // that are eligible for decref. As a precaution, we limit the size of the
+    // map to 1024.
+    {
+      mutex_lock l(type_to_sequence_map_mu_);
+      if (type_to_sequence_map_.size() < kMaxItemsInCache) {
+        Py_INCREF(type);
+        type_to_sequence_map_.insert({type, check_result});
+      }
+    }
+
+    return check_result;
+  }
+
+ private:
+  std::function<int(PyObject*)> ternary_predicate_;
+  mutex type_to_sequence_map_mu_;
+  std::unordered_map<PyTypeObject*, bool> type_to_sequence_map_
+      GUARDED_BY(type_to_sequence_map_mu_);
+};
+
+// Returns 1 if `o` is considered a mapping for the purposes of Flatten().
+// Returns 0 otherwise.
+// Returns -1 if an error occurred.
+int IsMappingHelper(PyObject* o) {
+  static auto* const check_cache = new CachedTypeCheck([](PyObject* to_check) {
+    return PyObject_IsInstance(to_check, CollectionsMappingType);
+  });
+  if (PyDict_Check(o)) return true;
+  if (TF_PREDICT_FALSE(CollectionsMappingType == nullptr)) {
+    PyErr_SetString(
+        PyExc_RuntimeError,
+        tensorflow::strings::StrCat(
+            "collections.Mapping type has not been set. "
+            "Please call RegisterMappingClass before using this module")
+            .c_str());
+    return -1;
+  }
+  return check_cache->CachedLookup(o);
+}
+
+// Returns 1 if `o` is considered a sequence for the purposes of Flatten().
+// Returns 0 otherwise.
+// Returns -1 if an error occurred.
+int IsSequenceHelper(PyObject* o) {
+  static auto* const check_cache = new CachedTypeCheck([](PyObject* to_check) {
+    int is_instance = PyObject_IsInstance(to_check, CollectionsSequenceType);
+
+    // Don't cache a failed is_instance check.
+    if (is_instance == -1) return -1;
+
+    return static_cast<int>(is_instance != 0 && !IsString(to_check));
+  });
+  // We treat dicts and other mappings as special cases of sequences.
+  if (IsMappingHelper(o)) return true;
+  if (PySet_Check(o) && !WarnedThatSetIsNotSequence) {
+    LOG(WARNING) << "Sets are not currently considered sequences, "
+                    "but this may change in the future, "
+                    "so consider avoiding using them.";
+    WarnedThatSetIsNotSequence = true;
+  }
+  if (TF_PREDICT_FALSE(CollectionsSequenceType == nullptr)) {
+    PyErr_SetString(
+        PyExc_RuntimeError,
+        tensorflow::strings::StrCat(
+            "collections.Sequence type has not been set. "
+            "Please call RegisterSequenceClass before using this module")
+            .c_str());
+    return -1;
+  }
+  return check_cache->CachedLookup(o);
+}
+
 // Implements the same idea as tensorflow.util.nest._yield_value
 // During construction we check if the iterable is a dictionary.
 // If so, we construct a sequence from its sorted keys that will be used
@@ -96,7 +233,12 @@ string PyObjectToString(PyObject* o) {
 // 'iterable' must not be modified while ValIterator is used.
 class ValIterator {
  public:
-  explicit ValIterator(PyObject* iterable) : dict_(nullptr), index_(0) {
+  explicit ValIterator(PyObject* iterable)
+      : dict_(nullptr),
+        mapping_(nullptr),
+        last_mapping_element_(nullptr),
+        seq_(nullptr),
+        index_(0) {
     if (PyDict_Check(iterable)) {
       dict_ = iterable;
       // PyDict_Keys returns a list, which can be used with
@@ -108,6 +250,10 @@ class ValIterator {
       // bugs caused by mixing ordered and plain dicts (e.g., flattening
       // a dict but using a corresponding `OrderedDict` to pack it back).
       PyList_Sort(seq_);
+    } else if (IsMappingHelper(iterable)) {
+      mapping_ = iterable;
+      seq_ = MappingKeys(iterable);
+      PyList_Sort(seq_);
     } else {
       seq_ = PySequence_Fast(iterable, "");
     }
@@ -119,10 +265,15 @@ class ValIterator {
   // Return a borrowed reference to the next element from iterable.
   // Return nullptr when iteration is over.
   PyObject* next() {
+    if (TF_PREDICT_FALSE(seq_ == nullptr)) {
+      return nullptr;
+    }
     PyObject* element = nullptr;
     if (index_ < size_) {
       // Both PySequence_Fast_GET_ITEM and PyDict_GetItem return borrowed
-      // references.
+      // references. For general mappings, ValIterator keeps a reference to the
+      // last retrieved element (and decrefs it before producing the next
+      // element) to abstract away the borrowed/new difference.
       element = PySequence_Fast_GET_ITEM(seq_, index_);
       ++index_;
       if (dict_ != nullptr) {
@@ -132,85 +283,32 @@ class ValIterator {
                           "Dictionary was modified during iteration over it");
           return nullptr;
         }
+      } else if (mapping_ != nullptr) {
+        element = PyObject_GetItem(mapping_, element);
+        if (element == nullptr) {
+          PyErr_SetString(PyExc_RuntimeError,
+                          "Mapping was modified during iteration over it");
+          return nullptr;
+        }
+        last_mapping_element_.reset(element);
       }
     }
     return element;
   }
 
  private:
-  PyObject* seq_;
+  // Special casing for things that pass PyDict_Check (faster, no Python calls)
   PyObject* dict_;
+
+  // General mappings which have custom Python logic
+  PyObject* mapping_;
+  Safe_PyObjectPtr last_mapping_element_;
+
+  PyObject* seq_;
   Py_ssize_t size_;
   Py_ssize_t index_;
 };
 
-mutex g_type_to_sequence_map(LINKER_INITIALIZED);
-std::unordered_map<PyTypeObject*, bool>* IsTypeSequenceMap() {
-  static auto* const m = new std::unordered_map<PyTypeObject*, bool>;
-  return m;
-}
-
-// Returns 1 if `o` is considered a sequence for the purposes of Flatten().
-// Returns 0 otherwise.
-// Returns -1 if an error occurred.
-int IsSequenceHelper(PyObject* o) {
-  if (PyDict_Check(o)) return true;
-  if (PySet_Check(o) && !WarnedThatSetIsNotSequence) {
-    LOG(WARNING) << "Sets are not currently considered sequences, "
-                    "but this may change in the future, "
-                    "so consider avoiding using them.";
-    WarnedThatSetIsNotSequence = true;
-  }
-  if (TF_PREDICT_FALSE(CollectionsSequenceType == nullptr)) {
-    PyErr_SetString(
-        PyExc_RuntimeError,
-        tensorflow::strings::StrCat(
-            "collections.Sequence type has not been set. "
-            "Please call RegisterSequenceClass before using this module")
-            .c_str());
-    return -1;
-  }
-
-  // Try not to return to Python - see if the type has already been seen
-  // before.
-
-  auto* type_to_sequence_map = IsTypeSequenceMap();
-  auto* type = Py_TYPE(o);
-
-  {
-    mutex_lock l(g_type_to_sequence_map);
-    auto it = type_to_sequence_map->find(type);
-    if (it != type_to_sequence_map->end()) {
-      return it->second;
-    }
-  }
-
-  // NOTE: We explicitly release the g_type_to_sequence_map mutex,
-  // because PyObject_IsInstance() may release the GIL, allowing another thread
-  // concurrent entry to this function.
-  int is_instance = PyObject_IsInstance(o, CollectionsSequenceType);
-
-  // Don't cache a failed is_instance check.
-  if (is_instance == -1) return -1;
-
-  bool is_sequence = static_cast<int>(is_instance != 0 && !IsString(o));
-
-  // NOTE: This is never decref'd, but we don't want the type to get deleted
-  // as long as it is in the map. This should not be too much of a
-  // leak, as there should only be a relatively small number of types in the
-  // map, and an even smaller number that are eligible for decref. As a
-  // precaution, we limit the size of the map to 1024.
-  {
-    mutex_lock l(g_type_to_sequence_map);
-    if (type_to_sequence_map->size() < kMaxItemsInCache) {
-      Py_INCREF(type);
-      type_to_sequence_map->insert({type, is_sequence});
-    }
-  }
-
-  return is_sequence;
-}
-
 bool IsSparseTensorValueType(PyObject* o) {
   if (TF_PREDICT_FALSE(SparseTensorValueType == nullptr)) {
     return false;
@@ -226,21 +324,35 @@ int IsSequenceForDataHelper(PyObject* o) {
 
 bool GetNextValuesForDict(PyObject* nested,
                           std::vector<Safe_PyObjectPtr>* next_values) {
-  std::vector<PyObject*> result;
-
-  PyObject* keys = PyDict_Keys(nested);
-  if (PyList_Sort(keys) == -1) return false;
-  Py_ssize_t size = PyList_Size(keys);
+  Safe_PyObjectPtr keys(PyDict_Keys(nested));
+  if (PyList_Sort(keys.get()) == -1) return false;
+  Py_ssize_t size = PyList_Size(keys.get());
   for (Py_ssize_t i = 0; i < size; ++i) {
     // We know that key and item will not be deleted because nested owns
     // a reference to them and callers of flatten must not modify nested
     // while the method is running.
-    PyObject* key = PyList_GET_ITEM(keys, i);
+    PyObject* key = PyList_GET_ITEM(keys.get(), i);
     PyObject* item = PyDict_GetItem(nested, key);
     Py_INCREF(item);
     next_values->emplace_back(item);
   }
-  Py_DECREF(keys);
+  return true;
+}
+
+bool GetNextValuesForMapping(PyObject* nested,
+                             std::vector<Safe_PyObjectPtr>* next_values) {
+  Safe_PyObjectPtr keys(MappingKeys(nested));
+  if (keys.get() == nullptr) {
+    return false;
+  }
+  if (PyList_Sort(keys.get()) == -1) return false;
+  Py_ssize_t size = PyList_Size(keys.get());
+  for (Py_ssize_t i = 0; i < size; ++i) {
+    PyObject* key = PyList_GET_ITEM(keys.get(), i);
+    // Unlike PyDict_GetItem, PyObject_GetItem returns a new reference.
+    PyObject* item = PyObject_GetItem(nested, key);
+    next_values->emplace_back(item);
+  }
   return true;
 }
 
@@ -265,6 +377,9 @@ bool GetNextValues(PyObject* nested,
   if (PyDict_Check(nested)) {
     // if nested is dictionary, sort it by key and recurse on each value
     return GetNextValuesForDict(nested, next_values);
+  } else if (IsMappingHelper(nested)) {
+    // same treatment as dictionaries, but for custom mapping types
+    return GetNextValuesForMapping(nested, next_values);
   }
   // iterate and recurse
   return GetNextValuesForIterable(nested, next_values);
@@ -276,6 +391,9 @@ bool GetNextValuesForData(PyObject* nested,
   if (PyDict_Check(nested)) {
     // if nested is dictionary, sort it by key and recurse on each value
     return GetNextValuesForDict(nested, next_values);
+  } else if (IsMappingHelper(nested)) {
+    // same treatment as dictionaries, but for custom mapping types
+    return GetNextValuesForMapping(nested, next_values);
   } else if (IsSparseTensorValueType(nested)) {
     // if nested is a SparseTensorValue, just return itself as a single item
     Py_INCREF(nested);
@@ -320,16 +438,26 @@ bool FlattenHelper(
 // 'dict1' and 'dict2' are assumed to be Python dictionaries.
 void SetDifferentKeysError(PyObject* dict1, PyObject* dict2, string* error_msg,
                            bool* is_type_error) {
-  PyObject* k1 = PyDict_Keys(dict1);
-  PyObject* k2 = PyDict_Keys(dict2);
+  Safe_PyObjectPtr k1(MappingKeys(dict1));
+  if (PyErr_Occurred() || k1.get() == nullptr) {
+    *error_msg =
+        ("The two dictionaries don't have the same set of keys. Failed to "
+         "fetch keys.");
+    return;
+  }
+  Safe_PyObjectPtr k2(MappingKeys(dict2));
+  if (PyErr_Occurred() || k2.get() == nullptr) {
+    *error_msg =
+        ("The two dictionaries don't have the same set of keys. Failed to "
+         "fetch keys.");
+    return;
+  }
   *is_type_error = false;
   *error_msg = tensorflow::strings::StrCat(
       "The two dictionaries don't have the same set of keys. "
       "First structure has keys ",
-      PyObjectToString(k1), ", while second structure has keys ",
-      PyObjectToString(k2));
-  Py_DECREF(k1);
-  Py_DECREF(k2);
+      PyObjectToString(k1.get()), ", while second structure has keys ",
+      PyObjectToString(k2.get()));
 }
 
 // Returns true iff there were no "internal" errors. In other words,
@@ -396,9 +524,12 @@ bool AssertSameStructureHelper(PyObject* o1, PyObject* o2, bool check_types,
       }
     } else if (type1 != type2
                /* If both sequences are list types, don't complain. This allows
-               one to be a list subclass (e.g. _ListWrapper used for automatic
-               dependency tracking.) */
-               && !(PyList_Check(o1) && PyList_Check(o2))) {
+                  one to be a list subclass (e.g. _ListWrapper used for
+                  automatic dependency tracking.) */
+               && !(PyList_Check(o1) && PyList_Check(o2))
+               /* Two mapping types will also compare equal, making _DictWrapper
+                  and dict compare equal. */
+               && !(IsMappingHelper(o1) && IsMappingHelper(o2))) {
       *is_type_error = true;
       *error_msg = tensorflow::strings::StrCat(
           "The two namedtuples don't have the same sequence type. "
@@ -409,7 +540,7 @@ bool AssertSameStructureHelper(PyObject* o1, PyObject* o2, bool check_types,
       return true;
     }
 
-    if (PyDict_Check(o1)) {
+    if (PyDict_Check(o1) && PyDict_Check(o2)) {
       if (PyDict_Size(o1) != PyDict_Size(o2)) {
         SetDifferentKeysError(o1, o2, error_msg, is_type_error);
         return true;
@@ -423,6 +554,24 @@ bool AssertSameStructureHelper(PyObject* o1, PyObject* o2, bool check_types,
           return true;
         }
       }
+    } else if (IsMappingHelper(o1)) {
+      // Fallback for custom mapping types. Instead of using PyDict methods
+      // which stay in C, we call iter(o1).
+      if (PyMapping_Size(o1) != PyMapping_Size(o2)) {
+        SetDifferentKeysError(o1, o2, error_msg, is_type_error);
+        return true;
+      }
+
+      Safe_PyObjectPtr iter(PyObject_GetIter(o1));
+      PyObject* key;
+      while ((key = PyIter_Next(iter.get())) != nullptr) {
+        if (!PyMapping_HasKey(o2, key)) {
+          SetDifferentKeysError(o1, o2, error_msg, is_type_error);
+          Py_DECREF(key);
+          return true;
+        }
+        Py_DECREF(key);
+      }
     }
   }
 
@@ -470,6 +619,19 @@ void RegisterSequenceClass(PyObject* sequence_class) {
   CollectionsSequenceType = sequence_class;
 }
 
+void RegisterMappingClass(PyObject* mapping_class) {
+  if (!PyType_Check(mapping_class)) {
+    PyErr_SetString(
+        PyExc_TypeError,
+        tensorflow::strings::StrCat(
+            "Expecting a class definition for `collections.Mapping`. Got ",
+            Py_TYPE(mapping_class)->tp_name)
+            .c_str());
+    return;
+  }
+  CollectionsMappingType = mapping_class;
+}
+
 void RegisterSparseTensorValueClass(PyObject* sparse_tensor_value_class) {
   if (!PyType_Check(sparse_tensor_value_class)) {
     PyErr_SetString(
@@ -597,6 +759,11 @@ PyObject* AssertSameStructure(PyObject* o1, PyObject* o2, bool check_types) {
   string error_msg;
   bool is_type_error = false;
   AssertSameStructureHelper(o1, o2, check_types, &error_msg, &is_type_error);
+  if (PyErr_Occurred()) {
+    // Don't hide Python exceptions while checking (e.g. errors fetching keys
+    // from custom mappings).
+    return nullptr;
+  }
   if (!error_msg.empty()) {
     PyErr_SetString(
         is_type_error ? PyExc_TypeError : PyExc_ValueError,
diff --git a/tensorflow/python/util/util.h b/tensorflow/python/util/util.h
index 70efc10c9a..41dcc969f8 100644
--- a/tensorflow/python/util/util.h
+++ b/tensorflow/python/util/util.h
@@ -118,7 +118,9 @@ PyObject* Flatten(PyObject* nested);
 // the type from the module. This approach also requires some trigger from
 // Python so that we know that Python interpreter had been initialzied.
 void RegisterSequenceClass(PyObject* sequence_class);
-// Similar to the above function, except for the
+// Like RegisterSequenceClass, but for collections.Mapping.
+void RegisterMappingClass(PyObject* mapping_class);
+// Similar to the above functions, except for the
 // sparse_tensor.SparseTensorValue class.
 void RegisterSparseTensorValueClass(PyObject* sparse_tensor_value_class);
 
diff --git a/tensorflow/python/util/util.i b/tensorflow/python/util/util.i
index 9f3b11b982..6ad1484295 100644
--- a/tensorflow/python/util/util.i
+++ b/tensorflow/python/util/util.i
@@ -31,6 +31,9 @@ limitations under the License.
 %unignore tensorflow::swig::RegisterSequenceClass;
 %noexception tensorflow::swig::RegisterSequenceClass;
 
+%unignore tensorflow::swig::RegisterMappingClass;
+%noexception tensorflow::swig::RegisterMappingClass;
+
 %unignore tensorflow::swig::RegisterSparseTensorValueClass;
 %noexception tensorflow::swig::RegisterSparseTensorValueClass;
 
diff --git a/tensorflow/security/advisory/tfsa-2018-001.md b/tensorflow/security/advisory/tfsa-2018-001.md
index bb97543a21..1966789c84 100644
--- a/tensorflow/security/advisory/tfsa-2018-001.md
+++ b/tensorflow/security/advisory/tfsa-2018-001.md
@@ -22,7 +22,7 @@ TensorFlow 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0
 ### Mitigation
 
 We have patched the vulnerability in GitHub commit
-[49f73c55](https://github.com/tensorflow/tensorflow/commit/49f73c55d56edffebde4bca4a407ad69c1cae4333c55).
+[49f73c55](https://github.com/tensorflow/tensorflow/commit/49f73c55d56edffebde4bca4a407ad69c1cae433).
 If users are running TensorFlow in production or on untrusted data, they are
 encouraged to apply this patch.
 
diff --git a/tensorflow/security/index.md b/tensorflow/security/index.md
index ea39e17ab2..0f176151c2 100644
--- a/tensorflow/security/index.md
+++ b/tensorflow/security/index.md
@@ -4,7 +4,7 @@ We regularly publish security advisories about using TensorFlow.
 
 *Note*: In conjunction with these security advisories, we strongly encourage
 TensorFlow users to read and understand TensorFlow's security model as outlined
-in (https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md)[SECURITY.md].
+in [SECURITY.md](https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md).
 
 | Advisory Number | Type               | Versions affected | Reported by           | Additional Information      |
 |-----------------|--------------------|:-----------------:|-----------------------|-----------------------------|
diff --git a/tensorflow/stream_executor/blas.h b/tensorflow/stream_executor/blas.h
index ea87744b22..7f851e3646 100644
--- a/tensorflow/stream_executor/blas.h
+++ b/tensorflow/stream_executor/blas.h
@@ -1121,6 +1121,40 @@ class BlasSupport {
       const port::ArraySlice<DeviceMemory<std::complex<double>> *> &c, int ldc,
       int batch_count, ScratchAllocator *scratch_allocator) = 0;
 
+  // Batched gemm with strides instead of pointer arrays.
+  virtual bool DoBlasGemmStridedBatched(
+      Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+      uint64 n, uint64 k, float alpha, const DeviceMemory<Eigen::half> &a,
+      int lda, int64 stride_a, const DeviceMemory<Eigen::half> &b, int ldb,
+      int64 stride_b, float beta, DeviceMemory<Eigen::half> *c, int ldc,
+      int64 stride_c, int batch_count) = 0;
+  virtual bool DoBlasGemmStridedBatched(
+      Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+      uint64 n, uint64 k, float alpha, const DeviceMemory<float> &a, int lda,
+      int64 stride_a, const DeviceMemory<float> &b, int ldb, int64 stride_b,
+      float beta, DeviceMemory<float> *c, int ldc, int64 stride_c,
+      int batch_count) = 0;
+  virtual bool DoBlasGemmStridedBatched(
+      Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+      uint64 n, uint64 k, double alpha, const DeviceMemory<double> &a, int lda,
+      int64 stride_a, const DeviceMemory<double> &b, int ldb, int64 stride_b,
+      double beta, DeviceMemory<double> *c, int ldc, int64 stride_c,
+      int batch_count) = 0;
+  virtual bool DoBlasGemmStridedBatched(
+      Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+      uint64 n, uint64 k, std::complex<float> alpha,
+      const DeviceMemory<std::complex<float>> &a, int lda, int64 stride_a,
+      const DeviceMemory<std::complex<float>> &b, int ldb, int64 stride_b,
+      std::complex<float> beta, DeviceMemory<std::complex<float>> *c, int ldc,
+      int64 stride_c, int batch_count) = 0;
+  virtual bool DoBlasGemmStridedBatched(
+      Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+      uint64 n, uint64 k, std::complex<double> alpha,
+      const DeviceMemory<std::complex<double>> &a, int lda, int64 stride_a,
+      const DeviceMemory<std::complex<double>> &b, int ldb, int64 stride_b,
+      std::complex<double> beta, DeviceMemory<std::complex<double>> *c, int ldc,
+      int64 stride_c, int batch_count) = 0;
+
   // Computes a matrix-matrix product where one input matrix is Hermitian:
   //
   //     c <- alpha * a * b + beta * c,
@@ -1990,6 +2024,38 @@ class BlasSupport {
       int ldb, std::complex<double> beta,                                      \
       const port::ArraySlice<DeviceMemory<std::complex<double>> *> &c,         \
       int ldc, int batch_count, ScratchAllocator *scratch_allocator) override; \
+  bool DoBlasGemmStridedBatched(                                               \
+      Stream *stream, blas::Transpose transa, blas::Transpose transb,          \
+      uint64 m, uint64 n, uint64 k, float alpha,                               \
+      const DeviceMemory<Eigen::half> &a, int lda, int64 stride_a,             \
+      const DeviceMemory<Eigen::half> &b, int ldb, int64 stride_b, float beta, \
+      DeviceMemory<Eigen::half> *c, int ldc, int64 stride_c, int batch_count); \
+  bool DoBlasGemmStridedBatched(                                               \
+      Stream *stream, blas::Transpose transa, blas::Transpose transb,          \
+      uint64 m, uint64 n, uint64 k, float alpha, const DeviceMemory<float> &a, \
+      int lda, int64 stride_a, const DeviceMemory<float> &b, int ldb,          \
+      int64 stride_b, float beta, DeviceMemory<float> *c, int ldc,             \
+      int64 stride_c, int batch_count);                                        \
+  bool DoBlasGemmStridedBatched(                                               \
+      Stream *stream, blas::Transpose transa, blas::Transpose transb,          \
+      uint64 m, uint64 n, uint64 k, double alpha,                              \
+      const DeviceMemory<double> &a, int lda, int64 stride_a,                  \
+      const DeviceMemory<double> &b, int ldb, int64 stride_b, double beta,     \
+      DeviceMemory<double> *c, int ldc, int64 stride_c, int batch_count);      \
+  bool DoBlasGemmStridedBatched(                                               \
+      Stream *stream, blas::Transpose transa, blas::Transpose transb,          \
+      uint64 m, uint64 n, uint64 k, std::complex<float> alpha,                 \
+      const DeviceMemory<std::complex<float>> &a, int lda, int64 stride_a,     \
+      const DeviceMemory<std::complex<float>> &b, int ldb, int64 stride_b,     \
+      std::complex<float> beta, DeviceMemory<std::complex<float>> *c, int ldc, \
+      int64 stride_c, int batch_count);                                        \
+  bool DoBlasGemmStridedBatched(                                               \
+      Stream *stream, blas::Transpose transa, blas::Transpose transb,          \
+      uint64 m, uint64 n, uint64 k, std::complex<double> alpha,                \
+      const DeviceMemory<std::complex<double>> &a, int lda, int64 stride_a,    \
+      const DeviceMemory<std::complex<double>> &b, int ldb, int64 stride_b,    \
+      std::complex<double> beta, DeviceMemory<std::complex<double>> *c,        \
+      int ldc, int64 stride_c, int batch_count);                               \
   bool DoBlasHemm(Stream *stream, blas::Side side, blas::UpperLower uplo,      \
                   uint64 m, uint64 n, std::complex<float> alpha,               \
                   const DeviceMemory<std::complex<float>> &a, int lda,         \
diff --git a/tensorflow/stream_executor/cuda/cuda_blas.cc b/tensorflow/stream_executor/cuda/cuda_blas.cc
index 874bf0e8cb..ab7091b3f5 100644
--- a/tensorflow/stream_executor/cuda/cuda_blas.cc
+++ b/tensorflow/stream_executor/cuda/cuda_blas.cc
@@ -279,6 +279,10 @@ STREAM_EXECUTOR_CUBLAS_WRAP(cublasSgemmEx)
 
 #if CUDA_VERSION >= 8000
 STREAM_EXECUTOR_CUBLAS_WRAP(cublasGemmEx)
+STREAM_EXECUTOR_CUBLAS_WRAP(cublasSgemmStridedBatched)
+STREAM_EXECUTOR_CUBLAS_WRAP(cublasDgemmStridedBatched)
+STREAM_EXECUTOR_CUBLAS_WRAP(cublasCgemmStridedBatched)
+STREAM_EXECUTOR_CUBLAS_WRAP(cublasZgemmStridedBatched)
 #endif
 
 #if CUDA_VERSION >= 9000
@@ -288,6 +292,7 @@ STREAM_EXECUTOR_CUBLAS_WRAP(cublasSetMathMode)
 
 #if CUDA_VERSION >= 9010
 STREAM_EXECUTOR_CUBLAS_WRAP(cublasGemmBatchedEx)
+STREAM_EXECUTOR_CUBLAS_WRAP(cublasGemmStridedBatchedEx)
 #endif
 
 }  // namespace wrap
@@ -643,7 +648,7 @@ bool CUDABlas::DoBlasInternalImpl(FuncT cublas_func, Stream *stream,
   }
 #endif
   cublasStatus_t ret = cublas_func(parent_, blas_, args...);
-  if (err_on_failure && ret != CUBLAS_STATUS_SUCCESS) {
+  if ((err_on_failure || VLOG_IS_ON(3)) && ret != CUBLAS_STATUS_SUCCESS) {
     LOG(ERROR) << "failed to run cuBLAS routine " << cublas_func.kName << ": "
                << ToString(ret);
   }
@@ -1865,7 +1870,7 @@ bool CUDABlas::DoBlasGemm(
   stream->parent()->GetDeviceDescription().cuda_compute_capability(&cc_major,
                                                                    &cc_minor);
 
-  // GPUs < sm_70 don't support Volta hardware.
+  // GPUs < sm_70 don't support tensor ops.
   if (cc_major >= 7 && TensorOpMathEnabled()) {
     use_tensor_ops = true;
   }
@@ -2139,6 +2144,10 @@ static bool UsesTensorOps(blas::AlgorithmType algo) {
 template <typename InType>
 static bool TensorOpsAvailable(int cc_major) {
 #if CUDA_VERSION >= 9000
+  // cublas *does* allow tensor ops on inputs that are not fp16, so this is not
+  // strictly correct.  We can't simply enable it, though, as that would change
+  // clients' behavior significantly: Using tensor ops on fp32 inputs cause them
+  // to be rounded to fp16.
   if (cc_major >= 7 && TensorOpMathEnabled() &&
       std::is_same<InType, Eigen::half>::value) {
     return true;
@@ -2160,16 +2169,30 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
   if (stream->parent()->GetDeviceDescription().cuda_compute_capability(
           &cc_major, &cc_minor) &&
       cc_major < 5) {
+    VLOG(2) << "DoBlasGemmWithAlgorithm returning false because sm" << cc_major
+            << cc_minor << " devices don't support explicit gemm algorithms.";
     return false;
   }
 
   if (UsesTensorOps(algorithm) && !TensorOpsAvailable<InT>(cc_major)) {
+    if (std::is_same<InT, Eigen::half>::value) {
+      VLOG(2) << "DoBlasGemmWithAlgorithm returning false because algorithm "
+              << algorithm
+              << " uses tensor ops, but tensor ops are not available in sm"
+              << cc_major << "X devices.";
+    } else {
+      VLOG(2) << "DoBlasGemmWithAlgorithm returning false because algorithm "
+              << algorithm
+              << " uses tensor ops, but the input data type is not fp16.";
+    }
     return false;
   }
 
   // Either both 'alpha' and 'beta' need to be pointers to device memory, or
   // they need to be both host scalars.
   if (alpha.is_pointer() != beta.is_pointer()) {
+    VLOG(2) << "DoBlasGemmWithAlgorithm returning false because one of `alpha` "
+               "and `beta` is a pointer, but the other is not.";
     return false;
   }
 
@@ -2177,6 +2200,9 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
   if (output_profile_result != nullptr) {
     timer.reset(new CUDATimer(parent_));
     if (!timer->Init() || !timer->Start(AsCUDAStream(stream))) {
+      VLOG(2) << "DoBlasGemmWithAlgorithm returning false because "
+                 "output_profile_result was given, but we were unable to "
+                 "create a CUDATimer.";
       return false;
     }
   }
@@ -2186,6 +2212,8 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
 #if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
   if ((algorithm == CUBLAS_GEMM_DEFAULT || algorithm >= CUBLAS_GEMM_ALGO13) &&
       std::max({m, n, k}) >= 2097153 && cc_major < 7) {
+    VLOG(2) << "DoBlasGemmWithAlgorithm returning false to work around cudnn "
+               "<9.2 bug with m, n, or k >= 2097153.  See b/79126339.";
     return false;
   }
 #endif
@@ -2211,6 +2239,8 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
     // CUDATimer will CHECK-fail if we Stop() it while the stream is in an error
     // state.
     if (!timer->Stop(AsCUDAStream(stream))) {
+      VLOG(2) << "DoBlasGemmWithAlgorithm returning false; unable to stop "
+                 "CUDATimer.";
       return false;
     }
     output_profile_result->set_is_valid(true);
@@ -2223,26 +2253,60 @@ bool CUDABlas::DoBlasGemmWithAlgorithmImpl(
 
 bool CUDABlas::GetBlasGemmAlgorithms(
     std::vector<blas::AlgorithmType> *out_algorithms) {
-// cublasGemmAlgo_t (and the function that accepts this type, cublasGemmEx)
-// were first introduced in CUDA 8.
-// Note that when CUDA version and compute capability is not sufficient, we
-// still return the out_algorithms. Caller needs to make sure that in this case,
-// the returned vector is empty.
-  for (cublasGemmAlgo_t algo : {
-         CUBLAS_GEMM_DFALT, CUBLAS_GEMM_ALGO0, CUBLAS_GEMM_ALGO1,
-             CUBLAS_GEMM_ALGO2, CUBLAS_GEMM_ALGO3, CUBLAS_GEMM_ALGO4,
-             CUBLAS_GEMM_ALGO5, CUBLAS_GEMM_ALGO6, CUBLAS_GEMM_ALGO7,
+  // cublasGemmAlgo_t (and the function that accepts this type, cublasGemmEx)
+  // were first introduced in CUDA 8.
+  //
+  // Note that when CUDA version and compute capability is not sufficient, we
+  // still return the out_algorithms. Caller needs to make sure that in this
+  // case, the returned vector is empty.
+  *out_algorithms = {
+    CUBLAS_GEMM_DFALT,
+    CUBLAS_GEMM_ALGO0,
+    CUBLAS_GEMM_ALGO1,
+    CUBLAS_GEMM_ALGO2,
+    CUBLAS_GEMM_ALGO3,
+    CUBLAS_GEMM_ALGO4,
+    CUBLAS_GEMM_ALGO5,
+    CUBLAS_GEMM_ALGO6,
+    CUBLAS_GEMM_ALGO7,
 #if CUDA_VERSION >= 9000
-             CUBLAS_GEMM_ALGO8, CUBLAS_GEMM_ALGO9, CUBLAS_GEMM_ALGO10,
-             CUBLAS_GEMM_ALGO11, CUBLAS_GEMM_ALGO12, CUBLAS_GEMM_ALGO13,
-             CUBLAS_GEMM_ALGO14, CUBLAS_GEMM_ALGO15, CUBLAS_GEMM_ALGO16,
-             CUBLAS_GEMM_ALGO17, CUBLAS_GEMM_DFALT_TENSOR_OP,
-             CUBLAS_GEMM_ALGO0_TENSOR_OP, CUBLAS_GEMM_ALGO1_TENSOR_OP,
-             CUBLAS_GEMM_ALGO2_TENSOR_OP
+    CUBLAS_GEMM_ALGO8,
+    CUBLAS_GEMM_ALGO9,
+    CUBLAS_GEMM_ALGO10,
+    CUBLAS_GEMM_ALGO11,
+    CUBLAS_GEMM_ALGO12,
+    CUBLAS_GEMM_ALGO13,
+    CUBLAS_GEMM_ALGO14,
+    CUBLAS_GEMM_ALGO15,
+    CUBLAS_GEMM_ALGO16,
+    CUBLAS_GEMM_ALGO17,
+    CUBLAS_GEMM_DFALT_TENSOR_OP,
+    CUBLAS_GEMM_ALGO0_TENSOR_OP,
+    CUBLAS_GEMM_ALGO1_TENSOR_OP,
+    CUBLAS_GEMM_ALGO2_TENSOR_OP,
+    CUBLAS_GEMM_ALGO3_TENSOR_OP,
+    CUBLAS_GEMM_ALGO4_TENSOR_OP,
 #endif
-       }) {
-    out_algorithms->push_back(algo);
-  }
+#if CUDA_VERSION >= 9200
+    CUBLAS_GEMM_ALGO18,
+    CUBLAS_GEMM_ALGO19,
+    CUBLAS_GEMM_ALGO20,
+    CUBLAS_GEMM_ALGO21,
+    CUBLAS_GEMM_ALGO22,
+    CUBLAS_GEMM_ALGO23,
+    CUBLAS_GEMM_ALGO5_TENSOR_OP,
+    CUBLAS_GEMM_ALGO6_TENSOR_OP,
+    CUBLAS_GEMM_ALGO7_TENSOR_OP,
+    CUBLAS_GEMM_ALGO8_TENSOR_OP,
+    CUBLAS_GEMM_ALGO9_TENSOR_OP,
+    CUBLAS_GEMM_ALGO10_TENSOR_OP,
+    CUBLAS_GEMM_ALGO11_TENSOR_OP,
+    CUBLAS_GEMM_ALGO12_TENSOR_OP,
+    CUBLAS_GEMM_ALGO13_TENSOR_OP,
+    CUBLAS_GEMM_ALGO14_TENSOR_OP,
+    CUBLAS_GEMM_ALGO15_TENSOR_OP,
+#endif
+  };
   return true;
 }
 
@@ -2564,6 +2628,119 @@ bool CUDABlas::DoBlasGemmBatched(
   return status.ok();
 }
 
+bool CUDABlas::DoBlasGemmStridedBatched(
+    Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+    uint64 n, uint64 k, float alpha, const DeviceMemory<Eigen::half> &a,
+    int lda, int64 stride_a, const DeviceMemory<Eigen::half> &b, int ldb,
+    int64 stride_b, float beta, DeviceMemory<Eigen::half> *c, int ldc,
+    int64 stride_c, int batch_count) {
+  bool use_tensor_ops = false;
+#if CUDA_VERSION >= 9000
+  int cc_major, cc_minor;
+  if (stream->parent()->GetDeviceDescription().cuda_compute_capability(
+          &cc_major, &cc_minor)) {
+    // GPUs < sm_70 don't support tensor ops.
+    if (cc_major >= 7 && TensorOpMathEnabled()) {
+      use_tensor_ops = true;
+    }
+#if CUDA_VERSION >= 9010
+    if (cc_major >= 5) {
+      cublasGemmAlgo_t algo =
+          (use_tensor_ops ? CUBLAS_GEMM_DFALT_TENSOR_OP : CUBLAS_GEMM_DFALT);
+      bool ok = DoBlasInternalImpl(
+          wrap::cublasGemmStridedBatchedEx, stream,
+          true /* = pointer_mode_host */, true /* = err_on_failure */,
+          use_tensor_ops, CUDABlasTranspose(transa), CUDABlasTranspose(transb),
+          m, n, k, &alpha, CUDAMemory(a), CUDA_R_16F, lda, stride_a,
+          CUDAMemory(b), CUDA_R_16F, ldb, stride_b, &beta, CUDAMemoryMutable(c),
+          CUDA_R_16F, ldc, stride_c, batch_count, CUDA_R_32F, algo);
+      if (ok) {
+        return true;
+      }
+      LOG(ERROR) << "failed BLAS call, see log for details";
+      return false;
+    }
+#endif
+  }
+#endif
+  // Either CUDA_VERSION < 9.1 or SM < 5.0. Fall back to a loop.
+  for (int batch = 0; batch < batch_count; ++batch) {
+    const auto *a_matrix =
+        reinterpret_cast<const __half *>(CUDAMemory(a) + batch * stride_a);
+    const auto *b_matrix =
+        reinterpret_cast<const __half *>(CUDAMemory(b) + batch * stride_b);
+    auto *c_matrix =
+        reinterpret_cast<__half *>(CUDAMemoryMutable(c) + batch * stride_c);
+    bool ok = DoBlasInternalImpl(
+        wrap::cublasSgemmEx, stream, true /* = pointer_mode_host */,
+        true /* = err_on_failure= */, use_tensor_ops, CUDABlasTranspose(transa),
+        CUDABlasTranspose(transb), m, n, k, &alpha, a_matrix, SE_CUDA_DATA_HALF,
+        lda, b_matrix, SE_CUDA_DATA_HALF, ldb, &beta, c_matrix,
+        SE_CUDA_DATA_HALF, ldc);
+    if (!ok) {
+      LOG(ERROR) << "failed BLAS call, see log for details";
+      return false;
+    }
+  }
+  return true;
+}
+
+bool CUDABlas::DoBlasGemmStridedBatched(
+    Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+    uint64 n, uint64 k, float alpha, const DeviceMemory<float> &a, int lda,
+    int64 stride_a, const DeviceMemory<float> &b, int ldb, int64 stride_b,
+    float beta, DeviceMemory<float> *c, int ldc, int64 stride_c,
+    int batch_count) {
+  return DoBlasInternal(
+      wrap::cublasSgemmStridedBatched, stream, true /* = pointer_mode_host */,
+      CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha,
+      CUDAMemory(a), lda, stride_a, CUDAMemory(b), ldb, stride_b, &beta,
+      CUDAMemoryMutable(c), ldc, stride_c, batch_count);
+}
+
+bool CUDABlas::DoBlasGemmStridedBatched(
+    Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+    uint64 n, uint64 k, double alpha, const DeviceMemory<double> &a, int lda,
+    int64 stride_a, const DeviceMemory<double> &b, int ldb, int64 stride_b,
+    double beta, DeviceMemory<double> *c, int ldc, int64 stride_c,
+    int batch_count) {
+  return DoBlasInternal(
+      wrap::cublasDgemmStridedBatched, stream, true /* = pointer_mode_host */,
+      CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha,
+      CUDAMemory(a), lda, stride_a, CUDAMemory(b), ldb, stride_b, &beta,
+      CUDAMemoryMutable(c), ldc, stride_c, batch_count);
+}
+
+bool CUDABlas::DoBlasGemmStridedBatched(
+    Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+    uint64 n, uint64 k, std::complex<float> alpha,
+    const DeviceMemory<std::complex<float>> &a, int lda, int64 stride_a,
+    const DeviceMemory<std::complex<float>> &b, int ldb, int64 stride_b,
+    std::complex<float> beta, DeviceMemory<std::complex<float>> *c, int ldc,
+    int64 stride_c, int batch_count) {
+  return DoBlasInternal(
+      wrap::cublasCgemmStridedBatched, stream, true /* = pointer_mode_host */,
+      CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k,
+      CUDAComplex(&alpha), CUDAComplex(CUDAMemory(a)), lda, stride_a,
+      CUDAComplex(CUDAMemory(b)), ldb, stride_b, CUDAComplex(&beta),
+      CUDAComplex(CUDAMemoryMutable(c)), ldc, stride_c, batch_count);
+}
+
+bool CUDABlas::DoBlasGemmStridedBatched(
+    Stream *stream, blas::Transpose transa, blas::Transpose transb, uint64 m,
+    uint64 n, uint64 k, std::complex<double> alpha,
+    const DeviceMemory<std::complex<double>> &a, int lda, int64 stride_a,
+    const DeviceMemory<std::complex<double>> &b, int ldb, int64 stride_b,
+    std::complex<double> beta, DeviceMemory<std::complex<double>> *c, int ldc,
+    int64 stride_c, int batch_count) {
+  return DoBlasInternal(
+      wrap::cublasZgemmStridedBatched, stream, true /* = pointer_mode_host */,
+      CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k,
+      CUDAComplex(&alpha), CUDAComplex(CUDAMemory(a)), lda, stride_a,
+      CUDAComplex(CUDAMemory(b)), ldb, stride_b, CUDAComplex(&beta),
+      CUDAComplex(CUDAMemoryMutable(c)), ldc, stride_c, batch_count);
+}
+
 bool CUDABlas::DoBlasHemm(Stream *stream, blas::Side side,
                           blas::UpperLower uplo, uint64 m, uint64 n,
                           std::complex<float> alpha,
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.cc b/tensorflow/stream_executor/cuda/cuda_dnn.cc
index 84916385a8..725f6aeaa4 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.cc
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.cc
@@ -322,6 +322,7 @@ port::Status GetLoadedCudnnVersion(CudnnVersion* version) {
 CudnnSupport::CudnnSupport(CUDAExecutor* parent) : parent_(parent) {}
 
 port::Status CudnnSupport::Init() {
+  ScopedActivateExecutorContext context(parent_);
   cudnnHandle_t cudnn_handle = nullptr;
   auto status = cudnnCreate(&cudnn_handle);
   if (status == CUDNN_STATUS_SUCCESS) {
@@ -791,6 +792,11 @@ class CudnnActivationDescriptor {
     double relu_ceiling = 0.0;
     cudnnActivationMode_t mode;
     switch (activation_mode) {
+#if CUDNN_VERSION >= 7100
+      case dnn::ActivationMode::kNone:
+        mode = CUDNN_ACTIVATION_IDENTITY;
+        break;
+#endif
       case dnn::ActivationMode::kRelu6:
         relu_ceiling = 6.0;
         mode = CUDNN_ACTIVATION_CLIPPED_RELU;
@@ -2480,10 +2486,11 @@ port::Status CudnnSupport::DoFusedConvolveImpl(
     DeviceMemory<Type>* output_data, ScratchAllocator* scratch_allocator,
     const dnn::AlgorithmConfig& algorithm_config,
     dnn::ProfileResult* output_profile_result) {
-  if (activation_mode != dnn::ActivationMode::kRelu) {
+  if (activation_mode != dnn::ActivationMode::kRelu &&
+      activation_mode != dnn::ActivationMode::kNone) {
     return port::Status(port::error::INVALID_ARGUMENT,
                         "cudnnConvolutionBiasActivationForward() only supports "
-                        "Relu activation.");
+                        "Relu or None activation.");
   }
 
   CudnnTensorDescriptor conv_input_nd(
@@ -3075,8 +3082,7 @@ port::Status CudnnSupport::DoConvolveBackwardDataImpl(
   }
 
   // Cudnn 7.1.4 has a bug if the workspace of the following convolution is not
-  // zero-initialized.
-  // TODO(timshen): Add an nvbugs/ link.
+  // zero-initialized, nvbugs/2254619.
   if (CUDNN_VERSION >= 7000 &&
       algorithm_config.algorithm().algo_id() ==
           CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 &&
@@ -3603,7 +3609,7 @@ bool CudnnSupport::DoPoolForward(
     const dnn::BatchDescriptor& input_dimensions,
     const DeviceMemory<double>& input_data,
     const dnn::BatchDescriptor& output_dimensions,
-    DeviceMemory<double>* output_data) {
+    DeviceMemory<double>* output_data, ScratchAllocator* workspace_allocator) {
   // Alpha is the scaling factor for input.
   double alpha = 1.0;
   // Beta is the scaling factor for output.
@@ -3628,7 +3634,7 @@ bool CudnnSupport::DoPoolForward(
     const dnn::BatchDescriptor& input_dimensions,
     const DeviceMemory<float>& input_data,
     const dnn::BatchDescriptor& output_dimensions,
-    DeviceMemory<float>* output_data) {
+    DeviceMemory<float>* output_data, ScratchAllocator* workspace_allocator) {
   // Alpha is the scaling factor for input.
   float alpha = 1.0;
   // Beta is the scaling factor for output.
@@ -3653,7 +3659,8 @@ bool CudnnSupport::DoPoolForward(
     const dnn::BatchDescriptor& input_dimensions,
     const DeviceMemory<Eigen::half>& input_data,
     const dnn::BatchDescriptor& output_dimensions,
-    DeviceMemory<Eigen::half>* output_data) {
+    DeviceMemory<Eigen::half>* output_data,
+    ScratchAllocator* workspace_allocator) {
   // Alpha is the scaling factor for input.
   float alpha = 1.0;
   // Beta is the scaling factor for output.
@@ -3679,7 +3686,8 @@ bool CudnnSupport::DoPoolBackward(
     const dnn::BatchDescriptor& output_dimensions,
     const DeviceMemory<double>& output_data,
     const DeviceMemory<double>& input_diff_data,
-    DeviceMemory<double>* output_diff_data) {
+    DeviceMemory<double>* output_diff_data,
+    ScratchAllocator* workspace_allocator) {
   // Alpha is the scaling factor for input.
   double alpha = 1.0;
   // Beta is the scaling factor for output.
@@ -3708,7 +3716,8 @@ bool CudnnSupport::DoPoolBackward(
     const dnn::BatchDescriptor& output_dimensions,
     const DeviceMemory<float>& output_data,
     const DeviceMemory<float>& input_diff_data,
-    DeviceMemory<float>* output_diff_data) {
+    DeviceMemory<float>* output_diff_data,
+    ScratchAllocator* workspace_allocator) {
   // Alpha is the scaling factor for input.
   float alpha = 1.0;
   // Beta is the scaling factor for output.
@@ -3737,7 +3746,8 @@ bool CudnnSupport::DoPoolBackward(
     const dnn::BatchDescriptor& output_dimensions,
     const DeviceMemory<Eigen::half>& output_data,
     const DeviceMemory<Eigen::half>& input_diff_data,
-    DeviceMemory<Eigen::half>* output_diff_data) {
+    DeviceMemory<Eigen::half>* output_diff_data,
+    ScratchAllocator* workspace_allocator) {
   // Alpha is the scaling factor for input.
   float alpha = 1.0;
   // Beta is the scaling factor for output.
@@ -3806,7 +3816,8 @@ bool CudnnSupport::DoNormalizeBackwardWithDimensions(
     const dnn::BatchDescriptor& dimensions, const DeviceMemory<float>& raw_data,
     const DeviceMemory<float>& normalized_data,
     const DeviceMemory<float>& normalized_variable_gradient,
-    DeviceMemory<float>* raw_variable_gradient) {
+    DeviceMemory<float>* raw_variable_gradient,
+    ScratchAllocator* workspace_allocator) {
   // Check for unsupported modes.
   if (normalize_descriptor.wrap_around()) {
     LOG(ERROR) << "CUDA LRN does not support cudnn-around mode";
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.h b/tensorflow/stream_executor/cuda/cuda_dnn.h
index c924d41cb5..9d88f971bb 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.h
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.h
@@ -515,21 +515,24 @@ class CudnnSupport : public dnn::DnnSupport {
                      const dnn::BatchDescriptor& input_dimensions,
                      const DeviceMemory<double>& input_data,
                      const dnn::BatchDescriptor& output_dimensions,
-                     DeviceMemory<double>* output_data) override;
+                     DeviceMemory<double>* output_data,
+                     ScratchAllocator* workspace_allocator) override;
 
   bool DoPoolForward(Stream* stream,
                      const dnn::PoolingDescriptor& pooling_dimensions,
                      const dnn::BatchDescriptor& input_dimensions,
                      const DeviceMemory<float>& input_data,
                      const dnn::BatchDescriptor& output_dimensions,
-                     DeviceMemory<float>* output_data) override;
+                     DeviceMemory<float>* output_data,
+                     ScratchAllocator* workspace_allocator) override;
 
   bool DoPoolForward(Stream* stream,
                      const dnn::PoolingDescriptor& pooling_dimensions,
                      const dnn::BatchDescriptor& input_dimensions,
                      const DeviceMemory<Eigen::half>& input_data,
                      const dnn::BatchDescriptor& output_dimensions,
-                     DeviceMemory<Eigen::half>* output_data) override;
+                     DeviceMemory<Eigen::half>* output_data,
+                     ScratchAllocator* workspace_allocator) override;
 
   bool DoPoolBackward(Stream* stream,
                       const dnn::PoolingDescriptor& pooling_dimensions,
@@ -538,7 +541,8 @@ class CudnnSupport : public dnn::DnnSupport {
                       const dnn::BatchDescriptor& output_dimensions,
                       const DeviceMemory<double>& output_data,
                       const DeviceMemory<double>& input_diff_data,
-                      DeviceMemory<double>* output_diff_data) override;
+                      DeviceMemory<double>* output_diff_data,
+                      ScratchAllocator* workspace_allocator) override;
 
   bool DoPoolBackward(Stream* stream,
                       const dnn::PoolingDescriptor& pooling_dimensions,
@@ -547,7 +551,8 @@ class CudnnSupport : public dnn::DnnSupport {
                       const dnn::BatchDescriptor& output_dimensions,
                       const DeviceMemory<float>& output_data,
                       const DeviceMemory<float>& input_diff_data,
-                      DeviceMemory<float>* output_diff_data) override;
+                      DeviceMemory<float>* output_diff_data,
+                      ScratchAllocator* workspace_allocator) override;
 
   bool DoPoolBackward(Stream* stream,
                       const dnn::PoolingDescriptor& pooling_dimensions,
@@ -556,7 +561,8 @@ class CudnnSupport : public dnn::DnnSupport {
                       const dnn::BatchDescriptor& output_dimensions,
                       const DeviceMemory<Eigen::half>& output_data,
                       const DeviceMemory<Eigen::half>& input_diff_data,
-                      DeviceMemory<Eigen::half>* output_diff_data) override;
+                      DeviceMemory<Eigen::half>* output_diff_data,
+                      ScratchAllocator* workspace_allocator) override;
 
   bool DoNormalize(Stream* stream,
                    const dnn::NormalizeDescriptor& normalize_descriptor,
@@ -575,7 +581,8 @@ class CudnnSupport : public dnn::DnnSupport {
       const DeviceMemory<float>& raw_data,
       const DeviceMemory<float>& normalized_data,
       const DeviceMemory<float>& normalized_variable_gradient,
-      DeviceMemory<float>* raw_variable_gradient) override;
+      DeviceMemory<float>* raw_variable_gradient,
+      ScratchAllocator* workspace_allocator) override;
 
   bool DoDepthConcatenate(
       Stream* stream, port::ArraySlice<dnn::BatchDescriptor> input_dimensions,
diff --git a/tensorflow/stream_executor/cuda/cuda_driver.cc b/tensorflow/stream_executor/cuda/cuda_driver.cc
index d508f6594a..f982f34b98 100644
--- a/tensorflow/stream_executor/cuda/cuda_driver.cc
+++ b/tensorflow/stream_executor/cuda/cuda_driver.cc
@@ -28,6 +28,7 @@ limitations under the License.
 #include "tensorflow/stream_executor/lib/human_readable.h"
 #include "tensorflow/stream_executor/lib/inlined_vector.h"
 #include "tensorflow/stream_executor/lib/notification.h"
+#include "tensorflow/stream_executor/lib/ptr_util.h"
 #include "tensorflow/stream_executor/lib/stacktrace.h"
 #include "tensorflow/stream_executor/lib/static_threadlocal.h"
 #include "tensorflow/stream_executor/lib/strcat.h"
@@ -66,14 +67,17 @@ class CreatedContexts {
     return Live()->find(context) != Live()->end();
   }
 
-  // Adds context to the live set.
+  // Adds context to the live set, or returns it if it's already present.
   static CudaContext* Add(CUcontext context) {
     CHECK(context != nullptr);
     mutex_lock lock(mu_);
-    auto cuda_context = new CudaContext(context, next_id_++);
-    Live()->insert(
-        std::make_pair(context, std::unique_ptr<CudaContext>(cuda_context)));
-    return cuda_context;
+    auto insert_result = Live()->insert(std::make_pair(context, nullptr));
+    auto it = insert_result.first;
+    if (insert_result.second) {
+      // context was not present in the map.  Add it.
+      it->second = MakeUnique<CudaContext>(context, next_id_++);
+    }
+    return it->second.get();
   }
 
   // Removes context from the live set.
@@ -102,117 +106,16 @@ class CreatedContexts {
 /* static */ int64 CreatedContexts::next_id_ = 1;  // 0 means "no context"
 
 // Formats CUresult to output prettified values into a log stream.
-// Error summaries taken from:
-// http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gc6c391505e117393cc2558fff6bfc2e9
-//
-// TODO(leary) switch to cuGetErrorName when updated cuda.h is available.
 string ToString(CUresult result) {
-#define OSTREAM_CUDA_ERROR(__name) \
-  case CUDA_ERROR_##__name:        \
-    return "CUDA_ERROR_" #__name;
-
-///////////////
-// NOTE: here we specify return code values outside of the enum explicitly
-// because our in-tree cuda.h is from the CUDA 5.5 SDK, but CUDA 6.0+ driver
-// libraries are deployed in the fleet these error codes are backwards
-// compatible, but if we see a "new" one, we want to be able to identify it in
-// the logs.
-//
-// Once we get a cuda.h that has cuGetErrorName (TODO is above) we can
-// eliminate this function and just rely on the driver to provide us these
-// strings.
-//
-// NOTE: "Must reboot all context" below is shorthand for, "must
-// destroy/recreate the offending context and any allocation which come from
-// it if you are to continue using CUDA."
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Wswitch"
-  switch (result) {
-    OSTREAM_CUDA_ERROR(INVALID_VALUE)
-    OSTREAM_CUDA_ERROR(OUT_OF_MEMORY)
-    OSTREAM_CUDA_ERROR(NOT_INITIALIZED)
-    OSTREAM_CUDA_ERROR(DEINITIALIZED)
-    OSTREAM_CUDA_ERROR(NO_DEVICE)
-    OSTREAM_CUDA_ERROR(INVALID_DEVICE)
-    OSTREAM_CUDA_ERROR(INVALID_IMAGE)
-    OSTREAM_CUDA_ERROR(INVALID_CONTEXT)
-    OSTREAM_CUDA_ERROR(INVALID_HANDLE)
-    OSTREAM_CUDA_ERROR(NOT_FOUND)
-    OSTREAM_CUDA_ERROR(NOT_READY)
-    OSTREAM_CUDA_ERROR(NO_BINARY_FOR_GPU)
-
-    // Encountered an uncorrectable ECC error during execution.
-    OSTREAM_CUDA_ERROR(ECC_UNCORRECTABLE)
-
-    // Load/store on an invalid address. Must reboot all context.
-    case 700:
-      return "CUDA_ERROR_ILLEGAL_ADDRESS";
-    // Passed too many / wrong arguments, too many threads for register count.
-    case 701:
-      return "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES";
-    // Kernel took too long to execute.
-    case 702:
-      return "CUDA_ERROR_LAUNCH_TIMEOUT";
-    // Kernel launch uses an incompatible texturing mode.
-    case 703:
-      return "CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING";
-    // Trying to re-enable peer access that already has it enabled.
-    case 704:
-      return "CUDA_ERROR_PEER_ACCESS_ALREADY_ENABLED";
-    // Trying to disable peer access that has not yet been enabled.
-    case 705:
-      return "CUDA_ERROR_PEER_ACCESS_NOT_ENABLED";
-    // Primary context for the specified device has already been initialized.
-    case 708:
-      return "CUDA_ERROR_PRIMARY_CONTEXT_ACTIVE";
-    // Context current to calling thread has been destroyed or is a primary
-    // context that has not yet been initialized.
-    case 709:
-      return "CUDA_ERROR_CONTEXT_IS_DESTROYED";
-    // Device-side assert triggered during kernel execution. Must reboot all
-    // context.
-    case 710:
-      return "CUDA_ERROR_ASSERT";
-    // Hardware resources to enable peer access have been exhausted.
-    case 711:
-      return "CUDA_ERROR_TOO_MANY_PEERS";
-    // Memory range has already been registered.
-    case 712:
-      return "CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED";
-    // Pointer does not correspond to any currently registered memory region.
-    case 713:
-      return "CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED";
-    // Due to stack corruption or exceeding stack size limit. Must reboot all
-    // context.
-    case 714:
-      return "CUDA_ERROR_HARDWARE_STACK_ERROR";
-    case 715:
-      return "CUDA_ERROR_ILLEGAL_INSTRUCTION";
-    // Load/store on an unaligned memory address. Must reboot all context.
-    case 716:
-      return "CUDA_ERROR_MISALIGNED_ADDRESS";
-    // Device instruction with specific address space given address not
-    // belonging to allowed address space. Must reboot all context.
-    case 717:
-      return "CUDA_ERROR_INVALID_ADDRESS_SPACE";
-    // Device program counter wrapped its address space. Must reboot all
-    // context.
-    case 718:
-      return "CUDA_ERROR_INVALID_PC";
-    // Exception on device while executing a kernel; e.g. deref invalid device
-    // pointer, accessing OOB shared memory. Must reboot all context.
-    case 719:
-      return "CUDA_ERROR_LAUNCH_FAILED";
-
-      OSTREAM_CUDA_ERROR(CONTEXT_ALREADY_IN_USE)
-      OSTREAM_CUDA_ERROR(PEER_ACCESS_UNSUPPORTED)
-      OSTREAM_CUDA_ERROR(NOT_PERMITTED)
-      OSTREAM_CUDA_ERROR(NOT_SUPPORTED)
-      OSTREAM_CUDA_ERROR(UNKNOWN)  // Unknown internal error to CUDA.
-    default:
-      return port::StrCat("CUresult(", static_cast<int>(result), ")");
+  const char *error_name;
+  if (cuGetErrorName(result, &error_name)) {
+    return port::StrCat("UNKNOWN ERROR (", static_cast<int>(result), ")");
+  }
+  const char *error_string;
+  if (cuGetErrorString(result, &error_string)) {
+    return error_name;
   }
-#pragma GCC diagnostic pop
+  return port::StrCat(error_name, ": ", error_string);
 }
 
 // Returns the current context and checks that it is in the set of CUDA contexts
@@ -528,7 +431,7 @@ bool DeviceOptionsToContextFlags(const DeviceOptions &device_options,
     *context = CreatedContexts::Add(new_context);
     CHECK(*context != nullptr)
         << "success in this call must entail non-null result";
-    VLOG(2) << "created context " << context << " for this thread";
+    VLOG(2) << "created or reused context " << context << " for this thread";
     return port::Status::OK();
   }
 
diff --git a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
index f11022ef1d..73f05b94db 100644
--- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
@@ -206,6 +206,48 @@ static string GetBinaryDir(bool strip_exe) {
   return exe_path;
 }
 
+bool CUDAExecutor::LoadModuleFromCuBin(const char *cubin, CUmodule *module) {
+  uint64_t module_refcount;
+  std::tie(*module, module_refcount) = gpu_binary_to_module_[cubin];
+
+  if (*module == nullptr) {
+    auto load_status = CUDADriver::LoadCubin(context_, cubin, module);
+    if (!load_status.ok()) {
+      LOG(ERROR) << "failed to load CUBIN: " << load_status;
+      return false;
+    }
+    module_refcount = 1;
+    VLOG(3) << "Loaded CUBIN " << static_cast<const void *>(cubin)
+            << " as module " << *module;
+  } else {
+    ++module_refcount;
+    VLOG(3) << "CUBIN " << static_cast<const void *>(cubin)
+            << " is already loaded as module " << *module;
+  }
+  gpu_binary_to_module_[cubin] = {*module, module_refcount};
+  return true;
+}
+
+bool CUDAExecutor::LoadModuleFromPtx(const char *ptx, CUmodule *module) {
+  uint64_t module_refcount;
+  std::tie(*module, module_refcount) = gpu_binary_to_module_[ptx];
+
+  if (*module == nullptr) {
+    if (!CUDADriver::LoadPtx(context_, ptx, module)) {
+      return false;
+    }
+    VLOG(3) << "Loaded PTX " << static_cast<const void *>(ptx) << " as module "
+            << *module;
+    module_refcount = 1;
+  } else {
+    ++module_refcount;
+    VLOG(3) << "PTX " << static_cast<const void *>(ptx)
+            << " is already loaded as module " << module;
+  }
+  gpu_binary_to_module_[ptx] = {*module, module_refcount};
+  return true;
+}
+
 bool CUDAExecutor::GetKernel(const MultiKernelLoaderSpec &spec,
                              KernelBase *kernel) {
   CUDAKernel *cuda_kernel = AsCUDAKernel(kernel);
@@ -215,28 +257,13 @@ bool CUDAExecutor::GetKernel(const MultiKernelLoaderSpec &spec,
   VLOG(3) << "GetKernel on kernel " << kernel << " : " << kernel->name();
 
   if (spec.has_cuda_cubin_in_memory()) {
+    mutex_lock lock{in_memory_modules_mu_};
     kernelname = &spec.cuda_cubin_in_memory().kernelname();
     const char *cubin = spec.cuda_cubin_in_memory().bytes();
-    mutex_lock lock{in_memory_modules_mu_};
-    uint64_t module_refcount;
-    std::tie(module, module_refcount) = gpu_binary_to_module_[cubin];
-
-    if (module == nullptr) {
-      auto load_status = CUDADriver::LoadCubin(context_, cubin, &module);
-      if (!load_status.ok()) {
-        LOG(ERROR) << "failed to load CUBIN: " << load_status;
-        return false;
-      }
-      module_refcount = 1;
-      VLOG(3) << "Loaded CUBIN " << static_cast<const void *>(cubin)
-              << " as module " << module;
-    } else {
-      ++module_refcount;
-      VLOG(3) << "CUBIN " << static_cast<const void *>(cubin)
-              << " is already loaded as module " << module;
+    if (!LoadModuleFromCuBin(cubin, &module)) {
+      return false;
     }
     kernel_to_gpu_binary_[kernel] = cubin;
-    gpu_binary_to_module_[cubin] = {module, module_refcount};
   } else if (spec.has_cuda_ptx_in_memory()) {
     kernelname = &spec.cuda_ptx_in_memory().kernelname();
 
@@ -254,24 +281,10 @@ bool CUDAExecutor::GetKernel(const MultiKernelLoaderSpec &spec,
     }
 
     mutex_lock lock{in_memory_modules_mu_};
-    uint64_t module_refcount;
-    std::tie(module, module_refcount) = gpu_binary_to_module_[ptx];
-
-    if (module == nullptr) {
-      if (!CUDADriver::LoadPtx(context_, ptx, &module)) {
-        LOG(ERROR) << "failed to load PTX for kernel " << *kernelname;
-        return false;
-      }
-      VLOG(3) << "Loaded PTX " << static_cast<const void *>(ptx)
-              << " as module " << module;
-      module_refcount = 1;
-    } else {
-      ++module_refcount;
-      VLOG(3) << "PTX " << static_cast<const void *>(ptx)
-              << " is already loaded as module " << module;
+    if (!LoadModuleFromPtx(ptx, &module)) {
+      return false;
     }
     kernel_to_gpu_binary_[kernel] = ptx;
-    gpu_binary_to_module_[ptx] = {module, module_refcount};
   } else {
     LOG(WARNING) << "no method of loading CUDA kernel provided";
     return false;
@@ -295,6 +308,23 @@ bool CUDAExecutor::GetKernel(const MultiKernelLoaderSpec &spec,
   return true;
 }
 
+bool CUDAExecutor::UnloadGpuBinary(const void *gpu_binary) {
+  auto module_it = gpu_binary_to_module_.find(gpu_binary);
+  if (gpu_binary_to_module_.end() == module_it) {
+    VLOG(3) << "No loaded CUDA module for " << gpu_binary;
+    return false;
+  }
+  auto &module = module_it->second.first;
+  auto &refcount = module_it->second.second;
+  VLOG(3) << "Found CUDA module " << module << " with refcount " << refcount;
+  if (--refcount == 0) {
+    VLOG(3) << "Unloading CUDA module " << module;
+    CUDADriver::UnloadModule(context_, module);
+    gpu_binary_to_module_.erase(module_it);
+  }
+  return true;
+}
+
 void CUDAExecutor::UnloadKernel(const KernelBase *kernel) {
   VLOG(3) << "Unloading kernel " << kernel << " : " << kernel->name();
 
@@ -307,25 +337,52 @@ void CUDAExecutor::UnloadKernel(const KernelBase *kernel) {
   }
   VLOG(3) << "Kernel " << kernel << " : " << kernel->name()
           << " has loaded GPU code " << gpu_binary_it->second;
-  auto module_it = gpu_binary_to_module_.find(gpu_binary_it->second);
-  if (gpu_binary_to_module_.end() == module_it) {
-    VLOG(3) << "Kernel " << kernel << " : " << kernel->name()
-            << " has no loaded CUDA module.";
-    return;  // This kernel never loaded any modules
-  }
-  auto &module = module_it->second.first;
-  auto &refcount = module_it->second.second;
-  VLOG(3) << "Kernel " << kernel << " : " << kernel->name()
-          << " has loaded GPU code " << gpu_binary_it->second
-          << " into CUDA module " << module << " with refcount " << refcount;
-  if (--refcount == 0) {
-    VLOG(3) << "Unloading CUDA module " << module;
-    CUDADriver::UnloadModule(context_, module);
-    gpu_binary_to_module_.erase(module_it);
-  }
+  UnloadGpuBinary(gpu_binary_it->second);
   kernel_to_gpu_binary_.erase(gpu_binary_it);
 }
 
+bool CUDAExecutor::LoadModule(const MultiModuleLoaderSpec &spec,
+                              ModuleHandle *module_handle) {
+  // In CUDAExecutor we store the pointer to the GPU binary (PTX or CUBIN) as
+  // ModuleHandle::id().
+  CUmodule cu_module;
+  if (spec.has_cuda_cubin_in_memory()) {
+    mutex_lock lock{in_memory_modules_mu_};
+    if (!LoadModuleFromCuBin(
+            reinterpret_cast<const char *>(spec.cuda_cubin_in_memory().data()),
+            &cu_module)) {
+      return false;
+    }
+    *module_handle = ModuleHandle(const_cast<void *>(
+        static_cast<const void *>(spec.cuda_cubin_in_memory().data())));
+    return true;
+  } else if (spec.has_cuda_ptx_in_memory()) {
+    if (cc_major_ == 0 && cc_minor_ == 0) {
+      return false;
+    }
+
+    if (!spec.cuda_ptx_in_memory()) {
+      return false;
+    }
+
+    mutex_lock lock{in_memory_modules_mu_};
+    if (!LoadModuleFromPtx(spec.cuda_ptx_in_memory(), &cu_module)) {
+      return false;
+    }
+    *module_handle = ModuleHandle(const_cast<void *>(
+        static_cast<const void *>(spec.cuda_ptx_in_memory())));
+    return true;
+  }
+  LOG(WARNING) << "no method of loading CUDA module provided";
+  return false;
+}
+
+bool CUDAExecutor::UnloadModule(ModuleHandle module_handle) {
+  const char *gpu_binary = reinterpret_cast<const char *>(module_handle.id());
+  mutex_lock lock{in_memory_modules_mu_};
+  return UnloadGpuBinary(gpu_binary);
+}
+
 bool CUDAExecutor::GetKernelMetadata(CUDAKernel *cuda_kernel,
                                      KernelMetadata *kernel_metadata) {
   int value;
@@ -783,16 +840,26 @@ bool CUDAExecutor::DeviceMemoryUsage(int64 *free, int64 *total) const {
   return CUDADriver::GetDeviceMemoryInfo(context_, free, total);
 }
 
-bool CUDAExecutor::GetSymbol(const string& symbol_name, void **mem,
+bool CUDAExecutor::GetSymbol(const string &symbol_name,
+                             ModuleHandle module_handle, void **mem,
                              size_t *bytes) {
+  auto lookup_in_module = [&](CUmodule module) {
+    CHECK(module != nullptr);
+    return CUDADriver::GetModuleSymbol(context_, module, symbol_name.c_str(),
+                                       reinterpret_cast<CUdeviceptr *>(mem),
+                                       bytes);
+  };
+
   {  // give limited scope to mutex_lock
     mutex_lock lock{in_memory_modules_mu_};
+    if (static_cast<bool>(module_handle)) {
+      auto it = gpu_binary_to_module_.find(module_handle.id());
+      CHECK(it != gpu_binary_to_module_.end());
+      return lookup_in_module(it->second.first);
+    }
+
     for (auto &it : gpu_binary_to_module_) {
-      CUmodule module = it.second.first;
-      CHECK(module != nullptr);
-      if (CUDADriver::GetModuleSymbol(context_, module, symbol_name.c_str(),
-                                      reinterpret_cast<CUdeviceptr *>(mem),
-                                      bytes)) {
+      if (lookup_in_module(it.second.first)) {
         return true;
       }
     }
@@ -844,7 +911,7 @@ CUDAExecutor::GetTimerImplementation() {
   return std::unique_ptr<internal::TimerInterface>(new CUDATimer(this));
 }
 
-void *CUDAExecutor::CudaContextHack() { return context_; }
+void *CUDAExecutor::GpuContextHack() { return context_; }
 
 CudaContext* CUDAExecutor::cuda_context() { return context_; }
 
diff --git a/tensorflow/stream_executor/cuda/cuda_gpu_executor.h b/tensorflow/stream_executor/cuda/cuda_gpu_executor.h
index 773cbfb8a1..8a954d5461 100644
--- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.h
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.h
@@ -62,6 +62,9 @@ class CUDAExecutor : public internal::StreamExecutorInterface {
   bool GetKernel(const MultiKernelLoaderSpec &spec,
                  KernelBase *kernel) override;
   void UnloadKernel(const KernelBase *kernel) override;
+  bool LoadModule(const MultiModuleLoaderSpec &spec,
+                  ModuleHandle *module_handle) override;
+  bool UnloadModule(ModuleHandle module_handle) override;
 
   bool Launch(Stream *stream, const ThreadDim &thread_dims,
               const BlockDim &block_dims, const KernelBase &k,
@@ -175,7 +178,8 @@ class CUDAExecutor : public internal::StreamExecutorInterface {
 
   // Search for the symbol and returns a device pointer and size.
   // Returns false if symbol does not exist.
-  bool GetSymbol(const string& symbol_name, void **mem, size_t *bytes) override;
+  bool GetSymbol(const string &symbol_name, ModuleHandle module_handle,
+                 void **mem, size_t *bytes) override;
 
   DeviceDescription *PopulateDeviceDescription() const override;
 
@@ -210,7 +214,7 @@ class CUDAExecutor : public internal::StreamExecutorInterface {
 
   std::unique_ptr<internal::TimerInterface> GetTimerImplementation() override;
 
-  void *CudaContextHack() override;
+  void *GpuContextHack() override;
 
   CudaContext* cuda_context();
 
@@ -239,6 +243,16 @@ class CUDAExecutor : public internal::StreamExecutorInterface {
   void VlogOccupancyInfo(const KernelBase &kernel, const ThreadDim &thread_dims,
                          const BlockDim &block_dims);
 
+  bool LoadModuleFromCuBin(const char *cubin, CUmodule *module)
+      EXCLUSIVE_LOCKS_REQUIRED(in_memory_modules_mu_);
+
+  // Loads the PTX text `ptx` as a CUDA module.  `ptx` must be null terminated.
+  bool LoadModuleFromPtx(const char *ptx, CUmodule *module)
+      EXCLUSIVE_LOCKS_REQUIRED(in_memory_modules_mu_);
+
+  bool UnloadGpuBinary(const void *gpu_binary)
+      EXCLUSIVE_LOCKS_REQUIRED(in_memory_modules_mu_);
+
   // Guards the in-memory-module mapping.
   mutex in_memory_modules_mu_;
 
diff --git a/tensorflow/stream_executor/cuda/cuda_stream.h b/tensorflow/stream_executor/cuda/cuda_stream.h
index 02edff6431..bb8bda4755 100644
--- a/tensorflow/stream_executor/cuda/cuda_stream.h
+++ b/tensorflow/stream_executor/cuda/cuda_stream.h
@@ -40,8 +40,8 @@ class CUDAStream : public internal::StreamInterface {
   // Note: teardown is handled by a parent's call to DeallocateStream.
   ~CUDAStream() override {}
 
-  void *CudaStreamHack() override { return cuda_stream_; }
-  void **CudaStreamMemberHack() override {
+  void *GpuStreamHack() override { return cuda_stream_; }
+  void **GpuStreamMemberHack() override {
     return reinterpret_cast<void **>(&cuda_stream_);
   }
 
diff --git a/tensorflow/stream_executor/dnn.cc b/tensorflow/stream_executor/dnn.cc
index 82aa8ceb32..2a30f922bc 100644
--- a/tensorflow/stream_executor/dnn.cc
+++ b/tensorflow/stream_executor/dnn.cc
@@ -117,6 +117,8 @@ string FilterLayoutString(FilterLayout layout) {
   switch (layout) {
     case FilterLayout::kOutputInputYX:
       return "OutputInputYX";
+    case FilterLayout::kOutputYXInput:
+      return "OutputYXInput";
     case FilterLayout::kOutputInputYX4:
       return "OutputInputYX4";
     case FilterLayout::kInputYXOutput:
diff --git a/tensorflow/stream_executor/dnn.h b/tensorflow/stream_executor/dnn.h
index 9eca5abe1a..a7449c2df4 100644
--- a/tensorflow/stream_executor/dnn.h
+++ b/tensorflow/stream_executor/dnn.h
@@ -1552,14 +1552,16 @@ class DnnSupport {
                              const dnn::BatchDescriptor& input_dimensions,
                              const DeviceMemory<float>& input_data,
                              const dnn::BatchDescriptor& output_dimensions,
-                             DeviceMemory<float>* output_data) = 0;
+                             DeviceMemory<float>* output_data,
+                             ScratchAllocator* workspace_allocator) = 0;
 
   virtual bool DoPoolForward(Stream* stream,
                              const dnn::PoolingDescriptor& pooling_dimensions,
                              const dnn::BatchDescriptor& input_dimensions,
                              const DeviceMemory<double>& input_data,
                              const dnn::BatchDescriptor& output_dimensions,
-                             DeviceMemory<double>* output_data) {
+                             DeviceMemory<double>* output_data,
+                             ScratchAllocator* workspace_allocator) {
     LOG(FATAL) << "DoPoolForward not implemented for double.";
     return false;
   }
@@ -1569,7 +1571,8 @@ class DnnSupport {
                              const dnn::BatchDescriptor& input_dimensions,
                              const DeviceMemory<Eigen::half>& input_data,
                              const dnn::BatchDescriptor& output_dimensions,
-                             DeviceMemory<Eigen::half>* output_data) {
+                             DeviceMemory<Eigen::half>* output_data,
+                             ScratchAllocator* workspace_allocator) {
     LOG(FATAL) << "DoPoolForward not implemented for float16.";
     return false;
   }
@@ -1582,7 +1585,8 @@ class DnnSupport {
                               const dnn::BatchDescriptor& output_dimensions,
                               const DeviceMemory<double>& output_data,
                               const DeviceMemory<double>& input_diff_data,
-                              DeviceMemory<double>* output_diff_data) {
+                              DeviceMemory<double>* output_diff_data,
+                              ScratchAllocator* workspace_allocator) {
     LOG(FATAL) << "DoPoolBackward not implemented.";
     return false;
   }
@@ -1594,7 +1598,8 @@ class DnnSupport {
                               const dnn::BatchDescriptor& output_dimensions,
                               const DeviceMemory<float>& output_data,
                               const DeviceMemory<float>& input_diff_data,
-                              DeviceMemory<float>* output_diff_data) {
+                              DeviceMemory<float>* output_diff_data,
+                              ScratchAllocator* workspace_allocator) {
     LOG(FATAL) << "DoPoolBackward not implemented.";
     return false;
   }
@@ -1606,7 +1611,8 @@ class DnnSupport {
                               const dnn::BatchDescriptor& output_dimensions,
                               const DeviceMemory<Eigen::half>& output_data,
                               const DeviceMemory<Eigen::half>& input_diff_data,
-                              DeviceMemory<Eigen::half>* output_diff_data) {
+                              DeviceMemory<Eigen::half>* output_diff_data,
+                              ScratchAllocator* workspace_allocator) {
     LOG(FATAL) << "DoPoolBackward not implemented.";
     return false;
   }
@@ -1653,7 +1659,8 @@ class DnnSupport {
       const DeviceMemory<float>& raw_data,
       const DeviceMemory<float>& normalized_data,
       const DeviceMemory<float>& normalized_variable_gradient,
-      DeviceMemory<float>* raw_variable_gradient) {
+      DeviceMemory<float>* raw_variable_gradient,
+      ScratchAllocator* workspace_allocator) {
     return false;
   }
 
diff --git a/tensorflow/stream_executor/event.cc b/tensorflow/stream_executor/event.cc
index 50a6edd80b..52efe771bc 100644
--- a/tensorflow/stream_executor/event.cc
+++ b/tensorflow/stream_executor/event.cc
@@ -15,9 +15,9 @@ limitations under the License.
 
 #include "tensorflow/stream_executor/event.h"
 
+#include "tensorflow/stream_executor/stream.h"
 #include "tensorflow/stream_executor/stream_executor_internal.h"
 #include "tensorflow/stream_executor/stream_executor_pimpl.h"
-#include "tensorflow/stream_executor/stream.h"
 
 namespace stream_executor {
 
@@ -27,9 +27,12 @@ Event::Event(StreamExecutor* stream_exec)
           stream_exec_->implementation()->CreateEventImplementation()) {}
 
 Event::~Event() {
-  auto status = stream_exec_->DeallocateEvent(this);
-  if (!status.ok()) {
-    LOG(ERROR) << status.error_message();
+  // Deal with nullptr implementation_, as this event may have been std::moved.
+  if (stream_exec_ && implementation_) {
+    auto status = stream_exec_->DeallocateEvent(this);
+    if (!status.ok()) {
+      LOG(ERROR) << status.error_message();
+    }
   }
 }
 
diff --git a/tensorflow/stream_executor/event.h b/tensorflow/stream_executor/event.h
index 1f37262c78..9cc87a7c12 100644
--- a/tensorflow/stream_executor/event.h
+++ b/tensorflow/stream_executor/event.h
@@ -61,6 +61,9 @@ class Event {
   // Returns a pointer to the underlying platform-specific implementation.
   internal::EventInterface* implementation() { return implementation_.get(); }
 
+  Event(Event&&) = default;
+  Event& operator=(Event&&) = default;
+
  private:
   friend class Stream;
 
diff --git a/tensorflow/stream_executor/host/host_gpu_executor.cc b/tensorflow/stream_executor/host/host_gpu_executor.cc
index c8a6297330..8adf739b17 100644
--- a/tensorflow/stream_executor/host/host_gpu_executor.cc
+++ b/tensorflow/stream_executor/host/host_gpu_executor.cc
@@ -26,8 +26,6 @@ limitations under the License.
 #include "tensorflow/stream_executor/lib/statusor.h"
 #include "tensorflow/stream_executor/plugin_registry.h"
 
-bool FLAGS_stream_executor_cpu_real_clock_rate = false;
-
 namespace stream_executor {
 namespace host {
 
@@ -190,11 +188,8 @@ DeviceDescription *HostExecutor::PopulateDeviceDescription() const {
   // doesn't result in thrashing or other badness? 4GiB chosen arbitrarily.
   builder.set_device_memory_size(static_cast<uint64>(4) * 1024 * 1024 * 1024);
 
-  float cycle_counter_frequency = 1e9;
-  if (FLAGS_stream_executor_cpu_real_clock_rate) {
-    cycle_counter_frequency = static_cast<float>(
-        tensorflow::profile_utils::CpuUtils::GetCycleCounterFrequency());
-  }
+  float cycle_counter_frequency = static_cast<float>(
+      tensorflow::profile_utils::CpuUtils::GetCycleCounterFrequency());
   builder.set_clock_rate_ghz(cycle_counter_frequency / 1e9);
 
   auto built = builder.Build();
diff --git a/tensorflow/stream_executor/host/host_gpu_executor.h b/tensorflow/stream_executor/host/host_gpu_executor.h
index e82f57569f..858396ef96 100644
--- a/tensorflow/stream_executor/host/host_gpu_executor.h
+++ b/tensorflow/stream_executor/host/host_gpu_executor.h
@@ -202,7 +202,7 @@ class HostExecutor : public internal::StreamExecutorInterface {
     return std::unique_ptr<internal::TimerInterface>(new HostTimer());
   }
 
-  void *CudaContextHack() override { return nullptr; }
+  void *GpuContextHack() override { return nullptr; }
 
  private:
   const PluginConfig plugin_config_;
diff --git a/tensorflow/stream_executor/host/host_stream.h b/tensorflow/stream_executor/host/host_stream.h
index 5d7b8a3782..be88f074cf 100644
--- a/tensorflow/stream_executor/host/host_stream.h
+++ b/tensorflow/stream_executor/host/host_stream.h
@@ -34,8 +34,8 @@ class HostStream : public internal::StreamInterface {
 
   bool EnqueueTask(std::function<void()> task);
 
-  void *CudaStreamHack() override { return nullptr; }
-  void **CudaStreamMemberHack() override { return nullptr; }
+  void *GpuStreamHack() override { return nullptr; }
+  void **GpuStreamMemberHack() override { return nullptr; }
 
   void BlockUntilDone();
 
diff --git a/tensorflow/stream_executor/module_spec.h b/tensorflow/stream_executor/module_spec.h
new file mode 100644
index 0000000000..75bdfed2d7
--- /dev/null
+++ b/tensorflow/stream_executor/module_spec.h
@@ -0,0 +1,66 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_STREAM_EXECUTOR_MODULE_SPEC_H_
+#define TENSORFLOW_STREAM_EXECUTOR_MODULE_SPEC_H_
+
+#include "tensorflow/stream_executor/lib/array_slice.h"
+#include "tensorflow/stream_executor/lib/stringpiece.h"
+#include "tensorflow/stream_executor/platform/logging.h"
+#include "tensorflow/stream_executor/platform/port.h"
+
+namespace stream_executor {
+
+// Describes how to load a module on a target platform.
+//
+// The exact meaning of a "module" may differ from platform to platform but
+// loosely speaking a module a collection of kernels and global variables.  It
+// corresponds to CUmodule when running on CUDA.
+class MultiModuleLoaderSpec {
+ public:
+  bool has_cuda_cubin_in_memory() const { return has_cuda_cubin_in_memory_; }
+  port::ArraySlice<const uint8> cuda_cubin_in_memory() const {
+    CHECK(has_cuda_cubin_in_memory());
+    return {cuda_cubin_in_memory_.data(), cuda_cubin_in_memory_.size()};
+  }
+
+  bool has_cuda_ptx_in_memory() const { return has_cuda_ptx_in_memory_; }
+  const char* cuda_ptx_in_memory() const {
+    CHECK(has_cuda_ptx_in_memory());
+    return cuda_ptx_in_memory_;
+  }
+
+  void AddCudaCubinInMemory(port::ArraySlice<const uint8> cubin_bytes) {
+    CHECK(!cubin_bytes.empty());
+    has_cuda_cubin_in_memory_ = true;
+    cuda_cubin_in_memory_ = cubin_bytes;
+  }
+
+  void AddCudaPtxInMemory(const char* ptx) {
+    has_cuda_ptx_in_memory_ = true;
+    // The CUDA driver does not like getting an empty string as PTX.
+    cuda_ptx_in_memory_ = *ptx ? ptx : nullptr;
+  }
+
+ private:
+  port::ArraySlice<const uint8> cuda_cubin_in_memory_;
+  bool has_cuda_cubin_in_memory_ = false;
+  const char* cuda_ptx_in_memory_;
+  bool has_cuda_ptx_in_memory_ = false;
+};
+
+}  // namespace stream_executor
+
+#endif  // TENSORFLOW_STREAM_EXECUTOR_MODULE_SPEC_H_
diff --git a/tensorflow/stream_executor/stream.cc b/tensorflow/stream_executor/stream.cc
index 0cd0790a72..6439e3992d 100644
--- a/tensorflow/stream_executor/stream.cc
+++ b/tensorflow/stream_executor/stream.cc
@@ -267,6 +267,12 @@ Stream::Stream(StreamExecutor *parent,
 Stream::~Stream() {
   VLOG_CALL();
 
+  // Ensure the stream is completed.
+  auto status = BlockHostUntilDone();
+  if (!status.ok()) {
+    LOG(WARNING) << "Error blocking host until done in stream destructor: "
+                 << status;
+  }
   temporary_memory_manager_.ForceDeallocateAll();
 
   if (allocated_) {
@@ -1377,15 +1383,16 @@ Stream &Stream::ThenPoolForward(
     const dnn::BatchDescriptor &input_dimensions,
     const DeviceMemory<double> &input_data,
     const dnn::BatchDescriptor &output_dimensions,
-    DeviceMemory<double> *output_data) {
+    DeviceMemory<double> *output_data, ScratchAllocator *workspace_allocator) {
   VLOG_CALL(PARAM(pooling_dimensions), PARAM(input_dimensions),
-            PARAM(input_data), PARAM(output_dimensions), PARAM(output_data));
+            PARAM(input_data), PARAM(output_dimensions), PARAM(output_data),
+            PARAM(workspace_allocator));
 
   if (ok()) {
     if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
       CheckError(dnn->DoPoolForward(this, pooling_dimensions, input_dimensions,
-                                    input_data, output_dimensions,
-                                    output_data));
+                                    input_data, output_dimensions, output_data,
+                                    workspace_allocator));
     } else {
       SetError();
       LOG(WARNING)
@@ -1401,15 +1408,16 @@ Stream &Stream::ThenPoolForward(
     const dnn::BatchDescriptor &input_dimensions,
     const DeviceMemory<float> &input_data,
     const dnn::BatchDescriptor &output_dimensions,
-    DeviceMemory<float> *output_data) {
+    DeviceMemory<float> *output_data, ScratchAllocator *workspace_allocator) {
   VLOG_CALL(PARAM(pooling_dimensions), PARAM(input_dimensions),
-            PARAM(input_data), PARAM(output_dimensions), PARAM(output_data));
+            PARAM(input_data), PARAM(output_dimensions), PARAM(output_data),
+            PARAM(workspace_allocator));
 
   if (ok()) {
     if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
       CheckError(dnn->DoPoolForward(this, pooling_dimensions, input_dimensions,
-                                    input_data, output_dimensions,
-                                    output_data));
+                                    input_data, output_dimensions, output_data,
+                                    workspace_allocator));
     } else {
       SetErrorAndLogNoDnnSupport();
     }
@@ -1422,15 +1430,17 @@ Stream &Stream::ThenPoolForward(
     const dnn::BatchDescriptor &input_dimensions,
     const DeviceMemory<Eigen::half> &input_data,
     const dnn::BatchDescriptor &output_dimensions,
-    DeviceMemory<Eigen::half> *output_data) {
+    DeviceMemory<Eigen::half> *output_data,
+    ScratchAllocator *workspace_allocator) {
   VLOG_CALL(PARAM(pooling_dimensions), PARAM(input_dimensions),
-            PARAM(input_data), PARAM(output_dimensions), PARAM(output_data));
+            PARAM(input_data), PARAM(output_dimensions), PARAM(output_data),
+            PARAM(workspace_allocator));
 
   if (ok()) {
     if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
       CheckError(dnn->DoPoolForward(this, pooling_dimensions, input_dimensions,
-                                    input_data, output_dimensions,
-                                    output_data));
+                                    input_data, output_dimensions, output_data,
+                                    workspace_allocator));
     } else {
       SetErrorAndLogNoDnnSupport();
     }
@@ -1445,16 +1455,19 @@ Stream &Stream::ThenPoolBackward(
     const dnn::BatchDescriptor &output_dimensions,
     const DeviceMemory<double> &output_data,
     const DeviceMemory<double> &input_diff_data,
-    DeviceMemory<double> *output_diff_data) {
+    DeviceMemory<double> *output_diff_data,
+    ScratchAllocator *workspace_allocator) {
   VLOG_CALL(PARAM(pooling_dimensions), PARAM(input_dimensions),
             PARAM(input_data), PARAM(output_dimensions), PARAM(output_data),
-            PARAM(input_diff_data), PARAM(output_diff_data));
+            PARAM(input_diff_data), PARAM(output_diff_data),
+            PARAM(workspace_allocator));
 
   if (ok()) {
     if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
       CheckError(dnn->DoPoolBackward(this, pooling_dimensions, input_dimensions,
                                      input_data, output_dimensions, output_data,
-                                     input_diff_data, output_diff_data));
+                                     input_diff_data, output_diff_data,
+                                     workspace_allocator));
     } else {
       SetError();
       LOG(WARNING)
@@ -1472,16 +1485,19 @@ Stream &Stream::ThenPoolBackward(
     const dnn::BatchDescriptor &output_dimensions,
     const DeviceMemory<float> &output_data,
     const DeviceMemory<float> &input_diff_data,
-    DeviceMemory<float> *output_diff_data) {
+    DeviceMemory<float> *output_diff_data,
+    ScratchAllocator *workspace_allocator) {
   VLOG_CALL(PARAM(pooling_dimensions), PARAM(input_dimensions),
             PARAM(input_data), PARAM(output_dimensions), PARAM(output_data),
-            PARAM(input_diff_data), PARAM(output_diff_data));
+            PARAM(input_diff_data), PARAM(output_diff_data),
+            PARAM(workspace_allocator));
 
   if (ok()) {
     if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
       CheckError(dnn->DoPoolBackward(this, pooling_dimensions, input_dimensions,
                                      input_data, output_dimensions, output_data,
-                                     input_diff_data, output_diff_data));
+                                     input_diff_data, output_diff_data,
+                                     workspace_allocator));
     } else {
       SetErrorAndLogNoDnnSupport();
     }
@@ -1496,16 +1512,19 @@ Stream &Stream::ThenPoolBackward(
     const dnn::BatchDescriptor &output_dimensions,
     const DeviceMemory<Eigen::half> &output_data,
     const DeviceMemory<Eigen::half> &input_diff_data,
-    DeviceMemory<Eigen::half> *output_diff_data) {
+    DeviceMemory<Eigen::half> *output_diff_data,
+    ScratchAllocator *workspace_allocator) {
   VLOG_CALL(PARAM(pooling_dimensions), PARAM(input_dimensions),
             PARAM(input_data), PARAM(output_dimensions), PARAM(output_data),
-            PARAM(input_diff_data), PARAM(output_diff_data));
+            PARAM(input_diff_data), PARAM(output_diff_data),
+            PARAM(workspace_allocator));
 
   if (ok()) {
     if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
       CheckError(dnn->DoPoolBackward(this, pooling_dimensions, input_dimensions,
                                      input_data, output_dimensions, output_data,
-                                     input_diff_data, output_diff_data));
+                                     input_diff_data, output_diff_data,
+                                     workspace_allocator));
     } else {
       SetErrorAndLogNoDnnSupport();
     }
@@ -1552,16 +1571,18 @@ Stream &Stream::ThenNormalizeBackwardWithDimensions(
     const dnn::BatchDescriptor &dimensions, const DeviceMemory<float> &raw_data,
     const DeviceMemory<float> &normalized_data,
     const DeviceMemory<float> &normalized_variable_gradient,
-    DeviceMemory<float> *raw_variable_gradient) {
+    DeviceMemory<float> *raw_variable_gradient,
+    ScratchAllocator *workspace_allocator) {
   VLOG_CALL(PARAM(normalize_descriptor), PARAM(dimensions), PARAM(raw_data),
             PARAM(normalized_data), PARAM(normalized_variable_gradient),
-            PARAM(raw_variable_gradient));
+            PARAM(raw_variable_gradient), PARAM(workspace_allocator));
 
   if (ok()) {
     if (dnn::DnnSupport *dnn = parent_->AsDnn()) {
       CheckError(dnn->DoNormalizeBackwardWithDimensions(
           this, normalize_descriptor, dimensions, raw_data, normalized_data,
-          normalized_variable_gradient, raw_variable_gradient));
+          normalized_variable_gradient, raw_variable_gradient,
+          workspace_allocator));
     } else {
       SetErrorAndLogNoDnnSupport();
     }
@@ -1920,7 +1941,14 @@ void Stream::ReturnSubStream(Stream *sub_stream) {
   mutex_lock lock(mu_);
   for (auto &stream : sub_streams_) {
     if (stream.first.get() == sub_stream) {
-      stream.second = true;
+      // Streams have a monotonic state machine; if a stream
+      // encounters an error, it will remain in an error state
+      // forever. Only allow re-use of ok streams.
+      //
+      // TODO(toddw): Improve this mechanism, if necessary, to drop
+      // failed streams completely.
+      const bool ready_to_reuse = sub_stream->ok();
+      stream.second = ready_to_reuse;
       return;
     }
   }
@@ -4657,6 +4685,115 @@ Stream &Stream::ThenBlasGemmBatchedWithScratch(
               scratch_allocator);
 }
 
+Stream &Stream::ThenBlasGemmStridedBatched(
+    blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+    uint64 k, float alpha, const DeviceMemory<Eigen::half> &a, int lda,
+    int64 stride_a, const DeviceMemory<Eigen::half> &b, int ldb, int64 stride_b,
+    float beta, DeviceMemory<Eigen::half> *c, int ldc, int64 stride_c,
+    int batch_count) {
+  VLOG_CALL(PARAM(transa), PARAM(transb), PARAM(m), PARAM(n), PARAM(k),
+            PARAM(alpha), PARAM(a), PARAM(lda), PARAM(stride_a), PARAM(b),
+            PARAM(ldb), PARAM(stride_b), PARAM(beta), PARAM(c), PARAM(ldc),
+            PARAM(stride_c), PARAM(batch_count));
+
+  ThenBlasImpl<blas::Transpose, blas::Transpose, uint64, uint64, uint64, float,
+               const DeviceMemory<Eigen::half> &, int, int64,
+               const DeviceMemory<Eigen::half> &, int, int64, float,
+               DeviceMemory<Eigen::half> *, int, int64, int>
+      impl;
+  return impl(this, &blas::BlasSupport::DoBlasGemmStridedBatched, transa,
+              transb, m, n, k, alpha, a, lda, stride_a, b, ldb, stride_b, beta,
+              c, ldc, stride_c, batch_count);
+}
+
+Stream &Stream::ThenBlasGemmStridedBatched(
+    blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+    uint64 k, float alpha, const DeviceMemory<float> &a, int lda,
+    int64 stride_a, const DeviceMemory<float> &b, int ldb, int64 stride_b,
+    float beta, DeviceMemory<float> *c, int ldc, int64 stride_c,
+    int batch_count) {
+  VLOG_CALL(PARAM(transa), PARAM(transb), PARAM(m), PARAM(n), PARAM(k),
+            PARAM(alpha), PARAM(a), PARAM(lda), PARAM(stride_a), PARAM(b),
+            PARAM(ldb), PARAM(stride_b), PARAM(beta), PARAM(c), PARAM(ldc),
+            PARAM(stride_c), PARAM(batch_count));
+
+  ThenBlasImpl<blas::Transpose, blas::Transpose, uint64, uint64, uint64, float,
+               const DeviceMemory<float> &, int, int64,
+               const DeviceMemory<float> &, int, int64, float,
+               DeviceMemory<float> *, int, int64, int>
+      impl;
+  return impl(this, &blas::BlasSupport::DoBlasGemmStridedBatched, transa,
+              transb, m, n, k, alpha, a, lda, stride_a, b, ldb, stride_b, beta,
+              c, ldc, stride_c, batch_count);
+}
+
+Stream &Stream::ThenBlasGemmStridedBatched(
+    blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+    uint64 k, double alpha, const DeviceMemory<double> &a, int lda,
+    int64 stride_a, const DeviceMemory<double> &b, int ldb, int64 stride_b,
+    double beta, DeviceMemory<double> *c, int ldc, int64 stride_c,
+    int batch_count) {
+  VLOG_CALL(PARAM(transa), PARAM(transb), PARAM(m), PARAM(n), PARAM(k),
+            PARAM(alpha), PARAM(a), PARAM(lda), PARAM(stride_a), PARAM(b),
+            PARAM(ldb), PARAM(stride_b), PARAM(beta), PARAM(c), PARAM(ldc),
+            PARAM(stride_c), PARAM(batch_count));
+
+  ThenBlasImpl<blas::Transpose, blas::Transpose, uint64, uint64, uint64, double,
+               const DeviceMemory<double> &, int, int64,
+               const DeviceMemory<double> &, int, int64, double,
+               DeviceMemory<double> *, int, int64, int>
+      impl;
+  return impl(this, &blas::BlasSupport::DoBlasGemmStridedBatched, transa,
+              transb, m, n, k, alpha, a, lda, stride_a, b, ldb, stride_b, beta,
+              c, ldc, stride_c, batch_count);
+}
+
+Stream &Stream::ThenBlasGemmStridedBatched(
+    blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+    uint64 k, std::complex<float> alpha,
+    const DeviceMemory<std::complex<float>> &a, int lda, int64 stride_a,
+    const DeviceMemory<std::complex<float>> &b, int ldb, int64 stride_b,
+    std::complex<float> beta, DeviceMemory<std::complex<float>> *c, int ldc,
+    int64 stride_c, int batch_count) {
+  VLOG_CALL(PARAM(transa), PARAM(transb), PARAM(m), PARAM(n), PARAM(k),
+            PARAM(alpha), PARAM(a), PARAM(lda), PARAM(stride_a), PARAM(b),
+            PARAM(ldb), PARAM(stride_b), PARAM(beta), PARAM(c), PARAM(ldc),
+            PARAM(stride_c), PARAM(batch_count));
+
+  ThenBlasImpl<blas::Transpose, blas::Transpose, uint64, uint64, uint64,
+               std::complex<float>, const DeviceMemory<std::complex<float>> &,
+               int, int64, const DeviceMemory<std::complex<float>> &, int,
+               int64, std::complex<float>, DeviceMemory<std::complex<float>> *,
+               int, int64, int>
+      impl;
+  return impl(this, &blas::BlasSupport::DoBlasGemmStridedBatched, transa,
+              transb, m, n, k, alpha, a, lda, stride_a, b, ldb, stride_b, beta,
+              c, ldc, stride_c, batch_count);
+}
+
+Stream &Stream::ThenBlasGemmStridedBatched(
+    blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+    uint64 k, std::complex<double> alpha,
+    const DeviceMemory<std::complex<double>> &a, int lda, int64 stride_a,
+    const DeviceMemory<std::complex<double>> &b, int ldb, int64 stride_b,
+    std::complex<double> beta, DeviceMemory<std::complex<double>> *c, int ldc,
+    int64 stride_c, int batch_count) {
+  VLOG_CALL(PARAM(transa), PARAM(transb), PARAM(m), PARAM(n), PARAM(k),
+            PARAM(alpha), PARAM(a), PARAM(lda), PARAM(stride_a), PARAM(b),
+            PARAM(ldb), PARAM(stride_b), PARAM(beta), PARAM(c), PARAM(ldc),
+            PARAM(stride_c), PARAM(batch_count));
+
+  ThenBlasImpl<blas::Transpose, blas::Transpose, uint64, uint64, uint64,
+               std::complex<double>, const DeviceMemory<std::complex<double>> &,
+               int, int64, const DeviceMemory<std::complex<double>> &, int,
+               int64, std::complex<double>,
+               DeviceMemory<std::complex<double>> *, int, int64, int>
+      impl;
+  return impl(this, &blas::BlasSupport::DoBlasGemmStridedBatched, transa,
+              transb, m, n, k, alpha, a, lda, stride_a, b, ldb, stride_b, beta,
+              c, ldc, stride_c, batch_count);
+}
+
 Stream &Stream::ThenSetRngSeed(const uint8 *seed, uint64 seed_bytes) {
   VLOG_CALL(PARAM(seed), PARAM(seed_bytes));
 
@@ -5228,24 +5365,11 @@ port::Status Stream::BlockHostUntilDone() {
     return status;
   }
 
-  port::Status first_error;
-  {
-    // Wait until all active sub-streams have done their tasks.
-    mutex_lock lock(mu_);
-    for (auto &stream : sub_streams_) {
-      if (!stream.second) {
-        first_error.Update(stream.first->BlockHostUntilDone());
-        // Set this sub-stream as available.
-        stream.second = true;
-      }
-    }
-  }
-
   temporary_memory_manager_.DeallocateFinalizedTemporaries();
 
-  first_error.Update(parent_->BlockHostUntilDone(this));
-  CheckError(first_error.ok());
-  return first_error;
+  port::Status error = parent_->BlockHostUntilDone(this);
+  CheckError(error.ok());
+  return error;
 }
 
 }  // namespace stream_executor
diff --git a/tensorflow/stream_executor/stream.h b/tensorflow/stream_executor/stream.h
index e8885e1eb6..62d0a2062d 100644
--- a/tensorflow/stream_executor/stream.h
+++ b/tensorflow/stream_executor/stream.h
@@ -125,7 +125,7 @@ class Stream {
   Stream *GetOrCreateSubStream() LOCKS_EXCLUDED(mu_);
 
   // Return the sub-stream back to the host stream so that it can be reused
-  // later.
+  // later. Sub-streams that are !ok() will not be reused.
   void ReturnSubStream(Stream *sub_stream) LOCKS_EXCLUDED(mu_);
 
   // Allocate temporary memories. The stream will deallocate them when blocked
@@ -629,19 +629,22 @@ class Stream {
                           const dnn::BatchDescriptor &input_dimensions,
                           const DeviceMemory<double> &input_data,
                           const dnn::BatchDescriptor &output_dimensions,
-                          DeviceMemory<double> *output_data);
+                          DeviceMemory<double> *output_data,
+                          ScratchAllocator *workspace_allocator = nullptr);
 
   Stream &ThenPoolForward(const dnn::PoolingDescriptor &pooling_dimensions,
                           const dnn::BatchDescriptor &input_dimensions,
                           const DeviceMemory<float> &input_data,
                           const dnn::BatchDescriptor &output_dimensions,
-                          DeviceMemory<float> *output_data);
+                          DeviceMemory<float> *output_data,
+                          ScratchAllocator *workspace_allocator = nullptr);
 
   Stream &ThenPoolForward(const dnn::PoolingDescriptor &pooling_dimensions,
                           const dnn::BatchDescriptor &input_dimensions,
                           const DeviceMemory<Eigen::half> &input_data,
                           const dnn::BatchDescriptor &output_dimensions,
-                          DeviceMemory<Eigen::half> *output_data);
+                          DeviceMemory<Eigen::half> *output_data,
+                          ScratchAllocator *workspace_allocator = nullptr);
 
   Stream &ThenPoolBackward(const dnn::PoolingDescriptor &pooling_dimensions,
                            const dnn::BatchDescriptor &input_dimensions,
@@ -649,7 +652,8 @@ class Stream {
                            const dnn::BatchDescriptor &output_dimensions,
                            const DeviceMemory<double> &output_data,
                            const DeviceMemory<double> &input_diff_data,
-                           DeviceMemory<double> *output_diff_data);
+                           DeviceMemory<double> *output_diff_data,
+                           ScratchAllocator *workspace_allocator = nullptr);
 
   Stream &ThenPoolBackward(const dnn::PoolingDescriptor &pooling_dimensions,
                            const dnn::BatchDescriptor &input_dimensions,
@@ -657,7 +661,8 @@ class Stream {
                            const dnn::BatchDescriptor &output_dimensions,
                            const DeviceMemory<float> &output_data,
                            const DeviceMemory<float> &input_diff_data,
-                           DeviceMemory<float> *output_diff_data);
+                           DeviceMemory<float> *output_diff_data,
+                           ScratchAllocator *workspace_allocator = nullptr);
 
   Stream &ThenPoolBackward(const dnn::PoolingDescriptor &pooling_dimensions,
                            const dnn::BatchDescriptor &input_dimensions,
@@ -665,7 +670,8 @@ class Stream {
                            const dnn::BatchDescriptor &output_dimensions,
                            const DeviceMemory<Eigen::half> &output_data,
                            const DeviceMemory<Eigen::half> &input_diff_data,
-                           DeviceMemory<Eigen::half> *output_diff_data);
+                           DeviceMemory<Eigen::half> *output_diff_data,
+                           ScratchAllocator *workspace_allocator = nullptr);
 
   Stream &ThenNormalize(const dnn::NormalizeDescriptor &normalize_descriptor,
                         const DeviceMemory<float> &input_data,
@@ -684,7 +690,8 @@ class Stream {
       const DeviceMemory<float> &raw_data,
       const DeviceMemory<float> &normalized_data,
       const DeviceMemory<float> &normalized_variable_gradient,
-      DeviceMemory<float> *raw_variable_gradient);
+      DeviceMemory<float> *raw_variable_gradient,
+      ScratchAllocator *workspace_allocator = nullptr);
 
   Stream &ThenActivate(dnn::ActivationMode activation_mode,
                        const dnn::BatchDescriptor &dimensions,
@@ -1550,6 +1557,38 @@ class Stream {
       std::complex<double> beta,
       const port::ArraySlice<DeviceMemory<std::complex<double>> *> &c, int ldc,
       int batch_count, ScratchAllocator *scratch_allocator);
+  Stream &ThenBlasGemmStridedBatched(
+      blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+      uint64 k, float alpha, const DeviceMemory<Eigen::half> &a, int lda,
+      int64 stride_a, const DeviceMemory<Eigen::half> &b, int ldb,
+      int64 stride_b, float beta, DeviceMemory<Eigen::half> *c, int ldc,
+      int64 stride_c, int batch_count);
+  Stream &ThenBlasGemmStridedBatched(
+      blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+      uint64 k, float alpha, const DeviceMemory<float> &a, int lda,
+      int64 stride_a, const DeviceMemory<float> &b, int ldb, int64 stride_b,
+      float beta, DeviceMemory<float> *c, int ldc, int64 stride_c,
+      int batch_count);
+  Stream &ThenBlasGemmStridedBatched(
+      blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+      uint64 k, double alpha, const DeviceMemory<double> &a, int lda,
+      int64 stride_a, const DeviceMemory<double> &b, int ldb, int64 stride_b,
+      double beta, DeviceMemory<double> *c, int ldc, int64 stride_c,
+      int batch_count);
+  Stream &ThenBlasGemmStridedBatched(
+      blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+      uint64 k, std::complex<float> alpha,
+      const DeviceMemory<std::complex<float>> &a, int lda, int64 stride_a,
+      const DeviceMemory<std::complex<float>> &b, int ldb, int64 stride_b,
+      std::complex<float> beta, DeviceMemory<std::complex<float>> *c, int ldc,
+      int64 stride_c, int batch_count);
+  Stream &ThenBlasGemmStridedBatched(
+      blas::Transpose transa, blas::Transpose transb, uint64 m, uint64 n,
+      uint64 k, std::complex<double> alpha,
+      const DeviceMemory<std::complex<double>> &a, int lda, int64 stride_a,
+      const DeviceMemory<std::complex<double>> &b, int ldb, int64 stride_b,
+      std::complex<double> beta, DeviceMemory<std::complex<double>> *c, int ldc,
+      int64 stride_c, int batch_count);
 
   // See BlasSupport::DoBlasHemm.
   Stream &ThenBlasHemm(blas::Side side, blas::UpperLower uplo, uint64 m,
diff --git a/tensorflow/stream_executor/stream_executor_internal.h b/tensorflow/stream_executor/stream_executor_internal.h
index 9c989b971d..f34b1fc083 100644
--- a/tensorflow/stream_executor/stream_executor_internal.h
+++ b/tensorflow/stream_executor/stream_executor_internal.h
@@ -36,20 +36,38 @@ limitations under the License.
 #include "tensorflow/stream_executor/kernel_cache_config.h"
 #include "tensorflow/stream_executor/kernel_spec.h"
 #include "tensorflow/stream_executor/launch_dim.h"
+#include "tensorflow/stream_executor/lib/inlined_vector.h"
 #include "tensorflow/stream_executor/lib/status.h"
 #include "tensorflow/stream_executor/lib/statusor.h"
+#include "tensorflow/stream_executor/module_spec.h"
 #include "tensorflow/stream_executor/platform.h"
 #include "tensorflow/stream_executor/platform/port.h"
 #include "tensorflow/stream_executor/plugin_registry.h"
 #include "tensorflow/stream_executor/shared_memory_config.h"
 #include "tensorflow/stream_executor/trace_listener.h"
-#include "tensorflow/stream_executor/lib/inlined_vector.h"
 
 namespace stream_executor {
 
 class Stream;
 class Timer;
 
+// An opaque handle to a loaded module.
+//
+// An instance of this is returned from StreamExecutor::GetModule.
+class ModuleHandle {
+ public:
+  /*implicit*/ ModuleHandle(void *id = nullptr) : id_(id) {}
+
+  // A ModuleHandle with id() == nullptr is an invalid module handle, akin to a
+  // null pointer.
+  void *id() const { return id_; }
+
+  explicit operator bool() const { return id() != nullptr; }
+
+ private:
+  void *id_;
+};
+
 namespace internal {
 
 // Platform-dependent interface class for the generic Events interface, in
@@ -100,19 +118,20 @@ class StreamInterface {
   // Default destructor for the abstract interface.
   virtual ~StreamInterface() {}
 
-  // Returns the CUDA stream associated with this platform's stream
+  // Returns the GPU stream associated with this platform's stream
   // implementation.
   //
-  // WARNING: checks that the underlying platform is, in fact, CUDA, causing a
-  // fatal error if it is not. This hack is made available solely for use from
-  // distbelief code, which temporarily has strong ties to CUDA as a platform.
-  virtual void *CudaStreamHack() { return nullptr; }
-
-  // See the above comment on CudaStreamHack -- this further breaks abstraction
-  // for Eigen within distbelief, which has strong ties to CUDA as a platform,
-  // and a historical attachment to a programming model which takes a
+  // WARNING: checks that the underlying platform is, in fact, CUDA or ROCm,
+  // causing a fatal error if it is not. This hack is made available solely for
+  // use from distbelief code, which temporarily has strong ties to CUDA or
+  // ROCm as a platform.
+  virtual void *GpuStreamHack() { return nullptr; }
+
+  // See the above comment on GpuStreamHack -- this further breaks abstraction
+  // for Eigen within distbelief, which has strong ties to CUDA or ROCm as a
+  // platform, and a historical attachment to a programming model which takes a
   // stream-slot rather than a stream-value.
-  virtual void **CudaStreamMemberHack() { return nullptr; }
+  virtual void **GpuStreamMemberHack() { return nullptr; }
 
  private:
   SE_DISALLOW_COPY_AND_ASSIGN(StreamInterface);
@@ -163,6 +182,11 @@ class StreamExecutorInterface {
                          KernelBase *kernel) {
     return false;
   }
+  virtual bool LoadModule(const MultiModuleLoaderSpec &spec,
+                          ModuleHandle *module_handle) {
+    return false;
+  }
+  virtual bool UnloadModule(ModuleHandle module_handle) { return false; }
   virtual bool Launch(Stream *stream, const ThreadDim &thread_dims,
                       const BlockDim &block_dims, const KernelBase &k,
                       const KernelArgsArrayBase &args) {
@@ -246,7 +270,12 @@ class StreamExecutorInterface {
   // null, however, both of them cannot be null at the same time. To use
   // constant memory in CUDA, GetSymbol has to be used. Returns true if symbol
   // is found.
-  virtual bool GetSymbol(const string& symbol_name, void **mem, size_t *bytes) {
+  //
+  // If ModuleHandle is set then we search for `symbol_name` only within the
+  // module corresponding to `module_handle`.  Otherwise all loaded modules are
+  // searched.
+  virtual bool GetSymbol(const string &symbol_name, ModuleHandle module_handle,
+                         void **mem, size_t *bytes) {
     return false;
   }
 
@@ -324,13 +353,14 @@ class StreamExecutorInterface {
   virtual std::unique_ptr<StreamInterface> GetStreamImplementation() = 0;
   virtual std::unique_ptr<TimerInterface> GetTimerImplementation() = 0;
 
-  // Returns the CUDA context associated with this StreamExecutor platform
-  // implementation.
+  // Returns the CUDA or ROCm context associated with this StreamExecutor
+  // platform implementation.
   //
-  // WARNING: checks that the underlying platform is, in fact, CUDA, causing a
-  // fatal error if it is not. This hack is made available solely for use from
-  // distbelief code, which temporarily has strong ties to CUDA as a platform.
-  virtual void *CudaContextHack() { return nullptr; }
+  // WARNING: checks that the underlying platform is, in fact, CUDA or ROCm,
+  // causing a fatal error if it is not. This hack is made available solely for
+  // use from distbelief code, which temporarily has strong ties to CUDA or ROCm
+  // as a platform.
+  virtual void *GpuContextHack() { return nullptr; }
 
  private:
   SE_DISALLOW_COPY_AND_ASSIGN(StreamExecutorInterface);
diff --git a/tensorflow/stream_executor/stream_executor_pimpl.cc b/tensorflow/stream_executor/stream_executor_pimpl.cc
index 000795ff00..2e0137a485 100644
--- a/tensorflow/stream_executor/stream_executor_pimpl.cc
+++ b/tensorflow/stream_executor/stream_executor_pimpl.cc
@@ -220,6 +220,15 @@ void StreamExecutor::UnloadKernel(const KernelBase *kernel) {
   implementation_->UnloadKernel(kernel);
 }
 
+bool StreamExecutor::LoadModule(const MultiModuleLoaderSpec &spec,
+                                ModuleHandle *module_handle) {
+  return implementation_->LoadModule(spec, module_handle);
+}
+
+bool StreamExecutor::UnloadModule(ModuleHandle module_handle) {
+  return implementation_->UnloadModule(module_handle);
+}
+
 void StreamExecutor::Deallocate(DeviceMemoryBase *mem) {
   VLOG(1) << "Called StreamExecutor::Deallocate(mem=" << mem->opaque()
           << ") mem->size()=" << mem->size() << StackTraceIfVLOG10();
@@ -459,9 +468,34 @@ void *StreamExecutor::Allocate(uint64 size) {
   return buf;
 }
 
-bool StreamExecutor::GetSymbol(const string &symbol_name, void **mem,
+port::StatusOr<DeviceMemoryBase> StreamExecutor::GetUntypedSymbol(
+    const string &symbol_name, ModuleHandle module_handle) {
+  // If failed to get the symbol, opaque/bytes are unchanged. Initialize them to
+  // be nullptr/0 for consistency with DeviceMemory semantics.
+  void *opaque = nullptr;
+  size_t bytes = 0;
+  if (GetSymbol(symbol_name, module_handle, &opaque, &bytes)) {
+    return DeviceMemoryBase(opaque, bytes);
+  }
+
+  if (static_cast<bool>(module_handle)) {
+    return port::Status(
+        port::error::NOT_FOUND,
+        port::StrCat("Check if module containing symbol ", symbol_name,
+                     " is loaded (module_handle = ",
+                     reinterpret_cast<uintptr_t>(module_handle.id()), ")"));
+  } else {
+    return port::Status(
+        port::error::NOT_FOUND,
+        port::StrCat("Check if kernel using the symbol is loaded: ",
+                     symbol_name));
+  }
+}
+
+bool StreamExecutor::GetSymbol(const string &symbol_name,
+                               ModuleHandle module_handle, void **mem,
                                size_t *bytes) {
-  return implementation_->GetSymbol(symbol_name, mem, bytes);
+  return implementation_->GetSymbol(symbol_name, module_handle, mem, bytes);
 }
 
 void *StreamExecutor::UnifiedMemoryAllocate(uint64 bytes) {
diff --git a/tensorflow/stream_executor/stream_executor_pimpl.h b/tensorflow/stream_executor/stream_executor_pimpl.h
index ad80a1ba25..47b3a2b030 100644
--- a/tensorflow/stream_executor/stream_executor_pimpl.h
+++ b/tensorflow/stream_executor/stream_executor_pimpl.h
@@ -106,6 +106,16 @@ class StreamExecutor {
   // Releases any state associated with the previously loaded kernel.
   void UnloadKernel(const KernelBase *kernel);
 
+  // Loads a module for the platform this StreamExecutor is acting upon.
+  //
+  // `spec` describes the module to be loaded.  On success writes the handle for
+  // the loaded module to `module_handle` and returns true.  Else returns false.
+  bool LoadModule(const MultiModuleLoaderSpec &spec,
+                  ModuleHandle *module_handle);
+
+  // Unloads the module with handle `module_handle`.
+  bool UnloadModule(ModuleHandle module_handle);
+
   // Synchronously allocates an array on the device of type T with element_count
   // elements.
   template <typename T>
@@ -169,8 +179,16 @@ class StreamExecutor {
   // type of symbol and T match.
   // - Note: symbol_name should include its namespace as well. For example,
   //         pass "nms0::symbol" if referring to nms0::symbol.
+  //
+  // If `module_handle` is set then searches only within the module
+  // corresponding to `module_handle`.
   template <typename T>
-  port::StatusOr<DeviceMemory<T>> GetSymbol(const string &symbol_name);
+  port::StatusOr<DeviceMemory<T>> GetSymbol(const string &symbol_name,
+                                            ModuleHandle module_handle = {});
+
+  // An untyped version of GetSymbol.
+  port::StatusOr<DeviceMemoryBase> GetUntypedSymbol(
+      const string &symbol_name, ModuleHandle module_handle = {});
 
   // Deallocate the DeviceMemory previously allocated via this interface.
   // Deallocation of a nullptr-representative value is permitted.
@@ -507,7 +525,8 @@ class StreamExecutor {
 
   // Finds and retrieves device memory for the symbol on the underlying
   // platform.
-  bool GetSymbol(const string& symbol_name, void **mem, size_t *bytes);
+  bool GetSymbol(const string &symbol_name, ModuleHandle module_handle,
+                 void **mem, size_t *bytes);
 
   // Entrains a memcpy operation onto stream, with a host destination location
   // host_dst and a device memory source, with target size size.
@@ -678,6 +697,41 @@ class StreamExecutor {
   SE_DISALLOW_COPY_AND_ASSIGN(StreamExecutor);
 };
 
+// A wrapper around ModuleHandle that uses RAII to manage its lifetime.
+class ScopedModuleHandle {
+ public:
+  explicit ScopedModuleHandle(StreamExecutor *executor,
+                              ModuleHandle module_handle)
+      : executor_(executor), module_handle_(module_handle) {}
+
+  ScopedModuleHandle(ScopedModuleHandle &&other) {
+    executor_ = other.executor_;
+    module_handle_ = other.module_handle_;
+    other.executor_ = nullptr;
+    other.module_handle_ = ModuleHandle();
+  }
+
+  ScopedModuleHandle &operator=(ScopedModuleHandle &&other) {
+    executor_ = other.executor_;
+    module_handle_ = other.module_handle_;
+    other.executor_ = nullptr;
+    other.module_handle_ = ModuleHandle();
+    return *this;
+  }
+
+  ~ScopedModuleHandle() {
+    if (static_cast<bool>(module_handle_)) {
+      CHECK(executor_->UnloadModule(module_handle_));
+    }
+  }
+
+ private:
+  StreamExecutor *executor_;
+  ModuleHandle module_handle_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(ScopedModuleHandle);
+};
+
 ////////////
 // Inlines
 
@@ -690,19 +744,13 @@ inline DeviceMemory<T> StreamExecutor::AllocateArray(uint64 element_count) {
 
 template <typename T>
 inline port::StatusOr<DeviceMemory<T>> StreamExecutor::GetSymbol(
-    const string &symbol_name) {
-  // If failed to get the symbol, opaque/bytes are unchanged. Initialize them to
-  // be nullptr/0 for consistency with DeviceMemory semantics.
-  void *opaque = nullptr;
-  size_t bytes = 0;
-  if (GetSymbol(symbol_name, &opaque, &bytes)) {
-    CHECK_EQ(bytes % sizeof(T), 0);
-    return DeviceMemory<T>::MakeFromByteSize(opaque, bytes);
+    const string &symbol_name, ModuleHandle module_handle) {
+  port::StatusOr<DeviceMemoryBase> untyped_symbol =
+      GetUntypedSymbol(symbol_name, module_handle);
+  if (!untyped_symbol.ok()) {
+    return untyped_symbol.status();
   }
-  return port::Status(
-      port::error::NOT_FOUND,
-      port::StrCat("Check if kernel using the symbol is loaded: ",
-                   symbol_name));
+  return DeviceMemory<T>(untyped_symbol.ValueOrDie());
 }
 
 template <typename ElemT>
diff --git a/tensorflow/stream_executor/stream_test.cc b/tensorflow/stream_executor/stream_test.cc
new file mode 100644
index 0000000000..47dd675834
--- /dev/null
+++ b/tensorflow/stream_executor/stream_test.cc
@@ -0,0 +1,139 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/stream_executor/stream_executor.h"
+
+#include "tensorflow/core/platform/test.h"
+
+namespace stream_executor {
+namespace {
+
+class StreamTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<StreamExecutor> NewStreamExecutor() {
+    Platform* platform =
+        MultiPlatformManager::PlatformWithName("Host").ConsumeValueOrDie();
+    StreamExecutorConfig config(/*ordinal=*/0);
+    return platform->GetUncachedExecutor(config).ConsumeValueOrDie();
+  }
+};
+
+TEST_F(StreamTest, NoInitNotOk) {
+  std::unique_ptr<StreamExecutor> executor = NewStreamExecutor();
+  Stream stream(executor.get());
+  EXPECT_FALSE(stream.ok());
+}
+
+TEST_F(StreamTest, InitOk) {
+  std::unique_ptr<StreamExecutor> executor = NewStreamExecutor();
+  Stream stream(executor.get());
+  stream.Init();
+  EXPECT_TRUE(stream.ok());
+}
+
+TEST_F(StreamTest, OneSubStream) {
+  std::unique_ptr<StreamExecutor> executor = NewStreamExecutor();
+  Stream stream(executor.get());
+  stream.Init();
+  EXPECT_TRUE(stream.ok());
+
+  // Get and return a sub-stream. Sub-streams are always initialized.
+  Stream* sub_stream1 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream1->ok());
+  stream.ReturnSubStream(sub_stream1);
+
+  // Get and return another sub-stream.
+  Stream* sub_stream2 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream2->ok());
+  stream.ReturnSubStream(sub_stream1);
+
+  // The underlying sub-streams should be the same, since sub_stream1
+  // was returned before we tried to get sub_stream2.
+  EXPECT_EQ(sub_stream1, sub_stream2);
+}
+
+TEST_F(StreamTest, TwoSubStreams) {
+  std::unique_ptr<StreamExecutor> executor = NewStreamExecutor();
+  Stream stream(executor.get());
+  stream.Init();
+  EXPECT_TRUE(stream.ok());
+
+  // Get two sub-streams.
+  Stream* sub_stream1 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream1->ok());
+  Stream* sub_stream2 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream2->ok());
+
+  // The underlying sub-streams should be different, since neither
+  // sub-stream has been returned.
+  EXPECT_NE(sub_stream1, sub_stream2);
+
+  // Return sub_stream1 and get sub_stream3, which should be the same.
+  stream.ReturnSubStream(sub_stream1);
+  Stream* sub_stream3 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream3->ok());
+  EXPECT_EQ(sub_stream1, sub_stream3);
+  EXPECT_NE(sub_stream2, sub_stream3);
+
+  // Return sub_stream2 and get sub_stream4, which should be the same.
+  stream.ReturnSubStream(sub_stream2);
+  Stream* sub_stream4 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream4->ok());
+  EXPECT_EQ(sub_stream2, sub_stream4);
+  EXPECT_NE(sub_stream3, sub_stream4);
+}
+
+TEST_F(StreamTest, FailedSubStreamNotReused) {
+  std::unique_ptr<StreamExecutor> executor = NewStreamExecutor();
+  Stream stream(executor.get());
+  stream.Init();
+  EXPECT_TRUE(stream.ok());
+
+  // Get a sub-stream.
+  Stream* sub_stream1 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream1->ok());
+
+  // Force an error on the stream; here we call a method that requires
+  // DNN support, which we know the Host platform doesn't support.
+  sub_stream1->ThenDepthConcatenate({}, {}, nullptr);
+  EXPECT_FALSE(sub_stream1->ok());
+
+  // Return sub_stream1 and get sub_stream2.
+  stream.ReturnSubStream(sub_stream1);
+  Stream* sub_stream2 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream2->ok());
+
+  // The underlying streams should be different. They would have been
+  // the same, but since we forced an error on sub_stream1, it will
+  // not be re-used. Sadly we can't just check:
+  //   EXPECT_NE(sub_stream1, sub_stream2);
+  //
+  // The above should hold logically, but it may fail if the new
+  // stream instance allocated for sub_stream2 happens to reside in
+  // the same memory address as sub_stream1.
+  //
+  // The check that sub_stream2->ok() serves as a good-enough check.
+
+  // Return sub_stream2 and get sub_stream3. The previous error on
+  // sub_stream1 has no effect on these streams, and they are the
+  // same.
+  stream.ReturnSubStream(sub_stream2);
+  Stream* sub_stream3 = stream.GetOrCreateSubStream();
+  EXPECT_TRUE(sub_stream3->ok());
+  EXPECT_EQ(sub_stream2, sub_stream3);
+}
+
+}  // namespace
+}  // namespace stream_executor
diff --git a/tensorflow/tensorflow.bzl b/tensorflow/tensorflow.bzl
index e4632c4811..39db840884 100644
--- a/tensorflow/tensorflow.bzl
+++ b/tensorflow/tensorflow.bzl
@@ -9,6 +9,7 @@ load(
     "tf_additional_grpc_deps_py",
     "tf_additional_xla_deps_py",
     "if_static",
+    "if_dynamic_kernels",
 )
 load(
     "@local_config_tensorrt//:build_defs.bzl",
@@ -24,7 +25,10 @@ load(
     "if_mkl",
     "if_mkl_lnx_x64"
 )
-
+load(
+    "//third_party/mkl_dnn:build_defs.bzl",
+    "if_mkl_open_source_only",
+)
 def register_extension_info(**kwargs):
     pass
 
@@ -134,6 +138,14 @@ def if_not_mobile(a):
       "//conditions:default": a,
   })
 
+# Config setting selector used when building for products
+# which requires restricted licenses to be avoided.
+def if_not_lgpl_restricted(a):
+  _ = (a,)
+  return select({
+      "//conditions:default": [],
+  })
+
 def if_not_windows(a):
   return select({
       clean_dep("//tensorflow:windows"): [],
@@ -180,9 +192,13 @@ def get_win_copts(is_external=False):
         "/DEIGEN_AVOID_STL_ARRAY",
         "/Iexternal/gemmlowp",
         "/wd4018",  # -Wno-sign-compare
-        "/U_HAS_EXCEPTIONS",
-        "/D_HAS_EXCEPTIONS=1",
-        "/EHsc",  # -fno-exceptions
+        # Bazel's CROSSTOOL currently pass /EHsc to enable exception by
+        # default. We can't pass /EHs-c- to disable exception, otherwise
+        # we will get a waterfall of flag conflict warnings. Wait for
+        # Bazel to fix this.
+        # "/D_HAS_EXCEPTIONS=0",
+        # "/EHs-c-",
+        "/wd4577",
         "/DNOGDI",
     ]
     if is_external:
@@ -214,6 +230,7 @@ def tf_copts(android_optimization_level_override="-O2", is_external=False):
       + if_cuda(["-DGOOGLE_CUDA=1"])
       + if_tensorrt(["-DGOOGLE_TENSORRT=1"])
       + if_mkl(["-DINTEL_MKL=1", "-DEIGEN_USE_VML"])
+      + if_mkl_open_source_only(["-DDO_NOT_USE_ML"])
       + if_mkl_lnx_x64(["-fopenmp"])
       + if_android_arm(["-mfpu=neon"])
       + if_linux_x86_64(["-msse3"])
@@ -228,6 +245,7 @@ def tf_copts(android_optimization_level_override="-O2", is_external=False):
             clean_dep("//tensorflow:windows"): get_win_copts(is_external),
             clean_dep("//tensorflow:windows_msvc"): get_win_copts(is_external),
             clean_dep("//tensorflow:ios"): ["-std=c++11"],
+            clean_dep("//tensorflow:no_lgpl_deps"): ["-D__TENSORFLOW_NO_LGPL_DEPS__", "-pthread"],
             "//conditions:default": ["-pthread"]
       }))
 
@@ -301,18 +319,36 @@ def tf_binary_additional_srcs():
           clean_dep("//tensorflow:libtensorflow_framework.so"),
       ])
 
+
+# Helper functions to add kernel dependencies to tf binaries when using dynamic
+# kernel linking.
+def tf_binary_dynamic_kernel_dsos(kernels):
+  return if_dynamic_kernels(
+      extra_deps=["libtfkernel_%s.so" % clean_dep(k) for k in kernels],
+      otherwise=[])
+
+# Helper functions to add kernel dependencies to tf binaries when using static
+# kernel linking.
+def tf_binary_dynamic_kernel_deps(kernels):
+  return if_dynamic_kernels(
+      extra_deps=[],
+      otherwise=kernels)
+
 def tf_cc_shared_object(
     name,
     srcs=[],
     deps=[],
+    data=[],
     linkopts=[],
     framework_so=tf_binary_additional_srcs(),
+    kernels=[],
     **kwargs):
   native.cc_binary(
       name=name,
       srcs=srcs + framework_so,
-      deps=deps,
+      deps=deps + tf_binary_dynamic_kernel_deps(kernels),
       linkshared = 1,
+      data = data + tf_binary_dynamic_kernel_dsos(kernels),
       linkopts=linkopts + _rpath_linkopts(name) + select({
           clean_dep("//tensorflow:darwin"): [
               "-Wl,-install_name,@rpath/" + name.split("/")[-1],
@@ -336,18 +372,21 @@ register_extension_info(
 def tf_cc_binary(name,
                  srcs=[],
                  deps=[],
+                 data=[],
                  linkopts=[],
                  copts=tf_copts(),
+                 kernels=[],
                  **kwargs):
   native.cc_binary(
       name=name,
       copts=copts,
       srcs=srcs + tf_binary_additional_srcs(),
-      deps=deps + if_mkl(
+      deps=deps + tf_binary_dynamic_kernel_deps(kernels) + if_mkl(
           [
               "//third_party/mkl:intel_binary_blob",
           ],
       ),
+      data=data +  tf_binary_dynamic_kernel_dsos(kernels),
       linkopts=linkopts + _rpath_linkopts(name),
       **kwargs)
 
@@ -532,9 +571,6 @@ def tf_gen_op_wrappers_cc(name,
 #     is invalid to specify both "hidden" and "op_whitelist".
 #   cc_linkopts: Optional linkopts to be added to tf_cc_binary that contains the
 #     specified ops.
-#   gen_locally: if True, the genrule to generate the Python library will be run
-#     without sandboxing. This would help when the genrule depends on symlinks
-#     which may not be supported in the sandbox.
 def tf_gen_op_wrapper_py(name,
                          out=None,
                          hidden=None,
@@ -545,8 +581,7 @@ def tf_gen_op_wrapper_py(name,
                          generated_target_name=None,
                          op_whitelist=[],
                          cc_linkopts=[],
-                         api_def_srcs=[],
-                         gen_locally=False):
+                         api_def_srcs=[]):
   if (hidden or hidden_file) and op_whitelist:
     fail('Cannot pass specify both hidden and op_whitelist.')
 
@@ -601,7 +636,6 @@ def tf_gen_op_wrapper_py(name,
         outs=[out],
         srcs=api_def_srcs + [hidden_file],
         tools=[tool_name] + tf_binary_additional_srcs(),
-        local = (1 if gen_locally else 0),
         cmd=("$(location " + tool_name + ") " + api_def_args_str +
              " @$(location " + hidden_file + ") " +
              ("1" if require_shape_functions else "0") + " > $@"))
@@ -611,7 +645,6 @@ def tf_gen_op_wrapper_py(name,
         outs=[out],
         srcs=api_def_srcs,
         tools=[tool_name] + tf_binary_additional_srcs(),
-        local = (1 if gen_locally else 0),
         cmd=("$(location " + tool_name + ") " + api_def_args_str + " " +
              op_list_arg + " " +
              ("1" if require_shape_functions else "0") + " " +
@@ -641,11 +674,13 @@ def tf_gen_op_wrapper_py(name,
 def tf_cc_test(name,
                srcs,
                deps,
+               data=[],
                linkstatic=0,
                extra_copts=[],
                suffix="",
                linkopts=[],
                nocopts=None,
+               kernels=[],
                **kwargs):
   native.cc_test(
       name="%s%s" % (name, suffix),
@@ -665,11 +700,12 @@ def tf_cc_test(name,
             "-lm"
         ],
       }) + linkopts + _rpath_linkopts(name),
-      deps=deps + if_mkl(
+      deps=deps + tf_binary_dynamic_kernel_deps(kernels) + if_mkl(
           [
               "//third_party/mkl:intel_binary_blob",
           ],
       ),
+      data=data + tf_binary_dynamic_kernel_dsos(kernels),
       # Nested select() statements seem not to be supported when passed to
       # linkstatic, and we already have a cuda select() passed in to this
       # function.
@@ -770,6 +806,7 @@ def tf_cuda_only_cc_test(name,
                     size="medium",
                     linkstatic=0,
                     args=[],
+                    kernels=[],
                     linkopts=[]):
   native.cc_test(
       name="%s%s" % (name, "_gpu"),
@@ -777,8 +814,8 @@ def tf_cuda_only_cc_test(name,
       size=size,
       args=args,
       copts= _cuda_copts() + tf_copts(),
-      data=data,
-      deps=deps + if_cuda([
+      data=data + tf_binary_dynamic_kernel_dsos(kernels),
+      deps=deps + tf_binary_dynamic_kernel_deps(kernels) + if_cuda([
           clean_dep("//tensorflow/core:cuda"),
           clean_dep("//tensorflow/core:gpu_lib")]),
       linkopts=if_not_windows(["-lpthread", "-lm"]) + linkopts + _rpath_linkopts(name),
@@ -821,10 +858,15 @@ def tf_cc_tests(srcs,
 def tf_cc_test_mkl(srcs,
                    deps,
                    name="",
+                   data=[],
                    linkstatic=0,
                    tags=[],
                    size="medium",
+                   kernels=[],
                    args=None):
+  # -fno-exceptions in nocopts breaks compilation if header modules are enabled.
+  disable_header_modules = ["-use_header_modules"]
+
   for src in srcs:
     native.cc_test(
       name=src_to_test_name(src),
@@ -841,15 +883,17 @@ def tf_cc_test_mkl(srcs,
             "-lm"
         ],
       }) + _rpath_linkopts(src_to_test_name(src)),
-      deps=deps + if_mkl(
+      deps=deps + tf_binary_dynamic_kernel_deps(kernels) + if_mkl(
           [
               "//third_party/mkl:intel_binary_blob",
           ],
       ),
+      data=data + tf_binary_dynamic_kernel_dsos(kernels),
       linkstatic=linkstatic,
       tags=tags,
       size=size,
       args=args,
+      features=disable_header_modules,
       nocopts="-fno-exceptions")
 
 
@@ -884,12 +928,13 @@ def tf_cuda_cc_tests(srcs,
 def tf_java_test(name,
                  srcs=[],
                  deps=[],
+                 kernels=[],
                  *args,
                  **kwargs):
   native.java_test(
       name=name,
       srcs=srcs,
-      deps=deps + tf_binary_additional_srcs(),
+      deps=deps + tf_binary_additional_srcs() + tf_binary_dynamic_kernel_dsos(kernels) + tf_binary_dynamic_kernel_deps(kernels),
       *args,
       **kwargs)
 
@@ -984,16 +1029,17 @@ register_extension_info(
     label_regex_for_dep = "{extension_name}",
 )
 
-def tf_kernel_library(name,
-                      prefix=None,
-                      srcs=None,
-                      gpu_srcs=None,
-                      hdrs=None,
-                      deps=None,
-                      alwayslink=1,
-                      copts=None,
-                      is_external=False,
-                      **kwargs):
+def tf_kernel_library(
+        name,
+        prefix = None,
+        srcs = None,
+        gpu_srcs = None,
+        hdrs = None,
+        deps = None,
+        alwayslink = 1,
+        copts = None,
+        is_external = False,
+        **kwargs):
   """A rule to build a TensorFlow OpKernel.
 
   May either specify srcs/hdrs or prefix.  Similar to tf_cuda_library,
@@ -1023,6 +1069,7 @@ def tf_kernel_library(name,
     deps = []
   if not copts:
     copts = []
+  textual_hdrs = []
   copts = copts + tf_copts(is_external=is_external)
   if prefix:
     if native.glob([prefix + "*.cu.cc"], exclude=["*test*"]):
@@ -1033,8 +1080,13 @@ def tf_kernel_library(name,
     srcs = srcs + native.glob(
         [prefix + "*.cc"], exclude=[prefix + "*test*", prefix + "*.cu.cc"])
     hdrs = hdrs + native.glob(
-        [prefix + "*.h"], exclude=[prefix + "*test*", prefix + "*.cu.h"])
-
+            [prefix + "*.h"],
+            exclude = [prefix + "*test*", prefix + "*.cu.h", prefix + "*impl.h"],
+        )
+    textual_hdrs = native.glob(
+            [prefix + "*impl.h"],
+            exclude = [prefix + "*test*", prefix + "*.cu.h"],
+        )
   cuda_deps = [clean_dep("//tensorflow/core:gpu_lib")]
   if gpu_srcs:
     for gpu_src in gpu_srcs:
@@ -1044,10 +1096,15 @@ def tf_kernel_library(name,
     tf_gpu_kernel_library(
         name=name + "_gpu", srcs=gpu_srcs, deps=deps, **kwargs)
     cuda_deps.extend([":" + name + "_gpu"])
+  kwargs["tags"] = kwargs.get("tags", []) + [
+      "req_dep=%s" % clean_dep("//tensorflow/core:gpu_lib"),
+      "req_dep=@local_config_cuda//cuda:cuda_headers",
+  ]
   tf_cuda_library(
       name=name,
       srcs=srcs,
       hdrs=hdrs,
+      textual_hdrs = textual_hdrs,
       copts=copts,
       cuda_deps=cuda_deps,
       linkstatic=1,  # Needed since alwayslink is broken in bazel b/27630669
@@ -1055,6 +1112,15 @@ def tf_kernel_library(name,
       deps=deps,
       **kwargs)
 
+  # TODO(gunan): CUDA dependency not clear here. Fix it.
+  tf_cc_shared_object(
+      name="libtfkernel_%s.so" % name,
+      srcs=srcs + hdrs,
+      copts=copts,
+      deps=deps,
+      tags=["manual", "notap"])
+
+
 register_extension_info(
     extension_name = "tf_kernel_library",
     label_regex_for_dep = "{extension_name}(_gpu)?",
@@ -1081,6 +1147,9 @@ def tf_mkl_kernel_library(name,
     hdrs = hdrs + native.glob(
         [prefix + "*.h"])
 
+  # -fno-exceptions in nocopts breaks compilation if header modules are enabled.
+  disable_header_modules = ["-use_header_modules"]
+
   native.cc_library(
       name=name,
       srcs=if_mkl(srcs),
@@ -1088,7 +1157,8 @@ def tf_mkl_kernel_library(name,
       deps=deps,
       alwayslink=alwayslink,
       copts=copts,
-      nocopts=nocopts
+      nocopts=nocopts,
+      features = disable_header_modules
   )
 
 register_extension_info(
@@ -1135,7 +1205,6 @@ _py_wrap_cc = rule(
             allow_files = True,
         ),
         "swig_includes": attr.label_list(
-            cfg = "data",
             allow_files = True,
         ),
         "deps": attr.label_list(
@@ -1327,7 +1396,7 @@ def tf_custom_op_library(name, srcs=[], gpu_srcs=[], deps=[], linkopts=[]):
       name=name,
       srcs=srcs,
       deps=deps + if_cuda(cuda_deps),
-      data=[name + "_check_deps"],
+      data=if_static([name + "_check_deps"]),
       copts=tf_copts(is_external=True),
       features = ["windows_export_all_symbols"],
       linkopts=linkopts + select({
@@ -1423,7 +1492,7 @@ def tf_py_wrap_cc(name,
       srcs=srcs,
       swig_includes=swig_includes,
       deps=deps + extra_deps,
-      toolchain_deps=["//tools/defaults:crosstool"],
+      toolchain_deps=["@bazel_tools//tools/cpp:current_cc_toolchain"],
       module_name=module_name,
       py_module_name=name)
   vscriptname=name+"_versionscript"
diff --git a/tensorflow/tools/api/golden/tensorflow.-config-proto.-experimental.pbtxt b/tensorflow/tools/api/golden/tensorflow.-config-proto.-experimental.pbtxt
index 9e09a8d48e..eb41deee13 100644
--- a/tensorflow/tools/api/golden/tensorflow.-config-proto.-experimental.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-config-proto.-experimental.pbtxt
@@ -8,5 +8,17 @@ tf_proto {
       label: LABEL_OPTIONAL
       type: TYPE_STRING
     }
+    field {
+      name: "client_handles_error_formatting"
+      number: 2
+      label: LABEL_OPTIONAL
+      type: TYPE_BOOL
+    }
+    field {
+      name: "executor_type"
+      number: 3
+      label: LABEL_OPTIONAL
+      type: TYPE_STRING
+    }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt b/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt
index 4af4ed70ef..e565b903d2 100644
--- a/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-config-proto.pbtxt
@@ -131,6 +131,18 @@ tf_proto {
         label: LABEL_OPTIONAL
         type: TYPE_STRING
       }
+      field {
+        name: "client_handles_error_formatting"
+        number: 2
+        label: LABEL_OPTIONAL
+        type: TYPE_BOOL
+      }
+      field {
+        name: "executor_type"
+        number: 3
+        label: LABEL_OPTIONAL
+        type: TYPE_STRING
+      }
     }
   }
 }
diff --git a/tensorflow/tools/api/golden/tensorflow.-variable-scope.pbtxt b/tensorflow/tools/api/golden/tensorflow.-variable-scope.pbtxt
index ec1f72453f..c13eb7b8bb 100644
--- a/tensorflow/tools/api/golden/tensorflow.-variable-scope.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-variable-scope.pbtxt
@@ -56,7 +56,7 @@ tf_class {
   }
   member_method {
     name: "get_variable"
-    argspec: "args=[\'self\', \'var_store\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'reuse\', \'trainable\', \'collections\', \'caching_device\', \'partitioner\', \'validate_shape\', \'use_resource\', \'custom_getter\', \'constraint\', \'synchronization\', \'aggregation\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\'], "
+    argspec: "args=[\'self\', \'var_store\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'reuse\', \'trainable\', \'collections\', \'caching_device\', \'partitioner\', \'validate_shape\', \'use_resource\', \'custom_getter\', \'constraint\', \'synchronization\', \'aggregation\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\'], "
   }
   member_method {
     name: "global_variables"
diff --git a/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt b/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
index 23b552cc38..e841c4ad89 100644
--- a/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
@@ -49,7 +49,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'initial_value\', \'trainable\', \'collections\', \'validate_shape\', \'caching_device\', \'name\', \'variable_def\', \'dtype\', \'expected_shape\', \'import_scope\', \'constraint\'], varargs=None, keywords=None, defaults=[\'None\', \'True\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'initial_value\', \'trainable\', \'collections\', \'validate_shape\', \'caching_device\', \'name\', \'variable_def\', \'dtype\', \'expected_shape\', \'import_scope\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\'], varargs=None, keywords=None, defaults=[\'None\', \'True\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\'], "
   }
   member_method {
     name: "assign"
diff --git a/tensorflow/tools/api/golden/tensorflow.compat.pbtxt b/tensorflow/tools/api/golden/tensorflow.compat.pbtxt
index bab480ff9b..f1d760603e 100644
--- a/tensorflow/tools/api/golden/tensorflow.compat.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.compat.pbtxt
@@ -33,6 +33,14 @@ tf_module {
     argspec: "args=[\'bytes_or_text\', \'encoding\'], varargs=None, keywords=None, defaults=[\'utf-8\'], "
   }
   member_method {
+    name: "forward_compatibility_horizon"
+    argspec: "args=[\'year\', \'month\', \'day\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
+    name: "forward_compatible"
+    argspec: "args=[\'year\', \'month\', \'day\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
     name: "path_to_str"
     argspec: "args=[\'path\'], varargs=None, keywords=None, defaults=None"
   }
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-iterator.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-iterator.pbtxt
index 1f9aeb6ad6..4f0147a523 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-iterator.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-iterator.pbtxt
@@ -1,6 +1,7 @@
 path: "tensorflow.data.Iterator"
 tf_class {
   is_instance: "<class \'tensorflow.python.data.ops.iterator_ops.Iterator\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.base.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "initializer"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-classifier.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-classifier.pbtxt
index 099838fa65..c23b04b4ef 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-classifier.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-classifier.pbtxt
@@ -21,7 +21,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'feature_columns\', \'n_batches_per_layer\', \'model_dir\', \'n_classes\', \'weight_column\', \'label_vocabulary\', \'n_trees\', \'max_depth\', \'learning_rate\', \'l1_regularization\', \'l2_regularization\', \'tree_complexity\', \'min_node_weight\', \'config\'], varargs=None, keywords=None, defaults=[\'None\', \'<object object instance>\', \'None\', \'None\', \'100\', \'6\', \'0.1\', \'0.0\', \'0.0\', \'0.0\', \'0.0\', \'None\'], "
+    argspec: "args=[\'self\', \'feature_columns\', \'n_batches_per_layer\', \'model_dir\', \'n_classes\', \'weight_column\', \'label_vocabulary\', \'n_trees\', \'max_depth\', \'learning_rate\', \'l1_regularization\', \'l2_regularization\', \'tree_complexity\', \'min_node_weight\', \'config\', \'center_bias\', \'pruning_mode\'], varargs=None, keywords=None, defaults=[\'None\', \'<object object instance>\', \'None\', \'None\', \'100\', \'6\', \'0.1\', \'0.0\', \'0.0\', \'0.0\', \'0.0\', \'None\', \'False\', \'none\'], "
   }
   member_method {
     name: "eval_dir"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-regressor.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-regressor.pbtxt
index 87bd19a23a..6878d28fff 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-regressor.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-boosted-trees-regressor.pbtxt
@@ -21,7 +21,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'feature_columns\', \'n_batches_per_layer\', \'model_dir\', \'label_dimension\', \'weight_column\', \'n_trees\', \'max_depth\', \'learning_rate\', \'l1_regularization\', \'l2_regularization\', \'tree_complexity\', \'min_node_weight\', \'config\'], varargs=None, keywords=None, defaults=[\'None\', \'<object object instance>\', \'None\', \'100\', \'6\', \'0.1\', \'0.0\', \'0.0\', \'0.0\', \'0.0\', \'None\'], "
+    argspec: "args=[\'self\', \'feature_columns\', \'n_batches_per_layer\', \'model_dir\', \'label_dimension\', \'weight_column\', \'n_trees\', \'max_depth\', \'learning_rate\', \'l1_regularization\', \'l2_regularization\', \'tree_complexity\', \'min_node_weight\', \'config\', \'center_bias\', \'pruning_mode\'], varargs=None, keywords=None, defaults=[\'None\', \'<object object instance>\', \'None\', \'100\', \'6\', \'0.1\', \'0.0\', \'0.0\', \'0.0\', \'0.0\', \'None\', \'False\', \'none\'], "
   }
   member_method {
     name: "eval_dir"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-classifier.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-classifier.pbtxt
index 49a3e898c5..9c1c072124 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-classifier.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-classifier.pbtxt
@@ -21,7 +21,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'model_dir\', \'linear_feature_columns\', \'linear_optimizer\', \'dnn_feature_columns\', \'dnn_optimizer\', \'dnn_hidden_units\', \'dnn_activation_fn\', \'dnn_dropout\', \'n_classes\', \'weight_column\', \'label_vocabulary\', \'input_layer_partitioner\', \'config\', \'warm_start_from\', \'loss_reduction\', \'batch_norm\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'Ftrl\', \'None\', \'Adagrad\', \'None\', \'<function relu instance>\', \'None\', \'2\', \'None\', \'None\', \'None\', \'None\', \'None\', \'weighted_sum\', \'False\'], "
+    argspec: "args=[\'self\', \'model_dir\', \'linear_feature_columns\', \'linear_optimizer\', \'dnn_feature_columns\', \'dnn_optimizer\', \'dnn_hidden_units\', \'dnn_activation_fn\', \'dnn_dropout\', \'n_classes\', \'weight_column\', \'label_vocabulary\', \'input_layer_partitioner\', \'config\', \'warm_start_from\', \'loss_reduction\', \'batch_norm\', \'linear_sparse_combiner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'Ftrl\', \'None\', \'Adagrad\', \'None\', \'<function relu instance>\', \'None\', \'2\', \'None\', \'None\', \'None\', \'None\', \'None\', \'weighted_sum\', \'False\', \'sum\'], "
   }
   member_method {
     name: "eval_dir"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-regressor.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-regressor.pbtxt
index 4b81613c92..7391d4b07a 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-regressor.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-d-n-n-linear-combined-regressor.pbtxt
@@ -21,7 +21,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'model_dir\', \'linear_feature_columns\', \'linear_optimizer\', \'dnn_feature_columns\', \'dnn_optimizer\', \'dnn_hidden_units\', \'dnn_activation_fn\', \'dnn_dropout\', \'label_dimension\', \'weight_column\', \'input_layer_partitioner\', \'config\', \'warm_start_from\', \'loss_reduction\', \'batch_norm\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'Ftrl\', \'None\', \'Adagrad\', \'None\', \'<function relu instance>\', \'None\', \'1\', \'None\', \'None\', \'None\', \'None\', \'weighted_sum\', \'False\'], "
+    argspec: "args=[\'self\', \'model_dir\', \'linear_feature_columns\', \'linear_optimizer\', \'dnn_feature_columns\', \'dnn_optimizer\', \'dnn_hidden_units\', \'dnn_activation_fn\', \'dnn_dropout\', \'label_dimension\', \'weight_column\', \'input_layer_partitioner\', \'config\', \'warm_start_from\', \'loss_reduction\', \'batch_norm\', \'linear_sparse_combiner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'Ftrl\', \'None\', \'Adagrad\', \'None\', \'<function relu instance>\', \'None\', \'1\', \'None\', \'None\', \'None\', \'None\', \'weighted_sum\', \'False\', \'sum\'], "
   }
   member_method {
     name: "eval_dir"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-linear-classifier.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-linear-classifier.pbtxt
index cb578759ee..154f171e89 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-linear-classifier.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-linear-classifier.pbtxt
@@ -21,7 +21,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'feature_columns\', \'model_dir\', \'n_classes\', \'weight_column\', \'label_vocabulary\', \'optimizer\', \'config\', \'partitioner\', \'warm_start_from\', \'loss_reduction\'], varargs=None, keywords=None, defaults=[\'None\', \'2\', \'None\', \'None\', \'Ftrl\', \'None\', \'None\', \'None\', \'weighted_sum\'], "
+    argspec: "args=[\'self\', \'feature_columns\', \'model_dir\', \'n_classes\', \'weight_column\', \'label_vocabulary\', \'optimizer\', \'config\', \'partitioner\', \'warm_start_from\', \'loss_reduction\', \'sparse_combiner\'], varargs=None, keywords=None, defaults=[\'None\', \'2\', \'None\', \'None\', \'Ftrl\', \'None\', \'None\', \'None\', \'weighted_sum\', \'sum\'], "
   }
   member_method {
     name: "eval_dir"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-linear-regressor.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-linear-regressor.pbtxt
index fcd01bb663..4d46d1e6b6 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-linear-regressor.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-linear-regressor.pbtxt
@@ -21,7 +21,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'feature_columns\', \'model_dir\', \'label_dimension\', \'weight_column\', \'optimizer\', \'config\', \'partitioner\', \'warm_start_from\', \'loss_reduction\'], varargs=None, keywords=None, defaults=[\'None\', \'1\', \'None\', \'Ftrl\', \'None\', \'None\', \'None\', \'weighted_sum\'], "
+    argspec: "args=[\'self\', \'feature_columns\', \'model_dir\', \'label_dimension\', \'weight_column\', \'optimizer\', \'config\', \'partitioner\', \'warm_start_from\', \'loss_reduction\', \'sparse_combiner\'], varargs=None, keywords=None, defaults=[\'None\', \'1\', \'None\', \'Ftrl\', \'None\', \'None\', \'None\', \'weighted_sum\', \'sum\'], "
   }
   member_method {
     name: "eval_dir"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.-run-config.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.-run-config.pbtxt
index c8da55d802..bf1f94b6ae 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.-run-config.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.-run-config.pbtxt
@@ -11,6 +11,10 @@ tf_class {
     mtype: "<type \'property\'>"
   }
   member {
+    name: "eval_distribute"
+    mtype: "<type \'property\'>"
+  }
+  member {
     name: "evaluation_master"
     mtype: "<type \'property\'>"
   }
@@ -51,6 +55,10 @@ tf_class {
     mtype: "<type \'property\'>"
   }
   member {
+    name: "protocol"
+    mtype: "<type \'property\'>"
+  }
+  member {
     name: "save_checkpoints_secs"
     mtype: "<type \'property\'>"
   }
@@ -88,7 +96,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'model_dir\', \'tf_random_seed\', \'save_summary_steps\', \'save_checkpoints_steps\', \'save_checkpoints_secs\', \'session_config\', \'keep_checkpoint_max\', \'keep_checkpoint_every_n_hours\', \'log_step_count_steps\', \'train_distribute\', \'device_fn\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'100\', \'<object object instance>\', \'<object object instance>\', \'None\', \'5\', \'10000\', \'100\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'model_dir\', \'tf_random_seed\', \'save_summary_steps\', \'save_checkpoints_steps\', \'save_checkpoints_secs\', \'session_config\', \'keep_checkpoint_max\', \'keep_checkpoint_every_n_hours\', \'log_step_count_steps\', \'train_distribute\', \'device_fn\', \'protocol\', \'eval_distribute\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'100\', \'<object object instance>\', \'<object object instance>\', \'None\', \'5\', \'10000\', \'100\', \'None\', \'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "replace"
diff --git a/tensorflow/tools/api/golden/tensorflow.image.pbtxt b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
index e89b4dbffd..5c46dc5ee7 100644
--- a/tensorflow/tools/api/golden/tensorflow.image.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
@@ -121,6 +121,14 @@ tf_module {
     argspec: "args=[\'boxes\', \'scores\', \'max_output_size\', \'iou_threshold\', \'score_threshold\', \'name\'], varargs=None, keywords=None, defaults=[\'0.5\', \'-inf\', \'None\'], "
   }
   member_method {
+    name: "non_max_suppression_overlaps"
+    argspec: "args=[\'overlaps\', \'scores\', \'max_output_size\', \'overlap_threshold\', \'score_threshold\', \'name\'], varargs=None, keywords=None, defaults=[\'0.5\', \'-inf\', \'None\'], "
+  }
+  member_method {
+    name: "non_max_suppression_padded"
+    argspec: "args=[\'boxes\', \'scores\', \'max_output_size\', \'iou_threshold\', \'score_threshold\', \'pad_to_max_output_size\', \'name\'], varargs=None, keywords=None, defaults=[\'0.5\', \'-inf\', \'False\', \'None\'], "
+  }
+  member_method {
     name: "pad_to_bounding_box"
     argspec: "args=[\'image\', \'offset_height\', \'offset_width\', \'target_height\', \'target_width\'], varargs=None, keywords=None, defaults=None"
   }
diff --git a/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt b/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt
index eaf0036cac..bc0426f2f1 100644
--- a/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.initializers.pbtxt
@@ -45,6 +45,30 @@ tf_module {
     argspec: "args=[], varargs=None, keywords=None, defaults=None"
   }
   member_method {
+    name: "glorot_normal"
+    argspec: "args=[\'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "glorot_uniform"
+    argspec: "args=[\'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'None\', \"<dtype: \'float32\'>\"], "
+  }
+  member_method {
+    name: "he_normal"
+    argspec: "args=[\'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "he_uniform"
+    argspec: "args=[\'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "lecun_normal"
+    argspec: "args=[\'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
+    name: "lecun_uniform"
+    argspec: "args=[\'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
     name: "local_variables"
     argspec: "args=[], varargs=None, keywords=None, defaults=None"
   }
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
index 11cdd6f0b5..e579fe6a1a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
@@ -119,7 +119,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
@@ -135,7 +135,7 @@ tf_class {
   }
   member_method {
     name: "compile"
-    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\', \'distribute\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "compute_mask"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
index 4afad3e4df..6f05cdd093 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
@@ -124,7 +124,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
@@ -140,7 +140,7 @@ tf_class {
   }
   member_method {
     name: "compile"
-    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\', \'distribute\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "compute_mask"
@@ -267,6 +267,10 @@ tf_class {
     argspec: "args=[\'self\', \'line_length\', \'positions\', \'print_fn\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
   }
   member_method {
+    name: "symbolic_set_inputs"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
     name: "test_on_batch"
     argspec: "args=[\'self\', \'x\', \'y\', \'sample_weight\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
   }
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.activations.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.activations.pbtxt
index 2cd83baf65..2e9de9ebb2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.activations.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.activations.pbtxt
@@ -22,7 +22,7 @@ tf_module {
   }
   member_method {
     name: "relu"
-    argspec: "args=[\'x\', \'alpha\', \'max_value\'], varargs=None, keywords=None, defaults=[\'0.0\', \'None\'], "
+    argspec: "args=[\'x\', \'alpha\', \'max_value\', \'threshold\'], varargs=None, keywords=None, defaults=[\'0.0\', \'None\', \'0\'], "
   }
   member_method {
     name: "selu"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.backend.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.backend.pbtxt
index c6149e8aa7..126ce8db6a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.backend.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.backend.pbtxt
@@ -70,7 +70,7 @@ tf_module {
   }
   member_method {
     name: "categorical_crossentropy"
-    argspec: "args=[\'target\', \'output\', \'from_logits\'], varargs=None, keywords=None, defaults=[\'False\'], "
+    argspec: "args=[\'target\', \'output\', \'from_logits\', \'axis\'], varargs=None, keywords=None, defaults=[\'False\', \'-1\'], "
   }
   member_method {
     name: "clear_session"
@@ -366,7 +366,7 @@ tf_module {
   }
   member_method {
     name: "relu"
-    argspec: "args=[\'x\', \'alpha\', \'max_value\'], varargs=None, keywords=None, defaults=[\'0.0\', \'None\'], "
+    argspec: "args=[\'x\', \'alpha\', \'max_value\', \'threshold\'], varargs=None, keywords=None, defaults=[\'0.0\', \'None\', \'0\'], "
   }
   member_method {
     name: "repeat"
@@ -462,7 +462,7 @@ tf_module {
   }
   member_method {
     name: "sparse_categorical_crossentropy"
-    argspec: "args=[\'target\', \'output\', \'from_logits\'], varargs=None, keywords=None, defaults=[\'False\'], "
+    argspec: "args=[\'target\', \'output\', \'from_logits\', \'axis\'], varargs=None, keywords=None, defaults=[\'False\', \'-1\'], "
   }
   member_method {
     name: "spatial_2d_padding"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.callbacks.-tensor-board.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.callbacks.-tensor-board.pbtxt
index 2f52464315..e58ba18c1c 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.callbacks.-tensor-board.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.callbacks.-tensor-board.pbtxt
@@ -5,7 +5,7 @@ tf_class {
   is_instance: "<type \'object\'>"
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'log_dir\', \'histogram_freq\', \'batch_size\', \'write_graph\', \'write_grads\', \'write_images\'], varargs=None, keywords=None, defaults=[\'./logs\', \'0\', \'32\', \'True\', \'False\', \'False\'], "
+    argspec: "args=[\'self\', \'log_dir\', \'histogram_freq\', \'batch_size\', \'write_graph\', \'write_grads\', \'write_images\', \'embeddings_freq\', \'embeddings_layer_names\', \'embeddings_metadata\', \'embeddings_data\'], varargs=None, keywords=None, defaults=[\'./logs\', \'0\', \'32\', \'True\', \'False\', \'False\', \'0\', \'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "on_batch_begin"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt
index 14a667870d..8645e54302 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.initializers.pbtxt
@@ -90,11 +90,11 @@ tf_module {
   }
   member_method {
     name: "glorot_normal"
-    argspec: "args=[\'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'None\', \"<dtype: \'float32\'>\"], "
   }
   member_method {
     name: "glorot_uniform"
-    argspec: "args=[\'seed\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'seed\', \'dtype\'], varargs=None, keywords=None, defaults=[\'None\', \"<dtype: \'float32\'>\"], "
   }
   member_method {
     name: "he_normal"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
index 2bf973debb..86e328888e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
index 03f20e72c2..b0ed545781 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
index 4b46b8d15a..42f98ed03d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
index d8a1c76fd0..000898a4be 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
index 622926bc4b..380b49f99c 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
index 82100d8e09..82db5e6137 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
index 408061077c..b6ff688ec3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
index a3c8031104..b41290f8b0 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
index e2dfaca29f..88a033e61f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
index 4f068d2066..c1b9b96044 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
index b8c261a743..f59f7727a3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
index 4ccd6cace6..7d3744ed92 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
index 2790e5fd85..3fd4ccdab2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
@@ -107,7 +107,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
index b1326bd0e6..ba21b50be4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
index e3ac3dbf28..46f9fa2bbb 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
@@ -188,7 +188,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
index 1117a695a3..c3ad326589 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
index b9de142142..fd9eb43066 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
index deb535e06e..40d61688f2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
index 9a9a223fba..b8c227d725 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
index 1c59b0bdf6..095d35e574 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
index 30cf5489f4..8f99961198 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
index 0ec69508d5..96d522a016 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
index 4cd8928403..de2824dab4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
index 4b4912496d..1d563241d8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
index d0ad9cf567..c87e52c537 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
index 98cff95a7f..dccf5523e3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
index 2357498b46..7ac4116d92 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
index 3324cbff30..024f72705d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-g-r-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-g-r-u.pbtxt
index 6c81823654..4e0233331b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-g-r-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-g-r-u.pbtxt
@@ -108,7 +108,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-l-s-t-m.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-l-s-t-m.pbtxt
index 487e04fd07..32d46ce8f3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-l-s-t-m.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cu-d-n-n-l-s-t-m.pbtxt
@@ -108,7 +108,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
index 137e7cced4..858486c725 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-depthwise-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-depthwise-conv2-d.pbtxt
index 7161665d25..f65d750926 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-depthwise-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-depthwise-conv2-d.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
index 24affa2481..2e71ef503d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
index 7ba19a4269..42533bcd21 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
index 503aa9162c..b5df169417 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
index 1737e590a2..0ea17919a9 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
index 021d024dc2..a33248bc00 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
index 65387008bf..4ba21a25cd 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
index 4f791acf05..a7a570418e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
@@ -171,7 +171,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
index abc30e54e0..763bc23113 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
index 20791bb448..3c50a3d7f2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
index 449a91d873..ac78bdafad 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
index bb361e1297..275282d9d2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
index e564bf3216..0e31e6058b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
index 4cb9cc3ec8..aacd0b1791 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
index 5ed52b88ae..c236548663 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
index f4559d29d7..6b9c0290aa 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
index 64e2d061e2..0d7b2211e6 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
index 3372ad6453..d080ad6aed 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
index 08a6860bcd..fcb0a109da 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
index 22c9eab64f..1d0e22abd0 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
index 74c405ba9b..653c9f547b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
index 39f6f98193..cdbaf82cf6 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
index 7b25e80b6b..230c5e9034 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
index 3619b8bfc4..511456e740 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
index 8ef3d71dd8..4a3492ebd6 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
@@ -171,7 +171,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
index ecbaa9ce2c..5d05cf689f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
index 9b90db1e5e..7efa29be77 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
@@ -97,7 +97,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
index 3c60eaab7f..0ca8e0b52c 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
index 3dac1ff342..f754fa1da8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
index 7f1b5db4d3..c9516b8f07 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
index b3e31000f3..850ecff974 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
index bbd9d1b0dc..7c69e31f9a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
index fe72beea80..fba42642d7 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
index e9bf57b2b0..9c277411ea 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
index 0eecc58a2b..7c2f6ccc8a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
index 96785a7d85..802178dba6 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
index 42c46cccb3..e870dfe9ad 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
index ac816f68d4..c1337ce0cb 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-minimum.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-minimum.pbtxt
index 56e32e9d36..ed27a62765 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-minimum.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-minimum.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
index 9ae99563e9..b9f05cb3e5 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
index 815f3bc2d1..336d9f76fb 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
index e704992b4a..46282217e0 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
index b3a58fa11e..42cd7e87ee 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
@@ -102,7 +102,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-re-l-u.pbtxt
index f3a96ab895..4d3de58bd1 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-re-l-u.pbtxt
@@ -82,7 +82,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'max_value\'], varargs=None, keywords=kwargs, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'max_value\', \'negative_slope\', \'threshold\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'0\', \'0\'], "
   }
   member_method {
     name: "add_loss"
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
index 78f464583b..9f094a877a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
index 222344fd04..2f519a2438 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
index 55fddf576c..6b93116ba0 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
index 96314ce498..fd17115e27 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
index 88bdf99566..4b37a94478 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
index 6eeea7a8d1..5bdadca74a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
@@ -100,7 +100,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
index 3050d46249..9dfda96fc8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
index dda4c9358b..7b7684ccd2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
@@ -159,7 +159,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
index cc6275158b..3b15407fca 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
index 5eb7e75047..6d04415267 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
index 500cb8c14e..04950654d5 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
index 1113a7634f..c424e6dcc8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
index c4b9f93561..1160d2840f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
@@ -102,7 +102,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-subtract.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-subtract.pbtxt
index 35ad87ad5d..740a03367b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-subtract.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-subtract.pbtxt
@@ -99,7 +99,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
index 282c98d79a..a08c583adb 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
index acab93706b..c1294fed0f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
@@ -103,7 +103,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
index a5ec228a07..dc401d3ed0 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
index d8d8e0bfe9..4b5165ae97 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
index 97d6dc06fb..789af15fea 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
index ea9bb41b99..0536a7cee7 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
@@ -102,7 +102,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
index e6d1d2e089..8915353ec3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
index f62017305f..6efb5ef15a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
index 07a1fde5bd..4c33c5d0bf 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
@@ -98,7 +98,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt
index a97a9b5758..73b577da37 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.metrics.pbtxt
@@ -22,7 +22,7 @@ tf_module {
   }
   member_method {
     name: "binary_accuracy"
-    argspec: "args=[\'y_true\', \'y_pred\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'y_true\', \'y_pred\', \'threshold\'], varargs=None, keywords=None, defaults=[\'0.5\'], "
   }
   member_method {
     name: "binary_crossentropy"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
index 62aa929d32..56914e1746 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
@@ -119,7 +119,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
@@ -135,7 +135,7 @@ tf_class {
   }
   member_method {
     name: "compile"
-    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\', \'distribute\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "compute_mask"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
index 93ecbbce9b..4c1c54001d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
@@ -124,7 +124,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'partitioner\', \'use_resource\', \'synchronization\', \'aggregation\', \'getter\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
@@ -140,7 +140,7 @@ tf_class {
   }
   member_method {
     name: "compile"
-    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'optimizer\', \'loss\', \'metrics\', \'loss_weights\', \'sample_weight_mode\', \'weighted_metrics\', \'target_tensors\', \'distribute\'], varargs=None, keywords=kwargs, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "compute_mask"
@@ -267,6 +267,10 @@ tf_class {
     argspec: "args=[\'self\', \'line_length\', \'positions\', \'print_fn\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
   }
   member_method {
+    name: "symbolic_set_inputs"
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=None, defaults=None"
+  }
+  member_method {
     name: "test_on_batch"
     argspec: "args=[\'self\', \'x\', \'y\', \'sample_weight\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
   }
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
index 11067058d5..c82e67526b 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
index 3259e706d7..1d031cb5f8 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
index e561f2f415..a8dda6655d 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
index 3124a35c78..97f65ed894 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
@@ -108,7 +108,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
index b5ec61255a..ccd9578f0d 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
index b2c89ae66f..9cbb58d721 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
@@ -110,7 +110,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
index 9e4f4969dc..c75ea3911e 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
index 9850e6d765..5dc834e514 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
@@ -110,7 +110,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
index be113826cc..96ab209874 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
index 0d951bf633..7e9656b352 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
@@ -108,7 +108,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
index f1beeed9ef..e9a2269a6e 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
@@ -108,7 +108,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
index b75a012811..7d2eaaab2a 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
@@ -108,7 +108,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
index 80e0fb228b..8bc3eb26e9 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
@@ -106,7 +106,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
index 50ff484d73..6a0dcce56a 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
index cea809744c..b6c84edf2a 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
index ab9e89554c..062a02fa59 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
@@ -109,7 +109,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
index 4362568445..eaad0fb23e 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
@@ -110,7 +110,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
index 3cad824cd3..ece28a8ce9 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
@@ -110,7 +110,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.linalg.-linear-operator-zeros.__metaclass__.pbtxt b/tensorflow/tools/api/golden/tensorflow.linalg.-linear-operator-zeros.__metaclass__.pbtxt
new file mode 100644
index 0000000000..49ff85728f
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.linalg.-linear-operator-zeros.__metaclass__.pbtxt
@@ -0,0 +1,14 @@
+path: "tensorflow.linalg.LinearOperatorZeros.__metaclass__"
+tf_class {
+  is_instance: "<class \'abc.ABCMeta\'>"
+  member_method {
+    name: "__init__"
+  }
+  member_method {
+    name: "mro"
+  }
+  member_method {
+    name: "register"
+    argspec: "args=[\'cls\', \'subclass\'], varargs=None, keywords=None, defaults=None"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.linalg.-linear-operator-zeros.pbtxt b/tensorflow/tools/api/golden/tensorflow.linalg.-linear-operator-zeros.pbtxt
new file mode 100644
index 0000000000..a1b0e06b47
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.linalg.-linear-operator-zeros.pbtxt
@@ -0,0 +1,130 @@
+path: "tensorflow.linalg.LinearOperatorZeros"
+tf_class {
+  is_instance: "<class \'tensorflow.python.ops.linalg.linear_operator_zeros.LinearOperatorZeros\'>"
+  is_instance: "<class \'tensorflow.python.ops.linalg.linear_operator.LinearOperator\'>"
+  is_instance: "<type \'object\'>"
+  member {
+    name: "batch_shape"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "domain_dimension"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "dtype"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "graph_parents"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "is_non_singular"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "is_positive_definite"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "is_self_adjoint"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "is_square"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "name"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "range_dimension"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "shape"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "tensor_rank"
+    mtype: "<type \'property\'>"
+  }
+  member_method {
+    name: "__init__"
+    argspec: "args=[\'self\', \'num_rows\', \'num_columns\', \'batch_shape\', \'dtype\', \'is_non_singular\', \'is_self_adjoint\', \'is_positive_definite\', \'is_square\', \'assert_proper_shapes\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'False\', \'True\', \'False\', \'True\', \'False\', \'LinearOperatorZeros\'], "
+  }
+  member_method {
+    name: "add_to_tensor"
+    argspec: "args=[\'self\', \'mat\', \'name\'], varargs=None, keywords=None, defaults=[\'add_to_tensor\'], "
+  }
+  member_method {
+    name: "assert_non_singular"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'assert_non_singular\'], "
+  }
+  member_method {
+    name: "assert_positive_definite"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'assert_positive_definite\'], "
+  }
+  member_method {
+    name: "assert_self_adjoint"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'assert_self_adjoint\'], "
+  }
+  member_method {
+    name: "batch_shape_tensor"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'batch_shape_tensor\'], "
+  }
+  member_method {
+    name: "determinant"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'det\'], "
+  }
+  member_method {
+    name: "diag_part"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'diag_part\'], "
+  }
+  member_method {
+    name: "domain_dimension_tensor"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'domain_dimension_tensor\'], "
+  }
+  member_method {
+    name: "log_abs_determinant"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'log_abs_det\'], "
+  }
+  member_method {
+    name: "matmul"
+    argspec: "args=[\'self\', \'x\', \'adjoint\', \'adjoint_arg\', \'name\'], varargs=None, keywords=None, defaults=[\'False\', \'False\', \'matmul\'], "
+  }
+  member_method {
+    name: "matvec"
+    argspec: "args=[\'self\', \'x\', \'adjoint\', \'name\'], varargs=None, keywords=None, defaults=[\'False\', \'matvec\'], "
+  }
+  member_method {
+    name: "range_dimension_tensor"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'range_dimension_tensor\'], "
+  }
+  member_method {
+    name: "shape_tensor"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'shape_tensor\'], "
+  }
+  member_method {
+    name: "solve"
+    argspec: "args=[\'self\', \'rhs\', \'adjoint\', \'adjoint_arg\', \'name\'], varargs=None, keywords=None, defaults=[\'False\', \'False\', \'solve\'], "
+  }
+  member_method {
+    name: "solvevec"
+    argspec: "args=[\'self\', \'rhs\', \'adjoint\', \'name\'], varargs=None, keywords=None, defaults=[\'False\', \'solve\'], "
+  }
+  member_method {
+    name: "tensor_rank_tensor"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'tensor_rank_tensor\'], "
+  }
+  member_method {
+    name: "to_dense"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'to_dense\'], "
+  }
+  member_method {
+    name: "trace"
+    argspec: "args=[\'self\', \'name\'], varargs=None, keywords=None, defaults=[\'trace\'], "
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt b/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt
index 3b5845f99a..d979116887 100644
--- a/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt
@@ -52,6 +52,10 @@ tf_module {
     name: "LinearOperatorScaledIdentity"
     mtype: "<class \'abc.ABCMeta\'>"
   }
+  member {
+    name: "LinearOperatorZeros"
+    mtype: "<class \'abc.ABCMeta\'>"
+  }
   member_method {
     name: "adjoint"
     argspec: "args=[\'matrix\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
index a8d9e120cb..c74773000a 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
@@ -117,7 +117,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
index c039890e1f..d251f54806 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
@@ -117,7 +117,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
index 62c393de34..8a63b49180 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
@@ -116,7 +116,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
index f121ba7939..db1aae2757 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
@@ -120,7 +120,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
index 4583dc32b2..d76eab7eb8 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
@@ -117,7 +117,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
index 5016b6ac30..944db6ac93 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
@@ -117,7 +117,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
index 59623fc983..72b40cc9f7 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
@@ -116,7 +116,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
index e2ab5aaee9..a5c2b4aefd 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
@@ -115,7 +115,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
index bd2a6d61f8..61d5f04b22 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
@@ -116,7 +116,7 @@ tf_class {
   }
   member_method {
     name: "add_weight"
-    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'self\', \'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'constraint\', \'use_resource\', \'synchronization\', \'aggregation\', \'partitioner\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\', \'None\'], "
   }
   member_method {
     name: "apply"
diff --git a/tensorflow/tools/api/golden/tensorflow.pbtxt b/tensorflow/tools/api/golden/tensorflow.pbtxt
index 9ec20f0955..5eb42b4db3 100644
--- a/tensorflow/tools/api/golden/tensorflow.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.pbtxt
@@ -258,7 +258,7 @@ tf_module {
   }
   member {
     name: "Variable"
-    mtype: "<type \'type\'>"
+    mtype: "<class \'tensorflow.python.ops.variables.VariableMetaclass\'>"
   }
   member {
     name: "VariableAggregation"
@@ -1174,7 +1174,7 @@ tf_module {
   }
   member_method {
     name: "get_variable"
-    argspec: "args=[\'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'collections\', \'caching_device\', \'partitioner\', \'validate_shape\', \'use_resource\', \'custom_getter\', \'constraint\', \'synchronization\', \'aggregation\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\'], "
+    argspec: "args=[\'name\', \'shape\', \'dtype\', \'initializer\', \'regularizer\', \'trainable\', \'collections\', \'caching_device\', \'partitioner\', \'validate_shape\', \'use_resource\', \'custom_getter\', \'constraint\', \'synchronization\', \'aggregation\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'None\', \'True\', \'None\', \'None\', \'None\', \'VariableSynchronization.AUTO\', \'VariableAggregation.NONE\'], "
   }
   member_method {
     name: "get_variable_scope"
@@ -1561,10 +1561,6 @@ tf_module {
     argspec: "args=[\'x\', \'y\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
-    name: "print"
-    argspec: "args=[\'input_\', \'data\', \'message\', \'first_n\', \'summarize\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\'], "
-  }
-  member_method {
     name: "py_func"
     argspec: "args=[\'func\', \'inp\', \'Tout\', \'stateful\', \'name\'], varargs=None, keywords=None, defaults=[\'True\', \'None\'], "
   }
@@ -2198,7 +2194,7 @@ tf_module {
   }
   member_method {
     name: "while_loop"
-    argspec: "args=[\'cond\', \'body\', \'loop_vars\', \'shape_invariants\', \'parallel_iterations\', \'back_prop\', \'swap_memory\', \'name\', \'maximum_iterations\'], varargs=None, keywords=None, defaults=[\'None\', \'10\', \'True\', \'False\', \'None\', \'None\'], "
+    argspec: "args=[\'cond\', \'body\', \'loop_vars\', \'shape_invariants\', \'parallel_iterations\', \'back_prop\', \'swap_memory\', \'name\', \'maximum_iterations\', \'return_same_structure\'], varargs=None, keywords=None, defaults=[\'None\', \'10\', \'True\', \'False\', \'None\', \'None\', \'False\'], "
   }
   member_method {
     name: "write_file"
diff --git a/tensorflow/tools/api/lib/python_object_to_proto_visitor.py b/tensorflow/tools/api/lib/python_object_to_proto_visitor.py
index 1cf330e702..3a48cf683c 100644
--- a/tensorflow/tools/api/lib/python_object_to_proto_visitor.py
+++ b/tensorflow/tools/api/lib/python_object_to_proto_visitor.py
@@ -88,6 +88,9 @@ def _SanitizedMRO(obj):
   """
   return_list = []
   for cls in tf_inspect.getmro(obj):
+    if cls.__name__ == '_NewClass':
+      # Ignore class created by @deprecated_alias decorator.
+      continue
     str_repr = str(cls)
     return_list.append(str_repr)
     if 'tensorflow' not in str_repr:
diff --git a/tensorflow/tools/api/tests/api_compatibility_test.py b/tensorflow/tools/api/tests/api_compatibility_test.py
index 90375a794f..d1b34fb242 100644
--- a/tensorflow/tools/api/tests/api_compatibility_test.py
+++ b/tensorflow/tools/api/tests/api_compatibility_test.py
@@ -34,6 +34,13 @@ import sys
 import unittest
 
 import tensorflow as tf
+# pylint: disable=g-import-not-at-top
+try:
+  from tensorflow.compat import v1 as tf_v1
+  # We import compat.v1 as tf_v1 instead.
+  del tf.compat.v1
+except ImportError:
+  tf_v1 = None
 
 from google.protobuf import message
 from google.protobuf import text_format
@@ -46,6 +53,7 @@ from tensorflow.tools.api.lib import api_objects_pb2
 from tensorflow.tools.api.lib import python_object_to_proto_visitor
 from tensorflow.tools.common import public_api
 from tensorflow.tools.common import traverse
+# pylint: enable=g-import-not-at-top
 
 
 # FLAGS defined at the bottom:
@@ -215,25 +223,19 @@ class ApiCompatibilityTest(test.TestCase):
     visitor.do_not_descend_map['tf'].append('contrib')
     traverse.traverse(tf, visitor)
 
-  @unittest.skipUnless(
-      sys.version_info.major == 2,
-      'API compabitility test goldens are generated using python2.')
-  def testAPIBackwardsCompatibility(self):
-    # Extract all API stuff.
+  def checkBackwardsCompatibility(self, root, golden_file_pattern):
+     # Extract all API stuff.
     visitor = python_object_to_proto_visitor.PythonObjectToProtoVisitor()
 
     public_api_visitor = public_api.PublicAPIVisitor(visitor)
     public_api_visitor.do_not_descend_map['tf'].append('contrib')
     public_api_visitor.do_not_descend_map['tf.GPUOptions'] = ['Experimental']
-    traverse.traverse(tf, public_api_visitor)
+    traverse.traverse(root, public_api_visitor)
 
     proto_dict = visitor.GetProtos()
 
     # Read all golden files.
-    expression = os.path.join(
-        resource_loader.get_root_dir_with_all_resources(),
-        _KeyToFilePath('*'))
-    golden_file_list = file_io.get_matching_files(expression)
+    golden_file_list = file_io.get_matching_files(golden_file_pattern)
 
     def _ReadFileToProto(filename):
       """Read a filename, create a protobuf from its contents."""
@@ -254,6 +256,26 @@ class ApiCompatibilityTest(test.TestCase):
         verbose=FLAGS.verbose_diffs,
         update_goldens=FLAGS.update_goldens)
 
+  @unittest.skipUnless(
+      sys.version_info.major == 2,
+      'API compabitility test goldens are generated using python2.')
+  def testAPIBackwardsCompatibility(self):
+    golden_file_pattern = os.path.join(
+        resource_loader.get_root_dir_with_all_resources(),
+        _KeyToFilePath('*'))
+    self.checkBackwardsCompatibility(tf, golden_file_pattern)
+
+  @unittest.skipUnless(
+      sys.version_info.major == 2,
+      'API compabitility test goldens are generated using python2.')
+  def testAPIBackwardsCompatibilityV1(self):
+    if not tf_v1:
+      return
+    golden_file_pattern = os.path.join(
+        resource_loader.get_root_dir_with_all_resources(),
+        _KeyToFilePath('*'))
+    self.checkBackwardsCompatibility(tf_v1, golden_file_pattern)
+
 
 if __name__ == '__main__':
   parser = argparse.ArgumentParser()
diff --git a/tensorflow/tools/ci_build/Dockerfile.cpu.ppc64le b/tensorflow/tools/ci_build/Dockerfile.cpu.ppc64le
index e879c34bbd..ada2c63880 100644
--- a/tensorflow/tools/ci_build/Dockerfile.cpu.ppc64le
+++ b/tensorflow/tools/ci_build/Dockerfile.cpu.ppc64le
@@ -7,7 +7,7 @@ COPY install/*.sh /install/
 RUN /install/install_bootstrap_deb_packages.sh
 RUN add-apt-repository -y ppa:openjdk-r/ppa
 RUN /install/install_deb_packages.sh
-RUN apt-get update && apt-get install -y libopenblas-dev
+RUN /install/install_openblas_ppc64le.sh
 RUN /install/install_hdf5_ppc64le.sh
 RUN /install/install_pip_packages.sh
 RUN /install/install_bazel_from_source.sh
diff --git a/tensorflow/tools/ci_build/Dockerfile.gpu b/tensorflow/tools/ci_build/Dockerfile.gpu
index 7591ecc04e..383f9545c9 100644
--- a/tensorflow/tools/ci_build/Dockerfile.gpu
+++ b/tensorflow/tools/ci_build/Dockerfile.gpu
@@ -14,6 +14,7 @@ RUN /install/install_bootstrap_deb_packages.sh
 RUN add-apt-repository -y ppa:openjdk-r/ppa && \
     add-apt-repository -y ppa:george-edison55/cmake-3.x
 RUN /install/install_deb_packages.sh
+
 RUN /install/install_pip_packages.sh
 RUN /install/install_bazel.sh
 RUN /install/install_golang.sh
@@ -22,6 +23,11 @@ RUN /install/install_golang.sh
 COPY install/.bazelrc /etc/bazel.bazelrc
 ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
 
+# Link NCCL libray and header where the build script expects them.
+RUN mkdir /usr/local/cuda-9.0/lib &&  \
+    ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/lib/libnccl.so.2 && \
+    ln -s /usr/include/nccl.h /usr/local/cuda/include/nccl.h
+
 # Configure the build for our CUDA configuration.
 ENV TF_NEED_CUDA 1
 ENV TF_CUDA_COMPUTE_CAPABILITIES 3.0
diff --git a/tensorflow/tools/ci_build/Dockerfile.gpu.ppc64le b/tensorflow/tools/ci_build/Dockerfile.gpu.ppc64le
index 8967138747..a404f129ab 100644
--- a/tensorflow/tools/ci_build/Dockerfile.gpu.ppc64le
+++ b/tensorflow/tools/ci_build/Dockerfile.gpu.ppc64le
@@ -13,7 +13,7 @@ ARG DEBIAN_FRONTEND=noninteractive
 RUN /install/install_bootstrap_deb_packages.sh
 RUN add-apt-repository -y ppa:openjdk-r/ppa
 RUN /install/install_deb_packages.sh
-RUN apt-get update && apt-get install -y libopenblas-dev
+RUN /install/install_openblas_ppc64le.sh 
 RUN /install/install_hdf5_ppc64le.sh
 RUN /install/install_pip_packages.sh
 RUN /install/install_bazel_from_source.sh
diff --git a/tensorflow/tools/ci_build/builds/android.sh b/tensorflow/tools/ci_build/builds/android.sh
index d81793efe0..7c3e308229 100755
--- a/tensorflow/tools/ci_build/builds/android.sh
+++ b/tensorflow/tools/ci_build/builds/android.sh
@@ -26,13 +26,19 @@ configure_android_workspace
 # android_full.sh
 
 echo "========== TensorFlow Demo Build Test =========="
+TARGETS=
+TARGETS+=" //tensorflow/examples/android:tensorflow_demo"
+# Also build the Eager Runtime so it remains compatible with Android for the
+# benefits of clients like TensorFlow Lite. For now it is enough to build only
+# :execute, which what TF Lite needs.
+TARGETS+=" //tensorflow/core/common_runtime/eager:execute"
 # Enable sandboxing so that zip archives don't get incorrectly packaged
 # in assets/ dir (see https://github.com/bazelbuild/bazel/issues/2334)
 # TODO(gunan): remove extra flags once sandboxing is enabled for all builds.
 bazel --bazelrc=/dev/null build \
     --compilation_mode=opt --cxxopt=-std=c++11 --fat_apk_cpu=x86_64 \
     --spawn_strategy=sandboxed --genrule_strategy=sandboxed \
-    //tensorflow/examples/android:tensorflow_demo
+    ${TARGETS}
 
 echo "========== Makefile Build Test =========="
 # Test Makefile build just to make sure it still works.
diff --git a/tensorflow/tools/ci_build/ci_build.sh b/tensorflow/tools/ci_build/ci_build.sh
index f6a50d3d4c..77265e0f50 100755
--- a/tensorflow/tools/ci_build/ci_build.sh
+++ b/tensorflow/tools/ci_build/ci_build.sh
@@ -115,6 +115,7 @@ DOCKER_IMG_NAME=$(echo "${DOCKER_IMG_NAME}" | tr '[:upper:]' '[:lower:]')
 
 # Print arguments.
 echo "WORKSPACE: ${WORKSPACE}"
+echo "CI_DOCKER_BUILD_EXTRA_PARAMS: ${CI_DOCKER_BUILD_EXTRA_PARAMS[*]}"
 echo "CI_DOCKER_EXTRA_PARAMS: ${CI_DOCKER_EXTRA_PARAMS[*]}"
 echo "COMMAND: ${COMMAND[*]}"
 echo "CI_COMMAND_PREFIX: ${CI_COMMAND_PREFIX[*]}"
@@ -126,7 +127,7 @@ echo ""
 
 # Build the docker container.
 echo "Building container (${DOCKER_IMG_NAME})..."
-docker build -t ${DOCKER_IMG_NAME} \
+docker build -t ${DOCKER_IMG_NAME} ${CI_DOCKER_BUILD_EXTRA_PARAMS[@]} \
     -f "${DOCKERFILE_PATH}" "${DOCKER_CONTEXT_PATH}"
 
 # Check docker build status
diff --git a/tensorflow/tools/ci_build/ci_parameterized_build.sh b/tensorflow/tools/ci_build/ci_parameterized_build.sh
index 08e2c3edd2..993894d658 100755
--- a/tensorflow/tools/ci_build/ci_parameterized_build.sh
+++ b/tensorflow/tools/ci_build/ci_parameterized_build.sh
@@ -150,36 +150,7 @@ BAZEL_TARGET="//tensorflow/... -//tensorflow/compiler/..."
 if [[ -n "$TF_SKIP_CONTRIB_TESTS" ]]; then
   BAZEL_TARGET="$BAZEL_TARGET -//tensorflow/contrib/..."
 else
-  BAZEL_TARGET="${BAZEL_TARGET} -//tensorflow/contrib/lite/..."
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite:context_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite:framework"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite:interpreter_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite:model_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/toco:toco"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite:simple_memory_arena_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite:string_util_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:activations_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:add_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:basic_rnn_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:concatenation_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:conv_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:depthwise_conv_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:embedding_lookup_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:embedding_lookup_sparse_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:fully_connected_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:hashtable_lookup_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:local_response_norm_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:lsh_projection_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:lstm_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:l2norm_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:mul_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:pooling_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:reshape_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:resize_bilinear_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:skip_gram_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:softmax_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:space_to_depth_test"
-  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:svdf_test"
+  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/..."
 fi
 
 TUT_TEST_DATA_DIR="/tmp/tf_tutorial_test_data"
@@ -570,33 +541,35 @@ echo ""
 
 TMP_DIR=""
 DOCKERFILE_FLAG=""
-if [[ "${TF_BUILD_PYTHON_VERSION}" == "python3.5" ]] ||
-  [[ "${TF_BUILD_PYTHON_VERSION}" == "python3.6" ]]; then
-  # Modify Dockerfile for Python3.5 | Python3.6 build
-  TMP_DIR=$(mktemp -d)
-  echo "Docker build will occur in temporary directory: ${TMP_DIR}"
-
-  # Copy the files required for the docker build
-  SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-  cp -r "${SCRIPT_DIR}/install" "${TMP_DIR}/install" || \
-      die "ERROR: Failed to copy directory ${SCRIPT_DIR}/install"
-
-  DOCKERFILE="${SCRIPT_DIR}/Dockerfile.${TF_BUILD_CONTAINER_TYPE}"
-  cp "${DOCKERFILE}" "${TMP_DIR}/" || \
-      die "ERROR: Failed to copy Dockerfile at ${DOCKERFILE}"
-  DOCKERFILE="${TMP_DIR}/Dockerfile.${TF_BUILD_CONTAINER_TYPE}"
-
-  # Replace a line in the Dockerfile
-  if sed -i \
-      "s/RUN \/install\/install_pip_packages.sh/RUN \/install\/install_${TF_BUILD_PYTHON_VERSION}_pip_packages.sh/g" \
-      "${DOCKERFILE}"
-  then
-    echo "Copied and modified Dockerfile for ${TF_BUILD_PYTHON_VERSION} build: ${DOCKERFILE}"
-  else
-    die "ERROR: Faild to copy and modify Dockerfile: ${DOCKERFILE}"
-  fi
+if [[ "${DO_DOCKER}" == "1" ]]; then
+  if [[ "${TF_BUILD_PYTHON_VERSION}" == "python3.5" ]] ||
+    [[ "${TF_BUILD_PYTHON_VERSION}" == "python3.6" ]]; then
+    # Modify Dockerfile for Python3.5 | Python3.6 build
+    TMP_DIR=$(mktemp -d)
+    echo "Docker build will occur in temporary directory: ${TMP_DIR}"
+
+    # Copy the files required for the docker build
+    SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+    cp -r "${SCRIPT_DIR}/install" "${TMP_DIR}/install" || \
+        die "ERROR: Failed to copy directory ${SCRIPT_DIR}/install"
+
+    DOCKERFILE="${SCRIPT_DIR}/Dockerfile.${TF_BUILD_CONTAINER_TYPE}"
+    cp "${DOCKERFILE}" "${TMP_DIR}/" || \
+        die "ERROR: Failed to copy Dockerfile at ${DOCKERFILE}"
+    DOCKERFILE="${TMP_DIR}/Dockerfile.${TF_BUILD_CONTAINER_TYPE}"
+
+    # Replace a line in the Dockerfile
+    if sed -i \
+        "s/RUN \/install\/install_pip_packages.sh/RUN \/install\/install_${TF_BUILD_PYTHON_VERSION}_pip_packages.sh/g" \
+        "${DOCKERFILE}"
+    then
+      echo "Copied and modified Dockerfile for ${TF_BUILD_PYTHON_VERSION} build: ${DOCKERFILE}"
+    else
+      die "ERROR: Faild to copy and modify Dockerfile: ${DOCKERFILE}"
+    fi
 
-  DOCKERFILE_FLAG="--dockerfile ${DOCKERFILE}"
+    DOCKERFILE_FLAG="--dockerfile ${DOCKERFILE}"
+  fi
 fi
 
 chmod +x ${TMP_SCRIPT}
diff --git a/tensorflow/tools/ci_build/ci_sanity.sh b/tensorflow/tools/ci_build/ci_sanity.sh
index db37edf809..866fe95d2b 100755
--- a/tensorflow/tools/ci_build/ci_sanity.sh
+++ b/tensorflow/tools/ci_build/ci_sanity.sh
@@ -354,7 +354,7 @@ do_external_licenses_check(){
 
   # Whitelist
   echo ${EXTRA_LICENSE_FILE}
-  grep -e "@bazel_tools//src" -e "@bazel_tools//tools/" -e "@com_google_absl//" -e "//external" -e "@local" -e "@com_github_googlecloudplatform_google_cloud_cpp//" -v ${EXTRA_LICENSES_FILE} > temp.txt
+  grep -e "@bazel_tools//src" -e "@bazel_tools//tools/" -e "@com_google_absl//" -e "//external" -e "@local" -e "@com_github_googlecloudplatform_google_cloud_cpp//" -e "@embedded_jdk//" -v ${EXTRA_LICENSES_FILE} > temp.txt
   mv temp.txt ${EXTRA_LICENSES_FILE}
 
 
diff --git a/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh b/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
index d0816c92b7..75da9bb835 100755
--- a/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
+++ b/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
@@ -35,6 +35,30 @@ elif [[ ${BASH_VER_MAJOR} -eq 4 ]] && [[ ${BASH_VER_MINOR} -lt 2 ]]; then
   exit 1
 fi
 
+function is_absolute {
+  [[ "$1" = /* ]] || [[ "$1" =~ ^[a-zA-Z]:[/\\].* ]]
+}
+
+RUNFILES_MANIFEST_FILE="${TEST_SRCDIR}/MANIFEST"
+function rlocation() {
+  if is_absolute "$1" ; then
+    # If the file path is already fully specified, simply return it.
+    echo "$1"
+  elif [[ -e "$TEST_SRCDIR/$1" ]]; then
+    # If the file exists in the $TEST_SRCDIR then just use it.
+    echo "$TEST_SRCDIR/$1"
+  elif [[ -e "$RUNFILES_MANIFEST_FILE" ]]; then
+    # If a runfiles manifest file exists then use it.
+    echo "$(grep "^$1 " "$RUNFILES_MANIFEST_FILE" | sed 's/[^ ]* //')"
+  fi
+}
+
+TEST_BINARY="$(rlocation $TEST_WORKSPACE/${1#./})"
+shift
+
+# Make sure /var/lock exists, this may not be true under MSYS
+mkdir -p /var/lock
+
 TF_GPU_COUNT=${TF_GPU_COUNT:-8}
 
 for i in `seq 0 $((TF_GPU_COUNT-1))`; do
@@ -45,8 +69,8 @@ for i in `seq 0 $((TF_GPU_COUNT-1))`; do
       # This export only works within the brackets, so it is isolated to one
       # single command.
       export CUDA_VISIBLE_DEVICES=$i
-      echo "Running test $* on GPU $CUDA_VISIBLE_DEVICES"
-      $@
+      echo "Running test $TEST_BINARY $* on GPU $CUDA_VISIBLE_DEVICES"
+      "$TEST_BINARY" $@
     )
     return_code=$?
     flock -u "$lock_fd"
diff --git a/tensorflow/tools/ci_build/install/install_bazel.sh b/tensorflow/tools/ci_build/install/install_bazel.sh
index 3e27a94cf2..e284401b8a 100755
--- a/tensorflow/tools/ci_build/install/install_bazel.sh
+++ b/tensorflow/tools/ci_build/install/install_bazel.sh
@@ -15,7 +15,7 @@
 # ==============================================================================
 
 # Select bazel version.
-BAZEL_VERSION="0.11.0"
+BAZEL_VERSION="0.15.0"
 
 set +e
 local_bazel_ver=$(bazel version 2>&1 | grep -i label | awk '{print $3}')
diff --git a/tensorflow/tools/ci_build/install/install_bazel_from_source.sh b/tensorflow/tools/ci_build/install/install_bazel_from_source.sh
index ddad00c5f0..87be81577d 100755
--- a/tensorflow/tools/ci_build/install/install_bazel_from_source.sh
+++ b/tensorflow/tools/ci_build/install/install_bazel_from_source.sh
@@ -18,7 +18,7 @@
 # It will compile bazel from source and install it in /usr/local/bin
 
 # Select bazel version.
-BAZEL_VERSION="0.11.0"
+BAZEL_VERSION="0.15.0"
 
 set +e
 local_bazel_ver=$(bazel version 2>&1 | grep -i label | awk '{print $3}')
diff --git a/tensorflow/tools/ci_build/install/install_openblas_ppc64le.sh b/tensorflow/tools/ci_build/install/install_openblas_ppc64le.sh
new file mode 100755
index 0000000000..107cc61ff5
--- /dev/null
+++ b/tensorflow/tools/ci_build/install/install_openblas_ppc64le.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+OPENBLAS_SRC_PATH=/tmp/openblas_src/
+POWER="POWER8"
+USE_OPENMP="USE_OPENMP=1"
+OPENBLAS_INSTALL_PATH="/usr"
+apt-get update
+apt-get install -y gfortran gfortran-5
+rm -rf ${OPENBLAS_SRC_PATH}
+git clone -b release-0.3.0 https://github.com/xianyi/OpenBLAS ${OPENBLAS_SRC_PATH}
+cd ${OPENBLAS_SRC_PATH}
+# Pick up fix for OpenBLAS issue 1571
+git cherry-pick -X theirs 961d25e9c7e4a1758adb1dbeaa15187de69dd052
+make TARGET=${POWER} ${USE_OPENMP} FC=gfortran
+make PREFIX=${OPENBLAS_INSTALL_PATH} install
diff --git a/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh b/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
index 2b68de3c5b..f6fa9251d4 100755
--- a/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
@@ -34,35 +34,4 @@ yes "" | $PYTHON_BIN_PATH configure.py
 bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test -k \
     --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 --config=opt \
     --test_size_filters=small,medium --test_output=errors -- \
-    //tensorflow/contrib/... \
-    -//tensorflow/contrib/lite/... \
-    //tensorflow/contrib/lite:context_test \
-    //tensorflow/contrib/lite:framework \
-    //tensorflow/contrib/lite:interpreter_test \
-    //tensorflow/contrib/lite:model_test \
-    //tensorflow/contrib/lite/toco:toco \
-    //tensorflow/contrib/lite:simple_memory_arena_test \
-    //tensorflow/contrib/lite:string_util_test \
-    //tensorflow/contrib/lite/kernels:activations_test \
-    //tensorflow/contrib/lite/kernels:add_test \
-    //tensorflow/contrib/lite/kernels:basic_rnn_test \
-    //tensorflow/contrib/lite/kernels:concatenation_test \
-    //tensorflow/contrib/lite/kernels:conv_test \
-    //tensorflow/contrib/lite/kernels:depthwise_conv_test \
-    //tensorflow/contrib/lite/kernels:embedding_lookup_test \
-    //tensorflow/contrib/lite/kernels:embedding_lookup_sparse_test \
-    //tensorflow/contrib/lite/kernels:fully_connected_test \
-    //tensorflow/contrib/lite/testing:generated_zip_tests \
-    //tensorflow/contrib/lite/kernels:hashtable_lookup_test \
-    //tensorflow/contrib/lite/kernels:local_response_norm_test \
-    //tensorflow/contrib/lite/kernels:lsh_projection_test \
-    //tensorflow/contrib/lite/kernels:lstm_test \
-    //tensorflow/contrib/lite/kernels:l2norm_test \
-    //tensorflow/contrib/lite/kernels:mul_test \
-    //tensorflow/contrib/lite/kernels:pooling_test \
-    //tensorflow/contrib/lite/kernels:reshape_test \
-    //tensorflow/contrib/lite/kernels:resize_bilinear_test \
-    //tensorflow/contrib/lite/kernels:skip_gram_test \
-    //tensorflow/contrib/lite/kernels:softmax_test \
-    //tensorflow/contrib/lite/kernels:space_to_depth_test \
-    //tensorflow/contrib/lite/kernels:svdf_test
+    //tensorflow/contrib/...
diff --git a/tensorflow/tools/ci_build/linux/mkl/build-dev-container.sh b/tensorflow/tools/ci_build/linux/mkl/build-dev-container.sh
index ad22ebe4eb..a1d91a6123 100755
--- a/tensorflow/tools/ci_build/linux/mkl/build-dev-container.sh
+++ b/tensorflow/tools/ci_build/linux/mkl/build-dev-container.sh
@@ -34,12 +34,17 @@ echo "TF_DOCKER_BUILD_DEVEL_BRANCH=${TF_DOCKER_BUILD_DEVEL_BRANCH}"
 echo "TF_DOCKER_BUILD_IMAGE_NAME=${TF_DOCKER_BUILD_IMAGE_NAME}"
 echo "TF_DOCKER_BUILD_VERSION=${TF_DOCKER_BUILD_VERSION}"
 
+# Build containers for AVX
+# Include the instructions for sandybridge and later, but tune for ivybridge
+TF_BAZEL_BUILD_OPTIONS="--config=mkl --copt=-march=sandybridge --copt=-mtune=ivybridge --copt=-O3 --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0"
+
 # build the python 2 container and whl
 TF_DOCKER_BUILD_TYPE="MKL" \
   TF_DOCKER_BUILD_IS_DEVEL="YES" \
   TF_DOCKER_BUILD_DEVEL_BRANCH="${TF_DOCKER_BUILD_DEVEL_BRANCH}" \
   TF_DOCKER_BUILD_IMAGE_NAME="${TF_DOCKER_BUILD_IMAGE_NAME}" \
   TF_DOCKER_BUILD_VERSION="${TF_DOCKER_BUILD_VERSION}" \
+  TF_BAZEL_BUILD_OPTIONS="${TF_BAZEL_BUILD_OPTIONS}" \
   ${WORKSPACE}/tensorflow/tools/docker/parameterized_docker_build.sh 
 
 # build the python 3 container and whl
@@ -49,5 +54,29 @@ TF_DOCKER_BUILD_TYPE="MKL" \
   TF_DOCKER_BUILD_IMAGE_NAME="${TF_DOCKER_BUILD_IMAGE_NAME}" \
   TF_DOCKER_BUILD_VERSION="${TF_DOCKER_BUILD_VERSION}" \
   TF_DOCKER_BUILD_PYTHON_VERSION="PYTHON3" \
+  TF_BAZEL_BUILD_OPTIONS="${TF_BAZEL_BUILD_OPTIONS}" \
+  ${WORKSPACE}/tensorflow/tools/docker/parameterized_docker_build.sh
+
+# Build containers for AVX2
+# Include the instructions for haswell and later, but tune for broadwell
+TF_BAZEL_BUILD_OPTIONS="--config=mkl --copt=-march=haswell --copt=-mtune=broadwell --copt=-O3 --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0"
+
+# build the python 2 container and whl
+TF_DOCKER_BUILD_TYPE="MKL" \
+  TF_DOCKER_BUILD_IS_DEVEL="YES" \
+  TF_DOCKER_BUILD_DEVEL_BRANCH="${TF_DOCKER_BUILD_DEVEL_BRANCH}" \
+  TF_DOCKER_BUILD_IMAGE_NAME="${TF_DOCKER_BUILD_IMAGE_NAME}" \
+  TF_DOCKER_BUILD_VERSION="${TF_DOCKER_BUILD_VERSION}-avx2" \
+  TF_BAZEL_BUILD_OPTIONS="${TF_BAZEL_BUILD_OPTIONS}" \
   ${WORKSPACE}/tensorflow/tools/docker/parameterized_docker_build.sh 
 
+# build the python 3 container and whl
+TF_DOCKER_BUILD_TYPE="MKL" \
+  TF_DOCKER_BUILD_IS_DEVEL="YES" \
+  TF_DOCKER_BUILD_DEVEL_BRANCH="${TF_DOCKER_BUILD_DEVEL_BRANCH}" \
+  TF_DOCKER_BUILD_IMAGE_NAME="${TF_DOCKER_BUILD_IMAGE_NAME}" \
+  TF_DOCKER_BUILD_VERSION="${TF_DOCKER_BUILD_VERSION}-avx2" \
+  TF_DOCKER_BUILD_PYTHON_VERSION="PYTHON3" \
+  TF_BAZEL_BUILD_OPTIONS="${TF_BAZEL_BUILD_OPTIONS}" \
+  ${WORKSPACE}/tensorflow/tools/docker/parameterized_docker_build.sh
+
diff --git a/tensorflow/tools/ci_build/update_version.py b/tensorflow/tools/ci_build/update_version.py
index 642dde36a7..30c318a58f 100755
--- a/tensorflow/tools/ci_build/update_version.py
+++ b/tensorflow/tools/ci_build/update_version.py
@@ -248,16 +248,6 @@ def update_md_files(old_version, new_version):
     replace_string_in_line(r"<version>%s<\/version>" % old_version,
                            "<version>%s</version>" % new_version, filepath)
 
-  # Update any links to colab notebooks.
-  def colab_url(version):
-    version_string = "%s.%s.%s" % (version.major, version.minor, version.patch)
-    prefix = "https://colab.research.google.com/github/tensorflow/models/blob/r"
-    return prefix + version_string + "/"
-
-  replace_string_in_line(
-      colab_url(old_version), colab_url(new_version),
-      "%s/docs_src/get_started/eager.md" % TF_SRC_DIR)
-
 
 def major_minor_change(old_version, new_version):
   """Check if a major or minor change occurred."""
diff --git a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
index e10483e7fd..0482cf619a 100644
--- a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
@@ -23,10 +23,6 @@ function run_configure_for_gpu_build {
   # Enable CUDA support
   export TF_NEED_CUDA=1
 
-  # TODO(pcloudy): Remove this after TensorFlow uses its own CRSOOTOOL
-  # for GPU build on Windows
-  export USE_MSVC_WRAPPER=1
-
   yes "" | ./configure
 }
 
@@ -37,10 +33,10 @@ function set_remote_cache_options {
   echo "build --tls_enabled=true" >> "${TMP_BAZELRC}"
   echo "build --remote_timeout=3600" >> "${TMP_BAZELRC}"
   echo "build --auth_enabled=true" >> "${TMP_BAZELRC}"
-  echo "build --spawn_strategy=remote" >> "${TMP_BAZELRC}"
-  echo "build --strategy=Javac=remote" >> "${TMP_BAZELRC}"
-  echo "build --strategy=Closure=remote" >> "${TMP_BAZELRC}"
-  echo "build --genrule_strategy=remote" >> "${TMP_BAZELRC}"
+  echo "build --spawn_strategy=standalone" >> "${TMP_BAZELRC}"
+  echo "build --strategy=Javac=standalone" >> "${TMP_BAZELRC}"
+  echo "build --strategy=Closure=standalone" >> "${TMP_BAZELRC}"
+  echo "build --genrule_strategy=standalone" >> "${TMP_BAZELRC}"
   echo "build --google_credentials=$GOOGLE_CLOUD_CREDENTIAL" >> "${TMP_BAZELRC}"
 }
 
diff --git a/tensorflow/tools/ci_build/windows/bazel/common_env.sh b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
index 8a237e4e28..333a89d3f5 100644
--- a/tensorflow/tools/ci_build/windows/bazel/common_env.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
@@ -26,7 +26,8 @@
 # * Bazel windows executable copied as "bazel.exe" and included in PATH.
 
 # Use a temporary directory with a short name.
-export TMPDIR="C:/tmp"
+export TMPDIR=${TMPDIR:-"C:/tmp"}
+export TMPDIR=$(cygpath -m "$TMPDIR")
 mkdir -p "$TMPDIR"
 
 # Set bash path
@@ -54,10 +55,10 @@ export PATH="/c/${PYTHON_BASE_PATH}/Scripts:$PATH"
 export TF_CUDA_VERSION=${TF_CUDA_VERSION:-9.0}
 export TF_CUDNN_VERSION=${TF_CUDNN_VERSION:-7.0}
 export TF_CUDA_COMPUTE_CAPABILITIES=${TF_CUDA_COMPUTE_CAPABILITIES:-3.7}
-export CUDA_INSTALL_PATH=${CUDA_INSTALL_PATH:-"C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${TF_CUDA_VERSION}"}
+export CUDA_TOOLKIT_PATH=${CUDA_TOOLKIT_PATH:-"C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${TF_CUDA_VERSION}"}
 export CUDNN_INSTALL_PATH=${CUDNN_INSTALL_PATH:-"C:/tools/cuda"}
 
 # Add Cuda and Cudnn dll directories into PATH
-export PATH="$(cygpath -u "${CUDA_INSTALL_PATH}")/bin:$PATH"
-export PATH="$(cygpath -u "${CUDA_INSTALL_PATH}")/extras/CUPTI/libx64:$PATH"
+export PATH="$(cygpath -u "${CUDA_TOOLKIT_PATH}")/bin:$PATH"
+export PATH="$(cygpath -u "${CUDA_TOOLKIT_PATH}")/extras/CUPTI/libx64:$PATH"
 export PATH="$(cygpath -u "${CUDNN_INSTALL_PATH}")/bin:$PATH"
diff --git a/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh b/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
index ed73401467..47e0e5dd59 100644
--- a/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
+++ b/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
@@ -53,30 +53,39 @@ function cleanup {
 }
 trap cleanup EXIT
 
-skip_test=0
-release_build=0
+PY_TEST_DIR="py_test_dir"
 
+SKIP_TEST=0
+RELEASE_BUILD=0
+TEST_TARGET="//${PY_TEST_DIR}/tensorflow/python/... \
+    //${PY_TEST_DIR}/tensorflow/contrib/... "
+
+# --skip_test            Skip running tests
+# --enable_remote_cache  Add options to enable remote cache for build and test
+# --release_build        Build for release, compilation time will be longer to
+#                        ensure performance
+# --test_core_only       Use tensorflow/python/... as test target
+# --test_contrib_only    Use tensorflow/contrib/... as test target
 for ARG in "$@"; do
-  if [[ "$ARG" == --skip_test ]]; then
-    skip_test=1
-  elif [[ "$ARG" == --enable_remote_cache ]]; then
-    set_remote_cache_options
-  elif [[ "$ARG" == --release_build ]]; then
-    release_build=1
-  fi
+  case "$ARG" in
+    --skip_test) SKIP_TEST=1 ;;
+    --enable_remote_cache) set_remote_cache_options ;;
+    --release_build) RELEASE_BUILD=1 ;;
+    --test_core_only) TEST_TARGET="//${PY_TEST_DIR}/tensorflow/python/..." ;;
+    --test_contrib_only) TEST_TARGET="//${PY_TEST_DIR}/tensorflow/contrib/..." ;;
+    *)
+  esac
 done
 
-if [[ "$release_build" != 1 ]]; then
-  # --define=override_eigen_strong_inline=true speeds up the compiling of conv_grad_ops_3d.cc and conv_ops_3d.cc
+if [[ "$RELEASE_BUILD" == 1 ]]; then
+  # Overriding eigen strong inline speeds up the compiling of conv_grad_ops_3d.cc and conv_ops_3d.cc
   # by 20 minutes. See https://github.com/tensorflow/tensorflow/issues/10521
-  # Because this hurts the performance of TF, we don't enable it in release build.
-  echo "build --define=override_eigen_strong_inline=true" >> "${TMP_BAZELRC}"
+  # Because this hurts the performance of TF, we don't override it in release build.
+  export TF_OVERRIDE_EIGEN_STRONG_INLINE=0
+else
+  export TF_OVERRIDE_EIGEN_STRONG_INLINE=1
 fi
 
-# The host and target platforms are the same in Windows build. So we don't have
-# to distinct them. This helps avoid building the same targets twice.
-echo "build --distinct_host_configuration=false" >> "${TMP_BAZELRC}"
-
 # Enable short object file path to avoid long path issue on Windows.
 echo "startup --output_user_root=${TMPDIR}" >> "${TMP_BAZELRC}"
 
@@ -88,12 +97,11 @@ run_configure_for_cpu_build
 
 bazel build --announce_rc --config=opt tensorflow/tools/pip_package:build_pip_package || exit $?
 
-if [[ "$skip_test" == 1 ]]; then
+if [[ "$SKIP_TEST" == 1 ]]; then
   exit 0
 fi
 
 # Create a python test directory to avoid package name conflict
-PY_TEST_DIR="py_test_dir"
 create_python_test_dir "${PY_TEST_DIR}"
 
 ./bazel-bin/tensorflow/tools/pip_package/build_pip_package "$PWD/${PY_TEST_DIR}"
@@ -111,7 +119,7 @@ bazel test --announce_rc --config=opt -k --test_output=errors \
   --define=no_tensorflow_py_deps=true --test_lang_filters=py \
   --test_tag_filters=-no_pip,-no_windows,-no_oss \
   --build_tag_filters=-no_pip,-no_windows,-no_oss --build_tests_only \
+  --test_size_filters=small,medium \
   --jobs="${N_JOBS}" --test_timeout="300,450,1200,3600" \
   --flaky_test_attempts=3 \
-  //${PY_TEST_DIR}/tensorflow/python/... \
-  //${PY_TEST_DIR}/tensorflow/contrib/...
+  ${TEST_TARGET}
diff --git a/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh b/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh
index fe3bce428f..e3eee11080 100644
--- a/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh
+++ b/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh
@@ -53,30 +53,39 @@ function cleanup {
 }
 trap cleanup EXIT
 
-skip_test=0
-release_build=0
+PY_TEST_DIR="py_test_dir"
 
+SKIP_TEST=0
+RELEASE_BUILD=0
+TEST_TARGET="//${PY_TEST_DIR}/tensorflow/python/... \
+  //${PY_TEST_DIR}/tensorflow/contrib/... "
+
+# --skip_test            Skip running tests
+# --enable_remote_cache  Add options to enable remote cache for build and test
+# --release_build        Build for release, compilation time will be longer to
+#                        ensure performance
+# --test_core_only       Use tensorflow/python/... as test target
+# --test_contrib_only    Use tensorflow/contrib/... as test target
 for ARG in "$@"; do
-  if [[ "$ARG" == --skip_test ]]; then
-    skip_test=1
-  elif [[ "$ARG" == --enable_remote_cache ]]; then
-    set_remote_cache_options
-  elif [[ "$ARG" == --release_build ]]; then
-    release_build=1
-  fi
+  case "$ARG" in
+    --skip_test) SKIP_TEST=1 ;;
+    --enable_remote_cache) set_remote_cache_options ;;
+    --release_build) RELEASE_BUILD=1 ;;
+    --test_core_only) TEST_TARGET="//${PY_TEST_DIR}/tensorflow/python/..." ;;
+    --test_contrib_only) TEST_TARGET="//${PY_TEST_DIR}/tensorflow/contrib/..." ;;
+    *)
+  esac
 done
 
-if [[ "$release_build" != 1 ]]; then
-  # --define=override_eigen_strong_inline=true speeds up the compiling of conv_grad_ops_3d.cc and conv_ops_3d.cc
+if [[ "$RELEASE_BUILD" == 1 ]]; then
+  # Overriding eigen strong inline speeds up the compiling of conv_grad_ops_3d.cc and conv_ops_3d.cc
   # by 20 minutes. See https://github.com/tensorflow/tensorflow/issues/10521
-  # Because this hurts the performance of TF, we don't enable it in release build.
-  echo "build --define=override_eigen_strong_inline=true" >> "${TMP_BAZELRC}"
+  # Because this hurts the performance of TF, we don't override it in release build.
+  export TF_OVERRIDE_EIGEN_STRONG_INLINE=0
+else
+  export TF_OVERRIDE_EIGEN_STRONG_INLINE=1
 fi
 
-# The host and target platforms are the same in Windows build. So we don't have
-# to distinct them. This helps avoid building the same targets twice.
-echo "build --distinct_host_configuration=false" >> "${TMP_BAZELRC}"
-
 # Enable short object file path to avoid long path issue on Windows.
 echo "startup --output_user_root=${TMPDIR}" >> "${TMP_BAZELRC}"
 
@@ -91,12 +100,11 @@ run_configure_for_gpu_build
 
 bazel build --announce_rc --config=opt tensorflow/tools/pip_package:build_pip_package || exit $?
 
-if [[ "$skip_test" == 1 ]]; then
+if [[ "$SKIP_TEST" == 1 ]]; then
   exit 0
 fi
 
 # Create a python test directory to avoid package name conflict
-PY_TEST_DIR="py_test_dir"
 create_python_test_dir "${PY_TEST_DIR}"
 
 ./bazel-bin/tensorflow/tools/pip_package/build_pip_package "$PWD/${PY_TEST_DIR}"
@@ -105,14 +113,18 @@ create_python_test_dir "${PY_TEST_DIR}"
 PIP_NAME=$(ls ${PY_TEST_DIR}/tensorflow-*.whl)
 reinstall_tensorflow_pip ${PIP_NAME}
 
+TF_GPU_COUNT=${TF_GPU_COUNT:-8}
+
 # Define no_tensorflow_py_deps=true so that every py_test has no deps anymore,
 # which will result testing system installed tensorflow
 # GPU tests are very flaky when running concurrently, so set local_test_jobs=1
 bazel test --announce_rc --config=opt -k --test_output=errors \
+  --test_env=TF_GPU_COUNT \
+  --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute \
   --define=no_tensorflow_py_deps=true --test_lang_filters=py \
   --test_tag_filters=-no_pip,-no_windows,-no_windows_gpu,-no_gpu,-no_pip_gpu,-no_oss \
   --build_tag_filters=-no_pip,-no_windows,-no_windows_gpu,-no_gpu,-no_pip_gpu,-no_oss --build_tests_only \
-  --local_test_jobs=1 --test_timeout="300,450,1200,3600" \
+  --test_size_filters=small,medium \
+  --local_test_jobs=$TF_GPU_COUNT --test_timeout="300,450,1200,3600" \
   --flaky_test_attempts=3 \
-  //${PY_TEST_DIR}/tensorflow/python/... \
-  //${PY_TEST_DIR}/tensorflow/contrib/...
+  ${TEST_TARGET}
diff --git a/tensorflow/tools/compatibility/BUILD b/tensorflow/tools/compatibility/BUILD
index b7bfb29aae..55792c51fe 100644
--- a/tensorflow/tools/compatibility/BUILD
+++ b/tensorflow/tools/compatibility/BUILD
@@ -8,10 +8,17 @@ load(
     "tf_cc_test",  # @unused
 )
 
+py_library(
+    name = "ast_edits",
+    srcs = ["ast_edits.py"],
+    srcs_version = "PY2AND3",
+)
+
 py_binary(
     name = "tf_upgrade",
     srcs = ["tf_upgrade.py"],
     srcs_version = "PY2AND3",
+    deps = [":ast_edits"],
 )
 
 py_test(
@@ -26,6 +33,28 @@ py_test(
     ],
 )
 
+py_binary(
+    name = "tf_upgrade_v2",
+    srcs = [
+        "renames_v2.py",
+        "tf_upgrade_v2.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [":ast_edits"],
+)
+
+py_test(
+    name = "tf_upgrade_v2_test",
+    srcs = ["tf_upgrade_v2_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":tf_upgrade_v2",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_test_lib",
+        "@six_archive//:six",
+    ],
+)
+
 # Keep for reference, this test will succeed in 0.11 but fail in 1.0
 # py_test(
 #     name = "test_file_v0_11",
@@ -62,9 +91,37 @@ py_test(
     ],
 )
 
+genrule(
+    name = "generate_upgraded_file_v2",
+    testonly = 1,
+    srcs = ["testdata/test_file_v1_10.py"],
+    outs = [
+        "test_file_v2_0.py",
+        "report_v2.txt",
+    ],
+    cmd = ("$(location :tf_upgrade_v2)" +
+           " --infile $(location testdata/test_file_v1_10.py)" +
+           " --outfile $(location test_file_v2_0.py)" +
+           " --reportfile $(location report_v2.txt)"),
+    tools = [":tf_upgrade_v2"],
+)
+
+py_test(
+    name = "test_file_v2_0",
+    size = "small",
+    srcs = ["test_file_v2_0.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
 exports_files(
     [
+        "ast_edits.py",
         "tf_upgrade.py",
+        "renames_v2.py",
         "testdata/test_file_v0_11.py",
+        "testdata/test_file_v1_10.py",
     ],
 )
diff --git a/tensorflow/tools/compatibility/ast_edits.py b/tensorflow/tools/compatibility/ast_edits.py
new file mode 100644
index 0000000000..23cc4a21a9
--- /dev/null
+++ b/tensorflow/tools/compatibility/ast_edits.py
@@ -0,0 +1,502 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Upgrader for Python scripts according to an API change specification."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import ast
+import collections
+import os
+import shutil
+import sys
+import tempfile
+import traceback
+
+
+class APIChangeSpec(object):
+  """This class defines the transformations that need to happen.
+
+  This class must provide the following fields:
+
+  * `function_keyword_renames`: maps function names to a map of old -> new
+    argument names
+  * `function_renames`: maps function names to new function names
+  * `change_to_function`: a set of function names that have changed (for
+    notifications)
+  * `function_reorders`: maps functions whose argument order has changed to the
+    list of arguments in the new order
+  * `function_handle`: maps function names to custom handlers for the function
+
+  For an example, see `TFAPIChangeSpec`.
+  """
+
+
+class _FileEditTuple(
+    collections.namedtuple("_FileEditTuple",
+                           ["comment", "line", "start", "old", "new"])):
+  """Each edit that is recorded by a _FileEditRecorder.
+
+  Fields:
+    comment: A description of the edit and why it was made.
+    line: The line number in the file where the edit occurs (1-indexed).
+    start: The line number in the file where the edit occurs (0-indexed).
+    old: text string to remove (this must match what was in file).
+    new: text string to add in place of `old`.
+  """
+
+  __slots__ = ()
+
+
+class _FileEditRecorder(object):
+  """Record changes that need to be done to the file."""
+
+  def __init__(self, filename):
+    # all edits are lists of chars
+    self._filename = filename
+
+    self._line_to_edit = collections.defaultdict(list)
+    self._errors = []
+
+  def process(self, text):
+    """Process a list of strings, each corresponding to the recorded changes.
+
+    Args:
+      text: A list of lines of text (assumed to contain newlines)
+    Returns:
+      A tuple of the modified text and a textual description of what is done.
+    Raises:
+      ValueError: if substitution source location does not have expected text.
+    """
+
+    change_report = ""
+
+    # Iterate of each line
+    for line, edits in self._line_to_edit.items():
+      offset = 0
+      # sort by column so that edits are processed in order in order to make
+      # indexing adjustments cumulative for changes that change the string
+      # length
+      edits.sort(key=lambda x: x.start)
+
+      # Extract each line to a list of characters, because mutable lists
+      # are editable, unlike immutable strings.
+      char_array = list(text[line - 1])
+
+      # Record a description of the change
+      change_report += "%r Line %d\n" % (self._filename, line)
+      change_report += "-" * 80 + "\n\n"
+      for e in edits:
+        change_report += "%s\n" % e.comment
+      change_report += "\n    Old: %s" % (text[line - 1])
+
+      # Make underscore buffers for underlining where in the line the edit was
+      change_list = [" "] * len(text[line - 1])
+      change_list_new = [" "] * len(text[line - 1])
+
+      # Iterate for each edit
+      for e in edits:
+        # Create effective start, end by accounting for change in length due
+        # to previous edits
+        start_eff = e.start + offset
+        end_eff = start_eff + len(e.old)
+
+        # Make sure the edit is changing what it should be changing
+        old_actual = "".join(char_array[start_eff:end_eff])
+        if old_actual != e.old:
+          raise ValueError("Expected text %r but got %r" %
+                           ("".join(e.old), "".join(old_actual)))
+        # Make the edit
+        char_array[start_eff:end_eff] = list(e.new)
+
+        # Create the underline highlighting of the before and after
+        change_list[e.start:e.start + len(e.old)] = "~" * len(e.old)
+        change_list_new[start_eff:end_eff] = "~" * len(e.new)
+
+        # Keep track of how to generate effective ranges
+        offset += len(e.new) - len(e.old)
+
+      # Finish the report comment
+      change_report += "         %s\n" % "".join(change_list)
+      text[line - 1] = "".join(char_array)
+      change_report += "    New: %s" % (text[line - 1])
+      change_report += "         %s\n\n" % "".join(change_list_new)
+    return "".join(text), change_report, self._errors
+
+  def add(self, comment, line, start, old, new, error=None):
+    """Add a new change that is needed.
+
+    Args:
+      comment: A description of what was changed
+      line: Line number (1 indexed)
+      start: Column offset (0 indexed)
+      old: old text
+      new: new text
+      error: this "edit" is something that cannot be fixed automatically
+    Returns:
+      None
+    """
+
+    self._line_to_edit[line].append(
+        _FileEditTuple(comment, line, start, old, new))
+    if error:
+      self._errors.append("%s:%d: %s" % (self._filename, line, error))
+
+
+class _ASTCallVisitor(ast.NodeVisitor):
+  """AST Visitor that processes function calls.
+
+  Updates function calls from old API version to new API version using a given
+  change spec.
+  """
+
+  def __init__(self, filename, lines, api_change_spec):
+    self._filename = filename
+    self._file_edit = _FileEditRecorder(filename)
+    self._lines = lines
+    self._api_change_spec = api_change_spec
+
+  def process(self, lines):
+    return self._file_edit.process(lines)
+
+  def generic_visit(self, node):
+    ast.NodeVisitor.generic_visit(self, node)
+
+  def _rename_functions(self, node, full_name):
+    function_renames = self._api_change_spec.function_renames
+    try:
+      new_name = function_renames[full_name]
+      self._file_edit.add("Renamed function %r to %r" % (full_name, new_name),
+                          node.lineno, node.col_offset, full_name, new_name)
+    except KeyError:
+      pass
+
+  def _get_attribute_full_path(self, node):
+    """Traverse an attribute to generate a full name e.g. tf.foo.bar.
+
+    Args:
+      node: A Node of type Attribute.
+
+    Returns:
+      a '.'-delimited full-name or None if the tree was not a simple form.
+      i.e. `foo()+b).bar` returns None, while `a.b.c` would return "a.b.c".
+    """
+    curr = node
+    items = []
+    while not isinstance(curr, ast.Name):
+      if not isinstance(curr, ast.Attribute):
+        return None
+      items.append(curr.attr)
+      curr = curr.value
+    items.append(curr.id)
+    return ".".join(reversed(items))
+
+  def _find_true_position(self, node):
+    """Return correct line number and column offset for a given node.
+
+    This is necessary mainly because ListComp's location reporting reports
+    the next token after the list comprehension list opening.
+
+    Args:
+      node: Node for which we wish to know the lineno and col_offset
+    """
+    import re
+    find_open = re.compile("^\s*(\\[).*$")
+    find_string_chars = re.compile("['\"]")
+
+    if isinstance(node, ast.ListComp):
+      # Strangely, ast.ListComp returns the col_offset of the first token
+      # after the '[' token which appears to be a bug. Workaround by
+      # explicitly finding the real start of the list comprehension.
+      line = node.lineno
+      col = node.col_offset
+      # loop over lines
+      while 1:
+        # Reverse the text to and regular expression search for whitespace
+        text = self._lines[line - 1]
+        reversed_preceding_text = text[:col][::-1]
+        # First find if a [ can be found with only whitespace between it and
+        # col.
+        m = find_open.match(reversed_preceding_text)
+        if m:
+          new_col_offset = col - m.start(1) - 1
+          return line, new_col_offset
+        else:
+          if (reversed_preceding_text == "" or
+              reversed_preceding_text.isspace()):
+            line = line - 1
+            prev_line = self._lines[line - 1]
+            # TODO(aselle):
+            # this is poor comment detection, but it is good enough for
+            # cases where the comment does not contain string literal starting/
+            # ending characters. If ast gave us start and end locations of the
+            # ast nodes rather than just start, we could use string literal
+            # node ranges to filter out spurious #'s that appear in string
+            # literals.
+            comment_start = prev_line.find("#")
+            if comment_start == -1:
+              col = len(prev_line) - 1
+            elif find_string_chars.search(prev_line[comment_start:]) is None:
+              col = comment_start
+            else:
+              return None, None
+          else:
+            return None, None
+    # Most other nodes return proper locations (with notably does not), but
+    # it is not possible to use that in an argument.
+    return node.lineno, node.col_offset
+
+  def visit_Call(self, node):  # pylint: disable=invalid-name
+    """Handle visiting a call node in the AST.
+
+    Args:
+      node: Current Node
+    """
+
+    # Find a simple attribute name path e.g. "tf.foo.bar"
+    full_name = self._get_attribute_full_path(node.func)
+
+    # Make sure the func is marked as being part of a call
+    node.func.is_function_for_call = True
+
+    if full_name:
+      # Call special handlers
+      function_handles = self._api_change_spec.function_handle
+      if full_name in function_handles:
+        function_handles[full_name](self._file_edit, node)
+
+      # Examine any non-keyword argument and make it into a keyword argument
+      # if reordering required.
+      function_reorders = self._api_change_spec.function_reorders
+      function_keyword_renames = (
+          self._api_change_spec.function_keyword_renames)
+
+      if full_name in function_reorders:
+        reordered = function_reorders[full_name]
+        for idx, arg in enumerate(node.args):
+          lineno, col_offset = self._find_true_position(arg)
+          if lineno is None or col_offset is None:
+            self._file_edit.add(
+                "Failed to add keyword %r to reordered function %r" %
+                (reordered[idx], full_name),
+                arg.lineno,
+                arg.col_offset,
+                "",
+                "",
+                error="A necessary keyword argument failed to be inserted.")
+          else:
+            keyword_arg = reordered[idx]
+            if (full_name in function_keyword_renames and
+                keyword_arg in function_keyword_renames[full_name]):
+              keyword_arg = function_keyword_renames[full_name][keyword_arg]
+            self._file_edit.add("Added keyword %r to reordered function %r" %
+                                (reordered[idx], full_name), lineno, col_offset,
+                                "", keyword_arg + "=")
+
+      # Examine each keyword argument and convert it to the final renamed form
+      renamed_keywords = ({} if full_name not in function_keyword_renames else
+                          function_keyword_renames[full_name])
+      for keyword in node.keywords:
+        argkey = keyword.arg
+        argval = keyword.value
+
+        if argkey in renamed_keywords:
+          argval_lineno, argval_col_offset = self._find_true_position(argval)
+          if argval_lineno is not None and argval_col_offset is not None:
+            # TODO(aselle): We should scan backward to find the start of the
+            # keyword key. Unfortunately ast does not give you the location of
+            # keyword keys, so we are forced to infer it from the keyword arg
+            # value.
+            key_start = argval_col_offset - len(argkey) - 1
+            key_end = key_start + len(argkey) + 1
+            if (self._lines[argval_lineno - 1][key_start:key_end] == argkey +
+                "="):
+              self._file_edit.add("Renamed keyword argument from %r to %r" %
+                                  (argkey,
+                                   renamed_keywords[argkey]), argval_lineno,
+                                  argval_col_offset - len(argkey) - 1,
+                                  argkey + "=", renamed_keywords[argkey] + "=")
+              continue
+          self._file_edit.add(
+              "Failed to rename keyword argument from %r to %r" %
+              (argkey, renamed_keywords[argkey]),
+              argval.lineno,
+              argval.col_offset - len(argkey) - 1,
+              "",
+              "",
+              error="Failed to find keyword lexographically. Fix manually.")
+
+    ast.NodeVisitor.generic_visit(self, node)
+
+  def visit_Attribute(self, node):  # pylint: disable=invalid-name
+    """Handle bare Attributes i.e. [tf.foo, tf.bar].
+
+    Args:
+      node: Node that is of type ast.Attribute
+    """
+    full_name = self._get_attribute_full_path(node)
+    if full_name:
+      self._rename_functions(node, full_name)
+    if full_name in self._api_change_spec.change_to_function:
+      if not hasattr(node, "is_function_for_call"):
+        new_text = full_name + "()"
+        self._file_edit.add("Changed %r to %r" % (full_name, new_text),
+                            node.lineno, node.col_offset, full_name, new_text)
+
+    ast.NodeVisitor.generic_visit(self, node)
+
+
+class ASTCodeUpgrader(object):
+  """Handles upgrading a set of Python files using a given API change spec."""
+
+  def __init__(self, api_change_spec):
+    if not isinstance(api_change_spec, APIChangeSpec):
+      raise TypeError("Must pass APIChangeSpec to ASTCodeUpgrader, got %s" %
+                      type(api_change_spec))
+    self._api_change_spec = api_change_spec
+
+  def process_file(self, in_filename, out_filename):
+    """Process the given python file for incompatible changes.
+
+    Args:
+      in_filename: filename to parse
+      out_filename: output file to write to
+    Returns:
+      A tuple representing number of files processed, log of actions, errors
+    """
+
+    # Write to a temporary file, just in case we are doing an implace modify.
+    with open(in_filename, "r") as in_file, \
+        tempfile.NamedTemporaryFile("w", delete=False) as temp_file:
+      ret = self.process_opened_file(in_filename, in_file, out_filename,
+                                     temp_file)
+
+    shutil.move(temp_file.name, out_filename)
+    return ret
+
+  # Broad exceptions are required here because ast throws whatever it wants.
+  # pylint: disable=broad-except
+  def process_opened_file(self, in_filename, in_file, out_filename, out_file):
+    """Process the given python file for incompatible changes.
+
+    This function is split out to facilitate StringIO testing from
+    tf_upgrade_test.py.
+
+    Args:
+      in_filename: filename to parse
+      in_file: opened file (or StringIO)
+      out_filename: output file to write to
+      out_file: opened file (or StringIO)
+    Returns:
+      A tuple representing number of files processed, log of actions, errors
+    """
+    process_errors = []
+    text = "-" * 80 + "\n"
+    text += "Processing file %r\n outputting to %r\n" % (in_filename,
+                                                         out_filename)
+    text += "-" * 80 + "\n\n"
+
+    parsed_ast = None
+    lines = in_file.readlines()
+    try:
+      parsed_ast = ast.parse("".join(lines))
+    except Exception:
+      text += "Failed to parse %r\n\n" % in_filename
+      text += traceback.format_exc()
+    if parsed_ast:
+      visitor = _ASTCallVisitor(in_filename, lines, self._api_change_spec)
+      visitor.visit(parsed_ast)
+      out_text, new_text, process_errors = visitor.process(lines)
+      text += new_text
+      if out_file:
+        out_file.write(out_text)
+    text += "\n"
+    return 1, text, process_errors
+
+  # pylint: enable=broad-except
+
+  def process_tree(self, root_directory, output_root_directory,
+                   copy_other_files):
+    """Processes upgrades on an entire tree of python files in place.
+
+    Note that only Python files. If you have custom code in other languages,
+    you will need to manually upgrade those.
+
+    Args:
+      root_directory: Directory to walk and process.
+      output_root_directory: Directory to use as base.
+      copy_other_files: Copy files that are not touched by this converter.
+
+    Returns:
+      A tuple of files processed, the report string ofr all files, and errors
+    """
+
+    # make sure output directory doesn't exist
+    if output_root_directory and os.path.exists(output_root_directory):
+      print("Output directory %r must not already exist." %
+            (output_root_directory))
+      sys.exit(1)
+
+    # make sure output directory does not overlap with root_directory
+    norm_root = os.path.split(os.path.normpath(root_directory))
+    norm_output = os.path.split(os.path.normpath(output_root_directory))
+    if norm_root == norm_output:
+      print("Output directory %r same as input directory %r" %
+            (root_directory, output_root_directory))
+      sys.exit(1)
+
+    # Collect list of files to process (we do this to correctly handle if the
+    # user puts the output directory in some sub directory of the input dir)
+    files_to_process = []
+    files_to_copy = []
+    for dir_name, _, file_list in os.walk(root_directory):
+      py_files = [f for f in file_list if f.endswith(".py")]
+      copy_files = [f for f in file_list if not f.endswith(".py")]
+      for filename in py_files:
+        fullpath = os.path.join(dir_name, filename)
+        fullpath_output = os.path.join(output_root_directory,
+                                       os.path.relpath(fullpath,
+                                                       root_directory))
+        files_to_process.append((fullpath, fullpath_output))
+      if copy_other_files:
+        for filename in copy_files:
+          fullpath = os.path.join(dir_name, filename)
+          fullpath_output = os.path.join(output_root_directory,
+                                         os.path.relpath(
+                                             fullpath, root_directory))
+          files_to_copy.append((fullpath, fullpath_output))
+
+    file_count = 0
+    tree_errors = []
+    report = ""
+    report += ("=" * 80) + "\n"
+    report += "Input tree: %r\n" % root_directory
+    report += ("=" * 80) + "\n"
+
+    for input_path, output_path in files_to_process:
+      output_directory = os.path.dirname(output_path)
+      if not os.path.isdir(output_directory):
+        os.makedirs(output_directory)
+      file_count += 1
+      _, l_report, l_errors = self.process_file(input_path, output_path)
+      tree_errors += l_errors
+      report += l_report
+    for input_path, output_path in files_to_copy:
+      output_directory = os.path.dirname(output_path)
+      if not os.path.isdir(output_directory):
+        os.makedirs(output_directory)
+      shutil.copy(input_path, output_path)
+    return file_count, report, tree_errors
diff --git a/tensorflow/tools/compatibility/renames_v2.py b/tensorflow/tools/compatibility/renames_v2.py
new file mode 100644
index 0000000000..216aa41b60
--- /dev/null
+++ b/tensorflow/tools/compatibility/renames_v2.py
@@ -0,0 +1,134 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=line-too-long
+"""List of renames to apply when converting from TF 1.0 to TF 2.0.
+
+THIS FILE IS AUTOGENERATED: To update, please run:
+  bazel build tensorflow/tools/compatibility/update:generate_v2_renames_map
+  bazel-bin/tensorflow/tools/compatibility/update/generate_v2_renames_map
+This file should be updated whenever endpoints are deprecated.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+renames = {
+    'tf.acos': 'tf.math.acos',
+    'tf.acosh': 'tf.math.acosh',
+    'tf.add': 'tf.math.add',
+    'tf.as_string': 'tf.dtypes.as_string',
+    'tf.asin': 'tf.math.asin',
+    'tf.asinh': 'tf.math.asinh',
+    'tf.atan': 'tf.math.atan',
+    'tf.atan2': 'tf.math.atan2',
+    'tf.atanh': 'tf.math.atanh',
+    'tf.batch_to_space_nd': 'tf.manip.batch_to_space_nd',
+    'tf.betainc': 'tf.math.betainc',
+    'tf.ceil': 'tf.math.ceil',
+    'tf.check_numerics': 'tf.debugging.check_numerics',
+    'tf.cholesky': 'tf.linalg.cholesky',
+    'tf.cos': 'tf.math.cos',
+    'tf.cosh': 'tf.math.cosh',
+    'tf.cross': 'tf.linalg.cross',
+    'tf.decode_base64': 'tf.io.decode_base64',
+    'tf.decode_compressed': 'tf.io.decode_compressed',
+    'tf.decode_json_example': 'tf.io.decode_json_example',
+    'tf.decode_raw': 'tf.io.decode_raw',
+    'tf.dequantize': 'tf.quantization.dequantize',
+    'tf.diag': 'tf.linalg.tensor_diag',
+    'tf.diag_part': 'tf.linalg.tensor_diag_part',
+    'tf.digamma': 'tf.math.digamma',
+    'tf.encode_base64': 'tf.io.encode_base64',
+    'tf.equal': 'tf.math.equal',
+    'tf.erfc': 'tf.math.erfc',
+    'tf.exp': 'tf.math.exp',
+    'tf.expm1': 'tf.math.expm1',
+    'tf.extract_image_patches': 'tf.image.extract_image_patches',
+    'tf.fake_quant_with_min_max_args': 'tf.quantization.fake_quant_with_min_max_args',
+    'tf.fake_quant_with_min_max_args_gradient': 'tf.quantization.fake_quant_with_min_max_args_gradient',
+    'tf.fake_quant_with_min_max_vars': 'tf.quantization.fake_quant_with_min_max_vars',
+    'tf.fake_quant_with_min_max_vars_gradient': 'tf.quantization.fake_quant_with_min_max_vars_gradient',
+    'tf.fake_quant_with_min_max_vars_per_channel': 'tf.quantization.fake_quant_with_min_max_vars_per_channel',
+    'tf.fake_quant_with_min_max_vars_per_channel_gradient': 'tf.quantization.fake_quant_with_min_max_vars_per_channel_gradient',
+    'tf.fft': 'tf.spectral.fft',
+    'tf.floor': 'tf.math.floor',
+    'tf.gather_nd': 'tf.manip.gather_nd',
+    'tf.greater': 'tf.math.greater',
+    'tf.greater_equal': 'tf.math.greater_equal',
+    'tf.ifft': 'tf.spectral.ifft',
+    'tf.igamma': 'tf.math.igamma',
+    'tf.igammac': 'tf.math.igammac',
+    'tf.invert_permutation': 'tf.math.invert_permutation',
+    'tf.is_finite': 'tf.debugging.is_finite',
+    'tf.is_inf': 'tf.debugging.is_inf',
+    'tf.is_nan': 'tf.debugging.is_nan',
+    'tf.less': 'tf.math.less',
+    'tf.less_equal': 'tf.math.less_equal',
+    'tf.lgamma': 'tf.math.lgamma',
+    'tf.log': 'tf.math.log',
+    'tf.log1p': 'tf.math.log1p',
+    'tf.logical_and': 'tf.math.logical_and',
+    'tf.logical_not': 'tf.math.logical_not',
+    'tf.logical_or': 'tf.math.logical_or',
+    'tf.matching_files': 'tf.io.matching_files',
+    'tf.matrix_band_part': 'tf.linalg.band_part',
+    'tf.matrix_determinant': 'tf.linalg.det',
+    'tf.matrix_diag': 'tf.linalg.diag',
+    'tf.matrix_diag_part': 'tf.linalg.diag_part',
+    'tf.matrix_inverse': 'tf.linalg.inv',
+    'tf.matrix_set_diag': 'tf.linalg.set_diag',
+    'tf.matrix_solve': 'tf.linalg.solve',
+    'tf.matrix_triangular_solve': 'tf.linalg.triangular_solve',
+    'tf.maximum': 'tf.math.maximum',
+    'tf.minimum': 'tf.math.minimum',
+    'tf.not_equal': 'tf.math.not_equal',
+    'tf.parse_tensor': 'tf.io.parse_tensor',
+    'tf.polygamma': 'tf.math.polygamma',
+    'tf.qr': 'tf.linalg.qr',
+    'tf.quantized_concat': 'tf.quantization.quantized_concat',
+    'tf.read_file': 'tf.io.read_file',
+    'tf.reciprocal': 'tf.math.reciprocal',
+    'tf.regex_replace': 'tf.strings.regex_replace',
+    'tf.reshape': 'tf.manip.reshape',
+    'tf.reverse': 'tf.manip.reverse',
+    'tf.reverse_v2': 'tf.manip.reverse',
+    'tf.rint': 'tf.math.rint',
+    'tf.rsqrt': 'tf.math.rsqrt',
+    'tf.scatter_nd': 'tf.manip.scatter_nd',
+    'tf.segment_max': 'tf.math.segment_max',
+    'tf.segment_mean': 'tf.math.segment_mean',
+    'tf.segment_min': 'tf.math.segment_min',
+    'tf.segment_prod': 'tf.math.segment_prod',
+    'tf.segment_sum': 'tf.math.segment_sum',
+    'tf.sin': 'tf.math.sin',
+    'tf.sinh': 'tf.math.sinh',
+    'tf.space_to_batch_nd': 'tf.manip.space_to_batch_nd',
+    'tf.squared_difference': 'tf.math.squared_difference',
+    'tf.string_join': 'tf.strings.join',
+    'tf.string_strip': 'tf.strings.strip',
+    'tf.string_to_hash_bucket': 'tf.strings.to_hash_bucket',
+    'tf.string_to_hash_bucket_fast': 'tf.strings.to_hash_bucket_fast',
+    'tf.string_to_hash_bucket_strong': 'tf.strings.to_hash_bucket_strong',
+    'tf.string_to_number': 'tf.strings.to_number',
+    'tf.substr': 'tf.strings.substr',
+    'tf.tan': 'tf.math.tan',
+    'tf.tile': 'tf.manip.tile',
+    'tf.unsorted_segment_max': 'tf.math.unsorted_segment_max',
+    'tf.unsorted_segment_min': 'tf.math.unsorted_segment_min',
+    'tf.unsorted_segment_prod': 'tf.math.unsorted_segment_prod',
+    'tf.unsorted_segment_sum': 'tf.math.unsorted_segment_sum',
+    'tf.write_file': 'tf.io.write_file',
+    'tf.zeta': 'tf.math.zeta'
+}
diff --git a/tensorflow/tools/compatibility/testdata/test_file_v1_10.py b/tensorflow/tools/compatibility/testdata/test_file_v1_10.py
new file mode 100644
index 0000000000..a49035a1a0
--- /dev/null
+++ b/tensorflow/tools/compatibility/testdata/test_file_v1_10.py
@@ -0,0 +1,34 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for tf upgrader."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import tensorflow as tf
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test as test_lib
+
+
+class TestUpgrade(test_util.TensorFlowTestCase):
+  """Test various APIs that have been changed in 2.0."""
+
+  def testRenames(self):
+    with self.test_session():
+      self.assertAllClose(1.04719755, tf.acos(0.5).eval())
+      self.assertAllClose(0.5, tf.rsqrt(4.0).eval())
+
+if __name__ == "__main__":
+  test_lib.main()
diff --git a/tensorflow/tools/compatibility/tf_upgrade.py b/tensorflow/tools/compatibility/tf_upgrade.py
index 1f8833582a..96705b1a4c 100644
--- a/tensorflow/tools/compatibility/tf_upgrade.py
+++ b/tensorflow/tools/compatibility/tf_upgrade.py
@@ -19,491 +19,11 @@ from __future__ import division
 from __future__ import print_function
 
 import argparse
-import ast
-import collections
-import os
-import shutil
-import sys
-import tempfile
-import traceback
 
+from tensorflow.tools.compatibility import ast_edits
 
-class APIChangeSpec(object):
-  """This class defines the transformations that need to happen.
 
-  This class must provide the following fields:
-
-  * `function_keyword_renames`: maps function names to a map of old -> new
-    argument names
-  * `function_renames`: maps function names to new function names
-  * `change_to_function`: a set of function names that have changed (for
-    notifications)
-  * `function_reorders`: maps functions whose argument order has changed to the
-    list of arguments in the new order
-  * `function_handle`: maps function names to custom handlers for the function
-
-  For an example, see `TFAPIChangeSpec`.
-  """
-
-
-class _FileEditTuple(
-    collections.namedtuple("_FileEditTuple",
-                           ["comment", "line", "start", "old", "new"])):
-  """Each edit that is recorded by a _FileEditRecorder.
-
-  Fields:
-    comment: A description of the edit and why it was made.
-    line: The line number in the file where the edit occurs (1-indexed).
-    start: The line number in the file where the edit occurs (0-indexed).
-    old: text string to remove (this must match what was in file).
-    new: text string to add in place of `old`.
-  """
-
-  __slots__ = ()
-
-
-class _FileEditRecorder(object):
-  """Record changes that need to be done to the file."""
-
-  def __init__(self, filename):
-    # all edits are lists of chars
-    self._filename = filename
-
-    self._line_to_edit = collections.defaultdict(list)
-    self._errors = []
-
-  def process(self, text):
-    """Process a list of strings, each corresponding to the recorded changes.
-
-    Args:
-      text: A list of lines of text (assumed to contain newlines)
-    Returns:
-      A tuple of the modified text and a textual description of what is done.
-    Raises:
-      ValueError: if substitution source location does not have expected text.
-    """
-
-    change_report = ""
-
-    # Iterate of each line
-    for line, edits in self._line_to_edit.items():
-      offset = 0
-      # sort by column so that edits are processed in order in order to make
-      # indexing adjustments cumulative for changes that change the string
-      # length
-      edits.sort(key=lambda x: x.start)
-
-      # Extract each line to a list of characters, because mutable lists
-      # are editable, unlike immutable strings.
-      char_array = list(text[line - 1])
-
-      # Record a description of the change
-      change_report += "%r Line %d\n" % (self._filename, line)
-      change_report += "-" * 80 + "\n\n"
-      for e in edits:
-        change_report += "%s\n" % e.comment
-      change_report += "\n    Old: %s" % (text[line - 1])
-
-      # Make underscore buffers for underlining where in the line the edit was
-      change_list = [" "] * len(text[line - 1])
-      change_list_new = [" "] * len(text[line - 1])
-
-      # Iterate for each edit
-      for e in edits:
-        # Create effective start, end by accounting for change in length due
-        # to previous edits
-        start_eff = e.start + offset
-        end_eff = start_eff + len(e.old)
-
-        # Make sure the edit is changing what it should be changing
-        old_actual = "".join(char_array[start_eff:end_eff])
-        if old_actual != e.old:
-          raise ValueError("Expected text %r but got %r" %
-                           ("".join(e.old), "".join(old_actual)))
-        # Make the edit
-        char_array[start_eff:end_eff] = list(e.new)
-
-        # Create the underline highlighting of the before and after
-        change_list[e.start:e.start + len(e.old)] = "~" * len(e.old)
-        change_list_new[start_eff:end_eff] = "~" * len(e.new)
-
-        # Keep track of how to generate effective ranges
-        offset += len(e.new) - len(e.old)
-
-      # Finish the report comment
-      change_report += "         %s\n" % "".join(change_list)
-      text[line - 1] = "".join(char_array)
-      change_report += "    New: %s" % (text[line - 1])
-      change_report += "         %s\n\n" % "".join(change_list_new)
-    return "".join(text), change_report, self._errors
-
-  def add(self, comment, line, start, old, new, error=None):
-    """Add a new change that is needed.
-
-    Args:
-      comment: A description of what was changed
-      line: Line number (1 indexed)
-      start: Column offset (0 indexed)
-      old: old text
-      new: new text
-      error: this "edit" is something that cannot be fixed automatically
-    Returns:
-      None
-    """
-
-    self._line_to_edit[line].append(
-        _FileEditTuple(comment, line, start, old, new))
-    if error:
-      self._errors.append("%s:%d: %s" % (self._filename, line, error))
-
-
-class _ASTCallVisitor(ast.NodeVisitor):
-  """AST Visitor that processes function calls.
-
-  Updates function calls from old API version to new API version using a given
-  change spec.
-  """
-
-  def __init__(self, filename, lines, api_change_spec):
-    self._filename = filename
-    self._file_edit = _FileEditRecorder(filename)
-    self._lines = lines
-    self._api_change_spec = api_change_spec
-
-  def process(self, lines):
-    return self._file_edit.process(lines)
-
-  def generic_visit(self, node):
-    ast.NodeVisitor.generic_visit(self, node)
-
-  def _rename_functions(self, node, full_name):
-    function_renames = self._api_change_spec.function_renames
-    try:
-      new_name = function_renames[full_name]
-      self._file_edit.add("Renamed function %r to %r" % (full_name, new_name),
-                          node.lineno, node.col_offset, full_name, new_name)
-    except KeyError:
-      pass
-
-  def _get_attribute_full_path(self, node):
-    """Traverse an attribute to generate a full name e.g. tf.foo.bar.
-
-    Args:
-      node: A Node of type Attribute.
-
-    Returns:
-      a '.'-delimited full-name or None if the tree was not a simple form.
-      i.e. `foo()+b).bar` returns None, while `a.b.c` would return "a.b.c".
-    """
-    curr = node
-    items = []
-    while not isinstance(curr, ast.Name):
-      if not isinstance(curr, ast.Attribute):
-        return None
-      items.append(curr.attr)
-      curr = curr.value
-    items.append(curr.id)
-    return ".".join(reversed(items))
-
-  def _find_true_position(self, node):
-    """Return correct line number and column offset for a given node.
-
-    This is necessary mainly because ListComp's location reporting reports
-    the next token after the list comprehension list opening.
-
-    Args:
-      node: Node for which we wish to know the lineno and col_offset
-    """
-    import re
-    find_open = re.compile("^\s*(\\[).*$")
-    find_string_chars = re.compile("['\"]")
-
-    if isinstance(node, ast.ListComp):
-      # Strangely, ast.ListComp returns the col_offset of the first token
-      # after the '[' token which appears to be a bug. Workaround by
-      # explicitly finding the real start of the list comprehension.
-      line = node.lineno
-      col = node.col_offset
-      # loop over lines
-      while 1:
-        # Reverse the text to and regular expression search for whitespace
-        text = self._lines[line - 1]
-        reversed_preceding_text = text[:col][::-1]
-        # First find if a [ can be found with only whitespace between it and
-        # col.
-        m = find_open.match(reversed_preceding_text)
-        if m:
-          new_col_offset = col - m.start(1) - 1
-          return line, new_col_offset
-        else:
-          if (reversed_preceding_text == "" or
-              reversed_preceding_text.isspace()):
-            line = line - 1
-            prev_line = self._lines[line - 1]
-            # TODO(aselle):
-            # this is poor comment detection, but it is good enough for
-            # cases where the comment does not contain string literal starting/
-            # ending characters. If ast gave us start and end locations of the
-            # ast nodes rather than just start, we could use string literal
-            # node ranges to filter out spurious #'s that appear in string
-            # literals.
-            comment_start = prev_line.find("#")
-            if comment_start == -1:
-              col = len(prev_line) - 1
-            elif find_string_chars.search(prev_line[comment_start:]) is None:
-              col = comment_start
-            else:
-              return None, None
-          else:
-            return None, None
-    # Most other nodes return proper locations (with notably does not), but
-    # it is not possible to use that in an argument.
-    return node.lineno, node.col_offset
-
-  def visit_Call(self, node):  # pylint: disable=invalid-name
-    """Handle visiting a call node in the AST.
-
-    Args:
-      node: Current Node
-    """
-
-    # Find a simple attribute name path e.g. "tf.foo.bar"
-    full_name = self._get_attribute_full_path(node.func)
-
-    # Make sure the func is marked as being part of a call
-    node.func.is_function_for_call = True
-
-    if full_name:
-      # Call special handlers
-      function_handles = self._api_change_spec.function_handle
-      if full_name in function_handles:
-        function_handles[full_name](self._file_edit, node)
-
-      # Examine any non-keyword argument and make it into a keyword argument
-      # if reordering required.
-      function_reorders = self._api_change_spec.function_reorders
-      function_keyword_renames = (
-          self._api_change_spec.function_keyword_renames)
-
-      if full_name in function_reorders:
-        reordered = function_reorders[full_name]
-        for idx, arg in enumerate(node.args):
-          lineno, col_offset = self._find_true_position(arg)
-          if lineno is None or col_offset is None:
-            self._file_edit.add(
-                "Failed to add keyword %r to reordered function %r" %
-                (reordered[idx], full_name),
-                arg.lineno,
-                arg.col_offset,
-                "",
-                "",
-                error="A necessary keyword argument failed to be inserted.")
-          else:
-            keyword_arg = reordered[idx]
-            if (full_name in function_keyword_renames and
-                keyword_arg in function_keyword_renames[full_name]):
-              keyword_arg = function_keyword_renames[full_name][keyword_arg]
-            self._file_edit.add("Added keyword %r to reordered function %r" %
-                                (reordered[idx], full_name), lineno, col_offset,
-                                "", keyword_arg + "=")
-
-      # Examine each keyword argument and convert it to the final renamed form
-      renamed_keywords = ({} if full_name not in function_keyword_renames else
-                          function_keyword_renames[full_name])
-      for keyword in node.keywords:
-        argkey = keyword.arg
-        argval = keyword.value
-
-        if argkey in renamed_keywords:
-          argval_lineno, argval_col_offset = self._find_true_position(argval)
-          if argval_lineno is not None and argval_col_offset is not None:
-            # TODO(aselle): We should scan backward to find the start of the
-            # keyword key. Unfortunately ast does not give you the location of
-            # keyword keys, so we are forced to infer it from the keyword arg
-            # value.
-            key_start = argval_col_offset - len(argkey) - 1
-            key_end = key_start + len(argkey) + 1
-            if (self._lines[argval_lineno - 1][key_start:key_end] == argkey +
-                "="):
-              self._file_edit.add("Renamed keyword argument from %r to %r" %
-                                  (argkey,
-                                   renamed_keywords[argkey]), argval_lineno,
-                                  argval_col_offset - len(argkey) - 1,
-                                  argkey + "=", renamed_keywords[argkey] + "=")
-              continue
-          self._file_edit.add(
-              "Failed to rename keyword argument from %r to %r" %
-              (argkey, renamed_keywords[argkey]),
-              argval.lineno,
-              argval.col_offset - len(argkey) - 1,
-              "",
-              "",
-              error="Failed to find keyword lexographically. Fix manually.")
-
-    ast.NodeVisitor.generic_visit(self, node)
-
-  def visit_Attribute(self, node):  # pylint: disable=invalid-name
-    """Handle bare Attributes i.e. [tf.foo, tf.bar].
-
-    Args:
-      node: Node that is of type ast.Attribute
-    """
-    full_name = self._get_attribute_full_path(node)
-    if full_name:
-      self._rename_functions(node, full_name)
-    if full_name in self._api_change_spec.change_to_function:
-      if not hasattr(node, "is_function_for_call"):
-        new_text = full_name + "()"
-        self._file_edit.add("Changed %r to %r" % (full_name, new_text),
-                            node.lineno, node.col_offset, full_name, new_text)
-
-    ast.NodeVisitor.generic_visit(self, node)
-
-
-class ASTCodeUpgrader(object):
-  """Handles upgrading a set of Python files using a given API change spec."""
-
-  def __init__(self, api_change_spec):
-    if not isinstance(api_change_spec, APIChangeSpec):
-      raise TypeError("Must pass APIChangeSpec to ASTCodeUpgrader, got %s" %
-                      type(api_change_spec))
-    self._api_change_spec = api_change_spec
-
-  def process_file(self, in_filename, out_filename):
-    """Process the given python file for incompatible changes.
-
-    Args:
-      in_filename: filename to parse
-      out_filename: output file to write to
-    Returns:
-      A tuple representing number of files processed, log of actions, errors
-    """
-
-    # Write to a temporary file, just in case we are doing an implace modify.
-    with open(in_filename, "r") as in_file, \
-        tempfile.NamedTemporaryFile("w", delete=False) as temp_file:
-      ret = self.process_opened_file(in_filename, in_file, out_filename,
-                                     temp_file)
-
-    shutil.move(temp_file.name, out_filename)
-    return ret
-
-  # Broad exceptions are required here because ast throws whatever it wants.
-  # pylint: disable=broad-except
-  def process_opened_file(self, in_filename, in_file, out_filename, out_file):
-    """Process the given python file for incompatible changes.
-
-    This function is split out to facilitate StringIO testing from
-    tf_upgrade_test.py.
-
-    Args:
-      in_filename: filename to parse
-      in_file: opened file (or StringIO)
-      out_filename: output file to write to
-      out_file: opened file (or StringIO)
-    Returns:
-      A tuple representing number of files processed, log of actions, errors
-    """
-    process_errors = []
-    text = "-" * 80 + "\n"
-    text += "Processing file %r\n outputting to %r\n" % (in_filename,
-                                                         out_filename)
-    text += "-" * 80 + "\n\n"
-
-    parsed_ast = None
-    lines = in_file.readlines()
-    try:
-      parsed_ast = ast.parse("".join(lines))
-    except Exception:
-      text += "Failed to parse %r\n\n" % in_filename
-      text += traceback.format_exc()
-    if parsed_ast:
-      visitor = _ASTCallVisitor(in_filename, lines, self._api_change_spec)
-      visitor.visit(parsed_ast)
-      out_text, new_text, process_errors = visitor.process(lines)
-      text += new_text
-      if out_file:
-        out_file.write(out_text)
-    text += "\n"
-    return 1, text, process_errors
-
-  # pylint: enable=broad-except
-
-  def process_tree(self, root_directory, output_root_directory,
-                   copy_other_files):
-    """Processes upgrades on an entire tree of python files in place.
-
-    Note that only Python files. If you have custom code in other languages,
-    you will need to manually upgrade those.
-
-    Args:
-      root_directory: Directory to walk and process.
-      output_root_directory: Directory to use as base.
-      copy_other_files: Copy files that are not touched by this converter.
-
-    Returns:
-      A tuple of files processed, the report string ofr all files, and errors
-    """
-
-    # make sure output directory doesn't exist
-    if output_root_directory and os.path.exists(output_root_directory):
-      print("Output directory %r must not already exist." %
-            (output_root_directory))
-      sys.exit(1)
-
-    # make sure output directory does not overlap with root_directory
-    norm_root = os.path.split(os.path.normpath(root_directory))
-    norm_output = os.path.split(os.path.normpath(output_root_directory))
-    if norm_root == norm_output:
-      print("Output directory %r same as input directory %r" %
-            (root_directory, output_root_directory))
-      sys.exit(1)
-
-    # Collect list of files to process (we do this to correctly handle if the
-    # user puts the output directory in some sub directory of the input dir)
-    files_to_process = []
-    files_to_copy = []
-    for dir_name, _, file_list in os.walk(root_directory):
-      py_files = [f for f in file_list if f.endswith(".py")]
-      copy_files = [f for f in file_list if not f.endswith(".py")]
-      for filename in py_files:
-        fullpath = os.path.join(dir_name, filename)
-        fullpath_output = os.path.join(output_root_directory,
-                                       os.path.relpath(fullpath,
-                                                       root_directory))
-        files_to_process.append((fullpath, fullpath_output))
-      if copy_other_files:
-        for filename in copy_files:
-          fullpath = os.path.join(dir_name, filename)
-          fullpath_output = os.path.join(output_root_directory,
-                                         os.path.relpath(
-                                             fullpath, root_directory))
-          files_to_copy.append((fullpath, fullpath_output))
-
-    file_count = 0
-    tree_errors = []
-    report = ""
-    report += ("=" * 80) + "\n"
-    report += "Input tree: %r\n" % root_directory
-    report += ("=" * 80) + "\n"
-
-    for input_path, output_path in files_to_process:
-      output_directory = os.path.dirname(output_path)
-      if not os.path.isdir(output_directory):
-        os.makedirs(output_directory)
-      file_count += 1
-      _, l_report, l_errors = self.process_file(input_path, output_path)
-      tree_errors += l_errors
-      report += l_report
-    for input_path, output_path in files_to_copy:
-      output_directory = os.path.dirname(output_path)
-      if not os.path.isdir(output_directory):
-        os.makedirs(output_directory)
-      shutil.copy(input_path, output_path)
-    return file_count, report, tree_errors
-
-
-class TFAPIChangeSpec(APIChangeSpec):
+class TFAPIChangeSpec(ast_edits.APIChangeSpec):
   """List of maps that describe what changed in the API."""
 
   def __init__(self):
@@ -718,7 +238,7 @@ Simple usage:
       default="report.txt")
   args = parser.parse_args()
 
-  upgrade = ASTCodeUpgrader(TFAPIChangeSpec())
+  upgrade = ast_edits.ASTCodeUpgrader(TFAPIChangeSpec())
   report_text = None
   report_filename = args.report_filename
   files_processed = 0
diff --git a/tensorflow/tools/compatibility/tf_upgrade_test.py b/tensorflow/tools/compatibility/tf_upgrade_test.py
index 3d02eacba6..66325ea2ad 100644
--- a/tensorflow/tools/compatibility/tf_upgrade_test.py
+++ b/tensorflow/tools/compatibility/tf_upgrade_test.py
@@ -22,6 +22,7 @@ import tempfile
 import six
 from tensorflow.python.framework import test_util
 from tensorflow.python.platform import test as test_lib
+from tensorflow.tools.compatibility import ast_edits
 from tensorflow.tools.compatibility import tf_upgrade
 
 
@@ -36,7 +37,7 @@ class TestUpgrade(test_util.TensorFlowTestCase):
   def _upgrade(self, old_file_text):
     in_file = six.StringIO(old_file_text)
     out_file = six.StringIO()
-    upgrader = tf_upgrade.ASTCodeUpgrader(tf_upgrade.TFAPIChangeSpec())
+    upgrader = ast_edits.ASTCodeUpgrader(tf_upgrade.TFAPIChangeSpec())
     count, report, errors = (
         upgrader.process_opened_file("test.py", in_file,
                                      "test_out.py", out_file))
@@ -139,7 +140,7 @@ class TestUpgradeFiles(test_util.TensorFlowTestCase):
     upgraded = "tf.multiply(a, b)\n"
     temp_file.write(original)
     temp_file.close()
-    upgrader = tf_upgrade.ASTCodeUpgrader(tf_upgrade.TFAPIChangeSpec())
+    upgrader = ast_edits.ASTCodeUpgrader(tf_upgrade.TFAPIChangeSpec())
     upgrader.process_file(temp_file.name, temp_file.name)
     self.assertAllEqual(open(temp_file.name).read(), upgraded)
     os.unlink(temp_file.name)
diff --git a/tensorflow/tools/compatibility/tf_upgrade_v2.py b/tensorflow/tools/compatibility/tf_upgrade_v2.py
new file mode 100644
index 0000000000..9702430a12
--- /dev/null
+++ b/tensorflow/tools/compatibility/tf_upgrade_v2.py
@@ -0,0 +1,115 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Upgrader for Python scripts from 1.* TensorFlow to 2.0 TensorFlow."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+
+from tensorflow.tools.compatibility import ast_edits
+from tensorflow.tools.compatibility import renames_v2
+
+
+class TFAPIChangeSpec(ast_edits.APIChangeSpec):
+  """List of maps that describe what changed in the API."""
+
+  def __init__(self):
+    # Maps from a function name to a dictionary that describes how to
+    # map from an old argument keyword to the new argument keyword.
+    self.function_keyword_renames = {}
+
+    # Mapping from function to the new name of the function
+    self.function_renames = renames_v2.renames
+
+    # Variables that should be changed to functions.
+    self.change_to_function = {}
+
+    # Functions that were reordered should be changed to the new keyword args
+    # for safety, if positional arguments are used. If you have reversed the
+    # positional arguments yourself, this could do the wrong thing.
+    self.function_reorders = {}
+
+    # Specially handled functions.
+    self.function_handle = {}
+
+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser(
+      formatter_class=argparse.RawDescriptionHelpFormatter,
+      description="""Convert a TensorFlow Python file to 2.0
+
+Simple usage:
+  tf_convert_v2.py --infile foo.py --outfile bar.py
+  tf_convert_v2.py --intree ~/code/old --outtree ~/code/new
+""")
+  parser.add_argument(
+      "--infile",
+      dest="input_file",
+      help="If converting a single file, the name of the file "
+      "to convert")
+  parser.add_argument(
+      "--outfile",
+      dest="output_file",
+      help="If converting a single file, the output filename.")
+  parser.add_argument(
+      "--intree",
+      dest="input_tree",
+      help="If converting a whole tree of files, the directory "
+      "to read from (relative or absolute).")
+  parser.add_argument(
+      "--outtree",
+      dest="output_tree",
+      help="If converting a whole tree of files, the output "
+      "directory (relative or absolute).")
+  parser.add_argument(
+      "--copyotherfiles",
+      dest="copy_other_files",
+      help=("If converting a whole tree of files, whether to "
+            "copy the other files."),
+      type=bool,
+      default=False)
+  parser.add_argument(
+      "--reportfile",
+      dest="report_filename",
+      help=("The name of the file where the report log is "
+            "stored."
+            "(default: %(default)s)"),
+      default="report.txt")
+  args = parser.parse_args()
+
+  upgrade = ast_edits.ASTCodeUpgrader(TFAPIChangeSpec())
+  report_text = None
+  report_filename = args.report_filename
+  files_processed = 0
+  if args.input_file:
+    files_processed, report_text, errors = upgrade.process_file(
+        args.input_file, args.output_file)
+    files_processed = 1
+  elif args.input_tree:
+    files_processed, report_text, errors = upgrade.process_tree(
+        args.input_tree, args.output_tree, args.copy_other_files)
+  else:
+    parser.print_help()
+  if report_text:
+    open(report_filename, "w").write(report_text)
+    print("TensorFlow 2.0 Upgrade Script")
+    print("-----------------------------")
+    print("Converted %d files\n" % files_processed)
+    print("Detected %d errors that require attention" % len(errors))
+    print("-" * 80)
+    print("\n".join(errors))
+    print("\nMake sure to read the detailed log %r\n" % report_filename)
diff --git a/tensorflow/tools/compatibility/tf_upgrade_v2_test.py b/tensorflow/tools/compatibility/tf_upgrade_v2_test.py
new file mode 100644
index 0000000000..57ac04de06
--- /dev/null
+++ b/tensorflow/tools/compatibility/tf_upgrade_v2_test.py
@@ -0,0 +1,83 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for tf 2.0 upgrader."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import tempfile
+import six
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test as test_lib
+from tensorflow.tools.compatibility import ast_edits
+from tensorflow.tools.compatibility import tf_upgrade_v2
+
+
+class TestUpgrade(test_util.TensorFlowTestCase):
+  """Test various APIs that have been changed in 2.0.
+
+  We also test whether a converted file is executable. test_file_v1_10.py
+  aims to exhaustively test that API changes are convertible and actually
+  work when run with current TensorFlow.
+  """
+
+  def _upgrade(self, old_file_text):
+    in_file = six.StringIO(old_file_text)
+    out_file = six.StringIO()
+    upgrader = ast_edits.ASTCodeUpgrader(tf_upgrade_v2.TFAPIChangeSpec())
+    count, report, errors = (
+        upgrader.process_opened_file("test.py", in_file,
+                                     "test_out.py", out_file))
+    return count, report, errors, out_file.getvalue()
+
+  def testParseError(self):
+    _, report, unused_errors, unused_new_text = self._upgrade(
+        "import tensorflow as tf\na + \n")
+    self.assertTrue(report.find("Failed to parse") != -1)
+
+  def testReport(self):
+    text = "tf.acos(a)\n"
+    _, report, unused_errors, unused_new_text = self._upgrade(text)
+    # This is not a complete test, but it is a sanity test that a report
+    # is generating information.
+    self.assertTrue(report.find("Renamed function `tf.acos` to `tf.math.acos`"))
+
+  def testRename(self):
+    text = "tf.acos(a)\n"
+    _, unused_report, unused_errors, new_text = self._upgrade(text)
+    self.assertEqual(new_text, "tf.math.acos(a)\n")
+    text = "tf.rsqrt(tf.log(3.8))\n"
+    _, unused_report, unused_errors, new_text = self._upgrade(text)
+    self.assertEqual(new_text, "tf.math.rsqrt(tf.math.log(3.8))\n")
+
+
+class TestUpgradeFiles(test_util.TensorFlowTestCase):
+
+  def testInplace(self):
+    """Check to make sure we don't have a file system race."""
+    temp_file = tempfile.NamedTemporaryFile("w", delete=False)
+    original = "tf.acos(a, b)\n"
+    upgraded = "tf.math.acos(a, b)\n"
+    temp_file.write(original)
+    temp_file.close()
+    upgrader = ast_edits.ASTCodeUpgrader(tf_upgrade_v2.TFAPIChangeSpec())
+    upgrader.process_file(temp_file.name, temp_file.name)
+    self.assertAllEqual(open(temp_file.name).read(), upgraded)
+    os.unlink(temp_file.name)
+
+
+if __name__ == "__main__":
+  test_lib.main()
diff --git a/tensorflow/tools/compatibility/update/BUILD b/tensorflow/tools/compatibility/update/BUILD
new file mode 100644
index 0000000000..feb37c902e
--- /dev/null
+++ b/tensorflow/tools/compatibility/update/BUILD
@@ -0,0 +1,15 @@
+licenses(["notice"])  # Apache 2.0
+
+package(default_visibility = ["//visibility:private"])
+
+py_binary(
+    name = "generate_v2_renames_map",
+    srcs = ["generate_v2_renames_map.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow:tensorflow_py",
+        "//tensorflow/python:lib",
+        "//tensorflow/tools/common:public_api",
+        "//tensorflow/tools/common:traverse",
+    ],
+)
diff --git a/tensorflow/tools/compatibility/update/generate_v2_renames_map.py b/tensorflow/tools/compatibility/update/generate_v2_renames_map.py
new file mode 100644
index 0000000000..567eceb0b6
--- /dev/null
+++ b/tensorflow/tools/compatibility/update/generate_v2_renames_map.py
@@ -0,0 +1,103 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=line-too-long
+"""Script for updating tensorflow/tools/compatibility/renames_v2.py.
+
+To update renames_v2.py, run:
+  bazel build tensorflow/tools/compatibility/update:generate_v2_renames_map
+  bazel-bin/tensorflow/tools/compatibility/update/generate_v2_renames_map
+"""
+# pylint: enable=line-too-long
+
+import tensorflow as tf
+
+from tensorflow.python.lib.io import file_io
+from tensorflow.python.util import tf_decorator
+from tensorflow.python.util import tf_export
+from tensorflow.tools.common import public_api
+from tensorflow.tools.common import traverse
+
+
+_OUTPUT_FILE_PATH = 'third_party/tensorflow/tools/compatibility/renames_v2.py'
+_FILE_HEADER = """# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=line-too-long
+\"\"\"List of renames to apply when converting from TF 1.0 to TF 2.0.
+
+THIS FILE IS AUTOGENERATED: To update, please run:
+  bazel build tensorflow/tools/compatibility/update:generate_v2_renames_map
+  bazel-bin/tensorflow/tools/compatibility/update/generate_v2_renames_map
+This file should be updated whenever endpoints are deprecated.
+\"\"\"
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+"""
+
+
+def update_renames_v2(output_file_path):
+  """Writes a Python dictionary mapping deprecated to canonical API names.
+
+  Args:
+    output_file_path: File path to write output to. Any existing contents
+      would be replaced.
+  """
+  # Set of rename lines to write to output file in the form:
+  #   'tf.deprecated_name': 'tf.canonical_name'
+  rename_line_set = set()
+  # _tf_api_names attribute name
+  tensorflow_api_attr = tf_export.API_ATTRS[tf_export.TENSORFLOW_API_NAME].names
+
+  def visit(unused_path, unused_parent, children):
+    """Visitor that collects rename strings to add to rename_line_set."""
+    for child in children:
+      _, attr = tf_decorator.unwrap(child[1])
+      if not hasattr(attr, '__dict__'):
+        continue
+      api_names = attr.__dict__.get(tensorflow_api_attr, [])
+      deprecated_api_names = attr.__dict__.get('_tf_deprecated_api_names', [])
+      canonical_name = tf_export.get_canonical_name(
+          api_names, deprecated_api_names)
+      for name in deprecated_api_names:
+        rename_line_set.add('    \'tf.%s\': \'tf.%s\'' % (name, canonical_name))
+
+  visitor = public_api.PublicAPIVisitor(visit)
+  visitor.do_not_descend_map['tf'].append('contrib')
+  traverse.traverse(tf, visitor)
+
+  renames_file_text = '%srenames = {\n%s\n}\n' % (
+      _FILE_HEADER, ',\n'.join(sorted(rename_line_set)))
+  file_io.write_string_to_file(output_file_path, renames_file_text)
+
+
+def main(unused_argv):
+  update_renames_v2(_OUTPUT_FILE_PATH)
+
+
+if __name__ == '__main__':
+  tf.app.run(main=main)
diff --git a/tensorflow/tools/docker/Dockerfile.devel b/tensorflow/tools/docker/Dockerfile.devel
index 57a491255e..f7fe4119da 100644
--- a/tensorflow/tools/docker/Dockerfile.devel
+++ b/tensorflow/tools/docker/Dockerfile.devel
@@ -63,7 +63,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.11.0
+ENV BAZEL_VERSION 0.15.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index 204b5b4dba..340f96df48 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -15,6 +15,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
         git \
         libcudnn7=7.1.4.18-1+cuda9.0 \
         libcudnn7-dev=7.1.4.18-1+cuda9.0 \
+        libnccl2=2.2.13-1+cuda9.0 \
+        libnccl-dev=2.2.13-1+cuda9.0 \
         libcurl3-dev \
         libfreetype6-dev \
         libhdf5-serial-dev \
@@ -33,6 +35,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     find /usr/local/cuda-9.0/lib64/ -type f -name 'lib*_static.a' -not -name 'libcudart_static.a' -delete && \
     rm /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
 
+# Link NCCL libray and header where the build script expects them.
+RUN mkdir /usr/local/cuda-9.0/lib &&  \
+    ln -s /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda/lib/libnccl.so.2 && \
+    ln -s /usr/include/nccl.h /usr/local/cuda/include/nccl.h
+
 RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
     python get-pip.py && \
     rm get-pip.py
@@ -72,7 +79,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.11.0
+ENV BAZEL_VERSION 0.15.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
@@ -91,10 +98,13 @@ RUN git clone --branch=r1.9 --depth=1 https://github.com/tensorflow/tensorflow.g
 ENV CI_BUILD_PYTHON python
 ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
 ENV TF_NEED_CUDA 1
-ENV TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2,6.0,6.1
+ENV TF_CUDA_COMPUTE_CAPABILITIES=3.5,5.2,6.0,6.1,7.0
 ENV TF_CUDA_VERSION=9.0
 ENV TF_CUDNN_VERSION=7
 
+# NCCL 2.x
+ENV TF_NCCL_VERSION=2
+
 RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
     LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH} \
     tensorflow/tools/ci_build/builds/configured GPU \
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7 b/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
new file mode 100644
index 0000000000..30bc2d2806
--- /dev/null
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
@@ -0,0 +1,115 @@
+FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
+
+LABEL maintainer="Gunhan Gulsoy <gunan@google.com>"
+
+# It is possible to override these for releases.
+ARG TF_BRANCH=master
+ARG BAZEL_VERSION=0.15.0
+ARG TF_AVAILABLE_CPUS=32
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        build-essential \
+        curl \
+        git \
+        golang \
+        libcurl3-dev \
+        libfreetype6-dev \
+        libpng12-dev \
+        libzmq3-dev \
+        pkg-config \
+        python-dev \
+        python-pip \
+        rsync \
+        software-properties-common \
+        unzip \
+        zip \
+        zlib1g-dev \
+        openjdk-8-jdk \
+        openjdk-8-jre-headless \
+        wget \
+        && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN pip --no-cache-dir install --upgrade \
+        pip setuptools
+
+RUN pip --no-cache-dir install \
+        ipykernel \
+        jupyter \
+        matplotlib \
+        numpy \
+        scipy \
+        sklearn \
+        pandas \
+        wheel \
+        && \
+    python -m ipykernel.kernelspec
+
+# Set up our notebook config.
+COPY jupyter_notebook_config.py /root/.jupyter/
+
+# Jupyter has issues with being run directly:
+#   https://github.com/ipython/ipython/issues/7062
+# We just add a little wrapper script.
+COPY run_jupyter.sh /
+
+# Set up Bazel.
+
+# Running bazel inside a `docker build` command causes trouble, cf:
+#   https://github.com/bazelbuild/bazel/issues/134
+# The easiest solution is to set up a bazelrc file forcing --batch.
+RUN echo "startup --batch" >>/etc/bazel.bazelrc
+# Similarly, we need to workaround sandboxing issues:
+#   https://github.com/bazelbuild/bazel/issues/418
+RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
+    >>/etc/bazel.bazelrc
+WORKDIR /
+RUN mkdir /bazel && \
+    cd /bazel && \
+    wget --quiet https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
+    wget --quiet https://raw.githubusercontent.com/bazelbuild/bazel/master/LICENSE && \
+    chmod +x bazel-*.sh && \
+    ./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
+    rm -f /bazel/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
+
+# Download and build TensorFlow.
+WORKDIR /
+RUN git clone https://github.com/tensorflow/tensorflow.git && \
+    cd tensorflow && \
+    git checkout ${TF_BRANCH}
+WORKDIR /tensorflow
+
+# Configure the build for our CUDA configuration.
+ENV CI_BUILD_PYTHON=python \
+    LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:${LD_LIBRARY_PATH} \
+    CUDNN_INSTALL_PATH=/usr/lib/x86_64-linux-gnu \
+    PYTHON_BIN_PATH=/usr/bin/python \
+    PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages \
+    TF_NEED_CUDA=1 \
+    TF_CUDA_VERSION=9.0 \
+    TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2,6.0,6.1,7.0 \
+    TF_CUDNN_VERSION=7
+RUN ./configure
+
+# Build and Install TensorFlow.
+RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
+    LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH} \
+    bazel build -c opt \
+                --config=cuda \
+                --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \
+                --jobs=${TF_AVAILABLE_CPUS} \
+                tensorflow/tools/pip_package:build_pip_package && \
+    mkdir /pip_pkg && \
+    bazel-bin/tensorflow/tools/pip_package/build_pip_package /pip_pkg && \
+    pip --no-cache-dir install --upgrade /pip_pkg/tensorflow-*.whl && \
+    rm -rf /pip_pkg && \
+    rm -rf /root/.cache
+# Clean up pip wheel and Bazel cache when done.
+
+WORKDIR /root
+
+# TensorBoard
+EXPOSE 6006
+# IPython
+EXPOSE 8888
diff --git a/tensorflow/tools/docker/Dockerfile.devel-mkl b/tensorflow/tools/docker/Dockerfile.devel-mkl
index 6dca0e393f..5b0051c1d4 100755
--- a/tensorflow/tools/docker/Dockerfile.devel-mkl
+++ b/tensorflow/tools/docker/Dockerfile.devel-mkl
@@ -73,7 +73,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.11.0
+ENV BAZEL_VERSION 0.15.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index 9197651ff4..28d4371da3 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -13,6 +13,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
         cuda-cusparse-9-0 \
         curl \
         libcudnn7=7.1.4.18-1+cuda9.0 \
+        libnccl2=2.2.13-1+cuda9.0 \
         libfreetype6-dev \
         libhdf5-serial-dev \
         libpng12-dev \
diff --git a/tensorflow/tools/docker/README.md b/tensorflow/tools/docker/README.md
index 525f2995ce..a286e8a212 100644
--- a/tensorflow/tools/docker/README.md
+++ b/tensorflow/tools/docker/README.md
@@ -87,8 +87,10 @@ export TF_DOCKER_BUILD_IS_DEVEL=NO
 export TF_DOCKER_BUILD_TYPE=CPU
 export TF_DOCKER_BUILD_PYTHON_VERSION=PYTHON2
 
-export NIGHTLY_VERSION="1.head"
-export TF_DOCKER_BUILD_CENTRAL_PIP=$(echo ${TF_DOCKER_BUILD_PYTHON_VERSION} | sed s^PYTHON2^http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=${TF_DOCKER_BUILD_PYTHON_VERSION},label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-${NIGHTLY_VERSION}-cp27-cp27mu-manylinux1_x86_64.whl^ | sed s^PYTHON3^http://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-${NIGHTLY_VERSION}-cp35-cp35m-manylinux1_x86_64.whl^)
+pip download --no-deps tf-nightly
+
+export TF_DOCKER_BUILD_CENTRAL_PIP=$(ls tf_nightly*.whl)
+export TF_DOCKER_BUILD_CENTRAL_PIP_IS_LOCAL=1
 
 tensorflow/tools/docker/parameterized_docker_build.sh
 ```
diff --git a/tensorflow/tools/docker/notebooks/1_hello_tensorflow.ipynb b/tensorflow/tools/docker/notebooks/1_hello_tensorflow.ipynb
index 0633b03259..8fa871ef77 100644
--- a/tensorflow/tools/docker/notebooks/1_hello_tensorflow.ipynb
+++ b/tensorflow/tools/docker/notebooks/1_hello_tensorflow.ipynb
@@ -665,7 +665,7 @@
       "source": [
         "## What's next?\n",
         "\n",
-        "This has been a gentle introduction to TensorFlow, focused on what TensorFlow is and the very basics of doing anything in TensorFlow. If you'd like more, the next tutorial in the series is Getting Started with TensorFlow, also available in the [notebooks directory](..)."
+        "This has been a gentle introduction to TensorFlow, focused on what TensorFlow is and the very basics of doing anything in TensorFlow. If you'd like more, the next tutorial in the series is Getting Started with TensorFlow, also available in the [notebooks directory](../notebooks)."
       ]
     }
   ],
diff --git a/tensorflow/tools/docs/doc_generator_visitor.py b/tensorflow/tools/docs/doc_generator_visitor.py
index 259a4694fd..c090dbd8da 100644
--- a/tensorflow/tools/docs/doc_generator_visitor.py
+++ b/tensorflow/tools/docs/doc_generator_visitor.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import six
 
+from tensorflow.python.util import tf_export
 from tensorflow.python.util import tf_inspect
 
 
@@ -201,7 +202,6 @@ class DocGeneratorVisitor(object):
             raw_duplicates[master_name] = [master_name, full_name]
         else:
           reverse_index[object_id] = full_name
-
     # Decide on master names, rewire duplicates and make a duplicate_of map
     # mapping all non-master duplicates to the master name. The master symbol
     # does not have an entry in this map.
@@ -211,10 +211,15 @@ class DocGeneratorVisitor(object):
     duplicates = {}
     for names in raw_duplicates.values():
       names = sorted(names)
-
-      # Choose the lexicographically first name with the minimum number of
-      # submodules. This will prefer highest level namespace for any symbol.
-      master_name = min(names, key=lambda name: name.count('.'))
+      master_name = (
+          tf_export.get_canonical_name_for_symbol(self._index[names[0]])
+          if names else None)
+      if master_name:
+        master_name = 'tf.%s' % master_name
+      else:
+        # Choose the lexicographically first name with the minimum number of
+        # submodules. This will prefer highest level namespace for any symbol.
+        master_name = min(names, key=lambda name: name.count('.'))
 
       duplicates[master_name] = names
       for name in names:
diff --git a/tensorflow/tools/docs/generate.py b/tensorflow/tools/docs/generate.py
index fc93085e3e..f96887e4c7 100644
--- a/tensorflow/tools/docs/generate.py
+++ b/tensorflow/tools/docs/generate.py
@@ -31,6 +31,11 @@ if __name__ == '__main__':
   doc_generator = generate_lib.DocGenerator()
   doc_generator.add_output_dir_argument()
   doc_generator.add_src_dir_argument()
+  doc_generator.argument_parser.add_argument(
+      '--site_api_path',
+      type=str, default='api_docs/python',
+      help='The path from the site-root to api_docs'
+           'directory for this project')
 
   # This doc generator works on the TensorFlow codebase. Since this script lives
   # at tensorflow/tools/docs, and all code is defined somewhere inside
diff --git a/tensorflow/tools/docs/generate_lib.py b/tensorflow/tools/docs/generate_lib.py
index e7634cd5dc..4f70a69364 100644
--- a/tensorflow/tools/docs/generate_lib.py
+++ b/tensorflow/tools/docs/generate_lib.py
@@ -55,7 +55,8 @@ def write_docs(output_dir,
                parser_config,
                yaml_toc,
                root_title='TensorFlow',
-               search_hints=True):
+               search_hints=True,
+               site_api_path=None):
   """Write previously extracted docs to disk.
 
   Write a docs page for each symbol included in the indices of parser_config to
@@ -73,6 +74,8 @@ def write_docs(output_dir,
     root_title: The title name for the root level index.md.
     search_hints: (bool) include meta-data search hints at the top of each
       output file.
+    site_api_path: Used to write the api-duplicates _redirects.yaml file. if
+      None (the default) the file is not generated.
 
   Raises:
     ValueError: if `output_dir` is not an absolute path
@@ -92,6 +95,9 @@ def write_docs(output_dir,
   #  - symbol name(string):pathname (string)
   symbol_to_file = {}
 
+  # Collect redirects for an api _redirects.yaml file.
+  redirects = ['redirects:\n']
+
   # Parse and write Markdown pages, resolving cross-links (@{symbol}).
   for full_name, py_object in six.iteritems(parser_config.index):
     parser_config.reference_resolver.current_doc_full_name = full_name
@@ -150,6 +156,25 @@ def write_docs(output_dir,
       raise OSError(
           'Cannot write documentation for %s to %s' % (full_name, directory))
 
+    if site_api_path:
+      duplicates = parser_config.duplicates.get(full_name, [])
+      if not duplicates:
+        continue
+
+      duplicates = [item for item in duplicates if item != full_name]
+      template = ('- from: /{}\n'
+                  '  to: /{}\n')
+      for dup in duplicates:
+        from_path = os.path.join(site_api_path, dup.replace('.', '/'))
+        to_path = os.path.join(site_api_path, full_name.replace('.', '/'))
+        redirects.append(
+            template.format(from_path, to_path))
+
+  if site_api_path:
+    api_redirects_path = os.path.join(output_dir, '_redirects.yaml')
+    with open(api_redirects_path, 'w') as redirect_file:
+      redirect_file.write(''.join(redirects))
+
   if yaml_toc:
     # Generate table of contents
 
@@ -608,7 +633,8 @@ class DocGenerator(object):
         parser_config,
         yaml_toc=self.yaml_toc,
         root_title=root_title,
-        search_hints=getattr(flags, 'search_hints', True))
+        search_hints=getattr(flags, 'search_hints', True),
+        site_api_path=getattr(flags, 'site_api_path', None))
 
     # Replace all the @{} references in files under `FLAGS.src_dir`
     replace_refs(flags.src_dir, flags.output_dir, reference_resolver, '*.md')
diff --git a/tensorflow/tools/docs/generate_lib_test.py b/tensorflow/tools/docs/generate_lib_test.py
index 7a6f9fd9f7..de18b13254 100644
--- a/tensorflow/tools/docs/generate_lib_test.py
+++ b/tensorflow/tools/docs/generate_lib_test.py
@@ -107,7 +107,18 @@ class GenerateTest(googletest.TestCase):
 
     output_dir = googletest.GetTempDir()
 
-    generate_lib.write_docs(output_dir, parser_config, yaml_toc=True)
+    generate_lib.write_docs(output_dir, parser_config, yaml_toc=True,
+                            site_api_path='api_docs/python')
+
+    # Check redirects
+    redirects_file = os.path.join(output_dir, '_redirects.yaml')
+    self.assertTrue(os.path.exists(redirects_file))
+    with open(redirects_file) as f:
+      redirects = f.read()
+    self.assertEqual(redirects.split(), [
+        'redirects:', '-', 'from:', '/api_docs/python/tf/test_function', 'to:',
+        '/api_docs/python/tf/TestModule/test_function'
+    ])
 
     # Make sure that the right files are written to disk.
     self.assertTrue(os.path.exists(os.path.join(output_dir, 'index.md')))
diff --git a/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc b/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
index f1d361e07d..156636ab82 100644
--- a/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
+++ b/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
@@ -159,7 +159,7 @@ Status FuseScaleOffsetToConvWeights(const std::vector<float>& scale_values,
   NodeDef bias_add_node;
   bias_add_node.set_op("BiasAdd");
   bias_add_node.set_name(conv_output_name);
-  if (!conv_node.attr().count("data_format")) {
+  if (conv_node.attr().count("data_format") > 0) {
     CopyNodeAttr(conv_node, "data_format", "data_format", &bias_add_node);
   }
   CopyNodeAttr(conv_node, "T", "T", &bias_add_node);
diff --git a/tensorflow/tools/graph_transforms/transform_utils.cc b/tensorflow/tools/graph_transforms/transform_utils.cc
index af17fd75bc..cb084e49b7 100644
--- a/tensorflow/tools/graph_transforms/transform_utils.cc
+++ b/tensorflow/tools/graph_transforms/transform_utils.cc
@@ -247,9 +247,16 @@ Status SortByExecutionOrder(const GraphDef& input_graph_def,
     }
   }
 
-  if (processed < input_graph_def.node_size()) {
-    return errors::InvalidArgument(input_graph_def.node_size() - processed,
-                                   " nodes in a cycle");
+  if (processed < num_nodes) {
+    LOG(WARNING) << "IN " << __func__ << (num_nodes - processed)
+                 << " NODES IN A CYCLE";
+    for (int64 i = 0; i < num_nodes; i++) {
+      if (pending_count[i] != 0) {
+        LOG(WARNING) << "PENDING: " << SummarizeNodeDef(input_graph_def.node(i))
+                     << "WITH PENDING COUNT = " << pending_count[i];
+      }
+    }
+    return errors::InvalidArgument(num_nodes - processed, " nodes in a cycle");
   }
   return Status::OK();
 }
diff --git a/tensorflow/tools/lib_package/BUILD b/tensorflow/tools/lib_package/BUILD
index 173f418dc8..44d8a37a8f 100644
--- a/tensorflow/tools/lib_package/BUILD
+++ b/tensorflow/tools/lib_package/BUILD
@@ -143,6 +143,7 @@ genrule(
         "@zlib_archive//:zlib.h",
     ] + if_mkl([
         "//third_party/mkl:LICENSE",
+        "//third_party/mkl_dnn:LICENSE",
     ]),
     outs = ["include/tensorflow/c/LICENSE"],
     cmd = "$(location :concat_licenses.sh) $(SRCS) >$@",
@@ -182,6 +183,7 @@ genrule(
         "@zlib_archive//:zlib.h",
     ] + if_mkl([
         "//third_party/mkl:LICENSE",
+        "//third_party/mkl_dnn:LICENSE",
     ]),
     outs = ["include/tensorflow/jni/LICENSE"],
     cmd = "$(location :concat_licenses.sh) $(SRCS) >$@",
diff --git a/tensorflow/tools/pip_package/BUILD b/tensorflow/tools/pip_package/BUILD
index c9d53f46c3..06ee2307e5 100644
--- a/tensorflow/tools/pip_package/BUILD
+++ b/tensorflow/tools/pip_package/BUILD
@@ -11,7 +11,7 @@ load(
 )
 load("//third_party/mkl:build_defs.bzl", "if_mkl")
 load("//tensorflow:tensorflow.bzl", "if_cuda")
-load("@local_config_tensorrt//:build_defs.bzl", "if_tensorrt")
+load("@local_config_syslibs//:build_defs.bzl", "if_not_system_lib")
 load("//tensorflow/core:platform/default/build_config_root.bzl", "tf_additional_license_deps")
 
 # This returns a list of headers of all public header libraries (e.g.,
@@ -63,12 +63,14 @@ COMMON_PIP_DEPS = [
     "//tensorflow/contrib/autograph/lang:lang",
     "//tensorflow/contrib/autograph/operators:operators",
     "//tensorflow/contrib/autograph/pyct:pyct",
+    "//tensorflow/contrib/autograph/pyct/testing:testing",
     "//tensorflow/contrib/autograph/pyct/static_analysis:static_analysis",
     "//tensorflow/contrib/autograph/pyct/common_transformers:common_transformers",
     "//tensorflow/contrib/boosted_trees:boosted_trees_pip",
     "//tensorflow/contrib/cluster_resolver:cluster_resolver_pip",
     "//tensorflow/contrib/constrained_optimization:constrained_optimization_pip",
     "//tensorflow/contrib/data/python/kernel_tests/serialization:dataset_serialization_test_base",
+    "//tensorflow/contrib/data/python/kernel_tests:stats_dataset_test_base",
     "//tensorflow/contrib/data/python/ops:contrib_op_loader",
     "//tensorflow/contrib/eager/python/examples:examples_pip",
     "//tensorflow/contrib/eager/python:evaluator",
@@ -78,7 +80,7 @@ COMMON_PIP_DEPS = [
     "//tensorflow/contrib/labeled_tensor:labeled_tensor_pip",
     "//tensorflow/contrib/nn:nn_py",
     "//tensorflow/contrib/predictor:predictor_pip",
-    "//tensorflow/contrib/proto:proto_pip",
+    "//tensorflow/contrib/proto:proto",
     "//tensorflow/contrib/receptive_field:receptive_field_pip",
     "//tensorflow/contrib/rpc:rpc_pip",
     "//tensorflow/contrib/session_bundle:session_bundle_pip",
@@ -104,6 +106,7 @@ COMMON_PIP_DEPS = [
     "//tensorflow/python/kernel_tests/testdata:self_adjoint_eig_op_test_files",
     "//tensorflow/python/saved_model:saved_model",
     "//tensorflow/python/tools:tools_pip",
+    "//tensorflow/python/tools/api/generator:create_python_api",
     "//tensorflow/python:test_ops",
     "//tensorflow/tools/dist_test/server:grpc_tensorflow_server",
 ]
@@ -144,7 +147,6 @@ filegroup(
         "@gast_archive//:PKG-INFO",
         "@gemmlowp//:LICENSE",
         "@gif_archive//:COPYING",
-        "@grpc//:LICENSE",
         "@highwayhash//:LICENSE",
         "@jemalloc//:COPYING",
         "@jpeg//:LICENSE.md",
@@ -153,8 +155,6 @@ filegroup(
         "@lmdb//:LICENSE",
         "@local_config_nccl//:LICENSE",
         "@local_config_sycl//sycl:LICENSE.text",
-        "@grpc//third_party/nanopb:LICENSE.txt",
-        "@grpc//third_party/address_sorting:LICENSE",
         "@nasm//:LICENSE",
         "@nsync//:LICENSE",
         "@pcre//:LICENCE",
@@ -168,7 +168,15 @@ filegroup(
         "@org_python_pypi_backports_weakref//:LICENSE",
     ] + if_mkl([
         "//third_party/mkl:LICENSE",
-    ]) + tf_additional_license_deps(),
+        "//third_party/mkl_dnn:LICENSE",
+    ]) + if_not_system_lib(
+        "grpc",
+        [
+            "@grpc//:LICENSE",
+            "@grpc//third_party/nanopb:LICENSE.txt",
+            "@grpc//third_party/address_sorting:LICENSE",
+        ],
+    ) + tf_additional_license_deps(),
 )
 
 sh_binary(
@@ -183,9 +191,7 @@ sh_binary(
             "//tensorflow/contrib/lite/python:tflite_convert",
             "//tensorflow/contrib/lite/toco/python:toco_from_protos",
         ],
-    }) + if_mkl(["//third_party/mkl:intel_binary_blob"]) + if_tensorrt([
-        "//tensorflow/contrib/tensorrt:init_py",
-    ]),
+    }) + if_mkl(["//third_party/mkl:intel_binary_blob"]),
 )
 
 # A genrule for generating a marker file for the pip package on Windows
diff --git a/tensorflow/tools/pip_package/build_pip_package.sh b/tensorflow/tools/pip_package/build_pip_package.sh
index 9e41514cfa..ca40f2eaa8 100755
--- a/tensorflow/tools/pip_package/build_pip_package.sh
+++ b/tensorflow/tools/pip_package/build_pip_package.sh
@@ -17,8 +17,12 @@
 
 set -e
 
+function is_absolute {
+  [[ "$1" = /* ]] || [[ "$1" =~ ^[a-zA-Z]:[/\\].* ]]
+}
+
 function real_path() {
-  [[ $1 = /* ]] && echo "$1" || echo "$PWD/${1#./}"
+  is_absolute "$1" && echo "$1" || echo "$PWD/${1#./}"
 }
 
 function cp_external() {
@@ -27,7 +31,7 @@ function cp_external() {
 
   pushd .
   cd "$src_dir"
-  for f in `find . ! -type d ! -name '*.py' ! -name '*local_config_cuda*' ! -name '*local_config_tensorrt*' ! -name '*org_tensorflow*'`; do
+  for f in `find . ! -type d ! -name '*.py' ! -path '*local_config_cuda*' ! -path '*local_config_tensorrt*' ! -path '*local_config_syslibs*' ! -path '*org_tensorflow*'`; do
     mkdir -p "${dest_dir}/$(dirname ${f})"
     cp "${f}" "${dest_dir}/$(dirname ${f})/"
   done
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index c630ca04b8..1f4c3d47bf 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -45,7 +45,7 @@ DOCLINES = __doc__.split('\n')
 # This version string is semver compatible, but incompatible with pip.
 # For pip, we will remove all '-' characters from this string, and use the
 # result for pip.
-_VERSION = '1.9.0-rc0'
+_VERSION = '1.9.0'
 
 REQUIRED_PACKAGES = [
     'absl-py >= 0.1.6',
@@ -55,7 +55,7 @@ REQUIRED_PACKAGES = [
     'six >= 1.10.0',
     'protobuf >= 3.6.0',
     'setuptools <= 39.1.0',
-    'tensorboard >= 1.8.0, < 1.9.0',
+    'tensorboard >= 1.10.0, < 1.11.0',
     'termcolor >= 1.1.0',
 ]
 
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index c4ae21b645..2932195b45 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -8,881 +8,902 @@ load("//third_party/git:git_configure.bzl", "git_configure")
 load("//third_party/py:python_configure.bzl", "python_configure")
 
 load("//third_party/sycl:sycl_configure.bzl", "sycl_configure")
+load("//third_party/systemlibs:syslibs_configure.bzl", "syslibs_configure")
 load("//third_party/toolchains/clang6:repo.bzl", "clang6_configure")
 load("//third_party/toolchains/cpus/arm:arm_compiler_configure.bzl", "arm_compiler_configure")
 load("//third_party:repo.bzl", "tf_http_archive")
 load("//third_party/clang_toolchain:cc_configure_clang.bzl", "cc_download_clang_toolchain")
 load("@io_bazel_rules_closure//closure/private:java_import_external.bzl", "java_import_external")
 load("@io_bazel_rules_closure//closure:defs.bzl", "filegroup_external")
-load("//tensorflow/tools/def_file_filter:def_file_filter_configure.bzl",
-     "def_file_filter_configure")
-
+load(
+    "//tensorflow/tools/def_file_filter:def_file_filter_configure.bzl",
+    "def_file_filter_configure",
+)
 
 # Sanitize a dependency so that it works correctly from code that includes
 # TensorFlow as a submodule.
 def clean_dep(dep):
-  return str(Label(dep))
+    return str(Label(dep))
 
 # If TensorFlow is linked as a submodule.
 # path_prefix is no longer used.
 # tf_repo_name is thought to be under consideration.
-def tf_workspace(path_prefix="", tf_repo_name=""):
-  # Note that we check the minimum bazel version in WORKSPACE.
-  clang6_configure(name="local_config_clang6")
-  cc_download_clang_toolchain(name="local_config_download_clang")
-  cuda_configure(name="local_config_cuda")
-  tensorrt_configure(name="local_config_tensorrt")
-  nccl_configure(name="local_config_nccl")
-  git_configure(name="local_config_git")
-  sycl_configure(name="local_config_sycl")
-  python_configure(name="local_config_python")
-
-  # For windows bazel build
-  # TODO: Remove def file filter when TensorFlow can export symbols properly on Windows.
-  def_file_filter_configure(name = "local_config_def_file_filter")
-
-  # Point //external/local_config_arm_compiler to //external/arm_compiler
-  arm_compiler_configure(
-      name="local_config_arm_compiler",
-      remote_config_repo="../arm_compiler",
-      build_file = clean_dep("//third_party/toolchains/cpus/arm:BUILD"))
-
-  mkl_repository(
-      name = "mkl_linux",
-      urls = [
-          "https://mirror.bazel.build/github.com/intel/mkl-dnn/releases/download/v0.14/mklml_lnx_2018.0.3.20180406.tgz",
-          "https://github.com/intel/mkl-dnn/releases/download/v0.14/mklml_lnx_2018.0.3.20180406.tgz"
-      ],
-      sha256 = "d2305244fdc9b87db7426ed4496e87a4b3977ad3374d73b8000e8b7a5b7aa725",
-      strip_prefix = "mklml_lnx_2018.0.3.20180406",
-      build_file = clean_dep("//third_party/mkl:mkl.BUILD")
-  )
-  mkl_repository(
-      name = "mkl_windows",
-      urls = [
-          "https://mirror.bazel.build/github.com/intel/mkl-dnn/releases/download/v0.14/mklml_win_2018.0.3.20180406.zip",
-          "https://github.com/intel/mkl-dnn/releases/download/v0.14/mklml_win_2018.0.3.20180406.zip"
-      ],
-      sha256 = "a584a5bf1c8d2ad70b90d12b52652030e9a338217719064fdb84b7ad0d693694",
-      strip_prefix = "mklml_win_2018.0.3.20180406",
-      build_file = clean_dep("//third_party/mkl:mkl.BUILD")
-  )
-  mkl_repository(
-      name = "mkl_darwin",
-      urls = [
-          "https://mirror.bazel.build/github.com/intel/mkl-dnn/releases/download/v0.14/mklml_mac_2018.0.3.20180406.tgz",
-          "https://github.com/intel/mkl-dnn/releases/download/v0.14/mklml_mac_2018.0.3.20180406.tgz"
-      ],
-      sha256 = "094e3dfd61c816136dc8d12a45cc611ce26c5f4828176a3644cd0b0efa15a25b",
-      strip_prefix = "mklml_mac_2018.0.3.20180406",
-      build_file = clean_dep("//third_party/mkl:mkl.BUILD")
-  )
-
-  if path_prefix:
-    print("path_prefix was specified to tf_workspace but is no longer used " +
-          "and will be removed in the future.")
-
-  tf_http_archive(
-      name = "mkl_dnn",
-      urls = [
-          "https://mirror.bazel.build/github.com/intel/mkl-dnn/archive/v0.14.tar.gz",
-          "https://github.com/intel/mkl-dnn/archive/v0.14.tar.gz",
-      ],
-      sha256 = "efebc53882856afec86457a2da644693f5d59c68772d41d640d6b60a8efc4eb0",
-      strip_prefix = "mkl-dnn-0.14",
-      build_file = clean_dep("//third_party/mkl_dnn:mkldnn.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "com_google_absl",
-      urls = [
-          "https://mirror.bazel.build/github.com/abseil/abseil-cpp/archive/9613678332c976568272c8f4a78631a29159271d.tar.gz",
-          "https://github.com/abseil/abseil-cpp/archive/9613678332c976568272c8f4a78631a29159271d.tar.gz",
-      ],
-     sha256 = "1273a1434ced93bc3e703a48c5dced058c95e995c8c009e9bdcb24a69e2180e9",
-     strip_prefix = "abseil-cpp-9613678332c976568272c8f4a78631a29159271d",
-     build_file = clean_dep("//third_party:com_google_absl.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "eigen_archive",
-      urls = [
-          "https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/fd6845384b86.tar.gz",
-          "https://bitbucket.org/eigen/eigen/get/fd6845384b86.tar.gz",
-      ],
-      sha256 = "d956415d784fa4e42b6a2a45c32556d6aec9d0a3d8ef48baee2522ab762556a9",
-      strip_prefix = "eigen-eigen-fd6845384b86",
-      build_file = clean_dep("//third_party:eigen.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "arm_compiler",
-      sha256 = "970285762565c7890c6c087d262b0a18286e7d0384f13a37786d8521773bc969",
-      strip_prefix = "tools-0e906ebc527eab1cdbf7adabff5b474da9562e9f/arm-bcm2708/arm-rpi-4.9.3-linux-gnueabihf",
-      urls = [
-          "https://mirror.bazel.build/github.com/raspberrypi/tools/archive/0e906ebc527eab1cdbf7adabff5b474da9562e9f.tar.gz",
-          # Please uncomment me, when the next upgrade happens. Then
-          # remove the whitelist entry in third_party/repo.bzl.
-          # "https://github.com/raspberrypi/tools/archive/0e906ebc527eab1cdbf7adabff5b474da9562e9f.tar.gz",
-      ],
-      build_file = clean_dep("//:arm_compiler.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "libxsmm_archive",
-      urls = [
-          "https://mirror.bazel.build/github.com/hfp/libxsmm/archive/1.9.tar.gz",
-          "https://github.com/hfp/libxsmm/archive/1.9.tar.gz",
-      ],
-      sha256 = "cd8532021352b4a0290d209f7f9bfd7c2411e08286a893af3577a43457287bfa",
-      strip_prefix = "libxsmm-1.9",
-      build_file = clean_dep("//third_party:libxsmm.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "ortools_archive",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/or-tools/archive/253f7955c6a1fd805408fba2e42ac6d45b312d15.tar.gz",
-          # Please uncomment me, when the next upgrade happens. Then
-          # remove the whitelist entry in third_party/repo.bzl.
-          # "https://github.com/google/or-tools/archive/253f7955c6a1fd805408fba2e42ac6d45b312d15.tar.gz",
-      ],
-      sha256 = "932075525642b04ac6f1b50589f1df5cd72ec2f448b721fd32234cf183f0e755",
-      strip_prefix = "or-tools-253f7955c6a1fd805408fba2e42ac6d45b312d15/src",
-      build_file = clean_dep("//third_party:ortools.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "com_googlesource_code_re2",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/re2/archive/2018-04-01.tar.gz",
-          "https://github.com/google/re2/archive/2018-04-01.tar.gz",
-
-      ],
-      sha256 = "2f945446b71336e7f5a2bcace1abcf0b23fbba368266c6a1be33de3de3b3c912",
-      strip_prefix = "re2-2018-04-01",
-  )
-
-  tf_http_archive(
-      name = "com_github_googlecloudplatform_google_cloud_cpp",
-      urls = [
-          "https://mirror.bazel.build/github.com/GoogleCloudPlatform/google-cloud-cpp/archive/53f822805e77ea7715f5b52c592a162c515c7219.tar.gz",
-          "https://github.com/GoogleCloudPlatform/google-cloud-cpp/archive/53f822805e77ea7715f5b52c592a162c515c7219.tar.gz",
-      ],
-      sha256 = "06853bfca77ef4aec09db5ab48c548f68ef2e18f17404cbce61f8d9b820f951b",
-      strip_prefix = "google-cloud-cpp-53f822805e77ea7715f5b52c592a162c515c7219",
-  )
-
-  tf_http_archive(
-      name = "com_github_googleapis_googleapis",
-      urls = [
-          "https://mirror.bazel.build/github.com/googleapis/googleapis/archive/f81082ea1e2f85c43649bee26e0d9871d4b41cdb.zip",
-          "https://github.com/googleapis/googleapis/archive/f81082ea1e2f85c43649bee26e0d9871d4b41cdb.zip",
-      ],
-      sha256 = "824870d87a176f26bcef663e92051f532fac756d1a06b404055dc078425f4378",
-      strip_prefix="googleapis-f81082ea1e2f85c43649bee26e0d9871d4b41cdb",
-      build_file = clean_dep("//third_party:googleapis.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "gemmlowp",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/gemmlowp/archive/38ebac7b059e84692f53e5938f97a9943c120d98.zip",
-          "https://github.com/google/gemmlowp/archive/38ebac7b059e84692f53e5938f97a9943c120d98.zip",
-      ],
-      sha256 = "b87faa7294dfcc5d678f22a59d2c01ca94ea1e2a3b488c38a95a67889ed0a658",
-      strip_prefix = "gemmlowp-38ebac7b059e84692f53e5938f97a9943c120d98",
-  )
-
-  tf_http_archive(
-      name = "farmhash_archive",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/farmhash/archive/816a4ae622e964763ca0862d9dbd19324a1eaf45.tar.gz",
-          "https://github.com/google/farmhash/archive/816a4ae622e964763ca0862d9dbd19324a1eaf45.tar.gz",
-      ],
-      sha256 = "6560547c63e4af82b0f202cb710ceabb3f21347a4b996db565a411da5b17aba0",
-      strip_prefix = "farmhash-816a4ae622e964763ca0862d9dbd19324a1eaf45",
-      build_file = clean_dep("//third_party:farmhash.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "highwayhash",
-      urls = [
-          "http://mirror.bazel.build/github.com/google/highwayhash/archive/fd3d9af80465e4383162e4a7c5e2f406e82dd968.tar.gz",
-          "https://github.com/google/highwayhash/archive/fd3d9af80465e4383162e4a7c5e2f406e82dd968.tar.gz",
-      ],
-      sha256 = "9c3e0e87d581feeb0c18d814d98f170ff23e62967a2bd6855847f0b2fe598a37",
-      strip_prefix = "highwayhash-fd3d9af80465e4383162e4a7c5e2f406e82dd968",
-      build_file = clean_dep("//third_party:highwayhash.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "nasm",
-      urls = [
-          "https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
-          "http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.12.02.tar.bz2/d15843c3fb7db39af80571ee27ec6fad/nasm-2.12.02.tar.bz2",
-          "http://www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
-      ],
-      sha256 = "00b0891c678c065446ca59bcee64719d0096d54d6886e6e472aeee2e170ae324",
-      strip_prefix = "nasm-2.12.02",
-      build_file = clean_dep("//third_party:nasm.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "jpeg",
-      urls = [
-          "https://mirror.bazel.build/github.com/libjpeg-turbo/libjpeg-turbo/archive/1.5.3.tar.gz",
-          "https://github.com/libjpeg-turbo/libjpeg-turbo/archive/1.5.3.tar.gz",
-      ],
-      sha256 = "1a17020f859cb12711175a67eab5c71fc1904e04b587046218e36106e07eabde",
-      strip_prefix = "libjpeg-turbo-1.5.3",
-      build_file = clean_dep("//third_party/jpeg:jpeg.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "png_archive",
-      urls = [
-          "https://mirror.bazel.build/github.com/glennrp/libpng/archive/v1.6.34.tar.gz",
-          "https://github.com/glennrp/libpng/archive/v1.6.34.tar.gz",
-      ],
-      sha256 = "e45ce5f68b1d80e2cb9a2b601605b374bdf51e1798ef1c2c2bd62131dfcf9eef",
-      strip_prefix = "libpng-1.6.34",
-      build_file = clean_dep("//third_party:png.BUILD"),
-      patch_file = clean_dep("//third_party:png_fix_rpi.patch"),
-  )
-
-  tf_http_archive(
-      name = "org_sqlite",
-      urls = [
-          "https://mirror.bazel.build/www.sqlite.org/2018/sqlite-amalgamation-3230100.zip",
-          "https://www.sqlite.org/2018/sqlite-amalgamation-3230100.zip",
-      ],
-      sha256 = "4239a1f69e5721d07d9a374eb84d594225229e54be4ee628da2995f4315d8dfc",
-      strip_prefix = "sqlite-amalgamation-3230100",
-      build_file = clean_dep("//third_party:sqlite.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "gif_archive",
-      urls = [
-          "https://mirror.bazel.build/ufpr.dl.sourceforge.net/project/giflib/giflib-5.1.4.tar.gz",
-          "http://pilotfiber.dl.sourceforge.net/project/giflib/giflib-5.1.4.tar.gz",
-      ],
-      sha256 = "34a7377ba834397db019e8eb122e551a49c98f49df75ec3fcc92b9a794a4f6d1",
-      strip_prefix = "giflib-5.1.4",
-      build_file = clean_dep("//third_party:gif.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "six_archive",
-      urls = [
-          "https://mirror.bazel.build/pypi.python.org/packages/source/s/six/six-1.10.0.tar.gz",
-          "https://pypi.python.org/packages/source/s/six/six-1.10.0.tar.gz",
-      ],
-      sha256 = "105f8d68616f8248e24bf0e9372ef04d3cc10104f1980f54d57b2ce73a5ad56a",
-      strip_prefix = "six-1.10.0",
-      build_file = clean_dep("//third_party:six.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "astor_archive",
-      urls = [
-          "https://mirror.bazel.build/pypi.python.org/packages/d8/be/c4276b3199ec3feee2a88bc64810fbea8f26d961e0a4cd9c68387a9f35de/astor-0.6.2.tar.gz",
-          "https://pypi.python.org/packages/d8/be/c4276b3199ec3feee2a88bc64810fbea8f26d961e0a4cd9c68387a9f35de/astor-0.6.2.tar.gz",
-      ],
-      sha256 = "ff6d2e2962d834acb125cc4dcc80c54a8c17c253f4cc9d9c43b5102a560bb75d",
-      strip_prefix = "astor-0.6.2",
-      build_file = clean_dep("//third_party:astor.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "gast_archive",
-      urls = [
-          "https://mirror.bazel.build/pypi.python.org/packages/5c/78/ff794fcae2ce8aa6323e789d1f8b3b7765f601e7702726f430e814822b96/gast-0.2.0.tar.gz",
-          "https://pypi.python.org/packages/5c/78/ff794fcae2ce8aa6323e789d1f8b3b7765f601e7702726f430e814822b96/gast-0.2.0.tar.gz",
-      ],
-      sha256 = "7068908321ecd2774f145193c4b34a11305bd104b4551b09273dfd1d6a374930",
-      strip_prefix = "gast-0.2.0",
-      build_file = clean_dep("//third_party:gast.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "termcolor_archive",
-      urls = [
-          "https://mirror.bazel.build/pypi.python.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz",
-          "https://pypi.python.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz",
-      ],
-      sha256 = "1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b",
-      strip_prefix = "termcolor-1.1.0",
-      build_file = clean_dep("//third_party:termcolor.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "absl_py",
-      urls = [
-          "https://mirror.bazel.build/github.com/abseil/abseil-py/archive/pypi-v0.2.2.tar.gz",
-          "https://github.com/abseil/abseil-py/archive/pypi-v0.2.2.tar.gz",
-      ],
-      sha256 = "95160f778a62c7a60ddeadc7bf2d83f85a23a27359814aca12cf949e896fa82c",
-      strip_prefix = "abseil-py-pypi-v0.2.2",
-  )
-
-  tf_http_archive(
-      name = "org_python_pypi_backports_weakref",
-      urls = [
-          "https://mirror.bazel.build/pypi.python.org/packages/bc/cc/3cdb0a02e7e96f6c70bd971bc8a90b8463fda83e264fa9c5c1c98ceabd81/backports.weakref-1.0rc1.tar.gz",
-          "https://pypi.python.org/packages/bc/cc/3cdb0a02e7e96f6c70bd971bc8a90b8463fda83e264fa9c5c1c98ceabd81/backports.weakref-1.0rc1.tar.gz",
-      ],
-      sha256 = "8813bf712a66b3d8b85dc289e1104ed220f1878cf981e2fe756dfaabe9a82892",
-      strip_prefix = "backports.weakref-1.0rc1/src",
-      build_file = clean_dep("//third_party:backports_weakref.BUILD"),
-  )
-
-  filegroup_external(
-      name = "org_python_license",
-      licenses = ["notice"],  # Python 2.0
-      sha256_urls = {
-          "b5556e921715ddb9242c076cae3963f483aa47266c5e37ea4c187f77cc79501c": [
-              "https://mirror.bazel.build/docs.python.org/2.7/_sources/license.txt",
-              "https://docs.python.org/2.7/_sources/license.txt",
-          ],
-      },
-  )
-
-  tf_http_archive(
-      name = "protobuf_archive",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/protobuf/archive/v3.6.0.tar.gz",
-          "https://github.com/google/protobuf/archive/v3.6.0.tar.gz",
-      ],
-      sha256 = "50a5753995b3142627ac55cfd496cebc418a2e575ca0236e29033c67bd5665f4",
-      strip_prefix = "protobuf-3.6.0",
-  )
-
-  # We need to import the protobuf library under the names com_google_protobuf
-  # and com_google_protobuf_cc to enable proto_library support in bazel.
-  # Unfortunately there is no way to alias http_archives at the moment.
-  tf_http_archive(
-      name = "com_google_protobuf",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/protobuf/archive/v3.6.0.tar.gz",
-          "https://github.com/google/protobuf/archive/v3.6.0.tar.gz",
-      ],
-      sha256 = "50a5753995b3142627ac55cfd496cebc418a2e575ca0236e29033c67bd5665f4",
-      strip_prefix = "protobuf-3.6.0",
-  )
-
-  tf_http_archive(
-      name = "com_google_protobuf_cc",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/protobuf/archive/v3.6.0.tar.gz",
-          "https://github.com/google/protobuf/archive/v3.6.0.tar.gz",
-      ],
-      sha256 = "50a5753995b3142627ac55cfd496cebc418a2e575ca0236e29033c67bd5665f4",
-      strip_prefix = "protobuf-3.6.0",
-  )
-
-  tf_http_archive(
-      name = "nsync",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/nsync/archive/1.20.0.tar.gz",
-          "https://github.com/google/nsync/archive/1.20.0.tar.gz",
-      ],
-      sha256 = "0c1b03962b2f8450f21e74a5a46116bf2d6009a807c57eb4207e974a8c4bb7dd",
-      strip_prefix = "nsync-1.20.0",
-  )
-
-  tf_http_archive(
-      name = "com_google_googletest",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/googletest/archive/9816b96a6ddc0430671693df90192bbee57108b6.zip",
-          "https://github.com/google/googletest/archive/9816b96a6ddc0430671693df90192bbee57108b6.zip",
-      ],
-      sha256 = "9cbca84c4256bed17df2c8f4d00c912c19d247c11c9ba6647cd6dd5b5c996b8d",
-      strip_prefix = "googletest-9816b96a6ddc0430671693df90192bbee57108b6",
-  )
-
-  tf_http_archive(
-      name = "com_github_gflags_gflags",
-      urls = [
-          "https://mirror.bazel.build/github.com/gflags/gflags/archive/f8a0efe03aa69b3336d8e228b37d4ccb17324b88.tar.gz",
-          "https://github.com/gflags/gflags/archive/f8a0efe03aa69b3336d8e228b37d4ccb17324b88.tar.gz",
-      ],
-      sha256 = "4d222fab8f1ede4709cdff417d15a1336f862d7334a81abf76d09c15ecf9acd1",
-      strip_prefix = "gflags-f8a0efe03aa69b3336d8e228b37d4ccb17324b88",
-  )
-
-  tf_http_archive(
-      name = "pcre",
-      sha256 = "69acbc2fbdefb955d42a4c606dfde800c2885711d2979e356c0636efde9ec3b5",
-      urls = [
-          "https://mirror.bazel.build/ftp.exim.org/pub/pcre/pcre-8.42.tar.gz",
-          "http://ftp.exim.org/pub/pcre/pcre-8.42.tar.gz",
-      ],
-      strip_prefix = "pcre-8.42",
-      build_file = clean_dep("//third_party:pcre.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "swig",
-      sha256 = "58a475dbbd4a4d7075e5fe86d4e54c9edde39847cdb96a3053d87cb64a23a453",
-      urls = [
-          "https://mirror.bazel.build/ufpr.dl.sourceforge.net/project/swig/swig/swig-3.0.8/swig-3.0.8.tar.gz",
-          "http://ufpr.dl.sourceforge.net/project/swig/swig/swig-3.0.8/swig-3.0.8.tar.gz",
-          "http://pilotfiber.dl.sourceforge.net/project/swig/swig/swig-3.0.8/swig-3.0.8.tar.gz",
-      ],
-      strip_prefix = "swig-3.0.8",
-      build_file = clean_dep("//third_party:swig.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "curl",
-      sha256 = "e9c37986337743f37fd14fe8737f246e97aec94b39d1b71e8a5973f72a9fc4f5",
-      urls = [
-          "https://mirror.bazel.build/curl.haxx.se/download/curl-7.60.0.tar.gz",
-          "https://curl.haxx.se/download/curl-7.60.0.tar.gz",
-      ],
-      strip_prefix = "curl-7.60.0",
-      build_file = clean_dep("//third_party:curl.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "grpc",
-      urls = [
-          "https://mirror.bazel.build/github.com/grpc/grpc/archive/d184fa229d75d336aedea0041bd59cb93e7e267f.tar.gz",
-          "https://github.com/grpc/grpc/archive/d184fa229d75d336aedea0041bd59cb93e7e267f.tar.gz",
-      ],
-      sha256 = "895b31310e718a61f7335759a778c068a6edde1c089883598a0830cbb7075673",
-      strip_prefix = "grpc-d184fa229d75d336aedea0041bd59cb93e7e267f",
-  )
-
-
-  tf_http_archive(
-      name = "linenoise",
-      sha256 = "7f51f45887a3d31b4ce4fa5965210a5e64637ceac12720cfce7954d6a2e812f7",
-      urls = [
-          "https://mirror.bazel.build/github.com/antirez/linenoise/archive/c894b9e59f02203dbe4e2be657572cf88c4230c3.tar.gz",
-          "https://github.com/antirez/linenoise/archive/c894b9e59f02203dbe4e2be657572cf88c4230c3.tar.gz",
-      ],
-      strip_prefix = "linenoise-c894b9e59f02203dbe4e2be657572cf88c4230c3",
-      build_file = clean_dep("//third_party:linenoise.BUILD"),
-  )
-
-  # TODO(phawkins): currently, this rule uses an unofficial LLVM mirror.
-  # Switch to an official source of snapshots if/when possible.
-  tf_http_archive(
-      name = "llvm",
-      urls = [
-          "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/fe1e7736763a8577ac081eca525e05d3b52de414.tar.gz",
-	  "https://github.com/llvm-mirror/llvm/archive/fe1e7736763a8577ac081eca525e05d3b52de414.tar.gz",
-      ],
-      sha256 = "77b9a98d3c0be94561fed32f44a7a8c78421e01a74bad009964d8bbaf066ed6c",
-      strip_prefix = "llvm-fe1e7736763a8577ac081eca525e05d3b52de414",
-      build_file = clean_dep("//third_party/llvm:llvm.autogenerated.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "lmdb",
-      urls = [
-          "https://mirror.bazel.build/github.com/LMDB/lmdb/archive/LMDB_0.9.22.tar.gz",
-          "https://github.com/LMDB/lmdb/archive/LMDB_0.9.22.tar.gz",
-      ],
-      sha256 = "f3927859882eb608868c8c31586bb7eb84562a40a6bf5cc3e13b6b564641ea28",
-      strip_prefix = "lmdb-LMDB_0.9.22/libraries/liblmdb",
-      build_file = clean_dep("//third_party:lmdb.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "jsoncpp_git",
-      urls = [
-          "https://mirror.bazel.build/github.com/open-source-parsers/jsoncpp/archive/1.8.4.tar.gz",
-          "https://github.com/open-source-parsers/jsoncpp/archive/1.8.4.tar.gz",
-      ],
-      sha256 = "c49deac9e0933bcb7044f08516861a2d560988540b23de2ac1ad443b219afdb6",
-      strip_prefix = "jsoncpp-1.8.4",
-      build_file = clean_dep("//third_party:jsoncpp.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "boringssl",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/boringssl/archive/a0fb951d2a26a8ee746b52f3ba81ab011a0af778.tar.gz",
-          "https://github.com/google/boringssl/archive/a0fb951d2a26a8ee746b52f3ba81ab011a0af778.tar.gz",
-      ],
-      sha256 = "524ba98a56300149696481b4cb9ddebd0c7b7ac9b9f6edee81da2d2d7e5d2bb3",
-      strip_prefix = "boringssl-a0fb951d2a26a8ee746b52f3ba81ab011a0af778",
-  )
-
-  tf_http_archive(
-      name = "zlib_archive",
-      urls = [
-          "https://mirror.bazel.build/zlib.net/zlib-1.2.11.tar.gz",
-          "https://zlib.net/zlib-1.2.11.tar.gz",
-      ],
-      sha256 = "c3e5e9fdd5004dcb542feda5ee4f0ff0744628baf8ed2dd5d66f8ca1197cb1a1",
-      strip_prefix = "zlib-1.2.11",
-      build_file = clean_dep("//third_party:zlib.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "fft2d",
-      urls = [
-          "https://mirror.bazel.build/www.kurims.kyoto-u.ac.jp/~ooura/fft.tgz",
-          "http://www.kurims.kyoto-u.ac.jp/~ooura/fft.tgz",
-      ],
-      sha256 = "52bb637c70b971958ec79c9c8752b1df5ff0218a4db4510e60826e0cb79b5296",
-      build_file = clean_dep("//third_party/fft2d:fft2d.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "snappy",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/snappy/archive/1.1.7.tar.gz",
-          "https://github.com/google/snappy/archive/1.1.7.tar.gz",
-      ],
-      sha256 = "3dfa02e873ff51a11ee02b9ca391807f0c8ea0529a4924afa645fbf97163f9d4",
-      strip_prefix = "snappy-1.1.7",
-      build_file = clean_dep("//third_party:snappy.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "nccl_archive",
-      urls = [
-          "https://mirror.bazel.build/github.com/nvidia/nccl/archive/03d856977ecbaac87e598c0c4bafca96761b9ac7.tar.gz",
-          "https://github.com/nvidia/nccl/archive/03d856977ecbaac87e598c0c4bafca96761b9ac7.tar.gz",
-      ],
-      sha256 = "2ca86fb6179ecbff789cc67c836139c1bbc0324ed8c04643405a30bf26325176",
-      strip_prefix = "nccl-03d856977ecbaac87e598c0c4bafca96761b9ac7",
-      build_file = clean_dep("//third_party:nccl/nccl_archive.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "kafka",
-      urls = [
-          "https://mirror.bazel.build/github.com/edenhill/librdkafka/archive/v0.11.4.tar.gz",
-          "https://github.com/edenhill/librdkafka/archive/v0.11.4.tar.gz",
-      ],
-      sha256 = "9d8f1eb7b0e29e9ab1168347c939cb7ae5dff00a39cef99e7ef033fd8f92737c",
-      strip_prefix = "librdkafka-0.11.4",
-      build_file = clean_dep("//third_party:kafka/BUILD"),
-      patch_file = clean_dep("//third_party/kafka:config.patch"),
-  )
-
-  tf_http_archive(
-      name = "aws",
-      urls = [
-          "https://mirror.bazel.build/github.com/aws/aws-sdk-cpp/archive/1.3.15.tar.gz",
-          "https://github.com/aws/aws-sdk-cpp/archive/1.3.15.tar.gz",
-      ],
-      sha256 = "b888d8ce5fc10254c3dd6c9020c7764dd53cf39cf011249d0b4deda895de1b7c",
-      strip_prefix = "aws-sdk-cpp-1.3.15",
-      build_file = clean_dep("//third_party:aws.BUILD"),
-  )
-
-  java_import_external(
-      name = "junit",
-      jar_sha256 = "59721f0805e223d84b90677887d9ff567dc534d7c502ca903c0c2b17f05c116a",
-      jar_urls = [
-          "https://mirror.bazel.build/repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.jar",
-          "http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.jar",
-          "http://maven.ibiblio.org/maven2/junit/junit/4.12/junit-4.12.jar",
-      ],
-      licenses = ["reciprocal"],  # Common Public License Version 1.0
-      testonly_ = True,
-      deps = ["@org_hamcrest_core"],
-  )
-
-  java_import_external(
-      name = "org_hamcrest_core",
-      jar_sha256 = "66fdef91e9739348df7a096aa384a5685f4e875584cce89386a7a47251c4d8e9",
-      jar_urls = [
-          "https://mirror.bazel.build/repo1.maven.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar",
-          "http://repo1.maven.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar",
-          "http://maven.ibiblio.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar",
-      ],
-      licenses = ["notice"],  # New BSD License
-      testonly_ = True,
-  )
-
-  tf_http_archive(
-      name = "jemalloc",
-      urls = [
-          "https://mirror.bazel.build/github.com/jemalloc/jemalloc/archive/4.4.0.tar.gz",
-          "https://github.com/jemalloc/jemalloc/archive/4.4.0.tar.gz",
-      ],
-      sha256 = "3c8f25c02e806c3ce0ab5fb7da1817f89fc9732709024e2a81b6b82f7cc792a8",
-      strip_prefix = "jemalloc-4.4.0",
-      build_file = clean_dep("//third_party:jemalloc.BUILD"),
-  )
-
-  java_import_external(
-      name = "com_google_testing_compile",
-      jar_sha256 = "edc180fdcd9f740240da1a7a45673f46f59c5578d8cd3fbc912161f74b5aebb8",
-      jar_urls = [
-          "http://mirror.bazel.build/repo1.maven.org/maven2/com/google/testing/compile/compile-testing/0.11/compile-testing-0.11.jar",
-          "http://repo1.maven.org/maven2/com/google/testing/compile/compile-testing/0.11/compile-testing-0.11.jar",
-      ],
-      licenses = ["notice"],  # New BSD License
-      testonly_ = True,
-      deps = ["@com_google_guava", "@com_google_truth"],
-  )
-
-  java_import_external(
-      name = "com_google_truth",
-      jar_sha256 = "032eddc69652b0a1f8d458f999b4a9534965c646b8b5de0eba48ee69407051df",
-      jar_urls = [
-          "http://mirror.bazel.build/repo1.maven.org/maven2/com/google/truth/truth/0.32/truth-0.32.jar",
-          "http://repo1.maven.org/maven2/com/google/truth/truth/0.32/truth-0.32.jar",
-      ],
-      licenses = ["notice"],  # Apache 2.0
-      testonly_ = True,
-      deps = ["@com_google_guava"],
-  )
-
-  java_import_external(
-      name = "org_checkerframework_qual",
-      jar_sha256 = "a17501717ef7c8dda4dba73ded50c0d7cde440fd721acfeacbf19786ceac1ed6",
-      jar_urls = [
-          "http://mirror.bazel.build/repo1.maven.org/maven2/org/checkerframework/checker-qual/2.4.0/checker-qual-2.4.0.jar",
-          "http://repo1.maven.org/maven2/org/checkerframework/checker-qual/2.4.0/checker-qual-2.4.0.jar",
-      ],
-      licenses = ["notice"],  # Apache 2.0
-  )
-
-  java_import_external(
-      name = "com_squareup_javapoet",
-      jar_sha256 = "5bb5abdfe4366c15c0da3332c57d484e238bd48260d6f9d6acf2b08fdde1efea",
-      jar_urls = [
-          "http://mirror.bazel.build/repo1.maven.org/maven2/com/squareup/javapoet/1.9.0/javapoet-1.9.0.jar",
-          "http://repo1.maven.org/maven2/com/squareup/javapoet/1.9.0/javapoet-1.9.0.jar",
-      ],
-      licenses = ["notice"],  # Apache 2.0
-  )
-
-  tf_http_archive(
-      name = "com_google_pprof",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/pprof/archive/c0fb62ec88c411cc91194465e54db2632845b650.tar.gz",
-          "https://github.com/google/pprof/archive/c0fb62ec88c411cc91194465e54db2632845b650.tar.gz",
-      ],
-      sha256 = "e0928ca4aa10ea1e0551e2d7ce4d1d7ea2d84b2abbdef082b0da84268791d0c4",
-      strip_prefix = "pprof-c0fb62ec88c411cc91194465e54db2632845b650",
-      build_file = clean_dep("//third_party:pprof.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "cub_archive",
-      urls = [
-          "https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.8.0.zip",
-          "https://github.com/NVlabs/cub/archive/1.8.0.zip",
-      ],
-      sha256 = "6bfa06ab52a650ae7ee6963143a0bbc667d6504822cbd9670369b598f18c58c3",
-      strip_prefix = "cub-1.8.0",
-      build_file = clean_dep("//third_party:cub.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "cython",
-      sha256 = "6dcd30b5ceb887b2b965ee7ceb82ea3acb5f0642fe2206c7636b45acea4798e5",
-      urls = [
-          "https://mirror.bazel.build/github.com/cython/cython/archive/3732784c45cfb040a5b0936951d196f83a12ea17.tar.gz",
-          "https://github.com/cython/cython/archive/3732784c45cfb040a5b0936951d196f83a12ea17.tar.gz",
-      ],
-      strip_prefix = "cython-3732784c45cfb040a5b0936951d196f83a12ea17",
-      build_file = clean_dep("//third_party:cython.BUILD"),
-      delete = ["BUILD.bazel"],
-  )
-
-  tf_http_archive(
-      name = "bazel_toolchains",
-      urls = [
-          "https://mirror.bazel.build/github.com/bazelbuild/bazel-toolchains/archive/44200e0c026d86c53470d107b3697a3e46469c43.tar.gz",
-          "https://github.com/bazelbuild/bazel-toolchains/archive/44200e0c026d86c53470d107b3697a3e46469c43.tar.gz",
-      ],
-      strip_prefix = "bazel-toolchains-44200e0c026d86c53470d107b3697a3e46469c43",
-      sha256 = "699b55a6916c687f4b7dc092dbbf5f64672cde0dc965f79717735ec4e5416556",
-  )
-
-  tf_http_archive(
-      name = "arm_neon_2_x86_sse",
-      sha256 = "c8d90aa4357f8079d427e87a6f4c493da1fa4140aee926c05902d7ec1533d9a5",
-      strip_prefix = "ARM_NEON_2_x86_SSE-0f77d9d182265259b135dad949230ecbf1a2633d",
-      urls = [
-          "https://mirror.bazel.build/github.com/intel/ARM_NEON_2_x86_SSE/archive/0f77d9d182265259b135dad949230ecbf1a2633d.tar.gz",
-          "https://github.com/intel/ARM_NEON_2_x86_SSE/archive/0f77d9d182265259b135dad949230ecbf1a2633d.tar.gz",
-      ],
-      build_file = clean_dep("//third_party:arm_neon_2_x86_sse.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "flatbuffers",
-      strip_prefix = "flatbuffers-1.9.0",
-      sha256 = "5ca5491e4260cacae30f1a5786d109230db3f3a6e5a0eb45d0d0608293d247e3",
-      urls = [
-          "https://mirror.bazel.build/github.com/google/flatbuffers/archive/v1.9.0.tar.gz",
-          "https://github.com/google/flatbuffers/archive/v1.9.0.tar.gz",
-      ],
-      build_file = clean_dep("//third_party/flatbuffers:flatbuffers.BUILD"),
-  )
-
-  native.new_http_archive(
-      name = "double_conversion",
-      urls = [
-          "https://github.com/google/double-conversion/archive/3992066a95b823efc8ccc1baf82a1cfc73f6e9b8.zip",
-      ],
-      sha256 = "2f7fbffac0d98d201ad0586f686034371a6d152ca67508ab611adc2386ad30de",
-      strip_prefix = "double-conversion-3992066a95b823efc8ccc1baf82a1cfc73f6e9b8",
-      build_file = clean_dep("//third_party:double_conversion.BUILD")
-  )
-
-  tf_http_archive(
-      name = "tflite_mobilenet",
-      sha256 = "23f814d1c076bdf03715dfb6cab3713aa4fbdf040fd5448c43196bd2e97a4c1b",
-      urls = [
-          "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip",
-          "https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip",
-      ],
-      build_file = clean_dep("//third_party:tflite_mobilenet.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "tflite_mobilenet_ssd",
-      sha256 = "767057f2837a46d97882734b03428e8dd640b93236052b312b2f0e45613c1cf0",
-      urls = [
-          "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_ssd_tflite_v1.zip",
-          "https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_ssd_tflite_v1.zip",
-      ],
-      build_file = str(Label("//third_party:tflite_mobilenet.BUILD")),
-  )
-  tf_http_archive(
-      name = "tflite_mobilenet_ssd_quant",
-      sha256 = "a809cd290b4d6a2e8a9d5dad076e0bd695b8091974e0eed1052b480b2f21b6dc",
-      urls = ["https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/coco_ssd_mobilenet_v1_0.75_quant_2018_06_29.zip",
-          "https://storage.googleapis.com/download.tensorflow.org/models/tflite/coco_ssd_mobilenet_v1_0.75_quant_2018_06_29.zip",
-      ],
-      build_file = str(Label("//third_party:tflite_mobilenet.BUILD")),
-  )
-
-  tf_http_archive(
-      name = "tflite_conv_actions_frozen",
-      sha256 = "d947b38cba389b5e2d0bfc3ea6cc49c784e187b41a071387b3742d1acac7691e",
-      urls = [
-          "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/conv_actions_tflite.zip",
-          "https://storage.googleapis.com/download.tensorflow.org/models/tflite/conv_actions_tflite.zip",
-      ],
-      build_file = str(Label("//third_party:tflite_mobilenet.BUILD")),
-  )
-
-  tf_http_archive(
-      name = "tflite_smartreply",
-      sha256 = "8980151b85a87a9c1a3bb1ed4748119e4a85abd3cb5744d83da4d4bd0fbeef7c",
-      urls = [
-          "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/smartreply_1.0_2017_11_01.zip",
-          "https://storage.googleapis.com/download.tensorflow.org/models/tflite/smartreply_1.0_2017_11_01.zip"
-      ],
-      build_file = clean_dep("//third_party:tflite_smartreply.BUILD"),
-  )
-
-  tf_http_archive(
-      name = "tflite_ovic_testdata",
-      sha256 = "a9a705d8d519220178e2e65d383fdb21da37fdb31d1e909b0a1acdac46479e9c",
-      urls = [
-          "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/data/ovic.zip",
-          "https://storage.googleapis.com/download.tensorflow.org/data/ovic.zip",
-      ],
-      build_file = clean_dep("//third_party:tflite_ovic_testdata.BUILD"),
-      strip_prefix = "ovic",
-  )
-
-  tf_http_archive(
-      name = "build_bazel_rules_android",
-      sha256 = "cd06d15dd8bb59926e4d65f9003bfc20f9da4b2519985c27e190cddc8b7a7806",
-      urls = [
-          "https://mirror.bazel.build/github.com/bazelbuild/rules_android/archive/v0.1.1.zip",
-          "https://github.com/bazelbuild/rules_android/archive/v0.1.1.zip",
-      ],
-      strip_prefix = "rules_android-0.1.1",
-  )
-
-  ##############################################################################
-  # BIND DEFINITIONS
-  #
-  # Please do not add bind() definitions unless we have no other choice.
-  # If that ends up being the case, please leave a comment explaining
-  # why we can't depend on the canonical build target.
-
-  # gRPC wants a cares dependency but its contents is not actually
-  # important since we have set GRPC_ARES=0 in tools/bazel.rc
-  native.bind(
-      name = "cares",
-      actual = "@grpc//third_party/nanopb:nanopb",
-  )
-
-  # Needed by Protobuf
-  native.bind(
-      name = "grpc_cpp_plugin",
-      actual = "@grpc//:grpc_cpp_plugin",
-  )
-  native.bind(
-      name = "grpc_python_plugin",
-      actual = "@grpc//:grpc_python_plugin",
-  )
-
-  native.bind(
-      name = "grpc_lib",
-      actual = "@grpc//:grpc++",
-  )
-
-  native.bind(
-      name = "grpc_lib_unsecure",
-      actual = "@grpc//:grpc++_unsecure",
-  )
-
-  # Needed by gRPC
-  native.bind(
-      name = "libssl",
-      actual = "@boringssl//:ssl",
-  )
-
-  # Needed by gRPC
-  native.bind(
-      name = "nanopb",
-      actual = "@grpc//third_party/nanopb:nanopb",
-  )
-
-  # Needed by gRPC
-  native.bind(
-      name = "protobuf",
-      actual = "@protobuf_archive//:protobuf",
-  )
-
-  # gRPC expects //external:protobuf_clib and //external:protobuf_compiler
-  # to point to Protobuf's compiler library.
-  native.bind(
-      name = "protobuf_clib",
-      actual = "@protobuf_archive//:protoc_lib",
-  )
-
-  # Needed by gRPC
-  native.bind(
-      name = "protobuf_headers",
-      actual = "@protobuf_archive//:protobuf_headers",
-  )
-
-  # Needed by Protobuf
-  native.bind(
-      name = "python_headers",
-      actual = clean_dep("//third_party/python_runtime:headers"),
-  )
-
-  # Needed by Protobuf
-  native.bind(
-      name = "six",
-      actual = "@six_archive//:six",
-  )
-
-  # Needed by gRPC
-  native.bind(
-      name = "zlib",
-      actual = "@zlib_archive//:zlib",
-  )
+def tf_workspace(path_prefix = "", tf_repo_name = ""):
+    # Note that we check the minimum bazel version in WORKSPACE.
+    clang6_configure(name = "local_config_clang6")
+    cc_download_clang_toolchain(name = "local_config_download_clang")
+    cuda_configure(name = "local_config_cuda")
+    tensorrt_configure(name = "local_config_tensorrt")
+    nccl_configure(name = "local_config_nccl")
+    git_configure(name = "local_config_git")
+    sycl_configure(name = "local_config_sycl")
+    syslibs_configure(name = "local_config_syslibs")
+    python_configure(name = "local_config_python")
+
+    # For windows bazel build
+    # TODO: Remove def file filter when TensorFlow can export symbols properly on Windows.
+    def_file_filter_configure(name = "local_config_def_file_filter")
+
+    # Point //external/local_config_arm_compiler to //external/arm_compiler
+    arm_compiler_configure(
+        name = "local_config_arm_compiler",
+        remote_config_repo = "../arm_compiler",
+        build_file = clean_dep("//third_party/toolchains/cpus/arm:BUILD"),
+    )
+
+    mkl_repository(
+        name = "mkl_linux",
+        urls = [
+            "https://mirror.bazel.build/github.com/intel/mkl-dnn/releases/download/v0.15/mklml_lnx_2018.0.3.20180406.tgz",
+            "https://github.com/intel/mkl-dnn/releases/download/v0.15/mklml_lnx_2018.0.3.20180406.tgz",
+        ],
+        sha256 = "d2305244fdc9b87db7426ed4496e87a4b3977ad3374d73b8000e8b7a5b7aa725",
+        strip_prefix = "mklml_lnx_2018.0.3.20180406",
+        build_file = clean_dep("//third_party/mkl:mkl.BUILD"),
+    )
+    mkl_repository(
+        name = "mkl_windows",
+        urls = [
+            "https://mirror.bazel.build/github.com/intel/mkl-dnn/releases/download/v0.15/mklml_win_2018.0.3.20180406.zip",
+            "https://github.com/intel/mkl-dnn/releases/download/v0.15/mklml_win_2018.0.3.20180406.zip",
+        ],
+        sha256 = "a584a5bf1c8d2ad70b90d12b52652030e9a338217719064fdb84b7ad0d693694",
+        strip_prefix = "mklml_win_2018.0.3.20180406",
+        build_file = clean_dep("//third_party/mkl:mkl.BUILD"),
+    )
+    mkl_repository(
+        name = "mkl_darwin",
+        urls = [
+            "https://mirror.bazel.build/github.com/intel/mkl-dnn/releases/download/v0.15/mklml_mac_2018.0.3.20180406.tgz",
+            "https://github.com/intel/mkl-dnn/releases/download/v0.15/mklml_mac_2018.0.3.20180406.tgz",
+        ],
+        sha256 = "094e3dfd61c816136dc8d12a45cc611ce26c5f4828176a3644cd0b0efa15a25b",
+        strip_prefix = "mklml_mac_2018.0.3.20180406",
+        build_file = clean_dep("//third_party/mkl:mkl.BUILD"),
+    )
+
+    if path_prefix:
+        print("path_prefix was specified to tf_workspace but is no longer used " +
+              "and will be removed in the future.")
+
+    tf_http_archive(
+        name = "mkl_dnn",
+        urls = [
+            "https://mirror.bazel.build/github.com/intel/mkl-dnn/archive/0c1cf54b63732e5a723c5670f66f6dfb19b64d20.tar.gz",
+            "https://github.com/intel/mkl-dnn/archive/0c1cf54b63732e5a723c5670f66f6dfb19b64d20.tar.gz",
+        ],
+        sha256 = "da1f27f92453a65331197dd8e4992e810fb7b1c4e0b902a1da5611592df2b633",
+        strip_prefix = "mkl-dnn-0c1cf54b63732e5a723c5670f66f6dfb19b64d20",
+        build_file = clean_dep("//third_party/mkl_dnn:mkldnn.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "com_google_absl",
+        urls = [
+            "https://mirror.bazel.build/github.com/abseil/abseil-cpp/archive/9613678332c976568272c8f4a78631a29159271d.tar.gz",
+            "https://github.com/abseil/abseil-cpp/archive/9613678332c976568272c8f4a78631a29159271d.tar.gz",
+        ],
+        sha256 = "1273a1434ced93bc3e703a48c5dced058c95e995c8c009e9bdcb24a69e2180e9",
+        strip_prefix = "abseil-cpp-9613678332c976568272c8f4a78631a29159271d",
+        build_file = clean_dep("//third_party:com_google_absl.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "eigen_archive",
+        urls = [
+            "https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/fd6845384b86.tar.gz",
+            "https://bitbucket.org/eigen/eigen/get/fd6845384b86.tar.gz",
+        ],
+        sha256 = "d956415d784fa4e42b6a2a45c32556d6aec9d0a3d8ef48baee2522ab762556a9",
+        strip_prefix = "eigen-eigen-fd6845384b86",
+        build_file = clean_dep("//third_party:eigen.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "arm_compiler",
+        sha256 = "970285762565c7890c6c087d262b0a18286e7d0384f13a37786d8521773bc969",
+        strip_prefix = "tools-0e906ebc527eab1cdbf7adabff5b474da9562e9f/arm-bcm2708/arm-rpi-4.9.3-linux-gnueabihf",
+        urls = [
+            "https://mirror.bazel.build/github.com/raspberrypi/tools/archive/0e906ebc527eab1cdbf7adabff5b474da9562e9f.tar.gz",
+            # Please uncomment me, when the next upgrade happens. Then
+            # remove the whitelist entry in third_party/repo.bzl.
+            # "https://github.com/raspberrypi/tools/archive/0e906ebc527eab1cdbf7adabff5b474da9562e9f.tar.gz",
+        ],
+        build_file = clean_dep("//:arm_compiler.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "libxsmm_archive",
+        urls = [
+            "https://mirror.bazel.build/github.com/hfp/libxsmm/archive/1.9.tar.gz",
+            "https://github.com/hfp/libxsmm/archive/1.9.tar.gz",
+        ],
+        sha256 = "cd8532021352b4a0290d209f7f9bfd7c2411e08286a893af3577a43457287bfa",
+        strip_prefix = "libxsmm-1.9",
+        build_file = clean_dep("//third_party:libxsmm.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "ortools_archive",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/or-tools/archive/v6.7.2.tar.gz",
+            "https://github.com/google/or-tools/archive/v6.7.2.tar.gz",
+        ],
+        sha256 = "d025a95f78b5fc5eaa4da5f395f23d11c23cf7dbd5069f1f627f002de87b86b9",
+        strip_prefix = "or-tools-6.7.2/src",
+        build_file = clean_dep("//third_party:ortools.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "com_googlesource_code_re2",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/re2/archive/2018-04-01.tar.gz",
+            "https://github.com/google/re2/archive/2018-04-01.tar.gz",
+        ],
+        sha256 = "2f945446b71336e7f5a2bcace1abcf0b23fbba368266c6a1be33de3de3b3c912",
+        strip_prefix = "re2-2018-04-01",
+        system_build_file = clean_dep("//third_party/systemlibs:re2.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "com_github_googlecloudplatform_google_cloud_cpp",
+        urls = [
+            "https://mirror.bazel.build/github.com/GoogleCloudPlatform/google-cloud-cpp/archive/f875700a023bdd706333cde45aee8758b272c357.tar.gz",
+            "https://github.com/GoogleCloudPlatform/google-cloud-cpp/archive/f875700a023bdd706333cde45aee8758b272c357.tar.gz",
+        ],
+        sha256 = "a34f3c50b237686dc870b13baaa6a5836ce3473f2f2a02717299f0ff318372db",
+        strip_prefix = "google-cloud-cpp-f875700a023bdd706333cde45aee8758b272c357",
+    )
+
+    tf_http_archive(
+        name = "com_github_googleapis_googleapis",
+        urls = [
+            "https://mirror.bazel.build/github.com/googleapis/googleapis/archive/f81082ea1e2f85c43649bee26e0d9871d4b41cdb.zip",
+            "https://github.com/googleapis/googleapis/archive/f81082ea1e2f85c43649bee26e0d9871d4b41cdb.zip",
+        ],
+        sha256 = "824870d87a176f26bcef663e92051f532fac756d1a06b404055dc078425f4378",
+        strip_prefix = "googleapis-f81082ea1e2f85c43649bee26e0d9871d4b41cdb",
+        build_file = clean_dep("//third_party:googleapis.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "gemmlowp",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/gemmlowp/archive/38ebac7b059e84692f53e5938f97a9943c120d98.zip",
+            "https://github.com/google/gemmlowp/archive/38ebac7b059e84692f53e5938f97a9943c120d98.zip",
+        ],
+        sha256 = "b87faa7294dfcc5d678f22a59d2c01ca94ea1e2a3b488c38a95a67889ed0a658",
+        strip_prefix = "gemmlowp-38ebac7b059e84692f53e5938f97a9943c120d98",
+    )
+
+    tf_http_archive(
+        name = "farmhash_archive",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/farmhash/archive/816a4ae622e964763ca0862d9dbd19324a1eaf45.tar.gz",
+            "https://github.com/google/farmhash/archive/816a4ae622e964763ca0862d9dbd19324a1eaf45.tar.gz",
+        ],
+        sha256 = "6560547c63e4af82b0f202cb710ceabb3f21347a4b996db565a411da5b17aba0",
+        strip_prefix = "farmhash-816a4ae622e964763ca0862d9dbd19324a1eaf45",
+        build_file = clean_dep("//third_party:farmhash.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "highwayhash",
+        urls = [
+            "http://mirror.bazel.build/github.com/google/highwayhash/archive/fd3d9af80465e4383162e4a7c5e2f406e82dd968.tar.gz",
+            "https://github.com/google/highwayhash/archive/fd3d9af80465e4383162e4a7c5e2f406e82dd968.tar.gz",
+        ],
+        sha256 = "9c3e0e87d581feeb0c18d814d98f170ff23e62967a2bd6855847f0b2fe598a37",
+        strip_prefix = "highwayhash-fd3d9af80465e4383162e4a7c5e2f406e82dd968",
+        build_file = clean_dep("//third_party:highwayhash.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "nasm",
+        urls = [
+            "https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.13.03/nasm-2.13.03.tar.bz2",
+            "http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.13.03.tar.bz2/sha512/d7a6b4cee8dfd603d8d4c976e5287b5cc542fa0b466ff989b743276a6e28114e64289bf02a7819eca63142a5278aa6eed57773007e5f589e15768e6456a8919d/nasm-2.13.03.tar.bz2",
+            "http://www.nasm.us/pub/nasm/releasebuilds/2.13.03/nasm-2.13.03.tar.bz2",
+        ],
+        sha256 = "63ec86477ad3f0f6292325fd89e1d93aea2e2fd490070863f17d48f7cd387011",
+        strip_prefix = "nasm-2.13.03",
+        build_file = clean_dep("//third_party:nasm.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:nasm.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "jpeg",
+        urls = [
+            "https://mirror.bazel.build/github.com/libjpeg-turbo/libjpeg-turbo/archive/1.5.3.tar.gz",
+            "https://github.com/libjpeg-turbo/libjpeg-turbo/archive/1.5.3.tar.gz",
+        ],
+        sha256 = "1a17020f859cb12711175a67eab5c71fc1904e04b587046218e36106e07eabde",
+        strip_prefix = "libjpeg-turbo-1.5.3",
+        build_file = clean_dep("//third_party/jpeg:jpeg.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:jpeg.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "png_archive",
+        urls = [
+            "https://mirror.bazel.build/github.com/glennrp/libpng/archive/v1.6.34.tar.gz",
+            "https://github.com/glennrp/libpng/archive/v1.6.34.tar.gz",
+        ],
+        sha256 = "e45ce5f68b1d80e2cb9a2b601605b374bdf51e1798ef1c2c2bd62131dfcf9eef",
+        strip_prefix = "libpng-1.6.34",
+        build_file = clean_dep("//third_party:png.BUILD"),
+        patch_file = clean_dep("//third_party:png_fix_rpi.patch"),
+        system_build_file = clean_dep("//third_party/systemlibs:png.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "org_sqlite",
+        urls = [
+            "https://mirror.bazel.build/www.sqlite.org/2018/sqlite-amalgamation-3240000.zip",
+            "https://www.sqlite.org/2018/sqlite-amalgamation-3240000.zip",
+        ],
+        sha256 = "ad68c1216c3a474cf360c7581a4001e952515b3649342100f2d7ca7c8e313da6",
+        strip_prefix = "sqlite-amalgamation-3240000",
+        build_file = clean_dep("//third_party:sqlite.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:sqlite.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "gif_archive",
+        urls = [
+            "https://mirror.bazel.build/ufpr.dl.sourceforge.net/project/giflib/giflib-5.1.4.tar.gz",
+            "http://pilotfiber.dl.sourceforge.net/project/giflib/giflib-5.1.4.tar.gz",
+        ],
+        sha256 = "34a7377ba834397db019e8eb122e551a49c98f49df75ec3fcc92b9a794a4f6d1",
+        strip_prefix = "giflib-5.1.4",
+        build_file = clean_dep("//third_party:gif.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:gif.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "six_archive",
+        urls = [
+            "https://mirror.bazel.build/pypi.python.org/packages/source/s/six/six-1.10.0.tar.gz",
+            "https://pypi.python.org/packages/source/s/six/six-1.10.0.tar.gz",
+        ],
+        sha256 = "105f8d68616f8248e24bf0e9372ef04d3cc10104f1980f54d57b2ce73a5ad56a",
+        strip_prefix = "six-1.10.0",
+        build_file = clean_dep("//third_party:six.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:six.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "astor_archive",
+        urls = [
+            "https://mirror.bazel.build/pypi.python.org/packages/d8/be/c4276b3199ec3feee2a88bc64810fbea8f26d961e0a4cd9c68387a9f35de/astor-0.6.2.tar.gz",
+            "https://pypi.python.org/packages/d8/be/c4276b3199ec3feee2a88bc64810fbea8f26d961e0a4cd9c68387a9f35de/astor-0.6.2.tar.gz",
+        ],
+        sha256 = "ff6d2e2962d834acb125cc4dcc80c54a8c17c253f4cc9d9c43b5102a560bb75d",
+        strip_prefix = "astor-0.6.2",
+        build_file = clean_dep("//third_party:astor.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:astor.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "gast_archive",
+        urls = [
+            "https://mirror.bazel.build/pypi.python.org/packages/5c/78/ff794fcae2ce8aa6323e789d1f8b3b7765f601e7702726f430e814822b96/gast-0.2.0.tar.gz",
+            "https://pypi.python.org/packages/5c/78/ff794fcae2ce8aa6323e789d1f8b3b7765f601e7702726f430e814822b96/gast-0.2.0.tar.gz",
+        ],
+        sha256 = "7068908321ecd2774f145193c4b34a11305bd104b4551b09273dfd1d6a374930",
+        strip_prefix = "gast-0.2.0",
+        build_file = clean_dep("//third_party:gast.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "termcolor_archive",
+        urls = [
+            "https://mirror.bazel.build/pypi.python.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz",
+            "https://pypi.python.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz",
+        ],
+        sha256 = "1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b",
+        strip_prefix = "termcolor-1.1.0",
+        build_file = clean_dep("//third_party:termcolor.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:termcolor.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "absl_py",
+        urls = [
+            "https://mirror.bazel.build/github.com/abseil/abseil-py/archive/pypi-v0.2.2.tar.gz",
+            "https://github.com/abseil/abseil-py/archive/pypi-v0.2.2.tar.gz",
+        ],
+        sha256 = "95160f778a62c7a60ddeadc7bf2d83f85a23a27359814aca12cf949e896fa82c",
+        strip_prefix = "abseil-py-pypi-v0.2.2",
+    )
+
+    tf_http_archive(
+        name = "org_python_pypi_backports_weakref",
+        urls = [
+            "https://mirror.bazel.build/pypi.python.org/packages/bc/cc/3cdb0a02e7e96f6c70bd971bc8a90b8463fda83e264fa9c5c1c98ceabd81/backports.weakref-1.0rc1.tar.gz",
+            "https://pypi.python.org/packages/bc/cc/3cdb0a02e7e96f6c70bd971bc8a90b8463fda83e264fa9c5c1c98ceabd81/backports.weakref-1.0rc1.tar.gz",
+        ],
+        sha256 = "8813bf712a66b3d8b85dc289e1104ed220f1878cf981e2fe756dfaabe9a82892",
+        strip_prefix = "backports.weakref-1.0rc1/src",
+        build_file = clean_dep("//third_party:backports_weakref.BUILD"),
+    )
+
+    filegroup_external(
+        name = "org_python_license",
+        licenses = ["notice"],  # Python 2.0
+        sha256_urls = {
+            "b5556e921715ddb9242c076cae3963f483aa47266c5e37ea4c187f77cc79501c": [
+                "https://mirror.bazel.build/docs.python.org/2.7/_sources/license.txt",
+                "https://docs.python.org/2.7/_sources/license.txt",
+            ],
+        },
+    )
+
+    tf_http_archive(
+        name = "protobuf_archive",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/protobuf/archive/v3.6.0.tar.gz",
+            "https://github.com/google/protobuf/archive/v3.6.0.tar.gz",
+        ],
+        sha256 = "50a5753995b3142627ac55cfd496cebc418a2e575ca0236e29033c67bd5665f4",
+        strip_prefix = "protobuf-3.6.0",
+    )
+
+    # We need to import the protobuf library under the names com_google_protobuf
+    # and com_google_protobuf_cc to enable proto_library support in bazel.
+    # Unfortunately there is no way to alias http_archives at the moment.
+    tf_http_archive(
+        name = "com_google_protobuf",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/protobuf/archive/v3.6.0.tar.gz",
+            "https://github.com/google/protobuf/archive/v3.6.0.tar.gz",
+        ],
+        sha256 = "50a5753995b3142627ac55cfd496cebc418a2e575ca0236e29033c67bd5665f4",
+        strip_prefix = "protobuf-3.6.0",
+    )
+
+    tf_http_archive(
+        name = "com_google_protobuf_cc",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/protobuf/archive/v3.6.0.tar.gz",
+            "https://github.com/google/protobuf/archive/v3.6.0.tar.gz",
+        ],
+        sha256 = "50a5753995b3142627ac55cfd496cebc418a2e575ca0236e29033c67bd5665f4",
+        strip_prefix = "protobuf-3.6.0",
+    )
+
+    tf_http_archive(
+        name = "nsync",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/nsync/archive/1.20.0.tar.gz",
+            "https://github.com/google/nsync/archive/1.20.0.tar.gz",
+        ],
+        sha256 = "0c1b03962b2f8450f21e74a5a46116bf2d6009a807c57eb4207e974a8c4bb7dd",
+        strip_prefix = "nsync-1.20.0",
+    )
+
+    tf_http_archive(
+        name = "com_google_googletest",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/googletest/archive/9816b96a6ddc0430671693df90192bbee57108b6.zip",
+            "https://github.com/google/googletest/archive/9816b96a6ddc0430671693df90192bbee57108b6.zip",
+        ],
+        sha256 = "9cbca84c4256bed17df2c8f4d00c912c19d247c11c9ba6647cd6dd5b5c996b8d",
+        strip_prefix = "googletest-9816b96a6ddc0430671693df90192bbee57108b6",
+    )
+
+    tf_http_archive(
+        name = "com_github_gflags_gflags",
+        urls = [
+            "https://mirror.bazel.build/github.com/gflags/gflags/archive/v2.2.1.tar.gz",
+            "https://github.com/gflags/gflags/archive/v2.2.1.tar.gz",
+        ],
+        sha256 = "ae27cdbcd6a2f935baa78e4f21f675649271634c092b1be01469440495609d0e",
+        strip_prefix = "gflags-2.2.1",
+    )
+
+    tf_http_archive(
+        name = "pcre",
+        sha256 = "69acbc2fbdefb955d42a4c606dfde800c2885711d2979e356c0636efde9ec3b5",
+        urls = [
+            "https://mirror.bazel.build/ftp.exim.org/pub/pcre/pcre-8.42.tar.gz",
+            "http://ftp.exim.org/pub/pcre/pcre-8.42.tar.gz",
+        ],
+        strip_prefix = "pcre-8.42",
+        build_file = clean_dep("//third_party:pcre.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:pcre.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "swig",
+        sha256 = "58a475dbbd4a4d7075e5fe86d4e54c9edde39847cdb96a3053d87cb64a23a453",
+        urls = [
+            "https://mirror.bazel.build/ufpr.dl.sourceforge.net/project/swig/swig/swig-3.0.8/swig-3.0.8.tar.gz",
+            "http://ufpr.dl.sourceforge.net/project/swig/swig/swig-3.0.8/swig-3.0.8.tar.gz",
+            "http://pilotfiber.dl.sourceforge.net/project/swig/swig/swig-3.0.8/swig-3.0.8.tar.gz",
+        ],
+        strip_prefix = "swig-3.0.8",
+        build_file = clean_dep("//third_party:swig.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:swig.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "curl",
+        sha256 = "e9c37986337743f37fd14fe8737f246e97aec94b39d1b71e8a5973f72a9fc4f5",
+        urls = [
+            "https://mirror.bazel.build/curl.haxx.se/download/curl-7.60.0.tar.gz",
+            "https://curl.haxx.se/download/curl-7.60.0.tar.gz",
+        ],
+        strip_prefix = "curl-7.60.0",
+        build_file = clean_dep("//third_party:curl.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:curl.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "grpc",
+        urls = [
+            "https://mirror.bazel.build/github.com/grpc/grpc/archive/v1.13.0.tar.gz",
+            "https://github.com/grpc/grpc/archive/v1.13.0.tar.gz",
+        ],
+        sha256 = "50db9cf2221354485eb7c3bd55a4c27190caef7048a2a1a15fbe60a498f98b44",
+        strip_prefix = "grpc-1.13.0",
+        system_build_file = clean_dep("//third_party/systemlibs:grpc.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "linenoise",
+        sha256 = "7f51f45887a3d31b4ce4fa5965210a5e64637ceac12720cfce7954d6a2e812f7",
+        urls = [
+            "https://mirror.bazel.build/github.com/antirez/linenoise/archive/c894b9e59f02203dbe4e2be657572cf88c4230c3.tar.gz",
+            "https://github.com/antirez/linenoise/archive/c894b9e59f02203dbe4e2be657572cf88c4230c3.tar.gz",
+        ],
+        strip_prefix = "linenoise-c894b9e59f02203dbe4e2be657572cf88c4230c3",
+        build_file = clean_dep("//third_party:linenoise.BUILD"),
+    )
+
+    # TODO(phawkins): currently, this rule uses an unofficial LLVM mirror.
+    # Switch to an official source of snapshots if/when possible.
+    tf_http_archive(
+        name = "llvm",
+        urls = [
+            "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/7b3bfc8151f3a6bcd9642c49c1f86f66cc43a428.tar.gz",
+            "https://github.com/llvm-mirror/llvm/archive/7b3bfc8151f3a6bcd9642c49c1f86f66cc43a428.tar.gz",
+        ],
+        sha256 = "c6cbb21acd46e3e00faa8c379595ecffb99ef77622da17f29371db2bfad1d3d3",
+        strip_prefix = "llvm-7b3bfc8151f3a6bcd9642c49c1f86f66cc43a428",
+        build_file = clean_dep("//third_party/llvm:llvm.autogenerated.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "lmdb",
+        urls = [
+            "https://mirror.bazel.build/github.com/LMDB/lmdb/archive/LMDB_0.9.22.tar.gz",
+            "https://github.com/LMDB/lmdb/archive/LMDB_0.9.22.tar.gz",
+        ],
+        sha256 = "f3927859882eb608868c8c31586bb7eb84562a40a6bf5cc3e13b6b564641ea28",
+        strip_prefix = "lmdb-LMDB_0.9.22/libraries/liblmdb",
+        build_file = clean_dep("//third_party:lmdb.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:lmdb.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "jsoncpp_git",
+        urls = [
+            "https://mirror.bazel.build/github.com/open-source-parsers/jsoncpp/archive/1.8.4.tar.gz",
+            "https://github.com/open-source-parsers/jsoncpp/archive/1.8.4.tar.gz",
+        ],
+        sha256 = "c49deac9e0933bcb7044f08516861a2d560988540b23de2ac1ad443b219afdb6",
+        strip_prefix = "jsoncpp-1.8.4",
+        build_file = clean_dep("//third_party:jsoncpp.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:jsoncpp.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "boringssl",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/boringssl/archive/a0fb951d2a26a8ee746b52f3ba81ab011a0af778.tar.gz",
+            "https://github.com/google/boringssl/archive/a0fb951d2a26a8ee746b52f3ba81ab011a0af778.tar.gz",
+        ],
+        sha256 = "524ba98a56300149696481b4cb9ddebd0c7b7ac9b9f6edee81da2d2d7e5d2bb3",
+        strip_prefix = "boringssl-a0fb951d2a26a8ee746b52f3ba81ab011a0af778",
+    )
+
+    tf_http_archive(
+        name = "zlib_archive",
+        urls = [
+            "https://mirror.bazel.build/zlib.net/zlib-1.2.11.tar.gz",
+            "https://zlib.net/zlib-1.2.11.tar.gz",
+        ],
+        sha256 = "c3e5e9fdd5004dcb542feda5ee4f0ff0744628baf8ed2dd5d66f8ca1197cb1a1",
+        strip_prefix = "zlib-1.2.11",
+        build_file = clean_dep("//third_party:zlib.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:zlib.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "fft2d",
+        urls = [
+            "https://mirror.bazel.build/www.kurims.kyoto-u.ac.jp/~ooura/fft.tgz",
+            "http://www.kurims.kyoto-u.ac.jp/~ooura/fft.tgz",
+        ],
+        sha256 = "52bb637c70b971958ec79c9c8752b1df5ff0218a4db4510e60826e0cb79b5296",
+        build_file = clean_dep("//third_party/fft2d:fft2d.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "snappy",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/snappy/archive/1.1.7.tar.gz",
+            "https://github.com/google/snappy/archive/1.1.7.tar.gz",
+        ],
+        sha256 = "3dfa02e873ff51a11ee02b9ca391807f0c8ea0529a4924afa645fbf97163f9d4",
+        strip_prefix = "snappy-1.1.7",
+        build_file = clean_dep("//third_party:snappy.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:snappy.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "nccl_archive",
+        urls = [
+            "https://mirror.bazel.build/github.com/nvidia/nccl/archive/03d856977ecbaac87e598c0c4bafca96761b9ac7.tar.gz",
+            "https://github.com/nvidia/nccl/archive/03d856977ecbaac87e598c0c4bafca96761b9ac7.tar.gz",
+        ],
+        sha256 = "2ca86fb6179ecbff789cc67c836139c1bbc0324ed8c04643405a30bf26325176",
+        strip_prefix = "nccl-03d856977ecbaac87e598c0c4bafca96761b9ac7",
+        build_file = clean_dep("//third_party:nccl/nccl_archive.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "kafka",
+        urls = [
+            "https://mirror.bazel.build/github.com/edenhill/librdkafka/archive/v0.11.4.tar.gz",
+            "https://github.com/edenhill/librdkafka/archive/v0.11.4.tar.gz",
+        ],
+        sha256 = "9d8f1eb7b0e29e9ab1168347c939cb7ae5dff00a39cef99e7ef033fd8f92737c",
+        strip_prefix = "librdkafka-0.11.4",
+        build_file = clean_dep("//third_party:kafka/BUILD"),
+        patch_file = clean_dep("//third_party/kafka:config.patch"),
+    )
+
+    tf_http_archive(
+        name = "aws",
+        urls = [
+            "https://mirror.bazel.build/github.com/aws/aws-sdk-cpp/archive/1.3.15.tar.gz",
+            "https://github.com/aws/aws-sdk-cpp/archive/1.3.15.tar.gz",
+        ],
+        sha256 = "b888d8ce5fc10254c3dd6c9020c7764dd53cf39cf011249d0b4deda895de1b7c",
+        strip_prefix = "aws-sdk-cpp-1.3.15",
+        build_file = clean_dep("//third_party:aws.BUILD"),
+    )
+
+    java_import_external(
+        name = "junit",
+        jar_sha256 = "59721f0805e223d84b90677887d9ff567dc534d7c502ca903c0c2b17f05c116a",
+        jar_urls = [
+            "https://mirror.bazel.build/repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.jar",
+            "http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.jar",
+            "http://maven.ibiblio.org/maven2/junit/junit/4.12/junit-4.12.jar",
+        ],
+        licenses = ["reciprocal"],  # Common Public License Version 1.0
+        testonly_ = True,
+        deps = ["@org_hamcrest_core"],
+    )
+
+    java_import_external(
+        name = "org_hamcrest_core",
+        jar_sha256 = "66fdef91e9739348df7a096aa384a5685f4e875584cce89386a7a47251c4d8e9",
+        jar_urls = [
+            "https://mirror.bazel.build/repo1.maven.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar",
+            "http://repo1.maven.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar",
+            "http://maven.ibiblio.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar",
+        ],
+        licenses = ["notice"],  # New BSD License
+        testonly_ = True,
+    )
+
+    tf_http_archive(
+        name = "jemalloc",
+        urls = [
+            "https://mirror.bazel.build/github.com/jemalloc/jemalloc/archive/4.4.0.tar.gz",
+            "https://github.com/jemalloc/jemalloc/archive/4.4.0.tar.gz",
+        ],
+        sha256 = "3c8f25c02e806c3ce0ab5fb7da1817f89fc9732709024e2a81b6b82f7cc792a8",
+        strip_prefix = "jemalloc-4.4.0",
+        build_file = clean_dep("//third_party:jemalloc.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:jemalloc.BUILD"),
+    )
+
+    java_import_external(
+        name = "com_google_testing_compile",
+        jar_sha256 = "edc180fdcd9f740240da1a7a45673f46f59c5578d8cd3fbc912161f74b5aebb8",
+        jar_urls = [
+            "http://mirror.bazel.build/repo1.maven.org/maven2/com/google/testing/compile/compile-testing/0.11/compile-testing-0.11.jar",
+            "http://repo1.maven.org/maven2/com/google/testing/compile/compile-testing/0.11/compile-testing-0.11.jar",
+        ],
+        licenses = ["notice"],  # New BSD License
+        testonly_ = True,
+        deps = ["@com_google_guava", "@com_google_truth"],
+    )
+
+    java_import_external(
+        name = "com_google_truth",
+        jar_sha256 = "032eddc69652b0a1f8d458f999b4a9534965c646b8b5de0eba48ee69407051df",
+        jar_urls = [
+            "http://mirror.bazel.build/repo1.maven.org/maven2/com/google/truth/truth/0.32/truth-0.32.jar",
+            "http://repo1.maven.org/maven2/com/google/truth/truth/0.32/truth-0.32.jar",
+        ],
+        licenses = ["notice"],  # Apache 2.0
+        testonly_ = True,
+        deps = ["@com_google_guava"],
+    )
+
+    java_import_external(
+        name = "org_checkerframework_qual",
+        jar_sha256 = "a17501717ef7c8dda4dba73ded50c0d7cde440fd721acfeacbf19786ceac1ed6",
+        jar_urls = [
+            "http://mirror.bazel.build/repo1.maven.org/maven2/org/checkerframework/checker-qual/2.4.0/checker-qual-2.4.0.jar",
+            "http://repo1.maven.org/maven2/org/checkerframework/checker-qual/2.4.0/checker-qual-2.4.0.jar",
+        ],
+        licenses = ["notice"],  # Apache 2.0
+    )
+
+    java_import_external(
+        name = "com_squareup_javapoet",
+        jar_sha256 = "5bb5abdfe4366c15c0da3332c57d484e238bd48260d6f9d6acf2b08fdde1efea",
+        jar_urls = [
+            "http://mirror.bazel.build/repo1.maven.org/maven2/com/squareup/javapoet/1.9.0/javapoet-1.9.0.jar",
+            "http://repo1.maven.org/maven2/com/squareup/javapoet/1.9.0/javapoet-1.9.0.jar",
+        ],
+        licenses = ["notice"],  # Apache 2.0
+    )
+
+    tf_http_archive(
+        name = "com_google_pprof",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/pprof/archive/c0fb62ec88c411cc91194465e54db2632845b650.tar.gz",
+            "https://github.com/google/pprof/archive/c0fb62ec88c411cc91194465e54db2632845b650.tar.gz",
+        ],
+        sha256 = "e0928ca4aa10ea1e0551e2d7ce4d1d7ea2d84b2abbdef082b0da84268791d0c4",
+        strip_prefix = "pprof-c0fb62ec88c411cc91194465e54db2632845b650",
+        build_file = clean_dep("//third_party:pprof.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "cub_archive",
+        urls = [
+            "https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.8.0.zip",
+            "https://github.com/NVlabs/cub/archive/1.8.0.zip",
+        ],
+        sha256 = "6bfa06ab52a650ae7ee6963143a0bbc667d6504822cbd9670369b598f18c58c3",
+        strip_prefix = "cub-1.8.0",
+        build_file = clean_dep("//third_party:cub.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "cython",
+        sha256 = "bccc9aa050ea02595b2440188813b936eaf345e85fb9692790cecfe095cf91aa",
+        urls = [
+            "https://mirror.bazel.build/github.com/cython/cython/archive/0.28.4.tar.gz",
+            "https://github.com/cython/cython/archive/0.28.4.tar.gz",
+        ],
+        strip_prefix = "cython-0.28.4",
+        build_file = clean_dep("//third_party:cython.BUILD"),
+        delete = ["BUILD.bazel"],
+        system_build_file = clean_dep("//third_party/systemlibs:cython.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "bazel_toolchains",
+        urls = [
+            "https://mirror.bazel.build/github.com/bazelbuild/bazel-toolchains/archive/37acf1841ab1475c98a152cb9e446460c8ae29e1.tar.gz",
+            "https://github.com/bazelbuild/bazel-toolchains/archive/37acf1841ab1475c98a152cb9e446460c8ae29e1.tar.gz",
+        ],
+        strip_prefix = "bazel-toolchains-37acf1841ab1475c98a152cb9e446460c8ae29e1",
+        sha256 = "3b604699685c5c65dd3f6f17425570a4b2f00ddba2f750db15acc72e55bb098b",
+    )
+
+    tf_http_archive(
+        name = "arm_neon_2_x86_sse",
+        sha256 = "c8d90aa4357f8079d427e87a6f4c493da1fa4140aee926c05902d7ec1533d9a5",
+        strip_prefix = "ARM_NEON_2_x86_SSE-0f77d9d182265259b135dad949230ecbf1a2633d",
+        urls = [
+            "https://mirror.bazel.build/github.com/intel/ARM_NEON_2_x86_SSE/archive/0f77d9d182265259b135dad949230ecbf1a2633d.tar.gz",
+            "https://github.com/intel/ARM_NEON_2_x86_SSE/archive/0f77d9d182265259b135dad949230ecbf1a2633d.tar.gz",
+        ],
+        build_file = clean_dep("//third_party:arm_neon_2_x86_sse.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "flatbuffers",
+        strip_prefix = "flatbuffers-1.9.0",
+        sha256 = "5ca5491e4260cacae30f1a5786d109230db3f3a6e5a0eb45d0d0608293d247e3",
+        urls = [
+            "https://mirror.bazel.build/github.com/google/flatbuffers/archive/v1.9.0.tar.gz",
+            "https://github.com/google/flatbuffers/archive/v1.9.0.tar.gz",
+        ],
+        build_file = clean_dep("//third_party/flatbuffers:flatbuffers.BUILD"),
+        system_build_file = clean_dep("//third_party/systemlibs:flatbuffers.BUILD"),
+    )
+
+    native.new_http_archive(
+        name = "double_conversion",
+        urls = [
+            "https://github.com/google/double-conversion/archive/3992066a95b823efc8ccc1baf82a1cfc73f6e9b8.zip",
+        ],
+        sha256 = "2f7fbffac0d98d201ad0586f686034371a6d152ca67508ab611adc2386ad30de",
+        strip_prefix = "double-conversion-3992066a95b823efc8ccc1baf82a1cfc73f6e9b8",
+        build_file = clean_dep("//third_party:double_conversion.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "tflite_mobilenet",
+        sha256 = "23f814d1c076bdf03715dfb6cab3713aa4fbdf040fd5448c43196bd2e97a4c1b",
+        urls = [
+            "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip",
+            "https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip",
+        ],
+        build_file = clean_dep("//third_party:tflite_mobilenet.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "tflite_mobilenet_ssd",
+        sha256 = "767057f2837a46d97882734b03428e8dd640b93236052b312b2f0e45613c1cf0",
+        urls = [
+            "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_ssd_tflite_v1.zip",
+            "https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_ssd_tflite_v1.zip",
+        ],
+        build_file = str(Label("//third_party:tflite_mobilenet.BUILD")),
+    )
+    tf_http_archive(
+        name = "tflite_mobilenet_ssd_quant",
+        sha256 = "a809cd290b4d6a2e8a9d5dad076e0bd695b8091974e0eed1052b480b2f21b6dc",
+        urls = [
+            "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/coco_ssd_mobilenet_v1_0.75_quant_2018_06_29.zip",
+            "https://storage.googleapis.com/download.tensorflow.org/models/tflite/coco_ssd_mobilenet_v1_0.75_quant_2018_06_29.zip",
+        ],
+        build_file = str(Label("//third_party:tflite_mobilenet.BUILD")),
+    )
+
+    tf_http_archive(
+        name = "tflite_conv_actions_frozen",
+        sha256 = "d947b38cba389b5e2d0bfc3ea6cc49c784e187b41a071387b3742d1acac7691e",
+        urls = [
+            "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/conv_actions_tflite.zip",
+            "https://storage.googleapis.com/download.tensorflow.org/models/tflite/conv_actions_tflite.zip",
+        ],
+        build_file = str(Label("//third_party:tflite_mobilenet.BUILD")),
+    )
+
+    tf_http_archive(
+        name = "tflite_smartreply",
+        sha256 = "8980151b85a87a9c1a3bb1ed4748119e4a85abd3cb5744d83da4d4bd0fbeef7c",
+        urls = [
+            "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/models/tflite/smartreply_1.0_2017_11_01.zip",
+            "https://storage.googleapis.com/download.tensorflow.org/models/tflite/smartreply_1.0_2017_11_01.zip",
+        ],
+        build_file = clean_dep("//third_party:tflite_smartreply.BUILD"),
+    )
+
+    tf_http_archive(
+        name = "tflite_ovic_testdata",
+        sha256 = "a9a705d8d519220178e2e65d383fdb21da37fdb31d1e909b0a1acdac46479e9c",
+        urls = [
+            "https://mirror.bazel.build/storage.googleapis.com/download.tensorflow.org/data/ovic.zip",
+            "https://storage.googleapis.com/download.tensorflow.org/data/ovic.zip",
+        ],
+        build_file = clean_dep("//third_party:tflite_ovic_testdata.BUILD"),
+        strip_prefix = "ovic",
+    )
+
+    tf_http_archive(
+        name = "build_bazel_rules_android",
+        sha256 = "cd06d15dd8bb59926e4d65f9003bfc20f9da4b2519985c27e190cddc8b7a7806",
+        urls = [
+            "https://mirror.bazel.build/github.com/bazelbuild/rules_android/archive/v0.1.1.zip",
+            "https://github.com/bazelbuild/rules_android/archive/v0.1.1.zip",
+        ],
+        strip_prefix = "rules_android-0.1.1",
+    )
+
+    ##############################################################################
+    # BIND DEFINITIONS
+    #
+    # Please do not add bind() definitions unless we have no other choice.
+    # If that ends up being the case, please leave a comment explaining
+    # why we can't depend on the canonical build target.
+
+    # gRPC wants a cares dependency but its contents is not actually
+    # important since we have set GRPC_ARES=0 in tools/bazel.rc
+    native.bind(
+        name = "cares",
+        actual = "@grpc//third_party/nanopb:nanopb",
+    )
+
+    # Needed by Protobuf
+    native.bind(
+        name = "grpc_cpp_plugin",
+        actual = "@grpc//:grpc_cpp_plugin",
+    )
+    native.bind(
+        name = "grpc_python_plugin",
+        actual = "@grpc//:grpc_python_plugin",
+    )
+
+    native.bind(
+        name = "grpc_lib",
+        actual = "@grpc//:grpc++",
+    )
+
+    native.bind(
+        name = "grpc_lib_unsecure",
+        actual = "@grpc//:grpc++_unsecure",
+    )
+
+    # Needed by gRPC
+    native.bind(
+        name = "libssl",
+        actual = "@boringssl//:ssl",
+    )
+
+    # Needed by gRPC
+    native.bind(
+        name = "nanopb",
+        actual = "@grpc//third_party/nanopb:nanopb",
+    )
+
+    # Needed by gRPC
+    native.bind(
+        name = "protobuf",
+        actual = "@protobuf_archive//:protobuf",
+    )
+
+    # gRPC expects //external:protobuf_clib and //external:protobuf_compiler
+    # to point to Protobuf's compiler library.
+    native.bind(
+        name = "protobuf_clib",
+        actual = "@protobuf_archive//:protoc_lib",
+    )
+
+    # Needed by gRPC
+    native.bind(
+        name = "protobuf_headers",
+        actual = "@protobuf_archive//:protobuf_headers",
+    )
+
+    # Needed by Protobuf
+    native.bind(
+        name = "python_headers",
+        actual = clean_dep("//third_party/python_runtime:headers"),
+    )
+
+    # Needed by Protobuf
+    native.bind(
+        name = "six",
+        actual = "@six_archive//:six",
+    )
+
+    # Needed by gRPC
+    native.bind(
+        name = "zlib",
+        actual = "@zlib_archive//:zlib",
+    )
diff --git a/third_party/clang_toolchain/cc_configure_clang.bzl b/third_party/clang_toolchain/cc_configure_clang.bzl
index 1181110ea9..0778c43c53 100644
--- a/third_party/clang_toolchain/cc_configure_clang.bzl
+++ b/third_party/clang_toolchain/cc_configure_clang.bzl
@@ -7,16 +7,16 @@ _TF_DOWNLOAD_CLANG = "TF_DOWNLOAD_CLANG"
 _TF_NEED_CUDA = "TF_NEED_CUDA"
 
 def _cc_clang_autoconf(repo_ctx):
-  if repo_ctx.os.environ.get(_TF_DOWNLOAD_CLANG) != "1":
-    return
-  if repo_ctx.os.environ.get(_TF_NEED_CUDA) == "1":
-    # Clang is handled separately for CUDA configs.
-    # See cuda_configure.bzl for more details.
-    return
+    if repo_ctx.os.environ.get(_TF_DOWNLOAD_CLANG) != "1":
+        return
+    if repo_ctx.os.environ.get(_TF_NEED_CUDA) == "1":
+        # Clang is handled separately for CUDA configs.
+        # See cuda_configure.bzl for more details.
+        return
 
-  download_clang(repo_ctx, out_folder='extra_tools')
-  overriden_tools = {'gcc': 'extra_tools/bin/clang'}
-  cc_autoconf_impl(repo_ctx, overriden_tools)
+    download_clang(repo_ctx, out_folder = "extra_tools")
+    overriden_tools = {"gcc": "extra_tools/bin/clang"}
+    cc_autoconf_impl(repo_ctx, overriden_tools)
 
 cc_download_clang_toolchain = repository_rule(
     environ = [
diff --git a/third_party/clang_toolchain/download_clang.bzl b/third_party/clang_toolchain/download_clang.bzl
index a014a806a6..5ef47cdd0d 100644
--- a/third_party/clang_toolchain/download_clang.bzl
+++ b/third_party/clang_toolchain/download_clang.bzl
@@ -1,54 +1,60 @@
 """ Helpers to download a recent clang release."""
 
 def _get_platform_folder(os_name):
-  os_name = os_name.lower()
-  if os_name.startswith('windows'):
-    return 'Win'
-  if os_name.startswith('mac os'):
-    return 'Mac'
-  if not os_name.startswith('linux'):
-    fail('Unknown platform')
-  return 'Linux_x64'
-
-def _download_chromium_clang(repo_ctx, platform_folder, package_version, sha256,
-                             out_folder):
-  cds_url = 'https://commondatastorage.googleapis.com/chromium-browser-clang'
-  cds_file = 'clang-%s.tgz' % package_version
-  cds_full_url = '{0}/{1}/{2}'.format(cds_url, platform_folder, cds_file)
-  repo_ctx.download_and_extract(cds_full_url, output=out_folder, sha256=sha256)
+    os_name = os_name.lower()
+    if os_name.startswith("windows"):
+        return "Win"
+    if os_name.startswith("mac os"):
+        return "Mac"
+    if not os_name.startswith("linux"):
+        fail("Unknown platform")
+    return "Linux_x64"
+
+def _download_chromium_clang(
+        repo_ctx,
+        platform_folder,
+        package_version,
+        sha256,
+        out_folder):
+    cds_url = "https://commondatastorage.googleapis.com/chromium-browser-clang"
+    cds_file = "clang-%s.tgz" % package_version
+    cds_full_url = "{0}/{1}/{2}".format(cds_url, platform_folder, cds_file)
+    repo_ctx.download_and_extract(cds_full_url, output = out_folder, sha256 = sha256)
 
 def download_clang(repo_ctx, out_folder):
-  """ Download a fresh clang release and put it into out_folder.
-
-  Clang itself will be located in 'out_folder/bin/clang'.
-  We currently download one of the latest releases of clang by the
-  Chromium project (see
-  https://chromium.googlesource.com/chromium/src/+/master/docs/clang.md).
-
-  Args:
-    repo_ctx: An instance of repository_context object.
-    out_folder: A folder to extract the compiler into.
-  """
-  # TODO(ibiryukov): we currently download and extract some extra tools in the
-  # clang release (e.g., sanitizers). We should probably remove the ones
-  # we don't need and document the ones we want provide in addition to clang.
-
-  # Latest CLANG_REVISION and CLANG_SUB_REVISION of the Chromiums's release
-  # can be found in https://chromium.googlesource.com/chromium/src/tools/clang/+/master/scripts/update.py
-  CLANG_REVISION = '335091'
-  CLANG_SUB_REVISION = 1
-
-  package_version = '%s-%s' % (CLANG_REVISION, CLANG_SUB_REVISION)
-
-  checksums = {
-      'Linux_x64':
-          '17002b75293fccfdd175eacdc9ee47d97b58d7e98fef343384fbbef1b68ce99f',
-      'Mac':
-          '9351e46d28315daaa06a1eb55bd0370ed4aaeb693a2a3e82e48d2737d7723468',
-      'Win':
-          'e78a1e469224d6f6751b4df4374bf58893ac03900ec924e4c8264888ba4aeb1e',
-  }
-
-  platform_folder = _get_platform_folder(repo_ctx.os.name)
-  _download_chromium_clang(repo_ctx, platform_folder, package_version,
-                           checksums[platform_folder], out_folder)
+    """ Download a fresh clang release and put it into out_folder.
+
+    Clang itself will be located in 'out_folder/bin/clang'.
+    We currently download one of the latest releases of clang by the
+    Chromium project (see
+    https://chromium.googlesource.com/chromium/src/+/master/docs/clang.md).
+
+    Args:
+      repo_ctx: An instance of repository_context object.
+      out_folder: A folder to extract the compiler into.
+    """
+    # TODO(ibiryukov): we currently download and extract some extra tools in the
+    # clang release (e.g., sanitizers). We should probably remove the ones
+    # we don't need and document the ones we want provide in addition to clang.
+
+    # Latest CLANG_REVISION and CLANG_SUB_REVISION of the Chromiums's release
+    # can be found in https://chromium.googlesource.com/chromium/src/tools/clang/+/master/scripts/update.py
+    CLANG_REVISION = "338452"
+    CLANG_SUB_REVISION = 1
+
+    package_version = "%s-%s" % (CLANG_REVISION, CLANG_SUB_REVISION)
+
+    checksums = {
+        "Linux_x64": "213ba23a0a9855ede5041f66661caa9c5c59a573ec60b82a31839f9a97f397bf",
+        "Mac": "4267774201f8cb50c25e081375e87038d58db80064a20a0d9d7fe57ea4357ece",
+        "Win": "a8a5d5b25443c099e2c20d1a0cdce2f1d17e2dba84de66a6dc6a239ce3e78c34",
+    }
+
+    platform_folder = _get_platform_folder(repo_ctx.os.name)
+    _download_chromium_clang(
+        repo_ctx,
+        platform_folder,
+        package_version,
+        checksums[platform_folder],
+        out_folder,
+    )
diff --git a/third_party/codegen.BUILD b/third_party/codegen.BUILD
new file mode 100644
index 0000000000..df436c8163
--- /dev/null
+++ b/third_party/codegen.BUILD
@@ -0,0 +1,16 @@
+# -*- mode: python; -*-
+#
+# Description:
+#   Extension to ast that allow ast -> python code generation.
+
+package(default_visibility = ["//visibility:public"])
+
+licenses(["notice"])  # New BSD
+
+exports_files(["LICENSE"])
+
+py_library(
+    name = "com_github_andreif_codegen",
+    srcs = glob(["codegen.py"]),
+    srcs_version = "PY2AND3",
+)
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/Core b/third_party/eigen3/unsupported/Eigen/CXX11/Core
deleted file mode 100644
index 1b3690716c..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/Core
+++ /dev/null
@@ -1,46 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2013 Christian Seiler <christian@iwakd.de>
-// Copyright (C) 2014 Benoit Steiner <benoit.steiner.goog@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-
-#ifndef EIGEN_CXX11_CORE_MODULE
-#define EIGEN_CXX11_CORE_MODULE
-
-#include <Eigen/Core>
-
-#include <Eigen/src/Core/util/DisableStupidWarnings.h>
-
-/** \defgroup CXX11_Core_Module C++11 Core Module
-  *
-  * This module provides common core features for all modules that
-  * explicitly depend on C++11. Currently, this is only the Tensor
-  * module. Note that at this stage, you should not need to include
-  * this module directly.
-  *
-  * It also provides a limited fallback for compilers that don't support
-  * CXX11 yet, such as nvcc.
-  *
-  * \code
-  * #include <Eigen/CXX11/Core>
-  * \endcode
-  */
-
-// Only a subset of cxx11 is allowed at Google, so we default to emulate the
-// cxx11 functionality that we need.
-#include "src/Core/util/FixedSizeVector.h"
-#if 1
-#include <vector>
-#include "src/Core/util/EmulateCXX11Meta.h"
-#else
-#include "src/Core/util/CXX11Workarounds.h"
-#include "src/Core/util/CXX11Meta.h"
-#endif
-#include <Eigen/src/Core/util/ReenableStupidWarnings.h>
-
-#endif // EIGEN_CXX11_CORE_MODULE
-
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/NeuralNetworks b/third_party/eigen3/unsupported/Eigen/CXX11/NeuralNetworks
deleted file mode 100644
index 7741b68d8a..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/NeuralNetworks
+++ /dev/null
@@ -1,35 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2014 Benoit Steiner <benoit.steiner.goog@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_MODULE
-#define EIGEN_CXX11_NEURAL_NETWORKS_MODULE
-
-#include "unsupported/Eigen/CXX11/Tensor"
-
-/** \defgroup CXX11_NeuralNetworks_Module Neural Networks Module
-  *
-  * This module provides an efficient implementation of the common primitives
-  * used by neural networks.
-  * The primitives are  built on top of the tensor library.
-  *
-  * \code
-  * #include <Eigen/CXX11/NeuralNetworks>
-  * \endcode
-  */
-
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/Activations.h"
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/Attention.h"
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/Pooling.h"
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/SoftMax.h"
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardCuboidConvolutions.h"
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/CuboidConvolution.h"
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardSpatialConvolutions.h"
-#include "unsupported/Eigen/CXX11/src/NeuralNetworks/SpatialConvolutions.h"
-
-#endif  // EIGEN_CXX11_NEURAL_NETWORKS_MODULE
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/FixedPointTypes.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/FixedPointTypes.h
index 6b625abc3e..5ab3664918 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/FixedPointTypes.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/FixedPointTypes.h
@@ -7,8 +7,8 @@
 // Public License v. 2.0. If a copy of the MPL was not distributed
 // with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
 
-#ifndef EIGEN_CXX11_FIXED_POINT_TYPES_H
-#define EIGEN_CXX11_FIXED_POINT_TYPES_H
+#ifndef CXX11_SRC_FIXEDPOINT_FIXEDPOINTTYPES_H_
+#define CXX11_SRC_FIXEDPOINT_FIXEDPOINTTYPES_H_
 
 #include <cmath>
 #include <iostream>
@@ -339,4 +339,4 @@ EIGEN_STRONG_INLINE std::ostream& operator<<(std::ostream& os, QInt32 a) {
 
 }  // namespace Eigen
 
-#endif  // EIGEN_CXX11_FIXED_POINT_TYPES_H
+#endif  // CXX11_SRC_FIXEDPOINT_FIXEDPOINTTYPES_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProduct.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProduct.h
index 4d0dca07df..e6f4080ae1 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProduct.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProduct.h
@@ -7,9 +7,8 @@
 // Public License v. 2.0. If a copy of the MPL was not distributed
 // with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
 
-#ifndef EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_H
-#define EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_H
-
+#ifndef CXX11_SRC_FIXEDPOINT_MATMATPRODUCT_H_
+#define CXX11_SRC_FIXEDPOINT_MATMATPRODUCT_H_
 
 namespace Eigen {
 namespace internal {
@@ -24,6 +23,14 @@ template<> struct scalar_product_traits<QInt8, QInt8>
   typedef QInt32 ReturnType;
 };
 
+// Accumulate the product of 2 QInt16 inputs on 32 bits to prevent
+// overflows
+template <>
+struct scalar_product_traits<QInt16, QInt16> {
+  enum { Defined = 1 };
+  typedef QInt32 ReturnType;
+};
+
 // Accumulate the product of QInt8 inputs with QUint8 inputs on 32 bits
 // to prevent overflows
 template<> struct scalar_product_traits<QInt8, QUInt8>
@@ -247,9 +254,76 @@ void gebp_kernel<QUInt8, QInt8, Index, DataMapper, mr, nr, ConjugateLhs, Conjuga
 }
 #endif
 
-}  // namespace internal
-}  // namespace Eigen
+#ifndef EIGEN_USE_OPTIMIZED_INT16_INT16_MAT_MAT_PRODUCT
+
+template <bool _ConjLhs, bool _ConjRhs>
+class gebp_traits<QInt16, QInt16, _ConjLhs, _ConjRhs> {
+ public:
+  typedef QInt16 LhsScalar;
+  typedef QInt16 RhsScalar;
+  typedef QInt32 ResScalar;
+
+  enum {
+    // register block size along the M and N directions
+    // One for the current implementation
+    nr = 1,
+    mr = 1,
+    // Progress made at each iteration of the product loop
+    // also 1 for the current implementation
+    LhsProgress = 1,
+    RhsProgress = 1
+  };
+};
+
+// The signed 16bit Mat-Mat product itself.
+template <typename Index, typename DataMapper, int mr, int nr,
+          bool ConjugateLhs, bool ConjugateRhs>
+struct gebp_kernel<QInt16, QInt16, Index, DataMapper, mr, nr, ConjugateLhs,
+                   ConjugateRhs> {
+  EIGEN_DONT_INLINE
+  void operator()(const DataMapper& res, const QInt16* blockA,
+                  const QInt16* blockB, Index rows, Index depth, Index cols,
+                  QInt32 alpha, Index strideA = -1, Index strideB = -1,
+                  Index offsetA = 0, Index offsetB = 0);
+};
+
+template <typename Index, typename DataMapper, int mr, int nr,
+          bool ConjugateLhs, bool ConjugateRhs>
+EIGEN_DONT_INLINE void gebp_kernel<QInt16, QInt16, Index, DataMapper, mr, nr,
+                                   ConjugateLhs, ConjugateRhs>::
+operator()(const DataMapper& res, const QInt16* blockA, const QInt16* blockB,
+           Index rows, Index depth, Index cols, QInt32 alpha, Index strideA,
+           Index strideB, Index offsetA, Index offsetB) {
+  EIGEN_STATIC_ASSERT(!ConjugateLhs, YOU_MADE_A_PROGRAMMING_MISTAKE);
+  EIGEN_STATIC_ASSERT(!ConjugateRhs, YOU_MADE_A_PROGRAMMING_MISTAKE);
 
+  eigen_assert(alpha.value == 1);
+  eigen_assert(strideA == -1);
+  eigen_assert(strideB == -1);
+  eigen_assert(offsetA == 0);
+  eigen_assert(offsetB == 0);
+
+  eigen_assert(rows > 0);
+  eigen_assert(cols > 0);
+  eigen_assert(depth > 0);
+  eigen_assert(blockA);
+  eigen_assert(blockB);
+
+  for (Index j = 0; j < cols; ++j) {
+    Index startB = j * depth;
 
+    for (Index i = 0; i < rows; ++i) {
+      Index startA = i * depth;
+
+      for (Index k = 0; k < depth; ++k) {
+        res(i, j) += blockA[startA + k] * blockB[startB + k];
+      }
+    }
+  }
+}
+#endif
+
+}  // namespace internal
+}  // namespace Eigen
 
-#endif  // EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_H
+#endif  // CXX11_SRC_FIXEDPOINT_MATMATPRODUCT_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductAVX2.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductAVX2.h
index 6b4b0edcfb..66532fb600 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductAVX2.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductAVX2.h
@@ -3,18 +3,494 @@
 //
 // Copyright (C) 2015 Benoit Steiner <benoit.steiner.goog@gmail.com>
 // Copyright (C) 2015 Matthew Sarett <msarett@google.com>
+// Copyright (C) 2016 Nishant Patil <nishantpatil@google.com>
 //
 // This Source Code Form is subject to the terms of the Mozilla
 // Public License v. 2.0. If a copy of the MPL was not distributed
 // with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
 
-#ifndef EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_AVX2_H
-#define EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_AVX2_H
+#ifndef CXX11_SRC_FIXEDPOINT_MATMATPRODUCTAVX2_H_
+#define CXX11_SRC_FIXEDPOINT_MATMATPRODUCTAVX2_H_
 
 namespace Eigen {
 namespace internal {
 
 // AVX2 optimized implementation of Mat-Mat product.
+// LHS is encoded using signed 16-bit integers.
+// RHS is encoded using signed 16-bit integers.
+#ifdef EIGEN_USE_OPTIMIZED_INT16_INT16_MAT_MAT_PRODUCT
+
+// Define quantized traits
+template <bool _ConjLhs, bool _ConjRhs>
+class gebp_traits<QInt16, QInt16, _ConjLhs, _ConjRhs> {
+ public:
+  typedef QInt16 LhsScalar;
+  typedef QInt16 RhsScalar;
+  typedef QInt32 ResScalar;
+
+  enum {
+    // Define register blocking scheme.
+    nr = 16,
+    mr = 16,
+    kr = 4,
+    // Ignore progress tracking per loop iteration.
+    LhsProgress = -1,
+    RhsProgress = -1
+  };
+};
+
+// Specialized blocking for quantized implementations.
+// Used by TensorContractionThreadPool, inputs must have dimensions that are
+// multiples of 32.
+template <typename Index, int ShardingType>
+class TensorContractionBlocking<QInt16, QInt16, Index, ShardingType> {
+ public:
+  TensorContractionBlocking(Index k, Index m, Index n, Index num_threads = 1)
+      : kc_(((k + 15) / 16) * 16),
+        mc_(((m + 15) / 16) * 16),
+        nc_(((n + 15) / 16) * 16) {
+    eigen_assert(mc_ % 16 == 0);
+    eigen_assert(kc_ % 16 == 0);
+    if (!k || !m || !n) {
+      return;
+    }
+
+    if (ShardingType == ShardByCol) {
+      eigen_assert(nc_ % 16 == 0);
+      nc_ = (((nc_ / num_threads) + 15) / 16) * 16;
+    } else {
+      eigen_assert(nc_ % 16 == 0);
+      mc_ = (((mc_ / num_threads) + 15) / 16) * 16;
+    }
+  }
+
+  EIGEN_ALWAYS_INLINE Index kc() const { return kc_; }
+  EIGEN_ALWAYS_INLINE Index mc() const { return mc_; }
+  EIGEN_ALWAYS_INLINE Index nc() const { return nc_; }
+
+ private:
+  Index kc_;
+  Index mc_;
+  Index nc_;
+};
+
+// Specialized blocking for quantized implementations.
+// Used by TensorContraction and GeneralMatrixMatrix, inputs are padded to
+// multiples of 32.
+template <int MaxRows, int MaxCols, int MaxDepth, int KcFactor>
+class gemm_blocking_space<ColMajor, QInt16, QInt16, MaxRows, MaxCols, MaxDepth,
+                          KcFactor, false>
+    : public level3_blocking<QInt16, QInt16> {
+  DenseIndex m_sizeA;
+  DenseIndex m_sizeB;
+
+ public:
+  gemm_blocking_space(DenseIndex rows, DenseIndex cols, DenseIndex depth,
+                      DenseIndex /*num_threads*/, bool /*l3_blocking*/) {
+    this->m_mc = ((rows + 15) / 16) * 16;
+    this->m_nc = ((cols + 15) / 16) * 16;
+    this->m_kc = ((depth + 15) / 16) * 16;
+    m_sizeA = this->m_mc * this->m_kc;
+    m_sizeB = this->m_kc * this->m_nc;
+  }
+  void allocateA() {
+    if (this->m_blockA == 0) this->m_blockA = aligned_new<QInt16>(m_sizeA);
+  }
+  void allocateB() {
+    if (this->m_blockB == 0) this->m_blockB = aligned_new<QInt16>(m_sizeB);
+  }
+  void allocateAll() {
+    allocateA();
+    allocateB();
+  }
+  ~gemm_blocking_space() {
+    aligned_delete(this->m_blockA, m_sizeA);
+    aligned_delete(this->m_blockB, m_sizeB);
+  }
+};
+
+// Below are the fully optimized versions that are correct only for sizes that
+// are multiple of 16.  It is about a 10% performance benefit to keep these
+// implementations separate.
+
+// Arrange a block of the left input matrix in contiguous memory.
+//
+// Given column major input (A0 beside A1 in memory):
+// A0 B0 C0 D0 E0 F0 G0 H0 ...
+// A1 B1 C1 D1 E1 F1 G1 H1 ...
+// A2 B2 C2 D2 E2 F2 G2 H2 ...
+// A3 B3 C3 D3 E3 F3 G3 H3 ...
+// A4 B4 C4 D4 E4 F4 G4 H4 ...
+// A5 B5 C5 D5 E5 F5 G5 H5 ...
+// A6 B6 C6 D6 E6 F6 G6 H6 ...
+// A7 B7 C7 D7 E7 F7 G7 H7 ...
+// A8 ...
+// ...
+//
+// Packing with m = 8 yields row major output (A0 beside B0 in memory):
+// A0 B0
+// A1 B1
+// A2 B2
+// A3 B3
+// A4 B4
+// A5 B5
+// A6 B6
+// A7 B7
+// ...
+//
+// The purpose is to collect m rows of size k.  Two elements of the same
+// row are arranged contiguously because madd performs an adjacent addition
+// in the kernel.
+
+template <typename Index, typename DataMapper, int Pack1, int Pack2,
+          bool Conjugate, bool PanelMode>
+struct gemm_pack_lhs<QInt16, Index, DataMapper, Pack1, Pack2, ColMajor,
+                     Conjugate, PanelMode> {
+  EIGEN_DONT_INLINE void operator()(QInt16* blockA, const DataMapper& lhs,
+                                    Index depth, Index rows, Index stride = 0,
+                                    Index offset = 0);
+};
+
+template <typename Index, typename DataMapper, int Pack1, int Pack2,
+          bool Conjugate, bool PanelMode>
+EIGEN_DONT_INLINE void gemm_pack_lhs<QInt16, Index, DataMapper, Pack1, Pack2,
+                                     ColMajor, Conjugate, PanelMode>::
+operator()(QInt16* blockA, const DataMapper& lhs, Index depth, Index rows,
+           Index stride, Index offset) {
+  eigen_assert(stride == 0);
+  eigen_assert(offset == 0);
+
+  // Use alternate function for weird sizes
+  if (rows % 16 != 0 || depth % 16 != 0) {
+    assert(false &&
+           "only depths and rows that are a multiple of 16 are currently "
+           "supported");
+    // gemm_pack_lhs_any<QInt16, Index, DataMapper, Pack1, Pack2, ColMajor,
+    // Conjugate, PanelMode> lhs_pack;
+    // return lhs_pack(blockA, lhs, depth, rows, stride, offset);
+  }
+
+  // Get vector pointer
+  __m256i* blockA_256 = reinterpret_cast<__m256i*>(blockA);
+
+  // Pack rows in sets of 16
+  for (Index m = 0; m < rows; m += 16) {
+    // Pack depth in sets of 4
+    for (Index k = 0; k < depth; k += 4) {
+      // Load vectors
+      __m256i L_A = lhs.loadPacket(m, k);
+      __m256i L_B = lhs.loadPacket(m, k + 1);
+      __m256i L_C = lhs.loadPacket(m, k + 2);
+      __m256i L_D = lhs.loadPacket(m, k + 3);
+
+      // Rearrange the inputs as required by the kernel
+      __m256i L_AB0_AB7 = _mm256_unpacklo_epi16(L_A, L_B);
+      __m256i L_AB8_AB15 = _mm256_unpackhi_epi16(L_A, L_B);
+      __m256i L_CD0_CD7 = _mm256_unpacklo_epi16(L_C, L_D);
+      __m256i L_CD8_CD15 = _mm256_unpackhi_epi16(L_C, L_D);
+
+      __m256i L_AD0 = _mm256_permute2x128_si256(L_AB0_AB7, L_AB8_AB15, 0x20);
+      _mm256_store_si256(blockA_256++, L_AD0);
+      __m256i L_AD8 = _mm256_permute2x128_si256(L_CD0_CD7, L_CD8_CD15, 0x20);
+      _mm256_store_si256(blockA_256++, L_AD8);
+      __m256i L_AD16 = _mm256_permute2x128_si256(L_AB0_AB7, L_AB8_AB15, 0x31);
+      _mm256_store_si256(blockA_256++, L_AD16);
+      __m256i L_AD24 = _mm256_permute2x128_si256(L_CD0_CD7, L_CD8_CD15, 0x31);
+      _mm256_store_si256(blockA_256++, L_AD24);
+    }
+  }
+}
+
+// Arrange a block of the right input matrix in contiguous memory.
+//
+// Given column major input (A0 beside A1 in memory):
+// A0 B0 C0 D0 E0 F0 G0 H0 ...
+// A1 B1 C1 D1 E1 F1 G1 H1 ...
+// A2 B2 C2 D2 E2 F2 G2 H2 ...
+// A3 B3 C3 D3 E3 F3 G3 H3 ...
+// A4 B4 C4 D4 E4 F4 G4 H4 ...
+// A5 B5 C5 D5 E5 F5 G5 H5 ...
+// A6 B6 C6 D6 E6 F6 G6 H6 ...
+// A7 B7 C7 D7 E7 F7 G7 H7 ...
+// A8 ...
+// ...
+// Packing yields row major output (A0 beside A1 in memory):
+// A0 A1 A2 A3 A4 A5 A6 A7
+// B0 B1 B2 B3 B4 B5 B6 B7
+// ...
+//
+// At least two elements of the same col are arranged contiguously because
+// maddubs and madd both perform an adjacent addition in the kernel.  We can
+// save work by leaving 4 adjacent elements because kr = 4.
+// The purpose is to collect n cols of size k.  Two elements of the same
+// col are arranged contiguously because madd performs an adjacent addition
+// in the kernel.
+template <typename Index, typename DataMapper, int nr, bool Conjugate,
+          bool PanelMode>
+struct gemm_pack_rhs<QInt16, Index, DataMapper, nr, ColMajor, Conjugate,
+                     PanelMode> {
+  EIGEN_DONT_INLINE void operator()(QInt16* blockB, const DataMapper& rhs,
+                                    Index depth, Index cols, Index stride = 0,
+                                    Index offset = 0);
+};
+
+template <typename Index, typename DataMapper, int nr, bool Conjugate,
+          bool PanelMode>
+EIGEN_DONT_INLINE void
+gemm_pack_rhs<QInt16, Index, DataMapper, nr, ColMajor, Conjugate, PanelMode>::
+operator()(QInt16* blockB, const DataMapper& rhs, Index depth, Index cols,
+           Index stride, Index offset) {
+  eigen_assert(stride == 0);
+  eigen_assert(offset == 0);
+
+  // Use alternate function for weird sizes
+  if (cols % 16 != 0 || depth % 16 != 0) {
+    assert(false &&
+           "only depths and cols that are a multiple of 16 are currently "
+           "supported");
+    // gemm_pack_rhs_any<QInt16, Index, DataMapper, nr, ColMajor, Conjugate,
+    // PanelMode> rhs_pack;
+    // return rhs_pack(blockB, rhs, depth, cols, stride, offset);
+  }
+
+  // Get vector pointer
+  __m256i* blockB_256 = reinterpret_cast<__m256i*>(blockB);
+
+  // Perform a step of the packing for 4 columns
+  __m256i R_AB_L, R_AB_H, R_CD_L, R_CD_H, R_AD_0, R_AD_4, R_AD_8, R_AD_12;
+#define PACK_STEP                                            \
+  R_AB_L = _mm256_unpacklo_epi64(R_A, R_B);                  \
+  R_CD_L = _mm256_unpacklo_epi64(R_C, R_D);                  \
+  R_AB_H = _mm256_unpackhi_epi64(R_A, R_B);                  \
+  R_CD_H = _mm256_unpackhi_epi64(R_C, R_D);                  \
+  R_AD_0 = _mm256_permute2x128_si256(R_AB_L, R_CD_L, 0x20);  \
+  R_AD_8 = _mm256_permute2x128_si256(R_AB_L, R_CD_L, 0x31);  \
+  R_AD_4 = _mm256_permute2x128_si256(R_AB_H, R_CD_H, 0x20);  \
+  R_AD_12 = _mm256_permute2x128_si256(R_AB_H, R_CD_H, 0x31); \
+  _mm256_store_si256(blockB_256, R_AD_0);                    \
+  _mm256_store_si256(blockB_256 + 4, R_AD_4);                \
+  _mm256_store_si256(blockB_256 + 8, R_AD_8);                \
+  _mm256_store_si256(blockB_256 + 12, R_AD_12);              \
+  blockB_256++;
+
+  // Pack cols in sets of 16
+  for (Index n = 0; n < cols; n += 16) {
+    // Pack depth in sets of 16
+    for (Index k = 0; k < depth; k += 16) {
+      __m256i R_A = rhs.loadPacket(k, n);
+      __m256i R_B = rhs.loadPacket(k, n + 1);
+      __m256i R_C = rhs.loadPacket(k, n + 2);
+      __m256i R_D = rhs.loadPacket(k, n + 3);
+      PACK_STEP;
+
+      R_A = rhs.loadPacket(k, n + 4);
+      R_B = rhs.loadPacket(k, n + 5);
+      R_C = rhs.loadPacket(k, n + 6);
+      R_D = rhs.loadPacket(k, n + 7);
+      PACK_STEP;
+
+      R_A = rhs.loadPacket(k, n + 8);
+      R_B = rhs.loadPacket(k, n + 9);
+      R_C = rhs.loadPacket(k, n + 10);
+      R_D = rhs.loadPacket(k, n + 11);
+      PACK_STEP;
+
+      R_A = rhs.loadPacket(k, n + 12);
+      R_B = rhs.loadPacket(k, n + 13);
+      R_C = rhs.loadPacket(k, n + 14);
+      R_D = rhs.loadPacket(k, n + 15);
+      PACK_STEP;
+
+      blockB_256 += 12;
+    }
+  }
+#undef PACK_STEP
+}
+
+// Perform the actual multiplication on packed inputs
+template <typename Index, typename DataMapper, int mr, int nr,
+          bool ConjugateLhs, bool ConjugateRhs>
+struct gebp_kernel<QInt16, QInt16, Index, DataMapper, mr, nr, ConjugateLhs,
+                   ConjugateRhs> {
+  typedef typename DataMapper::LinearMapper LinearMapper;
+
+  EIGEN_DONT_INLINE
+  void operator()(const DataMapper& res, const QInt16* blockA,
+                  const QInt16* blockB, Index rows, Index depth, Index cols,
+                  QInt32 alpha, Index strideA = -1, Index strideB = -1,
+                  Index offsetA = 0, Index offsetB = 0);
+};
+
+template <typename Index, typename DataMapper, int mr, int nr,
+          bool ConjugateLhs, bool ConjugateRhs>
+EIGEN_DONT_INLINE void gebp_kernel<QInt16, QInt16, Index, DataMapper, mr, nr,
+                                   ConjugateLhs, ConjugateRhs>::
+operator()(const DataMapper& res, const QInt16* blockA, const QInt16* blockB,
+           Index rows, Index depth, Index cols, QInt32 alpha, Index strideA,
+           Index strideB, Index offsetA, Index offsetB) {
+  EIGEN_STATIC_ASSERT(!ConjugateLhs, YOU_MADE_A_PROGRAMMING_MISTAKE);
+  EIGEN_STATIC_ASSERT(!ConjugateRhs, YOU_MADE_A_PROGRAMMING_MISTAKE);
+  eigen_assert(alpha.value == 1);
+  eigen_assert(strideA == -1);
+  eigen_assert(strideB == -1);
+  eigen_assert(offsetA == 0);
+  eigen_assert(offsetB == 0);
+  eigen_assert(rows > 0);
+  eigen_assert(cols > 0);
+  eigen_assert(depth > 0);
+  eigen_assert(blockA);
+  eigen_assert(blockB);
+
+  // Use alternate function for weird sizes
+  if (rows % 16 != 0 || cols % 16 != 0 || depth % 16 != 0) {
+    assert(false &&
+           "only depths, cols and rows that are a multiple of 16 are currently "
+           "supported");
+    // gebp_kernel_any<QInt16, QInt16, Index, DataMapper, mr, nr, ConjugateLhs,
+    // ConjugateRhs> gebp;
+    // return gebp(res, blockA, blockB, rows, depth, cols, alpha, strideA,
+    // strideB, offsetA, offsetB);
+  }
+
+  // Create result block
+  QInt32* blockO = aligned_new<QInt32>(16 * 16);
+  memset(blockO, 0, 16 * 16 * sizeof(QInt32));
+
+  // Get vectorized pointers
+  __m256i* blockO_256 = reinterpret_cast<__m256i*>(blockO);
+  const __m256i* blockA_256 = reinterpret_cast<const __m256i*>(blockA);
+  const __m256i* blockB_256 = reinterpret_cast<const __m256i*>(blockB);
+
+  // Loop over blocks of 16 columns
+  for (Index n = 0; n < cols; n += 16) {
+    // Reset index into blockA
+    Index indexL = 0;
+    // Loop over blocks of 16 rows
+    for (Index m = 0; m < rows; m += 16) {
+      // Reset index into blockB
+      Index indexR = n / 16 * depth;
+      // Loop over blocks of 4 on depth
+      for (Index k = 0; k < depth; k += 4) {
+        // Load inputs
+        __m256i L_AD0 = blockA_256[indexL++];
+        __m256i L_AD8 = blockA_256[indexL++];
+        __m256i L_EH0 = blockA_256[indexL++];
+        __m256i L_EH8 = blockA_256[indexL++];
+
+        __m256i R_AH0 = blockB_256[indexR++];
+        __m256i R_AH4 = blockB_256[indexR++];
+        __m256i R_AH8 = blockB_256[indexR++];
+        __m256i R_AH12 = blockB_256[indexR++];
+
+        // Declare variables used in COMPUTE_STEP
+        __m256i P_32_A, P_32_B, P_32;
+
+#define COMPUTE_STEP(R_INPUT_A, R_INPUT_B, OFFSET)                         \
+  P_32_A = _mm256_madd_epi16(R_INPUT_A, L_AD0);                            \
+  P_32_B = _mm256_madd_epi16(R_INPUT_B, L_AD8);                            \
+  P_32 = _mm256_add_epi32(P_32_A, P_32_B);                                 \
+  _mm256_store_si256(                                                      \
+      blockO_256 + 2 * OFFSET,                                             \
+      _mm256_add_epi32(_mm256_load_si256(blockO_256 + 2 * OFFSET), P_32)); \
+                                                                           \
+  P_32_A = _mm256_madd_epi16(R_INPUT_A, L_EH0);                            \
+  P_32_B = _mm256_madd_epi16(R_INPUT_B, L_EH8);                            \
+  P_32 = _mm256_add_epi32(P_32_A, P_32_B);                                 \
+  _mm256_store_si256(                                                      \
+      blockO_256 + 2 * OFFSET + 1,                                         \
+      _mm256_add_epi32(_mm256_load_si256(blockO_256 + 2 * OFFSET + 1), P_32));
+
+        // Permute and shuffle to copy a single value across the entire vector
+        // Then compute the multiplication
+        // Replicate lower 128-bits of R_AH0 across both lanes
+        __m256i R_AH0_ = _mm256_permute2x128_si256(R_AH0, R_AH0, 0x00);
+        // Copy first two elements of R_AH0 across entire vector
+        __m256i R_AD0 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        // Copy second two elements of R_AH0 across entire vector
+        __m256i R_EH0 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+
+        COMPUTE_STEP(R_AD0, R_EH0, 0);
+        __m256i R_AD1 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        __m256i R_EH1 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD1, R_EH1, 1);
+
+        // Replicate upper 128-bits of R_AH0 across both lanes
+        R_AH0_ = _mm256_permute2x128_si256(R_AH0, R_AH0, 0x11);
+        __m256i R_AD2 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        __m256i R_EH2 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+        COMPUTE_STEP(R_AD2, R_EH2, 2);
+        __m256i R_AD3 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        __m256i R_EH3 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD3, R_EH3, 3);
+
+        R_AH0_ = _mm256_permute2x128_si256(R_AH4, R_AH4, 0x00);
+        R_AD0 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        R_EH0 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+        COMPUTE_STEP(R_AD0, R_EH0, 4);
+        R_AD1 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        R_EH1 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD1, R_EH1, 5);
+        R_AH0_ = _mm256_permute2x128_si256(R_AH4, R_AH4, 0x11);
+        R_AD2 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        R_EH2 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+        COMPUTE_STEP(R_AD2, R_EH2, 6);
+        R_AD3 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        R_EH3 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD3, R_EH3, 7);
+
+        R_AH0_ = _mm256_permute2x128_si256(R_AH8, R_AH8, 0x00);
+        R_AD0 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        R_EH0 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+        COMPUTE_STEP(R_AD0, R_EH0, 8);
+        R_AD1 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        R_EH1 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD1, R_EH1, 9);
+        R_AH0_ = _mm256_permute2x128_si256(R_AH8, R_AH8, 0x11);
+        R_AD2 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        R_EH2 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+        COMPUTE_STEP(R_AD2, R_EH2, 10);
+        R_AD3 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        R_EH3 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD3, R_EH3, 11);
+
+        R_AH0_ = _mm256_permute2x128_si256(R_AH12, R_AH12, 0x00);
+        R_AD0 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        R_EH0 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+        COMPUTE_STEP(R_AD0, R_EH0, 12);
+        R_AD1 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        R_EH1 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD1, R_EH1, 13);
+        R_AH0_ = _mm256_permute2x128_si256(R_AH12, R_AH12, 0x11);
+        R_AD2 = _mm256_shuffle_epi32(R_AH0_, 0x00);
+        R_EH2 = _mm256_shuffle_epi32(R_AH0_, 0x55);
+        COMPUTE_STEP(R_AD2, R_EH2, 14);
+        R_AD3 = _mm256_shuffle_epi32(R_AH0_, 0xAA);
+        R_EH3 = _mm256_shuffle_epi32(R_AH0_, 0xFF);
+        COMPUTE_STEP(R_AD3, R_EH3, 15);
+
+#undef COMPUTE_STEP
+      }
+
+      // Transfer the results to the result matrix
+      Index i = 0;
+      for (Index j = n; j < n + 16; j++) {
+        LinearMapper r0 = res.getLinearMapper(m, j);
+        LinearMapper r1 = res.getLinearMapper(m + 8, j);
+
+        r0.storePacket(0, _mm256_add_epi32(blockO_256[i++], r0.loadPacket(0)));
+        r1.storePacket(0, _mm256_add_epi32(blockO_256[i++], r1.loadPacket(0)));
+      }
+
+      // Zero the result block so it can be reused
+      memset(blockO, 0, 16 * 16 * sizeof(QInt32));
+    }
+  }
+  aligned_delete(blockO, 16 * 16);
+}
+
+#endif
+
+// AVX2 optimized implementation of Mat-Mat product.
 // LHS is encoded using signed 8-bit integers.
 // RHS is encoded using unsigned 8-bit integers.
 #ifdef EIGEN_USE_OPTIMIZED_INT8_UINT8_MAT_MAT_PRODUCT
@@ -1751,4 +2227,4 @@ void gebp_kernel<QInt8, QUInt8, Index, DataMapper, mr, nr, ConjugateLhs, Conjuga
 }  // namespace internal
 }  // namespace Eigen
 
-#endif  // EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_AVX2_H
+#endif  // CXX11_SRC_FIXEDPOINT_MATMATPRODUCTAVX2_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductNEON.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductNEON.h
index 99894cafb5..9cd3157023 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductNEON.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductNEON.h
@@ -8,9 +8,8 @@
 // Public License v. 2.0. If a copy of the MPL was not distributed
 // with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
 
-#ifndef EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_NEON_H
-#define EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_NEON_H
-
+#ifndef CXX11_SRC_FIXEDPOINT_MATMATPRODUCTNEON_H_
+#define CXX11_SRC_FIXEDPOINT_MATMATPRODUCTNEON_H_
 
 namespace Eigen {
 namespace internal {
@@ -90,6 +89,4 @@ void gebp_kernel<QInt8, QUInt8, Index, DataMapper, mr, nr, ConjugateLhs, Conjuga
 }  // namespace internal
 }  // namespace Eigen
 
-
-
-#endif  // EIGEN_CXX11_FIXED_POINT_MAT_MAT_PRODUCT_NEON_H
+#endif  // CXX11_SRC_FIXEDPOINT_MATMATPRODUCTNEON_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatVecProduct.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatVecProduct.h
index 18b5085b89..ad11d3d44b 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatVecProduct.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatVecProduct.h
@@ -7,9 +7,8 @@
 // Public License v. 2.0. If a copy of the MPL was not distributed
 // with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
 
-#ifndef EIGEN_CXX11_FIXED_POINT_MAT_VEC_PRODUCT_H
-#define EIGEN_CXX11_FIXED_POINT_MAT_VEC_PRODUCT_H
-
+#ifndef CXX11_SRC_FIXEDPOINT_MATVECPRODUCT_H_
+#define CXX11_SRC_FIXEDPOINT_MATVECPRODUCT_H_
 
 namespace Eigen {
 namespace internal {
@@ -47,6 +46,36 @@ EIGEN_DONT_INLINE void general_matrix_vector_product<Index,QInt8,LhsMapper,ColMa
   }
 }
 
+// Mat-Vec product
+// Both lhs and rhs are encoded as 16bit signed integers
+template <typename Index, typename LhsMapper, bool ConjugateLhs,
+          typename RhsMapper, bool ConjugateRhs, int Version>
+struct general_matrix_vector_product<Index, QInt16, LhsMapper, ColMajor,
+                                     ConjugateLhs, QInt16, RhsMapper,
+                                     ConjugateRhs, Version> {
+  EIGEN_DONT_INLINE static void run(Index rows, Index cols,
+                                    const LhsMapper& lhs, const RhsMapper& rhs,
+                                    QInt32* res, Index resIncr, QInt16 alpha);
+};
+
+template <typename Index, typename LhsMapper, bool ConjugateLhs,
+          typename RhsMapper, bool ConjugateRhs, int Version>
+EIGEN_DONT_INLINE void general_matrix_vector_product<
+    Index, QInt16, LhsMapper, ColMajor, ConjugateLhs, QInt16, RhsMapper,
+    ConjugateRhs, Version>::run(Index rows, Index cols, const LhsMapper& lhs,
+                                const RhsMapper& rhs, QInt32* res,
+                                Index resIncr, QInt16 alpha) {
+  eigen_assert(alpha.value == 1);
+  eigen_assert(resIncr == 1);
+  eigen_assert(rows > 0);
+  eigen_assert(cols > 0);
+
+  for (Index i = 0; i < rows; ++i) {
+    for (Index j = 0; j < cols; ++j) {
+      res[i] += lhs(i, j) * rhs(j, 0);
+    }
+  }
+}
 
 // Mat-Vec product
 // The lhs is encoded using 8bit signed integers, the rhs using 8bit unsigned integers
@@ -118,6 +147,4 @@ EIGEN_DONT_INLINE void general_matrix_vector_product<Index,QUInt8,LhsMapper,ColM
 }  // namespace internal
 }  // namespace Eigen
 
-
-
-#endif  // EIGEN_CXX11_FIXED_POINT_MAT_VEC_PRODUCT_H
+#endif  // CXX11_SRC_FIXEDPOINT_MATVECPRODUCT_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h
index cb1636256d..3abd4ee49c 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h
@@ -1,6 +1,5 @@
-#ifndef EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
-#define EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
-
+#ifndef CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
+#define CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
 #ifdef _MSC_VER
 
 #include <immintrin.h>
@@ -29,7 +28,6 @@ inline int _mm256_extract_epi8_N1(const __m256i X)
 	return _mm_extract_epi8(_mm256_extractf128_si256((X), 1 >> 4), 1 % 16);
 }
 
-
 namespace Eigen {
 namespace internal {
 
@@ -502,4 +500,4 @@ struct functor_traits<scalar_product_op<QInt32, double>> {
 }  // end namespace internal
 }  // end namespace Eigen
 
-#endif  // EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
+#endif  // CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h
index 8f9906dbf9..2092ce1d4c 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX512.h
@@ -1,5 +1,5 @@
-#ifndef EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX512_H_
-#define EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX512_H_
+#ifndef CXX11_SRC_FIXEDPOINT_PACKETMATHAVX512_H_
+#define CXX11_SRC_FIXEDPOINT_PACKETMATHAVX512_H_
 
 #include "PacketMathAVX2.h"
 
@@ -542,4 +542,4 @@ EIGEN_STRONG_INLINE QInt8 predux_max<Packet64q8i>(const Packet64q8i& a) {
 }  // end namespace internal
 }  // end namespace Eigen
 
-#endif  // EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX512_H_
+#endif  // CXX11_SRC_FIXEDPOINT_PACKETMATHAVX512_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX2.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX2.h
index 7b4ecc752f..9561d6a338 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX2.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX2.h
@@ -1,5 +1,5 @@
-#ifndef EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX2_H_
-#define EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX2_H_
+#ifndef CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX2_H_
+#define CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX2_H_
 
 namespace Eigen {
 namespace internal {
@@ -52,8 +52,16 @@ template <>
 EIGEN_STRONG_INLINE Packet32q8u
 pcast<Packet8q32i, Packet32q8u>(const Packet8q32i& a, const Packet8q32i& b,
                                 const Packet8q32i& c, const Packet8q32i& d) {
+  // _mm256_packus_epi32 trims negative numbers to 0 but we can't allow numbers
+  // that are too large because _mm256_packus_epi16 expects signed input
+  // (example of problem input: 0x11111111, which saturates to 0xffff = -1,
+  // which saturates to 0).
+  const __m256i a_clip = _mm256_min_epi32(a, _mm256_set1_epi32(255));
+  const __m256i b_clip = _mm256_min_epi32(b, _mm256_set1_epi32(255));
+  const __m256i c_clip = _mm256_min_epi32(c, _mm256_set1_epi32(255));
+  const __m256i d_clip = _mm256_min_epi32(d, _mm256_set1_epi32(255));
   const __m256i converted = _mm256_packus_epi16(
-      _mm256_packs_epi32(a.val, b.val), _mm256_packs_epi32(c.val, d.val));
+      _mm256_packus_epi32(a_clip, b_clip), _mm256_packus_epi32(c_clip, d_clip));
   // Since packus does not cross 128 bit lane boundaries,
   // we have to permute to properly order the final result.
   const __m256i permute_mask = _mm256_set_epi32(7, 3, 6, 2, 5, 1, 4, 0);
@@ -63,4 +71,4 @@ pcast<Packet8q32i, Packet32q8u>(const Packet8q32i& a, const Packet8q32i& b,
 }  // end namespace internal
 }  // end namespace Eigen
 
-#endif  // EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX2_H_
+#endif  // CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX2_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX512.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX512.h
index 26735743d4..a09eac6707 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX512.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/TypeCastingAVX512.h
@@ -1,5 +1,5 @@
-#ifndef EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX512_H_
-#define EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX512_H_
+#ifndef CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX512_H_
+#define CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX512_H_
 
 namespace Eigen {
 namespace internal {
@@ -132,8 +132,15 @@ pcast<Packet16q32i, Packet64q8i>(const Packet16q32i& a,
                                  const Packet16q32i& b,
                                  const Packet16q32i& c,
                                  const Packet16q32i& d) {
-  __m512i converted = _mm512_packs_epi16(_mm512_packs_epi32(a.val, b.val),
-                                         _mm512_packs_epi32(c.val, d.val));
+  __m128i a_part = _mm512_cvtsepi32_epi8(a);
+  __m128i b_part = _mm512_cvtsepi32_epi8(b);
+  __m128i c_part = _mm512_cvtsepi32_epi8(c);
+  __m128i d_part = _mm512_cvtsepi32_epi8(d);
+  __m256i ab =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(a_part), b_part, 1);
+  __m256i cd =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(c_part), d_part, 1);
+  __m512i converted = _mm512_inserti64x4(_mm512_castsi256_si512(ab), cd, 1);
   return converted;
 }
 
@@ -141,7 +148,10 @@ template <>
 EIGEN_STRONG_INLINE Packet32q16i
 pcast<Packet16q32i, Packet32q16i>(const Packet16q32i& a,
                                   const Packet16q32i& b) {
-  __m512i converted = _mm512_packs_epi32(a.val, b.val);
+  __m256i a_part = _mm512_cvtsepi32_epi16(a);
+  __m256i b_part = _mm512_cvtsepi32_epi16(b);
+  __m512i converted =
+      _mm512_inserti64x4(_mm512_castsi256_si512(a_part), b_part, 1);
   return converted;
 }
 
@@ -154,22 +164,45 @@ template <>
 EIGEN_STRONG_INLINE Packet64q8u
 pcast<Packet16q32i, Packet64q8u>(const Packet16q32i& a, const Packet16q32i& b,
                                  const Packet16q32i& c, const Packet16q32i& d) {
-  const __m512i converted = _mm512_packus_epi16(
-      _mm512_packus_epi32(a.val, b.val), _mm512_packus_epi32(c.val, d.val));
+  // Brute-force saturation since there isn't a pack operation for unsigned
+  // numbers that keeps the elements in order.
+  __m128i a_part = _mm512_cvtepi32_epi8(_mm512_max_epi32(
+      _mm512_min_epi32(a, _mm512_set1_epi32(255)), _mm512_setzero_si512()));
+  __m128i b_part = _mm512_cvtepi32_epi8(_mm512_max_epi32(
+      _mm512_min_epi32(b, _mm512_set1_epi32(255)), _mm512_setzero_si512()));
+  __m128i c_part = _mm512_cvtepi32_epi8(_mm512_max_epi32(
+      _mm512_min_epi32(c, _mm512_set1_epi32(255)), _mm512_setzero_si512()));
+  __m128i d_part = _mm512_cvtepi32_epi8(_mm512_max_epi32(
+      _mm512_min_epi32(d, _mm512_set1_epi32(255)), _mm512_setzero_si512()));
+  __m256i ab =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(a_part), b_part, 1);
+  __m256i cd =
+      _mm256_inserti128_si256(_mm256_castsi128_si256(c_part), d_part, 1);
+  __m512i converted = _mm512_inserti64x4(_mm512_castsi256_si512(ab), cd, 1);
   return converted;
 }
 
+#if 0
+// The type Packet32q16u does not exist for AVX-512 yet
 template <>
 struct type_casting_traits<QInt32, QUInt16> {
   enum { VectorizedCast = 1, SrcCoeffRatio = 2, TgtCoeffRatio = 1 };
 };
 
-#if 0
 template <>
 EIGEN_STRONG_INLINE Packet32q16u
 pcast<Packet16q32i, Packet32q16u>(const Packet16q32i& a,
                                   const Packet16q32i& b) {
-  const __m512i converted = _mm512_packus_epi32(a.val, b.val);
+  // Brute-force saturation since there isn't a pack operation for unsigned
+  // numbers that keeps the elements in order.
+  __m256i a_part =
+      _mm512_cvtepi32_epi16(_mm512_max_epi32(
+        _mm512_min_epi32(a, _mm512_set1_epi32(65535)), _mm512_setzero_si512()));
+  __m256i b_part = _mm512_cvtepi32_epi16(
+    _mm512_max_epi32(_mm512_min_epi32(b, _mm512_set1_epi32(65535)),
+                     _mm512_setzero_si512()));
+  __m512i converted =
+      _mm512_inserti64x4(_mm512_castsi256_si512(a_part), b_part, 1);
   return converted;
 }
 #endif
@@ -177,4 +210,4 @@ pcast<Packet16q32i, Packet32q16u>(const Packet16q32i& a,
 }  // end namespace internal
 }  // end namespace Eigen
 
-#endif  // EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX512_H_
+#endif  // CXX11_SRC_FIXEDPOINT_TYPECASTINGAVX512_H_
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Activations.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Activations.h
deleted file mode 100644
index cbcce9e282..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Activations.h
+++ /dev/null
@@ -1,116 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2015 Benoit Steiner <benoit.steiner.goog@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_ACTIVATIONS_H
-#define EIGEN_CXX11_NEURAL_NETWORKS_ACTIVATIONS_H
-
-namespace Eigen {
-
-/** scalar_sigmoid_fast_derivative_op
-  * \ingroup CXX11_NeuralNetworks_Module
-  * \brief Template functor to compute the fast derivative of a sigmoid
-  *
-  * Input should be the backpropagated gradient.
-  *
-  * \sa class CwiseUnaryOp, Cwise::sigmoid_fast_derivative()
-  */
-template <typename T>
-struct scalar_sigmoid_fast_derivative_op {
-  EIGEN_EMPTY_STRUCT_CTOR(scalar_sigmoid_fast_derivative_op)
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE T operator()(const T& y) const {
-    const T one = T(1);
-    return (one - y) * y;
-  }
-
-  template <typename Packet>
-  inline Packet packetOp(const Packet& y) const {
-    const Packet one = internal::pset1<Packet>(1);
-    return internal::pmul(internal::psub(one, y), y);
-  }
-};
-
-namespace internal {
-template <typename T>
-struct functor_traits<scalar_sigmoid_fast_derivative_op<T> > {
-  enum {
-    Cost = NumTraits<T>::AddCost * 2 + NumTraits<T>::MulCost,
-    PacketAccess = packet_traits<T>::HasAdd && packet_traits<T>::HasMul &&
-                   packet_traits<T>::HasNegate
-  };
-};
-}  // namespace internal
-
-/** scalar_tanh_fast_derivative_op
-  * \ingroup CXX11_NeuralNetworks_Module
-  * \brief Template functor to compute the fast derivative of a tanh
-  *
-  * Input should be the backpropagated gradient.
-  *
-  * \sa class CwiseUnaryOp, Cwise::tanh_fast_derivative()
-  */
-template <typename T>
-struct scalar_tanh_fast_derivative_op {
-  EIGEN_EMPTY_STRUCT_CTOR(scalar_tanh_fast_derivative_op)
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE T operator()(const T& y) const {
-    const T one = T(1);
-    return one - (y * y);
-  }
-
-  template <typename Packet>
-  inline Packet packetOp(const Packet& y) const {
-    const Packet one = internal::pset1<Packet>(1);
-    return internal::psub(one, internal::pmul(y, y));
-  }
-};
-
-namespace internal {
-template <typename T>
-struct functor_traits<scalar_tanh_fast_derivative_op<T> > {
-  enum {
-    Cost = NumTraits<T>::AddCost * 2 + NumTraits<T>::MulCost * 1,
-    PacketAccess = packet_traits<T>::HasAdd && packet_traits<T>::HasMul &&
-                   packet_traits<T>::HasNegate
-  };
-};
-}  // namespace internal
-
-/**
- * \ingroup CXX11_NeuralNetworks_Module
- * \brief Template functor to clip the magnitude of the first scalar.
- *
- * \sa class CwiseBinaryOp, MatrixBase::Clip
- */
-template <typename Scalar>
-struct scalar_clip_op {
-  EIGEN_EMPTY_STRUCT_CTOR(scalar_clip_op)
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Scalar
-  operator()(const Scalar& a, const Scalar& b) const {
-    return numext::mini(numext::maxi(a, -b), b);
-  }
-  template <typename Packet>
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Packet
-  packetOp(const Packet& a, const Packet& b) const {
-    return internal::pmin(internal::pmax(a, internal::pnegate(b)), b);
-  }
-};
-
-namespace internal {
-template <typename Scalar>
-struct functor_traits<scalar_clip_op<Scalar> > {
-  enum {
-    Cost = NumTraits<Scalar>::AddCost * 3,
-    PacketAccess = packet_traits<Scalar>::HasMax &&
-                   packet_traits<Scalar>::HasMin &&
-                   packet_traits<Scalar>::HasNegate
-  };
-};
-}  // namespace internal
-
-}  // end namespace Eigen
-
-#endif  // EIGEN_CXX11_NEURAL_NETWORKS_ACTIVATIONS_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Attention.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Attention.h
deleted file mode 100644
index d4bc7a3515..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Attention.h
+++ /dev/null
@@ -1,209 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2015 Benoit Steiner <benoit.steiner.goog@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_ATTENTION_H
-#define EIGEN_CXX11_NEURAL_NETWORKS_ATTENTION_H
-
-namespace Eigen {
-
-/** ExtractGlimpses
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Extract glimpses from an input tensor.
-  *
-  * The input parameter is expected to be a col-major tensor with a rank of 4 (depth, x, y, and batch).
-  * The width and height parameters specify the extension of the returned glimpses.
-  * The offsets parameter specifies the x, y locations of the center of the glimpses relative to the center of the input image. The vector is expected to contain one IndexPair for each image in the batch dimension.
-  * The normalized boolean indicates if incoming coordinates are normalized so that 0.0 and 1.0 correspond to the minimum and maximum of each height and width dimension.
-  * The centered boolean indicates if incoming coordinates are centered relative to the image, in which case -1.0 and 1.0 correspond to minimum and maximum of each dimension while 0.0 corresponds to the center.
-  *
-  * The result can be assigned to a tensor of rank equal to that of the input. The result will be laid out in col-major order (depth, x, y, batch).
-  * The dimensions of the result will be equal to the dimensions of the input except for width and height which will be equal to the requested glimpse size.
-  */
-namespace {
-template <typename Index>
-struct GlimpseExtractionOp {
-  GlimpseExtractionOp(const Index width, const Index height,
-                      const std::vector<IndexPair<float> >& offsets,
-                      const bool normalized,
-                      const bool centered,
-                      const bool uniform_noise) :
-      width_(width), height_(height), offsets_(offsets),
-      normalized_(normalized), centered_(centered), uniform_noise_(uniform_noise) { }
-
-  template <typename Input>
-  DSizes<Index, 4> dimensions(const Input& input) const {
-    typedef typename internal::traits<Input>::Index IndexType;
-    typedef TensorRef<Tensor<typename internal::traits<Input>::Scalar, 4,
-                             internal::traits<Input>::Layout, IndexType> > Ref;
-    Ref in(input);
-
-    DSizes<Index, 4> dims = in.dimensions();
-
-    dims[0] = in.dimension(0);
-    dims[1] = width_;
-    dims[2] = height_;
-    dims[3] = in.dimension(3);
-    return dims;
-  }
-
-  template <typename Input, typename Output, typename Device>
-  EIGEN_DEVICE_FUNC
-  void eval(const Input& input, Output& output, const Device& device) const
-  {
-    typedef typename internal::traits<Input>::Index IndexType;
-    typedef TensorRef<Tensor<typename internal::traits<Input>::Scalar, 4,
-                             internal::traits<Input>::Layout, IndexType> > Ref;
-    Ref in(input);
-
-    const Index num_channels = in.dimension(0);
-    const Index input_width = in.dimension(1);
-    const Index input_height = in.dimension(2);
-    const Index batch_size = in.dimension(3);
-    eigen_assert(input_width > 0);
-    eigen_assert(input_height > 0);
-
-    for (Index i = 0; i < batch_size; ++i) {
-      float x = offsets_[i].first, y = offsets_[i].second;
-
-      // Un-normalize coordinates back to pixel space if normalized.
-      if (normalized_) {
-        x *= input_width;
-        y *= input_height;
-      }
-      // Un-center if coordinates are centered on the image center.
-      if (centered_) {
-        x /= 2.0f;
-        y /= 2.0f;
-        x += input_width / 2.0f;
-        y += input_height / 2.0f;
-      }
-      // Remove half of the glimpse window.
-      x -= width_ / 2.0f;
-      y -= height_ / 2.0f;
-
-      const Index offset_x = (Index) x;
-      const Index offset_y = (Index) y;
-      Index glimpse_width = width_;
-      Index glimpse_height = height_;
-      bool partial_overlap = false;
-      DSizes<Index, 3> slice_offset(0, offset_x, offset_y);
-      DSizes<Index, 3> slice_extent(num_channels, width_, height_);
-      DSizes<Index, 3> base_offset(0, 0, 0);
-
-      if (offset_x < 0) {
-        slice_offset[1] = 0;
-        glimpse_width = (std::max<Index>)(0, width_ + offset_x);
-        slice_extent[1] = glimpse_width;
-        base_offset[1] = width_ - glimpse_width;
-        partial_overlap = true;
-      } else if (offset_x + width_ >= input_width) {
-        glimpse_width = (std::max<Index>)(0, input_width - offset_x);
-        slice_extent[1] = glimpse_width;
-        partial_overlap = true;
-      }
-      if (offset_y < 0) {
-        slice_offset[2] = 0;
-        glimpse_height = (std::max<Index>)(0, height_ + offset_y);
-        slice_extent[2] = glimpse_height;
-        base_offset[2] = height_ - glimpse_height;
-        partial_overlap = true;
-      } else if (offset_y + height_ >= input_height) {
-        glimpse_height = (std::max<Index>)(0, input_height - offset_y);
-        slice_extent[2] = glimpse_height;
-        partial_overlap = true;
-      }
-      slice_extent[1] = std::min<Index>(input_width, slice_extent[1]);
-      slice_extent[2] = std::min<Index>(input_height, slice_extent[2]);
-
-      if (partial_overlap) {
-        if (uniform_noise_) {
-          // Initialize the glimpse with uniform noise.
-          typedef typename internal::remove_const<
-            typename internal::traits<Input>::Scalar>::type Scalar;
-          TensorFixedSize<Scalar, Sizes<> > mini;
-          mini.device(device) = input.template chip<3>(i).minimum();
-          TensorFixedSize<float, Sizes<> > range;
-          range.device(device) =
-              (input.template chip<3>(i).maximum() - mini).template cast<float>();
-
-          DSizes<Index, 3> glimpse_size(num_channels, width_, height_);
-          TensorMap<Tensor<float, 3> > tmp(NULL, glimpse_size);
-          output.template chip<3>(i).device(device) =
-              mini.reshape(Sizes<1,1,1>()).broadcast(glimpse_size) +
-              (tmp.random() * range.reshape(Sizes<1,1,1>()).broadcast(glimpse_size)).template cast<Scalar>();
-        } else {
-          // Initialize the glimpse with white noise: compute the mean and sigma
-          // of each channel, and use them to shape the gaussian.
-          DSizes<Index, 2> glimpse_size(width_, height_);
-          DSizes<Index, 2> input_size(input_width, input_height);
-          typedef typename internal::remove_const<
-            typename internal::traits<Input>::Scalar>::type Scalar;
-
-          for (int j = 0; j < num_channels; ++j) {
-            TensorFixedSize<Scalar, Sizes<> > mean;
-            mean.device(device) = input.template chip<3>(i).template chip<0>(j).template cast<float>().mean();
-            TensorFixedSize<float, Sizes<> > sigma;
-            sigma.device(device) =
-                (input.template chip<3>(i).template chip<0>(j).template cast<float>() - mean.reshape(Sizes<1,1>()).broadcast(input_size)).square().mean().sqrt();
-            TensorFixedSize<Scalar, Sizes<> > mini;
-            mini.device(device) = input.template chip<3>(i).template chip<0>(j).minimum();
-            TensorFixedSize<float, Sizes<> > maxi;
-            maxi.device(device) = input.template chip<3>(i).template chip<0>(j).maximum();
-
-            TensorMap<Tensor<float, 2> > tmp(NULL, glimpse_size);
-            output.template chip<3>(i).template chip<0>(j).device(device) =
-                (mean.reshape(Sizes<1,1>()).broadcast(glimpse_size) +
-                 (tmp.random(internal::NormalRandomGenerator<float>()) * sigma.reshape(Sizes<1,1>()).broadcast(glimpse_size)).template cast<Scalar>()).cwiseMin(maxi.reshape(Sizes<1,1>()).broadcast(glimpse_size)).cwiseMax(mini.reshape(Sizes<1,1>()).broadcast(glimpse_size));
-          }
-        }
-
-        // Copy the part of the glimpse that cover the input image if any.
-        if (glimpse_width == 0 || glimpse_height == 0) {
-          continue;
-        }
-        output.template chip<3>(i).slice(base_offset, slice_extent).device(device) = input.template chip<3>(i).slice(slice_offset, slice_extent);
-      } else {
-        output.template chip<3>(i).device(device) = input.template chip<3>(i).slice(slice_offset, slice_extent);
-      }
-    }
-  }
-
- private:
-  const Index width_;
-  const Index height_;
-  const std::vector<IndexPair<float> > offsets_;
-  const bool normalized_;
-  const bool centered_;
-  const bool uniform_noise_;
-};
-}
-
-
-template <typename Input>
-EIGEN_ALWAYS_INLINE
-static const TensorCustomUnaryOp<const GlimpseExtractionOp<typename internal::traits<Input>::Index>, const Input>
-ExtractGlimpses(const Input& input,
-                const typename internal::traits<Input>::Index width,
-                const typename internal::traits<Input>::Index height,
-                const std::vector<IndexPair<float> >& offsets,
-                const bool normalized = true, const bool centered = true,
-                const bool uniform_noise = true)
-{
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::Layout == ColMajor, YOU_MADE_A_PROGRAMMING_MISTAKE);
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == 4, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  typedef typename internal::traits<Input>::Index Index;
-  const GlimpseExtractionOp<Index> op(width, height, offsets, normalized,
-                                      centered, uniform_noise);
-  return input.customOp(op);
-}
-
-} // end namespace Eigen
-
-#endif // EIGEN_CXX11_NEURAL_NETWORKS_ATTENTION_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardCuboidConvolutions.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardCuboidConvolutions.h
deleted file mode 100644
index 12ce23444c..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardCuboidConvolutions.h
+++ /dev/null
@@ -1,523 +0,0 @@
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_BACKWARD_CUBOID_CONVOLUTIONS_H
-#define EIGEN_CXX11_NEURAL_NETWORKS_BACKWARD_CUBOID_CONVOLUTIONS_H
-
-#include "Patch3d.h"
-
-namespace Eigen {
-
-/** CuboidConvolutionBackwardInput
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Computes the backprop for the input of a 3D convolution.
-  *
-  * The output_backward parameter is expected to be a tensor with a rank of 4 or more (channels, depth, height, width, and optionally others)
-  * The kernel parameter is expected to be a 5D tensor (filters, channels, kernel_depth, kernel_height, kernel_width)
-  * output_backward and kernel have to be in the same layout.
-  *
-  * The dimensions of the result will be filters, depth, height, width (and others if applicable).
-  *
-  * It is possible to swap the order of the depth, width and height dimensions provided that the same order is used in the input, the kernel, and the output.
-  *
-  * All dimension orders above are given for col-major, and should be reversed for row-major.
-  */
-
-template <typename OutputBackward, typename Kernel>
-EIGEN_ALWAYS_INLINE static const typename internal::conditional<
-    internal::traits<OutputBackward>::Layout == ColMajor,
-    TensorReshapingOp<
-        const DSizes<typename internal::traits<OutputBackward>::Index,
-                     internal::traits<OutputBackward>::NumDimensions>,
-        const TensorContractionOp<
-            const array< IndexPair<typename internal::traits<OutputBackward>::Index>, 2>,
-            const TensorReshapingOp<
-                const DSizes< typename internal::traits<OutputBackward>::Index, 3>,
-                const TensorReverseOp<const array<bool, 5>, const Kernel>
-            >,
-            const TensorReshapingOp<
-                const DSizes< typename internal::traits<OutputBackward>::Index, 3>,
-                const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const OutputBackward>
-            >
-        >
-    >,
-    TensorReshapingOp<
-        const DSizes<typename internal::traits<OutputBackward>::Index,
-                     internal::traits<OutputBackward>::NumDimensions>,
-        const TensorContractionOp<
-            const array< IndexPair<typename internal::traits<OutputBackward>::Index>, 2>,
-            const TensorReshapingOp<
-                const DSizes< typename internal::traits<OutputBackward>::Index, 3>,
-                const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const OutputBackward>
-            >,
-            const TensorReshapingOp<
-                const DSizes<typename internal::traits<OutputBackward>::Index, 3>,
-                const TensorReverseOp<const array<bool, 5>, const Kernel>
-            >
-        >
-    >
->::type
-CuboidConvolutionBackwardInput(
-    const Kernel& kernel, const OutputBackward& output_backward,
-    typename internal::traits<OutputBackward>::Index inputPlanes,
-    typename internal::traits<OutputBackward>::Index inputRows,
-    typename internal::traits<OutputBackward>::Index inputCols,
-    const DenseIndex stridePlanes = 1, const DenseIndex strideRows = 1,
-    const DenseIndex strideCols = 1) {
-  typedef typename internal::traits<OutputBackward>::Index TensorIndex;
-  const TensorRef<const Tensor<typename internal::traits<Kernel>::Scalar, internal::traits<Kernel>::NumDimensions, internal::traits<Kernel>::Layout, TensorIndex> > kern(kernel);
-  const TensorRef<const Tensor<typename internal::traits<OutputBackward>::Scalar, internal::traits<OutputBackward>::NumDimensions, internal::traits<OutputBackward>::Layout, TensorIndex> > out(output_backward);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Kernel>::Layout == internal::traits<OutputBackward>::Layout, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  static const bool isColMajor = (internal::traits<OutputBackward>::Layout == ColMajor);
-
-  static const int NumDims = internal::traits<OutputBackward>::NumDimensions;
-
-  // Number of filters to apply. This is the same as the output depth of the result
-  const TensorIndex kernelFilters = isColMajor ? kern.dimensions()[0] : kern.dimensions()[4];
-  // Number of channels. This is the same as the input depth.
-  const TensorIndex kernelChannels = isColMajor ? kern.dimensions()[1] : kern.dimensions()[3];
-  const TensorIndex kernelPlanes = isColMajor ? kern.dimensions()[2] : kern.dimensions()[2];
-  const TensorIndex kernelRows = isColMajor ? kern.dimensions()[3] : kern.dimensions()[1];
-  const TensorIndex kernelCols = isColMajor ? kern.dimensions()[4] : kern.dimensions()[0];
-
-  const TensorIndex outputPlanes = isColMajor ? out.dimensions()[1] : out.dimensions()[NumDims - 2];
-  const TensorIndex outputRows = isColMajor ? out.dimensions()[2] : out.dimensions()[NumDims - 3];
-  const TensorIndex outputCols = isColMajor ? out.dimensions()[3] : out.dimensions()[NumDims - 4];
-
-  TensorIndex forward_pad_z, forward_pad_y, forward_pad_x;
-  const TensorIndex size_z = ceil(inputPlanes / static_cast<float>(stridePlanes));
-  const TensorIndex size_y = ceil(inputRows / static_cast<float>(strideRows));
-  const TensorIndex size_x = ceil(inputCols / static_cast<float>(strideCols));
-
-  // Infer padding type.
-  if (size_z == outputPlanes && size_y == outputRows && size_x == outputCols) {
-    // SAME padding.
-    const TensorIndex dz = size_z * stridePlanes + kernelPlanes - 1 - inputPlanes;
-    const TensorIndex dy = size_y * strideRows + kernelRows - 1 - inputRows;
-    const TensorIndex dx = size_x * strideCols + kernelCols - 1 - inputCols;
-
-    forward_pad_z = dz - dz / 2;
-    forward_pad_y = dy - dy / 2;
-    forward_pad_x = dx - dx / 2;
-  } else {
-    // VALID padding.
-    forward_pad_z = 0;
-    forward_pad_y = 0;
-    forward_pad_x = 0;
-  }
-  const TensorIndex padding_ztop = kernelPlanes - 1 - forward_pad_z;
-  const TensorIndex padding_top = kernelRows - 1 - forward_pad_y;
-  const TensorIndex padding_left = kernelCols - 1 - forward_pad_x;
-
-  const TensorIndex padding_zbottom = inputPlanes + kernelPlanes - 1 - (outputPlanes - 1) * stridePlanes - 1 - padding_ztop;
-  const TensorIndex padding_bottom = inputRows + kernelRows - 1 - (outputRows - 1) * strideRows - 1 - padding_top;
-  const TensorIndex padding_right = inputCols + kernelCols - 1 - (outputCols - 1) * strideCols - 1 - padding_left;
-
-  eigen_assert(padding_ztop >= 0);
-  eigen_assert(padding_zbottom >= 0);
-  eigen_assert(padding_top >= 0);
-  eigen_assert(padding_left >= 0);
-  eigen_assert(padding_bottom >= 0);
-  eigen_assert(padding_right >= 0);
-
-  // The kernel has dimensions filters X channels X patch_planes X patch_rows X patch_cols.
-  // We need to reverse the kernel along the spatial dimensions.
-  array<bool, 5> kernel_reverse;
-  if (isColMajor) {
-    kernel_reverse[0] = false;
-    kernel_reverse[1] = false;
-    kernel_reverse[2] = true;
-    kernel_reverse[3] = true;
-    kernel_reverse[4] = true;
-  } else {
-    kernel_reverse[0] = true;
-    kernel_reverse[1] = true;
-    kernel_reverse[2] = true;
-    kernel_reverse[3] = false;
-    kernel_reverse[4] = false;
-  }
-
-  DSizes<TensorIndex, 3> kernel_dims;
-  if (isColMajor) {
-    kernel_dims[0] = kernelFilters;
-    kernel_dims[1] = kernelChannels;
-    kernel_dims[2] = kernelRows * kernelCols * kernelPlanes;
-  } else {
-    kernel_dims[0] = kernelRows * kernelCols * kernelPlanes;
-    kernel_dims[1] = kernelChannels;
-    kernel_dims[2] = kernelFilters;
-  }
-
-  // The output_backward has dimensions out_depth X out_planes X out_rows X out_cols X OTHERS
-  // When we extract the image patches from output_backward, it will have dimensions:
-  //   out_depth X (patch_planes * patch_rows * patch_cols) X (input_planes * input_rows * input_cols * OTHERS)
-  DSizes<TensorIndex, 3> pre_contract_dims;
-  if (isColMajor) {
-    pre_contract_dims[0] = kernelFilters;
-    pre_contract_dims[1] = kernelRows * kernelCols * kernelPlanes;
-    pre_contract_dims[2] = inputRows * inputCols * inputPlanes;
-    for (int i = 4; i < NumDims; ++i) {
-      pre_contract_dims[2] *= out.dimension(i);
-    }
-  } else {
-    pre_contract_dims[2] = kernelFilters;
-    pre_contract_dims[1] = kernelRows * kernelCols * kernelPlanes;
-    pre_contract_dims[0] = inputRows * inputCols * inputPlanes;
-    for (int i = 0; i < NumDims - 4; ++i) {
-      pre_contract_dims[0] *= out.dimension(i);
-    }
-  }
-
-  // We will contract along dimensions (0, 2) in kernel and (0, 1) in
-  // output_backward, if this is col-major, and
-  // dimensions (0, 2) in kernel and (1, 2) in output_backward, if this row-major.
-  array<IndexPair<TensorIndex>, 2> contract_dims;
-  if (isColMajor) {
-    // col-major: kernel.contract(output.patches)
-    contract_dims[0] = IndexPair<TensorIndex>(0, 0);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 1);
-  } else {
-    // row-major: output.patches.contract(kernel)
-    contract_dims[0] = IndexPair<TensorIndex>(1, 0);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 2);
-  }
-
-  // Post contraction, the dimensions of the input_backprop is
-  //  channels X input_planes X input_rows X input_cols X OTHERS
-  DSizes<TensorIndex, NumDims> post_contract_dims;
-  if (isColMajor) {
-    post_contract_dims[0] = kernelChannels;
-    post_contract_dims[1] = inputPlanes;
-    post_contract_dims[2] = inputRows;
-    post_contract_dims[3] = inputCols;
-    for (int i = 4; i < NumDims; ++i) {
-      post_contract_dims[i] = out.dimension(i);
-    }
-  } else {
-    post_contract_dims[NumDims - 1] = kernelChannels;
-    post_contract_dims[NumDims - 2] = inputPlanes;
-    post_contract_dims[NumDims - 3] = inputRows;
-    post_contract_dims[NumDims - 4] = inputCols;
-    for (int i = 0; i < NumDims - 4; ++i) {
-      post_contract_dims[i] = out.dimension(i);
-    }
-  }
-
-  DSizes<TensorIndex, NumDims> strides;
-  for (int i = 0; i < NumDims; i++) {
-    strides[i] = 1;
-  }
-  if (isColMajor) {
-    strides[1] = stridePlanes;
-    strides[2] = strideRows;
-    strides[3] = strideCols;
-  } else {
-    strides[NumDims - 2] = stridePlanes;
-    strides[NumDims - 3] = strideRows;
-    strides[NumDims - 4] = strideCols;
-  }
-
-  return choose(
-      Cond<internal::traits<OutputBackward>::Layout == ColMajor>(),
-      kernel.reverse(kernel_reverse)
-          .reshape(kernel_dims)
-          .contract(
-              output_backward.extract_volume_patches(kernelPlanes, kernelRows, kernelCols,
-                                                     1, 1, 1, stridePlanes, strideRows, strideCols,
-                               padding_ztop, padding_zbottom,
-                               padding_top, padding_bottom,
-                               padding_left, padding_right)
-                  .reshape(pre_contract_dims),
-              contract_dims)
-          .reshape(post_contract_dims),
-      output_backward.extract_volume_patches(kernelPlanes, kernelRows, kernelCols,
-                                             1, 1, 1, stridePlanes, strideRows, strideCols,
-                       padding_ztop, padding_zbottom,
-                       padding_top, padding_bottom,
-                       padding_left, padding_right)
-          .reshape(pre_contract_dims)
-          .contract(kernel.reverse(kernel_reverse).reshape(kernel_dims),
-                    contract_dims)
-          .reshape(post_contract_dims));
-}
-
-
-/** CuboidConvolutionBackwardKernel
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Computes the backprop for the filter of a 3D convolution.
-  *
-  * The output_backward parameter is expected to be a tensor with a rank of 4 or more (channels, depth, height, width, and optionally others)
-  * The kernel parameter is expected to be a 4D tensor (filters, channels, kernel_depth, kernel_height, kernel_width)
-  * output_backward and kernel have to be in the same layout.
-  *
-  * The dimensions of the result will be filters, depth, height, width (and others if applicable).
-  *
-  * It is possible to swap the order of the depth, width and height dimensions provided that the same order is used in the input, the kernel, and the output.
-  *
-  * All dimension orders above are given for col-major, and should be reversed for row-major.
-  */
-template <typename OutputBackward, typename Input>
-EIGEN_ALWAYS_INLINE static const typename internal::conditional<
-    internal::traits<OutputBackward>::Layout == ColMajor,
-    const TensorShufflingOp<
-        const array<typename internal::traits<OutputBackward>::Index, 5>,
-        const TensorReverseOp<
-            const array<bool, 5>,
-            const TensorReshapingOp<
-                const DSizes<typename internal::traits<OutputBackward>::Index, 5>,
-                const TensorContractionOp<
-                    const array< IndexPair<typename internal::traits<Input>::Index>, 2>,
-                    const TensorReshapingOp<
-                        const DSizes<typename internal::traits<Input>::Index, 3>,
-                        const Input>,
-                    const TensorReshapingOp<
-                        const DSizes< typename internal::traits<OutputBackward>::Index, 4>,
-                        const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const OutputBackward>
-                    >
-                >
-            >
-        >
-    >,
-    const TensorShufflingOp<
-        const array<typename internal::traits<OutputBackward>::Index, 5>,
-        const TensorReverseOp<
-            const array<bool, 5>,
-            const TensorReshapingOp<
-                const DSizes<typename internal::traits<OutputBackward>::Index, 5>,
-                const TensorContractionOp<
-                    const array< IndexPair<typename internal::traits<Input>::Index>, 2>,
-                    const TensorReshapingOp<
-                        const DSizes< typename internal::traits<OutputBackward>::Index, 4>,
-                        const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const OutputBackward>
-                    >,
-                    const TensorReshapingOp<
-                        const DSizes<typename internal::traits<Input>::Index, 3>,
-                        const Input
-                    >
-                >
-            >
-        >
-    >
->::type
-CuboidConvolutionBackwardKernel(
-    const Input& input, const OutputBackward& output_backward,
-    typename internal::traits<Input>::Index kernelPlanes,
-    typename internal::traits<Input>::Index kernelRows,
-    typename internal::traits<Input>::Index kernelCols,
-    const DenseIndex stridePlanes = 1,
-    const DenseIndex strideRows = 1,
-    const DenseIndex strideCols = 1) {
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-  TensorRef<Tensor<typename internal::traits<OutputBackward>::Scalar, internal::traits<OutputBackward>::NumDimensions, internal::traits<OutputBackward>::Layout, TensorIndex> > out(output_backward);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::Layout == internal::traits<OutputBackward>::Layout, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-
-  static const int NumDims = internal::traits<Input>::NumDimensions;
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == internal::traits<OutputBackward>::NumDimensions, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  const TensorIndex inputPlanes = isColMajor ? in.dimension(1) : in.dimension(NumDims - 2);
-  const TensorIndex inputRows = isColMajor ? in.dimension(2) : in.dimension(NumDims - 3);
-  const TensorIndex inputCols = isColMajor ? in.dimension(3) : in.dimension(NumDims - 4);
-
-  const TensorIndex outputPlanes = isColMajor ? out.dimension(1) : out.dimension(NumDims - 2);
-  const TensorIndex outputRows = isColMajor ? out.dimension(2) : out.dimension(NumDims - 3);
-  const TensorIndex outputCols = isColMajor ? out.dimension(3) : out.dimension(NumDims - 4);
-
-  const TensorIndex kernelFilters = isColMajor ? out.dimension(0) : out.dimension(NumDims - 1);
-  const TensorIndex kernelChannels = isColMajor ? in.dimension(0) : in.dimension(NumDims - 1);
-
-  TensorIndex forward_pad_z, forward_pad_y, forward_pad_x;
-  const TensorIndex size_z = ceil(inputPlanes / static_cast<float>(stridePlanes));
-  const TensorIndex size_y = ceil(inputRows / static_cast<float>(strideRows));
-  const TensorIndex size_x = ceil(inputCols / static_cast<float>(strideCols));
-
-  // Infer padding type.
-  if (size_z == outputPlanes && size_y == outputRows && size_x == outputCols) {
-    // SAME padding.
-    const TensorIndex dz = size_z * stridePlanes + kernelPlanes - 1 - inputPlanes;
-    const TensorIndex dy = size_y * strideRows + kernelRows - 1 - inputRows;
-    const TensorIndex dx = size_x * strideCols + kernelCols - 1 - inputCols;
-
-    forward_pad_z = dz - dz / 2;
-    forward_pad_y = dy - dy / 2;
-    forward_pad_x = dx - dx / 2;
-  } else {
-    // VALID padding.
-    forward_pad_z = 0;
-    forward_pad_y = 0;
-    forward_pad_x = 0;
-  }
-
-  const TensorIndex padding_ztop = kernelPlanes - 1 - forward_pad_z;
-  const TensorIndex padding_top = kernelRows - 1 - forward_pad_y;
-  const TensorIndex padding_left = kernelCols - 1 - forward_pad_x;
-
-  const TensorIndex padding_zbottom = inputPlanes + kernelPlanes - 1 - (outputPlanes - 1) * stridePlanes - 1 - padding_ztop;
-  const TensorIndex padding_bottom = inputRows + kernelRows - 1 - (outputRows - 1) * strideRows - 1 - padding_top;
-  const TensorIndex padding_right = inputCols + kernelCols - 1 - (outputCols - 1) * strideCols - 1 - padding_left;
-
-  eigen_assert(padding_ztop >= 0);
-  eigen_assert(padding_zbottom >= 0);
-  eigen_assert(padding_top >= 0);
-  eigen_assert(padding_left >= 0);
-  eigen_assert(padding_bottom >= 0);
-  eigen_assert(padding_right >= 0);
-
-  // The output_backward has dimensions out_depth X out_plaens X out_rows X out_cols X OTHERS
-  // When we extract the image patches from output_backward (with input as the
-  // kernel), it will have dimensions
-  //  (out_depth) X (input_planes * input_rows * input_cols) X (kernel_planes * kernel_rows * kernel_cols) X OTHERS
-  DSizes<TensorIndex, 4> pre_contract_dims;
-  if (isColMajor) {
-    pre_contract_dims[0] = kernelFilters;
-    pre_contract_dims[1] = inputRows * inputCols * inputPlanes;
-    pre_contract_dims[2] = kernelRows * kernelCols * kernelPlanes;
-    pre_contract_dims[3] = 1;
-    for (int i = 4; i < NumDims; ++i) {
-      pre_contract_dims[3] *= out.dimension(i);
-    }
-  } else {
-    pre_contract_dims[3] = kernelFilters;
-    pre_contract_dims[2] = inputRows * inputCols * inputPlanes;
-    pre_contract_dims[1] = kernelRows * kernelCols * kernelPlanes;
-    pre_contract_dims[0] = 1;
-    for (int i = 0; i < NumDims - 4; ++i) {
-      pre_contract_dims[0] *= out.dimension(i);
-    }
-  }
-
-  // The input has dimensions in_depth X (input_planes * input_rows * input_cols) X OTHERS
-  DSizes<TensorIndex, 3> input_dims;
-  if (isColMajor) {
-    input_dims[0] = kernelChannels;
-    input_dims[1] = inputRows * inputCols * inputPlanes;
-    input_dims[2] = 1;
-    for (int i = 4; i < NumDims; ++i) {
-      input_dims[2] *= in.dimension(i);
-    }
-    eigen_assert(input_dims[2] == pre_contract_dims[3]);
-  } else {
-    input_dims[2] = kernelChannels;
-    input_dims[1] = inputRows * inputCols * inputPlanes;
-    input_dims[0] = 1;
-    for (int i = 0; i < NumDims - 4; ++i) {
-      input_dims[0] *= in.dimension(i);
-    }
-    eigen_assert(input_dims[0] == pre_contract_dims[0]);
-  }
-
-  // We will contract along dimensions (1, 2) in in and (1, 3) in out, if
-  // this is col-major.
-  // For row-major, it's dimensions (0, 1) in in and (0, 2) in out.
-  array<IndexPair<TensorIndex>, 2> contract_dims;
-  if (isColMajor) {
-    // col-major: in.contract(output.patches)
-    contract_dims[0] = IndexPair<TensorIndex>(1, 1);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 3);
-  } else {
-    // row-major: output.patches.contract(in)
-    contract_dims[0] = IndexPair<TensorIndex>(0, 0);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 1);
-  }
-
-  // After the contraction, the kernel will have dimension
-  //   in_depth X out_depth X kernel_patches X kernel_rows X kernel_cols
-  // We will need to shuffle the first two dimensions and reverse the spatial dimensions.
-  // The end shape is:
-  //   out_depth X in_shape X kernel_planes X kernel_rows X kernel_cols
-
-  // This is the shape of the kernel *before* the shuffling.
-  DSizes<TensorIndex, 5> kernel_dims;
-  if (isColMajor) {
-    kernel_dims[0] = kernelChannels;
-    kernel_dims[1] = kernelFilters;
-    kernel_dims[2] = kernelPlanes;
-    kernel_dims[3] = kernelRows;
-    kernel_dims[4] = kernelCols;
-  } else {
-    kernel_dims[0] = kernelCols;
-    kernel_dims[1] = kernelRows;
-    kernel_dims[2] = kernelPlanes;
-    kernel_dims[3] = kernelFilters;
-    kernel_dims[4] = kernelChannels;
-  }
-
-  // Flip filters and channels.
-  array<TensorIndex, 5> kernel_shuffle;
-  if (isColMajor) {
-    kernel_shuffle[0] = 1;
-    kernel_shuffle[1] = 0;
-    kernel_shuffle[2] = 2;
-    kernel_shuffle[3] = 3;
-    kernel_shuffle[4] = 4;
-  } else {
-    kernel_shuffle[0] = 0;
-    kernel_shuffle[1] = 1;
-    kernel_shuffle[2] = 2;
-    kernel_shuffle[3] = 4;
-    kernel_shuffle[4] = 3;
-  }
-
-  // Reverse the spatial dimensions.
-  array<bool, 5> kernel_reverse;
-  if (isColMajor) {
-    kernel_reverse[0] = false;
-    kernel_reverse[1] = false;
-    kernel_reverse[2] = true;
-    kernel_reverse[3] = true;
-    kernel_reverse[4] = true;
-  } else {
-    kernel_reverse[0] = true;
-    kernel_reverse[1] = true;
-    kernel_reverse[2] = true;
-    kernel_reverse[3] = false;
-    kernel_reverse[4] = false;
-  }
-
-  DSizes<TensorIndex, NumDims> strides;
-  for (int i = 0; i < NumDims; i++) {
-    strides[i] = 1;
-  }
-  if (isColMajor) {
-    strides[1] = stridePlanes;
-    strides[2] = strideRows;
-    strides[3] = strideCols;
-  } else {
-    strides[NumDims - 2] = stridePlanes;
-    strides[NumDims - 3] = strideRows;
-    strides[NumDims - 4] = strideCols;
-  }
-  return choose(
-      Cond<internal::traits<Input>::Layout == ColMajor>(),
-      input.reshape(input_dims)
-          .contract(
-              output_backward.extract_volume_patches(
-                                 inputPlanes, inputRows, inputCols, 1,
-                                 1, 1, stridePlanes, strideRows, strideCols,
-
-                                 padding_ztop, padding_zbottom, padding_top,
-                                 padding_bottom, padding_left, padding_right)
-                  .reshape(pre_contract_dims),
-              contract_dims)
-          .reshape(kernel_dims)
-          .reverse(kernel_reverse)
-          .shuffle(kernel_shuffle),
-      output_backward.extract_volume_patches(
-                         inputPlanes, inputRows, inputCols, 1, 1, 1,
-                         stridePlanes, strideRows, strideCols, padding_ztop,
-                         padding_zbottom, padding_top, padding_bottom,
-                         padding_left, padding_right)
-          .reshape(pre_contract_dims)
-          .contract(input.reshape(input_dims), contract_dims)
-          .reshape(kernel_dims)
-          .reverse(kernel_reverse)
-          .shuffle(kernel_shuffle));
-}
-
-} // end namespace Eigen
-
-#endif // EIGEN_CXX11_NEURAL_NETWORKS_BACKWARD_CUBOID_CONVOLUTIONS_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardSpatialConvolutions.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardSpatialConvolutions.h
deleted file mode 100644
index 0f4ada246c..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/BackwardSpatialConvolutions.h
+++ /dev/null
@@ -1,351 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2015 Ke Yang <yangke@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_BACKWARD_SPATIAL_CONVOLUTIONS_H
-#define EIGEN_CXX11_NEURAL_NETWORKS_BACKWARD_SPATIAL_CONVOLUTIONS_H
-
-namespace Eigen {
-
-/** SpatialConvolutionBackwardInput
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Computes the backprop for the input of a 2D convolution.
-  *
-  * The output_backward parameter is expected to be a tensor with a rank of 3 or more (channels, height, width, and optionally others)
-  * The kernel parameter is expected to be a 4D tensor (filters, channels, kernel_height, kernel_width)
-  * The output_backward and the kernel must both be in col-major layout. The result will also be in col-major layout.
-  *
-  * If in_stride > 1, then applies convolution with holes (aka atrous convolution), sampling every in_stride input pixels.
-  *
-  * The result can be assigned to a tensor of rank equal to the rank of the output_backward. The dimensions of the result will be filters, height, width (and others if applicable).
-  *
-  * It is possible to swap the order of the width and height dimensions provided that the same order is used in the input, the kernel, and the output.
-  *
-  */
-
-template <typename OutputBackward, typename Kernel>
-EIGEN_ALWAYS_INLINE
-static const typename internal::conditional<
-  internal::traits<OutputBackward>::Layout == ColMajor,
-  TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, internal::traits<OutputBackward>::NumDimensions>, const TensorContractionOp<const array<IndexPair<typename internal::traits<OutputBackward>::Index>, 2>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 3>, const TensorReverseOp<const array<bool, 4>, const Kernel> >, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 3>, const TensorImagePatchOp<Dynamic, Dynamic, const OutputBackward> > > >,
-  TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, internal::traits<OutputBackward>::NumDimensions>, const TensorContractionOp<const array<IndexPair<typename internal::traits<OutputBackward>::Index>, 2>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 3>, const TensorImagePatchOp<Dynamic, Dynamic, const OutputBackward> >, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 3>, const TensorReverseOp<const array<bool, 4>, const Kernel> > > > >::type
-SpatialConvolutionBackwardInput(const Kernel& kernel, const OutputBackward& output_backward, typename internal::traits<OutputBackward>::Index inputRows, typename internal::traits<OutputBackward>::Index inputCols, const DenseIndex stride = 1, const DenseIndex in_stride = 1) {
-
-  typedef typename internal::traits<OutputBackward>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Kernel>::Scalar, internal::traits<Kernel>::NumDimensions, internal::traits<Kernel>::Layout, TensorIndex> > kern(kernel);
-  TensorRef<Tensor<typename internal::traits<OutputBackward>::Scalar, internal::traits<OutputBackward>::NumDimensions, internal::traits<OutputBackward>::Layout, TensorIndex> > out(output_backward);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Kernel>::Layout == internal::traits<OutputBackward>::Layout, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  static const bool isColMajor = (internal::traits<OutputBackward>::Layout == ColMajor);
-
-  static const int NumDims = internal::traits<OutputBackward>::NumDimensions;
-
-  // Number of filters to apply. This is the same as the output depth of the result
-  const TensorIndex kernelFilters = isColMajor ? kern.dimensions()[0] : kern.dimensions()[3];
-  // Number of channels. This is the same as the input depth.
-  const TensorIndex kernelChannels = isColMajor ? kern.dimensions()[1] : kern.dimensions()[2];
-  const TensorIndex kernelRows = isColMajor ? kern.dimensions()[2] : kern.dimensions()[1];
-  const TensorIndex kernelCols = isColMajor ? kern.dimensions()[3] : kern.dimensions()[0];
-
-  // This is the effective kernel size, taking into account the (in_stride - 1) zero-values
-  // inserted between consecutive kernel elements in atrous convolution
-  const TensorIndex kernelRowsEff = kernelRows + (kernelRows - 1) * (in_stride - 1);
-  const TensorIndex kernelColsEff = kernelCols + (kernelCols - 1) * (in_stride - 1);
-
-  const TensorIndex outputRows = isColMajor ? output_backward.dimension(1) : output_backward.dimension(NumDims - 2);
-  const TensorIndex outputCols = isColMajor ? output_backward.dimension(2) : output_backward.dimension(NumDims - 3);
-
-  // Computing the forward padding
-  const TensorIndex forward_pad_top = ((outputRows - 1) * stride + kernelRowsEff - inputRows) / 2;
-  const TensorIndex forward_pad_left = ((outputCols - 1) * stride + kernelColsEff - inputCols) / 2;
-
-  const TensorIndex padding_top = kernelRowsEff - 1 - forward_pad_top;
-  const TensorIndex padding_left = kernelColsEff - 1 - forward_pad_left;
-  const TensorIndex padding_bottom = inputRows + kernelRowsEff - 1 - (outputRows - 1) * stride - 1 - padding_top;
-  const TensorIndex padding_right = inputCols + kernelColsEff - 1 - (outputCols - 1) * stride - 1 - padding_left;
-
-  eigen_assert(padding_top >= 0);
-  eigen_assert(padding_left >= 0);
-  eigen_assert(padding_bottom >= 0);
-  eigen_assert(padding_right >= 0);
-
-  // The kernel has dimensions filters X channels X patch_rows X patch_cols
-  // We need to reverse the kernel along dimensions corresponding to rows and
-  // cols.
-  // TODO(yangke): we can make things slightly faster by collapsing the dimensions
-  // where we don't reverse. Try that once we have a faster compiler.
-  array<bool, 4> kernel_reverse;
-  if (isColMajor) {
-    kernel_reverse[0] = false;
-    kernel_reverse[1] = false;
-    kernel_reverse[2] = true;
-    kernel_reverse[3] = true;
-  } else {
-    kernel_reverse[0] = true;
-    kernel_reverse[1] = true;
-    kernel_reverse[2] = false;
-    kernel_reverse[3] = false;
-  }
-
-  DSizes<TensorIndex, 3> kernel_dims;
-  if (isColMajor) {
-    kernel_dims[0] = kernelFilters;
-    kernel_dims[1] = kernelChannels;
-    kernel_dims[2] = kernelRows * kernelCols;
-  } else {
-    kernel_dims[0] = kernelRows * kernelCols;
-    kernel_dims[1] = kernelChannels;
-    kernel_dims[2] = kernelFilters;
-  }
-
-  // The output_backward has dimensions out_depth X out_rows X out_cols X OTHERS
-  // When we extract the image patches from output_backward, it will have dimensions
-  //   out_depth X (patch_rows * patch_cols) X (input_rows * input_cols * OTHERS)
-  DSizes<TensorIndex, 3> pre_contract_dims;
-  if (isColMajor) {
-    pre_contract_dims[0] = kernelFilters;
-    pre_contract_dims[1] = kernelRows * kernelCols;
-    pre_contract_dims[2] = inputRows * inputCols;
-    for (int i = 3; i < NumDims; ++i) {
-      pre_contract_dims[2] *= out.dimension(i);
-    }
-  } else {
-    pre_contract_dims[2] = kernelFilters;
-    pre_contract_dims[1] = kernelRows * kernelCols;
-    pre_contract_dims[0] = inputRows * inputCols;
-    for (int i = 0; i < NumDims - 3; ++i) {
-      pre_contract_dims[0] *= out.dimension(i);
-    }
-  }
-
-  // We will contract along dimensions (0, 2) in kernel and (0, 1) in
-  // output_backward, if this is col-major, and
-  // dimensions (0, 2) in kernel and (1, 2) in output_backward, if this row-major.
-  array<IndexPair<TensorIndex>, 2> contract_dims;
-  if (isColMajor) {
-    // col-major: kernel.contract(output.patches)
-    contract_dims[0] = IndexPair<TensorIndex>(0, 0);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 1);
-  } else {
-    // row-major: output.patches.contract(kernel)
-    contract_dims[0] = IndexPair<TensorIndex>(1, 0);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 2);
-  }
-
-  // Post contraction, the dimensions of the input_backprop is
-  //  channels X input_rows X input_cols X OTHERS
-  DSizes<TensorIndex, NumDims> post_contract_dims;
-  if (isColMajor) {
-    post_contract_dims[0] = kernelChannels;
-    post_contract_dims[1] = inputRows;
-    post_contract_dims[2] = inputCols;
-    for (int i = 3; i < NumDims; ++i) {
-      post_contract_dims[i] = out.dimension(i);
-    }
-  } else {
-    post_contract_dims[NumDims - 1] = kernelChannels;
-    post_contract_dims[NumDims - 2] = inputRows;
-    post_contract_dims[NumDims - 3] = inputCols;
-    for (int i = 0; i < NumDims - 3; ++i) {
-      post_contract_dims[i] = out.dimension(i);
-    }
-  }
-
-  return choose(Cond<internal::traits<OutputBackward>::Layout == ColMajor>(),
-                kernel.reverse(kernel_reverse).reshape(kernel_dims).contract(output_backward.extract_image_patches(kernelRows, kernelCols, 1, 1, in_stride, in_stride, stride, stride, padding_top, padding_bottom, padding_left, padding_right, 0).reshape(pre_contract_dims), contract_dims).reshape(post_contract_dims),
-                output_backward.extract_image_patches(kernelRows, kernelCols, 1, 1, in_stride, in_stride, stride, stride, padding_top, padding_bottom, padding_left, padding_right, 0).reshape(pre_contract_dims).contract(kernel.reverse(kernel_reverse).reshape(kernel_dims), contract_dims).reshape(post_contract_dims));
-}
-
-
-/** SpatialConvolutionBackwardKernel
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Computes the backprop for the filter of a 2D convolution.
-  *
-  * The output_backward parameter is expected to be a tensor with a rank of 3 or more (channels, height, width, and optionally others)
-  * The kernel parameter is expected to be a 4D tensor (filters, channels, kernel_height, kernel_width)
-  * The output_backward and the kernel must both be in col-major layout. The result will also be in col-major layout.
-  *
-  * If in_stride > 1, then applies convolution with holes (aka atrous convolution), sampling every in_stride input pixels.
-  *
-  * The result can be assigned to a tensor of rank equal to the rank of the output_backward. The dimensions of the result will be filters, height, width (and others if applicable).
-  *
-  * It is possible to swap the order of the width and height dimensions provided that the same order is used in the input, the kernel, and the output.
-  *
-  */
-// TODO(gpapan): Resolve a bug in TensorContractionInputMapper at SpatialConvolutions.h that yangke circumvented by using .reshape().reshape().
-// This can significantly accelerate SpatialConvolutionBackwardKernel.
-
-template <typename OutputBackward, typename Input>
-EIGEN_ALWAYS_INLINE
-static const typename internal::conditional<
-  internal::traits<OutputBackward>::Layout == ColMajor,
-  const TensorShufflingOp<const array<typename internal::traits<OutputBackward>::Index, 4>, const TensorReverseOp<const array<bool, 4>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 4>, const TensorContractionOp<const array<IndexPair<typename internal::traits<Input>::Index>, 2>, const TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, 3>, const Input>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 4>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 4>, const TensorImagePatchOp<Dynamic, Dynamic, const OutputBackward> > > > > > >,
-  const TensorShufflingOp<const array<typename internal::traits<OutputBackward>::Index, 4>, const TensorReverseOp<const array<bool, 4>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 4>, const TensorContractionOp<const array<IndexPair<typename internal::traits<Input>::Index>, 2>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 4>, const TensorReshapingOp<const DSizes<typename internal::traits<OutputBackward>::Index, 4>, const TensorImagePatchOp<Dynamic, Dynamic, const OutputBackward> > >, const TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, 3>, const Input> > > > > >::type
-SpatialConvolutionBackwardKernel(const Input& input, const OutputBackward& output_backward, typename internal::traits<Input>::Index kernelRows, typename internal::traits<Input>::Index kernelCols, const DenseIndex stride = 1, const DenseIndex in_stride = 1) {
-
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-  TensorRef<Tensor<typename internal::traits<OutputBackward>::Scalar, internal::traits<OutputBackward>::NumDimensions, internal::traits<OutputBackward>::Layout, TensorIndex> > out(output_backward);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::Layout == internal::traits<OutputBackward>::Layout, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  // stride and in_stride cannot both be larger than 1
-  eigen_assert(!(stride > 1 && in_stride > 1));
-
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-
-  static const int NumDims = internal::traits<Input>::NumDimensions;
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == internal::traits<OutputBackward>::NumDimensions, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  const TensorIndex inputRows = isColMajor ? in.dimension(1) : in.dimension(NumDims - 2);
-  const TensorIndex inputCols = isColMajor ? in.dimension(2) : in.dimension(NumDims - 3);
-
-  const TensorIndex outputRows = isColMajor ? output_backward.dimension(1) : output_backward.dimension(NumDims - 2);
-  const TensorIndex outputCols = isColMajor ? output_backward.dimension(2) : output_backward.dimension(NumDims - 3);
-
-  // Number of filters to apply. This is the same as the output depth of the result
-  const TensorIndex kernelFilters = isColMajor ? out.dimensions()[0] : out.dimensions()[NumDims - 1];
-
-  // Number of channels. This is the same as the input depth.
-  const TensorIndex kernelChannels = isColMajor ? in.dimensions()[0] : in.dimensions()[NumDims - 1];
-
-  // This is the effective kernel size, taking into account the (in_stride - 1) zero-values
-  // inserted between consecutive kernel elements in atrous convolution
-  const TensorIndex kernelRowsEff = kernelRows + (kernelRows - 1) * (in_stride - 1);
-  const TensorIndex kernelColsEff = kernelCols + (kernelCols - 1) * (in_stride - 1);
-
-  // Computing the forward padding
-  const TensorIndex forward_pad_top = ((outputRows - 1) * stride + kernelRowsEff - inputRows) / 2;
-  const TensorIndex forward_pad_left = ((outputCols - 1) * stride + kernelColsEff - inputCols) / 2;
-
-  // TODO: factor out the padding computation.
-  const TensorIndex padding_top = kernelRowsEff - 1 - forward_pad_top;
-  const TensorIndex padding_left = kernelColsEff - 1 - forward_pad_left;
-  const TensorIndex padding_bottom = inputRows + kernelRowsEff - 1 - (outputRows - 1) * stride - 1 - padding_top;
-  const TensorIndex padding_right = inputCols + kernelColsEff - 1 - (outputCols - 1) * stride - 1 - padding_left;
-
-  eigen_assert(padding_top >= 0);
-  eigen_assert(padding_left >= 0);
-  eigen_assert(padding_bottom >= 0);
-  eigen_assert(padding_right >= 0);
-
-  // The output_backward has dimensions out_depth X out_rows X out_cols X OTHERS
-  // When we extract the image patches from output_backward (with input as the
-  // kernel), it will have dimensions
-  //  (out_depth) X (input_rows * input_cols) X (kernel_rows * kernel_cols) X OTHERS
-  DSizes<TensorIndex, 4> pre_contract_dims;
-  if (isColMajor) {
-    pre_contract_dims[0] = kernelFilters;
-    pre_contract_dims[1] = inputRows * inputCols;
-    pre_contract_dims[2] = kernelRows * kernelCols;
-    pre_contract_dims[3] = 1;
-    for (int i = 3; i < NumDims; ++i) {
-      pre_contract_dims[3] *= out.dimension(i);
-    }
-  } else {
-    pre_contract_dims[3] = kernelFilters;
-    pre_contract_dims[2] = inputRows * inputCols;
-    pre_contract_dims[1] = kernelRows * kernelCols;
-    pre_contract_dims[0] = 1;
-    for (int i = 0; i < NumDims - 3; ++i) {
-      pre_contract_dims[0] *= out.dimension(i);
-    }
-  }
-
-  // The input has dimensions in_depth X (input_rows * input_cols) X OTHERS
-  DSizes<TensorIndex, 3> input_dims;
-  if (isColMajor) {
-    input_dims[0] = kernelChannels;
-    input_dims[1] = inputRows * inputCols;
-    input_dims[2] = 1;
-    for (int i = 3; i < NumDims; ++i) {
-      input_dims[2] *= in.dimension(i);
-    }
-    eigen_assert(input_dims[2] == pre_contract_dims[3]);
-  } else {
-    input_dims[2] = kernelChannels;
-    input_dims[1] = inputRows * inputCols;
-    input_dims[0] = 1;
-    for (int i = 0; i < NumDims - 3; ++i) {
-      input_dims[0] *= in.dimension(i);
-    }
-    eigen_assert(input_dims[0] == pre_contract_dims[0]);
-  }
-
-  // We will contract along dimensions (1, 2) in and (1, 3) in out, if
-  // this is col-major.
-  // For row-major, it's dimensions (0, 1) in and (0, 2) in out.
-  array<IndexPair<TensorIndex>, 2> contract_dims;
-  if (isColMajor) {
-    // col-major: in.contract(output.patches)
-    contract_dims[0] = IndexPair<TensorIndex>(1, 1);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 3);
-  } else {
-    // row-major: output.patches.contract(in)
-    contract_dims[0] = IndexPair<TensorIndex>(0, 0);
-    contract_dims[1] = IndexPair<TensorIndex>(2, 1);
-  }
-
-  // After the contraction, the kernel will have dimension
-  // in_depth X out_depth X kernel_rows X kernel_cols
-  // We will need to shuffle the first two dimensions and reverse the latter
-  // two dimensions.
-  // The end shape is
-  // out_depth X in_shape X kernel_rows X kernel_cols
-
-  // This is the shape of the kernel *before* the shuffling.
-  DSizes<TensorIndex, 4> kernel_dims;
-  if (isColMajor) {
-    kernel_dims[0] = kernelChannels;
-    kernel_dims[1] = kernelFilters;
-    kernel_dims[2] = kernelRows;
-    kernel_dims[3] = kernelCols;
-  } else {
-    kernel_dims[0] = kernelCols;
-    kernel_dims[1] = kernelRows;
-    kernel_dims[2] = kernelFilters;
-    kernel_dims[3] = kernelChannels;
-  }
-
-  array<TensorIndex, 4> kernel_shuffle;
-  if (isColMajor) {
-    kernel_shuffle[0] = 1;
-    kernel_shuffle[1] = 0;
-    kernel_shuffle[2] = 2;
-    kernel_shuffle[3] = 3;
-  } else {
-    kernel_shuffle[0] = 0;
-    kernel_shuffle[1] = 1;
-    kernel_shuffle[2] = 3;
-    kernel_shuffle[3] = 2;
-  }
-
-  array<bool, 4> kernel_reverse;
-  if (isColMajor) {
-    kernel_reverse[0] = false;
-    kernel_reverse[1] = false;
-    kernel_reverse[2] = true;
-    kernel_reverse[3] = true;
-  } else {
-    kernel_reverse[0] = true;
-    kernel_reverse[1] = true;
-    kernel_reverse[2] = false;
-    kernel_reverse[3] = false;
-  }
-
-  return choose(Cond<internal::traits<Input>::Layout == ColMajor>(),
-                input.reshape(input_dims).contract(output_backward.extract_image_patches(inputRows, inputCols, in_stride, in_stride, 1, 1, stride, stride, padding_top, padding_bottom, padding_left, padding_right, 0).reshape(pre_contract_dims).reshape(pre_contract_dims), contract_dims).reshape(kernel_dims).reverse(kernel_reverse).shuffle(kernel_shuffle),
-                output_backward.extract_image_patches(inputRows, inputCols, in_stride, in_stride, 1, 1, stride, stride, padding_top, padding_bottom, padding_left, padding_right, 0).reshape(pre_contract_dims).reshape(pre_contract_dims).contract(input.reshape(input_dims), contract_dims).reshape(kernel_dims).reverse(kernel_reverse).shuffle(kernel_shuffle));
-}
-
-} // end namespace Eigen
-
-#endif // EIGEN_CXX11_NEURAL_NETWORKS_BACKWARD_SPATIAL_CONVOLUTIONS_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/CuboidConvolution.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/CuboidConvolution.h
deleted file mode 100644
index dfb9dcedba..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/CuboidConvolution.h
+++ /dev/null
@@ -1,179 +0,0 @@
-#ifndef EIGEN_CXX11_SRC_NEURAL_NETWORKS_CUBOID_CONVOLUTION_H
-#define EIGEN_CXX11_SRC_NEURAL_NETWORKS_CUBOID_CONVOLUTION_H
-
-#include "Patch3d.h"
-
-namespace Eigen {
-
-/** CuboidConvolution
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Applies a 3D convolution over a multichannel input voxel block.
-  *
-  * The input parameter is expected to be a tensor with a rank of 4 or more (channels, depth, height, width, and optionally others).
-  * The kernel parameter is expected to be a 5D tensor (filters, channels, kernel_depth, kernel_height, kernel_width).
-  * The result can be assigned to a tensor of rank equal to the rank of the input. The dimensions of the result will be filters, depth, height, width (and others if applicable).
-  *
-  * The input and kernel have to be in the same layout, and both row-major and
-  * col-major are supported. The shapes given above are for col-major layout.
-  * For row-major, all dimensions should be reversed.
-  *
-  * It is possible to swap the order of the depth, width, and height dimensions provided that the same order is used in the input, the kernel, and the output.
-  */
-template <typename Input, typename Kernel>
-EIGEN_ALWAYS_INLINE
-static const typename internal::conditional <
-    internal::traits<Input>::Layout == ColMajor,
-    TensorReshapingOp<
-        const DSizes<typename internal::traits<Input>::Index,
-                     internal::traits<Input>::NumDimensions>,
-        const TensorContractionOp<
-            const array<IndexPair<typename internal::traits<Input>::Index>, 1>,
-            const TensorReshapingOp<
-                const DSizes<typename internal::traits<Input>::Index, 2>,
-                const Kernel>,
-            const TensorReshapingOp<
-                const DSizes<typename internal::traits<Input>::Index, 2>,
-                const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic,
-                                          const Input> > > >,
-    TensorReshapingOp<
-        const DSizes<typename internal::traits<Input>::Index,
-                     internal::traits<Input>::NumDimensions>,
-        const TensorContractionOp<
-            const array<IndexPair<typename internal::traits<Input>::Index>, 1>,
-            const TensorReshapingOp<
-                const DSizes<typename internal::traits<Input>::Index, 2>,
-                const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic,
-                                          const Input> > ,
-                const TensorReshapingOp<
-                    const DSizes<typename internal::traits<Input>::Index, 2>,
-                    const Kernel> > > >::type
-CuboidConvolution(const Input& input, const Kernel& kernel,
-                  const DenseIndex stridePlanes = 1,
-                  const DenseIndex strideRows = 1,
-                  const DenseIndex strideCols = 1,
-                  const PaddingType padding_type = PADDING_SAME) {
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-  TensorRef<Tensor<typename internal::traits<Kernel>::Scalar, internal::traits<Kernel>::NumDimensions, internal::traits<Kernel>::Layout, TensorIndex> > kern(kernel);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::Layout == internal::traits<Kernel>::Layout, YOU_MADE_A_PROGRAMMING_MISTAKE);
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-  static const int NumDims = internal::traits<Input>::NumDimensions;
-
-  // Number of filters to apply. This is the same as the output depth of the result.
-  const TensorIndex kernelFilters = isColMajor ? kern.dimensions()[0] : kern.dimensions()[4];
-  const TensorIndex kernelChannels = isColMajor ? kern.dimensions()[1] : kern.dimensions()[3];
-
-  // Spatial size of the kernel.
-  const TensorIndex kernelDepth = isColMajor ? kern.dimensions()[2] : kern.dimensions()[2];
-  const TensorIndex kernelRows = isColMajor ? kern.dimensions()[3] : kern.dimensions()[1];
-  const TensorIndex kernelCols = isColMajor ? kern.dimensions()[4] : kern.dimensions()[0];
-
-  if (isColMajor) {
-    eigen_assert(kernelChannels == in.dimension(0));
-  } else {
-    eigen_assert(kernelChannels == in.dimension(NumDims - 1));
-  }
-
-  const TensorIndex inputPlanes = isColMajor ? in.dimension(1) : in.dimension(NumDims - 2);
-  const TensorIndex inputRows = isColMajor ? in.dimension(2) : in.dimension(NumDims - 3);
-  const TensorIndex inputCols = isColMajor ? in.dimension(3) : in.dimension(NumDims - 4);
-
-  const float stride_planes_f = static_cast<float>(stridePlanes);
-  const float stride_rows_f = static_cast<float>(strideRows);
-  const float stride_cols_f = static_cast<float>(strideCols);
-  TensorIndex out_depth;
-  TensorIndex out_height;
-  TensorIndex out_width;
-  switch (padding_type) {
-    case PADDING_VALID:
-      out_depth = ceil((inputPlanes - kernelDepth + 1.f) / stride_planes_f);
-      out_height = ceil((inputRows - kernelRows + 1.f) / stride_rows_f);
-      out_width = ceil((inputCols - kernelCols + 1.f) / stride_cols_f);
-      break;
-    case PADDING_SAME:
-      out_depth = ceil(inputPlanes / stride_planes_f);
-      out_height = ceil(inputRows / stride_rows_f);
-      out_width = ceil(inputCols / stride_cols_f);
-      break;
-    default:
-      eigen_assert(false && "unexpected padding");
-  }
-
-  DSizes<TensorIndex, 2> kernel_dims;
-  if (isColMajor) {
-    kernel_dims[0] = kernelFilters;
-    kernel_dims[1] = kernelChannels * kernelDepth * kernelRows * kernelCols;
-  } else {
-    kernel_dims[0] = kernelChannels * kernelDepth * kernelRows * kernelCols;
-    kernel_dims[1] = kernelFilters;
-  }
-
-  // Molds the output of the patch extraction result into a 2D tensor:
-  // - the first dimension (dims[0]): the patch values to be multiplied with the kernels
-  // - the second dimension (dims[1]): everything else
-  DSizes<TensorIndex, 2> pre_contract_dims;
-  if (isColMajor) {
-    pre_contract_dims[0] = kernelChannels * kernelDepth * kernelRows * kernelCols;
-    pre_contract_dims[1] = out_depth * out_height * out_width;
-    for (int i = 4; i < NumDims; ++i) {
-      pre_contract_dims[1] *= in.dimension(i);
-    }
-  } else {
-    pre_contract_dims[1] = kernelChannels * kernelDepth * kernelRows * kernelCols;
-    pre_contract_dims[0] = out_depth * out_height * out_width;
-    for (int i = 0; i < NumDims - 4; ++i) {
-      pre_contract_dims[0] *= in.dimension(i);
-    }
-  }
-
-  array<IndexPair<TensorIndex>, 1> contract_dims;
-  contract_dims[0] = IndexPair<TensorIndex>(1, 0);
-
-  // Molds the output of the contraction into the shape expected by the user
-  // (assuming ColMajor):
-  // - 1st dim: kernel filters
-  // - 2nd dim: output depth
-  // - 3nd dim: output height
-  // - 4rd dim: output width
-  // - 5th dim and beyond: everything else including batch size
-  DSizes<TensorIndex, NumDims> post_contract_dims;
-  if (isColMajor) {
-    post_contract_dims[0] = kernelFilters;
-    post_contract_dims[1] = out_depth;
-    post_contract_dims[2] = out_height;
-    post_contract_dims[3] = out_width;
-    for (int i = 4; i < NumDims; ++i) {
-      post_contract_dims[i] = in.dimension(i);
-    }
-  } else {
-    post_contract_dims[NumDims - 1] = kernelFilters;
-    post_contract_dims[NumDims - 2] = out_depth;
-    post_contract_dims[NumDims - 3] = out_height;
-    post_contract_dims[NumDims - 4] = out_width;
-    for (int i = 0; i < NumDims - 4; ++i) {
-      post_contract_dims[i] = in.dimension(i);
-    }
-  }
-
-  return choose(
-      Cond<internal::traits<Input>::Layout == ColMajor>(),
-      kernel.reshape(kernel_dims)
-          .contract(input.extract_volume_patches(
-                             kernelDepth, kernelRows, kernelCols, stridePlanes,
-                             strideRows, strideCols, padding_type)
-                        .reshape(pre_contract_dims),
-                    contract_dims)
-          .reshape(post_contract_dims),
-      input.extract_volume_patches(kernelDepth, kernelRows, kernelCols,
-                                   stridePlanes, strideRows, strideCols,
-                                   padding_type)
-          .reshape(pre_contract_dims)
-          .contract(kernel.reshape(kernel_dims), contract_dims)
-          .reshape(post_contract_dims));
-}
-
-} // end namespace Eigen
-
-#endif  // EIGEN_CXX11_SRC_NEURAL_NETWORKS_CUBOID_CONVOLUTION_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Patch3d.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Patch3d.h
deleted file mode 100644
index 2864f83299..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Patch3d.h
+++ /dev/null
@@ -1,240 +0,0 @@
-#ifndef EIGEN_CXX11_SRC_NEURAL_NETWORKS_PATCH3D_H
-#define EIGEN_CXX11_SRC_NEURAL_NETWORKS_PATCH3D_H
-
-#if not defined(__CUDACC__)
-#include <type_traits>
-#endif
-
-namespace Eigen {
-namespace internal {
-
-/** Extract3DPatches
- * \ingroup CXX11_NeuralNetworksModule
- *
- * \brief Extracts 3D patches from a multichannel input volume.
- *
- * The input parameter is expected to be a tensor with a rank of 4 or more
- * (channels, depth, height, width, optional others in col-major, and the
- * reverse order in row-major).
-
- * The return value will be a tensor of 3 more dimension than the input tensor.
- * In col-major, the first 4 dimensions of the result are: channels, patch_depth,
- * patch_height, patch_width. The next dimensions will identify the patch
- * position on the 3D grid of extracted patches: z, y, x. The remaining
- * dimensions, if any, will be the same as the 'other' dimensions of the input
- * tensor.
- */
-
-template <typename Input>
-EIGEN_ALWAYS_INLINE static const TensorStridingOp<
-    const array<typename internal::traits<Input>::Index,
-                internal::traits<Input>::NumDimensions + 3>,
-    const TensorReshapingOp<
-        const DSizes<typename internal::traits<Input>::Index,
-                     internal::traits<Input>::NumDimensions + 3>,
-        const TensorPatchOp<
-            const DSizes<typename internal::traits<Input>::Index,
-                         internal::traits<Input>::NumDimensions>,
-            const TensorPaddingOp<
-                const array<IndexPair<typename internal::traits<Input>::Index>,
-                            internal::traits<Input>::NumDimensions>,
-                const Input> > > >
-Extract3DPatches(
-    const Input& input, const DenseIndex patchPlanes,
-    const DenseIndex patchRows, const DenseIndex patchCols,
-    const DenseIndex stridePlanes, const DenseIndex strideRows,
-    const DenseIndex strideCols,
-    const DenseIndex paddingZTop, const DenseIndex paddingZBottom,
-    const DenseIndex paddingTop, const DenseIndex paddingBottom,
-    const DenseIndex paddingLeft, const DenseIndex paddingRight,
-    const typename internal::traits<Input>::Scalar padding_value = 0) {
-
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions >= 4, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-  static const int NumDims = internal::traits<Input>::NumDimensions;
-  static const int ExtDims = NumDims + 3;
-
-  // Tensor size after patch extraction. We add three dimensions to unpack the
-  // linear patch index into a 3D grid over which stride() can work.
-  DSizes<TensorIndex, ExtDims> pre_stride_dims;
-
-  if (isColMajor) {
-    pre_stride_dims[0] = in.dimension(0);
-    pre_stride_dims[1] = patchPlanes;
-    pre_stride_dims[2] = patchRows;
-    pre_stride_dims[3] = patchCols;
-  } else {
-    pre_stride_dims[ExtDims - 1] = in.dimension(NumDims - 1);
-    pre_stride_dims[ExtDims - 4] = patchCols;
-    pre_stride_dims[ExtDims - 3] = patchRows;
-    pre_stride_dims[ExtDims - 2] = patchPlanes;
-  }
-
-  const TensorIndex inputPlanes = isColMajor ? in.dimension(1) : in.dimension(NumDims - 2);
-  const TensorIndex inputRows = isColMajor ? in.dimension(2) : in.dimension(NumDims - 3);
-  const TensorIndex inputCols = isColMajor ? in.dimension(3) : in.dimension(NumDims - 4);
-
-  array<IndexPair<TensorIndex>, NumDims> paddings;
-  for (int i = 0; i < NumDims; ++i) {
-    paddings[i] = IndexPair<TensorIndex>(0, 0);
-  }
-
-  paddings[isColMajor ? 1 : (NumDims - 2)] = IndexPair<TensorIndex>(paddingZTop, paddingZBottom);
-  paddings[isColMajor ? 2 : (NumDims - 3)] = IndexPair<TensorIndex>(paddingTop, paddingBottom);
-  paddings[isColMajor ? 3 : (NumDims - 4)] = IndexPair<TensorIndex>(paddingLeft, paddingRight);
-
-  pre_stride_dims[isColMajor ? 4 : (ExtDims - 5)] = inputPlanes + paddingZBottom + paddingZTop - patchPlanes + 1;
-  pre_stride_dims[isColMajor ? 5 : (ExtDims - 6)] = inputRows + paddingTop + paddingBottom - patchRows + 1;
-  pre_stride_dims[isColMajor ? 6 : (ExtDims - 7)] = inputCols + paddingLeft + paddingRight - patchCols + 1;
-
-  if (isColMajor) {
-    for (int i = 7; i < NumDims + 3; ++i) {
-      pre_stride_dims[i] = in.dimension(i - 3);
-    }
-  } else {
-    for (int i = 0; i < NumDims - 4; ++i) {
-      pre_stride_dims[i] = in.dimension(i);
-    }
-  }
-
-  DSizes<TensorIndex, NumDims> patch_dims;
-  if (isColMajor) {
-    patch_dims[0] = in.dimension(0);
-    patch_dims[1] = patchPlanes;
-    patch_dims[2] = patchRows;
-    patch_dims[3] = patchCols;
-    for (int i = 4; i < NumDims; ++i) {
-      patch_dims[i] = 1;
-    }
-  } else {
-    patch_dims[NumDims - 1] = in.dimension(NumDims - 1);
-    patch_dims[NumDims - 4] = patchCols;
-    patch_dims[NumDims - 3] = patchRows;
-    patch_dims[NumDims - 2] = patchPlanes;
-    for (int i = 0; i < NumDims - 4; i++) {
-      patch_dims[i] = 1;
-    }
-  }
-
-  array<TensorIndex, NumDims + 3> strides;
-  if (isColMajor) {
-    // No striding within the patches.
-    for (int i = 0; i < 4; ++i) {
-      strides[i] = 1;
-    }
-    // Apply striding in the spatial patch grid dimensions only.
-    strides[4] = stridePlanes;
-    strides[5] = strideRows;
-    strides[6] = strideCols;
-    // No striding in the remaining dimensions (batches, ...).
-    for (int i = 7; i < NumDims + 3; i++) {
-      strides[i] = 1;
-    }
-  } else {
-    // No striding within the patches.
-    for (int i = 1; i <= 4; ++i) {
-      strides[ExtDims - i] = 1;
-    }
-    // Apply striding in the spatial patch grid dimensions only.
-    strides[ExtDims - 7] = strideCols;
-    strides[ExtDims - 6] = strideRows;
-    strides[ExtDims - 5] = stridePlanes;
-    // No striding in the remaining dimensions (batches, ...).
-    for (int i = 0; i < NumDims - 4; i++) {
-      strides[i] = 1;
-    }
-  }
-
-  // TODO(mjanusz): Consider getting rid of pad(), and stride() and extend
-  // extract_patches to take additional parameters for padding/striding,
-  // similarly to extract_image_patches.
-  return input.pad(paddings, padding_value).extract_patches(patch_dims).reshape(pre_stride_dims).stride(strides);
-}
-
-
-template <typename Input>
-EIGEN_ALWAYS_INLINE static const TensorStridingOp<
-    const array<typename internal::traits<Input>::Index,
-                internal::traits<Input>::NumDimensions + 3>,
-    const TensorReshapingOp<
-        const DSizes<typename internal::traits<Input>::Index,
-                     internal::traits<Input>::NumDimensions + 3>,
-        const TensorPatchOp<
-            const DSizes<typename internal::traits<Input>::Index,
-                         internal::traits<Input>::NumDimensions>,
-            const TensorPaddingOp<
-                const array<IndexPair<typename internal::traits<Input>::Index>,
-                            internal::traits<Input>::NumDimensions>,
-                const Input> > > >
-Extract3DPatches(
-    const Input& input, const DenseIndex patchPlanes,
-    const DenseIndex patchRows, const DenseIndex patchCols,
-    const DenseIndex stridePlanes, const DenseIndex strideRows,
-    const DenseIndex strideCols, const PaddingType padding_type,
-    const typename internal::traits<Input>::Scalar padding_value = 0) {
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions >= 4, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-  static const int NumDims = internal::traits<Input>::NumDimensions;
-
-  const TensorIndex inputPlanes = isColMajor ? in.dimension(1) : in.dimension(NumDims - 2);
-  const TensorIndex inputRows = isColMajor ? in.dimension(2) : in.dimension(NumDims - 3);
-  const TensorIndex inputCols = isColMajor ? in.dimension(3) : in.dimension(NumDims - 4);
-
-  switch (padding_type) {
-    case PADDING_VALID:
-      // No padding in any dimension.
-      return Extract3DPatches(input, patchPlanes, patchRows, patchCols,
-                              stridePlanes, strideRows, strideCols,
-                              0, 0, 0, 0, 0, 0, padding_value);
-    case PADDING_SAME: {
-      // The side of the tensor before striding should be just the expected
-      // output times the stride.
-      const TensorIndex size_z = ceil(inputPlanes / static_cast<float>(stridePlanes)) * stridePlanes;
-      const TensorIndex size_y = ceil(inputRows / static_cast<float>(strideRows)) * strideRows;
-      const TensorIndex size_x = ceil(inputCols / static_cast<float>(strideCols)) * strideCols;
-
-      // The size of the patch space is going to be: padded_input_size - patch_size + 1.
-      // This has to match the expected size before striding (pre_stride_dims).
-      // The deltas below extend the input to the expected size.
-      const TensorIndex dz = size_z + patchPlanes - 1 - inputPlanes;
-      const TensorIndex dy = size_y + patchRows - 1 - inputRows;
-      const TensorIndex dx = size_x + patchCols - 1 - inputCols;
-
-      return Extract3DPatches(input, patchPlanes, patchRows, patchCols,
-                              stridePlanes, strideRows, strideCols,
-                              dz - dz / 2, dz / 2,
-                              dy - dy / 2, dy / 2,
-                              dx - dx / 2, dx / 2,
-                              padding_value);
-    }
-    default:
-      eigen_assert(false && "unexpected padding");
-      // unreachable code to avoid missing return warning.
-      return Extract3DPatches(input, patchPlanes, patchRows, patchCols,
-                              stridePlanes, strideRows, strideCols,
-                              0, 0, 0, 0, 0, 0, padding_value);
-  }
-}
-
-// TODO(mjanusz): Switch this to a 'using' alias once CUDA supports C++11.
-template <typename Input>
-struct Extract3DPatchesType {
-  typedef const TensorStridingOp< const array<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions + 3>,
-      const TensorReshapingOp< const DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions + 3>,
-      const TensorPatchOp< const DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions>,
-      const TensorPaddingOp< const array< IndexPair<typename internal::traits<Input>::Index>, internal::traits<Input>::NumDimensions>,
-      const Input> > > > type;
-};
-
-}  // end namespace internal
-}  // end namespace Eigen
-
-#endif  // EIGEN_CXX11_SRC_NEURAL_NETWORKS_PATCH3D_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Pooling.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Pooling.h
deleted file mode 100644
index 942b060ba7..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/Pooling.h
+++ /dev/null
@@ -1,433 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2014 Benoit Steiner <benoit.steiner.goog@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_POOLING_H
-#define EIGEN_CXX11_NEURAL_NETWORKS_POOLING_H
-
-#include "Patch3d.h"
-
-namespace Eigen {
-
-/** SpatialMaxPooling
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Applies a max-pooling over a multichannel input image.
-  *
-  * The input parameter is expected to be a with a rank of 4 (channels, height, width, others in col-major, and the reverse of that in row-major).
-  *
-  * The result can be assigned to a tensor of rank equal to the rank of the input. The dimensions of the result will be channels, height, width, and others (in col-major, and the reverse of that if the input was row-major).
-  *
-  * The order of the width and height dimensions can be swapped if needed.
-  *
-*/
-#if !defined(EIGEN_HAS_INDEX_LIST)
-template <typename Input>
-EIGEN_ALWAYS_INLINE
-static const TensorReshapingOp<const Eigen::DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions>, const TensorReductionOp<internal::MaxReducer<typename internal::remove_const<typename internal::traits<Input>::Scalar>::type>, const Eigen::array<int, 2>, const TensorImagePatchOp<Dynamic, Dynamic, const Input> > >
-#else
-template <typename Input>
-EIGEN_ALWAYS_INLINE
-static const TensorReshapingOp<const Eigen::DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions>, const TensorReductionOp<internal::MaxReducer<typename internal::remove_const<typename internal::traits<Input>::Scalar>::type>, typename internal::conditional<internal::traits<Input>::Layout == ColMajor, const Eigen::IndexList<Eigen::type2index<1>, Eigen::type2index<2> >, const Eigen::IndexList<Eigen::type2index<2>, Eigen::type2index<3> > >::type, const TensorImagePatchOp<Dynamic, Dynamic, const Input> > >
-#endif
-SpatialMaxPooling(const Input& input, DenseIndex patchRows, DenseIndex patchCols,
-                  DenseIndex strideRows, DenseIndex strideCols, const PaddingType padding_type,
-                  DenseIndex in_strideRows = 1, DenseIndex in_strideCols = 1)
-{
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == 4, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-
-  const DenseIndex patchRowsEff = patchRows + (patchRows - 1) * (in_strideRows - 1);
-  const DenseIndex patchColsEff = patchCols + (patchCols - 1) * (in_strideCols - 1);
-
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-  static const int idxRows = isColMajor ? 1 : 2;
-  static const int idxCols = isColMajor ? 2 : 1;
-
-  // Molds the output of the reduction into the shape expected by the user.
-  // (assuming col-major):
-  // - 1st dim: channels
-  // - 2nd dim: output height
-  // - 3rd dim: output width
-  // - 4th dim and beyond: everything else including batch size
-  Eigen::DSizes<TensorIndex, internal::traits<Input>::NumDimensions> post_reduce_dims;
-  post_reduce_dims[0] = in.dimension(0);
-  if (padding_type == PADDING_VALID) {
-    post_reduce_dims[idxRows] = numext::ceil((in.dimension(idxRows) - patchRowsEff + 1.f) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil((in.dimension(idxCols) - patchColsEff + 1.f) / static_cast<float>(strideCols));
-  } else {
-    post_reduce_dims[idxRows] = numext::ceil(in.dimension(idxRows) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil(in.dimension(idxCols) / static_cast<float>(strideCols));
-  }
-  post_reduce_dims[3] = in.dimension(3);
-
-#if !defined(EIGEN_HAS_INDEX_LIST)
-  // nvcc doesn't support cxx11
-  Eigen::array<int, 2> reduction_dims;
-  if (isColMajor) {
-    reduction_dims[0] = 1;
-    reduction_dims[1] = 2;
-  } else {
-    reduction_dims[0] = 2;
-    reduction_dims[1] = 3;
-  }
-#else
-  // Take advantage of cxx11 to give the compiler information it can use to
-  // optimize the code.
-  typename internal::conditional<internal::traits<Input>::Layout == ColMajor, const Eigen::IndexList<Eigen::type2index<1>, Eigen::type2index<2> >, const Eigen::IndexList<Eigen::type2index<2>, Eigen::type2index<3> > >::type reduction_dims;
-#endif
-
-  return input.extract_image_patches(patchRows, patchCols, strideRows, strideCols, in_strideRows, in_strideCols, padding_type, -Eigen::NumTraits<typename internal::remove_const<typename internal::traits<Input>::Scalar>::type>::highest()).maximum(reduction_dims).reshape(post_reduce_dims);
-}
-
-/** CuboidMaxPooling
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Applies a max-pooling over a multichannel input volume.
-  *
-  * The input parameter is expected to be a tensor with a rank of 5 (channels, depth, height, width, others in col-major, and the reverse of that in row-major).
-  *
-  * The result can be assigned to a tensor of rank equal to the rank of the input. The dimensions of the result will be channels, depth, height, width, and others (in col-major, and the reverse of that if the input was row-major).
-  *
-  * The order of the depth, width and height dimensions can be swapped if needed.
-  *
-*/
-#if !defined(EIGEN_HAS_INDEX_LIST)
-template <typename Input>
-EIGEN_ALWAYS_INLINE static const TensorReshapingOp<
-    const Eigen::DSizes<DenseIndex, internal::traits<Input>::NumDimensions>,
-    const TensorReductionOp<
-        internal::MaxReducer<float>, const Eigen::array<int, 1>,
-        const TensorReshapingOp<
-            const Eigen::DSizes<DenseIndex, 3>,
-            const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const Input> > > >
-#else
-template <typename Input>
-EIGEN_ALWAYS_INLINE static const TensorReshapingOp<
-    const Eigen::DSizes<DenseIndex, internal::traits<Input>::NumDimensions>,
-    const TensorReductionOp<
-        internal::MaxReducer<float>,
-        const Eigen::IndexList<Eigen::type2index<1> >,
-        const TensorReshapingOp<
-            const Eigen::DSizes<DenseIndex, 3>,
-            const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const Input> > > >
-#endif
-CuboidMaxPooling(const Input& input, DenseIndex patchPlanes,
-                 DenseIndex patchRows, DenseIndex patchCols,
-                 DenseIndex stridePlanes, DenseIndex strideRows,
-                 DenseIndex strideCols, const PaddingType padding_type) {
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == 5, YOU_MADE_A_PROGRAMMING_MISTAKE);
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-
-  static const int idxPlanes = isColMajor ? 1 : 3;
-  static const int idxRows = 2;
-  static const int idxCols = isColMajor ? 3 : 1;
-
-  // Molds the output of the reduction into the shape expected by the used
-  // (assuming col-major):
-  // - 1st dim: channels
-  // - 2nd dim: output depth
-  // - 3rd dim: output height
-  // - 4th dim: output width
-  // - 5th dim and beyond: everything else including batch size
-  Eigen::DSizes<DenseIndex, internal::traits<Input>::NumDimensions> post_reduce_dims;
-  post_reduce_dims[0] = in.dimension(0);
-  if (padding_type == PADDING_VALID) {
-    post_reduce_dims[idxPlanes] = numext::ceil((in.dimension(idxPlanes) - patchPlanes + 1.f) / static_cast<float>(stridePlanes));
-    post_reduce_dims[idxRows] = numext::ceil((in.dimension(idxRows) - patchRows + 1.f) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil((in.dimension(idxCols) - patchCols + 1.f) / static_cast<float>(strideCols));
-  } else {
-    post_reduce_dims[idxPlanes] = numext::ceil(in.dimension(idxPlanes) / static_cast<float>(stridePlanes));
-    post_reduce_dims[idxRows] = numext::ceil(in.dimension(idxRows) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil(in.dimension(idxCols) / static_cast<float>(strideCols));
-  }
-  post_reduce_dims[4] = in.dimension(4);
-
-  Eigen::DSizes<DenseIndex, 3> pre_reduce_dims;
-  pre_reduce_dims[1] = patchRows * patchCols * patchPlanes;
-  if (isColMajor) {
-    pre_reduce_dims[0] = post_reduce_dims[0];
-    pre_reduce_dims[2] = post_reduce_dims[1] * post_reduce_dims[2] * post_reduce_dims[3] * post_reduce_dims[4];
-  } else {
-    pre_reduce_dims[0] = post_reduce_dims[0] * post_reduce_dims[1] * post_reduce_dims[2] * post_reduce_dims[3];
-    pre_reduce_dims[2] = post_reduce_dims[4];
-  }
-
-#if !defined(EIGEN_HAS_INDEX_LIST)
-  // nvcc doesn't support cxx11
-  Eigen::array<int, 1> reduction_dims;
-  reduction_dims[0] = 1;
-#else
-  // Take advantage of cxx11 to give the compiler information it can use to
-  // optimize the code.
-  Eigen::IndexList<Eigen::type2index<1> > reduction_dims;
-#endif
-  return input.extract_volume_patches(patchPlanes, patchRows, patchCols,
-                                      stridePlanes, strideRows, strideCols,
-                                      padding_type, -Eigen::NumTraits<float>::highest())
-      .reshape(pre_reduce_dims)
-      .maximum(reduction_dims)
-      .reshape(post_reduce_dims);
-}
-
-
-/** SpatialAvgPooling
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Applies an average pooling over a multichannel input image.
-  *
-  * The input parameter is expected to be a tensor with a rank of 4 (channels, height, width, others in col-major, and the reverse of that in row-major).
-  *
-  * The result can be assigned to a tensor of rank equal to the rank of the input. The dimensions of the result will be channels, height, width, and others (in col-major, and the reverse of that if the input was row-major).
-  *
-  * The order of the width and height dimensions can be swapped if needed.
-  *
-*/
-namespace internal {
-
-template <typename T> struct AvgPoolMeanReducer
-{
-#if (EIGEN_ARCH_i386 || EIGEN_ARCH_x86_64) && !defined(__CUDACC__)
-  // We only support packet access for floats.
-  static const bool PacketAccess = internal::is_same<T, float>::value;
-#else
-  static const bool PacketAccess = false;
-#endif
-  static const bool IsStateful = true;
-
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE AvgPoolMeanReducer() : scalarCount_(0) {
-    typedef typename packet_traits<T>::type Packet;
-    packetCount_ = pset1<Packet>(0.0);
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void reduce(const T t, T* accum) {
-    if (t != -Eigen::NumTraits<T>::highest()) {
-      (*accum) = (*accum) + t;
-      scalarCount_++;
-    }
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE T initialize() const {
-    return static_cast<T>(0);
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE T finalize(const T accum) const {
-    eigen_assert(scalarCount_ > 0);
-    return accum / scalarCount_;
-  }
-
-#if (EIGEN_ARCH_i386 || EIGEN_ARCH_x86_64) && !defined(__CUDACC__)
-#ifdef EIGEN_VECTORIZE_AVX
-#define pequal(a,b) _mm256_cmp_ps(a,b,_CMP_EQ_UQ)
-#define psel(a,b,false_mask) _mm256_blendv_ps(a,b,false_mask)
-#else
-#define pequal(a,b) _mm_cmpeq_ps(a,b)
-#define psel(a,b,false_mask) _mm_or_ps(_mm_andnot_ps(false_mask, a), _mm_and_ps(false_mask, b))
-#endif
-
-  template <typename Packet>
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void reducePacket(const Packet& p, Packet* accum) {
-    reducePacketWithType(static_cast<T>(0), p, accum);
-  }
-
-  template <typename Packet>
-  void reducePacketWithType(T, const Packet& p, Packet* accum) {
-    Packet skip_mask = pequal(p, pset1<Packet>(-Eigen::NumTraits<T>::highest()));
-    (*accum) = padd<Packet>(*accum, psel(p, pset1<Packet>(0), skip_mask));
-    packetCount_ = padd<Packet>(packetCount_, psel(pset1<Packet>(1), pset1<Packet>(0), skip_mask));
-  }
-
-  template <typename Packet>
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE Packet initializePacket() const {
-    return pset1<Packet>(0);
-  }
-
-  template <typename Packet>
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE Packet finalizePacket(const Packet& vaccum) const {
-    return pdiv(vaccum, packetCount_);
-  }
-  template <typename Packet>
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE T finalizeBoth(const T saccum, const Packet& vaccum) const {
-    return (saccum + predux(vaccum)) / (scalarCount_ + predux(packetCount_));
-  }
-#endif
-
- protected:
-    typedef typename packet_traits<T>::type Packet;
-    int scalarCount_;
-    Packet packetCount_;
-};
-
-}  // namespace internal
-
-#if !defined(EIGEN_HAS_INDEX_LIST)
-template <typename Input>
-EIGEN_ALWAYS_INLINE
-static const TensorReshapingOp<const Eigen::DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions>, const TensorReductionOp<internal::AvgPoolMeanReducer<typename internal::remove_const<typename internal::traits<Input>::Scalar>::type>, const Eigen::array<int, 2>, const TensorImagePatchOp<Dynamic, Dynamic, const Input> > >
-#else
-template <typename Input>
-EIGEN_ALWAYS_INLINE
-static const TensorReshapingOp<const Eigen::DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions>, const TensorReductionOp<internal::AvgPoolMeanReducer<typename internal::remove_const<typename internal::traits<Input>::Scalar>::type>, typename internal::conditional<internal::traits<Input>::Layout == ColMajor, const Eigen::IndexList<Eigen::type2index<1>, Eigen::type2index<2> >, const Eigen::IndexList<Eigen::type2index<2>, Eigen::type2index<3> > >::type, const TensorImagePatchOp<Dynamic, Dynamic, const Input> > >
-#endif
-SpatialAvgPooling(const Input& input, DenseIndex patchRows, DenseIndex patchCols,
-                  DenseIndex strideRows, DenseIndex strideCols, const PaddingType padding_type,
-                  DenseIndex in_strideRows = 1, DenseIndex in_strideCols = 1)
-{
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == 4, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-
-  const DenseIndex patchRowsEff = patchRows + (patchRows - 1) * (in_strideRows - 1);
-  const DenseIndex patchColsEff = patchCols + (patchCols - 1) * (in_strideCols - 1);
-
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-  static const int idxRows = isColMajor ? 1 : 2;
-  static const int idxCols = isColMajor ? 2 : 1;
-
-  // Molds the output of the reduction into the shape expected by the user.
-  // (assuming col-major):
-  // - 1st dim: channels
-  // - 2nd dim: output height
-  // - 3rd dim: output width
-  // - 4th dim and beyond: everything else including batch size
-  Eigen::DSizes<TensorIndex, internal::traits<Input>::NumDimensions> post_reduce_dims;
-  post_reduce_dims[0] = in.dimension(0);
-  if (padding_type == PADDING_VALID) {
-    post_reduce_dims[idxRows] = numext::ceil((in.dimension(idxRows) - patchRowsEff + 1.f) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil((in.dimension(idxCols) - patchColsEff + 1.f) / static_cast<float>(strideCols));
-  } else {
-    post_reduce_dims[idxRows] = numext::ceil(in.dimension(idxRows) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil(in.dimension(idxCols) / static_cast<float>(strideCols));
-  }
-  post_reduce_dims[3] = in.dimension(3);
-
-  typedef typename internal::remove_const<typename internal::traits<Input>::Scalar>::type CoeffReturnType;
-  internal::AvgPoolMeanReducer<CoeffReturnType> mean_with_nan;
-
-#if !defined(EIGEN_HAS_INDEX_LIST)
-  // nvcc doesn't support cxx11
-  Eigen::array<int, 2> reduction_dims;
-  if (isColMajor) {
-    reduction_dims[0] = 1;
-    reduction_dims[1] = 2;
-  } else {
-    reduction_dims[0] = 2;
-    reduction_dims[1] = 3;
-  }
-#else
-  // Take advantage of cxx11 to give the compiler information it can use to
-  // optimize the code.
-  typename internal::conditional<internal::traits<Input>::Layout == ColMajor, const Eigen::IndexList<Eigen::type2index<1>, Eigen::type2index<2> >, const Eigen::IndexList<Eigen::type2index<2>, Eigen::type2index<3> > >::type reduction_dims;
-#endif
-  return input.extract_image_patches(patchRows, patchCols, strideRows, strideCols, in_strideRows, in_strideCols, padding_type, -Eigen::NumTraits<typename internal::remove_const<typename internal::traits<Input>::Scalar>::type>::highest()).reduce(reduction_dims, mean_with_nan).reshape(post_reduce_dims);
-}
-
-
-/** CuboidAvgPooling
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Applies an average pooling over a multichannel input volume.
-  *
-  * The input parameter is expected to be a tensor with a rank of 5 (channels, depth, height, width, others, and the reverse of that in row-major).
-  *
-  * The result can be assigned to a tensor of rank equal to the rank of the input. The dimensions of the result will be channels, depth, width, and others (in col-major, and the reverse of that if the input was row-major).
-  *
-  * The order of the depth, width and height dimensions can be swapped if needed.
-  *
-*/
-#if !defined(EIGEN_HAS_INDEX_LIST)
-template <typename Input>
-EIGEN_ALWAYS_INLINE static const TensorReshapingOp<
-    const Eigen::DSizes<DenseIndex, internal::traits<Input>::NumDimensions>,
-    const TensorReductionOp<
-        internal::AvgPoolMeanReducer<float>, const Eigen::array<int, 1>,
-        const TensorReshapingOp<
-            const Eigen::DSizes<DenseIndex, 3>,
-            const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const Input> > > >
-#else
-template <typename Input>
-EIGEN_ALWAYS_INLINE static const TensorReshapingOp<
-      const Eigen::DSizes<DenseIndex, internal::traits<Input>::NumDimensions>,
-      const TensorReductionOp<
-          internal::AvgPoolMeanReducer<float>,
-          const Eigen::IndexList<Eigen::type2index<1> >,
-          const TensorReshapingOp<
-              const Eigen::DSizes<DenseIndex, 3>,
-              const TensorVolumePatchOp<Dynamic, Dynamic, Dynamic, const Input> > > >
-#endif
-CuboidAvgPooling(const Input& input, DenseIndex patchPlanes,
-                 DenseIndex patchRows, DenseIndex patchCols,
-                 DenseIndex stridePlanes, DenseIndex strideRows,
-                 DenseIndex strideCols, const PaddingType padding_type) {
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == 5, YOU_MADE_A_PROGRAMMING_MISTAKE);
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-
-  static const int idxPlanes = isColMajor ? 1 : 3;
-  static const int idxRows = 2;
-  static const int idxCols = isColMajor ? 3 : 1;
-  // Molds the output of the reduction into the shape expected by the used
-  // (assuming col-major):
-  // - 1st dim: channels
-  // - 2nd dim: outupt depth
-  // - 3rd dim: output height
-  // - 4th dim: output width
-  // - 5th dim and beyond: everything else including batch size
-  Eigen::DSizes<DenseIndex, internal::traits<Input>::NumDimensions> post_reduce_dims;
-  post_reduce_dims[0] = in.dimension(0);
-  if (padding_type == PADDING_VALID) {
-    post_reduce_dims[idxPlanes] = numext::ceil((in.dimension(idxPlanes) - patchPlanes + 1.f) / static_cast<float>(stridePlanes));
-    post_reduce_dims[idxRows] = numext::ceil((in.dimension(idxRows) - patchRows + 1.f) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil((in.dimension(idxCols) - patchCols + 1.f) / static_cast<float>(strideCols));
-  } else {
-    post_reduce_dims[idxPlanes] = numext::ceil(in.dimension(idxPlanes) / static_cast<float>(stridePlanes));
-    post_reduce_dims[idxRows] = numext::ceil(in.dimension(idxRows) / static_cast<float>(strideRows));
-    post_reduce_dims[idxCols] = numext::ceil(in.dimension(idxCols) / static_cast<float>(strideCols));
-  }
-  post_reduce_dims[4] = in.dimension(4);
-
-  Eigen::DSizes<DenseIndex, 3> pre_reduce_dims;
-  pre_reduce_dims[1] = patchRows * patchCols * patchPlanes;
-  if (isColMajor) {
-    pre_reduce_dims[0] = post_reduce_dims[0];
-    pre_reduce_dims[2] = post_reduce_dims[1] * post_reduce_dims[2] * post_reduce_dims[3] * post_reduce_dims[4];
-  } else {
-    pre_reduce_dims[0] = post_reduce_dims[0] * post_reduce_dims[1] * post_reduce_dims[2] * post_reduce_dims[3];
-    pre_reduce_dims[2] = post_reduce_dims[4];
-  }
-
-  typedef typename internal::remove_const<typename internal::traits<Input>::Scalar>::type CoeffReturnType;
-  internal::AvgPoolMeanReducer<CoeffReturnType> mean_with_nan;
-
-#if !defined(EIGEN_HAS_INDEX_LIST)
-  // nvcc doesn't support cxx11
-  Eigen::array<int, 1> reduction_dims;
-  reduction_dims[0] = 1;
-#else
-  // Take advantage of cxx11 to give the compiler information it can use to
-  // optimize the code.
-  Eigen::IndexList<Eigen::type2index<1> > reduction_dims;
-#endif
-  return input.extract_volume_patches(patchPlanes, patchRows, patchCols,
-                                      stridePlanes, strideRows, strideCols,
-                                      padding_type, -Eigen::NumTraits<float>::highest())
-      .reshape(pre_reduce_dims)
-      .reduce(reduction_dims, mean_with_nan)
-      .reshape(post_reduce_dims);
-}
-
-} // end namespace Eigen
-
-#endif // EIGEN_CXX11_NEURAL_NETWORKS_POOLING_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/SoftMax.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/SoftMax.h
deleted file mode 100644
index f0e21ab9c2..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/SoftMax.h
+++ /dev/null
@@ -1,83 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2014 Benoit Steiner <benoit.steiner.goog@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_SOFTMAX_H
-#define EIGEN_CXX11_NEURAL_NETWORKS_SOFTMAX_H
-
-namespace Eigen {
-
-/** SoftMax
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Applies a softmax
-  *
-  * The input parameter is expected to be a col-major tensor with a rank of 2 (depth and other).
-  *
-  * The result can be assigned to a tensor of rank and dimensions equal to that of the input. The result will be laid out in col-major order.
-  *
-*/
-
-namespace {
-class SoftmaxOp {
- public:
-  EIGEN_ALWAYS_INLINE SoftmaxOp(const float beta) : beta_(beta) { }
-
-  template <typename Input> EIGEN_ALWAYS_INLINE
-  typename Input::Dimensions dimensions(const Input& input) const {
-    return input.dimensions();
-  }
-
-  template <typename Input, typename Output, typename Device>
-  void eval(const Input& input, Output& output, const Device& device) const
-  {
-#if !defined(EIGEN_HAS_INDEX_LIST)
-    // nvcc doesn't support cxx11
-    Eigen::array<typename internal::traits<Input>::Index, 1> depth_dim;
-    depth_dim[0] = 0;
-    Eigen::array<typename internal::traits<Input>::Index, 2> bcast;
-    bcast[0] = dimensions(input)[0];
-    bcast[1] = 1;
-    DSizes<typename internal::traits<Input>::Index, 2> dims2d;
-    dims2d[0] = 1;
-    dims2d[1] = dimensions(input)[1];
-#else
-    // Take advantage of cxx11 to give the compiler information it can use to
-    // optimize the code.
-    Eigen::IndexList<Eigen::type2index<0>> depth_dim;
-    Eigen::IndexList<int, Eigen::type2index<1>> bcast;
-    bcast.set(0, dimensions(input)[0]);
-    Eigen::IndexList<Eigen::type2index<1>, typename internal::traits<Input>::Index> dims2d;
-    dims2d.set(1, dimensions(input)[1]);
-#endif
-
-    output.device(device) = ((input - input.maximum(depth_dim).eval().reshape(dims2d).broadcast(bcast)) * beta_).exp();
-    output.device(device) = output / (output.sum(depth_dim).eval().reshape(dims2d).broadcast(bcast));
-  }
-
- private:
-  const float beta_;
-};
-}
-
-
-template <typename Input>
-EIGEN_ALWAYS_INLINE
-static const TensorCustomUnaryOp<const SoftmaxOp, const Input>
-SoftMax(const Input& input, const float beta)
-{
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::Layout == ColMajor, YOU_MADE_A_PROGRAMMING_MISTAKE);
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::NumDimensions == 2, YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-  const SoftmaxOp op(beta);
-  return input.customOp(op);
-}
-
-
-} // end namespace Eigen
-
-#endif // EIGEN_CXX11_NEURAL_NETWORKS_SOFTMAX_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/SpatialConvolutions.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/SpatialConvolutions.h
deleted file mode 100644
index 8e2ddca6b5..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/SpatialConvolutions.h
+++ /dev/null
@@ -1,775 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2014 Benoit Steiner <benoit.steiner.goog@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-#ifndef EIGEN_CXX11_NEURAL_NETWORKS_SPATIAL_CONVOLUTIONS_H
-#define EIGEN_CXX11_NEURAL_NETWORKS_SPATIAL_CONVOLUTIONS_H
-
-namespace Eigen {
-
-namespace internal {
-
-// These optimizations require vector instructions
-#ifdef EIGEN_VECTORIZE
-
-// TODO: Consolidate this part of the code with the image patch extraction code
-// since they are both very similar.
-template <typename NewDimension, DenseIndex Rows, DenseIndex Cols, typename ArgType, typename Device,
-          typename Scalar_, typename Index,
-          typename nocontract_t, typename contract_t,
-          int Side, size_t packet_size,
-          bool inner_dim_contiguous, bool inner_dim_reordered, int Alignment>
-class TensorContractionInputMapper<Scalar_, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment>
-{
- public:
-  typedef TensorContractionInputMapper<Scalar_, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment> Self;
-  typedef TensorContractionSubMapper<Scalar_, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment> SubMapper;
-  typedef SubMapper VectorMapper;
-  typedef SubMapper LinearMapper;
-  typedef Scalar_ Scalar;
-  typedef typename packet_traits<Scalar>::type Packet;
-
-  TensorContractionInputMapper(const TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>& tensor,
-                               const nocontract_t&, const nocontract_t&,
-                               const contract_t&, const contract_t&)
-      : m_impl(tensor.impl().impl())
-  {
-    Index patch_rows;
-    Index patch_depth;
-    if (internal::traits<ArgType>::Layout == ColMajor) {
-      patch_depth = tensor.impl().dimensions()[0];
-      patch_rows = tensor.impl().dimensions()[1];
-      m_patch_cols = tensor.impl().dimensions()[2];
-      m_num_patches = tensor.impl().dimensions()[3];
-    } else {
-      static const int NumDims = tensor.impl().dimensions().size();
-      patch_depth = tensor.impl().dimensions()[NumDims - 1];
-      patch_rows = tensor.impl().dimensions()[NumDims - 2];
-      m_patch_cols = tensor.impl().dimensions()[NumDims - 3];
-      m_num_patches = tensor.impl().dimensions()[NumDims - 4];
-    }
-    m_patch_row_inflate_strides = tensor.impl().rowInflateStride();
-    m_patch_col_inflate_strides = tensor.impl().colInflateStride();
-
-    m_colStride = patch_rows;
-
-    m_outputRows = tensor.impl().outputRows();
-    m_row_strides = tensor.impl().userRowStride();
-    m_col_strides = tensor.impl().userColStride();
-
-    m_in_row_strides = tensor.impl().userInRowStride();
-    m_in_col_strides = tensor.impl().userInColStride();
-
-    if (internal::traits<ArgType>::Layout == ColMajor) {
-      m_inputRows = tensor.impl().impl().dimensions()[1];
-      m_inputCols = tensor.impl().impl().dimensions()[2];
-    } else {
-      static const int NumDims = tensor.impl().impl().dimensions().size();
-      m_inputRows = tensor.impl().impl().dimensions()[NumDims - 2];
-      m_inputCols = tensor.impl().impl().dimensions()[NumDims - 3];
-    }
-
-    m_rowInputStride = patch_depth;
-    m_colInputStride = patch_depth * m_inputRows;
-    m_patchInputStride = patch_depth * m_inputRows * m_inputCols;
-
-    m_rowPaddingTop = tensor.impl().rowPaddingTop();
-    m_colPaddingLeft = tensor.impl().colPaddingLeft();
-
-    m_fastInputRowStride = internal::TensorIntDivisor<Index>(m_patch_row_inflate_strides);
-    m_fastInputColStride = internal::TensorIntDivisor<Index>(m_patch_col_inflate_strides);
-    m_fastNumPatches = internal::TensorIntDivisor<Index>(m_num_patches);
-    m_fastColStride = internal::TensorIntDivisor<Index>(m_colStride);
-    m_fastOutputRows = internal::TensorIntDivisor<Index>(m_outputRows);
-    m_fastDimZero = internal::TensorIntDivisor<Index>(patch_depth);
-  }
-
-  TensorContractionInputMapper(const TensorContractionInputMapper& base_mapper) :
-      m_impl(base_mapper.m_impl) {
-    m_patch_cols = base_mapper.m_patch_cols;
-    m_num_patches = base_mapper.m_num_patches;
-    m_patch_row_inflate_strides = base_mapper.m_patch_row_inflate_strides;
-    m_patch_col_inflate_strides = base_mapper.m_patch_col_inflate_strides;
-
-    m_colStride = base_mapper.m_colStride;
-
-    m_rowInputStride = base_mapper.m_rowInputStride;
-    m_colInputStride = base_mapper.m_colInputStride;
-    m_patchInputStride = base_mapper.m_patchInputStride;
-
-    m_inputRows = base_mapper.m_inputRows;
-    m_inputCols = base_mapper.m_inputCols;
-
-    m_outputRows = base_mapper.m_outputRows;
-    m_row_strides = base_mapper.m_row_strides;
-    m_col_strides = base_mapper.m_col_strides;
-
-    m_in_row_strides = base_mapper.m_in_row_strides;
-    m_in_col_strides = base_mapper.m_in_col_strides;
-
-    m_rowPaddingTop = base_mapper.m_rowPaddingTop;
-    m_colPaddingLeft = base_mapper.m_colPaddingLeft;
-
-    m_fastInputRowStride = base_mapper.m_fastInputRowStride;
-    m_fastInputColStride = base_mapper.m_fastInputColStride;
-    m_fastNumPatches = base_mapper.m_fastNumPatches;
-    m_fastColStride = base_mapper.m_fastColStride;
-    m_fastOutputRows = base_mapper.m_fastOutputRows;
-    m_fastDimZero = base_mapper.m_fastDimZero;
-  }
-
- // If true, turns off some optimizations for loading packets since the image
-  // patches are "non-standard" such as there are non-trivial strides or
-  // inflations in the input.
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE bool nonStandardPatches() const {
-    return m_in_row_strides != 1 || m_in_col_strides != 1 || m_patch_row_inflate_strides != 1 || m_patch_col_inflate_strides != 1;
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_STRONG_INLINE SubMapper getSubMapper(Index i, Index j) const {
-    return SubMapper(*this, i, j);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_STRONG_INLINE LinearMapper getLinearMapper(Index i, Index j) const {
-    return LinearMapper(*this, i, j);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Scalar operator()(Index row) const {
-    Index rowIndex, colIndex, otherIndex;
-    computeBaseIndices(0, rowIndex, colIndex, otherIndex);
-    return loadCoeff(row, rowIndex, colIndex, otherIndex);
-  }
-
-  // Load the coefficient at the patchIndex location instead of the usual m_rowIndex,
-  // m_colIndex, m_otherIndex. This is currently only used by the gpu code.  EIGEN_DEVICE_FUNC
-  EIGEN_DEVICE_FUNC
-  EIGEN_STRONG_INLINE Scalar operator()(Index row, Index patchIndex) const {
-    Index rowIndex, colIndex, otherIndex;
-    computeBaseIndices(patchIndex, rowIndex, colIndex, otherIndex);
-    return loadCoeff(row, rowIndex, colIndex, otherIndex);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Packet loadPacket(Index row) const {
-    Index rowIndex, colIndex, otherIndex;
-    computeBaseIndices(0, rowIndex, colIndex, otherIndex);
-    return loadPacket(row, rowIndex, colIndex, otherIndex);
-  }
-
-  // Load the packet at the patchIndex location instead of the usual m_rowIndex,
-  // m_colIndex, m_otherIndex. This is currently only used by the gpu code.
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Packet loadPacket(Index row, Index patchIndex) const {
-    Index rowIndex, colIndex, otherIndex;
-    computeBaseIndices(patchIndex, rowIndex, colIndex, otherIndex);
-    return loadPacket(row, rowIndex, colIndex, otherIndex);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE const TensorEvaluator<ArgType, Device>& impl() const { return m_impl; }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index patchDepth() const { return m_rowInputStride; }
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index patchRows() const { return m_colStride; }
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index patchCols() const { return m_patch_cols; }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Packet packetNoPadding(const Index depth, const Index baseIndex) const {
-    const Index inputIndex = depth + baseIndex;
-    return m_impl.template packet<Unaligned>(inputIndex);
-  }
-
- private:
-  friend class TensorContractionSubMapper<Scalar, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment>;
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_STRONG_INLINE Scalar loadCoeff(Index patchId, Index rowIndex, Index colIndex, Index otherIndex) const {
-    // Find the offset of the element wrt the location of the first element.
-    const Index patchOffset = patchId / m_fastDimZero;
-
-    const Index colOffset = patchOffset / m_fastColStride;
-    const Index inputCol = colIndex + colOffset * m_in_col_strides;
-    const Index origInputCol = (m_patch_col_inflate_strides == 1) ? inputCol : ((inputCol >= 0) ? (inputCol / m_fastInputColStride) : 0);
-    const Index rowOffset = patchOffset - colOffset * m_colStride;
-    const Index inputRow = rowIndex + rowOffset * m_in_row_strides;
-    const Index origInputRow = (m_patch_row_inflate_strides == 1) ? inputRow : ((inputRow >= 0) ? (inputRow / m_fastInputRowStride) : 0);
-    if (origInputCol < 0 | origInputRow < 0 | origInputCol >= m_inputCols | origInputRow >= m_inputRows |
-        (inputCol != origInputCol * m_patch_col_inflate_strides) | (inputRow != origInputRow * m_patch_row_inflate_strides)) {
-      return Scalar(0);
-    }
-    const Index depth = patchId - patchOffset * patchDepth();
-    const Index inputIndex = depth + origInputRow * m_rowInputStride + origInputCol * m_colInputStride + otherIndex;
-    return m_impl.coeff(inputIndex);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_STRONG_INLINE Scalar loadCoeffStandard(Index patchId, Index rowIndex, Index colIndex, Index otherIndex) const {
-    eigen_assert(!nonStandardPatches());
-
-    // Find the offset of the element wrt the location of the first element.
-    const Index patchOffset = patchId / m_fastDimZero;
-
-    const Index colOffset = patchOffset / m_fastColStride;
-    const Index inputCol = colIndex + colOffset;
-    const Index rowOffset = patchOffset - colOffset * m_colStride;
-    const Index inputRow = rowIndex + rowOffset;
-    if (inputCol < 0 || inputCol >= m_inputCols || inputRow < 0 || inputRow >= m_inputRows) {
-      return Scalar(0);
-    }
-    const Index depth = patchId - patchOffset * patchDepth();
-    const Index inputIndex = depth + inputRow * m_rowInputStride + inputCol * m_colInputStride + otherIndex;
-    return m_impl.coeff(inputIndex);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Packet loadPacket(Index patchId, Index rowIndex, Index colIndex, Index otherIndex) const {
-    const Index packetSize = internal::unpacket_traits<Packet>::size;
-    EIGEN_STATIC_ASSERT(packetSize > 1, YOU_MADE_A_PROGRAMMING_MISTAKE)
-    eigen_assert(patchId < patchDepth()*patchRows()*m_patch_cols);
-
-    if (nonStandardPatches()) {
-      return packetWithPossibleZero(patchId, rowIndex, colIndex, otherIndex);
-    }
-    return loadPacketStandard(patchId, rowIndex, colIndex, otherIndex);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Packet loadPacketStandard(Index patchId, Index rowIndex, Index colIndex, Index otherIndex) const {
-    const Index packetSize = internal::unpacket_traits<Packet>::size;
-    EIGEN_STATIC_ASSERT(packetSize > 1, YOU_MADE_A_PROGRAMMING_MISTAKE)
-    eigen_assert(patchId < patchDepth()*patchRows()*m_patch_cols);
-
-    eigen_assert(!nonStandardPatches());
-
-    if ((patchDepth() % packetSize) == 0) {
-      return loadPacketFast(patchId, rowIndex, colIndex, otherIndex);
-    }
-    else {
-      const Index patchOffsets[2] = {patchId / m_fastDimZero, (patchId + packetSize - 1) / m_fastDimZero};
-
-      const Index colOffsets[2] = {patchOffsets[0] / m_fastColStride, patchOffsets[1] / m_fastColStride};
-
-      const Index inputCols[2] = {colIndex + colOffsets[0], colIndex + colOffsets[1]};
-      if (inputCols[0] >= m_inputCols | inputCols[1] < 0) {
-        // all zeros
-        return internal::pset1<Packet>(Scalar(0));
-      }
-
-      if (inputCols[0] == inputCols[1]) {
-        const Index rowOffsets[2] = {patchOffsets[0] - colOffsets[0]*m_colStride, patchOffsets[1] - colOffsets[1]*m_colStride};
-        eigen_assert(rowOffsets[0] <= rowOffsets[1]);
-        const Index inputRows[2] = {rowIndex + rowOffsets[0], rowIndex + rowOffsets[1]};
-
-        if (inputRows[0] >= m_inputRows | inputRows[1] < 0) {
-          // all zeros
-          return internal::pset1<Packet>(Scalar(0));
-        }
-
-        if (inputRows[0] >= 0 & inputRows[1] < m_inputRows) {
-          // no padding
-          const Index depth = patchId - patchOffsets[0] * patchDepth();
-          const Index inputIndex = depth + inputRows[0] * m_rowInputStride + inputCols[0] * m_colInputStride + otherIndex;
-          return m_impl.template packet<Unaligned>(inputIndex);
-        }
-      }
-    }
-    return packetWithPossibleZero(patchId, rowIndex, colIndex, otherIndex);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Packet loadPacketFast(Index patchId, Index rowIndex, Index colIndex, Index otherIndex) const {
-    const Index packetSize = internal::unpacket_traits<Packet>::size;
-    EIGEN_STATIC_ASSERT(packetSize > 1, YOU_MADE_A_PROGRAMMING_MISTAKE)
-    eigen_assert(patchId < patchDepth()*patchRows()*m_patch_cols);
-
-    eigen_assert(!nonStandardPatches());
-    eigen_assert((patchDepth() % packetSize) == 0);
-    // Find the offset of the element wrt the location of the first element.
-    const Index patchOffset = patchId / m_fastDimZero;
-    eigen_assert((patchId + packetSize - 1)  / m_fastDimZero == patchOffset);
-
-    const Index colOffset = patchOffset / m_fastColStride;
-    const Index inputCol = colIndex + colOffset;
-    const Index rowOffset = patchOffset - colOffset*m_colStride;
-    const Index inputRow = rowIndex + rowOffset;
-    if (inputCol < 0 | inputRow < 0 | inputCol >= m_inputCols | inputRow >= m_inputRows) {
-      // all zeros
-      return internal::pset1<Packet>(Scalar(0));
-    }
-    // no padding
-    const Index depth = patchId - patchOffset * patchDepth();
-    const Index inputIndex = depth + inputRow * m_rowInputStride + inputCol * m_colInputStride + otherIndex;
-    return m_impl.template packet<Unaligned>(inputIndex);
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Packet packetWithPossibleZero(Index patchId, Index rowIndex, Index colIndex, Index otherIndex) const
-  {
-    const int packetSize = internal::unpacket_traits<Packet>::size;
-    EIGEN_ALIGN_MAX typename internal::remove_const<Scalar>::type values[packetSize];
-    for (int i = 0; i < packetSize; ++i) {
-      values[i] = loadCoeff(patchId+i, rowIndex, colIndex, otherIndex);
-    }
-    Packet rslt = internal::pload<Packet>(values);
-    return rslt;
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void computeBaseIndices(Index patchIndex, Index& rowIndex, Index& colIndex, Index& otherIndex) const {
-    const int NumInputDims = array_size<typename TensorEvaluator<ArgType, Device>::Dimensions>::value;
-    otherIndex = (NumInputDims == 3) ? 0 : patchIndex / m_fastNumPatches;
-    const Index patch2DIndex = (NumInputDims == 3) ? patchIndex : (patchIndex - otherIndex * m_num_patches);
-    otherIndex *= m_patchInputStride;
-    colIndex = patch2DIndex / m_fastOutputRows;
-    rowIndex = patch2DIndex - colIndex * m_outputRows;
-    colIndex = colIndex * m_col_strides - m_colPaddingLeft;
-    rowIndex = rowIndex * m_row_strides - m_rowPaddingTop;
-  }
-
-  Index m_patch_cols;    // number of colums in the patch
-  Index m_num_patches;   // number of patches to extract.
-  Index m_patch_row_inflate_strides;  // the strides for row inflation in the image patch
-  Index m_patch_col_inflate_strides;  // the strides for col inflation in the image patch
-  // Fast representation of inflation strides.
-  internal::TensorIntDivisor<Index> m_fastInputRowStride;
-  internal::TensorIntDivisor<Index> m_fastInputColStride;
-
-  Index m_otherStride;
-  Index m_colStride;
-  internal::TensorIntDivisor<Index> m_fastNumPatches;
-  internal::TensorIntDivisor<Index> m_fastColStride;
-
-  Index m_rowInputStride;     // row stride in the input tensor
-  Index m_colInputStride;     // col stride in the input tensor
-  Index m_patchInputStride;   // patch stride in the input tensor
-
-  Index m_inputRows;     // Number of rows in the input tensor
-  Index m_inputCols;     // Number of cols in the input tensor
-
-  Index m_outputRows;    // Number of patch rows
-
-  Index m_row_strides;   // User specified row stride
-  Index m_col_strides;   // User specified col stride
-
-  Index m_in_row_strides;  // User specified input row stride
-  Index m_in_col_strides;  // User specified input col stride
-
-  Index m_rowPaddingTop;    // Row padding
-  Index m_colPaddingLeft;   // Column padding
-
-  internal::TensorIntDivisor<Index> m_fastOutputRows;
-  internal::TensorIntDivisor<Index> m_fastDimZero;
-
-  const TensorEvaluator<ArgType, Device> m_impl;
-};
-
-
-template <typename NewDimension, DenseIndex Rows, DenseIndex Cols, typename ArgType, typename Device,
-          typename Scalar_, typename Index,
-          typename nocontract_t, typename contract_t,
-          int Side, size_t packet_size,
-          bool inner_dim_contiguous, bool inner_dim_reordered, int Alignment>
-class TensorContractionSubMapper<Scalar_, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment>
-{
- public:
-  typedef Scalar_ Scalar;
-  typedef typename packet_traits<Scalar>::type Packet;
-  typedef typename packet_traits<Scalar>::half HalfPacket;
-
-  typedef TensorContractionInputMapper<Scalar, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment> ParentMapper;
-  typedef TensorContractionSubMapper<Scalar, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment> Self;
-  typedef Self LinearMapper;
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE TensorContractionSubMapper(const ParentMapper& base_mapper, Index vert_offset, Index horiz_offset)
-      : m_base_mapper(base_mapper), m_depth_offset(vert_offset), m_col_offset(horiz_offset) {
-    m_base_mapper.computeBaseIndices(m_col_offset, m_rowIndex, m_colIndex, m_otherIndex);
-  }
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE TensorContractionSubMapper(const Self& base_mapper, Index vert_offset, Index horiz_offset)
-      : m_base_mapper(base_mapper.m_base_mapper), m_depth_offset(vert_offset+base_mapper.m_depth_offset), m_col_offset(horiz_offset+base_mapper.m_col_offset) {
-    m_base_mapper.computeBaseIndices(m_col_offset, m_rowIndex, m_colIndex, m_otherIndex);
-  }
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Scalar operator()(Index i) const {
-    return m_base_mapper.loadCoeff(i + m_depth_offset, m_rowIndex, m_colIndex, m_otherIndex);
-  }
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Scalar operator()(Index i, Index j) const {
-    return m_base_mapper(i + m_depth_offset, j + m_col_offset);
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Packet loadPacket(Index i) const {
-   return m_base_mapper.loadPacket(i + m_depth_offset, m_rowIndex, m_colIndex, m_otherIndex);
-  }
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Packet loadPacket(Index i, Index j) const {
-    return m_base_mapper.template loadPacket(i + m_depth_offset, j + m_col_offset);
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Scalar loadCoeffStandard(Index i) const {
-    return m_base_mapper.loadCoeffStandard(i + m_depth_offset, m_rowIndex, m_colIndex, m_otherIndex);
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Packet loadPacketFast(Index i) const {
-   return m_base_mapper.loadPacketFast(i + m_depth_offset, m_rowIndex, m_colIndex, m_otherIndex);
-  }
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE Packet loadPacketStandard(Index i) const {
-   return m_base_mapper.loadPacketStandard(i + m_depth_offset, m_rowIndex, m_colIndex, m_otherIndex);
-  }
-  template <typename Packet>
-  EIGEN_DEVICE_FUNC bool aligned(Index) const {
-    return false;
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE bool nonStandardPatches() const {
-    return m_base_mapper.nonStandardPatches();
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index patchDepth() const { return m_base_mapper.m_rowInputStride; }
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index patchRows() const { return m_base_mapper.m_colStride; }
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index patchCols() const { return m_base_mapper.m_patch_cols; }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Packet packetNoPadding(const Index depth, const Index baseIndex) const {
-    const Index inputIndex = depth + baseIndex;
-    return m_base_mapper.m_impl.template packet<Unaligned>(inputIndex);
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE bool padRow(const Index row) const {
-    const Index r = m_rowIndex + row;
-    return r < 0 | r >= m_base_mapper.m_inputRows;
-  }
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE bool padCol(const Index col) const {
-    const Index c = m_colIndex + col;
-    return c < 0 | c >= m_base_mapper.m_inputCols;
-    }
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index baseIndex(const Index row, const Index col) const {
-    const Index r = m_rowIndex + row;
-    const Index c = m_colIndex + col;
-    return r * m_base_mapper.m_rowInputStride + c * m_base_mapper.m_colInputStride + m_otherIndex;
-  }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index rowOffset() const {
-    const Index patchOffset = m_depth_offset / m_base_mapper.m_fastDimZero;
-    const Index colOffset = patchOffset / m_base_mapper.m_fastColStride;
-    return patchOffset-colOffset*m_base_mapper.m_colStride;
-    }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index colOffset() const {
-    const Index patchOffset = m_depth_offset / m_base_mapper.m_fastDimZero;
-    const Index colOffset = patchOffset / m_base_mapper.m_fastColStride;
-    return colOffset;
-    }
-
-  EIGEN_DEVICE_FUNC
-  EIGEN_ALWAYS_INLINE Index depthOffset() const {
-    const Index patchOffset = m_depth_offset % m_base_mapper.patchDepth();
-    return patchOffset;
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE LinearMapper getLinearMapper(Index i, Index j) const {
-    return LinearMapper(m_base_mapper, i + m_depth_offset, j + m_col_offset);
- }
-
- private:
-  const ParentMapper& m_base_mapper;  // that was a reference before
-  Index m_depth_offset;  // First row in the input matrix
-  Index m_col_offset;    // First col in the input matrix
-
-  Index m_rowIndex;        // precomputed row index corresponding to the col offset
-  Index m_colIndex;        // precomputed col index corresponding to the col offset
-  Index m_otherIndex;      // precomputed other index corresponding to the col offset
-
-};
-
-
-template <typename NewDimension, DenseIndex Rows, DenseIndex Cols, typename ArgType, typename Device,
-          typename Scalar, typename Index,
-          typename nocontract_t, typename contract_t,
-          int Side, size_t packet_size,
-          bool inner_dim_contiguous, bool inner_dim_reordered, int Alignment, int nr>
-struct gemm_pack_rhs<Scalar, Index, TensorContractionSubMapper<Scalar, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment>, nr, ColMajor, false, false> {
-
-  typedef TensorContractionSubMapper<Scalar, Index, Side, TensorEvaluator<const TensorReshapingOp<NewDimension, const TensorImagePatchOp<Rows, Cols, ArgType> >, Device>, nocontract_t, contract_t, packet_size, inner_dim_contiguous, inner_dim_reordered, Alignment> SubMapper;
-  typedef SubMapper DataMapper;
-
-  static inline Index ceil_div(Index a, Index b) {
-    return (a + b - 1) / b;
-  }
-
-  EIGEN_DONT_INLINE void operator()(Scalar* block, const DataMapper& rhs, Index depth, Index cols, Index stride=0, Index offset=0) const {
-    eigen_assert(stride == 0);
-    eigen_assert(offset == 0);
-
-    EIGEN_STATIC_ASSERT((nr == 4), YOU_MADE_A_PROGRAMMING_MISTAKE);
-    typedef typename DataMapper::LinearMapper LinearMapper;
-    typedef typename packet_traits<Scalar>::type Packet;
-
-    const Index packet_cols4 = (cols/4) * 4;
-    const Index peeled_k = (depth/packet_size) * packet_size;
-    const bool non_standard_patches = rhs.nonStandardPatches();
-
-    for(Index j2=0; j2<packet_cols4; j2+=4)
-    {
-      const SubMapper dm0 = rhs.getLinearMapper(0, j2 + 0);
-      const SubMapper dm1 = rhs.getLinearMapper(0, j2 + 1);
-      const SubMapper dm2 = rhs.getLinearMapper(0, j2 + 2);
-      const SubMapper dm3 = rhs.getLinearMapper(0, j2 + 3);
-
-      Index k=0;
-      if((packet_size%4)==0 && !non_standard_patches)
-      {
-        const Index patch_depth = rhs.patchDepth();
-        if ((patch_depth % packet_size) == 0) {
-          const Index patch_cols = rhs.patchCols();
-          const Index patch_rows = rhs.patchRows();
-
-          const Index startCol = rhs.colOffset();
-          const Index max_cols = std::min<Index>(ceil_div(peeled_k, patch_rows*patch_depth)+startCol, patch_cols);
-
-          for (Index c = startCol; c < max_cols; ++c) {
-            eigen_assert(k < peeled_k);
-            const Index startRow = (c == startCol) ? rhs.rowOffset() : 0;
-            const Index max_rows = std::min<Index>(ceil_div(peeled_k-c*patch_rows*patch_depth, patch_depth)+startRow, patch_rows);
-
-            const bool pad_col0 = dm0.padCol(c);
-            const bool pad_col1 = dm1.padCol(c);
-            const bool pad_col2 = dm2.padCol(c);
-            const bool pad_col3 = dm3.padCol(c);
-            for (Index r = startRow; r < max_rows; ++r) {
-              eigen_assert(k < peeled_k);
-              const bool pad0 = pad_col0 || dm0.padRow(r);
-              const bool pad1 = pad_col1 || dm1.padRow(r);
-              const bool pad2 = pad_col2 || dm2.padRow(r);
-              const bool pad3 = pad_col3 || dm3.padRow(r);
-
-              const Index idx0 = dm0.baseIndex(r, c);
-              const Index idx1 = dm1.baseIndex(r, c);
-              const Index idx2 = dm2.baseIndex(r, c);
-              const Index idx3 = dm3.baseIndex(r, c);
-
-              const Index startDepth = ((c == startCol) && (r == startRow)) ? rhs.depthOffset() : 0;
-              const Index max_depth = std::min<Index>(peeled_k-c*patch_rows*patch_depth-r*patch_depth+startDepth, patch_depth);
-              eigen_assert(max_depth % packet_size == 0);
-              for (Index d = startDepth; d < max_depth; d += packet_size) {
-                eigen_assert(k < peeled_k);
-                PacketBlock<Packet, 4> kernel;
-                kernel.packet[0] = pad0 ? pset1<Packet>(0) : rhs.packetNoPadding(d, idx0);
-                kernel.packet[1] = pad1 ? pset1<Packet>(0) : rhs.packetNoPadding(d, idx1);
-                kernel.packet[2] = pad2 ? pset1<Packet>(0) : rhs.packetNoPadding(d, idx2);
-                kernel.packet[3] = pad3 ? pset1<Packet>(0) : rhs.packetNoPadding(d, idx3);
-                ptranspose(kernel);
-                pstoreu(block+0*packet_size, kernel.packet[0]);
-                pstoreu(block+1*packet_size, kernel.packet[1]);
-                pstoreu(block+2*packet_size, kernel.packet[2]);
-                pstoreu(block+3*packet_size, kernel.packet[3]);
-                block+=4*packet_size;
-                k += packet_size;
-              }
-            }
-          }
-
-          for(; k<peeled_k; k+=packet_size) {
-            PacketBlock<Packet, 4> kernel;
-            kernel.packet[0] = dm0.loadPacketFast(k);
-            kernel.packet[1] = dm1.loadPacketFast(k);
-            kernel.packet[2] = dm2.loadPacketFast(k);
-            kernel.packet[3] = dm3.loadPacketFast(k);
-            ptranspose(kernel);
-            pstoreu(block+0*packet_size, kernel.packet[0]);
-            pstoreu(block+1*packet_size, kernel.packet[1]);
-            pstoreu(block+2*packet_size, kernel.packet[2]);
-            pstoreu(block+3*packet_size, kernel.packet[3]);
-            block+=4*packet_size;
-          }
-        }
-        else {
-          for(; k<peeled_k; k+=packet_size) {
-            PacketBlock<Packet, 4> kernel;
-            kernel.packet[0] = dm0.loadPacketStandard(k);
-            kernel.packet[1] = dm1.loadPacketStandard(k);
-            kernel.packet[2] = dm2.loadPacketStandard(k);
-            kernel.packet[3] = dm3.loadPacketStandard(k);
-            ptranspose(kernel);
-            pstoreu(block+0*packet_size, kernel.packet[0]);
-            pstoreu(block+1*packet_size, kernel.packet[1]);
-            pstoreu(block+2*packet_size, kernel.packet[2]);
-            pstoreu(block+3*packet_size, kernel.packet[3]);
-            block+=4*packet_size;
-          }
-        }
-      }
-      if (!rhs.nonStandardPatches()) {
-        for(; k<depth; k++)
-        {
-          block[0] = dm0.loadCoeffStandard(k);
-          block[1] = dm1.loadCoeffStandard(k);
-          block[2] = dm2.loadCoeffStandard(k);
-          block[3] = dm3.loadCoeffStandard(k);
-          block += 4;
-        }
-      }
-      else {
-        for(; k<depth; k++)
-        {
-          block[0] = dm0(k);
-          block[1] = dm1(k);
-          block[2] = dm2(k);
-          block[3] = dm3(k);
-          block += 4;
-        }
-      }
-    }
-
-    // copy the remaining columns one at a time (nr==1)
-    for(Index j2=packet_cols4; j2<cols; ++j2)
-    {
-      const SubMapper dm0 = rhs.getLinearMapper(0, j2);
-      for(Index k=0; k<depth; k++)
-      {
-        *block = dm0(k);
-        block += 1;
-      }
-    }
-  }
-};
-
-#endif  // EIGEN_VECTORIZE
-}  // end namespace internal
-
-
-/** SpatialConvolution
-  * \ingroup CXX11_NeuralNetworks_Module
-  *
-  * \brief Applies a 2D convolution over a multichannel input image.
-  *
-  * The input parameter is expected to be a tensor with a rank of 3 or more (channels, height, width, and optionally others)
-  * The kernel parameter is expected to be a 4D tensor (filters, channels, kernel_height, kernel_width)
-  * The input and the kernel must both be in col-major layout. The result will also be in col-major layout.
-  *
-  * If in_stride > 1, then applies convolution with holes (aka atrous convolution), sampling every in_stride input pixels.
-  *
-  * The result can be assigned to a tensor of rank equal to the rank of the input. The dimensions of the result will be filters, height, width (and others if applicable).
-  *
-  * It is possible to swap the order of the width and height dimensions provided that the same order is used in the input, the kernel, and the output.
-  *
-  */
-template <typename Input, typename Kernel>
-EIGEN_ALWAYS_INLINE
-static const typename internal::conditional<
-  internal::traits<Input>::Layout == ColMajor,
-  TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions>, const TensorContractionOp<const array<IndexPair<typename internal::traits<Input>::Index>, 1>, const TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, 2>, const Kernel>, const TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, 2>, const TensorImagePatchOp<Dynamic, Dynamic, const Input> > > >,
-  TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, internal::traits<Input>::NumDimensions>, const TensorContractionOp<const array<IndexPair<typename internal::traits<Input>::Index>, 1>, const TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, 2>, const TensorImagePatchOp<Dynamic, Dynamic, const Input> >, const TensorReshapingOp<const DSizes<typename internal::traits<Input>::Index, 2>, const Kernel> > > >::type
-SpatialConvolution(const Input& input, const Kernel& kernel, const DenseIndex stride = 1, const PaddingType padding_type = PADDING_SAME, const DenseIndex in_stride = 1) {
-
-  typedef typename internal::traits<Input>::Index TensorIndex;
-  TensorRef<Tensor<typename internal::traits<Input>::Scalar, internal::traits<Input>::NumDimensions, internal::traits<Input>::Layout, TensorIndex> > in(input);
-  TensorRef<Tensor<typename internal::traits<Kernel>::Scalar, internal::traits<Kernel>::NumDimensions, internal::traits<Kernel>::Layout, TensorIndex> > kern(kernel);
-
-  EIGEN_STATIC_ASSERT(internal::traits<Input>::Layout == internal::traits<Kernel>::Layout, YOU_MADE_A_PROGRAMMING_MISTAKE);
-  static const bool isColMajor = (internal::traits<Input>::Layout == ColMajor);
-
-  static const int NumDims = internal::traits<Input>::NumDimensions;
-
-  // Number of filters to apply. This is the same as the output depth of the result
-  const TensorIndex kernelFilters = isColMajor ? kern.dimensions()[0] : kern.dimensions()[3];
-  // Number of channels. This is the same as the input depth.
-  const TensorIndex kernelChannels = isColMajor ? kern.dimensions()[1] : kern.dimensions()[2];
-  const TensorIndex kernelRows = isColMajor ? kern.dimensions()[2] : kern.dimensions()[1];
-  const TensorIndex kernelCols = isColMajor ? kern.dimensions()[3] : kern.dimensions()[0];
-
-  const DenseIndex kernelRowsEff = kernelRows + (kernelRows - 1) * (in_stride - 1);
-  const DenseIndex kernelColsEff = kernelCols + (kernelCols - 1) * (in_stride - 1);
-
-  array<IndexPair<TensorIndex>, 1> contract_dims;
-  contract_dims[0] = IndexPair<TensorIndex>(1, 0);
-
-  const TensorIndex InputRows = isColMajor ? in.dimension(1) : in.dimension(NumDims - 2);
-  const TensorIndex InputCols = isColMajor ? in.dimension(2) : in.dimension(NumDims - 3);
-
-  TensorIndex out_height;
-  TensorIndex out_width;
-  switch (padding_type) {
-    case PADDING_VALID:
-      out_height = numext::ceil((InputRows - kernelRowsEff + 1.f) / static_cast<float>(stride));
-      out_width = numext::ceil((InputCols - kernelColsEff + 1.f) / static_cast<float>(stride));
-      break;
-    case PADDING_SAME:
-      out_height = numext::ceil(InputRows / static_cast<float>(stride));
-      out_width = numext::ceil(InputCols / static_cast<float>(stride));
-      break;
-    default:
-      eigen_assert(false && "unexpected padding");
-  }
-
-  // Molds the output of the patch extraction code into a 2d tensor:
-  // - the first dimension (dims[0]): the patch values to be multiplied with the kernels
-  // - the second dimension (dims[1]): everything else
-  DSizes<TensorIndex, 2> pre_contract_dims;
-  if (isColMajor) {
-    pre_contract_dims[0] = kernelChannels * kernelRows * kernelCols;
-    pre_contract_dims[1] = out_height * out_width;
-    for (int i = 3; i < NumDims; ++i) {
-      pre_contract_dims[1] *= in.dimension(i);
-    }
-  } else {
-    pre_contract_dims[1] = kernelChannels * kernelRows * kernelCols;
-    pre_contract_dims[0] = out_height * out_width;
-    for (int i = 0; i < NumDims - 3; ++i) {
-      pre_contract_dims[0] *= in.dimension(i);
-    }
-  }
-
-  // Molds the output of the contraction into the shape expected by the used
-  // (assuming this is ColMajor):
-  // - 1st dim: kernel filters
-  // - 2nd dim: output height
-  // - 3rd dim: output width
-  // - 4th dim and beyond: everything else including batch size
-  DSizes<TensorIndex, NumDims> post_contract_dims;
-  if (isColMajor) {
-    post_contract_dims[0] = kernelFilters;
-    post_contract_dims[1] = out_height;
-    post_contract_dims[2] = out_width;
-    for (int i = 3; i < NumDims; ++i) {
-      post_contract_dims[i] = in.dimension(i);
-    }
-  } else {
-    post_contract_dims[NumDims - 1] = kernelFilters;
-    post_contract_dims[NumDims - 2] = out_height;
-    post_contract_dims[NumDims - 3] = out_width;
-    for (int i = 0; i < NumDims - 3; ++i) {
-      post_contract_dims[i] = in.dimension(i);
-    }
-  }
-
-  DSizes<TensorIndex, 2> kernel_dims;
-  if (isColMajor) {
-    kernel_dims[0] = kernelFilters;
-    kernel_dims[1] = kernelChannels * kernelRows * kernelCols;
-  } else {
-    kernel_dims[0] = kernelChannels * kernelRows * kernelCols;
-    kernel_dims[1] = kernelFilters;
-  }
-  // TODO(yangke): choose() is defined in TensorContraction.h -- consider
-  // moving it to somewhere more "common".
-  return choose(Cond<internal::traits<Input>::Layout == ColMajor>(),
-                kernel.reshape(kernel_dims).contract(input.extract_image_patches(kernelRows, kernelCols, stride, stride, in_stride, in_stride, padding_type).reshape(pre_contract_dims), contract_dims).reshape(post_contract_dims),
-                input.extract_image_patches(kernelRows, kernelCols, stride, stride, in_stride, in_stride, padding_type).reshape(pre_contract_dims).contract(kernel.reshape(kernel_dims), contract_dims).reshape(post_contract_dims));
-}
-
-} // end namespace Eigen
-
-#endif // EIGEN_CXX11_NEURAL_NETWORKS_SPATIAL_CONVOLUTIONS_H
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/TensorConvolutionByFFT.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/TensorConvolutionByFFT.h
deleted file mode 100644
index 0e72173536..0000000000
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/NeuralNetworks/TensorConvolutionByFFT.h
+++ /dev/null
@@ -1,289 +0,0 @@
-// This file is part of Eigen, a lightweight C++ template library
-// for linear algebra.
-//
-// Copyright (C) 2014 Benoit Steiner <benoit.steiner.goog@gmail.com>
-// Copyright (C) 2015 Jianwei Cui <thucjw@gmail.com>
-//
-// This Source Code Form is subject to the terms of the Mozilla
-// Public License v. 2.0. If a copy of the MPL was not distributed
-// with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
-
-#ifndef EIGEN_CXX11_TENSOR_TENSOR_CONVOLUTIONBYFFT_H
-#define EIGEN_CXX11_TENSOR_TENSOR_CONVOLUTIONBYFFT_H
-
-namespace Eigen {
-
-/** \class TensorConvolutionByFFT
-  * \ingroup CXX11_Tensor_Module
-  *
-  * \brief Tensor convolution class.
-  *
-  *
-  */
-namespace internal {
-
-
-template<typename Dimensions, typename InputXprType, typename KernelXprType>
-struct traits<TensorConvolutionByFFTOp<Dimensions, InputXprType, KernelXprType> >
-{
-  // Type promotion to handle the case where the types of the lhs and the rhs are different.
-  typedef typename promote_storage_type<typename InputXprType::Scalar,
-                                        typename KernelXprType::Scalar>::ret Scalar;
-  typedef typename packet_traits<Scalar>::type Packet;
-  typedef typename promote_storage_type<typename traits<InputXprType>::StorageKind,
-                                        typename traits<KernelXprType>::StorageKind>::ret StorageKind;
-  typedef typename promote_index_type<typename traits<InputXprType>::Index,
-                                      typename traits<KernelXprType>::Index>::type Index;
-  typedef typename InputXprType::Nested LhsNested;
-  typedef typename KernelXprType::Nested RhsNested;
-  typedef typename remove_reference<LhsNested>::type _LhsNested;
-  typedef typename remove_reference<RhsNested>::type _RhsNested;
-  static const int NumDimensions = traits<InputXprType>::NumDimensions;
-  static const int Layout = traits<InputXprType>::Layout;
-
-  enum {
-    Flags = 0,
-  };
-};
-
-template<typename Dimensions, typename InputXprType, typename KernelXprType>
-struct eval<TensorConvolutionByFFTOp<Dimensions, InputXprType, KernelXprType>, Eigen::Dense>
-{
-  typedef const TensorConvolutionByFFTOp<Dimensions, InputXprType, KernelXprType>& type;
-};
-
-template<typename Dimensions, typename InputXprType, typename KernelXprType>
-struct nested<TensorConvolutionByFFTOp<Dimensions, InputXprType, KernelXprType>, 1, typename eval<TensorConvolutionByFFTOp<Dimensions, InputXprType, KernelXprType> >::type>
-{
-  typedef TensorConvolutionByFFTOp<Dimensions, InputXprType, KernelXprType> type;
-};
-
-}  // end namespace internal
-
-
-
-template<typename Indices, typename InputXprType, typename KernelXprType>
-class TensorConvolutionByFFTOp : public TensorBase<TensorConvolutionByFFTOp<Indices, InputXprType, KernelXprType> >
-{
-  public:
-  typedef typename Eigen::internal::traits<TensorConvolutionByFFTOp>::Scalar Scalar;
-  typedef typename Eigen::internal::traits<TensorConvolutionByFFTOp>::Packet Packet;
-  typedef typename Eigen::NumTraits<Scalar>::Real RealScalar;
-  typedef typename internal::promote_storage_type<typename InputXprType::CoeffReturnType,
-                                                  typename KernelXprType::CoeffReturnType>::ret CoeffReturnType;
-  typedef typename internal::promote_storage_type<typename InputXprType::PacketReturnType,
-                                                  typename KernelXprType::PacketReturnType>::ret PacketReturnType;
-  typedef typename Eigen::internal::nested<TensorConvolutionByFFTOp>::type Nested;
-  typedef typename Eigen::internal::traits<TensorConvolutionByFFTOp>::StorageKind StorageKind;
-  typedef typename Eigen::internal::traits<TensorConvolutionByFFTOp>::Index Index;
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE TensorConvolutionByFFTOp(const InputXprType& input, const KernelXprType& kernel, const Indices& dims)
-      : m_input_xpr(input), m_kernel_xpr(kernel), m_indices(dims) {}
-
-    EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
-    const Indices& indices() const { return m_indices; }
-
-    /** \returns the nested expressions */
-    EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
-    const typename internal::remove_all<typename InputXprType::Nested>::type&
-    inputExpression() const { return m_input_xpr; }
-
-    EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
-    const typename internal::remove_all<typename KernelXprType::Nested>::type&
-    kernelExpression() const { return m_kernel_xpr; }
-
-  protected:
-    typename InputXprType::Nested m_input_xpr;
-    typename KernelXprType::Nested m_kernel_xpr;
-    const Indices m_indices;
-};
-
-
-template<typename Indices, typename InputArgType, typename KernelArgType, typename Device>
-struct TensorEvaluator<const TensorConvolutionByFFTOp<Indices, InputArgType, KernelArgType>, Device>
-{
-  typedef TensorConvolutionByFFTOp<Indices, InputArgType, KernelArgType> XprType;
-
-  typedef typename XprType::Scalar Scalar;
-  typedef typename XprType::CoeffReturnType CoeffReturnType;
-  typedef typename XprType::PacketReturnType PacketReturnType;
-
-  typedef typename Eigen::NumTraits<Scalar>::Real RealScalar;
-
-  static const int NumDims = internal::array_size<typename TensorEvaluator<InputArgType, Device>::Dimensions>::value;
-  static const int NumKernelDims = internal::array_size<Indices>::value;
-  typedef typename XprType::Index Index;
-  typedef DSizes<Index, NumDims> Dimensions;
-
-  enum {
-    IsAligned = TensorEvaluator<InputArgType, Device>::IsAligned &
-                TensorEvaluator<KernelArgType, Device>::IsAligned,
-    PacketAccess = false,
-    BlockAccess = false,
-    Layout = TensorEvaluator<InputArgType, Device>::Layout,
-    CoordAccess = false,  // to be implemented
-  };
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE TensorEvaluator(const XprType& op, const Device& device)
-      : m_inputImpl(op.inputExpression(), device), m_kernelImpl(op.kernelExpression(), device), m_kernelArg(op.kernelExpression()), m_kernel(NULL), m_local_kernel(false), m_device(device)
-  {
-    EIGEN_STATIC_ASSERT((static_cast<int>(TensorEvaluator<InputArgType, Device>::Layout) == static_cast<int>(TensorEvaluator<KernelArgType, Device>::Layout)), YOU_MADE_A_PROGRAMMING_MISTAKE);
-
-    const typename TensorEvaluator<InputArgType, Device>::Dimensions& input_dims = m_inputImpl.dimensions();
-    const typename TensorEvaluator<KernelArgType, Device>::Dimensions& kernel_dims = m_kernelImpl.dimensions();
-
-    if (static_cast<int>(Layout) == static_cast<int>(ColMajor)) {
-      m_inputStride[0] = 1;
-      for (int i = 1; i < NumDims; ++i) {
-        m_inputStride[i] = m_inputStride[i - 1] * input_dims[i - 1];
-      }
-    } else {
-      m_inputStride[NumDims - 1] = 1;
-      for (int i = NumDims - 2; i >= 0; --i) {
-        m_inputStride[i] = m_inputStride[i + 1] * input_dims[i + 1];
-      }
-    }
-
-    m_dimensions = m_inputImpl.dimensions();
-    if (static_cast<int>(Layout) == static_cast<int>(ColMajor)) {
-      for (int i = 0; i < NumKernelDims; ++i) {
-        const Index index = op.indices()[i];
-        const Index input_dim = input_dims[index];
-        const Index kernel_dim = kernel_dims[i];
-        const Index result_dim = input_dim - kernel_dim + 1;
-        m_dimensions[index] = result_dim;
-        if (i > 0) {
-          m_kernelStride[i] = m_kernelStride[i - 1] * kernel_dims[i - 1];
-        } else {
-          m_kernelStride[0] = 1;
-        }
-        m_indexStride[i] = m_inputStride[index];
-      }
-
-      m_outputStride[0] = 1;
-      for (int i = 1; i < NumDims; ++i) {
-        m_outputStride[i] = m_outputStride[i - 1] * m_dimensions[i - 1];
-      }
-    } else {
-      for (int i = NumKernelDims - 1; i >= 0; --i) {
-        const Index index = op.indices()[i];
-        const Index input_dim = input_dims[index];
-        const Index kernel_dim = kernel_dims[i];
-        const Index result_dim = input_dim - kernel_dim + 1;
-        m_dimensions[index] = result_dim;
-        if (i < NumKernelDims - 1) {
-          m_kernelStride[i] = m_kernelStride[i + 1] * kernel_dims[i + 1];
-        } else {
-          m_kernelStride[NumKernelDims - 1] = 1;
-        }
-        m_indexStride[i] = m_inputStride[index];
-      }
-
-      m_outputStride[NumDims - 1] = 1;
-      for (int i = NumDims - 2; i >= 0; --i) {
-        m_outputStride[i] = m_outputStride[i + 1] * m_dimensions[i + 1];
-      }
-    }
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const Dimensions& dimensions() const { return m_dimensions; }
-
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE bool evalSubExprsIfNeeded(Scalar* data) {
-    m_inputImpl.evalSubExprsIfNeeded(NULL);
-    m_kernelImpl.evalSubExprsIfNeeded(NULL);
-
-    typedef typename internal::traits<InputArgType>::Index TensorIndex;
-
-    Tensor<Scalar, NumDims, Layout, TensorIndex> input(m_inputImpl.dimensions());
-    for (int i = 0; i < m_inputImpl.dimensions().TotalSize(); ++i) {
-      input.data()[i] = m_inputImpl.coeff(i);
-    }
-
-    Tensor<Scalar, NumDims, Layout, TensorIndex> kernel(m_kernelImpl.dimensions());
-    for (int i = 0; i < m_kernelImpl.dimensions().TotalSize(); ++i) {
-      kernel.data()[i] = m_kernelImpl.coeff(i);
-    }
-
-    array<std::pair<ptrdiff_t, ptrdiff_t>, NumDims> paddings;
-    for (int i = 0; i < NumDims; ++i) {
-      paddings[i] = std::make_pair(0, m_inputImpl.dimensions()[i] - m_kernelImpl.dimensions()[i]);
-    }
-
-    Eigen::array<bool, NumKernelDims> reverse;
-    for (int i = 0; i < NumKernelDims; ++i) {
-      reverse[i] = true;
-    }
-
-    Eigen::array<bool, NumDims> fft;
-    for (int i = 0; i < NumDims; ++i) {
-      fft[i] = i;
-    }
-
-    Eigen::DSizes<TensorIndex, NumDims> slice_offsets;
-    for (int i = 0; i < NumDims; ++i) {
-      slice_offsets[i] = m_kernelImpl.dimensions()[i] - 1;
-    }
-
-    Eigen::DSizes<TensorIndex, NumDims> slice_extents;
-    for (int i = 0; i < NumDims; ++i) {
-      slice_extents[i] = m_inputImpl.dimensions()[i] - m_kernelImpl.dimensions()[i] + 1;
-    }
-
-    Tensor<Scalar, NumDims, Layout, TensorIndex> kernel_variant =  kernel.reverse(reverse).pad(paddings);
-    Tensor<std::complex<Scalar>, NumDims, Layout, TensorIndex> kernel_fft =  kernel_variant.template fft<Eigen::BothParts, FFT_FORWARD>(fft);
-    //Tensor<std::complex<Scalar>, NumDims, Layout|IndexType> kernel_fft =  kernel.reverse(reverse).pad(paddings).template fft<2>(fft);
-    Tensor<std::complex<Scalar>, NumDims, Layout, TensorIndex> input_fft = input.template fft<Eigen::BothParts, FFT_FORWARD>(fft);
-    Tensor<std::complex<Scalar>, NumDims, Layout, TensorIndex> prod = (input_fft * kernel_fft).template fft<Eigen::BothParts, FFT_REVERSE>(fft);
-    Tensor<std::complex<Scalar>, NumDims, Layout, TensorIndex> tensor_result = prod.slice(slice_offsets, slice_extents);
-
-    for (int i = 0; i < tensor_result.size(); ++i) {
-      data[i] = std::real(tensor_result.data()[i]);
-    }
-    return false;
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void cleanup() {
-    m_inputImpl.cleanup();
-    if (m_local_kernel) {
-      m_device.deallocate((void*)m_kernel);
-      m_local_kernel = false;
-    }
-    m_kernel = NULL;
-  }
-
-  void evalTo(typename XprType::Scalar* buffer) {
-    evalSubExprsIfNeeded(NULL);
-    for (int i = 0; i < dimensions().TotalSize(); ++i) {
-      buffer[i] += coeff(i);
-    }
-    cleanup();
-  }
-
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE CoeffReturnType coeff(Index index) const
-  {
-    CoeffReturnType result = CoeffReturnType(0);
-    return result;
-  }
-
-  EIGEN_DEVICE_FUNC Scalar* data() const { return NULL; }
-
- private:
-  array<Index, NumDims> m_inputStride;
-  array<Index, NumDims> m_outputStride;
-
-  array<Index, NumKernelDims> m_indexStride;
-  array<Index, NumKernelDims> m_kernelStride;
-  TensorEvaluator<InputArgType, Device> m_inputImpl;
-  TensorEvaluator<KernelArgType, Device> m_kernelImpl;
-  Dimensions m_dimensions;
-
-  KernelArgType m_kernelArg;
-  const Scalar* m_kernel;
-  bool m_local_kernel;
-  const Device& m_device;
-};
-
-} // end namespace Eigen
-
-#endif // EIGEN_CXX11_TENSOR_TENSOR_CONVOLUTIONBYFFT_H
diff --git a/third_party/examples/eager/spinn/spinn.py b/third_party/examples/eager/spinn/spinn.py
index 67456a5bdf..de63ebe9e6 100644
--- a/third_party/examples/eager/spinn/spinn.py
+++ b/third_party/examples/eager/spinn/spinn.py
@@ -419,7 +419,7 @@ class SNLIClassifierTrainer(tfe.Checkpointable):
     # Create a custom learning rate Variable for the RMSProp optimizer, because
     # the learning rate needs to be manually decayed later (see
     # decay_learning_rate()).
-    self._learning_rate = tfe.Variable(lr, name="learning_rate")
+    self._learning_rate = tf.Variable(lr, name="learning_rate")
     self._optimizer = tf.train.RMSPropOptimizer(self._learning_rate,
                                                 epsilon=1e-6)
 
@@ -626,7 +626,7 @@ def train_or_infer_spinn(embed,
     model = SNLIClassifier(config, embed)
     global_step = tf.train.get_or_create_global_step()
     trainer = SNLIClassifierTrainer(model, config.lr)
-    checkpoint = tfe.Checkpoint(trainer=trainer, global_step=global_step)
+    checkpoint = tf.train.Checkpoint(trainer=trainer, global_step=global_step)
     checkpoint.restore(tf.train.latest_checkpoint(config.logdir))
 
     if inference_sentence_pair:
diff --git a/third_party/gpus/crosstool/BUILD.tpl b/third_party/gpus/crosstool/BUILD.tpl
index 98cb326572..f638756d23 100644
--- a/third_party/gpus/crosstool/BUILD.tpl
+++ b/third_party/gpus/crosstool/BUILD.tpl
@@ -7,6 +7,7 @@ cc_toolchain_suite(
     toolchains = {
         "local|compiler": ":cc-compiler-local",
         "darwin|compiler": ":cc-compiler-darwin",
+        "x64_windows|msvc-cl": ":cc-compiler-windows",
     },
 )
 
@@ -42,6 +43,20 @@ cc_toolchain(
     supports_param_files = 0,
 )
 
+cc_toolchain(
+    name = "cc-compiler-windows",
+    all_files = "%{win_linker_files}",
+    compiler_files = ":empty",
+    cpu = "x64_windows",
+    dwp_files = ":empty",
+    dynamic_runtime_libs = [":empty"],
+    linker_files = "%{win_linker_files}",
+    objcopy_files = ":empty",
+    static_runtime_libs = [":empty"],
+    strip_files = ":empty",
+    supports_param_files = 1,
+)
+
 filegroup(
     name = "empty",
     srcs = [],
@@ -51,3 +66,8 @@ filegroup(
     name = "crosstool_wrapper_driver_is_not_gcc",
     srcs = ["clang/bin/crosstool_wrapper_driver_is_not_gcc"],
 )
+
+filegroup(
+    name = "windows_msvc_wrapper_files",
+    srcs = glob(["windows/msvc_*"]),
+)
diff --git a/third_party/gpus/crosstool/CROSSTOOL.tpl b/third_party/gpus/crosstool/CROSSTOOL.tpl
index 1424ff6511..3972c96a2f 100644
--- a/third_party/gpus/crosstool/CROSSTOOL.tpl
+++ b/third_party/gpus/crosstool/CROSSTOOL.tpl
@@ -22,6 +22,10 @@ default_toolchain {
   cpu: "ppc"
   toolchain_identifier: "local_linux"
 }
+default_toolchain {
+  cpu: "x64_windows"
+  toolchain_identifier: "local_windows"
+}
 
 toolchain {
   abi_version: "local"
@@ -537,3 +541,868 @@ toolchain {
 
 %{host_compiler_includes}
 }
+
+toolchain {
+  toolchain_identifier: "local_windows"
+  host_system_name: "local"
+  target_system_name: "local"
+
+  abi_version: "local"
+  abi_libc_version: "local"
+  target_cpu: "x64_windows"
+  compiler: "msvc-cl"
+  target_libc: "msvcrt"
+
+%{cxx_builtin_include_directory}
+
+  tool_path {
+    name: "ar"
+    path: "%{msvc_lib_path}"
+  }
+  tool_path {
+    name: "ml"
+    path: "%{msvc_ml_path}"
+  }
+  tool_path {
+    name: "cpp"
+    path: "%{msvc_cl_path}"
+  }
+  tool_path {
+    name: "gcc"
+    path: "%{msvc_cl_path}"
+  }
+  tool_path {
+    name: "gcov"
+    path: "wrapper/bin/msvc_nop.bat"
+  }
+  tool_path {
+    name: "ld"
+    path: "%{msvc_link_path}"
+  }
+  tool_path {
+    name: "nm"
+    path: "wrapper/bin/msvc_nop.bat"
+  }
+  tool_path {
+    name: "objcopy"
+    path: "wrapper/bin/msvc_nop.bat"
+  }
+  tool_path {
+    name: "objdump"
+    path: "wrapper/bin/msvc_nop.bat"
+  }
+  tool_path {
+    name: "strip"
+    path: "wrapper/bin/msvc_nop.bat"
+  }
+  supports_interface_shared_objects: true
+
+  # TODO(pcloudy): Review those flags below, they should be defined by cl.exe
+  compiler_flag: "/DCOMPILER_MSVC"
+
+  # Don't define min/max macros in windows.h.
+  compiler_flag: "/DNOMINMAX"
+
+  # Platform defines.
+  compiler_flag: "/D_WIN32_WINNT=0x0600"
+  # Turn off warning messages.
+  compiler_flag: "/D_CRT_SECURE_NO_DEPRECATE"
+  compiler_flag: "/D_CRT_SECURE_NO_WARNINGS"
+  compiler_flag: "/D_SILENCE_STDEXT_HASH_DEPRECATION_WARNINGS"
+
+  # Useful options to have on for compilation.
+  # Increase the capacity of object files to 2^32 sections.
+  compiler_flag: "/bigobj"
+  # Allocate 500MB for precomputed headers.
+  compiler_flag: "/Zm500"
+  # Use unsigned char by default.
+  compiler_flag: "/J"
+  # Use function level linking.
+  compiler_flag: "/Gy"
+  # Use string pooling.
+  compiler_flag: "/GF"
+  # Catch C++ exceptions only and tell the compiler to assume that functions declared
+  # as extern "C" never throw a C++ exception.
+  compiler_flag: "/EHsc"
+
+  # Globally disabled warnings.
+  # Don't warn about elements of array being be default initialized.
+  compiler_flag: "/wd4351"
+  # Don't warn about no matching delete found.
+  compiler_flag: "/wd4291"
+  # Don't warn about diamond inheritance patterns.
+  compiler_flag: "/wd4250"
+  # Don't warn about insecure functions (e.g. non _s functions).
+  compiler_flag: "/wd4996"
+
+  linker_flag: "/MACHINE:X64"
+
+  feature {
+    name: "no_legacy_features"
+  }
+
+  # Suppress startup banner.
+  feature {
+    name: "nologo"
+    flag_set {
+      action: "c-compile"
+      action: "c++-compile"
+      action: "c++-module-compile"
+      action: "c++-module-codegen"
+      action: "c++-header-parsing"
+      action: "assemble"
+      action: "preprocess-assemble"
+      action: "c++-link-executable"
+      action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
+      action: "c++-link-static-library"
+      flag_group {
+        flag: "/nologo"
+      }
+    }
+  }
+
+  feature {
+    name: 'has_configured_linker_path'
+  }
+
+  # This feature indicates strip is not supported, building stripped binary will just result a copy of orignial binary
+  feature {
+    name: 'no_stripping'
+  }
+
+  # This feature indicates this is a toolchain targeting Windows.
+  feature {
+    name: 'targets_windows'
+    implies: 'copy_dynamic_libraries_to_binary'
+    enabled: true
+  }
+
+  feature {
+    name: 'copy_dynamic_libraries_to_binary'
+  }
+
+  action_config {
+    config_name: 'assemble'
+    action_name: 'assemble'
+    tool {
+      tool_path: '%{msvc_ml_path}'
+    }
+    implies: 'compiler_input_flags'
+    implies: 'compiler_output_flags'
+    implies: 'nologo'
+    implies: 'msvc_env'
+    implies: 'sysroot'
+  }
+
+  action_config {
+    config_name: 'preprocess-assemble'
+    action_name: 'preprocess-assemble'
+    tool {
+      tool_path: '%{msvc_ml_path}'
+    }
+    implies: 'compiler_input_flags'
+    implies: 'compiler_output_flags'
+    implies: 'nologo'
+    implies: 'msvc_env'
+    implies: 'sysroot'
+  }
+
+  action_config {
+    config_name: 'c-compile'
+    action_name: 'c-compile'
+    tool {
+      tool_path: '%{msvc_cl_path}'
+    }
+    implies: 'compiler_input_flags'
+    implies: 'compiler_output_flags'
+    implies: 'legacy_compile_flags'
+    implies: 'nologo'
+    implies: 'msvc_env'
+    implies: 'parse_showincludes'
+    implies: 'user_compile_flags'
+    implies: 'sysroot'
+    implies: 'unfiltered_compile_flags'
+  }
+
+  action_config {
+    config_name: 'c++-compile'
+    action_name: 'c++-compile'
+    tool {
+      tool_path: '%{msvc_cl_path}'
+    }
+    implies: 'compiler_input_flags'
+    implies: 'compiler_output_flags'
+    implies: 'legacy_compile_flags'
+    implies: 'nologo'
+    implies: 'msvc_env'
+    implies: 'parse_showincludes'
+    implies: 'user_compile_flags'
+    implies: 'sysroot'
+    implies: 'unfiltered_compile_flags'
+  }
+
+  action_config {
+    config_name: 'c++-link-executable'
+    action_name: 'c++-link-executable'
+    tool {
+      tool_path: '%{msvc_link_path}'
+    }
+    implies: 'nologo'
+    implies: 'linkstamps'
+    implies: 'output_execpath_flags'
+    implies: 'input_param_flags'
+    implies: 'user_link_flags'
+    implies: 'legacy_link_flags'
+    implies: 'linker_subsystem_flag'
+    implies: 'linker_param_file'
+    implies: 'msvc_env'
+    implies: 'no_stripping'
+  }
+
+  action_config {
+    config_name: 'c++-link-dynamic-library'
+    action_name: 'c++-link-dynamic-library'
+    tool {
+      tool_path: '%{msvc_link_path}'
+    }
+    implies: 'nologo'
+    implies: 'shared_flag'
+    implies: 'linkstamps'
+    implies: 'output_execpath_flags'
+    implies: 'input_param_flags'
+    implies: 'user_link_flags'
+    implies: 'legacy_link_flags'
+    implies: 'linker_subsystem_flag'
+    implies: 'linker_param_file'
+    implies: 'msvc_env'
+    implies: 'no_stripping'
+    implies: 'has_configured_linker_path'
+    implies: 'def_file'
+  }
+
+  action_config {
+      config_name: 'c++-link-nodeps-dynamic-library'
+      action_name: 'c++-link-nodeps-dynamic-library'
+      tool {
+        tool_path: '%{msvc_link_path}'
+      }
+      implies: 'nologo'
+      implies: 'shared_flag'
+      implies: 'linkstamps'
+      implies: 'output_execpath_flags'
+      implies: 'input_param_flags'
+      implies: 'user_link_flags'
+      implies: 'legacy_link_flags'
+      implies: 'linker_subsystem_flag'
+      implies: 'linker_param_file'
+      implies: 'msvc_env'
+      implies: 'no_stripping'
+      implies: 'has_configured_linker_path'
+      implies: 'def_file'
+    }
+
+  action_config {
+    config_name: 'c++-link-static-library'
+    action_name: 'c++-link-static-library'
+    tool {
+      tool_path: '%{msvc_lib_path}'
+    }
+    implies: 'nologo'
+    implies: 'archiver_flags'
+    implies: 'input_param_flags'
+    implies: 'linker_param_file'
+    implies: 'msvc_env'
+  }
+
+  # TODO(b/65151735): Remove legacy_compile_flags feature when legacy fields are
+  # not used in this crosstool
+  feature {
+    name: 'legacy_compile_flags'
+    flag_set {
+      expand_if_all_available: 'legacy_compile_flags'
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-header-parsing'
+      action: 'c++-module-compile'
+      action: 'c++-module-codegen'
+      flag_group {
+        iterate_over: 'legacy_compile_flags'
+        flag: '%{legacy_compile_flags}'
+      }
+    }
+  }
+
+  feature {
+    name: "msvc_env"
+    env_set {
+      action: "c-compile"
+      action: "c++-compile"
+      action: "c++-module-compile"
+      action: "c++-module-codegen"
+      action: "c++-header-parsing"
+      action: "assemble"
+      action: "preprocess-assemble"
+      action: "c++-link-executable"
+      action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
+      action: "c++-link-static-library"
+      env_entry {
+        key: "PATH"
+        value: "%{msvc_env_path}"
+      }
+      env_entry {
+        key: "INCLUDE"
+        value: "%{msvc_env_include}"
+      }
+      env_entry {
+        key: "LIB"
+        value: "%{msvc_env_lib}"
+      }
+      env_entry {
+        key: "TMP"
+        value: "%{msvc_env_tmp}"
+      }
+      env_entry {
+        key: "TEMP"
+        value: "%{msvc_env_tmp}"
+      }
+    }
+  }
+
+  feature {
+    name: 'include_paths'
+    flag_set {
+      action: "assemble"
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-header-parsing'
+      action: 'c++-module-compile'
+      flag_group {
+        iterate_over: 'quote_include_paths'
+        flag: '/I%{quote_include_paths}'
+      }
+      flag_group {
+        iterate_over: 'include_paths'
+        flag: '/I%{include_paths}'
+      }
+      flag_group {
+        iterate_over: 'system_include_paths'
+        flag: '/I%{system_include_paths}'
+      }
+    }
+  }
+
+  feature {
+    name: "preprocessor_defines"
+    flag_set {
+      action: "assemble"
+      action: "preprocess-assemble"
+      action: "c-compile"
+      action: "c++-compile"
+      action: "c++-header-parsing"
+      action: "c++-module-compile"
+      flag_group {
+        flag: "/D%{preprocessor_defines}"
+        iterate_over: "preprocessor_defines"
+      }
+    }
+  }
+
+  # Tell Bazel to parse the output of /showIncludes
+  feature {
+    name: 'parse_showincludes'
+    flag_set {
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-module-compile'
+      action: 'c++-header-parsing'
+      flag_group {
+        flag: "/showIncludes"
+      }
+    }
+  }
+
+
+  feature {
+    name: 'generate_pdb_file'
+    requires: {
+      feature: 'dbg'
+    }
+    requires: {
+      feature: 'fastbuild'
+    }
+  }
+
+  feature {
+    name: 'shared_flag'
+    flag_set {
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: '/DLL'
+      }
+    }
+  }
+
+  feature {
+    name: 'linkstamps'
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      expand_if_all_available: 'linkstamp_paths'
+      flag_group {
+        iterate_over: 'linkstamp_paths'
+        flag: '%{linkstamp_paths}'
+      }
+    }
+  }
+
+  feature {
+    name: 'output_execpath_flags'
+    flag_set {
+      expand_if_all_available: 'output_execpath'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: '/OUT:%{output_execpath}'
+      }
+    }
+  }
+
+  feature {
+    name: 'archiver_flags'
+    flag_set {
+      expand_if_all_available: 'output_execpath'
+      action: 'c++-link-static-library'
+      flag_group {
+        flag: '/OUT:%{output_execpath}'
+      }
+    }
+  }
+
+  feature {
+    name: 'input_param_flags'
+    flag_set {
+      expand_if_all_available: 'interface_library_output_path'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/IMPLIB:%{interface_library_output_path}"
+      }
+    }
+    flag_set {
+      expand_if_all_available: 'libopts'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        iterate_over: 'libopts'
+        flag: '%{libopts}'
+      }
+    }
+    flag_set {
+      expand_if_all_available: 'libraries_to_link'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      action: 'c++-link-static-library'
+      flag_group {
+        iterate_over: 'libraries_to_link'
+        flag_group {
+          expand_if_equal: {
+            variable: 'libraries_to_link.type'
+            value: 'object_file_group'
+          }
+          iterate_over: 'libraries_to_link.object_files'
+          flag_group {
+            flag: '%{libraries_to_link.object_files}'
+          }
+        }
+        flag_group {
+          expand_if_equal: {
+            variable: 'libraries_to_link.type'
+            value: 'object_file'
+          }
+          flag_group {
+            flag: '%{libraries_to_link.name}'
+          }
+        }
+        flag_group {
+          expand_if_equal: {
+            variable: 'libraries_to_link.type'
+            value: 'interface_library'
+          }
+          flag_group {
+            flag: '%{libraries_to_link.name}'
+          }
+        }
+        flag_group {
+          expand_if_equal: {
+            variable: 'libraries_to_link.type'
+            value: 'static_library'
+          }
+          flag_group {
+            expand_if_false: 'libraries_to_link.is_whole_archive'
+            flag: '%{libraries_to_link.name}'
+          }
+          flag_group {
+            expand_if_true: 'libraries_to_link.is_whole_archive'
+            flag: '/WHOLEARCHIVE:%{libraries_to_link.name}'
+          }
+        }
+      }
+    }
+  }
+
+  # Since this feature is declared earlier in the CROSSTOOL than
+  # "user_link_flags", this feature will be applied prior to it anwyhere they
+  # are both implied. And since "user_link_flags" contains the linkopts from
+  # the build rule, this allows the user to override the /SUBSYSTEM in the BUILD
+  # file.
+  feature {
+    name: 'linker_subsystem_flag'
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: '/SUBSYSTEM:CONSOLE'
+      }
+    }
+  }
+
+  # The "user_link_flags" contains user-defined linkopts (from build rules)
+  # so it should be defined after features that declare user-overridable flags.
+  # For example the "linker_subsystem_flag" defines a default "/SUBSYSTEM" flag
+  # but we want to let the user override it, therefore "link_flag_subsystem" is
+  # defined earlier in the CROSSTOOL file than "user_link_flags".
+  feature {
+    name: 'user_link_flags'
+    flag_set {
+      expand_if_all_available: 'user_link_flags'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        iterate_over: 'user_link_flags'
+        flag: '%{user_link_flags}'
+      }
+    }
+  }
+  feature {
+    name: 'legacy_link_flags'
+    flag_set {
+      expand_if_all_available: 'legacy_link_flags'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        iterate_over: 'legacy_link_flags'
+        flag: '%{legacy_link_flags}'
+      }
+    }
+  }
+
+  feature {
+    name: 'linker_param_file'
+    flag_set {
+      expand_if_all_available: 'linker_param_file'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      action: 'c++-link-static-library'
+      flag_group {
+        flag: '@%{linker_param_file}'
+      }
+    }
+  }
+
+  feature {
+    name: 'static_link_msvcrt'
+  }
+
+  feature {
+    name: 'static_link_msvcrt_no_debug'
+    flag_set {
+      action: 'c-compile'
+      action: 'c++-compile'
+      flag_group {
+        flag: "/MT"
+      }
+    }
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/DEFAULTLIB:libcmt.lib"
+      }
+    }
+    requires: { feature: 'fastbuild'}
+    requires: { feature: 'opt'}
+  }
+
+  feature {
+    name: 'dynamic_link_msvcrt_no_debug'
+    flag_set {
+      action: 'c-compile'
+      action: 'c++-compile'
+      flag_group {
+        flag: "/MD"
+      }
+    }
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/DEFAULTLIB:msvcrt.lib"
+      }
+    }
+    requires: { feature: 'fastbuild'}
+    requires: { feature: 'opt'}
+  }
+
+  feature {
+    name: 'static_link_msvcrt_debug'
+    flag_set {
+      action: 'c-compile'
+      action: 'c++-compile'
+      flag_group {
+        flag: "/MTd"
+      }
+    }
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/DEFAULTLIB:libcmtd.lib"
+      }
+    }
+    requires: { feature: 'dbg'}
+  }
+
+  feature {
+    name: 'dynamic_link_msvcrt_debug'
+    flag_set {
+      action: 'c-compile'
+      action: 'c++-compile'
+      flag_group {
+        flag: "/MDd"
+      }
+    }
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/DEFAULTLIB:msvcrtd.lib"
+      }
+    }
+    requires: { feature: 'dbg'}
+  }
+
+  feature {
+    name: 'dbg'
+    flag_set {
+      action: 'c-compile'
+      action: 'c++-compile'
+      flag_group {
+        flag: "/Od"
+        flag: "/Z7"
+        flag: "/DDEBUG"
+      }
+    }
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/DEBUG:FULL"
+        flag: "/INCREMENTAL:NO"
+      }
+    }
+    implies: 'generate_pdb_file'
+  }
+
+  feature {
+    name: 'fastbuild'
+    flag_set {
+      action: 'c-compile'
+      action: 'c++-compile'
+      flag_group {
+        flag: "/Od"
+        flag: "/Z7"
+        flag: "/DDEBUG"
+      }
+    }
+    flag_set {
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/DEBUG:FASTLINK"
+        flag: "/INCREMENTAL:NO"
+      }
+    }
+    implies: 'generate_pdb_file'
+  }
+
+  feature {
+    name: 'opt'
+    flag_set {
+      action: 'c-compile'
+      action: 'c++-compile'
+      flag_group {
+        flag: "/O2"
+        flag: "/DNDEBUG"
+      }
+    }
+  }
+
+  feature {
+    name: 'user_compile_flags'
+    flag_set {
+      expand_if_all_available: 'user_compile_flags'
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-header-parsing'
+      action: 'c++-module-compile'
+      action: 'c++-module-codegen'
+      flag_group {
+        iterate_over: 'user_compile_flags'
+        flag: '%{user_compile_flags}'
+      }
+    }
+  }
+
+  feature {
+    name: 'sysroot'
+    flag_set {
+      expand_if_all_available: 'sysroot'
+      action: 'assemble'
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-header-parsing'
+      action: 'c++-module-compile'
+      action: 'c++-module-codegen'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        iterate_over: 'sysroot'
+        flag: '--sysroot=%{sysroot}'
+      }
+    }
+  }
+
+  feature {
+    name: 'unfiltered_compile_flags'
+    flag_set {
+      expand_if_all_available: 'unfiltered_compile_flags'
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-header-parsing'
+      action: 'c++-module-compile'
+      action: 'c++-module-codegen'
+      flag_group {
+        iterate_over: 'unfiltered_compile_flags'
+        flag: '%{unfiltered_compile_flags}'
+      }
+    }
+  }
+
+  feature {
+    name: 'compiler_output_flags'
+    flag_set {
+      action: 'assemble'
+      flag_group {
+        expand_if_all_available: 'output_file'
+        expand_if_none_available: 'output_assembly_file'
+        expand_if_none_available: 'output_preprocess_file'
+        flag: '/Fo%{output_file}'
+        flag: '/Zi'
+      }
+    }
+    flag_set {
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-header-parsing'
+      action: 'c++-module-compile'
+      action: 'c++-module-codegen'
+      flag_group {
+        expand_if_all_available: 'output_file'
+        expand_if_none_available: 'output_assembly_file'
+        expand_if_none_available: 'output_preprocess_file'
+        flag: '/Fo%{output_file}'
+      }
+      flag_group {
+        expand_if_all_available: 'output_file'
+        expand_if_all_available: 'output_assembly_file'
+        flag: '/Fa%{output_file}'
+      }
+      flag_group {
+        expand_if_all_available: 'output_file'
+        expand_if_all_available: 'output_preprocess_file'
+        flag: '/P'
+        flag: '/Fi%{output_file}'
+      }
+    }
+  }
+
+  feature {
+    name: 'compiler_input_flags'
+    flag_set {
+      action: 'assemble'
+      action: 'preprocess-assemble'
+      action: 'c-compile'
+      action: 'c++-compile'
+      action: 'c++-header-parsing'
+      action: 'c++-module-compile'
+      action: 'c++-module-codegen'
+      flag_group {
+        expand_if_all_available: 'source_file'
+        flag: '/c'
+        flag: '%{source_file}'
+      }
+    }
+  }
+
+  feature {
+    name : 'def_file',
+    flag_set {
+      expand_if_all_available: 'def_file_path'
+      action: 'c++-link-executable'
+      action: 'c++-link-dynamic-library'
+      action: "c++-link-nodeps-dynamic-library"
+      flag_group {
+        flag: "/DEF:%{def_file_path}"
+        # We can specify a different DLL name in DEF file, /ignore:4070 suppresses
+        # the warning message about DLL name doesn't match the default one.
+        # See https://msdn.microsoft.com/en-us/library/sfkk2fz7.aspx
+        flag: "/ignore:4070"
+      }
+    }
+  }
+
+  feature {
+    name: 'windows_export_all_symbols'
+  }
+
+  feature {
+    name: 'no_windows_export_all_symbols'
+  }
+
+  linking_mode_flags { mode: DYNAMIC }
+}
diff --git a/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl b/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
index 2558f46fd5..f4f4d0ee96 100755
--- a/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
+++ b/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
@@ -175,6 +175,11 @@ def InvokeNvcc(argv, log=False):
   # any other reliable way to just get the list of source files to be compiled.
   src_files = GetOptionValue(argv, 'c')
 
+  # Pass -w through from host to nvcc, but don't do anything fancier with
+  # warnings-related flags, since they're not necessarily the same across
+  # compilers.
+  warning_options = ' -w' if '-w' in argv else ''
+
   if len(src_files) == 0:
     return 1
   if len(out_file) != 1:
@@ -205,6 +210,7 @@ def InvokeNvcc(argv, log=False):
   nvccopts += defines
   nvccopts += std_options
   nvccopts += m_options
+  nvccopts += warning_options
 
   if depfiles:
     # Generate the dependency file
diff --git a/third_party/gpus/crosstool/windows/msvc_wrapper_for_nvcc.bat.tpl b/third_party/gpus/crosstool/windows/msvc_wrapper_for_nvcc.bat.tpl
new file mode 100644
index 0000000000..8f8fb3e423
--- /dev/null
+++ b/third_party/gpus/crosstool/windows/msvc_wrapper_for_nvcc.bat.tpl
@@ -0,0 +1,20 @@
+:: Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+::
+:: Licensed under the Apache License, Version 2.0 (the "License");
+:: you may not use this file except in compliance with the License.
+:: You may obtain a copy of the License at
+::
+::     http://www.apache.org/licenses/LICENSE-2.0
+::
+:: Unless required by applicable law or agreed to in writing, software
+:: distributed under the License is distributed on an "AS IS" BASIS,
+:: WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+:: See the License for the specific language governing permissions and
+:: limitations under the License.
+:: =============================================================================
+
+:: Invoke msvc_wrapper_for_nvcc.py, which is located in the same directory.
+@echo OFF
+set arg0=%~0
+for %%F in ("%arg0%") do set DRIVER_BIN=%%~dpF
+"%{python_binary}" -B "%DRIVER_BIN%\msvc_wrapper_for_nvcc.py" %*
diff --git a/third_party/gpus/crosstool/windows/msvc_wrapper_for_nvcc.py.tpl b/third_party/gpus/crosstool/windows/msvc_wrapper_for_nvcc.py.tpl
new file mode 100644
index 0000000000..1a09756813
--- /dev/null
+++ b/third_party/gpus/crosstool/windows/msvc_wrapper_for_nvcc.py.tpl
@@ -0,0 +1,192 @@
+#!/usr/bin/env python
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Crosstool wrapper for compiling CUDA programs with nvcc on Windows.
+
+DESCRIPTION:
+  This script is the Windows version of //third_party/gpus/crosstool/crosstool_wrapper_is_not_gcc
+"""
+
+from __future__ import print_function
+
+from argparse import ArgumentParser
+import os
+import subprocess
+import re
+import sys
+import pipes
+
+# Template values set by cuda_autoconf.
+CPU_COMPILER = ('%{cpu_compiler}')
+GCC_HOST_COMPILER_PATH = ('%{gcc_host_compiler_path}')
+
+NVCC_PATH = '%{nvcc_path}'
+NVCC_VERSION = '%{cuda_version}'
+NVCC_TEMP_DIR = "%{nvcc_tmp_dir}"
+supported_cuda_compute_capabilities = [ %{cuda_compute_capabilities} ]
+
+def Log(s):
+  print('gpus/crosstool: {0}'.format(s))
+
+
+def GetOptionValue(argv, option):
+  """Extract the list of values for option from options.
+
+  Args:
+    option: The option whose value to extract, without the leading '/'.
+
+  Returns:
+    1. A list of values, either directly following the option,
+    (eg., /opt val1 val2) or values collected from multiple occurrences of
+    the option (eg., /opt val1 /opt val2).
+    2. The leftover options.
+  """
+
+  parser = ArgumentParser(prefix_chars='/')
+  parser.add_argument('/' + option, nargs='*', action='append')
+  args, leftover = parser.parse_known_args(argv)
+  if args and vars(args)[option]:
+    return (sum(vars(args)[option], []), leftover)
+  return ([], leftover)
+
+def _update_options(nvcc_options):
+  if NVCC_VERSION in ("7.0",):
+    return nvcc_options
+
+  update_options = { "relaxed-constexpr" : "expt-relaxed-constexpr" }
+  return [ update_options[opt] if opt in update_options else opt
+                    for opt in nvcc_options ]
+
+def GetNvccOptions(argv):
+  """Collect the -nvcc_options values from argv.
+
+  Args:
+    argv: A list of strings, possibly the argv passed to main().
+
+  Returns:
+    1. The string that can be passed directly to nvcc.
+    2. The leftover options.
+  """
+
+  parser = ArgumentParser()
+  parser.add_argument('-nvcc_options', nargs='*', action='append')
+
+  args, leftover = parser.parse_known_args(argv)
+
+  if args.nvcc_options:
+    options = _update_options(sum(args.nvcc_options, []))
+    return (['--' + a for a in options], leftover)
+  return ([], leftover)
+
+
+def InvokeNvcc(argv, log=False):
+  """Call nvcc with arguments assembled from argv.
+
+  Args:
+    argv: A list of strings, possibly the argv passed to main().
+    log: True if logging is requested.
+
+  Returns:
+    The return value of calling os.system('nvcc ' + args)
+  """
+
+  src_files = [f for f in argv if
+               re.search('\.cpp$|\.cc$|\.c$|\.cxx$|\.C$', f)]
+  if len(src_files) == 0:
+    raise Error('No source files found for cuda compilation.')
+
+  out_file = [ f for f in argv if f.startswith('/Fo') ]
+  if len(out_file) != 1:
+    raise Error('Please sepecify exactly one output file for cuda compilation.')
+  out = ['-o', out_file[0][len('/Fo'):]]
+
+  nvcc_compiler_options, argv = GetNvccOptions(argv)
+
+  opt_option, argv = GetOptionValue(argv, 'O')
+  opt = ['-g', '-G']
+  if (len(opt_option) > 0 and opt_option[0] != 'd'):
+    opt = ['-O2']
+
+  include_options, argv = GetOptionValue(argv, 'I')
+  includes = ["-I " + include for include in include_options]
+
+  defines, argv = GetOptionValue(argv, 'D')
+  defines = ['-D' + define for define in defines]
+
+  undefines, argv = GetOptionValue(argv, 'U')
+  undefines = ['-U' + define for define in undefines]
+
+  # The rest of the unrecongized options should be passed to host compiler
+  host_compiler_options = [option for option in argv if option not in (src_files + out_file)]
+
+  m_options = ["-m64"]
+
+  nvccopts = ['-D_FORCE_INLINES']
+  for capability in supported_cuda_compute_capabilities:
+    capability = capability.replace('.', '')
+    nvccopts += [r'-gencode=arch=compute_%s,"code=sm_%s,compute_%s"' % (
+        capability, capability, capability)]
+  nvccopts += nvcc_compiler_options
+  nvccopts += undefines
+  nvccopts += defines
+  nvccopts += m_options
+  nvccopts += ['--compiler-options="' + " ".join(host_compiler_options) + '"']
+  nvccopts += ['-x', 'cu'] + opt + includes + out + ['-c'] + src_files
+  # If we don't specify --keep-dir, nvcc will generate intermediate files under TEMP
+  # Put them under NVCC_TEMP_DIR instead, then Bazel can ignore files under NVCC_TEMP_DIR during dependency check
+  # http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-guiding-compiler-driver
+  # Different actions are sharing NVCC_TEMP_DIR, so we cannot remove it if the directory already exists.
+  if os.path.isfile(NVCC_TEMP_DIR):
+    os.remove(NVCC_TEMP_DIR)
+  if not os.path.exists(NVCC_TEMP_DIR):
+    os.makedirs(NVCC_TEMP_DIR)
+  nvccopts += ['--keep', '--keep-dir', NVCC_TEMP_DIR]
+  cmd = [NVCC_PATH] + nvccopts
+  if log:
+    Log(cmd)
+  proc = subprocess.Popen(cmd,
+                          stdout=sys.stdout,
+                          stderr=sys.stderr,
+                          env=os.environ.copy(),
+                          shell=True)
+  proc.wait()
+  return proc.returncode
+
+def main():
+  parser = ArgumentParser()
+  parser.add_argument('-x', nargs=1)
+  parser.add_argument('--cuda_log', action='store_true')
+  args, leftover = parser.parse_known_args(sys.argv[1:])
+
+  if args.x and args.x[0] == 'cuda':
+    if args.cuda_log: Log('-x cuda')
+    leftover = [pipes.quote(s) for s in leftover]
+    if args.cuda_log: Log('using nvcc')
+    return InvokeNvcc(leftover, log=args.cuda_log)
+
+  # Strip our flags before passing through to the CPU compiler for files which
+  # are not -x cuda. We can't just pass 'leftover' because it also strips -x.
+  # We not only want to pass -x to the CPU compiler, but also keep it in its
+  # relative location in the argv list (the compiler is actually sensitive to
+  # this).
+  cpu_compiler_flags = [flag for flag in sys.argv[1:]
+                             if not flag.startswith(('--cuda_log'))
+                             and not flag.startswith(('-nvcc_options'))]
+
+  return subprocess.call([CPU_COMPILER] + cpu_compiler_flags)
+
+if __name__ == '__main__':
+  sys.exit(main())
diff --git a/third_party/gpus/cuda/BUILD.windows.tpl b/third_party/gpus/cuda/BUILD.windows.tpl
new file mode 100644
index 0000000000..ff6b3cc351
--- /dev/null
+++ b/third_party/gpus/cuda/BUILD.windows.tpl
@@ -0,0 +1,163 @@
+licenses(["restricted"])  # MPL2, portions GPL v3, LGPL v3, BSD-like
+
+package(default_visibility = ["//visibility:public"])
+
+config_setting(
+    name = "using_nvcc",
+    values = {
+        "define": "using_cuda_nvcc=true",
+    },
+)
+
+config_setting(
+    name = "using_clang",
+    values = {
+        "define": "using_cuda_clang=true",
+    },
+)
+
+# Equivalent to using_clang && -c opt.
+config_setting(
+    name = "using_clang_opt",
+    values = {
+        "define": "using_cuda_clang=true",
+        "compilation_mode": "opt",
+    },
+)
+
+config_setting(
+    name = "darwin",
+    values = {"cpu": "darwin"},
+    visibility = ["//visibility:public"],
+)
+
+config_setting(
+    name = "freebsd",
+    values = {"cpu": "freebsd"},
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "cuda_headers",
+    hdrs = [
+        "cuda/cuda_config.h",
+        %{cuda_headers}
+    ],
+    includes = [
+        ".",
+        "cuda/include",
+        "cuda/include/crt",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cudart_static",
+    # /WHOLEARCHIVE:cudart_static.lib will cause a
+    # "Internal error during CImplib::EmitThunk" error.
+    # Treat this library as interface library to avoid being whole archived when
+    # linking a DLL that depends on this.
+    # TODO(pcloudy): Remove this rule after b/111278841 is resolved.
+    interface_library = "cuda/lib/%{cudart_static_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cuda_driver",
+    interface_library = "cuda/lib/%{cuda_driver_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cudart",
+    interface_library = "cuda/lib/%{cudart_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cublas",
+    interface_library = "cuda/lib/%{cublas_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cusolver",
+    interface_library = "cuda/lib/%{cusolver_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cudnn",
+    interface_library = "cuda/lib/%{cudnn_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "cudnn_header",
+    includes = [
+        ".",
+        "cuda/include",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cufft",
+    interface_library = "cuda/lib/%{cufft_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "curand",
+    interface_library = "cuda/lib/%{curand_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "cuda",
+    visibility = ["//visibility:public"],
+    deps = [
+        ":cublas",
+        ":cuda_headers",
+        ":cudart",
+        ":cudnn",
+        ":cufft",
+        ":curand",
+    ],
+)
+
+cc_library(
+    name = "cupti_headers",
+    hdrs = [
+        "cuda/cuda_config.h",
+        ":cuda-extras",
+    ],
+    includes = [
+        ".",
+        "cuda/extras/CUPTI/include/",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+cc_import(
+    name = "cupti_dsos",
+    interface_library = "cuda/lib/%{cupti_lib}",
+    system_provided = 1,
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "libdevice_root",
+    data = [":cuda-nvvm"],
+    visibility = ["//visibility:public"],
+)
+
+%{cuda_include_genrules}
diff --git a/third_party/gpus/cuda_configure.bzl b/third_party/gpus/cuda_configure.bzl
index c90c66912d..e848fa175c 100644
--- a/third_party/gpus/cuda_configure.bzl
+++ b/third_party/gpus/cuda_configure.bzl
@@ -20,6 +20,7 @@
     `/usr/local/cuda`.
   * `TF_CUDA_COMPUTE_CAPABILITIES`: The CUDA compute capabilities. Default is
     `3.5,5.2`.
+  * `PYTHON_BIN_PATH`: The python binary path
 """
 
 _GCC_HOST_COMPILER_PATH = "GCC_HOST_COMPILER_PATH"
@@ -31,6 +32,7 @@ _CUDNN_INSTALL_PATH = "CUDNN_INSTALL_PATH"
 _TF_CUDA_COMPUTE_CAPABILITIES = "TF_CUDA_COMPUTE_CAPABILITIES"
 _TF_CUDA_CONFIG_REPO = "TF_CUDA_CONFIG_REPO"
 _TF_DOWNLOAD_CLANG = "TF_DOWNLOAD_CLANG"
+_PYTHON_BIN_PATH = "PYTHON_BIN_PATH"
 
 _DEFAULT_CUDA_VERSION = ""
 _DEFAULT_CUDNN_VERSION = ""
@@ -44,12 +46,12 @@ _DEFAULT_CUDA_COMPUTE_CAPABILITIES = ["3.5", "5.2"]
 # will be used. For example, when looking for the cudart libraries, the first
 # attempt will be lib64/cudart inside the CUDA toolkit.
 CUDA_LIB_PATHS = [
-  "lib64/",
-  "lib64/stubs/",
-  "lib/x86_64-linux-gnu/",
-  "lib/x64/",
-  "lib/",
-  "",
+    "lib64/",
+    "lib64/stubs/",
+    "lib/x86_64-linux-gnu/",
+    "lib/x64/",
+    "lib/",
+    "",
 ]
 
 # Lookup paths for cupti.h, relative to the CUDA toolkit directory.
@@ -57,8 +59,8 @@ CUDA_LIB_PATHS = [
 # On most systems, the cupti library is not installed in the same directory as
 # the other CUDA libraries but rather in a special extras/CUPTI directory.
 CUPTI_HEADER_PATHS = [
-  "extras/CUPTI/include/",
-  "include/cuda/CUPTI/",
+    "extras/CUPTI/include/",
+    "include/cuda/CUPTI/",
 ]
 
 # Lookup paths for the cupti library, relative to the
@@ -66,25 +68,25 @@ CUPTI_HEADER_PATHS = [
 # On most systems, the cupti library is not installed in the same directory as
 # the other CUDA libraries but rather in a special extras/CUPTI directory.
 CUPTI_LIB_PATHS = [
-  "extras/CUPTI/lib64/",
-  "lib/x86_64-linux-gnu",
-  "lib64/",
-  "extras/CUPTI/libx64/",
-  "extras/CUPTI/lib/",
-  "lib/",
+    "extras/CUPTI/lib64/",
+    "lib/x86_64-linux-gnu",
+    "lib64/",
+    "extras/CUPTI/libx64/",
+    "extras/CUPTI/lib/",
+    "lib/",
 ]
 
 # Lookup paths for CUDA headers (cuda.h) relative to the CUDA toolkit directory.
 CUDA_INCLUDE_PATHS = [
-  "include/",
-  "include/cuda/"
+    "include/",
+    "include/cuda/",
 ]
 
 # Lookup paths for cudnn.h relative to the CUDNN install directory.
 CUDNN_INCLUDE_PATHS = [
-  "",
-  "include/",
-  "include/cuda/",
+    "",
+    "include/",
+    "include/cuda/",
 ]
 
 # Lookup paths for NVVM libdevice relative to the CUDA directory toolkit.
@@ -92,686 +94,841 @@ CUDNN_INCLUDE_PATHS = [
 # libdevice implements mathematical functions for GPU kernels, and is provided
 # in NVVM bitcode (a subset of LLVM bitcode).
 NVVM_LIBDEVICE_PATHS = [
-  "nvvm/libdevice/",
-  "share/cuda/",
+    "nvvm/libdevice/",
+    "share/cuda/",
+]
+
+# Files used to detect the NVVM libdevice path.
+NVVM_LIBDEVICE_FILES = [
+    # CUDA 9.0 has a single file.
+    "libdevice.10.bc",
+
+    # CUDA 8.0 has separate files for compute versions 2.0, 3.0, 3.5 and 5.0.
+    # Probing for one of them is sufficient.
+    "libdevice.compute_20.10.bc",
 ]
 
 load("//third_party/clang_toolchain:download_clang.bzl", "download_clang")
+load(
+    "@bazel_tools//tools/cpp:lib_cc_configure.bzl",
+    "escape_string",
+    "get_env_var",
+)
+load(
+    "@bazel_tools//tools/cpp:windows_cc_configure.bzl",
+    "find_msvc_tool",
+    "find_vc_path",
+    "setup_vc_env_vars",
+)
+
+def _get_python_bin(repository_ctx):
+    """Gets the python bin path."""
+    python_bin = repository_ctx.os.environ.get(_PYTHON_BIN_PATH)
+    if python_bin != None:
+        return python_bin
+    python_bin_name = "python.exe" if _is_windows(repository_ctx) else "python"
+    python_bin_path = repository_ctx.which(python_bin_name)
+    if python_bin_path != None:
+        return str(python_bin_path)
+    auto_configure_fail("Cannot find python in PATH, please make sure " +
+                        "python is installed and add its directory in PATH, or --define " +
+                        "%s='/something/else'.\nPATH=%s" % (
+                            _PYTHON_BIN_PATH,
+                            repository_ctx.os.environ.get("PATH", ""),
+                        ))
+
+def _get_nvcc_tmp_dir_for_windows(repository_ctx):
+    """Return the tmp directory for nvcc to generate intermediate source files."""
+    escaped_tmp_dir = escape_string(
+        get_env_var(repository_ctx, "TMP", "C:\\Windows\\Temp").replace("\\", "\\\\"),
+    )
+    return escaped_tmp_dir + "\\\\nvcc_inter_files_tmp_dir"
+
+def _get_msvc_compiler(repository_ctx):
+    vc_path = find_vc_path(repository_ctx)
+    return find_msvc_tool(repository_ctx, vc_path, "cl.exe").replace("\\", "/")
+
+def _get_win_cuda_defines(repository_ctx):
+    """Return CROSSTOOL defines for Windows"""
+
+    # If we are not on Windows, return empty vaules for Windows specific fields.
+    # This ensures the CROSSTOOL file parser is happy.
+    if not _is_windows(repository_ctx):
+        return {
+            "%{msvc_env_tmp}": "",
+            "%{msvc_env_path}": "",
+            "%{msvc_env_include}": "",
+            "%{msvc_env_lib}": "",
+            "%{msvc_cl_path}": "",
+            "%{msvc_ml_path}": "",
+            "%{msvc_link_path}": "",
+            "%{msvc_lib_path}": "",
+            "%{cxx_builtin_include_directory}": "",
+        }
+
+    vc_path = find_vc_path(repository_ctx)
+    if not vc_path:
+        auto_configure_fail("Visual C++ build tools not found on your machine." +
+                            "Please check your installation following https://docs.bazel.build/versions/master/windows.html#using")
+        return {}
+
+    env = setup_vc_env_vars(repository_ctx, vc_path)
+    escaped_paths = escape_string(env["PATH"])
+    escaped_include_paths = escape_string(env["INCLUDE"])
+    escaped_lib_paths = escape_string(env["LIB"])
+    escaped_tmp_dir = escape_string(
+        get_env_var(repository_ctx, "TMP", "C:\\Windows\\Temp").replace("\\", "\\\\"),
+    )
+
+    msvc_cl_path = "windows/msvc_wrapper_for_nvcc.bat"
+    msvc_ml_path = find_msvc_tool(repository_ctx, vc_path, "ml64.exe").replace("\\", "/")
+    msvc_link_path = find_msvc_tool(repository_ctx, vc_path, "link.exe").replace("\\", "/")
+    msvc_lib_path = find_msvc_tool(repository_ctx, vc_path, "lib.exe").replace("\\", "/")
+
+    # nvcc will generate some temporary source files under %{nvcc_tmp_dir}
+    # The generated files are guranteed to have unique name, so they can share the same tmp directory
+    escaped_cxx_include_directories = ["cxx_builtin_include_directory: \"%s\"" % _get_nvcc_tmp_dir_for_windows(repository_ctx)]
+    for path in escaped_include_paths.split(";"):
+        if path:
+            escaped_cxx_include_directories.append("cxx_builtin_include_directory: \"%s\"" % path)
+
+    return {
+        "%{msvc_env_tmp}": escaped_tmp_dir,
+        "%{msvc_env_path}": escaped_paths,
+        "%{msvc_env_include}": escaped_include_paths,
+        "%{msvc_env_lib}": escaped_lib_paths,
+        "%{msvc_cl_path}": msvc_cl_path,
+        "%{msvc_ml_path}": msvc_ml_path,
+        "%{msvc_link_path}": msvc_link_path,
+        "%{msvc_lib_path}": msvc_lib_path,
+        "%{cxx_builtin_include_directory}": "\n".join(escaped_cxx_include_directories),
+    }
 
 # TODO(dzc): Once these functions have been factored out of Bazel's
 # cc_configure.bzl, load them from @bazel_tools instead.
 # BEGIN cc_configure common functions.
 def find_cc(repository_ctx):
-  """Find the C++ compiler."""
-  # On Windows, we use Bazel's MSVC CROSSTOOL for GPU build
-  # Return a dummy value for GCC detection here to avoid error
-  if _is_windows(repository_ctx):
-    return "/use/--config=win-cuda --cpu=x64_windows_msvc/instead"
-
-  if _use_cuda_clang(repository_ctx):
-    target_cc_name = "clang"
-    cc_path_envvar = _CLANG_CUDA_COMPILER_PATH
-    if _flag_enabled(repository_ctx, _TF_DOWNLOAD_CLANG):
-      return "extra_tools/bin/clang"
-  else:
-    target_cc_name = "gcc"
-    cc_path_envvar = _GCC_HOST_COMPILER_PATH
-  cc_name = target_cc_name
-
-  if cc_path_envvar in repository_ctx.os.environ:
-    cc_name_from_env = repository_ctx.os.environ[cc_path_envvar].strip()
-    if cc_name_from_env:
-      cc_name = cc_name_from_env
-  if cc_name.startswith("/"):
-    # Absolute path, maybe we should make this supported by our which function.
-    return cc_name
-  cc = repository_ctx.which(cc_name)
-  if cc == None:
-    fail(("Cannot find {}, either correct your path or set the {}" +
-          " environment variable").format(target_cc_name, cc_path_envvar))
-  return cc
-
+    """Find the C++ compiler."""
+    if _is_windows(repository_ctx):
+        return _get_msvc_compiler(repository_ctx)
+
+    if _use_cuda_clang(repository_ctx):
+        target_cc_name = "clang"
+        cc_path_envvar = _CLANG_CUDA_COMPILER_PATH
+        if _flag_enabled(repository_ctx, _TF_DOWNLOAD_CLANG):
+            return "extra_tools/bin/clang"
+    else:
+        target_cc_name = "gcc"
+        cc_path_envvar = _GCC_HOST_COMPILER_PATH
+    cc_name = target_cc_name
+
+    if cc_path_envvar in repository_ctx.os.environ:
+        cc_name_from_env = repository_ctx.os.environ[cc_path_envvar].strip()
+        if cc_name_from_env:
+            cc_name = cc_name_from_env
+    if cc_name.startswith("/"):
+        # Absolute path, maybe we should make this supported by our which function.
+        return cc_name
+    cc = repository_ctx.which(cc_name)
+    if cc == None:
+        fail(("Cannot find {}, either correct your path or set the {}" +
+              " environment variable").format(target_cc_name, cc_path_envvar))
+    return cc
 
 _INC_DIR_MARKER_BEGIN = "#include <...>"
 
-
 # OSX add " (framework directory)" at the end of line, strip it.
 _OSX_FRAMEWORK_SUFFIX = " (framework directory)"
-_OSX_FRAMEWORK_SUFFIX_LEN =  len(_OSX_FRAMEWORK_SUFFIX)
-def _cxx_inc_convert(path):
-  """Convert path returned by cc -E xc++ in a complete path."""
-  path = path.strip()
-  if path.endswith(_OSX_FRAMEWORK_SUFFIX):
-    path = path[:-_OSX_FRAMEWORK_SUFFIX_LEN].strip()
-  return path
+_OSX_FRAMEWORK_SUFFIX_LEN = len(_OSX_FRAMEWORK_SUFFIX)
 
+def _cxx_inc_convert(path):
+    """Convert path returned by cc -E xc++ in a complete path."""
+    path = path.strip()
+    if path.endswith(_OSX_FRAMEWORK_SUFFIX):
+        path = path[:-_OSX_FRAMEWORK_SUFFIX_LEN].strip()
+    return path
 
 def _normalize_include_path(repository_ctx, path):
-  """Normalizes include paths before writing them to the crosstool.
+    """Normalizes include paths before writing them to the crosstool.
 
-  If path points inside the 'crosstool' folder of the repository, a relative
-  path is returned.
-  If path points outside the 'crosstool' folder, an absolute path is returned.
-  """
-  path = str(repository_ctx.path(path))
-  crosstool_folder = str(repository_ctx.path(".").get_child('crosstool'))
-
-  if path.startswith(crosstool_folder):
-    # We drop the path to "$REPO/crosstool" and a trailing path separator.
-    return path[len(crosstool_folder)+1:]
-  return path
+    If path points inside the 'crosstool' folder of the repository, a relative
+    path is returned.
+    If path points outside the 'crosstool' folder, an absolute path is returned.
+    """
+    path = str(repository_ctx.path(path))
+    crosstool_folder = str(repository_ctx.path(".").get_child("crosstool"))
 
+    if path.startswith(crosstool_folder):
+        # We drop the path to "$REPO/crosstool" and a trailing path separator.
+        return path[len(crosstool_folder) + 1:]
+    return path
 
 def _get_cxx_inc_directories_impl(repository_ctx, cc, lang_is_cpp):
-  """Compute the list of default C or C++ include directories."""
-  if lang_is_cpp:
-    lang = "c++"
-  else:
-    lang = "c"
-  result = repository_ctx.execute([cc, "-E", "-x" + lang, "-", "-v"])
-  index1 = result.stderr.find(_INC_DIR_MARKER_BEGIN)
-  if index1 == -1:
-    return []
-  index1 = result.stderr.find("\n", index1)
-  if index1 == -1:
-    return []
-  index2 = result.stderr.rfind("\n ")
-  if index2 == -1 or index2 < index1:
-    return []
-  index2 = result.stderr.find("\n", index2 + 1)
-  if index2 == -1:
-    inc_dirs = result.stderr[index1 + 1:]
-  else:
-    inc_dirs = result.stderr[index1 + 1:index2].strip()
-
-  return [
-      _normalize_include_path(repository_ctx, _cxx_inc_convert(p))
-      for p in inc_dirs.split("\n")
-  ]
+    """Compute the list of default C or C++ include directories."""
+    if lang_is_cpp:
+        lang = "c++"
+    else:
+        lang = "c"
+    result = repository_ctx.execute([cc, "-E", "-x" + lang, "-", "-v"])
+    index1 = result.stderr.find(_INC_DIR_MARKER_BEGIN)
+    if index1 == -1:
+        return []
+    index1 = result.stderr.find("\n", index1)
+    if index1 == -1:
+        return []
+    index2 = result.stderr.rfind("\n ")
+    if index2 == -1 or index2 < index1:
+        return []
+    index2 = result.stderr.find("\n", index2 + 1)
+    if index2 == -1:
+        inc_dirs = result.stderr[index1 + 1:]
+    else:
+        inc_dirs = result.stderr[index1 + 1:index2].strip()
 
+    return [
+        _normalize_include_path(repository_ctx, _cxx_inc_convert(p))
+        for p in inc_dirs.split("\n")
+    ]
 
 def get_cxx_inc_directories(repository_ctx, cc):
-  """Compute the list of default C and C++ include directories."""
-  # For some reason `clang -xc` sometimes returns include paths that are
-  # different from the ones from `clang -xc++`. (Symlink and a dir)
-  # So we run the compiler with both `-xc` and `-xc++` and merge resulting lists
-  includes_cpp = _get_cxx_inc_directories_impl(repository_ctx, cc, True)
-  includes_c = _get_cxx_inc_directories_impl(repository_ctx, cc, False)
+    """Compute the list of default C and C++ include directories."""
 
-  includes_cpp_set = depset(includes_cpp)
-  return includes_cpp + [inc for inc in includes_c
-                         if inc not in includes_cpp_set]
+    # For some reason `clang -xc` sometimes returns include paths that are
+    # different from the ones from `clang -xc++`. (Symlink and a dir)
+    # So we run the compiler with both `-xc` and `-xc++` and merge resulting lists
+    includes_cpp = _get_cxx_inc_directories_impl(repository_ctx, cc, True)
+    includes_c = _get_cxx_inc_directories_impl(repository_ctx, cc, False)
 
+    includes_cpp_set = depset(includes_cpp)
+    return includes_cpp + [
+        inc
+        for inc in includes_c
+        if inc not in includes_cpp_set
+    ]
 
 def auto_configure_fail(msg):
-  """Output failure message when cuda configuration fails."""
-  red = "\033[0;31m"
-  no_color = "\033[0m"
-  fail("\n%sCuda Configuration Error:%s %s\n" % (red, no_color, msg))
-# END cc_configure common functions (see TODO above).
+    """Output failure message when cuda configuration fails."""
+    red = "\033[0;31m"
+    no_color = "\033[0m"
+    fail("\n%sCuda Configuration Error:%s %s\n" % (red, no_color, msg))
 
+# END cc_configure common functions (see TODO above).
 
 def _host_compiler_includes(repository_ctx, cc):
-  """Generates the cxx_builtin_include_directory entries for gcc inc dirs.
-
-  Args:
-    repository_ctx: The repository context.
-    cc: The path to the gcc host compiler.
-
-  Returns:
-    A string containing the cxx_builtin_include_directory for each of the gcc
-    host compiler include directories, which can be added to the CROSSTOOL
-    file.
-  """
-  inc_dirs = get_cxx_inc_directories(repository_ctx, cc)
-  inc_entries = []
-  for inc_dir in inc_dirs:
-    inc_entries.append("  cxx_builtin_include_directory: \"%s\"" % inc_dir)
-  return "\n".join(inc_entries)
+    """Generates the cxx_builtin_include_directory entries for gcc inc dirs.
+
+    Args:
+      repository_ctx: The repository context.
+      cc: The path to the gcc host compiler.
+
+    Returns:
+      A string containing the cxx_builtin_include_directory for each of the gcc
+      host compiler include directories, which can be added to the CROSSTOOL
+      file.
+    """
+    inc_dirs = get_cxx_inc_directories(repository_ctx, cc)
+    inc_entries = []
+    for inc_dir in inc_dirs:
+        inc_entries.append("  cxx_builtin_include_directory: \"%s\"" % inc_dir)
+    return "\n".join(inc_entries)
 
 def _cuda_include_path(repository_ctx, cuda_config):
-  """Generates the cxx_builtin_include_directory entries for cuda inc dirs.
-
-  Args:
-    repository_ctx: The repository context.
-    cc: The path to the gcc host compiler.
-
-  Returns:
-    A string containing the cxx_builtin_include_directory for each of the gcc
-    host compiler include directories, which can be added to the CROSSTOOL
-    file.
-  """
-  nvcc_path = repository_ctx.path("%s/bin/nvcc%s" %
-                                  (cuda_config.cuda_toolkit_path,
-                                   ".exe" if cuda_config.cpu_value == "Windows" else ""))
-  result = repository_ctx.execute([nvcc_path, '-v',
-                                  '/dev/null', '-o', '/dev/null'])
-  target_dir = ""
-  for one_line in result.stderr.splitlines():
-    if one_line.startswith('#$ _TARGET_DIR_='):
-      target_dir = (cuda_config.cuda_toolkit_path + '/' +
-                    one_line.replace('#$ _TARGET_DIR_=', '') + "/include")
-  inc_entries = []
-  if target_dir != "":
-    inc_entries.append("  cxx_builtin_include_directory: \"%s\"" % target_dir)
-  default_include = cuda_config.cuda_toolkit_path + '/include'
-  inc_entries.append("  cxx_builtin_include_directory: \"%s\"" %
-                     default_include)
-  return "\n".join(inc_entries)
+    """Generates the cxx_builtin_include_directory entries for cuda inc dirs.
 
+    Args:
+      repository_ctx: The repository context.
+      cc: The path to the gcc host compiler.
 
-def _enable_cuda(repository_ctx):
-  if "TF_NEED_CUDA" in repository_ctx.os.environ:
-    enable_cuda = repository_ctx.os.environ["TF_NEED_CUDA"].strip()
-    return enable_cuda == "1"
-  return False
+    Returns:
+      A string containing the cxx_builtin_include_directory for each of the gcc
+      host compiler include directories, which can be added to the CROSSTOOL
+      file.
+    """
+    nvcc_path = repository_ctx.path("%s/bin/nvcc%s" %
+                                    (
+                                        cuda_config.cuda_toolkit_path,
+                                        ".exe" if cuda_config.cpu_value == "Windows" else "",
+                                    ))
+    result = repository_ctx.execute([
+        nvcc_path,
+        "-v",
+        "/dev/null",
+        "-o",
+        "/dev/null",
+    ])
+    target_dir = ""
+    for one_line in result.stderr.splitlines():
+        if one_line.startswith("#$ _TARGET_DIR_="):
+            target_dir = (cuda_config.cuda_toolkit_path + "/" +
+                          one_line.replace("#$ _TARGET_DIR_=", "") + "/include")
+    inc_entries = []
+    if target_dir != "":
+        inc_entries.append("  cxx_builtin_include_directory: \"%s\"" % target_dir)
+    default_include = cuda_config.cuda_toolkit_path + "/include"
+    inc_entries.append("  cxx_builtin_include_directory: \"%s\"" %
+                       default_include)
+    return "\n".join(inc_entries)
 
+def _enable_cuda(repository_ctx):
+    if "TF_NEED_CUDA" in repository_ctx.os.environ:
+        enable_cuda = repository_ctx.os.environ["TF_NEED_CUDA"].strip()
+        return enable_cuda == "1"
+    return False
 
 def _cuda_toolkit_path(repository_ctx):
-  """Finds the cuda toolkit directory.
-
-  Args:
-    repository_ctx: The repository context.
+    """Finds the cuda toolkit directory.
 
-  Returns:
-    A speculative real path of the cuda toolkit install directory.
-  """
-  cuda_toolkit_path = _DEFAULT_CUDA_TOOLKIT_PATH
-  if _CUDA_TOOLKIT_PATH in repository_ctx.os.environ:
-    cuda_toolkit_path = repository_ctx.os.environ[_CUDA_TOOLKIT_PATH].strip()
-  if not repository_ctx.path(cuda_toolkit_path).exists:
-    auto_configure_fail("Cannot find cuda toolkit path.")
-  return str(repository_ctx.path(cuda_toolkit_path).realpath)
+    Args:
+      repository_ctx: The repository context.
 
+    Returns:
+      A speculative real path of the cuda toolkit install directory.
+    """
+    cuda_toolkit_path = _DEFAULT_CUDA_TOOLKIT_PATH
+    if _CUDA_TOOLKIT_PATH in repository_ctx.os.environ:
+        cuda_toolkit_path = repository_ctx.os.environ[_CUDA_TOOLKIT_PATH].strip()
+    if not repository_ctx.path(cuda_toolkit_path).exists:
+        auto_configure_fail("Cannot find cuda toolkit path.")
+    return str(repository_ctx.path(cuda_toolkit_path).realpath)
 
 def _cudnn_install_basedir(repository_ctx):
-  """Finds the cudnn install directory."""
-  cudnn_install_path = _DEFAULT_CUDNN_INSTALL_PATH
-  if _CUDNN_INSTALL_PATH in repository_ctx.os.environ:
-    cudnn_install_path = repository_ctx.os.environ[_CUDNN_INSTALL_PATH].strip()
-  if not repository_ctx.path(cudnn_install_path).exists:
-    auto_configure_fail("Cannot find cudnn install path.")
-  return cudnn_install_path
-
+    """Finds the cudnn install directory."""
+    cudnn_install_path = _DEFAULT_CUDNN_INSTALL_PATH
+    if _CUDNN_INSTALL_PATH in repository_ctx.os.environ:
+        cudnn_install_path = repository_ctx.os.environ[_CUDNN_INSTALL_PATH].strip()
+    if not repository_ctx.path(cudnn_install_path).exists:
+        auto_configure_fail("Cannot find cudnn install path.")
+    return cudnn_install_path
 
 def matches_version(environ_version, detected_version):
-  """Checks whether the user-specified version matches the detected version.
-
-  This function performs a weak matching so that if the user specifies only the
-  major or major and minor versions, the versions are still considered matching
-  if the version parts match. To illustrate:
-
-      environ_version  detected_version  result
-      -----------------------------------------
-      5.1.3            5.1.3             True
-      5.1              5.1.3             True
-      5                5.1               True
-      5.1.3            5.1               False
-      5.2.3            5.1.3             False
-
-  Args:
-    environ_version: The version specified by the user via environment
-      variables.
-    detected_version: The version autodetected from the CUDA installation on
-      the system.
-
-  Returns: True if user-specified version matches detected version and False
-    otherwise.
-  """
-  environ_version_parts = environ_version.split(".")
-  detected_version_parts = detected_version.split(".")
-  if len(detected_version_parts) < len(environ_version_parts):
-    return False
-  for i, part in enumerate(detected_version_parts):
-    if i >= len(environ_version_parts):
-      break
-    if part != environ_version_parts[i]:
-      return False
-  return True
-
+    """Checks whether the user-specified version matches the detected version.
+
+    This function performs a weak matching so that if the user specifies only the
+    major or major and minor versions, the versions are still considered matching
+    if the version parts match. To illustrate:
+
+        environ_version  detected_version  result
+        -----------------------------------------
+        5.1.3            5.1.3             True
+        5.1              5.1.3             True
+        5                5.1               True
+        5.1.3            5.1               False
+        5.2.3            5.1.3             False
+
+    Args:
+      environ_version: The version specified by the user via environment
+        variables.
+      detected_version: The version autodetected from the CUDA installation on
+        the system.
+
+    Returns: True if user-specified version matches detected version and False
+      otherwise.
+    """
+    environ_version_parts = environ_version.split(".")
+    detected_version_parts = detected_version.split(".")
+    if len(detected_version_parts) < len(environ_version_parts):
+        return False
+    for i, part in enumerate(detected_version_parts):
+        if i >= len(environ_version_parts):
+            break
+        if part != environ_version_parts[i]:
+            return False
+    return True
 
 _NVCC_VERSION_PREFIX = "Cuda compilation tools, release "
 
-
 def _cuda_version(repository_ctx, cuda_toolkit_path, cpu_value):
-  """Detects the version of CUDA installed on the system.
-
-  Args:
-    repository_ctx: The repository context.
-    cuda_toolkit_path: The CUDA install directory.
-
-  Returns:
-    String containing the version of CUDA.
-  """
-  # Run nvcc --version and find the line containing the CUDA version.
-  nvcc_path = repository_ctx.path("%s/bin/nvcc%s" %
-                                  (cuda_toolkit_path,
-                                   ".exe" if cpu_value == "Windows" else ""))
-  if not nvcc_path.exists:
-    auto_configure_fail("Cannot find nvcc at %s" % str(nvcc_path))
-  result = repository_ctx.execute([str(nvcc_path), '--version'])
-  if result.stderr:
-    auto_configure_fail("Error running nvcc --version: %s" % result.stderr)
-  lines = result.stdout.splitlines()
-  version_line = lines[len(lines) - 1]
-  if version_line.find(_NVCC_VERSION_PREFIX) == -1:
-    auto_configure_fail(
-        "Could not parse CUDA version from nvcc --version. Got: %s" %
-        result.stdout)
-
-  # Parse the CUDA version from the line containing the CUDA version.
-  prefix_removed = version_line.replace(_NVCC_VERSION_PREFIX, '')
-  parts = prefix_removed.split(",")
-  if len(parts) != 2 or len(parts[0]) < 2:
-    auto_configure_fail(
-        "Could not parse CUDA version from nvcc --version. Got: %s" %
-        result.stdout)
-  full_version = parts[1].strip()
-  if full_version.startswith('V'):
-    full_version = full_version[1:]
-
-  # Check whether TF_CUDA_VERSION was set by the user and fail if it does not
-  # match the detected version.
-  environ_version = ""
-  if _TF_CUDA_VERSION in repository_ctx.os.environ:
-    environ_version = repository_ctx.os.environ[_TF_CUDA_VERSION].strip()
-  if environ_version and not matches_version(environ_version, full_version):
-    auto_configure_fail(
-        ("CUDA version detected from nvcc (%s) does not match " +
-         "TF_CUDA_VERSION (%s)") % (full_version, environ_version))
-
-  # We only use the version consisting of the major and minor version numbers.
-  version_parts = full_version.split('.')
-  if len(version_parts) < 2:
-    auto_configure_fail("CUDA version detected from nvcc (%s) is incomplete.")
-  if cpu_value == "Windows":
-    version = "64_%s%s" % (version_parts[0], version_parts[1])
-  else:
-    version = "%s.%s" % (version_parts[0], version_parts[1])
-  return version
+    """Detects the version of CUDA installed on the system.
+
+    Args:
+      repository_ctx: The repository context.
+      cuda_toolkit_path: The CUDA install directory.
+
+    Returns:
+      String containing the version of CUDA.
+    """
+
+    # Run nvcc --version and find the line containing the CUDA version.
+    nvcc_path = repository_ctx.path("%s/bin/nvcc%s" %
+                                    (
+                                        cuda_toolkit_path,
+                                        ".exe" if cpu_value == "Windows" else "",
+                                    ))
+    if not nvcc_path.exists:
+        auto_configure_fail("Cannot find nvcc at %s" % str(nvcc_path))
+    result = repository_ctx.execute([str(nvcc_path), "--version"])
+    if result.stderr:
+        auto_configure_fail("Error running nvcc --version: %s" % result.stderr)
+    lines = result.stdout.splitlines()
+    version_line = lines[len(lines) - 1]
+    if version_line.find(_NVCC_VERSION_PREFIX) == -1:
+        auto_configure_fail(
+            "Could not parse CUDA version from nvcc --version. Got: %s" %
+            result.stdout,
+        )
 
+    # Parse the CUDA version from the line containing the CUDA version.
+    prefix_removed = version_line.replace(_NVCC_VERSION_PREFIX, "")
+    parts = prefix_removed.split(",")
+    if len(parts) != 2 or len(parts[0]) < 2:
+        auto_configure_fail(
+            "Could not parse CUDA version from nvcc --version. Got: %s" %
+            result.stdout,
+        )
+    full_version = parts[1].strip()
+    if full_version.startswith("V"):
+        full_version = full_version[1:]
+
+    # Check whether TF_CUDA_VERSION was set by the user and fail if it does not
+    # match the detected version.
+    environ_version = ""
+    if _TF_CUDA_VERSION in repository_ctx.os.environ:
+        environ_version = repository_ctx.os.environ[_TF_CUDA_VERSION].strip()
+    if environ_version and not matches_version(environ_version, full_version):
+        auto_configure_fail(
+            ("CUDA version detected from nvcc (%s) does not match " +
+             "TF_CUDA_VERSION (%s)") % (full_version, environ_version),
+        )
+
+    # We only use the version consisting of the major and minor version numbers.
+    version_parts = full_version.split(".")
+    if len(version_parts) < 2:
+        auto_configure_fail("CUDA version detected from nvcc (%s) is incomplete.")
+    if cpu_value == "Windows":
+        version = "64_%s%s" % (version_parts[0], version_parts[1])
+    else:
+        version = "%s.%s" % (version_parts[0], version_parts[1])
+    return version
 
 _DEFINE_CUDNN_MAJOR = "#define CUDNN_MAJOR"
 _DEFINE_CUDNN_MINOR = "#define CUDNN_MINOR"
 _DEFINE_CUDNN_PATCHLEVEL = "#define CUDNN_PATCHLEVEL"
 
-
 def find_cuda_define(repository_ctx, header_dir, header_file, define):
-  """Returns the value of a #define in a header file.
-
-  Greps through a header file and returns the value of the specified #define.
-  If the #define is not found, then raise an error.
-
-  Args:
-    repository_ctx: The repository context.
-    header_dir: The directory containing the header file.
-    header_file: The header file name.
-    define: The #define to search for.
-
-  Returns:
-    The value of the #define found in the header.
-  """
-  # Confirm location of the header and grep for the line defining the macro.
-  h_path = repository_ctx.path("%s/%s" % (header_dir, header_file))
-  if not h_path.exists:
-    auto_configure_fail("Cannot find %s at %s" % (header_file, str(h_path)))
-  result = repository_ctx.execute(
-      # Grep one more lines as some #defines are splitted into two lines.
-      ["grep", "--color=never", "-A1", "-E", define, str(h_path)])
-  if result.stderr:
-    auto_configure_fail("Error reading %s: %s" % (str(h_path), result.stderr))
-
-  # Parse the version from the line defining the macro.
-  if result.stdout.find(define) == -1:
-    auto_configure_fail("Cannot find line containing '%s' in %s" %
-                        (define, h_path))
-  # Split results to lines
-  lines = result.stdout.split('\n')
-  num_lines = len(lines)
-  for l in range(num_lines):
-    line = lines[l]
-    if define in line:  # Find the line with define
-      version = line
-      if l != num_lines-1 and line[-1] == '\\':  # Add next line, if multiline
-        version = version[:-1] + lines[l+1]
-      break
-  # Remove any comments
-  version = version.split("//")[0]
-  # Remove define name
-  version = version.replace(define, "").strip()
-  # Remove the code after the version number.
-  version_end = version.find(" ")
-  if version_end != -1:
-    if version_end == 0:
-      auto_configure_fail(
-          "Cannot extract the version from line containing '%s' in %s" %
-          (define, str(h_path)))
-    version = version[:version_end].strip()
-  return version
+    """Returns the value of a #define in a header file.
+
+    Greps through a header file and returns the value of the specified #define.
+    If the #define is not found, then raise an error.
 
+    Args:
+      repository_ctx: The repository context.
+      header_dir: The directory containing the header file.
+      header_file: The header file name.
+      define: The #define to search for.
+
+    Returns:
+      The value of the #define found in the header.
+    """
+
+    # Confirm location of the header and grep for the line defining the macro.
+    h_path = repository_ctx.path("%s/%s" % (header_dir, header_file))
+    if not h_path.exists:
+        auto_configure_fail("Cannot find %s at %s" % (header_file, str(h_path)))
+    result = repository_ctx.execute(
+        # Grep one more lines as some #defines are splitted into two lines.
+        ["grep", "--color=never", "-A1", "-E", define, str(h_path)],
+    )
+    if result.stderr:
+        auto_configure_fail("Error reading %s: %s" % (str(h_path), result.stderr))
+
+    # Parse the version from the line defining the macro.
+    if result.stdout.find(define) == -1:
+        auto_configure_fail("Cannot find line containing '%s' in %s" %
+                            (define, h_path))
+
+    # Split results to lines
+    lines = result.stdout.split("\n")
+    num_lines = len(lines)
+    for l in range(num_lines):
+        line = lines[l]
+        if define in line:  # Find the line with define
+            version = line
+            if l != num_lines - 1 and line[-1] == "\\":  # Add next line, if multiline
+                version = version[:-1] + lines[l + 1]
+            break
+
+    # Remove any comments
+    version = version.split("//")[0]
+
+    # Remove define name
+    version = version.replace(define, "").strip()
+
+    # Remove the code after the version number.
+    version_end = version.find(" ")
+    if version_end != -1:
+        if version_end == 0:
+            auto_configure_fail(
+                "Cannot extract the version from line containing '%s' in %s" %
+                (define, str(h_path)),
+            )
+        version = version[:version_end].strip()
+    return version
 
 def _cudnn_version(repository_ctx, cudnn_install_basedir, cpu_value):
-  """Detects the version of cuDNN installed on the system.
-
-  Args:
-    repository_ctx: The repository context.
-    cpu_value: The name of the host operating system.
-    cudnn_install_basedir: The cuDNN install directory.
-
-  Returns:
-    A string containing the version of cuDNN.
-  """
-  cudnn_header_dir = _find_cudnn_header_dir(repository_ctx,
-                                            cudnn_install_basedir)
-  major_version = find_cuda_define(
-      repository_ctx, cudnn_header_dir, "cudnn.h", _DEFINE_CUDNN_MAJOR)
-  minor_version = find_cuda_define(
-      repository_ctx, cudnn_header_dir, "cudnn.h", _DEFINE_CUDNN_MINOR)
-  patch_version = find_cuda_define(
-      repository_ctx, cudnn_header_dir, "cudnn.h", _DEFINE_CUDNN_PATCHLEVEL)
-  full_version = "%s.%s.%s" % (major_version, minor_version, patch_version)
-
-  # Check whether TF_CUDNN_VERSION was set by the user and fail if it does not
-  # match the detected version.
-  environ_version = ""
-  if _TF_CUDNN_VERSION in repository_ctx.os.environ:
-    environ_version = repository_ctx.os.environ[_TF_CUDNN_VERSION].strip()
-  if environ_version and not matches_version(environ_version, full_version):
-    cudnn_h_path = repository_ctx.path("%s/include/cudnn.h" %
-                                       cudnn_install_basedir)
-    auto_configure_fail(
-        ("cuDNN version detected from %s (%s) does not match " +
-        "TF_CUDNN_VERSION (%s)") %
-        (str(cudnn_h_path), full_version, environ_version))
-
-  # We only use the major version since we use the libcudnn libraries that are
-  # only versioned with the major version (e.g. libcudnn.so.5).
-  version = major_version
-  if cpu_value == "Windows":
-    version = "64_" + version
-  return version
+    """Detects the version of cuDNN installed on the system.
 
+    Args:
+      repository_ctx: The repository context.
+      cpu_value: The name of the host operating system.
+      cudnn_install_basedir: The cuDNN install directory.
 
-def _compute_capabilities(repository_ctx):
-  """Returns a list of strings representing cuda compute capabilities."""
-  if _TF_CUDA_COMPUTE_CAPABILITIES not in repository_ctx.os.environ:
-    return _DEFAULT_CUDA_COMPUTE_CAPABILITIES
-  capabilities_str = repository_ctx.os.environ[_TF_CUDA_COMPUTE_CAPABILITIES]
-  capabilities = capabilities_str.split(",")
-  for capability in capabilities:
-    # Workaround for Skylark's lack of support for regex. This check should
-    # be equivalent to checking:
-    #     if re.match("[0-9]+.[0-9]+", capability) == None:
-    parts = capability.split(".")
-    if len(parts) != 2 or not parts[0].isdigit() or not parts[1].isdigit():
-      auto_configure_fail("Invalid compute capability: %s" % capability)
-  return capabilities
+    Returns:
+      A string containing the version of cuDNN.
+    """
+    cudnn_header_dir = _find_cudnn_header_dir(
+        repository_ctx,
+        cudnn_install_basedir,
+    )
+    major_version = find_cuda_define(
+        repository_ctx,
+        cudnn_header_dir,
+        "cudnn.h",
+        _DEFINE_CUDNN_MAJOR,
+    )
+    minor_version = find_cuda_define(
+        repository_ctx,
+        cudnn_header_dir,
+        "cudnn.h",
+        _DEFINE_CUDNN_MINOR,
+    )
+    patch_version = find_cuda_define(
+        repository_ctx,
+        cudnn_header_dir,
+        "cudnn.h",
+        _DEFINE_CUDNN_PATCHLEVEL,
+    )
+    full_version = "%s.%s.%s" % (major_version, minor_version, patch_version)
+
+    # Check whether TF_CUDNN_VERSION was set by the user and fail if it does not
+    # match the detected version.
+    environ_version = ""
+    if _TF_CUDNN_VERSION in repository_ctx.os.environ:
+        environ_version = repository_ctx.os.environ[_TF_CUDNN_VERSION].strip()
+    if environ_version and not matches_version(environ_version, full_version):
+        cudnn_h_path = repository_ctx.path("%s/include/cudnn.h" %
+                                           cudnn_install_basedir)
+        auto_configure_fail(
+            ("cuDNN version detected from %s (%s) does not match " +
+             "TF_CUDNN_VERSION (%s)") %
+            (str(cudnn_h_path), full_version, environ_version),
+        )
 
+    # We only use the major version since we use the libcudnn libraries that are
+    # only versioned with the major version (e.g. libcudnn.so.5).
+    version = major_version
+    if cpu_value == "Windows":
+        version = "64_" + version
+    return version
 
-def get_cpu_value(repository_ctx):
-  """Returns the name of the host operating system.
+def _compute_capabilities(repository_ctx):
+    """Returns a list of strings representing cuda compute capabilities."""
+    if _TF_CUDA_COMPUTE_CAPABILITIES not in repository_ctx.os.environ:
+        return _DEFAULT_CUDA_COMPUTE_CAPABILITIES
+    capabilities_str = repository_ctx.os.environ[_TF_CUDA_COMPUTE_CAPABILITIES]
+    capabilities = capabilities_str.split(",")
+    for capability in capabilities:
+        # Workaround for Skylark's lack of support for regex. This check should
+        # be equivalent to checking:
+        #     if re.match("[0-9]+.[0-9]+", capability) == None:
+        parts = capability.split(".")
+        if len(parts) != 2 or not parts[0].isdigit() or not parts[1].isdigit():
+            auto_configure_fail("Invalid compute capability: %s" % capability)
+    return capabilities
 
-  Args:
-    repository_ctx: The repository context.
+def get_cpu_value(repository_ctx):
+    """Returns the name of the host operating system.
 
-  Returns:
-    A string containing the name of the host operating system.
-  """
-  os_name = repository_ctx.os.name.lower()
-  if os_name.startswith("mac os"):
-    return "Darwin"
-  if os_name.find("windows") != -1:
-    return "Windows"
-  result = repository_ctx.execute(["uname", "-s"])
-  return result.stdout.strip()
+    Args:
+      repository_ctx: The repository context.
 
+    Returns:
+      A string containing the name of the host operating system.
+    """
+    os_name = repository_ctx.os.name.lower()
+    if os_name.startswith("mac os"):
+        return "Darwin"
+    if os_name.find("windows") != -1:
+        return "Windows"
+    result = repository_ctx.execute(["uname", "-s"])
+    return result.stdout.strip()
 
 def _is_windows(repository_ctx):
-  """Returns true if the host operating system is windows."""
-  return get_cpu_value(repository_ctx) == "Windows"
-
-def _lib_name(lib, cpu_value, version="", static=False):
-  """Constructs the platform-specific name of a library.
-
-  Args:
-    lib: The name of the library, such as "cudart"
-    cpu_value: The name of the host operating system.
-    version: The version of the library.
-    static: True the library is static or False if it is a shared object.
-
-  Returns:
-    The platform-specific name of the library.
-  """
-  if cpu_value in ("Linux", "FreeBSD"):
-    if static:
-      return "lib%s.a" % lib
-    else:
-      if version:
-        version = ".%s" % version
-      return "lib%s.so%s" % (lib, version)
-  elif cpu_value == "Windows":
-    return "%s.lib" % lib
-  elif cpu_value == "Darwin":
-    if static:
-      return "lib%s.a" % lib
-    else:
-      if version:
-        version = ".%s" % version
-    return "lib%s%s.dylib" % (lib, version)
-  else:
-    auto_configure_fail("Invalid cpu_value: %s" % cpu_value)
-
-
-def _find_cuda_lib(lib, repository_ctx, cpu_value, basedir, version="",
-                   static=False):
-  """Finds the given CUDA or cuDNN library on the system.
-
-  Args:
-    lib: The name of the library, such as "cudart"
-    repository_ctx: The repository context.
-    cpu_value: The name of the host operating system.
-    basedir: The install directory of CUDA or cuDNN.
-    version: The version of the library.
-    static: True if static library, False if shared object.
-
-  Returns:
-    Returns a struct with the following fields:
-      file_name: The basename of the library found on the system.
-      path: The full path to the library.
-  """
-  file_name = _lib_name(lib, cpu_value, version, static)
-  for relative_path in CUDA_LIB_PATHS:
-    path = repository_ctx.path("%s/%s%s" % (basedir, relative_path, file_name))
-    if path.exists:
-      return struct(file_name=file_name, path=str(path.realpath))
-  auto_configure_fail("Cannot find cuda library %s" % file_name)
+    """Returns true if the host operating system is windows."""
+    return get_cpu_value(repository_ctx) == "Windows"
 
+def _lib_name(lib, cpu_value, version = "", static = False):
+    """Constructs the platform-specific name of a library.
 
-def _find_cupti_header_dir(repository_ctx, cuda_config):
-  """Returns the path to the directory containing cupti.h
+    Args:
+      lib: The name of the library, such as "cudart"
+      cpu_value: The name of the host operating system.
+      version: The version of the library.
+      static: True the library is static or False if it is a shared object.
+
+    Returns:
+      The platform-specific name of the library.
+    """
+    if cpu_value in ("Linux", "FreeBSD"):
+        if static:
+            return "lib%s.a" % lib
+        else:
+            if version:
+                version = ".%s" % version
+            return "lib%s.so%s" % (lib, version)
+    elif cpu_value == "Windows":
+        return "%s.lib" % lib
+    elif cpu_value == "Darwin":
+        if static:
+            return "lib%s.a" % lib
+        elif version:
+            version = ".%s" % version
+        return "lib%s%s.dylib" % (lib, version)
+    else:
+        auto_configure_fail("Invalid cpu_value: %s" % cpu_value)
+
+def _find_cuda_lib(
+        lib,
+        repository_ctx,
+        cpu_value,
+        basedir,
+        version = "",
+        static = False):
+    """Finds the given CUDA or cuDNN library on the system.
+
+    Args:
+      lib: The name of the library, such as "cudart"
+      repository_ctx: The repository context.
+      cpu_value: The name of the host operating system.
+      basedir: The install directory of CUDA or cuDNN.
+      version: The version of the library.
+      static: True if static library, False if shared object.
+
+    Returns:
+      Returns a struct with the following fields:
+        file_name: The basename of the library found on the system.
+        path: The full path to the library.
+    """
+    file_name = _lib_name(lib, cpu_value, version, static)
+    for relative_path in CUDA_LIB_PATHS:
+        path = repository_ctx.path("%s/%s%s" % (basedir, relative_path, file_name))
+        if path.exists:
+            return struct(file_name = file_name, path = str(path.realpath))
+    auto_configure_fail("Cannot find cuda library %s" % file_name)
 
-  On most systems, the cupti library is not installed in the same directory as
-  the other CUDA libraries but rather in a special extras/CUPTI directory.
+def _find_cupti_header_dir(repository_ctx, cuda_config):
+    """Returns the path to the directory containing cupti.h
 
-  Args:
-    repository_ctx: The repository context.
-    cuda_config: The CUDA config as returned by _get_cuda_config
+    On most systems, the cupti library is not installed in the same directory as
+    the other CUDA libraries but rather in a special extras/CUPTI directory.
 
-  Returns:
-    The path of the directory containing the cupti header.
-  """
-  cuda_toolkit_path = cuda_config.cuda_toolkit_path
-  for relative_path in CUPTI_HEADER_PATHS:
-    if repository_ctx.path("%s/%scupti.h" % (cuda_toolkit_path, relative_path)).exists:
-        return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
-  auto_configure_fail("Cannot find cupti.h under %s" % ", ".join([cuda_toolkit_path + "/" + s for s in CUPTI_HEADER_PATHS]))
+    Args:
+      repository_ctx: The repository context.
+      cuda_config: The CUDA config as returned by _get_cuda_config
 
+    Returns:
+      The path of the directory containing the cupti header.
+    """
+    cuda_toolkit_path = cuda_config.cuda_toolkit_path
+    for relative_path in CUPTI_HEADER_PATHS:
+        if repository_ctx.path("%s/%scupti.h" % (cuda_toolkit_path, relative_path)).exists:
+            return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
+    auto_configure_fail("Cannot find cupti.h under %s" % ", ".join([cuda_toolkit_path + "/" + s for s in CUPTI_HEADER_PATHS]))
 
 def _find_cupti_lib(repository_ctx, cuda_config):
-  """Finds the cupti library on the system.
-
-  On most systems, the cupti library is not installed in the same directory as
-  the other CUDA libraries but rather in a special extras/CUPTI directory.
-
-  Args:
-    repository_ctx: The repository context.
-    cuda_config: The cuda configuration as returned by _get_cuda_config.
-
-  Returns:
-    Returns a struct with the following fields:
-      file_name: The basename of the library found on the system.
-      path: The full path to the library.
-  """
-  file_name = _lib_name("cupti", cuda_config.cpu_value,
-                        cuda_config.cuda_version)
-  cuda_toolkit_path = cuda_config.cuda_toolkit_path
-  for relative_path in CUPTI_LIB_PATHS:
-    path = repository_ctx.path(
-        "%s/%s%s" % (cuda_toolkit_path, relative_path, file_name))
-    if path.exists:
-      return struct(file_name=file_name, path=str(path.realpath))
-
-  auto_configure_fail("Cannot find cupti library %s" % file_name)
+    """Finds the cupti library on the system.
+
+    On most systems, the cupti library is not installed in the same directory as
+    the other CUDA libraries but rather in a special extras/CUPTI directory.
+
+    Args:
+      repository_ctx: The repository context.
+      cuda_config: The cuda configuration as returned by _get_cuda_config.
+
+    Returns:
+      Returns a struct with the following fields:
+        file_name: The basename of the library found on the system.
+        path: The full path to the library.
+    """
+    file_name = _lib_name(
+        "cupti",
+        cuda_config.cpu_value,
+        cuda_config.cuda_version,
+    )
+    cuda_toolkit_path = cuda_config.cuda_toolkit_path
+    for relative_path in CUPTI_LIB_PATHS:
+        path = repository_ctx.path(
+            "%s/%s%s" % (cuda_toolkit_path, relative_path, file_name),
+        )
+        if path.exists:
+            return struct(file_name = file_name, path = str(path.realpath))
+
+    auto_configure_fail("Cannot find cupti library %s" % file_name)
 
 def _find_libs(repository_ctx, cuda_config):
-  """Returns the CUDA and cuDNN libraries on the system.
-
-  Args:
-    repository_ctx: The repository context.
-    cuda_config: The CUDA config as returned by _get_cuda_config
-
-  Returns:
-    Map of library names to structs of filename and path.
-  """
-  cpu_value = cuda_config.cpu_value
-  return {
-      "cuda": _find_cuda_lib("cuda", repository_ctx, cpu_value, cuda_config.cuda_toolkit_path),
-      "cudart": _find_cuda_lib(
-          "cudart", repository_ctx, cpu_value, cuda_config.cuda_toolkit_path,
-          cuda_config.cuda_version),
-      "cudart_static": _find_cuda_lib(
-          "cudart_static", repository_ctx, cpu_value,
-          cuda_config.cuda_toolkit_path, cuda_config.cuda_version, static=True),
-      "cublas": _find_cuda_lib(
-          "cublas", repository_ctx, cpu_value, cuda_config.cuda_toolkit_path,
-          cuda_config.cuda_version),
-      "cusolver": _find_cuda_lib(
-          "cusolver", repository_ctx, cpu_value, cuda_config.cuda_toolkit_path,
-          cuda_config.cuda_version),
-      "curand": _find_cuda_lib(
-          "curand", repository_ctx, cpu_value, cuda_config.cuda_toolkit_path,
-          cuda_config.cuda_version),
-      "cufft": _find_cuda_lib(
-          "cufft", repository_ctx, cpu_value, cuda_config.cuda_toolkit_path,
-          cuda_config.cuda_version),
-      "cudnn": _find_cuda_lib(
-          "cudnn", repository_ctx, cpu_value, cuda_config.cudnn_install_basedir,
-          cuda_config.cudnn_version),
-      "cupti": _find_cupti_lib(repository_ctx, cuda_config)
-  }
+    """Returns the CUDA and cuDNN libraries on the system.
 
+    Args:
+      repository_ctx: The repository context.
+      cuda_config: The CUDA config as returned by _get_cuda_config
 
-def _find_cuda_include_path(repository_ctx, cuda_config):
-  """Returns the path to the directory containing cuda.h
+    Returns:
+      Map of library names to structs of filename and path.
+    """
+    cpu_value = cuda_config.cpu_value
+    return {
+        "cuda": _find_cuda_lib("cuda", repository_ctx, cpu_value, cuda_config.cuda_toolkit_path),
+        "cudart": _find_cuda_lib(
+            "cudart",
+            repository_ctx,
+            cpu_value,
+            cuda_config.cuda_toolkit_path,
+            cuda_config.cuda_version,
+        ),
+        "cudart_static": _find_cuda_lib(
+            "cudart_static",
+            repository_ctx,
+            cpu_value,
+            cuda_config.cuda_toolkit_path,
+            cuda_config.cuda_version,
+            static = True,
+        ),
+        "cublas": _find_cuda_lib(
+            "cublas",
+            repository_ctx,
+            cpu_value,
+            cuda_config.cuda_toolkit_path,
+            cuda_config.cuda_version,
+        ),
+        "cusolver": _find_cuda_lib(
+            "cusolver",
+            repository_ctx,
+            cpu_value,
+            cuda_config.cuda_toolkit_path,
+            cuda_config.cuda_version,
+        ),
+        "curand": _find_cuda_lib(
+            "curand",
+            repository_ctx,
+            cpu_value,
+            cuda_config.cuda_toolkit_path,
+            cuda_config.cuda_version,
+        ),
+        "cufft": _find_cuda_lib(
+            "cufft",
+            repository_ctx,
+            cpu_value,
+            cuda_config.cuda_toolkit_path,
+            cuda_config.cuda_version,
+        ),
+        "cudnn": _find_cuda_lib(
+            "cudnn",
+            repository_ctx,
+            cpu_value,
+            cuda_config.cudnn_install_basedir,
+            cuda_config.cudnn_version,
+        ),
+        "cupti": _find_cupti_lib(repository_ctx, cuda_config),
+    }
 
-  Args:
-    repository_ctx: The repository context.
-    cuda_config: The CUDA config as returned by _get_cuda_config
+def _find_cuda_include_path(repository_ctx, cuda_config):
+    """Returns the path to the directory containing cuda.h
 
-  Returns:
-    The path of the directory containing the CUDA headers.
-  """
-  cuda_toolkit_path = cuda_config.cuda_toolkit_path
-  for relative_path in CUDA_INCLUDE_PATHS:
-    if repository_ctx.path("%s/%scuda.h" % (cuda_toolkit_path, relative_path)).exists:
-        return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
-  auto_configure_fail("Cannot find cuda.h under %s" % cuda_toolkit_path)
+    Args:
+      repository_ctx: The repository context.
+      cuda_config: The CUDA config as returned by _get_cuda_config
 
+    Returns:
+      The path of the directory containing the CUDA headers.
+    """
+    cuda_toolkit_path = cuda_config.cuda_toolkit_path
+    for relative_path in CUDA_INCLUDE_PATHS:
+        if repository_ctx.path("%s/%scuda.h" % (cuda_toolkit_path, relative_path)).exists:
+            return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
+    auto_configure_fail("Cannot find cuda.h under %s" % cuda_toolkit_path)
 
 def _find_cudnn_header_dir(repository_ctx, cudnn_install_basedir):
-  """Returns the path to the directory containing cudnn.h
-
-  Args:
-    repository_ctx: The repository context.
-    cudnn_install_basedir: The cudnn install directory as returned by
-      _cudnn_install_basedir.
+    """Returns the path to the directory containing cudnn.h
 
-  Returns:
-    The path of the directory containing the cudnn header.
-  """
-  for relative_path in CUDA_INCLUDE_PATHS:
-    if repository_ctx.path("%s/%scudnn.h" % (cudnn_install_basedir, relative_path)).exists:
-        return ("%s/%s" % (cudnn_install_basedir, relative_path))[:-1]
-  if repository_ctx.path("/usr/include/cudnn.h").exists:
-    return "/usr/include"
-  auto_configure_fail("Cannot find cudnn.h under %s" % cudnn_install_basedir)
+    Args:
+      repository_ctx: The repository context.
+      cudnn_install_basedir: The cudnn install directory as returned by
+        _cudnn_install_basedir.
 
+    Returns:
+      The path of the directory containing the cudnn header.
+    """
+    for relative_path in CUDA_INCLUDE_PATHS:
+        if repository_ctx.path("%s/%scudnn.h" % (cudnn_install_basedir, relative_path)).exists:
+            return ("%s/%s" % (cudnn_install_basedir, relative_path))[:-1]
+    if repository_ctx.path("/usr/include/cudnn.h").exists:
+        return "/usr/include"
+    auto_configure_fail("Cannot find cudnn.h under %s" % cudnn_install_basedir)
 
 def _find_nvvm_libdevice_dir(repository_ctx, cuda_config):
-  """Returns the path to the directory containing libdevice in bitcode format.
+    """Returns the path to the directory containing libdevice in bitcode format.
 
-  Args:
-    repository_ctx: The repository context.
-    cuda_config: The CUDA config as returned by _get_cuda_config
-
-  Returns:
-    The path of the directory containing the CUDA headers.
-  """
-  cuda_toolkit_path = cuda_config.cuda_toolkit_path
-  for relative_path in NVVM_LIBDEVICE_PATHS:
-    if repository_ctx.path("%s/%slibdevice.10.bc" % (cuda_toolkit_path, relative_path)).exists:
-      return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
-  auto_configure_fail("Cannot find libdevice.10.bc under %s" % cuda_toolkit_path)
+    Args:
+      repository_ctx: The repository context.
+      cuda_config: The CUDA config as returned by _get_cuda_config
 
+    Returns:
+      The path of the directory containing the CUDA headers.
+    """
+    cuda_toolkit_path = cuda_config.cuda_toolkit_path
+    for libdevice_file in NVVM_LIBDEVICE_FILES:
+        for relative_path in NVVM_LIBDEVICE_PATHS:
+            if repository_ctx.path("%s/%s%s" % (cuda_toolkit_path, relative_path, libdevice_file)).exists:
+                return ("%s/%s" % (cuda_toolkit_path, relative_path))[:-1]
+    auto_configure_fail("Cannot find libdevice*.bc files under %s" % cuda_toolkit_path)
 
 def _cudart_static_linkopt(cpu_value):
-  """Returns additional platform-specific linkopts for cudart."""
-  return "" if cpu_value == "Darwin" else "\"-lrt\","
+    """Returns additional platform-specific linkopts for cudart."""
+    return "" if cpu_value == "Darwin" else "\"-lrt\","
 
 def _get_cuda_config(repository_ctx):
-  """Detects and returns information about the CUDA installation on the system.
-
-  Args:
-    repository_ctx: The repository context.
-
-  Returns:
-    A struct containing the following fields:
-      cuda_toolkit_path: The CUDA toolkit installation directory.
-      cudnn_install_basedir: The cuDNN installation directory.
-      cuda_version: The version of CUDA on the system.
-      cudnn_version: The version of cuDNN on the system.
-      compute_capabilities: A list of the system's CUDA compute capabilities.
-      cpu_value: The name of the host operating system.
-  """
-  cpu_value = get_cpu_value(repository_ctx)
-  cuda_toolkit_path = _cuda_toolkit_path(repository_ctx)
-  cuda_version = _cuda_version(repository_ctx, cuda_toolkit_path, cpu_value)
-  cudnn_install_basedir = _cudnn_install_basedir(repository_ctx)
-  cudnn_version = _cudnn_version(repository_ctx, cudnn_install_basedir, cpu_value)
-  return struct(
-      cuda_toolkit_path = cuda_toolkit_path,
-      cudnn_install_basedir = cudnn_install_basedir,
-      cuda_version = cuda_version,
-      cudnn_version = cudnn_version,
-      compute_capabilities = _compute_capabilities(repository_ctx),
-      cpu_value = cpu_value)
-
-
-def _tpl(repository_ctx, tpl, substitutions={}, out=None):
-  if not out:
-    out = tpl.replace(":", "/")
-  repository_ctx.template(
-      out,
-      Label("//third_party/gpus/%s.tpl" % tpl),
-      substitutions)
-
+    """Detects and returns information about the CUDA installation on the system.
+
+    Args:
+      repository_ctx: The repository context.
+
+    Returns:
+      A struct containing the following fields:
+        cuda_toolkit_path: The CUDA toolkit installation directory.
+        cudnn_install_basedir: The cuDNN installation directory.
+        cuda_version: The version of CUDA on the system.
+        cudnn_version: The version of cuDNN on the system.
+        compute_capabilities: A list of the system's CUDA compute capabilities.
+        cpu_value: The name of the host operating system.
+    """
+    cpu_value = get_cpu_value(repository_ctx)
+    cuda_toolkit_path = _cuda_toolkit_path(repository_ctx)
+    cuda_version = _cuda_version(repository_ctx, cuda_toolkit_path, cpu_value)
+    cudnn_install_basedir = _cudnn_install_basedir(repository_ctx)
+    cudnn_version = _cudnn_version(repository_ctx, cudnn_install_basedir, cpu_value)
+    return struct(
+        cuda_toolkit_path = cuda_toolkit_path,
+        cudnn_install_basedir = cudnn_install_basedir,
+        cuda_version = cuda_version,
+        cudnn_version = cudnn_version,
+        compute_capabilities = _compute_capabilities(repository_ctx),
+        cpu_value = cpu_value,
+    )
+
+def _tpl(repository_ctx, tpl, substitutions = {}, out = None):
+    if not out:
+        out = tpl.replace(":", "/")
+    repository_ctx.template(
+        out,
+        Label("//third_party/gpus/%s.tpl" % tpl),
+        substitutions,
+    )
 
 def _file(repository_ctx, label):
-  repository_ctx.template(
-      label.replace(":", "/"),
-      Label("//third_party/gpus/%s.tpl" % label),
-      {})
-
+    repository_ctx.template(
+        label.replace(":", "/"),
+        Label("//third_party/gpus/%s.tpl" % label),
+        {},
+    )
 
 _DUMMY_CROSSTOOL_BZL_FILE = """
 def error_gpu_disabled():
@@ -792,379 +949,498 @@ def error_gpu_disabled():
   )
 """
 
-
 _DUMMY_CROSSTOOL_BUILD_FILE = """
 load("//crosstool:error_gpu_disabled.bzl", "error_gpu_disabled")
 
 error_gpu_disabled()
 """
 
-
 def _create_dummy_repository(repository_ctx):
-  cpu_value = get_cpu_value(repository_ctx)
-
-  # Set up BUILD file for cuda/.
-  _tpl(repository_ctx, "cuda:build_defs.bzl",
-       {
-           "%{cuda_is_configured}": "False",
-           "%{cuda_extra_copts}": "[]",
-       })
-  _tpl(repository_ctx, "cuda:BUILD",
-       {
-           "%{cuda_driver_lib}": _lib_name("cuda", cpu_value),
-           "%{cudart_static_lib}": _lib_name("cudart_static", cpu_value,
-                                             static=True),
-           "%{cudart_static_linkopt}": _cudart_static_linkopt(cpu_value),
-           "%{cudart_lib}": _lib_name("cudart", cpu_value),
-           "%{cublas_lib}": _lib_name("cublas", cpu_value),
-           "%{cusolver_lib}": _lib_name("cusolver", cpu_value),
-           "%{cudnn_lib}": _lib_name("cudnn", cpu_value),
-           "%{cufft_lib}": _lib_name("cufft", cpu_value),
-           "%{curand_lib}": _lib_name("curand", cpu_value),
-           "%{cupti_lib}": _lib_name("cupti", cpu_value),
-           "%{cuda_include_genrules}": '',
-           "%{cuda_headers}": '',
-       })
-
-  # Create dummy files for the CUDA toolkit since they are still required by
-  # tensorflow/core/platform/default/build_config:cuda.
-  repository_ctx.file("cuda/cuda/include/cuda.h", "")
-  repository_ctx.file("cuda/cuda/include/cublas.h", "")
-  repository_ctx.file("cuda/cuda/include/cudnn.h", "")
-  repository_ctx.file("cuda/cuda/extras/CUPTI/include/cupti.h", "")
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cuda", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cudart", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cudart_static", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cublas", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cusolver", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cudnn", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("curand", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cufft", cpu_value))
-  repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cupti", cpu_value))
-
-  # Set up cuda_config.h, which is used by
-  # tensorflow/stream_executor/dso_loader.cc.
-  _tpl(repository_ctx, "cuda:cuda_config.h",
-       {
-           "%{cuda_version}": _DEFAULT_CUDA_VERSION,
-           "%{cudnn_version}": _DEFAULT_CUDNN_VERSION,
-           "%{cuda_compute_capabilities}": ",".join([
-               "CudaVersion(\"%s\")" % c
-               for c in _DEFAULT_CUDA_COMPUTE_CAPABILITIES]),
-           "%{cuda_toolkit_path}": _DEFAULT_CUDA_TOOLKIT_PATH,
-       }, "cuda/cuda/cuda_config.h")
-
-  # If cuda_configure is not configured to build with GPU support, and the user
-  # attempts to build with --config=cuda, add a dummy build rule to intercept
-  # this and fail with an actionable error message.
-  repository_ctx.file("crosstool/error_gpu_disabled.bzl",
-                      _DUMMY_CROSSTOOL_BZL_FILE)
-  repository_ctx.file("crosstool/BUILD", _DUMMY_CROSSTOOL_BUILD_FILE)
-
-
-def _execute(repository_ctx, cmdline, error_msg=None, error_details=None,
-             empty_stdout_fine=False):
-  """Executes an arbitrary shell command.
-
-  Args:
-    repository_ctx: the repository_ctx object
-    cmdline: list of strings, the command to execute
-    error_msg: string, a summary of the error if the command fails
-    error_details: string, details about the error or steps to fix it
-    empty_stdout_fine: bool, if True, an empty stdout result is fine, otherwise
-      it's an error
-  Return:
-    the result of repository_ctx.execute(cmdline)
-  """
-  result = repository_ctx.execute(cmdline)
-  if result.stderr or not (empty_stdout_fine or result.stdout):
-    auto_configure_fail(
-        "\n".join([
-            error_msg.strip() if error_msg else "Repository command failed",
-            result.stderr.strip(),
-            error_details if error_details else ""]))
-  return result
-
+    cpu_value = get_cpu_value(repository_ctx)
+
+    # Set up BUILD file for cuda/.
+    _tpl(
+        repository_ctx,
+        "cuda:build_defs.bzl",
+        {
+            "%{cuda_is_configured}": "False",
+            "%{cuda_extra_copts}": "[]",
+        },
+    )
+    _tpl(
+        repository_ctx,
+        "cuda:BUILD",
+        {
+            "%{cuda_driver_lib}": _lib_name("cuda", cpu_value),
+            "%{cudart_static_lib}": _lib_name(
+                "cudart_static",
+                cpu_value,
+                static = True,
+            ),
+            "%{cudart_static_linkopt}": _cudart_static_linkopt(cpu_value),
+            "%{cudart_lib}": _lib_name("cudart", cpu_value),
+            "%{cublas_lib}": _lib_name("cublas", cpu_value),
+            "%{cusolver_lib}": _lib_name("cusolver", cpu_value),
+            "%{cudnn_lib}": _lib_name("cudnn", cpu_value),
+            "%{cufft_lib}": _lib_name("cufft", cpu_value),
+            "%{curand_lib}": _lib_name("curand", cpu_value),
+            "%{cupti_lib}": _lib_name("cupti", cpu_value),
+            "%{cuda_include_genrules}": "",
+            "%{cuda_headers}": "",
+        },
+    )
+
+    # Create dummy files for the CUDA toolkit since they are still required by
+    # tensorflow/core/platform/default/build_config:cuda.
+    repository_ctx.file("cuda/cuda/include/cuda.h", "")
+    repository_ctx.file("cuda/cuda/include/cublas.h", "")
+    repository_ctx.file("cuda/cuda/include/cudnn.h", "")
+    repository_ctx.file("cuda/cuda/extras/CUPTI/include/cupti.h", "")
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cuda", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cudart", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cudart_static", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cublas", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cusolver", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cudnn", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("curand", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cufft", cpu_value))
+    repository_ctx.file("cuda/cuda/lib/%s" % _lib_name("cupti", cpu_value))
+
+    # Set up cuda_config.h, which is used by
+    # tensorflow/stream_executor/dso_loader.cc.
+    _tpl(
+        repository_ctx,
+        "cuda:cuda_config.h",
+        {
+            "%{cuda_version}": _DEFAULT_CUDA_VERSION,
+            "%{cudnn_version}": _DEFAULT_CUDNN_VERSION,
+            "%{cuda_compute_capabilities}": ",".join([
+                "CudaVersion(\"%s\")" % c
+                for c in _DEFAULT_CUDA_COMPUTE_CAPABILITIES
+            ]),
+            "%{cuda_toolkit_path}": _DEFAULT_CUDA_TOOLKIT_PATH,
+        },
+        "cuda/cuda/cuda_config.h",
+    )
+
+    # If cuda_configure is not configured to build with GPU support, and the user
+    # attempts to build with --config=cuda, add a dummy build rule to intercept
+    # this and fail with an actionable error message.
+    repository_ctx.file(
+        "crosstool/error_gpu_disabled.bzl",
+        _DUMMY_CROSSTOOL_BZL_FILE,
+    )
+    repository_ctx.file("crosstool/BUILD", _DUMMY_CROSSTOOL_BUILD_FILE)
+
+def _execute(
+        repository_ctx,
+        cmdline,
+        error_msg = None,
+        error_details = None,
+        empty_stdout_fine = False):
+    """Executes an arbitrary shell command.
+
+    Args:
+      repository_ctx: the repository_ctx object
+      cmdline: list of strings, the command to execute
+      error_msg: string, a summary of the error if the command fails
+      error_details: string, details about the error or steps to fix it
+      empty_stdout_fine: bool, if True, an empty stdout result is fine, otherwise
+        it's an error
+    Return:
+      the result of repository_ctx.execute(cmdline)
+    """
+    result = repository_ctx.execute(cmdline)
+    if result.stderr or not (empty_stdout_fine or result.stdout):
+        auto_configure_fail(
+            "\n".join([
+                error_msg.strip() if error_msg else "Repository command failed",
+                result.stderr.strip(),
+                error_details if error_details else "",
+            ]),
+        )
+    return result
 
 def _norm_path(path):
-  """Returns a path with '/' and remove the trailing slash."""
-  path = path.replace("\\", "/")
-  if path[-1] == "/":
-    path = path[:-1]
-  return path
-
-
-def symlink_genrule_for_dir(repository_ctx, src_dir, dest_dir, genrule_name,
-                            src_files = [], dest_files = []):
-  """Returns a genrule to symlink(or copy if on Windows) a set of files.
-
-  If src_dir is passed, files will be read from the given directory; otherwise
-  we assume files are in src_files and dest_files
-  """
-  if src_dir != None:
-    src_dir = _norm_path(src_dir)
-    dest_dir = _norm_path(dest_dir)
-    files = '\n'.join(sorted(_read_dir(repository_ctx, src_dir).splitlines()))
-    # Create a list with the src_dir stripped to use for outputs.
-    dest_files = files.replace(src_dir, '').splitlines()
-    src_files = files.splitlines()
-  command = []
-  if not _is_windows(repository_ctx):
-    # We clear folders that might have been generated previously to avoid
-    # undesired inclusions
-    command.append('if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi')
-    command.append('if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi')
-    command.append('if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi')
-    command.append('if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi')
-  outs = []
-  for i in range(len(dest_files)):
-    if dest_files[i] != "":
-      # If we have only one file to link we do not want to use the dest_dir, as
-      # $(@D) will include the full path to the file.
-      dest = '$(@D)/' + dest_dir + dest_files[i] if len(dest_files) != 1 else '$(@D)/' + dest_files[i]
-      # On Windows, symlink is not supported, so we just copy all the files.
-      cmd = 'cp -f' if _is_windows(repository_ctx) else 'ln -s'
-      command.append(cmd + ' "%s" "%s"' % (src_files[i] , dest))
-      outs.append('        "' + dest_dir + dest_files[i] + '",')
-  genrule = _genrule(src_dir, genrule_name, " && ".join(command),
-                     "\n".join(outs))
-  return genrule
-
+    """Returns a path with '/' and remove the trailing slash."""
+    path = path.replace("\\", "/")
+    if path[-1] == "/":
+        path = path[:-1]
+    return path
+
+def symlink_genrule_for_dir(
+        repository_ctx,
+        src_dir,
+        dest_dir,
+        genrule_name,
+        src_files = [],
+        dest_files = []):
+    """Returns a genrule to symlink(or copy if on Windows) a set of files.
+
+    If src_dir is passed, files will be read from the given directory; otherwise
+    we assume files are in src_files and dest_files
+    """
+    if src_dir != None:
+        src_dir = _norm_path(src_dir)
+        dest_dir = _norm_path(dest_dir)
+        files = "\n".join(sorted(_read_dir(repository_ctx, src_dir).splitlines()))
+
+        # Create a list with the src_dir stripped to use for outputs.
+        dest_files = files.replace(src_dir, "").splitlines()
+        src_files = files.splitlines()
+    command = []
+    if not _is_windows(repository_ctx):
+        # We clear folders that might have been generated previously to avoid
+        # undesired inclusions
+        command.append('if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi')
+        command.append('if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi')
+        command.append('if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi')
+        command.append('if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi')
+    outs = []
+    for i in range(len(dest_files)):
+        if dest_files[i] != "":
+            # If we have only one file to link we do not want to use the dest_dir, as
+            # $(@D) will include the full path to the file.
+            dest = "$(@D)/" + dest_dir + dest_files[i] if len(dest_files) != 1 else "$(@D)/" + dest_files[i]
+
+            # On Windows, symlink is not supported, so we just copy all the files.
+            cmd = "cp -f" if _is_windows(repository_ctx) else "ln -s"
+            command.append(cmd + ' "%s" "%s"' % (src_files[i], dest))
+            outs.append('        "' + dest_dir + dest_files[i] + '",')
+    genrule = _genrule(
+        src_dir,
+        genrule_name,
+        " && ".join(command),
+        "\n".join(outs),
+    )
+    return genrule
 
 def _genrule(src_dir, genrule_name, command, outs):
-  """Returns a string with a genrule.
-
-  Genrule executes the given command and produces the given outputs.
-  """
-  return (
-      'genrule(\n' +
-      '    name = "' +
-      genrule_name + '",\n' +
-      '    outs = [\n' +
-      outs +
-      '\n    ],\n' +
-      '    cmd = """\n' +
-      command +
-      '\n   """,\n' +
-      ')\n'
-  )
+    """Returns a string with a genrule.
 
+    Genrule executes the given command and produces the given outputs.
+    """
+    return (
+        "genrule(\n" +
+        '    name = "' +
+        genrule_name + '",\n' +
+        "    outs = [\n" +
+        outs +
+        "\n    ],\n" +
+        '    cmd = """\n' +
+        command +
+        '\n   """,\n' +
+        ")\n"
+    )
 
 def _read_dir(repository_ctx, src_dir):
-  """Returns a string with all files in a directory.
-
-  Finds all files inside a directory, traversing subfolders and following
-  symlinks. The returned string contains the full path of all files
-  separated by line breaks.
-  """
-  if _is_windows(repository_ctx):
-    src_dir = src_dir.replace("/", "\\")
-    find_result = _execute(
-        repository_ctx, ["cmd.exe", "/c", "dir", src_dir, "/b", "/s", "/a-d"],
-        empty_stdout_fine=True)
-    # src_files will be used in genrule.outs where the paths must
-    # use forward slashes.
-    result = find_result.stdout.replace("\\", "/")
-  else:
-    find_result = _execute(
-        repository_ctx, ["find", src_dir, "-follow", "-type", "f"],
-        empty_stdout_fine=True)
-    result = find_result.stdout
-  return result
+    """Returns a string with all files in a directory.
+
+    Finds all files inside a directory, traversing subfolders and following
+    symlinks. The returned string contains the full path of all files
+    separated by line breaks.
+    """
+    if _is_windows(repository_ctx):
+        src_dir = src_dir.replace("/", "\\")
+        find_result = _execute(
+            repository_ctx,
+            ["cmd.exe", "/c", "dir", src_dir, "/b", "/s", "/a-d"],
+            empty_stdout_fine = True,
+        )
+
+        # src_files will be used in genrule.outs where the paths must
+        # use forward slashes.
+        result = find_result.stdout.replace("\\", "/")
+    else:
+        find_result = _execute(
+            repository_ctx,
+            ["find", src_dir, "-follow", "-type", "f"],
+            empty_stdout_fine = True,
+        )
+        result = find_result.stdout
+    return result
 
 def _flag_enabled(repository_ctx, flag_name):
-  if flag_name in repository_ctx.os.environ:
-    value = repository_ctx.os.environ[flag_name].strip()
-    return value == "1"
-  return False
+    if flag_name in repository_ctx.os.environ:
+        value = repository_ctx.os.environ[flag_name].strip()
+        return value == "1"
+    return False
 
 def _use_cuda_clang(repository_ctx):
-  return _flag_enabled(repository_ctx, "TF_CUDA_CLANG")
+    return _flag_enabled(repository_ctx, "TF_CUDA_CLANG")
 
 def _compute_cuda_extra_copts(repository_ctx, compute_capabilities):
-  if _use_cuda_clang(repository_ctx):
-    capability_flags = ["--cuda-gpu-arch=sm_" +
-        cap.replace(".", "") for cap in compute_capabilities]
-  else:
-    # Capabilities are handled in the "crosstool_wrapper_driver_is_not_gcc" for nvcc
-    capability_flags = []
-  return str(capability_flags)
+    if _use_cuda_clang(repository_ctx):
+        capability_flags = ["--cuda-gpu-arch=sm_" +
+                            cap.replace(".", "") for cap in compute_capabilities]
+    else:
+        # Capabilities are handled in the "crosstool_wrapper_driver_is_not_gcc" for nvcc
+        capability_flags = []
+    return str(capability_flags)
 
 def _create_local_cuda_repository(repository_ctx):
-  """Creates the repository containing files set up to build with CUDA."""
-  cuda_config = _get_cuda_config(repository_ctx)
-
-  cuda_include_path = _find_cuda_include_path(repository_ctx, cuda_config)
-  cudnn_header_dir = _find_cudnn_header_dir(repository_ctx,
-                                            cuda_config.cudnn_install_basedir)
-  cupti_header_dir = _find_cupti_header_dir(repository_ctx, cuda_config)
-  nvvm_libdevice_dir = _find_nvvm_libdevice_dir(repository_ctx, cuda_config)
-
-  # Set up symbolic links for the cuda toolkit by creating genrules to do
-  # symlinking. We create one genrule for each directory we want to track under
-  # cuda_toolkit_path
-  cuda_toolkit_path = cuda_config.cuda_toolkit_path
-  genrules = [symlink_genrule_for_dir(repository_ctx,
-      cuda_include_path, "cuda/include", "cuda-include")]
-  genrules.append(symlink_genrule_for_dir(repository_ctx,
-      nvvm_libdevice_dir, "cuda/nvvm/libdevice", "cuda-nvvm"))
-  genrules.append(symlink_genrule_for_dir(repository_ctx,
-      cupti_header_dir, "cuda/extras/CUPTI/include", "cuda-extras"))
-
-  cuda_libs = _find_libs(repository_ctx, cuda_config)
-  cuda_lib_src = []
-  cuda_lib_dest = []
-  for lib in cuda_libs.values():
-    cuda_lib_src.append(lib.path)
-    cuda_lib_dest.append("cuda/lib/" + lib.file_name)
-  genrules.append(symlink_genrule_for_dir(repository_ctx, None, "", "cuda-lib",
-                                          cuda_lib_src, cuda_lib_dest))
-
-  # Set up the symbolic links for cudnn if cndnn was not installed to
-  # CUDA_TOOLKIT_PATH.
-  included_files = _read_dir(repository_ctx, cuda_include_path).replace(
-      cuda_include_path, '').splitlines()
-  if '/cudnn.h' not in included_files:
-    genrules.append(symlink_genrule_for_dir(repository_ctx, None,
-        "cuda/include/", "cudnn-include", [cudnn_header_dir + "/cudnn.h"],
-        ["cudnn.h"]))
-  else:
-    genrules.append(
-            'filegroup(\n' +
+    """Creates the repository containing files set up to build with CUDA."""
+    cuda_config = _get_cuda_config(repository_ctx)
+
+    cuda_include_path = _find_cuda_include_path(repository_ctx, cuda_config)
+    cudnn_header_dir = _find_cudnn_header_dir(
+        repository_ctx,
+        cuda_config.cudnn_install_basedir,
+    )
+    cupti_header_dir = _find_cupti_header_dir(repository_ctx, cuda_config)
+    nvvm_libdevice_dir = _find_nvvm_libdevice_dir(repository_ctx, cuda_config)
+
+    # Set up symbolic links for the cuda toolkit by creating genrules to do
+    # symlinking. We create one genrule for each directory we want to track under
+    # cuda_toolkit_path
+    cuda_toolkit_path = cuda_config.cuda_toolkit_path
+    genrules = [symlink_genrule_for_dir(
+        repository_ctx,
+        cuda_include_path,
+        "cuda/include",
+        "cuda-include",
+    )]
+    genrules.append(symlink_genrule_for_dir(
+        repository_ctx,
+        nvvm_libdevice_dir,
+        "cuda/nvvm/libdevice",
+        "cuda-nvvm",
+    ))
+    genrules.append(symlink_genrule_for_dir(
+        repository_ctx,
+        cupti_header_dir,
+        "cuda/extras/CUPTI/include",
+        "cuda-extras",
+    ))
+
+    cuda_libs = _find_libs(repository_ctx, cuda_config)
+    cuda_lib_src = []
+    cuda_lib_dest = []
+    for lib in cuda_libs.values():
+        cuda_lib_src.append(lib.path)
+        cuda_lib_dest.append("cuda/lib/" + lib.file_name)
+    genrules.append(symlink_genrule_for_dir(
+        repository_ctx,
+        None,
+        "",
+        "cuda-lib",
+        cuda_lib_src,
+        cuda_lib_dest,
+    ))
+
+    # Set up the symbolic links for cudnn if cndnn was not installed to
+    # CUDA_TOOLKIT_PATH.
+    included_files = _read_dir(repository_ctx, cuda_include_path).replace(
+        cuda_include_path,
+        "",
+    ).splitlines()
+    if "/cudnn.h" not in included_files:
+        genrules.append(symlink_genrule_for_dir(
+            repository_ctx,
+            None,
+            "cuda/include/",
+            "cudnn-include",
+            [cudnn_header_dir + "/cudnn.h"],
+            ["cudnn.h"],
+        ))
+    else:
+        genrules.append(
+            "filegroup(\n" +
             '    name = "cudnn-include",\n' +
-            '    srcs = [],\n' +
-            ')\n'
+            "    srcs = [],\n" +
+            ")\n",
         )
 
-  # Set up BUILD file for cuda/
-  _tpl(repository_ctx, "cuda:build_defs.bzl",
-       {
-           "%{cuda_is_configured}": "True",
-           "%{cuda_extra_copts}": _compute_cuda_extra_copts(
-               repository_ctx, cuda_config.compute_capabilities),
-       })
-  _tpl(repository_ctx, "cuda:BUILD",
-       {
-           "%{cuda_driver_lib}": cuda_libs["cuda"].file_name,
-           "%{cudart_static_lib}": cuda_libs["cudart_static"].file_name,
-           "%{cudart_static_linkopt}": _cudart_static_linkopt(
-               cuda_config.cpu_value),
-           "%{cudart_lib}": cuda_libs["cudart"].file_name,
-           "%{cublas_lib}": cuda_libs["cublas"].file_name,
-           "%{cusolver_lib}": cuda_libs["cusolver"].file_name,
-           "%{cudnn_lib}": cuda_libs["cudnn"].file_name,
-           "%{cufft_lib}": cuda_libs["cufft"].file_name,
-           "%{curand_lib}": cuda_libs["curand"].file_name,
-           "%{cupti_lib}": cuda_libs["cupti"].file_name,
-           "%{cuda_include_genrules}": "\n".join(genrules),
-           "%{cuda_headers}": ('":cuda-include",\n' +
-                               '        ":cudnn-include",')
-       })
-
-  is_cuda_clang = _use_cuda_clang(repository_ctx)
-
-  should_download_clang = is_cuda_clang and _flag_enabled(
-      repository_ctx, _TF_DOWNLOAD_CLANG)
-  if should_download_clang:
-    download_clang(repository_ctx, "crosstool/extra_tools")
-
-  # Set up crosstool/
-  cc = find_cc(repository_ctx)
-  cc_fullpath = cc if not should_download_clang else "crosstool/" + cc
-
-  host_compiler_includes = _host_compiler_includes(repository_ctx, cc_fullpath)
-  cuda_defines = {}
-  if is_cuda_clang:
-    cuda_defines["%{host_compiler_path}"] = str(cc)
-    cuda_defines["%{host_compiler_warnings}"] = """
+    # Set up BUILD file for cuda/
+    _tpl(
+        repository_ctx,
+        "cuda:build_defs.bzl",
+        {
+            "%{cuda_is_configured}": "True",
+            "%{cuda_extra_copts}": _compute_cuda_extra_copts(
+                repository_ctx,
+                cuda_config.compute_capabilities,
+            ),
+        },
+    )
+    _tpl(
+        repository_ctx,
+        "cuda:BUILD.windows" if _is_windows(repository_ctx) else "cuda:BUILD",
+        {
+            "%{cuda_driver_lib}": cuda_libs["cuda"].file_name,
+            "%{cudart_static_lib}": cuda_libs["cudart_static"].file_name,
+            "%{cudart_static_linkopt}": _cudart_static_linkopt(
+                cuda_config.cpu_value,
+            ),
+            "%{cudart_lib}": cuda_libs["cudart"].file_name,
+            "%{cublas_lib}": cuda_libs["cublas"].file_name,
+            "%{cusolver_lib}": cuda_libs["cusolver"].file_name,
+            "%{cudnn_lib}": cuda_libs["cudnn"].file_name,
+            "%{cufft_lib}": cuda_libs["cufft"].file_name,
+            "%{curand_lib}": cuda_libs["curand"].file_name,
+            "%{cupti_lib}": cuda_libs["cupti"].file_name,
+            "%{cuda_include_genrules}": "\n".join(genrules),
+            "%{cuda_headers}": ('":cuda-include",\n' +
+                                '        ":cudnn-include",'),
+        },
+        "cuda/BUILD",
+    )
+
+    is_cuda_clang = _use_cuda_clang(repository_ctx)
+
+    should_download_clang = is_cuda_clang and _flag_enabled(
+        repository_ctx,
+        _TF_DOWNLOAD_CLANG,
+    )
+    if should_download_clang:
+        download_clang(repository_ctx, "crosstool/extra_tools")
+
+    # Set up crosstool/
+    cc = find_cc(repository_ctx)
+    cc_fullpath = cc if not should_download_clang else "crosstool/" + cc
+
+    host_compiler_includes = _host_compiler_includes(repository_ctx, cc_fullpath)
+    cuda_defines = {}
+    if is_cuda_clang:
+        cuda_defines["%{host_compiler_path}"] = str(cc)
+        cuda_defines["%{host_compiler_warnings}"] = """
         # Some parts of the codebase set -Werror and hit this warning, so
         # switch it off for now.
         flag: "-Wno-invalid-partial-specialization"
     """
-    cuda_defines["%{host_compiler_includes}"] = host_compiler_includes
-    _tpl(repository_ctx, "crosstool:BUILD", {"%{linker_files}": ":empty"})
-    repository_ctx.file("crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc", "")
-  else:
-    cuda_defines["%{host_compiler_path}"] = "clang/bin/crosstool_wrapper_driver_is_not_gcc"
-    cuda_defines["%{host_compiler_warnings}"] = ""
-    # TODO(klimek): We currently need to inject "/" as builtin directory path
-    # to disable bazel's dependency checks.
-    # The problem is that:
-    # - the python rules symlink the python headers into the bazel root
-    # - the rules use 'includes' in the BUILD file to redirect includes of the
-    #   python headers through those paths
-    # - bazel currently uses -isystem for include paths specified via 'includes'
-    # - gcc follows symlinks when resolving files via -isystem paths, and puts
-    #   the resolved paths into the .d file, which makes the dependency check
-    #   fail for bazel
-    # There are multiple possible ways to solve this:
-    # 1. make bazel not use -isystem for paths specified via 'includes'
-    # 2. cp the headers instead of symlinking them
-    #
-    # Once this is fixed, the right builtin directory path is:
-    # (host_compiler_includes +
-    #    "\n  cxx_builtin_include_directory: \"%s\"" % cuda_include_path)
-    # The cuda directory needs to be passed, as there is currently no rule
-    # providing the cuda headers in the same way the python headers are
-    # provided.
-    cuda_defines["%{host_compiler_includes}"] = "\n  cxx_builtin_include_directory: \"/\""
-    nvcc_path = str(repository_ctx.path("%s/bin/nvcc%s" %
-        (cuda_config.cuda_toolkit_path,
-        ".exe" if cuda_config.cpu_value == "Windows" else "")))
-    _tpl(repository_ctx, "crosstool:BUILD",
-         {"%{linker_files}": ":crosstool_wrapper_driver_is_not_gcc"})
-    _tpl(repository_ctx,
-         "crosstool:clang/bin/crosstool_wrapper_driver_is_not_gcc",
-         {
-             "%{cpu_compiler}": str(cc),
-             "%{cuda_version}": cuda_config.cuda_version,
-             "%{nvcc_path}": nvcc_path,
-             "%{gcc_host_compiler_path}": str(cc),
-             "%{cuda_compute_capabilities}": ", ".join(
-                 ["\"%s\"" % c for c in cuda_config.compute_capabilities]),
-         })
-  _tpl(repository_ctx, "crosstool:CROSSTOOL", cuda_defines, out="crosstool/CROSSTOOL")
-
-  # Set up cuda_config.h, which is used by
-  # tensorflow/stream_executor/dso_loader.cc.
-  _tpl(repository_ctx, "cuda:cuda_config.h",
-       {
-           "%{cuda_version}": cuda_config.cuda_version,
-           "%{cudnn_version}": cuda_config.cudnn_version,
-           "%{cuda_compute_capabilities}": ",".join(
-               ["CudaVersion(\"%s\")" % c
-                for c in cuda_config.compute_capabilities]),
-               "%{cuda_toolkit_path}": cuda_config.cuda_toolkit_path,
-       }, "cuda/cuda/cuda_config.h")
+        cuda_defines["%{host_compiler_includes}"] = host_compiler_includes
+        _tpl(repository_ctx, "crosstool:BUILD", {"%{linker_files}": ":empty", "%{win_linker_files}": ":empty"})
+        repository_ctx.file("crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc", "")
+        repository_ctx.file("crosstool/windows/msvc_wrapper_for_nvcc.py", "")
+        repository_ctx.file("crosstool/windows/msvc_wrapper_for_nvcc.bat", "")
+    else:
+        cuda_defines["%{host_compiler_path}"] = "clang/bin/crosstool_wrapper_driver_is_not_gcc"
+        cuda_defines["%{host_compiler_warnings}"] = ""
+
+        # TODO(klimek): We currently need to inject "/" as builtin directory path
+        # to disable bazel's dependency checks.
+        # The problem is that:
+        # - the python rules symlink the python headers into the bazel root
+        # - the rules use 'includes' in the BUILD file to redirect includes of the
+        #   python headers through those paths
+        # - bazel currently uses -isystem for include paths specified via 'includes'
+        # - gcc follows symlinks when resolving files via -isystem paths, and puts
+        #   the resolved paths into the .d file, which makes the dependency check
+        #   fail for bazel
+        # There are multiple possible ways to solve this:
+        # 1. make bazel not use -isystem for paths specified via 'includes'
+        # 2. cp the headers instead of symlinking them
+        #
+        # Once this is fixed, the right builtin directory path is:
+        # (host_compiler_includes +
+        #    "\n  cxx_builtin_include_directory: \"%s\"" % cuda_include_path)
+        # The cuda directory needs to be passed, as there is currently no rule
+        # providing the cuda headers in the same way the python headers are
+        # provided.
+        cuda_defines["%{host_compiler_includes}"] = "\n  cxx_builtin_include_directory: \"/\""
+        nvcc_path = str(repository_ctx.path("%s/bin/nvcc%s" %
+                                            (
+                                                cuda_config.cuda_toolkit_path,
+                                                ".exe" if _is_windows(repository_ctx) else "",
+                                            )))
+        _tpl(
+            repository_ctx,
+            "crosstool:BUILD",
+            {
+                "%{linker_files}": ":crosstool_wrapper_driver_is_not_gcc",
+                "%{win_linker_files}": ":windows_msvc_wrapper_files",
+            },
+        )
+        wrapper_defines = {
+            "%{cpu_compiler}": str(cc),
+            "%{cuda_version}": cuda_config.cuda_version,
+            "%{nvcc_path}": nvcc_path,
+            "%{gcc_host_compiler_path}": str(cc),
+            "%{cuda_compute_capabilities}": ", ".join(
+                ["\"%s\"" % c for c in cuda_config.compute_capabilities],
+            ),
+            "%{nvcc_tmp_dir}": _get_nvcc_tmp_dir_for_windows(repository_ctx),
+        }
+        _tpl(
+            repository_ctx,
+            "crosstool:clang/bin/crosstool_wrapper_driver_is_not_gcc",
+            wrapper_defines,
+        )
+        _tpl(
+            repository_ctx,
+            "crosstool:windows/msvc_wrapper_for_nvcc.py",
+            wrapper_defines,
+        )
+        _tpl(
+            repository_ctx,
+            "crosstool:windows/msvc_wrapper_for_nvcc.bat",
+            {
+                "%{python_binary}": _get_python_bin(repository_ctx),
+            },
+        )
+
+    _tpl(
+        repository_ctx,
+        "crosstool:CROSSTOOL",
+        cuda_defines + _get_win_cuda_defines(repository_ctx),
+        out = "crosstool/CROSSTOOL",
+    )
+
+    # Set up cuda_config.h, which is used by
+    # tensorflow/stream_executor/dso_loader.cc.
+    _tpl(
+        repository_ctx,
+        "cuda:cuda_config.h",
+        {
+            "%{cuda_version}": cuda_config.cuda_version,
+            "%{cudnn_version}": cuda_config.cudnn_version,
+            "%{cuda_compute_capabilities}": ",".join(
+                [
+                    "CudaVersion(\"%s\")" % c
+                    for c in cuda_config.compute_capabilities
+                ],
+            ),
+            "%{cuda_toolkit_path}": cuda_config.cuda_toolkit_path,
+        },
+        "cuda/cuda/cuda_config.h",
+    )
 
 def _create_remote_cuda_repository(repository_ctx, remote_config_repo):
-  """Creates pointers to a remotely configured repo set up to build with CUDA."""
-  _tpl(repository_ctx, "cuda:build_defs.bzl",
-       {
-           "%{cuda_is_configured}": "True",
-           "%{cuda_extra_copts}": _compute_cuda_extra_copts(
-               repository_ctx, _compute_capabilities(repository_ctx)),
-
-       })
-  _tpl(repository_ctx, "cuda:remote.BUILD",
-       {
-           "%{remote_cuda_repo}": remote_config_repo,
-       }, "cuda/BUILD")
-  _tpl(repository_ctx, "crosstool:remote.BUILD", {
-           "%{remote_cuda_repo}": remote_config_repo,
-       }, "crosstool/BUILD")
+    """Creates pointers to a remotely configured repo set up to build with CUDA."""
+    _tpl(
+        repository_ctx,
+        "cuda:build_defs.bzl",
+        {
+            "%{cuda_is_configured}": "True",
+            "%{cuda_extra_copts}": _compute_cuda_extra_copts(
+                repository_ctx,
+                _compute_capabilities(repository_ctx),
+            ),
+        },
+    )
+    _tpl(
+        repository_ctx,
+        "cuda:remote.BUILD",
+        {
+            "%{remote_cuda_repo}": remote_config_repo,
+        },
+        "cuda/BUILD",
+    )
+    _tpl(repository_ctx, "crosstool:remote.BUILD", {
+        "%{remote_cuda_repo}": remote_config_repo,
+    }, "crosstool/BUILD")
 
 def _cuda_autoconf_impl(repository_ctx):
-  """Implementation of the cuda_autoconf repository rule."""
-  if not _enable_cuda(repository_ctx):
-    _create_dummy_repository(repository_ctx)
-  else:
-    if _TF_CUDA_CONFIG_REPO in repository_ctx.os.environ:
-      _create_remote_cuda_repository(repository_ctx,
-          repository_ctx.os.environ[_TF_CUDA_CONFIG_REPO])
+    """Implementation of the cuda_autoconf repository rule."""
+    if not _enable_cuda(repository_ctx):
+        _create_dummy_repository(repository_ctx)
+    elif _TF_CUDA_CONFIG_REPO in repository_ctx.os.environ:
+        _create_remote_cuda_repository(
+            repository_ctx,
+            repository_ctx.os.environ[_TF_CUDA_CONFIG_REPO],
+        )
     else:
-      _create_local_cuda_repository(repository_ctx)
-
+        _create_local_cuda_repository(repository_ctx)
 
 cuda_configure = repository_rule(
     implementation = _cuda_autoconf_impl,
@@ -1181,6 +1457,7 @@ cuda_configure = repository_rule(
         _TF_CUDA_COMPUTE_CAPABILITIES,
         _TF_CUDA_CONFIG_REPO,
         "NVVMIR_LIBRARY_DIR",
+        _PYTHON_BIN_PATH,
     ],
 )
 
diff --git a/third_party/llvm/llvm.autogenerated.BUILD b/third_party/llvm/llvm.autogenerated.BUILD
index d931932d9d..0ac27e26a4 100644
--- a/third_party/llvm/llvm.autogenerated.BUILD
+++ b/third_party/llvm/llvm.autogenerated.BUILD
@@ -8,13 +8,14 @@ exports_files(["LICENSE.TXT"])
 
 load(
     "@org_tensorflow//third_party/llvm:llvm.bzl",
-    "LLVM_COPTS",
-    "LLVM_DEFINES",
-    "LLVM_LINKOPTS",
     "cmake_var_string",
     "expand_cmake_vars",
     "gentbl",
     "llvm_all_cmake_vars",
+    "llvm_copts",
+    "llvm_defines",
+    "llvm_linkopts",
+    "llvm_support_platform_specific_srcs_glob",
 )
 load(
     "@org_tensorflow//third_party:common.bzl",
@@ -27,9 +28,7 @@ llvm_host_triple = "x86_64-unknown-linux_gnu"
 
 llvm_targets = [
     "AArch64",
-    # Uncomment to enable the AMDGPU backend.
-    # TODO(phawkins): use a configure-time test.
-    # "AMDGPU",
+    "AMDGPU",
     "ARM",
     "NVPTX",
     "PowerPC",
@@ -121,7 +120,7 @@ cc_library(
         "include/llvm/Config/config.h",
         "include/llvm/Config/llvm-config.h",
     ],
-    defines = LLVM_DEFINES,
+    defines = llvm_defines,
     includes = ["include"],
 )
 
@@ -198,7 +197,8 @@ cc_binary(
         "utils/TableGen/*.cpp",
         "utils/TableGen/*.h",
     ]),
-    linkopts = LLVM_LINKOPTS,
+    copts = llvm_copts,
+    linkopts = llvm_linkopts,
     stamp = 0,
     deps = [
         ":config",
@@ -214,7 +214,8 @@ cc_binary(
         "utils/FileCheck/*.cpp",
         "utils/FileCheck/*.h",
     ]),
-    linkopts = LLVM_LINKOPTS,
+    copts = llvm_copts,
+    linkopts = llvm_linkopts,
     stamp = 0,
     deps = [":support"],
 )
@@ -253,13 +254,31 @@ llvm_target_list = [
             ("-gen-dag-isel", "lib/Target/AMDGPU/AMDGPUGenDAGISel.inc"),
             ("-gen-callingconv", "lib/Target/AMDGPU/AMDGPUGenCallingConv.inc"),
             ("-gen-subtarget", "lib/Target/AMDGPU/AMDGPUGenSubtargetInfo.inc"),
-            ("-gen-tgt-intrinsic", "lib/Target/AMDGPU/AMDGPUGenIntrinsics.inc"),
+            ("-gen-tgt-intrinsic-impl", "lib/Target/AMDGPU/AMDGPUGenIntrinsicImpl.inc"),
+            ("-gen-tgt-intrinsic-enums", "lib/Target/AMDGPU/AMDGPUGenIntrinsicEnums.inc"),
             ("-gen-emitter", "lib/Target/AMDGPU/AMDGPUGenMCCodeEmitter.inc"),
             ("-gen-dfa-packetizer", "lib/Target/AMDGPU/AMDGPUGenDFAPacketizer.inc"),
             ("-gen-asm-writer", "lib/Target/AMDGPU/AMDGPUGenAsmWriter.inc"),
             ("-gen-asm-matcher", "lib/Target/AMDGPU/AMDGPUGenAsmMatcher.inc"),
             ("-gen-disassembler", "lib/Target/AMDGPU/AMDGPUGenDisassemblerTables.inc"),
             ("-gen-pseudo-lowering", "lib/Target/AMDGPU/AMDGPUGenMCPseudoLowering.inc"),
+            ("-gen-searchable-tables", "lib/Target/AMDGPU/AMDGPUGenSearchableTables.inc"),
+            ("-gen-global-isel", "lib/Target/AMDGPU/AMDGPUGenGlobalISel.inc"),
+        ],
+    },
+    {
+        "name": "AMDGPU",
+        "lower_name": "amdgpu_r600",
+        "short_name": "R600",
+        "tbl_outs": [
+            ("-gen-asm-writer", "lib/Target/AMDGPU/R600GenAsmWriter.inc"),
+            ("-gen-callingconv", "lib/Target/AMDGPU/R600GenCallingConv.inc"),
+            ("-gen-dag-isel", "lib/Target/AMDGPU/R600GenDAGISel.inc"),
+            ("-gen-dfa-packetizer", "lib/Target/AMDGPU/R600GenDFAPacketizer.inc"),
+            ("-gen-instr-info", "lib/Target/AMDGPU/R600GenInstrInfo.inc"),
+            ("-gen-emitter", "lib/Target/AMDGPU/R600GenMCCodeEmitter.inc"),
+            ("-gen-register-info", "lib/Target/AMDGPU/R600GenRegisterInfo.inc"),
+            ("-gen-subtarget", "lib/Target/AMDGPU/R600GenSubtargetInfo.inc"),
         ],
     },
     {
@@ -385,8 +404,7 @@ cc_library(
         "include/llvm/Target/AArch64/AsmParser/*.inc",
         "lib/Target/AArch64/AsmParser/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AArch64"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AArch64"],
     deps = [
         ":aarch64_desc",
         ":aarch64_info",
@@ -411,8 +429,7 @@ cc_library(
         "include/llvm/Target/AArch64/InstPrinter/*.inc",
         "lib/Target/AArch64/InstPrinter/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AArch64"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AArch64"],
     deps = [
         ":aarch64_target_gen",
         ":aarch64_utils",
@@ -435,8 +452,7 @@ cc_library(
         "include/llvm/Target/AArch64/*.inc",
         "lib/Target/AArch64/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AArch64"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AArch64"],
     deps = [
         ":aarch64_asm_printer",
         ":aarch64_desc",
@@ -469,8 +485,7 @@ cc_library(
         "include/llvm/Target/AArch64/MCTargetDesc/*.inc",
         "lib/Target/AArch64/MCTargetDesc/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AArch64"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AArch64"],
     deps = [
         ":aarch64_asm_printer",
         ":aarch64_info",
@@ -497,8 +512,7 @@ cc_library(
         "include/llvm/Target/AArch64/Disassembler/*.inc",
         "lib/Target/AArch64/Disassembler/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AArch64"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AArch64"],
     deps = [
         ":aarch64_desc",
         ":aarch64_info",
@@ -526,8 +540,7 @@ cc_library(
         "lib/Target/AArch64/AArch64*.h",
         "lib/Target/AArch64/TargetInfo/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AArch64"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AArch64"],
     deps = [
         ":code_gen",
         ":config",
@@ -550,8 +563,7 @@ cc_library(
         "include/llvm/Target/AArch64/Utils/*.inc",
         "lib/Target/AArch64/Utils/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AArch64"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AArch64"],
     deps = [
         ":aarch64_target_gen",
         ":config",
@@ -573,8 +585,7 @@ cc_library(
         "include/llvm/Transforms/AggressiveInstCombine/*.def",
         "include/llvm/Transforms/AggressiveInstCombine/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -599,8 +610,7 @@ cc_library(
         "include/llvm/Analysis/*.def",
         "include/llvm/Analysis/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":binary_format",
         ":config",
@@ -624,8 +634,7 @@ cc_library(
         "include/llvm/Target/AMDGPU/MCTargetDesc/*.inc",
         "lib/Target/AMDGPU/MCTargetDesc/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AMDGPU"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AMDGPU"],
     deps = [
         ":amdgpu_asm_printer",
         ":amdgpu_info",
@@ -650,8 +659,7 @@ cc_library(
         "include/llvm/Target/AMDGPU/Disassembler/*.inc",
         "lib/Target/AMDGPU/Disassembler/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AMDGPU"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AMDGPU"],
     deps = [
         ":amdgpu_desc",
         ":amdgpu_info",
@@ -676,9 +684,9 @@ cc_library(
         "include/llvm/Target/AMDGPU/TargetInfo/*.inc",
         "lib/Target/AMDGPU/TargetInfo/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AMDGPU"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AMDGPU"],
     deps = [
+        ":amdgpu_r600_target_gen",
         ":amdgpu_target_gen",
         ":config",
         ":core",
@@ -699,9 +707,9 @@ cc_library(
         "include/llvm/Target/AMDGPU/Utils/*.inc",
         "lib/Target/AMDGPU/Utils/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AMDGPU"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AMDGPU"],
     deps = [
+        ":amdgpu_r600_target_gen",
         ":amdgpu_target_gen",
         ":config",
         ":core",
@@ -723,8 +731,7 @@ cc_library(
         "include/llvm/Target/AMDGPU/AsmParser/*.inc",
         "lib/Target/AMDGPU/AsmParser/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AMDGPU"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AMDGPU"],
     deps = [
         ":amdgpu_desc",
         ":amdgpu_info",
@@ -749,8 +756,7 @@ cc_library(
         "include/llvm/Target/AMDGPU/InstPrinter/*.inc",
         "lib/Target/AMDGPU/InstPrinter/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AMDGPU"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AMDGPU"],
     deps = [
         ":amdgpu_utils",
         ":config",
@@ -772,8 +778,7 @@ cc_library(
         "include/llvm/Target/AMDGPU/*.inc",
         "lib/Target/AMDGPU/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/AMDGPU"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/AMDGPU"],
     deps = [
         ":amdgpu_asm_printer",
         ":amdgpu_desc",
@@ -809,8 +814,7 @@ cc_library(
         "include/llvm/Target/ARM/AsmParser/*.inc",
         "lib/Target/ARM/AsmParser/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/ARM"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/ARM"],
     deps = [
         ":arm_desc",
         ":arm_info",
@@ -836,8 +840,7 @@ cc_library(
         "lib/Target/ARM/*.h",
         "lib/Target/ARM/InstPrinter/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/ARM"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/ARM"],
     deps = [
         ":arm_info",
         ":arm_target_gen",
@@ -861,8 +864,7 @@ cc_library(
         "include/llvm/Target/ARM/*.inc",
         "lib/Target/ARM/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/ARM"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/ARM"],
     deps = [
         ":analysis",
         ":arm_asm_printer",
@@ -879,6 +881,7 @@ cc_library(
         ":selection_dag",
         ":support",
         ":target",
+        ":transform_utils",
     ],
 )
 
@@ -897,8 +900,7 @@ cc_library(
         "include/llvm/Target/ARM/MCTargetDesc/*.inc",
         "lib/Target/ARM/MCTargetDesc/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/ARM"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/ARM"],
     deps = [
         ":arm_asm_printer",
         ":arm_info",
@@ -926,8 +928,7 @@ cc_library(
         "include/llvm/Target/ARM/Disassembler/*.inc",
         "lib/Target/ARM/Disassembler/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/ARM"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/ARM"],
     deps = [
         ":arm_desc",
         ":arm_info",
@@ -952,8 +953,7 @@ cc_library(
         "include/llvm/Target/ARM/TargetInfo/*.inc",
         "lib/Target/ARM/TargetInfo/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/ARM"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/ARM"],
     deps = [
         ":arm_target_gen",
         ":config",
@@ -976,8 +976,7 @@ cc_library(
         "include/llvm/Target/ARM/Utils/*.inc",
         "lib/Target/ARM/Utils/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/ARM"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/ARM"],
     deps = [
         ":arm_target_gen",
         ":config",
@@ -999,8 +998,7 @@ cc_library(
         "include/llvm/AsmParser/*.def",
         "include/llvm/AsmParser/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":binary_format",
         ":config",
@@ -1023,8 +1021,7 @@ cc_library(
         "include/llvm/CodeGen/AsmPrinter/*.inc",
         "lib/CodeGen/AsmPrinter/*.def",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":binary_format",
@@ -1055,8 +1052,7 @@ cc_library(
         "include/llvm/BinaryFormat/ELFRelocs/*.def",
         "include/llvm/BinaryFormat/WasmRelocs/*.def",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":support",
@@ -1077,8 +1073,7 @@ cc_library(
         "include/llvm/Bitcode/Reader/*.inc",
         "include/llvm/Bitcode/BitstreamReader.h",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":core",
@@ -1102,8 +1097,7 @@ cc_library(
         "include/llvm/Bitcode/BitcodeWriterPass.h",
         "include/llvm/Bitcode/BitstreamWriter.h",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -1128,8 +1122,7 @@ cc_library(
         "include/llvm/CodeGen/*.inc",
         "include/llvm/CodeGen/**/*.h",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":bit_reader",
@@ -1167,8 +1160,7 @@ cc_library(
         "include/llvm/*.h",
         "include/llvm/Analysis/*.def",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":attributes_compat_gen",
         ":attributes_gen",
@@ -1193,8 +1185,7 @@ cc_library(
         "include/llvm/DebugInfo/CodeView/*.def",
         "include/llvm/DebugInfo/CodeView/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":binary_format",
         ":config",
@@ -1216,8 +1207,7 @@ cc_library(
         "include/llvm/DebugInfo/MSF/*.def",
         "include/llvm/DebugInfo/MSF/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":support",
@@ -1237,8 +1227,7 @@ cc_library(
         "include/llvm/Demangle/*.def",
         "include/llvm/Demangle/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [":config"],
 )
 
@@ -1255,8 +1244,7 @@ cc_library(
         "include/llvm/ExecutionEngine/*.def",
         "include/llvm/ExecutionEngine/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":core",
@@ -1281,8 +1269,7 @@ cc_library(
         "include/llvm/CodeGen/GlobalISel/*.def",
         "include/llvm/CodeGen/GlobalISel/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":code_gen",
@@ -1312,8 +1299,7 @@ cc_library(
         "include/llvm/Transforms/InstrProfiling.h",
         "include/llvm/Transforms/PGOInstrumentation.h",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -1338,8 +1324,7 @@ cc_library(
         "include/llvm/Transforms/InstCombine/*.def",
         "include/llvm/Transforms/InstCombine/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -1366,8 +1351,7 @@ cc_library(
         "include/llvm/Transforms/IPO/*.def",
         "include/llvm/Transforms/IPO/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":aggressive_inst_combine",
         ":analysis",
@@ -1401,8 +1385,7 @@ cc_library(
         "include/llvm/IRReader/*.def",
         "include/llvm/IRReader/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":asm_parser",
         ":bit_reader",
@@ -1425,8 +1408,7 @@ cc_library(
         "include/llvm/Linker/*.def",
         "include/llvm/Linker/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":core",
@@ -1448,8 +1430,7 @@ cc_library(
         "include/llvm/MC/*.def",
         "include/llvm/MC/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":binary_format",
         ":config",
@@ -1471,8 +1452,7 @@ cc_library(
         "include/llvm/MC/MCDisassembler/*.def",
         "include/llvm/MC/MCDisassembler/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":mc",
@@ -1493,8 +1473,7 @@ cc_library(
         "include/llvm/MC/MCParser/*.def",
         "include/llvm/MC/MCParser/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":mc",
@@ -1515,8 +1494,7 @@ cc_library(
         "include/llvm/Target/NVPTX/InstPrinter/*.inc",
         "lib/Target/NVPTX/InstPrinter/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/NVPTX"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/NVPTX"],
     deps = [
         "nvptx_target_gen",
         ":attributes_gen",
@@ -1540,8 +1518,7 @@ cc_library(
         "include/llvm/Target/NVPTX/*.inc",
         "lib/Target/NVPTX/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/NVPTX"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/NVPTX"],
     deps = [
         ":analysis",
         ":asm_printer",
@@ -1575,8 +1552,7 @@ cc_library(
         "include/llvm/Target/NVPTX/MCTargetDesc/*.inc",
         "lib/Target/NVPTX/MCTargetDesc/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/NVPTX"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/NVPTX"],
     deps = [
         "nvptx_target_gen",
         ":config",
@@ -1602,8 +1578,7 @@ cc_library(
         "lib/Target/NVPTX/NVPTX.h",
         "lib/Target/NVPTX/TargetInfo/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/NVPTX"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/NVPTX"],
     deps = [
         "nvptx_target_gen",
         ":attributes_gen",
@@ -1627,8 +1602,7 @@ cc_library(
         "include/llvm/Object/*.def",
         "include/llvm/Object/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":binary_format",
         ":bit_reader",
@@ -1654,8 +1628,7 @@ cc_library(
         "include/llvm/Transforms/ObjCARC/*.def",
         "include/llvm/Transforms/ObjCARC/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -1678,8 +1651,7 @@ cc_library(
         "include/llvm/ExecutionEngine/Orc/*.def",
         "include/llvm/ExecutionEngine/Orc/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":core",
@@ -1706,8 +1678,7 @@ cc_library(
         "include/llvm/Target/PowerPC/AsmParser/*.inc",
         "lib/Target/PowerPC/AsmParser/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/PowerPC"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/PowerPC"],
     deps = [
         ":config",
         ":mc",
@@ -1731,8 +1702,7 @@ cc_library(
         "include/llvm/Target/PowerPC/InstPrinter/*.inc",
         "lib/Target/PowerPC/InstPrinter/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/PowerPC"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/PowerPC"],
     deps = [
         ":attributes_gen",
         ":config",
@@ -1758,8 +1728,7 @@ cc_library(
         "include/llvm/Target/PowerPC/*.inc",
         "lib/Target/PowerPC/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/PowerPC"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/PowerPC"],
     deps = [
         ":analysis",
         ":asm_printer",
@@ -1791,8 +1760,7 @@ cc_library(
         "include/llvm/Target/PowerPC/MCTargetDesc/*.inc",
         "lib/Target/PowerPC/MCTargetDesc/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/PowerPC"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/PowerPC"],
     deps = [
         ":attributes_gen",
         ":config",
@@ -1819,8 +1787,7 @@ cc_library(
         "include/llvm/Target/PowerPC/Disassembler/*.inc",
         "lib/Target/PowerPC/Disassembler/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/PowerPC"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/PowerPC"],
     deps = [
         ":config",
         ":mc_disassembler",
@@ -1844,8 +1811,7 @@ cc_library(
         "lib/Target/PowerPC/PPC*.h",
         "lib/Target/PowerPC/TargetInfo/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/PowerPC"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/PowerPC"],
     deps = [
         ":attributes_gen",
         ":config",
@@ -1869,8 +1835,7 @@ cc_library(
         "include/llvm/ProfileData/*.def",
         "include/llvm/ProfileData/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":core",
@@ -1899,8 +1864,7 @@ cc_library(
         "include/llvm/ExecutionEngine/RTDyldMemoryManager.h",
         "include/llvm/ExecutionEngine/RuntimeDyld*.h",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":mc",
@@ -1928,8 +1892,7 @@ cc_library(
         "include/llvm/Transforms/IPO.h",
         "include/llvm/Transforms/IPO/SCCP.h",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":aggressive_inst_combine",
         ":analysis",
@@ -1955,8 +1918,7 @@ cc_library(
         "include/llvm/CodeGen/SelectionDAG/*.def",
         "include/llvm/CodeGen/SelectionDAG/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":code_gen",
@@ -1975,14 +1937,12 @@ cc_library(
         "lib/Support/*.c",
         "lib/Support/*.cpp",
         "lib/Support/*.inc",
-        "lib/Support/Unix/*.inc",
-        "lib/Support/Unix/*.h",
         "include/llvm-c/*.h",
         "include/llvm/CodeGen/MachineValueType.h",
         "include/llvm/BinaryFormat/COFF.h",
         "include/llvm/BinaryFormat/MachO.h",
         "lib/Support/*.h",
-    ]),
+    ]) + llvm_support_platform_specific_srcs_glob(),
     hdrs = glob([
         "include/llvm/Support/*.h",
         "include/llvm/Support/*.def",
@@ -1994,8 +1954,7 @@ cc_library(
         "include/llvm/BinaryFormat/MachO.def",
         "include/llvm/Support/VCSRevision.h",
     ],
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":demangle",
@@ -2018,8 +1977,7 @@ cc_library(
         "include/llvm/TableGen/*.inc",
         "include/llvm/Target/*.def",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":config",
         ":mc",
@@ -2045,8 +2003,7 @@ cc_library(
         "include/llvm/CodeGen/*.def",
         "include/llvm/CodeGen/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -2071,8 +2028,7 @@ cc_library(
         "include/llvm/Transforms/Utils/*.def",
         "include/llvm/Transforms/Utils/*.inc",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -2096,8 +2052,7 @@ cc_library(
         "include/llvm/Transforms/Vectorize/*.inc",
         "include/llvm/Transforms/Vectorize.h",
     ]),
-    copts = LLVM_COPTS,
-    defines = LLVM_DEFINES,
+    copts = llvm_copts,
     deps = [
         ":analysis",
         ":config",
@@ -2121,8 +2076,7 @@ cc_library(
         "include/llvm/Target/X86/AsmParser/*.inc",
         "lib/Target/X86/AsmParser/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/X86"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/X86"],
     deps = [
         ":config",
         ":mc",
@@ -2147,8 +2101,7 @@ cc_library(
         "include/llvm/Target/X86/InstPrinter/*.inc",
         "lib/Target/X86/InstPrinter/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/X86"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/X86"],
     deps = [
         ":config",
         ":mc",
@@ -2172,8 +2125,7 @@ cc_library(
         "include/llvm/Target/X86/*.inc",
         "lib/Target/X86/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/X86"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/X86"],
     deps = [
         ":analysis",
         ":asm_printer",
@@ -2206,8 +2158,7 @@ cc_library(
         "include/llvm/Target/X86/MCTargetDesc/*.inc",
         "lib/Target/X86/MCTargetDesc/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/X86"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/X86"],
     deps = [
         ":config",
         ":mc",
@@ -2232,8 +2183,7 @@ cc_library(
         "include/llvm/Target/X86/Disassembler/*.inc",
         "lib/Target/X86/Disassembler/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/X86"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/X86"],
     deps = [
         ":config",
         ":mc_disassembler",
@@ -2256,8 +2206,7 @@ cc_library(
         "include/llvm/Target/X86/TargetInfo/*.inc",
         "lib/Target/X86/TargetInfo/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/X86"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/X86"],
     deps = [
         ":config",
         ":mc",
@@ -2279,8 +2228,7 @@ cc_library(
         "include/llvm/Target/X86/Utils/*.inc",
         "lib/Target/X86/Utils/*.h",
     ]),
-    copts = LLVM_COPTS + ["-Iexternal/llvm/lib/Target/X86"],
-    defines = LLVM_DEFINES,
+    copts = llvm_copts + ["-Iexternal/llvm/lib/Target/X86"],
     deps = [
         ":code_gen",
         ":config",
diff --git a/third_party/llvm/llvm.bzl b/third_party/llvm/llvm.bzl
index 2e809e5f14..d493a3c476 100644
--- a/third_party/llvm/llvm.bzl
+++ b/third_party/llvm/llvm.bzl
@@ -7,103 +7,143 @@ TODO(chandlerc): Currently this expresses include-based dependencies as
 correctly understood by the build system.
 """
 
+def _dict_add(*dictionaries):
+    """Returns a new `dict` that has all the entries of the given dictionaries.
+
+    If the same key is present in more than one of the input dictionaries, the
+    last of them in the argument list overrides any earlier ones.
+
+    This function is designed to take zero or one arguments as well as multiple
+    dictionaries, so that it follows arithmetic identities and callers can avoid
+    special cases for their inputs: the sum of zero dictionaries is the empty
+    dictionary, and the sum of a single dictionary is a copy of itself.
+
+    Re-implemented here to avoid adding a dependency on skylib.
+
+    Args:
+      *dictionaries: Zero or more dictionaries to be added.
+
+    Returns:
+      A new `dict` that has all the entries of the given dictionaries.
+    """
+    result = {}
+    for d in dictionaries:
+        result.update(d)
+    return result
+
 def gentbl(name, tblgen, td_file, td_srcs, tbl_outs, library = True, **kwargs):
-  """gentbl() generates tabular code from a table definition file.
-
-  Args:
-    name: The name of the build rule for use in dependencies.
-    tblgen: The binary used to produce the output.
-    td_file: The primary table definitions file.
-    td_srcs: A list of table definition files included transitively.
-    tbl_outs: A list of tuples (opts, out), where each opts is a string of
-      options passed to tblgen, and the out is the corresponding output file
-      produced.
-    library: Whether to bundle the generated files into a library.
-    **kwargs: Keyword arguments to pass to subsidiary cc_library() rule.
-  """
-  if td_file not in td_srcs:
-    td_srcs += [td_file]
-  includes = []
-  for (opts, out) in tbl_outs:
-    outdir = out[:out.rindex("/")]
-    if outdir not in includes:
-      includes.append(outdir)
-    rule_suffix = "_".join(opts.replace("-", "_").replace("=", "_").split(" "))
-    native.genrule(
-        name="%s_%s_genrule" % (name, rule_suffix),
-        srcs=td_srcs,
-        outs=[out],
-        tools=[tblgen],
-        message="Generating code from table: %s" % td_file,
-        cmd=(("$(location %s) " + "-I external/llvm/include " +
-              "-I external/llvm/tools/clang/include " +
-              "-I $$(dirname $(location %s)) " + "%s $(location %s) -o $@") % (
-                  tblgen, td_file, opts, td_file)))
-  # For now, all generated files can be assumed to comprise public interfaces.
-  # If this is not true, you should specify library = False
-  # and list the generated '.inc' files in "srcs".
-  if library:
-    native.cc_library(name=name, textual_hdrs=[f for (_, f) in tbl_outs],
-                      includes=includes,  **kwargs)
+    """gentbl() generates tabular code from a table definition file.
+
+    Args:
+      name: The name of the build rule for use in dependencies.
+      tblgen: The binary used to produce the output.
+      td_file: The primary table definitions file.
+      td_srcs: A list of table definition files included transitively.
+      tbl_outs: A list of tuples (opts, out), where each opts is a string of
+        options passed to tblgen, and the out is the corresponding output file
+        produced.
+      library: Whether to bundle the generated files into a library.
+      **kwargs: Keyword arguments to pass to subsidiary cc_library() rule.
+    """
+    if td_file not in td_srcs:
+        td_srcs += [td_file]
+    includes = []
+    for (opts, out) in tbl_outs:
+        outdir = out[:out.rindex("/")]
+        if outdir not in includes:
+            includes.append(outdir)
+        rule_suffix = "_".join(opts.replace("-", "_").replace("=", "_").split(" "))
+        native.genrule(
+            name = "%s_%s_genrule" % (name, rule_suffix),
+            srcs = td_srcs,
+            outs = [out],
+            tools = [tblgen],
+            message = "Generating code from table: %s" % td_file,
+            cmd = (("$(location %s) " + "-I external/llvm/include " +
+                    "-I external/llvm/tools/clang/include " +
+                    "-I $$(dirname $(location %s)) " + "%s $(location %s) -o $@") % (
+                tblgen,
+                td_file,
+                opts,
+                td_file,
+            )),
+        )
+
+    # For now, all generated files can be assumed to comprise public interfaces.
+    # If this is not true, you should specify library = False
+    # and list the generated '.inc' files in "srcs".
+    if library:
+        native.cc_library(
+            name = name,
+            textual_hdrs = [f for (_, f) in tbl_outs],
+            includes = includes,
+            **kwargs
+        )
 
 def llvm_target_cmake_vars(native_arch, target_triple):
-  return {
-      "LLVM_HOST_TRIPLE": target_triple,
-      "LLVM_DEFAULT_TARGET_TRIPLE": target_triple,
-      "LLVM_NATIVE_ARCH": native_arch,
-  }
+    return {
+        "LLVM_HOST_TRIPLE": target_triple,
+        "LLVM_DEFAULT_TARGET_TRIPLE": target_triple,
+        "LLVM_NATIVE_ARCH": native_arch,
+    }
 
 def _quote(s):
-  """Quotes the given string for use in a shell command.
-
-  This function double-quotes the given string (in case it contains spaces or
-  other special characters) and escapes any special characters (dollar signs,
-  double-quotes, and backslashes) that may be present.
-
-  Args:
-    s: The string to quote.
-  Returns:
-    An escaped and quoted version of the string that can be passed to a shell
-    command.
-  """
-  return ('"' +
-          s.replace("\\", "\\\\").replace("$", "\\$").replace('"', '\\"') +
-          '"')
+    """Quotes the given string for use in a shell command.
+
+    This function double-quotes the given string (in case it contains spaces or
+    other special characters) and escapes any special characters (dollar signs,
+    double-quotes, and backslashes) that may be present.
+
+    Args:
+      s: The string to quote.
+
+    Returns:
+      An escaped and quoted version of the string that can be passed to a shell
+      command.
+    """
+    return ('"' +
+            s.replace("\\", "\\\\").replace("$", "\\$").replace('"', '\\"') +
+            '"')
 
 def cmake_var_string(cmake_vars):
-  """Converts a dictionary to an input suitable for expand_cmake_vars.
+    """Converts a dictionary to an input suitable for expand_cmake_vars.
+
+    Ideally we would jist stringify in the expand_cmake_vars() rule, but select()
+    interacts badly with genrules.
 
-  Ideally we would jist stringify in the expand_cmake_vars() rule, but select()
-  interacts badly with genrules.
+    TODO(phawkins): replace the genrule() with native rule and delete this rule.
 
-  TODO(phawkins): replace the genrule() with native rule and delete this rule.
+    Args:
+      cmake_vars: a dictionary with string keys and values that are convertable to
+        strings.
 
-  Args:
-    cmake_vars: a dictionary with string keys and values that are convertable to
-      strings.
-  """
-  return " ".join([_quote("{}={}".format(k, str(v)))
-                   for (k, v) in cmake_vars.items()])
+    Returns:
+      cmake_vars in a form suitable for passing to expand_cmake_vars.
+    """
+    return " ".join([
+        _quote("{}={}".format(k, str(v)))
+        for (k, v) in cmake_vars.items()
+    ])
 
 def expand_cmake_vars(name, src, dst, cmake_vars):
-  """Expands #cmakedefine, #cmakedefine01, and CMake variables in a text file.
-
-  Args:
-    name: the name of the rule
-    src: the input of the rule
-    dst: the output of the rule
-    cmake_vars: a string containing the CMake variables, as generated by
-      cmake_var_string.
-  """
-  expand_cmake_vars_tool = Label("@org_tensorflow//third_party/llvm:expand_cmake_vars")
-  native.genrule(
-      name = name,
-      srcs = [src],
-      tools = [expand_cmake_vars_tool],
-      outs = [dst],
-      cmd = ("$(location {}) ".format(expand_cmake_vars_tool) + cmake_vars +
-             "< $< > $@")
-  )
+    """Expands #cmakedefine, #cmakedefine01, and CMake variables in a text file.
+
+    Args:
+      name: the name of the rule
+      src: the input of the rule
+      dst: the output of the rule
+      cmake_vars: a string containing the CMake variables, as generated by
+        cmake_var_string.
+    """
+    expand_cmake_vars_tool = Label("@org_tensorflow//third_party/llvm:expand_cmake_vars")
+    native.genrule(
+        name = name,
+        srcs = [src],
+        tools = [expand_cmake_vars_tool],
+        outs = [dst],
+        cmd = ("$(location {}) ".format(expand_cmake_vars_tool) + cmake_vars +
+               "< $< > $@"),
+    )
 
 # TODO(phawkins): the set of CMake variables was hardcoded for expediency.
 # However, we should really detect many of these via configure-time tests.
@@ -212,23 +252,31 @@ darwin_cmake_vars = {
 # than hardcoding x86_64.
 llvm_all_cmake_vars = select({
     "@org_tensorflow//tensorflow:darwin": cmake_var_string(
-        cmake_vars + llvm_target_cmake_vars("X86", "x86_64-apple-darwin") +
-        darwin_cmake_vars),
+        _dict_add(
+            cmake_vars,
+            llvm_target_cmake_vars("X86", "x86_64-apple-darwin"),
+            darwin_cmake_vars,
+        ),
+    ),
     "@org_tensorflow//tensorflow:linux_ppc64le": cmake_var_string(
-        cmake_vars +
-        llvm_target_cmake_vars("PowerPC", "powerpc64le-unknown-linux_gnu") +
-        linux_cmake_vars,
+        _dict_add(
+            cmake_vars,
+            llvm_target_cmake_vars("PowerPC", "powerpc64le-unknown-linux_gnu"),
+            linux_cmake_vars,
+        ),
     ),
     "//conditions:default": cmake_var_string(
-         cmake_vars +
-         llvm_target_cmake_vars("X86", "x86_64-unknown-linux_gnu") +
-         linux_cmake_vars),
-
+        _dict_add(
+            cmake_vars,
+            llvm_target_cmake_vars("X86", "x86_64-unknown-linux_gnu"),
+            linux_cmake_vars,
+        ),
+    ),
 })
 
-LLVM_LINKOPTS = ["-ldl", "-lm", "-lpthread"]
+llvm_linkopts = ["-ldl", "-lm", "-lpthread"]
 
-LLVM_DEFINES = [
+llvm_defines = [
     "LLVM_ENABLE_STATS",
     "__STDC_LIMIT_MACROS",
     "__STDC_CONSTANT_MACROS",
@@ -237,4 +285,14 @@ LLVM_DEFINES = [
     "LLVM_BUILD_GLOBAL_ISEL",
 ]
 
-LLVM_COPTS = []
+llvm_copts = []
+
+# Platform specific sources for libSupport.
+
+def llvm_support_platform_specific_srcs_glob():
+    return select({
+        "//conditions:default": native.glob([
+            "lib/Support/Unix/*.inc",
+            "lib/Support/Unix/*.h",
+        ]),
+    })
diff --git a/third_party/mkl/LICENSE b/third_party/mkl/LICENSE
new file mode 100644
index 0000000000..9c8f3ea087
--- /dev/null
+++ b/third_party/mkl/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "{}"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright {yyyy} {name of copyright owner}
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+\ No newline at end of file
diff --git a/third_party/mkl_dnn/BUILD b/third_party/mkl_dnn/BUILD
index 5b01f6e3e4..d075809ee9 100644
--- a/third_party/mkl_dnn/BUILD
+++ b/third_party/mkl_dnn/BUILD
@@ -1 +1,11 @@
 licenses(["notice"])
+
+exports_files(["LICENSE"])
+
+config_setting(
+    name = "using_mkl_dnn_only",
+    values = {
+        "define": "using_mkl_dnn_only=true",
+    },
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/mkl_dnn/LICENSE b/third_party/mkl_dnn/LICENSE
new file mode 100644
index 0000000000..8dada3edaf
--- /dev/null
+++ b/third_party/mkl_dnn/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "{}"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright {yyyy} {name of copyright owner}
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/third_party/mkl_dnn/build_defs.bzl b/third_party/mkl_dnn/build_defs.bzl
new file mode 100644
index 0000000000..7ce2a7d9b0
--- /dev/null
+++ b/third_party/mkl_dnn/build_defs.bzl
@@ -0,0 +1,13 @@
+def if_mkl_open_source_only(if_true, if_false = []):
+    """Shorthand for select()'ing on whether we're building with
+    MKL-DNN open source lib only, without depending on MKL binary form.
+
+    Returns a select statement which evaluates to if_true if we're building
+    with MKL-DNN open source lib only. Otherwise,
+    the select statement evaluates to if_false.
+
+    """
+    return select({
+        str(Label("//third_party/mkl_dnn:using_mkl_dnn_only")): if_true,
+        "//conditions:default": if_false,
+    })
diff --git a/third_party/mkl_dnn/mkldnn.BUILD b/third_party/mkl_dnn/mkldnn.BUILD
index 68f24aabae..597ac69e2f 100644
--- a/third_party/mkl_dnn/mkldnn.BUILD
+++ b/third_party/mkl_dnn/mkldnn.BUILD
@@ -1,5 +1,10 @@
 exports_files(["LICENSE"])
 
+load(
+    "@org_tensorflow//third_party/mkl_dnn:build_defs.bzl",
+    "if_mkl_open_source_only",
+)
+
 config_setting(
     name = "clang_linux_x86_64",
     values = {
@@ -13,9 +18,17 @@ cc_library(
     srcs = glob([
         "src/common/*.cpp",
         "src/cpu/*.cpp",
+        "src/cpu/gemm/*.cpp",
     ]),
     hdrs = glob(["include/*"]),
-    copts = ["-fexceptions"] + select({
+    copts = [
+        "-fexceptions",
+        "-DUSE_MKL",
+        "-DUSE_CBLAS",
+    ] + if_mkl_open_source_only([
+        "-UUSE_MKL",
+        "-UUSE_CBLAS",
+    ]) + select({
         "@org_tensorflow//tensorflow:linux_x86_64": [
             "-fopenmp",  # only works with gcc
         ],
@@ -30,7 +43,23 @@ cc_library(
         "src/common",
         "src/cpu",
         "src/cpu/xbyak",
+        "src/cpu/gemm",
     ],
     nocopts = "-fno-exceptions",
     visibility = ["//visibility:public"],
+    deps = select({
+        "@org_tensorflow//tensorflow:linux_x86_64": [
+            "@mkl_linux//:mkl_headers",
+            "@mkl_linux//:mkl_libs_linux",
+        ],
+        "@org_tensorflow//tensorflow:darwin": [
+            "@mkl_darwin//:mkl_headers",
+            "@mkl_darwin//:mkl_libs_darwin",
+        ],
+        "@org_tensorflow//tensorflow:windows": [
+            "@mkl_windows//:mkl_headers",
+            "@mkl_windows//:mkl_libs_windows",
+        ],
+        "//conditions:default": [],
+    }),
 )
diff --git a/third_party/nanopb.BUILD b/third_party/nanopb.BUILD
new file mode 100644
index 0000000000..d21866911b
--- /dev/null
+++ b/third_party/nanopb.BUILD
@@ -0,0 +1,23 @@
+# Description:
+#   Nanopb, a tiny ANSI C protobuf implementation for use on embedded devices.
+
+licenses(["notice"])  # zlib license
+
+exports_files(["LICENSE.txt"])
+
+cc_library(
+    name = "nanopb",
+    srcs = [
+        "pb_common.c",
+        "pb_decode.c",
+        "pb_encode.c",
+    ],
+    hdrs = [
+        "pb.h",
+        "pb_common.h",
+        "pb_decode.h",
+        "pb_encode.h",
+    ],
+    includes = ["."],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/nasm.BUILD b/third_party/nasm.BUILD
index 341d58068b..89330eac54 100644
--- a/third_party/nasm.BUILD
+++ b/third_party/nasm.BUILD
@@ -8,45 +8,93 @@ exports_files(["LICENSE"])
 cc_binary(
     name = "nasm",
     srcs = [
-        "assemble.c",
-        "assemble.h",
-        "compiler.h",
-        "crc64.c",
-        "directiv.c",
-        "directiv.h",
-        "disp8.c",
-        "disp8.h",
-        "eval.c",
-        "eval.h",
-        "exprlib.c",
-        "float.c",
-        "float.h",
-        "hashtbl.c",
-        "hashtbl.h",
-        "iflag.c",
-        "iflag.h",
-        "iflaggen.h",
-        "ilog2.c",
-        "insns.h",
-        "insnsa.c",
-        "insnsb.c",
-        "insnsi.h",
-        "labels.c",
-        "labels.h",
-        "lib/strlcpy.c",
-        "listing.c",
-        "listing.h",
-        "macros.c",
-        "md5.h",
-        "md5c.c",
-        "nasm.c",
-        "nasm.h",
-        "nasmlib.c",
-        "nasmlib.h",
-        "opflags.h",
+        "asm/assemble.c",
+        "asm/assemble.h",
+        "asm/directbl.c",
+        "asm/directiv.c",
+        "asm/directiv.h",
+        "asm/error.c",
+        "asm/eval.c",
+        "asm/eval.h",
+        "asm/exprdump.c",
+        "asm/exprlib.c",
+        "asm/float.c",
+        "asm/float.h",
+        "asm/labels.c",
+        "asm/listing.c",
+        "asm/listing.h",
+        "asm/nasm.c",
+        "asm/parser.c",
+        "asm/parser.h",
+        "asm/pptok.c",
+        "asm/pptok.h",
+        "asm/pragma.c",
+        "asm/preproc.c",
+        "asm/preproc.h",
+        "asm/preproc-nop.c",
+        "asm/quote.c",
+        "asm/quote.h",
+        "asm/rdstrnum.c",
+        "asm/segalloc.c",
+        "asm/stdscan.c",
+        "asm/stdscan.h",
+        "asm/strfunc.c",
+        "asm/tokens.h",
+        "asm/tokhash.c",
+        "common/common.c",
+        "config/unknown.h",
+        "disasm/disasm.c",
+        "disasm/disasm.h",
+        "disasm/sync.c",
+        "disasm/sync.h",
+        "include/compiler.h",
+        "include/disp8.h",
+        "include/error.h",
+        "include/hashtbl.h",
+        "include/iflag.h",
+        "include/insns.h",
+        "include/labels.h",
+        "include/md5.h",
+        "include/nasm.h",
+        "include/nasmint.h",
+        "include/nasmlib.h",
+        "include/opflags.h",
+        "include/perfhash.h",
+        "include/raa.h",
+        "include/rbtree.h",
+        "include/rdoff.h",
+        "include/saa.h",
+        "include/strlist.h",
+        "include/tables.h",
+        "include/ver.h",
+        "macros/macros.c",
+        "nasmlib/badenum.c",
+        "nasmlib/bsi.c",
+        "nasmlib/crc64.c",
+        "nasmlib/file.c",
+        "nasmlib/file.h",
+        "nasmlib/filename.c",
+        "nasmlib/hashtbl.c",
+        "nasmlib/ilog2.c",
+        "nasmlib/malloc.c",
+        "nasmlib/md5c.c",
+        "nasmlib/mmap.c",
+        "nasmlib/path.c",
+        "nasmlib/perfhash.c",
+        "nasmlib/raa.c",
+        "nasmlib/rbtree.c",
+        "nasmlib/readnum.c",
+        "nasmlib/realpath.c",
+        "nasmlib/saa.c",
+        "nasmlib/srcfile.c",
+        "nasmlib/string.c",
+        "nasmlib/strlist.c",
+        "nasmlib/ver.c",
+        "nasmlib/zerobuf.c",
         "output/codeview.c",
         "output/dwarf.h",
         "output/elf.h",
+        "output/legacy.c",
         "output/nulldbg.c",
         "output/nullout.c",
         "output/outaout.c",
@@ -56,9 +104,6 @@ cc_binary(
         "output/outdbg.c",
         "output/outelf.c",
         "output/outelf.h",
-        "output/outelf32.c",
-        "output/outelf64.c",
-        "output/outelfx32.c",
         "output/outform.c",
         "output/outform.h",
         "output/outieee.c",
@@ -69,35 +114,31 @@ cc_binary(
         "output/outrdf2.c",
         "output/pecoff.h",
         "output/stabs.h",
-        "parser.c",
-        "parser.h",
-        "pptok.c",
-        "pptok.h",
-        "preproc.c",
-        "preproc.h",
-        "preproc-nop.c",
-        "quote.c",
-        "quote.h",
-        "raa.c",
-        "raa.h",
-        "rbtree.c",
-        "rbtree.h",
-        "rdoff/rdoff.h",
-        "realpath.c",
-        "regflags.c",
-        "regs.h",
-        "regvals.c",
-        "saa.c",
-        "saa.h",
-        "srcfile.c",
-        "stdscan.c",
-        "stdscan.h",
-        "strfunc.c",
-        "tables.h",
-        "tokens.h",
-        "tokhash.c",
-        "ver.c",
+        "stdlib/snprintf.c",
+        "stdlib/strlcpy.c",
+        "stdlib/strnlen.c",
+        "stdlib/vsnprintf.c",
         "version.h",
+        "x86/disp8.c",
+        "x86/iflag.c",
+        "x86/iflaggen.h",
+        "x86/insnsa.c",
+        "x86/insnsb.c",
+        "x86/insnsd.c",
+        "x86/insnsi.h",
+        "x86/insnsn.c",
+        "x86/regdis.c",
+        "x86/regdis.h",
+        "x86/regflags.c",
+        "x86/regs.c",
+        "x86/regs.h",
+        "x86/regvals.c",
+    ],
+    includes = [
+        "asm",
+        "include",
+        "output",
+        "x86",
     ],
     copts = select({
         ":windows": [],
@@ -110,7 +151,10 @@ cc_binary(
     defines = select({
         ":windows": [],
         ":windows_msvc": [],
-        "//conditions:default": ["HAVE_SNPRINTF"],
+        "//conditions:default": [
+            "HAVE_SNPRINTF",
+            "HAVE_SYS_TYPES_H",
+        ],
     }),
     visibility = ["@jpeg//:__pkg__"],
 )
diff --git a/third_party/nccl/nccl_configure.bzl b/third_party/nccl/nccl_configure.bzl
index 9dfcb18369..5d1ebf0686 100644
--- a/third_party/nccl/nccl_configure.bzl
+++ b/third_party/nccl/nccl_configure.bzl
@@ -47,10 +47,10 @@ alias(
 )
 """
 
+# Local build results in dynamic link and the license should not be included.
 _NCCL_LOCAL_BUILD_TEMPLATE = """
 filegroup(
   name = "LICENSE",
-  data = ["nccl/NCCL-SLA.txt"],
   visibility = ["//visibility:public"],
 )
 
diff --git a/third_party/repo.bzl b/third_party/repo.bzl
index 9cee1fcc4b..5cb42691c5 100644
--- a/third_party/repo.bzl
+++ b/third_party/repo.bzl
@@ -35,6 +35,15 @@ def _get_env_var(ctx, name):
   else:
     return None
 
+# Checks if we should use the system lib instead of the bundled one
+def _use_system_lib(ctx, name):
+  syslibenv = _get_env_var(ctx, "TF_SYSTEM_LIBS")
+  if syslibenv:
+    for n in syslibenv.strip().split(","):
+      if n.strip() == name:
+        return True
+  return False
+
 # Executes specified command with arguments and calls 'fail' if it exited with
 # non-zero code
 def _execute_and_check_ret_code(repo_ctx, cmd_and_args):
@@ -75,17 +84,28 @@ def _tf_http_archive(ctx):
          "Even if you don't have permission to mirror the file, please " +
          "put the correctly formatted mirror URL there anyway, because " +
          "someone will come along shortly thereafter and mirror the file.")
-  ctx.download_and_extract(
-      ctx.attr.urls,
-      "",
-      ctx.attr.sha256,
-      ctx.attr.type,
-      ctx.attr.strip_prefix)
-  if ctx.attr.delete:
-    _apply_delete(ctx, ctx.attr.delete)
-  if ctx.attr.patch_file != None:
-    _apply_patch(ctx, ctx.attr.patch_file)
-  if ctx.attr.build_file != None:
+
+  use_syslib = _use_system_lib(ctx, ctx.attr.name)
+  if not use_syslib:
+    ctx.download_and_extract(
+        ctx.attr.urls,
+        "",
+        ctx.attr.sha256,
+        ctx.attr.type,
+        ctx.attr.strip_prefix)
+    if ctx.attr.delete:
+      _apply_delete(ctx, ctx.attr.delete)
+    if ctx.attr.patch_file != None:
+      _apply_patch(ctx, ctx.attr.patch_file)
+
+  if use_syslib and ctx.attr.system_build_file != None:
+    # Use BUILD.bazel to avoid conflict with third party projects with
+    # BUILD or build (directory) underneath.
+    ctx.template("BUILD.bazel", ctx.attr.system_build_file, {
+        "%prefix%": ".." if _repos_are_siblings() else "external",
+    }, False)
+
+  elif ctx.attr.build_file != None:
     # Use BUILD.bazel to avoid conflict with third party projects with
     # BUILD or build (directory) underneath.
     ctx.template("BUILD.bazel", ctx.attr.build_file, {
@@ -102,7 +122,11 @@ tf_http_archive = repository_rule(
         "delete": attr.string_list(),
         "patch_file": attr.label(),
         "build_file": attr.label(),
-    })
+        "system_build_file": attr.label(),
+    },
+    environ=[
+	"TF_SYSTEM_LIBS",
+    ])
 """Downloads and creates Bazel repos for dependencies.
 
 This is a swappable replacement for both http_archive() and
diff --git a/third_party/systemlibs/BUILD b/third_party/systemlibs/BUILD
new file mode 100644
index 0000000000..e69de29bb2
--- /dev/null
+++ b/third_party/systemlibs/BUILD
diff --git a/third_party/systemlibs/BUILD.tpl b/third_party/systemlibs/BUILD.tpl
new file mode 100644
index 0000000000..e69de29bb2
--- /dev/null
+++ b/third_party/systemlibs/BUILD.tpl
diff --git a/third_party/systemlibs/astor.BUILD b/third_party/systemlibs/astor.BUILD
new file mode 100644
index 0000000000..497ec4bcea
--- /dev/null
+++ b/third_party/systemlibs/astor.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # New BSD
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+py_library(
+    name = "astor",
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/build_defs.bzl.tpl b/third_party/systemlibs/build_defs.bzl.tpl
new file mode 100644
index 0000000000..3faa46c581
--- /dev/null
+++ b/third_party/systemlibs/build_defs.bzl.tpl
@@ -0,0 +1,32 @@
+# -*- Python -*-
+"""Skylark macros for system libraries.
+"""
+
+SYSTEM_LIBS_ENABLED = %{syslibs_enabled}
+
+SYSTEM_LIBS_LIST = [
+%{syslibs_list}
+]
+
+
+def if_any_system_libs(a, b=[]):
+  """Conditional which evaluates to 'a' if any system libraries are configured."""
+  if SYSTEM_LIBS_ENABLED:
+    return a
+  else:
+    return b
+
+
+def if_system_lib(lib, a, b=[]):
+  """Conditional which evaluates to 'a' if we're using the system version of lib"""
+
+  if SYSTEM_LIBS_ENABLED and lib in SYSTEM_LIBS_LIST:
+    return a
+  else:
+    return b
+
+
+def if_not_system_lib(lib, a, b=[]):
+  """Conditional which evaluates to 'a' if we're using the system version of lib"""
+
+  return if_system_lib(lib, b, a)
diff --git a/third_party/systemlibs/curl.BUILD b/third_party/systemlibs/curl.BUILD
new file mode 100644
index 0000000000..c5f125caa9
--- /dev/null
+++ b/third_party/systemlibs/curl.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # MIT/X derivative license
+
+filegroup(
+    name = "COPYING",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "curl",
+    linkopts = ["-lcurl"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/cython.BUILD b/third_party/systemlibs/cython.BUILD
new file mode 100644
index 0000000000..1d52587676
--- /dev/null
+++ b/third_party/systemlibs/cython.BUILD
@@ -0,0 +1,13 @@
+licenses(["notice"])  # Apache-2.0
+
+genrule(
+    name = "lncython",
+    outs = ["cython"],
+    cmd = "ln -s $$(which cython) $@",
+)
+
+sh_binary(
+    name = "cython_binary",
+    srcs = ["cython"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/flatbuffers.BUILD b/third_party/systemlibs/flatbuffers.BUILD
new file mode 100644
index 0000000000..14fceada82
--- /dev/null
+++ b/third_party/systemlibs/flatbuffers.BUILD
@@ -0,0 +1,38 @@
+licenses(["notice"])  # Apache 2.0
+
+filegroup(
+    name = "LICENSE.txt",
+    visibility = ["//visibility:public"],
+)
+
+# Public flatc library to compile flatbuffer files at runtime.
+cc_library(
+    name = "flatbuffers",
+    linkopts = ["-lflatbuffers"],
+    visibility = ["//visibility:public"],
+)
+
+# Public flatc compiler library.
+cc_library(
+    name = "flatc_library",
+    linkopts = ["-lflatbuffers"],
+    visibility = ["//visibility:public"],
+)
+
+genrule(
+    name = "lnflatc",
+    outs = ["flatc.bin"],
+    cmd = "ln -s $$(which flatc) $@",
+)
+
+# Public flatc compiler.
+sh_binary(
+    name = "flatc",
+    srcs = ["flatc.bin"],
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "runtime_cc",
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/gif.BUILD b/third_party/systemlibs/gif.BUILD
new file mode 100644
index 0000000000..5eb2c918ba
--- /dev/null
+++ b/third_party/systemlibs/gif.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # MIT
+
+filegroup(
+    name = "COPYING",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "gif",
+    linkopts = ["-lgif"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/grpc.BUILD b/third_party/systemlibs/grpc.BUILD
new file mode 100644
index 0000000000..fd90eb0dd3
--- /dev/null
+++ b/third_party/systemlibs/grpc.BUILD
@@ -0,0 +1,54 @@
+licenses(["notice"])  # Apache v2
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "grpc",
+    linkopts = ["-lgrpc"],
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "grpc++",
+    linkopts = ["-lgrpc++"],
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "grpc_unsecure",
+    linkopts = ["-lgrpc_unsecure"],
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "grpc++_unsecure",
+    linkopts = ["-lgrpc++_unsecure"],
+    visibility = ["//visibility:public"],
+)
+
+genrule(
+    name = "ln_grpc_cpp_plugin",
+    outs = ["grpc_cpp_plugin.bin"],
+    cmd = "ln -s $$(which grpc_cpp_plugin) $@",
+)
+
+sh_binary(
+    name = "grpc_cpp_plugin",
+    srcs = ["grpc_cpp_plugin.bin"],
+    visibility = ["//visibility:public"],
+)
+
+genrule(
+    name = "ln_grpc_python_plugin",
+    outs = ["grpc_python_plugin.bin"],
+    cmd = "ln -s $$(which grpc_python_plugin) $@",
+)
+
+sh_binary(
+    name = "grpc_python_plugin",
+    srcs = ["grpc_python_plugin.bin"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/jemalloc.BUILD b/third_party/systemlibs/jemalloc.BUILD
new file mode 100644
index 0000000000..6a48d582ba
--- /dev/null
+++ b/third_party/systemlibs/jemalloc.BUILD
@@ -0,0 +1,30 @@
+licenses(["notice"])  # BSD
+
+filegroup(
+    name = "COPYING",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "jemalloc_headers",
+    defines = [
+        "jemalloc_posix_memalign=posix_memalign",
+        "jemalloc_malloc=malloc",
+        "jemalloc_realloc=realloc",
+        "jemalloc_free=free",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "jemalloc_impl",
+    linkopts = ["-ljemalloc"],
+    defines = [
+        "jemalloc_posix_memalign=posix_memalign",
+        "jemalloc_malloc=malloc",
+        "jemalloc_realloc=realloc",
+        "jemalloc_free=free",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [":jemalloc_headers"],
+)
diff --git a/third_party/systemlibs/jpeg.BUILD b/third_party/systemlibs/jpeg.BUILD
new file mode 100644
index 0000000000..f4f52da9bd
--- /dev/null
+++ b/third_party/systemlibs/jpeg.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # custom notice-style license, see LICENSE.md
+
+filegroup(
+    name = "LICENSE.md",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "jpeg",
+    linkopts = ["-ljpeg"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/jsoncpp.BUILD b/third_party/systemlibs/jsoncpp.BUILD
new file mode 100644
index 0000000000..cf91917cfb
--- /dev/null
+++ b/third_party/systemlibs/jsoncpp.BUILD
@@ -0,0 +1,37 @@
+licenses(["unencumbered"])  # Public Domain or MIT
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+HEADERS = [
+    "include/json/autolink.h",
+    "include/json/config.h",
+    "include/json/features.h",
+    "include/json/forwards.h",
+    "include/json/json.h",
+    "include/json/reader.h",
+    "include/json/value.h",
+    "include/json/version.h",
+    "include/json/writer.h",
+]
+
+genrule(
+    name = "link_headers",
+    outs = HEADERS,
+    cmd = """
+      for i in $(OUTS); do
+        i=$${i##*/}
+        ln -vsf /usr/include/jsoncpp/json/$$i $(@D)/include/json/$$i
+      done
+    """,
+)
+
+cc_library(
+    name = "jsoncpp",
+    hdrs = HEADERS,
+    includes = ["."],
+    linkopts = ["-ljsoncpp"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/lmdb.BUILD b/third_party/systemlibs/lmdb.BUILD
new file mode 100644
index 0000000000..6177b095ec
--- /dev/null
+++ b/third_party/systemlibs/lmdb.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # OpenLDAP Public License
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "lmdb",
+    linkopts = ["-llmdb"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/nasm.BUILD b/third_party/systemlibs/nasm.BUILD
new file mode 100644
index 0000000000..10ef8d8832
--- /dev/null
+++ b/third_party/systemlibs/nasm.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # BSD 2-clause
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+sh_binary(
+    name = "nasm",
+    srcs = ["nasm"],
+    visibility = ["@jpeg//:__pkg__"],
+)
diff --git a/third_party/systemlibs/pcre.BUILD b/third_party/systemlibs/pcre.BUILD
new file mode 100644
index 0000000000..df74238847
--- /dev/null
+++ b/third_party/systemlibs/pcre.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # BSD
+
+filegroup(
+    name = "LICENCE",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "pcre",
+    linkopts = ["-lpcre"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/png.BUILD b/third_party/systemlibs/png.BUILD
new file mode 100644
index 0000000000..fc6b6f2d8b
--- /dev/null
+++ b/third_party/systemlibs/png.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # BSD/MIT-like license
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "png",
+    linkopts = ["-lpng"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/re2.BUILD b/third_party/systemlibs/re2.BUILD
new file mode 100644
index 0000000000..c18e252dbc
--- /dev/null
+++ b/third_party/systemlibs/re2.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # BSD/MIT-like license
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "re2",
+    linkopts = ["-lre2"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/six.BUILD b/third_party/systemlibs/six.BUILD
new file mode 100644
index 0000000000..ff9b1a540b
--- /dev/null
+++ b/third_party/systemlibs/six.BUILD
@@ -0,0 +1,11 @@
+licenses(["notice"])  # MIT
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+py_library(
+    name = "six",
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/snappy.BUILD b/third_party/systemlibs/snappy.BUILD
new file mode 100644
index 0000000000..fd2db9e2df
--- /dev/null
+++ b/third_party/systemlibs/snappy.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # BSD 3-Clause
+
+filegroup(
+    name = "COPYING",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "snappy",
+    linkopts = ["-lsnappy"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/sqlite.BUILD b/third_party/systemlibs/sqlite.BUILD
new file mode 100644
index 0000000000..20ee1ebbef
--- /dev/null
+++ b/third_party/systemlibs/sqlite.BUILD
@@ -0,0 +1,15 @@
+licenses(["unencumbered"])  # Public Domain
+
+# Production build of SQLite library that's baked into TensorFlow.
+cc_library(
+    name = "org_sqlite",
+    linkopts = ["-lsqlite3"],
+    visibility = ["//visibility:public"],
+)
+
+# This is a Copybara sync helper for Google.
+py_library(
+    name = "python",
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/swig.BUILD b/third_party/systemlibs/swig.BUILD
new file mode 100644
index 0000000000..4c9b74dadb
--- /dev/null
+++ b/third_party/systemlibs/swig.BUILD
@@ -0,0 +1,23 @@
+licenses(["restricted"])  # GPLv3
+
+filegroup(
+    name = "LICENSE",
+    visibility = ["//visibility:public"],
+)
+
+filegroup(
+    name = "templates",
+    visibility = ["//visibility:public"],
+)
+
+genrule(
+    name = "lnswiglink",
+    outs = ["swiglink"],
+    cmd = "ln -s $$(which swig) $@",
+)
+
+sh_binary(
+    name = "swig",
+    srcs = ["swiglink"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/syslibs_configure.bzl b/third_party/systemlibs/syslibs_configure.bzl
new file mode 100644
index 0000000000..07a44c317e
--- /dev/null
+++ b/third_party/systemlibs/syslibs_configure.bzl
@@ -0,0 +1,160 @@
+# -*- Python -*-
+"""Repository rule for system library autoconfiguration.
+
+`syslibs_configure` depends on the following environment variables:
+
+  * `TF_SYSTEM_LIBS`: list of third party dependencies that should use
+    the system version instead
+"""
+
+_TF_SYSTEM_LIBS="TF_SYSTEM_LIBS"
+
+VALID_LIBS=[
+    "astor_archive",
+    "com_googlesource_code_re2",
+    "curl",
+    "cython",
+    "flatbuffers",
+    "gif_archive",
+    "grpc",
+    "jemalloc",
+    "jpeg",
+    "jsoncpp_git",
+    "lmdb",
+    "nasm",
+    "org_sqlite",
+    "pcre",
+    "png_archive",
+    "six_archive",
+    "snappy",
+    "swig",
+    "termcolor_archive",
+    "zlib_archive",
+]
+
+
+def auto_configure_fail(msg):
+  """Output failure message when syslibs configuration fails."""
+  red = "\033[0;31m"
+  no_color = "\033[0m"
+  fail("\n%sSystem Library Configuration Error:%s %s\n" % (red, no_color, msg))
+
+
+def _is_windows(repository_ctx):
+  """Returns true if the host operating system is windows."""
+  os_name = repository_ctx.os.name.lower()
+  if os_name.find("windows") != -1:
+    return True
+  return False
+
+
+def _enable_syslibs(repository_ctx):
+  s = repository_ctx.os.environ.get(_TF_SYSTEM_LIBS, '').strip()
+  if not _is_windows(repository_ctx) and s != None and s != '':
+    return True
+  return False
+
+
+def _get_system_lib_list(repository_ctx):
+  """Gets the list of deps that should use the system lib.
+
+  Args:
+    repository_ctx: The repository context.
+
+  Returns:
+    A string version of a python list
+  """
+  if _TF_SYSTEM_LIBS not in repository_ctx.os.environ:
+    return []
+
+  libenv = repository_ctx.os.environ[_TF_SYSTEM_LIBS].strip()
+  libs = []
+
+  for lib in list(libenv.split(',')):
+    lib = lib.strip()
+    if lib == "":
+      continue
+    if lib not in VALID_LIBS:
+      auto_configure_fail("Invalid system lib set: %s" % lib)
+      return []
+    libs.append(lib)
+
+  return libs
+
+
+def _format_system_lib_list(repository_ctx):
+  """Formats the list of deps that should use the system lib.
+
+  Args:
+    repository_ctx: The repository context.
+
+  Returns:
+    A list of the names of deps that should use the system lib.
+  """
+  libs = _get_system_lib_list(repository_ctx)
+  ret = ''
+  for lib in libs:
+    ret += "'%s',\n" % lib
+
+  return ret
+
+
+def _tpl(repository_ctx, tpl, substitutions={}, out=None):
+  if not out:
+    out = tpl.replace(":", "")
+  repository_ctx.template(
+      out,
+      Label("//third_party/systemlibs%s.tpl" % tpl),
+      substitutions,
+      False)
+
+
+def _create_dummy_repository(repository_ctx):
+  """Creates the dummy repository to build with all bundled libraries."""
+
+  _tpl(repository_ctx, ":BUILD")
+  _tpl(repository_ctx, ":build_defs.bzl",
+    {
+      "%{syslibs_enabled}": 'False',
+      "%{syslibs_list}": '',
+    })
+
+
+def _create_local_repository(repository_ctx):
+  """Creates the repository to build with system libraries."""
+
+  _tpl(repository_ctx, ":BUILD")
+  _tpl(repository_ctx, ":build_defs.bzl",
+    {
+      "%{syslibs_enabled}": 'True',
+      "%{syslibs_list}": _format_system_lib_list(repository_ctx),
+    })
+
+
+def _syslibs_autoconf_impl(repository_ctx):
+  """Implementation of the syslibs_configure repository rule."""
+  if not _enable_syslibs(repository_ctx):
+    _create_dummy_repository(repository_ctx)
+  else:
+    _create_local_repository(repository_ctx)
+
+
+syslibs_configure = repository_rule(
+    implementation = _syslibs_autoconf_impl,
+    environ = [
+        _TF_SYSTEM_LIBS,
+    ],
+)
+
+"""Configures the build to link to system libraries
+instead of using bundled versions.
+
+Add the following to your WORKSPACE FILE:
+
+```python
+syslibs_configure(name = "local_config_syslibs")
+```
+
+Args:
+  name: A unique name for this workspace rule.
+"""
diff --git a/third_party/systemlibs/termcolor.BUILD b/third_party/systemlibs/termcolor.BUILD
new file mode 100644
index 0000000000..915eb621d5
--- /dev/null
+++ b/third_party/systemlibs/termcolor.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # MIT
+
+filegroup(
+    name = "COPYING.txt",
+    visibility = ["//visibility:public"],
+)
+
+py_library(
+    name = "termcolor",
+    srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/systemlibs/zlib.BUILD b/third_party/systemlibs/zlib.BUILD
new file mode 100644
index 0000000000..69462ae6cb
--- /dev/null
+++ b/third_party/systemlibs/zlib.BUILD
@@ -0,0 +1,12 @@
+licenses(["notice"])  # BSD/MIT-like license (for zlib)
+
+filegroup(
+    name = "zlib.h",
+    visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "zlib",
+    linkopts = ["-lz"],
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/toolchains/BUILD b/third_party/toolchains/BUILD
index fc3183a754..ec1006fe23 100644
--- a/third_party/toolchains/BUILD
+++ b/third_party/toolchains/BUILD
@@ -17,6 +17,6 @@ platform(
     remote_execution_properties = """
         properties: {
             name: "container-image"
-            value:"docker://gcr.io/asci-toolchain/nosla-ubuntu16_04-tf@sha256:800a7b68cabef15419695c188ed33ed70adf678c2371b97b236f3ae26c38274d"
+            value:"docker://gcr.io/asci-toolchain/nosla-ubuntu16_04-tf@sha256:495a025ed5e273cfa5d53357ef93ac20500c008994e0be106c509f51555fb93c"
         }""",
 )
diff --git a/third_party/toolchains/cpus/py/BUILD b/third_party/toolchains/cpus/py/BUILD
index c175742cbf..1235988abb 100644
--- a/third_party/toolchains/cpus/py/BUILD
+++ b/third_party/toolchains/cpus/py/BUILD
@@ -6,18 +6,24 @@ licenses(["restricted"])
 
 package(default_visibility = ["//visibility:public"])
 
+# To build Python C/C++ extension on Windows, we need to link to python import library pythonXY.lib
+# See https://docs.python.org/3/extending/windows.html
+cc_import(
+    name = "python_lib",
+    interface_library = select({
+        ":windows": ":python_import_lib",
+        # A placeholder for Unix platforms which makes --no_build happy.
+        "//conditions:default": "not-existing.lib",
+    }),
+    system_provided = 1,
+)
+
 cc_library(
     name = "python_headers",
     hdrs = [":python_include"],
-    data = select({
-        ":windows": [":python_import_lib"],
-        "//conditions:default": [],
-    }),
     includes = ["python_include"],
-    linkopts = select({
-        # TODO(pcloudy): Ideally, this should just go into deps after resolving
-        # https://github.com/bazelbuild/bazel/issues/3237,
-        ":windows": ["$(locations :python_import_lib)"],
+    deps = select({
+        ":windows": [":python_lib"],
         "//conditions:default": [],
     }),
 )
@@ -37,161 +43,135 @@ config_setting(
 genrule(
     name = "python_include",
     outs = [
+        "python_include/Python-ast.h",
+        "python_include/Python.h",
+        "python_include/abstract.h",
+        "python_include/asdl.h",
+        "python_include/ast.h",
+        "python_include/bitset.h",
+        "python_include/boolobject.h",
+        "python_include/bufferobject.h",
+        "python_include/bytearrayobject.h",
+        "python_include/bytes_methods.h",
+        "python_include/bytesobject.h",
+        "python_include/cStringIO.h",
+        "python_include/cellobject.h",
+        "python_include/ceval.h",
+        "python_include/classobject.h",
+        "python_include/cobject.h",
         "python_include/code.h",
+        "python_include/codecs.h",
+        "python_include/compile.h",
+        "python_include/complexobject.h",
+        "python_include/datetime.h",
+        "python_include/descrobject.h",
+        "python_include/dictobject.h",
         "python_include/dtoa.h",
-        "python_include/tupleobject.h",
-        "python_include/object.h",
-        "python_include/ast.h",
-        "python_include/pymacconfig.h",
+        "python_include/enumobject.h",
         "python_include/errcode.h",
+        "python_include/eval.h",
+        "python_include/fileobject.h",
+        "python_include/floatobject.h",
         "python_include/frameobject.h",
-        "python_include/pgenheaders.h",
-        "python_include/cellobject.h",
+        "python_include/funcobject.h",
+        "python_include/genobject.h",
+        "python_include/graminit.h",
+        "python_include/grammar.h",
+        "python_include/import.h",
         "python_include/intobject.h",
-        "python_include/pythread.h",
-        "python_include/cStringIO.h",
-        "python_include/boolobject.h",
+        "python_include/intrcheck.h",
+        "python_include/iterobject.h",
+        "python_include/listobject.h",
+        "python_include/longintrepr.h",
+        "python_include/longobject.h",
+        "python_include/marshal.h",
+        "python_include/memoryobject.h",
+        "python_include/metagrammar.h",
+        "python_include/methodobject.h",
         "python_include/modsupport.h",
-        "python_include/import.h",
-        "python_include/pymath.h",
+        "python_include/moduleobject.h",
         "python_include/node.h",
-        "python_include/funcobject.h",
-        "python_include/eval.h",
-        "python_include/longintrepr.h",
-        "python_include/floatobject.h",
-        "python_include/rangeobject.h",
-        "python_include/pyfpe.h",
-        "python_include/pystrcmp.h",
-        "python_include/dictobject.h",
-        "python_include/pyarena.h",
+        "python_include/object.h",
         "python_include/objimpl.h",
-        "python_include/bitset.h",
-        "python_include/memoryobject.h",
-        "python_include/bytearrayobject.h",
+        "python_include/opcode.h",
+        "python_include/osdefs.h",
+        "python_include/parsetok.h",
+        "python_include/patchlevel.h",
+        "python_include/pgen.h",
+        "python_include/pgenheaders.h",
+        "python_include/py_curses.h",
+        "python_include/pyarena.h",
+        "python_include/pycapsule.h",
+        "python_include/pyconfig.h",
+        "python_include/pyctype.h",
         "python_include/pydebug.h",
         "python_include/pyerrors.h",
-        "python_include/weakrefobject.h",
-        "python_include/grammar.h",
-        "python_include/symtable.h",
-        "python_include/longobject.h",
-        "python_include/structmember.h",
-        "python_include/enumobject.h",
-        "python_include/classobject.h",
-        "python_include/unicodeobject.h",
-        "python_include/sliceobject.h",
-        "python_include/pystrtod.h",
-        "python_include/genobject.h",
-        "python_include/pymactoolbox.h",
-        "python_include/compile.h",
         "python_include/pyexpat.h",
-        "python_include/asdl.h",
-        "python_include/codecs.h",
-        "python_include/pyctype.h",
-        "python_include/sysmodule.h",
-        "python_include/methodobject.h",
-        "python_include/graminit.h",
-        "python_include/cobject.h",
-        "python_include/intrcheck.h",
-        "python_include/pyport.h",
-        "python_include/warnings.h",
-        "python_include/osdefs.h",
-        "python_include/fileobject.h",
-        "python_include/stringobject.h",
-        "python_include/timefuncs.h",
-        "python_include/traceback.h",
-        "python_include/ceval.h",
-        "python_include/bytes_methods.h",
-        "python_include/pyconfig.h",
-        "python_include/Python.h",
-        "python_include/moduleobject.h",
-        "python_include/pystate.h",
-        "python_include/descrobject.h",
-        "python_include/ucnhash.h",
+        "python_include/pyfpe.h",
         "python_include/pygetopt.h",
+        "python_include/pymacconfig.h",
+        "python_include/pymactoolbox.h",
+        "python_include/pymath.h",
         "python_include/pymem.h",
-        "python_include/complexobject.h",
-        "python_include/structseq.h",
-        "python_include/datetime.h",
+        "python_include/pyport.h",
+        "python_include/pystate.h",
+        "python_include/pystrcmp.h",
+        "python_include/pystrtod.h",
         "python_include/pythonrun.h",
-        "python_include/numpy/oldnumeric.h",
-        "python_include/numpy/npy_1_7_deprecated_api.h",
-        "python_include/numpy/ufunc_api.txt",
-        "python_include/numpy/multiarray_api.txt",
-        "python_include/numpy/halffloat.h",
-        "python_include/numpy/npy_common.h",
-        "python_include/numpy/utils.h",
-        "python_include/numpy/npy_interrupt.h",
-        "python_include/numpy/npy_endian.h",
-        "python_include/numpy/__ufunc_api.h",
-        "python_include/numpy/_neighborhood_iterator_imp.h",
-        "python_include/numpy/ufuncobject.h",
-        "python_include/numpy/ndarraytypes.h",
-        "python_include/numpy/npy_math.h",
-        "python_include/numpy/noprefix.h",
-        "python_include/numpy/npy_3kcompat.h",
-        "python_include/numpy/arrayscalars.h",
-        "python_include/numpy/npy_os.h",
-        "python_include/numpy/ndarrayobject.h",
-        "python_include/numpy/npy_no_deprecated_api.h",
-        "python_include/numpy/arrayobject.h",
-        "python_include/numpy/_numpyconfig.h",
-        "python_include/numpy/__multiarray_api.h",
-        "python_include/numpy/npy_cpu.h",
-        "python_include/numpy/old_defines.h",
-        "python_include/numpy/numpyconfig.h",
-        "python_include/pycapsule.h",
+        "python_include/pythread.h",
+        "python_include/rangeobject.h",
         "python_include/setobject.h",
-        "python_include/listobject.h",
-        "python_include/bytesobject.h",
-        "python_include/pgen.h",
-        "python_include/patchlevel.h",
-        "python_include/opcode.h",
-        "python_include/parsetok.h",
-        "python_include/marshal.h",
+        "python_include/sliceobject.h",
+        "python_include/stringobject.h",
+        "python_include/structmember.h",
+        "python_include/structseq.h",
+        "python_include/symtable.h",
+        "python_include/sysmodule.h",
+        "python_include/timefuncs.h",
         "python_include/token.h",
-        "python_include/iterobject.h",
-        "python_include/abstract.h",
-        "python_include/py_curses.h",
-        "python_include/metagrammar.h",
-        "python_include/bufferobject.h",
-        "python_include/Python-ast.h",
+        "python_include/traceback.h",
+        "python_include/tupleobject.h",
+        "python_include/ucnhash.h",
+        "python_include/unicodeobject.h",
+        "python_include/warnings.h",
+        "python_include/weakrefobject.h",
     ],
     cmd = """
-cp "/usr/include/python2.7/code.h" "$(@D)/python_include/code.h" && cp "/usr/include/python2.7/dtoa.h" "$(@D)/python_include/dtoa.h" && cp "/usr/include/python2.7/tupleobject.h" "$(@D)/python_include/tupleobject.h" && cp "/usr/include/python2.7/object.h" "$(@D)/python_include/object.h" && cp "/usr/include/python2.7/ast.h" "$(@D)/python_include/ast.h" && cp "/usr/include/python2.7/pymacconfig.h" "$(@D)/python_include/pymacconfig.h" && cp "/usr/include/python2.7/errcode.h" "$(@D)/python_include/errcode.h" && cp "/usr/include/python2.7/frameobject.h" "$(@D)/python_include/frameobject.h" && cp "/usr/include/python2.7/pgenheaders.h" "$(@D)/python_include/pgenheaders.h" && cp "/usr/include/python2.7/cellobject.h" "$(@D)/python_include/cellobject.h" && cp "/usr/include/python2.7/intobject.h" "$(@D)/python_include/intobject.h" && cp "/usr/include/python2.7/pythread.h" "$(@D)/python_include/pythread.h" && cp "/usr/include/python2.7/cStringIO.h" "$(@D)/python_include/cStringIO.h" && cp "/usr/include/python2.7/boolobject.h" "$(@D)/python_include/boolobject.h" && cp "/usr/include/python2.7/modsupport.h" "$(@D)/python_include/modsupport.h" && cp "/usr/include/python2.7/import.h" "$(@D)/python_include/import.h" && cp "/usr/include/python2.7/pymath.h" "$(@D)/python_include/pymath.h" && cp "/usr/include/python2.7/node.h" "$(@D)/python_include/node.h" && cp "/usr/include/python2.7/funcobject.h" "$(@D)/python_include/funcobject.h" && cp "/usr/include/python2.7/eval.h" "$(@D)/python_include/eval.h" && cp "/usr/include/python2.7/longintrepr.h" "$(@D)/python_include/longintrepr.h" && cp "/usr/include/python2.7/floatobject.h" "$(@D)/python_include/floatobject.h" && cp "/usr/include/python2.7/rangeobject.h" "$(@D)/python_include/rangeobject.h" && cp "/usr/include/python2.7/pyfpe.h" "$(@D)/python_include/pyfpe.h" && cp "/usr/include/python2.7/pystrcmp.h" "$(@D)/python_include/pystrcmp.h" && cp "/usr/include/python2.7/dictobject.h" "$(@D)/python_include/dictobject.h" && cp "/usr/include/python2.7/pyarena.h" "$(@D)/python_include/pyarena.h" && cp "/usr/include/python2.7/objimpl.h" "$(@D)/python_include/objimpl.h" && cp "/usr/include/python2.7/bitset.h" "$(@D)/python_include/bitset.h" && cp "/usr/include/python2.7/memoryobject.h" "$(@D)/python_include/memoryobject.h" && cp "/usr/include/python2.7/bytearrayobject.h" "$(@D)/python_include/bytearrayobject.h" && cp "/usr/include/python2.7/pydebug.h" "$(@D)/python_include/pydebug.h" && cp "/usr/include/python2.7/pyerrors.h" "$(@D)/python_include/pyerrors.h" && cp "/usr/include/python2.7/weakrefobject.h" "$(@D)/python_include/weakrefobject.h" && cp "/usr/include/python2.7/grammar.h" "$(@D)/python_include/grammar.h" && cp "/usr/include/python2.7/symtable.h" "$(@D)/python_include/symtable.h" && cp "/usr/include/python2.7/longobject.h" "$(@D)/python_include/longobject.h" && cp "/usr/include/python2.7/structmember.h" "$(@D)/python_include/structmember.h" && cp "/usr/include/python2.7/enumobject.h" "$(@D)/python_include/enumobject.h" && cp "/usr/include/python2.7/classobject.h" "$(@D)/python_include/classobject.h" && cp "/usr/include/python2.7/unicodeobject.h" "$(@D)/python_include/unicodeobject.h" && cp "/usr/include/python2.7/sliceobject.h" "$(@D)/python_include/sliceobject.h" && cp "/usr/include/python2.7/pystrtod.h" "$(@D)/python_include/pystrtod.h" && cp "/usr/include/python2.7/genobject.h" "$(@D)/python_include/genobject.h" && cp "/usr/include/python2.7/pymactoolbox.h" "$(@D)/python_include/pymactoolbox.h" && cp "/usr/include/python2.7/compile.h" "$(@D)/python_include/compile.h" && cp "/usr/include/python2.7/pyexpat.h" "$(@D)/python_include/pyexpat.h" && cp "/usr/include/python2.7/asdl.h" "$(@D)/python_include/asdl.h" && cp "/usr/include/python2.7/codecs.h" "$(@D)/python_include/codecs.h" && cp "/usr/include/python2.7/pyctype.h" "$(@D)/python_include/pyctype.h" && cp "/usr/include/python2.7/sysmodule.h" "$(@D)/python_include/sysmodule.h" && cp "/usr/include/python2.7/methodobject.h" "$(@D)/python_include/methodobject.h" && cp "/usr/include/python2.7/graminit.h" "$(@D)/python_include/graminit.h" && cp "/usr/include/python2.7/cobject.h" "$(@D)/python_include/cobject.h" && cp "/usr/include/python2.7/intrcheck.h" "$(@D)/python_include/intrcheck.h" && cp "/usr/include/python2.7/pyport.h" "$(@D)/python_include/pyport.h" && cp "/usr/include/python2.7/warnings.h" "$(@D)/python_include/warnings.h" && cp "/usr/include/python2.7/osdefs.h" "$(@D)/python_include/osdefs.h" && cp "/usr/include/python2.7/fileobject.h" "$(@D)/python_include/fileobject.h" && cp "/usr/include/python2.7/stringobject.h" "$(@D)/python_include/stringobject.h" && cp "/usr/include/python2.7/timefuncs.h" "$(@D)/python_include/timefuncs.h" && cp "/usr/include/python2.7/traceback.h" "$(@D)/python_include/traceback.h" && cp "/usr/include/python2.7/ceval.h" "$(@D)/python_include/ceval.h" && cp "/usr/include/python2.7/bytes_methods.h" "$(@D)/python_include/bytes_methods.h" && cp "/usr/include/python2.7/pyconfig.h" "$(@D)/python_include/pyconfig.h" && cp "/usr/include/python2.7/Python.h" "$(@D)/python_include/Python.h" && cp "/usr/include/python2.7/moduleobject.h" "$(@D)/python_include/moduleobject.h" && cp "/usr/include/python2.7/pystate.h" "$(@D)/python_include/pystate.h" && cp "/usr/include/python2.7/descrobject.h" "$(@D)/python_include/descrobject.h" && cp "/usr/include/python2.7/ucnhash.h" "$(@D)/python_include/ucnhash.h" && cp "/usr/include/python2.7/pygetopt.h" "$(@D)/python_include/pygetopt.h" && cp "/usr/include/python2.7/pymem.h" "$(@D)/python_include/pymem.h" && cp "/usr/include/python2.7/complexobject.h" "$(@D)/python_include/complexobject.h" && cp "/usr/include/python2.7/structseq.h" "$(@D)/python_include/structseq.h" && cp "/usr/include/python2.7/datetime.h" "$(@D)/python_include/datetime.h" && cp "/usr/include/python2.7/pythonrun.h" "$(@D)/python_include/pythonrun.h" && cp "/usr/include/python2.7/numpy/oldnumeric.h" "$(@D)/python_include/numpy/oldnumeric.h" && cp "/usr/include/python2.7/numpy/npy_1_7_deprecated_api.h" "$(@D)/python_include/numpy/npy_1_7_deprecated_api.h" && cp "/usr/include/python2.7/numpy/ufunc_api.txt" "$(@D)/python_include/numpy/ufunc_api.txt" && cp "/usr/include/python2.7/numpy/multiarray_api.txt" "$(@D)/python_include/numpy/multiarray_api.txt" && cp "/usr/include/python2.7/numpy/halffloat.h" "$(@D)/python_include/numpy/halffloat.h" && cp "/usr/include/python2.7/numpy/npy_common.h" "$(@D)/python_include/numpy/npy_common.h" && cp "/usr/include/python2.7/numpy/utils.h" "$(@D)/python_include/numpy/utils.h" && cp "/usr/include/python2.7/numpy/npy_interrupt.h" "$(@D)/python_include/numpy/npy_interrupt.h" && cp "/usr/include/python2.7/numpy/npy_endian.h" "$(@D)/python_include/numpy/npy_endian.h" && cp "/usr/include/python2.7/numpy/__ufunc_api.h" "$(@D)/python_include/numpy/__ufunc_api.h" && cp "/usr/include/python2.7/numpy/_neighborhood_iterator_imp.h" "$(@D)/python_include/numpy/_neighborhood_iterator_imp.h" && cp "/usr/include/python2.7/numpy/ufuncobject.h" "$(@D)/python_include/numpy/ufuncobject.h" && cp "/usr/include/python2.7/numpy/ndarraytypes.h" "$(@D)/python_include/numpy/ndarraytypes.h" && cp "/usr/include/python2.7/numpy/npy_math.h" "$(@D)/python_include/numpy/npy_math.h" && cp "/usr/include/python2.7/numpy/noprefix.h" "$(@D)/python_include/numpy/noprefix.h" && cp "/usr/include/python2.7/numpy/npy_3kcompat.h" "$(@D)/python_include/numpy/npy_3kcompat.h" && cp "/usr/include/python2.7/numpy/arrayscalars.h" "$(@D)/python_include/numpy/arrayscalars.h" && cp "/usr/include/python2.7/numpy/npy_os.h" "$(@D)/python_include/numpy/npy_os.h" && cp "/usr/include/python2.7/numpy/ndarrayobject.h" "$(@D)/python_include/numpy/ndarrayobject.h" && cp "/usr/include/python2.7/numpy/npy_no_deprecated_api.h" "$(@D)/python_include/numpy/npy_no_deprecated_api.h" && cp "/usr/include/python2.7/numpy/arrayobject.h" "$(@D)/python_include/numpy/arrayobject.h" && cp "/usr/include/python2.7/numpy/_numpyconfig.h" "$(@D)/python_include/numpy/_numpyconfig.h" && cp "/usr/include/python2.7/numpy/__multiarray_api.h" "$(@D)/python_include/numpy/__multiarray_api.h" && cp "/usr/include/python2.7/numpy/npy_cpu.h" "$(@D)/python_include/numpy/npy_cpu.h" && cp "/usr/include/python2.7/numpy/old_defines.h" "$(@D)/python_include/numpy/old_defines.h" && cp "/usr/include/python2.7/numpy/numpyconfig.h" "$(@D)/python_include/numpy/numpyconfig.h" && cp "/usr/include/python2.7/pycapsule.h" "$(@D)/python_include/pycapsule.h" && cp "/usr/include/python2.7/setobject.h" "$(@D)/python_include/setobject.h" && cp "/usr/include/python2.7/listobject.h" "$(@D)/python_include/listobject.h" && cp "/usr/include/python2.7/bytesobject.h" "$(@D)/python_include/bytesobject.h" && cp "/usr/include/python2.7/pgen.h" "$(@D)/python_include/pgen.h" && cp "/usr/include/python2.7/patchlevel.h" "$(@D)/python_include/patchlevel.h" && cp "/usr/include/python2.7/opcode.h" "$(@D)/python_include/opcode.h" && cp "/usr/include/python2.7/parsetok.h" "$(@D)/python_include/parsetok.h" && cp "/usr/include/python2.7/marshal.h" "$(@D)/python_include/marshal.h" && cp "/usr/include/python2.7/token.h" "$(@D)/python_include/token.h" && cp "/usr/include/python2.7/iterobject.h" "$(@D)/python_include/iterobject.h" && cp "/usr/include/python2.7/abstract.h" "$(@D)/python_include/abstract.h" && cp "/usr/include/python2.7/py_curses.h" "$(@D)/python_include/py_curses.h" && cp "/usr/include/python2.7/metagrammar.h" "$(@D)/python_include/metagrammar.h" && cp "/usr/include/python2.7/bufferobject.h" "$(@D)/python_include/bufferobject.h" && cp "/usr/include/python2.7/Python-ast.h" "$(@D)/python_include/Python-ast.h"
+cp "/usr/include/python2.7/Python-ast.h" "$(@D)/python_include/Python-ast.h" && cp "/usr/include/python2.7/Python.h" "$(@D)/python_include/Python.h" && cp "/usr/include/python2.7/abstract.h" "$(@D)/python_include/abstract.h" && cp "/usr/include/python2.7/asdl.h" "$(@D)/python_include/asdl.h" && cp "/usr/include/python2.7/ast.h" "$(@D)/python_include/ast.h" && cp "/usr/include/python2.7/bitset.h" "$(@D)/python_include/bitset.h" && cp "/usr/include/python2.7/boolobject.h" "$(@D)/python_include/boolobject.h" && cp "/usr/include/python2.7/bufferobject.h" "$(@D)/python_include/bufferobject.h" && cp "/usr/include/python2.7/bytearrayobject.h" "$(@D)/python_include/bytearrayobject.h" && cp "/usr/include/python2.7/bytes_methods.h" "$(@D)/python_include/bytes_methods.h" && cp "/usr/include/python2.7/bytesobject.h" "$(@D)/python_include/bytesobject.h" && cp "/usr/include/python2.7/cStringIO.h" "$(@D)/python_include/cStringIO.h" && cp "/usr/include/python2.7/cellobject.h" "$(@D)/python_include/cellobject.h" && cp "/usr/include/python2.7/ceval.h" "$(@D)/python_include/ceval.h" && cp "/usr/include/python2.7/classobject.h" "$(@D)/python_include/classobject.h" && cp "/usr/include/python2.7/cobject.h" "$(@D)/python_include/cobject.h" && cp "/usr/include/python2.7/code.h" "$(@D)/python_include/code.h" && cp "/usr/include/python2.7/codecs.h" "$(@D)/python_include/codecs.h" && cp "/usr/include/python2.7/compile.h" "$(@D)/python_include/compile.h" && cp "/usr/include/python2.7/complexobject.h" "$(@D)/python_include/complexobject.h" && cp "/usr/include/python2.7/datetime.h" "$(@D)/python_include/datetime.h" && cp "/usr/include/python2.7/descrobject.h" "$(@D)/python_include/descrobject.h" && cp "/usr/include/python2.7/dictobject.h" "$(@D)/python_include/dictobject.h" && cp "/usr/include/python2.7/dtoa.h" "$(@D)/python_include/dtoa.h" && cp "/usr/include/python2.7/enumobject.h" "$(@D)/python_include/enumobject.h" && cp "/usr/include/python2.7/errcode.h" "$(@D)/python_include/errcode.h" && cp "/usr/include/python2.7/eval.h" "$(@D)/python_include/eval.h" && cp "/usr/include/python2.7/fileobject.h" "$(@D)/python_include/fileobject.h" && cp "/usr/include/python2.7/floatobject.h" "$(@D)/python_include/floatobject.h" && cp "/usr/include/python2.7/frameobject.h" "$(@D)/python_include/frameobject.h" && cp "/usr/include/python2.7/funcobject.h" "$(@D)/python_include/funcobject.h" && cp "/usr/include/python2.7/genobject.h" "$(@D)/python_include/genobject.h" && cp "/usr/include/python2.7/graminit.h" "$(@D)/python_include/graminit.h" && cp "/usr/include/python2.7/grammar.h" "$(@D)/python_include/grammar.h" && cp "/usr/include/python2.7/import.h" "$(@D)/python_include/import.h" && cp "/usr/include/python2.7/intobject.h" "$(@D)/python_include/intobject.h" && cp "/usr/include/python2.7/intrcheck.h" "$(@D)/python_include/intrcheck.h" && cp "/usr/include/python2.7/iterobject.h" "$(@D)/python_include/iterobject.h" && cp "/usr/include/python2.7/listobject.h" "$(@D)/python_include/listobject.h" && cp "/usr/include/python2.7/longintrepr.h" "$(@D)/python_include/longintrepr.h" && cp "/usr/include/python2.7/longobject.h" "$(@D)/python_include/longobject.h" && cp "/usr/include/python2.7/marshal.h" "$(@D)/python_include/marshal.h" && cp "/usr/include/python2.7/memoryobject.h" "$(@D)/python_include/memoryobject.h" && cp "/usr/include/python2.7/metagrammar.h" "$(@D)/python_include/metagrammar.h" && cp "/usr/include/python2.7/methodobject.h" "$(@D)/python_include/methodobject.h" && cp "/usr/include/python2.7/modsupport.h" "$(@D)/python_include/modsupport.h" && cp "/usr/include/python2.7/moduleobject.h" "$(@D)/python_include/moduleobject.h" && cp "/usr/include/python2.7/node.h" "$(@D)/python_include/node.h" && cp "/usr/include/python2.7/object.h" "$(@D)/python_include/object.h" && cp "/usr/include/python2.7/objimpl.h" "$(@D)/python_include/objimpl.h" && cp "/usr/include/python2.7/opcode.h" "$(@D)/python_include/opcode.h" && cp "/usr/include/python2.7/osdefs.h" "$(@D)/python_include/osdefs.h" && cp "/usr/include/python2.7/parsetok.h" "$(@D)/python_include/parsetok.h" && cp "/usr/include/python2.7/patchlevel.h" "$(@D)/python_include/patchlevel.h" && cp "/usr/include/python2.7/pgen.h" "$(@D)/python_include/pgen.h" && cp "/usr/include/python2.7/pgenheaders.h" "$(@D)/python_include/pgenheaders.h" && cp "/usr/include/python2.7/py_curses.h" "$(@D)/python_include/py_curses.h" && cp "/usr/include/python2.7/pyarena.h" "$(@D)/python_include/pyarena.h" && cp "/usr/include/python2.7/pycapsule.h" "$(@D)/python_include/pycapsule.h" && cp "/usr/include/python2.7/pyconfig.h" "$(@D)/python_include/pyconfig.h" && cp "/usr/include/python2.7/pyctype.h" "$(@D)/python_include/pyctype.h" && cp "/usr/include/python2.7/pydebug.h" "$(@D)/python_include/pydebug.h" && cp "/usr/include/python2.7/pyerrors.h" "$(@D)/python_include/pyerrors.h" && cp "/usr/include/python2.7/pyexpat.h" "$(@D)/python_include/pyexpat.h" && cp "/usr/include/python2.7/pyfpe.h" "$(@D)/python_include/pyfpe.h" && cp "/usr/include/python2.7/pygetopt.h" "$(@D)/python_include/pygetopt.h" && cp "/usr/include/python2.7/pymacconfig.h" "$(@D)/python_include/pymacconfig.h" && cp "/usr/include/python2.7/pymactoolbox.h" "$(@D)/python_include/pymactoolbox.h" && cp "/usr/include/python2.7/pymath.h" "$(@D)/python_include/pymath.h" && cp "/usr/include/python2.7/pymem.h" "$(@D)/python_include/pymem.h" && cp "/usr/include/python2.7/pyport.h" "$(@D)/python_include/pyport.h" && cp "/usr/include/python2.7/pystate.h" "$(@D)/python_include/pystate.h" && cp "/usr/include/python2.7/pystrcmp.h" "$(@D)/python_include/pystrcmp.h" && cp "/usr/include/python2.7/pystrtod.h" "$(@D)/python_include/pystrtod.h" && cp "/usr/include/python2.7/pythonrun.h" "$(@D)/python_include/pythonrun.h" && cp "/usr/include/python2.7/pythread.h" "$(@D)/python_include/pythread.h" && cp "/usr/include/python2.7/rangeobject.h" "$(@D)/python_include/rangeobject.h" && cp "/usr/include/python2.7/setobject.h" "$(@D)/python_include/setobject.h" && cp "/usr/include/python2.7/sliceobject.h" "$(@D)/python_include/sliceobject.h" && cp "/usr/include/python2.7/stringobject.h" "$(@D)/python_include/stringobject.h" && cp "/usr/include/python2.7/structmember.h" "$(@D)/python_include/structmember.h" && cp "/usr/include/python2.7/structseq.h" "$(@D)/python_include/structseq.h" && cp "/usr/include/python2.7/symtable.h" "$(@D)/python_include/symtable.h" && cp "/usr/include/python2.7/sysmodule.h" "$(@D)/python_include/sysmodule.h" && cp "/usr/include/python2.7/timefuncs.h" "$(@D)/python_include/timefuncs.h" && cp "/usr/include/python2.7/token.h" "$(@D)/python_include/token.h" && cp "/usr/include/python2.7/traceback.h" "$(@D)/python_include/traceback.h" && cp "/usr/include/python2.7/tupleobject.h" "$(@D)/python_include/tupleobject.h" && cp "/usr/include/python2.7/ucnhash.h" "$(@D)/python_include/ucnhash.h" && cp "/usr/include/python2.7/unicodeobject.h" "$(@D)/python_include/unicodeobject.h" && cp "/usr/include/python2.7/warnings.h" "$(@D)/python_include/warnings.h" && cp "/usr/include/python2.7/weakrefobject.h" "$(@D)/python_include/weakrefobject.h"
    """,
 )
 
 genrule(
     name = "numpy_include",
     outs = [
-        "numpy_include/numpy/oldnumeric.h",
-        "numpy_include/numpy/npy_1_7_deprecated_api.h",
-        "numpy_include/numpy/ufunc_api.txt",
-        "numpy_include/numpy/multiarray_api.txt",
-        "numpy_include/numpy/halffloat.h",
-        "numpy_include/numpy/npy_common.h",
-        "numpy_include/numpy/utils.h",
-        "numpy_include/numpy/npy_interrupt.h",
-        "numpy_include/numpy/npy_endian.h",
+        "numpy_include/numpy/__multiarray_api.h",
         "numpy_include/numpy/__ufunc_api.h",
         "numpy_include/numpy/_neighborhood_iterator_imp.h",
-        "numpy_include/numpy/ufuncobject.h",
+        "numpy_include/numpy/_numpyconfig.h",
+        "numpy_include/numpy/arrayobject.h",
+        "numpy_include/numpy/arrayscalars.h",
+        "numpy_include/numpy/halffloat.h",
+        "numpy_include/numpy/multiarray_api.txt",
+        "numpy_include/numpy/ndarrayobject.h",
         "numpy_include/numpy/ndarraytypes.h",
-        "numpy_include/numpy/npy_math.h",
         "numpy_include/numpy/noprefix.h",
+        "numpy_include/numpy/npy_1_7_deprecated_api.h",
         "numpy_include/numpy/npy_3kcompat.h",
-        "numpy_include/numpy/arrayscalars.h",
-        "numpy_include/numpy/npy_os.h",
-        "numpy_include/numpy/ndarrayobject.h",
-        "numpy_include/numpy/npy_no_deprecated_api.h",
-        "numpy_include/numpy/arrayobject.h",
-        "numpy_include/numpy/_numpyconfig.h",
-        "numpy_include/numpy/__multiarray_api.h",
+        "numpy_include/numpy/npy_common.h",
         "numpy_include/numpy/npy_cpu.h",
-        "numpy_include/numpy/old_defines.h",
+        "numpy_include/numpy/npy_endian.h",
+        "numpy_include/numpy/npy_interrupt.h",
+        "numpy_include/numpy/npy_math.h",
+        "numpy_include/numpy/npy_no_deprecated_api.h",
+        "numpy_include/numpy/npy_os.h",
         "numpy_include/numpy/numpyconfig.h",
+        "numpy_include/numpy/old_defines.h",
+        "numpy_include/numpy/oldnumeric.h",
+        "numpy_include/numpy/ufunc_api.txt",
+        "numpy_include/numpy/ufuncobject.h",
+        "numpy_include/numpy/utils.h",
     ],
     cmd = """
-cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/oldnumeric.h" "$(@D)/numpy_include/numpy/oldnumeric.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_1_7_deprecated_api.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ufunc_api.txt" "$(@D)/numpy_include/numpy/ufunc_api.txt" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/multiarray_api.txt" "$(@D)/numpy_include/numpy/multiarray_api.txt" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/halffloat.h" "$(@D)/numpy_include/numpy/halffloat.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_common.h" "$(@D)/numpy_include/numpy/npy_common.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/utils.h" "$(@D)/numpy_include/numpy/utils.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_interrupt.h" "$(@D)/numpy_include/numpy/npy_interrupt.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_endian.h" "$(@D)/numpy_include/numpy/npy_endian.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/__ufunc_api.h" "$(@D)/numpy_include/numpy/__ufunc_api.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/_neighborhood_iterator_imp.h" "$(@D)/numpy_include/numpy/_neighborhood_iterator_imp.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ufuncobject.h" "$(@D)/numpy_include/numpy/ufuncobject.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h" "$(@D)/numpy_include/numpy/ndarraytypes.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_math.h" "$(@D)/numpy_include/numpy/npy_math.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/noprefix.h" "$(@D)/numpy_include/numpy/noprefix.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_3kcompat.h" "$(@D)/numpy_include/numpy/npy_3kcompat.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayscalars.h" "$(@D)/numpy_include/numpy/arrayscalars.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_os.h" "$(@D)/numpy_include/numpy/npy_os.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h" "$(@D)/numpy_include/numpy/ndarrayobject.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_no_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_no_deprecated_api.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h" "$(@D)/numpy_include/numpy/arrayobject.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/_numpyconfig.h" "$(@D)/numpy_include/numpy/_numpyconfig.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/__multiarray_api.h" "$(@D)/numpy_include/numpy/__multiarray_api.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_cpu.h" "$(@D)/numpy_include/numpy/npy_cpu.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/old_defines.h" "$(@D)/numpy_include/numpy/old_defines.h" && cp "/usr/lib/python2.7/dist-packages/numpy/core/include/numpy/numpyconfig.h" "$(@D)/numpy_include/numpy/numpyconfig.h"
+cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__multiarray_api.h" "$(@D)/numpy_include/numpy/__multiarray_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__ufunc_api.h" "$(@D)/numpy_include/numpy/__ufunc_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/_neighborhood_iterator_imp.h" "$(@D)/numpy_include/numpy/_neighborhood_iterator_imp.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/_numpyconfig.h" "$(@D)/numpy_include/numpy/_numpyconfig.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h" "$(@D)/numpy_include/numpy/arrayobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayscalars.h" "$(@D)/numpy_include/numpy/arrayscalars.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/halffloat.h" "$(@D)/numpy_include/numpy/halffloat.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/multiarray_api.txt" "$(@D)/numpy_include/numpy/multiarray_api.txt" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h" "$(@D)/numpy_include/numpy/ndarrayobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h" "$(@D)/numpy_include/numpy/ndarraytypes.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/noprefix.h" "$(@D)/numpy_include/numpy/noprefix.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_1_7_deprecated_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_3kcompat.h" "$(@D)/numpy_include/numpy/npy_3kcompat.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_common.h" "$(@D)/numpy_include/numpy/npy_common.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_cpu.h" "$(@D)/numpy_include/numpy/npy_cpu.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_endian.h" "$(@D)/numpy_include/numpy/npy_endian.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_interrupt.h" "$(@D)/numpy_include/numpy/npy_interrupt.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_math.h" "$(@D)/numpy_include/numpy/npy_math.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_no_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_no_deprecated_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_os.h" "$(@D)/numpy_include/numpy/npy_os.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/numpyconfig.h" "$(@D)/numpy_include/numpy/numpyconfig.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/old_defines.h" "$(@D)/numpy_include/numpy/old_defines.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/oldnumeric.h" "$(@D)/numpy_include/numpy/oldnumeric.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ufunc_api.txt" "$(@D)/numpy_include/numpy/ufunc_api.txt" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ufuncobject.h" "$(@D)/numpy_include/numpy/ufuncobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/utils.h" "$(@D)/numpy_include/numpy/utils.h"
    """,
 )
diff --git a/third_party/toolchains/cpus/py3/BUILD b/third_party/toolchains/cpus/py3/BUILD
index 932a25239f..d47256ebef 100644
--- a/third_party/toolchains/cpus/py3/BUILD
+++ b/third_party/toolchains/cpus/py3/BUILD
@@ -6,18 +6,24 @@ licenses(["restricted"])
 
 package(default_visibility = ["//visibility:public"])
 
+# To build Python C/C++ extension on Windows, we need to link to python import library pythonXY.lib
+# See https://docs.python.org/3/extending/windows.html
+cc_import(
+    name = "python_lib",
+    interface_library = select({
+        ":windows": ":python_import_lib",
+        # A placeholder for Unix platforms which makes --no_build happy.
+        "//conditions:default": "not-existing.lib",
+    }),
+    system_provided = 1,
+)
+
 cc_library(
     name = "python_headers",
     hdrs = [":python_include"],
-    data = select({
-        ":windows": [":python_import_lib"],
-        "//conditions:default": [],
-    }),
     includes = ["python_include"],
-    linkopts = select({
-        # TODO(pcloudy): Ideally, this should just go into deps after resolving
-        # https://github.com/bazelbuild/bazel/issues/3237,
-        ":windows": ["$(locations :python_import_lib)"],
+    deps = select({
+        ":windows": [":python_lib"],
         "//conditions:default": [],
     }),
 )
@@ -37,143 +43,143 @@ config_setting(
 genrule(
     name = "python_include",
     outs = [
-        "python_include/code.h",
-        "python_include/dtoa.h",
-        "python_include/tupleobject.h",
-        "python_include/object.h",
-        "python_include/ast.h",
-        "python_include/pymacconfig.h",
-        "python_include/errcode.h",
-        "python_include/frameobject.h",
-        "python_include/typeslots.h",
-        "python_include/pgenheaders.h",
-        "python_include/cellobject.h",
-        "python_include/pythread.h",
-        "python_include/boolobject.h",
+        "python_include/Python-ast.h",
+        "python_include/Python.h",
+        "python_include/abstract.h",
         "python_include/accu.h",
-        "python_include/modsupport.h",
-        "python_include/import.h",
-        "python_include/pymath.h",
-        "python_include/node.h",
-        "python_include/funcobject.h",
-        "python_include/eval.h",
-        "python_include/pyatomic.h",
-        "python_include/longintrepr.h",
-        "python_include/floatobject.h",
-        "python_include/rangeobject.h",
-        "python_include/pyfpe.h",
-        "python_include/pystrcmp.h",
-        "python_include/fileutils.h",
-        "python_include/dictobject.h",
-        "python_include/pyarena.h",
-        "python_include/osmodule.h",
-        "python_include/objimpl.h",
+        "python_include/asdl.h",
+        "python_include/ast.h",
         "python_include/bitset.h",
-        "python_include/memoryobject.h",
+        "python_include/bltinmodule.h",
+        "python_include/boolobject.h",
         "python_include/bytearrayobject.h",
-        "python_include/pydebug.h",
-        "python_include/pyerrors.h",
-        "python_include/weakrefobject.h",
-        "python_include/grammar.h",
-        "python_include/symtable.h",
-        "python_include/longobject.h",
-        "python_include/structmember.h",
-        "python_include/enumobject.h",
-        "python_include/pymacro.h",
+        "python_include/bytes_methods.h",
+        "python_include/bytesobject.h",
+        "python_include/cellobject.h",
+        "python_include/ceval.h",
         "python_include/classobject.h",
-        "python_include/unicodeobject.h",
-        "python_include/sliceobject.h",
-        "python_include/pystrtod.h",
-        "python_include/genobject.h",
-        "python_include/compile.h",
-        "python_include/pyexpat.h",
-        "python_include/asdl.h",
+        "python_include/code.h",
         "python_include/codecs.h",
+        "python_include/compile.h",
+        "python_include/complexobject.h",
+        "python_include/datetime.h",
+        "python_include/descrobject.h",
+        "python_include/dictobject.h",
+        "python_include/dtoa.h",
         "python_include/dynamic_annotations.h",
-        "python_include/pyctype.h",
-        "python_include/sysmodule.h",
-        "python_include/methodobject.h",
+        "python_include/enumobject.h",
+        "python_include/errcode.h",
+        "python_include/eval.h",
+        "python_include/fileobject.h",
+        "python_include/fileutils.h",
+        "python_include/floatobject.h",
+        "python_include/frameobject.h",
+        "python_include/funcobject.h",
+        "python_include/genobject.h",
         "python_include/graminit.h",
-        "python_include/bltinmodule.h",
+        "python_include/grammar.h",
+        "python_include/import.h",
         "python_include/intrcheck.h",
-        "python_include/pyport.h",
-        "python_include/warnings.h",
-        "python_include/osdefs.h",
-        "python_include/pydtrace.h",
-        "python_include/pylifecycle.h",
-        "python_include/fileobject.h",
-        "python_include/pytime.h",
-        "python_include/traceback.h",
-        "python_include/ceval.h",
-        "python_include/bytes_methods.h",
-        "python_include/namespaceobject.h",
-        "python_include/pyconfig.h",
-        "python_include/Python.h",
+        "python_include/iterobject.h",
+        "python_include/listobject.h",
+        "python_include/longintrepr.h",
+        "python_include/longobject.h",
+        "python_include/marshal.h",
+        "python_include/memoryobject.h",
+        "python_include/metagrammar.h",
+        "python_include/methodobject.h",
+        "python_include/modsupport.h",
         "python_include/moduleobject.h",
-        "python_include/pystate.h",
-        "python_include/descrobject.h",
+        "python_include/namespaceobject.h",
+        "python_include/node.h",
+        "python_include/object.h",
+        "python_include/objimpl.h",
         "python_include/odictobject.h",
-        "python_include/ucnhash.h",
+        "python_include/opcode.h",
+        "python_include/osdefs.h",
+        "python_include/osmodule.h",
+        "python_include/parsetok.h",
+        "python_include/patchlevel.h",
+        "python_include/pgen.h",
+        "python_include/pgenheaders.h",
+        "python_include/py_curses.h",
+        "python_include/pyarena.h",
+        "python_include/pyatomic.h",
+        "python_include/pycapsule.h",
+        "python_include/pyconfig.h",
+        "python_include/pyctype.h",
+        "python_include/pydebug.h",
+        "python_include/pydtrace.h",
+        "python_include/pyerrors.h",
+        "python_include/pyexpat.h",
+        "python_include/pyfpe.h",
         "python_include/pygetopt.h",
+        "python_include/pyhash.h",
+        "python_include/pylifecycle.h",
+        "python_include/pymacconfig.h",
+        "python_include/pymacro.h",
+        "python_include/pymath.h",
         "python_include/pymem.h",
-        "python_include/complexobject.h",
-        "python_include/structseq.h",
-        "python_include/datetime.h",
+        "python_include/pyport.h",
+        "python_include/pystate.h",
+        "python_include/pystrcmp.h",
+        "python_include/pystrhex.h",
+        "python_include/pystrtod.h",
         "python_include/pythonrun.h",
-        "python_include/pyhash.h",
-        "python_include/pycapsule.h",
+        "python_include/pythread.h",
+        "python_include/pytime.h",
+        "python_include/rangeobject.h",
         "python_include/setobject.h",
-        "python_include/listobject.h",
-        "python_include/bytesobject.h",
-        "python_include/pgen.h",
-        "python_include/patchlevel.h",
-        "python_include/opcode.h",
-        "python_include/parsetok.h",
-        "python_include/pystrhex.h",
-        "python_include/marshal.h",
+        "python_include/sliceobject.h",
+        "python_include/structmember.h",
+        "python_include/structseq.h",
+        "python_include/symtable.h",
+        "python_include/sysmodule.h",
         "python_include/token.h",
-        "python_include/iterobject.h",
-        "python_include/abstract.h",
-        "python_include/py_curses.h",
-        "python_include/metagrammar.h",
-        "python_include/Python-ast.h",
+        "python_include/traceback.h",
+        "python_include/tupleobject.h",
+        "python_include/typeslots.h",
+        "python_include/ucnhash.h",
+        "python_include/unicodeobject.h",
+        "python_include/warnings.h",
+        "python_include/weakrefobject.h",
     ],
     cmd = """
-cp "/opt/python3.6/include/python3.6m/code.h" "$(@D)/python_include/code.h" && cp "/opt/python3.6/include/python3.6m/dtoa.h" "$(@D)/python_include/dtoa.h" && cp "/opt/python3.6/include/python3.6m/tupleobject.h" "$(@D)/python_include/tupleobject.h" && cp "/opt/python3.6/include/python3.6m/object.h" "$(@D)/python_include/object.h" && cp "/opt/python3.6/include/python3.6m/ast.h" "$(@D)/python_include/ast.h" && cp "/opt/python3.6/include/python3.6m/pymacconfig.h" "$(@D)/python_include/pymacconfig.h" && cp "/opt/python3.6/include/python3.6m/errcode.h" "$(@D)/python_include/errcode.h" && cp "/opt/python3.6/include/python3.6m/frameobject.h" "$(@D)/python_include/frameobject.h" && cp "/opt/python3.6/include/python3.6m/typeslots.h" "$(@D)/python_include/typeslots.h" && cp "/opt/python3.6/include/python3.6m/pgenheaders.h" "$(@D)/python_include/pgenheaders.h" && cp "/opt/python3.6/include/python3.6m/cellobject.h" "$(@D)/python_include/cellobject.h" && cp "/opt/python3.6/include/python3.6m/pythread.h" "$(@D)/python_include/pythread.h" && cp "/opt/python3.6/include/python3.6m/boolobject.h" "$(@D)/python_include/boolobject.h" && cp "/opt/python3.6/include/python3.6m/accu.h" "$(@D)/python_include/accu.h" && cp "/opt/python3.6/include/python3.6m/modsupport.h" "$(@D)/python_include/modsupport.h" && cp "/opt/python3.6/include/python3.6m/import.h" "$(@D)/python_include/import.h" && cp "/opt/python3.6/include/python3.6m/pymath.h" "$(@D)/python_include/pymath.h" && cp "/opt/python3.6/include/python3.6m/node.h" "$(@D)/python_include/node.h" && cp "/opt/python3.6/include/python3.6m/funcobject.h" "$(@D)/python_include/funcobject.h" && cp "/opt/python3.6/include/python3.6m/eval.h" "$(@D)/python_include/eval.h" && cp "/opt/python3.6/include/python3.6m/pyatomic.h" "$(@D)/python_include/pyatomic.h" && cp "/opt/python3.6/include/python3.6m/longintrepr.h" "$(@D)/python_include/longintrepr.h" && cp "/opt/python3.6/include/python3.6m/floatobject.h" "$(@D)/python_include/floatobject.h" && cp "/opt/python3.6/include/python3.6m/rangeobject.h" "$(@D)/python_include/rangeobject.h" && cp "/opt/python3.6/include/python3.6m/pyfpe.h" "$(@D)/python_include/pyfpe.h" && cp "/opt/python3.6/include/python3.6m/pystrcmp.h" "$(@D)/python_include/pystrcmp.h" && cp "/opt/python3.6/include/python3.6m/fileutils.h" "$(@D)/python_include/fileutils.h" && cp "/opt/python3.6/include/python3.6m/dictobject.h" "$(@D)/python_include/dictobject.h" && cp "/opt/python3.6/include/python3.6m/pyarena.h" "$(@D)/python_include/pyarena.h" && cp "/opt/python3.6/include/python3.6m/osmodule.h" "$(@D)/python_include/osmodule.h" && cp "/opt/python3.6/include/python3.6m/objimpl.h" "$(@D)/python_include/objimpl.h" && cp "/opt/python3.6/include/python3.6m/bitset.h" "$(@D)/python_include/bitset.h" && cp "/opt/python3.6/include/python3.6m/memoryobject.h" "$(@D)/python_include/memoryobject.h" && cp "/opt/python3.6/include/python3.6m/bytearrayobject.h" "$(@D)/python_include/bytearrayobject.h" && cp "/opt/python3.6/include/python3.6m/pydebug.h" "$(@D)/python_include/pydebug.h" && cp "/opt/python3.6/include/python3.6m/pyerrors.h" "$(@D)/python_include/pyerrors.h" && cp "/opt/python3.6/include/python3.6m/weakrefobject.h" "$(@D)/python_include/weakrefobject.h" && cp "/opt/python3.6/include/python3.6m/grammar.h" "$(@D)/python_include/grammar.h" && cp "/opt/python3.6/include/python3.6m/symtable.h" "$(@D)/python_include/symtable.h" && cp "/opt/python3.6/include/python3.6m/longobject.h" "$(@D)/python_include/longobject.h" && cp "/opt/python3.6/include/python3.6m/structmember.h" "$(@D)/python_include/structmember.h" && cp "/opt/python3.6/include/python3.6m/enumobject.h" "$(@D)/python_include/enumobject.h" && cp "/opt/python3.6/include/python3.6m/pymacro.h" "$(@D)/python_include/pymacro.h" && cp "/opt/python3.6/include/python3.6m/classobject.h" "$(@D)/python_include/classobject.h" && cp "/opt/python3.6/include/python3.6m/unicodeobject.h" "$(@D)/python_include/unicodeobject.h" && cp "/opt/python3.6/include/python3.6m/sliceobject.h" "$(@D)/python_include/sliceobject.h" && cp "/opt/python3.6/include/python3.6m/pystrtod.h" "$(@D)/python_include/pystrtod.h" && cp "/opt/python3.6/include/python3.6m/genobject.h" "$(@D)/python_include/genobject.h" && cp "/opt/python3.6/include/python3.6m/compile.h" "$(@D)/python_include/compile.h" && cp "/opt/python3.6/include/python3.6m/pyexpat.h" "$(@D)/python_include/pyexpat.h" && cp "/opt/python3.6/include/python3.6m/asdl.h" "$(@D)/python_include/asdl.h" && cp "/opt/python3.6/include/python3.6m/codecs.h" "$(@D)/python_include/codecs.h" && cp "/opt/python3.6/include/python3.6m/dynamic_annotations.h" "$(@D)/python_include/dynamic_annotations.h" && cp "/opt/python3.6/include/python3.6m/pyctype.h" "$(@D)/python_include/pyctype.h" && cp "/opt/python3.6/include/python3.6m/sysmodule.h" "$(@D)/python_include/sysmodule.h" && cp "/opt/python3.6/include/python3.6m/methodobject.h" "$(@D)/python_include/methodobject.h" && cp "/opt/python3.6/include/python3.6m/graminit.h" "$(@D)/python_include/graminit.h" && cp "/opt/python3.6/include/python3.6m/bltinmodule.h" "$(@D)/python_include/bltinmodule.h" && cp "/opt/python3.6/include/python3.6m/intrcheck.h" "$(@D)/python_include/intrcheck.h" && cp "/opt/python3.6/include/python3.6m/pyport.h" "$(@D)/python_include/pyport.h" && cp "/opt/python3.6/include/python3.6m/warnings.h" "$(@D)/python_include/warnings.h" && cp "/opt/python3.6/include/python3.6m/osdefs.h" "$(@D)/python_include/osdefs.h" && cp "/opt/python3.6/include/python3.6m/pydtrace.h" "$(@D)/python_include/pydtrace.h" && cp "/opt/python3.6/include/python3.6m/pylifecycle.h" "$(@D)/python_include/pylifecycle.h" && cp "/opt/python3.6/include/python3.6m/fileobject.h" "$(@D)/python_include/fileobject.h" && cp "/opt/python3.6/include/python3.6m/pytime.h" "$(@D)/python_include/pytime.h" && cp "/opt/python3.6/include/python3.6m/traceback.h" "$(@D)/python_include/traceback.h" && cp "/opt/python3.6/include/python3.6m/ceval.h" "$(@D)/python_include/ceval.h" && cp "/opt/python3.6/include/python3.6m/bytes_methods.h" "$(@D)/python_include/bytes_methods.h" && cp "/opt/python3.6/include/python3.6m/namespaceobject.h" "$(@D)/python_include/namespaceobject.h" && cp "/opt/python3.6/include/python3.6m/pyconfig.h" "$(@D)/python_include/pyconfig.h" && cp "/opt/python3.6/include/python3.6m/Python.h" "$(@D)/python_include/Python.h" && cp "/opt/python3.6/include/python3.6m/moduleobject.h" "$(@D)/python_include/moduleobject.h" && cp "/opt/python3.6/include/python3.6m/pystate.h" "$(@D)/python_include/pystate.h" && cp "/opt/python3.6/include/python3.6m/descrobject.h" "$(@D)/python_include/descrobject.h" && cp "/opt/python3.6/include/python3.6m/odictobject.h" "$(@D)/python_include/odictobject.h" && cp "/opt/python3.6/include/python3.6m/ucnhash.h" "$(@D)/python_include/ucnhash.h" && cp "/opt/python3.6/include/python3.6m/pygetopt.h" "$(@D)/python_include/pygetopt.h" && cp "/opt/python3.6/include/python3.6m/pymem.h" "$(@D)/python_include/pymem.h" && cp "/opt/python3.6/include/python3.6m/complexobject.h" "$(@D)/python_include/complexobject.h" && cp "/opt/python3.6/include/python3.6m/structseq.h" "$(@D)/python_include/structseq.h" && cp "/opt/python3.6/include/python3.6m/datetime.h" "$(@D)/python_include/datetime.h" && cp "/opt/python3.6/include/python3.6m/pythonrun.h" "$(@D)/python_include/pythonrun.h" && cp "/opt/python3.6/include/python3.6m/pyhash.h" "$(@D)/python_include/pyhash.h" && cp "/opt/python3.6/include/python3.6m/pycapsule.h" "$(@D)/python_include/pycapsule.h" && cp "/opt/python3.6/include/python3.6m/setobject.h" "$(@D)/python_include/setobject.h" && cp "/opt/python3.6/include/python3.6m/listobject.h" "$(@D)/python_include/listobject.h" && cp "/opt/python3.6/include/python3.6m/bytesobject.h" "$(@D)/python_include/bytesobject.h" && cp "/opt/python3.6/include/python3.6m/pgen.h" "$(@D)/python_include/pgen.h" && cp "/opt/python3.6/include/python3.6m/patchlevel.h" "$(@D)/python_include/patchlevel.h" && cp "/opt/python3.6/include/python3.6m/opcode.h" "$(@D)/python_include/opcode.h" && cp "/opt/python3.6/include/python3.6m/parsetok.h" "$(@D)/python_include/parsetok.h" && cp "/opt/python3.6/include/python3.6m/pystrhex.h" "$(@D)/python_include/pystrhex.h" && cp "/opt/python3.6/include/python3.6m/marshal.h" "$(@D)/python_include/marshal.h" && cp "/opt/python3.6/include/python3.6m/token.h" "$(@D)/python_include/token.h" && cp "/opt/python3.6/include/python3.6m/iterobject.h" "$(@D)/python_include/iterobject.h" && cp "/opt/python3.6/include/python3.6m/abstract.h" "$(@D)/python_include/abstract.h" && cp "/opt/python3.6/include/python3.6m/py_curses.h" "$(@D)/python_include/py_curses.h" && cp "/opt/python3.6/include/python3.6m/metagrammar.h" "$(@D)/python_include/metagrammar.h" && cp "/opt/python3.6/include/python3.6m/Python-ast.h" "$(@D)/python_include/Python-ast.h"
+cp "/opt/python3.6/include/python3.6m/Python-ast.h" "$(@D)/python_include/Python-ast.h" && cp "/opt/python3.6/include/python3.6m/Python.h" "$(@D)/python_include/Python.h" && cp "/opt/python3.6/include/python3.6m/abstract.h" "$(@D)/python_include/abstract.h" && cp "/opt/python3.6/include/python3.6m/accu.h" "$(@D)/python_include/accu.h" && cp "/opt/python3.6/include/python3.6m/asdl.h" "$(@D)/python_include/asdl.h" && cp "/opt/python3.6/include/python3.6m/ast.h" "$(@D)/python_include/ast.h" && cp "/opt/python3.6/include/python3.6m/bitset.h" "$(@D)/python_include/bitset.h" && cp "/opt/python3.6/include/python3.6m/bltinmodule.h" "$(@D)/python_include/bltinmodule.h" && cp "/opt/python3.6/include/python3.6m/boolobject.h" "$(@D)/python_include/boolobject.h" && cp "/opt/python3.6/include/python3.6m/bytearrayobject.h" "$(@D)/python_include/bytearrayobject.h" && cp "/opt/python3.6/include/python3.6m/bytes_methods.h" "$(@D)/python_include/bytes_methods.h" && cp "/opt/python3.6/include/python3.6m/bytesobject.h" "$(@D)/python_include/bytesobject.h" && cp "/opt/python3.6/include/python3.6m/cellobject.h" "$(@D)/python_include/cellobject.h" && cp "/opt/python3.6/include/python3.6m/ceval.h" "$(@D)/python_include/ceval.h" && cp "/opt/python3.6/include/python3.6m/classobject.h" "$(@D)/python_include/classobject.h" && cp "/opt/python3.6/include/python3.6m/code.h" "$(@D)/python_include/code.h" && cp "/opt/python3.6/include/python3.6m/codecs.h" "$(@D)/python_include/codecs.h" && cp "/opt/python3.6/include/python3.6m/compile.h" "$(@D)/python_include/compile.h" && cp "/opt/python3.6/include/python3.6m/complexobject.h" "$(@D)/python_include/complexobject.h" && cp "/opt/python3.6/include/python3.6m/datetime.h" "$(@D)/python_include/datetime.h" && cp "/opt/python3.6/include/python3.6m/descrobject.h" "$(@D)/python_include/descrobject.h" && cp "/opt/python3.6/include/python3.6m/dictobject.h" "$(@D)/python_include/dictobject.h" && cp "/opt/python3.6/include/python3.6m/dtoa.h" "$(@D)/python_include/dtoa.h" && cp "/opt/python3.6/include/python3.6m/dynamic_annotations.h" "$(@D)/python_include/dynamic_annotations.h" && cp "/opt/python3.6/include/python3.6m/enumobject.h" "$(@D)/python_include/enumobject.h" && cp "/opt/python3.6/include/python3.6m/errcode.h" "$(@D)/python_include/errcode.h" && cp "/opt/python3.6/include/python3.6m/eval.h" "$(@D)/python_include/eval.h" && cp "/opt/python3.6/include/python3.6m/fileobject.h" "$(@D)/python_include/fileobject.h" && cp "/opt/python3.6/include/python3.6m/fileutils.h" "$(@D)/python_include/fileutils.h" && cp "/opt/python3.6/include/python3.6m/floatobject.h" "$(@D)/python_include/floatobject.h" && cp "/opt/python3.6/include/python3.6m/frameobject.h" "$(@D)/python_include/frameobject.h" && cp "/opt/python3.6/include/python3.6m/funcobject.h" "$(@D)/python_include/funcobject.h" && cp "/opt/python3.6/include/python3.6m/genobject.h" "$(@D)/python_include/genobject.h" && cp "/opt/python3.6/include/python3.6m/graminit.h" "$(@D)/python_include/graminit.h" && cp "/opt/python3.6/include/python3.6m/grammar.h" "$(@D)/python_include/grammar.h" && cp "/opt/python3.6/include/python3.6m/import.h" "$(@D)/python_include/import.h" && cp "/opt/python3.6/include/python3.6m/intrcheck.h" "$(@D)/python_include/intrcheck.h" && cp "/opt/python3.6/include/python3.6m/iterobject.h" "$(@D)/python_include/iterobject.h" && cp "/opt/python3.6/include/python3.6m/listobject.h" "$(@D)/python_include/listobject.h" && cp "/opt/python3.6/include/python3.6m/longintrepr.h" "$(@D)/python_include/longintrepr.h" && cp "/opt/python3.6/include/python3.6m/longobject.h" "$(@D)/python_include/longobject.h" && cp "/opt/python3.6/include/python3.6m/marshal.h" "$(@D)/python_include/marshal.h" && cp "/opt/python3.6/include/python3.6m/memoryobject.h" "$(@D)/python_include/memoryobject.h" && cp "/opt/python3.6/include/python3.6m/metagrammar.h" "$(@D)/python_include/metagrammar.h" && cp "/opt/python3.6/include/python3.6m/methodobject.h" "$(@D)/python_include/methodobject.h" && cp "/opt/python3.6/include/python3.6m/modsupport.h" "$(@D)/python_include/modsupport.h" && cp "/opt/python3.6/include/python3.6m/moduleobject.h" "$(@D)/python_include/moduleobject.h" && cp "/opt/python3.6/include/python3.6m/namespaceobject.h" "$(@D)/python_include/namespaceobject.h" && cp "/opt/python3.6/include/python3.6m/node.h" "$(@D)/python_include/node.h" && cp "/opt/python3.6/include/python3.6m/object.h" "$(@D)/python_include/object.h" && cp "/opt/python3.6/include/python3.6m/objimpl.h" "$(@D)/python_include/objimpl.h" && cp "/opt/python3.6/include/python3.6m/odictobject.h" "$(@D)/python_include/odictobject.h" && cp "/opt/python3.6/include/python3.6m/opcode.h" "$(@D)/python_include/opcode.h" && cp "/opt/python3.6/include/python3.6m/osdefs.h" "$(@D)/python_include/osdefs.h" && cp "/opt/python3.6/include/python3.6m/osmodule.h" "$(@D)/python_include/osmodule.h" && cp "/opt/python3.6/include/python3.6m/parsetok.h" "$(@D)/python_include/parsetok.h" && cp "/opt/python3.6/include/python3.6m/patchlevel.h" "$(@D)/python_include/patchlevel.h" && cp "/opt/python3.6/include/python3.6m/pgen.h" "$(@D)/python_include/pgen.h" && cp "/opt/python3.6/include/python3.6m/pgenheaders.h" "$(@D)/python_include/pgenheaders.h" && cp "/opt/python3.6/include/python3.6m/py_curses.h" "$(@D)/python_include/py_curses.h" && cp "/opt/python3.6/include/python3.6m/pyarena.h" "$(@D)/python_include/pyarena.h" && cp "/opt/python3.6/include/python3.6m/pyatomic.h" "$(@D)/python_include/pyatomic.h" && cp "/opt/python3.6/include/python3.6m/pycapsule.h" "$(@D)/python_include/pycapsule.h" && cp "/opt/python3.6/include/python3.6m/pyconfig.h" "$(@D)/python_include/pyconfig.h" && cp "/opt/python3.6/include/python3.6m/pyctype.h" "$(@D)/python_include/pyctype.h" && cp "/opt/python3.6/include/python3.6m/pydebug.h" "$(@D)/python_include/pydebug.h" && cp "/opt/python3.6/include/python3.6m/pydtrace.h" "$(@D)/python_include/pydtrace.h" && cp "/opt/python3.6/include/python3.6m/pyerrors.h" "$(@D)/python_include/pyerrors.h" && cp "/opt/python3.6/include/python3.6m/pyexpat.h" "$(@D)/python_include/pyexpat.h" && cp "/opt/python3.6/include/python3.6m/pyfpe.h" "$(@D)/python_include/pyfpe.h" && cp "/opt/python3.6/include/python3.6m/pygetopt.h" "$(@D)/python_include/pygetopt.h" && cp "/opt/python3.6/include/python3.6m/pyhash.h" "$(@D)/python_include/pyhash.h" && cp "/opt/python3.6/include/python3.6m/pylifecycle.h" "$(@D)/python_include/pylifecycle.h" && cp "/opt/python3.6/include/python3.6m/pymacconfig.h" "$(@D)/python_include/pymacconfig.h" && cp "/opt/python3.6/include/python3.6m/pymacro.h" "$(@D)/python_include/pymacro.h" && cp "/opt/python3.6/include/python3.6m/pymath.h" "$(@D)/python_include/pymath.h" && cp "/opt/python3.6/include/python3.6m/pymem.h" "$(@D)/python_include/pymem.h" && cp "/opt/python3.6/include/python3.6m/pyport.h" "$(@D)/python_include/pyport.h" && cp "/opt/python3.6/include/python3.6m/pystate.h" "$(@D)/python_include/pystate.h" && cp "/opt/python3.6/include/python3.6m/pystrcmp.h" "$(@D)/python_include/pystrcmp.h" && cp "/opt/python3.6/include/python3.6m/pystrhex.h" "$(@D)/python_include/pystrhex.h" && cp "/opt/python3.6/include/python3.6m/pystrtod.h" "$(@D)/python_include/pystrtod.h" && cp "/opt/python3.6/include/python3.6m/pythonrun.h" "$(@D)/python_include/pythonrun.h" && cp "/opt/python3.6/include/python3.6m/pythread.h" "$(@D)/python_include/pythread.h" && cp "/opt/python3.6/include/python3.6m/pytime.h" "$(@D)/python_include/pytime.h" && cp "/opt/python3.6/include/python3.6m/rangeobject.h" "$(@D)/python_include/rangeobject.h" && cp "/opt/python3.6/include/python3.6m/setobject.h" "$(@D)/python_include/setobject.h" && cp "/opt/python3.6/include/python3.6m/sliceobject.h" "$(@D)/python_include/sliceobject.h" && cp "/opt/python3.6/include/python3.6m/structmember.h" "$(@D)/python_include/structmember.h" && cp "/opt/python3.6/include/python3.6m/structseq.h" "$(@D)/python_include/structseq.h" && cp "/opt/python3.6/include/python3.6m/symtable.h" "$(@D)/python_include/symtable.h" && cp "/opt/python3.6/include/python3.6m/sysmodule.h" "$(@D)/python_include/sysmodule.h" && cp "/opt/python3.6/include/python3.6m/token.h" "$(@D)/python_include/token.h" && cp "/opt/python3.6/include/python3.6m/traceback.h" "$(@D)/python_include/traceback.h" && cp "/opt/python3.6/include/python3.6m/tupleobject.h" "$(@D)/python_include/tupleobject.h" && cp "/opt/python3.6/include/python3.6m/typeslots.h" "$(@D)/python_include/typeslots.h" && cp "/opt/python3.6/include/python3.6m/ucnhash.h" "$(@D)/python_include/ucnhash.h" && cp "/opt/python3.6/include/python3.6m/unicodeobject.h" "$(@D)/python_include/unicodeobject.h" && cp "/opt/python3.6/include/python3.6m/warnings.h" "$(@D)/python_include/warnings.h" && cp "/opt/python3.6/include/python3.6m/weakrefobject.h" "$(@D)/python_include/weakrefobject.h"
    """,
 )
 
 genrule(
     name = "numpy_include",
     outs = [
-        "numpy_include/numpy/oldnumeric.h",
-        "numpy_include/numpy/npy_1_7_deprecated_api.h",
-        "numpy_include/numpy/ufunc_api.txt",
-        "numpy_include/numpy/multiarray_api.txt",
-        "numpy_include/numpy/halffloat.h",
-        "numpy_include/numpy/npy_common.h",
-        "numpy_include/numpy/utils.h",
-        "numpy_include/numpy/npy_interrupt.h",
-        "numpy_include/numpy/npy_endian.h",
+        "numpy_include/numpy/__multiarray_api.h",
         "numpy_include/numpy/__ufunc_api.h",
         "numpy_include/numpy/_neighborhood_iterator_imp.h",
-        "numpy_include/numpy/ufuncobject.h",
+        "numpy_include/numpy/_numpyconfig.h",
+        "numpy_include/numpy/arrayobject.h",
+        "numpy_include/numpy/arrayscalars.h",
+        "numpy_include/numpy/halffloat.h",
+        "numpy_include/numpy/multiarray_api.txt",
+        "numpy_include/numpy/ndarrayobject.h",
         "numpy_include/numpy/ndarraytypes.h",
-        "numpy_include/numpy/npy_math.h",
         "numpy_include/numpy/noprefix.h",
+        "numpy_include/numpy/npy_1_7_deprecated_api.h",
         "numpy_include/numpy/npy_3kcompat.h",
-        "numpy_include/numpy/arrayscalars.h",
-        "numpy_include/numpy/npy_os.h",
-        "numpy_include/numpy/ndarrayobject.h",
-        "numpy_include/numpy/npy_no_deprecated_api.h",
-        "numpy_include/numpy/arrayobject.h",
-        "numpy_include/numpy/_numpyconfig.h",
-        "numpy_include/numpy/__multiarray_api.h",
+        "numpy_include/numpy/npy_common.h",
         "numpy_include/numpy/npy_cpu.h",
-        "numpy_include/numpy/old_defines.h",
+        "numpy_include/numpy/npy_endian.h",
+        "numpy_include/numpy/npy_interrupt.h",
+        "numpy_include/numpy/npy_math.h",
+        "numpy_include/numpy/npy_no_deprecated_api.h",
+        "numpy_include/numpy/npy_os.h",
         "numpy_include/numpy/numpyconfig.h",
+        "numpy_include/numpy/old_defines.h",
+        "numpy_include/numpy/oldnumeric.h",
+        "numpy_include/numpy/ufunc_api.txt",
+        "numpy_include/numpy/ufuncobject.h",
+        "numpy_include/numpy/utils.h",
     ],
     cmd = """
-cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/oldnumeric.h" "$(@D)/numpy_include/numpy/oldnumeric.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_1_7_deprecated_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ufunc_api.txt" "$(@D)/numpy_include/numpy/ufunc_api.txt" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/multiarray_api.txt" "$(@D)/numpy_include/numpy/multiarray_api.txt" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/halffloat.h" "$(@D)/numpy_include/numpy/halffloat.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_common.h" "$(@D)/numpy_include/numpy/npy_common.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/utils.h" "$(@D)/numpy_include/numpy/utils.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h" "$(@D)/numpy_include/numpy/npy_interrupt.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_endian.h" "$(@D)/numpy_include/numpy/npy_endian.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/__ufunc_api.h" "$(@D)/numpy_include/numpy/__ufunc_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/_neighborhood_iterator_imp.h" "$(@D)/numpy_include/numpy/_neighborhood_iterator_imp.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ufuncobject.h" "$(@D)/numpy_include/numpy/ufuncobject.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h" "$(@D)/numpy_include/numpy/ndarraytypes.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_math.h" "$(@D)/numpy_include/numpy/npy_math.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/noprefix.h" "$(@D)/numpy_include/numpy/noprefix.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_3kcompat.h" "$(@D)/numpy_include/numpy/npy_3kcompat.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/arrayscalars.h" "$(@D)/numpy_include/numpy/arrayscalars.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_os.h" "$(@D)/numpy_include/numpy/npy_os.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h" "$(@D)/numpy_include/numpy/ndarrayobject.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_no_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_no_deprecated_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h" "$(@D)/numpy_include/numpy/arrayobject.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/_numpyconfig.h" "$(@D)/numpy_include/numpy/_numpyconfig.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/__multiarray_api.h" "$(@D)/numpy_include/numpy/__multiarray_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_cpu.h" "$(@D)/numpy_include/numpy/npy_cpu.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/old_defines.h" "$(@D)/numpy_include/numpy/old_defines.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/numpyconfig.h" "$(@D)/numpy_include/numpy/numpyconfig.h"
+cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/__multiarray_api.h" "$(@D)/numpy_include/numpy/__multiarray_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/__ufunc_api.h" "$(@D)/numpy_include/numpy/__ufunc_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/_neighborhood_iterator_imp.h" "$(@D)/numpy_include/numpy/_neighborhood_iterator_imp.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/_numpyconfig.h" "$(@D)/numpy_include/numpy/_numpyconfig.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h" "$(@D)/numpy_include/numpy/arrayobject.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/arrayscalars.h" "$(@D)/numpy_include/numpy/arrayscalars.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/halffloat.h" "$(@D)/numpy_include/numpy/halffloat.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/multiarray_api.txt" "$(@D)/numpy_include/numpy/multiarray_api.txt" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h" "$(@D)/numpy_include/numpy/ndarrayobject.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h" "$(@D)/numpy_include/numpy/ndarraytypes.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/noprefix.h" "$(@D)/numpy_include/numpy/noprefix.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_1_7_deprecated_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_3kcompat.h" "$(@D)/numpy_include/numpy/npy_3kcompat.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_common.h" "$(@D)/numpy_include/numpy/npy_common.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_cpu.h" "$(@D)/numpy_include/numpy/npy_cpu.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_endian.h" "$(@D)/numpy_include/numpy/npy_endian.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h" "$(@D)/numpy_include/numpy/npy_interrupt.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_math.h" "$(@D)/numpy_include/numpy/npy_math.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_no_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_no_deprecated_api.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_os.h" "$(@D)/numpy_include/numpy/npy_os.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/numpyconfig.h" "$(@D)/numpy_include/numpy/numpyconfig.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/old_defines.h" "$(@D)/numpy_include/numpy/old_defines.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/oldnumeric.h" "$(@D)/numpy_include/numpy/oldnumeric.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ufunc_api.txt" "$(@D)/numpy_include/numpy/ufunc_api.txt" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ufuncobject.h" "$(@D)/numpy_include/numpy/ufuncobject.h" && cp "/opt/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/utils.h" "$(@D)/numpy_include/numpy/utils.h"
    """,
 )
diff --git a/tools/bazel.rc b/tools/bazel.rc
index 1c1e6afb65..913c4bc333 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -27,6 +27,10 @@ build --define framework_shared_object=true
 build:mkl --define=using_mkl=true
 build:mkl -c opt
 
+# This config option is used to enable MKL-DNN open source library only,
+# without depending on MKL binary version.
+build:mkl_open_source_only --define=using_mkl_dnn_only=true
+
 build:download_clang --crosstool_top=@local_config_download_clang//:toolchain
 build:download_clang --define=using_clang=true
 
@@ -36,8 +40,6 @@ build:cuda --define=using_cuda=true --define=using_cuda_nvcc=true
 build:cuda_clang --crosstool_top=@local_config_cuda//crosstool:toolchain
 build:cuda_clang --define=using_cuda=true --define=using_cuda_clang=true --define=using_clang=true
 
-build:win-cuda --define=using_cuda=true --define=using_cuda_nvcc=true
-
 build:mkl --define=using_mkl=true
 
 build:sycl --crosstool_top=@local_config_sycl//crosstool:toolchain
author	Madiyar Aitbayev <madiyar@google.com>	2018-08-03 20:26:48 +0100
committer	Madiyar Aitbayev <madiyar@google.com>	2018-08-03 20:26:48 +0100
commit	750ac07158c66839e675bacd30beb5b5ad0ea0f6 (patch)
tree	3d7476ce4481b43eb530fd50503cb8eb1dafb8a2
parent	3ce29e3b3341d8820442407fc260d3f3223b9ab8 (diff)
parent	dd45704d092dac87575b8ce39013f91f4f213dc0 (diff)